Kmeans

Biometrika (1994), 81, 3, pp.
425-55
Printed in Great Britain
Ideal spatial adaptation by wavelet shrinkage

BY DAVID L. DON0H0 AND IAIN M. JOHNSTONE
Department of Statistics, Stanford University, Stanford, California, 94305-4065, U.S.A.
SUMMARY
With ideal spatial adaptation, an oracle furnishes information about how best to adapt
a spatially variable estimator, whether piecewise constant, piecewise polynomial, variable
knot spline, or variable bandwidth kernel, to the unknown function. Estimation with the
Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

aid of an oracle offers dramatic advantages over traditional linear estimation by nonadapt-
ive kernels; however, it is a priori unclear whether such performance can be obtained by
a procedure relying on the data alone. We describe a new principle for spatially-adaptive
estimation: selective wavelet reconstruction. We show that variable-knot spline fits and
piecewise-polynomial fits, when equipped with an oracle to select the knots, are not dra-
matically more powerful than selective wavelet reconstruction with an oracle. We develop
a practical spatially adaptive method, RiskShrink, which works by shrinkage of empirical
wavelet coefficients. RiskShrink mimics the performance of an oracle for selective wavelet
reconstruction as well as it is possible to do so. A new inequality in multivariate normal
decision theory which we call the oracle inequality shows that attained performance differs
from ideal performance by at most a factor of approximately 2 log n, where n is the sample
size. Moreover no estimator can give a better guarantee than this. Within the class of
spatially adaptive procedures, RiskShrink is essentially optimal. Relying only on the data,
it comes within a factor log2 n of the performance of piecewise polynomial and variable-
knot spline methods equipped with an oracle. In contrast, it is unknown how or if piecewise
polynomial methods could be made to function this well when denied access to an oracle
and forced to rely on data alone.
Some key words: Minimax estimation subject to doing well at a point; Orthogonal wavelet bases of compact
support; Piecewise-polynomial fitting; Variable-knot spline.
1. INTRODUCTION
1 1 . General
Suppose we are given data
yi=f(t<)
tt = i/n, where e, are independently distributed as N(0, a2), and/(.) is an unknown function
which we would like to recover. We measure performance of an estimate/(.) in terms of
quadratic loss at the sample points. In detail, let / = (/(t|))7=i a n d / = (/(f,))?=1 denote
the vectors of true and estimated sample values, respectively. Let ||t>||2,«= ^1=1^ denote
the usual squared 1% norm; we measure performance by the risk
which we would like to make as small as possible. Although the notation / suggests a
426 DAVID L. DONOHO AND IAIN M. JOHNSTONE
function of a real variable t, in this paper we work only with the equally spaced sample
points t,.
1-2. Spatially adaptive methods

A variety of spatially adaptive methods has been proposed in the statistical literature,
such as CART (Breiman et al., 1983), Turbo (Friedman & Silverman, 1989), MARS
(Friedman, 1991), and variable-bandwidth kernel methods (MQller & Stadtmuller, 1987).
Such methods have presumably been introduced because they were expected to do a better
job in recovery of the functions actually occurring with real data than do traditional
methods based on a fixed spatial scale, such as Fourier series methods, fixed-bandwidth
kernel methods, and linear spline smoothers. Informal conversations with Leo Breiman
and Jerome Friedman have confirmed this assumption

We now describe a simple framework which encompasses the most important spatially
adaptive methods, and allows us to develop our main theme efficiently. We consider
estimates / defined as
(2)
where T(y, 8) is a reconstruction formula with 'spatial smoothing' parameter 8, and d(y)
is a data-adaptive choice of the spatial smoothing parameter 5. A clearer picture of what
we intend emerges from five examples.
Example 1: Piecewise constant reconstruction 7^(y, <5). Here 8 is a finite list of, say, L
real numbers defining a partition ( 7 l 5 . . . , 1L) of [0,1] via
h = [0, 8J, I2 = [5 l s «5j + 82),.,., 1L = rA + ... + 5L.U <5j + ... + <5J,
so that Yu = i <5< = 1- Note that L is a variable. The reconstruction formula is
TPC(y,5)(t)= i Ave(y i :t i e /,)!,,(£);

1= 1
piecewise constant reconstruction using the means of the data within each piece to estimate
the pieces.
Example 2: Piecewise polynomials TPPiD)(y, 8). Here the interpretation of 3 is the same
as in Example 1, only the reconstruction uses polynomials of degree D:
TPP(D)(y,S)(t)= t
1=1
where p ( (t)= Xk«oa*f* is determined by applying the least squares principle to the data
arising for interval /,:
Example 3: Variable-knot splines T^^y, 8). Here 8 defines a partition as above, and
on each interval of the partition the reconstruction formula is a polynomial of degree D,
but now the reconstruction must be continuous and have continuous derivatives up to
order D — 1. In detail, let T, be the left endpoint of I, (/ = 1 , . . . , L). The reconstruction is
chosen from among those piecewise polynomials s(t) satisfying
dk \ fdk
Ideal spatial adaptation by wavelet shrinkage 427
for k = 0 , . . . , D — 1, / = 2 , . . . , L; subject to this constraint, one solves
Z { s W - > , } 2 = mini.
Example 4: Variable bandwidth kernel methods TyK^Cy. <5)- Now 5 is a function on [ 0 , 1 ] ;

5(t) represents the 'bandwidth of the kernel at £'; the smoothing kernel K is a C2 function
of compact support which is also a probability density, and if / = Tyxîy, S) then
More refined versions of this formula would adjust K for boundary effects near t = 0 and
t = l.

Example 5: Variable-bandwidth high-order kernels Ty^^y, 5), D>2. Here d is again the
local bandwidth, and the reconstruction formula is as in (3), only K(.) is a CD function
integrating to 1, with vanishing intermediate moments:
t*K(t)dt = O (j = l,...,D-l).
As D>2, K(.) cannot be nonnegative.

These reconstruction techniques, when equipped with appropriate selectors of the spatial
smoothing parameter <5, duplicate essential features of certain well-known methods.
Method 1. The piecewise constant reconstruction formula 7^, equipped with choice
of partition <5 by recursive partitioning and cross-validatory choice of 'pruning constant'
as described by Breiman et al. (1983) results in the method CART applied to one-
dimensional data.
Method 2. The spline reconstruction formula 7 ^ D, equipped with a backwards deletion
scheme models the methods of Friedman & Silverman (1989) and Friedman (1991) applied
to one-dimensional data.
Method 3. The kernel method TK2 equipped with the variable bandwidth selector
described by Brockmann, Gasser & Herrmann (1993) results in the 'Heidelberg' variable
bandwidth smoothing method. Compare also Terrell & Scott (1992).
These schemes are computationally feasible and intuitively appealing. However, very
little is known about the theoretical performance of these adaptive schemes, at the level
of uniformity in / and N that we would like.
1-3. Ideal adaptation with oracles

To avoid messy questions, we abandon the study of specific <5-selectors and instead
study ideal adaptation.
For us, ideal adaptation is the performance which can be achieved from smoothing with
the aid of an oracle. Such an oracle will not tell us /, but will tell us, for our method
T(y, 5), the 'best' choice of <5 for the true underlying/. The oracle's response is conceptually
a selection A(/) which satisfies
where 9ina denotes the ideal risk
As di measures performance with a selection A(/) based on full knowledge of/ rather
than a data-dependent selection d(y), it represents an ideal we cannot expect to attain.
Nevertheless it is the target we shall consider.
Ideal adaptation offers, in principle, considerable advantages over traditional nonadapt-
ative linear smoothers. Consider a function / which is a piecewise polynomial of degree D,
with a finite number of pieces Iu ..., 1L, say:
/ = £ pMh.ity (4)

Assume that / has discontinuities at some of the break-points xu...,xL.
An oracle could supply the information that one should use 1^,... ,1L rather than some
other partition. Least-squares theory says that, for data from the linear model Y = Xfi + E,
with noise Et independently distributed as N(0, a1), the least-squares estimator f$ satisfies
E || Xfi — X$ ||| = (number of parameters in /?) x (variance of noise).
Applying this to our setting, for the risk R(f,f) = n~1E\\f — f\\\n we get ideal risk
UP+ 1)0*In.
On the other hand, the risk of a spatially nonadaptive procedure is far worse. Consider
kernel smoothing. Because / has discontinuities, no kernel smoother with fixed nonspa-
tially varying bandwidth attains a risk R(J,f) tending to zero faster than Cn~*, C =
C(f, kernel). The same result holds for estimates in orthogonal series of polynomials or
sinusoids, for smoothing splines with knots at the sample points and for least squares
smoothing splines with knots equispaced.
Most strikingly, even for piecewise polynomial fits with equal-width pieces, we have
that R{J,f) is of size n~* unless the breakpoints of/ form a subset of the breakpoints of
/ But this can happen only for very special n, so in any event
In short, oracles offer an improvement, ideally from risk of order n~* to order n" 1 . No
better performance than this can be expected, since n~l is the usual 'parametric rate' for
estimating finite-dimensional parameters.
Can we approach this ideal performance with estimators using the data alone?
1-4. Selective wavelet reconstruction as a spatially adaptive method

A new principle for spatially adaptive estimation can be based on recently developed
'wavelets' ideas. Introductions, historical accounts and references to much recent work
may be found in the books by Daubechies (1992), Meyer (1990), Chui (1992) and Frazier,
Jawerth & Weiss (1991). Orthonormal bases of compactly supported wavelets provide a
powerful complement to traditional Fourier methods: they permit an analysis of a signal
or image into localised oscillating components. In a statistical regression context, this
spatially varying decomposition can be used to build algorithms that adapt their effective
'window width' to the amount of local oscillation in the data. Since the decomposition is
in terms of an orthogonal basis, analytic study in closed form is possible.
For the purposes of this paper, we discuss a finite, discrete, wavelet transform. This
transform, along with a careful treatment of boundary correction, has been described by
Cohen et al. (1993), with related work by Meyer (1991) and G. Malgouyres in the unpub-
lished report 'Ondelettes sur l'intervalle: algorithmes rapides', prepublications mathema-
tiques Orsay. To focus attention on our main themes, we employ a simpler periodised
version of the finite discrete wavelet transform in the main exposition. This version yields
an exactly orthogonal transformation between data and wavelet coefficient domains. Brief
comments on the minor changes needed for the boundary corrected version are made
in § 4-6.
Suppose we have data y = (.y,-)?=1, with n = 2J+1. For various combinations of param-
eters M, the number of vanishing moments, S, the support width, and j 0 , the low-resolution
cutoff, one may construct a n n x n orthogonal matrix if, the finite wavelet transform
matrix. Actually there are many such matrices, depending on special filters: in addition

to the original Daubechies wavelets there are the Coiflets and Symmlets of Daubechies
(1993). For the figures in this paper we use the Symmlet with parameter N = 8. This has
M — 1 vanishing moments and support length S = 15.
This matrix yields a vector w of the wavelet coefficients of y via w = ~W y, and we have
the inversion formula y = i^r w.
The vector w has n = 2J+1 elements. It is convenient to index dyadically n — 1 = 2 / + 1 — 1
of the elements following the scheme
and the remaining element we label w_ 1 0 . To interpret these coefficients let W^ denote
the (j, k)th row of "W. The inversion formula y = iV1 w becomes
expressing y as a sum of basis elements W^ with coefficients wjk. We call the Wjk wavelets.
The vector WJk, plotted as a function of i, looks like a localized wiggle, hence the name
'wavelet'. For j and k bounded away from extreme cases by the conditions j0 ^j < J —j^
and S < k < 2J — S, we have the approximation
where if/ is a fixed 'wavelet' in the sense of the usual wavelet transform on R (Meyer, 1990,
Ch. 3; Daubechies, 1988). This approximation improves with increasing n and increasing
j \ . Here if/ is an oscillating function of compact support, usually called the mother wavelet.
We therefore speak of Wjk as being localized to spatial positions near t = kl~1 and frequen-
cies near 2J.
The wavelet \\i can have a smooth visual appearance, if the parameters M and S are
chosen sufficiently large, and favourable choices of so-called quadrature mirror filters are
made in the construction of the matrix 'W. Daubechies (1988) described a particular
construction with S = 2M + 1 for which the number of derivatives of ^ is proportional
to M.
For our purposes, the only details we need are as follows.
Property 1. We have that Wp. has vanishing moments up to order M, as long as j
£ ilWJk(i) = O (1 = 0,...,M;
Property 2. We have that W^ is supported in [2J~J(k - S), 2J~J(k + 5)], provided ; *tj0.
430 DAVID L. D O N O H O AND IAIN M. JOHNSTONE
Because of the spatial localization of wavelet bases, the wavelet coefficients allow one
to easily answer the question 'is there a significant change in the function near tT by
looking at the wavelet coefficients at levels j =j0,..., J at spatial indices k with /c2~-/ — t.
If these coefficients are large, the answer is 'yes'.
Figure 1 displays four functions, Bumps, Blocks, HeaviSine and Doppler, which have
been chosen because they caricature spatially variable functions arising in imaging, spec-
troscopy and other scientific signal processing. For all figures in this paper, n = 2048.
(a) Blocks (b) Bumps

Fig. 1. Four spatially variable functions; n=2048. Formulae, before rescaling as in
Fig. 3, are given in Table 1.
Table 1. Formulae for test functions

(a) Blocks
f(t) = £ hjK(t - tj), K(t) = {1 + sgn (t)}/2
(tj)=(O-l, 013, 0-15, 0-23, 0-25, 0-40, 0-44, 0-65, 0-76, 0-78, 0-81)
(hj) = (4, - 5 , 3, - 4 , 5, -4-2, 2 1 , 4-3, - 3 1 , 2 1 , -4-2)
(b) Bumps
f{t) = £hjK((t- tj)/Wj), K(t) = (1 + |t|)-<
=
(tj) blocks
(hj)=(4, 5, 3, 4, 5, 4-2, 2-1, 4-3, 31, 5-1, 4-2)
(w;) = (0-005, 0005, 0006, 001, 001, 003, 001, 001, 0005, 0008, 0005)
(c) HeaviSine
f(t) = 4 sin Ant - sgn (t - 0-3) - sgn (0-72 - 1 )
(d) Doppler
f(i) = {t(l -1)}* sin {2n(\ + e)/(t + e)}, e = 005
(a) Blocks (b) Bumps
1 1 1 1 1 . . .
10 10
1, 1 1
8 , I 8
1 1 i
1,',
!'
1
i 6
6 1 .!••!
J i1 '
1 i ,' r if '
\
i, i
4 4
0-5 I 0-5
t t
(c) HeaviSine (d) Doppler
10 10

8
1
1
6
1 1
4
0-5 0-5
t t
Fig. 2. The four functions in the wavelet domain: most nearly symmetric Daubechies
wavelet with N = 8. Wavelet coefficients 6Jk are depicted for ; = 5, 6,..., 10.
Coefficients in one level, with j constant, are plotted as a series against position t =
2~'k. The vast majority of the coefficients are zero or effectively zero.
Figure 2 depicts the wavelet transforms of the four functions. The large coefficients
occur exclusively near the areas of major spatial activity. This property suggests that a
spatially adaptive algorithm could be based on the principle of selective wavelet reconstruc-
tion. Given a finite list 6 of (;, k) pairs, define Tsw(y, 3) by
TSw(y, <5) = / = Z k eswj* Wik- (5)
This provides reconstructions by selecting only a subset of the empirical wavelet
coefficients.
Our motivation in proposing this principle is twofold. First, for a spatially inhomo-
geneous function, 'most of the action' is concentrated in a small subset of (;, fc)-space.
Secondly, under the noise model underlying (1), noise contaminates all wavelet coefficients
equally. Indeed, the noise vector e = (e.) is assumed to be a white noise; so its orthogonal
transform z = iVe is also a white noise. Consequently, the empirical wavelet coefficient is
= 9J*
where 6 = Wj is the wavelet transform of the noiseless data / =
Every empirical wavelet coefficient therefore contributes noise of variance a2, but only
a very few wavelet coefficients contribute signal. This is the heuristic of our method.
Ideal spatial adaptation can be defined for selective wavelet reconstruction in the obvi-
ous way. For the risk measure (1) the ideal risk is
a(sw,/) = inf R. y, S),f),
with optimal spatial parameter 5 = A(/), namely a list of (j, k) indices attaining
Figures 3-6 depict the results of ideal wavelet adaptation for the four functions displayed
in Fig. 2. Figure 3 shows noisy versions of the four functions of interest; the signal-to-
noise ratio || signal ||2>n/|| noise ||2,n is 7. Figure 4 shows the noisy data in the wavelet domain.
Figure 5 shows the reconstruction by selective wavelet reconstruction using an oracle;
Fig. 6 shows the situation in the wavelet domain. Because the oracle helps us to select the
important wavelet coefficients, the reconstructions are of high quality.
(a) Noisy Blocks (b) Noisy Bumps

Uulw u
0-5 0 0-5
t t
(c) Noisy HeaviSine (d) Noisy Doppler
20
10
y o
-10
-20
Fig. 3. Four functions with Gaussian white noise, a = 1, with / rescaled to have
signal-to-noise ratio, SD(/)/<T = 7.
The theoretical benefits of ideal wavelet selection can again be seen in the case (4) where
/ is a piecewise polynomial of degree D. Suppose we use a wavelet basis with parameter
M~^D. Then Properties 1 and 2 imply that the wavelet coefficients djJ( of / all vanish
except for:
(i) coefficients at the coarse levels 0 <_/ <; 0 .
(ii) coefficients at j0 ^j ^ J whose associated interval \_2~J(k — S), 2~J(k + SJ] contains
a breakpoint of/.
There is a fixed number 2Jo of coefficients satisfying (i), and, in each resolution level ;,
(0j,k= k = 0 , . . . , 2j' — 1) at most (# breakpoints) x (2S + 1) satisfying (ii). Consequently, with
L denoting again the number of pieces in (4), we have
: 0;,* * 0} s: 2*> + (J + l-; 0 )(2S

(a) Noisy Blocks (b) Noisy Bumps
,J 1 1
10 "*'P »•»«**-• 10 rt
l 1
••-^f-'l I'M "-'
T+ j
8
„„ 1 , (.
1
Ml
j U
6 1
1.
1 ,1 1 1
T 1 • I1
0 0-5 40 0-5
,1
t t
(c) Noisy HeaviSine (d) Noisy Doppler
10 10

j 8 y 8
6 6
4 4
0 0-5 0-5
t t
Fig. 4. The four noisy functions in the wavelet domain. Compare Fig. 2. Only a
small number of coefficients stand out against a noise background.
(a) IdealfBlocks] (b) Ideal[Bumps]

20 V^—V
1
10-
v k-V> \
-10 -20
0 0-5 0 0-5
t
(c) IdealfHeaviSine] (d) Ideaipoppler]
Fig. 5. Ideal selective wavelet reconstruction, with j0 = 5. Compare Figs 1, 3.

(a) Ideal[Blocks] (b) IdealfBumps]
10 r
10
II
' 1 II
8 1 11l l ' • 8
i lit J i
1 jil
I ll 6
, ,1'l 1
n
i i
1 1
i| | i
4 i
0 0-5 0 0-5
t t
(c) Ideal[HeaviSine] (d) Ideal[Doppler]
10 10

8 . 8
4 4
0-5 0-5
t t
Fig. 6. Ideal reconstruction, wavelet domain. Compare Figs 2, 4. Most of the

coefficients in Fig. 4 have been set to zero. The others have been retained as they are.
Let 5* = {(j, /c):0 M #O}. Then, because of the orthogonality of the (WJk),
S(• k)6a* wj*Wjk is the least-squares estimate of/and
+ C2J)La2/n (6)
for all n = 2J+1, with certain constants Cu C2, depending linearly on S, but not on /.
Hence
a2 log n
( (7)
for every piecewise polynomial of degree D ^ M. This is nearly as good as the bound
a2lip + l)n~l of ideal piecewise polynomial adaptation, and considerably better than the
rate n~* of usual nonadaptive linear methods.
1-5. Near-ideal spatial adaptation by wavelets

Calculations of ideal risk which point to the benefits of ideal spatial adaptation prompt
the question: How nearly can one approach ideal performance when no oracle is available
and we must rely on data only, and no side information about / ?
The benefit of the wavelet framework is that we can answer such questions precisely.
In § 2 of this paper we develop new inequalities in multivariate decision theory which
furnish an estimate / * which, when presented with data y and knowledge of the noise
level a2, obeys
*„,„(/*, / ) ^ (2 log n + 1) |^ n , a (sw, /) + ^ J (8)
for every /, every n = 2 J + 1 , and every a.

Thus, in complete generality, it is possible to come within a 2 log n factor of the perform-
ance of ideal wavelet adaptation. In small samples n, the factor (2 log n + 1) can be replaced
by a constant which is much smaller: for example, 5 will do if n < 256, and 10 will do if
n < 16384. On the other hand, no radically better performance is possible: to get an
inequality valid for all /, all a, and all n, we cannot even change the constant 2 to 2 — e
and still have (8) hold, whether by / * or by any other measurable estimator sequence.
To illustrate the implications, Figs 7 and 8 show the situation for the four basic examples,

with an estimator / * which has been implemented on the computer, as described in § 2-4
below. The result, while slightly noisier than the ideal estimate, is still of good quality,
and requires no oracle.
(a) WaveSelect[Blocks] (b) WaveSelectfBumps]

20
10-
0
\J
-10
0 0-5
(c) WaveSelectfHeaviSine]
Fig. 7. RiskShrink reconstruction using soft thresholding and X = X*. Mimicking an

oracle while relying on the data alone.
The theoretical properties are also interesting. Our method has the property that for
every piecewise polynomial (4) of degree D < M with ^ L pieces,
KXf*J) < (Ci + C2 log n)(2 log n + l)L<r>,
where Cx and C2 are as in (6); this result is merely a combination of (7) and (8). Hence
in this special case we have an actual estimator coming within C log2 n of ideal piecewise
polynomial fits.
436 D A V I D L. D O N O H O AND IAIN M. JOHNSTONE
(a) WaveSelect[Blocks] (b) WaveSelectfBumps]
v
1
10 10
I
o o 1 j I
I1
i
j I, 'i
,
6
1r i'
6 1
Ij 1 .
I , 1
4
i i i
4
1 I1 1
0-5 0-5
t t
(c) WaveSelect[HeaviSine] d) WaveSelect[Doppler]
10 10

8
6 [
8
6
ct
bfe1
1 1 i
4 4
0-5 0-5
t
Fig. 8. RiskShrink, wavelet domain. Compare Figs 2, 4, 6.
1-6. Universality of wavelets as a spatially adaptive procedure

This last calculation is not essentially limited to piecewise polynomials; something like
it holds for all /. In § 3 we show that, for constants C{ not depending on /, n or a,
@n,Asw,f) < (C, + C2J)®n,a{™{D),f)
J+1
for every/, every n = 2 and every a > 0. Thus selective wavelet reconstruction is essen-
tially as powerful as variable-partition piecewise constant fits, variable-knot least-squares
splines, or piecewise polynomial fits. Suppose that the function / is such that, furnished
with an oracle, piecewise polynomials, piecewise constants, or variable-knot splines would
improve the rate of convergence over traditional fixed-bandwidth kernel methods, say
from rate of convergence n~'1, with fixed-bandwidth, to n~f2, for r2>rx. Then, furnished
with an oracle, selective wavelet adaptation offers an improvement to log2 n x n~'2; this
is essentially the same benefit at the level of rates.
We know of no proof that existing procedures for fitting piecewise polynomials and
variable-knot splines, such as those current in the statistical literature, can attain anything
like the performance of ideal methods. In contrast, for selective wavelet reconstruction^, it
is easy to offer performance comparable to that with an oracle, using the estimator /*.
A wavelet selection with an oracle offers the advantages of other spatially-variable
methods. From this theoretical perspective, it is thus cleaner and more elegant to abandon
the ideal of fitting piecewise polynomials with optimal partitions, and turn instead to
RiskShrink, about which we have theoretical results, and an order O{n) algorithm.
17. Contents
Section 2 discusses the problem of mimicking ideal wavelet selection; § 3 shows why
wavelet selection offers the same advantages as piecewise polynomial fits; §4 discusses
variations and relations to other work. The appendixes contain certain proofs. Related
manuscripts by the authors, currently under publication review and available as PostScript
files by anonymous ftp from playfair.stanford.edu, are cited in the text by [filename.ps].
2. DECISION THEORY AND SPATIAL ADAPTATION

2 1 . General
In this section we solve a new problem in multivariate normal decision theory and
apply it to function estimation.
2-2. Oracles for diagonal linear projection

Consider the following problem from multivariate normal decision theory. We are given
observations w = (W()?=1 according to
wt = Ol + ezi (i = l , . . . , n ) , (9)
where zf are independent and identically distributed as N(0,1), e > 0 is the known noise
level, and 9 = (0,) is the object of interest. We wish to estimate with /2-loss and so define
the risk measure
We consider a family of diagonal linear projections:
TDP(w,5) = (diWi)1=i, 5,6 {0,1}.

Such estimators 'keep' or 'kill' each coordinate. Suppose we had available an oracle
which would supply for us the coefficients ADP(0) optimal for use in the diagonal projection
scheme. These ideal coefficients are S, = 1 <IO,I =-e>: ideal diagonal projection consists in
estimating only those 9t larger than the noise level. These yield the ideal risk
<*e(DP,0)= £> r (|0iU)

i= l
2 2
with pT(x, a) = min (T , a ).
In general the ideal risk £?e(DP, 9) cannot be attained for all 9 by any estimator, linear
or nonlinear. However surprisingly simple estimates do come remarkably close.
Motivated by the idea that only very few wavelet coefficients contribute signal, we
consider threshold rules, that retain only observed data that exceed a multiple of the noise
level. Define 'hard' and 'soft' threshold nonlinearities by
{|w|>A}, (11)
w)(M-A)+. (12)
The hard threshold rule is reminiscent of subset selection rules used in model selection
and we return to it later. For now, we focus on soft thresholding.
THEOREM 1. Assume model (9)-(10). The estimator
0? = r / s K e(2 log n)*) (i = ! , . . . , « )
satisfies
(13)
for all 9 e R".

In 'Oracular' notation, we have
R0*, 9) ^ (2 log n + 1) {e2 + « , ( D P , 0)} (0 € K").
Now e2 denotes the mean-squared loss for estimating one parameter unbiasedly, so the
inequality says that we can mimic the performance of an oracle plus one extra parameter
to within a factor of essentially 2 log n. A short proof appears in Appendix 1. However it
is natural and more revealing to look for 'optimal' thresholds X* which yield the smallest

possible constant A* in place of 2 log n + 1 among soft threshold estimators. We give the
result here and outline the approach in § 2-5.
THEOREM 2. Assume model (9)-{10)- The minimax threshold X* defined at (20) and
solving (22) below yields an estimator
9f = r,s(Wi,XU) (i=l,...,n) (14)
which satisfies
E\\6*-9\\lnÂ;L2+ £ min(02)£4 (15)

for all 9 e R". The coefficient A*, defined at (19), satisfies A* ^ 2 log n + 1, and the threshold
X* ^ (2 log n)*. Asymptotically, as n -»• oo,
A:~21ogn, A*~(21ogn)*.
Table 2 shows that this constant A* is much smaller than 2 log n + 1 when n is of the
order of a few hundred. For n = 256, we get A* ===4-44. For large n, however, the 2 log n
upper bound is sharp. This holds even if we extend from soft coordinatewise thresholds
to allow completely arbitrary estimator sequences.
Table 2. Coefficient X* and related quantities

n X* (21ogn)+ A* 21ogn + l
64 1-474 2-884 3124 8-3178
128 1-669 3115 3-755 9-7040
256 1-860 3-330 4-442 11-090
512 2048 3-532 5182 12-477
1024 2-232 3-723 5-976 13-863
2048 2-414 3-905 6-824 15-249
4096 2-594 4-079 7-728 16-635
8192 2-773 4-245 8-691 18-022
16384 2-952 4-405 9-715 19-408
32768 3131 4-560 10-80 20-794
65536 3-310 4-710 11-95 22181
THEOREM 3. We have
' £110-0111, ,. M,,
inf sup -x—^ —r^r~57 ~ 2 log n (16)
A proof is given in Appendix 5.

Hence an inequality of the form (13) or (15) cannot be valid for any estimator sequence
with {2 — e + o(l)}logn in place of A*. In this sense, an oracle for diagonal projection
cannot be mimicked essentially more faithfully than by 0*.
The use of soft thresholding rules (12) was suggested to us in prior work on multivariate
normal decision theory by Bickel (1983) and ourselves [mrlp.ps]. However it is worth
mentioning that a more traditional hard threshold estimator (11) exhibits the same asymp-

totic performance.
THEOREM 4. With (lH) a thresholding sequence sufficiently close to (21ogn)*, the hard
threshold estimator
satisfies, for an LH ~ 2 log n, the inequality
(=1
for all 6 e W. Here, sufficiently close to (2 log n)* means

(1 - y) log log n ^ l2n - 2 log n ^ o (log «)
for some y > 0.
2-3. Adaptive wavelet shrinkage

We now apply the preceding results to function estimation. Let n = 2J+1, and let iV
denote the wavelet transform mentioned in § 1-4. Then iV is an orthogonal transformation
of R" into R". In particular, if / = (/,) and / = (/() are two n-vectors and (6Jtk) and (0^)
their Of transforms, we have the Parseval relation
Now let (y{) be data as in model (1) and let w = iVy be the discrete wavelet transform.
Then with e = a
w =0 + ez, k (j = 0,...,J;k = 0,...,2J-l).
JfK J,K ' Ji* '••' ' * ' J " J /
As in § 1-4, we define selective wavelet reconstruction via 7sw(.y, 5), see (5), and observe
that
SW — " 'DP " \±UJ
in the sense that (5) is realized by wavelet transform, followed by diagonal linear projection
or shrinkage, followed by inverse wavelet transform. Because of the Parseval relation (17),
we have
Also, if 9* denotes the nonlinear estimator (14) and
then again by Parseval E\\f* —f\\2tH = E\\6* — 9\\lin, and we immediately conclude the
following.
COROLLARY 1. For all f and all n = 2J +
\
*(/*,/)< A* j^j-
Moreover, no estimator can satisfy a better inequality than this for all f and all n, in
the sense that for no measurable estimator can such an inequality hold, for all n and f
with A* replaced by {2 — e + o(l)} log n. The same type of inequality holds for an estimator

j = iTT°9 + °i(r derived from hard thresholding, with Ln in place of A*.
Hence, we have achieved, by very simple means, essentially the best spatial adaptation
possible via wavelets.
2-4. Implementation
We have developed a computer software package which runs in the numerical computing
environment Matlab. In addition, an implementation by G. P. Nason in the S language
is available by anonymous ftp from Statlib at l i b . s t a t . c m u . e d u ; other implementations
are also in development. They implement the following modification of/*.
DEFINITION 1. Let 8* denote the estimator in the wavelet domain obtained by
g* = iwj,k U <Jo),
RiskShrink is the estimator
The name RiskShrink for the estimator emphasises that shrinkage of wavelet coefficients
is performed by soft thresholding, and that a mean squared error or 'risk' approach has
been taken to specify the threshold. Alternative choices of threshold lead to the estimators
VisuShrink introduced in § 4-2 below, and SureShrink discussed in our report [ausws.ps].
The rationale behind this rule is as follows. The wavelets Wjk at levels j<j0 do not
have vanishing means, and so the corresponding coefficients 6Jtk should not generally
cluster around zero. Hence, those coefficients, a fixed number, independent of n, should
not be shrunken towards zero. Let §W denote the selective wavelet reconstruction where
the levels below j 0 are never shrunk. We have, evidently, the risk bound
and of course
so RiskShrink is never dramatically worse t h a n / * ; it is typically much better on functions

having nonzero average values.
Figure 7 shows the reconstructions of the four test functions; Fig. 8 shows the situation
in the wavelet domain. Evidently the methods do a good job of adapting to the spatial
variability of functions.
The reader will note that occasionally these reconstructions exhibit fine scale noise
artifacts. This is to some extent inevitable: no hypothesis of smoothness of the underlying
function is being made.
2-5. Proof outline for Theorem 2

Suppose we have a single observation 7~iV(/i, 1). Define the function Psi{X,p.) =
E{n{Y, X) — fi}2; see e.g. Bickel (1983). Qualitatively, PST(^> A*) increases in/i from 0 to a
maximum of 1 + X2 at p. = oo. Some explicit formulae and properties are given in
Appendixes 1 and 2.

Define the minimax quantities
A* • r PSTCÂO ,in,
AJ = infsup — . 2 , (19)
x „ n 1 + min(/i 2 ,1)
X* = the largest X attaining A* above. (20)
A.
The key inequality (13) follows immediately: first assume e = 1. Set 6* = %(w,, X*). Then
£||0*-0|||= t PsrM,0t)< t AJfc

i=i f=i
= A*{l+
If e + 1, then for 6* = %(w(, AJe) we get by rescaling that
and the inequality (15) follows. Consequently, Theorem 2 follows from asymptotics for
A* and X*. To obtain these, consider the analogous quantities where the supremum over
the interval [0, oo) is replaced by the supremum over the endpoints {0, oo}:
A ° = m f sup - z j - — . 2 , (21)
x ve{o,o°)n 1 + min(/x 2 ,1)
and X° is the largest X attaining A°. In Appendix 4 we show that A* = A° and X* = A°.
We remark that p{X, oo) is strictly increasing in X and p(X, 0) is strictly decreasing in X,
so that at the solution of (21),
(n+l)p OT (A,0) = pOT(A,oo). (22)
Hence this last equation defines X° uniquely, and, as is shown in Appendix 3, leads to
71 + 0(1) (n-»oo). (23)

To complete this outline, we note that the balance condition (22) together with
2
PsT(X°n, oo) = l+(X°n) gives
21ogn (n->oo).
3 . PlECEWISE POLYNOMIALS ARE NOT MORE POWERFUL THAN WAVELETS

We now show that wavelet selection using an oracle can closely mimic piecewise poly-
nomial fitting using an oracle.
THEOREM 5. Let D^M and n = 2J + 1. With constants Ct depending on the wavelet
transform alone,
«...(swj) ^ (Ct + C2 J)<2n,a(pp(D),/) (24)
for all f, for all a > 0.
Hence for every function, wavelets supplied with an oracle have an ideal risk that differs
by at most a logarithmic factor from the ideal risk of the piecewise polynomial estimate.
Since variable-knot splines of order D are piecewise polynomials of order D, we also have

*-.«,(sw, / ) ^ (C, + C2J)an,a{sp\ (D), / ) . (25)
Note that the constants are not necessarily the same at each appearance: see the proof
below. Since piecewise-constant fits are piecewise polynomials of degree D = 0, we also
have
#n,,(sw,/) ^ (Cx + C2J)%n,a{vc,f).
Hence, if one is willing to neglect factors of log n then selective wavelet reconstruction,
with an oracle, is as good as these other methods, with their oracles.
We note that one should not expect to get better than a log n worst-case ratio, essentially
for the reasons given in § 1-3. If / is a piecewise polynomial, so that it is perfectly suited
for piecewise polynomial fits, then wavelets should not be expected to be also perfectly
suited: wavelets are not polynomials. On the other hand, if/ were precisely a finite wavelet
sum, then one could not expect piecewise polynomials to be perfectly suited to recon-
structing/; some differences between different spatially adaptive schemes are inevitable.
The theorem only compares ideal risks. Of course, the ideal risk for wavelet selection
is nearly attainable. We know of no parallel result for the ideal risk of piecewise poly-
nomials. In any event, we get as a corollary that the estimator / * satisfies
R(f*,f) ^ (Ct + C2 log2 n)(2 log n + 1 )<*,,,,(PP(D),/)
so that / * comes within a factor log2 n of ideal piecewise polynomial fits. Thus, there is
a way to mimic an oracle for piecewise polynomials: to abandon piecewise-polynomial
fits and to use wavelet shrinkage.
Proof of Theorem 5. Let A(/) be the partition supplied by an oracle for piecewise
polynomial fits. Suppose that this optimal partition contains L elements. Let s be the
least-squares fit, using this partition, to noiseless data. We have the Bias2 + Variance
decomposition of ideal risk
R(TPP{D)(y, A(f))J) = n " 1 1 | / - s||2>B + (D + \)L^ln. (26)
Now let 6 = iVs be the wavelet transform of s. Then, as s is a piecewise polynomial,
the argument leading to (6) tells us that most of the wavelet coefficients of s vanish. Let
Consider the use of S* as spatial parameter in selective wavelet reconstruction. We have

(27)
Comparing this with (26), we have
RiTwiy, 5*),/) < {1 + (Q + C2J)/(D + l)}R(TmD)(y, A),/);
the theorem now follows from the assumption
# n > p ( D ) , / ) = R(TPP(D)(y,
and the definition
Finally, to verify (25) observe that the optimal variable knot spline 3 of order D for
noiseless data is certainly a piecewise polynomial, so \\f — s\\2 ^ \\f — 3|| 2 . It depends on
at least L unknown parameters and so for noisy data has variance term at least l/(D + 1)
times that of (26). Therefore,

<Xn.a{i>v(D),f) <(D
which, together with (24), establishes (25). •
4. DISCUSSION
4-1. Variations on choice of oracle
An alternative family of estimators for the multivariate normal estimation problem (9)
is given by diagonal linear shrinkers:
ros{w,d) = (5lwlyi=1, 8, e 10,11-
Such estimators shrink each coordinate towards 0, different coordinates being possibly
treated differently. An oracle AÔ) for this family of estimators provides the ideal
coefficients (£,•) = (8f/{9j + e 2 ) ^ ! and would yield an ideal risk
t t
say. There is an oracle inequality for diagonal shrinkage also.
THEOREM 6. (i) The soft thresholding estimator 6* with threshold X* satisfies
( 28 )
for all 6 e R", with An ~ 2 log n.

(») More generally, the asymptotic inequality (28) continues to hold for soft threshold
sequences, Xn, and hard threshold estimators with threshold sequences, ln, satisfying respect-
ively
5 log log n ^ X2n - 2 log n ^ o(log n), (29)
(1 - e) log log n ^ Ij - 2 log n ^ o(log n). (30)
(Hi) Theorem 3 continues to hold, a fortiori, if the denominator e2 + £ " = 1 rnin (02, e2) is
replaced by e2 + £"=i dfe^/iOj + e2). So oracles for diagonal shrinkage can be mimicked to
within a factor of order 2 log n and not more closely.
In Appendix 6 is a proof of Theorem 6 that covers both soft and hard threshold
estimators and both DP and DS oracles. Thus the proof also establishes Theorem 4 and
an asymptotic version of Theorem 2 for thresholds in the range specified in (29).
These results are carried over to adaptive wavelet shrinkage just as in § 2-3 by defining
wavelet shrinkage in this case by the analogue of (18):
Corollary 1 extends immediately to this case.
4-2. Variations on choice of threshold

Optimal thresholds. In Theorem 1 we have studied X*, the minimax threshold for the
soft threshold nonlinearity, with comparison to a projection oracle. A total of 4 minimax
quantities may be defined, by considering various combinations of threshold type

(soft, hard) and oracle type (projection, shrinkage).
We have computer programs for calculating X* which have been used to tabulate X*i
for j = 6, 7 , . . . , 16 (compare Table 2). These have also been embedded as look-up tables
in the RiskShrink software mentioned earlier.
Implementation of any of the other optimal thresholds would require a computational
effort to tabulate the thresholds for various values of n. However, this computational effort
would be far greater in the other three cases than in the case we have studied here,
essentially because there is no analogue of the simplification that occurs through replacing
(19) with (21).
Remark. A drawback of using optimal thresholds is that the threshold which is precisely
optimal for one of the four combinations may not be even asymptotically optimal for
another of the four combinations. Comparing (23) with (30) shows that X* used with hard
thresholding can only mimic the oracle to within a factor a log n, for some a > 2.
Universal thresholds. As an alternative to the use of minimax thresholds, one could
simply employ the universal sequence XUH = (2 log n)*. The sequence is easy to remember;
implementation in software requires no costly development of look-up tables; and it is
asymptotically optimal for each of the four combinations of threshold nonlinearity and
oracle discussed above. In fact, finite-n risk bounds may be developed for this threshold
by examining closely the proofs of Theorems 4 and 6.
THEOREM 7. We have
-1 + pT(fx,l)} (n = 4,5,...),
1
- + p L ( M ,l)} (n = 4, 5,.. .)•
The drawback of this simple threshold formula is that in samples on the order of
dozens or hundreds, the mean squared error performance of minimax thresholds is
noticeably better.
VisuShrink. On the other hand (X") has an important visual advantage: the almost
'noise-free' character of reconstructions. This can be explained as follows. The wavelet
transform of many noiseless objects, such as those portrayed in Fig. 1, is very sparse, and
filled with essentially zero coefficients. After contamination with noise, these coefficients
are all nonzero. If a sample that in the noiseless case ought to be zero is in the noisy
case nonzero, and that character is preserved in the reconstruction, the reconstruction
will have an annoying visual appearance: it will contain small blips against an otherwise
clean background.
The threshold (2 log rift avoids this problem because of the fact that when (zf) is a white
noise sequence independent and identically distributed Af(0,1), then, as n->oo,
pr < max |z,| > (2 log ri) (31)
So, with high probability, every sample in the wavelet transform in which the underlying
signal is exactly zero will be estimated as zero.
Figure 9 displays the results of using this threshold on the noisy data of Figs 3 and 4.

The almost 'noise free' character of the plots is striking.
(a) VisuShrink[Blocks] (b) VisuShrink[Bumps]

20
10 - \
0 -J «/«—'VV
-10
0 0-5
t
(c) VisuShrink[HeaviSine]
-20
Fig. 9. VisuShrink reconstructions using soft thresholding and A = (2 log n)*. Notice
'noise-free' character; compare Figs 1, 3, 5, 7.
DEFINITION 2. Let 9" denote the estimator in the wavelet domain obtained by
Q»IWJ* ti<Jo),
k) <r(2 log «)
VisuShrink is the estimator
Not only is the method better in visual quality than RiskShrink, the asymptotic risk
bounds are no worse:
R(f'H,f)^(2\ogn + l ) ^ +
This estimator is discussed further in our report [asymp.ps].

Estimating the noise level. Our software estimates the noise level a as the median absol-
ute deviation of the wavelet coefficients at the finest level J, divided by 0-6745. In our
experience, the empirical wavelet coefficients at the finest scale are, with a small fraction
of exceptions, essentially pure noise. Naturally, this is not perfect; we get an estimate that
suffers an upward bias due to the presence of some signal at that level. By using the
median absolute deviation, this bias is effectively controlled. Incidentally, upward bias is
not disastrous; if our estimate is biased upwards by, say 50%, then the same type of risk

bounds hold, but with a 3 log n in place of 2 log n.
4-3. Adaptation in other bases

A considerable amount of Soviet literature in the 1980s, for example Efroimovich &
Pinsker (1984), concerns what in our terms could be called mimicking an oracle in the
Fourier basis. Our work is an improvement in two respects.
(i) For the type of objects considered here, a wavelet oracle is more powerful than a
Fourier oracle. Indeed, a Fourier oracle can never give a rate of convergence faster than
n~* on any discontinuous object, while the wavelet oracle can achieve rates as fast as
log n/n on certain discontinuous objects. Figure 10 displays the results of using a Fourier-
domain oracle with our four basic functions; this should be compared with Fig. 5.
Evidently, the wavelet oracle is visually better in every case. It is also better in mean square.
(a) IdealFourierfBlocks] (b) IdealFourier[Bumps]
(c) IdealFourierfHeaviSine] (d) IdealFourierfPoppler]
Fig. 10. Ideal selective Fourier reconstruction. Compare Fig. 5. Superiority of wavelet
oracle is evident.
(ii) The Efroimovich-Pinsker work did not have access to the oracle inequality and
used a different approach, not based on thresholding but instead on grouping in blocks
and adaptive linear damping within blocks. Such an approach cannot obey the same risk
bounds as the oracle inequality, and can easily depart from ideal risk by larger than
logarithmic factors. Indeed, from a 'minimax over L2-Sobolev balls' point of view, for
which the Efroimovich-Pinsker work was designed, the adaptive linear damping is essen-
tially optimal; compare comments in our report [ausws.ps, §4]. Actual reconstructions
by RiskShrink and by the Efroimovich-Pinsker method on the data of Fig. 3 show that
RiskShrink is much better for spatial adaptation; see Fig. 4 of [ausws.ps].
4-4. Numerical measures of fit

Table 3 contains the average, over location, squared error of the various estimates from

our four test functions for the noise realisation and the reconstructions shown in Figs
2-10. Figures 5 and 10 show ideal estimators, constructed with the aid of an oracle, while
Figs 7 and 9 relate to genuine estimators depending on the data alone. It is apparent that
the ideal wavelets reconstruction dominates ideal Fourier and that the genuine estimate
using soft threshold at X* comes well within the factor 6-824 of the ideal error predicted
for n = 2048 by Table 2. Although the (2 log n)* threshold is visually preferable in most
cases, it has uniformly worse squared error than X*, which reflects the well-known diver-
gence between the usual numerical and visual assessments of quality of fit.
Table 4 shows the results of a very small simulation comparison of the same four
techniques as sample size is varied dyadically from n = 256 through 8192, and using
10 replications in each case. The same features noted in Table 3 extend to the other sample
sizes. In addition, one notes that, as expected, the average squared errors decline more
rapidly with sample size for the smoother signals HeaviSine and Doppler than for the
rougher Blocks and Bumps.
Table 3. Average square errors \\f — f \ \ \ , n l n '" lne Figures

Figure Blocks Bumps HeaviSine Doppler
Fig. 1: ||/||2j,/n 81-211 57-665 58-893 50348
Fig. 3: with noise 1-047 0937 1-008 09998
Fig. 5: ideal wavelets O097 Olll 0028 0042
Fig. 10: ideal Fourier O370 0375 0062 O200
Fig. 7: threshold k* 0395 0496 0059 0152
Fig. 9: threshold (2 log n)+ 0874 1058 0076 0324
4-5. Other adaptive properties

The estimator proposed here has a number of optimality properties in minimax decision
theory. In recent work, we consider the problem of estimating / at a single point f(t0),
where we believe that / is in some Holder class, but we are not sure of the exponent nor
the constant of the class. RiskShrink is adaptive in the sense that it achieves, within a
logarithmic factor, the best risk bounds that could be had if the class were known; and
the logarithmic factor is necessary when the class is unknown, by work of Lepskii (1990)
and L. Brown and M. Low in the unpublished report 'A constrained risk inequality with
aplications to nonparametric functional estimation'. Other near-minimax properties are
described in detail in our report [asymp.ps].
Table 4. Average square errors \\f — f \\l,Jn from 10 replications

n Ideal Fourier Ideal wavelets Threshold )£ Threshold (2 log nf
Blocks
256 0-717 0-367 0923 2-072
512 0-587 0-243 0766 1-673
1024 0-496 0-168 0586 1-268
2048 0-374 0098 0427 O905
4096 0-288 0-062 0295 0621
8192 0-212 0035 O204 0412
Bumps
256 0-913 0-411 1125 2-674
512 0-784 0-291 0968 2-310
1024 0-578 0-177 0694 1-592

2048 0-396 0-109 0499 1-080
4096 0-233 0062 0318 0683
8192 0-144 0037 O208 O430
HeaviSine
256 0-168 0136 0222 0244
512 0132 O079 0155 0186
1024 0-091 OO40 0089 0122
2048 0-065 0026 O060 O083
4096 0-048 O016 0045 O066
8192 0-033 O008 O030 0047
Doppler
256 0-711 O220 0473 0951
512 0-564 0146 0341 0672
1024 0-356 O078 0249 O470
2048 0-208 O039 0151 0318
4096 0-127 O023 0098 O203
8192 0071 0012 0055 0113
4-6. Boundary correction

As described in the Introduction, Cohen et al. (1993), have introduced separate 'bound-
ary filters' to correct the nonorthogonality on [0,1] of the restriction to [0,1] of basis
functions that intersect [0, l] c . To preserve the important Property 1 in § 1-4 of ortho-
gonality to polynomials of degree ^ M, a further 'preconditioning' transformation P of
the data y is necessary. Thus, the transform may be represented as "W —U°P, where U is
the orthogonal transformation built from the quadrature mirror filters and their boundary
versions via the cascade algorithm. The preconditioning transformation affects only the
N = M + 1 left-most and the AT right-most elements of y: it has block diagonal structure
P = diag {PL\I\PR)- The key point is that the size and content of the boundary blocks PL
and PR do not depend onn = 2 / + 1 . Thus the Parseval relation (17) is modified to
where the constants yt correspond to the smallest and largest singular values of PL and
PR, and hence do not depend on n = 2J+1. Thus all the ideal risk inequalities in the paper
remain valid, with only an additional dependence for the constants on yt and y2. In
particular, the conclusions concerning logarithmic mimicking of oracles are unchanged.
4-7. Relation to model selection
RiskShrink may be viewed by statisticians as an automatic model selection method,
which picks a subset of the wavelet vectors and fits a 'model', consisting only of wavelets
in that subset, to the data by ordinary least-squares. Our results show that the method
gives almost the same performance in mean-squared error as one could attain if one knew
in advance which model provided the minimum mean-squared error.
Our results apply equally well in orthogonal regression. Suppose we have Y = xp + E,
with noise Et independent and identically distributed as N(0, a2), and X an n x p matrix.
Suppose that the predictor variables are orthogonal: XTX = IP. Theorem 1 shows that
the estimator ft* = 6*°XTY achieves a risk not worse than p~1 + dtpa{p~p, fl) by more
than a factor 2 log p + 1. This point of view has amusing consequences. For example, the
hard thresholding estimator ft+ =6 + °XTY amounts to 'backwards-deletion' variable
selection; one retains in the final model only variables which had Z-scores larger than X

in the original least-squares fit of the full model. In small dimensions p, this actually
corresponds to current practice; the ' 5 % significance' rule X — 2 is near-minimax, in the
sense of Theorem 2, for p = 200.
For lack of space, we do not pursue the model-selection connection here at length,
except for two comments.
(i) D. P. Foster and E. I. George, in the University of Chicago technical report 'The
risk inflation of variable selection in regression', have proved two results about model
selection which it is interesting to compare with our Theorem 4. In our language, they
show that one can mimic the 'nonzeroness' oracle pz{9,e) = e 2 l {fl+0} to within Ln =
1 + 2 log (n + 1) by hard thresholding with Xn = {2 log {n + 1)}*. They also show that for
what we call the hard thresholding nonlinearity, no other choice of threshold can give a
worst-case performance ratio, which they call a 'Variance Inflation Factor', asymptotically
smaller than 2 log nasn->oo. Compare also Bickel (1983). Our results here differ because
we attempt to mimic more powerful oracles, which attain optimal mean-squared errors.
The increase in power of our oracles is expressed by pz(/x, \)lpL{jx, l)-+oo as ^->0.
Intuitively, our oracles achieve significant risk savings over the nonzeroness oracle for the
case when the true parameter vector has many coordinates which are nearly, but not
precisely zero. We thank Dean Foster and Ed George for calling our attention to this
interesting work, which also describes connections with 'classical' model selection, such
as Gideon Schwarz's BIC criterion.
(ii) Alan Miller (1984, 1990) has described a model selection procedure whereby an
equal number of 'pure noise variables', namely column vectors independent of Y, are
appended to the X matrix. One stops adding terms into the model at the point where the
next term to be added would be one of the artificial, pure noise variables. This simulation
method sets, implicitly, a threshold at the maximum of a collection of n Gaussian random
variables. In the orthogonal regression case, this maximum behaves like (2 log «)*, that
is (Xun); compare (31). Hence Miller's method is probably not far from minimaxity with
respect to an MSE-oracle.
ACKNOWLEDGEMENT
This paper was completed while D. L. Donoho was on leave from the University of
California, Berkeley, where this work was supported by grants from NSF and NASA.
I. M. Johnstone was supported in part by grants from NSF and NIH. Helpful comments
of a referee are gratefully acknowledged. We are also most grateful to Carl Taswell, who
carried out the simulations reported in Table 4.
APPENDIX 1
Proof of Theorem 1
It is enough to verify the univariate case, for the multivariate case follows by summation.
So, let X~N{n, 1), and n,(x) = sgn(x)(|x| -1)+. In fact we show that, for all 5^\ and with
,4 2 < (2 log <5 - > + 1 )(<5 +// 2 A 1).

Regard the right-hand side above as the minimum of two functions and note first that
E{ri,{X) - »}2 = 1 - 2 prM (\X\ < t) + E^X2 A t1 s£ 1 + t2

<(21og(5- 1 + l)((5 + l), (AM)
2 2 2 2 2 2
where we used X At ^ t . Using instead X A t ^ X , we get from (All)
The proof will be complete if we verify that

g(/<) = 2pr,,(|X| :> t)<<5(2log . T 1 + 1) + (2log<T V -
Since g is symmetric about 0,
Finally, some calculus shows that

g(
and that sup |^"| < 4 sup |x^>(x)| ^41og^" 1 for all
APPENDIX 2
Mean squared error properties ofunivariate thresholding
We begin a more systematic summary by recording
PST(X, H) = 1 + X2 + (p.2 - X2 - 1){<&(X - n ) - <D(-X - n)} -(X- n)<j>(X + n) -(X +,
(A2-1)
2
PHT{X, H) = n {O(X — p) — <I>(— X — p)} 4- $(A — n) + <b(X + /i) + (X — n)<f>{X — fi) + (X + p.)<f>(X + p.),
(A2-2)
where <f>, <I> are the standard Gaussian density and distribution function and 4(x) = 1 — <D(x).
LEMMA 1. For both p = p^ and p = PHT
X2 + l for all fieW.,X>ci, (A2-3)
2
;' n + 1 for all \i e R, (A2-4)
2
p(X, 0) + c2/x 0</z<c 3 . (A2-5)
For soft thresholding, (cx, c2, c3) may be taken as (0, 1, oo) and for hard thresholding as (1,1-2, X). At
fi = 0, we have the inequalities
Psi(X, 0) < 4X~3HX)(l + 1-5A"2), (A2-6)
l) (A>1). (A2-7)
Proof. For soft thresholding, (A2-3) and (A2-4) follow from ( A l l ) and (Al-2) respectively. In
fact \x -+ Psx(A, H) is monotone increasing, as follows from
(d/dpOpÂ, n) = 2p.{fl>(A - p.) - d>(-A - p.)}. (A2-8)
From (A2-8) it follows that (fi/dfi^pîk, p) ^ 2 for p ^ O . Using (Al-3) for g = p^ establishes
(A2-5). The inequality (A2-6) follows from (A21) and the alternating series bound for Gaussian
tails: $(A) *S <f>W(X-1 - A"3 + 3A' 5 ).
Turning now to hard thresholding, formula (A2-4), and (A2-3) for p ^ A, follow by taking expec-
tations in
Now consider (A2-3). In the range /i e [A, oo), we have

- tf *S E,(Y- tf + y1 prp (| 7| < X)

l)
For k ^ 1, we obtain (A2-3) from
k~\k + vf$(v) ^ (1 + t^tyu) < 1.

for all v > 0.
To prove (A2-5) it suffices, as for p^k,.), to bound (52/3/i2)p(A, /i) ^ 2. Differentiating (A2-2)
twice, using the inequalities
and, for 0 < A < p ,
and finally substituting s = k + fi and s = A — p, we obtain, for 0 ^ A < p,
—2 P(k, p) < 2 + 2 sup {</>(s)(s3 - 2s) - 2<D(-s)} < 2-4.
Finally (A2-7) follows from (A2-2) and <1>(A) ^ k~l$(k) for A > 1. •
APPENDIX 3
Proof of Theorem 2: Asymptotics ofk°

The quantity k° is the root of pB(A) = (n + l)p(A, 0) — p(k, oo). Note that pB is a continuous
function, with one zero on [0, oo). Furthermore, p n (0) = n, and p n (+ oo) = — oo. Now
Note that, if the term in brackets is negative, the whole expression is negative on [A, oo). Using
the standard inequality <D(—A) s% A"1(^(A), one verifies that this happens for A = (2 log n)*, for n ^ 3.
This implies that the zero k° of pH is less than (2 log n)*. For n = 2, the claim has been verified by
direct computation.
For the second half, define knN for all sufficiently large n via
A2,,, = 2 log (n + 1) - 4 log log (n + 1) - log 2TI + q.
By using the standard asymptotic result <!>(—A)~k~l<f>(k) as A-> + oo, it follows that p^k^,,) con-
verges to — oo or oo according to t]> 0 or t] < 0 respectively. This implies (23).
APPENDIX 4
Proof of Theorem 2: Equivalence of A* = A°n, X* = k°n
We must prove that
*) = sup -1 r r r ; j2 r
„ n +min(l,/i )
attains its maximum at either fi = 0 or \i = oo. For /i e [ 1, oo], the numerator PST(^!?> n) is monotone
increasing in /z, and the denominator is constant. For \i e [0,1], we apply (39) to p^{X°, /x). An
argument similar to that following (A31) shows that p{n~*) > 0 for n ^ 3 so that X° ^ n~*. By the
equation preceding (22), we conclude that np(fi,0)= {1 + (/£)2}/(l +n~1)^ 1. Combining this
with (A2-5),

so that L attains its maximum over p. e [0, 1 ] at 0, establishing the required equivalence.
APPENDIX 5
Proof of Theorem 3
The main idea is to make 9 a random variable, with prior distribution chosen so that a randomly
selected subset of about log n coordinates are each of size roughly (2 log n)*, and to derive infor-
mation from the Bayes risk of such a prior.
Consider the 0-varying loss
1.0, 6) = 1(=i1 0i ~ Oi?} I (1 + E, Of A 1)

and the resulting risk
Let 7i be a prior distribution on 6 and let
finally, let
denote the Bayes risk of the prior n. Call the corresponding Bayes rule h%.
The minimax theorem of statistical decision theory applies to the loss Lm0, 9), and so, if we let
mn denote the left-hand side of (16), we have
m,, = sup pR{n).
Consequently, Theorem 2 is proved if we can exhibit a sequence of priors nn such that

/JH(7rJ^21ogn{l + o(l)}, n-co.
Consider the three-point prior distribution
where v^ denotes Dirac mass at x. Fix a » 0. Define \i = /X(E, a) for all sufficiently small e > 0 by
<p(a
Then
Our reports [mrlp.tex, mews.tex, ausws.tex] have considered the use of this prior in the scalar
problem of estimating £ ~ F^ from data v = ^ + z with z ~ N(0, 1) and usual squared-error loss
E{5(v) - £} 2 . They show that the Bayes risk
(A5-2)
To apply these results in our problem, we will select e = en = log n/n, so that
p. = rt, = n(sH, a) ~ (2 log n - 2 log log n)*.

Consider the prior nn which is independent and identically distributed F^tfljt. This prior has an

easily calculated Bayes risk pn(nn) for the vector problem w, = 0, + z, (i = 1 , . . . , n) when the usual
l2n loss LB(0,0) = | | 0 - 0 | | l H is used. Applying (A5-2),
We use this fact to get a lower bound for the Bayes risk pH{nn).
Consider the random variable Nn = #{i:Ot =t=0}, which has a binomial distribution with param-
eters n, en. Set r\n = (log n) 2/3 and define the event An = {NB < nen + r\n). By Chebyshev's inequality,
an = P{ACH) < nz/t\2-*0. Let <5B denote the Bayes rule for nn with respect to the loss LH. Then
C)
1 + nen + r\n
l+o(l)
1 + neH + t\n
1
~ 2 log nQ)(a), n -* oo;
as a can be chosen arbitrarily large, this proves (A51).

To justify (*) above, we must verify that
£„„£,( || S - 0|| 2 , A'n) = o{pn(nn)} = o(£ log n).
We focus only on the trickier term £(||#|| 2 , ACK), where we use simply E to denote the joint distri-
bution of 0 and x. Set p(0) = 1 + NH(d). Using by turns the conditional expectation representation
for (5, ((x), the Cauchy-Schwarz and Jensen inequalities, we find
\\K\\2Ê{p(6)\x}E{\\e\\2/p(d)\x},
E( IIKII2, AD < {£P4(0) pr 2 ( ^ ) £ || 0 || 8 /p 4 (0)} 1/4
< C/i2 pr+ (.4;) log n = o ( ^ log n),
since || 01|8 = N n ^ and £iVf = O(log" n).

APPENDIX 6
Proof of Theorems 4 and 6
We give a proof that covers both soft and hard thresholding, and both DP and DS oracles. In
fact, since pL < pT it is enough to consider p = pL. Let
where p is either p^ or pm. We show that UX, v) < (2 log n)(l + <5J uniformly in \i so long as
c log log n ^ A2 — 2 log n a% e, log n.
Here 5n -> 0 and depends only on eB and c in a way that can be made explicit from the proof. For
Psr, we require that c < 5 and, for pm, that c < 1.
For n e [(2 log n)*, oo], the numerator of L is bounded above by 1 + A2, from (A2-3), and the

denominator is bounded below by 2 log n/(2 log n + 1).
For /x e [1, (2 log n)*], bound the numerator by (A2-4) to obtain
LO,/i) </i- 2 (l +/i 2 ) 2 < (2 log n){l + o(l)}.
For/ie[0, 1], use (A2-5):
If AB(c) = (2log«-cloglogn)*, then n^(/l,(c)) = ^(0)(logrif12.It foUows from (A2-6) and (A2-7)
that np(X, 0) and hence L(A, /i) = o(logri)if A > A,(c), where c < 5 for soft thresholding and c < 1
for hard thresholding. The expansion (23) shows that this range includes X* and hence 6*.
APPENDIX 7
Proof of Theorem 7
2
When A = (2 log n)*, the bounds over [1, (2 log n)*] and [(2 logri)*,oo] in Appendix 6 become
simply [1 + 2 log n]2/2 log n < 2 log n + 2-4 for n > 4. For /i e [0,1], the bounds follow by direct
evaluation from (A61), (A2-6) and (A2-7). We note that these bounds can be improved slightly by
considering the cases separately.
REFERENCES
BICKEL, P. J. (1983). Minimax estimation of a normal mean subject to doing well at a point In Recent
Advances in Statistics, Ed. M. H. Rizvi, J. S. Rustagi and D. Siegmund, pp. 511-28. New York; Academic
Press.
BRHMAN, L., FRIEDMAN, J. H., OLSHEN, R. A. & STONE, C. J. (1983). CART: Classification and Regression
Trees. Belmont, CA: Wadsworth.
BROCKMANN, M., GASSER, T. & HERRMANN, E. (1993). Locally adaptive bandwidth choice for kernel regression
estimators. J. Am. Statist. Assoc. 88, 1302-9.
CHUI, C. K. (1992). An Introduction to Wavelets. Boston, MA: Academic Press.
COHEN, A., DAUBECHIES, I., JAWERTH, B. & VIAL, P. (1993). Multiresolution analysis, wavelets, and fast
algorithms on an interval. Comptes Rendus Acad. Sci. Paris A 316, 417-21.
DAUBECHIES, I. (1988). Orthononnal bases of compactly supported wavelets. Commun. Pure Appl. Math.
41, 909-96.
DAUBECHIES, I. (1992). Ten Lectures on Wavelets. Philadelphia: SIAM.
DAUBECHIES, I. (1993). Orthononnal bases of compactly supported wavelets II: Variations on a theme. SIAM
J. Math. Anal. 24, 499-519.
EFROIMOVICH, S. Y. & PINSKER, M. S. (1984). A learning algorithm for nonparametric filtering (in Russian).
Automat, i Telemeh. 11, 58-65.
FRAZIER, M., JAWERTH, B. & WEISS, G. (1991). Littlewood-Paky Theory and the Study of Function Spaces,
NSF-CBMS Regional Conf. Ser. in Mathematics, 79. Providence, RI: American Math. Soc.
FRIEDMAN, J. H. & SILVERMAN, B. W. (1989). Flexible parsimonious smoothing and additive modeling (with
discussion). Technometrics 31, 3-39.
FRIEDMAN, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19, 1-67.
LEPSKH, O. V. (1990). On one problem of adaptive estimation on white Gaussian noise. Teor. Veoryatnost. i
Primenen. 35, 459-70 (in Russian); Theory Prob. Applic. 35, 454-66 (in English).
MEYER, Y. (1990). Ondelettes et Operateurs: I. Ondelettes. Paris: Herman et Cie.
MEYER, Y. (1991). Ondelettes sur l'intervalle. Revista Matemdtica Ibero-Americana 7(2), 115-33.
MILLER, A. J. (1984). Selection of subsets of regression variables (with discussion). J. R, Statist. Soc. A
147, 389-425.
MILLER, A. J. (1990). Subset Selection in Regression. London, New York: Chapman and Hall.
MOLLER, H.-G. & STADTMULLER, U. (1987). Variable bandwidth kernel estimators of regression curves. Ann.
Statist. 15, 182-201.
TERRELL, G. R. & SCOTT, D. W. (1992). Variable kernel density estimation. Ann. Statist. 20, 1236-65.
{Received August 1992. Revised June 1993]

Kmeans

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Kmeans

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kmeans

Uploaded by

Copyright:

Available Formats

Biometrika (1994), 81, 3, pp.

Ideal spatial adaptation by wavelet shrinkage

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

1-2. Spatially adaptive methods

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

TPC(y,5)(t)= i Ave(y i :t i e /,)!,,(£);

Example 4: Variable bandwidth kernel methods TyK^Cy. <5)- Now 5 is a function on [ 0 , 1 ] ;

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

As D>2, K(.) cannot be nonnegative.

1-3. Ideal adaptation with oracles

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

1-4. Selective wavelet reconstruction as a spatially adaptive method

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

(a) Blocks (b) Bumps

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

Table 1. Formulae for test functions

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

(a) Noisy Blocks (b) Noisy Bumps

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

: 0;,* * 0} s: 2*> + (J + l-; 0 )(2S

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

(a) IdealfBlocks] (b) Ideal[Bumps]

Fig. 5. Ideal selective wavelet reconstruction, with j0 = 5. Compare Figs 1, 3.

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

Fig. 6. Ideal reconstruction, wavelet domain. Compare Figs 2, 4. Most of the

1-5. Near-ideal spatial adaptation by wavelets

*„,„(/*, / ) ^ (2 log n + 1) |^ n , a (sw, /) + ^ J (8)

for every /, every n = 2 J + 1 , and every a.

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

(a) WaveSelect[Blocks] (b) WaveSelectfBumps]

Fig. 7. RiskShrink reconstruction using soft thresholding and X = X*. Mimicking an

(a) WaveSelect[Blocks] (b) WaveSelectfBumps]

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

Fig. 8. RiskShrink, wavelet domain. Compare Figs 2, 4, 6.

1-6. Universality of wavelets as a spatially adaptive procedure

2. DECISION THEORY AND SPATIAL ADAPTATION

2-2. Oracles for diagonal linear projection

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

We consider a family of diagonal linear projections:

TDP(w,5) = (diWi)1=i, 5,6 {0,1}.

<*e(DP,0)= £> r (|0iU)

for all 9 e R".

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

E\\6*-9\\ln^A;L2+ £ min(02)£4 (15)

Table 2. Coefficient X* and related quantities

A proof is given in Appendix 5.

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

satisfies, for an LH ~ 2 log n, the inequality

for all 6 e W. Here, sufficiently close to (2 log n)* means

2-3. Adaptive wavelet shrinkage

Also, if 9* denotes the nonlinear estimator (14) and

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

RiskShrink is the estimator

so RiskShrink is never dramatically worse t h a n / * ; it is typically much better on functions

2-5. Proof outline for Theorem 2

Downloaded from http://biomet.oxfordjournals.org/ at University of Sussex on December 25, 2012

£||0*-0|||= t PsrM,0t)< t AJfc

71 + 0(1) (n-»oo). (23)

„,„(/, / ) ^ (2 log n + 1) |^ n , a (sw, /) + ^ J (8)