Kmeans
Kmeans
Kmeans
425-55
Printed in Great Britain
SUMMARY
With ideal spatial adaptation, an oracle furnishes information about how best to adapt
a spatially variable estimator, whether piecewise constant, piecewise polynomial, variable
knot spline, or variable bandwidth kernel, to the unknown function. Estimation with the
1. INTRODUCTION
1 1 . General
Suppose we are given data
yi=f(t<)
tt = i/n, where e, are independently distributed as N(0, a2), and/(.) is an unknown function
which we would like to recover. We measure performance of an estimate/(.) in terms of
quadratic loss at the sample points. In detail, let / = (/(t|))7=i a n d / = (/(f,))?=1 denote
the vectors of true and estimated sample values, respectively. Let ||t>||2,«= ^1=1^ denote
the usual squared 1% norm; we measure performance by the risk
which we would like to make as small as possible. Although the notation / suggests a
426 DAVID L. DONOHO AND IAIN M. JOHNSTONE
function of a real variable t, in this paper we work only with the equally spaced sample
points t,.
piecewise constant reconstruction using the means of the data within each piece to estimate
the pieces.
Example 2: Piecewise polynomials TPPiD)(y, 8). Here the interpretation of 3 is the same
as in Example 1, only the reconstruction uses polynomials of degree D:
TPP(D)(y,S)(t)= t
1=1
where p ( (t)= Xk«oa*f* is determined by applying the least squares principle to the data
arising for interval /,:
Example 3: Variable-knot splines T^^y, 8). Here 8 defines a partition as above, and
on each interval of the partition the reconstruction formula is a polynomial of degree D,
but now the reconstruction must be continuous and have continuous derivatives up to
order D — 1. In detail, let T, be the left endpoint of I, (/ = 1 , . . . , L). The reconstruction is
chosen from among those piecewise polynomials s(t) satisfying
dk \ fdk
Ideal spatial adaptation by wavelet shrinkage 427
for k = 0 , . . . , D — 1, / = 2 , . . . , L; subject to this constraint, one solves
Z { s W - > , } 2 = mini.
More refined versions of this formula would adjust K for boundary effects near t = 0 and
t = l.
t*K(t)dt = O (j = l,...,D-l).
As di measures performance with a selection A(/) based on full knowledge of/ rather
than a data-dependent selection d(y), it represents an ideal we cannot expect to attain.
Nevertheless it is the target we shall consider.
Ideal adaptation offers, in principle, considerable advantages over traditional nonadapt-
ative linear smoothers. Consider a function / which is a piecewise polynomial of degree D,
with a finite number of pieces Iu ..., 1L, say:
/ = £ pMh.ity (4)
In short, oracles offer an improvement, ideally from risk of order n~* to order n" 1 . No
better performance than this can be expected, since n~l is the usual 'parametric rate' for
estimating finite-dimensional parameters.
Can we approach this ideal performance with estimators using the data alone?
and the remaining element we label w_ 1 0 . To interpret these coefficients let W^ denote
the (j, k)th row of "W. The inversion formula y = iV1 w becomes
expressing y as a sum of basis elements W^ with coefficients wjk. We call the Wjk wavelets.
The vector WJk, plotted as a function of i, looks like a localized wiggle, hence the name
'wavelet'. For j and k bounded away from extreme cases by the conditions j0 ^j < J —j^
and S < k < 2J — S, we have the approximation
where if/ is a fixed 'wavelet' in the sense of the usual wavelet transform on R (Meyer, 1990,
Ch. 3; Daubechies, 1988). This approximation improves with increasing n and increasing
j \ . Here if/ is an oscillating function of compact support, usually called the mother wavelet.
We therefore speak of Wjk as being localized to spatial positions near t = kl~1 and frequen-
cies near 2J.
The wavelet \\i can have a smooth visual appearance, if the parameters M and S are
chosen sufficiently large, and favourable choices of so-called quadrature mirror filters are
made in the construction of the matrix 'W. Daubechies (1988) described a particular
construction with S = 2M + 1 for which the number of derivatives of ^ is proportional
to M.
For our purposes, the only details we need are as follows.
Property 1. We have that Wp. has vanishing moments up to order M, as long as j
£ ilWJk(i) = O (1 = 0,...,M;
Property 2. We have that W^ is supported in [2J~J(k - S), 2J~J(k + 5)], provided ; *tj0.
430 DAVID L. D O N O H O AND IAIN M. JOHNSTONE
Because of the spatial localization of wavelet bases, the wavelet coefficients allow one
to easily answer the question 'is there a significant change in the function near tT by
looking at the wavelet coefficients at levels j =j0,..., J at spatial indices k with /c2~-/ — t.
If these coefficients are large, the answer is 'yes'.
Figure 1 displays four functions, Bumps, Blocks, HeaviSine and Doppler, which have
been chosen because they caricature spatially variable functions arising in imaging, spec-
troscopy and other scientific signal processing. For all figures in this paper, n = 2048.
(b) Bumps
f{t) = £hjK((t- tj)/Wj), K(t) = (1 + |t|)-<
=
(tj) blocks
(hj)=(4, 5, 3, 4, 5, 4-2, 2-1, 4-3, 31, 5-1, 4-2)
(w;) = (0-005, 0005, 0006, 001, 001, 003, 001, 001, 0005, 0008, 0005)
(c) HeaviSine
f(t) = 4 sin Ant - sgn (t - 0-3) - sgn (0-72 - 1 )
(d) Doppler
f(i) = {t(l -1)}* sin {2n(\ + e)/(t + e)}, e = 005
Ideal spatial adaptation by wavelet shrinkage 431
(a) Blocks (b) Bumps
1 1 1 1 1 . . .
10 10
1, 1 1
8 , I 8
1 1 i
1,',
!'
1
i 6
6 1 .!••!
J i1 '
1 i ,' r if '
\
i, i
4 4
0-5 I 0-5
t t
(c) HeaviSine (d) Doppler
10 10
1
6
1 1
4
0-5 0-5
t t
Fig. 2. The four functions in the wavelet domain: most nearly symmetric Daubechies
wavelet with N = 8. Wavelet coefficients 6Jk are depicted for ; = 5, 6,..., 10.
Coefficients in one level, with j constant, are plotted as a series against position t =
2~'k. The vast majority of the coefficients are zero or effectively zero.
Figure 2 depicts the wavelet transforms of the four functions. The large coefficients
occur exclusively near the areas of major spatial activity. This property suggests that a
spatially adaptive algorithm could be based on the principle of selective wavelet reconstruc-
tion. Given a finite list 6 of (;, k) pairs, define Tsw(y, 3) by
TSw(y, <5) = / = Z k eswj* Wik- (5)
This provides reconstructions by selecting only a subset of the empirical wavelet
coefficients.
Our motivation in proposing this principle is twofold. First, for a spatially inhomo-
geneous function, 'most of the action' is concentrated in a small subset of (;, fc)-space.
Secondly, under the noise model underlying (1), noise contaminates all wavelet coefficients
equally. Indeed, the noise vector e = (e.) is assumed to be a white noise; so its orthogonal
transform z = iVe is also a white noise. Consequently, the empirical wavelet coefficient is
= 9J*
where 6 = Wj is the wavelet transform of the noiseless data / =
Every empirical wavelet coefficient therefore contributes noise of variance a2, but only
a very few wavelet coefficients contribute signal. This is the heuristic of our method.
Ideal spatial adaptation can be defined for selective wavelet reconstruction in the obvi-
ous way. For the risk measure (1) the ideal risk is
a(sw,/) = inf R. y, S),f),
432 DAVID L. DONOHO AND IAIN M. JOHNSTONE
with optimal spatial parameter 5 = A(/), namely a list of (j, k) indices attaining
Figures 3-6 depict the results of ideal wavelet adaptation for the four functions displayed
in Fig. 2. Figure 3 shows noisy versions of the four functions of interest; the signal-to-
noise ratio || signal ||2>n/|| noise ||2,n is 7. Figure 4 shows the noisy data in the wavelet domain.
Figure 5 shows the reconstruction by selective wavelet reconstruction using an oracle;
Fig. 6 shows the situation in the wavelet domain. Because the oracle helps us to select the
important wavelet coefficients, the reconstructions are of high quality.
10
y o
-10
-20
Fig. 3. Four functions with Gaussian white noise, a = 1, with / rescaled to have
signal-to-noise ratio, SD(/)/<T = 7.
The theoretical benefits of ideal wavelet selection can again be seen in the case (4) where
/ is a piecewise polynomial of degree D. Suppose we use a wavelet basis with parameter
M~^D. Then Properties 1 and 2 imply that the wavelet coefficients djJ( of / all vanish
except for:
(i) coefficients at the coarse levels 0 <_/ <; 0 .
(ii) coefficients at j0 ^j ^ J whose associated interval \_2~J(k — S), 2~J(k + SJ] contains
a breakpoint of/.
There is a fixed number 2Jo of coefficients satisfying (i), and, in each resolution level ;,
(0j,k= k = 0 , . . . , 2j' — 1) at most (# breakpoints) x (2S + 1) satisfying (ii). Consequently, with
L denoting again the number of pieces in (4), we have
l 1
••-^f-'l I'M "-'
T+ j
8
„„ 1 , (.
1
Ml
j U
6 1
1.
1 ,1 1 1
T 1 • I1
0 0-5 40 0-5
,1
t t
(c) Noisy HeaviSine (d) Noisy Doppler
10 10
6 6
4 4
0 0-5 0-5
t t
Fig. 4. The four noisy functions in the wavelet domain. Compare Fig. 2. Only a
small number of coefficients stand out against a noise background.
10-
v k-V> \
-10 -20
0 0-5 0 0-5
t
(c) IdealfHeaviSine] (d) Ideaipoppler]
10 r
10
II
' 1 II
8 1 11l l ' • 8
i lit J i
1 jil
I ll 6
, ,1'l 1
n
i i
1 1
i| | i
4 i
0 0-5 0 0-5
t t
(c) Ideal[HeaviSine] (d) Ideal[Doppler]
10 10
4 4
0-5 0-5
t t
Let 5* = {(j, /c):0 M #O}. Then, because of the orthogonality of the (WJk),
S(• k)6a* wj*Wjk is the least-squares estimate of/and
+ C2J)La2/n (6)
for all n = 2J+1, with certain constants Cu C2, depending linearly on S, but not on /.
Hence
a2 log n
( (7)
for every piecewise polynomial of degree D ^ M. This is nearly as good as the bound
a2lip + l)n~l of ideal piecewise polynomial adaptation, and considerably better than the
rate n~* of usual nonadaptive linear methods.
10-
0
\J
-10
0 0-5
(c) WaveSelectfHeaviSine]
The theoretical properties are also interesting. Our method has the property that for
every piecewise polynomial (4) of degree D < M with ^ L pieces,
KXf*J) < (Ci + C2 log n)(2 log n + l)L<r>,
where Cx and C2 are as in (6); this result is merely a combination of (7) and (8). Hence
in this special case we have an actual estimator coming within C log2 n of ideal piecewise
polynomial fits.
436 D A V I D L. D O N O H O AND IAIN M. JOHNSTONE
v
1
10 10
I
o o 1 j I
I1
i
j I, 'i
,
6
1r i'
6 1
Ij 1 .
I , 1
4
i i i
4
1 I1 1
0-5 0-5
t t
(c) WaveSelect[HeaviSine] d) WaveSelect[Doppler]
10 10
6 [
8
6
ct
bfe1
1 1 i
4 4
0-5 0-5
t
17. Contents
Section 2 discusses the problem of mimicking ideal wavelet selection; § 3 shows why
wavelet selection offers the same advantages as piecewise polynomial fits; §4 discusses
Ideal spatial adaptation by wavelet shrinkage 437
variations and relations to other work. The appendixes contain certain proofs. Related
manuscripts by the authors, currently under publication review and available as PostScript
files by anonymous ftp from playfair.stanford.edu, are cited in the text by [filename.ps].
{|w|>A}, (11)
w)(M-A)+. (12)
The hard threshold rule is reminiscent of subset selection rules used in model selection
and we return to it later. For now, we focus on soft thresholding.
THEOREM 1. Assume model (9)-(10). The estimator
0? = r / s K e(2 log n)*) (i = ! , . . . , « )
438 DAVID L. DONOHO AND IAIN M. JOHNSTONE
satisfies
(13)
Now e2 denotes the mean-squared loss for estimating one parameter unbiasedly, so the
inequality says that we can mimic the performance of an oracle plus one extra parameter
to within a factor of essentially 2 log n. A short proof appears in Appendix 1. However it
is natural and more revealing to look for 'optimal' thresholds X* which yield the smallest
(=1
Now let (y{) be data as in model (1) and let w = iVy be the discrete wavelet transform.
Then with e = a
w =0 + ez, k (j = 0,...,J;k = 0,...,2J-l).
JfK J,K ' Ji* '••' ' * ' J " J /
As in § 1-4, we define selective wavelet reconstruction via 7sw(.y, 5), see (5), and observe
that
SW — " 'DP " \±UJ
in the sense that (5) is realized by wavelet transform, followed by diagonal linear projection
or shrinkage, followed by inverse wavelet transform. Because of the Parseval relation (17),
we have
440 DAVID L. D O N O H O AND IAIN M. JOHNSTONE
then again by Parseval E\\f* —f\\2tH = E\\6* — 9\\lin, and we immediately conclude the
following.
COROLLARY 1. For all f and all n = 2J +
\
*(/*,/)< A* j^j-
Moreover, no estimator can satisfy a better inequality than this for all f and all n, in
the sense that for no measurable estimator can such an inequality hold, for all n and f
with A* replaced by {2 — e + o(l)} log n. The same type of inequality holds for an estimator
2-4. Implementation
We have developed a computer software package which runs in the numerical computing
environment Matlab. In addition, an implementation by G. P. Nason in the S language
is available by anonymous ftp from Statlib at l i b . s t a t . c m u . e d u ; other implementations
are also in development. They implement the following modification of/*.
DEFINITION 1. Let 8* denote the estimator in the wavelet domain obtained by
g* = iwj,k U <Jo),
The name RiskShrink for the estimator emphasises that shrinkage of wavelet coefficients
is performed by soft thresholding, and that a mean squared error or 'risk' approach has
been taken to specify the threshold. Alternative choices of threshold lead to the estimators
VisuShrink introduced in § 4-2 below, and SureShrink discussed in our report [ausws.ps].
The rationale behind this rule is as follows. The wavelets Wjk at levels j<j0 do not
have vanishing means, and so the corresponding coefficients 6Jtk should not generally
cluster around zero. Hence, those coefficients, a fixed number, independent of n, should
not be shrunken towards zero. Let §W denote the selective wavelet reconstruction where
the levels below j 0 are never shrunk. We have, evidently, the risk bound
and of course
The key inequality (13) follows immediately: first assume e = 1. Set 6* = %(w,, X*). Then
= A*{l+
If e + 1, then for 6* = %(w(, AJe) we get by rescaling that
and the inequality (15) follows. Consequently, Theorem 2 follows from asymptotics for
A* and X*. To obtain these, consider the analogous quantities where the supremum over
the interval [0, oo) is replaced by the supremum over the endpoints {0, oo}:
A ° = m f sup - z j - — . 2 , (21)
x ve{o,o°)n 1 + min(/x 2 ,1)
and X° is the largest X attaining A°. In Appendix 4 we show that A* = A° and X* = A°.
We remark that p{X, oo) is strictly increasing in X and p(X, 0) is strictly decreasing in X,
so that at the solution of (21),
(n+l)p OT (A,0) = pOT(A,oo). (22)
Hence this last equation defines X° uniquely, and, as is shown in Appendix 3, leads to
21ogn (n->oo).
442 DAVID L. DONOHO AND IAIN M. JOHNSTONE
Finally, to verify (25) observe that the optimal variable knot spline 3 of order D for
noiseless data is certainly a piecewise polynomial, so \\f — s\\2 ^ \\f — 3|| 2 . It depends on
at least L unknown parameters and so for noisy data has variance term at least l/(D + 1)
times that of (26). Therefore,
4. DISCUSSION
4-1. Variations on choice of oracle
An alternative family of estimators for the multivariate normal estimation problem (9)
is given by diagonal linear shrinkers:
ros{w,d) = (5lwlyi=1, 8, e 10,11-
Such estimators shrink each coordinate towards 0, different coordinates being possibly
treated differently. An oracle A^O) for this family of estimators provides the ideal
coefficients (£,•) = (8f/{9j + e 2 ) ^ ! and would yield an ideal risk
t t
say. There is an oracle inequality for diagonal shrinkage also.
THEOREM 6. (i) The soft thresholding estimator 6* with threshold X* satisfies
( 28 )
-1 + pT(fx,l)} (n = 4,5,...),
1
- + p L ( M ,l)} (n = 4, 5,.. .)•
The drawback of this simple threshold formula is that in samples on the order of
dozens or hundreds, the mean squared error performance of minimax thresholds is
noticeably better.
VisuShrink. On the other hand (X") has an important visual advantage: the almost
'noise-free' character of reconstructions. This can be explained as follows. The wavelet
transform of many noiseless objects, such as those portrayed in Fig. 1, is very sparse, and
filled with essentially zero coefficients. After contamination with noise, these coefficients
Ideal spatial adaptation by wavelet shrinkage 445
are all nonzero. If a sample that in the noiseless case ought to be zero is in the noisy
case nonzero, and that character is preserved in the reconstruction, the reconstruction
will have an annoying visual appearance: it will contain small blips against an otherwise
clean background.
The threshold (2 log rift avoids this problem because of the fact that when (zf) is a white
noise sequence independent and identically distributed Af(0,1), then, as n->oo,
So, with high probability, every sample in the wavelet transform in which the underlying
signal is exactly zero will be estimated as zero.
Figure 9 displays the results of using this threshold on the noisy data of Figs 3 and 4.
10 - \
0 -J «/«—'VV
-10
0 0-5
t
(c) VisuShrink[HeaviSine]
-20
Fig. 9. VisuShrink reconstructions using soft thresholding and A = (2 log n)*. Notice
'noise-free' character; compare Figs 1, 3, 5, 7.
DEFINITION 2. Let 9" denote the estimator in the wavelet domain obtained by
Q»IWJ* ti<Jo),
k) <r(2 log «)
Not only is the method better in visual quality than RiskShrink, the asymptotic risk
446 DAVID L. DONOHO AND IAIN M. JOHNSTONE
bounds are no worse:
R(f'H,f)^(2\ogn + l ) ^ +
Fig. 10. Ideal selective Fourier reconstruction. Compare Fig. 5. Superiority of wavelet
oracle is evident.
Ideal spatial adaptation by wavelet shrinkage 447
(ii) The Efroimovich-Pinsker work did not have access to the oracle inequality and
used a different approach, not based on thresholding but instead on grouping in blocks
and adaptive linear damping within blocks. Such an approach cannot obey the same risk
bounds as the oracle inequality, and can easily depart from ideal risk by larger than
logarithmic factors. Indeed, from a 'minimax over L2-Sobolev balls' point of view, for
which the Efroimovich-Pinsker work was designed, the adaptive linear damping is essen-
tially optimal; compare comments in our report [ausws.ps, §4]. Actual reconstructions
by RiskShrink and by the Efroimovich-Pinsker method on the data of Fig. 3 show that
RiskShrink is much better for spatial adaptation; see Fig. 4 of [ausws.ps].
where the constants yt correspond to the smallest and largest singular values of PL and
PR, and hence do not depend on n = 2J+1. Thus all the ideal risk inequalities in the paper
remain valid, with only an additional dependence for the constants on yt and y2. In
particular, the conclusions concerning logarithmic mimicking of oracles are unchanged.
Ideal spatial adaptation by wavelet shrinkage 449
4-7. Relation to model selection
RiskShrink may be viewed by statisticians as an automatic model selection method,
which picks a subset of the wavelet vectors and fits a 'model', consisting only of wavelets
in that subset, to the data by ordinary least-squares. Our results show that the method
gives almost the same performance in mean-squared error as one could attain if one knew
in advance which model provided the minimum mean-squared error.
Our results apply equally well in orthogonal regression. Suppose we have Y = xp + E,
with noise Et independent and identically distributed as N(0, a2), and X an n x p matrix.
Suppose that the predictor variables are orthogonal: XTX = IP. Theorem 1 shows that
the estimator ft* = 6*°XTY achieves a risk not worse than p~1 + dtpa{p~p, fl) by more
than a factor 2 log p + 1. This point of view has amusing consequences. For example, the
hard thresholding estimator ft+ =6 + °XTY amounts to 'backwards-deletion' variable
selection; one retains in the final model only variables which had Z-scores larger than X
ACKNOWLEDGEMENT
This paper was completed while D. L. Donoho was on leave from the University of
California, Berkeley, where this work was supported by grants from NSF and NASA.
I. M. Johnstone was supported in part by grants from NSF and NIH. Helpful comments
450 DAVID L. D O N O H O AND IAIN M. JOHNSTONE
of a referee are gratefully acknowledged. We are also most grateful to Carl Taswell, who
carried out the simulations reported in Table 4.
APPENDIX 1
Proof of Theorem 1
It is enough to verify the univariate case, for the multivariate case follows by summation.
So, let X~N{n, 1), and n,(x) = sgn(x)(|x| -1)+. In fact we show that, for all 5^\ and with
APPENDIX 2
Mean squared error properties ofunivariate thresholding
We begin a more systematic summary by recording
PST(X, H) = 1 + X2 + (p.2 - X2 - 1){<&(X - n ) - <D(-X - n)} -(X- n)<j>(X + n) -(X +,
(A2-1)
2
PHT{X, H) = n {O(X — p) — <I>(— X — p)} 4- $(A — n) + <b(X + /i) + (X — n)<f>{X — fi) + (X + p.)<f>(X + p.),
(A2-2)
where <f>, <I> are the standard Gaussian density and distribution function and 4(x) = 1 — <D(x).
LEMMA 1. For both p = p^ and p = PHT
X2 + l for all fieW.,X>ci, (A2-3)
2
;' n + 1 for all \i e R, (A2-4)
2
p(X, 0) + c2/x 0</z<c 3 . (A2-5)
For soft thresholding, (cx, c2, c3) may be taken as (0, 1, oo) and for hard thresholding as (1,1-2, X). At
fi = 0, we have the inequalities
Psi(X, 0) < 4X~3HX)(l + 1-5A"2), (A2-6)
l) (A>1). (A2-7)
Ideal spatial adaptation by wavelet shrinkage 451
Proof. For soft thresholding, (A2-3) and (A2-4) follow from ( A l l ) and (Al-2) respectively. In
fact \x -+ Psx(A, H) is monotone increasing, as follows from
(d/dpOp^A, n) = 2p.{fl>(A - p.) - d>(-A - p.)}. (A2-8)
From (A2-8) it follows that (fi/dfi^p^ik, p) ^ 2 for p ^ O . Using (Al-3) for g = p^ establishes
(A2-5). The inequality (A2-6) follows from (A21) and the alternating series bound for Gaussian
tails: $(A) *S <f>W(X-1 - A"3 + 3A' 5 ).
Turning now to hard thresholding, formula (A2-4), and (A2-3) for p ^ A, follow by taking expec-
tations in
Finally (A2-7) follows from (A2-2) and <1>(A) ^ k~l$(k) for A > 1. •
APPENDIX 3
Note that, if the term in brackets is negative, the whole expression is negative on [A, oo). Using
the standard inequality <D(—A) s% A"1(^(A), one verifies that this happens for A = (2 log n)*, for n ^ 3.
This implies that the zero k° of pH is less than (2 log n)*. For n = 2, the claim has been verified by
direct computation.
For the second half, define knN for all sufficiently large n via
A2,,, = 2 log (n + 1) - 4 log log (n + 1) - log 2TI + q.
By using the standard asymptotic result <!>(—A)~k~l<f>(k) as A-> + oo, it follows that p^k^,,) con-
verges to — oo or oo according to t]> 0 or t] < 0 respectively. This implies (23).
452 DAVID L. D O N O H O AND IAIN M. JOHNSTONE
APPENDIX 4
Proof of Theorem 2: Equivalence of A* = A°n, X* = k°n
We must prove that
*) = sup -1 r r r ; j2 r
„ n +min(l,/i )
attains its maximum at either fi = 0 or \i = oo. For /i e [ 1, oo], the numerator PST(^!?> n) is monotone
increasing in /z, and the denominator is constant. For \i e [0,1], we apply (39) to p^{X°, /x). An
argument similar to that following (A31) shows that p{n~*) > 0 for n ^ 3 so that X° ^ n~*. By the
equation preceding (22), we conclude that np(fi,0)= {1 + (/£)2}/(l +n~1)^ 1. Combining this
with (A2-5),
APPENDIX 5
Proof of Theorem 3
The main idea is to make 9 a random variable, with prior distribution chosen so that a randomly
selected subset of about log n coordinates are each of size roughly (2 log n)*, and to derive infor-
mation from the Bayes risk of such a prior.
Consider the 0-varying loss
finally, let
denote the Bayes risk of the prior n. Call the corresponding Bayes rule h%.
The minimax theorem of statistical decision theory applies to the loss Lm0, 9), and so, if we let
mn denote the left-hand side of (16), we have
m,, = sup pR{n).
where v^ denotes Dirac mass at x. Fix a » 0. Define \i = /X(E, a) for all sufficiently small e > 0 by
<p(a
Ideal spatial adaptation by wavelet shrinkage 453
Then
Our reports [mrlp.tex, mews.tex, ausws.tex] have considered the use of this prior in the scalar
problem of estimating £ ~ F^ from data v = ^ + z with z ~ N(0, 1) and usual squared-error loss
E{5(v) - £} 2 . They show that the Bayes risk
(A5-2)
To apply these results in our problem, we will select e = en = log n/n, so that
We use this fact to get a lower bound for the Bayes risk pH{nn).
Consider the random variable Nn = #{i:Ot =t=0}, which has a binomial distribution with param-
eters n, en. Set r\n = (log n) 2/3 and define the event An = {NB < nen + r\n). By Chebyshev's inequality,
an = P{ACH) < nz/t\2-*0. Let <5B denote the Bayes rule for nn with respect to the loss LH. Then
C)
1 + nen + r\n
l+o(l)
1 + neH + t\n
1
We focus only on the trickier term £(||#|| 2 , ACK), where we use simply E to denote the joint distri-
bution of 0 and x. Set p(0) = 1 + NH(d). Using by turns the conditional expectation representation
for (5, ((x), the Cauchy-Schwarz and Jensen inequalities, we find
\\K\\2^E{p(6)\x}E{\\e\\2/p(d)\x},
E( IIKII2, AD < {£P4(0) pr 2 ( ^ ) £ || 0 || 8 /p 4 (0)} 1/4
< C/i2 pr+ (.4;) log n = o ( ^ log n),
APPENDIX 6
Proof of Theorems 4 and 6
We give a proof that covers both soft and hard thresholding, and both DP and DS oracles. In
fact, since pL < pT it is enough to consider p = pL. Let
where p is either p^ or pm. We show that UX, v) < (2 log n)(l + <5J uniformly in \i so long as
c log log n ^ A2 — 2 log n a% e, log n.
Here 5n -> 0 and depends only on eB and c in a way that can be made explicit from the proof. For
Psr, we require that c < 5 and, for pm, that c < 1.
For n e [(2 log n)*, oo], the numerator of L is bounded above by 1 + A2, from (A2-3), and the
If AB(c) = (2log«-cloglogn)*, then n^(/l,(c)) = ^(0)(logrif12.It foUows from (A2-6) and (A2-7)
that np(X, 0) and hence L(A, /i) = o(logri)if A > A,(c), where c < 5 for soft thresholding and c < 1
for hard thresholding. The expansion (23) shows that this range includes X* and hence 6*.
APPENDIX 7
Proof of Theorem 7
2
When A = (2 log n)*, the bounds over [1, (2 log n)*] and [(2 logri)*,oo] in Appendix 6 become
simply [1 + 2 log n]2/2 log n < 2 log n + 2-4 for n > 4. For /i e [0,1], the bounds follow by direct
evaluation from (A61), (A2-6) and (A2-7). We note that these bounds can be improved slightly by
considering the cases separately.
REFERENCES
BICKEL, P. J. (1983). Minimax estimation of a normal mean subject to doing well at a point In Recent
Advances in Statistics, Ed. M. H. Rizvi, J. S. Rustagi and D. Siegmund, pp. 511-28. New York; Academic
Press.
BRHMAN, L., FRIEDMAN, J. H., OLSHEN, R. A. & STONE, C. J. (1983). CART: Classification and Regression
Trees. Belmont, CA: Wadsworth.
BROCKMANN, M., GASSER, T. & HERRMANN, E. (1993). Locally adaptive bandwidth choice for kernel regression
estimators. J. Am. Statist. Assoc. 88, 1302-9.
CHUI, C. K. (1992). An Introduction to Wavelets. Boston, MA: Academic Press.
COHEN, A., DAUBECHIES, I., JAWERTH, B. & VIAL, P. (1993). Multiresolution analysis, wavelets, and fast
algorithms on an interval. Comptes Rendus Acad. Sci. Paris A 316, 417-21.
DAUBECHIES, I. (1988). Orthononnal bases of compactly supported wavelets. Commun. Pure Appl. Math.
41, 909-96.
DAUBECHIES, I. (1992). Ten Lectures on Wavelets. Philadelphia: SIAM.
DAUBECHIES, I. (1993). Orthononnal bases of compactly supported wavelets II: Variations on a theme. SIAM
J. Math. Anal. 24, 499-519.
EFROIMOVICH, S. Y. & PINSKER, M. S. (1984). A learning algorithm for nonparametric filtering (in Russian).
Automat, i Telemeh. 11, 58-65.
Ideal spatial adaptation by wavelet shrinkage 455
FRAZIER, M., JAWERTH, B. & WEISS, G. (1991). Littlewood-Paky Theory and the Study of Function Spaces,
NSF-CBMS Regional Conf. Ser. in Mathematics, 79. Providence, RI: American Math. Soc.
FRIEDMAN, J. H. & SILVERMAN, B. W. (1989). Flexible parsimonious smoothing and additive modeling (with
discussion). Technometrics 31, 3-39.
FRIEDMAN, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19, 1-67.
LEPSKH, O. V. (1990). On one problem of adaptive estimation on white Gaussian noise. Teor. Veoryatnost. i
Primenen. 35, 459-70 (in Russian); Theory Prob. Applic. 35, 454-66 (in English).
MEYER, Y. (1990). Ondelettes et Operateurs: I. Ondelettes. Paris: Herman et Cie.
MEYER, Y. (1991). Ondelettes sur l'intervalle. Revista Matemdtica Ibero-Americana 7(2), 115-33.
MILLER, A. J. (1984). Selection of subsets of regression variables (with discussion). J. R, Statist. Soc. A
147, 389-425.
MILLER, A. J. (1990). Subset Selection in Regression. London, New York: Chapman and Hall.
MOLLER, H.-G. & STADTMULLER, U. (1987). Variable bandwidth kernel estimators of regression curves. Ann.
Statist. 15, 182-201.
TERRELL, G. R. & SCOTT, D. W. (1992). Variable kernel density estimation. Ann. Statist. 20, 1236-65.