Adaptive Estimation in An Autoregression and A Geometrical
Adaptive Estimation in An Autoregression and A Geometrical
tensively studied in the past under the conditions that α or k are known.
There have been some generalizations to the cases of unknown α and k , but
then the results are typically given in an asymptotic form (as n → +∞).
In this paper, the aim is to introduce an estimation procedure for model
(1.1) which, when applied to some Hölderian function f satisfying (1.2) with
unknown values of α and f α , will perform almost as well as a procedure
based on the knowledge of those two parameters. This is what is usually called
adaptation. In the same way, our procedure will result in estimation of model
(1.3) with an unknown value of k (k ≤ k, k known) which is almost as good
as if k were known. Moreover, the results will be given in the form of non
asymptotic bounds for the risk of our estimators. Many other examples can be
treated by the same method. One could, for instance, replace the regularity
conditions (1.2) by more sophisticated ones and model (1.3) by a nonlinear
analogue.
In order to explain the main idea underlying the approach, let us turn back
to the two previous examples. Model (1.3) says that f belongs to some specific
k -dimensional linear space Sk of functions from k to . When k is known,
a classical estimator of f is the least squares estimator over Sk . Dealing with
an unknown k therefore amounts to choosing a “good” value k̂ of k from
the data. By “good,” we mean here that the estimation procedure based on k̂
should perform almost as well as the procedure based on the true value of k .
The treatment of model (1.1) when f satisfies a condition of type (1.2) is
actually quite similar. Let us expand f in some suitable orthonormal basis
φj j≥1 of 2 0
1 dx (the Haar basis for instance). Then (1.1) can be writ-
ten as Yi = ∞ j=1 βj φj Xi + εi and a classical procedure for estimating f
is as follows: define SJ to be the J-dimensional linear space generated by
φ1 φJ and f̂J to be the least squares estimator on SJ , that is the least
squares estimator for model (1.1) when f is supposed to belong to SJ . The
problem is to determine from the data set some Ĵ in such a way that the
least squares estimator f̂Ĵ performs almost as well as the best least-squares
estimator of f, that is, the one which achieves the minimum of the risk.
In order to give a further explanation of the procedure, we need to be precise
as to the “risk” we are dealing with. Throughout the paper we consider least-
squares estimators of f, obtained by minimizing over a finite dimensional
linear subspace S ⊂ 2 k dx the (least squares) contrast function γn defined
by
n
1 i 2
(1.4) ∀t ∈ 2 k dx γn t = Y − tX
n i=1 i
reason we consider the risk of f̂S based on the design points, that is,
n 2
1 i
i − f̂S X
Ɛ fX = Ɛ f − f̂S 2n
n i=1
In addition, under suitable assumptions on the design set and the εi ’s, the
risk of f̂S can be decomposed in a classical way into a bias and a variance
term. More precisely, we have
dimS
(1.5) Ɛf − f̂S 2n ≤ dµ2 f S + σ22
n
1 − sX
where for t s ∈ 2 k µ, dµ2 s t denotes ƐtX 1 2 and dµ2 f S =
2
inf s∈S dµ f s. Inequality (1.5) is usually sharp; note that equality occurs
when the X i ’s and the εi ’s are independent for instance.
Coming back to model (1.1) we see that the quadratic risk Ɛf − f̂J 2n is
of order
J
(1.6) dµ2 f SJ + σ22
n
for SJ generated by the Haar basis φj 1≤j≤J as above. Then (1.2) standardly
implies that dµ f SJ ≤ C f α J−α whatever µ. When α and f α are known,
it is possible to determine the value of J that minimizes (1.6). If α and f α
are unknown, the problem of adaptation, that is doing almost as well as if they
were known, clearly amounts to choosing an estimation procedure Ĵ based on
the data, such that the estimator based on Ĵ is almost as good as the estimator
based on the optimal value of J. The analogy with the study of model (1.3) then
becomes obvious and we have shown that the problem of adaptation to some
unknown smoothness for Hölderian regression functions amounts to what is
generally called a problem of model selection, that is finding a procedure solely
based on the data to choose one statistical model among a (possibly large)
family of such models, the aim being to choose automatically a model which
is close to optimal in the family for the problem at hand. Let us now describe
this procedure.
We start with a finite collection of possible models Sm m ∈ n for f, each
Sm being a finite-dimensional linear subspace of 2 k . The family of models
usually depends on n and the function f may or may not belong to one of them.
Let us denote by f̂m the least squares estimator for model (1.1) based on the
model class Sm . We look for a model selection procedure m̂ with values in n ,
based solely on the data and not on any prior assumption on f, such that the
risk of the resulting procedure f̂m̂ is almost as good as the risk of the best
least squares estimator in the family. Therefore an ideal selection procedure
m̂ should look for an optimal trade-off between the bias term dµ2 f Sm and
the variance term σ22 dimSm /n. Our aim is to find m̂ such that
dimSm
(1.7) Ɛf − f̂m̂ 2n ≤ C min dµ2 f Sm + σ22
m∈n n
842 Y. BARAUD, F. COMTE AND G. VIENNET
γn f̂m + penm
and we set f̃ = f̂m̂ ∈ Sm̂ . The choice of the penalty function is the main
concern of this paper.
The main assumptions used in the paper are listed below. Assumptions (Hε )
and (HXε ) will be weakened in Section 5:
the collection of models is nested, that is, is an increasing sequence (for in-
clusion) of linear spaces and when there exists some !0 such that for each
m ∈ n
This connection between the sup-norm and the 2 A dx-norm is satisfied
for numbers of collections of models of interest. Birgé and Massart [(1998),
Lemma 6] have shown that for any 2 A dx-orthonormal basis φλ λ∈#m of
Sm :
1/2
t∞
(2.3) φ2λ = sup
λ∈#m ∞ t∈Sm t=0 t
Hence (2.2) holds if and only if there exists an orthonormal basis φλ λ∈#m
of Sm such that
1/2
(2.4) φ2λ ≤ !0 dimSm
λ∈#m ∞
(3.2) Yi = Xi = fXi−1 + εi i = 1 n
The value 17 as bound for d is certainly not sharp. The model (3.5) for the
Xi ’s together with the assumptions on the coefficients aj aim at ensuring that
(HXY ) is fulfilled with arithmetically β-mixing variables. Of course, any other
model implying the same property would suit.
We introduce the following collection of models.
Collection of wavelets: For any integer J, let #j = j k/ k = 1 2j
and let
+∞
φJ0 k J0 k ∈ #J0 ∪ ϕjk j k ∈ #J
J=J0
Proposition 2. Assume that f|A ∞ < ∞ and that for all m ∈ n , the
constant functions belong to Sm . If (HReg ) is satisfied, then (3.1) holds true for
some constant R depending on x ρ h0 h1 σ22 C d f|A − f|A dx∞ .
Yi = fX i + εi i = 1 n
(3.6)
εi = aεi−1 + ui i = 1 n
i for i = 1
We observe the pairs Yi X n.
We assume that:
(HRd ) The real number a satisfies 0 ≤ a < 1, and the ui ’s are i.i.d. centered
random variables admitting a common finite variance. The law of the εi ’s is
assumed to be stationary admitting a finite variance σ22 . The sequence of the
X i ’s is geometrically β-mixing [i.e., satisfying (6.1)] and the sequences of the
X i ’s and the εi ’s are independent.
The main difference between this framework and the previous one lies in
the dependency between the εi ’s. To deal with it, we need to modify the penalty
term:
Proposition 3. Assume that f|A ∞ < ∞, that (HX ) and (HRd ) hold and
that Ɛ ε1 p < ∞ for some p > 6. Let x > 1, if the penalty term pen satisfies
2a Dm 2
(3.7) ∀m ∈ n penm ≥ x3 1 + σ
1−a n 2
then by using the collection of piecewise polynomials described in Section 3 1
and applying the estimation procedure given in Section 2 we have that the
estimator f̃ satisfies for any ρ ∈1 x,
x+ρ 2 R
(3.8) Ɛ f|A − f̃2n ≤ inf f|A − fm 2µ + 2penm +
x − ρ m∈n n
where R depends on a p σp f|A − f|A dx∞ x ρ h0 h1 / θ.
Proposition 4. Assume that f|A ∞ < ∞, that the sequence of the X i Yi
is geometrically β-mixing, that is, satisfies 6 1 and that (HX ), (Hε ) and
(HXε ) are fulfilled. Consider the additive regression framework 3 9 with
p
the above collection of models. If σp = Ɛ ε p < ∞ for some p > 6, then f̃
satisfies 3 1 for some constant R depending on k p σp f|A − f|A dx∞
x h0 h1 / θ.
where wd f yl denotes the modulus of smoothness. For a precise definition
of those notions we refer to DeVore and Lorentz [(1993), Chapter 2, Section 7].
Since for l ≥ 2, αl∞ ⊂ α2∞ , we now restrict ourselves to the case where
l = 2. In the sequel, for any L > 0 α2∞ L denotes the set of functions
which belong to α2∞ and satisfy f α2 ≤ L. Then the following result holds.
4. The main result. In this section, we give our main result concerning
the estimation of a regression function from dependent data. Although this
result considers the case of particular collections of models, extension includ-
ing very general collections are to be found in the comments following the
theorem.
4.1. The main theorem. Let n be some finite dimensional linear subspace
of A-supported functions of 2 k dx. Let φλ λ∈#n be an orthonormal basis
of n ⊂ 2 A dx and set Dn = #n = dimn . We assume that there exists
some positive constant !1 ≥ 1 such that for all λ ∈ #n
(Hn ) φλ ∞ ≤ !1 Dn and λ / φλ φλ ∞ = 0 ≤ !1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 851
The second condition means that for each λ, the supports of φλ and φλ are
disjoint except for at most !1 functions φλ ’s. We shall see in Section 10 that
those conditions imply that (2.2) holds with !20 = !31 . In addition we assume
some constraint on the dimension of n
(HDn )8b There exists an increasing function 8 mapping + into + sat-
isfying for some K > 0 and b ∈0 1/4
such that
n
(4.1) Dn ≤
8n lnn
for some M > 0 and for some constant B given by 7 14. For any x > 1, let
pen be a penalty function such that
Dm 2
∀m ∈ n penm ≥ x3 σ
n 2
Let ρ ∈1 x, for any p̄ ∈0 1, if there exists p > p0 = 21 + 2p̄/1 − 4b
p
such that σp = Ɛ ε1 p < ∞, we have that the PLSE f̃ defined by
1 n 2
(4.3) f̃ = arg min γn f̂m + penm with γn g = i
Yi − gX
m∈n n i=1
satisfies
1/p̄
2p̄
Ɛ f|A − f̃n
(4.4) 2
x+ρ R
≤ inf m∈n f|A − fm 2µ + 2penm + C n
x−ρ n
Comments
1. The functions 8 of particular interest are either of the form 8u = lnu
or 8u = uc with 0 < c < 1/4. In the first case, (4.2) is equivalent to a
geometric decay of the β-mixing coefficients (then, we say that the variables
are geometrically β-mixing), in the second case (4.2) is equivalent to an
arithmetic decay (the sequence is then arithmetically β-mixing).
2. A choice of Dn small in front of n allows one to deal with stronger depen-
dency between the Yi X i ’s. In return, choosing Dn too small may lead to
a serious drawback with regard to the performance of the PLSE. Indeed,
in the case of nested models, the smaller Dn the smaller the collection of
models and the poorer the performance of f̃.
3. Assumption (Hn ) is fulfilled when n is generated by piecewise polynomi-
als of degree r on 0 1 (in that case !1 = 2r+1 suits) or by wavelets such
as those described in Section 3.2 (a suitable basis is obtained by rescaling
the father wavelets φJ0 k ’s).
4. We shall see in Section 10 that the result of Theorem 1 holds for a larger
class of linear spaces n [i.e., for n ’s which do not satisfy (Hn )], provided
that (4.1) is replaced by
n
(4.6) D2n ≤
lnn8n
5. Take p̄ = 1, the main term involved in the right-hand side of (4.4) is usually
inf f|A − fm 2µ + 2penm
m∈n
It is worth noticing that the constant in front of this term, that is,
x+ρ 2
C1 x ρ =
x−ρ
only depends on x and ρ, and not on unpleasant quantities such as h0 , h1 . If
Theorem 1 gives no precise recommendation on the choice of x to optimize
the performance of the PLSE, it suggests, in contrast, that a choice of x
close to 1 is certainly not a good choice since it makes the constant C1 x ρ
blow up (we recall that ρ must belong to 1 x). Fix ρ, we see that C1 x ρ
decreases to 1 as x becomes large; the negative effect of choosing x large
being that it increases the value of the penalty term.
6. Why does Theorem 1 give a result for values of p̄ = 1? By using Markov’s
inequality, we can derive from (4.4) a result in probability saying that for
any τ > 0,
R
f|A − f̃2n > τ inf f|A − fm 2µ + 2penm + n
(4.7) m∈n n
C
≤ p̄
τ
where C depends on x ρ p̄ C. If Ɛ ε1 p < ∞ for some p > 2 and if
it is possible to choose 8u of order a power of lnu [this is the case
ADAPTIVE ESTIMATION IN AUTOREGRESSION 853
when the Yi X i ’s are geometrically β-mixing] then one can choose both
b in (HDn )8b and p̄ small enough to ensure that p > 21 + 2p̄/1 − 4b.
Consequently we get that (4.7) holds true under the weak assumption that
Ɛ ε1 p < ∞ for some p > 2. Lastly we mention that an analogue of
(4.7) where f|A − f̃2n is replaced by f|A − f̃2µ can be obtained. This
can be derived from the fact that, under the assumptions of Theorem 1,
the (semi)norms µ and n are equivalent on n on a set of probability
close to 1 (we refer to the proof of Theorem 1 and for further details to
Baraud (2001)).
7. For adequate collections of models, the quantity Rn remains bounded by
some number R not depending on n. In addition, if for all m ∈ n , the
constants belong to Sm , then the quantity
f|A ∞ involved in Rn can be
replaced by the smaller one f|A − f|A ∞ .
for any 1 ≤ q ≤ n.
In addition, Assumption (HXε ) is replaced by a milder one:
(H’Xε ) For all i ∈ 1 i and εi are independent.
n, X
Then the following result holds.
Comments
1. In the case of i.i.d. εi ’s and under Assumption (HXε ) (which clearly implies
(H’Xε )), it is straightforward that (5.1) holds with ϑ = σ22 . Indeed under
Condition (HXε ), for all t ∈ 2 k µ
2
q q
Ɛ εi tX i = i + 0 = qσ22 t2µ
Ɛ εi2 t2 X
i=1 i=1
2. Assume that the sequences X i i=1 n and εi i=1 n are independent
(which clearly implies (H’Xε )) and that the εi ’s are β-mixing. Then, we
know from Viennet (1997) that there exists a function dβ depending on the
β-mixing coefficients of the εi ’s such that for all t ∈ 2 k µ
q 2
Ɛ i ≤ qƐ ε12 dβ ε1 t2µ
εi tX
i=1
which amounts to taking ϑ = ϑβ = Ɛ ε12 dβ ε1 in (5.1). Roughly speak-
ing ϑβ is close to σ22 when the β-mixing coefficients of the εi ’s are close to
0 which corresponds to the independence of the εi ’s. Thus, in this context
the result of Theorem 2 can be understood as a result of robustness, since
ϑβ is unknown. Indeed, the penalized procedure described in Theorem 1
with a penalty term satisfying, for some κ > 1,
Dm 2
∀m ∈ n penm ≥ κ σ
n 2
still works if ϑβ < κσ22 . This also means that if the independence of the
εi ’s is debatable, it is safer to increase the value of the penalty term.
Then we can derive the existence of positive numbers h1 and h0 bounding the
density hX from above and below on 0 1 and thus (HX ) is true. In addition
we know from Doukhan (1994) that under (3.3) the Xi ’s are geometrically
β-mixing that is, there exist two positive constants /, θ such that
(6.1) βq ≤ /e−θq ∀q ≥ 1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 855
√
Since 8−1 u = exp u, clearly there exists some constant M = M/ θ > 0
such that
√
βq ≤ /e−θq ≤ Me−3 Bq ∀q ≥ 1
Lastly, the εi ’s being independent of the sequence Xj j<i , (HXε ) is true and
we know that the β-mixing coefficients of both sequences Xi−1 εi i=1 n and
Xi−1 i=1 n are the same. Consequently, Condition (HXY ) holds and (4.2) is
fulfilled. By choosing p̄ = 1, Theorem 1 can be applied if Ɛ εi p < ∞ for some
p > 6/1 − 4b. This is true for b small enough and then (3.1) follows from
(4.4) with
2
−p/2+1 n f|01 2∞
Rn = σp Dm + 1/4−bp−6/1−4b +
m∈n n σp2
+∞ 2
lnn f| 01 ∞
≤ σp2 r2m −2 + sup 1/4−bp−6/1−4b +
m=0 n≥1 n σp2
=R
Take R = CR where C is the constant involved in (4.4) to complete the proof
of Proposition 1. ✷
Proof of Proposition 2. Conditions (HS ) and (Hn ) are fulfilled [we re-
fer to Birgé and Massart (1998)]. Next we check that (HXY ) holds true and
more precisely that the sequence εi Xi 1≤i≤n is arithmetically β-mixing with
β-mixing coefficients satisfying
(6.2) ∀q ∈ 1 n βq ≤ /q−θ
for
∞some constants / > 0 and θ > 15. For that purpose, simply write εt Xt =
j=0 Aj et − j with et − j = εt−2j εt−1−2j , for j ≥ 0, A0 is the 2 × 2-
identity matrix and
0 0
Aj =
0 aj
Then Pham and Tran’s (1985) Theorem 2.1 implies under (HReg ), that εt Xt
$ %
is absolutely regular with coefficients βn ≤ K +∞ j=n k≥j ak ≤ KC/d −
−d+2
1d − 2n . This implies (6.2) with θ = d − 2 > 15. In addition, it can be
proved that if aj = j−d then βn ≥ Cdn−d , which shows that we do not reach
the geometrical rate of mixing.
Clearly the other assumptions of Theorem 1 are satisfied and it remains to
apply it with p = 30 (a moment of order 30 exists since the εi ’s are gaussian),
8u = u1/5 and p̄ = 1. An upper bound for Rn which is does not depend on
n can be established in the same way as in the proof of Proposition 1. ✷
of Theorem 2. Most of them are clearly fulfilled, we only check (HXY ) and
(H’ε ). We note that the pairs X i Yi ’s are geometrically β-mixing (which
shows that (HXY ) holds true) since both sequences Xi ’s and εi ’s are geomet-
rically β-mixing (since the εi ’s are drawn from a “nice” autoregression model,
we refer to Section 3.1) and are independent. Next we show that (H’ε ) holds
true with ϑ = 1 + 2a/1 − aσ22 . This will end the proof of Proposition 3. For
all t ∈ 2 k µ,
q 2 q
Ɛ i ≤
εi tX t2µ σ22 + 2 i tX
Ɛεi εj ƐtX j
i=1 i=1 i<j
For i < j,
Ɛεi εj = Ɛ εi uj + · · · + ak uj−k + · · · + aj−i−1 ui−1 + aj−i εi
= 0 + aj−i σ22
thus
q 2
Ɛ i
εi tX ≤ qt2µ σ22 + 2 i tX
aj−i ƐtX j σ22
i=1 i<j
≤ q+2 aj−i t2µ σ22
1≤i<j≤q
i
where Sm = t ∈ Sm tx1 xk = ti xi and | denotes the constant
k i gi
function on 0 1 . Clearly, Sm satisfies (2.2) if and only if Smi does, which
is true. Now the fact that the Sm ’s satisfy (2.2) is a consequence of the following
lemma.
Lemma 1. Let S1 , , Sk be k linear spaces which are piecewise orthog-
onal in 2 0 1k dx1 dxk . If for each i = 1 k, Si satisfies 2 2, then
1 k
so does S = S + · · · + S .
gi
where fmi denotes the 2 0 1 dx projection of fi onto Smi . Lastly we
use standard results of approximation theory [see Barron, Birgé and Mas-
858 Y. BARAUD, F. COMTE AND G. VIENNET
sart (1999), Lemma 13 or DeVore and Lorentz (1993)] which ensure that
$ %2 −2αi
01 fi x − fmi x dx ≤ Cαi LDmi (if gi = 1, this holds true in the
case of piecewise polynomials since r ≥ αi ). We obtain (3.10) by taking for each
i
i = 1 k, mi ∈ n such that Dmi is of order n1/2αi +1 which is possible
since αi > 1/2 and therefore n1/2αi +1 ≤ Dn (at least for n large enough). In
the one dimensional case, by considering the piecewise polynomials described
in Section 3.1, Dn is of order n/ ln3 n (such a choice is possible in this case)
and then a choice of m among n such that Dm is of order n1/2α+1 is possible
for any α > 0. ✷
Claim 1. ∀m ∈ n ,
n
2 i
f|A − f̃2n ≤ f|A − fm 2n + ε f̃ − fm X
(7.1) n i=1 i
+penm − penm̂
(ii) For 2 = 1 2n ,
(7.3) U 21 = U ∗21 ≤ βq −q and U 22 = U
∗22 ≤ βq
n n1 n1
∗1δ
(iii) For each δ ∈ 1 2, the random vectors U ∗2 δ are independent.
U n
We set
(7.4) A0 = h20 1 − 1/ρ2 /80!41 h1
and we choose qn = intA0 8n/4 + 1 ≥ 1 (intu denotes the
√ integer part of
u) and qn1 = qn1 x to satisfy qn1 /qn + 1 − qn1 /qn ≤ x, namely qn1
of order x − 12 ∧ 1qn /2 works. For the sake of simplicity, we assume qn
to divide n that is, n = 2n qn and we introduce the sets B∗ and Bρ defined as
follows:
B∗ = εi X i = εi∗ X
∗i / i = 1 n
and for ρ ≥ 1,
Bρ = t2µ ≤ ρt2n ∀t ∈ Sm + S m
mm ∈n
We denote by Bρ∗ the set B∗ ∩Bρ . From now on, the index m denotes a minimizer
of the quantity f|A − fm 2µ + penm for m ∈ n . Therefore, m is fixed
and, for the sake of simplicity, the index m is omitted in the three following
notations. Let Bm µ be the unit ball in Sm = Sm + Sm with respect to
µ , that is,
n
2 1 2
Bm µ = t ∈ Sm + Sm / tµ = Ɛ t Xi ≤ 1
n i=1
then
f|A − f̃2n |Bρ∗
(7.6) xx + ρ −2
≤ C1 x ρ f|A − fm 2n + 2penm + n Wn m̂
x−ρ
where Wn m is defined by
2
n
Wn m = sup ∗i
εi∗ tX − x2 nDm σ22
t∈Bm µ i=1
+
Proof. The following inequalities hold on Bρ∗ . Starting from (7.1) we get
2 n
f̃ − fm X ∗
i
f|A − f̃2n ≤ f|A − fm 2n + f̃ − fm µ εi∗
n i=1 f̃ − fm µ
+ penm − penm̂
n
2 ∗i
≤ f|A − fm 2n + f̃ − fm µ sup εi∗ tX
n t∈Bm̂µ i=1
+ penm − penm̂
Using the elementary inequality 2ab ≤ xa2 + x−1 b2 , which holds for any pos-
itive numbers a b, we have
2
n
2 2 −1 2 −2
f|A − f̃n ≤ f|A − fm n + x f̃ − fm µ + n x sup ∗ ∗
εi tXi
t∈Bm̂µ i=1
+penm − penm̂
On Bρ∗ ⊂ Bρ , we know that for all t ∈ m ∈n Sm + Sm , t2µ ≤ ρt2n , hence
2
n
f|A − f̃2n ≤ f|A − fm 2n −1
+ x ρf̃ − fm 2n −2
+n x sup ∗i
εi∗ tX
t∈Bm̂µ i=1
+penm − penm̂
2
≤ f|A − fm 2n + x−1 ρ f̃ − f|A n + f|A − fm n
2
n
−2
+n x sup ∗ ∗
εi tXi + penm − penm̂
t∈Bm̂µ i=1
by the triangular inequality. Since for all y > 0 (y is chosen at the end of the
proof)
2
f̃ − f|A n + f|A − fm n ≤ 1 + yf̃ − f|A 2n + 1 + y−1 f|A − fm 2n
ADAPTIVE ESTIMATION IN AUTOREGRESSION 861
we obtain
1+y
f|A − f̃2n 1−ρ
x
n 2
1 + y−1
≤ f|A − fm 2n 1+ρ −2
+n x sup ∗i
εi∗ tX
x t∈Bm̂µ i=1
+penm − penm̂
1 + y−1 Dm + Dm̂ 2
≤ f|A − fm 2n 1+ρ + penm + x3 σ2
x n
n 2
x
−penm̂ + sup ∗i − x2 nDm̂σ22
εi∗ tX
n2 t∈Bm̂µ i=1 +
using that Dm̂ ≤ Dm̂ + Dm . Since the penalty function pen satisfies (7.5) for
all m ∈ n , we obtain that on Bρ∗
1+y 1+y−1
f|A − f̃2n 1−ρ 2
≤ f|A − fm n 1+ρ +2penm + xn−2 Wn m̂
x x
Proof. By taking the power p̄ ≤ 1 of the right- and left-hand side of (7.6)
we obtain
By taking the expectation on both sides of the inequality and using Jensen’s
inequality we obtain that
Ɛ f|A − f̃2np̄ |Bρ∗
p̄ p̄
(7.7) ≤ C1 x ρ f|A − fm 2µ + 2penm
xx + ρ p̄ p̄
+ Ɛ Wn m
n2 x − ρ m ∈
n
* * p̄
qn1 qn1 2
−x + 1− nDm σ22
qn qn
+
$ %p̄−p p̄
−1/2 p 2p̄
≤ xp̄/3 x1/3 − 1 n ! 0 h0 σp
p
−p/2+p̄ qn n
× Dm + pp−2/4p−1−p̄
m ∈n n
The proof of the second inequality is delayed to Section 8, the first one is a
straightforward consequence of our choice of qn1 .
Using Proposition 6 we derive from (7.7) that
2p̄
Ɛ f|A − f̃n |Bρ∗
p̄
p̄ 2 Cx p p̄
−1/2 p 2p̄
(7.8) ≤ C 1 x ρ f| A − f m µ + 2penm + ! 0 h 0 σp
np̄
p
−p/2+p̄ qn n
× Dm + pp−2/4p−1−p̄
m ∈n n
Note that the power of n, 1/4 − bp − 21 + 2p̄/1 − 4b is positive for
p > 21 + 2p̄/1 − 4b. The result follows by combining (7.8) and (7.9). ✷
and
$ %
(7.11) Ɛ f|A − f̃2np̄ |Bρ∗c ≤ 2M + e16/A0 1−2p̄/p f|A 2∞p̄ + σp2p̄ n−p̄
Proof. For the proof of (7.11) we refer to Baraud (2000) [see proof of The-
orem 6.1, (49) with q = p̄ and β = 2] noticing that p ≥ 21 + 2p̄/1 − 4b >
4p̄/2 − p̄ (p̄ ≤ 1). By examining the proof, it is easy to check that if the con-
stants belong to the Sm ’s then f|A ∞ can be replaced by f|A − f|A ∞ .
To prove (7.10) we use the following Proposition, which is proved in Section 9.
1 2
where for 2 = 1 2n , I2 = 2 − 1qn + 1 2 − 1qn + qn1 and I2 =
2 − 1qn + qn1 + 1 2qn = 2 − 1qn + qn1 + qn − qn1 . Denoting Ɛ∗1 =
2n qn1 Dm σ2 and Ɛ∗2 = 2n qn − qn1 Dm σ2 we have
p
n
Ɛ sup ∗i
εi∗ tX − Ɛ∗1 − Ɛ∗2
t∈Bm µ i=1
+
p
2n
∗
≤ 2p−1 Ɛ sup ∗i − Ɛ∗1
εi tX
t∈Bm µ 2=1 1
i∈I2 +
p
2n
∗
+2p−1 Ɛ sup ∗i − Ɛ∗2
εi tX
t∈Bm µ 2=1 2
i∈I2 +
Since the two terms can be bounded in the same way, we only show how
to bound the first one. To do so, we use a moment inequality proved in Ba-
raud [(2000), Theorem 5.2, page 478]: consider the sequence of independent
$ %q
random vectors of × k n1 , U ∗1 ∗2 defined by U
U ∗2 = ε∗ X
∗ 1 for
n i i i∈I2
2 = 1 $ 2n , and
%q consider m = gt / t ∈ Bm µ the set of functions gt
mapping × k n1 into defined by
q
n1
By applying the moment inequality with the U ∗2 ’s and the class of functions
m we find for all p ≥ 2,
p
2n
Cp−1 Ɛ sup εi∗ tX ∗i − Ɛ∗1
t∈Bm µ 2=1 1
i∈I2 +
p
2n
∗
≤Ɛ sup ∗
εi tXi
(8.2) t∈Bm µ 2=1
i∈I2
1
2
2n
+Ɛp/2 sup εi∗ tX ∗i
t∈Bm µ 2=1 1
i∈I2
p/2
= Vp + V2
provided that
2n
(8.3) Ɛ sup ∗i ≤ Ɛ∗1 =
εi∗ tX 2n qn1 Dm σ2
t∈Bm µ 2=1 1
i∈I2
the random variables G2 ϕj 2=1 2n being independent and centered for each
j = 1 Dm . Now, for each 2 = 1 2n , we know that the laws of the
∗ ∗
vectors εi Xi i∈I1 and εi Xi i∈I1 are the same, therefore under Condition
2 2
HXε
2
i
i = qn1 σ22
(8.6) Ɛ G22 ϕj = Ɛ εi ϕj X ≤ Ɛεi2 Ɛϕ2j X
i∈I21 1
i∈I2
−1/2
(8.7) t∞ ≤ !0 h0 Dm × 1
Thus,
2n p
∗
Vp = Ɛ sup ∗i
εi tX
t∈Bm µ 2=1 1
i∈I2
2n
1 p−1
≤ I2 Ɛ sup εi∗ p ∗i p
tX
t∈Bm µ 2=1 1
i∈I2
p−2 2n
p−1 −1/2
≤ qn1 ! 0 h0 Dm Ɛ sup ∗i
εi∗ p t2 X
t∈Bm µ 2=1 1
i∈I2
−1/2 p−2
≤ qp−1
n !0 h0 nDm p/2 σpp
−1/2 p 2
(8.8) Vp ≤ qp
n !0 h0 σpp Dm p/2 np /4p−1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 867
where the random variables ξ2 ’s are i.i.d. centered random variables indepen-
∗ ’s and the ε∗ ’s, satisfying ξ1 = ±1 = 1/2. It remains to bound
dent of the X i i
the last term in the right-hand side of (8.9). To do so, we use a truncation
argument. We set M2 = maxi∈I1 εi∗ . For any c > 0, we have
2
2n 2n
Ɛ sup ξ2 G2 t ≤Ɛ
2
sup ξ2 G2 t|M2 ≤c
2
t∈Bm µ 2=1 t∈Bm µ 2=1
(8.10) 2n
+Ɛ sup ξ2 G2 t|M2 >c
2
t∈Bm µ 2=1
We apply a comparison theorem [Theorem 4.12, page 112 in Ledoux and Tala-
grand (1991)] to bound the first term of the right-hand side of (8.10): we know
that for each t ∈ Bm µ the random variables G2 t|M2 ≤c ’s are bounded by
−1/2
B = qn1 !0 h0 Dm c [using (8.7)] and are independent of the ξl ’s. The
function x %→ x2 defined on the set −B B being Lipschitz with Lipschitz
constant smaller than 2B, we obtain (Ɛξ denotes the conditional expectation
with respect to the εi∗ ’s and the X∗i ’s)
2n 2n
Ɛξ sup ξ2 G2 t|M2 ≤c ≤ 4BƐξ
2
sup ξ2 G2 t|M2 ≤c
t∈Bm µ 2=1 t∈Bm µ 2=1
2
1/2
Dm 2n
≤ 4BƐξ ξ2 G2 ϕj |M2 ≤c
j=1 2=1
1/2
Dm 2n
≤ 4B G22 ϕj
j=1 2=1
We now decondition with respect to the random variables εi∗ ’s and X ∗ and
i
using (8.6) we get
2n
−1/2 √
Ɛ sup ξ2 G2 t|M2 ≤c ≤ 4qn1 !0 h0 Dm σ2 nc
2
(8.11) t∈Bm µ 2=1
√
≤ 4q2n1 !20 h−1
0 Dm σp nc
868 Y. BARAUD, F. COMTE AND G. VIENNET
−1/2
noticing that qn1 , !0 h0 are both greater than 1.
Now, we bound the second term of the right-hand side of (8.10). We have
2n
2n
Ɛ sup ξ2 G2 t|M2 >c ≤ Ɛ
2
sup 2
G2 t|M2 >c
t∈Bm µ 2=1 t∈Bm µ 2=1
Dm 2n
≤Ɛ G22 ϕj |M2 >c
j=1 2=1
Dm 2n
≤ qn1 Ɛ M22 |M2 >c ∗i
ϕ2j X
j=1 2=1 1
i∈I2
2n Dm
p
≤ qn1 c 2−p
Ɛ M2 ∗i
ϕ2j X
2=1 1
i∈I2 j=1
2n
p
≤ q2n1 c2−p !20 h−1
0 Dm Ɛ M2
2=1
p
using (2.4). Lastly, since M2 ≤ i∈I2
1 εi∗ p
we get
2n
(8.12) Ɛ sup ξ2 G2 t|M2 >c ≤ q2n1 !20 h−1
2 p 2−p
0 nDm σp c
t∈Bm µ 2=1
where Lφ is a quantity specific to the orthonormal basis φλ λ∈#n , defined
as follows.
Let φλ λ∈#n be a 2 dx-orthonormal basis of n and as in Baraud (2001)
define the quantities
*
#
V= φ2λ xφ2λ xdx B = φλ φλ ∞ λλ ∈#n ×#n
A
λλ ∈#n ×#n
We set
(9.4) Lφ = maxρ̄2 V ρ̄B
Then, to complete the proof of Proposition 7, it remains to check that
n
(9.5) Lφ ≤ K
8n lnn
for some constant K independent of n (we shall show the result for K = !41 ).
Under (Hn ), Lemma 2 in Section 10 ensures that
Lφ ≤ !41 Dn
ADAPTIVE ESTIMATION IN AUTOREGRESSION 871
= h20 1 − 1/ρ2 /16h1 Lφ. Then on the set ∀λ λ ∈ #n2 /νn φλ φλ ≤
Let x
2Vλλ 2h1 x + 2Bλλ x, we have
sup νn t2 ≤ 2h−1 0 2h1 xρ̄V + xρ̄B
µ
t∈Bn 01
1/2
1 ρ̄2 V h 1 − 1/ρ ρ̄B
≤ 1 − 1/ρ √ + 0
2 Lφ 8h1 Lφ
1 1
≤ 1 − 1/ρ √ + ≤ 1 − 1/ρ
2 8
The proof of inequality (9.3) is then achieved by using the following claim.
Claim 6. Let φλ λ∈#n be an 2 A dx basis of n . Then, for all x ≥ 0 and
all integers q, 1 ≤ q ≤ n,
nx
∗ ∃λ λ ∈ #n2 / νn φλ φλ > 2Vλλ 2h1 x + 2Bλλ x ≤ 2D2n exp −
qn
∗ ∗
Proof of Claim 6. Let νn∗ φλ φλ = νn1 φλ φλ + νn2 φλ φλ be defined by
∗
n −1
1 2
νnk φλ φλ = Z∗ φ φ k = 1 2
2n l=0 lk λ λ
where for 0 ≤ l ≤ 2n − 1,
1 ∗i φλ X
∗i − Ɛµ φλ φλ
Z∗lk φλ φλ = φλ X k = 1 2
qn k
i∈ l
We have
νn φλ φλ > 2Vλλ 2h1 x + 2Bλλ x
≤ ∗ νn1∗
φλ φλ > Vλλ 2h1 x + Bλλ x
∗
+ νn2 φλ φλ > Vλλ 2h1 x + Bλλ x
= 1 + 2
We obtain from 1 and 2 that the constraints on Dn given by (4.6) and (4.1)
lead to (9.5).
ADAPTIVE ESTIMATION IN AUTOREGRESSION 873
#
≤ φ2λ φ2λ ≤ !20 D2n
λ∈#n ∞ λ ∈#n
using (2.4). On the other hand, by (2.2) we know that φλ ∞ ≤ !0 Dn × 1.
Thus, using similar arguments one gets
ρ̄B ≤ !20 D2n
which leads to Lφ ≤ !20 D2n . ✷
Proof of 2. We now prove that (2.2) holds true in the case 2. Note that
for all x,
φ2λ x ≤ !1 φλ 2∞ ≤ !31 Dn
λ∈#n
Therefore,
# 1/2
ρ̄V = sup aλ a λ φ2λ φ2λ
aλ λ λ a2λ =1 λ λ ∈Jλ
≤ !21 Dn sup aλ aλ
aλ λ λ a2λ =1 λ λ ∈Jλ
= !21 Dn Wn
Finally,
2
W2n ≤ sup aλ ≤ !1 sup a2λ
2 2
λ aλ =1 λ∈#n λ ∈Jλ λ aλ =1 λ∈#n λ ∈Jλ
= !1 sup a2λ = !1 sup Jλ a2λ
2 2
λ aλ =1 λ ∈#n λ∈Jλ λ aλ =1 λ ∈#n
≤ !21
874 Y. BARAUD, F. COMTE AND G. VIENNET
In other words, ρ̄V ≤ !21 Dn and ρ̄B ≤ !31 Dn , which gives the bound
Lφ ≤ !41 Dn since !1 ≥ 1. ✷
REFERENCES
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
Proceedings of the 2nd International Symposium on Information Theory (P. N. Petrov
and F. Csaki, eds.) 267–281. Akademia Kiado, Budapest.
Akaike, H. (1984). A new look at the statistical model identification. IEEE Trans. Automatic
Control 19 716–723.
Baraud, Y. (1998). Sélection de modèles et estimation adaptative dans différents cadres de
régression. Ph.D. thesis, Univ. Paris-Sud.
Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields
117 467–493.
Baraud, Y. (2001). Model selection for regression on a random design. Preprint 01-10, DMA, Ecole
Normale Supérieure, Paris.
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks.
In Proceedings of the NATO Advanced Study Institute on Nonparametric Functional
Estimation (G. Roussas, ed.) 561–576. Kluwer, Dordrecht.
Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function
processes. IEEE Trans. Inform. Theory 39 930–945.
Barron, A., Birgé, L. and Massart, P. (1999). Risks bounds for model selection via penalization.
Probab. Theory Related Fields 113 301–413.
Barron, A. R. and Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans.
Inform. Theory 37 1034–1054.
Berbee, H. C. P. (1979). Random walks with stationary increments and renewal theory. Math.
Centre Tract 112. Math. Centrum, Amsterdam.
Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for
Lucien Lecam: Research Papers in Probability and Statistics (D. Pollard, E. Torgensen
and G. Yangs, eds.) 55–87. Springer, New York.
Birgé, L. and Massart, P. (1998). Exponential bounds for minimum contrast estimators on sieves.
Bernoulli 4 329–375.
Cohen, A., Daubechies, I. and Vial, P. (1993). Wavelet and fast wavelet transform on an interval.
Appl. Comp. Harmon. Anal. 1 54–81.
Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia.
Devore, R. A. and Lorentz, C. G. (1993). Constructive Approximation. Springer, New York.
Donoho, D. L. and Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage. Ann.
Statist. 26 879–921.
Doukhan, P. (1994). Mixing properties and Examples. Springer, New York.
Doukhan, P., Massart, P. and Rio, E. (1995). Invariance principle for absolutely regular empirical
processes. Ann. Inst. H. Poincaré Probab. Statist. 31 393–427.
Duflo, M. (1997). Random Iterative Models. Springer, New-York.
Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929–
989.
Hoffmann, M. (1999). On nonparametric estimation in nonlinear AR(1)-models. Statist. Probab.
Lett. 44 29–45.
Kolmogorov, A. R. and Rozanov, Y. A. (1960). On the strong mixing conditions for stationary
gaussian sequences. Theor. Probab. Appl. 5 204–207.
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.
ADAPTIVE ESTIMATION IN AUTOREGRESSION 875
Y. Baraud F. Comte
Ecole Normale Supérieure Laboratoire de Probabilités
DMA et Modèles Aléatoires
45 rue d’Ulm Boite 188
75230 Paris Cedex 05 Université Paris 6
France 4, place Jussieu
E-mail: [email protected] 75252 Paris Cedex 05
France
G. Viennet
Laboratoire de Probabilités
et Modèles Aléatoires
Boite 7012
Université Paris 7
2, place Jussieu
75251 Paris Cedex 05
France