0% found this document useful (0 votes)
22 views37 pages

Adaptive Estimation in An Autoregression and A Geometrical

Uploaded by

Yichuan HUANG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views37 pages

Adaptive Estimation in An Autoregression and A Geometrical

Uploaded by

Yichuan HUANG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

The Annals of Statistics

2001, Vol. 29, No. 3, 839–875

ADAPTIVE ESTIMATION IN AUTOREGRESSION OR -MIXING


REGRESSION VIA MODEL SELECTION

By Y. Baraud, F. Comte and G. Viennet


Ecole Normale Supérieure, Université Paris VI and Université Paris VII
We study the problem of estimating some unknown regression func-
tion in a β-mixing dependent framework. To this end, we consider some
collection of models which are finite dimensional spaces. A penalized least-
squares estimator (PLSE) is built on a data driven selected model among
this collection. We state non asymptotic risk bounds for this PLSE and
give several examples where the procedure can be applied (autoregression,
regression with arithmetically β-mixing design points, regression with mix-
ing errors, estimation in additive frameworks, estimation of the order of the
autoregression). In addition we show that under a weak moment condition
on the errors, our estimator is adaptive in the minimax sense simultane-
ously over some family of Besov balls.

1. Introduction. We consider the problem of estimating the unknown


function f, from k into , based on the observation of n (possibly) dependent
 i  1 ≤ i ≤ n, arising from the model
data Yi  X
(1.1)  i  + εi
Yi = fX
We assume that X  i 1≤i≤n is a stationary sequence of random vectors in k
and we denote by µ the common law of the X  i ’s. The εi ’s are unobservable
identically distributed centered random variables admitting a finite variance
denoted by σ22 . Throughout the paper we assume that σ22 is a known quantity
(or that a bound on it is known). In this introduction, we assume that the εi ’s
are independent random variables. As an example of model (1.1), consider the
case of a random design set X  i = Xi with values in 0 1 with a regression
function f assumed to satisfy some Hölderian regularity condition
fx − fy
(1.2) sup = f α < +∞
0≤x<y≤1 y − xα
for some α ∈ 0 1. Another possible illustration is a linear autoregressive
model
k

(1.3) Xi = βj Xi−j + εi
j=1

where k is an integer smaller than k. This means that Yi = Xi , X i =


Xi−1   Xi−k  and fu1   uk  = kj=1 βj uj . Such models have been ex-

Received May 1998; revised February 2001.


AMS 2000 subject classifications. Primary 62G08; secondary 62J02.
Key words and phrases. Nonparametric regression, least-squares estimator, model selection,
adaptive estimation, autoregression order, additive framework, time series, mixing processes.
839
840 Y. BARAUD, F. COMTE AND G. VIENNET

tensively studied in the past under the conditions that α or k are known.
There have been some generalizations to the cases of unknown α and k , but
then the results are typically given in an asymptotic form (as n → +∞).
In this paper, the aim is to introduce an estimation procedure for model
(1.1) which, when applied to some Hölderian function f satisfying (1.2) with
unknown values of α and f α , will perform almost as well as a procedure
based on the knowledge of those two parameters. This is what is usually called
adaptation. In the same way, our procedure will result in estimation of model
(1.3) with an unknown value of k (k ≤ k, k known) which is almost as good
as if k were known. Moreover, the results will be given in the form of non
asymptotic bounds for the risk of our estimators. Many other examples can be
treated by the same method. One could, for instance, replace the regularity
conditions (1.2) by more sophisticated ones and model (1.3) by a nonlinear
analogue.
In order to explain the main idea underlying the approach, let us turn back
to the two previous examples. Model (1.3) says that f belongs to some specific
k -dimensional linear space Sk of functions from k to . When k is known,
a classical estimator of f is the least squares estimator over Sk . Dealing with
an unknown k therefore amounts to choosing a “good” value k̂ of k from
the data. By “good,” we mean here that the estimation procedure based on k̂
should perform almost as well as the procedure based on the true value of k .
The treatment of model (1.1) when f satisfies a condition of type (1.2) is
actually quite similar. Let us expand f in some suitable orthonormal basis
φj j≥1 of 2 0
 1 dx (the Haar basis for instance). Then (1.1) can be writ-
ten as Yi = ∞ j=1 βj φj Xi  + εi and a classical procedure for estimating f
is as follows: define SJ to be the J-dimensional linear space generated by
φ1   φJ and f̂J to be the least squares estimator on SJ , that is the least
squares estimator for model (1.1) when f is supposed to belong to SJ . The
problem is to determine from the data set some Ĵ in such a way that the
least squares estimator f̂Ĵ performs almost as well as the best least-squares
estimator of f, that is, the one which achieves the minimum of the risk.
In order to give a further explanation of the procedure, we need to be precise
as to the “risk” we are dealing with. Throughout the paper we consider least-
squares estimators of f, obtained by minimizing over a finite dimensional
linear subspace S ⊂ 2 k  dx the (least squares) contrast function γn defined
by

n
1  i 2
(1.4) ∀t ∈ 2 k  dx γn t = Y − tX
n i=1 i

A minimizer of γn in S, f̂S , always exists but might not be unique. Indeed, in


common situations the minimization of γn over S leads to an affine space of
possible solutions and then it becomes impossible to consider the 2 k  dx-
quadratic risk of “the least-squares estimator” of f in S. In contrast, the (ran-
 1 
dom) n -vector f̂S X  n  is always uniquely defined; this is the
 f̂S X
ADAPTIVE ESTIMATION IN AUTOREGRESSION 841

reason we consider the risk of f̂S based on the design points, that is,
 
n  2  
1  i
 i  − f̂S X
Ɛ fX = Ɛ f − f̂S 2n
n i=1

In addition, under suitable assumptions on the design set and the εi ’s, the
risk of f̂S can be decomposed in a classical way into a bias and a variance
term. More precisely, we have
dimS
(1.5) Ɛf − f̂S 2n  ≤ dµ2 f S + σ22 
n
 1  − sX
where for t s ∈ 2 k  µ, dµ2 s t denotes ƐtX  1 2  and dµ2 f S =
2
inf s∈S dµ f s. Inequality (1.5) is usually sharp; note that equality occurs
when the X  i ’s and the εi ’s are independent for instance.
Coming back to model (1.1) we see that the quadratic risk Ɛf − f̂J 2n  is
of order
J
(1.6) dµ2 f SJ  + σ22 
n
for SJ generated by the Haar basis φj 1≤j≤J as above. Then (1.2) standardly
implies that dµ f SJ  ≤ C f α J−α whatever µ. When α and f α are known,
it is possible to determine the value of J that minimizes (1.6). If α and f α
are unknown, the problem of adaptation, that is doing almost as well as if they
were known, clearly amounts to choosing an estimation procedure Ĵ based on
the data, such that the estimator based on Ĵ is almost as good as the estimator
based on the optimal value of J. The analogy with the study of model (1.3) then
becomes obvious and we have shown that the problem of adaptation to some
unknown smoothness for Hölderian regression functions amounts to what is
generally called a problem of model selection, that is finding a procedure solely
based on the data to choose one statistical model among a (possibly large)
family of such models, the aim being to choose automatically a model which
is close to optimal in the family for the problem at hand. Let us now describe
this procedure.
We start with a finite collection of possible models Sm  m ∈ n  for f, each
Sm being a finite-dimensional linear subspace of 2 k . The family of models
usually depends on n and the function f may or may not belong to one of them.
Let us denote by f̂m the least squares estimator for model (1.1) based on the
model class Sm . We look for a model selection procedure m̂ with values in n ,
based solely on the data and not on any prior assumption on f, such that the
risk of the resulting procedure f̂m̂ is almost as good as the risk of the best
least squares estimator in the family. Therefore an ideal selection procedure
m̂ should look for an optimal trade-off between the bias term dµ2 f Sm  and
the variance term σ22 dimSm /n. Our aim is to find m̂ such that
dimSm 
(1.7) Ɛf − f̂m̂ 2n  ≤ C min dµ2 f Sm  + σ22 
m∈n n
842 Y. BARAUD, F. COMTE AND G. VIENNET

which means that, up to the constant C, our estimator chooses an optimal


model.
It is important to notice that an estimator which satisfies (1.7) has many
interesting properties provided that the family of models Sm has been suit-
ably chosen. In particular this estimator is adaptive in the minimax sense
with respect to many well-known classes of smoothness. The connections be-
tween adaptation and model selection and the nice properties of any estimator
f̂m̂ satisfying (1.7) have been developed at length in Barron, Birgé and Mas-
sart [(1999), Chapter 5] and many illustrations of potential applications of our
results can be found there and in Birgé and Massart (1997). We shall content
ourselves in the sequel with a limited number of applications and we refer the
interested reader to those papers.
Our model selection criterion is closely related to the classical Cp criterion
of Mallows (1973). For each model m we compute the normalized residual
  i 2 and we choose m̂ in or-
sum of squares: γn f̂m  = n−1 ni=1 Yi − f̂m X
der to minimize among all models m ∈ n the penalized residual sum of
squares γn f̂m  + penm. Mallows’ Cp criterion corresponds to penm =
2σ22 dimSm /n. In this paper, we want to see how one needs to modify Mal-
lows’ Cp when the errors or the covariates are correlated.
There have been many studies concerning model selection based on Mal-
lows’ Cp or related penalized criteria like Akaike’s or the BIC criterion for
regressive and autoregressive models [see Akaike (1973, 1974), Shibata (1976,
1981), Li (1987), Polyak and Tsybakov (1992), among many others]. A com-
mon characteristic of these results is their asymptotic character. Extensions of
these penalized criteria for data-driven model selection procedures have been
done in Barron (1991, 1993), Barron and Cover (1991) and Rissanen (1984).
More recently, a general approach to model selection for various statistical
frameworks including density estimation and regression has been developed
in Birgé and Massart (1997) and Barron, Birgé and Massart (1999), with many
applications to adaptive estimation. An original characteristic of their view-
point is its non asymptotic feature. Unfortunately, their general approach im-
poses restrictions on the regression model (1.1), (e.g., the regression function
needs to be bounded by some known quantity) which makes it unattractive for
practical issues. We relax such restrictions and also obtain non asymptotic re-
sults. Our approach is inspired by Baraud’s (2000) work. Although there have
been many results concerning adaptation for the classical regression model
with independent variables, not much is known to our knowledge concerning
general adaptation methods for regression involving dependent variables. It
is not within the scope of this paper to make an historical review for the case
of independent variables.
Concerning dependent variables, it is worth mentioning the work of Modha
and Masry (1996) which deals with model (1.1) when the process X  i  Yi i∈ is
strongly mixing and when the function f satisfies some Fourier-transform-type
representation. In Modha and Masry (1998), the problem of one step ahead
prediction of real valued stationary exponentially strongly mixing processes
ADAPTIVE ESTIMATION IN AUTOREGRESSION 843

is considered. Minimum complexity regression estimators based on Legendre


polynomials are used to estimate both the model memory and the predictor
function. In the particular case of an autoregressive model their approach does
not lead to optimal rates of convergence. In the case of a one dimensional first
order autoregressive model, Neumann and Kreiss (1998) and Hoffmann (1999)
study the behavior of nonparametric adaptive estimators (local polynomials
and wavelet thresholding estimators) by approximating an AR(1) autoregres-
sion experiment by a regression experiment with independent variables.
Our estimation procedure is the same as that proposed by Baraud (2000) in
the case of a regression framework with deterministic design points and i.i.d.
errors. Thus, we show that the procedure is robust (to a certain extent) to pos-
sible dependency between the data X  i  Yi ’s. More precisely, we assume that
the data are β-mixing [for a precise definition of β-mixing, see Kolmogorov and
Rozanov (1960)] and we show that under an adequate condition on the decay of
the β-mixing coefficients (for instance arithmetical or geometrical decay) the
estimation procedure is still relevant. Of course, this robustness with respect
to dependency is obvious when the sequences of X  i ’s and εi ’s are independent
and when the εi ’s are i.i.d. Indeed, the result can merely be obtained by argu-
ing as follows. We start from inequality (11) in Baraud [(2000), Corollary 3.1]
which gives the result conditionally on the variables X  i ’s. Then, by integrat-
ing with respect to those, one gets (1.7). We emphasize that the result holds
under mild assumptions on the statistical framework (an adequate moment
condition on the i.i.d. errors and stationarity of the distribution of the X  i ’s).
Consequently, we shall only consider either the case where the sequences of
X i ’s and εi ’s are dependent or the case where the εi ’s are dependent.
The case of β-mixing data is natural in the autoregression context, where,
in addition, the above condition on the β-mixing coefficients is usually met.
This makes the procedure of particular interest in this case.
Our techniques of proof are based on the work of Baraud (2000). Unfortu-
nately, the possible dependency of the X  i ’s prevents us from directly using
classical inequalities on product measures like Talagrand’s (1996) concentra-
tion inequalities. Taking advantage of the β-mixing assumptions, we instead
use coupling techniques derived from Berbee’s (1979) lemma and inspired by
Viennet’s (1997) work in order to approximate the original sequence X  i 1≤i≤n
by a new sequence built on independent blocks.
Lastly, we mention that the results presented in this paper can be extended
to the case where the variance σ22 of the errors is unknown, which is the
practical case, by estimating it by residual least-squares. For further details
we refer to Baraud’s (1998) Ph.D. thesis, where a previous version of this work
is available.
The paper is organized as follows: the estimation procedure and the main
assumptions are given in Section 2. We apply the procedure to various sta-
tistical frameworks in Section 3. In each of these frameworks, we state non
asymptotic risk bounds, the proofs of those results being delayed to Section 6.
Section 4 is devoted to the main result (treating the case of independent er-
844 Y. BARAUD, F. COMTE AND G. VIENNET

rors), Section 5 to an extension to the case of dependent errors. The proof of


those results are given in Sections 7 to 10.

2. The estimation procedure and the assumptions. We observe pairs


 i  i = 1
Yi  X  n arising from model (1.1) and our aim is to estimate the
unknown function f from k into , on some (compact) subset A ⊂ k .
Our estimation procedure is the following one. We consider a finite family of
linear subspaces Sm m∈n of 2 A dx  . We assume that the Sm ’s are fi-
nite dimensional linear spaces consisting of A-compactly supported functions.
Hereafter, Dm denotes the dimension of Sm and fm the 2 k  µ-projection of
f onto Sm . We associate to each Sm a least-squares estimator f̂m of f which
minimizes among t ∈ Sm the empirical least-squares contrast function γn de-
fined by (1.4). Note that such a minimizer might not be unique as an element
of Sm but the n -vector f̂m X 1   n  is uniquely defined. We select
 f̂m X
our estimator f̃ among the family of least-squares estimators f̂m m∈n in the
following way: given a nonnegative penalty function pen· on n , we define
m̂ as the minimizer among n of the penalized criterion

γn f̂m  + penm

and we set f̃ = f̂m̂ ∈ Sm̂ . The choice of the penalty function is the main
concern of this paper.
The main assumptions used in the paper are listed below. Assumptions (Hε )
and (HXε ) will be weakened in Section 5:

(HX ) The sequence X  i i≥0 is identically distributed with common law µ


admitting a density hX w.r.t. the Lebesgue measure which is bounded from
below and above, that is,
0 < h0 ≤ hX u ≤ h1 ∀u ∈ A
(Hε ) The sequence εi ’s are i.i.d. centered random variables admitting a
finite variance denoted by σ22 .
(HXY ) The sequence of the X  i  Yi ’s is β-mixing.
(HXε ) For all i ∈ 1  j j≤i .
 n, εi is independent of the sequence X
(HS ) There exists a constant !0 such that for any pair m m  ∈ n2 , and
any t ∈ Sm + Sm ,

(2.1) t∞ ≤ !0 dimSm + Sm t

Comments. Assumption (HXY ) is equivalent to the β-mixing of the se-


quence of the X  i  εi ’s, which is the property which is used in the proof. As
mentioned in the introduction, if the sequences X  i 1≤i≤n and εi 1≤i≤n are in-
dependent and the εi ’s are i.i.d., then the result can be obtained under milder
conditions. In particular, except stationarity, no other assumption on the dis-
tribution of the X i ’s is required. Condition (HS ) is most easily fulfilled when
ADAPTIVE ESTIMATION IN AUTOREGRESSION 845

the collection of models is nested, that is, is an increasing sequence (for in-
clusion) of linear spaces and when there exists some !0 such that for each
m ∈ n

(2.2) t∞ ≤ !0 dimSm t ∀t ∈ Sm

This connection between the sup-norm and the 2 A dx-norm is satisfied
for numbers of collections of models of interest. Birgé and Massart [(1998),
Lemma 6] have shown that for any 2 A dx-orthonormal basis φλ λ∈#m of
Sm :
1/2
 t∞
(2.3) φ2λ = sup
λ∈#m ∞ t∈Sm t=0 t

Hence (2.2) holds if and only if there exists an orthonormal basis φλ λ∈#m
of Sm such that
1/2

(2.4) φ2λ ≤ !0 dimSm 
λ∈#m ∞

and then the result is true for any orthonormal basis of Sm .

3. Examples. In the section we apply our estimation procedure to various


statistical frameworks. In each framework, we give an example of a collection
of models Sm  m ∈ n  and for some x > 1, choose the penalty term to be
equal to
Dm 2
penm = x3 σ ∀m ∈ n 
n 2
except in Section 3.3 where the penalty term is chosen in a different way. In
each case, we give sufficient conditions for f̃ = f̂m̂ to achieve the best trade-off
(up to a constant) between the bias and the variance term among the collection
of estimators f̂m  m ∈ n . Namely, we show that for any ρ in 1 x,
    
2 x+ρ 2 2 3 Dm 2 R
(3.1) Ɛ f|A − f̃n ≤ inf f|A − fm µ + 2x σ2 + 
x − ρ m∈n n n
for some constant R = Rρ to be specified. With no loss of generality we shall
assume that A = 0 1k . Those results, proved in Section 6, derive from our
main theorems which are to be found in Section 4 and Section 5.

3.1. Autoregression framework. We deal with a particular feature of the


regression framework (1.1), the autoregression framework of order 1 given by

(3.2) Yi = Xi = fXi−1  + εi  i = 1 n

The process is initialized with some real valued random variable X0 .


846 Y. BARAUD, F. COMTE AND G. VIENNET

We assume the following:


(HAR1 ) The random variable X0 is independent of the εi ’s. The εi ’s are
i.i.d. centered random variables admitting a density, hε , with respect to the
Lebesgue measure and satisfying σ22 = Ɛ ε1 2  < ∞. The density hε is a
positive bounded and continuous function and the function f satisfies for some
0 ≤ a < 1 and b ∈ ,
(3.3) ∀u ∈  fu ≤ a u + b
The sequence of the random variables Xi ’s is stationary of common law µ.
The existence of a stationary law µ derives from the assumptions on the
εi ’s and f.
To estimate f we use the collection of models given below.
Collection of piecewise polynomials. Let r be some positive integer and mn
the largest integer such that r2mn ≤ n/ ln3 n that is, mn = intlnn/ ln3 n
/r ln2 (intu denotes the integer part of u). Let n be the set of integers
0  mn, for each m ∈ n we define Sm as the linear span of piece-
wise polynomials of degree less than r based on the dyadic grid j/2m  j =
0  2m − 1 ⊂ 0 1.
The result on f̃ is the following.

Proposition 1. Consider the autoregression framework 3 2 and assume


p
that (HAR1 ) holds. If σp = Ɛ εi p  < ∞ for some p > 6 then 3 1 holds for
some constant R that depends on p x ρ hε  σp2  r f|A − f|A dx∞ .

To obtain results in probability on f|A − f̃2n , it is actually enough to


assume Ɛ εi p  < ∞ for some p > 2, we refer to (4.7) and the comment given
there.

3.2. Regression framework. We give an illustration of Theorem 1 in case


of regression with arithmetically β-mixing design points. Of course the case
of autoregression with arithmetically β-mixing Xi ’s can be treated similarly.
Let us consider the regression model
(3.4) Yi = fXi  + εi  i = 1 n
In this section, we consider a sequence εi for i ∈  and we take the Xi ’s to be
generated by a standard time series model:
+∞

(3.5) Xi = ak εi−1−2k
k=0

Then we make the following assumption:


(HReg ) The εi ’s are i.i.d. Gaussian random variables. The aj ’s are such that

a0 = 1, +∞ j=0 aj z
2j
= 0 for all z with z ≤ 1 and for all j ≥ 1, aj ≤ Cj−d for
some constants C > 0 and d > 17.
ADAPTIVE ESTIMATION IN AUTOREGRESSION 847

The value 17 as bound for d is certainly not sharp. The model (3.5) for the
Xi ’s together with the assumptions on the coefficients aj aim at ensuring that
(HXY ) is fulfilled with arithmetically β-mixing variables. Of course, any other
model implying the same property would suit.
We introduce the following collection of models.
Collection of wavelets: For any integer J, let #j = j k/ k = 1  2j 
and let
 
+∞

φJ0 k  J0  k ∈ #J0  ∪ ϕjk  j k ∈ #J
J=J0

be an 2 0 1 dx-orthonormal system of compactly supported wavelets of


regularity r built by Cohen, Daubechies and Vial (1993). For some positive
integer Jn > J0 , let n be the space spanned by the φJ0 k ’s for J0  k ∈ #J0 
J −1
and by the ϕjk ’s for j k ∈ ∪J=J
n
0
#J. The integer Jn is chosen in such a
way that dimn  = 2 is of order n4/5 / lnn. We set n = J0 
Jn
 Jn − 1
and for each m ∈ n we define Sm as the linear span of the φJ0 k ’s for J0  k ∈
#J0  and the ϕjk ’s for j k ∈ ∪mJ=J0 #J.
For a precise description and use of these wavelet systems, see Donoho
and Johnstone (1998). These new functions derive from Daubechies’ wavelets
(1992) at the interior of 0 1 and are boundary corrected at the “edges.”

Proposition 2. Assume that f|A ∞ < ∞ and that for all m ∈ n , the
constant functions belong to Sm . If (HReg ) is satisfied, then (3.1) holds true for

some constant R depending on x ρ h0  h1  σ22  C d f|A − f|A dx∞ .

3.3. Regression with dependent errors. We consider the regression frame-


work

Yi = fX  i  + εi  i = 1  n
(3.6)
εi = aεi−1 + ui  i = 1 n

 i  for i = 1
We observe the pairs Yi  X  n.
We assume that:

(HRd ) The real number a satisfies 0 ≤ a < 1, and the ui ’s are i.i.d. centered
random variables admitting a common finite variance. The law of the εi ’s is
assumed to be stationary admitting a finite variance σ22 . The sequence of the
X i ’s is geometrically β-mixing [i.e., satisfying (6.1)] and the sequences of the
X i ’s and the εi ’s are independent.

Geometrically β-mixing X  i ’s can be generated by an autoregressive model


with a regression function g and errors ηi satisfying an assumption of the
same kind as (HAR1 ) in Section 3.1.
848 Y. BARAUD, F. COMTE AND G. VIENNET

The main difference between this framework and the previous one lies in
the dependency between the εi ’s. To deal with it, we need to modify the penalty
term:

Proposition 3. Assume that f|A ∞ < ∞, that (HX ) and (HRd ) hold and
that Ɛ ε1 p  < ∞ for some p > 6. Let x > 1, if the penalty term pen satisfies

2a Dm 2
(3.7) ∀m ∈ n  penm ≥ x3 1 + σ 
1−a n 2
then by using the collection of piecewise polynomials described in Section 3 1
and applying the estimation procedure given in Section 2 we have that the
estimator f̃ satisfies for any ρ ∈1 x,
  
x+ρ 2   R
(3.8) Ɛ f|A − f̃2n ≤ inf f|A − fm 2µ + 2penm + 
x − ρ m∈n n

where R depends on a p σp  f|A − f|A dx∞  x ρ h0  h1  / θ.

In contrast with the results of the previous examples, we cannot give a


choice of a penalty term which would work for any value of a. An unknown
lower bound for the choice of the penalty term seems to be the price to pay
when the εi ’s are no longer independent. This example shows how this lower
bound varies with respect to unknown number a, this number quantifying in
some sense a discrepancy from independence (the independence corresponds
to a = 0). We also see that a choice of the penalty term of the form
Dm 2
penm = κ σ
n 2
with κ large is safer than a choice of κ close to 1. This should be kept in mind
every time the independence of the εi ’s is debatable (we refer the reader to
the comments following Theorem 2).

3.4. Additive models. We consider the additive regression models, widely


used in Economics, described by
1 2 k
(3.9) Yi = ef + f1 Xi  + f2 Xi  + · · · + fk Xi  + εi
where the εi ’s are i.i.d. and ef denotes a constant. Model (3.9) follows from
model (1.1) with X  i = X1  X2  k
 Xi  and the additive function f:
i i
fx1   xk  = ef + f1 x1  + · · · + fk xk . For identifiability, we assume that
01 f i xdx = 0, for i = 1  k. Such a model assumes that the effects on
Y of the variables Xj are additive. Our aim is to estimate f on A = 0 1k .
The estimation method allows one to build estimators of f1   fk in different
spaces.
1
Let 2 be some integer. We define S2 as the linear space of piecewise poly-
nomials t of degree less that r, r ≥ 1, based on the dyadic grid j/22  j =
 2
0  22  ⊂ 0 1 satisfying 01 tx dx = 0 and S2 as the linear span
ADAPTIVE ESTIMATION IN AUTOREGRESSION 849
√ √
of the functions ψ2j−1 x = 2 cos2πjx and ψ2j x = 2 sin2πjx for
j = 1  22 . Now we set m1 n [m2 n respectively] the largest integers
1 2 √
2 such that dimS2  (dimS2  respectively) is smaller than n/ ln3 n. Fi-
1 2
nally, n and n denote respectively the set of integers 0  m1 n and
0  m2 n.
We propose to estimate the fi ’s either by piecewise or trigonometric polyno-
mials. To do so, we introduce the choice function g from 1  k into 1 2
and consider the following collections of models.
Mixed additive collection of models: We set n = kn = m = k m1  
gj
mk  mj ∈ n  and for each m = k m1   mk  ∈ n we define
 
k
 k
 gi
Sm = tx1   xk  = a + ti xi  a t1   tk  ∈  × Sm i
i=1 i=1

The performance of f̃ is given by the following result

Proposition 4. Assume that f|A ∞ < ∞, that the sequence of the X i Yi 
is geometrically β-mixing, that is, satisfies 6 1 and that (HX ), (Hε ) and
(HXε ) are fulfilled. Consider the additive regression framework 3 9 with
p
the above collection of models. If σp = Ɛ ε p  < ∞ for some p > 6, then f̃
satisfies 3 1 for some constant R depending on k p σp  f|A − f|A dx∞ 
x h0  h1  / θ.

We can deduce from Proposition 4 that our procedure is adaptive in the


minimax sense. The point of interest is that the additive framework avoids
the curse of dimensionality in the rate of convergence that is, we can derive
similar rates of convergence for k ≥ 2 as for k = 1.
Let α > 0 and l > 2, we recall that a function f from 0 1 into  belongs
to the Besov space αl∞ if it satisfies
f αl = sup y−α wd f yl < +∞ d = α + 1
y>0

where wd f yl denotes the modulus of smoothness. For a precise definition
of those notions we refer to DeVore and Lorentz [(1993), Chapter 2, Section 7].
Since for l ≥ 2, αl∞ ⊂ α2∞ , we now restrict ourselves to the case where
l = 2. In the sequel, for any L > 0 α2∞ L denotes the set of functions
which belong to α2∞ and satisfy f α2 ≤ L. Then the following result holds.

Proposition 5. Consider model 3 9 with k ≥ 2. Let L > 0, assume that


f|A ∞ ≤ L and that for all i = 1  k, fi ∈ αi 2∞ L for some αi >
1/2. Assume that for all i = 1  k such that gi = 1, αi ≤ r. Set α =
minα1   αk , if Ɛ ε1 p  < ∞ for some p > 6 then under the assumptions of
Proposition 4
  2α
(3.10) Ɛ f|A − f̃2n ≤ Ck L α Rn− 2α+1
850 Y. BARAUD, F. COMTE AND G. VIENNET

Comments. (i) In the case where k = 1, by using the collection of piecewise


polynomials described in Section 3.1, (3.10) holds under the weaker assump-
tion that α > 0, we refer the reader to the proof of Proposition 5.
(ii) A result of the same flavor can be established in probability, this would
require a weaker moment condition on the εi ’s. Namely, using (4.7) we show
similarly that for any η > 0, there exists a positive constant Cη (also de-
pending on k L α and R) such that
α
f|A − f̃n ≤ Cηn− 2α+1 
with probability greater or equal to 1 − η, as soon as Ɛ ε1 p  < ∞ for some
p > 2.

3.5. Estimation of the order of an additive autoregression. Consider an


additive autoregression framework,
(3.11) Xi = ef + f1 Xi−1  + f2 Xi−2  + + fk Xi−k  + εi
where the εi ’s are i.i.d. and ef denotes a constant. Under suitable assumptions
ensuring that the X  i = Xi−1   Xi−k  ’s are stationary and geometrically
β-mixing, the estimation of f1 , ,fk can be handled in the same way as in
the previous section. The aim of this section is to provide an estimator of
the order of autoregression, that is, an estimator of the integer k0 (k0 ≤ k,
k being known) satisfying fk0 = 0 and fi = 0 for all i > k0 . To do so, let

n = kj=0 jn (we use the notations introduced in Section 3.4) and consider
the collection of models Sm  m ∈ n . We estimate k0 by k̂0 = k̂0 x defined
as the first coordinate of m̂, m̂ being given by
 
3 Dm 2
m̂ = arg min γn f̂m  + x σ
m∈n n 2
We measure the performance of k̂0 via that of f̃ = f̂m̂ , the latter being known,
under the assumptions of Theorem 1, to achieve the best trade-off (up to a
constant) between the bias term and the variance term among the collections
of least-squares estimators f̂m  m ∈ n .

4. The main result. In this section, we give our main result concerning
the estimation of a regression function from dependent data. Although this
result considers the case of particular collections of models, extension includ-
ing very general collections are to be found in the comments following the
theorem.

4.1. The main theorem. Let n be some finite dimensional linear subspace
of A-supported functions of 2 k  dx. Let φλ λ∈#n be an orthonormal basis
of n ⊂ 2 A dx and set Dn = #n = dimn . We assume that there exists
some positive constant !1 ≥ 1 such that for all λ ∈ #n
  
(Hn ) φλ ∞ ≤ !1 Dn and λ / φλ φλ ∞ = 0 ≤ !1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 851

The second condition means that for each λ, the supports of φλ and φλ are
disjoint except for at most !1 functions φλ ’s. We shall see in Section 10 that
those conditions imply that (2.2) holds with !20 = !31 . In addition we assume
some constraint on the dimension of n
(HDn )8b There exists an increasing function 8 mapping + into + sat-
isfying for some K > 0 and b ∈0 1/4

∀u ≥ 1 lnu ∨ 1 ≤ 8u ≤ Kub 

such that
n
(4.1) Dn ≤
8n lnn

Theorem 1. Let us consider model 1 1 with f an unknown function from


k into  such that f|A ∞ < ∞ and where Conditions (HX ), (Hε ) and (HXε )
are fulfilled. Consider a family Sm m∈n of linear subspaces of n . Assume
that Sm m∈n satisfies (HS ) and that n satisfies (Hn ) and (HDn )8b . Sup-
pose that (HXY ) is fulfilled for a sequence of β-mixing coefficients satisfying
 −3
(4.2) ∀q ≥ 1 βq ≤ M 8−1 Bq 

for some M > 0 and for some constant B given by 7 14. For any x > 1, let
pen be a penalty function such that

Dm 2
∀m ∈ n  penm ≥ x3 σ
n 2
Let ρ ∈1 x, for any p̄ ∈0 1, if there exists p > p0 = 21 + 2p̄/1 − 4b
p
such that σp = Ɛ ε1 p  < ∞, we have that the PLSE f̃ defined by
  1 n  2
(4.3) f̃ = arg min γn f̂m  + penm with γn g =  i
Yi − gX
m∈n n i=1

satisfies
  1/p̄
2p̄
Ɛ f|A − f̃n
(4.4) 2
x+ρ   R
≤ inf m∈n f|A − fm 2µ + 2penm + C n
x−ρ n

where C is a constant depending on p x ρ p̄ !0  h0  h1  M K and Rn is given


by
 2p̄

p̄ 2p̄
 −p/2+p̄ n f|A ∞
(4.5) Rn = σp Dm + 1/4−bp−p  + 2p̄
m∈n n 0
σp
852 Y. BARAUD, F. COMTE AND G. VIENNET

Comments
1. The functions 8 of particular interest are either of the form 8u = lnu
or 8u = uc with 0 < c < 1/4. In the first case, (4.2) is equivalent to a
geometric decay of the β-mixing coefficients (then, we say that the variables
are geometrically β-mixing), in the second case (4.2) is equivalent to an
arithmetic decay (the sequence is then arithmetically β-mixing).
2. A choice of Dn small in front of n allows one to deal with stronger depen-
dency between the Yi  X  i ’s. In return, choosing Dn too small may lead to
a serious drawback with regard to the performance of the PLSE. Indeed,
in the case of nested models, the smaller Dn the smaller the collection of
models and the poorer the performance of f̃.
3. Assumption (Hn ) is fulfilled when n is generated by piecewise polynomi-
als of degree r on 0 1 (in that case !1 = 2r+1 suits) or by wavelets such
as those described in Section 3.2 (a suitable basis is obtained by rescaling
the father wavelets φJ0 k ’s).
4. We shall see in Section 10 that the result of Theorem 1 holds for a larger
class of linear spaces n [i.e., for n ’s which do not satisfy (Hn )], provided
that (4.1) is replaced by
n
(4.6) D2n ≤
lnn8n

5. Take p̄ = 1, the main term involved in the right-hand side of (4.4) is usually
 
inf f|A − fm 2µ + 2penm
m∈n

It is worth noticing that the constant in front of this term, that is,

x+ρ 2
C1 x ρ =
x−ρ
only depends on x and ρ, and not on unpleasant quantities such as h0 , h1 . If
Theorem 1 gives no precise recommendation on the choice of x to optimize
the performance of the PLSE, it suggests, in contrast, that a choice of x
close to 1 is certainly not a good choice since it makes the constant C1 x ρ
blow up (we recall that ρ must belong to 1 x). Fix ρ, we see that C1 x ρ
decreases to 1 as x becomes large; the negative effect of choosing x large
being that it increases the value of the penalty term.
6. Why does Theorem 1 give a result for values of p̄ = 1? By using Markov’s
inequality, we can derive from (4.4) a result in probability saying that for
any τ > 0,
 
  R
 f|A − f̃2n > τ inf f|A − fm 2µ + 2penm + n
(4.7) m∈n n
C
≤ p̄
τ
where C depends on x ρ p̄ C. If Ɛ ε1 p  < ∞ for some p > 2 and if
it is possible to choose 8u of order a power of lnu [this is the case
ADAPTIVE ESTIMATION IN AUTOREGRESSION 853

when the Yi  X  i ’s are geometrically β-mixing] then one can choose both
b in (HDn )8b and p̄ small enough to ensure that p > 21 + 2p̄/1 − 4b.
Consequently we get that (4.7) holds true under the weak assumption that
Ɛ ε1 p  < ∞ for some p > 2. Lastly we mention that an analogue of
(4.7) where f|A − f̃2n is replaced by f|A − f̃2µ can be obtained. This
can be derived from the fact that, under the assumptions of Theorem 1,
the (semi)norms  µ and  n are equivalent on n on a set of probability
close to 1 (we refer to the proof of Theorem 1 and for further details to
Baraud (2001)).
7. For adequate collections of models, the quantity Rn remains bounded by
some number R not depending on n. In addition, if for all m ∈ n , the
constants belong to Sm , then the quantity
 f|A ∞ involved in Rn can be
replaced by the smaller one f|A − f|A ∞ .

5. Generalization of Theorem 1. In this section we give an extension


of Theorem 1 by relaxing the independence of the εi ’s and by weakening As-
sumption (HXε ). In particular, the next result shows that the procedure is
robust to possible dependency (to some extent) of the εi ’s.
We assume that:
(H’ε ) The εi ’s satisfy, for some positive number ϑ,
 
q 2

(5.1) sup Ɛ   i   ≤ qϑ
εi tX
ttµ ≤1 i=1

for any 1 ≤ q ≤ n.
In addition, Assumption (HXε ) is replaced by a milder one:
(H’Xε ) For all i ∈ 1  i and εi are independent.
 n, X
Then the following result holds.

Theorem 2. Consider the assumptions of Theorem 1 and replace (Hε ) by


(H’ε ) and (HXε ) by (H’Xε ). For any x > 1, let pen be a penalty function such
that
D
∀m ∈ n  penm ≥ x3 m ϑ
n
Then, the result 4 4 of Theorem 1 holds for a constant C that also depends
on ϑ.

Comments
1. In the case of i.i.d. εi ’s and under Assumption (HXε ) (which clearly implies
(H’Xε )), it is straightforward that (5.1) holds with ϑ = σ22 . Indeed under
Condition (HXε ), for all t ∈ 2 k  µ
 
2
q q
  
Ɛ εi tX  i  =  i  + 0 = qσ22 t2µ
Ɛ εi2 t2 X
i=1 i=1

Then, we recover Theorem 1.


854 Y. BARAUD, F. COMTE AND G. VIENNET

2. Assume that the sequences X  i i=1 n and εi i=1 n are independent
(which clearly implies (H’Xε )) and that the εi ’s are β-mixing. Then, we
know from Viennet (1997) that there exists a function dβ depending on the
β-mixing coefficients of the εi ’s such that for all t ∈ 2 k  µ
 
q 2
  
Ɛ  i   ≤ qƐ ε12 dβ ε1  t2µ 
εi tX
i=1

 
which amounts to taking ϑ = ϑβ = Ɛ ε12 dβ ε1  in (5.1). Roughly speak-
ing ϑβ is close to σ22 when the β-mixing coefficients of the εi ’s are close to
0 which corresponds to the independence of the εi ’s. Thus, in this context
the result of Theorem 2 can be understood as a result of robustness, since
ϑβ is unknown. Indeed, the penalized procedure described in Theorem 1
with a penalty term satisfying, for some κ > 1,
Dm 2
∀m ∈ n  penm ≥ κ σ 
n 2
still works if ϑβ < κσ22 . This also means that if the independence of the
εi ’s is debatable, it is safer to increase the value of the penalty term.

6. Proof of the propositions of Section 3.

Proof of Proposition 1. The result is a consequence of Theorem 1. Let us


show that under (HAR1 ) the assumptions of Theorem 1 are fulfilled. Condition
(Hε ) is direct. Under (3.3) it is clear that f|01 ∞ < ∞ holds true. We now
set n = Smn and 8x = ln2 x. Since
n
dimn  = Dn ≤ 3

ln n
(HDn )8b holds for any b > 0 and for some constant K = Kb. As to Con-
ditions (HS ) and (Hn ), they hold with !0 = r [we refer to Birgé and Mas-
sart (1998)]. Under Condition (3.3), we know from Duflo (1997) that the process
Xi i∈ admits a stationary law µ. Furthermore, we know that if the εi ’s admit
a positive bounded continuous density with respect to the Lebesgue measure
then so does µ. This can easily be deduced from the connection between hX
and hε given by
#
hX y = hε y − fxhX xdx ∀y ∈ 

Then we can derive the existence of positive numbers h1 and h0 bounding the
density hX from above and below on 0 1 and thus (HX ) is true. In addition
we know from Doukhan (1994) that under (3.3) the Xi ’s are geometrically
β-mixing that is, there exist two positive constants /, θ such that
(6.1) βq ≤ /e−θq ∀q ≥ 1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 855

Since 8−1 u = exp u, clearly there exists some constant M = M/ θ > 0
such that

βq ≤ /e−θq ≤ Me−3 Bq ∀q ≥ 1
Lastly, the εi ’s being independent of the sequence Xj j<i , (HXε ) is true and
we know that the β-mixing coefficients of both sequences Xi−1  εi i=1 n and
Xi−1 i=1 n are the same. Consequently, Condition (HXY ) holds and (4.2) is
fulfilled. By choosing p̄ = 1, Theorem 1 can be applied if Ɛ εi p  < ∞ for some
p > 6/1 − 4b. This is true for b small enough and then (3.1) follows from
(4.4) with
 
2
 −p/2+1 n f|01 2∞
Rn = σp Dm + 1/4−bp−6/1−4b +
m∈n n σp2
 
+∞ 2
 lnn f| 01  ∞
≤ σp2 r2m −2 + sup 1/4−bp−6/1−4b +
m=0 n≥1 n σp2

=R
Take R = CR where C is the constant involved in (4.4) to complete the proof
of Proposition 1. ✷

Proof of Proposition 2. Conditions (HS ) and (Hn ) are fulfilled [we re-
fer to Birgé and Massart (1998)]. Next we check that (HXY ) holds true and
more precisely that the sequence εi  Xi 1≤i≤n is arithmetically β-mixing with
β-mixing coefficients satisfying
(6.2) ∀q ∈ 1  n βq ≤ /q−θ 
for
∞some constants / > 0 and θ > 15. For that purpose, simply write εt  Xt  =
j=0 Aj et − j with et − j = εt−2j  εt−1−2j  , for j ≥ 0, A0 is the 2 × 2-
identity matrix and

0 0
Aj =
0 aj
Then Pham and Tran’s (1985) Theorem 2.1 implies under (HReg ), that εt  Xt 
 $ %
is absolutely regular with coefficients βn ≤ K +∞ j=n k≥j ak ≤ KC/d −
−d+2
1d − 2n . This implies (6.2) with θ = d − 2 > 15. In addition, it can be
proved that if aj = j−d then βn ≥ Cdn−d , which shows that we do not reach
the geometrical rate of mixing.
Clearly the other assumptions of Theorem 1 are satisfied and it remains to
apply it with p = 30 (a moment of order 30 exists since the εi ’s are gaussian),
8u = u1/5 and p̄ = 1. An upper bound for Rn which is does not depend on
n can be established in the same way as in the proof of Proposition 1. ✷

Proof of Proposition 3. The line of proof is similar to that of Proposi-


tion 1, the difference lying in the fact that we need to check the assumptions
856 Y. BARAUD, F. COMTE AND G. VIENNET

of Theorem 2. Most of them are clearly fulfilled, we only check (HXY ) and
(H’ε ). We note that the pairs X  i  Yi ’s are geometrically β-mixing (which
shows that (HXY ) holds true) since both sequences Xi ’s and εi ’s are geomet-
rically β-mixing (since the εi ’s are drawn from a “nice” autoregression model,
we refer to Section 3.1) and are independent. Next we show that (H’ε ) holds
true with ϑ = 1 + 2a/1 − aσ22 . This will end the proof of Proposition 3. For
all t ∈ 2 k  µ,
 
q 2 q
  
Ɛ  i  ≤
εi tX t2µ σ22 + 2  i tX
Ɛεi εj ƐtX  j 
i=1 i=1 i<j

For i < j,
 
Ɛεi εj  = Ɛ εi uj + · · · + ak uj−k + · · · + aj−i−1 ui−1 + aj−i εi 

= 0 + aj−i σ22 
thus
 
q 2
 
Ɛ  i
εi tX  ≤ qt2µ σ22 + 2  i tX
aj−i ƐtX  j σ22
i=1 i<j


≤ q+2 aj−i t2µ σ22 
1≤i<j≤q

by Cauchy–Schwarz’s inequality. Therefore, we obtain


 
q 2 

Ɛ  i   ≤ q 1 + 2a
εi tX t2µ σ22 
i=1
1 − a

which gives the result. ✷

Proof of Proposition 4. This proposition is a consequence of Theorem 1.


It is enough to apply it with p̄ = 1. In the sequel, we check that the as-
sumptions of the theorem are fulfilled and we bound Rn (given by (4.5))
by some constant that does not depend on n. To bound the β-mixing coef-
ficients of the sequence of the Yi  Xi ’s, we argue as in the proof of Propo-

sition 1, with n = Smg1 n mgk n , dimSmg1 n mgk n  ≤ n/ ln3 n
and 8n = ln2 n. Inequality (4.6) is verified (thus condition (Hn ) can be
omitted). Let us now check (HS ). Since for all m m ∈ n , Sm + Sm and n
belong to the collection of models Sm  m ∈ n , the assumption (HS ) holds
true if we prove that (2.2) is satisfied for any Sm , m ∈ n . Now note that for
each m ∈ n , the following decomposition in 2 0 1k  dx1 · · · dxk  holds:
⊥ 1 ⊥ ⊥ k
Sm =  | ⊕ Sm ⊕ · · · ⊕ Sm 
ADAPTIVE ESTIMATION IN AUTOREGRESSION 857

i
where Sm = t ∈ Sm  tx1   xk  = ti xi  and | denotes the constant
k i gi
function on 0 1 . Clearly, Sm satisfies (2.2) if and only if Smi does, which
is true. Now the fact that the Sm ’s satisfy (2.2) is a consequence of the following
lemma.

Lemma 1. Let S1 , , Sk be k linear spaces which are piecewise orthog-
onal in 2 0 1k  dx1 dxk . If for each i = 1  k, Si satisfies 2 2, then
1 k
so does S = S + · · · + S .

Proof. The result follows from a Cauchy–Schwarz argument: for all ti ∈


Si , i = 1  k,

k k
 k

ti ≤ ti ∞ ≤ !0 dimSi  ti 
i=1 ∞ i=1 i=1
 1/2  1/2
k
 k

i
≤ !0 dimS  ti 2
i=1 i=1
k

= !0 dimS ti ✷
i=1

To complete the proof of Proposition 4 we bound Rn by some constant R


which does not dependon n. Note that n is of order a power of lnn so the
point is to show that m∈n D−2 m (we recall that p̄ = 1 and p > 6) remains
bounded by some quantity which does not depend on n. Now for each m =
k m1   mk  ∈ n we have that Dm is of order 2m1 + · · · + 2mk , thus by
using the convexity inequality k−1 a1 + · · · + ak  ≥a1 · · · ak 1/k which holds
for any positive numbers a1   ak , we obtain that m∈n D−2 m is bounded (up
to a constant) by

 ∞
 ∞
 ∞

··· 2m1 + · · · + 2mk −2 ≤ ··· 2−2m1 +···+mk /k
m1 =0 mk =0 m1 =0 mk =0
 k

 −2j/k
= 2 =R<∞ ✷
j=0

Proof of Proposition 5. Let k ≥ 2. We start from (3.1) and bound the


bias term. Let fm the 2 0 1 dx projection of f onto Sm , we have that
f|A − fm 2µ ≤ f|A − fm 2µ ≤ h1 f − fm 2 by (HX ) and for each m =
k m1   mk ,
k #
 $ %2
f − fm 2 = fi x − fmi x dx
i=1 01

gi
where fmi denotes the 2 0 1 dx projection of fi onto Smi . Lastly we
use standard results of approximation theory [see Barron, Birgé and Mas-
858 Y. BARAUD, F. COMTE AND G. VIENNET

sart (1999), Lemma 13 or DeVore and Lorentz (1993)] which ensure that
 $ %2 −2αi
01 fi x − fmi x dx ≤ Cαi  LDmi (if gi = 1, this holds true in the
case of piecewise polynomials since r ≥ αi ). We obtain (3.10) by taking for each
i
i = 1  k, mi ∈ n such that Dmi is of order n1/2αi +1 which is possible
since αi > 1/2 and therefore n1/2αi +1 ≤ Dn (at least for n large enough). In
the one dimensional case, by considering the piecewise polynomials described
in Section 3.1, Dn is of order n/ ln3 n (such a choice is possible in this case)
and then a choice of m among n such that Dm is of order n1/2α+1 is possible
for any α > 0. ✷

7. Proofs of Theorems 1 and 2. The proof of Theorem 2 is clear from the


proof of Theorem 1. Indeed the assumptions (HXε ) and (Hε ) are only needed
in (8.6) and (8.10). For the rest of the proof assuming (H’Xε ) is enough. It
remains to notice that an analogue of (8.6) and (8.10) is easily obtained from
Assumption (H’ε ).
Now we prove Theorem 1. The proof is divided in consecutive claims.

Claim 1. ∀m ∈ n ,
n
2  i
f|A − f̃2n ≤ f|A − fm 2n + ε f̃ − fm X
(7.1) n i=1 i
+penm − penm̂

Proof. By definition of f̃ we know that for all m ∈ n and t ∈ Sm

γn f̃ + penm̂ ≤ γn t + penm


In particular this holds for t = fm and algebraic computations lead to
n
2  i  + penm − penm̂
(7.2) f − f̃2n ≤ f − fm 2n + ε f̃ − fm X
n i=1 i

Note that the relation


f − t2n = f|A − t2n + f − f|A 2n
is satisfied for any A-supported function t. Applying this identity respectively
to
 t = f̃ and t = fm (those functions being A-supported as elements of
m ∈n Sm ), we derive (7.1) from (7.2). ✷

Claim 2. Let qn , qn1 be integers such that 0 ≤ qn1 ≤ qn /2, qn ≥ 1. Set


ui = εi  X i , i = 1  ∗ ,
 n, then there exist random variables u∗i = εi∗  X i
i = 1  n satisfying the following properties!
(i) For 2 = 1  2n = n/qn , the random vectors
   
 21 = u2−1q +1  u2−1q +q
U and  ∗21 = u∗
U u∗2−1qn +qn1
n n n1 2−1qn +1 
ADAPTIVE ESTIMATION IN AUTOREGRESSION 859

have the same distribution, and so have the random vectors


   
 22 = u2−1q +q +1  u2q and U
U  ∗22 = u∗ u∗2qn
n n1 n 2−1qn +qn1 +1 

(ii) For 2 = 1  2n ,
   
(7.3)  U  21 = U  ∗21 ≤ βq −q  and  U 22 = U
 ∗22 ≤ βq
n n1 n1

 ∗1δ 
(iii) For each δ ∈ 1 2, the random vectors U  ∗2 δ are independent.
U n

Proof. The claim is a corollary of Berbee’s coupling lemma


(1979) [see Doukhan et al. (1995)] together with (HXY ). For further details
about the construction of the u∗i ’s we refer to Viennet (1997); see Proposition
5.1 and its proof page 484. ✷

We set
(7.4) A0 = h20 1 − 1/ρ2 /80!41 h1 
and we choose qn = intA0 8n/4  + 1 ≥ 1 (intu  denotes the
√ integer part of
u) and qn1 = qn1 x to satisfy qn1 /qn + 1 − qn1 /qn ≤ x, namely qn1
of order x − 12 ∧ 1qn /2 works. For the sake of simplicity, we assume qn
to divide n that is, n = 2n qn and we introduce the sets B∗ and Bρ defined as
follows:
 
B∗ = εi  X i  = εi∗  X
 ∗i / i = 1 n

and for ρ ≥ 1,
 

Bρ = t2µ ≤ ρt2n  ∀t ∈ Sm + S m
mm ∈n

We denote by Bρ∗ the set B∗ ∩Bρ . From now on, the index m denotes a minimizer
of the quantity f|A − fm 2µ + penm  for m ∈ n . Therefore, m is fixed
and, for the sake of simplicity, the index m is omitted in the three following
notations. Let Bm  µ be the unit ball in Sm  = Sm + Sm with respect to
 µ , that is,
   
n
2 1 2 
Bm  µ = t ∈ Sm + Sm / tµ = Ɛ t Xi  ≤ 1
n i=1

For each m ∈ n , we set Dm  = dimSm .

Claim 3. Let x ρ be numbers satisfying x > ρ > 1. If pen is chosen to


satisfy
Dm 2
(7.5) penm  ≥ x3 σ 
n 2
860 Y. BARAUD, F. COMTE AND G. VIENNET

then
f|A − f̃2n |Bρ∗
(7.6)   xx + ρ −2
≤ C1 x ρ f|A − fm 2n + 2penm + n Wn m̂
x−ρ
where Wn m  is defined by
 
2
n

Wn m  =  sup  ∗i 
εi∗ tX − x2 nDm σ22  
t∈Bm µ i=1
+

for m ∈ n and where C1 x ρ = x + ρ2 /x − ρ2 > 1.

Proof. The following inequalities hold on Bρ∗ . Starting from (7.1) we get

2 n
f̃ − fm X ∗
i
f|A − f̃2n ≤ f|A − fm 2n + f̃ − fm µ εi∗
n i=1 f̃ − fm µ
+ penm − penm̂
n
2  ∗i 
≤ f|A − fm 2n + f̃ − fm µ sup εi∗ tX
n t∈Bm̂µ i=1

+ penm − penm̂
Using the elementary inequality 2ab ≤ xa2 + x−1 b2 , which holds for any pos-
itive numbers a b, we have
 2
n
2 2 −1 2 −2
f|A − f̃n ≤ f|A − fm n + x f̃ − fm µ + n x sup ∗  ∗
εi tXi 
t∈Bm̂µ i=1

+penm − penm̂

On Bρ∗ ⊂ Bρ , we know that for all t ∈ m ∈n Sm + Sm , t2µ ≤ ρt2n , hence
 2
n

f|A − f̃2n ≤ f|A − fm 2n −1
+ x ρf̃ − fm 2n −2
+n x sup  ∗i 
εi∗ tX
t∈Bm̂µ i=1

+penm − penm̂
 2
≤ f|A − fm 2n + x−1 ρ f̃ − f|A n + f|A − fm n
 2
n

−2
+n x sup ∗  ∗
εi tXi  + penm − penm̂
t∈Bm̂µ i=1

by the triangular inequality. Since for all y > 0 (y is chosen at the end of the
proof)
 2
f̃ − f|A n + f|A − fm n ≤ 1 + yf̃ − f|A 2n + 1 + y−1 f|A − fm 2n 
ADAPTIVE ESTIMATION IN AUTOREGRESSION 861

we obtain

1+y
f|A − f̃2n 1−ρ
x
 n 2
1 + y−1 
≤ f|A − fm 2n 1+ρ −2
+n x sup  ∗i 
εi∗ tX
x t∈Bm̂µ i=1

+penm − penm̂

1 + y−1 Dm + Dm̂ 2
≤ f|A − fm 2n 1+ρ + penm + x3 σ2
x n
n 2 
x 
−penm̂ + sup  ∗i  − x2 nDm̂σ22 
εi∗ tX
n2 t∈Bm̂µ i=1 +

using that Dm̂ ≤ Dm̂ + Dm . Since the penalty function pen satisfies (7.5) for
all m ∈ n , we obtain that on Bρ∗

 
1+y 1+y−1
f|A − f̃2n 1−ρ 2
≤ f|A − fm n 1+ρ +2penm + xn−2 Wn m̂
x x

which gives the claim by choosing y = x − ρ/x + ρ. ✷

Claim 4. For p ≥ 21 + 2p̄/1 − 4b we have,


 
Ɛ f|A − f̃2np̄ |Bρ∗
p̄  p̄
≤ C1 x ρ f|A − fm 2µ + 2penm
 
C  
−1/2 p 2p̄  −p/2+p̄  n
+ p̄ !0 h0 σp Dm + 2Kp 1−4bp−21+2p̄/1−4b 
n m ∈ n
n

where C is a constant that depends on x ρ p p̄.

Proof. By taking the power p̄ ≤ 1 of the right- and left-hand side of (7.6)
we obtain

f|A − f̃2np̄ |Bρ∗


p̄
p̄  p̄ xx + ρ
≤ C1 x ρ f|A − fm 2n + 2penm + Wp̄
n m̂
n2 x − ρ
p̄
p̄  p̄ xx + ρ 
≤ C1 x ρ f|A − fm 2n + 2penm + Wp̄
n m 
n2 x − ρ m ∈n
862 Y. BARAUD, F. COMTE AND G. VIENNET

By taking the expectation on both sides of the inequality and using Jensen’s
inequality we obtain that
 
Ɛ f|A − f̃2np̄ |Bρ∗
p̄  p̄
(7.7) ≤ C1 x ρ f|A − fm 2µ + 2penm

xx + ρ p̄   p̄ 
+ Ɛ Wn m 
n2 x − ρ m ∈
n

We now use the following result,

Proposition 6. Under the assumptions of Theorem 1


  
Cp p̄−1 Ɛ Wp̄
n m 
m ∈n

2
 n

≤ Cp p̄ −1
Ɛ  sup  ∗i 
εi∗ tX
m ∈n t∈Bm µ i=1

* *  p̄ 
qn1 qn1 2
−x + 1− nDm σ22
qn qn
+
$ %p̄−p p̄  
−1/2 p 2p̄
≤ xp̄/3 x1/3 − 1 n ! 0 h0 σp
 p

 −p/2+p̄ qn n
× Dm + pp−2/4p−1−p̄
m ∈n n

The proof of the second inequality is delayed to Section 8, the first one is a
straightforward consequence of our choice of qn1 .
Using Proposition 6 we derive from (7.7) that
 
2p̄
Ɛ f|A − f̃n |Bρ∗
 p̄
p̄ 2 Cx p p̄  
−1/2 p 2p̄
(7.8) ≤ C 1 x ρ f| A − f m  µ + 2penm + ! 0 h 0 σp
np̄
 p 
 −p/2+p̄ qn n
× Dm + pp−2/4p−1−p̄
m ∈n n

Since A0 ≤ 1 and 1 ≤ 8n ≤ Knb we have


qp p p p bp
n ≤ 2 8n ≤ 2K n

hence by using the inequality pp − 2/4p − 1 ≥ p − 2/4 we get


p
q n n n
(7.9) ≤ 2Kp
npp−2/4p−1−p̄ n 1/4−bp−21+2 p̄/1−4b
ADAPTIVE ESTIMATION IN AUTOREGRESSION 863

Note that the power of n, 1/4 − bp − 21 + 2p̄/1 − 4b is positive for
p > 21 + 2p̄/1 − 4b. The result follows by combining (7.8) and (7.9). ✷

Claim 5. Under the assumptions of Theorem 1 we have


 
(7.10)  Bρ∗c ≤ 2M + e16/A0 n−2

and
  $ %
(7.11) Ɛ f|A − f̃2np̄ |Bρ∗c ≤ 2M + e16/A0 1−2p̄/p f|A 2∞p̄ + σp2p̄ n−p̄

Proof. For the proof of (7.11) we refer to Baraud (2000) [see proof of The-
orem 6.1, (49) with q = p̄ and β = 2] noticing that p ≥ 21 + 2p̄/1 − 4b >
4p̄/2 − p̄ (p̄ ≤ 1). By examining the proof, it is easy to check that if the con-
stants belong to the Sm ’s then f|A ∞ can be replaced by f|A − f|A ∞ .
To prove (7.10) we use the following Proposition, which is proved in Section 9.

Proposition 7. Under the assumptions of Theorem 1 for all ρ > 1,


 
  8n lnn
(7.12)  Bρ∗c ≤ 2n2 exp −A0 + 2nβqn1
qn

Since qn = intA0 8n/4 + 1 ≤ A0 8n/4 + 1 we have


   
8n lnn 4
2n2 exp −A0 ≤ 2n2 exp 4 lnn −1 +
(7.13) qn A0 8n + 4
2 16/A0
≤ 2e 
n

8n being larger than lnn. Now, set

(7.14) B = A0 x−12 ∧1/8−1 = h20 x−12 ∧11−1/ρ2 /640!30 h1 −1

Since qn ≥ A0 8n/4, under Condition (4.2) we have


 
Bqn −3
2nβqn1 ≤ 2nM 8−1 x − 12 ∧ 1
2
(7.15)
 −3 2M
≤ 2nM 8−1 8n = 2
n

Claim 5 is proved by combining (7.13) and (7.15). ✷

The proof of Theorem 1 is completed by combining Claim 4 and Claim 5.


864 Y. BARAUD, F. COMTE AND G. VIENNET

8. Proof of Proposition 6. We decompose the proof into two steps:

Step 1. For all m ∈ n ,


 * *  p
n
 qn1 qn1
Ɛ sup  ∗i 
εi∗ tX − + 1− nDm σ2
t∈Bm µ i=1 qn qn
+
(8.1)
 
−1/2 p/2 p2 /4p−1
≤ Cpσpp np/2 + !0 h0 p qp
n Dm  n

Proof. Using the result of Claim 2, we have the following decomposition:


 
n 2n
   
 ∗i  =
εi∗ tX   ∗i  +
εi∗ tX  ∗i 
εi∗ tX
i=1 2=1 1
i∈I2
2
i∈I2

1 2
where for 2 = 1  2n , I2 = 2 − 1qn + 1  2 − 1qn + qn1  and I2 =
2 − 1qn + qn1 + 1  2qn = 2 − 1qn + qn1 + qn − qn1 . Denoting Ɛ∗1 =
 
2n qn1 Dm σ2 and Ɛ∗2 = 2n qn − qn1 Dm σ2 we have

 p
n

Ɛ sup  ∗i 
εi∗ tX − Ɛ∗1 − Ɛ∗2
t∈Bm µ i=1
+

 p 
2n
   ∗
≤ 2p−1 Ɛ  sup  ∗i  − Ɛ∗1  
εi tX 
t∈Bm µ 2=1 1
i∈I2 +

 p 
2n
   ∗
+2p−1 Ɛ  sup  ∗i  − Ɛ∗2  
εi tX 
t∈Bm µ 2=1 2
i∈I2 +

Since the two terms can be bounded in the same way, we only show how
to bound the first one. To do so, we use a moment inequality proved in Ba-
raud [(2000), Theorem 5.2, page 478]: consider the sequence of independent
$ %q
random vectors of  × k n1 , U ∗1   ∗2 defined by U
U  ∗2 = ε∗  X
 ∗  1 for
n i i i∈I2
2 = 1 $ 2n , and
%q consider m = gt / t ∈ Bm  µ the set of functions gt
mapping  × k n1 into  defined by

  q
n1

gt e1  x1   eqn1  xqn1  = ei txi 


i=1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 865

By applying the moment inequality with the U  ∗2 ’s and the class of functions
m we find for all p ≥ 2,
 p 
2n
  
Cp−1 Ɛ  sup εi∗ tX  ∗i  − Ɛ∗1  
t∈Bm µ 2=1 1
i∈I2 +
  p 
2n  

  ∗
≤Ɛ  sup   ∗  
εi tXi 

(8.2) t∈Bm µ 2=1 
i∈I2
1 
  2 
2n
  
+Ɛp/2  sup  εi∗ tX ∗i  
t∈Bm µ 2=1 1
i∈I2

p/2
= Vp + V2
provided that
 
2n
 
(8.3) Ɛ  sup  ∗i  ≤ Ɛ∗1 =
εi∗ tX 2n qn1 Dm σ2
t∈Bm µ 2=1 1
i∈I2

Throughout this section, we denote by G2 t the random process


 ∗
G2 t = εi tX  ∗i 
1
i∈I2

which is repeatedly involved in our computations. It is worth noticing that it


is linear with respect to the argument t.
We first show that (8.3) is true. Let ϕj , j = 1  Dm  be an orthonormal
2
basis of Sm + Sm ⊂  A µ. For each t ∈ Bm  µ we have the following
decomposition
Dm  Dm 
 
(8.4) t= aj ϕ j  a2j ≤ 1
j=1 j=1

By Cauchy–Schwarz’s inequality we know that


   2
1/2
2n Dm  2n Dm  2n
    
G2 t = aj G2 ϕj  ≤  G2 ϕj  
2=1 j=1 2=1 j=1 2=1

Thus, by using Jensen’s inequality we obtain


    2
1/2
2n Dm  2n
  
Ɛ sup G2 t ≤  Ɛ G2 ϕj  
t∈Bm µ 2=1 j=1 2=1
(8.5)
 1/2
Dm  2n
 
= ƐG22 ϕj 
j=1 2=1
866 Y. BARAUD, F. COMTE AND G. VIENNET

the random variables G2 ϕj 2=1 2n being independent and centered for each
j = 1  Dm . Now, for each 2 = 1  2n , we know that the laws of the
∗ ∗ 
vectors εi  Xi i∈I1 and εi  Xi i∈I1 are the same, therefore under Condition
2 2
HXε

 2 
     i  

 i  = qn1 σ22 
(8.6) Ɛ G22 ϕj  = Ɛ  εi ϕj X ≤ Ɛεi2 Ɛϕ2j X
i∈I21 1
i∈I2

which together with (8.5) proves (8.3).


Let us now bound Vp and V2 respectively.
The connection between  ∞ and  µ over Sm + Sm allows to write that
for all t ∈ Bm  µ,

−1/2
(8.7) t∞ ≤ !0 h0 Dm  × 1

Thus,
 
2n  p

  ∗
Vp = Ɛ  sup   ∗i  
εi tX
 
t∈Bm µ 2=1 1
i∈I2

 
2n
1 p−1  
≤ I2 Ɛ  sup εi∗ p  ∗i  p 
tX
t∈Bm µ 2=1 1
i∈I2

 
p−2 2n
p−1 −1/2  
≤ qn1 ! 0 h0 Dm  Ɛ  sup  ∗i 
εi∗ p t2 X
t∈Bm µ 2=1 1
i∈I2

Using (8.4) and Cauchy–Schwarz’s inequality we get


 
p−2 2n Dm 
p−1 −1/2   
Vp ≤ qn1 ! 0 h0 Dm  Ɛ  ∗i 
εi∗ p ϕ2j X
2=1 j=1 1
i∈I2

−1/2 p−2
≤ qp−1
n !0 h0  nDm p/2 σpp 

recalling that 2n qn1 ≤ 2n qn ≤ n. Since for p ≥ 2, p2 /4p − 1 ≥ 1 one also


has

−1/2 p 2
(8.8) Vp ≤ qp
n !0 h0  σpp Dm p/2 np /4p−1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 867

We now bound V2 . A symmetrization argument [see Giné and Zinn (1984)]


gives
 
2n
 2
V2 = Ɛ sup G2 t
t∈Bm µ 2=1

 2n 
2n  
  
(8.9) ≤ sup Ɛ G22 t + 4Ɛ sup  ξ2 G2 t
2
t∈Bm µ 2=1 t∈Bm µ 2=1
  2n 
 
2
≤ nσ2 + 4Ɛ 
sup  ξ2 G2 t 
2
t∈Bm µ 2=1

where the random variables ξ2 ’s are i.i.d. centered random variables indepen-
 ∗ ’s and the ε∗ ’s, satisfying ξ1 = ±1 = 1/2. It remains to bound
dent of the X i i
the last term in the right-hand side of (8.9). To do so, we use a truncation
argument. We set M2 = maxi∈I1 εi∗ . For any c > 0, we have
2
  2n    2n 
   
Ɛ sup  ξ2 G2 t ≤Ɛ
2
sup  ξ2 G2 t|M2 ≤c 
2
t∈Bm µ 2=1 t∈Bm µ 2=1
(8.10)   2n 
 
+Ɛ sup  ξ2 G2 t|M2 >c 
2
t∈Bm µ 2=1

We apply a comparison theorem [Theorem 4.12, page 112 in Ledoux and Tala-
grand (1991)] to bound the first term of the right-hand side of (8.10): we know
that for each t ∈ Bm  µ the random variables G2 t|M2 ≤c ’s are bounded by
−1/2 
B = qn1 !0 h0 Dm c [using (8.7)] and are independent of the ξl ’s. The
function x %→ x2 defined on the set −B B being Lipschitz with Lipschitz
constant smaller than 2B, we obtain (Ɛξ denotes the conditional expectation
with respect to the εi∗ ’s and the X∗i ’s)
  2n    2n 
   
Ɛξ sup  ξ2 G2 t|M2 ≤c  ≤ 4BƐξ
2
sup  ξ2 G2 t|M2 ≤c 
t∈Bm µ 2=1 t∈Bm µ 2=1
  2
1/2
Dm  2n
 
≤ 4BƐξ  ξ2 G2 ϕj |M2 ≤c 
j=1 2=1
 1/2
Dm  2n
 
≤ 4B G22 ϕj 
j=1 2=1

We now decondition with respect to the random variables εi∗ ’s and X  ∗ and
i
using (8.6) we get
  2n 
  −1/2 √
Ɛ sup  ξ2 G2 t|M2 ≤c  ≤ 4qn1 !0 h0 Dm σ2 nc
2
(8.11) t∈Bm µ 2=1

≤ 4q2n1 !20 h−1
0 Dm σp nc
868 Y. BARAUD, F. COMTE AND G. VIENNET

−1/2
noticing that qn1 , !0 h0 are both greater than 1.
Now, we bound the second term of the right-hand side of (8.10). We have
  2n   
  2n

Ɛ sup  ξ2 G2 t|M2 >c  ≤ Ɛ
2
sup 2
G2 t|M2 >c
t∈Bm µ 2=1 t∈Bm µ 2=1
 
Dm  2n
 
≤Ɛ G22 ϕj |M2 >c
j=1 2=1
 
Dm  2n
  
≤ qn1 Ɛ  M22 |M2 >c  ∗i 
ϕ2j X
j=1 2=1 1
i∈I2
  
2n Dm 
 p  
≤ qn1 c 2−p
Ɛ M2  ∗i  
ϕ2j X
2=1 1
i∈I2 j=1

2n
  p
≤ q2n1 c2−p !20 h−1
0 Dm  Ɛ M2
2=1

p 
using (2.4). Lastly, since M2 ≤ i∈I2
1 εi∗ p
we get
  2n 
 
(8.12) Ɛ sup  ξ2 G2 t|M2 >c  ≤ q2n1 !20 h−1
 2 p 2−p
0 nDm σp c
t∈Bm µ 2=1

By gathering (8.11) and (8.12) we obtain that for all c > 0


  2n 
  √ $ √ p−1 2−p %
Ɛ sup  ξ2 G2 t ≤ 4q2n1 !20 h−1
2
0 σp Dm  n c + nσp c
t∈Bm µ 2=1

We choose c = σp n1/2p−2 , and thus from (8.9) we get


  2n 
 
(8.13) V2 = Ɛ 
sup  ξ2 G2 t ≤ nσ22 + 8q2n !20 h−1
2 2
0 σp Dm n
p/2p−1

t∈Bm µ 2=1

which straightforwardly proves Step 1 by combining (8.2), (8.8) and (8.13). ✷

Step 2. For all x > 1, m ∈ n , p̄ < 2p


 p̄ 
n
2 * * 2
 qn1 qn1
n−p̄ Ɛ  sup  ∗i 
εi∗ tX −x + 1− nDm σ22  
t∈Bm µ i=1 qn qn
+
 
≤ Cp x!0 h−1 p p
0  σp Dm 
−p/2−p̄
+ qp
nn
p̄−pp−2/p−1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 869
  ∗  ≥ 0 and
Proof. We set Zn m  = supt∈Bm µ ni=1 εi∗ tX i
* * 
qn1 qn1
Ɛ∗ = + 1− nDm σ2 ≥ nDm σ2
qn qn
Since x > 1, there exists η > 0 such that x = 1 + η3 (i.e., η = x1/3 − 1). Thus
for all τ > 0,
 
 Z2n m  ≥ 1 + η3 Ɛ∗ 2 + τ
 * 2 
2 ∗ τ
≤  Zn m  ≥ 1 + ηƐ +
1 + η−1 
 * 
τ
≤  Zn m  − Ɛ∗ ≥ ηƐ∗ +
1 + η−1 
 * 
τ
≤  Zn m  − Ɛ∗ ≥ η2 Ɛ∗ 2 +
1 + η−1 
−p/2
2 ∗ 2 τ  p
≤ η Ɛ  + Ɛ Zn m  − Ɛ∗ +
1 + η−1 
p/2  
x1/3 Ɛ Zn m  − Ɛ∗ p +
≤ $ %p/2 
x1/3 − 1 x1/3 − 1x1/3 nDm σ22 + τ
using Markov’s inequality. Now, for each p̄ such that 2p̄ < p, the integration
with respect to the variable τ leads to
 p̄ 
Ɛ Z2n m  − x Ɛ∗ 2
+
# +∞  
= p̄τp̄−1  Z2n m  − x Ɛ∗ 2 ≥ τ dτ
0
p/2
x1/3  p
≤ 1/3
Ɛ Zn m  − Ɛ∗ +
x −1
# +∞ p̄τp̄−1
× $ %p/2 dτ
0 x1/3 − 1x1/3 nDm σ22 + τ
$ 1/3 1/3 %p̄  
p x x − 1 Ɛ Zn m  − Ɛ∗ p +
≤ p $ %p/2−p̄ 
p − 2p̄ x1/3 − 1 nDm σ 2 2

and using Step 1, we get


 p̄ 
2 ∗ 2
Ɛ Zn m  − x Ɛ 
+
$ %p̄
x1/3 x1/3 − 1 −1/2 p 2p̄−p
≤C p !0 h0  σ2 σpp np̄
x1/3 − 1
870 Y. BARAUD, F. COMTE AND G. VIENNET
 
× Dm −p/2−p̄ + qp p̄ −pp−2/4p−1
n Dm  n
$ %p̄
x1/3 x1/3 − 1 −1/2 p
≤C p !0 h0  σp2p̄ np̄
x1/3 − 1
 
× Dm −p/2−p̄ + qp
nn
p̄−pp−2/4p−1


since Dm  = dim Sm + Sm  ≤ n. The constant C depends on p and p̄. ✷

It is now easy to prove Proposition 6 by summing up over m in n .

9. Proof of Proposition 7. Since Bρ∗c  = Bρc ∩ B∗  + B∗c  and since


it is clear from Claim 2 that
 
(9.1) B∗c  ≤ 2n βqn −qn1  + βqn1 ≤ 2nβqn1 

the result holds if we prove



8n lnn
(9.2) Bρc ∗
∩ B  ≤ 2n exp −A02
qn
In fact, we prove a more general result, namely,

h2 1 − 1/ρ2 n
(9.3) Bρ ∩ B  ≤ 2Dn exp − 0
c ∗ 2
16h1 qn Lφ

where Lφ is a quantity specific to the orthonormal basis φλ λ∈#n , defined
as follows.
Let φλ λ∈#n be a 2 dx-orthonormal basis of n and as in Baraud (2001)
define the quantities
*
#
V= φ2λ xφ2λ xdx  B = φλ φλ ∞ λλ ∈#n ×#n 
A
λλ ∈#n ×#n

and for any symmetric matrix A = Aλλ ,



ρ̄A = sup aλ aλ Aλλ

aλ  λ a2λ =1 λλ

We set
(9.4) Lφ = maxρ̄2 V ρ̄B
Then, to complete the proof of Proposition 7, it remains to check that
n
(9.5) Lφ ≤ K 
8n lnn
for some constant K independent of n (we shall show the result for K = !41 ).
Under (Hn ), Lemma 2 in Section 10 ensures that

Lφ ≤ !41 Dn 
ADAPTIVE ESTIMATION IN AUTOREGRESSION 871

which together with (4.1) leads to (9.5).


Now we prove inequality (9.3). First note that if ρ > 1,

t2µ −νn t2  1
sup 2
≥ ρ ⇔ sup 2
≥1− 
t∈n /0 tn t∈n /0 tµ ρ
n 
where νn u = 1/n i=1 uXi  − Ɛµ u denotes the centered empirical pro-
cess. Then for ρ > 1,
 

t2µ ∗ 1
 sup 2
≥ρ ≤ sup νn t2  ≥ 1 −
t∈n /0 tn µ
t∈Bn 01 ρ

where we denote by ∗ A the probability A ∩ B∗ , and by Bnµ 0 1 = t ∈


n  tµ ≤ 1.
µ  
For t ∈ Bn 0 1, t = λ∈#n aλ φλ with λ∈#n a2λ ≤ h−1 0 , and we have
 
  
 
sup νn t2  ≤ sup h−1 0  a λ a λ ν n φ λ φ λ  
µ
t∈Bn 01
 2
a
λ λ ≤1  λλ ∈# 2 
n

≤ sup h−1
0 aλ aλ νn φλ φλ 

λ a2λ ≤1 λλ ∈#n2

= h20 1 − 1/ρ2 /16h1 Lφ. Then on the set ∀λ λ  ∈ #n2 /νn φλ φλ  ≤
Let x 
2Vλλ 2h1 x + 2Bλλ x, we have
 
sup νn t2  ≤ 2h−1 0 2h1 xρ̄V + xρ̄B
µ
t∈Bn 01
 1/2
1 ρ̄2 V h 1 − 1/ρ ρ̄B
≤ 1 − 1/ρ √ + 0
2 Lφ 8h1 Lφ

1 1
≤ 1 − 1/ρ √ + ≤ 1 − 1/ρ
2 8
The proof of inequality (9.3) is then achieved by using the following claim.

Claim 6. Let φλ λ∈#n be an 2 A dx basis of n . Then, for all x ≥ 0 and
all integers q, 1 ≤ q ≤ n,
   
nx
∗ ∃λ λ  ∈ #n2 / νn φλ φλ  > 2Vλλ 2h1 x + 2Bλλ x ≤ 2D2n exp −
qn

This implies that



h20 1 − 1/ρ2 n
Bρc ∩ B∗  ≤ 2D2n exp − 
16h1 qn Lφ

and thus inequality (9.3) holds true.


872 Y. BARAUD, F. COMTE AND G. VIENNET

∗ ∗
Proof of Claim 6. Let νn∗ φλ φλ  = νn1 φλ φλ  + νn2 φλ φλ  be defined by


n −1
1 2
νnk φλ φλ  = Z∗ φ φ  k = 1 2
2n l=0 lk λ λ

where for 0 ≤ l ≤ 2n − 1,
1    ∗i φλ X

 ∗i  − Ɛµ φλ φλ  
Z∗lk φλ φλ  = φλ X k = 1 2
qn k
i∈ l

We have
  
 νn φλ φλ  > 2Vλλ 2h1 x + 2Bλλ x
  
≤ ∗ νn1∗
φλ φλ  > Vλλ 2h1 x + Bλλ x
  

+ νn2 φλ φλ  > Vλλ 2h1 x + Bλλ x

= 1 + 2

Now, we bound 1 and 2 by using Bernstein’s inequality [see Lemma 8


page 366 in Birgé and Massart (1998)] applied to the independent
 variables
Z∗lk , which satisfy Z∗lk ∞ ≤ Bλλ and Ɛ1/2 Z∗lk 2  ≤ h1 Vλλ . Then we
obtain 1 + 2 ≤ 2 exp−x2n , which proves the Claim 6. ✷

10. Constraints on the dimension of  n . Most elements of the follow-


ing proof can be found in Baraud (2001), but we recall them for the paper to
be self-contained.
Let n be the linear subspace defined at the beginning of Section 4. We
recall that n is generated by an orthonormal basis φλ λ∈#n and that Dn =
#n . In the previous section the conditions on n (given by (Hn )) and Dn
[given by (4.1)] are used to prove (9.5). To obtain (9.5) we proceed into two
steps: first, under some particular characteristics of the basis φλ λ∈#n [in the
case of Theorem 1 these characteristics are given by (Hn )], we state an upper
bound on Lφ depending on !1 (or !0 ) and Dn . Secondly, starting from this
bound we specify a constraint on Dn for (9.5) to hold. In the next lemma
we consider various cases of linear spaces n (including those considered in
Theorem 1) and provide upper bounds on Lφ according to the characteristics
of one of their orthonormal bases.

Lemma 2. Let Lφ be the quantity defined by 9 4


1. If n satisfies (2.2) then Lφ ≤ !20 D2n .
2. Under (Hn ), Lφ ≤ !41 Dn . Moreover, (2.2) holds true with !20 = !31 .

We obtain from 1 and 2 that the constraints on Dn given by (4.6) and (4.1)
lead to (9.5).
ADAPTIVE ESTIMATION IN AUTOREGRESSION 873

Proof of 1. On the one hand, by Cauchy–Schwarz’s inequality we have


that

2
 #  # 
ρ̄ V ≤ φ2λ φ2λ ≤ φ2λ φ2λ
λλ ∈#n λ ∈#n λ∈#n

  #
≤ φ2λ φ2λ ≤ !20 D2n 
λ∈#n ∞ λ ∈#n

using (2.4). On the other hand, by (2.2) we know that φλ ∞ ≤ !0 Dn × 1.
Thus, using similar arguments one gets
ρ̄B ≤ !20 D2n 
which leads to Lφ ≤ !20 D2n . ✷

Proof of 2. We now prove that (2.2) holds true in the case 2. Note that
for all x,

φ2λ x ≤ !1 φλ 2∞ ≤ !31 Dn
λ∈#n

thus, (2.4) holds true with !20 = !31 .


Under (Hn ), Jλ = λ ∈ #n / φλ φλ ≡ 0 satisfies Jλ ≤ !1 and
#
∀λ ∈ #n  ∀λ ∈ Jλ φ2λ φ2λ ≤ !21 Dn

Therefore,
  # 1/2
ρ̄V = sup aλ a λ φ2λ φ2λ

aλ λ  λ a2λ =1 λ λ ∈Jλ

 
≤ !21 Dn sup aλ aλ

aλ λ  λ a2λ =1 λ λ ∈Jλ

= !21 Dn Wn

Besides, ∀λ ∈ #n  ∀λ ∈ Jλ, φλ φλ ∞ ≤ !21 Dn and thus


ρ̄B = sup aλ aλ φλ φλ ∞ ≤ !21 Dn Wn
 2
λ aλ =1

Finally,
 2
   
W2n ≤ sup aλ ≤ !1 sup a2λ
 2  2
λ aλ =1 λ∈#n λ ∈Jλ λ aλ =1 λ∈#n λ ∈Jλ

  
= !1 sup a2λ = !1 sup Jλ  a2λ
 2  2
λ aλ =1 λ ∈#n λ∈Jλ  λ aλ =1 λ ∈#n

≤ !21
874 Y. BARAUD, F. COMTE AND G. VIENNET

In other words, ρ̄V ≤ !21 Dn and ρ̄B ≤ !31 Dn , which gives the bound
Lφ ≤ !41 Dn since !1 ≥ 1. ✷

Acknowledgments. The authors are deeply grateful to Lucien Birgé for


numbers of constructive suggestions and thank Pascal Massart for helpful
comments.

REFERENCES

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
Proceedings of the 2nd International Symposium on Information Theory (P. N. Petrov
and F. Csaki, eds.) 267–281. Akademia Kiado, Budapest.
Akaike, H. (1984). A new look at the statistical model identification. IEEE Trans. Automatic
Control 19 716–723.
Baraud, Y. (1998). Sélection de modèles et estimation adaptative dans différents cadres de
régression. Ph.D. thesis, Univ. Paris-Sud.
Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields
117 467–493.
Baraud, Y. (2001). Model selection for regression on a random design. Preprint 01-10, DMA, Ecole
Normale Supérieure, Paris.
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks.
In Proceedings of the NATO Advanced Study Institute on Nonparametric Functional
Estimation (G. Roussas, ed.) 561–576. Kluwer, Dordrecht.
Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function
processes. IEEE Trans. Inform. Theory 39 930–945.
Barron, A., Birgé, L. and Massart, P. (1999). Risks bounds for model selection via penalization.
Probab. Theory Related Fields 113 301–413.
Barron, A. R. and Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans.
Inform. Theory 37 1034–1054.
Berbee, H. C. P. (1979). Random walks with stationary increments and renewal theory. Math.
Centre Tract 112. Math. Centrum, Amsterdam.
Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for
Lucien Lecam: Research Papers in Probability and Statistics (D. Pollard, E. Torgensen
and G. Yangs, eds.) 55–87. Springer, New York.
Birgé, L. and Massart, P. (1998). Exponential bounds for minimum contrast estimators on sieves.
Bernoulli 4 329–375.
Cohen, A., Daubechies, I. and Vial, P. (1993). Wavelet and fast wavelet transform on an interval.
Appl. Comp. Harmon. Anal. 1 54–81.
Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia.
Devore, R. A. and Lorentz, C. G. (1993). Constructive Approximation. Springer, New York.
Donoho, D. L. and Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage. Ann.
Statist. 26 879–921.
Doukhan, P. (1994). Mixing properties and Examples. Springer, New York.
Doukhan, P., Massart, P. and Rio, E. (1995). Invariance principle for absolutely regular empirical
processes. Ann. Inst. H. Poincaré Probab. Statist. 31 393–427.
Duflo, M. (1997). Random Iterative Models. Springer, New-York.
Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929–
989.
Hoffmann, M. (1999). On nonparametric estimation in nonlinear AR(1)-models. Statist. Probab.
Lett. 44 29–45.
Kolmogorov, A. R. and Rozanov, Y. A. (1960). On the strong mixing conditions for stationary
gaussian sequences. Theor. Probab. Appl. 5 204–207.
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.
ADAPTIVE ESTIMATION IN AUTOREGRESSION 875

Li, K. C. (1987). Asymptotic optimality for Cp , Cl cross-validation and genralized cross-validation:


discrete index set. Ann. Statist. 15 958–975.
Mallows, C. L. (1973). Some comments on Cp . Technometrics 15 661–675.
Modha, D. S. and Masry, E. (1996) Minimum complexity regression estimation with weakly
dependent observations. IEEE Trans. Inform. Theory 42 2133–2145.
Modha, D. S. and Masry, E. (1998). Memory-universal prediction of stationary random processes.
IEEE Trans. Inform. Theory 44 117-133.
Neumann, M. and Kreiss, J.-P. (1998). Regression-type inference in nonparametric autoregres-
sion. Ann. Statist. 26 1570–1613.
Pham, D. T. and Tran, L. T. (1985). Some mixing properties of time series models. Stochastic
Process. Appl. 19 297–303.
Polyak, B. T. and Tsybakov, A. (1992). A family of asymptotically optimal methods for choosing
the order of a projective regression estimate. Theory Probab. Appl. 37 471–481.
Rissanen, J. (1984). Universal coding, information, prediction and estimation. IEEE Trans. In-
form. Theory 30 629–636.
Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike’s information
criterion. Biometrika 63 117-126.
Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68 45–54.
Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–
563.
Viennet, G. (1997). Inequalities for absolutely regular processes: application to density estima-
tion. Probab. Theory Related Fields 107 467–492.

Y. Baraud F. Comte
Ecole Normale Supérieure Laboratoire de Probabilités
DMA et Modèles Aléatoires
45 rue d’Ulm Boite 188
75230 Paris Cedex 05 Université Paris 6
France 4, place Jussieu
E-mail: [email protected] 75252 Paris Cedex 05
France

G. Viennet
Laboratoire de Probabilités
et Modèles Aléatoires
Boite 7012
Université Paris 7
2, place Jussieu
75251 Paris Cedex 05
France

You might also like