0% found this document useful (0 votes)

22 views37 pages

Adaptive Estimation in An Autoregression and A Geometrical

Uploaded by

Yichuan HUANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views37 pages

Adaptive Estimation in An Autoregression and A Geometrical

Uploaded by

Yichuan HUANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

The Annals of Statistics

2001, Vol. 29, No. 3, 839–875

ADAPTIVE ESTIMATION IN AUTOREGRESSION OR -MIXING

REGRESSION VIA MODEL SELECTION

By Y. Baraud, F. Comte and G. Viennet

Ecole Normale Supérieure, Université Paris VI and Université Paris VII
We study the problem of estimating some unknown regression func-
tion in a β-mixing dependent framework. To this end, we consider some
collection of models which are ﬁnite dimensional spaces. A penalized least-
squares estimator (PLSE) is built on a data driven selected model among
this collection. We state non asymptotic risk bounds for this PLSE and
give several examples where the procedure can be applied (autoregression,
regression with arithmetically β-mixing design points, regression with mix-
ing errors, estimation in additive frameworks, estimation of the order of the
autoregression). In addition we show that under a weak moment condition
on the errors, our estimator is adaptive in the minimax sense simultane-
ously over some family of Besov balls.

1. Introduction. We consider the problem of estimating the unknown

function f, from k into , based on the observation of n (possibly) dependent
i 1 ≤ i ≤ n, arising from the model
data Yi X
(1.1) i + εi
Yi = fX
We assume that X i 1≤i≤n is a stationary sequence of random vectors in k
and we denote by µ the common law of the X i ’s. The εi ’s are unobservable
identically distributed centered random variables admitting a ﬁnite variance
denoted by σ22 . Throughout the paper we assume that σ22 is a known quantity
(or that a bound on it is known). In this introduction, we assume that the εi ’s
are independent random variables. As an example of model (1.1), consider the
case of a random design set X i = Xi with values in 0 1 with a regression
function f assumed to satisfy some Hölderian regularity condition
fx − fy
(1.2) sup = f α < +∞
0≤x<y≤1 y − xα
for some α ∈ 0 1. Another possible illustration is a linear autoregressive
model
k

(1.3) Xi = βj Xi−j + εi
j=1

where k is an integer smaller than k. This means that Yi = Xi , X i =

Xi−1 Xi−k and fu1 uk = kj=1 βj uj . Such models have been ex-

Received May 1998; revised February 2001.

AMS 2000 subject classiﬁcations. Primary 62G08; secondary 62J02.
Key words and phrases. Nonparametric regression, least-squares estimator, model selection,
adaptive estimation, autoregression order, additive framework, time series, mixing processes.
839
840 Y. BARAUD, F. COMTE AND G. VIENNET

tensively studied in the past under the conditions that α or k are known.
There have been some generalizations to the cases of unknown α and k , but
then the results are typically given in an asymptotic form (as n → +∞).
In this paper, the aim is to introduce an estimation procedure for model
(1.1) which, when applied to some Hölderian function f satisfying (1.2) with
unknown values of α and f α , will perform almost as well as a procedure
based on the knowledge of those two parameters. This is what is usually called
adaptation. In the same way, our procedure will result in estimation of model
(1.3) with an unknown value of k (k ≤ k, k known) which is almost as good
as if k were known. Moreover, the results will be given in the form of non
asymptotic bounds for the risk of our estimators. Many other examples can be
treated by the same method. One could, for instance, replace the regularity
conditions (1.2) by more sophisticated ones and model (1.3) by a nonlinear
analogue.
In order to explain the main idea underlying the approach, let us turn back
to the two previous examples. Model (1.3) says that f belongs to some specific
k -dimensional linear space Sk of functions from k to . When k is known,
a classical estimator of f is the least squares estimator over Sk . Dealing with
an unknown k therefore amounts to choosing a “good” value k̂ of k from
the data. By “good,” we mean here that the estimation procedure based on k̂
should perform almost as well as the procedure based on the true value of k .
The treatment of model (1.1) when f satisfies a condition of type (1.2) is
actually quite similar. Let us expand f in some suitable orthonormal basis
φj j≥1 of 2 0
1 dx (the Haar basis for instance). Then (1.1) can be writ-
ten as Yi = ∞ j=1 βj φj Xi + εi and a classical procedure for estimating f
is as follows: define SJ to be the J-dimensional linear space generated by
φ1 φJ and f̂J to be the least squares estimator on SJ , that is the least
squares estimator for model (1.1) when f is supposed to belong to SJ . The
problem is to determine from the data set some Ĵ in such a way that the
least squares estimator f̂Ĵ performs almost as well as the best least-squares
estimator of f, that is, the one which achieves the minimum of the risk.
In order to give a further explanation of the procedure, we need to be precise
as to the “risk” we are dealing with. Throughout the paper we consider least-
squares estimators of f, obtained by minimizing over a finite dimensional
linear subspace S ⊂ 2 k dx the (least squares) contrast function γn defined
by

n
1 i 2
(1.4) ∀t ∈ 2 k dx γn t = Y − tX
n i=1 i

A minimizer of γn in S, f̂S , always exists but might not be unique. Indeed, in

common situations the minimization of γn over S leads to an afﬁne space of
possible solutions and then it becomes impossible to consider the 2 k dx-
quadratic risk of “the least-squares estimator” of f in S. In contrast, the (ran-
1
dom) n -vector f̂S X n is always uniquely deﬁned; this is the
f̂S X
ADAPTIVE ESTIMATION IN AUTOREGRESSION 841

reason we consider the risk of f̂S based on the design points, that is,

n 2
1 i
i − f̂S X
Ɛ fX = Ɛ f − f̂S 2n
n i=1

In addition, under suitable assumptions on the design set and the εi ’s, the
risk of f̂S can be decomposed in a classical way into a bias and a variance
term. More precisely, we have
dimS
(1.5) Ɛf − f̂S 2n ≤ dµ2 f S + σ22
n
1 − sX
where for t s ∈ 2 k µ, dµ2 s t denotes ƐtX 1 2 and dµ2 f S =
2
inf s∈S dµ f s. Inequality (1.5) is usually sharp; note that equality occurs
when the X i ’s and the εi ’s are independent for instance.
Coming back to model (1.1) we see that the quadratic risk Ɛf − f̂J 2n is
of order
J
(1.6) dµ2 f SJ + σ22
n
for SJ generated by the Haar basis φj 1≤j≤J as above. Then (1.2) standardly
implies that dµ f SJ ≤ C f α J−α whatever µ. When α and f α are known,
it is possible to determine the value of J that minimizes (1.6). If α and f α
are unknown, the problem of adaptation, that is doing almost as well as if they
were known, clearly amounts to choosing an estimation procedure Ĵ based on
the data, such that the estimator based on Ĵ is almost as good as the estimator
based on the optimal value of J. The analogy with the study of model (1.3) then
becomes obvious and we have shown that the problem of adaptation to some
unknown smoothness for Hölderian regression functions amounts to what is
generally called a problem of model selection, that is finding a procedure solely
based on the data to choose one statistical model among a (possibly large)
family of such models, the aim being to choose automatically a model which
is close to optimal in the family for the problem at hand. Let us now describe
this procedure.
We start with a finite collection of possible models Sm m ∈ n for f, each
Sm being a finite-dimensional linear subspace of 2 k . The family of models
usually depends on n and the function f may or may not belong to one of them.
Let us denote by f̂m the least squares estimator for model (1.1) based on the
model class Sm . We look for a model selection procedure m̂ with values in n ,
based solely on the data and not on any prior assumption on f, such that the
risk of the resulting procedure f̂m̂ is almost as good as the risk of the best
least squares estimator in the family. Therefore an ideal selection procedure
m̂ should look for an optimal trade-off between the bias term dµ2 f Sm and
the variance term σ22 dimSm /n. Our aim is to find m̂ such that
dimSm
(1.7) Ɛf − f̂m̂ 2n ≤ C min dµ2 f Sm + σ22
m∈n n
842 Y. BARAUD, F. COMTE AND G. VIENNET

which means that, up to the constant C, our estimator chooses an optimal

model.
It is important to notice that an estimator which satisﬁes (1.7) has many
interesting properties provided that the family of models Sm has been suit-
ably chosen. In particular this estimator is adaptive in the minimax sense
with respect to many well-known classes of smoothness. The connections be-
tween adaptation and model selection and the nice properties of any estimator
f̂m̂ satisfying (1.7) have been developed at length in Barron, Birgé and Mas-
sart [(1999), Chapter 5] and many illustrations of potential applications of our
results can be found there and in Birgé and Massart (1997). We shall content
ourselves in the sequel with a limited number of applications and we refer the
interested reader to those papers.
Our model selection criterion is closely related to the classical Cp criterion
of Mallows (1973). For each model m we compute the normalized residual
i 2 and we choose m̂ in or-
sum of squares: γn f̂m = n−1 ni=1 Yi − f̂m X
der to minimize among all models m ∈ n the penalized residual sum of
squares γn f̂m + penm. Mallows’ Cp criterion corresponds to penm =
2σ22 dimSm /n. In this paper, we want to see how one needs to modify Mal-
lows’ Cp when the errors or the covariates are correlated.
There have been many studies concerning model selection based on Mal-
lows’ Cp or related penalized criteria like Akaike’s or the BIC criterion for
regressive and autoregressive models [see Akaike (1973, 1974), Shibata (1976,
1981), Li (1987), Polyak and Tsybakov (1992), among many others]. A com-
mon characteristic of these results is their asymptotic character. Extensions of
these penalized criteria for data-driven model selection procedures have been
done in Barron (1991, 1993), Barron and Cover (1991) and Rissanen (1984).
More recently, a general approach to model selection for various statistical
frameworks including density estimation and regression has been developed
in Birgé and Massart (1997) and Barron, Birgé and Massart (1999), with many
applications to adaptive estimation. An original characteristic of their view-
point is its non asymptotic feature. Unfortunately, their general approach im-
poses restrictions on the regression model (1.1), (e.g., the regression function
needs to be bounded by some known quantity) which makes it unattractive for
practical issues. We relax such restrictions and also obtain non asymptotic re-
sults. Our approach is inspired by Baraud’s (2000) work. Although there have
been many results concerning adaptation for the classical regression model
with independent variables, not much is known to our knowledge concerning
general adaptation methods for regression involving dependent variables. It
is not within the scope of this paper to make an historical review for the case
of independent variables.
Concerning dependent variables, it is worth mentioning the work of Modha
and Masry (1996) which deals with model (1.1) when the process X i Yi i∈ is
strongly mixing and when the function f satisﬁes some Fourier-transform-type
representation. In Modha and Masry (1998), the problem of one step ahead
prediction of real valued stationary exponentially strongly mixing processes
ADAPTIVE ESTIMATION IN AUTOREGRESSION 843

is considered. Minimum complexity regression estimators based on Legendre

polynomials are used to estimate both the model memory and the predictor
function. In the particular case of an autoregressive model their approach does
not lead to optimal rates of convergence. In the case of a one dimensional first
order autoregressive model, Neumann and Kreiss (1998) and Hoffmann (1999)
study the behavior of nonparametric adaptive estimators (local polynomials
and wavelet thresholding estimators) by approximating an AR(1) autoregres-
sion experiment by a regression experiment with independent variables.
Our estimation procedure is the same as that proposed by Baraud (2000) in
the case of a regression framework with deterministic design points and i.i.d.
errors. Thus, we show that the procedure is robust (to a certain extent) to pos-
sible dependency between the data X i Yi ’s. More precisely, we assume that
the data are β-mixing [for a precise definition of β-mixing, see Kolmogorov and
Rozanov (1960)] and we show that under an adequate condition on the decay of
the β-mixing coefficients (for instance arithmetical or geometrical decay) the
estimation procedure is still relevant. Of course, this robustness with respect
to dependency is obvious when the sequences of X i ’s and εi ’s are independent
and when the εi ’s are i.i.d. Indeed, the result can merely be obtained by argu-
ing as follows. We start from inequality (11) in Baraud [(2000), Corollary 3.1]
which gives the result conditionally on the variables X i ’s. Then, by integrat-
ing with respect to those, one gets (1.7). We emphasize that the result holds
under mild assumptions on the statistical framework (an adequate moment
condition on the i.i.d. errors and stationarity of the distribution of the X i ’s).
Consequently, we shall only consider either the case where the sequences of
X i ’s and εi ’s are dependent or the case where the εi ’s are dependent.
The case of β-mixing data is natural in the autoregression context, where,
in addition, the above condition on the β-mixing coefficients is usually met.
This makes the procedure of particular interest in this case.
Our techniques of proof are based on the work of Baraud (2000). Unfortu-
nately, the possible dependency of the X i ’s prevents us from directly using
classical inequalities on product measures like Talagrand’s (1996) concentra-
tion inequalities. Taking advantage of the β-mixing assumptions, we instead
use coupling techniques derived from Berbee’s (1979) lemma and inspired by
Viennet’s (1997) work in order to approximate the original sequence X i 1≤i≤n
by a new sequence built on independent blocks.
Lastly, we mention that the results presented in this paper can be extended
to the case where the variance σ22 of the errors is unknown, which is the
practical case, by estimating it by residual least-squares. For further details
we refer to Baraud’s (1998) Ph.D. thesis, where a previous version of this work
is available.
The paper is organized as follows: the estimation procedure and the main
assumptions are given in Section 2. We apply the procedure to various sta-
tistical frameworks in Section 3. In each of these frameworks, we state non
asymptotic risk bounds, the proofs of those results being delayed to Section 6.
Section 4 is devoted to the main result (treating the case of independent er-
844 Y. BARAUD, F. COMTE AND G. VIENNET

rors), Section 5 to an extension to the case of dependent errors. The proof of

those results are given in Sections 7 to 10.

2. The estimation procedure and the assumptions. We observe pairs

i i = 1
Yi X n arising from model (1.1) and our aim is to estimate the
unknown function f from k into , on some (compact) subset A ⊂ k .
Our estimation procedure is the following one. We consider a finite family of
linear subspaces Sm m∈n of 2 A dx . We assume that the Sm ’s are fi-
nite dimensional linear spaces consisting of A-compactly supported functions.
Hereafter, Dm denotes the dimension of Sm and fm the 2 k µ-projection of
f onto Sm . We associate to each Sm a least-squares estimator f̂m of f which
minimizes among t ∈ Sm the empirical least-squares contrast function γn de-
fined by (1.4). Note that such a minimizer might not be unique as an element
of Sm but the n -vector f̂m X 1 n is uniquely defined. We select
f̂m X
our estimator f̃ among the family of least-squares estimators f̂m m∈n in the
following way: given a nonnegative penalty function pen· on n , we define
m̂ as the minimizer among n of the penalized criterion

γn f̂m + penm

and we set f̃ = f̂m̂ ∈ Sm̂ . The choice of the penalty function is the main
concern of this paper.
The main assumptions used in the paper are listed below. Assumptions (Hε )
and (HXε ) will be weakened in Section 5:

(HX ) The sequence X i i≥0 is identically distributed with common law µ

admitting a density hX w.r.t. the Lebesgue measure which is bounded from
below and above, that is,
0 < h0 ≤ hX u ≤ h1 ∀u ∈ A
(Hε ) The sequence εi ’s are i.i.d. centered random variables admitting a
ﬁnite variance denoted by σ22 .
(HXY ) The sequence of the X i Yi ’s is β-mixing.
(HXε ) For all i ∈ 1 j j≤i .
n, εi is independent of the sequence X
(HS ) There exists a constant !0 such that for any pair m m ∈ n2 , and
any t ∈ Sm + Sm ,

(2.1) t∞ ≤ !0 dimSm + Sm t

Comments. Assumption (HXY ) is equivalent to the β-mixing of the se-

quence of the X i εi ’s, which is the property which is used in the proof. As
mentioned in the introduction, if the sequences X i 1≤i≤n and εi 1≤i≤n are in-
dependent and the εi ’s are i.i.d., then the result can be obtained under milder
conditions. In particular, except stationarity, no other assumption on the dis-
tribution of the X i ’s is required. Condition (HS ) is most easily fulﬁlled when
ADAPTIVE ESTIMATION IN AUTOREGRESSION 845

the collection of models is nested, that is, is an increasing sequence (for in-
clusion) of linear spaces and when there exists some !0 such that for each
m ∈ n

(2.2) t∞ ≤ !0 dimSm t ∀t ∈ Sm

This connection between the sup-norm and the 2 A dx-norm is satisﬁed
for numbers of collections of models of interest. Birgé and Massart [(1998),
Lemma 6] have shown that for any 2 A dx-orthonormal basis φλ λ∈#m of
Sm :
1/2
t∞
(2.3) φ2λ = sup
λ∈#m ∞ t∈Sm t=0 t

Hence (2.2) holds if and only if there exists an orthonormal basis φλ λ∈#m
of Sm such that
1/2

(2.4) φ2λ ≤ !0 dimSm
λ∈#m ∞

and then the result is true for any orthonormal basis of Sm .

3. Examples. In the section we apply our estimation procedure to various

statistical frameworks. In each framework, we give an example of a collection
of models Sm m ∈ n and for some x > 1, choose the penalty term to be
equal to
Dm 2
penm = x3 σ ∀m ∈ n
n 2
except in Section 3.3 where the penalty term is chosen in a different way. In
each case, we give sufﬁcient conditions for f̃ = f̂m̂ to achieve the best trade-off
(up to a constant) between the bias and the variance term among the collection
of estimators f̂m m ∈ n . Namely, we show that for any ρ in 1 x,

2 x+ρ 2 2 3 Dm 2 R
(3.1) Ɛ f|A − f̃n ≤ inf f|A − fm µ + 2x σ2 +
x − ρ m∈n n n
for some constant R = Rρ to be speciﬁed. With no loss of generality we shall
assume that A = 0 1k . Those results, proved in Section 6, derive from our
main theorems which are to be found in Section 4 and Section 5.

3.1. Autoregression framework. We deal with a particular feature of the

regression framework (1.1), the autoregression framework of order 1 given by

(3.2) Yi = Xi = fXi−1 + εi i = 1 n

The process is initialized with some real valued random variable X0 .

846 Y. BARAUD, F. COMTE AND G. VIENNET

We assume the following:

(HAR1 ) The random variable X0 is independent of the εi ’s. The εi ’s are
i.i.d. centered random variables admitting a density, hε , with respect to the
Lebesgue measure and satisfying σ22 = Ɛ ε1 2 < ∞. The density hε is a
positive bounded and continuous function and the function f satisﬁes for some
0 ≤ a < 1 and b ∈ ,
(3.3) ∀u ∈ fu ≤ a u + b
The sequence of the random variables Xi ’s is stationary of common law µ.
The existence of a stationary law µ derives from the assumptions on the
εi ’s and f.
To estimate f we use the collection of models given below.
Collection of piecewise polynomials. Let r be some positive integer and mn
the largest integer such that r2mn ≤ n/ ln3 n that is, mn = intlnn/ ln3 n
/r ln2 (intu denotes the integer part of u). Let n be the set of integers
0 mn, for each m ∈ n we deﬁne Sm as the linear span of piece-
wise polynomials of degree less than r based on the dyadic grid j/2m j =
0 2m − 1 ⊂ 0 1.
The result on f̃ is the following.

Proposition 1. Consider the autoregression framework 3 2 and assume

p
that (HAR1 ) holds. If σp = Ɛ εi p < ∞ for some p > 6 then 3 1 holds for
some constant R that depends on p x ρ hε σp2 r f|A − f|A dx∞ .

To obtain results in probability on f|A − f̃2n , it is actually enough to

assume Ɛ εi p < ∞ for some p > 2, we refer to (4.7) and the comment given
there.

3.2. Regression framework. We give an illustration of Theorem 1 in case

of regression with arithmetically β-mixing design points. Of course the case
of autoregression with arithmetically β-mixing Xi ’s can be treated similarly.
Let us consider the regression model
(3.4) Yi = fXi + εi i = 1 n
In this section, we consider a sequence εi for i ∈ and we take the Xi ’s to be
generated by a standard time series model:
+∞

(3.5) Xi = ak εi−1−2k
k=0

Then we make the following assumption:

(HReg ) The εi ’s are i.i.d. Gaussian random variables. The aj ’s are such that

a0 = 1, +∞ j=0 aj z
2j
= 0 for all z with z ≤ 1 and for all j ≥ 1, aj ≤ Cj−d for
some constants C > 0 and d > 17.
ADAPTIVE ESTIMATION IN AUTOREGRESSION 847

The value 17 as bound for d is certainly not sharp. The model (3.5) for the
Xi ’s together with the assumptions on the coefﬁcients aj aim at ensuring that
(HXY ) is fulﬁlled with arithmetically β-mixing variables. Of course, any other
model implying the same property would suit.
We introduce the following collection of models.
Collection of wavelets: For any integer J, let #j = j k/ k = 1 2j
and let

+∞

φJ0 k J0 k ∈ #J0 ∪ ϕjk j k ∈ #J
J=J0

be an 2 0 1 dx-orthonormal system of compactly supported wavelets of

regularity r built by Cohen, Daubechies and Vial (1993). For some positive
integer Jn > J0 , let n be the space spanned by the φJ0 k ’s for J0 k ∈ #J0
J −1
and by the ϕjk ’s for j k ∈ ∪J=J
n
0
#J. The integer Jn is chosen in such a
way that dimn = 2 is of order n4/5 / lnn. We set n = J0
Jn
Jn − 1
and for each m ∈ n we deﬁne Sm as the linear span of the φJ0 k ’s for J0 k ∈
#J0 and the ϕjk ’s for j k ∈ ∪mJ=J0 #J.
For a precise description and use of these wavelet systems, see Donoho
and Johnstone (1998). These new functions derive from Daubechies’ wavelets
(1992) at the interior of 0 1 and are boundary corrected at the “edges.”

Proposition 2. Assume that f|A ∞ < ∞ and that for all m ∈ n , the
constant functions belong to Sm . If (HReg ) is satisﬁed, then (3.1) holds true for

some constant R depending on x ρ h0 h1 σ22 C d f|A − f|A dx∞ .

3.3. Regression with dependent errors. We consider the regression frame-

work

Yi = fX i + εi i = 1 n
(3.6)
εi = aεi−1 + ui i = 1 n

i for i = 1
We observe the pairs Yi X n.
We assume that:

(HRd ) The real number a satisfies 0 ≤ a < 1, and the ui ’s are i.i.d. centered
random variables admitting a common finite variance. The law of the εi ’s is
assumed to be stationary admitting a finite variance σ22 . The sequence of the
X i ’s is geometrically β-mixing [i.e., satisfying (6.1)] and the sequences of the
X i ’s and the εi ’s are independent.

Geometrically β-mixing X i ’s can be generated by an autoregressive model

with a regression function g and errors ηi satisfying an assumption of the
same kind as (HAR1 ) in Section 3.1.
848 Y. BARAUD, F. COMTE AND G. VIENNET

The main difference between this framework and the previous one lies in
the dependency between the εi ’s. To deal with it, we need to modify the penalty
term:

Proposition 3. Assume that f|A ∞ < ∞, that (HX ) and (HRd ) hold and
that Ɛ ε1 p < ∞ for some p > 6. Let x > 1, if the penalty term pen satisﬁes

2a Dm 2
(3.7) ∀m ∈ n penm ≥ x3 1 + σ
1−a n 2
then by using the collection of piecewise polynomials described in Section 3 1
and applying the estimation procedure given in Section 2 we have that the
estimator f̃ satisﬁes for any ρ ∈1 x,

x+ρ 2 R
(3.8) Ɛ f|A − f̃2n ≤ inf f|A − fm 2µ + 2penm +
x − ρ m∈n n

where R depends on a p σp f|A − f|A dx∞ x ρ h0 h1 / θ.

In contrast with the results of the previous examples, we cannot give a

choice of a penalty term which would work for any value of a. An unknown
lower bound for the choice of the penalty term seems to be the price to pay
when the εi ’s are no longer independent. This example shows how this lower
bound varies with respect to unknown number a, this number quantifying in
some sense a discrepancy from independence (the independence corresponds
to a = 0). We also see that a choice of the penalty term of the form
Dm 2
penm = κ σ
n 2
with κ large is safer than a choice of κ close to 1. This should be kept in mind
every time the independence of the εi ’s is debatable (we refer the reader to
the comments following Theorem 2).

3.4. Additive models. We consider the additive regression models, widely

used in Economics, described by
1 2 k
(3.9) Yi = ef + f1 Xi + f2 Xi + · · · + fk Xi + εi
where the εi ’s are i.i.d. and ef denotes a constant. Model (3.9) follows from
model (1.1) with X i = X1 X2 k
Xi and the additive function f:
i i
fx1 xk = ef + f1 x1 + · · · + fk xk . For identifiability, we assume that
01 f i xdx = 0, for i = 1 k. Such a model assumes that the effects on
Y of the variables Xj are additive. Our aim is to estimate f on A = 0 1k .
The estimation method allows one to build estimators of f1 fk in different
spaces.
1
Let 2 be some integer. We define S2 as the linear space of piecewise poly-
nomials t of degree less that r, r ≥ 1, based on the dyadic grid j/22 j =
2
0 22 ⊂ 0 1 satisfying 01 tx dx = 0 and S2 as the linear span
ADAPTIVE ESTIMATION IN AUTOREGRESSION 849
√ √
of the functions ψ2j−1 x = 2 cos2πjx and ψ2j x = 2 sin2πjx for
j = 1 22 . Now we set m1 n [m2 n respectively] the largest integers
1 2 √
2 such that dimS2 (dimS2 respectively) is smaller than n/ ln3 n. Fi-
1 2
nally, n and n denote respectively the set of integers 0 m1 n and
0 m2 n.
We propose to estimate the fi ’s either by piecewise or trigonometric polyno-
mials. To do so, we introduce the choice function g from 1 k into 1 2
and consider the following collections of models.
Mixed additive collection of models: We set n = kn = m = k m1
gj
mk mj ∈ n and for each m = k m1 mk ∈ n we define

k
k
gi
Sm = tx1 xk = a + ti xi a t1 tk ∈ × Sm i
i=1 i=1

The performance of f̃ is given by the following result

Proposition 4. Assume that f|A ∞ < ∞, that the sequence of the X i Yi
is geometrically β-mixing, that is, satisfies 6 1 and that (HX ), (Hε ) and
(HXε ) are fulfilled. Consider the additive regression framework 3 9 with
p
the above collection of models. If σp = Ɛ ε p < ∞ for some p > 6, then f̃
satisfies 3 1 for some constant R depending on k p σp f|A − f|A dx∞
x h0 h1 / θ.

We can deduce from Proposition 4 that our procedure is adaptive in the

minimax sense. The point of interest is that the additive framework avoids
the curse of dimensionality in the rate of convergence that is, we can derive
similar rates of convergence for k ≥ 2 as for k = 1.
Let α > 0 and l > 2, we recall that a function f from 0 1 into belongs
to the Besov space αl∞ if it satisﬁes
f αl = sup y−α wd f yl < +∞ d = α + 1
y>0

where wd f yl denotes the modulus of smoothness. For a precise deﬁnition
of those notions we refer to DeVore and Lorentz [(1993), Chapter 2, Section 7].
Since for l ≥ 2, αl∞ ⊂ α2∞ , we now restrict ourselves to the case where
l = 2. In the sequel, for any L > 0 α2∞ L denotes the set of functions
which belong to α2∞ and satisfy f α2 ≤ L. Then the following result holds.

Proposition 5. Consider model 3 9 with k ≥ 2. Let L > 0, assume that

f|A ∞ ≤ L and that for all i = 1 k, fi ∈ αi 2∞ L for some αi >
1/2. Assume that for all i = 1 k such that gi = 1, αi ≤ r. Set α =
minα1 αk , if Ɛ ε1 p < ∞ for some p > 6 then under the assumptions of
Proposition 4
2α
(3.10) Ɛ f|A − f̃2n ≤ Ck L α Rn− 2α+1
850 Y. BARAUD, F. COMTE AND G. VIENNET

Comments. (i) In the case where k = 1, by using the collection of piecewise

polynomials described in Section 3.1, (3.10) holds under the weaker assump-
tion that α > 0, we refer the reader to the proof of Proposition 5.
(ii) A result of the same ﬂavor can be established in probability, this would
require a weaker moment condition on the εi ’s. Namely, using (4.7) we show
similarly that for any η > 0, there exists a positive constant Cη (also de-
pending on k L α and R) such that
α
f|A − f̃n ≤ Cηn− 2α+1
with probability greater or equal to 1 − η, as soon as Ɛ ε1 p < ∞ for some
p > 2.

3.5. Estimation of the order of an additive autoregression. Consider an

additive autoregression framework,
(3.11) Xi = ef + f1 Xi−1 + f2 Xi−2 + + fk Xi−k + εi
where the εi ’s are i.i.d. and ef denotes a constant. Under suitable assumptions
ensuring that the X i = Xi−1 Xi−k ’s are stationary and geometrically
β-mixing, the estimation of f1 , ,fk can be handled in the same way as in
the previous section. The aim of this section is to provide an estimator of
the order of autoregression, that is, an estimator of the integer k0 (k0 ≤ k,
k being known) satisfying fk0 = 0 and fi = 0 for all i > k0 . To do so, let

n = kj=0 jn (we use the notations introduced in Section 3.4) and consider
the collection of models Sm m ∈ n . We estimate k0 by k̂0 = k̂0 x deﬁned
as the ﬁrst coordinate of m̂, m̂ being given by

3 Dm 2
m̂ = arg min γn f̂m + x σ
m∈n n 2
We measure the performance of k̂0 via that of f̃ = f̂m̂ , the latter being known,
under the assumptions of Theorem 1, to achieve the best trade-off (up to a
constant) between the bias term and the variance term among the collections
of least-squares estimators f̂m m ∈ n .

4. The main result. In this section, we give our main result concerning
the estimation of a regression function from dependent data. Although this
result considers the case of particular collections of models, extension includ-
ing very general collections are to be found in the comments following the
theorem.

4.1. The main theorem. Let n be some ﬁnite dimensional linear subspace
of A-supported functions of 2 k dx. Let φλ λ∈#n be an orthonormal basis
of n ⊂ 2 A dx and set Dn = #n = dimn . We assume that there exists
some positive constant !1 ≥ 1 such that for all λ ∈ #n

(Hn ) φλ ∞ ≤ !1 Dn and λ / φλ φλ ∞ = 0 ≤ !1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 851

The second condition means that for each λ, the supports of φλ and φλ are
disjoint except for at most !1 functions φλ ’s. We shall see in Section 10 that
those conditions imply that (2.2) holds with !20 = !31 . In addition we assume
some constraint on the dimension of n
(HDn )8b There exists an increasing function 8 mapping + into + sat-
isfying for some K > 0 and b ∈0 1/4

∀u ≥ 1 lnu ∨ 1 ≤ 8u ≤ Kub

such that
n
(4.1) Dn ≤
8n lnn

Theorem 1. Let us consider model 1 1 with f an unknown function from

k into such that f|A ∞ < ∞ and where Conditions (HX ), (Hε ) and (HXε )
are fulfilled. Consider a family Sm m∈n of linear subspaces of n . Assume
that Sm m∈n satisfies (HS ) and that n satisfies (Hn ) and (HDn )8b . Sup-
pose that (HXY ) is fulfilled for a sequence of β-mixing coefficients satisfying
−3
(4.2) ∀q ≥ 1 βq ≤ M 8−1 Bq

for some M > 0 and for some constant B given by 7 14. For any x > 1, let
pen be a penalty function such that

Dm 2
∀m ∈ n penm ≥ x3 σ
n 2
Let ρ ∈1 x, for any p̄ ∈0 1, if there exists p > p0 = 21 + 2p̄/1 − 4b
p
such that σp = Ɛ ε1 p < ∞, we have that the PLSE f̃ deﬁned by
1 n 2
(4.3) f̃ = arg min γn f̂m + penm with γn g = i
Yi − gX
m∈n n i=1

satisﬁes
1/p̄
2p̄
Ɛ f|A − f̃n
(4.4) 2
x+ρ R
≤ inf m∈n f|A − fm 2µ + 2penm + C n
x−ρ n

where C is a constant depending on p x ρ p̄ !0 h0 h1 M K and Rn is given

by
2p̄

p̄ 2p̄
−p/2+p̄ n f|A ∞
(4.5) Rn = σp Dm + 1/4−bp−p + 2p̄
m∈n n 0
σp
852 Y. BARAUD, F. COMTE AND G. VIENNET

Comments
1. The functions 8 of particular interest are either of the form 8u = lnu
or 8u = uc with 0 < c < 1/4. In the first case, (4.2) is equivalent to a
geometric decay of the β-mixing coefficients (then, we say that the variables
are geometrically β-mixing), in the second case (4.2) is equivalent to an
arithmetic decay (the sequence is then arithmetically β-mixing).
2. A choice of Dn small in front of n allows one to deal with stronger depen-
dency between the Yi X i ’s. In return, choosing Dn too small may lead to
a serious drawback with regard to the performance of the PLSE. Indeed,
in the case of nested models, the smaller Dn the smaller the collection of
models and the poorer the performance of f̃.
3. Assumption (Hn ) is fulfilled when n is generated by piecewise polynomi-
als of degree r on 0 1 (in that case !1 = 2r+1 suits) or by wavelets such
as those described in Section 3.2 (a suitable basis is obtained by rescaling
the father wavelets φJ0 k ’s).
4. We shall see in Section 10 that the result of Theorem 1 holds for a larger
class of linear spaces n [i.e., for n ’s which do not satisfy (Hn )], provided
that (4.1) is replaced by
n
(4.6) D2n ≤
lnn8n

5. Take p̄ = 1, the main term involved in the right-hand side of (4.4) is usually

inf f|A − fm 2µ + 2penm
m∈n

It is worth noticing that the constant in front of this term, that is,

x+ρ 2
C1 x ρ =
x−ρ
only depends on x and ρ, and not on unpleasant quantities such as h0 , h1 . If
Theorem 1 gives no precise recommendation on the choice of x to optimize
the performance of the PLSE, it suggests, in contrast, that a choice of x
close to 1 is certainly not a good choice since it makes the constant C1 x ρ
blow up (we recall that ρ must belong to 1 x). Fix ρ, we see that C1 x ρ
decreases to 1 as x becomes large; the negative effect of choosing x large
being that it increases the value of the penalty term.
6. Why does Theorem 1 give a result for values of p̄ = 1? By using Markov’s
inequality, we can derive from (4.4) a result in probability saying that for
any τ > 0,

R
f|A − f̃2n > τ inf f|A − fm 2µ + 2penm + n
(4.7) m∈n n
C
≤ p̄
τ
where C depends on x ρ p̄ C. If Ɛ ε1 p < ∞ for some p > 2 and if
it is possible to choose 8u of order a power of lnu [this is the case
ADAPTIVE ESTIMATION IN AUTOREGRESSION 853

when the Yi X i ’s are geometrically β-mixing] then one can choose both
b in (HDn )8b and p̄ small enough to ensure that p > 21 + 2p̄/1 − 4b.
Consequently we get that (4.7) holds true under the weak assumption that
Ɛ ε1 p < ∞ for some p > 2. Lastly we mention that an analogue of
(4.7) where f|A − f̃2n is replaced by f|A − f̃2µ can be obtained. This
can be derived from the fact that, under the assumptions of Theorem 1,
the (semi)norms µ and n are equivalent on n on a set of probability
close to 1 (we refer to the proof of Theorem 1 and for further details to
Baraud (2001)).
7. For adequate collections of models, the quantity Rn remains bounded by
some number R not depending on n. In addition, if for all m ∈ n , the
constants belong to Sm , then the quantity
f|A ∞ involved in Rn can be
replaced by the smaller one f|A − f|A ∞ .

5. Generalization of Theorem 1. In this section we give an extension

of Theorem 1 by relaxing the independence of the εi ’s and by weakening As-
sumption (HXε ). In particular, the next result shows that the procedure is
robust to possible dependency (to some extent) of the εi ’s.
We assume that:
(H’ε ) The εi ’s satisfy, for some positive number ϑ,
 
q 2

(5.1) sup Ɛ  i  ≤ qϑ
εi tX
ttµ ≤1 i=1

for any 1 ≤ q ≤ n.
In addition, Assumption (HXε ) is replaced by a milder one:
(H’Xε ) For all i ∈ 1 i and εi are independent.
n, X
Then the following result holds.

Theorem 2. Consider the assumptions of Theorem 1 and replace (Hε ) by

(H’ε ) and (HXε ) by (H’Xε ). For any x > 1, let pen be a penalty function such
that
D
∀m ∈ n penm ≥ x3 m ϑ
n
Then, the result 4 4 of Theorem 1 holds for a constant C that also depends
on ϑ.

Comments
1. In the case of i.i.d. εi ’s and under Assumption (HXε ) (which clearly implies
(H’Xε )), it is straightforward that (5.1) holds with ϑ = σ22 . Indeed under
Condition (HXε ), for all t ∈ 2 k µ
 
2
q q

Ɛ εi tX i  = i + 0 = qσ22 t2µ
Ɛ εi2 t2 X
i=1 i=1

Then, we recover Theorem 1.

854 Y. BARAUD, F. COMTE AND G. VIENNET

2. Assume that the sequences X i i=1 n and εi i=1 n are independent
(which clearly implies (H’Xε )) and that the εi ’s are β-mixing. Then, we
know from Viennet (1997) that there exists a function dβ depending on the
β-mixing coefﬁcients of the εi ’s such that for all t ∈ 2 k µ
 
q 2

Ɛ i  ≤ qƐ ε12 dβ ε1 t2µ
εi tX
i=1

which amounts to taking ϑ = ϑβ = Ɛ ε12 dβ ε1 in (5.1). Roughly speak-
ing ϑβ is close to σ22 when the β-mixing coefﬁcients of the εi ’s are close to
0 which corresponds to the independence of the εi ’s. Thus, in this context
the result of Theorem 2 can be understood as a result of robustness, since
ϑβ is unknown. Indeed, the penalized procedure described in Theorem 1
with a penalty term satisfying, for some κ > 1,
Dm 2
∀m ∈ n penm ≥ κ σ
n 2
still works if ϑβ < κσ22 . This also means that if the independence of the
εi ’s is debatable, it is safer to increase the value of the penalty term.

6. Proof of the propositions of Section 3.

Proof of Proposition 1. The result is a consequence of Theorem 1. Let us

show that under (HAR1 ) the assumptions of Theorem 1 are fulﬁlled. Condition
(Hε ) is direct. Under (3.3) it is clear that f|01 ∞ < ∞ holds true. We now
set n = Smn and 8x = ln2 x. Since
n
dimn = Dn ≤ 3

ln n
(HDn )8b holds for any b > 0 and for some constant K = Kb. As to Con-
ditions (HS ) and (Hn ), they hold with !0 = r [we refer to Birgé and Mas-
sart (1998)]. Under Condition (3.3), we know from Duﬂo (1997) that the process
Xi i∈ admits a stationary law µ. Furthermore, we know that if the εi ’s admit
a positive bounded continuous density with respect to the Lebesgue measure
then so does µ. This can easily be deduced from the connection between hX
and hε given by
#
hX y = hε y − fxhX xdx ∀y ∈

Then we can derive the existence of positive numbers h1 and h0 bounding the
density hX from above and below on 0 1 and thus (HX ) is true. In addition
we know from Doukhan (1994) that under (3.3) the Xi ’s are geometrically
β-mixing that is, there exist two positive constants /, θ such that
(6.1) βq ≤ /e−θq ∀q ≥ 1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 855
√
Since 8−1 u = exp u, clearly there exists some constant M = M/ θ > 0
such that
√
βq ≤ /e−θq ≤ Me−3 Bq ∀q ≥ 1
Lastly, the εi ’s being independent of the sequence Xj j<i , (HXε ) is true and
we know that the β-mixing coefﬁcients of both sequences Xi−1 εi i=1 n and
Xi−1 i=1 n are the same. Consequently, Condition (HXY ) holds and (4.2) is
fulﬁlled. By choosing p̄ = 1, Theorem 1 can be applied if Ɛ εi p < ∞ for some
p > 6/1 − 4b. This is true for b small enough and then (3.1) follows from
(4.4) with

2
−p/2+1 n f|01 2∞
Rn = σp Dm + 1/4−bp−6/1−4b +
m∈n n σp2

+∞ 2
lnn f| 01 ∞
≤ σp2 r2m −2 + sup 1/4−bp−6/1−4b +
m=0 n≥1 n σp2

=R
Take R = CR where C is the constant involved in (4.4) to complete the proof
of Proposition 1. ✷

Proof of Proposition 2. Conditions (HS ) and (Hn ) are fulfilled [we re-
fer to Birgé and Massart (1998)]. Next we check that (HXY ) holds true and
more precisely that the sequence εi Xi 1≤i≤n is arithmetically β-mixing with
β-mixing coefficients satisfying
(6.2) ∀q ∈ 1 n βq ≤ /q−θ
for
∞some constants / > 0 and θ > 15. For that purpose, simply write εt Xt =
j=0 Aj et − j with et − j = εt−2j εt−1−2j , for j ≥ 0, A0 is the 2 × 2-
identity matrix and

0 0
Aj =
0 aj
Then Pham and Tran’s (1985) Theorem 2.1 implies under (HReg ), that εt Xt
$ %
is absolutely regular with coefficients βn ≤ K +∞ j=n k≥j ak ≤ KC/d −
−d+2
1d − 2n . This implies (6.2) with θ = d − 2 > 15. In addition, it can be
proved that if aj = j−d then βn ≥ Cdn−d , which shows that we do not reach
the geometrical rate of mixing.
Clearly the other assumptions of Theorem 1 are satisfied and it remains to
apply it with p = 30 (a moment of order 30 exists since the εi ’s are gaussian),
8u = u1/5 and p̄ = 1. An upper bound for Rn which is does not depend on
n can be established in the same way as in the proof of Proposition 1. ✷

Proof of Proposition 3. The line of proof is similar to that of Proposi-

tion 1, the difference lying in the fact that we need to check the assumptions
856 Y. BARAUD, F. COMTE AND G. VIENNET

of Theorem 2. Most of them are clearly fulﬁlled, we only check (HXY ) and
(H’ε ). We note that the pairs X i Yi ’s are geometrically β-mixing (which
shows that (HXY ) holds true) since both sequences Xi ’s and εi ’s are geomet-
rically β-mixing (since the εi ’s are drawn from a “nice” autoregression model,
we refer to Section 3.1) and are independent. Next we show that (H’ε ) holds
true with ϑ = 1 + 2a/1 − aσ22 . This will end the proof of Proposition 3. For
all t ∈ 2 k µ,
 
q 2 q

Ɛ i  ≤
εi tX t2µ σ22 + 2 i tX
Ɛεi εj ƐtX j
i=1 i=1 i<j

For i < j,

Ɛεi εj = Ɛ εi uj + · · · + ak uj−k + · · · + aj−i−1 ui−1 + aj−i εi

= 0 + aj−i σ22
thus
 
q 2

Ɛ i
εi tX  ≤ qt2µ σ22 + 2 i tX
aj−i ƐtX j σ22
i=1 i<j

≤ q+2 aj−i t2µ σ22
1≤i<j≤q

by Cauchy–Schwarz’s inequality. Therefore, we obtain

 
q 2

Ɛ i  ≤ q 1 + 2a
εi tX t2µ σ22
i=1
1 − a

which gives the result. ✷

Proof of Proposition 4. This proposition is a consequence of Theorem 1.

It is enough to apply it with p̄ = 1. In the sequel, we check that the as-
sumptions of the theorem are fulfilled and we bound Rn (given by (4.5))
by some constant that does not depend on n. To bound the β-mixing coef-
ficients of the sequence of the Yi Xi ’s, we argue as in the proof of Propo-
√
sition 1, with n = Smg1 n mgk n , dimSmg1 n mgk n ≤ n/ ln3 n
and 8n = ln2 n. Inequality (4.6) is verified (thus condition (Hn ) can be
omitted). Let us now check (HS ). Since for all m m ∈ n , Sm + Sm and n
belong to the collection of models Sm m ∈ n , the assumption (HS ) holds
true if we prove that (2.2) is satisfied for any Sm , m ∈ n . Now note that for
each m ∈ n , the following decomposition in 2 0 1k dx1 · · · dxk holds:
⊥ 1 ⊥ ⊥ k
Sm = | ⊕ Sm ⊕ · · · ⊕ Sm
ADAPTIVE ESTIMATION IN AUTOREGRESSION 857

i
where Sm = t ∈ Sm tx1 xk = ti xi and | denotes the constant
k i gi
function on 0 1 . Clearly, Sm satisﬁes (2.2) if and only if Smi does, which
is true. Now the fact that the Sm ’s satisfy (2.2) is a consequence of the following
lemma.

Lemma 1. Let S1 , , Sk be k linear spaces which are piecewise orthog-
onal in 2 0 1k dx1 dxk . If for each i = 1 k, Si satisﬁes 2 2, then
1 k
so does S = S + · · · + S .

Proof. The result follows from a Cauchy–Schwarz argument: for all ti ∈

Si , i = 1 k,

k k
k

ti ≤ ti ∞ ≤ !0 dimSi ti
i=1 ∞ i=1 i=1
1/2 1/2
k
k

i
≤ !0 dimS ti 2
i=1 i=1
k

= !0 dimS ti ✷
i=1

To complete the proof of Proposition 4 we bound Rn by some constant R

which does not dependon n. Note that n is of order a power of lnn so the
point is to show that m∈n D−2 m (we recall that p̄ = 1 and p > 6) remains
bounded by some quantity which does not depend on n. Now for each m =
k m1 mk ∈ n we have that Dm is of order 2m1 + · · · + 2mk , thus by
using the convexity inequality k−1 a1 + · · · + ak ≥a1 · · · ak 1/k which holds
for any positive numbers a1 ak , we obtain that m∈n D−2 m is bounded (up
to a constant) by
∞
∞
∞
∞

··· 2m1 + · · · + 2mk −2 ≤ ··· 2−2m1 +···+mk /k
m1 =0 mk =0 m1 =0 mk =0
k
∞
−2j/k
= 2 =R<∞ ✷
j=0

Proof of Proposition 5. Let k ≥ 2. We start from (3.1) and bound the

bias term. Let fm the 2 0 1 dx projection of f onto Sm , we have that
f|A − fm 2µ ≤ f|A − fm 2µ ≤ h1 f − fm 2 by (HX ) and for each m =
k m1 mk ,
k #
$ %2
f − fm 2 = fi x − fmi x dx
i=1 01

gi
where fmi denotes the 2 0 1 dx projection of fi onto Smi . Lastly we
use standard results of approximation theory [see Barron, Birgé and Mas-
858 Y. BARAUD, F. COMTE AND G. VIENNET

sart (1999), Lemma 13 or DeVore and Lorentz (1993)] which ensure that
$ %2 −2αi
01 fi x − fmi x dx ≤ Cαi LDmi (if gi = 1, this holds true in the
case of piecewise polynomials since r ≥ αi ). We obtain (3.10) by taking for each
i
i = 1 k, mi ∈ n such that Dmi is of order n1/2αi +1 which is possible
since αi > 1/2 and therefore n1/2αi +1 ≤ Dn (at least for n large enough). In
the one dimensional case, by considering the piecewise polynomials described
in Section 3.1, Dn is of order n/ ln3 n (such a choice is possible in this case)
and then a choice of m among n such that Dm is of order n1/2α+1 is possible
for any α > 0. ✷

7. Proofs of Theorems 1 and 2. The proof of Theorem 2 is clear from the

proof of Theorem 1. Indeed the assumptions (HXε ) and (Hε ) are only needed
in (8.6) and (8.10). For the rest of the proof assuming (H’Xε ) is enough. It
remains to notice that an analogue of (8.6) and (8.10) is easily obtained from
Assumption (H’ε ).
Now we prove Theorem 1. The proof is divided in consecutive claims.

Claim 1. ∀m ∈ n ,
n
2 i
f|A − f̃2n ≤ f|A − fm 2n + ε f̃ − fm X
(7.1) n i=1 i
+penm − penm̂

Proof. By deﬁnition of f̃ we know that for all m ∈ n and t ∈ Sm

γn f̃ + penm̂ ≤ γn t + penm

In particular this holds for t = fm and algebraic computations lead to
n
2 i + penm − penm̂
(7.2) f − f̃2n ≤ f − fm 2n + ε f̃ − fm X
n i=1 i

Note that the relation

f − t2n = f|A − t2n + f − f|A 2n
is satisﬁed for any A-supported function t. Applying this identity respectively
to
t = f̃ and t = fm (those functions being A-supported as elements of
m ∈n Sm ), we derive (7.1) from (7.2). ✷

Claim 2. Let qn , qn1 be integers such that 0 ≤ qn1 ≤ qn /2, qn ≥ 1. Set

ui = εi X i , i = 1 ∗ ,
n, then there exist random variables u∗i = εi∗ X i
i = 1 n satisfying the following properties!
(i) For 2 = 1 2n = n/qn , the random vectors

21 = u2−1q +1 u2−1q +q
U and ∗21 = u∗
U u∗2−1qn +qn1
n n n1 2−1qn +1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 859

have the same distribution, and so have the random vectors

22 = u2−1q +q +1 u2q and U
U ∗22 = u∗ u∗2qn
n n1 n 2−1qn +qn1 +1

(ii) For 2 = 1 2n ,

(7.3) U 21 = U ∗21 ≤ βq −q and U 22 = U
∗22 ≤ βq
n n1 n1

∗1δ
(iii) For each δ ∈ 1 2, the random vectors U ∗2 δ are independent.
U n

Proof. The claim is a corollary of Berbee’s coupling lemma

(1979) [see Doukhan et al. (1995)] together with (HXY ). For further details
about the construction of the u∗i ’s we refer to Viennet (1997); see Proposition
5.1 and its proof page 484. ✷

We set
(7.4) A0 = h20 1 − 1/ρ2 /80!41 h1
and we choose qn = intA0 8n/4 + 1 ≥ 1 (intu denotes the
√ integer part of
u) and qn1 = qn1 x to satisfy qn1 /qn + 1 − qn1 /qn ≤ x, namely qn1
of order x − 12 ∧ 1qn /2 works. For the sake of simplicity, we assume qn
to divide n that is, n = 2n qn and we introduce the sets B∗ and Bρ deﬁned as
follows:

B∗ = εi X i = εi∗ X
∗i / i = 1 n

and for ρ ≥ 1,

Bρ = t2µ ≤ ρt2n ∀t ∈ Sm + S m
mm ∈n

We denote by Bρ∗ the set B∗ ∩Bρ . From now on, the index m denotes a minimizer
of the quantity f|A − fm 2µ + penm for m ∈ n . Therefore, m is ﬁxed
and, for the sake of simplicity, the index m is omitted in the three following
notations. Let Bm µ be the unit ball in Sm = Sm + Sm with respect to
µ , that is,

n
2 1 2
Bm µ = t ∈ Sm + Sm / tµ = Ɛ t Xi ≤ 1
n i=1

For each m ∈ n , we set Dm = dimSm .

Claim 3. Let x ρ be numbers satisfying x > ρ > 1. If pen is chosen to

satisfy
Dm 2
(7.5) penm ≥ x3 σ
n 2
860 Y. BARAUD, F. COMTE AND G. VIENNET

then
f|A − f̃2n |Bρ∗
(7.6) xx + ρ −2
≤ C1 x ρ f|A − fm 2n + 2penm + n Wn m̂
x−ρ
where Wn m is deﬁned by
 
2
n

Wn m =  sup ∗i
εi∗ tX − x2 nDm σ22 
t∈Bm µ i=1
+

for m ∈ n and where C1 x ρ = x + ρ2 /x − ρ2 > 1.

Proof. The following inequalities hold on Bρ∗ . Starting from (7.1) we get

2 n
f̃ − fm X ∗
i
f|A − f̃2n ≤ f|A − fm 2n + f̃ − fm µ εi∗
n i=1 f̃ − fm µ
+ penm − penm̂
n
2 ∗i
≤ f|A − fm 2n + f̃ − fm µ sup εi∗ tX
n t∈Bm̂µ i=1

+ penm − penm̂
Using the elementary inequality 2ab ≤ xa2 + x−1 b2 , which holds for any pos-
itive numbers a b, we have
2
n
2 2 −1 2 −2
f|A − f̃n ≤ f|A − fm n + x f̃ − fm µ + n x sup ∗ ∗
εi tXi
t∈Bm̂µ i=1

+penm − penm̂

On Bρ∗ ⊂ Bρ , we know that for all t ∈ m ∈n Sm + Sm , t2µ ≤ ρt2n , hence
2
n

f|A − f̃2n ≤ f|A − fm 2n −1
+ x ρf̃ − fm 2n −2
+n x sup ∗i
εi∗ tX
t∈Bm̂µ i=1

+penm − penm̂
2
≤ f|A − fm 2n + x−1 ρ f̃ − f|A n + f|A − fm n
2
n

−2
+n x sup ∗ ∗
εi tXi + penm − penm̂
t∈Bm̂µ i=1

by the triangular inequality. Since for all y > 0 (y is chosen at the end of the
proof)
2
f̃ − f|A n + f|A − fm n ≤ 1 + yf̃ − f|A 2n + 1 + y−1 f|A − fm 2n
ADAPTIVE ESTIMATION IN AUTOREGRESSION 861

we obtain

1+y
f|A − f̃2n 1−ρ
x
n 2
1 + y−1
≤ f|A − fm 2n 1+ρ −2
+n x sup ∗i
εi∗ tX
x t∈Bm̂µ i=1

+penm − penm̂

1 + y−1 Dm + Dm̂ 2
≤ f|A − fm 2n 1+ρ + penm + x3 σ2
x n
n 2
x
−penm̂ + sup ∗i − x2 nDm̂σ22
εi∗ tX
n2 t∈Bm̂µ i=1 +

using that Dm̂ ≤ Dm̂ + Dm . Since the penalty function pen satisﬁes (7.5) for
all m ∈ n , we obtain that on Bρ∗

1+y 1+y−1
f|A − f̃2n 1−ρ 2
≤ f|A − fm n 1+ρ +2penm + xn−2 Wn m̂
x x

which gives the claim by choosing y = x − ρ/x + ρ. ✷

Claim 4. For p ≥ 21 + 2p̄/1 − 4b we have,

Ɛ f|A − f̃2np̄ |Bρ∗
p̄ p̄
≤ C1 x ρ f|A − fm 2µ + 2penm

C
−1/2 p 2p̄ −p/2+p̄ n
+ p̄ !0 h0 σp Dm + 2Kp 1−4bp−21+2p̄/1−4b
n m ∈ n
n

where C is a constant that depends on x ρ p p̄.

Proof. By taking the power p̄ ≤ 1 of the right- and left-hand side of (7.6)
we obtain

f|A − f̃2np̄ |Bρ∗

p̄
p̄ p̄ xx + ρ
≤ C1 x ρ f|A − fm 2n + 2penm + Wp̄
n m̂
n2 x − ρ
p̄
p̄ p̄ xx + ρ
≤ C1 x ρ f|A − fm 2n + 2penm + Wp̄
n m
n2 x − ρ m ∈n
862 Y. BARAUD, F. COMTE AND G. VIENNET

By taking the expectation on both sides of the inequality and using Jensen’s
inequality we obtain that

Ɛ f|A − f̃2np̄ |Bρ∗
p̄ p̄
(7.7) ≤ C1 x ρ f|A − fm 2µ + 2penm

xx + ρ p̄ p̄
+ Ɛ Wn m
n2 x − ρ m ∈
n

We now use the following result,

Proposition 6. Under the assumptions of Theorem 1

Cp p̄−1 Ɛ Wp̄
n m
m ∈n

2
n

≤ Cp p̄ −1
Ɛ  sup ∗i
εi∗ tX
m ∈n t∈Bm µ i=1

* * p̄
qn1 qn1 2
−x + 1− nDm σ22
qn qn
+
$ %p̄−p p̄
−1/2 p 2p̄
≤ xp̄/3 x1/3 − 1 n ! 0 h0 σp
p

−p/2+p̄ qn n
× Dm + pp−2/4p−1−p̄
m ∈n n

The proof of the second inequality is delayed to Section 8, the ﬁrst one is a
straightforward consequence of our choice of qn1 .
Using Proposition 6 we derive from (7.7) that

2p̄
Ɛ f|A − f̃n |Bρ∗
p̄
p̄ 2 Cx p p̄
−1/2 p 2p̄
(7.8) ≤ C 1 x ρ f| A − f m µ + 2penm + ! 0 h 0 σp
np̄
p
−p/2+p̄ qn n
× Dm + pp−2/4p−1−p̄
m ∈n n

Since A0 ≤ 1 and 1 ≤ 8n ≤ Knb we have

qp p p p bp
n ≤ 2 8n ≤ 2K n

hence by using the inequality pp − 2/4p − 1 ≥ p − 2/4 we get

p
q n n n
(7.9) ≤ 2Kp
npp−2/4p−1−p̄ n 1/4−bp−21+2 p̄/1−4b
ADAPTIVE ESTIMATION IN AUTOREGRESSION 863

Note that the power of n, 1/4 − bp − 21 + 2p̄/1 − 4b is positive for
p > 21 + 2p̄/1 − 4b. The result follows by combining (7.8) and (7.9). ✷

Claim 5. Under the assumptions of Theorem 1 we have

(7.10) Bρ∗c ≤ 2M + e16/A0 n−2

and
$ %
(7.11) Ɛ f|A − f̃2np̄ |Bρ∗c ≤ 2M + e16/A0 1−2p̄/p f|A 2∞p̄ + σp2p̄ n−p̄

Proof. For the proof of (7.11) we refer to Baraud (2000) [see proof of The-
orem 6.1, (49) with q = p̄ and β = 2] noticing that p ≥ 21 + 2p̄/1 − 4b >
4p̄/2 − p̄ (p̄ ≤ 1). By examining the proof, it is easy to check that if the con-
stants belong to the Sm ’s then f|A ∞ can be replaced by f|A − f|A ∞ .
To prove (7.10) we use the following Proposition, which is proved in Section 9.

Proposition 7. Under the assumptions of Theorem 1 for all ρ > 1,

8n lnn
(7.12) Bρ∗c ≤ 2n2 exp −A0 + 2nβqn1
qn

Since qn = intA0 8n/4 + 1 ≤ A0 8n/4 + 1 we have

8n lnn 4
2n2 exp −A0 ≤ 2n2 exp 4 lnn −1 +
(7.13) qn A0 8n + 4
2 16/A0
≤ 2e
n

8n being larger than lnn. Now, set

(7.14) B = A0 x−12 ∧1/8−1 = h20 x−12 ∧11−1/ρ2 /640!30 h1 −1

Since qn ≥ A0 8n/4, under Condition (4.2) we have

Bqn −3
2nβqn1 ≤ 2nM 8−1 x − 12 ∧ 1
2
(7.15)
−3 2M
≤ 2nM 8−1 8n = 2
n

Claim 5 is proved by combining (7.13) and (7.15). ✷

The proof of Theorem 1 is completed by combining Claim 4 and Claim 5.

864 Y. BARAUD, F. COMTE AND G. VIENNET

8. Proof of Proposition 6. We decompose the proof into two steps:

Step 1. For all m ∈ n ,

* * p
n
qn1 qn1
Ɛ sup ∗i
εi∗ tX − + 1− nDm σ2
t∈Bm µ i=1 qn qn
+
(8.1)

−1/2 p/2 p2 /4p−1
≤ Cpσpp np/2 + !0 h0 p qp
n Dm n

Proof. Using the result of Claim 2, we have the following decomposition:

 
n 2n

∗i =
εi∗ tX  ∗i +
εi∗ tX ∗i 
εi∗ tX
i=1 2=1 1
i∈I2
2
i∈I2

1 2
where for 2 = 1 2n , I2 = 2 − 1qn + 1 2 − 1qn + qn1 and I2 =
2 − 1qn + qn1 + 1 2qn = 2 − 1qn + qn1 + qn − qn1 . Denoting Ɛ∗1 =

2n qn1 Dm σ2 and Ɛ∗2 = 2n qn − qn1 Dm σ2 we have

p
n

Ɛ sup ∗i
εi∗ tX − Ɛ∗1 − Ɛ∗2
t∈Bm µ i=1
+

 p 
2n
 ∗
≤ 2p−1 Ɛ  sup ∗i − Ɛ∗1  
εi tX 
t∈Bm µ 2=1 1
i∈I2 +

 p 
2n
 ∗
+2p−1 Ɛ  sup ∗i − Ɛ∗2  
εi tX 
t∈Bm µ 2=1 2
i∈I2 +

Since the two terms can be bounded in the same way, we only show how
to bound the first one. To do so, we use a moment inequality proved in Ba-
raud [(2000), Theorem 5.2, page 478]: consider the sequence of independent
$ %q
random vectors of × k n1 , U ∗1 ∗2 defined by U
U ∗2 = ε∗ X
∗ 1 for
n i i i∈I2
2 = 1 $ 2n , and
%q consider m = gt / t ∈ Bm µ the set of functions gt
mapping × k n1 into defined by

q
n1

gt e1 x1 eqn1 xqn1 = ei txi

i=1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 865

By applying the moment inequality with the U ∗2 ’s and the class of functions
m we ﬁnd for all p ≥ 2,
 p 
2n

Cp−1 Ɛ  sup εi∗ tX ∗i − Ɛ∗1  
t∈Bm µ 2=1 1
i∈I2 +
 p 
2n

∗
≤Ɛ  sup ∗ 
εi tXi

(8.2) t∈Bm µ 2=1
i∈I2
1
  2 
2n

+Ɛp/2  sup  εi∗ tX ∗i  
t∈Bm µ 2=1 1
i∈I2

p/2
= Vp + V2
provided that
 
2n

(8.3) Ɛ  sup ∗i  ≤ Ɛ∗1 =
εi∗ tX 2n qn1 Dm σ2
t∈Bm µ 2=1 1
i∈I2

Throughout this section, we denote by G2 t the random process

∗
G2 t = εi tX ∗i
1
i∈I2

which is repeatedly involved in our computations. It is worth noticing that it

is linear with respect to the argument t.
We ﬁrst show that (8.3) is true. Let ϕj , j = 1 Dm be an orthonormal
2
basis of Sm + Sm ⊂ A µ. For each t ∈ Bm µ we have the following
decomposition
Dm Dm

(8.4) t= aj ϕ j a2j ≤ 1
j=1 j=1

By Cauchy–Schwarz’s inequality we know that

 2
1/2
2n Dm 2n Dm 2n

G2 t = aj G2 ϕj ≤  G2 ϕj 
2=1 j=1 2=1 j=1 2=1

Thus, by using Jensen’s inequality we obtain

 2
1/2
2n Dm 2n

Ɛ sup G2 t ≤  Ɛ G2 ϕj 
t∈Bm µ 2=1 j=1 2=1
(8.5)
1/2
Dm 2n

= ƐG22 ϕj
j=1 2=1
866 Y. BARAUD, F. COMTE AND G. VIENNET

the random variables G2 ϕj 2=1 2n being independent and centered for each
j = 1 Dm . Now, for each 2 = 1 2n , we know that the laws of the
∗ ∗
vectors εi Xi i∈I1 and εi Xi i∈I1 are the same, therefore under Condition
2 2
HXε

 2 
 i  

i = qn1 σ22
(8.6) Ɛ G22 ϕj = Ɛ  εi ϕj X ≤ Ɛεi2 Ɛϕ2j X
i∈I21 1
i∈I2

which together with (8.5) proves (8.3).

Let us now bound Vp and V2 respectively.
The connection between ∞ and µ over Sm + Sm allows to write that
for all t ∈ Bm µ,

−1/2
(8.7) t∞ ≤ !0 h0 Dm × 1

Thus,
 
2n p

∗
Vp = Ɛ  sup ∗i 
εi tX

t∈Bm µ 2=1 1
i∈I2

 
2n
1 p−1
≤ I2 Ɛ  sup εi∗ p ∗i p 
tX
t∈Bm µ 2=1 1
i∈I2

 
p−2 2n
p−1 −1/2
≤ qn1 ! 0 h0 Dm Ɛ  sup ∗i 
εi∗ p t2 X
t∈Bm µ 2=1 1
i∈I2

Using (8.4) and Cauchy–Schwarz’s inequality we get

 
p−2 2n Dm
p−1 −1/2
Vp ≤ qn1 ! 0 h0 Dm Ɛ ∗i 
εi∗ p ϕ2j X
2=1 j=1 1
i∈I2

−1/2 p−2
≤ qp−1
n !0 h0 nDm p/2 σpp

recalling that 2n qn1 ≤ 2n qn ≤ n. Since for p ≥ 2, p2 /4p − 1 ≥ 1 one also

has

−1/2 p 2
(8.8) Vp ≤ qp
n !0 h0 σpp Dm p/2 np /4p−1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 867

We now bound V2 . A symmetrization argument [see Giné and Zinn (1984)]

gives

2n
2
V2 = Ɛ sup G2 t
t∈Bm µ 2=1

2n
2n

(8.9) ≤ sup Ɛ G22 t + 4Ɛ sup ξ2 G2 t
2
t∈Bm µ 2=1 t∈Bm µ 2=1
2n

2
≤ nσ2 + 4Ɛ
sup ξ2 G2 t
2
t∈Bm µ 2=1

where the random variables ξ2 ’s are i.i.d. centered random variables indepen-
∗ ’s and the ε∗ ’s, satisfying ξ1 = ±1 = 1/2. It remains to bound
dent of the X i i
the last term in the right-hand side of (8.9). To do so, we use a truncation
argument. We set M2 = maxi∈I1 εi∗ . For any c > 0, we have
2
2n 2n

Ɛ sup ξ2 G2 t ≤Ɛ
2
sup ξ2 G2 t|M2 ≤c
2
t∈Bm µ 2=1 t∈Bm µ 2=1
(8.10) 2n

+Ɛ sup ξ2 G2 t|M2 >c
2
t∈Bm µ 2=1

We apply a comparison theorem [Theorem 4.12, page 112 in Ledoux and Tala-
grand (1991)] to bound the ﬁrst term of the right-hand side of (8.10): we know
that for each t ∈ Bm µ the random variables G2 t|M2 ≤c ’s are bounded by
−1/2
B = qn1 !0 h0 Dm c [using (8.7)] and are independent of the ξl ’s. The
function x %→ x2 deﬁned on the set −B B being Lipschitz with Lipschitz
constant smaller than 2B, we obtain (Ɛξ denotes the conditional expectation
with respect to the εi∗ ’s and the X∗i ’s)
2n 2n

Ɛξ sup ξ2 G2 t|M2 ≤c ≤ 4BƐξ
2
sup ξ2 G2 t|M2 ≤c
t∈Bm µ 2=1 t∈Bm µ 2=1
 2
1/2
Dm 2n

≤ 4BƐξ  ξ2 G2 ϕj |M2 ≤c 
j=1 2=1
1/2
Dm 2n

≤ 4B G22 ϕj
j=1 2=1

We now decondition with respect to the random variables εi∗ ’s and X ∗ and
i
using (8.6) we get
2n
−1/2 √
Ɛ sup ξ2 G2 t|M2 ≤c ≤ 4qn1 !0 h0 Dm σ2 nc
2
(8.11) t∈Bm µ 2=1
√
≤ 4q2n1 !20 h−1
0 Dm σp nc
868 Y. BARAUD, F. COMTE AND G. VIENNET

−1/2
noticing that qn1 , !0 h0 are both greater than 1.
Now, we bound the second term of the right-hand side of (8.10). We have
2n
2n

Ɛ sup ξ2 G2 t|M2 >c ≤ Ɛ
2
sup 2
G2 t|M2 >c
t∈Bm µ 2=1 t∈Bm µ 2=1

Dm 2n

≤Ɛ G22 ϕj |M2 >c
j=1 2=1
 
Dm 2n

≤ qn1 Ɛ  M22 |M2 >c ∗i 
ϕ2j X
j=1 2=1 1
i∈I2
 
2n Dm
p
≤ qn1 c 2−p
Ɛ M2 ∗i 
ϕ2j X
2=1 1
i∈I2 j=1

2n
p
≤ q2n1 c2−p !20 h−1
0 Dm Ɛ M2
2=1

p
using (2.4). Lastly, since M2 ≤ i∈I2
1 εi∗ p
we get
2n

(8.12) Ɛ sup ξ2 G2 t|M2 >c ≤ q2n1 !20 h−1
2 p 2−p
0 nDm σp c
t∈Bm µ 2=1

By gathering (8.11) and (8.12) we obtain that for all c > 0

2n
√ $ √ p−1 2−p %
Ɛ sup ξ2 G2 t ≤ 4q2n1 !20 h−1
2
0 σp Dm n c + nσp c
t∈Bm µ 2=1

We choose c = σp n1/2p−2 , and thus from (8.9) we get

2n

(8.13) V2 = Ɛ
sup ξ2 G2 t ≤ nσ22 + 8q2n !20 h−1
2 2
0 σp Dm n
p/2p−1

t∈Bm µ 2=1

which straightforwardly proves Step 1 by combining (8.2), (8.8) and (8.13). ✷

Step 2. For all x > 1, m ∈ n , p̄ < 2p

 p̄ 
n
2 * * 2
qn1 qn1
n−p̄ Ɛ  sup ∗i
εi∗ tX −x + 1− nDm σ22  
t∈Bm µ i=1 qn qn
+

≤ Cp x!0 h−1 p p
0 σp Dm
−p/2−p̄
+ qp
nn
p̄−pp−2/p−1
ADAPTIVE ESTIMATION IN AUTOREGRESSION 869
∗ ≥ 0 and
Proof. We set Zn m = supt∈Bm µ ni=1 εi∗ tX i
* *
qn1 qn1
Ɛ∗ = + 1− nDm σ2 ≥ nDm σ2
qn qn
Since x > 1, there exists η > 0 such that x = 1 + η3 (i.e., η = x1/3 − 1). Thus
for all τ > 0,

Z2n m ≥ 1 + η3 Ɛ∗ 2 + τ
* 2
2 ∗ τ
≤ Zn m ≥ 1 + ηƐ +
1 + η−1
*
τ
≤ Zn m − Ɛ∗ ≥ ηƐ∗ +
1 + η−1
*
τ
≤ Zn m − Ɛ∗ ≥ η2 Ɛ∗ 2 +
1 + η−1
−p/2
2 ∗ 2 τ p
≤ η Ɛ + Ɛ Zn m − Ɛ∗ +
1 + η−1
p/2
x1/3 Ɛ Zn m − Ɛ∗ p +
≤ $ %p/2
x1/3 − 1 x1/3 − 1x1/3 nDm σ22 + τ
using Markov’s inequality. Now, for each p̄ such that 2p̄ < p, the integration
with respect to the variable τ leads to
p̄
Ɛ Z2n m − x Ɛ∗ 2
+
# +∞
= p̄τp̄−1 Z2n m − x Ɛ∗ 2 ≥ τ dτ
0
p/2
x1/3 p
≤ 1/3
Ɛ Zn m − Ɛ∗ +
x −1
# +∞ p̄τp̄−1
× $ %p/2 dτ
0 x1/3 − 1x1/3 nDm σ22 + τ
$ 1/3 1/3 %p̄
p x x − 1 Ɛ Zn m − Ɛ∗ p +
≤ p $ %p/2−p̄
p − 2p̄ x1/3 − 1 nDm σ 2 2

and using Step 1, we get

p̄
2 ∗ 2
Ɛ Zn m − x Ɛ
+
$ %p̄
x1/3 x1/3 − 1 −1/2 p 2p̄−p
≤C p !0 h0 σ2 σpp np̄
x1/3 − 1
870 Y. BARAUD, F. COMTE AND G. VIENNET

× Dm −p/2−p̄ + qp p̄ −pp−2/4p−1
n Dm n
$ %p̄
x1/3 x1/3 − 1 −1/2 p
≤C p !0 h0 σp2p̄ np̄
x1/3 − 1

× Dm −p/2−p̄ + qp
nn
p̄−pp−2/4p−1

since Dm = dim Sm + Sm ≤ n. The constant C depends on p and p̄. ✷

It is now easy to prove Proposition 6 by summing up over m in n .

9. Proof of Proposition 7. Since Bρ∗c = Bρc ∩ B∗ + B∗c and since

it is clear from Claim 2 that

(9.1) B∗c ≤ 2n βqn −qn1 + βqn1 ≤ 2nβqn1

the result holds if we prove

8n lnn
(9.2) Bρc ∗
∩ B ≤ 2n exp −A02
qn
In fact, we prove a more general result, namely,

h2 1 − 1/ρ2 n
(9.3) Bρ ∩ B ≤ 2Dn exp − 0
c ∗ 2
16h1 qn Lφ

where Lφ is a quantity specific to the orthonormal basis φλ λ∈#n , defined
as follows.
Let φλ λ∈#n be a 2 dx-orthonormal basis of n and as in Baraud (2001)
define the quantities
*
#
V= φ2λ xφ2λ xdx B = φλ φλ ∞ λλ ∈#n ×#n
A
λλ ∈#n ×#n

and for any symmetric matrix A = Aλλ ,

ρ̄A = sup aλ aλ Aλλ

aλ λ a2λ =1 λλ

We set
(9.4) Lφ = maxρ̄2 V ρ̄B
Then, to complete the proof of Proposition 7, it remains to check that
n
(9.5) Lφ ≤ K
8n lnn
for some constant K independent of n (we shall show the result for K = !41 ).
Under (Hn ), Lemma 2 in Section 10 ensures that

Lφ ≤ !41 Dn
ADAPTIVE ESTIMATION IN AUTOREGRESSION 871

which together with (4.1) leads to (9.5).

Now we prove inequality (9.3). First note that if ρ > 1,

t2µ −νn t2 1
sup 2
≥ ρ ⇔ sup 2
≥1−
t∈n /0 tn t∈n /0 tµ ρ
n
where νn u = 1/n i=1 uXi − Ɛµ u denotes the centered empirical pro-
cess. Then for ρ > 1,

∗
t2µ ∗ 1
sup 2
≥ρ ≤ sup νn t2 ≥ 1 −
t∈n /0 tn µ
t∈Bn 01 ρ

where we denote by ∗ A the probability A ∩ B∗ , and by Bnµ 0 1 = t ∈

n tµ ≤ 1.
µ
For t ∈ Bn 0 1, t = λ∈#n aλ φλ with λ∈#n a2λ ≤ h−1 0 , and we have

sup νn t2 ≤ sup h−1 0 a λ a λ ν n φ λ φ λ
µ
t∈Bn 01
2
a
λ λ ≤1 λλ ∈# 2
n

≤ sup h−1
0 aλ aλ νn φλ φλ

λ a2λ ≤1 λλ ∈#n2

= h20 1 − 1/ρ2 /16h1 Lφ. Then on the set ∀λ λ ∈ #n2 /νn φλ φλ ≤
Let x
2Vλλ 2h1 x + 2Bλλ x, we have

sup νn t2 ≤ 2h−1 0 2h1 xρ̄V + xρ̄B
µ
t∈Bn 01
1/2
1 ρ̄2 V h 1 − 1/ρ ρ̄B
≤ 1 − 1/ρ √ + 0
2 Lφ 8h1 Lφ

1 1
≤ 1 − 1/ρ √ + ≤ 1 − 1/ρ
2 8
The proof of inequality (9.3) is then achieved by using the following claim.

Claim 6. Let φλ λ∈#n be an 2 A dx basis of n . Then, for all x ≥ 0 and
all integers q, 1 ≤ q ≤ n,

nx
∗ ∃λ λ ∈ #n2 / νn φλ φλ > 2Vλλ 2h1 x + 2Bλλ x ≤ 2D2n exp −
qn

This implies that

h20 1 − 1/ρ2 n
Bρc ∩ B∗ ≤ 2D2n exp −
16h1 qn Lφ

and thus inequality (9.3) holds true.

872 Y. BARAUD, F. COMTE AND G. VIENNET

∗ ∗
Proof of Claim 6. Let νn∗ φλ φλ = νn1 φλ φλ + νn2 φλ φλ be deﬁned by

∗
n −1
1 2
νnk φλ φλ = Z∗ φ φ k = 1 2
2n l=0 lk λ λ

where for 0 ≤ l ≤ 2n − 1,
1 ∗i φλ X

∗i − Ɛµ φλ φλ
Z∗lk φλ φλ = φλ X k = 1 2
qn k
i∈ l

We have

νn φλ φλ > 2Vλλ 2h1 x + 2Bλλ x

≤ ∗ νn1∗
φλ φλ > Vλλ 2h1 x + Bλλ x

∗
+ νn2 φλ φλ > Vλλ 2h1 x + Bλλ x

= 1 + 2

Now, we bound 1 and 2 by using Bernstein’s inequality [see Lemma 8

page 366 in Birgé and Massart (1998)] applied to the independent
variables
Z∗lk , which satisfy Z∗lk ∞ ≤ Bλλ and Ɛ1/2 Z∗lk 2 ≤ h1 Vλλ . Then we
obtain 1 + 2 ≤ 2 exp−x2n , which proves the Claim 6. ✷

10. Constraints on the dimension of n . Most elements of the follow-

ing proof can be found in Baraud (2001), but we recall them for the paper to
be self-contained.
Let n be the linear subspace deﬁned at the beginning of Section 4. We
recall that n is generated by an orthonormal basis φλ λ∈#n and that Dn =
#n . In the previous section the conditions on n (given by (Hn )) and Dn
[given by (4.1)] are used to prove (9.5). To obtain (9.5) we proceed into two
steps: ﬁrst, under some particular characteristics of the basis φλ λ∈#n [in the
case of Theorem 1 these characteristics are given by (Hn )], we state an upper
bound on Lφ depending on !1 (or !0 ) and Dn . Secondly, starting from this
bound we specify a constraint on Dn for (9.5) to hold. In the next lemma
we consider various cases of linear spaces n (including those considered in
Theorem 1) and provide upper bounds on Lφ according to the characteristics
of one of their orthonormal bases.

Lemma 2. Let Lφ be the quantity deﬁned by 9 4

1. If n satisﬁes (2.2) then Lφ ≤ !20 D2n .
2. Under (Hn ), Lφ ≤ !41 Dn . Moreover, (2.2) holds true with !20 = !31 .

We obtain from 1 and 2 that the constraints on Dn given by (4.6) and (4.1)
lead to (9.5).
ADAPTIVE ESTIMATION IN AUTOREGRESSION 873

Proof of 1. On the one hand, by Cauchy–Schwarz’s inequality we have

that

2
# #
ρ̄ V ≤ φ2λ φ2λ ≤ φ2λ φ2λ
λλ ∈#n λ ∈#n λ∈#n

#
≤ φ2λ φ2λ ≤ !20 D2n
λ∈#n ∞ λ ∈#n

using (2.4). On the other hand, by (2.2) we know that φλ ∞ ≤ !0 Dn × 1.
Thus, using similar arguments one gets
ρ̄B ≤ !20 D2n
which leads to Lφ ≤ !20 D2n . ✷

Proof of 2. We now prove that (2.2) holds true in the case 2. Note that
for all x,

φ2λ x ≤ !1 φλ 2∞ ≤ !31 Dn
λ∈#n

thus, (2.4) holds true with !20 = !31 .

Under (Hn ), Jλ = λ ∈ #n / φλ φλ ≡ 0 satisﬁes Jλ ≤ !1 and
#
∀λ ∈ #n ∀λ ∈ Jλ φ2λ φ2λ ≤ !21 Dn

Therefore,
# 1/2
ρ̄V = sup aλ a λ φ2λ φ2λ

aλ λ λ a2λ =1 λ λ ∈Jλ

≤ !21 Dn sup aλ aλ

aλ λ λ a2λ =1 λ λ ∈Jλ

= !21 Dn Wn

Besides, ∀λ ∈ #n ∀λ ∈ Jλ, φλ φλ ∞ ≤ !21 Dn and thus

ρ̄B = sup aλ aλ φλ φλ ∞ ≤ !21 Dn Wn
2
λ aλ =1

Finally,
2

W2n ≤ sup aλ ≤ !1 sup a2λ
2 2
λ aλ =1 λ∈#n λ ∈Jλ λ aλ =1 λ∈#n λ ∈Jλ

= !1 sup a2λ = !1 sup Jλ a2λ
2 2
λ aλ =1 λ ∈#n λ∈Jλ λ aλ =1 λ ∈#n

≤ !21
874 Y. BARAUD, F. COMTE AND G. VIENNET

In other words, ρ̄V ≤ !21 Dn and ρ̄B ≤ !31 Dn , which gives the bound
Lφ ≤ !41 Dn since !1 ≥ 1. ✷

Acknowledgments. The authors are deeply grateful to Lucien Birgé for

numbers of constructive suggestions and thank Pascal Massart for helpful
comments.

REFERENCES

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
Proceedings of the 2nd International Symposium on Information Theory (P. N. Petrov
and F. Csaki, eds.) 267–281. Akademia Kiado, Budapest.
Akaike, H. (1984). A new look at the statistical model identification. IEEE Trans. Automatic
Control 19 716–723.
Baraud, Y. (1998). Sélection de modèles et estimation adaptative dans différents cadres de
régression. Ph.D. thesis, Univ. Paris-Sud.
Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields
117 467–493.
Baraud, Y. (2001). Model selection for regression on a random design. Preprint 01-10, DMA, Ecole
Normale Supérieure, Paris.
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks.
In Proceedings of the NATO Advanced Study Institute on Nonparametric Functional
Estimation (G. Roussas, ed.) 561–576. Kluwer, Dordrecht.
Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function
processes. IEEE Trans. Inform. Theory 39 930–945.
Barron, A., Birgé, L. and Massart, P. (1999). Risks bounds for model selection via penalization.
Probab. Theory Related Fields 113 301–413.
Barron, A. R. and Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans.
Inform. Theory 37 1034–1054.
Berbee, H. C. P. (1979). Random walks with stationary increments and renewal theory. Math.
Centre Tract 112. Math. Centrum, Amsterdam.
Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for
Lucien Lecam: Research Papers in Probability and Statistics (D. Pollard, E. Torgensen
and G. Yangs, eds.) 55–87. Springer, New York.
Birgé, L. and Massart, P. (1998). Exponential bounds for minimum contrast estimators on sieves.
Bernoulli 4 329–375.
Cohen, A., Daubechies, I. and Vial, P. (1993). Wavelet and fast wavelet transform on an interval.
Appl. Comp. Harmon. Anal. 1 54–81.
Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia.
Devore, R. A. and Lorentz, C. G. (1993). Constructive Approximation. Springer, New York.
Donoho, D. L. and Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage. Ann.
Statist. 26 879–921.
Doukhan, P. (1994). Mixing properties and Examples. Springer, New York.
Doukhan, P., Massart, P. and Rio, E. (1995). Invariance principle for absolutely regular empirical
processes. Ann. Inst. H. Poincaré Probab. Statist. 31 393–427.
Duflo, M. (1997). Random Iterative Models. Springer, New-York.
Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929–
989.
Hoffmann, M. (1999). On nonparametric estimation in nonlinear AR(1)-models. Statist. Probab.
Lett. 44 29–45.
Kolmogorov, A. R. and Rozanov, Y. A. (1960). On the strong mixing conditions for stationary
gaussian sequences. Theor. Probab. Appl. 5 204–207.
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.
ADAPTIVE ESTIMATION IN AUTOREGRESSION 875

Li, K. C. (1987). Asymptotic optimality for Cp , Cl cross-validation and genralized cross-validation:

discrete index set. Ann. Statist. 15 958–975.
Mallows, C. L. (1973). Some comments on Cp . Technometrics 15 661–675.
Modha, D. S. and Masry, E. (1996) Minimum complexity regression estimation with weakly
dependent observations. IEEE Trans. Inform. Theory 42 2133–2145.
Modha, D. S. and Masry, E. (1998). Memory-universal prediction of stationary random processes.
IEEE Trans. Inform. Theory 44 117-133.
Neumann, M. and Kreiss, J.-P. (1998). Regression-type inference in nonparametric autoregres-
sion. Ann. Statist. 26 1570–1613.
Pham, D. T. and Tran, L. T. (1985). Some mixing properties of time series models. Stochastic
Process. Appl. 19 297–303.
Polyak, B. T. and Tsybakov, A. (1992). A family of asymptotically optimal methods for choosing
the order of a projective regression estimate. Theory Probab. Appl. 37 471–481.
Rissanen, J. (1984). Universal coding, information, prediction and estimation. IEEE Trans. In-
form. Theory 30 629–636.
Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike’s information
criterion. Biometrika 63 117-126.
Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68 45–54.
Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–
563.
Viennet, G. (1997). Inequalities for absolutely regular processes: application to density estima-
tion. Probab. Theory Related Fields 107 467–492.

Y. Baraud F. Comte
Ecole Normale Supérieure Laboratoire de Probabilités
DMA et Modèles Aléatoires
45 rue d’Ulm Boite 188
75230 Paris Cedex 05 Université Paris 6
France 4, place Jussieu
E-mail: [email protected] 75252 Paris Cedex 05
France

G. Viennet
Laboratoire de Probabilités
et Modèles Aléatoires
Boite 7012
Université Paris 7
2, place Jussieu
75251 Paris Cedex 05
France

Kernel Smoothing-MP Wand-MC Jones-1995
100% (1)
Kernel Smoothing-MP Wand-MC Jones-1995
228 pages
Digital Signal Processing (BEC-42) : Unit-3 Lecture-1 (FIR Filter Design)
No ratings yet
Digital Signal Processing (BEC-42) : Unit-3 Lecture-1 (FIR Filter Design)
115 pages
Ebook Econometrics
No ratings yet
Ebook Econometrics
1,006 pages
Iiyr It - Cs8451 Daa - 5 Units Notes
No ratings yet
Iiyr It - Cs8451 Daa - 5 Units Notes
117 pages
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
100% (1)
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
414 pages
Eco No Metrics
No ratings yet
Eco No Metrics
1,045 pages
Econometrics Simpler Note
No ratings yet
Econometrics Simpler Note
692 pages
Econometrics UAB
No ratings yet
Econometrics UAB
353 pages
Econometric S
No ratings yet
Econometric S
1,341 pages
2021 - Creel - Econometrics (Githuib Book)
No ratings yet
2021 - Creel - Econometrics (Githuib Book)
1,060 pages
Word Embedding 9 Mar 23 PDF
No ratings yet
Word Embedding 9 Mar 23 PDF
16 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
(Gerad 25th Anniversary 5) Guy Desaulniers, Jacques Desrosiers, Marius M. Solomon - Column Generation (Gerad 25th Anniversary Series, Volume 5) - Springer (2005) PDF
No ratings yet
(Gerad 25th Anniversary 5) Guy Desaulniers, Jacques Desrosiers, Marius M. Solomon - Column Generation (Gerad 25th Anniversary Series, Volume 5) - Springer (2005) PDF
369 pages
Michael Creel - Econometrics
No ratings yet
Michael Creel - Econometrics
490 pages
Creel M Econometrics
No ratings yet
Creel M Econometrics
479 pages
Econometrics
No ratings yet
Econometrics
310 pages
STAT613
No ratings yet
STAT613
295 pages
Eco No Metrics
No ratings yet
Eco No Metrics
312 pages
Applied Nonparametric Regression
No ratings yet
Applied Nonparametric Regression
7 pages
Ece IV Signals & Systems (10ec44) Notes
No ratings yet
Ece IV Signals & Systems (10ec44) Notes
115 pages
Stat 378
No ratings yet
Stat 378
73 pages
Fundamentals of Mathematical Statistics 2020
No ratings yet
Fundamentals of Mathematical Statistics 2020
196 pages
Ecom 165 Notes
No ratings yet
Ecom 165 Notes
98 pages
GOLDBERGER, Arthur S. - A Course in Econometrics - Harvard University Press (1991) (1) (1) - 1-229
No ratings yet
GOLDBERGER, Arthur S. - A Course in Econometrics - Harvard University Press (1991) (1) (1) - 1-229
229 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
VarSelThesis PDF
No ratings yet
VarSelThesis PDF
106 pages
Stat 2013
No ratings yet
Stat 2013
132 pages
CPT212 - Graphs Pt.2 (ELearn)
No ratings yet
CPT212 - Graphs Pt.2 (ELearn)
79 pages
Chapter 22
No ratings yet
Chapter 22
40 pages
Model Fitting and Error Estimation: BSR 1803 Systems Biology: Biomedical Modeling
No ratings yet
Model Fitting and Error Estimation: BSR 1803 Systems Biology: Biomedical Modeling
34 pages
ML Question Bank
No ratings yet
ML Question Bank
4 pages
High Dimentional Economics Model
No ratings yet
High Dimentional Economics Model
41 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Cyber Law1
No ratings yet
Cyber Law1
19 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Ferguson Rom
No ratings yet
Ferguson Rom
16 pages
Nature-Inspired Optimizers: Theories, Literature Reviews and Applications Seyedali Mirjalili Download
No ratings yet
Nature-Inspired Optimizers: Theories, Literature Reviews and Applications Seyedali Mirjalili Download
60 pages
Manual Econometrics
No ratings yet
Manual Econometrics
20 pages
Breidt Opsomer 2000
No ratings yet
Breidt Opsomer 2000
28 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
Computers and Mathematics With Applications: Alejandro Balbás, Beatriz Balbás, Inna Galperin, Efim Galperin
No ratings yet
Computers and Mathematics With Applications: Alejandro Balbás, Beatriz Balbás, Inna Galperin, Efim Galperin
15 pages
Estimating Regression Models of Finite But Unknown Order
No ratings yet
Estimating Regression Models of Finite But Unknown Order
17 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Chapter2 (Simple Linear Regression)
No ratings yet
Chapter2 (Simple Linear Regression)
11 pages
An Asymptotic Theory For Linear Model Selection
No ratings yet
An Asymptotic Theory For Linear Model Selection
44 pages
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
No ratings yet
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
23 pages
Completed Review of Various Solar Power Forecasting Techniques Considering Different Viewpoints
No ratings yet
Completed Review of Various Solar Power Forecasting Techniques Considering Different Viewpoints
22 pages
Dodge1987 - An Introduction To L1-Norm Based Statistical Data Analysis
No ratings yet
Dodge1987 - An Introduction To L1-Norm Based Statistical Data Analysis
15 pages
Mca Syllabus
No ratings yet
Mca Syllabus
55 pages
Inference in A Class of Optimization Problems Confidence Regions and Finite Sample Bounds On Errors in Coverage Probabilities
No ratings yet
Inference in A Class of Optimization Problems Confidence Regions and Finite Sample Bounds On Errors in Coverage Probabilities
53 pages
Non-Linear Regression Models
No ratings yet
Non-Linear Regression Models
34 pages
Decision Science PPT 2
No ratings yet
Decision Science PPT 2
20 pages
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
No ratings yet
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
22 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
32 pages
Using The Moving Block Bootstrap
No ratings yet
Using The Moving Block Bootstrap
25 pages
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data (Fall 2014)
No ratings yet
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data (Fall 2014)
68 pages
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
No ratings yet
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
10 pages
Sample of My Work Lab Details For Lab 04
No ratings yet
Sample of My Work Lab Details For Lab 04
22 pages
Nonlinear Least SQ: Queensland 4001, Australia
No ratings yet
Nonlinear Least SQ: Queensland 4001, Australia
17 pages
Linear Stochastic Models: 5.1 Least Squares
No ratings yet
Linear Stochastic Models: 5.1 Least Squares
12 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Final Credit Risk Prediction Report Corrected
No ratings yet
Final Credit Risk Prediction Report Corrected
19 pages
Must Solve 100 Programs Part 2 - 10
No ratings yet
Must Solve 100 Programs Part 2 - 10
11 pages
R300 Solution Guide 2018M
No ratings yet
R300 Solution Guide 2018M
8 pages
Lars Based S Estimator
No ratings yet
Lars Based S Estimator
10 pages
Example: Let X Represent The Sum of Two Dice.: A Discrete Random Variable X Has A Countable Number of Possible Values
No ratings yet
Example: Let X Represent The Sum of Two Dice.: A Discrete Random Variable X Has A Countable Number of Possible Values
7 pages
The Bartlett Versus The Rectangular Window
No ratings yet
The Bartlett Versus The Rectangular Window
11 pages
1981 Estimating The Dimension of A Linear-Model - J. Andel, M. G. Perez and A. I. Negrao
No ratings yet
1981 Estimating The Dimension of A Linear-Model - J. Andel, M. G. Perez and A. I. Negrao
12 pages
WST 311 Notes Part 2 2024
No ratings yet
WST 311 Notes Part 2 2024
21 pages
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
No ratings yet
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
9 pages
Lecture1 Slides
No ratings yet
Lecture1 Slides
10 pages
Similar With My Research
No ratings yet
Similar With My Research
8 pages
Developments in KD Tree and KNN Searches
No ratings yet
Developments in KD Tree and KNN Searches
8 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Hybrid Least Squares For Learning Functions From Highly Noisy Data
No ratings yet
Hybrid Least Squares For Learning Functions From Highly Noisy Data
30 pages
Non Parametric Prediction
No ratings yet
Non Parametric Prediction
16 pages
Design of Water Quality Monitoring Based On SVM and Its Simulation Platform by Remote Sensing
No ratings yet
Design of Water Quality Monitoring Based On SVM and Its Simulation Platform by Remote Sensing
5 pages
2020 Comp Q1 - 3
No ratings yet
2020 Comp Q1 - 3
4 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
LAB211 Assignment: Title
No ratings yet
LAB211 Assignment: Title
2 pages
Unit - III
No ratings yet
Unit - III
4 pages
Dynamic Programming DP Explanation
No ratings yet
Dynamic Programming DP Explanation
3 pages
Lab Assignment 3 - COMSOAL Random Sequence Gene
No ratings yet
Lab Assignment 3 - COMSOAL Random Sequence Gene
2 pages
Familiarization of Matlab - Part2 - lsdcRNyfAH
No ratings yet
Familiarization of Matlab - Part2 - lsdcRNyfAH
3 pages
Bilinear Quad Source Code in Matlab
No ratings yet
Bilinear Quad Source Code in Matlab
2 pages
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
Infinite Series
From Everand
Infinite Series
James M Hyslop
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Adaptive Estimation in An Autoregression and A Geometrical

Uploaded by

Adaptive Estimation in An Autoregression and A Geometrical

Uploaded by

The Annals of Statistics

2001, Vol. 29, No. 3, 839–875

ADAPTIVE ESTIMATION IN AUTOREGRESSION OR -MIXING

By Y. Baraud, F. Comte and G. Viennet

1. Introduction. We consider the problem of estimating the unknown

where k is an integer smaller than k. This means that Yi = Xi , X i =

Received May 1998; revised February 2001.

A minimizer of γn in S, f̂S , always exists but might not be unique. Indeed, in

which means that, up to the constant C, our estimator chooses an optimal

is considered. Minimum complexity regression estimators based on Legendre

rors), Section 5 to an extension to the case of dependent errors. The proof of

2. The estimation procedure and the assumptions. We observe pairs

(HX ) The sequence X i i≥0 is identically distributed with common law µ

(2.1) t∞ ≤ !0 dimSm + Sm t

Comments. Assumption (HXY ) is equivalent to the β-mixing of the se-

(2.2) t∞ ≤ !0 dimSm t ∀t ∈ Sm

and then the result is true for any orthonormal basis of Sm .

3. Examples. In the section we apply our estimation procedure to various

3.1. Autoregression framework. We deal with a particular feature of the

The process is initialized with some real valued random variable X0 .

We assume the following:

Proposition 1. Consider the autoregression framework 3 2 and assume

To obtain results in probability on f|A − f̃2n , it is actually enough to

3.2. Regression framework. We give an illustration of Theorem 1 in case

Then we make the following assumption:

be an 2 0 1 dx-orthonormal system of compactly supported wavelets of

3.3. Regression with dependent errors. We consider the regression frame-

Geometrically β-mixing X i ’s can be generated by an autoregressive model

In contrast with the results of the previous examples, we cannot give a

3.4. Additive models. We consider the additive regression models, widely

The performance of f̃ is given by the following result

We can deduce from Proposition 4 that our procedure is adaptive in the

Proposition 5. Consider model 3 9 with k ≥ 2. Let L > 0, assume that

Comments. (i) In the case where k = 1, by using the collection of piecewise

3.5. Estimation of the order of an additive autoregression. Consider an

∀u ≥ 1 lnu ∨ 1 ≤ 8u ≤ Kub 

Theorem 1. Let us consider model 1 1 with f an unknown function from

where C is a constant depending on p x ρ p̄ !0  h0  h1  M K and Rn is given

5. Generalization of Theorem 1. In this section we give an extension

Theorem 2. Consider the assumptions of Theorem 1 and replace (Hε ) by

Then, we recover Theorem 1.

6. Proof of the propositions of Section 3.

Proof of Proposition 1. The result is a consequence of Theorem 1. Let us

Proof of Proposition 3. The line of proof is similar to that of Proposi-

by Cauchy–Schwarz’s inequality. Therefore, we obtain

which gives the result. ✷

Proof of Proposition 4. This proposition is a consequence of Theorem 1.

Proof. The result follows from a Cauchy–Schwarz argument: for all ti ∈

To complete the proof of Proposition 4 we bound Rn by some constant R

Proof of Proposition 5. Let k ≥ 2. We start from (3.1) and bound the

7. Proofs of Theorems 1 and 2. The proof of Theorem 2 is clear from the

Proof. By deﬁnition of f̃ we know that for all m ∈ n and t ∈ Sm

γn f̃ + penm̂ ≤ γn t + penm

Note that the relation

Claim 2. Let qn , qn1 be integers such that 0 ≤ qn1 ≤ qn /2, qn ≥ 1. Set

have the same distribution, and so have the random vectors

Proof. The claim is a corollary of Berbee’s coupling lemma

For each m ∈ n , we set Dm = dimSm .

Claim 3. Let x ρ be numbers satisfying x > ρ > 1. If pen is chosen to

for m ∈ n and where C1 x ρ = x + ρ2 /x − ρ2 > 1.

which gives the claim by choosing y = x − ρ/x + ρ. ✷

Claim 4. For p ≥ 21 + 2p̄/1 − 4b we have,

where C is a constant that depends on x ρ p p̄.

f|A − f̃2np̄ |Bρ∗

We now use the following result,

Proposition 6. Under the assumptions of Theorem 1

Since A0 ≤ 1 and 1 ≤ 8n ≤ Knb we have

hence by using the inequality pp − 2/4p − 1 ≥ p − 2/4 we get

Claim 5. Under the assumptions of Theorem 1 we have

Proposition 7. Under the assumptions of Theorem 1 for all ρ > 1,

Since qn = intA0 8n/4 + 1 ≤ A0 8n/4 + 1 we have

8n being larger than lnn. Now, set

(7.14) B = A0 x−12 ∧1/8−1 = h20 x−12 ∧11−1/ρ2 /640!30 h1 −1

Since qn ≥ A0 8n/4, under Condition (4.2) we have

Claim 5 is proved by combining (7.13) and (7.15). ✷

The proof of Theorem 1 is completed by combining Claim 4 and Claim 5.

(2.1) t∞ ≤ !0 dimSm + Sm t

Comments. Assumption (HXY ) is equivalent to the β-mixing of the se-

(2.2) t∞ ≤ !0 dimSm t ∀t ∈ Sm

To obtain results in probability on f|A − f̃2n , it is actually enough to

be an 2 0 1 dx-orthonormal system of compactly supported wavelets of

∀u ≥ 1 lnu ∨ 1 ≤ 8u ≤ Kub

where C is a constant depending on p x ρ p̄ !0 h0 h1 M K and Rn is given

Claim 2. Let qn , qn1 be integers such that 0 ≤ qn1 ≤ qn /2, qn ≥ 1. Set

Claim 3. Let x ρ be numbers satisfying x > ρ > 1. If pen is chosen to

for m ∈ n and where C1 x ρ = x + ρ2 /x − ρ2 > 1.

where C is a constant that depends on x ρ p p̄.

f|A − f̃2np̄ |Bρ∗

Proposition 6. Under the assumptions of Theorem 1

hence by using the inequality pp − 2/4p − 1 ≥ p − 2/4 we get

Claim 5. Under the assumptions of Theorem 1 we have

Proposition 7. Under the assumptions of Theorem 1 for all ρ > 1,

Since qn = intA0 8n/4 + 1 ≤ A0 8n/4 + 1 we have

(7.14) B = A0 x−12 ∧1/8−1 = h20 x−12 ∧11−1/ρ2 /640!30 h1 −1

gt e1 x1 eqn1 xqn1 = ei txi

recalling that 2n qn1 ≤ 2n qn ≤ n. Since for p ≥ 2, p2 /4p − 1 ≥ 1 one also

9. Proof of Proposition 7. Since Bρ∗c = Bρc ∩ B∗ + B∗c and since

and for any symmetric matrix A = Aλλ ,

where we denote by ∗ A the probability A ∩ B∗ , and by Bnµ 0 1 = t ∈

Now, we bound 1 and 2 by using Bernstein’s inequality [see Lemma 8

10. Constraints on the dimension of n . Most elements of the follow-

Besides, ∀λ ∈ #n ∀λ ∈ Jλ, φλ φλ ∞ ≤ !21 Dn and thus