0% found this document useful (0 votes)
18 views31 pages

Adaptive Density Estimation For Stationary Process

This document discusses adaptive density estimation for stationary processes. It proposes an algorithm to estimate the common density of a stationary process. The algorithm provides a model selection procedure based on a generalization of Mallows' Cp. It proves oracle inequalities for the selected estimator under assumptions on the mixing coefficients and model collection. The estimator is shown to achieve adaptive rates of convergence over Besov spaces, similar to the i.i.d. case.

Uploaded by

Thai Phuc Hung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

Adaptive Density Estimation For Stationary Process

This document discusses adaptive density estimation for stationary processes. It proposes an algorithm to estimate the common density of a stationary process. The algorithm provides a model selection procedure based on a generalization of Mallows' Cp. It proves oracle inequalities for the selected estimator under assumptions on the mixing coefficients and model collection. The estimator is shown to achieve adaptive rates of convergence over Besov spaces, similar to the i.i.d. case.

Uploaded by

Thai Phuc Hung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/258367948

Adaptive density estimation for stationary processes

Article · September 2009

CITATIONS READS

2 24

1 author:

Matthieu Lerasle
French National Centre for Scientific Research
59 PUBLICATIONS 704 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Statistical estimation under heavy tails View project

All content following this page was uploaded by Matthieu Lerasle on 25 November 2015.

The user has requested enhancement of the downloaded file.


Adaptive density estimation for stationary processes
Matthieu Lerasle

To cite this version:


Matthieu Lerasle. Adaptive density estimation for stationary processes. Mathematical Methods
of Statistics, 2009, 18 (1), pp.59–83. <10.3103/S1066530709010049>. <hal-00413692>

HAL Id: hal-00413692


https://hal.archives-ouvertes.fr/hal-00413692
Submitted on 4 Sep 2009

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Adaptive density estimation of stationary β-mixing and
τ -mixing processes.
Matthieu Lerasle∗

Abstract:

We propose an algorithm to estimate the common density s of a stationary


process X1 , ..., Xn . We suppose that the process is either β or τ -mixing. We
provide a model selection procedure based on a generalization of Mallows’ Cp
and we prove oracle inequalities for the selected estimator under a few prior
assumptions on the collection of models and on the mixing coefficients. We
prove that our estimator is adaptive over a class of Besov spaces, namely, we
prove that it achieves the same rates of convergence as in the i.i.d framework.

Key words: Density estimation, weak dependence, model selection.


2000 Mathematics Subject Classification: 62G07, 62M99.

1 Introduction
We consider the problem of estimating the unknown density s of P , the law of a random
variable X, based on the observation of n (possibly) dependent data X1 , ..., Xn with com-
mon law P . We assume that X is real valued, that s belongs to L2 (µ) where µ denotes the
Lebesgue measure on R and that s is compactly supported, say in [0, 1]. Throughout the
chapter, we consider least-squares estimators ŝm of s on a collection (Sm )m∈Mn of linear
subspaces of L2 (µ). Our final estimator is chosen through a model selection algorithm.
Model selection has received much interest in the last decades. When its final goal is pre-
diction, it can be seen more generally as the question of choosing between the outcomes
of several prediction algorithms. With such a general formulation, a very natural answer
is the following. First, estimate the prediction error for each model, that is ks − ŝm k22 .
Then, select the model which minimizes this estimate.
It is natural to think of the empirical risk as an estimator of the prediction error. This can
fail dramatically, because it uses the same data for building predictors and for comparing
them, making these estimates strongly biased for models involving a number of parameters
growing with the sample size.
In order to correct this drawback, penalization’s methods state that a good choice can be
made by minimizing the sum of the empirical risk (how do algorithms fit the data) and
some complexity measure of the algorithms (called the penalty). This method was first
developped in the work of Akaike [2] and [1] and Mallows [19].
In the context of density estimation, with independent data, Birgé & Massart [8] used
penalties of order Ln Dm /n, where Dm denotes the dimension of Sm and Ln is a constant
depending on the complexity of the collection Mn . They used Talagrand’s inequality (see
for example Talagrand [24] for an overview) to prove that this penalization procedure is

Institut de Mathématiques (UMR 5219), INSA de Toulouse, Université de Toulouse, France

1
efficient i.e. the integrated quadratic risk of the selected estimator is asymptotically equiv-
alent to the risk of the oracle (see Section 2 for a precise definition). They also proved that
the selected estimator achieves adaptive rates of convergence over a large class of Besov
spaces. Moreover, they showed that some methods of adaptive density estimation like the
unbiased cross validation (Rudemo [23]) or the hard thresholded estimator of Donoho et
al. [16] can be viewed as special instances of penalized projection estimators.
More recently, Arlot [5] introduced new measures of the quality of penalized least-squares
estimators (PLSE). He proved pathwise oracle inequalities, that is deviation bounds for
the PLSE that are harder to prove but more informative from a practical point of view
(see also Section 2 for details).
When the process (Xi )i=1,...,n is β-mixing (Rozanov & Volkonskii [26] and Section 2), Ta-
lagrand’s inequality can not be used directly. Baraud et al. [6] used Berbee’s coupling
lemma (see Berbee ([7]) and Viennet’s covariance inequality (Viennet [25]) to overcome
this problem and build model selection procedure in the regression problem. Then Comte
& Merlevède [13] used this algorithm to investigate the problem of density estimation for
a β-mixing process. They proved that under reasonable assumptions on the collection Mn
and on the coefficients β, one can recover the results of Birgé & Massart [8] in the i.i.d.
framework.
The main drawback of those results is that many processes, even simple Markov chains
are not β-mixing. For instance, if (ǫi )i≥1 is iid with marginal B(1/2), then the stationary
solution (Xi )i≥0 of the equation

1
Xn = (Xn−1 + ǫn ), X0 independent of (ǫi )i≥1 (1)
2
is not β-mixing (Andrews [3]). More recently, Dedecker & Prieur [15] introduced new
mixing-coefficients, in particular the coefficients τ , φ̃ and β̃ and proved that many processes
like (1) happen to be τ , φ̃ and β̃-mixing. They proved a coupling lemma for the coefficient
τ and covariance inequalities for φ̃ and β̃. Gannaz & Wintenberger [18] used the covariance
inequality to extend the result of Donoho et al. [16] for the wavelet thresholded estimator
to the case of φ̃-mixing processes. They recovered (up to a log(n) factor) the adaptive
rates of convergence over Besov spaces.
In this article, we first investigate the case of β-mixing processes. We prove a pathwise
oracle inequality for the PLSE. We extend the result of Comte & Merlevède [13] under
weaker assumptions on the mixing coefficients. Then, we consider τ -mixing processes. The
problem is that the coupling result is weaker for the coefficient τ than for β. Moreover,
in order to control the empirical process we use a covariance inequality that is harder to
handle. Hence, the generalization of the procedure of Baraud et al. [6] to the framework
of τ -mixing processes is not straightforward. We recover the optimal adaptive rates of
convergence over Besov spaces (that is the same as in the independent framework) for
τ -mixing processes, which is new as far as we know.
The chapter is organized as follows. In Section 2, we give the basic material that we will
use throughout the chapter. We recall the definition of some mixing coefficients and we
state their properties. We define the penalized least-squares estimator (PLSE). Sections 3
and 4 are devoted to the statement of the main results, respectively in the β-mixing case
and in the τ -mixing case. In Section 5, we derive the adaptive properties of the PLSE.
Finally, Section 6 is devoted to the proofs. Some additional material has been reported in
the Appendix in Section 7.

2
2 Preliminaries
2.1 Notation.
Let (Ω, A, P) be a probability space. Let µ be the Lebesgue measure on R, let k.kp be the
usual norm on Lp (µ) for 1 ≤ p ≤ ∞. For all y ∈ Rl , let |y|l = li=1 |yi |. Denote by λκ the
P
set of κ-Lipschitz functions, i.e. the functions t from (Rl , |.|l ) to R such that Lip(t) ≤ κ
where  
|t(x) − t(y)| l
Lip(t) = sup , x, y ∈ R , x 6= y ≤ κ.
|x − y|l
Let BV and BV1 be the set of functions t supported on R satisfying respectively ktkBV < ∞
and ktkBV ≤ 1 where

ktkBV = sup sup |t(ai+1 ) − t(ai )|.


n∈N∗ −∞<a1 <...<an <∞

2.2 Some measures of dependence.


2.2.1 Definitions and assumptions
Let Y = (Y1 , ..., Yl ) be a random variable defined on (Ω, A, P) with values in (Rl , |.|l ).
Let M be a σ-algebra of A. Let PY |M , PY1 |M be conditional distributions of Y and Y1
given M, let PY , PY1 be the distribution of Y and Y1 and let FY1 |M , FY1 be distribution
functions of PY1 |M and PY1 . Let B be the Borel σ-algebra on (Rl , |.|l ). Define now
 
β(M, σ(Y )) = E sup |PY |M (A) − PY (A)| ,
A∈B
 
β̃(M, Y1 ) = E sup FY1 |M (x) − FY1 (x) ,
x∈R
 
and if E(|Y |) < ∞, τ (M, Y ) = E sup |PY |M (t) − PY (t)| .
t∈λ1

The coefficient β(M, σ(Y )) is the mixing coefficient introduced by Rozanov & Volkonskii
[26]. The coefficients β̃(M, Y1 ) and τ (M, Y ) have been introduced by Dedecker & Prieur
[15].
Let (Xk )k∈Z be a stationary sequence of real valued random variables defined on (Ω, A, P).
For all k ∈ N∗ , the coefficients βk , β̃k and τk are defined by

βk = β(σ(Xi , i ≤ 0), σ(Xi , i ≥ k)), β̃k = sup{β̃(σ(Xp , p ≤ 0), Xj )}.


j≥k

If E(|X1 |) < ∞, for all k ∈ N∗ and all r ∈ N∗ , let


1
τk,r = max sup {τ (σ(Xp , p ≤ 0), (Xi1 , ..., Xil ))}, τk = sup τk,r .
1≤l≤r l k≤i1 <..<il r∈N∗

Moreover, we set β0 = 1. In the sequel, the processes of interest are either β-mixing or
τ -mixing, meaning that, for γ = β or τ , the γ-mixing coefficients γk → 0 as k → +∞. For
p ∈ {1, 2}, we define κp as:
X∞
κp = p lp−1 βl , (2)
l=0

3
where 00 = 1, when the series are convergent. Besides, we consider two kinds of rates of
convergence to 0 of the mixing coefficients, that is for γ = β or τ ,
[AR] arithmetical γ-mixing with rate θ if there exists some θ > 0 such that γk ≤ (1 +
k)−(1+θ) for all k in N,
[GEO] geometrical γ-mixing with rate θ if there exists some θ > 0 such that γk ≤ e−θk
for all k in N.

2.2.2 Properties
Coupling
Let X be an Rl -valued random variable defined on (Ω, A, P) and let M be a σ-algebra.
Assume that there exists a random variable U uniformly distributed on [0, 1] and indepen-
dent of M ∨ σ(X). There exist two M ∨ σ(X) ∨ σ(U )-measurable random variables X1∗
and X2∗ distributed as X and independent of M such that

β(M, σ(X)) = P(X 6= X1∗ ) and (3)

τ (M, X) = E (|X − X2∗ |l ) . (4)


Equality (3) has been established by Berbee [7], Equality (4) has been established in
Dedecker & Prieur [15], Section 7.1.
Covariance inequalities
Let X, Y be two real valued random variables and let f, h be two measurable functions
from R to C. Then, there exist two measurable functions b1 : R → R and b2 : R → R with
E (b1 (X)) = E(b2 (Y )) = β(σ(X), σ(Y )) such that, for any conjugate p, q ≥ 1 (see Viennet
[25] Lemma 4.1)

|Cov(f (X), h(Y ))| ≤ 2E1/p (|f (X)|p b1 (X)) E1/q (|h(Y )|q b2 (Y )).

There exists a random variable b(σ(X), Y ) such that E(b(σ(X), Y )) = β̃(σ(X), Y ) and such
that, for all Lipschitz functions f and all h in BV (Dedecker & Prieur [15] Proposition 1)

|Cov(f (X), h(Y ))| ≤ khkBV E (|f (X)|b(σ(X), Y )) ≤ khkBV kf k∞ β̃(σ(X), Y ). (5)

Comparison results
Let (Xk )k∈Z be a sequence of identically distributed real random variables. If the marginal
distribution satisfies a concentration’s condition |FX (x) − FX (y)| ≤ K|x − y|a with a ≤ 1,
K > 0, then (Dedecker et al. [14] Remark 5.1 p 104)
a/(a+1) a/(a+1)
β̃k ≤ 2K 1/(1+a) τk,1 ≤ 2K 1/(1+a) τk .

In particular, if PX has a density s with respect to the Lebesgue measure µ and if s ∈ L2 (µ),
we have from Cauchy-Schwarz inequality
Z Z 1/2
|FX (x) − FX (y)| = | 1[x,y] sdµ| ≤ ksk2 1[x,y] dµ = ksk2 |x − y|1/2 ,

thus
2/3 1/3
β̃k ≤ 2 ksk2 τk .
In particular, for any arithmetically [AR] τ -mixing process with rate θ > 2, we have
2/3
β̃k ≤ 2 ksk2 (1 + k)−(1+θ)/3 . (6)

4
2.2.3 Examples
Examples of β-mixing and τ -mixing sequences are well known, we refer to the books of
Doukhan [17] and Bradley [11] for examples of β-mixing processes and to the book of
Dedecker et. al [14] or the articles of Dedecker & Prieur [15], Prieur [21], and Comte
et. al [12] for examples of τ -mixing sequences. One of the most important example is
the following: a stationary, irreducible, aperiodic and positively recurent Markov chain
(Xi )i≥1 is β-mixing. However, many simple Markov chains are not β-mixing but are τ -
mixing. For instance, it is known for a long time that if (ǫi )i≥1 are i.i.d Bernoulli B(1/2),
then a stationary solution (Xi )i≥0 of the equation
1
Xn = (Xn−1 + ǫn ), X0 independent of (ǫi )i≥1
2
is not β-mixing since βk = 1 for any k ≥ 1 whereas τk ≤ 2−k (see Dedecker & Prieur [15]
Section 4.1). Another advantage of the coefficient τ is that it is easy to compute in many
situations (see Dedecker & Prieur [15] Section 4).

2.3 Collections of models


We observe n identically distributed real valued random variables X1 , ..., Xn with common
density s with respect to the Lebesgue measure µ. We assume that s belongs to the Hilbert
space L2 (µ) endowed with norm k.k2 . We consider an orthonormal system {ψj,k }(j,k)∈Λ
of L2 (µ) and a collection of models (Sm )m∈Mn indexed by subsets m ⊂ Λ for which we
assume that the following assumptions are fulfilled:
[M1 ] for all m ∈ Mn , Sm is the linear span of {ψj,k }(j,k)∈m with finite dimension Dm =
|m| ≥ 2 and Nn = maxm∈Mn Dm satisfies Nn ≤ n;
[M2 ] there exists a constant Φ such that

∀m, m′ ∈ Mn , ∀t ∈ Sm , ∀t′ ∈ Sm′ , kt + t′ k∞ ≤ Φ dim(Sm + Sm′ )kt + t′ k2 ;


p

[M3 ] Dm ≤ Dm′ implies that m ⊂ m′ and so Sm ⊂ Sm′ .


As a consequence of Cauchy-Schwarz inequality, we have

X
2 ktk2∞
ψj,k = sup 2 (7)
t∈Sm +Sm′ ,t6=0 ktk2
(j,k)∈m∪m′ ∞

see Birgé & Massart [8] p 58. Three examples are usually developed as fulfilling this set
of assumptions:
[T] trigonometric spaces: ψ0,0 (x) = 1 and for all j ∈ N∗ , ψj,1 (x) = cos(2πjx), ψj,2 (x) =
sin(2πjx). m = {(0, 0), (j, 1), (j ′ , 2), 1 ≤ j, j ′ ≤ Jm } and Dm = 2Jm + 1;
[P] regular piecewise polynomial spaces: Sm is generated by r polynomials ψj,k of degree
k = 0, ..., r − 1 on each subinterval [(j − 1)/Jm , j/Jm ] for j = 1, ..., Jm , Dm = rJm ,
Mn = {m = {(j, k), j = 1, ..., Jm , k = 0, ..., r − 1}, 1 ≤ Jm ≤ [n/r]};
[W] spaces generated by dyadic wavelet with regularity r as described in Section 4.
For a precise description of those spaces and their properties, we refer to Birgé & Massart
[8].

2.4 The estimator


Let (Xn )n∈Z be a real valued stationary process and let P denote the law of X0 . Assume
that P has a density s with respect to the Lebesgue measure µ and that s ∈ L2 (µ).

5
Let (Sm )m∈Mn be a collection of models satisfying assumptions [M1 ]-[M3 ]. We define
Sn = ∪m∈Mn Sm , sm and sn the orthogonal projections of s onto Sm and Sn respectively,
let P be the joint distribution of the observations (Xn )n∈Z and let E be the corresponding
expectation. We define the operators Pn , P and νn on L2 (µ) by
n
1X
Z
Pn t = t(Xi ), P t = t(x)s(x)dµ(x), νn (t) = (Pn − P )t.
n
i=1

All the real numbers that we shall introduce and which are not indexed by m or n are fixed
constants. In order to define the penalized least-squares estimator, let us consider on R×Sn
the contrast function γ(x, t) = −2t(x) + ktk22 and its empirical version γn (t) = Pn γ(., t).
Minimizing γn (t) over Sm leads to the classical projection estimator ŝm on Sm . Let ŝn be
the projection estimator on Sn . Since {ψj,k }(j,k)∈m is an orthonormal basis of Sm one gets
X X
ŝm = (Pn ψj,k )ψj,k and γn (ŝm ) = − (Pn ψj,k )2 .
(j,k)∈m (j,k)∈m

Now, given a penalty function pen : Mn → R+ , we define a selected model m̂ as any


element
m̂ ∈ arg min (γn (ŝm ) + pen(m)) (8)
m∈Mn

and a PLSE is defined as any s̃ ∈ Sm̂ ⊂ Sn such that

γn (s̃) + pen(m̂) = inf (γn (ŝm ) + pen(m)) . (9)


m∈Mn

2.5 Oracle inequalities


An ideal procedure for estimation chooses an oracle

mo ∈ Arg min {ks − ŝm k2 }.


m∈Mn

An oracle depends on the unknown s and on the data so that it is unknown in practice.
In order to validate our procedure, we try to prove:
-non asymptotic oracle inequalities for the PLSE:
 
E ks − s̃k22 ≤ L inf {E ks − ŝm k22 + R(m, n) },

(10)
m∈Mn

for some constant L ≥ 1 (as close to 1 as possible) and a remainder term R(m, n) ≥ 0
possibly random, and small compared to E ks − s̃k22 if possible. This inequality compares
the risk of the PLSE with the best deterministic choice of m. Since m̂ is random, we prefer
to prove a stronger form of oracle inequality :
   
2 2
E ks − s̃k2 ≤ LE inf {ks − ŝm k2 + R(m, n)} , (11)
m∈Mn

or, when it is possible, deviation bounds for the PLSE:


  
2 2
P ks − s̃k2 > L inf ks − ŝm k2 + R(m, n) ≤ cn , (12)
m∈Mn

6
where typically cn ≤ C/n1+γ for some γ > 0. Inequality (12) proves that, asymptotically,
the risk ks − s̃k22 is almost surely the one of the oracle. Let
  
2 2
Ω = ks − s̃k2 > L inf ks − ŝm k2 + R(m, n) .
m∈Mn

We have      
E ks − s̃k22 = E ks − s̃k22 1Ω + E ks − s̃k22 1Ωc .
   n o
It is clear that E ks − s̃k22 1Ωc ≤ LE inf m∈Mn ks − ŝm k22 + R(m, n) . Moreover, we
have ks − s̃k2 = ks − sm̂ k2 + ksm̂ − s̃k2 ≤ ksk2 + Φ2 Dm̂ ≤ ksk2 + Φ2 n, thus, when (12)
holds, we have
  C
E ks − s̃k22 1Ωc ≤ (ksk2 + Φ2 n)cn ≤ γ .
n
Therefore, inequality (12) implies
 

2

2 C
E ks − s̃k2 ≤ E inf {ks − ŝm k2 + R(m, n)} + γ .
m∈Mn n

We can derive from these inequalities adaptive rates of convergence of the PLSE on Besov
spaces (see Birgé & Massart [8] for example). In order to achieve this goal, we only have
to prove a weaker form of oracle inequality where the remainder term R(m, n) ≤ LDm /n
for some constant L, for all the models m with sufficiently large dimension. This will be
detailed in Section 5.

3 Results for β-mixing processes


From now on, the letters κ, L and K, with various sub- or supscripts, will denote some
constants which may vary from line to line. One shall use L. to indicate more precisely
the dependence on various quantities, especially those which are related to the unknown
s.
In this section, we give the following theorem for β-mixing sequences. It can be seen as a
pathwise version of Theorem 3.1 in Comte & Merlevède [13].

Theorem 3.1 Consider a collection of models satisfying [M1 ], [M2 ] and [M3 ]. Assume
that the process (Xn )n∈Z is strictly stationary and arithmetically [AR] β-mixing with mix-
ing rate θ > 2 and that its marginal distribution admits a density s with respect to the
Lebesgue measure µ, with s ∈ L2 (µ).
Let κ1 be the constant defined in (2) and let s̃ be the PLSE defined by (9) with

KΦ2 κ1 Dm
pen(m) = , where K > 4.
n
Then, for all κ > 2 there exist c0 > 0, Ls > 0, γ1 > 0 and a sequence ǫn → 0, such that

(log n)(θ+2)κ
  
P ks̃ − sk22 > (1 + ǫn ) inf ks − s m k2
2 + pen(m) ≤ L s .
m∈Mn ,Dm ≥c0 (log n)γ1 nθ/2
(13)

Remark: The term KΦ2 κ1 is the same as in Theorem 3.1 of Comte & Merlevède [13] but
with a constant K > 4 instead of 320. The main drawback of this result is that the penalty
term involves the constant κ1 which is unknown in practice. However, Theorem 3.1 ensures

7
that penalties proportional to the linear dimension of Sm lead to efficient model selection
procedures. Thus we can use this information to apply the slope heuristic algorithm intro-
duced by Birgé & Massart [9] in a Gaussian regression context and generalized by Arlot
& Massart [4] to more general M-estimation frameworks. This algorithm calibrates the
constant in front of the penalty term when the shape of an ideal penalty is available. The
result of Arlot & Massart is proven for independent sequences, in a regression framework,
but it can be generalized to the density estimation framework, for independent as well as
for β or τ dependent data. This result is beyond the scope of this chapter and will be
proved in chapter 4.
We have to consider the infimum in equation (13) over the models with sufficiently large
dimensions. However, as noted by Arlot [5] (Remark 9 p 43), we can take the infimum
over all the models in (13) if we add an extra term in (13). More precisely, we can prove
that, with probability larger than 1 − Ls (log n)(θ+2)κ /nθ/2
  (log n)γ2
ks̃ − sk22 ≤ (1 + ǫn ) inf ks − ŝm k22 + pen(m) + L , (14)
m∈Mn n
where L > 0 and γ2 > 0.
Remark : The main improvement of Theorem 3.1 is that it gives an oracle inequality
in probability, with a deviation bound of order o(1/n) as soon as θ > 2 instead of θ > 3
in Comte & Merlevède [13]. Moreover, we do not require s to be bounded to prove our
result.
Remark: When the data are independent, the proof of Theorem 3.1 can be used to
obtain that the estimator s̃ chosen with a penalty term of order KΦDm /n satisfy an
oracle inequality as (13). The main difference would be that κ1 = 1, thus it can be
used without a slope heuristic (even if this algorithm can be used also in this context to
2
optimize the constant K) and the control of the probability would be Ls e− ln(n) /Cs for
some constants Ls , Cs instead of Ls (log n)(θ+2) κn−θ/2 in our theorem.

4 Results for τ -mixing sequences


In order to deal with τ -mixing sequences, we need to specify the basis (ψj,k )(j,k)∈Λ .

4.1 Wavelet basis


Throughout this section, r is a real number, r ≥ 1 and we work with an r-regular or-
thonormal multiresolution analysis of L2 (µ), associated with a compactly supported scal-
ing function φ and a compactly supported mother wavelet ψ. Without loss of generality,
we suppose that the support of the functions φ and ψ is an interval [A1 , A2 ) where A1
and A2 are integers such that A2 − A1 = A ≥ 1. Let us recall that φ and ψ generate an
orthonormal basis by dilatations and translations.

For all k ∈ Z and j ∈ N∗ , let ψ0,k : x → 2φ(2x − k) and ψj,k : x → 2j/2 ψ(2j x − k). The
family {(ψj,k )j≥0,k∈Z} is an orthonormal
√ basis of L2 (µ). √
Let us recall the following inequal-
ities: for all p ≥ 1, let Kp = ( 2kφkp ) ∨ kψkp , KL = (2 2Lip(φ)) ∨ Lip(ψ), KBV = AKL .
Then for all j ≥ 0, we have kψj,k k∞ ≤ K∞ 2j/2 ,
X
|ψj,k | ≤ AK∞ 2j/2 (15)
k∈Z ∞
Lip(ψj,k ) ≤ KL 23j/2 , (16)
kψj,k kBV ≤ KBV 2j/2 . (17)

8
We assume that our collection (Sm )m∈Mn satisfies the following assumption:
[W] dyadic wavelet generated spaces: let Jn = [log(n/2(A + 1))/ log(2)] and for all Jm =
1, ..., Jn , let

m = {(0, k), −A2 < k < 2 − A1 } ∪ {(j, k), 1 ≤ j ≤ Jm , −A2 < k < −A1 + 2j }

and Sm the linear span of {ψj,k }(j,k)∈m . In particular, we have Dm = (A−1)(Jm +1)+2Jm +1
and thus 2Jm +1 ≤ Dm ≤ (A − 1)(Jm + 1) + 2Jm +1 ≤ A2Jm +1 .

4.2 The τ -mixing case


The following result proves that we keep the same rate of convergence for the PLSE based
on τ -mixing processes.

Theorem 4.1 Consider the collection of models [W]. Assume that (Xn )n∈Z is strictly
stationary and arithmetically [AR] τ -mixing with mixing rate θ > 5 and that its marginal
distribution admits a density s with respect to the Lebesgue measure µ. Let s̃ be the PLSE
defined by (9) with

!
X Dm
pen(m) = KAK∞ KBV β̃l , where K ≥ 8.
n
l=0

Then there exist constants c0 > 0, γ1 > 0 and a sequence ǫn → 0 such that
 
2 2

E ks̃ − sk2 ≤ (1 + ǫn ) inf γ
ks − sm k2 + pen(m) . (18)
m∈Mn , Dm ≥c0 (log n) 1

Remark : As in Theorem 3.1, the penalty term involves an unknown constant and we
have a condition on the dimension of the models in (18). However, the slope heuristic can
also be used in this context to calibrate the constant and a careful look at the proof shows
that we can take the infimum over all models m ∈ Mn provided that we increase the
constant K in front of the penalty term. Our result allows to derive rates of convergence
in Besov spaces for the PLSE that correspond to the rates in the i.i.d. framework (see
Proposition 5.2).
Remark : Theorem 4.1 gives an oracle inequality for the PLSE built on τ -mixing se-
quences. This inequality is not pathwise and the constants involved in the penalty term
are not optimal. This is due to technical reasons, mainly because we use the coupling
result (4) instead of (3). However, we recover the same kind of oracle inequality as in
the i.i.d. framework (Birgé and Massart [8]) under weak assumptions on the mixing co-
efficients since we only require arithmetical [AR] τ -mixing assumptions on the process
(Xn )n∈Z . This is the first result for these processes up to our knowledge.
Let us mention here Theorem 4.1 in Comte & Merlevède [13]. They consider α-mixing
processes (for a definition of the coefficient α and its properties, we refer to Rio [22]). They
make geometrical [GEO] α-mixing assumptions on the processes and consider penalties
of order L log(n)Dm /n to get an oracle inequality. This leads to a logarithmic loss in
the rates of convergence. They get the optimal rate under an extra assumption (namely
Assumption [Lip] in Section 3.2). There exist random processes that are τ -mixing and
not α-mixing (see Dedecker & Prieur [15]), however, the comparison of these coefficients
is difficult in general and our method can not be applied in this context.
The constants c0 , γ1 , no are given in the end of the proof.

9
Remark : Inequality (2.6) can be improved under stronger assumptions on s. For exam-

ple, when s is bounded, we have β̃k ≤ C τk . Under this assumption and θ > 3, we can
prove that the estimator s̃ satisfies the inequality

(log n)κ(θ+1)
 
2 2

E ks̃ − sk2 ≤ (1 + ǫn ) inf ks − s m k2 + pen(m) + .
m∈Mn , Dm ≥c0 (log n)γ1 n(θ−3)/2

When θ < 5, the extra term (log n)κ(θ+1) /n(θ−3)/2 may be larger than the main term
inf m∈Mn , Dm ≥c0 (log n)γ1 ks − sm k22 + pen(m). In this case, we don’t know if our control
remains optimal. On the other hand, Proposition 5.2 ensures that s̃ is adaptive over the
class of Besov balls when θ ≥ 5.

5 Minimax results
5.1 Approximation results on Besov spaces
Besov balls.
Throughout this section, Λ = {(j, k), j ∈ N, k ∈ Z} and {ψj,k , (j, k) ∈ Λ} denotes an
r-regular wavelet basis as introduced in Section 4.1. Let α, pPbe two positive numbers such
that α + 1/2 − 1/p > 0. For all functions t ∈ L2 (µ), t = (j,k)∈Λ tj,k ψj,k , we say that t
belongs to the Besov ball Bα,p,∞ (M1 ) on the real line if ktkα,p,∞ ≤ M1 where
!1/p
X
j(α+1/2−1/p) p
ktkα,p,∞ = sup 2 |tj,k | .
j∈N k∈Z

It is easy to check that if p ≥ 2 Bα,p,∞ (M1 ) ⊂ Bα,2,∞ (M1 ) so that upper bounds on
Bα,2,∞ (M1 ) yield upper bounds on Bα,p,∞ (M1 ).
Approximation results on Besov spaces.
We have the following result (Birgé & Massart [8] Section 4.7.1). Suppose that the support
of s equals [0, 1] and that s belongs to the Besov ball Bα,2,∞ (1), then whenever r > α − 1,

ksk2α,2,∞ (2A)2α ksk2α,2,∞


ks − sm k22 ≤ 2−2Jm α ≤ −2α
Dm (19)
4(4α − 1) 4(4α − 1)

5.2 Minimax rates of convergence for the PLSE


We can derive from Theorems 3.1 and 4.1 adaptation results to unknown smoothness over
Besov Balls.

Proposition 5.1 Assume that the process (Xn )n∈Z is stricly stationary and arithmetically
[AR] β-mixing with mixing rate θ > 2 and that its marginal distribution admits a density
s with respect to the Lebesgue measure µ, that s is supported in [0, 1] and that s ∈ L2 (µ).
For all α, M1 > 0, the PLSE s̃ defined in Theorem 3.1 for the collection of models [W]
satisfies
  L (log n)(θ+2)κ
M1
∀κ > 2, sup P ks̃ − sk22 > LM1 ,α,θ n−2α/(2α+1) ≤ .
s∈Bα,2,∞ (M1 ) nθ/2

Proposition 5.2 Assume that the process (Xn )n∈Z is stricly stationary and arithmetically
[AR] τ -mixing with mixing rate θ > 5 and that its marginal distribution admits a density

10
s with respect to the Lebesgue measure µ, that s is supported in [0, 1] and that s ∈ L2 (µ).
For all α, M1 > 0, the PLSE s̃ defined in Theorem 4.1 satisfies
 
sup E ks̃ − sk22 ≤ LM1 ,α,θ n−2α/(2α+1) .
s∈Bα,2,∞ (M1 )

Remark: Proposition 5.2 can be compared to Theorem 3.1 in Gannaz & Wintenberger
[18]. They prove near minimax results for the thresholded wavelet estimator introduced
by Donoho et al. [16] in a φ̃-dependent setting (for a definition of the coefficient φ̃, we
refer to Dedecker & Prieur [15]). Basically, with our notations, their result can be stated
b
as follows: if (Xn )n∈Z is φ̃-mixing with φ̃1 (r) ≤ Ce−ar for some constants C, a, b, then
the thresholded wavelet estimator ŝ of s satisfies
log n 2α/(2α+1)
   
2
∀α > 0, ∀p > 1, sup E kŝ − sk2 ≤ LM,M1,α,p .
s∈Bα,p,∞ (M1 )∩L∞ (M ) n

The main advantage of their result is that they can deal with Besov balls with regularity
1 < p < 2. However, in the regular case, when p ≥ 2, we have been able to remove the extra
log n factor. Moreover, our result only requires arithmetical [AR] rates of convergence for
the mixing coefficients and we do not have to suppose that s is bounded.

6 Proofs.
6.1 Proofs of the minimax results.
Proof of Proposition 5.1:
Let α > 0 and M1 > 0 and assume that s ∈ Bα,2,∞ (M1 ). Let M̃n = {m ∈ Mn , Dm >
c0 (log n)γ1 }. By Theorem 3.1, there exists a constant Lθ > 0 such that

Ls (log n)(θ+2)κ
  
2 2 Dm
P ks̃ − sk2 > Lθ inf ks − sm k2 + ≤ . (20)
m∈M̃n n nθ/2
It appears from the proof of Theorem 3.1 that the constant Ls depends only on ksk2 and
that it is a nondecreasing function of ksk2 so that Ls can be uniformly bounded over
Bα,2,∞ (M1 ) by a constant LM1 so that, by (20)

LM1 (log n)(θ+2)κ


  
2 2 Dm
P ks̃ − sk2 > Lθ inf ks − sm k2 + ≤ .
m∈M̃n n nθ/2
In particular, for a model m in Mn with dimension Dm such that

c0 (log n)γ1 ≤ L1 n1/(2α+1) ≤ Dm ≤ L2 n1/(2α+1) ,

we have
LM1 (log n)(θ+2)κ
  
Dm
P ks̃ − sk22 > Lθ ks − sm k22 + ≤ .
n nθ/2
Since s belongs to Bα,2,∞ (M1 ), we can use Inequality (19) to get

ks − sm k22 ≤ Lα,M1 Dm
−2α
.

Thus we obtain
  L (log n)(θ+2)κ
M1
P ks̃ − sk22 > LM1 ,α,θ n−2α/(2α+1) ≤ .
nθ/2

11
Proof of Proposition 5.2:
Let α > 0 and M1 > 0 and assume that s ∈ Bα,2,∞ (M1 ). By Theorem 4.1, we have
 

2

2 Dm
E ks̃ − sk2 ≤ Lθ inf {ks − sm k2 + } .
m∈M̃n n

Inequality (19) leads to ks − sm k22 ≤ Lα,M1 Dm


−2α , so that for a model m in M̃ with
n
dimension Dm such that
c0 (log n)γ1 ≤ L1 n1/(2α+1) ≤ Dm ≤ L2 n1/(2α+1) ,
we find  
E ks̃ − sk22 ≤ Lθ,α,M1 n−2α/(2α+1) .

6.2 Proof of Theorem 3.1:


For all mo in Mn , we have, by definition of m̂
γn (s̃) + pen(m̂) ≤ γn (ŝmo ) + pen(mo )
P γ(s̃) + νn γ(s̃) + pen(m̂) ≤ P γ(ŝmo ) + νn γ(ŝmo ) + pen(mo )
P γ(s̃) − P γ(s) − 2νn s̃ + pen(m̂) ≤ P γ(ŝmo ) − P γ(s) − 2νn ŝmo + pen(mo )
Since for all t ∈ L2 (µ), P γ(t) − P γ(s) = kt − sk22 , we have
ks − s̃k22 ≤ ks − ŝmo k22 + pen(mo ) − V (mo ) − (pen(m̂) − V (m̂)) − 2νn (smo − sm̂ ), (21)
where, for all m ∈ Mn
X
V (m) = 2νn (ŝm − sm ) = 2 νn2 (ψj,k ).
(j,k)∈m

This decomposition is different from the one used in Birgé & Massart [8] and in Comte &
Merlevède [13]. It allows to improve the constant in the oracle inequality in the β-mixing
case. Moreover, we choose to prove an oracle inequality of the form (12) for β-mixing
sequences, which allows to assume only θ > 2 instead of θ > 3. Let us now give a sketch
of the proof:
1. we build an event ΩC with P(ΩcC ) ≤ pβq such that, on ΩC , νn = νn∗ , where νn∗
is built with independent data. A suitable choice of the integers p and q leads to
pβq ≤ C(ln n)r n−θ/2 .
2. We use the concentration’s inequality (7.4) of Birgé & Massart [8] for χ2 -type statis-
tics, derived from Talagrand’s inequality. This allows us to find p1 (m) such that on
an event Ω1 with P(Ωc1 ∩ ΩC ) ≤ L1,s cn
sup {V (m) − p1 (m)} ≤ 0.
m∈Mn

cn < C(ln n)r n−θ/2 and L1,s is some constant depending on s.


3. From Bernstein’s inequality, we prove that, for all m, m′ ∈ Mn , there exists p2 (m, m′ )
such that, for all η > 0, on an event Ω2 with P(Ωc2 ∩ ΩC ) ≤ L2,s cn ,
( )
η ′ ksm − sm′ k22
sup νn (sm − sm′ ) − p2 (m, m ) − ≤ 0.
m,m′ ∈Mn 2 2η

Moreover, for all m, m′ ∈ Mn , p2 (m, m′ ) ≤ p2 (m, m) + p2 (m′ , m′ ).

12
4. We have ksm̂ − smo k22 ≤ ksm̂ − sk22 + ks − smo k22 because sm̂ − smo is either the
projection of sm̂ − s onto Smo or the projection of s − smo onto Sm̂ . Take pen(m) ≥
p1 (m) + ηp2 (m, m), we have, on Ω1 ∩ Ω2 ∩ ΩC

Vmo Vmo
ks − s̃k22 ≤ ks − ŝmo k22 − + pen(mo ) − (22)
2 2
−(pen(m̂) − p1 (m̂)) − (p1 (m̂) − V (m̂)) − 2νn (smo − sm̂ )
V (mo )
≤ ks − smo k22 + pen(mo ) − − ηp2 (m̂, m̂)
2
ksmo − sm̂ k22
+ηp2 (m̂, mo ) + (23)
η
 
1 1
1− ks − s̃k22 ≤ (1 + ) ks − smo k22 + pen(mo ) + ηp2 (mo , mo ). (24)
η η

In (23), we used that V (mo ) = 2ksmo − ŝmo k22 ≥ 0. In (24), we used that Vmo ≥ 0.
Pythagoras Theorem gives

V (mo )
ks − ŝmo k22 − = ks − smo k22 and; ks − sm̂ k22 ≤ ks − s̃k22 .
2
Finally, we prove that we can choose η = (log n)γ , with γ > 0 such that ηp2 (mo , mo ) =
o(pen(mo )) and we conclude the proof of (3.1) from the previous inequalities.
We decompose the proof in several claims corresponding to the previous steps.
Claim 1 : For all l = 0, ..., p − 1, let us define Al = (X2lq+1 , ..., X(2l+1)q ) and Bl =
(X(2l+1)q+1 , ..., X(2l+2)q ). There exist random vectors A∗l = (X2lq+1
∗ ∗
, ..., X(2l+1)q ) and Bl∗ =
∗ ∗
(X(2l+1)q+1 , ..., X(2l+2)q ) such that for all l = 0, ..., p − 1 :

1. A∗l and Al have the same law,

2. A∗l is independent of A0 , ..., Al−1 , A∗0 ..., A∗l−1

3. P(Al 6= A∗l ) ≤ βq

the same being true for the variables Bl .

Proof of Claim 1 :
The proof is derived from Berbee’s lemma, we refer to Proposition 5.1 in Viennet [25]
for further details about this construction.
√ √
Hereafter, we assume that, for some κ > 2, n(log n)κ /2 ≤ p ≤ n(log n)κ and for the
sake of simplicity that pq = n/2, the modifications needed to handle the extra term when
q = [n/(2p)] being straightforward. Let ΩC = {∀l = 0, ..., p − 1 Al = A∗l , Bl = Bl∗ }. We
have
(log n)(θ+2)κ
P(ΩcC ) ≤ 2pβq ≤ 22+θ .
nθ/2
Let us first deal with the quadratic term V (m).
Claim 2 : Under the assumptions of Theorem 3.1, let ǫ > 0, 1 < γ < κ/2. We define

L21 = 2Φ2 κ1 , L22 = 8Φ3/2 κ2 , L3 = 2Φκ(ǫ) and
s !2
(log n)γ L3
L1,m = 4 (1 + ǫ)L1 + L2 + . (25)
Dm
1/4 (log n)κ−γ

13
Then, we have
!
(log n)γ
   
L1,m Dm
P sup V (m) − ≥ 0 ∩ ΩC ≤ Ls,γ exp − p .
m∈Mn n ksk2

1/2
where Ls,γ = 2 ∞ γ
P
D=1 exp(−(log D) / ksk2 ). In particular, for all r > 0, there exists a
constant L′s,r depending on ksk2 , such that

L′s,r
   
L1,m Dm
P sup V (m) − ≥ 0 ∩ ΩC ≤ .
m∈Mn n nr

Remark : When (L2 /L1 )8 (log n)4(2κ−γ) ≤ Dm ≤ n, we have


" √ ! #2
2κ(ǫ)
L1,m ≤ 1 + ǫ + 1+ √ (log n)−(κ−γ) 4L21 .
κ1

Proof of ClaimP2 :
Let Pn∗ (t) = ni=1 t(Xi∗ )/n and νn∗ (t) = (Pn∗ − P )t, we have
X
V (m)1ΩC = 2 (νn∗ )2 (ψj,k )1ΩC .
(j,k)∈m
Pq
Let B1 (Sm ) = {t ∈ Sm ; ktk2 ≤ 1}. ∀t ∈ B1 (Sm ), let t̄(x1 , ..., xq ) = i=1 t(xi )/2q and for
all functions g : Rq → R let
p−1 p−1
1X 1X
Z

PA,p g= g(A∗j ), PB,p

g= g(Bj∗ ), P̄ g = gPA (dµ),
p p
j=0 j=0

∗ ∗
and ν̄A,p g = (PA,p − P̄ )g, ν̄B,p g = (PB,p − P̄ )g.
Now we have
X X X
(νn∗ )2 (ψj,k ) ≤ 2 2
ν̄A,p ψ̄j,k + 2 2
ν̄B,p ψ̄j,k .
(j,k)∈m (j,k)∈m (j,k)∈m

In order to handle these terms, we use Proposition 7.4 which is stated in Section 7. Taking

X X
2
Bm = Var(ψ̄j,k (A1 )), Vm2 = sup 2
Var(t̄(A1 )), and Hm = (ψ̄j,k )2 ,
(j,k)∈m t∈B1 (Sm ) (j,k)∈m ∞

we have
 
r
(1 + ǫ) 2x Hm x 
s X
∀x > 0, P  2 ψ̄
ν̄A,p j,k ≥ √ Bm + V m + κ(ǫ) ≤ e−x . (26)
p p p
(j,k)∈m

In order to evaluate Bm , Vm and Hm , we use Viennet’s inequality (54). There exists a


function b such that, for all p = 1, 2, P |b|p ≤ κp where κp is defined in (2) and for all
functions t ∈ L2 (P̄ ),
1
Var(t̄(A1 )) ≤ P bt2 .
q

14
Thus
2
X 1 X 2
X
2 κ1
Bm = Var(ψ̄j,k (A1 )) ≤ P bψj,k ≤ ψj,k .
q q
(j,k)∈m (j,k)∈m (j,k)∈m ∞

2 ≤ Φ2 Dm , thus,
P
From Assumption [M2 ], (j,k)∈m ψj,k

2 Φ2 κ1 Dm
Bm ≤ . (27)
q
From Viennet’s and Cauchy-Schwarz inequalities

P bt2 (P t2 )1/2 (P b2 )1/2


Vm2 = sup Var(t̄(A1 )) ≤ sup ≤ sup ktk∞ .
t∈B1 (Sm ) t∈B1 (Sm ) q t∈B1 (Sm ) q

Since t ∈ B1 (Sm ), we have by Cauchy-Schwarz inequality

(P t2 )1/2 ≤ (ktk∞ ktk2 ksk2 )1/2 ≤ (ktk∞ ksk2 )1/2 .



From Assumption [M2 ], we have ktk∞ ≤ Φ Dm , and from Viennet’s inequality P b2 ≤
κ2 < ∞, thus we obtain
3/4
2 3/2 1/2 Dm
Vm ≤ Φ (ksk2 κ2 ) . (28)
q
Finally, from Assumption [M2 ], we have, using Cauchy-Schwarz inequality

2
X
2 1 X
2 Φ2 Dm
Hm = ψ̄j,k ≤ ψj,k ≤ . (29)
4 4
(j,k)∈m ∞ (j,k)∈m ∞

Let yn > 0. We define


s !2
(log Dm )γ + yn (log Dm )γ + yn
Lm = (1 + ǫ)L1 + L2 + L3 .
2Dm
1/4 2(log n)κ

1/2
We apply Inequality (26) with x = ((log Dm )γ + yn )/ ksk2 and the evaluations (27), (28)

and (29). Recalling that 1/p ≤ 2/( n(log n)κ ), this leads to
  !
L D (log D )γ yn
m m m
X
2
P ν̄A,p ψ̄j,k ≥ ≤ exp − p exp(− p ).
n ksk2 ksk2
(j,k)∈m

In order to give an upper bound on Hm x, we used that the support of s in included in


[0, 1], thus
1 = ksk1 ≤ ksk2 .
The result follows by taking yn = (log n)γ ≥ (log Dm )γ .

Claim 3. We keep the notations κ/2 > γ > 1, L2 of the proof of Claim 2. For all
m, m′ ∈ Mn we take
s !2
(log n)γ 4Φ
Lm,m′ = 4 L2 + , (30)
(Dm ∨ Dm′ ) 1/4 3(log n)κ−γ

15
we have, for all η > 0,
n)γ
!
ksm − sm′ k22 η Lm,m′ (Dm ∨ Dm′ ) − (log 1/2
P sup νn∗ (sm − sm′ ) − − > 0 ≤ Ls,γ e ksk
2
m,m′ ∈Mn 2η 2 n

(log(Dm ∨D ′ ))γ
− m
P 1/2
ksk
with Ls,γ = 2 m,m′ ∈Mn e . 2

Remark : The constant Ls,γ is finite since for all x, y > 0, (log(x ∨ y))γ ≥ ((log x)γ +
(log y)γ )/2.
As in Claim 2, when (L2 /L1 )8 (log n)4(2κ−γ) ≤ Dm ≤ n, we have
!2
23/2
Lm,m′ ≤ 1 + √ (log n)−2(κ−2γ) 4L21 .
3 κ1

Proof of Claim 3.
We keep the notations of the proof of Claim 2 and for m, m′ ∈ Mn , let tm,m′ =
(sm − sm′ )/ ksm − sm′ k2 . We use the inequality 2ab ≤ a2 η −1 + b2 η, which holds for all
a, b ∈ R, η > 0. This leads to

ksm − sm′ k22 η ∗ 2


νn∗ (sm − sm′ ) = ksm − sm′ k2 νn∗ (tm,m′ ) ≤ + νn (tm,m′ )
2η 2
2
ksm − sm′ k2 η 2
= + ν̄A,p (t̄m,m′ ) + ν̄B,p (t̄m,m′ )
2η 2
2
ksm − sm′ k2
≤ + η(ν̄A,p (t̄m,m′ ))2 + η(ν̄B,p (t̄m,m′ ))2 .

Now from Bernstein’s inequality (see Section 7), we have
 s 
2Var(t̄ m,m′ (A1 ))x kt̄ ′ k
m,m ∞ x
∀x > 0, P ν̄A,p (t̄m,m′ ) > + ≤ e−x . (31)
p 3p

From Viennet’s and Cauchy-Schwarz inequalities, we have


q
kt 2 2
2
P btm,m′ m,m ∞ P b P tm,m′
′ k
Var(t̄m,m′ (A1 )) ≤ ≤ .
q q
Moreover
P b2 ≤ κ2 , P t2m,m′ ≤ ktm,m′ k∞ ktm,m′ k2 ksk2 .

√ tm,m ∈ Sm ∪ Sm and ktm,m k2 = 1, we have, from Assumption [M2 ] ktm,m k∞γ ≤


Since ′ ′ ′ ′

Φ Dm ∨ Dm′ . Let yn > 0. We apply Inequality (31) with x = [(log(Dm ∨ Dm′ )) +


1/2
yn ]/ ksk2 . We define
s !2
L′m,m′ (log(Dm ∨ Dm′ ))γ + yn 4Φ [(log(Dm ∨ Dm′ ))γ + yn ]
= L2 + ,
4 2(Dm ∨ Dm′ )1/4 6(log n)κ

we have
 s  !
L′m,m′ (Dm ∨D )m′ (log(Dm ∨ Dm′ ))γ 1/2
P ν̄A,p (t̄m,m′ ) >  ≤ exp −
1/2
e−yn /ksk2 .
4n ksk2

16
The result follows by taking yn = (log n)γ and using 2 ≤ Dm ≤ n.
Conclusion of the proof:
Let η > 0 and pen′ (m) ≥ (L1,m + ηLm,m )Dm /n where L1,m and Lm,m are defined respec-
tively by (25) and (30). From Claims 1, 2 and 3 and (24), we obtain that, for all mo and
with probability larger than Ls,θ (log n)(θ+2)κ n−θ/2
1 1 Dmo
(1 − ) ks − s̃k22 ≤ (1 + ) ks − smo k22 + pen′ (mo ) + ηL(mo , mo ) . (32)
η η n

Assume that Dm ≥ (L2 /L1 )8 (log n)4(2κ−γ) , then we have from remarks 6.2 and 6.2
   2
2κ(ǫ)
L1,m ≤ 1 + ǫ + 1 + √ (log n)−(κ−2γ) 4L21 and
κ1
!2
23/2
Lm,m ≤ 1+ √ (log n)−2(κ−γ) 4L21 .
3 κ1

Take η = (log n)κ−γ , we have (L1,mo + ηLmo ,mo )Dmo /n ≤ Cpen(mo ). Fix ǫ > 0 such that
[1 + ǫ]2 < K/4. Since κ > γ, for n ≥ no , we have L1,m + ηLm,m ≤ KL21 , thus, inequality
(13) follows follows from (32) as soon as n > no . We remove the condition n > no by
improving the constant Ls in (13) if necessary.

6.3 Proof of Theorem 4.1.


The proof follows the previous one, the main difference is that the coupling lemma (Claim
1) as well as the covariance inequalities are much harder to handle in the τ -mixing case.
This leads to more technical computations to recover the results obtained in the β-mixing
case (see Claims 2, 3 and the proof of inequality (45)). We start with the decomposition
(21). As in the previous proof, the decomposition of the risk given in Birgé & Massart [8]
or in Comte & Merlevède [13] could be used. This leads to a loss in the constant in front
of the main term in (18) without avoiding any of the main difficulties. We divide the proof
in four claims.
Claim 1 : For all l = 0, ..., p − 1, let us denote by Al = (X2lq+1 , ..., X(2l+1)q ) and Bl =

(X(2l+1)q+1 , ..., X(2l+2)q ). There exist random vectors A∗l = (X2lq+1 ∗
, ..., X(2l+1)q ) and Bl∗ =
∗ ∗
(X(2l+1)q+1 , ..., X(2l+2)q ) such that for all l = 0, ..., p − 1 :
• A∗l and Al have the same law,

• A∗l is independent of A0 , ..., Al−1 , A∗0 ..., A∗l−1

• E(|Al − A∗l |q ) ≤ qτq


the same being true for the variables Bl .
Proof of Claim 1 :
We use the same recursive construction as Viennet [25].
Let (δj )0≤j≤p−1 be a sequence of independent random variables uniformly distributed over
[0, 1] and independent of the sequence (Aj )0≤j≤p−1 . Let A∗0 = (X1∗ , ..., Xq∗ ) be the random
variable given by equality (4) for M = σ(Xi , i ≤ −q), A0 and δ0 .
Now suppose that we have built the variables A∗l for l < l′ . From equality (4) applied to
the σ-algebra σ(Al , A∗l , l < l′ ), Al′ and δl′ , there exists a random variable A∗l′ satisfying
the hypotheses of Claim 1.
We build in the same way the variables Bl∗ for all l = 0, ..., p − 1. 

17
We keep the notations νn∗ , ν̄A,p , ν̄B,p , t̄ and B1 (Sm ) that we introduced in the proof of The-

orem 3.1. As in the proof of Theorem 3.1, we assume that, for some κ > 2, n(log n)κ /2 ≤

p ≤ n(log n)κ and for the sake of simplicity that pq = n/2, the modifications needed to
handle the extra term when q = [n/(2p)] being straightforward. We have
X X X
V (m̂) = νn2 (ψj,k ) ≤ 2 (Pn − Pn∗ )2 (ψj,k ) + 2 (νn∗ )2 (ψj,k ) (33)
(j,k)∈m̂ (j,k)∈m̂ (j,k)∈m̂

Claim 2 : There exists a constant L = LA,KL ,K∞ ,κ,θ such that


 
(log n)κ(θ+1)
((Pn − Pn∗ )(ψj,k ))2  ≤ L
X
E . (34)
n(θ−3)/2
j,k∈m̂

Proof of Claim 2 :

   
X X
E (Pn − Pn∗ )2 (ψj,k ) ≤ E  sup (Pn − Pn∗ )2 (ψj,k )
m∈Mn
(j,k)∈m̂ (j,k)∈m
X X
E (Pn − Pn∗ )2 (ψj,k )


m∈Mn (j,k)∈m
p
2 X X
≤ (gA,m (j, k, l, l′ ) + gB,m (j, k, l, l′ ))
p2 ′
m∈Mn l,l =1

with
 
X
gm,A (j, k, l, l′ ) = E  ψ̄j,k (Al ) − ψ̄j,k (A∗l ) ψ̄j,k (Al′ ) − ψ̄j,k (A∗l′ )  .
 

(j,k)∈m

We develop this last term and we get, since

KL 23j/2 |x − y|q
ψ̄j,k (x) − ψ̄j,k (y) ≤
2q
 
X
gA,m (j, k, l, l′ ) ≤ E  ψ̄j,k (Al ) − ψ̄j,k (A∗l ) ψ̄j,k (Al′ ) − ψ̄j,k (A∗l′ ) 
(j,k)∈m
 
X Al′ − A∗l′ q
≤ E ψ̄j,k (Al ) − ψ̄j,k (A∗l ) KL 23j/2
2q
(j,k)∈m
 
KL τq  X 
≤ sup 23j/2 ψ̄j,k (x) − ψ̄j,k (y)
2 x,y∈Rq  
(j,k)∈m
Jm
( )
KL τq X 3j/2
X
≤ 2 sup |ψj,k (x) − ψj,k (y)|
4 x,y∈R
j=0 k∈Z

2 X
≤ AKL K∞ 22Jm τq since |ψj,k | ≤ AK∞ 2j/2
3
k∈Z ∞

18
We can do the same computations for the term gB,m (j, k, l, l′ ) and we obtain
 
(log n)κ(θ+1)
((Pn − Pn∗ )(ψj,k ))2  ≤ Lτq
X X
E 22Jm ≤ Lτq 22Jn ≤ L (θ−3)/2
.
j,k∈m̂ m∈M
n
n


The last inequality comes from q ≥ n/(2(log n)κ ) and Assumption [AR], the one before
comes from Assumption [W]. 

Claim 3. Let us keep the notations of Theorem 4.1, let u = 6/(7 + θ) < 1/2 and recall
that κ > 2. Let γ be a real number in (1, κ/2). Let

X ∞
X
L21 = AK∞ KBV β̃l , L22 = 2ΦKBV
u
β̃ku , L3 = κ(ǫ)Φ
l=0 k=0
s !2
(log Dm )γ (log Dm )γ
and L1,m = 4(1 + ǫ) (1 + ǫ)L1 + L2 + L3 , (35)
Dm
1/2−u (log n)κ
There exists a constant Ls such that
  
 X L D 
1,m m  Ls
E  sup (νn∗ )2 (ψj,k ) − ≤ .
m∈Mn  n  n
(j,k)∈m

Remark : The series ∞


P P∞ u
l=0 β̃l and k=0 β̃k are convergent under our hypotheses on the
2 2/3 1/3
coefficients τ . Since s ∈ L ([0, 1]), we have from Inequality (6), β̃l ≤ 2ksk2 τl and thus
2/3
β̃l ≤ 2ksk2 (1 + l)−(1+θ)/3 . The series ∞ u
P
k=0 β̃k converge since θ > 5 and

u(1 + θ) 2(1 + θ) θ−5


= =1+ > 1.
3 7+θ θ+7

We use here β̃ instead of τ which allows to take L1 not depending on ksk2 .

Proof of Claim 3 :
As in the previous section we use the following decomposition
X X 2
(νn∗ )2 (ψj,k ) = ν̄A,p (ψ̄j,k ) + ν̄B,p (ψ̄j,k )
(j,k)∈m (j,k)∈m
X 2 X 2
≤ 2 ν̄A,p (ψ̄j,k ) + 2 ν̄B,p (ψ̄j,k )
(j,k)∈m (j,k)∈m

We treat both terms with Proposition 7.4 applied to the random variables (A∗l )0=1,..,p−1
and (Bl∗ )l=0,..,p−1 and to the class of functions (ψ̄j,k )(j,k)∈m . Let


X X
2
Var ψ̄j,k (A1 ) , Vm2 = 2 2

Bm = sup Var(t̄(A1 )), Hm =k ψ̄j,k k∞ .
(j,k)∈m t∈B1 (Sm ) (j,k)∈m

We have, from Proposition 7.4


 
r
(1 + ǫ) 2x Hm x 
s X
∀x > 0, P  (ν̄A,p )2 ψ̄j,k ≥ √ Bm + Vm + κ(ǫ) ≤ e−x . (36)
p p p
(j,k)∈m

19
Let us now evaluate Bm , Vm and Hm , we have
q
!
2 1 X X
Bm = Var ψj,k (Xi ) .
(2q)2
(j,k)∈m i=1

From (17) and (15) we have ∀j, k kψj,k kBV ≤ KBV 2j/2 and ∀j k k∈Z |ψj,k |k∞ ≤ AK∞ 2j/2 .
P
Thus, from Inequality (5)
q q
!
X X X X
Var ψj,k (Xi ) ≤ 2 (q + 1 − l)|Cov(ψj,k (X1 ), ψj,k (Xl ))|
(j,k)∈m i=1 (j,k)∈m l=1
q
Jm X X
X
≤ 2q kψj,k kBV E (|ψj,k (X1 )|b(σ(X1 ), Xl ))
j=0 k∈Z l=1
Jm
X X q
X
j/2
≤ 2KBV q 2 |ψj,k (X0 )| β̃l−1
j=0 k∈Z ∞ l=1

!
X
≤ 2q AK∞ KBV β̃l Dm .
l=0

The last inequality comes


P from Assumption [W].
Since L21 = AK∞ KBV ∞ l=0 β̃l we have

2 L21 Dm
Bm ≤ . (37)
2q

Let us deal with the term Vm2 . We have


q
2 X
Vm2 ≤ sup Var(t̄(A1 )) ≤ (q + 1 − k) sup |Cov(t(X1 ), t(Xk ))| (38)
t∈B1 (Sm ) (2q)2 t∈B1 (Sm )
k=1

From Inequality (5), we have

|Cov(t(X1 ), t(Xk ))| ≤ ktkBV ktk∞ β̃k−1 .

Since t belongs to B1 (Sm ), we have t = (j,k)∈m aj,k ψj,k , with (j,k)∈m a2j,k ≤ 1. Thus,
P P
by Cauchy-Schwarz inequality
l
X X l
X
|t(xi+1 ) − t(xi )| ≤ |aj,k | |ψj,k (xi+1 ) − ψj,k (xi )|
i=1 (j,k)∈m i=1
 1/2  !2 1/2
X X X
≤  a2j,k   |ψj,k (xi+1 ) − ψj,k (xi )| 
(j,k)∈m (j,k)∈m i
 1/2

kψj,k k2BV 
X
≤  ≤ KBV Dm .
(j,k)∈m

Thus ktkBV ≤ Dm KBV . From Assumption [M2 ], we have ktk∞ ≤ Φ Dm . Thus
3/2
|Cov(t(X1 ), t(Xk ))| ≤ ΦKBV β̃k−1 Dm . (39)

20
Moreover, we have by Cauchy-Schwarz inequality and [M2 ]
p
|Cov(t(X1 ), t(Xk ))| ≤ ktk∞ ktk2 ksk2 ≤ Φ ksk2 Dm . (40)

We use the inequality a ∧ b ≤ au b1−u with

3/2
p 6 1
a = ΦKBV β̃k−1 Dm , b = Φ ksk2 Dm , u = < .
7+θ 2
From (39) and (40), we derive that
 u
|Cov(t(X1 ), t(Xk ))| ≤ L′k Dm
1/2+u
where L′k = Φ KBV β̃k−1 ksk1−u
2 .

Pluging this inequality in (38), we obtain


1/2+u ∞
L22 ksk21−u Dm X
Vm2 ≤ since L22 = 2ΦKBV
u
β̃ku . (41)
4q
k=0

Finally, we have from hypothesis [M2 ]

2 1 X
2 Φ2 Dm
Hm ≤ ψj,k ≤ . (42)
4 4
(j,k)∈m ∞

1/2+u
Let y > 0 and let us apply Inequality (36) with x = ((log Dm )γ / ksk1−u 2 ) + (y/Dm ).
We have, from (37), (41) and (42)

  s !
L 2D L D (log D )γ y
1 m 3 m m
X
P (ν̄A,p )2 (ψ̄j,k ) > (1 + ǫ) + 1−u + 1/2+u
2pq 2p ksk 2 Dm
(j,k)∈m
!2
v 
1/2+u (log Dm )γ
u L2 ksk1−u Dm
u
(log Dm )γ y − 1−u −(1/2+u)
+ 2 2
+ 1/2+u  ≤ e ksk2 e−Dm y
.
t  
2pq 1−u
ksk2 Dm
√ √ √
Then, we use the inequality α + β ≤ α + β with
(log Dm )γ y
α= 1−u and β = 1/2+u
ksk2 Dm
and the inequality (a + b)2 ≤ (1 + ǫ)a2 + (1 + ǫ−1 )b2 with
s !r
(log Dm )γ L3 (log Dm )γ Dm
a = (1 + ǫ)L1 + L2 1/2−u
+ 1−u
Dm ksk2 (log n) κ n
 q 
1 1−u L3 y
and b = √ L2 ksk2 y + .
n (log n)κ Dm
u

Setting Lm = (1 + ǫ)a2 n/Dm , we obtain


 
−1
 q 2
X Lm Dm (1 + ǫ ) L3 y
P (ν̄A,p )2 (ψ̄j,k ) − > L2 ksk1−u
2 y+ 
n n (log n)κ Dm
u
(j,k)∈m
γ
− (log D1−u
m)
−(1/2+u)
≤e ksk
2 e−Dm y
.

21
Thus, for all y > 0,
   
γ −(1/2+u)
 X Lm Dm  Ls X − (log D1−u
m)
−Dm y
2 2  ksk
P  sup (ν̄A,p ) (ψ̄j,k ) − > (y + y ) ≤ e 2
m∈Mn  n  n
(j,k)∈m m∈Mn
 q 2
where Ls = 2(1 + ǫ−1 ) (L2 ksk1−u
2 )∨ L3 /((log 2)κ 2u ) . We can integrate this last
inequality to prove Claim 3.

Claim 4 :We keep the notations of the previous Claims. Let


s !2
(log(D ∨ D ′ ))γ Φ
m m
L2 (m, m′ ) = 4 L2 + . (43)
(Dm ∨ Dm′ )1/2−u 3(log n)κ−γ

Then there exists a constant Ls,θ depending on ksk2 and θ such that, for all η > 0
( )!
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm′ ) ηLs,θ
E sup νn (sm − sm′ ) − −η ≤ .
m,m′ ∈Mn 2η n n

Proof of Claim 4 :

( )!
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm′ )
E sup νn (sm − sm ) −
′ −η
m,m′ ∈Mn 2η n
!
≤E sup (Pn − Pn∗ )(sm − sm′ )
m,m′
( )!
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm′ )
+E sup νn∗ (sm − sm′ ) − −η . (44)
m,m′ 2η n

Since ∀l = 0, ..., p − 1, E (|Al − A∗l |q ) ≤ qτq , we have


!
X
E sup (Pn − Pn∗ )(sm − sm′ ) ≤ 2 E (|(s̄m − s̄m′ )(A1 ) − (s̄m − s̄m′ )(A∗1 )|)
m,m′ m,m′
X
≤ τq Lip(sm − sm′ ).
m,m′

When m ⊂ m′ , we have, for all x, y ∈ R, using Assumption [W],


Jm′ 2jX
−A1
|(sm − sm′ )(x − y)| X |ψj,k (x) − ψj,k (y)|
≤ |P ψj,k |
|x − y| |x − y|
j=Jm +1 k=−A2

Let us fix j ∈ [Jm + 1, Jm′ ], from Assumption [W], there is less than A indexes k ∈ Z
such that ψj,k (x) 6= 0, thus there is less than 2A indexes such that |ψj,k (x) − ψj,k (y)| =
6 0.
Hence
X |ψj,k (x) − ψj,k (y)|
|P ψj,k | ≤ 2A sup |P ψj,k |Lip(ψj,k )
|x − y| k∈Z
k∈Z

≤ 2A ksk2 KL 23j/2 .

22
√ √
Thus, Lip(sm − sm′ ) ≤ A ksk2 KL 823Jm′ /2 /( 8 − 1) and by Assumptions [W], [AR] and
the value of q,
!
(log n)κ(θ+1)+1
E sup (Pn − Pn∗ )(sm − sm′ ) ≤ Ls n3/2 (log n)τq ≤ Ls . (45)
m,m′ n(θ−2)/2

Let us deal with the other term in (44). We have, ∀η > 0

ksm − sm′ k22 η 2


νn∗ (sm − sm′ ) ≤ + ν̄A,p (t̄m,m′ ) + ν̄B,p (t̄m,m′ )
2η 2
ksm − sm k2

2
≤ + η(ν̄A,p (t̄m,m′ ))2 + η(ν̄B,p (t̄m,m′ ))2 (46)

where, as in the proof of Theorem 3.1, tm,m′ = (sm − sm′ )/ksm − sm′ k2 . We apply
Bernstein’s inequality to the function t̄m,m′ and the variables A∗l , we have
 s 
2Var(t̄m,m (A0 ))x kt̄m,m k∞ x 
′ ′
∀x > 0, P ν̄A,p (t̄m,m′ ) > + ≤ e−x . (47)
p 3p

We proceed as in the proof of Claim 3 to control this variance. We have, by stationarity


of the process (Xn )n∈Z ,
q−1
1 X
Var(t̄m,m′ (A0 )) = 2 (q − k)Cov(tm,m′ (X1 ), tm,m′ (Xk+1 )).
2q
k=0

From Inequality (5), we have

Cov(tm,m′ (X1 ), tm,m′ (Xk+1 )) ≤ tm,m′ BV


tm,m′ ∞
β̃k .

Let m △ m′ be the set of indexes that belong to m ∪ m′ but do not belong to m ∩ m′ . We


use the same computations as in the proof of Claim 3 to get
P
(j,k)∈m′ △m (P ψj,k )ψj,k
s
kψj,k k2BV ≤ KBV (Dm ∨ Dm′ ).
X
BV
tm,m′ BV
≤ ≤
ksm − sm′ k2
(j,k)∈m′ △m


Since tm,m′ ∞
= Φ Dm ∨ Dm′ , we have

Cov(tm,m′ (X1 ), tm,m′ (Xk+1 )) ≤ ΦKBV β̃k (Dm ∨ Dm′ )3/2 . (48)

Moreover, we have
p
Cov(tm,m′ (X1 ), tm,m′ (Xk+1 )) ≤ tm,m′ tm,m′ ksk2 ≤ Φ ksk2 ′ ).
(Dm ∨ Dm (49)
∞ 2

Thus, using a ∧ b ≤ au b1−u with


6 1
a = ΦKBV β̃k (Dm ∨ Dm′ )3/2 , b = Φ ksk2
p
(Dm ∨ Dm′ ), and u = < ,
7+θ 2
we have
u
Cov(tm,m′ (X1 ), tm,m′ (Xk+1 )) ≤ ΦKBV β̃ku ksk1−u
2 (Dm ∨ Dm′ )1/2+u .

23
Thus

!
(Dm ∨ Dm′ )1/2+u
ksk1−u
X
u
Var(t̄m,m′ (A0 )) ≤ ΦKBV β̃ku 2 . (50)
2q
k=0

Moreover
1 1 p ′ .
kt̄m,m′ k∞ ≤ ktm,m′ k∞ ≤ Φ Dm ∨ Dm (51)
2 2
Now, we use (47) with x = (log(Dm ∨ Dm′ ))γ / ksk1−u 2 + y/(Dm ∨ Dm′ )1/2+u . From (50)
and (51), we have for all y > 0,
 v !
1−u
u
u (Dm ∨ Dm′ )1/2+u ksk y
2
P ν̄A,p (t̄m,m′ ) > L2 t (log(Dm ∨ Dm′ ))γ +
2pq (Dm ∨ Dm′ )1/2+u
p !!
Φ Dm ∨ Dm ′ (log(Dm ∨ Dm′ ))γ y
+ +
6p ksk1−u
2
(Dm ∨ Dm′ )1/2+u
(log(Dm ∨D ′ ))γ y
− m −
ksk1−u (Dm ∨D ) 1/2+u
≤e 2 e m′ .
√ √ √
Now we use the inequality a+b≤ a+ b with

ksk1−u y
a = (log(Dm ∨ Dm′ ))γ and b = 2
(Dm ∨ Dm′ )1/2+u

and we obtain, using Assumption [M1 ]


r !
L2 (m, m′ )(Dm ∨ Dm
′ ) Ls √
P ν̄A,p t̄m,m′ − > √ ( y + y)
n n
(log(Dm ∨D ′ ))γ
− m
−(1/2+u) y
ksk1−u
≤e 2 e−(Dm ∨Dm′ ) ,

with s !2
(log(Dm ∨ Dm′ ))γ Φ(log(Dm ∨ Dm′ ))γ
L2 (m, m′ ) = L2 + ,
(Dm ∨ Dm′ )1/2−u 3(log n)κ
Φ
q
and Ls = L2 ksk1−u
2 ∨ .
3(log 2)κ 2u
Thus, we obain

L2 (m, m′ )(Dm ∨ Dm
′ ) L2s
 
2 2
P (ν̄A,p t̄m,m′ ) > 2 + 4 (y + y )
n n
(log(Dm ∨Dm′ ))γ y
− 1−u −
ksk2 (Dm ∨Dm′ )1/2+u
≤e .

The same result holds for ν̄B,p t̄m,m′ . Thus we obtain from (46)

ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm


′ ) L2
 
P νn∗ (sm − sm′ ) ≥ + 4η + 8η s (y + y 2 )
2η n n
(log(Dm ∨D ′ ))γ y
− m −
ksk1−u (Dm ∨D ) 1/2+u
≤ 2e 2 m′ .

24
We deduce that
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm ′ )

P ∃m, m′ ∈ Mn , νn∗ (sm − sm′ ) − − 4η
2η n
(log(Dm ∨D ′ ))γ
!
y
L2 m

− −
ksk1−u
X 1/2+u
≥ 8η s (y + y 2 ) ≤ 2 e 2 e (Dm ∨Dm′ ) .
n ′m,m ∈Mn

We integrate this last inequality to get Claim 4.

Conclusion of the proof:


Take
Dm
pen′ (m) ≥ (2L1,m + ηL2 (m, m)) ,
n
where L1,m and L2 (m, m) are defined by (35) and (43) respectively. From Claims 2, 3 and
4, if we take the expectation in (21), we have, for some constant Ls ,
 

2

2 ′ Dmo ηLs
E ks − s̃k2 ≤ E ks − ŝmo k2 + pen (mo ) − V (mo ) + 2ηL2 (mo , mo ) + . (52)
n n
2(7+θ)/(θ−5)
Moreover, if Dm ≥ (L2 /L1 )(log n)κ−γ/2 , we have
  2
L1,m L3 −(κ−γ)
≤ (1 + ǫ) (1 + ǫ) + 1 + (log n)
4L21 2L1
L3 2
 
3 −1
≤ (1 + ǫ) + (1 + ǫ )(1 + ǫ) 1 + (log n)−2(κ−γ) . (53)
2L1

We use the inequality (a + b)2 ≤ (1 + ǫ)a2 + (1 + ǫ−1 )b2 to obtain (53). Moreover, we have
  2
Φ
L2 (m, m) ≤ 4L21 1+ −(κ−γ)
(log n) .
6L1

As in the proof of Theorem 3.1, we take η = (log n)κ−γ and we fix ǫ sufficiently small. For
n ≥ no , we have 2L1,m + ηL2 (m, m) < KL21 . Thus inequality (18) follows from (52).

7 Appendix
This section is devoted to technical lemmas that are needed in the proofs.

7.1 Covariance inequality


Lemma 7.1 Viennet’s inequality Let (Xn )n∈ZP be a stationary and P
β-mixing process. There
exists a positive function b such that P (b) ≤ l=0 βl , P (b ) ≤ p ∞
∞ p
l=1 l
p−1 β , and for all
l
function h ∈ L2 (P )
q
!
X
Var h(Xl ) ≤ 4qP (bh2 ). (54)
l=1

25
7.2 Concentration inequalities
We sum up in this section the concentration inequalities we used in the proofs. We begin
with Bernstein’s inequality

Proposition 7.2 Bernstein’s inequality


Let X1 , ..., Xn be iid random variables valued in a measurable space (X, X ) and let t be a
measurable real valued function. Let v = Var(t(X1 )) and b = ktk∞ , then, for all x > 0,
we have r !
2x bx
P (Pn − P )t > v + ≤ e−x .
n 3n

Now we give the most important tool of our proof, it is a concentration’s inequality for
the supremum of the empirical process over a class of function. We give here the version
of Bousquet [10].

Theorem 7.3 Talagrand’s Theorem


Let X1 , ..., Xn be i.i.d random variables valued in some measurable space [X, X ]. Let F be
a separable class of bounded functions from X to R and assume that all functions t in F
are P -measurable, and satisfy Var(t(X1 )) ≤ σ 2 , ktk∞ ≤ b. Then
 r !
2x (σ 2 + 2bE (supt∈F νn (t))) bx

P sup νn (t) > E sup νn (t) + + ≤ e−x .
t∈F t∈F n 3n

In particular, for all ǫ > 0, if κ(ǫ) = 1/3 + ǫ−1 , we have


  r !
2x bx
P sup νn (t) > (1 + ǫ)E sup νn (t) + σ + κ(ǫ) ≤ e−x .
t∈F t∈F n n

We can deduce from this Theorem a concentration’s inequality for χ-square type statistics.
This is Proposition (7.3) of Massart [20].

Proposition 7.4 Let X1 , ..., Xn be independent and identically distributed random vari-
ables valued in some measurable space (X, X ). Let P denote their common distribution.
Let φλ be a finite family of measurable and bounded functions on (X, X ). Let
X X
HΛ2 = k φ2λ k∞ and BΛ2 = Var(φλ (X1 )).
λ∈Λ λ∈Λ

Moreover, let SΛ = a ∈ RΛ : λ∈Λ a2λ = 1 and


 P

( !)
X
VΛ2 = sup Var aλ φλ (X1 ) .
a∈SΛ λ∈Λ

Then the following inequality holds, for all positive x and ǫ


 !1/2 r 
X 1+ǫ 2x HΛ x 
P (Pn − P )2 φλ ≥ √ BΛ + V Λ + κ(ǫ) ≤ e−x , (55)
n n n
λ∈Λ

where κ(ǫ) = ǫ−1 + 1/3.

26
Proof :
Following Massart [20] Proposition 7.3, we remark that, by Cauchy-Schwarz’s inequal-
ity
!1/2 !
X X X
2
νn φλ = sup aλ νn φλ = sup νn aλ φλ .
λ∈Λ a∈SΛ λ∈Λ a∈SΛ λ∈Λ

Thus the result follows by applying Talagrand’s Theorem to the class of functions
( )
X
F = t= aλ φλ ; a ∈ SΛ .
λ∈Λ

27
References
[1] H. Akaike. Information theory and an extension of the maximum likelihood principle.
In Second International Symposium on Information Theory (Tsahkadsor, 1971), pages
267–281. Akadémiai Kiadó, Budapest, 1973.

[2] Hirotugu Akaike. Statistical predictor identification. Ann. Inst. Statist. Math.,
22:203–217, 1970.

[3] Donald W. K. Andrews. Nonstrong mixing autoregressive processes. J. Appl. Probab.,


21(4):930–934, 1984.

[4] S. Arlot and P. Massart. Data-driven calibration of penalties for least squares regres-
sion. Submitted to Journal of Machine learning research, 2008.

[5] Sylvain Arlot. Model selection by resampling penalization. hal-00262478, 2008.

[6] Y. Baraud, F. Comte, and G. Viennet. Adaptive estimation in autoregression or


β-mixing regression via model selection. Ann. Statist., 29(3):839–875, 2001.

[7] Henry C. P. Berbee. Random walks with stationary increments and renewal theory,
volume 112 of Mathematical Centre Tracts. Mathematisch Centrum, Amsterdam,
1979.

[8] Lucien Birgé and Pascal Massart. From model selection to adaptive estimation. In
Festschrift for Lucien Le Cam, pages 55–87. Springer, New York, 1997.

[9] Lucien Birgé and Pascal Massart. Minimal penalties for Gaussian model selection.
Probab. Theory Related Fields, 138(1-2):33–73, 2007.

[10] Olivier Bousquet. A Bennett concentration inequality and its application to suprema
of empirical processes. C. R. Math. Acad. Sci. Paris, 334(6):495–500, 2002.

[11] Richard C. Bradley. Introduction to strong mixing conditions. Vol. 1. Kendrick Press,
Heber City, UT, 2007.

[12] F. Comte, J. Dedecker, and M. L. Taupin. Adaptive density deconvolution with


dependent inputs. Math. Methods Statist., 17(2):87–112, 2008.

[13] Fabienne Comte and Florence Merlevède. Adaptive estimation of the stationary den-
sity of discrete and continuous time mixing processes. ESAIM Probab. Statist., 6:211–
238 (electronic), 2002. New directions in time series analysis (Luminy, 2001).

[14] Jérôme Dedecker, Paul Doukhan, Gabriel Lang, José Rafael León R., Sana Louhichi,
and Clémentine Prieur. Weak dependence: with examples and applications, volume
190 of Lecture Notes in Statistics. Springer, New York, 2007.

[15] Jérôme Dedecker and Clémentine Prieur. New dependence coefficients. Examples and
applications to statistics. Probab. Theory Related Fields, 132(2):203–236, 2005.

[16] David L. Donoho, Iain M. Johnstone, Gérard Kerkyacharian, and Dominique Picard.
Density estimation by wavelet thresholding. Ann. Statist., 24(2):508–539, 1996.

[17] Paul Doukhan. Mixing, volume 85 of Lecture Notes in Statistics. Springer-Verlag,


New York, 1994. Properties and examples.

28
[18] Irène Gannaz and Olivier Wintenberger. Adaptive density estimation under depen-
dence. forthcoming in ESAIM, Probab. and Statist., 2008.

[19] C.L. Mallows. Some comments on cp . Technometrics, 15:661–675, 1973.

[20] Pascal Massart. Concentration inequalities and model selection, volume 1896 of Lec-
ture Notes in Mathematics. Springer, Berlin, 2007. Lectures from the 33rd Summer
School on Probability Theory held in Saint-Flour, July 6–23, 2003, With a foreword
by Jean Picard.

[21] C. Prieur. Change point estimation by local linear smoothing under a weak depen-
dence condition. Math. Methods Statist., 16(1):25–41, 2007.

[22] Emmanuel Rio. Théorie asymptotique des processus aléatoires faiblement dépendants,
volume 31 of Mathématiques & Applications (Berlin) [Mathematics & Applications].
Springer-Verlag, Berlin, 2000.

[23] Mats Rudemo. Empirical choice of histograms and kernel density estimators. Scand.
J. Statist., 9(2):65–78, 1982.

[24] Michel Talagrand. New concentration inequalities in product spaces. Invent. Math.,
126(3):505–563, 1996.

[25] Gabrielle Viennet. Inequalities for absolutely regular sequences: application to density
estimation. Probab. Theory Related Fields, 107(4):467–492, 1997.

[26] V. A. Volkonskiı̆ and Yu. A. Rozanov. Some limit theorems for random functions. I.
Teor. Veroyatnost. i Primenen, 4:186–207, 1959.

29

View publication stats

You might also like