Adaptive Density Estimation For Stationary Process
Adaptive Density Estimation For Stationary Process
net/publication/258367948
CITATIONS READS
2 24
1 author:
Matthieu Lerasle
French National Centre for Scientific Research
59 PUBLICATIONS 704 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Matthieu Lerasle on 25 November 2015.
Abstract:
1 Introduction
We consider the problem of estimating the unknown density s of P , the law of a random
variable X, based on the observation of n (possibly) dependent data X1 , ..., Xn with com-
mon law P . We assume that X is real valued, that s belongs to L2 (µ) where µ denotes the
Lebesgue measure on R and that s is compactly supported, say in [0, 1]. Throughout the
chapter, we consider least-squares estimators ŝm of s on a collection (Sm )m∈Mn of linear
subspaces of L2 (µ). Our final estimator is chosen through a model selection algorithm.
Model selection has received much interest in the last decades. When its final goal is pre-
diction, it can be seen more generally as the question of choosing between the outcomes
of several prediction algorithms. With such a general formulation, a very natural answer
is the following. First, estimate the prediction error for each model, that is ks − ŝm k22 .
Then, select the model which minimizes this estimate.
It is natural to think of the empirical risk as an estimator of the prediction error. This can
fail dramatically, because it uses the same data for building predictors and for comparing
them, making these estimates strongly biased for models involving a number of parameters
growing with the sample size.
In order to correct this drawback, penalization’s methods state that a good choice can be
made by minimizing the sum of the empirical risk (how do algorithms fit the data) and
some complexity measure of the algorithms (called the penalty). This method was first
developped in the work of Akaike [2] and [1] and Mallows [19].
In the context of density estimation, with independent data, Birgé & Massart [8] used
penalties of order Ln Dm /n, where Dm denotes the dimension of Sm and Ln is a constant
depending on the complexity of the collection Mn . They used Talagrand’s inequality (see
for example Talagrand [24] for an overview) to prove that this penalization procedure is
∗
Institut de Mathématiques (UMR 5219), INSA de Toulouse, Université de Toulouse, France
1
efficient i.e. the integrated quadratic risk of the selected estimator is asymptotically equiv-
alent to the risk of the oracle (see Section 2 for a precise definition). They also proved that
the selected estimator achieves adaptive rates of convergence over a large class of Besov
spaces. Moreover, they showed that some methods of adaptive density estimation like the
unbiased cross validation (Rudemo [23]) or the hard thresholded estimator of Donoho et
al. [16] can be viewed as special instances of penalized projection estimators.
More recently, Arlot [5] introduced new measures of the quality of penalized least-squares
estimators (PLSE). He proved pathwise oracle inequalities, that is deviation bounds for
the PLSE that are harder to prove but more informative from a practical point of view
(see also Section 2 for details).
When the process (Xi )i=1,...,n is β-mixing (Rozanov & Volkonskii [26] and Section 2), Ta-
lagrand’s inequality can not be used directly. Baraud et al. [6] used Berbee’s coupling
lemma (see Berbee ([7]) and Viennet’s covariance inequality (Viennet [25]) to overcome
this problem and build model selection procedure in the regression problem. Then Comte
& Merlevède [13] used this algorithm to investigate the problem of density estimation for
a β-mixing process. They proved that under reasonable assumptions on the collection Mn
and on the coefficients β, one can recover the results of Birgé & Massart [8] in the i.i.d.
framework.
The main drawback of those results is that many processes, even simple Markov chains
are not β-mixing. For instance, if (ǫi )i≥1 is iid with marginal B(1/2), then the stationary
solution (Xi )i≥0 of the equation
1
Xn = (Xn−1 + ǫn ), X0 independent of (ǫi )i≥1 (1)
2
is not β-mixing (Andrews [3]). More recently, Dedecker & Prieur [15] introduced new
mixing-coefficients, in particular the coefficients τ , φ̃ and β̃ and proved that many processes
like (1) happen to be τ , φ̃ and β̃-mixing. They proved a coupling lemma for the coefficient
τ and covariance inequalities for φ̃ and β̃. Gannaz & Wintenberger [18] used the covariance
inequality to extend the result of Donoho et al. [16] for the wavelet thresholded estimator
to the case of φ̃-mixing processes. They recovered (up to a log(n) factor) the adaptive
rates of convergence over Besov spaces.
In this article, we first investigate the case of β-mixing processes. We prove a pathwise
oracle inequality for the PLSE. We extend the result of Comte & Merlevède [13] under
weaker assumptions on the mixing coefficients. Then, we consider τ -mixing processes. The
problem is that the coupling result is weaker for the coefficient τ than for β. Moreover,
in order to control the empirical process we use a covariance inequality that is harder to
handle. Hence, the generalization of the procedure of Baraud et al. [6] to the framework
of τ -mixing processes is not straightforward. We recover the optimal adaptive rates of
convergence over Besov spaces (that is the same as in the independent framework) for
τ -mixing processes, which is new as far as we know.
The chapter is organized as follows. In Section 2, we give the basic material that we will
use throughout the chapter. We recall the definition of some mixing coefficients and we
state their properties. We define the penalized least-squares estimator (PLSE). Sections 3
and 4 are devoted to the statement of the main results, respectively in the β-mixing case
and in the τ -mixing case. In Section 5, we derive the adaptive properties of the PLSE.
Finally, Section 6 is devoted to the proofs. Some additional material has been reported in
the Appendix in Section 7.
2
2 Preliminaries
2.1 Notation.
Let (Ω, A, P) be a probability space. Let µ be the Lebesgue measure on R, let k.kp be the
usual norm on Lp (µ) for 1 ≤ p ≤ ∞. For all y ∈ Rl , let |y|l = li=1 |yi |. Denote by λκ the
P
set of κ-Lipschitz functions, i.e. the functions t from (Rl , |.|l ) to R such that Lip(t) ≤ κ
where
|t(x) − t(y)| l
Lip(t) = sup , x, y ∈ R , x 6= y ≤ κ.
|x − y|l
Let BV and BV1 be the set of functions t supported on R satisfying respectively ktkBV < ∞
and ktkBV ≤ 1 where
The coefficient β(M, σ(Y )) is the mixing coefficient introduced by Rozanov & Volkonskii
[26]. The coefficients β̃(M, Y1 ) and τ (M, Y ) have been introduced by Dedecker & Prieur
[15].
Let (Xk )k∈Z be a stationary sequence of real valued random variables defined on (Ω, A, P).
For all k ∈ N∗ , the coefficients βk , β̃k and τk are defined by
Moreover, we set β0 = 1. In the sequel, the processes of interest are either β-mixing or
τ -mixing, meaning that, for γ = β or τ , the γ-mixing coefficients γk → 0 as k → +∞. For
p ∈ {1, 2}, we define κp as:
X∞
κp = p lp−1 βl , (2)
l=0
3
where 00 = 1, when the series are convergent. Besides, we consider two kinds of rates of
convergence to 0 of the mixing coefficients, that is for γ = β or τ ,
[AR] arithmetical γ-mixing with rate θ if there exists some θ > 0 such that γk ≤ (1 +
k)−(1+θ) for all k in N,
[GEO] geometrical γ-mixing with rate θ if there exists some θ > 0 such that γk ≤ e−θk
for all k in N.
2.2.2 Properties
Coupling
Let X be an Rl -valued random variable defined on (Ω, A, P) and let M be a σ-algebra.
Assume that there exists a random variable U uniformly distributed on [0, 1] and indepen-
dent of M ∨ σ(X). There exist two M ∨ σ(X) ∨ σ(U )-measurable random variables X1∗
and X2∗ distributed as X and independent of M such that
|Cov(f (X), h(Y ))| ≤ 2E1/p (|f (X)|p b1 (X)) E1/q (|h(Y )|q b2 (Y )).
There exists a random variable b(σ(X), Y ) such that E(b(σ(X), Y )) = β̃(σ(X), Y ) and such
that, for all Lipschitz functions f and all h in BV (Dedecker & Prieur [15] Proposition 1)
|Cov(f (X), h(Y ))| ≤ khkBV E (|f (X)|b(σ(X), Y )) ≤ khkBV kf k∞ β̃(σ(X), Y ). (5)
Comparison results
Let (Xk )k∈Z be a sequence of identically distributed real random variables. If the marginal
distribution satisfies a concentration’s condition |FX (x) − FX (y)| ≤ K|x − y|a with a ≤ 1,
K > 0, then (Dedecker et al. [14] Remark 5.1 p 104)
a/(a+1) a/(a+1)
β̃k ≤ 2K 1/(1+a) τk,1 ≤ 2K 1/(1+a) τk .
In particular, if PX has a density s with respect to the Lebesgue measure µ and if s ∈ L2 (µ),
we have from Cauchy-Schwarz inequality
Z Z 1/2
|FX (x) − FX (y)| = | 1[x,y] sdµ| ≤ ksk2 1[x,y] dµ = ksk2 |x − y|1/2 ,
thus
2/3 1/3
β̃k ≤ 2 ksk2 τk .
In particular, for any arithmetically [AR] τ -mixing process with rate θ > 2, we have
2/3
β̃k ≤ 2 ksk2 (1 + k)−(1+θ)/3 . (6)
4
2.2.3 Examples
Examples of β-mixing and τ -mixing sequences are well known, we refer to the books of
Doukhan [17] and Bradley [11] for examples of β-mixing processes and to the book of
Dedecker et. al [14] or the articles of Dedecker & Prieur [15], Prieur [21], and Comte
et. al [12] for examples of τ -mixing sequences. One of the most important example is
the following: a stationary, irreducible, aperiodic and positively recurent Markov chain
(Xi )i≥1 is β-mixing. However, many simple Markov chains are not β-mixing but are τ -
mixing. For instance, it is known for a long time that if (ǫi )i≥1 are i.i.d Bernoulli B(1/2),
then a stationary solution (Xi )i≥0 of the equation
1
Xn = (Xn−1 + ǫn ), X0 independent of (ǫi )i≥1
2
is not β-mixing since βk = 1 for any k ≥ 1 whereas τk ≤ 2−k (see Dedecker & Prieur [15]
Section 4.1). Another advantage of the coefficient τ is that it is easy to compute in many
situations (see Dedecker & Prieur [15] Section 4).
X
2 ktk2∞
ψj,k = sup 2 (7)
t∈Sm +Sm′ ,t6=0 ktk2
(j,k)∈m∪m′ ∞
see Birgé & Massart [8] p 58. Three examples are usually developed as fulfilling this set
of assumptions:
[T] trigonometric spaces: ψ0,0 (x) = 1 and for all j ∈ N∗ , ψj,1 (x) = cos(2πjx), ψj,2 (x) =
sin(2πjx). m = {(0, 0), (j, 1), (j ′ , 2), 1 ≤ j, j ′ ≤ Jm } and Dm = 2Jm + 1;
[P] regular piecewise polynomial spaces: Sm is generated by r polynomials ψj,k of degree
k = 0, ..., r − 1 on each subinterval [(j − 1)/Jm , j/Jm ] for j = 1, ..., Jm , Dm = rJm ,
Mn = {m = {(j, k), j = 1, ..., Jm , k = 0, ..., r − 1}, 1 ≤ Jm ≤ [n/r]};
[W] spaces generated by dyadic wavelet with regularity r as described in Section 4.
For a precise description of those spaces and their properties, we refer to Birgé & Massart
[8].
5
Let (Sm )m∈Mn be a collection of models satisfying assumptions [M1 ]-[M3 ]. We define
Sn = ∪m∈Mn Sm , sm and sn the orthogonal projections of s onto Sm and Sn respectively,
let P be the joint distribution of the observations (Xn )n∈Z and let E be the corresponding
expectation. We define the operators Pn , P and νn on L2 (µ) by
n
1X
Z
Pn t = t(Xi ), P t = t(x)s(x)dµ(x), νn (t) = (Pn − P )t.
n
i=1
All the real numbers that we shall introduce and which are not indexed by m or n are fixed
constants. In order to define the penalized least-squares estimator, let us consider on R×Sn
the contrast function γ(x, t) = −2t(x) + ktk22 and its empirical version γn (t) = Pn γ(., t).
Minimizing γn (t) over Sm leads to the classical projection estimator ŝm on Sm . Let ŝn be
the projection estimator on Sn . Since {ψj,k }(j,k)∈m is an orthonormal basis of Sm one gets
X X
ŝm = (Pn ψj,k )ψj,k and γn (ŝm ) = − (Pn ψj,k )2 .
(j,k)∈m (j,k)∈m
An oracle depends on the unknown s and on the data so that it is unknown in practice.
In order to validate our procedure, we try to prove:
-non asymptotic oracle inequalities for the PLSE:
E ks − s̃k22 ≤ L inf {E ks − ŝm k22 + R(m, n) },
(10)
m∈Mn
for some constant L ≥ 1 (as close to 1 as possible) and a remainder term R(m, n) ≥ 0
possibly random, and small compared to E ks − s̃k22 if possible. This inequality compares
the risk of the PLSE with the best deterministic choice of m. Since m̂ is random, we prefer
to prove a stronger form of oracle inequality :
2 2
E ks − s̃k2 ≤ LE inf {ks − ŝm k2 + R(m, n)} , (11)
m∈Mn
6
where typically cn ≤ C/n1+γ for some γ > 0. Inequality (12) proves that, asymptotically,
the risk ks − s̃k22 is almost surely the one of the oracle. Let
2 2
Ω = ks − s̃k2 > L inf ks − ŝm k2 + R(m, n) .
m∈Mn
We have
E ks − s̃k22 = E ks − s̃k22 1Ω + E ks − s̃k22 1Ωc .
n o
It is clear that E ks − s̃k22 1Ωc ≤ LE inf m∈Mn ks − ŝm k22 + R(m, n) . Moreover, we
have ks − s̃k2 = ks − sm̂ k2 + ksm̂ − s̃k2 ≤ ksk2 + Φ2 Dm̂ ≤ ksk2 + Φ2 n, thus, when (12)
holds, we have
C
E ks − s̃k22 1Ωc ≤ (ksk2 + Φ2 n)cn ≤ γ .
n
Therefore, inequality (12) implies
2
2 C
E ks − s̃k2 ≤ E inf {ks − ŝm k2 + R(m, n)} + γ .
m∈Mn n
We can derive from these inequalities adaptive rates of convergence of the PLSE on Besov
spaces (see Birgé & Massart [8] for example). In order to achieve this goal, we only have
to prove a weaker form of oracle inequality where the remainder term R(m, n) ≤ LDm /n
for some constant L, for all the models m with sufficiently large dimension. This will be
detailed in Section 5.
Theorem 3.1 Consider a collection of models satisfying [M1 ], [M2 ] and [M3 ]. Assume
that the process (Xn )n∈Z is strictly stationary and arithmetically [AR] β-mixing with mix-
ing rate θ > 2 and that its marginal distribution admits a density s with respect to the
Lebesgue measure µ, with s ∈ L2 (µ).
Let κ1 be the constant defined in (2) and let s̃ be the PLSE defined by (9) with
KΦ2 κ1 Dm
pen(m) = , where K > 4.
n
Then, for all κ > 2 there exist c0 > 0, Ls > 0, γ1 > 0 and a sequence ǫn → 0, such that
(log n)(θ+2)κ
P ks̃ − sk22 > (1 + ǫn ) inf ks − s m k2
2 + pen(m) ≤ L s .
m∈Mn ,Dm ≥c0 (log n)γ1 nθ/2
(13)
Remark: The term KΦ2 κ1 is the same as in Theorem 3.1 of Comte & Merlevède [13] but
with a constant K > 4 instead of 320. The main drawback of this result is that the penalty
term involves the constant κ1 which is unknown in practice. However, Theorem 3.1 ensures
7
that penalties proportional to the linear dimension of Sm lead to efficient model selection
procedures. Thus we can use this information to apply the slope heuristic algorithm intro-
duced by Birgé & Massart [9] in a Gaussian regression context and generalized by Arlot
& Massart [4] to more general M-estimation frameworks. This algorithm calibrates the
constant in front of the penalty term when the shape of an ideal penalty is available. The
result of Arlot & Massart is proven for independent sequences, in a regression framework,
but it can be generalized to the density estimation framework, for independent as well as
for β or τ dependent data. This result is beyond the scope of this chapter and will be
proved in chapter 4.
We have to consider the infimum in equation (13) over the models with sufficiently large
dimensions. However, as noted by Arlot [5] (Remark 9 p 43), we can take the infimum
over all the models in (13) if we add an extra term in (13). More precisely, we can prove
that, with probability larger than 1 − Ls (log n)(θ+2)κ /nθ/2
(log n)γ2
ks̃ − sk22 ≤ (1 + ǫn ) inf ks − ŝm k22 + pen(m) + L , (14)
m∈Mn n
where L > 0 and γ2 > 0.
Remark : The main improvement of Theorem 3.1 is that it gives an oracle inequality
in probability, with a deviation bound of order o(1/n) as soon as θ > 2 instead of θ > 3
in Comte & Merlevède [13]. Moreover, we do not require s to be bounded to prove our
result.
Remark: When the data are independent, the proof of Theorem 3.1 can be used to
obtain that the estimator s̃ chosen with a penalty term of order KΦDm /n satisfy an
oracle inequality as (13). The main difference would be that κ1 = 1, thus it can be
used without a slope heuristic (even if this algorithm can be used also in this context to
2
optimize the constant K) and the control of the probability would be Ls e− ln(n) /Cs for
some constants Ls , Cs instead of Ls (log n)(θ+2) κn−θ/2 in our theorem.
8
We assume that our collection (Sm )m∈Mn satisfies the following assumption:
[W] dyadic wavelet generated spaces: let Jn = [log(n/2(A + 1))/ log(2)] and for all Jm =
1, ..., Jn , let
m = {(0, k), −A2 < k < 2 − A1 } ∪ {(j, k), 1 ≤ j ≤ Jm , −A2 < k < −A1 + 2j }
and Sm the linear span of {ψj,k }(j,k)∈m . In particular, we have Dm = (A−1)(Jm +1)+2Jm +1
and thus 2Jm +1 ≤ Dm ≤ (A − 1)(Jm + 1) + 2Jm +1 ≤ A2Jm +1 .
Theorem 4.1 Consider the collection of models [W]. Assume that (Xn )n∈Z is strictly
stationary and arithmetically [AR] τ -mixing with mixing rate θ > 5 and that its marginal
distribution admits a density s with respect to the Lebesgue measure µ. Let s̃ be the PLSE
defined by (9) with
∞
!
X Dm
pen(m) = KAK∞ KBV β̃l , where K ≥ 8.
n
l=0
Then there exist constants c0 > 0, γ1 > 0 and a sequence ǫn → 0 such that
2 2
E ks̃ − sk2 ≤ (1 + ǫn ) inf γ
ks − sm k2 + pen(m) . (18)
m∈Mn , Dm ≥c0 (log n) 1
Remark : As in Theorem 3.1, the penalty term involves an unknown constant and we
have a condition on the dimension of the models in (18). However, the slope heuristic can
also be used in this context to calibrate the constant and a careful look at the proof shows
that we can take the infimum over all models m ∈ Mn provided that we increase the
constant K in front of the penalty term. Our result allows to derive rates of convergence
in Besov spaces for the PLSE that correspond to the rates in the i.i.d. framework (see
Proposition 5.2).
Remark : Theorem 4.1 gives an oracle inequality for the PLSE built on τ -mixing se-
quences. This inequality is not pathwise and the constants involved in the penalty term
are not optimal. This is due to technical reasons, mainly because we use the coupling
result (4) instead of (3). However, we recover the same kind of oracle inequality as in
the i.i.d. framework (Birgé and Massart [8]) under weak assumptions on the mixing co-
efficients since we only require arithmetical [AR] τ -mixing assumptions on the process
(Xn )n∈Z . This is the first result for these processes up to our knowledge.
Let us mention here Theorem 4.1 in Comte & Merlevède [13]. They consider α-mixing
processes (for a definition of the coefficient α and its properties, we refer to Rio [22]). They
make geometrical [GEO] α-mixing assumptions on the processes and consider penalties
of order L log(n)Dm /n to get an oracle inequality. This leads to a logarithmic loss in
the rates of convergence. They get the optimal rate under an extra assumption (namely
Assumption [Lip] in Section 3.2). There exist random processes that are τ -mixing and
not α-mixing (see Dedecker & Prieur [15]), however, the comparison of these coefficients
is difficult in general and our method can not be applied in this context.
The constants c0 , γ1 , no are given in the end of the proof.
9
Remark : Inequality (2.6) can be improved under stronger assumptions on s. For exam-
√
ple, when s is bounded, we have β̃k ≤ C τk . Under this assumption and θ > 3, we can
prove that the estimator s̃ satisfies the inequality
(log n)κ(θ+1)
2 2
E ks̃ − sk2 ≤ (1 + ǫn ) inf ks − s m k2 + pen(m) + .
m∈Mn , Dm ≥c0 (log n)γ1 n(θ−3)/2
When θ < 5, the extra term (log n)κ(θ+1) /n(θ−3)/2 may be larger than the main term
inf m∈Mn , Dm ≥c0 (log n)γ1 ks − sm k22 + pen(m). In this case, we don’t know if our control
remains optimal. On the other hand, Proposition 5.2 ensures that s̃ is adaptive over the
class of Besov balls when θ ≥ 5.
5 Minimax results
5.1 Approximation results on Besov spaces
Besov balls.
Throughout this section, Λ = {(j, k), j ∈ N, k ∈ Z} and {ψj,k , (j, k) ∈ Λ} denotes an
r-regular wavelet basis as introduced in Section 4.1. Let α, pPbe two positive numbers such
that α + 1/2 − 1/p > 0. For all functions t ∈ L2 (µ), t = (j,k)∈Λ tj,k ψj,k , we say that t
belongs to the Besov ball Bα,p,∞ (M1 ) on the real line if ktkα,p,∞ ≤ M1 where
!1/p
X
j(α+1/2−1/p) p
ktkα,p,∞ = sup 2 |tj,k | .
j∈N k∈Z
It is easy to check that if p ≥ 2 Bα,p,∞ (M1 ) ⊂ Bα,2,∞ (M1 ) so that upper bounds on
Bα,2,∞ (M1 ) yield upper bounds on Bα,p,∞ (M1 ).
Approximation results on Besov spaces.
We have the following result (Birgé & Massart [8] Section 4.7.1). Suppose that the support
of s equals [0, 1] and that s belongs to the Besov ball Bα,2,∞ (1), then whenever r > α − 1,
Proposition 5.1 Assume that the process (Xn )n∈Z is stricly stationary and arithmetically
[AR] β-mixing with mixing rate θ > 2 and that its marginal distribution admits a density
s with respect to the Lebesgue measure µ, that s is supported in [0, 1] and that s ∈ L2 (µ).
For all α, M1 > 0, the PLSE s̃ defined in Theorem 3.1 for the collection of models [W]
satisfies
L (log n)(θ+2)κ
M1
∀κ > 2, sup P ks̃ − sk22 > LM1 ,α,θ n−2α/(2α+1) ≤ .
s∈Bα,2,∞ (M1 ) nθ/2
Proposition 5.2 Assume that the process (Xn )n∈Z is stricly stationary and arithmetically
[AR] τ -mixing with mixing rate θ > 5 and that its marginal distribution admits a density
10
s with respect to the Lebesgue measure µ, that s is supported in [0, 1] and that s ∈ L2 (µ).
For all α, M1 > 0, the PLSE s̃ defined in Theorem 4.1 satisfies
sup E ks̃ − sk22 ≤ LM1 ,α,θ n−2α/(2α+1) .
s∈Bα,2,∞ (M1 )
Remark: Proposition 5.2 can be compared to Theorem 3.1 in Gannaz & Wintenberger
[18]. They prove near minimax results for the thresholded wavelet estimator introduced
by Donoho et al. [16] in a φ̃-dependent setting (for a definition of the coefficient φ̃, we
refer to Dedecker & Prieur [15]). Basically, with our notations, their result can be stated
b
as follows: if (Xn )n∈Z is φ̃-mixing with φ̃1 (r) ≤ Ce−ar for some constants C, a, b, then
the thresholded wavelet estimator ŝ of s satisfies
log n 2α/(2α+1)
2
∀α > 0, ∀p > 1, sup E kŝ − sk2 ≤ LM,M1,α,p .
s∈Bα,p,∞ (M1 )∩L∞ (M ) n
The main advantage of their result is that they can deal with Besov balls with regularity
1 < p < 2. However, in the regular case, when p ≥ 2, we have been able to remove the extra
log n factor. Moreover, our result only requires arithmetical [AR] rates of convergence for
the mixing coefficients and we do not have to suppose that s is bounded.
6 Proofs.
6.1 Proofs of the minimax results.
Proof of Proposition 5.1:
Let α > 0 and M1 > 0 and assume that s ∈ Bα,2,∞ (M1 ). Let M̃n = {m ∈ Mn , Dm >
c0 (log n)γ1 }. By Theorem 3.1, there exists a constant Lθ > 0 such that
Ls (log n)(θ+2)κ
2 2 Dm
P ks̃ − sk2 > Lθ inf ks − sm k2 + ≤ . (20)
m∈M̃n n nθ/2
It appears from the proof of Theorem 3.1 that the constant Ls depends only on ksk2 and
that it is a nondecreasing function of ksk2 so that Ls can be uniformly bounded over
Bα,2,∞ (M1 ) by a constant LM1 so that, by (20)
we have
LM1 (log n)(θ+2)κ
Dm
P ks̃ − sk22 > Lθ ks − sm k22 + ≤ .
n nθ/2
Since s belongs to Bα,2,∞ (M1 ), we can use Inequality (19) to get
ks − sm k22 ≤ Lα,M1 Dm
−2α
.
Thus we obtain
L (log n)(θ+2)κ
M1
P ks̃ − sk22 > LM1 ,α,θ n−2α/(2α+1) ≤ .
nθ/2
11
Proof of Proposition 5.2:
Let α > 0 and M1 > 0 and assume that s ∈ Bα,2,∞ (M1 ). By Theorem 4.1, we have
2
2 Dm
E ks̃ − sk2 ≤ Lθ inf {ks − sm k2 + } .
m∈M̃n n
This decomposition is different from the one used in Birgé & Massart [8] and in Comte &
Merlevède [13]. It allows to improve the constant in the oracle inequality in the β-mixing
case. Moreover, we choose to prove an oracle inequality of the form (12) for β-mixing
sequences, which allows to assume only θ > 2 instead of θ > 3. Let us now give a sketch
of the proof:
1. we build an event ΩC with P(ΩcC ) ≤ pβq such that, on ΩC , νn = νn∗ , where νn∗
is built with independent data. A suitable choice of the integers p and q leads to
pβq ≤ C(ln n)r n−θ/2 .
2. We use the concentration’s inequality (7.4) of Birgé & Massart [8] for χ2 -type statis-
tics, derived from Talagrand’s inequality. This allows us to find p1 (m) such that on
an event Ω1 with P(Ωc1 ∩ ΩC ) ≤ L1,s cn
sup {V (m) − p1 (m)} ≤ 0.
m∈Mn
12
4. We have ksm̂ − smo k22 ≤ ksm̂ − sk22 + ks − smo k22 because sm̂ − smo is either the
projection of sm̂ − s onto Smo or the projection of s − smo onto Sm̂ . Take pen(m) ≥
p1 (m) + ηp2 (m, m), we have, on Ω1 ∩ Ω2 ∩ ΩC
Vmo Vmo
ks − s̃k22 ≤ ks − ŝmo k22 − + pen(mo ) − (22)
2 2
−(pen(m̂) − p1 (m̂)) − (p1 (m̂) − V (m̂)) − 2νn (smo − sm̂ )
V (mo )
≤ ks − smo k22 + pen(mo ) − − ηp2 (m̂, m̂)
2
ksmo − sm̂ k22
+ηp2 (m̂, mo ) + (23)
η
1 1
1− ks − s̃k22 ≤ (1 + ) ks − smo k22 + pen(mo ) + ηp2 (mo , mo ). (24)
η η
In (23), we used that V (mo ) = 2ksmo − ŝmo k22 ≥ 0. In (24), we used that Vmo ≥ 0.
Pythagoras Theorem gives
V (mo )
ks − ŝmo k22 − = ks − smo k22 and; ks − sm̂ k22 ≤ ks − s̃k22 .
2
Finally, we prove that we can choose η = (log n)γ , with γ > 0 such that ηp2 (mo , mo ) =
o(pen(mo )) and we conclude the proof of (3.1) from the previous inequalities.
We decompose the proof in several claims corresponding to the previous steps.
Claim 1 : For all l = 0, ..., p − 1, let us define Al = (X2lq+1 , ..., X(2l+1)q ) and Bl =
(X(2l+1)q+1 , ..., X(2l+2)q ). There exist random vectors A∗l = (X2lq+1
∗ ∗
, ..., X(2l+1)q ) and Bl∗ =
∗ ∗
(X(2l+1)q+1 , ..., X(2l+2)q ) such that for all l = 0, ..., p − 1 :
3. P(Al 6= A∗l ) ≤ βq
Proof of Claim 1 :
The proof is derived from Berbee’s lemma, we refer to Proposition 5.1 in Viennet [25]
for further details about this construction.
√ √
Hereafter, we assume that, for some κ > 2, n(log n)κ /2 ≤ p ≤ n(log n)κ and for the
sake of simplicity that pq = n/2, the modifications needed to handle the extra term when
q = [n/(2p)] being straightforward. Let ΩC = {∀l = 0, ..., p − 1 Al = A∗l , Bl = Bl∗ }. We
have
(log n)(θ+2)κ
P(ΩcC ) ≤ 2pβq ≤ 22+θ .
nθ/2
Let us first deal with the quadratic term V (m).
Claim 2 : Under the assumptions of Theorem 3.1, let ǫ > 0, 1 < γ < κ/2. We define
√
L21 = 2Φ2 κ1 , L22 = 8Φ3/2 κ2 , L3 = 2Φκ(ǫ) and
s !2
(log n)γ L3
L1,m = 4 (1 + ǫ)L1 + L2 + . (25)
Dm
1/4 (log n)κ−γ
13
Then, we have
!
(log n)γ
L1,m Dm
P sup V (m) − ≥ 0 ∩ ΩC ≤ Ls,γ exp − p .
m∈Mn n ksk2
1/2
where Ls,γ = 2 ∞ γ
P
D=1 exp(−(log D) / ksk2 ). In particular, for all r > 0, there exists a
constant L′s,r depending on ksk2 , such that
L′s,r
L1,m Dm
P sup V (m) − ≥ 0 ∩ ΩC ≤ .
m∈Mn n nr
Proof of ClaimP2 :
Let Pn∗ (t) = ni=1 t(Xi∗ )/n and νn∗ (t) = (Pn∗ − P )t, we have
X
V (m)1ΩC = 2 (νn∗ )2 (ψj,k )1ΩC .
(j,k)∈m
Pq
Let B1 (Sm ) = {t ∈ Sm ; ktk2 ≤ 1}. ∀t ∈ B1 (Sm ), let t̄(x1 , ..., xq ) = i=1 t(xi )/2q and for
all functions g : Rq → R let
p−1 p−1
1X 1X
Z
∗
PA,p g= g(A∗j ), PB,p
∗
g= g(Bj∗ ), P̄ g = gPA (dµ),
p p
j=0 j=0
∗ ∗
and ν̄A,p g = (PA,p − P̄ )g, ν̄B,p g = (PB,p − P̄ )g.
Now we have
X X X
(νn∗ )2 (ψj,k ) ≤ 2 2
ν̄A,p ψ̄j,k + 2 2
ν̄B,p ψ̄j,k .
(j,k)∈m (j,k)∈m (j,k)∈m
In order to handle these terms, we use Proposition 7.4 which is stated in Section 7. Taking
X X
2
Bm = Var(ψ̄j,k (A1 )), Vm2 = sup 2
Var(t̄(A1 )), and Hm = (ψ̄j,k )2 ,
(j,k)∈m t∈B1 (Sm ) (j,k)∈m ∞
we have
r
(1 + ǫ) 2x Hm x
s X
∀x > 0, P 2 ψ̄
ν̄A,p j,k ≥ √ Bm + V m + κ(ǫ) ≤ e−x . (26)
p p p
(j,k)∈m
14
Thus
2
X 1 X 2
X
2 κ1
Bm = Var(ψ̄j,k (A1 )) ≤ P bψj,k ≤ ψj,k .
q q
(j,k)∈m (j,k)∈m (j,k)∈m ∞
2 ≤ Φ2 Dm , thus,
P
From Assumption [M2 ], (j,k)∈m ψj,k
∞
2 Φ2 κ1 Dm
Bm ≤ . (27)
q
From Viennet’s and Cauchy-Schwarz inequalities
2
X
2 1 X
2 Φ2 Dm
Hm = ψ̄j,k ≤ ψj,k ≤ . (29)
4 4
(j,k)∈m ∞ (j,k)∈m ∞
1/2
We apply Inequality (26) with x = ((log Dm )γ + yn )/ ksk2 and the evaluations (27), (28)
√
and (29). Recalling that 1/p ≤ 2/( n(log n)κ ), this leads to
!
L D (log D )γ yn
m m m
X
2
P ν̄A,p ψ̄j,k ≥ ≤ exp − p exp(− p ).
n ksk2 ksk2
(j,k)∈m
Claim 3. We keep the notations κ/2 > γ > 1, L2 of the proof of Claim 2. For all
m, m′ ∈ Mn we take
s !2
(log n)γ 4Φ
Lm,m′ = 4 L2 + , (30)
(Dm ∨ Dm′ ) 1/4 3(log n)κ−γ
15
we have, for all η > 0,
n)γ
!
ksm − sm′ k22 η Lm,m′ (Dm ∨ Dm′ ) − (log 1/2
P sup νn∗ (sm − sm′ ) − − > 0 ≤ Ls,γ e ksk
2
m,m′ ∈Mn 2η 2 n
(log(Dm ∨D ′ ))γ
− m
P 1/2
ksk
with Ls,γ = 2 m,m′ ∈Mn e . 2
Remark : The constant Ls,γ is finite since for all x, y > 0, (log(x ∨ y))γ ≥ ((log x)γ +
(log y)γ )/2.
As in Claim 2, when (L2 /L1 )8 (log n)4(2κ−γ) ≤ Dm ≤ n, we have
!2
23/2
Lm,m′ ≤ 1 + √ (log n)−2(κ−2γ) 4L21 .
3 κ1
Proof of Claim 3.
We keep the notations of the proof of Claim 2 and for m, m′ ∈ Mn , let tm,m′ =
(sm − sm′ )/ ksm − sm′ k2 . We use the inequality 2ab ≤ a2 η −1 + b2 η, which holds for all
a, b ∈ R, η > 0. This leads to
we have
s !
L′m,m′ (Dm ∨D )m′ (log(Dm ∨ Dm′ ))γ 1/2
P ν̄A,p (t̄m,m′ ) > ≤ exp −
1/2
e−yn /ksk2 .
4n ksk2
16
The result follows by taking yn = (log n)γ and using 2 ≤ Dm ≤ n.
Conclusion of the proof:
Let η > 0 and pen′ (m) ≥ (L1,m + ηLm,m )Dm /n where L1,m and Lm,m are defined respec-
tively by (25) and (30). From Claims 1, 2 and 3 and (24), we obtain that, for all mo and
with probability larger than Ls,θ (log n)(θ+2)κ n−θ/2
1 1 Dmo
(1 − ) ks − s̃k22 ≤ (1 + ) ks − smo k22 + pen′ (mo ) + ηL(mo , mo ) . (32)
η η n
Assume that Dm ≥ (L2 /L1 )8 (log n)4(2κ−γ) , then we have from remarks 6.2 and 6.2
2
2κ(ǫ)
L1,m ≤ 1 + ǫ + 1 + √ (log n)−(κ−2γ) 4L21 and
κ1
!2
23/2
Lm,m ≤ 1+ √ (log n)−2(κ−γ) 4L21 .
3 κ1
Take η = (log n)κ−γ , we have (L1,mo + ηLmo ,mo )Dmo /n ≤ Cpen(mo ). Fix ǫ > 0 such that
[1 + ǫ]2 < K/4. Since κ > γ, for n ≥ no , we have L1,m + ηLm,m ≤ KL21 , thus, inequality
(13) follows follows from (32) as soon as n > no . We remove the condition n > no by
improving the constant Ls in (13) if necessary.
17
We keep the notations νn∗ , ν̄A,p , ν̄B,p , t̄ and B1 (Sm ) that we introduced in the proof of The-
√
orem 3.1. As in the proof of Theorem 3.1, we assume that, for some κ > 2, n(log n)κ /2 ≤
√
p ≤ n(log n)κ and for the sake of simplicity that pq = n/2, the modifications needed to
handle the extra term when q = [n/(2p)] being straightforward. We have
X X X
V (m̂) = νn2 (ψj,k ) ≤ 2 (Pn − Pn∗ )2 (ψj,k ) + 2 (νn∗ )2 (ψj,k ) (33)
(j,k)∈m̂ (j,k)∈m̂ (j,k)∈m̂
Proof of Claim 2 :
X X
E (Pn − Pn∗ )2 (ψj,k ) ≤ E sup (Pn − Pn∗ )2 (ψj,k )
m∈Mn
(j,k)∈m̂ (j,k)∈m
X X
E (Pn − Pn∗ )2 (ψj,k )
≤
m∈Mn (j,k)∈m
p
2 X X
≤ (gA,m (j, k, l, l′ ) + gB,m (j, k, l, l′ ))
p2 ′
m∈Mn l,l =1
with
X
gm,A (j, k, l, l′ ) = E ψ̄j,k (Al ) − ψ̄j,k (A∗l ) ψ̄j,k (Al′ ) − ψ̄j,k (A∗l′ ) .
(j,k)∈m
KL 23j/2 |x − y|q
ψ̄j,k (x) − ψ̄j,k (y) ≤
2q
X
gA,m (j, k, l, l′ ) ≤ E ψ̄j,k (Al ) − ψ̄j,k (A∗l ) ψ̄j,k (Al′ ) − ψ̄j,k (A∗l′ )
(j,k)∈m
X Al′ − A∗l′ q
≤ E ψ̄j,k (Al ) − ψ̄j,k (A∗l ) KL 23j/2
2q
(j,k)∈m
KL τq X
≤ sup 23j/2 ψ̄j,k (x) − ψ̄j,k (y)
2 x,y∈Rq
(j,k)∈m
Jm
( )
KL τq X 3j/2
X
≤ 2 sup |ψj,k (x) − ψj,k (y)|
4 x,y∈R
j=0 k∈Z
2 X
≤ AKL K∞ 22Jm τq since |ψj,k | ≤ AK∞ 2j/2
3
k∈Z ∞
18
We can do the same computations for the term gB,m (j, k, l, l′ ) and we obtain
(log n)κ(θ+1)
((Pn − Pn∗ )(ψj,k ))2 ≤ Lτq
X X
E 22Jm ≤ Lτq 22Jn ≤ L (θ−3)/2
.
j,k∈m̂ m∈M
n
n
√
The last inequality comes from q ≥ n/(2(log n)κ ) and Assumption [AR], the one before
comes from Assumption [W].
Claim 3. Let us keep the notations of Theorem 4.1, let u = 6/(7 + θ) < 1/2 and recall
that κ > 2. Let γ be a real number in (1, κ/2). Let
∞
X ∞
X
L21 = AK∞ KBV β̃l , L22 = 2ΦKBV
u
β̃ku , L3 = κ(ǫ)Φ
l=0 k=0
s !2
(log Dm )γ (log Dm )γ
and L1,m = 4(1 + ǫ) (1 + ǫ)L1 + L2 + L3 , (35)
Dm
1/2−u (log n)κ
There exists a constant Ls such that
X L D
1,m m Ls
E sup (νn∗ )2 (ψj,k ) − ≤ .
m∈Mn n n
(j,k)∈m
Proof of Claim 3 :
As in the previous section we use the following decomposition
X X 2
(νn∗ )2 (ψj,k ) = ν̄A,p (ψ̄j,k ) + ν̄B,p (ψ̄j,k )
(j,k)∈m (j,k)∈m
X 2 X 2
≤ 2 ν̄A,p (ψ̄j,k ) + 2 ν̄B,p (ψ̄j,k )
(j,k)∈m (j,k)∈m
We treat both terms with Proposition 7.4 applied to the random variables (A∗l )0=1,..,p−1
and (Bl∗ )l=0,..,p−1 and to the class of functions (ψ̄j,k )(j,k)∈m . Let
X X
2
Var ψ̄j,k (A1 ) , Vm2 = 2 2
Bm = sup Var(t̄(A1 )), Hm =k ψ̄j,k k∞ .
(j,k)∈m t∈B1 (Sm ) (j,k)∈m
19
Let us now evaluate Bm , Vm and Hm , we have
q
!
2 1 X X
Bm = Var ψj,k (Xi ) .
(2q)2
(j,k)∈m i=1
From (17) and (15) we have ∀j, k kψj,k kBV ≤ KBV 2j/2 and ∀j k k∈Z |ψj,k |k∞ ≤ AK∞ 2j/2 .
P
Thus, from Inequality (5)
q q
!
X X X X
Var ψj,k (Xi ) ≤ 2 (q + 1 − l)|Cov(ψj,k (X1 ), ψj,k (Xl ))|
(j,k)∈m i=1 (j,k)∈m l=1
q
Jm X X
X
≤ 2q kψj,k kBV E (|ψj,k (X1 )|b(σ(X1 ), Xl ))
j=0 k∈Z l=1
Jm
X X q
X
j/2
≤ 2KBV q 2 |ψj,k (X0 )| β̃l−1
j=0 k∈Z ∞ l=1
∞
!
X
≤ 2q AK∞ KBV β̃l Dm .
l=0
2 L21 Dm
Bm ≤ . (37)
2q
Since t belongs to B1 (Sm ), we have t = (j,k)∈m aj,k ψj,k , with (j,k)∈m a2j,k ≤ 1. Thus,
P P
by Cauchy-Schwarz inequality
l
X X l
X
|t(xi+1 ) − t(xi )| ≤ |aj,k | |ψj,k (xi+1 ) − ψj,k (xi )|
i=1 (j,k)∈m i=1
1/2 !2 1/2
X X X
≤ a2j,k |ψj,k (xi+1 ) − ψj,k (xi )|
(j,k)∈m (j,k)∈m i
1/2
kψj,k k2BV
X
≤ ≤ KBV Dm .
(j,k)∈m
√
Thus ktkBV ≤ Dm KBV . From Assumption [M2 ], we have ktk∞ ≤ Φ Dm . Thus
3/2
|Cov(t(X1 ), t(Xk ))| ≤ ΦKBV β̃k−1 Dm . (39)
20
Moreover, we have by Cauchy-Schwarz inequality and [M2 ]
p
|Cov(t(X1 ), t(Xk ))| ≤ ktk∞ ktk2 ksk2 ≤ Φ ksk2 Dm . (40)
3/2
p 6 1
a = ΦKBV β̃k−1 Dm , b = Φ ksk2 Dm , u = < .
7+θ 2
From (39) and (40), we derive that
u
|Cov(t(X1 ), t(Xk ))| ≤ L′k Dm
1/2+u
where L′k = Φ KBV β̃k−1 ksk1−u
2 .
2 1 X
2 Φ2 Dm
Hm ≤ ψj,k ≤ . (42)
4 4
(j,k)∈m ∞
1/2+u
Let y > 0 and let us apply Inequality (36) with x = ((log Dm )γ / ksk1−u 2 ) + (y/Dm ).
We have, from (37), (41) and (42)
√
s !
L 2D L D (log D )γ y
1 m 3 m m
X
P (ν̄A,p )2 (ψ̄j,k ) > (1 + ǫ) + 1−u + 1/2+u
2pq 2p ksk 2 Dm
(j,k)∈m
!2
v
1/2+u (log Dm )γ
u L2 ksk1−u Dm
u
(log Dm )γ y − 1−u −(1/2+u)
+ 2 2
+ 1/2+u ≤ e ksk2 e−Dm y
.
t
2pq 1−u
ksk2 Dm
√ √ √
Then, we use the inequality α + β ≤ α + β with
(log Dm )γ y
α= 1−u and β = 1/2+u
ksk2 Dm
and the inequality (a + b)2 ≤ (1 + ǫ)a2 + (1 + ǫ−1 )b2 with
s !r
(log Dm )γ L3 (log Dm )γ Dm
a = (1 + ǫ)L1 + L2 1/2−u
+ 1−u
Dm ksk2 (log n) κ n
q
1 1−u L3 y
and b = √ L2 ksk2 y + .
n (log n)κ Dm
u
21
Thus, for all y > 0,
γ −(1/2+u)
X Lm Dm Ls X − (log D1−u
m)
−Dm y
2 2 ksk
P sup (ν̄A,p ) (ψ̄j,k ) − > (y + y ) ≤ e 2
m∈Mn n n
(j,k)∈m m∈Mn
q 2
where Ls = 2(1 + ǫ−1 ) (L2 ksk1−u
2 )∨ L3 /((log 2)κ 2u ) . We can integrate this last
inequality to prove Claim 3.
Then there exists a constant Ls,θ depending on ksk2 and θ such that, for all η > 0
( )!
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm′ ) ηLs,θ
E sup νn (sm − sm′ ) − −η ≤ .
m,m′ ∈Mn 2η n n
Proof of Claim 4 :
( )!
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm′ )
E sup νn (sm − sm ) −
′ −η
m,m′ ∈Mn 2η n
!
≤E sup (Pn − Pn∗ )(sm − sm′ )
m,m′
( )!
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm′ )
+E sup νn∗ (sm − sm′ ) − −η . (44)
m,m′ 2η n
Let us fix j ∈ [Jm + 1, Jm′ ], from Assumption [W], there is less than A indexes k ∈ Z
such that ψj,k (x) 6= 0, thus there is less than 2A indexes such that |ψj,k (x) − ψj,k (y)| =
6 0.
Hence
X |ψj,k (x) − ψj,k (y)|
|P ψj,k | ≤ 2A sup |P ψj,k |Lip(ψj,k )
|x − y| k∈Z
k∈Z
≤ 2A ksk2 KL 23j/2 .
22
√ √
Thus, Lip(sm − sm′ ) ≤ A ksk2 KL 823Jm′ /2 /( 8 − 1) and by Assumptions [W], [AR] and
the value of q,
!
(log n)κ(θ+1)+1
E sup (Pn − Pn∗ )(sm − sm′ ) ≤ Ls n3/2 (log n)τq ≤ Ls . (45)
m,m′ n(θ−2)/2
where, as in the proof of Theorem 3.1, tm,m′ = (sm − sm′ )/ksm − sm′ k2 . We apply
Bernstein’s inequality to the function t̄m,m′ and the variables A∗l , we have
s
2Var(t̄m,m (A0 ))x kt̄m,m k∞ x
′ ′
∀x > 0, P ν̄A,p (t̄m,m′ ) > + ≤ e−x . (47)
p 3p
√
Since tm,m′ ∞
= Φ Dm ∨ Dm′ , we have
Cov(tm,m′ (X1 ), tm,m′ (Xk+1 )) ≤ ΦKBV β̃k (Dm ∨ Dm′ )3/2 . (48)
Moreover, we have
p
Cov(tm,m′ (X1 ), tm,m′ (Xk+1 )) ≤ tm,m′ tm,m′ ksk2 ≤ Φ ksk2 ′ ).
(Dm ∨ Dm (49)
∞ 2
23
Thus
∞
!
(Dm ∨ Dm′ )1/2+u
ksk1−u
X
u
Var(t̄m,m′ (A0 )) ≤ ΦKBV β̃ku 2 . (50)
2q
k=0
Moreover
1 1 p ′ .
kt̄m,m′ k∞ ≤ ktm,m′ k∞ ≤ Φ Dm ∨ Dm (51)
2 2
Now, we use (47) with x = (log(Dm ∨ Dm′ ))γ / ksk1−u 2 + y/(Dm ∨ Dm′ )1/2+u . From (50)
and (51), we have for all y > 0,
v !
1−u
u
u (Dm ∨ Dm′ )1/2+u ksk y
2
P ν̄A,p (t̄m,m′ ) > L2 t (log(Dm ∨ Dm′ ))γ +
2pq (Dm ∨ Dm′ )1/2+u
p !!
Φ Dm ∨ Dm ′ (log(Dm ∨ Dm′ ))γ y
+ +
6p ksk1−u
2
(Dm ∨ Dm′ )1/2+u
(log(Dm ∨D ′ ))γ y
− m −
ksk1−u (Dm ∨D ) 1/2+u
≤e 2 e m′ .
√ √ √
Now we use the inequality a+b≤ a+ b with
ksk1−u y
a = (log(Dm ∨ Dm′ ))γ and b = 2
(Dm ∨ Dm′ )1/2+u
with s !2
(log(Dm ∨ Dm′ ))γ Φ(log(Dm ∨ Dm′ ))γ
L2 (m, m′ ) = L2 + ,
(Dm ∨ Dm′ )1/2−u 3(log n)κ
Φ
q
and Ls = L2 ksk1−u
2 ∨ .
3(log 2)κ 2u
Thus, we obain
L2 (m, m′ )(Dm ∨ Dm
′ ) L2s
2 2
P (ν̄A,p t̄m,m′ ) > 2 + 4 (y + y )
n n
(log(Dm ∨Dm′ ))γ y
− 1−u −
ksk2 (Dm ∨Dm′ )1/2+u
≤e .
The same result holds for ν̄B,p t̄m,m′ . Thus we obtain from (46)
24
We deduce that
ksm − sm′ k22 L2 (m, m′ )(Dm ∨ Dm ′ )
P ∃m, m′ ∈ Mn , νn∗ (sm − sm′ ) − − 4η
2η n
(log(Dm ∨D ′ ))γ
!
y
L2 m
− −
ksk1−u
X 1/2+u
≥ 8η s (y + y 2 ) ≤ 2 e 2 e (Dm ∨Dm′ ) .
n ′m,m ∈Mn
We use the inequality (a + b)2 ≤ (1 + ǫ)a2 + (1 + ǫ−1 )b2 to obtain (53). Moreover, we have
2
Φ
L2 (m, m) ≤ 4L21 1+ −(κ−γ)
(log n) .
6L1
As in the proof of Theorem 3.1, we take η = (log n)κ−γ and we fix ǫ sufficiently small. For
n ≥ no , we have 2L1,m + ηL2 (m, m) < KL21 . Thus inequality (18) follows from (52).
7 Appendix
This section is devoted to technical lemmas that are needed in the proofs.
25
7.2 Concentration inequalities
We sum up in this section the concentration inequalities we used in the proofs. We begin
with Bernstein’s inequality
Now we give the most important tool of our proof, it is a concentration’s inequality for
the supremum of the empirical process over a class of function. We give here the version
of Bousquet [10].
We can deduce from this Theorem a concentration’s inequality for χ-square type statistics.
This is Proposition (7.3) of Massart [20].
Proposition 7.4 Let X1 , ..., Xn be independent and identically distributed random vari-
ables valued in some measurable space (X, X ). Let P denote their common distribution.
Let φλ be a finite family of measurable and bounded functions on (X, X ). Let
X X
HΛ2 = k φ2λ k∞ and BΛ2 = Var(φλ (X1 )).
λ∈Λ λ∈Λ
( !)
X
VΛ2 = sup Var aλ φλ (X1 ) .
a∈SΛ λ∈Λ
26
Proof :
Following Massart [20] Proposition 7.3, we remark that, by Cauchy-Schwarz’s inequal-
ity
!1/2 !
X X X
2
νn φλ = sup aλ νn φλ = sup νn aλ φλ .
λ∈Λ a∈SΛ λ∈Λ a∈SΛ λ∈Λ
Thus the result follows by applying Talagrand’s Theorem to the class of functions
( )
X
F = t= aλ φλ ; a ∈ SΛ .
λ∈Λ
27
References
[1] H. Akaike. Information theory and an extension of the maximum likelihood principle.
In Second International Symposium on Information Theory (Tsahkadsor, 1971), pages
267–281. Akadémiai Kiadó, Budapest, 1973.
[2] Hirotugu Akaike. Statistical predictor identification. Ann. Inst. Statist. Math.,
22:203–217, 1970.
[4] S. Arlot and P. Massart. Data-driven calibration of penalties for least squares regres-
sion. Submitted to Journal of Machine learning research, 2008.
[7] Henry C. P. Berbee. Random walks with stationary increments and renewal theory,
volume 112 of Mathematical Centre Tracts. Mathematisch Centrum, Amsterdam,
1979.
[8] Lucien Birgé and Pascal Massart. From model selection to adaptive estimation. In
Festschrift for Lucien Le Cam, pages 55–87. Springer, New York, 1997.
[9] Lucien Birgé and Pascal Massart. Minimal penalties for Gaussian model selection.
Probab. Theory Related Fields, 138(1-2):33–73, 2007.
[10] Olivier Bousquet. A Bennett concentration inequality and its application to suprema
of empirical processes. C. R. Math. Acad. Sci. Paris, 334(6):495–500, 2002.
[11] Richard C. Bradley. Introduction to strong mixing conditions. Vol. 1. Kendrick Press,
Heber City, UT, 2007.
[13] Fabienne Comte and Florence Merlevède. Adaptive estimation of the stationary den-
sity of discrete and continuous time mixing processes. ESAIM Probab. Statist., 6:211–
238 (electronic), 2002. New directions in time series analysis (Luminy, 2001).
[14] Jérôme Dedecker, Paul Doukhan, Gabriel Lang, José Rafael León R., Sana Louhichi,
and Clémentine Prieur. Weak dependence: with examples and applications, volume
190 of Lecture Notes in Statistics. Springer, New York, 2007.
[15] Jérôme Dedecker and Clémentine Prieur. New dependence coefficients. Examples and
applications to statistics. Probab. Theory Related Fields, 132(2):203–236, 2005.
[16] David L. Donoho, Iain M. Johnstone, Gérard Kerkyacharian, and Dominique Picard.
Density estimation by wavelet thresholding. Ann. Statist., 24(2):508–539, 1996.
28
[18] Irène Gannaz and Olivier Wintenberger. Adaptive density estimation under depen-
dence. forthcoming in ESAIM, Probab. and Statist., 2008.
[20] Pascal Massart. Concentration inequalities and model selection, volume 1896 of Lec-
ture Notes in Mathematics. Springer, Berlin, 2007. Lectures from the 33rd Summer
School on Probability Theory held in Saint-Flour, July 6–23, 2003, With a foreword
by Jean Picard.
[21] C. Prieur. Change point estimation by local linear smoothing under a weak depen-
dence condition. Math. Methods Statist., 16(1):25–41, 2007.
[22] Emmanuel Rio. Théorie asymptotique des processus aléatoires faiblement dépendants,
volume 31 of Mathématiques & Applications (Berlin) [Mathematics & Applications].
Springer-Verlag, Berlin, 2000.
[23] Mats Rudemo. Empirical choice of histograms and kernel density estimators. Scand.
J. Statist., 9(2):65–78, 1982.
[24] Michel Talagrand. New concentration inequalities in product spaces. Invent. Math.,
126(3):505–563, 1996.
[25] Gabrielle Viennet. Inequalities for absolutely regular sequences: application to density
estimation. Probab. Theory Related Fields, 107(4):467–492, 1997.
[26] V. A. Volkonskiı̆ and Yu. A. Rozanov. Some limit theorems for random functions. I.
Teor. Veroyatnost. i Primenen, 4:186–207, 1959.
29