Mathematical Statistics, Asymptotic Minimax Theory
Mathematical Statistics, Asymptotic Minimax Theory
Mathematical Statistics, Asymptotic Minimax Theory
Statistics
Asymptotic Minimax Theory
Alexander Korostelev
Olga Korosteleva
Graduate Studies
in Mathematics
Volume 119
Alexander Korostelev
Olga Korosteleva
Graduate Studies
in Mathematics
Volume 119
QA276.8.K667 2011
519.5–dc22 2010037408
Copying and reprinting. Individual readers of this publication, and nonprofit libraries
acting for them, are permitted to make fair use of the material, such as to copy a chapter for use
in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for such
permission should be addressed to the Acquisitions Department, American Mathematical Society,
201 Charles Street, Providence, Rhode Island 02904-2294 USA. Requests can also be made by
e-mail to [email protected].
c 2011 by the American Mathematical Society. All rights reserved.
The American Mathematical Society retains all rights
except those granted to the United States Government.
Printed in the United States of America.
∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at http://www.ams.org/
10 9 8 7 6 5 4 3 2 1 16 15 14 13 12 11
Contents
Preface ix
v
vi Contents
Exercises 191
Chapter 14. Dimension and Structure in Nonparametric Regression 193
§14.1. Multiple Regression Model 193
§14.2. Additive regression 196
§14.3. Single-Index Model 199
§14.4. Proofs of Technical Results 206
Exercises 209
Chapter 15. Adaptive Estimation 211
§15.1. Adaptive Rate at a Point. Lower Bound 211
§15.2. Adaptive Estimator in the Sup-Norm 215
§15.3. Adaptation in the Sequence Space 218
§15.4. Proofs of Lemmas 223
Exercises 225
Chapter 16. Testing of Nonparametric Hypotheses 227
§16.1. Basic Definitions 227
§16.2. Separation Rate in the Sup-Norm 229
§16.3. Sequence Space. Separation Rate in the L2 -Norm 231
Exercises 237
Bibliography 239
Index of Notation 241
Index 243
Preface
This book is based on the lecture notes written for the advanced Ph.D. level
statistics courses delivered by the first author at the Wayne State Univer-
sity over the last decade. It has been easy to observe how the gap deepens
between applied (computational) and theoretical statistics. It has become
more difficult to direct and mentor graduate students in the field of math-
ematical statistics. The research monographs in this field are extremely
difficult to use as textbooks. Even in the best published lecture notes the
intensive material of original studies is typically included. On the other
hand, the classical courses in statistics that cover the traditional parametric
point and interval estimation methods and hypotheses testing are hardly
sufficient for the teaching goals in modern mathematical statistics.
ix
x Preface
Most chapters are weakly related to each other and may be covered in
any order. Our suggestion for a two-semester course would be to cover the
parametric part during the first semester and to cover the nonparametric
part and selected topics in the second half of the course.
We are grateful to O. Lepskii for his advice and help with the presenta-
tion of Part 3.
Parametric Models
Chapter 1
3
4 1. The Fisher Efficiency
A random variable
l(Xi , θ) = ln p(Xi , θ)
is referred to as a log-likelihood function related to the observation Xi .
The joint log-likelihood function of a sample of size n (or, simply, the log-
likelihood function) is the sum
n
n
Ln (θ) = Ln (θ | X1 , . . . , Xn ) = l(Xi , θ) = ln p(Xi , θ).
i=1 i=1
The function
bn (θ) = bn (θ , θ̂n ) = Eθ θ̂n − θ = Eθ θ̂n (X1 , . . . , Xn ) − θ
Setting the derivative equal to zero yields the solution θn∗ = X̄n , where
X̄n = (X1 + · · · + Xn )/n
denotes
∗ the sample
mean.
In
this example, the MLE is unbiased since
Eθ θn = Eθ X̄n = Eθ X1 = θ.
Nonetheless, we should not take the unbiased MLE for granted. Even
for common densities, its expected value may not exist. Consider the next
example.
Example 1.4. For the exponential distribution with the density
p(x, θ) = θ exp − θ x , x > 0, θ > 0,
the MLE θn∗ = 1/X̄n has the expected value Eθ θn∗ = n θ/(n − 1) (see
Exercise
∂p(x , θ) ∂ R p(x , θ) dx
Eθ l (Xi , θ) = dx = = 0.
R ∂θ ∂θ
The total Fisher score function for a sample X1 , . . . , Xn is defined as the
sum of the score functions for each individual observation,
n
Ln (θ) = l (Xi , θ).
i=1
Therefore, for any value of θ, the variance of X̄n achieves the Cramér-Rao
lower bound 1/In (θ) = σ 2 /n.
The concept of the Fisher efficiency seems to be nice and powerful. In-
deed, besides being unbiased, an efficient estimator has the minimum pos-
sible variance uniformly in θ ∈ Θ. Another feature is that it applies to any
sample size n. Unfortunately, this concept is extremely restrictive. It works
only in a limited number of models. The main pitfalls of the Fisher efficiency
are discussed in the next chapter.
Exercises
Exercise 1.1. Show that the Fisher information can be computed by the
formula
∂ 2 ln p (X, θ)
In (θ) = − n Eθ .
∂ θ2
Hint: Make use of the representation (show!)
∂ ln p (x, θ) 2 ∂ 2 p (x, θ) ∂ 2 ln p (x, θ)
p (x, θ) = − p (x, θ).
∂θ ∂θ2 ∂θ2
Exercise 1.6. Show that in the exponential model with the density p(x , θ) =
θ exp{−θ x} , x , θ > 0, the MLE θn∗ = 1/X̄n has the expected value Eθ [ θn∗ ] =
n θ/(n − 1). What is the variance of this estimator?
Exercise 1.7. Show that for the location parameter model with the density
p(x , θ) = p 0 (x − θ), introduced in Example 1.2, the Fisher information is
a constant if it exists.
Exercise 1.8. In the Exercise 1.7, find the values of α for which the Fisher
information exists if p 0 (x) = C cosα x , −π/2 < x < π/2 , and p 0 (x) = 0
otherwise, where C = C(α) is the normalizing constant. Note that p 0 is a
probability density if α > −1 .
Chapter 2
11
12 2. The Bayes and Minimax Estimators
that is,
Rn (θ, θ̂n , w) = Eθ w In (θ)(θ̂n − θ)
= w In (θ)(θ̂n (x1 , . . . , xn ) − θ) p(x1 , . . . , xn , θ) dx1 . . . dxn .
Rn
Example 2.1. For the quadratic loss function w(u) = u2 , the normalized
risk (commonly termed the normalized quadratic risk) of an estimator θ̂n
can be found as
2 2
Rn (θ, θ̂n , u2 ) = Eθ In (θ) θ̂n −θ = In (θ) Eθ θ̂n −Eθ [ θ̂n ]+Eθ [ θ̂n ]−θ
(2.1) = In (θ) Varθ [ θ̂n ] + bn2 (θ , θ̂n )
By (2.1), for any unbiased estimator θ̂n , the normalized quadratic risk
function has the representation Rn (θ, θ̂n , u2 ) = In (θ)Varθ [ θ̂n ]. The Cramér-
Rao inequality (1.2) can thus be written as
2
(2.2) Rn (θ, θ̂n , u2 ) = Eθ In (θ) θ̂n − θ ≥ 1, θ ∈ Θ,
with the equality attained for the Fisher efficient estimators θn∗ ,
2
(2.3) Rn (θ, θn∗ , u2 ) = Eθ In (θ) θn∗ − θ = 1, θ ∈ Θ.
Rn
6
Rn (θ, θ̂, u2 ) = n
σ 2 (θ0 − θ)2
1 Rn (θ, θn∗ , u2 ) = 1
• -
0 θ0 − √σ
n θ0 θ0 + √σ
n θ
θ̂ is more efficient
in this interval
In other words, the Bayes estimator minimizes the Bayes risk. Loosely
speaking, we can understand the Bayes estimator as a solution of the mini-
mization problem,
tn = argminθ̂n β(θ̂n , w, π),
though we should keep in mind that the minimum value may not exist or
may be non-unique.
In the case of the quadratic loss w(u) = u2 , the Bayes estimator can
be computed explicitly. Define the posterior density of θ as the conditional
density, given the observations X1 , . . . , Xn ; that is,
f (θ | X1 , . . . , Xn ) = Cn p(X1 , . . . , Xn , θ) π(θ), θ ∈ Θ,
where Cn = Cn (X1 , . . . , Xn ) is the normalizing constant. Assuming that
In (θ) f (θ | X1 , . . . , Xn ) dθ < ∞,
Θ
we can introduce the weighted posterior density as
f˜(θ | X1 , . . . , Xn ) = C̃n In (θ) f (θ | X1 , . . . , Xn ), θ ∈ Θ,
−1
with the normalizing constant C̃n = Θ In (θ) f (θ | X1 , . . . , Xn ) dθ , which
is finite under our assumption.
Theorem 2.3. If w(u) = u2 , then the Bayes estimator tn is the weighted
posterior mean
tn = tn (X1 , . . . , Xn ) = θ f˜(θ | X1 , . . . , Xn ) dθ.
Θ
In particular, if the Fisher information is a constant independent of θ, then
the Bayes estimator is the non-weighted posterior mean,
tn = tn (X1 , . . . , Xn ) = θ f (θ | X1 , . . . , Xn ) dθ.
Θ
Proof. The Bayes risk of an estimator θ̂n with respect to the quadratic loss
can be written in the form
βn (θ̂n , π) = In (θ) (θ̂n − θ)2 p(x1 , . . . , xn , θ) π(θ) dx1 . . . dxn dθ
Θ Rn
= (θ̂n − θ)2 f˜( θ | x1 , . . . , xn ) dθ C̃n−1 (x1 , . . . , xn ) dx1 . . . dxn .
Rn Θ
Thus, the minimization problem of the Bayes risk is tantamount to mini-
mization of the integral
(θ̂n − θ)2 f˜( θ | x1 , . . . , xn ) dθ
Θ
2.2. The Bayes Estimator 15
with respect to θ̂n for any fixed values x1 , . . . , xn . Equating to zero the
derivative of this integral with respect to θ̂n produces a linear equation,
satisfied by the Bayes estimator tn ,
(tn − θ) f˜(θ | x1 , . . . , xn ) dθ = 0.
Θ
Finding the infimum over all possible estimators θ̂n = θ̂n (X1 , . . . , Xn ), that
is, over all functions of observations X1 , . . . , Xn , is not an easily tackled
task. Even for the most common distributions, such as normal or binomial,
the direct minimization is a hopeless endeavor. This calls for an alternative
route in finding minimax estimators.
In this section we establish a connection between the Bayes and minimax
estimators that will lead to some advances in computing the latter. The
following theorem shows that if the Bayes estimator has a constant risk,
then it is also minimax.
Proof. Notice that since the risk function of tn is a constant, the Bayes and
maximum normalized risks of tn are the same constants. Indeed, letting
2.3. Minimax Estimator. Connection Between Estimators 17
Proof. As in the proof of Theorem 2.5, for any estimator θ̂n , we can write
rn (θ̂n , w) = sup Rn (θ, θ̂n , w) ≥ Rn (θ, θ̂n , w) πb (θ) dθ
θ∈ Θ Θ
= βn (θ̂n , w, πb ) ≥ βn tn (b), w, πb .
Now take the limit as b → ∞. Since the left-hand side is independent of b,
the theorem follows.
Example 2.9. Let X1 , . . . , Xn be independent N (θ, 1) observations. We
will show that conditions of Theorem 2.8 are satisfied under the quadratic
loss function w(u) = u2 , and therefore the lower bound for the corresponding
minimax risk holds:
√ 2
inf rn (θ̂n , w ) = inf sup Eθ n (θ̂n − θ) ≥ 1.
θ̂n θ̂n θ∈R
As shown in Example 2.7, for a N (0, b2 ) prior density, the weighted posterior
mean tn (b) = n b2 X̄n /(n b2 + 1) is the Bayes estimator with respect to the
quadratic loss function. Now we will compute its Bayes risk. This estimator
has the variance
n2 b4 Varθ X̄n n b4
Varθ tn (b) = =
(n b2 + 1)2 (n b2 + 1)2
and the bias
n2 b2 θ θ
bn θ, tn (b) = Eθ tn (b) − θ = −θ = − 2 .
n b2 + 1 nb + 1
Exercises 19
n2 b4 n θ2
= + .
(n b2 + 1)2 (n b2 + 1)2
With the remark that R θ2 πb (θ) dθ = b2 , the Bayes risk of tn (b) equals
n2 b4 n θ2
βn tn (b), w, πb = + πb (θ) dθ
R (n b2 + 1)2 (n b2 + 1)2
n2 b4 n b2
= + → 1 as b → ∞.
(n b2 + 1)2 (n b2 + 1)2
Applying Theorem 2.8, we obtain the result with c = 1. Taking a step
further, note that the minimax lower bound√ is attained for the estimator
2
X̄n , which is thus minimax. Indeed, Eθ n (X̄n − θ) = 1.
Exercises
(ii) ∗
Show that the non-normalized quadratic risk of θn (with the factor
In (θ) omitted) is equal to
1
Eθ (θn∗ − θ) 2 = √ .
4(1 + n) 2
(iii) Verify that Theorem 2.5 is valid for a non-normalized risk function, and
argue that θn∗ is minimax in the appropriate sense.
Exercise 2.13. Refer to the Bernoulli model in Example 2.4. Show that
the prior beta distribution with α = β = 1 + b−1 defines the weighted
posterior mean tn (b) which is minimax for b = ∞.
Chapter 3
Asymptotic
Minimaxity
Example 3.2. In Example 1.5, there is no unbiased estimator. The es-
timator θ̂n = X/n, however, is asymptotically unbiased (see Exercise
3.14.)
21
22 3. Asymptotic Minimaxity
Thus, the sequence θ̂n is asymptotically more efficient than any asymptot-
ically Fisher efficient estimator defined by (3.1). In particular, it is better
than the sample mean X̄n . Sometimes the Hodges estimator is called super-
efficient, and the point at which the Cramér-Rao lower bound is violated,
θ = 0, is termed the superefficient point.
Suppose we can show that for any estimator θ̂n the inequality
(3.5) lim inf rn (θ̂n , u2 ) ≥ r∗
n→∞
Remark 3.5. Under the assumptions of Lemma 3.4, the maximum normal-
ized risk rn (θ̂n , u2 ) admits the asymptotic upper bound r∗ = 1, guaranteed
by the sample mean estimator X̄n .
Proof of Lemma 3.4. Without loss of generality, we can assume that
σ 2 = 1 hence I(θ) = 1 , and that Θ contains points θ0 = 0 and θ1 =
√
1/ n. Introduce the log-likelihood ratio associated with these values of the
parameter θ,
ΔLn = ΔLn (θ0 , θ1 ) = Ln (θ1 ) − Ln (θ0 )
n √
p(X1 , . . . , Xn , θ1 ) p(Xi , 1/ n)
= ln = ln
p(X1 , . . . , Xn , θ0 ) p(Xi , 0)
i=1
n 1 1 2 1 2 1
n
1 1
= − Xi − √ + Xi = √ Xi − = Z −
2 n 2 n 2 2
i=1 i=1
where Z is a N (0, 1) random variable with respect to the distribution Pθ0 .
Further, by definition, for any random function f (X1 , . . . , Xn ) , and for
any values θ0 and θ1 , the basic likelihood ratio identity relating the two
expectations holds:
p(X1 , . . . , Xn , θ1 )
Eθ1 f (X1 , . . . , Xn ) = Eθ0 f (X1 , . . . , Xn )
p(X1 , . . . , Xn , θ0 )
(3.8) = Eθ0 f (X1 , . . . , Xn ) exp ΔLn (θ0 , θ1 ) .
Next, for any fixed estimator θ̂n , the supremum over R of the normalized
risk function is not less than the average of the normalized risk over the two
points θ0 and θ1 . Thus, we obtain the inequality
n sup Eθ (θ̂n − θ)2 ≥ n max Eθ (θ̂n − θ)2
θ∈R θ∈{θ0 , θ1 }
n
≥ Eθ0 (θ̂n − θ0 )2 + Eθ1 (θ̂n − θ1 )2
2
n
= Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 exp ΔLn (θ0 , θ1 ) by (3.8)
2
3.2. Asymptotic Minimax Lower Bound 25
n
≥ Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 I ΔLn (θ0 , θ1 ) ≥ 0
2
n (θ1 − θ0 )2
≥ Pθ0 ΔLn (θ0 , θ1 ) ≥ 0
2 2
n 1 2 1
= √ Pθ0 Z − 1/2 ≥ 0 = Pθ0 Z ≥ 1/2 .
4 n 4
In the above, if the log-likelihood ratio ΔLn (θ0 , θ1 ) is non-negative, then its
exponent is at least 1. At the last stage we used the elementary inequality
1
(x − θ0 )2 + (x − θ1 )2 ≥ (θ1 − θ0 )2 , x ∈ R.
2
As shown previously, Z is a standard normal
random variable with respect
to the distribution Pθ0 , therefore, Pθ0 Z ≥ 1/2 = 0.3085. Finally, the
maximum normalized risk is bounded from below by 0.3085/4 > 0.077.
Remark 3.6. Note that computing the mean value of the normalized risk
over two points is equivalent to finding the Bayes risk with respect to the
prior distribution that is equally likely concentrated at these points. Thus,
in the above proof, we could have taken a Bayes prior concentrated not at
two but at three or more points, then the lower bound constant r∗ would be
different from 0.077.
Proof. As in the proof of Lemma 3.4, we can take σ 2 = 1. The idea of the
proof is based on the substitution of the maximum normalized risk by the
√ √
Bayes risk with the uniform prior distribution in an interval −b/ n , b/ n
where b will be chosen later. Under the assumption on Θ, it contains this
interval for all sufficiently large n. Proceeding as in the proof of Lemma 3.4,
we obtain the inequalities
√
b/√n
2 n 2
sup Eθ n θ̂n − θ ≥ E θ n θ̂n − θ dθ
θ∈R 2b −b/√n
b √
1 2 √
= Et/√n n θ̂n − t dt (by substitution t = n θ)
2b −b
b √
1 2 t
(3.9) = E0 n θ̂n − t exp ΔLn 0, √ dt.
2b −b n
Here the same trick is used as in the proof of Lemma 3.4 with the change
of the distribution by means of the log-likelihood ratio, which in this case is
equal to
n
t t 1 t 2 1 2
ΔLn 0, √ = Ln √ − Ln (0) = − Xi − √ + Xi
n n 2 n 2
i=1
t
n
t2 t2 Z2 (t − Z)2
= √ Xi − = tZ − = −
n 2 2 2 2
i=1
where Z ∼ N (0, 1) under P0 . Thus, the latter expression can be written as
3.3. Sharp Lower Bound. Normal Observations 27
1 b √ 2
Z 2 /2
e−(t−Z)
2 /2
E0 e n θ̂n − t dt
2b −b
2
b
1 √ 2
n θ̂n − t e−(t−Z) /2 dt
2
(3.10) ≥ E0 eZ /2 I |Z| ≤ a
2b −b
where a is a positive constant, a < b. The next step is to change the
variable of integration to u = t − Z. The new limits of integration are
[−b − Z , b − Z ]. For any Z that satisfies |Z| ≤ a, this interval includes the
interval [−(b − a) , b − a ], so that the integral over [−b , b ] with respect to
t can be estimated from below by the integral in u over [−(b − a) , b − a ].
Hence, for |Z| ≤ a,
b √ 2 b−a √ 2
e−u
2 /2 2 /2
n θ̂n − t e(t−Z) dt ≥ n θ̂n − Z − u du
−b −(b−a)
b−a √ 2 b−a
n θ̂n − Z + u2 e−u /2 du ≥ u2 e−u
2 2 /2
(3.11) = du.
−(b−a) −(b−a)
b−a
Here the cross term disappears because −(b−a) u exp{−u2 /2} du = 0.
a
(3.13) = E Z02 I |Z0 | ≤ b − a
b
where Z0 is a standard normal random variable. Choose
√ a and b such that
a/b → 1 and b − a → ∞, for example, put a = b − b and let b → ∞. Then
the expression in (3.13) can be made however close to E Z0 = 1.
2
The quadratic loss function is not critical in Theorem 3.8. The next
theorem generalizes the result to any loss function.
Theorem 3.9. Under the assumptions of Theorem 3.8, for any loss function
w and any estimator θ̂n , the following lower bound holds:
n
∞
w(u) −u2 /2
lim inf sup Eθ w 2
(θ̂n − θ) ≥ √ e du.
n→∞ θ∈Θ σ −∞ 2π
28 3. Asymptotic Minimaxity
Proof. In the proof of Theorem 3.8, the quadratic loss function was used
√
only to demonstrate that for any n θ̂n − Z, the following inequality holds:
b−a
b−a
√ 2
n θ̂n − Z − u e−u /2 du ≥ u2 e−u /2 du.
2 2
−(b−a) −(b−a)
We can generalize this inequality to any loss function as follows (see Exercise
b−a
3.18). The minimum value of the integral −(b−a) w c − u e−u /2 du over
2
Remark 3.10. Note that in the proof of Theorem 3.8 (respectively, The-
orem 3.9), we considered the values of θ not in the whole parameter set
√ √
Θ, but only in the interval − b/ n, b/ n of however small the length.
Therefore, it is possible to formulate a local version of Theorem 3.9 with the
proof remaining the same. For any loss function w, the inequality holds
n
∞
w(u) −u2 /2
lim lim inf sup Eθ w 2
(θ̂n − θ) ≥ √ e du.
δ→0 n→∞ |θ−θ0 |<δ σ −∞ 2π
t
n
t2
n
(3.15) = l (Xi , θ) + l (Xi , θ) 1 + on (1)
nI(θ) 2nI(θ)
i=1 i=1
where on (1) → 0 as n → ∞.
3.4. Local Asymptotic Normality (LAN) 29
By the computational formula for the Fisher information (see Exercise 1.1),
n
Eθ l (Xi , θ) = − nI(θ),
i=1
and therefore, the Law of Large Numbers ensures the convergence of the
second term in (3.15),
t2
n
t2
l (Xi , θ) 1 + on (1) → − as n → ∞.
2nI(θ) 2
i=1
Thus, at the heuristic level of understanding, we can expect that for any
t ∈ R, the log-likelihood ratio satisfies
(3.16) ΔLn θ, θ + t/ In (θ) = zn (θ) t − t2 /2 + εn (θ, t)
1
n
zn (θ) = l (Xi , θ)
nI(θ) i=1
A family of distributions for which the log-likelihood ratio has the repre-
sentation (3.16) under constraint (3.17) is said to satisfy the local asymptotic
normality (LAN) condition. It can actually be derived under less restrictive
assumptions. In particular, we do not need to require the existence of the
second derivative l .
To generalize Theorem 3.9 to the distributions satisfying the LAN con-
dition, we need to justify that the remainder term εn (θ, t) may be ignored
in the expression for the likelihood ratio,
exp ΔLn θ, θ + t/ In (θ) ≈ exp zn (θ) t − t2 /2 .
Theorem 3.11. Under the LAN conditions (3.16) and (3.17), there exists
a sequence of random variables z̃n (θ) such that | zn (θ) − z̃n (θ) | → 0 in
Pθ -probability, as n → ∞, and for any c > 0,
lim sup Eθ exp zn (θ)t − t2 /2 + εn (θ, t) − exp z̃n (θ)t − t2 /2 = 0.
n→∞ −c≤t≤c
To ease the proof, we split it into lemmas proved as the technical results
below.
Lemma 3.12. Under the LAN condition (3.16), there exists a truncation
of zn (θ) defined by
z̃n (θ) = zn (θ) I(zn (θ) ≤ cn ),
with the properly chosen sequence of constants cn , such that the following
equations hold:
(3.19) z̃n (θ) − zn (θ) → 0 as n → ∞
and
(3.20) lim sup Eθ exp z̃n (θ) t − t /2 − 1 = 0.
2
n→∞ −c ≤ t ≤ c
(3.23) + Eθ ξn (t) I ξn (t) > A + Eθ ξ˜n (t) I ξ˜n (t) > A .
Due to Lemma 3.13, we can choose A so large that the last two terms do
not exceed a however small positive δ. From Lemma 3.12, ξ˜n (t) − ξ(t) → 0
in Pθ -distribution, and by the LAN condition, ξn (t)−ξ(t) → 0, therefore, for
a fixed A, the first two terms on the right-hand side of (3.23) are vanishing
uniformly over t ∈ [−c, c] as n → ∞.
Finally, we formulate the result analogous to Theorem 3.9 (for the proof
see Exercise 3.20).
Theorem 3.14. If a statistical model satisfies the LAN condition (3.16),
then for any loss function w, the asymptotic lower bound of the minimax
risk holds:
∞
w(u) −u2 /2
lim inf inf sup Eθ w In (θ) (θ̂n − θ) ≥ √ e du.
n→∞ θ̂n θ∈Θ −∞ 2π
where Z1 (θ0 , θ1 ) = p(X, θ1 )/p(X, θ0 ) denotes the likelihood ratio for a single
observation.
where θ̄ = (θ0 + θ1 )/2. As the integral of the probability density, the latter
one equals 1. Therefore,
(θ0 − θ1 )2
H(θ0 , θ1 ) = 2 1 − exp − .
8σ 2
If Θ is a bounded interval, then (θ0 − θ1 )2 /(8σ 2 ) ≤ C with some constant
C > 0. In this case,
H(θ0 , θ1 ) ≥ a (θ0 − θ1 )2 , a = (1 − e−C )/(4 C σ 2 ),
where we used the inequality (1 − e−x ) ≥ (1 − e−C ) x/C if 0 ≤ x ≤ C.
To find the sharp upper bound for the MLE, we make an additional
assumption that allows us to prove a relatively simple result. As shown in
the next theorem,
the normalized deviation of the MLE from the true value
of the parameter, nI(θ) (θn∗ − θ ), converges in distribution to a standard
normal random variable. Note that this result is sufficient to claim the
asymptotically sharp minimax property for all bounded loss functions.
Theorem 3.21. Let Assumption 3.17 and the LAN condition (3.16) hold.
Moreover, suppose that for any δ > 0 and any c > 0, the remainder term in
(3.16) satisfies the equality:
(3.25) lim sup Pθ sup | εn (θ, t) | ≥ δ = 0.
n→∞ θ∈Θ −c ≤ t ≤ c
Proof. Fix a large c such that c > | x |, and a small δ > 0 . Put
t∗n = n I(θ) (θn∗ − θ ).
Define two random events
An = An (c, δ) = sup | εn (θ, t) | ≥ δ
−2c ≤ t ≤ 2c
and
Bn = Bn (c) = | t∗n | ≥ c .
Note that under the condition (3.25), we have that Pθ (An ) → 0 as
n → ∞. Besides, as follows from the Markov inequality and Theorem 3.20
with w(u) = |u|,
Pθ (Bn ) ≤ Eθ | t∗n | /c ≤ r∗ /c.
3.7. Proofs of Technical Lemmas 35
Proof of Lemma 3.13. First we will prove (3.21) for ξ̃n (t). Note that
ξ̃n (t), n = 1, . . . , are positive random variables. By Lemma 3.12, for any
t ∈ [−c, c], the convergence takes place
(3.30) ξ̃n (t) → ξ(t) as n → ∞.
Choose an arbitrarily small δ > 0. There exists A(δ) such that uniformly
over t ∈ [−c, c] , the following inequality holds:
(3.32) Eθ ξ(t) I ξ(t) > A(δ) ≤ δ.
Next, we can choose n = n(δ) so large that for any n ≥ n(δ) and all
t ∈ [−c, c], the following inequalities are satisfied:
(3.33) Eθ ξ̃n (t) − Eθ ξ(t) ≤ δ
and
˜ ˜
(3.34) E
θ n
ξ (t) I ξ n (t) ≤ A(δ) − Eθ ξ(t) I ξ(t) ≤ A(δ) ≤ δ.
To see that the latter inequality holds, use the fact that A(δ) is fixed and
ξ̃n (t) → ξ(t) as n → ∞.
The triangle inequality and the inequalities (3.32)-(3.34) imply that for
any A ≥ A(δ),
Eθ ξ˜n (t) I ξ˜n (t) > A ≤ Eθ ξ̃n (t) I ξ˜n (t) > A(δ)
= Eθ ξ̃n (t) − ξ˜n (t) I ξ˜n (t) ≤ A(δ)
− Eθ ξ(t) − ξ(t) I ξ(t) ≤ A(δ) − ξ(t) I ξ(t) > A(δ)
≤ Eθ ξ˜n (t) − Eθ ξ(t)
+ Eθ ξ˜n (t) I ξn (t) ≤ A(δ) − Eθ ξ(t) I ξ(t) ≤ A(δ)
(3.35) + Eθ ξ(t) I ξ(t) > A(δ) ≤ 3δ.
There are finitely many n such that n ≤ n(δ). For each n ≤ n(δ), we can
find An so large that for all A ≥ An , the following expected
value is bounded:
Eθ ξ˜n (t) I ξ˜n (t) > A ) ≤ 3δ. Put A0 = max A1 , . . . , An(δ) , A(δ) . By
definition, for any A ≥ A0 , and all t ∈ [−c, c], we have that
(3.36) sup Eθ ξ̃n (t) I ξ˜n (t) > A ≤ 3δ.
n≥1
3.7. Proofs of Technical Lemmas 37
Result 2. Let Assumption 3.17 be true. Then for any positive constants γ
and c, the following inequality holds:
Pθ sup√ Zn (θ, θ + t) ≥ eγ ≤ C e− 3γ/4 exp{−a c2 /4
| t | ≥ c/ n
where C = 2 + 3 πI ∗ /a with I ∗ = supθ∈Θ I(θ) < ∞.
3 πI ∗
exp − a c /4 +
2
1 − Φ(c a/2)
2 a
3 πI ∗ 1
≤ 1+ exp − a c2 /4 = C exp − a c2 /4 .
2 a 2
The same inequality is true for t < 0 (show!),
1
Eθ sup√ zn (t) ≤ C exp − a c2 /4 .
t ≤ −c/ n 2
Further,
Pθ sup√ Zn (θ, θ + t) ≥ eγ
| t | ≥ c/ n
≤ Pθ sup√ Zn (θ, θ + t) ≥ eγ + Pθ sup√ Zn (θ, θ + t) ≥ eγ
t ≥ c/ n t ≤ −c/ n
= Pθ sup√ zn (t) ≥ e3 γ/4 + Pθ sup√ zn (t) ≥ e3 γ/4 ,
t ≥ c/ n t ≤ −c/ n
Now we are in the position to prove Lemma 3.19. Applying the inclusion
√
n |θn∗ − θ| ≥ c = sup√ Zn (θ, θ + t) ≥ sup√ Zn (θ, θ + t)
|t| ≥ c/ n |t| < c/ n
⊆ sup√ Zn (θ, θ + t) ≥ Zn (θ, θ) = 1 ,
|t| ≥ c/ n
≤ C exp − a c2 /4 .
40 3. Asymptotic Minimaxity
Exercises
Exercise 3.14. Verify that in Example 1.5, the estimator θ̂n= X/n is
an asymptotically
unbiased estimator ofθ. Hint: Note that | X/n − θ| =
|X/n − θ2 | / | X/n + θ|, and thus, Eθ | X/n − θ| ≤ θ−1 Eθ |X/n − θ2 | .
Now use the Cauchy-Schwarz inequality to finish off the proof.
Exercise 3.15. Show that the Hodges estimator defined by (3.3) is asymp-
totically unbiased and satisfies the identities (3.4).
Exercise 3.16. Prove Theorem 3.7.
Exercise 3.17. Suppose the conditions of Theorem 3.7 hold, and a loss
function w is such that w(1/2) > 0. Show that for any estimator θ̂n the
following lower bound holds:
√ 1
sup Eθ w n (θ̂n − θ) ≥ w(1/2) p0 exp{z0 }.
θ∈ Θ 2
Hint: Use Theorem 3.7 and the inequality (show!)
√ √
w n (θ̂n − θ) + w n (θ̂n − θ) − 1 ≥ w(1/2), for any θ ∈ Θ.
Exercise 3.18. Prove (3.14). Hint: First show this result for bounded loss
functions.
Exercise 3.19. Prove the local asymptotic normality (LAN) for
(i) exponential model with the density
p(x , θ) = θ exp{− θ x} , x , θ > 0;
where an = nI t/ nI(0) , z̃n (0) is an asymptotically standard normal
random variable, and on (1) → 0 as n → ∞ . Then follow the lines of
Theorems 3.8 and 3.9, and, finally, let C → ∞ .
Exercise 3.21. Consider a distorted parabola zt − t2 /2 + ε(t) where z has
a fixed value and −2c ≤ t ≤ 2c. Assume that the maximum of this function
is attained at a point t∗ that lies within the interval [−c, c]. Suppose that√the
remainder term satisfies sup−2c ≤ t ≤ 2c | ε(t) | ≤ δ . Show that | t∗ −z | ≤ 2 δ .
Chapter 4
Some Irregular
Statistical Experiments
n+1
θn∗ = X(n)
n
is an unbiased estimator of θ with the variance
θ2
Varθ θn∗ = = O n −2 as n → ∞.
n(n + 2)
43
44 4. Some Irregular Statistical Experiments
1 1 2
= lim (Δθ)−2 √ I[ 0, θ+Δθ ] ( · ) − √ I[ 0, θ ] ( · ) 2 .
Δθ→0 θ + Δθ θ
A finite limit exists if and only if
2
√ 1 1
I[ 0, θ+Δθ ] ( · ) − √ I[ 0, θ ] ( · ) 2 = O (Δθ) 2 as Δθ → 0.
θ + Δθ θ
However, the L2 -norm decreases at a lower rate. To see this, assume Δθ is
positive and write
2
√ 1 1
I[ 0, θ+Δθ ] ( · ) − √ I[ 0, θ ] ( · ) 2
θ + Δθ θ
1 1 2
= √ I 0 ≤ x ≤ θ + Δθ − √ I 0 ≤ x ≤ θ dx
R θ + Δθ θ
θ
θ+Δθ
1 1 2 1 2
= √ − √ dx + √ dx
0 θ + Δθ θ θ θ + Δθ
√ √ 2
θ − θ + Δθ Δθ
= +
θ + Δθ θ + Δθ
−1/2
= 2 1 − 1 + Δθ/θ = Δθ/θ + o(Δθ/θ) O (Δθ)2 as Δθ → 0.
Hence, in this example, p( · , θ) is not differentiable as a function of θ, and
the finite Fisher information does not exist.
or
(ii) Zn θ, θ + t/n = exp λ(θ) t I t ≤ Tn + on (1)
n2
≥ Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 eλ(θ0 ) + on (1) I 1 ≤ Tn
2
n2
≥ Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 I Tn ≥ 1 ,
2
since λ(θ0 ) + on (1) ≥ 0,
n2 (θ1 − θ0 )2
≥ Pθ0 Tn ≥ 1
2 2
1 1
= Pθ0 Tn ≥ 1 → exp{ − λ(θ0 ) } as n → ∞.
4 4
Remark 4.8. The rate of convergence may be different from O(n−1 ) for
some other irregular statistical experiments, but those models are not asymp-
totically exponential. For instance, the model described in Exercise 1.8 is
not regular (the Fisher information does not exist) if −1 < α ≤ 1. The
rate of convergence in this model depends on α and is, generally speaking,
different from O(n−1 ).
1 b
(4.1) = Eθ0 wC n (θ̂n − θ0 ) − u eλ0 u I u ≤ Tn + on (1) du .
b 0
Here we applied the change of measure formula. Now, since wC is a bounded
function,
1
b
Eθ0 wC n (θ̂n − θ0 ) − u on (1) du = on (1),
b 0
and, continuing from (4.1), we obtain
1
b
= Eθ0 wC n (θ̂n − θ0 ) − u eλ0 u I u ≤ Tn du + on (1)
b 0
1 √ Tn
≥ Eθ0 I b ≤ Tn ≤ b wC n (θ̂n − θ0 ) − u eλ0 u du + on (1),
b 0
which, after a substitution u = t + Tn , takes the form
1 √ 0
= Eθ0 eλ0 Tn I b ≤ Tn ≤ b wC n(θ̂n − θ0 ) − Tn − t eλ0 t dt + on (1).
b −Tn
Note that
1 √ 1 b
lim Eθ0 e λ0 Tn
I b ≤ Tn ≤ b = λ0 e−λ0 t + λ0 t dt
n→∞ b b √b
√
λ0 b − b 1
= = λ0 1 − √ .
b b
Exercises 49
For the quadratic risk function, the lower bound in Theorem 4.9 can be
found explicitly.
Example 4.10. If w(u) = u2 , then
∞
λ0 min (u − y)2 e−λ0 u du = min y 2 − 2y/λ0 + 2/λ20
y∈R 0 y∈R
= min (y − 1/λ0 ) +
2
1/λ20 = 1/λ20 .
y∈R
By Proposition 4.5, for the uniform model, λ 0 = 1/θ0 , hence, the exact
lower bound equals θ02 . For the shifted exponential experiment, according
to Proposition 4.6, λ0 = 1 and thus, the lower bound is 1.
Remark 4.11. In Exercise 4.27 we ask the reader to show that the lower
bound limiting constant of Theorem 4.9 is attainable in the uniform and
shifted exponential models under the quadratic risk. The sharpness of the
bound holds in general, for all asymptotically exponential models, but under
some additional conditions.
Exercises
Exercise 4.27. Show that (i) in the uniform model (see Exercise 4.22),
2
lim Eθ0 n(θn∗ − θ0 ) = θ02 .
n→∞
Exercise 4.28. Compute explicitly the lower bound in Theorem 4.9 for the
absolute loss function w(u) = |u|.
(i) Take a uniform(0, b) prior density and let Y = min(X(1) , b). Verify that
the posterior density is defined only if X(1) > 0, and is given by the formula
n exp{ n θ }
fb (θ | X1 , . . . , Xn ) = , 0 ≤ θ ≤ Y.
exp{ n Y } − 1
(ii) Check that the posterior mean is equal to
1 Y
θn∗ (b) = Y − + .
n exp{ n Y } − 1
√ √
(iii) Argue that for any θ, b ≤ θ ≤ b − b, the normalized quadratic risk
of the estimator θn∗ (b) has the limit
2
lim Eθ n (θn∗ (b) − θ) = 1.
b→∞
(iv) Show that
√
2 b−2 b 2
sup Eθ n (θ̂n − θ) ≥ √ inf √ Eθ n (θn∗ (b) − θ)
θ∈R b b ≤ θ≤ b− b
where the right-hand side is arbitrarily close to 1 for sufficiently large b.
Chapter 5
Change-Point Problem
Θ = Θα = θ : α n ≤ θ ≤ (1 − α) n , θ ∈ Z+
where α is a given number, 0 < α < 1/2. We assume that the standard
deviation σ and the expectation μ are known. Put c = μ/σ. This ratio is
called a signal-to-noise ratio.
The objective is to estimate θ from observations X1 , . . . , Xn . The pa-
rameter θ is called the change point, and the problem of its estimation is
termed the change-point problem. Note that it is assumed that there are at
least α n observations obtained before and after the change point θ, that is,
the number of observations of both kinds are of the same order O(n).
In the context of the change-point problem, the index i may be associated
with the time at which the observation Xi becomes available. This statistical
model differs from the models of the previous chapters in the respect that
it deals with non-homogeneous observations since the expected value of the
observations suffers a jump at θ.
The joint probability density of the observations has the form
n
θ
exp − x2i /(2σ 2 ) exp − (xi − μ)2 /(2σ 2 )
p x 1 , . . . , xn , θ = √ √ .
i=1 2πσ 2 i=θ+1
2πσ 2
51
52 5. Change-Point Problem
0 if j = 0,
where the εi ’s are as in (5.1) for 1 ≤ i ≤ n; and for i outside of the
interval [1, n] , the εi ’s are understood as supplemental independent standard
normal random variables. The process W (j) is called the two-sided Gaussian
random walk (see Figure 2).
6W (j )
-
1 − θ0 0 n − θ0 j
θ0
Xi2
n
(Xi − μ)2
θ
Xi2
θ
(Xi − μ)2
+ + = − +
2σ 2 2σ 2 2σ 2 2σ 2
i=1 i = θ0 + 1 i = θ0 + 1 i = θ0 + 1
μ
θ Xi μ μ
θ X − μ μ
i
= − + = − −
σ σ 2σ σ σ 2σ
i = θ0 + 1 i = θ0 + 1
μ
θ
μ2 c2
= εi − (θ − θ0 ) = c W (θ − θ0 ) − (θ − θ0 )
σ 2σ 2 2
i = θ0 + 1
with c = μ/σ. For θ < θ0 , we get a similar formula,
μ
θ0 X μ
i
Ln (θ) − Ln (θ0 ) = −
σ σ 2σ
i=θ+1
θ0
c2 c2
= c εi − (θ0 − θ) = c W (θ − θ0 ) − |θ − θ0 |.
2 2
i=θ+1
Remark 5.2. The two-sided Gaussian random walk W plays a similar role
as a standard normal random variable Z in the regular statistical models
under the LAN condition (see Section 3.4.) The essential difference is that
the dimension of W grows as n → ∞.
Remark 5.4. Lemma 5.3 is very intuitive. Any estimator θ̂n misses the
true change point θ0 by at least 1 with a positive probability, which is not a
surprise due to the stochastic nature of observations. Thus, the anticipated
minimax rate of convergence in the change-point problem should be O(1) as
n → ∞.
Remark 5.5. We can define a change-point problem on the interval [0, 1]
by Xi ∼ N (0, σ 2 ) if i/n ≤ θ, and Xi ∼ N (μ, σ 2 ) if i/n > θ. On this scale,
the anticipated minimax rate of convergence is O(n−1 ) for n → ∞. Note
that the convergence in this model is faster than that in regular models, and,
though unrelated, is on the order of that in the asymptotically exponential
experiments.
The goal of this section is to describe the exact large sample performance
of the MLE.
Introduce a stochastic process
c2
L∞ (j) = c W (j) − | j | , j ∈ Z,
2
where the subscript “∞” indicates that j is unbounded in both directions.
Define j∞∗ as the point of maximum of the process L (j),
∞
∗
j∞ = argmax L∞ (j),
j ∈Z
and put
∗
pj = Pθ0 j∞ = j , j ∈ Z.
∗ is independent of θ .
Note that the distribution of j∞ 0
Theorem 5.6. For any θ0 ∈ Θα , and any loss function w, the risk of the
MLE θ̃n has a limit as n → ∞, independent of θ0 ,
lim Eθ0 w θ̃n − θ0 = w(j) pj .
n→∞
j∈Z
5.2. Maximum Likelihood Estimator of Change Point 55
pj ≤ b3 e − b4 | j | , j ∈ Z.
= w(j) pj − w(j) pj
j ∈Z j ∈ Z \ [1−θ0 , n−θ0 ]
where the latter sum is taken over integers j that do not belong to the
set 1 ≤ θ0 + j ≤ n. As a loss function, w(j) does not increase faster
than a polynomial function in | j |, while pj , in accordance with Lemma
5.8, decreases exponentially fast. Besides, the absolute value of any j ∈
Z \ [1 − θ0 , n − θ0 ] is no less than α n. Thus,
lim w(j) pj = 0
n→∞
j ∈ Z \ [1−θ0 , n−θ0 ]
and
lim inf Eθ0 w(θ̃n − θ0 ) ≥ w(j) pj .
n→∞
j∈Z
≤ w(j) pj + b1 e − b2 n max w(j) .
j : 1 ≤ θ0 + j ≤ n
j ∈Z
The maximum of w(j) does not grow faster than a polynomial function in
n. That is why the latter term is vanishing as n → ∞, and
lim sup Eθ0 w(θ̃n − θ0 ) ≤ w(j) pj .
n→∞
j ∈Z
56 5. Change-Point Problem
Now we can show that the minimax quadratic risk of any estimator of
θ0 is bounded from below by r∗ .
Theorem 5.11. Let r∗ be the constant defined in Lemma 5.10. For all large
enough n, and for any estimator θ̂n , the following inequality takes place:
2
lim inf max Eθ0 θ̂n − θ0 ≥ r∗ .
n→∞ θ0 ∈Θα
where θ̃N is the Bayes estimator with respect to the uniform prior distribu-
tion πN (θ) = 1/N, αn ≤ θ ≤ (1 − α)n.
5.4. Model of Non-Gaussian Observations 57
Substituting this limit into (5.4), we obtain that for any estimator θ̂n ,
2 2
lim inf max Eθ0 θ̂n − θ0 ≥ lim inf N −1 Eθ0 θ̃N − θ0
n→∞ θ0 ∈Θα n→∞
θ0 ∈Θα
2
≥ lim inf N −1 Eθ0 θ̃N − θ0
n→∞
θ0 ∈Θα,β
N − 2βN 2
≥ lim inf min Eθ0 θ̃N − θ0 = (1 − 2β)r∗ .
n→∞ N θ0 ∈Θα,β
where for 1 ≤ i ≤ n, εi ’s are the random variables from Lemma 5.13 that
have mean zero with respect to the Pθ0 -distribution. For all other values of
i, the random variables εi ’s are assumed independent with the zero expected
value. Note that W (j) is a two-sided random walk, which in general is not
symmetric. Indeed, the distributions of εi ’s may be different for i ≤ θ0 and
i > θ0 .
Define a constant Ksgn(i) as K+ for i > 0, and K− for i < 0.
Theorem 5.14. For any integer θ, 1 ≤ θ ≤ n, and any true change point
θ0 ∈ Θα , the log-likelihood ratio has the form
p(X1 , . . . , Xn , θ)
Ln (θ) − Ln (θ0 ) = ln = W (θ−θ0 ) − Ksgn(θ−θ0 ) | θ−θ0 |.
p(X1 , . . . , Xn , θ0 )
5.5. Proofs of Lemmas 59
From Theorem 5.14 we can expect that the MLE of θ0 in the non-
Gaussian case possesses properties similar to that in the normal case. This
is true under some restrictions on p0 (see Exercise 5.33).
√
= 1 − Φ c k/2 ≤ exp{ − c2 k/8 } ≤ b3 exp{− b4 j}
k≥j k≥j
and
(5.8) lim Pθ0 exp c W (j) − c 2 | j | / 2 > ε = 0.
n→∞
j : j + θ0
∈ [1, n]
Introduce the notation for the denominator in the formula (5.3) for the
random variable ξ,
D = exp c W (j) − c 2 | j |/2 .
j∈Z
(5.10) ≤ ζ2 + 2 j exp − c2 j /4 = ζ 2 + a3
j ≥1
2
with a3 = j ≥ 1 j exp{−c j/4}. Because the tail probabilities of ζ de-
crease exponentially
2 fast, any power moment of ξ is finite, in particular,
r∗ = Eθ0 ξ < ∞.
Finally, we verify that θn∗ − θ0 converges to ξ in the L2 sense, that is,
uniformly in θ0 ∈ Θα ,
2
lim Eθ0 θn∗ − θ0 − ξ = 0.
n→∞
Apply the representation for the difference θn∗ − θ0 from Lemma 5.9. Simi-
larly to the argument used to derive (5.10), we obtain that
∗
θn − θ0 ≤ ζ 2 + a3
with the same definitions of the entries on the right-hand side. Thus, the
difference
∗
θn − θn − ξ ≤ θn∗ − θ0 + ξ ≤ 2 ζ 2 + a3
Exercises
Exercise 5.33. Suppose that Eθ0 | li |5+δ < ∞ for a small δ > 0, where
li ’s are the log-likelihood ratios defined in (5.5), i = 1, . . . , n, and θ0 denotes
the true value of the change point. Show that uniformly in θ0 ∈ Θα , the
quadratic risk of the MLE θ̃n of θ0 is finite for any n , that is,
2
Eθ0 θ̃n − θ0 < ∞.
Sequential Estimators
65
66 6. Sequential Estimators
F1 ⊆ F 2 ⊆ · · · ⊆ F
where F denotes the σ-algebra that contains all the σ-algebras Ft . The set
of the ordered σ-algebras { Ft , t ≥ 1 } is called a filter.
Example 6.2. The following are examples of the Markov stopping times
(for the proof see Exercise 6.37):
(i) A non-random variable τ = T where T is a given positive integer number.
(ii) The first time when the sequence Xi hits a given interval [a, b], that is,
τ = min{ i : Xi ∈ [a, b] }.
(iii) The minimum or maximum of two given Markov stopping times τ1 and
τ2 , τ = min(τ1 , τ2 ) or τ = max(τ1 , τ2 ).
(iv) The time τ = τ1 + s for any positive integer s, where τ1 is a given
Markov stopping time.
Example 6.3. Some random times are not examples of Markov stopping
times (for the proof see Exercise 6.38):
(i) The last time when the sequence Xi , 1 ≤ i ≤ n, hits a given interval
[a, b], that is, τ = max{ i : Xi ∈ [a, b], 1 ≤ i ≤ n}.
(ii) The time τ = τ1 − s for any positive integer s, where τ1 is a given
stopping time.
Proof. By definition,
∞
E X1 + · · · + Xτ = E (X1 + · · · + Xt ) I(τ = t)
t=1
= E X1 I(τ ≥ 1) + X2 I(τ ≥ 2) + · · · + Xt I(τ ≥ t) + . . . .
For a Markov stopping time τ, the random event {τ ≥ t} is Ft−1 -measurable
by Lemma 6.4, that is, it is predictable from the observations up to time t−1,
X1 , . . . , Xt−1 , and is independent of the future observations Xt , Xt+1 , . . . . In
particular, Xt and I(τ ≥ t) are independent, and hence, E Xt I(τ ≥ t) =
E[ X1 ] P( τ ≥ t ). Consequently,
∞
E X1 + · · · + Xτ = E[ X1 ] P(τ ≥ t) = E[ X1 ] E[ τ ].
t=1
Here we used the straightforward fact that
∞
P( τ ≥ t ) = P(τ = 1) + 2 P(τ = 2) + 3 P(τ = 3) + . . .
t=1
∞
= t P(τ = t) = E[ τ ].
t=1
= {τ = t} ∩ A1 ∩ {τ = t} ∈ Ft ,
as an intersection of two Ft -measurable events.
Proof. For any positive integer s, put A = {τ = s}. We need to show that
A ∈ Fτ . For all t we find that
A ∩ {τ = t} = {τ = s} ∩ {τ = t} = {τ = t} if s = t,
and is the empty set, otherwise. The set {τ = t} belongs to Ft by the
definition of a stopping time. The empty set is Ft - measurable as well
(refer to Exercise 6.36). Thus, by the definition of Fτ , the event A belongs
to Fτ .
Proof. Take any interval [a, b] and define A = { Xτ ∈ [a, b] }. Note that
∞
A = { Xs ∈ [a, b] } ∩ {τ = s} .
s=1
= { Xt ∈ [a, b] } ∩ {τ = t}.
The latter intersection belongs to Ft because both random events belong to
Ft . Hence A is Fτ -measurable.
Remark 6.9. The concept of the σ-algebra Fτ is essential in the sequential
analysis. All parameter estimators constructed from sequential observations
are Fτ -measurable, that is, are based on observations X1 , . . . , Xτ obtained
up to a random stopping time τ.
6.2. Change-Point Problem. Rate of Detection 69
The crucial difference between the minimax risk rn in the previous chap-
ters and rnD consists of restrictions on the set of admissible estimators. In
the on-line detection, we cannot use an arbitrary function of observations.
Below we find the rate of convergence for the minimax quadratic risk of
detection rnD for the Gaussian model, and define the rate-optimal detectors.
Assume that Xi ∼ N (0, σ 2 ) if 1 ≤ i ≤ θ0 , and Xi ∼ N (μ, σ 2 ) if θ0 < i ≤
n, where μ > 0 is known. Our goal is to show that there exists a Markov
stopping time τn∗ such that its deviation away from the true value of θ0 has
the magnitude O(ln n) as n → ∞. It indicates a slower rate of convergence
for the on-line detection as opposed to the estimation based on the entire
sample. Recall that in the latter case, the rate is O(1).
Remark 6.11. Note that on the integer scale, the convergence with the
rate O(ln n) is not a convergence at all. This should not be surprising since
the convergence rate of O(1) means no convergence as well. If we compress
the scale and consider the on-line detection problem on the unit interval
[0, 1] with the frequency of observations n (see Remark 5.5), then the rate
of convergence guaranteed by the Markov stopping time detectors becomes
(ln n)/n.
Theorem 6.12. In the on-line detection problem with n Gaussian observa-
tions, there exists a Markov stopping time τn∗ and a constant r∗ independent
of n such that the following upper bound holds:
τ ∗ − θ 2
≤ r∗ .
0
max Eθ0 n
θ0 ∈Θα ln n
Proof. The construction of the stopping time τn∗ is based on the idea of
averaging. Roughly speaking, we partition the interval [1, n] and compute
the sample means in each of the subintervals. At the lower end of the
interval, the averages are close to zero. At the upper end, they are close to
the known number μ, while in the subinterval that captures the true change
point, the sample mean is something in-between.
Put N = b ln n where b is a positive constant independent of n that will
be chosen later. Define M = n/N . Without loss of generality, we assume
that N and M are integer numbers. Introduce the normalized mean values
of observations in subintervals of length N by
1
N
X̄m = X(m−1)N +i , m = 1, . . . , M.
μN
i=1
6.2. Change-Point Problem. Rate of Detection 71
1 1
N
N
X̄m = I (m − 1) N + i > θ0 + ε(m−1) N +i
N cN
i=1 i=1
1
N
1
ε(m−1)N +i = √ Zm , m = 1, . . . , M,
cN c N
i=1
2
= 2 M ( 1 − Φ(y) ) ≤ M exp{−y 2 /2} ≤ M exp{−y 2 /2}
2πy 2
where Φ(y) denotes the cumulative distribution function of a N (0, 1) ran-
dom variable.In the above we used the standard inequality 1 − Φ(y) ≤
exp{−y 2 /2}/ 2πy 2 if y > 1. Thus, we have
√
Pθ0 max | Zm | ≥ 10 ln M ≤ M exp{−10 ln M/2}
1≤m≤M
1
= 10 (ln n − ln ln n − ln b) / (b ln n) ≤ 10/(b c2 ) = 0.1.
c
Now we can finalize the description of the averaged observations X̄m =
Bm + ξm where Bm ’s are deterministic with the property that Bm = 0
if m < √m0 , and Bm = 1 if m > m0 . The random variables | ξm | =
| Zm / (c N ) | do not exceed 0.1 if the random event A holds.
We are ready to define the Markov stopping time that estimates the
change point θ0 . Define an integer-valued random variable m∗ by
m∗ = min m : X̄m ≥ 0.9 , 1 ≤ m ≤ M ,
and formally put m∗ = M if X̄m < 0.9 for all m. Under the random event
A, the minimal m∗ exists and is equal to either m0 or m0 + 1.
Introduce a random variable
(6.4) τn∗ = m∗ N.
2 2
= max Eθ0 (τn∗ − θ0 )/ ln n I(A) + Eθ0 (τn∗ − θ0 )/ ln n I(A)
θ0 ∈Θα
2
≤ max Eθ0 ( 2 N/ ln n )2 I(A) + Eθ0 ( n/ ln n I(A)
θ0 ∈Θα
2 2
≤ 2 N/ ln n + n/ ln n n−3 ≤ 4 b2 + 2
where at the final stage we have applied (6.3) and the trivial inequality
1/(n ln2 n) < 2 , n ≥ 2. Thus, the statement of the theorem follows with
r∗ = 4 b2 + 2.
6.3. Minimax Limit in the Detection Problem. 73
M −1
= PtM | τ̃n − tj | ≤ b ln n
j=0
M −1 dP
tM
(6.6) = Etj I | τ̃n − tj | ≤ b ln n .
dPtj
j =0
74 6. Sequential Estimators
tM
c2
= exp c εi − (tM − tj )
2
i=tj +1
where c = μ/σ is the signal-to-noise ratio, and εi = −(Xi − μ)/σ have the
standard normal distribution with respect to the Ptj -probability. Note that
the number of terms in the sum from tj + 1 to tM can be as large as O(n).
Further, let
Bj = | τ̃n − tj | ≤ b ln n .
Thus, each expectation in (6.6) can be written as
dP
tM
Etj I | τ̃n − tj | ≤ b ln n
dPtj
tM
c2
= Etj exp c εi − (tM − tj ) I( Bj ) .
2
i=tj +1
We write dP
tM
Etj I | τ̃n − tj | ≤ b ln n
dPtj
uj
c2
= Etj exp c εi − (uj − tj ) I( Bj )
2
i=tj +1
√ c2
= Etj exp c b ln n Zj − b ln n I( Bj )
2
u j √
where Zj = i=tj +1 iε / b ln n is a standard normal random variable with
respect to the Ptj -probability,
√ c2
≥ Etj exp c b ln n Zj − b ln n I( Bj ) I( Zj ≥ 0 )
2
6.4. Sequential Estimation in the Autoregressive Model 75
c2
≥ exp − b ln n Ptj Bj ∩ {Zj ≥ 0} .
2
Further, the probability of the intersection
Ptj Bj ∩ {Zj ≥ 0} = Ptj (Bj ) + Ptj Zj ≥ 0 − Ptj Bj ∪ {Zj ≥ 0}
3 1 1
≥ Ptj (Bj ) + Ptj Zj ≥ 0 − 1 ≥ + −1 = .
4 2 4
In the last step we used the inequality (6.5) and the fact that Ptj Zj ≥ 0 =
1/2.
1 c2 1
≥ exp − b ln n = √ .
4 2 4 n
Substituting this inequality into (6.6), we arrive at a contradiction,
M−1 √
1 1 M n(1 − 2α) 1 − 2α n
≥ √ = √ = √ = → ∞ as n → ∞.
4 4 n 4 n 3b ln n 4 n 12b ln n
j=0
(ii) The random variable Xi is normal with the zero mean and variance
1 − θ2i
σi2 = Var[ Xi ] = σ 2 .
1 − θ2
(iii) The variance of Xi has the limit
σ2
lim σi2 = σ∞
2
= .
i→∞ 1 − θ2
(iv) The covariance between Xi and Xi+j , j ≥ 0, is equal to
1 − θ2i
Cov[ Xi , Xi+j ] = σ 2 θj .
1 − θ2
Our objective is to find an on-line estimator of the parameter θ. Before
we do this, we first study the maximum likelihood estimator (MLE).
X1 ε2 + · · · + Xn−1 εn
(6.8) = θ0 + .
X12 + · · · + Xn−1
2
6.4. Sequential Estimation in the Autoregressive Model 77
By Lemma 6.14 (iv), since |θ| < 1, the covariance between two remote
terms Xi and Xi+j decays exponentially fast as j → ∞. It can be shown
that the Law of Large Numbers (LLN) applies to this process exactly as in
the case of independent random variables. By the LLN, for all large n, we
can substitute the denominator in the latter formula by its expectation
n−1
σ2
E[ X12 + · · · + Xn−1
2
]= Var[ Xi ] ∼ n σ∞
2
= n .
i=1
1 − θ02
1 − θ02 X1 ε2 + · · · + Xn−1 εn
= √ .
σ2 n
√
If the Xi ’s were independent, then X1 ε2 + · · · + Xn−1 εn / n would
satisfy the Central Limit Theorem (CLT). It turns out, and it is far from
being trivial, that we can work with the Xi ’s as if they were independent,
and the CLT still applies. Thus, the limiting distribution of this quotient is
normal with mean zero and the limiting variance
X ε + ··· + X 1
n−1
2
1 2 n−1 εn
lim Var √ = lim E Xi εi+1
n→∞ n n→∞ n
i=1
1 2 2
n−1 n−1
σ4
= lim E Xi E εi+1 = lim (1 − θ02i )
n→∞ n
i=1
n(1 − θ0
2 ) n→∞
i=1
σ4 1 − θ02n σ4
= lim n − = .
n(1 − θ0 ) n→∞
2 1 − θ02 1 − θ02
√
It partially explains why the difference n (θn∗ −θ0 ) is asymptotically normal
with mean zero and variance
1 − θ 2 2 σ 4
0
= 1 − θ02 ,
σ2 1 − θ02
that is,
√
n (θn∗ − θ0 ) → N 0, 1 − θ02 as n → ∞.
Note that the limiting variance is independent of σ 2 , the variance of the
noise.
78 6. Sequential Estimators
6.4.2. On-Line Estimator. After obtaining a general idea about the MLE
and its asymptotic performance, we are ready to try a sequential estimation
procedure, termed an on-line estimation.
∗
Note that nfrom (6.8) the difference θn − θ0 can be presented in the form
θn∗ − θ0 = υ ε
i = 2 n,i i with the weights υn,i = X i−1 /( X 2 + · · · + X2
1 n−1 ).
If the υn,i ’s were deterministic, then the variance of the difference θn − θ0 ∗
would be
n
σ2 2
υn,i = σ 2 /(X12 + · · · + Xn−1
2
).
i=2
In the discrete case with normal noise, the overshoot X12 + · · · + Xt2 − H
is positive with probability 1. The stopping time τ is a random sample
size, and the level H controls the magnitude of its expected value, Eθ0 [ τ ]
increases as H grows (see Exercise 6.39). Put
ΔH
ΔH = H − ( X12 + · · · + Xτ2−1 ) and η = .
Xτ
The definition of η makes sense because the random variable Xτ differs from
zero with probability 1.
Define an on-line estimator of θ0 by
1
τ
(6.9) θ̂τ = Xi−1 Xi + η Xτ +1 .
H
i=1
1 2
τ τ
= θ0 Xi−1 + Xi−1 εi + θ0 η Xτ + η ετ +1 .
H
i=1 i=1
By definition, η Xτ = ΔH and ΔH + τi = 1 Xi−1 2 = H, hence,
1
τ
(6.10) θ̂τ = θ0 + Xi−1 εi + η ετ +1 .
H
i=1
It follows that either sum in (6.11) is equal to zero, which means that the
estimator θ̂τ is unbiased.
Next, we want to estimate the variance of θ̂τ . Using the representation
(6.10) of θ̂τ , we need to verify that
τ
2
Eθ0 Xi−1 εi + η ετ +1 ≤ σ 2 H.
i=1
The left-hand side of this inequality is equal to
τ
2 τ
(6.12) Eθ0 Xi−1 εi + 2 Xi−1 εi η ετ +1 + η 2 ε2τ +1 .
i=1 i=1
Consider the last term. We know that η is Fτ -measurable. Hence
∞
Eθ0 η 2 ε2τ +1 = Eθ0 η 2 ε2t+1 I(τ = t)
t=1
∞
= Eθ0 η 2 I(τ = t) Eθ0 ε2t+1
t=1
∞
= σ2 Eθ0 η 2 I(τ = t) = σ 2 Eθ0 η 2 .
t=1
In a similar way, we can show that the expectation of the cross-term in
(6.12) is zero. The analysis of the first term, however, takes more steps. It
can be written as
τ
2
Eθ0 Xi−1 εi = Eθ0 (X1 ε2 )2 I(τ = 2) + (X1 ε2 + X2 ε3 )2 I(τ = 3)
i=1
+ (X1 ε2 + X2 ε3 + X3 ε4 )2 I(τ = 4) + . . . = Eθ0 X12 ε22 I(τ = 2)
+ X12 ε22 + X22 ε23 I(τ = 3) + X12 ε22 + X22 ε23 + X32 ε24 I(τ = 4) + · · ·
+ 2 Eθ0 (X1 ε2 ) (X2 ε3 ) I(τ ≥ 3) + (X1 ε2 + X2 ε3 ) (X3 ε4 ) I(τ ≥ 4) + · · ·
= E1 + 2 E2
where
E1 = Eθ0 X12 ε22 I(τ = 2) + X12 ε22 + X22 ε23 I(τ = 3)
+ X12 ε22 + X22 ε23 + X32 ε24 I(τ = 4) + · · ·
= σ 2 Eθ0 X12 I(τ = 2) + (X12 +X22 ) I(τ = 3) + (X12 +X22 +X32 ) I(τ = 4)+ · · ·
τ
= σ 2 Eθ0 2
Xi−1
i=1
6.4. Sequential Estimation in the Autoregressive Model 81
and
E2 = Eθ0 (X1 ε2 ) (X2 ε3 ) I(τ ≥ 3) + (X1 ε2 + X2 ε3 ) (X3 ε4 ) I(τ ≥ 4) + · · ·
= Eθ0 (X1 ε2 )(X2 ) I(τ ≥ 3) Eθ0 ε3
+ Eθ0 (X1 ε2 + X2 ε3 )(X3 ) I(τ ≥ 3) Eθ0 ε4 + · · · = 0.
Combining all these estimates, we find that the expectation in (6.12) is equal
to
τ
2 τ
Eθ0 Xi−1 εi + 2 Xi−1 εi η ετ +1 + η 2 ε2τ +1
i=1 i=1
τ
= σ 2 Eθ0 2
Xi−1 + σ 2 Eθ0 η 2 .
i=1
τ
2
From the definition of ΔH, i = 1 Xi−1 = H − ΔH. Also, recall that η =
H/Xτ . Thus, we continue
= σ 2 Eθ0 H − ΔH + η 2 = σ 2 H − Eθ0 ΔH − η 2
= σ 2 H − Eθ0 ΔH − ( ΔH/Xτ )2
= σ 2 H − Eθ0 ΔH ( 1 − ΔH/Xτ2 ) .
Note that at the time τ − 1, the value of the sum X12 + · · · + Xτ2−1 does
not exceed H, which yields the inequality ΔH ≥ 0. In addition, by the
definition of τ , τi = 1 Xi−1
2 + X 2 > H, which implies that
τ
τ
ΔH = H − 2
Xi−1 < Xτ2 .
i=1
σ2 σ 2 C02
Varθ0 θ̂τ ≥ − .
H 4H 2 ( 1 − | θ0 | )2
σ2
Varθ0 θ̂τ = 2 H − Eθ0 ΔH(1 − ΔH/Xτ2 )
H
Remark 6.17. Note that the bound for the variance of θ̂τ in Theorem 6.16
is pointwise, that is, the lower bound depends on θ0 . To declare a uniform
bound for all θ0 ∈ Θα = {θ : | θ0 | ≤ 1 − α }, we take the minimum of both
sides:
σ2 σ 2 C02
inf Varθ0 θ̂τ ≥ − .
θ0 ∈Θα H 4H 2 α2
Combining this result with the uniform upper bound in Lemma 6.15, we get
that as H → ∞,
σ2
inf Varθ0 θ̂τ = 1 + O(H −1 ) .
θ0 ∈Θα H
Exercises 83
Exercises
Exercise 6.38. Show that the variables τ specified in Example 6.3 are
non-stopping times.
Linear Parametric
Regression
(7.1) Y = f (X) + ε
Remark 7.1. In this book we study only simple regressions where there is
only one predictor X.
(7.2) f = θ0 g0 + θ1 g1 + · · · + θk gk .
85
86 7. Linear Parametric Regression
Y 6
(xi , yi ) f (X)
• • •
εi ••
• • • • •
•
•
• • •
•
•
-
0 X
that is, when ε = 0, we have ŷ = y which implies that θ̂j = θj for all
j = 0, . . . , k.
y
r
ε
0 -
ŷ S
Gθ s
To ease the presentation, we study the regression on the interval [0, 1],
that is, we assume that the regression function f (x) and all the components
in the linear regression model, g0 (x), . . . , gk (x), are defined for x ∈ [0, 1].
The design points xi , i = 1, . . . , n, also belong to this interval.
Define the least-squares estimator of the regression function f (x) in (7.2),
at any point x ∈ [0, 1], by
(7.14) fˆn (x) = θ̂0 g0 (x) + · · · + θ̂k gk (x).
Here the subscript n indicates that the estimation is based on n pairs of
observations (xi , yi ), i = 1, . . . , n.
A legitimate question is how close fˆn (x) is to f (x)? We try to answer
this question using two different loss functions. The first one is the quadratic
loss function computed at a fixed point x ∈ [0, 1],
2
(7.15) w fˆn − f = fˆn (x) − f (x) .
The risk with respect to this loss is called the mean squared risk at a point
or mean squared error (MSE).
The second loss function that we consider is the mean squared difference
over the design points
1 ˆ
n
2
(7.16) w fˆn − f = fn (xi ) − f (xi ) .
n
i=1
Note that this loss function is a discrete version of the integral L2 -norm,
1
2
ˆ
f n − f 2 =
2
fˆn (x) − f (x) dx.
0
The respective risk is a discrete version of the mean integrated squared error
(MISE).
In this section, we study the conditional risk Eθ w(fˆn − f ) | X , given
the design X . The next two lemmas provide computational formulas for the
MSE and discrete MISE, respectively.
Introduce the matrix D = σ 2 (G G)−1 called the covariance matrix.
Note that D depends on the design X , and this dependence can be sophisti-
cated. In particular, that if the design X is random, this matrix is random
as well.
7.3. Properties of the Least-Squares Estimator 91
Lemma 7.6. For a fixed design X , the estimator fˆn (x) is an unbiased es-
timator of f (x) at any x ∈ [0, 1], so that its MSE equals the variance of
fˆn (x),
2
k
Varθ fˆn (x) | X = Eθ fˆn (x) − f (x) |X = Dl, m gl (x) gm (x),
l, m = 0
where Dl, m denotes the (l, m)-th entry of the covariance matrix D.
k
k
= Eθ (θ̂l − θl )(θ̂m − θm )|X gl (x)gm (x) = Dl,m gl (x)gm (x).
l,m=0 l,m=0
Proof. Applying the facts that σ 2 G G = D−1 , and that the matrix D is
symmetric and positive definite (therefore, D1/2 exists), we have the equa-
tions
1 ˆ
n
2 1
fn (xi ) − f (xi ) = G ( θ̂ − θ ) 2
n n
i=1
1 1
= G ( θ̂ − θ ) G ( θ̂ − θ ) = ( θ̂ − θ ) G G ( θ̂ − θ )
n n
1 σ2
= σ 2 ( θ̂ − θ ) D−1 ( θ̂ − θ ) = D−1/2 ( θ̂ − θ ) 2 ,
n n
where by · we mean the Euclidean norm in the Rn space of observations.
92 7. Linear Parametric Regression
Note that the vector with the components fˆn (xi ) coincides with ŷ, the
projection of y on the span-space S, that is,
In other words, residuals are deviations of the observed responses from the
predicted ones evaluated at the design points.
Graphically, residuals can be visualized in the data space Rn . The vector
of residuals r, plotted in Figure 4, is orthogonal to the span-space S. Also,
the residuals ri ’s can be depicted on a scatter plot (see Figure 5).
Y 6
•
•
(xi , yi ) • •
• • fˆn (X)
•
ri
◦ •
(xi , ŷi )
• -
0 X
Lemma 7.8. For a given design X , the sum of squares of the residuals
r12 + · · · + rn2 = r2 = y − ŷ 2 = σ 2 χ2n−k−1 ,
Proof. The squared Euclidean norm of the vector of random errors admits
the partition
ε 2 = y − G θ 2 = y − ŷ + ŷ − G θ 2
= y − ŷ 2 + ŷ − G θ 2 = r 2 + ŷ − G θ 2 .
Here the cross term is zero, because it is a dot product of the residual vector
r and the vector ŷ − G θ that lies in the span-space S. Moreover, these two
vectors are independent (see Exercise 7.46).
The random vector ε has Nn (0, σ 2 In ) distribution, implying that ε 2 =
σ 2 χ2n , where χ2n denotes a chi-squared random variable with n degrees of
freedom. Also, by Lemma 7.7,
n
2
ŷ − G θ 2 = fˆn (xi ) − f (xi ) = σ 2 χ2k+1
i=1
too close to each other (concentrated around one point, or even coincide),
or have big gaps between each other, or both.
For simplicity we suppress the dependence on n of the regular design
points, that is, we write xi instead of xn,i , i = 1, . . . , n.
Example 7.9. The data points that are spread equidistantly on the unit
interval, xi = i/n, i = 1, . . . , n, constitute a regular design, called uniform
design, since these points are (i/n)-th quantiles of the standard uniform
distribution.
It can be shown (see Exercise 7.48) that in the case of a regular design
corresponding to a probability density p(x), for any continuous function
g(x), the Riemann sum converges to the integral
1
1
n
(7.19) g(xi ) → g(x) p(x) dx as n → ∞.
n 0
i=1
σ2
= lim gl (x1 ) gm (x1 ) + · · · + gl (xn ) gm (xn )
n→ ∞ n
1
2
(7.20) = σ gl (x) gm (x) p(x) dx.
0
where (D∞ )l, m are the elements of the limiting covariance matrix D∞ .
1
2
(7.22) = σ gl (x) gm (x) p(x) dx.
0
Exercises
Exercise 7.46. Show that (i) the vector of residuals r has a multivariate
normal distribution with mean zero and covariance matrix σ 2 (In − H),
where H = G(G G)−1 G is called the hat matrix because of the identity
Exercises 97
ŷ = Hy.
(ii) Argue that the vectors r and ŷ − G θ are independent.
Exercise 7.49. Show that the matrix with the elements given by (7.21) is
invertible.
Nonparametric
Regression
Chapter 8
Estimation in
Nonparametric
Regression
(8.1) Y = f (X) + ε
with the random error ε ∼ N (0, σ 2 ). However, unlike that in the parametric
regression model, here the algebraic form of the regression function f is
assumed unknown and must be evaluated from the data. The goal of the
nonparametric regression analysis is to estimate the function f as a curve,
rather than to estimate parameters of a guessed function.
A set of n pairs of observations (x1 , y1 ), . . . , (xn , yn ) satisfy the relation
(8.2) yi = f (xi ) + εi , i = 1, . . . , n,
101
102 8. Estimation in Nonparametric Regression
Note that in the nonparametric case, the loss functions are, in fact,
functionals since they depend of f . For simplicity, we will continue calling
them functions. We denote the risk function by
Rn (f, fˆn ) = Ef w(fˆn − f )
where the subscript f in the expectation refers to a fixed regression function
f . If the design X is random, we use the conditional expectation Ef [ · | X ]
to emphasize averaging over the distribution of the random error ε.
When working with the difference fˆn − f , it is technically
more conve-
nient to consider separately the bias bn (x) = Ef fˆn (x) − f (x), and the
stochastic part ξn (x) = fˆn (x) − Ef fˆn (x) . Then the MSE or discrete
MISE is split into a sum (see Exercise 8.55),
(8.3) Rn (fˆn , f ) = Ef w(fˆn − f ) = Ef w(ξn ) + w(bn ).
8.2. Asymptotically Minimax Rate of Convergence. Definition 103
To deal with random designs, we consider the conditional bias and sto-
chastic part of an estimator fˆn , given the design X ,
bn (x, X ) = Ef fˆn (x) | X ] − f (x)
and
ξn (x, X ) = fˆn (x) − Ef fˆn (x) | X .
Note that fn∗ is not a single estimator, rather a sequence of estimators defined
for all sufficiently large n.
104 8. Estimation in Nonparametric Regression
Note that the linear estimator fˆn is a linear function of the response values
y1 , . . . , yn . The weight vn, i (x) determines the influence of the observation
yi on the estimator fˆn (x) at point x.
An advantage of the linear estimator (8.8) is that for a given design X ,
the conditional bias and variance are easily computable (see Exercise 8.56),
n
(8.9) bn (x, X ) = υn, i (x) f (xi ) − f (x)
i=1
and
n
(8.10) Ef ξn2 (x, X ) | X = σ 2 2
υn, i (x).
i=1
If for any x ∈ [0, 1], the linear estimator (8.8) depends on all the design
points x1 , . . . , xn , it is called a global linear estimator of the regression func-
tion. We study global estimators later in this book.
For a chosen kernel and a bandwidth, define the weights υn, i (x) by
x −x n x −x
i j
(8.12) υn, i (x) = K / K .
hn hn
j=1
106 8. Estimation in Nonparametric Regression
where f (m)
denotes the m-th derivative of f . Also, for any xi and x such
that |xi − x| ≤ hn , the remainder term ρ(xi , x) satisfies the inequality
Lhβn
(8.15) | ρ(xi , x) | ≤ .
(β − 1)!
8.4. Smoothing Kernel Estimator 107
It turns out that for linear estimators, regular random designs have an
advantage over deterministic ones. As we demonstrate in this section, when
computing the risk, averaging over the distribution of a random design helps
to eliminate a significant portion of the bias.
Next we introduce a linear estimator that guarantees the zero bias for
any polynomial regression function up to degree β − 1 (see Exercise 8.59).
To ease the presentation, we assume that a regular random design is uniform
with the probability density p(x) = 1, x ∈ [0, 1]. The extension to a more
general case is given in Remark 8.6.
A smoothing kernel estimator fˆn (x) of degree β − 1 is given by the formula
1
n x −x
i
(8.16) fˆn (x) = yi K , 0 < x < 1,
nhn hn
i=1
Remark 8.6. For a general density p(x) of the design points, the smoothing
kernel estimator is defined as
1 yi
n x −x
i
(8.18) fˆn (x) = K
nhn p(xi ) hn
i=1
The next lemma gives upper bounds for the bias and variance of the
smoothing kernel estimator (8.16). The proof of the lemma can be found at
the end of this section.
Lemma 8.8. For any regression function f ∈ Θ(β, L, L1 ), at any point
x ∈ (0, 1), the bias and variance of the smoothing kernel estimator (8.16)
admit the upper bounds for all large enough n,
Av
| bn (x) | ≤ Ab hβn and Varf fˆn (x) ≤
nhn
with the constants
L K1
Ab = and Av = (L21 + σ 2 ) K22
(β − 1)!
1
1
where K1 = −1 |K(u)| du and K22 = −1 K 2 (u) du.
Remark 8.9. The above lemma clearly indicates that as hn increases, the
upper bound for the bias increases, while that for the variance decreases.
Applying this lemma, we can bound the mean squared risk of fˆn (x) at
a point x ∈ (0, 1) by
2 Av
(8.19) Ef fˆn (x) − f (x) = b2n (x) + Varf fˆn (x) ≤ A2b h2n β + .
nhn
It is easily seen that the value of hn that minimizes the right-hand side
of (8.19) satisfies the equation
A
(8.20) h2n β =
nhn
with a constant factor A independent of n. This equation is called the
balance equation since it reflects the idea of balancing the squared bias and
variance terms.
Next, we neglect the constant in the balance equation (8.20), and label
the respective optimal bandwidth by a superscript (*). It is a solution of the
equation
1
h2n β = ,
nhn
8.4. Smoothing Kernel Estimator 109
and is equal to
h∗n = n−1/(2β+1) .
Denote by fn∗ (x) the smoothing kernel estimator (8.16) corresponding to the
optimal bandwidth h∗n ,
1
n x −x
fn∗ (x) =
i
(8.21) yi K .
nh∗n h∗n
i=1
Finally, we give the proofs of two technical lemmas stated in this section.
Proof of Lemma 8.5. We need to prove that the bound (8.15) for the
remainder term is valid. For β = 1, the bound follows from the definition of
the Lipschitz class of functions Θ(1, L, L1 ),
| ρ(xi , x) | = | f (xi ) − f (x) | ≤ L|xi − x| ≤ Lhn .
If β ≥ 2, then the Taylor expansion with the Lagrange remainder term has
the form
β−2
f (m) (x) f (β−1) (x∗ )
(8.22) f (xi ) = (xi − x)m + (xi − x)β−1
m! (β − 1)!
m=0
where the new remainder term ρ(xi , x), satisfies the inequality for any xi
and x such that |xi − x| ≤ hn ,
| f (β−1) (x∗ ) − f (β−1) (x) |
| ρ(xi , x) | = | xi − x|β−1
(β − 1)!
L|x∗ − x| Lhn Lhβn
≤ |xi − x|β−1 ≤ hβ−1 = .
(β − 1)! (β − 1)! n (β − 1)!
In the above, the definition of the Hölder class Θ(β, L, L1 ) has been applied.
Proof of Lemma 8.8. Using the definition of the bias and the regression
equation yi = f (xi ) + εi , we write
1 n x − x
i
bn (x) = Ef yi K − f (x)
nhn hn
i=1
1
n x − x
i
(8.23) = Ef (f (xi ) + εi ) K − f (x).
nhn hn
i=1
Now since εi has mean zero and is independent of xi ,
n x − x
i
Ef εi K = 0.
hn
i=1
Also, by the normalization condition,
x − x
x+hn
1
1 i 1 xi − x
Ef K = K dxi = K(u) du = 1.
hn hn hn x−hn hn −1
Consequently, continuing from (8.23), we can write
1 n
xi − x
(8.24) bn (x) = Ef f (xi ) − f (x) K .
nhn hn
i=1
Substituting Taylor’s expansion (8.14) of the function f (xi ) into (8.24), we
get that for any β > 1,
this sum equals zero as well, which can be seen from the orthogonality
conditions. For m = 1, . . . , β − 1,
x+hn x −x
1
1
(x1 − x) K
m m+1
dx1 = hn um K(u) du = 0.
x−hn h n −1
Thus, using the inequality (8.15) for the remainder term ρ(xi , x), we obtain
that for any β ≥ 1, the absolute value of the bias is bounded by
x+hn
1 x1 − x
|bn (x)| ≤ max |ρ(z, x)| K dx1
hn z:|z−x|≤hn x−hn hn
Lhβn 1
LK1 hβn
≤ | K(u) | du = = Ab hβn .
(β − 1)! −1 (β − 1)!
Further, to find a bound for the variance of fˆn (x), we use the indepen-
dence of the data points to write
1 n x − x
i
Varf fˆn (x) = Varf yi K
nhn hn
i=1
1
n x − x
i
= Var f yi K .
(nhn )2 hn
i=1
Now we bound the variance by the second moment, and plug in the regres-
sion equation yi = f (xi ) + εi ,
n
1 2 xi − x
≤ Ef yi
2
K
(nhn )2 hn
i=1
1
n 2 2 xi − x
= Ef f (x i ) + ε i K
(nhn )2 hn
i=1
1
n 2 xi − x
= Ef f 2
(x i ) + ε 2
i K .
(nhn )2 hn
i=1
Here the cross term disappears because of independence of εi and xi , and
the fact that the expected
2 value of εi is zero. Finally, using the facts that
|f (xi )| ≤ L1 and Ef εi = σ , we find
2
1 2 x+hn 2 x1 − x
≤ n L 1 + σ 2
K dx1
(nhn )2 x−hn hn
1 2 1
1 2 Av
= L1 + σ 2 K 2 (u) du = L1 + σ 2 K22 = .
nhn −1 nhn nhn
112 8. Estimation in Nonparametric Regression
Exercises
Exercise 8.55. Prove (8.3) for: (i) the quadratic loss at a point
2
w fˆn − f = fˆn (x) − f (x) ,
and (ii) the mean squared difference
1 ˆ
n
2
w fˆn − f = fn (xi ) − f (xi ) .
n
i=1
Exercise 8.57. Show that the kernels introduced in Example 8.2 integrate
to one.
Exercise 8.59. Prove that the smoothing kernel estimator (8.16) is unbi-
ased if the regression function f is a polynomial up to order β − 1.
Exercise 8.60. Find the normalizing constant C such that the tri-cube
kernel function
K(u) = C( 1 − |u|3 )3 I( |u| ≤ 1 )
integrates to one. What is its degree? Hint: Use (8.17).
Local Polynomial
Approximation of the
Regression Function
115
116 9. Local Polynomial Approximation of the Regression Function
The system of normal equations (9.2) has a unique solution if the matrix
G G is invertible. We always make this assumption. It suffices to require
that the design points are distinct and that N ≥ β.
Applying Lemma 8.5, we can present each observation yi as the sum of
the three components: a polynomial of degree β − 1, a remainder term, and
a random error,
β−1
f (m) (x)
(9.3) yi = (xi − x)m + ρ (xi , x) + εi
m!
m=0
where
Lhβn
| ρ (xi , x) | ≤ = O(hβn ), i = 1, . . . , N.
(β − 1)!
The system of normal equations (9.2) is linear in y, hence each compo-
nent of yi in (9.3) can be treated separately. The next lemma provides the
information about the first polynomial component.
9.1. Preliminary Results and Definition 117
The next lemma presents the results on the remainder and stochastic
terms in (9.3).
Lemma 9.3. Suppose Assumption 9.2 holds. Then the following is valid.
(i) If yi = ρ (xi , x), then the solution θ̂ of the system of normal equations
(9.2) has the elements θ̂m , m = 0, . . . , β − 1, bounded by
γ0 βL
| θ̂m | ≤ Cb hβn where Cb = .
(β − 1)!
(ii) If yi = εi , then the solution θ̂ of the system of normal equations (9.2)
has the zero-mean normal elements θ̂m , m = 0, . . . , β − 1, the variances of
which are bounded by
Cv
Varf θ̂m | X ≤ where Cv = (σγ0 β)2 .
N
−1
Proof. (i) As the solution of the normal equations (9.2), θ̂ = G G G y.
m
All the elements of the matrix G are of the form (xi − x)/hn , and thus
are bounded by one. Therefore, using Assumption 9.2, we conclude that
−1
the entries of the β × N matrix G G G are bounded by γ0 β/N . Also,
118 9. Local Polynomial Approximation of the Regression Function
from (8.15), the absolute values of the entries of the vector y are bounded
by Lhβn /(β − 1)! since they are the remainder terms. After we compute the
dot product, N cancels, and we obtain the answer.
(ii) The element θ̂m is the dot product of the m-th row of the matrix
−1
GG G and the random vector (ε1 , . . . , εN ) . Therefore, θ̂m is the
sum of independent N (0, σ 2 ) random variables with the weights that do
not exceed γ0 β/N . This sum has mean zero and the variance bounded by
N σ 2 (γ0 β/N )2 = (σγ0 β)2 /N.
Combining the results of Lemmas 8.5, 9.1, and 9.3, we arrive at the
following conclusion.
Proposition 9.4. Suppose Assumption 9.2 holds. Then the estimate θ̂m ,
which is the m-th element of the solution of the system of normal equations
(9.2), admits the expansion
f (m) (x) m
θ̂m = hn + bm + Nm , m = 0, . . . , β − 1,
m!
where the deterministic term bm is the conditional bias satisfying
| bm | ≤ Cb hβn ,
and the stochastic term Nm has a normal distribution with mean zero and
variance bounded by
Varf Nm | X ≤ Cv /N.
Finally, we are ready to introduce the local polynomial estimator fˆn (t),
which is defined for all t such that x − hn ≤ t ≤ x + hn by
t−x t − x β−1
(9.4) fˆn (t) = θ̂0 + θ̂1 + · · · + θ̂β−1
hn hn
where the least-squares estimators θ̂0 , . . . , θ̂β−1 are as described in Proposi-
tion 9.4.
The local polynomial estimator (9.4) corresponding to the bandwidth
h∗n = n−1/(2β+1) will be denoted by fn∗ (t). Recall from Section 8.4 that h∗n
is called the optimal bandwidth, and it solves the equation (h∗n )2β = (nh∗n )−1 .
The formula (9.4) is significantly simplified if t = x. In this case the
local polynomial estimator is just the estimate of the intercept, fˆn (x) = θ̂0 .
Up to this point there was no connection between the number of the
design points N in the hn -neighborhood of x and the bandwidth hn . Such
a connection is necessary if we want to balance the bias and the variance
terms in Proposition 9.4.
Assumption 9.5. There exists a positive constant γ1 , independent of n,
such that for all large enough n the inequality N ≥ γ1 nhn holds.
9.2. Polynomial Approximation and Regularity of Design 119
Now we will prove the result on the conditional quadratic risk at a point
of the local polynomial estimator.
Theorem 9.6. Suppose Assumptions 9.2 and 9.5 hold with hn = h∗n =
n−1/(2β+1) . Consider the local polynomial estimator fn∗ (x) corresponding to
h∗n . Then for a given design X , the conditional quadratic risk of fn∗ (x) at
the point x ∈ (0, 1) admits the upper bound
2
sup Ef fn∗ (x) − f (x) X ≤ r∗ n−2β/(2β+1)
f ∈ Θ( β)
Proof. By Proposition 9.4, for any f ∈ Θ(β), the conditional quadratic risk
of the local polynomial estimator fn∗ is equal to
2 2
Ef fn∗ (x) − f (x) X = Ef θ̂0 − f (x) | X
2
= Ef f (x) + b0 + N0 − f (x) X = b20 + Ef N02 | X
= b20 + Varf N0 | X ≤ Cb2 (h∗n )2β + Cv /N .
Applying Assumption 9.5 and the fact that h∗n satisfies the identity (h∗n )2β =
(nh∗n )−1 = n−2β/(2β+1) , we obtain that
2 Cv
Ef fn∗ (x) − f (x) | X ≤ Cb2 (h∗n )2β + = r∗ n−2β/(2β+1)
γ1 nh∗n
with r∗ = Cb2 + Cv /γ1 .
Remark 9.7. Proposition 9.4 also opens a way to estimate the derivatives
f (m) (t) of the regression function f. The estimator is especially elegant if
t = x,
m! θ̂m
(9.5) fˆn(m) (x) = , m = 1, . . . , β − 1.
hm
n
Now, we are in the position to prove the main result for the quadratic
risk under the random uniform design.
Theorem 9.15. Take the optimal bandwidth h∗n = n−1/(2β+1) . Let the de-
sign X be random and uniform on [0, 1]. Then the quadratic risk of the local
polynomial estimator fn∗ (x) at x defined by (9.7) satisfies the upper bound
2
sup Ef fn∗ (x) − f (x) ≤ r∗∗ n−2β/(2β+1)
f ∈ Θ(β)
Proof. Note that in the statement of Theorem 9.6, the constant r∗ de-
pends on the design X only through the constants γ0 and γ1 that appear
in Assumptions 9.2 and 9.5. Thus, if the assumptions hold, then r∗ is non-
random, and averaging over the distribution of the design points does not
affect the upper bound. Hence,
2
Ef fn∗ (x) − f (x) I A ∩ B ≤ r∗ n−2β/(2β+1) .
Applying this inequality and Lemmas 9.12 and 9.13, we have that for all
sufficiently large n and for any f ∈ Θ(β, L, L1 ),
2 2
Ef fn∗ (x) − f (x) ≤ Ef fn∗ (x) − f (x) I A ∩ B
2 2
+ Ef fn∗ (x) − f (x) I A + Ef fn∗ (x) − f (x) I B
≤ r∗ n−2β/(2β+1) + L21 Pf A + Pf B
≤ r∗ + 2L21 α−2 + CL21 n−2β/(2β+1) .
Finally, we choose r∗∗ = r∗ + 2L21 α−2 + CL21 , and the result follows.
Clearly, the inequality (9.8) does not hold for any design X . For example,
if all the design points are concentrated at one point x1 = · · · = xn = x, then
our observations (xi , yi ) are actually observations in the parametric model
yi = f (x) + εi , i = 1, . . . , n,
with a real-valued parameter θ = f (x). This parameter can be estimated
√
1/ n-consistently by the simple averaging of the response values yi . On the
other hand, if the design points x1 , . . . , xn are regular, then the lower bound
(9.8) turns out to be true.
Proof. To prove the lower bound in (9.8), we use the same trick as in
the parametric case (refer to the proof of Lemma 3.4). We substitute the
supremum over Θ(β) by the Bayes prior distribution concentrated at two
points. This time, however, the two points are represented by two regression
functions, called the test functions,
f0 = f0 (x) = 0 and f1 = f1 (x)
= 0 , f1 ∈ Θ(β), x ∈ [0, 1].
Note that for any estimator fˆn = fˆn (x), the supremum exceeds the mean
value, 2
sup Ef fˆn (x) − f (x)
f ∈ Θ(β)
1 1 2
(9.9) ≥ Ef0 fˆn2 (x) + Ef1 fˆn (x) − f1 (x) .
2 2
The expected values Ef0 and Ef1 denote the integration with respect to
the distribution of yi , given the corresponding regression function. Under the
hypothesis f = f0 = 0, the response yi = εi ∼ N (0, σ 2 ), while under the
alternative f = f1 , yi ∼ N f1 (xi ), σ 2 . Changing the probability measure
of integration, we can write the expectation Ef1 in terms of Ef0 ,
2
Ef1 fˆn (x) − f1 (x)
2 exp − (yi − f1 (xi ))2 /(2σ 2 )
n
ˆ
= Ef0 fn (x) − f1 (x)
i=1
exp − yi2 /(2σ 2 )
2
n
yi f1 (xi )
n
f12 (xi )
(9.10) = Ef0 fˆn (x) − f1 (x) exp − .
σ2 2σ 2
i=1 i=1
124 9. Local Polynomial Approximation of the Regression Function
6
ϕ(u)
ϕ(0)
-
−1 0 1 u
f1 (t)
6
(h∗n )β ϕ(0)
-
0 x− h∗n x x+ h∗n 1 t
1 2 n
yi f1 (xi ) n
f12 (xi )
≥ Ef0 fˆn2 (x) + fˆn (x) − f1 (x) exp −
2 σ2 2σ 2
i=1 i=1
1 2
≥ Ef0 fˆn2 (x) + fˆn (x) − f1 (x) I(E) .
2
Here we bound the exponent from below by one, which is true under the
event E. Next, by the elementary inequality a2 + (a − b)2 ≥ b2 /2 with
a = fˆn (x) and b = f1 (x) = (h∗n )β ϕ(0), we get the following bound:
2 1
(9.11) sup Ef fˆn (x) − f (x) ≥ (h∗n )2β ϕ2 (0) Pf0 ( E ).
f ∈ Θ(β) 4
What is left to show is that the probability Pf0 ( E ) is separated away from
zero,
(9.12) Pf0 ( E ) ≥ p0
where p0 is a positive constant independent of n. In this case, (9.8) holds,
2 1
(9.13) sup Ef fˆn (x)−f (x) ≥ (h∗n )2β ϕ2 (0) p0 = r∗ n−2β/(2β+1)
f ∈ Θ(β) 4
with r∗ = (1/4) ϕ2 (0) p0 . To verify (9.12), note that under the hypothesis
f = f0 = 0, the random variable
n −1/2
n
2 2
Z = σ f1 (xi ) yi f1 (xi )
i=1 i=1
has the standard normal distribution. Thus,
n
1 2
n
lim Pf0 ( E ) = lim Pf0 yi f1 (xi ) ≥ f1 (xi )
n→∞ n→∞ 2
i=1 i=1
1
n 1/2 1
n 1/2
= lim Pf0 Z≥ f12 (xi ) = 1 − lim Φ f12 (xi ) .
n→∞ 2σ n→∞ 2σ
i=1 i=1
Finally, we will show that
n
1
(9.14) lim f12 (xi ) = p(x) ϕ22 = p(x) ϕ2 (u) du > 0.
n→∞ −1
i=1
126 9. Local Polynomial Approximation of the Regression Function
Indeed, recall that the optimal bandwidth h∗n satisfies the identity (h∗n )2β =
1/(nh∗n ). Using this fact and the assertion of part (iii) of Lemma 9.8, we
have that
n
n x −x
(h∗n )2β ϕ2
2 i
f1 (xi ) =
h∗n
i=1 i=1
n x −x
1
1 i
(9.15) = ϕ2
→ p(x) ϕ2 (u) du as n → ∞.
nh∗n h∗n −1
i=1
Hence (9.14) is true, and the probability Pf0 ( E ) has a strictly positive limit,
1 1/2
lim Pf0 ( E ) = 1 − Φ p(x) ϕ22 > 0.
n→∞ 2σ
This completes the proof of the theorem.
p(x) xi − x
N
≤ ϕ0 Δui ,
1 − αn hn
i=1
and the desired convergence follows,
1
1 xi − x
N
ϕ0 → p(x) ϕ0 (u) du.
nhn hn −1
i=1
1 N
1 1 N
ui Δui ≤
l+m
G G l, m ≤ ul+m
i Δui .
2 + β̃n i = 1 N 2 + βn
i=1
Both bounds converge as n → ∞ to the integral in the definition (9.6)
of D−1
∞ . The proof that this matrix is invertible is left as an exercise (see
Exercise 9.66).
Before we turn to Lemmas 9.12 and 9.13, we prove the following result.
Let g(u) be a continuous function such that |g(u)| ≤ 1 for all u ∈ [−1, 1].
Let x1 , . . . , xn be independent random variables with a common uniform
distribution on [0, 1]. Introduce the independent random variables ηi , i =
1, . . . , n, by
⎧
⎨g xi − x , if xi ∈ [x − h∗ , x + h∗ ],
(9.19) ηi = h∗n n n
⎩0, otherwise.
Denote by μn the expected value of ηi ,
x+h∗n
1
t−x ∗
μn = E[ ηi ] = g dt = hn g(u) du.
x−h∗n h∗n −1
The inequality (9.20) provides the upper bound 2 δ −2 n−2 β/(2 β+1) for the
probability of each event in the union C. This proves (9.21).
Next, recall that we denoted by C∗ a constant that exceeds the absolute
values of all elements of the matrix D∞ . Due to the continuity of a matrix
inversion, for any ε ≤ C∗ , there exists a number δ = δ(ε) such that
!
β−1
1 −1
C = (G G)l, m − (D∞ )l, m ≤ δ(ε)
2nh∗n
l, m = 0
!
β−1
⊆ (2nh∗n )(G G)−1
l, m − (D∞ ) l, m ≤ ε
l, m = 0
−1
2C∗
⊆ G G l, m ≤ for all l, m = 0, . . . , β − 1 = B.
nh∗n
130 9. Local Polynomial Approximation of the Regression Function
The latter inclusion follows from the fact that if (G G)−1 ∗
l, m ≤ (C∗ +ε)/(2nhn )
and ε ≤ C∗ , it implies that (G G)−1 ∗ ∗
l, m ≤ C∗ /(nhn ) ≤ 2C∗ /(nhn ). Thus,
from (9.21), we obtain Pf (B) ≤ Pf (C) ≤ Cn−2β/(2β+1) with C = 2β 2 δ −2 .
Exercises
Exercise 9.65. Prove an analogue of Theorem 9.6 for the derivative esti-
mator (9.5),
m! θ̂ 2
X ≤ r∗ n−2(β−m)/(2β+1)
m
sup Ef ∗ )m
− f (m)
(x)
f ∈ Θ(β) (h n
Estimation of
Regression in Global
Norms
10.1. Regressogram
In Chapters 8 and 9, we gave a detailed analysis of the kernel and local
polynomial estimators at a fixed point x inside the interval (0, 1). The as-
ymptotic minimax rate of convergence was found to be ψn = n−β/(2β+1) ,
which strongly depends on the smoothness parameter β of the regression
function.
What if our objective is different? What if we want to estimate the
regression function f (x) as a curve in the interval [0, 1]? The global norms
serve this purpose. In this chapter, we discuss the regression estimation
problems with regard to the continuous and discrete L2 -norms, and sup-
norm.
In the current section, we introduce an estimator fˆn , called a regresso-
gram. A formal definition will be given at the end of the section.
When it comes to the regression estimation in the interval [0, 1], we can
extend a smoothing kernel estimator (8.16) to be defined in the entire unit
interval. However, the estimation at the endpoints x = 0 and x = 1 would
cause difficulties. It is more convenient to introduce an estimator defined
everywhere in [0, 1] based on the local polynomial estimator (9.4).
Consider a partition of the interval [0, 1] into small subintervals of the
equal length 2hn . To ease the presentation assume that Q = 1/(2hn ) is
131
132 10. Estimation of Regression in Global Norms
assumption holds, the systems of normal equations (10.2) have the unique
solutions for all q = 1, . . . , Q.
Assumption 10.1. There exist positive constants γ0 and γ1 , independent
of n and q, such that for all q = 1, . . . , Q,
−1
(i) the absolute values of the elements of the matrix Gq Gq are bounded
from above by γ0 /Nq ,
(ii) the number of observations Nq in the q-th bin is bounded from below,
Nq ≥ γ1 nhn .
Now we are ready to define the piecewise polynomial estimator fˆn (x)
in the entire interval [0, 1]. This estimator is called a regressogram, and is
computed according to the formula
(10.3)
x−c x − c β−1
q q
fˆn (x) = θ̂0, q + θ̂1, q + . . . + θ̂β−1, q if x ∈ Bq ,
hn hn
where the estimates θ̂0, q , . . . , θ̂β−1, q satisfy the normal equations (10.2), q =
1, . . . , Q.
The next theorem answers the question about the integral L2 -norm risk
for the regressogram.
Theorem 10.3. Let a design X be such that Assumption 10.1 holds with
the bandwidth h∗n = n−1/(2β+1) . Then the mean integrated quadratic risk of
134 10. Estimation of Regression in Global Norms
Proof. From Lemma 8.5, for any f ∈ Θ(β, L, L1 ), and for any bin Bq
centered at cq , the Taylor expansion is valid
f (β−1) (cq )
f (x) = f (cq ) + f (1) (cq )(x − cq ) + · · · + (x − cq )β−1 + ρ(x, cq )
(β − 1)!
f (m) (cq ) ∗ m x − cq m
β−1
= (hn ) + ρ(x, cq )
m! h∗n
m=0
where the remainder term ρ(x, cq ) satisfies the inequality ρ(x, cq ) ≤ Cρ (h∗n )β
with Cρ = L/(β − 1)! .
Applying Proposition 10.2, the Taylor expansion of f , and the definition
of the regressogram (10.3), we get the expression for the quadratic risk
1 2 Q
2
Ef fˆn (x) − f (x) dx X = Ef fˆn (x) − f (x) dx X
0 q=1 Bq
Q
β−1
x − cq m 2
= Ef bm,q + Nm,q − ρ(x, c q ) dx X .
Bq h∗n
q=1 m=0
Using the fact that the random variables Nm,q have mean zero, we can
write the latter expectation of the integral over Bq as the sum of a deter-
ministic and stochastic terms,
β−1
x − c m 2
q
bm,q − ρ(x, c q ) dx
Bq h∗n
m=0
β−1
x − c m 2
q
+ Ef Nm,q ∗ X dx.
Bq hn
m=0
From the bounds for the bias and the remainder term, the first integrand
can be estimated from above by a constant,
β−1
x − c m 2 β−1
2
q
bm,q − ρ(x, cq ) ≤ |bm,q | + |ρ(x, cq )|
h∗n
m=0 m=0
≤ βCb (h∗n )β + Cρ (h∗n ) β 2
= CD (h∗n )2β = CD n−2β/(2β+1)
where CD = (βCb + Cρ )2 .
10.2. Integral L2 -Norm Risk for the Regressogram 135
Note that the random variables Nm,q may be correlated for a fixed q
and different m’s. Using a special case of the Cauchy-Schwarz inequality
(a0 + · · · + aβ−1 )2 ≤ β (a20 + · · · + a2β−1 ), Proposition 10.2, and Assumption
10.1, we bound the second integrand from above by
β−1
x − c m 2
β−1
q
Ef Nm,q ∗ X ≤ β Varf Nm,q | X
hn
m=0 m=0
β 2 Cv β 2 Cv CS
≤ ≤ ∗
= = CS n−2β/(2β+1) where CS = β 2 Cv /γ1 .
Nq γ1 nhn nh∗n
Thus, combining the deterministic and stochastic terms, we arrive at the
upper bound
1 2 Q
Ef ˆ
fn (x) − f (x) dx X ≤ CD + CS n−2β/(2β+1) dx
0 q = 1 Bq
= r∗ n−2β/(2β+1) with r∗ = CD + CS .
Remark 10.4. Under Assumption 10.1, the results of Lemmas 9.8 and 9.9
stay valid uniformly over the bins Bq , q = 1, . . . , Q. Therefore, we can
extend the statement of Corollary 9.11 to the integral L2 -norm. For the
regular deterministic design, the unconditional quadratic risk in the integral
L2 -norm of the regressogram fˆn with the bandwidth h∗n = n−1/(2β+1) admits
the upper bound
1 2
sup Ef fˆn (x) − f (x) dx ≤ r∗ n−2β/(2β+1)
f ∈ Θ(β) 0
Under the same choice of the bandwidth, h∗n = n−1/(2β+1) , this estimator
admits the upper bound similar to the one in Theorem 10.3 with the rate
n−(β−m)/(2β+1), that is,
(10.4)
1 dm fˆ (x) dm f (x) 2
dx X ≤ r∗ n−2(β−m)/(2β+1).
n
sup Ef m
− m
f ∈ Θ(β) 0 d x d x
For the proof see Exercise 10.69.
136 10. Estimation of Regression in Global Norms
Our starting point is Proposition 10.2. It is a very powerful result that allows
us to control the risk under any loss function. We use this proposition to
prove the following theorem.
Theorem 10.6. Let a design X be such that Assumption 10.1 holds with
1/(2β+1)
the bandwidth hn = (ln n)/n . Let fˆn be the regressogram that
corresponds to this bandwidth. Then the conditional sup-norm risk (10.5)
admits the upper bound
ln n β/(2β+1)
(10.6) sup Ef fˆn − f ∞ X ≤ r∗
f ∈ Θ( β) n
where r∗ is a positive constant independent of n and f .
Proof. Applying Lemma 8.5, we can write the sup-norm of the difference
fˆn − f as
β−1 x − c m
q
fˆn − f ∞ = max sup θ̂m, q
1≤q≤Q x ∈ Bq h n
m=0
β−1
f (m) (cq ) (hn )m x − cq m
(10.7) − − ρ(x, cq )
m! hn
m=0
where Q = 1/(2hn ) is the number of bins, and the q-th bin is the interval
Bq = cq − hn , cq + hn centered at x = cq , q = 1, . . . , Q. The remainder
term ρ(x, cq ) satisfies the inequality | ρ(x, cq ) | ≤ Cρ hβn with the constant
Cρ = L / (β − 1)!. Applying the formula for θ̂m, q from Proposition 10.2 and
the fact that |x − cq |/hn ≤ 1, we obtain that
β−1 x − cq m
fn − f ≤
∞
max sup bm,q + Nm,q + Cρ hβn
1 ≤ q ≤ Q x ∈ Bq hn
m=0
β−1
(10.8) ≤ βCb hβn + Cρ hβn + max Nm,q .
1≤q≤Q
m=0
Introduce the standard normal random variables
−1/2
Zm,q = Varf Nm,q X Nm,q .
10.3. Estimation in the Sup-Norm 137
From the upper bound on the variance in Proposition 10.2, we find that
β−1
Cv
β−1
Cv
(10.9) max Nm,q ≤ max Zm,q ≤ Z∗
1≤q≤Q 1≤q≤Q Nq γ1 nhn
m=0 m=0
where
β−1
∗
Z = max Zm,q .
1≤q≤Q
m=0
Note that the random variables Zm,q are independent for different bins,
but may be correlated for different values of m within the same bin.
Putting together (10.8) and (10.9), we get the upper bound for the sup-
norm loss,
β Cv
(10.10) fn − f ≤ βCb + Cρ h + Z ∗.
∞ n
γ1 nhn
≤ βCb + Cρ + Cz Cv /γ1 hβn = r∗ hβn .
Remark 10.7. As Theorem 10.6 shows, the upper bound of the risk in
the sup-norm contains an extra log-factor as compared to the case of the
L2 -norm. The source of this additional factor becomes clear from√(10.11).
Indeed, the maximum of the random noise has the magnitude O( ln n) as
n → ∞. That is why theoptimum choice of the bandwidth comes from the
balance equation hβn = (nhn )−1 ln n.
138 10. Estimation of Regression in Global Norms
where the q-th bin Bq = [ 2(q − 1)hn , 2qhn ), and cq = (2q − 1)hn is its
center. Here Q = 1/(2hn ) is an integer that represents the number of bins.
The rates of convergence in the L2 -norm and sup-norm found in the previous
sections were partially based on the fact that the bias of the regressogram
has the magnitude O(hβn ) uniformly in f ∈ Θ(β) at any point x ∈ [0, 1].
Indeed, from Proposition 10.2, we get
sup sup Ef fˆn (x) − f (x)
f ∈ Θ(β) 0 ≤ x ≤ 1
β−1
m
≤ sup sup bm,q x − cq ≤ Cb βhβn .
1 ≤ q ≤ Q x ∈ Bq m = 0 hn
In turn, this upper bound for the bias is the immediate consequence of the
Taylor’s approximation in Lemma 8.5.
In this section, we take a different approach. Before we proceed, we
need to introduce some notation. Define a set of βQ piecewise monomial
functions,
(10.12) x − c m
q
γm,q (x) = I(x ∈ Bq ) , q = 1, . . . , Q , m = 0, . . . , β − 1.
hn
The regressogram fˆn (x) is a linear combination of these monomials,
Q β−1
(10.13) fˆn (x) = θ̂m,q γm,q (x) , 0 ≤ x ≤ 1
q=1 m=0
From this definition, the matrix Γ has the dimensions n × K. The vector
ϑ̂ = ( θ̂1 , . . . , θ̂K ) of estimates in (10.15) satisfies the system of normal
equations
(10.17) Γ Γ ϑ̂ = Γ y
where y = (y1 , . . . , yn ) .
Depending on the design X , the normal equations (10.17) may have
a unique or multiple solutions. If this system has a unique solution, then
the estimate fˆn (x) can be restored at any point x by (10.14). But even
when (10.17) does not have a unique solution, we can still approximate the
regression function f (x) at the design points, relying on the geometry of the
problem.
In the n-dimensional space of observations Rn , define a linear span-space
S generated by the columns γ k of matrix Γ. With a minor abuse of notation,
we also denote by S the operator in Rn of the orthogonal projection on the
span-space S . Introduce a vector consisting of the values of the regression
function at the design points,
f = ( f (x1 ), . . . , f (xn ) ) , f ∈ Rn ,
and a vector of estimates at these points,
f̂n = Sy = ( fˆ(x1 ), . . . , fˆ(xn ) ) .
Note that this projection is correctly defined regardless of whether (10.17)
has a unique solution or not.
140 10. Estimation of Regression in Global Norms
2 2
(10.19) ≤ Sf − f 2 + Ef Sε 2 | X
n n
where · is the Euclidean norm in R . Here we used the inequality (a+b)2 ≤
n
2 (a2 + b2 ).
Denote by dim(S) the dimension of the span-space S. Note that neces-
sarily dim(S) ≤ K. In many special cases, this inequality turns into equality,
dim(S) = K. For example, it is true for the regressogram under Assumption
10.1 (see Exercise 10.72).
Assumption 10.8. There exists δn , δn → 0 as n → ∞, such that for any
f ∈ Θ(β), the inequality is fulfilled
1
S f − f 2 ≤ δn2 .
n
Proposition 10.9. Let Assumption 10.8 hold. Then the following upper
bound on the discrete MISE holds:
1 n
2 2σ 2 dim(S)
(10.20) Ef fˆn (xi ) − f (xi ) X ≤ 2δn2 + .
n n
i=1
and
1 √
bk = f (x) 2 cos(2πkx) dx, k = 1, 2, . . . .
0
The trigonometric basis is complete in the sense that if f ||2 < ∞, and
m
√
m
√
fm (x) = a0 + ak 2 cos(2πkx) + bk 2 sin(2πkx), 0 ≤ x ≤ 1,
k=1 k=1
then
lim fm (·) − f (·) 2 = 0.
m→∞
Thus, a function f with a finite L2 -norm is equivalent to its Fourier series
∞
∞
√ √
f (x) = a0 + ak 2 cos(2πkx) + bk 2 sin(2πkx),
k=1 k=1
Thus,
∞
f 22 = (2π)2 k 2 (a2k + b2k ) ≤ (2π)2 L.
k=1
10.5. Orthogonal Series Regression Estimator 143
Clearly, the values at the design points for each function in (10.25) rep-
resent a vector in Rn . Therefore, there cannot be more than n orthonormal
functions with respect to the discrete dot product. As shown in the lemma
below, the functions in (10.25) corresponding to k = 1, . . . , n0 , form an
orthonormal basis with respect to this dot product.
Lemma 10.14. Fix n = 2n0 + 1 for some n0 ≥ 1. For i = 1, . . . , n, the
system of functions
√ 2πi √ 2πi √ 2πn0 i √ 2πn0 i
1, 2 sin , 2 cos , . . . , 2 sin , 2 cos
n n n n
is orthonormal with respect to the discrete dot product.
Now fix k
= l such that k, l ≤ n0 . Note that then k ±l
= 0(mod n). Letting
m = k ± l in (10.28), and applying (10.26)-(10.28), we obtain that
n
2πki 2πli 1
n
2π(k − l)i
sin sin = cos
n n 2 n
i=1 i=1
1
n
2π(k + l)i
− cos = 0
2 n
i=1
and
n
2πki 2πli 1
n
2π(k − l)i
cos cos = cos
n n 2 n
i=1 i=1
1
n
2π(k + l)i
+ cos = 0,
2 n
i=1
and
n
2πki 1
n
cos2 = cos(0) = n/2.
n 2
i=1 i=1
1
n
2π(k − l)i
+ sin = 0.
2 n
i=1
10.5. Orthogonal Series Regression Estimator 145
Thus far, we have worked with the functions of sine and cosineseparately.
It is convenient to combine them in a single notation. Put ϕ0 i/n = 1,
and c0 = a0 . For m = 1, . . . , n0 , take
√ 2πmi
ϕ2m i/n = 2 cos , c2m = am ,
n
and
√ 2πmi
ϕ2m−1 i/n = 2 sin , c2m−1 = bm .
n
Note that altogether we have n basis functions ϕk , k = 0, . . . , n − 1. They
satisfy the orthonormality conditions
1
n
ϕk (·), ϕl (·) 2, n = ϕk i/n ϕl i/n = 0, for k
= l,
n
i=1
2 1 2
n
(10.29)
and ϕk (·) 2, n = ϕk i/n = 1.
n
i=1
The regression function f at the design points xi = i/n can be written as
n−1
f i/n = ck ϕk i/n .
k=0
146 10. Estimation of Regression in Global Norms
Lemma 10.15. The discrete MISE of the estimator fˆn in (10.30) can be
presented as
2 n−1
2
ˆ ˆ
Rn (fn , f ) = Ef fn (·) − f (·) 2, n = Ef ĉk − ck .
k=0
n−1
n−1
2
= Ef ĉk − ck ĉl − cl ϕk (·), ϕl (·) 2,n
= Ef ĉk − ck
k, l = 0 k=0
where we used the fact that the basis functions are orthonormal.
σ2
n
−1/2 1
n
= ϕ2k i/n εi ϕk i/n ∼ N (0, 1).
n n
i=1 i=1
As a result,
√
zk = ck + σξk / n.
It remains to show that the ξ’s are independent. Since they are normally
distributed, it suffices to show that they are uncorrelated. This in turn
follows from independence of the ε’s and orthogonality of the ϕ’s. Indeed,
we have that for any k
= l such that k, l = 0, . . . , n − 1,
1 n
n
Cov(ξk , ξl ) = 2 E εi ϕk i/n εi ϕl i/n
σ n
i=1 i=1
1
n
1
n
= E ε 2
ϕ
i k i/n ϕ l i/n = ϕk i/n ϕl i/n = 0.
σ2 n n
i=1 i=1
Proof. Consider the orthogonal series estimator fˆn (i/n) specified by (10.32)
with M = n1/(2β+1) . Comparing this definition to a general form (10.30)
of an estimator given by a Fourier series, we see that in this instance, the
estimators of the Fourier coefficients ck , k = 0, . . . , n − 1, have the form
zk , if k = 0, . . . , M,
ĉk =
0, if k = M + 1, . . . , n − 1.
Now applying Lemmas 10.15 and 10.16, we get
2
M
n−1
ˆ
Ef fn − f 2, n = Ef (zk − ck ) +
2
c2k
k=0 k = M +1
σ2 2
M n−1 n−1
σ2M
(10.33) = Ef ξk + c2k = + c2k .
n n
k=0 k = M +1 k = M +1
Exercises
Exercise 10.72. Prove that for the regressogram under Assumption 10.1,
dim(S) = K = β Q.
Estimation by Splines
x+hn
1 t − xˆ
smoother of fˆn (x) = K fn (t) dt.
x−hn hn hn
151
152 11. Estimation by Splines
(iv) For any m ≥ 1 and for any u ∈ R, the equation (called the partition
of unity) holds:
∞
(11.4) Sm (u − j) = 1.
j = −∞
(ii) This part follows immediately from the definition of the standard
B-spline as a probability density.
Here we used part (ii) once again, and the fact that the variable of integration
u belongs to the unit interval. Continuing, we obtain
j+1
m−1
m
c = Sm (u) du = Sm (u) du = 1,
j =0 j 0
for Sm (u) is the probability density for u in the interval [0, m].
Now we try to answer the question: How smooth is the standard B-spline
Sm (u)? The answer can be found in the following lemma.
Lemma 11.2. For any m ≥ 2, the standard B-spline Sm (u), u ∈ R, is a
piecewise polynomial of order m − 1. It has continuous derivatives up to the
order m−2, and its derivative of order m−1 is a piecewise constant function
given by the sum
m−1
j m−1
(11.5) (m−1)
Sm (u) = (−1) I[j, j+1) (u).
j
j =0
11.3. Shifted B-splines and Power Splines 155
Proof. We start by stating the following result. For any m ≥ 2, the k-th
derivative of Sm (u) can be written in the form
k
k
(11.6) (k)
Sm (u) = (−1)j
Sm−k (u − j), k ≤ m − 1.
j
j=0
The shortest way to verify this identity is to use induction on k starting
with (11.2). We leave it as an exercise (see Exercise 11.76).
If k ≤ m − 2, then the function Sm−k (u − j) is continuous for any j.
Indeed, all the functions Sm (u), m ≥ 2, are continuous as the convolutions
in (11.1). Thus, by (11.6), as a linear combination of continuous functions,
(k)
Sm (u), k ≤ m − 2, is continuous in u ∈ R. Also, for k = m − 1, the formula
(11.6) yields (11.5).
It remains to show that Sm (u), u ∈ R, is a piecewise polynomial of order
m − 1. From (11.2), we obtain
u
(11.7) Sm (u) = Sm−1 (z) − Sm−1 (z − 1) dz.
0
Note that by definition, S1 (u) = I[0, 1) (u) is a piecewise polynomial of order
zero. By induction, if Sm−1 (u) is a piecewise polynomial of order at most
m − 2, then so is the integrand in the above formula. Therefore, Sm (u) is
a piecewise polynomial of order not exceeding m − 1. However, from (11.5),
the (m − 1)-st derivative of Sm (u) is non-zero, which proves that Sm (u) has
order m − 1.
Remark 11.3. To restore a standard B-spline Sm (u), it suffices to look at
(11.5) as a differential equation
d Sm (u)
m−1
(11.8) = λj I[j,j+1) (u)
d um
j =0
with the constants λj defined by the right-hand side of (11.5), and to solve
it with the zero initial conditions,
Sm (0) = Sm (0) = · · · = Sm
(m−2)
(0) = 0.
m−2
≤ max | a0 | , . . . , | am−2 | Sm (u − j)
j =0
∞
≤ max | a0 | , . . . , | am−2 | Sm (u − j)
j = −∞
≤ max | a0 | , . . . , | am−2 | ≤ C(m) max | ν0 | , . . . , | νm−2 |
where we applied the partition of unity (11.4).
Lemma 11.8. For any polynomial g(u) of order m − 1, m − 1 ≤ u < m,
there exists a unique linear combination of the shifted standard B-splines
LS ∗ (u) = a0 Sm (u) + · · · + am−2 Sm (u − (m − 2)) + am−1 Sm (u − (m − 1))
such that LS ∗ (u) = g(u), if m − 1 ≤ u < m.
158 11. Estimation by Splines
LS ∗ (u)
-
0 m−1 m u
γk (x)
6
The proof of the next lemma is postponed to the end of this chapter.
Lemma 11.9. The set of functions { γk (x), k = −β + 1, . . . , Q − 1 } forms
a basis in the linear sub-space of the smooth piecewise polynomials of order
β − 1 that are defined in bins Bq , and have continuous derivatives up to
order β − 2. That is, any γ(x) in this space admits a unique representation
Q−1
(11.11) γ(x) = θ̂k γk (x) , x ∈ [0, 1],
k=−β+1
with some real coefficients θ̂k . Hence this vector can also be associated with
the function defined by (11.11),
Q−1
fn∗ (x) = θ̂k γk (x), x ∈ [0, 1].
k = −β+1
We are ready to extend the result stated in Theorem 10.11 for the re-
gressogram to the approximation by splines.
11.5. Proofs of Technical Lemmas 161
In the above, the constants C1 and r∗ are positive and independent of n and
f ∈ Θ(β, L, L1 ).
Proof. The result follows immediately from the bound (11.12) on the ap-
proximation error by splines (cf. the proof of Theorem 10.11.)
Remark 11.14. With the splines γk (x) of this section, we could introduce
the design matrix with column vectors (10.16) as well as the system of
normal equations (10.17). In the case of B-splines, however, the system
of normal equations does not partition into sub-systems as was the case of
the regressogram. It makes the asymptotic analysis of spline approximation
technically more challenging as compared to the one of the regressogram.
In particular, an analogue of Proposition 10.2 with explicit control over the
bias and the stochastic terms goes beyond the scope of this book.
On the other hand, any power spline LP ∈ Lp also has the piecewise constant
(m − 1)-st derivative
(11.17) LP (m−1) (u) = λj , if j ≤ u < j + 1,
with
(11.18) λj = b0 + · · · + bj , j = 0, . . . , m − 2.
In (11.15) and (11.17), we have deliberately denoted the (m−1)-st derivative
by the same λj ’s because we mean them to be identical. Introduce a vector
λ = λ0 , . . . , λm−2 . If we look at (11.16) and (11.18) as the systems of
linear equations for a and b , respectively, we find that the matrices of these
systems are lower triangular with non-zero diagonal elements. Hence, these
systems establish the linear one-to-one correspondence between a and λ, on
the one hand, and between λ and b, on the other hand. Thus, there exists
a linear one-to-one correspondence between a and b.
Proof of Lemma 11.5. Applying Lemma 11.4, we can find a linear com-
bination of the power splines such that
LS(u) = LP (u) = b0 P0 (u) + b1 P1 (u) + · · · + bm−2 Pm−2 (u)
1 1
= b0 um−1 I[0, m−1) (u) + b1 (u − 1)m−1 I[1, m−1) (u)
(m − 1)! (m − 1)!
1
+ · · · + bm−2 (u − (m − 2))m−1 I[m−2, m−1) (u).
(m − 1)!
The derivatives of the latter combination, νj = LP (j) (u), at the right-most
point u = m − 1 are computable explicitly (see Exercise 11.78),
(11.19)
(m − 1)m−j−1 (m − 2)m−j−1 (1)m−j−1
ν j = b0 + b1 + · · · + bm−2 .
(m − j − 1)! (m − j − 1)! (m − j − 1)!
If we manage to restore the coefficients b from the derivatives ν , then by
Lemma 11.4, we would prove the claim. Consider (11.19) as the system of
linear equations. Then the matrix M of this system is an (m − 1) × (m − 1)
matrix with the elements
(m − k − 1)m−j−1
(11.20) Mj, k = , j, k = 0, . . . , m − 2.
(m − j − 1)!
The matrix M is invertible because its determinant is non-zero (see Exercise
11.79). Thus, the lemma follows.
Proof of Lemma 11.11. Put q(l) = 1+βl. Consider all the bins Bq(l) with
l = 0, . . . , (Q − 1)/β. Without loss of generality, we assume that (Q − 1)/β
is an integer, so that the last bin BQ belongs to this subsequence. Note that
the indices of the bins in the subsequence Bq(l) are equal to 1 modulo β,
and that any two consecutive bins Bq(l) and Bq(l+1) in this subsequence are
separated by (β − 1) original bins.
Let xl = 2(q(l) − 1)hn denote the left endpoint of the bin Bq(l) , l =
0, 1, . . . . For any regression function f ∈ Θ(β, L, L1 ), introduce the Taylor
expansion of f (x) around x = xl ,
(11.21)
f (1) (xl ) f (β−1) (xl )
πl (x) = f (xl ) + (x − xl ) + · · · + (x − xl )β−1 , x ∈ Bq(l) .
1! (β − 1)!
In accordance with Lemma 11.8, for any l, there exists a linear combi-
nation of the splines that coincides with πl (x) in Bq(l) . It implies that
Q−1
= ak γk (x), 0 ≤ x ≤ 1.
k = − β+1
This function γ(x) defines a piecewise polynomial of order at most β − 1
that coincides with the Taylor polynomial (11.21) in all the bins Bq(l) (see
Figure 10). Hence, in the union of these bins l Bq(l) , the function γ(x)
does not deviate away from f (x) by more than O(hβn ), this magnitude being
preserved uniformly over f ∈ Θ(β, L, L1 ).
Next, how close is γ(x) to f (x) in the rest of the unit interval? We want
to show that the same magnitude holds for all x ∈ [0, 1], that is,
(11.22) max γ(x) − f (x) ≤ C1 hβn
0≤x≤1
6 π2 (x)
f (x)
γ(x)
π1 (x)
-
0 B1 B1+β BQ 1 x
Δπ2 (x)
Δγ1 (x)
-
0
B1 B1+β x
Figure 10. Schematic graphs of the functions γ(x) and Δγ1 (x) for x lying
in bins B1 through B1+β .
We want to rescale Δπl (x) to bring it to the scale of the integer bins of
unit length. Put
g(u) = h−β
n Δπl (xl + 2hn (u + 1)) with 0 ≤ u ≤ β − 1,
dj −β j d
j
νi = g(β − 1) = h n (2h n ) Δπl (xl+1 )
duj dxj
f (j+1) (xl )
= 2j hj−β
n f (j) (xl+1 ) − f (j) (xl ) + (xl+1 − xl ) + · · ·
1!
f (β−1) (xl )
··· + (xl+1 − xl )β−1−j .
(β − 1 − j)!
Note that the expression in the brackets on the right-hand side is the
remainder term of the Taylor expansion of the j-th derivative f (j) (xl+1 )
around xl . If f ∈ Θ(β, L, L1 ) , then f (j) belongs to the Hölder class Θ(β −
j, L, L2 ) with some positive constant L2 (see Exercise 11.81). Similar to
Lemma 10.2, this remainder term has the magnitude O( | xl+1 − xl |β−j ) =
O(hβ−j
n ).
Thus, in the notation of Lemma 11.7, max |ν0 |, . . . , |νβ−1 | ≤ C1 where
the constant C1 does not depend on n nor l. From Lemma 11.7, the unique
spline of order β with zero derivatives at u = 0 and the given derivatives νj
at u = β −1 is uniformly bounded for 0 ≤ u ≤ β −1. Since this is true for any
l, we can conclude that |g(u)| ≤ C2 = C(β) C1 at all u where this function
is defined. The constant C(β) is introduced in Lemma 11.7, m = β. So, we
proved that max0≤x≤1 | γ(x) − f (x) | = O(hβn ) , which implies (11.12).
166 11. Estimation by Splines
Exercises
Exercise 11.79. Show that the determinant det M of the matrix M with
the elements defined by (11.20) is non-zero. Hint: Show that this deter-
minant is proportional to the determinant of the generalized Vandermonde
matrix ⎡ ⎤
x1 x21 . . . xm
1
⎢ x2 x22 . . . xm ⎥
Vm = ⎢ ⎣
2 ⎥
⎦
...
2
xm xm . . . xm m
Exercise 11.80. Prove the statement similar to Lemma 11.8 for the power
splines. Show that for any polynomial g(u) of order m − 1 in the interval
m − 1 ≤ u < m, there exists a unique linear combination LP ∗ (u) of power
splines
LP ∗ (u) = b0 P0 (u) + b1 P1 (u) + · · · + bm−2 Pm−2 (u) + bm−1 Pm−1 (u)
such that LS ∗ (u) = g(u) if m − 1 ≤ u < m. Apply this result to represent
the function g(u) = 2 − u2 in the interval [2, 3) by the power splines.
Asymptotic Optimality
in Global Norms
In Chapter 10, we studied the regression estimation problem for the integral
L2 -norm and the sup-norm risks. The upper bounds in Theorems 10.3 and
10.6 guarantee the rates of convergence in these norms that are n−β/(2β+1)
−β/(2β+1)
and n/(ln n) , respectively. These rates hold for any function f
in the Hölder class Θ(β, L, L1 ), and they are attained by the regressogram
with the properly chosen bandwidths.
The question that we address in this chapter is whether these rates can
be improved by any other estimators. The answer turns out to be negative.
We will prove the lower bounds that show the minimax optimality of the
regressogram.
167
168 12. Asymptotic Optimality in Global Norms
Proof. As in the proof of Theorem 9.16, take any “bump” function ϕ(t), t ∈
R, such that ϕ(0) > 0 , ϕ(t) = 0 if |t| > 1, and | ϕ(β) (t) | ≤ L. Clearly, this
function has a finite L2 -norm, ϕ2 < ∞. We want this norm to be small,
therefore, below we make an appropriate choice of ϕ(t). Take a bandwidth
1/(2β+1)
h∗n = (ln n)/n , and consider the bins
∗
Bq = 2hn (q − 1), 2h∗n q , q = 1, . . . , Q,
where we assume without loss of generality that Q = 1/(2h∗n ) is an integer.
Introduce the test functions f0 (t) = 0 , and
t−c
fq (t) = (h∗n )β ϕ
q
(12.3) , t ∈ [0, 1], q = 1, . . . , Q,
h∗n
where cq is the center of the bin Bq . Note that each function fq (t) takes
non-zero values only within the respective bin Bq . For any small enough
h∗n , the function fq belongs to the Hölder class Θ(β, L, L1 ). This fact was
explained in the proof of Theorem 9.16.
Recall that under the hypothesis f = fq , the observations yi in the
nonparametric regression model satisfy the equation
yi = fq (xi ) + εi , i = 1, . . . , n,
where the xi ’s are the design points, the εi ’s are independent N (0, σ 2 )-
random variables. Put
1
(12.4) d0 = (h∗n )β ϕ ∞ > 0 .
2
Note that by definition,
(12.5) fl − fq ∞ = 2d0 , 1 ≤ l < q ≤ Q,
and
(12.6) fq ∞ = fq − f0 ∞ = 2d0 , q = 1, . . . , Q.
Further, we will need the following lemma. We postpone its proof to the
end of the section.
Lemma 12.2. Under the assumptions of Theorem 12.1, for any small δ > 0,
there exists a constant c0 > 0 such that if ϕ22 ≤ c0 , then for all large n,
1
max Pfq Dq ≥ (1 − δ).
0≤q≤Q 2
Now, we apply Lemma 12.2 to find that for all n large enough, the
following inequalities hold:
sup Ef fˆn − f ∞ ≥ max Efq fˆn − fq ∞
f ∈ Θ(β) 0≤q≤Q
≥ d0 max Pfq fˆn − fq ∞ ≥ d0 = d0 max Pfq Dq
0≤q≤Q 0≤q≤Q
1 1
d0 (1 − δ) = (h∗n )β ϕ ∞ (1 − δ),
≥
2 4
and we can choose r∗ = (1/4) ϕ ∞ (1 − δ).
Remark 12.3. Contrast the proof of Theorem 12.1 with that of Theorem
9.16. The proof of the latter theorem was based on two hypotheses, f = f0
or f = f1 , with the likelihood ratio that stayed finite as n → ∞. In the sup-
norm, however, the proof of the rate of convergence is complicated by the
extra log-factor, which prohibits using the same idea. The likelihood ratios
in the proof of Theorem 12.1 are vanishing as n → ∞. To counterweigh that
fact, a growing number of hypotheses is selected. Note that the number of
hypotheses Q + 1 ≤ n1/(2β+1) has the polynomial rate of growth as n goes
to infinity.
The next theorem handles the case of a random design. It shows that if
the random design is regular, then the rate of convergence of the sup-norm
risk is the same as that in the deterministic case. Since the random design
can be “very bad” with a positive probability, the conditional risk for given
design points does not guarantee even the consistency of estimators. That
is why we study the unconditional risks. The proof of the theorem below is
left as an exercise (see Exercise 12.83).
Theorem 12.4. Let X be a random design such that that design points
xi are independent with a continuous and strictly positive density p(x) on
170 12. Asymptotic Optimality in Global Norms
[0, 1]. Then for all sufficiently large n, and for any estimator fˆn (x) of the
regression function f (x), the following inequality holds:
(12.7) sup Ef ψn−1 fˆn − f ∞ ≥ c0
f ∈ Θ(β)
Proof of Lemma 12.2. From the inclusion D0 ⊆ Dq , which holds for any
q = 1, . . . , Q, we have that
max Pq Dq = max P0 D0 , max Pq Dq
0≤q≤Q 1≤q≤Q
1
≥ P0 D0 + max Pq D0
2 1≤q≤Q
1 1
Q
≥ P0 D0 + E0 I D 0 exp Ln, q
2 Q
q=1
1 1
Q
(12.8) = P0 D0 + E0 I D 0 exp Ln, q .
2 Q
q=1
have that E0 ξn = 1. Applying the independence of the random variables
Nn,q for different q, we find that
Q
E0 ξn2 = Q−2 E0 exp 2Ln, q
q=1
Q
= Q−2 E0 exp 2σn, q Nn, q − σn, q
q=1
Q
= Q−2 2
exp{σn, −1 c1 ln n
q} ≤ Q e = Q−1 nc1 = 2 h∗n nc1 .
q=1
Lemma 12.5. For any estimator fˆn of the regression function f (t, ω) the
following inequality holds:
Eω fˆn (·) − f (·, ω) 2 ≥ Eω fˆn, q (·) − f (·, ω) 2
2, Bq 2, Bq
β t − cq 2
(12.13) = Eω fˆn, q (t) − ωq h∗n ϕ dt .
Bq h∗n
At the last step we used the definition of the function f (t, ω) in the bin
Bq .
In (12.13), the function fˆn, q (t) depends only on the regression observa-
tions with the design points in Bq . We will denote the expectation relative
to these observations by Eωq . We know that Eωq is computed with respect
to one of the two probability measures P{ωq =0} or P{ωq =1} . These measures
are controlled entirely by the performance of the test function f (·, ω) in the
bin Bq .
Lemma 12.6. There exists a constant r0 , which depends only on the design
density p and the chosen function ϕ, such that for any q, 1 ≤ q ≤ Q, and
for any Yq -measurable estimator fˆn, q , the following inequality holds:
β t − cq 2
max Eωq fˆn, q (t) − ωq h∗n ϕ dt ≥ r0 /n.
ωq ∈ {0, 1} Bq h∗n
1 ∗ 2β t − c q 2 1 ∗ 2β+1 1
≥ hn ϕ ∗
dt = hn ϕ 22 = ϕ 22 .
2 Bq h n 2 2n
Finally, combining these estimates, we obtain that
β t − cq
max Eωq fˆn, q (t)−ωq h∗n ϕ dt ≥ p0 ϕ 22 /(2n) = r0 /n
ωq ∈ {0, 1} Bq h∗n
with r0 = p0 ϕ 22 /2.
Proof. We use the notation introduced in Lemmas 12.5 and 12.6. Applying
the former lemma, we obtain the inequalities
sup Ef fˆn − f 22 ≥ max Eω fˆn (·) − f (· , ω) 22
f ∈Θ(β,L) ω ∈ ΩQ
Q
β t − cq
≥ max Eωq fˆn , q (t) − ωq h∗n ϕ dt .
ω∈ΩQ Bq h∗n
q=1
Note that each term in the latter sum depends only on a single component
ωq . This is true for the expectation and the integrand. That is why the
maximum over the binary vector ω can be split into the sum of maxima. In
view of Lemma 12.6, we can write
Q
β t − cq
max Eωq fˆn, q (t) − ωq h∗n ϕ dt
ωq ∈ {0, 1} Bq h∗n
q=1
to apply it to the sup-norm because the sup-norm does not split into the
sum of the sup-norms over the bins.
In this section, we will suggest a more general lower bound that covers
both of these norms as special cases. As above, we consider a nonparamet-
ric regression function f (x), x ∈ [0, 1], of a given smoothness β ≥ 1. We
introduce a norm f of functions in the interval [0, 1]. This norm will be
specified later in each particular case.
As in the sections above, we must care about two things: a proper
set of the test functions, and the asymptotic performance of the respective
likelihood ratios.
Assume that there exists a positive number d0 and a set of M + 1 test
functions f0 (x), . . . , fM (x), x ∈ [0, 1], such that any two functions fl and
fm are separated by at least 2d0 , that is,
(12.15) fl − fm ≥ 2d0 for any l
= m, l, m = 0, . . . , M.
The constant d0 depends on n, decreases as n → 0, and controls the rate
of convergence. The number M typically goes to infinity n → 0. For
as
example, in the case of the sup-norm, we had d0 = O (h∗n )β in (12.4), and
1/(2β+1)
M = Q = O 1/h∗n where h∗n = (ln n)/n .
In this section, we consider the regression with the regular deterministic
design X . Denote by Pm ( · ) = Pfm ( · | X ) m = 0, . . . , M , the probability
distributions corresponding to a fixed design, and by Em the respective
expectations associated with the test function fm , m = 0, . . . , M.
Fix one of the test functions, for instance, f0 . Consider all log-likelihood
ratios for m = 1, . . . , M ,
1 2
n
d P0
ln = − 2 yi − (yi − fm (xi ))2
d Pm 2σ
i=1
1 1 2
n n
1 2
= fm (xi )(−εi /σ) − 2
fm (xi ) = σm, n Nm, n − σm, n
σ 2σ 2
i=1 i=1
where
n
−2
εi = yi − f (xi ) and 2
σm, n = σ 2
fm (xi ).
i=1
The random variables εi and Nm, n are standard normal with respect to the
distribution Pm .
We need assumptions on the likelihood ratios to guarantee that they are
not too small as n → 0. Introduce the random events
Am = { Nm, n > 0} with Pm Am = 1/2, m = 1, . . . , M.
176 12. Asymptotic Optimality in Global Norms
Assume that there exists a constant α, 0 < α < 1, such that all the variances
2
σm, n are bounded from above,
n ≤ 2α ln M.
2
(12.16) max σm,
1≤m≤M
If the random event Am takes place and the inequality (12.16) holds, then
d P0
(12.17) ≥ exp − σm,2
n /2 ≥ exp{−α ln M } = 1/M .
α
d Pm
Let fˆn be an arbitrary estimator of the regression function f. Define the
random events
Dm = { fˆn − fm ≥ d0 }, m = 0, . . . , M.
The following lemma plays the same fundamental role in the proof of
the lower bound as Lemma 12.2 in the case of the sup-norm.
Lemma 12.8. If the conditions (12.15) and (12.16) are satisfied, then the
following lower bound is true:
max Pm Dm ≥ 1/4.
0≤m≤M
where the random events D m are mutually exclusive. Indeed, if the norm
of the difference fˆn − fm is strictly less than d0 for some m, then by the
triangle inequality and (12.15), the norm fˆn − fl is not smaller than d0
for any l
= m. The inclusion (12.19) makes use of this fact for l = 0.
It immediately follows that
M
M
M dP
0
P0 D0 ≥ P0 Dm = P0 Dm = Em I Dm
d Pm
m=1 m=1 m=1
M dP 1
M
0
≥ Em I Dm Am ≥ Pm Dm − 1/2 .
d Pm M α
m=1 m=1
In the latter inequality, we used (12.17).
12.4. Examples and Extensions 177
1 1 1
M
M
≥ Pm D m − 1/2 + Pm Dm
2 Mα M
m=1 m=1
1
M
≥ Pm D m + Pm Dm − 1/2 = 1/4.
2M
m=1
≥ d0 max Pm Dm ≥ d0 /4.
0≤m≤M
Example 12.10. The sup-norm risk. In the case of the sup-norm, the test
functions are defined by (12.3) with M = Q. The condition (12.15) follows
from (12.5) and (12.6) with d0 = (1/2)(h∗n )β ϕ∞ . Note that for all large n
the following inequality holds:
1 1
ln Q = ln n − ln ln n − ln 2 ≥ ln n.
2β + 1 2(2β + 1)
2 in the expansion (12.9) of ln dP /dP
In view of (12.11), the variance σq, n q 0
is bounded from above uniformly in q = 1, . . . , Q,
n ≤ c1 ln n ≤ 2(2β + 1)c1 ln Q ≤ 2α ln Q = 2α ln M.
2
σq,
The latter inequality holds if the constant c1 = 2σ −2 p∗ c0 is so small that
(2β + 1)c1 < α. Such a choice of c1 is guaranteed because c0 is however
178 12. Asymptotic Optimality in Global Norms
small. Thus, the condition (12.16) is also fulfilled. Applying Theorem 12.9,
we get the lower bound
1
sup Ef ψn−1 fˆn − f ∞ ≥ (h∗n )β ϕ∞ = r∗ ψn
f ∈ Θ(β) 8
with the constant r∗ = (1/8)ϕ∞ , and the rate of convergence ψn defined
in (12.1).
Unlike the case of the upper bounds in Chapter 9, “bad” designs do not
create a problem in obtaining the lower bound in the sup-norm. Intuitively it
is understandable because when we concentrate more design points in some
bins, we loose them in the other bins. This process reduces the precision of
the uniform estimation of the regression function. In a sense, the uniform
design is optimal if we estimate the regression in the sup-norm. We will
prove some results in support of these considerations.
Let a design X be of any kind, not necessarily regular. Assume that there
exists a subset M = M(X ) ⊆ { 1, . . . , M } such that for some α ∈ (0, 1) the
following inequality holds:
n ≤ 2α ln M.
2
(12.21) max σm,
m∈M
Let |M| denote the number of elements in M. It turns out that Lemma 12.8
remains valid in the following modification.
Lemma 12.11. If the conditions (12.15) and (12.21) are satisfied, then the
following lower bound holds:
|M|
max Pm Dm ≥ .
0≤m≤M 4M
Proof. Repeating the proof of Lemma 12.8, we find that
M
M
M dP
0
P0 D0 ≥ P0 Dm = P0 Dm = Em I Dm
d Pm
m=1 m=1 m=1
dP 1
0
≥ Em I D m Am ≥ Pm Dm − 1/2
d Pm M α
m∈M m∈M
where we have used the inequality (12.17). Under (12.21), this inequality
applies only to the indices m ∈ M. Continuing as in Lemma 12.8, we obtain
the bound
1 1
M
max Pm Dm ≥ P0 D0 + Pm Dm
0≤m≤M 2 M
m=1
1 |M|
≥ Pm Dm + Pm Dm − 1/2 = .
2M 4M
m∈M
12.4. Examples and Extensions 179
From (12.22) and Lemma 12.11, analogously to the proof of Theorem 12.9,
we derive the lower bound for any design X ,
|M| d0 1 ∗ β
(12.23) sup Ef ψn−1 fˆn − f ∞ ≥ d0 ≥ = (h ) ϕ∞ .
f ∈ Θ(β) 4Q 8 16 n
where the xi ’s are the design points (see Exercise 12.84). From the definition
of the test functions, the variance σn2 can be bounded from above by
Q x − c
σn2 = σ −2 (h∗n )2β |ωq − ωq | 2 i q
ϕ
h∗n
q=1 xi ∈ Bq
Q
−2
= σ ϕ22 |ωq − ωq |p(cq )(1 + oq, n (1))
q=1
(12.25) ≤ σ −2 ϕ22 Q 1 + on (1) ≤ 2 σ −2 ϕ22 Q.
In the above, oq, n (1) → 0 as n → ∞ uniformly in q, 1 ≤ q ≤ Q. Also, we
bounded |ωq − ωq | by 1, and used the fact that the Riemann sum of the
design density approximates the integral
Q
1
−1
Q p(cq ) = p(x) dx + on (1) = 1 + on (1).
q=1 0
180 12. Asymptotic Optimality in Global Norms
Next, we have to discuss the separation condition (12.15). For any test
functions, the L2 -norm of the difference is easy to find,
1 Q
(12.26) f (xi , ω ) − f (xi , ω ) 22 = ϕ22 |ωq − ωq |.
n
q=1
At this point, we need a result which will be proved at the end of this section.
Lemma 12.14. (Warshamov-Gilbert) For all Q large enough, there ex-
ists a subset Ω0 , Ω0 ⊂ Ω, with the number of elements no less than 1 + eQ/8
and such that for any ω , ω ∈ Ω0 , the following inequality holds:
Q
|ωq − ωq | ≥ Q/8.
q=1
Continuing with the example, let M = eQ/8 . From Lemma 12.14 and
(12.26), we see that there exist M + 1 test functions such that for any two
of them,
Q
f (xi , ω ) − f (xi , ω ) 22 = ϕ22 = (2d0 )2
8n
where
1 Q 1 1 1
d0 = ϕ2 = ϕ2 = ϕ2 (h∗n )β .
2 8n 2 16h∗n n 8
Hence the condition (12.15) is fulfilled with this d0 . We arbitrarily choose
f0 = f (t, ω 0 ) for some ω 0 ∈ Ω0 , and take M as a set of the rest of the
functions with ω ∈ Ω0 . In this case, |M| = M = eQ/8 .
!
Q
P |ωq, l − ωq, m | ≥ Q/8
0≤l<m≤M q=1
Q
= 1−P |ωq, l − ωq, m | < Q/8
0≤l<m≤M q=1
M (M + 1)
Q
≥ 1− P ξq < Q/8
2
q=1
Q
Q
≥ 1−e Q/4
P ξq < Q/8 ≥ 1 − e Q/4
P ξq > (3/8) Q
q=1 q=1
where we denoted by ξq = 1/2 − ξq the random variables that take on values
±1/2 with equal probabilities.
Further, Chernoff’s inequality P(X ≥ a) ≤ e−za E ezX , a, z > 0, en-
sures that for any positive z,
Q Q
P ξq > (3/8)Q ≤ E exp zξq exp − (3/8)zQ .
q=1
The moment generating function of ξq satisfies the inequality (see Exercise
12.85)
1
(12.27) E exp zξq = exp{z/2} + exp{−z/2} ≤ exp{z 2 /8}.
2
Take z = 3/2. Then
Q
P ξq > (3/8)Q ≤ exp − (9/32)Q .
q=1
Hence,
!
Q 1 9
P |ωq, l −ωq, m | ≥ Q/8 ≥ 1 − exp − Q > 0.
4 32
0≤l<m≤M q=1
This proves the lemma, because what happens with a positive probability
exists.
182 12. Asymptotic Optimality in Global Norms
Exercises
Exercise 12.86. Let the design X be equidistant, that is, with the design
points xi = i/n. Show by giving an example that the following lower bound
is false. For any large c there exists a positive p0 independent of n such that
for all large n, the following inequality holds:
inf sup Pf fˆn − f 2 ≥ cn−β/(2β+1) X ≥ p0 .
fˆn f ∈ Θ(β)
Hint: Consider the case β = 1, and let fn∗ be a piecewise constant estimator
in the bins. Show that the above probability goes to zero as n increases.
Part 3
Estimation in
Nonparametric Models
Chapter 13
Estimation of
Functionals
where w(x) is a given Lipschitz function, called the weight function, and f =
f (x) is an unknown regression observed with noise as in (13.1). Along with
185
186 13. Estimation of Functionals
the integral notation, we will use the dot product notation, Ψ(f ) = (w, f )
1
and w22 = 0 w2 (x) dx.
Note that Ψ(f ) defined by (13.2) is a linear functional, that is, for any
f1 and f2 , and any constants k1 and k2 , the following identity holds:
1
Ψ(k1 f1 + k2 f2 ) = w(x) k1 f1 (x) + k2 f2 (x) dx
0
1
1
(13.3) = k1 w(x)f1 (x) dx + k2 w(x)f2 (x) dx = k1 Ψ(f1 ) + k2 Ψ(f2 ).
0 0
Proof. The statement of the lemma is the Hajek-LeCam lower bound. Note
that the Fisher information in the non-homogeneous Gaussian model is
2
equal to In = ni= 1 μi /σi , and the log-likelihood ratio is normal non-
asymptotically,
n
1
Ln (θ) = Ln (0) = ln f (Xi , θ)/f (Xi , 0) = In Z0,1 θ − In θ2
2
i=1
where Z0,1 is a standard normal random variable with respect to the true
distribution P0 . Thus, as in the Hajek-LeCam case, we have the lower bound
2
lim inf sup Eθ In (θ̂n − θ) ≥ 1.
n→∞ θ ∈ R
Hence for this family of regression functions, the estimation of Ψ(f ) is equiv-
alent to estimation of θ from the following observations:
(13.5) yi = θ w(i/n)/w22 + εi , εi ∼ N (0 σ 2 ), i = 1, . . . , n.
Theorem 13.5. For any estimator Ψ̃n from the observations (13.5), the
following asymptotic lower bound holds:
√ 2
(13.6) lim inf sup Ef (·, θ) n Ψ̃n − Ψ f ( · , θ) ≥ σ 2 w22 .
n→∞ f (·, θ) ∈ Θ
188 13. Estimation of Functionals
1 w2 (i/n)
n
n n(1 + on (1))
= = .
σ 2 w22 n i = 1 w22 σ 2 w22
Here
1 2we used the fact that the latter sum is the Riemann sum for the integral
2
0 w (x)/w2 dx = 1. Thus, from Lemma 13.4, the lower bound follows
2
lim inf sup Ef (·, θ) In Ψ̃n − Ψ f ( · , θ)
n→∞ f (·, θ) ∈ Θ
n(1 + on (1)) 2
= lim inf sup Eθ Ψ̃n − θ ≥ 1,
n→∞ θ ∈ R σ 2 w22
which is equivalent to (13.6).
The normalized difference of the sum and the integral in (13.12) is nor-
mal with the expectation going to zero as n → ∞, and the variance that,
conditionally on fn∗ , is equal to
σ2 2 σ2 2
n
∗
w (i/n, fn ) = w (i/n, fn∗ ) + O(m/n)
n n
i ∈ J2 i=1
1
= σ2 w2 (x, fn∗ ) dx + on (1).
0
Here we used the fact that m/n = nα−1 = on (1) → 0 as n → 0. By the
assumption, the weight function w( · , f0 ) is continuous in f0 , and
1
1
∗
w (x, fn ) dx →
2
w2 (x, f ) dx = w( · , f )22
0 0
as fn∗ − f 2 → 0. Hence, the result of the theorem follows.
Exercises 191
Remark 13.9. If the preliminary estimator fn∗ in (13.11) satisfies the con-
dition ∗
1
n Ef fn − f 4 = n Ef
4
(fn∗ (x) − f (x))4 dx → 0
0
as n → ∞, then the estimator Ψ∗n converges in the mean square sense as
well (see Exercise 13.90),
√ 2
(13.13) lim Ef n Ψ∗n − Ψ(f ) = σ 2 w( · , f )22 .
n→∞
Exercises
1
Exercise 13.89. Show that Ψ(f ) = 0 f 4 (x) dx is a differentiable functional
of the regression function f ∈ Θ(β, L, L
1 ), β ≥ 1.
Dimension and
Structure in
Nonparametric
Regression
(14.1) yi = f (xi ) + εi , i = 1, . . . , n,
193
194 14. Dimension and Structure in Nonparametric Regression
located strictly inside the unit cube [0, 1]d . The asymptotically minimax rate
of convergence ψn , defined by (8.4) and (8.5), is attached to a Hölder class
Θ(β, L, L1 ) of regression functions. Thus, we have to extend the definition
of the Hölder class to the multivariate case. The direct extension via the
derivatives as in the one-dimensional case is less convenient since we would
have to deal with all mixed derivatives of f up to a certain order. A more
fruitful approach is to use the formula (8.14) from Lemma 8.5 as a guideline.
Let β be an integer, β ≥ 1, and let · denote the Euclidean norm
in Rd . A function f (x), x ∈ [0, 1]d , is said to belong to a Hölder class of
functions Θ(β) = Θ(β, L, L1 ) if: (i) there exists a constant L1 > 0 such that
maxx∈[0,1]d |f (x)| ≤ L1 , and (ii) for any x0 ∈ [0, 1]d there exists a polynomial
p(x) = p(x, x0 , f ) of degree β − 1 such that
f (x) − p(x, x0 , f ) ≤ Lx − x0 β
For any
x in the hypercube
Hn centered at x0 , we rescale this polynomial
to get π (x−x0 )/hn , θ . Suppose there are N pairs of observations (xi , yi )
such that the design points belong to Hn . Without loss of generality, we may
assume that these are the first N observations, x1 , . . . , xN ∈ Hn . The vector
θ can be estimated by the method of least squares. The estimator θ̂ is the
solution of the minimization problem (cf. (9.1)),
N
2
(14.2) yi − π (x − x0 )/hn , θ̂ → min .
i=1 θ̂
Proof. From the definition of the Hölder class Θ(β, L, L1 ), we obtain (cf.
(9.3)),
yi = p(xi , x0 , f ) + ρ(xi , x0 , f ) + εi , i = 1, . . . , n,
where the remainder term
√ β
|ρ(xi , x0 , f )| ≤ xi − x0 β ≤ dhn .
Repeating the proofs of Lemmas 9.1 and 9.3, we find that the least-
squares estimator θ̂ actually estimates the vector of coefficients of the poly-
nomial p(x, x0 , f ) in the above approximation. The deterministic error here
that does not exceed Cb hβn , and the zero-mean normal stochastic term has
the variance that is not larger than Cv /N. By definition, the zero-order term
of the approximation polynomial θ0 is equal to f (x0 ). Hence, the estimate
of the intercept
θ̂0 = θ0 + b0 + N0 = f (x0 ) + b0 + N0
196 14. Dimension and Structure in Nonparametric Regression
Finally, we have arrived at the point where the influence of the higher
dimension shows up. In Section 9.1, to obtain the minimax rate of con-
vergence, we needed Assumption 9.5 which helped to control the stochastic
term. This assumption required that the number N of the design points in
the hn -neighborhood of the given point x0 is proportional to nhn . Clearly,
this assumption was meant to meet the needs of regular designs, determin-
istic or random. So, the question arises: How many design points can we
anticipate in the regular cases in the d-dimensional Hn -neighborhood of x0 ?
Simple geometric considerations show that at best we can rely on a number
proportional to the volume of Hn .
Assumption 14.3. There exists a positive constant γ1 , independent of n,
such that for all large enough n, the inequality N ≥ γ1 nhdn holds.
Proof. Analogously to the proof of Theorem 9.6, from Proposition 14.2, the
upper bound of the quadratic risk holds uniformly in f ∈ Θ(β, L, L1 ),
2 Cv Cv
Ef π(0, θ̂) − f (x0 ) X ≤ Cb2 (hn )2β + ≤ Cb2 (hn )2β + .
N γ1 nhdn
The balance equation in the d-dimensional case has the form (hn )2β =
1/(nhdn ). The optimal choice of the bandwidth is hn = h∗n = n−1/(2β+d) , and
the respective rate of convergence is (h∗n )β = n−β/(2β+d) .
Let x0 be a fixed point strictly inside the unit square. Without loss of
generality, we will assume that this point coincides with one of the design
knots, x0 = (i0 /m, j0 /m). Clearly, we could treat the model of observations
(14.4) as a two-dimensional regression. The value of the regression function
f (x0 ) at x0 can be estimated with the rate n−β/(2β+2) suggested by Theorem
14.4 for d = 2. A legitimate question at this point is whether it is possible
to estimate f (x0 ) with a faster rate exploiting the specific structure of the
model. In particular, is it possible to attain the one-dimensional rate of
convergence n−β/(2β+1) ? As the following proposition shows, the answer to
this question is affirmative.
Proposition 14.5. In the additive regression model (14.4)-(14.6) at any
point x0 = (i0 /m, j0 /m), there exists an estimator fˆn (x0 ) such that
2
sup Ef fˆn (x0 ) − f (x0 ) ≤ r∗ n−2β/(2β+1)
f1 ,f2 ∈ Θ(β,L,L1 )
198 14. Dimension and Structure in Nonparametric Regression
Proof. Select the bandwidth h∗n = n−1/(2β+1) as if the model were one-
dimensional. Consider the set of indices
In = In (i0 /m) = i : |i/m − i0 /m| ≤ h∗n .
The number N of indices in the set In is equal to N = |In | = 2mh∗n + 1 .
√
Note that mh∗n = nn−1/(2β+1) → ∞ , and hence N ∼ 2 m h∗n as n → ∞.
To estimate f1 at i0 /m, consider the means
1 1 1
m m m
ȳi · = yij = f1 (i/m) + f2 (j/m) + εij
m m m
j=1 j =1 j =1
1
(14.7) = f1 (i/m) + δn + √ ε̄i , i ∈ In ,
m
where the deterministic error
1
m
δn = f2 (j/m)
m
j=1
1
is the Riemann sum for the integral 0 f2 (t) dt = 0, and the random variables
1
m
ε̄i = √ εij ∼ N (0, σ 2 )
m
j =1
Now we can replace the observations yij in (14.4) by yij − fˆ0 , and use these
shifted observations to estimate f1 and f2 as done above. Then the statement
of Proposition 14.5 would stay unchanged (see Exercise 14.93).
This class is well defined because√the variable t = x(1) cos α + x(2) sin α
belongs to the interval 0 ≤ t ≤ 2. The functions in H, if rotated at a
proper angle, depend on a single variable, and are monotonically increasing
in the corresponding direction. The point
(14.12) tij = (i/m) cos α + (j/m) sin α
is the projection (show!) of xij = (i/m, j/m) onto the straight line pass-
ing through the origin at the angle α (see Figure 11). If we knew α, we
could compute the projections tij , and the problem would become one-
dimensional.
x(2) 6
1
xij
•
tij •
α -
0 1 x(1)
Figure 11. Projection of the design knot xij on the line passing through
the origin at the angle α.
14.3. Single-Index Model 201
∂ϕ(x(1) , x(2) )
wl (x(1) , x(2) ) = − , l = 1 or 2.
∂x(l)
The outside-of-the-integral term in (14.17) vanishes due to the boundary
condition (14.15).
Thus, (14.13) and (14.14) along with (14.17) yield the equations
with
1
1
Φ0 = Φ0 (f ) = ϕ(x(1) , x(2) )g (x(1) cos α + x(2) sin α) dx(1) dx(2) .
0 0
(14.19) Φ0 (f ) ≥ Φ∗ > 0.
1
m
(14.20) Φ̂(l)
n = wl (i/m, j/m) yij , l = 1 or 2.
n
i,j = 1
(2) (1)
Note that the ratio Φ̂n /Φ̂n can be however large, positive or negative.
Thus, the range of α̂n runs from −π/2 to π/2, whereas the range of the true α
is [0, π/4]. Next, we want to show that the values of α̂n outside of the interval
[0, π/4] are possible only due to the large deviations, and the probability of
this event is negligible if n is large. As the following proposition shows, the
√
estimator α̂n is n-consistent with rapidly decreasing probabilities of large
deviations. The proof of this proposition is postponed to the next section.
and
ûij = −(i/m) sin α̂n + (j/m) cos α̂n .
Let the respective projections of the fixed point x0 = (i0 /m, j0 /m) be de-
noted by t̂0 and û0 . Introduce T , a rectangle in the new coordinates (see
Figure 12),
T = (t, u) : |t − t̂0 | ≤ h∗n , |u − û0 | ≤ H
x(2) 6
1 2h∗n T
u
K x0 2H
• *
t
û0 t̂0
α̂n -
0 1 x(1)
Figure 12. Rectangle T in the coordinate system rotated by the angle α̂n .
Proposition 14.9. For any design knot xij = (i/m, j/m) ∈ T , the obser-
vation ỹij in (14.10) admits the representation
with the remainder term ρij being independent of the random variable ε̃ij ,
and satisfying the inequality
Proof. Put ρij = g(tij ) − g(t̂ij ). By definition, ρij depends only on the first
sub-sample yij of observations in (14.10), and hence is independent of ε̃ij .
We have ỹij = g(t̂ij ) + ρij + ε̃ij . For any knot (i/m, j/m), we obtain
|ρij | = |g(tij ) − g(t̂ij )|
= g (i/m) cos α + (j/m) sin α − g (i/m) cos α̂n + (j/m) sin α̂n
≤ L0 (i/m) cos α̂n − cos α + (j/m) sin α̂n − sin α
≤ L0 i/m + j/m α̂n − α ≤ 2 L0 α̂n − α .
Note that the number of design knots N (T ), and the elements of the matrix
G G are random because they depend on the estimator α̂n .
Lemma 14.10. (i) Uniformly in α̂n , the number of design knots N (T ) sat-
isfies
N (T )
lim = 1.
n→∞ 4Hnh∗ n
(ii) The normalized elements of the matrix G G have the limits
1 1 − (−1)k+l+1
lim G G k, l = ,
n→∞ N (T ) 2(k + l + 1)
and the limiting matrix is invertible.
Proof. Proposition 14.9 and Lemma 14.10 allow us to apply the expansion
similar to Proposition 9.4. In the case under consideration, this expansion
takes the form
(14.21) θ̂0 = f (x0 ) + b0 + N0 .
Conditionally on the first sub-sample of observations in (14.10), the bias
term b0 admits the bound
| b0 | ≤ Cb (h∗n )β + Ca max |ρij | ≤ Cb (h∗n )β + 2L0 Ca |α̂n − α|
1≤i, j≤m
where
m
√ i/m j/m
ρ(l)
n = n wl (i/m, j/m) f (i/m, j/m)
i, j = 1 (i−1)/m (j−1)/m
− wl (x1 , x2 ) f (x1 , x2 ) dx2 dx1
and
1
m
ηn(l) = √ wl (i/m, j/m) εij .
n
i, j = 1
(l)
The variance of the normal random variable ηn is equal to
(l) σ2 2
m
Var ηn = wl (i/m, j/m)
n
i, j = 1
1
1
−→ σ 2 wl2 (x1 , x2 ) dx2 dx1 < Cv2 < ∞.
n→∞ 0 0
(l)
The deterministic remainder term ρn admits the upper bound
m
i/m
j/m
(l)
ρ ≤ L0 m x1 − i/m + x2 − j/m dx2 dx1
n
i, j = 1 (i−1)/m (j−1)/m
m
1
= L0 m = L0
m3
i, j = 1
14.4. Proofs of Technical Results 207
(l)
Proof. From Lemma 14.12, the random variable ηn , l = 1 or 2, is a zero-
(l)
mean normal with the variance satisfying Varf ηn ≤ Cv . Therefore, if
y ≥ Cv , then uniformly in f ∈ H, we have
Pf |ηn(l) | > y ≤ 2 exp − y 2 /(2 Cv2 ) , l = 1 or 2.
Hence, with the probability at least 1 − 4 exp −y 2 /(2Cv2 ) , we can assume
(l)
that |ηn | ≤ y for both l = 1 and 2 simultaneously. Under these conditions,
in view of (14.18) and Lemma 14.12, we obtain that
(2) √ (1) √
(2) (1) cos α(ρ(2) + η )/ n − sin α(ρ
(1)
+ η )/ n
Φ̂n /Φ̂n − tan α = n
n n n
(1) (1) √
cos α Φ0 (f ) cos α + (ρn + ηn )/ n
√
2(cos α + sin α)(Cρ + y) 2(C + y)
≤ ≤
ρ
Φ∗ n/2 − (Cρ + y) Φ∗ n/2 − (Cρ + y)
√
where we used the facts that cos √ α ≥ 1/ 2 since 0 ≤ α ≤ π/4, and, at the
last step, that sin α + cos α ≤ 2.
Further, by our assumption, Cρ ≤ y and 2y ≤ (1/2)Φ∗ n/2, therefore,
(2) (1)
Φ̂n /Φ̂n − tan α ≤ 4y 8y 12 y
Φ n/2 − 2y ≤ Φ n/2 ≤ Φ∗ √n .
∗ ∗
where we applied the mean value theorem with some α∗ between α̂n and α.
Thus,
√
Pf |α̂n − α| > x/ n
√
≤ Pf Φ̂(2)
n / Φ̂(1)
n − tan α > x/ n ≤ 4 exp − a0 x2 .
Proof of Lemma 14.10. (i) For every design knot (i/m, j/m), consider
a square, which we call pixel, [(i − 1)/m, i/m] × [(j − 1)/m, j/m]. Let T∗
be the union of the pixels that lie strictly inside T , and let T ∗ be the
minimum union of the pixels that contain T , that is, the union of pixels
which
√ intersections
with T are non-empty. The diameter of each pixel is
2/m = 2/n, and its area is equal to 1/n. That is
why the number N (T∗ )
of the pixels in T∗ is no less than 4n(H − 2/n)(hn ∗ − 2/n), andthe number
N (T ∗ ) of the pixels in T ∗ does not exceed 4n(H + 2/n)(h∗n + 2/n). Since
√
1/ n = o(h∗n ), we find that
N (T∗ ) N (T ∗ )
1 ≤ lim inf ≤ lim sup ≤ 1.
n→∞ 4Hnh∗n n→∞ 4Hnhn
∗
N (T∗ ) ≤ N (T ) ≤ N (T ∗ ),
we can apply the squeezing theorem to conclude that the variable N (T ) also
has the same limit,
N (T )
lim = 1.
n→∞ 4Hnh∗n
1 1 t̂ − t̂0 k+l
ij
G G k, l ∼
N (T ) 4Hnh∗n h∗n
(i/m, j/m) ∈ T
Exercises
Adaptive Estimation
yi = f (i/n) + εi , i = 1, . . . , n,
where εi are standard normal random variables. Since the design is not the
focus of our current interest, we work with the simplest equidistant design.
We assume that the smoothness parameter can take on only two values
β1 and β2 such that 1 ≤ β1 < β2 . Thus, we assume that the regression func-
tion f belongs to one of the two Hölder classes, either Θ(β1 ) = Θ(β1 , L, L1 )
or Θ(β2 ) = Θ(β2 , L, L1 ).
211
212 15. Adaptive Estimation
Thus, un Ef0 h−2β
n fn − δ < 0 . Put Tn = h−β
1 ˜2 1 ˜
n fn . We obtain
sup Ef h−2β n
1 ˜n − f (x0 ) 2 ≥ Ef h−2β1 f˜n − f1 (x0 ) 2
f 1 n
f ∈ Θ(β1 )
≥ Ef1 h−2β
n
1 ˜n − f1 (x0 ) 2 + un Ef h−2β1 f˜2 − δ
f 0 n n
= Ef1 (Tn − ϕ(0))2 + un Ef0 Tn2 − δ.
Finally, we want to show that the right-hand side is separated away from
zero by a positive constant independent of n. Introduce the likelihood ratio
dPf0 n
1 2
n
Λn = = exp − f1 (i/n)ξi − f1 (i/n)
dPf1 2
i=1 i=1
where ξi = yi − f1 (i/n), i = 1, . . . , n, are independent standard normal ran-
dom variables with respect to Pf1 -distribution. As in the proof of Theorem
9.16, we get
n
2
σn = f12 (i/n) = ϕ22 n hn2β1 +1 1 + on (1) = ϕ22 (c ln n) 1 + on (1)
i=1
where on (1)→ 0 as n → ∞. Introduce a standard normal random variable
Zn = σn−1 ni= 1 f1 (i/n) ξi . Then the likelihood ratio takes the form
1
Λn = exp − σn Zn − ϕ22 (c ln n) 1 + on (1) .
2
Note that if the random event { Zn ≤ 0 } holds, then
1
Λn ≥ exp − ϕ22 (c ln n) 1 + on (1) ≥ n−c1 ,
2
for all large n, where c1 = c ϕ22 . From the definition of the likelihood ratio,
we obtain the lower bound
2
sup Ef h−2β1 f˜n − f (x0 )
n ≥ Ef (Tn − ϕ(0))2 + un Λn T 2 − δ
1 n
f ∈ Θ(β1 )
≥ Ef1 (Tn − ϕ(0))2 + un n−c1 Tn2 I(Zn ≤ 0) − δ.
Now we choose c so small that c1 = c ϕ22 < 2 a0 − β1 /(2β1 + 1) . Then,
by (15.4), un n−c1 increases and exceeds 1 if n is sufficiently large. Hence,
sup Ef h−2β n
1 ˜n − f (x0 ) 2 ≥ Ef (Tn − ϕ(0))2 + T 2 I(Zn ≤ 0) − δ
f 1 n
f ∈ Θ(β1 )
≥ Ef1 I(Zn ≤ 0) (Tn − ϕ(0))2 + Tn2 −δ
≥ Ef1 I(Zn ≤ 0) (−ϕ(0)/2)2 + (ϕ(0)/2)2 −δ
1 2 1
≥ ϕ (0) Pf1 Zn ≤ 0 − δ = ϕ2 (0) − δ = r∗
2 4
15.2. Adaptive Estimator in the Sup-Norm 215
where Z ∼ N (0, 1). Now, since P(|Z| ≥ x) ≤ exp{−x2 /2} for any x ≥ 1, we
arrive at the upper bound
P Zβ∗ ≥ y 2β 2 ln n ≤ Qβn−y ≤ β n−(y −1) .
2 2
(15.8)
Here we have used the fact that the number of bins Q = 1/(2h∗n, β ) ≤ n for
all large enough n.
Theorem 15.4. There exists a constant C in the definition (15.6) of the
adaptive estimator f˜n such that the adaptive risk AR∞ (f˜n ) specified by (15.5)
satisfies the upper bound
(15.9) AR∞ (f˜n ) ≤ r∗
with a constant r∗ independent of n.
Proof. Denote the random event in the definition of the adaptive estimator
f˜n by
∗ ∗
∗ β1
C = fn, β1 − f
n, β2 ∞ ≥ C hn, β1 .
If f ∈ Θ(β1 ), then
β1 /(2β1 +1)
n/(ln n) Ef f˜n − f ∞
(15.10)
−β1 ∗ −β1 ∗
≤ h∗n,β1 Ef fn,β1 − f ∞ I(C) + h∗n,β1 Ef fn,β2 − f ∞ I(C)
n
f 22, n = n−1 f 2 (i/n).
i=1
Next, we take two integers β1 and β2 such that 1 ≤ β1 < β2 , and consider
two sets in the sequence space
n−1
Θ2,n (β) = Θ2,n (β, L) = (c0 , . . . , cn−1 ) : c2k k 2β ≤ L , β ∈ { β1 , β2 }.
k=0
15.3. Adaptation in the Sequence Space 219
We associate Θ2,n (β) with the smoothness parameter β because the decrease
rate of Fourier coefficients controls the smoothness of the original regression
function (cf. Lemma 10.13.)
As shown in Theorem 10.17, for a known β, uniformly in c ∈ Θ2,n (β),
the risk Rn (ĉ, c) = O(n−2β/(2β+1)) as n → ∞. The rate-optimal estimator
is the projection estimator which can be defined as
ĉ = z0 , . . . , zM , 0, . . . , 0
where M = Mβ = n1/(2β+1) . In other words, ĉk = zk if k = 0, . . . , M, and
ĉk = 0 for k ≥ M + 1.
Now, suppose that we do not know the true smoothness of the regression
function, or, equivalently, suppose that the true Fourier coefficients may be-
long to either class, Θ2,n (β1 ) or Θ2,n (β2 ). Can we estimate these coefficients
so that the optimal rate would be preserved over either class of smoothness?
To make this statement more precise, we redefine the adaptive risk for se-
quence space. For any estimator c̃ = (c̃0 , . . . , c̃n−1 ) introduce the adaptive
quadratic risk by
n−1
(15.15) AR(c̃) = max sup (Mβ )2β Ec (c̃k − ck )2
β ∈ {β1 , β2 } c ∈ Θ2, n (β)
k=0
We give the proof of the following theorem at the end of the present
section after we formulate some important auxiliary results.
Theorem 15.6. There exists a constant r∗ independent of n and such that
the adaptive risk (15.15) is bounded from above,
AR(c̃) ≤ r∗ .
We have to emphasize that Remark 15.5 stays valid in this case as well.
We have to understand the performance of the adaptive estimator if the
correct selection fails. As we will show, this performance is governed by the
large deviations probabilities of the stochastic terms. Before we prove the
theorem, let us analyze the structure of the difference ΔR = Rβ1 − Rβ2 that
controls the choice of the adaptive estimator. Put
M = k : Mβ2 + 1 ≤ k ≤ Mβ1 and ΔM = Mβ1 − Mβ2 = Mβ1 1 + on (1) .
The following technical lemmas are proved in the next section.
Lemma 15.7. The difference of the risk estimates satisfies the equation
√
ΔR = Rβ1 − Rβ2 = −Sn + Mβ1 1 + on (1) /n + Un(1) /n − 2Un(2) / n
with Sn = Sn (c) = k∈M c2k , and the random variables
Un(1) = ξk2 − 1 and Un(2) = zk ξ˜k .
k∈M k∈M
1 Mβ2
n−1
= (Mβ1 )2β1 Ec I(ΔR > 0) ξk2 + Sn + c2k
n
k=0 k = Mβ1 +1
where Sn is defined in Lemma 15.7. Note that
1 M
β2 Mβ2 Mβ1 −2β1
Ec ξk2 = = Mβ1 ,
n n n
k=0
and since c ∈ Θ2,n (β1 ),
n−1
c2k ≤ L(Mβ1 )−2β1 .
k = Mβ1 +1
Thus, even multiplied by (Mβ1 )2β1 , the respective terms in the risk stay
bounded as n → ∞.
It remains to verify that the term Sn (Mβ1 )2β1 Pc (ΔR > 0) also stays
finite as n → ∞. It suffices to study the case when Sn > 4 (Mβ1 )−2β1 =
4 Mβ1 /n, because otherwise this term would be bounded by 4. From Lemma
15.7,
√
{ΔR > 0} = − Sn + Mβ1 1 + on (1) /n + Un(1) /n − 2 Un(2) / n > 0
(1)
can occur if at least one of the two random events A1 = Un /n ≥ Mβ1 /n
(2) √
or A2 = − 2 Un / n ≥ Sn /4 occurs. Indeed, otherwise we would have
the inequality
ΔR < −(3/4)Sn + 2Mβ1 (1 + on (1))/n < 0,
since by our assumption, Sn > 4 Mβ1 /n.
Lemma 15.8 part (i) claims that the probabilities of
the random events
A1 and A2 decrease faster than exp − An1/(2β1 +1) as n → ∞, which
implies that Sn (Mβ1 )2β1 Pc (ΔR > 0) is finite.
The other case, when c ∈ Θ2, n (β2 ) and c̃ = ĉβ1 , is treated in a similar
fashion, though some calculations change. We write
n−1
(Mβ2 )2β2 Ec I(ΔR ≤ 0) (ĉk, β1 − ck )2
k=0
222 15. Adaptive Estimation
1 M
β1
n−1
(15.16) = (Mβ2 ) 2β2
Ec I(ΔR ≤ 0) ξk2 + c2k .
n
k=0 k = Mβ1 +1
n−1
2β2
(Mβ2 )2β2 c2k ≤ L Mβ2 /Mβ1 → 0, as n → ∞.
k = Mβ1 +1
1 M
β1
(Mβ2 )2β2 Ec I(ΔR ≤ 0) ξk2
n
k=0
1 M
β1 2
1/2 1/2
≤ (Mβ2 ) 2β2
Ec ξk2 Pc (ΔR ≤ 0)
n
k=0
2M
β1 1/2 1/2
≤ (Mβ2 )2β2 Pc (ΔR ≤ 0) = 2 nγ Pc (ΔR ≤ 0).
n
Here we applied
4 the Cauchy-Schwartz inequality, and the elementary calcu-
lations Ec ξk = 3, hence,
M
β1 2 Mβ1
β1 M
Ec ξk2 = Ec ξk4 + Ec ξk2 ξl2
k=0 k=0 k, l = 0
k
=l
(2)
Next, the moment generating function of Un can be expressed as
ξk ˜
G2 (t) = E exp t Un(2) = E exp t ck + √ ξk
n
k∈M
ξk
= E E exp t ck + √ ξ̃k ξk , k ∈ M
n
k∈M
2 √
= E exp (t /2n)( ck n + ξk )2 .
k∈M
Now, for any real a < 1 and any b, we have the formula
E exp (a/2) (b + ξ)2 = (1 − a)−1/2 exp ab2 /(2(1 − a)) .
224 15. Adaptive Estimation
√
Applying this formula with a = t2 /n and b = ck n, we obtain
n t2 1
G2 (t) = exp c 2
− ln(1 − t 2
/n)
2(n − t2 ) k 2
k∈M
which completes the proof because Sn = 2
k∈M ck .
Proof of Lemma 15.8. All the inequalities in this lemma follow from the
exponential Chebyshev inequality (also known as Chernoff’s inequality),
P(U ≥ x) ≤ G(t) exp{−t x}
where G(t) = E exp{t U } is the moment generation function of a random
variable U.
It is essential that the moment generating functions of the random
(1) (2)
variables Un and Un in Proposition 15.9 are quadratic at the origin,
Gi (t) = O(t ), i = 1, 2, as t → 0. A choice of a sufficiently small t would
2
guarantee the desired bounds. In the four stated inequalities, the choices of
t differ.
(1)
We start with the random event A1 = Un ≥ Mβ1 ,
Pc ( A1 ) ≤ G1 (t) exp − t Mβ1
= exp − Mβ1 1 + on (1) t + (1/2) ln(1 − 2t) − t Mβ1 .
We choose t = 1/4 and obtain
Pc ( A1 ) ≤ exp − (1/2)(1 − ln 2) Mβ1 1 + on (1) ≤ exp − 0.15 Mβ1 .
(2)
Similarly, if we apply Chernoff’s inequality to the random variable −Un
√
with t = n/10, and use the fact that ΔM < Mβ1 ≤ n Sn /4, we get
√
Pc ( A2 ) = Pc − Un(2) ≥ nSn /8
n t2 ΔM √
≤ exp Sn − ln(1 − t /n) − t n Sn /8
2
2(n − t2 ) 2
nS ΔM nSn
n
= exp − ln(99/100) −
198 2 80
nS nS nS
n n n
≤ exp − ln(99/100) −
198 8 80
≤ exp − AnSn ≤ exp − 4AMβ1
where A = −1/198 + (1/8) ln(99/100) + 1/80 > 0.
Exercises 225
To prove the upper bound for the probability of A3 , take t = 1/8. Then
Pc (A3 ) = Pc − Un(1) ≥ Mβ1 /4
≤ exp − Mβ1 1 + on (1) − t + (1/2) ln(1 + 2t) − t Mβ1 /4
= exp − AMβ1 1 + on (1)
where A = −1/8 + (1/2) ln(5/4) + 1/32 > 0.
Finally, if nSn = o(Mβ1 ), then
G2 (t) = exp − (1/2)Mβ1 1 + on (1) ln(1 − t2 /n) .
√
Put t = n/8. Then
√
Pc ( A4 ) = Pc Un(2) ≥ Mβ1 /(8 n)
√
≤ exp − (1/2)Mβ1 1 + on (1) ln(1 − t2 /n) − tMβ1 /(8 n)
= exp − AMβ1 1 + on (1)
where A = (1/2) ln(63/64) + 1/64 > 0.
Exercises
Testing of
Nonparametric
Hypotheses
H1 : θ = θ1 .
227
228 16. Testing of Nonparametric Hypotheses
false null
results
in type II error. The respective probabilities are denoted
by P0 Δn = 1 and Pθ1 Δn = 0 .
The classical optimization problem in hypotheses testing consists of find-
ing a decision rule that minimizes the type II error, provided the type I error
does not exceed a given positive number α,
Pθ1 Δn = 0 → inf subject to P0 Δn = 1 ≤ α.
Δn
If n is large, then a reasonable anticipation is that α can be chosen small,
that is, α → 0 as n → ∞. This criterion of optimality is popular because of
its elegant solution suggested by the fundamental Neyman-Pearson lemma
(see Exercise 16.97).
A more sophisticated problem is to test the null hypothesis against a
composite alternative,
H1 : θ ∈ Λn
where Λn is a known set of the parameter values that does not include the
origin, that is, 0
∈ Λn . In our asymptotic studies, different criteria for finding
the decision rule are possible. One reasonable criterion that we choose is
minimization of the sum of the type I error probability and the supremum
over θ ∈ Λn of the type II error probability,
rn (Δn ) = P0 Δn = 1 + sup Pθ Δn = 0 → inf .
θ ∈ Λn Δn
with a positive constant C. Denote the corresponding sum of the error prob-
abilities by
rn (Δn , β, C, ψn ) = P0 Δn = 1 + sup Pf Δn = 0 .
f ∈ Λn (β,C,ψn )
We call the sequence ψn a minimax separation rate if (i) for any small
positive γ, there exist a constant C ∗ and a decision rule Δ∗n such that
(16.2) lim sup rn (Δ∗n , β, C ∗ , ψn ) ≤ γ,
n→∞
and (ii) there exist positive constants C∗ and r∗ , independent of n and such
that for any decision rule Δn ,
(16.3) lim inf rn (Δn , β, C∗ , ψn ) ≥ r∗ .
n→∞
Proof. First, we prove the existence of the decision rule Δ∗n with the claimed
separation rate such that (16.2) holds. Let fn∗ be the regressogram with the
1/(2β+1)
rate-optimal bandwidth h∗n = (ln n)/n . Our starting point is the
inequalities (15.7) and (15.8). For any C > Ab + 2βAv , uniformly over
f ∈ Θ(β), these inequalities yield
Pf fn∗ − f ∞ ≥ C ψn ≤ Pf Ab (h∗n )β + Av (nh∗n )−1/2 Zβ∗ ≥ C (h∗n )β
230 16. Testing of Nonparametric Hypotheses
√ √
(16.4) = Pf Ab + Av Zβ∗ / ln n ≥ C ≤ Pf Zβ∗ ≥ 2β ln n ≤ β n−1
√
where we have applied (15.8) with y = 2. Put C ∗ = 2 C, and define the
set of alternatives by
Λn (β, C ∗ , ψn ) = f : f ∈ Θ(β) and f ∞ ≥ C ∗ ψn .
≥ P0 D + (1 − δ)P0 D ∩ {ξn > 1 − δ} ≥ (1 − δ)P0 ξn > 1 − δ
where
1
Q
ξn = exp ln dPfq /dP0 .
Q
q=1
Note that if C∗ < ϕ∞ , then all the test functions fq , q = 1, . . . , Q, belong
to the set of alternatives
Λn (β, C∗ , ψn ) = f : f ∈ Θ(β) and f ≥ C∗ ψn .
We want to test the null hypothesis that all the Fourier coefficients are
2 1/2
equal to zero versus the alternative that their L2 -norm c2 = ck
is larger than a constant that may depend on n. Our goal is to find the
minimax separation rate ψn . Formally, we study the problem of testing H0 :
c = 0 against the composite alternative
H1 : c ∈ Λn = Λn (β, C, ψn )
where
(16.5) Λn = c : c ∈ Θ2,n (β) and c2 ≥ C ψn .
sequence space as well. The proof in the sequence space is especially simple.
Indeed, the sum of the centered zk2 ’s admits the representation
n−1 σ2 2σ
n−1
σ2 2
n−1
zk2 − = c2 − √
2
ck ξk + (ξk − 1)
n n n
k=0 k=0 k=0
2σ σ2
(16.6) = c22 − √ N + √ Yn
n n
where N denotes the zero-mean normal random variable with the variance
c22 . The variable Yn is a centered chi-squared random variable that is
asymptotically normal,
n−1
√
Yn = (ξk2 − 1)/ n → N (0, 2).
k=0
√
The convergence rate 1/ n in estimation of c22 follows immediately from
(16.6).
Now we continue with testing the null hypothesis against the composite
alternative (16.5). We will show that the separation rate of testing in the
quadratic norm is equal to ψn = n−2β/(4β+1) . Note that this separation rate
is faster than the minimax estimation rate in the L2 -norm n−β/(2β+1) . The
proof of this fact is split between the upper and lower bounds in the theorems
below.
We introduce the rate-optimal decision rule, proceeding similar to (16.6).
We take Mn = n2/(4β+1) , so that the separation rate ψn = Mn−β , and esti-
mate the norm of the Fourier coefficients by
Mn
Ŝn = zk2 − σ 2 /n .
k=0
σ2
Mn
= P0 (ξk2 − 1) > b ψn2 = P0 σ 2 n−1 2(Mn + 1) Yn > b ψn2
n
k=0
M n
where Yn = k = 0 (ξk − 1)/ 2(Mn
2 + 1) is asymptotically standard normal
random variable. Under our choice of Mn , we have that as n → ∞,
n−1 Mn + 1 ∼ n1/(4β+1)−1 = n−4β/(4β+1) = ψn2 .
Consequently,
√ b
P0 Δ∗n = 1 = P0 2 σ 2 Yn ≥ b 1 + on (1) → 1 − Φ √ ,
2 σ2
as n → ∞, where Φ denotes the cumulative distribution
√ 2 function of a stan-
dard normal random variable. If we choose b = 2 σ q1−γ/2 with q1−γ/2
standing for the (1 − γ/2)-quantile of the standard normal distribution,
then the inequality (16.8) follows.
To verify the inequality (16.9), note that
Mn
2
Pc Δ∗n = 0 = Pc Ŝn ≤ b ψn2 = Pc zk − σ 2 /n ≤ b ψn2
k=0
n−1
2σ
Mn
σ2
Mn
= Pc c22 − c2k − √ ck ξk + (ξk2 − 1) ≤ b ψn2 .
n n
k = Mn +1 k=0 k=0
234 16. Testing of Nonparametric Hypotheses
Observe that for any c ∈ Λn (β, C ∗ , ψn ), the variance of the following nor-
malized random sum vanishes as n → ∞,
* + 2σ 2
2σ
Mn
4σ 2 4σ 2
Varc √ c k ξ k ≤ ≤ = n−1/(4β+1) → 0,
nc22 nc22 n(C ∗ ψn )2 C∗
k=0
which implies that
2σ
Mn
c22 − √ ck ξk = c22 1 + on (1) as n → ∞,
n
k=0
where on (1) → 0 in Pc -probability. Thus,
σ2
Mn
n−1
Pc Δ∗n =0 = Pc (ξk2 −1) ≤ − c22 1+on (1) + c2k +b ψn2 .
n
k=0 k = Mn +1
M n
Put Yn = 2
k=0 (ξk − 1)/ 2(Mn + 1). Note that
n−1
n−1 k 2β
c2k < c2k ≤ Mn−2β L.
Mn
k = Mn +1 k = Mn +1
Therefore,
σ2
Pc Δ∗n = 0 ≤ Pc 2(Mn + 1)Yn
n
≤ −(C ∗ ψn )2 1 + on (1) + Mn−2β L + bψn2
where Yn is asymptotically standard normal. Note that here every term has
the magnitude ψn2 = Mn−2β . If we cancel ψn2 , the latter probability becomes
√ −C ∗ + L + b
Pc 2 σ 2 Yn ≤ (−C ∗ + L + b) 1 + on (1) → Φ √ ,
2 σ2
√
as n → ∞. Choose C ∗ = 2b + L and recall that b = 2σ 2 q1−γ/2 . We obtain
−C ∗ + L + b −b
√ = √ = − q1−γ/2 = qγ/2
2σ 2 2σ 2
where qγ/2 denotes the γ/2-quantile of Φ. Thus, the inequality (16.9) is
valid.
Remark 16.3. In the case of the sup-norm, we can find a single constant
C ∗ to guarantee the upper bound for any γ. In the case of the L2 -norm, it
is not possible. Every γ requires its own constants C ∗ and b.
Theorem 16.4. For any constant r∗ , 0 < r∗ < 1, there exists C = C∗ > 0 in
the definition (16.5) of the set of alternatives Λn such that for any decision
rule Δn , the sum of the error probabilities
rn (Δn ) = P0 Δn = 1 + sup Pc Δn = 0
c ∈ Λn (β,C∗ ,ψn )
Put
C ψ 2 n C 2
∗ n ∗
αn2 = = n−1/(4β+1) → 0, as n → ∞.
σ Mn σ
Further, we substitute the maximum by the mean value to obtain
rn (Δn ) ≥ P0 Δn = 1 + max Pc(ω) Δn = 0
ω ∈ Ωn
236 16. Testing of Nonparametric Hypotheses
≥ P0 Δn = 1 + 2−Mn Pc(ω) Δn = 0
ω ∈ Ωn
= E0 I Δn = 1 + I Δn = 0 2−Mn exp Ln (ω)
ω ∈ Ωn
= E0 I Δn = 1 + I Δn = 0 ηn
where
d Pc(ω)
Ln (ω) = ln and ηn = 2−Mn exp Ln (ω) .
d P0
ω ∈ Ωn
Now, the log-likelihood ratio
n
Mn
Mn
Ln (ω) = 2
c z
k k − c 2
k /2 = αn ωk ξk − αn2 /2 .
σ
k=1 k=1
√
Here we have used the fact that, under P0 , zk = σ ξk / n. In addition, the
identities ωk2 = 1 and
√
n ck /σ = (C∗ ψn /σ) n/Mn ωk = αn ωk
were employed. The random variable ηn admits the representation, which
will be derived below,
1 2 Mn
1 αn ξk 1
(16.12) ηn = exp − αn Mn e + e− αn ξk .
2 2 2
k=1
Even though this expression is purely deterministic and can be shown alge-
braically, the easiest way to prove it is by looking at the ωk ’s as independent
random variables such that
P(ω) ωk = ± 1 = 1/2.
Using this definition, the random variable ηn can be computed as the ex-
pected value, denoted by E(ω) , with respect to the distribution P(ω) ,
Mn 1
ηn = E(ω) exp Ln (ω) = E(ω) exp αn ξk ωk − αn2
2
k=1
Mn
= exp − αn2 Mn /2 E(ω) exp αn ξk ωk
k=1
so that the representation (16.12) for ηn follows.
Recall that ξk are independent standard normal random variables with
respect to the P0 -distribution, hence, E0 [ ηn ] = 1. To compute the second
moment of ηn , we write
1 1 1 Mn
E0 [ ηn2 ] = exp − αn2 Mn E0 e2αn ξ1 + + e−2αn ξ1
4 2 4
Exercises 237
Exercises
239
Index of Notation
241
242 Index of Notation
243
244 Index