Mathematical Statistics, Asymptotic Minimax Theory

Download as pdf or txt
Download as pdf or txt
You are on page 1of 258

Mathematical

Statistics
Asymptotic Minimax Theory

Alexander Korostelev
Olga Korosteleva

Graduate Studies
in Mathematics
Volume 119

American Mathematical Society


Mathematical
Statistics
Asymptotic Minimax Theory
Mathematical
Statistics
Asymptotic Minimax Theory

Alexander Korostelev
Olga Korosteleva

Graduate Studies
in Mathematics
Volume 119

American Mathematical Society


Providence, Rhode Island
EDITORIAL COMMITTEE
David Cox (Chair)
Rafe Mazzeo
Martin Scharlemann
Gigliola Staffilani

2010 Mathematics Subject Classification. Primary 62F12, 62G08;


Secondary 62F10, 62G05, 62G10, 62G20.

For additional information and updates on this book, visit


www.ams.org/bookpages/gsm-119

Library of Congress Cataloging-in-Publication Data


Korostelev, A. P. (Aleksandr Petrovich)
Mathematical statistics : asymptotic minimax theory / Alexander Korostelev, Olga Korostel-
eva.
p. cm. — (Graduate studies in mathematics ; v. 119)
Includes bibliographical references and index.
ISBN 978-0-8218-5283-5 (alk. paper)
1. Estimation theory. 2. Asymptotic efficiencies (Statistics) 3. Statistical hypothesis testing.
I. Korostelev, Olga. II. Title.

QA276.8.K667 2011
519.5–dc22 2010037408

Copying and reprinting. Individual readers of this publication, and nonprofit libraries
acting for them, are permitted to make fair use of the material, such as to copy a chapter for use
in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for such
permission should be addressed to the Acquisitions Department, American Mathematical Society,
201 Charles Street, Providence, Rhode Island 02904-2294 USA. Requests can also be made by
e-mail to [email protected].

c 2011 by the American Mathematical Society. All rights reserved.
The American Mathematical Society retains all rights
except those granted to the United States Government.
Printed in the United States of America.

∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at http://www.ams.org/
10 9 8 7 6 5 4 3 2 1 16 15 14 13 12 11
Contents

Preface ix

Part 1. Parametric Models


Chapter 1. The Fisher Efficiency 3
§1.1. Statistical Experiment 3
§1.2. The Fisher Information 6
§1.3. The Cramér-Rao Lower Bound 7
§1.4. Efficiency of Estimators 8
Exercises 9
Chapter 2. The Bayes and Minimax Estimators 11
§2.1. Pitfalls of the Fisher Efficiency 11
§2.2. The Bayes Estimator 13
§2.3. Minimax Estimator. Connection Between Estimators 16
§2.4. Limit of the Bayes Estimator and Minimaxity 18
Exercises 19
Chapter 3. Asymptotic Minimaxity 21
§3.1. The Hodges Example 21
§3.2. Asymptotic Minimax Lower Bound 22
§3.3. Sharp Lower Bound. Normal Observations 26
§3.4. Local Asymptotic Normality (LAN) 28
§3.5. The Hellinger Distance 31
§3.6. Maximum Likelihood Estimator 33

v
vi Contents

§3.7. Proofs of Technical Lemmas 35


Exercises 40
Chapter 4. Some Irregular Statistical Experiments 43
§4.1. Irregular Models: Two Examples 43
§4.2. Criterion for Existence of the Fisher Information 44
§4.3. Asymptotically Exponential Statistical Experiment 45
§4.4. Minimax Rate of Convergence 47
§4.5. Sharp Lower Bound 47
Exercises 49
Chapter 5. Change-Point Problem 51
§5.1. Model of Normal Observations 51
§5.2. Maximum Likelihood Estimator of Change Point 54
§5.3. Minimax Limiting Constant 56
§5.4. Model of Non-Gaussian Observations 57
§5.5. Proofs of Lemmas 59
Exercises 62
Chapter 6. Sequential Estimators 65
§6.1. The Markov Stopping Time 65
§6.2. Change-Point Problem. Rate of Detection 69
§6.3. Minimax Limit in the Detection Problem. 73
§6.4. Sequential Estimation in the Autoregressive Model 75
Exercises 83
Chapter 7. Linear Parametric Regression 85
§7.1. Definitions and Notations 85
§7.2. Least-Squares Estimator 87
§7.3. Properties of the Least-Squares Estimator 89
§7.4. Asymptotic Analysis of the Least-Squares Estimator 93
Exercises 96

Part 2. Nonparametric Regression


Chapter 8. Estimation in Nonparametric Regression 101
§8.1. Setup and Notations 101
§8.2. Asymptotically Minimax Rate of Convergence. Definition 103
§8.3. Linear Estimator 104
Contents vii

§8.4. Smoothing Kernel Estimator 106


Exercises 112
Chapter 9. Local Polynomial Approximation of the Regression
Function 115
§9.1. Preliminary Results and Definition 115
§9.2. Polynomial Approximation and Regularity of Design 119
§9.3. Asymptotically Minimax Lower Bound 122
§9.4. Proofs of Auxiliary Results 126
Exercises 130
Chapter 10. Estimation of Regression in Global Norms 131
§10.1. Regressogram 131
§10.2. Integral L2 -Norm Risk for the Regressogram 133
§10.3. Estimation in the Sup-Norm 136
§10.4. Projection on Span-Space and Discrete MISE 138
§10.5. Orthogonal Series Regression Estimator 141
Exercises 148
Chapter 11. Estimation by Splines 151
§11.1. In Search of Smooth Approximation 151
§11.2. Standard B-splines 152
§11.3. Shifted B-splines and Power Splines 155
§11.4. Estimation of Regression by Splines 158
§11.5. Proofs of Technical Lemmas 161
Exercises 166
Chapter 12. Asymptotic Optimality in Global Norms 167
§12.1. Lower Bound in the Sup-Norm 167
§12.2. Bound in L2 -Norm. Assouad’s Lemma 171
§12.3. General Lower Bound 174
§12.4. Examples and Extensions 177
Exercises 182

Part 3. Estimation in Nonparametric Models


Chapter 13. Estimation of Functionals 185
§13.1. Linear Integral Functionals 185
§13.2. Non-Linear Functionals 188
viii Contents

Exercises 191
Chapter 14. Dimension and Structure in Nonparametric Regression 193
§14.1. Multiple Regression Model 193
§14.2. Additive regression 196
§14.3. Single-Index Model 199
§14.4. Proofs of Technical Results 206
Exercises 209
Chapter 15. Adaptive Estimation 211
§15.1. Adaptive Rate at a Point. Lower Bound 211
§15.2. Adaptive Estimator in the Sup-Norm 215
§15.3. Adaptation in the Sequence Space 218
§15.4. Proofs of Lemmas 223
Exercises 225
Chapter 16. Testing of Nonparametric Hypotheses 227
§16.1. Basic Definitions 227
§16.2. Separation Rate in the Sup-Norm 229
§16.3. Sequence Space. Separation Rate in the L2 -Norm 231
Exercises 237
Bibliography 239
Index of Notation 241
Index 243
Preface

This book is based on the lecture notes written for the advanced Ph.D. level
statistics courses delivered by the first author at the Wayne State Univer-
sity over the last decade. It has been easy to observe how the gap deepens
between applied (computational) and theoretical statistics. It has become
more difficult to direct and mentor graduate students in the field of math-
ematical statistics. The research monographs in this field are extremely
difficult to use as textbooks. Even in the best published lecture notes the
intensive material of original studies is typically included. On the other
hand, the classical courses in statistics that cover the traditional parametric
point and interval estimation methods and hypotheses testing are hardly
sufficient for the teaching goals in modern mathematical statistics.

In this book, we tried to give a general overview of the key statistical


topics, parametric and nonparametric, as a set of very special optimization
problems. As a criterion for optimality of estimators we chose minimax
risks, and we focused on asymptotically minimax rates of convergence for
large samples. Definitely, the selection of models presented in this book fol-
lows our preferences. Many very important problems and examples are not
included. The simplest models were deliberately selected for presentation,
and we consciously concentrated on the detailed proofs of all propositions.
We believe that mathematics students should be trained in proof-writing to
be better prepared for applications in statistics.

This textbook can form a reasonable basis for a two-semester course in


mathematical statistics. Every chapter is followed by a collection of exercises
consisting partly of verification of technical results, and partly of important

ix
x Preface

illustrative examples. In our opinion, the sufficient prerequisite is a stan-


dard course in advanced probability supported by undergraduate statistics
and real analysis. We hope that students who successfully pass this course
are prepared for reading original papers and monographs in the minimax
estimation theory and can be easily introduced to research studies in this
field.

This book is organized into three parts. Part 1 is comprised of Chap-


ters 1-7 that contain fundamental topics of local asymptotic normality as
well as irregular statistical models, change-point problem, and sequential
estimation. For convenience of reference we also included a chapter on clas-
sical parametric linear regression with the concentration on the asymptotical
properties of least-squares estimators. Part 2 (Chapters 8-12) focuses on es-
timation of nonparametric regression functions. We restrict the presentation
to estimation at a point and in the quadratic and uniform norms, and con-
sider deterministic as well as random designs. The last part of the book,
Chapters 13-16, is devoted to special more modern topics such as influence of
higher-dimension and structure in nonparametric regression models, prob-
lems of adaptive estimation, and testing of nonparametric hypotheses. We
present the ideas through simple examples with the equidistant design.

Most chapters are weakly related to each other and may be covered in
any order. Our suggestion for a two-semester course would be to cover the
parametric part during the first semester and to cover the nonparametric
part and selected topics in the second half of the course.

We are grateful to O. Lepskii for his advice and help with the presenta-
tion of Part 3.

The authors, October 2010


Part 1

Parametric Models
Chapter 1

The Fisher Efficiency

1.1. Statistical Experiment


 
A classical statistical experiment X1 , . . . , Xn ; p(x, θ); θ ∈ Θ is composed of
the following three elements: (i) a set of independent observations X1 , . . . , Xn
where n is the sample size, (ii) a family of probability densities p(x, θ) de-
fined by a parameter θ, and (iii) a parameter set Θ of all possible values of
θ.
Unless otherwise stated, we always assume that θ is one-dimensional,
that is, Θ ⊆ R. For discrete distributions, p (x, θ) is the probability mass
function. In this chapter we formulate results only for continuous distri-
butions. Analogous results hold for discrete distributions if integration is
replaced by summation. Some discrete distributions are used in examples.

Example 1.1. (a) If n independent observations X1 , . . . , Xn have a normal


distribution with an unknown mean θ and a known variance σ 2 , that is,
Xi ∼ N (θ, σ 2 ), then the density is
 
p (x, θ) = (2 π σ 2 )−1/2 exp − (x − θ)2 /(2σ 2 ) , −∞ < x, θ < ∞,

and the parameter set is the whole real line Θ = R.

(b) If n independent observations have a normal distribution with a known


mean μ and an unknown variance θ, that is, Xi ∼ N (μ, θ), then the density
is
 
p(x, θ) = (2 π θ)−1/2 exp − (x − μ)2 /(2θ) , −∞ < x < ∞ , θ > 0,

and the parameter set is the positive half-axis Θ = { θ ∈ R : θ > 0 }. 

3
4 1. The Fisher Efficiency

Example 1.2. Suppose n independent observations X1 , . . . , Xn come from


a distribution with density
p(x, θ) = p 0 (x − θ), −∞ < x, θ < ∞,
where p 0 is a fixed probability density function. Here θ determines the shift
of the distribution, and therefore is termed the location parameter. The
location parameter model can be written as Xi = θ + εi , i = 1, . . . , n, where
ε1 , . . . , εn are independent random variables with a given density p 0 , and
θ ∈ Θ = R. 

The independence of observations implies that the joint density of Xi ’s


equals

n
p (x1 , . . . , xn , θ) = p (xi , θ).
i=1
We denote the respective expectation by Eθ [ · ] and variance by Varθ [ · ].
In a statistical experiment, all observations are obtained under the same
value of an unknown parameter θ. The goal of the parametric statistical
estimation is to assess the true value of θ from the observations X1 , . . . , Xn .
An arbitrary function of observations, denoted by θ̂ = θ̂n = θ̂n (X1 , . . . , Xn ),
is called an estimator (or a point estimator) of θ.

A random variable
l(Xi , θ) = ln p(Xi , θ)
is referred to as a log-likelihood function related to the observation Xi .
The joint log-likelihood function of a sample of size n (or, simply, the log-
likelihood function) is the sum

n 
n
Ln (θ) = Ln (θ | X1 , . . . , Xn ) = l(Xi , θ) = ln p(Xi , θ).
i=1 i=1

In the above notation, we emphasize the dependence of the log-likelihood


function on the parameter θ, keeping in mind that it is actually a random
function that depends on the entire set of observations X1 , . . . , Xn .
The parameter θ may be evaluated by the method of maximum likelihood
estimation. An estimator θn∗ is called the maximum likelihood estimator
(MLE), if for any θ ∈ Θ the following inequality holds:
Ln (θn∗ ) ≥ Ln (θ).
If the log-likelihood function attains its unique maximum, then the MLE
reduces to
θn∗ = argmax Ln (θ).
θ∈Θ
1.1. Statistical Experiment 5

If the function L is differentiable at its attainable maximum, then θn∗ is a


solution of the equation
∂Ln (θ)
= 0.
∂θ
Note that if the maximum is not unique, this equation has multiple solutions.

The function
 
bn (θ) = bn (θ , θ̂n ) = Eθ θ̂n − θ = Eθ θ̂n (X1 , . . . , Xn ) − θ

is called the bias of θ̂n . An estimator θ̂n (X1 , . . . , Xn ) is called


 an unbiased
estimator of θ if its bias equals zero, or equivalently, Eθ θ̂n = θ for all
θ ∈ Θ.
Example 1.3. Assume that the underlying distribution of the random sam-
ple X1 , . . . , Xn is Poisson with mean θ. The probability mass function is
given by
θx −θ
pn (x, θ) = e , θ > 0, x ∈ {0, 1, 2, . . . }.
x!
Then the log-likelihood function has the form

n 
n
Ln (θ) = Xi ln θ − nθ − ln (Xi !).
i=1 i=1

Setting the derivative equal to zero yields the solution θn∗ = X̄n , where
X̄n = (X1 + · · · + Xn )/n
denotes
 ∗ the sample
 mean.
 In
this example, the MLE is unbiased since
Eθ θn = Eθ X̄n = Eθ X1 = θ. 

Nonetheless, we should not take the unbiased MLE for granted. Even
for common densities, its expected value may not exist. Consider the next
example.
Example 1.4. For the exponential distribution with the density
 
p(x, θ) = θ exp − θ x , x > 0, θ > 0,

the MLE θn∗ = 1/X̄n has the expected value Eθ θn∗ = n θ/(n − 1) (see
Exercise

∞ −1 1.6). In particular,


 for n = 1, the expectation does not exist since
0 x θ exp − θ x dx = ∞. 

In this example, however, an unbiased estimator may be found for n > 1.


Indeed, the estimator (n − 1)θn∗ /n is unbiased. As the next example shows,
an unbiased estimator may not exist at all.
6 1. The Fisher Efficiency

Example 1.5. Let X be a Binomial(n , θ 2 ) observation, that is, a random


number of successes in n independent Bernoulli trials with the probability of
a success p = θ 2 , 0 < θ < 1. An unbiased estimator of the parameter θ does
not exist. In fact, if θ̂ = θ̂(X) were such an estimator, then its expectation
would be an even polynomial of θ,

 n
n
Eθ θ̂(X) = θ̂(k) θ 2k (1 − θ 2 )n−k ,
k
k=0

which cannot be identically equal to θ. 

1.2. The Fisher Information


Introduce the Fisher score function as the derivative of the log-likelihood
function with respect to θ,
∂ ln p(Xi , θ) ∂p(Xi , θ)/∂θ
l (Xi , θ) = = .
∂θ p(Xi , θ)
Note that the expected value of the Fisher score function is zero. Indeed,

  ∂p(x , θ) ∂ R p(x , θ) dx
Eθ l (Xi , θ) = dx = = 0.
R ∂θ ∂θ
The total Fisher score function for a sample X1 , . . . , Xn is defined as the
sum of the score functions for each individual observation,

n
Ln (θ) = l (Xi , θ).
i=1

The Fisher information of one observation Xi is the variance of the Fisher


score function l (Xi , θ),
  2 
I(θ) = Varθ l (Xi , θ) = Eθ l (Xi , θ)
 ∂ ln p (X, θ) 2  
∂ ln p(x , θ)  2
= Eθ = p(x , θ) dx
∂θ R ∂θ
 2
∂p(x , θ)/∂θ
= dx.
R p(x , θ)
Remark 1.6. In the above definition of the Fisher information, the density
appears in the denominator. Thus, it is problematic to calculate the Fisher
information for distributions with densities that may be equal to zero for
some values of x; even more so, if the density turns into zero as a function of
x on sets that vary depending on the value of θ. A more general approach to
the concept of information that overcomes this difficulty will be suggested
in Section 4.2. 
1.3. The Cramér-Rao Lower Bound 7

The Fisher information for a statistical experiment of size n is the vari-


ance of the total Fisher score function,
  2
In (θ) = Varθ Ln (θ) = Eθ Ln (θ)
 ∂ ln p (X , . . . , X , θ) 2 
1 n
= Eθ
∂θ

(∂ p (x1 , . . . , xn , θ)/∂θ)2
= dx1 . . . dxn .
Rn p (x1 , . . . , xn , θ)

Lemma 1.7. For independent observations, the Fisher information is ad-


ditive. In particular, for any θ ∈ Θ , the equation holds In (θ) = n I(θ).

Proof. As the variance of the sum of n independent random variables,


 
In (θ) = Varθ Ln (θ) = Varθ l (X1 , θ) + . . . + l (Xn , θ)

= n Varθ l (X1 , θ) = n I(θ). 

In view of this lemma, we use the following definition of the Fisher


information for a random sample of size n:
 ∂ ln p (X, θ) 2 
In (θ) = n Eθ .
∂θ
Another way of computing the Fisher information is presented in Exercise
1.1.

1.3. The Cramér-Rao Lower Bound


A statistical experiment is called regular if its Fisher information is con-
tinuous, strictly positive, and bounded for all θ ∈ Θ . Next we present an
inequality for the variance of any estimator of θ in a regular experiment.
This inequality is termed the Cramér-Rao inequality, and the lower bound
is known as the Cramér-Rao lower bound.

Theorem 1.8. Consider an estimator θ̂n = θ̂n (X1 , . . . , Xn ) of the parame-


ter θ in a regular experiment. Suppose its bias bn (θ) = Eθ θ̂n − θ is con-
tinuously differentiable. Let bn (θ) denote the derivative of the bias. Then
the variance of θ̂n satisfies the inequality
 2
 1 + bn (θ)
(1.1) Varθ θ̂n ≥ , θ ∈ Θ.
In (θ)
Proof. By the definition of the bias, we have that


θ + bn (θ) = Eθ θ̂n = θ̂n (x1 , . . . , xn ) p (x1 , . . . , xn , θ) dx1 . . . dxn .
Rn
8 1. The Fisher Efficiency

In the regular case, the differentiation and integration are interchangeable,


hence differentiating in θ , we get the equation,


1 + bn (θ) = θ̂n (x1 , . . . , xn ) ∂p (x1 , . . . , xn , θ)/∂θ dx1 . . . dxn
Rn
 ∂p (x , . . . , x , θ)/∂θ 
1 n
= θ̂n (x1 , . . . , xn ) p (x1 , . . . , xn , θ) dx1 . . . dxn
R n p (x 1 , . . . , x n , θ)
 
= Eθ θ̂n Ln (θ) = Covθ θ̂n , Ln (θ)

where we use the fact that Eθ Ln (θ) = 0. The correlation coefficient ρn of
θ̂n and Ln (θ) does not exceed 1 in its absolute value, so that
  2
Covθ θ̂n , Ln (θ) (1 + bn (θ))2
1 ≥ ρn =
2
= . 
Varθ [θ̂n ] Varθ [Ln (θ)] Varθ [θ̂n ] In (θ)

1.4. Efficiency of Estimators


An immediate consequence of Theorem 1.8 is the formula for unbiased esti-
mators.

Corollary 1.9. For an unbiased estimator θ̂n , the Cramér-Rao inequality


(1.1) takes the form
 1
(1.2) Varθ θ̂n ≥ , θ ∈ Θ. 
In (θ)

An unbiased estimator θn∗ = θn∗ (X1 , . . . , Xn ) in a regular statistical


experiment is called Fisher efficient (or, simply, efficient) if, for any θ ∈ Θ,
the variance of θn∗ reaches the Cramér-Rao lower bound, that is, the equality
in (1.2) holds:
 1
Varθ θn∗ = , θ ∈ Θ.
In (θ)
Example 1.10. Suppose, as in Example 1.1(a), the observations X1 , . . . , Xn
are independent N (θ, σ 2 ) where σ 2 is assumed known. We show that the
sample mean X̄n = (X1 + · · · +
Xn )/n is an efficient estimator of θ. Indeed,
X̄n is unbiased and Varθ X̄n = σ 2 /n. On the other hand,
1 (X − θ)2
ln p (X, θ) = − ln(2 π σ 2 ) −
2 2σ 2
and
∂ ln p (X, θ) X −θ
l (X , θ) = = .
∂θ σ2
Thus, the Fisher information for the statistical experiment is
 2 n  nσ 2 n
In (θ) = n Eθ l (X , θ) = 4 Eθ (X − θ)2 = 4 = 2 .
σ σ σ
Exercises 9

Therefore, for any value of θ, the variance of X̄n achieves the Cramér-Rao
lower bound 1/In (θ) = σ 2 /n. 

The concept of the Fisher efficiency seems to be nice and powerful. In-
deed, besides being unbiased, an efficient estimator has the minimum pos-
sible variance uniformly in θ ∈ Θ. Another feature is that it applies to any
sample size n. Unfortunately, this concept is extremely restrictive. It works
only in a limited number of models. The main pitfalls of the Fisher efficiency
are discussed in the next chapter.

Exercises

Exercise 1.1. Show that the Fisher information can be computed by the
formula
 ∂ 2 ln p (X, θ) 
In (θ) = − n Eθ .
∂ θ2
Hint: Make use of the representation (show!)
 ∂ ln p (x, θ) 2 ∂ 2 p (x, θ)  ∂ 2 ln p (x, θ) 
p (x, θ) = − p (x, θ).
∂θ ∂θ2 ∂θ2

Exercise 1.2. Let X1 , . . . , Xn be independent observations with the N (μ, θ)


distribution, where μ has a known value (refer to Example 1.1(b)). Prove
that
1 
n

θn = (Xi − μ)2
n
i=1
is an efficient estimator of θ. Hint: Use Exercise 1.1 to show that In (θ) =
2 ∗
n θ ). When2 computing the variance of θn , first notice that the variable
n/(2
i = 1 (Xi − μ) /θ has a chi-squared distribution with n degrees of freedom,
and, thus, its variance equals 2n.

Exercise 1.3. Suppose that independent observations X1 , . . . , Xn have a


Bernoulli distribution with the probability mass function
p (x, θ) = θ x (1 − θ)1−x , x ∈ { 0, 1 } , 0 < θ < 1.
Show that the Fisher information is of the form
n
In (θ) = ,
θ (1 − θ)
and verify that the estimator θn∗ = X̄n is efficient.
10 1. The Fisher Efficiency

Exercise 1.4. Assume that X1 , . . . , Xn are independent observations from


a Poisson distribution with the probability mass function
θ x −θ
p (x, θ) = e , x ∈ { 0, 1, . . . }, θ > 0.
x!
Prove that the Fisher information in this case is In (θ) = n/θ, and show that
X̄n is an efficient estimator of θ.

Exercise 1.5. Let X1 , . . . , Xn be a random sample from an exponential


distribution with the density
1
p (x, θ) = e− x/θ , x > 0, θ > 0.
θ
2
Verify that In (θ) = n/θ , and prove that X̄n is efficient.

Exercise 1.6. Show that in the exponential model with the density p(x , θ) =
θ exp{−θ x} , x , θ > 0, the MLE θn∗ = 1/X̄n has the expected value Eθ [ θn∗ ] =
n θ/(n − 1). What is the variance of this estimator?

Exercise 1.7. Show that for the location parameter model with the density
p(x , θ) = p 0 (x − θ), introduced in Example 1.2, the Fisher information is
a constant if it exists.

Exercise 1.8. In the Exercise 1.7, find the values of α for which the Fisher
information exists if p 0 (x) = C cosα x , −π/2 < x < π/2 , and p 0 (x) = 0
otherwise, where C = C(α) is the normalizing constant. Note that p 0 is a
probability density if α > −1 .
Chapter 2

The Bayes and


Minimax Estimators

2.1. Pitfalls of the Fisher Efficiency


Fisher efficient estimators defined in the previous chapter possess two ma-
jor unattractive properties, which prevent the Fisher efficiency from being
widely used in statistical theory. First, the Fisher efficient estimators rarely
exist, and second, they need to be unbiased. In effect, the Fisher efficiency
does not provide an answer to how to compare biased estimators with dif-
ferent bias functions. A lesser issue is that the comparison of estimators is
based on their variances alone.
Before we proceed to an illustrative example, we need several notions
defined below. A function w(u), u ∈ R, is called a loss function if: (i) w(0) =
0, (ii) it is symmetric, w(u) = w(−u), (iii) it is non-decreasing for u > 0,
and (iv) it is not identically equal to zero. Besides, we require that w is
bounded from above by a power function, that is, (v) w(u) ≤ k(1 + |u|a ) for
all u with some constants k > 0 and a > 0.
The loss function w(θ̂n − θ) measures the deviation of the estimator
θ̂n = θ̂n (X1 , . . . , Xn ) from the true parameter θ. In this book, we do not
go far beyond: (i) quadratic loss function, w(u) = u2 , (ii) absolute loss
function, w(u) = |u|, or (iii) bounded loss function, w(u) = I( |u| > c ) with
a given positive c, where I(·) denotes the indicator function.
The normalized risk function (or simply, the normalized 
risk) Rn (θ, θ̂n , w)
is the expected value of the loss function w evaluated at In (θ)(θ̂n − θ),

11
12 2. The Bayes and Minimax Estimators

that is,
  
Rn (θ, θ̂n , w) = Eθ w In (θ)(θ̂n − θ)
 
= w In (θ)(θ̂n (x1 , . . . , xn ) − θ) p(x1 , . . . , xn , θ) dx1 . . . dxn .
Rn

Example 2.1. For the quadratic loss function w(u) = u2 , the normalized
risk (commonly termed the normalized quadratic risk) of an estimator θ̂n
can be found as
  2   2 
Rn (θ, θ̂n , u2 ) = Eθ In (θ) θ̂n −θ = In (θ) Eθ θ̂n −Eθ [ θ̂n ]+Eθ [ θ̂n ]−θ

 
(2.1) = In (θ) Varθ [ θ̂n ] + bn2 (θ , θ̂n )

where bn (θ , θ̂n ) = Eθ [ θ̂n ] − θ is the bias of θ̂n . 

By (2.1), for any unbiased estimator θ̂n , the normalized quadratic risk
function has the representation Rn (θ, θ̂n , u2 ) = In (θ)Varθ [ θ̂n ]. The Cramér-
Rao inequality (1.2) can thus be written as
  2 
(2.2) Rn (θ, θ̂n , u2 ) = Eθ In (θ) θ̂n − θ ≥ 1, θ ∈ Θ,

with the equality attained for the Fisher efficient estimators θn∗ ,
  2 
(2.3) Rn (θ, θn∗ , u2 ) = Eθ In (θ) θn∗ − θ = 1, θ ∈ Θ.

Next, we present an example of a biased estimator that in a certain in-


terval performs more efficiently than the Fisher efficient unbiased estimator,
if we define a more efficient estimator as the one with a smaller normalized
quadratic risk.

Example 2.2. Let X1 , . . . , Xn be independent observations from the N (θ, σ 2 )


distribution, where σ 2 is known. Consider two estimators: (i) θn∗ = X̄n ,
which is efficient by Example 1.10, and (ii) a constant-value estimator θ̂ = θ0 ,
where θ0 is a fixed point. The normalized quadratic risk of θn∗ equals the
unity by (2.3), while that of θ̂ is
 n
Rn (θ, θ̂, u2 ) = Eθ In (θ)(θ̂ − θ)2 = 2 (θ0 − θ)2 .
σ
Note that θ̂ is a biased estimator with the bias bn (θ) = θ0 − θ.
It is impossible to determine which of the two normalized quadratic risks

is smaller (refer to Figure 1). If θ is within θ0 ±σ/ n, then θ̂ is more efficient,
whereas for all other values of θ, θ∗ is a more efficient estimator. 
2.2. The Bayes Estimator 13

Rn
6

Rn (θ, θ̂, u2 ) = n
σ 2 (θ0 − θ)2

1 Rn (θ, θn∗ , u2 ) = 1

• -
0 θ0 − √σ
n θ0 θ0 + √σ
n θ

θ̂ is more efficient
in this interval

Figure 1. The normalized quadratic risk functions in Example 2.2.

This example illustrates the difficulty in comparing normalized risks of


two estimators as functions of θ ∈ Θ. To overcome it, we could try to
represent each risk function by a positive number. In statistics, there are
two major ways to implement this idea. One approach is to integrate the
normalized risk over the parameter set Θ, whereas the other one is to take
the maximum value of the normalized risk function over Θ. These are called
the Bayes and the minimax approaches, respectively. They are explored in
the next three sections.

2.2. The Bayes Estimator


In what follows, we study only regular statistical models, which by definition
have a strictly positive, continuous Fisher information.
Assume that there is a probability density π(θ) defined on the parameter
set Θ. The density π(θ) is called a prior density of θ. It reflects the judgement
of how likely values of θ are before the data are obtained. The Bayes risk
of θ̂n is the integrated value of the normalized risk function,

(2.4) βn (θ̂n , w, π) = Rn (θ, θ̂n , w) π(θ) dθ .
Θ

An estimator tn = tn (X1 , . . . , Xn ) is called the Bayes estimator of θ, if for


any other estimator θ̂n , the following inequality holds:
βn (tn , w, π) ≤ βn (θ̂n , w, π).
14 2. The Bayes and Minimax Estimators

In other words, the Bayes estimator minimizes the Bayes risk. Loosely
speaking, we can understand the Bayes estimator as a solution of the mini-
mization problem,
tn = argminθ̂n β(θ̂n , w, π),
though we should keep in mind that the minimum value may not exist or
may be non-unique.
In the case of the quadratic loss w(u) = u2 , the Bayes estimator can
be computed explicitly. Define the posterior density of θ as the conditional
density, given the observations X1 , . . . , Xn ; that is,
f (θ | X1 , . . . , Xn ) = Cn p(X1 , . . . , Xn , θ) π(θ), θ ∈ Θ,
where Cn = Cn (X1 , . . . , Xn ) is the normalizing constant. Assuming that

In (θ) f (θ | X1 , . . . , Xn ) dθ < ∞,
Θ
we can introduce the weighted posterior density as
f˜(θ | X1 , . . . , Xn ) = C̃n In (θ) f (θ | X1 , . . . , Xn ), θ ∈ Θ,

−1
with the normalizing constant C̃n = Θ In (θ) f (θ | X1 , . . . , Xn ) dθ , which
is finite under our assumption.
Theorem 2.3. If w(u) = u2 , then the Bayes estimator tn is the weighted
posterior mean

tn = tn (X1 , . . . , Xn ) = θ f˜(θ | X1 , . . . , Xn ) dθ.
Θ
In particular, if the Fisher information is a constant independent of θ, then
the Bayes estimator is the non-weighted posterior mean,

tn = tn (X1 , . . . , Xn ) = θ f (θ | X1 , . . . , Xn ) dθ.
Θ

Proof. The Bayes risk of an estimator θ̂n with respect to the quadratic loss
can be written in the form

βn (θ̂n , π) = In (θ) (θ̂n − θ)2 p(x1 , . . . , xn , θ) π(θ) dx1 . . . dxn dθ
Θ Rn
 
= (θ̂n − θ)2 f˜( θ | x1 , . . . , xn ) dθ C̃n−1 (x1 , . . . , xn ) dx1 . . . dxn .
Rn Θ
Thus, the minimization problem of the Bayes risk is tantamount to mini-
mization of the integral

(θ̂n − θ)2 f˜( θ | x1 , . . . , xn ) dθ
Θ
2.2. The Bayes Estimator 15

with respect to θ̂n for any fixed values x1 , . . . , xn . Equating to zero the
derivative of this integral with respect to θ̂n produces a linear equation,
satisfied by the Bayes estimator tn ,

(tn − θ) f˜(θ | x1 , . . . , xn ) dθ = 0.
Θ

Recalling that Θ f˜(θ | x1 , . . . , xn ) dθ = 1, we obtain the result,



tn = θ f˜(θ | x1 , . . . , xn ) dθ. 
Θ

In many examples, the weighted posterior mean tn is easily computable


if we choose a prior density π(θ) from a conjugate family of distributions.
A conjugate prior distribution π(θ) is such that the posterior distribution
belongs to the same family of distributions for any sample X1 , . . . , Xn . If
the posterior distribution allows a closed-form expression of expectations,
then tn can be found without integration. The following example illustrates
the idea.
Example 2.4. Consider independent Bernoulli observations X1 , . . . , Xn
with the probability mass function
p(x, θ) = θ x (1 − θ)1−x , x ∈ {0 , 1}, 0 < θ < 1,
where θ is assumed to be a random variable. The joint distribution function
of the sample is
 
p(X1 , . . . , Xn , θ) = θ Xi
(1 − θ) n− Xi
.
As a function of θ, it has an algebraic form or a beta distribution. Thus, we
select a beta density as a prior density,
π(θ) = C(α, β) θ α−1 (1 − θ) β−1 , 0 < θ < 1,
where α and β are positive parameters, and C(α, β) is the normalizing con-
stant. The posterior density is then also a beta density,
   
f θ | X1 , . . . , Xn = C(α, β) θ Xi +α−1 (1 − θ) n− Xi +β−1 , 0 < θ < 1.
By Exercise 1.3, the Fisher information is equal to In (θ) = n/[ θ(1 − θ)].
Thus, the weighted posterior density is a beta density as well,
   
f˜ θ | X1 , . . . , Xn = C̃n θ Xi +α−2 (1 − θ) n− Xi +β−2 , 0 < θ < 1,
where α > 1 and β > 1. The weighted posterior mean therefore is equal to
 
Xi + α − 1 Xi + α − 1
tn =   = .
Xi + α − 1 + n − Xi + β − 1 n+α+β−2

More examples of the conjugate families are in the exercises.


16 2. The Bayes and Minimax Estimators

2.3. Minimax Estimator. Connection Between Estimators


Define a maximum normalized risk of an estimator θ̂n = θ̂n (X1 , . . . , Xn )
with respect to a loss function w by
  
rn (θ̂n , w) = sup Rn (θ, θ̂n , w) = sup Eθ w In (θ) (θ̂n − θ) .
θ∈Θ θ∈Θ

An estimator θn∗ = θn∗ (X1 , . . . , Xn ) is called minimax if its maximum


normalized risk does not exceed that of any other estimator θ̂n . That is, for
any estimator θ̂n ,
rn (θn∗ , w) ≤ rn (θ̂n , w).
The maximum normalized risk of a minimax estimator, rn (θn∗ , w), is called
the minimax risk.
In contrast with the Bayes estimator, the minimax estimator represents
a different concept of the statistical optimality. The Bayes estimator is
optimal in the averaged (integrated) sense, whereas the minimax one takes
into account the “worst-case scenario”.
It follows from the above definition that a minimax estimator θn∗ solves
the optimization problem
  
sup Eθ w In (θ) (θ̂n − θ) → inf .
θ∈Θ θ̂n

Finding the infimum over all possible estimators θ̂n = θ̂n (X1 , . . . , Xn ), that
is, over all functions of observations X1 , . . . , Xn , is not an easily tackled
task. Even for the most common distributions, such as normal or binomial,
the direct minimization is a hopeless endeavor. This calls for an alternative
route in finding minimax estimators.
In this section we establish a connection between the Bayes and minimax
estimators that will lead to some advances in computing the latter. The
following theorem shows that if the Bayes estimator has a constant risk,
then it is also minimax.

Theorem 2.5. Let tn = tn (X1 , . . . , Xn ) be a Bayes estimator with respect


to a loss function w. Suppose that the normalized risk function of the Bayes
estimator is a constant for any θ ∈ Θ, that is,
  
Rn (θ, tn , w) = Eθ w In (θ) ( tn − θ) = c

for some c > 0. Then tn is also a minimax estimator.

Proof. Notice that since the risk function of tn is a constant, the Bayes and
maximum normalized risks of tn are the same constants. Indeed, letting
2.3. Minimax Estimator. Connection Between Estimators 17

π(θ) denote the corresponding prior density, we write



βn (tn , w, π) = Rn (θ, tn , w) π(θ) dθ = c π(θ) dθ = c
Θ Θ
and
rn (tn , w) = sup Rn (θ, tn , w) = sup c = c.
θ∈ Θ θ∈ Θ

Further, for any estimator θ̂n ,



rn (θ̂n , w) = sup Rn (θ, θ̂n , w) ≥ Rn (θ, θ̂n , w) π(θ) dθ
θ∈ Θ Θ

= βn (θ̂n , w, π) ≥ βn (tn , w, π) = c = rn (tn , w). 

Unfortunately, Theorem 2.5 does not provide a recipe for choosing a


prior density for which the normalized risk function is a constant on Θ .
Moreover, constant-risk priors rarely exist. Below we give two examples
where we try to explain why it happens.
Example 2.6. Consider independent Bernoulli observations X1 , . . . , Xn
with parameter θ. As shown in Example 2.4, the weighted posterior mean
of θ is 
Xi + α − 1
tn = .
n+α+β−2
If we now select α = β = 1, then tn becomes the sample mean X̄n . From
Exercise 1.3 we know that X̄n is an efficient estimator of θ, and therefore its
weighted quadratic risk is equal to 1, a constant. However, α = β = 1 is not
a legitimate choice in this instance, because the weighted posterior density
 
f˜(θ | X1 , . . . , Xn ) = C̃n θ Xi −1 (1 − θ) n− Xi −1

does not exist for Xi = 0. Indeed, θ−1 (1 − θ) n−1 is not integrable at
zero, and therefore the normalizing constant C̃n does not exist. 
Example 2.7. Let X1 , . . . , Xn be independent observations from the N (θ, 1)
distribution. If we choose the prior density of θ to be N (0, b2 ) for some pos-
itive real b, then, by Exercise 2.10, the weighted posterior distribution is
also normal,
 n b2 X̄ b2 
n
N , .
n b2 + 1 n b2 + 1
Here the weighted posterior mean tn = n b2 X̄n /(n b2 + 1) is the Bayes
estimator with respect to the quadratic loss function. If we let b → ∞, then
tn equals X̄n , which is Fisher efficient (see Example 1.10) and thus has a
constant normalized quadratic risk. The flaw in this argument is that no
normal prior density exists with infinite b. 
18 2. The Bayes and Minimax Estimators

2.4. Limit of the Bayes Estimator and Minimaxity


Assume that we can find a family of prior distributions with the densities
πb (θ) indexed by a positive real number b. If the Bayes risks of the respective
Bayes estimators have a limit as b goes to infinity, then this limit guarantees
a minimax lower bound. A rigorous statement is presented in the following
theorem.
Theorem 2.8. Let πb (θ) be a family of prior densities on Θ that depend
on a positive real parameter b, and let tn (b) = tn (X1 , . . . , Xn , b) be the
respective
 Bayes estimators
 for a loss function w. Suppose that the Bayes
risk βn tn (b), w, πb has a limit,
 
lim βn tn (b), w, πb = c > 0.
b→∞

Then the minimax lower bound holds for any n,


  
inf rn (θ̂n , w) = inf sup Eθ w In (θ) (θ̂n − θ) ≥ c.
θ̂n θ̂n θ∈Θ

Proof. As in the proof of Theorem 2.5, for any estimator θ̂n , we can write

rn (θ̂n , w) = sup Rn (θ, θ̂n , w) ≥ Rn (θ, θ̂n , w) πb (θ) dθ
θ∈ Θ Θ
 
= βn (θ̂n , w, πb ) ≥ βn tn (b), w, πb .
Now take the limit as b → ∞. Since the left-hand side is independent of b,
the theorem follows. 
Example 2.9. Let X1 , . . . , Xn be independent N (θ, 1) observations. We
will show that conditions of Theorem 2.8 are satisfied under the quadratic
loss function w(u) = u2 , and therefore the lower bound for the corresponding
minimax risk holds:
 √ 2 
inf rn (θ̂n , w ) = inf sup Eθ n (θ̂n − θ) ≥ 1.
θ̂n θ̂n θ∈R

As shown in Example 2.7, for a N (0, b2 ) prior density, the weighted posterior
mean tn (b) = n b2 X̄n /(n b2 + 1) is the Bayes estimator with respect to the
quadratic loss function. Now we will compute its Bayes risk. This estimator
has the variance

 n2 b4 Varθ X̄n n b4
Varθ tn (b) = =
(n b2 + 1)2 (n b2 + 1)2
and the bias
   n2 b2 θ θ
bn θ, tn (b) = Eθ tn (b) − θ = −θ = − 2 .
n b2 + 1 nb + 1
Exercises 19

Therefore, its normalized quadratic risk is expressed as


  √ 2    
Rn θ , tn (b), w = Eθ n (tn (b) − θ) = n Varθ [ tn (b) ] + b2n θ , tn (b)

n2 b4 n θ2
= + .
(n b2 + 1)2 (n b2 + 1)2

With the remark that R θ2 πb (θ) dθ = b2 , the Bayes risk of tn (b) equals
 
  n2 b4 n θ2
βn tn (b), w, πb = + πb (θ) dθ
R (n b2 + 1)2 (n b2 + 1)2
n2 b4 n b2
= + → 1 as b → ∞.
(n b2 + 1)2 (n b2 + 1)2
Applying Theorem 2.8, we obtain the result with c = 1. Taking a step
further, note that the minimax lower bound√ is attained for the estimator
2 
X̄n , which is thus minimax. Indeed, Eθ n (X̄n − θ) = 1. 

In subsequent chapters we present additional useful applications of


Theorem 2.8.

Exercises

Exercise 2.9. Suppose the random observations X1 , . . . , Xn come from a


Poisson distribution with the probability mass function
θ x e−θ
p(x , θ) = , x ∈ { 0 , 1 , . . . },
x!
where θ is a random variable. Show that the conjugate prior density of θ is
a gamma density, π(θ) = C(α, β) θ α−1 e− β θ , θ > 0, for some positive pa-
rameters α and β, and the normalizing constant C(α, β). Find the weighted
posterior mean of θ.

Exercise 2.10. Consider a set of independent observations X1 , . . . , Xn ∼


N (θ, σ 2 ), where θ is assumed random with the prior density N (μ, σθ2 ). Show
that
 2the weighted   posterior distribution of θ is also normal with the mean
n σθ X̄n + μσ 2 / n σθ2 + σ 2 and variance σ 2 σθ2 /(nσθ2 + σ 2 ). Note that the
family of normal distributions is self-conjugate.

Exercise 2.11. Find a conjugate distribution and the corresponding Bayes


estimator for the parameter θ in the exponential model with p(x, θ) =
θ exp{− θ x}, x , θ > 0.
20 2. The Bayes and Minimax Estimators

Exercise 2.12. Consider n independent Bernoulli observations X1 , . . . , Xn


with p(x, θ) = θ x (1 − θ) 1−x , x ∈ { 0 , 1 }, and Θ = (0, 1). Define the
estimator  √
∗ Xi + n / 2
θn = √ .
n+ n
(i) Verify that θn∗ is the non-weighted posterior mean with respect to the
 √n/2 − 1
conjugate prior density π(θ) = C θ (1 − θ) , 0 < θ < 1.

(ii) ∗
 Show that the non-normalized quadratic risk of θn (with the factor
In (θ) omitted) is equal to
 1
Eθ (θn∗ − θ) 2 = √ .
4(1 + n) 2
(iii) Verify that Theorem 2.5 is valid for a non-normalized risk function, and
argue that θn∗ is minimax in the appropriate sense.

Exercise 2.13. Refer to the Bernoulli model in Example 2.4. Show that
the prior beta distribution with α = β = 1 + b−1 defines the weighted
posterior mean tn (b) which is minimax for b = ∞.
Chapter 3

Asymptotic
Minimaxity

In this chapter we study the asymptotic minimaxity of estimators as the


sample size n increases.

3.1. The Hodges Example


An estimator θ̂n is called asymptotically unbiased if it satisfies the limiting
condition

lim Eθ θ̂n = θ, θ ∈ Θ.
n→∞

In many cases when an unbiased estimator of θ does not exist, an asymp-


totically unbiased estimator is easy to construct.
Example 3.1. In Example 1.4, the MLE θn∗ = 1/X̄n , though biased for any
n > 1, is asymptotically unbiased. Indeed,
 nθ
lim Eθ θn∗ = lim = θ. 
n→∞ n→∞ n − 1

Example 3.2. In Example 1.5, there is no unbiased estimator. The es-
timator θ̂n = X/n, however, is asymptotically unbiased (see Exercise
3.14.) 

In the previous chapter, we explained why the Fisher approach fails


as a criterion for finding the most efficient estimators. Now we are plan-
ning to undertake another desperate, though futile, task of rescuing the
concept of Fisher efficiency at least in an asymptotic form. The question

21
22 3. Asymptotic Minimaxity

is: Can we define a sequence of asymptotically Fisher efficient estimators


θn∗ = θn∗ (X1 , . . . , Xn ) by requiring that they: (i) are asymptotically unbi-
ased and (ii) satisfy the equation (compare to (2.3)):
  2 
(3.1) lim Eθ In (θ) θ̂n − θ = 1, θ ∈ Θ ?
n→ ∞
The answer to this question would be positive, if for any sequence of asymp-
totically unbiased estimators θ̂n , the following analogue of the Cramér-Rao
lower bound (2.2) were true,
  2 
(3.2) lim Eθ In (θ) θ̂n − θ ≥ 1, θ ∈ Θ .
n→ ∞
Indeed, if (3.2) held, then the estimator that satisfies (3.1) would be asymp-
totically the most efficient one. However, it turns out that this inequality
is not valid even for N (θ, 1) observations. A famous Hodges example is
presented below.

Example 3.3. Consider independent observations X1 , . . . , Xn from N (θ, 1)


distribution, θ ∈ R. Define the sequence of estimators

X̄n if | X̄n | ≥ n−1/4 ,
(3.3) θ̂n =
0 otherwise.
Note that in this example, In (θ) = n. It can be shown (see Exercise 3.15)
that this sequence is asymptotically unbiased, and that the following equal-
ities hold:
⎧    
⎨limn→ ∞ Eθ n θ̂n − θ 2 = 1 if θ
= 0 ,
(3.4)  
⎩limn→ ∞ Eθ n θ̂n2 = 0 if θ = 0.

Thus, the sequence θ̂n is asymptotically more efficient than any asymptot-
ically Fisher efficient estimator defined by (3.1). In particular, it is better
than the sample mean X̄n . Sometimes the Hodges estimator is called super-
efficient, and the point at which the Cramér-Rao lower bound is violated,
θ = 0, is termed the superefficient point. 

The above example explains why the asymptotic theory of parameter


estimation should be based on methods other than the pointwise asymptotic
Fisher efficiency. We start introducing these methods in the next section.

3.2. Asymptotic Minimax Lower Bound


Recall from Section 2.3 that a minimax estimator corresponding to the qua-
dratic loss function solves the minimization problem
  2 
sup In (θ) θ̂n (x1 , . . . , xn )−θ p(x1 , . . . , xn , θ) dx1 . . . dxn → inf .
θ∈Θ Rn θ̂n
3.2. Asymptotic Minimax Lower Bound 23

The minimization is carried over all arbitrary functions θ̂n = θ̂(x1 , . . . , xn ) .


As discussed earlier, this problem is impenetrable from the point of view of
standard analytic methods of calculus. In this section we will learn a bypass-
ing approach based on the asymptotically minimax lower bound. Consider
the maximum normalized risk of an estimator θ̂n with respect to the qua-
dratic loss function
rn (θ̂n , u2 ) = sup Rn (θ, θ̂n , u2 )
θ∈Θ
   
= sup Eθ In (θ) (θ̂n − θ)2 = n sup I(θ) Eθ (θ̂n − θ)2 .
θ∈Θ θ∈Θ

Suppose we can show that for any estimator θ̂n the inequality
(3.5) lim inf rn (θ̂n , u2 ) ≥ r∗
n→∞

holds with a positive constant r∗ independent of n . This inequality implies


that for any estimator θ̂n and for all large enough n, the maximum of the
quadratic risk is bounded from below,
 2  r∗ − ε
sup I(θ) Eθ θ̂n − θ ≥
θ∈Θ n
with arbitrarily small ε > 0 . We call the inequality (3.5) the asymptotically
minimax lower bound. If, in addition, we can find an estimator θn∗ , which
for all large n satisfies the upper bound
 2  r∗
sup I(θ) Eθ θn∗ − θ ≤
θ∈Θ n
with a positive constant r∗ , then for all large enough n, the minimax risk is
sandwiched between two positive constants,
  2 
(3.6) r∗ ≤ inf sup Eθ nI(θ) ( θ̂n − θ ) ≤ r∗ .
θ̂n θ∈Θ

In this special case of the quadratic loss function w(u) = u2 , we define



the asymptotically minimax rate of convergence as 1/ n (or, equivalently,

O(1/ n) as n → ∞). This is the fastest possible decrease rate of θ̂n − θ
in the mean-squared sense as n → ∞. This rate is not improvable by any
estimator.
More generally, we call a deterministic sequence ψn the asymptotically
minimax rate of convergence, if for some positive constants r∗ and r∗ , and
for all sufficiently large n, the following inequalities hold:
  θ̂ − θ  
≤ r∗ < ∞.
n
(3.7) r∗ ≤ inf sup Eθ w
θ̂n θ∈Θ ψn
If r∗ = r∗ , these bounds are called asymptotically sharp.
24 3. Asymptotic Minimaxity

In the following lemma we explain the idea of how the asymptotically


minimax lower bound (3.5) may be proved. We consider only normally
distributed observations, and leave some technical details out of the proof.
Lemma 3.4. Take independent observations X1 , . . . , Xn ∼ N (θ, σ 2 ) where
σ 2 is known. Let θ ∈ Θ where Θ is an open interval containing the origin
θ = 0 . Then for any estimator θ̂n , the following inequality holds:
n  2 
lim inf rn (θ̂n , u2 ) = lim inf 2 sup Eθ θ̂n − θ ≥ r∗ = 0.077.
n→∞ n→∞ σ θ∈Θ

Remark 3.5. Under the assumptions of Lemma 3.4, the maximum normal-
ized risk rn (θ̂n , u2 ) admits the asymptotic upper bound r∗ = 1, guaranteed
by the sample mean estimator X̄n . 

Proof of Lemma 3.4. Without loss of generality, we can assume that
σ 2 = 1 hence I(θ) = 1 , and that Θ contains points θ0 = 0 and θ1 =

1/ n. Introduce the log-likelihood ratio associated with these values of the
parameter θ,
ΔLn = ΔLn (θ0 , θ1 ) = Ln (θ1 ) − Ln (θ0 )

n √
p(X1 , . . . , Xn , θ1 ) p(Xi , 1/ n)
= ln = ln
p(X1 , . . . , Xn , θ0 ) p(Xi , 0)
i=1

n  1 1 2 1 2 1 
n
1 1
= − Xi − √ + Xi = √ Xi − = Z −
2 n 2 n 2 2
i=1 i=1
where Z is a N (0, 1) random variable with respect to the distribution Pθ0 .
Further, by definition, for any random function f (X1 , . . . , Xn ) , and for
any values θ0 and θ1 , the basic likelihood ratio identity relating the two
expectations holds:
  p(X1 , . . . , Xn , θ1 ) 
Eθ1 f (X1 , . . . , Xn ) = Eθ0 f (X1 , . . . , Xn )
p(X1 , . . . , Xn , θ0 )
  
(3.8) = Eθ0 f (X1 , . . . , Xn ) exp ΔLn (θ0 , θ1 ) .

Next, for any fixed estimator θ̂n , the supremum over R of the normalized
risk function is not less than the average of the normalized risk over the two
points θ0 and θ1 . Thus, we obtain the inequality
   
n sup Eθ (θ̂n − θ)2 ≥ n max Eθ (θ̂n − θ)2
θ∈R θ∈{θ0 , θ1 }
n   
≥ Eθ0 (θ̂n − θ0 )2 + Eθ1 (θ̂n − θ1 )2
2
n     
= Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 exp ΔLn (θ0 , θ1 ) by (3.8)
2
3.2. Asymptotic Minimax Lower Bound 25

n    
≥ Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 I ΔLn (θ0 , θ1 ) ≥ 0
2
n (θ1 − θ0 )2  
≥ Pθ0 ΔLn (θ0 , θ1 ) ≥ 0
2 2
n  1  2   1  
= √ Pθ0 Z − 1/2 ≥ 0 = Pθ0 Z ≥ 1/2 .
4 n 4
In the above, if the log-likelihood ratio ΔLn (θ0 , θ1 ) is non-negative, then its
exponent is at least 1. At the last stage we used the elementary inequality
1
(x − θ0 )2 + (x − θ1 )2 ≥ (θ1 − θ0 )2 , x ∈ R.
2
As shown previously, Z is a standard normal
 random variable with respect
to the distribution Pθ0 , therefore, Pθ0 Z ≥ 1/2 = 0.3085. Finally, the
maximum normalized risk is bounded from below by 0.3085/4 > 0.077. 
Remark 3.6. Note that computing the mean value of the normalized risk
over two points is equivalent to finding the Bayes risk with respect to the
prior distribution that is equally likely concentrated at these points. Thus,
in the above proof, we could have taken a Bayes prior concentrated not at
two but at three or more points, then the lower bound constant r∗ would be
different from 0.077. 

The normal distribution of the observations in Lemma 3.4 is used only


in the explicit formula for the log-likelihood ratio ΔLn (θ0 , θ1 ). A generaliza-
tion of this lemma to the case of a statistical experiment with an arbitrary
distribution is stated in the theorem below. The proof of the theorem is
analogous to that of the lemma, and therefore is left as an exercise (see
Exercise 3.16).
Theorem 3.7. Assume that an experiment (X1 , . . . , Xn ; p(x , θ) ; Θ) is

such that for some points θ0 and θ1 = θ0 + 1/ n in Θ, the log-likelihood
ratio

n
p(Xi , θ1 )
ΔLn (θ0 , θ1 ) = ln
p(Xi , θ0 )
i=1
satisfies the condition
 
Pθ0 ΔLn (θ0 , θ1 ) ≥ z0 ≥ p0

with the constants z0 and p0 independent of n. Assume that z0 ≤ 0. Then


for any estimator θ̂n , the lower bound of the minimax risk holds:
  2  1
lim inf sup Eθ In (θ) θ̂n − θ ≥ I∗ p0 exp{z0 }.
n→∞ θ∈R 4

where I∗ = min I(θ0 ) , I(θ1 ) > 0.
26 3. Asymptotic Minimaxity

3.3. Sharp Lower Bound. Normal Observations


Lemma 3.4 leaves a significant gap between the lower and upper constants
in (3.6). Indeed, r∗ = 0.077, while r∗ = 1 by Remark 3.5. It should not
come as a surprise that in such a regular case as normal observations it can
be shown that r∗ = r∗ . In this section, we prove the sharp lower bound
with r∗ = r∗ = 1 for the normal observations. To do this, we have to
overcome the same technical difficulties and we will need the same ideas as
in the case of more general observations discussed in the next section.
Theorem 3.8. Under the assumptions of Lemma 3.4, for any estimator θ̂n ,
the following lower bound holds:
n  2 
lim inf 2 sup Eθ θ̂n − θ ≥ 1.
n→∞ σ θ∈Θ

Proof. As in the proof of Lemma 3.4, we can take σ 2 = 1. The idea of the
proof is based on the substitution of the maximum normalized  risk by the
√ √
Bayes risk with the uniform prior distribution in an interval −b/ n , b/ n
where b will be chosen later. Under the assumption on Θ, it contains this
interval for all sufficiently large n. Proceeding as in the proof of Lemma 3.4,
we obtain the inequalities
  √ b/√n
2  n   2 
sup Eθ n θ̂n − θ ≥ E θ n θ̂n − θ dθ
θ∈R 2b −b/√n
b √
1 2  √
= Et/√n n θ̂n − t dt (by substitution t = n θ)
2b −b
b √
1 2   t  
(3.9) = E0 n θ̂n − t exp ΔLn 0, √ dt.
2b −b n
Here the same trick is used as in the proof of Lemma 3.4 with the change
of the distribution by means of the log-likelihood ratio, which in this case is
equal to
n 
 t   t   1 t 2 1 2 
ΔLn 0, √ = Ln √ − Ln (0) = − Xi − √ + Xi
n n 2 n 2
i=1

t 
n
t2 t2 Z2 (t − Z)2
= √ Xi − = tZ − = −
n 2 2 2 2
i=1
where Z ∼ N (0, 1) under P0 . Thus, the latter expression can be written as
3.3. Sharp Lower Bound. Normal Observations 27

 
1 b √ 2
Z 2 /2
e−(t−Z)
2 /2
E0 e n θ̂n − t dt
2b −b
 2  b 
 1 √ 2
n θ̂n − t e−(t−Z) /2 dt
2
(3.10) ≥ E0 eZ /2 I |Z| ≤ a
2b −b
where a is a positive constant, a < b. The next step is to change the
variable of integration to u = t − Z. The new limits of integration are
[−b − Z , b − Z ]. For any Z that satisfies |Z| ≤ a, this interval includes the
interval [−(b − a) , b − a ], so that the integral over [−b , b ] with respect to
t can be estimated from below by the integral in u over [−(b − a) , b − a ].
Hence, for |Z| ≤ a,

b √ 2 b−a √ 2
e−u
2 /2 2 /2
n θ̂n − t e(t−Z) dt ≥ n θ̂n − Z − u du
−b −(b−a)

b−a  √ 2 b−a
n θ̂n − Z + u2 e−u /2 du ≥ u2 e−u
2 2 /2
(3.11) = du.
−(b−a) −(b−a)

b−a
Here the cross term disappears because −(b−a) u exp{−u2 /2} du = 0.

Further, we compute the expected value


 2  a
 1 2a
ez /2 √ e−z /2 dz = √ .
2 2
(3.12) E0 e Z /2
I |Z| ≤ a =
−a 2π 2π
Putting together (3.11) and (3.12), and continuing from (3.10), we arrive at
the lower bound
  b−a
2  2a 1
u2 e−u /2 du
2
sup Eθ n θ̂n − θ ≥ √
θ∈R 2π 2b −(b−a)

a   
(3.13) = E Z02 I |Z0 | ≤ b − a
b
where Z0 is a standard normal random variable. Choose
√ a and b such that
a/b → 1 and b − a → ∞, for example, put a = b − b and  let b → ∞. Then
the expression in (3.13) can be made however close to E Z0 = 1.
2 

The quadratic loss function is not critical in Theorem 3.8. The next
theorem generalizes the result to any loss function.
Theorem 3.9. Under the assumptions of Theorem 3.8, for any loss function
w and any estimator θ̂n , the following lower bound holds:
  n  ∞
w(u) −u2 /2
lim inf sup Eθ w 2
(θ̂n − θ) ≥ √ e du.
n→∞ θ∈Θ σ −∞ 2π
28 3. Asymptotic Minimaxity

Proof. In the proof of Theorem 3.8, the quadratic loss function was used

only to demonstrate that for any n θ̂n − Z, the following inequality holds:
b−a b−a
√ 2
n θ̂n − Z − u e−u /2 du ≥ u2 e−u /2 du.
2 2

−(b−a) −(b−a)

We can generalize this inequality to any loss function as follows (see Exercise

b−a  
3.18). The minimum value of the integral −(b−a) w c − u e−u /2 du over
2

c ∈ R is attained at c = 0, that is,


b−a b−a
 
w c − u e−u /2 du ≥ w(u) e−u /2 du.
2 2
(3.14) 
−(b−a) −(b−a)

Remark 3.10. Note that in the proof of Theorem 3.8 (respectively, The-
orem 3.9), we considered the  values of θ not in the whole parameter set
√ √
Θ, but only in the interval − b/ n, b/ n of however small the length.
Therefore, it is possible to formulate a local version of Theorem 3.9 with the
proof remaining the same. For any loss function w, the inequality holds
  n  ∞
w(u) −u2 /2
lim lim inf sup Eθ w 2
(θ̂n − θ) ≥ √ e du. 
δ→0 n→∞ |θ−θ0 |<δ σ −∞ 2π

3.4. Local Asymptotic Normality (LAN)


The sharp lower bounds of the previous section were proved under the re-
strictive assumption of normal observations. How far can these results be
extended? What is to be required from a statistical experiment to ensure
an asymptotically sharp lower bound of the minimax risk similar to the one
in Theorem 3.9? The answers to these and related questions comprise an
essential part of modern mathematical statistics. Below we present some
ideas that stay within the scope of this book.
Assume that a statistical experiment is regular in the sense that the
Fisher information I(θ) is positive, continuous and bounded in Θ . Let θ

and θ + t/ n belong to Θ for all t in some compact set. If the Taylor
expansion below is legitimate, we obtain that
     
ΔLn θ, θ + t/ In (θ) = Ln θ + t/ In (θ) − Ln (θ)
n 
  
 
= l Xi , θ + t/ nI(θ) − l(Xi , θ)
i=1

t 
n
t2 
n
 
(3.15) =  l (Xi , θ) + l (Xi , θ) 1 + on (1)
nI(θ) 2nI(θ)
i=1 i=1

where on (1) → 0 as n → ∞.
3.4. Local Asymptotic Normality (LAN) 29

By the computational formula for the Fisher information (see Exercise 1.1),

n 
Eθ l (Xi , θ) = − nI(θ),
i=1

and therefore, the Law of Large Numbers ensures the convergence of the
second term in (3.15),

t2 
n
  t2
l (Xi , θ) 1 + on (1) → − as n → ∞.
2nI(θ) 2
i=1

Thus, at the heuristic level of understanding, we can expect that for any
t ∈ R, the log-likelihood ratio satisfies
  
(3.16) ΔLn θ, θ + t/ In (θ) = zn (θ) t − t2 /2 + εn (θ, t)

where, by the Central Limit Theorem, the random variable

1 
n
zn (θ) =  l (Xi , θ)
nI(θ) i=1

converges in Pθ -distribution to a standard normal random variable z(θ), and


for any δ > 0,
 
(3.17) lim Pθ | εn (θ, t) | ≥ δ = 0.
n→∞

A family of distributions for which the log-likelihood ratio has the repre-
sentation (3.16) under constraint (3.17) is said to satisfy the local asymptotic
normality (LAN) condition. It can actually be derived under less restrictive
assumptions. In particular, we do not need to require the existence of the
second derivative l .
To generalize Theorem 3.9 to the distributions satisfying the LAN con-
dition, we need to justify that the remainder term εn (θ, t) may be ignored
in the expression for the likelihood ratio,
     
exp ΔLn θ, θ + t/ In (θ) ≈ exp zn (θ) t − t2 /2 .

To do this, we have to guarantee that the following approximation holds, as


n → ∞,
     

(3.18) Eθ  exp zn (θ) t − t2 /2 + εn (θ, t) − exp zn (θ) t − t2 /2  → 0.

Unfortunately, the condition (3.17) that warrants that εn (θ, t) vanishes in


probability does not imply (3.18). The remedy comes in the form of LeCam’s
theorem stated below.
30 3. Asymptotic Minimaxity

Theorem 3.11. Under the LAN conditions (3.16) and (3.17), there exists
a sequence of random variables z̃n (θ) such that | zn (θ) − z̃n (θ) | → 0 in
Pθ -probability, as n → ∞, and for any c > 0,
     

lim sup Eθ  exp zn (θ)t − t2 /2 + εn (θ, t) − exp z̃n (θ)t − t2 /2  = 0.
n→∞ −c≤t≤c

To ease the proof, we split it into lemmas proved as the technical results
below.
Lemma 3.12. Under the LAN condition (3.16), there exists a truncation
of zn (θ) defined by
z̃n (θ) = zn (θ) I(zn (θ) ≤ cn ),
with the properly chosen sequence of constants cn , such that the following
equations hold:
(3.19) z̃n (θ) − zn (θ) → 0 as n → ∞
and
   
 
(3.20) lim sup Eθ  exp z̃n (θ) t − t /2 − 1  = 0.
2
n→∞ −c ≤ t ≤ c

Introduce the notations


 
ξn (t) = exp zn (θ) t − t2 /2 + εn (θ, t) ,
   
ξ˜n (t) = exp z̃n (θ) t − t2 /2 , and ξ(t) = exp z(θ) t − t2 /2
where z̃n (θ) is as defined in Lemma 3.12, and z(θ) is a standard normal
random variable.
Lemma 3.13. Under the LAN condition (3.16), the tails of ξn (t) and ξ̃n (t)
are small, uniformly in n and t ∈ [−c, c], in the sense that
 
lim sup sup Eθ ξn (t) I(ξn (t) > A)
A→∞ n ≥ 1 −c ≤ t ≤ c
 
(3.21) = lim sup sup Eθ ξ˜n (t) I(ξ˜n (t) > A) = 0.
A→∞ n ≥ 1 −c ≤ t ≤ c

Now we are in the position to prove the LeCam theorem.


Proof of Theorem 3.11. We have to show that for any t ∈ [−c, c], the
convergence takes place:
 
(3.22) lim Eθ  ξn (t) − ξ˜n (t)  = 0.
n→∞
From the triangle inequality, we obtain that
        
 ˜  
Eθ ξn (t) − ξn (t) ≤ Eθ  ξn (t) I ξn (t) ≤ A − ξ(t) I ξ(t) ≤ A 
      

+ Eθ  ξ˜n (t) I ξ˜n (t) ≤ A − ξ(t) I ξ(t) ≤ A 
3.5. The Hellinger Distance 31

     
(3.23) + Eθ ξn (t) I ξn (t) > A + Eθ ξ˜n (t) I ξ˜n (t) > A .

Due to Lemma 3.13, we can choose A so large that the last two terms do
not exceed a however small positive δ. From Lemma 3.12, ξ˜n (t) − ξ(t) → 0
in Pθ -distribution, and by the LAN condition, ξn (t)−ξ(t) → 0, therefore, for
a fixed A, the first two terms on the right-hand side of (3.23) are vanishing
uniformly over t ∈ [−c, c] as n → ∞. 
Finally, we formulate the result analogous to Theorem 3.9 (for the proof
see Exercise 3.20).
Theorem 3.14. If a statistical model satisfies the LAN condition (3.16),
then for any loss function w, the asymptotic lower bound of the minimax
risk holds:
  ∞
 w(u) −u2 /2
lim inf inf sup Eθ w In (θ) (θ̂n − θ) ≥ √ e du.
n→∞ θ̂n θ∈Θ −∞ 2π

3.5. The Hellinger Distance


Though this section may seem rather technical, it answers an important
statistical question. Suppose that the statistical experiment with the family
of densities p(x, θ) is such that p(x, θ0 ) = p(x, θ1 ) for some θ0
= θ1 , where
θ0 , θ1 ∈ Θ, and all x ∈ R. Clearly, no statistical observations can distinguish
between θ0 and θ1 in this case.
 
Thus, we have to require that the family of probability densities p(x, θ)
is such that for θ0
= θ1 , the densities p( · , θ0 ) and p( · , θ1 ) are essentially
different in some sense. How can the difference between these two densities
be measured?
   
For any family of densities p(x, θ) , the set p( · , θ), θ ∈ Θ presents
a parametric curve on the surface of a unit sphere in L2 -space, the space of
square integrable functions in x variable. Indeed, for any θ, the square of
the L2 -norm is
 2
 
 p ( · , θ) 2 = p(x, θ) dx = p(x, θ) dx = 1.
2
R R

The Hellinger distance H(θ0 , θ1 ) between p( · , θ0 ) and p( · , θ1 ) is defined as


  2
(3.24) H(θ0 , θ1 ) =  p( · , θ0 ) − p( · , θ1 ) 2 , θ0 , θ1 ∈ Θ.
Lemma 3.15. For the Hellinger distance (3.24), the following identities
hold:   
(i) H(θ0 , θ1 ) = 2 1 − p(x, θ0 ) p(x, θ1 ) dx
R
and   1
(ii) Eθ0 Z1 (θ0 , θ1 ) = 1− H(θ0 , θ1 )
2
32 3. Asymptotic Minimaxity

where Z1 (θ0 , θ1 ) = p(X, θ1 )/p(X, θ0 ) denotes the likelihood ratio for a single
observation.

Proof. (i) We write by definition


  2
H(θ0 , θ1 ) = p(x, θ0 ) − p(x, θ1 ) dx
R

= p(x, θ0 ) dx +p(x, θ1 ) dx − 2 p(x, θ0 ) p(x, θ1 ) dx
R R R
  
= 2 1− p(x, θ0 ) p(x, θ1 ) dx .
R
(ii) By definition of Z1 (θ0 , θ1 ), we have
  
p(x, θ1 )
Eθ0 Z1 (θ0 , θ1 ) = p(x, θ0 ) dx
R p(x, θ0 )

1
= p(x, θ0 ) p(x, θ1 ) dx = 1 − H(θ0 , θ1 )
R 2
where the result of part (i) is applied. 
Lemma 3.16. Let Zn (θ0 , θ1 ) be the likelihood ratio for a sample of size n,

n
p(Xi , θ1 )
Zn (θ0 , θ1 ) = , θ0 , θ1 ∈ Θ.
p (Xi , θ0 )
i=1
Then the following identity is true:
   1 n
Eθ0 Zn (θ0 , θ1 ) = 1 − H(θ0 , θ1 ) .
2
Proof. In view of independence of observations and Lemma 3.15 (ii), we
have 
  n  p(X , θ ) 
i 1
Eθ0 Zn (θ0 , θ1 ) = Eθ0
p(Xi , θ0 )
i=1
 1 n
= 1 − H(θ0 , θ1 ) . 
2
Assumption 3.17. There exists a constant a > 0 such that for any θ0 , θ1 ∈
Θ ⊆ R , the inequality H(θ0 , θ1 ) ≥ a (θ0 − θ1 )2 holds. 
 
Example 3.18. If Xi ’s are independent N 0, σ 2 random variables, then
by Lemma 3.15 (i),
 ∞ 
1 (x − θ0 )2 (x − θ1 )2  
H(θ0 , θ1 ) = 2 1 − √ exp − − dx
2πσ 2 −∞ 4σ 2 4σ 2
  ∞
(θ0 − θ1 )2  1  (x − θ̄)2  
= 2 1 − exp − √ exp − dx
8σ 2 2πσ 2 −∞ 2σ 2
3.6. Maximum Likelihood Estimator 33

where θ̄ = (θ0 + θ1 )/2. As the integral of the probability density, the latter
one equals 1. Therefore,
  (θ0 − θ1 )2  
H(θ0 , θ1 ) = 2 1 − exp − .
8σ 2
If Θ is a bounded interval, then (θ0 − θ1 )2 /(8σ 2 ) ≤ C with some constant
C > 0. In this case,
H(θ0 , θ1 ) ≥ a (θ0 − θ1 )2 , a = (1 − e−C )/(4 C σ 2 ),
where we used the inequality (1 − e−x ) ≥ (1 − e−C ) x/C if 0 ≤ x ≤ C. 

3.6. Maximum Likelihood Estimator


In this section we study regular statistical experiments, which have contin-
uous, bounded, and strictly positive Fisher information I(θ).
We call θn∗ an asymptotically minimax estimator, if for any loss function
w, and all sufficiently large n, the following inequality holds:
  θ∗ − θ  
sup Eθ w n ≤ r∗ < ∞
θ∈Θ ψ n
where ψn and r∗ are as in (3.7).
Recall from Section 1.1 that an estimator θn∗ is called the maximum
likelihood estimator (MLE), if for any θ ∈ Θ,
Ln (θn∗ ) ≥ Ln (θ).
It turns out that Assumption 3.17 guarantees the asymptotic minimax
property of the MLE with ψn = 1/ nI(θ). This result is proved in Theorem
3.20. We start with a lemma, the proof of which is postponed until the next
section.
Lemma 3.19. Under Assumption 3.17, for any θ ∈ Θ and any c > 0, the
MLE θn∗ satisfies the inequality
√   
Pθ n |θn∗ − θ| ≥ c ≤ C exp − a c2 /4

where the constant C = 2 + 3 πI ∗ /a with I ∗ = supθ∈Θ I(θ) < ∞.

At this point we are ready to prove the asymptotic minimaxity of the


MLE.

Theorem 3.20. Under Assumption 3.17, the MLE is asymptotically min-


imax. That is, for any loss function w and for any θ ∈ Θ, the normalized
risk function of the MLE is finite,
  
lim sup Eθ w nI(θ) (θn∗ − θ) = r∗ < ∞.
n→∞
34 3. Asymptotic Minimaxity

Proof. Since w(u) is an increasing function for u ≥ 0, we have


  
Eθ w nI(θ) (θn∗ − θ)

   
≤ w(m + 1) Pθ m ≤ nI(θ) |θn∗ − θ| ≤ m + 1
m=0
∞ √  
≤ w(m + 1) Pθ n |θn∗ − θ| ≥ m/ I(θ) .
m=0
By definition, the loss w is bounded from above by a power function, while
the probabilities decrease exponentially fast by Lemma 3.19. Therefore, the
latter sum is finite. 

To find the sharp upper bound for the MLE, we make an additional
assumption that allows us to prove a relatively simple result. As shown in
the next theorem, 
the normalized deviation of the MLE from the true value
of the parameter, nI(θ) (θn∗ − θ ), converges in distribution to a standard
normal random variable. Note that this result is sufficient to claim the
asymptotically sharp minimax property for all bounded loss functions.
Theorem 3.21. Let Assumption 3.17 and the LAN condition (3.16) hold.
Moreover, suppose that for any δ > 0 and any c > 0, the remainder term in
(3.16) satisfies the equality:
 
(3.25) lim sup Pθ sup | εn (θ, t) | ≥ δ = 0.
n→∞ θ∈Θ −c ≤ t ≤ c

Then for any x ∈ R, uniformly in θ ∈ Θ, the MLE satisfies the limiting


equation:
 
lim Pθ n I(θ) (θn∗ − θ ) ≤ x = Φ(x)
n→∞
where Φ denotes the standard normal cumulative distribution function.

Proof. Fix a large c such that c > | x |, and a small δ > 0 . Put

t∗n = n I(θ) (θn∗ − θ ).
Define two random events
 
An = An (c, δ) = sup | εn (θ, t) | ≥ δ
−2c ≤ t ≤ 2c

and
 
Bn = Bn (c) = | t∗n | ≥ c .
Note that under the condition (3.25), we have that Pθ (An ) → 0 as
n → ∞. Besides, as follows from the Markov inequality and Theorem 3.20
with w(u) = |u|,

Pθ (Bn ) ≤ Eθ | t∗n | /c ≤ r∗ /c.
3.7. Proofs of Technical Lemmas 35

Let An and B n denote the complements of the events A and B, respectively.


We will use the following inclusion (for the proof see Exercise 3.21)
 √ 
(3.26) An ∩ B n ⊆ | t∗n − zn (θ) | ≤ 2 δ
or, equivalently,
 √ 
(3.27) | t∗n − zn (θ) | > 2 δ ⊆ An ∪ Bn
where zn (θ) is defined in (3.16). Elementary inequalities and (3.26) imply
that
       
Pθ t∗n ≤ x ≤ Pθ {t∗n ≤ x} ∩ An ∩ B n + Pθ An + Pθ Bn
 √     
≤ Pθ zn (θ) ≤ x + 2 δ + Pθ An + Pθ Bn .
Taking the limit as n → ∞, we obtain that
  √
(3.28) lim sup Pθ t∗n ≤ x ≤ Φ(x + 2 δ) + r∗ /c
n→∞
where we use the fact that zn (θ) is asymptotically standard normal. Next,
     √ 
Pθ t∗n ≤ x ≥ Pθ t∗n ≤ x ∩ | t∗n − zn (θ) | ≤ 2 δ
 √   √ 
≥ Pθ zn (θ) ≤ x − 2 δ ∩ | t∗n − zn (θ) | ≤ 2 δ
 √   √ 
≥ Pθ zn (θ) ≤ x − 2 δ − Pθ | t∗n − zn (θ) | > 2 δ
 √     
≥ Pθ zn (θ) ≤ x − 2 δ − Pθ An − Pθ Bn
where at the last stage we have applied (3.27). Again, taking n → ∞, we
have that
  √
(3.29) lim inf Pθ t∗n ≤ x ≥ Φ(x − 2 δ) − r∗ /c.
n→∞
Now we combine (3.28) and (3.29) and take into account that c is however
large and δ is arbitrarily small. Thus, the theorem follows. 

3.7. Proofs of Technical Lemmas


Proof of Lemma 3.12. Define z̃n (θ, A) = zn (θ) I(zn (θ) ≤ A) where A is
a large positive constant. Note that zn (θ, A) converges in distribution as n
increases to z(θ, A) = z(θ) I(z(θ) ≤ A) with a standard normal z(θ). Thus,
for any k , k = 1, 2, . . . , we can find a constant Ak and an integer nk so large
that for all n ≥ nk ,
   
 
sup Eθ  exp z̃n (θ, Ak ) t − t2 /2 − 1  ≤ 1/k.
−c ≤ t ≤ c

Without loss of generality, we can assume that nk is an increasing sequence,


nk → ∞ as k → ∞. Finally, put cn = Ak if nk ≤ n < nk+1 . From this
definition, (3.19) and (3.20) follow. 
36 3. Asymptotic Minimaxity

Proof of Lemma 3.13. First we will prove (3.21) for ξ̃n (t). Note that
ξ̃n (t), n = 1, . . . , are positive random variables. By Lemma 3.12, for any
t ∈ [−c, c], the convergence takes place
(3.30) ξ̃n (t) → ξ(t) as n → ∞.

Hence, the expected value of ξ˜n (t) converges as n → ∞,


      
(3.31) sup  Eθ ξ˜n (t) − Eθ ξ(t)  = sup  Eθ ξ˜n (t) − 1  → 0.
−c ≤ t ≤ c −c ≤ t ≤ c

Choose an arbitrarily small δ > 0. There exists A(δ) such that uniformly
over t ∈ [−c, c] , the following inequality holds:
  
(3.32) Eθ ξ(t) I ξ(t) > A(δ) ≤ δ.
Next, we can choose n = n(δ) so large that for any n ≥ n(δ) and all
t ∈ [−c, c], the following inequalities are satisfied:
   
(3.33)  Eθ ξ̃n (t) − Eθ ξ(t)  ≤ δ
and
       
 ˜ ˜
(3.34) E
 θ n
ξ (t) I ξ n (t) ≤ A(δ) − Eθ ξ(t) I ξ(t) ≤ A(δ)  ≤ δ.
To see that the latter inequality holds, use the fact that A(δ) is fixed and
ξ̃n (t) → ξ(t) as n → ∞.

The triangle inequality and the inequalities (3.32)-(3.34) imply that for
any A ≥ A(δ),
     
Eθ ξ˜n (t) I ξ˜n (t) > A ≤ Eθ ξ̃n (t) I ξ˜n (t) > A(δ)
  
= Eθ ξ̃n (t) − ξ˜n (t) I ξ˜n (t) ≤ A(δ)
    
− Eθ ξ(t) − ξ(t) I ξ(t) ≤ A(δ) − ξ(t) I ξ(t) > A(δ)
   
≤  Eθ ξ˜n (t) − Eθ ξ(t) 
        

+  Eθ ξ˜n (t) I ξn (t) ≤ A(δ) − Eθ ξ(t) I ξ(t) ≤ A(δ) 
  
(3.35) + Eθ ξ(t) I ξ(t) > A(δ) ≤ 3δ.

There are finitely many n such that n ≤ n(δ). For each n ≤ n(δ), we can
find An so large that for all A ≥ An , the following expected

value is bounded:

Eθ ξ˜n (t) I ξ˜n (t) > A ) ≤ 3δ. Put A0 = max A1 , . . . , An(δ) , A(δ) . By
definition, for any A ≥ A0 , and all t ∈ [−c, c], we have that
  
(3.36) sup Eθ ξ̃n (t) I ξ˜n (t) > A ≤ 3δ.
n≥1
3.7. Proofs of Technical Lemmas 37

Thus, the lemma follows for ξ˜n (t).


The proof for ξn (t) is simpler. Similarly to ξ˜n (t), the random vari-
ables ξn (t), n = 1, . . . , are positive, and the convergence analogous to
 is valid from the LAN condition, ξn (t) → ξ(t) as n → ∞ . But since
(3.30)
Eθ ξn (t) = 1, the convergence (3.31) of the expected values is replaced by
   
exact equality,  Eθ ξ˜n (t) − Eθ ξ(t)  = 0. Therefore, (3.35) and (3.36)
hold for ξn (t) with the upper bound replaced by 2δ, and the result of the
lemma follows. 
Proof of Lemma 3.19. The proof of this lemma (and Theorem 3.20) is
due to A.I. Sakhanenko (cf. Borovkov [Bor99], Chapter 2, §23). The proof
is based on two results which we state and prove below.
Introduce the likelihood ratio
n
p(Xi , θ + t)
Zn (θ, θ + t) = .
p(Xi , θ)
i=1
 3/4
Result 1. Put z n (t) = Zn (θ, θ + t) . Under Assumption 3.17, for any
θ, θ + t ∈ Θ, the following inequalities hold:
   
(3.37) Eθ Zn (θ, θ + t) ≤ exp − a n t2 /2 ,
  
(3.38) Eθ zn (t) ≤ exp − a n t2 /4 ,
and
 3  
(3.39) Eθ zn (t) ≤ In (θ + t) exp − a n t2 /4
4
where zn (t) = dzn (t)/dt.

Proof. From Lemma 3.16 and Assumption 3.17, we obtain (3.37),


   1 n
Eθ Zn (θ, θ + t) = 1 − H(θ0 , θ1 )
2
 n   
≤ exp − H(θ0 , θ1 ) ≤ exp − a n t2 /2 .
2
To prove (3.38), we use the Cauchy-Schwarz inequality and (3.37),
   3/4 
Eθ zn (t) = Eθ Zn (θ, θ + t)
 1/2  1/4 
= Eθ Zn (θ, θ + t) Zn (θ, θ + t)
   1/2    1/2  1/2
≤ Eθ Zn (θ, θ + t) Eθ Zn (θ, θ + t)
  1/2  1/2  
= Eθ Zn (θ, θ + t) ≤ exp − a n t2 /4 .

Here we used the identity (show!) Eθ Zn (θ, θ + t) = 1.
38 3. Asymptotic Minimaxity

The proof of (3.39) requires more calculations. We write


   d 3 
Eθ zn (t) = Eθ exp Ln (θ + t) − Ln (θ)
dt 4
3  3/4 
= Eθ Ln (θ + t) Zn (θ, θ + t)
4
3     1/4 
= Eθ Ln (θ + t) Zn (θ, θ + t) Zn (θ, θ + t)
4
3    2 1/2    1/2  1/2
≤ Eθ Ln (θ + t) Zn (θ, θ + t) Eθ Zn (θ, θ + t)
4
3   2  1/2    1/2  1/2  
= Eθ+t Ln (θ + t) Eθ Zn (θ, θ + t) by (3.8)
4
3  
≤ In (θ + t) exp − a n t2 /4 .
4
The last inequality sign is justified by the definition of the Fisher informa-
tion, and (3.37). 

Result 2. Let Assumption 3.17 be true. Then for any positive constants γ
and c, the following inequality holds:
  
Pθ sup√ Zn (θ, θ + t) ≥ eγ ≤ C e− 3γ/4 exp{−a c2 /4
| t | ≥ c/ n

where C = 2 + 3 πI ∗ /a with I ∗ = supθ∈Θ I(θ) < ∞.

Proof. Consider the case t > 0. Note that



√ t
sup√ zn (t) = zn (c/ n) + sup√ √
zn (u) du
t ≥ c/ n t > c/ n c/ n

√ t
≤ zn (c/ n) + sup√ √
|zn (u)| du
t > c/ n c/ n


≤ zn (c/ n) + √
|zn (u)| du.
c/ n
Applying Result 1, we find that
 
  3√ ∗ ∞  
Eθ sup√ zn (t) ≤ exp −a c /4 +
2
nI √
exp −a n u2 /4 du
t ≥ c/ n 4 c/ n
 ∞
 3 √ ∗ 4π
 1
√ e−u /2 du
2
= exp − a c /4 + 2
nI √
4 a n c a/2 2π

  3 πI ∗   
= exp − a c2 /4 + 1 − Φ(c a/2)
2 a
3.7. Proofs of Technical Lemmas 39

where Φ denotes the cumulative distribution function of the standard normal


random variable. The inequality 1 − Φ(x) ≤ e−x /2 yields
2


  3 πI ∗   
exp − a c /4 +
2
1 − Φ(c a/2)
2 a

 
3 πI ∗    1  
≤ 1+ exp − a c2 /4 = C exp − a c2 /4 .
2 a 2
The same inequality is true for t < 0 (show!),
  1  
Eθ sup√ zn (t) ≤ C exp − a c2 /4 .
t ≤ −c/ n 2

Further,
 
Pθ sup√ Zn (θ, θ + t) ≥ eγ
| t | ≥ c/ n

   
≤ Pθ sup√ Zn (θ, θ + t) ≥ eγ + Pθ sup√ Zn (θ, θ + t) ≥ eγ
t ≥ c/ n t ≤ −c/ n

   
= Pθ sup√ zn (t) ≥ e3 γ/4 + Pθ sup√ zn (t) ≥ e3 γ/4 ,
t ≥ c/ n t ≤ −c/ n

and the Markov inequality P(X > x) ≤ E[ X ] / x completes the proof,


1   1  
≤ C e−3 γ/4 exp − a c2 /4 + C e−3 γ/4 exp − a c2 /4
2 2
 
= C e−3 γ/4 exp − a c2 /4 . 

Now we are in the position to prove Lemma 3.19. Applying the inclusion
√   
n |θn∗ − θ| ≥ c = sup√ Zn (θ, θ + t) ≥ sup√ Zn (θ, θ + t)
|t| ≥ c/ n |t| < c/ n

 
⊆ sup√ Zn (θ, θ + t) ≥ Zn (θ, θ) = 1 ,
|t| ≥ c/ n

and using the Result 2 with γ = 0, we obtain


√   
Pθ n |θn∗ − θ| ≥ c ≤ Pθ sup√ Zn (θ, θ + t) ≥ 1
| t | ≥ c/ n

 
≤ C exp − a c2 /4 . 
40 3. Asymptotic Minimaxity

Exercises


Exercise 3.14. Verify that in Example 1.5, the estimator θ̂n= X/n is
an asymptotically
 unbiased estimator ofθ. Hint: Note that | X/n − θ| =

|X/n − θ2 | / | X/n + θ|, and thus, Eθ | X/n − θ| ≤ θ−1 Eθ |X/n − θ2 | .
Now use the Cauchy-Schwarz inequality to finish off the proof.

Exercise 3.15. Show that the Hodges estimator defined by (3.3) is asymp-
totically unbiased and satisfies the identities (3.4).
Exercise 3.16. Prove Theorem 3.7.

Exercise 3.17. Suppose the conditions of Theorem 3.7 hold, and a loss
function w is such that w(1/2) > 0. Show that for any estimator θ̂n the
following lower bound holds:
 √  1
sup Eθ w n (θ̂n − θ) ≥ w(1/2) p0 exp{z0 }.
θ∈ Θ 2
Hint: Use Theorem 3.7 and the inequality (show!)
√  √ 
w n (θ̂n − θ) + w n (θ̂n − θ) − 1 ≥ w(1/2), for any θ ∈ Θ.

Exercise 3.18. Prove (3.14). Hint: First show this result for bounded loss
functions.
Exercise 3.19. Prove the local asymptotic normality (LAN) for
(i) exponential model with the density
p(x , θ) = θ exp{− θ x} , x , θ > 0;

(ii) Poisson model with the probability mass function


θx
p(x , θ) = exp{−θ} , θ > 0 , x ∈ {0 , 1 , . . . } .
x!
Exercise 3.20. Prove Theorem 3.14. Hint: Start with a truncated loss
function wC (u) = min(w(u), C) for some C > 0. Applying Theorem 3.11,
obtain an analogue of (3.9) of the form
  
sup Eθ wC nI(θ) (θ̂n − θ)
θ∈R

1 b √   
≥ E0 wC an θ̂n − t exp z̃n (0) t − t2 /2 dt + on (1)
2b −b
Exercises 41

  
where an = nI t/ nI(0) , z̃n (0) is an asymptotically standard normal
random variable, and on (1) → 0 as n → ∞ . Then follow the lines of
Theorems 3.8 and 3.9, and, finally, let C → ∞ .
Exercise 3.21. Consider a distorted parabola zt − t2 /2 + ε(t) where z has
a fixed value and −2c ≤ t ≤ 2c. Assume that the maximum of this function
is attained at a point t∗ that lies within the interval [−c, c]. Suppose that√the
remainder term satisfies sup−2c ≤ t ≤ 2c | ε(t) | ≤ δ . Show that | t∗ −z | ≤ 2 δ .
Chapter 4

Some Irregular
Statistical Experiments

4.1. Irregular Models: Two Examples


As shown in the previous chapters, in regular models, for any estimator θ̂n ,
the normalized deviation nI(θ) (θ̂n − θ) either grows or stays bounded in
the minimax sense, as n increases. In particular, we have shown that the
 2
quadratic risk Eθ θ̂n − θ decreases not faster than at the rate O(n−1 )
as n → ∞. This result has been obtained under some regularity conditions.
The easiest way to understand their importance is to look at some irregular
experiments commonly used in statistics, for which the regularity conditions
are violated and the quadratic risk converges faster than O(n−1 ). We present
two examples below.

Example 4.1. Suppose the observations X1 , . . . , Xn come from the uni-


form distribution on [ 0 , θ ]. The family
 of probability densities can be de-
fined as p(x , θ) = θ −1 I 0 ≤ x ≤ θ . In this case, the MLE of θ is the
maximum of all observations (see Exercise 4.22), that is, θ̂n = X(n) =
 
max X1 , . . . , Xn . The estimator

n+1
θn∗ = X(n)
n
is an unbiased estimator of θ with the variance
 θ2  
Varθ θn∗ = = O n −2 as n → ∞. 
n(n + 2)

43
44 4. Some Irregular Statistical Experiments

Example 4.2. Consider a model with observations X1 , . . . , Xn which have


a shifted exponential distribution with the density
p(x , θ) = e − (x−θ) I(x ≥ θ), θ ∈ R.
It can be shown (see Exercise 4.23) that the MLE of θ is θ̂n = X(1) =
min(X1 , . . . , Xn ), and that θn∗ = X(1) − n −1 is an unbiased estimator of θ

with the variance Varθ θn∗ = n −2 . 

The unbiased estimators in the above examples violate the Cramér-Rao



lower bound (1.2) since their variances decrease faster than O n −1 . Why
does it happen? In the next section we explain that in these examples the
Fisher information does not exist, and therefore, the Cramér-Rao inequality
is not applicable.

4.2. Criterion for Existence of the Fisher Information


 
For any probability density p(x , θ), consider the set p( · , θ), θ ∈ Θ .

It has been shown in Section 3.5 that for any fixed θ, p( · , θ) has a unit
L2 -norm, that is,
 2
 
 p ( · , θ) 2 = p( x , θ) dx = 1.
2
R

The existence of the Fisher information is equivalent to the smoothness


of this curve as a function of θ. We show that the Fisher information exists
if this curve is differentiable with respect to θ in the L2 -space.
Theorem 4.3. The  Fisher information
 is finite if and only if the L2 -norm
of the derivative  ∂ p ( · , θ) / ∂θ 2 is finite. The Fisher information is
computed according to the formula
 ∂  2
I(θ) = 4  p( · , θ) 2 .
∂θ
Proof. The proof is straightforward:
  2
 ∂  2 ∂
 p( · , θ) 2 = p( x, θ) dx
∂θ R ∂θ
  
∂p(x, θ)/∂θ 2 1 ∂p(x, θ)/∂θ  2
=  dx = p(x, θ) dx
R 2 p(x, θ) 4 R p(x, θ)

1 ∂ ln p(x, θ)  2 1
= p(x, θ) dx = I(θ). 
4 R ∂θ 4
Example 4.4. The family of the uniform densities in Example 4.1 is not
differentiable in the sense of Theorem 4.3. By definition,
 ∂  2   2
 p( · , θ) 2 = lim (Δθ)−2  p( · , θ + Δθ) − p( · , θ) 2
∂θ Δθ→0
4.3. Asymptotically Exponential Statistical Experiment 45

 1 1 2
= lim (Δθ)−2  √ I[ 0, θ+Δθ ] ( · ) − √ I[ 0, θ ] ( · ) 2 .
Δθ→0 θ + Δθ θ
A finite limit exists if and only if
 2  
√ 1 1
I[ 0, θ+Δθ ] ( · ) − √ I[ 0, θ ] ( · ) 2 = O (Δθ) 2 as Δθ → 0.
θ + Δθ θ
However, the L2 -norm decreases at a lower rate. To see this, assume Δθ is
positive and write
 2
√ 1 1
I[ 0, θ+Δθ ] ( · ) − √ I[ 0, θ ] ( · ) 2
θ + Δθ θ

1   1  2
= √ I 0 ≤ x ≤ θ + Δθ − √ I 0 ≤ x ≤ θ dx
R θ + Δθ θ
θ  θ+Δθ 
1 1 2 1 2
= √ − √ dx + √ dx
0 θ + Δθ θ θ θ + Δθ
√ √ 2
θ − θ + Δθ Δθ
= +
θ + Δθ θ + Δθ
  −1/2   
= 2 1 − 1 + Δθ/θ = Δθ/θ + o(Δθ/θ)  O (Δθ)2 as Δθ → 0.

Hence, in this example, p( · , θ) is not differentiable as a function of θ, and
the finite Fisher information does not exist. 

A similar result is true for the shifted exponential model introduced in


Example 4.2 (see Exercise 4.24).
If we formally write the Fisher information as I(θ) = ∞, then the right-
hand side of the Cramér-Rao inequality (1.2) becomes zero, and there is no
contradiction with the faster rate of convergence.

4.3. Asymptotically Exponential Statistical Experiment


What do the two irregular
 models considered in the previous sections have in
common? First of all, p( · , θ) is not differentiable in the sense of Theorem
  2
4.3, and  p( · , θ + Δθ) − p( · , θ) 2 = O(Δθ) as Δθ → 0. For the
uniform model, this fact is verified in Example 4.4, while for the shifted
exponential distribution, it is assigned as Exercise 4.24.
Another feature that these models share is the limiting structure of the
likelihood ratio
     
Zn θ0 , θ1 = exp Ln θ1 − Ln (θ0 )

n
p(Xi , θ1 )
= , θ0 , θ1 ∈ Θ.
p(Xi , θ0 )
i=1
46 4. Some Irregular Statistical Experiments

A statistical experiment is called asymptotically exponential if for any


θ ∈ Θ, there exists an asymptotically exponential random variable Tn such
that  
lim P( Tn ≥ τ ) = exp − λ(θ) τ , τ > 0,
n→∞
and either
     
(i) Zn θ, θ + t/n = exp − λ(θ) t I t ≥ − Tn + on (1)

or
     
(ii) Zn θ, θ + t/n = exp λ(θ) t I t ≤ Tn + on (1)

where λ(θ) is a continuous positive function of θ, θ ∈ Θ, and on (1) → 0


in Pθ -probability as n → ∞.
Both uniform and shifted exponential models are special cases of the
asymptotically exponential statistical experiment, as stated in Propositions
4.5 and 4.6 below.
Proposition 4.5. The uniform statistical experiment defined in Example
4.1 is asymptotically exponential with λ(θ) = 1/θ.

Proof. The likelihood ratio for the uniform distribution is


   θ n  I X ≤ θ + t/n 
i
Zn θ , θ + t/n =  
θ + t/n
i=1
I X i ≤ θ
   
θ n I X(n) ≤ θ + t/n
=   .
θ + t/n I X(n) ≤ θ
 
Note that the event X(n) ≤ θ holds with probability 1. Also,
 θ n  t/θ  −n  
= 1+ = exp − t/θ + on (1) as n → ∞
θ + t/n n
and
   
I X(n) ≤ θ + t/n = I t ≥ − Tn where Tn = n (θ − X(n) ).
It remains to show that Tn has a limiting exponential distribution. Indeed,
   
lim Pθ Tn ≥ τ = lim Pθ n (θ − X(n) ) ≥ τ
n→∞ n→∞
   θ − τ /n n
= lim Pθ X(n) ≤ θ − τ /n = lim = e−τ /θ . 
n→∞ n→∞ θ
A similar argument proves the next proposition (see Exercise 4.26).
Proposition 4.6. The shifted exponential statistical experiment defined in
Example 4.2 is asymptotically exponential with λ(θ) = 1.
4.5. Sharp Lower Bound 47

4.4. Minimax Rate of Convergence


In accordance with the definition (3.7), the estimators in Examples 4.1 and
4.2 have guaranteed rate of convergence ψn = O(n−1 ). Can this rate be
improved? That is, are there estimators that converge with faster rates?
The answer is negative, and the proof is relatively easy.

Lemma 4.7. In an asymptotically exponential statistical experiment, there


exists a constant r∗ > 0 not depending on n such that for any estimator θ̂n ,
the following lower bound holds:
 2 
lim inf sup Eθ n (θ̂n − θ) ≥ r∗ .
n→∞ θ∈Θ

Proof. Take θ0 ∈ Θ and θ1 = θ0 + 1/n ∈ Θ. Assume that property (ii)


in the definition of an asymptotically exponential model holds. Then, as in
the proof of Lemma 3.4, we have
 2   2 
sup Eθ n (θ̂n − θ) ≥ max Eθ n (θ̂n − θ)
θ∈Θ θ∈{θ0 , θ1 }

n2   
≥ Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 eλ(θ0 ) + on (1) I 1 ≤ Tn
2
n2    
≥ Eθ0 (θ̂n − θ0 )2 + (θ̂n − θ1 )2 I Tn ≥ 1 ,
2
since λ(θ0 ) + on (1) ≥ 0,
n2 (θ1 − θ0 )2  
≥ Pθ0 Tn ≥ 1
2 2
1   1
= Pθ0 Tn ≥ 1 → exp{ − λ(θ0 ) } as n → ∞. 
4 4
Remark 4.8. The rate of convergence may be different from O(n−1 ) for
some other irregular statistical experiments, but those models are not asymp-
totically exponential. For instance, the model described in Exercise 1.8 is
not regular (the Fisher information does not exist) if −1 < α ≤ 1. The
rate of convergence in this model depends on α and is, generally speaking,
different from O(n−1 ). 

4.5. Sharp Lower Bound


The constant r∗ = 14 exp{ − λ(θ0 ) } in the proof of Lemma 4.7 is far from
being sharp. In the theorem that follows, we state a local version of the lower
bound with an exact constant for an asymptotically exponential experiment.
48 4. Some Irregular Statistical Experiments

Theorem 4.9. Consider an asymptotically exponential statistical exper-


iment. Assume that it satisfies property (ii) of the definition, and put
λ0 = λ(θ0 ). Then for any θ0 ∈ Θ, any loss function w, and any estimator
θ̂n , the following lower bound holds:
  ∞

lim lim inf sup Eθ w n (θ̂n − θ) ≥ λ 0 min w(u−y) e−λ0 u du.
δ→ 0 n→∞ θ : |θ−θ0 |<δ y∈R 0

Proof. Choose a large positive number b and assume that n is so large


that b < δ n. Put wC (u) = min(w(u), C) where C is an arbitrarily large
constant. For any θ̂n , we estimate the supremum over θ of the normalized
risk by the integral
 
 1 b   
sup Eθ wC n(θ̂n − θ) ≥ Eθ0 +u/n wC n(θ̂n − θ0 − u/n) du
θ : |θ−θ0 |<δ b 0


1 b      
(4.1) = Eθ0 wC n (θ̂n − θ0 ) − u eλ0 u I u ≤ Tn + on (1) du .
b 0
Here we applied the change of measure formula. Now, since wC is a bounded
function,
1  b    
Eθ0 wC n (θ̂n − θ0 ) − u on (1) du = on (1),
b 0
and, continuing from (4.1), we obtain
1  b     
= Eθ0 wC n (θ̂n − θ0 ) − u eλ0 u I u ≤ Tn du + on (1)
b 0
 
1 √  Tn  
≥ Eθ0 I b ≤ Tn ≤ b wC n (θ̂n − θ0 ) − u eλ0 u du + on (1),
b 0
which, after a substitution u = t + Tn , takes the form
 
1 √  0  
= Eθ0 eλ0 Tn I b ≤ Tn ≤ b wC n(θ̂n − θ0 ) − Tn − t eλ0 t dt + on (1).
b −Tn

Put y = n (θ̂n − θ0 ) − Tn , and continue:


(4.2)
 0
1 √    λ0 t
≥ Eθ0 e λ0 Tn
I b ≤ Tn ≤ b min √ wC y − t e dt + on (1).
b y∈R − b

Note that

1 √  1 b
lim Eθ0 e λ0 Tn
I b ≤ Tn ≤ b = λ0 e−λ0 t + λ0 t dt
n→∞ b b √b
 √ 
λ0 b − b  1 
= = λ0 1 − √ .
b b
Exercises 49

This provides the asymptotic lower bound for (4.2)


0
 1    λ0 u
λ0 1 − √ min √ wC y − u e du
b y∈R − b
√b
 1   
= λ0 1 − √ min wC u − y e−λ0 u du
b y∈R 0
where b and C can be taken however large, which proves the theorem. 

For the quadratic risk function, the lower bound in Theorem 4.9 can be
found explicitly.
Example 4.10. If w(u) = u2 , then
∞  
λ0 min (u − y)2 e−λ0 u du = min y 2 − 2y/λ0 + 2/λ20
y∈R 0 y∈R
 
= min (y − 1/λ0 ) +
2
1/λ20 = 1/λ20 .
y∈R

By Proposition 4.5, for the uniform model, λ 0 = 1/θ0 , hence, the exact
lower bound equals θ02 . For the shifted exponential experiment, according
to Proposition 4.6, λ0 = 1 and thus, the lower bound is 1. 
Remark 4.11. In Exercise 4.27 we ask the reader to show that the lower
bound limiting constant of Theorem 4.9 is attainable in the uniform and
shifted exponential models under the quadratic risk. The sharpness of the
bound holds in general, for all asymptotically exponential models, but under
some additional conditions. 

Exercises

Exercise 4.22. Show that if X1 , . . . , Xn are independent uniform(0, θ) ran-


dom variables, then (i) the MLE of θ is θ̂n = X(n) = max(X1 , . . . , Xn ),

(ii) θn∗ = (n + 1)X(n) /n is an unbiased estimator of θ, and (iii) Varθ θn∗ =
θ2 /[ n(n + 2)].
Exercise 4.23. Consider independent observations X1 , . . . , Xn from a shifted
exponential distribution with the density
p (x , θ) = exp{ −(x − θ) }, x ≥ θ, θ ∈ R .
Verify that (i) θ̂n = X(1) = min(X1 , . . . , Xn ) is the MLE of θ, (ii) θn∗ =

X(1) − 1/n is an unbiased estimator of θ, and (iii) Varθ θn∗ = 1/n2 .
Exercise 4.24. Refer to Exercise 4.23. Prove that in the shifted exponential
model the Fisher information does not exist.
50 4. Some Irregular Statistical Experiments

Exercise 4.25. Let p(x, θ) = c− if −1 < x < 0 and p(x, θ) = c+ if


0 < x < 1. Assume that p(x, θ) = 0 if x is outside of the interval (−1, 1),
and that the jump of the density at the origin equals θ , i.e., c+ − c− =
θ , θ ∈ Θ ⊂ (0, 1). Use the formula in Theorem 4.3 to compute the Fisher
information.

Exercise 4.26. Prove Proposition 4.6.

Exercise 4.27. Show that (i) in the uniform model (see Exercise 4.22),
 2 
lim Eθ0 n(θn∗ − θ0 ) = θ02 .
n→∞

(ii) in the shifted exponential model (see Exercise 4.23),


 2 
Eθ0 n(θn∗ − θ0 ) = 1.

Exercise 4.28. Compute explicitly the lower bound in Theorem 4.9 for the
absolute loss function w(u) = |u|.

Exercise 4.29. Suppose n independent observations have the shifted ex-


ponential distribution with the location parameter θ. Using an argument
involving the Bayes risk, show that for any estimator θ̂n , the quadratic min-
imax risk is bounded from below (cf. Example 4.10),
 2 
inf sup Eθ n (θ̂n − θ) ≥ 1.
θ̂n θ∈R

(i) Take a uniform(0, b) prior density and let Y = min(X(1) , b). Verify that
the posterior density is defined only if X(1) > 0, and is given by the formula
n exp{ n θ }
fb (θ | X1 , . . . , Xn ) = , 0 ≤ θ ≤ Y.
exp{ n Y } − 1
(ii) Check that the posterior mean is equal to
1 Y
θn∗ (b) = Y − + .
n exp{ n Y } − 1
√ √
(iii) Argue that for any θ, b ≤ θ ≤ b − b, the normalized quadratic risk
of the estimator θn∗ (b) has the limit
 2 
lim Eθ n (θn∗ (b) − θ) = 1.
b→∞
(iv) Show that
 √
2  b−2 b  2 
sup Eθ n (θ̂n − θ) ≥ √ inf √ Eθ n (θn∗ (b) − θ)
θ∈R b b ≤ θ≤ b− b
where the right-hand side is arbitrarily close to 1 for sufficiently large b.
Chapter 5

Change-Point Problem

5.1. Model of Normal Observations


Consider a statistical model with normal observations X1 , . . . , Xn where
Xi ∼ N (0 , σ 2 ) if i = 1 , . . . , θ, and Xi ∼ N (μ , σ 2 ) if i = θ + 1 , . . . , n. An
integer parameter θ belongs to a subset Θ of all positive integers Z+ ,

 
Θ = Θα = θ : α n ≤ θ ≤ (1 − α) n , θ ∈ Z+

where α is a given number, 0 < α < 1/2. We assume that the standard
deviation σ and the expectation μ are known. Put c = μ/σ. This ratio is
called a signal-to-noise ratio.
The objective is to estimate θ from observations X1 , . . . , Xn . The pa-
rameter θ is called the change point, and the problem of its estimation is
termed the change-point problem. Note that it is assumed that there are at
least α n observations obtained before and after the change point θ, that is,
the number of observations of both kinds are of the same order O(n).
In the context of the change-point problem, the index i may be associated
with the time at which the observation Xi becomes available. This statistical
model differs from the models of the previous chapters in the respect that
it deals with non-homogeneous observations since the expected value of the
observations suffers a jump at θ.
The joint probability density of the observations has the form

  n  
   θ
exp − x2i /(2σ 2 )  exp − (xi − μ)2 /(2σ 2 )
p x 1 , . . . , xn , θ = √ √ .
i=1 2πσ 2 i=θ+1
2πσ 2

51
52 5. Change-Point Problem

Denote by θ0 the true value of the parameter θ. We want to study the


log-likelihood ratio
p (X1 , . . . , Xn , θ) 
n
p(Xi , θ)
Ln (θ) − Ln (θ0 ) = ln = ln .
p (X1 , . . . , Xn , θ0 ) p(Xi , θ0 )
i=1

Introduce a set of random variables



Xi /σ if 1 ≤ i ≤ θ0 ,
(5.1) εi =
−(Xi − μ)/σ if θ0 < i ≤ n .
These random variables are independent and have N (0, 1) distribution with
respect to Pθ0 -probability.
Define a stochastic process W (j) for integer-valued j ’s by
⎧ 

⎪ εi if j > 0,



⎨ 0 0
θ < i ≤ θ + j
(5.2) W (j) = εi if j < 0,



⎪ θ + j < i ≤ θ


0 0

0 if j = 0,
where the εi ’s are as in (5.1) for 1 ≤ i ≤ n; and for i outside of the
interval [1, n] , the εi ’s are understood as supplemental independent standard
normal random variables. The process W (j) is called the two-sided Gaussian
random walk (see Figure 2).
6W (j )

-
1 − θ0 0 n − θ0 j

Figure 2. A sample path of the two-sided Gaussian random walk W .

The next theorem suggests an explicit form of the likelihood ratio in


terms of the process W and the signal-to-noise ratio c.
Theorem 5.1. For any θ, 1 ≤ θ ≤ n, and any θ0 ∈ Θα , the log-likelihood
ratio has the representation
c2
Ln (θ) − Ln (θ0 ) = c W (θ − θ0 ) − | θ − θ0 |
2
5.1. Model of Normal Observations 53

where W (θ − θ0 ) is the two-sided Gaussian random walk in Pθ0 -probability


defined by (5.2).

Proof. By definition, for any θ > θ0 , the log-likelihood ratio is expressed


as

Xi2 n
(Xi − μ)2
Ln (θ) − Ln (θ0 ) = − −
2σ 2 2σ 2
i=1 i = θ+1


θ0
Xi2 
n
(Xi − μ)2 
θ
Xi2 
θ
(Xi − μ)2
+ + = − +
2σ 2 2σ 2 2σ 2 2σ 2
i=1 i = θ0 + 1 i = θ0 + 1 i = θ0 + 1

μ 
θ  Xi μ  μ 
θ  X − μ  μ 
i
= − + = − −
σ σ 2σ σ σ 2σ
i = θ0 + 1 i = θ0 + 1

μ 
θ
μ2 c2
= εi − (θ − θ0 ) = c W (θ − θ0 ) − (θ − θ0 )
σ 2σ 2 2
i = θ0 + 1
with c = μ/σ. For θ < θ0 , we get a similar formula,

μ 
θ0 X μ 
i
Ln (θ) − Ln (θ0 ) = −
σ σ 2σ
i=θ+1


θ0
c2 c2
= c εi − (θ0 − θ) = c W (θ − θ0 ) − |θ − θ0 |. 
2 2
i=θ+1

Remark 5.2. The two-sided Gaussian random walk W plays a similar role
as a standard normal random variable Z in the regular statistical models
under the LAN condition (see Section 3.4.) The essential difference is that
the dimension of W grows as n → ∞. 

The next result establishes a rough minimax lower bound.


Lemma 5.3. There exists a positive constant r∗ independent of n such that

inf max Eθ (θ̂n − θ)2 ≥ r∗ .
θ̂n θ∈Θα

Proof. Take θ0 , θ1 ∈ Θα such that θ1 − θ0 = 1. From Theorem 5.1, we


have that
c2
Ln (θ1 ) − Ln (θ0 ) = c W (1) −
2
where W (1) is a standard normal random variable in Pθ0 -probability. Thus,
   
Pθ0 Ln (θ1 ) − Ln (θ0 ) ≥ 0 = Pθ0 W (1) ≥ c/2 = p0
with a positive constant p0 independent of n . Taking the same steps as in
the proof of Lemma 3.4, we obtain the result with r∗ = p0 /4. 
54 5. Change-Point Problem

Remark 5.4. Lemma 5.3 is very intuitive. Any estimator θ̂n misses the
true change point θ0 by at least 1 with a positive probability, which is not a
surprise due to the stochastic nature of observations. Thus, the anticipated
minimax rate of convergence in the change-point problem should be O(1) as
n → ∞. 
Remark 5.5. We can define a change-point problem on the interval [0, 1]
by Xi ∼ N (0, σ 2 ) if i/n ≤ θ, and Xi ∼ N (μ, σ 2 ) if i/n > θ. On this scale,
the anticipated minimax rate of convergence is O(n−1 ) for n → ∞. Note
that the convergence in this model is faster than that in regular models, and,
though unrelated, is on the order of that in the asymptotically exponential
experiments. 

5.2. Maximum Likelihood Estimator of Change Point


The log-likelihood function Ln (θ) in the change-point problem can be writ-
ten as
n 
 1    2 1 
Ln (θ) = − X i − μ I i > θ − ln (2πσ 2
) , 1 ≤ θ ≤ n.
2σ 2 2
i=1

The maximum likelihood estimator (MLE) of θ exists and is unique with


probability 1 for any true value θ0 . We denote the MLE of θ by
θ̃n = argmax Ln (θ).
1 ≤ θ ≤n

The goal of this section is to describe the exact large sample performance
of the MLE.
Introduce a stochastic process
c2
L∞ (j) = c W (j) − | j | , j ∈ Z,
2
where the subscript “∞” indicates that j is unbounded in both directions.
Define j∞∗ as the point of maximum of the process L (j),


j∞ = argmax L∞ (j),
j ∈Z

and put
 ∗ 
pj = Pθ0 j∞ = j , j ∈ Z.
∗ is independent of θ .
Note that the distribution of j∞ 0

Theorem 5.6. For any θ0 ∈ Θα , and any loss function w, the risk of the
MLE θ̃n has a limit as n → ∞, independent of θ0 ,
   
lim Eθ0 w θ̃n − θ0 = w(j) pj .
n→∞
j∈Z
5.2. Maximum Likelihood Estimator of Change Point 55

Before we turn to proving the theorem, we state two technical lemmas,


the proofs of which are postponed until the final section of this chapter.
 
Lemma 5.7. Let p̃j = Pθ0 θ̃n − θ0 = j . There exist positive constants
b1 and b2 such that for any θ0 ∈ Θα , and any j satisfying 1 ≤ θ0 + j ≤ n,
the following bounds hold:
pj ≤ p̃j ≤ pj + b1 e− b2 n .

Lemma 5.8. There exist positive constants b3 and b4 such that

pj ≤ b3 e − b4 | j | , j ∈ Z.

Proof of Theorem 5.6. Applying Lemma 5.7, we find that


   
Eθ0 w(θ̃n − θ0 ) = w(j) p̃j ≥ w(j) pj
j : 1 ≤ θ0 + j ≤ n j : 1 ≤ θ0 + j ≤ n

 
= w(j) pj − w(j) pj
j ∈Z j ∈ Z \ [1−θ0 , n−θ0 ]

where the latter sum is taken over integers j that do not belong to the
set 1 ≤ θ0 + j ≤ n. As a loss function, w(j) does not increase faster
than a polynomial function in | j |, while pj , in accordance with Lemma
5.8, decreases exponentially fast. Besides, the absolute value of any j ∈
Z \ [1 − θ0 , n − θ0 ] is no less than α n. Thus,

lim w(j) pj = 0
n→∞
j ∈ Z \ [1−θ0 , n−θ0 ]

and   
lim inf Eθ0 w(θ̃n − θ0 ) ≥ w(j) pj .
n→∞
j∈Z

Similarly, we get the upper bound


     
Eθ0 w(θ̃n − θ0 ) = w(j)p̃j ≤ w(j) pj + b1 e−b2 n
j:1≤θ0 +j≤n j:1≤θ0 +j≤n

  
≤ w(j) pj + b1 e − b2 n max w(j) .
j : 1 ≤ θ0 + j ≤ n
j ∈Z

The maximum of w(j) does not grow faster than a polynomial function in
n. That is why the latter term is vanishing as n → ∞, and
  
lim sup Eθ0 w(θ̃n − θ0 ) ≤ w(j) pj . 
n→∞
j ∈Z
56 5. Change-Point Problem

5.3. Minimax Limiting Constant


In this section we will find the minimax limiting constant for the quadratic
risk function.
Lemma 5.9. For any θ0 , the Bayes estimator θn∗ with respect to the uniform
prior πn (θ) = 1/n, 1 ≤ θ ≤ n, satisfies the formula
  
j : 1≤j+θ0 ≤n j exp c W (j) − c | j | / 2
2
∗   .
θn = θ0 + 
j : 1≤j+θ0 ≤n exp c W (j) − c | j | / 2
2

Proof. The proof is left as an exercise (see Exercise 5.30). 

Introduce a new random variable,


  
j ∈ Z j exp c W (j) − c | j | / 2
2
(5.3) ξ =    .
j ∈ Z exp c W (j) − c | j | / 2
2

The next lemma asserts that θn∗ − θ0 converges to ξ in the quadratic


sense. The proof of the lemma is deferred until Section 5.5.

Lemma 5.10. There exists a finite second moment r∗ = Eθ0 ξ 2 and,
uniformly in θ0 ∈ Θα , the Bayes estimator θn∗ satisfies the identity
 2 
lim Eθ0 θn∗ − θ0 = r∗ .
n→∞

Now we can show that the minimax quadratic risk of any estimator of
θ0 is bounded from below by r∗ .
Theorem 5.11. Let r∗ be the constant defined in Lemma 5.10. For all large
enough n, and for any estimator θ̂n , the following inequality takes place:
 2 
lim inf max Eθ0 θ̂n − θ0 ≥ r∗ .
n→∞ θ0 ∈Θα

Proof. Assume, without loss of generality, that αn is an integer, and so is


N = n − 2αn where N is the number of points in the parameter set Θα .
As we typically deal with lower bounds, we estimate the maximum over Θα
from below by the mean value over the same set,
    
max Eθ0 (θ̂n − θ0 )2 ≥ N −1 Eθ0 (θ̂n − θ0 )2
θ0 ∈ Θα
θ0 ∈ Θα
  
(5.4) ≥ N −1 Eθ0 (θ̃N − θ0 )2
θ0 ∈ Θα

where θ̃N is the Bayes estimator with respect to the uniform prior distribu-
tion πN (θ) = 1/N, αn ≤ θ ≤ (1 − α)n.
5.4. Model of Non-Gaussian Observations 57

Take an arbitrarily small positive β and define a set of integers Θα, β by


 
Θα, β = θ : αn + βN ≤ θ ≤ n − (αn + βN ) .
The set Θα, β contains no less than n − 2 (αn + βN ) = N − 2βN points.
This set plays
 the same roleof the “inner points” for Θα as Θα does for the
original set θ : 1 ≤ θ ≤ n . Applying Lemma 5.10 to θ̃N , we have that,
uniformly in θ0 ∈ Θα, β ,
 2 
lim Eθ0 θ̃N − θ0 = r∗ .
n→∞

Substituting this limit into (5.4), we obtain that for any estimator θ̂n ,
 2   2 
lim inf max Eθ0 θ̂n − θ0 ≥ lim inf N −1 Eθ0 θ̃N − θ0
n→∞ θ0 ∈Θα n→∞
θ0 ∈Θα
  2 
≥ lim inf N −1 Eθ0 θ̃N − θ0
n→∞
θ0 ∈Θα,β
 N − 2βN  2 
≥ lim inf min Eθ0 θ̃N − θ0 = (1 − 2β)r∗ .
n→∞ N θ0 ∈Θα,β

Since β is arbitrarily small, the result follows. 


Remark 5.12. Combining the results of Lemma 5.10 and Theorem 5.11, we
obtain the sharp limit of the quadratic minimax risk. The Bayes estimator
corresponding to the uniform prior distribution is asymptotically minimax.
To the best of our knowledge, the exact value of r∗ is unknown. Nevertheless,
it is possible to show that the MLE estimator discussed in Section 5.2 is not
asymptotically minimax, since the limit of its quadratic risk determined in
Theorem 5.6 is strictly larger than r∗ . 

5.4. Model of Non-Gaussian Observations


Let p0 (x) be a probability density which is positive on the whole real line,
x ∈ R, and let μ
= 0, be a fixed number. Assume that observations Xi ’s
have a distribution with the density p0 (x) if 1 ≤ i ≤ θ , and p0 (x − μ) if
θ < i ≤ n. As in the previous sections, an integer θ is an unknown change
point which belongs to Θα . In this section, we will describe the structure
of the likelihood ratio to understand the difference from the case of normal
observations.
Denote by li the log-likelihood ratio associated with a single observation
Xi ,
p0 (Xi − μ)
(5.5) li = ln , 1 ≤ i ≤ n.
p0 (Xi )
58 5. Change-Point Problem

The two quantities




p0 (x ± μ) 
K± = − ln p0 (x) dx
−∞ p0 (x)
are called the Kullback-Leibler information numbers.
Let θ0 denote the true value of the parameter θ. Consider the random
variables

li + K− if 1 ≤ i ≤ θ0 ,
(5.6) εi =
− li + K+ if θ0 < i ≤ n.
Lemma 5.13. For any integer i, 1 ≤ i ≤ n, the random variables εi defined
in (5.6) have an expected value of zero under Pθ0 -distribution.

Proof. If 1 ≤ i ≤ θ0 , then by definition of K− ,


Eθ0 [ εi ] = Eθ0 [ li ] + K− = (− K− ) + K− = 0 .
If θ0 < i ≤ n , then


p0 (x − μ) 
Eθ0 [ εi ] = − ln p0 (x − μ) dx + K+
−∞ p0 (x)
∞ 
p0 (x) 
= − ln p0 (x) dx + K+
−∞ p0 (x + μ)


p0 (x + μ) 
= ln p0 (x) d x + K+ = − K+ + K+ = 0. 
−∞ p0 (x)

For any j ∈ Z, analogously to the Gaussian case (cf. (5.2)), introduce a


stochastic process W (j) by W (0) = 0,
 
W (j) = εi if j > 0, and W (j) = εi if j < 0
θ0 < i ≤ θ0 + j θ0 + j < i ≤ θ0

where for 1 ≤ i ≤ n, εi ’s are the random variables from Lemma 5.13 that
have mean zero with respect to the Pθ0 -distribution. For all other values of
i, the random variables εi ’s are assumed independent with the zero expected
value. Note that W (j) is a two-sided random walk, which in general is not
symmetric. Indeed, the distributions of εi ’s may be different for i ≤ θ0 and
i > θ0 .
Define a constant Ksgn(i) as K+ for i > 0, and K− for i < 0.
Theorem 5.14. For any integer θ, 1 ≤ θ ≤ n, and any true change point
θ0 ∈ Θα , the log-likelihood ratio has the form
p(X1 , . . . , Xn , θ)
Ln (θ) − Ln (θ0 ) = ln = W (θ−θ0 ) − Ksgn(θ−θ0 ) | θ−θ0 |.
p(X1 , . . . , Xn , θ0 )
5.5. Proofs of Lemmas 59

Proof. The joint density is computed according to the formula


 
p(X1 , . . . , Xn , θ) = p0 (Xi ) p0 (Xi − μ).
1≤i≤θ θ<i≤n

Therefore, if θ > θ0 , the likelihood ratio satisfies


 p0 (Xi ) 
Ln (θ) − Ln (θ0 ) = ln = (−li )
p0 (Xi − μ)
θ0 < i ≤ θ θ0 < i ≤ θ
  
= εi − K+ = W (θ − θ0 ) − K+ (θ − θ0 ).
θ0 < i ≤ θ
In the case θ < θ0 , we write
 p0 (Xi − μ) 
Ln (θ) − Ln (θ0 ) = ln = li
p0 (Xi )
θ < i ≤ θ0 θ < i ≤ θ0
  
= εi − K− = W (θ − θ0 ) − K− |θ − θ0 |. 
θ < i ≤ θ0

From Theorem 5.14 we can expect that the MLE of θ0 in the non-
Gaussian case possesses properties similar to that in the normal case. This
is true under some restrictions on p0 (see Exercise 5.33).

5.5. Proofs of Lemmas


Proof of Lemma 5.7. Note that if j∞ ∗ = j, and j is such that 1 ≤ θ +
0
j ≤ n, then θ̃n − θ0 = j . Therefore, for any j satisfying 1 ≤ θ0 + j ≤ n,
we have
 ∗   
pj = Pθ0 j∞ = j ≤ Pθ0 θ̃n − θ0 = j = p̃j
 ∗
  ∗

= Pθ0 θ̃n − θ0 = j, 1 ≤ j∞ + θ0 ≤ n + Pθ0 θ̃n − θ0 = j, j∞ + θ0
∈ [1, n]
 ∗   ∗   ∗ 
≤ pj + Pθ0 j∞ + θ0
∈ [1, n] = pj + Pθ0 j∞ ≤ −θ0 + Pθ0 j∞ ≥ n + 1 − θ0
 ∗   ∗   
≤ pj + Pθ0 j∞ ≤ −αn + Pθ0 j∞ ≥ αn ≤ pj + pk + pk
k≤−αn k≥αn
 
= pj + 2 pk ≤ pj + 2b3 e−b4 k = pj + b1 e−b2 n
k≥αn k≥αn
where we have applied Lemma 5.8. The constants b1 and b2 are independent
of n, b1 = 2 b3 /(1 − exp{−b4 }) and b2 = b4 α. 
Proof of Lemma 5.8. For a positive integer j we can write
 ∗   ∗    
pj = Pθ0 j∞ = j ≤ Pθ0 j∞ ≥ j ≤ Pθ0 max c W (k) − c2 k/2 ≥ 0
k≥j
     √ √ 
≤ Pθ0 c W (k) − c2 k/2 ≥ 0 = Pθ0 W (k)/ k ≥ c k/2
k≥j k≥j
60 5. Change-Point Problem

   √  
= 1 − Φ c k/2 ≤ exp{ − c2 k/8 } ≤ b3 exp{− b4 j}
k≥j k≥j

with the positive constants b3 = 1/(1 − exp{−c 2 /8}) and b4 = c 2 /8.


 ∗ 
In the above, we have estimated the probability Pθ0 j∞ ≥ j using the
fact that c W (j) − c 2 j/2 = 0 at j = 0, which implies that at the point
of the √maximum c W (j∞ ∗ ) − c 2 j ∗ /2 ≥ 0. Next we applied the fact that

W (k)/ k has the standard normal distribution with the c.d.f. Φ(x). In the
last step, the inequality 1 − Φ(x) ≤ exp{−x2 /2} is used.
A similar upper bound for the probabilities pj holds for negative j, so
that
pj ≤ b3 exp{ − b4 | j | }, j ∈ Z . 
Proof of Lemma 5.10. First we show that uniformly in θ0 ∈ Θα , there
exists a limit in Pθ0 -probability
 
lim θn∗ − θ0 = ξ.
n→∞

To see that ξ is well-defined, it suffices to show that the denominator


in the definition (5.3) of ξ is separated away from zero. Indeed, at j = 0
the contributing term equals 1. We want to demonstrate that the infinite
sums in the numerator and denominator converge in probability. That is,
for however small the positive ε, uniformly in Pθ0 -probability, the following
limits exist:
    
(5.7) lim Pθ0 | j | exp c W (j) − c 2 | j | / 2 > ε = 0
n→∞
j : j + θ0
∈ [1, n]

and
    
(5.8) lim Pθ0 exp c W (j) − c 2 | j | / 2 > ε = 0.
n→∞
j : j + θ0
∈ [1, n]

Introduce a random variable:


 
(5.9) ζ = min k : W (|j |) ≤ c | j |/ 4 for all j , | j | ≥ k .
Starting from ζ, the random walk W (j) does not exceed c | j | / 4. First of
all, note that the tail probabilities for ζ are decreasing exponentially fast.
In fact,
       
Pθ0 ζ ≥ k ≤ Pθ0 W (j) ≥ cj/4 + Pθ0 W (j) ≥ c|j|/4
j≥k j≤−k
       
≤2 Pθ0 W (j) ≥ cj/4 = 2 Pθ0 W (j)/ j ≥ c j/4
j≥k j≥k
  
   
≤2 exp − c2 j/32 ≤ a1 exp − c2 k/32 = a1 exp − a2 k
j≥k
5.5. Proofs of Lemmas 61

with a1 = 2/(1 − exp{−c2 /32}) and a2 = c2 /32.

Next we verify (5.7). The limit (5.8) is shown analogously. We have


    
Pθ0 |j| exp cW (j) − c2 |j|/2 > ε
j+θ0
∈[1,n]
    
≤ Pθ0 |j| exp cW (j) − c2 |j|/2 > ε
|j|≥αn
      
≤ Pθ0 |j| exp cW (j) − c2 |j|/2 > ε; ζ ≤ αn + Pθ0 ζ > αn
|j|≥αn
      
≤ Pθ0 |j| exp − c2 |j|/4 > ε; ζ ≤ αn + Pθ0 ζ > αn
|j|≥αn
      
≤ Pθ0 |j| exp − c2 |j|/4 > ε + Pθ0 ζ > αn
|j|≥αn
where we have applied the definition of the random variable ζ. In the latter
sum, the first probability is zero for all sufficiently large n as the probability
of a non-random event. The second probability is decreasing exponentially
fast as n → ∞, which proves (5.7).
Further, we check that the second moment of ξ is finite despite the fact
that neither the numerator nor denominator is integrable (see Exercise 5.31).
Thus, we prove that there exists a finite second moment
r ∗ = Eθ0 [ ξ 2 ] < ∞.

Introduce the notation for the denominator in the formula (5.3) for the
random variable ξ,
  
D = exp c W (j) − c 2 | j |/2 .
j∈Z

Involving the random variable ζ defined in (5.9), we write


  
|ξ| = | j | D −1 exp c W (j) − c2 | j |/2
|j |≤ζ
  
+ | j | D −1 exp c W (j) − c 2 | j |/2 .
|j |>ζ

Note that for any j, D −1 exp{ c W (j) − c 2 | j |/2 } ≤ 1. We substitute this


inequality into the first sum. In the second sum, we use the obvious fact
that D > exp{ c W (0) } = 1. Hence, we arrive at the following inequalities:

|ξ| ≤ ζ2 + | j | exp{ c W (j) − c 2 | j |/2 }.
|j |>ζ
62 5. Change-Point Problem

If | j | is larger than ζ, then we can bound W (j) from above by c | j | / 4 and


find that
  
|ξ| ≤ ζ2 + 2 j exp − c 2 j / 4
j>ζ

  
(5.10) ≤ ζ2 + 2 j exp − c2 j /4 = ζ 2 + a3
j ≥1
 2
with a3 = j ≥ 1 j exp{−c j/4}. Because the tail probabilities of ζ de-
crease exponentially
 2 fast, any power moment of ξ is finite, in particular,
r∗ = Eθ0 ξ < ∞.
Finally, we verify that θn∗ − θ0 converges to ξ in the L2 sense, that is,
uniformly in θ0 ∈ Θα ,
 2 
lim Eθ0 θn∗ − θ0 − ξ = 0.
n→∞

Apply the representation for the difference θn∗ − θ0 from Lemma 5.9. Simi-
larly to the argument used to derive (5.10), we obtain that
 ∗ 
 θn − θ0  ≤ ζ 2 + a3

with the same definitions of the entries on the right-hand side. Thus, the
difference
 ∗       
θn − θn − ξ  ≤ θn∗ − θ0  + ξ  ≤ 2 ζ 2 + a3

where the random variable ζ 2 + a3 is square integrable and independent


of n . As shown above, the difference θn∗ − θn − ξ converges to 0 in Pθ0 -
probability as n → ∞. By the dominated convergence theorem, this differ-
ence converges to zero in the quadratic sense. 

Exercises

Exercise 5.30. Prove Lemma 5.9.


  
Exercise 5.31. Show that Eθ0 exp c W (j) − c 2 | j | / 2 = 1, for any
integer j. Deduce from here that the numerator and denominator in (5.3)
have infinite expected values.

Exercise 5.32. Show that the Kullback-Leibler information numbers K±


are positive. Hint: Check that − K± < 0. Use the inequality ln(1 + x) < x,
for any x
= 0.
Exercises 63


Exercise 5.33. Suppose that Eθ0 | li |5+δ < ∞ for a small δ > 0, where
li ’s are the log-likelihood ratios defined in (5.5), i = 1, . . . , n, and θ0 denotes
the true value of the change point. Show that uniformly in θ0 ∈ Θα , the
quadratic risk of the MLE θ̃n of θ0 is finite for any n , that is,
 2 
Eθ0 θ̃n − θ0 < ∞.

Hint: Argue that if θ̃n > θ0 ,


  ∞
 
l 
Pθ0 θ̃n − θ0 = m ≤ Pθ0 εi ≥ K+ l
l=m i=1

where εi ’s are as in (5.6). Now use the fact that if Eθ0 | εi |5+δ < ∞, then
there exists a positive constant C with the property
l 
Pθ 0 εi ≥ K+ l ≤ C l−(4+δ) .
i=1
This fact can be found, for example, in Petrov [Pet75], Chapter IX, Theo-
rem 28.

Exercise 5.34. Consider the Bernoulli model with 30 independent obser-


vations Xi that take on values 1 or 0 with the respective probabilities p i and
1 − p i , i = 1 , . . . , 30. Suppose that p i = 0.4 if 1 ≤ i ≤ θ, and p i = 0.7
if θ < i ≤ 30, where θ is an integer change point, 1 ≤ θ ≤ 30. Estimate θ
from the following set of data:
i Xi i Xi i Xi
1 0 11 0 21 1
2 0 12 1 22 1
3 1 13 1 23 1
4 0 14 0 24 0
5 1 15 1 25 1
6 0 16 0 26 0
7 0 17 0 27 1
8 1 18 1 28 1
9 1 19 1 29 1
10 0 20 0 30 0

Exercise 5.35. Suppose that in the change-point problem, observations


Xi have a known c.d.f. F1 (x), x ∈ R, for 1 ≤ i ≤ θ, and another known
c.d.f. F2 (x), x ∈ R, for θ < i ≤ n. Assume that the two c.d.f.’s are not
identically equal. Suggest an estimator


change point θ0 ∈ Θα .
of the true
Hint: Consider a set X such that X dF1 (x)
= X dF2 (x), and introduce
indicators I(Xi ∈ X).
Chapter 6

Sequential Estimators

6.1. The Markov Stopping Time


Sequential estimation is a method in which the size of the sample is not
predetermined, but instead parameters are estimated as new observations
become available. The data collection is terminated in accordance with a
predefined stopping rule.
In this chapter we consider only the model of Gaussian observations. We
address two statistical problems, using the sequential estimation approach.
First, we revisit the change-point problem discussed in Chapter 5, and,
second, we study the parameter estimation problem from a sample of a
random size in an autoregressive model. Solutions to both problems are
based on the concept of the Markov stopping time.
Let Xi , i = 1, 2, . . . , be a sequence of real-valued random variables. For
any integer t ≥ 1, and any real numbers ai and bi such that ai ≤ bi , consider
the random events
 
Xi ∈ [ai , bi ] , i = 1, . . . , t.

All countable intersections, unions and complements of these random


events form a σ-algebra generated by the random variables X1 , . . . , Xt . De-
note this σ-algebra by Ft , that is,
 
Ft = σ Xi ∈ [ai , bi ] , i = 1, . . . , t .

All the random events that belong to Ft are called Ft -measurable. We


interpret the integer t as time, and we call Ft a σ-algebra generated by the
observations Xi up to time t.

65
66 6. Sequential Estimators

It is easily seen that these inclusions are true:

F1 ⊆ F 2 ⊆ · · · ⊆ F

where F denotes the σ-algebra that contains all the σ-algebras Ft . The set
of the ordered σ-algebras { Ft , t ≥ 1 } is called a filter.

Example 6.1. The random event { X12 + X22 < 1 } is F2 -measurable.


Indeed, this random event can be presented as the union of intersections:

   
{ |X1 | < i/m } ∩ { |X2 | < j/m } . 
m = 1 i2 +j 2 < m2

An integer-valued random variable τ , τ ∈ {1, 2, . . . }, is called a Markov


stopping time (or, simply, stopping time) with respect to the filter {Ft , t ≥ 1 }
if for any integer t, the random event { τ = t } is Ft -measurable.

Example 6.2. The following are examples of the Markov stopping times
(for the proof see Exercise 6.37):
(i) A non-random variable τ = T where T is a given positive integer number.
(ii) The first time when the sequence Xi hits a given interval [a, b], that is,
τ = min{ i : Xi ∈ [a, b] }.
(iii) The minimum or maximum of two given Markov stopping times τ1 and
τ2 , τ = min(τ1 , τ2 ) or τ = max(τ1 , τ2 ).
(iv) The time τ = τ1 + s for any positive integer s, where τ1 is a given
Markov stopping time. 

Example 6.3. Some random times are not examples of Markov stopping
times (for the proof see Exercise 6.38):
(i) The last time when the sequence Xi , 1 ≤ i ≤ n, hits a given interval
[a, b], that is, τ = max{ i : Xi ∈ [a, b], 1 ≤ i ≤ n}.
(ii) The time τ = τ1 − s for any positive integer s, where τ1 is a given
stopping time. 

Lemma 6.4. If τ is a stopping time, then the random events: (i) {τ ≤ t}


is Ft -measurable, and (ii) {τ ≥ t} is Ft−1 -measurable.

Proof. (i) We write {τ ≤ t} = ts = 1 {τ = s}, where each event {τ = s}


is Fs -measurable, and since Fs ⊆ Ft , it is Ft -measurable as well. Thus,
{τ ≤ t} is Ft -measurable as the union of Ft -measurable events.
(ii) The random event {τ ≥ t} is Ft−1 -measurable as the complement of
{τ < t} = {τ ≤ t − 1}, an Ft−1 -measurable event. 

The next important result is known as Wald’s first identity.


6.1. The Markov Stopping Time 67

Theorem 6.5. Let X1 , X2 , . . . , be a sequence of independent identically dis-


tributed random variables with E[ X1 ] < ∞. Then for any Markov stopping
time τ such that E[ τ ] < ∞, the following identity holds:

E X1 + · · · + Xτ = E[ X1 ] E[ τ ].

Proof. By definition,
 
∞ 
E X1 + · · · + Xτ = E (X1 + · · · + Xt ) I(τ = t)
t=1

= E X1 I(τ ≥ 1) + X2 I(τ ≥ 2) + · · · + Xt I(τ ≥ t) + . . . .
For a Markov stopping time τ, the random event {τ ≥ t} is Ft−1 -measurable
by Lemma 6.4, that is, it is predictable from the observations up to time t−1,
X1 , . . . , Xt−1 , and is independent of the future observations Xt , Xt+1 , . . . . In
particular, Xt and I(τ ≥ t) are independent, and hence, E Xt I(τ ≥ t) =
E[ X1 ] P( τ ≥ t ). Consequently,



E X1 + · · · + Xτ = E[ X1 ] P(τ ≥ t) = E[ X1 ] E[ τ ].
t=1
Here we used the straightforward fact that
∞
P( τ ≥ t ) = P(τ = 1) + 2 P(τ = 2) + 3 P(τ = 3) + . . .
t=1


= t P(τ = t) = E[ τ ]. 
t=1

Let τ be a Markov stopping time. Introduce a set of random events:


 
Fτ = A ∈ F : A ∩ {τ = t} ∈ Ft for all t, t ≥ 1 .
Lemma 6.6. The set Fτ is a σ-algebra, that is, this set is closed under
countable intersections, unions, and complements.

Proof. Suppose events A1 and A2 belong to Fτ . To show that Fτ is a σ-


algebra, it suffices to show that the intersection A1 ∩ A2 , union A1 ∪ A2 , and
complement A1 belong to Fτ . The same proof extends to countably many
random events.
Denote by B1 = A1 ∩ {τ = t} ∈ Ft and B2 = A2 ∩ {τ = t} ∈ Ft . The
intersection A1 ∩ A2 satisfies
(A1 ∩ A2 ) ∩ {τ = t} = B1 ∩ B2 ∈ Ft ,
as the intersection of two Ft -measurable events. Also, the union A1 ∪ A2 is
such that
(A1 ∪ A2 ) ∩ {τ = t} = B1 ∪ B2 ∈ Ft ,
68 6. Sequential Estimators

as the union of two Ft -measurable events. As for the complement A1 , note


first that both events {τ = t} and A1 ∩ {τ = t} belong to Ft , therefore,
A1 ∩ {τ = t} = {τ = t} \ (A1 ∩ {τ = t})

= {τ = t} ∩ A1 ∩ {τ = t} ∈ Ft ,
as an intersection of two Ft -measurable events. 

The σ-algebra Fτ is referred to as a σ-algebra of random events measur-


able up to the random time τ.
Lemma 6.7. The Markov stopping time τ is Fτ -measurable.

Proof. For any positive integer s, put A = {τ = s}. We need to show that
A ∈ Fτ . For all t we find that
A ∩ {τ = t} = {τ = s} ∩ {τ = t} = {τ = t} if s = t,
and is the empty set, otherwise. The set {τ = t} belongs to Ft by the
definition of a stopping time. The empty set is Ft - measurable as well
(refer to Exercise 6.36). Thus, by the definition of Fτ , the event A belongs
to Fτ . 

Recall that we defined Ft as a σ-algebra generated by the random vari-


ables Xi up to time t.
Lemma 6.8. The random variable Xτ is Fτ -measurable.

Proof. Take any interval [a, b] and define A = { Xτ ∈ [a, b] }. Note that
∞ 
 
A = { Xs ∈ [a, b] } ∩ {τ = s} .
s=1

Then for all t, we have that


∞ 
 !
A ∩ {τ = t} = { Xs ∈ [a, b] } ∩ {τ = s} {τ = t}
s=1

= { Xt ∈ [a, b] } ∩ {τ = t}.
The latter intersection belongs to Ft because both random events belong to
Ft . Hence A is Fτ -measurable. 
Remark 6.9. The concept of the σ-algebra Fτ is essential in the sequential
analysis. All parameter estimators constructed from sequential observations
are Fτ -measurable, that is, are based on observations X1 , . . . , Xτ obtained
up to a random stopping time τ. 
6.2. Change-Point Problem. Rate of Detection 69

6.2. Change-Point Problem. Rate of Detection


In this section we return to the change-point problem studied in Chapter 5,
and look at it from the sequential estimation point of view. The statistical
setting of the problem is modified. If previously all n observations were
available for estimation of the true change point θ0 ∈ Θα , in this section we
assume that observations Xi ’s arrive sequentially one at a time at moments
ti = i where i = 1, . . . , n.
 
Define a filter Ft , 1 ≤ t ≤ n of σ-algebras Ft generated by the
observations X1 , . . . , Xt up to time t, 1 ≤ t ≤ n. Introduce T as a set of all
Markov stopping times with respect to this filter.
If a sequential estimator τ̂n of the change point θ0 belongs to T , that
is, if τ̂n is a Markov stopping time, then we call this estimator an on-line
detector (or just detector) and the estimation problem itself, the on-line
detection problem (or, simply, detection).
In the on-line detection problem, we use the same loss functions as in the
regular estimation problems studied so far. For example, for the quadratic
loss, the minimax risk of detection is defined as

rnD = inf max Eθ0 (τ̂n − θ0 )2 .
τ̂n ∈T θ0 ∈Θα

The crucial difference between the minimax risk rn in the previous chap-
ters and rnD consists of restrictions on the set of admissible estimators. In
the on-line detection, we cannot use an arbitrary function of observations.

Remark 6.10. In this section we focus our attention on the quadratic


loss, even though, sometimes in practice, other loss functions are used. For
instance, we can restrict the class of admissible detectors to a class Tγ defined
by
   
(6.1) Tγ = τ̂n : τ̂n ∈ T and max Pθ0 τ̂n ≤ θ0 ≤ γ
θ0 ∈Θα
 
where γ is a given small positive number. The probability Pθ0 τ̂n ≤ θ0 is
called the false alarm probability. The name is inherited from the military air
defense problems where θ is associated with the time of a target appearance,
so that any detection of the target before it actually appears on the radar
screen is, indeed, a false alarm. The condition on detectors in Tγ requires
that the false alarm probability is small, uniformly in θ0 ∈ Θα . Another
natural criterion in detection, also rooted in the military concerns, is a so-
called expected detection delay,
 
(6.2) Eθ0 (τ̂n − θ0 )+ = Eθ0 (τ̂n − θ0 ) I( τ̂n > θ0 ) .
70 6. Sequential Estimators

The expected detection delay generates a minimax risk



inf max Eθ0 ( τ̂n − θ0 )+ .
τ̂n ∈Tγ θ0 ∈Θα

Clearly, any additional constraints on the admissible detectors increase


the value of the minimax risk. And every additional restriction makes the
problem more difficult. 

Below we find the rate of convergence for the minimax quadratic risk of
detection rnD for the Gaussian model, and define the rate-optimal detectors.
Assume that Xi ∼ N (0, σ 2 ) if 1 ≤ i ≤ θ0 , and Xi ∼ N (μ, σ 2 ) if θ0 < i ≤
n, where μ > 0 is known. Our goal is to show that there exists a Markov
stopping time τn∗ such that its deviation away from the true value of θ0 has
the magnitude O(ln n) as n → ∞. It indicates a slower rate of convergence
for the on-line detection as opposed to the estimation based on the entire
sample. Recall that in the latter case, the rate is O(1).
Remark 6.11. Note that on the integer scale, the convergence with the
rate O(ln n) is not a convergence at all. This should not be surprising since
the convergence rate of O(1) means no convergence as well. If we compress
the scale and consider the on-line detection problem on the unit interval
[0, 1] with the frequency of observations n (see Remark 5.5), then the rate
of convergence guaranteed by the Markov stopping time detectors becomes
(ln n)/n. 
Theorem 6.12. In the on-line detection problem with n Gaussian observa-
tions, there exists a Markov stopping time τn∗ and a constant r∗ independent
of n such that the following upper bound holds:
  τ ∗ − θ 2 
≤ r∗ .
0
max Eθ0 n
θ0 ∈Θα ln n
Proof. The construction of the stopping time τn∗ is based on the idea of
averaging. Roughly speaking, we partition the interval [1, n] and compute
the sample means in each of the subintervals. At the lower end of the
interval, the averages are close to zero. At the upper end, they are close to
the known number μ, while in the subinterval that captures the true change
point, the sample mean is something in-between.
Put N = b ln n where b is a positive constant independent of n that will
be chosen later. Define M = n/N . Without loss of generality, we assume
that N and M are integer numbers. Introduce the normalized mean values
of observations in subintervals of length N by
1 
N
X̄m = X(m−1)N +i , m = 1, . . . , M.
μN
i=1
6.2. Change-Point Problem. Rate of Detection 71

Let m0 be an integer such that (m0 − 1)N + 1 ≤ θ0 ≤ m0 N . Note that

1   1 
N
 N
X̄m = I (m − 1) N + i > θ0 + ε(m−1) N +i
N cN
i=1 i=1

where εi ’s are independent standard normal random variables and c is the


signal-to-noise ratio, c = μ/σ. For any m < m0 the first sum equals zero,
and for m > m0 this sum is 1. The value of the first sum at m = m0 is a
number between 0 and 1, and it depends on the specific location of the true
change point θ0 . The second sum can be shown to be

1 
N
1
ε(m−1)N +i = √ Zm , m = 1, . . . , M,
cN c N
i=1

where the Zm ’s are independent standard normal random variables under


Pθ0 -probability.
Next we show that for sufficiently large n,
 √ 
Pθ0 max | Zm | ≥ 10 ln M ≤ n−3 .
1≤m≤M

Put y = 10 ln M > 1. Notice that the probability that the maximum is not
less than y equals the probability that at least one of the random variables
is not less than y, therefore, we estimate
   
M
 
M
 
P max | Zm | ≥ y = P { | Zm | ≥ y } ≤ P | Zm | ≥ y
1≤m≤M
m=1 m=1

2
= 2 M ( 1 − Φ(y) ) ≤  M exp{−y 2 /2} ≤ M exp{−y 2 /2}
2πy 2
where Φ(y) denotes the cumulative distribution function of a N (0, 1) ran-
dom variable.In the above we used the standard inequality 1 − Φ(y) ≤
exp{−y 2 /2}/ 2πy 2 if y > 1. Thus, we have
 √ 
Pθ0 max | Zm | ≥ 10 ln M ≤ M exp{−10 ln M/2}
1≤m≤M

= M −4 = (n/(b ln n))−4 ≤ n−3 .


Consider the random event
 √ 
A= max | Zm | < 10 ln M .
1≤m≤M

We have just shown that, uniformly in θ0 ∈ Θα , the probability of A, the


complement of A, is bounded from above,
 
(6.3) Pθ0 A ≤ n−3 .
72 6. Sequential Estimators

Choose b = 103 c−2 . If the event A occurs, we have the inequalities


 1  1
 
max  √ Zm  < 10 ln M/N
1≤m≤M c N c

1 
= 10 (ln n − ln ln n − ln b) / (b ln n) ≤ 10/(b c2 ) = 0.1.
c
Now we can finalize the description of the averaged observations X̄m =
Bm + ξm where Bm ’s are deterministic with the property that Bm = 0
if m < √m0 , and Bm = 1 if m > m0 . The random variables | ξm | =
| Zm / (c N ) | do not exceed 0.1 if the random event A holds.
We are ready to define the Markov stopping time that estimates the
change point θ0 . Define an integer-valued random variable m∗ by
 
m∗ = min m : X̄m ≥ 0.9 , 1 ≤ m ≤ M ,

and formally put m∗ = M if X̄m < 0.9 for all m. Under the random event
A, the minimal m∗ exists and is equal to either m0 or m0 + 1.
Introduce a random variable

(6.4) τn∗ = m∗ N.

If t is an integer divisible by N , then the random event { τn∗ = t } is defined


in terms of X̄1 , . . . , X̄t/N , that is, in terms of X1 , . . . , Xt , which means that
τn∗ is Ft -measurable. Thus, τn∗ is a stopping time. We take τn∗ as the on-line
detector. The next step is to estimate its quadratic risk.
As shown above, the inclusion A ⊆ { 0 ≤ m∗ − m0 ≤ 1 } is true. The
definition of m0 implies the inequalities 0 ≤ τn∗ − θ0 ≤ 2 N. We write
 2 
max Eθ0 (τn∗ − θ0 )/ ln n
θ0 ∈Θα

  2   2 
= max Eθ0 (τn∗ − θ0 )/ ln n I(A) + Eθ0 (τn∗ − θ0 )/ ln n I(A)
θ0 ∈Θα

    2 
≤ max Eθ0 ( 2 N/ ln n )2 I(A) + Eθ0 ( n/ ln n I(A)
θ0 ∈Θα

 2  2
≤ 2 N/ ln n + n/ ln n n−3 ≤ 4 b2 + 2

where at the final stage we have applied (6.3) and the trivial inequality
1/(n ln2 n) < 2 , n ≥ 2. Thus, the statement of the theorem follows with
r∗ = 4 b2 + 2. 
6.3. Minimax Limit in the Detection Problem. 73

6.3. Minimax Limit in the Detection Problem.


The rate ln n in the on-line detection which is guaranteed by Theorem 6.12
is the minimax rate. We show in this section that it cannot be improved by
any other detector.
Recall that T denotes the class of all Markov stopping times with respect
to the filter generated by the observations.
Theorem 6.13. In the on-line detection problem with n Gaussian observa-
tions, there exists a positive constant r∗ independent of n such that
  τ̂ − θ 2 
n 0
lim inf inf max Eθ0 ≥ r∗ .
n→∞ τ̂n ∈ T θ0 ∈ Θα ln n
Proof. Choose points t0 , . . . , tM in the parameter set Θα such that tj −
tj−1 = 3 b ln n, j = 1, . . . , M, with a positive constant b independent of n.
The exact value of b will be selected later. Here the number of points M is
equal to M = n (1 − 2 α)/(3 b ln n). We assume, without loss of generality,
that M and b ln n are integers.
We proceed by contradiction and assume that the claim of the theorem
is false. Then there exists a detector τ̃n such that
  τ̃ − t 2 
n j
lim max Etj = 0,
n→∞ 0 ≤ j ≤ M ln n
which implies that
 
lim max Ptj | τ̃n − tj | > b ln n = 0.
n→∞ 0 ≤ j ≤ M

Indeed, by the Markov inequality,


    τ̃ − t 2 
Ptj | τ̃n − tj | > b ln n ≤ b−2 Etj
n j
.
ln n
Hence for all large enough n, the following inequalities hold:
 
(6.5) Ptj | τ̃n − tj | ≤ b ln n ≥ 3/4 , j = 0 , . . . , M.
Consider the inequality for j = M. Then
1    M
−1
 
≥ PtM | τ̃n − tM | > b ln n ≥ PtM | τ̃n − tj | ≤ b ln n
4
j =0


M −1  
= PtM | τ̃n − tj | ≤ b ln n
j=0


M −1  dP  
tM
(6.6) = Etj I | τ̃n − tj | ≤ b ln n .
dPtj
j =0
74 6. Sequential Estimators

of tj , j = 0, . . . , M −1, then τ̃n is distant from


Indeed, if τ̃n is close to one 
tM , and the random events | τ̃n − tj | ≤ b ln n are mutually exclusive.
The likelihood ratio has the form
dPtM μ  tM  X − μ  μ 
i
= exp − −
dPtj σ σ 2σ
i=tj +1

 tM
c2 
= exp c εi − (tM − tj )
2
i=tj +1

where c = μ/σ is the signal-to-noise ratio, and εi = −(Xi − μ)/σ have the
standard normal distribution with respect to the Ptj -probability. Note that
the number of terms in the sum from tj + 1 to tM can be as large as O(n).
Further, let
 
Bj = | τ̃n − tj | ≤ b ln n .
Thus, each expectation in (6.6) can be written as
 dP 
tM 
Etj I | τ̃n − tj | ≤ b ln n
dPtj
  
tM
c2  
= Etj exp c εi − (tM − tj ) I( Bj ) .
2
i=tj +1

Put uj = tj + b ln n. The event Bj is Fuj -measurable because τ̃n is


a Markov stopping time. Hence Bj is independent of the observations
Xuj +1 , . . . , XtM . Equivalently, I(Bj ) is independent of εi for i = uj +1, . . . , tM .
Note also that
   tM
c2   tM
c2 c2 
Etj exp c εi − (tM − uj ) = exp − (tM − uj ) = 1.
2 2 2
i=uj +1 i=uj +1

We write  dP
tM  
Etj I | τ̃n − tj | ≤ b ln n
dPtj
  
uj
c2  
= Etj exp c εi − (uj − tj ) I( Bj )
2
i=tj +1
  √ c2  
= Etj exp c b ln n Zj − b ln n I( Bj )
2
u j √
where Zj = i=tj +1 iε / b ln n is a standard normal random variable with
respect to the Ptj -probability,
  √ c2  
≥ Etj exp c b ln n Zj − b ln n I( Bj ) I( Zj ≥ 0 )
2
6.4. Sequential Estimation in the Autoregressive Model 75

 c2   
≥ exp − b ln n Ptj Bj ∩ {Zj ≥ 0} .
2
Further, the probability of the intersection
     
Ptj Bj ∩ {Zj ≥ 0} = Ptj (Bj ) + Ptj Zj ≥ 0 − Ptj Bj ∪ {Zj ≥ 0}
  3 1 1
≥ Ptj (Bj ) + Ptj Zj ≥ 0 − 1 ≥ + −1 = .
4 2 4
 
In the last step we used the inequality (6.5) and the fact that Ptj Zj ≥ 0 =
1/2.

Thus, if we choose b = c−2 , then the following lower bound holds:


 dP 
tM 
Etj I | τ̃n − tj | ≤ b ln n
dPtj

1  c2  1
≥ exp − b ln n = √ .
4 2 4 n
Substituting this inequality into (6.6), we arrive at a contradiction,
M−1 √
1 1 M n(1 − 2α) 1 − 2α n
≥ √ = √ = √ = → ∞ as n → ∞.
4 4 n 4 n 3b ln n 4 n 12b ln n
j=0

This implies that the statement of the theorem is true. 

6.4. Sequential Estimation in the Autoregressive Model


In the previous two sections we applied the sequential estimation method to
the on-line detection problem. In this section, we demonstrate this technique
with another example, the first-order autoregressive model (also, termed
autoregression). Assume that the observations Xi satisfy the equation
(6.7) Xi = θ Xi−1 + εi , i = 1, 2, . . .
with the zero initial condition, X0 = 0. Here εi ’s are independent normal
random variables with mean zero and variance σ 2 . The autoregression co-
efficient θ is assumed bounded, −1 < θ < 1. Moreover, the true value of
this parameter is strictly less than 1, θ0 ∈ Θα = {θ : | θ0 | ≤ 1 − α } with a
given small positive number α.
The following lemma describes the asymptotic behavior of autoregres-
sion. The proof of the lemma is moved to Exercise 6.42.
Lemma 6.14. (i) The autoregressive model admits the representation
Xi = εi + θ εi−1 + θ2 εi−2 + . . . + θ i−2 ε2 + θ i−1 ε1 , i = 1, 2, . . . .
76 6. Sequential Estimators

(ii) The random variable Xi is normal with the zero mean and variance
1 − θ2i
σi2 = Var[ Xi ] = σ 2 .
1 − θ2
(iii) The variance of Xi has the limit
σ2
lim σi2 = σ∞
2
= .
i→∞ 1 − θ2
(iv) The covariance between Xi and Xi+j , j ≥ 0, is equal to
1 − θ2i
Cov[ Xi , Xi+j ] = σ 2 θj .
1 − θ2
Our objective is to find an on-line estimator of the parameter θ. Before
we do this, we first study the maximum likelihood estimator (MLE).

6.4.1. Heuristic Remarks on MLE. Assume that only n observations


are available, X1 , . . . , Xn . Then the log-likelihood function has the form
n  
(Xi − θ Xi−1 )2 1
Ln (θ) = − − ln(2πσ 2
) .
2σ 2 2
i=1
Differentiating with respect to θ, we find the classical MLE θn∗ of the au-
toregression coefficient θ:
n
∗ i = 1 Xi−1 Xi
θn =  n 2 .
i = 1 Xi−1
The MLE does not have a normal distribution, which is easy to show for
n = 2,
X0 X1 + X1 X2 X1 X2 ε1 (θ0 ε1 + ε2 ) ε2
θ2∗ = 2 2 = 2 = 2 = θ0 +
X0 + X1 X1 ε1 ε1
where θ0 is the true value of θ. The ratio ε2 /ε1 has the Cauchy distribution
(show!). Therefore, the expectation of the difference θ2∗ − θ0 does not exist.
For n > 2, the expectation of θn∗ − θ0 exists but is not zero, so that
the MLE is biased. We skip the proofs of these technical and less important
facts. What is more important is that θn∗ is asymptotically normal as n →
∞. We will try to explain this fact at the intuitive level. Note that
n
∗ i = 1 Xi−1 Xi
θn =  n 2
i = 1 Xi−1

X0 (θ0 X0 + ε1 ) + X1 (θ0 X1 + ε2 ) + · · · + Xn−1 (θ0 Xn−1 + εn )


=
X02 + X12 + · · · + Xn−12

X1 ε2 + · · · + Xn−1 εn
(6.8) = θ0 + .
X12 + · · · + Xn−1
2
6.4. Sequential Estimation in the Autoregressive Model 77

By Lemma 6.14 (iv), since |θ| < 1, the covariance between two remote
terms Xi and Xi+j decays exponentially fast as j → ∞. It can be shown
that the Law of Large Numbers (LLN) applies to this process exactly as in
the case of independent random variables. By the LLN, for all large n, we
can substitute the denominator in the latter formula by its expectation


n−1
σ2
E[ X12 + · · · + Xn−1
2
]= Var[ Xi ] ∼ n σ∞
2
= n .
i=1
1 − θ02

Thus, on a heuristic level, we may say that


√ √ X1 ε2 + · · · + Xn−1 εn
n (θn∗ − θ0 ) ∼ n
n σ 2 /(1 − θ02 )

1 − θ02 X1 ε2 + · · · + Xn−1 εn
= √ .
σ2 n
  √
If the Xi ’s were independent, then X1 ε2 + · · · + Xn−1 εn / n would
satisfy the Central Limit Theorem (CLT). It turns out, and it is far from
being trivial, that we can work with the Xi ’s as if they were independent,
and the CLT still applies. Thus, the limiting distribution of this quotient is
normal with mean zero and the limiting variance
 X ε + ··· + X  1  
n−1
2
1 2 n−1 εn
lim Var √ = lim E Xi εi+1
n→∞ n n→∞ n
i=1

1   2  2 
n−1 n−1
σ4
= lim E Xi E εi+1 = lim (1 − θ02i )
n→∞ n
i=1
n(1 − θ0
2 ) n→∞
i=1

σ4  1 − θ02n  σ4
= lim n − = .
n(1 − θ0 ) n→∞
2 1 − θ02 1 − θ02

It partially explains why the difference n (θn∗ −θ0 ) is asymptotically normal
with mean zero and variance
 1 − θ 2 2 σ 4
0
= 1 − θ02 ,
σ2 1 − θ02

that is,
√  
n (θn∗ − θ0 ) → N 0, 1 − θ02 as n → ∞.
Note that the limiting variance is independent of σ 2 , the variance of the
noise.
78 6. Sequential Estimators

6.4.2. On-Line Estimator. After obtaining a general idea about the MLE
and its asymptotic performance, we are ready to try a sequential estimation
procedure, termed an on-line estimation.

Note that nfrom (6.8) the difference θn − θ0 can be presented in the form
θn∗ − θ0 = υ ε
i = 2 n,i i with the weights υn,i = X i−1 /( X 2 + · · · + X2
1 n−1 ).
If the υn,i ’s were deterministic, then the variance of the difference θn − θ0 ∗

would be
n
σ2 2
υn,i = σ 2 /(X12 + · · · + Xn−1
2
).
i=2

In a sense, the sum X12 +· · ·+Xn−1


2 plays the role of the information number:
the larger it is, the smaller the variance.
The above argument brings us to an understanding of how to construct a
sequential estimator of θ, called an on-line estimator. Let us stop collecting
data at a random time τ when the sum X12 + · · · + Xτ2 reaches a prescribed
level H > 0, that is, define the Markov stopping time τ by (see Exercise
6.39)
 
τ = min t : X12 + · · · + Xt2 > H .

In the discrete case with normal noise, the overshoot X12 + · · · + Xt2 − H
is positive with probability 1. The stopping time τ is a random sample
size, and the level H controls the magnitude of its expected value, Eθ0 [ τ ]
increases as H grows (see Exercise 6.39). Put
ΔH
ΔH = H − ( X12 + · · · + Xτ2−1 ) and η = .

The definition of η makes sense because the random variable Xτ differs from
zero with probability 1.
Define an on-line estimator of θ0 by

1 
τ

(6.9) θ̂τ = Xi−1 Xi + η Xτ +1 .
H
i=1

This is a sequential version of the MLE (6.8). Apparently, if ΔH (and,


respectively, η) were negligible, then θ̂τ would be the MLE with n substituted
by τ. Note that θ̂τ is not Fτ -measurable because it depends on one extra
observation, Xτ +1 . This is the tribute to the discrete nature of the model.
As shown below, due to this extra term, the estimator (6.9) is unbiased.

Lemma 6.15. The estimator θ̂τ given by (6.9) is an unbiased estimator of


θ0 , and uniformly over θ0 ∈ Θα , its variance does not exceed σ 2 /H.
6.4. Sequential Estimation in the Autoregressive Model 79

Proof. First, we show that the estimator is unbiased. Note that


1  
τ
θ̂τ = Xi−1 ( θ0 Xi−1 + εi ) + η ( θ0 Xτ + ετ +1 )
H
i=1

1   2  
τ τ
= θ0 Xi−1 + Xi−1 εi + θ0 η Xτ + η ετ +1 .
H
i=1 i=1

By definition, η Xτ = ΔH and ΔH + τi = 1 Xi−1 2 = H, hence,
1  
τ
(6.10) θ̂τ = θ0 + Xi−1 εi + η ετ +1 .
H
i=1

Therefore, the bias of θ̂τ is equal to


 1    
τ

(6.11) Eθ0 θ̂τ − θ0 = Eθ0 Xi−1 εi + Eθ0 η ετ +1 ,
H
i=1
and it suffices to show that both expectations are equal to zero. Start with
the first one:
τ   
Eθ0 Xi−1 εi = Eθ0 X1 ε2 I(τ = 2) + (X1 ε2 + X2 ε3 ) I(τ = 3) + . . .
i=1

∞  ∞
  
= Eθ0 Xi−1 εi I(τ ≥ i) = Eθ0 Xi−1 εi I(τ ≥ i) .
i=1 i=1
We already know that the random variable I(τ ≥ i) is Fi−1 -measurable, and
so is Xi−1 . The random variable εi is independent of Fi−1 , which yields that
each term in this infinite sum is equal to zero,
    
Eθ0 Xi−1 εi I(τ ≥ i) = Eθ0 Xi−1 I(τ ≥ i) Eθ0 εi = 0.

The second expectation Eθ0 η ετ +1 requires more attention. Note that η is
Fτ -measurable. Indeed, for any integer t and for any a ≤ b, the intersection
of the random events
       
a ≤ η ≤ b ∩ τ = t = aXt ≤ H − (X12 + · · · + Xt−1 2
) ≤ bXt ∩ τ = t
is Ft -measurable, because both random events on the right-hand side are
Ft -measurable. Hence for any t, the random variable η I(τ = t) is Ft -
measurable. The variable εt+1 , on the other hand, is independent of Ft .
Thus,
∞
 
Eθ0 η ετ +1 = Eθ0 η εt+1 I(τ = t)
t=1

  
= Eθ0 η I(τ = t) Eθ0 εt+1 = 0.
t=1
80 6. Sequential Estimators

It follows that either sum in (6.11) is equal to zero, which means that the
estimator θ̂τ is unbiased.
Next, we want to estimate the variance of θ̂τ . Using the representation
(6.10) of θ̂τ , we need to verify that
 τ
2 
Eθ0 Xi−1 εi + η ετ +1 ≤ σ 2 H.
i=1
The left-hand side of this inequality is equal to
 τ
2  τ
 
(6.12) Eθ0 Xi−1 εi + 2 Xi−1 εi η ετ +1 + η 2 ε2τ +1 .
i=1 i=1
Consider the last term. We know that η is Fτ -measurable. Hence


 
Eθ0 η 2 ε2τ +1 = Eθ0 η 2 ε2t+1 I(τ = t)
t=1

  
= Eθ0 η 2 I(τ = t) Eθ0 ε2t+1
t=1

  
= σ2 Eθ0 η 2 I(τ = t) = σ 2 Eθ0 η 2 .
t=1
In a similar way, we can show that the expectation of the cross-term in
(6.12) is zero. The analysis of the first term, however, takes more steps. It
can be written as
  τ
2  
Eθ0 Xi−1 εi = Eθ0 (X1 ε2 )2 I(τ = 2) + (X1 ε2 + X2 ε3 )2 I(τ = 3)
i=1
 
+ (X1 ε2 + X2 ε3 + X3 ε4 )2 I(τ = 4) + . . . = Eθ0 X12 ε22 I(τ = 2)
    
+ X12 ε22 + X22 ε23 I(τ = 3) + X12 ε22 + X22 ε23 + X32 ε24 I(τ = 4) + · · ·
 
+ 2 Eθ0 (X1 ε2 ) (X2 ε3 ) I(τ ≥ 3) + (X1 ε2 + X2 ε3 ) (X3 ε4 ) I(τ ≥ 4) + · · ·
= E1 + 2 E2
where   
E1 = Eθ0 X12 ε22 I(τ = 2) + X12 ε22 + X22 ε23 I(τ = 3)
  
+ X12 ε22 + X22 ε23 + X32 ε24 I(τ = 4) + · · ·

= σ 2 Eθ0 X12 I(τ = 2) + (X12 +X22 ) I(τ = 3) + (X12 +X22 +X32 ) I(τ = 4)+ · · ·

τ

= σ 2 Eθ0 2
Xi−1
i=1
6.4. Sequential Estimation in the Autoregressive Model 81

and
 
E2 = Eθ0 (X1 ε2 ) (X2 ε3 ) I(τ ≥ 3) + (X1 ε2 + X2 ε3 ) (X3 ε4 ) I(τ ≥ 4) + · · ·
 
= Eθ0 (X1 ε2 )(X2 ) I(τ ≥ 3) Eθ0 ε3
 
+ Eθ0 (X1 ε2 + X2 ε3 )(X3 ) I(τ ≥ 3) Eθ0 ε4 + · · · = 0.
Combining all these estimates, we find that the expectation in (6.12) is equal
to
  τ
2 τ
 
Eθ0 Xi−1 εi + 2 Xi−1 εi η ετ +1 + η 2 ε2τ +1
i=1 i=1

τ

= σ 2 Eθ0 2
Xi−1 + σ 2 Eθ0 η 2 .
i=1

2
From the definition of ΔH, i = 1 Xi−1 = H − ΔH. Also, recall that η =
H/Xτ . Thus, we continue
   
= σ 2 Eθ0 H − ΔH + η 2 = σ 2 H − Eθ0 ΔH − η 2
  
= σ 2 H − Eθ0 ΔH − ( ΔH/Xτ )2
  
= σ 2 H − Eθ0 ΔH ( 1 − ΔH/Xτ2 ) .
Note that at the time τ − 1, the value of the sum X12 + · · · + Xτ2−1 does
not exceed H,  which yields the inequality ΔH ≥ 0. In addition, by the
definition of τ , τi = 1 Xi−1
2 + X 2 > H, which implies that
τ

τ
ΔH = H − 2
Xi−1 < Xτ2 .
i=1

Hence, ΔH/Xτ2 < 1. Thus, ΔH and ( 1 − ΔH/Xτ2 ) are positive random


variables with probability 1, and therefore,
  τ
2 
Eθ0 Xi−1 εi + η ετ +1
i=1
  
(6.13) = σ2 H − Eθ0 ΔH ( 1 − ΔH/Xτ2 ) ≤ σ 2 H. 

The statement of Lemma 6.15 is true for any continuous distribution of


the noise εi , if it has the zero mean and variance σ 2 . The continuity of the
noise guarantees that the distribution of Xi is also continuous, and therefore
η = ΔH/Xτ is properly defined. If we assume additionally that the noise
has a bounded distribution, that is, | εi | ≤ C0 for some positive constant
C0 , then for any i the random variables | Xi |’s turn out to be bounded as
well. Under this additional assumption, we can get a lower bound on the
variance of θ̂τ .
82 6. Sequential Estimators

Theorem 6.16. If | εi | ≤ C0 , E[ εi ] = 0, and Var[ εi ] = σ 2 , then

 σ2 σ 2 C02
Varθ0 θ̂τ ≥ − .
H 4H 2 ( 1 − | θ0 | )2

Proof. From Lemma 6.14 (i), we find that

| Xi | ≤ | εi | + | θ | | εi−1 | + | θ2 | | εi−2 | + · · · + | θi−2 | | ε2 | + | θi−1 | | ε1 |


 
≤ C0 1 + | θ | + | θ2 | + · · · + | θi−2 | + | θi−1 | ≤ C0 / ( 1 − | θ0 | ).

In the proof of Lemma 6.15 we have shown (see (6.9)-(6.13)) that

 σ2   
Varθ0 θ̂τ = 2 H − Eθ0 ΔH(1 − ΔH/Xτ2 )
H

where 0 ≤ ΔH ≤ Xτ2 . Now, the parabola ΔH(1 − ΔH/Xτ2 ) is maximized


at ΔH = Xτ2 /2, and therefore ΔH(1 − ΔH/Xτ2 ) ≤ Xτ2 /4. Finally, we have
that
  1  C02
Eθ0 ΔH(1 − ΔH/Xτ2 ) ≤ Eθ0 Xτ2 ≤ .
4 4(1 − |θ0 |)2

The result of the theorem follows. 

Remark 6.17. Note that the bound for the variance of θ̂τ in Theorem 6.16
is pointwise, that is, the lower bound depends on θ0 . To declare a uniform
bound for all θ0 ∈ Θα = {θ : | θ0 | ≤ 1 − α }, we take the minimum of both
sides:

 σ2 σ 2 C02
inf Varθ0 θ̂τ ≥ − .
θ0 ∈Θα H 4H 2 α2

Combining this result with the uniform upper bound in Lemma 6.15, we get
that as H → ∞,

 σ2  
inf Varθ0 θ̂τ = 1 + O(H −1 ) . 
θ0 ∈Θα H
Exercises 83

Exercises

Exercise 6.36. Show that an empty set is F -measurable.


Exercise 6.37. Check that the random variables τ defined in Example 6.2
are stopping times.

Exercise 6.38. Show that the variables τ specified in Example 6.3 are
non-stopping times.

Exercise 6.39. Let Xi ’s be independent identically distributed random


variables, and let τ be defined as the first time when the sum of squared
observations hits a given positive level H,
τ = min{ i : X12 + · · · + Xi2 > H }.
(i) Show that τ is a Markov stopping time.
(ii) Suppose E[ X12 ] = σ 2 . Prove that E[ τ ] > H/σ 2 . Hint: Use Wald’s first
identity.

Exercise 6.40. Prove Wald’s second identity formulated as follows. Sup-


pose X1 , X2 , . . . are independent identically distributed random variables
with finite mean and variance. Then
 
Var X1 + · · · + Xτ − E[ X1 ] τ = Var[ X1 ] E[ τ ].

Exercise 6.41. Suppose that Xi ’s are independent random variables, Xi ∼


N (θ, σ 2 ). Let τ be a stopping time such that Eθ [ τ ] = h, where h is a
deterministic constant.
(i) Show that θ̂τ = (X1 + · · · + Xτ )/h is an unbiased estimator of θ. Hint:
Apply Wald’s first identity.
(ii) Show that
2σ 2 2θ2 Varθ [ τ ]
Varθ [ θ̂τ ] ≤ + .
h h2
Hint: Apply Wald’s second identity.

Exercise 6.42. Prove Lemma 6.14.


Chapter 7

Linear Parametric
Regression

7.1. Definitions and Notations


An important research area in many scientific fields is to find a functional
relation between two variables, say X and Y , based on the experimental
data. The variable Y is called a response variable (or, simply, response),
while X is termed an explanatory variable or a predictor variable.
The relation between X and Y can be described by a regression equation

(7.1) Y = f (X) + ε

where f is a regression function, and ε is a N (0, σ 2 ) random error indepen-


dent of X. In this chapter we consider only parametric regression models
for which the algebraic form of the function f is assumed to be known.

Remark 7.1. In this book we study only simple regressions where there is
only one predictor X. 

Let f be a sum of known functions g0 , . . . , gk with unknown regression


coefficients θ0 , . . . , θk ,

(7.2) f = θ0 g0 + θ1 g1 + · · · + θk gk .

It is convenient to have a constant intercept θ0 in the model, thus, without


loss of generality, we assume that g0 = 1. Note that the function f is linear
in parameters θ0 , . . . , θk .

85
86 7. Linear Parametric Regression

Plugging (7.2) into the regression equation (7.1), we obtain a general


form of a linear parametric regression model
(7.3) Y = θ0 g0 (X) + θ1 g1 (X) + · · · + θk gk (X) + ε
where the random error ε has a N (0, σ 2 ) distribution and is independent of
X.

Example 7.2. Consider a polynomial regression, for which g0 (X) = 1,


g1 (X) = X, . . . , gk (X) = X k . Here the response variable Y is a polynomial
function of X corrupted by a random error ε ∼ N (0, σ 2 ),
Y = θ0 + θ1 X + θ2 X 2 + · · · + θk X k + ε. 

Suppose the observed data consist of n pairs of observations (xi , yi ), i =


1, . . . , n. The collection of the observations of the explanatory variable X,
denoted by X = {x1 , . . . , xn }, is called a design. According to (7.1), the
data points satisfy the equations
(7.4) yi = f (xi ) + εi , i = 1, . . . , n,
where the εi ’s are independent N (0, σ 2 ) random variables. In particular,
the linear parametric regression model (7.3) for the observations takes the
form
(7.5) yi = θ0 g0 (xi ) + θ1 g1 (xi ) + · · · + θk gk (xi ) + εi , i = 1, . . . , n,
where the εi ’s are independent N (0, σ 2 ).
A scatter plot is the collection of data points with the coordinates (xi , yi ),
for i = 1, . . . , n. A typical scatter plot for a polynomial regression is shown
in Figure 3.

Y 6

(xi , yi ) f (X)
• • •
εi ••
• • • • •


• • •


-
0 X

Figure 3. A scatter plot with a fitted polynomial regression function.


7.2. Least-Squares Estimator 87

It is convenient to write (7.5) using vectors. To this end, introduce


column vectors
   
y = y1 , . . . , yn , ε = ε1 , . . . , εn
and  
gj = gj (x1 ), . . . , gj (xn ) , j = 0, . . . k.
Here the prime indicates the operation of vector transposition. In this
notation, the equations (7.5) turn into
(7.6) y = θ0 g0 + θ1 g1 + · · · + θk gk + ε
where ε ∼ Nn (0, σ 2 In ). That is, ε has an n-variate
 normal
distribution
with mean 0 = (0, . . . , 0) and covariance matrix E ε ε = σ 2 In , where In
is the n × n identity matrix.
Denote a linear span-space generated by the vectors g0 , . . . , gk by
 
S = span g0 , . . . , gk ⊆ Rn .
The vectors g0 , . . . , gk are assumed to be linearly independent, so that the
dimension of the span-space dim(S) is equal to k + 1. Obviously, it may
happen only if n ≥ k + 1. Typically, n is much larger than k.
Example 7.3. For the polynomial regression, the span-space S is generated
by the vectors g0 = (1, . . . , 1) , g1 = (x1 , . . . , xn ) , . . . , gk = (xk1 , . . . , xkn ) .
For distinct values x1 , . . . , xn , n ≥ k + 1, these vectors are linearly indepen-
dent, and the assumption dim(S) = k +1 is fulfilled (see Exercise 11.79). 

Define an n × (k + 1) matrix G = g0 , . . . , gk , called a design matrix,
 
and let θ = θ0 , . . . , θk denote the vector of the regression coefficients.
The linear regression (7.6) can be written in the matrix form
(7.7) y = G θ + ε, ε ∼ Nn (0, σ 2 In ).

7.2. Least-Squares Estimator


In the system of equations (7.5) (or, equivalently, in its vector form (7.6)),
the parameters θ0 , . . . , θk have unknown values, which should be estimated
from the observations (xi , yi ), i = 1, . . . , n.

Let ŷ = ŷ1 , . . . , ŷn ) denote the orthogonal projection of y on the span-
space S (see Figure 4). This vector is called a fitted (or predicted) response
vector. As any vector in S, this projection is a linear combination of vectors
g0 , g1 , . . . , gk , that is, there exist some constants θ̂0 , θ̂1 , . . . , θ̂k such that
(7.8) ŷ = θ̂0 g0 + θ̂1 g1 + · · · + θ̂k gk .
These coefficients θ̂0 , θ̂1 , . . . , θ̂k may serve as estimates of the unknown pa-
rameters θ0 , θ1 , . . . , θk . Indeed, in the absence of the random error in (7.6),
88 7. Linear Parametric Regression

that is, when ε = 0, we have ŷ = y which implies that θ̂j = θj for all
j = 0, . . . , k.

y

r
ε

0 -
ŷ S
Gθ s

Figure 4. Geometric interpretation of the linear parametric regression.


The problem of finding the estimators θ̂0 , θ̂1 , . . . , θ̂k can be looked at as
the minimization problem
   
(7.9)  y − ŷ 2 =  y − ( θ̂0 g0 + · · · + θ̂k gk ) 2 → min .
θ̂0 ,..., θ̂k

Here  ·  denotes the Euclidean norm of a vector in Rn ,


 y − ŷ 2 = (y1 − ŷ1 )2 + · · · + (yn − ŷn )2 .
The estimation procedure consists of finding the minimum of the sum of
squares of the coordinates, thus, the estimators θ̂0 , . . . , θ̂k are referred to as
the least-squares estimators.
The easiest way to solve the minimization problem is through the geo-
metric interpretation of linear regression. In fact, by the definition of a
projection, the vector y − ŷ is orthogonal to every vector in the span-space
S. In particular, its dot product with any basis vector in S must be equal
to zero,
 
(7.10) y − ŷ, gj = 0, j = 0, . . . , k.
Substituting ŷ in (7.10) by its expression from (7.8), we arrive at the system
of k + 1 linear equations with respect to the estimators θ̂0 , . . . , θ̂k ,
     
y, gj − θ̂0 g0 , gj − · · · − θ̂k gk , gj = 0, j = 0, . . . , k.
These equations can be rewritten in a standard form, which are known as a
system of normal equations,
     
(7.11) θ̂0 g0 , gj + · · · + θ̂k gk , gj = y, gj , j = 0, . . . , k.
 
Let θ̂ = θ̂0 , . . . , θ̂k be the vector of estimated regression coefficients.
Then we can write equations (7.11) in the matrix form
  
(7.12) G G θ̂ = G y.
7.3. Properties of the Least-Squares Estimator 89

By our assumption, the (k + 1) × (k + 1) matrix G G, has a full rank k + 1,


and therefore is invertible. Thus, the least-squares estimator of θ is the
unique solution of the normal equations (7.12),
 −1 
(7.13) θ̂ = G G G y.
Remark 7.4. Three Euclidean spaces are involved in the linear regression.
The primary space is the (X, Y )-plane where observed as well as fitted values
may be depicted. Another is the space of observations Rn that includes the
linear subspace S. And the third space is the space Rk+1 that contains the
vector of the regression coefficients θ as well as its least-squares estimator θ̂.
Though the latter two spaces play an auxiliary role in practical regression
analysis, they are important from the mathematical point of view. 

7.3. Properties of the Least-Squares Estimator


 
Consider the least-squares estimator θ̂ = θ̂0 , . . . , θ̂k of the vector of the
 
true regression coefficients θ = θ0 , . . . , θk computed by formula (7.13).
In this section, we study the properties of this estimator.
Recall that we denoted by X = (x1 , . . . , xn ) the design in the regres-
sion model. The explanatory variable X may be assumed deterministic, or
random with a certain distribution. In what follows, we use the notation
Eθ [ · | X ] and Varθ [ · | X ] for the conditional expectation and variance with
respect to the distribution of the random error ε, given the design X . Aver-
aging over both distributions, ε’s and X ’s, will be designated by Eθ [ · ]. For
the deterministic designs, we use the notation Eθ [ · | X ] only if we want to
emphasize the dependence on the design X .
Theorem 7.5. For a fixed design X , the least-squares estimator θ̂ has a
(k +1)-variate normal distribution with mean θ (is unbiased) and covariance
  −1
matrix Eθ (θ̂ − θ)(θ̂ − θ) | X = σ 2 G G .

Proof. According to the matrix form of the linear regression


(7.7), the
conditional mean of y, given the design X , is Eθ y | X = G θ, and the
conditional covariance matrix of y is equal to
   
Eθ y − G θ y − G θ | X = Eθ ε ε | X = σ 2 In .
Thus, the conditional mean of θ̂, given the design X , is calculated as
  −1 
Eθ θ̂ | X = Eθ G G G y|X
 −1    −1 
= G G G Eθ y | X = G G G G θ = θ.
To find an expression for the conditional covariance matrix of θ̂, notice first
 −1 
that θ̂ − θ = G G G (y − Gθ). Thus,

Eθ (θ̂ − θ)(θ̂ − θ) | X
90 7. Linear Parametric Regression

 −1   −1   


= Eθ G G G (y − Gθ) G G G (y − Gθ) | X
 −1     −1
= G G G Eθ (y − Gθ)(y − Gθ) | X G G G
 −1  2  −1  −1
= G G G σ In G G G = σ 2 G G . 

To ease the presentation, we study the regression on the interval [0, 1],
that is, we assume that the regression function f (x) and all the components
in the linear regression model, g0 (x), . . . , gk (x), are defined for x ∈ [0, 1].
The design points xi , i = 1, . . . , n, also belong to this interval.
Define the least-squares estimator of the regression function f (x) in (7.2),
at any point x ∈ [0, 1], by
(7.14) fˆn (x) = θ̂0 g0 (x) + · · · + θ̂k gk (x).
Here the subscript n indicates that the estimation is based on n pairs of
observations (xi , yi ), i = 1, . . . , n.
A legitimate question is how close fˆn (x) is to f (x)? We try to answer
this question using two different loss functions. The first one is the quadratic
loss function computed at a fixed point x ∈ [0, 1],
   2
(7.15) w fˆn − f = fˆn (x) − f (x) .
The risk with respect to this loss is called the mean squared risk at a point
or mean squared error (MSE).
The second loss function that we consider is the mean squared difference
over the design points
  1 ˆ
n
2
(7.16) w fˆn − f = fn (xi ) − f (xi ) .
n
i=1

Note that this loss function is a discrete version of the integral L2 -norm,
1
 2
ˆ
 f n − f 2 =
2
fˆn (x) − f (x) dx.
0
The respective risk is a discrete version of the mean integrated squared error
(MISE).

In this section, we study the conditional risk Eθ w(fˆn − f ) | X , given
the design X . The next two lemmas provide computational formulas for the
MSE and discrete MISE, respectively.
Introduce the matrix D = σ 2 (G G)−1 called the covariance matrix.
Note that D depends on the design X , and this dependence can be sophisti-
cated. In particular, that if the design X is random, this matrix is random
as well.
7.3. Properties of the Least-Squares Estimator 91

Lemma 7.6. For a fixed design X , the estimator fˆn (x) is an unbiased es-
timator of f (x) at any x ∈ [0, 1], so that its MSE equals the variance of
fˆn (x),
  2  
k
Varθ fˆn (x) | X = Eθ fˆn (x) − f (x) |X = Dl, m gl (x) gm (x),
l, m = 0

where Dl, m denotes the (l, m)-th entry of the covariance matrix D.

Proof. By Theorem 7.5, the least-squares estimator θ̂ is unbiased. This


implies the unbiasedness of the estimator fˆn (x). To see that, write
  
Eθ fˆn (x) | X = Eθ θ̂0 | X g0 (x) + · · · + Eθ θ̂k | X gk (x)
= θ0 g0 (x) + · · · + θk gk (x) = f (x).
Also, the covariance matrix of θ̂ is D, and therefore the variance of fˆn (x)
can be written as
 2   2 
Eθ fˆn (x)−f (x) | X = Eθ (θ̂0 −θ0 ) g0 (x) + · · · + (θ̂k −θk ) gk (x) | X


k
 
k
= Eθ (θ̂l − θl )(θ̂m − θm )|X gl (x)gm (x) = Dl,m gl (x)gm (x). 
l,m=0 l,m=0

Lemma 7.7. For a fixed design X , the mean squared difference


1 ˆ
n
2
fn (xi ) − f (xi ) = (σ 2 /n) χ2k+1
n
i=1

where χ2k+1 denotes a chi-squared random variable with k + 1 degrees of


freedom. In particular, the MISE equals to
1  n
 2  σ 2 (k + 1)
Eθ fˆn (xi ) − f (xi ) | X = .
n n
i=1

Proof. Applying the facts that σ 2 G G = D−1 , and that the matrix D is
symmetric and positive definite (therefore, D1/2 exists), we have the equa-
tions
1 ˆ
n
2 1
fn (xi ) − f (xi ) =  G ( θ̂ − θ ) 2
n n
i=1
1    1
= G ( θ̂ − θ ) G ( θ̂ − θ ) = ( θ̂ − θ ) G G ( θ̂ − θ )
n n
1 σ2
= σ 2 ( θ̂ − θ ) D−1 ( θ̂ − θ ) =  D−1/2 ( θ̂ − θ ) 2 ,
n n
where by  ·  we mean the Euclidean norm in the Rn space of observations.
92 7. Linear Parametric Regression

By Theorem 7.5, the (k + 1)-dimensional vector D−1/2 (θ̂ − θ) has inde-


pendent standard normal coordinates. The result of the proposition follows
from the definition of the chi-squared distribution. 

Note that the vector with the components fˆn (xi ) coincides with ŷ, the
projection of y on the span-space S, that is,

ŷi = fˆn (xi ), i = 1, . . . , n.

Introduce the vector r = y − ŷ. The coordinates of this vector, called


residuals, are the differences

ri = yi − ŷi = yi − fˆn (xi ), i = 1, . . . , n.

In other words, residuals are deviations of the observed responses from the
predicted ones evaluated at the design points.
Graphically, residuals can be visualized in the data space Rn . The vector
of residuals r, plotted in Figure 4, is orthogonal to the span-space S. Also,
the residuals ri ’s can be depicted on a scatter plot (see Figure 5).

Y 6

(xi , yi ) • •
• • fˆn (X)

ri
◦ •
(xi , ŷi )
• -
0 X

Figure 5. Residuals shown on a schematic scatter plot.

In the next lemma, we obtain the distribution of the squared norm of


the residual vector r for a fixed design X .

Lemma 7.8. For a given design X , the sum of squares of the residuals
r12 + · · · + rn2 =  r2 = y − ŷ 2 = σ 2 χ2n−k−1 ,

where χ2n−k−1 denotes a chi-squared random variable with n − k − 1 degrees


of freedom.
7.4. Asymptotic Analysis of the Least-Squares Estimator 93

Proof. The squared Euclidean norm of the vector of random errors admits
the partition
 ε 2 =  y − G θ 2 =  y − ŷ + ŷ − G θ 2

=  y − ŷ 2 +  ŷ − G θ 2 =  r 2 +  ŷ − G θ 2 .
Here the cross term is zero, because it is a dot product of the residual vector
r and the vector ŷ − G θ that lies in the span-space S. Moreover, these two
vectors are independent (see Exercise 7.46).
The random vector ε has Nn (0, σ 2 In ) distribution, implying that  ε 2 =
σ 2 χ2n , where χ2n denotes a chi-squared random variable with n degrees of
freedom. Also, by Lemma 7.7,

n
 2
 ŷ − G θ 2 = fˆn (xi ) − f (xi ) = σ 2 χ2k+1
i=1

where χ2k+1 has a chi-squared distribution with k + 1 degrees of freedom.


Taking into account that vectors r and ŷ − Gθ are independent, it can
be shown (see Exercise 7.47) that  r 2 has a chi-squared distribution with
n − (k + 1) degrees of freedom. 

7.4. Asymptotic Analysis of the Least-Squares Estimator


In this section we focus on describing asymptotic behavior of the least-
squares estimator θ̂ as the sample size n goes to infinity. This task is com-
plicated by the fact that θ̂ depends on the design X = {x1 , . . . , xn }. Thus,
we can expect the existence of a limiting distribution only if the design is
governed by some regularity conditions.

7.4.1. Regular Deterministic Design. Take a continuous strictly posi-


tive probability density p(x),
0 ≤ x ≤ 1, and consider the cumulative dis-
x
tribution function FX (x) = 0 p(t) dt. Define a sequence of regular deter-
ministic designs Xn = { xn,1 , . . . , xn,n } where xi,n is the (i/n)-th quantile
of this distribution,
i
(7.17) FX (xn,i ) = , i = 1, . . . , n.
n
Equivalently, the xn,i ’s satisfy the recursive equations
xn,i
1
(7.18) p(x) dx = , i = 1, . . . , n, xn,0 = 0.
xn,i−1 n
It is important to emphasize that the distances between consecutive points
in a regular design have magnitude O(1/n) as n → ∞. Typical irregular
designs that are avoided in asymptotic analysis have data points that are
94 7. Linear Parametric Regression

too close to each other (concentrated around one point, or even coincide),
or have big gaps between each other, or both.
For simplicity we suppress the dependence on n of the regular design
points, that is, we write xi instead of xn,i , i = 1, . . . , n.
Example 7.9. The data points that are spread equidistantly on the unit
interval, xi = i/n, i = 1, . . . , n, constitute a regular design, called uniform
design, since these points are (i/n)-th quantiles of the standard uniform
distribution. 

It can be shown (see Exercise 7.48) that in the case of a regular design
corresponding to a probability density p(x), for any continuous function
g(x), the Riemann sum converges to the integral
1
1 
n
(7.19) g(xi ) → g(x) p(x) dx as n → ∞.
n 0
i=1

If the functions g0 , g1 , . . . , gk in the linear regression model (7.5) are


continuous, and the design points are regular, then the convergence in (7.19)
implies the existence of the entrywise limits of the matrix (1/n)D−1 as
n → ∞, that is, for any l and m such that 0 ≤ l ≤ m ≤ k,
1 −1 σ2
lim Dl, m = lim ( G G )l, m
n→ ∞ n n→ ∞ n

σ2  
= lim gl (x1 ) gm (x1 ) + · · · + gl (xn ) gm (xn )
n→ ∞ n
1
2
(7.20) = σ gl (x) gm (x) p(x) dx.
0

Denote by D−1∞ the matrix with the elements σ


2 1 g (x) g (x) p(x) dx.
0 l k
Assume that this matrix is positive definite. Then its inverse D∞ , called the
limiting covariance matrix, exists, and the convergence takes place n D →
D∞ .
Example 7.10. Consider a polynomial regression model with the uniform
design on [0, 1], that is, the regular design with the constant probability
density p(x) = 1, 0 ≤ x ≤ 1. The matrix D−1 ∞ has the entries
1 2
σ
(7.21) σ2 xl xm dx = , 0 ≤ l, m ≤ k.
0 1+l+m
This is a positive definite matrix, and hence the limiting covariance matrix
D∞ is well defined (see Exercise 7.49). 

We are ready to summarize our findings in the following theorem.


7.4. Asymptotic Analysis of the Least-Squares Estimator 95

Theorem 7.11. If X is a regular deterministic design, and D∞ exists, then


√  
n θ̂ − θ → Nk+1 ( 0, D∞ ) as n → ∞.

Next we study the limiting behavior of the least-squares estimator fˆn


defined by (7.14). The lemma below shows that in the mean squared sense,

fˆn converges pointwise to the true regression function f at the rate O(1/ n)
as n → ∞. The proof of this lemma is assigned as an exercise (see Exercise
7.50).

Lemma 7.12. Suppose X is a regular deterministic design such that D∞


exists. Then at any fixed point x ∈ [0, 1], the estimator fˆn of the regression
function f is unbiased and its normalized quadratic risk satisfies the limiting
equation
√ 2  
k
lim Eθ n ( fˆn (x) − f (x) ) = (D∞ )l, m gl (x) gm (x),
n→∞
l, m = 0

where (D∞ )l, m are the elements of the limiting covariance matrix D∞ .

7.4.2. Regular Random Design. We call a random design regular, if


its points are independent with a common continuous and strictly positive
probability density function p(x), x ∈ [0, 1].
Suppose the functions g0 , . . . , gk are continuous on [0, 1]. By the Law of
Large Numbers, for any element of the matrix D−1 = σ 2 G G, we have that
with probability 1 (with respect to the distribution of the random design),
σ2 σ2  
lim ( G G )l, m = lim gl (x1 ) gm (x1 ) + · · · + gl (xn ) gm (xn )
n→∞ n n→∞ n

1
2
(7.22) = σ gl (x) gm (x) p(x) dx.
0

Again, as in the case of a regular


deterministic design, we assume that the
matrix D−1∞ with the elements σ
2 1 g (x) g (x) p(x) dx is positive definite,
0 l m
so that its inverse matrix D∞ exists.

The essential difference between the random and deterministic designs is


that even in the case of a regular random design, for any given n, the matrix
G G can be degenerate with a positive probability (see Exercise 7.51). If
it happens, then for the sake of definiteness, we put θ̂ = 0. Fortunately, if
the functions g0 , . . . , gk are continuous in [0, 1], then the probability of this
“non-existence” is exponentially small in n as n → ∞. For the proofs of the
following lemma and theorem refer to Exercises 7.52 and 7.53.
96 7. Linear Parametric Regression

Lemma 7.13. Assume that |g0 |, . . . , |gk | ≤ C0 , and that X = {x1 , . . . , xn }


is a regular random design. Then for any n, for however small δ > 0, and
for all l and m such that 0 ≤ l, m ≤ k, the following inequality holds:
1
1  n
   δ2 n 
P  gl (xi )gm (xi ) − gl (x)gm (x) dx > δ ≤ 2 exp − .
n
i=1 0 2C04

Assume that for a regular random design X , the estimator θ̂ is properly


defined with probability 1. Then, as the next theorem shows, the distribu-

tion of the normalized estimator n(θ̂ − θ) is asymptotically normal.
Theorem 7.14. If X
 is a regular random design and D∞ exists, then as
√ 
n → ∞, n θ̂ − θ converges in distribution to a Nk+1 (0, D∞ ) random
variable.
Remark 7.15. An important conclusion is that the parametric least-squares
estimator fˆn is unbiased, and its typical rate of convergence under various

norms and under regular designs is equal to O(1/ n) as n → ∞. 

Exercises

Exercise 7.43. Consider the observations (xi , yi ) in a simple linear regres-


sion model,
yi = θ0 + θ1 xi + εi , i = 1, . . . , n,
where the εi ’s are independent N (0 , σ 2 ) random variables. Write down the
system of normal equations (7.11) and solve it explicitly.
Exercise 7.44. Show that in a simple linear
 regression
model (see Exercise
7.43), the minimum of the variance Var ˆ
f (x) | X in Lemma 7.6 is attained
n θ n
at x = x̄ = i=1 xi /n.
Exercise 7.45. (i) Prove that in a simple linear regression  model (see
n
Exercise
n 7.43), the sum of residuals is equal to zero, that is, i=1 ri =
(y
i=1 i − ŷi ) = 0.
(ii) Consider a simple linear regression through the origin,
yi = θ1 xi + εi , i = 1, . . . , n
where the εi ’s are independent N (0, σ 2 ) random variables. Show by giving
an example that the sum of residuals is not necessarily equal to zero.

Exercise 7.46. Show that (i) the vector of residuals r has a multivariate
normal distribution with mean zero and covariance matrix σ 2 (In − H),
where H = G(G G)−1 G is called the hat matrix because of the identity
Exercises 97

ŷ = Hy.
(ii) Argue that the vectors r and ŷ − G θ are independent.

Exercise 7.47. Let Z = X + Y where X and Y are independent. Suppose


Z and X have chi-squared distributions with n and m degrees of freedom,
respectively, where m < n. Prove that Y also has a chi-squared distribution
with n − m degrees of freedom.

Exercise 7.48. Show the convergence of the Riemann sum in (7.19).

Exercise 7.49. Show that the matrix with the elements given by (7.21) is
invertible.

Exercise 7.50. Prove Lemma 7.12.

Exercise 7.51. Let k = 1, and let g0 = 1; g1 (x) = x if 0 ≤ x ≤ 1/2, and


g1 (x) = 1/2 if 1/2 < x ≤ 1. Assume that X is the uniform random design
governed by the density p(x) = 1. Show that the system of normal equations
does not have a unique solution with probability 1/2n .

Exercise 7.52. Prove Lemma 7.13.

Exercise 7.53. Prove Theorem 7.14.

Exercise 7.54. For the regression function f = θ0 g0 + · · · + θk gk , show


that the conditional expectation of the squared L2 -norm of the difference
fˆn − f , given the design X , admits the upper bound
 
Eθ  fˆn − f 22 | X ≤ tr(D)  g 22
 
k
where the trace tr(D) = Eθ i=0 i ( θ̂ − θi ) 2 | X is the sum of the diagonal

elements of the covariance matrix D, and


k k 1
 2
 g 2 =
2
 g i 2 =
2
gi (x) dx
i=0 i=0 0
is the squared L2 -norm of the vector g = (g0 , . . . , gk ) .
Part 2

Nonparametric
Regression
Chapter 8

Estimation in
Nonparametric
Regression

8.1. Setup and Notations


In a nonparametric regression model the response variable Y and the ex-
planatory variable X are related by the same regression equation (7.1) as in
a parametric regression model,

(8.1) Y = f (X) + ε

with the random error ε ∼ N (0, σ 2 ). However, unlike that in the parametric
regression model, here the algebraic form of the regression function f is
assumed unknown and must be evaluated from the data. The goal of the
nonparametric regression analysis is to estimate the function f as a curve,
rather than to estimate parameters of a guessed function.
A set of n pairs of observations (x1 , y1 ), . . . , (xn , yn ) satisfy the relation

(8.2) yi = f (xi ) + εi , i = 1, . . . , n,

where the εi ’s are independent N (0, σ 2 ) random errors. For simplicity we


assume that the design X = {x1 , . . . , xn } is concentrated on [0, 1].
In nonparametric regression analysis, some assumptions are made a pri-
ori on the smoothness of the regression function f . Let β ≥ 1 be an integer.
We assume that f belongs to a Hölder class of functions of smoothness β,
denoted by Θ(β, L, L1 ). That is, we assume that (i) its derivative f (β−1) of

101
102 8. Estimation in Nonparametric Regression

order β − 1 satisfies the Lipschitz condition with a given constant L,


| f (β−1) (x2 ) − f (β−1) (x1 ) | ≤ L | x2 − x1 |, x1 , x2 ∈ [0, 1],
and (ii) there exists a constant L1 > 0 such that
max | f (x) | ≤ L1 .
0≤x≤1

Example 8.1. If β = 1, the class Θ(1, L, L1 ) is a set of bounded Lipschitz


functions. Recall that a Lipschitz function f satisfies the inequality
| f (x2 ) − f (x1 ) | ≤ L | x2 − x1 |
where L is a constant independent of x1 and x2 . 

Sometimes we write Θ(β), suppressing the constants L and L1 in the


notation of the Hölder class Θ(β, L, L1 ).
Denote by fˆn the nonparametric estimator of the regression function f .
Since f is a function of x ∈ [0, 1], so should be the estimator. The latter,
however, also depends on the data points. This dependence is frequently
omitted in the notation,
 
fˆn (x) = fˆn x ; (x1 , y1 ), . . . , (xn , yn ) , 0 ≤ x ≤ 1.

To measure how close fˆn is to f , we consider the same loss functions as


in Chapter 7, the quadratic loss function computed at a fixed point x ∈ [0, 1]
specified in (7.15), and the mean squared difference over the design points
given by (7.16). In addition, to illustrate particular effects in nonparametric
estimation, we use the sup-norm loss function
 
w(fˆn − f ) =  fˆn − f ∞ = sup  fˆn (x) − f (x) .
0≤x≤1

Note that in the nonparametric case, the loss functions are, in fact,
functionals since they depend of f . For simplicity, we will continue calling
them functions. We denote the risk function by

Rn (f, fˆn ) = Ef w(fˆn − f )
where the subscript f in the expectation refers to a fixed regression function
f . If the design X is random, we use the conditional expectation Ef [ · | X ]
to emphasize averaging over the distribution of the random error ε.
When working with the difference fˆn − f , it is technically
more conve-
nient to consider separately the bias bn (x) = Ef fˆn (x) − f (x), and the

stochastic part ξn (x) = fˆn (x) − Ef fˆn (x) . Then the MSE or discrete
MISE is split into a sum (see Exercise 8.55),
 
(8.3) Rn (fˆn , f ) = Ef w(fˆn − f ) = Ef w(ξn ) + w(bn ).
8.2. Asymptotically Minimax Rate of Convergence. Definition 103

For the sup-norm loss function, the triangle inequality applies


 
Rn (fˆn , f ) = Ef fˆn − f ∞ ≤ Ef ξn ∞ + bn ∞ .

To deal with random designs, we consider the conditional bias and sto-
chastic part of an estimator fˆn , given the design X ,

bn (x, X ) = Ef fˆn (x) | X ] − f (x)
and

ξn (x, X ) = fˆn (x) − Ef fˆn (x) | X .

8.2. Asymptotically Minimax Rate of Convergence.


Definition
We want to estimate the regression function in the most efficient way. As
a criterion of optimality we choose the asymptotically minimax rate of con-
vergence of the estimator.
Consider a deterministic sequence of positive numbers ψn → 0 as n →
∞. Introduce a maximum normalized risk of an estimator fˆn with respect
to a loss function w by
  fˆ − f  
n
(8.4) rn (fˆn , w, ψn ) = sup Ef w .
f ∈ Θ(β) ψn

A sequence of positive numbers ψn is called an asymptotically minimax


rate of convergence if there exist two positive constants r∗ and r∗ such that
for any estimator fˆn , the maximum normalized risk rn (fˆn , w, ψn ) is bounded
from above and below,
(8.5) r∗ ≤ lim inf rn (fˆn , w, ψn ) ≤ lim sup rn (fˆn , w, ψn ) ≤ r∗ .
n→∞ n→∞

This very formal definition has a transparent interpretation. It implies


that for any estimator fˆn and for all n large enough, the maximum of the
risk is bounded from below,
  fˆ − f  
n
(8.6) sup Ef w ≥ r∗ − ε,
f ∈ Θ(β) ψn
where ε is an arbitrarily small positive number. On the other hand, there
exists an estimator fn∗ , called the asymptotically minimax estimator, the
maximum risk of which is bounded from above,
  f∗ − f  
(8.7) sup Ef w n ≤ r∗ + ε.
f ∈ Θ(β) ψn

Note that fn∗ is not a single estimator, rather a sequence of estimators defined
for all sufficiently large n.
104 8. Estimation in Nonparametric Regression

It is worth mentioning that the asymptotically minimax rate of conver-


gence ψn is not uniquely defined but admits any bounded and separated
away from zero multiplier. As we have shown in Chapter 7, a typical rate of

convergence in parametric regression model is O(1/ n). In nonparametric
regression, on the other hand, the rates depend on a particular loss func-
tion and on the smoothness parameter β of the Hölder class of regression
functions. We study these rates in the next chapters.

8.3. Linear Estimator


8.3.1. Definition. An estimator fˆn is called a linear estimator of f , if
for any x ∈ [0, 1], there exist weights υn, i (x) that may also depend on the
design points, υn, i (x) = υn, i (x, X ), i = 1, . . . , n, such that

n
(8.8) fˆn (x) = υn, i (x) yi .
i=1

Note that the linear estimator fˆn is a linear function of the response values
y1 , . . . , yn . The weight vn, i (x) determines the influence of the observation
yi on the estimator fˆn (x) at point x.
An advantage of the linear estimator (8.8) is that for a given design X ,
the conditional bias and variance are easily computable (see Exercise 8.56),

n
(8.9) bn (x, X ) = υn, i (x) f (xi ) − f (x)
i=1

and
 
n
(8.10) Ef ξn2 (x, X ) | X = σ 2 2
υn, i (x).
i=1

These formulas are useful when either the design X is deterministic or


integration over the distribution of a random design is not too difficult. Since
the weights υn, i (x) may depend on the design points in a very intricate way,
in general, averaging over the distribution of x1 , . . . , xn is a complicated
task.
The linear estimator (8.8) is not guaranteed to be unbiased. Even in
the simplest case of a constant regression function f (x) = θ0 , the linear
estimator is unbiased if and only if the weights sum up to one,

n

n

bn (x, X ) = υn,i (x) θ0 − θ0 = θ0 υn,i (x) − 1 = 0.
i=1 i=1
8.3. Linear Estimator 105

For a linear regression function f (x) = θ0 + θ1 x, the linear estimator is


unbiased if and only if the following identity holds:
 n
  n

υn,i (x) − 1 θ0 + υn,i (x) xi − x θ1 = 0 ,
i=1 i=1
n
which under the condition that i = 1 υn,i (x) = 1 is tantamount to the
identity
n
υn,i (x) xi = x, uniformly in x ∈ [0, 1].
i=1

If for any x ∈ [0, 1], the linear estimator (8.8) depends on all the design
points x1 , . . . , xn , it is called a global linear estimator of the regression func-
tion. We study global estimators later in this book.

An estimator (8.8) is called a local linear estimator of the regression


function if the weights υn, i (x) differ from zero only for those i’s for which the
design points xi ’s belong to a small neighborhood of x, that is, | xi −x | ≤ hn ,
where hn is called a bandwidth. We always assume that
(8.11) hn > 0, hn → 0 , and nhn → ∞ as n → ∞.
In what follows we consider only designs in which for any x ∈ [0, 1], the
number of the design points in the hn -neighborhood of x has the magnitude
O(nhn ) as n → ∞.

8.3.2. The Nadaraya-Watson Kernel Estimator. Consider a smooth


or piecewise smooth function K = K(u), u ∈ R. Assume that the support
of K is the interval [−1, 1], that is, K(u) = 0 if |u| > 1. The function K is
called a kernel function or simply, a kernel.
Example 8.2. Some classical kernel functions frequently used in practice
are:
(i) uniform, K(u) = (1/2) I( |u| ≤ 1 ),
(ii) triangular, K(u) = ( 1 − |u| ) I( |u| ≤ 1 ),
(iii) bi-square, K(u) = (15/16) ( 1 − u2 )2 I( |u| ≤ 1 ),
(iv) the Epanechnikov kernel, K(u) = (3/4) ( 1 − u2 ) I( |u| ≤ 1 ). 
Remark 8.3. Typically, kernels are normalized in such a way that they
integrate to one. It can be shown (see Exercise 8.57) that all the kernels
introduced above are normalized in such a way. 

For a chosen kernel and a bandwidth, define the weights υn, i (x) by
x −x  n x −x
i j
(8.12) υn, i (x) = K / K .
hn hn
j=1
106 8. Estimation in Nonparametric Regression

The Nadaraya-Watson kernel estimator fˆn of the regression function f


at a given point x ∈ [0, 1] is the linear estimator with the weights defined
by (8.12),

n
 xi − x  n x −x
j
(8.13) fˆn (x) = yi K / K .
hn hn
i=1 j =1

Note that the Nadaraya-Watson estimator is an example of a local linear


estimator, since outside of the interval [x − hn , x + hn ], the weights are equal
to zero.
Example 8.4. Consider the uniform kernel defined in Example 8.2 (i). Let
N (x, hn ) denote the number of the design points in the hn -neighborhood of
x. Then the weights in (8.12) have the form
1  
υn, i (x) = I x − h n < xi < x + h n .
N (x, hn )
Thus, in this case, the Nadaraya-Watson estimator is the average of
the observed responses that correspond to the design points in the hn -
neighborhood of x,
1 
n
 
fˆn (x) = yi I x − hn < xi < x + hn . 
N (x, hn )
i=1

8.4. Smoothing Kernel Estimator


In Section 8.3, we explained the challenge to control the conditional bias of
a linear estimator even in the case of a linear regression function. The linear
regression function is important as the first step because, as the following
lemma shows, any regression function from a Hölder class is essentially a
polynomial. The proof of this auxiliary lemma is postponed until the end
of this section.
Lemma 8.5. For any function f ∈ Θ(β, L, L1 ), the following Taylor expan-
sion holds:

β−1
f (m) (x)
(8.14) f (xi ) = (xi − x)m + ρ(xi , x), 0 ≤ x, xi ≤ 1,
m!
m=0

where f (m)
denotes the m-th derivative of f . Also, for any xi and x such
that |xi − x| ≤ hn , the remainder term ρ(xi , x) satisfies the inequality

Lhβn
(8.15) | ρ(xi , x) | ≤ .
(β − 1)!
8.4. Smoothing Kernel Estimator 107

It turns out that for linear estimators, regular random designs have an
advantage over deterministic ones. As we demonstrate in this section, when
computing the risk, averaging over the distribution of a random design helps
to eliminate a significant portion of the bias.
Next we introduce a linear estimator that guarantees the zero bias for
any polynomial regression function up to degree β − 1 (see Exercise 8.59).
To ease the presentation, we assume that a regular random design is uniform
with the probability density p(x) = 1, x ∈ [0, 1]. The extension to a more
general case is given in Remark 8.6.
A smoothing kernel estimator fˆn (x) of degree β − 1 is given by the formula

1 
n x −x
i
(8.16) fˆn (x) = yi K , 0 < x < 1,
nhn hn
i=1

where the smoothing kernel K = K(u), |u| ≤ 1, is bounded, piecewise


continuous, and satisfies the normalization and orthogonality conditions
(8.17)
1 1
K(u) du = 1 and um K(u) du = 0 for m = 1, . . . , β − 1.
−1 −1

Note that the smoothing kernel is orthogonal to all monomials up to degree


β − 1.

Remark 8.6. For a general density p(x) of the design points, the smoothing
kernel estimator is defined as
1  yi
n x −x
i
(8.18) fˆn (x) = K
nhn p(xi ) hn
i=1

where the kernel K(u) satisfies the same conditions as in (8.17). 

Remark 8.7. A smoothing kernel estimator (8.16) requires that x lies


strictly inside the unit interval. In fact, the definition of fˆn (x) is valid
for any x such that hn ≤ x ≤ 1 − hn . On the other hand, a linear estimator
(8.8) is defined for any x ∈ [0, 1], including the endpoints. Why does the
smoothing kernel estimator fail if x coincides with either of the endpoints?
If, for instance, x = 0, then for any symmetric kernel K(u), the expected
value
  x  hn  1
1 i 1 xi  1
Ef K = K dxi = K(u) du = .
hn hn hn 0 hn 0 2

For example, in the situation when the regression function is identically


equal to 1, the responses are yi = 1 + εi , where εi are N (0, σ 2 ) random
variables independent of xi ’s for all i = 1, . . . , n. The average value of the
108 8. Estimation in Nonparametric Regression

smoothing kernel estimator at zero is


  1  n  x  1
ˆ i
Ef fn (0) = Ef (1 + εi ) K = ,
nhn hn 2
i=1
which is certainly not satisfactory.
A remedy for the endpoints is to define a one-sided kernel to preserve
the normalization and orthogonality conditions (8.17). In Exercises 8.61 and
8.62 we formulate some examples related to this topic. 

The next lemma gives upper bounds for the bias and variance of the
smoothing kernel estimator (8.16). The proof of the lemma can be found at
the end of this section.
Lemma 8.8. For any regression function f ∈ Θ(β, L, L1 ), at any point
x ∈ (0, 1), the bias and variance of the smoothing kernel estimator (8.16)
admit the upper bounds for all large enough n,
 Av
| bn (x) | ≤ Ab hβn and Varf fˆn (x) ≤
nhn
with the constants
L K1
Ab = and Av = (L21 + σ 2 ) K22
(β − 1)!

1
1
where K1 = −1 |K(u)| du and K22 = −1 K 2 (u) du.
Remark 8.9. The above lemma clearly indicates that as hn increases, the
upper bound for the bias increases, while that for the variance decreases. 

Applying this lemma, we can bound the mean squared risk of fˆn (x) at
a point x ∈ (0, 1) by
 2   Av
(8.19) Ef fˆn (x) − f (x) = b2n (x) + Varf fˆn (x) ≤ A2b h2n β + .
nhn
It is easily seen that the value of hn that minimizes the right-hand side
of (8.19) satisfies the equation
A
(8.20) h2n β =
nhn
with a constant factor A independent of n. This equation is called the
balance equation since it reflects the idea of balancing the squared bias and
variance terms.
Next, we neglect the constant in the balance equation (8.20), and label
the respective optimal bandwidth by a superscript (*). It is a solution of the
equation
1
h2n β = ,
nhn
8.4. Smoothing Kernel Estimator 109

and is equal to
h∗n = n−1/(2β+1) .
Denote by fn∗ (x) the smoothing kernel estimator (8.16) corresponding to the
optimal bandwidth h∗n ,
1 
n x −x
fn∗ (x) =
i
(8.21) yi K .
nh∗n h∗n
i=1

We call this estimator the optimal smoothing kernel estimator.


For the convenience of reference, we formulate the proposition below.
Its proof follows directly from the expression (8.19), and the definition of
the estimator fn∗ (x).
Proposition 8.10. For all large enough n, and any f ∈ Θ(β), the quadratic
risk of the optimal smoothing kernel estimator (8.21) at a given point x, 0 <
x < 1, is bounded from above by
 2 
Ef fn∗ (x) − f (x) ≤ (A2b + Av ) n−2β/(2β+1) .

Remark 8.11. Suppose the loss function is the absolute difference at a


given point x ∈ (0, 1). Then the supremum over f ∈ Θ(β) of the risk of the
estimator fn∗ (x) is bounded from above by

sup Ef | fn∗ (x) − f (x) | ≤ (A2b + Av )1/2 n−β/(2β+1) .
f ∈Θ(β)

This follows immediately from Proposition 8.10 and the Cauchy-Schwarz


inequality. 

Finally, we give the proofs of two technical lemmas stated in this section.
Proof of Lemma 8.5. We need to prove that the bound (8.15) for the
remainder term is valid. For β = 1, the bound follows from the definition of
the Lipschitz class of functions Θ(1, L, L1 ),
| ρ(xi , x) | = | f (xi ) − f (x) | ≤ L|xi − x| ≤ Lhn .
If β ≥ 2, then the Taylor expansion with the Lagrange remainder term has
the form

β−2
f (m) (x) f (β−1) (x∗ )
(8.22) f (xi ) = (xi − x)m + (xi − x)β−1
m! (β − 1)!
m=0

where x∗ is an intermediate point between x and xi , so that | x∗ − x | ≤ hn .


This remainder can be transformed into
f (β−1) (x∗ ) f (β−1) (x)
(xi − x)β−1 = (xi − x)β−1 + ρ(xi , x)
(β − 1)! (β − 1)!
110 8. Estimation in Nonparametric Regression

where the new remainder term ρ(xi , x), satisfies the inequality for any xi
and x such that |xi − x| ≤ hn ,
| f (β−1) (x∗ ) − f (β−1) (x) |
| ρ(xi , x) | = | xi − x|β−1
(β − 1)!
L|x∗ − x| Lhn Lhβn
≤ |xi − x|β−1 ≤ hβ−1 = .
(β − 1)! (β − 1)! n (β − 1)!
In the above, the definition of the Hölder class Θ(β, L, L1 ) has been applied.

Proof of Lemma 8.8. Using the definition of the bias and the regression
equation yi = f (xi ) + εi , we write
1  n  x − x 
i
bn (x) = Ef yi K − f (x)
nhn hn
i=1

1 
n  x − x 
i
(8.23) = Ef (f (xi ) + εi ) K − f (x).
nhn hn
i=1
Now since εi has mean zero and is independent of xi ,
 n  x − x 
i
Ef εi K = 0.
hn
i=1
Also, by the normalization condition,
  x − x  x+hn  1
1 i 1 xi − x 
Ef K = K dxi = K(u) du = 1.
hn hn hn x−hn hn −1
Consequently, continuing from (8.23), we can write
1 n
   xi − x  
(8.24) bn (x) = Ef f (xi ) − f (x) K .
nhn hn
i=1
Substituting Taylor’s expansion (8.14) of the function f (xi ) into (8.24), we
get that for any β > 1,

1      f (m) (x)(xi − x)m   x − x 


n β−1
i 
|bn (x)| =  Ef + ρ(xi , x) K 
nhn m! hn
i=1 m=1

1   f (m) (x) (x1 − x)m  x1 − x 
β−1 x+hn
≤  K dx1
hn m! hn
m = 1 x−hn
x+hn  
1  x1 − x  
+ max |ρ(z, x)| K  dx1 .
hn z:|z−x|≤hn x−hn hn
In the above, we replaced xi by x1 due to the independence of the design
points. If β = 1, we agree to define the sum over m as zero. For any β > 1,
8.4. Smoothing Kernel Estimator 111

this sum equals zero as well, which can be seen from the orthogonality
conditions. For m = 1, . . . , β − 1,
x+hn x −x 1
1
(x1 − x) K
m m+1
dx1 = hn um K(u) du = 0.
x−hn h n −1

Thus, using the inequality (8.15) for the remainder term ρ(xi , x), we obtain
that for any β ≥ 1, the absolute value of the bias is bounded by
x+hn  
1  x1 − x  
|bn (x)| ≤ max |ρ(z, x)| K  dx1
hn z:|z−x|≤hn x−hn hn

Lhβn 1
LK1 hβn
≤ | K(u) | du = = Ab hβn .
(β − 1)! −1 (β − 1)!

Further, to find a bound for the variance of fˆn (x), we use the indepen-
dence of the data points to write
  1 n  x − x 
i
Varf fˆn (x) = Varf yi K
nhn hn
i=1

1 
n   x − x 
i
= Var f yi K .
(nhn )2 hn
i=1
Now we bound the variance by the second moment, and plug in the regres-
sion equation yi = f (xi ) + εi ,

n   
1 2 xi − x
≤ Ef yi
2
K
(nhn )2 hn
i=1

1 
n  2 2  xi − x  
= Ef f (x i ) + ε i K
(nhn )2 hn
i=1

1 
n   2  xi − x  
= Ef f 2
(x i ) + ε 2
i K .
(nhn )2 hn
i=1
Here the cross term disappears because of independence of εi and xi , and
the fact that the expected
 2 value of εi is zero. Finally, using the facts that
|f (xi )| ≤ L1 and Ef εi = σ , we find
2


1  2  x+hn 2  x1 − x 
≤ n L 1 + σ 2
K dx1
(nhn )2 x−hn hn

1  2  1
1  2  Av
= L1 + σ 2 K 2 (u) du = L1 + σ 2 K22 = . 
nhn −1 nhn nhn
112 8. Estimation in Nonparametric Regression

Exercises

Exercise 8.55. Prove (8.3) for: (i) the quadratic loss at a point
   2
w fˆn − f = fˆn (x) − f (x) ,
and (ii) the mean squared difference
  1 ˆ
n
2
w fˆn − f = fn (xi ) − f (xi ) .
n
i=1

Exercise 8.56. Prove (8.9) and (8.10).

Exercise 8.57. Show that the kernels introduced in Example 8.2 integrate
to one.

Exercise 8.58. Consider the Nadaraya-Watson estimator defined by (8.13).


Show that conditional on the design X , its bias
(i) is equal to zero, for any constant regression function f (x) = θ0 ,
(ii) does not exceed L hn in absolute value, for any regression function f ∈
Θ(1, L, L1 ).

Exercise 8.59. Prove that the smoothing kernel estimator (8.16) is unbi-
ased if the regression function f is a polynomial up to order β − 1.

Exercise 8.60. Find the normalizing constant C such that the tri-cube
kernel function
K(u) = C( 1 − |u|3 )3 I( |u| ≤ 1 )
integrates to one. What is its degree? Hint: Use (8.17).

Exercise 8.61. To define a smoothing kernel estimator at either endpoint


of the unit interval, we can use formula (8.16), with K(u) being a one-sided
kernel function (see Remark 8.7).
(i) Show that to estimate the regression function at x = 0, the kernel
K(u) = 4 − 6u, 0 ≤ u ≤ 1,
may be applied, that satisfies the normalization and orthogonality conditions
1 1
K(u) du = 1 and uK(u) du = 0.
0 0

(ii) Show that at x = 1, the kernel


K(u) = 4 + 6u, −1 ≤ u ≤ 0,
Exercises 113

may be used, which satisfies the normalization and orthogonality conditions


0 0
K(u) du = 1 and uK(u) du = 0.
−1 −1

Exercise 8.62. Refer to Exercise 8.61. We can apply a one-sided smoothing


kernel to estimate the regression function f at x where 0 ≤ x ≤ hn . For
example, we can take K(u) = 4 − 6u, 0 ≤ u ≤ 1. However, this kernel
function does not use the observations located between 0 and x.
To deal with this drawback, we can introduce a family of smoothing
kernels Kθ (u) that utilize all the observations to estimate the regression
function for any x such that 0 ≤ x ≤ hn .
(i) Let x = xθ = θhn , 0 ≤ θ ≤ 1. Find a family of smoothing kernels
Kθ (u) with the support [−θ, 1], satisfying the normalization and orthogonal
conditions
1 1
Kθ (u) du = 1 and uKθ (u) du = 0 .
−θ −θ
Hint: Search for Kθ (u) in the class of linear functions.
(ii) Let x = xθ = 1 − θhn , 0 ≤ θ ≤ 1. Show that the family of smoothing
kernels Kθ (−u), −1 ≤ u ≤ θ, can be applied to estimate f (x) for any x
such that 1 − hn ≤ x ≤ 1.
Chapter 9

Local Polynomial
Approximation of the
Regression Function

9.1. Preliminary Results and Definition


In a small neighborhood of a fixed point x ∈ [0, 1], an unknown nonpara-
metric regression function f (x) can be approximated by a polynomial. This
method, called the local polynomial approximation, is introduced in this sec-
tion. Below we treat the case of the point x lying strictly inside the unit
interval, 0 < x < 1. The case of x being one of the endpoints is left as an
exercise (see Exercise 9.64.)

Choose a bandwidth hn that satisfies the standard conditions (8.11),


hn > 0 , hn → 0 , and nhn → ∞ as n → ∞.

Let n be so large that the interval [x − hn , x + hn ] ⊆ [0, 1]. Denote by


N the number of observations in the interval [x − hn , x + hn ],
 
N = # i : xi ∈ [x − hn , x + hn ] .

Without loss of generality, we can assume that the observations (xi , yi )


are distinct and numbered so that the first N design points belong to this
interval,
x − h n ≤ x1 < · · · < xN ≤ x + h n .
Consider the restriction of the original nonparametric Hölder regression
function f ∈ Θ(β) = Θ(β, L, L1 ) to the interval [x − hn , x + hn ]. That is,

115
116 9. Local Polynomial Approximation of the Regression Function

consider f = f (t) where x − hn ≤ t ≤ x + hn . Recall that every function f in


Θ(β) is essentially a polynomial of degree β − 1 with a small remainder term
described in Lemma 8.5. Let us forget for a moment about the remainder
term, and let us try to approximate the nonparametric regression function
by a parametric polynomial regression of degree β − 1. The least-squares
estimator in the parametric polynomial regression is defined via the solution
of the minimization problem with respect to the estimates of the regression
coefficients θ̂0 , . . . , θ̂β−1 ,
(9.1)
N   x − x  x − x β−1  2
i i
yi − θ̂0 + θ̂1 + · · · + θ̂β−1 → min .
hn hn θ̂0 , ..., θ̂β−1
i=1

In each monomial, it is convenient to subtract x as the midpoint of the


interval [x − hn , x + hn ], and to scale by hn so that the monomials do not
vanish as hn shrinks.
Recall from Chapter 7 that solving the minimization problem (9.1) is
equivalent to solving the system of normal equations
  
(9.2) G G θ̂ = G y
   
where θ̂ = θ̂0 , . . . , θ̂β−1 and G = g0 , . . . , gβ−1 is the design matrix.
Its m-th column has the form
  x − x m  x − x m 
1 N
gm = , ..., , m = 0, . . . , β − 1.
hn hn

The system of normal equations (9.2) has a unique solution if the matrix
G G is invertible. We always make this assumption. It suffices to require
that the design points are distinct and that N ≥ β.
Applying Lemma 8.5, we can present each observation yi as the sum of
the three components: a polynomial of degree β − 1, a remainder term, and
a random error,


β−1
f (m) (x)
(9.3) yi = (xi − x)m + ρ (xi , x) + εi
m!
m=0

where
Lhβn
| ρ (xi , x) | ≤ = O(hβn ), i = 1, . . . , N.
(β − 1)!
The system of normal equations (9.2) is linear in y, hence each compo-
nent of yi in (9.3) can be treated separately. The next lemma provides the
information about the first polynomial component.
9.1. Preliminary Results and Definition 117

Lemma 9.1. If each entry of y = (y1 , . . . , yN ) has only the polynomial


component, that is,

β−1
f (m) (x)  f (m) (x)
β−1  x − x m
i
yi = (xi −x) =
m
hm
n , i = 1, . . . , N,
m! m! hn
m=0 m=0
then the least-squares estimates in (9.1) are equal to
f (m) (x) m
θ̂m = hn , m = 0, . . . , β − 1.
m!
Proof. The proof follows immediately if we apply the results of Section 7.1.
Indeed, the vector y belongs to the span-space S, so it stays unchanged after
projecting on this space. 

To establish results concerning the remainder ρ (xi , x) and the random


error term εi in (9.3), some technical preliminaries are needed. In view of the
fact that | (xi − x)/hn | ≤ 1, all elements of matrix G have a magnitude O(1)
as n increases. That is why, generally speaking, the elements of the matrix
G G have a magnitude O(N ), assuming that the number of points N may
grow with n. These considerations shed light on the following assumption,
which plays an essential role in this chapter.
Assumption 9.2. For a given design X , the absolute values of the elements
 −1
of the covariance matrix G G are bounded from above by γ0 N −1 with
a constant γ0 independent of n. 

The next lemma presents the results on the remainder and stochastic
terms in (9.3).
Lemma 9.3. Suppose Assumption 9.2 holds. Then the following is valid.
(i) If yi = ρ (xi , x), then the solution θ̂ of the system of normal equations
(9.2) has the elements θ̂m , m = 0, . . . , β − 1, bounded by
γ0 βL
| θ̂m | ≤ Cb hβn where Cb = .
(β − 1)!
(ii) If yi = εi , then the solution θ̂ of the system of normal equations (9.2)
has the zero-mean normal elements θ̂m , m = 0, . . . , β − 1, the variances of
which are bounded by
 Cv
Varf θ̂m | X ≤ where Cv = (σγ0 β)2 .
N
 −1 
Proof. (i) As the solution of the normal equations (9.2), θ̂ = G G G y.

 m
All the elements of the matrix G are of the form (xi − x)/hn , and thus
are bounded by one. Therefore, using Assumption 9.2, we conclude that
 −1 
the entries of the β × N matrix G G G are bounded by γ0 β/N . Also,
118 9. Local Polynomial Approximation of the Regression Function

from (8.15), the absolute values of the entries of the vector y are bounded
by Lhβn /(β − 1)! since they are the remainder terms. After we compute the
dot product, N cancels, and we obtain the answer.
(ii) The element θ̂m is the dot product of the m-th row of the matrix
  −1 
GG G and the random vector (ε1 , . . . , εN ) . Therefore, θ̂m is the
sum of independent N (0, σ 2 ) random variables with the weights that do
not exceed γ0 β/N . This sum has mean zero and the variance bounded by
N σ 2 (γ0 β/N )2 = (σγ0 β)2 /N. 

Combining the results of Lemmas 8.5, 9.1, and 9.3, we arrive at the
following conclusion.
Proposition 9.4. Suppose Assumption 9.2 holds. Then the estimate θ̂m ,
which is the m-th element of the solution of the system of normal equations
(9.2), admits the expansion
f (m) (x) m
θ̂m = hn + bm + Nm , m = 0, . . . , β − 1,
m!
where the deterministic term bm is the conditional bias satisfying
| bm | ≤ Cb hβn ,
and the stochastic term Nm has a normal distribution with mean zero and
variance bounded by 
Varf Nm | X ≤ Cv /N.

Finally, we are ready to introduce the local polynomial estimator fˆn (t),
which is defined for all t such that x − hn ≤ t ≤ x + hn by
t−x  t − x β−1
(9.4) fˆn (t) = θ̂0 + θ̂1 + · · · + θ̂β−1
hn hn
where the least-squares estimators θ̂0 , . . . , θ̂β−1 are as described in Proposi-
tion 9.4.
The local polynomial estimator (9.4) corresponding to the bandwidth
h∗n = n−1/(2β+1) will be denoted by fn∗ (t). Recall from Section 8.4 that h∗n
is called the optimal bandwidth, and it solves the equation (h∗n )2β = (nh∗n )−1 .
The formula (9.4) is significantly simplified if t = x. In this case the
local polynomial estimator is just the estimate of the intercept, fˆn (x) = θ̂0 .
Up to this point there was no connection between the number of the
design points N in the hn -neighborhood of x and the bandwidth hn . Such
a connection is necessary if we want to balance the bias and the variance
terms in Proposition 9.4.
Assumption 9.5. There exists a positive constant γ1 , independent of n,
such that for all large enough n the inequality N ≥ γ1 nhn holds. 
9.2. Polynomial Approximation and Regularity of Design 119

Now we will prove the result on the conditional quadratic risk at a point
of the local polynomial estimator.

Theorem 9.6. Suppose Assumptions 9.2 and 9.5 hold with hn = h∗n =
n−1/(2β+1) . Consider the local polynomial estimator fn∗ (x) corresponding to
h∗n . Then for a given design X , the conditional quadratic risk of fn∗ (x) at
the point x ∈ (0, 1) admits the upper bound
 2  
sup Ef fn∗ (x) − f (x)  X ≤ r∗ n−2β/(2β+1)
f ∈ Θ( β)

where a positive constant r∗ is independent of n.

Proof. By Proposition 9.4, for any f ∈ Θ(β), the conditional quadratic risk
of the local polynomial estimator fn∗ is equal to
 2    2
Ef fn∗ (x) − f (x)  X = Ef θ̂0 − f (x) | X
 2   
= Ef f (x) + b0 + N0 − f (x)  X = b20 + Ef N02 | X


= b20 + Varf N0 | X ≤ Cb2 (h∗n )2β + Cv /N .
Applying Assumption 9.5 and the fact that h∗n satisfies the identity (h∗n )2β =
(nh∗n )−1 = n−2β/(2β+1) , we obtain that
 2 Cv
Ef fn∗ (x) − f (x) | X ≤ Cb2 (h∗n )2β + = r∗ n−2β/(2β+1)
γ1 nh∗n
with r∗ = Cb2 + Cv /γ1 . 

Remark 9.7. Proposition 9.4 also opens a way to estimate the derivatives
f (m) (t) of the regression function f. The estimator is especially elegant if
t = x,

m! θ̂m
(9.5) fˆn(m) (x) = , m = 1, . . . , β − 1.
hm
n

The rate of convergence becomes slower as m increases. In Exercise 9.65,


an analogue of Theorem 9.6 is stated with the rate n−(β−m)/(2β+1) . 

9.2. Polynomial Approximation and Regularity of Design


In a further study of the local polynomial approximation, we introduce some
regularity rules for a design to guarantee Assumptions 9.2 and 9.5. The
lemmas that we state in this section will be proved in Section 9.4.
120 9. Local Polynomial Approximation of the Regression Function

9.2.1. Regular Deterministic Design. Recall that according to (7.18),


the design points are defined on the interval [0, 1] as the quantiles of a
distribution with a continuous strictly positive probability density p(x).
Lemma 9.8. Let the regular deterministic design be defined by (7.18), and
suppose the bandwidth hn satisfies the conditions hn → 0 and nhn → ∞
as n → ∞. Let N denote the number of the design points in the interval
[x − hn , x + hn ]. Then:
(i) xi+1 − xi = (1 + αi, n )/(np(x)) where max1 ≤ i ≤ N |αi, n | → 0 as
n → ∞.
(ii) limn→∞ N/(nhn ) = 2p(x).
(iii) For any continuous function ϕ0 (u), u ∈ [−1, 1],
1
1   xi − x 
N
lim ϕ0 = p(x) ϕ0 (u) du.
n→∞ nhn hn −1
i=1

Define a matrix D−1


∞ with the (l, m)-th element given by


⎨1
1
1
−1 ul+m du = , if l + m is even,
(9.6) (D∞ )l, m = 2 −1 2(l + m + 1)

⎩0, if l + m is odd.
The matrix D−1
∞ has the inverse D∞ (for a proof see Exercise 9.66). The
matrix D∞ is a limiting covariance matrix introduced in Chapter 7.
Lemma 9.9. Suppose the assumptions of Lemma 9.8 hold. Then the fol-
lowing limit exists:
 
lim N −1 G G = D−1 ∞,
n→ ∞
and the limiting matrix is invertible.
Corollary 9.10. Under the conditions of Lemma 9.8, Assumption 9.2 is
fulfilled for all sufficiently large n, and Assumption 9.5 holds with any con-
stant γ1 < 2p(x).
Corollary 9.11. For the regular deterministic design, the local polynomial
estimator fn∗ (x) with the bandwidth h∗n = n−1/(2β+1) has the quadratic risk
at x ∈ (0, 1) bounded by r∗ n−2β/(2β+1) where a positive constant r∗ is inde-
pendent of n.

9.2.2. Random Uniform Design. To understand the key difficulties with


the random design, it suffices to look at the case of the uniformly distributed
design points xi on the interval [0, 1] . For this design the regularity in the
deterministic sense does not hold. That is, it cannot be guaranteed with
probability 1 that the distances between two consecutive points are O(1/n)
as n → ∞. With a positive probability there may be no design points in the
9.2. Polynomial Approximation and Regularity of Design 121

interval [x − hn , x + hn ], or it may contain some points but the system of


the normal equations (9.2) may be singular (see Exercise 7.51).
In what follows, we concentrate on the case of the optimal bandwidth
h∗n = n−1/(2β+1) . Take a small fixed positive number α < 1, and introduce
the random event  N  
 
A =  ∗ − 2 ≤ α .
nhn
As in the case of the deterministic
 design, introduce the same matrix
D−1
∞ = limn→∞ N −1 G G and its inverse D . Denote by C a constant
∞ ∗
that exceeds the absolute values of all elements of D∞ . Define another ran-
dom event
 −1  
 2C∗
B =  G G l, m  ≤ for all l, m = 0, . . . , β − 1 .
nh∗n
Note that these random events depend on n, but this fact is suppressed in
the notation.
Recall that the local polynomial estimator (9.4) at t = x is the inter-
cept θ̂0 . In the case of the random uniform design, we redefine the local
polynomial estimator as

θ̂0 , if A ∩ B occurs,
(9.7) fn∗ (x) =
0, otherwise.
If the random event A occurs, then Assumption 9.5 holds with γ1 = 2 − α.
If also the event B occurs, then Assumption 9.2 holds with γ0 = 2(2 + α)C∗ .
Thus, if both events take place, we can anticipate an upper bound for the
quadratic risk similar to the one in Theorem 9.6. If fn∗ (x) = 0, this estimator
does not estimate the regression function at all. Fortunately, as follows from
the two lemmas below (see Remark 9.14), the probability that either A or
B fails is negligible as n → ∞. Proofs of these lemmas can be found in the
last section.
Lemma 9.12. Let A be the complement of the event A. Then
 
Pf A ≤ 2α−2 n−2 β/(2β+1) .
Lemma 9.13. Let B denote the complement of the event B. Then there
exists a positive number C, independent of n, such that
 
Pf B ≤ C n−2β/(2β+1) .
Remark 9.14. Applying Lemmas 9.12 and 9.13, we see that
   
Pf fn∗ (x) = 0 = Pf A ∩ B
     
= Pf A ∪ B ≤ Pf A + Pf B
≤ 2α−2 n−2 β/(2β+1) + C n−2β/(2β+1) → 0 as n → ∞. 
122 9. Local Polynomial Approximation of the Regression Function

Now, we are in the position to prove the main result for the quadratic
risk under the random uniform design.
Theorem 9.15. Take the optimal bandwidth h∗n = n−1/(2β+1) . Let the de-
sign X be random and uniform on [0, 1]. Then the quadratic risk of the local
polynomial estimator fn∗ (x) at x defined by (9.7) satisfies the upper bound
 2 
sup Ef fn∗ (x) − f (x) ≤ r∗∗ n−2β/(2β+1)
f ∈ Θ(β)

where a positive constant r∗∗ is independent of n.

Proof. Note that in the statement of Theorem 9.6, the constant r∗ de-
pends on the design X only through the constants γ0 and γ1 that appear
in Assumptions 9.2 and 9.5. Thus, if the assumptions hold, then r∗ is non-
random, and averaging over the distribution of the design points does not
affect the upper bound. Hence,
 2  
Ef fn∗ (x) − f (x) I A ∩ B ≤ r∗ n−2β/(2β+1) .

Applying this inequality and Lemmas 9.12 and 9.13, we have that for all
sufficiently large n and for any f ∈ Θ(β, L, L1 ),
 2   2  
Ef fn∗ (x) − f (x) ≤ Ef fn∗ (x) − f (x) I A ∩ B
 2     2   
+ Ef fn∗ (x) − f (x) I A + Ef fn∗ (x) − f (x) I B
    
≤ r∗ n−2β/(2β+1) + L21 Pf A + Pf B

≤ r∗ + 2L21 α−2 + CL21 n−2β/(2β+1) .
Finally, we choose r∗∗ = r∗ + 2L21 α−2 + CL21 , and the result follows. 

9.3. Asymptotically Minimax Lower Bound


For the quadratic risk at a point, the results of the previous sections confirm
the existence of estimators with the asymptotic rate of convergence ψn =
n−β/(2β+1) in the sense of the definition (8.5). This rate is uniform over the
Hölder class of regression functions Θ(β). To make sure that we do not miss
any better estimator with a faster rate of convergence, we have to prove the
lower bound for the minimax risk. In this section, we show that for all large
n, and for any estimator fˆn of the regression function f , the inequality
 2 
(9.8) sup Ef fˆn (x) − f (x) ≥ r∗ n−2β/(2β+1)
f ∈ Θ(β)

holds with a positive constant r∗ independent of n.


9.3. Asymptotically Minimax Lower Bound 123

Clearly, the inequality (9.8) does not hold for any design X . For example,
if all the design points are concentrated at one point x1 = · · · = xn = x, then
our observations (xi , yi ) are actually observations in the parametric model
yi = f (x) + εi , i = 1, . . . , n,
with a real-valued parameter θ = f (x). This parameter can be estimated

1/ n-consistently by the simple averaging of the response values yi . On the
other hand, if the design points x1 , . . . , xn are regular, then the lower bound
(9.8) turns out to be true.

9.3.1. Regular Deterministic Design. We start with the case of a de-


terministic regular design, and prove the following theorem.
Theorem 9.16. Let the deterministic design points be defined by (7.18)
with a continuous and strictly positive density p(x), x ∈ [0, 1]. Then for any
fixed x, the inequality (9.8) holds.

Proof. To prove the lower bound in (9.8), we use the same trick as in
the parametric case (refer to the proof of Lemma 3.4). We substitute the
supremum over Θ(β) by the Bayes prior distribution concentrated at two
points. This time, however, the two points are represented by two regression
functions, called the test functions,
f0 = f0 (x) = 0 and f1 = f1 (x)
= 0 , f1 ∈ Θ(β), x ∈ [0, 1].

Note that for any estimator fˆn = fˆn (x), the supremum exceeds the mean
value,  2 
sup Ef fˆn (x) − f (x)
f ∈ Θ(β)

1   1  2 
(9.9) ≥ Ef0 fˆn2 (x) + Ef1 fˆn (x) − f1 (x) .
2 2
The expected values Ef0 and Ef1 denote the integration with respect to
the distribution of yi , given the corresponding regression function. Under the
hypothesis f = f0 = 0, the response yi = εi ∼ N (0, σ 2 ), while under the
alternative f = f1 , yi ∼ N f1 (xi ), σ 2 . Changing the probability measure
of integration, we can write the expectation Ef1 in terms of Ef0 ,
 2 
Ef1 fˆn (x) − f1 (x)
  
2  exp − (yi − f1 (xi ))2 /(2σ 2 ) 
n
ˆ
= Ef0 fn (x) − f1 (x)  
i=1
exp − yi2 /(2σ 2 )
 2 
n
yi f1 (xi ) 
n
f12 (xi )  
(9.10) = Ef0 fˆn (x) − f1 (x) exp − .
σ2 2σ 2
i=1 i=1
124 9. Local Polynomial Approximation of the Regression Function

Now, for a given Hölder class Θ(β, L, L1 ), we will explicitly introduce a


function f1 that belongs to this class. Take a continuous function ϕ(u), u ∈
R. We assume that it is supported on the interval [−1, 1], is positive at the
origin, and its β-th derivative is bounded by L. That is, we assume that
ϕ(u) = 0 if | u | > 1, ϕ(0) > 0, and | ϕ(β) (u) | ≤ L. Choose the bandwidth
h∗n = n−1/(2β+1) , and put
 t − x
f1 (t) = (h∗n )β ϕ , t ∈ [0, 1].
h∗n
Schematic graphs of the functions ϕ and f1 are given in Figures 6 and 7.
These graphs reflect a natural choice of the function ϕ as a “bump”. Notice
that f1 is a rescaling of ϕ. Indeed, since the support of ϕ is [−1, 1], the
function f1 is non-zero only for t such that |(t − x)/h∗n | ≤ 1 or, equivalently,
for t ∈ [x − h∗n , x + h∗n ], and the value of f1 at x is small, f1 (x) = (h∗n )β ϕ(0).

6
ϕ(u)

ϕ(0)

-
−1 0 1 u

Figure 6. A graph of a “bump” function ϕ.

f1 (t)
6

(h∗n )β ϕ(0)
-
0 x− h∗n x x+ h∗n 1 t

Figure 7. A graph of the function f1 .

For any n sufficiently large, the function f1 belongs to Θ(β, L, L1 ). In-


deed, since the function ϕ is bounded and h∗n is small, |f1 | is bounded by
L1 . Also, for any t1 , t2 ∈ [0, 1],
    
(β−1) (β−1)  ∗ (β−1) t1 − x ∗ (β−1) t2 − x 
| f1 (t1 ) − f1 (t2 ) | =  hn ϕ − hn ϕ 
h∗n h∗n
   t1 − x t2 − x 
≤ h∗n max  ϕ(β) (u)   ∗ −  ≤ L | t1 − t2 |.
−1 ≤ u ≤ 1 hn h∗n
9.3. Asymptotically Minimax Lower Bound 125

Introduce a random event


 n
yi f1 (xi ) 
n
f12 (xi ) 
E = − ≥ 0 .
σ2 2σ 2
i=1 i=1

From (9.9) and (9.10), we find that


 2 
sup Ef fˆn (x) − f (x)
f ∈ Θ(β)

1   2  n
yi f1 (xi ) n
f12 (xi )  
≥ Ef0 fˆn2 (x) + fˆn (x) − f1 (x) exp −
2 σ2 2σ 2
i=1 i=1
1   2  
≥ Ef0 fˆn2 (x) + fˆn (x) − f1 (x) I(E) .
2
Here we bound the exponent from below by one, which is true under the
event E. Next, by the elementary inequality a2 + (a − b)2 ≥ b2 /2 with
a = fˆn (x) and b = f1 (x) = (h∗n )β ϕ(0), we get the following bound:
 2  1
(9.11) sup Ef fˆn (x) − f (x) ≥ (h∗n )2β ϕ2 (0) Pf0 ( E ).
f ∈ Θ(β) 4
What is left to show is that the probability Pf0 ( E ) is separated away from
zero,
(9.12) Pf0 ( E ) ≥ p0
where p0 is a positive constant independent of n. In this case, (9.8) holds,
 2  1
(9.13) sup Ef fˆn (x)−f (x) ≥ (h∗n )2β ϕ2 (0) p0 = r∗ n−2β/(2β+1)
f ∈ Θ(β) 4
with r∗ = (1/4) ϕ2 (0) p0 . To verify (9.12), note that under the hypothesis
f = f0 = 0, the random variable
 
n −1/2 
n
2 2
Z = σ f1 (xi ) yi f1 (xi )
i=1 i=1
has the standard normal distribution. Thus,

n
1  2
n 
lim Pf0 ( E ) = lim Pf0 yi f1 (xi ) ≥ f1 (xi )
n→∞ n→∞ 2
i=1 i=1
 1 
n 1/2   1 
n 1/2 
= lim Pf0 Z≥ f12 (xi ) = 1 − lim Φ f12 (xi ) .
n→∞ 2σ n→∞ 2σ
i=1 i=1
Finally, we will show that

n 1
(9.14) lim f12 (xi ) = p(x) ϕ22 = p(x) ϕ2 (u) du > 0.
n→∞ −1
i=1
126 9. Local Polynomial Approximation of the Regression Function

Indeed, recall that the optimal bandwidth h∗n satisfies the identity (h∗n )2β =
1/(nh∗n ). Using this fact and the assertion of part (iii) of Lemma 9.8, we
have that
n 
n x −x
(h∗n )2β ϕ2
2 i
f1 (xi ) =
h∗n
i=1 i=1


n x −x 1
1 i
(9.15) = ϕ2
→ p(x) ϕ2 (u) du as n → ∞.
nh∗n h∗n −1
i=1
Hence (9.14) is true, and the probability Pf0 ( E ) has a strictly positive limit,
 1  1/2 
lim Pf0 ( E ) = 1 − Φ p(x) ϕ22 > 0.
n→∞ 2σ
This completes the proof of the theorem. 

9.3.2. Regular Random Design. Do random designs represent all the


points of the interval [0, 1] “fairly” to ensure the lower bound (9.8)? It
seems plausible, provided the probability density of the design is strictly
positive. The following theorem supports this view.
Theorem 9.17. Let the design points x1 , . . . , xn be independent identically
distributed random variables with the common probability density p(x) which
is continuous and strictly positive on [0, 1]. Then at any fixed x ∈ (0, 1), the
inequality (9.8) holds.

Proof. See Exercise 9.68. 

9.4. Proofs of Auxiliary Results


Proof of Lemma 9.8. (i) Consider the design points in the hn -neighborhood
of x. By the definition (7.17) of the regular deterministic design points, we
have
1 i+1 i
= − = FX (xi+1 ) − FX (xi ) = p(x∗i )(xi+1 − xi )
n n n
where x∗i ∈ (xi , xi+1 ) . Hence,
1
xi+1 − xi = .
np(x∗i )
From the continuity of the density p(x), we have that p(x∗i )(1+αi, n ) = p(x)
where αi, n = o(1) → 0 as n → ∞. Therefore,
1 + αi, n
xi+1 − xi = .
np(x)
The quantity |αi, n | can be bounded by a small constant uniformly over
i = 1, . . . , n, so that αn = max1 ≤i ≤N |αi, n | → 0 as n → ∞.
9.4. Proofs of Auxiliary Results 127

(ii) Note that by definition, the number N of observations in the interval


[x − hn , x + hn ] can be bounded by
2hn 2hn
−1 ≤ N ≤ + 1.
max (xi+1 − xi ) min (xi+1 − xi )
1≤i≤N 1≤i≤N

From part (i),


1 − αn 1 + αn
≤ xi+1 − xi ≤ ,
np(x) np(x)
and, therefore, N is bounded by
2hn np(x) 2hn np(x)
(9.16) −1 ≤ N ≤ + 1.
1 + αn 1 − αn
Hence, limn→∞ N/(nhn ) = 2p(x).

(iii) Put ui = (xi+1 − x)/hn . From part (i), we have that


xi+1 − xi 1 + αi, n
(9.17) Δui = ui+1 − ui = = ,
hn nhn p(x)
or, equivalently,
1 Δui
= p(x) .
nhn 1 + αi, n
Hence, the bounds take place:
Δui 1 Δui
p(x) ≤ ≤ p(x) .
1 + αn nhn 1 − αn
Consequently,
p(x)   xi − x  1   xi − x 
N N
ϕ0 Δui ≤ ϕ0
1 + αn hn nhn hn
i=1 i=1

p(x)   xi − x 
N
≤ ϕ0 Δui ,
1 − αn hn
i=1
and the desired convergence follows,
1
1   xi − x 
N
ϕ0 → p(x) ϕ0 (u) du. 
nhn hn −1
i=1

Proof of Lemma 9.9. By the definition of the matrix G, we can write


1   1   xi − x l  xi − x m 1  l+m
N N
(9.18) G G l, m = = ui .
N N hn hn N
i=1 i=1
Next, we want to find bounds for 1/N . From (9.17), we have
1 − αn 1 + αn
≤ Δui ≤ .
nhn p(x) nhn p(x)
128 9. Local Polynomial Approximation of the Regression Function

Combining this result with (9.16), we obtain


 2h np(x)  1 − α   2h np(x)  1 + α 
n n n n
−1 ≤ N Δui ≤ +1 .
1 + αn nhn p(x) 1 − αn nhn p(x)
Put
1 − αn 1 − αn 1 + αn 1 − αn
βn = − − 1 and β̃n = + − 1.
1 + αn nhn p(x) 1 − αn nhn p(x)
Thus, we have shown that 2 + βn ≤ N Δui ≤ 2 + β̃n , or, equivalently,
Δui 1 Δui
≤ ≤
2 + β̃n N 2 + βn
where βn and β̃n vanish as
 n goes
 to infinity. Therefore, using the expression
1
(9.18), we can bound N G G l, m by

1  N
1   1 N
ui Δui ≤
l+m
G G l, m ≤ ul+m
i Δui .
2 + β̃n i = 1 N 2 + βn
i=1
Both bounds converge as n → ∞ to the integral in the definition (9.6)
of D−1
∞ . The proof that this matrix is invertible is left as an exercise (see
Exercise 9.66). 
Before we turn to Lemmas 9.12 and 9.13, we prove the following result.
Let g(u) be a continuous function such that |g(u)| ≤ 1 for all u ∈ [−1, 1].
Let x1 , . . . , xn be independent random variables with a common uniform
distribution on [0, 1]. Introduce the independent random variables ηi , i =
1, . . . , n, by
⎧  
⎨g xi − x , if xi ∈ [x − h∗ , x + h∗ ],
(9.19) ηi = h∗n n n
⎩0, otherwise.
Denote by μn the expected value of ηi ,
x+h∗n  1
t−x ∗
μn = E[ ηi ] = g dt = hn g(u) du.
x−h∗n h∗n −1

Result. For any positive number α,


 1 μn  
 2
(9.20) P  ∗ (η1 + · · · + ηn ) − ∗  > α ≤ 2 ∗ .
nhn hn α nhn
Proof. Note that
1
Var[ ηi ] ≤ E[ ηi2 ] = h∗n g 2 (u) du ≤ 2h∗n .
−1
Thus, the Chebyshev inequality yields
 1 μn  
 nVar[ ηi ] 2
P  ∗ (η1 + · · · + ηn ) − ∗  > α ≤ ∗ 2
≤ 2 ∗. 
nhn hn (αnhn ) α nhn
9.4. Proofs of Auxiliary Results 129

Proof of Lemma 9.12. Apply the definition (9.19) of ηi with g = 1. In


this case, N = η1 + · · · + ηn and μn /h∗n = 2. Thus, from (9.20) we obtain
  N  
 
Pf A) = Pf  ∗ − 2  > α
nhn
 1  
  2
= Pf  ∗ (η1 + · · · + ηn ) − 2  > α ≤ 2 ∗ .
nhn α nhn
Finally, note that nh∗n = n2β/(2β+1) . 
Proof of Lemma 9.13. For an arbitrarily small δ > 0, define a random
event
!   1
β−1 


 −1
C =  (G G) l, m − (D∞ ) l, m  ≤ δ .
2nh∗n
l, m = 0
First, we want to show that the probability of the complement event

β−1   
 1  −1 
C =  (G G) l, m − (D )
∞ l, m  > δ
2nh∗n
l, m = 0

is bounded from above,


 
(9.21) Pf C ≤ 2β 2 δ −2 n−2β/(2β+1) .
To see this, put g(u) = (1/2)u l + m in (9.19). Then
1 η1 + . . . + ηn

(G G)l, m =
2nhn nh∗n
and
x+h∗n t−x 1
μn 1 1
= ∗ g dt = ul + m du = (D−1
∞ )l, m .
h∗n hn x−h∗n h∗n 2 −1

The inequality (9.20) provides the upper bound 2 δ −2 n−2 β/(2 β+1) for the
probability of each event in the union C. This proves (9.21).
Next, recall that we denoted by C∗ a constant that exceeds the absolute
values of all elements of the matrix D∞ . Due to the continuity of a matrix
inversion, for any ε ≤ C∗ , there exists a number δ = δ(ε) such that
!
β−1   
 1  −1 
C =  (G G)l, m − (D∞ )l, m  ≤ δ(ε)
2nh∗n
l, m = 0

!
β−1   
 
⊆  (2nh∗n )(G G)−1
l, m − (D∞ ) l, m  ≤ ε
l, m = 0
 −1  
 2C∗
⊆  G G l, m  ≤ for all l, m = 0, . . . , β − 1 = B.
nh∗n
130 9. Local Polynomial Approximation of the Regression Function

The latter inclusion follows from the fact that if (G G)−1 ∗
l, m ≤ (C∗ +ε)/(2nhn )
and ε ≤ C∗ , it implies that (G G)−1 ∗ ∗
l, m ≤ C∗ /(nhn ) ≤ 2C∗ /(nhn ). Thus,
from (9.21), we obtain Pf (B) ≤ Pf (C) ≤ Cn−2β/(2β+1) with C = 2β 2 δ −2 .


Exercises

Exercise 9.63. Explain what happens to the local polynomial estimator


(9.4) if one of the conditions hn → 0 or nhn → ∞ is violated.

Exercise 9.64. Take x = hn , and consider the local polynomial approx-


imation in the interval [0, 2hn ]. Let the estimate of the regression coef-
ficient be defined as the solution of the respective minimization problem
(9.1). Define the estimator of the regression function at the origin by
β−1
fˆn (0) = m
m = 0 (−1) θ̂m . Find upper bounds for the conditional bias and
ˆ
variance of fn (0) for a fixed design X .

Exercise 9.65. Prove an analogue of Theorem 9.6 for the derivative esti-
mator (9.5),
  m! θ̂ 2  

 X ≤ r∗ n−2(β−m)/(2β+1)
m
sup Ef ∗ )m
− f (m)
(x)
f ∈ Θ(β) (h n

where h∗n = n−1/(2β+1) .

Exercise 9.66. Show that the matrix D−1


∞ in Lemma 9.9 is invertible.

Exercise 9.67. Let f1 be as defined in the proof of Theorem 9.16, and


let x1 , . . . , xn be a random design with the probability density p(x) on the
interval [0, 1].
(i) Show that the random variable
 1  2  xi − x 
n n
f12 (xi ) = ϕ
n h∗n h∗n
i=1 i=1
has the expected value that converges to p(x)  ϕ 22 as n → ∞.
(ii) Prove that the variance of this random variable is O(1/(nhn )) as n → ∞.
(iii) Derive from parts (i) and (ii) that for all sufficiently large n,
 n 
Pf0 f12 (xi ) ≤ 2p(x)  ϕ 22 ≥ 1/2.
i=1

Exercise 9.68. Apply Exercise 9.67 to prove Theorem 9.17.


Chapter 10

Estimation of
Regression in Global
Norms

10.1. Regressogram
In Chapters 8 and 9, we gave a detailed analysis of the kernel and local
polynomial estimators at a fixed point x inside the interval (0, 1). The as-
ymptotic minimax rate of convergence was found to be ψn = n−β/(2β+1) ,
which strongly depends on the smoothness parameter β of the regression
function.
What if our objective is different? What if we want to estimate the
regression function f (x) as a curve in the interval [0, 1]? The global norms
serve this purpose. In this chapter, we discuss the regression estimation
problems with regard to the continuous and discrete L2 -norms, and sup-
norm.
In the current section, we introduce an estimator fˆn , called a regresso-
gram. A formal definition will be given at the end of the section.
When it comes to the regression estimation in the interval [0, 1], we can
extend a smoothing kernel estimator (8.16) to be defined in the entire unit
interval. However, the estimation at the endpoints x = 0 and x = 1 would
cause difficulties. It is more convenient to introduce an estimator defined
everywhere in [0, 1] based on the local polynomial estimator (9.4).
Consider a partition of the interval [0, 1] into small subintervals of the
equal length 2hn . To ease the presentation assume that Q = 1/(2hn ) is

131
132 10. Estimation of Regression in Global Norms

an integer. The number Q represents the total number of intervals in the


partition. Each small interval
 
Bq = 2(q − 1)hn , 2qhn , q = 1, . . . , Q,
is called a bin. It is convenient to introduce notation for the midpoint of the
bin Bq . We denote it by cq = (2q − 1)hn , q = 1, . . . , Q.
The local polynomial estimator (9.4) is defined separately for each bin.
If we want to estimate the regression function at every point x ∈ [0, 1], we
must consider a collection of the local polynomial estimators. Introduce Q
minimization problems, one for each bin,
 n   x −c   x − c β−1  2 
i q i q 
yi − θ̂0, q + θ̂1, q + · · · + θ̂β−1, q I xi ∈ Bq
hn hn
i=1

(10.1) → min for x ∈ Bq , q = 1, . . . , Q.


θ̂0, q ,...,θ̂β−1, q

Note that these minimization problems are totally disconnected. Each of


them involves only the observations the design points of which belong to the
respective bin Bq . The estimates of the regression coefficients are marked
by the double subscript, representing the coefficient number and the bin
number. There should also be a subscript “n”, which we omit to avoid too
cumbersome a notation.
As in Section 9.1, it is easier to interpret the minimization problems
(10.1) if they are written in the vector notation. Denote by N1 , . . . , NQ the
number of the design points in each bin, N1 + · · · + NQ = n. For a fixed
q = 1, . . . , Q, let
x1,q < · · · < xNq ,q
be the design points in the bin Bq , and let the corresponding response values
have matching indices y1,q , . . . , yNq ,q . Denote by
 
θ̂ q = θ̂0,q , . . . , θ̂β−1, q
the vector of the estimates of the regression coefficients in the q-th bin. The
vectors θ̂ q satisfy the systems of normal equations
  
(10.2) Gq Gq θ̂ q = Gq yq , q = 1, . . . , Q,

where yq = (y1, q , . . . , yNq , q ) , and the matrix Gq = g0, q , . . . , gβ−1, q has
the columns
  x − c m x  
1, q q Nq , q − c q m 
gm, q = ,..., , m = 0, . . . , β − 1.
hn hn
The results of Section 9.1 were based on Assumptions 9.2 and 9.5. In
this section, we combine their analogues into one assumption. Provided this
10.2. Integral L2 -Norm Risk for the Regressogram 133

assumption holds, the systems of normal equations (10.2) have the unique
solutions for all q = 1, . . . , Q.
Assumption 10.1. There exist positive constants γ0 and γ1 , independent
of n and q, such that for all q = 1, . . . , Q,
 −1
(i) the absolute values of the elements of the matrix Gq  Gq are bounded
from above by γ0 /Nq ,
(ii) the number of observations Nq in the q-th bin is bounded from below,
Nq ≥ γ1 nhn . 

Now we are ready to define the piecewise polynomial estimator fˆn (x)
in the entire interval [0, 1]. This estimator is called a regressogram, and is
computed according to the formula
(10.3)
x−c   x − c β−1
q q
fˆn (x) = θ̂0, q + θ̂1, q + . . . + θ̂β−1, q if x ∈ Bq ,
hn hn
where the estimates θ̂0, q , . . . , θ̂β−1, q satisfy the normal equations (10.2), q =
1, . . . , Q.

10.2. Integral L2 -Norm Risk for the Regressogram


Consider the regressogram fˆn (x), x ∈ [0, 1], defined by (10.3). The following
statement is an adaptation of Proposition 9.4 about the components of θ̂ q .
We omit its proof.
Proposition 10.2. Suppose that for a given design X , Assumption 10.1
holds. Assume that the regression function f belongs to a Hölder class
Θ(β, L, L1 ). Then the m-th element θ̂m,q of the vector θ̂ q , which satisfies
the system of normal equations (10.2), admits the expansion
f (m) (cq ) m
θ̂m,q = hn + bm,q + Nm,q , m = 0, . . . , β − 1, q = 1, . . . , Q,
m!
 
where the conditional bias bm,q is bounded from above, bm, q  ≤ Cb hβn , and
the stochastic term Nm,q has the normal
 distribution
 with mean zero. Its

variance is limited from above, Varf Nm,q X ≤ Cv /Nq . Here the con-
stants Cb and Cv are independent of n. Conditionally, given the design X ,
the random variables Nm,q are independent for different values of q.

The next theorem answers the question about the integral L2 -norm risk
for the regressogram.
Theorem 10.3. Let a design X be such that Assumption 10.1 holds with
the bandwidth h∗n = n−1/(2β+1) . Then the mean integrated quadratic risk of
134 10. Estimation of Regression in Global Norms

the regressogram fˆn (x) admits the upper bound


 1 2  
sup Ef fˆn (x) − f (x) dx  X ≤ r∗ n−2β/(2β+1)
f ∈ Θ(β) 0

for some positive constant r∗ independent of n.

Proof. From Lemma 8.5, for any f ∈ Θ(β, L, L1 ), and for any bin Bq
centered at cq , the Taylor expansion is valid
f (β−1) (cq )
f (x) = f (cq ) + f (1) (cq )(x − cq ) + · · · + (x − cq )β−1 + ρ(x, cq )
(β − 1)!

f (m) (cq ) ∗ m  x − cq m

β−1
= (hn ) + ρ(x, cq )
m! h∗n
m=0
 
where the remainder term ρ(x, cq ) satisfies the inequality ρ(x, cq ) ≤ Cρ (h∗n )β
with Cρ = L/(β − 1)! .
Applying Proposition 10.2, the Taylor expansion of f , and the definition
of the regressogram (10.3), we get the expression for the quadratic risk
 1 2   Q   2  
Ef fˆn (x) − f (x) dx  X = Ef fˆn (x) − f (x) dx  X
0 q=1 Bq


Q   β−1
    x − cq m 2  

= Ef bm,q + Nm,q − ρ(x, c q ) dx  X .
Bq h∗n
q=1 m=0

Using the fact that the random variables Nm,q have mean zero, we can
write the latter expectation of the integral over Bq as the sum of a deter-
ministic and stochastic terms,
 β−1
  x − c m 2
q
bm,q − ρ(x, c q ) dx
Bq h∗n
m=0
 β−1
  x − c m 2  
q 
+ Ef Nm,q ∗ X dx.
Bq hn
m=0

From the bounds for the bias and the remainder term, the first integrand
can be estimated from above by a constant,
 β−1
  x − c m 2  β−1
 2
q
bm,q − ρ(x, cq ) ≤ |bm,q | + |ρ(x, cq )|
h∗n
m=0 m=0
 
≤ βCb (h∗n )β + Cρ (h∗n ) β 2
= CD (h∗n )2β = CD n−2β/(2β+1)
where CD = (βCb + Cρ )2 .
10.2. Integral L2 -Norm Risk for the Regressogram 135

Note that the random variables Nm,q may be correlated for a fixed q
and different m’s. Using a special case of the Cauchy-Schwarz inequality
(a0 + · · · + aβ−1 )2 ≤ β (a20 + · · · + a2β−1 ), Proposition 10.2, and Assumption
10.1, we bound the second integrand from above by
  β−1
  x − c m 2   
β−1

q 
Ef Nm,q ∗  X ≤ β Varf Nm,q | X
hn
m=0 m=0

β 2 Cv β 2 Cv CS
≤ ≤ ∗
= = CS n−2β/(2β+1) where CS = β 2 Cv /γ1 .
Nq γ1 nhn nh∗n
Thus, combining the deterministic and stochastic terms, we arrive at the
upper bound
 1 2   Q
  
Ef ˆ 
fn (x) − f (x) dx X ≤ CD + CS n−2β/(2β+1) dx
0 q = 1 Bq

= r∗ n−2β/(2β+1) with r∗ = CD + CS . 
Remark 10.4. Under Assumption 10.1, the results of Lemmas 9.8 and 9.9
stay valid uniformly over the bins Bq , q = 1, . . . , Q. Therefore, we can
extend the statement of Corollary 9.11 to the integral L2 -norm. For the
regular deterministic design, the unconditional quadratic risk in the integral
L2 -norm of the regressogram fˆn with the bandwidth h∗n = n−1/(2β+1) admits
the upper bound
 1 2 
sup Ef fˆn (x) − f (x) dx ≤ r∗ n−2β/(2β+1)
f ∈ Θ(β) 0

where a positive constant r∗ is independent of n. A similar result is also true


for the regular random design (cf. Theorem 9.15). Unfortunately, it is too
technical, and we skip its proof. 
Remark 10.5. The m-th derivative of the regressogram (10.3) has the form
 1  x − cq i−m
β−1
dm fˆn (x) i!
= θ̂i,q if x ∈ Bq , 0 ≤ m ≤ β−1.
d xm (i − m)! hm
n hn
i=m

Under the same choice of the bandwidth, h∗n = n−1/(2β+1) , this estimator
admits the upper bound similar to the one in Theorem 10.3 with the rate
n−(β−m)/(2β+1), that is,
(10.4)
 1  dm fˆ (x) dm f (x) 2  
dx  X ≤ r∗ n−2(β−m)/(2β+1).
n
sup Ef m
− m
f ∈ Θ(β) 0 d x d x
For the proof see Exercise 10.69. 
136 10. Estimation of Regression in Global Norms

10.3. Estimation in the Sup-Norm


In this section we study the asymptotic performance of the sup-norm risk
of the regressogram fˆn (x) defined in (10.3). The sup-norm risk for a fixed
design X is given by
     
 
(10.5) Ef  fˆn − f ∞  X = Ef sup | fˆn (x) − f (x) |  X .
0≤x≤1

Our starting point is Proposition 10.2. It is a very powerful result that allows
us to control the risk under any loss function. We use this proposition to
prove the following theorem.
Theorem 10.6. Let a design X be such that Assumption 10.1 holds with
 1/(2β+1)
the bandwidth hn = (ln n)/n . Let fˆn be the regressogram that
corresponds to this bandwidth. Then the conditional sup-norm risk (10.5)
admits the upper bound
    ln n β/(2β+1)

(10.6) sup Ef  fˆn − f ∞  X ≤ r∗
f ∈ Θ( β) n
where r∗ is a positive constant independent of n and f .

Proof. Applying Lemma 8.5, we can write the sup-norm of the difference
fˆn − f as
 β−1  x − c m
  q
 fˆn − f ∞ = max sup  θ̂m, q
1≤q≤Q x ∈ Bq h n
m=0


β−1
f (m) (cq ) (hn )m  x − cq m 

(10.7) − − ρ(x, cq ) 
m! hn
m=0
where Q = 1/(2hn ) is the number of bins, and the q-th bin is the interval
Bq = cq − hn , cq + hn centered at x = cq , q = 1, . . . , Q. The remainder
term ρ(x, cq ) satisfies the inequality | ρ(x, cq ) | ≤ Cρ hβn with the constant
Cρ = L / (β − 1)!. Applying the formula for θ̂m, q from Proposition 10.2 and
the fact that |x − cq |/hn ≤ 1, we obtain that
 β−1  x − cq m 
 
 fn − f  ≤   

max sup  bm,q + Nm,q  + Cρ hβn
1 ≤ q ≤ Q x ∈ Bq hn
m=0

  
β−1
 
(10.8) ≤ βCb hβn + Cρ hβn + max  Nm,q .
1≤q≤Q
m=0
Introduce the standard normal random variables
   −1/2
Zm,q = Varf Nm,q  X Nm,q .
10.3. Estimation in the Sup-Norm 137

From the upper bound on the variance in Proposition 10.2, we find that
 
 
β−1 
 Cv  
β−1
 Cv
(10.9) max  Nm,q  ≤ max Zm,q  ≤ Z∗
1≤q≤Q 1≤q≤Q Nq γ1 nhn
m=0 m=0

where

β−1
 

Z = max Zm,q .
1≤q≤Q
m=0

Note that the random variables Zm,q are independent for different bins,
but may be correlated for different values of m within the same bin.
Putting together (10.8) and (10.9), we get the upper bound for the sup-
norm loss,

   β Cv
(10.10)  fn − f  ≤ βCb + Cρ h + Z ∗.
∞ n
γ1 nhn

To continue, we need the following technical result, which we ask to be


proved in Exercise 10.70.
Result. There exists a constant Cz > 0 such that
  √
(10.11) E Z ∗  X ≤ Cz ln n.

Finally, under our choice of hn , it is easily seen that


 n 2β/(2β+1)
n hn = n2β/(2β+1) (ln n)1/(2β+1) = (ln n) = (ln n) hn−2β .
ln n
These results along with (10.10) yield

    Cv ln n
Ef  fˆn − f ∞  X ≤ βCb + Cρ hβn + Cz
γ1 nhn

 
≤ βCb + Cρ + Cz Cv /γ1 hβn = r∗ hβn . 

Remark 10.7. As Theorem 10.6 shows, the upper bound of the risk in
the sup-norm contains an extra log-factor as compared to the case of the
L2 -norm. The source of this additional factor becomes clear from√(10.11).
Indeed, the maximum of the random noise has the magnitude O( ln n) as
n → ∞. That is why theoptimum choice of the bandwidth comes from the
balance equation hβn = (nhn )−1 ln n. 
138 10. Estimation of Regression in Global Norms

10.4. Projection on Span-Space and Discrete MISE


The objective of this section is to study the discrete mean integrated squared
error (MISE) of the regressogram (10.3). The regressogram is a piecewise
polynomial estimator that can be written in the form

Q 
β−1  x − c m
q
fˆn (x) = I(x ∈ Bq ) θ̂m,q
hn
q=1 m=0

where the q-th bin Bq = [ 2(q − 1)hn , 2qhn ), and cq = (2q − 1)hn is its
center. Here Q = 1/(2hn ) is an integer that represents the number of bins.
The rates of convergence in the L2 -norm and sup-norm found in the previous
sections were partially based on the fact that the bias of the regressogram
has the magnitude O(hβn ) uniformly in f ∈ Θ(β) at any point x ∈ [0, 1].
Indeed, from Proposition 10.2, we get
  

sup sup  Ef fˆn (x) − f (x) 
f ∈ Θ(β) 0 ≤ x ≤ 1


β−1
  m
≤ sup sup  bm,q  x − cq  ≤ Cb βhβn .
1 ≤ q ≤ Q x ∈ Bq m = 0 hn
In turn, this upper bound for the bias is the immediate consequence of the
Taylor’s approximation in Lemma 8.5.
In this section, we take a different approach. Before we proceed, we
need to introduce some notation. Define a set of βQ piecewise monomial
functions,
(10.12)  x − c m
q
γm,q (x) = I(x ∈ Bq ) , q = 1, . . . , Q , m = 0, . . . , β − 1.
hn
The regressogram fˆn (x) is a linear combination of these monomials,
 
Q β−1
(10.13) fˆn (x) = θ̂m,q γm,q (x) , 0 ≤ x ≤ 1
q=1 m=0

where θ̂m,q are the estimates of the regression coefficients in bins.


In what we used above, it was important that any function f (x) ∈ Θ(β)
admits an approximation by a linear combination of {γm,q (x)} with the error
not exceeding O(hβn ). This property does not exclusively belong to the set
of piecewise monomials. We will prove results in the generalized setting for
which the regressogram is a special case.
What changes should be made if instead of the piecewise monomials
(10.12) we use some other functions? In place of the indices m and q in
the monomials (10.12) we will use a single index k for a set of functions
10.4. Projection on Span-Space and Discrete MISE 139

γk (x), k = 1, . . . , K . The number of functions K = K(n) → ∞ as n → ∞.


In the case of the monomials (10.12), the number K = βQ.
Consider the regression observations yi = f (xi ) + εi , εi ∼ N (0, σ 2 ),
with a Hölder regression function f (x) ∈ Θ(β). As before, we assume that
the design points are distinct and ordered in the interval [0, 1],
0 ≤ x1 < x2 < · · · < xn ≤ 1.
We want to estimate the regression function f (x) by a linear combination
(10.14) fˆn (x) = θ̂1 γ1 (x) + · · · + θ̂K γK (x), x ∈ [0, 1],
in the least-squares sense over the design points. To do this, we have to
solve the following minimization problem:
n 
   2
(10.15) yi − θ̂1 γ1 (xi ) + · · · + θ̂K γK (xi ) → min .
i=1 θ̂1 ... θ̂K

Define the design matrix Γ with columns


 
(10.16) γ k = γk (x1 ), . . . , γk (xn ) , k = 1, . . . , K.

From this definition, the matrix Γ has the dimensions n × K. The vector
ϑ̂ = ( θ̂1 , . . . , θ̂K ) of estimates in (10.15) satisfies the system of normal
equations
(10.17) Γ Γ ϑ̂ = Γ y
where y = (y1 , . . . , yn ) .
Depending on the design X , the normal equations (10.17) may have
a unique or multiple solutions. If this system has a unique solution, then
the estimate fˆn (x) can be restored at any point x by (10.14). But even
when (10.17) does not have a unique solution, we can still approximate the
regression function f (x) at the design points, relying on the geometry of the
problem.
In the n-dimensional space of observations Rn , define a linear span-space
S generated by the columns γ k of matrix Γ. With a minor abuse of notation,
we also denote by S the operator in Rn of the orthogonal projection on the
span-space S . Introduce a vector consisting of the values of the regression
function at the design points,
f = ( f (x1 ), . . . , f (xn ) ) , f ∈ Rn ,
and a vector of estimates at these points,
f̂n = Sy = ( fˆ(x1 ), . . . , fˆ(xn ) ) .
Note that this projection is correctly defined regardless of whether (10.17)
has a unique solution or not.
140 10. Estimation of Regression in Global Norms

Denote by ε = ( ε1 , . . . , εn ) the vector of independent N (0, σ 2 ) - ran-


dom errors. In this notation, we can interpret f̂n as a vector sum of two
projections,
(10.18) f̂n = Sy = Sf + Sε.
Our goal is to find an upper bound on the discrete MISE. Conditionally
on the design X , the discrete MISE has the form
1  n
 2   1 
Ef fˆn (xi ) − f (xi )  X = Ef  Sf + Sε − f 2 | X
n n
i=1

2 2 
(10.19) ≤  Sf − f 2 + Ef  Sε 2 | X
n n
where · is the Euclidean norm in R . Here we used the inequality (a+b)2 ≤
n

2 (a2 + b2 ).
Denote by dim(S) the dimension of the span-space S. Note that neces-
sarily dim(S) ≤ K. In many special cases, this inequality turns into equality,
dim(S) = K. For example, it is true for the regressogram under Assumption
10.1 (see Exercise 10.72).
Assumption 10.8. There exists δn , δn → 0 as n → ∞, such that for any
f ∈ Θ(β), the inequality is fulfilled
1
 S f − f 2 ≤ δn2 . 
n
Proposition 10.9. Let Assumption 10.8 hold. Then the following upper
bound on the discrete MISE holds:
1  n
 2   2σ 2 dim(S)
(10.20) Ef fˆn (xi ) − f (xi )  X ≤ 2δn2 + .
n n
i=1

Proof. The normal random errors εi , i = 1, . . . , n, are conditionally in-


dependent, given the design points. Therefore, the square of the Euclidean
norm  S ε 2 has a σ 2 χ2 -distribution with dim(S) degrees of freedom. Thus,

Ef  Sε 2 | X = σ 2 dim(S).
From Assumption 10.8, we find that the right-hand side of (10.19) is bounded
from above by 2δn2 + 2σ 2 dim(S)/n. 
Proposition 10.10. Assume that for any regression function f ∈ Θ(β),
there exists a linear combination a1 γ1 (x) + · · · + aK γK (x) such that at any
design point xi , the following inequality holds:
 
(10.21)  a1 γ1 (xi ) + · · · + aK γK (xi ) − f (xi )  ≤ δn , i = 1, . . . , n.
Then the upper bound (10.20) is valid.
10.5. Orthogonal Series Regression Estimator 141

Proof. Recall that S is an operator of orthogonal projection, and, therefore,


Sf is the vector in S closest to f . Applying (10.21), we see that
1 
n
 
 Sf − f 2 ≤ 1  a1 γ1 (xi ) + · · · + aK γK (xi ) − f (xi ) 2 ≤ δn2 ,
n n
i=1
and Assumption 10.8 holds. 

In the next theorem, we describe the asymptotical performance of the


discrete MISE for the regressogram.
Theorem 10.11. For any fixed design X , the discrete MISE of the regres-
sogram fˆn (x) given by (10.3) satisfies the inequality
1 n
 2   2σ 2 βQ
(10.22) Ef fˆn (xi ) − f (xi )  X ≤ 2Cρ2 h2β
n +
n n
i=1
where Cρ = L/(β − 1)!. Moreover, under the optimal choice of the bandwidth
h∗n = n−1/(2β+1) , there exists a positive constant r∗ , independent of n and
f ∈ Θ(β), such that the following upper bound holds:
1  n
 2  
Ef fˆn (xi ) − f (xi )  X ≤ r∗ n−2β/(2β+2) .
n
i =,1

Proof. In the case of the regressogram, dim(S) ≤ K = βQ. The Taylor


approximation of f (x) in Lemma 8.5 within each bin guarantees the in-
equality (10.21) with δn = Cρ hβn . Hence Proposition 10.10 yields the upper
bound (10.22). Now recall that Q/n = 1/(2nhn ). Under the optimal choice
of the bandwidth, both terms on the right-hand side of (10.22) have the
same magnitude O(n−2β/(2β+1) ). 

10.5. Orthogonal Series Regression Estimator


In this section we take a different approach to estimation of the regression
function f . We will be concerned with estimation of its Fourier coefficients.
The functional class to which f belongs will differ from the Hölder class.

10.5.1. Preliminaries. A set of functions B is called an orthonormal basis


in L2 [0, 1] if: (1) the L2 -norm in the interval [0, 1] of any function in this

1
set is equal to one, that is, 0 g(x) dx = 1 for any g ∈ B, and (2) the dot

1
product of any two functions in B is zero, that is, 0 g1 (x)g2 (x) dx = 0 for
any g1 , g2 ∈ B.
Consider the following set of functions defined for all x in [0, 1]:
(10.23)
 √ √ √ √ 
1, 2 sin(2πx), 2 cos(2πx), . . . , 2 sin(2πkx), 2 cos(2πkx), . . . .
142 10. Estimation of Regression in Global Norms

This set is referred to as a trigonometric basis. The next proposition is a


standard result from analysis. We omit its proof.

Proposition 10.12. The trigonometric basis in (10.23) is an orthonomal


basis.

Choose the trigonometric basis as a working basis in L2 [0, 1] space. For


any function f (x), 0 ≤ x ≤ 1, introduce its Fourier coefficients by
1 1 √
a0 = f (x) dx, ak = f (x) 2 cos(2πkx) dx
0 0

and
1 √
bk = f (x) 2 cos(2πkx) dx, k = 1, 2, . . . .
0
The trigonometric basis is complete in the sense that if f ||2 < ∞, and

m
√ 
m

fm (x) = a0 + ak 2 cos(2πkx) + bk 2 sin(2πkx), 0 ≤ x ≤ 1,
k=1 k=1

then
lim  fm (·) − f (·) 2 = 0.
m→∞
Thus, a function f with a finite L2 -norm is equivalent to its Fourier series

 ∞

√ √
f (x) = a0 + ak 2 cos(2πkx) + bk 2 sin(2πkx),
k=1 k=1

though they may differ at the points of discontinuity.


The next lemma links the decrease rate of the Fourier coefficients with
β, β ≥ 1, the smoothness parameter of f .
∞ 2 2 2β ≤ L for some constant L, then
Lemma 10.13. If k = 1 (ak + bk ) k
 f (β) 22 ≤ (2π)2β L.

Proof. We restrict the calculations to the case β = 1. See Exercise 10.73


for the proof of the general case. For β = 1, we have

  √ √
f  (x) = ak (−2πk) 2 sin(2πkx) + bk (2πk) 2 cos(2πkx) .
k=1

Thus,


 f  22 = (2π)2 k 2 (a2k + b2k ) ≤ (2π)2 L. 
k=1
10.5. Orthogonal Series Regression Estimator 143

10.5.2. Discrete Fourier Series and Regression. Consider the obser-


vations
 
(10.24) yi = f i/n + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n.
To ease presentation, it is convenient to assume that n = 2n0 + 1 is an odd
number, n0 ≥ 1.
Our goal is to estimate the regression function f in the integral sense
in L2 [0, 1] . Since our observations are discrete and available only at the
equidistant design points xi = i/n, we want to restore the regression function
exclusively at these points.
Consider the values of the trigonometric basis functions in (10.23) at the
design points xi = i/n, i = 1, . . . , n,
(10.25)
 √  2πi  √  2πi  √  2πki  √  2πki  
1, 2 sin , 2 cos , . . . , 2 sin , 2 cos ,... .
n n n n
For any functions g, g1 and g2 ∈ L2 [0, 1], define the discrete dot product
and the respective squared L2 -norm by the Riemann sums
  1      1  2 
n n
g1 (·), g2 (·) 2, n = g1 i/n g2 i/n and  g(·) 22, n = g i/n .
n n
i=1 i=1

Clearly, the values at the design points for each function in (10.25) rep-
resent a vector in Rn . Therefore, there cannot be more than n orthonormal
functions with respect to the discrete dot product. As shown in the lemma
below, the functions in (10.25) corresponding to k = 1, . . . , n0 , form an
orthonormal basis with respect to this dot product.
Lemma 10.14. Fix n = 2n0 + 1 for some n0 ≥ 1. For i = 1, . . . , n, the
system of functions
 √  2πi  √  2πi  √  2πn0 i  √  2πn0 i 
1, 2 sin , 2 cos , . . . , 2 sin , 2 cos
n n n n
is orthonormal with respect to the discrete dot product.

Proof. For any k and l, the elementary trigonometric identities hold:


 2πki   2πli  1  2π(k − l)i   2π(k + l)i  
(10.26) sin sin = cos − cos
n n 2 n n
and
 2πki   2πli  1  2π(k − l)i   2π(k + l)i  
(10.27) cos cos = cos + cos .
n n 2 n n
Also, as shown in Exercise 10.74, for any integer m
= 0(mod n),
n
 2πmi   n
 2πmi 
(10.28) cos = sin = 0.
n n
i=1 i=1
144 10. Estimation of Regression in Global Norms

Now fix k
= l such that k, l ≤ n0 . Note that then k ±l
= 0(mod n). Letting
m = k ± l in (10.28), and applying (10.26)-(10.28), we obtain that

n
 2πki   2πli  1 
n
 2π(k − l)i 
sin sin = cos
n n 2 n
i=1 i=1

1 
n
 2π(k + l)i 
− cos = 0
2 n
i=1

and

n
 2πki   2πli  1 
n
 2π(k − l)i 
cos cos = cos
n n 2 n
i=1 i=1

1 
n
 2π(k + l)i 
+ cos = 0,
2 n
i=1

which yields that the respective dot products are zeros.

If k = l ≤ n0 , then from (10.26)-(10.28), we have



n
 2πki  1 
n
2
sin = cos(0) = n/2
n 2
i=1 i=1

and

n
 2πki  1 
n
cos2 = cos(0) = n/2.
n 2
i=1 i=1

These imply the normalization condition,


√   √  
 2 sin 2πki 2 =  2 cos 2πki 2 = 1.
n 2, n n 2, n

Finally, for any k, l ≤ n0 , from the identity


 2πki   2πli  1  2π(k + l)i   2π(k − l)i  
sin cos = sin + sin
n n 2 n n
and (10.28), we have

n
 2πki   2πli  1 
n
 2π(k + l)i 
sin cos = sin
n n 2 n
i=1 i=1

1 
n
 2π(k − l)i 
+ sin = 0. 
2 n
i=1
10.5. Orthogonal Series Regression Estimator 145

For a given integer β and a positive constant L, introduce a class of


functions Θ2, n = Θ2,n (β, L) defined at the design points xi = i/n, i =
1, . . . , n. We say that f ∈ Θ2,n (β, L) if
n0 
   √  2πki  √  2πki  
f i/n = a0 + ak 2 cos + bk 2 sin , n = 2n0 + 1,
n n
k=1
where the Fourier coefficients ak and bk satisfy the condition

n0

a2k + b2k k 2β ≤ L.
k=1

Note that there are a total of n Fourier coefficients, a0 , a1 , . . . , an0 ,


b1 , . . . , bn0 , that define any function f in the class Θ2,n (β, L). It should be
expected, because any such function is equivalent to a vector in Rn .
The class Θ2,n (β, L) replaces the Hölder class Θ(β, L, L1 ) in our earlier
studies. However, in view of Lemma 10.13, the parameter β still represents
the smoothness of regression functions.
We want to estimate the regression function f in the discrete MISE. The
quadratic risk, for which we preserve the notation Rn (fˆn , f ), has the form
 2  1 
n 
    2 
Rn (fˆn , f ) = Ef  fˆn (·) − f (·) 2, n = Ef fˆn i/n − f i/n .
n
i=1

Thus far, we have worked with the functions of sine and cosineseparately.

It is convenient to combine them in a single notation. Put ϕ0 i/n = 1,
and c0 = a0 . For m = 1, . . . , n0 , take
  √  2πmi 
ϕ2m i/n = 2 cos , c2m = am ,
n
and
  √  2πmi 
ϕ2m−1 i/n = 2 sin , c2m−1 = bm .
n
Note that altogether we have n basis functions ϕk , k = 0, . . . , n − 1. They
satisfy the orthonormality conditions
  1     
n
ϕk (·), ϕl (·) 2, n = ϕk i/n ϕl i/n = 0, for k
= l,
n
i=1

 2 1  2 
n
(10.29)  
and ϕk (·) 2, n = ϕk i/n = 1.
n
i=1
The regression function f at the design points xi = i/n can be written as
  
n−1
 
f i/n = ck ϕk i/n .
k=0
146 10. Estimation of Regression in Global Norms

It is easier to study the estimation problem with respect to the discrete


L2 -norm in the space of the Fourier coefficients ck , called a sequence space.
Assume that these Fourier coefficients are estimated by ĉk , k = 0, . . . , n − 1.
Then the estimator of the regression function in the original space can be
expressed by the sum
  
n−1
 
(10.30) fˆn i/n = ĉk ϕk i/n .
k=0

Lemma 10.15. The discrete MISE of the estimator fˆn in (10.30) can be
presented as
 2   n−1
  2 
ˆ  ˆ 
Rn (fn , f ) = Ef fn (·) − f (·) 2, n = Ef ĉk − ck .
k=0

Proof. By the definition of the risk function Rn (fˆn , f ), we have


  n−1 2 
   
Rn (fˆn , f ) = Ef  ĉk − ck ϕk (·) 
2, n
k=0

 n−1
       n−1
 2 
= Ef ĉk − ck ĉl − cl ϕk (·), ϕl (·) 2,n
= Ef ĉk − ck
k, l = 0 k=0
where we used the fact that the basis functions are orthonormal. 

To switch from the original observation yi to the corresponding obser-


vation in the sequence space, consider the following transformation:
  1 
n
 
(10.31) zk = yi , ϕk (·) 2, n = yi ϕk i/n , k = 0, . . . , n − 1.
n
i=1

Lemma 10.16. The random variables zk , k = 0, . . . , n−1 defined by (10.31)


satisfy the equations

zk = ck + σ ξk / n, k = 0, . . . , n − 1,
for some independent standard normal random variables ξk .

Proof. First, observe that


   
yi = c0 ϕ0 i/n + . . . + cn−1 ϕn−1 i/n + εi
where the error terms εi ’s are independent N (0, σ 2 )-random variables, i =
1, . . . , n. Thus, for any k = 0, . . . , n − 1, we can write
1      1 
n
  n
 
zk = c0 ϕ0 i/n + · · · + cn−1 ϕn−1 i/n ϕk i/n + εi ϕk i/n .
n n
i=1 i=1
10.5. Orthogonal Series Regression Estimator 147

By the orthonormality conditions (10.29), the first sum is equal to ck , and



the second one can be written as σξk / n where
√  
n 1
n
 
ξk = εi ϕk i/n
σ n
i=1

 σ2 
n
  −1/2 1 
n
 
= ϕ2k i/n εi ϕk i/n ∼ N (0, 1).
n n
i=1 i=1
As a result,

zk = ck + σξk / n.
It remains to show that the ξ’s are independent. Since they are normally
distributed, it suffices to show that they are uncorrelated. This in turn
follows from independence of the ε’s and orthogonality of the ϕ’s. Indeed,
we have that for any k
= l such that k, l = 0, . . . , n − 1,
1   n
    n
 
Cov(ξk , ξl ) = 2 E εi ϕk i/n εi ϕl i/n
σ n
i=1 i=1

1 
n
    1     
n
= E ε 2
ϕ
i k i/n ϕ l i/n = ϕk i/n ϕl i/n = 0. 
σ2 n n
i=1 i=1

The orthogonal series (or projection) estimator of regression function f is


defined by
  
M
 
(10.32) ˆ
fn i/n = zk ϕk i/n , i = 1, . . . , n,
k=0

where M =  M (n) is an integer parameter of the estimation procedure.


Note that fˆn i/n is indeed an estimator, because it is computable from the
original observed responses y1 , . . . , yn . The parameter M serves to balance
the bias and variance errors of the estimation. The choice
M = M (n) = (h∗n )−1  = n1/(2β+1) ,
where · denotes an integer part of a number, turns out to be optimal in
the minimax sense. We will prove only the upper bound.
Theorem 10.17. Assume that the regression function f belongs to the class
Θ2,n (β, L). Then, uniformly over this class, the quadratic risk in the discrete
MISE of the orthogonal series estimator fˆn given by (10.32) with M =
n1/(2β+1)  is bounded from above,
 2  
Rn (fˆn , f ) = Ef  fˆn − f 2, n ≤ σ 2 + 4β L n−2β/(2β+1) .
148 10. Estimation of Regression in Global Norms

Proof. Consider the orthogonal series estimator fˆn (i/n) specified by (10.32)
with M = n1/(2β+1) . Comparing this definition to a general form (10.30)
of an estimator given by a Fourier series, we see that in this instance, the
estimators of the Fourier coefficients ck , k = 0, . . . , n − 1, have the form

zk , if k = 0, . . . , M,
ĉk =
0, if k = M + 1, . . . , n − 1.
Now applying Lemmas 10.15 and 10.16, we get
 2   
M  
n−1
 ˆ 
Ef fn − f 2, n = Ef (zk − ck ) +
2
c2k
k=0 k = M +1

σ2   2   
M n−1 n−1
σ2M
(10.33) = Ef ξk + c2k = + c2k .
n n
k=0 k = M +1 k = M +1

Next, let M0 = (M + 1)/2. By the definitions of the functional space


Θ2,n and the basis function ϕk , the following inequalities hold:

n−1 
n0
 
n0

c2k ≤ a2k + b2k ≤ M0−2β a2k + b2k k 2β ≤ LM0−2β .
k = M +1 k = M0 k = M0

Substituting this estimate into (10.33), and noticing that M0 ≤ M/2, we


obtain that
σ2M
Rn (fˆn , f ) ≤ + LM0−2β
n
σ2M 
≤ + 22β LM −2β ≤ σ 2 + 4β L n−2β/(2 β+1) . 
n

Exercises

Exercise 10.69. Verify the rate of convergence in (10.4).

Exercise 10.70. Prove the inequality (10.11).

Exercise 10.71. Prove a more accurate bound in (10.11),


  
E Z ∗  X ≤ Cz ln Q where Q = 1/(2 hn ).
Also show that
" the respective balance equation for the sup-norm in Remark
10.7 is hβn = (n hn )−1 ln h−1
n . Solve this equation assuming that n is large.
Exercises 149

Exercise 10.72. Prove that for the regressogram under Assumption 10.1,
dim(S) = K = β Q.

Exercise 10.73. Prove Lemma 10.13 for any integer β > 1.

Exercise 10.74. Prove formula (10.28).


Chapter 11

Estimation by Splines

11.1. In Search of Smooth Approximation


The regressogram approach to estimation of a regression function f has
several advantages. First of all, it is a piecewise polynomial approximation
that requires the computation and storage of only β Q coefficients. Second,
the computational process for these coefficients is divided into the Q sub-
problems, each of dimension β which does not increase with n. Third, along
with the regression function f , its derivatives f (m) up to the order β −
1 can be estimated (see Remark 10.5). The forth advantage is that the
regressogram works in the whole interval [0, 1], and the endpoints do not
need special treatment.
A big disadvantage of the regressogram is that it suggests a discontinu-
ous function as an approximation of a smooth regression function f (x). An
immediate idea is to smooth the regressogram, substituting it by a convolu-
tion with some kernel K,

x+hn
1  t − xˆ
smoother of fˆn (x) = K fn (t) dt.
x−hn hn hn

Unfortunately, the convolution smoother has shortcomings as well. The


endpoints effect exists, hence at these points the estimator should be de-
fined separately. Besides, unless the kernel itself is a piecewise polynomial
function, the smoother is no longer piecewise polynomial with ensuing com-
putational difficulties.
A natural question arises: Is it possible to find a piecewise polynomial
estimator fˆn of the regression function f in the interval [0, 1] that would be

151
152 11. Estimation by Splines

a smooth function up to a certain order? It turns out that the answer to


this question is positive.
Suppose that we still have Q bins, and the regression function f is ap-
proximated in each bin by a polynomial. If the regression function belongs
to the Hölder class Θ(β, L, L1 ), then the justifiable order of each polynomial
is β − 1. Indeed, beyond this order, we do not have any control over the de-
terministic remainder term in Proposition 10.2. However, it is impossible to
have a smooth polynomial estimator fˆn with the continuous derivatives up
to the order β − 1. It would impose β constraints at each knot (breakpoint)
between the bins 2hn q, q = 1, . . . , Q − 1. Thus, the polynomial coefficients
in each next bin would be identical to those in the previous one. It makes
fˆn (x) a single polynomial of order β − 1. Clearly, we cannot approximate a
Hölder regression function by a single polynomial in the whole interval [0, 1].
Is it possible to define a piecewise polynomial fˆn that has β−2 continuous
derivatives in [0, 1]? This question makes sense if β ≥ 2 (for β = 2 it
means that the function itself is continuous). The answer to this question
is affirmative. In Q bins we have βQ polynomial coefficients. At the Q − 1
inner knots between the bins we impose (β − 1)(Q − 1) = βQ − (Q + β − 1)
gluing conditions to guarantee the continuity of fˆn with the derivatives up
to the order β − 2. Still Q + β − 1 degrees of freedom are left, at least one per
bin, that can be used to ensure some approximation quality of the estimator.
In the spirit of formula (10.14), we can try to define a smooth piecewise
polynomial approximation by

fˆn (x) = θ̂1 γ1 (x) + · · · + θ̂K γK (x), x ∈ [0, 1],

where γ1 (x), . . . , γK (x) are piecewise polynomials. We require these func-


tions to be linearly independent in order to form a basis in [0, 1]. We will
show that there exists theoretically and computationally convenient basis
of the piecewise polynomial functions called B-splines. To introduce the
B-splines, we need some auxiliary results presented in the next section.

11.2. Standard B-splines


Consider the linearly independent piecewise polynomial functions in Q bins,
each of order β − 1 with β − 2 continuous derivatives. Let us address the
question: What is the maximum number of such functions? As we argued in
the previous section, the answer is Q + β − 1. We can rephrase our argument
in the following way. In the first bin we can have β linearly independent
polynomials (for example, all the monomials of order m = 0, . . . , β − 1).
At each of the Q − 1 inner knots, the β − 1 constraints are imposed on the
continuity of derivatives. This leaves just one degree of freedom for all the
11.2. Standard B-splines 153

other Q − 1 bins. Thus, the number of piecewise polynomials in the basis


equals β + (Q − 1).
First, we give the definition of a standard B-spline. Here “B” is short
for “basis” spline. It is defined for infinitely many bins with unit length and
integer knots. A standard B-spline serves as a building block for a basis of
B-splines in the interval [0, 1].
A standard B-spline of order m, denoted by Sm (u), u ∈ R, is a function
satisfying the recurrent convolution

(11.1) Sm (u) = Sm−1 (z) I[0, 1) (u − z) dz, m = 2, 3, . . . ,
−∞
with the initial standard spline S1 (u) = I[0,1) (u). Note that, by the convo-
lution formula, the function Sm (u) is the probability density function of a
sum of m independent random variables uniformly distributed on [0, 1].
Since the splines are piecewise continuous functions, their higher deriva-
tives can have discontinuities at the knots. We make an agreement to define
the derivatives as right-continuous functions. This is the reason to use the
semi-open interval in (11.1).
It is far from being obvious that a standard B-spline meets all the
requirements of a piecewise polynomial function of the certain degree of
smoothness. Nevertheless, it turns out to be true. The lemmas below de-
scribe some analytical properties of standard B-splines.
Lemma 11.1. (i) For any m ≥ 2 ,

(11.2) Sm (u) = Sm−1 (u) − Sm−1 (u − 1) , u ∈ R.

(ii) For any m ≥ 2, Sm (u) is strictly positive in (0, m) and is equal to


zero outside of this interval.

(iii) Sm (u) is symmetric with respect to the endpoints of the interval


(0, m), that is,
(11.3) Sm (u) = Sm (m − u), u ∈ R.

(iv) For any m ≥ 1 and for any u ∈ R, the equation (called the partition
of unity) holds:


(11.4) Sm (u − j) = 1.
j = −∞

Proof. (i) Differentiating formally (11.1) with respect to u, we obtain





Sm (u) = Sm−1 (z) δ{0} (u − z) − δ{1} (u − z) dz
−∞
154 11. Estimation by Splines

= Sm−1 (u) − Sm−1 (u − 1)


where δ{a} is the Dirac delta-function concentrated at a .

(ii) This part follows immediately from the definition of the standard
B-spline as a probability density.

(iii) If Uj is a uniformly distributed in [0, 1] random variable, then 1 − Uj


has the same distribution. Hence the probability density of U1 + · · · + Um
is the same as that of m − (U1 + · · · + Um ).

(iv) In view of part (ii), for a fixed u ∈ R, the sum ∞ j = −∞ Sm (u − j)
has only a finite number of non-zero terms. Using this fact and (11.2), we
have
 ∞  ∞ ∞


Sm (u−j) = Sm (u−j) = Sm−1 (u−j)−Sm−1 (u−j −1)
j=−∞ j=−∞ j=−∞

 ∞

= Sm−1 (u − j) − Sm−1 (u − j − 1) = 0.
j=−∞ j=−∞
Indeed, the last two sums are 
both finite, hence they have the identical
values. Consequently, the sum ∞ j = −∞ Sm (u − j) is a constant c, say. We
write
1 1  ∞   1
m−1
c = c du = Sm (u − j) du = Sm (u − j) du.
0 0 j = −∞ j =0 0

Here we used part (ii) once again, and the fact that the variable of integration
u belongs to the unit interval. Continuing, we obtain
 j+1
m−1 m
c = Sm (u) du = Sm (u) du = 1,
j =0 j 0

for Sm (u) is the probability density for u in the interval [0, m]. 

Now we try to answer the question: How smooth is the standard B-spline
Sm (u)? The answer can be found in the following lemma.
Lemma 11.2. For any m ≥ 2, the standard B-spline Sm (u), u ∈ R, is a
piecewise polynomial of order m − 1. It has continuous derivatives up to the
order m−2, and its derivative of order m−1 is a piecewise constant function
given by the sum

m−1
j m−1
(11.5) (m−1)
Sm (u) = (−1) I[j, j+1) (u).
j
j =0
11.3. Shifted B-splines and Power Splines 155

Proof. We start by stating the following result. For any m ≥ 2, the k-th
derivative of Sm (u) can be written in the form


k
k
(11.6) (k)
Sm (u) = (−1)j
Sm−k (u − j), k ≤ m − 1.
j
j=0
The shortest way to verify this identity is to use induction on k starting
with (11.2). We leave it as an exercise (see Exercise 11.76).
If k ≤ m − 2, then the function Sm−k (u − j) is continuous for any j.
Indeed, all the functions Sm (u), m ≥ 2, are continuous as the convolutions
in (11.1). Thus, by (11.6), as a linear combination of continuous functions,
(k)
Sm (u), k ≤ m − 2, is continuous in u ∈ R. Also, for k = m − 1, the formula
(11.6) yields (11.5).
It remains to show that Sm (u), u ∈ R, is a piecewise polynomial of order
m − 1. From (11.2), we obtain
u

(11.7) Sm (u) = Sm−1 (z) − Sm−1 (z − 1) dz.
0
Note that by definition, S1 (u) = I[0, 1) (u) is a piecewise polynomial of order
zero. By induction, if Sm−1 (u) is a piecewise polynomial of order at most
m − 2, then so is the integrand in the above formula. Therefore, Sm (u) is
a piecewise polynomial of order not exceeding m − 1. However, from (11.5),
the (m − 1)-st derivative of Sm (u) is non-zero, which proves that Sm (u) has
order m − 1. 
Remark 11.3. To restore a standard B-spline Sm (u), it suffices to look at
(11.5) as a differential equation
d Sm (u) 
m−1
(11.8) = λj I[j,j+1) (u)
d um
j =0

with the constants λj defined by the right-hand side of (11.5), and to solve
it with the zero initial conditions,

Sm (0) = Sm (0) = · · · = Sm
(m−2)
(0) = 0. 

11.3. Shifted B-splines and Power Splines


The results of this section play a central role in our approach to spline
approximation. They may look somewhat technical. From now on, we
assume that the order m of all the splines under consideration is given and
fixed, m ≥ 2. We start with the definition of the shifted B-splines. Consider
the shifted B-splines in the interval [0, m − 1),
Sm (u), Sm (u − 1), . . . , Sm (u − (m − 2)), 0 ≤ u < m − 1.
156 11. Estimation by Splines

Let Ls be the linear space generated by the shifted B-splines,


 
Ls = LS(u) : LS(u) = a0 Sm (u)+a1 Sm (u−1)+· · ·+am−2 Sm (u−(m−2))
where a0 , . . . , am−2 are real coefficients. Put a = (a0 , . . . , am−2 ) for the
vector of these coefficients.
We need more definitions. Consider the piecewise polynomial functions,
called the power splines,
1
(11.9) Pk (u) = (u − k)m−1 I(u ≥ k), k = 0, . . . , m − 2.
(m − 1)!
Note that we define the power splines on the whole real axis. In what fol-
lows, however, we restrict our attention to the interval [0, m − 1).

Similar to Ls , introduce the linear space Lp of functions generated by


the power splines,
 
Lp = LP (u) : LP (u) = b0 P0 (u) + b1 P1 (u) + · · · + bm−2 Pm−2 (u)
with the vector of coefficients b = (b0 , . . . , bm−2 ) .
Lemma 11.4. In the interval [0, m − 1), the linear spaces Ls and Lp are
identical. Moreover, there exists a linear one-to-one correspondence between
the coefficients a and b in the linear combinations of shifted B-splines and
power splines.

Proof. The proof is postponed until Section 11.5. 

For any particular linear combination LS ∈ Ls , consider its derivatives


at the right-most point u = m − 1,
ν0 = LS (0) (m − 1) , ν1 = LS (1) (m − 1), . . . , νm−2 = LS (m−2) (m − 1),
and put ν = ( ν0 , . . . , νm−2 ) . Is it possible to restore the function LS(u) =
a0 Sm (u) + a1 Sm (u − 1) + · · · + am−2 Sm (u − (m − 2)) from these deriva-
tives? In other words, is it possible to restore in a unique way the vector of
coefficients a from ν? As the following lemma shows, the answer is assertive.
Lemma 11.5. There exists a linear one-to-one correspondence between a
and ν.

Proof. Can be found in the last section of the present chapter. 


Remark 11.6. Though our principal interest lies in the shifted B-splines,
we had to involve the power splines for the following reason. The derivatives
of the power splines at the right-most point provide the explicit formula
(11.19), while for the shifted B-splines this relation is more complex. So,
the power splines are just a technical tool for our consideration. 
11.3. Shifted B-splines and Power Splines 157

Next, let us discuss the following problem. Consider the shifted B-


spline from Ls added by another one, Sm (u − (m − 1)). All these shifted
B-splines have a non-trivial effect in the interval [m − 1, m) . Assume that
a polynomial g(u) of degree m − 1 or less is given in the interval [m − 1, m).
Can we guarantee a representation of this polynomial in [m−1, m) as a linear
combination of the shifted standard B-splines? The answer to this question
is revealed in the next two lemmas. The first one explains that all these
splines, except the last one, can ensure the approximation of the derivatives
of g(u) at u = m − 1. At this important step, we rely on Lemma 11.5. The
last spline Sm (u − (m − 1)) is used to fit the leading coefficient of g(u). This
is done in the second lemma. It is remarkable that the polynomial g(u) not
only coincides with the linear combination of B-splines in [m − 1, m), but it
also controls the maximum of this linear combination for u ∈ [0, m − 1).
Lemma 11.7. Denote by ν0 = g (0) (m −1) , . . . , νm−2 = g (m−2) (m −1) the
derivatives of the polynomial g(u) at u = m−1. There exists a unique linear
combination LS(u) = a0 Sm (u) + a1 Sm (u−1) + · · · + am−2 Sm (u−(m −2))
that solves the boundary value problem
LS (0) (m − 1) = ν0 , . . . , LS (m−2) (m − 1) = νm−2 .
Moreover, there exists a constant C(m) such that

max | LS(u) | ≤ C(m) max | ν0 | , . . . , | νm−2 | .
0 ≤ u ≤ m−1

Proof. In accordance with Lemma 11.5, there exists a one-to-one linear


correspondence between a and ν, which implies the inequality
 
max | a0 | , . . . , | am−2 | ≤ C(m) max | ν0 | , . . . , | νm−2 |
with a positive constant C(m). Also, we can write
   
 LS(u)  =  a0 Sm (u) + · · · + am−2 Sm (u − (m − 2)) 

 
m−2
≤ max | a0 | , . . . , | am−2 | Sm (u − j)
j =0
 ∞

≤ max | a0 | , . . . , | am−2 | Sm (u − j)
j = −∞
 
≤ max | a0 | , . . . , | am−2 | ≤ C(m) max | ν0 | , . . . , | νm−2 |
where we applied the partition of unity (11.4). 
Lemma 11.8. For any polynomial g(u) of order m − 1, m − 1 ≤ u < m,
there exists a unique linear combination of the shifted standard B-splines
LS ∗ (u) = a0 Sm (u) + · · · + am−2 Sm (u − (m − 2)) + am−1 Sm (u − (m − 1))
such that LS ∗ (u) = g(u), if m − 1 ≤ u < m.
158 11. Estimation by Splines

Proof. Find LS(u) = a0 Sm (u) + · · · + am−2 Sm (u − (m − 2)) such that all


the derivatives up to the order m − 2 of g(u) and LS(u) are identical at
u = m − 1. By Lemma 11.7 such a linear combination exists and is unique.
Note that (m−1)-st derivatives of LS(u) and g(u) are constants in [m−1, m).
If these constants are different, we can add another B-spline, LS ∗ (u) =
LS(u) + am−1 Sm (u − (m − 1)). The newly added spline Sm (u − (m − 1)) does
not change LS(u) in [0, m − 1). By choosing the coefficient am−1 properly,
we can make the (m−1)-st derivatives of LS ∗ (u) and g(u) identical while all
the derivatives of LS(u) of the smaller orders stay unchanged at u = m − 1,
because Sm (u − (m − 1)) has all derivatives up to the order m − 2 equal to
zero at u = m − 1. Figure 8 illustrates the statement of this lemma. 
g(u)

LS ∗ (u)

-
0 m−1 m u

Figure 8. The linear combination LS ∗ (u) coincides with the polynomial


g(u) in [m − 1, m).

11.4. Estimation of Regression by Splines


For a chosen bandwidth hn , consider an integer number Q = 1/(2hn ) of the
bins
 
Bq = 2(q − 1)hn , 2qhn , q = 1, . . . , Q.
We are supposed to work with a regression function f (x) that belongs to
a fixed Hölder class of functions Θ(β, L, L1 ). Let Sβ (u) be the standard B-
spline of order m = β. This parameter β will determine the order of all the
splines that follow. Sometimes, it will be suppressed in the notation.
The number of bins Q increases as the bandwidth hn → 0, while the
order β stays a constant as n → ∞. That is why it is not restrictive to
assume that Q exceeds β. We make this assumption in this section. In the
interval [0, 1], we define a set of functions
 x − 2h k 
n
(11.10) γk (x) = hβn Sβ I[0, 1] (x) , k = −β + 1, . . . , Q − 1.
2hn
We call these functions scaled splines or, simply, splines of order β . To
visualize the behavior of the splines, notice that, if we disregard the indi-
cator function I[0,1] (x), then the identity γk (x) = γk+1 (x + 2hn ) holds. It
11.4. Estimation of Regression by Splines 159

means that as k ranges from −β + 1 to Q − 1, the functions γk (x) move


from left to right, every time shifting by 2hn , the size of one bin. Now we
have to restrict the picture to the unit interval, truncating the functions
γ−β+1 , . . . , γ−1 below zero and γQ−β+1 , . . . , γQ−1 above one. An analogy
can be drawn between the performance of γk (x) as k increases and taking
periodic snapshots of a hunchback beast who gradually crawls into the pic-
ture from the left, crosses the space, and slowly disappears on the right still
dragging its tail away. Figure 9 contains an illustration of the case β = 3.

γk (x)
6

γ−1 γ0 γ1 γQ− 3 γQ− 2


3
4
h3n
1
2
h3n
γ−2 γQ− 1
-
0 2hn 4hn 2hn (Q−1) 1 x
B1 B2 B3 BQ− 2 BQ− 1 BQ

Figure 9. Graphs of functions γk (x), k = −2, . . . , Q − 1, when β = 3.

From properties of the standard B-splines, it immediately follows that


within each bin Bq , the function γk (x) is a polynomial of order β − 1 with
continuous derivatives up to order β − 2. The knots are the endpoints of the
bins. Under the assumption that Q is greater than or equal to β, there is at
least one full-sized spline γk the support of which contains all β bins. For
instance, γ0 is such a spline.

The proof of the next lemma is postponed to the end of this chapter.
Lemma 11.9. The set of functions { γk (x), k = −β + 1, . . . , Q − 1 } forms
a basis in the linear sub-space of the smooth piecewise polynomials of order
β − 1 that are defined in bins Bq , and have continuous derivatives up to
order β − 2. That is, any γ(x) in this space admits a unique representation

Q−1
(11.11) γ(x) = θ̂k γk (x) , x ∈ [0, 1],
k=−β+1

with some real coefficients θ̂−β+1 , . . . , θ̂Q−1 .

Now we return to the regression observations yi = f (xi ) + εi , i =


1, . . . , n, where f (x) is a Hölder function in Θ(β, L, L1 ). In this section, we
pursue the modest goal of the asymptotic analysis of the discrete MISE
160 11. Estimation by Splines

for the spline approximation of the regression function. We want to prove


the result similar to Theorem 10.11, and in particular, the analogue of in-
equality (10.22). Note that the proof of Theorem 10.11 is heavily based on
the relation between the approximation error δn (bias of the regressogram)
in Proposition 10.10 and the bandwidth hn . We need a similar result for
approximation by splines.
In the spirit of the general approximation by functions γk (x) in the space
of observations Rn , we introduce the span-space S as a linear sub-space
generated by the vectors
 
γ k = γk (x1 ), . . . , γk (xn ) , k = −β + 1 , . . . , Q − 1.
Following the agreement of Section 10.4, by S we also denote the operator
of the orthogonal projection on this span-space.
Remark 11.10. Note that S is a linear sub-space in Rn of the dimension
which does not exceed K = Q+β−1. For the regular designs and sufficiently
large n, this dimension should be K. But for a particular design, generally
speaking, this dimension can be strictly less than K. 

The following lemma is a version of Proposition 10.10 with δn = O(hβn ).


Its proof can be found at the end of the present chapter.
Lemma 11.11. There exists a constant C0 independent of n such that for
any regression function
 f ∈ Θ(β, L, L1 ) and for any design X = { x1 , . . . , xn },

we can find fn∗ = fn∗ (x1 ), . . . , fn∗ (xn ) that belongs to S, for which at any
design point xi the following inequality holds:
(11.12) | fn∗ (xi ) − f (xi ) | ≤ C0 hβn , i = 1, . . . , n.
Remark 11.12. The vector fn∗ , as any vector in S, admits the representation

Q−1
fn∗ = θ̂k γ k
k = −β+1

with some real coefficients θ̂k . Hence this vector can also be associated with
the function defined by (11.11),

Q−1
fn∗ (x) = θ̂k γk (x), x ∈ [0, 1].
k = −β+1

This representation, which is not necessarily unique, defines a spline approx-


imation of the function f (x). 

We are ready to extend the result stated in Theorem 10.11 for the re-
gressogram to the approximation by splines.
11.5. Proofs of Technical Lemmas 161

Theorem 11.13. For any design X , the projection


 
f̂n = S y = fˆn (x1 ), . . . , fˆn (xn )
of the regression observations y = ( y1 , . . . , yn ) on span-space S generated
by the splines of order β, admits the upper bound of the discrete L2 -norm
risk
1  n
 2  σ 2 (Q + β − 1)
(11.13) Ef fˆn (xi ) − f (xi ) | X ≤ C1 h2β n + .
n n
i=1

Moreover, under the optimal choice of the bandwidth hn = h∗n = n−1/(2β+1) ,


the following upper bound holds:
1  n
 2 
(11.14) Ef fˆn (xi ) − f (xi ) | X ≤ r∗ n−β/(2β+2) .
n
i=1

In the above, the constants C1 and r∗ are positive and independent of n and
f ∈ Θ(β, L, L1 ).

Proof. The result follows immediately from the bound (11.12) on the ap-
proximation error by splines (cf. the proof of Theorem 10.11.) 
Remark 11.14. With the splines γk (x) of this section, we could introduce
the design matrix with column vectors (10.16) as well as the system of
normal equations (10.17). In the case of B-splines, however, the system
of normal equations does not partition into sub-systems as was the case of
the regressogram. It makes the asymptotic analysis of spline approximation
technically more challenging as compared to the one of the regressogram.
In particular, an analogue of Proposition 10.2 with explicit control over the
bias and the stochastic terms goes beyond the scope of this book. 

11.5. Proofs of Technical Lemmas


Proof of Lemma 11.4. It is easy to show (see Exercise 11.77) that, accord-
ing to (11.5), the (m − 1)-st derivative of LS ∈ Ls is a piecewise constant
function
(11.15) LS (m−1) (u) = λj , if j ≤ u < j + 1,
where for j = 0, . . . , m − 2,
 j
j−i m − 1 j m−1 j−1 m − 1
λj = ai (−1) = a0 (−1) + a1 (−1)
j−i j j−1
i=0

m−1 m−1
(11.16) + · · · + aj−1 (−1) + aj .
1 0
162 11. Estimation by Splines

On the other hand, any power spline LP ∈ Lp also has the piecewise constant
(m − 1)-st derivative
(11.17) LP (m−1) (u) = λj , if j ≤ u < j + 1,
with
(11.18) λj = b0 + · · · + bj , j = 0, . . . , m − 2.
In (11.15) and (11.17), we have deliberately denoted the (m−1)-st derivative
by the same λj ’s because we mean them to be identical. Introduce a vector
 
λ = λ0 , . . . , λm−2 . If we look at (11.16) and (11.18) as the systems of
linear equations for a and b , respectively, we find that the matrices of these
systems are lower triangular with non-zero diagonal elements. Hence, these
systems establish the linear one-to-one correspondence between a and λ, on
the one hand, and between λ and b, on the other hand. Thus, there exists
a linear one-to-one correspondence between a and b. 
Proof of Lemma 11.5. Applying Lemma 11.4, we can find a linear com-
bination of the power splines such that
LS(u) = LP (u) = b0 P0 (u) + b1 P1 (u) + · · · + bm−2 Pm−2 (u)
1 1
= b0 um−1 I[0, m−1) (u) + b1 (u − 1)m−1 I[1, m−1) (u)
(m − 1)! (m − 1)!
1
+ · · · + bm−2 (u − (m − 2))m−1 I[m−2, m−1) (u).
(m − 1)!
The derivatives of the latter combination, νj = LP (j) (u), at the right-most
point u = m − 1 are computable explicitly (see Exercise 11.78),
(11.19)
(m − 1)m−j−1 (m − 2)m−j−1 (1)m−j−1
ν j = b0 + b1 + · · · + bm−2 .
(m − j − 1)! (m − j − 1)! (m − j − 1)!
If we manage to restore the coefficients b from the derivatives ν , then by
Lemma 11.4, we would prove the claim. Consider (11.19) as the system of
linear equations. Then the matrix M of this system is an (m − 1) × (m − 1)
matrix with the elements
(m − k − 1)m−j−1
(11.20) Mj, k = , j, k = 0, . . . , m − 2.
(m − j − 1)!
The matrix M is invertible because its determinant is non-zero (see Exercise
11.79). Thus, the lemma follows. 

Proof of Lemma 11.9. As shown above, the dimension of the space of


smooth piecewise polynomials of order no greater than β − 1 equals Q +
β − 1 which matches the number of functions γk (x). Thus, the only question
is about linear independence of functions γk (x), x ∈ [0, 1]. Consider the
11.5. Proofs of Technical Lemmas 163

functions γk (x) for k = −β + 1, . . . , Q − β. In this set, each consecutive


function has a support that contains a new bin not included in the union of
all the previous supports. That is why the γk (x)’s are linearly independent
for k = −β + 1, . . . , Q − β. Hence, the linear combinations
  
L1 = a−β+1 γ−β+1 (x) + · · · + aQ−β γQ−β (x)  a−β+1 , . . . , aQ−β ∈ R

form a linear space of functions of dimension Q. A similar argument shows


that the linear combinations of the remaining splines
  
L2 = aQ−β+1 γQ−β+1 (x) + · · · + aQ−1 γQ−1 (x)  aQ−β+1 , . . . , aQ−1 ∈ R

form a linear space of dimension β − 1.


Since the supports of the functions from L1 cover the whole semi-open
interval [0, 1), the “support” argument does not prove independence of L1
and L2 . We have to show that these spaces intersect only at the origin. In-
deed, by the definition of the standard B-splines, the first (β − 2) derivatives
of any function from L1 at x = 1 are zeros. On the other hand, by Lemma
11.7, a function from L2 has all its first β − 2 derivatives equal to zero, if
and only if all its coefficients are zeros, aQ−β+1 = · · · = aQ−1 = 0. Thus,
a zero function is the only one that simultaneously belongs to L1 and L2 . 

Proof of Lemma 11.11. Put q(l) = 1+βl. Consider all the bins Bq(l) with
l = 0, . . . , (Q − 1)/β. Without loss of generality, we assume that (Q − 1)/β
is an integer, so that the last bin BQ belongs to this subsequence. Note that
the indices of the bins in the subsequence Bq(l) are equal to 1 modulo β,
and that any two consecutive bins Bq(l) and Bq(l+1) in this subsequence are
separated by (β − 1) original bins.
Let xl = 2(q(l) − 1)hn denote the left endpoint of the bin Bq(l) , l =
0, 1, . . . . For any regression function f ∈ Θ(β, L, L1 ), introduce the Taylor
expansion of f (x) around x = xl ,
(11.21)
f (1) (xl ) f (β−1) (xl )
πl (x) = f (xl ) + (x − xl ) + · · · + (x − xl )β−1 , x ∈ Bq(l) .
1! (β − 1)!

In accordance with Lemma 11.8, for any l, there exists a linear combi-
nation of the splines that coincides with πl (x) in Bq(l) . It implies that

πl (x) = aq(l)−β γq(l)−β (x) + · · · + aq(l)−1 γq(l)−1 (x) , x ∈ Bq(l) ,

with some uniquely defined real coefficients ak , k = q(l) − β, . . . , q(l) − 1.


Note that as l runs from 1 to (Q − 1)/β, each of the splines γk (x), k =
−β + 1, . . . , Q − 1, participates exactly once in these linear combinations.
164 11. Estimation by Splines

Consider the sum


 
γ(x) = aq(l)−β γq(l)−β (x) + · · · + aq(l)−1 γq(l)−1 (x)
1 ≤ l ≤ (Q−1)/β


Q−1
= ak γk (x), 0 ≤ x ≤ 1.
k = − β+1
This function γ(x) defines a piecewise polynomial of order at most β − 1
that coincides with the Taylor polynomial (11.21) in all the bins Bq(l) (see
Figure 10). Hence, in the union of these bins l Bq(l) , the function γ(x)
does not deviate away from f (x) by more than O(hβn ), this magnitude being
preserved uniformly over f ∈ Θ(β, L, L1 ).
Next, how close is γ(x) to f (x) in the rest of the unit interval? We want
to show that the same magnitude holds for all x ∈ [0, 1], that is,
 
(11.22) max  γ(x) − f (x)  ≤ C1 hβn
0≤x≤1

with a constant C1 independent of f ∈ Θ(β, L, L1 ).

6 π2 (x)
f (x)
γ(x)

π1 (x)

-
0 B1 B1+β BQ 1 x

Δπ2 (x)
Δγ1 (x)

-
0
B1 B1+β x

Figure 10. Schematic graphs of the functions γ(x) and Δγ1 (x) for x lying
in bins B1 through B1+β .

Clearly, it is sufficient to estimate the absolute value | γ(x) − f (x) | in


the gap between two consecutive bins Bq(l) and Bq(l+1) . Consider the in-
terval [xl , xl+1 + 2hn ). It covers all the bins from Bq(l) to Bq(l+1) , inclu-
sively. The length of this interval is 2hn (β + 1). Hence, the regression
function f (x) does not deviate away from its Taylor approximation πl (x) in
this interval by more than O(hβn ) uniformly over the Hölder class. Thus,
11.5. Proofs of Technical Lemmas 165

to verify (11.22), it is enough to check the magnitude of the difference


Δγl (x) = γ(x) − πl (x), x ∈ [xl , xl+1 + 2hn ). Note that this difference is
a piecewise polynomial of order at most β − 1 in the bins. In particular, it
is a zero function for x ∈ Bq(l) , and is equal to Δπl (x) = πl+1 (x) − πl (x) for
x ∈ Bq(l+1) (see Figure 10).

We want to rescale Δπl (x) to bring it to the scale of the integer bins of
unit length. Put

g(u) = h−β
n Δπl (xl + 2hn (u + 1)) with 0 ≤ u ≤ β − 1,

so that u = 0 corresponds to the left endpoint of Bq(l) and u = β − 1


corresponds the left endpoint of Bq(l+1) . Next, we compute the derivatives
of g(u) at u = β − 1,

dj −β j d
j
νi = g(β − 1) = h n (2h n ) Δπl (xl+1 )
duj dxj

  f (j+1) (xl )
= 2j hj−β
n f (j) (xl+1 ) − f (j) (xl ) + (xl+1 − xl ) + · · ·
1!

f (β−1) (xl ) 
··· + (xl+1 − xl )β−1−j .
(β − 1 − j)!

Note that the expression in the brackets on the right-hand side is the
remainder term of the Taylor expansion of the j-th derivative f (j) (xl+1 )
around xl . If f ∈ Θ(β, L, L1 ) , then f (j) belongs to the Hölder class Θ(β −
j, L, L2 ) with some positive constant L2 (see Exercise 11.81). Similar to
Lemma 10.2, this remainder term has the magnitude O( | xl+1 − xl |β−j ) =
O(hβ−j
n ).

Thus, in the notation of Lemma 11.7, max |ν0 |, . . . , |νβ−1 | ≤ C1 where
the constant C1 does not depend on n nor l. From Lemma 11.7, the unique
spline of order β with zero derivatives at u = 0 and the given derivatives νj
at u = β −1 is uniformly bounded for 0 ≤ u ≤ β −1. Since this is true for any
l, we can conclude that |g(u)| ≤ C2 = C(β) C1 at all u where this function
is defined. The constant C(β) is introduced in Lemma 11.7, m = β. So, we
proved that max0≤x≤1 | γ(x) − f (x) | = O(hβn ) , which implies (11.12). 
166 11. Estimation by Splines

Exercises

Exercise 11.75. Find explicitly the standard B-splines S2 and S3 . Graph


these functions.

Exercise 11.76. Prove (11.6).

Exercise 11.77. Prove (11.16).

Exercise 11.78. Prove (11.19).

Exercise 11.79. Show that the determinant det M of the matrix M with
the elements defined by (11.20) is non-zero. Hint: Show that this deter-
minant is proportional to the determinant of the generalized Vandermonde
matrix ⎡ ⎤
x1 x21 . . . xm
1
⎢ x2 x22 . . . xm ⎥
Vm = ⎢ ⎣
2 ⎥

...
2
xm xm . . . xm m

with distinct x1 , . . . , xm . Look at det Vm as a function of xm . If xm equals


either x1 , or x2 , . . . , xm−1 , then the determinant is zero. Consequently,
det Vm = v(x1 , x2 , . . . , xm ) (xm − x1 )(xm − x2 ) . . . (xm − xn−1 )
for some function v. Now expand along the last row to see that the deter-
minant is a polynomial in xm of order m with the highest coefficient equal
to det Vm−1 . Thus, the recursive formula holds:
det Vm = det Vm−1 xm (xm − x1 )(xm − x2 ) . . . (xm − xm−1 ), det V1 = 1.
)
From here deduce that det Vm = x1 x2 . . . xm i<j (xj − xi )
= 0.

Exercise 11.80. Prove the statement similar to Lemma 11.8 for the power
splines. Show that for any polynomial g(u) of order m − 1 in the interval
m − 1 ≤ u < m, there exists a unique linear combination LP ∗ (u) of power
splines
LP ∗ (u) = b0 P0 (u) + b1 P1 (u) + · · · + bm−2 Pm−2 (u) + bm−1 Pm−1 (u)
such that LS ∗ (u) = g(u) if m − 1 ≤ u < m. Apply this result to represent
the function g(u) = 2 − u2 in the interval [2, 3) by the power splines.

Exercise 11.81. Show that if f ∈ Θ(β, L, L1 ), then f (j) belongs to the


Hölder class Θ(β − j, L, L2 ) with some positive constant L2 .
Chapter 12

Asymptotic Optimality
in Global Norms

In Chapter 10, we studied the regression estimation problem for the integral
L2 -norm and the sup-norm risks. The upper bounds in Theorems 10.3 and
10.6 guarantee the rates of convergence in these norms that are n−β/(2β+1)
 −β/(2β+1)
and n/(ln n) , respectively. These rates hold for any function f
in the Hölder class Θ(β, L, L1 ), and they are attained by the regressogram
with the properly chosen bandwidths.
The question that we address in this chapter is whether these rates can
be improved by any other estimators. The answer turns out to be negative.
We will prove the lower bounds that show the minimax optimality of the
regressogram.

12.1. Lower Bound in the Sup-Norm


In this section we prove that the rate
 ln n β/(2β+1)
(12.1) ψn =
n
is the minimax rate of convergence in the Hölder class of regression functions
Θ(β) = Θ(β, L, L1 ) if the losses are measured in the sup-norm. The theorem
below is another example of a “lower bound” (see Theorems 9.16 and 9.17).
As explained in Section 9.4, the lower bound may not hold for any design.
We use the definitions of regular deterministic and random designs as in
Sections 7.4 and 9.3.

167
168 12. Asymptotic Optimality in Global Norms

Theorem 12.1. Let the deterministic design X be defined by (7.17) with a


continuous and strictly positive density p(x) on [0, 1]. Then for all large n,
and for any estimator fˆn of the regression function f , the following inequality
holds:
 
(12.2) sup Ef ψn−1  fˆn − f ∞ ≥ r∗
f ∈ Θ(β)

where ψn is defined in (12.1), and a positive constant r∗ is independent of


n.

Proof. As in the proof of Theorem 9.16, take any “bump” function ϕ(t), t ∈
R, such that ϕ(0) > 0 , ϕ(t) = 0 if |t| > 1, and | ϕ(β) (t) | ≤ L. Clearly, this
function has a finite L2 -norm, ϕ2 < ∞. We want this norm to be small,
therefore, below we make an appropriate choice of ϕ(t). Take a bandwidth
 1/(2β+1)
h∗n = (ln n)/n , and consider the bins
 ∗ 
Bq = 2hn (q − 1), 2h∗n q , q = 1, . . . , Q,
where we assume without loss of generality that Q = 1/(2h∗n ) is an integer.
Introduce the test functions f0 (t) = 0 , and
t−c 
fq (t) = (h∗n )β ϕ
q
(12.3) , t ∈ [0, 1], q = 1, . . . , Q,
h∗n
where cq is the center of the bin Bq . Note that each function fq (t) takes
non-zero values only within the respective bin Bq . For any small enough
h∗n , the function fq belongs to the Hölder class Θ(β, L, L1 ). This fact was
explained in the proof of Theorem 9.16.
Recall that under the hypothesis f = fq , the observations yi in the
nonparametric regression model satisfy the equation
yi = fq (xi ) + εi , i = 1, . . . , n,
where the xi ’s are the design points, the εi ’s are independent N (0, σ 2 )-
random variables. Put
1
(12.4) d0 = (h∗n )β  ϕ ∞ > 0 .
2
Note that by definition,
(12.5)  fl − fq ∞ = 2d0 , 1 ≤ l < q ≤ Q,
and
(12.6)  fq ∞ =  fq − f0 ∞ = 2d0 , q = 1, . . . , Q.

Introduce the random events


 
Dq =  fˆn − fq ∞ ≥ d0 , q = 0, . . . , Q.
12.1. Lower Bound in the Sup-Norm 169

Observe that for any q, 1 ≤ q ≤ Q, the inclusion takes place, D0 ⊆ Dq .


Indeed, by the triangle inequality, if fˆn is closer to f0 = 0 than d0 , that is if
 fˆn ∞ < d0 , then it deviates away from any fq by no less than d0 ,
 fˆn − fq ∞ ≥  fq ∞ −  fˆn ∞ = 2 d0 −  fˆn ∞ ≥ d0 , q = 1, . . . , Q.

Further, we will need the following lemma. We postpone its proof to the
end of the section.

Lemma 12.2. Under the assumptions of Theorem 12.1, for any small δ > 0,
there exists a constant c0 > 0 such that if ϕ22 ≤ c0 , then for all large n,
  1
max Pfq Dq ≥ (1 − δ).
0≤q≤Q 2
Now, we apply Lemma 12.2 to find that for all n large enough, the
following inequalities hold:
 
sup Ef  fˆn − f ∞ ≥ max Efq  fˆn − fq ∞
f ∈ Θ(β) 0≤q≤Q
   
≥ d0 max Pfq  fˆn − fq ∞ ≥ d0 = d0 max Pfq Dq
0≤q≤Q 0≤q≤Q
1 1
d0 (1 − δ) = (h∗n )β  ϕ ∞ (1 − δ),

2 4
and we can choose r∗ = (1/4) ϕ ∞ (1 − δ). 
Remark 12.3. Contrast the proof of Theorem 12.1 with that of Theorem
9.16. The proof of the latter theorem was based on two hypotheses, f = f0
or f = f1 , with the likelihood ratio that stayed finite as n → ∞. In the sup-
norm, however, the proof of the rate of convergence is complicated by the
extra log-factor, which prohibits using the same idea. The likelihood ratios
in the proof of Theorem 12.1 are vanishing as n → ∞. To counterweigh that
fact, a growing number of hypotheses is selected. Note that the number of
hypotheses Q + 1 ≤ n1/(2β+1) has the polynomial rate of growth as n goes
to infinity. 

The next theorem handles the case of a random design. It shows that if
the random design is regular, then the rate of convergence of the sup-norm
risk is the same as that in the deterministic case. Since the random design
can be “very bad” with a positive probability, the conditional risk for given
design points does not guarantee even the consistency of estimators. That
is why we study the unconditional risks. The proof of the theorem below is
left as an exercise (see Exercise 12.83).
Theorem 12.4. Let X be a random design such that that design points
xi are independent with a continuous and strictly positive density p(x) on
170 12. Asymptotic Optimality in Global Norms

[0, 1]. Then for all sufficiently large n, and for any estimator fˆn (x) of the
regression function f (x), the following inequality holds:

(12.7) sup Ef ψn−1  fˆn − f ∞ ≥ c0
f ∈ Θ(β)

with a positive constant c0 independent of n.

Proof of Lemma 12.2. From the inclusion D0 ⊆ Dq , which holds for any
q = 1, . . . , Q, we have that
      
max Pq Dq = max P0 D0 , max Pq Dq
0≤q≤Q 1≤q≤Q

1    
≥ P0 D0 + max Pq D0
2 1≤q≤Q

1  1      
Q
 
≥ P0 D0 + E0 I D 0 exp Ln, q
2 Q
q=1

1   1   
Q
   
(12.8) = P0 D0 + E0 I D 0 exp Ln, q .
2 Q
q=1

In the above, by Ln, q we denoted the log-likelihood ratios


dPq
Ln, q = ln , q = 1, . . . , Q.
dP0
They admit the asymptotic representation
1 2
(12.9) Ln, q = σn, q Nn, q − σ
2 n, q
where for every q = 1, . . . , Q,

n
 
−2
(12.10) 2
σn, q = σ fq2 (xi ) = n (h∗n )2β+1 σ −2 p(cq ) ϕ 22 1 + on (1)
i=1

with on (1) vanishing as n → ∞ uniformly in q. The random variables Nn, q


in (12.9) are standard normal and independent for different q.
Let p∗ denote the maximum of the density p(x), p∗ = max0 ≤ x ≤ 1 p(x).
Recall that (h∗n )2β+1 = (ln n)/n . Thus, if n is large enough, then
−2 ∗
q ≤ 2σ
2
(12.11) σn, p c0 ln n = c1 ln n
where the constant c1 = 2σ −2 p∗ c0 is small if c0 is small. Note that the
constant c1 is independent of q.
  
Put ξn = Q−1 Q q = 1 exp Ln, q . The first and the second moments of ξn
 
are easily computable. Indeed, since by definition, E0 exp Ln, q = 1, we
12.2. Bound in L2 -Norm. Assouad’s Lemma 171


have that E0 ξn = 1. Applying the independence of the random variables
Nn,q for different q, we find that

 
Q
  
E0 ξn2 = Q−2 E0 exp 2Ln, q
q=1


Q
  
= Q−2 E0 exp 2σn, q Nn, q − σn, q
q=1


Q
= Q−2 2
exp{σn, −1 c1 ln n
q} ≤ Q e = Q−1 nc1 = 2 h∗n nc1 .
q=1

If we now take c0 so


small that c1 = 2σ −2 p∗ c0 = 1/(2β +1)− for some small
 > 0, then E0 ξn2 ≤ 2 h∗n nc1 = 2(ln n)1/(2β+1) n− can be chosen arbitrarily
small for sufficiently large n.
Next, by the Chebyshev inequality, we have
      
P0 D0 + E0 I D0 ξn ≥ (1 − δ0 ) P0 ξn ≥ 1 − δ0
    
≥ (1 − δ0 ) P0 | ξn − 1 | ≤ δ0 = (1 − δ0 ) 1 − P0 | ξn − 1 | > δ0
  
≥ (1 − δ0 ) 1 − δ0−2 E0 ξn2 → 1 − δ0 ≥ 1 − δ if δ0 < δ.
Plugging this expression into (12.8), we obtain the result of the lemma. 

12.2. Bound in L2 -Norm. Assouad’s Lemma


To prove the lower bound in the L2 -norm, a more elaborate construction is
required as compared to estimation at a point (Section 9.3) or in the sup-
norm (Section 12.1). The method we use here is a modified version of what
is known in nonparametric statistics as Assouad’s Lemma. This method can
be relatively easily explained if we start with the definitions similar to those
given for the result at a fixed point in Theorem 9.16.
We will proceed under the assumptions that the design points are deter-
ministic, regular and controlled by a density p(x) which is continuous and
strictly positive in [0, 1]. As in Section 9.3, take a function ϕ(u) ≥ 0, u ∈ R,
that satisfies all the properties mentioned in the that section. The key prop-
erties are that this function is smooth and its support is [−1, 1].
In the proof of Theorem 9.16, we defined the two test functions f0 (t)
and f1 (t), t ∈ [0, 1]. To extend this definition to the L2 -norm, consider Q
bins B1 , . . . , BQ , centered at cq , q = 1, . . . , Q, each of the length 2h∗n where
h∗n = n−1/(2β+1) . Without loss of generality, Q = 1/(2h∗n ) is an integer.
Denote by ΩQ a set of Q-dimensional binary vectors
 
ΩQ = ω : ω = (ω1 , . . . , ωQ ), ωq ∈ {0, 1} , q = 1, . . . , Q .
172 12. Asymptotic Optimality in Global Norms

The number of elements in ΩQ is equal to 2Q . To study the lower bound in


the L2 -norm, define 2Q test functions by
  β  t − c1   ∗ β  t − cQ 
(12.12) f (t, ω) = ω1 h∗n ϕ + · · · + ωQ hn ϕ
h∗n h∗n
where the variable t belongs to the interval [0, 1] and ω ∈ ΩQ . A proper
choice of ϕ(u) guarantees that each function f (t, ω) belongs to the Hölder
class Θ(β, L, L1 ).
Before continuing we introduce more notation. Define by Yq the σ-
algebra generated by the regression observations yi = f (xi ) + εi with the
design points xi ∈ Bq , q = 1, . . . , Q. For any estimator fˆn of the regression
function, we define the conditional expectation
 
fˆn, q = Ef fˆn  Yq .
Note that fˆn, q = fˆn, q (t) depends only on the observations within the bin
Bq .
For the sake of brevity, below we denote the conditional expectation
Ef (·, ω) [ · | X ] by Eω [ · ], suppressing dependence on the test function and the
design.

By the definition of the L2 -norm, we obtain that


 
Q

Eω  fˆn (·) − f (·, ω) 22 = Eω  fˆn (·) − f (·, ω) 22, Bq .
q=1

Lemma 12.5. For any estimator fˆn of the regression function f (t, ω) the
following inequality holds:
 
Eω  fˆn (·) − f (·, ω) 2 ≥ Eω  fˆn, q (·) − f (·, ω) 2
2, Bq 2, Bq

   β  t − cq  2 
(12.13) = Eω fˆn, q (t) − ωq h∗n ϕ dt .
Bq h∗n

Proof. First conditioning on the σ-algebra Yq , and then applying Jensen’s


inequality to the convex quadratic function, we obtain that for any q =
1, . . . , Q,
    
Eω  fˆn (·) − f (·, ω) 22, Bq = Eω Eω  fˆn (·) − f (·, ω) 22, Bq  Yq
   
≥ Eω  Eω fˆn (·)  Yq − f (·, ω) 22, Bq

= Eω  fˆn, q (·) − f (·, ω) 22, Bq (by definition of fˆn, q )
   β  t − cq  2 
= Eω fˆn, q (t) − ωq h∗n ϕ dt .
Bq h∗n
12.2. Bound in L2 -Norm. Assouad’s Lemma 173

At the last step we used the definition of the function f (t, ω) in the bin
Bq . 

In (12.13), the function fˆn, q (t) depends only on the regression observa-
tions with the design points in Bq . We will denote the expectation relative
to these observations by Eωq . We know that Eωq is computed with respect
to one of the two probability measures P{ωq =0} or P{ωq =1} . These measures
are controlled entirely by the performance of the test function f (·, ω) in the
bin Bq .
Lemma 12.6. There exists a constant r0 , which depends only on the design
density p and the chosen function ϕ, such that for any q, 1 ≤ q ≤ Q, and
for any Yq -measurable estimator fˆn, q , the following inequality holds:
   β  t − cq  2 
max Eωq fˆn, q (t) − ωq h∗n ϕ dt ≥ r0 /n.
ωq ∈ {0, 1} Bq h∗n

Proof. We proceed as in the proof of Theorem 9.16. At any fixed t, t ∈ Bq ,


we obtain that
   β  t − cq 2 
max Eωq fˆn, q (t) − ωq h∗n ϕ dt
ωq ∈{0,1} Bq h∗n
1   1    β  t − cq 2 
≥ E{ωq =0} ˆ
fn,q (t)dt + E{ωq =1}
2
fˆn,q (t) − h∗n ϕ dt
2 Bq 2 Bq h∗n
(12.14)
1  dP{ωq = 1}

 ∗  β  t − cq  2 
= E{ωq = 0} ˆ2
fn, q (t) dt + ˆ
fn, q − hn ϕ dt
2 Bq dP{ωq = 0} Bq h∗n
where
dP{ωq =1} 1 2
ln = σn, q Nq − σn, q
dP{ωq =0} 2
with a standard normal random variable Nq and
−2
2
lim σn, q = σ p(cq )  ϕ 22 .
n→∞
For all large n and any q = 1, . . . , Q , the standard deviation σn, q is separated
away from zero and infinity. Hence,
 
P{ωq =0} Nq > σn, q /2 ≥ p0
for a positive constant p0 independent of n and q. If the random event
{ Nq > σn, q /2 } holds, then we can estimate the likelihood ratio on the
right-hand side of (12.14) from below by 1.
Next, note that for any functions fˆn and g, the inequality is true fˆn 22 +
fˆn − g22 ≥ (1/2) g22 . Applied to fˆn, q , it provides the lower bound
   β  t − cq  2 
fˆn,
2
q (t) + fˆn, q − h∗n ϕ dt
Bq h∗n
174 12. Asymptotic Optimality in Global Norms

 
1  ∗ 2β t − c q  2 1  ∗ 2β+1 1
≥ hn ϕ ∗
dt = hn  ϕ 22 =  ϕ 22 .
2 Bq h n 2 2n
Finally, combining these estimates, we obtain that
   β  t − cq  
max Eωq fˆn, q (t)−ωq h∗n ϕ dt ≥ p0  ϕ 22 /(2n) = r0 /n
ωq ∈ {0, 1} Bq h∗n
with r0 = p0  ϕ 22 /2. 

After these technical preparations, we are ready to formulate the mini-


max lower bound for estimation of the Hölder class functions in the L2 -norm.
Theorem 12.7. Let the deterministic design X be defined by (7.17) with a
continuous and strictly positive density p(x) in [0, 1]. There exists a positive
constant r∗ such that for any estimator fˆn (t), the following asymptotic lower
bound holds:
 
lim inf sup n2β/(2β+1) Ef  fˆn − f 2 ≥ r∗ . 2
n→∞ f ∈Θ(β,L)

Proof. We use the notation introduced in Lemmas 12.5 and 12.6. Applying
the former lemma, we obtain the inequalities
 
sup Ef  fˆn − f 22 ≥ max Eω  fˆn (·) − f (· , ω) 22
f ∈Θ(β,L) ω ∈ ΩQ


Q     β  t − cq  
≥ max Eωq fˆn , q (t) − ωq h∗n ϕ dt .
ω∈ΩQ Bq h∗n
q=1
Note that each term in the latter sum depends only on a single component
ωq . This is true for the expectation and the integrand. That is why the
maximum over the binary vector ω can be split into the sum of maxima. In
view of Lemma 12.6, we can write
Q    β  t − cq   
max Eωq fˆn, q (t) − ωq h∗n ϕ dt
ωq ∈ {0, 1} Bq h∗n
q=1

≥ r0 Q/n = r0 / (2 h∗n n ) = (r0 /2) n−2β/(2β+1) ,


and the theorem follows with r∗ = r0 /2. 

12.3. General Lower Bound


The proof of the lower bound in the previous sections explored the charac-
teristics of the sup-norm and the L2 -norm, which do not extend very far.
In particular, in the proof of the lower bound in the sup-norm, we relied on
the independence of the random variables Nq,n in (12.9). A similar indepen-
dence does not hold for the test functions (12.12) since their supports are
overlapping. On the other hand, the idea of Assouad’s lemma fails if we try
12.3. General Lower Bound 175

to apply it to the sup-norm because the sup-norm does not split into the
sum of the sup-norms over the bins.
In this section, we will suggest a more general lower bound that covers
both of these norms as special cases. As above, we consider a nonparamet-
ric regression function f (x), x ∈ [0, 1], of a given smoothness β ≥ 1. We
introduce a norm f  of functions in the interval [0, 1]. This norm will be
specified later in each particular case.
As in the sections above, we must care about two things: a proper
set of the test functions, and the asymptotic performance of the respective
likelihood ratios.
Assume that there exists a positive number d0 and a set of M + 1 test
functions f0 (x), . . . , fM (x), x ∈ [0, 1], such that any two functions fl and
fm are separated by at least 2d0 , that is,
(12.15)  fl − fm  ≥ 2d0 for any l
= m, l, m = 0, . . . , M.
The constant d0 depends on n, decreases as n → 0, and controls the rate
of convergence. The number M typically goes to infinity   n → 0. For
as
example, in the case of the sup-norm, we had d0 = O (h∗n )β in (12.4), and
   1/(2β+1)
M = Q = O 1/h∗n where h∗n = (ln n)/n .
In this section, we consider the regression with the regular deterministic
design X . Denote by Pm ( · ) = Pfm ( · | X ) m = 0, . . . , M , the probability
distributions corresponding to a fixed design, and by Em the respective
expectations associated with the test function fm , m = 0, . . . , M.
Fix one of the test functions, for instance, f0 . Consider all log-likelihood
ratios for m = 1, . . . , M ,
1  2
n
d P0
ln = − 2 yi − (yi − fm (xi ))2
d Pm 2σ
i=1

1 1  2
n n
1 2
= fm (xi )(−εi /σ) − 2
fm (xi ) = σm, n Nm, n − σm, n
σ 2σ 2
i=1 i=1
where

n
−2
εi = yi − f (xi ) and 2
σm, n = σ 2
fm (xi ).
i=1
The random variables εi and Nm, n are standard normal with respect to the
distribution Pm .
We need assumptions on the likelihood ratios to guarantee that they are
not too small as n → 0. Introduce the random events
 
Am = { Nm, n > 0} with Pm Am = 1/2, m = 1, . . . , M.
176 12. Asymptotic Optimality in Global Norms

Assume that there exists a constant α, 0 < α < 1, such that all the variances
2
σm, n are bounded from above,

n ≤ 2α ln M.
2
(12.16) max σm,
1≤m≤M

If the random event Am takes place and the inequality (12.16) holds, then
d P0  
(12.17) ≥ exp − σm,2
n /2 ≥ exp{−α ln M } = 1/M .
α
d Pm
Let fˆn be an arbitrary estimator of the regression function f. Define the
random events
Dm = {  fˆn − fm  ≥ d0 }, m = 0, . . . , M.
The following lemma plays the same fundamental role in the proof of
the lower bound as Lemma 12.2 in the case of the sup-norm.
Lemma 12.8. If the conditions (12.15) and (12.16) are satisfied, then the
following lower bound is true:
 
max Pm Dm ≥ 1/4.
0≤m≤M

Proof. To start with, note that


     
Pm D m = Pm Dm Am + Pm Dm Am
     
≤ Pm D m Am + Pm Am = Pm Dm Am + 1/2,
which implies the inequality
   
(12.18) Pm Dm Am ≥ Pm Dm − 1/2.
Next, the following inclusion is true:

M
(12.19) D m ⊆ D0
m=1

where the random events D m are mutually exclusive. Indeed, if the norm
of the difference fˆn − fm  is strictly less than d0 for some m, then by the
triangle inequality and (12.15), the norm fˆn − fl  is not smaller than d0
for any l
= m. The inclusion (12.19) makes use of this fact for l = 0.
It immediately follows that
   
M
 
M
  
M  dP  
0
P0 D0 ≥ P0 Dm = P0 Dm = Em I Dm
d Pm
m=1 m=1 m=1


M  dP   1  
M
 
0
≥ Em I Dm Am ≥ Pm Dm − 1/2 .
d Pm M α
m=1 m=1
In the latter inequality, we used (12.17).
12.4. Examples and Extensions 177

The final step of the proof is straightforward. The maximum is estimated


from below by a mean value,
  1   1  
M

max Pm Dm ≥ P0 D0 + Pm Dm
0≤m≤M 2 M
m=1

1 1   1  
M
  M

≥ Pm D m − 1/2 + Pm Dm
2 Mα M
m=1 m=1

1  
M
   
≥ Pm D m + Pm Dm − 1/2 = 1/4. 
2M
m=1

As a consequence of Lemma 12.8, we obtain a general lower bound.


Theorem 12.9. If the conditions (12.15) and (12.16) are satisfied, then
for any estimator fˆn and for all n large enough, the following lower bound
holds:
 
(12.20) sup Ef  fˆn − f  ≥ d0 /4.
f ∈ Θ(β)

Proof. Applying Lemma 12.8, we obtain that


  
sup Ef  fˆn − f  ≥ max Em  fˆn − fm 
f ∈ Θ(β) 0≤m≤M

 
≥ d0 max Pm Dm ≥ d0 /4. 
0≤m≤M

12.4. Examples and Extensions

Example 12.10. The sup-norm risk. In the case of the sup-norm, the test
functions are defined by (12.3) with M = Q. The condition (12.15) follows
from (12.5) and (12.6) with d0 = (1/2)(h∗n )β ϕ∞ . Note that for all large n
the following inequality holds:
1   1
ln Q = ln n − ln ln n − ln 2 ≥ ln n.
2β + 1 2(2β + 1)
 
2 in the expansion (12.9) of ln dP /dP
In view of (12.11), the variance σq, n q 0
is bounded from above uniformly in q = 1, . . . , Q,

n ≤ c1 ln n ≤ 2(2β + 1)c1 ln Q ≤ 2α ln Q = 2α ln M.
2
σq,
The latter inequality holds if the constant c1 = 2σ −2 p∗ c0 is so small that
(2β + 1)c1 < α. Such a choice of c1 is guaranteed because c0 is however
178 12. Asymptotic Optimality in Global Norms

small. Thus, the condition (12.16) is also fulfilled. Applying Theorem 12.9,
we get the lower bound
  1
sup Ef ψn−1  fˆn − f ∞ ≥ (h∗n )β ϕ∞ = r∗ ψn
f ∈ Θ(β) 8
with the constant r∗ = (1/8)ϕ∞ , and the rate of convergence ψn defined
in (12.1). 

Unlike the case of the upper bounds in Chapter 9, “bad” designs do not
create a problem in obtaining the lower bound in the sup-norm. Intuitively it
is understandable because when we concentrate more design points in some
bins, we loose them in the other bins. This process reduces the precision of
the uniform estimation of the regression function. In a sense, the uniform
design is optimal if we estimate the regression in the sup-norm. We will
prove some results in support of these considerations.
Let a design X be of any kind, not necessarily regular. Assume that there
exists a subset M = M(X ) ⊆ { 1, . . . , M } such that for some α ∈ (0, 1) the
following inequality holds:
n ≤ 2α ln M.
2
(12.21) max σm,
m∈M
Let |M| denote the number of elements in M. It turns out that Lemma 12.8
remains valid in the following modification.
Lemma 12.11. If the conditions (12.15) and (12.21) are satisfied, then the
following lower bound holds:
  |M|
max Pm Dm ≥ .
0≤m≤M 4M
Proof. Repeating the proof of Lemma 12.8, we find that
   M
 
M
  M  dP  
0
P0 D0 ≥ P0 Dm = P0 Dm = Em I Dm
d Pm
m=1 m=1 m=1
  dP   1     
0
≥ Em I D m Am ≥ Pm Dm − 1/2
d Pm M α
m∈M m∈M
where we have used the inequality (12.17). Under (12.21), this inequality
applies only to the indices m ∈ M. Continuing as in Lemma 12.8, we obtain
the bound
  1   1  
M

max Pm Dm ≥ P0 D0 + Pm Dm
0≤m≤M 2 M
m=1
1       |M|
≥ Pm Dm + Pm Dm − 1/2 = . 
2M 4M
m∈M
12.4. Examples and Extensions 179

Example 12.12. The sup-norm risk (cont’d). For an arbitrary design X ,


the bound (12.11) is no longer true. But it turns out (see Exercise 12.82)
that for any design X and for any α ∈ (0, 1), there exists a “bump” function
ϕ and a subset M = M(X ) ⊆ { 1, . . . , Q } such that
(12.22) |M| ≥ Q/2 and max σq,
2
n ≤ 2α ln Q.
q∈M

From (12.22) and Lemma 12.11, analogously to the proof of Theorem 12.9,
we derive the lower bound for any design X ,
  |M| d0 1 ∗ β
(12.23) sup Ef ψn−1  fˆn − f ∞ ≥ d0 ≥ = (h ) ϕ∞ . 
f ∈ Θ(β) 4Q 8 16 n

Next, we will study the case of the L2 -norm risk.


Example 12.13. The L2 -norm risk. Consider the test functions f (t, ω), ω ∈
Ω, defined in (12.12). For any two functions f (t, ω  ) and f (t, ω  ), the log-
likelihood function has the representation
Pf (·, ω ) 1
(12.24) ln = σn Nn − σn2
Pf (·, ω  ) 2
where Nn = Nn (ω  , ω  ) is a standard normal random variable with respect
to the distribution controlled by the test function f (·, ω  ), and

n
 2
σn2 = σn2 (ω  , 
ω ) = σ −2
f (xi , ω  ) − f (xi , ω  )
i=1

where the xi ’s are the design points (see Exercise 12.84). From the definition
of the test functions, the variance σn2 can be bounded from above by

Q  x − c 
σn2 = σ −2 (h∗n )2β |ωq − ωq | 2 i q
ϕ
h∗n
q=1 xi ∈ Bq


Q
−2
= σ ϕ22 |ωq − ωq |p(cq )(1 + oq, n (1))
q=1

 
(12.25) ≤ σ −2 ϕ22 Q 1 + on (1) ≤ 2 σ −2 ϕ22 Q.
In the above, oq, n (1) → 0 as n → ∞ uniformly in q, 1 ≤ q ≤ Q. Also, we
bounded |ωq − ωq | by 1, and used the fact that the Riemann sum of the
design density approximates the integral
 Q 1
−1
Q p(cq ) = p(x) dx + on (1) = 1 + on (1).
q=1 0
180 12. Asymptotic Optimality in Global Norms

Next, we have to discuss the separation condition (12.15). For any test
functions, the L2 -norm of the difference is easy to find,
1  Q
 
(12.26)  f (xi , ω ) − f (xi , ω ) 22 = ϕ22 |ωq − ωq |.
n
q=1

At this point, we need a result which will be proved at the end of this section.
Lemma 12.14. (Warshamov-Gilbert) For all Q large enough, there ex-
ists a subset Ω0 , Ω0 ⊂ Ω, with the number of elements no less than 1 + eQ/8
and such that for any ω  , ω  ∈ Ω0 , the following inequality holds:

Q
|ωq − ωq | ≥ Q/8.
q=1

Continuing with the example, let M = eQ/8 . From Lemma 12.14 and
(12.26), we see that there exist M + 1 test functions such that for any two
of them,
Q
 f (xi , ω  ) − f (xi , ω  ) 22 = ϕ22 = (2d0 )2
8n
where 
1 Q 1 1 1
d0 = ϕ2 = ϕ2  = ϕ2 (h∗n )β .
2 8n 2 16h∗n n 8
Hence the condition (12.15) is fulfilled with this d0 . We arbitrarily choose
f0 = f (t, ω 0 ) for some ω 0 ∈ Ω0 , and take M as a set of the rest of the
functions with ω ∈ Ω0 . In this case, |M| = M = eQ/8 .

Finally, we have to verify the condition (12.16). If we choose a “bump”


function ϕ such that ϕ22 = σ 2 α/8 where α is any number, 0 < α < 1, then
it follows from (12.25) that
σn2 ≤ 2σ −2 ϕ22 Q = 2α ln(eQ/8 ) = 2α ln M.
Theorem 12.9 applies, and the lower bound of the L2 -norm risk follows for
all large n,
  1 1
sup Ef fˆn − f 2 ≥ d0 = ϕ2 (h∗n )β = r∗ n−β/(2β+1) . 
f ∈ Θ(β) 4 32

Proof of Lemma 12.14. Define the binary vectors


 
ω m = ω1, m , . . . , ωQ, m , m = 0, . . . , M,
with the independent Bernoulli(1/2) random components ωq, m . Note that
for any l
= m, the random variables ξq = |ωq, l −ωq, m | are also Bernoulli(1/2),
and are independent for different q.
12.4. Examples and Extensions 181

Next, the elementary inequalities and the choice of M yield that

 !  
Q  
P |ωq, l − ωq, m | ≥ Q/8
0≤l<m≤M q=1

   
Q  
= 1−P |ωq, l − ωq, m | < Q/8
0≤l<m≤M q=1

M (M + 1)   
Q
≥ 1− P ξq < Q/8
2
q=1

 
Q   
Q 
≥ 1−e Q/4
P ξq < Q/8 ≥ 1 − e Q/4
P ξq > (3/8) Q
q=1 q=1

where we denoted by ξq = 1/2 − ξq the random variables that take on values
±1/2 with equal probabilities.

Further, Chernoff’s inequality P(X ≥ a) ≤ e−za E ezX , a, z > 0, en-
sures that for any positive z,

 
Q      Q  
P ξq > (3/8)Q ≤ E exp zξq exp − (3/8)zQ .
q=1

The moment generating function of ξq satisfies the inequality (see Exercise
12.85)
   1 
(12.27) E exp zξq = exp{z/2} + exp{−z/2} ≤ exp{z 2 /8}.
2
Take z = 3/2. Then


Q   
P ξq > (3/8)Q ≤ exp − (9/32)Q .
q=1

Hence,

 !  
Q    1 9  
P |ωq, l −ωq, m | ≥ Q/8 ≥ 1 − exp − Q > 0.
4 32
0≤l<m≤M q=1

This proves the lemma, because what happens with a positive probability
exists. 
182 12. Asymptotic Optimality in Global Norms

Exercises

Exercise 12.82. Prove (12.22).

Exercise 12.83. Use (12.22) to prove Theorem 12.4.

Exercise 12.84. Verify (12.24).

Exercise 12.85. Prove (12.27).

Exercise 12.86. Let the design X be equidistant, that is, with the design
points xi = i/n. Show by giving an example that the following lower bound
is false. For any large c there exists a positive p0 independent of n such that
for all large n, the following inequality holds:
  

inf sup Pf fˆn − f 2 ≥ cn−β/(2β+1)  X ≥ p0 .
fˆn f ∈ Θ(β)

Hint: Consider the case β = 1, and let fn∗ be a piecewise constant estimator
in the bins. Show that the above probability goes to zero as n increases.
Part 3

Estimation in
Nonparametric Models
Chapter 13

Estimation of
Functionals

13.1. Linear Integral Functionals


As in the previous chapters, here we consider the observations of a regres-
sion function f in the presence of the Gaussian random noise. To ease the
presentation, we concentrate on the case of the equidistant design,

(13.1) yi = f (i/n) + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n.

So far we have studied the estimation problem of the regression function.



We found that the typical parametric n-rate of convergence is not attain-
able in nonparametric setup. The typical minimax rate under smoothness
parameter β equals ψn = n−β/(2β+1) . Note that the exponent β/(2β + 1) ap-
proaches 1/2 as β goes to infinity. Thus, for a very smooth nonparametric
regression, the rate of convergence is close to the typical parametric rate.
In this section we focus on estimating

1 an integral functional of the re-
gression function, for example, Ψ(f ) = 0 f (x)dx. We address the question:
What is the minimax rate of convergence in this estimation problem? We

will show that n is a very common rate of convergence.
We start with the easiest problem of a linear integral functional
1
(13.2) Ψ(f ) = w(x)f (x) dx
0

where w(x) is a given Lipschitz function, called the weight function, and f =
f (x) is an unknown regression observed with noise as in (13.1). Along with

185
186 13. Estimation of Functionals

the integral notation, we will use the dot product notation, Ψ(f ) = (w, f )

1
and w22 = 0 w2 (x) dx.
Note that Ψ(f ) defined by (13.2) is a linear functional, that is, for any
f1 and f2 , and any constants k1 and k2 , the following identity holds:
1
 
Ψ(k1 f1 + k2 f2 ) = w(x) k1 f1 (x) + k2 f2 (x) dx
0
1 1
(13.3) = k1 w(x)f1 (x) dx + k2 w(x)f2 (x) dx = k1 Ψ(f1 ) + k2 Ψ(f2 ).
0 0

Define an estimator of Ψ(f ) by


1
n
(13.4) Ψ̂n =
w(i/n)yi .
n
i=1

1
Example 13.1. If w(t) = 1, then Ψ(f ) = 0 f (x) dx. Assume that f ∈
Θ(β, L, L1 ) with some β ≥ 1, which yields that f is a Lipschitz function.

In this case, the trivial estimator, the sample mean, turns out to be n-
consistent,
1
  √
Ψ̂n = y1 + · · · + yn /n = f (x)dx + O(n−1 ) + σZ0 / n
0
where Z0 is a standard normal random variable, and O(n−1 ) represents
the deterministic error of the Riemann sum approximation. Note that this
deterministic error is uniform over f ∈ Θ(β, L, L1 ). 
Next, we state a proposition the proof of which is straightforward (see
Exercise 13.87).
Proposition 13.2. For all β ≥ 1 and any f ∈ Θ(β, L, L1 ), the bias and the
variance of the estimator (13.4) are respectively equal to
bn = Ef [Ψ̂n ] − Ψ(f ) = O(n−1 )
and
σ2 1 2
Varf [Ψ̂n ] =w (x) dx + O(n−2 ).
n 0
Corollary 13.3. It immediately follows from Proposition 13.2 that for any
f ∈ Θ(β, L, L1 ), the following limit exists:
1
√   2
lim Ef n Ψ̂n − Ψ(f ) = σ 2
w2 (x) dx = σ 2 w22 .
n→∞ 0

A legitimate question is whether it is possible to improve the result of


Proposition 13.2 and to find another estimator with an asymptotic variance
smaller than σ 2 w22 . As we could anticipate, the answer to this question is
negative. To prove the lower bound, we need the following auxiliary result.
13.1. Linear Integral Functionals 187

Lemma 13.4. Let the Yi ’s be independent observations of a location param-


eter θ in the non-homogeneous Gaussian model
Yi = θ μi + εi , εi ∼ N (0, σi2 ), i = 1, . . . , n,
with some constant μi ’s. Assume that there exists a strictly positive limit
  2
I∞ = limn→∞ n−1 In > 0 where In = ni= 1 μi /σi is the Fisher informa-
tion. Then for any estimator θ̂n of the location parameter θ, the following
lower bound holds:
 2  2
lim inf sup Eθ nI∞ (θ̂n − θ) = lim inf sup Eθ In (θ̂n − θ) ≥ 1.
n→∞ θ ∈ R n→∞ θ ∈ R

Proof. The statement of the lemma is the Hajek-LeCam lower bound. Note
that the Fisher information in the non-homogeneous Gaussian model is
  2
equal to In = ni= 1 μi /σi , and the log-likelihood ratio is normal non-
asymptotically,

n
   1
Ln (θ) = Ln (0) = ln f (Xi , θ)/f (Xi , 0) = In Z0,1 θ − In θ2
2
i=1

where Z0,1 is a standard normal random variable with respect to the true
distribution P0 . Thus, as in the Hajek-LeCam case, we have the lower bound
 2
lim inf sup Eθ In (θ̂n − θ) ≥ 1. 
n→∞ θ ∈ R

Now we return to the functional estimation problem. Consider a one-


parameter family of regression functions
 
Θ = f (x, θ) = θw(x)/w22 , θ ∈ R, w22 > 0
where w(x) is the weight function specified
 in (13.2). For this family of
regression functions, the functional Ψ f ( · , θ) coincides with θ identically,
1 1
  w(x)
Ψ f ( · , θ) = w(x)f (x, θ) dx = w(x) θ 2 dx = θ.
0 0 w 2

Hence for this family of regression functions, the estimation of Ψ(f ) is equiv-
alent to estimation of θ from the following observations:
(13.5) yi = θ w(i/n)/w22 + εi , εi ∼ N (0 σ 2 ), i = 1, . . . , n.

Theorem 13.5. For any estimator Ψ̃n from the observations (13.5), the
following asymptotic lower bound holds:
√     2
(13.6) lim inf sup Ef (·, θ) n Ψ̃n − Ψ f ( · , θ) ≥ σ 2 w22 .
n→∞ f (·, θ) ∈ Θ
188 13. Estimation of Functionals

Proof. Applying Lemma 13.4 with μi = w(i/n)/w||22 and Yi = yi , we find


that the Fisher information in this case is expressed as
n
 2 n
w2 (i/n)
In = μi /σi =
i=1 i=1
σ 2 w42

1  w2 (i/n)
n
n n(1 + on (1))
= = .
σ 2 w22 n i = 1 w22 σ 2 w22
Here

1 2we used the fact that the latter sum is the Riemann sum for the integral
2
0 w (x)/w2 dx = 1. Thus, from Lemma 13.4, the lower bound follows
     2
lim inf sup Ef (·, θ) In Ψ̃n − Ψ f ( · , θ)
n→∞ f (·, θ) ∈ Θ


 n(1 + on (1))   2
= lim inf sup Eθ Ψ̃n − θ ≥ 1,
n→∞ θ ∈ R σ 2 w22
which is equivalent to (13.6). 

13.2. Non-Linear Functionals


As an example, suppose we want to estimate the square of the L2 -norm of
the regression function f , that is, we want to estimate the integral quadratic
functional
1
(13.7) Ψ(f ) = f 2 =
2
f 2 (x) dx .
0
Clearly, this is a non-linear functional of f since it does not satisfy the
linearity property (13.3), though it is very smooth. Can we estimate it

n-consistently? The answer is positive. The efficient estimator of the
functional (13.7) is discussed in Example 13.6 below. Now we turn to general
smooth functionals.
A functional Ψ(f ) is called differentiable on a set of functions Θ, if for
any fixed function f0 ∈ Θ and for any other function f in Θ, the following
approximation holds:
(13.8) Ψ(f ) = Ψ(f0 ) + Ψf0 (f − f0 ) + ρ(f, f0 )
where Ψf0 (f − f0 ) is the first derivative of Ψ applied to the difference f − f0 .
The functional Ψf0 is a linear functional that depends on f0 . Moreover, we
assume that for any function g = g(x),
1
Ψf0 (g) = w(x, f0 ) g(x) dx,
0
13.2. Non-Linear Functionals 189

where w(x, f0 ) is a Lipschitz function of x and a continuous functional of


f0 , that is,
|w(x1 , f0 ) − w(x2 , f0 )| ≤ L|x1 − x2 |
with a Lipschitz constant L independent of f0 , and
w( · , f ) − w( · , f0 )2 → 0 as f − f0 2 → 0.
The remainder term ρ(f, f0 ) in (13.8) satisfies the inequality
(13.9) ρ(f, f0 ) ≤ Cρ f − f0 22
with some positive constant Cρ independent of f and f0 . Since the functional
Ψ(f ) is known, the weight function of its derivative w( · , f0 ) is also known
for all f0 .
Example 13.6. Consider the quadratic functional (13.7). From the identity
f 22 = f0 22 + 2(f0 , f − f0 ) + f − f0 22 ,
we have the explicit formula for the derivative
1

Ψf0 (g) = 2(f0 , g) = 2 f0 (x)g(x) dx.
0
This formula implies that the weight function w(x, f0 ) = 2f0 (x). The weight
function is a Lipschitz function if f0 belongs to a class of Lipschitz functions.
The remainder term in this example, ρ(f0 , f ) = f − f0 22 , meets the con-
dition (13.9) with Cρ = 1. 

The next proposition claims that a differentiable functional can be esti-



mated n-consistently and describes the asymptotic distribution.
Theorem 13.7. Assume that the regression function f ∈ Θ(β, L, L1 ) with
some β ≥ 1. Let Ψ(f ) be a differentiable functional on Θ(β, L, L1 ). There
exists an estimator Ψ∗n such that the Pf -distribution of its normalized error
is asymptotically normal as n → ∞,
√  ∗   
n Ψn − Ψ(f ) → N 0, σ 2 w( · , f )22 .

Proof. Split n observations into the two sub-samples of size m and n − m,


respectively, with m = nα , 5/6 ≤ α < 1. Assume that n/m is an inte-
ger. Define the first sub-sample by the equidistant design points, J1 =
{ 1/m, 2/m, . . . , m/m}. Let the second sub-sample J2 be composed of the
rest of the design points. Note that J2 is not necessarily
 regular and it
contains almost all the points, |J2 | = n(1 − nα−1 ) = n 1 + on (1) . From
Theorem 10.3, even in the case of the smallest smoothing parameter, β = 1,
we can choose an estimator fn∗ of f so that uniformly in f,
  √ 
Ef fn∗ − f 22 ≤ r∗ m−2/3 = r∗ n−2α/3 ≤ r∗ n− 5/9 = o 1/ n .
190 13. Estimation of Functionals

If fn∗ is not a Lipschitz function, we replace it by the projection onto the


set Θ(β, L, L1 ), which is a convex set. So that we may assume that fn∗ is a
Lipschitz function and
√ 
(13.10) lim n Ef fn∗ − f 22 = 0.
n→∞

Introduce an estimator of the functional Ψ by


1
∗ ∗ 1  ∗
(13.11) Ψn = Ψ(fn ) + w(i/n, fn ) yi − w(x, fn∗ )fn∗ (x) dx.
n 0i ∈ J2

Note that Ψ∗n


is computable from the data. The smaller portion consisting
of the m observations is used in the preliminary approximation of f by fn∗ ,
while the larger portion of the n − m observations is used in estimation of
the derivative Ψfn∗ , which is a linear functional. This linear functional is

estimated similarly to (13.4) by n−1 i ∈ J2 w(i/n, fn∗ ) yi . From (13.11), by
the definition of a differentiable functional, we obtain that
1
√  ∗ √  ∗ 1  ∗
n Ψn −Ψ(f ) = n Ψ(fn ) + w(i/n, fn ) yi − w(x, fn∗ )fn∗ (x) dx
n 0 i ∈ J2
 
1  
− Ψ(fn∗ ) + w(x, fn∗ f (x) − fn∗ (x) dx + ρ(f, fn∗ )
0
1
√ 1  ∗
 √
(13.12) = n w(i/n, fn ) yi − w(x, fn∗ )f (x) dx + n ρ(f, fn∗ ).
n 0
i ∈ J2

In view of (13.10), the remainder term in (13.12) is vanishing as n → ∞,


√  √ 
n Ef |ρ(f, fn∗ )| ≤ Cρ n Ef fn∗ − f 22 → 0.

The normalized difference of the sum and the integral in (13.12) is nor-
mal with the expectation going to zero as n → ∞, and the variance that,
conditionally on fn∗ , is equal to
σ2  2 σ2  2
n

w (i/n, fn ) = w (i/n, fn∗ ) + O(m/n)
n n
i ∈ J2 i=1
1
= σ2 w2 (x, fn∗ ) dx + on (1).
0
Here we used the fact that m/n = nα−1 = on (1) → 0 as n → 0. By the
assumption, the weight function w( · , f0 ) is continuous in f0 , and
1 1

w (x, fn ) dx →
2
w2 (x, f ) dx = w( · , f )22
0 0
as fn∗ − f 2 → 0. Hence, the result of the theorem follows. 
Exercises 191

Example 13.8. Consider again the integral quadratic functional Ψ defined


by (13.7). We apply (13.11) to get an explicit expression for the estimator
Ψ∗n of this functional. From Example 13.6, the weight function w(i/n, fn∗ ) =
2fn∗ (i/n), therefore, 1
2  ∗  ∗ 2
Ψ∗n = fn∗ 22 + fn (i/n) yi − 2 fn (x) dx
n 0
i ∈ J2
2  ∗
= fn (i/n) yi − fn∗ 22 . 
n
i ∈ J2

Remark 13.9. If the preliminary estimator fn∗ in (13.11) satisfies the con-
dition  ∗  1 
n Ef fn − f 4 = n Ef
4
(fn∗ (x) − f (x))4 dx → 0
0
as n → ∞, then the estimator Ψ∗n converges in the mean square sense as
well (see Exercise 13.90),
√   2 
(13.13) lim Ef n Ψ∗n − Ψ(f ) = σ 2 w( · , f )22 . 
n→∞

Exercises

Exercise 13.87. Verify the statement of Proposition 13.2.

Exercise 13.88. Let a regression function f ∈ Θ(β, L, L1 ), β ≥ 1, serve as


the right-hand side of the differential equation
Ψ (x) + Ψ(x) = f (x), x ≥ 0,
with the initial condition Ψ(0) = 0. Assume that the observations yi , i =
1, . . . , n, of the regression function f satisfy (13.1). Estimate the solution
Ψ(x) at x = 1. Find the asymptotics of
the estimate’s bias and variance as
n → ∞ . Hint: Check that Ψ(x) = e−x 0 et f (t) dt.
x


1
Exercise 13.89. Show that Ψ(f ) = 0 f 4 (x) dx is a differentiable functional
of the regression function f ∈ Θ(β, L, L
1 ), β ≥ 1.

Exercise 13.90. Prove (13.13).

Exercise 13.91. Consider the observations yi = f (xi ) + εi , i = 1, . . . , n,


with a regular design governed by a density p(x). Show that the sample mean
ȳ = (y1 + · · · + yn )/n is the minimax efficient estimator of the functional

1
Ψ(f ) = 0 f (x)p(x) dx.
Chapter 14

Dimension and
Structure in
Nonparametric
Regression

14.1. Multiple Regression Model


In this chapter, we revise the material of Chapters 8 and 9, and extend it to
the multiple regression model. As in (8.1), our starting point is the regression
equation Y = f (X) + ε, ε ∼ N (0, σ 2 ). The difference  is that this time the
explanatory variable is a d-dimensional vector, X = X (1) , . . . , X (d) ) ∈ Rd .
We use the upper index to label the components of this vector. A set of
n observations has the form { (x1 , y1 ), . . . , (xn , yn ) } where each regressor
 (1) (d) 
xi = xi , . . . , xi is d-dimensional. The regression equation looks similar
to (8.2),

(14.1) yi = f (xi ) + εi , i = 1, . . . , n,

where εi ∼N (0, σ 2 ) are independent normal random errors. The regression


function f : Rd → R is a real-valued function of d variables. This function
is unknown and has to be estimated from the observations (14.1). Assume
that the regressors belong to the unit cube in Rd , xi ∈ [0, 1]d , i = 1, . . . , n.
The design X = { x1 , . . . , xn } can be deterministic or stochastic, though in
this chapter we prefer to deal with the regular deterministic designs.
Our principal objective is to explain the influence of the dimension d
on the asymptotically minimax rate of convergence. We restrict ourselves

193
194 14. Dimension and Structure in Nonparametric Regression

to the estimation problem of f (x0 ) at a fixed point x0 = (x0 , . . . , x0 )


(1) (d)

located strictly inside the unit cube [0, 1]d . The asymptotically minimax rate
of convergence ψn , defined by (8.4) and (8.5), is attached to a Hölder class
Θ(β, L, L1 ) of regression functions. Thus, we have to extend the definition
of the Hölder class to the multivariate case. The direct extension via the
derivatives as in the one-dimensional case is less convenient since we would
have to deal with all mixed derivatives of f up to a certain order. A more
fruitful approach is to use the formula (8.14) from Lemma 8.5 as a guideline.
Let β be an integer, β ≥ 1, and let  ·  denote the Euclidean norm
in Rd . A function f (x), x ∈ [0, 1]d , is said to belong to a Hölder class of
functions Θ(β) = Θ(β, L, L1 ) if: (i) there exists a constant L1 > 0 such that
maxx∈[0,1]d |f (x)| ≤ L1 , and (ii) for any x0 ∈ [0, 1]d there exists a polynomial
p(x) = p(x, x0 , f ) of degree β − 1 such that
 
 f (x) − p(x, x0 , f )  ≤ Lx − x0 β

for any x ∈ [0, 1]d , with a constant L independent of f and x0 .


To estimate the regression function at a given point x0 we can apply
any of the methods developed in the previous chapters. Let us consider the
local polynomial approximation first. Take a hypercube Hn centered at x0 ,
 
Hn = x = (x(1) , . . . , x(d) ) : |x(1) − x0 | ≤ hn , . . . , |x(d) − x0 | ≤ hn
(1) (d)

where the bandwidth hn → 0 as n → ∞. If x0 belongs to an open cube


(0, 1)d , then for any large n, the hypercube Hn is a subset of [0, 1]d .
Consider a polynomial π(x), x ∈ [0, 1]d , of degree
i+d−1β − 1. Note that in the
case of d predictor variables, there are exactly i monomials of degree
i ≥ 0 (see Exercise 14.92). That is why there are k = k(β, d) coefficients
that define the polynomial π(x) where
 i + d − 1
β−1
1 
β−1
k = k(β, d) = = (i + 1) . . . (i + d − 1).
i (d − 1)!
i=0 i=0

We denote the vector of these coefficients by θ ∈ Rk , and explicitly mention


it in the notation for the polynomial π(x) = π(x, θ).
Example 14.1. Let d = 2 and β = 3. A polynomial of degree β − 1 = 2
has a general form
π(x, θ) = θ0 + θ1 x(1) + θ2 x(2) + θ3 (x(1) )2 + θ4 x(1) x(2) + θ5 (x(2) )2
with the vector of unknown coefficients θ = (θ0 , θ1 , θ2 , θ3 , θ4 , θ5 ) . To ver-
ify the dimension of θ, compute
 2
i+1 1 2 3
k = k(3 , 2) = = + + = 1 + 2 + 3 = 6. 
i 0 1 2
i=0
14.1. Multiple Regression Model 195

For any
 x in the hypercube
 Hn centered at x0 , we rescale this polynomial
to get π (x−x0 )/hn , θ . Suppose there are N pairs of observations (xi , yi )
such that the design points belong to Hn . Without loss of generality, we may
assume that these are the first N observations, x1 , . . . , xN ∈ Hn . The vector
θ can be estimated by the method of least squares. The estimator θ̂ is the
solution of the minimization problem (cf. (9.1)),
N 
   2
(14.2) yi − π (x − x0 )/hn , θ̂ → min .
i=1 θ̂

As in Section 9.1, we can define a system of normal equations (9.2)


where G is the design matrix with dimensions N× k. The columns
 of G are
composed of the monomials in the polynomial π (x − x0 )/hn , θ evaluated
at the N design points.
Let Assumption 9.2 hold, that is, we assume that the elements of the
 −1
matrix G G are bounded from above by γ0 N −1 with a constant γ0
independent of n. Clearly, this assumption is a restriction on the design X .
The next proposition is a simplified version of Proposition 9.4.
Proposition 14.2. Suppose Assumption 9.2 holds. Let θ̂0 = π(0, θ̂) be the
estimate of the intercept, that is, the first component of the solution θ̂ of
(14.2). Then uniformly in f ∈ Θ(β, L, L1 ) , we have that
θ̂0 − f (x0 ) = b0 + N0
where |b0 | ≤ Cb hβn, and N0 is a zero-mean normal random variable with
the variance Varf N0 | X ≤ Cv /N . The positive constants Cb and Cv are
independent of n.

Proof. From the definition of the Hölder class Θ(β, L, L1 ), we obtain (cf.
(9.3)),
yi = p(xi , x0 , f ) + ρ(xi , x0 , f ) + εi , i = 1, . . . , n,
where the remainder term
√ β
|ρ(xi , x0 , f )| ≤ xi − x0 β ≤ dhn .

Repeating the proofs of Lemmas 9.1 and 9.3, we find that the least-
squares estimator θ̂ actually estimates the vector of coefficients of the poly-
nomial p(x, x0 , f ) in the above approximation. The deterministic error here
that does not exceed Cb hβn , and the zero-mean normal stochastic term has
the variance that is not larger than Cv /N. By definition, the zero-order term
of the approximation polynomial θ0 is equal to f (x0 ). Hence, the estimate
of the intercept
θ̂0 = θ0 + b0 + N0 = f (x0 ) + b0 + N0
196 14. Dimension and Structure in Nonparametric Regression

satisfies the claim of the proposition. 

Finally, we have arrived at the point where the influence of the higher
dimension shows up. In Section 9.1, to obtain the minimax rate of con-
vergence, we needed Assumption 9.5 which helped to control the stochastic
term. This assumption required that the number N of the design points in
the hn -neighborhood of the given point x0 is proportional to nhn . Clearly,
this assumption was meant to meet the needs of regular designs, determin-
istic or random. So, the question arises: How many design points can we
anticipate in the regular cases in the d-dimensional Hn -neighborhood of x0 ?
Simple geometric considerations show that at best we can rely on a number
proportional to the volume of Hn .
Assumption 14.3. There exists a positive constant γ1 , independent of n,
such that for all large enough n, the inequality N ≥ γ1 nhdn holds. 

Now we are in the position to formulate the upper bound result.


Theorem 14.4. Suppose that the design X satisfies the conditions of As-
sumptions 9.2 and 14.3 with hn = h∗n = n−1/(2β+d) . Given X , the quadratic
risk of the local polynomial estimator θ̂0 = π(0, θ̂) described in Proposition
14.2 admits the upper bound
 2  
sup Ef π( 0, θ̂) − f (x0 )  X ≤ r∗ n−2β/(2β+d)
f ∈ Θ(β,L,L1 )

where a positive constant r∗ is independent of n.

Proof. Analogously to the proof of Theorem 9.6, from Proposition 14.2, the
upper bound of the quadratic risk holds uniformly in f ∈ Θ(β, L, L1 ),
 2   Cv Cv
Ef π(0, θ̂) − f (x0 )  X ≤ Cb2 (hn )2β + ≤ Cb2 (hn )2β + .
N γ1 nhdn

The balance equation in the d-dimensional case has the form (hn )2β =
1/(nhdn ). The optimal choice of the bandwidth is hn = h∗n = n−1/(2β+d) , and
the respective rate of convergence is (h∗n )β = n−β/(2β+d) . 

14.2. Additive regression


The minimax rate of convergence ψn = n−β/(2β+d) in a d-dimensional Hölder
regression rapidly slows down as d increases. One way to overcome this
“curse of dimensionality” is to assume that the smoothness also grows with
d. Indeed, if β = dβ1 , then the exponent in the rate of convergence β/(2β +
d) = β1 /(2β1 + 1) matches the one-dimensional case with the smoothness
parameter β1 . However, this assumption is very restrictive.
14.2. Additive regression 197

Another approach is to impose some constraints on the structure of


the regression model. Here we consider one example, a so-called additive
regression model. To understand the role of a higher dimension, it suffices
to look at the case d = 2 and a very basic regular design. Suppose that
in the two-dimensional regression model, the design X is equidistant. Let

m = n be an integer. The design points, responses, and random errors are
all labeled by two integer indices i and j, where i, j = 1, . . . , m. We denote
the design points by
(14.3) xij = (i/m, j/m).
Thus, the regression relation takes the form
(14.4) yij = f (i/m, j/m) + εij
where εij are independent N (0, σ 2 ) random variables.
In the additive regression model, the regression function is the sum of
two functions, both of which depend on one variable,

(14.5) f (x) = f (x(1) , x(2) ) = f1 (x(1) ) + f2 (x(2) )


where f1 and f2 are the Hölder functions of a single variable, f1 , f2 ∈
Θ(β, L, L1 ).
This definition of the additive model is not complete, since we can always
add a constant to one term and subtract it from the other one. To make
the terms identifiable, we impose the following conditions:
1 1
(14.6) f1 (t) dt = f2 (t) dt = 0.
0 0

Let x0 be a fixed point strictly inside the unit square. Without loss of
generality, we will assume that this point coincides with one of the design
knots, x0 = (i0 /m, j0 /m). Clearly, we could treat the model of observations
(14.4) as a two-dimensional regression. The value of the regression function
f (x0 ) at x0 can be estimated with the rate n−β/(2β+2) suggested by Theorem
14.4 for d = 2. A legitimate question at this point is whether it is possible
to estimate f (x0 ) with a faster rate exploiting the specific structure of the
model. In particular, is it possible to attain the one-dimensional rate of
convergence n−β/(2β+1) ? As the following proposition shows, the answer to
this question is affirmative.
Proposition 14.5. In the additive regression model (14.4)-(14.6) at any
point x0 = (i0 /m, j0 /m), there exists an estimator fˆn (x0 ) such that
 2 
sup Ef fˆn (x0 ) − f (x0 ) ≤ r∗ n−2β/(2β+1)
f1 ,f2 ∈ Θ(β,L,L1 )
198 14. Dimension and Structure in Nonparametric Regression

for all large enough n. Here a constant r∗ is independent of n.

Proof. Select the bandwidth h∗n = n−1/(2β+1) as if the model were one-
dimensional. Consider the set of indices
 
In = In (i0 /m) = i : |i/m − i0 /m| ≤ h∗n .
The number N of indices in the set In is equal to N = |In | = 2mh∗n  + 1 .

Note that mh∗n = nn−1/(2β+1) → ∞ , and hence N ∼ 2 m h∗n as n → ∞.
To estimate f1 at i0 /m, consider the means
1  1   1 
m m m
ȳi · = yij = f1 (i/m) + f2 (j/m) + εij
m m m
j=1 j =1 j =1

1
(14.7) = f1 (i/m) + δn + √ ε̄i , i ∈ In ,
m
where the deterministic error
1 
m
δn = f2 (j/m)
m
j=1

1
is the Riemann sum for the integral 0 f2 (t) dt = 0, and the random variables
1 
m
ε̄i = √ εij ∼ N (0, σ 2 )
m
j =1

are independent for different i ∈ In .


Applying (14.6), we find that
√  
|δn | ≤ L0 /m = L0 / n = o (h∗n )β as n → ∞
with a Lipschitz constant L0 = max |f2 | of any function f2 ∈ Θ(β, L, L1 ).
Thus, we have a one-dimensional regression problem with N observations
in the bin centered at i0 /m. Applying the one-dimensional local polynomial
approximation (see Lemma 9.3) to the means (14.7), we can estimate f1 at
the point i0 /m with the deterministic error not exceeding
 
Cb (h∗n )β + |δn | = Cb (h∗n )β 1 + o(1) .
The stochastic error of this estimator is normal with the zero expectation
and the variance which is not larger than
Cv  σ  2 Cv  σ  2 Cv σ 2
√ ∼ √ = .
N m 2mh∗n m 2nh∗n

The constants Cb and Cv are independent of n. Here (σ/ m)2 = σ 2 /m =

σ 2 / n represents the variance of the stochastic error of the means (14.7). So,
the one-dimensional balance between the deterministic and stochastic errors
holds, and the one-dimensional rate of convergence (h∗n )β is guaranteed.
14.3. Single-Index Model 199

Similarly, we can estimate f2 at j0 /m with the same one-dimensional


rate. 
Remark 14.6. In this section, we considered the simplest version of the
additive regression model. In more general settings, the design may be de-
terministic or random, and the dimension may be any positive integer. The
effect, however, is still the same: the one-dimensional rate of convergence
is attainable. Clearly, this rate is minimax since in any higher dimension,
this rate cannot be improved for the subset of one-dimensional regression
functions. 
Remark 14.7. Traditionally the additive regression model includes a con-
stant intercept f0 . That is, the regression function has the form
(14.8) f (x(1) , x(2) ) = f0 + f1 (x(1) ) + f2 (x(2) ).
For simplicity, it was omitted in the model considered in this section. To
estimate f0 , we could split the sample of observations into two sub-samples
of sizes n/2 each, but this would destroy the regularity of the design. It
is more convenient to consider the model with two independent repeated
observations yij and ỹij at the design knots (i/m , j/m). Then we can use
the second set of observations to estimate the intercept,
1 
m
(14.9) ˆ
f0 = ỹij .
n
i,j = 1

Now we can replace the observations yij in (14.4) by yij − fˆ0 , and use these
shifted observations to estimate f1 and f2 as done above. Then the statement
of Proposition 14.5 would stay unchanged (see Exercise 14.93). 

14.3. Single-Index Model


14.3.1. Definition. The additive regression model of the previous section
provides an example of a specific structure of the nonparametric regression
function. In this section we give another example, known as a single-index
model. This name unites a variety of models. We present here a version
that is less technical.
Consider a two-dimensional regression model with the equidistant design
in the unit square [0, 1]2 . It is convenient to study the model with two
independent repeated observations at every design knot xij = (i/m, j/m),
(14.10) yij = f (i/m, j/m) + εij and ỹij = f (i/m, j/m) + ε̃ij

where 1 ≤ i, j ≤ m , and m = n is assumed to be an integer (cf. Remark
14.7). The random variables εij and ε̃ij are independent N (0, σ 2 ).
200 14. Dimension and Structure in Nonparametric Regression

The structure of the regression function f is imposed by the assumption


that there exist a Hölder function g = g(t) and an angle α such that
 
(14.11) f (i/m, j/m) = g (i/m) cos α + (j/m) sin α , 1 ≤ i, j ≤ m.
We will suppose that 0 ≤ α ≤ π/4. The restrictions on the function g are
more elaborate and are discussed below.
Let β ≥ 2 be an integer, and let g∗ be a positive number. Assume
that Θ(β,
√ L, L1 ) is the class of Hölder functions g = g(t) in the interval
0 ≤ t ≤ 2, and let
Θ(β, L, L1 , g∗ ) = Θ(β, L, L1 ) ∩ { g  (t) ≥ g∗ }
be a sub-class of functions the first derivative of which exceeds g∗ .
Introduce a class of functions H = H(β, L, L1 , g∗ ) in the unit square
0 ≤ x(1) , x(2) ≤ 1 by
  
H = f = f (x(1) , x(2) ) = g x(1) cos α + x(2) sin α ,

0 ≤ α ≤ π/4 , g ∈ Θ(β, L, L1 , g∗ ) .

This class is well defined because√the variable t = x(1) cos α + x(2) sin α
belongs to the interval 0 ≤ t ≤ 2. The functions in H, if rotated at a
proper angle, depend on a single variable, and are monotonically increasing
in the corresponding direction. The point
(14.12) tij = (i/m) cos α + (j/m) sin α
is the projection (show!) of xij = (i/m, j/m) onto the straight line pass-
ing through the origin at the angle α (see Figure 11). If we knew α, we
could compute the projections tij , and the problem would become one-
dimensional.

x(2) 6
1
xij

tij •

α -
0 1 x(1)
Figure 11. Projection of the design knot xij on the line passing through
the origin at the angle α.
14.3. Single-Index Model 201

Let x0 = (i0 /m, j0 /m) be a fixed point. Our objective is to estimate


the value f (x0 ) of the regression function at this point. Clearly, we can
look at the observations (14.10) as the observations of the two-dimensional
regression. The results of Section 14.1 would guarantee the minimax rate of
estimation n−2β/(2β+2) . Can this rate be improved to the one-dimensional
rate n−2β/(2β+1) ? The answer is positive, and the algorithm is simple. First,
estimate α by α̂n , and then plug α̂n into the projection formula (14.12) for
the one-dimensional design points,
t̂ij = (i/m) cos α̂n + (j/m) sin α̂n .
The two-sample model of observations (14.10) is convenient, because the
first sample is used to estimate α, while the second one serves to estimate
the regression function itself. We could work with one sample of size n,
and split it into two independent sub-samples, but this would result in less
regular designs.

14.3.2. Estimation of Angle. To estimate α, note that at any point


(x(1) , x(2) ) ∈ [0, 1]2 , the partial derivatives of the regression function are
proportional to cos α and sin α, respectively,
∂f (x(1) , x(2) )
(14.13) = g  (x(1) cos α + x(2) sin α) cos α
∂x(1)
and
∂f (x(1) , x(2) )
(14.14) = g  (x(1) cos α + x(2) sin α) sin α.
∂x(2)
If we integrate the left-hand sides of (14.13) and (14.14) over the square
[0, 1]2 , we obtain the integral functionals that we may try to estimate. Un-
fortunately, these are functionals of partial derivatives of f , not of f itself.
However, we can turn these functionals into the functionals of f if we inte-
grate by parts.
Choose any function ϕ = ϕ(x(1) , x(2) ) , (x(1) , x(2) ) ∈ [0, 1]2 . Assume
that ϕ is non-negative and very smooth, for example, infinitely differentiable.
Assume also that it is equal to zero identically on the boundary of the unit
square,
(14.15) ϕ(x(1) , x(2) ) = 0 for (x(1) , x(2) ) ∈ ∂[0, 1]2 .
Multiplying the left-hand sides of (14.13) and (14.14) by ϕ and integrating
by parts over [0, 1]2 , we obtain an integral functional of f ,
1 1
∂f (x(1) , x(2) ) (1) (2)
(14.16) Φl (f ) = ϕ(x(1) , x(2) ) dx dx
0 0 ∂x(l)
1 1
(14.17) = wl (x(1) , x(2) )f (x(1) , x(2) ) dx(1) dx(2)
0 0
202 14. Dimension and Structure in Nonparametric Regression

where wl are the weight functions

∂ϕ(x(1) , x(2) )
wl (x(1) , x(2) ) = − , l = 1 or 2.
∂x(l)
The outside-of-the-integral term in (14.17) vanishes due to the boundary
condition (14.15).
Thus, (14.13) and (14.14) along with (14.17) yield the equations

(14.18) Φ1 = Φ1 (f ) = Φ0 cos α and Φ2 = Φ2 (f ) = Φ0 sin α

with
1 1
Φ0 = Φ0 (f ) = ϕ(x(1) , x(2) )g  (x(1) cos α + x(2) sin α) dx(1) dx(2) .
0 0

Under our assumptions, uniformly in f ∈ H, the values of Φ0 (f ) are sepa-


rated away from zero by some strictly positive constant,

(14.19) Φ0 (f ) ≥ Φ∗ > 0.

Now, given the values of the functionals


 Φ1 and Φ2 , we can restore the
angle α from the equation α = arctan Φ2 /Φ1 . Define the estimators of
these functionals by

1 
m
(14.20) Φ̂(l)
n = wl (i/m, j/m) yij , l = 1 or 2.
n
i,j = 1

Then we can estimate the angle α by


 
α̂n = arctan Φ̂(2) (1)
n /Φ̂n .

(2) (1)
Note that the ratio Φ̂n /Φ̂n can be however large, positive or negative.
Thus, the range of α̂n runs from −π/2 to π/2, whereas the range of the true α
is [0, π/4]. Next, we want to show that the values of α̂n outside of the interval
[0, π/4] are possible only due to the large deviations, and the probability of
this event is negligible if n is large. As the following proposition shows, the

estimator α̂n is n-consistent with rapidly decreasing probabilities of large
deviations. The proof of this proposition is postponed to the next section.

Proposition 14.8. There exist positive constants a0 , c0 , and c1 , indepen-



dent of f and n, such that for any x, c0 ≤ x ≤ c1 n, the following inequality
holds:
  √ 
Pf α̂n − α > x/ n ≤ 4 exp{−a0 x2 }.
14.3. Single-Index Model 203

14.3.3. Estimation of Regression Function. We use the second sample


ỹij of the observations in (14.10) to estimate the regression function f (x0 ) at
the given knot x0 = (i0 /m, j0 /m). Recall that tij , as introduced in (14.12),
is the projection of (i/m, j/m) onto the line determined by the true angle α .
Denote by t̂ij the projection of (i/m, j/m) onto the line determined by the
estimated angle α̂n , and let ûij be the projection in the orthogonal direction
given by the angle α̂n + π/2, that is,

t̂ij = (i/m) cos α̂n + (j/m) sin α̂n ,

and
ûij = −(i/m) sin α̂n + (j/m) cos α̂n .

Let the respective projections of the fixed point x0 = (i0 /m, j0 /m) be de-
noted by t̂0 and û0 . Introduce T , a rectangle in the new coordinates (see
Figure 12),
 
T = (t, u) : |t − t̂0 | ≤ h∗n , |u − û0 | ≤ H

where h∗n = n−1/(2β+1) and H is a constant independent of n and so small


that T ⊂ [0, 1]2 .

x(2) 6
1 2h∗n T
u
K x0 2H
• *
t

û0 t̂0
α̂n -
0 1 x(1)
Figure 12. Rectangle T in the coordinate system rotated by the angle α̂n .

Proposition 14.9. For any design knot xij = (i/m, j/m) ∈ T , the obser-
vation ỹij in (14.10) admits the representation

ỹij = g(t̂ij ) + ρij + ε̃ij , 1 ≤ i, j ≤ m,

with the remainder term ρij being independent of the random variable ε̃ij ,
and satisfying the inequality

max |ρij | ≤ 2L0 |α̂n − α|


1≤i,j≤m

where L0 = max |g  | is the Lipschitz constant of any g ∈ Θ(β, L, L1 , g∗ ).


204 14. Dimension and Structure in Nonparametric Regression

Proof. Put ρij = g(tij ) − g(t̂ij ). By definition, ρij depends only on the first
sub-sample yij of observations in (14.10), and hence is independent of ε̃ij .
We have ỹij = g(t̂ij ) + ρij + ε̃ij . For any knot (i/m, j/m), we obtain
|ρij | = |g(tij ) − g(t̂ij )|
    
=  g (i/m) cos α + (j/m) sin α − g (i/m) cos α̂n + (j/m) sin α̂n 
    
≤ L0 (i/m)  cos α̂n − cos α  + (j/m) sin α̂n − sin α 
    
≤ L0 i/m + j/m  α̂n − α  ≤ 2 L0  α̂n − α . 

Further, let fˆ(x0 ) be the local polynomial approximation obtained from


the observations ỹij at the design points t̂ij where (i/m, j/m) ∈ T . It means
that fˆ(x0 ) = θ̂0 where θ̂0 is the least-squares estimator of the intercept. It
can be obtained as a partial solution to the minimization problem
   t̂ij − t̂0  t̂ − t̂ β−1 2
ij 0
ỹij − θ̂0 + θ̂1 + · · · + θ̂β−1 → min .
h∗n h∗n θ̂0 ,...,θ̂β−1
(i/m,j/m)∈T

To analyze this minimization problem, we have to verify the Assump-


tions 9.2 and 9.5 on the system of normal equations associated with it.
Denote by N (T ) the number of design knots in T , and let G G be the
matrix of the system of normal equations with the elements
     t̂ − t̂ k+l
ij 0
G G k, k = ∗
, k, l = 0, . . . , β − 1.
hn
(i/m, j/m) ∈ T

Note that the number of design knots N (T ), and the elements of the matrix
G G are random because they depend on the estimator α̂n .
Lemma 14.10. (i) Uniformly in α̂n , the number of design knots N (T ) sat-
isfies
N (T )
lim = 1.
n→∞ 4Hnh∗ n
(ii) The normalized elements of the matrix G G have the limits
1    1 − (−1)k+l+1
lim G G k, l = ,
n→∞ N (T ) 2(k + l + 1)
and the limiting matrix is invertible.

Proof. See the next section. 


Theorem 14.11. The estimator fˆ(x0 ) = θ̂0 has the one-dimensional rate
of convergence on H,
 2 
sup n2β/(2β+1) Ef fˆ(x0 ) − f (x0 ) ≤ r∗
f ∈H

for all sufficiently large n with a constant r∗ independent of n.


14.3. Single-Index Model 205

Proof. Proposition 14.9 and Lemma 14.10 allow us to apply the expansion
similar to Proposition 9.4. In the case under consideration, this expansion
takes the form
(14.21) θ̂0 = f (x0 ) + b0 + N0 .
Conditionally on the first sub-sample of observations in (14.10), the bias
term b0 admits the bound
| b0 | ≤ Cb (h∗n )β + Ca max |ρij | ≤ Cb (h∗n )β + 2L0 Ca |α̂n − α|
1≤i, j≤m

where the constants Ca and Cb are independent of n. The stochastic term N0


on the right-hand side of (14.21) is a zero-mean normal random variable with
the conditional variance that does not exceed Cv /N (T ) ≤ Cv /(2Hnh∗n ).
Hence, uniformly in f ∈ H, we have that
 2   2 
Ef fˆ(x0 ) − f (x0 ) = Ef b0 + N0
 
≤ 2 Ef b20 + 2 Ef N02
 2  
≤ 2Ef Cb (h∗n )β + 2L0 Ca |α̂n − α| + 2Ef N02

≤ 4 Cb2 (h∗n )2β + 16 L20 Ca2 Ef (α̂n − α)2 + 2Cv /(2Hnh∗n ).
Note that with probability 1, |α̂n − α| ≤ π. From Proposition 14.8, we can
estimate the latter expectation by

  
Ef (α̂n − α)2 = z 2 dPf |α̂n − α| ≤ z
0

 √ 2 c1   π  
≤ c0 / n + √
z dPf |α̂n − α| ≤ z
2
+ z 2 dPf |α̂n − α| ≤ z
c0 / n c1
c1    
≤ c20 /n + c20 /n + 2 √
z dPf |α̂n − α| > z + π 2 Pf |α̂n − α| > c1
c0 / n

≤ 2c20 /n +4 exp{−a0 nz 2 } d(z 2 ) + 4 π 2 exp{−a0 nc21 }
0

≤ 2c20 /n + 4/(a0 n) + 4π 2 exp{−a0 nc21 } ≤ C1 /n


for some positive constant C1 and for all large enough n. Thus,
 2 
Ef fˆ(x0 ) − f (x0 )
 
≤ 4Cb2 (h∗n )2β + C1 /n + 2Cv /(2Hnh∗n ) = O (h∗n )2β .
 
Here we used the facts that (h∗n )2β = (nh∗n )−1 , and C1 /n = o (h∗n )2β as
n → ∞. 
206 14. Dimension and Structure in Nonparametric Regression

14.4. Proofs of Technical Results


To prove Proposition 14.8, we need two lemmas.
(l)
Lemma 14.12. For any n, the estimator Φ̂n given by (14.20) of the func-
tionals Φl (f ) defined in (14.16) admits the representation
√ (l) √
Φ̂(l) (l)
n = Φl (f ) + ρn (f )/ n + ηn / n , l = 1 or 2,
 (l) 
where the deterministic remainder term is bounded by a constant  ρn (f )  ≤
(l)
Cρ , and the random variables ηn are zero-mean normal with the variances
 (l)
bounded from above, Varf ηn ≤ Cv . The constants Cρ and Cv are inde-
pendent of n and f.

Proof. Recall that m = n is assumed to be an integer. Note that
1  1 
m m
Φ̂(l)
n = wl (i/m, j/m) f (i/m, j/m) + wl (i/m, j/m) εij
n n
i, j = 1 i, j = 1
m
 i/m j/m (l) (l)
ρn ηn
= wl (x1 , x2 ) f (x1 , x2 ) dx2 dx1 + √ + √
n n
i, j = 1 (i−1)/m (j−1)/m

where
m
√  i/m j/m 
ρ(l)
n = n wl (i/m, j/m) f (i/m, j/m)
i, j = 1 (i−1)/m (j−1)/m

− wl (x1 , x2 ) f (x1 , x2 ) dx2 dx1
and
1 
m
ηn(l) = √ wl (i/m, j/m) εij .
n
i, j = 1
(l)
The variance of the normal random variable ηn is equal to
 (l) σ2  2
m
Var ηn = wl (i/m, j/m)
n
i, j = 1
1 1
−→ σ 2 wl2 (x1 , x2 ) dx2 dx1 < Cv2 < ∞.
n→∞ 0 0
(l)
The deterministic remainder term ρn admits the upper bound
m i/m j/m
 (l)     
 ρ  ≤ L0 m  x1 − i/m  +  x2 − j/m  dx2 dx1
n
i, j = 1 (i−1)/m (j−1)/m


m
1
= L0 m = L0
m3
i, j = 1
14.4. Proofs of Technical Results 207

where L0 = max |(wl f ) | is the Lipschitz constant of the product wl f. 


(1) (2)
Lemma 14.13. Let Φ∗ , Φ̂n and Φ̂n be as defined in (14.19) and (14.20).
If y satisfies the conditions
√ √
max(Cρ , Cv ) ≤ y ≤ (4 2)−1 Φ∗ n,
then for all sufficiently large n, and any f ∈ H , the following inequality
holds:
  12 y   y2 
Pf  Φ̂(2)
n /Φ̂n − tan α ≤
(1)  √ ≥ 1 − 4 exp − .
Φ∗ n 2Cv2

(l)
Proof. From Lemma 14.12, the random variable ηn , l = 1 or 2, is a zero-
 (l)
mean normal with the variance satisfying Varf ηn ≤ Cv . Therefore, if
y ≥ Cv , then uniformly in f ∈ H, we have
   
Pf |ηn(l) | > y ≤ 2 exp − y 2 /(2 Cv2 ) , l = 1 or 2.
 
Hence, with the probability at least 1 − 4 exp −y 2 /(2Cv2 ) , we can assume
(l)
that |ηn | ≤ y for both l = 1 and 2 simultaneously. Under these conditions,
in view of (14.18) and Lemma 14.12, we obtain that
 (2) √ (1) √ 
 (2) (1)   cos α(ρ(2) + η )/ n − sin α(ρ
(1)
+ η )/ n 
 Φ̂n /Φ̂n − tan α  =  n

n n n
 (1) (1) √  
cos α Φ0 (f ) cos α + (ρn + ηn )/ n
√   
 2(cos α + sin α)(Cρ + y)   2(C + y) 
≤   ≤   
ρ
  
Φ∗ n/2 − (Cρ + y) Φ∗ n/2 − (Cρ + y) 

where we used the facts that cos √ α ≥ 1/ 2 since 0 ≤ α ≤ π/4, and, at the
last step, that sin α + cos α ≤ 2.

Further, by our assumption, Cρ ≤ y and 2y ≤ (1/2)Φ∗ n/2, therefore,
 
 (2) (1)   
 Φ̂n /Φ̂n − tan α  ≤   4y  8y 12 y
 Φ n/2 − 2y  ≤ Φ n/2 ≤ Φ∗ √n . 
∗ ∗

Proof of Proposition 14.8. In Lemma 14.13, put y = (Φ∗ /12) x. Then


√ for x, c0 ≤ x ≤
the restrictions on y in this lemma turn into the bounds

c1 n, where c0 = max(Cρ , Cv )(12/Φ∗ ) and c1 = 2 2. If we take a0 =
(Φ∗ /12)2 /(2Cv2 ), then
  √   
Pf  Φ̂(2)
n / Φ̂(1)
n − tan α  > x/ n ≤ 4 exp − a0 x2 .

Note that if |α̂n − α| > x/ n, then

| tan α̂n − tan α| = (cos α∗ )−2 |α̂n − α| ≥ |α̂n − α| > x/ n
208 14. Dimension and Structure in Nonparametric Regression

where we applied the mean value theorem with some α∗ between α̂n and α.
Thus,
 √ 
Pf |α̂n − α| > x/ n

  √   
≤ Pf  Φ̂(2)
n / Φ̂(1)
n − tan α  > x/ n ≤ 4 exp − a0 x2 . 

Proof of Lemma 14.10. (i) For every design knot (i/m, j/m), consider
a square, which we call pixel, [(i − 1)/m, i/m] × [(j − 1)/m, j/m]. Let T∗
be the union of the pixels that lie strictly inside T , and let T ∗ be the
minimum union of the pixels that contain T , that is, the union of pixels
which
√ intersections
 with T are non-empty. The diameter of each pixel is
2/m = 2/n, and its area is equal to  1/n. That is
why the number N (T∗ )
of the pixels in T∗ is no less than 4n(H − 2/n)(hn ∗ − 2/n), andthe number
N (T ∗ ) of the pixels in T ∗ does not exceed 4n(H + 2/n)(h∗n + 2/n). Since

1/ n = o(h∗n ), we find that

N (T∗ ) N (T ∗ )
1 ≤ lim inf ≤ lim sup ≤ 1.
n→∞ 4Hnh∗n n→∞ 4Hnhn

Due to the inequalities

N (T∗ ) ≤ N (T ) ≤ N (T ∗ ),

we can apply the squeezing theorem to conclude that the variable N (T ) also
has the same limit,
N (T )
lim = 1.
n→∞ 4Hnh∗n

(ii) As n → ∞ , for any k, l = 0, . . . , β − 1, we have that

1    1   t̂ − t̂0 k+l
ij
G G k, l ∼
N (T ) 4Hnh∗n h∗n
(i/m, j/m) ∈ T

t̂0 +h∗n  t − t̂ k+l


1 H
0 1 1
1 − (−1)k+l+1
∼ dt du = z k+l dz = .
4Hh∗n −H t̂0 −h∗n h∗n 2 −1 2(k + l + 1)

The respective limiting matrix is invertible (see Exercise 9.66). 


Exercises 209

Exercises

Exercise 14.92. Prove that


 the number of monomials of degree i in d-
variables is equal to i+d−1
i .
Exercise 14.93. Show that in the additive regression model with the in-
tercept (14.8), the preliminary estimator (14.9) and the shifted observations
yij − fˆ0 allow us to prove the one-dimensional rate of convergence of the
nonparametric components f1 and f2 as in Proposition 14.5.
Exercise 14.94. Let β1 and β2 be two positive integers, β1
= β2 . A function
in two variables f (x), x = (x(1) , x(2) ), is said to belong to the anisotropic
Hölder class of functions Θ(β1 , β2 ) = Θ(β1 , β2 , L, L1 ), if f is bounded by
(1) (2)
L1 , and if for any point x0 = (x0 , x0 ), there exists a polynomial p(x) =
p(x, x0 , f ) such that
 
(1) (2)
|f (x) − p(x, x0 , f )| ≤ L |x(1) − x0 |β1 + |x(2) − x0 |β2
where we assume that x and x0 belong to the unit square.
Suppose we want to estimate the value of the regression function f at a
given design knot (i0 /m, j0 /m) from the observations

yij = f (i/m, j/m) + εij , i, j = 1, . . . , m, m = n.
Show that if the regression function belongs to the anisotropic class
Θ(β1 , β2 ), then there exists an estimator with the convergence rate n−β̄/(2β̃+1)
where β̃ −1 = β1−1 + β2−1 . Hint: Consider a local polynomial estimator in the
bin with the sides h1 and h2 satisfying hβ1 1 = hβ2 2 . Show that the bias is
 
O(hβ1 1 ), and the variance is O (nh1 h2 )−1 . Now use the balance equation
to find the rate of convergence.
Chapter 15

Adaptive Estimation

In Chapters 8-11, we studied a variety of nonparametric regression estima-


tion problems and found the minimax rates of convergence for different loss
functions. These rates of convergence depend essentially on the parameter
of smoothness β. This parameter determines the choice of the optimal band-
width. An estimator which is minimax optimal for one smoothness does not
preserve this property for another smoothness. The problem of adaptation
consists of finding,if possible, an adaptive estimator which is independent of
a particular β and is simultaneously minimax optimal over different non-
parametric classes.
In this chapter, we will give examples of problems where the adaptive es-
timators exist in the sense that over each class of smoothness, the regression
function can be estimated as if the smoothness parameter were known. We
start, however, with a counterexample of an estimation problem in which
the minimax rates are not attainable.

15.1. Adaptive Rate at a Point. Lower Bound


Consider regression observations on [0, 1],

yi = f (i/n) + εi , i = 1, . . . , n,

where εi are standard normal random variables. Since the design is not the
focus of our current interest, we work with the simplest equidistant design.
We assume that the smoothness parameter can take on only two values
β1 and β2 such that 1 ≤ β1 < β2 . Thus, we assume that the regression func-
tion f belongs to one of the two Hölder classes, either Θ(β1 ) = Θ(β1 , L, L1 )
or Θ(β2 ) = Θ(β2 , L, L1 ).

211
212 15. Adaptive Estimation

Let x0 = i0 /n be a given point in (0, 1) which coincides with a design


knot. We want to estimate f (x0 ) by a single estimator f˜n with the property
that if f ∈ Θ(β1 ), then the rate of convergence is n−β1 /(2β1 +1) , while if
f ∈ Θ(β2 ), then the rate of convergence is n−β2 /(2β2 +1) . Whether the true
value of the smoothness parameter is β1 or β2 is unknown. The estimator
f˜n may depend on both β1 and β2 but the knowledge of the true β cannot
be assumed.
Formally, we introduce an adaptive risk of an estimator f˜n by
  2 
(15.1) AR(f˜n ) = max sup Ef n2β/(2β+1) f˜n − f (x0 ) .
β∈{β1 , β2 } f ∈ Θ(β)

The question we want to answer is whether there exists an estimator f˜n


such that AR(f˜n ) ≤ r∗ for some constant r∗ < ∞ independent of n. The
objective of this section is to demonstrate that such an estimator does not
exist.
First, we define a class of estimators. For two given constants, A > 0
and a such that β1 /(2β1 + 1) < a ≤ β2 /(2β2 + 1), we introduce a class
of estimators that are minimax optimal or sub-optimal over the regression
functions of the higher smoothness,
   2  
F = F (A, a) = f˜n : sup Ef n2a f˜n − f (x0 ) ≤ A .
f ∈ Θ(β2 )

As the following proposition claims, the estimators that belong to F cannot


attain the minimax rate of convergence on the lesser smoothness of regression
functions.
Proposition 15.1. There exists a constant r∗ = r∗ (A, a) independent of n
such that for any estimator f˜n ∈ F (A, a), the following lower bound holds:
 2β /(2β1 +1)  2 
(15.2) sup Ef n/ ln n 1 f˜n − f (x0 ) ≥ r∗ > 0.
f ∈ Θ(β1 )

Two important corollaries follow immediately from this result.


Corollary 15.2. The adaptive risk AR(f˜n ) in (15.1) is unbounded for any
estimator f˜n . Indeed, take a = β2 /(2β2 + 1) in the definition of the class
F (A, a). Then we have that
  2 
sup Ef n2β2 /(2β2 +1) f˜n − f (x0 ) ≤ A.
f ∈ Θ(β2 )

From Proposition 15.1, however, for all large n,


  2   2β /(2β1 +1)
sup Ef n2β1 /(2β1 +1) f˜n − f (x0 ) ≥ r∗ ln n 1
f ∈ Θ(β1 )

with the right-hand side growing unboundedly as n → ∞. Thus, the adaptive


risk, being the maximum of the two supremums, is unbounded as well.
15.1. Adaptive Rate at a Point. Lower Bound 213

Corollary 15.3. The contrapositive statement of Proposition 15.1 is valid.


It can be formulated as follows. Assume that there exists an estimator f˜n that
guarantees the minimax rate over the Hölder class of the lesser smoothness,
  2 
sup Ef n2β1 /(2β1 +1) f˜n − f (x0 ) ≤ r∗
f ∈ Θ(β1 )

with a constant r∗ independent of n. Then this estimator does not belong to


F (A, a) for any a and A. As a consequence, from the definition of F (A, a)
with a = β2 /(2β2 + 1), we find that
  2 
sup Ef n2β2 /(2β2 +1) f˜n − f (x0 ) → ∞ as n → ∞.
f ∈ Θ(β2 )

As Corollaries 15.2 and 15.3 explain, there is no adaptive estimator of a


regression at a point. By this we mean that we cannot estimate a regression
function at a point as if its smoothness were known.
Define a sequence
 − β1 /(2β1 +1)
n/(ln n) if f ∈ Θ(β1 ),
(15.3) ψn = ψn (f ) =
n− β2 /(2β2 +1) if f ∈ Θ(β2 ).

The next question we ask about the adaptive estimation of f (x0 ) is


whether it can be estimated with the rate ψn (f ). The answer to this question
is positive. We leave it as an exercise (see Exercise 15.95).
The rest of this section is devoted to the proof of Proposition 15.1.
Proof of Proposition 15.1. Define two test functions f0 (x) = 0 and
x−x   c ln n 1/(2β1 +1)
0
f1 (x) = hβn1 ϕ with hn = .
hn n
The choice of constant c will be made below. This definition is explained
in detail in the proof of Theorem 9.16. In particular, f1 ∈ Θ(β1 , L, L1 ) for
some constants L and L1 .
Choose a constant a0 such that β1 /(2β1 +1) < a0 < a. Define a sequence
 c ln n 2β1 /(2β1 +1)
un = n2a0 h2β
n
1
= n2a0
n
 2β /(2β1 +1) 2[a0 −β1 /(2β1 +1)]
(15.4) = c ln n 1 n ≥ n2[a0 −β1 /(2β1 +1)]
for any fixed c and all large enough n, so that un → ∞ at the power rate
as n → ∞. Take an arbitrarily small δ such that δ < ϕ2 (0)/4. Note that if
f˜n ∈ F (A, a), then for all sufficiently large n, we have
 
un Ef0 h−2β
n fn = un Ef0 u−1
1 ˜2
n n
2a0 ˜2
fn

= n−2(a−a0 ) Ef0 n2a f˜n2 ≤ n−2(a−a0 ) A < δ.
214 15. Adaptive Estimation


Thus, un Ef0 h−2β
n fn − δ < 0 . Put Tn = h−β
1 ˜2 1 ˜
n fn . We obtain
       
sup Ef h−2β n
1 ˜n − f (x0 ) 2 ≥ Ef h−2β1 f˜n − f1 (x0 ) 2
f 1 n
f ∈ Θ(β1 )
    
≥ Ef1 h−2β
n
1 ˜n − f1 (x0 ) 2 + un Ef h−2β1 f˜2 − δ
f 0 n n
 
= Ef1 (Tn − ϕ(0))2 + un Ef0 Tn2 − δ.
Finally, we want to show that the right-hand side is separated away from
zero by a positive constant independent of n. Introduce the likelihood ratio
dPf0   n
1  2
n 
Λn = = exp − f1 (i/n)ξi − f1 (i/n)
dPf1 2
i=1 i=1
where ξi = yi − f1 (i/n), i = 1, . . . , n, are independent standard normal ran-
dom variables with respect to Pf1 -distribution. As in the proof of Theorem
9.16, we get
 n
   
2
σn = f12 (i/n) = ϕ22 n hn2β1 +1 1 + on (1) = ϕ22 (c ln n) 1 + on (1)
i=1
where on (1)→ 0 as n → ∞. Introduce a standard normal random variable
Zn = σn−1 ni= 1 f1 (i/n) ξi . Then the likelihood ratio takes the form
 1  
Λn = exp − σn Zn − ϕ22 (c ln n) 1 + on (1) .
2
Note that if the random event { Zn ≤ 0 } holds, then
 1  
Λn ≥ exp − ϕ22 (c ln n) 1 + on (1) ≥ n−c1 ,
2
for all large n, where c1 = c ϕ22 . From the definition of the likelihood ratio,
we obtain the lower bound
  2   
sup Ef h−2β1 f˜n − f (x0 )
n ≥ Ef (Tn − ϕ(0))2 + un Λn T 2 − δ
1 n
f ∈ Θ(β1 )
 
≥ Ef1 (Tn − ϕ(0))2 + un n−c1 Tn2 I(Zn ≤ 0) − δ.
 
Now we choose c so small that c1 = c ϕ22 < 2 a0 − β1 /(2β1 + 1) . Then,
by (15.4), un n−c1 increases and exceeds 1 if n is sufficiently large. Hence,
     
sup Ef h−2β n
1 ˜n − f (x0 ) 2 ≥ Ef (Tn − ϕ(0))2 + T 2 I(Zn ≤ 0) − δ
f 1 n
f ∈ Θ(β1 )
  
≥ Ef1 I(Zn ≤ 0) (Tn − ϕ(0))2 + Tn2 −δ
  
≥ Ef1 I(Zn ≤ 0) (−ϕ(0)/2)2 + (ϕ(0)/2)2 −δ
1 2   1
≥ ϕ (0) Pf1 Zn ≤ 0 − δ = ϕ2 (0) − δ = r∗
2 4
15.2. Adaptive Estimator in the Sup-Norm 215

where r∗ is strictly positive because, under our choice, δ < ϕ2 (0) / 4. 

15.2. Adaptive Estimator in the Sup-Norm


In this section we present a result on adaptive estimation in the sup-norm.
We will show that for the sup-norm losses, the adaptation is possible in the
straightforward sense: the minimax rates are attainable as if the smoothness
parameter were known.
First, we modify the definition (15.1) of the adaptive risk to reflect the
sup-norm loss function,
 n β/(2β+1) 
(15.5) AR∞ (f˜n ) = max sup Ef f˜n − f ∞ .
β ∈ {β1 , β2 } f ∈ Θ(β) ln n

Since the log-factor is intrinsic to the sup-norm rates of convergence,


there is no need to prove the lower bound. All we have to do is to define
an adaptive estimator. As in the previous section, we proceed with the
equidistant design and the standard normal errors in the regression model,
iid
yi = f (i/n) + εi , εi ∼ N (0, 1), i = 1, . . . , n.
Define fn, ∗ ∗
β1 and fn, β2 as the rate-optimal estimators in the sup-norm
over the classes Θ(β1 ) and Θ(β2 ), respectively. Each estimator is based
 1/(2β+1)
on the regressogram with the bandwidth h∗n, β = (ln n)/n , β ∈
{β1 , β2 } (see Section 10.3.)
Now we introduce an adaptive estimator,
  ∗ β1
f ∗ if f ∗ − f ∗  ≥ C hn, β1 ,

(15.6) f˜n = n,

β1 n, β1 n, β2
fn, β2 otherwise,
where C is a sufficiently large constant which depends only on β1 and β2 .
The final choice of C is made below.
Our starting point is the inequality (10.10). Since the notations of the
current section are a little bit different, we rewrite this inequality in the
form convenient for reference,
 ∗   −1/2 ∗
(15.7)  fn,  ∗
β − f ∞ ≤ Ab (hn, β ) + Av n hn, β
β ∗
Zβ , β ∈ {β1 , β2 },
where Ab and Av are constants independent of n. Using the notation defined
in Section 10.3, we show that the distribution of Zβ∗ has fast-decreasing tail
probabilities,
    
Q 
β−1
  √ 
P Zβ∗ ≥ y 2β 2 ln n ≤ P Zm, q  ≥ y 2 ln n
q = 1 m =,0
 √ 
≤ QβP |Z | ≥ y 2 ln n
216 15. Adaptive Estimation

where Z ∼ N (0, 1). Now, since P(|Z| ≥ x) ≤ exp{−x2 /2} for any x ≥ 1, we
arrive at the upper bound
  
P Zβ∗ ≥ y 2β 2 ln n ≤ Qβn−y ≤ β n−(y −1) .
2 2
(15.8)
Here we have used the fact that the number of bins Q = 1/(2h∗n, β ) ≤ n for
all large enough n.
Theorem 15.4. There exists a constant C in the definition (15.6) of the
adaptive estimator f˜n such that the adaptive risk AR∞ (f˜n ) specified by (15.5)
satisfies the upper bound
(15.9) AR∞ (f˜n ) ≤ r∗
with a constant r∗ independent of n.

Proof. Denote the random event in the definition of the adaptive estimator
f˜n by 
∗ ∗
 ∗ β1 
C =  fn, β1 − f 
n, β2 ∞ ≥ C hn, β1 .
If f ∈ Θ(β1 ), then
 β1 /(2β1 +1) 
n/(ln n) Ef f˜n − f ∞
(15.10)
 −β1  ∗  −β1  ∗ 
≤ h∗n,β1 Ef fn,β1 − f ∞ I(C) + h∗n,β1 Ef fn,β2 − f ∞ I(C)

where C is the complementary random event to C. The first term on the


right-hand side of (15.10) is bounded from above uniformly in f ∈ Θ(β1 )

because fn, β1 is the minimax rate-optimal estimator over this class. If the
complementary random event C holds, then by the triangle inequality, the
second term does not exceed
 ∗ −β1   ∗  β 
(15.11) hn, β1 Ef fn, β1 − f ∞ + C h∗n, β1 1
which is also bounded from above by a constant.
Next, we turn to the case f ∈ Θ(β2 ). As above,
 β2 /(2β2 +1) 
n/(ln n) Ef f˜n − f ∞
 −β2  ∗  −β2  ∗ 
≤ h∗n, β2 Ef fn, β2 − f ∞ I(C) + h∗n, β2 Ef fn, β1 − f ∞ I(C) .
Once again, it suffices to study the case when the estimator does not match
the true class of functions,
 ∗ −β2  ∗ 
hn, β2 Ef fn, β1 − f ∞ I(C)

 −2β1  ∗ 1/2    1/2


(15.12) ≤ vn h∗n, β1 Ef fn, β1 − f  2
∞ Pf C
15.2. Adaptive Estimator in the Sup-Norm 217

where the Cauchy-Schwarz inequality was applied. The deterministic se-


quence vn is defined by
 −β2  ∗ β1  n γ β2 β1
vn = h∗n, β2 hn, β1 = , γ= − > 0.
ln n 2β2 + 1 2β1 + 1
The normalized expected value on the right-hand side of (15.12) is
bounded from above uniformly over f ∈ Θ(β2 ). Indeed, over a smoother
class of functions Θ(β2 ), a coarser estimator fn, ∗
β1 preserves its slower rate
of convergence. Formally, this bound does not follow from Theorem 10.6
because of the squared sup-norm which is not covered by this theorem.
However, the proper extension is elementary if we use (15.7) directly (see
Exercise 15.96.)
 
Hence, it remains to show that the probability Pf C in (15.12) vanishes
fast enough to compensate the growth of vn . From the definition of the
random event C and the triangle inequality, we have
 ∗ 1  ∗ β1   ∗ 1  β 
C ⊆ fn, β1 − f ∞ ≥ C hn, β1 ∪ fn, β2 − f ∞ ≥ C h∗n, β1 1 .
2 2
Now, note that the bias terms in (15.7) are relatively small,
1
Ab (h∗n, β2 )β2 < Ab (h∗n, β1 )β1 < C (h∗n, β1 )β1
4
if the constant C in the definition of the adaptive estimator f˜n satisfies the
condition C > 4Ab . Under this choice of C, the random event C may occur
only due to the large deviations of the stochastic terms. It implies that
C ⊆ A1 ∪ A2 with
  −1/2 ∗ 1   C √ 
A1 = Av n h∗n, β1 Zβ1 ≥ C (h∗n, β1 )β1 = Zβ∗1 ≥ ln n
4 4 Av
and
  −1/2 ∗ 1 
A2 = Av n h∗n, β2 Zβ2 ≥ C (h∗n , β1 )β1
4
  −1/2 1   C √ 
⊆ Av n h∗n, β1 Zβ∗2 ≥ C (h∗n, β1 )β1 = Zβ∗2 ≥ ln n .
4 4 Av
From the inequality (15.8), it follows that for a large C, the probabili-
 A1 and A2 decrease faster than any power of n.
ties of the random events
Choosing C > 4 Av 2β22 (1 + 2γ), we can guarantee that the right-hand
side of (15.12) vanishes as n → ∞. 
Remark 15.5. The definition of the adaptive estimator f˜n and the proof
of Proposition 15.4 contain a few ideas common to selection of adaptive es-
timators in different nonparametric models. First, we choose an estimator
from minimax optimal estimators over each class of functions. Second, we
focus on the performance of the chosen adaptive estimator over the alien
class, provided the choice has been made incorrectly. Third, we use the fact
218 15. Adaptive Estimation

that this performance is always controlled by the probabilities of large devi-


ations that vanish faster than their normalization factors that are growing
at a power rate. 

15.3. Adaptation in the Sequence Space


Another relatively less technical example of adaptation concerns the adap-
tive estimation problem in the sequence space. Recall that the sequence
space, as defined in Section 10.5, is the n-dimensional space of the Fourier
coefficients of regression function. We assume that each regression func-
tion f (x), 0 ≤ x ≤ 1, is observed at the equidistant design points x = i/n.
This function is defined in terms of its Fourier coefficients ck and the basis
functions ϕk by the formula

n−1
f (i/n) = ck ϕk (i/n), i = 1, . . . , n.
k=0
The transition from the original observations of the regression function f to
the sequence space is explained in Lemma 10.16 (see formula (10.31)).
To ease the presentation, we will consider the following model of obser-
vations directly in the sequence space,
√ √
(15.13) zk = ck + ξk / n and z̃k = ck + ξ˜k / n, k = 0, . . . , n − 1,
where ξk and ξ˜k are independent standard normal random variables. That
is, we assume that each observation of the Fourier coefficient ck is observed
twice and the repeated observations are independent.
By Lemma 10.15, for any estimator ĉ = (ĉ0 , . . . , ĉn−1 ) of the Fourier
coefficients c = (c0 , . . . , cn−1 ), the quadratic risk Rn (ĉ, c) in the sequence
space is equivalent to the quadratic risk of regression, that is,
 n−1
   n−1
 
(15.14) Rn (ĉ, c) = Ec (ĉk − ck )2 = Ef  ĉk ϕk − f 22, n
k=0 k=0
where Ec refers to the expectation for the true Fourier coefficients c =
(c0 , . . . , cn−1 ), and the discrete L2 -norm is defined as


n
f 22, n = n−1 f 2 (i/n).
i=1

Next, we take two integers β1 and β2 such that 1 ≤ β1 < β2 , and consider
two sets in the sequence space
 
n−1 
Θ2,n (β) = Θ2,n (β, L) = (c0 , . . . , cn−1 ) : c2k k 2β ≤ L , β ∈ { β1 , β2 }.
k=0
15.3. Adaptation in the Sequence Space 219

We associate Θ2,n (β) with the smoothness parameter β because the decrease
rate of Fourier coefficients controls the smoothness of the original regression
function (cf. Lemma 10.13.)
As shown in Theorem 10.17, for a known β, uniformly in c ∈ Θ2,n (β),
the risk Rn (ĉ, c) = O(n−2β/(2β+1)) as n → ∞. The rate-optimal estimator
is the projection estimator which can be defined as
 
ĉ = z0 , . . . , zM , 0, . . . , 0
where M = Mβ = n1/(2β+1) . In other words, ĉk = zk if k = 0, . . . , M, and
ĉk = 0 for k ≥ M + 1.
Now, suppose that we do not know the true smoothness of the regression
function, or, equivalently, suppose that the true Fourier coefficients may be-
long to either class, Θ2,n (β1 ) or Θ2,n (β2 ). Can we estimate these coefficients
so that the optimal rate would be preserved over either class of smoothness?
To make this statement more precise, we redefine the adaptive risk for se-
quence space. For any estimator c̃ = (c̃0 , . . . , c̃n−1 ) introduce the adaptive
quadratic risk by
 n−1
 
(15.15) AR(c̃) = max sup (Mβ )2β Ec (c̃k − ck )2
β ∈ {β1 , β2 } c ∈ Θ2, n (β)
k=0

where Mβ = n1/(2β+1) . The objective is to find an adaptive estimator c̃ that


keeps the risk AR(c̃) bounded from above by a constant independent of n.
To this end, introduce two estimators, each optimal over its own class,
   
ĉβ = ĉ0, β , . . . , ĉn−1, β = z0 , . . . , zMβ , 0, . . . , 0 , β ∈ {β1 , β2 }.
Further, define two statistics designed to mimic the quadratic risks of the
respective estimators ĉβ ,

n−1
 2
Rβ = z̃k − ĉk, β , β ∈ {β1 , β2 }.
k=0
These statistics are based on the second set of the repeated observations z̃k
in (15.13). From the definition of the quadratic risk (15.14), we have
  n−1
 √ 2 
Ec Rβ = Ec ck + ξ˜k / n − ĉk, β = Rn (ĉβ , c) + 1.
k=0

Next, we give a natural definition of the adaptive estimator in our set-


ting. The adaptive estimator is the estimator ĉβ that minimizes the risk
Rβ , that is,

ĉβ1 if Rβ1 ≤ Rβ2 ,
c̃ =
ĉβ2 if Rβ1 > Rβ2 .
220 15. Adaptive Estimation

We give the proof of the following theorem at the end of the present
section after we formulate some important auxiliary results.
Theorem 15.6. There exists a constant r∗ independent of n and such that
the adaptive risk (15.15) is bounded from above,
AR(c̃) ≤ r∗ .

We have to emphasize that Remark 15.5 stays valid in this case as well.
We have to understand the performance of the adaptive estimator if the
correct selection fails. As we will show, this performance is governed by the
large deviations probabilities of the stochastic terms. Before we prove the
theorem, let us analyze the structure of the difference ΔR = Rβ1 − Rβ2 that
controls the choice of the adaptive estimator. Put
   
M = k : Mβ2 + 1 ≤ k ≤ Mβ1 and ΔM = Mβ1 − Mβ2 = Mβ1 1 + on (1) .
The following technical lemmas are proved in the next section.
Lemma 15.7. The difference of the risk estimates satisfies the equation
  √
ΔR = Rβ1 − Rβ2 = −Sn + Mβ1 1 + on (1) /n + Un(1) /n − 2Un(2) / n

with Sn = Sn (c) = k∈M c2k , and the random variables
  
Un(1) = ξk2 − 1 and Un(2) = zk ξ˜k .
k∈M k∈M

The following random events help to control the adaptive risk


   √ 
A1 = Un(1) ≥ Mβ1 , A2 = Un(2) ≤ − nSn /8 ,
   √ 
A3 = Un(1) ≤ − Mβ1 /4 , and A4 = Un(2) ≥ Mβ1 /(8 n) .
Lemma 15.8. (i) If the inequality Sn > 4Mβ1 /n holds, then
 
Pc ( Ai ) ≤ exp − Ai Mβ1 , i = 1 or 2,
where A1 and A2 are positive constants independent of n.

(ii) If Sn = o(Mβ1 /n) as n → ∞, then


 
Pc ( Ai ) ≤ exp − Ai Mβ1 , i = 3 or 4,
with some positive constants A3 and A4 .

Proof of Theorem 15.6. As explained in the proof of Proposition 15.1, we


have to understand what happens with the risk, if the adaptive estimator is
chosen incorrectly, that is, if it does not coincide with the optimal estimator
over the respective class. Let us start with the case c ∈ Θ2,n (β1 ), while
15.3. Adaptation in the Sequence Space 221

c̃ = ĉβ2 . The contribution of this instance to the adaptive risk (15.15) is


equal to
 
n−1
2 
(Mβ1 )2β1 Ec I(ΔR > 0) ĉk, β2 − ck
k=0
 M
 β2

n−1 
= (Mβ1 )2β1 Ec I(ΔR > 0) (zk − ck )2 + c2k
k=0 k = Mβ2 +1

 1  Mβ2

n−1 
= (Mβ1 )2β1 Ec I(ΔR > 0) ξk2 + Sn + c2k
n
k=0 k = Mβ1 +1
where Sn is defined in Lemma 15.7. Note that
1 M
 β2  Mβ2 Mβ1  −2β1
Ec ξk2 =  = Mβ1 ,
n n n
k=0
and since c ∈ Θ2,n (β1 ),

n−1
c2k ≤ L(Mβ1 )−2β1 .
k = Mβ1 +1

Thus, even multiplied by (Mβ1 )2β1 , the respective terms in the risk stay
bounded as n → ∞.
It remains to verify that the term Sn (Mβ1 )2β1 Pc (ΔR > 0) also stays
finite as n → ∞. It suffices to study the case when Sn > 4 (Mβ1 )−2β1 =
4 Mβ1 /n, because otherwise this term would be bounded by 4. From Lemma
15.7,
   √ 
{ΔR > 0} = − Sn + Mβ1 1 + on (1) /n + Un(1) /n − 2 Un(2) / n > 0
 (1) 
can occur if at least one of the two random events A1 = Un /n ≥ Mβ1 /n
 (2) √ 
or A2 = − 2 Un / n ≥ Sn /4 occurs. Indeed, otherwise we would have
the inequality
ΔR < −(3/4)Sn + 2Mβ1 (1 + on (1))/n < 0,
since by our assumption, Sn > 4 Mβ1 /n.
Lemma 15.8 part (i) claims that the probabilities of
 the random events
A1 and A2 decrease faster than exp − An1/(2β1 +1) as n → ∞, which
implies that Sn (Mβ1 )2β1 Pc (ΔR > 0) is finite.
The other case, when c ∈ Θ2, n (β2 ) and c̃ = ĉβ1 , is treated in a similar
fashion, though some calculations change. We write
 
n−1

(Mβ2 )2β2 Ec I(ΔR ≤ 0) (ĉk, β1 − ck )2
k=0
222 15. Adaptive Estimation

 1 M
 β1

n−1 
(15.16) = (Mβ2 ) 2β2
Ec I(ΔR ≤ 0) ξk2 + c2k .
n
k=0 k = Mβ1 +1

Since c ∈ Θ2,n (β2 ),


n−1
 2β2
(Mβ2 )2β2 c2k ≤ L Mβ2 /Mβ1 → 0, as n → ∞.
k = Mβ1 +1

It remains to verify that the first term in (15.16) is bounded. We obtain

 1 M
 β1 
(Mβ2 )2β2 Ec I(ΔR ≤ 0) ξk2
n
k=0

 1 M
 β1 2 
1/2 1/2
≤ (Mβ2 ) 2β2
Ec ξk2 Pc (ΔR ≤ 0)
n
k=0
 2M 
β1 1/2 1/2
≤ (Mβ2 )2β2 Pc (ΔR ≤ 0) = 2 nγ Pc (ΔR ≤ 0).
n
Here we applied
 4 the Cauchy-Schwartz inequality, and the elementary calcu-
lations Ec ξk = 3, hence,

 M
 β1 2  Mβ1
  
β1 M

Ec ξk2 = Ec ξk4 + Ec ξk2 ξl2
k=0 k=0 k, l = 0
k
=l

= 3 Mβ1 + Mβ1 (Mβ1 − 1) ≤ 4 Mβ21 .


The constant γ in the exponent above is equal to
2β2 2 2β2 2β1
γ = + −1 = − > 0.
2β2 + 1 2β1 + 1 2β2 + 1 2β1 + 1
 
If c ∈ Θ2,n (β2 ), then Sn ≤ L (Mβ2 )−2β2 = L Mβ2 /n = o Mβ1 /n as n →
∞. Note that the random event
 √ 
{ΔR ≤ 0} = − Sn + Mβ1 (1 + on (1))/n + Un(1) /n − 2 Un(2) / n ≤ 0
 √ 
= − Un(1) /n + 2 Un(2) / n ≥ Mβ1 (1 + on (1))/n
 (1)   (2) √ 
occurs if either A3 = − Un ≥ Mβ1 /4 or A4 = Un ≥ Mβ1 /(8 n)
occurs. Again, as Lemma 15.8 (ii) shows, the probabilities of these random
1/2
events decrease faster than n2γ , so that nγ Pc (ΔR ≤ 0) → 0 as n → ∞,
and the statement of the theorem follows. 
15.4. Proofs of Lemmas 223

15.4. Proofs of Lemmas


Proof of Lemma 15.7. By straightforward calculations, we obtain
 
ΔR = Rβ1 − Rβ2 = (z̃k − zk )2 − z̃k2
k∈M k∈M
1   2   2  1  ˜2 
= ξ̃k − ξk − + √
c2k ck ξ̃k + ξk
n n n
k∈M k∈M k∈M k∈M
2  ˜ 1  2  2 
= − ξk ξk + ξk − c2k − √ ck ξ̃k
n n n
k∈M k∈M k∈M k∈M
 1 1  2 2   ξk 
= − c2k + ΔM + (ξk − 1) − √ ck + √ ξ̃k
n n n n
k∈M k∈M k∈M
 
where ΔM = Mβ2 − Mβ1 = Mβ1 1 + on (1) is the number of elements in
M. So, the lemma follows. 
To prove Lemma 15.8 we need the following result.
Proposition 15.9. The moment generating functions of the random vari-
(1) (2)
ables Un and Un are respectively equal to
      
G1 (t) = E exp t Un(1) = exp − Mβ1 1+on (1) t+(1/2) ln(1−2t)
and
    nt2 ΔM 
G2 (t) = E exp t Un(2) = exp Sn − ln(1 − t 2
/n) .
2(n − t2 ) 2

Proof. Note that E exp{tξ 2 } = (1 − 2t)−1/2 , t ≤ 1/2, where ξ ∼ N (0, 1).
Therefore,
  
E exp{t (ξ 2 − 1)} = exp − t − (1/2) ln(1 − 2t) ,
and the expression for G1 (t) follows from independence of the random vari-
ables ξk2 .

(2)
Next, the moment generating function of Un can be expressed as
       ξk  ˜  
G2 (t) = E exp t Un(2) = E exp t ck + √ ξk
n
k∈M
     ξk    
= E E exp t ck + √ ξ̃k  ξk , k ∈ M
n
k∈M
   2 √ 
= E exp (t /2n)( ck n + ξk )2 .
k∈M
Now, for any real a < 1 and any b, we have the formula
    
E exp (a/2) (b + ξ)2 = (1 − a)−1/2 exp ab2 /(2(1 − a)) .
224 15. Adaptive Estimation


Applying this formula with a = t2 /n and b = ck n, we obtain
  n t2 1 
G2 (t) = exp c 2
− ln(1 − t 2
/n)
2(n − t2 ) k 2
k∈M

which completes the proof because Sn = 2
k∈M ck . 

Proof of Lemma 15.8. All the inequalities in this lemma follow from the
exponential Chebyshev inequality (also known as Chernoff’s inequality),
P(U ≥ x) ≤ G(t) exp{−t x}

where G(t) = E exp{t U } is the moment generation function of a random
variable U.
It is essential that the moment generating functions of the random
(1) (2)
variables Un and Un in Proposition 15.9 are quadratic at the origin,
Gi (t) = O(t ), i = 1, 2, as t → 0. A choice of a sufficiently small t would
2

guarantee the desired bounds. In the four stated inequalities, the choices of
t differ.
 (1) 
We start with the random event A1 = Un ≥ Mβ1 ,
 
Pc ( A1 ) ≤ G1 (t) exp − t Mβ1
    
= exp − Mβ1 1 + on (1) t + (1/2) ln(1 − 2t) − t Mβ1 .
We choose t = 1/4 and obtain
    
Pc ( A1 ) ≤ exp − (1/2)(1 − ln 2) Mβ1 1 + on (1) ≤ exp − 0.15 Mβ1 .
(2)
Similarly, if we apply Chernoff’s inequality to the random variable −Un

with t = n/10, and use the fact that ΔM < Mβ1 ≤ n Sn /4, we get
 √ 
Pc ( A2 ) = Pc − Un(2) ≥ nSn /8
 n t2 ΔM √ 
≤ exp Sn − ln(1 − t /n) − t n Sn /8
2
2(n − t2 ) 2
 nS ΔM nSn 
n
= exp − ln(99/100) −
198 2 80
 nS nS nS 
n n n
≤ exp − ln(99/100) −
198 8  80
≤ exp − AnSn ≤ exp − 4AMβ1
where A = −1/198 + (1/8) ln(99/100) + 1/80 > 0.
Exercises 225

To prove the upper bound for the probability of A3 , take t = 1/8. Then
 
Pc (A3 ) = Pc − Un(1) ≥ Mβ1 /4
    
≤ exp − Mβ1 1 + on (1) − t + (1/2) ln(1 + 2t) − t Mβ1 /4
  
= exp − AMβ1 1 + on (1)
where A = −1/8 + (1/2) ln(5/4) + 1/32 > 0.
Finally, if nSn = o(Mβ1 ), then
   
G2 (t) = exp − (1/2)Mβ1 1 + on (1) ln(1 − t2 /n) .

Put t = n/8. Then
 √ 
Pc ( A4 ) = Pc Un(2) ≥ Mβ1 /(8 n)
   √ 
≤ exp − (1/2)Mβ1 1 + on (1) ln(1 − t2 /n) − tMβ1 /(8 n)
  
= exp − AMβ1 1 + on (1)
where A = (1/2) ln(63/64) + 1/64 > 0. 

Exercises

Exercise 15.95. Let ψn = ψn (f ) be the rate defined by (15.3). Show that


there exists an estimator f˜n and a constant r∗ independent of n such that
  f˜ − f (x )  
 n 0 
max sup Ef   ≤ r∗ .
β ∈ {β1 , β2 } f ∈ Θ(β) ψn (f )
Exercise 15.96. Use (15.7) to prove that the expectation in (15.12) is
bounded, that is, show that uniformly in f ∈ Θ(β2 ), the following inequality
holds:  ∗
(h∗n, β1 )−2β1 Ef fn, β1 − f ∞ ≤ r
2 ∗

where a constant r∗ is independent of n.


Chapter 16

Testing of
Nonparametric
Hypotheses

16.1. Basic Definitions


16.1.1. Parametric Case. First of all, we introduce the notion of para-
metric hypotheses testing. Suppose that in a classical statistical model with
observations X1 , . . . , Xn that obey a probability density p(x, θ), θ ∈ Θ ⊆ R,
we have to choose between two values of the parameter θ. That is, we want
to decide whether θ = θ0 or θ1 , where θ0 and θ1 are known. For simplicity
we assume that θ0 = 0 and θ1
= 0. Our primary hypothesis, called the null
hypothesis, is written as
H0 : θ = 0,
while the simple alternative hypothesis has the form

H1 : θ = θ1 .

In testing the null hypothesis against the alternative, we do not estimate


the parameter θ . A substitution for an estimator is a decision rule Δn =
Δn (X1 , . . . , Xn ) that takes on only two values, for example, 0 or 1. The case
Δn = 0 is interpreted as acceptance of the null hypothesis, whereas the case
Δn = 1 means rejection of the null hypothesis in favor of the alternative.
The appropriate substitution for the risk function is the error probability.
Actually, there are probabilities of two types of errors. Type I error is
committed when a true null hypothesis is rejected, whereas acceptance of a

227
228 16. Testing of Nonparametric Hypotheses

false null
 results
 in type II error. The respective probabilities are denoted
by P0 Δn = 1 and Pθ1 Δn = 0 .
The classical optimization problem in hypotheses testing consists of find-
ing a decision rule that minimizes the type II error, provided the type I error
does not exceed a given positive number α,
   
Pθ1 Δn = 0 → inf subject to P0 Δn = 1 ≤ α.
Δn
If n is large, then a reasonable anticipation is that α can be chosen small,
that is, α → 0 as n → ∞. This criterion of optimality is popular because of
its elegant solution suggested by the fundamental Neyman-Pearson lemma
(see Exercise 16.97).
A more sophisticated problem is to test the null hypothesis against a
composite alternative,
H1 : θ ∈ Λn
where Λn is a known set of the parameter values that does not include the
origin, that is, 0
∈ Λn . In our asymptotic studies, different criteria for finding
the decision rule are possible. One reasonable criterion that we choose is
minimization of the sum of the type I error probability and the supremum
over θ ∈ Λn of the type II error probability,
   
rn (Δn ) = P0 Δn = 1 + sup Pθ Δn = 0 → inf .
θ ∈ Λn Δn

The key question in asymptotic studies is: How distant should Λn be


from the origin, so that it is still possible to separate H0 from the alternative
H1 with a high probability? By separation between hypotheses we mean that
there exists a decision rule Δ∗n such that the sum of the error probabilities
rn (Δ∗n ) is vanishing, limn→∞ rn (Δ∗n ) = 0.

16.1.2. Nonparametric Case. Our objective here is to extend the para-


metric hypotheses testing to the nonparametric setup. We replace the pa-
rameter θ by a regression function f from a Hölder class Θ(β) = Θ(β, L, L1 ),
and consider the model of observations,
iid
yi = f (i/n) + εi where εi ∼ N (0, σ 2 ).
Suppose that we want to test H0 : f = 0 against the composite alternative
H1 : f ∈ Λn where the set of regression functions Λn is specified.
A general approach to the nonparametric hypotheses testing is as follows.
Assume that a norm f  of the regression function is chosen. Let ψn be a
deterministic sequence, ψn → 0 as n → ∞, which plays the same role as
the rate of convergence in estimation problems. Define the set of alternative
hypotheses
 
(16.1) Λn = Λn (β, C, ψn ) = f : f ∈ Θ(β) and f  ≥ C ψn
16.2. Separation Rate in the Sup-Norm 229

with a positive constant C. Denote the corresponding sum of the error prob-
abilities by
   
rn (Δn , β, C, ψn ) = P0 Δn = 1 + sup Pf Δn = 0 .
f ∈ Λn (β,C,ψn )

We call the sequence ψn a minimax separation rate if (i) for any small
positive γ, there exist a constant C ∗ and a decision rule Δ∗n such that
(16.2) lim sup rn (Δ∗n , β, C ∗ , ψn ) ≤ γ,
n→∞

and (ii) there exist positive constants C∗ and r∗ , independent of n and such
that for any decision rule Δn ,
(16.3) lim inf rn (Δn , β, C∗ , ψn ) ≥ r∗ .
n→∞

The meaning of this definition is transparent. The regression functions


with the norm satisfying f  ≥ C ∗ ψn can be tested against the zero re-
gression function with however small prescribed error probabilities. On the
other hand, the reduction of the constant below C∗ holds the sum of error
probabilities above r∗ for any sample size n.

16.2. Separation Rate in the Sup-Norm


In general, estimation of regression function and testing of hypotheses (in the
same norm) are two different problems. The minimax rate of convergence is
not necessarily equal to the minimax separation rate. We will demonstrate
this fact in the next section. For some norms, however, they are the same.
In particular, it happens in the sup-norm.
The following result is not difficult to prove because all the preliminary
work is already done in Section 12.1.
Theorem 16.1. Assume that the norm in the definition of Λn is the sup-
norm,
 
Λn = Λn (β, C, ψn ) = f : f ∈ Θ(β) and f ∞ ≥ C ψn .
Then the minimax separation rate coincides with the minimax rate of con-
 β/(2β+1)
vergence in the sup-norm ψn = (ln n)/n .

Proof. First, we prove the existence of the decision rule Δ∗n with the claimed
separation rate such that (16.2) holds. Let fn∗ be the regressogram with the
 1/(2β+1)
rate-optimal bandwidth h∗n = (ln n)/n . Our starting point is the
inequalities (15.7) and (15.8). For any C > Ab + 2βAv , uniformly over
f ∈ Θ(β), these inequalities yield
    
Pf fn∗ − f ∞ ≥ C ψn ≤ Pf Ab (h∗n )β + Av (nh∗n )−1/2 Zβ∗ ≥ C (h∗n )β
230 16. Testing of Nonparametric Hypotheses

 √   √ 
(16.4) = Pf Ab + Av Zβ∗ / ln n ≥ C ≤ Pf Zβ∗ ≥ 2β ln n ≤ β n−1

where we have applied (15.8) with y = 2. Put C ∗ = 2 C, and define the
set of alternatives by
 
Λn (β, C ∗ , ψn ) = f : f ∈ Θ(β) and f ∞ ≥ C ∗ ψn .

Introduce a rate-optimal decision rule



∗ 0, if fn∗ ∞ < 1
2 C ∗ ψn ,
Δn =
1, otherwise.

Then, from (16.4), we obtain that as n → ∞,


   1   
P0 Δ∗n = 1 = P0 fn∗ ∞ ≥ C ∗ ψn = P0 fn∗ ∞ ≥ C ψn → 0.
2
Next, for any f ∈ Λn (β, C ∗ , ψn ), by the triangle inequality, as n → ∞,
   1   
Pf Δ∗n = 0 = Pf fn∗ ∞ < C ∗ ψn ≤ Pf fn∗ − f ∞ ≥ C ψn → 0.
2
Hence (16.2) is fulfilled for any γ > 0.
The proof of the lower bound (16.3) is similar to that in Lemma 12.2. We
repeat the construction of the Q test functions fq , q = 1, . . . , Q, in (12.3)
based on a common “bump” function ϕ . For any test Δn , introduce the
random event D = { Δn = 1 }. Then for any δ > 0,
        
P0 D + max Pfq D ≥ P0 D + E0 I D ξn
1≤q≤Q

     
≥ P0 D + (1 − δ)P0 D ∩ {ξn > 1 − δ} ≥ (1 − δ)P0 ξn > 1 − δ
where
1 
Q
  
ξn = exp ln dPfq /dP0 .
Q
q=1

As shown in Lemma 12.2, the random variable ξn converges to 1 as n → ∞.


Hence,
    
lim inf P0 Δn = 1 + max Pfq Δn = 0 ≥ 1 − δ.
n→∞ 1≤q≤Q

Note that if C∗ < ϕ∞ , then all the test functions fq , q = 1, . . . , Q, belong
to the set of alternatives
 
Λn (β, C∗ , ψn ) = f : f ∈ Θ(β) and f  ≥ C∗ ψn .

Thus the lower bound (16.3) holds with r∗ = 1 − δ however close to 1. 


16.3. Sequence Space. Separation Rate in the L2 -Norm 231

16.3. Sequence Space. Separation Rate in the L2 -Norm


Analyzing the proof of Theorem 16.1, we find the two remarkable properties
of the hypotheses testing in the sup-norm. First, the separation rate ψn
coincides with the minimax rate of estimation, and the minimax optimal
decision rule if very simple: the null hypothesis is accepted if the sup-norm
of the estimator is small enough. Second, the choice of a sufficiently large
constant C ∗ in the definition of the alternative hypothesis Λn (β, C ∗ , ψn )
guarantees the upper bound for arbitrarily small error probability γ. More-
over, C ∗ does not depend on the value of γ. It happens because the sup-norm

is a very special type of norm.

 ∗The distribution

 of fn ∞ is degenerate. If
C is large enough, then Pf fn ∞ ≥ C ψn → 0 as n → ∞.
In this section, we turn to the quadratic norm. To ease the presentation,
we consider the problem of hypotheses testing in the sequence space. All
the necessary definitions are given in Section 10.5. The observations zk in
the sequence space are

zk = ck + σ ξk / n, k = 0, . . . , n − 1,
where the
 ck ’s are the Fourier coefficients of the regression function f , that
is, f = n−1k=0 ck ϕk . Here the ϕk ’s form an orthonormal basis in the discrete
L2 -norm. The errors ξk are independent standard normal random variables,
and σ > 0 represents the standard deviation of the observations in the
original regression model.
 
We use c = c0 , . . . , cn−1 to denote the whole set of the Fourier coeffi-
cients. As in Section 15.3, it is convenient to work directly in the sequence
space of the Fourier coefficients. For ease of reference, we repeat here the
definition of the following class:
 
 n−1 
Θ2, n (β) = Θ2, n (β, L) = c0 , . . . , cn−1 : c2k k 2β ≤ L .
k=0

We want to test the null hypothesis that all the Fourier coefficients are
  2 1/2
equal to zero versus the alternative that their L2 -norm c2 = ck
is larger than a constant that may depend on n. Our goal is to find the
minimax separation rate ψn . Formally, we study the problem of testing H0 :
c = 0 against the composite alternative
H1 : c ∈ Λn = Λn (β, C, ψn )
where
 
(16.5) Λn = c : c ∈ Θ2,n (β) and c2 ≥ C ψn .

In Section 13.1, it was shown that the squared L2 -norm of regression



in [0, 1] can be estimated with the minimax rate 1/ n. This is true in the
232 16. Testing of Nonparametric Hypotheses

sequence space as well. The proof in the sequence space is especially simple.
Indeed, the sum of the centered zk2 ’s admits the representation


n−1  σ2  2σ 
n−1
σ2  2
n−1
zk2 − = c2 − √
2
ck ξk + (ξk − 1)
n n n
k=0 k=0 k=0

2σ σ2
(16.6) = c22 − √ N + √ Yn
n n
where N denotes the zero-mean normal random variable with the variance
c22 . The variable Yn is a centered chi-squared random variable that is
asymptotically normal,

n−1

Yn = (ξk2 − 1)/ n → N (0, 2).
k=0

The convergence rate 1/ n in estimation of c22 follows immediately from
(16.6).
Now we continue with testing the null hypothesis against the composite
alternative (16.5). We will show that the separation rate of testing in the
quadratic norm is equal to ψn = n−2β/(4β+1) . Note that this separation rate
is faster than the minimax estimation rate in the L2 -norm n−β/(2β+1) . The
proof of this fact is split between the upper and lower bounds in the theorems
below.
We introduce the rate-optimal decision rule, proceeding similar to (16.6).
We take Mn = n2/(4β+1) , so that the separation rate ψn = Mn−β , and esti-
mate the norm of the Fourier coefficients by

Mn
 
Ŝn = zk2 − σ 2 /n .
k=0

Consider a class of decision rules Δn that depends on a constant b,



0, if Ŝn < b ψn2 = b n−4β/(4β+1) ,
(16.7) Δn = Δn (β, b) =
1, otherwise.

The following theorem claims that by choosing properly the constants in


the definitions of the set of alternatives and the decision rule, we can make
the error probabilities less than any prescribed number in the sense of the
upper bound (16.2).

Theorem 16.2. For any small positive γ, there exist a constant C = C ∗ =


C ∗ (γ) in the definition (16.5) of the set of alternatives Λn = Λn (β, C ∗ , ψn ),
16.3. Sequence Space. Separation Rate in the L2 -Norm 233

and a constant b = b(γ) in the definition (16.7) of decision rule Δn = Δ∗n =


Δ∗n (β, b) such that
lim sup rn (Δ∗n ) ≤ γ
n→∞
where
   
rn (Δ∗n ) = P0 Δ∗n = 1 + sup Pc Δ∗n = 0 .
c ∈ Λn (β,C ∗ ,ψn )

Proof. It suffices to show that for all sufficiently large n, probabilities of


type I and II errors are bounded from above by γ/2, that is, it suffices to
show that
 
(16.8) lim sup P0 Δ∗n = 1 ≤ γ/2
n→∞
and
 
(16.9) lim sup sup Pc Δ∗n = 0 ≤ γ/2.
n→∞ c ∈ Λn (β,C ∗ ,ψn )

Starting with the first inequality, we write


     
Mn
 2  
P0 Δ∗n = 1 = P0 Ŝn ≥ b ψn2 = P0 zk − σ 2 /n ≥ b ψn2
k=0

 σ2 
Mn    
= P0 (ξk2 − 1) > b ψn2 = P0 σ 2 n−1 2(Mn + 1) Yn > b ψn2
n
k=0
M n 
where Yn = k = 0 (ξk − 1)/ 2(Mn
2 + 1) is asymptotically standard normal
random variable. Under our choice of Mn , we have that as n → ∞,

n−1 Mn + 1 ∼ n1/(4β+1)−1 = n−4β/(4β+1) = ψn2 .
Consequently,
  √    b 
P0 Δ∗n = 1 = P0 2 σ 2 Yn ≥ b 1 + on (1) → 1 − Φ √ ,
2 σ2
as n → ∞, where Φ denotes the cumulative distribution
√ 2 function of a stan-
dard normal random variable. If we choose b = 2 σ q1−γ/2 with q1−γ/2
standing for the (1 − γ/2)-quantile of the standard normal distribution,
then the inequality (16.8) follows.
To verify the inequality (16.9), note that
     
Mn
 2  
Pc Δ∗n = 0 = Pc Ŝn ≤ b ψn2 = Pc zk − σ 2 /n ≤ b ψn2
k=0

 
n−1
2σ 
Mn
σ2 
Mn 
= Pc c22 − c2k − √ ck ξk + (ξk2 − 1) ≤ b ψn2 .
n n
k = Mn +1 k=0 k=0
234 16. Testing of Nonparametric Hypotheses

Observe that for any c ∈ Λn (β, C ∗ , ψn ), the variance of the following nor-
malized random sum vanishes as n → ∞,
* +  2σ 2
2σ 
Mn
4σ 2 4σ 2
Varc √ c k ξ k ≤ ≤ = n−1/(4β+1) → 0,
nc22 nc22 n(C ∗ ψn )2 C∗
k=0
which implies that
2σ 
Mn
 
c22 − √ ck ξk = c22 1 + on (1) as n → ∞,
n
k=0
where on (1) → 0 in Pc -probability. Thus,
   σ2 
Mn
  
n−1 
Pc Δ∗n =0 = Pc (ξk2 −1) ≤ − c22 1+on (1) + c2k +b ψn2 .
n
k=0 k = Mn +1
M n 
Put Yn = 2
k=0 (ξk − 1)/ 2(Mn + 1). Note that

n−1 
n−1  k 2β
c2k < c2k ≤ Mn−2β L.
Mn
k = Mn +1 k = Mn +1

Therefore,
   σ2 
Pc Δ∗n = 0 ≤ Pc 2(Mn + 1)Yn
n 
 
≤ −(C ∗ ψn )2 1 + on (1) + Mn−2β L + bψn2
where Yn is asymptotically standard normal. Note that here every term has
the magnitude ψn2 = Mn−2β . If we cancel ψn2 , the latter probability becomes
√    −C ∗ + L + b 
Pc 2 σ 2 Yn ≤ (−C ∗ + L + b) 1 + on (1) → Φ √ ,
2 σ2

as n → ∞. Choose C ∗ = 2b + L and recall that b = 2σ 2 q1−γ/2 . We obtain
−C ∗ + L + b −b
√ = √ = − q1−γ/2 = qγ/2
2σ 2 2σ 2
where qγ/2 denotes the γ/2-quantile of Φ. Thus, the inequality (16.9) is
valid. 
Remark 16.3. In the case of the sup-norm, we can find a single constant
C ∗ to guarantee the upper bound for any γ. In the case of the L2 -norm, it
is not possible. Every γ requires its own constants C ∗ and b. 

As the next theorem shows, the separation rate ψn = n−2β/(4β+1) can-


not be improved. If the constant C in the definition (16.5) of the set of
alternatives Λn (β, C, ψn ) is small, then there is no decision rule that would
guarantee arbitrarily small error probabilities.
16.3. Sequence Space. Separation Rate in the L2 -Norm 235

Theorem 16.4. For any constant r∗ , 0 < r∗ < 1, there exists C = C∗ > 0 in
the definition (16.5) of the set of alternatives Λn such that for any decision
rule Δn , the sum of the error probabilities
   
rn (Δn ) = P0 Δn = 1 + sup Pc Δn = 0
c ∈ Λn (β,C∗ ,ψn )

satisfies the inequality lim inf n→∞ rn (Δn ) ≥ r∗ .


−1/β
Proof. Let Mn = n2/(4β+1) = ψn . Introduce a set of 2Mn binary se-
quences
   
Ωn = ω = ω1 , . . . , ωMn , ωk ∈ {−1, 1}, k = 1, . . . , Mn .
(0)
Define a set of alternatives Λn with the same number of elements 2Mn ,
 
Λ(0)
n = c = c(ω) : c k = C ∗ ψn ωk / Mn if k = 1, . . . , Mn ,

and ck = 0 otherwise, ω ∈ Ωn
where a positive constant C∗ will be chosen later. Note that if C∗ is small
enough, C∗2 < (2β + 1)L, then
(16.10) n ⊂ Λn (β, C∗ , ψn ).
Λ(0)
(0)
Indeed, if c ∈ Λn , then

n−1
(C∗ ψn )2 
Mn
(C∗ ψn )2 Mn2β+1 C∗2
c2k k 2β = k 2β ∼ = < L.
Mn Mn 2β + 1 2β + 1
k=0 k=1
(0)
Thus, every c ∈ Λn belongs to Θ2,n (β, L). Also, the following identity takes
place:
 
Mn 1/2
c2 = (C∗ ψn ωk )2 /Mn = C∗ ψn ,
k=1
which implies (16.10).
Next, we want to show that for any decision rule Δn , the following
inequality holds:
    
(16.11) lim inf P0 Δn = 1 + max Pc(ω) Δn = 0 ≥ r∗ .
n→∞ ω ∈ Ωn

Put
 C ψ 2 n  C 2
∗ n ∗
αn2 = = n−1/(4β+1) → 0, as n → ∞.
σ Mn σ
Further, we substitute the maximum by the mean value to obtain
   
rn (Δn ) ≥ P0 Δn = 1 + max Pc(ω) Δn = 0
ω ∈ Ωn
236 16. Testing of Nonparametric Hypotheses

    
≥ P0 Δn = 1 + 2−Mn Pc(ω) Δn = 0
ω ∈ Ωn
       
= E0 I Δn = 1 + I Δn = 0 2−Mn exp Ln (ω)
ω ∈ Ωn
     
= E0 I Δn = 1 + I Δn = 0 ηn
where
d Pc(ω)   
Ln (ω) = ln and ηn = 2−Mn exp Ln (ω) .
d P0
ω ∈ Ωn
Now, the log-likelihood ratio
n  
Mn
  Mn
 
Ln (ω) = 2
c z
k k − c 2
k /2 = αn ωk ξk − αn2 /2 .
σ
k=1 k=1

Here we have used the fact that, under P0 , zk = σ ξk / n. In addition, the
identities ωk2 = 1 and
√ 
n ck /σ = (C∗ ψn /σ) n/Mn ωk = αn ωk
were employed. The random variable ηn admits the representation, which
will be derived below,
 1 2 Mn 
1 αn ξk 1 
(16.12) ηn = exp − αn Mn e + e− αn ξk .
2 2 2
k=1
Even though this expression is purely deterministic and can be shown alge-
braically, the easiest way to prove it is by looking at the ωk ’s as independent
random variables such that
 
P(ω) ωk = ± 1 = 1/2.
Using this definition, the random variable ηn can be computed as the ex-
pected value, denoted by E(ω) , with respect to the distribution P(ω) ,
   
Mn   1 
ηn = E(ω) exp Ln (ω) = E(ω) exp αn ξk ωk − αn2
2
k=1

  
Mn   
= exp − αn2 Mn /2 E(ω) exp αn ξk ωk
k=1
so that the representation (16.12) for ηn follows.
Recall that ξk are independent standard normal random variables with
respect to the P0 -distribution, hence, E0 [ ηn ] = 1. To compute the second
moment of ηn , we write
   1 1 1  Mn
E0 [ ηn2 ] = exp − αn2 Mn E0 e2αn ξ1 + + e−2αn ξ1
4 2 4
Exercises 237

  1 2α2 1 Mn   Mn


= exp −αn2 Mn e n+ = exp −αn2 Mn 1+αn2 +αn4 +o(αn4 )
 2 2
 2    
= exp − αn Mn + αn + αn4 /2 + o(αn4 ) Mn = exp αn4 Mn /2 + o(αn4 Mn ) .
2

From the definition of αn , we have


 2
αn4 Mn = (C∗ /σ)2 n−1/(4β+1) Mn = (C∗ /σ)4 .
 
Thus, as n → ∞, we find that o(αn4 Mn) → 0 andE0 [ ηn2 ] → exp C∗4 /(2σ 4 ) .
Then the variance Var0 [ ηn ] ∼ exp C∗4 /(2σ 4 ) − 1 for large n. For any
δ > 0, by the Chebyshev inequality,
     
lim inf P0 ηn ≥ 1 − δ ≥ 1 − δ −2 exp C∗4 /(2σ 4 ) − 1 .
n→∞
The right-hand side can be made arbitrarily close to 1 if we choose suffi-
ciently small C∗ . Finally, we obtain that
     
lim inf rn (Δn ) ≥ lim inf E0 I Δn = 1 + I Δn = 0 ηn
n→∞ n→∞
         
≥ 1−δ lim inf P0 ηn ≥ 1−δ ≥ 1−δ 1 − δ −2 exp C∗4 /(2σ 4 ) − 1 .
n→∞
By choosing a small positive δ and then a sufficiently small C∗ , we can make
the right-hand side larger than any r∗ < 1, which proves the lower bound
(16.11). 

Exercises

Exercise 16.97. (Fundamental Neyman-Pearson Lemma) Assume that for


a given α > 0 , there exists a constant  c > 0 such that P0 (Ln ≥ c) = α
n
where Ln = i = 1 ln p(Xi , θ1 )/p(Xi , 0) . Put Δ∗n = I(Ln ≥ c). Let Δn be
a decision rule Δn which probability of type I error P0 (Δn = 1) ≤ α. Show
that the probability of type II error of Δn is larger than type II error of Δ∗n ,
that is, Pθ1 (Δn = 0) ≥ Pθ1 (Δ∗n = 0).
Bibliography

[Bor99] A.A. Borovkov, Mathematical statistics, CRC, 1999.


[Efr99] S. Efromovich, Nonparametric curve estimation: Methods, theory and appli-
cations, Springer, 1999.
[Eub99] R. L. Eubank, Nonparametric regression and spline smoothing, 2nd ed., CRC,
1999.
[Har97] J. Hart, Nonparametric smoothing and lack-of-fit tests, Springer, 1997.
[HMSW04] W. Härdle, M. Müller, S. Sperlich, and A. Werwatz, Nonparametric and semi-
parametric models, Springer, 2004.
[IH81] I.A. Ibragimov and R.Z. Has’minski, Statistical estimation. Asymptotic theory,
Springer, 1981.
[Mas07] P. Massart, Concentration inequalities and model selection, Springer, 2010.
[Pet75] V.V. Petrov, Sums of independent random variables, Berlin, New York:
Springer-Verlag, 1975.
[Sim98] J.S. Simonoff, Smoothing methods in statistics, Springer, 1996.
[Tsy08] A.B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
[Was04] L. Wasserman, All of statistics: A concise course in statistical inference,
Springer, 2003.
[Was07] , All of nonparametric statistics, Springer, 2005.

239
Index of Notation

(D−1∞ )l, m , 120 Zn (θ0 , θ1 ), 32, 45


AR(f˜n ), 212 ΔH, 78
Bq , 132, 158 ΔLn , 24
Cb , 117 ΔLn (θ0 , θ1 ), 24
Cn , 14 Δθ, 45
Cn (X1 , . . . , Xn ), 14 Δn , 227
Cv , 117 Φl (f ), 201
H, 78 Ψ (f ), 188
H(θ0 , θ1 ), 31 Ψ(f ), 185, 188
I(θ), 6 Ψ∗n , 190
In (θ), 7 Θ, 3
K(u), 105 Θ(β), 102
K± , 58 Θ(β, L, L1 ), 101
Ksgn(i) , 58 Θ(β, L, L1 , g∗ ), 200
LP (u), 156 Θα , 51, 75
LS(u), 156 Θ2,n , 145
Ln (θ), 6 Θ2,n (β, L), 145
Ln (θ), 4 X̄n , 5
Ln (θ | X1 , . . . , Xn ), 4 βn (θ̂n , w, π), 13
N , 115 θ̂, 88
N (x, hn ), 106 θ, 87
Nq , 133 ε, 87
Pk (u), 156 η, 78
Q, 131 γ0 , 117
Rβ , 219 γ1 , 118
Rn (θ, θ̂n , w), 11 γk (x), 139
Rn (f, fˆn ), 102 γm,q (x), 138
Sm (u), 153 fˆn (t), 118
Tn , 46 fˆn (x), 90, 102, 104
(l)
W (j), 52 Φ̂n , 202
X1 , . . . , Xn , 3 Ψ̂n , 186
Z1 (θ0 , θ1 ), 32 τ̂n , 69

241
242 Index of Notation

θ̂, 4 ξn (x, X ), 103


θ̂τ , 78 ak , 142
θ̂n , 4 bk , 142
θ̂n (X1 , . . . , Xn ), 4 bm , 118
θ̂m,q , 133 bn (θ), 7
λ(θ), 46 bn (θ), 5
λ0 , 48  bn (θ , θ̂n ), 5
Eθ θ̂n , 5 bn (x), 102
Ef [ · ], 102 bn (x, X ), 103, 104
Ef [ · | X ], 102 bm,q , 133
Eθ [ · | X ], 89 c, 51
I(·), 11 cq , 132
  f (θ | X1 , . . . , Xn ), 14
Varθ θ̂n , 7
Varθ [ · | X ], 89 fn∗ (x), 109
D, 90 fn∗ (t), 118
D∞ , 95 hn , 105
G, 87 h∗n , 109
H0 , 227 l (Xi , θ), 28
l (Xi , θ), 6
H1 , 227
l(Xi , θ), 4
In , 87
li , 57
ŷ, 87
p(x, θ), 3
gj , 87
p0 (x − θ), 4
r, 92
r∗ , 23
y, 87
rnD , 69
F , 66
r∗ , 23
F (A, a), 212
rn (Δn ), 228
Fτ , 67
rn (θ̂n , w), 16
Ft , 65
tn , 13
H, 200
tn (X1 , . . . , Xn ), 13
Lp , 156
w(u), 11
Ls , 156
wl (x(1) , x(2) ), 202
Nm , 118
zn (θ), 29
Nm,q , 133
S, 87
T , 69
Tγ , 69
X , 86
π(θ), 13
ψn , 23, 103
ρ(xi , x), 106
τ , 66
τn∗ , 70
θn∗ , 16, 76
θn∗ (X1 , . . . , Xn ), 16
C̃n , 14
f˜(θ | X1 , . . . , Xn ), 14
υn, i (x), 104
υn,i , 78
εi , 52
ξ, 56
ξn (x), 102
Index

B-spline balance equation, 108


standard, 153 bandwidth, 105
B-splines, 152 optimal, 108
shifted, 155 basis
Ft -measurable event, 65 complete, 142
σ-algebra, 65 orthonormal, 141
trigonometric, 142
absolute loss function, 11 Bayes estimator, 13
acceptance of null hypothesis, 227 Bayes risk, 13
adaptation, 211 bi-square kernel function, 105
adaptive estimator, 211 bias, 5
adaptive risk, 212 bin, 132
bounded loss function, 11
additive regression model, 197
anisotropic Hölder class of functions,
209 change point, 51
asymptotically exponential change-point problem, 51
statistical experiment, 46 complete basis, 142
asymptotically Fisher composite alternative hypothesis, 228
efficient estimator, 22 conjugate family of distributions, 15
asymptotically minimax conjugate prior distribution, 15
estimator, 103 covariance matrix, 90
lower bound, 23 limiting, 94
rate of convergence, 23, 103 Cramér-Rao inequality, 7
estimator, 33 Cramér-Rao lower bound, 7
asymptotically sharp minimax
bounds, 23 decision rule, 227
asymptotically unbiased design, 86
estimator, 21 regular deterministic, 93
autoregression coefficient, 75 regular random, 95
autoregression, see autoregressive uniform, 94
model, 75 design matrix, 87
coefficient, 75 detection, see on-line detection
autoregressive model, 75 problem, 69

243
244 Index

detector, see on-line detector, 69 Lipschitz condition, 102


deterministic regular design, 93 smoothness, 101
differentiable functional, 188 Hellinger distance, 31
Hodges’ example, 22
efficient estimator, 8 hypotheses testing
Epanechnikov kernel function, 105 parametric, 227
estimator, 4 acceptance of null hypothesis, 227
asymptotically unbiased, 21 composite alternative hypothesis, 228
maximum likelihood (MLE), 4, 33 decision rule, 227
projection, 147 minimax separation rate, 229
adaptive, 211 nonparametric, 228
asymptotically Fisher null hypothesis, 227
efficient, 22 rejection of null hypothesis, 227
Bayes, 13 separation between hypotheses, 228
efficient, 8 simple alternative hypothesis, 227
Fisher efficient, see efficient, 8 type I error, 227
global linear, 105 type II error, 228
linear, 104 hypothesis
local linear, 105 simple alternative, 227
minimax, 16 composite alternative, 228
more efficient, 12 null, 227
on-line, 78
orthogonal series, 147 integral functional, 185
sequential, 69, 78 integral quadratic functional, 188
smoothing kernel, 107 irregular statistical experiment, 43
superefficient, 22
unbiased, 5 kernel estimator
expected detection delay, 69 Nadaraya-Watson, 106
explanatory variable, 85 optimal smoothing, 109
smoothing, 107
false alarm probability, 69
kernel function, 105
filter, 66
Epanechnikov, 105
first-order autoregressive
bi-square, 105
model, 75
tri-cube, 112
Fisher efficient estimator, see
triangular, 105
efficient estimator, 8
uniform, 105
Fisher information, 6, 7
kernel, see kernel function, 105
Fisher score function, 6
Kullback-Leibler information
total, 6
number, 58
fitted response vector, 87
functional LAN, see local asymptotic
differentiable, 188 normality condition, 29
integral quadratic, 188 least-squares estimator
linear, 186 of regression coefficient, 88
linear integral, 185 of regression function, 90
non-linear, 188 of vector of regression coefficients, 89
non-linear integral, 188 likelihood ratio, 45
global linear estimator of regression limiting covariance matrix, 94
function, 105 linear estimator, 104
linear functional, 186
Hölder class of functions, 101 linear integral functional, 185
anisotropic, 209 linear parametric regression
Index 245

model, 86 normalized quadratic risk, 12


linear span-space, 87 normalized risk, 11
Lipschitz condition, 102 maximum, 16
Lipschitz function, 102 normalized risk function, see
local asymptotic normality normalized risk, 11
(LAN) condition, 29 null hypothesis, 227
local linear estimator of regression
function, 105 on-line detection problem, 69
local polynomial approximation, 115 on-line detector, 69
local polynomial estimator, 118 on-line estimation, 78
location parameter, 4 on-line estimator, 78
log-likelihood function, 4 optimal bandwidth, 108, 118
log-likelihood ratio, 24 optimal smoothing kernel
loss function, 11 estimator, 109
absolute, 11 orthogonal series, see projection
bounded, 11 estimator, 147
quadratic, 11 orthonormal basis, 141
sup-norm, 102 parametric hypotheses testing, 227
lower bound parametric regression model, 85
asymptotically minimax, 23 linear, 86
Cramér-Rao, 7 random error in, 86
minimax, 18 partition of unity, 153
pixel, 208
Markov stopping time, see
point estimator, see estimator, 4
stopping time, 66
polynomial regression, 86
maximum likelihood estimator
posterior density, 14
(MLE), 4, 33
weighted, 14
maximum normalized risk, 16, 103
posterior mean, 14
mean integrated squared error (MISE),
non-weighted, 14
90
weighted, 14
mean squared error (MSE), 90
power spline, 156
mean squared risk at a point, see mean
predicted, see fitted response
squared error (MSE), 90
vector, 87
measurable event, see
predictor variable, see explanatory
Ft -measurable event, 65
variable, 85
up to random time, 68
prior density, 13
minimax estimator, 16
prior distribution, 13
minimax lower bound, 18
conjugate, 15
minimax risk, 16
projection, see orthogonal series
of detection, 69
estimator, 147
minimax risk of detection, 69
minimax separation rate, 229 quadratic loss function, 11
more efficient estimator, 12
multiple regression model, 193 random error, 85
random regular design, 95
Nadaraya-Watson kernel random time, 68
estimator, 106 random walk,
non-linear functional, 188 two-sided Gaussian, 52
non-linear integral functional, 188 rate of convergence, 23
nonparametric hypotheses testing, 228 regression coefficient, 85
nonparametric regression model, 101 least-squares estimator of, 88
normal equations, 88 regression equation, 85, 101
246 Index

regression function, 85 scaled, 158


global linear estimator of, 105 shifted B-spline, 155
least-squares estimator of, 90 standard B-spline, 153
linear estimator of, 104 standard B-spline, 153
local linear estimator of, 105 statistical experiment, 3
regression model regular, 7
simple linear, 96 asymptotically exponential, 46
simple linear through the origin, 96 irregular, 43
additive, 197 stopping time, 66
multiple, 193 sup-norm loss function, 102
nonparametric, 101 superefficient estimator, 22
parametric, 85 superefficient point, 22
simple, 85
single-index, 199 test function, 123, 168
regressogram, 133 time, 65
regular deterministic design, 93 random, 68
regular random design, 95 total Fisher score function, 6
regular statistical experiment, 7 tri-cube kernel function, 112
rejection of null hypothesis, 227 triangular kernel function, 105
residual, 92 trigonometric basis, 142
response variable, 85 two-sided Gaussian random
response, see response variable, 85 walk, 52
risk, 11 type I error, 227
risk function, 11 type II error, 228
normalized quadratic, 12 unbiased estimator, 5
normalized, 11 uniform design, 94
uniform kernel function, 105
sample mean, 5
scaled spline, 158 vector of regression coefficients, 87
scatter plot, 86 vector of regression coefficients
separation between hypotheses, 228 least-squares estimator of, 89
sequence space, 146
sequential estimation, 65, 69, 78 Wald’s first identity, 66
sequential estimator, 69, 78 Wald’s second identity, 83
shifted B-splines, 155 weight function, 185
sigma-algebra, see weighted posterior density, 14
σ-algebra, 65 weighted posterior mean, 14
signal-to-noise ratio, 51
simple alternative hypothesis, 227
simple linear regression model, 96
simple linear regression through the
origin, 96
simple regression model, 85
single-index regression model, 199
smoothing kernel, 107
smoothing kernel estimator, 107
optimal, 109
smoothness of Hölder class
of functions, 101
spline
B-spline, 152
power, 156
This book is designed to bridge the gap between traditional textbooks in statistics
and more advanced books that include the sophisticated nonparametric techniques.
It covers topics in parametric and nonparametric large-sample estimation theory.
The exposition is based on a collection of relatively simple statistical models. It
gives a thorough mathematical analysis for each of them with all the rigorous
proofs and explanations. The book also includes a number of helpful exercises.
Prerequisites for the book include senior undergraduate/beginning graduate-level
courses in probability and statistics.

For additional information


and updates on this book, visit
www.ams.org/bookpages/gsm-119

GSM/119 AMS on the Web


w w w. a m s . o r g
www.ams.org

You might also like