0% found this document useful (0 votes)
3 views

Intro&NP Stat

This document is an introduction to nonparametric statistics, covering various topics such as kernel density estimation, nonparametric regression, projection estimators, and minimax lower bounds. It also discusses reproducing kernel Hilbert spaces, bootstrap methods, multiple hypothesis testing, and high-dimensional linear regression. The content is based on lectures given to PhD students and incorporates influences from various statistical texts and courses.

Uploaded by

yuzukieba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Intro&NP Stat

This document is an introduction to nonparametric statistics, covering various topics such as kernel density estimation, nonparametric regression, projection estimators, and minimax lower bounds. It also discusses reproducing kernel Hilbert spaces, bootstrap methods, multiple hypothesis testing, and high-dimensional linear regression. The content is based on lectures given to PhD students and incorporates influences from various statistical texts and courses.

Uploaded by

yuzukieba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Introduction to Nonparametric Statistics

Bodhisattva Sen

March 24, 2020

Contents
1 Kernel density estimation 5
1.1 The choice of the bandwidth and the kernel . . . . . . . . . . . . . . 7
1.2 Mean squared error of kernel estimators . . . . . . . . . . . . . . . . 8
1.3 Pointwise asymptotic distribution . . . . . . . . . . . . . . . . . . . . 13
1.4 Integrated squared risk of kernel estimators . . . . . . . . . . . . . . . 15
1.5 Unbiased risk estimation: cross-validation . . . . . . . . . . . . . . . 18

2 Nonparametric regression 20
2.1 Local polynomial estimators . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Pointwise and integrated risk of local polynomial estimators . . . . . 23
2.2.1 Assumption (LP1) . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Projection estimators 27
3.1 Risk bounds for projection estimators . . . . . . . . . . . . . . . . . . 29
3.1.1 Projection estimator with trigonometric basis in L2 [0, 1] . . . . 31

4 Minimax lower bounds 34


4.1 Distances between probability measures . . . . . . . . . . . . . . . . . 34
4.2 Lower Bounds on the risk of density estimators at a point . . . . . . . 37
4.3 Lower bounds on many hypotheses . . . . . . . . . . . . . . . . . . . 41
4.3.1 Assouad’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Estimation of a monotone function . . . . . . . . . . . . . . . 45
4.4 A general reduction scheme . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Fano’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.1 Estimation of a regression function under the supremum loss . 52
4.6 Covering and packing numbers and metric entropy . . . . . . . . . . . 52

1
4.6.1 Two examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Global Fano method: Bounding I(M ) based on metric entropy . . . . 58
4.7.1 A general scheme for proving minimax bounds using global
packings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Reproducing kernel Hilbert spaces 61


5.1 Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . 65
5.2.1 The Representer theorem . . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Feature map and kernels . . . . . . . . . . . . . . . . . . . . . 69
5.3 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Classification and Support Vector Machines . . . . . . . . . . . . . . 72
5.4.1 The problem of classification . . . . . . . . . . . . . . . . . . . 72
5.4.2 Minimum empirical risk classifiers . . . . . . . . . . . . . . . . 75
5.4.3 Convexifying the ERM classifier . . . . . . . . . . . . . . . . . 76
5.4.4 Support vector machine (SVM): definition . . . . . . . . . . . 79
5.4.5 Analysis of the SVM minimization problem . . . . . . . . . . 79
5.5 Kernel ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Kernel principal component analysis (PCA) . . . . . . . . . . . . . . 82

6 Bootstrap 85
6.1 Parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 The nonparametric bootstrap . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Consistency of the bootstrap . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Second-order accuracy of the bootstrap . . . . . . . . . . . . . . . . . 92
6.5 Failure of the bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Subsampling: a remedy to the bootstrap . . . . . . . . . . . . . . . . 93
6.7 Bootstrapping regression models . . . . . . . . . . . . . . . . . . . . . 94
6.8 Bootstrapping a nonparametric function: the Grenander estimator . . 96

7 Multiple hypothesis testing 97


7.1 Global testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2 Bonferroni procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.1 Power of the Bonferroni procedure . . . . . . . . . . . . . . . 99
7.3 Chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4 Fisher’s combination test . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.5 Multiple testing/comparison problem: false discovery rate . . . . . . . 103

2
7.6 The Bayesian approach: connection to empirical Bayes . . . . . . . . 106
7.6.1 Global versus local FDR . . . . . . . . . . . . . . . . . . . . . 107
7.6.2 Empirical Bayes interpretation of BH(q) . . . . . . . . . . . . 108

8 High dimensional linear regression 110


8.1 Strong convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2 Restricted strong convexity and `2 -error kβ̂ − β ∗ k2 . . . . . . . . . . . 112
8.3 Bounds on prediction error . . . . . . . . . . . . . . . . . . . . . . . . 115
8.4 Equivalence between `0 and `1 -recovery . . . . . . . . . . . . . . . . . 117
8.4.1 Sufficient conditions for restricted nullspace . . . . . . . . . . 120

3
Abstract

This lecture note arose from a class I taught in Spring 2016 to our 2nd year
PhD students (in Statistics) at Columbia University. The choice of topics is
very eclectic and mostly reflect: (a) my background and research interests, and
(b) some of the topics I wanted to learn more systematically in 2016. The first
part of this lecture notes is on nonparametric function estimation — density
and regression — and I borrow heavily from the book Tsybakov [14] and the
course he taught at Yale in 2014. The second part of the course is a medley
of different topics: (i) reproducing kernel Hilbert spaces (RKHSs; Section 5),
(ii) bootstrap methods (Section 6), (iii) multiple hypothesis testing (Section 7),
and (iv) an introduction to high dimensional linear regression (Section 8).
The content of Section 5 is greatly influenced by Arthur Gretton’s lectures
and slides on RKHSs and its applications in Machine Learning (see e.g., http:
//www.gatsby.ucl.ac.uk/~gretton/coursefiles/rkhscourse.html for a more
detailed course). I have borrowed the material in Section 7 from Emmanuel
Candes’s lectures on ‘Theory of Statistics’ (Stats 300C, Stanford), while the
content of Section 8 is taken from Hastie et al. [5].

4
1 Kernel density estimation

Let X1 , . . . , Xn be i.i.d. random variables having a probability density p with respect


to the Lebesgue measure on R. The corresponding distribution function is F (x) :=
Rx
−∞
p(t)dt.

1.0 ecdf(x)


0.8


0.6


Fn(x)


0.4


0.2


0.0

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

n=10

A natural estimator of F is the empirical distribution function:


1X 1X
n n
Fn (x) = I{Xi ≤ x} = I(−∞,x] (Xi ), (1)
n i=1 n i=1

where I(·) denotes the indicator function. The Glivenko-Cantelli theorem shows that
a.s.
sup |Fn (x) − F (x)| → 0,
x∈R

as n → ∞ (Exercise (HW1)). Further we know that for every x ∈ R,


√ d
n(Fn (x) − F (x)) → N (0, F (x)(1 − F (x))).

ecdf(x) ecdf(x)
1.0

1.0

● ●
●●
●●

●●●
● ●

● ●●

●●
0.8

0.8

●●

●●

●●



●●
●●●
● ●●

0.6

0.6



●●
●●
●●
Fn(x)

Fn(x)

●●●
●●
●●●
●●


0.4

0.4

●●
●●


●●



●●
●●

● ●

● ●
0.2

0.2


●●
●●
● ●●
●●

●●
● ●

●●
●●
0.0

0.0

−2 −1 0 1 2 −4 −2 0 2

n=100 n=1000

5
Exercise (HW1): Consider testing F = F0 where F0 is a known continuous strictly
increasing distribution function (e.g., standard normal) when we observe i.i.d. data
X1 , . . . , Xn from F . The Kolmogorov-Smirnov test statistic is to consider

Dn := sup |Fn (x) − F0 (x)|,


x∈R

and reject H0 when Dn > cα , for a suitable cα > 0 (where α is the level of the test).
Show that, under H0 , Dn is distribution-free, i.e., the distribution of Dn does not
depend on F0 (as long as it is continuous and strictly increasing). How would you
compute (approximate/simulate) the critical value cα , for every n.

Let us come back to the estimation of p. As p is the derivative of F , for small h > 0,
we can write the approximation
F (x + h) − F (x − h)
p(x) ≈ .
2h
As Fn is a natural estimator of F , it is intuitive to define the following (Rosenblatt)
estimator of p:
Fn (x + h) − Fn (x − h)
p̂R
n (x) = .
2h
We can rewrite p̂R n as
 
1 X 1 X
n n
R Xi − x
p̂n (x) = I(x − h < Xi ≤ x + h) = K0 ,
2nh i=1 nh i=1 h

where K0 (u) = 21 I(−1,1] (u). A simple generalization of the Rosenblatt estimator is


given by
 
1 X
n
Xi − x
p̂n (x) := K , (2)
nh i=1 h
R
where K : R → R is an integrable function satisfying K(u)du = 1. Such a function
K is called a kernel and the parameter h is called the bandwidth of the estimator (2).
The function p̂n is called the kernel density estimator (KDE) or the Parzen-Rosenblatt
estimator. Some classical examples of kernels are the following:
1
K(u) = 2
I(|u| ≤ 1) (the rectangular kernel)
1
K(u) = √ exp(−u2 /2) (the Gaussian kernel)

3
K(u) = 4 (1 − u2 )I(|u| ≤ 1) (the Epanechnikov kernel).

Note that if the kernel K takes only nonnegative values and if X1 , . . . , Xn are fixed,
then p̂n is a probability density.

6
Figure 1: KDE with different bandwidths of a random sample of 100 points from a stan-
dard normal distribution. Grey: true density (standard normal). Red: KDE
with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.

The Parzen-Rosenblatt estimator can be generalized to the multidimensional case


easily. Suppose that (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. with (joint) density p(·, ·). A
kernel estimator of p is then given by
   
1 X
n
Xi − x Yi − y
p̂n (x, y) := K K , (3)
nh2 i=1 h h
where K : R → R is a kernel defined as above and h > 0 is the bandwidth.

1.1 The choice of the bandwidth and the kernel

It turns out that the choice of the bandwidth h is far more crucial for the quality
of p̂n as an estimator of p than the choice of the kernel K. We can view the KDE
(for unimodal, nonnegative kernels) as the sum of n small “mountains” given by the
functions  
1 Xi − x
x 7→ K .
nh h
Every small mountain is centered around an observation Xi and has area 1/n under
it, for any bandwidth h. For a small bandwidth the mountain is very concentrated
(peaked), while for a large bandwidth the mountain is low and fat. If the bandwidth
is small, then the mountains remain separated and their sum is peaky. On the other
hand, if the bandwidth is large, then the sum of the individual mountains is too flat.
Intermediate values of the bandwidth should give the best results.

For a fixed h, the KDE p̂n (x0 ) is not consistent in estimating p(x0 ), where x0 ∈ R.
However, if the bandwidth decreases with sample size at an appropriate rate, then it
is, regardless of which kernel is used.

7
Exercise (HW1): Suppose that p is continuous at x0 , that hn → 0, and that nhn → ∞
p
as n → ∞. Then, p̂n (x0 ) → p(x0 ) [Hint: Study the bias and variance of the estimator
separately].

1.2 Mean squared error of kernel estimators

A basic measure of the accuracy of p̂n is its mean squared risk (or mean squared error)
at an arbitrary fixed point x0 ∈ R:
h i
MSE = MSE(x0 ) := Ep (p̂n (x0 ) − p(x0 )) .2

Here Ep denotes the expectation with respect to the distribution of (X1 , . . . , Xn ):


Z Z  2 h Y
n i
MSE(x0 ) := ··· p̂n (x0 ; z1 , . . . , zn ) − p(x0 ) p(zi ) dz1 . . . dzn .
i=1

Of course,
MSE(x0 ) = b2 (x0 ) + σ 2 (x0 )
where
b(x0 ) := Ep [p̂n (x0 )] − p(x0 ), (bias)
and h 2 i
σ 2 (x0 ) := Ep p̂n (x0 ) − Ep [p̂n (x0 )] (variance).

To evaluate the mean squared risk of p̂n we will analyze separately its variance and
bias.
Proposition 1.1 (Variance of p̂n ). Suppose that the density p satisfies p(x) ≤ pmax <
∞ for all x ∈ R. Let K : R → R be the kernel function such that
Z
K 2 (u)du < ∞.

Then for any x0 ∈ R, h > 0, and n ≥ 1 we have


C1
σ 2 (x0 ) ≤ ,
nh
R
where C1 = pmax K 2 (u)du.

Proof. Observe that p̂n (x0 ) is an average of n i.i.d. random variables and so
     
1 1 X 1 − x0 1 X 1 − x0
2
σ (x0 ) = Var(p̂n (x0 )) = Var K ≤ Ep K 2
n h h nh2 h

8
Now, observe that
   Z   Z
X 1 − x0 z − x0
Ep K 2
= K 2
p(z)dz ≤ pmax h K 2 (u)du.
h h
Combining the above two displays we get the desired result.

Thus, we conclude that if the bandwidth h ≡ hn is such that nh → ∞ as n → ∞,


then the variance of σ 2 (x0 ) goes to 0 as n → ∞.

To analyze the bias of the KDE (as a function of h) we need certain conditions on
the density p and on the kernel K.
Definition 1.2. Let T be an interval in R and let β and L be two positive numbers.
The Hölder class Σ(β, L) on T is defined as the set of ` = bβc times differentiable
functions f : T → R whose derivative f (`) satisfies

|f (`) (x) − f (`) (x0 )| ≤ L|x − x0 |β−` , for all x, x0 ∈ T.

Definition 1.3. Let ` ≥ 1 be an integer. We say that K : R → R is a kernel of order


` if the functions u 7→ uj K(u), j = 0, 1, . . . , `, are integrable and satisfy
Z Z
K(u)du = 1, uj K(u)du = 0, j = 1, . . . , `.

Does bounded kernels of order ` exist? See Section 1.2.2 of [14] for constructing such
kernels.

Observe that when ` ≥ 2 then the kernel has to take negative values which may lead
to negative values of p̂n . This is sometimes mentioned as a drawback of using higher
order kernels (` ≥ 2). However, observe that we can always define the estimator

p̂+
n (x) = max{0, p̂n (x)}

whose risk is smaller than or equal to the risk of p̂n (x):


h i h i
Ep (p̂+
n (x 0 ) − p(x 0 ))2
≤ Ep (p̂ (x
n 0 ) − p(x 0 ))2
, ∀x ∈ R.

Suppose now that p belong to a class of densities P = P(β, L) defined as follows:


 Z 
P(β, L) := p : p ≥ 0, p(x)dx = 1, and p ∈ Σ(β, L) on R .

9
Proposition 1.4 (Bias of p̂n ). Assume that p ∈ P(β, L) and let K be a kernel of
order ` = bβc satisfying Z
|u|β |K(u)|du < ∞.

Then for any x0 ∈ R, h > 0, and n ≥ 1 we have

|b(x0 )| ≤ C2 hβ , (4)

L
R
where C2 = `!
|u|β |K(u)|du.

Proof. We have
Z  
1 z−x
b(x0 ) = K p(z)dz − p(x0 )
h h
Z h i
= K(u) p(x0 + uh) − p(x0 ) du.

Next, using Taylor theorem1 , we get

0 (uh)` (`)
p(x0 + uh) = p(x0 ) + p (x0 )uh + . . . + p (x0 + τ uh),
`!
where 0 ≤ τ ≤ 1. Since K has order ` = bβc, we obtain
Z
(uh)` (`)
b(x0 ) = K(u) p (x0 + τ uh)du
`!
Z
(uh)` (`)
= K(u) (p (x0 + τ uh) − p(`) (x0 ))du,
`!
1
Taylor’s theorem: Let k ≥ 1 be an integer and let the function f : R → R be k times
differentiable at the point a ∈ R. Then there exists a function Rk : R → R such that

f 00 (a) f (k) (a)


f (x) = f (a) + f 0 (a)(x − a) + (x − a)2 + · · · + (x − a)k + Rk (x),
2! k!
where Rk (x) = o(|x − a|k ) as x → a.
Mean-value forms of the remainder: Let f : R → R be k + 1 times differentiable on the open
interval with f (k) continuous on the closed interval between a and x. Then

f (k+1) (ξL )
Rk (x) = (x − a)k+1
(k + 1)!

for some real number ξL between a and x. This is the Lagrange form of the remainder.
Integral form of the remainder: Let f (k) be absolutely continuous on the closed interval between
a and x. Then Z x (k+1)
f (t)
Rk (x) = (x − t)k dt. (5)
a k!
Due to absolute continuity of f (k) , on the closed interval between a and x, f (k+1) exists a.e.

10
and
Z
|uh|` (`)
|b(x0 )| ≤ |K(u)| p (x0 + τ uh) − p(`) (x0 ) du
`!
Z
|uh|`
≤ L |K(u)| |τ uh|β−` du ≤ C2 hβ .
`!

From Propositions 1.1 and 1.4, we see that the upper bounds on the bias and variance
behave in opposite ways as the bandwidth h varies. The variance decreases as h grows,
whereas the bound on the bias increases. The choice of a small h corresponding to a
large variance leads to undersmoothing. Alternatively, with a large h the bias cannot
be reasonably controlled, which leads to oversmoothing. An optimal value of h that
balances bias and variance is located between these two extremes. To get an insight
into the optimal choice of h, we can minimize in h the upper bound on the MSE
obtained from the above results.

If p and K satisfy the assumptions of Propositions 1.1 and 1.4, we obtain


C1
MSE ≤ C22 h2β + . (6)
nh
The minimum with respect to h of the right hand side of the above display is attained
at  1/(2β+1)
∗ C1
hn = n−1/(2β+1) .
2βC22
Therefore, the choice h = h∗n gives
 2β

MSE(x0 ) = O n− 2β+1 , as n → ∞,

uniformly in x0 . Thus, we have the following result.


R
Theorem 1.5. Assume that the conditions of Proposition 1.4 hold and that K 2 (u)du <
∞. Fix α > 0 and take h = αn−1/(2β+1) . Then for n ≥ 1, the KDE p̂n satisfies
h i 2β
sup sup Ep (p̂n (x0 ) − p(x0 ))2 ≤ Cn− 2β+1 ,
x0 ∈R p∈P(β,L)

where C > 0 is a constant depending only on β, L, α and on the kernel K.

Proof. We apply (14) to derive the result. To justify the application of Proposi-
tion 1.1, it remains to prove that there exists a constant pmax < ∞ satisfying

sup sup p(x) ≤ pmax . (7)


x∈R p∈P(β,L)

11
To show that (7) holds, consider K ∗ which is a bounded kernel of order ` (not neces-
sarily equal to K). Applying Proposition 1.4 with h = 1 we get that, for any x ∈ R
and any p ∈ P(β, L),
Z Z
∗ L
K (z − x) p(z)dz − p(x) ≤ C2 := |u|β |K ∗ (u)|du.
`!

Therefore, for any x ∈ R and any p ∈ P(β, L),


Z
p(x) ≤ C2 + |K ∗ (z − x)| p(z)dz ≤ C2∗ + Kmax
∗ ∗
,


where Kmax ∗
= supu∈R |K ∗ (u)|. Thus, we get (7) with pmax = C2∗ + Kmax .

Under the assumptions of Theorem 1.5, the rate of convergence of the estimator
β
p̂n (x0 ) is ψn = n− 2β+1 , which means that for a finite constant C and for all n ≥ 1 we
have h i
sup Ep (p̂n (x0 ) − p(x0 ))2 ≤ Cψn2 .
p∈P(β,L)

Now the following two questions arise. Can we improve the rate ψn by using other
density estimators? What is the best possible rate of convergence? To answer these
questions it is useful to consider the minimax risk Rn∗ associated to the class P(β, L):
h i
Rn (P(β, L)) = inf sup Ep (Tn (x0 ) − p(x0 )) ,
∗ 2
Tn p∈P(β,L)

where the infimum is over all estimators. One can prove a lower bound on the minimax

risk of the form Rn∗ (P(β, L)) ≥ C 0 ψn2 = C 0 n− 2β+1 with some constant C 0 > 0. This
implies that under the assumptions of Theorem 1.5 the KDE attains the optimal rate
β
of convergence n− 2β+1 associated with the class of densities P(β, L). Exact definitions
and discussions of the notion of optimal rate of convergence will be given later.
Remark 1.1. Quite often in practice it is assumed that β = 2 and that p00 is con-
tinuous at x0 . Also, the kernel is taken to be of order one and symmetric around 0.
Then it can be shown that (Exercise (HW1))
Z Z 2
1 1
MSE(x0 ) = K (u)dup(x0 ) + h4
2
u K(u)du p00 (x0 )2 + o((nh)−1 + h4 ).
2
nh 4

Remark 1.2. Since 2β/(2β+1) approaches 1 as k becomes large, Theorem 1.5 implies
that, for sufficiently smooth densities, the convergence rate can be made arbitrarily
close to the parametric n−1 convergence rate. The fact that higher-order kernels

12
can achieve improved rates of convergence means that they will eventually dominate
first-order kernel estimators for large n. However, this does not mean that a higher-
order kernel will necessarily improve the error for sample sizes usually encountered in
practice, and in many cases, unless the sample size is very large there may actually
be an increase in the error due to using a higher-order kernel.

1.3 Pointwise asymptotic distribution

Whereas the results from the previous sub-section have shown us that p̂n (x0 ) converges
to p(x0 ) in probability under certain assumptions, we cannot straightforwardly use
this for statistical inference. Ideally, if we want to estimate p(x0 ) at the point x0 , we
would like to have exact confidence statements of the form

P (p(x0 ) ∈ [p̂n (x0 ) − c(n, α, x0 , K), p̂n (x0 ) − c(n, α, x0 , K)]) ≥ 1 − α,

where α is the significance level and c(n, α, x0 , K) sequence of constants that one
would like to be as small as possible (given α).
Theorem 1.6. Assume that p ∈ P(β, L) and let K be a kernel of order ` = bβc
satisfying Z
|u|β |K(u)|du < ∞.

Suppose that p also satisfies p(x) ≤ pmax < ∞ for all x ∈ R. Let K further satisfy
R
(a) kKk22 := K 2 (u)du < ∞, (b) kKk∞ := supu∈R K(u) < ∞. Suppose that the
1/2 β+1/2
sequence of bandwidths {hn }∞ n=1 satisfy hn → 0, nhn → ∞, and n hn → 0 as
n → ∞. Then, as n → ∞,
√  
d
 
nh p̂n (x0 ) − p(x0 ) → N 0, p(x0 )kKk22 .

Proof. We first find the limit for the ‘variance term’. We use the Lindeberg-Feller
central limit theorem for triangular arrays of independent random variables2 with
  r  
√ 1 Xi − x 0 1 X i − x0
Yni := nh K = K , i = 1, . . . , n,
nh h nh h
2
Lindeberg-Feller CLT (see e.g., [15, p.20]): For each n let Yn1 , . . . , Ynn be independent random
Pn
variables with finite variances. If, as n → ∞, (i) i=1 E[Yni
2
I(|Yni | > )] → 0, for every  > 0, and
Pn
(ii) i=1 E[(Yni − E(Yni )) ] → σ , then
2 2

n
X d
(Yni − E(Yni )) → N (0, σ 2 ), as n → ∞.
i=1

13
so that Yn1 , . . . , Ynn are i.i.d. and we have
√   Xn
nh p̂n (x0 ) − Ep [p̂n (x0 )] = (Yni − E(Yni )).
i=1

Thus, we only need to show that the two conditions in the Lindeberg-Feller CLT hold.
Clearly,
Z  
2 1 2 z − x0
nE(Yni ) = K p(z)dz
h h
Z Z
= K (u) p(x0 + uh)du → p(x0 ) K 2 (u)du,
2
as n → ∞,

by the dominated convergence theorem (DCT), since p(·) is continuous at x0 and


bounded on R. Now,
Z   2 Z 2
2 1 z − x0
nE(Yni ) = K p(z)dz = h K(u)p(x0 + uh)du
h h
≤ hkKk22 pmax → 0, as n → ∞,
P R
which shows that ni=1 E[(Yni − E(Yni ))2 ] → p(x0 ) K 2 (u)du. Furthermore,
1
|Yni | ≤ √ kKk∞ → 0, as n → ∞,
nh
by the assumption on the sequence of bandwidths. Thus, I(|Yni | > ) → 0, for every
 > 0 and by the DCT
X
n
E[Yni2 I(|Yni | > )] = E[nYn1
2
I(|Yn1 | > )] → 0.
i=

By (4) we see that the bias term can be bounded above as


√ √
nh|b(x0 )| ≤ nhhβ → 0, as n → ∞.
Therefore, we have the desired result.

Exercise (HW1): Suppose that you are given an i.i.d. sample from a bounded density
p with bounded derivatives at x0 . Suppose that c(α, x0 ) is such that P(−c(α, x0 ) ≤
Z ≤ c(α, x0 )) = 1 − α where Z ∼ N (0, p(x0 )). Use a kernel density estimator (with
a suitable kernel) to obtain a 95 percent confidence interval (CI) for p(x0 ) in such a

way that the size of the interval shrinks at rate 1/ nhn as n → ∞, and that hn can
be chosen so that this rate is ‘almost’ (say, up to a log n term) of order n−1/3 .

Exercise (HW1): Under the setup of Remark 1.1 and the assumption that h = αn−1/5 ,

where α > 0, find the asymptotic distribution of nh(p̂n (x0 ) − p(x0 )). Can this be
used to construct a CI for p(x0 )? What are the advantages/disadvantages of using
this result versus the setup of Theorem 1.6 with β = 2 to construct a CI for p(x0 )?

14
1.4 Integrated squared risk of kernel estimators

In Section 1.2 we have studied the behavior of the KDE p̂n at an arbitrary fixed point
x0 . It is also interesting to analyze the global risk of p̂n . An important global criterion
is the mean integrated squared error (MISE):
Z
MISE := Ep [(p̂n (x) − p(x))2 ]dx.

By Fubini’s theorem,
Z Z Z
2
MISE = MSE(x)dx = b (x)dx + σ 2 (x)dx. (8)
R
Thus, the MISE is represented as a sum of the bias term b2 (x)dx and the variance
R
term σ 2 (x)dx. To obtain bounds on these terms, we proceed in the same manner
as for the analogous terms of the MSE. Let us study first the variance term.
Proposition 1.7 (Variance of p̂n ). Let K : R → R be the kernel function such that
Z
K 2 (u)du < ∞.

Then for any h > 0, and n ≥ 1 and any probability density p we have
Z Z
2 1
σ (x)dx ≤ K 2 (u)du.
nh

Proof. As in the proof of Proposition 1.1,


  
1 1 X1 − x
2
σ (x) = Ep [η1 (x)] ≤
2
Ep K 2
nh2 nh2 h
for all x ∈ R. Therefore,
Z Z Z   
1 z−x
σ 2 (x)dx ≤ K 2
p(z)dz dx
nh2 h
Z Z   
1 2 z−x
= p(z) K dx dz
nh2 h
Z
1
= K 2 (u)du.
nh

The upper bound for the variance term in Proposition 1.7 does not require any con-
dition on p: The result holds for any density. For the bias term in (8) the situation
is different: We can only control it on a restricted subset of densities. As above, we

15
specifically assume that p is smooth enough. Since the MISE is a risk corresponding
to the L2 (R)-norm, it is natural to assume that p is smooth with respect to this norm.
Sobolev classes provide a popular way to describe smoothness in L2 (R).

Definition 1.8. Let β ≥ 1 be an integer and L > 0. The Sobolev class S(β, L) is
defined as the set of all β − 1 differentiable functions f : R → R having absolutely
continuous derivative f (β−1) and satisfying
Z
(f (β) (x))2 dx ≤ L2 .

Theorem 1.9. Suppose that, for an integer β ≥ 1:

(i) the function K is a kernel of order β − 1 satisfying the conditions


Z Z
2
K (u)du < ∞, |u|β |K(u)|du < ∞;

(ii) the density p ∈ S(β, L) for some β ≥ 1 and L > 0.

Then for all n ≥ 1 and all h > 0 the mean integrated squared error of the KDE p̂n
satisfies Z Z 2
1 2 L2 h2β β
MISE ≤ K (u)du + |u| |K(u)|du .
nh (`!)2

Proof. We bound the variance term as in Proposition 1.7. Let ` = β − 1. For the bias
term, first note that using the integral form of the remainder term in the Taylor’s
theorem (see (5) and make the transformation t 7→ t−x
uh
),
Z 1
0 (uh)`
p(x + uh) = p(x) + p (x)uh + . . . + (1 − τ )`−1 p(`) (x + τ uh)dτ.
(` − 1)! 0

Since the kernel K is of order β − 1, we obtain


Z Z 1 
(uh)` `−1 (`)
b(x) = K(u) (1 − τ ) p (x + τ uh)dτ du
(` − 1)! 0
Z Z 1   
(uh)` `−1 (`) (`)
= K(u) (1 − τ ) p (x + τ uh) − p (x) dτ du
(` − 1)! 0

Applying the generalized Minkowski inequality3 twice and using the given assump-
3
Generalized Minkowski inequality:

16
R
tions on p, we get the following upper bound for the bias term b2 (x)dx:
Z Z Z 1 2
|uh|` `−1 (`) (`)
|K(u)| (1 − τ ) p (x + τ uh) − p (x) dτ du dx
(` − 1)! 0
 "Z  Z 2
Z ` 1 2 #1/2
|uh|
≤  |K(u)| (1 − τ )`−1 p(`) (x + τ uh) − p(`) (x) dτ dx du
(` − 1)! 0
"Z Z  # !2
Z
|uh|` 1 2 1/2
≤ |K(u)| (1 − τ )`−1 p(`) (x + τ uh) − p(`) (x) dx dτ du .
(` − 1)! 0

Now, for t := τ uh,


Z  2
p(`) (x + t) − p(`) (x) dx
Z  Z 1 2
(`+1)
= t p (x + θt)dθ dx
0
Z Z Z 1/2 !2 Z
1 2 2
2 (`+1) 2
≤ t p (x + θt) dx dθ =t p(β) (x) dx
0

in view of the generalized Minskowski inequality. Therefore,


Z Z Z 1  2
2 |uh|` `−1
b (x)dx ≤ |K(u)| (1 − τ ) |τ uh|Ldτ du
(` − 1)! 0
Z 2 Z 1 2
L2 h2(`+1) `+1 `−1
≤ |K(u)||u| du (1 − τ ) dτ
[(` − 1)!]2 0
Z 2
L2 h2β β
≤ |u| |K(u)| du
(`!)2

Exercise (HW1): Assume that:

(i) the function K is a kernel of order 1 satisfying the conditions


Z Z Z
2 2
K (u)du < ∞, u |K(u)|du < ∞, SK := u2 K(u)du 6= 0;
Lemma 1.10. For any Borel function g on R × R, we have
Z Z 2 "Z Z 1/2 #2
2
g(u, x)du dx ≤ g (u, x)dx du .

17
(ii) The density p is differentiable on R, the first derivative p0 is absolutely contin-
R
uous on R and the second derivative satisfies (p00 (x))2 dx < ∞.

Then for all n ≥ 1 the mean integrated squared error of the kernel estimator p̂n
satisfies  Z Z 
1 2 h4 2 00 2
MISE = K (u)du + SK (p (x)) dx (1 + o(1)),
nh 4
where the term o(1) is independent of n (but depends on p) and tends to 0 as h → 0.

1.5 Unbiased risk estimation: cross-validation

Let p̂n be the KDE and let the kernel K be fixed. We already know that the bandwidth
h is crucial to determine the behavior of the estimator. How to choose h in practice?

Consider the risk Z


MISE(h) := Ep (p̂(h) 2
n − p) (x)dx.

The optimal value of h is the one that minimizes the MISE, i.e.,

h∗ = argmin MISE(h).
h>0

This ideal bandwidth h depends on the true density p, so it is not available in practice.
It is called the oracle bandwidth, and the estimator p̂n with bandwidth h = h∗ is called
the oracle. We would like to “mimic the oracle”, i.e., to find a bandwidth ĥn that
only depends on the data X1 , . . . , Xn , such that its risk is close to the risk of the
oracle: Z
Ep (p̂(nĥn ) − p)2 (x)dx ≈ min MISE(h),
h>0

It turns out that this task can be achieved. The idea is to first estimate the MISE(·),
and then to minimize in h the obtained estimator of MISE(·).

Note that the MISE can be written as


Z Z Z  Z
MISE(h) = E (p̂n − p) = E
2
p̂n − 2 p̂n p + p2 .
2

Only the expression in the square brackets depends on h; the last term is constant in
h. Let Z Z 
J(h) := Ep 2
p̂n − 2 p̂n p .

Since we are minimizing over h, minimizing MISE(h) is equivalent to minimizing J(h).


ˆ
Therefore, it is enough to look for an estimator of J(h), denoted by J(h), because

18
MISE(h) and J(h) have the same minimizers. A first idea is to take an unbiased
estimator of J(h).

[9] suggested the following estimator:


Z
2X
n
ˆ
J(h) ≡ CV(h) = p̂2n − p̂n,−i (Xi ),
n i=1

where CV stands for cross-validation, and


X  
1 Xj − x
p̂n,−i (x) = K .
(n − 1)h j6=i h

Now we prove that CV(h) is an unbiased estimator of J(h), i.e., we show that
Z " n #
1X
Ep p̂n p = Ep p̂n,−i (Xi ) . (9)
n i=1

Since X1 , . . . , Xn are i.i.d., the right hand side of (9) is equal to


"   #
1 XZ Xj − z
Ep [p̂n,−1 (X1 )] = Ep K p(z)dz
(n − 1)h j6=1 h
Z Z  
1 x−z
= p(x) K p(z)dz dx
h h
This integral is finite if K is bounded. The left hand side of (9) is equal to
" n Z   #
1 X Xi − x
Ep K p(x)dx = RHS of (9).
nh i=1 h

Define the cross-validated bandwidth and the cross-validated KDE:

ĥCV = argmin CV(h),


h>0
X
n  
1 Xi − x
p̃CV
n (x) = K .
nĥCV i=1 ĥCV

[12] was the first to investigate the issue of optimality in connection with cross-
validation. He proved that the integrated squared error of the estimator p̃CVn is
asymptotically equivalent to that of some oracle estimator:
R CV
(p̃n − p)2 a.s.
R (h) → 1, n → ∞,
minh>0 (p̂n − p)2
under some assumptions (the density p is bounded, the kernel is compactly supported,
essentially nonnegative, and satisfies the Hölder condition).

19
2 Nonparametric regression

Let (X, Y ) be a pair of real-valued random variables such that E|Y | < ∞. The
regression function f : R → R of Y on X is defined as

f (x) = E(Y |X = x).

Suppose that we have a sample (X1 , Y1 ), . . . , (Xn , Yn ) of n i.i.d. pairs of random


variables having the same distribution as (X, Y ). We would like to estimate the
regression function f from the data. The nonparametric approach only assumes that
f ∈ F, where F is a given nonparametric class of functions. The set of values
{X1 , . . . , Xn } is called the design. Here the design is random.

The conditional residual ξ := Y − E(Y |X) has mean zero, E(ξ) = 0, and we may
write
Yi = f (Xi ) + ξi , i = 1, . . . , n, (10)
where ξi are i.i.d. random variables with the same distribution as ξ. In particular,
E(ξi ) = 0 for all i = 1, . . . , n. The variables ξi can therefore be interpreted as “errors”.

The key idea we use in estimating f nonparametrically in this section is called “local
averaging”. Given a kernel K and a bandwidth h, one can construct kernel estimators
for nonparametric regression. There exist different types of kernel estimators of the
regression function f . The most celebrated one is the Nadaraya-Watson estimator
defined as follows:
Pn   
Y i K Xi −x Xn
Xi − x
NW i=1 h
fn (x) = Pn Xi −x
 , if K 6= 0,
i=1 K h i=1
h

and fnN W (x) = 0 otherwise. This estimator was proposed separately in two papers
by Nadaraya and Watson in the year 1964.

Example: If we choose K(u) = 21 I(|u| ≤ 1), then fnN W (x) is the average of Yi such
that Xi ∈ [x − h, x + h]. Thus, for estimating f (x) we define the “local” neighborhood
as [x − h, x + h] and consider the average of the observations in that neighborhood.
For fixed n, the two extreme cases for the bandwidth are:
P
(i) h → ∞. Then fnN W (x) tends to n−1 ni=1 Yi which is a constant independent
of x. The systematic error (bias) can be too large. This is a situation of
oversmoothing.

(ii) h → 0. Then fnN W (Xi ) = Yi whenever h < mini,j |Xi −Xj | and limh→0 fnN W (x) =
0, if x 6= Xi . The estimator fnN W is therefore too oscillating: it reproduces the

20
data Yi at the points Xi and vanishes elsewhere. This makes the stochastic
error (variance) too large. In other words, undersmoothing occurs.

Thus, the bandwidth h defines the “width” of the local neighborhood and the kernel K
defines the “weights” used in averaging the response values in the local neighborhood.
As we saw in density estimation, an appropriate choice of the bandwidth h is more
important than the choice of the kernel K.

The Nadaraya-Watson estimator can be represented as a weighted sum of the Yi :


X n
NW
fn (x) = Yi WiN W (x)
i=1

where the weights are


   !
K Xi −x X
n
Xj − x
WiN W (x) := Pn h  I K 6= 0 .
K
Xj −x
j=1
h
j=1 h

Definition 2.1. An estimator fˆn (x) of f (x) is called a linear nonparametric regres-
sion estimator if it can be written in the form
Xn
ˆ
fn (x) = Yi Wni (x)
i=1

where the weights Wni (x) = Wni (x, X1 , . . . , Xn ) depend only on n, i, x and the values
X1 , . . . , X n .

Typically, the weights Wni (x) of linear regression estimators satisfy the equality
Pn
i=1 Wni (x) = 1 for all x (or for almost all x with respect to the Lebesgue mea-
sure).

Another intuitive motivation of fnN W is given below. Suppose that the distribution
of (X, Y ) has density p(x, y) with respect to the Lebesgue measure and pX (x) =
R
p(x, y)dy > 0. Then,
R
yp(x, y)dy
f (x) = E(Y |X = x) = .
pX (x)
If we replace here p(x, y) by the KDE p̂n (x, y) of the density of (X, Y ) defined by (3)
and use the corresponding KDE p̂X X ˆN W in view of
n (x) to estimate p (x), we obtain fn
the following result.

Exercise (HW1): Let p̂Xn (x) and p̂n (x, y) be the KDEs defined in (2) and (3) respec-
tively, with a kernel K of order 1. Then
R
NW y p̂n (x, y)dy
fn (x) =
p̂X
n (x)

21
if p̂X
n (x) 6= 0.

2.1 Local polynomial estimators

If the kernel K takes only nonnegative values, the Nadaraya-Watson estimator fnN W
satisfies  
Xn
Xi − x
NW 2
fn (x) = argmin (Yi − θ) K . (11)
θ∈R i=1
h

Thus, fnN W is obtained by a local constant least squares approximation of the response
values, i.e., Yi ’s. The locality is determined by the bandwidth h and the kernel K
which downweighs all the Xi that are not close to x whereas θ plays the role of a local
constant to be fitted. More generally, we may define a local polynomial least squares
approximation, replacing in (11) the constant θ by a polynomial of given degree `. If
f ∈ Σ(β, L), β > 1, ` = β, then for z sufficiently close to x we may write
 
0 f (`) (x) ` > z−x
f (z) ≈ f (x) + f (x)(z − x) + . . . + (z − x) = θ (x)U ,
`! h

where

U (u) = (1, u, u2 /2!, . . . , u` /`!),


>
θ(x) = f (x), f 0 (x)h, f 00 (x)h2 , . . . , f (`) (x)h` .

Definition 2.2. Let K : R → R be a kernel, h > 0 be a bandwidth, and ` ≥ 0 be an


integer. A vector θ̂n (x) ∈ R`+1 defined by
n 
X  2  
> Xi − x Xi − x
θ̂n (x) = argmin Yi − θ U K (12)
θ∈R`+1 i=1 h h

is called a local polynomial estimator of order ` of θ(x) or LP(`) estimator of θ(x) for
short. The statistic
fˆn (x) = U > (0)θ̂n (x)
is called a local polynomial estimator of order ` of f(x) or LP(`) estimator of f (x) for
short.

Note that fˆn (x) is simply the first coordinate of the vector θ̂n (x). Comparing (11)
and (12) we see that the Nadaraya-Watson estimator fnN W with kernel K ≥ 0 is
the LP (0) estimator. Furthermore, properly normalized coordinates of θ̂n (x) provide
estimators of the derivatives f 0 (x), . . . , f (`) (x).

22
For a fixed x the estimator (12) is a weighted least squares estimator. Indeed, we can
write θ̂n (x) as follows:

θ̂n (x) = argmin(−2θ> anx + θ> Bnx θ),


θ∈R`+1

where the matrix Bnx and the vector anx are defined by the formulas:
     
1 X
n
Xi − x > Xi − x Xi − x
Bnx = U U K ,
nh i=1 h h h
   
1 X
n
Xi − x Xi − x
anx = Yi U K .
nh i=1 h h

Exercise (HW1): If the matrix Bnx is positive definite, show that the local polynomial
estimator fˆn (x) of f (x) is a linear estimator. Also, in this case, find an expression for
fˆn (x).

The local polynomial estimator of order ` has a remarkable property: It reproduces


polynomials of degree ≤ `. This is stated in the next proposition (Exercise (HW1)).
Proposition 2.3. Let x ∈ R be such that Bnx > 0 (i.e., Bnx is positive definite) and

let Q be a polynomial of degree ≤ `. Then the LP(`) weights Wni are such that

X
n

Q(Xi )Wni (x) = Q(x),
i=1

for any sample (X1 , . . . , Xn ). In particular,

X
n X
n
∗ ∗
Wni (x) = 1, and (Xi − x)k Wni (x) = 0 for k = 1, . . . , `.
i=1 i=1

2.2 Pointwise and integrated risk of local polynomial estima-


tors

In this section we study statistical properties of the LP(`) estimator constructed from
observations (Xi , Yi ), i = 1, . . . , n, such that

Yi = f (Xi ) + ξi , i = 1, . . . , n, (13)

where ξi are independent zero mean random variables (E(ξi ) = 0), the Xi are deter-
ministic values belonging to [0, 1], and f is a function from [0, 1] to R.

23
Let fˆn (x0 ) be an LP(`) estimator of f (x0 ) at point x0 ∈ [0, 1]. The bias and the
variance of fˆn (x0 ) are given by the formulas
h i h i  h i2
ˆ
b(x0 ) = Ef fn (x0 ) − f (x0 ), ˆ ˆ
σ (x0 ) = Ef fn (x0 ) − Ef fn (x0 ) ,
2 2

respectively, where Ef denotes expectation with respect to the distribution of the


random vector (Y1 , . . . , Yn ). We will sometimes write for brevity E instead of Ef .

We will study separately the bias and the variance terms in this representation of the
risk. First, we introduce the following assumptions.

Assumptions (LP)

(LP1) There exist a real number λ0 > 0 and a positive integer n0 such that the
smallest eigenvalue λmin (Bnx ) of Bnx satisfies λmin (Bnx ) ≥ λ0 for all n ≥ n0 and
any x ∈ [0, 1].

(LP2) There exists a real number a0 > 0 such that for any interval A ⊂ [0, 1] and all
n ≥ 1,
1X
n
I(Xi ∈ A) ≤ a0 max(Leb(A), 1/n)
n i=1
where Leb(A) denotes the Lebesgue measure of A.

(LP3) The kernel K has compact support belonging to [−1, 1] and there exists a num-
ber Kmax < ∞ such that |K(u)| ≤ Kmax , ∀ u ∈ R.

Assumption (LP1) is stronger than the condition Bnx > 0 introduced before since it
is uniform with respect to n and x. We will see that this assumption is natural in the
case where the matrix Bnx converges to a limit as n → ∞. Assumption (LP2) means
that the points Xi are dense enough in the interval [0, 1]. It holds for a sufficiently
wide range of designs. An important example is given by the regular design: Xi = i/n,
for which (LP2) is satisfied with a0 = 2. Finally, assumption (LP3) is not restrictive
since the choice of K belongs to the statistician.

Exercise (HW1): Show that assumption (LP1) implies that, for all n ≥ n0 , x ∈ [0, 1],
and v ∈ R`+1 ,
−1
kBnx vk ≤ kvk/λ0 ,
where k · k denotes the Euclidean norm in R`+1 . Hint: Use the fact that Bnx is
−1 −2
symmetric and relate the eigenvalues of Bnx to that of Bnx and Bnx (note that for a
>
square matrix A ∈ R , λmax (A) = kvk2 , where v 6= 0 ∈ R ).
r×r v Av r

24
We have the following result (Exercise (HW1)) which gives us some useful bounds on

the weights Wni (x).
Lemma 2.4. Under assumptions (LP1)–(LP3), for all n ≥ n0 , h ≥ 1/(2n), and

x ∈ [0, 1], the weights Wni (x) of the LP(`) estimator are such that:

(i) supi,x |Wni (x)| ≤ Cnh∗ ;
Pn ∗
(ii) i=1 |Wni (x)| ≤ C∗ ;


(iii) Wni (x) = 0 if |Xi − x| > h,

where the constant C∗ depends only on λ0 , a0 , and Kmax .

We are now ready to find upper bounds on the MSE of the LP(`) estimator.
Proposition 2.5. Suppose that f ∈ Σ(β, L) on [0, 1], with β > 0 and L > 0. Let fˆn
be the LP(`) estimator of f with ` = bβc. Assume also that:

(i) the design points X1 , . . . , Xn are deterministic;

(ii) assumptions (LP1)–(LP3) hold;

(iii) the random variables ξi are independent and such that for all i = 1, . . . , n,

E(ξi ) = 0, E(ξi2 ) ≤ σmax


2
< ∞.

Then for all x0 ∈ [0, 1], n ≥ n0 , and h ≥ 1/(2n) the following upper bounds hold:
q2
|b(x0 ) ≤ q1 hβ , σ 2 (x0 ) ≤ ,
nh
2
where q1 := C∗ L/`! and q2 := σmax C∗2 .

Thus, Proposition 2.5 implies that


q2
MSE ≤ q12 h2β + . (14)
nh
The minimum with respect to h of the right hand side of the above upper bound is
attained at  1/(2β+1)
∗ q2
hn = n−1/(2β+1) .
2βq22
Therefore, the choice h = h∗n gives
 2β

− 2β+1
MSE(x0 ) = O n , as n → ∞,

uniformly in x0 . Thus, we have the following result.

25
Theorem 2.6. Assume that the assumptions of Proposition 2.5 hold. Suppose that
for a fixed α > 0 the bandwidth is chosen as h = hn = αn−1/(2β+1) . Then the following
holds: h i
lim sup sup sup Ef ψn−2 (fˆn (x0 ) − f (x0 ))2 ≤ C < ∞,
n→∞ f ∈Σ(β,L) x0 ∈[0,1]
β
where ψn := n− 2β+1 is the rate of convergence and C > 0 is a constant depending
2
only on β, L, a0 , σmax , Kmax and α.

As the above upper bound holds for every x0 ∈ [0, 1] we immediately get the following
result on the integrated risk.
Corollary 2.7. Under the assumptions of Theorem 2.6 the following holds:
h i
lim sup sup Ef ψ −2 kfˆn (x0 ) − f (x0 )k2 ≤ C < ∞,
n 2
n→∞ f ∈Σ(β,L)

R1 β
where kf k22 = 0 f 2 (x)dx, ψn := n− 2β+1 is the rate of convergence and C > 0 is a
2
constant depending only on β, L, a0 , σmax , Kmax and α.

2.2.1 Assumption (LP1)

We now discuss assumption (LP1) in more detail. If the design is regular and n is large
R
enough, Bnx is close to the matrix B := U (u)U > (u)K(u)du, which is independent
of n and x. Therefore, for Assumption (LP1) to hold we only need to assure that B
is positive definite. This is indeed true, except for pathological cases, as the following
lemma states.
Lemma 2.8. Let K : R → [0, ∞) be a function such that the Lebesgue measure
Leb({u : K(u) > 0}) > 0. Then the matrix
Z
B = U (u)U > (u)K(u)du

is positive definite.

Proof. It is sufficient to prove that for all v ∈ R`+1 satisfying v 6= 0, we have v > Bv > 0.
Clearly, Z
v > Bv > 0 = (v > U (u))2 K(u)du ≥ 0.
R
If there exists v 6= 0 such that (v > U (u))2 K(u)du = 0, then v > U (u) = 0 for almost
all u on the set {u : K(u) > 0}, which has a positive Lebesgue measure by the
assumption of the lemma. But the function v 7→ v > U (u) is a polynomial of degree

26
≤ ` which cannot be equal to zero except for a finite number of points. Thus, we come
R
to a contradiction showing that (v > U (u))2 K(u)du = 0 is impossible for v 6= 0.
Lemma 2.9. Suppose that there exist Kmin > 0 and ∆ > 0 such that

K(u) ≥ Kmin I(|u| ≤ ∆), ∀u ∈ R,

and that Xi = i/n for i = 1, . . . , n. Let h = hn be a sequence satisfying hn → 0 and


nhn → ∞, as n → ∞. Then assumption (LP1) holds.

3 Projection estimators

Consider data (Xi , Yi ), i = 1, . . . , n, from a nonparametric regression model where

Yi = f (Xi ) + ξi , i = 1, . . . , n, (15)

with Xi ∈ X, a metric space, and E(ξi ) = 0. The goal is to estimate the function f
based on the data. In what follows, we will also use the vector notation, writing the
model as
y = f + ξ,
where y = (Y1 , . . . , Yn )> , f = (f (X1 ), . . . , f (Xn ))> and ξ = (ξ1 , . . . , ξn )> .

The idea here is to approximate f by fθ , a linear combination of N given functions


ϕ1 , . . . , ϕN where ϕj : X → R, so that

X
N
fθ (x) := θj ϕj (x).
j=1

Then we look for a suitable estimator θ̂ = (θ̂1 , . . . , θ̂N ) of θ based on the sample
(Xi , Yi ), i = 1, . . . , n, and construct an estimator of f having the form

X
N
fˆ(x) = fθ̂ (x) = θ̂j ϕj (x). (16)
j=1

Example 3.1. If X = [0, 1] and f ∈ L2 [0, 1], then a popular choice of {ϕj }N j=1
corresponds to the first N functions of an orthonormal basis in L2 [0, 1]. For example,
{ϕj }∞ ∞
j=1 can be the trigonometric basis or the Legendre basis on [0, 1]. Let {θj }j=1 be
the Fourier coefficients of f with respect to the orthonormal basis {ϕj }∞j=1 of L2 [0, 1],
i.e., Z 1
θj = f (x)ϕ(x)dx.
0

27
Assume that f can be represented as

X
f (x) = θj ϕj (x), (17)
j=1

where the series converges for all x ∈ [0, 1]. Observe that if Xi are scattered over
[0, 1] in a sufficiently uniform way, which happens, e.g., in the case Xi = i/n, the
P
coefficients θj are well approximated by the sums n−1 N i=1 f (Xi )ϕj (Xi ). Replacing
in these sums the unknown quantities f (Xi ) by the observations Yi we obtain the
following estimator of θj :

1X
θ̂j = Yi ϕj (Xi ). (18)
n i=1
Remark 3.1. The parameter N (called the order of the estimator) plays the same
role as the bandwidth h for kernel estimators: similar to h it is a smoothing parameter,
i.e., a parameter whose choice is crucial for establishing the balance between bias and
variance. The choice of very large N leads to undersmoothing, whereas for small
values of N oversmoothing occurs.

An important class of estimators of the form (16) are projection estimators. Define
the empirical norm k · k as:
X
n X
n
2 2 2
kf k := f (Xi ), kyk := Yi2 .
i=1 i=1

The projection estimator is defined as follows:

X
N
fˆLS (x) = fθ̂LS (x) = θ̂jLS ϕj (x) (19)
j=1

LS
where θ̂ is the classical least squares estimator (LSE):
LS
θ̂ := argmin ky − fθ k2 ,
θ∈RN

where fθ = (fθ (X1 ), . . . , fθ (Xn ))> . Equivalently, we can write


LS
θ̂ = argmin ky − Xθk2 ,
θ∈RN

where X := (ϕj (Xi ))i,j where i = 1, . . . , n and j = 1, . . . , N . In other words, we con-


struct a ‘nonparametric’ estimator based on a purely parametric idea. The question
is whether such an estimator is good. We will see that this is indeed the case under

28
appropriate conditions on the functions {ϕj }, the function f , and N . Recall that,
under the assumption that X> X > 0 (note that X> X is an N × N matrix), we have
LS LS
θ̂ = (X> X)−1 X> y and f̂ LS = Xθ̂ = Ay

where A := X(X> X)−1 X> is the so-called hat matrix. The hat matrix is the or-
thogonal projection matrix (in Rn ) onto the column-space of X, i.e., the subspace
of Rn spanned by the N columns of X. Note that we can have X> X > 0 only if
N ≤ n. However, even if X> X is not invertible f̂ LS is uniquely defined by the Hilbert
projection theorem4 and can be expressed as Ay where now A = X(X> X)+ X> ; here
A+ stands for the Moore-Penrose pseudoinverse.

Indeed, rank(X> X) = rank(X) ≤ min(N, n). Under the assumption that X> X > 0,
the projection estimator is unique and has the form

LS X
n
fˆLS (x) = ϕ(x)> θ̂ = ϕ(x)> (X> X)−1 X> y = Wni (x)Yi ,
i=1

where ϕ(x) = (ϕ1 (x), . . . , ϕN (x))> and Wni (x) is the i-th component of the vector
ϕ(x)> (X> X)−1 X> .

3.1 Risk bounds for projection estimators

Assume now that we have the regression model (48), where the points Xi are deter-
ministic elements in the space X. Let us measure the accuracy of an estimator fˆ of
f by the following squared risk:
" n #
1 X
R(f , f̂ ) := Ekf − f̂ k2 = E (fˆ(Xi ) − f (Xi ))2 .
n i=1

This choice of a loss function is quite natural and it measures the prediction accuracy
of the estimator at the observed design points. Further, if the Xi are “equi-spaced”
points then R(f , f̂ ) is approximately equal to the MISE.
P
Let fˆ(x) be a linear estimator, i.e., fˆ(x) = ni=1 Wni (x)Yi . Then we can write f̂ = Sy
where S := (Wnj (Xi ))n×n is a deterministic matrix. Note that S does not depend on
4
The Hilbert projection theorem is a famous result of convex analysis that says that for every
point u in a Hilbert space H and every nonempty closed convex C ⊂ H, there exists a unique point
v ∈ C for which kx − yk is minimized over C. This is, in particular, true for any closed subspace M
of C. In that case, a necessary and sufficient condition for v is that the vector u − v be orthogonal
to M .

29
y; it depends only on the Xi ’s. As particular cases, we can think of fˆ as the LP(`)
estimator or the projection estimator in (16).
Proposition 3.2. Let ξi be random variables such that E(ξi ) = 0 and E(ξi ξj ) = σ 2 δij
for i, j = 1, . . . , n, where δij is the Kronecker delta function. Let S be any n×n matrix.
Then the risk of linear estimator f = Sy is given by
σ2
R(f , f̂ ) = kSf − f k2 + tr(S > S).
n
Proof. By definition of the norm k · k and of the model,

kf̂ − f k2 = kSf + Sξ − f k2
2
= kSf − f k2 + (Sf − f )> Sξ + kSξk2 .
n
Taking expectations and using that E(ξ) = 0 we obtain
1
Ekf̂ − f k2 = kSf − f k2 + E(ξ > S > Sξ).
n
Set V = S > S and denote the elements of this matrix by vij . We have
!
Xn Xn
> >
E(ξ S Sξ) = E ξi vij ξj = σ 2 vii = σ 2 tr(V ).
i,j i=1

In particular, if f̂ is a projection estimator then S is an orthogonal projection matrix


and S > = S and thus, V = S (as S 2 = S) which shows that

tr(V ) = tr(S) = rank(S) ≤ min(n, N ).

Thus, we have
σ2
R(f , f̂ ) ≤ kSf − f k2 + min(n, N )
n
σ2
= min kfθ − f k2 + min(n, N ). (20)
θ∈RN n
In fact, a close inspection of the proof of Proposition 3.2 shows that for the above
inequality to hold it is enough to assume that E(ξi2 ) ≤ σ 2 , and E(ξi ξj ) = 0 for i 6= j,
where i, j = 1, . . . , n.

In order to control this bias term and to analyze the rate of convergence of projection
estimator, we need to impose some assumptions on the underlying function f and on
the basis {ϕj }∞
j=1 .

30
3.1.1 Projection estimator with trigonometric basis in L2 [0, 1]

Here we continue to consider the nonparametric regression model (48) and we will
assume that X = [0, 1]. We will mainly focus on a particular case, Xi = i/n.
Definition 3.3. The trigonometric basis is the orthonormal basis of L2 [0, 1] defined
by
√ √
ϕ1 (x) = 1, ϕ2k (x) = 2 cos(2πkx), ϕ2k+1 (x) = 2 sin(2πkx), k = 1, 2 . . . ,

for x ∈ [0, 1].

We will assume that the regression function f is sufficiently smooth, or more specif-
ically, that it belongs to a periodic Sobolev class of functions. First, we define the
periodic Sobolev class for integer smoothness β.
Definition 3.4. Let β ≥ 1 be an integer and let L > 0. The periodic Sobelev class
W (β, L) is defined as
n
W (β, L) := f : [0, 1] → R : f (β−1) is absolutely continuous and
Z 1 o
(β) 2 2 (j) (j)
(f (x)) dx ≤ L , f (0) = f (1), j = 0, 1, . . . , β − 1
0

Any function f belonging to such a class is continuous and periodic (f (0) = f (1))
and thus admits the representation

X
f (x) = θ1 ϕ1 (x) + (θ2k ϕ2k (x) + θ2k+1 ϕ2k+1 (x)) (21)
k=1

where {ϕj }∞j=1 is the trigonometric basis given in Definition 3.3. The above infinite
series converges pointwise, and the sequence θ = {θj }∞ j=1 of Fourier coefficients of f
belongs to the space ( )
X∞
`2 (N) := θ : θj2 < ∞ .
j=1

We now state a necessary and sufficient condition on θ under which the function (21)
belongs to the class W (β, L). Define
(
jβ , for even j,
aj = β
(22)
(j − 1) , for odd j.

31
Proposition 3.5. Let β ∈ {1, 2, . . .}, L > 0, and let {ϕj }∞ j=1 be the trigonometric
basis. A function f ∈ L2 [0, 1] belong to W (β, L) if and only if the vector θ of the
Fourier coefficients of f belongs to the following ellipsoid in `2 (N):
( ∞
)
X
Θ(β, Q) := θ ∈ `2 (N) : a2j θj2 ≤ Q (23)
j=1

where Q = L2 /π 2β and aj is given by (22).

See [14, Lemma A.3] for a proof of the above result.

The set Θ(β, Q) defined by (23) with β > 0 (not necessarily an integer), Q > 0, and
aj satisfying (22) is called a Sobolev ellipsoid. We mention the following properties of
these ellipsoids.

• The monotonicity with respect to inclusion:

0 < β 0 ≤ β implies Θ(β, Q) ⊂ Θ(β 0 , Q).

P
• If β > 1/2, the function f = ∞ ∞
j=1 θj ϕj with the trigonometric basis {ϕj }j=1
and θ ∈ Θ(β, Q) is continuous (check this as an exercise). In what follows, we
will basically consider this case.

The ellipsoid Θ(β, Q) is well-defined for all β > 0. In this sense Θ(β, Q) is a more
general object than the periodic Sobolev class W (β, L), where β can only be an
integer. Proposition 3.5 establishes an isomorphism between Θ(β, Q) and W (β, L)
for integer β. It can be extended to all β > 0 by generalizing the definition of W (β, L)
in the following way.
Definition 3.6. For any β > 0 and L > 0 the Sobolev class W (β, L) is defined as:
n o
W (β, L) = f ∈ L2 [0, 1] : θ = {θj }∞
j=1 ∈ Θ(β, Q)
R1
where θj = 0 f ϕj and {ϕj }∞ j=1 is the trigonometric basis. Here Θ(β, Q) is the Sobolev
ellipsoid defined in (23), where Q = L2 /π 2β and {aj }∞j=1 is given by (22).

For all β > 1/2, the functions belonging to W (β, L) are continuous. On the contrary,
they are not always continuous for β < 1/2; an example is given by the function
f (x) = sign(x − 1/2), whose Fourier coefficients θj are of order 1/j.
Lemma 3.7. Let {ϕj }∞
j=1 be the trigonometric basis. Then,

1X
n
ϕj (s/n)ϕk (s/n) = δjk , 1 ≤ j, k ≤ n − 1, (24)
n s=1

32
where δjk is the Kronecker delta.

See [14, Lemma 1.7] for a proof of the above result.

We are now ready to establish an upper bound on the bias of the projection estimator.
Proposition 3.8. Let f ∈ W (β, L), β ≥ 1, L > 0. Assume that {ϕj }∞ j=1 is the
trigonometric basis and Xi = i/n, i = 1, . . . , n. Then, for all n ≥ 1, N ≥ 1,
 
1 1
inf Ekfθ − f k ≤ C(β, L)
2
+ ,
θ∈RN N 2β n

where C(β, L) is a constant that depends only on β and L.

The proof of the above result was given in class.


Theorem 3.9. Let f ∈ W (β, L), β ≥ 1, L > 0 and N = dαn1/(2β+1) e for α > 0.
Assume that {ϕj }∞ j=1 is the trigonometric basis and Xi = i/n, i = 1, . . . , n. Let ξi
be random variables such that E(ξi ) = 0, E(ξi2 ) ≤ σ 2 and E(ξi ξj ) = 0 for i 6= j ∈
{1, . . . , n}. Then, for all n ≥ 1,

sup Ekf̂ LS − f k2 ≤ Cn− 2β+1 ,
f ∈W (β,L)

where C is a constant that depends only on σ 2 , β, L and α.

Proof. In view of (20) and Proposition 3.8,


 
1 1 σ2N 2β
Ekf̂ − f k ≤ C(β, L)
LS 2

+ + = O(n− 2β+1 ).
N n n

33
4 Minimax lower bounds

We have a family {Pθ : θ ∈ Θ} of probability measures, indexed by Θ, on a measurable


space (X , A) associated with the data. Usually, in nonparametric statistics, Θ is a
nonparametric class of functions (e.g., Θ = Σ(β, L) or Θ = W (β, L)). Thus, in the
density estimation model, Pθ is the probability measure associated with a sample
X = (X1 , . . . , Xn ) of size n when the density of Xi is p(·) ≡ θ.

Given a semi-distance5 the performance of an estimator θ̂n of θ is measured by the


maximum risk of this estimator on Θ:
h i
r(θ̂n ) := sup Eθ d2 (θ̂n , θ) .
θ∈Θ
h i
The aim of this section is to complement the upper bound results (i.e., supθ∈Θ Eθ d2 (θ̂n , θ) ≤
Cψn2 for certain estimator θ̂n ) by the corresponding lower bounds:
h i
∀ θ̂n : sup Eθ d (θ̂n , θ) ≥ cψn2
2
θ∈Θ

for sufficiently large n, where c is a positive constant.

The minimax risk associated with a statistical model {Pθ : θ ∈ Θ} and with a semi-
distance d is defined as h i
R∗n := inf sup Eθ d2 (θ̂n , θ) ,
θ̂n θ∈Θ

where the infimum is over all estimators. The upper bounds established previously
imply that there exists a constant C < ∞ such that

lim sup ψn−2 R∗n ≤ C


n→∞

for a sequence {ψn }n≥1 converging to zero. The corresponding lower bounds claim
that there exists a constant c > 0 such that, for the same sequence {ψn }n≥1 ,

lim inf ψn−2 R∗n ≥ c. (25)


n→∞

4.1 Distances between probability measures

Let (X , A) be a measurable space and let P and Q be two probability measures on


(X , A). Suppose that ν is a σ-finite measure on (X , A) satisfying P  ν and Q  ν.
5
We will call the semi-distance d : Θ × Θ → [0, +∞) on Θ as a function that satisfies d(θ, θ0 ) =
d(θ0 , θ), d(θ, θ00 ) ≤ d(θ, θ0 ) + d(θ0 , θ00 ), and d(θ, θ) = 0, where θ, θ0 , θ00 ∈ Θ. The following are a few
common examples of d: d(f, g) = |f (x0 ) − g(x0 )| (for some fixed x0 ), d(f, g) = kf − gk2 , etc.

34
Define p = dP/dν, q = dQ/dν. Observe that such a measure ν always exists since we
can take, for example, ν = P + Q.
Definition 4.1. The Hellinger distance between P and Q is defined as follows:
Z  Z 
2 √ √ 2 √
H (P, Q) := ( p − q) dν = 2 1 − pq dν .

Exercise (HW2): The following are some properties of the Hellinger distance:

1. H(P, Q) does not depend on the choice of the dominating measure ν.

2. H(P, Q) satisfies the axioms of distance.

3. 0 ≤ H 2 (P, Q) ≤ 2.

4. If P and Q are product measures, P = ⊗ni=1 Pi , Q = ⊗ni=1 Qi , then


" n  #
Y H 2
(P i , Qi )
H 2 (P, Q) = 2 1 − 1− .
i=1
2

Definition 4.2. The total variation distance between P and Q is defined as follows:
Z
V (P, Q) := sup |P (A) − Q(A)| = sup (p − q)dν .
A∈A A∈A A

Note that 0 ≤ V (P, Q) ≤ 1 and V (P, Q) satisfies the axioms of distance.


Lemma 4.3 (Scheffé’s theorem).
Z Z
1
V (P, Q) = |p − q|dν = 1 − min(p, q)dν.
2

Lemma 4.4 (Le Cam’s inequalities).


Z Z 2
 2
1 √ 1 H 2 (P, Q)
min(p, q)dν ≥ pq dν= 1− ,
2 2 2
r
1 2 H 2 (P, Q)
H (P, Q) ≤ V (P, Q) ≤ H(P, Q) 1 − .
2 2

Exercise (HW2): Prove the above two lemmas.


Definition 4.5. The Kullback divergence between P and Q is defined by:
( R  
log pq p dν, if P  Q,
K(P, Q) :=
+∞, otherwise.

35
It can be shown that the above definition always makes sense if P  Q. Here are
some properties of the Kullback divergence:

1. K(P, Q) ≥ 0. Indeed, by Jensen’s inequality,


Z   Z   Z 
p q
log p dν = − p log dν ≥ − log qdν ≥ 0.
pq>0 q pq>0 p

2. K(P, Q) is not a distance (for example, it is not symmetric).

3. [Show this (Exercise (HW2))] If P and Q are product measures, P = ⊗ni=1 Pi ,


Q = ⊗ni=1 Qi , then
X
n
K(P, Q) = K(Pi , Qi ).
i=1

The next lemma links the Hellinger distance with the Kullback divergence.
Lemma 4.6.
H 2 (P, Q) ≤ K(P, Q).

The following lemma links the total variation distance with the Kullback divergence.
Lemma 4.7 (Pinsker’s inequality).
p
V (P, Q) ≤ K(P, Q)/2.

Exercise (HW2): Prove the above two lemmas.


Definition 4.8. The χ2 divergence between P and Q is defined by:
( R
(p−q)2
2 p
dν, if P  Q,
χ (P, Q) :=
+∞, otherwise.
Lemma 4.9.
Z Z
1 1p 2
min(p, q)dν = 1 − |p − q|dν ≥ 1 − χ (P, Q).
2 2

Proof. Since p and q are probability densities,


Z Z Z Z
2 = p dν + q dν = 2 min(p, q)dν + |p − q|dν

which shows the first equality. To show the inequality, we use Cauchy-Schwarz in-
equality to obtain
Z Z
1 √
|p − q|dν = √ |p − q| p dν ≤ χ2 (P, Q).
p

36
Lemma 4.10. If P and Q are product measures, P = ⊗ni=1 Pi and Q = ⊗ni=1 Qi , then
Y
n
2
χ (P, Q) = (χ2 (Pi , Qi ) + 1) − 1.
i=1

The proof is left as an exercise (HW2).

4.2 Lower Bounds on the risk of density estimators at a point

Our aim is to obtain a lower bound for the minimax risk on (Θ, d) where Θ is a
Sobolev density:
Θ = P(β, L), β > 0, L > 0,
and where d is a distance at a fixed point x0 ∈ R:
d(f, g) = |f (x0 ) − g(x0 )|.
β
The rate that we would like to obtain is ψn = n− 2β+1 . Indeed, this is the same rate
as in the upper bounds which will enable us to conclude that ψn is optimal on (Θ, d).

Thus, we want to show that


  2β
inf sup Ep (Tn (x0 ) − p(x0 ))2 ≥ cn− 2β+1 , (26)
Tn p∈P(β,L)

for all n sufficiently large, where Tn ranges over all density estimators and c > 0 is a
constant. For brevity we write Tn = Tn (x0 ). For any p0 , p1 ∈ P(β, L), we may write

sup Ep [(Tn − p(x0 ))2 ] ≥ max Ep0 [(Tn − p0 (x0 ))2 ], Ep1 [(Tn − p1 (x0 ))2 ]
p∈P(β,L)
1
≥ Ep0 [(Tn − p0 (x0 ))2 ] + Ep1 [(Tn − p1 (x0 ))2 ] . (27)
2
Note that
Z Z !
Y
n
Ep [(Tn − p(x0 ))2 ] = ... [Tn (x1 , . . . , xn ) − p(x0 )]2 p(xi )dxi .
i=1
Q
Let x := (x1 , . . . , xn ) and πn (x) = ni=1 p(xi ). Also, let π0,n , π1,n be the joint densities
corresponding to the chosen densities p0 and p1 . The expression in (27) is then equal
to
Z Z 
1 2 2
(Tn (x) − p0 (x0 )) π0,n (x)dx + (Tn (x) − p1 (x0 )) π1,n (x)dx
2
Z 
1 2 2
≥ (Tn (x) − p0 (x0 )) + (Tn (x) − p1 (x0 )) min{π0,n (x), π1,n (x)} dx
2
Z
1 2
≥ (p0 (x0 ) − p1 (x0 )) min{π0,n (x), π1,n (x)} dx,
4

37
where we have used the fact that u2 + v 2 ≥ (u − v)2 /2, for u, v ∈ R.

In view of the above, to prove (26) it suffices to find densities p0 and p1 such that

(i) p0 , p1 ∈ P(β, L),


β
(ii) |p0 (x0 ) − p1 (x0 )| ≥ c1 n− 2β+1 ,
R
(iii) min{π0,n (x), π1,n (x)} dx ≥ c2 , where the constant c2 does not depend on n.

We take p0 to be a density on R such that p0 ∈ Σ(β, L/2) and p0 (x0 ) > 0; e.g., p0 can
be the N (0, σ 2 ) density with σ 2 chosen is such a suitable way. Obviously p0 ∈ Σ(β, L).
Construct p1 by adding a small perturbation to p0 :
 
β x − x0
p1 (x) := p0 (x) + h K ,
h

where h = αn−1/(2β+1) (for α > 0), the support of K is [− 12 , 32 ], K is infinitely


R
differentiable (i.e., K ∈ C ∞ (R)), K(0) > 0 and K(u)du = 0. Thus, p1 is a density
for h small enough.

Figure 2: Graphs of K0 and g.


0.3

0.2
0.2
K_0(x)

g(x)
0.0
0.1

−0.2
0.0

−2 −1 0 1 2 −0.5 0.0 0.5 1.0 1.5


x x

Lemma 4.11. p1 ∈ Σ(β, L) is a density for h > 0 small enough.

Proof. Let
1

K0 (u) := e 1−u2 I[−1,1] (u).
Then, K0 ∈ C ∞ (R) and the support of K0 is [−1, 1]. Let g : [− 12 , 32 ] → R be defined
as
g(u) := K0 (2u) − K0 (2(u − 1)).
Observe that

38
1. g(0) 6= 0,
R
2. g(u)du = 0,

3. g ∈ C ∞ (R), which implies that g ∈ Σ(β, L0 ) for a certain L0 > 0.

Define K : [− 12 , 23 ] → R such that K(u) := ag(u) for a > 0 small enough so that
K ∈ Σ(β, L/2).
R R
Using the fact that g(u)du = 0 it is easy to see that p1 (x)dx = 1. Next we show
that p1 ≥ 0 for h > 0 small enough. For x ∈ [x0 − h2 , x0 + 3h 2
],
 
β x − x0
p1 (x) ≥ min p0 (t) − sup h K
t∈[x0 − h
2
,x 0 + 3h
2
] t∈[x 0 − h
,x 0 + 3h
] h
2 2
β
≥ min p0 (t) − h sup |K (t)| .
t∈[x0 − h ,x + 3h
2 0 2
] t∈R

Since p0 is continuous, p0 (x0 ) > 0, we obtain that p1 (x) > 0 for all x ∈ [x0 − h2 , x0 + 3h
2
],
h 3h
if h is smaller than some constant h0 > 0. Note that for x ∈ / [x0 − 2 , x0 + 2 ],
p1 (x) = p0 (x) ≥ 0. Thus, p1 is a density.

We now have to show that p1 ∈ Σ(β, L). Set ` := bβc. Clearly, p1 is ` times
differentiable. Further,
 
(`) (`) β−` (`) x − x0
p1 (x) = p0 (x) + h K .
h

Hence,
   
(`) (`) (`) (`) x − x0 x0 − x0
|p1 (x) − p1 (x0 )| ≤ |p0 (x) − p0 (x)| +h β−`
K (`)
−K (`)
h h
β−`
L L x − x0
≤ |x − x0 |β−` + hβ−` ≤ L|x − x0 |β−` ,
2 2 h

where we have used the fact that both p0 , K ∈ Σ(β, L/2).

Thus,
β
|p0 (x0 ) − p1 (x0 )| = hβ K(0) = K(0)n− 2β+1 .

Next we will try to show that (iii) holds. In view of Lemma 4.9, it suffices to bound

39
χ2 (π0,n , π1,n ) from above by a constant strictly less than 1. First write χ2 (p0 , p1 ) as
Z Z x0 +3h/2  2β 2 
(p0 − p1 )2 h K ((x − x0 )/h)
= dx
p0 x0 −h/2 p0 (x)
Z
h2β+1
≤ K 2 (u)du
minx∈[x0 −h/2,x0 +3h/2] p0 (x)
Z
h2β+1
≤ K 2 (u)du
minx∈[x0 −1/2,x0 +3/2] p0 (x)
where we have assumed that h ≤ α and α ≤ 1. Plugging the choice of h we obtain

χ2 (p0 , p1 ) ≤ c∗ α2β+1 n−1

where the constant c∗ depends only on p0 and K. Therefore, applying Lemma 4.10
we find
χ2 (π0,n , π1,n ) ≤ (1 + c∗ α2β+1 n−1 )n − 1 ≤ exp(c∗ α2β+1 ) − 1,
where we have used the fact that 1 + v < ev , for v ∈ R. Now, we choose α small
enough so that exp(c∗ α2β+1 ) − 1 < 1. Then,
Z
1 1
min(π0 , π1 ) ≥ 1 − = ,
2 2
and thus, condition (iii) is satisfied.
Theorem 4.12. Let β > 0, L > 0. There exists a constant c > 0 that only depends
on β and L such that, for all x0 ∈ R, n ≥ 1,
  2β
inf sup Ep (Tn (x0 ) − p(x0 ))2 ≥ cn− 2β+1 ,
Tn p∈P(β,L)

where Tn ranges over all density estimators.

Since the choice of x0 is arbitrary, we can equivalently put inf x0 ∈R before the minimax
risk.
Definition 4.13. Let x0 be fixed, and let P be a class of densities on R. A sequence
{ψn }n≥1 , ψn > 0, is called an optimal rate of convergence of mean squared error (risk)
on the class P if the following two conditions are satisfied:

(i) inf Tn supp∈P Ep [(Tn (x0 ) − p(x0 ))2 ] ≥ cψn2 , where c > 0 is a constant independent
of n.

(ii) There exist an estimator pn (·), and a constant C > 0 independent of n such
that
sup Ep [(pn (x0 ) − p(x0 ))2 ] ≤ Cψn2 .
p∈P

40
If (i) and (ii) hold, then pn is called a rate optimal estimator for the risk on the class
P.
Corollary 4.14. Let β > 0, L > 0. The KDE with bandwidth h = αn−1/(2β+1) ,
α > 0, and kernel of order ` = bβc is rate optimal for the mean squared error on the
Hölder class P(β, L), and ψn = n−β/(2β+1) is the corresponding optimal rate.

Summary: We have seen that the following issues play the key role in nonparametric
estimation.

• Bias-variance trade-off: For nonparametric estimation, the bias is not negligible,


which brings in the problem of optimal choice of the smoothing parameter. For
the KDE, the smoothing parameter is the bandwidth.

• Optimality in a minimax sense: Is the upper bound obtained from bias-variance


trade-off indeed optimal? We need minimax lower bounds to answer this ques-
tion.

• Adaptation: What is the optimal data-driven choice of the smoothing param-


eter? An adaptive estimator is an estimator which is rate optimal on a large
scale of classes without any knowledge about the parameters of the classes.
Cross-validation is an example of a successful adaptation procedure.

4.3 Lower bounds on many hypotheses

The lower bounds based on two hypotheses turn out to be inconvenient when we deal
with estimation in Lp distances; see e.g., the start of Section 2.6 of [14].

Let us consider the nonparametric density estimation problem under the L2 risk.
Then,
Z 1/2
2
d(f, g) = kf − gk2 = (f (x) − g(x)) dx .

Our aim is to prove an optimal lower bound on the minimax risk for the Sobolev class
of densities Θ = S(β, L) (where β ≥ 1 is an integer and L > 0) and the above L2
distance with the rate ψn = n−β/(2β+1) .

The proof is based on a construction of subsets Fn ⊂ S(β, L), consisting of 2rn


functions, where rn = bn1/(2β+1) c, and on bounding the supremum over S(β, L) by
the average over Fn .

The subset Fn is indexed by the set of all vectors θ ∈ {0, 1}rn consisting of sequences

41
of rn zeros and ones. For h = n−1/(2β+1) , let xn,1 < xn,2 < . . . < xn,n be a regular grid
of mesh width 2h (i.e., xn,i − xn,i−1 > 2h, for i = 2, . . . , n).

For a fixed probability density p ∈ S(β, L/2) (e.g., let p be the density of N (0, σ 2 )
where σ 2 is such that p ∈ S(β, L/2)). Consider a fixed function K ∈ S(β, L0 ) with
support (−1, 1), and define, for every θ ∈ {0, 1}rn ,
Xrn  
β x − xn,j
pn,θ (x) := p(x) + h θj K . (28)
j=1
h

If p is bounded away from zero on a interval containing the grid, |K| is bounded, and
R
K(x)dx = 0, then pn,θ is a p.d.f, at least for large n. Furthermore,
Z Z Z
(β)
2
(β) 2 2 L2 L2
pn,θ (x) dx ≤ 2 p (x) dx + 2hrn K (β) (x) dx ≤ 2 + 2 ≤ L2 .
4 4
Observe that in the above we have used the fact that the mesh width is more than
2h so that for j 6= k,
Z    
x − xn,j x − xn,k
K K dx = 0.
h h

Thus, pn,θ ∈ S(β, L) for every θ.

Of course, there exists many choices of p and K such that pn,θ ∈ S(β, L) for every θ.
Theorem 4.15. There exists a constant cβ,L such that for any density estimator p̂n ,
Z 
sup Ep (p̂n − p) ≥ cβ,L n−2β/(2β+1) .
2
p∈S(β,L)

We will use the following result crucially to prove the above theorem.

4.3.1 Assouad’s lemma

The following lemma gives a lower bound for the maximum risk over the parameter
set {0, 1}r , in an abstract form, applicable to the problem of estimating an arbitrary
quantity ψ(θ) belonging to a semi-metric space (with semi-distance d). Let
X
r
0
H(θ, θ ) := |θi − θi0 |
i=1

be the Hamming distance on {0, 1}r , which counts the number of positions at which
θ and θ0 differ.

42
R
For two probability measures P and Q with densities p and q let kP ∧Qk := p∧q dν.
Before we state and prove Assouad’s lemma we give a simple result which will be useful
later.
Lemma 4.16 (Lemma from hypothesis testing). Suppose that we are given two
models Pθ0 and Pθ1 on a measurable space (X , A) with densities p0 and p1 with
respect to a σ-finite measure ν. Consider testing the hypothesis

H0 : θ = θ0 versus H0 : θ = θ1 .

The power function πφ of any test φ satisfies

1
πφ (θ1 ) − πφ (θ0 ) ≤ kPθ1 − Pθ0 k.
2
R
Proof. The difference on the left hand side can be written as φ(p1 − p0 )dν. The
expression is maximized for the test function I{p1 > p0 } (Exercise (HW2): Show
this). Thus,
Z Z Z
1
φ (p1 − p0 ) dν ≤ (p1 − p0 ) dν = |p1 − p0 |dν,
p1 >p0 2
as
Z Z Z
|p1 − p0 | dν = (p1 − p0 ) dν + (p0 − p1 ) dν
p1 >p0 p0 >p1
Z Z Z 
= (p1 − p0 ) dν + (p0 − p1 ) dν − (p0 − p1 ) dν
p1 >p0 p1 >p0
Z
= 2 (p1 − p0 ) dν.
p1 >p0

Lemma 4.17 (Assouad’s lemma). For any estimator T of ψ(θ) based on an obser-
vation in the experiment {Pθ : θ ∈ {0, 1}r }, and any p > 0,

dp (ψ(θ), ψ(θ0 )) r
max 2 Eθ [d (T, ψ(θ))] ≥ min
p p
min kPθ ∧ Pθ0 k. (29)
θ H(θ,θ 0 )≥1 H(θ, θ0 ) 2 H(θ,θ0 )=1

Proof. Define an estimator S, taking values in Θ = {0, 1}r , by letting S = θ is


θ0 7→ d(T, ψ(θ0 )) is minimal over Θ at θ0 = θ (if the minimum is not unique, choose a
point of minimum in any consistent way), i.e.,

S = argmin d(T, ψ(θ)).


θ∈Θ

43
By the triangle inequality, for any θ,

d(ψ(S), ψ(θ)) ≤ d(ψ(S), T ) + d(ψ(θ), T ),

which is (upper) bounded by 2d(ψ(θ), T ), by the definition of S (as d(ψ(S), T ) ≤


d(ψ(θ), T )). If
dp (ψ(θ), ψ(θ0 )) ≥ γH(θ, θ0 ) (30)
for all pairs θ, θ0 ∈ Θ (for some γ to be defined later), then

2p Eθ [dp (T, ψ(θ))] ≥ Eθ [dp (ψ(S), ψ(θ))] ≥ γEθ [H(S, θ)].

The maximum of this expression over Θ is bounded below by the average, which,
apart from the factor γ, can be written as
 
X Xr Xr X Z X Z
1 1  1 1
r
Eθ |Sj − θj | = r−1
Sj dPθ + r−1 (1 − Sj )dPθ 
2 θ j=1 2 j=1 2 θ:θj =0
2 θ:θj =1
Z Z 
1X
r
= Sj dP̄0,j + (1 − Sj )dP̄1,j ,
2 j=1

where
1 X 1 X
P̄0,j = Pθ and P̄1,j = Pθ .
2r−1 θ:θj =0
2r−1 θ:θj =1

This is minimized over S by choosing Sj for each j separately to minimize the j-th
term in the sum. The expression within brackets is the sum of the error probabilities
of a test of
H0 : P = P̄0,j versus H1 : P = P̄1,j .
Equivalently, it is equal to 1 minus the difference of power and level. By Lemma 4.16
it can be shown that this is at least 1 − 21 kP̄0,j − P̄1,j k = kP̄0,j ∧ P̄1,j k (by Lemma 4.9).
Hence, the preceding display is bounded below by
1X
r
kP̄0,j ∧ P̄1,j k.
2 j=1

Note that for two sequences {ai }m m


i=1 and {bi }i=1 ,
!
1 X 1 X 1 X
m m m
min ai , bi ≥ min(ai , bi ) ≥ min min(ai , bi ).
m i=1 m i=1 m i=1 i=1,...,m

The 2r−1 terms Pθ and Pθ0 in the averages P̄0,j and P̄1,j can be ordered and matched
such that each pair θ and θ0 differ only in their j-th coordinate. Conclude that
1X 1X
r r
r
kP̄0,j ∧ P̄1,j k ≥ min kPθ ∧ Pθ0 k ≥ min kPθ ∧ Pθ0 k.
2 j=1 2 j=1 H(θ,θ )=1,θj 6=θj
0 0 2 H(θ,θ0 )=1

44
dp (ψ(θ),ψ(θ0 ))
Observing that the γ in (30) can always be taken as minH(θ,θ0 )≥1 H(θ,θ 0 )
, we obtain
the desired result.

Exercise (HW2): Complete the proof of Theorem 4.15.

4.3.2 Estimation of a monotone function

Consider data (Xi , Yi ), i = 1, . . . , n, from a nonparametric regression model where

Yi = f (Xi ) + ξi , i = 1, . . . , n, (31)

with 0 ≤ X1 < X2 < . . . < Xn ≤ 1 being deterministic design points, f : [0, 1] → R


is nondecreasing and the (unobserved) errors ξ1 , . . . , ξn are i.i.d. N (0, σ 2 ). In what
follows, we will also use the vector notation, writing the model as

y = f + ξ,

where y = (Y1 , . . . , Yn )> , f = (f (X1 ), . . . , f (Xn ))> and ξ = (ξ1 , . . . , ξn )> . The goal of
this section is to find the (optimal) lower bound on the rate of convergence for any
estimator of f based on the loss

1X
n
1
2
d (f , g) := kf − gk22 = (f (Xi ) − g(Xi ))2 (32)
n n i=1

where f, g are real valued functions defined on [0, 1].

For every V > 0 we define by MV the class of nondecreasing functions f : [0, 1] → R


such that f (1) − f (0) ≤ V .
Theorem 4.18. For any V > 0, there exists a constant cV > 0, only depending on
σ 2 and V , such that for any estimator fˆn , and for all n ≥ n0 ∈ N,

sup Ef [d2 (f̂n , f )] ≥ cV n−2/3 ,


f ∈MV

where f̂n := (fˆn (X1 ), . . . , fˆn (Xn ))> .

Proof. We will use Assouad’s lemma to prove the desired result. Fix an integer
1 ≤ k ≤ n (be chosen later) and let rn := bn/kc, where bxc denotes the largest
integer smaller than or equal to x. Let us define f ∈ Rn as
(
V (j−1)
rn
, if (j − 1)k < i ≤ jk;
fi = (rn −1)
V rn , if rn k < i ≤ n.

45
Take f to be any nondecreasing function on [0, 1] such that f (Xi ) = fi , for i = 1, . . . , n.
Also, it can be assumed that f ∈ MV . Let Θ = {0, 1}rn and let ψ(θ) ∈ Rn , for θ ∈ Θ,
be defined as:
V X
rn
ψ(θ)i = fi + (2θj − 1)I{(j − 1)k < i ≤ jk}. (33)
2rn j=1

Note that ψ(θ) induces a nondecreasing function that belongs to MV .

For θ, θ0 ∈ Θ, we have
1X X
rn

d2 (ψ(θ), ψ(θ0 )) = [ψ(θ)i − ψ(θ0 )i ]2


n j=1
(j−1)k<i≤jk

1 X
rn 2
V 2k
0 2V
= k|θj − θj | 2 = 2 H(θ, θ0 ).
n j=1
rn rn n

Therefore, this implies that for θ, θ0 ∈ Θ,


d2 (ψ(θ), ψ(θ0 )) V 2k
min = .
0
H(θ,θ )≥1 H(θ, θ0 ) rn2 n
Further, by Pinsker’s inequality (see Lemma 4.7), and using the fact that the Kullback-
Leibler divergence K(Pθ , Pθ0 ) has a simple expression in terms of d2 (ψ(θ), ψ(θ0 )) [Show
this (Exercise (HW2))]:
1 n 2 V 2k
2
V (Pθ , P ) ≤ K(Pθ , Pθ ) = 2 d (ψ(θ), ψ(θ )) = 2 2 H(θ, θ0 ).
θ0 0
0
2 4σ 4σ rn
2/3 R
Let k := bn2/3 Vσ c. As, min(pθ , pθ0 )dν = 1 − V (Pθ , Pθ0 ),

V k
min kPθ ∧ Pθ0 k ≥ 1 − ≥ c > 0,
H(θ,θ0 )=1 2σrn
for c > 0 and n sufficiently large (in fact c can be taken to be close to 1/2). Therefore,
using Assouad’s lemma, we get the following lower bound:
V 2 k rn
inf sup Ef [d2 (f̂n , f )] ≥ 2
c ≥ cV n−2/3 ,
fˆn θ∈MV rn n 8
where cV is a constant that depends only on σ and V .

4.4 A general reduction scheme

We can consider a more general framework where the goal is to find lower bounds of
the following form:
h i
lim inf inf sup Eθ w(ψ −1 d(θ̂n , θ)) ≥ c > 0,
n→∞ θ̂n θ∈Θ

46
where w : [0, ∞) → [0, ∞) is nondecreasing, w(0) = 0 and w 6= 0 (e.g., w(u) =
up , p > 0). A general scheme for obtaining lower bounds is based on the following
three remarks:

(a) Reduction to bounds in probability. For any A > 0 satisfying w(A) > 0 we have
h i h i
Eθ w(ψn−1 d(θ̂n , θ)) ≥ w(A) Pθ ψn−1 d(θ̂n , θ) ≥ A . (34)

We will usually take s ≡ sn = Aψn . Therefore, instead of searching for a lower


bound on the minimax risk R∗n , it is sufficient to find a lower bound on the
minimax probabilities of the form
 
inf sup Pθ d(θ̂n , θ) ≥ s
θ̂n θ∈Θ

where s ≡ sn = Aψn .

(b) Reduction to a finite number of hypotheses. It is clear that


   
inf sup Pθ d(θ̂n , θ) ≥ s ≥ inf max Pθ d(θ̂n , θ) ≥ s (35)
θ̂n θ∈Θ θ̂n θ∈{θ0 ,...,θM }

for any finite set {θ0 , . . . , θM } contained in Θ. In the examples we have already
seen that the finite set {θ0 , . . . , θM } has to be chosen appropriately. We call the
M + 1 elements θ0 , . . . , θM as hypotheses. We will call a test any A-measurable
function Ψ : X → {0, 1, . . . , M }.

(c) Choice of 2s-separated hypotheses. If

d(θj , θk ) ≥ 2s, k 6= j, (36)

then for any estimator θ̂n ,


 
Pθ d(θ̂n , θ) ≥ s ≥ Pθ (Ψ∗ 6= j) , j = 0, 1, . . . , M,

where Ψ∗ : X → {0, 1, . . . , M } is the minimum distance test defined by

Ψ∗ = argmin d(θ̂n , θk ).
0≤k≤M

Therefore,
 
inf max Pθ d(θ̂n , θ) ≥ s ≥ inf max Pj (Ψ 6= j) =: pe,M , (37)
θ̂n θ∈{θ0 ,...,θM } Ψ 0≤j≤M

where Pj ≡ Pθj and inf Ψ denotes the infimum over all tests.

47
Thus, in order to obtain lower bounds it is sufficient to check that

pe,M ≥ c0 ,

where the hypotheses θj satisfy (36) with s = Aψn and where the constant c0 > 0 is
independent of n. The quantity pe,M is called the minimum probability of error for
the problem of testing M + 1 hypotheses θ0 , θ1 , . . . , θM .
Remark 4.1. Let P0 , P1 , . . . , PM be probability measures on a measurable space
(X , A). For a test Ψ : X → {0, 1, . . . , M }, define the average probability of error and
the minimum average probability of error by

1 X
M
p̄e,M (Ψ) := Pj (Ψ 6= j), and p̄e,M := inf p̄e,M (Ψ).
M + 1 j=0 Ψ

Note that as
pe,M ≥ p̄e,M ,
we can then use tools (from multiple hypotheses testing) to lower bound p̄e,M .
Example 4.19. Let Θ = [0, 1]. Consider data X1 , . . . , Xn i.i.d. Bernoulli(θ), where
θ ∈ Θ. Thus, here Pθ is the joint distribution of X = (X1 , . . . , Xn ). The goal is to find
the minimax lower bound for the estimation of θ under the loss d(θ̂n , θ) := |θ̂n − θ|.
We want to show that there exists c > 0 such that
h i
lim inf inf sup Eθ nd2 (θ̂n , θ) ≥ c > 0.
n→∞ θ̂n θ∈Θ

Consider M = 1 and let θ0 = 21 − s and θ1 = 12 + s, where s ∈ [0, 1/4]. Using


Lemma 4.16 we can show that
 
2
inf max Pθ d (θ̂n , θ) ≥ s ≥ pe,M ≥ p̄e,M ≥ 1 − V (P0 , P1 ).
θ̂n θ∈{θ0 ,θ1 }

We can bound V (P0 , P1 ) using Pinsker’s inequality (see Lemma 4.7) and then use
Property (3) of the Kullback divergence to show that
 
2 1 1 + 2s
V (P0 , P1 ) ≤ K(P0 , P1 ) ≤ nK(Ber(θ0 ), Ber(θ1 )) = 2s log .
2 1 − 2s

Using the fact that x log 1+x
1−x
≤ 3x2 for x ∈ [0, 21 ], we can now show the desired
1
result for c = 48 .

48
Figure 3: Graphs of H and g with M = 10.

0.7
0.6

2.0
0.5

1.5
0.4
H(x)

g(x)
1.0
0.3
0.2

0.5
0.1

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

4.5 Fano’s lemma

Lemma 4.20 (Fano’s lemma). Let P0 , P1 , . . . , PM be probability measures on a mea-


surable space (X , A), M ≥ 1. Then, p̄e,M ≤ M/(M + 1), and

1 X
M
g(p̄e,M ) ≥ log(1 + M ) − K(Pj , P ), (38)
M + 1 j=0

where
1 X
M
P = Pj ,
M + 1 j=0
and, for x ∈ [0, 1],

g(x) = x log M + H(x), H(x) = −x log x − (1 − x) log(1 − x).

Proof. We have
"M # "M #
1 X dPj X
p̄e,M (Ψ) = E I(Aj ) = EP bj p j (39)
M + 1 P j=0 dP j=0

where
dPj
pj := (M + 1)−1
, Aj := {Ψ 6= j}, bj = I(Aj )
dP
and EP denotes the expectation with respect to P . The random variables bj and pj
satisfy P -a.s. the following conditions:
X
M X
M
bj = M, bj ∈ {0, 1}, and pj = 1, pj ≥ 0.
j=0 j=0

49
Then we have that, P -a.s.,
X
M X
bj p j = pj , (40)
j=0 j6=j0

where j0 is a random number, 0 ≤ j0 ≤ M . We now apply the following lemma


(see [14, Lemma 2.11] for a proof).
Lemma 4.21. For all j0 ∈ {0, 1, . . . , M } and all real numbers p0 , p1 , . . . , pM , such
P
that M
j=0 pj = 1, pj ≥ 0, we have
!
X X
M
g pj ≥− pj log pj , (41)
j6=j0 j=0

where 0 log 0 := 0.

Note that the function g is concave on 0 ≤ x ≤ 1. Using (39), Jensen’s inequality,


and (40) and (41), we obtain that, for any test Ψ,
"M #! " !#
X X
M
g(p̄e,M (Ψ)) = g EP bj p j ≥ EP g bj p j
j=0 j=0
" #
X
M
≥ EP − pj log pj
j=0

1 X
M
= log(1 + M ) − K(Pj , P ).
M + 1 j=0

Since there exists a sequence of tests {Ψk }k≥1 such that p̄e,M (Ψk ) → p̄e,M as k → ∞,
we obtain, by the continuity of g,

1 X
M
g(p̄e,M ) = lim g(p̄e,M (Ψk )) ≥ log(1 + M ) − K(Pj , P ).
k→∞ M + 1 j=0

It remains to show that p̄e,M ≤ M/(M + 1). For this purpose, we define a degenerate
test Ψ∗ ≡ 1, and observe that

1 X
M
∗ M
inf p̄e,M (Ψ) ≤ p̄e,M (Ψ ) = 6 1) =
Pj (j = .
Ψ M + 1 j=0 M +1

50
Using Fano’s lemma we can bound from below the minimax probability of error pe,M
in the following way:

pe,M = inf max Pj (Ψ 6= j) ≥ inf p̄e,M


Ψ 0≤j≤M Ψ
!
1 X
M
≥ g −1 log(M + 1) − K(Pj , P ) , (42)
M +1 j=0

where g −1 (t) := 0 for t < 0, and for 0 < t < log(M + 1), g −1 (t) is a solution of the
equation g(x) = t with respect to x ∈ [0, M/(M + 1)] — this solution exists as g is
continuous and strictly increasing on [0, M/(M + 1)] and g(0) = 0, g(M/(M + 1)) =
log(M + 1).

The following corollary gives a more workable lower bound on pe,M .


Corollary 4.22. Let P0 , P1 , . . . , PM be probability measures on a measurable space
(X , A), M ≥ 2. Let
1 X
M
I(M ) := K(Pj , P ). (43)
M + 1 j=0
Then,
I(M ) + log 2
pe,M ≥ p̄e,M ≥ 1 − . (44)
log(M + 1)

Proof. As H(x) ≤ log 2 for all x ∈ [0, 1], and g(x) = x log M + H(x), we have,
from (38),

p̄e,M log(M + 1) ≥ p̄e,M log M ≥ log(M + 1) − I(M ) − log 2

which yields the desired result.

Determining I(M ) exactly is usually intractable however and one typically works with
appropriate bounds on I(M ). In fact, (42) is going to be useful if we can show that
log(M + 1) − I(M ) > 0. The following corollary gives a sufficient condition for this
and gives a non-trivial lower bound on pe,M .
Corollary 4.23. Let P0 , P1 , . . . , PM be probability measures on a measurable space
(X , A), M ≥ 2. If
1 X
M
K(Pj , P0 ) ≤ α log(M + 1) (45)
M + 1 j=0
with 0 < α < 1, then
log 2
pe,M ≥ p̄e,M ≥ 1 − − α. (46)
log(M + 1)

51
Proof. We will use the elementary fact (show this; Exercise (HW2)):

1 X 1 X
M M
K(Pj , P0 ) = K(Pj , P ) + K(P , P0 ). (47)
M + 1 j=0 M + 1 j=0

Thus, using the above display, (38) and the fact that K(P , P0 ) ≥ 0, we get

1 X
M
g(p̄e,M ) ≥ log(M + 1) − K(Pj , P ),
M + 1 j=0

1 X
M
≥ log(M + 1) − K(Pj , P0 )
M + 1 j=0
≥ log(M + 1) − α log(M + 1).

A similar calculation as in the proof of Corollary 4.22 now yields the desired result.

4.5.1 Estimation of a regression function under the supremum loss

Consider data (Xi , Yi ), i = 1, . . . , n, from a nonparametric regression model where

Yi = f (Xi ) + ξi , i = 1, . . . , n, (48)

with f : [0, 1] → R, the ξi ’s being i.i.d. N (0, σ 2 ), and the Xi ’s are arbitrary random
variables taking values in [0, 1] such that (X1 , . . . , Xn ) is independent of (ξ1 , . . . , ξn ).
Theorem 4.24. Let β > 0 and L > 0. Consider data from the above model where
f ∈ Σ(β, L). Let
 β/(2β+1)
log n
ψn = .
n
Then,
lim inf inf sup Ef [ψn−2 kTn − f k2∞ ] ≥ c
n→∞ Tn f ∈Σ(β,L)

where inf Tn denotes the infimum over all estimators and where the constant c > 0
depends only on β, L and σ 2 .

Proof. The proof was mostly done in class; also see [14, Theorem 2.11].

4.6 Covering and packing numbers and metric entropy

In Section 4.4 we described a general scheme for proving lower bounds. In step (c)
of the scheme it is important to choose the hypotheses θj ’s in Θ such that they are

52
2s-separated. Further, the choice of the number of such points M depends on how
large the space Θ is. In this section we define a concept that has been successfully
employed in many fields of mathematics to capture the size of the an underlying set
(with a semi-metric). We also give a few examples from parametric models to show
how this concept can be used in conjunction with Fano’s lemma (as discussed in the
last section) to yield useful lower bounds that do not need specification of the exact
θj ’s (the perturbation functions).

Let (Θ, d) be an arbitrary semi-metric space.


Definition 4.25 (Covering number). A δ-cover of the set Θ with respect to the semi-
metric d is a set {θ1 , . . . , θN } ⊂ Θ such that for any point θ ∈ Θ, there exists some
v ∈ {1, . . . , N } such that d(θ, θv ) < δ.

The δ-covering number of Θ is


N (δ, Θ, d) := inf{N ∈ N : ∃ a δ-cover θ1 , . . . , θN of Θ}.
Equivalently, the δ-covering number N (δ, Θ, d) is the minimal number of balls B(x; δ) :=
{y ∈ Θ : d(x, y) < δ} of radius δ needed to cover the set Θ.

A semi-metric space (Θ, d) is said to be totally bounded if the δ-covering number is


finite for every δ > 0.

The metric entropy of the set Θ is the logarithm of its covering number: log N (δ, Θ, d).

We can define a related measure — more useful for constructing our lower bounds —
of size that relates to the number of disjoint balls of radius δ > 0 that can be placed
into the set Θ.
Definition 4.26 (Packing number). A δ-packing of the set Θ with respect to the
semi-metric d is a set {θ1 , . . . , θD } such that for all distinct v, v 0 ∈ {1, . . . , D}, we
have d(θv , θv0 ) ≥ δ.

The δ-packing number of Θ is


D(δ, Θ, d) := inf{D ∈ N : ∃ a δ-packing θ1 , . . . , θD of Θ}.

Equivalently, call a collection of points δ-separated if the distance between each pair
of points is larger than δ. Thus, the packing number D(δ, Θ, d) is the maximum
number of δ-separated points in Θ.

Exercise (HW2): Show that


D(2δ, Θ, d) ≤ N (δ, Θ, d) ≤ D(δ, Θ, d), for every δ > 0.

53
Thus, packing and covering numbers have the same scaling in the radius δ.
Remark 4.2. As shown in the preceding exercise, covering and packing numbers are
closely related, and we can use both in the following. Clearly, they become bigger as
δ → 0.

We can now provide a few more complex examples of packing and covering numbers,
presenting two standard results that will be useful for constructing the packing sets
used in our lower bounds to come.

Our first bound shows that there are (exponentially) large packings of the d-dimensional
hypercube of points that are O(d)-separated in the Hamming metric.
Lemma 4.27 (Varshamov-Gilbert Lemma). Fix k ≥ 1. There exists a subset V of
P
{0, 1}k with |V| ≥ exp(k/8) such that the Hamming distance, H(τ, τ 0 ) := ki=1 I{τi 6=
τi0 } > k/4 for all τ, τ 0 ∈ V with τ 6= τ 0 .

Proof. Consider a maximal subset V of {0, 1}k that satisfies:

H(τ, τ 0 ) ≥ k/4 for all τ, τ 0 ∈ V with τ 6= τ 0 . (49)

The meaning of maximal here is that if one tries to expand V by adding one more
element, then the constraint (49) will be violated. In other words, if we define the
closed ball, B(τ, k/4) := {θ ∈ {0, 1}k : H(θ, τ ) ≤ k/4} for τ ∈ {0, 1}k , then we must
have [
B(τ, k/4) = {0, 1}k .
τ ∈V

This implies that X


|B(τ, k/4)| ≥ 2k . (50)
τ ∈V

Let T1 , . . . , Tk denote i.i.d. Bernoulli random variables with probability of success 1/2.
For every A ⊆ {0, 1}k , we have P ((T1 , . . . , Tk ) ∈ A) = |A|2−k . Therefore, for each
τ ∈ V, we can write
!
  Xk
2−k |B(τ, k/4)| = P (T1 , . . . , Tk ) ∈ B(τ, k/4) = P {Ti 6= τi } ≤ k/4 .
i=1

If Si := {Ti 6= τi }, then it is easy to see that S1 , . . . , Sk are also i.i.d. Bernoulli random

54
variables with probability of success 1/2. Thus,

2−k |B(τ, k/4)| = P (S1 + · · · + Sk ≤ k/4)


= P (S1 + · · · + Sk ≥ 3k/4)
≤ inf exp(−3λk/4) (E exp(λS1 ))k
λ>0

= inf exp(−3λk/4)2−k (1 + eλ )k .
λ>0

Taking λ = log 3, we get

|B(τ, k/4)| ≤ 3−3k/4 4k for every τ ∈ V.

Finally, from (50), we obtain

33k/4 
|V| ≥ k = exp k log(33/4 /2) ≥ exp (k/8) .
2

Given the relationships between packing, covering, and size of the set Θ, we would
expect there to be relationships between volume, packing, and covering numbers.
This is indeed the case, as we now demonstrate for arbitrary norm balls in finite
dimensions.
Lemma 4.28. Let B := {θ ∈ Rd : kθk2 ≤ 1} denote the unit Euclidean ball in Rd .
Then  d  d
1 2
≤ N (δ, B, k · k2 ) ≤ 1 + . (51)
δ δ

As a consequence of Lemma 4.28, we see that for any δ < 1, there is a packing V
of B such that kθ − θ0 k2 ≥ δ for all distinct θ, θ0 ∈ V and |V| ≥ (1/δ)d , because we
know D(δ, B, k · k2 ) ≥ N (δ, B, k · k2 ). In particular, the lemma shows that any norm
ball has a 1/2-packing in its own norm with cardinality at least 2d . We can also
construct exponentially large packings of arbitrary norm-balls (in finite dimensions)
where points are of constant distance apart.

Smoothly parameterized functions: Let F be a parameterized class of functions,


i.e.,
F := {fθ : θ ∈ Θ}.
Let k · kΘ be a norm on Θ, and let k · kF be a norm on F. Suppose that the mapping
θ 7→ fθ is L-Lipschitz, i.e.,

kfθ − fθ0 kF ≤ Lkθ − θ0 kΘ .

55
Lemma 4.29 (Exercise (HW2)). N (δ, F, k · kF ) ≤ N (δ/L, Θ, k · kΘ ) for all δ > 0.

A Lipschitz parameterization allows us to translates a cover of the parameter space


Θ into a cover of the function space F. For example, if F is smoothly parameterized
by (compact set of) d parameters, then N (δ, F, k · kF ) = O(δ −d ).

Exercise (HW2): Let F be the set of L-Lipschitz functions mapping from [0, 1] to
[0, 1]. Then in the supremum norm kf k∞ := supx∈[0,1] |f (x)|,

log N (δ, F, k · k∞ )  L/δ.

Hint: (Proof idea) Form an δ grid of the y-axis, and an δ/L grid of the x-axis, and
consider all functions that are piecewise linear on this grid, where all pieces have
slopes +L or −L. There are 1/δ starting points, and for each starting point there are
2L/δ slope choices. Show that this set is an O(δ) packing and an O(δ) cover.

4.6.1 Two examples

Example 4.30 (Normal mean estimation). Consider the d-dimensional normal lo-
cation family Nd := {N (θ, σ 2 Id ) : θ ∈ Rd }, where σ 2 > 0 and d ≥ 2. We wish to
estimate the mean θ in the squared error loss, i.e., d2 (θ̂n , θ) = kθ̂n − θk22 , given n
i.i.d. observations X1 , . . . , Xn from a member in Nd with mean θ. Let Pθ denote the
joint distribution of the data.

Let V be a 1/2-packing of the unit k·k2 -ball with cardinality at least 2d , as guaranteed
by Lemma 4.28. Now we construct our local packing. Fix δ > 0, and for each v ∈ V,
set θv = δv ∈ Rd . Then we have
δ
kθv − θv0 k2 = δkv − v 0 k2 ≥ =: 2s
2
for each distinct pair v, v 0 ∈ V, and moreover, we note that kθv − θv0 k2 ≤ 2δ for such
pairs as well. Thus, {θv }v∈V is a 2s-separated set with cardinality at least 2d . Let
θv0 , θv1 , . . . , θvM be an enumeration of the 2s-separated points, and we take Pj ≡ Pθvj ,
for j = 0, 1, . . . , M . Note that for j ∈ {0, . . . , M } such that Pj ≡ Pv , for some v ∈ V,
n 2 2nδ 2
K(Pj , P0 ) = kθv − θ v k
0 2 ≤ .
2σ 2 σ2
Therefore, taking δ 2 := dσ 2 log 2/(8n),

1 X
M
2nδ 2
K(Pj , P0 ) ≤ 2 · d log 2 ≤ α log(M + 1)
M + 1 j=0 σ d log 2

56
where α := 1/4. This shows that (45) holds. Hence, by (34), (35), (37) and Corol-
lary 4.23 we have

inf sup Eθ [d2 (θ̂n , θ)] ≥ inf sup s2 Pθ [d(θ̂n , θ) ≥ s]


θ̂n θ∈Θ θ̂n θ∈Θ
 
2 log 2
≥ s 1− −α
log(M + 1)
 
1 dσ 2 log 2 1 1
≥ 2· 1− − .
4 8n d 4

As d ≥ 2, the above inequality implies the minimax lower bound

1 dσ 2 log 2 dσ 2
inf sup Eθ [d2 (θ̂n , θ)] ≥ · =c ,
θ̂n θ∈Θ 64 8n n

where c > 0. While the constant c is not sharp, we do obtain the right scaling in d,
n and the variance σ 2 . The sample mean attains the same risk.
Example 4.31 (Linear regression). In this example, we show how local packings can
give (up to some constant factors) sharp minimax rates for standard linear regression
problems. In particular, for fixed matrix X ∈ Rn×d , we observe

Y = Xθ + ,

where  ∈ Rn consists of independent random variables i with variance bounded by


Var(i ) ≤ σ 2 , and θ ∈ Rd is allowed to vary over Rd . For the purposes of our lower
bound, we may assume that  ∼ N (0, σ 2 In ). Let P := {N (Xθ, σ 2 In ) : θ ∈ Rd } denote
the family of such normally distributed linear regression problems, and assume for
simplicity that d ≥ 32.

In this case, we use the Varshamov-Gilbert bound (Lemma 4.27) to construct a local
packing and attain minimax rates. Indeed, let V be a packing of {0, 1}d such that
kv − v 0 k1 ≥ d/4 for distinct elements of V, and let |V| ≥ exp(d/8) as guaranteed by
the Varshamov-Gilbert bound. For fixed δ > 0, if we set θv = δv, then we have the
packing guarantee for distinct elements v, v 0 that

kθv − θv0 k22 = δ 2 kv − v 0 k22 = δ 2 kv − v 0 k1 ≥ dδ 2 /4.

Moreover, we have the upper bound

1 2 δ2 > 2 dδ 2
K(Pθv , Pθv0 ) = 2
kX(θv − θ v 0 )k ≤
2 2
Λ max (X X)kθv − θv 0k ≤
2 2
Λmax (X > X),
2σ 2σ 2σ
where Λmax (X > X) denotes the maximum singular value of X > X.

57
σ2
Consequently, taking δ 2 := 16Λmax (X > X)
, we obtain that
 
dσ 2 8 1 dσ 2
inf sup Eθ [d (θ̂n , θ)] ≥
2
1 − log 2 − ≥c ,
θ̂n θ∈Θ (16)2 nΛmax (X > X/n) d 4 nΛmax (X > X/n)

for some c > 0 if d > (32/3) log 2. Thus, the convergence rate is (roughly) σ 2 d/n
after rescaling the singular values of X > X by n−1/2 . This bound is sharp in terms
of the dimension d, dependence on n, and the variance σ 2 , but it does not fully
capture the dependence on X > X, as it depends only on the maximum singular value.
An exact calculation can show that the minimax value of the problem is exactly
σ 2 tr((X > X)−1 ).

4.7 Global Fano method: Bounding I(M ) based on metric


entropy

Observe that, from (47), it follows that

1 X M
I(M ) ≤ inf K(Pj , Q). (52)
M + 1 Q j=0

Different choices of Q in (52) yield different upper bounds on I(M ). One gets, for
example,
PM PM
j=0 K(Pj , Pk ) j,k=0 K(Pj , Pk )
I(M ) ≤ min ≤ ≤ max K(Pj , Pk ). (53)
k=0,1...,M M +1 (M + 1)2 j,k∈{0,1...,M }

These bounds are very frequently used in conjunction with Fano’s inequality; see e.g.,
the two examples in Section 4.6.1. The last bound maxj,k∈{0,1...,M } K(Pj , Pk ) is called
the Kullback-Leibler diameter of {Pj }M
j=0 .

We will see that quite often (in nonparametric problems) the bounds in (53) are, in
general, quite inaccurate and describe an improved bounds due to [17].

Let P be a collection of distributions. In analogy with Definition 4.25, we say that


the collection of distributions {Qi }N
i=1 form an -cover of P in KL-divergence if for
all P ∈ P, there exists some i such that K(P, Qi ) ≤ 2 . With this, we may define the
KL-covering number of the set P as
 
N (, P, K) := inf N ∈ N : ∃ Qi , i = 1, . . . , N, such that sup min K(P, Qi ) ≤  ,
2
P ∈P i

where N (, P, K) = +∞ if no such cover exists.

58
Let P0 , P1 , . . . , PM be probability measures on a measurable space (X , A). Recall
that
1 X
M
I(M ) := K(Pj , P ).
M + 1 j=0

Let P be a collection of distributions such that Pj ∈ P, for all j = 0, 1, . . . , M .


Proposition 4.32. I(M ) ≤ inf >0 {2 + log N (, P, K)}.

Proof. By carefully choosing the distribution Q in the upper bound in (52) above,
we will obtain the desired. Now, assume that the distributions {Qi }N i=1 , form an
2
-cover of the family P, meaning that mini K(P, Qi ) ≤  , for all P ∈ P. Let pj
and qi denote the densities of Pj and Qi with respect to some fixed base measure ν
on X (the choice of based measure does not matter). Then defining the distribution
P
Q := (1/N ) N i=1 Qi (with density q with respect to ν), we obtain for any j,

Z   Z !
pj pj
K(Pj , Q) = log pj dν = log P pj dν
q N −1 N i=1 q i
Z ! Z  
pj pj
= log N + log PN pj dν ≤ log N + log pj dν
i=1 q i max i q i
Z  
pj
≤ log N + min log pj dν = log N + min K(Pj , Qi ).
i qi i

By our assumption that the Qi ’s form a cover which gives the desired result, as  > 0
was arbitrary (as was our choice of the cover).

4.7.1 A general scheme for proving minimax bounds using global packings

There is now a four step process to proving minimax lower bounds using the global
Fano method. Our starting point is to recall the Fano minimax lower bound in (46)
of Corollary 4.22 and (37), which begins with the construction of a set of points
{θ(Pj )}M
j=0 that form a 2s-packing of a set Θ in the semi-metric d. With this in mind,
we perform the following four steps:

(i) Bound the packing entropy. Give a lower bound on the packing number of the
set Θ with 2s-separation (call this lower bound D(s) ≡ M + 1).

(ii) Bound the metric entropy. Give an upper bound on the KL-metric entropy
of the class P of distributions containing all the distributions {Pj }M
j=0 , i.e., an
upper bound on log N (, P, K).

59
(iii) Find the critical radius. Using Proposition 4.32 we can now balance I(M ) and
the packing entropy log D(s). To that end, we choose n and sn > 0 at the
critical radius, defined as follows: choose any n such that

2n ≥ log N (n , P, K), (54)

and choose the largest sn > 0 such that

log D(sn ) ≥ 42n + 2 log 2. (55)

Then,

log D(sn ) ≥ 2 log N (n , P, K) + 22n + 2 log 2 ≥ 2(I(M ) + log 2).

(iv) Apply the Fano minimax bound (46). Having chosen sn and n as above, we
immediately obtain that
I(M ) + log 2 1 1
pe,M ≥ 1 − ≥1− = ,
log D(sn ) 2 2
and thus, we obtain
h i 1
inf sup Eθ w(s−1
n d( θ̂n , θ)) ≥ w(sn ).
θ̂n θ∈Θ 2

4.7.2 An example

Example 4.33 (Lipschitz regression). Consider data (Xi , Yi ), i = 1, . . . , n, from a


nonparametric regression model 31 with Xi = i/n, f : [0, 1] → [0, 1] is L-Lipschitz
and the (unobserved) errors ξ1 , . . . , ξn are i.i.d. N (0, σ 2 ). The goal of this section is
to find the (optimal) lower bound on the rate of convergence for any estimator of f
based on the discrete L2 -loss d(·, ·) defined in (32). Let

F := {f : [0, 1] → [0, 1]| f is L-Lipschitz}.

Result: Note that for δ > 0,


L L
c1 ≤ log D(δ, F, k · k∞ ) ≤ c2 ,
δ δ
where c2 ≥ c1 > 0.
p
Exercise (HW2): Show that log(, P, K) ≤ c2 2σn2 L−1 . This completes step (ii).
 √ 1/3
Further show that (54) holds for n ≥ c√
2L n
2σ 2
and (55) holds for sn = c(σ 2 L/n)1/3 ,
for some c > 0. Hence, show that the lower bound on the minimax rate is (σ 2 L/n)1/3
which involves the right scaling in n, L and the variance σ 2 .

60
5 Reproducing kernel Hilbert spaces

5.1 Hilbert spaces

A vector space in Rn can be spanned by a finite set of vectors. Classes of functions


may also form vector spaces over R, but these spaces are rarely spanned by a finite set
of functions. In this chapter we study a special class of functions that form a Hilbert
space (a generalization of the notion of Euclidean space) and admit expansions like
that as in a finite dimensional vector space.
Definition 5.1 (Hilbert space). Let H be a (real) vector space together with a
function h·, ·i : H × H → R (the inner product) for which

hx, yi = hy, xi, ∀ x, y ∈ H (symmetric),


hx, ay + bzi = ahx, yi + bhx, zi, ∀ x, y, z ∈ H, α, β ∈ R (bilinear),
hx, xi ≥ 0, x ∈ H, with equality if and only if x = 0.

Suppose that the norm in H is defined by


p
kxk := hx, xi

and H is complete6 in the metric d(x, y) := kx − yk. Then H forms a Hilbert space
equipped with the inner product h·, ·i.
P
Example 5.2 (Euclidean space). Let H = Rm and hx, yi := m i=1 xi yi (where x =
(x1 , . . . , xm ) ∈ R ); or more generally hx, yi = x Ay where A is a symmetric positive
m >

definite matrix.
Example 5.3 (Euclidean matrices). Let H = Rm×m be the set of all m×m matrices.
Define hx, yi := tr(xy > ). Then h·, ·i defines a Hilbert space over m × m matrices.
Example 5.4 (L2 space). Let (Ω, A, µ) be a measure space and let L2 (Ω, A, µ) be
the set (of equivalence classes) of all square integrable functions with
Z
hf, gi := f g dµ.

Example 5.5 (Sobolev space). The Sobolev space Wm [0, 1] is the collection of all
functions f : [0, 1] → R with m − 1 continuous derivatives, f (m−1) absolutely contin-
uous, and kf (m) k < ∞. With an inner product h·, ·i defined by
X
m−1 Z 1
(k) (k)
hf, gi := f (0)g (0) + f (m) (x)g (m) (x)dx, f, g ∈ Wm [0, 1], (56)
k=0 0

6
A metric space H is said to be complete if every Cauchy sequence in H has a limit in H.

61
Wm [0, 1] is a Hilbert space.

Here are some properties of any Hilbert space H with inner product h·, ·i:

• The Cauchy-Schwarz inequality holds:

|hx, yi| ≤ kxkkyk, ∀ x, y ∈ H.

• The Parallelogram laws assert that

kx+yk2 +kx−yk2 = 2(kxk2 +kyk2 ) and kx+yk2 −kx−yk2 = 4hx, yi ∀ x, y ∈ H.

• (Linear functional) A function ϕ : H → R is said to be a linear functional if


ϕ(αx + βy) = αϕ(x) + βϕ(y) whenever x, y ∈ H and α, β ∈ R. For example,
for a fixed y ∈ H,
ϕy (x) := hx, yi, ∀ x ∈ H, (57)
defines a continuous linear functional, a linear functional that is continuous with
respect to the metric induced by the inner product.

• (Dual space) The dual space H∗ (of H) is the space of all continuous linear
functions from H into R. It carries a natural norm7 , defined by

kϕkH∗ = sup |ϕ(x)|, ϕ ∈ H∗ .


kxk=1,x∈H

This norm satisfies the parallelogram laws.

Result: The Riesz representation theorem gives a convenient description of the


dual. It states that any continuous linear functional can be represented in the
form (57) for some y ∈ H depending on the linear functional.
7
Exercise (HW3): Let X and Y be normed vector spaces over R. A function T : X → Y is called
a linear operator if

T (cx1 + x2 ) = cT (x1 ) + T (x2 ), ∀ x1 , x2 ∈ X , c ∈ R.

The operator norm (or spectral norm) of T is defined as

kT k := sup{kT (x)k : kxk ≤ 1},

and T is called bounded if kT k < ∞.


(a) Show that a bounded operator T is continuous: If kxn − xk → 0, then kT (xn ) − T (x)k → 0.
(b) Show that a continuous linear operator T is bounded.
(c) Let X = Rm and Y = Rn , with the usual Euclidean norms. Let A be an n × m matrix, and
define a linear operator T by T (x) = Ax. Relate the operator norm kT k to the eigenvalues of
A> A.

62
Thus to every element ϕ of the dual H∗ there exists one and only one uϕ ∈ H
such that hx, uϕ i = ϕ(x), for all x ∈ H. The inner product on the dual space
H∗ satisfies
hϕ, ψiH∗ := huψ , uϕ iH .
So the dual space is also an inner product space. The dual space is also complete,
and so it is a Hilbert space in its own right.

• (Convex sets) Recall that a subset H0 ⊂ H is called a linear subspace if it is


closed under addition and scalar multiplication; i.e., αx + βy ∈ H0 whenever
x, y ∈ H0 and α, β ∈ R.

A subset C ⊂ H is said to be convex if it contains the line joining any two of


its elements, i.e., αx + (1 − α)y ∈ C whenever x, y ∈ C and 0 ≤ α ≤ 1.

A set C ⊂ H is said to be a cone if αx ∈ C whenever x ∈ C and α ≥ 0. Thus, C


is a convex cone if αx+βy ∈ C whenever x, y ∈ C and 0 ≤ α, β < ∞. Any linear
subspace is, by definition, also a convex cone. Any ball, B = {x ∈ H : kxk ≤ c},
c > 0, is a convex set, but not a convex cone.

• (Projection theorem) If C ⊂ H is a closed convex set and z ∈ H, then there is


a unique x ∈ C for which

kx − zk = inf ky − zk.
z∈C

In fact, x ∈ C satisfies the condition

hz − x, y − xi ≤ 0, ∀ y ∈ C. (58)

The element x ∈ C is called the projection of z onto C and denoted by ΠC (z).


Prove the projection theorem. (Exercise (HW3))

In particular, if C is a convex cone, setting y = x/2 and y = 2x in (58) shows


that hz − x, xi = 0. Thus, x is the unique element of C for which

hz − x, xi = 0 and hz − x, yi ≤ 0 ∀ y ∈ C.

If C is a linear subspace, then z − x is orthogonal to C, i.e.,

hz − x, yi = 0 ∀ y ∈ C.

• (Orthogonal complement) Suppose that H0 ⊂ H. The orthogonal complement


of H0 is
H0⊥ := {x ∈ H : hx, yi = 0, ∀ y ∈ H0 }.

63
Result: The orthogonal complement of a subset of a Hilbert space is a closed
linear subspace.

The projection theorem states that if C ⊂ H is a closed subspace, then any


z ∈ C may be uniquely represented as z = x + y, where x ∈ C is the best
approximation to z, and y ∈ C ⊥ .

Result: If C ⊂ H is a closed subspace, then H = C ⊕ C ⊥ , where

A ⊕ B := {x + y : x ∈ A, y ∈ B}.

Thus, every closed subspace C of H has a closed complementary subspace C ⊥ .

• (Orthonormal basis) A collection {et : t ∈ T } ⊂ H (where T is any index set)


is said to be orthonormal if es ⊥ et (i.e., hes , et i = 0) for all s 6= t and ket k = 1,
for all t ∈ T .

As in the finite-dimensional case, we would like to represent elements in our


Hilbert space as linear combinations of elements in an orthonormal collection,
but extra care is necessary because some infinite linear combinations may not
make sense.

The linear span of S ⊂ H, denoted span(S), is the collection of all finite linear
combinations α1 x1 + · · · + αn xn with α1 , . . . , αn ∈ R and x1 , . . . , xn ∈ S. The
closure of this set is denoted by span(S).

An orthonormal collection {et , t ∈ T }, is called an orthonormal basis for the


Hilbert space H if het , xi =
6 0 for some t ∈ T , for every nonzero x ∈ H.

Result: Every Hilbert space has an orthonormal basis.

When H is separable8 , a basis can be found by applying the Gram-Schmidt


algorithm to a countable dense set, and in this case the basis will be countable.

Result: If {en }n≥1 , is an orthonormal basis of H, then each x ∈ H may be


P
written as x = ∞ k=1 hx, ek iek . Show this. (Exercise (HW3))
8
A topological space is called separable if it contains a countable, dense subset; i.e., there exists
a sequence {xn }∞n=1 of elements of the space such that every nonempty open subset of the space
contains at least one element of the sequence.

64
5.2 Reproducing Kernel Hilbert Spaces

Definition 5.6 (Reproducing kernel Hilbert space). Let X be an arbitrary set and
H a Hilbert space of real-valued functions on X . The evaluation functional over the
Hilbert space of functions H is a linear functional that evaluates each function at a
point x ∈ X ,
Lx : f 7→ f (x) ∀f ∈ H.

We say that H is a reproducing kernel Hilbert space (RKHS) if Lx is continuous at


any f in H, for all x ∈ X (equivalently, if for all x ∈ X , Lx is a bounded9 operator
on H).

Thus, a RKHS is a Hilbert space of functions in which point evaluation is a continuous


linear functional. Roughly speaking, this means that if two functions f and g in the
RKHS are close in norm, i.e., kf − gk is small, then f and g are also pointwise close,
i.e., |f (x) − g(x)| is small for all x ∈ X .

The Riesz representation theorem implies that for all x ∈ X there exists a unique
element Kx of H with the reproducing property:

f (x) = Lx (f ) = hf, Kx i ∀ f ∈ H. (59)

Since Ky is itself a function in H we have that for each y ∈ X ,

Ky (x) = hKy , Kx i.

This allows us to define the reproducing kernel of H as a function K : X × X → R by

K(x, y) = hKx , Ky i.

From this definition it is easy to see (Exercise (HW3)) that K : X × X → R is both


symmetric and positive definite, i.e.,
X
n
αi αj K(xi , xj ) ≥ 0, (60)
i,j=1

for any n ∈ N, x1 , . . . , xn ∈ X , and α1 , . . . , αn ∈ R. Thus, the “Gram Matrix”


K = ((Kij ))n×n defined by Kij = k(xi , xj ) is positive semi-definite.
9
A functional λ : H → R is bounded if there is a finite real constant B so that, for all f ∈ H,
|λ(f )| ≤ Bkf kH . It can be shown that the continuity of the functional λ is equivalent to boundedness.

65
Example 5.7 (Linear kernel). Let X = Rd and let K(x, y) := x> y, for any x, y ∈ Rd ,
be the usual inner product in Rd . Then the linear kernel K is symmetric and positive
definite.
Example 5.8 (RKHS of the linear kernel). Let X = Rd . Consider the space H of
all linear forms on Rd : H := {f (x) = w> x : w ∈ Rd }. Define the inner product by
hf, giH = v > w for f (x) = v > x and g(x) = w> x. Then, the linear kernel K(x, y) :=
x> y is a reproducing kernel for H.
Example 5.9 (Gaussian and Laplace kernels). When X = Rd , the Gaussian and
Laplace kernels are defined as
   
kx − yk22 kx − yk2
K(x, y) := exp − , K(x, y) := exp − ,
2σ 2 2σ 2
respectively, where x, y ∈ Rd , σ 2 > 0. Both kernels are positive definite, but the proof
of this fact is more involved than for the linear kernel.

The Moore-Aronszajn theorem (see below) is a sort of converse to (60): if a function


K satisfies these conditions (symmetric and positive definite) then there is a Hilbert
space of functions on X for which it is a reproducing kernel.
Proposition 5.10 (Moore-Aronszajn theorem). Suppose that K is a symmetric,
positive definite kernel on a set X . Then there is a unique Hilbert space H of functions
on X for which K is a reproducing kernel.

Proof. The complete proof of this result is rather long. We give a sketch of the proof
here. For all x in X , define Kx := K(x, ·). Let H0 be the linear span of {Kx : x ∈ X }.
Define an inner product on H0 by
* n +
X X
m Xm Xn
βj Kyj , αi Kxi := αi βj K(yj , xi ),
j=1 i=1 H0 i=1 j=1

i=1 , {βj }j=1 ⊂ R and {xi }i=1 , {yj }j=1 ⊂ X . The symmetry of this inner
where {αi }m n m n

product follows from the symmetry of K and the non-degeneracy follows from the
fact that K is positive definite. We can show that

1. the point evaluation functionals Lx are continuous on H0 ,

2. any Cauchy sequence fn in H0 which converges pointwise to 0 also converges in


in H0 -norm to 0.

Let H be the completion of H0 with respect to this inner product. We define an inner
product in H as: suppose that {fn }n≥1 and {gn }n≥1 are sequences in H0 converging

66
to f and g respectively. Then {hfn , gn iH0 }n≥1 is convergent and its limit depends
only on f and g (see [2, Lemma 5] for a proof of the above). Thus we define

hf, giH := lim hfn , gn iH0 .


n→∞

Next we have to show that H is indeed a Hilbert space with the inner product h·, ·iH
(see [2, Lemma 6] for a proof of this; we will have to further show that H is complete).
Further we can show that H0 is dense in H (see see [2, Lemma 7] for a proof of this)
and that the point evaluation map is continuous on H (see see [2, Lemma 8] for a
proof of this).

Now we can check the reproducing property (59), i.e., hf, Kx iH = f (x), for all f ∈ H,
for all x ∈ X . To prove uniqueness, let G be another Hilbert space of functions for
which K is a reproducing kernel. For any x and y in X , (59) implies that

hKx , Ky iH = K(x, y) = hKx , Ky iG .

By linearity, h·, ·iH = h·, ·iG on the span of {Kx : x ∈ X }. Then G = H by the
uniqueness of the completion. See
http://www.gatsby.ucl.ac.uk/∼gretton/coursefiles/RKHS Notes1.pdf for a more
detailed discussion on this proof.

This proposition allows one to construct reproducing kernels on complicated spaces


X (such as graphs, images) only by checking that the proposed kernel is positive
definite and without explicitly defining the Hilbert space H.

5.2.1 The Representer theorem

The representer theorem [7] shows that solutions of a large class of optimization
problems can be expressed as kernel expansions over the sample points. We present
a slightly more general version of the theorem with a simple proof [10].

Let X be an arbitrary set and let HK be a RKHS of real valued functions on X with
reproducing kernel K(·, ·). Let {(Yi , Xi ) : i = 1, . . . , n} be given data (the “training
set”) with Xi ∈ X (the “attribute vector”), and Yi ∈ Y being the “response”.
Theorem 5.11. Denote by Ω : [0, ∞) → R a strictly increasing function. Let
` : (X × Y × Y)n → R ∪ {∞} be an arbitrary loss function. Then each minimizer
f ∈ HK of the regularized risk functional

`({(Xi , Yi , f (Xi )}ni=1 ) + Ω(kf k2HK ) (61)

67
admits a representation of the form
X
n
fˆ(x) = αi K(Xi , x), ∀x ∈ X, where α1 , . . . , αn ∈ R. (62)
i=1

Proof. Take any f ∈ HK . As HK is a Hilbert space there is a unique decomposition


of f as the sum of fn ∈ S := span({K(X1 , ·), . . . , K(Xn , ·)}) and f⊥ ∈ S ⊥ ⊂ HK , the
orthogonal complement of S (in HK ):
X
n
f (x) = fn (x) + f⊥ (x) := αi K(Xi , x) + f⊥ (x).
i=1

Here αi ∈ R and hf⊥ , K(Xi , ·)i = 0 for all i = 1, . . . , n. By the reproducing property,

f (Xi ) = hf, K(Xi , ·)i = hfn , K(Xi , ·)i + hf⊥ , K(Xi , ·)i = hfn , K(Xi , ·)i = fn (Xi ),

which implies that

`({(Xi , Yi , f (Xi )}ni=1 ) = `({(Xi , Yi , fn (Xi )}ni=1 ).

Secondly, for all f⊥ ∈ S ⊥ ,


! !
X
n
2 X
n
2
Ω(kf k2HK ) = Ω αi K(Xi , ·) + kf⊥ k2 ≥Ω αi K(Xi , ·) .
i=1 i=1

Hence, `(· · · ) depends only on the component of f lying in the subspace S and Ω(·)
is minimized if f lies in that subspace. Hence, the criterion function is minimized if
f lies in that subspace, and we can express the minimizer as in (62).

Note that as Ω(·) is strictly non-decreasing, kf⊥ k must necessarily be zero for f to
be the minimizer of (61), implying that fˆ must necessarily lie in the subspace S.

Monotonicity of Ω does not prevent the regularized loss functional (61) from hav-
ing multiple local minima. To ensure a global minimum, we would need to require
convexity. If we discard the strictness of the monotonicity, then it no longer follows
that each minimizer of the regularized loss admits an expansion (62); it still follows,
however, that there is always another solution that is as good, and that does admit
the expansion.

The significance of the representer theorem is that although we might be trying to


solve an optimization problem in an infinite-dimensional space HK , containing linear
combinations of kernels centered on arbitrary points of X , it states that the solution
lies in the span of n particular kernels — those centered on the training points. For
suitable choices of loss functions, many of the αi ’s often equal 0.

68
5.2.2 Feature map and kernels

A kernel can be thought of as a notion of similarity measure between two points in


the “input points” in X . For example, if X = Rd , then the canonical dot product
X
d
0 0
K(x, x ) = hx, x i = xi x0i ; x, x0 ∈ Rd .
i=1

can be taken as the kernel.

If X is a more complicated space, then we can still define a kernel as follows.


Definition 5.12 (Kernel). Let X be a non-empty set. The function K : X × X → R
is said to be a kernel if there exists a real Hilbert space E (not necessarily a RKHS),
and a map ϕ : X → E such that for all x, y ∈ X ,

K(x, x0 ) = hϕ(x), ϕ(x0 )iE . (63)

Such map ϕ : X → E is referred to as the feature map, and space E as the feature
space. Thus kernels are functions that can be written as an inner product in a feature
space.

Exercise (HW3): Show that K(·, ·) defined in (63) is a positive definite function.

Thus, we can think of the patterns as ϕ(x), ϕ(x0 ), and carry out geometric algorithms
in the Hilbert space (feature space) E. Usually, dim(E)  dim(X ) (if dim(X ) is
defined).

Note that for a given kernel, there may be more than one feature map, as demon-
strated by the following example: take X = R, and K(x, y) = xy = [ √x2 √x2 ][ √y2 √y2 ]> ,
where we defined the feature maps ϕ(x) = x and ϕ̃(x) = [ √x2 √x2 ], and where the
feature spaces are respectively, E = R and Ẽ = R2 .

Exercise (HW3): For every x ∈ X , assume that the sequence {fn (x)}n≥1 ∈ `2 (N),
P
where fn : X → R, for all n ∈ N. Then K(x1 , x2 ) := ∞n=1 fn (x1 )fn (x2 ) is a kernel.

As k(·, ·) defined in (63) is symmetric and positive definite it induces a unique RKHS.
Thus, to construct reproducing kernels on complicated spaces we only need to find a
feature map ϕ.

Another way to characterize a symmetric positive definite kernel K is via the Mercer’s
Theorem.

69
Figure 4: (Feature space and feature map) On the left, the points are plotted in the
original space. There is no linear classifier that can separate the red crosses
from the blue circles. Mapping the points to a higher dimensional feature space
(x 7→ ϕ(x) := (x1 , x2 , x1 x2 ) ∈ R3 ), we obtain linearly separable classes. A
possible decision boundary is shown as a gray plane.

2
5

4
1

3
x2

0
2

−1 1
x2

0
−2
−1

−3 −2

−3
−4
−4

−5
−5 −4 −3 −2 −5 −1 0 1 2 3 4 5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x x1
1

Figure 2.1: XOR example. On the left, the points are plotted in the original
Figure 2.1: XOR
space. example. On classifier
There is no linear the left,that the pointstheare
can separate red plotted inthethe original
crosses from
space. There blue
is no linear
circles. classifier
Mapping that
the points to acan separate
higher the
dimensional redspace,
feature crosses
we from the
Definition 5.13 (Integral
blue circles. operator).
obtain linearly
Mapping
gray plane.
separable classes.Let
the points to K decision
A possible
a higher bedimensional
aboundary
continuous
is shown as a kernel on compact met-
feature space, we
ric space X , obtain
and letlinearly separable classes. A possible decision boundary is shown as a
gray plane.
ν be a finite Borel measure on X . Let TK : L2 (X , ν) → C(X )
(C(X ) being the space of all continuous real-valued functions on X thought of as a
subset of L2 (X , ν)) be the linear map defined as:
Z
2
(TK f )(·) = K(x, ·)f (x) dν(x), f ∈ L2 (X , ν).
X

Such a TK is called an integral operator.2

Exercise (HW3): Show that TK is a continuous function for all f ∈ L2 (X , ν).


Theorem 5.14 (Mercer’s Theorem). Suppose that K is a continuous positive definite
kernel on a compact set X , and let ν be a finite Borel measure on X with supp(ν) = X .
Then there is an orthonormal basis {ψi }i∈J of L2 (X , ν) consisting of eigenfunctions of
TK such that the corresponding sequence of eigenvalues {λi } are non-negative. The
eigenfunctions corresponding to non-zero eigenvalues are continuous on X and K(·, ·)
has the representation
X
K(u, v) = λi ψi (u)ψi (v), u, v ∈ X ,
i∈J

where the convergence is absolute and uniform, i.e.,

X
n
lim sup K(u, v) − λi ψi (u)ψi (v) = 0.
n→∞ u,v∈X
i∈J:i=1

70
Example 5.15. To take an analogue in the finite case, let X = {x1 , . . . , xn }. Let
Kij = K(xi , xj ), and f : X → Rn with fi = f (xi ) and let ν be the counting measure.
Then,
X n
TK f = K(xi , ·)fi
i=1
and
X
n
∀ f, f > Kf ≥ 0 ⇒ K is p.s.d. ⇒ K = λi vi vi> .
i=1
Hence,
X
n X
n
>
K(xi , xj ) = Kij = (V ΛV )ij = λk vki vkj = λk vki vkj .
k=1 k=1

Note that Mercer’s theorem gives us another feature map for the kernel K, since:
X
K(u, v) = λi ψi (u)ψi (v) = hϕ(u), ϕ(v)i`2 (J) ,
i∈J

so we can take `2 (J) as a feature space, and the corresponding feature map is ϕ : X →
`2 (J) where np o
ϕ : x 7→ λi ψi (x) .
`2 (J)
P √
This map is well defined as i∈J | λi ψi (x)|2 = K(x, x) < ∞.

Apart from the representation of the kernel function, Mercer’s theorem also leads to
a construction of RKHS using the eigenfunctions of the integral operator TK .

5.3 Smoothing Splines

Let us consider again our nonparametric regression model

Yi = f (xi ) + i , i = 1, . . . , n,

where 1 , . . . , n are mean zero, uncorrelated random variables with a common vari-
ance σ 2 . As with the kernel approach, there is a presumption that f is smooth. The
smoothing spline approach tries to take direct advantage of this smoothness by aug-
menting the usual least squares criteria with a penalty for roughness. For instance,
if the xi ’s lie in [0, 1], the estimator fˆ might be chosen to minimize (over g)
X
n
(Yi − g(xi ))2 + λkg (m) k22 ,
i=1

71
where k · k2 is the L2 -norm of functions on [0, 1] under the Lebesgue measure, i.e.,
Z 1
2
kgk2 = g 2 (x)dx.
0

The constant λ is called the smoothing parameter. Larger values for λ will lead to a
smoother fˆ, smaller values will lead to an estimate fˆ that follows the observed data
more closely (i.e.,fˆ(xi ) will be closer to Yi ).

We can use the RKHS approach to solve the above optimization problem using the
representer theorem. Please read Chapter 18.3 from [6] for the details (this was done
in class).

Exercise (HW3): (Semiparametric models — partially linear regression model.) Con-


sider a regression model with two explanatory variables x and w in which

Yi = f (xi ) + βwi + i , i = 1, . . . , n,

with 0 < x1 < . . . < xn < 1, f ∈ Wm [0, 1], β ∈ R, and the i ’s are i.i.d. from
N (0, σ 2 ). This might be called a semiparametric model because the dependence on
w is modeled parametrically, but the dependence on x is nonparametric. Following a
penalized least squares approach, consider choosing fˆ and β̂ to minimize
X
n
(Yi − g(xi ) − αwi )2 + λkg (m) k22 .
i=1

(a) Show that the estimator fˆ will still be a natural spline of order 2m.

(b) Derive explicit formulas based on linear algebra to compute β̂ and fˆ.

5.4 Classification and Support Vector Machines

5.4.1 The problem of classification

We observe the data


Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}
where (Xi , Yi )’s are i.i.d. random pairs, Xi takes values in the measurable space (X , A)
and Yi ∈ {−1, +1} is a label. A new observation X arrives and the aim of classification
is to predict the corresponding Y . We can interpret this task as a classification of X
into one of the two groups labeled with −1 or +1.

72
Example 5.16 (Spam-filter). We have a sample of n e-mail messages. For each
message i, we count the percentages of 50 selected words characteristic for spam,
such as the words money, credit, Viagra and so on. This constitutes the vectors of
measurements Xi ∈ R50 . Then, an expert provides the values Yi = +1 if e-mail
i is spam and Yi = −1 otherwise. When a new message arrives, we would like to
decide whether it is spam or not. For this purpose, we measure the corresponding
percentages X ∈ R50 in this message, and based on X and on the training data Dn ,
we have find a decision Y . The problem is usually solved by separating R50 in two
parts (corresponding to spam and non-spam) the via a hyperplane depending on the
training data Dn . This is called a linear classifier.

At first sight, the observations are of the same form as in the problem of regression
with random design. However, the important feature is that Yi ’s are now binary.
Even more important, in the classification context our final aim is different. We are
not interested in estimation of the regression function f ∗ (x) := E(Y |X = x) but
rather in predicting the value of the label Y . Note that the regression function has
now the form

f ∗ (x) = P(Y = 1|X = x) − P(Y = −1|X = x) = 2η(x) − 1

where
η(x) := P(Y = 1|X = x).
We define a classifier h as any measurable function from X to {−1, 1}. We predict
the label for an observed X as h(X). In practice, h depends on the observed data Dn
but, in this section, we will assume that the observed data is fixed and thus h is just
a function of X.

The performance of a classifier is characterized by the probability of error (also called


the risk of classification), which is defined as:

R(h) := P(Y 6= h(X)).

Our aim is to find the best classifier, i.e., a classifier which minimizes this risk:

h∗ = argmin R(h).
h

We will call h∗ the Bayes classifier and we call the minimal possible risk R∗ the Bayes
risk, i.e.,
R∗ := min R(h) = R(h∗ ).
h
The next theorem shows that such a classifier always exists.

73
Theorem 5.17. (i) The Bayes classifier has the form
(
∗ 1, if η(x) > 1/2,
h (x) =
−1, if η(x) ≤ 1/2.

(ii) For any classifier h we have


Z

R(h) − R(h ) = |2η(x) − 1|dPX (x),
x:h(x)6=h∗ (x)

where PX is the probability distribution of X.

(iii) The Bayes risk is bounded by 1/2:


1
R∗ = E[min{η(X), 1 − η(X)}] ≤ .
2
Example 5.18. Let X ∈ Rd admit a density p(·) with respect to the Lebesgue
measure on Rd . Then show that (Exercise (HW3))
(
1, if πp1 (x) > (1 − π)p−1 (x),
h∗ (x) =
−1, otherwise.

where π = P(Y = 1) and pi (x) = p(x|Y = i) are conditional densities of X given


Y = i, for i = −1, 1. This is the maximum likelihood classifier if and only if π = 1/2.

Parametric approach to classification: Assume that p−1 , p1 in the example above


are known, up to some parameters in Rk . If we estimate these parameters then we
can use the “plug-in classifier”:

p̂1 (X)π̂ > p̂−1 (X)(1 − π̂) ⇒ X is classified with Y = 1


p̂1 (X)π̂ ≤ p̂−1 (X)(1 − π̂) ⇒ X is classified with Y = −1

where p̂1 , p̂−1 are parametric estimators of p−1 , p1 , and π. If pi ’s are Gaussian densities
N (θi , Σ), i = −1, 1, then the decision rule is linear, which means that X is labeled 1
if and only if X > a + b > 0 for some a ∈ Rd , b ∈ R. Show this (Exercise (HW3)).

Nonparametric plug-in approach: We can also estimate the regression function


f ? ∗ and then calculate η̂n (x) as estimators of η(x) = (f ∗ (x) + 1)/2. Using this as the
plug-in estimator, we derive the classifier
(
1, if η̂n (x) > 1/2,
ĥn (x) =
−1, otherwise.

74
However, for this method to work we need η̂n to be close to η, which is typically
guaranteed if the function η has some smoothness properties. This is not always
reasonable to assume.

Machine learning approach: This is also a fully nonparametric approach. Except


for assuming that Y ∈ {−1, 1}, we do not make any other assumption on the joint
distribution of (X, Y ). The aim is to mimic the oracle h∗ based on the data Dn . But
this is typically not possible. A more modest and achievable task is to mimic the
oracle hH within some reasonable restricted collection H of classifiers (also called the
dictionary),
hH = argmin R(h).
h∈H

An important example is given by the class of all linear classifiers:

H = {h : Rd → {−1, 1} : h(x) = 2I(x> a + b > 0) − 1, a ∈ Rd , b ∈ R}.

5.4.2 Minimum empirical risk classifiers

How to construct good classifiers based on the data? A first idea is to use the
principle of unbiased risk estimation. We need to find an unbiased estimator for the
risk R(h) = P(Y 6= h(X)) and then to minimize this estimator in h over a given class
H. Note that the empirical risk is

1X
n
Rn (h) = 6 h(Xi )}
I{Yi =
n i=1

is an unbiased estimator for R(h) for all h. Minimizing Rn (h) can be used to obtain
a classifier.
Definition 5.19. Let H be a fixed collection of classifiers. The empirical risk mini-
mization (ERM) classifier on H is defined by

ĥn := argmin Rn (h).


h∈H

The ERM classifier always exists since the function Rn takes only a finite number of
values, whatever is the class H. Note that I{Yi 6= h(Xi )} = (Yi − h(Xi ))2 /4 and thus

1 X
n
Rn (h) = (Yi − h(Xi ))2 .
4n i=1

Therefore, ĥn is the least squares estimator based on binary variables.

75
We expect the ERM classifier to have the risk close to that of the oracle classifier hH .
Let us emphasize that we are not interested in accurate estimation of hH and moreover
there is no guarantee that hH is unique. Mimicking the oracle means constructing a
classifier ĥn such that its risk R(ĥn ) is close to the oracle risk minh∈H R(h).

Computational considerations: To find ĥn , we should minimize on H the non-


convex function
1X
n
I{Yi 6= h(Xi )},
n i=1
and H is not a convex set because a convex combination of classifiers is not necessarily
a classifier. Thus, the only possibility is to use combinatorial search. Even in the case
where H is the class of linear rules, the computational complexity of combinatorial
search on
A(Dn ) := {b = (b1 , . . . , bn ) : bi = I{h(Xi ) 6= Yi }, h ∈ H}
will be of order O(nd+1 ) where d is the dimension of the Xi ’s. This is prohibitive
already for moderately large d.

A remedy is convexification: we replace the indicator function by a convex function


and the class H by a convex class of functions, then solve a convex minimization
problem and classify according to the sign of the solution. This approach was probably
first used by [16] to define the method initially called the generalized portrait and
renamed in the 1990’s as the support vector machine.

5.4.3 Convexifying the ERM classifier

Let us first rewrite R(h) in another form:


h i
R(h) = P(Y =
6 h(X)) = P(−Y h(X) ≥ 0) = E I{−Y h(X) ≥ 0} = E[ϕ0 (−Y h(X))],

where ϕ0 (u) := I(u ≥ 0). We now replace ϕ0 by a convex function ϕ : R → R


(sometimes called a convex surrogate loss) and define

Rϕ (h) := E[ϕ(−Y h(X))],


fϕ∗ := argmin Rϕ (f ),
f :X →R

1 X n
Rn,ϕ (h) := ϕ(−Yi h(Xi )),
n i=1

fˆn,ϕ := argmin Rn,ϕ (f ), (64)


f ∈F

76
where F is a convex class of functions f : X → R. The question is whether there are
convex functions ϕ such that h∗ = sign(fϕ∗ ), where h∗ is defined in Theorem 5.17?

Natural requirements to ϕ are: (i) convexity, (ii) ϕ should penalize more for wrong
classification than for correct classification. Note that ϕ0 does not penalize at all for
correct classification, because ϕ0 (−1) = 0, but it penalizes for wrong classification
since ϕ0 (1) = 1. However ϕ0 is not convex. The first historical example of convex
surrogate loss ϕ is the hinge loss:

ϕH (x) := (1 + x)+ .

It satisfies both the requirements (i) and (ii) above. The corresponding risk and its
minimizer are

RϕH (f ) := E[(1 − Y f (X))+ ], fϕ∗H := argmin RϕH (f ).


f :X →R

Proposition 5.20. Let h∗ be the Bayes classifier, i.e., h∗ (x) := sign(η(x) − 1/2).
Then fϕ∗H = h∗ .

Proof. Recall that η(x) = P(Y = 1|X = x) and h∗ (x) = sign(f ∗ (x)) with f ∗ (x) =
E(Y |X = x) = 2η(x) − 1. We can write
Z
RϕH (f ) = E[(1 − Y f (X))+ |X = x]dPX (x)
X

where

E[(1 − Y f (X))+ |X = x] = P(Y = 1|X = x)(1 − f (x))+ + P(Y = −1|X = x)(1 + f (x))+
= η(1 − f (x))+ + (1 − η(x))(1 + f (x))+ .

Fix an arbitrary x and define

g(u) := η(x)(1 − u)+ + (1 − η(x))(1 + u)+ .

We claim that
fϕ∗H (x) = argmin g(u).
u∈R

Next, observe that g is a piecewise affine function. Let u∗ = argminu∈R g(u). We can
see that:


 η(x)(1 − u) + (1 − η(x))(1 + u) = 1 + (1 − 2η(x))u, if |u| ≤ 1;
g(u) = (1 − η(x))(1 + u), if u > 1;


η(x)(1 − u), if u < −1.

77
As g is affine for u > 1 (and u < −1) with nonnegative slope we see that u∗ must
belong to [−1, 1]. However, for u ∈ [−1, 1], g is minimized at
(
−1, if η(x) ≤ 1/2;
u∗ =
+1, if η(x) > 1/2.

Thus, fϕ∗H (x) = u∗ = sign(η(x) − 1/2) = h∗ (x) for all x ∈ X .

Classical examples of functions ϕ are the following: (i) ϕ(x) = (1 + x)+ (hinge loss);
(ii) ϕ(x) = exp(x) (exponential loss); (iii) ϕ(x) = log2 (1 + exp(x)) (logistic loss).
Proposition 5.21. Let ϕ0 be positive and strictly increasing. Then h∗ = sign(fϕ∗ ).

Proof. Exercise (HW3).

Given a solution fˆn,ϕ to the minimization problem (64), we define the classifier ĥn,ϕ :=
sign(fˆn,ϕ ). A popular choice for the set F is
(M )
X
F= θj hj : θ ∈ Θ
j=1

where {h1 , . . . , hM } is a dictionary of classifiers independent of the data {(Xi , Yi )}ni=1


(the hj ’s are often called the “weak learners”), and Θ ⊂ RM is a set of coefficients
where Θ = RM or Θ is an `1 -body or an `2 -body as defined as follows.

• An `2 -body is a set of the form



Θ = θ ∈ RM : θ> Kθ ≤ r

for some symmetric positive semi-definite matrix K and a positive scalar r.

• An `1 -body is either an `1 -ball Θ = {θ ∈ RM : |θ|1 ≤ r}, or the simplex


( )
XM
Θ = ΛM = θ ∈ RM : θj = 1, θj ≥ 0 .
j=1

The hinge loss with an `2 -body yields support vector machines (SVM). The exponen-
tial and logit loss with an `1 -body leads to boosting.

78
5.4.4 Support vector machine (SVM): definition

Suppose that H is a RKHS of functions on X with kernel K.

Consider the classification problem described in the previous subsections. A popular


example of a convex set of functions F used in (64) is a ball in the RKHS H:

F = {f ∈ H : kf kH ≤ r}, r > 0.

Then (64) becomes


min Rn,ϕ (f ).
f ∈H:kf kH ≤r

The support vector machine is, by definition, a classifier obtained from solving this
problem when ϕ(x) = (1 + x)+ (the Hinge loss):
!
1X
n
min (1 − Yi f (Xi ))+ + λkf k2H . (65)
f ∈H n i=1

Thus, by the representer theorem (see Theorem 5.11), it is enough to look for a
solution of (65) in the finite dimensional space S (see the proof of Theorem 5.11)
of dimension less than or equal to n. Solving the problem reduces to finding the
coefficients θj in the representation (62).

Let Kij = K(Xi , Xj ) and denote by K the symmetric matrix (Kij )i,j=1,...,n . Then for
any f ∈ S ⊂ H, X
kf k2H = θi θj Kij = θ> Kθ.
i,j=1,...,n

Thus, the SVM minimization problem (65) reduces to


" n #
1X
min (1 − Yi (Kθ)i )+ + λθ> Kθ , (66)
θ∈Rn n
i=1

where (Kθ)i is the i’th component of Kθ. Given the solution θ̂ of (66), the SVM
classifier ĥn,ϕ is determined as:
X
n
ĥn,ϕ = sign(fˆn,ϕ (x)), where fˆn,ϕ (x) = θ̂i K(Xi , x). (67)
i=1

5.4.5 Analysis of the SVM minimization problem

Traditionally, the SVM minimization problem (66) is solved by reduction to a quadratic


program after introducing some additional slack variables. Here, we choose to treat

79
the problem differently, using subdifferential calculus. For any convex objective func-
tion G, we have the equivalence

θ̂ ∈ argmin G(θ) ⇐⇒ 0 ∈ ∂G(θ̂) (68)


θ∈RM

where ∂G(θ) is the subdifferential of G at θ. (In the particular case where G is


differentiable at θ, the subdifferential is reduced to the gradient of G at θ: ∂G(θ̂) =
{∇G(θ)}.)
Proposition 5.22. The solution of the SVM optimization problem (66) has the form

X
n
fˆ(x) = θ̂i K(Xi , x),
i=1

where the coefficients θ̂i satisfy

θ̂i = 0, if Yi fˆ(Xi ) > 1,


Yi
θ̂i = , if Yi fˆ(Xi ) < 1,
2λn
αi Yi
θ̂i = , with αi ∈ [0, 1], if Yi fˆ(Xi ) = 1.
2λn

The points Xi with θ̂i 6= 0 are called the support vectors.

In practice, there are often not too many support vectors since only the points Xi
that are misclassified or close to the decision boundary satisfy the condition θ̂i 6= 0.

Proof. We will derive the expression for the subdifferential of the objective function
in (66) by analyzing each term in the sum separately. Fix some index i and consider
the function !
Xn
θ 7→ 1 − Yi Kij θj = (1 − Yi (Kθ)i )+ .
j=1 +

Let gi (θ) be a subgradient of this function and denote by gij (θ) its j’th component.
There are three cases that follow immediately from the form of the subdifferential of
the function (1 − x)+ :

• if Yi (Kθ)i > 1 then gij (θ) = 0,

• if Yi (Kθ)i < 1 then gij (θ) = −Yi Kij ,

• if Yi (Kθ)i = 1 then gij (θ) = −αi Yi Kij , for some αi ∈ [0, 1].

We can wrap these three cases as gij (θ) = −αi Yi Kij , with

80
(i) αi = 0 if Yi (Kθ)i > 1,

(ii) αi = 1 if Yi (Kθ)i < 1,

(iii) αi ∈ [0, 1] if Yi (Kθ)i = 1.


 P 
1 n
Consequently, the subdifferential ∂ n i=1 (1 − Yi (Kθ)i )+ is composed of vectors
 n
of the form −Kβ here β = n1 αi Yi with αi satisfying the above conditions (i)–
i=1
(iii). Next, the function λθ> Kθ is differentiable and its gradient is 2λKθ. Thus, the
subdifferential of the objective function in (66) is composed of vectors of the form

−Kβ + 2λKθ.

Now, by (68), a vector θ̂ is a solution of (66) if and only if 0 belongs to the subdiffer-
ential of the objective function at θ̂, which can be written as 2λθ̂ − β =  for some 
satisfying K = 0. It remains to note that we can always take  = 0 since the choice
of  in the null space of K does not modify the value of the objective function. This
completes the proof.

Observe that the SVM solution can be written as (67). Thus, if we consider the
functions ϕi (·) = K(Xi , ·), we have

X
n
fˆ = θ̂i ϕj
i=1

so that fˆ can be viewed as a linear classifier in “transformed coordinates”. The


functions ϕi (·) = K(Xi , ·) can be interpreted as “weak learners” but in this case they
are not classifiers.

The strength of the RKHS approach is that the space X can be any arbitrary space
(such as a graph or a semi-group, for example) but we transform each point Xi ∈
X into an finite-dimensional vector Zi = (ϕ1 (Xi ), . . . , ϕn (Xi ))> ∈ Rn , and then
use a linear classifier fˆ(X) = θ> Z in the finite-dimensional space Rn where Z :=
(ϕ1 (X), . . . , ϕn (X))> ∈ Rn . The classification rule for a new point Z is
(
1, if θ̂> Z > 0,
Ŷ :=
−1, otherwise.

For any learning point Zi , if Zi is correctly classified we have Yi θ̂> Zi > 0, and if Zi is
wrongly classified we have Yi θ̂> Zi ≤ 0. By Proposition 5.22 a solution θ̂ of the SVM
minimization problem has the coordinates θ̂i , i = 1, . . . , n, satisfying:

81
• θ̂i = 0 if Yi θ̂> Zi > 1. Interpretation: The point (Zi , Yi ) does not affect the
classification rule if Zi is correctly classified with high margin (larger than 1),
where the margin of the i’th observation is defined as Yi θ̂> Zi = Yi fˆ(Xi ).

• θ̂i 6= 0 if Yi θ̂> Zi ≤ 1. The last inequality means that the point Zi is wrongly
classified or correctly classified with small margin (smaller than 1). If θ̂i 6= 0,
the point Zi is called a support vector.

5.5 Kernel ridge regression

Consider the regression model

Yi = f ∗ (xi ) + i , i = 1, . . . , n,

where f ∗ is the true regression function, xi ’s take values in X (an arbitrary metric
space), 1 , . . . , n are mean zero, uncorrelated random variables. We want to estimate
f ∗ by minimizing the criterion function
X
n
fˆ = argmin (Yi − f (xi ))2 + λkf k2H . (69)
f ∈H i=1

By the representer theorem, we can claim that any solution to (69) is of the form
P
fˆ(·) = ni=1 αi K(xi , ·) for some weight vector (α1 , . . . , αn ) ∈ Rn . Thus, the above
optimization problem can be equivalently expressed as:

α̂ := argmin kY − Kαk2 + λα> Kα,


α∈Rn

where K = ((Kij )) with Kij = K(xi , xj ), Y = (Y1 , . . . , Yn ). Here we have used that
for fˆ in the span of {K(xi , ·)}ni=1 and thus f (xi ) = (Kα)i , kf k2H = α> Kα. We can
solve the above finite dimensional optimization problem to yield

K(K + λI)α = KY,

which shows that α̂ = (K + λI)−1 Y is a solution.

5.6 Kernel principal component analysis (PCA)

Given a random vector X ∼ P in Rd , PCA solves the following eigenvector (eigen-


value) problem:
Σv = λv,

82
where Σ is the covariance matrix of X and the eigenvector corresponding to the i’th
largest eigenvalue is the i’th principal component, for i = 1, . . . , n. Another way to
view the PCA problem is to first consider the first principal component, which is a
solution to the following optimization problem:

v1 = argmax Var(v > X) = argmax v > Σv.


v∈Rd :kvk≤1 v∈Rd :kvk≤1

The second principal component is defined as the unit vector that maximizes Var(v > X)
over all vectors v that are orthogonal to v1 , and so on.

Given i.i.d. samples {xi }ni=1 from P , the sample principal components are obtained
by solving the corresponding sample analogue:

Σ̂v = λv,
P P
where Σ̂ := n1 ni=1 (xi − x̄)(xi − x̄)> (here x̄ = ni=1 xi /n is the sample mean) is the
sample covariance matrix of X. Similarly,

1 X >
n
>
2
v̂1 = argmax v Σ̂v = argmax v (xi − x̄) . (70)
v∈Rd :kvk≤1 v∈Rd :kvk≤1 n i=1

Now suppose that X ∼ P takes values in an arbitrary metric space X . Suppose that
H is a RKHS (of functions) on X with reproducing kernel K. We can use the kernel
method to extent classical PCA to capture non-linear principal components. The first
principal component can now be defined as

f1 := argmax Var(f (X)) = argmax Var(hf, K(X, ·)iH ).


f ∈H:kf kH ≤1 f ∈H:kf kH ≤1

Let ϕ(x) := K(x, ·) for all x ∈ X (note that here ϕ is not exactly the feature map, as
ϕ : X → H). Given a sample {xi }ni=1 from P , the sample first principal component
(function) can be defined analogously (as in (70)) as
* + 2
1 X
n X
n
fˆ1 = argmax d Var(hf, K(X, ·)iH ) = argmax  f, ϕ(xi ) − 1 ϕ(xj )  .
f ∈H:kf kH ≤1 kf kH ≤1 n i=1
n j=1 H

We define the empirical covariance operator as

1X 1X
n n
Σ̂ := ϕ̃(xi ) ⊗ ϕ̃(xi ), where ϕ̃(xi ) := ϕ(xi ) − ϕ(xj ).
n i=1 n j=1

83
We would like to find eigenfunctions fˆ (the principal components) (Why? Exercise
(HW3)) such that
Σ̂(fˆ) = λfˆ. (71)
The question now is, how do we express the above equation in terms of kernels, i.e.,
how do we “kernelize” it? Towards this end, we make the following claim.
P
Claim: Any solution to (71) is of the form fˆ = ni=1 αi ϕ̃(xi ) for some weight vector
(α1 , . . . , αn ) ∈ Rn .

Proof: First, we observe that any solution to (71) lies in Range(Σ̂). Linearity, and
the nature of ϕ̃(xi ) ⊗ ϕ̃(xi ) (by definition (a ⊗ b)(c) := hb, ciH a) tell us that
1X
n
Σ̂(fˆ) = hϕ̃(xi ), fˆiH ϕ̃(xi ).
n i=1

Therefore, (71) is equivalent to the following system of equations in α ∈ Rn :


!
X n Xn
Σ̂ αi ϕ̃(xi ) = λ αi ϕ̃(xi ).
i=1 i=1

For the above set of equations, we left-hand side equals


1 XX
n n
αi hϕ̃(xi ), ϕ̃(xj )iH ϕ̃(xj ).
n j=1 i=1

Using the fact that hϕ̃(xj ), ϕ̃(xj )iH = K̃(xi , xj ), where K̃ = HKH (show this; Exer-
cise (HW3); here H = In − 1n×n /n and K is the Gram matrix, i.e., Kij = K(xi , xj ))
the above system of equations may be written as
1 XX X
n n n
αi K̃ij ϕ̃(xj ) = λ αi ϕ̃(xi ).
n j=1 i=1 i=1

Taking inner products with ϕ(xl ), for l = 1, . . . , n, we get


1 XX X
n n n
αi K̃ij K̃jl = λ αi K̃il .
n j=1 i=1 i=1

We now have a set of n linear equations in the vector α ∈ Rn . In matrix-vector form,


it can be written very simply as

K̃ 2 α = λnK̃α.

The only solutions of this equation that are of interest to us are those that satisfy

K̃α = λnα.

This is simply an eigenvalue/eigenvector problem in the matrix K.

84
6 Bootstrap

Suppose that we have data X ∼ P , and θ ≡ θ(P ) is a parameter of interest. Let


θ̂ ≡ θ̂(X) be an estimator of θ. Suppose that we would want to construct a level-
(1 − 2α) confidence interval for θ, i.e., find κα and κ1−α such that

P(θ̂ − κα ≤ θ ≤ θ̂ + κ1−α ) ≥ 1 − 2α. (72)

How do we find (estimate) κα and κ1−α in such a general setting?

Problem: The distribution of θ̂ − θ depends on P and might be unknown. Even


if we know the asymptotics (e.g., asymptotically normal), we may want more accu-
rate quantiles for a fixed sample size. In some situations, the asymptotic limiting
distribution can depend on nuisance parameters that can be hard to estimate.

In these situations we can use the bootstrap.

To motivate the bootstrap method, let us consider the following simple scenario.
Suppose that we model our data X = (X1 , . . . , Xn ) as a random sample from some
distribution P ∈ P, where P is a class of probability distributions. Let η(X, P )
be a root, i.e., a random variable that possibly depends on both the distribution

P and the sample X drawn from P (e.g., think of η(X, P ) as n(X̄n − µ), where
P
X̄n = ni=1 Xi /n and µ = E(X1 )). In fact, θ̂ − θ (as described above) is a root.

In general, we may wish to estimate the mean or a quantile or some other probabilistic
feature or the entire distribution of η(X, P ). As mentioned above, the distribution of
θ̂ − θ depends on P and is thus unknown. Let Hn (x, P ) denote the c.d.f. of η(X, P ),
i.e.,
Hn (x, P ) := PP (η(X, P ) ≤ x). (73)
Of course, if we can estimate Hn (·, P ) then we can use this to construct CIs, test
hypotheses; e.g., if η(X, P ) = (θ̂−θ) then being able to estimate Hn (·, P ) immediately
yields estimates of κα and κ1−α as defined in (72).

Idea: What if we knew P and could draw unlimited replicated samples from P ?

In that case we could approximate Hn (x, P ) as follows: draw repeated samples from
P resulting in a series of values for the root η(X, P ), then we could form an estimate
of Hn (x, P ) by counting how many of the η(X, P )’s are ≤ x.

But, of course, we do not know P . However we can estimate P by P̂n and use the
above idea. This is the notion of bootstrap.

85
Definition 6.1 (Bootstrap). The bootstrap is a method of replacing (plug-in) the
unknown distribution P with a known distribution P̂n (estimated from the data) in
probability/expectation calculations.

The bootstrap approximation of Hn (·, P ) is Ĥn (·, P̂n ), where P̂n is an estimator of P
obtained from the observed data (that we think is close to P ), i.e.,

Ĥn (x, P̂n ) := P∗P̂n (η(X∗ , P̂n ) ≤ x|X). (74)

where P∗P̂ (·|X) is the conditional probability given the observed data X (under the
n

estimated P̂n ). Thus, bootstrap estimates the distribution of η(X, P ) by that of


η(X∗ , P̂n ), where X∗ is a random sample (conditional on the data) drawn from the
distribution P̂n . The idea is that

if P̂n ≈ P, then Ĥn (·, P̂n ) ≈ Hn (·, P ).

Question: How do we find Ĥn (·, P̂n ), the distribution of η(X∗ , P̂n )?

Answer: In most cases, the distribution of η(X∗ , P̂n ) is difficult to analytically com-
pute, but we can always be approximated easily by Monte Carlo simulations.

The bootstrap can be broken down in the following simple steps:

• Find a “good” estimator P̂n of P .

• Draw a large number (say, B) of random samples X∗(1) , . . . , X∗(B) from the
distribution P̂n and then compute T ∗(j) := η(X∗(j) , P̂n ), for j = 1, . . . , B.

• Finally, compute the desired feature of η(X∗ , P̂n ) using the empirical c.d.f. H̃nB (·, P̂n )
of the values T ∗(1) , . . . , T ∗(B) , i.e.,

1 X
B
H̃nB (x, P̂n ) := I{T ∗(j) ≤ x}, for x ∈ R.
B j=1

Intuitively,
H̃nB (·, P̂n ) ≈ Ĥn (·, P̂n ) ≈ Hn (·, P ),
where the first approximation is from Monte Carlo error (and can be as small as we
would like, by taking B as large as we want) and the second approximation is due to
the bootstrap method. If P̂n is a good approximation of P , then bootstrap can be
successful.

86
6.1 Parametric bootstrap

In parametric models it is more natural to take P̂n as the fitted parametric model.
Example 6.2 (Estimating the standard deviation of a statistic). Suppose that X1 , . . . , Xn
is random sample from N (µ, σ 2 ). Suppose that we are interested in the parameter
 
c−µ
θ = P(X ≤ c) = Φ ,
σ

where c is a given known constant. A natural estimator of θ is its MLE θ̂:


 
c − X̄
θ̂ = Φ ,
σ̂
P P
where σ̂ 2 = n1 ni=1 (Xi − X̄)2 and X̄ = n1 ni=1 Xi .

Question: How do we estimate the standard deviation of θ̂? There is no easy closed
form expression for this.

Solution: We can bootstrap!

Draw many (say B) bootstrap samples of size n from

N (X̄, σ̂ 2 ) ≡ P̂n .

For the j’th bootstrap sample we compute a sample average X̄ ∗(j) , a sample standard
deviation σ̂ ∗(j) . Finally, we compute
 
∗(j) c − X̄ ∗(j)
θ̂ =Φ .
σ̂ ∗(j)
P
We can estimate the mean of θ̂ by θ̄∗ = B1 B j=1 θ̂
∗(j)
. The standard deviation of θ̂ can
then be estimated by the bootstrap standard deviation of the θ̂∗(j) values, i.e.,
" #1/2
1 X ∗(j)
B
(θ̂ − θ̄∗ )2 .
B j=1

Example 6.3 (Comparing means when variances are unequal). Suppose that we
have two independent samples X1 , . . . , Xm and Y1 , . . . , Yn from two possibly different
normal populations. Suppose that

X1 , . . . , Xm are i.i.d. N (µ1 , σ12 ) and Y1 , . . . , Yn are i.i.d. N (µ2 , σ22 ).

Suppose that we want to test

H0 : µ1 = µ2 versus H1 : µ1 6= µ2 .

87
We can use the test statistic
(m + n − 2)1/2 (X̄m − Ȳn )
U=  ,
1 1 1/2 2 2 1/2
m
+ n
(S X + SY )
P Pn Pm Pn
where X̄m = m1 m 1
i=1 Xi , Ȳn = n
2
i=1 Yi , SX =
2 2
i=1 (Xi − X̄m ) and SY = i=1 (Yi −
2 2 2
Ȳn ) . Note that as σ1 6= σ2 , U does not necessarily follow a t-distribution.

Question: How do we find the critical value of this test?

The parametric bootstrap can proceed as follows:


∗(j) ∗(j) 2∗(j) 2∗(j)
First choose a large number B, and for j = 1, . . . , B, simulate (X̄m , Ȳn , SX , SY ),
where all four random variables are independent with the following distributions:
∗(j) 2
• X̄m ∼ N (0, σ̂X /m),
∗(j)
• Ȳn ∼ N (0, σ̂Y2 /n),
2∗(j) 2
• SX ∼ σ̂X χ2m−1 ,
2∗(j)
• SY ∼ σ̂Y2 χ2n−1 ,
2 2
where σ̂X = SX /(m − 1) and σ̂Y2 = SY2 /(n − 1). Then we compute
∗(j) ∗(j)
(m + n − 2)1/2 (X̄m − Ȳn )
U ∗(j) = 1/2 2∗(j) 2∗(j)
1
m
+ n1 (SX + SY )1/2

for each j. We approximate the null distribution of U by the empirical distribution



of the {U ∗(j) }B ∗ α
j=1 . Let cn be the 1 − 2 -quantile of the empirical distribution of
{U ∗(j) }B
j=1 . Then, we can reject H0 if

|U | > c∗n .

6.2 The nonparametric bootstrap

In problems where the distribution P is not indexed by a parametric family, a natural


estimator of P is the empirical distribution P̂n given by the distribution that puts
1/n-mass at each of the observed data points.
Example 6.4. Let X = (X1 , . . . , Xn ) be an i.i.d. sample from a distribution F on R.
Suppose that we want a CI for the median θ of F . We can base a CI on the sample
median M .

88
We want to estimate the distribution of M − θ. Let η(X, F ) := M − θ. We may
choose F̂ = Fn , the empirical distribution function of the observed data. Thus, our
method can be broken in the following steps:

• Choose a large number B and simulate many samples X∗(j) , for j = 1, . . . , B,


(conditionally i.i.d. given the data) from Fn . This reduces to drawing with
replacement sampling from X.

• For each bootstrap sample we compute the sample median M ∗(j) and then find
the appropriate sample quantiles of {M ∗(j) − M }B ∗
i=1 . Observe that η(X , Fn ) =
M∗ − M.

6.3 Consistency of the bootstrap

Suppose that F̂n and F are the corresponding c.d.f.’s for P̂n and P respectively.
Suppose that P̂n is a consistent estimator of P . This means that at each x in the
support of X1 where F (x) is continuous, F̂n (x) → F (x) in probability or a.s. as
n → ∞10 . If, in addition, Ĥn (x, P ), considered as a functional of P , is continuous
in an appropriate sense, it can be expected that Ĥn (x, P̂n ) will be close to Hn (x, P ),
when n is large.

Observe that Ĥn (x, P̂n ) is a random distribution function (as it depends on the ob-
served data). Let ρ be any notion of distance between two probability distributions
that metrizes weak convergence, i.e., for any sequence of c.d.f.’s {Gn }n≥1 , we have
d
Gn → G if and only if ρ(Gn , G) → 0 as n → ∞.

In particular, we can take ρ to be the Prohorov metric11 or the Levy metric12 . For
simplicity, we can also use the uniform distance (Kolmogorov metric) between Gn
and G (which metrizes weak convergence if G is a continuous c.d.f.).
Definition 6.5. We say that the bootstrap is weakly consistent under ρ for η(Xn , P )
if
p
ρ(Hn , Ĥn ) → 0 as n → ∞,
where Hn and Ĥn are defined in (73) and (74) respectively. We say that the bootstrap
10
If F is a continuous c.d.f., then it follows from Polya’s theorem that F̂n → F in probability or
a.s. uniformly over x. Thus, F̂n and F are uniformly close to one another if n is large.
11
12

89
is strongly consistent under ρ for η(Xn , P ) if
a.s.
ρ(Hn , Ĥn ) → 0 as n → ∞.

In many problems, it can be shown that Hn (·, P ) converges in distribution to a limit


H(·, P ). In such situations, it is much easier to prove that the bootstrap is consistent
by showing that
a.s./p
ρ(Ĥn , H) → 0 as n → ∞.

In applications, e.g., for construction of CIs, we are quite often interested in approx-
imating the quantiles of Hn by that of Ĥn (as opposed to the actual c.d.f.). The
following simple result shows that weak convergence, under some mild conditions,
implies the convergence of the quantiles.

Exercise (HW4): Let {Gn }n≥1 be a sequence of distribution functions on the real line
converging weakly to a distribution function G, i.e., Gn (x) → G(x) at all continuity
points x of G. Assume that G is continuous and strictly increasing at y = G−1 (1 − α).
Then,
n (1 − α) := inf{x ∈ R : Gn (x) ≥ 1 − α} → y = G (1 − α).
G−1 −1

The following theorem, although quite obvious, gives us a general strategy to prove
the consistency of the bootstrap in many problems.
Theorem 6.6. Let CP be a set of sequences {Pn ∈ P}n≥1 containing the sequence
{P, P, . . .}. Suppose that, for every sequence {Pn } ∈ CP , Hn (·, Pn ) converges weakly
to a common limit H(·, P ). Let Xn be a sample of size n from P . Assume that P̂n
is an estimator of P based on Xn such that {P̂n } falls in CP with probability one.
Then,
a.s.
ρ(Hn (·, P ), Ĥn (·, P̂n )) → 0 as n → ∞.
If H(·, P ) is continuous and strictly increasing at H −1 (1 − α, P ) (0 < α < 1), then
a.s.
Ĥn−1 (1 − α, P̂n ) → H(1 − α, P ) as n → ∞.

Further, if H(x, P ) is continuous in x, then


a.s.
K(Ĥn , Hn ) := sup |Ĥn (x, P̂n ) − Hn (x, P )| → 0 as n → ∞.
x∈R

The proof of the above theorem is also left as an exercise (HW4).

90
Remark 6.1. Often, the set of sequences CP can be described as the set of sequences
{Pn }n≥1 such that d(Pn , P ) → 0, where d is an appropriate “metric” on the space
of probabilities. Indeed, one should think of CP as a set of sequences {Pn } that
are converging to P in an appropriate sense. Thus, the convergence of Hn (·, Pn ) to
H(·, P ) is locally uniform in a specified sense. Unfortunately, the appropriate metric
d will depend on the precise nature of the problem and the choice of the root.

Theorem 6.6 essentially says that to prove the consistency of the bootstrap it is enough
to try to understand the limiting behavior of Hn (·, Pn ), where Pn is any sequence
of distributions “converging” (in some appropriate sense) to P . Thus, quite often,
showing the consistency of the bootstrap boils down to showing the weak convergence
of η(Xn , Pn ) under a triangular array setup, as Xn is now an i.i.d. sample from Pn . For
example, if the CLT plays a crucial role in proving that Hn (·, P ) converges weakly to
a limit H(·, P ), the Lindeberg-Feller CLT theorem can be used to show that Hn (·, Pn )
converges weakly to H(·, P ).
Theorem 6.7 (Bootstrapping the sample mean). Suppose X1 , X2 , . . . , Xn are i.i.d. F

and that σ 2 := VarF (X1 ) < ∞. Let η(X, F ) := n(X̄n − µ), where µ := EF (X1 ) and
P
X̄n := ni=1 Xi /n. Then,
p
K(Ĥn , Hn ) = sup |Hn (x) − Ĥn (x)| → 0 as n → ∞,
x∈R

where Ĥn (x) ≡ Ĥn (x, Fn ) and Fn is the empirical c.d.f. of the sample X1 , X2 , . . . , Xn .

Exercise (HW4): Show that foror almost all sequences X = {X1 , X2 , . . .}, the con-

ditional distribution of n(X̄n∗ − X̄n ), given X, converges in law to N (0, σ 2 ) by the
triangular array CLT (Lindeberg CLT).

Exercise (HW4): Show the following joint (unconditional) asymptotic distribution:


√ √  d
n(X̄n − µ), n(X̄n∗ − X̄n ) → (Z1 , Z2 ),

where Z1 , Z2 are i.i.d. N (0, σ 2 ). In fact, a more general version of the result is true.
d
Suppose that (Un , Vn ) is a sequence of random vectors such that Un → Z ∼ H (some
d d
Z) and Vn |Un → Z (the same Z) a.s. Then (Un , Vn ) → (Z1 , Z2 ), where Z1 , Z2 are
i.i.d. H.

Exercise (HW4): What do you think would be the limiting behavior of n(X̄n∗ − µ),
conditional on the data X?

91
6.4 Second-order accuracy of the bootstrap

One philosophical question about the use of the bootstrap is whether the bootstrap
has any advantages at all when a CLT is already available. To be specific, suppose

that η(X, F ) = n(X̄n − µ). If σ 2 := VarF (X1 ) < ∞, then
√ d p
n(X̄n − µ) → N (0, σ 2 ) and K(Ĥn , Hn ) → 0 as n → ∞.
P
So two competitive approximations to Hn (x) are Φ(x/σ̂n ) (where σ̂n2 := n1 ni=1 (Xi −
X̄n )2 ) and Ĥn (x, Fn ). It turns out that, for certain types of statistics, the bootstrap
approximation is (theoretically) more accurate than the approximation provided by
the CLT. Because any normal distribution is symmetric, the CLT cannot capture
information about the skewness in the finite sample distribution of η(X, F ). The
bootstrap approximation does so. So the bootstrap succeeds in correcting for skew-
ness, just as an Edgeworth expansion13 would do. This is called Edgeworth correction
by the bootstrap, and the property is called second-order accuracy of the bootstrap.
Theorem 6.8 (Second-order accuracy). Suppose X1 , X2 , . . . , Xn are i.i.d. F and

that σ 2 := VarF (X1 ) < ∞. Let η(X, F ) := n(X̄n − µ)/σ, where µ := EF (X1 ) and
P
X̄n := ni=1 Xi /n. If EF |X1 |3 < ∞ and F is continuous, then,

K(Ĥn , Hn ) = op (n−1/2 ) as n → ∞,

where Ĥn (x) ≡ Ĥn (x; Fn ) is the c.d.f. of η(X∗ , Fn ) := n(X̄n∗ − X̄n )/σ̂ (σ̂ 2 =
1
Pn 2
n i=1 (Xi − X̄n ) ) and Fn is the empirical c.d.f. of the sample X1 , X2 , . . . , Xn .

Remark 6.2 (Rule of thumb). Let X1 , X2 , . . . , Xn are i.i.d. F and η(X, F ) be a root.
d
If η(X, F ) → N (0, τ 2 ), where τ does not dependent of F , then second-order accuracy
is likely. Proving it will depend on the availability of an Edgeworth expansion for
η(X, F ). If τ depends on F (i.e., τ = τ (F )), then the bootstrap should be just
first-order accurate.

6.5 Failure of the bootstrap

In spite of the many consistency theorems in the previous sections, there are instances
where the ordinary bootstrap based on sampling with replacement from Fn actually
13

We note that T := n(X̄n − µ)/σ admits the following Edgeworth expansion:
p1 (x|F ) p2 (x|F )
P(T ≤ x) = Φ(x) + √ φ(x) + φ(x) + smaller order terms,
n n
where p1 (x|F ) and p2 (x|F ) are polynomials in x with coefficients depending on F .

92
does not work. Typically, these are instances where the root η(X, F ) fails to admit
a CLT. Before seeing a few examples, we list a few situations where the ordinary
bootstrap fails to estimate the c.d.f. of η(X, F ) consistently:

(a) η(X, F ) = n(X̄n − µ) when VarF (X1 ) = ∞.

(b) η(X, F ) = n(g(X̄n ) − g(µ)) and ∇g(µ) = 0.

(c) η(X, F ) = n(g(X̄n ) − g(µ)) and g is not differentiable at µ.

(d) The underlying population Fθ is indexed by a parameter θ, and the support of


Fθ depends on the value of θ.

(e) The underlying population Fθ is indexed by a parameter θ, and the true value
θ0 belongs to the boundary of the parameter space Θ.

Exercise (HW4): Let X = (X1 , X2 , . . . , Xn ) be an i.i.d. sample from F and σ 2 =



VarF (X1 ) = 1. Let g(x) = |x| and let η(X, F ) = n(g(X̄n ) − g(µ)). If the true value
d
of µ is 0, then by the CLT for X̄n and the continuous mapping theorem, η(X, F ) → |Z|
with Z ∼ N (0, σ 2 ). Show that the bootstrap does not work in this case.

6.6 Subsampling: a remedy to the bootstrap

The basic idea of subsampling is to approximate the sampling distribution of a statistic


based on the values of the statistic computed over smaller subsets of the data. For
example, in the case where the data are n observations that are i.i.d., a statistic is

computed based on the entire data set and is recomputed over all nb data sets of size
b. These recomputed values of the statistic are suitably normalized to approximate
the true sampling distribution.

Suppose that X1 , . . . , Xn is a sample of n i.i.d. random variables having a common


probability measure denoted by P . Suppose that the goal is to construct a confidence
region for some parameter θ(P ) ∈ R.

Let θ̂n ≡ θn (X1 , . . . , Xn ) be an estimator of θ(P ). It is desired to estimate or ap-


proximate the true sampling distribution of θ̂n in order to make inferences about
θ(P ).

Let Hn (·, P ) be the sampling c.d.f. of τn (θ̂n − θ) based on a sample of size n from P ,
where τn is a normalizing constant. Essentially, the only assumption that we will need
to construct asymptotically valid confidence intervals for θ(P ) is the following: there

93
exists a limiting non-degenerate c.d.f. H(·, P ) such that Hn (·, P ) converges weakly to
H(·, P ) as n → ∞.

To describe the method let Y1 , . . . , YNn be equal to the Nn := Nb subsets of size b of
{X1 , . . . , Xn }, ordered in any fashion. Of course, the Yi ’s depend on b and n, but this
notation has been suppressed. Only a very weak assumption on b will be required.
In typical situations, it will be assumed that b/n → 0 and b → ∞ as n → ∞.

Now, let θ̂n,b,j be equal to the statistic θ̂b evaluated at the data set Yj . The approxi-
mation to Hn (x, P ) we study is defined by

1 X
Nn
Ln,b (x) = I{τb (θ̂n,b,j − θ̂n ) ≤ x}.
Nn i=1

The motivation behind the method is the following. For any j, Yj is a random sample
of size b from P . Hence, the exact distribution of τb (θ̂n,b,i − θ(P )) is Hb (·, P ). The
empirical distribution of the Nn values of τb (θ̂n,b,j − θ̂n ) should then serve as a good
approximation to Hb (P ) ≈ Hn (P ). Of course, θ(P ) is unknown, so we replace θ(P )
by θ̂n , which is asymptotically permissible because τb (θ̂n −θ(P )) is of order τb /τn → 0.
Theorem 6.9. Assume that there exists a limiting non-degenerate c.d.f. H(·, P ) such
that Hn (·, P ) converges weakly to H(·, P ) as n → ∞. Also assume τb /τn → 0, b → ∞,
and b/n → 0 as n → ∞.
p
(i) If x is a continuity point of H(.·, P ), then Ln,b (x) → H(x, P ).
p
(ii) If H(·, P ) is continuous, then supx |Ln,b (x) − Hn (x, P )| → 0.

(iii) Assume τb (θ̂n − θ(P )) → 0 almost surely and, for every d > 0,
X
exp{−d(n/b)} < ∞.
n

Then, the convergences in (i) and (ii) hold with probability one.

Proof. See the proof of Theorem 2.2.1 in [8].

6.7 Bootstrapping regression models

Regression models are among the key ones that differ from the i.i.d. setup and are
also among the most widely used. Bootstrap for regression cannot be model-free; the
particular choice of the bootstrap scheme depends on whether the errors are i.i.d. or

94
not. We will only talk about the linear model with deterministic x’s and i.i.d. errors.
Additional moment conditions will be necessary depending on the specific problem to
which the bootstrap will be applied; see e.g., [4]. First let us introduce some notation.

We consider the model


yi = β > xi + i ,
where β is a p × 1 (p < n) vector and so is xi , and i ’s are i.i.d. F with mean 0 and
variance σ 2 < ∞.

Let X be the n × p design matrix with the i’th row equal to xi and let Y :=
(y1 , . . . , yn ) ∈ Rn . The least squares estimator of β is defined as

X
n
β̂n := argmin (yi − x> 2 > −1 >
i β) = (X X) X Y,
β∈Rp i=1

where we assume that (X > X)−1 is nonsingular.

We may be interested in the sampling distribution of

(X > X)−1 (β̂n − β) ∼ Hn (F ).

First observe that Hn only depends on F . The residual bootstrap scheme is described
below.

Compute the residual vector

ˆ = (ˆ1 , . . . , ˆn )> := Y − X β̂n .

We consider the centered residuals:

1X
n
˜i = yi − x>
i β̂n − ˆj , for i = 1, . . . , n.
n j=1

The bootstrap estimator of the distribution Hn (F ) is Hn (F̃n ), where F̃n is the empir-
ical c.d.f. of ˜1 , . . . , ˜n .

We proved in class that an application of the Lindeberg-Feller CLT shows that the
above bootstrap scheme is consistent, under the conditions:

(i) p is fixed (as n grows);


1
(ii) X > Xn
n n
→ Σ, where Σ is positive definite;

(iii) √1 |xij,n | → 0 as n → ∞, where Xn = (xij,n ).


n

95
6.8 Bootstrapping a nonparametric function: the Grenander
estimator

Consider X1 , . . . , Xn i.i.d. from a nonincreasing density f0 on [0, ∞). The goal is to es-
timate f0 nonparametrically. In particular, we consider the nonparametric maximum
likelihood estimator (NPMLE) of f0 , defined as

Y
n
f˜n := arg max f (Xi ),
f↓
i=1

where the maximization if over all nonincreasing densities on [0, ∞). It can see shown
that
f˜n = LCM0 [Fn ],
where Fn is the empirical c.d.f. of the data, and LCM0 [Fn ] denotes the right-hand
slope of the least concave majorant of Fn ; see e.g.,
http://www.math.yorku.ca/∼hkj/Teaching/Bristol/notes.pdf
for the characterization, computation and theoretical properties of f˜n .

In class we considered bootstrapping the Grenander estimator f˜n , the NPMLE of f0 ,


at a fixed point t0 > 0, in the interior of the support of f0 . We sketched a proof
of the inconsistency of bootstrapping from Fn or LCM0 [Fn ]; see [11] for the details.
We also derived sufficient conditions for the consistency of any bootstrap scheme in
this problem. Furthermore, we showed that we can consistently bootstrap from a
smoothed version of f˜n .

96
7 Multiple hypothesis testing

In the multiple hypothesis testing14,15 problem we wish to test many hypotheses si-
multaneously. The null hypotheses are denoted by H0,i , i = 1, . . . , n, where n denotes
the total number of hypotheses.

Consider a prototypical example: we test n = 1000 null hypotheses at level 0.05 (say).
Suppose that everything is null (i.e., all the null hypotheses are true) — even then
on an average we expect 50 rejections.

In general, the problem is how do we detect the true non-null effects (hypotheses
where the null is not true) when a majority of the null hypotheses are true? This
question has received a lot of attention in the statistical literature, particularly in
genomic experiments. Consider the following example.
Example 7.1 (Prostate cancer study). DNA microarrays measure expression levels
of tens of thousands of genes. The data consist of levels of mRNA, which are thought
to measure how much of a protein the gene produces. A larger number implies a more
active gene.

Suppose that we have n genes and data on the expression levels for each gene among
healthy individuals and those with prostate cancer. In the example considered in [3],
n = 6033 genes were measured on 50 control patients and 52 patients with prostate
cancer. The data obtained are (Xij ) where

Xij = gene expression level on gene i for the j’th individual.

We want to test the effect of the i’th gene. For the i’th gene, we use the following
test statistic:
X̄i·P − X̄i·C
∼ t100 , under H0,i ,
sd(. . .)
where X̄i·P denotes the average expression level for the i’th gene for the 52 cancer
patients and X̄i·C denotes the corresponding value for the control patients and sd(. . .)
denotes the standard error of the difference. We reject the null H0,i for gene i if the
test statistic exceeds the critical value t−1
100 (1 − α), for α ∈ (0, 1).

There are two main questions that we will address on this topic:
14
Many thanks to Jimmy K Duong for scribing the lecture notes based on which this section is
adapted.
15
Most of the material here can be found in the lecture notes by Emmanuel Candes; see
http://statweb.stanford.edu/∼candes/stats300c/lectures.html.

97
• Global testing. In global testing, our primary interest is not on the n hypotheses
H0,i , but instead on the global hypothesis H0 : ∩ni=1 H0,i , the intersection of the
H0,i ’s.

• Multiple testing. In this scenario we are concerned with the individual hypothe-
ses H0,i and want to say something about each hypothesis.

7.1 Global testing

Consider the following prototypical (Gaussian sequence model) example:

yi = µi + zi , for i = 1, . . . , n, (75)

where zi ’s are i.i.d. N (0, 1), the µi ’s are unknown constants and we only observe the
yi ’s. We want to test

H0,i : µi = 0 versus H1,i : µi 6= 0 (or µi > 0).

In global testing, the goal is to test the hypothesis:

H0 : µi = 0, for all i(no signal), versus H1 : at least one µi is non-zero.

The complication is that if we do each of these tests H0,i at level α, and then want
to combine them, the global null hypothesis H0 might not have level α. This is the
first hurdle.

Data: p1 , p2 , . . . , pn : p-values for the n hypotheses.

We will assume that under H0,i , pi ∼ Unif(0, 1). (we are not assuming independence
among the pi ’s yet.)

7.2 Bonferroni procedure

Suppose that α ∈ (0, 1) is given. The Bonferroni procedure can be described as:

• Test H0,i at level α/n, for all i = 1, . . . , n.

• Reject the global null hypothesis H0 if we reject H0,i for some i.

This can be succinctly expressed as looking at the minimum of the p-values, i.e.,
α
Reject H0 if min pi ≤ .
i=1,...,n n

98
?
Question: Is this a valid level-α test, i.e., is PH0 (Type I error) ≤ α? Answer: Yes.
Observe that
 
PH0 (Rejecting H0 ) = PH0 min pi ≤ α/n
i=1,...,n

= PH0 (∪ni=1 {pi ≤ α/n})


Xn
≤ PH0,i (pi ≤ α/n), (crude upper bound)
i=1

= n · α/n, since pi ∼ Unif([0, 1]) under null


= α.

So this is a valid level-α test, whatever the pi ’s are (the pi ’s could be dependent).

Question: Are we being too conservative (the above is an upper bound)? As we are
testing each hypothesis using a very small level α/n most of the p-values would fail
to be significant. The feeling is that we need a very strong signal for some i to detect
the global null using the Bonferroni method.

Answer: We are not doing something very crude, if all the p-values are independent.

Question: What is the exact level of the test?

Answer: If the pi ’s are independent, then observe that


 
PH0 min pi ≤ α/n = 1 − PH0 (∩ni=1 {pi > α/n})
i
Y
n
=1− PH0,i (pi > α/n) (using independence)
i=1
 α n
=1− 1−
n
as n→∞
−−−−−→ 1 − e−α
≈α (for α small).

Thus, the Bonferroni approach is not a bad thing to do, especially when we have
independent p-values.

7.2.1 Power of the Bonferroni procedure

Let us now focus on the power of the Bonferroni method. To discuss power we need
a model for the alternative.

99
Question: Consider the example of the Gaussian sequence model mentioned pre-
viously. Under what scenario for the µi ’s do we expect the Bonferroni test to do
well?

Answer: If we have (a few) strong signals, then the Bonferroni procedure is good.
We will try to formalize this now.

In the Gaussian sequence model the Bonferroni procedure reduces to: Reject H0,i
(H0,i : µi = 0 vs. H1,i : µi > 0) if
yi > zα/n ,
where zα/n is the (1 − α/n)’th quantile of the standard normal distribution.

Question: How does zα/n behave? Do we know its order (when α is fixed and n is
large)?

Answer: As first approximation, zα/n is like 2 log n (an important number for
Gaussian random variables).

Fact 1. Here is a fact from extreme value theory about the order of the maximum of
the zi ’s, i.e., maxi=1,...,n zi :
maxi=1,...,n zi a.s.
√ −−→ 1,
2 log n
i.e., if we have a bunch of n independent standard normals, the maximum is like

2 log n (Exercise: show this).

Fact 2. Bound on 1 − Φ(t):


 
φ(t) 1 φ(t)
1 − 2 ≤ 1 − Φ(t) ≤ ,
t t t

which implies that


φ(t)
1 − Φ(t) ≈ for t large.
t

Here is a heuristics proof of the fact that zα/n ≈ 2 log n:

φ(t) α
1 − Φ(t) ≈ =
t n
2
e−t /2 α
⇔ √ =
2πt n
t2
X XX√ √
⇔ − = 
log(
X2πt)
X + log(α/n) (as log( 2πt) is a smaller order term)
2
X
p
≈ t2 = −2 log(α/n) = 2 log n − 2 log α ≈ 2 log n.

100

The mean of maxi=1,...,n zi is like 2 log n and the fluctuations around the mean is of
order Op (1).

Exercise: Use the Gaussian concentration inequality to derive this result. Note that
the maximum is a Lipschitz function.

To study the power of the Bonferroni procedure, we consider the following stylistic
regimes (in the following the superscript (n) is to allow the variables to vary with n):
(n) √
(i) µ1 = (1 + ) 2 log n and µ2 = . . . = µn = 0,
(n) √
(ii) µ1 = (1 − ) 2 log n and µ2 = . . . = µn = 0,

where  > 0. So, in both settings, we have a one strong signal, and everything else is
0.

In case (i), the signal is slightly stronger than 2 log n; and in case (ii), the signal

is slightly weaker than 2 log n. We will show that Bonferroni actually works for
case (i) (by that we mean the power of the test actually goes to 1). Meanwhile, the
Bonferroni procedure fails for case (ii) — the power of the test converges to α.

This is not only a problem with the Bonferroni procedure — it can be shown that no
test can detect the signal in case (ii).

Case (i):
  
P(max yi > zα/n ) = P {y1 > zα/n } ∪ max yi > zα/n
i=2,...,n

≥ P({y1 > zα/n })


 p p 
≈ P z1 > 2 log n − (1 + ) 2 log n → 1.

In this regime, just by looking at y1 , we will be able to detect that H0 is not true.

Case (ii):
 
P(max yi > zα/n ) ≤ P(y1 > zα/n ) + P max yi > zα/n .
i=2,...,n

Note that the first term is equal to P(z1 >  2 log n) → 0 as n → ∞; whereas the
second term converges to 1 − e−α . Hence, we have shown that in this case the power
of the test is less than or equal to the level of the test. So the test does as well as
just plain guesswork.

This shows the dichotomy in the Bonferroni procedure; that by just changing the
signal strength you can always recover or you can fail (1 − α) of the time.

101
Whenever we have a hypothesis testing procedure, there has to be an effort in trying to
understand the power of the procedure. And it is quite often the case that different
tests (using different test statistics) are usually geared towards detecting different
kinds of departures from the null. Here, the Bonferroni procedure is geared towards
detecting sparse, strong signals.

7.3 Chi-squared test

Consider the Gaussian sequence model described in (75) and suppose that we want
to test the global null hypothesis:

H0 : µi = 0, for all i, (no signal) versus H1 : at least one µi is non-zero.

Letting Y = (y1 , . . . , yn ), the chi-squared test can be expressed as:

Reject H0 if T := kY k2 > χ2n (1 − α).

Note that under H0 ,


T ∼ χ2n ,
and under H1 ,
T ∼ χ2n (kµk2 ),
where µ = (µ1 , . . . , µn ) ∈ Rn and χ2n (kµk2 ) denotes the non-central χ2n distribution
with non-centrality parameter kµk2 .

This test is going to have high power when kµk2 is large. So, this test would have
high power when there are many weak signals (even if each µi is slightly different
from zero as we square it and add these up we can get a substantially large kµk2 ).
The Bonferroni procedure may not be able to detect a scenario like this — given α/n
to each hypothesis if the signal strengths are weak all of the p-values (for the different
hypotheses) might be considerably large.

7.4 Fisher’s combination test

Suppose that p1 , . . . , pn are the n p-values obtained from the n hypotheses tests. We
assume that the pi ’s are independent. The Fisher’s combination test rejects the global
null hypothesis if
Xn
T := −2 log pi
i=1

102
is large. Observe that, under H0 ,
X
n
T := −2 log pi ∼ χ22n .
i=1

This follows from the fact that under H0,i ,

− log pi ∼ Exp(1) ≡ Gamma(1, 1).

Again, as this test is aggregating the p-values, it will hopefully be able to detect the
presence of many weak signals.

7.5 Multiple testing/comparison problem: false discovery rate

Until now, we have been considering tests of the global null H0 = ∩i H0,i . For some
testing problems, however, our goal is to accept or reject each individual H0,i . Given
n hypotheses, we have four types of outcomes in multiple testing:
Accept H0,i Reject H0,i
H0,i true U V n0
H0,i false T S n − n0
n−R R n
where R = number of rejections is an observed random variable; U, V, S, T are unob-
served random variables. Note that

V = number of false discoveries.

Suppose that the hypotheses indexed by I0 ⊆ {1, . . . , n} are truly null with |I0 | = n0
and the remaining hypotheses are non-null.

Ideally, we would not like to make false discoveries. But if you are not willing to make
any false discoveries, which basically translates to our threshold/cutoff being really
large for each test, then we will not be able make any discoveries at all.

Traditionally, statisticians want to control the family-wise error rate (FWER) :

FWER = P(V ≥ 1).

It is very easy to design a test whose FWER is controlled by a predetermined level


α: reject or accept each hypothesis H0i according to a test whose type I error is at
most α/n. Indeed, this is the Bonferroni method. By the union bound, one then has
X αn0
FWER = P (∪i∈I0 {Reject H0,i }) ≤ P (Reject H0,i ) ≤ ≤ α.
i∈I
n
0

103
In modern theory of hypothesis testing, control of the FWER is considered too strin-
gent mainly because it leads to tests that fail to reject many non-null hypotheses as
well.

The false discovery rate (FDR) is an error control criterion developed in the 1990’s as
an alternative to the FWER. When the number of tests is in the tens of thousands or
even higher, FWER control is so stringent a criterion that individual departures from
the null have little chance of being detected. In such cases, it may be unreasonable
to control the probability of having any false rejections. Attempting to do so would
leave us with virtually no power to reject individual non-nulls. Sometimes, control of
FWER is even not quite needed.

A new point of view advanced by [1] proposes controlling the expected proportion
of errors among the rejected hypotheses. The false discovery proportion (FDP) is
defined as
V
FDP := .
max(R, 1)

FDP is an unobserved random variable, so the criterion we propose to control is its


expectation, which we refer to as the false discovery rate:

FDR := E(FDP).

The Benjamini-Hochberg (BH) procedure controls FDR at any desired level (e.g.,
suppose we take q = 0.2), i.e.,

FDR ≤ q = 0.2;

thus out of all of the rejections we make we are willing to have 20% of them be false,
on an average.

The BH procedure can be described as: suppose that p1 , . . . , pn are the p-values from
the n hypotheses tests. Let

p(1) ≤ p(2) ≤ . . . ≤ p(n)

be the sorted p-values. Let


 
i
i0 := max i ≤ n : p(i) ≤q , 0 < q < 1.
n

We reject all the hypotheses H0,(i) for 1 ≤ i ≤ i0 (reject those hypotheses with p-
values from p(1) to p(i0 ) ). Pictorially this can be easily expressed as: draw the line

104
with slope q passing through the origin and plot the ordered p-values, and reject all
the hypotheses whose p-values lie above the line after the last time it was below the
line.

Another way to view the BH procedure is via the following sequential description:
start with {i = n} and keep accepting the hypothesis corresponding to p(i) as long as
p(i) > qi/n. As soon as p(i) ≤ iq/n, stop and reject all the hypotheses corresponding
to p(j) for j ≤ i.
Theorem 7.2. Suppose that the p-values p1 , . . . , pn are independent. Then
 
V
FDR = E ≤ q.
max(R, 1)

Remark 7.1. Note that the above result states that the BH procedure controls FDR
for all configurations of {H0,i }ni=1 .

Proof. Without loss of generality suppose that H0,1 , . . . , H0,n0 are true. Observe that
n r s o
{R = r} = p(r) ≤ q, p(s) > q, ∀s > r .
n n
P 0
Further, under {R = r}, V = ni=1 1{pi ≤ nr q}. Thus,
n r o
p1 ≤ q, R = r
n n o
r r s
= p1 ≤ q, p(r) ≤ q, p(s) > q, ∀s > r
 n n n 
r (−1) r (−1) s+1
= p1 ≤ q, pr−1 ≤ q, ps > q, ∀s ≥ r
n n n
n r o
(−1)
= p1 ≤ q, R̃(p ) = r − 1 ,
n
(−1)
where p(−1) = (p2 , . . . , pn ) and R̃ = sup{1 ≤ i ≤ n − 1 : p(i) ≤ i+1
n
q}. Finally we can

105
show that
 
V
FDR = E 1{R 6= 0}
R
X n
V
= E( 1{R = r})
r=1
r
X
n
1
= E(V 1{R = r})
r=1
r
X
n
1X
n0  r 
= P pi ≤ q, R = r
r=1
r i=1 n
Xn
1  r 
= n0 P p1 ≤ q, R = r (by exchangeability)
r=1
r n
X n0  
n
r
= P p1 ≤ q)P(R̃(p(−1) ) = r − 1 (by independence)
r=1
r n
X
n
n0 r
= qP(R̃(p(−1) ) = r − 1)
r=1
r n
n0
= q ≤ q.
n

7.6 The Bayesian approach: connection to empirical Bayes

By formulating the multiple testing problem in a simple Bayesian framework, we are


able to construct procedures that control a quantity closely related to the FDR as we
have previously defined.

We assume that we have n hypotheses, which are null (H = 0) with probability π0


and non-null (H = 1) with probability 1 − π0 . Our observations {Xi }ni=1 (p-values/z-
values) are thus assumed to come from the mixture distribution

f (x) = π0 f0 (x) + (1 − π0 )f1 (x)

where f0 is the density of Xi if null is true (with c.d.f. F0 ; e.g., U [0, 1] or N (0, 1))
and f1 is the density of Xi otherwise (with c.d.f. F1 ). Let H denote the unobserved
variable that takes the value 0 or 1 depending on whether the null hypothesis is true
or not.

In this setup, we observe X ∈ A and wonder whether it is null or not. By Bayes’

106
rule, we can evaluate this probability to be
φ(A) := P(H = 0|X ∈ A) (posterior probability of the null hypothesis)
P(X ∈ A|H = 0)
=
P(X ∈ A)
R
π0 P0 (A) π0 A f0 (x)dx
= = ,
P (A) P (A)
where P0 (A) denotes the probability of a set A under the null distribution.

We can call the quantity φ(A) the Bayes false discovery rate (BFDR). If we report
x ∈ A as non-null, φ(A) is the probability that we have made a false discovery. What
should be A? If we reject H0,i if Xi > xc (e.g., if we are testing H0,i : µi = 0 vs.
H1,i : µi > 0) then A = [xc , ∞). In practice, we will have some critical value xc and
A will take one of the following forms:
[xc , ∞) (−∞, xc ] (−∞, −xc ] ∪ [xc , ∞). (76)

In order to make use of the above machinery, we need to have knowledge of π0 , f0


and f1 . It is extremely unlikely that we would know these quantities in practice. By
using empirical Bayes techniques, we are able to accurately estimate these quantities
based on our data, as explained below.

We proceed by assuming the following: (i) usually f0 is known (assumed N (0, 1) or


Unif(0, 1)); (ii) π0 is ‘almost known’, in the sense that it’s a fraction close to 1 in
many applications; (iii) f1 is unknown.

Without knowing P (A), the BFDR cannot be computed. However, we can estimate
this quantity by
1X
n
[
P (A) = 1A (Xi ).
n i=1
This yields the BFDR estimate:
[=π
\ = φ(A) b0 P0 (A)
BFDR .
P[ (A)
If n is large, then P[ \ may be a good estimate
(A) will be close to P (A), and thus BFDR
of BFDR.

7.6.1 Global versus local FDR

Classical BH theory only lets us discuss false discovery rates for tail sets of the
form (76). An advantage of the Bayesian theory is that we can now compute and

107
bound the FDR for generic measurable sets A. [3] likes to distinguish between the
“local” and “global” FDR rates:

Global FDR : FDR(xc ) = φ([xc , ∞)), Local FDR : FDR(xc ) = φ({xc }),

where FDR(xc ) will in general be well-defined provided all distributions have contin-
uous densities, i.e.,
π0 f0 (x0 )
φ({x0 }) = .
f (x0 )
These two quantities can be very different.
Example 7.3. Suppose that F0 = N (0, 1) and F1 = Unif(−10, 10), π0 = 1/2. In
other words, under the null hypotheses the test statistics are standard Gaussian,
whereas under the alternatives they have a uniform distribution over a medium-size
interval, and on average half the hypotheses are null. In this case:
1 − Φ(2) φ(2)
φ([2, ∞)) = ≈ 0.054, φ({2}) = ≈ 0.52.
8/20 + (1 − Φ(2)) 1/20 + φ(2)

Thus, a global FDR analysis suggests that x ≥ 2 is strong evidence for the alternative,
whereas a local FDR analysis tells us that in fact x = 2 is mild evidence for the null.
(There is no contradiction here — under the data generating distribution, given that
x ≥ 2 you would expect that x >> 2, and so the expected global FDR is small.)

The beauty of local FDR theory is that it can tell us the probability that any given
hypothesis is null, instead of just giving us the expected proportion of nulls among
all rejections. It’s down side, of course, is that it relies on more complex Bayesian
machinery. Standard BH theory (which is what people mostly use in practice) gives
us weaker global FDR type results, but requires much less assumptions to go through.
For more on this topic see [3, Chapter 5].

7.6.2 Empirical Bayes interpretation of BH(q)

How does the BH procedure relate to the empirical Bayes procedures we are dis-
cussing? First, we note that z-values map to p-values using the relation

pi = F0 (Xi ), (Xi is the test statistic).

Using this we observe that


i
p(i) = F0 (X(i) ), and = Fbn (X(i) ) ≈ F (X(i) ).
n
108
Thus,
i F0 (X(i) ) F0 (X(i) )
i : p(i) ≤ q ⇔ ≤q ≈ ≤ q.
n Fbn (X(i) ) F (X(i) )

\ was computed with π0 = 1, we observe that


Thus, assuming that BFDR

F0 (X(i) ) b
≤q ≈ φ((−∞, X(i) ]) ≤ q.
Fbn (X(i) )

The claim below then follows.

Claim: The empirical Bayes formulation of BH(q) is to reject H0,(i) for all i ≤ i0
where i0 is the largest index such that

\
BFDR((−∞, x(i0 ) ]) ≤ q.

Assuming independence of the test statistics, the FDR is at most q.

Note that π0 is usually unknown. However, usually we set π0 = 1, which results in a


conservative estimate of the FDR.

109
8 High dimensional linear regression

Consider the standard linear regression model

y = Xβ ∗ + w,

where X ∈ RN ×p is the design matrix, w ∈ RN is the vector of noise variables (i.e.,


E(w) = 0), and β ∗ ∈ Rp is the unknown coefficient vector. We are interested in
estimating β ∗ from the observed response y. In this section we consider the situation
where p  N (or p is comparable to N ) and study the performance of the lasso
estimator16 (least absolute shrinkage and selection operator; see e.g., [13]):

β̂ := argmin ky − Xβk22 , (77)


β∈Rp :kβk1 ≤R

where R > 0 is a tuning parameter. The above is sometimes called as the constrained
form of the lasso solution. An equivalent form (due to Lagrangian duality) is the
penalized version  
1 2
min ky − Xβk2 + λN kβk1 , (78)
β∈Rp 2N

where λN > 0 is the Lagrange multiplier associated with the constraint kβk1 ≤ R.

The lasso estimator performs both variable selection and regularization simultane-
ously; it has good prediction accuracy and offers interpretability to the statistical
model it produces. Figure 5 shows a simple illustration of the performance of the
constrained lasso estimator (and ridge regression17 ) and gives some intuition as to
why it can also perform variable selection.

Given a lasso estimate β̂ ∈ Rp , we can assess its quality in various ways. In some
settings, we are interested in the predictive performance of β̂, so that we might
compute a prediction loss function of the form
1
L(β̂, β ∗ ) := kXβ̂ − Xβ ∗ k22 ,
N

corresponding to the mean-squared error of β̂ over the given samples of X. If the


unknown vector β ∗ is of primary interest then a more appropriate loss function to
consider would be the `2 -error

L2 (β̂, β ∗ ) := kβ̂ − β ∗ k22 .


16
This material is mostly taken from [5].
17
In ridge regression we consider the problem: minβ∈Rp :kβk22 ≤R2 ky − Xβk22 .

110
Figure 2.1 Left: Coefficient path for the lasso, plotted versus the ¸1 norm of the
˜
coefficient vector, relative to the norm of the unrestricted least-squares estimate —.
Right: Same for ridge regression, plotted against the relative ¸2 norm.

β2 ^ . β2 ^ .
β β

β1 β1

Figure 2.2 Estimation picture for the lasso (left) and ridge regression (right). The
Estimation
Figure 5:solid blue areas picture for the lasso
are the constraint (left)
regions and2 | ridge
|—1 |+|— regression
Æ t and —12 +—22 Æ t(right).
2 The solid blue
, respectively,
while theare
red the
ellipses are the contours
regionsof|βthe
1| +residual-sum-of-squares
|β2 | ≤ t and β1 + β2 function.
2 2 The
areas constraint 2 ≤ t , respectively,
point —‚ depicts the usual (unconstrained) least-squares estimate.
while the red ellipses are the contours of the residual-sum-of-squares function.
The point β̂ depicts the usual (unconstrained) least-squares estimate.

8.1 Strong convexity

1
The lasso minimizes the least-squares loss fN (β) := 2N ky − Xβk22 subject to an `1 -
constraint. Let us suppose that the difference in function values ∆fN = |fN (β̂) −
fN (β ∗ )| converges to zero as the sample size N increases. The key question is the
following: what additional conditions are needed to ensure that the `2 -norm of the
parameter vector difference ∆β = kβ̂−β ∗ k2 also converges to zero? Figure 6 illustrates
two scenarios that suggest that the function fN has to be suitably “curved”.

A natural way to specify that a function is suitably “curved” is via the notion of
strong convexity. More specifically, given a differentiable function f : Rp → R, we
say that it is strongly convex with parameter γ > 0 at θ∗ ∈ Rp if the inequality
γ
f (θ) − f (θ∗ ) ≥ ∇f (θ∗ )> (θ − θ∗ ) + kθ − θ∗ k22
2
holds for all θ ∈ Rp . Note that this notion is a strengthening of ordinary convexity,
which corresponds to the case γ = 0. When the function f is twice continuously
differentiable, an alternative characterization of strong convexity is in terms of the
Hessian ∇2 f : in particular, the function f is strongly convex with parameter γ around
θ∗ ∈ Rp if and only if the minimum eigenvalue of the Hessian matrix ∇2 f (θ) is at
least γ for all vectors θ in a neighborhood of θ∗ .

111
the least-squares loss fN (—) = N1 Îy ≠ X—Î22 subject to an ¸1 -constraint.) Let
us suppose that the difference in function values fN = |fN (—) ‚ ≠ fN (— ú )|
converges to zero as the sample size N increases. The key question is the
following: what additional conditions are needed to ensure that the ¸2 -norm
of the parameter vector difference — = Η‚ ≠ — ú Î2 also converges to zero?

fN
fN

⇤ b ⇤ b

FigureFigure 11.2 Relation


6: Relation betweenbetween differences
differences in objective
in objective function
function values
values and and differ- in pa-
differences
ences in parameter values. Left: the function fN is relatively “flat” around its opti-
rameter values. Left: the function f is relatively “flat” around its optimum
mum —‚, so that a small function difference fNN = |fN (—‚) ≠ fN (— ú )| does not imply
function difference ∆f = |f strongly ∗ does not imply
(β̂) − fcurved
that β̂,— so
= Îthat
—‚ ≠ —aú Î
small
2 is small. Right: the functionNfN is N N (β )|around its
optimum, so that a small∗ difference
that ∆β = kβ̂ − β k2 is small. Right: fN in function values translates into a small
the function fN is strongly curved around
difference in parameter values.
its optimum, so that a small difference ∆fN in function values translates into a
To understand
small difference the
in issues involved,
parameter suppose that for some N , the objec-
values.
tive function fN takes the form shown in Figure 11.2(a). Due to the relative
8.2 “flatness” of the objective function around its minimum —, ‚ we see that
Restricted strong ‚ convexity
ú and ` 2 -error kβ̂ − β ∗ k the
difference fN = |fN (—) ≠ fN (— )| in function values is quite small while2 at
the same time the difference — = Η‚ ≠ — ú Î2 in parameter values is relatively
Let us large. In contrast,
now return to theFigure 11.2(b) shows
high-dimensional a moreindesirable
setting, which the situation,
numberinofwhich
parameters
the objective function has a high degree of curvature around
p might be larger than N . It is clear that the least-squares objective function its minimum f (β)
‚ In this case, a bound on the function difference fN = |fN (—)
—. ‚ ≠ fN (— ú )| N
is always convex; under what additional conditions is it also strongly convex? A
translates directly into a bound on — = Η‚ ≠ — ú Î2 .
How docalculation
straightforward we formalize the intuition
yields that ∇2 fcaptured
(β) = Xby> X/NFigurefor
11.2?
all βA ∈ Rp . Thus,
natural
way to specify
the least-squares that
loss is astrongly
functionconvex
is suitably “curved”
if and is via
only if the the notion of strong
eigenvalues of the p × p
convexity. More specifically, > given a differentiable function f : R æ R, we
p
positive semidefinite matrix X X are uniformly bounded away from zero. However,
say that it is strongly convex with parameter “ > 0 at ◊ œ Rp if the inequality
it is easy to see that any matrix of the form X> X has rank at most min{N, p}, so it is
always rank-deficient — and hence not strongly convex “ — whenever N < p. Figure 7
f (◊Õ ) ≠ f (◊) Ø Òf (◊)T (◊Õ ≠ ◊) + Î◊Õ ≠ ◊Î22 (11.8)
illustrates the situation. 2
hold for all ◊Õ œ Rp . Note that this notion is a strengthening of ordinary
For this reason, we need to relax our notion of strong convexity. It turns out, as will
convexity, which corresponds to the case “ = 0. When the function f is twice
continuously
be clarified by the differentiable,
analysis below, an that
alternative characterization
it is only necessary toofimpose
strong aconvexity
type of strong
convexity condition for some subset C ⊂ R of possible perturbation vectors ν ∈ Rp .
p

Definition 8.1 (Restricted strong convexity). We say that a function f : Rp → R


satisfies restricted strong convexity at θ∗ ∈ Rp with respect to C ⊂ Rp if there is a
constant γ > 0 such that

ν > ∇2 f (θ)ν
≥γ for all nonzero ν ∈ C,
kνk22

and for all θ ∈ Rp in a neighborhood of θ∗ .

112
conditions is it also strongly convex? A straightforward calculation yields that
Ò2 f (—) = XT X/N for all — œ Rp . Thus, the least-squares loss is strongly
convex if and only if the eigenvalues of the p ◊ p positive semidefinite matrix
XT X are uniformly bounded away from zero. However, it is easy to see that
any matrix of the form XT X has rank at most min{N, p}, so it is always
rank-deficient—and hence not strongly convex—whenever N < p. Figure 11.3
illustrates the situation.

⌫good

⌫bad

Figure 7: A convex
Figure 11.3lossAfunction in high-dimensional
convex loss function in high-dimensional settings (with
settings (with  N ) cannot be
p ∫ Np) can-
not be strongly convex; rather, it will be curved in some directions but flat in others.
strongly convex; rather, it will be‚curved
As shown in Lemma 11.1, the lasso error ‹
in some directions but flat in others.
= —‚ ≠ — ú must lie in a restricted subset C
As will be shown in later, the lasso error ν̂function
of . For this reason, it is only necessary that the loss be curved
β ∗ mustin certain
p
R = β̂ − lie in a restricted
directions of space.
subset C of R . For this reason, it is only necessary that the loss function be
p

curvedFor in this reason,


certain we need to of
directions relax our notion of strong convexity. It turns
space.
out, as will be clarified by the analysis below, that it is only necessary to
impose a type of strong convexity condition for some subset C µ Rp of possible
In the specific case of linear regression, this notion is equivalent to lower bounding
the restricted eigenvalues of the design matrix — in particular, requiring that
1 > 2 >
N
ν ∇ X Xν
≥γ for all nonzero ν ∈ C. (79)
kνk22
This is referred to as the γ-RE condition.

So, what constraint sets C are relevant? Suppose that the parameter vector β ∗ is
sparse — say supported on the subset S = S(β ∗ ). Defining the lasso error ν̂ = β̂ − β ∗ ,
let ν̂S ∈ R|S| denote the subvector indexed by elements of S, with ν̂S c defined in an
analogous manner. For appropriate choices of the `1 -ball radius — or equivalently, of
the regularization parameter λN — it turns out that the lasso error satisfies a cone
constraint of the form
kν̂S c k1 ≤ αkν̂S k1 ,
for some constant α ≥ 1. Thus, we consider a restricted set of the form

C(S, α) := {ν ∈ Rp : kνS c k1 ≤ αkνS k1 },

for some parameter α ≥ 1.


Theorem 8.2. Suppose that the design matrix X satisfies the restricted eigenvalue
bound (79) with parameter γ > 0 over C(S, 1). Then any estimate β̂ based on the
constrained lasso (77) with R = kβ ∗ k1 satisfies the bound
r
4 k kX> wk∞
kβ̂ − β ∗ k2 ≤ √ .
γ N N

113
Before proving this result, let us discuss the different factors in the above bound.
First, it is important to note that this result is deterministic, and apply to any set of
linear regression equations with a given observed noise vector w. Based on our earlier
discussion of the role of strong convexity, it is natural that lasso `2 -error is inversely
proportional to the restricted eigenvalue constant γ > 0. The second term k/N is
also to be expected, since we are trying to estimate an unknown regression vector
with k unknown entries based on N samples. As we have discussed, the final term
in both bounds, involving either kX> wk∞ , reflects the interaction of the observation
noise w with the design matrix X.
Example 8.3 (Classical linear Gaussian model). We begin with the classical linear
Gaussian model for which the noise w ∈ RN is Gaussian with i.i.d. N (0, σ 2 ) entries.
Let us view the design matrix X as fixed, with columns {x1 , . . . , xp }. For any given
column j ∈ {1, . . . , p}, a simple calculation shows that the random variable x> j w/N
2 kx k2
is distributed as N (0, σN Nj 2 ). Consequently, if the columns of the design matrix X
are normalized (meaning kxj k22 /N = 1 for all j = 1, . . . , p), then this variable has
2
N (0, σN ) distribution, so that we have the Gaussian tail bound
!
|x>
j w| N t2
P ≥ t ≤ 2e− 2σ2 for t > 0.
N

Since kX> wk∞ /N corresponds to the maximum over p such variables, the union
bound yields  > 
kX wk∞ N t2 1
P ≥ t ≤ 2e− 2σ2 +log p = 2e− 2 (τ −2) log p ,
N
q
where the second equality follows by setting t = σ τ log N
p
for some τ > 2. Conse-
quently, we conclude that the lasso error satisfies the bound
r
4σ τ k log p
kβ̂ − β ∗ k2 ≤
γ N
1
with probability at least 1 − 2e− 2 (τ −2) log p .

Proof of Theorem 8.2. In this case, since β ∗ is feasible and β̂ is optimal, we have
the inequality ky − Xβ̂k22 ≤ ky − Xβ ∗ k22 . Defining the error vector ν̂ := β̂ − β ∗ ,
substituting in the relation y = Xβ ∗ + w, and performing some algebra yields the
basic inequality
kXν̂k22 w> Xν̂
≤ . (80)
2N N
Applying a version of Hölder’s inequality to the right-hand side yields the upper
bound N1 |w> Xν̂| ≤ N1 kX> wk∞ kν̂k1 .

114
Next, we claim that the inequality kβ̂k1 ≤ R = kβ ∗ k1 implies that ν̂ ∈ C(S, 1).
Observe that

R = kβS∗ k1 ≥ kβ ∗ + ν̂k1
= kβS∗ + ν̂S k1 + kν̂S c k1
≥ kβS∗ k − kν̂S k1 + kν̂S c k1 .

Rearranging this inequality, we see that kν̂S c k1 ≤ kν̂S k1 , which shows that ν̂ ∈ C(S, 1).

Thus, we have

kν̂k1 = kν̂S k1 + kν̂S c k1 ≤ 2kν̂S k1 ≤ 2 kkν̂k2 ,
where we have used the Cauchy-Schwarz inequality in the last step.

On the other hand, applying the restricted eigenvalue condition to the left-hand side
of the inequality (80) yields
kν̂k22 kXν̂k22 w> Xν̂ 1 1 √
γ ≤ ≤ ≤ kX> wk∞ kν̂k1 ≤ kX> wk∞ 2 kkν̂k2 .
2 2N N N N
Putting together the pieces yields the claimed bound.

Exercise (HW 4): Suppose that the design matrix X satisfies the restricted eigenvalue
bound (79) with parameter γ > 0 over C(S, 3). Given a regularization parameter
λN ≥ 2kX> wk∞ /N > 0, show that any estimate β̂ from the regularized lasso (78)
satisfies the bound r
3 k√
kβ̂ − β ∗ k2 ≤ N λN .
γ N

8.3 Bounds on prediction error

In this section we focus on the Lagrangian lasso (78) and develop some theoretical
guarantees for the prediction error L(β̂, β) := N1 kXβ̂ − Xβ ∗ k22 .
Theorem 8.4. Consider the Lagrangian lasso with a regularization parameter λN ≥
2
N
kX> wk∞ .

(a) Any optimal solution β̂ satisfies


1
kXβ̂ − Xβ ∗ k22 ≤ 6kβ ∗ k1 λN .
N

(b) If β ∗ is supported on a subset S, and the design matrix X satisfies the γ-RE
condition (79) over C(S, 3), then any optimal solution β̂ satisfies
1 9
kXβ̂ − Xβ ∗ k22 ≤ |S|λ2N .
N γ

115
q
As we have discussed, for various statistical models, the choice λN = cσ log
N
p
is valid
for Theorem 8.4 with high probability, so the two bounds take the form
r
1 ∗ 2 log p
kXβ̂ − Xβ k2 ≤ c1 σR1 , and
N N
1 σ |S| log p
kXβ̂ − Xβ ∗ k22 ≤ c2 ,
N γ N
for suitable constants c1 , c2 . The first bound, which depends on the `1 -ball radius R1 ,
is known as the “slow rate” for the lasso, since the squared prediction error decays

as 1/ N . On the other hand, the second bound is known as the “fast rate” since it
decays as 1/N . Note that the latter is based on much stronger assumptions: namely,
the hard sparsity condition that β ∗ is supported on a small subset S, and more
disconcertingly, the γ-RE condition on the design matrix X. In principle, prediction
performance should not require an RE condition, so that one might suspect that this
requirement is an artifact of our proof technique. However, this dependence turns
out to be unavoidable for any polynomial-time method; see e.g., [18] where, under a
standard assumption in complexity theory, the authors prove that no polynomial-time
algorithm can achieve the fast rate without imposing an RE condition.

Proof of Theorem 8.4. Define the function


1
G(ν) := ky − X(β ∗ + ν)k22 + λN kβ ∗ + νk1 .
2N

Noting that ν̂ := β̂ − β ∗ minimizes G by construction, we have G(ν̂) ≤ G(0). Some


algebra yields the modified basic inequality:

kXν̂k22 w> Xν̂


≤ + λN {kβ ∗ k1 − kβ ∗ + ν̂k1 }. (81)
2N N
Thus,

kX> wk∞
0 ≤ kν̂k1 + λN {kβ ∗ k1 − kβ ∗ + ν̂k1 }
 > N 
kX wk∞
≤ − λN kν̂k1 + 2λN kβ ∗ k1
N
1
≤ λN {−kν̂k1 + 4kβ ∗ k1 } ,
2
where the last step uses the fact that N1 kX> wk∞ ≤ λN /2 (by assumption). Therefore,
kν̂k1 ≤ 4kβ ∗ k1 . Returning again to the modified basic inequality (81), we have

kXν̂k22 kX> wk∞ λN


≤ kν̂k1 + λN kβ ∗ k1 ≤ · 4kβ ∗ k1 + λN kβ ∗ k1 ≤ 3λN kβ ∗ k1 ,
2N N 2
116
which establishes (a).

To prove (b), observe that as βS∗ c = 0, we have kβ ∗ k1 = kβS∗ k1 , and

kβ ∗ + ν̂k1 = kβS∗ + ν̂S k1 + kν̂S c k1 ≥ kβS∗ k1 − kν̂S k1 + kν̂S c k1 .

Substituting this relation into the modified basic inequality (81) yields

kXν̂k22 w> Xν̂


≤ + λN {kν̂S k1 − kν̂S c k1 }.
2N N
kX> wk∞
≤ kν̂k1 + λN {kν̂S k1 − kν̂S c k1 }. (82)
N
Given the stated choice of λN , the above inequality yields

kXν̂k22 λN
≤ {kν̂S k1 + kν̂S c k1 } + λN {kν̂S k1 − kν̂S c k1 }
2N 2
3 3 √
≤ λN kν̂S k1 ≤ λN kkν̂k2 , (83)
2 2
where k := |S|.

Next we claim that the error vector ν̂ associated with any lasso solution β̂ belongs to
>
the cone C(S, 3). Since kX Nwk∞ ≤ λ2N , inequality (82) implies that

λN
0≤ kν̂k1 + λN {kν̂S k1 − kν̂S c k1 }.
2
Rearranging and then dividing out by λN > 0 yields that kν̂S c k1 ≤ 3kν̂S k1 as claimed.

As the error vector ν̂ belongs to the cone C(S, 3), the γ-RE condition guarantees that
kν̂k22 ≤ N1γ kXν̂k22 . Therefore, using (82) gives
s s
1 √ k 1 k
kXν̂k22 ≤ 3λN kkν̂k2 ≤ 3λN kXν̂k2 ⇒ √ kXν̂k2 ≤ 3λN .
N Nγ N γ

This completes the proof.

Exercise (HW 4): State and prove the analogous theorem for the constrained form of
the lasso (given in (77)) where you take R = kβ ∗ k1 .

8.4 Equivalence between `0 and `1 -recovery

As seen in Theorem 8.4(b), the `1 -constraint yields a bound on the prediction error
that is almost optimal — if we knew the set S then using linear regression would yield

117
a bound of the order σ|S|/N ; using lasso, we just pay an additional multiplicative
factor of log p. As S is obviously unknown, we can think of fitting all possible linear
regression models with k := |S| predictors and then choosing the best one. This
would be equivalent to solving the following `0 -problem:

min
p
ky − Xβk22 ,
β∈R :kβk0 ≤k

where kβk0 denotes the number of non-zero components of β. Obviously, this proce-
dure is computationally infeasible and possibly NP hard.

In this subsection we compare the `0 and `1 -problems in the noiseless setup. This
would shed light on when we can expect the `1 relaxation to perform as well as
solving the `0 -problem. More precisely, given an observation vector y ∈ RN and a
design matrix X ∈ RN ×p , let us consider the two problems

min kβk0 (84)


β∈Rp :Xβ=y

and
min kβk1 . (85)
β∈Rp :Xβ=y

The above linear program (LP) (85) is also known as the basis pursuit LP. Suppose
that the `0 -based problem (84) has a unique optimal solution, say β ∗ ∈ Rp . Our
interest is in understanding when β ∗ is also the unique optimal solution of the `1 -
based problem (85), in which case we say that the basis pursuit LP is equivalent to
`0 -recovery. Remarkably, there exists a very simple necessary and sufficient condition
on the design matrix X for this equivalence to hold.
Definition 8.5 (Exact recovery property). An N × p design matrix X is said to
satisfy the exact recovery property for S ⊂ {1, . . . , p} (or S-ERP) if every β ∗ ∈ Rp
supported on S uniquely minimizes kβk1 subject to Xβ = Xβ ∗ .

For a given subset S ⊂ {1, 2, . . . , p}, let us define the following set:

C(S) := {β ∈ Rp : kβS c k1 ≤ kβS k1 }.

The set C(S) is a cone but is not convex (Exercise (HW4): show this), containing all
vectors that are supported on S, and other vectors as well. Roughly, it corresponds
to the cone of vectors that have most of their mass allocated to S. Recall that we
have already seen the importance of the set C(S) in the recovery of β ∗ and Xβ ∗ using
the lasso estimator.

Given a matrix X ∈ RN ×p , its nullspace is given by

null(X) = {β ∈ Rp : Xβ = 0}.

118
Definition 8.6 (Restricted nullspace property). For a given subset S ⊂ {1, 2, . . . , p},
we say that the design matrix X ∈ RN ×p satisfies the restricted nullspace property
over S, denoted by RN(S), if

null(X) ∩ C(S) = {0}.

In words, the RN(S) property holds when the only element of the cone C(S) that lies
within the nullspace of X is the all-zeroes vector. The following theorem highlights the
connection between the exact recovery property and the restricted nullspace property.
Theorem 8.7. The matrix X is S-ERP if and only if it is RN(S).

Since the subset S is not known in advance — indeed, it is usually what we are trying
to determine — it is natural to seek matrices that satisfy a uniform version of the
restricted nullspace property. For instance, we say that the uniform RN property of
order k holds if RN(S) holds for all subsets of {1, . . . , p} of size at most k. In this
case, we are guaranteed that the `1 -relaxation succeeds for any vector supported on
any subset of size at most k.

Proof of Theorem 8.7. First, suppose that X satisfies the RN(S) property. Let
β ∗ ∈ Rp be supported on S and let y = Xβ ∗ . Let β̂ ∈ Rp be any optimal solution
to the basis pursuit LP (85), and define the error vector ν̂ := β̂ − β ∗ . Our goal is to
show that ν̂ = 0, and in order to do so, it suffices to show that ν̂ ∈ null(X) ∩ C(S).
On the one hand, since β ∗ and β̂ are optimal (and hence feasible) solutions to the `0
and `1 -problems, respectively, we are guaranteed that Xβ ∗ = y = Xβ̂, showing that
X ν̂ = 0. On the other hand, since β ∗ is also feasible for the `1 -based problem (85),
the optimality of β̂ implies that kβ̂k1 ≤ kβ ∗ k1 = kβS∗ k1 . Writing β̂ = β ∗ + ν̂, we have

kβS∗ k1 ≥ kβ̂k1 = kβS∗ + ν̂S k1 + kν̂S c k1 ≥ kβS∗ k1 − kν̂S k1 + kν̂S c k1 .

Rearranging terms, we find that ν̂ ∈ C(S). Since X satisfies the RN(S) condition by
assumption, we conclude that ν̂ = 0, as required.

Suppose now that X is S-ERP. We will use the method of contradiction here to
show that X is RN(S). Thus, assume that X is not RN(S). Then there exists
h 6= 0 ∈ null(X) such that
khS k1 ≥ khS c k1 . (86)
Set β ∗ ∈ Rp such that βS∗ = hS and βS∗ c = 0. Then β ∗ is supported on S. Thus, by
the S-ERP β ∗ uniquely minimizes kβk1 subject to Xβ = Xβ ∗ := y.

119
Set β + ∈ Rp such that βS+ = 0 and βS+c = −hS c . Then observe that Xβ ∗ = Xβ + as

Xβ ∗ = XS hS = −XS c hS c = Xβ +

(recall that Xh = 0). Thus, β + ∈ Rp is a feasible solution to the optimization


problem: min kβk1 subject to Xβ = Xβ ∗ = y. Thus, by the uniqueness of β ∗ ,
kβ ∗ k1 < kβ + k1 which is equivalent to khS k1 < khS c k1 — a contradiction to (86).
This completes the proof.

8.4.1 Sufficient conditions for restricted nullspace

Of course, in order for Theorem 8.7 to be useful in practice, we need to verify the RN
property. A line of work has developed various conditions for certifying the uniform
RN property. The simplest and historically earliest condition is based on the pairwise
incoherence
|hxj , xk i|
r(X) := max .
j6=k∈{1,...,p} kxj k2 kxk k2

For centered xj this is the maximal absolute pairwise correlation. When X is rescaled
to have unit-norm columns, an equivalent representation is given by r(X) = maxj6=k |hxj , xk i|,
which illustrates that pairwise incoherence measures how close the Gram matrix X> X
is to the p-dimensional identity matrix in an element-wise sense.

The following result shows that having a low pairwise incoherence is sufficient to
guarantee exactness of the basis pursuit LP.
Proposition 8.8 (Pairwise incoherence implies RN). Suppose that for some integer
1
k ∈ {1, 2, . . . , p}, the pairwise incoherence satisfies the bound r(X) < 3k . Then X
satisfies the uniform RN property of order k, and hence, the basis pursuit LP is exact
for all vectors with support at most k.

Proof. See [5][Section 10.4.3] for a proof of this claim.

An attractive feature of pairwise incoherence is that it is easily computed; in partic-


ular, in O(N p2 ) time. A disadvantage is that it provides very conservative bounds
that do not always capture the actual performance of `1 -relaxation in practice.
Definition 8.9 (Restricted isometry property). For a tolerance δ ∈ (0, 1) and integer
k ∈ {1, 2, . . . , p}, we say that the restricted isometry property RIP(k, δ) holds if

kX> X − Ik kop ≤ δ

120
for all subsets S ⊂ {1, 2, . . . , p} of cardinality k. We recall here that k · kop denotes
the operator norm, or maximal singular value of a matrix.

Thus, we see that RIP(k, δ) holds if and only if for all subsets S of cardinality k, we
have
kXS uk22
2
∈ [1 − δ, 1 + δ], for all u 6= 0 ∈ Rk ;
kuk2
hence the terminology of restricted isometry. The following result, which we state
without any proof, shows that the RIP is a sufficient condition for the RN property
to hold.
Proposition 8.10 (RIP implies RNP). If RIP(2k, δ) holds with δ < 1/3, then the
uniform RN property of order k holds, and hence the `1 -relaxation is exact for all
vectors supported on at most k elements.

References
[1] Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a
practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser.
B 57 (1), 289–300.

[2] Berlinet, A. and C. Thomas-Agnan (2004). Reproducing kernel Hilbert spaces


in probability and statistics. Kluwer Academic Publishers, Boston, MA. With a
preface by Persi Diaconis.

[3] Efron, B. (2010). Large-scale inference, Volume 1 of Institute of Mathematical


Statistics (IMS) Monographs. Cambridge University Press, Cambridge. Empirical
Bayes methods for estimation, testing, and prediction.

[4] Freedman, D. A. (1981). Bootstrapping regression models. Ann. Statist. 9 (6),


1218–1228.

[5] Hastie, T., R. Tibshirani, and M. Wainwright (2015). Statistical Learning with
Sparsity.

[6] Keener, R. W. (2010). Theoretical statistics. Springer Texts in Statistics. Springer,


New York. Topics for a core course.

[7] Kimeldorf, G. and G. Wahba (1971). Some results on Tchebycheffian spline func-
tions. J. Math. Anal. Appl. 33, 82–95.

121
[8] Politis, D. N., J. P. Romano, and M. Wolf (1999). Subsampling. Springer Series
in Statistics. Springer-Verlag, New York.

[9] Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators.
Scand. J. Statist. 9 (2), 65–78.

[10] Schölkopf, B., R. Herbrich, and A. J. Smola (2001). A generalized representer


theorem. In Computational learning theory, pp. 416–426. Springer.

[11] Sen, B., M. Banerjee, and M. Woodroofe (2010). Inconsistency of bootstrap: the
Grenander estimator. Ann. Statist. 38 (4), 1953–1977.

[12] Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel
density estimates. Ann. Statist. 12 (4), 1285–1297.

[13] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy.
Statist. Soc. Ser. B 58 (1), 267–288.

[14] Tsybakov, A. B. (2009). Introduction to nonparametric estimation. Springer


Series in Statistics. Springer, New York. Revised and extended from the 2004
French original, Translated by Vladimir Zaiats.

[15] van der Vaart, A. W. (1998). Asymptotic statistics, Volume 3 of Cambridge


Series in Statistical and Probabilistic Mathematics. Cambridge University Press,
Cambridge.

[16] Vapnik, V. and A. Lerner (1963). Pattern recognition using generalized portrait
method. Automation and remote control 24, 774–780.

[17] Yang, Y. and A. Barron (1999). Information-theoretic determination of minimax


rates of convergence. Ann. Statist. 27 (5), 1564–1599.

[18] Zhang, Y., M. J. Wainwright, and M. I. Jordan (2014). Lower bounds on the per-
formance of polynomial-time algorithms for sparse linear regression. arXiv preprint
arXiv:1402.1918 .

122

You might also like