Intro&NP Stat
Intro&NP Stat
Bodhisattva Sen
Contents
1 Kernel density estimation 5
1.1 The choice of the bandwidth and the kernel . . . . . . . . . . . . . . 7
1.2 Mean squared error of kernel estimators . . . . . . . . . . . . . . . . 8
1.3 Pointwise asymptotic distribution . . . . . . . . . . . . . . . . . . . . 13
1.4 Integrated squared risk of kernel estimators . . . . . . . . . . . . . . . 15
1.5 Unbiased risk estimation: cross-validation . . . . . . . . . . . . . . . 18
2 Nonparametric regression 20
2.1 Local polynomial estimators . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Pointwise and integrated risk of local polynomial estimators . . . . . 23
2.2.1 Assumption (LP1) . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Projection estimators 27
3.1 Risk bounds for projection estimators . . . . . . . . . . . . . . . . . . 29
3.1.1 Projection estimator with trigonometric basis in L2 [0, 1] . . . . 31
1
4.6.1 Two examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Global Fano method: Bounding I(M ) based on metric entropy . . . . 58
4.7.1 A general scheme for proving minimax bounds using global
packings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Bootstrap 85
6.1 Parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 The nonparametric bootstrap . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Consistency of the bootstrap . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Second-order accuracy of the bootstrap . . . . . . . . . . . . . . . . . 92
6.5 Failure of the bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Subsampling: a remedy to the bootstrap . . . . . . . . . . . . . . . . 93
6.7 Bootstrapping regression models . . . . . . . . . . . . . . . . . . . . . 94
6.8 Bootstrapping a nonparametric function: the Grenander estimator . . 96
2
7.6 The Bayesian approach: connection to empirical Bayes . . . . . . . . 106
7.6.1 Global versus local FDR . . . . . . . . . . . . . . . . . . . . . 107
7.6.2 Empirical Bayes interpretation of BH(q) . . . . . . . . . . . . 108
3
Abstract
This lecture note arose from a class I taught in Spring 2016 to our 2nd year
PhD students (in Statistics) at Columbia University. The choice of topics is
very eclectic and mostly reflect: (a) my background and research interests, and
(b) some of the topics I wanted to learn more systematically in 2016. The first
part of this lecture notes is on nonparametric function estimation — density
and regression — and I borrow heavily from the book Tsybakov [14] and the
course he taught at Yale in 2014. The second part of the course is a medley
of different topics: (i) reproducing kernel Hilbert spaces (RKHSs; Section 5),
(ii) bootstrap methods (Section 6), (iii) multiple hypothesis testing (Section 7),
and (iv) an introduction to high dimensional linear regression (Section 8).
The content of Section 5 is greatly influenced by Arthur Gretton’s lectures
and slides on RKHSs and its applications in Machine Learning (see e.g., http:
//www.gatsby.ucl.ac.uk/~gretton/coursefiles/rkhscourse.html for a more
detailed course). I have borrowed the material in Section 7 from Emmanuel
Candes’s lectures on ‘Theory of Statistics’ (Stats 300C, Stanford), while the
content of Section 8 is taken from Hastie et al. [5].
4
1 Kernel density estimation
1.0 ecdf(x)
●
0.8
●
0.6
●
Fn(x)
●
0.4
●
0.2
●
0.0
n=10
where I(·) denotes the indicator function. The Glivenko-Cantelli theorem shows that
a.s.
sup |Fn (x) − F (x)| → 0,
x∈R
ecdf(x) ecdf(x)
1.0
1.0
● ●
●●
●●
●
●●●
● ●
●
● ●●
●
●●
0.8
0.8
●●
●
●●
●
●●
●
●
●
●●
●●●
● ●●
●
0.6
0.6
●
●
●●
●●
●●
Fn(x)
Fn(x)
●●●
●●
●●●
●●
●
●
0.4
0.4
●●
●●
●
●
●●
●
●
●
●●
●●
●
● ●
●
● ●
0.2
0.2
●
●●
●●
● ●●
●●
●
●●
● ●
●
●●
●●
0.0
0.0
−2 −1 0 1 2 −4 −2 0 2
n=100 n=1000
5
Exercise (HW1): Consider testing F = F0 where F0 is a known continuous strictly
increasing distribution function (e.g., standard normal) when we observe i.i.d. data
X1 , . . . , Xn from F . The Kolmogorov-Smirnov test statistic is to consider
and reject H0 when Dn > cα , for a suitable cα > 0 (where α is the level of the test).
Show that, under H0 , Dn is distribution-free, i.e., the distribution of Dn does not
depend on F0 (as long as it is continuous and strictly increasing). How would you
compute (approximate/simulate) the critical value cα , for every n.
Let us come back to the estimation of p. As p is the derivative of F , for small h > 0,
we can write the approximation
F (x + h) − F (x − h)
p(x) ≈ .
2h
As Fn is a natural estimator of F , it is intuitive to define the following (Rosenblatt)
estimator of p:
Fn (x + h) − Fn (x − h)
p̂R
n (x) = .
2h
We can rewrite p̂R n as
1 X 1 X
n n
R Xi − x
p̂n (x) = I(x − h < Xi ≤ x + h) = K0 ,
2nh i=1 nh i=1 h
Note that if the kernel K takes only nonnegative values and if X1 , . . . , Xn are fixed,
then p̂n is a probability density.
6
Figure 1: KDE with different bandwidths of a random sample of 100 points from a stan-
dard normal distribution. Grey: true density (standard normal). Red: KDE
with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.
It turns out that the choice of the bandwidth h is far more crucial for the quality
of p̂n as an estimator of p than the choice of the kernel K. We can view the KDE
(for unimodal, nonnegative kernels) as the sum of n small “mountains” given by the
functions
1 Xi − x
x 7→ K .
nh h
Every small mountain is centered around an observation Xi and has area 1/n under
it, for any bandwidth h. For a small bandwidth the mountain is very concentrated
(peaked), while for a large bandwidth the mountain is low and fat. If the bandwidth
is small, then the mountains remain separated and their sum is peaky. On the other
hand, if the bandwidth is large, then the sum of the individual mountains is too flat.
Intermediate values of the bandwidth should give the best results.
For a fixed h, the KDE p̂n (x0 ) is not consistent in estimating p(x0 ), where x0 ∈ R.
However, if the bandwidth decreases with sample size at an appropriate rate, then it
is, regardless of which kernel is used.
7
Exercise (HW1): Suppose that p is continuous at x0 , that hn → 0, and that nhn → ∞
p
as n → ∞. Then, p̂n (x0 ) → p(x0 ) [Hint: Study the bias and variance of the estimator
separately].
A basic measure of the accuracy of p̂n is its mean squared risk (or mean squared error)
at an arbitrary fixed point x0 ∈ R:
h i
MSE = MSE(x0 ) := Ep (p̂n (x0 ) − p(x0 )) .2
Of course,
MSE(x0 ) = b2 (x0 ) + σ 2 (x0 )
where
b(x0 ) := Ep [p̂n (x0 )] − p(x0 ), (bias)
and h 2 i
σ 2 (x0 ) := Ep p̂n (x0 ) − Ep [p̂n (x0 )] (variance).
To evaluate the mean squared risk of p̂n we will analyze separately its variance and
bias.
Proposition 1.1 (Variance of p̂n ). Suppose that the density p satisfies p(x) ≤ pmax <
∞ for all x ∈ R. Let K : R → R be the kernel function such that
Z
K 2 (u)du < ∞.
Proof. Observe that p̂n (x0 ) is an average of n i.i.d. random variables and so
1 1 X 1 − x0 1 X 1 − x0
2
σ (x0 ) = Var(p̂n (x0 )) = Var K ≤ Ep K 2
n h h nh2 h
8
Now, observe that
Z Z
X 1 − x0 z − x0
Ep K 2
= K 2
p(z)dz ≤ pmax h K 2 (u)du.
h h
Combining the above two displays we get the desired result.
To analyze the bias of the KDE (as a function of h) we need certain conditions on
the density p and on the kernel K.
Definition 1.2. Let T be an interval in R and let β and L be two positive numbers.
The Hölder class Σ(β, L) on T is defined as the set of ` = bβc times differentiable
functions f : T → R whose derivative f (`) satisfies
Does bounded kernels of order ` exist? See Section 1.2.2 of [14] for constructing such
kernels.
Observe that when ` ≥ 2 then the kernel has to take negative values which may lead
to negative values of p̂n . This is sometimes mentioned as a drawback of using higher
order kernels (` ≥ 2). However, observe that we can always define the estimator
p̂+
n (x) = max{0, p̂n (x)}
9
Proposition 1.4 (Bias of p̂n ). Assume that p ∈ P(β, L) and let K be a kernel of
order ` = bβc satisfying Z
|u|β |K(u)|du < ∞.
|b(x0 )| ≤ C2 hβ , (4)
L
R
where C2 = `!
|u|β |K(u)|du.
Proof. We have
Z
1 z−x
b(x0 ) = K p(z)dz − p(x0 )
h h
Z h i
= K(u) p(x0 + uh) − p(x0 ) du.
0 (uh)` (`)
p(x0 + uh) = p(x0 ) + p (x0 )uh + . . . + p (x0 + τ uh),
`!
where 0 ≤ τ ≤ 1. Since K has order ` = bβc, we obtain
Z
(uh)` (`)
b(x0 ) = K(u) p (x0 + τ uh)du
`!
Z
(uh)` (`)
= K(u) (p (x0 + τ uh) − p(`) (x0 ))du,
`!
1
Taylor’s theorem: Let k ≥ 1 be an integer and let the function f : R → R be k times
differentiable at the point a ∈ R. Then there exists a function Rk : R → R such that
f (k+1) (ξL )
Rk (x) = (x − a)k+1
(k + 1)!
for some real number ξL between a and x. This is the Lagrange form of the remainder.
Integral form of the remainder: Let f (k) be absolutely continuous on the closed interval between
a and x. Then Z x (k+1)
f (t)
Rk (x) = (x − t)k dt. (5)
a k!
Due to absolute continuity of f (k) , on the closed interval between a and x, f (k+1) exists a.e.
10
and
Z
|uh|` (`)
|b(x0 )| ≤ |K(u)| p (x0 + τ uh) − p(`) (x0 ) du
`!
Z
|uh|`
≤ L |K(u)| |τ uh|β−` du ≤ C2 hβ .
`!
From Propositions 1.1 and 1.4, we see that the upper bounds on the bias and variance
behave in opposite ways as the bandwidth h varies. The variance decreases as h grows,
whereas the bound on the bias increases. The choice of a small h corresponding to a
large variance leads to undersmoothing. Alternatively, with a large h the bias cannot
be reasonably controlled, which leads to oversmoothing. An optimal value of h that
balances bias and variance is located between these two extremes. To get an insight
into the optimal choice of h, we can minimize in h the upper bound on the MSE
obtained from the above results.
Proof. We apply (14) to derive the result. To justify the application of Proposi-
tion 1.1, it remains to prove that there exists a constant pmax < ∞ satisfying
11
To show that (7) holds, consider K ∗ which is a bounded kernel of order ` (not neces-
sarily equal to K). Applying Proposition 1.4 with h = 1 we get that, for any x ∈ R
and any p ∈ P(β, L),
Z Z
∗ L
K (z − x) p(z)dz − p(x) ≤ C2 := |u|β |K ∗ (u)|du.
`!
∗
where Kmax ∗
= supu∈R |K ∗ (u)|. Thus, we get (7) with pmax = C2∗ + Kmax .
Under the assumptions of Theorem 1.5, the rate of convergence of the estimator
β
p̂n (x0 ) is ψn = n− 2β+1 , which means that for a finite constant C and for all n ≥ 1 we
have h i
sup Ep (p̂n (x0 ) − p(x0 ))2 ≤ Cψn2 .
p∈P(β,L)
Now the following two questions arise. Can we improve the rate ψn by using other
density estimators? What is the best possible rate of convergence? To answer these
questions it is useful to consider the minimax risk Rn∗ associated to the class P(β, L):
h i
Rn (P(β, L)) = inf sup Ep (Tn (x0 ) − p(x0 )) ,
∗ 2
Tn p∈P(β,L)
where the infimum is over all estimators. One can prove a lower bound on the minimax
2β
risk of the form Rn∗ (P(β, L)) ≥ C 0 ψn2 = C 0 n− 2β+1 with some constant C 0 > 0. This
implies that under the assumptions of Theorem 1.5 the KDE attains the optimal rate
β
of convergence n− 2β+1 associated with the class of densities P(β, L). Exact definitions
and discussions of the notion of optimal rate of convergence will be given later.
Remark 1.1. Quite often in practice it is assumed that β = 2 and that p00 is con-
tinuous at x0 . Also, the kernel is taken to be of order one and symmetric around 0.
Then it can be shown that (Exercise (HW1))
Z Z 2
1 1
MSE(x0 ) = K (u)dup(x0 ) + h4
2
u K(u)du p00 (x0 )2 + o((nh)−1 + h4 ).
2
nh 4
Remark 1.2. Since 2β/(2β+1) approaches 1 as k becomes large, Theorem 1.5 implies
that, for sufficiently smooth densities, the convergence rate can be made arbitrarily
close to the parametric n−1 convergence rate. The fact that higher-order kernels
12
can achieve improved rates of convergence means that they will eventually dominate
first-order kernel estimators for large n. However, this does not mean that a higher-
order kernel will necessarily improve the error for sample sizes usually encountered in
practice, and in many cases, unless the sample size is very large there may actually
be an increase in the error due to using a higher-order kernel.
Whereas the results from the previous sub-section have shown us that p̂n (x0 ) converges
to p(x0 ) in probability under certain assumptions, we cannot straightforwardly use
this for statistical inference. Ideally, if we want to estimate p(x0 ) at the point x0 , we
would like to have exact confidence statements of the form
where α is the significance level and c(n, α, x0 , K) sequence of constants that one
would like to be as small as possible (given α).
Theorem 1.6. Assume that p ∈ P(β, L) and let K be a kernel of order ` = bβc
satisfying Z
|u|β |K(u)|du < ∞.
Suppose that p also satisfies p(x) ≤ pmax < ∞ for all x ∈ R. Let K further satisfy
R
(a) kKk22 := K 2 (u)du < ∞, (b) kKk∞ := supu∈R K(u) < ∞. Suppose that the
1/2 β+1/2
sequence of bandwidths {hn }∞ n=1 satisfy hn → 0, nhn → ∞, and n hn → 0 as
n → ∞. Then, as n → ∞,
√
d
nh p̂n (x0 ) − p(x0 ) → N 0, p(x0 )kKk22 .
Proof. We first find the limit for the ‘variance term’. We use the Lindeberg-Feller
central limit theorem for triangular arrays of independent random variables2 with
r
√ 1 Xi − x 0 1 X i − x0
Yni := nh K = K , i = 1, . . . , n,
nh h nh h
2
Lindeberg-Feller CLT (see e.g., [15, p.20]): For each n let Yn1 , . . . , Ynn be independent random
Pn
variables with finite variances. If, as n → ∞, (i) i=1 E[Yni
2
I(|Yni | > )] → 0, for every > 0, and
Pn
(ii) i=1 E[(Yni − E(Yni )) ] → σ , then
2 2
n
X d
(Yni − E(Yni )) → N (0, σ 2 ), as n → ∞.
i=1
13
so that Yn1 , . . . , Ynn are i.i.d. and we have
√ Xn
nh p̂n (x0 ) − Ep [p̂n (x0 )] = (Yni − E(Yni )).
i=1
Thus, we only need to show that the two conditions in the Lindeberg-Feller CLT hold.
Clearly,
Z
2 1 2 z − x0
nE(Yni ) = K p(z)dz
h h
Z Z
= K (u) p(x0 + uh)du → p(x0 ) K 2 (u)du,
2
as n → ∞,
Exercise (HW1): Suppose that you are given an i.i.d. sample from a bounded density
p with bounded derivatives at x0 . Suppose that c(α, x0 ) is such that P(−c(α, x0 ) ≤
Z ≤ c(α, x0 )) = 1 − α where Z ∼ N (0, p(x0 )). Use a kernel density estimator (with
a suitable kernel) to obtain a 95 percent confidence interval (CI) for p(x0 ) in such a
√
way that the size of the interval shrinks at rate 1/ nhn as n → ∞, and that hn can
be chosen so that this rate is ‘almost’ (say, up to a log n term) of order n−1/3 .
Exercise (HW1): Under the setup of Remark 1.1 and the assumption that h = αn−1/5 ,
√
where α > 0, find the asymptotic distribution of nh(p̂n (x0 ) − p(x0 )). Can this be
used to construct a CI for p(x0 )? What are the advantages/disadvantages of using
this result versus the setup of Theorem 1.6 with β = 2 to construct a CI for p(x0 )?
14
1.4 Integrated squared risk of kernel estimators
In Section 1.2 we have studied the behavior of the KDE p̂n at an arbitrary fixed point
x0 . It is also interesting to analyze the global risk of p̂n . An important global criterion
is the mean integrated squared error (MISE):
Z
MISE := Ep [(p̂n (x) − p(x))2 ]dx.
By Fubini’s theorem,
Z Z Z
2
MISE = MSE(x)dx = b (x)dx + σ 2 (x)dx. (8)
R
Thus, the MISE is represented as a sum of the bias term b2 (x)dx and the variance
R
term σ 2 (x)dx. To obtain bounds on these terms, we proceed in the same manner
as for the analogous terms of the MSE. Let us study first the variance term.
Proposition 1.7 (Variance of p̂n ). Let K : R → R be the kernel function such that
Z
K 2 (u)du < ∞.
Then for any h > 0, and n ≥ 1 and any probability density p we have
Z Z
2 1
σ (x)dx ≤ K 2 (u)du.
nh
The upper bound for the variance term in Proposition 1.7 does not require any con-
dition on p: The result holds for any density. For the bias term in (8) the situation
is different: We can only control it on a restricted subset of densities. As above, we
15
specifically assume that p is smooth enough. Since the MISE is a risk corresponding
to the L2 (R)-norm, it is natural to assume that p is smooth with respect to this norm.
Sobolev classes provide a popular way to describe smoothness in L2 (R).
Definition 1.8. Let β ≥ 1 be an integer and L > 0. The Sobolev class S(β, L) is
defined as the set of all β − 1 differentiable functions f : R → R having absolutely
continuous derivative f (β−1) and satisfying
Z
(f (β) (x))2 dx ≤ L2 .
Then for all n ≥ 1 and all h > 0 the mean integrated squared error of the KDE p̂n
satisfies Z Z 2
1 2 L2 h2β β
MISE ≤ K (u)du + |u| |K(u)|du .
nh (`!)2
Proof. We bound the variance term as in Proposition 1.7. Let ` = β − 1. For the bias
term, first note that using the integral form of the remainder term in the Taylor’s
theorem (see (5) and make the transformation t 7→ t−x
uh
),
Z 1
0 (uh)`
p(x + uh) = p(x) + p (x)uh + . . . + (1 − τ )`−1 p(`) (x + τ uh)dτ.
(` − 1)! 0
Applying the generalized Minkowski inequality3 twice and using the given assump-
3
Generalized Minkowski inequality:
16
R
tions on p, we get the following upper bound for the bias term b2 (x)dx:
Z Z Z 1 2
|uh|` `−1 (`) (`)
|K(u)| (1 − τ ) p (x + τ uh) − p (x) dτ du dx
(` − 1)! 0
"Z Z 2
Z ` 1 2 #1/2
|uh|
≤ |K(u)| (1 − τ )`−1 p(`) (x + τ uh) − p(`) (x) dτ dx du
(` − 1)! 0
"Z Z # !2
Z
|uh|` 1 2 1/2
≤ |K(u)| (1 − τ )`−1 p(`) (x + τ uh) − p(`) (x) dx dτ du .
(` − 1)! 0
17
(ii) The density p is differentiable on R, the first derivative p0 is absolutely contin-
R
uous on R and the second derivative satisfies (p00 (x))2 dx < ∞.
Then for all n ≥ 1 the mean integrated squared error of the kernel estimator p̂n
satisfies Z Z
1 2 h4 2 00 2
MISE = K (u)du + SK (p (x)) dx (1 + o(1)),
nh 4
where the term o(1) is independent of n (but depends on p) and tends to 0 as h → 0.
Let p̂n be the KDE and let the kernel K be fixed. We already know that the bandwidth
h is crucial to determine the behavior of the estimator. How to choose h in practice?
The optimal value of h is the one that minimizes the MISE, i.e.,
h∗ = argmin MISE(h).
h>0
This ideal bandwidth h depends on the true density p, so it is not available in practice.
It is called the oracle bandwidth, and the estimator p̂n with bandwidth h = h∗ is called
the oracle. We would like to “mimic the oracle”, i.e., to find a bandwidth ĥn that
only depends on the data X1 , . . . , Xn , such that its risk is close to the risk of the
oracle: Z
Ep (p̂(nĥn ) − p)2 (x)dx ≈ min MISE(h),
h>0
It turns out that this task can be achieved. The idea is to first estimate the MISE(·),
and then to minimize in h the obtained estimator of MISE(·).
Only the expression in the square brackets depends on h; the last term is constant in
h. Let Z Z
J(h) := Ep 2
p̂n − 2 p̂n p .
18
MISE(h) and J(h) have the same minimizers. A first idea is to take an unbiased
estimator of J(h).
Now we prove that CV(h) is an unbiased estimator of J(h), i.e., we show that
Z " n #
1X
Ep p̂n p = Ep p̂n,−i (Xi ) . (9)
n i=1
[12] was the first to investigate the issue of optimality in connection with cross-
validation. He proved that the integrated squared error of the estimator p̃CVn is
asymptotically equivalent to that of some oracle estimator:
R CV
(p̃n − p)2 a.s.
R (h) → 1, n → ∞,
minh>0 (p̂n − p)2
under some assumptions (the density p is bounded, the kernel is compactly supported,
essentially nonnegative, and satisfies the Hölder condition).
19
2 Nonparametric regression
Let (X, Y ) be a pair of real-valued random variables such that E|Y | < ∞. The
regression function f : R → R of Y on X is defined as
The conditional residual ξ := Y − E(Y |X) has mean zero, E(ξ) = 0, and we may
write
Yi = f (Xi ) + ξi , i = 1, . . . , n, (10)
where ξi are i.i.d. random variables with the same distribution as ξ. In particular,
E(ξi ) = 0 for all i = 1, . . . , n. The variables ξi can therefore be interpreted as “errors”.
The key idea we use in estimating f nonparametrically in this section is called “local
averaging”. Given a kernel K and a bandwidth h, one can construct kernel estimators
for nonparametric regression. There exist different types of kernel estimators of the
regression function f . The most celebrated one is the Nadaraya-Watson estimator
defined as follows:
Pn
Y i K Xi −x Xn
Xi − x
NW i=1 h
fn (x) = Pn Xi −x
, if K 6= 0,
i=1 K h i=1
h
and fnN W (x) = 0 otherwise. This estimator was proposed separately in two papers
by Nadaraya and Watson in the year 1964.
Example: If we choose K(u) = 21 I(|u| ≤ 1), then fnN W (x) is the average of Yi such
that Xi ∈ [x − h, x + h]. Thus, for estimating f (x) we define the “local” neighborhood
as [x − h, x + h] and consider the average of the observations in that neighborhood.
For fixed n, the two extreme cases for the bandwidth are:
P
(i) h → ∞. Then fnN W (x) tends to n−1 ni=1 Yi which is a constant independent
of x. The systematic error (bias) can be too large. This is a situation of
oversmoothing.
(ii) h → 0. Then fnN W (Xi ) = Yi whenever h < mini,j |Xi −Xj | and limh→0 fnN W (x) =
0, if x 6= Xi . The estimator fnN W is therefore too oscillating: it reproduces the
20
data Yi at the points Xi and vanishes elsewhere. This makes the stochastic
error (variance) too large. In other words, undersmoothing occurs.
Thus, the bandwidth h defines the “width” of the local neighborhood and the kernel K
defines the “weights” used in averaging the response values in the local neighborhood.
As we saw in density estimation, an appropriate choice of the bandwidth h is more
important than the choice of the kernel K.
Definition 2.1. An estimator fˆn (x) of f (x) is called a linear nonparametric regres-
sion estimator if it can be written in the form
Xn
ˆ
fn (x) = Yi Wni (x)
i=1
where the weights Wni (x) = Wni (x, X1 , . . . , Xn ) depend only on n, i, x and the values
X1 , . . . , X n .
Typically, the weights Wni (x) of linear regression estimators satisfy the equality
Pn
i=1 Wni (x) = 1 for all x (or for almost all x with respect to the Lebesgue mea-
sure).
Another intuitive motivation of fnN W is given below. Suppose that the distribution
of (X, Y ) has density p(x, y) with respect to the Lebesgue measure and pX (x) =
R
p(x, y)dy > 0. Then,
R
yp(x, y)dy
f (x) = E(Y |X = x) = .
pX (x)
If we replace here p(x, y) by the KDE p̂n (x, y) of the density of (X, Y ) defined by (3)
and use the corresponding KDE p̂X X ˆN W in view of
n (x) to estimate p (x), we obtain fn
the following result.
Exercise (HW1): Let p̂Xn (x) and p̂n (x, y) be the KDEs defined in (2) and (3) respec-
tively, with a kernel K of order 1. Then
R
NW y p̂n (x, y)dy
fn (x) =
p̂X
n (x)
21
if p̂X
n (x) 6= 0.
If the kernel K takes only nonnegative values, the Nadaraya-Watson estimator fnN W
satisfies
Xn
Xi − x
NW 2
fn (x) = argmin (Yi − θ) K . (11)
θ∈R i=1
h
Thus, fnN W is obtained by a local constant least squares approximation of the response
values, i.e., Yi ’s. The locality is determined by the bandwidth h and the kernel K
which downweighs all the Xi that are not close to x whereas θ plays the role of a local
constant to be fitted. More generally, we may define a local polynomial least squares
approximation, replacing in (11) the constant θ by a polynomial of given degree `. If
f ∈ Σ(β, L), β > 1, ` = β, then for z sufficiently close to x we may write
0 f (`) (x) ` > z−x
f (z) ≈ f (x) + f (x)(z − x) + . . . + (z − x) = θ (x)U ,
`! h
where
is called a local polynomial estimator of order ` of θ(x) or LP(`) estimator of θ(x) for
short. The statistic
fˆn (x) = U > (0)θ̂n (x)
is called a local polynomial estimator of order ` of f(x) or LP(`) estimator of f (x) for
short.
Note that fˆn (x) is simply the first coordinate of the vector θ̂n (x). Comparing (11)
and (12) we see that the Nadaraya-Watson estimator fnN W with kernel K ≥ 0 is
the LP (0) estimator. Furthermore, properly normalized coordinates of θ̂n (x) provide
estimators of the derivatives f 0 (x), . . . , f (`) (x).
22
For a fixed x the estimator (12) is a weighted least squares estimator. Indeed, we can
write θ̂n (x) as follows:
where the matrix Bnx and the vector anx are defined by the formulas:
1 X
n
Xi − x > Xi − x Xi − x
Bnx = U U K ,
nh i=1 h h h
1 X
n
Xi − x Xi − x
anx = Yi U K .
nh i=1 h h
Exercise (HW1): If the matrix Bnx is positive definite, show that the local polynomial
estimator fˆn (x) of f (x) is a linear estimator. Also, in this case, find an expression for
fˆn (x).
X
n
∗
Q(Xi )Wni (x) = Q(x),
i=1
X
n X
n
∗ ∗
Wni (x) = 1, and (Xi − x)k Wni (x) = 0 for k = 1, . . . , `.
i=1 i=1
In this section we study statistical properties of the LP(`) estimator constructed from
observations (Xi , Yi ), i = 1, . . . , n, such that
Yi = f (Xi ) + ξi , i = 1, . . . , n, (13)
where ξi are independent zero mean random variables (E(ξi ) = 0), the Xi are deter-
ministic values belonging to [0, 1], and f is a function from [0, 1] to R.
23
Let fˆn (x0 ) be an LP(`) estimator of f (x0 ) at point x0 ∈ [0, 1]. The bias and the
variance of fˆn (x0 ) are given by the formulas
h i h i h i2
ˆ
b(x0 ) = Ef fn (x0 ) − f (x0 ), ˆ ˆ
σ (x0 ) = Ef fn (x0 ) − Ef fn (x0 ) ,
2 2
We will study separately the bias and the variance terms in this representation of the
risk. First, we introduce the following assumptions.
Assumptions (LP)
(LP1) There exist a real number λ0 > 0 and a positive integer n0 such that the
smallest eigenvalue λmin (Bnx ) of Bnx satisfies λmin (Bnx ) ≥ λ0 for all n ≥ n0 and
any x ∈ [0, 1].
(LP2) There exists a real number a0 > 0 such that for any interval A ⊂ [0, 1] and all
n ≥ 1,
1X
n
I(Xi ∈ A) ≤ a0 max(Leb(A), 1/n)
n i=1
where Leb(A) denotes the Lebesgue measure of A.
(LP3) The kernel K has compact support belonging to [−1, 1] and there exists a num-
ber Kmax < ∞ such that |K(u)| ≤ Kmax , ∀ u ∈ R.
Assumption (LP1) is stronger than the condition Bnx > 0 introduced before since it
is uniform with respect to n and x. We will see that this assumption is natural in the
case where the matrix Bnx converges to a limit as n → ∞. Assumption (LP2) means
that the points Xi are dense enough in the interval [0, 1]. It holds for a sufficiently
wide range of designs. An important example is given by the regular design: Xi = i/n,
for which (LP2) is satisfied with a0 = 2. Finally, assumption (LP3) is not restrictive
since the choice of K belongs to the statistician.
Exercise (HW1): Show that assumption (LP1) implies that, for all n ≥ n0 , x ∈ [0, 1],
and v ∈ R`+1 ,
−1
kBnx vk ≤ kvk/λ0 ,
where k · k denotes the Euclidean norm in R`+1 . Hint: Use the fact that Bnx is
−1 −2
symmetric and relate the eigenvalues of Bnx to that of Bnx and Bnx (note that for a
>
square matrix A ∈ R , λmax (A) = kvk2 , where v 6= 0 ∈ R ).
r×r v Av r
24
We have the following result (Exercise (HW1)) which gives us some useful bounds on
∗
the weights Wni (x).
Lemma 2.4. Under assumptions (LP1)–(LP3), for all n ≥ n0 , h ≥ 1/(2n), and
∗
x ∈ [0, 1], the weights Wni (x) of the LP(`) estimator are such that:
∗
(i) supi,x |Wni (x)| ≤ Cnh∗ ;
Pn ∗
(ii) i=1 |Wni (x)| ≤ C∗ ;
∗
(iii) Wni (x) = 0 if |Xi − x| > h,
We are now ready to find upper bounds on the MSE of the LP(`) estimator.
Proposition 2.5. Suppose that f ∈ Σ(β, L) on [0, 1], with β > 0 and L > 0. Let fˆn
be the LP(`) estimator of f with ` = bβc. Assume also that:
(iii) the random variables ξi are independent and such that for all i = 1, . . . , n,
Then for all x0 ∈ [0, 1], n ≥ n0 , and h ≥ 1/(2n) the following upper bounds hold:
q2
|b(x0 ) ≤ q1 hβ , σ 2 (x0 ) ≤ ,
nh
2
where q1 := C∗ L/`! and q2 := σmax C∗2 .
25
Theorem 2.6. Assume that the assumptions of Proposition 2.5 hold. Suppose that
for a fixed α > 0 the bandwidth is chosen as h = hn = αn−1/(2β+1) . Then the following
holds: h i
lim sup sup sup Ef ψn−2 (fˆn (x0 ) − f (x0 ))2 ≤ C < ∞,
n→∞ f ∈Σ(β,L) x0 ∈[0,1]
β
where ψn := n− 2β+1 is the rate of convergence and C > 0 is a constant depending
2
only on β, L, a0 , σmax , Kmax and α.
As the above upper bound holds for every x0 ∈ [0, 1] we immediately get the following
result on the integrated risk.
Corollary 2.7. Under the assumptions of Theorem 2.6 the following holds:
h i
lim sup sup Ef ψ −2 kfˆn (x0 ) − f (x0 )k2 ≤ C < ∞,
n 2
n→∞ f ∈Σ(β,L)
R1 β
where kf k22 = 0 f 2 (x)dx, ψn := n− 2β+1 is the rate of convergence and C > 0 is a
2
constant depending only on β, L, a0 , σmax , Kmax and α.
We now discuss assumption (LP1) in more detail. If the design is regular and n is large
R
enough, Bnx is close to the matrix B := U (u)U > (u)K(u)du, which is independent
of n and x. Therefore, for Assumption (LP1) to hold we only need to assure that B
is positive definite. This is indeed true, except for pathological cases, as the following
lemma states.
Lemma 2.8. Let K : R → [0, ∞) be a function such that the Lebesgue measure
Leb({u : K(u) > 0}) > 0. Then the matrix
Z
B = U (u)U > (u)K(u)du
is positive definite.
Proof. It is sufficient to prove that for all v ∈ R`+1 satisfying v 6= 0, we have v > Bv > 0.
Clearly, Z
v > Bv > 0 = (v > U (u))2 K(u)du ≥ 0.
R
If there exists v 6= 0 such that (v > U (u))2 K(u)du = 0, then v > U (u) = 0 for almost
all u on the set {u : K(u) > 0}, which has a positive Lebesgue measure by the
assumption of the lemma. But the function v 7→ v > U (u) is a polynomial of degree
26
≤ ` which cannot be equal to zero except for a finite number of points. Thus, we come
R
to a contradiction showing that (v > U (u))2 K(u)du = 0 is impossible for v 6= 0.
Lemma 2.9. Suppose that there exist Kmin > 0 and ∆ > 0 such that
3 Projection estimators
Yi = f (Xi ) + ξi , i = 1, . . . , n, (15)
with Xi ∈ X, a metric space, and E(ξi ) = 0. The goal is to estimate the function f
based on the data. In what follows, we will also use the vector notation, writing the
model as
y = f + ξ,
where y = (Y1 , . . . , Yn )> , f = (f (X1 ), . . . , f (Xn ))> and ξ = (ξ1 , . . . , ξn )> .
X
N
fθ (x) := θj ϕj (x).
j=1
Then we look for a suitable estimator θ̂ = (θ̂1 , . . . , θ̂N ) of θ based on the sample
(Xi , Yi ), i = 1, . . . , n, and construct an estimator of f having the form
X
N
fˆ(x) = fθ̂ (x) = θ̂j ϕj (x). (16)
j=1
Example 3.1. If X = [0, 1] and f ∈ L2 [0, 1], then a popular choice of {ϕj }N j=1
corresponds to the first N functions of an orthonormal basis in L2 [0, 1]. For example,
{ϕj }∞ ∞
j=1 can be the trigonometric basis or the Legendre basis on [0, 1]. Let {θj }j=1 be
the Fourier coefficients of f with respect to the orthonormal basis {ϕj }∞j=1 of L2 [0, 1],
i.e., Z 1
θj = f (x)ϕ(x)dx.
0
27
Assume that f can be represented as
∞
X
f (x) = θj ϕj (x), (17)
j=1
where the series converges for all x ∈ [0, 1]. Observe that if Xi are scattered over
[0, 1] in a sufficiently uniform way, which happens, e.g., in the case Xi = i/n, the
P
coefficients θj are well approximated by the sums n−1 N i=1 f (Xi )ϕj (Xi ). Replacing
in these sums the unknown quantities f (Xi ) by the observations Yi we obtain the
following estimator of θj :
∞
1X
θ̂j = Yi ϕj (Xi ). (18)
n i=1
Remark 3.1. The parameter N (called the order of the estimator) plays the same
role as the bandwidth h for kernel estimators: similar to h it is a smoothing parameter,
i.e., a parameter whose choice is crucial for establishing the balance between bias and
variance. The choice of very large N leads to undersmoothing, whereas for small
values of N oversmoothing occurs.
An important class of estimators of the form (16) are projection estimators. Define
the empirical norm k · k as:
X
n X
n
2 2 2
kf k := f (Xi ), kyk := Yi2 .
i=1 i=1
X
N
fˆLS (x) = fθ̂LS (x) = θ̂jLS ϕj (x) (19)
j=1
LS
where θ̂ is the classical least squares estimator (LSE):
LS
θ̂ := argmin ky − fθ k2 ,
θ∈RN
28
appropriate conditions on the functions {ϕj }, the function f , and N . Recall that,
under the assumption that X> X > 0 (note that X> X is an N × N matrix), we have
LS LS
θ̂ = (X> X)−1 X> y and f̂ LS = Xθ̂ = Ay
where A := X(X> X)−1 X> is the so-called hat matrix. The hat matrix is the or-
thogonal projection matrix (in Rn ) onto the column-space of X, i.e., the subspace
of Rn spanned by the N columns of X. Note that we can have X> X > 0 only if
N ≤ n. However, even if X> X is not invertible f̂ LS is uniquely defined by the Hilbert
projection theorem4 and can be expressed as Ay where now A = X(X> X)+ X> ; here
A+ stands for the Moore-Penrose pseudoinverse.
Indeed, rank(X> X) = rank(X) ≤ min(N, n). Under the assumption that X> X > 0,
the projection estimator is unique and has the form
LS X
n
fˆLS (x) = ϕ(x)> θ̂ = ϕ(x)> (X> X)−1 X> y = Wni (x)Yi ,
i=1
where ϕ(x) = (ϕ1 (x), . . . , ϕN (x))> and Wni (x) is the i-th component of the vector
ϕ(x)> (X> X)−1 X> .
Assume now that we have the regression model (48), where the points Xi are deter-
ministic elements in the space X. Let us measure the accuracy of an estimator fˆ of
f by the following squared risk:
" n #
1 X
R(f , f̂ ) := Ekf − f̂ k2 = E (fˆ(Xi ) − f (Xi ))2 .
n i=1
This choice of a loss function is quite natural and it measures the prediction accuracy
of the estimator at the observed design points. Further, if the Xi are “equi-spaced”
points then R(f , f̂ ) is approximately equal to the MISE.
P
Let fˆ(x) be a linear estimator, i.e., fˆ(x) = ni=1 Wni (x)Yi . Then we can write f̂ = Sy
where S := (Wnj (Xi ))n×n is a deterministic matrix. Note that S does not depend on
4
The Hilbert projection theorem is a famous result of convex analysis that says that for every
point u in a Hilbert space H and every nonempty closed convex C ⊂ H, there exists a unique point
v ∈ C for which kx − yk is minimized over C. This is, in particular, true for any closed subspace M
of C. In that case, a necessary and sufficient condition for v is that the vector u − v be orthogonal
to M .
29
y; it depends only on the Xi ’s. As particular cases, we can think of fˆ as the LP(`)
estimator or the projection estimator in (16).
Proposition 3.2. Let ξi be random variables such that E(ξi ) = 0 and E(ξi ξj ) = σ 2 δij
for i, j = 1, . . . , n, where δij is the Kronecker delta function. Let S be any n×n matrix.
Then the risk of linear estimator f = Sy is given by
σ2
R(f , f̂ ) = kSf − f k2 + tr(S > S).
n
Proof. By definition of the norm k · k and of the model,
kf̂ − f k2 = kSf + Sξ − f k2
2
= kSf − f k2 + (Sf − f )> Sξ + kSξk2 .
n
Taking expectations and using that E(ξ) = 0 we obtain
1
Ekf̂ − f k2 = kSf − f k2 + E(ξ > S > Sξ).
n
Set V = S > S and denote the elements of this matrix by vij . We have
!
Xn Xn
> >
E(ξ S Sξ) = E ξi vij ξj = σ 2 vii = σ 2 tr(V ).
i,j i=1
Thus, we have
σ2
R(f , f̂ ) ≤ kSf − f k2 + min(n, N )
n
σ2
= min kfθ − f k2 + min(n, N ). (20)
θ∈RN n
In fact, a close inspection of the proof of Proposition 3.2 shows that for the above
inequality to hold it is enough to assume that E(ξi2 ) ≤ σ 2 , and E(ξi ξj ) = 0 for i 6= j,
where i, j = 1, . . . , n.
In order to control this bias term and to analyze the rate of convergence of projection
estimator, we need to impose some assumptions on the underlying function f and on
the basis {ϕj }∞
j=1 .
30
3.1.1 Projection estimator with trigonometric basis in L2 [0, 1]
Here we continue to consider the nonparametric regression model (48) and we will
assume that X = [0, 1]. We will mainly focus on a particular case, Xi = i/n.
Definition 3.3. The trigonometric basis is the orthonormal basis of L2 [0, 1] defined
by
√ √
ϕ1 (x) = 1, ϕ2k (x) = 2 cos(2πkx), ϕ2k+1 (x) = 2 sin(2πkx), k = 1, 2 . . . ,
We will assume that the regression function f is sufficiently smooth, or more specif-
ically, that it belongs to a periodic Sobolev class of functions. First, we define the
periodic Sobolev class for integer smoothness β.
Definition 3.4. Let β ≥ 1 be an integer and let L > 0. The periodic Sobelev class
W (β, L) is defined as
n
W (β, L) := f : [0, 1] → R : f (β−1) is absolutely continuous and
Z 1 o
(β) 2 2 (j) (j)
(f (x)) dx ≤ L , f (0) = f (1), j = 0, 1, . . . , β − 1
0
Any function f belonging to such a class is continuous and periodic (f (0) = f (1))
and thus admits the representation
∞
X
f (x) = θ1 ϕ1 (x) + (θ2k ϕ2k (x) + θ2k+1 ϕ2k+1 (x)) (21)
k=1
where {ϕj }∞j=1 is the trigonometric basis given in Definition 3.3. The above infinite
series converges pointwise, and the sequence θ = {θj }∞ j=1 of Fourier coefficients of f
belongs to the space ( )
X∞
`2 (N) := θ : θj2 < ∞ .
j=1
We now state a necessary and sufficient condition on θ under which the function (21)
belongs to the class W (β, L). Define
(
jβ , for even j,
aj = β
(22)
(j − 1) , for odd j.
31
Proposition 3.5. Let β ∈ {1, 2, . . .}, L > 0, and let {ϕj }∞ j=1 be the trigonometric
basis. A function f ∈ L2 [0, 1] belong to W (β, L) if and only if the vector θ of the
Fourier coefficients of f belongs to the following ellipsoid in `2 (N):
( ∞
)
X
Θ(β, Q) := θ ∈ `2 (N) : a2j θj2 ≤ Q (23)
j=1
The set Θ(β, Q) defined by (23) with β > 0 (not necessarily an integer), Q > 0, and
aj satisfying (22) is called a Sobolev ellipsoid. We mention the following properties of
these ellipsoids.
P
• If β > 1/2, the function f = ∞ ∞
j=1 θj ϕj with the trigonometric basis {ϕj }j=1
and θ ∈ Θ(β, Q) is continuous (check this as an exercise). In what follows, we
will basically consider this case.
The ellipsoid Θ(β, Q) is well-defined for all β > 0. In this sense Θ(β, Q) is a more
general object than the periodic Sobolev class W (β, L), where β can only be an
integer. Proposition 3.5 establishes an isomorphism between Θ(β, Q) and W (β, L)
for integer β. It can be extended to all β > 0 by generalizing the definition of W (β, L)
in the following way.
Definition 3.6. For any β > 0 and L > 0 the Sobolev class W (β, L) is defined as:
n o
W (β, L) = f ∈ L2 [0, 1] : θ = {θj }∞
j=1 ∈ Θ(β, Q)
R1
where θj = 0 f ϕj and {ϕj }∞ j=1 is the trigonometric basis. Here Θ(β, Q) is the Sobolev
ellipsoid defined in (23), where Q = L2 /π 2β and {aj }∞j=1 is given by (22).
For all β > 1/2, the functions belonging to W (β, L) are continuous. On the contrary,
they are not always continuous for β < 1/2; an example is given by the function
f (x) = sign(x − 1/2), whose Fourier coefficients θj are of order 1/j.
Lemma 3.7. Let {ϕj }∞
j=1 be the trigonometric basis. Then,
1X
n
ϕj (s/n)ϕk (s/n) = δjk , 1 ≤ j, k ≤ n − 1, (24)
n s=1
32
where δjk is the Kronecker delta.
We are now ready to establish an upper bound on the bias of the projection estimator.
Proposition 3.8. Let f ∈ W (β, L), β ≥ 1, L > 0. Assume that {ϕj }∞ j=1 is the
trigonometric basis and Xi = i/n, i = 1, . . . , n. Then, for all n ≥ 1, N ≥ 1,
1 1
inf Ekfθ − f k ≤ C(β, L)
2
+ ,
θ∈RN N 2β n
33
4 Minimax lower bounds
The minimax risk associated with a statistical model {Pθ : θ ∈ Θ} and with a semi-
distance d is defined as h i
R∗n := inf sup Eθ d2 (θ̂n , θ) ,
θ̂n θ∈Θ
where the infimum is over all estimators. The upper bounds established previously
imply that there exists a constant C < ∞ such that
for a sequence {ψn }n≥1 converging to zero. The corresponding lower bounds claim
that there exists a constant c > 0 such that, for the same sequence {ψn }n≥1 ,
34
Define p = dP/dν, q = dQ/dν. Observe that such a measure ν always exists since we
can take, for example, ν = P + Q.
Definition 4.1. The Hellinger distance between P and Q is defined as follows:
Z Z
2 √ √ 2 √
H (P, Q) := ( p − q) dν = 2 1 − pq dν .
Exercise (HW2): The following are some properties of the Hellinger distance:
3. 0 ≤ H 2 (P, Q) ≤ 2.
Definition 4.2. The total variation distance between P and Q is defined as follows:
Z
V (P, Q) := sup |P (A) − Q(A)| = sup (p − q)dν .
A∈A A∈A A
35
It can be shown that the above definition always makes sense if P Q. Here are
some properties of the Kullback divergence:
The next lemma links the Hellinger distance with the Kullback divergence.
Lemma 4.6.
H 2 (P, Q) ≤ K(P, Q).
The following lemma links the total variation distance with the Kullback divergence.
Lemma 4.7 (Pinsker’s inequality).
p
V (P, Q) ≤ K(P, Q)/2.
which shows the first equality. To show the inequality, we use Cauchy-Schwarz in-
equality to obtain
Z Z
1 √
|p − q|dν = √ |p − q| p dν ≤ χ2 (P, Q).
p
36
Lemma 4.10. If P and Q are product measures, P = ⊗ni=1 Pi and Q = ⊗ni=1 Qi , then
Y
n
2
χ (P, Q) = (χ2 (Pi , Qi ) + 1) − 1.
i=1
Our aim is to obtain a lower bound for the minimax risk on (Θ, d) where Θ is a
Sobolev density:
Θ = P(β, L), β > 0, L > 0,
and where d is a distance at a fixed point x0 ∈ R:
d(f, g) = |f (x0 ) − g(x0 )|.
β
The rate that we would like to obtain is ψn = n− 2β+1 . Indeed, this is the same rate
as in the upper bounds which will enable us to conclude that ψn is optimal on (Θ, d).
for all n sufficiently large, where Tn ranges over all density estimators and c > 0 is a
constant. For brevity we write Tn = Tn (x0 ). For any p0 , p1 ∈ P(β, L), we may write
sup Ep [(Tn − p(x0 ))2 ] ≥ max Ep0 [(Tn − p0 (x0 ))2 ], Ep1 [(Tn − p1 (x0 ))2 ]
p∈P(β,L)
1
≥ Ep0 [(Tn − p0 (x0 ))2 ] + Ep1 [(Tn − p1 (x0 ))2 ] . (27)
2
Note that
Z Z !
Y
n
Ep [(Tn − p(x0 ))2 ] = ... [Tn (x1 , . . . , xn ) − p(x0 )]2 p(xi )dxi .
i=1
Q
Let x := (x1 , . . . , xn ) and πn (x) = ni=1 p(xi ). Also, let π0,n , π1,n be the joint densities
corresponding to the chosen densities p0 and p1 . The expression in (27) is then equal
to
Z Z
1 2 2
(Tn (x) − p0 (x0 )) π0,n (x)dx + (Tn (x) − p1 (x0 )) π1,n (x)dx
2
Z
1 2 2
≥ (Tn (x) − p0 (x0 )) + (Tn (x) − p1 (x0 )) min{π0,n (x), π1,n (x)} dx
2
Z
1 2
≥ (p0 (x0 ) − p1 (x0 )) min{π0,n (x), π1,n (x)} dx,
4
37
where we have used the fact that u2 + v 2 ≥ (u − v)2 /2, for u, v ∈ R.
In view of the above, to prove (26) it suffices to find densities p0 and p1 such that
We take p0 to be a density on R such that p0 ∈ Σ(β, L/2) and p0 (x0 ) > 0; e.g., p0 can
be the N (0, σ 2 ) density with σ 2 chosen is such a suitable way. Obviously p0 ∈ Σ(β, L).
Construct p1 by adding a small perturbation to p0 :
β x − x0
p1 (x) := p0 (x) + h K ,
h
0.2
0.2
K_0(x)
g(x)
0.0
0.1
−0.2
0.0
Proof. Let
1
−
K0 (u) := e 1−u2 I[−1,1] (u).
Then, K0 ∈ C ∞ (R) and the support of K0 is [−1, 1]. Let g : [− 12 , 32 ] → R be defined
as
g(u) := K0 (2u) − K0 (2(u − 1)).
Observe that
38
1. g(0) 6= 0,
R
2. g(u)du = 0,
Define K : [− 12 , 23 ] → R such that K(u) := ag(u) for a > 0 small enough so that
K ∈ Σ(β, L/2).
R R
Using the fact that g(u)du = 0 it is easy to see that p1 (x)dx = 1. Next we show
that p1 ≥ 0 for h > 0 small enough. For x ∈ [x0 − h2 , x0 + 3h 2
],
β x − x0
p1 (x) ≥ min p0 (t) − sup h K
t∈[x0 − h
2
,x 0 + 3h
2
] t∈[x 0 − h
,x 0 + 3h
] h
2 2
β
≥ min p0 (t) − h sup |K (t)| .
t∈[x0 − h ,x + 3h
2 0 2
] t∈R
Since p0 is continuous, p0 (x0 ) > 0, we obtain that p1 (x) > 0 for all x ∈ [x0 − h2 , x0 + 3h
2
],
h 3h
if h is smaller than some constant h0 > 0. Note that for x ∈ / [x0 − 2 , x0 + 2 ],
p1 (x) = p0 (x) ≥ 0. Thus, p1 is a density.
We now have to show that p1 ∈ Σ(β, L). Set ` := bβc. Clearly, p1 is ` times
differentiable. Further,
(`) (`) β−` (`) x − x0
p1 (x) = p0 (x) + h K .
h
Hence,
(`) (`) (`) (`) x − x0 x0 − x0
|p1 (x) − p1 (x0 )| ≤ |p0 (x) − p0 (x)| +h β−`
K (`)
−K (`)
h h
β−`
L L x − x0
≤ |x − x0 |β−` + hβ−` ≤ L|x − x0 |β−` ,
2 2 h
Thus,
β
|p0 (x0 ) − p1 (x0 )| = hβ K(0) = K(0)n− 2β+1 .
Next we will try to show that (iii) holds. In view of Lemma 4.9, it suffices to bound
39
χ2 (π0,n , π1,n ) from above by a constant strictly less than 1. First write χ2 (p0 , p1 ) as
Z Z x0 +3h/2 2β 2
(p0 − p1 )2 h K ((x − x0 )/h)
= dx
p0 x0 −h/2 p0 (x)
Z
h2β+1
≤ K 2 (u)du
minx∈[x0 −h/2,x0 +3h/2] p0 (x)
Z
h2β+1
≤ K 2 (u)du
minx∈[x0 −1/2,x0 +3/2] p0 (x)
where we have assumed that h ≤ α and α ≤ 1. Plugging the choice of h we obtain
where the constant c∗ depends only on p0 and K. Therefore, applying Lemma 4.10
we find
χ2 (π0,n , π1,n ) ≤ (1 + c∗ α2β+1 n−1 )n − 1 ≤ exp(c∗ α2β+1 ) − 1,
where we have used the fact that 1 + v < ev , for v ∈ R. Now, we choose α small
enough so that exp(c∗ α2β+1 ) − 1 < 1. Then,
Z
1 1
min(π0 , π1 ) ≥ 1 − = ,
2 2
and thus, condition (iii) is satisfied.
Theorem 4.12. Let β > 0, L > 0. There exists a constant c > 0 that only depends
on β and L such that, for all x0 ∈ R, n ≥ 1,
2β
inf sup Ep (Tn (x0 ) − p(x0 ))2 ≥ cn− 2β+1 ,
Tn p∈P(β,L)
Since the choice of x0 is arbitrary, we can equivalently put inf x0 ∈R before the minimax
risk.
Definition 4.13. Let x0 be fixed, and let P be a class of densities on R. A sequence
{ψn }n≥1 , ψn > 0, is called an optimal rate of convergence of mean squared error (risk)
on the class P if the following two conditions are satisfied:
(i) inf Tn supp∈P Ep [(Tn (x0 ) − p(x0 ))2 ] ≥ cψn2 , where c > 0 is a constant independent
of n.
(ii) There exist an estimator pn (·), and a constant C > 0 independent of n such
that
sup Ep [(pn (x0 ) − p(x0 ))2 ] ≤ Cψn2 .
p∈P
40
If (i) and (ii) hold, then pn is called a rate optimal estimator for the risk on the class
P.
Corollary 4.14. Let β > 0, L > 0. The KDE with bandwidth h = αn−1/(2β+1) ,
α > 0, and kernel of order ` = bβc is rate optimal for the mean squared error on the
Hölder class P(β, L), and ψn = n−β/(2β+1) is the corresponding optimal rate.
Summary: We have seen that the following issues play the key role in nonparametric
estimation.
The lower bounds based on two hypotheses turn out to be inconvenient when we deal
with estimation in Lp distances; see e.g., the start of Section 2.6 of [14].
Let us consider the nonparametric density estimation problem under the L2 risk.
Then,
Z 1/2
2
d(f, g) = kf − gk2 = (f (x) − g(x)) dx .
Our aim is to prove an optimal lower bound on the minimax risk for the Sobolev class
of densities Θ = S(β, L) (where β ≥ 1 is an integer and L > 0) and the above L2
distance with the rate ψn = n−β/(2β+1) .
The subset Fn is indexed by the set of all vectors θ ∈ {0, 1}rn consisting of sequences
41
of rn zeros and ones. For h = n−1/(2β+1) , let xn,1 < xn,2 < . . . < xn,n be a regular grid
of mesh width 2h (i.e., xn,i − xn,i−1 > 2h, for i = 2, . . . , n).
For a fixed probability density p ∈ S(β, L/2) (e.g., let p be the density of N (0, σ 2 )
where σ 2 is such that p ∈ S(β, L/2)). Consider a fixed function K ∈ S(β, L0 ) with
support (−1, 1), and define, for every θ ∈ {0, 1}rn ,
Xrn
β x − xn,j
pn,θ (x) := p(x) + h θj K . (28)
j=1
h
If p is bounded away from zero on a interval containing the grid, |K| is bounded, and
R
K(x)dx = 0, then pn,θ is a p.d.f, at least for large n. Furthermore,
Z Z Z
(β)
2
(β) 2 2 L2 L2
pn,θ (x) dx ≤ 2 p (x) dx + 2hrn K (β) (x) dx ≤ 2 + 2 ≤ L2 .
4 4
Observe that in the above we have used the fact that the mesh width is more than
2h so that for j 6= k,
Z
x − xn,j x − xn,k
K K dx = 0.
h h
Of course, there exists many choices of p and K such that pn,θ ∈ S(β, L) for every θ.
Theorem 4.15. There exists a constant cβ,L such that for any density estimator p̂n ,
Z
sup Ep (p̂n − p) ≥ cβ,L n−2β/(2β+1) .
2
p∈S(β,L)
We will use the following result crucially to prove the above theorem.
The following lemma gives a lower bound for the maximum risk over the parameter
set {0, 1}r , in an abstract form, applicable to the problem of estimating an arbitrary
quantity ψ(θ) belonging to a semi-metric space (with semi-distance d). Let
X
r
0
H(θ, θ ) := |θi − θi0 |
i=1
be the Hamming distance on {0, 1}r , which counts the number of positions at which
θ and θ0 differ.
42
R
For two probability measures P and Q with densities p and q let kP ∧Qk := p∧q dν.
Before we state and prove Assouad’s lemma we give a simple result which will be useful
later.
Lemma 4.16 (Lemma from hypothesis testing). Suppose that we are given two
models Pθ0 and Pθ1 on a measurable space (X , A) with densities p0 and p1 with
respect to a σ-finite measure ν. Consider testing the hypothesis
H0 : θ = θ0 versus H0 : θ = θ1 .
1
πφ (θ1 ) − πφ (θ0 ) ≤ kPθ1 − Pθ0 k.
2
R
Proof. The difference on the left hand side can be written as φ(p1 − p0 )dν. The
expression is maximized for the test function I{p1 > p0 } (Exercise (HW2): Show
this). Thus,
Z Z Z
1
φ (p1 − p0 ) dν ≤ (p1 − p0 ) dν = |p1 − p0 |dν,
p1 >p0 2
as
Z Z Z
|p1 − p0 | dν = (p1 − p0 ) dν + (p0 − p1 ) dν
p1 >p0 p0 >p1
Z Z Z
= (p1 − p0 ) dν + (p0 − p1 ) dν − (p0 − p1 ) dν
p1 >p0 p1 >p0
Z
= 2 (p1 − p0 ) dν.
p1 >p0
Lemma 4.17 (Assouad’s lemma). For any estimator T of ψ(θ) based on an obser-
vation in the experiment {Pθ : θ ∈ {0, 1}r }, and any p > 0,
dp (ψ(θ), ψ(θ0 )) r
max 2 Eθ [d (T, ψ(θ))] ≥ min
p p
min kPθ ∧ Pθ0 k. (29)
θ H(θ,θ 0 )≥1 H(θ, θ0 ) 2 H(θ,θ0 )=1
43
By the triangle inequality, for any θ,
The maximum of this expression over Θ is bounded below by the average, which,
apart from the factor γ, can be written as
X Xr Xr X Z X Z
1 1 1 1
r
Eθ |Sj − θj | = r−1
Sj dPθ + r−1 (1 − Sj )dPθ
2 θ j=1 2 j=1 2 θ:θj =0
2 θ:θj =1
Z Z
1X
r
= Sj dP̄0,j + (1 − Sj )dP̄1,j ,
2 j=1
where
1 X 1 X
P̄0,j = Pθ and P̄1,j = Pθ .
2r−1 θ:θj =0
2r−1 θ:θj =1
This is minimized over S by choosing Sj for each j separately to minimize the j-th
term in the sum. The expression within brackets is the sum of the error probabilities
of a test of
H0 : P = P̄0,j versus H1 : P = P̄1,j .
Equivalently, it is equal to 1 minus the difference of power and level. By Lemma 4.16
it can be shown that this is at least 1 − 21 kP̄0,j − P̄1,j k = kP̄0,j ∧ P̄1,j k (by Lemma 4.9).
Hence, the preceding display is bounded below by
1X
r
kP̄0,j ∧ P̄1,j k.
2 j=1
The 2r−1 terms Pθ and Pθ0 in the averages P̄0,j and P̄1,j can be ordered and matched
such that each pair θ and θ0 differ only in their j-th coordinate. Conclude that
1X 1X
r r
r
kP̄0,j ∧ P̄1,j k ≥ min kPθ ∧ Pθ0 k ≥ min kPθ ∧ Pθ0 k.
2 j=1 2 j=1 H(θ,θ )=1,θj 6=θj
0 0 2 H(θ,θ0 )=1
44
dp (ψ(θ),ψ(θ0 ))
Observing that the γ in (30) can always be taken as minH(θ,θ0 )≥1 H(θ,θ 0 )
, we obtain
the desired result.
Yi = f (Xi ) + ξi , i = 1, . . . , n, (31)
y = f + ξ,
where y = (Y1 , . . . , Yn )> , f = (f (X1 ), . . . , f (Xn ))> and ξ = (ξ1 , . . . , ξn )> . The goal of
this section is to find the (optimal) lower bound on the rate of convergence for any
estimator of f based on the loss
1X
n
1
2
d (f , g) := kf − gk22 = (f (Xi ) − g(Xi ))2 (32)
n n i=1
Proof. We will use Assouad’s lemma to prove the desired result. Fix an integer
1 ≤ k ≤ n (be chosen later) and let rn := bn/kc, where bxc denotes the largest
integer smaller than or equal to x. Let us define f ∈ Rn as
(
V (j−1)
rn
, if (j − 1)k < i ≤ jk;
fi = (rn −1)
V rn , if rn k < i ≤ n.
45
Take f to be any nondecreasing function on [0, 1] such that f (Xi ) = fi , for i = 1, . . . , n.
Also, it can be assumed that f ∈ MV . Let Θ = {0, 1}rn and let ψ(θ) ∈ Rn , for θ ∈ Θ,
be defined as:
V X
rn
ψ(θ)i = fi + (2θj − 1)I{(j − 1)k < i ≤ jk}. (33)
2rn j=1
For θ, θ0 ∈ Θ, we have
1X X
rn
1 X
rn 2
V 2k
0 2V
= k|θj − θj | 2 = 2 H(θ, θ0 ).
n j=1
rn rn n
We can consider a more general framework where the goal is to find lower bounds of
the following form:
h i
lim inf inf sup Eθ w(ψ −1 d(θ̂n , θ)) ≥ c > 0,
n→∞ θ̂n θ∈Θ
46
where w : [0, ∞) → [0, ∞) is nondecreasing, w(0) = 0 and w 6= 0 (e.g., w(u) =
up , p > 0). A general scheme for obtaining lower bounds is based on the following
three remarks:
(a) Reduction to bounds in probability. For any A > 0 satisfying w(A) > 0 we have
h i h i
Eθ w(ψn−1 d(θ̂n , θ)) ≥ w(A) Pθ ψn−1 d(θ̂n , θ) ≥ A . (34)
where s ≡ sn = Aψn .
for any finite set {θ0 , . . . , θM } contained in Θ. In the examples we have already
seen that the finite set {θ0 , . . . , θM } has to be chosen appropriately. We call the
M + 1 elements θ0 , . . . , θM as hypotheses. We will call a test any A-measurable
function Ψ : X → {0, 1, . . . , M }.
Ψ∗ = argmin d(θ̂n , θk ).
0≤k≤M
Therefore,
inf max Pθ d(θ̂n , θ) ≥ s ≥ inf max Pj (Ψ 6= j) =: pe,M , (37)
θ̂n θ∈{θ0 ,...,θM } Ψ 0≤j≤M
where Pj ≡ Pθj and inf Ψ denotes the infimum over all tests.
47
Thus, in order to obtain lower bounds it is sufficient to check that
pe,M ≥ c0 ,
where the hypotheses θj satisfy (36) with s = Aψn and where the constant c0 > 0 is
independent of n. The quantity pe,M is called the minimum probability of error for
the problem of testing M + 1 hypotheses θ0 , θ1 , . . . , θM .
Remark 4.1. Let P0 , P1 , . . . , PM be probability measures on a measurable space
(X , A). For a test Ψ : X → {0, 1, . . . , M }, define the average probability of error and
the minimum average probability of error by
1 X
M
p̄e,M (Ψ) := Pj (Ψ 6= j), and p̄e,M := inf p̄e,M (Ψ).
M + 1 j=0 Ψ
Note that as
pe,M ≥ p̄e,M ,
we can then use tools (from multiple hypotheses testing) to lower bound p̄e,M .
Example 4.19. Let Θ = [0, 1]. Consider data X1 , . . . , Xn i.i.d. Bernoulli(θ), where
θ ∈ Θ. Thus, here Pθ is the joint distribution of X = (X1 , . . . , Xn ). The goal is to find
the minimax lower bound for the estimation of θ under the loss d(θ̂n , θ) := |θ̂n − θ|.
We want to show that there exists c > 0 such that
h i
lim inf inf sup Eθ nd2 (θ̂n , θ) ≥ c > 0.
n→∞ θ̂n θ∈Θ
We can bound V (P0 , P1 ) using Pinsker’s inequality (see Lemma 4.7) and then use
Property (3) of the Kullback divergence to show that
2 1 1 + 2s
V (P0 , P1 ) ≤ K(P0 , P1 ) ≤ nK(Ber(θ0 ), Ber(θ1 )) = 2s log .
2 1 − 2s
Using the fact that x log 1+x
1−x
≤ 3x2 for x ∈ [0, 21 ], we can now show the desired
1
result for c = 48 .
48
Figure 3: Graphs of H and g with M = 10.
0.7
0.6
2.0
0.5
1.5
0.4
H(x)
g(x)
1.0
0.3
0.2
0.5
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1 X
M
g(p̄e,M ) ≥ log(1 + M ) − K(Pj , P ), (38)
M + 1 j=0
where
1 X
M
P = Pj ,
M + 1 j=0
and, for x ∈ [0, 1],
Proof. We have
"M # "M #
1 X dPj X
p̄e,M (Ψ) = E I(Aj ) = EP bj p j (39)
M + 1 P j=0 dP j=0
where
dPj
pj := (M + 1)−1
, Aj := {Ψ 6= j}, bj = I(Aj )
dP
and EP denotes the expectation with respect to P . The random variables bj and pj
satisfy P -a.s. the following conditions:
X
M X
M
bj = M, bj ∈ {0, 1}, and pj = 1, pj ≥ 0.
j=0 j=0
49
Then we have that, P -a.s.,
X
M X
bj p j = pj , (40)
j=0 j6=j0
where 0 log 0 := 0.
1 X
M
= log(1 + M ) − K(Pj , P ).
M + 1 j=0
Since there exists a sequence of tests {Ψk }k≥1 such that p̄e,M (Ψk ) → p̄e,M as k → ∞,
we obtain, by the continuity of g,
1 X
M
g(p̄e,M ) = lim g(p̄e,M (Ψk )) ≥ log(1 + M ) − K(Pj , P ).
k→∞ M + 1 j=0
It remains to show that p̄e,M ≤ M/(M + 1). For this purpose, we define a degenerate
test Ψ∗ ≡ 1, and observe that
1 X
M
∗ M
inf p̄e,M (Ψ) ≤ p̄e,M (Ψ ) = 6 1) =
Pj (j = .
Ψ M + 1 j=0 M +1
50
Using Fano’s lemma we can bound from below the minimax probability of error pe,M
in the following way:
where g −1 (t) := 0 for t < 0, and for 0 < t < log(M + 1), g −1 (t) is a solution of the
equation g(x) = t with respect to x ∈ [0, M/(M + 1)] — this solution exists as g is
continuous and strictly increasing on [0, M/(M + 1)] and g(0) = 0, g(M/(M + 1)) =
log(M + 1).
Proof. As H(x) ≤ log 2 for all x ∈ [0, 1], and g(x) = x log M + H(x), we have,
from (38),
Determining I(M ) exactly is usually intractable however and one typically works with
appropriate bounds on I(M ). In fact, (42) is going to be useful if we can show that
log(M + 1) − I(M ) > 0. The following corollary gives a sufficient condition for this
and gives a non-trivial lower bound on pe,M .
Corollary 4.23. Let P0 , P1 , . . . , PM be probability measures on a measurable space
(X , A), M ≥ 2. If
1 X
M
K(Pj , P0 ) ≤ α log(M + 1) (45)
M + 1 j=0
with 0 < α < 1, then
log 2
pe,M ≥ p̄e,M ≥ 1 − − α. (46)
log(M + 1)
51
Proof. We will use the elementary fact (show this; Exercise (HW2)):
1 X 1 X
M M
K(Pj , P0 ) = K(Pj , P ) + K(P , P0 ). (47)
M + 1 j=0 M + 1 j=0
Thus, using the above display, (38) and the fact that K(P , P0 ) ≥ 0, we get
1 X
M
g(p̄e,M ) ≥ log(M + 1) − K(Pj , P ),
M + 1 j=0
1 X
M
≥ log(M + 1) − K(Pj , P0 )
M + 1 j=0
≥ log(M + 1) − α log(M + 1).
A similar calculation as in the proof of Corollary 4.22 now yields the desired result.
Yi = f (Xi ) + ξi , i = 1, . . . , n, (48)
with f : [0, 1] → R, the ξi ’s being i.i.d. N (0, σ 2 ), and the Xi ’s are arbitrary random
variables taking values in [0, 1] such that (X1 , . . . , Xn ) is independent of (ξ1 , . . . , ξn ).
Theorem 4.24. Let β > 0 and L > 0. Consider data from the above model where
f ∈ Σ(β, L). Let
β/(2β+1)
log n
ψn = .
n
Then,
lim inf inf sup Ef [ψn−2 kTn − f k2∞ ] ≥ c
n→∞ Tn f ∈Σ(β,L)
where inf Tn denotes the infimum over all estimators and where the constant c > 0
depends only on β, L and σ 2 .
Proof. The proof was mostly done in class; also see [14, Theorem 2.11].
In Section 4.4 we described a general scheme for proving lower bounds. In step (c)
of the scheme it is important to choose the hypotheses θj ’s in Θ such that they are
52
2s-separated. Further, the choice of the number of such points M depends on how
large the space Θ is. In this section we define a concept that has been successfully
employed in many fields of mathematics to capture the size of the an underlying set
(with a semi-metric). We also give a few examples from parametric models to show
how this concept can be used in conjunction with Fano’s lemma (as discussed in the
last section) to yield useful lower bounds that do not need specification of the exact
θj ’s (the perturbation functions).
The metric entropy of the set Θ is the logarithm of its covering number: log N (δ, Θ, d).
We can define a related measure — more useful for constructing our lower bounds —
of size that relates to the number of disjoint balls of radius δ > 0 that can be placed
into the set Θ.
Definition 4.26 (Packing number). A δ-packing of the set Θ with respect to the
semi-metric d is a set {θ1 , . . . , θD } such that for all distinct v, v 0 ∈ {1, . . . , D}, we
have d(θv , θv0 ) ≥ δ.
Equivalently, call a collection of points δ-separated if the distance between each pair
of points is larger than δ. Thus, the packing number D(δ, Θ, d) is the maximum
number of δ-separated points in Θ.
53
Thus, packing and covering numbers have the same scaling in the radius δ.
Remark 4.2. As shown in the preceding exercise, covering and packing numbers are
closely related, and we can use both in the following. Clearly, they become bigger as
δ → 0.
We can now provide a few more complex examples of packing and covering numbers,
presenting two standard results that will be useful for constructing the packing sets
used in our lower bounds to come.
Our first bound shows that there are (exponentially) large packings of the d-dimensional
hypercube of points that are O(d)-separated in the Hamming metric.
Lemma 4.27 (Varshamov-Gilbert Lemma). Fix k ≥ 1. There exists a subset V of
P
{0, 1}k with |V| ≥ exp(k/8) such that the Hamming distance, H(τ, τ 0 ) := ki=1 I{τi 6=
τi0 } > k/4 for all τ, τ 0 ∈ V with τ 6= τ 0 .
The meaning of maximal here is that if one tries to expand V by adding one more
element, then the constraint (49) will be violated. In other words, if we define the
closed ball, B(τ, k/4) := {θ ∈ {0, 1}k : H(θ, τ ) ≤ k/4} for τ ∈ {0, 1}k , then we must
have [
B(τ, k/4) = {0, 1}k .
τ ∈V
Let T1 , . . . , Tk denote i.i.d. Bernoulli random variables with probability of success 1/2.
For every A ⊆ {0, 1}k , we have P ((T1 , . . . , Tk ) ∈ A) = |A|2−k . Therefore, for each
τ ∈ V, we can write
!
Xk
2−k |B(τ, k/4)| = P (T1 , . . . , Tk ) ∈ B(τ, k/4) = P {Ti 6= τi } ≤ k/4 .
i=1
If Si := {Ti 6= τi }, then it is easy to see that S1 , . . . , Sk are also i.i.d. Bernoulli random
54
variables with probability of success 1/2. Thus,
= inf exp(−3λk/4)2−k (1 + eλ )k .
λ>0
33k/4
|V| ≥ k = exp k log(33/4 /2) ≥ exp (k/8) .
2
Given the relationships between packing, covering, and size of the set Θ, we would
expect there to be relationships between volume, packing, and covering numbers.
This is indeed the case, as we now demonstrate for arbitrary norm balls in finite
dimensions.
Lemma 4.28. Let B := {θ ∈ Rd : kθk2 ≤ 1} denote the unit Euclidean ball in Rd .
Then d d
1 2
≤ N (δ, B, k · k2 ) ≤ 1 + . (51)
δ δ
As a consequence of Lemma 4.28, we see that for any δ < 1, there is a packing V
of B such that kθ − θ0 k2 ≥ δ for all distinct θ, θ0 ∈ V and |V| ≥ (1/δ)d , because we
know D(δ, B, k · k2 ) ≥ N (δ, B, k · k2 ). In particular, the lemma shows that any norm
ball has a 1/2-packing in its own norm with cardinality at least 2d . We can also
construct exponentially large packings of arbitrary norm-balls (in finite dimensions)
where points are of constant distance apart.
55
Lemma 4.29 (Exercise (HW2)). N (δ, F, k · kF ) ≤ N (δ/L, Θ, k · kΘ ) for all δ > 0.
Exercise (HW2): Let F be the set of L-Lipschitz functions mapping from [0, 1] to
[0, 1]. Then in the supremum norm kf k∞ := supx∈[0,1] |f (x)|,
Hint: (Proof idea) Form an δ grid of the y-axis, and an δ/L grid of the x-axis, and
consider all functions that are piecewise linear on this grid, where all pieces have
slopes +L or −L. There are 1/δ starting points, and for each starting point there are
2L/δ slope choices. Show that this set is an O(δ) packing and an O(δ) cover.
Example 4.30 (Normal mean estimation). Consider the d-dimensional normal lo-
cation family Nd := {N (θ, σ 2 Id ) : θ ∈ Rd }, where σ 2 > 0 and d ≥ 2. We wish to
estimate the mean θ in the squared error loss, i.e., d2 (θ̂n , θ) = kθ̂n − θk22 , given n
i.i.d. observations X1 , . . . , Xn from a member in Nd with mean θ. Let Pθ denote the
joint distribution of the data.
Let V be a 1/2-packing of the unit k·k2 -ball with cardinality at least 2d , as guaranteed
by Lemma 4.28. Now we construct our local packing. Fix δ > 0, and for each v ∈ V,
set θv = δv ∈ Rd . Then we have
δ
kθv − θv0 k2 = δkv − v 0 k2 ≥ =: 2s
2
for each distinct pair v, v 0 ∈ V, and moreover, we note that kθv − θv0 k2 ≤ 2δ for such
pairs as well. Thus, {θv }v∈V is a 2s-separated set with cardinality at least 2d . Let
θv0 , θv1 , . . . , θvM be an enumeration of the 2s-separated points, and we take Pj ≡ Pθvj ,
for j = 0, 1, . . . , M . Note that for j ∈ {0, . . . , M } such that Pj ≡ Pv , for some v ∈ V,
n 2 2nδ 2
K(Pj , P0 ) = kθv − θ v k
0 2 ≤ .
2σ 2 σ2
Therefore, taking δ 2 := dσ 2 log 2/(8n),
1 X
M
2nδ 2
K(Pj , P0 ) ≤ 2 · d log 2 ≤ α log(M + 1)
M + 1 j=0 σ d log 2
56
where α := 1/4. This shows that (45) holds. Hence, by (34), (35), (37) and Corol-
lary 4.23 we have
1 dσ 2 log 2 dσ 2
inf sup Eθ [d2 (θ̂n , θ)] ≥ · =c ,
θ̂n θ∈Θ 64 8n n
where c > 0. While the constant c is not sharp, we do obtain the right scaling in d,
n and the variance σ 2 . The sample mean attains the same risk.
Example 4.31 (Linear regression). In this example, we show how local packings can
give (up to some constant factors) sharp minimax rates for standard linear regression
problems. In particular, for fixed matrix X ∈ Rn×d , we observe
Y = Xθ + ,
In this case, we use the Varshamov-Gilbert bound (Lemma 4.27) to construct a local
packing and attain minimax rates. Indeed, let V be a packing of {0, 1}d such that
kv − v 0 k1 ≥ d/4 for distinct elements of V, and let |V| ≥ exp(d/8) as guaranteed by
the Varshamov-Gilbert bound. For fixed δ > 0, if we set θv = δv, then we have the
packing guarantee for distinct elements v, v 0 that
1 2 δ2 > 2 dδ 2
K(Pθv , Pθv0 ) = 2
kX(θv − θ v 0 )k ≤
2 2
Λ max (X X)kθv − θv 0k ≤
2 2
Λmax (X > X),
2σ 2σ 2σ
where Λmax (X > X) denotes the maximum singular value of X > X.
57
σ2
Consequently, taking δ 2 := 16Λmax (X > X)
, we obtain that
dσ 2 8 1 dσ 2
inf sup Eθ [d (θ̂n , θ)] ≥
2
1 − log 2 − ≥c ,
θ̂n θ∈Θ (16)2 nΛmax (X > X/n) d 4 nΛmax (X > X/n)
for some c > 0 if d > (32/3) log 2. Thus, the convergence rate is (roughly) σ 2 d/n
after rescaling the singular values of X > X by n−1/2 . This bound is sharp in terms
of the dimension d, dependence on n, and the variance σ 2 , but it does not fully
capture the dependence on X > X, as it depends only on the maximum singular value.
An exact calculation can show that the minimax value of the problem is exactly
σ 2 tr((X > X)−1 ).
1 X M
I(M ) ≤ inf K(Pj , Q). (52)
M + 1 Q j=0
Different choices of Q in (52) yield different upper bounds on I(M ). One gets, for
example,
PM PM
j=0 K(Pj , Pk ) j,k=0 K(Pj , Pk )
I(M ) ≤ min ≤ ≤ max K(Pj , Pk ). (53)
k=0,1...,M M +1 (M + 1)2 j,k∈{0,1...,M }
These bounds are very frequently used in conjunction with Fano’s inequality; see e.g.,
the two examples in Section 4.6.1. The last bound maxj,k∈{0,1...,M } K(Pj , Pk ) is called
the Kullback-Leibler diameter of {Pj }M
j=0 .
We will see that quite often (in nonparametric problems) the bounds in (53) are, in
general, quite inaccurate and describe an improved bounds due to [17].
58
Let P0 , P1 , . . . , PM be probability measures on a measurable space (X , A). Recall
that
1 X
M
I(M ) := K(Pj , P ).
M + 1 j=0
Proof. By carefully choosing the distribution Q in the upper bound in (52) above,
we will obtain the desired. Now, assume that the distributions {Qi }N i=1 , form an
2
-cover of the family P, meaning that mini K(P, Qi ) ≤ , for all P ∈ P. Let pj
and qi denote the densities of Pj and Qi with respect to some fixed base measure ν
on X (the choice of based measure does not matter). Then defining the distribution
P
Q := (1/N ) N i=1 Qi (with density q with respect to ν), we obtain for any j,
Z Z !
pj pj
K(Pj , Q) = log pj dν = log P pj dν
q N −1 N i=1 q i
Z ! Z
pj pj
= log N + log PN pj dν ≤ log N + log pj dν
i=1 q i max i q i
Z
pj
≤ log N + min log pj dν = log N + min K(Pj , Qi ).
i qi i
By our assumption that the Qi ’s form a cover which gives the desired result, as > 0
was arbitrary (as was our choice of the cover).
4.7.1 A general scheme for proving minimax bounds using global packings
There is now a four step process to proving minimax lower bounds using the global
Fano method. Our starting point is to recall the Fano minimax lower bound in (46)
of Corollary 4.22 and (37), which begins with the construction of a set of points
{θ(Pj )}M
j=0 that form a 2s-packing of a set Θ in the semi-metric d. With this in mind,
we perform the following four steps:
(i) Bound the packing entropy. Give a lower bound on the packing number of the
set Θ with 2s-separation (call this lower bound D(s) ≡ M + 1).
(ii) Bound the metric entropy. Give an upper bound on the KL-metric entropy
of the class P of distributions containing all the distributions {Pj }M
j=0 , i.e., an
upper bound on log N (, P, K).
59
(iii) Find the critical radius. Using Proposition 4.32 we can now balance I(M ) and
the packing entropy log D(s). To that end, we choose n and sn > 0 at the
critical radius, defined as follows: choose any n such that
Then,
(iv) Apply the Fano minimax bound (46). Having chosen sn and n as above, we
immediately obtain that
I(M ) + log 2 1 1
pe,M ≥ 1 − ≥1− = ,
log D(sn ) 2 2
and thus, we obtain
h i 1
inf sup Eθ w(s−1
n d( θ̂n , θ)) ≥ w(sn ).
θ̂n θ∈Θ 2
4.7.2 An example
60
5 Reproducing kernel Hilbert spaces
and H is complete6 in the metric d(x, y) := kx − yk. Then H forms a Hilbert space
equipped with the inner product h·, ·i.
P
Example 5.2 (Euclidean space). Let H = Rm and hx, yi := m i=1 xi yi (where x =
(x1 , . . . , xm ) ∈ R ); or more generally hx, yi = x Ay where A is a symmetric positive
m >
definite matrix.
Example 5.3 (Euclidean matrices). Let H = Rm×m be the set of all m×m matrices.
Define hx, yi := tr(xy > ). Then h·, ·i defines a Hilbert space over m × m matrices.
Example 5.4 (L2 space). Let (Ω, A, µ) be a measure space and let L2 (Ω, A, µ) be
the set (of equivalence classes) of all square integrable functions with
Z
hf, gi := f g dµ.
Example 5.5 (Sobolev space). The Sobolev space Wm [0, 1] is the collection of all
functions f : [0, 1] → R with m − 1 continuous derivatives, f (m−1) absolutely contin-
uous, and kf (m) k < ∞. With an inner product h·, ·i defined by
X
m−1 Z 1
(k) (k)
hf, gi := f (0)g (0) + f (m) (x)g (m) (x)dx, f, g ∈ Wm [0, 1], (56)
k=0 0
6
A metric space H is said to be complete if every Cauchy sequence in H has a limit in H.
61
Wm [0, 1] is a Hilbert space.
Here are some properties of any Hilbert space H with inner product h·, ·i:
• (Dual space) The dual space H∗ (of H) is the space of all continuous linear
functions from H into R. It carries a natural norm7 , defined by
62
Thus to every element ϕ of the dual H∗ there exists one and only one uϕ ∈ H
such that hx, uϕ i = ϕ(x), for all x ∈ H. The inner product on the dual space
H∗ satisfies
hϕ, ψiH∗ := huψ , uϕ iH .
So the dual space is also an inner product space. The dual space is also complete,
and so it is a Hilbert space in its own right.
kx − zk = inf ky − zk.
z∈C
hz − x, y − xi ≤ 0, ∀ y ∈ C. (58)
hz − x, xi = 0 and hz − x, yi ≤ 0 ∀ y ∈ C.
hz − x, yi = 0 ∀ y ∈ C.
63
Result: The orthogonal complement of a subset of a Hilbert space is a closed
linear subspace.
A ⊕ B := {x + y : x ∈ A, y ∈ B}.
The linear span of S ⊂ H, denoted span(S), is the collection of all finite linear
combinations α1 x1 + · · · + αn xn with α1 , . . . , αn ∈ R and x1 , . . . , xn ∈ S. The
closure of this set is denoted by span(S).
64
5.2 Reproducing Kernel Hilbert Spaces
Definition 5.6 (Reproducing kernel Hilbert space). Let X be an arbitrary set and
H a Hilbert space of real-valued functions on X . The evaluation functional over the
Hilbert space of functions H is a linear functional that evaluates each function at a
point x ∈ X ,
Lx : f 7→ f (x) ∀f ∈ H.
The Riesz representation theorem implies that for all x ∈ X there exists a unique
element Kx of H with the reproducing property:
Ky (x) = hKy , Kx i.
K(x, y) = hKx , Ky i.
65
Example 5.7 (Linear kernel). Let X = Rd and let K(x, y) := x> y, for any x, y ∈ Rd ,
be the usual inner product in Rd . Then the linear kernel K is symmetric and positive
definite.
Example 5.8 (RKHS of the linear kernel). Let X = Rd . Consider the space H of
all linear forms on Rd : H := {f (x) = w> x : w ∈ Rd }. Define the inner product by
hf, giH = v > w for f (x) = v > x and g(x) = w> x. Then, the linear kernel K(x, y) :=
x> y is a reproducing kernel for H.
Example 5.9 (Gaussian and Laplace kernels). When X = Rd , the Gaussian and
Laplace kernels are defined as
kx − yk22 kx − yk2
K(x, y) := exp − , K(x, y) := exp − ,
2σ 2 2σ 2
respectively, where x, y ∈ Rd , σ 2 > 0. Both kernels are positive definite, but the proof
of this fact is more involved than for the linear kernel.
Proof. The complete proof of this result is rather long. We give a sketch of the proof
here. For all x in X , define Kx := K(x, ·). Let H0 be the linear span of {Kx : x ∈ X }.
Define an inner product on H0 by
* n +
X X
m Xm Xn
βj Kyj , αi Kxi := αi βj K(yj , xi ),
j=1 i=1 H0 i=1 j=1
i=1 , {βj }j=1 ⊂ R and {xi }i=1 , {yj }j=1 ⊂ X . The symmetry of this inner
where {αi }m n m n
product follows from the symmetry of K and the non-degeneracy follows from the
fact that K is positive definite. We can show that
Let H be the completion of H0 with respect to this inner product. We define an inner
product in H as: suppose that {fn }n≥1 and {gn }n≥1 are sequences in H0 converging
66
to f and g respectively. Then {hfn , gn iH0 }n≥1 is convergent and its limit depends
only on f and g (see [2, Lemma 5] for a proof of the above). Thus we define
Next we have to show that H is indeed a Hilbert space with the inner product h·, ·iH
(see [2, Lemma 6] for a proof of this; we will have to further show that H is complete).
Further we can show that H0 is dense in H (see see [2, Lemma 7] for a proof of this)
and that the point evaluation map is continuous on H (see see [2, Lemma 8] for a
proof of this).
Now we can check the reproducing property (59), i.e., hf, Kx iH = f (x), for all f ∈ H,
for all x ∈ X . To prove uniqueness, let G be another Hilbert space of functions for
which K is a reproducing kernel. For any x and y in X , (59) implies that
By linearity, h·, ·iH = h·, ·iG on the span of {Kx : x ∈ X }. Then G = H by the
uniqueness of the completion. See
http://www.gatsby.ucl.ac.uk/∼gretton/coursefiles/RKHS Notes1.pdf for a more
detailed discussion on this proof.
The representer theorem [7] shows that solutions of a large class of optimization
problems can be expressed as kernel expansions over the sample points. We present
a slightly more general version of the theorem with a simple proof [10].
Let X be an arbitrary set and let HK be a RKHS of real valued functions on X with
reproducing kernel K(·, ·). Let {(Yi , Xi ) : i = 1, . . . , n} be given data (the “training
set”) with Xi ∈ X (the “attribute vector”), and Yi ∈ Y being the “response”.
Theorem 5.11. Denote by Ω : [0, ∞) → R a strictly increasing function. Let
` : (X × Y × Y)n → R ∪ {∞} be an arbitrary loss function. Then each minimizer
f ∈ HK of the regularized risk functional
67
admits a representation of the form
X
n
fˆ(x) = αi K(Xi , x), ∀x ∈ X, where α1 , . . . , αn ∈ R. (62)
i=1
Here αi ∈ R and hf⊥ , K(Xi , ·)i = 0 for all i = 1, . . . , n. By the reproducing property,
f (Xi ) = hf, K(Xi , ·)i = hfn , K(Xi , ·)i + hf⊥ , K(Xi , ·)i = hfn , K(Xi , ·)i = fn (Xi ),
Hence, `(· · · ) depends only on the component of f lying in the subspace S and Ω(·)
is minimized if f lies in that subspace. Hence, the criterion function is minimized if
f lies in that subspace, and we can express the minimizer as in (62).
Note that as Ω(·) is strictly non-decreasing, kf⊥ k must necessarily be zero for f to
be the minimizer of (61), implying that fˆ must necessarily lie in the subspace S.
Monotonicity of Ω does not prevent the regularized loss functional (61) from hav-
ing multiple local minima. To ensure a global minimum, we would need to require
convexity. If we discard the strictness of the monotonicity, then it no longer follows
that each minimizer of the regularized loss admits an expansion (62); it still follows,
however, that there is always another solution that is as good, and that does admit
the expansion.
68
5.2.2 Feature map and kernels
Such map ϕ : X → E is referred to as the feature map, and space E as the feature
space. Thus kernels are functions that can be written as an inner product in a feature
space.
Exercise (HW3): Show that K(·, ·) defined in (63) is a positive definite function.
Thus, we can think of the patterns as ϕ(x), ϕ(x0 ), and carry out geometric algorithms
in the Hilbert space (feature space) E. Usually, dim(E) dim(X ) (if dim(X ) is
defined).
Note that for a given kernel, there may be more than one feature map, as demon-
strated by the following example: take X = R, and K(x, y) = xy = [ √x2 √x2 ][ √y2 √y2 ]> ,
where we defined the feature maps ϕ(x) = x and ϕ̃(x) = [ √x2 √x2 ], and where the
feature spaces are respectively, E = R and Ẽ = R2 .
Exercise (HW3): For every x ∈ X , assume that the sequence {fn (x)}n≥1 ∈ `2 (N),
P
where fn : X → R, for all n ∈ N. Then K(x1 , x2 ) := ∞n=1 fn (x1 )fn (x2 ) is a kernel.
As k(·, ·) defined in (63) is symmetric and positive definite it induces a unique RKHS.
Thus, to construct reproducing kernels on complicated spaces we only need to find a
feature map ϕ.
Another way to characterize a symmetric positive definite kernel K is via the Mercer’s
Theorem.
69
Figure 4: (Feature space and feature map) On the left, the points are plotted in the
original space. There is no linear classifier that can separate the red crosses
from the blue circles. Mapping the points to a higher dimensional feature space
(x 7→ ϕ(x) := (x1 , x2 , x1 x2 ) ∈ R3 ), we obtain linearly separable classes. A
possible decision boundary is shown as a gray plane.
2
5
4
1
3
x2
0
2
−1 1
x2
0
−2
−1
−3 −2
−3
−4
−4
−5
−5 −4 −3 −2 −5 −1 0 1 2 3 4 5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x x1
1
Figure 2.1: XOR example. On the left, the points are plotted in the original
Figure 2.1: XOR
space. example. On classifier
There is no linear the left,that the pointstheare
can separate red plotted inthethe original
crosses from
space. There blue
is no linear
circles. classifier
Mapping that
the points to acan separate
higher the
dimensional redspace,
feature crosses
we from the
Definition 5.13 (Integral
blue circles. operator).
obtain linearly
Mapping
gray plane.
separable classes.Let
the points to K decision
A possible
a higher bedimensional
aboundary
continuous
is shown as a kernel on compact met-
feature space, we
ric space X , obtain
and letlinearly separable classes. A possible decision boundary is shown as a
gray plane.
ν be a finite Borel measure on X . Let TK : L2 (X , ν) → C(X )
(C(X ) being the space of all continuous real-valued functions on X thought of as a
subset of L2 (X , ν)) be the linear map defined as:
Z
2
(TK f )(·) = K(x, ·)f (x) dν(x), f ∈ L2 (X , ν).
X
X
n
lim sup K(u, v) − λi ψi (u)ψi (v) = 0.
n→∞ u,v∈X
i∈J:i=1
70
Example 5.15. To take an analogue in the finite case, let X = {x1 , . . . , xn }. Let
Kij = K(xi , xj ), and f : X → Rn with fi = f (xi ) and let ν be the counting measure.
Then,
X n
TK f = K(xi , ·)fi
i=1
and
X
n
∀ f, f > Kf ≥ 0 ⇒ K is p.s.d. ⇒ K = λi vi vi> .
i=1
Hence,
X
n X
n
>
K(xi , xj ) = Kij = (V ΛV )ij = λk vki vkj = λk vki vkj .
k=1 k=1
Note that Mercer’s theorem gives us another feature map for the kernel K, since:
X
K(u, v) = λi ψi (u)ψi (v) = hϕ(u), ϕ(v)i`2 (J) ,
i∈J
so we can take `2 (J) as a feature space, and the corresponding feature map is ϕ : X →
`2 (J) where np o
ϕ : x 7→ λi ψi (x) .
`2 (J)
P √
This map is well defined as i∈J | λi ψi (x)|2 = K(x, x) < ∞.
Apart from the representation of the kernel function, Mercer’s theorem also leads to
a construction of RKHS using the eigenfunctions of the integral operator TK .
Yi = f (xi ) + i , i = 1, . . . , n,
where 1 , . . . , n are mean zero, uncorrelated random variables with a common vari-
ance σ 2 . As with the kernel approach, there is a presumption that f is smooth. The
smoothing spline approach tries to take direct advantage of this smoothness by aug-
menting the usual least squares criteria with a penalty for roughness. For instance,
if the xi ’s lie in [0, 1], the estimator fˆ might be chosen to minimize (over g)
X
n
(Yi − g(xi ))2 + λkg (m) k22 ,
i=1
71
where k · k2 is the L2 -norm of functions on [0, 1] under the Lebesgue measure, i.e.,
Z 1
2
kgk2 = g 2 (x)dx.
0
The constant λ is called the smoothing parameter. Larger values for λ will lead to a
smoother fˆ, smaller values will lead to an estimate fˆ that follows the observed data
more closely (i.e.,fˆ(xi ) will be closer to Yi ).
We can use the RKHS approach to solve the above optimization problem using the
representer theorem. Please read Chapter 18.3 from [6] for the details (this was done
in class).
Yi = f (xi ) + βwi + i , i = 1, . . . , n,
with 0 < x1 < . . . < xn < 1, f ∈ Wm [0, 1], β ∈ R, and the i ’s are i.i.d. from
N (0, σ 2 ). This might be called a semiparametric model because the dependence on
w is modeled parametrically, but the dependence on x is nonparametric. Following a
penalized least squares approach, consider choosing fˆ and β̂ to minimize
X
n
(Yi − g(xi ) − αwi )2 + λkg (m) k22 .
i=1
(a) Show that the estimator fˆ will still be a natural spline of order 2m.
(b) Derive explicit formulas based on linear algebra to compute β̂ and fˆ.
72
Example 5.16 (Spam-filter). We have a sample of n e-mail messages. For each
message i, we count the percentages of 50 selected words characteristic for spam,
such as the words money, credit, Viagra and so on. This constitutes the vectors of
measurements Xi ∈ R50 . Then, an expert provides the values Yi = +1 if e-mail
i is spam and Yi = −1 otherwise. When a new message arrives, we would like to
decide whether it is spam or not. For this purpose, we measure the corresponding
percentages X ∈ R50 in this message, and based on X and on the training data Dn ,
we have find a decision Y . The problem is usually solved by separating R50 in two
parts (corresponding to spam and non-spam) the via a hyperplane depending on the
training data Dn . This is called a linear classifier.
At first sight, the observations are of the same form as in the problem of regression
with random design. However, the important feature is that Yi ’s are now binary.
Even more important, in the classification context our final aim is different. We are
not interested in estimation of the regression function f ∗ (x) := E(Y |X = x) but
rather in predicting the value of the label Y . Note that the regression function has
now the form
where
η(x) := P(Y = 1|X = x).
We define a classifier h as any measurable function from X to {−1, 1}. We predict
the label for an observed X as h(X). In practice, h depends on the observed data Dn
but, in this section, we will assume that the observed data is fixed and thus h is just
a function of X.
Our aim is to find the best classifier, i.e., a classifier which minimizes this risk:
h∗ = argmin R(h).
h
We will call h∗ the Bayes classifier and we call the minimal possible risk R∗ the Bayes
risk, i.e.,
R∗ := min R(h) = R(h∗ ).
h
The next theorem shows that such a classifier always exists.
73
Theorem 5.17. (i) The Bayes classifier has the form
(
∗ 1, if η(x) > 1/2,
h (x) =
−1, if η(x) ≤ 1/2.
where p̂1 , p̂−1 are parametric estimators of p−1 , p1 , and π. If pi ’s are Gaussian densities
N (θi , Σ), i = −1, 1, then the decision rule is linear, which means that X is labeled 1
if and only if X > a + b > 0 for some a ∈ Rd , b ∈ R. Show this (Exercise (HW3)).
74
However, for this method to work we need η̂n to be close to η, which is typically
guaranteed if the function η has some smoothness properties. This is not always
reasonable to assume.
How to construct good classifiers based on the data? A first idea is to use the
principle of unbiased risk estimation. We need to find an unbiased estimator for the
risk R(h) = P(Y 6= h(X)) and then to minimize this estimator in h over a given class
H. Note that the empirical risk is
1X
n
Rn (h) = 6 h(Xi )}
I{Yi =
n i=1
is an unbiased estimator for R(h) for all h. Minimizing Rn (h) can be used to obtain
a classifier.
Definition 5.19. Let H be a fixed collection of classifiers. The empirical risk mini-
mization (ERM) classifier on H is defined by
The ERM classifier always exists since the function Rn takes only a finite number of
values, whatever is the class H. Note that I{Yi 6= h(Xi )} = (Yi − h(Xi ))2 /4 and thus
1 X
n
Rn (h) = (Yi − h(Xi ))2 .
4n i=1
75
We expect the ERM classifier to have the risk close to that of the oracle classifier hH .
Let us emphasize that we are not interested in accurate estimation of hH and moreover
there is no guarantee that hH is unique. Mimicking the oracle means constructing a
classifier ĥn such that its risk R(ĥn ) is close to the oracle risk minh∈H R(h).
1 X n
Rn,ϕ (h) := ϕ(−Yi h(Xi )),
n i=1
76
where F is a convex class of functions f : X → R. The question is whether there are
convex functions ϕ such that h∗ = sign(fϕ∗ ), where h∗ is defined in Theorem 5.17?
Natural requirements to ϕ are: (i) convexity, (ii) ϕ should penalize more for wrong
classification than for correct classification. Note that ϕ0 does not penalize at all for
correct classification, because ϕ0 (−1) = 0, but it penalizes for wrong classification
since ϕ0 (1) = 1. However ϕ0 is not convex. The first historical example of convex
surrogate loss ϕ is the hinge loss:
ϕH (x) := (1 + x)+ .
It satisfies both the requirements (i) and (ii) above. The corresponding risk and its
minimizer are
Proposition 5.20. Let h∗ be the Bayes classifier, i.e., h∗ (x) := sign(η(x) − 1/2).
Then fϕ∗H = h∗ .
Proof. Recall that η(x) = P(Y = 1|X = x) and h∗ (x) = sign(f ∗ (x)) with f ∗ (x) =
E(Y |X = x) = 2η(x) − 1. We can write
Z
RϕH (f ) = E[(1 − Y f (X))+ |X = x]dPX (x)
X
where
E[(1 − Y f (X))+ |X = x] = P(Y = 1|X = x)(1 − f (x))+ + P(Y = −1|X = x)(1 + f (x))+
= η(1 − f (x))+ + (1 − η(x))(1 + f (x))+ .
We claim that
fϕ∗H (x) = argmin g(u).
u∈R
Next, observe that g is a piecewise affine function. Let u∗ = argminu∈R g(u). We can
see that:
η(x)(1 − u) + (1 − η(x))(1 + u) = 1 + (1 − 2η(x))u, if |u| ≤ 1;
g(u) = (1 − η(x))(1 + u), if u > 1;
η(x)(1 − u), if u < −1.
77
As g is affine for u > 1 (and u < −1) with nonnegative slope we see that u∗ must
belong to [−1, 1]. However, for u ∈ [−1, 1], g is minimized at
(
−1, if η(x) ≤ 1/2;
u∗ =
+1, if η(x) > 1/2.
Classical examples of functions ϕ are the following: (i) ϕ(x) = (1 + x)+ (hinge loss);
(ii) ϕ(x) = exp(x) (exponential loss); (iii) ϕ(x) = log2 (1 + exp(x)) (logistic loss).
Proposition 5.21. Let ϕ0 be positive and strictly increasing. Then h∗ = sign(fϕ∗ ).
Given a solution fˆn,ϕ to the minimization problem (64), we define the classifier ĥn,ϕ :=
sign(fˆn,ϕ ). A popular choice for the set F is
(M )
X
F= θj hj : θ ∈ Θ
j=1
The hinge loss with an `2 -body yields support vector machines (SVM). The exponen-
tial and logit loss with an `1 -body leads to boosting.
78
5.4.4 Support vector machine (SVM): definition
F = {f ∈ H : kf kH ≤ r}, r > 0.
The support vector machine is, by definition, a classifier obtained from solving this
problem when ϕ(x) = (1 + x)+ (the Hinge loss):
!
1X
n
min (1 − Yi f (Xi ))+ + λkf k2H . (65)
f ∈H n i=1
Thus, by the representer theorem (see Theorem 5.11), it is enough to look for a
solution of (65) in the finite dimensional space S (see the proof of Theorem 5.11)
of dimension less than or equal to n. Solving the problem reduces to finding the
coefficients θj in the representation (62).
Let Kij = K(Xi , Xj ) and denote by K the symmetric matrix (Kij )i,j=1,...,n . Then for
any f ∈ S ⊂ H, X
kf k2H = θi θj Kij = θ> Kθ.
i,j=1,...,n
where (Kθ)i is the i’th component of Kθ. Given the solution θ̂ of (66), the SVM
classifier ĥn,ϕ is determined as:
X
n
ĥn,ϕ = sign(fˆn,ϕ (x)), where fˆn,ϕ (x) = θ̂i K(Xi , x). (67)
i=1
79
the problem differently, using subdifferential calculus. For any convex objective func-
tion G, we have the equivalence
X
n
fˆ(x) = θ̂i K(Xi , x),
i=1
In practice, there are often not too many support vectors since only the points Xi
that are misclassified or close to the decision boundary satisfy the condition θ̂i 6= 0.
Proof. We will derive the expression for the subdifferential of the objective function
in (66) by analyzing each term in the sum separately. Fix some index i and consider
the function !
Xn
θ 7→ 1 − Yi Kij θj = (1 − Yi (Kθ)i )+ .
j=1 +
Let gi (θ) be a subgradient of this function and denote by gij (θ) its j’th component.
There are three cases that follow immediately from the form of the subdifferential of
the function (1 − x)+ :
• if Yi (Kθ)i = 1 then gij (θ) = −αi Yi Kij , for some αi ∈ [0, 1].
We can wrap these three cases as gij (θ) = −αi Yi Kij , with
80
(i) αi = 0 if Yi (Kθ)i > 1,
−Kβ + 2λKθ.
Now, by (68), a vector θ̂ is a solution of (66) if and only if 0 belongs to the subdiffer-
ential of the objective function at θ̂, which can be written as 2λθ̂ − β = for some
satisfying K = 0. It remains to note that we can always take = 0 since the choice
of in the null space of K does not modify the value of the objective function. This
completes the proof.
Observe that the SVM solution can be written as (67). Thus, if we consider the
functions ϕi (·) = K(Xi , ·), we have
X
n
fˆ = θ̂i ϕj
i=1
The strength of the RKHS approach is that the space X can be any arbitrary space
(such as a graph or a semi-group, for example) but we transform each point Xi ∈
X into an finite-dimensional vector Zi = (ϕ1 (Xi ), . . . , ϕn (Xi ))> ∈ Rn , and then
use a linear classifier fˆ(X) = θ> Z in the finite-dimensional space Rn where Z :=
(ϕ1 (X), . . . , ϕn (X))> ∈ Rn . The classification rule for a new point Z is
(
1, if θ̂> Z > 0,
Ŷ :=
−1, otherwise.
For any learning point Zi , if Zi is correctly classified we have Yi θ̂> Zi > 0, and if Zi is
wrongly classified we have Yi θ̂> Zi ≤ 0. By Proposition 5.22 a solution θ̂ of the SVM
minimization problem has the coordinates θ̂i , i = 1, . . . , n, satisfying:
81
• θ̂i = 0 if Yi θ̂> Zi > 1. Interpretation: The point (Zi , Yi ) does not affect the
classification rule if Zi is correctly classified with high margin (larger than 1),
where the margin of the i’th observation is defined as Yi θ̂> Zi = Yi fˆ(Xi ).
• θ̂i 6= 0 if Yi θ̂> Zi ≤ 1. The last inequality means that the point Zi is wrongly
classified or correctly classified with small margin (smaller than 1). If θ̂i 6= 0,
the point Zi is called a support vector.
Yi = f ∗ (xi ) + i , i = 1, . . . , n,
where f ∗ is the true regression function, xi ’s take values in X (an arbitrary metric
space), 1 , . . . , n are mean zero, uncorrelated random variables. We want to estimate
f ∗ by minimizing the criterion function
X
n
fˆ = argmin (Yi − f (xi ))2 + λkf k2H . (69)
f ∈H i=1
By the representer theorem, we can claim that any solution to (69) is of the form
P
fˆ(·) = ni=1 αi K(xi , ·) for some weight vector (α1 , . . . , αn ) ∈ Rn . Thus, the above
optimization problem can be equivalently expressed as:
where K = ((Kij )) with Kij = K(xi , xj ), Y = (Y1 , . . . , Yn ). Here we have used that
for fˆ in the span of {K(xi , ·)}ni=1 and thus f (xi ) = (Kα)i , kf k2H = α> Kα. We can
solve the above finite dimensional optimization problem to yield
82
where Σ is the covariance matrix of X and the eigenvector corresponding to the i’th
largest eigenvalue is the i’th principal component, for i = 1, . . . , n. Another way to
view the PCA problem is to first consider the first principal component, which is a
solution to the following optimization problem:
The second principal component is defined as the unit vector that maximizes Var(v > X)
over all vectors v that are orthogonal to v1 , and so on.
Given i.i.d. samples {xi }ni=1 from P , the sample principal components are obtained
by solving the corresponding sample analogue:
Σ̂v = λv,
P P
where Σ̂ := n1 ni=1 (xi − x̄)(xi − x̄)> (here x̄ = ni=1 xi /n is the sample mean) is the
sample covariance matrix of X. Similarly,
1 X >
n
>
2
v̂1 = argmax v Σ̂v = argmax v (xi − x̄) . (70)
v∈Rd :kvk≤1 v∈Rd :kvk≤1 n i=1
Now suppose that X ∼ P takes values in an arbitrary metric space X . Suppose that
H is a RKHS (of functions) on X with reproducing kernel K. We can use the kernel
method to extent classical PCA to capture non-linear principal components. The first
principal component can now be defined as
Let ϕ(x) := K(x, ·) for all x ∈ X (note that here ϕ is not exactly the feature map, as
ϕ : X → H). Given a sample {xi }ni=1 from P , the sample first principal component
(function) can be defined analogously (as in (70)) as
* + 2
1 X
n X
n
fˆ1 = argmax d Var(hf, K(X, ·)iH ) = argmax f, ϕ(xi ) − 1 ϕ(xj ) .
f ∈H:kf kH ≤1 kf kH ≤1 n i=1
n j=1 H
1X 1X
n n
Σ̂ := ϕ̃(xi ) ⊗ ϕ̃(xi ), where ϕ̃(xi ) := ϕ(xi ) − ϕ(xj ).
n i=1 n j=1
83
We would like to find eigenfunctions fˆ (the principal components) (Why? Exercise
(HW3)) such that
Σ̂(fˆ) = λfˆ. (71)
The question now is, how do we express the above equation in terms of kernels, i.e.,
how do we “kernelize” it? Towards this end, we make the following claim.
P
Claim: Any solution to (71) is of the form fˆ = ni=1 αi ϕ̃(xi ) for some weight vector
(α1 , . . . , αn ) ∈ Rn .
Proof: First, we observe that any solution to (71) lies in Range(Σ̂). Linearity, and
the nature of ϕ̃(xi ) ⊗ ϕ̃(xi ) (by definition (a ⊗ b)(c) := hb, ciH a) tell us that
1X
n
Σ̂(fˆ) = hϕ̃(xi ), fˆiH ϕ̃(xi ).
n i=1
Using the fact that hϕ̃(xj ), ϕ̃(xj )iH = K̃(xi , xj ), where K̃ = HKH (show this; Exer-
cise (HW3); here H = In − 1n×n /n and K is the Gram matrix, i.e., Kij = K(xi , xj ))
the above system of equations may be written as
1 XX X
n n n
αi K̃ij ϕ̃(xj ) = λ αi ϕ̃(xi ).
n j=1 i=1 i=1
K̃ 2 α = λnK̃α.
The only solutions of this equation that are of interest to us are those that satisfy
K̃α = λnα.
84
6 Bootstrap
To motivate the bootstrap method, let us consider the following simple scenario.
Suppose that we model our data X = (X1 , . . . , Xn ) as a random sample from some
distribution P ∈ P, where P is a class of probability distributions. Let η(X, P )
be a root, i.e., a random variable that possibly depends on both the distribution
√
P and the sample X drawn from P (e.g., think of η(X, P ) as n(X̄n − µ), where
P
X̄n = ni=1 Xi /n and µ = E(X1 )). In fact, θ̂ − θ (as described above) is a root.
In general, we may wish to estimate the mean or a quantile or some other probabilistic
feature or the entire distribution of η(X, P ). As mentioned above, the distribution of
θ̂ − θ depends on P and is thus unknown. Let Hn (x, P ) denote the c.d.f. of η(X, P ),
i.e.,
Hn (x, P ) := PP (η(X, P ) ≤ x). (73)
Of course, if we can estimate Hn (·, P ) then we can use this to construct CIs, test
hypotheses; e.g., if η(X, P ) = (θ̂−θ) then being able to estimate Hn (·, P ) immediately
yields estimates of κα and κ1−α as defined in (72).
Idea: What if we knew P and could draw unlimited replicated samples from P ?
In that case we could approximate Hn (x, P ) as follows: draw repeated samples from
P resulting in a series of values for the root η(X, P ), then we could form an estimate
of Hn (x, P ) by counting how many of the η(X, P )’s are ≤ x.
But, of course, we do not know P . However we can estimate P by P̂n and use the
above idea. This is the notion of bootstrap.
85
Definition 6.1 (Bootstrap). The bootstrap is a method of replacing (plug-in) the
unknown distribution P with a known distribution P̂n (estimated from the data) in
probability/expectation calculations.
The bootstrap approximation of Hn (·, P ) is Ĥn (·, P̂n ), where P̂n is an estimator of P
obtained from the observed data (that we think is close to P ), i.e.,
where P∗P̂ (·|X) is the conditional probability given the observed data X (under the
n
Question: How do we find Ĥn (·, P̂n ), the distribution of η(X∗ , P̂n )?
Answer: In most cases, the distribution of η(X∗ , P̂n ) is difficult to analytically com-
pute, but we can always be approximated easily by Monte Carlo simulations.
• Draw a large number (say, B) of random samples X∗(1) , . . . , X∗(B) from the
distribution P̂n and then compute T ∗(j) := η(X∗(j) , P̂n ), for j = 1, . . . , B.
• Finally, compute the desired feature of η(X∗ , P̂n ) using the empirical c.d.f. H̃nB (·, P̂n )
of the values T ∗(1) , . . . , T ∗(B) , i.e.,
1 X
B
H̃nB (x, P̂n ) := I{T ∗(j) ≤ x}, for x ∈ R.
B j=1
Intuitively,
H̃nB (·, P̂n ) ≈ Ĥn (·, P̂n ) ≈ Hn (·, P ),
where the first approximation is from Monte Carlo error (and can be as small as we
would like, by taking B as large as we want) and the second approximation is due to
the bootstrap method. If P̂n is a good approximation of P , then bootstrap can be
successful.
86
6.1 Parametric bootstrap
In parametric models it is more natural to take P̂n as the fitted parametric model.
Example 6.2 (Estimating the standard deviation of a statistic). Suppose that X1 , . . . , Xn
is random sample from N (µ, σ 2 ). Suppose that we are interested in the parameter
c−µ
θ = P(X ≤ c) = Φ ,
σ
Question: How do we estimate the standard deviation of θ̂? There is no easy closed
form expression for this.
N (X̄, σ̂ 2 ) ≡ P̂n .
For the j’th bootstrap sample we compute a sample average X̄ ∗(j) , a sample standard
deviation σ̂ ∗(j) . Finally, we compute
∗(j) c − X̄ ∗(j)
θ̂ =Φ .
σ̂ ∗(j)
P
We can estimate the mean of θ̂ by θ̄∗ = B1 B j=1 θ̂
∗(j)
. The standard deviation of θ̂ can
then be estimated by the bootstrap standard deviation of the θ̂∗(j) values, i.e.,
" #1/2
1 X ∗(j)
B
(θ̂ − θ̄∗ )2 .
B j=1
Example 6.3 (Comparing means when variances are unequal). Suppose that we
have two independent samples X1 , . . . , Xm and Y1 , . . . , Yn from two possibly different
normal populations. Suppose that
H0 : µ1 = µ2 versus H1 : µ1 6= µ2 .
87
We can use the test statistic
(m + n − 2)1/2 (X̄m − Ȳn )
U= ,
1 1 1/2 2 2 1/2
m
+ n
(S X + SY )
P Pn Pm Pn
where X̄m = m1 m 1
i=1 Xi , Ȳn = n
2
i=1 Yi , SX =
2 2
i=1 (Xi − X̄m ) and SY = i=1 (Yi −
2 2 2
Ȳn ) . Note that as σ1 6= σ2 , U does not necessarily follow a t-distribution.
|U | > c∗n .
88
We want to estimate the distribution of M − θ. Let η(X, F ) := M − θ. We may
choose F̂ = Fn , the empirical distribution function of the observed data. Thus, our
method can be broken in the following steps:
• For each bootstrap sample we compute the sample median M ∗(j) and then find
the appropriate sample quantiles of {M ∗(j) − M }B ∗
i=1 . Observe that η(X , Fn ) =
M∗ − M.
Suppose that F̂n and F are the corresponding c.d.f.’s for P̂n and P respectively.
Suppose that P̂n is a consistent estimator of P . This means that at each x in the
support of X1 where F (x) is continuous, F̂n (x) → F (x) in probability or a.s. as
n → ∞10 . If, in addition, Ĥn (x, P ), considered as a functional of P , is continuous
in an appropriate sense, it can be expected that Ĥn (x, P̂n ) will be close to Hn (x, P ),
when n is large.
Observe that Ĥn (x, P̂n ) is a random distribution function (as it depends on the ob-
served data). Let ρ be any notion of distance between two probability distributions
that metrizes weak convergence, i.e., for any sequence of c.d.f.’s {Gn }n≥1 , we have
d
Gn → G if and only if ρ(Gn , G) → 0 as n → ∞.
In particular, we can take ρ to be the Prohorov metric11 or the Levy metric12 . For
simplicity, we can also use the uniform distance (Kolmogorov metric) between Gn
and G (which metrizes weak convergence if G is a continuous c.d.f.).
Definition 6.5. We say that the bootstrap is weakly consistent under ρ for η(Xn , P )
if
p
ρ(Hn , Ĥn ) → 0 as n → ∞,
where Hn and Ĥn are defined in (73) and (74) respectively. We say that the bootstrap
10
If F is a continuous c.d.f., then it follows from Polya’s theorem that F̂n → F in probability or
a.s. uniformly over x. Thus, F̂n and F are uniformly close to one another if n is large.
11
12
89
is strongly consistent under ρ for η(Xn , P ) if
a.s.
ρ(Hn , Ĥn ) → 0 as n → ∞.
In applications, e.g., for construction of CIs, we are quite often interested in approx-
imating the quantiles of Hn by that of Ĥn (as opposed to the actual c.d.f.). The
following simple result shows that weak convergence, under some mild conditions,
implies the convergence of the quantiles.
Exercise (HW4): Let {Gn }n≥1 be a sequence of distribution functions on the real line
converging weakly to a distribution function G, i.e., Gn (x) → G(x) at all continuity
points x of G. Assume that G is continuous and strictly increasing at y = G−1 (1 − α).
Then,
n (1 − α) := inf{x ∈ R : Gn (x) ≥ 1 − α} → y = G (1 − α).
G−1 −1
The following theorem, although quite obvious, gives us a general strategy to prove
the consistency of the bootstrap in many problems.
Theorem 6.6. Let CP be a set of sequences {Pn ∈ P}n≥1 containing the sequence
{P, P, . . .}. Suppose that, for every sequence {Pn } ∈ CP , Hn (·, Pn ) converges weakly
to a common limit H(·, P ). Let Xn be a sample of size n from P . Assume that P̂n
is an estimator of P based on Xn such that {P̂n } falls in CP with probability one.
Then,
a.s.
ρ(Hn (·, P ), Ĥn (·, P̂n )) → 0 as n → ∞.
If H(·, P ) is continuous and strictly increasing at H −1 (1 − α, P ) (0 < α < 1), then
a.s.
Ĥn−1 (1 − α, P̂n ) → H(1 − α, P ) as n → ∞.
90
Remark 6.1. Often, the set of sequences CP can be described as the set of sequences
{Pn }n≥1 such that d(Pn , P ) → 0, where d is an appropriate “metric” on the space
of probabilities. Indeed, one should think of CP as a set of sequences {Pn } that
are converging to P in an appropriate sense. Thus, the convergence of Hn (·, Pn ) to
H(·, P ) is locally uniform in a specified sense. Unfortunately, the appropriate metric
d will depend on the precise nature of the problem and the choice of the root.
Theorem 6.6 essentially says that to prove the consistency of the bootstrap it is enough
to try to understand the limiting behavior of Hn (·, Pn ), where Pn is any sequence
of distributions “converging” (in some appropriate sense) to P . Thus, quite often,
showing the consistency of the bootstrap boils down to showing the weak convergence
of η(Xn , Pn ) under a triangular array setup, as Xn is now an i.i.d. sample from Pn . For
example, if the CLT plays a crucial role in proving that Hn (·, P ) converges weakly to
a limit H(·, P ), the Lindeberg-Feller CLT theorem can be used to show that Hn (·, Pn )
converges weakly to H(·, P ).
Theorem 6.7 (Bootstrapping the sample mean). Suppose X1 , X2 , . . . , Xn are i.i.d. F
√
and that σ 2 := VarF (X1 ) < ∞. Let η(X, F ) := n(X̄n − µ), where µ := EF (X1 ) and
P
X̄n := ni=1 Xi /n. Then,
p
K(Ĥn , Hn ) = sup |Hn (x) − Ĥn (x)| → 0 as n → ∞,
x∈R
where Ĥn (x) ≡ Ĥn (x, Fn ) and Fn is the empirical c.d.f. of the sample X1 , X2 , . . . , Xn .
Exercise (HW4): Show that foror almost all sequences X = {X1 , X2 , . . .}, the con-
√
ditional distribution of n(X̄n∗ − X̄n ), given X, converges in law to N (0, σ 2 ) by the
triangular array CLT (Lindeberg CLT).
where Z1 , Z2 are i.i.d. N (0, σ 2 ). In fact, a more general version of the result is true.
d
Suppose that (Un , Vn ) is a sequence of random vectors such that Un → Z ∼ H (some
d d
Z) and Vn |Un → Z (the same Z) a.s. Then (Un , Vn ) → (Z1 , Z2 ), where Z1 , Z2 are
i.i.d. H.
√
Exercise (HW4): What do you think would be the limiting behavior of n(X̄n∗ − µ),
conditional on the data X?
91
6.4 Second-order accuracy of the bootstrap
One philosophical question about the use of the bootstrap is whether the bootstrap
has any advantages at all when a CLT is already available. To be specific, suppose
√
that η(X, F ) = n(X̄n − µ). If σ 2 := VarF (X1 ) < ∞, then
√ d p
n(X̄n − µ) → N (0, σ 2 ) and K(Ĥn , Hn ) → 0 as n → ∞.
P
So two competitive approximations to Hn (x) are Φ(x/σ̂n ) (where σ̂n2 := n1 ni=1 (Xi −
X̄n )2 ) and Ĥn (x, Fn ). It turns out that, for certain types of statistics, the bootstrap
approximation is (theoretically) more accurate than the approximation provided by
the CLT. Because any normal distribution is symmetric, the CLT cannot capture
information about the skewness in the finite sample distribution of η(X, F ). The
bootstrap approximation does so. So the bootstrap succeeds in correcting for skew-
ness, just as an Edgeworth expansion13 would do. This is called Edgeworth correction
by the bootstrap, and the property is called second-order accuracy of the bootstrap.
Theorem 6.8 (Second-order accuracy). Suppose X1 , X2 , . . . , Xn are i.i.d. F and
√
that σ 2 := VarF (X1 ) < ∞. Let η(X, F ) := n(X̄n − µ)/σ, where µ := EF (X1 ) and
P
X̄n := ni=1 Xi /n. If EF |X1 |3 < ∞ and F is continuous, then,
K(Ĥn , Hn ) = op (n−1/2 ) as n → ∞,
√
where Ĥn (x) ≡ Ĥn (x; Fn ) is the c.d.f. of η(X∗ , Fn ) := n(X̄n∗ − X̄n )/σ̂ (σ̂ 2 =
1
Pn 2
n i=1 (Xi − X̄n ) ) and Fn is the empirical c.d.f. of the sample X1 , X2 , . . . , Xn .
Remark 6.2 (Rule of thumb). Let X1 , X2 , . . . , Xn are i.i.d. F and η(X, F ) be a root.
d
If η(X, F ) → N (0, τ 2 ), where τ does not dependent of F , then second-order accuracy
is likely. Proving it will depend on the availability of an Edgeworth expansion for
η(X, F ). If τ depends on F (i.e., τ = τ (F )), then the bootstrap should be just
first-order accurate.
In spite of the many consistency theorems in the previous sections, there are instances
where the ordinary bootstrap based on sampling with replacement from Fn actually
13
√
We note that T := n(X̄n − µ)/σ admits the following Edgeworth expansion:
p1 (x|F ) p2 (x|F )
P(T ≤ x) = Φ(x) + √ φ(x) + φ(x) + smaller order terms,
n n
where p1 (x|F ) and p2 (x|F ) are polynomials in x with coefficients depending on F .
92
does not work. Typically, these are instances where the root η(X, F ) fails to admit
a CLT. Before seeing a few examples, we list a few situations where the ordinary
bootstrap fails to estimate the c.d.f. of η(X, F ) consistently:
√
(a) η(X, F ) = n(X̄n − µ) when VarF (X1 ) = ∞.
√
(b) η(X, F ) = n(g(X̄n ) − g(µ)) and ∇g(µ) = 0.
√
(c) η(X, F ) = n(g(X̄n ) − g(µ)) and g is not differentiable at µ.
(e) The underlying population Fθ is indexed by a parameter θ, and the true value
θ0 belongs to the boundary of the parameter space Θ.
Let Hn (·, P ) be the sampling c.d.f. of τn (θ̂n − θ) based on a sample of size n from P ,
where τn is a normalizing constant. Essentially, the only assumption that we will need
to construct asymptotically valid confidence intervals for θ(P ) is the following: there
93
exists a limiting non-degenerate c.d.f. H(·, P ) such that Hn (·, P ) converges weakly to
H(·, P ) as n → ∞.
To describe the method let Y1 , . . . , YNn be equal to the Nn := Nb subsets of size b of
{X1 , . . . , Xn }, ordered in any fashion. Of course, the Yi ’s depend on b and n, but this
notation has been suppressed. Only a very weak assumption on b will be required.
In typical situations, it will be assumed that b/n → 0 and b → ∞ as n → ∞.
Now, let θ̂n,b,j be equal to the statistic θ̂b evaluated at the data set Yj . The approxi-
mation to Hn (x, P ) we study is defined by
1 X
Nn
Ln,b (x) = I{τb (θ̂n,b,j − θ̂n ) ≤ x}.
Nn i=1
The motivation behind the method is the following. For any j, Yj is a random sample
of size b from P . Hence, the exact distribution of τb (θ̂n,b,i − θ(P )) is Hb (·, P ). The
empirical distribution of the Nn values of τb (θ̂n,b,j − θ̂n ) should then serve as a good
approximation to Hb (P ) ≈ Hn (P ). Of course, θ(P ) is unknown, so we replace θ(P )
by θ̂n , which is asymptotically permissible because τb (θ̂n −θ(P )) is of order τb /τn → 0.
Theorem 6.9. Assume that there exists a limiting non-degenerate c.d.f. H(·, P ) such
that Hn (·, P ) converges weakly to H(·, P ) as n → ∞. Also assume τb /τn → 0, b → ∞,
and b/n → 0 as n → ∞.
p
(i) If x is a continuity point of H(.·, P ), then Ln,b (x) → H(x, P ).
p
(ii) If H(·, P ) is continuous, then supx |Ln,b (x) − Hn (x, P )| → 0.
(iii) Assume τb (θ̂n − θ(P )) → 0 almost surely and, for every d > 0,
X
exp{−d(n/b)} < ∞.
n
Then, the convergences in (i) and (ii) hold with probability one.
Regression models are among the key ones that differ from the i.i.d. setup and are
also among the most widely used. Bootstrap for regression cannot be model-free; the
particular choice of the bootstrap scheme depends on whether the errors are i.i.d. or
94
not. We will only talk about the linear model with deterministic x’s and i.i.d. errors.
Additional moment conditions will be necessary depending on the specific problem to
which the bootstrap will be applied; see e.g., [4]. First let us introduce some notation.
Let X be the n × p design matrix with the i’th row equal to xi and let Y :=
(y1 , . . . , yn ) ∈ Rn . The least squares estimator of β is defined as
X
n
β̂n := argmin (yi − x> 2 > −1 >
i β) = (X X) X Y,
β∈Rp i=1
First observe that Hn only depends on F . The residual bootstrap scheme is described
below.
1X
n
˜i = yi − x>
i β̂n − ˆj , for i = 1, . . . , n.
n j=1
The bootstrap estimator of the distribution Hn (F ) is Hn (F̃n ), where F̃n is the empir-
ical c.d.f. of ˜1 , . . . , ˜n .
We proved in class that an application of the Lindeberg-Feller CLT shows that the
above bootstrap scheme is consistent, under the conditions:
95
6.8 Bootstrapping a nonparametric function: the Grenander
estimator
Consider X1 , . . . , Xn i.i.d. from a nonincreasing density f0 on [0, ∞). The goal is to es-
timate f0 nonparametrically. In particular, we consider the nonparametric maximum
likelihood estimator (NPMLE) of f0 , defined as
Y
n
f˜n := arg max f (Xi ),
f↓
i=1
where the maximization if over all nonincreasing densities on [0, ∞). It can see shown
that
f˜n = LCM0 [Fn ],
where Fn is the empirical c.d.f. of the data, and LCM0 [Fn ] denotes the right-hand
slope of the least concave majorant of Fn ; see e.g.,
http://www.math.yorku.ca/∼hkj/Teaching/Bristol/notes.pdf
for the characterization, computation and theoretical properties of f˜n .
96
7 Multiple hypothesis testing
In the multiple hypothesis testing14,15 problem we wish to test many hypotheses si-
multaneously. The null hypotheses are denoted by H0,i , i = 1, . . . , n, where n denotes
the total number of hypotheses.
Consider a prototypical example: we test n = 1000 null hypotheses at level 0.05 (say).
Suppose that everything is null (i.e., all the null hypotheses are true) — even then
on an average we expect 50 rejections.
In general, the problem is how do we detect the true non-null effects (hypotheses
where the null is not true) when a majority of the null hypotheses are true? This
question has received a lot of attention in the statistical literature, particularly in
genomic experiments. Consider the following example.
Example 7.1 (Prostate cancer study). DNA microarrays measure expression levels
of tens of thousands of genes. The data consist of levels of mRNA, which are thought
to measure how much of a protein the gene produces. A larger number implies a more
active gene.
Suppose that we have n genes and data on the expression levels for each gene among
healthy individuals and those with prostate cancer. In the example considered in [3],
n = 6033 genes were measured on 50 control patients and 52 patients with prostate
cancer. The data obtained are (Xij ) where
We want to test the effect of the i’th gene. For the i’th gene, we use the following
test statistic:
X̄i·P − X̄i·C
∼ t100 , under H0,i ,
sd(. . .)
where X̄i·P denotes the average expression level for the i’th gene for the 52 cancer
patients and X̄i·C denotes the corresponding value for the control patients and sd(. . .)
denotes the standard error of the difference. We reject the null H0,i for gene i if the
test statistic exceeds the critical value t−1
100 (1 − α), for α ∈ (0, 1).
There are two main questions that we will address on this topic:
14
Many thanks to Jimmy K Duong for scribing the lecture notes based on which this section is
adapted.
15
Most of the material here can be found in the lecture notes by Emmanuel Candes; see
http://statweb.stanford.edu/∼candes/stats300c/lectures.html.
97
• Global testing. In global testing, our primary interest is not on the n hypotheses
H0,i , but instead on the global hypothesis H0 : ∩ni=1 H0,i , the intersection of the
H0,i ’s.
• Multiple testing. In this scenario we are concerned with the individual hypothe-
ses H0,i and want to say something about each hypothesis.
yi = µi + zi , for i = 1, . . . , n, (75)
where zi ’s are i.i.d. N (0, 1), the µi ’s are unknown constants and we only observe the
yi ’s. We want to test
The complication is that if we do each of these tests H0,i at level α, and then want
to combine them, the global null hypothesis H0 might not have level α. This is the
first hurdle.
We will assume that under H0,i , pi ∼ Unif(0, 1). (we are not assuming independence
among the pi ’s yet.)
Suppose that α ∈ (0, 1) is given. The Bonferroni procedure can be described as:
This can be succinctly expressed as looking at the minimum of the p-values, i.e.,
α
Reject H0 if min pi ≤ .
i=1,...,n n
98
?
Question: Is this a valid level-α test, i.e., is PH0 (Type I error) ≤ α? Answer: Yes.
Observe that
PH0 (Rejecting H0 ) = PH0 min pi ≤ α/n
i=1,...,n
So this is a valid level-α test, whatever the pi ’s are (the pi ’s could be dependent).
Question: Are we being too conservative (the above is an upper bound)? As we are
testing each hypothesis using a very small level α/n most of the p-values would fail
to be significant. The feeling is that we need a very strong signal for some i to detect
the global null using the Bonferroni method.
Answer: We are not doing something very crude, if all the p-values are independent.
Thus, the Bonferroni approach is not a bad thing to do, especially when we have
independent p-values.
Let us now focus on the power of the Bonferroni method. To discuss power we need
a model for the alternative.
99
Question: Consider the example of the Gaussian sequence model mentioned pre-
viously. Under what scenario for the µi ’s do we expect the Bonferroni test to do
well?
Answer: If we have (a few) strong signals, then the Bonferroni procedure is good.
We will try to formalize this now.
In the Gaussian sequence model the Bonferroni procedure reduces to: Reject H0,i
(H0,i : µi = 0 vs. H1,i : µi > 0) if
yi > zα/n ,
where zα/n is the (1 − α/n)’th quantile of the standard normal distribution.
Question: How does zα/n behave? Do we know its order (when α is fixed and n is
large)?
√
Answer: As first approximation, zα/n is like 2 log n (an important number for
Gaussian random variables).
Fact 1. Here is a fact from extreme value theory about the order of the maximum of
the zi ’s, i.e., maxi=1,...,n zi :
maxi=1,...,n zi a.s.
√ −−→ 1,
2 log n
i.e., if we have a bunch of n independent standard normals, the maximum is like
√
2 log n (Exercise: show this).
φ(t) α
1 − Φ(t) ≈ =
t n
2
e−t /2 α
⇔ √ =
2πt n
t2
X XX√ √
⇔ − =
log(
X2πt)
X + log(α/n) (as log( 2πt) is a smaller order term)
2
X
p
≈ t2 = −2 log(α/n) = 2 log n − 2 log α ≈ 2 log n.
100
√
The mean of maxi=1,...,n zi is like 2 log n and the fluctuations around the mean is of
order Op (1).
Exercise: Use the Gaussian concentration inequality to derive this result. Note that
the maximum is a Lipschitz function.
To study the power of the Bonferroni procedure, we consider the following stylistic
regimes (in the following the superscript (n) is to allow the variables to vary with n):
(n) √
(i) µ1 = (1 + ) 2 log n and µ2 = . . . = µn = 0,
(n) √
(ii) µ1 = (1 − ) 2 log n and µ2 = . . . = µn = 0,
where > 0. So, in both settings, we have a one strong signal, and everything else is
0.
√
In case (i), the signal is slightly stronger than 2 log n; and in case (ii), the signal
√
is slightly weaker than 2 log n. We will show that Bonferroni actually works for
case (i) (by that we mean the power of the test actually goes to 1). Meanwhile, the
Bonferroni procedure fails for case (ii) — the power of the test converges to α.
This is not only a problem with the Bonferroni procedure — it can be shown that no
test can detect the signal in case (ii).
Case (i):
P(max yi > zα/n ) = P {y1 > zα/n } ∪ max yi > zα/n
i=2,...,n
In this regime, just by looking at y1 , we will be able to detect that H0 is not true.
Case (ii):
P(max yi > zα/n ) ≤ P(y1 > zα/n ) + P max yi > zα/n .
i=2,...,n
√
Note that the first term is equal to P(z1 > 2 log n) → 0 as n → ∞; whereas the
second term converges to 1 − e−α . Hence, we have shown that in this case the power
of the test is less than or equal to the level of the test. So the test does as well as
just plain guesswork.
This shows the dichotomy in the Bonferroni procedure; that by just changing the
signal strength you can always recover or you can fail (1 − α) of the time.
101
Whenever we have a hypothesis testing procedure, there has to be an effort in trying to
understand the power of the procedure. And it is quite often the case that different
tests (using different test statistics) are usually geared towards detecting different
kinds of departures from the null. Here, the Bonferroni procedure is geared towards
detecting sparse, strong signals.
Consider the Gaussian sequence model described in (75) and suppose that we want
to test the global null hypothesis:
This test is going to have high power when kµk2 is large. So, this test would have
high power when there are many weak signals (even if each µi is slightly different
from zero as we square it and add these up we can get a substantially large kµk2 ).
The Bonferroni procedure may not be able to detect a scenario like this — given α/n
to each hypothesis if the signal strengths are weak all of the p-values (for the different
hypotheses) might be considerably large.
Suppose that p1 , . . . , pn are the n p-values obtained from the n hypotheses tests. We
assume that the pi ’s are independent. The Fisher’s combination test rejects the global
null hypothesis if
Xn
T := −2 log pi
i=1
102
is large. Observe that, under H0 ,
X
n
T := −2 log pi ∼ χ22n .
i=1
Again, as this test is aggregating the p-values, it will hopefully be able to detect the
presence of many weak signals.
Until now, we have been considering tests of the global null H0 = ∩i H0,i . For some
testing problems, however, our goal is to accept or reject each individual H0,i . Given
n hypotheses, we have four types of outcomes in multiple testing:
Accept H0,i Reject H0,i
H0,i true U V n0
H0,i false T S n − n0
n−R R n
where R = number of rejections is an observed random variable; U, V, S, T are unob-
served random variables. Note that
Suppose that the hypotheses indexed by I0 ⊆ {1, . . . , n} are truly null with |I0 | = n0
and the remaining hypotheses are non-null.
Ideally, we would not like to make false discoveries. But if you are not willing to make
any false discoveries, which basically translates to our threshold/cutoff being really
large for each test, then we will not be able make any discoveries at all.
103
In modern theory of hypothesis testing, control of the FWER is considered too strin-
gent mainly because it leads to tests that fail to reject many non-null hypotheses as
well.
The false discovery rate (FDR) is an error control criterion developed in the 1990’s as
an alternative to the FWER. When the number of tests is in the tens of thousands or
even higher, FWER control is so stringent a criterion that individual departures from
the null have little chance of being detected. In such cases, it may be unreasonable
to control the probability of having any false rejections. Attempting to do so would
leave us with virtually no power to reject individual non-nulls. Sometimes, control of
FWER is even not quite needed.
A new point of view advanced by [1] proposes controlling the expected proportion
of errors among the rejected hypotheses. The false discovery proportion (FDP) is
defined as
V
FDP := .
max(R, 1)
FDR := E(FDP).
The Benjamini-Hochberg (BH) procedure controls FDR at any desired level (e.g.,
suppose we take q = 0.2), i.e.,
FDR ≤ q = 0.2;
thus out of all of the rejections we make we are willing to have 20% of them be false,
on an average.
The BH procedure can be described as: suppose that p1 , . . . , pn are the p-values from
the n hypotheses tests. Let
We reject all the hypotheses H0,(i) for 1 ≤ i ≤ i0 (reject those hypotheses with p-
values from p(1) to p(i0 ) ). Pictorially this can be easily expressed as: draw the line
104
with slope q passing through the origin and plot the ordered p-values, and reject all
the hypotheses whose p-values lie above the line after the last time it was below the
line.
Another way to view the BH procedure is via the following sequential description:
start with {i = n} and keep accepting the hypothesis corresponding to p(i) as long as
p(i) > qi/n. As soon as p(i) ≤ iq/n, stop and reject all the hypotheses corresponding
to p(j) for j ≤ i.
Theorem 7.2. Suppose that the p-values p1 , . . . , pn are independent. Then
V
FDR = E ≤ q.
max(R, 1)
Remark 7.1. Note that the above result states that the BH procedure controls FDR
for all configurations of {H0,i }ni=1 .
Proof. Without loss of generality suppose that H0,1 , . . . , H0,n0 are true. Observe that
n r s o
{R = r} = p(r) ≤ q, p(s) > q, ∀s > r .
n n
P 0
Further, under {R = r}, V = ni=1 1{pi ≤ nr q}. Thus,
n r o
p1 ≤ q, R = r
n n o
r r s
= p1 ≤ q, p(r) ≤ q, p(s) > q, ∀s > r
n n n
r (−1) r (−1) s+1
= p1 ≤ q, pr−1 ≤ q, ps > q, ∀s ≥ r
n n n
n r o
(−1)
= p1 ≤ q, R̃(p ) = r − 1 ,
n
(−1)
where p(−1) = (p2 , . . . , pn ) and R̃ = sup{1 ≤ i ≤ n − 1 : p(i) ≤ i+1
n
q}. Finally we can
105
show that
V
FDR = E 1{R 6= 0}
R
X n
V
= E( 1{R = r})
r=1
r
X
n
1
= E(V 1{R = r})
r=1
r
X
n
1X
n0 r
= P pi ≤ q, R = r
r=1
r i=1 n
Xn
1 r
= n0 P p1 ≤ q, R = r (by exchangeability)
r=1
r n
X n0
n
r
= P p1 ≤ q)P(R̃(p(−1) ) = r − 1 (by independence)
r=1
r n
X
n
n0 r
= qP(R̃(p(−1) ) = r − 1)
r=1
r n
n0
= q ≤ q.
n
where f0 is the density of Xi if null is true (with c.d.f. F0 ; e.g., U [0, 1] or N (0, 1))
and f1 is the density of Xi otherwise (with c.d.f. F1 ). Let H denote the unobserved
variable that takes the value 0 or 1 depending on whether the null hypothesis is true
or not.
106
rule, we can evaluate this probability to be
φ(A) := P(H = 0|X ∈ A) (posterior probability of the null hypothesis)
P(X ∈ A|H = 0)
=
P(X ∈ A)
R
π0 P0 (A) π0 A f0 (x)dx
= = ,
P (A) P (A)
where P0 (A) denotes the probability of a set A under the null distribution.
We can call the quantity φ(A) the Bayes false discovery rate (BFDR). If we report
x ∈ A as non-null, φ(A) is the probability that we have made a false discovery. What
should be A? If we reject H0,i if Xi > xc (e.g., if we are testing H0,i : µi = 0 vs.
H1,i : µi > 0) then A = [xc , ∞). In practice, we will have some critical value xc and
A will take one of the following forms:
[xc , ∞) (−∞, xc ] (−∞, −xc ] ∪ [xc , ∞). (76)
Without knowing P (A), the BFDR cannot be computed. However, we can estimate
this quantity by
1X
n
[
P (A) = 1A (Xi ).
n i=1
This yields the BFDR estimate:
[=π
\ = φ(A) b0 P0 (A)
BFDR .
P[ (A)
If n is large, then P[ \ may be a good estimate
(A) will be close to P (A), and thus BFDR
of BFDR.
Classical BH theory only lets us discuss false discovery rates for tail sets of the
form (76). An advantage of the Bayesian theory is that we can now compute and
107
bound the FDR for generic measurable sets A. [3] likes to distinguish between the
“local” and “global” FDR rates:
Global FDR : FDR(xc ) = φ([xc , ∞)), Local FDR : FDR(xc ) = φ({xc }),
where FDR(xc ) will in general be well-defined provided all distributions have contin-
uous densities, i.e.,
π0 f0 (x0 )
φ({x0 }) = .
f (x0 )
These two quantities can be very different.
Example 7.3. Suppose that F0 = N (0, 1) and F1 = Unif(−10, 10), π0 = 1/2. In
other words, under the null hypotheses the test statistics are standard Gaussian,
whereas under the alternatives they have a uniform distribution over a medium-size
interval, and on average half the hypotheses are null. In this case:
1 − Φ(2) φ(2)
φ([2, ∞)) = ≈ 0.054, φ({2}) = ≈ 0.52.
8/20 + (1 − Φ(2)) 1/20 + φ(2)
Thus, a global FDR analysis suggests that x ≥ 2 is strong evidence for the alternative,
whereas a local FDR analysis tells us that in fact x = 2 is mild evidence for the null.
(There is no contradiction here — under the data generating distribution, given that
x ≥ 2 you would expect that x >> 2, and so the expected global FDR is small.)
The beauty of local FDR theory is that it can tell us the probability that any given
hypothesis is null, instead of just giving us the expected proportion of nulls among
all rejections. It’s down side, of course, is that it relies on more complex Bayesian
machinery. Standard BH theory (which is what people mostly use in practice) gives
us weaker global FDR type results, but requires much less assumptions to go through.
For more on this topic see [3, Chapter 5].
How does the BH procedure relate to the empirical Bayes procedures we are dis-
cussing? First, we note that z-values map to p-values using the relation
F0 (X(i) ) b
≤q ≈ φ((−∞, X(i) ]) ≤ q.
Fbn (X(i) )
Claim: The empirical Bayes formulation of BH(q) is to reject H0,(i) for all i ≤ i0
where i0 is the largest index such that
\
BFDR((−∞, x(i0 ) ]) ≤ q.
109
8 High dimensional linear regression
y = Xβ ∗ + w,
where R > 0 is a tuning parameter. The above is sometimes called as the constrained
form of the lasso solution. An equivalent form (due to Lagrangian duality) is the
penalized version
1 2
min ky − Xβk2 + λN kβk1 , (78)
β∈Rp 2N
where λN > 0 is the Lagrange multiplier associated with the constraint kβk1 ≤ R.
The lasso estimator performs both variable selection and regularization simultane-
ously; it has good prediction accuracy and offers interpretability to the statistical
model it produces. Figure 5 shows a simple illustration of the performance of the
constrained lasso estimator (and ridge regression17 ) and gives some intuition as to
why it can also perform variable selection.
Given a lasso estimate β̂ ∈ Rp , we can assess its quality in various ways. In some
settings, we are interested in the predictive performance of β̂, so that we might
compute a prediction loss function of the form
1
L(β̂, β ∗ ) := kXβ̂ − Xβ ∗ k22 ,
N
110
Figure 2.1 Left: Coefficient path for the lasso, plotted versus the ¸1 norm of the
˜
coefficient vector, relative to the norm of the unrestricted least-squares estimate —.
Right: Same for ridge regression, plotted against the relative ¸2 norm.
β2 ^ . β2 ^ .
β β
β1 β1
Figure 2.2 Estimation picture for the lasso (left) and ridge regression (right). The
Estimation
Figure 5:solid blue areas picture for the lasso
are the constraint (left)
regions and2 | ridge
|—1 |+|— regression
Æ t and —12 +—22 Æ t(right).
2 The solid blue
, respectively,
while theare
red the
ellipses are the contours
regionsof|βthe
1| +residual-sum-of-squares
|β2 | ≤ t and β1 + β2 function.
2 2 The
areas constraint 2 ≤ t , respectively,
point —‚ depicts the usual (unconstrained) least-squares estimate.
while the red ellipses are the contours of the residual-sum-of-squares function.
The point β̂ depicts the usual (unconstrained) least-squares estimate.
1
The lasso minimizes the least-squares loss fN (β) := 2N ky − Xβk22 subject to an `1 -
constraint. Let us suppose that the difference in function values ∆fN = |fN (β̂) −
fN (β ∗ )| converges to zero as the sample size N increases. The key question is the
following: what additional conditions are needed to ensure that the `2 -norm of the
parameter vector difference ∆β = kβ̂−β ∗ k2 also converges to zero? Figure 6 illustrates
two scenarios that suggest that the function fN has to be suitably “curved”.
A natural way to specify that a function is suitably “curved” is via the notion of
strong convexity. More specifically, given a differentiable function f : Rp → R, we
say that it is strongly convex with parameter γ > 0 at θ∗ ∈ Rp if the inequality
γ
f (θ) − f (θ∗ ) ≥ ∇f (θ∗ )> (θ − θ∗ ) + kθ − θ∗ k22
2
holds for all θ ∈ Rp . Note that this notion is a strengthening of ordinary convexity,
which corresponds to the case γ = 0. When the function f is twice continuously
differentiable, an alternative characterization of strong convexity is in terms of the
Hessian ∇2 f : in particular, the function f is strongly convex with parameter γ around
θ∗ ∈ Rp if and only if the minimum eigenvalue of the Hessian matrix ∇2 f (θ) is at
least γ for all vectors θ in a neighborhood of θ∗ .
111
the least-squares loss fN (—) = N1 Îy ≠ X—Î22 subject to an ¸1 -constraint.) Let
us suppose that the difference in function values fN = |fN (—) ‚ ≠ fN (— ú )|
converges to zero as the sample size N increases. The key question is the
following: what additional conditions are needed to ensure that the ¸2 -norm
of the parameter vector difference — = Η‚ ≠ — ú Î2 also converges to zero?
fN
fN
⇤ b ⇤ b
ν > ∇2 f (θ)ν
≥γ for all nonzero ν ∈ C,
kνk22
112
conditions is it also strongly convex? A straightforward calculation yields that
Ò2 f (—) = XT X/N for all — œ Rp . Thus, the least-squares loss is strongly
convex if and only if the eigenvalues of the p ◊ p positive semidefinite matrix
XT X are uniformly bounded away from zero. However, it is easy to see that
any matrix of the form XT X has rank at most min{N, p}, so it is always
rank-deficient—and hence not strongly convex—whenever N < p. Figure 11.3
illustrates the situation.
⌫good
⌫bad
Figure 7: A convex
Figure 11.3lossAfunction in high-dimensional
convex loss function in high-dimensional settings (with
settings (with N ) cannot be
p ∫ Np) can-
not be strongly convex; rather, it will be curved in some directions but flat in others.
strongly convex; rather, it will be‚curved
As shown in Lemma 11.1, the lasso error ‹
in some directions but flat in others.
= —‚ ≠ — ú must lie in a restricted subset C
As will be shown in later, the lasso error ν̂function
of . For this reason, it is only necessary that the loss be curved
β ∗ mustin certain
p
R = β̂ − lie in a restricted
directions of space.
subset C of R . For this reason, it is only necessary that the loss function be
p
So, what constraint sets C are relevant? Suppose that the parameter vector β ∗ is
sparse — say supported on the subset S = S(β ∗ ). Defining the lasso error ν̂ = β̂ − β ∗ ,
let ν̂S ∈ R|S| denote the subvector indexed by elements of S, with ν̂S c defined in an
analogous manner. For appropriate choices of the `1 -ball radius — or equivalently, of
the regularization parameter λN — it turns out that the lasso error satisfies a cone
constraint of the form
kν̂S c k1 ≤ αkν̂S k1 ,
for some constant α ≥ 1. Thus, we consider a restricted set of the form
113
Before proving this result, let us discuss the different factors in the above bound.
First, it is important to note that this result is deterministic, and apply to any set of
linear regression equations with a given observed noise vector w. Based on our earlier
discussion of the role of strong convexity, it is natural that lasso `2 -error is inversely
proportional to the restricted eigenvalue constant γ > 0. The second term k/N is
also to be expected, since we are trying to estimate an unknown regression vector
with k unknown entries based on N samples. As we have discussed, the final term
in both bounds, involving either kX> wk∞ , reflects the interaction of the observation
noise w with the design matrix X.
Example 8.3 (Classical linear Gaussian model). We begin with the classical linear
Gaussian model for which the noise w ∈ RN is Gaussian with i.i.d. N (0, σ 2 ) entries.
Let us view the design matrix X as fixed, with columns {x1 , . . . , xp }. For any given
column j ∈ {1, . . . , p}, a simple calculation shows that the random variable x> j w/N
2 kx k2
is distributed as N (0, σN Nj 2 ). Consequently, if the columns of the design matrix X
are normalized (meaning kxj k22 /N = 1 for all j = 1, . . . , p), then this variable has
2
N (0, σN ) distribution, so that we have the Gaussian tail bound
!
|x>
j w| N t2
P ≥ t ≤ 2e− 2σ2 for t > 0.
N
Since kX> wk∞ /N corresponds to the maximum over p such variables, the union
bound yields >
kX wk∞ N t2 1
P ≥ t ≤ 2e− 2σ2 +log p = 2e− 2 (τ −2) log p ,
N
q
where the second equality follows by setting t = σ τ log N
p
for some τ > 2. Conse-
quently, we conclude that the lasso error satisfies the bound
r
4σ τ k log p
kβ̂ − β ∗ k2 ≤
γ N
1
with probability at least 1 − 2e− 2 (τ −2) log p .
Proof of Theorem 8.2. In this case, since β ∗ is feasible and β̂ is optimal, we have
the inequality ky − Xβ̂k22 ≤ ky − Xβ ∗ k22 . Defining the error vector ν̂ := β̂ − β ∗ ,
substituting in the relation y = Xβ ∗ + w, and performing some algebra yields the
basic inequality
kXν̂k22 w> Xν̂
≤ . (80)
2N N
Applying a version of Hölder’s inequality to the right-hand side yields the upper
bound N1 |w> Xν̂| ≤ N1 kX> wk∞ kν̂k1 .
114
Next, we claim that the inequality kβ̂k1 ≤ R = kβ ∗ k1 implies that ν̂ ∈ C(S, 1).
Observe that
R = kβS∗ k1 ≥ kβ ∗ + ν̂k1
= kβS∗ + ν̂S k1 + kν̂S c k1
≥ kβS∗ k − kν̂S k1 + kν̂S c k1 .
Rearranging this inequality, we see that kν̂S c k1 ≤ kν̂S k1 , which shows that ν̂ ∈ C(S, 1).
Thus, we have
√
kν̂k1 = kν̂S k1 + kν̂S c k1 ≤ 2kν̂S k1 ≤ 2 kkν̂k2 ,
where we have used the Cauchy-Schwarz inequality in the last step.
On the other hand, applying the restricted eigenvalue condition to the left-hand side
of the inequality (80) yields
kν̂k22 kXν̂k22 w> Xν̂ 1 1 √
γ ≤ ≤ ≤ kX> wk∞ kν̂k1 ≤ kX> wk∞ 2 kkν̂k2 .
2 2N N N N
Putting together the pieces yields the claimed bound.
Exercise (HW 4): Suppose that the design matrix X satisfies the restricted eigenvalue
bound (79) with parameter γ > 0 over C(S, 3). Given a regularization parameter
λN ≥ 2kX> wk∞ /N > 0, show that any estimate β̂ from the regularized lasso (78)
satisfies the bound r
3 k√
kβ̂ − β ∗ k2 ≤ N λN .
γ N
In this section we focus on the Lagrangian lasso (78) and develop some theoretical
guarantees for the prediction error L(β̂, β) := N1 kXβ̂ − Xβ ∗ k22 .
Theorem 8.4. Consider the Lagrangian lasso with a regularization parameter λN ≥
2
N
kX> wk∞ .
(b) If β ∗ is supported on a subset S, and the design matrix X satisfies the γ-RE
condition (79) over C(S, 3), then any optimal solution β̂ satisfies
1 9
kXβ̂ − Xβ ∗ k22 ≤ |S|λ2N .
N γ
115
q
As we have discussed, for various statistical models, the choice λN = cσ log
N
p
is valid
for Theorem 8.4 with high probability, so the two bounds take the form
r
1 ∗ 2 log p
kXβ̂ − Xβ k2 ≤ c1 σR1 , and
N N
1 σ |S| log p
kXβ̂ − Xβ ∗ k22 ≤ c2 ,
N γ N
for suitable constants c1 , c2 . The first bound, which depends on the `1 -ball radius R1 ,
is known as the “slow rate” for the lasso, since the squared prediction error decays
√
as 1/ N . On the other hand, the second bound is known as the “fast rate” since it
decays as 1/N . Note that the latter is based on much stronger assumptions: namely,
the hard sparsity condition that β ∗ is supported on a small subset S, and more
disconcertingly, the γ-RE condition on the design matrix X. In principle, prediction
performance should not require an RE condition, so that one might suspect that this
requirement is an artifact of our proof technique. However, this dependence turns
out to be unavoidable for any polynomial-time method; see e.g., [18] where, under a
standard assumption in complexity theory, the authors prove that no polynomial-time
algorithm can achieve the fast rate without imposing an RE condition.
kX> wk∞
0 ≤ kν̂k1 + λN {kβ ∗ k1 − kβ ∗ + ν̂k1 }
> N
kX wk∞
≤ − λN kν̂k1 + 2λN kβ ∗ k1
N
1
≤ λN {−kν̂k1 + 4kβ ∗ k1 } ,
2
where the last step uses the fact that N1 kX> wk∞ ≤ λN /2 (by assumption). Therefore,
kν̂k1 ≤ 4kβ ∗ k1 . Returning again to the modified basic inequality (81), we have
Substituting this relation into the modified basic inequality (81) yields
kXν̂k22 λN
≤ {kν̂S k1 + kν̂S c k1 } + λN {kν̂S k1 − kν̂S c k1 }
2N 2
3 3 √
≤ λN kν̂S k1 ≤ λN kkν̂k2 , (83)
2 2
where k := |S|.
Next we claim that the error vector ν̂ associated with any lasso solution β̂ belongs to
>
the cone C(S, 3). Since kX Nwk∞ ≤ λ2N , inequality (82) implies that
λN
0≤ kν̂k1 + λN {kν̂S k1 − kν̂S c k1 }.
2
Rearranging and then dividing out by λN > 0 yields that kν̂S c k1 ≤ 3kν̂S k1 as claimed.
As the error vector ν̂ belongs to the cone C(S, 3), the γ-RE condition guarantees that
kν̂k22 ≤ N1γ kXν̂k22 . Therefore, using (82) gives
s s
1 √ k 1 k
kXν̂k22 ≤ 3λN kkν̂k2 ≤ 3λN kXν̂k2 ⇒ √ kXν̂k2 ≤ 3λN .
N Nγ N γ
Exercise (HW 4): State and prove the analogous theorem for the constrained form of
the lasso (given in (77)) where you take R = kβ ∗ k1 .
As seen in Theorem 8.4(b), the `1 -constraint yields a bound on the prediction error
that is almost optimal — if we knew the set S then using linear regression would yield
117
a bound of the order σ|S|/N ; using lasso, we just pay an additional multiplicative
factor of log p. As S is obviously unknown, we can think of fitting all possible linear
regression models with k := |S| predictors and then choosing the best one. This
would be equivalent to solving the following `0 -problem:
min
p
ky − Xβk22 ,
β∈R :kβk0 ≤k
where kβk0 denotes the number of non-zero components of β. Obviously, this proce-
dure is computationally infeasible and possibly NP hard.
In this subsection we compare the `0 and `1 -problems in the noiseless setup. This
would shed light on when we can expect the `1 relaxation to perform as well as
solving the `0 -problem. More precisely, given an observation vector y ∈ RN and a
design matrix X ∈ RN ×p , let us consider the two problems
and
min kβk1 . (85)
β∈Rp :Xβ=y
The above linear program (LP) (85) is also known as the basis pursuit LP. Suppose
that the `0 -based problem (84) has a unique optimal solution, say β ∗ ∈ Rp . Our
interest is in understanding when β ∗ is also the unique optimal solution of the `1 -
based problem (85), in which case we say that the basis pursuit LP is equivalent to
`0 -recovery. Remarkably, there exists a very simple necessary and sufficient condition
on the design matrix X for this equivalence to hold.
Definition 8.5 (Exact recovery property). An N × p design matrix X is said to
satisfy the exact recovery property for S ⊂ {1, . . . , p} (or S-ERP) if every β ∗ ∈ Rp
supported on S uniquely minimizes kβk1 subject to Xβ = Xβ ∗ .
For a given subset S ⊂ {1, 2, . . . , p}, let us define the following set:
The set C(S) is a cone but is not convex (Exercise (HW4): show this), containing all
vectors that are supported on S, and other vectors as well. Roughly, it corresponds
to the cone of vectors that have most of their mass allocated to S. Recall that we
have already seen the importance of the set C(S) in the recovery of β ∗ and Xβ ∗ using
the lasso estimator.
null(X) = {β ∈ Rp : Xβ = 0}.
118
Definition 8.6 (Restricted nullspace property). For a given subset S ⊂ {1, 2, . . . , p},
we say that the design matrix X ∈ RN ×p satisfies the restricted nullspace property
over S, denoted by RN(S), if
In words, the RN(S) property holds when the only element of the cone C(S) that lies
within the nullspace of X is the all-zeroes vector. The following theorem highlights the
connection between the exact recovery property and the restricted nullspace property.
Theorem 8.7. The matrix X is S-ERP if and only if it is RN(S).
Since the subset S is not known in advance — indeed, it is usually what we are trying
to determine — it is natural to seek matrices that satisfy a uniform version of the
restricted nullspace property. For instance, we say that the uniform RN property of
order k holds if RN(S) holds for all subsets of {1, . . . , p} of size at most k. In this
case, we are guaranteed that the `1 -relaxation succeeds for any vector supported on
any subset of size at most k.
Proof of Theorem 8.7. First, suppose that X satisfies the RN(S) property. Let
β ∗ ∈ Rp be supported on S and let y = Xβ ∗ . Let β̂ ∈ Rp be any optimal solution
to the basis pursuit LP (85), and define the error vector ν̂ := β̂ − β ∗ . Our goal is to
show that ν̂ = 0, and in order to do so, it suffices to show that ν̂ ∈ null(X) ∩ C(S).
On the one hand, since β ∗ and β̂ are optimal (and hence feasible) solutions to the `0
and `1 -problems, respectively, we are guaranteed that Xβ ∗ = y = Xβ̂, showing that
X ν̂ = 0. On the other hand, since β ∗ is also feasible for the `1 -based problem (85),
the optimality of β̂ implies that kβ̂k1 ≤ kβ ∗ k1 = kβS∗ k1 . Writing β̂ = β ∗ + ν̂, we have
Rearranging terms, we find that ν̂ ∈ C(S). Since X satisfies the RN(S) condition by
assumption, we conclude that ν̂ = 0, as required.
Suppose now that X is S-ERP. We will use the method of contradiction here to
show that X is RN(S). Thus, assume that X is not RN(S). Then there exists
h 6= 0 ∈ null(X) such that
khS k1 ≥ khS c k1 . (86)
Set β ∗ ∈ Rp such that βS∗ = hS and βS∗ c = 0. Then β ∗ is supported on S. Thus, by
the S-ERP β ∗ uniquely minimizes kβk1 subject to Xβ = Xβ ∗ := y.
119
Set β + ∈ Rp such that βS+ = 0 and βS+c = −hS c . Then observe that Xβ ∗ = Xβ + as
Xβ ∗ = XS hS = −XS c hS c = Xβ +
Of course, in order for Theorem 8.7 to be useful in practice, we need to verify the RN
property. A line of work has developed various conditions for certifying the uniform
RN property. The simplest and historically earliest condition is based on the pairwise
incoherence
|hxj , xk i|
r(X) := max .
j6=k∈{1,...,p} kxj k2 kxk k2
For centered xj this is the maximal absolute pairwise correlation. When X is rescaled
to have unit-norm columns, an equivalent representation is given by r(X) = maxj6=k |hxj , xk i|,
which illustrates that pairwise incoherence measures how close the Gram matrix X> X
is to the p-dimensional identity matrix in an element-wise sense.
The following result shows that having a low pairwise incoherence is sufficient to
guarantee exactness of the basis pursuit LP.
Proposition 8.8 (Pairwise incoherence implies RN). Suppose that for some integer
1
k ∈ {1, 2, . . . , p}, the pairwise incoherence satisfies the bound r(X) < 3k . Then X
satisfies the uniform RN property of order k, and hence, the basis pursuit LP is exact
for all vectors with support at most k.
kX> X − Ik kop ≤ δ
120
for all subsets S ⊂ {1, 2, . . . , p} of cardinality k. We recall here that k · kop denotes
the operator norm, or maximal singular value of a matrix.
Thus, we see that RIP(k, δ) holds if and only if for all subsets S of cardinality k, we
have
kXS uk22
2
∈ [1 − δ, 1 + δ], for all u 6= 0 ∈ Rk ;
kuk2
hence the terminology of restricted isometry. The following result, which we state
without any proof, shows that the RIP is a sufficient condition for the RN property
to hold.
Proposition 8.10 (RIP implies RNP). If RIP(2k, δ) holds with δ < 1/3, then the
uniform RN property of order k holds, and hence the `1 -relaxation is exact for all
vectors supported on at most k elements.
References
[1] Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a
practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser.
B 57 (1), 289–300.
[5] Hastie, T., R. Tibshirani, and M. Wainwright (2015). Statistical Learning with
Sparsity.
[7] Kimeldorf, G. and G. Wahba (1971). Some results on Tchebycheffian spline func-
tions. J. Math. Anal. Appl. 33, 82–95.
121
[8] Politis, D. N., J. P. Romano, and M. Wolf (1999). Subsampling. Springer Series
in Statistics. Springer-Verlag, New York.
[9] Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators.
Scand. J. Statist. 9 (2), 65–78.
[11] Sen, B., M. Banerjee, and M. Woodroofe (2010). Inconsistency of bootstrap: the
Grenander estimator. Ann. Statist. 38 (4), 1953–1977.
[12] Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel
density estimates. Ann. Statist. 12 (4), 1285–1297.
[13] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy.
Statist. Soc. Ser. B 58 (1), 267–288.
[16] Vapnik, V. and A. Lerner (1963). Pattern recognition using generalized portrait
method. Automation and remote control 24, 774–780.
[18] Zhang, Y., M. J. Wainwright, and M. I. Jordan (2014). Lower bounds on the per-
formance of polynomial-time algorithms for sparse linear regression. arXiv preprint
arXiv:1402.1918 .
122