0% found this document useful (0 votes)
32 views28 pages

Concentration

Uploaded by

vvddup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views28 pages

Concentration

Uploaded by

vvddup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Copyright c 2008–2010 John Lafferty, Han Liu, and Larry Wasserman Do Not Distribute

Chapter 7

Concentration of Measure

Often we want to show that some random quantity is close to its mean with high
probability. Results of this kind are known as concentration inequalities. In
this chapter we consider some important concentration results such as Hoeffd-
ing’s inequality, Bernstein’s inequality and McDiarmid’s inequality. Then we
consider uniform bounds that guarantee that a set of random quantities are
simultaneously close to their means with high probabilty.

7.1 Introduction
Often we need to show that a random quantity is close to its mean. For example, later we
will prove Hoeffding’s inequality which implies that, if Z1 , . . . , Zn are Bernoulli random
variables with mean µ then

2n✏2
P(|Z µ| > ✏)  2e

1 Pn
where Z = n i=1 Zi .
More generally, we want a result of the form
✓ ◆
P f (Z1 , . . . , Zn ) µn (f ) > ✏ < n (7.1)

where µn (f ) = E(f (Z1 , . . . , Zn )) and n ! 0 as n ! 1. Such results are known


as concentration inequalities and the phenomenon that many random quantities are close
to their mean with high probability is called concentration of measure. These results are
97
98 Chapter 7. Concentration of Measure

fundamental for establishing performance guarantees of many algorithms. For statistical


learning theory, we will need uniform bounds of the form

✓ ◆
P sup f (Z1 , . . . , Zn ) µn (f ) > ✏ < n (7.2)
f 2F

over a class of functions F.

7.3 Example. To motivate the need for such results, consider empirical risk minimization
in classification. Suppose we have data (X1 , Y1 ), . . ., (Xn , Yn ) where Yi 2 {0, 1} and
Xi 2 Rd . Let h : Rd ! {0, 1} be a classifier. The training error is

X n
bn (h) = 1
R I(Yi 6= h(Xi ))
n
i=1

and the true classification error is

R(h) = P(Y 6= h(X)).

We would like to know if R(h)b is close to R(h) with high P probability. This is precisely of
the form (7.1) with Zi = (Xi , Yi ) and f (Z1 , . . . , Zn ) = n1 ni=1 I(Yi 6= h(Xi )).
Now let H be a set of classifiers. Let b b
h minimize the training error R(h) over H and
let h⇤ minimize the true error R(h) over H. Can we guarantee that the risk R(b h) of the
selected classifier is close to the risk R(h⇤ ) of the best classifier? Let E denote the event
that suph2H |Rbn (h) R(h)|  ✏. When the event E holds, we have that

R(h⇤ )  R(b bn (b
h)  R bn (h⇤ ) + ✏  R(h⇤ ) + 2✏
h) + ✏  R

where we used the following facts: h⇤ minimizes R, E holds, b bn , E holds and


h minimizes R
b
h⇤ minimizes R. It follows that, when E holds, |R(h) R(h⇤ )|  2✏. Concentration of
measure is used to prove that E holds with high probability. 2

Besides classification, concentration inequalities are used for studying many other meth-
ods such as clustering, random projections and density estimation.
7.2. Basic Inequalities 99

Notation

If P is a probability measure and f is a function then we write


Z
P f = P (f ) = f (z)dP (z) = E(f (Z)).

Given Z1 , . . . , Zn , let Pn denote the empirical measure that puts mass 1/n at each
data point:
n
1X
Pn (A) = I(Zi 2 A)
n
i=1

where I(Zi 2 A) = 1 if Zi 2 A and I(Zi 2 A) = 0 otherwise. Then we write


Z n
1X
Pn f = Pn (f ) = f (z)dPn (z) = f (Zi ).
n
i=1

7.2 Basic Inequalities


7.2.1 Hoeffding’s Inequality

Suppose that Z has a finite mean and that P(Z 0) = 1. Then, for any ✏ > 0,
Z 1 Z 1 Z 1
E(Z) = z dP (z) z dP (z) ✏ dP (z) = ✏ P(Z > ✏) (7.4)
0 ✏ ✏

which yields Markov’s inequality:


E(Z)
P(Z > ✏)  . (7.5)

An immediate consequence of Markov’s inequality is Chebyshev’s inequality
E(Z µ)2 2
P(|Z µ| > ✏) = P(|Z µ|2 > ✏2 )  = (7.6)
✏2 ✏2
where µ = E(Z) and 2 = Var(Z). If Z1 , . . . , Zn are iid with mean µ and variance 2

then, since Var(Z n ) = 2 /n, Chebyshev’s inequality yields

2
P(|Z n µ| > ✏)  . (7.7)
n✏2
While this inequality is useful, it does not decay exponentially fast as n increases. To
improve the inequality, we use Chernoff’s method: for any t > 0,

P(Z > ✏) = P(eZ > e✏ ) = P(etZ > et✏ )  e t✏


E(etZ ). (7.8)

We then minimize over t and conclude that:


100 Chapter 7. Concentration of Measure

t✏
P(Z > ✏)  inf e E(etZ ). (7.9)
t 0

To use the above result we need to bound the moment generating function E(etZ ).

7.10 Lemma. Let Z be a mean µ random variable such that a  Z  b. Then, for any t,

2 (b a)2 /8
E(etZ )  etµ+t . (7.11)

Proof. For simplicity, assume that µ = 0. Since a  Z  b, we can write Z as a convex


combination of a and b, namely, Z = ↵b + (1 ↵)a where ↵ = (Z a)/(b a). By the
convexity of the function y ! ety we have

Z a tb b Z ta
etZ  ↵etb + (1 ↵)eta = e + e .
b a b a

Take expectations of both sides and use the fact that E(Z) = 0 to get

a b
EetZ  etb + eta = eg(u) (7.12)
b a b a

where u = t(b a), g(u) = u + log(1 + eu ) and = a/(b a). Note that
0 00
g(0) = g (0) = 0. Also, g (u)  1/4 for all u > 0. By Taylor’s theorem, there is a
⇠ 2 (0, u) such that

0 u2 00 u2 00 u2 t2 (b a)2
g(u) = g(0) + ug (0) + g (⇠) = g (⇠)  = .
2 2 8 8
2 (b a)2 /8 .
Hence, EetZ  eg(u)  et

7.13 Theorem (Hoeffding). If Z1 , Z2 , . . . , Zn are independent with P(a  Zi 


b) = 1 and common mean µ then for any t > 0
2n✏2 /(b a)2
P(|Z n µ| > ✏)  2e (7.14)

1 Pn
where Z n = n i=1 Zi .
7.2. Basic Inequalities 101

Proof. For simplicity assume that E(Zi ) = 0. Now we use the Chernoff method. For any
t > 0, we have, from Markov’s inequality, that

! !
1X
n
tX
n ⇣ Pn ⌘
P Zi ✏ =P Zi t✏ = P e(t/n) i=1 Zi et✏
n n
i=1 i=1
⇣ Pn ⌘ Y
 e E e(t/n) i=1 Zi = e t✏
t✏
E(e(t/n)Zi ) (7.15)
i
Pn
t✏ (t2 /n2 ) i=1 (bi ai )2 /8
e e (7.16)

where the last inequality follows from PLemma 7.10. Now we minimize the right hand side
2
over t. In particular, we set t = 4✏n2 / ni=1 (bi ai )2 and get P Z n ✏  e 2n✏ /c . By
2
a similar argument, P Z n  ✏  e 2n✏ /c and the result follows.

7.17 Corollary. If Z1 , Z2 , . . . , Zn are independent with P(ai  Zi  bi ) = 1 and common


mean µ, then, with probability at least 1 ,

s ✓ ◆
c 2
|Z n µ|  log (7.18)
2n

1 Pn
where c = n i=1 (bi ai ) 2 .

7.19 Corollary. If Z1 , Z2 , . . . , Zn are independent Bernoulli random variables with P(Zi =


1) = p then, for any ✏ > 0, P(|Z 2n✏2 . Hence, with probability at least
qn p| > ✏)  2e
1 2
1 we have that |Z n p|  2n log .

7.20 Example (Classification). Returning to the classification problem, let h be a classifier


and let f (z) = I(yq6= h(x) where z = (x, y). Then Hoeffding’s inequality implies that
|R(h) R bn (h)|  1
log 2 with probability at least 1 . 2
2n

The following result extends Hoeffding’s inequality to more general functions f (z1 , . . . , zn ).
102 Chapter 7. Concentration of Measure

7.21 Theorem (McDiarmid). Let Z1 , . . . , Zn be independent random variables.


Suppose that

0
sup f (z1 , . . . , zi 1 , zi , zi+1 , . . . , zn ) f (z1 , . . . , zi 1 , zi , zi+1 , . . . , zn )  ci
z1 ,...,zn ,zi0
(7.22)
for i = 1, . . . , n. Then
! ✓ ◆
2✏2
P f (Z1 , . . . , Zn ) E(f (Z1 , . . . , Zn )) ✏  2 exp Pn 2 . (7.23)
i=1 ci

Proof. Let Y = f (Z1 , . . . , Zn ) and µ = E(f (Z1 , . . . , Zn )). Then


! ! !
P |Y µ| ✏ =P Y µ ✏ +P Y µ ✏ .

We will bound the first quantity. The second follows similarly. Let Vi = E(Y |Z1 , . . . , Zi )
E(Y |Z1 , . . . , Zi 1 ). Then
n
X
f (Z1 , . . . , Zn ) E(f (Z1 , . . . , Zn )) = Vi
i=1

and E(Vi |Z1 , . . . , Zi 1) = 0. Using a similar argument as in Lemma 7.10, we have


2 c2 /8
E(etVi |Z1 , . . . , Zi 1)  et i . (7.24)

Now, for any t > 0,


!
n
X ⇣ Pn ⌘ ⇣ Pn ⌘
P (Y µ ✏) = P Vi ✏ = P et i=1 Vi et✏  e t✏
E et i=1 Vi
i=1
!!
Pn 1
t✏ t Vi tVn
=e E e i=1 E e Z1 , . . . , Zn 1

⇣ Pn 1 ⌘ Pn
t✏ t2 c2n /8 t✏ t2 2
e e E et i=1 Vi · · ·  e e i=1 ci .
Pn
The result follows by taking t = 4✏/ i=1 ci .
2

Pn
Remark: If f (z1 , . . . , zn ) = 1
n i=1 zi then we get back Hoeffding’s inequality.
P
7.25 Example. Let X1 , . . . , Xn ⇠ P and let Pn (A) = n1 ni=1 I(Xi 2 A). Define n ⌘
f (X1 , . . . , Xn ) = supA |Pn (A) P (A)|. Changing one observation changes f by at most
7.2. Basic Inequalities 103

1/n. Hence, !
2n✏2
P | n E( n )| >✏  2e .

7.2.2 Sharper Inequalities

Hoeffding’s inequality does not use any information about the random variables except
the fact that they are bounded. If the variance of Xi is small, then we can get a sharper
inequality from Bernstein’s inequality. We begin with a preliminary result.

7.26 Lemma. Suppose that |X|  c and E(X) = 0. For any t > 0,
⇢ ✓ tc ◆
tX 2 2 e 1 tc
E(e )  exp t (7.27)
(tc)2
where 2 = Var(X).

P1 tr 2 E(X r )
Proof. Let F = r=2 r! 2 . Then,
1 r r
!
X t X 2 2F
tX
E(e ) = E 1 + tx + = 1 + t2 2
F  et . (7.28)
r!
r=2

For r 2, E(X r ) = E(X r 2 X 2 )  cr 2 2 and so


1 r
X 1
t 2 cr 2 2 1 X (tc)r etc 1 tc
F  2
= 2
= . (7.29)
r! (tc) r! (tc)2
r=2 i=2
n o
etc 1 tc
Hence, E(etX )  exp t2 2
(tc)2
.

7.30 Theorem (Bernstein). If P(|Xi |  c) = 1 and E(Xi ) = µ then, for any ✏ > 0,

n✏2
P(|X n µ| > ✏)  2 exp (7.31)
2 2 + 2c✏/3
P
where 2 = n1 ni=1 Var(Xi ).

Proof. For simplicity, assume that µ = 0. From Lemma 7.26,



etc 1 tc
E(e )  exp t2 i2
tXi
(7.32)
(tc)2
104 Chapter 7. Concentration of Measure

where 2
i = E(Xi2 ). Now,
!
n
X ⇣ Pn ⌘
P Xn > ✏ = P Xi > n✏ = P et i=1 Xi > etn✏ (7.33)
i=1
Pn n
Y
tn✏
e E(et i=1 Xi
)=e tn✏
E(etXi ) (7.34)
i=1
⇢ tc
tn✏ 2e 1 tc
e exp nt2 . (7.35)
(tc)2
Take t = (1/c) log(1 + ✏c/ 2) to get

n 2 ⇣ c✏ ⌘
P(X n > ✏)  exp h 2 (7.36)
c2
where h(u) = (1 + u) log(1 + u) u. The results follows by noting that h(u) u2 /(2 +
2u/3) for u 0.

A useful corollary is the following.

7.37 Lemma. Let X1 , . . . , Xn be iid and suppose that |Xi |  c and E(Xi ) = µ. With
probability at least 1 ,
r
2 2 log(1/ ) 2c log(1/ )
|X n µ|  + . (7.38)
n 3n
p
In particular, if  2c2 log(1/ )/(9n), then with probability at least 1 ,
C
|X n µ|  (7.39)
n
1
where C = 4c log /3.

We also get a very specific inequality in the special case that X is Gaussian.

7.40 Theorem. Suppose that X1 , . . . , Xn ⇠ N (µ, 2 ).

✓ ◆
n✏2
P(|X n µ| > ✏)  exp . (7.41)
2 2

2
Proof. Let
R x X ⇠ N (0, 1) with density (x) = (2⇡)
1/2 e x /2 and distribution function

(x) = 1 (s)ds. For any ✏ > 0,


Z 1 Z Z
1 1 1 1 0 (✏) 1 2
P(X > ✏) = (s)ds  s (s)ds = (s)ds =  e ✏ /2 .
✏ ✏ ✏ ✏ ✏ ✏ ✏
(7.42)
7.2. Basic Inequalities 105

By symmetry we have that


2 ✏2 /2
P(|X| > ✏)  e .

Pn
Now suppose that X1 , . . . , Xn ⇠ N (µ, 2 ). Then X n = n 1
i=1 Xi ⇠ N (µ, 2 /n).

Let Z ⇠ N (0, 1). Then,


⇣p p ⌘ ⇣ p ⌘
P(|X n µ| > ✏) = P n|X n µ|/ > n✏/ = P |Z| > n✏/ (7.43)

2 n✏2 /(2 2) n✏2 /(2 2)


 p e e (7.44)
✏ n

for all large n.

7.2.3 Bounds on Expected Values

Suppose we have an exponential bound on P(Xn > ✏). In that case we can bound E(Xn )
as follows.

7.45 Theorem. Suppose that Xn 0 and that for every ✏ > 0,

c2 n✏2
P(Xn > ✏)  c1 e (7.46)

q
C
for some c2 > 0 and c1 > 1/e. Then, E(Xn )  n where C = (1 + log(c1 ))/c2 .

R1
Proof. Recall that for any nonnegative random variable Y , E(Y ) = 0 P(Y t)dt.
Hence, for any a > 0,
Z 1 Z a Z 1 Z 1
E(Xn2 ) = P(Xn2 t)dt = P(Xn2 t)dt+ P(Xn2 t)dt  a+ P(Xn2 t)dt.
0 0 a a
p
Equation (7.46) implies that P(Xn > t)  c1 e c2 nt . Hence,
Z 1 Z 1 p Z 1
c1 e c2 na
E(Xn2 ) a+ P(Xn2 t)dt = a + P(Xn t)dt  a + c1 e c2 nt
dt = a + .
a a a c2 n

Set a = log(c1 )/(nc2 ) and conclude that

log(c1 ) 1 1 + log(c1 )
E(Xn2 )  + = .
nc2 nc2 nc2
p q
1+log(c1 )
Finally, we have E(Xn )  E(Xn2 )  nc2 .
106 Chapter 7. Concentration of Measure

Now we consider bounding the maximum of a set of random variables.

7.47 Theorem. Let X1 , . . . , Xn be random variables. Suppose there exists > 0 such that
2
E(etXi )  et /2 for all t > 0. Then
✓ ◆
p
E max Xi  2 log n. (7.48)
1in

Proof. By Jensen’s inequality,


⇢ ✓ ◆ ✓ ⇢ ◆
exp tE max Xi  E exp t max Xi
1in 1in
✓ ◆ n
X 2 2 /2
=E max exp {tXi }  E (exp {tXi })  net .
1in
i=1

log n t 2 p
Thus, E (max1in Xi )  t + 2 . The result follows by setting t = 2 log n/ .

7.3 Uniform Bounds


7.3.1 Binary Functions

A binary function on a space Z is a function f : Z ! {0, 1}. Let F be a class of binary


functions on Z. For any z1 , . . . , zn define
n o
Fz1 ,...,zn = (f (z1 ), . . . , f (zn )) : f 2 F . (7.49)

Fz1 ,...,zn is a finite collection of binary vectors and |Fz1 ,...,zn |  2n . The set Fz1 ,...,zn is
called the projection of F onto z1 , . . . , zn .

7.50 Example. Let F = {ft : t 2 R} where ft (z) = 1 if z > t and ft (z) = 0 of z  t.


Consider three real numbers z1 < z2 < z3 . Then
n o
Fz1 ,z2 ,z3 = (0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1) .

Define the growth function or shattering number by

s(F, n) = sup Fz1 ,...,zn . (7.51)


z1 ,...,zn
7.3. Uniform Bounds 107

A binary function f can be thought of as an indicator function for a set, namely, A =


{z : f (x) = 1}. Conversely, any set can be thought of as a binary function, namely, its
indicator function IA (z). We can therefore re-express the growth function in terms of sets.
If A is a class of subsets of Rd then s(A, n) is defined to be s(F, n) where F = |{IA :
A 2 A}| is the set of indicator functions and then s(A, n) is again called the shattering
number. It follows that
s(A, n) = max s(A, F )
F

where the maximum is over all finite sets of size n and s(A, F ) = |{A \ F : A 2 A|
denotes the number of subsets of F picked out by A. We say that a finite set F of size n is
shattered by A if s(A, F ) = 2n .

7.52 Theorem. Let A and B be classes of subsets of Rd .

1. s(A, n + m)  s(A, n)s(A, m).


S
2. If C = A B then s(C, n)  s(A, n) + s(B, n)
S
3. If C = {A B : A 2 A, B 2 B} then s(C, n)  s(A, n)s(B, n).
T
4. If C = {A B : A 2 A, B 2 B} then s(C, n)  s(A, n)s(B, n).

Proof. See exercise 10.

VC Dimension. Recall that a finite set F of size n is shattered by A if s(A, F ) = 2n . The


VC dimension (named after Vapnik and Chervonenkis) of A is the size of the largest set
that can be shattered by A.

The VC dimension of a class of set A is


n o
VC(A) = sup n : s(A, n) = 2n . (7.53)

The VC dimension of a class of binary functions F is


n o
VC(F) = sup n : s(F, n) = 2n . (7.54)

If the VC dimension is finite, then the growth function cannot grow too quickly. In
fact, there is a phase transition: s(F, n) = 2n for n < d and then the growth switches to
108 Chapter 7. Concentration of Measure

polynomial.

7.55 Theorem (Sauer’s Theorem). Suppose that F has finite VC dimension d. Then,
Xd ✓ ◆
n
s(F, n)  (7.56)
i
i=0
and for all n d,
⇣ en ⌘d
s(F, n)  . (7.57)
d

Proof. When n = d = 1, (7.56) clearly holds. We proceed by induction. Suppose that


(7.56) holds for n 1 and d 1 and alsoPthat it holds for n 1 and d. We will show
d n
that it holds for n and d. Let h(n, d) = i=0 i . We need to show that VC(F)  d
implies that s(F, n)  h(n, d). Let F1 = {z1 , . . . , zn } and F2 = {z2 , . . . , zn }. Let
F1 = {(f (z1 ), . . . , f (zn ) : f 2 F} and F2 = {(f (z2 ), . . . , f (zn ) : f 2 F}. For
f, g 2 F, write f ⇠ g if g(z1 ) = 1 f (z1 ) and g(zj ) = f (zj ) for j = 2, . . . , n. Let
n o
G = f 2 F : there exists g 2 F such that g ⇠ f .

Define F3 = {(f (z2 ), . . . , f (zn )) : f 2 G}. Then |F1 | = |F2 | + |F3 |. Note that
VC(F2 )  d and VC(F3 )  d 1. The latter follows since, if F3 shatters a set, then we
can add z1 to create a set that is shattered by F1 . By assumption |F2 |  h(n 1, d) and
|F3 |  h(n 1, d 1). Hence,
|F1 |  h(n 1, d) + h(n 1, d 1) = h(n, d).
Thus, s(F, n)  h(n, d) which proves (7.56).
To prove (7.57), we use the fact that n d and so:
Xd ✓ ◆ ⇣ n ⌘d X d ✓ ◆ ✓ ◆i ⇣ n ⌘d Xn ✓ ◆ ✓ ◆i
n n d n d
 
i d i n d i n
i=0 i=0 i=0
⇣ n ⌘d ✓ d
◆n ⇣ ⌘
n d d
 1+  e .
d n d

The VC dimensions of some common examples are summarized in Table 7.1.


Now we can extend the concentration inequalities to hold uniformly over sets of binary
functions. We start with finite collections.

7.58 Theorem. Suppose that F = {f1 , . . . , fN } is a finite set of binary functions. Then,
with probability at least 1 ,
s ✓ ◆
2 2N
sup |Pn (f ) P (f )|  log . (7.59)
f 2F n
7.3. Uniform Bounds 109

Class A VC dimension VA
A = {A1 , . . . , AN } log2 N
Intervals [a, b] on the real line 2
Discs in R2 3
Closed balls in Rd d+2
Rectangles in Rd 2d
Half-spaces in Rd d+1
Convex polygons in R2 1

Table 7.1. The VC dimension of some classes A.

Proof. It follows from Hoeffding’s inequality that, for each f 2 F, P (|Pn (f ) P (f )| > ✏) 
2
2e n✏ /2 . Hence,
✓ ◆
P max |Pn (f ) P (f )| > ✏ = P (|Pn (f ) P (f )| > ✏ for some f 2 F)
f 2F
N
X
n✏2 /2
 P (|Pn (fj ) P (fj )| > ✏)  2N e .
j=1

The conclusion follows.

Now we consider results for the case where F is infinite. We begin with an important
result due to Vapnik and Chervonenkis.

7.60 Theorem
p (Vapnik and Chervonenkis). Let F be a class of binary functions.
For any t > 2/n,
!
nt2 /8
P sup |(Pn P )f | > t  4 s(F, 2n)e (7.61)
f 2F

and hence, with probability at least 1 ,


s ✓ ◆
8 4 s(F, 2n)
sup |Pn (f ) P (f )|  log . (7.62)
f 2F n

Before proving the theorem, we need the symmetrization lemma. Let Z10 , . . . , Zn0 denote
a second independent sample from P . Let Pn0 denote the empirical distribution of this
110 Chapter 7. Concentration of Measure

second sample. The variables Z10 , . . . , Zn0 are called a ghost sample.
p
7.63 Lemma (Symmetrization). For all t > 2/n,
! !
P sup |(Pn P )f | > t  2P sup |(Pn Pn0 )f | > t/2 . (7.64)
f 2F f 2F

Proof. Let fn 2 F maximize |(Pn P )f |. Note that fn is a random function as it depends


on Z1 , . . . , Zn . We claim that if |(Pn P )fn | > t and |(P Pn0 )fn |  t/2 then |(Pn0
Pn )fn | > t/2. This follows since

t < |(Pn P )fn | = |(Pn Pn0 + Pn0 P )fn |  |(Pn Pn0 )fn | + |(Pn0 P )fn |
t
 |(Pn Pn0 )fn | +
2
and hence |(Pn0 Pn )fn | > t/2. So

I(|(Pn P )fn | > t) I(|(P Pn0 )fn |  t/2) = I(|(Pn P )fn | > t, |(P Pn0 )fn |  t/2)
 I(|(Pn0 Pn )fn | > t/2).

Now take the expected value over Z10 , . . . , Zn0 and conclude that

I(|(Pn P )fn | > t) P0 (|(P Pn0 )fn |  t/2)  P0 (|(Pn0 Pn )fn | > t/2). (7.65)

By Chebyshev’s inequality,
4Var0 (fn ) 1 1
P0 (|(P Pn0 )fn |  t/2) 1 1 .
nt2 nt2 2
(Here we used the fact that W 2 [0, 1] implies that Var(W )  1/4.) Inserting this into
(7.65) we have that

I(|(Pn P )fn | > t)  2 P0 (|(Pn0 Pn )fn | > t/2).

Thus, ⇣ ⌘ ⇣ ⌘
I sup |(Pn P )f | > t  2 P0 sup |(Pn0 Pn )f | > t/2 .
f 2F f 2F

Now take the expectation over Z1 , . . . , Zn to conclude that


⇣ ⌘ ⇣ ⌘
P sup |(Pn P )f | > t  2 P sup |(Pn0 Pn )f | > t/2 .
f 2F f 2F

The importance of symmetrization is that we have replaced (Pn P )f , which can take
any real value, with (Pn Pn0 )f , which can take only finitely many values. Now we prove
the Vapnik-Chervonenkis theorem.
7.3. Uniform Bounds 111

Pn
Proof. Let V = FZ10 ,...,Zn0 ,Z1 ,...,Zn . For any v 2 V write (Pn0 Pn )v to mean (1/n)( i=1 vi
P2n
i=n+1 vi ). Using the symmetrization lemma and Hoeffding’s inequality,

P(sup |(Pn P )f | > t)  2 P(sup |(Pn0 Pn )f | > t/2)


f 2F f 2F
= 2 P(max |(Pn0 Pn )v| > t/2)
v2V
X
2 P(|(Pn0 Pn )v| > t/2)
v2V
X
nt2 /8 nt2 /8
2 2e  4 s(F, 2n)e .
v2V

Recall that, for a class with finite VC dimension d, s(F, n)  (en/d)d . hence we have:

7.66 Corollary. If F has finite VC dimension d, then, with probability at least 1 ,


s ✓ ✓ ◆
8 4 ⇣ ne ⌘◆
sup |Pn (f ) P (f )|  log + d log . (7.67)
f 2F n d

7.3.2 Radamacher Complexity

A more general way to develop uniform bounds is to use a quantity called Rademacher
complexity. In this section we assume that F is a class of functions f such that 0  f (z) 
1.

Random variables 1 , . . . , n are called Rademacher random variables if they are


independent, identically distributed and P( i = 1) = P( i = 1) = 1/2. Define
the Rademacher complexity of F by
n
!
1X
Radn (F) = E sup i f (Zi ) . (7.68)
f 2F n i=1

Define the empirical Rademacher complexity of F by


n
!
1X
Radn (F, Z n ) = E sup i f (Zi ) (7.69)
f 2F n i=1

where Z n = (Z1 , . . . , Zn ) and the expectation is over only.


112 Chapter 7. Concentration of Measure

Intuitively, Radn (F) is large if we can find functions f 2 F that “look like” random
noise, that is, they are highly correlated with 1 , . . . , n . Here are some properties of the
Rademacher complexity.

7.70 Lemma.

1. If F ⇢ G then Radn (F, Z n )  Radn (G, Z n ).

2. Let conv(F) denote the convex hull of F. Then Radn (F, Z n ) = Radn (conv(F), Z n ).

3. For any c 2 R, Radn (cF, Z n ) = |c|Radn (F, Z n ).

4. Let g : R ! R be such that g(0) = 0 and, |g(y) g(x)|  L|x y| for all x, y.
Then Radn (g F, Z n )  2LRadn (F, Z n ).

Proof. See Exercise 8.

7.71 Theorem. With probability at least 1 ,


s ✓ ◆
1 2
sup |Pn (f ) P (f )|  2 Radn (F) + log (7.72)
f 2F 2n

and s ✓ ◆
4 2
sup |Pn (f ) P (f )|  2 Radn (F, Z n ) + log . (7.73)
f 2F n

Proof. The proof has two steps. First we show that supf 2F |Pn (f ) P (f )| is close to its
mean. Then we bound the mean.
Step 1: Let g(Z1 , . . . , Zn ) = supf 2F |Pn (f ) P (f )|. If we change Zi to some other value
Zi0 then |g(Z1 , . . . , Zn ) g(Z1 , . . . , Zi0 , . . . , Zn )|  n1 . By McDiarmid’s inequality,

2n✏2
P (|g(Z1 , . . . , Zn ) E[g(Z1 , . . . , Zn )]| > ✏)  2 e .

Hence, with probability at least 1 ,


s ✓ ◆
1 2
g(Z1 , . . . , Zn )  E[g(Z1 , . . . , Zn )] + log . (7.74)
2n
7.3. Uniform Bounds 113

Step 2: Now we bound E[g(Z1 , . . . , Zn )]. Once again we introduce a ghost sample Z10 , . . . , Zn0
and Rademacher variables 1 , . . . , n , Note that P (f ) = E0 Pn0 (f ). Also note that
n n
1X d 1X
(f (Zi0 ) f (Zi )) = 0
i (f (Zi ) f (Zi ))
n n
i=1 i=1
d
where = means “equal in distribution.” Hence,
" # " #
0
E[g(Z1 , . . . , Zn )] = E sup |P (f ) Pn (f )| = E sup |E (Pn0 (f ) Pn (f ))|
f 2F f 2F
" # " n
#
1X
 EE0 sup |Pn0 (f ) Pn (f )| = EE0 sup (f (Zi0 ) f (Zi ))
f 2F f 2F n i=1
" n
#
1X
= EE0 sup 0
i (f (Zi ) f (Zi ))
f 2F n
i=1
" n
# " n
#
1 X 1X
0 0
E sup i f (Zi ) + E sup i f (Zi )
f 2F n i=1 f 2F n i=1
= 2Radn (F).
Combining this bound with (7.74) proves the first result.
To prove the second result, let a(Z1 , . . . , Zn ) = Radn (F, Z n ) and note that a(Z1 , . . . , Zn )
changes by at most 1/n if we change
q one observation. McDiarmid’s inequality implies that
1
|Radn (F, Z n ) Radn (F)|  2n log 2 with probability at least 1 . Combining this
with the first result yields the second result.

In the special case where F is a class of binary functions, we can relate Radn (F) to
shattering numbers.

7.75 Theorem. Let F be a set of binary functions. Then, for all n,


r
2 log s(F, n)
Radn (F)  . (7.76)
n

Pn
Proof.P Let D = {Z1 , . . . , Zn }. Define S(f, ) = |n 1
i=1 i f (Zi )| and S(v, ) =
|n 1 ni=1 i vi |. Now, 1  i f (Zi )  1. Note that
! !! !!
Radn (F) = E sup S(f, ) =E E sup S(f, ) D =E E max S(v, ) D .
f 2F f 2F v2FZ1 ,...,Zn

Now, i vi /n has mean 0 and 1/n  i vi  1/n so, by Lemma 7.10, E(et i vi ) 
2 2
et /(2n ) for any t > 0. From Theorem 7.47,
! r r
2 log |Vn | 2 log s(F, n)
E max S(v, ) D  =
v2FZ1 ,...,Zn n n
114 Chapter 7. Concentration of Measure

and the result follows.

In fact, there is a sharper relationship between Radn (F) and VC dimension.

7.77 Theorem. Suppose that F p


has finite VC dimension d. There exists a universal constant
C > 0 such that Radn (F)  C d/n.

For a proof, see, for example, Devroye and Lugosi (2001).


Combining these results with Theorem 7.75 and Theorem 7.77 we get the following
result.

7.78 Corollary. With probability at least 1 ,

r s ✓ ◆
8 log s(F, n) 1 2
sup |Pn (f ) P (f )|  + log . (7.79)
f 2F n 2n

If F has finite VC dimension d then, with probability at least 1 ,

r s ✓ ◆
d 1 2
sup |Pn (f ) P (f )|  2C + log . (7.80)
f 2F n 2n

7.3.3 Bounds For Classes of Real Valued Functions

Suppose now that F is a class of real-valued functions. There are various methods to
obtain uniform bounds. We consider two such methods: covering numbers and bracketing
numbers.
If Q is a measure and p 1, define

✓Z ◆1/p
p
kf kLp (Q) = |f (x)| dQ(x) .

When Q is Lebesgue measure we simply write kf kp . We also define

kf k1 = sup |f (x)|.
x

A set C = {f1 , . . . , fN } is an ✏-cover of F (or an ✏-net) if, for every f 2 F there exists a
fj 2 C such that kf fj kLp (Q) < ✏.
7.3. Uniform Bounds 115

7.81 Definition. The size of the smallest ✏-cover is called the covering number and
is denoted by Np (✏, F, Q). The uniform covering number is defined by

Np (✏, F) = sup Np (✏, F, Q)


Q

where the supremum is over all probability measures Q.

Now we show how covering numbers can be used to obtain bounds.

7.82 Theorem. Suppose that kf k1  B for all f 2 F. Then,


⇣ ⌘
n✏2 /(18B 2 )
P sup |Pn (f ) P (f )| > ✏  2N (✏/3, F, L1 )e .
f 2F

Proof. Let N = N (✏/3, F, L1 ) and let C = {f1 , . . . , fN } be an ✏/3 cover. For any f 2 F
there is an fj 2 C such that kf fj k1  ✏/3. So
|Pn (f ) P (f )|  |Pn (f ) Pn (fj )| + |Pn (fj ) P (fj )| + |P (fj ) P (f )|
2✏
 |Pn (fj ) P (fj )| + .
3
Hence,
! ✓ ◆
2✏
P sup |Pn (f ) P (f )| > ✏  P max |Pn (fj ) P (fj )| + >✏
f 2F fj 2C 3
✓ ◆ ⇣ ✏⌘
N
X

= P max |Pn (fj ) P (fj )| >  P |Pn (fj ) P (fj )| >
fj 2C 3 3
j=1
n✏2 /(18B 2 )
 2N (✏/3, F, L1 )e
from the union bound and Hoeffding’s inequality.

The VC dimension can be used to bound covering numbers.

7.83 Theorem. Let F be a class of functions f : Rd ! [0, B] with V C dimension d such


that 2  d < 1. Let p 1 and 0 < ✏ < B/4. Then
✓ ✓ ◆◆d
2eB p 3eB p
Np (✏, F)  3 log .
✏p ep

(For a proof, see Devroye, Gyorfi and Lugosi (1996).) However, there are cases where
the covering numbers are finite and yet the VC dimension is infinite.
116 Chapter 7. Concentration of Measure

Bracketing Numbers. Another measure of complexity is the bracketing number. Given a


pair of functions ` and u with `  u, we define the bracket
n o
[`, u] = h : `(x)  h(x)  u(x) for all x .

A collection of pairs of functions (`1 , u1 ), . . . , (`N , uN ) is a bracketing of F if,


B
[
F⇢ [`j , uj ].
j=1

The collection is an ✏-Lq (P )-bracketing if it is a bracketing and if


Z !1
q

|uj (x) `j (x)|q dP (x) ✏

for j = 1, . . . , N . The bracketing number N[ ] (✏, F, Lq (P )) is the size of the smallest ✏


bracketing. Bracketing number are a little larger than covering numbers but provide stronger
control of the class F.

7.84 Theorem.

1. Np (✏, F, P )  N[ ] (2✏, F, Lp (P )).

2. Let X1 , . . . , Xn ⇠ P . If Suppose that N[ ] (✏, F, L1 (P )) < 1 for all ✏ > 0. Then,


for every > 0, !
P sup |Pn (f ) P (f )| > !0 (7.85)
f 2F
as n ! 1.

Proof. See exercise 11.

R
7.86 Theorem. Let A = supf |f |dP and B = supf kf k1 . Then
! ✓ ◆
n 3n✏2
P sup |Pn (f ) P (f )| > ✏  2N[ ] (✏/8, F, L1 (P )) exp
f 2F 4B[6A + ✏]
✓ ◆
3n✏
+2N[ ] (✏/8, F, L1 (P )) exp .
40B
Hence, if ✏  2A/3,
! ✓ ◆
n 96n✏2
P sup |Pn (f ) P (f )| > ✏  4N[ ] (✏/8, F, L1 (P )) exp . (7.87)
f 2F 76AB
7.3. Uniform Bounds 117

Proof. (This proof follows Yukich (1985).) For Rnotational simplicity in the proof, let us
write, N (✏) ⌘ N[ ] (✏, F, L1 (P )). Define zn (f ) = f (dPn dP ). Let [`1 , u1 ], . . . , [`N , uN ]
be a minimal ✏/8 bracketing. We may assume that for each j, kuj k  B and k`j k  B.
(Otherwise, we simply truncate the brackets.) For each j, choose some fj 2 [`j , uj ].
Consider any f 2 F and let [`j , uj ] denote a bracket containing f . Then

|zn (f )|  |zn (fj )| + |zn (f fj )|.

Furthermore,
Z Z Z
|zn (f fj )| = | (f fj )(dPn dP )|  |f fj | (dPn + dP )  |uj `j | (dPn + dP )
Z Z
= |uj `j | (dPn dP ) + 2 |uj `j |dP
Z ⇣✏⌘ ✏
= |uj `j | (dPn dP ) + 2 = zn (|uj `j |) + .
8 4
Hence, h ✏i
|zn (f )|  |zn (fj )| + [zn (|uj `j |) + .
4
Thus,

P n (sup |zn (f )| > ✏)  P n (max |zn (fj )| > ✏/2) + P n (max |zn (|uj `j |)| + ✏/4 > ✏/2)
f 2F j j

 P n (max |zn (fj )| > ✏/2) + P n (max |zn (|uj `j |)| > ✏/4).
j j

Now Z Z Z
Var(fj )  fj2 dP = |fj | |fj |dP  kfj k1 |fj |dP  AB.

Hence, by Bernstein’s inequality,


⇣ ⌘ N
X ✓ ◆ ✓ ◆
1 n(✏/2)2 3 n✏2
P n max |zn (fj )| > ✏/2  2 exp  2N (✏/8) exp .
j 2 AB + B✏/6 4B 6A + ✏
j=1

Similarly,
Z Z
Var(|uj `j |)  (uj `j )2 dP  |uj `j | |uj `j |dP
Z
✏ B✏
 kuj `j k1 |uj `j |dP  2B = .
8 4
Also, kuj `j k1  2B. Hence, by Bernstein’s inequality,
⇣ ⌘ N
X ✓ ◆
n 1 n(✏/4)2
P max zn (|uj `j |) > ✏/4 2 exp
j 2 2B 4✏ + 2B(✏/4)/3
j=1
✓ ◆
3n✏
 2N (✏/8) exp .
40B
118 Chapter 7. Concentration of Measure

The following result is from Example 19.7 from van der Vaart (1998).

7.88 Lemma. Let F = {f✓ : ✓ 2 ⇥} where ⇥ is a bounded subset of Rd . Suppose there


exists a function m such that, for every ✓1 , ✓2 ,

|f✓1 (x) f✓2 (x)|  m(x) k✓1 ✓2 k.

Then,
p R !d
4 d diam(⇥) |m(x)|q dP (x)
N[ ] (✏, F, Lq (P ))  .

Proof. Let

= p R .
4 d |m(x)|q dP (x)

We can cover ⇥ with (at most) N = (diam(⇥)/ )d cubes C1 , . . .p , CN of size . Let


c1 , . . . , cN denote the centers of the cubes. NoteSthat Cj p
⇢ B(xj , d ) where B(x, r)
denotes a ball of radius r centered at x. Hence, j B(cj , d ) covers ⇥. Let ✓j be the
S p
projection of cj onto ⇥. Then j B(✓j , 2 d) covers ⇥. In summary, for every ✓ 2 ⇥
there is a ✓j 2 {✓1 , . . . , ✓N } such that
p ✏
k✓ ✓j k  2 d R .
2 |m(x)|q dP (x)

Define `j = f✓j ✏m(x)/2 and uj = f✓j + ✏m(x)/2. We claim that the brackets
[`1 , u1 ], . . . , [`N , uN ] cover F. To see this, choose any f✓ 2 F. Let ✓j be the closest
element {✓1 , . . . , ✓N } to ✓. Then

f✓ (x) = f✓j (x) + f✓ (x) f✓j (x)  f✓j (x) + |f✓ (x) f✓j (x)|
m(x)✏
 f✓j (x) + m(x)k✓ ✓j k  f✓j (x) + R = uj (x).
2 |m(x)|q dP (x)
R
By a similar argument, f✓ (x) `j (x). Also, (uj `j )q dP  ✏q . Finally, note that the
number of brackets is
p R !d
4 d diam(⇥) |m(x)|q dP (x)
N = (diam(⇥)/ )d = .

7.3. Uniform Bounds 119

7.89 Example (Density Estimation). Let X1 , . . . , Xn ⇠ P where P has support on a


1 P
compact set X ⇢ R . Consider the kernel density estimator pbh (x) = hd i K(kx
d

Xi k/h) where K is a smooth symmetric function and h > 0 is a bandwidth. We study pbh
in detail in the chapter on nonparametric density estimation. Here we bound the sup norm
distance between pbh (x) and is mean ph (x) = E(bph (x)).

7.90 Theorem. Suppose that K(x)  K(0) for all x and that

|K(y) K(x)|  Lkx yk

for all x, y. Then


p !d  ✓ ◆ ✓ ◆
⇣ ⌘ 32L d diam(X ) 3n✏2 hd 3n✏hd
P n sup |b
p(x) ph (x)| > ✏  2 exp + exp .
x hd+1 ✏ 4K(0)(6 + ✏) 40K(0)

Hence, if ✏  2/3 then


p !d ✓ ◆
⇣ ⌘ 32L d diam(X ) 3n✏2 hd
P n sup |b
p(x) ph (x)| > ✏  4 exp .
x hd+1 ✏ 28K(0)

Proof. Let F = {fx : x 2 X } where fx (u) = h d K(kx uk/h). We apply Theorem 7.86
with A = 1 and B = K(0)/hd . We need to bound N[ ] (✏, F, L1 (P ). Now
✓ ◆ ✓ ◆
1 kx uk ky uk
|fx (u) fy (u)| = d K K
h h h
L
 kx uk ky uk
hd+1
L
 kx yk.
hd+1

Apply Lemma 7.88 with m(x) = L/hd+1 . Thus implies that


p !d
4L d diamX
N[ ] (✏, F, L1 (P ))  .
hd+1 ✏

Hence, Theorem 7.86 yields,


p !d  ✓ ◆ ✓ ◆
⇣ ⌘ 32L d diam(X ) 3n✏2 hd 3n✏hd
P n sup |b
p(x) ph (x)| > ✏  2 exp + exp .
x hd+1 ✏ 4K(0)(6 + ✏) 40K(0)
120 Chapter 7. Concentration of Measure

7.91 Corollary. Suppose that h = hn = (Cn /n)⇠ where ⇠ 1/d and Cn = (log n)a for
some a 0. Then
p !d ✓ ◆ !
⇣ ⌘ 32L d diam(X ) n ⇠(d+1) 3✏2 Cn⇠d n1 d⇠
P n sup |b
p(x) ph (x)| > ✏  4 exp .
x ✏ Cn 28K(0)

Hence, for sufficiently large n,


⇣ ⌘
P n (sup |b
p(x) ph (x)| > ✏)  c1 exp c2 ✏2 Cn⇠d n1 d⇠
.
x

Note that the proofs of the last two results did not depend on P . Hence, if P is the set
of distribution with support on X , we have that
⇣ ⌘ ⇣ ⌘
sup P n sup |b
p(x) ph (x)| > ✏  c1 exp c2 ✏2 Cn⇠d n1 d⇠
.
P 2P x

7.92 Example. Here are some further examples. In exercise 12 you are asked to prove
these results.

1. Let F be the set of cdf’s on R. Then N[ ] (✏, F, L2 (P ))  2/✏2 .


2. (Sobolev Spaces.) Let F be the functions R f on [0, 1] such that kf k1  1, the (k 1)
derivative is absolutely continuous and (f (k) (x))2 dx  1. Then, there is a constant
C > 0 such that " ✓ ◆1 #
1 k
N[ ] (✏, F, L1 (P ))  exp C .

3. Let F be the set of monotone functions f on R such that kf k1  1. Then, there is a


constant C > 0 such that
 ✓ ◆
1
N[ ] (✏, F, L1 (P ))  exp C .

7.4 Additional results


7.4.1 Talagrand’s Inequality

One of the most important developments in concentration of measure is Talagrand’s in-


equality (Talagrand 1994, 1996) which can be thought of as a uniform version of Bernstein’s
7.4. Additional results 121

inequality. Let F be a class of functions and define Zn = supf 2F |Pn (f )|.


Pn
7.93 Theorem. Let v E supf 2F 1
n i=1 f
2 (X
i) and U supf 2F kf k1 . There exists a
universal constant K such that
! ⇢ ✓ ◆
nt tU
P sup |Pn (f )| E(sup |Pn (f )|) > t  K exp log 1 + . (7.94)
f 2F f 2F KU v

To make use of Talagrand’s inequality, we need to estimate E(supf 2F |Pn (f )|).

7.95 Theorem (Giné and Guillou, 2001). Suppose that there exist A and d such that
sup N (✏, L2 (P ), ✏a)  (A/✏)d
P

where a = kF kL2 (P ) and F (x) = supf 2F |f (x)|. Then


✓ ◆ s ✓ ◆!
AU p AU
E(sup |Pn (f )|)  C dU log + dn log
f 2F

for some C > 0.

Combining these results gives Giné and Guillou’s version of Talagrand’s inequality:
Pn
7.96 Theorem. Let v E supf 2F 1
n i=1 f
2 (X
i) and U supf 2F kf k1 . There exists a
universal constant K such that
! ( !)
nt tU
P sup |Pn (f ) P (f )| > t  K exp log 1 + p p
f 2F KU K( n + U log(AU/ ))2
(7.97)
whenever s
✓ ◆ ✓ ◆!
C AU p AU
t U log + n log .
n

7.98 Example. Density Estimation. Gine and Guillou (2002) apply Talagrand’s inequality
to get bounds on density estimators. Let X1 , . . . , Xn ⇠ P where Xi 2 Rd and suppose that
P has density p. The kernel density estimator of p with bandwidth h is
n ✓ ◆
1X kx Xi k
pbh (x) = K .
n h
i=1

Applying the results above to pbh (x) we see that (under very weak conditions on K) for all
small ✏ and large n,
c2 nhd ✏2
P( sup |b
ph (x) ph (x)| > ✏)  c2 e (7.99)
x2Rd
122 Chapter 7. Concentration of Measure

where ph (x) = E(b ph (x)) and c1 , c2 are positive constants. This agrees with the earlier
result Theorem 7.90. 2

7.4.2 A Bound on Expected Values

Now we consider bounding the expected value of the maximum of an infinite set of random
variables. Let {Xf : f 2 F} be a collection of mean 0 random variables indexed by f 2 F
and let d be a metric on F. Let N (F, r) be the covering number of F, that is, the smallest
number of balls of radius r required to cover F. Say that {Xf : f 2 F} is sub-Gaussian
if, for every t > 0 and every f, g 2 F,
2 d2 (f,g)/2
E(et(Xf Xg )
)  et .

We say that {Xf : f 2 F} is sample continuous if, for every sequence f1 , f2 , . . . , 2 F such
that d(fi , f ) ! 0 for some f 2 F, we have that Xfi ! Xf a.s. The following theorem is
from Cesa-Bianchi and Lugosi (2006) and is a variation of a theorem due to Dudley (1978).

7.100 Theorem. Suppose that {Xf : f 2 F} is sub-Gaussian and sample continuous.


Then ! Z D/2 p
E sup Xf  12 log N (F, ✏)d✏ (7.101)
f 2F 0

where D = supf,g2F d(f, g).

Proof. The proof uses Dudley’s chaining technique. We follow the version in Theorem
8.3 of Cesa-Bianchi and Lugosi (2006). Let Fk be a minimal cover of F of radius D2 k .
Thus |Fk | = N (F, D2 k ). Let f0 denote the unique element in F0 . Each Xf is a random
variable and hence is a mapping from some sample space S to the reals. Fix s 2 S and let
f ⇤ be such that supF 2F Xf (s) = Xf ⇤ (s). (If an exact maximizer does not exist, we can
choose an approximate maximizer but we shall assume an exact maximizer.) Let fk 2 Fk
minimize the distance to f⇤ . Hence,
k
d(fk 1 , fk )  d(f⇤ , fk ) + d(f⇤ , fk 1)  3D2 .

Now limk!1 fk = f⇤ and by sample continuity


1
X
sup Xf (s) = Xf⇤ (s) = Xf0 (s) + (Xfk (s) Xfk 1 (s)).
f k=1

Recall that E(Xf0 ) = 0. Therefore,


! 1 ✓ ◆
X
E sup Xf  E max(Xf Xg )
f f,g
k=1
7.5. Summary 123

where the max is over all f 2 Fk and g 2 Fk 1 such that d(f, g)  3D2 k. There are at
most N (F, D2 k )2 such pairs. By Theorem 7.47,
✓ ◆ q
E max(Xf Xg )  3D2 k 2 log N (F, D2 k )2 .
f,g

By summing over k we have


! 1 q 1 q
X X
k (k+1)
E sup Xf  3D2 2 log N (F, D2 k )2 = 12 D2 log N (F, D2 k)
f k=1 k=1
Z D/2 p
 12 N (F, ✏)d✏.
0

7.102 Example. Let Y1 , . . . , Yn be a sample from a continuous cdf F on [0, 1] with bounded
p
density. Let Xs = n(Fn (s) F (s)) where Fn is the empirical distribution function. The
collection {Xs : s 2 [0, 1]} can be shown to be sub-Gaussian an sample continuous
with respect to the Euclidean metric on [0, 1]. The covering number is N ([0, 1], r) = 1/r.
Hence, ✓ ◆ Z 1/2 p
p
E sup n(Fn (s) F (s)  12 log(1/✏)d✏  C
0s1 0

for some C > 0. Hence,


✓ ◆
C
E sup (Fn (s) F (s) p .
0s1 n
2

7.5 Summary
The most important results in this chapter are Hoeffding’s inequality:
2n✏2 /c
P(|X n µ| > ✏)  2e ,

Bernstein’s inequality

n✏2
P(|X n | > ✏)  2 exp
2 2 + 2c✏/3
the Vapnik-Chervonenkis bound,
!
nt2 /8
P sup |(Pn P )f | > t  4 s(F, 2n)e
f 2F
124 Chapter 7. Concentration of Measure

and the Rademacher bound: with probability at least 1 ,


s ✓ ◆
1 2
sup |Pn (f ) P (f )|  2 Radn (F) + log .
f 2F 2n

These, and similar results, provide the theoretical basis for many statistical machine learn-
ing methods. The literature cantains many refinements and extensions of these results.

7.6 Bibliographic Remarks


Concentration of measure is a vast and still growing area. Some good references are De-
vroye, Gyorfi and Lugosi (1996), van der Vaart and Wellner (1996), Chapter 19 of van der
Vaart (1998), Dubhashi and Panconesi (2009), and Ledoux (2005).

Exercises
R1
7.1 Suppose that X 0 and E(X) < 1. Show that E(X) = 0 P (X t)dt.
7.2 Show that h(u) u2 /(2 + 2u/3) for u 0 where h(u) = (1 + u) log(1 + u) u.
7.3 In the proof of McDiarmid’s inequality, verify that E(Vi |X1 , . . . , Xi 1) = 0.
7.4 Prove Lemma 7.37.
7.5 Prove equation (7.24).
7.6 Prove the results in Table 7.1.
7.7 Derive Hoeffding’s inequality from McDiarmid’s inequality.
7.8 Prove lemma 7.70.
7.9 Consider Example 7.102. Show that {Xs : s 2 [0, 1]} is sub-Gaussian. Show that
R 1/2 p
0 log(1/✏)d✏  C for some C > 0.
7.10 Prove Theorem 7.52. .
7.11 Prove Theorem 7.84.
7.12 Prove the results in Example 7.92.

You might also like