Concentration
Concentration
Chapter 7
Concentration of Measure
Often we want to show that some random quantity is close to its mean with high
probability. Results of this kind are known as concentration inequalities. In
this chapter we consider some important concentration results such as Hoeffd-
ing’s inequality, Bernstein’s inequality and McDiarmid’s inequality. Then we
consider uniform bounds that guarantee that a set of random quantities are
simultaneously close to their means with high probabilty.
7.1 Introduction
Often we need to show that a random quantity is close to its mean. For example, later we
will prove Hoeffding’s inequality which implies that, if Z1 , . . . , Zn are Bernoulli random
variables with mean µ then
2n✏2
P(|Z µ| > ✏) 2e
1 Pn
where Z = n i=1 Zi .
More generally, we want a result of the form
✓ ◆
P f (Z1 , . . . , Zn ) µn (f ) > ✏ < n (7.1)
✓ ◆
P sup f (Z1 , . . . , Zn ) µn (f ) > ✏ < n (7.2)
f 2F
7.3 Example. To motivate the need for such results, consider empirical risk minimization
in classification. Suppose we have data (X1 , Y1 ), . . ., (Xn , Yn ) where Yi 2 {0, 1} and
Xi 2 Rd . Let h : Rd ! {0, 1} be a classifier. The training error is
X n
bn (h) = 1
R I(Yi 6= h(Xi ))
n
i=1
We would like to know if R(h)b is close to R(h) with high P probability. This is precisely of
the form (7.1) with Zi = (Xi , Yi ) and f (Z1 , . . . , Zn ) = n1 ni=1 I(Yi 6= h(Xi )).
Now let H be a set of classifiers. Let b b
h minimize the training error R(h) over H and
let h⇤ minimize the true error R(h) over H. Can we guarantee that the risk R(b h) of the
selected classifier is close to the risk R(h⇤ ) of the best classifier? Let E denote the event
that suph2H |Rbn (h) R(h)| ✏. When the event E holds, we have that
R(h⇤ ) R(b bn (b
h) R bn (h⇤ ) + ✏ R(h⇤ ) + 2✏
h) + ✏ R
Besides classification, concentration inequalities are used for studying many other meth-
ods such as clustering, random projections and density estimation.
7.2. Basic Inequalities 99
Notation
Given Z1 , . . . , Zn , let Pn denote the empirical measure that puts mass 1/n at each
data point:
n
1X
Pn (A) = I(Zi 2 A)
n
i=1
Suppose that Z has a finite mean and that P(Z 0) = 1. Then, for any ✏ > 0,
Z 1 Z 1 Z 1
E(Z) = z dP (z) z dP (z) ✏ dP (z) = ✏ P(Z > ✏) (7.4)
0 ✏ ✏
2
P(|Z n µ| > ✏) . (7.7)
n✏2
While this inequality is useful, it does not decay exponentially fast as n increases. To
improve the inequality, we use Chernoff’s method: for any t > 0,
t✏
P(Z > ✏) inf e E(etZ ). (7.9)
t 0
To use the above result we need to bound the moment generating function E(etZ ).
7.10 Lemma. Let Z be a mean µ random variable such that a Z b. Then, for any t,
2 (b a)2 /8
E(etZ ) etµ+t . (7.11)
Z a tb b Z ta
etZ ↵etb + (1 ↵)eta = e + e .
b a b a
Take expectations of both sides and use the fact that E(Z) = 0 to get
a b
EetZ etb + eta = eg(u) (7.12)
b a b a
where u = t(b a), g(u) = u + log(1 + eu ) and = a/(b a). Note that
0 00
g(0) = g (0) = 0. Also, g (u) 1/4 for all u > 0. By Taylor’s theorem, there is a
⇠ 2 (0, u) such that
0 u2 00 u2 00 u2 t2 (b a)2
g(u) = g(0) + ug (0) + g (⇠) = g (⇠) = .
2 2 8 8
2 (b a)2 /8 .
Hence, EetZ eg(u) et
1 Pn
where Z n = n i=1 Zi .
7.2. Basic Inequalities 101
Proof. For simplicity assume that E(Zi ) = 0. Now we use the Chernoff method. For any
t > 0, we have, from Markov’s inequality, that
! !
1X
n
tX
n ⇣ Pn ⌘
P Zi ✏ =P Zi t✏ = P e(t/n) i=1 Zi et✏
n n
i=1 i=1
⇣ Pn ⌘ Y
e E e(t/n) i=1 Zi = e t✏
t✏
E(e(t/n)Zi ) (7.15)
i
Pn
t✏ (t2 /n2 ) i=1 (bi ai )2 /8
e e (7.16)
where the last inequality follows from PLemma 7.10. Now we minimize the right hand side
2
over t. In particular, we set t = 4✏n2 / ni=1 (bi ai )2 and get P Z n ✏ e 2n✏ /c . By
2
a similar argument, P Z n ✏ e 2n✏ /c and the result follows.
s ✓ ◆
c 2
|Z n µ| log (7.18)
2n
1 Pn
where c = n i=1 (bi ai ) 2 .
The following result extends Hoeffding’s inequality to more general functions f (z1 , . . . , zn ).
102 Chapter 7. Concentration of Measure
0
sup f (z1 , . . . , zi 1 , zi , zi+1 , . . . , zn ) f (z1 , . . . , zi 1 , zi , zi+1 , . . . , zn ) ci
z1 ,...,zn ,zi0
(7.22)
for i = 1, . . . , n. Then
! ✓ ◆
2✏2
P f (Z1 , . . . , Zn ) E(f (Z1 , . . . , Zn )) ✏ 2 exp Pn 2 . (7.23)
i=1 ci
We will bound the first quantity. The second follows similarly. Let Vi = E(Y |Z1 , . . . , Zi )
E(Y |Z1 , . . . , Zi 1 ). Then
n
X
f (Z1 , . . . , Zn ) E(f (Z1 , . . . , Zn )) = Vi
i=1
⇣ Pn 1 ⌘ Pn
t✏ t2 c2n /8 t✏ t2 2
e e E et i=1 Vi · · · e e i=1 ci .
Pn
The result follows by taking t = 4✏/ i=1 ci .
2
Pn
Remark: If f (z1 , . . . , zn ) = 1
n i=1 zi then we get back Hoeffding’s inequality.
P
7.25 Example. Let X1 , . . . , Xn ⇠ P and let Pn (A) = n1 ni=1 I(Xi 2 A). Define n ⌘
f (X1 , . . . , Xn ) = supA |Pn (A) P (A)|. Changing one observation changes f by at most
7.2. Basic Inequalities 103
1/n. Hence, !
2n✏2
P | n E( n )| >✏ 2e .
Hoeffding’s inequality does not use any information about the random variables except
the fact that they are bounded. If the variance of Xi is small, then we can get a sharper
inequality from Bernstein’s inequality. We begin with a preliminary result.
7.26 Lemma. Suppose that |X| c and E(X) = 0. For any t > 0,
⇢ ✓ tc ◆
tX 2 2 e 1 tc
E(e ) exp t (7.27)
(tc)2
where 2 = Var(X).
P1 tr 2 E(X r )
Proof. Let F = r=2 r! 2 . Then,
1 r r
!
X t X 2 2F
tX
E(e ) = E 1 + tx + = 1 + t2 2
F et . (7.28)
r!
r=2
7.30 Theorem (Bernstein). If P(|Xi | c) = 1 and E(Xi ) = µ then, for any ✏ > 0,
⇢
n✏2
P(|X n µ| > ✏) 2 exp (7.31)
2 2 + 2c✏/3
P
where 2 = n1 ni=1 Var(Xi ).
where 2
i = E(Xi2 ). Now,
!
n
X ⇣ Pn ⌘
P Xn > ✏ = P Xi > n✏ = P et i=1 Xi > etn✏ (7.33)
i=1
Pn n
Y
tn✏
e E(et i=1 Xi
)=e tn✏
E(etXi ) (7.34)
i=1
⇢ tc
tn✏ 2e 1 tc
e exp nt2 . (7.35)
(tc)2
Take t = (1/c) log(1 + ✏c/ 2) to get
⇢
n 2 ⇣ c✏ ⌘
P(X n > ✏) exp h 2 (7.36)
c2
where h(u) = (1 + u) log(1 + u) u. The results follows by noting that h(u) u2 /(2 +
2u/3) for u 0.
7.37 Lemma. Let X1 , . . . , Xn be iid and suppose that |Xi | c and E(Xi ) = µ. With
probability at least 1 ,
r
2 2 log(1/ ) 2c log(1/ )
|X n µ| + . (7.38)
n 3n
p
In particular, if 2c2 log(1/ )/(9n), then with probability at least 1 ,
C
|X n µ| (7.39)
n
1
where C = 4c log /3.
We also get a very specific inequality in the special case that X is Gaussian.
✓ ◆
n✏2
P(|X n µ| > ✏) exp . (7.41)
2 2
2
Proof. Let
R x X ⇠ N (0, 1) with density (x) = (2⇡)
1/2 e x /2 and distribution function
Suppose we have an exponential bound on P(Xn > ✏). In that case we can bound E(Xn )
as follows.
c2 n✏2
P(Xn > ✏) c1 e (7.46)
q
C
for some c2 > 0 and c1 > 1/e. Then, E(Xn ) n where C = (1 + log(c1 ))/c2 .
R1
Proof. Recall that for any nonnegative random variable Y , E(Y ) = 0 P(Y t)dt.
Hence, for any a > 0,
Z 1 Z a Z 1 Z 1
E(Xn2 ) = P(Xn2 t)dt = P(Xn2 t)dt+ P(Xn2 t)dt a+ P(Xn2 t)dt.
0 0 a a
p
Equation (7.46) implies that P(Xn > t) c1 e c2 nt . Hence,
Z 1 Z 1 p Z 1
c1 e c2 na
E(Xn2 ) a+ P(Xn2 t)dt = a + P(Xn t)dt a + c1 e c2 nt
dt = a + .
a a a c2 n
log(c1 ) 1 1 + log(c1 )
E(Xn2 ) + = .
nc2 nc2 nc2
p q
1+log(c1 )
Finally, we have E(Xn ) E(Xn2 ) nc2 .
106 Chapter 7. Concentration of Measure
7.47 Theorem. Let X1 , . . . , Xn be random variables. Suppose there exists > 0 such that
2
E(etXi ) et /2 for all t > 0. Then
✓ ◆
p
E max Xi 2 log n. (7.48)
1in
log n t 2 p
Thus, E (max1in Xi ) t + 2 . The result follows by setting t = 2 log n/ .
Fz1 ,...,zn is a finite collection of binary vectors and |Fz1 ,...,zn | 2n . The set Fz1 ,...,zn is
called the projection of F onto z1 , . . . , zn .
where the maximum is over all finite sets of size n and s(A, F ) = |{A \ F : A 2 A|
denotes the number of subsets of F picked out by A. We say that a finite set F of size n is
shattered by A if s(A, F ) = 2n .
If the VC dimension is finite, then the growth function cannot grow too quickly. In
fact, there is a phase transition: s(F, n) = 2n for n < d and then the growth switches to
108 Chapter 7. Concentration of Measure
polynomial.
7.55 Theorem (Sauer’s Theorem). Suppose that F has finite VC dimension d. Then,
Xd ✓ ◆
n
s(F, n) (7.56)
i
i=0
and for all n d,
⇣ en ⌘d
s(F, n) . (7.57)
d
Define F3 = {(f (z2 ), . . . , f (zn )) : f 2 G}. Then |F1 | = |F2 | + |F3 |. Note that
VC(F2 ) d and VC(F3 ) d 1. The latter follows since, if F3 shatters a set, then we
can add z1 to create a set that is shattered by F1 . By assumption |F2 | h(n 1, d) and
|F3 | h(n 1, d 1). Hence,
|F1 | h(n 1, d) + h(n 1, d 1) = h(n, d).
Thus, s(F, n) h(n, d) which proves (7.56).
To prove (7.57), we use the fact that n d and so:
Xd ✓ ◆ ⇣ n ⌘d X d ✓ ◆ ✓ ◆i ⇣ n ⌘d Xn ✓ ◆ ✓ ◆i
n n d n d
i d i n d i n
i=0 i=0 i=0
⇣ n ⌘d ✓ d
◆n ⇣ ⌘
n d d
1+ e .
d n d
7.58 Theorem. Suppose that F = {f1 , . . . , fN } is a finite set of binary functions. Then,
with probability at least 1 ,
s ✓ ◆
2 2N
sup |Pn (f ) P (f )| log . (7.59)
f 2F n
7.3. Uniform Bounds 109
Class A VC dimension VA
A = {A1 , . . . , AN } log2 N
Intervals [a, b] on the real line 2
Discs in R2 3
Closed balls in Rd d+2
Rectangles in Rd 2d
Half-spaces in Rd d+1
Convex polygons in R2 1
Proof. It follows from Hoeffding’s inequality that, for each f 2 F, P (|Pn (f ) P (f )| > ✏)
2
2e n✏ /2 . Hence,
✓ ◆
P max |Pn (f ) P (f )| > ✏ = P (|Pn (f ) P (f )| > ✏ for some f 2 F)
f 2F
N
X
n✏2 /2
P (|Pn (fj ) P (fj )| > ✏) 2N e .
j=1
Now we consider results for the case where F is infinite. We begin with an important
result due to Vapnik and Chervonenkis.
7.60 Theorem
p (Vapnik and Chervonenkis). Let F be a class of binary functions.
For any t > 2/n,
!
nt2 /8
P sup |(Pn P )f | > t 4 s(F, 2n)e (7.61)
f 2F
Before proving the theorem, we need the symmetrization lemma. Let Z10 , . . . , Zn0 denote
a second independent sample from P . Let Pn0 denote the empirical distribution of this
110 Chapter 7. Concentration of Measure
second sample. The variables Z10 , . . . , Zn0 are called a ghost sample.
p
7.63 Lemma (Symmetrization). For all t > 2/n,
! !
P sup |(Pn P )f | > t 2P sup |(Pn Pn0 )f | > t/2 . (7.64)
f 2F f 2F
t < |(Pn P )fn | = |(Pn Pn0 + Pn0 P )fn | |(Pn Pn0 )fn | + |(Pn0 P )fn |
t
|(Pn Pn0 )fn | +
2
and hence |(Pn0 Pn )fn | > t/2. So
I(|(Pn P )fn | > t) I(|(P Pn0 )fn | t/2) = I(|(Pn P )fn | > t, |(P Pn0 )fn | t/2)
I(|(Pn0 Pn )fn | > t/2).
Now take the expected value over Z10 , . . . , Zn0 and conclude that
I(|(Pn P )fn | > t) P0 (|(P Pn0 )fn | t/2) P0 (|(Pn0 Pn )fn | > t/2). (7.65)
By Chebyshev’s inequality,
4Var0 (fn ) 1 1
P0 (|(P Pn0 )fn | t/2) 1 1 .
nt2 nt2 2
(Here we used the fact that W 2 [0, 1] implies that Var(W ) 1/4.) Inserting this into
(7.65) we have that
Thus, ⇣ ⌘ ⇣ ⌘
I sup |(Pn P )f | > t 2 P0 sup |(Pn0 Pn )f | > t/2 .
f 2F f 2F
The importance of symmetrization is that we have replaced (Pn P )f , which can take
any real value, with (Pn Pn0 )f , which can take only finitely many values. Now we prove
the Vapnik-Chervonenkis theorem.
7.3. Uniform Bounds 111
Pn
Proof. Let V = FZ10 ,...,Zn0 ,Z1 ,...,Zn . For any v 2 V write (Pn0 Pn )v to mean (1/n)( i=1 vi
P2n
i=n+1 vi ). Using the symmetrization lemma and Hoeffding’s inequality,
Recall that, for a class with finite VC dimension d, s(F, n) (en/d)d . hence we have:
A more general way to develop uniform bounds is to use a quantity called Rademacher
complexity. In this section we assume that F is a class of functions f such that 0 f (z)
1.
Intuitively, Radn (F) is large if we can find functions f 2 F that “look like” random
noise, that is, they are highly correlated with 1 , . . . , n . Here are some properties of the
Rademacher complexity.
7.70 Lemma.
2. Let conv(F) denote the convex hull of F. Then Radn (F, Z n ) = Radn (conv(F), Z n ).
4. Let g : R ! R be such that g(0) = 0 and, |g(y) g(x)| L|x y| for all x, y.
Then Radn (g F, Z n ) 2LRadn (F, Z n ).
and s ✓ ◆
4 2
sup |Pn (f ) P (f )| 2 Radn (F, Z n ) + log . (7.73)
f 2F n
Proof. The proof has two steps. First we show that supf 2F |Pn (f ) P (f )| is close to its
mean. Then we bound the mean.
Step 1: Let g(Z1 , . . . , Zn ) = supf 2F |Pn (f ) P (f )|. If we change Zi to some other value
Zi0 then |g(Z1 , . . . , Zn ) g(Z1 , . . . , Zi0 , . . . , Zn )| n1 . By McDiarmid’s inequality,
2n✏2
P (|g(Z1 , . . . , Zn ) E[g(Z1 , . . . , Zn )]| > ✏) 2 e .
Step 2: Now we bound E[g(Z1 , . . . , Zn )]. Once again we introduce a ghost sample Z10 , . . . , Zn0
and Rademacher variables 1 , . . . , n , Note that P (f ) = E0 Pn0 (f ). Also note that
n n
1X d 1X
(f (Zi0 ) f (Zi )) = 0
i (f (Zi ) f (Zi ))
n n
i=1 i=1
d
where = means “equal in distribution.” Hence,
" # " #
0
E[g(Z1 , . . . , Zn )] = E sup |P (f ) Pn (f )| = E sup |E (Pn0 (f ) Pn (f ))|
f 2F f 2F
" # " n
#
1X
EE0 sup |Pn0 (f ) Pn (f )| = EE0 sup (f (Zi0 ) f (Zi ))
f 2F f 2F n i=1
" n
#
1X
= EE0 sup 0
i (f (Zi ) f (Zi ))
f 2F n
i=1
" n
# " n
#
1 X 1X
0 0
E sup i f (Zi ) + E sup i f (Zi )
f 2F n i=1 f 2F n i=1
= 2Radn (F).
Combining this bound with (7.74) proves the first result.
To prove the second result, let a(Z1 , . . . , Zn ) = Radn (F, Z n ) and note that a(Z1 , . . . , Zn )
changes by at most 1/n if we change
q one observation. McDiarmid’s inequality implies that
1
|Radn (F, Z n ) Radn (F)| 2n log 2 with probability at least 1 . Combining this
with the first result yields the second result.
In the special case where F is a class of binary functions, we can relate Radn (F) to
shattering numbers.
Pn
Proof.P Let D = {Z1 , . . . , Zn }. Define S(f, ) = |n 1
i=1 i f (Zi )| and S(v, ) =
|n 1 ni=1 i vi |. Now, 1 i f (Zi ) 1. Note that
! !! !!
Radn (F) = E sup S(f, ) =E E sup S(f, ) D =E E max S(v, ) D .
f 2F f 2F v2FZ1 ,...,Zn
Now, i vi /n has mean 0 and 1/n i vi 1/n so, by Lemma 7.10, E(et i vi )
2 2
et /(2n ) for any t > 0. From Theorem 7.47,
! r r
2 log |Vn | 2 log s(F, n)
E max S(v, ) D =
v2FZ1 ,...,Zn n n
114 Chapter 7. Concentration of Measure
r s ✓ ◆
8 log s(F, n) 1 2
sup |Pn (f ) P (f )| + log . (7.79)
f 2F n 2n
r s ✓ ◆
d 1 2
sup |Pn (f ) P (f )| 2C + log . (7.80)
f 2F n 2n
Suppose now that F is a class of real-valued functions. There are various methods to
obtain uniform bounds. We consider two such methods: covering numbers and bracketing
numbers.
If Q is a measure and p 1, define
✓Z ◆1/p
p
kf kLp (Q) = |f (x)| dQ(x) .
kf k1 = sup |f (x)|.
x
A set C = {f1 , . . . , fN } is an ✏-cover of F (or an ✏-net) if, for every f 2 F there exists a
fj 2 C such that kf fj kLp (Q) < ✏.
7.3. Uniform Bounds 115
7.81 Definition. The size of the smallest ✏-cover is called the covering number and
is denoted by Np (✏, F, Q). The uniform covering number is defined by
Proof. Let N = N (✏/3, F, L1 ) and let C = {f1 , . . . , fN } be an ✏/3 cover. For any f 2 F
there is an fj 2 C such that kf fj k1 ✏/3. So
|Pn (f ) P (f )| |Pn (f ) Pn (fj )| + |Pn (fj ) P (fj )| + |P (fj ) P (f )|
2✏
|Pn (fj ) P (fj )| + .
3
Hence,
! ✓ ◆
2✏
P sup |Pn (f ) P (f )| > ✏ P max |Pn (fj ) P (fj )| + >✏
f 2F fj 2C 3
✓ ◆ ⇣ ✏⌘
N
X
✏
= P max |Pn (fj ) P (fj )| > P |Pn (fj ) P (fj )| >
fj 2C 3 3
j=1
n✏2 /(18B 2 )
2N (✏/3, F, L1 )e
from the union bound and Hoeffding’s inequality.
(For a proof, see Devroye, Gyorfi and Lugosi (1996).) However, there are cases where
the covering numbers are finite and yet the VC dimension is infinite.
116 Chapter 7. Concentration of Measure
7.84 Theorem.
R
7.86 Theorem. Let A = supf |f |dP and B = supf kf k1 . Then
! ✓ ◆
n 3n✏2
P sup |Pn (f ) P (f )| > ✏ 2N[ ] (✏/8, F, L1 (P )) exp
f 2F 4B[6A + ✏]
✓ ◆
3n✏
+2N[ ] (✏/8, F, L1 (P )) exp .
40B
Hence, if ✏ 2A/3,
! ✓ ◆
n 96n✏2
P sup |Pn (f ) P (f )| > ✏ 4N[ ] (✏/8, F, L1 (P )) exp . (7.87)
f 2F 76AB
7.3. Uniform Bounds 117
Proof. (This proof follows Yukich (1985).) For Rnotational simplicity in the proof, let us
write, N (✏) ⌘ N[ ] (✏, F, L1 (P )). Define zn (f ) = f (dPn dP ). Let [`1 , u1 ], . . . , [`N , uN ]
be a minimal ✏/8 bracketing. We may assume that for each j, kuj k B and k`j k B.
(Otherwise, we simply truncate the brackets.) For each j, choose some fj 2 [`j , uj ].
Consider any f 2 F and let [`j , uj ] denote a bracket containing f . Then
Furthermore,
Z Z Z
|zn (f fj )| = | (f fj )(dPn dP )| |f fj | (dPn + dP ) |uj `j | (dPn + dP )
Z Z
= |uj `j | (dPn dP ) + 2 |uj `j |dP
Z ⇣✏⌘ ✏
= |uj `j | (dPn dP ) + 2 = zn (|uj `j |) + .
8 4
Hence, h ✏i
|zn (f )| |zn (fj )| + [zn (|uj `j |) + .
4
Thus,
P n (sup |zn (f )| > ✏) P n (max |zn (fj )| > ✏/2) + P n (max |zn (|uj `j |)| + ✏/4 > ✏/2)
f 2F j j
P n (max |zn (fj )| > ✏/2) + P n (max |zn (|uj `j |)| > ✏/4).
j j
Now Z Z Z
Var(fj ) fj2 dP = |fj | |fj |dP kfj k1 |fj |dP AB.
Similarly,
Z Z
Var(|uj `j |) (uj `j )2 dP |uj `j | |uj `j |dP
Z
✏ B✏
kuj `j k1 |uj `j |dP 2B = .
8 4
Also, kuj `j k1 2B. Hence, by Bernstein’s inequality,
⇣ ⌘ N
X ✓ ◆
n 1 n(✏/4)2
P max zn (|uj `j |) > ✏/4 2 exp
j 2 2B 4✏ + 2B(✏/4)/3
j=1
✓ ◆
3n✏
2N (✏/8) exp .
40B
118 Chapter 7. Concentration of Measure
The following result is from Example 19.7 from van der Vaart (1998).
Then,
p R !d
4 d diam(⇥) |m(x)|q dP (x)
N[ ] (✏, F, Lq (P )) .
✏
Proof. Let
✏
= p R .
4 d |m(x)|q dP (x)
Define `j = f✓j ✏m(x)/2 and uj = f✓j + ✏m(x)/2. We claim that the brackets
[`1 , u1 ], . . . , [`N , uN ] cover F. To see this, choose any f✓ 2 F. Let ✓j be the closest
element {✓1 , . . . , ✓N } to ✓. Then
f✓ (x) = f✓j (x) + f✓ (x) f✓j (x) f✓j (x) + |f✓ (x) f✓j (x)|
m(x)✏
f✓j (x) + m(x)k✓ ✓j k f✓j (x) + R = uj (x).
2 |m(x)|q dP (x)
R
By a similar argument, f✓ (x) `j (x). Also, (uj `j )q dP ✏q . Finally, note that the
number of brackets is
p R !d
4 d diam(⇥) |m(x)|q dP (x)
N = (diam(⇥)/ )d = .
✏
7.3. Uniform Bounds 119
Xi k/h) where K is a smooth symmetric function and h > 0 is a bandwidth. We study pbh
in detail in the chapter on nonparametric density estimation. Here we bound the sup norm
distance between pbh (x) and is mean ph (x) = E(bph (x)).
7.90 Theorem. Suppose that K(x) K(0) for all x and that
Proof. Let F = {fx : x 2 X } where fx (u) = h d K(kx uk/h). We apply Theorem 7.86
with A = 1 and B = K(0)/hd . We need to bound N[ ] (✏, F, L1 (P ). Now
✓ ◆ ✓ ◆
1 kx uk ky uk
|fx (u) fy (u)| = d K K
h h h
L
kx uk ky uk
hd+1
L
kx yk.
hd+1
7.91 Corollary. Suppose that h = hn = (Cn /n)⇠ where ⇠ 1/d and Cn = (log n)a for
some a 0. Then
p !d ✓ ◆ !
⇣ ⌘ 32L d diam(X ) n ⇠(d+1) 3✏2 Cn⇠d n1 d⇠
P n sup |b
p(x) ph (x)| > ✏ 4 exp .
x ✏ Cn 28K(0)
Note that the proofs of the last two results did not depend on P . Hence, if P is the set
of distribution with support on X , we have that
⇣ ⌘ ⇣ ⌘
sup P n sup |b
p(x) ph (x)| > ✏ c1 exp c2 ✏2 Cn⇠d n1 d⇠
.
P 2P x
7.92 Example. Here are some further examples. In exercise 12 you are asked to prove
these results.
7.95 Theorem (Giné and Guillou, 2001). Suppose that there exist A and d such that
sup N (✏, L2 (P ), ✏a) (A/✏)d
P
Combining these results gives Giné and Guillou’s version of Talagrand’s inequality:
Pn
7.96 Theorem. Let v E supf 2F 1
n i=1 f
2 (X
i) and U supf 2F kf k1 . There exists a
universal constant K such that
! ( !)
nt tU
P sup |Pn (f ) P (f )| > t K exp log 1 + p p
f 2F KU K( n + U log(AU/ ))2
(7.97)
whenever s
✓ ◆ ✓ ◆!
C AU p AU
t U log + n log .
n
7.98 Example. Density Estimation. Gine and Guillou (2002) apply Talagrand’s inequality
to get bounds on density estimators. Let X1 , . . . , Xn ⇠ P where Xi 2 Rd and suppose that
P has density p. The kernel density estimator of p with bandwidth h is
n ✓ ◆
1X kx Xi k
pbh (x) = K .
n h
i=1
Applying the results above to pbh (x) we see that (under very weak conditions on K) for all
small ✏ and large n,
c2 nhd ✏2
P( sup |b
ph (x) ph (x)| > ✏) c2 e (7.99)
x2Rd
122 Chapter 7. Concentration of Measure
where ph (x) = E(b ph (x)) and c1 , c2 are positive constants. This agrees with the earlier
result Theorem 7.90. 2
Now we consider bounding the expected value of the maximum of an infinite set of random
variables. Let {Xf : f 2 F} be a collection of mean 0 random variables indexed by f 2 F
and let d be a metric on F. Let N (F, r) be the covering number of F, that is, the smallest
number of balls of radius r required to cover F. Say that {Xf : f 2 F} is sub-Gaussian
if, for every t > 0 and every f, g 2 F,
2 d2 (f,g)/2
E(et(Xf Xg )
) et .
We say that {Xf : f 2 F} is sample continuous if, for every sequence f1 , f2 , . . . , 2 F such
that d(fi , f ) ! 0 for some f 2 F, we have that Xfi ! Xf a.s. The following theorem is
from Cesa-Bianchi and Lugosi (2006) and is a variation of a theorem due to Dudley (1978).
Proof. The proof uses Dudley’s chaining technique. We follow the version in Theorem
8.3 of Cesa-Bianchi and Lugosi (2006). Let Fk be a minimal cover of F of radius D2 k .
Thus |Fk | = N (F, D2 k ). Let f0 denote the unique element in F0 . Each Xf is a random
variable and hence is a mapping from some sample space S to the reals. Fix s 2 S and let
f ⇤ be such that supF 2F Xf (s) = Xf ⇤ (s). (If an exact maximizer does not exist, we can
choose an approximate maximizer but we shall assume an exact maximizer.) Let fk 2 Fk
minimize the distance to f⇤ . Hence,
k
d(fk 1 , fk ) d(f⇤ , fk ) + d(f⇤ , fk 1) 3D2 .
where the max is over all f 2 Fk and g 2 Fk 1 such that d(f, g) 3D2 k. There are at
most N (F, D2 k )2 such pairs. By Theorem 7.47,
✓ ◆ q
E max(Xf Xg ) 3D2 k 2 log N (F, D2 k )2 .
f,g
7.102 Example. Let Y1 , . . . , Yn be a sample from a continuous cdf F on [0, 1] with bounded
p
density. Let Xs = n(Fn (s) F (s)) where Fn is the empirical distribution function. The
collection {Xs : s 2 [0, 1]} can be shown to be sub-Gaussian an sample continuous
with respect to the Euclidean metric on [0, 1]. The covering number is N ([0, 1], r) = 1/r.
Hence, ✓ ◆ Z 1/2 p
p
E sup n(Fn (s) F (s) 12 log(1/✏)d✏ C
0s1 0
7.5 Summary
The most important results in this chapter are Hoeffding’s inequality:
2n✏2 /c
P(|X n µ| > ✏) 2e ,
Bernstein’s inequality
⇢
n✏2
P(|X n | > ✏) 2 exp
2 2 + 2c✏/3
the Vapnik-Chervonenkis bound,
!
nt2 /8
P sup |(Pn P )f | > t 4 s(F, 2n)e
f 2F
124 Chapter 7. Concentration of Measure
These, and similar results, provide the theoretical basis for many statistical machine learn-
ing methods. The literature cantains many refinements and extensions of these results.
Exercises
R1
7.1 Suppose that X 0 and E(X) < 1. Show that E(X) = 0 P (X t)dt.
7.2 Show that h(u) u2 /(2 + 2u/3) for u 0 where h(u) = (1 + u) log(1 + u) u.
7.3 In the proof of McDiarmid’s inequality, verify that E(Vi |X1 , . . . , Xi 1) = 0.
7.4 Prove Lemma 7.37.
7.5 Prove equation (7.24).
7.6 Prove the results in Table 7.1.
7.7 Derive Hoeffding’s inequality from McDiarmid’s inequality.
7.8 Prove lemma 7.70.
7.9 Consider Example 7.102. Show that {Xs : s 2 [0, 1]} is sub-Gaussian. Show that
R 1/2 p
0 log(1/✏)d✏ C for some C > 0.
7.10 Prove Theorem 7.52. .
7.11 Prove Theorem 7.84.
7.12 Prove the results in Example 7.92.