0% found this document useful (0 votes)
51 views14 pages

Probability Bounds

Uploaded by

Rituparna Chutia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views14 pages

Probability Bounds

Uploaded by

Rituparna Chutia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Probability Bounds

John Duchi

This document starts from simple probalistic inequalities (Markov’s Inequality) and builds up through
several stronger concentration results, developing a few ideas about Rademacher complexity, until we give
proofs of the main Vapnik-Chervonenkis complexity for learning theory. Many of these proofs are based on
Peter Bartlett’s lectures for CS281b at Berkeley or Rob Schapire’s lectures at Princeton. The aim is to have
one self-contained document some of the standard uniform convergence results for learning theory.

1 Preliminaries
We begin this document with a few (nearly trivial) preliminaries which will allow us to make very strong
claims on distributions of sums of random variables.
Theorem 1 (Markov’s Inequality). For a nonnegative random variable X and t > 0,
E[X]
P[X ≥ t] ≤ .
t
Proof For t > 0,
Z Z ∞ Z ∞
E[X] = xP(dx) ≥ xP(dx) ≥ tP(dx) = tP[X ≥ t].
X t t

One very powerful consequence of Markov’s Inequality is the Chernoff method, which uses the fact that
for any s ≥ 0,
E[esX ]
P(X ≥ t) = P(esX ≥ est ) ≤ . (1)
est
The inequality above is a simple consequence of ez > 0 for all z ∈ R.

2 Hoeffding’s Bounds
Lemma 2.1 (Hoeffding’s Lemma). Given a random variable X, a ≤ X ≤ b, and E[X] = 0, then for any
s > 0, we have
s2 (b−a)2
E esX ≤ e 8
 

Proof Given any x such that a ≤ x ≤ b, we can define λ ∈ [0, 1] as


b−x
λ= .
b−a
Thus, we see that (b − a)λ = b − x, so that x = b − λ(b − a) = λa + (1 − λ)b. As such, we know that
sx = sλa + s(1 − λ)b. So the convexity of exp(·) implies
b − x sa x − a sb
esx = eλsa+(1−λ)sb ≤ λesa + (1 − λ)esb = e + e
b−a b−a

1
Using the above and the fact that E[X] = 0,
 
 sX  b − X sa X − a sb b sa a sb
E e ≤E e + e = e − e . (2)
b−a b−a b−a b−a
a b
Now, we let p = − b−a (noting that a ≤ 0 as E[X] = 0 and hence p ∈ [0, 1]), and we have 1 − p = b−a , giving

b sa a sb
e − e = (1 − p)esa + pesb = (1 − p + pesb−sa )esa .
b−a b−a
a
Solving for a in p = − b−a , we find that a = −p(b − a), so

(1 − p + pesb−sa )esa = (1 − p + pes(b−a) )e−ps(b−a) .

Defining u = s(b − a) and

φ(u) , −ps(b − a) + log(1 − p + pes(b−a) ) = −pu + log(1 − p + peu )

and using equation (2), we have that


E esX ≤ eφ(u) .
 

If we can upper bound φ(u), then we are done. Of course, by Taylor’s theorem, there is some z ∈ [0, u] such
that
1 1
φ(u) = φ(0) + uφ′ (0) + u2 φ′′ (z) ≤ φ(0) + uφ′ (0) + sup u2 φ′′ (z) (3)
2 z 2
Taking derivatives,
peu peu p2 e2u p(1 − p)eu
φ′ (u) = −p + u
, φ′′ (u) = u
− u 2
=
1 − p + pe 1 − p + pe (1 − p + pe ) (1 − p + peu )2
Since φ′ (0) = −p + p = 0, and φ(0) = 0, we maximize φ′′ (u). Substituting z for eu , we see that φ′′ (u) is
concave for z > 0 as it is linear over quadratic. Thus
d p(1 − p)z p(1 − p) 2p2 (1 − p)z
= −
dz (1 − p + pz)2 (1 − p + pz)2 (1 − p + pz)3
p(1 − p)(1 − p + p + z) − 2p2 (1 − p)z
=
(1 − p + pz)3
p (1 − p)z − p2 (1 − p)z − p2 (1 − p)z + p(1 − p)2
2
=
(1 − p + pz)3
p(1 − p)(1 − p − zp)
=
(1 − p + pz)3
1−p
so that the critical point is at z = eu = p . Substituting,
1−p
p(1 − p) · p (1 − p)2 1
φ′′ (u) ≤ 1−p 2 = = .
(1 − p + p · p )
4(1 − p)2 4

Using equation (3), it is evident that φ(u) ≤ 12 u2 · 41 = 18 s2 (b − a)2 . This completes the proof of the lemma,
as we have
s2 (b−a)2
E esX ≤ eφ(u) ≤ e 8 .
 

Now we prove a slightly more general result than the standard Hoeffding bound using Chernoff’s method.

2
Theorem 2 (Hoeffding’s Inequality). Suppose that X1 , . . . , Xm are independent random variables with ai ≤
Xi ≤ bi . Then
m m
!
−2ε2 m2
 
1 X 1 X
P Xi − E[Xi ] ≥ ε ≤ exp Pm 2
m i=1 m i=1 i=1 (bi − ai )

Proof The proof is fairly straightforward using lemma 2.1. First, define Zi = Xi −E[Xi ], so that E[Zi ] = 0
(and we can assume without loss of generality that the bound on Zi is still [ai , bi ]). Then
m m
! ! ! Qm
X X E [ i=1 exp(sZi )]
P Zi ≥ t = P exp s Zi ≥ exp(st) ≤
i=1 i=1
est
Qm m
i=1 E[exp(sZi )] 2 2
Y
−st
= ≤ e es (bi −ai ) /8
est i=1
m
!
2 X
s 2
= exp (bi − ai ) − st .
8 i=1

The first line is an application of the Chernoff method and the second the application of lemma 2.1 and the
fact that the Zi are independent.
Pm 2

If we substitute s = 4t/ i=1 (bi − ai ) , which is evidently > 0, we find that

m
!
−2t2
X  
P Zi ≥ t ≤ exp P m 2
.
i=1 i=1 (bi − ai )

Finally, letting t = εm gives our result.

We note that the above proof can be extendend using the union bound and reproving a bound on ≤ −ε
(setting Zi′ = 1 − Zi ) to give
m m
!
−2ε2 m2
X X  
P Xi − E[Xi ] ≥ mε ≤ 2 exp Pm 2
.
i=1 i=1 i=1 (bi − ai )

3 McDiarmid’s Inequality
This more general result, of which Hoeffding’s Inequality can be seen as a special case, is very useful in
learning theory and other domains. The statement of the theorem is this:

Theorem 3 (McDiarmid’s Inequality). Let X = X1 , . . . , Xm be m independent random variables taking


values from some set A, and assume that f : Am → R satisfies the following boundedness condition (bounded
differences):
sup |f (x1 , x2 , . . . , xi , . . . , xm ) − f (x1 , x2 , . . . , x̂i , . . . , xm )| ≤ ci
x1 ,...,xm ,x̂i

for all i ∈ {1, . . . , m}. Then for any ε > 0, we have

2ε2
 
P [f (X1 , . . . , Xm ) − E[f (X1 , . . . , Xm )] ≥ ε] ≤ exp − Pm 2 .
i=1 ci

Proof The proof of this result begins by introducing some notation. First, let X = {X1 , . . . , Xm } and
Xi:j = {Xi , . . . , Xj }. Further, let Z0 = E[f (X)], Zi = E[f (X) | X1 , . . . , Xi ], and (naturally) Zm = f (X).
We now prove the following claim:

3
Claim 3.1.
s2 c2k
 
E [exp(s(Zk − Zk−1 )) | X1 , . . . , Xk−1 ] ≤ exp
8
Proof of Claim 3.1 First, let
Uk = sup {E[f (X) | X1 , . . . , Xk−1 , u] − E[f (X) | X1 , . . . , Xk−1 ]}
u
Lk = inf {E[f (X) | X1 , . . . , Xk−1 , l] − E[f (X) | X1 , . . . , Xk−1 ]}
l

and note that


Uk − Lk ≤ sup {E[f (X) | X1 , . . . , Xk−1 , u] − E[f (X) | X1 , . . . , Xk−1 , l]}
l,u
 
Z Ym 
≤ sup [f (X1:k−1 , u, yk+1:m ) − f (X1:k−1 , l, yk+1:m )] p(Xj = yj )
l,u  yk+1:m 
j=k+1
Z m
Y
≤ ck p(Xj = yj ) = ck
yk+1:m j=k+1

The second line follows because X1 , . . . , Xm are independent, and the last line follows by Jensen’s inequality
and the boundedness condition on f . Thus, Lk ≤ Zk − Zk−1 ≤ Uk , so Zk − Zk−1 ≤ ck . By lemma 2.1, as
E[Zk − Zk−1 | X1:k−1 ] = EXk:m [E[f (X) | X1:k ] − E[f (X) | X1:k−1 ]]
= EXk:m [E[f (X) | X1:k−1 ] − E[f (X) | X1:k−1 ]] = 0,
our claim follows.

Now we simply proceed through a series of inequalities.


P[f (X) − E[f (X)] ≥ ε] ≤ e−sε E [exp(s(f (X) − E[f (X)]))]
m
!# "
X
= e−sε E [exp(s(Zm − Zm−1 + Zm−1 − Z0 ))] = e−sε E exp s (Zi − Zi=1 )
i=1
m
" " ! ##
X
= e−sε E E exp s (Zi − Zi=1 ) | X1:m−1
i=1
m−1
" ! #
X h i
= e−sε E exp s (Zi − Zi−1 ) E es(Zm −Zm−1 ) | X1:m−1
i=1
m−1
" ! #
X s2 c2
m
−sε
≤ e E exp s (Zi − Zi−1 ) e 8

i=1
m m
!
s2 c2i c2
Y   X
−sε 2 i
≤ e exp = exp −sε + s
i=1
8 i=1
8

The third line follows because of the properties of expectation (that is, that E[g(X, Y )] = EX [EY [g(X, Y ) |
X]]), the fourth because of our independence assumptions, and the fifth and sixth by repeated applications
of claim 3.1.
Minimizing the last equation with respect to s, we take the derivative and find that
m m m
! ! !
d 2
X c2i 2
X c2i X c2i
exp −sε + s = exp −sε + s −ε + 2s
ds i=1
8 i=1
8 i=1
8

4
Pm
which has a critical point at s = 4ε/ i=1 c2i . Substituting, we see that
m
! Pm !
c2i 4ε2 16ε2 i=1 c2i 2ε2
X  
2
exp −sε + s = exp − Pm 2 + Pm 2 = exp − Pm 2
i=1
8 i=1 ci 8( c2i ) i=1 ci
i=1

This completes our proof.

As a quick note, it is worth mentioning


Pm that Hoeffding’s inequality follows by applying McDiarmid’s inequality
to the function f (x1 , . . . , xm ) = i=1 xi .

4 Glivenko-Cantelli Theorem
In this section, we give a proof of the Glivenko-Cantelli theorem, which gives uniform convergence of the
empirical distribution function for a random variable X, to the true distribution. There are many ways
of proving this; for another example, see [1, Theorem 20.6]. Our proof makes use of Rademacher random
variables and gives a convergence rate as well. First, though, we need a Borel-Cantelli lemma.
Theorem
P∞ 4 (Borel-Cantelli Lemma I). Let An be a sequence of subsets of some probability space Ω. If
n=1 P(An ) < ∞, then P(An i.o.) = 0.
P∞
Proof Let N = n=1 1 {An }, the number of events that occur. By Fubini’s theorem, we have EN =
P ∞
n=1 P(An ) < ∞, so N < ∞ almost surely.

Now for the real proof. Let Fn (x) be the empirical distribution function of a sequence of i.i.d. random
variables X1 , . . . , Xn , that is,
n
1X
Fn (x) , 1 {Xi ≤ x}
n i=1
and let F be the true distribution. We have the following theorem.
a.s.
Theorem 5 (Glivenko-Cantelli). As n → ∞, supx |Fn (x) − F (x)| → 0.
Proof Our proof has three main parts. First is McDiarmid’s concentration inequality, then we symmetrize
using Rademacher random variables, and lastly we show the class of functions we use is “small” by ordering
the data we see.
To begin, define the function class

G , {g : x 7→ 1 {x ≤ θ} , θ ∈ R}
Pn
and note that there is a one-to-one mapping between G and R. Now, define En g = n1 i=1 g(Xi ). The
a.s.
theorem is equivalent to supg∈G |En g − Eg| → 0 for any probability measure P, and the rates of convergence
will be identical.
We begin with the concentration result. Let f (X1 , . . . , Xn ) = supg∈G |En g − Eg|, and note that changing
any 1 of the n data points arbitrarily makes the empirical distribution g change by at most 1/n. Thus,
McDiarmid’s inequality (theorem 3) implies that

sup |En g − Eg| ≤ E sup |En g − Eg| + ε (4)


g∈G g∈G

2
with probability at least 1 − e−2ε n .
We now use Rademacher random variables and symmetrization to get a handle on the term

E sup |En g − Eg| (5)


g∈G

5
It is hard to directly show that this converges to zero, but we can use symmetrization to upper bound Eq. (5).
To this end, let Y1 , . . . , Yn be n independent copies of X1 , . . . , Xn . We have
n n
1X 1X
E sup |En g − Eg| = E sup g(Xi ) − Eg(Yi )
g∈G g∈G n i=1 n i=1
n
1X
= E sup (g(Xi ) − E[g(Yi ) | X1 , . . . , Xn ])
g∈G n i=1
" n #
1X
= E sup E (g(Xi ) − g(Yi )) | X1 , . . . , Xn
g∈G n i=1
n n
" #
1X 1X
≤ EE sup (g(Xi ) − g(Yi )) | X1 , . . . , Xn = E sup g(Xi ) − g(Yi ) .
g∈G n i=1 g∈G n i=1

The second and third lines are almost sure and follow by properties of conditional expectation, and the last
inequality follows via convexity of | · | and sup.
We now proceed to remove dependence on g(Yi ) by the following steps. First, note that g(Xi ) − g(Yi ) is
symmetric around 0, so if σi ∈ {−1, 1}, σi (g(Xi ) − g(Yi )) has identical distribution. Thus, we can continue
our inequalities:
n n n n
" #
1X 1X 1X 1X
E sup g(Xi ) − g(Yi ) = E sup σi (g(Xi ) − g(Yi )) ≤ E sup σi g(Xi ) + σi g(Yi )
g∈G n i=1 g∈G n i=1 g∈G n i=1 n i=1
n n
" #
1X 1X
≤ E sup σi g(Xi ) + sup σi g(Yi )
g∈G n i=1 g∈G n i=1
n
1X
= 2E sup σi g(Xi ) . (6)
g∈G n i=1

The last expectation has a maximum inner product between the vectors [σ1 · · · σn ]⊤ and [g(X1 ) · · · g(Xn )]⊤ .
This an indication of how well the class of vectors {[g(X1 ) · · · g(Xn )]⊤ : g ∈ G} can align with random
directions σ1 , . . . , σn , which are uniformly distributed
P on the corners of the n-cube.
Now what remains is to bound E supg∈G | i σi g(Xi )|, which we do by noting that G is in a sense simple.
Fix the data set as (x1, . . . , xn ) and consider
 the order statistics x(1) , . . . , x(n) . Note that x(i) ≤ x(i+1)
implies that g(x(i) ) = 1 x(i) ≤ θ ≥ 1 x(i+1) ≤ θ = g(x(i+1) ), so

[g(x(1) ) · · · g(x(n) )]⊤ ∈ [0 · · · 0]⊤ , [1 0 · · · 0]⊤ , . . . , [1 · · · 1]⊤ .




The bijection between (x1 , . . . , xn ) and (x(1) , . . . , x(n) ) implies that the cardinality of the set {[g(x1 ), . . . , g(xn )]⊤ :
g ∈ G} is at most n + 1. We can thus use bounds relating the size of the class of functions to bound the
earlier expectations.
Lemma 4.1 (Massart’s finite class lemma). Let A ⊂ Rn have |A| < ∞, R = maxa∈A kak, and σi be
independent Rademacher variables. Then
" n # p
1X R 2 log |A|
E max σi ai ≤
a∈A n n
i=1
Pn
Proof of Lemma Let s > 0 and define Za , i=1 σi ai . Because exp is convex and positive,
    X
exp sE max Za ≤ E exp s max Za = E max exp(sZa ) ≤ E exp(sZa ).
a∈A a∈A a∈A
a∈A

6
Now we apply Hoeffding’s lemma (lemma 2.1) by noting that σi ai ∈ [−ai , ai ], and have
n
!
X X X X
2 2
E exp(sZa ) ≤ exp s ai /2 ≤ exp(s2 R2 /2) = |A| exp(s2 R2 /2).
a∈A a∈A i=1 a∈A

Combining the above bound with the first string of inequalities, we have exp(sE maxa∈A Za ) ≤ |A| exp(s2 R2 /2),
or
log |A| sR2
 
E max Za ≤ inf +
a∈A s>0 s 2
p
Setting s = 2 log |A|/R2 , we have
p !
R log |A| R 2 log |A| p
E max Za ≤ p + = R 2 log |A|
a∈A 2 log |A| 2

and dividing by n gives the final bound.


Now, letting A = {[g(X1 ), . . . , g(Xn )]⊤ : g ∈ G}, we note that |A| ≤ n + 1 and R = maxa∈A kak ≤ n.
Thus,
n n
" # " # r
1X 1X 2 log(n + 1)
E sup σi g(Xi ) = EE sup σi g(Xi ) | X1 , . . . , Xn ≤ .
g∈G n i=1 g∈G n i=1 n
By the above equation, Eq. (6), and the McDiarmid inequality application in Eq. (4),
n
r ! " #!
2 log(n + 1) 1X
P sup |En g − Eg| > ε + 2 ≤ P sup |En g − Eg| > ε + 2E sup σi g(Xi )
g∈G n g∈G g∈G n i=1
 
≤ P sup |En g − Eg| > ε + E sup |En g − Eg|
g∈G g∈G
2
≤ 2 exp(−2ε n).
p p
This implies that supg∈G |En g−Eg| → 0. To get almost
P sure convergence, choose n so that 2 2 log(n + 1)/n <
ε and let An = {supg∈G |En g − Eg| > 2ε}. Then P(An ) < ∞, so An happens only finitely many times.

5 Rademacher Averages
Now we explore uses of Rademacher random variabels to measure the complexity of a class of functions.
We also use them to masure (to some extent) generalization ability of a function from data to the true
distribution.
Definition 5.1 (Rademacher complexity). Let F be a function class with domain X, i.e. F ⊆ {f : X →
R}, and let S = {X1 , . . . , Xn } be a set of samples generated by a distribution P on X. The empirical
Rademacher complexity of F is
n
" #
1X
R̂n (F) = E sup σi f (Xi ) | X1 , . . . , Xn
f ∈F n i=1

where σi i.i.d. uniform random variables (Rademacher variables) on ±1. The Rademacher complexity
of F is
n
" #
1X
Rn (F) = ER̂n (F) = E sup σi f (Xi ) .
f ∈F n i=1

7
Lemma 5.1.
n
" #
1X
E sup Ef − f (Xi ) ≤ 2Rn (F)
f ∈F n i=1

Proof As we did for the Glivenko-Cantelli theorem, introduce i.i.d. random variables Yi , i ∈ {1, . . . , n}
independent of the Xi s. Letting EY denote expectation with respect to the Yi and σi be Rademacher
variables,
n n n
1X 1X 1X
E sup Ef − f (Xi ) = E sup Ef (Yi ) − f (Xi ) ≤ EX EY sup g(Xi ) − g(Yi )
f ∈F n i=1 f ∈F n i=1 f ∈F n i=1
n
1X
= E sup σi (g(Xi ) − g(Yi ))
f ∈F n i=1
n n
" #
1X 1X
≤ E sup σi g(Xi ) + sup σi g(Yi ) = 2Rn (F).
f ∈F n i=1 f ∈F n i=1

The first inequality follows from the convexity of | · | and sup, the second by the triangle inequality.

We can also use Rademacher complexity to bound the expected value of certain functions, which is often
used in conjunction with loss functions or expected P
risks. For example, we have the following theorem dealing
n
with bounded functions. Recall that En f (X) = n1 i=1 f (Xi ), where the Xi are given as a sample.
Theorem 6. Let δ ∈ (0, 1) and F be a class of functions mapping X to [0, 1]. Then with probability at least
1 − δ, all f ∈ F satisfy. r
log 1/δ
Ef (X) ≤ En f (X) + 2Rn (F) +
2n
Also with probability at least 1 − δ, all f ∈ F satisfy
r
log 2/δ
Ef (X) ≤ En f (X) + 2R̂n (F) + 5 .
2n

Proof Fix f ∈ F. Then we clearly have (by choosing f in the sup)

Ef (X) ≤ En f (X) + sup (Eg(X) − En g(X)) .


g∈F

As f (Xi ) ∈ [0, 1], modifying one of the Xi can change En g(X) by at most 1/n. McDiarmid’s inequality
(Theorem 3) thus implies that
 
P sup (Eg(X) − En g(X)) − E[sup (Eg(X) − En g(X))] ≥ ε ≤ exp(−2ε2 n).
g∈F g∈F

Setting the right hand side bound equal to δ and solving for ε, we have
r
2 1 log 1/δ log 1/δ
−2ε n = log δ or = ε2 so ε= .
2 n 2n
That is, with probability at least 1 − δ, we have
  r
log 1/δ
sup (Eg(X) − En g(X)) ≤ E sup (Eg(X) − En g(X)) + .
g∈F g∈F 2n

8
Applying lemma 5.1, we immediately see that the right hand expectation is bounded by Rn (F). This
completes the proof of the first inequality in the theorem.
Now we need to bound Rn (F) with high probability using R̂n (F). First note
p that the above reasoning
could have been done using probability δ/2, giving a bound with 2Rn (F) and log(2/δ)/(2n) instead. Now
we note that changing one example Xi changes |Rn (F) − R̂n (F)| by at most 2/n (because one sign inside of
R̂n (F) can change). Letting ci = 2/n in McDiarmid’s inequality, we have
 
1
P(R̂n (F) − Rn (F) ≥ ε) ≤ exp − ε2 n .
2

Again setting this equal to δ/2 and solving, we have 12 ε2 = log(2/δ)/n so that ε = 2 log(2/δ)/(2n). We
p

thus have with probability ≥ 1 − δ/2,


r
log 2/δ
Rn (F) ≤ R̂n (F) + 2 .
2n
Using the union bound with two events of probability at least 1 − δ/2 gives the desired second inequality.

Theorem 7 (Ledoux-Talagrand contraction). Let f : R+ → R+ be convex and increasing. Let φi : R → R


satisfy φi (0) = 0 and be Lipschitz with constant L, i.e., |φi (a) − φi (b)| ≤ L|a − b|. Let σi be independent
Rademacher random variables. For any T ⊆ Rn ,
n n
! !
1 X X
Ef sup σi φi (ti ) ≤ Ef L · sup σ i ti .
2 t∈T i=1 t∈T i=1

Pn
Proof First, note that if T is unbounded, there will be some setting of σi so that supt∈T | i=1 σi ti | = ∞.
This event is not a probability zero event, and f is increasing and convex and so will also be infinite, so the
right expectation will be infinite. We can thus focus on bounded T .
We begin by showing a similar statement to the proof, that is, that if g : R → R is convex and increasing,
then
n n
! !
X X
Eg sup σi φi (ti ) ≤ Eg L sup σ i ti (7)
t∈T i=1 t∈T i=1

By conditioning, we note that if we prove for T ⊆ R2


   
Eg sup(t1 + σ2 φ2 (t2 )) ≤ Eg sup(t1 + Lσ2 t2 ) (8)
t∈T t∈T

we are done. This follows because we will almost surely have


       
E g sup(σ1 φ1 (t1 ) + σ2 φ2 (t2 )) | σ1 ≤ E g sup(σ1 φ1 (t1 ) + Lσ2 t2 ) | σ1
t∈T t∈T

as σ1 φ1 (t1 ) simply transforms T (and is still bounded). By conditioning, this implies that
   
Eg sup(σ1 φ1 (t1 ) + σ2 φ2 (t2 )) ≤ Eg sup(σ1 φ1 (t1 ) + Lσ2 t2 )
t∈T t∈T

and we can iteratively apply this.


Thus we focus on proving Eq. (8). Define I(t, s) , 21 g(t1 + φ(t2 )) + 12 g(s1 − φ(s2 ); if we show that the
right side of Eq. (8) is larger than I(t, s) for all t, s ∈ T , clearly we are done (as it is the expectation with

9
respect to the Rademacher random variable σ2 ). Noting that we are taking a supremum over t and s in I,
we can assume w.l.o.g. that
t1 + φ(t2 ) ≥ s1 + φ(s2 ) and s1 − φ(s2 ) ≥ t1 − φ(t2 ). (9)
We define four quantities and then proceed through four cases to prove Eq. (8):
a = s1 − φ(s2 ), b = s1 − Ls2 , a′ = t1 + Lt2 , b′ = t1 + φ(t2 ).
We would like to show that 2I(t, s) = g(a) + g(b′ ) ≤ g(a′ ) + g(b).
Case I. Let t2 ≥ 0 and s2 ≥ 0. We know that, as φ(0) = 0, |φ(s2 )| ≤ Ls2 . This implies that a ≥ b and
Eq. (9) implies that b′ = t1 + φ(t2 ) ≥ s1 + φ(s2 ) ≥ s1 − Ls2 = b. Now assume that t2 ≥ s2 . In this case,
b′ + a − b = t1 + φ(t2 ) + s1 − φ(s2 ) − s2 + Ls2 ≤ t1 + Lt2 + Ls2 − φ(s2 ) ≤ t1 + Lt2 = a′
since |φ(t2 ) − φ(s2 )| ≤ L|t2 − s2 | = L(t2 − s2 ). Thus a − b ≤ a′ − b′ . Note that g(y + x) − g(y) is increasing
in y if x ≥ 0.1 Letting x = a − b ≥ 0 and noting that b′ ≥ b,
g(a) − g(b) = g(b + x) − g(b) ≤ g(b′ + x) − g(b′ ) = g(b′ + a − b) − g(b′ ) ≤ g(a′ ) − g(b′ )
so that g(a) + g(b′ ) ≤ g(a′ ) + g(b). If s2 ≥ t2 , then we use −φ instead of φ and switch the roles of s and t,
giving a similar proof.
Case II. Let t2 ≤ 0 and s2 ≤ 0. This is similar to the above case, switching signs as necessary, so we
omit it.
Case III. Let t2 ≥ 0 and s2 ≤ 0. We have φ(t2 ) ≤ Lt2 and −φ(s2 ) ≤ −Ls2 by the Lipschitz condition
on φ. This implies that
2I(t, s) = g(t1 + φ(t2 )) + g(s1 − φ(s2 )) ≤ g(t1 + Lt2 ) + g(s1 − Ls2 ).
Case IV. Let t2 ≤ 0 and s2 ≥ 0. Similar to above, we have −φ(s2 ) ≤ Ls2 and φ(t2 ) ≤ −Lt2 , so
2I(t, s) =≤ g(t1 − Lt2 ) + g(s1 + Ls2 ), which is symmetric to the above. We have thus proved Eq. (7).
We now conclude the proof. Denoting [x]+ = x if x ≥ 0 and [x]− = −x if x ≤ 0, we note that since f is
increasing and convex that
         
1 1 1 1 1 1
f sup |x| = f sup ([x]+ + [x]− ) ≤ f sup [x]+ + sup [x]− ≤ f sup [x]+ + f sup [x]− .
2 x∈X 2 x∈X 2 x∈X 2 x∈X 2 x∈X 2 x∈X

The above equation implies


n n n
! " # ! " # !
1 X 1 X 1 X
Ef sup σi φi (ti ) ≤ Ef sup σi φi (ti ) + Ef sup σi φi (ti )
2 t∈T i=1 2 t∈T i=1
2 t∈T i=1
+ −
n
" # !
X
≤ Ef sup σi φi (ti ) .
t∈T i=1 +

The last step uses the symmetry of σi and the fact that [−x]− = [x]+ .
Finally, note that f ([·]+ ) is convex, increasing on R, and f (supx [x]+ ) = f ([supx x]+ ). Applying Eq. (7),
we have
n n n
" # ! " # ! !
X X X
Ef sup σi φi (ti ) ≤ Ef L sup σ i ti ≤ Ef L sup σ t ti .
t∈T i=1 t∈T i=1 t∈T i=1
+ +

This is a simple extension of Theorem 4.13 of [2], but I include the entire theorem here because its proof
is somewhat interesting, and it is often cited. For instance, it gives the following corollary:
1 To see this, note that the slope of g (the right or left derivative or the subgradient set) is always increasing, so for x, d > 0,

we have g(y + d + x) − g(y + x) ≥ g(y + d) − g(y).

10
Corollary 5.1. Let φ be an L-Lipschitz map from R to R with φ(0) = 0 and F be a function class with
domain X. Let φ ◦ F = {φ ◦ f : f ∈ F} denote their composition. Then

Rn (φ ◦ F) ≤ 2LRn (F).

The corollary follows by taking the convex increasing function in Theorem 7 to be the identity and letting
the space T ⊆ Rn = {f (x) : f ∈ F, x ∈ X}.

6 Growth Functions, VC Theory, and Rademacher Complexity


In this section, we will be using a sample set S = (x1 , . . . , xn ), a hypothesis class H of functions mapping
the sample space X to {−1, 1}. We focus on what is known as the growth function ΠH . We define ΠH (S)
to be the set of dichotomies of H on the set S, that is,

ΠH (S) , {hh(x1 ), . . . , h(xn )i : h ∈ H} .

With this, we make the following definition


Definition 6.1. The growth function ΠH (n) of a hypothesis class H is the number of dichotomies of the
hypothesis class H on a sample of size S, that is,

ΠH (n) , max |ΠH (S)|


S:|S|=n

Clearly, we have ΠH (n) ≤ |H| and ΠH (n) ≤ 2n . With the growth function in mind, we can bound the
Rademacher complexity of certain function classes.
Lemma 6.1. Let H be a class of functions mapping from X to {−1, 1}. If H satisfies h ∈ H ⇒ −h ∈ H,
r
2 log ΠH (n)
Rn (H) ≤ .
n
If H does not satisfy h ∈ H implies −h ∈ H,
r
2 log 2ΠH (n)
Rn (H) ≤ .
n

Proof Note that if we let A = {[h(X1 ) · · · h(Xn )]⊤ : h ∈ H} and −A = {−a : a ∈ A}, then
n
X n
X n
X
sup σi h(Xi ) = max σi ai = max σi ai
h∈H i=1 a∈A a∈A∪−A
i=1 i=1

so that kak = n for a ∈ A and Massart’s finite class lemma (lemma 4.1) imply
" n
# √ p r
1X n 2 log (2|{[h(X1 ) · · · h(Xn )]⊤ : h ∈ H}|) 2 log 2ΠH (n)
E sup σi h(Xi ) | X1 , . . . , Xn ≤ ≤ .
h∈H n i=1
n n

Thus ER̂n (H) = Rn (H) implies the theorem.

Definition 6.2. A hypothesis class H shatters a finite set S ⊆ X if |ΠH (S)| = 2|S| .

11
Intuitively, H shatters a set S if for every labeling of S, there is an h ∈ H that realizes that labeling.
This notion of shattering leads us to a new notion of the complexity of a hypothesis class.

Definition 6.3. The Vapnik-Chervonenkis dimension, or VC dimension of a hypothesis class H on


a set X is the cardinality of the largest set shattered by H, that is, the largest n such that there exists a set
S ⊆ X, |S| = n that H shatters.

As a shorthand, we will use dVC (H) to denote the VC-dimension of a class of functions.

Theorem 8 (Sauer’s lemma). Let H be a class of functions mapping X to {−1, 1} and let dVC (H) = d.
Then
d  
X n
ΠH (n) ≤
i
i=0

and for n ≥ d,
 en d
ΠH (n) ≤ .
d
Proof The proof of Sauer’s lemma is a completely combinatorial argument. We prove the lemma by
induction on the sum nP + d, beginning from n = 0 or d = 0 as our base cases. For notational convenience,
d
we first define Φd (n) = i=0 n Ci .
Suppose that n = 0. Then ΠH (n) = ΠH (0) = 1, the degenerate labeling of the empty set, and Φd (0) =
0 0 = 1. When d = 0, no datasets can be shattered at all, so ΠH (S) is simply a labeling from one function
C
and ΠH (n) = 1.
Now we assume that for any n′ , d′ with n′ + d′ < n + d, the first inequality holds. We want to construct
hypothesis spaces Hi that are smaller than H so that we can use our inductive hypothesis. To this end, we
represent the labelings of H as a table and perform operations on said table. So let S = {x1 , . . . , xn } be
the dataset, and let S1 = {x1 , . . . , xn−1 } be S shrunk by removing xn . Now let H1 be the set of hypotheses
restricted to S1 , as in Fig. 1. We see that dVC (H1 ) ≤ dVC (H), because any set that H1 shatters H must be
able to shatter. Thus, by induction, we have |ΠH1 (S1 )| ≤ Φd (n − 1).
Now let H2 be the collection of hypotheses that were “collapsed” going from H to H1 . In the example
of Fig. 1, H2 is h1 and h4 , as they were collapsed. In particular, the collapsed hypotheses have that in
mcH, there was an h ∈ H with h(xn ) = 1 and another h ∈ H with h(xn ) = −1, wherease un-collapsed
hypotheses do not have this. The hypotheses are also restricted to S2 = S1 , and |ΠH2 (S2 )| = |H2 |. Since
the original H had hypotheses to label xn as ±1, any dataset T that H2 shatters will also have T ∪ {xn }
shattered by H, but H2 cannot shatter T ∪ {xn } as the dichotomies on xn were collapsed. In other words,
the VC-dimension of H is strictly greater than that of H2 , so that dVC (H2 ) ≤ d − 1. By the inductive
hypothesis, |ΠH2 (S2 )| ≤ Φd−1 (n − 1).

H H1 H2
x1 , . . . , xn x1 , . . . , xn−1 x1 , . . . , xn−1
h1 −1 1 1 −1 −1 → h1 −1 1 1 −1 → h1 −1 1 1 −1
h2 −1 1 1 −1 1 ր
h3 −1 1 1 1 −1 → h3 −1 1 1 1
h4 1 −1 −1 1 −1 → h4 1 −1 −1 1 → h4 1 −1 −1 1
h5 1 −1 −1 1 1 ր
h6 1 1 −1 −1 1 → h6 1 1 −1 −1

Figure 1: Hypothesis tables for the proof of Sauer’s Lemma

Combining the previous two paragraphs and noting that by construction the number of labelings |ΠH (S)|

12
of H on S is simply the size of H1 and H2 ,
|ΠH (S)| = |H1 | + |H2 | ≤ Φd (n − 1) + Φd−1 (n − 1)
d   X d−1   X d   Xd  
X n−1 n−1 n−1 n−1
= + = +
i i i i−1
i=0 i=0 i=0 i=0
d  
X n
= = Φd (n).
i
i=0

The equality in the second line follows because n−1 C−1 = 0, and the third line follows via the combinatorial
identity n−1 Ci + n−1 Ci−1 = n Ci . As S was arbitrary, this completes the proof of the first part of the theorem.
Pd
Now suppose that n ≥ d ≥ 1. Then Φd (n)(d/n)d = i=0 n Ci (d/m)d and n Ci (d/n)d ≤ n Ci (d/n)i since
d ≤ n. Thus,
 d d    i X n    i
d X n d n d
Φd (n) ≤ ≤
n i n i n
i=0 i=0
 n 
d  n
= 1+ ≤ ed/n = ed
n
The second line follows via an application of the binomial theorem, and its inequality is a consequence of
1 + x ≤ ex for all x. Multiplying both sides by (n/d)d gives the desired result.

By combining Sauer’s lemma and a simple application of the Ledoux-Talagrand contraction (via Corol-
lary 5.1), we can derive bounds on the expected loss of a classifier. Let H be a class of {−1, 1} valued functions,
and let examples be drawn from a distribution P(X, Y ) where Y ∈ {−1, 1} are labels for X. Then a classifier
h ∈ H makes a mistake if and only if Y h(X) = −1. As such, the function [1 − Y h(X)]+ ≥ 1 {Y 6= h(X)},
and [·]+ has Lipschitz constant 1. Thus, we have
P(h(X) 6= Y ) = E1 {Y 6= h(X)} ≤ E [1 − Y h(X)]+ . (10)
Further, for a Rademacher random variable σi , we have σi Y h(X) symmetric around 0, so that Y h(X) has
the same distribution. Thus, Rn (Y · H) = Rn (H). Further, φ(x) = [1 − x]+ − 1 is 1-Lipschitz and satisfies
φ(0) = 0, so Corollary 5.1 implies
Rn (φ ◦ (Y · H)) ≤ 2Rn (Y · H) = 2Rn (H).
Combining this with Eq. (10), P(h(X) 6= Y ) − 1 = E1 {Y 6= h(X)} − 1 ≤ Eφ(Y h(X)), which by Theorem 6
gives that with probability at least 1 − δ,
r
  log 1/δ
P(h(X) 6= Y ) − 1 ≤ En [1 − Y h(X)]+ − 1 + 2Rn (φ ◦ (Y · H)) + .
2n
Clearly, we can add 1 to both sides of the above equation, and the empirical probability of a mistake
P̂(h(X) 6= Y ) = En [1 − Y h(X)]+ . Combining Sauer’s lemma and the above two equations, we have proved
the following theorem.
Theorem 9. Let H be a class of hypotheses on a space X with labels Y drawn according to a joint distribution
P(X, Y ). Then for any h ∈ H and given any sample S = {hx1 , y1 i , . . . , hxn , yn i} drawn i.i.d. according to
P, with probability at least 1 − δ over the sample S drawn,
r
log 1/δ
P(h(X) 6= Y ) ≤ P̂(h(xi ) 6= yi ) + 4Rn (H) +
2n
r r
2d log(en) − 2d log d log 1/δ
≤ P̂(h(xi ) 6= yi ) + 4 +
n 2n

13
In short, we have everyone’s favorite result that the estimated probability is close to the true probability:
r !
d log n + log 1/δ
P(h(X) 6= Y ) = P̂(h(X) 6= Y ) + O .
n

References
[1] P. Billingsley, Probability and Measure, Third Edition, Wiley 1995.

[2] M. Ledoux and M. Talagrand, Probability in Banach Spaces, Springer Verlag 1991.

14

You might also like