0% found this document useful (0 votes)

51 views14 pages

Probability Bounds

Uploaded by

Rituparna Chutia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views14 pages

Probability Bounds

Uploaded by

Rituparna Chutia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Probability Bounds

John Duchi

This document starts from simple probalistic inequalities (Markov’s Inequality) and builds up through
several stronger concentration results, developing a few ideas about Rademacher complexity, until we give
proofs of the main Vapnik-Chervonenkis complexity for learning theory. Many of these proofs are based on
Peter Bartlett’s lectures for CS281b at Berkeley or Rob Schapire’s lectures at Princeton. The aim is to have
one self-contained document some of the standard uniform convergence results for learning theory.

1 Preliminaries
We begin this document with a few (nearly trivial) preliminaries which will allow us to make very strong
claims on distributions of sums of random variables.
Theorem 1 (Markov’s Inequality). For a nonnegative random variable X and t > 0,
E[X]
P[X ≥ t] ≤ .
t
Proof For t > 0,
Z Z ∞ Z ∞
E[X] = xP(dx) ≥ xP(dx) ≥ tP(dx) = tP[X ≥ t].
X t t

One very powerful consequence of Markov’s Inequality is the Chernoff method, which uses the fact that
for any s ≥ 0,
E[esX ]
P(X ≥ t) = P(esX ≥ est ) ≤ . (1)
est
The inequality above is a simple consequence of ez > 0 for all z ∈ R.

2 Hoeffding’s Bounds
Lemma 2.1 (Hoeffding’s Lemma). Given a random variable X, a ≤ X ≤ b, and E[X] = 0, then for any
s > 0, we have
s2 (b−a)2
E esX ≤ e 8

Proof Given any x such that a ≤ x ≤ b, we can define λ ∈ [0, 1] as

b−x
λ= .
b−a
Thus, we see that (b − a)λ = b − x, so that x = b − λ(b − a) = λa + (1 − λ)b. As such, we know that
sx = sλa + s(1 − λ)b. So the convexity of exp(·) implies
b − x sa x − a sb
esx = eλsa+(1−λ)sb ≤ λesa + (1 − λ)esb = e + e
b−a b−a

1
Using the above and the fact that E[X] = 0,

sX b − X sa X − a sb b sa a sb
E e ≤E e + e = e − e . (2)
b−a b−a b−a b−a
a b
Now, we let p = − b−a (noting that a ≤ 0 as E[X] = 0 and hence p ∈ [0, 1]), and we have 1 − p = b−a , giving

b sa a sb
e − e = (1 − p)esa + pesb = (1 − p + pesb−sa )esa .
b−a b−a
a
Solving for a in p = − b−a , we find that a = −p(b − a), so

(1 − p + pesb−sa )esa = (1 − p + pes(b−a) )e−ps(b−a) .

Defining u = s(b − a) and

φ(u) , −ps(b − a) + log(1 − p + pes(b−a) ) = −pu + log(1 − p + peu )

and using equation (2), we have that

E esX ≤ eφ(u) .

If we can upper bound φ(u), then we are done. Of course, by Taylor’s theorem, there is some z ∈ [0, u] such
that
1 1
φ(u) = φ(0) + uφ′ (0) + u2 φ′′ (z) ≤ φ(0) + uφ′ (0) + sup u2 φ′′ (z) (3)
2 z 2
Taking derivatives,
peu peu p2 e2u p(1 − p)eu
φ′ (u) = −p + u
, φ′′ (u) = u
− u 2
=
1 − p + pe 1 − p + pe (1 − p + pe ) (1 − p + peu )2
Since φ′ (0) = −p + p = 0, and φ(0) = 0, we maximize φ′′ (u). Substituting z for eu , we see that φ′′ (u) is
concave for z > 0 as it is linear over quadratic. Thus
d p(1 − p)z p(1 − p) 2p2 (1 − p)z
= −
dz (1 − p + pz)2 (1 − p + pz)2 (1 − p + pz)3
p(1 − p)(1 − p + p + z) − 2p2 (1 − p)z
=
(1 − p + pz)3
p (1 − p)z − p2 (1 − p)z − p2 (1 − p)z + p(1 − p)2
2
=
(1 − p + pz)3
p(1 − p)(1 − p − zp)
=
(1 − p + pz)3
1−p
so that the critical point is at z = eu = p . Substituting,
1−p
p(1 − p) · p (1 − p)2 1
φ′′ (u) ≤ 1−p 2 = = .
(1 − p + p · p )
4(1 − p)2 4

Using equation (3), it is evident that φ(u) ≤ 12 u2 · 41 = 18 s2 (b − a)2 . This completes the proof of the lemma,
as we have
s2 (b−a)2
E esX ≤ eφ(u) ≤ e 8 .

Now we prove a slightly more general result than the standard Hoeffding bound using Chernoff’s method.

2
Theorem 2 (Hoeffding’s Inequality). Suppose that X1 , . . . , Xm are independent random variables with ai ≤
Xi ≤ bi . Then
m m
!
−2ε2 m2

1 X 1 X
P Xi − E[Xi ] ≥ ε ≤ exp Pm 2
m i=1 m i=1 i=1 (bi − ai )

Proof The proof is fairly straightforward using lemma 2.1. First, define Zi = Xi −E[Xi ], so that E[Zi ] = 0
(and we can assume without loss of generality that the bound on Zi is still [ai , bi ]). Then
m m
! ! ! Qm
X X E [ i=1 exp(sZi )]
P Zi ≥ t = P exp s Zi ≥ exp(st) ≤
i=1 i=1
est
Qm m
i=1 E[exp(sZi )] 2 2
Y
−st
= ≤ e es (bi −ai ) /8
est i=1
m
!
2 X
s 2
= exp (bi − ai ) − st .
8 i=1

The first line is an application of the Chernoff method and the second the application of lemma 2.1 and the
fact that the Zi are independent.
Pm 2

If we substitute s = 4t/ i=1 (bi − ai ) , which is evidently > 0, we find that

m
!
−2t2
X
P Zi ≥ t ≤ exp P m 2
.
i=1 i=1 (bi − ai )

Finally, letting t = εm gives our result.

We note that the above proof can be extendend using the union bound and reproving a bound on ≤ −ε
(setting Zi′ = 1 − Zi ) to give
m m
!
−2ε2 m2
X X
P Xi − E[Xi ] ≥ mε ≤ 2 exp Pm 2
.
i=1 i=1 i=1 (bi − ai )

3 McDiarmid’s Inequality
This more general result, of which Hoeffding’s Inequality can be seen as a special case, is very useful in
learning theory and other domains. The statement of the theorem is this:

Theorem 3 (McDiarmid’s Inequality). Let X = X1 , . . . , Xm be m independent random variables taking

values from some set A, and assume that f : Am → R satisfies the following boundedness condition (bounded
differences):
sup |f (x1 , x2 , . . . , xi , . . . , xm ) − f (x1 , x2 , . . . , x̂i , . . . , xm )| ≤ ci
x1 ,...,xm ,x̂i

for all i ∈ {1, . . . , m}. Then for any ε > 0, we have

2ε2

P [f (X1 , . . . , Xm ) − E[f (X1 , . . . , Xm )] ≥ ε] ≤ exp − Pm 2 .
i=1 ci

Proof The proof of this result begins by introducing some notation. First, let X = {X1 , . . . , Xm } and
Xi:j = {Xi , . . . , Xj }. Further, let Z0 = E[f (X)], Zi = E[f (X) | X1 , . . . , Xi ], and (naturally) Zm = f (X).
We now prove the following claim:

3
Claim 3.1.
s2 c2k

E [exp(s(Zk − Zk−1 )) | X1 , . . . , Xk−1 ] ≤ exp
8
Proof of Claim 3.1 First, let
Uk = sup {E[f (X) | X1 , . . . , Xk−1 , u] − E[f (X) | X1 , . . . , Xk−1 ]}
u
Lk = inf {E[f (X) | X1 , . . . , Xk−1 , l] − E[f (X) | X1 , . . . , Xk−1 ]}
l

and note that

Uk − Lk ≤ sup {E[f (X) | X1 , . . . , Xk−1 , u] − E[f (X) | X1 , . . . , Xk−1 , l]}
l,u
 
Z Ym 
≤ sup [f (X1:k−1 , u, yk+1:m ) − f (X1:k−1 , l, yk+1:m )] p(Xj = yj )
l,u  yk+1:m 
j=k+1
Z m
Y
≤ ck p(Xj = yj ) = ck
yk+1:m j=k+1

The second line follows because X1 , . . . , Xm are independent, and the last line follows by Jensen’s inequality
and the boundedness condition on f . Thus, Lk ≤ Zk − Zk−1 ≤ Uk , so Zk − Zk−1 ≤ ck . By lemma 2.1, as
E[Zk − Zk−1 | X1:k−1 ] = EXk:m [E[f (X) | X1:k ] − E[f (X) | X1:k−1 ]]
= EXk:m [E[f (X) | X1:k−1 ] − E[f (X) | X1:k−1 ]] = 0,
our claim follows.

Now we simply proceed through a series of inequalities.

P[f (X) − E[f (X)] ≥ ε] ≤ e−sε E [exp(s(f (X) − E[f (X)]))]
m
!# "
X
= e−sε E [exp(s(Zm − Zm−1 + Zm−1 − Z0 ))] = e−sε E exp s (Zi − Zi=1 )
i=1
m
" " ! ##
X
= e−sε E E exp s (Zi − Zi=1 ) | X1:m−1
i=1
m−1
" ! #
X h i
= e−sε E exp s (Zi − Zi−1 ) E es(Zm −Zm−1 ) | X1:m−1
i=1
m−1
" ! #
X s2 c2
m
−sε
≤ e E exp s (Zi − Zi−1 ) e 8

i=1
m m
!
s2 c2i c2
Y X
−sε 2 i
≤ e exp = exp −sε + s
i=1
8 i=1
8

The third line follows because of the properties of expectation (that is, that E[g(X, Y )] = EX [EY [g(X, Y ) |
X]]), the fourth because of our independence assumptions, and the fifth and sixth by repeated applications
of claim 3.1.
Minimizing the last equation with respect to s, we take the derivative and find that
m m m
! ! !
d 2
X c2i 2
X c2i X c2i
exp −sε + s = exp −sε + s −ε + 2s
ds i=1
8 i=1
8 i=1
8

4
Pm
which has a critical point at s = 4ε/ i=1 c2i . Substituting, we see that
m
! Pm !
c2i 4ε2 16ε2 i=1 c2i 2ε2
X
2
exp −sε + s = exp − Pm 2 + Pm 2 = exp − Pm 2
i=1
8 i=1 ci 8( c2i ) i=1 ci
i=1

This completes our proof.

As a quick note, it is worth mentioning

Pm that Hoeffding’s inequality follows by applying McDiarmid’s inequality
to the function f (x1 , . . . , xm ) = i=1 xi .

4 Glivenko-Cantelli Theorem
In this section, we give a proof of the Glivenko-Cantelli theorem, which gives uniform convergence of the
empirical distribution function for a random variable X, to the true distribution. There are many ways
of proving this; for another example, see [1, Theorem 20.6]. Our proof makes use of Rademacher random
variables and gives a convergence rate as well. First, though, we need a Borel-Cantelli lemma.
Theorem
P∞ 4 (Borel-Cantelli Lemma I). Let An be a sequence of subsets of some probability space Ω. If
n=1 P(An ) < ∞, then P(An i.o.) = 0.
P∞
Proof Let N = n=1 1 {An }, the number of events that occur. By Fubini’s theorem, we have EN =
P ∞
n=1 P(An ) < ∞, so N < ∞ almost surely.

Now for the real proof. Let Fn (x) be the empirical distribution function of a sequence of i.i.d. random
variables X1 , . . . , Xn , that is,
n
1X
Fn (x) , 1 {Xi ≤ x}
n i=1
and let F be the true distribution. We have the following theorem.
a.s.
Theorem 5 (Glivenko-Cantelli). As n → ∞, supx |Fn (x) − F (x)| → 0.
Proof Our proof has three main parts. First is McDiarmid’s concentration inequality, then we symmetrize
using Rademacher random variables, and lastly we show the class of functions we use is “small” by ordering
the data we see.
To begin, define the function class

G , {g : x 7→ 1 {x ≤ θ} , θ ∈ R}
Pn
and note that there is a one-to-one mapping between G and R. Now, define En g = n1 i=1 g(Xi ). The
a.s.
theorem is equivalent to supg∈G |En g − Eg| → 0 for any probability measure P, and the rates of convergence
will be identical.
We begin with the concentration result. Let f (X1 , . . . , Xn ) = supg∈G |En g − Eg|, and note that changing
any 1 of the n data points arbitrarily makes the empirical distribution g change by at most 1/n. Thus,
McDiarmid’s inequality (theorem 3) implies that

sup |En g − Eg| ≤ E sup |En g − Eg| + ε (4)

g∈G g∈G

2
with probability at least 1 − e−2ε n .
We now use Rademacher random variables and symmetrization to get a handle on the term

E sup |En g − Eg| (5)

g∈G

5
It is hard to directly show that this converges to zero, but we can use symmetrization to upper bound Eq. (5).
To this end, let Y1 , . . . , Yn be n independent copies of X1 , . . . , Xn . We have
n n
1X 1X
E sup |En g − Eg| = E sup g(Xi ) − Eg(Yi )
g∈G g∈G n i=1 n i=1
n
1X
= E sup (g(Xi ) − E[g(Yi ) | X1 , . . . , Xn ])
g∈G n i=1
" n #
1X
= E sup E (g(Xi ) − g(Yi )) | X1 , . . . , Xn
g∈G n i=1
n n
" #
1X 1X
≤ EE sup (g(Xi ) − g(Yi )) | X1 , . . . , Xn = E sup g(Xi ) − g(Yi ) .
g∈G n i=1 g∈G n i=1

The second and third lines are almost sure and follow by properties of conditional expectation, and the last
inequality follows via convexity of | · | and sup.
We now proceed to remove dependence on g(Yi ) by the following steps. First, note that g(Xi ) − g(Yi ) is
symmetric around 0, so if σi ∈ {−1, 1}, σi (g(Xi ) − g(Yi )) has identical distribution. Thus, we can continue
our inequalities:
n n n n
" #
1X 1X 1X 1X
E sup g(Xi ) − g(Yi ) = E sup σi (g(Xi ) − g(Yi )) ≤ E sup σi g(Xi ) + σi g(Yi )
g∈G n i=1 g∈G n i=1 g∈G n i=1 n i=1
n n
" #
1X 1X
≤ E sup σi g(Xi ) + sup σi g(Yi )
g∈G n i=1 g∈G n i=1
n
1X
= 2E sup σi g(Xi ) . (6)
g∈G n i=1

The last expectation has a maximum inner product between the vectors [σ1 · · · σn ]⊤ and [g(X1 ) · · · g(Xn )]⊤ .
This an indication of how well the class of vectors {[g(X1 ) · · · g(Xn )]⊤ : g ∈ G} can align with random
directions σ1 , . . . , σn , which are uniformly distributed
P on the corners of the n-cube.
Now what remains is to bound E supg∈G | i σi g(Xi )|, which we do by noting that G is in a sense simple.
Fix the data set as (x1, . . . , xn ) and consider
the order statistics x(1) , . . . , x(n) . Note that x(i) ≤ x(i+1)
implies that g(x(i) ) = 1 x(i) ≤ θ ≥ 1 x(i+1) ≤ θ = g(x(i+1) ), so

[g(x(1) ) · · · g(x(n) )]⊤ ∈ [0 · · · 0]⊤ , [1 0 · · · 0]⊤ , . . . , [1 · · · 1]⊤ .

The bijection between (x1 , . . . , xn ) and (x(1) , . . . , x(n) ) implies that the cardinality of the set {[g(x1 ), . . . , g(xn )]⊤ :
g ∈ G} is at most n + 1. We can thus use bounds relating the size of the class of functions to bound the
earlier expectations.
Lemma 4.1 (Massart’s finite class lemma). Let A ⊂ Rn have |A| < ∞, R = maxa∈A kak, and σi be
independent Rademacher variables. Then
" n # p
1X R 2 log |A|
E max σi ai ≤
a∈A n n
i=1
Pn
Proof of Lemma Let s > 0 and define Za , i=1 σi ai . Because exp is convex and positive,
X
exp sE max Za ≤ E exp s max Za = E max exp(sZa ) ≤ E exp(sZa ).
a∈A a∈A a∈A
a∈A

6
Now we apply Hoeffding’s lemma (lemma 2.1) by noting that σi ai ∈ [−ai , ai ], and have
n
!
X X X X
2 2
E exp(sZa ) ≤ exp s ai /2 ≤ exp(s2 R2 /2) = |A| exp(s2 R2 /2).
a∈A a∈A i=1 a∈A

Combining the above bound with the first string of inequalities, we have exp(sE maxa∈A Za ) ≤ |A| exp(s2 R2 /2),
or
log |A| sR2

E max Za ≤ inf +
a∈A s>0 s 2
p
Setting s = 2 log |A|/R2 , we have
p !
R log |A| R 2 log |A| p
E max Za ≤ p + = R 2 log |A|
a∈A 2 log |A| 2

and dividing by n gives the final bound.

√
Now, letting A = {[g(X1 ), . . . , g(Xn )]⊤ : g ∈ G}, we note that |A| ≤ n + 1 and R = maxa∈A kak ≤ n.
Thus,
n n
" # " # r
1X 1X 2 log(n + 1)
E sup σi g(Xi ) = EE sup σi g(Xi ) | X1 , . . . , Xn ≤ .
g∈G n i=1 g∈G n i=1 n
By the above equation, Eq. (6), and the McDiarmid inequality application in Eq. (4),
n
r ! " #!
2 log(n + 1) 1X
P sup |En g − Eg| > ε + 2 ≤ P sup |En g − Eg| > ε + 2E sup σi g(Xi )
g∈G n g∈G g∈G n i=1

≤ P sup |En g − Eg| > ε + E sup |En g − Eg|
g∈G g∈G
2
≤ 2 exp(−2ε n).
p p
This implies that supg∈G |En g−Eg| → 0. To get almost
P sure convergence, choose n so that 2 2 log(n + 1)/n <
ε and let An = {supg∈G |En g − Eg| > 2ε}. Then P(An ) < ∞, so An happens only finitely many times.

5 Rademacher Averages
Now we explore uses of Rademacher random variabels to measure the complexity of a class of functions.
We also use them to masure (to some extent) generalization ability of a function from data to the true
distribution.
Definition 5.1 (Rademacher complexity). Let F be a function class with domain X, i.e. F ⊆ {f : X →
R}, and let S = {X1 , . . . , Xn } be a set of samples generated by a distribution P on X. The empirical
Rademacher complexity of F is
n
" #
1X
R̂n (F) = E sup σi f (Xi ) | X1 , . . . , Xn
f ∈F n i=1

where σi i.i.d. uniform random variables (Rademacher variables) on ±1. The Rademacher complexity
of F is
n
" #
1X
Rn (F) = ER̂n (F) = E sup σi f (Xi ) .
f ∈F n i=1

7
Lemma 5.1.
n
" #
1X
E sup Ef − f (Xi ) ≤ 2Rn (F)
f ∈F n i=1

Proof As we did for the Glivenko-Cantelli theorem, introduce i.i.d. random variables Yi , i ∈ {1, . . . , n}
independent of the Xi s. Letting EY denote expectation with respect to the Yi and σi be Rademacher
variables,
n n n
1X 1X 1X
E sup Ef − f (Xi ) = E sup Ef (Yi ) − f (Xi ) ≤ EX EY sup g(Xi ) − g(Yi )
f ∈F n i=1 f ∈F n i=1 f ∈F n i=1
n
1X
= E sup σi (g(Xi ) − g(Yi ))
f ∈F n i=1
n n
" #
1X 1X
≤ E sup σi g(Xi ) + sup σi g(Yi ) = 2Rn (F).
f ∈F n i=1 f ∈F n i=1

The first inequality follows from the convexity of | · | and sup, the second by the triangle inequality.

We can also use Rademacher complexity to bound the expected value of certain functions, which is often
used in conjunction with loss functions or expected P
risks. For example, we have the following theorem dealing
n
with bounded functions. Recall that En f (X) = n1 i=1 f (Xi ), where the Xi are given as a sample.
Theorem 6. Let δ ∈ (0, 1) and F be a class of functions mapping X to [0, 1]. Then with probability at least
1 − δ, all f ∈ F satisfy. r
log 1/δ
Ef (X) ≤ En f (X) + 2Rn (F) +
2n
Also with probability at least 1 − δ, all f ∈ F satisfy
r
log 2/δ
Ef (X) ≤ En f (X) + 2R̂n (F) + 5 .
2n

Proof Fix f ∈ F. Then we clearly have (by choosing f in the sup)

Ef (X) ≤ En f (X) + sup (Eg(X) − En g(X)) .

g∈F

As f (Xi ) ∈ [0, 1], modifying one of the Xi can change En g(X) by at most 1/n. McDiarmid’s inequality
(Theorem 3) thus implies that

P sup (Eg(X) − En g(X)) − E[sup (Eg(X) − En g(X))] ≥ ε ≤ exp(−2ε2 n).
g∈F g∈F

Setting the right hand side bound equal to δ and solving for ε, we have
r
2 1 log 1/δ log 1/δ
−2ε n = log δ or = ε2 so ε= .
2 n 2n
That is, with probability at least 1 − δ, we have
r
log 1/δ
sup (Eg(X) − En g(X)) ≤ E sup (Eg(X) − En g(X)) + .
g∈F g∈F 2n

8
Applying lemma 5.1, we immediately see that the right hand expectation is bounded by Rn (F). This
completes the proof of the first inequality in the theorem.
Now we need to bound Rn (F) with high probability using R̂n (F). First note
p that the above reasoning
could have been done using probability δ/2, giving a bound with 2Rn (F) and log(2/δ)/(2n) instead. Now
we note that changing one example Xi changes |Rn (F) − R̂n (F)| by at most 2/n (because one sign inside of
R̂n (F) can change). Letting ci = 2/n in McDiarmid’s inequality, we have

1
P(R̂n (F) − Rn (F) ≥ ε) ≤ exp − ε2 n .
2

Again setting this equal to δ/2 and solving, we have 12 ε2 = log(2/δ)/n so that ε = 2 log(2/δ)/(2n). We
p

thus have with probability ≥ 1 − δ/2,

r
log 2/δ
Rn (F) ≤ R̂n (F) + 2 .
2n
Using the union bound with two events of probability at least 1 − δ/2 gives the desired second inequality.

Theorem 7 (Ledoux-Talagrand contraction). Let f : R+ → R+ be convex and increasing. Let φi : R → R

satisfy φi (0) = 0 and be Lipschitz with constant L, i.e., |φi (a) − φi (b)| ≤ L|a − b|. Let σi be independent
Rademacher random variables. For any T ⊆ Rn ,
n n
! !
1 X X
Ef sup σi φi (ti ) ≤ Ef L · sup σ i ti .
2 t∈T i=1 t∈T i=1

Pn
Proof First, note that if T is unbounded, there will be some setting of σi so that supt∈T | i=1 σi ti | = ∞.
This event is not a probability zero event, and f is increasing and convex and so will also be infinite, so the
right expectation will be infinite. We can thus focus on bounded T .
We begin by showing a similar statement to the proof, that is, that if g : R → R is convex and increasing,
then
n n
! !
X X
Eg sup σi φi (ti ) ≤ Eg L sup σ i ti (7)
t∈T i=1 t∈T i=1

By conditioning, we note that if we prove for T ⊆ R2

Eg sup(t1 + σ2 φ2 (t2 )) ≤ Eg sup(t1 + Lσ2 t2 ) (8)
t∈T t∈T

we are done. This follows because we will almost surely have

E g sup(σ1 φ1 (t1 ) + σ2 φ2 (t2 )) | σ1 ≤ E g sup(σ1 φ1 (t1 ) + Lσ2 t2 ) | σ1
t∈T t∈T

as σ1 φ1 (t1 ) simply transforms T (and is still bounded). By conditioning, this implies that

Eg sup(σ1 φ1 (t1 ) + σ2 φ2 (t2 )) ≤ Eg sup(σ1 φ1 (t1 ) + Lσ2 t2 )
t∈T t∈T

and we can iteratively apply this.

Thus we focus on proving Eq. (8). Define I(t, s) , 21 g(t1 + φ(t2 )) + 12 g(s1 − φ(s2 ); if we show that the
right side of Eq. (8) is larger than I(t, s) for all t, s ∈ T , clearly we are done (as it is the expectation with

9
respect to the Rademacher random variable σ2 ). Noting that we are taking a supremum over t and s in I,
we can assume w.l.o.g. that
t1 + φ(t2 ) ≥ s1 + φ(s2 ) and s1 − φ(s2 ) ≥ t1 − φ(t2 ). (9)
We define four quantities and then proceed through four cases to prove Eq. (8):
a = s1 − φ(s2 ), b = s1 − Ls2 , a′ = t1 + Lt2 , b′ = t1 + φ(t2 ).
We would like to show that 2I(t, s) = g(a) + g(b′ ) ≤ g(a′ ) + g(b).
Case I. Let t2 ≥ 0 and s2 ≥ 0. We know that, as φ(0) = 0, |φ(s2 )| ≤ Ls2 . This implies that a ≥ b and
Eq. (9) implies that b′ = t1 + φ(t2 ) ≥ s1 + φ(s2 ) ≥ s1 − Ls2 = b. Now assume that t2 ≥ s2 . In this case,
b′ + a − b = t1 + φ(t2 ) + s1 − φ(s2 ) − s2 + Ls2 ≤ t1 + Lt2 + Ls2 − φ(s2 ) ≤ t1 + Lt2 = a′
since |φ(t2 ) − φ(s2 )| ≤ L|t2 − s2 | = L(t2 − s2 ). Thus a − b ≤ a′ − b′ . Note that g(y + x) − g(y) is increasing
in y if x ≥ 0.1 Letting x = a − b ≥ 0 and noting that b′ ≥ b,
g(a) − g(b) = g(b + x) − g(b) ≤ g(b′ + x) − g(b′ ) = g(b′ + a − b) − g(b′ ) ≤ g(a′ ) − g(b′ )
so that g(a) + g(b′ ) ≤ g(a′ ) + g(b). If s2 ≥ t2 , then we use −φ instead of φ and switch the roles of s and t,
giving a similar proof.
Case II. Let t2 ≤ 0 and s2 ≤ 0. This is similar to the above case, switching signs as necessary, so we
omit it.
Case III. Let t2 ≥ 0 and s2 ≤ 0. We have φ(t2 ) ≤ Lt2 and −φ(s2 ) ≤ −Ls2 by the Lipschitz condition
on φ. This implies that
2I(t, s) = g(t1 + φ(t2 )) + g(s1 − φ(s2 )) ≤ g(t1 + Lt2 ) + g(s1 − Ls2 ).
Case IV. Let t2 ≤ 0 and s2 ≥ 0. Similar to above, we have −φ(s2 ) ≤ Ls2 and φ(t2 ) ≤ −Lt2 , so
2I(t, s) =≤ g(t1 − Lt2 ) + g(s1 + Ls2 ), which is symmetric to the above. We have thus proved Eq. (7).
We now conclude the proof. Denoting [x]+ = x if x ≥ 0 and [x]− = −x if x ≤ 0, we note that since f is
increasing and convex that

1 1 1 1 1 1
f sup |x| = f sup ([x]+ + [x]− ) ≤ f sup [x]+ + sup [x]− ≤ f sup [x]+ + f sup [x]− .
2 x∈X 2 x∈X 2 x∈X 2 x∈X 2 x∈X 2 x∈X

The above equation implies

n n n
! " # ! " # !
1 X 1 X 1 X
Ef sup σi φi (ti ) ≤ Ef sup σi φi (ti ) + Ef sup σi φi (ti )
2 t∈T i=1 2 t∈T i=1
2 t∈T i=1
+ −
n
" # !
X
≤ Ef sup σi φi (ti ) .
t∈T i=1 +

The last step uses the symmetry of σi and the fact that [−x]− = [x]+ .
Finally, note that f ([·]+ ) is convex, increasing on R, and f (supx [x]+ ) = f ([supx x]+ ). Applying Eq. (7),
we have
n n n
" # ! " # ! !
X X X
Ef sup σi φi (ti ) ≤ Ef L sup σ i ti ≤ Ef L sup σ t ti .
t∈T i=1 t∈T i=1 t∈T i=1
+ +

This is a simple extension of Theorem 4.13 of [2], but I include the entire theorem here because its proof
is somewhat interesting, and it is often cited. For instance, it gives the following corollary:
1 To see this, note that the slope of g (the right or left derivative or the subgradient set) is always increasing, so for x, d > 0,

we have g(y + d + x) − g(y + x) ≥ g(y + d) − g(y).

10
Corollary 5.1. Let φ be an L-Lipschitz map from R to R with φ(0) = 0 and F be a function class with
domain X. Let φ ◦ F = {φ ◦ f : f ∈ F} denote their composition. Then

Rn (φ ◦ F) ≤ 2LRn (F).

The corollary follows by taking the convex increasing function in Theorem 7 to be the identity and letting
the space T ⊆ Rn = {f (x) : f ∈ F, x ∈ X}.

6 Growth Functions, VC Theory, and Rademacher Complexity

In this section, we will be using a sample set S = (x1 , . . . , xn ), a hypothesis class H of functions mapping
the sample space X to {−1, 1}. We focus on what is known as the growth function ΠH . We define ΠH (S)
to be the set of dichotomies of H on the set S, that is,

ΠH (S) , {hh(x1 ), . . . , h(xn )i : h ∈ H} .

With this, we make the following definition

Definition 6.1. The growth function ΠH (n) of a hypothesis class H is the number of dichotomies of the
hypothesis class H on a sample of size S, that is,

ΠH (n) , max |ΠH (S)|

S:|S|=n

Clearly, we have ΠH (n) ≤ |H| and ΠH (n) ≤ 2n . With the growth function in mind, we can bound the
Rademacher complexity of certain function classes.
Lemma 6.1. Let H be a class of functions mapping from X to {−1, 1}. If H satisfies h ∈ H ⇒ −h ∈ H,
r
2 log ΠH (n)
Rn (H) ≤ .
n
If H does not satisfy h ∈ H implies −h ∈ H,
r
2 log 2ΠH (n)
Rn (H) ≤ .
n

Proof Note that if we let A = {[h(X1 ) · · · h(Xn )]⊤ : h ∈ H} and −A = {−a : a ∈ A}, then
n
X n
X n
X
sup σi h(Xi ) = max σi ai = max σi ai
h∈H i=1 a∈A a∈A∪−A
i=1 i=1
√
so that kak = n for a ∈ A and Massart’s finite class lemma (lemma 4.1) imply
" n
# √ p r
1X n 2 log (2|{[h(X1 ) · · · h(Xn )]⊤ : h ∈ H}|) 2 log 2ΠH (n)
E sup σi h(Xi ) | X1 , . . . , Xn ≤ ≤ .
h∈H n i=1
n n

Thus ER̂n (H) = Rn (H) implies the theorem.

Definition 6.2. A hypothesis class H shatters a finite set S ⊆ X if |ΠH (S)| = 2|S| .

11
Intuitively, H shatters a set S if for every labeling of S, there is an h ∈ H that realizes that labeling.
This notion of shattering leads us to a new notion of the complexity of a hypothesis class.

Definition 6.3. The Vapnik-Chervonenkis dimension, or VC dimension of a hypothesis class H on

a set X is the cardinality of the largest set shattered by H, that is, the largest n such that there exists a set
S ⊆ X, |S| = n that H shatters.

As a shorthand, we will use dVC (H) to denote the VC-dimension of a class of functions.

Theorem 8 (Sauer’s lemma). Let H be a class of functions mapping X to {−1, 1} and let dVC (H) = d.
Then
d
X n
ΠH (n) ≤
i
i=0

and for n ≥ d,
en d
ΠH (n) ≤ .
d
Proof The proof of Sauer’s lemma is a completely combinatorial argument. We prove the lemma by
induction on the sum nP + d, beginning from n = 0 or d = 0 as our base cases. For notational convenience,
d
we first define Φd (n) = i=0 n Ci .
Suppose that n = 0. Then ΠH (n) = ΠH (0) = 1, the degenerate labeling of the empty set, and Φd (0) =
0 0 = 1. When d = 0, no datasets can be shattered at all, so ΠH (S) is simply a labeling from one function
C
and ΠH (n) = 1.
Now we assume that for any n′ , d′ with n′ + d′ < n + d, the first inequality holds. We want to construct
hypothesis spaces Hi that are smaller than H so that we can use our inductive hypothesis. To this end, we
represent the labelings of H as a table and perform operations on said table. So let S = {x1 , . . . , xn } be
the dataset, and let S1 = {x1 , . . . , xn−1 } be S shrunk by removing xn . Now let H1 be the set of hypotheses
restricted to S1 , as in Fig. 1. We see that dVC (H1 ) ≤ dVC (H), because any set that H1 shatters H must be
able to shatter. Thus, by induction, we have |ΠH1 (S1 )| ≤ Φd (n − 1).
Now let H2 be the collection of hypotheses that were “collapsed” going from H to H1 . In the example
of Fig. 1, H2 is h1 and h4 , as they were collapsed. In particular, the collapsed hypotheses have that in
mcH, there was an h ∈ H with h(xn ) = 1 and another h ∈ H with h(xn ) = −1, wherease un-collapsed
hypotheses do not have this. The hypotheses are also restricted to S2 = S1 , and |ΠH2 (S2 )| = |H2 |. Since
the original H had hypotheses to label xn as ±1, any dataset T that H2 shatters will also have T ∪ {xn }
shattered by H, but H2 cannot shatter T ∪ {xn } as the dichotomies on xn were collapsed. In other words,
the VC-dimension of H is strictly greater than that of H2 , so that dVC (H2 ) ≤ d − 1. By the inductive
hypothesis, |ΠH2 (S2 )| ≤ Φd−1 (n − 1).

H H1 H2
x1 , . . . , xn x1 , . . . , xn−1 x1 , . . . , xn−1
h1 −1 1 1 −1 −1 → h1 −1 1 1 −1 → h1 −1 1 1 −1
h2 −1 1 1 −1 1 ր
h3 −1 1 1 1 −1 → h3 −1 1 1 1
h4 1 −1 −1 1 −1 → h4 1 −1 −1 1 → h4 1 −1 −1 1
h5 1 −1 −1 1 1 ր
h6 1 1 −1 −1 1 → h6 1 1 −1 −1

Figure 1: Hypothesis tables for the proof of Sauer’s Lemma

Combining the previous two paragraphs and noting that by construction the number of labelings |ΠH (S)|

12
of H on S is simply the size of H1 and H2 ,
|ΠH (S)| = |H1 | + |H2 | ≤ Φd (n − 1) + Φd−1 (n − 1)
d X d−1 X d Xd
X n−1 n−1 n−1 n−1
= + = +
i i i i−1
i=0 i=0 i=0 i=0
d
X n
= = Φd (n).
i
i=0

The equality in the second line follows because n−1 C−1 = 0, and the third line follows via the combinatorial
identity n−1 Ci + n−1 Ci−1 = n Ci . As S was arbitrary, this completes the proof of the first part of the theorem.
Pd
Now suppose that n ≥ d ≥ 1. Then Φd (n)(d/n)d = i=0 n Ci (d/m)d and n Ci (d/n)d ≤ n Ci (d/n)i since
d ≤ n. Thus,
d d i X n i
d X n d n d
Φd (n) ≤ ≤
n i n i n
i=0 i=0
n
d n
= 1+ ≤ ed/n = ed
n
The second line follows via an application of the binomial theorem, and its inequality is a consequence of
1 + x ≤ ex for all x. Multiplying both sides by (n/d)d gives the desired result.

By combining Sauer’s lemma and a simple application of the Ledoux-Talagrand contraction (via Corol-
lary 5.1), we can derive bounds on the expected loss of a classifier. Let H be a class of {−1, 1} valued functions,
and let examples be drawn from a distribution P(X, Y ) where Y ∈ {−1, 1} are labels for X. Then a classifier
h ∈ H makes a mistake if and only if Y h(X) = −1. As such, the function [1 − Y h(X)]+ ≥ 1 {Y 6= h(X)},
and [·]+ has Lipschitz constant 1. Thus, we have
P(h(X) 6= Y ) = E1 {Y 6= h(X)} ≤ E [1 − Y h(X)]+ . (10)
Further, for a Rademacher random variable σi , we have σi Y h(X) symmetric around 0, so that Y h(X) has
the same distribution. Thus, Rn (Y · H) = Rn (H). Further, φ(x) = [1 − x]+ − 1 is 1-Lipschitz and satisfies
φ(0) = 0, so Corollary 5.1 implies
Rn (φ ◦ (Y · H)) ≤ 2Rn (Y · H) = 2Rn (H).
Combining this with Eq. (10), P(h(X) 6= Y ) − 1 = E1 {Y 6= h(X)} − 1 ≤ Eφ(Y h(X)), which by Theorem 6
gives that with probability at least 1 − δ,
r
log 1/δ
P(h(X) 6= Y ) − 1 ≤ En [1 − Y h(X)]+ − 1 + 2Rn (φ ◦ (Y · H)) + .
2n
Clearly, we can add 1 to both sides of the above equation, and the empirical probability of a mistake
P̂(h(X) 6= Y ) = En [1 − Y h(X)]+ . Combining Sauer’s lemma and the above two equations, we have proved
the following theorem.
Theorem 9. Let H be a class of hypotheses on a space X with labels Y drawn according to a joint distribution
P(X, Y ). Then for any h ∈ H and given any sample S = {hx1 , y1 i , . . . , hxn , yn i} drawn i.i.d. according to
P, with probability at least 1 − δ over the sample S drawn,
r
log 1/δ
P(h(X) 6= Y ) ≤ P̂(h(xi ) 6= yi ) + 4Rn (H) +
2n
r r
2d log(en) − 2d log d log 1/δ
≤ P̂(h(xi ) 6= yi ) + 4 +
n 2n

13
In short, we have everyone’s favorite result that the estimated probability is close to the true probability:
r !
d log n + log 1/δ
P(h(X) 6= Y ) = P̂(h(X) 6= Y ) + O .
n

References
[1] P. Billingsley, Probability and Measure, Third Edition, Wiley 1995.

[2] M. Ledoux and M. Talagrand, Probability in Banach Spaces, Springer Verlag 1991.

High Dimensional Probability MA3K0 Notes 3
No ratings yet
High Dimensional Probability MA3K0 Notes 3
108 pages
Soft Key Solutions - HASP4 HASP HL Hardlock Dongle Emulator For Aladdin Hardware Key
100% (1)
Soft Key Solutions - HASP4 HASP HL Hardlock Dongle Emulator For Aladdin Hardware Key
4 pages
Concentration
No ratings yet
Concentration
28 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
Miscellaneous 0
No ratings yet
Miscellaneous 0
25 pages
Notes On Rademacher Complexity
No ratings yet
Notes On Rademacher Complexity
17 pages
Lecture 3
No ratings yet
Lecture 3
20 pages
Stochastic Calculus For Finance II Conti
No ratings yet
Stochastic Calculus For Finance II Conti
99 pages
Tomacs Paper 1998
No ratings yet
Tomacs Paper 1998
13 pages
Unif Gauss Tail
No ratings yet
Unif Gauss Tail
13 pages
BSI MD Consultants Day Usability and Human Factors Presentation UK EN
No ratings yet
BSI MD Consultants Day Usability and Human Factors Presentation UK EN
38 pages
Peli Grad 1999
No ratings yet
Peli Grad 1999
18 pages
03 Hoeffding
No ratings yet
03 Hoeffding
5 pages
AOPSUMS97
No ratings yet
AOPSUMS97
12 pages
Concentration Inequalities: Hoeffding and Mcdiarmid
No ratings yet
Concentration Inequalities: Hoeffding and Mcdiarmid
5 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
MLB Assignment 7 Final
No ratings yet
MLB Assignment 7 Final
16 pages
Sup LAWS
No ratings yet
Sup LAWS
11 pages
An Error Bound in The Sudakov-Fernique Inequality: 1 Statement of The Result
No ratings yet
An Error Bound in The Sudakov-Fernique Inequality: 1 Statement of The Result
5 pages
Prob Notes
No ratings yet
Prob Notes
70 pages
Nonparametric Estimation of Trend For Stochastic Differential Equations Driven by Multiplicative Stochastic Volatility
No ratings yet
Nonparametric Estimation of Trend For Stochastic Differential Equations Driven by Multiplicative Stochastic Volatility
11 pages
CICS Administration Reference
No ratings yet
CICS Administration Reference
575 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Appendix A Solutions of Selected Problems
No ratings yet
Appendix A Solutions of Selected Problems
19 pages
CH 2
No ratings yet
CH 2
11 pages
Lec 3
No ratings yet
Lec 3
8 pages
Design of Mini Compressor Less Powered Refrigerator: Project Report ON
No ratings yet
Design of Mini Compressor Less Powered Refrigerator: Project Report ON
37 pages
MIT15 070JF13 Lec3
No ratings yet
MIT15 070JF13 Lec3
8 pages
Econometrics - Fumio Hayashi (Solutions)
No ratings yet
Econometrics - Fumio Hayashi (Solutions)
19 pages
Notes
No ratings yet
Notes
32 pages
CF Notes
No ratings yet
CF Notes
7 pages
ASCP Plan Attributes
No ratings yet
ASCP Plan Attributes
6 pages
Theory of Estimation
No ratings yet
Theory of Estimation
11 pages
2 On The Strong Law of Large Number For Pairwise I.I.D Random Variables With General Moment Conditions
No ratings yet
2 On The Strong Law of Large Number For Pairwise I.I.D Random Variables With General Moment Conditions
6 pages
Foss Lecture1
No ratings yet
Foss Lecture1
32 pages
Notes 2
No ratings yet
Notes 2
10 pages
Martingale Limit Theory and Stochastic Regression Theory: Ching-Zong Wei
No ratings yet
Martingale Limit Theory and Stochastic Regression Theory: Ching-Zong Wei
155 pages
Large Deviations: S. R. S. Varadhan
No ratings yet
Large Deviations: S. R. S. Varadhan
12 pages
Selective Review - Probability
No ratings yet
Selective Review - Probability
30 pages
Best Question
No ratings yet
Best Question
6 pages
Bernstein's Inequality, and Generalizations: CS281B/Stat241B (Spring 2003) Statistical Learning Theory
No ratings yet
Bernstein's Inequality, and Generalizations: CS281B/Stat241B (Spring 2003) Statistical Learning Theory
4 pages
Homework 1: Instructions and Notes
No ratings yet
Homework 1: Instructions and Notes
2 pages
Introduction To Probability Theory
No ratings yet
Introduction To Probability Theory
13 pages
1 Inequalities: 1.1 Markov
No ratings yet
1 Inequalities: 1.1 Markov
15 pages
Lecture 4 Inequalities and Asymptotic Estimates
No ratings yet
Lecture 4 Inequalities and Asymptotic Estimates
9 pages
Shreve Stochcal4fin 2
No ratings yet
Shreve Stochcal4fin 2
99 pages
DBDM Lecture Notes
No ratings yet
DBDM Lecture Notes
242 pages
A e Convergence
No ratings yet
A e Convergence
6 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
STULZ E2 Controller Operation Manual OZU0037M
No ratings yet
STULZ E2 Controller Operation Manual OZU0037M
82 pages
CS229 Supplemental Lecture Notes Hoeffding's Inequality: 1 Basic Probability Bounds
No ratings yet
CS229 Supplemental Lecture Notes Hoeffding's Inequality: 1 Basic Probability Bounds
8 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
Math556 05 Inequalities
No ratings yet
Math556 05 Inequalities
8 pages
Rosenthal
No ratings yet
Rosenthal
7 pages
Cramer Raoh and Out 08
No ratings yet
Cramer Raoh and Out 08
13 pages
Invoice: WD Elements (WDBUZG0010BBK) 1 TB Portable External Hard Drive (Black) 1 4284 4284
No ratings yet
Invoice: WD Elements (WDBUZG0010BBK) 1 TB Portable External Hard Drive (Black) 1 4284 4284
1 page
Hoeffding Bounds
No ratings yet
Hoeffding Bounds
9 pages
Introduction To Probability Theory
No ratings yet
Introduction To Probability Theory
12 pages
Mallows Theorem
No ratings yet
Mallows Theorem
21 pages
QuantNet Online C Course
No ratings yet
QuantNet Online C Course
9 pages
Shreve I I Solutions Chapter 03
No ratings yet
Shreve I I Solutions Chapter 03
12 pages
Nielsen. Lévy Characterization of Brownian Motion
No ratings yet
Nielsen. Lévy Characterization of Brownian Motion
5 pages
UG Courses of Study 2007
100% (2)
UG Courses of Study 2007
147 pages
1.8. Large Deviation and Some Exponential Inequalities.: B R e DX Essinf G (X), T e DX Esssup G (X)
No ratings yet
1.8. Large Deviation and Some Exponential Inequalities.: B R e DX Essinf G (X), T e DX Esssup G (X)
4 pages
נוסחאות ואי שיוויונים
No ratings yet
נוסחאות ואי שיוויונים
12 pages
ADASD
No ratings yet
ADASD
4 pages
MSC Pool Conceptdfadslfkdslfkdsal
No ratings yet
MSC Pool Conceptdfadslfkdslfkdsal
4 pages
Advanced Excel - Waterfall Chart
No ratings yet
Advanced Excel - Waterfall Chart
8 pages
Stochastic Calculus For Finance II - Some Solutions To Chapter III
No ratings yet
Stochastic Calculus For Finance II - Some Solutions To Chapter III
9 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
Ineq PDF
No ratings yet
Ineq PDF
3 pages
B8300030511 Maa Eng
No ratings yet
B8300030511 Maa Eng
18 pages
Sample Complaint Letter
No ratings yet
Sample Complaint Letter
2 pages
Computer Science To The Point Computer Science For Life Sciences Students and Other Noncomputer Scientists Boris Tolg Instant Download
No ratings yet
Computer Science To The Point Computer Science For Life Sciences Students and Other Noncomputer Scientists Boris Tolg Instant Download
82 pages
Format For GWA
No ratings yet
Format For GWA
6 pages
Research Office Confidential Reference
No ratings yet
Research Office Confidential Reference
3 pages
OITAF2024 AURO v2-LOW
No ratings yet
OITAF2024 AURO v2-LOW
42 pages
Emmanuel Main Report
No ratings yet
Emmanuel Main Report
17 pages
Trawnih Et Al 2023 Determining Perceptions of Banking Customers Regarding Fingerprint Atms
No ratings yet
Trawnih Et Al 2023 Determining Perceptions of Banking Customers Regarding Fingerprint Atms
19 pages
Mystic Media House Profile
No ratings yet
Mystic Media House Profile
16 pages
IPM Lab Manual - Exp - 1
No ratings yet
IPM Lab Manual - Exp - 1
9 pages
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
No ratings yet
Us2 Ss 56 Mothers Day Math Codebreaker Differentiated Activity Sheets English - Ver - 2
8 pages
Flyseen LED Auto Light Dec
No ratings yet
Flyseen LED Auto Light Dec
13 pages
KOBRA 400 HS-6 Auto-Oiler
No ratings yet
KOBRA 400 HS-6 Auto-Oiler
2 pages
0702LS Infineon PDF
No ratings yet
0702LS Infineon PDF
12 pages
Design and Simulation of Digital Down Converter Based On System Generator
No ratings yet
Design and Simulation of Digital Down Converter Based On System Generator
3 pages
Course Syllabus Gamification
No ratings yet
Course Syllabus Gamification
4 pages
Arjun Jaggi: Mapple July 2012 - Jan 2013
No ratings yet
Arjun Jaggi: Mapple July 2012 - Jan 2013
3 pages
Aaron Willette: Contact - (734) 680-4127 Github
No ratings yet
Aaron Willette: Contact - (734) 680-4127 Github
2 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)

Probability Bounds

Uploaded by

Probability Bounds

Uploaded by

Probability Bounds

Proof Given any x such that a ≤ x ≤ b, we can define λ ∈ [0, 1] as

(1 − p + pesb−sa )esa = (1 − p + pes(b−a) )e−ps(b−a) .

Defining u = s(b − a) and

φ(u) , −ps(b − a) + log(1 − p + pes(b−a) ) = −pu + log(1 − p + peu )

and using equation (2), we have that

Finally, letting t = εm gives our result.

Theorem 3 (McDiarmid’s Inequality). Let X = X1 , . . . , Xm be m independent random variables taking

for all i ∈ {1, . . . , m}. Then for any ε > 0, we have

and note that

Now we simply proceed through a series of inequalities.

This completes our proof.

As a quick note, it is worth mentioning

sup |En g − Eg| ≤ E sup |En g − Eg| + ε (4)

E sup |En g − Eg| (5)

[g(x(1) ) · · · g(x(n) )]⊤ ∈ [0 · · · 0]⊤ , [1 0 · · · 0]⊤ , . . . , [1 · · · 1]⊤ .

and dividing by n gives the final bound.

Proof Fix f ∈ F. Then we clearly have (by choosing f in the sup)

Ef (X) ≤ En f (X) + sup (Eg(X) − En g(X)) .

thus have with probability ≥ 1 − δ/2,

Theorem 7 (Ledoux-Talagrand contraction). Let f : R+ → R+ be convex and increasing. Let φi : R → R

By conditioning, we note that if we prove for T ⊆ R2

we are done. This follows because we will almost surely have

and we can iteratively apply this.

The above equation implies

we have g(y + d + x) − g(y + x) ≥ g(y + d) − g(y).

6 Growth Functions, VC Theory, and Rademacher Complexity

ΠH (S) , {hh(x1 ), . . . , h(xn )i : h ∈ H} .

With this, we make the following definition

ΠH (n) , max |ΠH (S)|

Thus ER̂n (H) = Rn (H) implies the theorem.

Definition 6.3. The Vapnik-Chervonenkis dimension, or VC dimension of a hypothesis class H on

Figure 1: Hypothesis tables for the proof of Sauer’s Lemma

You might also like