0% found this document useful (0 votes)
47 views

Lecture Notes 2 1 Probability Inequalities

1. The document presents several probability inequalities including the Gaussian Tail Inequality, Markov's Inequality, Chebyshev's Inequality, and Hoeffding's Inequality. 2. Hoeffding's Inequality bounds the probability that the average of independent random variables deviates from their expected value, and is sharper than Markov's Inequality. 3. McDiarmid's Inequality, also known as the Bounded Difference Inequality, extends Hoeffding's Inequality to functions of multiple independent random variables rather than just their average.

Uploaded by

hadithya369
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Lecture Notes 2 1 Probability Inequalities

1. The document presents several probability inequalities including the Gaussian Tail Inequality, Markov's Inequality, Chebyshev's Inequality, and Hoeffding's Inequality. 2. Hoeffding's Inequality bounds the probability that the average of independent random variables deviates from their expected value, and is sharper than Markov's Inequality. 3. McDiarmid's Inequality, also known as the Bounded Difference Inequality, extends Hoeffding's Inequality to functions of multiple independent random variables rather than just their average.

Uploaded by

hadithya369
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture Notes 2

1 Probability Inequalities
Inequalities are useful for bounding quantities that might otherwise be hard to compute.
They will also be used in the theory of convergence.

Theorem 1 (The Gaussian Tail Inequality) Let X ∼ N (0, 1). Then


2 /2
2e−
P(|X| > ) ≤ .

If X1 , . . . , Xn ∼ N (0, 1) then

2 2 large n 2 /2
P(|X n | > ) ≤ √ e−n /2 ≤ e−n .
n
2
Proof. The density of X is φ(x) = (2π)−1/2 e−x /2 . Hence,
Z ∞ Z ∞
1 ∞
Z
s
P(X > ) = φ(s)ds = φ(s)ds ≤ s φ(s)ds
  s  
2
1 ∞ 0 e− /2
Z
φ()
= − φ (s)ds = ≤ .
   
By symmetry,
2
2e− /2
P(|X| > ) ≤ .

d
Now let X1 , . . . , Xn ∼ N (0, 1). Then X n = n−1 ni=1 Xi ∼ N (0, 1/n). Thus, X n = n−1/2 Z
P
where Z ∼ N (0, 1) and
√ 2 2
P(|X n | > ) = P(n−1/2 |Z| > ) = P(|Z| > n ) ≤ √ e−n /2 .
n


1
Theorem 2 (Markov’s inequality) Let X be a non-negative random variable and
suppose that E(X) exists. For any t > 0,

E(X)
P(X > t) ≤ . (1)
t

Proof. Since X > 0,


Z ∞ Z t Z ∞
E(X) = x p(x)dx = x p(x)dx + xp(x)dx
0 0 t
Z ∞ Z ∞
≥ x p(x)dx ≥ t p(x)dx = t P(X > t).
t t

Theorem 3 (Chebyshev’s inequality) Let µ = E(X) and σ 2 = Var(X). Then,


σ2 1
P(|X − µ| ≥ t) ≤ and P(|Z| ≥ k) ≤ (2)
t2 k2
where Z = (X − µ)/σ. In particular, P(|Z| > 2) ≤ 1/4 and P(|Z| > 3) ≤ 1/9.

Proof. We use Markov’s inequality to conclude that


E(X − µ)2 σ2
P(|X − µ| ≥ t) = P(|X − µ|2 ≥ t2 ) ≤ = .
t2 t2
The second part follows by setting t = kσ. 

If X1 , . . . , Xn ∼ Bernoulli(p) then and X n = n−1 ni=1 Xi Then, Var(X n ) = Var(X1 )/n =


P
p(1 − p)/n and
Var(X n ) p(1 − p) 1
P(|X n − p| > ) ≤ 2
= 2

 n 4n2
since p(1 − p) ≤ 14 for all p.

2 Hoeffding’s Inequality
Hoeffding’s inequality is similar in spirit to Markov’s inequality but it is a sharper inequality.
We begin with the following important result.

Lemma 4 Suppose that a ≤ X ≤ b. Then


t2 (b−a)2
E(etX ) ≤ etµ e 8

where µ = E[X].

2
Before we start the proof, reecall that a function g is convex if for each x, y and each
α ∈ [0, 1],
g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y).
Proof. We will assume that µ = 0. Since a ≤ X ≤ b, we can write X as a convex
combination of a and b, namely, X = αb + (1 − α)a where α = (X − a)/(b − a). By the
convexity of the function y → ety we have
X − a tb b − X ta
etX ≤ αetb + (1 − α)eta = e + e .
b−a b−a
Take expectations of both sides and use the fact that E(X) = 0 to get

a tb b ta
EetX ≤ − e + e = eg(u) (3)
b−a b−a
where u = t(b − a), g(u) = −γu + log(1 − γ + γeu ) and γ = −a/(b − a). Note that
00
g(0) = g 0 (0) = 0. Also, g (u) ≤ 1/4 for all u > 0. By Taylor’s theorem, there is a ξ ∈ (0, u)
such that
0 u2 00 u2 00 u2 t2 (b − a)2
g(u) = g(0) + ug (0) + g (ξ) = g (ξ) ≤ = .
2 2 8 8
2 (b−a)2 /8
Hence, EetX ≤ eg(u) ≤ et .

Next, we need to use Chernoff ’s method.

Lemma 5 Let X be a random variable. Then

P(X > ) ≤ inf e−t E(etX ).


t≥0

Proof. For any t > 0,

P(X > ) = P(eX > e ) = P(etX > et ) ≤ e−t E(etX ).

Since this is true for every t ≥ 0, the result follows. 

Theorem 6 (Hoeffding’s Inequality) Let Y1 , . . . , Yn be iid observations such that


E(Yi ) = µ and a ≤ Yi ≤ b. Then, for any  > 0,
2 2
P |Y n − µ| ≥  ≤ 2e−2n /(b−a) .

(4)

3
Corollary 7 If X1 , X2 , . . . , Xn are independent with P(a ≤ Xi ≤ b) = 1 and common
mean µ, then, with probability at least 1 − δ,
s  
(b − a)2 2
|X n − µ| ≤ log . (5)
2n δ

Proof. Without los of generality, we asume that µ = 0. First we have

P(|Y n | ≥ ) = P(Y n ≥ ) + P(Y n ≤ −)


= P(Y n ≥ ) + P(−Y n ≥ ).

Next we use Chernoff’s method. For any t > 0, we have, from Markov’s inequality, that
n
!
X  Pn 
Yi n
P(Y n ≥ ) = P Yi ≥ n = P e i=1 ≥e
i=1
   Pn 
t n
P
= P e i=1 Yi ≥ etn ≤ e−tn E et i=1 Yi
Y
= e−tn E(etYi ) = e−tn (E(etYi ))n .
i

2 (b−a)2 /8
From Lemma 4, E(etYi ) ≤ et . So
2 n(b−a)2 /8
P(Y n ≥ ) ≤ e−tn et .

This is minimized by setting t = 4/(b − a)2 giving


2 /(b−a)2
P(Y n ≥ ) ≤ e−2n .

Applying the same argument to P(−Y n ≥ ) yields the result. 

Example 8 Let X1 , . . . , Xn ∼ Bernoulli(p). From, Hoeffding’s inequality,


2
P(|X n − p| > ) ≤ 2e−2n .

3 The Bounded Difference Inequality


So far we have focused on sums of random variables. The following result extends Hoeffding’s
inequality to more general functions g(x1 , . . . , xn ). Here we consider McDiarmid’s inequality,
also known as the Bounded Difference inequality.

4
Theorem 9 (McDiarmid) Let X1 , . . . , Xn be independent random variables. Sup-
pose that


0
sup g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) ≤ ci (6)

x1 ,...,xn ,x0i

for i = 1, . . . , n. Then
!
22
 
P g(X1 , . . . , Xn ) − E(g(X1 , . . . , Xn )) ≥  ≤ exp − Pn 2
. (7)
i=1 ci

Proof.
Pn Let Vi = E(g|X1 , . . . , Xi )−E(g|X1 , . . . , Xi−1 ). Then g(X1 , . . . , Xn )−E(g(X1 , . . . , Xn )) =
i=1 Vi and E(Vi |X1 , . . . , Xi−1 ) = 0. Using a similar argument as in Hoeffding’s Lemma we
have,
2 2
E(etVi |X1 , . . . , Xi−1 ) ≤ et ci /8 . (8)
Now, for any t > 0,
n
!
X
P (g(X1 , . . . , Xn ) − E(g(X1 , . . . , Xn )) ≥ ) = P Vi ≥ 
i=1
 Pn   Pn 
= P et i=1 Vi ≥ et ≤ e−t E et i=1 Vi
!!
Pn−1
= e−t E et i=1 Vi E etVn X1 , . . . , Xn−1


2 2
 Pn−1 
≤ e−t et cn /8 E et i=1 Vi
..
.
2
Pn 2
≤ e−t et i=1 ci .

The result follows by taking t = 4/ ni=1 c2i . 


P

Pn
Example 10 If we take g(x1 , . . . , xn ) = n−1 i=1 xi then we get back Hoeffding’s inequality.

4 Bounds on Expected Values

Theorem 11 (Cauchy-Schwartz inequality) If X and Y have finite variances


then p
E |XY | ≤ E(X 2 )E(Y 2 ). (9)

5
The Cauchy-Schwarz inequality can be written as

Cov2 (X, Y ) ≤ σX
2 2
σY .

Recall that a function g is convex if for each x, y and each α ∈ [0, 1],

g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y).

If g is twice differentiable and g 00 (x) ≥ 0 for all x, then g is convex. It can be shown that if
g is convex, then g lies above any line that touches g at some point, called a tangent line.
A function g is concave if −g is convex. Examples of convex functions are g(x) = x2 and
g(x) = ex . Examples of concave functions are g(x) = −x2 and g(x) = log x.

Theorem 12 (Jensen’s inequality) If g is convex, then

Eg(X) ≥ g(EX). (10)

If g is concave, then
Eg(X) ≤ g(EX). (11)

Proof. Let L(x) = a + bx be a line, tangent to g(x) at the point E(X). Since g is convex,
it lies above the line L(x). So,

Eg(X) ≥ EL(X) = E(a + bX) = a + bE(X) = L(E(X)) = g(EX).

Example 13 From Jensen’s inequality we see that E(X 2 ) ≥ (EX)2 .

Example 14 (Kullback Leibler Distance) Define the Kullback-Leibler distance between


two densities p and q by Z  
p(x)
D(p, q) = p(x) log dx.
q(x)
Note that D(p, p) = 0. We will use Jensen to show that D(p, q) ≥ 0. Let X ∼ p. Then
    Z Z
q(X) q(X) q(x)
−D(p, q) = E log ≤ log E = log p(x) dx = log q(x)dx = log(1) = 0.
p(X) p(X) p(x)
So, −D(p, q) ≤ 0 and hence D(p, q) ≥ 0.

Suppose we have an exponential bound on P(Xn > ). In that case we can bound E(Xn ) as
follows.

6
Theorem 15 Suppose that Xn ≥ 0 and that for every  > 0,
2
P(Xn > ) ≤ c1 e−c2 n (12)

for some c2 > 0 and c1 > 1/e. Then,


r
C
E(Xn ) ≤ . (13)
n
where C = (1 + log(c1 ))/c2 .
R∞
Proof. Recall that for any nonnegative random variable Y , E(Y ) = 0 P(Y ≥ t)dt. Hence,
for any a > 0,
Z ∞ Z a Z ∞ Z ∞
2 2 2 2
E(Xn ) = P(Xn ≥ t)dt = P(Xn ≥ t)dt + P(Xn ≥ t)dt ≤ a + P(Xn2 ≥ t)dt.
0 0 a a

Equation (12) implies that P(Xn > t) ≤ c1 e−c2 nt . Hence,
Z ∞ Z ∞ √ Z ∞
c1 e−c2 na
2
E(Xn ) ≤ a + 2
P(Xn ≥ t)dt = a + P(Xn ≥ t)dt ≤ a + c1 e−c2 nt dt = a + .
a a a c2 n
Set a = log(c1 )/(nc2 ) and conclude that
log(c1 ) 1 1 + log(c1 )
E(Xn2 ) ≤ + = .
nc2 nc2 nc2
Finally, we have s
p 1 + log(c1 )
E(Xn ) ≤ E(Xn2 ) ≤ .
nc2

Now we consider bounding the maximum of a set of random variables.

Theorem 16 Let X1 , . . . , Xn be random variables. Suppose there exists σ > 0 such


2 2
that E(etXi ) ≤ et σ /2 for all t > 0. Then
 
p
E max Xi ≤ σ 2 log n. (14)
1≤i≤n

Proof. By Jensen’s inequality,


     
exp tE max Xi ≤ E exp t max Xi
1≤i≤n 1≤i≤n
  n
2 σ 2 /2
X
= E max exp {tXi } ≤ E (exp {tXi }) ≤ net .
1≤i≤n
i=1

7
Thus,
log n tσ 2
 
E max Xi ≤ + .
1≤i≤n t 2

The result follows by setting t = 2 log n/σ. 

5 OP and oP
In statisics, probability and machine learning, we make use of oP and OP notation.
Recall first, that an = o(1) means that an → 0 as n → ∞. an = o(bn ) means that
an /bn = o(1).
an = O(1) means that an is eventually bounded, that is, for all large n, |an | ≤ C for some
C > 0. an = O(bn ) means that an /bn = O(1).
We write an ∼ bn if both an /bn and bn /an are eventually bounded. In computer sicence this
s written as an = Θ(bn ) but we prefer using an ∼ bn since, in statistics, Θ often denotes a
parameter space.
Now we move on to the probabilistic versions. Say that Yn = oP (1) if, for every  > 0,
P(|Yn | > ) → 0.
Say that Yn = oP (an ) if, Yn /an = oP (1).
Say that Yn = OP (1) if, for every  > 0, there is a C > 0 such that
P(|Yn | > C) ≤ .
Say that Yn = OP (an ) if Yn /an = OP (1).

Let’s use Hoeffding’s inequality to show that sample proportions are OP (1/ n) within the
the true mean. Let Y1 , . . . , Yn be coin flips i.e. Yi ∈ {0, 1}. Let p = P(Yi = 1). Let
n
1X
pbn = Yi .
n i=1

We will show that: pbn − p = oP (1) and pbn − p = OP (1/ n).
We have that
2
pn − p| > ) ≤ 2e−2n → 0
P(|b
and so pbn − p = oP (1). Also,

 
C
pn − p| > √
pn − p| > C) = P |b
P( n|b
n
2
≤ 2e−2C < δ

if we pick C large enough. Hence, pn − p) = OP (1) and so
n(b
 
1
pbn − p = OP √ .
n

8
Make sure you can prove the following:

OP (1)oP (1) = oP (1)


OP (1)OP (1) = OP (1)
oP (1) + OP (1) = OP (1)
OP (an )oP (bn ) = oP (an bn )
OP (an )OP (bn ) = OP (an bn )

You might also like