Lecture Notes 2 1 Probability Inequalities
Lecture Notes 2 1 Probability Inequalities
1 Probability Inequalities
Inequalities are useful for bounding quantities that might otherwise be hard to compute.
They will also be used in the theory of convergence.
2 2 large n 2 /2
P(|X n | > ) ≤ √ e−n /2 ≤ e−n .
n
2
Proof. The density of X is φ(x) = (2π)−1/2 e−x /2 . Hence,
Z ∞ Z ∞
1 ∞
Z
s
P(X > ) = φ(s)ds = φ(s)ds ≤ s φ(s)ds
s
2
1 ∞ 0 e− /2
Z
φ()
= − φ (s)ds = ≤ .
By symmetry,
2
2e− /2
P(|X| > ) ≤ .
d
Now let X1 , . . . , Xn ∼ N (0, 1). Then X n = n−1 ni=1 Xi ∼ N (0, 1/n). Thus, X n = n−1/2 Z
P
where Z ∼ N (0, 1) and
√ 2 2
P(|X n | > ) = P(n−1/2 |Z| > ) = P(|Z| > n ) ≤ √ e−n /2 .
n
1
Theorem 2 (Markov’s inequality) Let X be a non-negative random variable and
suppose that E(X) exists. For any t > 0,
E(X)
P(X > t) ≤ . (1)
t
2 Hoeffding’s Inequality
Hoeffding’s inequality is similar in spirit to Markov’s inequality but it is a sharper inequality.
We begin with the following important result.
where µ = E[X].
2
Before we start the proof, reecall that a function g is convex if for each x, y and each
α ∈ [0, 1],
g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y).
Proof. We will assume that µ = 0. Since a ≤ X ≤ b, we can write X as a convex
combination of a and b, namely, X = αb + (1 − α)a where α = (X − a)/(b − a). By the
convexity of the function y → ety we have
X − a tb b − X ta
etX ≤ αetb + (1 − α)eta = e + e .
b−a b−a
Take expectations of both sides and use the fact that E(X) = 0 to get
a tb b ta
EetX ≤ − e + e = eg(u) (3)
b−a b−a
where u = t(b − a), g(u) = −γu + log(1 − γ + γeu ) and γ = −a/(b − a). Note that
00
g(0) = g 0 (0) = 0. Also, g (u) ≤ 1/4 for all u > 0. By Taylor’s theorem, there is a ξ ∈ (0, u)
such that
0 u2 00 u2 00 u2 t2 (b − a)2
g(u) = g(0) + ug (0) + g (ξ) = g (ξ) ≤ = .
2 2 8 8
2 (b−a)2 /8
Hence, EetX ≤ eg(u) ≤ et .
3
Corollary 7 If X1 , X2 , . . . , Xn are independent with P(a ≤ Xi ≤ b) = 1 and common
mean µ, then, with probability at least 1 − δ,
s
(b − a)2 2
|X n − µ| ≤ log . (5)
2n δ
Next we use Chernoff’s method. For any t > 0, we have, from Markov’s inequality, that
n
!
X Pn
Yi n
P(Y n ≥ ) = P Yi ≥ n = P e i=1 ≥e
i=1
Pn
t n
P
= P e i=1 Yi ≥ etn ≤ e−tn E et i=1 Yi
Y
= e−tn E(etYi ) = e−tn (E(etYi ))n .
i
2 (b−a)2 /8
From Lemma 4, E(etYi ) ≤ et . So
2 n(b−a)2 /8
P(Y n ≥ ) ≤ e−tn et .
4
Theorem 9 (McDiarmid) Let X1 , . . . , Xn be independent random variables. Sup-
pose that
0
sup g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) ≤ ci (6)
x1 ,...,xn ,x0i
for i = 1, . . . , n. Then
!
22
P g(X1 , . . . , Xn ) − E(g(X1 , . . . , Xn )) ≥ ≤ exp − Pn 2
. (7)
i=1 ci
Proof.
Pn Let Vi = E(g|X1 , . . . , Xi )−E(g|X1 , . . . , Xi−1 ). Then g(X1 , . . . , Xn )−E(g(X1 , . . . , Xn )) =
i=1 Vi and E(Vi |X1 , . . . , Xi−1 ) = 0. Using a similar argument as in Hoeffding’s Lemma we
have,
2 2
E(etVi |X1 , . . . , Xi−1 ) ≤ et ci /8 . (8)
Now, for any t > 0,
n
!
X
P (g(X1 , . . . , Xn ) − E(g(X1 , . . . , Xn )) ≥ ) = P Vi ≥
i=1
Pn Pn
= P et i=1 Vi ≥ et ≤ e−t E et i=1 Vi
!!
Pn−1
= e−t E et i=1 Vi E etVn X1 , . . . , Xn−1
2 2
Pn−1
≤ e−t et cn /8 E et i=1 Vi
..
.
2
Pn 2
≤ e−t et i=1 ci .
Pn
Example 10 If we take g(x1 , . . . , xn ) = n−1 i=1 xi then we get back Hoeffding’s inequality.
5
The Cauchy-Schwarz inequality can be written as
Cov2 (X, Y ) ≤ σX
2 2
σY .
Recall that a function g is convex if for each x, y and each α ∈ [0, 1],
If g is twice differentiable and g 00 (x) ≥ 0 for all x, then g is convex. It can be shown that if
g is convex, then g lies above any line that touches g at some point, called a tangent line.
A function g is concave if −g is convex. Examples of convex functions are g(x) = x2 and
g(x) = ex . Examples of concave functions are g(x) = −x2 and g(x) = log x.
If g is concave, then
Eg(X) ≤ g(EX). (11)
Proof. Let L(x) = a + bx be a line, tangent to g(x) at the point E(X). Since g is convex,
it lies above the line L(x). So,
Suppose we have an exponential bound on P(Xn > ). In that case we can bound E(Xn ) as
follows.
6
Theorem 15 Suppose that Xn ≥ 0 and that for every > 0,
2
P(Xn > ) ≤ c1 e−c2 n (12)
7
Thus,
log n tσ 2
E max Xi ≤ + .
1≤i≤n t 2
√
The result follows by setting t = 2 log n/σ.
5 OP and oP
In statisics, probability and machine learning, we make use of oP and OP notation.
Recall first, that an = o(1) means that an → 0 as n → ∞. an = o(bn ) means that
an /bn = o(1).
an = O(1) means that an is eventually bounded, that is, for all large n, |an | ≤ C for some
C > 0. an = O(bn ) means that an /bn = O(1).
We write an ∼ bn if both an /bn and bn /an are eventually bounded. In computer sicence this
s written as an = Θ(bn ) but we prefer using an ∼ bn since, in statistics, Θ often denotes a
parameter space.
Now we move on to the probabilistic versions. Say that Yn = oP (1) if, for every > 0,
P(|Yn | > ) → 0.
Say that Yn = oP (an ) if, Yn /an = oP (1).
Say that Yn = OP (1) if, for every > 0, there is a C > 0 such that
P(|Yn | > C) ≤ .
Say that Yn = OP (an ) if Yn /an = OP (1).
√
Let’s use Hoeffding’s inequality to show that sample proportions are OP (1/ n) within the
the true mean. Let Y1 , . . . , Yn be coin flips i.e. Yi ∈ {0, 1}. Let p = P(Yi = 1). Let
n
1X
pbn = Yi .
n i=1
√
We will show that: pbn − p = oP (1) and pbn − p = OP (1/ n).
We have that
2
pn − p| > ) ≤ 2e−2n → 0
P(|b
and so pbn − p = oP (1). Also,
√
C
pn − p| > √
pn − p| > C) = P |b
P( n|b
n
2
≤ 2e−2C < δ
√
if we pick C large enough. Hence, pn − p) = OP (1) and so
n(b
1
pbn − p = OP √ .
n
8
Make sure you can prove the following: