0% found this document useful (0 votes)
18 views8 pages

Lec 4

Concentration inequalities

Uploaded by

nnguyen22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views8 pages

Lec 4

Concentration inequalities

Uploaded by

nnguyen22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

36-705: Intermediate Statistics Fall 2018

Lecture 4: September 4
Lecturer: Siva Balakrishnan

We will first continue our discussion of exponential concentration inequalities.

4.1 Levy’s inequality


There is a similar concentration inequality that applies to functions of Gaussian random
variables that are sufficiently smooth. In this case, the assumption is quite different. We
assume that:
v
u n
uX
|f (X1 , . . . , Xn ) − f (Y1 , . . . , Yn )| ≤ Lt (Xi − Yi )2 ,
i=1

for all X1 , . . . , Xn , Y1 , . . . , Yn ∈ R.
For such functions we have that if X1 , . . . , Xn ∼ N (0, 1) then,
t2
 
P(|f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )]| ≥ t) ≤ 2 exp − 2 .
2L

4.2 χ2 tail bounds


A χ2 random variable with n degrees of freedom, Pn denoted by Y ∼ χ2n , is a RV that is a sum
of n i.i.d. standard Gaussian RVs, i.e. Y = i=1 Xi2 where each Xi ∼ N (0, 1). Suppose that
Z1 , . . . , Zn ∼ N (0, 1), then the expected value E[Zi2 ] = 1, and we have the χ2 tail bound:
n
!
1X 2
P Z − 1 ≥ t ≤ 2 exp(−nt2 /8) for all t ∈ (0, 1).
n k=1 k
You will derive this in your HW using the Chernoff method. Analogous to the class of sub-
Gaussian RVs, χ2 random variables belong to a class of what are known as sub-exponential
random variables. The main note-worthy difference is that the Gaussian-type behaviour of
the tail only holds for small values of the deviation t.
Detour: The union bound. This is also known as Boole’s inequality. It says that if we
have events A1 , . . . , An then
n
! n
[ X
P Ai ≤ P(Ai ).
i=1 i=1

4-1
4-2 Lecture 4: September 4

In particular, if we consider a case when each event Ai is a failure of some type, then the
above inequality says that the probability that even a single failure occurs is at most the
sum of the probabilities of each failure.
Example: The Johnson-Lindenstrauss Lemma. One very nice application of χ2 tail
bounds is in the analysis of what are known as “random projections”. Suppose we have a
data set X1 , . . . , Xn ∈ Rd where d is quite large. Storing such a dataset might be expensive
and as a result we often resort to “sketching” or “random projection” where the goal is
to create a map F : Rd 7→ Rm , with m  d. We then instead store the mapped dataset
{F (X1 ), . . . , F (Xn )}. The challenge is to design this map F in a way that preserves essential
features of the original dataset. In particular, we would like that for every pair (Xi , Xj ) we
have that,
(1 − )kXi − Xj k22 ≤ kF (Xi ) − F (Xj )k22 ≤ (1 + )kXi − Xj k22 ,
i.e. the map preserves all the pair-wise distances up to a (1 ± ) factor. Of course, if m is
large we might expect this is not too difficult.
The Johnson-Lindenstrauss lemma is quite stunning: it says that a simple randomized con-
struction will produce such a map with probability at least 1 − δ provided that,
16 log(n/δ)
m≥ .
2
Notice that this is completely independent of the original dimension d and depends on
logarithmically on the number of points n. This map can result in huge savings in storage
cost while still essentially preserving all the pairwise distances.
The map itself is quite simple: we construct a matrix Z ∈ Rm×d , where each entry of Z is
i.i.d N (0, 1). We then define the map as:
ZXi
F (Xi ) = √ .
m
Now let us fix a pair (Xj , Xk ) and consider,
2
kF (Xj ) − F (Xk )k22 Z(Xj − Xk )
2
= √
kXj − Xk k2 mkXj − Xk k2 2
m
1 X Xj − Xk 2
= hZi , i .
m i=1 kXj − Xk k2
| {z }
Ti
Pd
Now, for some fixed numbers aj the distribution of j=1 aj Zij is Gaussian with mean 0 and
variance dj=1 a2j . So each term Ti is an independent χ2 random variable. Now applying the
P

χ2 tail bound, we obtain that,


kF (Xj ) − F (Xk )k22
 
P − 1 ≥  ≤ 2 exp(−m2 /8).
kXj − Xk k22
Lecture 4: September 4 4-3

Thus for the fixed pair (Xi , Xj ) the probability that our map fails to preserve the distance
2
is exponentially small, i.e. is at most
 2 exp(−m /8). Now, to find the probability that our
n
map fails to preserve any of our 2 pairwise distances we simply apply the union bound to
conclude that, the probability of any failure is at most:
 
n
P(failure) ≤ 2 exp(−m2 /8).
2

Now, it is straightforward to verify that if

16 log(n/δ)
m≥ ,
2
then this probability is at most δ as desired. An important point to note is that the expo-
nential concentration is what leads to such a small value for m (i.e. it only needs to grow
logarithmically with the sample size).
In the rest of this lecture we discuss the convergence of random variables. At a high-level,
our first few lectures focused on non-asymptotic properties of averages i.e. the tail bounds we
derived applied for any fixed sample size n. For the next few lectures we focus on asymptotic
properties, i.e. we ask the question: what happens to the average of n i.i.d. random variables
as n → ∞.
Roughly, from a theoretical perspective the idea is that many expressions will considerably
simplify in the asymptotic regime. Rather than have many different tail bounds, we will
derive simple “universal results” that hold under extremely weak conditions.
From a slightly more practical perspective, asymptotic theory is often useful to obtain ap-
proximate confidence intervals (and p-values and other useful things) that although approx-
imate are typically more useful. We will follow quite closely Section 5.5 of Casella and
Berger.

4.3 Reminder: convergence of sequences


When we think of convergence of deterministic real numbers the corresponding notions are
classical.
Formally, we say that a sequence of real numbers a1 , a2 , . . . converges to a fixed real number a
if, for every positive number , there exists a natural number N () such that for all n ≥ N (),
|an − a| < . We call a the limit of the sequence and write limn→∞ an = a.
Our focus today will in trying to develop analogues of this notion that apply to sequences
of random variables. We will first give some definitions and then try to circle back to relate
the definitions and discuss some examples.
4-4 Lecture 4: September 4

Throughout, we will focus on the setting where we have a sequence of random variables
X1 , . . . , Xn and another random variable X, and would like to define what is means for the
sequence to converge to X. In each case, to simplify things you should also think about the
case when X is deterministic, i.e. when X = c with probability 1 (for some constant c).
Importantly, we will not assume that the RVs X1 , . . . , Xn are independent.

4.4 Almost sure convergence

We will not use almost sure convergence in this course so you should feel free to ignore this
section. A natural analogue of the usual convergence would be to hope that,

lim Xn = X.
n→∞

These are both however random variables so one has to at least specify on what event we
are hoping for this statement to be true.
The correct analogue turns out to be to require:

 
P lim Xn = X = 1.
n→∞

There are measure theoretic subtleties to be aware of here. In particular, the sample space
inside the probability statement here grows with n and it requires some machinery to be
precise here.
There are other equivalent (this is somewhat difficult to see) ways to define almost sure
convergence. Equivalently, we say that Xn converges almost surely to X if we let Ω be a set
of probability mass 1, i.e. P(Ω) = 1, and for every ω ⊆ Ω, and for every  > 0, we have that
there is some n ≥ N (ω, ) such that:

|Xn (ω) − X(ω)| ≤ .

Roughly, the way to think about this type of convergence is to imagine that there is some
set of exceptional events on which the random variables can disagree, but these exceptional
events have probability 0 as n → ∞. Barring, these exceptional events the sequence con-
verges just like sequences of real numbers do. The exceptional events is where the “almost”
in almost sure arises.
Lecture 4: September 4 4-5

4.5 Convergence in probability


A sequence of random variables X1 , . . . , Xn converges in probability to a random variable X
if for every  > 0 we have that,

lim P(|Xn − X| ≥ ) = 0.
n→∞

To build intuition it is perhaps useful to consider the case when X is deterministic, i.e.
X = c with probability 1. Then convergence in probability is saying that as n gets large the
distribution of Xn gets more peaked around the value c.
Again somewhat roughly, convergence in probability can be viewed as a statement about the
convergence of probabilities, while almost sure convergence is a convergence of the values of
a sequence of random variables.
We will not prove this statement but convergence in probability is implied by almost sure
convergence. The notes contain a counterexample to the reverse implication but we most
likely will not cover this in lecture.
Example: Weak Law of Large Numbers Suppose that Y1 , . . . , Yn are i.i.d. with E[Yi ] =
µ and Var(Yi ) = σ 2 < ∞. Define, for i ∈ {1, . . . , n},
i
1X
Xi = Yj .
i j=1

The WLLN says that the sequence X1 , X2 , . . . converges in probability to µ.


Proof: The proof is simply an application of Chebyshev’s inequality. We note that by
Chebyshev’s inequality:
σ2
P(|Xn − E[X]| ≥ ) ≤ .
n2
This in turn implies that,

lim P(|Xn − E[X]| ≥ ) = 0,


n→∞

as desired.
Notes:

1. Strictly speaking the WLLN is true even without the assumption of finite variance, as
long as the first absolute moment is finite. This proof is a bit more difficult.
2. There is a statement that says that under similar assumptions the average converges
almost surely to the expectation. This is known as the strong law of large numbers.
This is actually quite a bit more difficult to prove.
4-6 Lecture 4: September 4

Consistency: Convergence in probability will frequently recur in this course. Usually we


will construct an estimator θbn for some quantity θ∗ . We will then say that the estimator is
consistent if the sequence of RVs θbn converges in probability to θ∗ .
The WLLN/Chebyshev can already be used to prove some rudimentary consistency guaran-
tees. For instance, if we consider the sample variance:

n
1 X
Sbn = bn )2 ,
(Xi − µ
n − 1 i=1

then by Chebyshev’s inequality we obtain,

Var(Sbn )
P(|Sbn − σ 2 | ≥ ) ≤ ,
2

so a sufficient condition for consistency is that Var(Sbn ) → 0 as n → ∞.


Convergence in probability does not imply almost sure convergence: This example
is from Casella and Berger. Suppose we have a sample space S = [0, 1], with the uniform
distribution, we draw s ∼ U [0, 1] and define X(s) = s.
We define the sequence as:

X1 (s) = s + I[0,1] (s), X2 (s) = s + I[0,1/2] (s), X3 (s) = s + I[1/2,1] (s)


X4 (s) = s + I[0,1/3] (s), X5 (s) = s + I[1/3,2/3] (s), X6 (s) = s + I[2/3,1] (s).

Now one can check that this sequence converges in probability but not almost surely.
Roughly, the “1 + s” spike becomes less frequent down the sequence (allowing convergence
in probability) but the limit is not well defined. For any s, Xn (s) alternates between 1 and
1 + s.

4.6 Convergence in quadratic mean

An often useful way to show convergence in probability is to show something stronger known
as convergence in quadratic mean. We say that a sequence converges to X in quadratic mean
if:

E(Xn − X)2 → 0,

as n → ∞. We will return to this one when we discuss some examples.


Lecture 4: September 4 4-7

4.7 Convergence in distribution


The other commonly encountered mode of convergence is convergence in distribution. We
say that a sequence converges to X in distribution if:
lim FXn (t) = FX (t),
n→∞

for all points t where the CDF FX is continuous. We will see why the exception matters in
a little while but for now it is worth noting that convergence in distribution is the weakest
form of convergence.
For instance, a sequence of i.i.d. N (0, 1) RVs converge in distribution to an independent
N (0, 1) RV, even though the values of the random variables are not close in any meaningful
sense (their distributions are however, identical). A famous example that we will spend a
chunk of the next lecture on is the central limit theorem. The central limit theorem says that
an average of i.i.d. random variables (appropriately normalized) converges in distribution to
a N (0, 1) random variable.
The picture to keep in mind to understand the relationships is the following one:

We will re-visit this in the next lecture and perhaps try to prove some of the implications
(or disprove some of the non-implications).

4.8 More Examples


Example 1: Suppose we consider a sequence Xn = N (0, 1/n). Intuitively, it seems like
this sequence converges to 0. Let us first consider what happens in distribution.
The CDF of the RV that is deterministically 0 is simply FX (x) = 0, for x < 0 and FX (x) = 1
for x ≥ 0. Now, let us consider,

FXn (x) = P(Z ≤ nx),
where Z ∼ N (0, 1). If x > 0 this tends to 1, and if x < 0 this tends to 0. Interestingly, at
x = 0, FXn (x) = 1/2, and does not converge to FX (0) = 1. Remember, however, that we
had an exception at points of discontinuity.
Example 2: Let us consider the same example and consider convergence in probability.
E[Xn2 ] 1
P(|Xn − X| ≥ ) = 2
= 2 → 0,
 n
4-8 Lecture 4: September 4

so the sequence converges to 0 in probability.


Example 3: Let us consider another example from the Casella and Berger book. Suppose
X1 , . . . ∼ U [0, 1]. Let us define X(n) = max1≤i≤n Xi . Now, we verify two things:

1. X(n) converges in probability to 1. To see this observe that,

P(|X(n) − 1| ≥ ) = P(X(n) ≤ 1 − )
Yn
= P(Xi ≤ 1 − ) = (1 − )n
i=1
→ 0.

2. The random variable n(1 − X(n) ) converges in distribution to an Exp(1) RV. To see
this we compute:

FX(n) (t) = P(n(1 − X(n) ) ≤ t) = 1 − P(X(n) ≤ 1 − t/n)


= 1 − (1 − t/n)n → 1 − exp(−t) = FX (t).

You might also like