Lec 4
Lec 4
Lecture 4: September 4
Lecturer: Siva Balakrishnan
for all X1 , . . . , Xn , Y1 , . . . , Yn ∈ R.
For such functions we have that if X1 , . . . , Xn ∼ N (0, 1) then,
t2
P(|f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )]| ≥ t) ≤ 2 exp − 2 .
2L
4-1
4-2 Lecture 4: September 4
In particular, if we consider a case when each event Ai is a failure of some type, then the
above inequality says that the probability that even a single failure occurs is at most the
sum of the probabilities of each failure.
Example: The Johnson-Lindenstrauss Lemma. One very nice application of χ2 tail
bounds is in the analysis of what are known as “random projections”. Suppose we have a
data set X1 , . . . , Xn ∈ Rd where d is quite large. Storing such a dataset might be expensive
and as a result we often resort to “sketching” or “random projection” where the goal is
to create a map F : Rd 7→ Rm , with m d. We then instead store the mapped dataset
{F (X1 ), . . . , F (Xn )}. The challenge is to design this map F in a way that preserves essential
features of the original dataset. In particular, we would like that for every pair (Xi , Xj ) we
have that,
(1 − )kXi − Xj k22 ≤ kF (Xi ) − F (Xj )k22 ≤ (1 + )kXi − Xj k22 ,
i.e. the map preserves all the pair-wise distances up to a (1 ± ) factor. Of course, if m is
large we might expect this is not too difficult.
The Johnson-Lindenstrauss lemma is quite stunning: it says that a simple randomized con-
struction will produce such a map with probability at least 1 − δ provided that,
16 log(n/δ)
m≥ .
2
Notice that this is completely independent of the original dimension d and depends on
logarithmically on the number of points n. This map can result in huge savings in storage
cost while still essentially preserving all the pairwise distances.
The map itself is quite simple: we construct a matrix Z ∈ Rm×d , where each entry of Z is
i.i.d N (0, 1). We then define the map as:
ZXi
F (Xi ) = √ .
m
Now let us fix a pair (Xj , Xk ) and consider,
2
kF (Xj ) − F (Xk )k22 Z(Xj − Xk )
2
= √
kXj − Xk k2 mkXj − Xk k2 2
m
1 X Xj − Xk 2
= hZi , i .
m i=1 kXj − Xk k2
| {z }
Ti
Pd
Now, for some fixed numbers aj the distribution of j=1 aj Zij is Gaussian with mean 0 and
variance dj=1 a2j . So each term Ti is an independent χ2 random variable. Now applying the
P
Thus for the fixed pair (Xi , Xj ) the probability that our map fails to preserve the distance
2
is exponentially small, i.e. is at most
2 exp(−m /8). Now, to find the probability that our
n
map fails to preserve any of our 2 pairwise distances we simply apply the union bound to
conclude that, the probability of any failure is at most:
n
P(failure) ≤ 2 exp(−m2 /8).
2
16 log(n/δ)
m≥ ,
2
then this probability is at most δ as desired. An important point to note is that the expo-
nential concentration is what leads to such a small value for m (i.e. it only needs to grow
logarithmically with the sample size).
In the rest of this lecture we discuss the convergence of random variables. At a high-level,
our first few lectures focused on non-asymptotic properties of averages i.e. the tail bounds we
derived applied for any fixed sample size n. For the next few lectures we focus on asymptotic
properties, i.e. we ask the question: what happens to the average of n i.i.d. random variables
as n → ∞.
Roughly, from a theoretical perspective the idea is that many expressions will considerably
simplify in the asymptotic regime. Rather than have many different tail bounds, we will
derive simple “universal results” that hold under extremely weak conditions.
From a slightly more practical perspective, asymptotic theory is often useful to obtain ap-
proximate confidence intervals (and p-values and other useful things) that although approx-
imate are typically more useful. We will follow quite closely Section 5.5 of Casella and
Berger.
Throughout, we will focus on the setting where we have a sequence of random variables
X1 , . . . , Xn and another random variable X, and would like to define what is means for the
sequence to converge to X. In each case, to simplify things you should also think about the
case when X is deterministic, i.e. when X = c with probability 1 (for some constant c).
Importantly, we will not assume that the RVs X1 , . . . , Xn are independent.
We will not use almost sure convergence in this course so you should feel free to ignore this
section. A natural analogue of the usual convergence would be to hope that,
lim Xn = X.
n→∞
These are both however random variables so one has to at least specify on what event we
are hoping for this statement to be true.
The correct analogue turns out to be to require:
P lim Xn = X = 1.
n→∞
There are measure theoretic subtleties to be aware of here. In particular, the sample space
inside the probability statement here grows with n and it requires some machinery to be
precise here.
There are other equivalent (this is somewhat difficult to see) ways to define almost sure
convergence. Equivalently, we say that Xn converges almost surely to X if we let Ω be a set
of probability mass 1, i.e. P(Ω) = 1, and for every ω ⊆ Ω, and for every > 0, we have that
there is some n ≥ N (ω, ) such that:
Roughly, the way to think about this type of convergence is to imagine that there is some
set of exceptional events on which the random variables can disagree, but these exceptional
events have probability 0 as n → ∞. Barring, these exceptional events the sequence con-
verges just like sequences of real numbers do. The exceptional events is where the “almost”
in almost sure arises.
Lecture 4: September 4 4-5
lim P(|Xn − X| ≥ ) = 0.
n→∞
To build intuition it is perhaps useful to consider the case when X is deterministic, i.e.
X = c with probability 1. Then convergence in probability is saying that as n gets large the
distribution of Xn gets more peaked around the value c.
Again somewhat roughly, convergence in probability can be viewed as a statement about the
convergence of probabilities, while almost sure convergence is a convergence of the values of
a sequence of random variables.
We will not prove this statement but convergence in probability is implied by almost sure
convergence. The notes contain a counterexample to the reverse implication but we most
likely will not cover this in lecture.
Example: Weak Law of Large Numbers Suppose that Y1 , . . . , Yn are i.i.d. with E[Yi ] =
µ and Var(Yi ) = σ 2 < ∞. Define, for i ∈ {1, . . . , n},
i
1X
Xi = Yj .
i j=1
as desired.
Notes:
1. Strictly speaking the WLLN is true even without the assumption of finite variance, as
long as the first absolute moment is finite. This proof is a bit more difficult.
2. There is a statement that says that under similar assumptions the average converges
almost surely to the expectation. This is known as the strong law of large numbers.
This is actually quite a bit more difficult to prove.
4-6 Lecture 4: September 4
n
1 X
Sbn = bn )2 ,
(Xi − µ
n − 1 i=1
Var(Sbn )
P(|Sbn − σ 2 | ≥ ) ≤ ,
2
Now one can check that this sequence converges in probability but not almost surely.
Roughly, the “1 + s” spike becomes less frequent down the sequence (allowing convergence
in probability) but the limit is not well defined. For any s, Xn (s) alternates between 1 and
1 + s.
An often useful way to show convergence in probability is to show something stronger known
as convergence in quadratic mean. We say that a sequence converges to X in quadratic mean
if:
E(Xn − X)2 → 0,
for all points t where the CDF FX is continuous. We will see why the exception matters in
a little while but for now it is worth noting that convergence in distribution is the weakest
form of convergence.
For instance, a sequence of i.i.d. N (0, 1) RVs converge in distribution to an independent
N (0, 1) RV, even though the values of the random variables are not close in any meaningful
sense (their distributions are however, identical). A famous example that we will spend a
chunk of the next lecture on is the central limit theorem. The central limit theorem says that
an average of i.i.d. random variables (appropriately normalized) converges in distribution to
a N (0, 1) random variable.
The picture to keep in mind to understand the relationships is the following one:
We will re-visit this in the next lecture and perhaps try to prove some of the implications
(or disprove some of the non-implications).
P(|X(n) − 1| ≥ ) = P(X(n) ≤ 1 − )
Yn
= P(Xi ≤ 1 − ) = (1 − )n
i=1
→ 0.
2. The random variable n(1 − X(n) ) converges in distribution to an Exp(1) RV. To see
this we compute: