0% found this document useful (0 votes)
484 views

Probability Theory

The document contains lecture notes for a probability theory course. It introduces concepts like the law of large numbers using a biased coin example. The law of large numbers states that as the number of coin flips increases, the fraction of heads will converge to the true probability of heads. It proves the law using Stirling's formula, which approximates the factorial function for large values. The notes provide context and examples to explain key probability concepts.

Uploaded by

Rohit Arora
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
484 views

Probability Theory

The document contains lecture notes for a probability theory course. It introduces concepts like the law of large numbers using a biased coin example. The law of large numbers states that as the number of coin flips increases, the fraction of heads will converge to the true probability of heads. It proves the law using Stirling's formula, which approximates the factorial function for large values. The notes provide context and examples to explain key probability concepts.

Uploaded by

Rohit Arora
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

18.

175: Probability Theory

Brice Huang

Spring 2018

These are my lecture notes for the Spring 2018 iteration of 18.175, Probability
Theory, taught by Prof. Vadim Gorin.
These notes are written in LATEX during lectures in real time, and may contain
errors. If you find an error, or would otherwise like to suggest improvements,
please contact me at [email protected].
Special thanks to Evan Chen and Tony Zhang for the help with formatting,
without which this project would not be possible.
Special thanks to Ryan Alweiss for proofreading these notes and catching
my errors.
These notes were last updated 2018-03-23. The permalink to these notes is
http://web.mit.edu/bmhuang/www/notes/18175-notes.pdf.

1
Brice Huang Contents

Contents

1 February 6, 2018: Probability Theory, and Why We Care 5


1.1 Administrivia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Biased Coin, Law of Large Numbers . . . . . . . . . . . . . . . . 5
1.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Poisson Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Continuous Probability Distributions . . . . . . . . . . . . . . . . 8

2 February 8, 2018: Measure Theory 10


2.1 σ-Algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Constructing a Probability Space . . . . . . . . . . . . . . . . . . 11
2.4 Examples of Probability Spaces . . . . . . . . . . . . . . . . . . . 11
2.4.1 N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Uniform Lebesgue measure on the unit square . . . . . . . 12

3 February 13, 2018: CDFs and measurable functions 14


3.1 Continuous Distribution Functions . . . . . . . . . . . . . . . . . 14
3.1.1 Example: Discrete/Atomic Measure . . . . . . . . . . . . 15
3.1.2 Example: Measure with Density . . . . . . . . . . . . . . 15
3.1.3 Example: Cantor Set . . . . . . . . . . . . . . . . . . . . . 15
3.2 CDFs on Rk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Example: Discrete/Atomic Measure . . . . . . . . . . . . 16
3.2.2 Example: Measure with Density . . . . . . . . . . . . . . 16
3.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 February 15, 2018: Lebesgue Integration 19


4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 The Lebesgue Integral . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Indicator Functions . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Elementary Functions . . . . . . . . . . . . . . . . . . . . 19
4.2.3 Measurable Functions . . . . . . . . . . . . . . . . . . . . 19
4.3 Integrating Measurable Functions . . . . . . . . . . . . . . . . . . 20
4.4 Riemann and Lebesgue Integrals . . . . . . . . . . . . . . . . . . 21
4.5 Properties of the Lebesgue Integral . . . . . . . . . . . . . . . . . 22

2
Brice Huang Contents

5 February 22, 2018: Lebesgue Integral Computations 24


5.1 Indicator Functions, Then Simple Functions, Then Everything . 24
5.2 Computations with the Lebesgue integral . . . . . . . . . . . . . 26
5.2.1 Example: Gaussian Random Variable . . . . . . . . . . . 27
5.2.2 Example: Gaussian Random Variable Squared . . . . . . 28
5.3 Convergence of Random Variables . . . . . . . . . . . . . . . . . 28

6 February 27, 2018: Convergence of Random Variables 30


6.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . 30
6.2 L1 convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3 Expectation Convergence Theorems . . . . . . . . . . . . . . . . 32

7 March 1, 2018: Product Measures 35


7.1 Product Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Independence of Random Variables . . . . . . . . . . . . . . . . . 36
7.2.1 Properties of Independent Random Variables . . . . . . . 36
7.3 Computations on Independent Random Variables . . . . . . . . . 38

8 March 6, 2018: Sequences of Random Variables 40


8.1 Tikhonov Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.2 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . 42
8.3 Weierstrass Approximation . . . . . . . . . . . . . . . . . . . . . 43

9 March 8, 2018: Strong Law of Large Numbers 46


9.1 Borel-Cantelli and Kolmogorov . . . . . . . . . . . . . . . . . . . 46
9.2 Toeplitz and Kronecker . . . . . . . . . . . . . . . . . . . . . . . 48
9.3 Proof of Strong LLN . . . . . . . . . . . . . . . . . . . . . . . . . 49

10 March 13, 2017: Snow Day 52

11 March 15, 2017: Characteristic Functions 53


11.1 Kolmogorov 0-1 Law . . . . . . . . . . . . . . . . . . . . . . . . . 53
11.2 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . 54
11.2.1 Computation With Characteristic Functions . . . . . . . . 55
11.3 Levy Inversion Formula . . . . . . . . . . . . . . . . . . . . . . . 56

12 March 20, 2018: Limits of Characteristic Functions 58


12.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . 58
12.1.1 Higher-Dimensional Characteristic Functions . . . . . . . 59
12.2 Gaussian vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
12.3 Characteristic Functions and Limits . . . . . . . . . . . . . . . . 61

3
Brice Huang Contents

13 March 22, 2018: Central Limit Theorem and Variations 64


13.1 Characteristic Functions and Limits . . . . . . . . . . . . . . . . 64
13.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 65
13.3 Multidimensional CLT . . . . . . . . . . . . . . . . . . . . . . . . 66
13.4 Lyapanov CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4
Brice Huang 1 February 6, 2018: Probability Theory, and Why We Care

1 February 6, 2018: Probability Theory, and Why


We Care

1.1 Administrivia
Lectures are 9:30-11am Tuesdays and Thursdays.
Office hours are immediately after lecture, or by appointment.
There are 4 psets (40%) and 2 exams (60%). Collaboration on homework is
OK, but acknowledge sources. Exams are closed book.

1.2 Biased Coin, Law of Large Numbers


Consider a coin that outputs heads (1) with probability p, and tails (0) with
probability 1 − p.
If we flip the coin N times, there are 2N possible sequences of results, which
we can notate e.g. 10100. We assign a number to each such sequence w, equal
to
P (w) = p# 1s (1 − p)# 0s . (1.1)
This will be our definition of probability. For now, this is some abstract numerical
notion.

Lemma 1.1
The following identity holds.
X
(w) = 1.
w∈{0,1}n

Proof. Binomial Theorem.

We can relate this to experiments by the following theorem. Let Sn (w) =


# 1s in w.

Theorem 1.2 (Law of Large Numbers, Bernoulli 1713)


Fix any  > 0. Then,
 
1
P p −  < SN (w) < p +  → 1.
N

Here, the probability of an event e(w) is defined by


X
P (e(w)) = P (w).
w∈{0,1}n
e(w) holds

We will prove this law with the celebrated Stirling’s Formula.

5
Brice Huang 1 February 6, 2018: Probability Theory, and Why We Care

Lemma 1.3 (Stirling’s Formula)


As n → ∞,
√  n n
n! ∼ 2πn . (1.2)
e

Proof. “Who has seen this before?” [Everyone raises hand in unison.]

Proof of LLN. Observe that


 
N M
P [SN (w) = M ] = p (1 − p)1−M .
M

By Stirling’s Formula, for large N, M :


 
N M
p (1 − p)1−M
M
N!
= pM (1 − p)1−M
M !(N − M )!
√  N    N −M
2πN N e M e
= √ p pM (1 − p)1−M
2πM 2π(N − M ) e M N −M
s
N
= exp[N log N − M log M − (N − M ) log(N − M )
2πM (N − M )
+ M log p + (N − M ) log(1 − p)]
1 1
=√ p exp[N (−x log x − (1 − x) log(1 − x) + x log p
2πN x(1 − x)
+ (1 − x) log(1 − p))],
M
where x = N. Define

f (x) = −x log x − (1 − x) log(1 − x) + x log p + (1 − x) log(1 − p).

Observe that the probability of interest has a sharp peak when f (x) is maximized.
By omitted computation:
 
1−x p
f 0 (x) = log · .
x 1−p

So, f (x) has a unique max, of value 0, at x = p.


This lets us bound the tail probabilities of SNN(w) :
 
SN (w) X 1 1
P <p− ≈ √ p exp (N f (x)) .
N 2πN x(1 − x)
x=M/N ≤p−

Note that
1 1 2
√ p ≤√
2πN x(1 − x) 2π
when each of x, 1 − x ≥ N1 . Because the sum has at most N terms,
 
SN (w) 2N
P < p −  ≤ √ exp(N f (p − )) → 0
N 2π
6
Brice Huang 1 February 6, 2018: Probability Theory, and Why We Care

By a similar argument,
 
SN (w)
P >p+ → 0.
N

We used the large deviations principle here – we considered the proba-


bility that a variable deviated significantly from its expectation, and showed
that it is small. We will see this technique again and again.

1.3 Central Limit Theorem


So now we know SN (w) ≈ N p. The next natural question is: what does the
distribution of SN (w) − N p look like?
The Central Limit Theorem answers this question.

Theorem 1.4 (Central Limit Theorem)


[De Moivre 1738, Laplace 1812] Let 0 < p < 1. Then
" # Z B
Sn − N p 1 2
lim P A < p √ <B = √ e−x /2 dx.
N →∞ p(1 − p) N 2π A


In other words, we expect |Sn − N p| to be on the order of N.

Proof. We analyze f (x) on a neighborhood of p by Taylor expansion:


1
f (x) ≈ f (p) + f 0 (p)(x − p) + f 00 (p)(x − p)2 + O (x − p)3

2
Note that f (p) = f 0 (p) = 0. By omitted computation, f 00 (x) = − x(1−x)
1
.

u p(1−p)
Let x = M N =p+

N
. Then,

P [SN (w) = M ]
 3 !!
1 1 N p(1 − p) 2 u
=√ p exp − · · u + NO √
2πN p(1 − p) p(1 − p) 2N N
1 1
exp −u2 /2 .

=√ p
2πN p(1 − p)

Summing this from A to B gives us a Riemann sum that converges to the desired
integral.

While CLT holds much more generally, this proof only works for coin flips.
Replace the coin with a die, and this method doesn’t work. One of the goals of
this class is to develop a framework that will give us results like CLT for free.

7
Brice Huang 1 February 6, 2018: Probability Theory, and Why We Care

1.4 Poisson Limit Theorem


Now, we let the probability p also vary with N . In particular, let
λ
p = p(N ) = .
N

Theorem 1.5 (Poisson Limit Theorem)


The following limit holds:

λk
P [SN (w) = K] → e−λ .
k!

P∞ k
Remark 1.6. This result implies the identity k=0 e−λ λk! = 1.

Proof of PLT. We compute:


 K  N −K
N! λ λ
P [SN (w) = K] = 1−
K!(N − K)! N N
N  −K
λK N (N − 1) · · · (N − K + 1)

λ λ
= 1− 1−
K! NK N N
λK −λ
→ e ,
K!
λ N
→ e−λ by L’Hopital.

where we use 1 − N

As with CLT, there are forms of this result that hold much more generally,
which we’ll need to develop machinery to talk about.

1.5 Continuous Probability Distributions


So far we’ve only seen discrete probability distributions.
In a continuous distribution the probability of each point is zero, so it’s not
even clear what a continuous probability distribution means.
Suppose we have a uniform random variable on [0, 1). We’d like to assign
a probability for each set A ⊂ [0, 1). We expect this probability to have some
natural properties:

• For sets A1 , A2 , . . . disjoint, P (x ∈


S P
i Ai ) = i P (x ∈ Ai );
• Cyclic symmetry, i.e. P (Ar ) = P (A) when Ar = (A + r) (mod 1).

Turns out, this isn’t possible.

Proposition 1.7
There is no way to define P (x ∈ A) for all sets A such that the above
properties are satisfied.

8
Brice Huang 1 February 6, 2018: Probability Theory, and Why We Care

Proof. Define an equivalence relation on [0, 1), where a ∼ b if a − b ∈ Q.


Define A by picking one element from each equivalence class. Let p = P (A).
We expect that
X X
1 = P ([0, 1)) = P (Ar ) = p.
r∈[0,1)∩Q r∈[0,1)∩Q

This doesn’t hold for any p, because the right-hand sum is either 0 or infinite.

So, it’s still not clear what a continuous probability distribution means.
Overcoming this problem requires a notion of measure, which we will develop
over the next few lectures.

9
Brice Huang 2 February 8, 2018: Measure Theory

2 February 8, 2018: Measure Theory


A probability space is a triple (Ω, A, P ). Here, Ω is some set of elementary
events, A is a σ-algebra, a collection of subsets of Ω, and P is a probability
measure.

2.1 σ-Algebras
Definition 2.1. A σ-algebra A is a collection of subsets of Ω, obeying the
following axioms:

1. ∅ ∈ A;
2. If A ∈ A, then A = Ω \ A ∈ A;
S
3. If A1 , A2 , A3 , · · · ∈ A, then n An ∈ A.

This implies that if A, B ∈ A, then

A∩B =A∪B ∈A

and
A \ B = A ∩ B ∈ A.

Example 2.2
If Ω is countable, 2Ω is a σ-algebra. The singletons being in A implies
everything is in A.

Example 2.3
If Ω has some topology (e.g. R), the Borel algebra on Ω is

B(Ω) = {minimal σ-algebra containing all open sets}

is a σ-algebra. Most σ-algebras we’ll encounter will be Borel.


When Ω = R, this can be spanned by all open intervals.

2.2 Measure
Definition 2.4. A probability measure is a function p : A → R≥0 , obeying
these axioms:

1. p(∅) = 0;
2. If A ∩ B = ∅, then p(A ∪ B) = p(A) + p(B);
S
3. If A1 ⊂ A2 ⊂ . . . , then limn→∞ p(Ai ) = p ( i Ai ).

10
Brice Huang 2 February 8, 2018: Measure Theory

As an alternative to axiom (3), we can have axiom (3*): if A1 , A2 , . . . are


pairwise disjoint, then !
[ X
p Ai = p(Ai ).
i i

Axiom (3*) is called σ-additivity.

Lemma 2.5
Given axioms (1) and (2), axioms (3) and (3*) are equivalent.

Proof. We give just a sketch.


(3) ⇒ (3∗) : consider Ai+1 \ Ai .
SN
(3∗) ⇒ (3) : i=1 Ai .

2.3 Constructing a Probability Space


We first construct a probability space for an algebra of sets. This is the defined
the same way as a σ-algebra, except only finite unions are allowed.

Example 2.6
Finite unions of open/closed intervals in R are an algebra of sets.

Explicitly, the defining axioms for an algebra of sets A are axioms (1) and
S of a σ-algebra, and the followingSmodification of (3): if A1 ⊂ A2 ⊂ . . . and
(2)
i Ai ∈ A, then limn→∞ p(Ai ) = p ( i Ai ).
S
Analogously, we also modify (3*) to add the hypothesiss i Ai ∈ A. The
modified (3) and (3*) are still equivalent.
We care about algebras of sets because:

Theorem 2.7 (Caratheodory’s Theorem)


If P is a probability measure on an algebra of sets A, then it has a unique
extension to σ(A), the smallest algebra of sets containing A.

We will take this theorem for granted. Caratheodory gives us a general


strategy for constructing a probability measure: construct it on an algebra of
sets, and then invoke Caratheodory.

2.4 Examples of Probability Spaces

2.4.1 N
P
Let Ω = N = {1, 2, 3, . . . , }, and {pi }i∈N be a sequenceP
with pi ≥ 0, pi = 1.
Then we can take A = 2Ω and, for any A ∈ A, p(A) = i∈A pi .

11
Brice Huang 2 February 8, 2018: Measure Theory

2.4.2 Uniform Lebesgue measure on the unit square

We first define probability measure on rectangles:

p ([a, b] × [c, d]) = (b − a)(d − c).

Let the algebra A be finite disjoint unions of rectangles (called simple sets),
with probability measure
X
p(A) = p(rect).
rect ∈A

We should check that this is well-defined and σ-additive on A.

Proposition 2.8
p(A) is well-defined.

Proof. Suppose A can be written as finite disjoint unions in two ways:


[ [
A = [ai , bi ] × [ci , di ] = [a0i , b0i ] × [c0i , d0i ],
i i

p0i for p(A). We need to show these sums


P P
giving rise to two sums pi and
are equal.
Decompose both sums further by cutting A along all lines that are boundaries
of any rectangle. This clearly works; details omitted.

Proposition 2.9
p(A) is σ-additive on the algebra of sets A.

S∞
Proof. Let
P A = n=1 An , where A and the An are simple sets. We want
p(A) = p(An ).
By writing

k
!
X [
p(A) = p(An ) + p An
n=1 n=k+1
Pk
and taking k → ∞, we get p(A) ≥ n=1 p(An ).
We can find a compact (closed, bounded) A− ⊂ A such that

p(A− ) ≥ p(A) − ,

and open sets A+


n ⊃ An such that

1
p(A+
n ) ≤ p(An ) + .
2n

A+
S
Now, n n is an open cover of the compact set A , so there is a finite subcover.
Thus,
N
X
p(A− ) ≤ p(A+
n ).
n=1

12
Brice Huang 2 February 8, 2018: Measure Theory

This gives us X
p(A) ≤ p(An ) + 2
n
P
for any . Thus p(A) = n p(An ).

By Caratheodory’s Theorem, this extends uniquely to a probability measure


on the Borel set B([0, 1]2 ). This procedure works, more generally, for the
k-dimensional Borel set B([0, 1]k ).
But, this isn’t that satisfying because this construction is abstract, and
because the resulting measure isn’t complete.
Definition 2.10. A probability measure µ on A is complete if: if µ(A) = 0 and
B ⊂ A, then B ∈ A and µ(B) = 0.

We can make a complete and more explicit construction using the Lebesgue
σ-algebra L([0, 1]2 ). We define the Lebesgue outer measure by

µ∗ (A) = inf p(A0 ).


simple sets A0 covering A

We say A ∈ L([0, 1]2 ) if for all  > 0, there is a simple set B such that
µ∗ (A∆B) < .
We will take the following theorem for granted.

Theorem 2.11
L([0, 1]2 ) is a σ-algebra, and µ∗ is a σ-additive probability measure on it.

Remark 2.12. We have the (strict) inclusion chain


k
Simple sets ⊂ B([0, 1]k ) ⊂ L([0, 1]k ) ⊂ 2[0,1] .

13
Brice Huang 3 February 13, 2018: CDFs and measurable functions

3 February 13, 2018: CDFs and measurable func-


tions

3.1 Continuous Distribution Functions


Definition 3.1. Given a probability measure P on R, define its continuous
distribution function (CDF) by

Fp (x) = P ((−∞, x]) .

CDFs have the following properties:

1. Monotonically increasing. This is obvious.


2. Right-continuous, i.e.
lim Fp (x) = Fp (y).
x→y +

This is because
!
\
P ((−∞, y]) = P (−∞, x] = lim+ P ((−∞, x]) .
x→y
x>y

3. The following limits hold:

lim Fp (x) = 0 and lim Fp (x) = 1.


x→−∞ x→∞

Note that left-continuity does not hold. In particular,

lim Fp (x) = Fp (y)


x→y −

if Fp jumps upward at y.

Theorem 3.2
Measures on B(R) are in one-to-one correspondence with distribution
functions (i.e. functions satisfying 1-3).

Proof. The map p → Fp (x) is obviously a distribution function.


Therefore, it suffices to back out a measure P on B(R) given a distribution
function F (x). Recall from last lecture that it’s enough to define P on intervals.
We define:

P ((a, b]) = F (b) − F (a)


P ((a, b)) = lim F (x) − F (a)
x→b−
P ([a, b]) = F (b) − lim− F (x)
x→a
P ([a, b)) = lim F (x) − lim F (y).
x→b− y→a−

We’ll skip verifying that this works because it isn’t interesting.

14
Brice Huang 3 February 13, 2018: CDFs and measurable functions

Lemma 3.3
The measure P on B(R) corresponding to a distribution function F (x) is a
σ-additive measure.

Proof. Same as the proof for uniform measure in the previous lecture, using the
compact sets trick.

By applying Caratheodory’s Theorem, we now get that

Theorem 3.4
B(R) is a σ-algebra spanned by intervals.

3.1.1 Example: Discrete/Atomic Measure


P
This measure is defined by pairs (xi ∈ R, pi ∈ (0, 1]), where pi = 1. The
corresponding CDF is X
Fp (x) = pi .
i:xi ≤x

The graph of Fp looks like a step function.


P
In physicist notation (gasp!) this measure can be written as P = pi δxi .

3.1.2 Example: Measure with Density


R∞
Let p(x) : R → [0, ∞) be Riemann integrable, with −∞
p(x) = 1. The
corresponding CDF is Z x
Fp (x) = p(y)dy.
−∞

Since the CDF grows continuously, p = Fp0 (x).


In some sense these two examples are opposites – in one the CDF grows
discretely, and in the other it grows continuously. A natural question is: is there
anything else, that isn’t just a mixture of these examples? (Yes.)

3.1.3 Example: Cantor Set

We define the Cantor set as follows. Set C0 = [0, 1], and define Cn+1 by taking
out the middle third of each interval in Cn . So, for example:

C0 = [0, 1]
C1 = [0, 31 ] ∪ [ 32 , 1]
C1 [0, 19 ] ∪ [ 92 , 13 ] ∪ [ 23 , 79 ] ∪ [ 89 , 1],
=
T
and define the Cantor set C = i Ci . We will define a pdf with support C.
k
Since Leb(Ck ) = 32 , the uniform measure on Ck has PDF
 k
3
pk (x) = ICk (x).
2

15
Brice Huang 3 February 13, 2018: CDFs and measurable functions

Each such PDF has a corresponding CDF Fk (x). Then, we define the CDF
F (x) = lim Fk (x).
k→∞

This probability distribution has support on the Cantor set.


Remark 3.5. An alternative formulation: let F (1) = 1. For x ∈ [0, 1), let the
ternary representation of x be 0.c1 c2 c3 . . . be. If x ∈ C, all the ci are 0 or 2.
Then, let F (x) = 0.d1 d2 d3 . . . , where
(
0 ci = 0
di = .
1 ci = 2

If x 6∈ C, let y = min{y 0 ∈ C, y 0 > x}, and set F (x) = F (y).

3.2 CDFs on Rk
Let P be a probability measure on B(Rk ). We can define a CDF
Fp (x1 , . . . , xk ) = P ((−∞, x1 ] × · · · × (−∞, xk ]) .
The same three properties from above hold.
The examples above also generalize:

3.2.1 Example: Discrete/Atomic Measure

This measure is defined by pairs (xi ∈ Rk , pi ∈ (0, 1]), where


P
pi = 1. For any
set A, define X
P (A) = pi
i:xi ∈A

3.2.2 Example: Measure with Density

Let p(x) : Rk → [0, ∞) be Riemann integrable, with


Z ∞ Z ∞
··· p(x)dx1 · · · dxk = 1.
−∞ −∞

The corresponding CDF is


Z x1 Z xk
Fp (x1 , . . . , xk ) = ··· p(y1 , . . . , yk ) dy1 · · · dyk .
−∞ −∞

3.3 Random Variables


Fix a probability measure (Ω, A, P ).
Definition 3.6. A function (Ω, A) → (Ω0 , A0 ) is measurable if for all A0 ∈ A0 ,
f −1 (A0 ) ∈ A.
Definition 3.7. A random variable is a measurable function f : (Ω, A) →
(R, B(R)).

We want this definition of measurable because we want the preimage of any


Borel set to be in A, so that it has a probability.

16
Brice Huang 3 February 13, 2018: CDFs and measurable functions

Example 3.8
f = IM is measurable iff M ∈ A. This is because any preimage of f is
either the empty set, M , M , or everything.

In practice, testing measurability by checking the preimage of every Borel


set under f : (Ω, A) → (R, B(R)) is hard. Fortunately, the following lemma says
we don’t have to check everything.

Lemma 3.9
Suppose for all x ∈ R, f −1 ((−∞, x]) is measurable (i.e. in A). Then f is
measurable.

Proof. Let
H = {H ∈ B(R)|f −1 (H) ∈ A}.
We claim H is a σ-algebra. We verify:
!
[ [
−1
f Hi = f −1 (Hi ) ∈ A,
i

and
f −1 H = f −1 (R) \ f −1 (H) ∈ A.


Since (−∞, x] ∈ H, and B(R) is the minimal σ-algebra containing all intervals
(−∞, x], we must have H = B(R). Therefore f is measurable.

Lemma 3.10
Let (Ω, B(Ω)) be some topology. If f : Ω → R is continuous, then f is
measurable.

Proof. Preimages of open sets are open.

We now establish results that let us do computations on measurable functions.

Proposition 3.11
The pointwise limit of measurable functions is measurable.

Proof. Suppose fn (w) → f (w), for w ∈ Ω. Then, f −1 has the explicit form:
\ [ \
f −1 ((−∞, x]) = fn−1 ((−∞, x + ]) .
m∈N N ∈N n≥N
1
= m

This is because

f (w) ≤ x ⇔ ∀ > 0, fn (w) ≤ x +  ∀ large enough n.

17
Brice Huang 3 February 13, 2018: CDFs and measurable functions

Proposition 3.12
If f, g are measurable, then f + g is measurable.

Proof. Same as before, with the observation that


\ [ 1
{w|f (w) + g(w) ≤ x} = {w|f (w) ≤ p} ∩ {g(w) ≤ x − p + }.
n
n∈N p∈Q

The same result holds for products and quotients of measurable functions.
Proofs are left as an exercise.
Next time, we’ll develop a notion of expected value, which will require
Lebesgue integration theory.

18
Brice Huang 4 February 15, 2018: Lebesgue Integration

4 February 15, 2018: Lebesgue Integration

4.1 Motivation
Given a measurable function f : (Ω, A, P ) → (R, B(R)),R we would like aRnotion
of expected value. This will be defined via an integral , such that Ef = f dP .

4.2 The Lebesgue Integral


The Lebesgue Integral for measurable functions f will be defined as follows.

4.2.1 Indicator Functions

Suppose A ∈ A, and (
1 x∈A
f (x) = IA = .
0 x 6∈ A
Then we define Z
IA dP = P (A).

4.2.2 Elementary Functions


R
Next, we define f for elementary functions, functions that is constant on
each of at most countably many sets Ai ∈ A partitioning Ω.
Suppose f takes values f1 , . . . , fk ∈PR on finitely many measurable sets
k
A1 , . . . , Ak ∈ A partitioning Ω, i.e. f = i=1 fi IAi . Then we define
Z X Z k
X
f dP = fi IAi = fi P (Ai ).
i=1

In the infinite case, suppose f takes values f1 , f2 , . . . on measurable sets


A1 , A2 , · · · ∈ A partitioning Ω. Then we define
Z X
f dP = fi P (Ai )
i
P
if |fi |P (Ai ) < ∞.

4.2.3 Measurable Functions

Finally, let f (x) be an arbitrary measurable function. Let f 1 , f 2 , . . . be elemen-


tary functions uniformly converging to f , i.e.

∀ > 0 ∃N ∀n > N ∀x ∈ Ω : |f (x) − f n (x)| < .

We will denote this as f n ⇒ f . Then, define


Z Z
f (x) dP = lim f n (x) dP.
n→∞

19
Brice Huang 4 February 15, 2018: Lebesgue Integration

Proposition 4.1
This is well-defined; i.e. the limit exists and does not depend on the choice
of f n .

Proof. To check the limit exists we use the Cauchy criterion:


Z Z Z
f n (x) dP − f m (x) dP ≤ (f n (x) − f m (x)) dP


≤ sup |f n (x) − f m (x)|
≤ sup |f n (x) − f (x)| + sup |f m (x) − f (x)|.

It remains to show this definition doesn’t depend on f n . Suppose f n ⇒ f


and g n ⇒ f . Then, (
n f n/2 (x) x even
h (x) =
g (n+1)/2 x odd
to f . The sequence hn dP has a well-defined limit,
R
also Rconverges uniformly
and f n dP and g n dP are subsequences of this sequence, so they converge
R

to the same limit.

4.3 Integrating Measurable Functions


The above exposition leads to the natural question: which measurable functions
are integrable? We’ll start with bounded functions.

Proposition 4.2
Any bounded measurable function f is Lebesgue integrable.

1
Proof. Define f n (x) = n bnf (x)c. It’s clear that f n ⇒ f .

The following theorem gives a characterization for unbounded functions.

Proposition 4.3
Let f ≥ 0 be an unbounded measurable function. f is Lebesgue integrable
if and only if X
P (f ≥ n) < ∞.
n≥0

Proof. Write Z XZ
f dP = f (x)In≤f (x)≤n+1 dP.
n≥0

The functions f (x)In≤f (x)≤n+1 are bounded and measurable.


From the definition of Lebesgue integration, the only thing to check is that
this sum does not blow up to infinity. Note the two-sided bound
Z
nP (n ≤ f (x) ≤ n + 1) ≤ f (x)In≤f (x)≤n+1 dP ≤ (n + 1)P (n ≤ f (x) ≤ n + 1).

20
Brice Huang 4 February 15, 2018: Lebesgue Integration

Summing this, with the observation that


X
P (k ≤ f (x) ≤ k + 1) = P (n ≤ f (x)),
k≥n

yields Z
X X
P (f ≥ n) ≤ f dP ≤ 1 + P (f ≥ n).
n≥0 n≥0

Finally for measurable, not necessarily positive f :


Z Z Z
f dP = f If ≥0 dP − (−f )If ≤0 dP.

This is well-defined if the two integrals on the right are both well-defined.
In other words, f is Lebesgue integrable iff its tails are small; that is, if
X X
P (f ≥ n) < ∞ and P (f ≤ −n) < ∞.
n≥0 n≥0

4.4 Riemann and Lebesgue Integrals


Proposition 4.4
If f is continuous, the Riemann and Lebesgue integrals agree:
Z b Z
1
f (x) dx = f dP.
b−a a [a,b]

Proof. Take a Lebesgue approximation by simple functions f n (x) ⇒ f (x).


f n is a step function, so the area under n
P the graph of f is a bunch of
rectangles, which equals a Riemann sum f (xi )∆xi .

As an exercise, the following more general result is true.

Proposition 4.5
If both Riemann and Lebesgue integrals exist, then they are equal.

A picture of what’s going on: the Riemann integral takes vertical rectangular
slices of the function, while the Lebesgue integral takes horizontal slices. The
advantage of horizontal slices is that it no longer depends on the structure of
the real line. But, the absolute-summability requirement of Lebesgue integrals
means that we cannot have the Riemann notion of improper integrals.

21
Brice Huang 4 February 15, 2018: Lebesgue Integration

4.5 Properties of the Lebesgue Integral


The following “obvious” properties hold for all Lebesgue integrable functions
f, g:

1. (Linearity) For any λ ∈ R,


Z Z
λf (x) dP = λ f (x) dP.

2. (Additivity) Z Z Z
(f + g) dP = f dP + g dP.

3. (Positivity, part 1) If f ≥ 0, then


Z
f dP ≥ 0.

4. (Positivity, part 2) If f (x) ≥ g(x), then


Z Z
f (x) dP ≥ g(x) dP.

R
5. (Lebesgue
R Dominated Convergence) If |f (x)| ≤ g(x) and g(x) dP exists,
then f (x) dP exists, and
Z Z

f (x) dP ≤ g dP.

R
6. (Continuity) Suppose f dP exists. Then, for all  > 0, there exists δ > 0
such that for all A ∈ A with P (A) < δ,
Z
f IA dP ≤ .

We will only prove the last two.

Proof of Lebesgue Dominated Convergence. We just need to show the tails of f


are small. X X
P (f ≥ n) ≤ P (g ≥ n) ≤ ∞
n≥0 n≥0
P
and likewise for n≥0 P (f ≤ −n). The bound
Z Z Z

f (x) dP ≤ |f (x)| dP ≤ g dP

follows from positivity.

Proof of Continuity. Split f = f + − f − . We consider only f + , since handling


f − is analogous.
We write the convergent sum
Z XZ
+
f dP = f + In≤f ≤n+1 dP.
n≥0

22
Brice Huang 4 February 15, 2018: Lebesgue Integration

Pick N such that the tail of this sum is small:


XZ 1
f + In≤f ≤n+1 dP ≤ ,
2
n≥N


and pick δ = 2N . Then,
Z Z Z
f + IA dP = f + IA If + ≤N + f + IA If + ≥N
Z
≤ N · P (A) + f + If + ≥N
1 1
≤  +  = .
2 2

23
Brice Huang 5 February 22, 2018: Lebesgue Integral Computations

5 February 22, 2018: Lebesgue Integral Computa-


tions
Recall that last time we defined the Lebesgue
P integral by defining it for indicators
IA with A measurable, then for sums fi Ai for Ai disjoint and measurable
(so-called simple functions), and finally for all measurable functions by taking
a uniform limit.
Moreover, we showed that f is Lebesgue integrable if it is bounded and
measurable, or if it is measurable and satisfies
X
P (|f | ≥ n) < ∞.
n≥0

5.1 Indicator Functions, Then Simple Functions, Then Every-


thing
Proposition 5.1
R P R
Let A1 , A2 , . . . partition Ω. Then f dP = n f IAn dP .

Proof. We’ll first show this when f is an indicator function, then when f is a
simple function, and finally for general f .1
For indicator functions f = IA , we have
Z X XZ
f dP = P (A) = P (A ∩ An ) = f IAn dP
n n

by σ-additivity.
P
For simple functions f = fi IA0i , use linearity of the integral.
For general f , let fm ⇒ f uniformly, where the fm are simple functions.
Fix  > 0, and set N such that sup |fm − f | <  for all m > N . Then, for all
m > N , Z Z Z

fm dP − f dP = (fm − f ) dP ≤ 

and similarly Z Z

fm IAn dP − f IAn dP ≤ P (An ).

Summing the second equality over  gives the bound

X Z XZ
fm IAn dP − f IAn dP ≤ .


n n

R P R
Since fm dP = n fm IAn dP , this implies
Z
XZ
f dP − f IAn dP ≤ 2.

n

Take  to 0 and we are done.


1 As we’ll see, this is a strategy that we will use often.

24
Brice Huang 5 February 22, 2018: Lebesgue Integral Computations

A lot of other claims are proved by this technique of showing something


first for indicator functions, then for simple functions, then for general func-
tions.

Proposition 5.2
If A1 , A2 , . . . partition Ω and the bounds
Z XZ
|f |IAn dP < ∞, |f |IAn dP < ∞,
n≥1

then f is integrable.

Proof. Same technique.

Definition 5.3. We say f (x) is equivalent to g(x) if they are equal almost
surely, i.e.
Pr(f (x) 6= g(x)) = 0.

Proposition 5.4
R R
If f, g are equivalent, then f (x) dP = g(x) dP .

Proof. Same technique.

Proposition 5.5
R R
If for all A ∈ A, f IA dP = gIA dP , then f = g almost surely.

Proof. Suppose for contradiction that P (f 6= g) > 0. Then, we claim there


exists  such that either
Pr((f − g) > ) > 0
or
Pr((f − g) < −) > 0.

This is true by σ-additivity, because


[ 1 [ 1
{f − g 6= 0} = {f − g > } ∪ {f − g < − }.
n
n n
n

So, suppose (WLOG) that

A = {f − g > } ∈ A

has positive measure. Then,


Z
(f − g)IA dP >  Pr (IA ) > 0.

contradiction.

25
Brice Huang 5 February 22, 2018: Lebesgue Integral Computations

5.2 Computations with the Lebesgue integral


R
Definition 5.6. The expectation of a random variable f is E(f ) = Ω
f dP .

Let ϕ : (Ω, A, P ) → (Ω0 , A0 ) be measurable, where the second space’s


probability measure φ∗ P (called the push-forward of P ) is induced by φ.
Formally, it is
φ∗ P (A0 ) = P φ−1 (A0 ) .


Remark 5.7. As an exercise, verify that φ∗ P is σ-additive.

The change of variables formula allows us to convert one random variable


to another.

Theorem 5.8 (Change of Variables Formula)


Let f : Ω0 → R be a random variable. Then
Z Z
f (x) dφ∗ P = f (φ(x)) dP.
Ω0 Ω

Proof. For simple functions, the two integrals evaluate to the same sum, so
there is nothing to prove.
For general functions: if fn are simple functions uniformly approximating f ,
then fn (φ(x)) are simple functions uniformly approximating f (φ(x)). So we are
done.

Suppose f is a random variable. Then,


Z
E[f ] = f dP
Z
= x dPf ,

where Pf = f ∗ P is the probability measure on R corresponding to the distribu-


tion of f . Explicitly,

Pf (A) = f ∗ P (A) = P (f −1 (A)).

for A ∈ B(R).
Great. Now we’re back to a real-valued integral, so we can compute things.
The only wrinkle is we need to know what the measure Pf looks like.
If Pf is discrete, we directly compute
X
E(f ) = xi P (f = xi ).
Rx
If Pf is continuous, then Pf = −∞
p(t) dt.
The following formula lets us explicitly compute random variables’ expecta-
tions:

26
Brice Huang 5 February 22, 2018: Lebesgue Integral Computations

Proposition 5.9
The
R ∞ following equality of Lebesgue and Riemann integrals holds, provided
−∞
|x|p(x) < ∞.
Z Z ∞
x dPf = xp(x) dx.
−∞

Proof of Change of Variables Formula. We’ll first show these integrals exist iff
the other exists. Note the two-sided bound
Z −n Z n+1 
nP (n ≤ |f | ≤ n + 1) = n p(x) dx + p(x) dx
−n−1 n
Z −n Z n+1
≤ |x|p(x) dx + |x|p(x) dx
−n−1 n
Z −n Z n+1 
≤ (n + 1) p(x) dx + p(x) dx
−n−1 n
= (n + 1)P (n ≤ |f | ≤ n + 1)

This implies
Z ∞ X
|x|p(x) < ∞ ⇔ nP (n ≤ |f | ≤ n + 1) < ∞.
−∞ n≥0

The second inequality holds iff the Lebesgue integral is defined, as desired.
Next, we prove the integrals are equal. We can show that
Z Z A
xI|x|≤A dPf = xp(x) dx
−A

1
by approximating x with n bnxc.
Then, we take A → ∞; since
Z Z Z
x dPf − xI|x|≤A dPf = xI|x|>A dPf

goes to 0 by absolute summability of the Lebesgue integral, the integrals are


equal.

5.2.1 Example: Gaussian Random Variable

For simplicity, let’s take Ω = R, and f (x) = x. Consider the Gaussian random
variable N (µ, σ 2 ), with probability density
1 (x−µ)2
p(x) = √ e− 2σ2 .
σ 2π
R∞
First, we’ll verify that −∞ p(x) = 1. By change of variables, this is equivalent
R∞ 2 √
to −∞ e−x /2 dx = 2π. This is because
Z ∞ Z ∞ Z 2π Z ∞
−x2 /2−y 2 /2 2
e dx dy = re−r /2
dr dθ
−∞ −∞ 0 0
= 2π.

27
Brice Huang 5 February 22, 2018: Lebesgue Integral Computations

Now let’s compute the Gaussian’s expectation.


Z ∞
EN (µ, σ 2 ) = xp(x) dx
−∞
Z ∞
1 (x−µ)2
= √ xe− 2σ2 dx
σ 2π −∞
Z ∞ Z ∞
1 (x−µ)2 1 (x−µ)2
= √ (x − µ)e− 2σ2 dx + √ µe− 2σ2 dx
σ 2π −∞ σ 2π −∞
=0+µ
= µ.

5.2.2 Example: Gaussian Random Variable Squared

By two different uses of change of variables, we know that for all f :


Z Z
E(f 2 ) = x dPf 2 = x2 dPf .

In practice, the second is usually easier because we don’t have to deal with Pf 2 .
The squared-Gaussian’s expectation is
Z ∞
2 2
EN (µ, σ ) = x2 p(x) dx
−∞
Z ∞
1 (x−µ)2
= √ x2 e− 2σ2 dx
σ 2π −∞
Z ∞
1 x2
= √ (x + µ)2 e− 2σ2 dx
σ 2π −∞
Z ∞
1 x2
2
=µ + √ x2 e− 2σ2 dx
σ 2π −∞
Z ∞
2 1 x2
2
=µ +σ √ x2 e− 2 dx
2π −∞
2 2
=µ +σ
where the final integral is computed by parts.

5.3 Convergence of Random Variables


Let’s define a few notions of a sequence of random variables ξn converging to a
random variable ξ. We’ll develop this theory in more detail next lecture.
Definition 5.10. We say ξn converges to ξ almost surely (denoted ξn →a.s ξ)
if
A = {x ∈ Ω|ξn (x) → ξ(x)},
obeys P (A) = 1.
Definition 5.11. We say ξn converges to ξ in probability (denoted ξn →P ξ)
if for all  > 0,  
lim P x ∈ Ω |ξn (x) − ξ(x)| <  → 1.

n→∞

This is the sense of convergence for which, for example, the Law of Large
Numbers (see Lecture 1) holds.

28
Brice Huang 5 February 22, 2018: Lebesgue Integral Computations

Lemma 5.12
If ξn →a.s. ξ, then ξn →P ξ.

Proof. Define the set


n o
A,n = x ∈ Ω |ξn (x) − ξ(x)| <  ,

and \[ \
A= A,n .
>0 N n≥N

= {x ∈ Ω|fn (x) → f (x)}, so by


Convince yourself by staring at this thatSA T
hypothesis P (A) = 1. So, for each  > 0, N n>N has measure 1.
T
Fix  > 0. Then,  n>N A,n has measure tending to 1 as N → ∞. So, for
T
any δ, P n>N A,n ≥ 1 − δ for large enough N , and so ξn →P ξ.

29
Brice Huang 6 February 27, 2018: Convergence of Random Variables

6 February 27, 2018: Convergence of Random Vari-


ables
Recall that last time we talked about two modes of convergence:

• ξn → ξ almost surely (denoted ξn →a.s. ξ) if


 
P lim ξn (x) = ξ(x) = 1.
n→∞

• ξn → ξ in probability (denoted ξn →P ξ) if for all  > 0,

lim P (|ξn (x) − ξ(x)| < ) → 1.


n→∞

In some sense, these just differ in the order that we take limits.
Last time we showed that (1) implies (2). Today we will define two more
modes of convergence.

6.1 Convergence in distribution


Definition 6.1. ξn → ξ in distribution (denoted ξn →d ξ) if the distribution
function Fξn (x) = P (ξn ≤ x) obeys

Fξn (x) → Fξ (x)

at every point where Fξ is continuous.

Proposition 6.2
ξn →d ξ if and only if for each continuous bounded f (x),

E(f (ξn )) → E(f (ξ)).

This condition is sometimes called weak convergence of probability measures.

Proof. First, we show weak convergence implies convergence in distribution. Let



1
1
 y ≤x− m
1
gx,m (y) = −m(y − x) x − m < y < x .

0 y≥x

This is a continuous approximation of fx (y) = Iy<x (y). For fixed x, m. we have

lim Efx (ξn ) ≥ lim Egx,m (ξn ) = Egx,m (ξ) ≥ Efx− m1 (ξ).
n→∞ n→∞

Similarly, we can take an approximator of fx from the other side:



1
 y≤x
1
hx,m (y) = −m(y − x) x<y <x+ m
,
1

0 y ≥x+ m

30
Brice Huang 6 February 27, 2018: Convergence of Random Variables

with the bound

lim Efx (ξn ) ≤ lim Ehx,m (ξn ) = Ehx,m (ξ) ≤ Efx+ m1 (ξ).
n→∞ n→∞

This gives the bound

Efx− m1 (ξ) ≤ lim Efx (ξn ) ≤ Efx+ m1 (ξ),


n→∞

whence    
1 1
Fξ x − ≤ lim Fξn (x) ≤ Fξ x + .
m n→∞ m
Taking m → ∞ gives Fξn (x) → Fξ (x).
Conversely, suppose ξn → ξ converges in distribution. Note that Fξ is
monotonic, so it has only countably many discontinuities.2 Take a continuous f
with |f | ≤ C. We want to show Ef (ξn ) → Ef (ξ).
Pick a point of continuity A > 0 of Fξ , big enough so that Fξn (−A) < δ and
1 − Fξn (A) < δ for all n. This implies the bound

Ef (ξn )I|ξn |>A ≤ 2δC,

whence
Ef (ξ)I|ξ|>A ≤ 2δC.
Pick points −A = x0 < x1 < · · · < xN = A, such that all xi are points of
continuity of Fξ and |f (x) − f (y)| < δ inside each [xi , xi+1 ]. Thus we have the
bound
Z

f (ξn )Iξ ∈[x ,x ] dP − f (xi ) (Fξn (xi ) − Fξn (xi−1 )) ≤ δ (Fξn (xi ) − Fξn (xi−1 ) ,
n i i−1

whence
X
Ef (ξn )I|ξn |≤A − f (xi ) (Fξn (xi ) − Fξn (xi−1 )) ≤ δ (Fξn (xN ) − Fξn (x0 )) ≤ δ.

Similarly we have
X
Ef (ξ)I|ξ|≤A − f (xi ) (Fξ (xi ) − Fξ (xi−1 )) ≤ δ.

As n → ∞, we get that

lim Ef (ξn )I|ξn |≤A − Ef (ξ)I|ξ|≤A ≤ 2δ.

n→∞

Along with the bounds on Ef (ξn )I|ξn |>A and Ef (ξ)I|ξ|>A , this implies

lim Ef (ξn ) − Ef (ξ) ≤ 2δ + 4δC.

n→∞

Take δ → 0 to conclude the result.

6.2 L1 convergence
Definition 6.3. ξn → ξ in expectation (denoted ξn →L1 ξ) if

E |ξn − ξ| → 0.
2 Proof: take one rational number skipped by each discontinuity.

31
Brice Huang 6 February 27, 2018: Convergence of Random Variables

6.3 Expectation Convergence Theorems


There are three classical theorems that guarantee Eξn → Eξ.

Theorem 6.4 (Dominated Convergence, due to Lebesgue)


Suppose ξn → ξ almost surely, and that |ξn | ≤ g for some integrable g.
Then Eξn → Eξ.

Before producing a proof, let’s produce a counterexample that shows the


condition |ξn | ≤ g is necessary. Let ξn have the following PDF:
(
n x ∈ [0, n1 ]
ξn = .
0 x ∈ ( n1 , 1]

Then Eξn = 1, whne Eξ = n.

Lemma 6.5 (Egorov’s Theorem)


If fn → f almost surely, then for all  > 0, there exists a set A such that
P (A) > 1 − , and fn → f uniformly on A.

Proof. We introduce the sets


\ 1

Am
n = x||fi (x) − f (x)| <
m
i≥n

and [
Am = Am
n.
n

Observe that P (Am ) = 1. Moreover, because the measures of the Am


n are
increasing in n, we can find n0 = n0 (m) such that
  
P AM \ Am n0 (m) < m .
2
Then, take

\
A= Am
n0 (m) .
m=1

As an exercise, verify that A has the properties we want.

Now we’re ready to prove Dominated Convergence.

Proof of DominatedRConvergence. Fix  > 0. There exists δ > 0 such that


P (S) < δ implies | gIS dP | < . By Egorov’s theorem, we cna take A such
that P (A) > 1 − δ. So, ξn → ξ uniformly on A, and
Z
(ξn − ξ)IA dP → 0

as n → ∞. Moreover,
Z Z

ξn I dP ≤ gI dP ≤ 
A A

32
Brice Huang 6 February 27, 2018: Convergence of Random Variables

and similarly Z

ξI dP ≤ .
A
This shows that Z Z

n→∞ ξn dP − ξ dP ≤ 2.
lim

As this holds for all , the theorem is proved.

The second classical theorem is:

Theorem 6.6 (Monotone Convergence, by B. Levy)


R
Suppose fn (x) is monotonically increasing as n → ∞, and fn (x) dP ≤ K.
Then:

• There exists f (x) such that fn → f almost surely;

• The expectations converge: Efn → Ef .

Proof. First, we may assume fn ≥ 0 for all n, by adding −f1 to all our functions.
By Markov’s Inequality,
Z Z
1 1 K
P (fn (x) > M ) ≤ fn Ifn >M dP ≤ fn dP = .
M M M
Let
A = {x|fn (x) is unbounded as n → ∞}.
Then,
P (A) ≤ P (fn (x) > M )
for all M , so in fact P (A) = 0. This shows that fn (x) → f (x) almost surely.
R
It remains to show that Efn → Ef . We will show f dP < ∞, which will
allow us to use Dominated Convergence.
For each N , note the bound
N
X Z Z
nP (n ≤ f ≤ n + 1) ≤ f If <N +1 dP = lim fn Ifn <N +1 dP,
n→∞
n=1

where the last equality follows from Dominated


R Convergence with g = N + 1.
The last quantity is bounded by K, so f dP < ∞.

One last criterion for convergence of expectations.

Theorem 6.7 (Fatou’s Theorem)


R
Suppose fn (x) ≥ 0 and fn (x)
R → f (x) almost surely, and fn dP ≤ K.
Then f (x) is integrable and f (x) dP ≤ K.

33
Brice Huang 6 February 27, 2018: Convergence of Random Variables

Proof. Let
ϕn (x) = inf fm (x).
m≥n

We’ll leave it as an exercise that ϕn (x) is measurable. Moreover,

0 ≤ ϕn (x) ≤ fn (x) ≤ K.

Moreover, ϕn (x) must grow as n R grows. Then, by Monotone Convergence,


ϕn (x) → f (x) almost surely and f (x) dP ≤ K.

34
Brice Huang 7 March 1, 2018: Product Measures

7 March 1, 2018: Product Measures

7.1 Product Measures


Let (Ω1 , A1 , P1 ) and (Ω2 , A2 , P2 ) be probability spaces. The product measure
(Ω1 × Ω2 , σ(A1 × A2 ), P1 ⊗ P2 ) is defined as follows. Its elements are the product
set Ω1 × Ω2 , and its σ-algebra is

σ(A1 × A2 ) = σ(A1 × A2 |A1 ∈ A1 , A2 ∈ A2 ),

the smallest σ-algebra containing all products A1 × A2 . Its probability measure


is defined on A1 × A2 by

(P1 ⊗ P2 )(A1 × A2 ) = P1 (A1 ) · P2 (A2 ).

As we’ll see, P1 ⊗ P2 is σ-additive on A1 × A2 , and thus extends to all of


σ(A1 × A2 ) by σ-additivity.

Lemma 7.1
P1 ⊗ P2 is σ-additive on A1 × A2 .

S
Proof. Suppose A × B = iAi × Bi as a disjoint union. We want to show
X
P1 (A) · P2 (B) = P1 (Ai ) · P2 (Bi ).
i

For finite unions, we subdivide the sets Ai × Bi more finely into atoms, like we
did in Lecture 2. For countably-infinite unions, define

fi = IAi P2 (Bi ).

This implies X
fi = IA · P2 (B).
i

The integrals of partial sums of the LHS are increasing, so the Monotone
Convergence Theorem applies, and
N
X N Z
X N
Z X
(P1 ⊗ P2 )(Ai × Bi ) = fi = fi
i=1 i=1 i=1

has limit Z
IA P2 (B) = P1 (A) · P2 (B).

as N → ∞.

Corollary 7.2
By Caratheodory’s Theorem this probability measure extends to a proba-
bility measure on σ(A1 × A2 ).

By iterating this construction, we can define a product of n probability


spaces.

35
Brice Huang 7 March 1, 2018: Product Measures

7.2 Independence of Random Variables


Definition 7.3. Random variables ξ1 , . . . , ξn on (Ω, A, P ) are independent if
for any Borel sets B1 , B2 , . . . , Bn , their joint probability factorizes:

P (ξ1 ∈ B1 , ξ2 ∈ B2 , . . . , ξn ∈ Bn ) = P (ξ1 ∈ B1 ) · P (ξ2 ∈ B2 ) · · · P (ξn ∈ Bn )

Similarly, we say n sets are independent if their indicator functions are:

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) · P (A2 ) · · · P (An ).

This is related to product measures as follows. Let (ξ1 , . . . , ξn ) be random


variables on a measure on B = B(Rn ), with probability

Pξ1 ,...,ξn (B) = P ((ξ1 , . . . , ξn ) ∈ B) .

Then, ξ1 , . . . , ξn are independent if and only if

Pξ1 ,...,ξn = Pξ1 ⊗ · · · ⊗ Pξn . (7.1)

Corollary 7.4
To show (7.1) it is enough to check

P ((ξ1 , . . . , ξn ) ∈ B1 × · · · × Bn ) = P (ξ1 ∈ B1 ) · P (ξ2 ∈ B2 ) · · · P (ξn ∈ Bn )

for sets Bi = (−∞, bi ].

Proof. The intervals of the form Bi = (−∞, bi ] span the Borel algebra B(R).

Example 7.5
Let N (0, 1) be the standard Gaussian random variable, with pdf p(x) =
2
√1 e−x /2 . Then, a pair of independent Gaussian random variables

(N (0, 1), N (0, 1)) has pdf
1 2 2
p(x, y) = · e−x /2 · e−y /2 .

7.2.1 Properties of Independent Random Variables

Theorem 7.6
If X, Y are independent and integrable random variables, then

E(XY ) = E(X) · E(Y ).

Proof. If X, Y are indicator functions, say X = IA , Y = IB , the result is clear


because
P (X, Y ) = P (X)P (Y )
is the definition of independence.

36
Brice Huang 7 March 1, 2018: Product Measures

Note that both sides are bilinear in X and Y , so the result holds for simple
functions X and Y .
Now suppose X, Y ≥ 0. We consider
1 n 1 n
Xn = b2 Xc Yn = b2 Y c.
2n 2n
These are independent too, because
1 n 1 1 1
P (Xn ≤ x, Yn ≤ y) = P (X < b2 xc + n , Y < n b2n yc + n )
2n 2 2 2
factorizes. So, by the monotone convergence theorem, E(Xn Yn ) → E(XY ) and
E(Xn ) · E(Yn ) → E(X) · E(Y ).

Proposition 7.7
If X, Y are independent, then f (X), g(Y ) are as well, for any measurable
functions f, g.

Proof. Just use the definition of independence:

P (f (X) ∈ B1 , g(Y ) ∈ B2 ) = P (X ∈ f −1 (B1 ), Y ∈ g −1 (B2 ))


= P (X ∈ f −1 (B1 )) · P (Y ∈ g −1 (B2 ))
= P (f (X) ∈ B1 ) · P (g(Y ) ∈ B2 ).

This is true for sets of variables, as well.

Proposition 7.8
If X1 , . . . , Xm , Xm+1 , . . . , Xn are independent, then

f (X1 , . . . , Xm ), Xm+1 , . . . , Xn

are independent.

Proof. For simplicity, we’ll show this for f (X, Y ) and Z. The general argument
is analogous. We use the identity

P (f (X, Y ) ∈ B1 , Z ∈ B2 ) = P (X, Y ) ∈ f −1 (B1 ), Z ∈ B2




= P ((X, Y ) ∈ f −1 (B1 )) · P (Z ∈ B2 )
= P (f (X, Y ) ∈ B1 ) · P (Z ∈ B2 ).

The crucial step is the second equality, which relies on the construction PX ⊗
PY ⊗ PZ being associative. That is, it doesn’t matter whether we construct
this product measure as (PX ⊗ PY ) ⊗ PZ or as PX ⊗ (PY ⊗ PZ ), because in
both cases our base sets are boxes SX × SY × SZ and Caratheodory’s Theorem
guarantees uniqueness of the resulting probability measure. Thus the second
equality is valid and we are done.

37
Brice Huang 7 March 1, 2018: Product Measures

Remark 7.9. The factorization of a product measure is not canonical. For


example, take (N (0, 1), N (0, 1)). Its pdf factors as
   
1 1 2 1 2
p(x, y) = exp − x exp − y
2π 2 2
   
1 1 2 1 2
= exp − (x + y) exp − (x − y) .
2π 4 4

7.3 Computations on Independent Random Variables


Theorem 7.10
Let A ∈ σ(A1 × A2 ). Then,
Z Z
P (A) = P2 (Ax ) dP1 = P1 (Ay ) dP2 ,
Ω1 Ω2

where Ax = {y ∈ Ω2 |(x, y) ∈ A}, and similarly for Ay .

Proof. We’ll first show Ax is measurable. If A is a rectangle, this is clearly true.


If AxSis measurable for countably many A = A1 , A2 , . . . , then Ax is measure for
A = i Ai as well. So,

{A ∈ σ(A1 × A2 )|Ax is measurable}

is a σ-algebra, and Ax is measurable.


The rest of this proof is technical; see e.g. Durett for details.

Theorem 7.11 (Fubini’s Theorem)


For random variables X, Y , the following identity holds
x Z Z 
f (X, Y ) dP1 ⊗ dP2 = f (X, Y ) dP1 dP2
Ω1 ×Ω2 Ω2 Ω1
Z Z 
= f (X, Y ) dP1 dP2 .
Ω1 Ω2

Proof Sketch. For elementary functions, this is true by Theorem 7.10. Following
the usual strategy, we extend this first to simple functions, then to arbitrary
functions.

Theorem 7.12
Let X, Y be independent random variables. Then
Z
FX+Y (t) = Fx (t − u) dPy .
u∈R

38
Brice Huang 7 March 1, 2018: Product Measures

Proof. We compute:

Fx+y (t) = P (x + y ≤ t)
x
= Ix+y≤t dPx ⊗ dPy
Z Z 
= Ix≤t−y dPx dPy
y x
Z
= Fx (t − y) dPy .
y

For continuous random variables, we can compute the probability distribution


px+y by
Z t Z ∞ Z t−u
p(x + y)(u) du = px (v)py (y) dv du,
−∞ −∞ −∞

whence Z ∞
px+y (t) = px (t − u)py (u) du.
−∞

Joke. “Now let’s compute something! Well, the only thing I know how to
integrate is a Gaussian random variable, so...” - Gorin

Example 7.13
Suppose X ∼ N (µ, σ 2 ), Y ∼ N (ν, c2 ) are independent. Then, we will show

X + Y ∼ N (µ + ν, σ 2 + c2 ).

By shifting and rescaling, it’s enough to consider µ = ν = 1, c2 = 1. Thus

u2
 
1
px (u) = √ exp − 2
σ 2π 2σ
 2
1 u
py (u) = √ exp − .
2π 2

We compute:
Z ∞
px+y (t) = px (t − u)py (u) du
−∞
Z ∞
(t − u)2 u2
 
1
= exp − − 2 du
2πσ −∞ 2 2σ
∞ 2 !
1 t2 1 σ2 + 1
  Z 
1 t
= exp − exp − · · u− du
2πσ 2 1 + σ2 −∞ 2 σ2 1 + σ12
1 t2
 
1
=p exp − .
2π(σ 2 + 1) 2 1 + σ2

This is the PDF for N (0, σ 2 + 1).

39
Brice Huang 8 March 6, 2018: Sequences of Random Variables

8 March 6, 2018: Sequences of Random Variables


Today we will develop machinery for handling sequences of i.i.d. random
variables. We will prove the Weak Law of Large Numbers.

8.1 Tikhonov Topology


We first show how to construct a probability space of sequences of i.i.d. random
variables. Set
Ω = [0, 1]∞ = {(x1 , x2 , . . . )|xi ∈ [0, 1]},
(n)
equipped with the Tikhonov topology, x(n) → x if for each i, xi → xi .
Equivalently, this is the topology defined by the metric

X |xm − ym |
d(x, y) = .
m=1
2m

As an exercise: show that this metric defines the Tikhonov topology.

Lemma 8.1
Ω is compact in the Tikhonov topology.

Proof. Let x(1) , x(2) , x(3) , . . . be a sequence in Ω. We will find a convergent


subsequence.
Since [0, 1] is compact, we can pick a subsequence i11 , i12 , i13 , . . . of N such that
(i1 ) (i1 ) (i1 )
x1 1 , x1 2 , x1 3 , . . .

converges. Then, we can pick a subsequence i21 , i22 , i23 , . . . of i11 , i12 , i13 , . . . such
that
(i2 ) (i2 ) (i2 )
x2 1 , x2 2 , x2 3 , . . .
converges. Continue this process, so
n n n
x(i1 ) , x(i2 ) , x(i3 ) , . . .

converges in the first n coordinates.


Then, define jn = inn . The sequence j1 , j2 , j3 , . . . is eventually a subsequence
of in1 , in2 , in3 , . . .
for all n, so it converges in all coordinates. Therefore

x(j1 ) , x(j2 ) , x(j3 ) , . . .

is a convergent subsequence of x(1) , x(2) , x(3) , . . . in the Tikhonov topology.

A picture of what’s happening:

i11 i12 i13 ...


i21 i22 i23 ...
i31 i32 i33 ...
.. .. .. ..
. . . .

The sequence j1 , j2 , j3 , . . . we take is the diagonal entries.

40
Brice Huang 8 March 6, 2018: Sequences of Random Variables

Proposition 8.2
This defines a probability measure on Ω.

Proof. By Caratheodory, we need only to check σ-additivity on the set-algebra


of boxes |a1 , b1 | × |a2 , b2 | × . . . , where |ak , bk | denotes a box whose boundaries
can be closed or open. So, suppose

[
|ak1 , bk1 | × |ak2 , bk2 | × . . . ,

|a1 , b1 | × |a2 , b2 | × · · · =
k=1

where the union on the right is disjoint. If this union is finite, the boxes on the
right differ in finitely many dimensions, so we can reduce to a finite-dimensional
setting. This is handled by the product-measures machinery from last lecture.
Otherwise, finite additivity implies
N
X
p |ak1 , bk1 | × |ak2 , bk2 | × . . . ,

p (|a1 , b1 | × |a2 , b2 | × . . .) ≥
k=1

whence

X
p |ak1 , bk1 | × |ak2 , bk2 | × . . . .

p (|a1 , b1 | × |a2 , b2 | × . . .) ≥
k=1

We need to show equality holds. Shrink the box on the left a bit to get a closed
box C with measure
p(C) ≥ LHS − ,
and grow each box on the right a bit to get an open box Ck with measure

p(Ck ) ≤ p(k th box) + .
2k
By
SMcompactness, the Ck are an open subcover of C, which has a finite subcover
k=1 Ck ⊃ C. Then, by finite additivity,


X N
X
RHS +  ≥ p(Ck ) ≥ p(Ck ) ≥ p(C) ≥ LHS − ,
k=1 k=1

whence
RHS ≥ LHS + 2.
Take  → 0 to get LHS = RHS, as desired.

Cool. This lets us construct sequences of i.i.d. uniform variables. Now, how
do we get arbitrary distributions?
Supose ξ has density p(x) > 0. Then Fξ (x) : R → (0, 1) is strictly monotone
and continuous. We claim we generate the variable ξ from a uniform random
variable as follows.

Proposition 8.3
Let u ∼ uniform(0, 1). Then,

Fξ−1 (u) = ξ.

41
Brice Huang 8 March 6, 2018: Sequences of Random Variables

Proof. Just compute:


 
P Fξ−1 (u) ≤ x = P (u ≤ Fξ (x))
= P (u ≤ P (ξ ≤ x))
= P (ξ ≤ x).

Therefore:

Corollary 8.4
Independent random variables with arbitrary distributions exist.

8.2 Weak Law of Large Numbers


Theorem 8.5 (Weak Law of Large Numbers)
Let ξn be i.i.d. random variables satisfying Eξn = m, and Eξn2 < ∞. Then,
the averages of the ξn converge in probability:
n
1X
ξk →P m.
n
k=1

To prove this, we will first develop the concept of variance.


Definition 8.6. The variance of the random variable ξ is
h i
2
Var(ξ) = E[ξ 2 ] − E[ξ]2 = E (ξ − E[ξ]) .

The two quantities on the right are equal by expansion.

Lemma 8.7
For independent ξ and µ, the following hold:

• E(ξ + µ) = E(ξ) + E(µ).


• Var(ξ + µ) = Var(ξ) + Var(µ).
• Var(c · ξ) = c2 Var(c) for any scalar c.

Proof. The first identity follows from linearity of the Lebesgue integral. For the
second identity, compute:
2
Var(ξ + µ) =E (ξ + µ)2 − E [E(ξ + µ)]
 

2 2
=E[ξ 2 ] + E[µ2 ] + 2E[ξµ] − (E[ξ]) − (E[µ]) − 2E[ξ]E[µ]
= Var(ξ) + Var(µ),
where for the last step we use E[ξµ] = EξEµ, by independence.
The third identity should be clear.

42
Brice Huang 8 March 6, 2018: Sequences of Random Variables

Lemma 8.8
If random variables ηn obey E[ηn2 ] → 0, then ηn →P 0.

Proof. Use Markov’s Inequality:


1 1
E ηn2 I|ηn |> ≤ 2 Eηn2 → 0.

P (|ηn | > ) ≤
2 

1
Pn
Proof of Weak Law of Large Numbers. Let ηn = n k=1 ξk − m. Then,

E[ηn ] = 0

and
n
1 X
E[ηn2 ] = Var(ηn ) = Var(ξk )
n2
k=1
1
= Var(ξ1 ) → 0.
n
Then, by Lemma 8.8, ηn →P 0 and we are done.

As a generalization:

Corollary 8.9
If ξi are independent and Var(ξi ) = σi < ∞, and
n
1 X
σk → 0,
n2
k=1

then !
n
1 X
(ξk − E[ξk ]) →P 0.
n
k=1

8.3 Weierstrass Approximation


We take a brief digression to prove a theorem in analysis.

Theorem 8.10 (Weierstrass Approximation Theorem)


Let f (x) be continuous on [a, b]. For all  > 0, there exists a polynomial
g(x) such that |f (x) − g(x)| <  on [a, b].

Proof. By rescaling, WLOG that [a, b] = [0, 1]. Let x1 , x2 , . . . , xn be Bernoulli,


such that P (xi = 1) = p. Define
  
x1 + · · · + xn
Bn (p) = E f .
n

43
Brice Huang 8 March 6, 2018: Sequences of Random Variables

First, note that Bn (p) is a polynomial, as


n   
X k n k
Bn (p) = f p (1 − p)n−k .
n k
k=0

Note that E[xi ] = E[x2i ] = p, so Var(xi ) = p(1 − p). Therefore, by the Weak
LLN,
1X
xi →P p
n i
and by Markov’s Inequality
 X 
1 Var x1
P xk − p >  <
n n2
p(1 − p)
=
n2
1
≤ .
4n2
It remains to show Bn → f uniformly, i.e.
lim sup |f (p) − Bn (p)| = 0.
n→∞ p∈[0,1]

Pick δ > 0 such that |f (x) − f (y)| < 12  whenever |x − y| < δ. This is possible
because continuous functions on closed intervals are uniformly continuous. Then,
write
  
X k n k
Bn (p) = f p (1 − p)n−k (∗)
k
n k
k:| n −p|<δ
  
X k n k
+ f p (1 − p)n−k (∗∗).
k
n k
k:| n −p|≥δ

The sum (*) obeys, with error at must 12 ,


X n
(∗) ≈ f (p) pk (1 − p)n−k .
k
k
k:| n −p|<δ

Moreover, we have the two-sided bound


   
1 X n k
f (p) 1 − 2
≤ f (p) p (1 − p)n−k ≤ f (p).
4nδ k
k
k:| n −p|<δ

Since f is continuous on closed domain, it has a maximum C. So,


1 C
|(∗) − f (p)| ≤ + .
2 4nδ 2
The sum (**) obeys
 
X n k
(∗∗) ≤ C p (1 − p)n−k
k
k
k:| n −p|≥δ

C
≤ .
4nδ 2

44
Brice Huang 8 March 6, 2018: Sequences of Random Variables

whence
1 C
|Bn (p) − f (p)| ≤ +
2 2nδ 2
C 1
for all p. Take n large enough that 2nδ 2 < 2 , so for all sufficiently large n,

|Bn (p) − f (p)| <  for all p. Thus Bn → f uniformly.

Corollary 8.11
Suppose random variables ξ and η satisfy |ξ| < C, |η| < C, and E[ξ k ] =
E[η k ] for all k. Then ξ = η, in the sense that their distributions are equal.

Proof. The condition E[ξ k ] = E[η k ] for all k implies E[g(ξ)] = E[g(η)] for all
polynomials g. By Weierstrass Approximation, this implies E[f (ξ)] = E[f (η)]
for all continuous f .
The distribution functions Fξ , Fη are the expectation of a step function,
which can be approximated by arbitrarily steep continuous functions, so we are
done.

45
Brice Huang 9 March 8, 2018: Strong Law of Large Numbers

9 March 8, 2018: Strong Law of Large Numbers


Today’s lecture will be devoted to proving the celebrated Strong Law of Large
Numbers:

Theorem 9.1 (Strong LLN)


Let ξ1 , ξ2 , ξ3 be a sequence of i.i.d. random variables. Then,
n
1X
ξn →a.s. m
n i=1

if and only if E[xi ] exists and E[xi ] = m.

Proving this theorem will require building a considerable amount of machinery.

9.1 Borel-Cantelli and Kolmogorov


We begin with two important lemmas.

Lemma 9.2 (Borel-Cantelli Lemma)


In a probability space (Ω, A, P ), let A1 , A2 , · · · ∈ A be a sequence of events,
and
B = {x ∈ Ω|x ∈ infinitely many Ai }.
Then:
P∞
• If n=1 P (An ) < ∞, then P (B) = 0.
P∞
• If n=1 P (An ) = ∞ and the Ai are independent, then P (B) = 1.

Proof. Let’s prove the first assertion first. Note the set equality
∞ [
\ ∞
B= An ,
k=1 n=k
S∞
and that n=k An is a decreasing function of k. Therefore, for any k,

!
[ X
P (B) ≤ P An ≤ P (An ).
n=k n≥k

Take k large to get P (B) = 0, as desired.


S∞
For the second assertion, we need to show P ( n=k An ) = 1 for all k. We

46
Brice Huang 9 March 8, 2018: Strong Law of Large Numbers

compute:
∞ ∞
! !
[ \
1−P An =P An
n=k n=k
Y 
= P An (by independence)
n≥k
Y
= (1 − P (An ))
n≥k

=0
P (Ai ) = ∞.3
P
where the last step follows from

Lemma 9.3 (Kolmogorov’s Inequality)


Let X1 , X2 , X3 , . . . be independent random variables, with EXi = 0, EXi2 =
σi2 < ∞. Then,
  n
1 X 2
P max |X1 + · · · + Xk | ≥ a ≤ 2 σk .
1≤k≤n a
k=1

The difference between this bound and bounds like Markov is that Kol-
mogorov’s Inequality bounds the random walk X1 + · · · + Xk at all timesteps k,
whereas Markov bounds only the value of the walk at the last timestep.

Proof of Kolmogorov’s Inequality. Set Sk = X1 + · · · + Xk . Set the events


A = { max |Si | ≥ a}
1≤i≤n

and
Ak = { max |Si | < a and |Sk | ≥ a}.
1≤i≤k−1
Sn
That is, Ak is the event that |Sk | ≥ a first at timestep k, and A = k=1 Ak is
the event that |Sk | ≥ a at some point.
Then, we can compute:
n
X
σk2 = ESn2
k=1
≥ E[Sn2 IA ]
Xn
= E[Sn2 IAk ]
k=1
Xn
E[Sk2 IAk ] + 2E[Sk (Sn − Sk )IAk ] + E(Sn − Sk )2 IAk
 
=
k=1
Xn
 2 
≥ a P (Ak ) + 0 + 0 .
k=1
3 Details: use the bound
 
Y Y X
(1 − P (An )) ≤ exp(−P (An )) = exp − P (An )
n≥k n≥k n≥k

47
Brice Huang 9 March 8, 2018: Strong Law of Large Numbers

where the second term is 0 because Sk and Sn − Sk are independent, and


E(Sn − Sk ) = 0.
Therefore
n n
X 1 X 2
P (A) = P (Ak ) ≤ σk ,
a2
k=1 k=1

as desired.

Using Kolmogorov, we’ll prove another bound. This isn’t quite Strong LLN
yet, but it’s useful.

Theorem 9.4
2
σk2 < ∞,
P
Let XP
1 , X2 , . . . be independent with mean 0 and variance σk . If k

then n=1 Xk converges almost surely.

Proof. As before, set Sk = X1 + · · · + Xk . We will take the Cauchy approach of


bounding |Si − Sj | for i, j both large.
By Kolmogorov,

σ 2 + · · · + σn+m
2
 
P max |Sn+k − Sn | ≥  ≤ n+1 .
1≤k≤n 2

We claim that for all δ > 0, there is n such that

P (|Sj − Si | ≥ )for some i, j > n) ≤ δ.

This is because we can take n such that 12 k≥n σk2 < δ.
P

Therefore, for any , almost surely:

∃n : ∀i, j > n |Sj − Si | < . (∗)

1
Take  = M , for M = 1, 2, 3, . . . . For each , (*) occurs with probability 1.
So, their countable intersection occurs with probability 1.

9.2 Toeplitz and Kronecker


Lemma 9.5 (Toeplitz)
Pn
Let an ≥ 0 be a sequence of constants, and bn = i=1 ai , with limn→∞ bn =
∞. Suppose xn is a sequence with limn→∞ xn = x. Then, in fact
n
1 X
lim ai xi = x.
n→∞ bn
i=1

Proof. Left as an exercise. This isn’t interesting – just  − δ bounding.

48
Brice Huang 9 March 8, 2018: Strong Law of Large Numbers

Lemma 9.6 (Kronecker)


≥ 0 be an increasing sequence with bn → ∞, and xn be a sequence
Let bn P

where n=1 xn < ∞. Then,
n
1 X
bk xk →a.s. 0.
bn
k=1

yn
Here’s one application of this. Suppose bn =P n and xn = n . Then this
n
theorem says that if n ynn is convergent, then n1 i=1 yi → 0.
P

Pk
Proof of Kronecker’s Lemma. Let Sk = i=1 xk . Using summation by parts,
we compute:
n j
1 X 1 X
bk xk = nSk − Sk−1
bn bn
k=1 k=1
n
1 1 1 X
= bn Sn − b0 S0 − Sk−1 (bk − bk−1 ).
bn bn bn
k=1
Pn P∞
The first term is i=1 xi , which converges
P∞ to i=1 xi . The second is 0 because
S0 = 0, and the third converges to i=1 xi by Toeplitz.

Proposition 9.7
2
Let
P∞Xi 1be 2independent random variables with mean µi and variance σi . If
n=1 n2 σn < ∞, then

n
1X
(Xk − µk ) →a.s. 0.
n
k=1

Note that this implies Strong LLN, conditioned on the values Eξi2 existing. So
we’re almost there!

Proof. The random variable Xn n−µn has mean 0 and variance n12 σn2 . By Theo-
rem 9.4, n Xn n−µn converges almost surely. By Kronecker, this implies
P

n
1X
(Xk − µk ) →a.s. 0.
n
k=1

9.3 Proof of Strong LLN


Finally, we’re ready to prove the Strong LLN. Throughout this proof, we define

Yn = ξn I|ξn |≤n .

49
Brice Huang 9 March 8, 2018: Strong Law of Large Numbers

Lemma 9.8
Let ξ1 , ξ2 , ξ3 , . . . be i.i.d. random variables. If E|ξn | < ∞ for all n, then
P Var(Yn )
n n2 < ∞.

Proof. Let an = P (n − 1 ≤ |ξi | ≤ n).4 Then,


∞ ∞ n
X Var(Yn ) X X k 2 ak

n=1
n2 n=1
n2
k=1
∞ X

X 1
= (k 2 ak ) ·
n2
k=1 n=k

X
2 C
< (k ak ) ·
k
k=1
X∞
=C kak
k=1
X∞
=C P (|ξi | ≥ k − 1)
k=1
< ∞,
P∞ 1 C
for C satisfying n=k n2 < k. The last step follows from existence of E|ξi |.

And finally, the proof:

Proof of Strong LLN. Suppose E[ξi ] = m < ∞. By Proposition 9.7,


n
1X
(Yk − E[Yk ]) →a.s. 0.
n
k=1

1
Pn
As E[Yk ] → Eξi = m by continuity of the Lebesgue integral, we have n k=1 E[Yk ] →
m. Therefore,
n n
1X 1X
Yk →a.s. E[Yk ] →a.s. m.
n n
k=1 k=1

Now define the random variable Zk = ξk − Yk . Then,


X X
P (Zk 6= 0) = P (|ξk | > k) < ∞,
k k

since E[ξi ] exists. By Borel-Cantelli, with probability 1 only finitely many Zk


are 6= 0. Therefore,
∞ n n
1X 1X 1X
ξk = Zk + Yk .
n n n
k=1 k=1 k=1

The first term goes to 0 and the second goes to m, as desired.


4 The ξi are i.i.d. so this is the same for all i.

50
Brice Huang 9 March 8, 2018: Strong Law of Large Numbers
Pn
For the converse, suppose for contradiction that n1 k=1 ξk → m and E|ξi | =
∞. Then, by the Lebesgue integrability condition, for any C we have
X  |ξn | 
P > n = ∞.
C
n≥1

By Borel-Cantelli, this means that with probability 1, the event |ξn | > nC
happens for infinitely many ξn . At each ξn where this happens, we have
n
1 X C 1 n−1X C
ξk − m > or ξk − m > .

n 3 n − 1 3


k=1 k=1

1
Pn
Therefore, n k=1 ξk cannot converge.

51
Brice Huang 10 March 13, 2017: Snow Day

10 March 13, 2017: Snow Day


Cancelled due to snow. Boston why you like this :(

52
Brice Huang 11 March 15, 2017: Characteristic Functions

11 March 15, 2017: Characteristic Functions

11.1 Kolmogorov 0-1 Law


Let ξ1 , ξ2 , . . . be independent random variables. Let An be the σ-algebra spanned
by the ξk for k > n.5 Moreover, define the tail σ-algebra

\
A∞ = An .
n=1

Note that this definition does not depend at all on the first finitely many terms
of An .

Example 11.1
The σ-algebra X
{w ∈ Ω| ξn converges}
n

is a tail σ-algebra.

Theorem 11.2 (Kolmogorov 0-1 Law)


If A ∈ A∞ , then P (A) = 0 or P (A) = 1.

In the previous example, this means that every series of random variables
either converges with probability 1, or diverges with probability 1. There is no
in between.

Proof Sketch. If A ∈ An , then A is independent of the first finitely many


random variables ξ1 , . . . , ξn . Since this is true for all n, A is independent of all
ξi . Therefore A is independent of itself, and

P (A ∩ A) = P (A) · P (A),

whence P (A) = P (A)2 .

Example 11.3
Let ξ1 , ξ2 , . . . be random variables. Then the radius of convergence of

X
f (t) = ξi ti
i=1

is almost surely a constant (which may possibly be ∞).

5 i.e. the σ-algebras {ξ −1 (B)|B ∈ B(R)}. It may be helpful to think of this as the σ-algebra
k
of events of the form ξk ∈ B, for B ∈ B(R) – so, for example, events like ξk ≤ c.

53
Brice Huang 11 March 15, 2017: Characteristic Functions

Example 11.4
Let’s be more concrete. Let ξi in the previous example be independent
Boolean variables: (
1 p = 12
ξi = .
−1 p = 12
f (t) clearly blows up when t > 1 and converges when t < 1, so R = 1.

Example 11.5
R = 1 when ξi ∼ N (0, 1) as well. In fact we can note that ξi ti ∼ N (0, t2i ),
and use the theorem from last lecture about sum of variances converges
implies sum of random variables converges.

...lol Gorin what are you doing.

11.2 Characteristic Functions


Definition 11.6. For a random variable ξ, define the characteristic function
ϕξ (t) = E exp(itξ).

Note that exp(itξ) is on the unit circle, so it is a bounded random variable.


Therefore its expectation exists.

Proposition 11.7
The characteristic function has the following properties:

1. |ϕξ (t)| ≤ 1.

2. ϕξ (0) = 1.
3. ϕξ (t) is uniformly continuous over t ∈ R.
4. ϕξ (·) is positive-definite, i.e. for any t1 , . . . , tk ∈ R,

det (ϕξ (ti − tj )) ≥ 0.

Equivalently, for all t1 , . . . , tk , x1 , . . . , xk ∈ C,


k
X
xi xj ϕξ (ti − tj ) ≥ 0.
i,j=1

Proof. (1) and (2) are obvious.


(3): We can obtain the uniform bound:

|E (exp(iξx) − exp(iξy))| = |1 − E exp(iξ(x − y))|


= |1 − E exp(iξ(x − y))| I|ξ|≤C + |1 − E exp(iξ(x − y))| I|ξ|>C
≤ |1 − E exp(iξ(x − y))| I|ξ|≤C + 2P (|ξ| ≥ C).

For any C, the first term goes to 0 uniformly as |x − y| → 0.

54
Brice Huang 11 March 15, 2017: Characteristic Functions

(4): Just compute:


2
0 ≤ E x1 eit1 ξ + · · · + ξk eitk ξ

! 
X X
≤E xi eiti ξ  xj e−iti ξ 
i j
X
= xi xj ϕξ (ti − tj ).
i,j

We care about characteristic functions because of the following result:

Theorem 11.8
Any function satisfying (1) through (4) is a characteristic function.

In fact, we’ll later show how to recover a random variable from its character-
istic function.

11.2.1 Computation With Characteristic Functions

Proposition 11.9
The characteristic function has the following properties:

1. If A, B ∈ R, then ϕAξ+B (t) = eiBt ϕξ (At).


2. If ξ1 , . . . , ξk are independent, then ϕξ1 +···+ξk (t) = ϕξ1 (t) · · · ϕξk (t).

Proof. (1): Just write out the definition of ϕξ .


(2): Write it out. Expectation is multiplicative over independent variables.

Example 11.10
Let ξ ∼ N (µ, σ 2 ). We’ll first compute this for µ = 0, σ = 1:
Z ∞ Z ∞
1 x2 1 1 2 1 2 1 2
ϕξ (t) = √ eitx− 2 dx = √ e− 2 t e− 2 (x−it) dx = e− 2 t ,
2π −∞ 2π −∞
R ∞ − 1 (x−it)2 √
where we evaluate −∞ e 2 dx = 2π by complex analysis.a
 2 2

So, applying (1) above, we get ϕξ (t) = exp iµt − σ 2t in the general
case.
1 2
contour integrate e− 2 z along the box with vertices R, −R, −R − it, R − it,
a Details:
RR 1 2 R R − 1 x2
−2
to get −R e (x−it) dx ≈ −R e 2 dx with error going to 0 as R → ∞.

55
Brice Huang 11 March 15, 2017: Characteristic Functions

Proposition 11.11
Assume E[ξ k ] exists. Then ϕξ (t) is k times differentiable, and
 k

ϕξ (t) = ik E[ξ k ].

∂t t=0

Proof. We will induct on k. Suppose E[ξ k+1 ] exists and is finite. Then E[ξ k ]
exists and is finite.6 Now compute:
(k)
ϕ (t + ∆) − ϕ(k) (t) itξ l ei∆ξ − 1
 
k+1 itξ k+1

− i Ee ξ = Ee ξ − iξ
∆ ∆

Note the identity


sin 12 ∆2
i∆ξ
e − 1
= 2 ∆ ≤ |∆|,


i∆ξ
so e ∆−1 → 0 as ∆ → 0. Therefore the above integrand is, in magnitude, at
most E2|ξ|k , and by dominated convergence its expectation goes to 0. Thus,

ϕ(k) (t + ∆) − ϕ(k) (t)


lim = ik+1 Eeitξ ξ k+1 ,
∆→0 ∆
as desired.

As a corollary, we can expand a characteristic function by a power se-


ries:

Corollary 11.12
Whenever Eξ m exists,
m
X (it)k
ϕξ (t) = Eξ k + o(tm ).
k!
k=0

11.3 Levy Inversion Formula


This formula lets us recover the distribution Fξ (x) from the characteristic
function ϕξ (t).

Theorem 11.13 (Levy Inversion Formula)


Let x1 , x2 be points of continuity of Fξ (x). Then,

1 ∞
 −itx2
− e−itx1
Z 
δ 2 t2 e
Fξ (x2 ) − Fξ (x1 ) = lim ϕ(t)e− 2 dt.
δ→0 2π −∞ −it

The point of the δ is to add a small Gaussian noise, to smooth over any
discontinuities in the original random variable.
6 Details: write
E[ξ k ] = E[ξ k I|ξ|≤1 ] + E[ξ k I|ξ|>1 ].
The first term is clearly bounded, and the second is bounded by E[ξ k+1 ]

56
Brice Huang 11 March 15, 2017: Characteristic Functions

Proof. Consider the random variables ξ + Yδ , where the Yδ ∼ N (0, δ 2 ) are


δ 2 t2
independent. Note that its characteristic function is ϕ(t)e− 2 ; call this ϕδ (t).
And, its density is
Z ∞
1 (x−y)2
Pδ (x) := Pξ+Yδ (x) = √ e− 2δ2 dFξ (y).
−∞ δ 2π

Let’s invert this random variable first. Observe that


1 ∞ −itx 1 ∞ −itx − δ2 t2
Z Z Z ∞ 
ity
e ϕδ (t) dt = e e 2 e dFξ (y) dt
2π −∞ 2π −∞ −∞
Z ∞ Z ∞ 
1 2 2
− δ 2t −it(y−x)
= e dt dFξ (y) (by Fubini)
2π −∞ −∞
Z ∞
1 (y−x)2
=√ e− 2δ2 dFξ (y)
2πδ −∞
= Pδ (x).

So, we backed out the density Pδ (x) from the characteristic function ϕδ (t).
Now, we can invert ξ. Let Fδ be the distribution function of ξ + Yδ . Then:
Z x2
Fδ (x2 ) − Fδ (x1 ) = Pδ (x) dx
x1

e−itx2 − e−itx1
Z  
1 2 2
− δ 2t
= ϕ(t)e .
2π −∞ −it

As δ → 0, ξ + Yδ → ξ in probability, and therefore in distribution. Therefore,


Fδ (x1 ) → F (x1 ) and Fδ (x2 ) → F (x2 ), where we use the fact that x1 , x2 are
points of continuity.

Therefore, ϕξ (t) uniquely defines Fξ (t).

Corollary 11.14
If X 1
E|ξ k |tk < ∞
k!
k

for all t > 0, then the sequence of Eξ k uniquely defines Fξ .

57
Brice Huang 12 March 20, 2018: Limits of Characteristic Functions

12 March 20, 2018: Limits of Characteristic Func-


tions

12.1 Characteristic Functions


Last time we said that the characteristic function of a random variable ξ is

ϕξ (t) = E exp(itξ).

Characteristic functions have the nice property that ϕξ+η (t) = ϕξ (t)ϕη (t) for
independent variables ξ and η.
We also derived the Levy Inversion Formula, which lets us recover a random
variable’s distribution from its characteristic function:
1 ∞
 −itx2
− e−itx1
Z 
2 2
− δ 2t e
Fξ (x2 ) − Fξ (x1 ) = lim ϕξ (t)e dt.
δ→0 2π −∞ −it

Recall that the point of the δ is to add a small Gaussian noise, in order to force
the distribution to be continuous.
For certain well-behaved ξ, it’s possible to pass to the limit explicitly - that
is, we can just take δ to 0.

Corollary 12.1
R∞
If −∞ |ϕξ (t)| dt < ∞, then ξ has a well-defined density, which is
Z ∞
1
p(x) = e−itx ϕξ (t) dt.
2π −∞

Proof. We want to take δ → 0 in the Levy inversion formula. We observe that


−itx
− e−itx1 2| sin(tx2 − tx1 )|

e 2

= ≤ 2|x2 − x1 |,
−it t

so by dominated convergence we can move the limδ→0 under the integral sign.
Thus,

1 ∞
 −itx2
− e−itx1
Z 
2 2
− δ 2t e
Fξ (x2 ) − Fξ (x1 ) = lim ϕξ (t)e dt
2π −∞ δ→0 −it
1 ∞
 −itx2
− e−itx1
Z 
e
= ϕξ (t) dt
2π −∞ −it
1 ∞
Z Z x2
= ϕξ (t) e−itx dx dt
2π −∞ x1
1 ∞ −itx
Z x2 Z
= e ϕξ (t) dt dx (by Fubini).
x1 2π −∞

Therefore Z ∞
d 1
p(x) = Fξ (x) = e−itx ϕξ (t) dt.
dx 2π −∞

58
Brice Huang 12 March 20, 2018: Limits of Characteristic Functions

Remark 12.2. The characteristic function really just takes a Fourier transform,
which Levy inversion inverts. This is especially apparent when the above
corollary holds.

Example 12.3
Let’s see a non-example of this corollary: what happens when ξ doesn’t
have a density?
Consider the Bernoulli variable ξ ∼ Ber(p). Then,

ϕξ (t) = Eeitη = peit + (1 − p),

so certainly Z ∞
|ϕξ (t)| dt = ∞.
−∞

12.1.1 Higher-Dimensional Characteristic Functions

Let ξ1 , . . . , ξk be arbitrary (not necessarily independent) random variables. The


multivariate characteristic function is
  
k
X
ϕξ1 ,...,ξk (t1 , . . . , tk ) = E exp i ξj t j   .
j=1

As an exercise: properties of multivariate characteristic functions are com-


pletely analogous to the 1-dimensional case.

12.2 Gaussian vectors


Definition 12.4 (Gaussian Vector, Definition 1). [ξ1 , . . . , ξk ] is a Gaussian
vector if for all deterministic constants c1 , . . . , ck ,
k
X
cj ξj
j=1

is a Gaussian random variable.


Definition 12.5 (Gaussian Vector, Definition 2). [ξ1 , . . . , ξk ] is a Gaussian
vector if there exist a vector m
~ and matrix C such that
~ ~t)− 12 (C~
t,~
ϕξ~(~t) = ei(m, t)
,

where (·, ·) is the standard dot product.

Here’s how we should interpret m ~ and C. When the above equation holds,
m
~ = (m1 , . . . , mk ) must be the vector of means given by mj = Eξj , and C
must be the nonnegative definite matrix, called the covariance matrix, whose
entries are given by

Cab = E [(ξa − ma )(ξb − mb )] = Cov(ξa , ξb ).

59
Brice Huang 12 March 20, 2018: Limits of Characteristic Functions

Definition 12.6 (Gaussian Vector, Definition 3). [ξ1 , . . . , ξk ] is a Gaussian


vector if there exists a deterministic matrix A such that

ξ~ = Aξ~0 + m,
~

where ξ~0 is the Gaussian vector whose coordinates are i.i.d. unit Gaussians
N (0, 1).

When the covariance matrix C is nonsingular, there is yet another charac-


terization.

Definition 12.7 (Gaussian Vector, Definition 4, kind of). Suppose the covari-
ance matrix C of [ξ1 , . . . , ξk ] is nonsingular. Then [ξ1 , . . . , ξk ] is a Gaussian
vector if
 
1 1
Pξ~(x1 , . . . , xk ) = p ~ T C −1 (~x − m)
exp − (~x − m) ~ .
(2π)k det C 2

Theorem 12.8
These four definitions are equivalent.

Proof that Definition 2 ⇒ Definition 1. Take ~t = (t1 c1 , t2 c2 , . . . , tk ck ) in the


definition of the multivariate characteristic function.
Pk
Proof that Definition 1 ⇒ Definition 2. Since j=1 cj ξj is Gaussian,
  
k
2 2
~ ~
cj ξj  = eitm(~c,ξ) · e− 2 [σ(~c,ξ)]
X 1
t
E exp it ,
j=1

where  
  k
X k
X
m ~c, ξ~ = E  cj ξj  = cj mj = (m,
~ ~c)
j=1 j=1

and
 
h  i2 k
X X
σ ~c, ξ~ = Var  cj ξj  = ca cb E(ξa − ma )(ξb − mb ) = (C~c, ~c) .
j=1 a,b

Thus if we take ~t = t~c, we have


  
k
tj ξj  = ei(m,
~ ~t)
· e− 2 (C t,t) .
X 1 ~~
E exp i
j=1

This matches the form of (2), as desired.


Remark 12.9. The fact that (1) implies (2) should be surprising - this relies
crucially on ξ~ being Gaussian, and is false for general collections of random
variables.

60
Brice Huang 12 March 20, 2018: Limits of Characteristic Functions

Proof that Definition 1 ⇒ Definition 3. This is just diagonalization. Since C is


symmetric, it has a diagonal decomposition

C = B T diag(λ1 , . . . , λk )B = AT A,

where A absorbs square roots of the eigenvalues. Check that ξ~ = Aξ~0 + m


~
works.

Proof that Definition 3 ⇒ Definition 2. This is immediate because we know the


characteristic function of unit Gaussian variables.

Proof that Definition 3 ⇔ Definition 4. Same idea; perform diagonalization on


C to reduce to the standard-Gaussian case.

Corollary 12.10
A Gaussian vector is uniquely specified by its mean and covariance matrix.

Corollary 12.11
For a Gaussian vector ξ, ~ two coordinates ξa , ξb are uncorrelated (i.e.
Cov(ξa , ξb ) = 0) iff they are independent.

Corollary 12.12
~ pairwise independence of coordinates is equivalent
For a Gaussian vector ξ,
to joint independence of coordinates.

It’s worth noting that pairwise independence is generally far weaker than
joint independence.

Example 12.13 (Pairwise Independence 6⇒ Joint Independence)


Let ξ1 , ξ2 ∼ Ber( 12 ) be i.i.d and ξ3 = ξ1 + ξ2 mod 2. Then ξ3 ∼ Ber( 12 ) as
well. The variables ξ1 , ξ2 , ξ3 are pairwise but not jointly independent.

We want to use this machinery to prove asymptotic theorems like the central
limit theorem. We introduced characteristic functions as our tool for studying
central tendency, so let’s look next at how they interact with limits.

12.3 Characteristic Functions and Limits


Theorem 12.14
Let ξn be random variables with characteristic functions ϕn (t). If the
pointwise limit ϕn (t) → ϕ(t) exists and ϕ(t) is continuous at t = 0, then:

• ϕ(t) is the characteristic function ϕξ (t) of some variable ξ; and


• ξn converges in distribution to ξ.

61
Brice Huang 12 March 20, 2018: Limits of Characteristic Functions

We’ll follow a commonly used scheme in probability. To show a sequence


converges, we first show “tightness,” that we can take subsequential limits. Then,
we’ll show that all the subsequential limits are the same.

Proposition 12.15
Let {ξn } be a sequence of random variables, such that

∀ > 0 : ∃C : sup [P (|ξn | > C)] < . (∗)


n

Then, {ξn } has a subsequence converging in distribution.

Proof. We look at the distribution functions Fξn (x) ∈ [0, 1]. At a fixed point
x, the values of Fξn (x) are just points in [0, 1], which must have a convergent
subsequence.
We can’t pick a subsequence such that Fξn (x) converges for all x simulta-
neously, but we can pick a subsequence {nk } such that Fξnk → H(x) for all
rational x.7
H(x) is monotone. Define Fξ (x) as the right-limit of H along rationals. Note
that this doesn’t depend on the approximation scheme, because Fξ (x) is just
the infimum of H(y) over y > x. Also note that Fξ (x) may not equal H(x),
even for rationals x.
Let’s check that Fξ is a distribution function, by verifying axioms.
Monotonicity. By construction.
Limits at 0,1. By condition (∗),

Fξ (−∞) = 0 and Fξ (+∞) = 1.

Right-continuity. Let xn → x from the right. Replace each xn by a rational rn


such that  
1 1
rn ∈ xn , x + and H(rn ) − Fξ (xn ) ≤ .
n n
The points rn converge to x, so H(rn ) converges to Fξ (x). Then

lim Fξ (xn ) ≤ lim Fξ (rn ) = Fξ (x).


n→∞ n→∞

But by monotonicity of Fξ the reverse inequality also holds, so in fact

lim Fξ (xn ) = Fξ (x).


n→∞

Therefore Fξ is indeed a distribution function.


Now we need to show that Fξnk (x) → Fξ (x) at each point x of continuity.
Fix some  > 0. Select r1 , r2 such that r1 ≤ x ≤ r2 and H(r2 ) − H(r1 ) < . We
have the sandwich bounds

H(r1 ) ≤ Fξ (x) ≤ H(r2 )


7 Details: (1)
enumerate the rationals q1 , q2 , . . . . Find a sequence {nk } such that the values
(2)
of Fξ (1)
 (q1 ) converges. Pick a subsequence {nk } of this sequence such that Fξ (2)
 (q2 )
n n
k k
(k)
converges, and so on. Then the sequence {nk } given by nk = nk works.

62
Brice Huang 12 March 20, 2018: Limits of Characteristic Functions

and
Fξnk (r1 ) ≤ Fξnk (x) ≤ Fξnk (r2 ).
As k → ∞, we have Fξnk (r1 ) → H(r1 ) and Fξnk (r1 ) → H(r2 ), so for all
sufficiently large k,
H(r1 ) ≤ Fξnk (x) ≤ H(r2 ).
By taking  → 0, we conclude Fξnk (x) → Fξ (x).

Next time we’ll finish the proof of Theorem 12.14. We’ll show that for
random variables ξn whose characteristic functions converge pointwise to a limit
continuous at t = 0, the condition (∗) holds. Then we’ll show full convergence,
by showing that all subsequential limits are equal.

63
Brice Huang 13 March 22, 2018: Central Limit Theorem and Variations

13 March 22, 2018: Central Limit Theorem and


Variations

13.1 Characteristic Functions and Limits


We will finish proving Theorem 12.14 from last time, and then use this machinery
to prove the Central Limit Theorem.

Lemma 13.1
Let X be any random variable. For any u > 0,

1 u
  Z
2
P |X| ≥ ≤ (1 − ϕX (t)) dt.
u u −u

Proof. Just compute:


Z u Z u 
1 1
1 − eitX dt

(1 − ϕX (t)) dt = E (by Fubini)
2u −u 2u −u
Z u 
1
=1− E eitX dt
2u −u
− e−iuX
 iuX 
1 e
=1− E
2u 2iuX
 
sin uX
=E 1− .
uX
sin x
Now, just by looking at the graph of x , we get the bound
(
sin uX 0 |uX| < 2
1− ≥ 1
.
uX 2 |uX| ≥ 2

Therefore,  
sin uX 1
E 1− ≥ P (|uX| ≥ 2)
uX 2
and we are done.

Let {ξn } be a sequence of random variables with characteristic functions


ϕn (t), with well-defined pointwise limit ϕn (t) → ϕ(t) continuous at t = 0.

Lemma 13.2
The hypothesis of Proposition 12.15 holds. That is,

∀ > 0 : ∃C : sup [P (|ξn | > C)] < .


n

64
Brice Huang 13 March 22, 2018: Central Limit Theorem and Variations

Proof. By Lemma 13.1, for all sufficiently large n,


Z 1/C
P (|ξn | ≥ 2C) ≤ C (1 − ϕξn (t)) dt
−1/C
Z 1/C
≤ 1.001C (1 − ϕξ (t)) dt (by Dominated Convergence)
−1/C

≤ 2.002 sup |1 − ϕξ (t)|.


t∈[−1/C,1/C]

Since ϕξ (t) → 1 as t → 0, we can pick C large enough that P (|ξn | ≥ 2C) < 
for all sufficiently large n. Increase C as necessary to make this bound hold for
all n.

By Proposition 12.15, the sequence {ξn } has a subsequence that converges in


distribution. Any subsequence of {ξnk } ⊂ {ξn } also has characteristic functions
converging pointwise to ϕ(t), so by the same reasoning it has a sub-subsequence
converging in distribution. Therefore the sequence {ξn } is precompact.8
Moreover, if ξnk →d ξ for a convergent subsequence {ξnk }, then ϕξnk (t) →
ϕξ (t) for all t. But ϕξnk (t) → ϕ(t) for all t, so in fact ϕξ (t) = ϕ(t). Therefore,
every convergent subsequence converges to the same limit!
It remains to prove an analysis statement: if a sequence {ξn } is precompact,
and all convergent subsequences converge to the same limit, then the entire
sequence {ξn } converges to this limit.
This part isn’t interesting, and we skip it.

13.2 Central Limit Theorem


We now have the machinery to prove the celebrated Central Limit Theo-
rem.

Theorem 13.3 (CLT)


Let ξn be i.i.d. random variables with E[ξi ]2 < ∞. Set m = E[ξi ] and
σ 2 = Var(ξi ). Then,
Pn
i=1 ξi − mn
√ →d N (0, 1).
σ n

Recall that the Strong Law of Large Numbers says that


Pn
i=1 ξi − mn
→a.s. 0.
n
CLT is therefore a refinement of this result.

Proof. Let ϕ(t) be the characteristic function of ηi = ξi −m 2


σ . Since E[ξi ] < ∞,
ϕ(t) is twice differentiable. We claim the Taylor expansion of ϕ(t) is
1
ϕ(t) = 1 − t2 + o(t2 ).
2
8 That is, any subsequence has a converging sub-subsequence.

65
Brice Huang 13 March 22, 2018: Central Limit Theorem and Variations

Indeed, the constant, linear, and quadratic terms are ϕ(0), ϕ0 (0), and 12 ϕ00 (0).
By Proposition 11.11, these are:
ϕ(0) = E[ηi0 ] = 1
0
ϕ (0) = iE[ηi ] = 0
1 00 1 2 1
ϕ (0) = i E[ηi2 ] = − .
2 2 2
Now, set
Pn n
i=1 ξi− mn 1 X
Xn = √ =√ ηi .
σ n n i=1
By properties of characteristic functions,
n   2 n
t2

t t
ϕXn (t) = ϕ √ = 1− +o → exp(−t2 /2) = ϕN (0,1) (t).
n 2n n
where the pointwise convergence is by the definition of the exp function. By
Theorem 12.14, we are done.

This leaves open the question: in practice, we can’t get as many data points
ξn as we want. How many data points do we need before CLT is useful? The
Berry-Essen Inequality answers this question.

Theorem 13.4
Assume the hypotheses of CLT, and that E|ξi |3 = M . Then,

P (Xn ≤ y) − FN (0,1) (y) ≤ C ·√M



σ3 n

for a fixed constant C.

This is a deep theorem, and we skip the proof.


Remark 13.5. Berry-Essen (1942) showed that C = 7.59 works. Since then,
this has been improved, with the best result C = 0.4748 due to Shevtsova (2011).
Essen (1956) also showed the lower bound C ≥ √12π + 0.001079 ≈ 0.4. The
optimal C is still unknown!

13.3 Multidimensional CLT


Theorem 13.6
Let ξ~n = (ξn1 , ξn2 , . . . , ξnk ) be i.i.d. in n. Let m ~ = E[ξ~n ] be the mean, and
 T  
C = E ξ~n − m ~ ξ~n − m
~

be the covariance matrix of ξ~n . Then,


Pn ~ −n·m
i=1 ξi ~
√ →d N (0, C).
n

Proof. Literally the same proof verbatim.

66
Brice Huang 13 March 22, 2018: Central Limit Theorem and Variations

13.4 Lyapanov CLT


It turns out we can drop the condition that the ξn are identically distributed,
as long as we have moments of slightly larger power than 2.

Theorem 13.7 (Lyapanov CLT)


Pn
Let ξi be independent variables, with E[ξi2 ] < ∞. Set Sn2 = i=1 Var(ξi ).
If for some δ > 0,
n
1 X
lim E|ξi − E[ξi ]|2+δ = 0,
n→∞ Sn2+δ i=1

then
n
1 X
(ξi − E[ξi ]) →d N (0, 1).
Sn i=1

Let’s look at an example to see what this crazy-looking condition is do-


ing.

Example 13.8
Suppose ξi has third moments. Suppose E[ξi ] = 0, E[ξi2 ] = σ 2 , and
E|ξi3 | < C. Then Sn2 = nσ 2 , so
n
1 X Cn
E|ξi |3 ≤ 3/2 3/2 → 0.
Sn3 i=1 n σ

The Lindeberg CLT weakens the Lyapanov CLT condition.

Theorem 13.9 (Lindeberg CLT)


be independent variables, with E[ξi2 ] < ∞. Set σi2 = Var(ξi ) and
Let ξiP
n
Sn = i=1 σi2 . If
2

n
1 X
lim 2
E[ξi2 ]I|ξi |≥Sn = 0,
n→∞ Sn
i=1

then
n
1 X
(ξi − E[ξi ]) →d N (0, 1).
Sn i=1

Remark 13.10. The Lindeberg CLT condition says the ξi don’t have fluctua-
tions on the order where we expect CLT to hold. As an exercise, show that the
Lyapanov CLT condition implies the Lindeberg CLT condition.
Remark 13.11. In some sense, the Lindeberg CLT is the strongest form of
CLT possible, because a result by Feller says that the Lindeberg CLT condition
is necessary for CLT.

To show how to check the Lindeberg condition, let’s check it for i.i.d. random
variables.

67
Brice Huang 13 March 22, 2018: Central Limit Theorem and Variations

Example 13.12
Let ξi be i.i.d. random variables with variance σ 2 , so Sn = nσ 2 . We need
that
n
1 X  2 1 
E ξi I|ξi |≥§n = 2 E ξi2 I|ξi |≥σ√n
 
2
Sn i=1 σ

converges to 0. Since E[ξi2 ] < ∞, this follows from continuity of the integral.

The proof of Lindeberg CLT is similar to the proof of CLT – we just have to
be more careful with the o(t2 ) error term. The proof is in Durrett.

68

You might also like