0% found this document useful (0 votes)

36 views70 pages

Stat 210 Notes

Statistics 210 is a graduate-level probability course at Harvard, focusing on foundational topics for PhD students in statistics, with an emphasis on probability applications rather than measure theory. The course covers a range of subjects including random variables, families of distributions, convergence theorems, and martingales. It is structured around a forthcoming textbook and includes both theoretical and practical components, with virtual interaction facilitated through discussion boards and regular office hours.

Uploaded by

LeoThomas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views70 pages

Stat 210 Notes

Uploaded by

LeoThomas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Statistics 210: Probability I

Eric K. Zhang
[email protected]

Fall 2020

Abstract
These are notes for Harvard’s Statistics 210, a graduate-level probability class providing
foundational material for statistics PhD students, as taught by Joe Blitzstein1 in Fall 2020. It
has a history as a long-running statistics requirement at Harvard. We will focus on probability
topics applicable to statistics, with a lesser focus on measure theory.
Course description: Random variables, measure theory, reasoning by representation.
Families of distributions: Multivariate Normal, conjugate, marginals, mixtures. Conditional
distributions and expectation. Convergence, laws of large numbers, central limit theorems, and
martingales.

Contents
1 September 3rd, 2020 4
1.1 Course Logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Breakout Puzzle: Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Representations of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 September 8th, 2020 6

2.1 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Uncountable σ-Algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 September 10th, 2020 8

3.1 The Borel Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Properties of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 September 15th, 2020 11

4.1 A Couple Notes on Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Working with π-λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 September 17th, 2020 13

5.1 Proof of the π-λ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Representations of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 September 22nd, 2020 15

6.1 Probability Integral Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2 Reasoning By Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
With teaching fellows: Ambarish Chattopadhyay, Louis Cammarata, Franklyn Wang, Michael Isakov, Mike Bao

1
7 September 24th, 2020 18
7.1 The Beta-Gamma Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2 The Normal Distribution and Box-Muller . . . . . . . . . . . . . . . . . . . . . . . . 19
7.3 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8 September 24th, 2020 21

8.1 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.2 The Broken Stick Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

9 October 1st, 2020 23

9.1 General Poisson Point Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.2 Properties of the Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.3 Defining Integration and Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

10 October 6th, 2020 26

10.1 Riemann-Stieltjes and Lebesgue Integration . . . . . . . . . . . . . . . . . . . . . . . 26
10.2 Convergence Theorems in Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11 October 8th, 2020 29

11.1 Proof of Bounded Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
11.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

12 October 13th, 2020 32

12.1 Conditional Covariance: ECCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
12.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

13 October 15th, 2020 35

13.1 Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
13.2 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
13.3 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 37

14 October 21st, 2020 39

14.1 More on the Multivariate Normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
14.2 Example Problem: Socks in a Drawer . . . . . . . . . . . . . . . . . . . . . . . . . . 40

15 October 27th, 2020 42

15.1 Intro to Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
15.2 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
15.3 More Basic Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

16 October 29th, 2020 46

16.1 Hölder’s Inequality and Nonnegative Covariance . . . . . . . . . . . . . . . . . . . . 46
16.2 Convergence and the Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . 47

17 November 3rd, 2020 48

17.1 More on Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
17.2 Building a Hierarchy of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2
18 November 5th, 2020 51
18.1 Major Tools in Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
18.2 Natural Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

19 November 10th, 2020 53

19.1 Example of the Delta Method in Asymptotics . . . . . . . . . . . . . . . . . . . . . . 53
19.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

20 November 12th, 2020 56

20.1 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
20.2 More Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

21 November 17th, 2020 60

21.1 Examples of the Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 60
21.2 The Replacement Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

22 November 19th, 2020 62

22.1 Dependent Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
22.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

23 November 24th, 2020 64

23.1 Examples of Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
23.2 Martingale Convergence and Optional Stopping . . . . . . . . . . . . . . . . . . . . . 65

24 December 1st, 2020 66

24.1 The Optional Stopping Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
24.2 Doob’s Martingale Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

25 December 3rd, 2020 68

25.1 Completeness of Natural Exponential Families . . . . . . . . . . . . . . . . . . . . . . 68
25.2 Bounded Central Limit Theorem from Lindeberg . . . . . . . . . . . . . . . . . . . . 68
25.3 Poisson Embedding for the Coupon Collector’s Problem . . . . . . . . . . . . . . . . 68
25.4 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3
1 September 3rd, 2020
We start with an overview of the course. The class has roughly 80 students, ranging from first-year
PhD students in statistics and other areas to advanced undergraduates. We will cover many aspects
of probability from a rigorous standpoint.

1.1 Course Logistics

Since the class will be held virtually, a lot of the interaction will be on the Ed discussion board.
Regular office hours and sections will be organized every week, and you can feel free to attend as
many (or as few) sessions as you wish.
We’ll start with a bit of measure theory, to provide a foundation for the probability in this
course. The goal is to provide enough foundations to understand a paper using σ-algebras or other
machinery, but not to have a course dominated by measure theory. This way, you can do things in
sufficient generality, but you don’t have to spend hundreds of hours proving pedantic things (for
example, showing measurability).
Like Stat 110 (the undergraduate partner course), we will cover various distributions, their
properties, and useful machinery. However, we generally try to allow deeper, more general analysis
with less assumptions. For example:

• Multivariate normal distributions, but also mixtures of Gaussians.

• Central limit theorem, including variants that don’t assume i.i.d. variables.

The course material is structured from Joe Blitzstein and Carl Morris’s forthcoming textbook, Prob-
ability for Statistical Science [BM20]. Key philosophies of the course: conditioning distributions
and balancing “coin-flipping” intuition versus analysis.

1.2 Breakout Puzzle: Random Walks

Joe gives us a “brain-teaser” puzzle that we can work on in breakout rooms.

Exercise 1.1 (Simple symmetric random walk). Suppose that you have a simple, symmetric ran-
dom walk on the real line, moving either +1 or −1 on each step with independent probability 12 . If
you start at 0, what is the expected number of times you reach 10100 before returning to 0?

Proof. The answer is 1. Let b = 10100 , and we can proceed in either of a couple of ways:

• Let p be the probability that we reach 10100 at least once before returning to 0.2 Then, the
distribution of the number of visits N to 10100 before returning is

[N | N ≥ 1] ∼ F S(p),

where F S is the “first-success” geometric distribution with success probability p. Therefore,

1
E [N ] = Pr(N ≥ 1) · E [F S(p)] = p · = 1.
p

• Imagine that during your random walk, you decide to write down an infinite sequence of
letters: ‘A’ whenever you hit the number 0, and ‘B’ whenever you hit the number b. This
2 1
We can actually compute that p = 2·10100
with martingales, but this is irrelevant.

4
creates some long string AAAAAABBBBAA . . .. For symmetry, we can start from the point
b/2 and generate this string. Since the random walk is memoryless, we simply want to know
the expected number of B’s we hit between any two adjacent A’s.
By symmetry, the expected number of A’s and B’s in any finite subsegment is equal. Since
every B (except a finite number) is between a pair of A’s with high probability, we have that

#(number of B’s)
E [N ] = lim = 1.
n→∞ #(number of A’s)

1.3 Representations of Distributions

We’ll introduce a couple of neat distributions. It turns out that dealing with the representations
of distributions can be a very powerful tool. We learned about the exponential distribution in Stat
110, but it turns out that the following distribution is more common.

Definition 1.1 (Weibull distribution). The Weibull distribution is given by the power X c of an
exponential random variable X.

Joe notes that entire books have been written on the Weibull distribution. Here’s another
interesting distribution.

Definition 1.2 (Cauchy distribution). The Cauchy distribution has probability density function
1
C : p(x) = .
π(1 + x2 )

Example 1.3. There are several interesting properties of the Cauchy distribution in terms of
representations by other distributions:

• If z1 , z2 ∼ N (0, 1) are independent, then z1 /z2 ∼ C.

• If z ∼ C, then 1/z ∼ C (corollary of above).

1+x
• If x, y ∼ C, then 1+y ∼ C.

Another neat fact is that if x, y ∼ N (0, 1), then x + y is independent from x − y!3

Finally, we’ll give some intuition for our forays into measure theory, starting next lecture.

Example 1.4 (Banach-Tarski Paradox). Assuming the axiom of choice, there exists a way to
decompose a 3-ball B 3 into two separate, yet congruent balls. However, at least one of the sets
must not be measurable.

In some sense, the intuition of measure theory allows you to rigorously define an intuitive
concept of mass. This can also help axiomatize concepts to get at the core of problems. We’ll see
that measure theory lets us unify many proofs for different distributions into a single general proof.

3
This is a special property of the normal distribution, not a general fact.

5
2 September 8th, 2020
Today is our first real lecture, where we introduce measure theory and its applications to continuous
distributions.

2.1 Measure Theory

The motivation here is to rigorously define what it means to be measurable, so that we can talk
about continuous random variables in a reasonable way.

Example 2.1. Suppose that we had a continuous random variable X varying uniformly on [0, 1].
Then, how can we calculate
Pr(X = 0 | X ∈ {0, 1})?
We would expect, intuitively, for the answer to be 21 . However, if we naively apply the definition
of conditional probability, we get something like

Pr(X = 0) 0
Pr(X = 0 | X ∈ {0, 1}) = = .
Pr(X = 0 ∪ X = 1) 0

This is not well-defined, so we are unhappy.

Similarly, we want “fundamental” laws of probability like Bayes’ Rule to be formalized over
continuous probability distributions like this. The core concept that will allow this to be possible
is called a σ-algebra.

Definition 2.2 (σ-algebra). Given a set X, a σ-algebra on X is a collection Σ ⊂ 2Σ , which satisfies

the following axioms:

• X ∈ Σ,

• If A ∈ Σ, then Ac = X \ A ∈ Σ,

• If A1 , . . . , An ∈ Σ, then A1 ∪ · · · ∪ An ∈ Σ.

Unlike a typical set algebra, which is a collection of subsets that is closed under finite unions
and intersections, a σ-algebra is closed under countable unions and intersections (hence the letter
σ). The important takeaway from this definition is that it’s fine enough to talk about probability
in a reasonable way, but coarse enough so that we don’t have Banach-Tarski and friends.
Now we can define the core concept of a probability measure.

Definition 2.3 (Probability measure). Let Ω be a set of samples, and F a σ-algebra on Ω, called
the events. A function P : F → [0, 1] is called a probability function if it satisfies the following
axioms:

• For any countable collection A1 , A2 , . . . ∈ F of pairwise disjoint sets,

∞ ∞
!
[ X
P Ak = P (Ak ).
k=1 k=1

• P (Ω) = 1.

6
Note. Since this isn’t a measure theory course (Math 114), we don’t usually care about measures
in general. A general measure space is defined the same way as a probability space, except we call it
(X, Σ, µ) instead of (Ω, F, P ) by convention, and we also do not require the last axiom P (Ω) = 1.
Indeed, probability measures are the special case where the total measure is finite.
Let’s do a couple of examples to visualize σ-algebras.
Example 2.4 (Finite σ-algebra). Suppose that you partition Ω into four disjoint subsets,
a a a
Ω=A B C D.
Then, the σ-algebra generated by {A, B, C, D} has 16 elements, and can be written as
F = {∅, A, B, C, D,
A ∪ B, A ∪ C, A ∪ D, B ∪ C, B ∪ D, C ∪ D,
A ∪ B ∪ C, A ∪ B ∪ D, A ∪ C ∪ D, B ∪ C ∪ D,
A ∪ B ∪ C ∪ D}.
It turns out that all finite σ-algebras basically look like this. They all have a power-of-two size,
and they consist of all subcollections of some finite collection of events.
Essentially, finite σ-algebras are uninteresting because they’re too coarse, but it does help lend
some intuition for the general case. We can think of σ-algebras as offering some kind of information
about the events that we have observed so far. This lends itself to the following definition:
Definition 2.5 (Filtration). Given a probability space (Ω, F, P ), a filtration is a sequence of sub
σ-algebras F1 , F2 , . . . where for all k ≤ `,
Fk ⊆ F` ⊆ F.

2.2 Uncountable σ-Algebras

Here’s an proof-based exercise to work on in breakout rooms.
Exercise 2.1. Show that any infinite σ-algebra is uncountable.
Proof. The sketch of the proof looks as follows. Suppose for the sake of contradiction that you had
some countably infinite σ-algebra consisting of subsets F = {A1 , A2 , . . .}, where Ai is indexed by
each natural number i ∈ N. Then define the atoms of F to be sets Bx for each x ∈ Ω, where
\
Bx = Ai .
Ai 3x

In other words, Bx is the smallest measurable set containing x. We claim that all distinct atoms
are disjoint. In other words, if Bx ∩ By 6= ∅, then there exists some z ∈ Bx ∩ By , so Bz ⊆ Bx ∩ By .
Assume for the sake of contradiction that x ∈ / Bz . Then, Bx \ Bz is a subset containing x but
not containing z. However, this implies that Bx \ Bz ⊆ Bx , so z ∈ / Bx , which is a contradiction.
Therefore x ∈ Bz , and by symmetry y ∈ Bz as well, so Bx = By = Bz .
Finally, consider the set of all atoms {Bx }x∈X . If this set is finite, then F must be finite as
well, which is a contradiction. Therefore there must be at least countably many distinct atoms
B1 , B2 , . . .. We can define an injective map f : 2N → F by
f ({n1 , n2 , . . .}) = Bn1 ∪ Bn2 ∪ · · · ,
so #(F) ≥ #(2N ) = 2ℵ0 .
Now we know the axioms of probability, and everything starts from here!

7
3 September 10th, 2020
Last lecture, we broke off after defining the foundations of probability: two axioms that define
everything from basics to modern research. We won’t cover too much more about this, as that is
the topic of measure theory classes (Math 114, Statistics 212). Instead we’ll shift gears and start
defining higher-level concepts.

3.1 The Borel Measure

We’re going to start working with the reals soon, so it’s useful to define a measure on the reals. To
do that, we’ll first need a bit of machinery.
Proposition 3.1 (Intersection of σ-algebras). If A, B ⊆ 2Ω are σ-algebras on Ω, then their inter-
section A ∩ B is also a σ-algebra. This also holds for infinite intersections.
Proof. Straightforward, verify the σ-algebra properties directly.

Be careful! The above proposition does not work for unions of σ-algebras.
Definition 3.2 (σ-algebra generated by subsets). Given a collection of subsets A ⊆ 2Ω , we define
the σ-algebra generated by A to be the smallest σ-algebra containing A, i.e.,
\
σ(A) = F.
A⊆F ⊆Ω
F is a σ−algebra

With this machinery, we can now define the Borel measure on the real numbers.
Definition 3.3 (Borel sets). Consider the set of closed intervals [a, b] ⊂ R. The Borel sets are
members of the σ-algebra generated by closed intervals.
Note. We can actually construct a stratified Borel hierarchy as follows. Start from F0 , the set
of closed intervals in R. Then, let F1 be the collection of all sets formed as countable unions or
intersections of sets in F0 , or their complements. This is already very complex, but we can similarly
let F2 be the collection of all sets formed as countable unions, intersections, or complements of sets
in F1 . It turns out that F0 ( F1 ( F2 ( · · · , and even the limit Fω is not a σ-algebra. You have
to keep going up to the first uncountable ordinal, and then you reach the Borel σ-algebra B = Fω1 .
Definition 3.4 (Lebesgue measurable sets). These exist and are more general than the Borel sets,
but we won’t talk too much more about them.
Note. These definitions are really general, which begs the question: are there sets that are not
measurable? The answer is yes (assuming the axiom of choice), for example, the Vitali sets.

3.2 Random Variables

Intuitively, we all have some idea of what a random variable is — it varies randomly! However, we
need to be slightly careful if we want to define this notion rigorously.
Definition 3.5 (Measurable function). Given a set X equipped with a given σ-algebra Σ ⊆ 2X , a
function f : X → R is called measurable if for all Borel sets4 B,

f −1 (B) ∈ Σ.
4
Technically, we can define this more generally for sets in the Lebesgue measure, but the difference is unimportant.

8
Note. For the rest of this course, we may implicitly assume that functions are measurable if not
specified. It is extremely difficult to construct a non-measurable function, and they almost never
occur in practice.
Definition 3.6 (Random variable). A random variable X is a measurable function X : Ω → R.
Random variables are so useful that we give them special notation. In particular, suppose that
you have a random variable X, and you want to know the probability that its value lies between 1
and 3. We could write this rigorously in terms of events, i.e.,

P (X −1 ([1, 3])) = P ({ω ∈ Ω | X(ω) ∈ [1, 3]}).

However, this is a bit cumbersome, so we use the notation “X ∈ B” to mean the same thing as
X −1 (B). We can then write the above as

P (X −1 ([1, 3])) = P (X ∈ [1, 3]) = P (1 ≤ X ≤ 3),

which seems much more natural to read.

Note. Joe philosophically comments that the reason why statistics is a field is because of random
variables. They give us a common framework to talk about all kinds of events in completely different
probability spaces, no matter their topology or other outlook. This unifies probability across a very
broad range of disciplines. In contrast, fields like social network modeling have very fragmented
theories with different journals and conventions, but they all draw upon ideas like distributions
from statistics.
With that comment in mind, let’s rigorously define the idea of a distribution.
Definition 3.7 (Distribution). Given a random variable X, the distribution of X is a function
L(X) that sends B 7→ P (X ∈ B). You can also write this compositionally as L(X) = P ◦ X −1 .
The important property is that L(X) is a probability measure on the real line.
Going back to the philosophical note, this lets us define arbitrary probability spaces on complex
events (Ω, F, P ). A random variable is just a means of projecting these complex world states into
the real line, which creates a probability space (R, B, L(X)). This is much easier to analyze!

3.3 Properties of Random Variables

We will state some important properties of random variables that will be useful to us.
Proposition 3.8. If X ∼ Y , and g : R → R is a measurable function, then g(X) ∼ g(Y ).
Proof. For any Borel set B ∈ B, observe that

P (g(X) ∈ B) = P (X ∈ g −1 (B)) = P (Y ∈ g −1 (B)) = P (g(Y ) ∈ B).

In addition to being a useful, intuitive theorem (measurable functions preserve equality of

measures), the proof gives us an instructive method of attack for proving similar theorems of this
nature. Essentially, preimages are really powerful, especially when we have measurable functions.
Next we will prove an essential uniqueness theorem for probability theory. Although this won’t
let us prove existence,5 it will still give us a powerful tool for some problems.
5
If you also want to prove existence, Carathéodory’s extension theorem works really well.

9
Proposition 3.9 (Dynkin’s π-λ theorem). Call a collection of subsets P ⊆ 2Ω a π-system if it is
closed under set intersection. If two measures P , Q agree on a π-system P , then they also agree
on all subsets in σ(P ), the σ-algebra generated by P .

Proof. This involves some complicated analysis wizardry. See Section 2.10 of the book.

Corollary 3.9.1 (CDFs are all you need). Any distribution is uniquely determined by its cumulative
distribution function F (x) = P (X ≤ x).

10
4 September 15th, 2020
Today is our final day focused primarily on measure theory, before we move on to random variables
and representations.

4.1 A Couple Notes on Proofs

Note. Joe first gives us a reminder about the goal of proofs. They should be mathematically
rigorous, but should not be encumbered by “obvious” details that mask the main idea. Writing
mathematical proofs with clarity is a skill that you can develop with time.

Note. Also, Joe mentions that we should not be intimidated by his use of the words trivial or
obvious in class. These words indicate that the ideas are simple enough to not require further
justification once you understand them, not that you should feel bad if you don’t immediately see
the justification.

4.2 Working with π-λ

One of the most common ways we interact with random variables is through their CDF, which
gives the measure of the variable on sets of the form (−∞, x] ∈ R. To illustrate the main idea
of Proposition 3.9, let’s provide some details. First, to get a taste, we prove a slightly more
fundamental fact related to the uniqueness of CDFs.

Proposition 4.1. If an function X : Ω → R satisfies X −1 ((−∞, x]) ∈ F for all x ∈ R, then X is

a random variable.

Proof. Let X : F → B be an arbitrary function. Let A be the set of all Borel sets B such that

A = {B ∈ B | X −1 (B) ∈ F}.

We know that (−∞, x] ∈ A for all x ∈ R. The key observation is that A is a σ-algebra, which we
can directly verify by checking the three properties and mapping them back to properties of F.
Therefore, A = B.

Definition 4.2 (Random vector). A random vector is a collection of n random variables, which
may or may not be independent. You can also see it as a measurable function X : Ω → Rn , which
defines the joint distribution of these variables. The marginal distribution of each variable is simply
the composition of X with the projection map.

Marginal distributions give us some information, but this is lossy. Only the joint distribution
gives us the full story of a random vector.
Now let’s go back to talking about π-λ. What does the letter λ mean?

Definition 4.3 (λ-system). A collection of subsets L ⊆ 2Ω is called a λ-system if it satisfies the

following properties:

• (Whole set) Ω ∈ L,

• (Complement and difference) A, B ∈ L, A ⊆ B =⇒ B \ A ∈ L,

S∞
• (Restricted countable union) If A1 , A2 , . . . ∈ L with A1 ⊆ A2 ⊆ · · · , then k=1 Ak ∈ L.

It turns out that there is only one example of a λ-system that we really care about.

11
Example 4.4. Let P1 , P2 be probability measures on (Ω, F). Let

L = {A ∈ F | P1 (A) = P2 (A)}.

Then, L is a λ-system.

The above example can be easily checked by verifying the axioms; Joe skips this justification
in class. Anyway, with this context, we can provide the general statement of Dynkin’s theorem.

Lemma 4.5. A family of sets is a σ-algebra if and only if it is both π and λ.

Proposition 4.6 (Dynkin’s π-λ, full form). If S is a π-system and L is a λ-system, and S ⊆ L,
then σ(S) ⊆ L.

Proof. Once again, the same tricky proof. We’ll outline it in the next lecture.

Some intuition for π-λ is that you can take a finite non π-system such as S = {{1, 2}, {2, 3}},
and this is not enough to guarantee uniqueness on the σ-algebra generated by S, which includes
sets like {2}, {1, 2, 3}. But, at least in the countable case, you can use the π-system property to do
disjointification/partitioning on Ω, which finishes the proof.

12
5 September 17th, 2020
We’ll first go through the proof of π-λ, then finally begin talking about distributions.

5.1 Proof of the π-λ Theorem

Joe mentions this is one of his favorite proofs, so let’s grind through it in all of its technical detail.

Proof of Proposition 4.6. Without loss of generality, let L be the smallest λ-system containing S.6
The key idea will be to show that L is a σ-algebra, by showing that it is a π-system. In other
words, it suffices to show that for all A, B ∈ L, we have A ∩ B ∈ L.
To prove this result, we will rely on the following key claim. For some fixed A0 ∈ L, we define
a collection of sets L(A0 ) = {B ∈ L | A0 ∩ B ∈ L}. Then L(A0 ) is a λ-system for any A0 .
The proof of the above claim is completely mechanical; just verify the axioms. Then, by the
assumption that S is a π-system, we know that S ⊆ L(A0 ) whenever A0 ∈ S, and since L is the
smallest λ-system containing S, we in fact have L(A0 ) = L. This means that whenever A ∈ S and
B ∈ L, we can conclude A ∩ B ∈ L.
With this stronger fact, we can apply the lemma once again to get the stronger result that
S ⊆ L(A0 ) whenever A0 ∈ L. Then, applying the same logic again, this means that L(A0 ) = L for
any A0 ∈ L, as desired.

Lemma 5.1. Recall that the definition of two random variables X, Y being independent is that for
all Borel sets A, B ∈ B, you have P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B). You can show that
this is equivalent to, for all x, y ∈ R,

P (X ≤ x, Y ≤ y) = P (X ≤ x)P (Y ≤ y).

Proof. Apply π-λ twice, judiciously. This can be generalized to n random variables.

Lemma 5.2. If g, h are measurable functions, then X ⊥⊥ Y =⇒ g(X) ⊥⊥ h(Y ).

Proof. This is immediate by preimages. For any A, B ∈ B,

P (g(X) ∈ A, h(Y ) ∈ B) = P (X ∈ g −1 (A), Y ∈ h−1 (B))

= P (X ∈ g −1 (A))P (Y ∈ h−1 (B))
= P (g(X) ∈ A)P (h(Y ) ∈ B).

5.2 Representations of Random Variables

If there are many distributions that might be useful in the general measure-theoretic framework,
then why are there only a couple dozen of them that are commonly used and have names? Joe
claims that this is because representations of distributions give us ways to model very complex
spaces, and they’re all connected.

Definition 5.3 (Bernoulli distribution). The simplest distribution is the Bernoulli, which models
a weighted coin toss. If 0 ≤ p ≤ 1 and Y ∼ Bern(p), then P (Y = 1) = p and P (Y = 0) = 1 − p.

The expected value of the Bernoulli distribution is p, while the variance is p(1 − p).
6
This is valid because λ-systems are closed under intersection.

13
Definition 5.4 (Rademacher distribution). The Rademacher distribution takes values {−1, +1}
with equal probabilities 12 each.

If Y ∼ Bern(1/2), then you can also represent a Rademacher random variable by R = 2Y − 1.

This immediately tells us that E [R] = 0 and Var [R] = 1.

Example 5.5. The position of a random walk on the real line, after n steps, can be modeled as a
sum of n i.i.d. Rademacher random variables.

Definition 5.6 (Binomial distribution). The binomial distribution Bin(n, p) is the sum of n inde-
pendent and identically distributed Bern(p) random variables.

The mean of a binomial distribution is np, while the variance is np(1 − p).

Definition 5.7 (Uniform distribution). The uniform distribution, written as U ∼ Unif, is the
distribution of equal density on the unit interval [0, 1]. It has the property that P (U ∈ [a, b]) = b−a
whenever 0 ≤ a ≤ b ≤ 1. It can be represented in terms of i.i.d. Y1 , Y2 , . . . ∼ Bern(1/2) by
∞
X Yk
U= .
2k
k=1

For brevity, we omit the measure theoretic details that the above dyadic construction is valid.
Note that many sources represent uniform distributions on intervals as Unif(a, b) instead, but Joe
prefers to write (b − a)U + a. The uniform distribution satisfies E [U ] = 12 and Var [U ] = 12
1
.

Definition 5.8 (Exponential distribution). The exponential distribution is the distribution of

random variables X ∼ Expo represented by X = − log U , where U ∼ Unif. Typically people also
write Expo(λ) as the exponential distribution with rate λ, which we choose to write as λ1 · Expo.

The mean and variance are both 1. Note that in the above definition, we’re using the syntac-
tic convention of doing arithmetic on a distribution. This actually means that we draw random
variables from that distribution, and do arithmetic on the values. Although unambiguous in most
cases, Joe mentions that we should not write things like Expo + Expo, where the joint distribution
is unclear.

Definition 5.9 (Gamma distribution). The gamma distribution is the sum of independent expo-
nentially distributed random variables. We call r the integer rate parameter. Then, Gamma(r) is
the distribution of
X1 + X2 + · · · + Xr ,
where Xj are i.i.d. and drawn from Expo. The probability density function can be written as

1 r−1 −x
f (x) = x e ,
Γ(r)

where x > 0, and Γ : R+ → R+ is the gamma function.

14
6 September 22nd, 2020
Today we discuss reasoning by representation in more depth, and we introduce a fair number of
useful, common distributions.

6.1 Probability Integral Transform

We previously defined the uniform distribution, which has measure on any interval proportional to
the length of the interval. This is pretty simple. One important fact from statistics is that there
is no uniform distribution on R, for clear reasons. It turns out that if you know any distribution’s
CDF, you can generate it from a uniform distribution with the following theorem.

Definition 6.1 (Quantile function). The quantile function of a distribution with CDF F is

F −1 (p) = min{x | F (x) ≥ p}.

When F is continuous and strictly increasing, F −1 is identical to the inverse. Otherwise, it serves
as a sort of proxy that skips over regions with zero probability.

Proposition 6.2 (Probability integral transform). Let F be any CDF, with quantile function F −1 .
If we sample U ∼ Unif, then it follows that F −1 (U ) ∼ F .7

Proof. Note that u ≤ F (y) is the same as F −1 (u) ≤ y, since F is a non-decreasing function.
Therefore, the events U ≤ F (y) and F −1 (U ) ≤ y are the same for any y ∈ R, so

P (F −1 (U ) ≤ y) = P (U ≤ F (y)) = F (y).

Notice that this reminds us of the exponential distribution, which is in fact defined in a manner
similar to the probability integral transform, as a function of a uniform random variable.

Example 6.3. To generate a Bernoulli random variable with probability p, we can generate a
uniform random variable and pass it through the quantile function
(
0 if u ≤ 1 − p,
F (u) =
1 if u > 1 − p.

This is consistent with our intuition about the uniform distribution.

It’s worth mentioning that the uniform distribution is not necessarily special. Using a variant
of the probability integral transform, we can generate a uniform from any continuous probability
distribution, and by extension, we can generate any probability distribution from any continuous
probability distribution.

Note. We can generate a normal distribution this way as well, by taking erf −1 (U ) for U ∼ Unif.
However, this is not terribly appealing because the error function is not expressible in terms of
elementary functions.
7
This is also sometimes called universality of the uniform.

15
Uniform
U
log( 1−U )
− log U

Logistic Exponential

± arrivals b•c
?r
Xβ

Laplace Weibull Gamma Poisson Geometric

Gα
Gα +Gβ
2 Gamma( n
2
) ?r

Beta χ2 ; χ Negative Binomial

Beta( 12 , 12 )
±χ(1)

Arcsine Normal
√
N (0,1)

eX Z1 χ2
n /n
Z2

Log-Normal Cauchy Student-t

Figure 1: Derivation path of distribution representations.

U
Example 6.4 (Logistic distribution). The logistic distribution has representation log( 1−U ), where
we sample U ∼ Unif. The quantile function of the distribution is called the logit function, which is

p
logit(p) = log .
1−p
This maps a probability (0, 1) 7→ R, and it can be thought of as the log-odds of a probability. For
example, you can imagine predicting logits with a linear model (logistic regression), or a neural
network (softmax and cross entropy). The CDF is the sigmoid function,
ey
σ(y) = logit−1 (y) = ,
1 + ey
which can also be used as a nonlinearity in neural networks!

6.2 Reasoning By Representation

Now we can start introducing more distributions by strategy of representation. See Fig. 1 for a
graphical overview of how we’ll proceed, which can also serve as a useful reference.
√
Example 6.5. A useful fact about the gamma function from last time is that Γ( 21 ) = π.
Definition 6.6 (Chi-square distribution). We define the chi-square distribution χ2n ∼ 2 Gamma( n2 ).
This is clearly a special case of the gamma distribution, but it’s worth noting that you can interpret
it as the sum of the squares of n i.i.d. standard normal
p random variables. The chi distribution with
n degrees of freedom is defined similarly, by χn ∼ χ2n .

16
Finally, in a somewhat roundabout manner, we finally arrive at a definition of the normal
distribution from the χ2 distribution!

Definition 6.7 (Normal distribution). The celebrated standard normal distribution is defined by
N (0, 1) ∼ S · χ1 , where S ∼ Rad. We can scale this standard normal distribution to define a family
of distributions with various means and variances, which we denote N (µ, σ 2 ) ∼ σN (0, 1) + µ.

Example 6.8. χ22 ∼ Z12 + Z22 , where Z1 , Z2 are i.i.d. ∼ N (0, 1). Also, χ22 ∼ 2 Expo.

Finally, here’s our last collector’s item today, which is often used in hypothesis testing.

Definition 6.9 (Student’s t-distribution). The t-distribution with n − 1 degrees of freedom is

represented by tn ∼ √Z , where Z ∼ N (0, 1) and V ∼ χ2n . You can think of the denominator as
V /n
the distribution of the empirical variance in a sample of size n.

Definition 6.10 (Cauchy distribution). The Cauchy distribution is defined by C ∼ T1 ∼ Z1 /Z2 ,

where Z1 , Z2 are i.i.d. ∼ N (0, 1).

We now work on an exercise in breakout rooms. Joe mentions that this exercise is very difficult
to solve with calculus, involving messy integrals, but it is surprisingly elegant when you attack it
by means of representations!

Example 6.11. Find the expected value of |T |, where T ∼ Tn .

Proof. We can write the representation of T as

N (0, 1) χ1 √
|T | ∼ p ∼ n.
2
χn /n χn

From here, since the numerator and denominator are independent (sorry for the sloppy notation),
we end up with a simpler expression:
√

1
E [|T |] = E [χ1 ] · E · n.
χn

This is still kind of messy, but it’s broken down into much more manageable parts. For example,
we can find E [χ1 ] by a quick search on Wikipedia.

Definition 6.12 (Beta distribution). The beta distribution with shape parameters α, β > 0 is
supported on [0, 1]. Its representation is Beta(α, β) ∼ GGα+β
α
, where Gα ∼ Gamma(α) and Gα+β ∼
Gamma(α + β), independently.

The beta distribution is often used as a conjugate prior for an unknown probability parameter.
Its probability density function is proportional to xα−1 (1−x)β−1 , and we’ll see some nice properties
connecting it to the gamma distribution next lecture.

17
7 September 24th, 2020
We continue where we left off, with the beta distribution, and we also talk about basic properties
of the normal distribution.

7.1 The Beta-Gamma Calculus

Here’s an interesting fact that turns out to be a key property of the beta distribution.

Proposition 7.1 (Beta-Gamma). Suppose that you have independent random variables Gα ∼
Gamma(α) and Gβ ∼ Gamma(β). Then by representations, we have that Gα +Gβ ∼ Gamma(α+β),
and GαG+G
α
β
∼ Beta(α, β). The interesting fact is that

Gα
⊥⊥ Gα + Gβ .
Gα + G β

Proof. This fact comes from a straightforward calculation with Jacobians. Alternatively, you can
also reason about this by relating both variables to a Poisson process and order statistics, which
might help provide additional intuition.

Note. Surprisingly, this fact actually completely characterizes the beta and gamma distributions,
though nontrivial. This was formalized and proven in a theorem of Lukacs.

Joe remarks that he wants to emphasize the “Choose Your Favorite (CYF)” methodology.
Whenever you start working on a new problem about distributions, try to pick whichever repre-
sentation of said distributions gives you the most salient properties.

Example 7.2. Let B1 ∼ Beta(α, β) and B2 ∼ Beta(α + β, δ), such that B1 ⊥⊥ B2 . Using Proposi-
tion 7.1, we can choose the following construction for B1 and B2 :

• Select independent Gα ∼ Gamma(α), Gβ ∼ Gamma(β), Gδ ∼ Gamma(δ).

Gα Gα +Gβ
• Let B1 = Gα +Gβ and B2 = Gα +Gβ +Gδ .

We can verify in the above representation that indeed, B1 ⊥⊥ B2 . Therefore the fractions cancel,
Gα
and we can write B1 B2 = Gα +G β +Gδ
∼ Beta(α, β + δ).

Example 7.3. Let’s try to compute the mean of the beta distribution. Note that independent
random variables are uncorrelated, so by definition

Gα Gα
⊥⊥ Gα + Gβ =⇒ E E [Gα + Gβ ] = E [Gα ] .
Gα + Gβ Gα + G β

Rearranging this equation yields

Gα E [Gα ] α
E = = .
Gα + G β E [Gα + Gβ ] α+β

18
7.2 The Normal Distribution and Box-Muller
Recall some basic properties of the normal distribution. If Z1 ∼ N (µ1 , σ12 ) and Z2 ∼ N (µ2 , σ22 ),
then Z1 ⊥⊥ Z2 =⇒ Z1 + Z2 ∼ N (µ1 + µ1 , σ12 + σ22 ). Other useful properties are that the normal
distribution is invariant under rotations (e.g., Z1 + Z2 ⊥⊥ Z1 − Z2 ), and it is symmetric.
Proposition 7.4 (Box-Muller transform). If U1 , U2 ∼ Unif and U1 ⊥⊥ U2 , then define
p
Z1 = −2 log U1 cos(2πU2 ),
p
Z2 = −2 log U1 sin(2πU2 ).
It follows that Z1 , Z2 are i.i.d. ∼ N (0, 1).
Proof. Note that (Z1 , Z2 ) has support on R2 . Since the multivariate normal distribution is centrally
symmetric, we can sample the angle θ ∼ 2π Unif, which is what U2 is used for. Meanwhile,
√ to get
the radius, observe that Z12 + Z22 ∼ χ22 ∼ 2 Gamma(1), which motivates the use of −2 log U1 .
This transformation gives us an efficient way to sample i.i.d. normal random variables in the
special case of a parallel processor (SIMD or GPU). However, the Ziggurat algorithm, a variant of
rejection sampling, is more efficient on common processors. Taking NumPy’s implementation as
an example, see the current Ziggurat version, or the old Box-Muller version.
In addition to being computationally nice in some cases (avoiding code branches), the Box-
Muller transform is also useful as a representation, which transforms many problems about the
normal distribution into ones about trigonometric functions.
Example 7.5. If U ∼ Unif, then tan(2πU ) ∼ Cauchy.

7.3 Order Statistics

Order statistics generalize properties like the minimum, maximum, median, and quartiles.
Definition 7.6 (Order statistics). Given a joint family of random variables X1 , . . . , Xn , we call
their k-th order statistic X(k) a variable reflecting to the k-th smallest of the values. In other words,
the order statistics together are a rearrangement of the variables in increasing order:
X(1) ≤ X(2) ≤ · · · ≤ X(n) .
Order statistics tend to be of general use in many cases. For example, insurance companies will
often care about the worst of several events, and the probability of that happening.8 We will first
consider a case of interest: when Xi are i.i.d. exponential r.v.s.
Proposition 7.7 (Rényi representation). If X1 , . . . , Xn are i.i.d. ∼ Expo, then their order statistics
are jointly distributed as
1
X(1) ∼ Y1 ,
n
1 1
X(2) ∼ Y1 + Y2 ,
n n−1
1 1 3
X(3) ∼ Y1 + Y2 + Y3 ,
n n−1 n−2
..
.
1 1 3
X(n) ∼ Y1 + Y2 + Y3 + · · · + Yn ,
n n−1 n−2
8
Order statistics are also deeply related to a fallacy known as the optimizer’s curse.

19
where Y1 , Y2 , . . . , Yn are i.i.d. ∼ Expo.

Proof. This follows from induction and the memoryless property of the exponential distribution.

One other interesting case to consider is when the distributions are uniform. In some sense,
these are the two nicest order statistics to work with.

Proposition 7.8 (Uniform order statistics). If U1 , . . . , Un are i.i.d. ∼ Unif, then their order statis-
tics are jointly distributed as
X1 + · · · + Xj
U(j) = ,
X1 + · · · + Xn+1
where X1 , . . . , Xn+1 are i.i.d. ∼ Expo. It immediately follows that the marginal distributions of the
order statistics are U(j) ∼ Beta(j, n + 1 − j).

Proof. Joe notes that there’s a nice proof of this due to Franklyn Wang, when viewed as related
to the Rényi representation. Essentially, you map this to a transformed Poisson process. See the
textbook for details.

When dealing with order statistics for exponential distributions with different rates λ1 , . . . , λn ,
the first order statistic is nice.9 However, all of the other order statistics are unfortunately messy.

9 1
It’s not hard to show that this is distributed according to λ1 +···+λn
Expo.

20
8 September 24th, 2020
Today we formally introduce Poisson distribution and related Poisson process.

8.1 Poisson Processes

We introduce the Poisson distribution, which is a discrete probability distribution, as follows.

Definition 8.1 (Poisson distribution). The Poisson distribution Pois(λ) with rate parameter λ,
supported on {0, 1, 2, . . .}, is defined by the probability mass function

e−λ λk
P (X = k) = .
k!
This distribution is deeply connected to the exponential and gamma distributions.

Definition 8.2 (Poisson process). The Poisson process refers to the sequence of arrival times
T1 , T2 , . . . ≥ 0, where the successive time differences X1 = T1 , X2 = T2 − T1 , X3 = T3 − T2 , . . . are
i.i.d. ∼ λ−1 Expo. The marginal distribution of arrival times is

Tn = X1 + X2 + · · · + Xn ∼ λ−1 Gamma(n).

Furthermore, if Nt = #(arrivals in [0, t]), then the two events {Nt ≥ n} = {Tn ≤ t} are equivalent.
This holds for general arrival processes, and we sometimes call this count-time duality.10

Proposition 8.3. For any t in the Poisson process, Nt ∼ Pois(λt).

Proof. Observe from count-time duality that

P (Nt = k) = P (Tk ≤ k < Tk+1 ) = P (Tk ≤ t) − P (Tk+1 ≤ t).

Both of these latter probabilities can be expressed as a CDF of the gamma distribution. Although
the incomplete gamma function is messy, applying integration by parts cracks the problem:
Z λt Z λt
1 −x k−1 1
P (Tk ≤ t) − P (Tk+1 ≤ t) = e x dx − e−x xk dx
Γ(k) 0 Γ(k + 1) 0
Z λt Z λt
1 −x k−1 1 −λt k
= e x dx + k
e (λt) − e−x xk−1 dx
Γ(k) 0 Γ(k + 1) Γ(k + 1) 0
e−λt (λt)k
= .
k!

Corollary 8.3.1. Given any fixed time interval of length t, the number of Poisson arrival events
in that interval is distributed ∼ Pois(λt). Furthermore, given two disjoint time intervals of any
lengths, the number of Poisson arrival events in those intervals are independent.

Proof. Use the memoryless property of the exponential distribution.

Previously, we mentioned Poisson processes before through a connection with the order statistics
of the uniform distribution. We formalize this below.
10
This is Joe’s invented name for the fact.

21
Proposition 8.4 (Conditional arrival times). If Tn+1 = t, then the conditional joint distribution
of (T1 , T2 , . . . , Tn ) are the order statistics of i.i.d. uniform random variables multiplied by t, i.e.,

[(T1 , . . . , Tn ) | Tn+1 = t] ∼ (tU(1) , . . . , tU(n) ),

where U1 , . . . , Un ∼ Unif.

Proof. This stems from distribution representations and the Beta-Gamma calculus. Observe that
Tk X1 + X2 + · · · + Xk
= .
Tn+1 X1 + X2 + · · · + Xn+1

The right-hand side is precisely the representation from Proposition 7.8 for the joint distribution
of uniform order statistics U(k) .

The above fact also yields a nice proof that Beta(1, 1) ∼ Unif.

8.2 The Broken Stick Problem

Exercise 8.1. Cut a stick of unit length at n randomly chosen points. This will produce n + 1
segments. What is the distribution of the length of the shortest segment?

Proof. Use the order statistics of the uniform distribution. This tells us that in the joint distribu-
tion, the k-th cut point can be represented as
X1 + · · · + Xk
,
X1 + · · · + Xn+1

where X1 , . . . , Xn+1 are i.i.d. ∼ Expo. Then, apply the Rényi representation of the exponential
distribution, which tells us that
1 1
X(1) ∼ Y1 ; X(2) − X(1) ∼ Y2 ; ...; X(n+1) − X(n) ∼ Yn+1 ;
n+1 n
where Y1 , . . . , Yn are also i.i.d. ∼ Expo. Finally, we can conclude that the length of the shortest
segment is simply distributed as
1
X(1) n+1 Y1 1
= = Beta(1, n).
X(1) + · · · + X(n+1) Y1 + · · · + Yn+1 n+1
1
This has mean (n+1)2
.

Note. By a slight modification of the above argument, using linearity of expectation, we can see
that the expected value of the length of the k-th largest segment is simply

1 1 1
+ ··· + .
n+1 n+1 k

In the next lecture, we will begin discussing expected value through Lebesgue integration!

22
9 October 1st, 2020
Today we will continue discussing Poisson processes and some of their nice properties. Then, we
introduce the notion of expected value, which is defined in Chapter 4 of the textbook.

9.1 General Poisson Point Processes

We can generalize Poisson processes to general measure spaces under some weak assumptions.11

Definition 9.1 (Poisson point process). A Poisson point process on a measure space (X, µ) with
rate λ has the property that the number of points in a bound region U ⊂ X is distributed according
to a Poisson random variable with parameter λµ(U ).

Note that with this definition, we lose the interpretation of a Poisson process as having expo-
nential arrival times. This only works for Poisson point processes on R+ , which is what we have
been working with so far. When we take Poisson processes over other measure spaces, there is no
longer any notion of arrival time.

Example 9.2 (Poisson process on a circle). We can define a Poisson process with rate λ on the
unit circle S 1 . Over any angle θ of the circle, written in radians, the number of points in that arc
is distributed according to Pois(λθ). The expected total number of points on the circle is 2πλ.

Example 9.3 (2D Poisson process). Consider the special case where X is a compact subset of R2 ,
and µ is the Lebesgue (or Borel) measure. Then, we call this a 2D Poisson process, and it has the
property that the number of points in any two separate regions are independent, and the mean is
proportional to the area of those regions.

We can arrive at an approximation for a 2D Poisson point process by subdividing our region
into many small squares, then giving each square a finite and i.i.d. Bernoulli probability of having
a point. As the number of squares gets larger, and each square gets smaller, our approximation
gets closer to a true Poisson process.

9.2 Properties of the Poisson Distribution

Here, we introduce 3 key properties about the Poisson distribution, and we justify them through
connection with a Poisson process. This might help glean some intuition for the deep connections
between these two topics.12

Lemma 9.4 (#1). If X ∼ Pois(λ1 ), Y ∼ Pois(λ2 ), and X ⊥⊥ Y , then X + Y ∼ Pois(λ1 + λ2 ).

Proof. Consider a Poisson point process with rate λ1 , and another Poisson point process with rate
λ2 . Then we can simply superimpose these processes together into a single process, combining the
arrival times from both. It’s easy to see that X is the number of arrivals in [0, 1] for the first
process, Y is the number of arrivals in [0, 1] for the second process, and X + Y becomes the number
of arrivals in the superimposed process, which has rate λ1 + λ2 .

Lemma 9.5 (#2). If X ∼ Pois(λ1 ), Y ∼ Pois(λ2 ), and X ⊥⊥ Y , then the conditional distribution
of X on X + Y = n is given by Bin(n, λ1λ+λ
1
2
).
11
Technically, these need to be a Radon measure for mathematical reasons.
12
Joe calls these his “favorite” properties, particularly #3.

23
Proof. This is equivalent to the following fact about a Poisson process. Given a Poisson process
with rate λ, the distribution of the arrival times T1 , . . . , TN , conditioned on N ∼ Pois(λt) equaling
the number of point events in the interval [0, t], is equivalent to the order statistics of N i.i.d.
uniform random variables multiplied by t.
Lemma 9.6 (#3). Consider the chicken-egg story, where a chicken lays N ∼ Pois(λ) eggs, which
each hatch independently with probability p. Let X be the number of eggs that hatch, and let Y be
the number of eggs that do not hatch. Then, X ⊥⊥ Y , X ∼ Pois(λp), and Y ∼ Pois(λ(1 − p)).
Proof. The proof of this result comes from LOTP, where we compute
P (X = x, Y = y) = P (X = x, Y = y | N = x + y) · P (N = x + y).
After some algebraic manipulation, this eventually shows independence of X ⊥⊥ Y . As an inter-
pretation in the corresponding Poisson point process, you can imagine starting with a process of
rate λ, then thinning the process by coloring each point independently with probability p. The
colored and uncolored points then form their own, independent, Poisson processes with rates λp
and λ(1 − p). This can be seen as the reverse of superposition.

9.3 Defining Integration and Expectation

When people typically define expected value, they usually do it separately either for discrete random
variables (as a summation), or for continuous random variables (as an integral). This is technically
okay, related to some analysis tricks such as the Lebesgue decomposition. However, it makes sense
to have a more general definition of expected value that works for any distribution, even if it is
partly discrete and partly continuous.
Definition 9.7 (Riemann integral). Recall that the Riemann integral of f : [a, b] → R is defined
as a limit of Riemann sums
Z b n−1
X
f (x) dx = lim f (ti )(xi+1 − xi ),
a n→∞
i=0

where a = x0 < x1 < x2 < · · · < xn = b, and for each i, ti ∈ [xi , xi+1 ].
This definition of Riemann integral clearly does not work when you have a discrete distribution,
which does not have a finite PDF. The Riemann sums will not converge in this case, so we need
something slightly more powerful.
Definition 9.8 (Riemann-Stieltjes integral). The Riemann-Stieltjes integral of f : [a, b] → R with
respect to a non-decreasing integrator function g : [a, b] → R is
Z b n−1
X
f (x) dg(x) = lim f (ti )(g(xi+1 ) − g(xi )),
a n→∞
i=0

where a = x0 < x1 < x2 < · · · < xn = b, and for each i, ti ∈ [xi , xi+1 ].
Note that this coincides with the ordinary Riemann integral when g(x) = x. This integral has
the property that it works for computing the expected value of discrete distributions, since you can
simply plug in the CDF (which is well-defined) as the integrator. It’s also fairly easy to compute by
hand, which makes it useful in practice. However, the strongest and most general integral, which
is often used in proofs, is as follows.13
13
We will only define the Lebesgue integral for random variables in probability spaces here, but you can generalize
the definition to other functions on measure spaces.

24
Definition 9.9 (Lebesgue integral). Let (Ω, F, P ) be a probability space, and let X : Ω → R be a
random variable. Then the expected value of X, denoted E [X] is defined by the following three-step
construction:14

1. For indicator random variables, which are simply 1 on a bounded measurable set S ∈ F and
0 otherwise. Their expectation is the measure P (S).

2. Extending to non-negative weighted sums of indicator random variables, called simple random
variables. We do this by linearity of expectation.

3. Defining for non-negative random variables by taking the supremum over all dominated simple
random variables X ∗ ,
E [X] = sup E [X ∗ ] .
X ∗ ≤X

4. Extending to general signed random variables by taking a partition X = X + − X − into

positive and negative parts, and computing the integral for each separately.

We omit the remaining details, but these are the key ideas of the Lebesgue integral construction.

14
Joe refers to this as InSiPoD, short for Indicator-Simple-Positive-Difference.

25
10 October 6th, 2020
Last week we introduced the Riemann-Stieltjes and Lebesgue integrals, for the purpose of defining
what a random variable is. Today we’ll continue by discussing them in more detail.

10.1 Riemann-Stieltjes and Lebesgue Integration

Recall our two main integral definitions, shown below:
Z ∞ Z
E [X] = x dF (x), E [X] = X(ω)P (dω).
−∞ Ω

The first is a mild generalization of our familiar Riemann integral from high school Calculus, while
the second is the venerable Lebesgue integral, which is general enough to work on any measurable
domain (not necessarily just the reals!). In general, when an integral is written, you can choose
whichever definition as they are consistent where defined.

Example 10.1 (Indicator of Q). Consider the indicator function IQ : R → {0, 1}, which is 1 on all
the rationals and 0 everywhere else. This function is not Riemann integrable (non-convergent) on
any nonzero interval of the reals, yet it is Lebesgue integrable. In fact, because the rationals are
countable, Z
IQ (x)λ(dx) = 0,
R

where λ is the Lebesgue measure.15

The single most important property of expectation is linearity.

Proposition 10.2 (Linearity of expectation). For r.v.s X and Y , E [X + Y ] = E [X] + E [Y ].

This is not an obvious statement, and proving it requires some work. We can also generalize to
countably infinite sums of random variables, as linearity still holds under some mild regularizing
assumptions. Joe uses this as an example of the difference between the Riemann and Lebesgue
definitions of expected value. Compare the statement of linearity in both senses:
Z ∞ Z ∞ Z ∞
tfX+Y (t) dt = xfX (x) dx + yfY (y) dy,
−∞ −∞ ∞
Z Z Z
(X + Y )(ω)P (dω) = X(ω)P (dω) + Y (ω)P (dω).
Ω Ω Ω

Either statement requires a formal mathematical proof, but the second statement (in terms of the
Lebesgue integral) is much more intuitive to read, as the integrating factor ω is the same.

Example 10.3 (Simple random variables). Consider a simple random variable X = nj=1 aj IAj .
P
We can usually write such a variable in canonical form by assuming that each subset Aj is distinct
from the rest, which makes X essentially a collection of disjoint positive rectangles over the sample
space Ω.
15
Joe notes that we won’t care much about weird cases like Q, as they don’t come up in practice. For example, in
the real world, all of your measurements will be in Q due to finite precision. Here’s a pun: “In this course, we care
about hard work, not IQ .”

26
For an additional clarification about Definition 9.9, consider the following equivalent description
of the nonnegative case. This isn’t written in the book yet, but there’s a really clean formula for
the simple random variables approximating any nonnegative random variable X. We can just take
a monotone sequence of random variables:

Xn = min n, 2−n b2n Xc .

It’s not hard to show that this is equivalent to the step in the definition of the Lebesgue integral
that uses a supremum over simple random variables. Basically, all this does is cut off the values of
X at n, then quantize it to the first n digits of its binary representation. However, this definition
can be much easier to use in an actual computation.
Example 10.4 (Darth Vader rule). For any nonnegative random variable Y , the following formula
for the expectation holds: Z ∞
E [Y ] = P (Y > y) dy.
0

Proof. First we will show this for the Lebesgue definition of expected value. If Y is an indicator
random variable IA , then the right-hand side integral just becomes P (A), which follows immediately.
Next, if Y is simple, then we proceed simply by breaking up the variable into its canonical form
and writing a double sum. After some manipulation (swapping the order of sums), this works.
Finally, we can generalize to all nonnegative random variables by using the monotone convergence
theorem, which lets us swap the order of lim and E.
For completeness, we also sketch the proof when the left-hand side has E [Y ] defined according
to the Riemann-Stieltjes definition. Recall by definition that
Z ∞ Z ∞Z y
E [Y ] = y dF (y) = dx dF (y).
−∞ −∞ 0

Writing it in this form, it’s clear that this statement just becomes a consequence of Fubini’s theorem.
Swapping the order of the integrals yields our desired result:
Z ∞Z y Z ∞Z ∞ Z ∞
dx dF (y) = dF (y) dx = P (Y > y) dy.
−∞ 0 0 x 0

10.2 Convergence Theorems in Analysis

We will now cover some common convergence from analysis. Oftentimes in many branches of
mathematics, we want to ask questions of the following commutative form: can I swap two operators
in an expression? Oftentimes we draw this as a commutative square. In this case, we’re concerned
with the commutative square shown below:

lim
j→∞
{Xj } X
E E
lim
j→∞
{E[Xj ]} E[X].

In other words, when does limj→∞ E [Xj ] = E [limj→∞ Xj ]? It turns out that this intuitive state-
ment holds most of the time, but there are counterexamples where it fails to hold.

27
Example 10.5 (Failure of convergence). Consider the sequence of discrete random variables Xn ,
where Xn is n2 with probability 1/n2 and 0 otherwise. By the Borel-Cantelli lemma, note that
∞ ∞
X X 1 π2
P (Xn > 0) = = < ∞,
j2 6
n=1 n=1

so with probability 1, at most finitely many of the Xn will be nonzero. This means that Xn → X al-
most surely, where X = 0 is in a Dirac delta distribution. However, now we have a counterexample,
as E [Xn ] = 1 for all n, yet E [X] = 0.

Despite this pessimistic example, under some mild assumptions we can prove that expected
values and limits commute, using three so-called convergence theorems.

Proposition 10.6 (Monotone convergence theorem). If 0 ≤ X1 ≤ X2 ≤ · · · , and X1 , X2 , . . . → X

in probability, then
lim E [Xn ] = E [X] .
n→∞

Proposition 10.7 (Dominated convergence theorem). If there exists a random variable W such
that |Xn | ≤ W for all n, E [W ] < ∞, and X1 , X2 , . . . → X in probability, then

lim E [Xn ] = E [X] .

n→∞

Corollary 10.7.1 (Bounded convergence theorem). If |Xn | ≤ c for some c, and X1 , X2 , . . . → X

in probability, then
lim E [Xn ] = E [X] .
n→∞

See Section 4.6 of the book for proofs of each of these theorems.

28
11 October 8th, 2020
Today we review convergence theorems a bit, for the purpose of providing us some analysis intuition.

11.1 Proof of Bounded Convergence

Recall the bounded convergence theorem from last lecture, i.e., Corollary 10.7.1. This is a theorem
about random variables that converge in probability. It might be useful to have the definitions of
various convergence types explicitly written down.
Definition 11.1 (Convergence in probability). If X1 , X2 , . . . and X are random variables, then we
say that X1 , X2 , . . . → X in probability if for all > 0,

lim P (|Xn − X| > ) = 0.

n→∞

Definition 11.2 (Almost sure convergence). If X1 , X2 , . . . and X are random variables, then we
say that X1 , X2 , . . . → X strongly, or almost surely converges, if

P lim Xn = X = 1.
n→∞

In general, almost sure convergence is a stronger condition than convergence in probability,

which is likewise stronger than convergence in distribution. The bounded convergence theorem
works generally for random variables that converge in probability. Let’s walk through the proof.

Proof of Corollary 10.7.1. In the statement of the bounded convergence theorem, we assumed that
|Xn | ≤ c for all n. Let’s first try to bound X as well. To do this, we will take a strategy of adding
some slack to the variable.16 For any n and > 0, note that by a union bound,

P (|X| > c + ) ≤ P (|Xn | > c ∨ |Xn − X| > )

≤ P (|Xn | > c) + P (|Xn − X| > )
= P (|Xn − X| > ).

However, as n → ∞, this probability just goes immediately to zero. We can view this as “taking
the limit” on both sides, except the left-hand side doesn’t actually contain the variable n in it.
Therefore, by the squeeze theorem,

P (|X| > c + ) = lim P (|X| > c + ) ≤ lim P (|Xn − X| > ) = 0,

n→∞ n→∞

where the last step follows from the definition of convergence in probability. For the next part of
our proof, consider E [|Xn − X|] for varying n. By the triangle inequality, since |Xn |, |X| ≤ c, we
must have |Xn − X| ≤ 2c. Then,

E [|Xn − X|] ≤ 2cP (Xn − X > ) + P (Xn − X ≤ )

≤ 2cP (Xn − X > ) + .

For any , as n → ∞, the first term on the right-hand side approaches zero. Therefore,

lim sup E [|Xn − X|] = 0.

n→∞

Therefore, E [X1 ] , E [X2 ] , . . . → E [X].

16
Joe calls this technique GSAS : Give Some Arbitrary Slack.

29
Note that we choose to present the proof above, instead of the more general dominated conver-
gence theorem, as that proof requires using machinery such as Fatou’s lemma.

Exercise 11.1. Does the bounded convergence theorem still hold if we replace “converges in
probability” with “converges in distribution” for X1 , X2 , . . .? Joe mentions that he’s not sure if
this true, but he can’t think of an easy counterexample at the moment.

11.2 Conditional Expectation

In undergraduate-level probability classes like Stat 110, we often define conditional expectation in
the following form:
E [Y | X = x] = g(x).
This focuses on the concrete values of X, rather than the underlying object. However, this kind
of definition leads to some common fallacies like Borel’s paradox. However, in graduate-level
probability classes like this one, we will often condition based on a random variable instead:

E [Y | X] = g(X).

The connection between these definitions is that E [Y | X = x] = E [Y | X] (x). Actually, what

we’re doing here is conditioning based on the σ-algebra generated by X, or in other words, the
coarsest filtration of our underlying σ-algebra F that determines the value of X.17

Note. To be very rigorous about definitions, assume that Y : Ω → R is a random variable, and
G ⊆ F is a σ-subalgebra. Then E [Y | G] is also a function Ω → R, defined in terms of an averaging
operator across all atomic sets in G. In other words, we already have that Y is a F-measurable
function, but by applying a certain averaging map, we can make it G-measurable, which is a stronger
condition because G is coarser than F. Mathematically, we have for all G ∈ G that
Z Z
E [Y | G] dP = Y dP.
G G

Therefore, the equation that E [Y | X] = g(X) is actually somewhat of an abuse of notation ac-
cording to this σ-algebra definition, but it makes the definition much easier to think about!

For more intuition about conditional expectation, you can also think of it as a form of projection.
This is reflected in the (albeit nonconstructive) definition below.

Definition 11.3 (Conditional expectation). The conditional expectation E [Y | X] is the (almost

surely) unique function g(X) that uncorrelates Y − g(X) from h(X) for all all bounded, measurable
functions h : R → R. In other words,

E [(Y − g(X))h(X)] = 0.

This makes sense intuitively, as you can pushforward the Lebesgue integral from the underlying
σ-algebra F to the σ-subalgebra σ(X) ⊂ F, which makes h(X) a constant and Y − g(X) zero.
Anyway, this is the property we’d really like for conditional expectation to have. Let’s now see
if this definition is actually valid, i.e., showing existence and uniqueness! For what follows, let’s
assume (for the sake of convenience) that all r.v.s have finite variance.
17
Therefore, you may see some papers writing this with the notation E [Y | σ(X)].

30
Definition 11.4 (Hilbert space). A Hilbert space is a real or complex inner product space that is
also a complete metric space with respect to its norm.

We care about this completeness condition because in function spaces, which are infinite-
dimensional real vector spaces, we don’t actually get completeness for free. Anyway, we could
spend more time talking about Hilbert and Banach spaces, but that’s the content of Math 114.
Instead, we’ll just state the theorem.

Proposition 11.5. Zero-centered random variables, i.e., such that E [X] = 0, form a Hilbert space
under the covariance inner product

hX, Y i = Cov(X, Y ) = E [XY ] .

This assumes that we consider two random variables to be equivalent if they are almost surely equal.

It’s a well-known fact that quotient Hilbert spaces exist. Using some kind of argument along
this form, you can essentially show with relative ease that conditional expectations exist and are
unique. The details are omitted here in the lecture, as it’s all measure theory.

Proposition 11.6 (Adam’s law). For any random variables X and Y ,

E [E [Y | X]] = E [Y ] .

Proof. This follows immediately from the conditional expectation property written above. In par-
ticular, if we set h(X) = 1, then the property reduces to

E [Y − E [Y | X]] = 0,

and the rest follows from linearity of expectation.

Proposition 11.7 (Eve’s law). For all random variables X, Y ,

Var [Y ] = E [Var [Y | X]] + Var [E [Y | X]] .

Proof. Without loss of generality, assume that E [Y ] = 0. By Adam’s law,

Var [Y ] = E Y 2 = E E Y 2 | X .

Also, E [Y | X] has mean zero by assumption. Then, observe that

h h i i
E E Y 2 | X = E E Y 2 − E [Y ]2 | X + E [Y | X]2

= E [Var [Y | X]] + Var [E [Y | X]] .

That concludes a very brief foray into conditional expectation and some of its properties.

31
12 October 13th, 2020
First, let’s talk about the midterm exam. The test-taking window will start on October 22nd, and
it will last for 60 hours. As a result, there will be no class next Thursday. The test can be taken
in any 3-hour block, although it has been written to be “reasonable” as a 75-80 minute exam to
somewhat alleviate time pressure. The exam is open-book, open-note, and open-internet, but you
may not ask questions or consult any other students. Submissions are in PDF format and can be
handwritten or in LATEX.

12.1 Conditional Covariance: ECCE

Recall that Eve’s law allows us to calculate the variance using conditional distributions, by adding
up the inter-group and intra-group variances. We can actually generalize this slightly from variances
to covariances between two variables.
Definition 12.1 (Covariance). The covariance of two r.v.s X and Y , denoted Cov(X, Y ), is defined
as E [XY ] − E [X] E [Y ]. Note that as a special case Cov(X, X) = Var [X].
Proposition 12.2 (ECCE). For any random variables X, Y , and Z,

Cov(X, Y ) = E [Cov(X, Y | Z)] + Cov(E [X | Z] , E [Y | Z]).

Proof. We will show this by simply applying Adam’s law and linearity of expectation:

Cov(X, Y ) = E [XY ] − E [X] E [Y ]

Alternatively, note that the above argument could be slightly simplified by assuming, without
essential loss of generality, that E [X] = E [Y ] = 0.

As an aside, note that everything stated above has been about conditional expectations. This
is because conditional expectations (which are just random variables) are much easier to rigorously
talk about than conditional distributions. When we write X | Z ∼ N (Z, 1), this is a statement
about the conditional distribution of X, not a random variable called “X | Z” (which does not make
sense). Defining conditional distributions rigorously requires some measure theory machinery,18
which is not the focus of this course.
An interesting generalization of Adam’s law, Eve’s law, and ECCE is the law of total cumulance.
This is not included in the textbook at the moment, but Joe might add it later. Anyway, it’s beyond
the scope of this course, as the laws written above cover 95% of cases.
Note. Borel’s paradox, as mentioned in the book, is an issue when trying to define conditional
probability when conditioning on events. It happens with continuous random variables, for example,
conditioning on the events X = X based on the equivalent formulations X − Y = 0, and X Y = 1.
The issue is that we are conditioning on a event of measure zero in both cases. Because of the
obvious issues, conditioning on events is outside the scope of this course, and we will only condition
on r.v.s and σ-algebras, which is well-defined.
18
For more info and some juicy pushforwards, see regular conditional probability.

32
12.2 Moment Generating Functions
In your typical undergraduate-level probability class, you probably talked about moment generating
functions, but perhaps not on generating functions in general.19 We’ll talk about this briefly.

Example 12.3 (Making change). Suppose that you wanted to know how many ways you can make
change for 50 cents, given coins of denominations 1, 5, 10, 25, 50 cents. You could do this by writing
out the possibilities, but this is tedious. You could also use dynamic programming (Knapsack).
One formalism from combinatorics that might help here though, is a generating function. Write

p(t) = (1 + t + t2 + t3 + · · · )(1 + t5 + t10 + t15 + · · · )(1 + t10 + t20 + t30 + · · · )

(1 + t25 + t50 + t75 + · · · )(1 + t50 + t100 + · · · )
1
= .
(1 − t)(1 − t )(1 − t10 )(1 − t25 )(1 − t50 )
5

Then, in the formal power series expansion for p(t), the coefficient of xk is precisely the number of
1 (k)
ways to make k cents, using these types of coins. Mathematically, this is also k! p (0). Here t is
just a “bookkeeping device” for the sequence of values in the coefficients.

With that informative example, which also shows the two-pronged interpretation of generating
functions as either formal power series or convergent functions, we are now ready to define the
notion of a generating function.

Definition 12.4 (Generating function). Given a sequence (a0 , a1 , a2 , . . .), a generating function
for the sequence is a power series containing this sequence in its coefficients. There are two kinds:

• The ordinary generating function p(t) = ∞ n

P
n=0 an t , and
an tn
• The exponential generating function p(t) = ∞
P
n=0 n! .

Generally we prefer working with the exponential kind in statistics, as it is more likely to converge.

Definition 12.5 (Moment generating function). The moment generating function of a random
variable X is the exponential generating function of the moments. We denote this by
"∞ # ∞
tX X (tX)n X E [X n ] tn
MX (t) = E e =E = .
n! n!
n=0 n=0

This function only exists when MX (t) < ∞ for all t in some open neighborhood of 0. Under this
assumption, we are allowed to swap the expectation and sum in the last step, due to dominated
convergence. This is because
m m
X X n tn X |X|n |t|n
= ≤ e|tX| ≤ etX + e−tX .
n! n!
n=0 n=0

The last expression above has finite expectation, by our assumption, so it meets the necessary
requirements for dominated convergence.

Now that we’ve rigorously defined MGFs, let’s see some useful properties.
19
For the definitive text on generating functions, see Herbert Wilf’s generatingfunctionology.

33
Proposition 12.6 (MGF of independent sum). If X, Y are r.v.s and X ⊥⊥ Y , then

MX+Y (t) = MX (t)MY (t).

Proposition 12.7 (Uniqueness of MGFs). If X, Y are r.v.s with moment generating functions
and MX (t) = MY (t) on some open neighborhood of the origin, then X ∼ Y .

Example 12.8 (MGF of the normal distribution). If X ∼ N (µ, σ 2 ), then

2 σ 2 /2
MX (t) = etµ+t .

Combined with the previous two propositions, this immediately implies that the sum of independent
normals is also normally distributed.

Example 12.9 (No MGF for log-normal distribution). If Y ∼ ez where z ∼ N (0, 1), then all of
the moments of Y are defined, as
2
E [Y n ] = E enZ = MZ (n) = en /2 .

Therefore, all of the moments

are defined. However, the MGF of Y does not exist. We can show
that the expectation E etY does not converge on any neighborhood of the origin.

Definition 12.10 (Joint moment generating function). Given two random variables X and Y , or
in general n variables, the joint MGF of X and Y is defined as the function

MX,Y (s, t) = E esX+tY .

In addition to the usual moments, the joint MGF also generates joint moments such as the
covariance when X and Y are centered. It also fully describes the joint distribution of (X, Y ),
meaning that X ⊥ ⊥ Y if and only if their joint MGF factors into the marginal variants.
Finally, one problem with moment generating functions illustrated by an earlier example is
convergence. To fix this issue somewhat, there is a variant of MGFs based on the Fourier transform
rather than the Laplace transform, which is guaranteed to always exist.

Definition 12.11 (Characteristic function). The characteristic function ϕX : R → C for a random

variable X is defined as
ϕX (t) = E eitX .

This also has uniqueness properties, as we will describe next lecture, but there is also the nice
fact that its value always lies within the unit disk (as it’s really a convex combination of points on
the unit circle).

34
13 October 15th, 2020
Today, class is starting 15 minutes early due to a last-minute schedule conflict for Joe. Therefore,
we will have a brief self-contained topic (cumulants) for the first 15 minutes, before going back to
the main topic.

13.1 Cumulants
As a brief aside, recall the definition of the characteristic function. This has the nice property that
it always exists, but it is not always smooth. The moment generating function has a redeeming
property that it are infinitely differentiable, i.e., C ∞ at 0, and their existence implies that all the
moments of the distribution exist. This makes them very useful in many cases.

Proposition 13.1. If a random variable X has moment generating function MX (t), then the k-th
moment of X exists for all k, and h i
E X k = M (k) (0).

Proof. This can be justified by dominated convergence, which can be used to do differentiation
under the integral sign.20 Essentially,

0 d Xt
= E XeXt .

MX (t) = E e
dt
Doing this repeatedly lets you illustrate that

(k) dk Xt h
k Xt
i
MX (t) = E e = E X e ,
dtk
and the result follows. In general, the MGF is always infinitely differentiable, arguing from domi-
nated convergence once again (i.e., MGF existence is stronger than existence of all moments), so
this proposition is valid.

Recall that if two random variables X and Y are independent, then the moment generating
function of their sum X + Y is simply the product of their individual moment generating functions.
In other words, MX+Y (t) = MX (t)MY (t). What if we wanted to turn this product back into a
sum?

Definition 13.2 (Cumulant generating function (CGF)). The cumulant generating function of a
random variable X, defined wheneever the MGF exists, is defined by
∞
X κr
KX (t) = log MX (t) = tr .
r!
r=1

The coefficients κr of this power series are called cumulants.

You can derive formulas for the few cumulants by using the power series expansion for log(1+x),
since the moment generating function satisfies MX (0) = 1. This power series looks like

x2 x3 x4
log(1 + x) = x − + − ± ··· .
2 3 4
20
Joe calls this by the acronym DUThIS.

35
Example 13.3 (Cumulants). The first four cumulants are:

1. κ1 = E [X]. This is the expectation.

2. κ2 = Var [X]. This is the variance.

3. κ3 = E (X − µ)3 . This is the third central moment, which is connected to skewness.

h i
4. κ4 = E (X − µ)4 − 3 Var [X]2 . This is the excess kurtosis multiplied by Var [X]2 .

We’ll come back to cumulants and generating functions later, when we discuss the central limit
theorem. However, we can still state some basic facts. One nice property of cumulants is that they
are easy to compute, partially due to the following fact.

Proposition 13.4 (Cumulants are additive). If X ⊥⊥ Y , then KX+Y (t) = KX (t) + KY (t).

This is a vast generalization of the fact that variances are additive for independent random
variables, as the variance is just the second cumulant. Cumulants also give you a good way to find
central moments of distributions like the Poisson, as their generating functions are simple.
t −1)
Example 13.5 (Cumulance of Poisson). The MGF of the Poisson distribution is eλ(e , and
therefore the CGF is
x2 x3

t
K(t) = λ(e − 1) = λ x + + + ··· .
2! 3!
This means that all of the cumulants of the Poisson distribution are equal to λ.
2 t2 /2
Example 13.6 (Cumulance of Normal). The MGF of the normal distribution is eµt+σ , and
therefore the CGF is
σ 2 t2
K(t) = µt + .
2
Therefore, the first two cumulants are nonzero, equal to the mean µ and variance σ 2 . The rest of
the cumulants are all zero.

Note. As an aside, it turns out that the normal distribution is the only nontrivial distribution
that has a finite number of nonzero cumulants.

13.2 Characteristic Functions

We want to talk a little bit more about characteristic functions. Recall from earlier today that we
mentioned characteristic functions are not always smooth, but you can approximate their values
based on how many derivatives are defined, using a power series. We can show characteristic
functions are always defined by applying Jensen’s inequality,
h 2
i
|E [itX]|2 ≤ E eitX = 1.

as the complex magnitude function |x|2 = xx is convex. This should be consistent with your
intuitions about the Fourier transform, if you are familiar with that operator. We will not compute
many characteristic functions by hand in this course, as the integrals can involve complex analysis
machinery like the residue theorem. Still, it can be useful to see some examples.

36
Example 13.7 (Characteristic of Cauchy). The characteristic function of the Cauchy distribution,
with PDF f (x) ∝ 1/(1 + x2 ), is
ϕX (t) = e−|t| .
This should remind you of the PDF of the Laplace distribution. Indeed, the characteristic function
of the Laplace distribution is also a scaled version of the Cauchy PDF; this is a consequence of the
inversion property of the Fourier transform.

Example 13.8 (Characteristic of Normal). The characteristic function of the normal distribution,
where X ∼ N (µ, σ 2 ), is
2 2
ϕX (t) = MX (it) = eiµt−σ t /2 .
Note that the above is a slight abuse of notation, as the moment generating function has real
domain, but it works out anyway if we pretend that it’s a Laplace transform and extend to C.

13.3 The Multivariate Normal Distribution

We would like to now introduce the multivariate normal (MVN) distribution. This is a distribution
with a lot of interesting properties, and you will learn a lot about this in courses like Statistics 230
and CS 229r (Spectral Graph Theory). However, for this course, it suffices to just have a basic
understanding.

Definition 13.9 (Random vector). A random vector X = (X1 , . . . , Xn ) is a vector of random

variables. These components may or may not be independent. Generally, random vectors have
some distribution over Rn .

Definition 13.10 (Covariance matrix). Given a random vector Y, we define the covariance matrix
to be the n × n matrix of variances and covariances between pairwise components. In other words,
 
Var [Y1 ] Cov(Y1 , Y2 ) · · · Cov(Y1 , Yn )
  Cov(Y2 , Y1 ) Var [Y2 ] · · · Cov(Y2 , Yn )
Cov(Y, Y) = E (Y − E [Y])(Y − E [Y])T = 

.

.. .. . . ..
 . . . . 
Cov(Yn , Y1 ) Cov(Yn , Y2 ) · · · Var [Yn ]

Proposition 13.11 (Semidefinite covariance matrix). The covariance matrix is always positive
semidefinite. In other words, it is symmetric, and all eigenvalues are nonnegative.

We can also define Cov(X, Y) similarly. One can show that this is a bilinear operator.

Proposition 13.12 (Linearity of vector expectation). If A is a linear operator (matrix), b is a

vector, and Y is a random vector, then

E [AY + b] = A E [Y] + b.

The proof of this is left as an exercise. Generally, you want to stick to vector and matrix
notation whenever possible when proving facts about random vectors, as it will make arguments
much cleaner (and natural ). You should avoid explicit sums over bases whenever possible.

Proposition 13.13 (Covariance matrices are conjugate). If A is a linear transformation and X

is a random vector, then
Cov(AX, AX) = A Cov(X, X)AT .

37
With all of this machinery for talking about multivariate distributions, it’s time to actually
create some instances. The nicest multivariate distribution is unambiguously the multivariate
normal.21 It turns out that there are many ways you might attempt to construct a multivariate
normal, but all of them will end up generating the same distribution.

Definition 13.14 (Matrix square root). If Σ 0, then there exists at least one matrix A such
that Σ = AT A = AAT . In general, there can exist many A, but they will all be equivalent up to
multiplication by an orthogonal matrix.

One can construct matrix square roots explicitly by using the Cholesky decomposition algorithm.

Definition 13.15 (Multivariate normal distribution). A multivariate normal distribution with

mean µ and covariance matrix Σ is defined by representation. If Z = (Z1 , . . . , Zn ) is a standard
multivariate normal where Zi ⊥
⊥ Zj are pairwise independent and Zi ∼ N (0, 1), then

X = AZ + µ

is a multivariate normal distributed according to X ∼ (µ, Σ), where Σ = AAT = AT A.

Note. Observe that since the standard multivariate normal is rotationally symmetric, multiplying
Z by any orthogonal rotation matrix does not affect the joint distribution. This means that the
above definition is unambiguous with respect to the matrix square root, and multivariate normals
are indeed characterized by their covariance matrices.

Note that if X is multivariate normal, then any projection bT X is a linear combination of

independent standard normals plus some mean bT µ, so it is itself a univariate normal. This lends
itself to an alternative definition.

Proposition 13.16 (Multivariate normal by projections). A random vector Y is multivariate

normal if and only if every nonzero linear combination of its components, bT Y where b 6= 0, is
distributed according to a univariate normal.

Another interesting fact is that within a multivariate normal distribution, uncorrelated implies
independent. This can be most easily shown through representations.

21
The second-nicest one is the multinomial distribution, but this isn’t too much different from the binomial.

38
14 October 21st, 2020
Today we will finish discussing some useful properties of the multivariate normal. The midterm is
on Thursday (so no class then), and it will be timed. Generally the problems will require some tricky
thinking, but there should always be a clever solution that does not require tedious calculation.

14.1 More on the Multivariate Normal

Recall that a multivariate normal distribution is defined by Y ∼ AZ + µ, where Z is a vector of
i.i.d. standard normal random variables. This is denoted N (µ, Σ), where Σ = AT A. If you take
the Jacobian, you can compute the probability density function, which in n dimensions is
1 1 T −1
p(x) = p e− 2 (x−µ) Σ (x−µ) .
(2π)k det(Σ)
Joe makes the following optimistic point about reducing high-dimensional random vectors into
ordinary random variables.
Proposition 14.1 (Cramér-Wold device). Given a finite-dimensional random vector X, the joint
distribution of X is uniquely determined by its projections onto 1-dimensional spaces. In other
words, knowing the marginal distribution of tT X, for every fixed t, is enough.
The above proposition follows from the fact that joint characteristic functions determine mul-
tivariate distributions (a hard but standard fact from analysis). This is because the joint charac-
teristic function is just h T i
ϕX (t) = E eit X .

In particular, tT X occurs in the exponent, so values of the characteristic function are completely
determined from marginal distributions of projections of X.
Example 14.2 (MGF of multivariate normal). Recall that the moment generating function of a
univariate N (µ, σ 2 ) normal distribution is
2 t2 /2
etµ+σ .

The joint moment generating function of a multivariate normal N (µ, Σ) is analogously

T (µ+ 1 Σt)
et 2 .

This is because if we let W ∼ N (µ, Σ) and consider the projection tT W, we get E tT W = tT µ

and Var t W = Var t (Σ W + µ) = Var t Σ Z = tT Σt. Therefore, the distribution of

T T 1/2 T 1/2

each projection of a multivariate normal is also normal, so we can compute the joint MGF from
the univariate MGF.
Example 14.3 (Closure properties of MVN). The multivariate normal distribution has many nice
closure properties, such as:
• If you take a linear combination or shift of multivariate normals, it is also multivariate normal.

• Any vector of projections (i.e., projection matrix) is also multivariate normal.

• The conditional distribution of a multivariate normal is also multivariate normal.22

22
See this page for more info, including the formula involving Schur complements.

39
It turns out that these closure properties are really useful for applications like Kalman filtering
(Branislav’s favorite!), where we can exactly compute posteriors due to closure.
Here’s a really important fact about multivariate normal distributions.
Proposition 14.4. Within a multivariate normal distribution, consider any two (possibly vector)
projections Y1 and Y2 . Then, if Y1 and Y2 are uncorrelated, they are also independent.

Proof. Consider the multivariate normal random vector Y = Y Y2 ∼ N (µ, V ). We have
1

V11 V12
V =
V21 V22
as a block matrix, where V11 and V22 are the covariance matrices of Y1 and Y2 , respectively. Now
we can simply observe that V12 = Cov(Y1 , Y2 ) = 0, and V21 = Cov(Y2 , Y1 ) = 0, which is the
assumption we made about the vectors being uncorrelated. Then, the matrix is a diagonal block
and factorizes into a direct sum of invariant subspaces, as desired.

Proposition 14.5. Suppose that Y = Y Y2
1
is multivariate normal with covariance matrix

V11 V12
V = .
V21 V22
Also assume that E [Y1 ] = µ1 and E [Y2 ] = µ2 . Then,
−1 −1
Y2 | Y1 ∼ N (µ2 + V21 V11 (Y1 − µ1 ), V22 − V21 V11 V12 ).

In particular, the conditional distribution is still normal, its mean is linear with respect to Y1 , and
its variance is constant! This is related to formulas from linear regression.23

14.2 Example Problem: Socks in a Drawer

For a complete non sequitur, we will now talk about one of Joe’s favorite problems. This will serve
as a minor review of order statistics for the upcoming midterm.
Exercise 14.1. Suppose that you have n pairs of socks, and each pair is different. You randomly
pull socks out of your drawer, one at a time, until you have a matching pair after N socks. Find
the expected value of N .
Proof. We choose an embedding of the problem in continuous time. Assume that we continue the
task, drawing all of the socks until the drawer is empty (even if we reach a pair after sock N ). Each
of the 2n socks will be assigned a time i.i.d. uniform in [0, 1].
Let the interarrival times be X1 , X2 , . . . , X2n+1 , and let T be the time when we draw our first
matching pair. Then we can write, using uniform order statistics, that
N
X
T = Xj ∼ Beta(N, 2n + 1 − N ).
j=1

Note that by Adam’s law and linearity of expectation,

E [N ]
E [T ] = E [E [T | N ]] = E [X1 ] E [N ] = .
2n + 1
23
In particular, in a multivariate normal distribution, the best predictor of Y2 given Y1 is a linear regression.

40
We’re going to find T in a slightly different way. For each color i ∈ {1, . . . , n}, let Ti be the time
when we complete the pair of socks with color i, so that T = min(T1 , . . . , Tn ). Note that each
individual sock is independent, so all of the Ti are i.i.d. distributed
√ according to the maximum of
two independent uniforms. Recall that √ this is ∼
√ Beta(2,p 1) ∼ Unif.
We can then write that T = min( U1 , . . .p , Un ) = min(U1 , . . . , Un ), where the Ui are jointly
distributed as i.i.d. uniform. Therefore, T ∼ Beta(1, n). We can verify by LOTUS that
1
x1/2 (1 − x)n−1
Z
E [T ] = dx
0 B(1, n)
B(3/2, n)
=
B(1, n)
Γ(3/2)Γ(n + 1)
=
Γ(1)Γ(n + 3/2)
n!
=
(3/2)(5/2) · · · ((2n + 1)/2)
4n (n!)2
= .
(2n)!(2n + 1)
4n (n!)2
Therefore, matching up our two expressions for E [T ], we get E [T ] = (2n)! .

Note that the above problem does not appear to be related to continuous distributions at all
(very combinatorial), yet we found a very natural solution by using a continuous embedding!

41
15 October 27th, 2020
Today marks a little over the halfway point of the course, as we’re done with the midterm. Grading
will take some time due to logistics. We’ll begin Chapter 9 today, on inequalities, and Chapters 10–
14 are all posted on Canvas.24 Overall, the chapters are fairly short, but they’re also analysis-heavy
and packed with information.

15.1 Intro to Inequalities

In life, numbers are almost never equal. Philosophically, given that the general position of values
is to be unequal, it can sometimes be lot more important to talk about when a ≤ b or a ≥ b, rather
than the specific case when a = b. This is why inequalities are important.
The general setup is that you have some probability problem that you’d like to solve, but it’s
too hard! Yet you have to do something, since you really want to solve it. So you can approach
the problem by considering a couple of strategies:

1. Simulate it by writing code.

Simulation is interesting because it can help inform your conjectures and hypotheses. After
you write a little bit of code, you will surely get some results or output, so you can at least
feel productive when trying to solve a hard problem. One of the most powerful simulation
techniques is Markov Chain Monte Carlo (MCMC), which is a specialty of the Harvard
Statistics Department.

2. Make approximations and bound their error.

When problems are too hard to solve analytically, you can often approximate some distribu-
tions by Poisson, Normal, Negative Binomial, or other special cases. This is awesome, and in
particular, the bounding part of this technique will be our focus today.

In many problems, approximation can be useful. For example, we can say something like a “linear”
or “quadratic” approximation, but these asymptotic bounds don’t tell us exactly how close we are
to the answer. Convergence in the n → ∞ case isn’t immediately applicable to discuss what a
distribution looks like when n = 5, or even n = 30.
We are going to develop a few inequalities, which we can apply to make statements like p is
within of the true value. Let’s get started with one of the most famous inequalities in math.25

Proposition 15.1 (Cauchy-Schwarz inequality). If X and Y are random variables, then

p
E [|XY |] ≤ E [X 2 ] E [Y 2 ].

Proof. This proof is particularly nice, though slightly algebra-heavy. The key idea is that variances
are a sum-of-squares, so for any value of β ∈ R,

E (Y − βX)2 ≥ 0.

This is an infinite family of inequalities. We can take the derivative to find the value of β that gives
us the strongest bound. This is a neat problem-solving idea because we added complexity with this
24
This includes things like exponential families and natural exponential families, convergence theorems, the central
limit theorem, and martingales. If we had more time, Joe mentions that he would have also liked to discuss Markov
chains. Those are covered in Stat 212 and Stat 171.
25
For more on this, Joe recommends J. Michael Steele’s book The Cauchy-Schwarz Master Class.

42
additional variable, but it actually makes the solution easier. In any case, the optimal value of β
is given by the projection of Y onto X in the Hilbert space, which is
hX, Y i E [XY ]
β= = .
hX, Xi E [X 2 ]
Substituting this into the inequality and expanding yields

E Y 2 + β 2 E X 2 ≥ 2β E [XY ] ,

E X2 E Y 2

+ E [XY ] ≥ 2 E [XY ] ,
E [XY ]
E X E Y ≥ E [XY ]2 .
2 2

The result follows after assuming, without loss of generality, that X and Y are nonnegative.

In the probability setting, this means that we can bound the covariance of random variables,
which is a 2-variable expectation, by the product of the second moments of their marginal distri-
butions. Marginal distributions are often easier to calculate.

Corollary 15.1.1 (Covariance inequality). For any random variables X and Y , we have

Cov(X, Y )
|Corr(X, Y )| = p ≤ 1.
Var [X] Var [Y ]

Proof. This corollary is almost equivalent to Cauchy-Schwarz, but it admits a particularly elegant
direct proof. Assume without loss of generality that X and Y are standardized to have mean 0 and
variance 1, and let ρ = Corr(X, Y ). Since variances are nonnegative,

Var [X + Y ] = Var [X] + Var [Y ] + 2 Cov(X, Y ) = 1 + 1 + 2ρ ≥ 0 =⇒ ρ ≥ −1,

Var [X − Y ] = Var [X] + Var [Y ] − 2 Cov(X, Y ) = 1 + 1 − 2ρ ≥ 0 =⇒ ρ ≤ 1.

15.2 Concentration Inequalities

Now let’s get into our first inequality for proving tail bounds!

Proposition 15.2 (Markov’s inequality). If Y ≥ 0, then

E [Y ]
P (Y ≥ a) ≤ .
a
Proof. This is an extremely crude bound. Observe that

a · IY ≥a ≤ |Y |.

This is obviously true. Here, a · IY ≥a is simply equal to a when Y ≥ a and 0 when Y < a. Now,
we can take expectation on both sides, and the result immediately follows.

Although Markov’s inequality seems really obvious, it’s the starting point for pretty much all
concentration bounds, as it makes very few assumptions about the random variable Y . If we
additionally assume that Y has a second moment, then we can extend our bound slightly.

43
Proposition 15.3 (Chebyshev’s inequality). If Y is a random variable with finite variance, then

Var [Y ]
P (|Y − E [Y ]| ≥ c) ≤ .
c2
Proof. Apply Markov’s inequality to the random variable X = (Y − E [Y ])2 .

Note that even though Chebyshev’s inequality is a trivial extension of Markov, you often get
much better tail bounds using it, as they are quadratic in the deviation. This idea of applying an
increasing function (such as x 7→ x2 ) to both sides of Markov’s inequality can be used to get even
better tail bounds in general, such as the celebrated Chernoff bound.26

Proposition 15.4 (Chernoff bound). Let Y be a nonnegative random variable, and let t > 0 be a
constant. Then,
E etY

tY ta
P (Y ≥ a) ≤ P (e ≥ e ) ≤ ,
eta
where the last step follows by Markov’s inequality.

Notice the coincidental appearance of the moment generating function MY (t) = E etY above.

This means that for Chernoff bounds to be applied, you essentially need all of the moments to be
defined. Intuitively, it is the limit case of many concentration inequalities based on moments, as
it makes a strong assumption of the MGF existing. The Chernoff bound is also intuitively useful
because it lets you optimize for any value of t by taking the derivative.

15.3 More Basic Inequalities

We’ll continue to cover more concentration inequalities in future weeks, such as Azuma-Hoeffding’s
inequality for martingales and McDiarmid’s inequality (which is my personal favorite). However,
let’s go back to talking about inequalities in general.

Proposition 15.5 (Jensen’s inequality). If g is a convex function, then

E [g(X)] ≥ g(E [X]).

Proof. Jensen’s inequality is interesting because it does not rely on smoothness properties of g,
and it also works in any number of dimensions. A proof is given in the book using the supporting
hyperplane theorem for convex sets.

Definition 15.6 (p-norms of random variables). The Lp norm for a random variable X, where we
have some fixed p ≥ 1, is defined by

kXkp = E [|X|p ]1/p .

This is a valid norm for two reasons. First, if kXkp = 0, then X is almost surely zero. Second, the
norm satisfies the triangle inequality, which is a fact called Minkowski’s inequality.

Now let’s ask the question of how the r-norm compares to the s-norm, when 1 ≤ r < s. The
following result actually holds for any values of r and s, including negative values and zero (in the
limit, which is called the geometric mean).
26
Named after Herman Chernoff, who is faculty emeritus at Harvard.

44
Proposition 15.7 (Monotonicity of norms). If 1 ≤ r < s, then

kXkr ≤ kXks .

(This is a continuous version of the so-called inequality of power means.)

Proof. This follows from Jensen’s inequality on the convex function x 7→ xs/r . Assume without loss
of generality that X is nonnegative. Then X r is also nonnegative, so
h i
E (X r )s/r ≥ E [X r ]s/r =⇒ E [X s ]1/s ≥ E [X r ]1/r .

Finally, we write down one of the most famous and classical inequalities, which is a special case
of the discrete inequality of power means mentioned above!

Proposition 15.8 (AM-GM inequality). If x1 , . . . , xn ≥ 0 are numbers and w1 , . . . , wn ≥ 0 are

weights with w1 + · · · + wn = 1, then
n
X n
Y
wi x i ≥ xwi
i .
i=1 i=1

The left-hand side is called the arithmetic mean, and the right-hand side is called the geometric
mean. Equality holds if and only if x1 = x2 = · · · = xn .

Proof. Assume without loss of generality that the xi are distinct. Let W be a random variable
supported on {x1 , . . . , xn }, with P (W = xi ) = wi for each i. By Jensen’s inequality on log,
n n
!
X X
wi log xi = E [log W ] ≤ log E [W ] = log wi x i .
i=1 i=1

The result follows after exponentiating both sides.

Corollary 15.8.1 (Young’s inequality). In the special n = 2 case of weighted AM-GM, we have

ap bq ≤ pa + qb,

where a, b, p, q ≥ 0 and p + q = 1.

45
16 October 29th, 2020
Today we continue our discussion of inequalities and norms.

16.1 Hölder’s Inequality and Nonnegative Covariance

Recall that we covered p-norms in the last lecture. One nice property of these norms is that if
random variables have zero distance under a norm, such as E (X − Y )k = 0, then X equals Y
almost surely. However, to be able to actually show this, we need the k-th moment to exist. A
common trend we’ll see in future classes is that fact get generalized by assuming less about the
existence of higher moments, such as various central limit theorem conditions.
Anyway, with that aside out of the way, we will now cover a classic result in analysis. This is a
generalization of Cauchy-Schwarz.
1 1
Proposition 16.1 (Hölder’s inequality). Let r, s ≥ 1 with r + s = 1. Then, for any random
variables X, Y where the corresponding norms are defined,
kXY k1 ≤ kXkr kY ks .
Here, r and s are sometimes called conjugate norms. The Cauchy-Schwarz inequality is the special
case where r = s = 2 (see Proposition 15.1).
Proof. Joe notes that you’ll often see complex proofs of this, but the proof actually only takes one
or two lines once you know how to set it up. We’ll start by assuming without loss of generality
that kXkr = kY ks = 1, since both sides of the inequality are bilinear. For any x, y ∈ R, we have
by Young’s inequality (Corollary 15.8.1) on |x|r and |y|s that
|x|r |y|s
|xy| ≤ + .
r s
Since the above result holds for real numbers, it also holds for random variables. Therefore,
kXkrr kXkss 1 1
E [|XY |] ≤ + = + = 1.
r s r s

Hopefully that was an inspiring, short proof of a classic inequality in analysis. Hoping to outdo
himself, Joe will now attempt to present an even more inspiring proof of another inequality.
Proposition 16.2 (Nonnegative covariance27 ). If g and h are non-decreasing functions, then
Cov(g(X), h(X)) ≥ 0.
Proof. The key idea is to choose i.i.d. X1 , X2 ∼ X. Then, observe that
(g(X1 ) − g(X2 ))(h(X1 ) − h(X2 )) ≥ 0.
What happens when we take the covariance of the above expression? Well,
E [(g(X1 ) − g(X2 ))(h(X1 ) − h(X2 ))]
= E [g(X1 )h(X1 )] − E [g(X1 )h(X2 )] − E [g(X2 )h(X1 )] + E [g(X2 )h(X2 )]
= 2 E [g(X)h(X)] − 2 E [g(X)] E [h(X)]
= 2 Cov(g(X), h(X)).
The result follows from the nonnegativity of that expression.
27
This is a special case of the FKG inequality in correlation theory. Amusingly enough, it’s also a continuous
version of Chebyshev’s sum inequality from olympiad mathematics — even having essentially the same proof!

46
16.2 Convergence and the Borel-Cantelli Lemma
Recall in an earlier lecture that we introduced the notions of almost-sure convergence (Defini-
tion 11.2) and convergence in probability (Definition 11.1). The first is stronger than the second.
Let’s rigorously introduce one more useful notion of convergence, the weakest so far.

Definition 16.3 (Convergence in distribution). Consider an infinite sequence of random variables

X1 , X2 , . . ., and a random variable X. Let F1 , F2 , . . . be the CDFs of X1 , X2 , . . ., and let F be
the CDF of X. Then we say that X1 , X2 , . . . → X in distribution if limn→∞ Fn (x) = F (x) for all
continuity points x of F .

In some sense, convergence in distribution is much weaker than the other two, as it only talks
about the marginal distributions of the random variables. Meanwhile, almost sure convergence is
only slightly weaker than convergence in probability.

Example 16.4 (Convergence in distribution but not in probability). Consider an infinite sequence
of i.i.d. U, U1 , U2 , . . . ∼ Unif. Then, clearly U1 , U2 , . . . ∼ U in distribution, as all of their marginal
distributions are the same (uniform). However, they do not converge in probability, as clearly
Pr(|Un − U | > ) ≥ 1 − 2.

Example 16.5 (Convergence in probability but not almost surely). Let Xn ∼ Bern(1/n), and
assume that all of them are independent. Then X1 , X2 , . . . → 0 in probability, since

Pr(|Xn − 0| > ) ≤ Pr(Xn = 1) = 1/n.

However, it does not converge almost surely.

How can we show that the above example does not converge almost surely? It’s true that in
the sequence X1 , X2 , . . ., the 1 values get rarer and rarer as n → ∞. If there is a finite number of
1’s, then we have almost sure convergence, but if there are an infinite number of 1’s, then we do
not have convergence.
With this motivation in mind, we will now deliver a two-part lemma that very elegantly describes
the above as a dichotomy — useful both for proving and disproving almost sure convergence.

Proposition 16.6 (Borel-Cantelli lemma). Let E1 , E2 , . . . be an infinite sequence of pairwise in-

dependent events. Let p = P (lim supn→∞ En ) be the probability that infinitely many of the events
occur. Then, p ∈ {0, 1}, and furthermore:

1. (Borel-Cantelli lemma).28 If ∞
P
n=1 P (An ) < ∞, then p = 0.

2. (Second Borel-Cantelli lemma). If ∞

P
n=1 P (An ) = ∞, then p = 1.

The second lemma immediately shows why Example 16.5 does not have almost sure convergence,
as the harmonic series diverges. However, if we had changed it slightly to Xn ∼ Bern(1/n1.001 )
instead, it would converge almost surely by the first Borel-Cantelli lemma.

28
This first version of the lemma also holds when the Ei are not necessarily independent.

47
17 November 3rd, 2020
Last time, we were talking about convergence. Let’s pick up where we left off.

17.1 More on Convergence

Commonly, we want to get examples and counterexamples of theorems in statistics, to help us intuit
about formalisms. One common way of doing this is to take simple random variables, such as a
coin flip Bern(p). We can scale these random variables in arbitrary ways and see if any interesting
examples come out. In the off chance that we want correlated random variables, one way is to first
select p from another distribution (such as uniform) before starting to sample Bernoullis with mean
p, which are conditionally independent on p.

Exercise 17.1. Construct an infinite, random sequence of coin tosses such that for any n consec-
1
utive coin tosses, the probability of all n tosses coming up heads is n+1 .

Now let’s prove the Borel-Cantelli lemma. Joe mentions that this is interesting not just for
completeness, but also because it illustrates many useful ideas in analysis — short yet instructive.

Proof of Proposition 16.6. We’ll prove each of the two parts separately.

1. Assume that ∞
P
n=1 P (An ) < ∞. Then, by the definition of lim sup and a union bound,

∞
! !
\ [ [ X
P lim sup An = P Am ≤ P Am ≤ P (Am ).
n→∞ m=n
n≥1 m≥n m≥n

This is the tail of the series, but by the definition of convergence of an infinite series, its partial
sums must converge. Therefore, the tail of the series P (A1 ) + P (A2 ) + · · · must converge to
zero, so we conclude.

2. Our strategy in this case will be slightly different. Instead of trying to directly prove that
something will happen infinitely often, we’re going to show that the complement (event hap-
pens finitely often) has zero probability. In other words, we want
!
[ \
P ACm .
n≥1 m≥n

A useful fact from measure theory is that the countable union of measure-zero sets also has
measure zero. Therefore, it’s equivalent to show that the inner intersection has measure zero
for any n. Since the Am are independent, we have
∞ ∞
!
\ Y Y P∞
P C
Am = C
P (Am ) = (1 − P (Am )) ≤ e− m=n P (Am ) .
m≥n m=n m=n

Since the infinite series in the exponent diverges to ∞, we conclude.

The next topic is an example of a zero-one law similar to the Borel-Cantelli lemma.

48
Proposition 17.1 (Kolmogorov zero-one law). Let A1 , A2 , . . . be independent events. Recall that
we can generate a σ-algebra from a collection of sets (i.e., events) by taking the smallest σ-algebra
containing those events. Then, the “tail field” of the Ai is
∞
\
A= σ(An , An+1 , An+2 , . . .).
n=1

You can think of the tail field as the set of events that only depends on the limiting tail of the event
sequence. Then, for any A ∈ A, we have P (A) ∈ {0, 1}.
Proof. Omitted, but the key idea in this proof is very “cute” — it is to show that A ⊥⊥ A.

This generalizes part of the Boreli-Cantelli lemma, since lim supn→∞ An is an example of some-
thing that only depends on the limiting values of An , so it is in the tail field.

17.2 Building a Hierarchy of Convergence

In this section, we will show that almost sure convergence is stronger than convergence in proba-
bility. We’ve already seen a subtle example in Example 16.5 of how they differ.
Proposition 17.2 (Almost sure convergence implies convergence in probability). If Xn → X
almost surely, then Xn → X in probability.
Proof. First, observe that
∞
!
[
P (|Xn − X| > ) ≤ P {|Xm − X| > } .
m=n
| {z }
An

As notated above, we call the event in parentheses An . We have A1 ⊇ A2 ⊇ A3 ⊇ · · · . This

inequality captures the essence of almost sure convergence (within for all indices after n), versus
convergence in distribution (within for n). To see why, consider
∞
!
\
P An = P (|Xn − X| > infinitely many times).
n=1

This is zero precisely when Xn → X converges almost surely. By our first inequality above, we
have proven the proposition.

Next on our menu is a theorem that Joe calls both “beautiful and useful,” which lets you go
from convergence in distribution back to convergence in probability. However, there has to be a
catch, since convergence in distribution is obviously weaker. We will need to move to a different
probability space.
Proposition 17.3 (Skorokhod’s representation theorem). Suppose that Xn → X in distribution.
Then, there exists a new probability space (Ω∗ , F ∗ , P ∗ ), with random variables Xn∗ , X ∗ : Ω∗ → R,
such that Xn∗ ∼ Xn , X ∗ ∼ X, and Xn∗ → X ∗ almost surely.
Proof. The proof is omitted because of hard technical details. However, in principle, the key
intuition is that you can just take a PIT on all of the Xn variables, which couples them to the
same uniform. This fixes the issue where the Xn may be totally independent, in a sequence that
converges in distribution.

49
Skorokhod’s theorem is somewhat of a useful hammer. One neat application is that you can
really easily prove the “in distribution” case of the continuous mapping theorem, by reducing it to
the “almost sure” case using Skorokhod.

50
18 November 5th, 2020
Today we will start talking about asymptotics. In other words, how do distributions change in
some limit where their parameters go to infinity? Some of these theorems are quite beautiful,29
but we will specifically focus on facts that have practical applications.

18.1 Major Tools in Asymptotics

Let’s review a few facts from analysis, which we’ll discuss and use as a stepping stone.

• Continuous mapping theorem: If Xn → X in any kind of convergence, then g(Xn ) →

g(X) in the same kind of convergence, assuming that g is a continuous function.

• Taylor’s theorem: Taylor approximations are also called the Delta method in statistics.

• Slutsky’s theorem: Convergence of a binary operation on two random variables.

• LLN, CLT: What we’re going to talk about soon.

For the rest of the semester, we will focus on a few high-level goals. One topic is natural exponential
families, which unify a lot of distributions that we’ve seen this semester.30 This includes the special
NEF-QVF families. We’ll also talk about martingales, which are useful for concentration bounds
and for modeling financial markets.

Proposition 18.1. If X1 , X2 , . . . → X in distribution and Y1 , Y2 , . . . → Y in distribution, and

these two sequences are mutually independent, then Xn + Yn → X + Y in distribution.

Proposition 18.2 (Slutsky’s theorem). Assume that we have two sequences of random variables
X1 , X2 . . . and Y1 , Y2 , . . ., not necessarily independent, such that Xn → X and Yn → c in distribu-
tion, where c is a constant. Then,

• Xn + Yn converges in distribution to X + c,

• Xn − Yn converges in distribution to X − c,

• Xn Yn converges in distribution to cX, and

• Xn /Yn converges in distribution to X/c, as long as we avoid division by zero.

Proof. This is a somewhat technical fact from analysis, so we omit the proof.

Proposition 18.3 (Delta method). Assume that you have a sequence of random variables T1 , T2 , . . .
√
such that n(Tn − θ0 ) → Z in distribution, where θ0 is some constant. If g is a real function that
is C 1 continuous at θ0 , then √
n(g(Tn ) − g(θ0 )) → g 0 (θ0 )Z
in probability. In particular, the special case when Z ∼ N (0, 1) is particularly nice because of
connections with the central limit theorem.
29
Joe cites the law of the iterated logarithm as an example.
30
Around this time, Carl Morris walked into our class and said hello. He is the “originator” of the NEF.

51
Proof. The proof uses the mean value theorem, which tells us that

g(Tn ) = g(θ0 ) + g 0 (θ̃n )(Tn − θ0 ),

for some θ̃n between θ0 and Tn . Then,

√ √
n(g(Tn ) − g(θ0 )) = ng 0 (θ̃n )(Tn − θ0 ).
√
However, note that g 0 (θ̃n ) → g 0 (θ0 ) by continuity. Also, Tn − θ0 → Z/ n in distribution, so after
applying Slutsky’s theorem to the above equation, we conclude.

18.2 Natural Exponential Families

Now we’ll introduce our next topic, which is NEFs. Natural exponential families are a special case
of the more general exponential family.

Definition 18.4 (Natural exponential family). A natural exponential family with natural param-
eter η is a family of distributions with CDF Fη , taking the form

dFη (y) = eηy−ψ(η) dF0 (y).

We give the condition that F0 (y) does not depend on η. In particular, it’s just the η = 0 case.

Here, we can see that ψ(η) is a normalizing factor for the rest of the density. The rough idea is
that we just shift probabilities by weighting with pointwise multiplication by some exponential of
the value. In particular,
Z Z Z
ηy−ψ(η) ψ(η)
dFη (y) = e dF0 (y) = 1 =⇒ e = eηy dF0 (y) = EY ∼F0 [eηY ].

The last step above follows from LOTUS. In particular, this means that ψ(t) is just the cumulant
generating function of Y ∼ F0 . It’s also easy to show that the cumulant generating function of Fη
for any η is ψ(t + η) − ψ(η). For this reason, we call ψ the cumulant function.

Example 18.5. The binomial distribution Bin(n, p) is a natural exponential family for any fixed
value of n, where we vary p. In this case, the natural parameter is given by the logit function
p
logit(p) = log 1−p .

Example 18.6. The normal distribution with unit variance, N (µ, 1), is a natural exponential
family with natural parameter µ and cumulant function µ2 /2. Notice how this aligns with the
cumulant generating function of the standard normal N (0, 1), which is t2 /2.

Another useful fact, which falls out of the cumulant function, is that if we let ψ 0 (η) = µ and
ψ 00 (η) = σ 2 , then Fη ∼ [µ, σ 2 ]. Note that since variances are positive (except in the degenerate
case), this tells us that ψ 0 (η) = µ is a strictly increasing function, so we can invert it.

Definition 18.7 (Variance function). The variance function of a natural exponential family Fη is
V (µ) = σ 2 . In other words, we have for any η that V (ψ 0 (η)) = ψ 00 (η).

Definition 18.8 (NEF-QVF). An NEF-QVF is a natural exponential family with variance function
of the form V (µ) = v0 + v1 µ + v2 µ2 .

It is a theorem from Carl Morris that there are only six NEF-QVF distribution families.

52
19 November 10th, 2020
Today we will continue talking about NEF-QVFs and asymptotics (delta method), in preparation
for the law of large numbers and central limit theorem.

19.1 Example of the Delta Method in Asymptotics

Recall that we proved the general delta method last time, in Proposition 18.3, by using Taylor
series approximations. However, although we gave a proof, we didn’t actually show how it is often
used in practice. Consider the case where we have the sample mean X n of n i.i.d. random variables
from [µ, σ 2 ]. The basic statement of the central limit theorem is that as n → ∞,
√
n(X n − µ) D
−→ N (0, 1),
σ
assuming some mild regularity conditions on the distribution. This is one of the most celebrated
theorems in all of probability and statistics. Note that although CLT is a nice statement strictly
about asymptotics, you can actually get a concrete version of the bound using the Berry-Esseen
theorem. The Berry-Esseen version is still fairly weak though; it tries to be too general.31
Example 19.1 (Central limit theorem for Poissons). Suppose that X ∼ Pois(λ), where λ is large.
Then, observe that by the properties of the Poisson distribution, X can be represented as the sum
of λ i.i.d. copies of a Pois(1), so X ≈ N (λ, λ) approximately.
(Note that the Poisson distribution has equal mean and variance both tied to the parameter
λ, which can lead to some issues in modeling, as many processes exhibit overdispersion, where the
variance is larger than the mean. The negative binomial is sometimes used as a proxy.)
Suppose that we would like to perform a variance-stabiling transformation on X ∼ Pois(λ) to
approximately disentangle the mean from the variance. Transformations are a big topic in statistics
for wrangling data. For example, we might apply a cube-root transformation to minimize skewness
and make the CLT approximation better. In this case, we will show that taking the square root is
a transformation that helps stabilize variance.
Example 19.2 (Delta method for square root of Poisson). Suppose that λ √ p X ∼ Pois(λ)
is large, so √
is approximately N (λ, λ). Note that Jensen’s inequality tells us that E[ X] < E[X] = λ.
√
However, it actually turns out that by applying Proposition 18.3 to the function g(x) = x, we
√ D √
get X − → N ( λ, 41 ) as λ → ∞.

19.2 The Law of Large Numbers

There are several versions of the Law of Large Numbers (LLN), each requiring slightly different
assumptions. These can be roughly divided into two types:
• Strong laws of large numbers deal with convergence of the sample mean almost surely.

• Weak laws of large numbers deal with convergence of the sample mean in probability.
This terminology is unique to the law of large numbers, as the central limit theorem only applies to
convergence in distribution. Generally, we will see that weak laws of large numbers have a slightly
weaker result, but also require less assumptions.
31
For practical purposes, Joe suggests using simulation to find the smallest value of n for which the distribution of
2
the mean becomes approximately normal, i.e., X n ≈ N (µ, σn ). This is not useful for rigorous proofs though.

53
Proposition 19.3 (Weak LLN, basic version). Suppose that X n is the mean of n i.i.d. random
variables with mean µ and finite variance σ 2 . Then, as n → ∞, X n → µ in probability.
Proof. By Chebyshev’s inequality,

Var X n σ2
P (|X n − µ| ≥ ) ≤ = .
2 n2
This goes to zero as n → ∞, so we’re done.

This was a really simple proof. Let’s see how we can relax the assumptions a bit, by being
a little more sophisticated in our argument. In particular, what if the random samples were not
independent? One strategy to deal with this is to consider
P
i,j Cov(Xi , Xj )
Var X n = .
n2 2
If the above value goes to zero as n → ∞, then we have an equivalent to the weak law of large
numbers. However, sometimes this strategy does not work, and we need to try something else.
For example, consider the characteristic function, which always exists and can be approximated by
derivatives. It turns out that with a linear approximation (first term) of the characteristic function,
we get LLN, and with a quadratic approximation, we get CLT.
In the case when random variables are guaranteed to be independent, we can prove incredibly
strong LLNs and CLTs due to the multiplicativity of the characteristic function. However, the
dependent case is harder, and we might give a couple of examples, later on, where we relax the
independence assumptions.
Proposition 19.4 (Strong LLN). Assume that X1 , X2 , . . . are i.i.d., with E [Xj ] = µ and the
average absolute deviation is bounded, i.e., E [|Xj |] < ∞. Then, X n → µ almost surely.
The above version of the strong LLN is hard to prove and fairly technical. This is because
it only assumes first moments. For now, we will prove an easier version with a different set of
assumptions — more moments, but also not necessarily i.i.d. this time.
Proposition 19.5 (Strong LLN, fourth moments). Assume that Xj are independent and have
mean zero, and also that E[Xj4 ] ≤ b < ∞ for some bound b. Then X n → 0 almost surely.
Proof. By the Borel-Cantelli lemma (Proposition 16.6), it suffices to check that for any > 0,
∞
X
P (|X n | > ) < ∞.
n=1

4
However, we have P (|X n | > ) = P (X n > 4 ), applying Markov’s inequality tells us that
∞ ∞
X 1 X h 4i
P (|X n | > ) ≤ E Xn .
4
n=1 n=1

This is fair enough, but how do we get rid of the fourth moment of the mean? One way to deal with
this is by brute forcing through the multinomial theorem on X n = n1 (X1 + · · · + Xn ). However, a
nicer approach is to use cumulants, which are additive. Note that
h i
4
h i2
2 1
E X n = κ4 (X n ) + 3 Var X n = 4 (κ4 (X1 ) + · · · + κ4 (Xn ) + 3(Var [X1 ] + · · · + Var [Xn ])2 ).
n

54
Now we just need to bound κ4 (Xj ) and Var [Xj ], for each j. This turns out to be very simple. First,
the fourth cumulant is strictly less than the first moment (smaller by three times the variance),
so κ4 (Xj ) ≤ E[Xj4 ] ≤ b. Also, by Jensen’s inequality, Var[Xj ] ≤ b as well. Therefore, the last
summation above is bounded by
∞ ∞
1 X h 4i 1 X b 3b
E X n ≤ + .
4 4 n3 n2
n=1 n=1

This summation converges because sums of 1/ns are finite for s > 1, so we are done.

Note. Even when b is not bounded by a constant, we can still use the argument above as long as
the summation converges, i.e., when b = o(n). For example, b = n0.999 would work just as well.

It’s instructive to ask why we need finite fourth moments in the law of large numbers above,
versus two or six or some other number. This is because only having finite variances gives you
linear falloff similar to the above, and the harmonic series diverges, so we end up on the wrong side
of Borel-Cantelli for almost sure convergence.

55
20 November 12th, 2020
Today we will discuss the central limit theorem.

20.1 The Central Limit Theorem

Let’s first provide a bit of intuition to motivate the central limit theorem. There’s a couple of
different ways to think about this:

1. Cumulants: Suppose that X1 , X2 , . . . are i.i.d. with mean 0 and variance 1. Then, the r-th
√
cumulant of the sum of these variables, divided by n, is

X1 + . . . + Xn n
κr √ = r/2 Kr (X1 ).
n n
This just follows from the additivity of cumulants. Notice that because of the exponent, this
fraction approaches 0 as n → ∞ for any r > 1. Therefore, we should expect the limiting
distribution to only have finite first two cumulants — which makes it a normal distribution!

2. Entropy: The normal distribution maximizes entropy for a given mean and variance. When
you add independent random variables together, their entropy increases, which is a statistical
analogue of the second law of thermodynamics. Andrew Barron has a paper in The Annals
of Probability where he proves CLT using an entropy-type argument.
√
3. Stability: Let Sn = X1 + · · · + Xn , and suppose that Sn / n converges in in distribution to
some distribution Z. Why must Z be normal? Well, note that in convergence, we can replace
n by 2n, so
S X + · · · + Xn Xn+1 + · · · + X2n D
√2n = 1 √ + √ −
→ Z.
2n 2n 2n
However,
√ we√can also write the second expression above as converging in distribution to
Z1 / 2 + Z2 / 2, where Z1 , Z2 are i.i.d. ∼ Z. Since a sequence can’t converge to two different
distributions, these must be the same, so
1
Z ∼ √ (Z1 + Z2 ).
2
This is a stable law, and the only stable law with finite variance is the normal distribution.

Actually, although we only promised to give some intuition above, we can also formalize the third
point to produce a rigorous proof as well. First, a quick lemma.

Lemma 20.1 (Taylor approximation for characteristic function). If X is a random variable with
finite m-th moment E[|X|m ] < ∞, and X has characteristic function ϕ, then
m
(it)k E X k

X
ϕ(t) = + o(|t|m ),
k!
k=0

for small t approaching 0.

Proof. This follows almost immediately from the Peano form of the Taylor series remainder for ex .
The only slight hiccup is that we need to apply dominated convergence, due to the expected value.
This is also why we need to assume finite m-th moment.

56
Proposition 20.2 (Stable law with √finite variance). Let Z1 , Z2 be i.i.d. random variables with
mean 0 and variance 1. If Z1 + Z2 ∼ 2Z1 , then Z1 ∼ N (0, 1).
Proof. We’ll turn the condition into a functional equation of the characteristic function. Let ϕ be
the characteristic function of Z1 . Then, the characteristic function of Z1√+Z
2
2
∼ Z1 is

t 2
h √it i
(Z1 +Z2 )
E e 2 =ϕ √ = ϕ(t).
2
By iterating this functional equation, we get
2 !22 2n
t t t
ϕ(t) = ϕ √ =ϕ √ 2 = · · · = ϕ n/2 .
2 2 2
Since κ1 (Z1 ) = 0 and κ2 (Z1 ) = 1, we have by Lemma 20.1 that
2n 2n
t2

t 1 2
lim ϕ n/2 = lim 1 − n
+o n = e−t /2 .
n→∞ 2 n→∞ 2! · 2 2
Therefore, by uniqueness of characteristic functions (Fourier transform), we conclude.
Therefore, from the intuition at the start of this section, this stable law immediately implies
the basic result of the central limit theorem itself.
Proposition 20.3 (Classical CLT (Lindeberg–Lévy)). If X1 , X2 , . . . is a sequence of i.i.d. random
variables with mean µ and variance σ 2 < ∞, then as n → ∞, we have in distribution that
(X1 + X2 + · · · + Xn ) − nµ D
√ −
→ N (0, σ 2 ).
n

20.2 More Central Limit Theorems

What are the limitations of Proposition 20.3? Even though it’s an extraordinarily powerful re-
sult, we still want to handle cases when our random variables are independent but not identically
distributed. Sometimes we even want CLTs for weakly dependent random variables.
We will now introduce four different central limit theorems. The common setting is that we
have a sequence X1 , X2 , . . . of independent random variables, where Xj has mean 0 and variance
σj2 . Let Sn = X1 + · · · + Xn , let Yj = Xj /σj ∼ [0, 1] be the standardized version of Xj . Let
n
X
Zn = Sn /sn = σj Yj /sn ∼ [0, 1]
j=1

be the standardized version of Sn , where

n
X
s2n = Var [Sn ] = σj2 .
j=1

Definition 20.4 (UAN). The uniform asymptotic negligibility condition is that none of that none
of the n terms in Sn has a large asymptotic variance, in comparison to the total variance s2n . In
general, UAN holds when
max1≤j≤n σj
un := → 0.
sn
We can interpret u2n as the largest fraction of the variance contributed by any single term Xj of
the entire sum Sn .

57
It turns out that the UAN is almost a sufficient condition to prove a central limit theorem.32
As a consequence, we will primarily focus on the setting in which UAN holds, which allows us to
prove two central limit theorems that turn out to be equivalent. The first is due to Morris and
Blitzstein, while the second is very famous.
Proposition 20.5 (Fundamental bound). Define the “fundamental bound” FBn by
n
" #
Xj 2

X |Xj |
FBn = E min 1, .
sn sn
j=1

If FBn → 0, then Zn converges in distribution to N (0, 1).

Proposition 20.6 (Lindeberg CLT). Define Lindeberg’s condition, for any > 0, to be
n
" #
X Xj 2
Lind,n = E I|Xj |/sn > → 0.
sn
j=1

If Lindeberg’s condition holds for each , then Zn → N (0, 1).

In some sense, the fundamental bound and Lindeberg CLT are both equivalently strong, as well
as the “strongest” central limit theorem. In the presence of the UAN, both of these central limit
theorems are necessary and sufficient for convergence to take place. They also both immediately
imply the UAN condition, i.e., they are supersets of the UAN.
However, both of these conditions (fundamental bound and Lindeberg) are fairly technical, so
we oftentimes prefer to apply simpler variants of the central limit theorem. These are not as general,
but they can still be useful in many practical applications.
Proposition 20.7 (Lyapunov CLT). Define Lyapunov’s condition, for any r > 2, to be
n
Xj r
X
Lyapr,n = E → 0.
sn
j=1

If Lyapunov’s condition holds for some r, then Zn → N (0, 1).

Proof. It’s easy to show that this implies Lindeberg’s condition. For any > 0, note that
2
Xj 1 Xj r 1 Xj r
I|Xj |/sn > ≤ r−2 I|Xj |/sn > ≤ r−2 .
sn sn sn
This works for any r > 2, and > 0 is just a constant here, so we can ignore it. Therefore, if
Lyapunov’s condition holds, so does Lindeberg’s condition, and we are done.

Proposition 20.8 (Fourth cumulant CLT). If the UAN condition holds and |κ4 (Zn )| → 0, then
Zn → N (0, 1).
Proof. This turns out to be equivalent to the r = 4 case of Lyapunov’s CLT. Observe that
n
X n
X
2
Lyap4,n = κ4 (Xj /sn ) + 3 Var [Xj /sn ] = κ4 (Zn ) + 3 (σj /sn )4 .
j=1 j=1

If we assume the UAN condition, then the latter term definitely tends to zero, so we are done.
32
In fact, the UAN condition is also almost necessary, in the sense that if any term contributes an asymptotically
nontrivial portion to the variance, then CLT only holds when that term is already normally distributed itself.

58
The proofs of all of these CLTs are given in the textbook. They are all pretty technical arguments
involving analysis on the characteristic function. Anyway, we can now do an illustrative example.

Example 20.9. Assume that Y1 , . . . , Yn are i.i.d. ∼ [0, 1], and let Sn = c1 Y1 + · · · + cn Yn . Then,
the UAN can be written as
max1≤j≤n c2j
Pn 2 < ∞,
j=1 cj
Sn
and it turns out that this condition alone is enough to prove that √ converges in distribution
c21 +···+c2n
to N (0, 1). Let’s see how to do this with the κ4 method. Observe that
P
n 4 |κ (Y )|
c
j=1 j 4 1
|κ4 (Zn )| = P 2 .
n 2
c
j=1 j

To bound this, we use a really neat trick. Note that c4j = c2j c2j . So, we can write
P
Pn 2 n 2
c4 max 1≤j≤n cj j=1 cj max1≤j≤n c2j
j=1 j
2 ≤ 2 = Pn 2 → 0,
j=1 cj
P P
n 2 n 2
c
j=1 j c
j=1 j

which is simply the UAN condition. Therefore, assuming that the fourth cumulants of Yi exist, we
are done by Proposition 20.8.

59
21 November 17th, 2020
First, some administrative information. The midterm grades will be posted tonight, and information
about the final project will be added shortly. We’re getting close to the end of the course, and today
we’ll continue discussing the central limit theorem. The last topic after this will be martingales.
Future courses to consider include Stat 171 and Stat 212.

21.1 Examples of the Central Limit Theorem

We discussed CLT last lecture, but we didn’t go into many examples of each of the cases. These
central limit theorems are most useful in problems where we sum many independent, but not
identically distributed random variables. Here’s an example involving Lyapunov CLT.
Example 21.1 (Record values). Suppose that we have an infinite sequence of i.i.d. random vari-
ables X1 , X2 , . . ., which are measurements. We say that Xj sets a record if Xj > max(X1 , . . . , Xj−1 ).
The probability of setting a record is strictly decreasing as time passes.33 Let Rn be the number
of records up to time n. Show that Rn ∼ N (log n, log n). More formally,
Rn − log n D
√ −
→ N (0, 1).
log n
For each j, let Ij be the indicator variable of Xj setting a record. Then, by a symmetry or
exchange argument, Ij ∼ Bern( 1j ). Furthermore, each of these Ij are independent, which is a result
due to Rényi. The intuition for them being independent is that given all values I1 , . . . , Ij−1 , we
can rearrange all of the variables X1 , . . . , Xj−1 in an arbitrary order to change the prior records. It
doesn’t matter which order these previous measurements were in; either way, Xj has to be greater
than all of them to set a record.
Assuming that Ij are independent, we can apply Lyapunov CLT with r = 3 to justify the
convergence in distribution. In this case, the Lyapunov condition is

Pn 3
1
j=1 E Ij − j
→ 0,
s3n
where s2n = Var [I1 + · · · + In ]. The top expression looks hard, but in this case the Ij are simply
indicators, so it’s easy to compute that
" #
1 3 1 31

1 1 1 1
E Ij − = 1− + 3 1− ≤ + 3,
j j j j j j j
where we’re choosing to use a crude bound for simplicity. Also,
n n n
2
X X 1 X 1
sn = Var [Ij ] = − .
j j2
j=1 j=1 j=1

Plugging all of this in, our Lyapunov condition becomes

Pn 1 1
j=1 j + j 3 log n + O(1)
3/2 = → 0.
(log n + O(1))3/2
P
n 1 1
j=1 j − j2

Therefore, we’ve shown that CLT holds. The only remaining things to check are that E [Rn ] =
log n + O(1) and Var [Rn ] = s2n = log n + O(1), so we conclude by Slutsky’s theorem.
33
This leads to some interesting behavior. For example, what’s the expected number of variables before the first
value greater than X1 , i.e., the first record? This actually turns out to be ∞!

60
21.2 The Replacement Method
We’re going to now discuss an interesting method used by Lindeberg and Lyapunov in the past.
We will use this strategy to prove an i.i.d. version of the central limit theorem, but it can also be
used more generally to prove Lindeberg’s CLT.
Suppose that we have X1 , X2 , . . . be i.i.d. random variables with mean 0 and variance 1. Let
Sn = X1 + · · · + Xn . Also, suppose that we have i.i.d. standard normals Z1 , Z2 , . . . ∼ N (0, 1). The
idea of the replacement method is to simply “install” Zj by replacing Xj in the sum, swapping in
the normals one-by-one.34 It turns out that each of these steps is negligible both individually and
as a whole on the final distribution, which implies X1 + · · · + Xn ∼ Z1 + · · · + Zn ∼ N (0, n).
Let’s go over each step of this argument in detail. We initially start by letting T0 = Sn , and
define Tj = Z1 + · · · + Zj + Xj+1 + · · · + Xn for each 1 ≤ j ≤ n. To show convergence in distribution,
we will use an equivalent definition of convergence in terms of expectation.

Proposition 21.2 (Convergence of expectations in distribution). If Y1 , Y2 , . . . is a sequence of

D
random variables and Y is a random variable, then Yn − → Y if and only if E [g(Yn )] → E [g(Y )]
for all suitable test functions g. For example, one class of test functions is gt (y) = eity , which can
be shown through uniqueness of characteristic functions. Another is the class of all bounded, C k
functions g for fixed k ≥ 0, which includes “ramp” functions approximating an indicator.

In the above proposition, using indicators gives us the original definition of convergence in
distribution, but this creates big issues at discontinuities. Therefore, we prefer to apply the other
kinds of test functions, which are continuous and simpler to use in an argument. Motivated by
this, we’ll show for all C 3 functions g such that g, g 0 , g 00 , and g 000 are bounded that

Sn Tn
E g √ −g √ → 0.
n n

The replacement idea manifests itself in a telescoping series. Observe that

n
Tn T0 X Tj Tj−1
g √ −g √ = g √ −g √ .
n n n n
j=1

Each of the differences in this telescoping sum can be thought of as replacing Xj by Zj . Any-
way, using a third-order Taylor series expansion with error term, we can show that each of these
differences is locally bounded by O(n−3/2 ), and therefore the entire sum vanishes as n → 0.

34
Joe compares this argument to the ship of Theseus thought experiment.

61
22 November 19th, 2020
Today we cover some central limit theorems on sums of dependent random variables, and we begin
our discussion of martingales.

22.1 Dependent Central Limit Theorems

First, we will introduce a famous dependent central limit theorem.
Definition 22.1 (Stationary sequence). We call a sequence of random variables X1 , X2 , . . . sta-
tionary if for all k, the joint distribution of all length-k windows (Xn , Xn+1 , . . . , Xn+k−1 ) is the
same for all starting indices n.
Definition 22.2 (m-dependence). We saw that a sequence X1 , X2 , . . . of random variables is m-
dependent if each element can only have dependence with other variables that are at most m apart.
In other words, for all n,

(X1 , . . . , Xn ) ⊥⊥ (Xn+m+1 , Xn+m+2 , . . .).

The motivation for the m-dependence definition is that it can be really useful for time series
data. For example, when m = 0, we get ordinary independence. For larger values of m, we can
imagine a “horizon” of observations in the past, which can influence our future observations.
Proposition 22.3 (m-dependent CLT). Let (Xn )n be a stationary, m-dependent sequence of ran-
dom variables, such that E [Xj ] = µ and Var [Xj ] = σ 2 < ∞ for all j. Then,
√ D
n(X n − µ) −
→ N (0, ν),

where the variance is

m+1
X
2
ν =σ +2 Cov(X1 , Xj ).
j=2

Proof. The full proof of this theorem is very technical. However, we will provide an outline of the
key idea, which involves an argument where we split up the series into two parts. This is called a
“big block-little block” strategy.
Choose some value k > 2m, and divide up the sequence of variables X1 , . . . , Xn into alternating
blocks of length k − m (big block) and m (little block). The idea is that because of m-dependence,
the sum of each of the big blocks constitutes an i.i.d. random variable. Meanwhile, as k grows
larger, we can show that the little blocks contribute a negligible amount to the total sum of the
series. Therefore, after many technical details, one can show convergence by piggybacking off of the
standard CLT (Proposition 20.3) for big blocks, and concentration bounding the little blocks.

Another common case of dependent random variables is when we sample n elements without
replacement from a finite population of size N . Since we don’t have replacement, the samples
Y1 , . . . , Yn must be dependent. We can prove finite population central limit theorems in this case,
showing that Y is approximately normal, as N, n → ∞ while maintaining that n N .
One interesting duality in the finite population case is between Y n and Y N −n . The distributions
of these two means for samples of size n and N − n have the exact same shape (just reversed).
This leads to the interesting observation that as you increase the sample size in a finite popula-
tion, the normal approximation gets more accurate, but once you increase it too far, the normal
approximation once again decreases in accuracy, until finally you get a full census when n = N .

62
Other CLTs of interest are Markov chain CLTs, which are useful for proving facts about MCMC
algorithms, and martingale CLTs, which we may cover at the end of the course if time permits.
Anyway, that concludes our unit on central limit theorems!

22.2 Martingales
Now we discuss discrete-time martingales, which are a useful model of stochastic processes that
maintain a certain “fairness” property.
Definition 22.4 (Discrete-time martingale). We say that X1 , X2 , X3 , . . . is a martingale with
respect to another sequence Y1 , Y2 , Y3 , . . . if for all n,
1. (Regularity) E [|Xn |] < ∞,
2. (Measurability) Xn ∈ σ(Y1 , . . . , Yn ),
3. (Fairness) E [Xn+1 | Y1 , . . . , Yn ] = Xn .
You can more generally think of (Xn )n as a martingale with respect to the filtration F1 ⊆ F2 ⊆ · · · ,
where Fn = σ(Y1 , . . . , Yn ). This is a more general definition, but it’s also more abstract.
Note. The etymology of the word “martingale” is complicated. In the context of probability theory,
the martingale was a risky betting strategy where you double your bet after each loss. This would
theoretically lead to a +$1 payoff if you had infinite money, but in practice, you will eventually run
out of money after enough consecutive losses.
Note. There are other interesting models of stochastic processes like Brownian motion, which we
won’t cover in this course. Brownian motion is a special kind of continuous-time martingale where
the deviations are Markov and multivariate normal. This gives it particularly nice properties, but
it also has some weird properties like being continuous everywhere yet differentiable nowhere.
Many stochastic processes will be both Markov chains and martingales. However, in general
the Markov property (memorylessness) and martingale property are different, as martingales are
allowed to depend on all previous events.
Definition 22.5 (Submartingale and supermartingale). We call a sequence of random variables a
submartingale if the third property above is replaced by E [Xn+1 | Y1 , . . . , Yn ] ≥ Xn . On the other
hand, it is a supermartingale if E [Xn+1 | Y1 , . . . , Yn ] ≤ Xn .
Oftentimes we will just write “Xn is a martingale” without specifying the sequence Yn . The
following proposition justifies why this is unambiguous.
Proposition 22.6. If (Xn )n is a martingale with respect to (Yn )n , then (Xn )n is also a martingale
with respect to (Xn )n .
Proof. We can simply check the properties. The first two properties are trivial, while the third
property can be verified by using Adam’s law, since σ(X1 , . . . , Xn ) ⊆ σ(Y1 , . . . , Yn ).

This is a relatively simple definition, and we’ll see how it can be used to prove really nice facts
about various processes, with machinery like Doob’s optional stopping theorem, Azuma’s inequality,
Kolmogorov’s inequality, and others.
Example 22.7 (Random walks are martingales). If X1 , X2 , . . . is a sequence of random variables
that have mean 0, then Sn = X1 + · · · + Xn is a martingale. Similarly, if E [Xj ] ≥ 0, then Sn is a
submartingale, and if E [Xj ] ≤ 0, then Sn is a supermartingale.

63
23 November 24th, 2020
Today we continue to discuss martingales and their applications. The two key theorems in this
area are the martingale convergence theorem and the optional stopping theorem.

23.1 Examples of Martingales

Recall that a martingale is any process that doesn’t tend to either increase or decrease, so it won’t
drift in either direction on expectation. We will consider the following example now.
Definition 23.1 (SSRW). A simple symmetric random walk is the sequence Sn for n ≥ 0, where
Sn = X1 + · · · + Xn and the Xj are i.i.d. random signs.
Clearly, Sn is a martingale because it does not tend to drift in either direction. Also, Sn2 is a
submartingale because it tends to increase over time (and in general, you could replace x2 with any
other convex function), but we can “correct” the drift by taking Sn2 − n, which is a martingale!
As an aside, there is a continuous analogy to the simple symmetric random walk called Brownian
motion. If Bt is Brownian motion with Bt ∼ N (0, t), where Bt (ω) for each ω ∈ Ω is a continuous
sample path, then Bt is a martingale, and furthermore Bt2 − t is also a martingale. Although
stochastic processes are not the focus of this course, here is an interesting fact in this vein, which
is a partial converse that characterizes the normal distribution.
Proposition 23.2 (Lévy). If Xt is a process with continuous sample paths, X0 = 0, Xt is a
martingale, and Xt2 − t is a martingale, then Xt is Brownian motion.
Here’s another cool fact about Brownian motion, since Joe likes it so much.
Proposition 23.3. Brownian motion in two dimensions will write your name. In other words, for
any time > 0, Brownian motion B will almost surely draw any continuous path with some error
tolerance, after sufficient amounts of zooming in.
This is quite hard to visualize because Brownian motion is “infinitely wiggly” with detail similar
to a fractal. But you can generalize this fact to say some incredible things, such as, Brownian motion
will write all of the complete works of Shakespeare in time. See infinite monkey theorem.
Let’s return to the discrete setting now.
Example 23.4 (Pólya urn). Suppose that you have an urn with balls of two different colors. We
start with a ≥ 1 orange balls and b ≥ 1 blue balls. On each step of the process, we draw a single
ball from the urn and replace it with two balls of that color. Let Mn be the proportion of orange
balls at time n. Then, Mn forms a martingale, and in the limit distribution is

M∞ ∼ Beta(a, b).

This has applications to reinforcement learning, such as in multi-armed bandits.

Example 23.5 (Likelihood ratio). There is a famous test in statistics known as the likelihood ratio
test. Suppose that we have two possible density functions f and g, and we draw i.i.d. data samples
Y1 , . . . , Yn . Consider the values
f (Y1 ) f (Y2 ) f (Yn )
Mn = · ··· .
g(Y1 ) g(Y2 ) g(Yn )
Suppose that either f or g is the true density for Yj . Then, if g is the true density of Yj , then Mn
is a martingale. When Yj ∼ f , we get that Mn is a submartingale.

64
Example 23.6 (Branching process). Suppose that you have a process that spreads to many in-
dividuals through a tree structure, such as a viral disease or a family tree. We can write down
the number of members in the process at time t as a stochastic process. Although this is not a
martingale (it’s increasing), it becomes a martingale after an appropriate rescaling.

Example 23.7 (Doob martingale). Suppose that we have a random variable Y with E [|Y |] < ∞.
Then, Zt = E [Y | Ft ] is a martingale with respect to filtration {F0 , F1 , F2 , . . .}.

23.2 Martingale Convergence and Optional Stopping

A martingale is just a stochastic sequence, so sometimes we might ask if they converge almost surely.
Unfortunately this is not always the case, as some martingales intuitively just “keep moving” like
the SSRW example (Definition 23.1). The following theorem provides some conditions under which
convergence happens.

Proposition 23.8 (Martingale convergence theorem). Let Mn be a submartingale such that

sup E [|Mn |] ≤ c,
n

for some constant c < ∞. Then, there exists a random variable M∞ such that Mn → M∞ almost
surely, and also, E [|M∞ |] < ∞.

Proof. This is technical but interesting, and we’ll try to cover it in the next lecture.

Corollary 23.8.1. A nonnegative supermartingale converges almost surely.

The intuition behind this theorem is that submartingales are similar to a monotone increasing
sequence in their convergence properties. Bounded monotone sequences must converge. Although
submartingales are not strictly increasing because they have bumpiness, these bumps are not enough
to significantly impact the convergence properties.
The next theorem is very useful for generalizing some intuitions about martingales not drifting
in expectation, to the case where our stopping time may be unbounded. If Mn be a martingale,
then it’s easy to show that E [Mt ] = E [M0 ] for any positive integer t. However, what if our stopping
time t is a random variable, instead?

Proposition 23.9 (Optional stopping theorem). A random variable T supported on N ∪ {∞} is

called a stopping rule with respect to {X0 , X1 , . . .} if for all n, the event {T ≤ n} lies in the sigma-
algebra σ(X0 , . . . , Xn ).35 If Mn is a martingale and T is a stopping time, then E [MT ] = E [M0 ] if
any of the following conditions holds:

1. (Bounded time). T is almost surely bounded by a constant c ∈ N, i.e., P (T ≤ c) = 1.

2. (Bounded space). Mn is almost surely bounded, i.e., P (|Mn | ≤ c) = 1.

3. (Bounded increments). |Mn − Mn−1 | ≤ c and E [T ] < ∞.

Example 23.10. To show that at least one of these conditions is necessary, consider the SSRW
Sn with stopping time T = inf{n ∈ N | Sn = 1}. This stopping time is almost surely finite by the
martingale convergence theorem, so we have E [ST ] = 1, but S0 = 0. An interesting conclusion is
that by the contrapositive of the optional stopping theorem, E [T ] = ∞.

35
In other words, you can’t use “psychic powers” to see into the future when deciding whether to stop at time t.

65
24 December 1st, 2020
This is the last week of classes. Today, we continue discussing martingales, and we’ll prove the
optional stopping theorem. This will give us some reusable tools that we can use more generally in
martingale problems.

24.1 The Optional Stopping Theorem

Recall that we covered Proposition 23.9 last week, known as the optional stopping theorem, which
provides sufficient conditions for E [MT ] = E [M0 ] where T is a stopping time. We gave three
sufficient conditions, but in general, there are many variants on the same theorem.

Proof of Proposition 23.9. First consider the bounded time condition. If T ≤ n almost surely, then
we can write MT in terms of a telescoping series with indicators,
T
X n
X
MT = M0 + (Mj − Mj−1 ) = M0 + (Mj − Mj−1 )IT ≥j ,
j=1 j=1

where the second equality holds almost surely. If we show that the summation above has expectation
zero, then we’re done. To do this, we use Adam’s law to get

E [(Mj − Mj−1 )IT ≥j ] = E [E [(Mj − Mj−1 )IT ≥j | Y0 , . . . , Yj−1 ]]

= E [IT ≥j E [Mj − Mj−1 | Y0 , . . . , Yj−1 ]]
= 0.

We can factor IT ≥j out of the conditional expectation because it is a stopping time, therefore in
the sigma-algebra generated by events up to time j − 1, and the last equality is just the definition
of a martingale. This finishes the proof for the bounded-time case.
What about the second condition, where |Mn | ≤ c almost surely for all n? For this, we will use
a technique called truncation. Let Tn = min(T, n) for all n, so Tn is clearly a bounded stopping
time. Furthermore, as n → ∞, we have Tn → T almost surely. The idea of truncation is that
we have the result E [MTn ] = E [M0 ] in the truncated case, and we use a convergence theorem to
deduce the same result in the general case. In this case, the bounded convergence theorem yields

E [MT0 ] , E [MT1 ] , E [MT2 ] , . . . → E [MT ] .

Since we have that E MTj = E [M0 ] for all j by the telescoping argument above, we conclude that
E [MT ] = E [M0 ]. The third condition with bounded increments can be proven in a similar manner
by using the dominated convergence theorem.

Note that with minor modifications to the above proof, we can get variants of the optional
stopping theorem for submartingales and supermartingales, where E [MT ] ≥ E [M0 ] and E [MT ] ≤
E [M0 ] respectively.

Example 24.1 (Gambler’s ruin). Consider a simple symmetric random walk on Z, starting at 0,
with absorbing barriers at −a and b. The position St at time t is a bounded martingale. Therefore,
by the optional stopping theorem, the probability of being absorbed at a is b/(a + b), while the
probability of being absorbed at b is a/(a + b).

66
Example 24.2 (Asymmetric random walk). Consider the same problem as the previous example,
but we instead have an unfair game where we win with probability p 6= 1/2 and lose with probability
q = 1 − p. Then, (q/p)St is a martingale.

Example 24.3 (“Say red”). Consider a deck of cards in random order, with 26 red cards and 26
black cards. A dealer is flipping over cards, one at a time, and after each step they give you the
option to stop. When you stop, the next card in the deck is revealed. You win that card is red. It
turns out that no strategy for this game achieves a success probability different from 50%. This is
because the fraction of red cards Mn left in the deck after n draws is a martingale, and it is also
your success probability when stopping.

Joe mentions that the above example has shown up in many job interviews. Indeed, I remember
Paul Christiano giving us this exercise as a brain teaser at SPARC. Another slick argument is to use
the interchangeability of the cards, which says that choosing the top card of the deck is completely
interchangeable with choosing the last card of the deck, and therefore none of your actions matter!

24.2 Doob’s Martingale Inequality

Now we’ll quickly see an application of martingales to concentration bounds. Let X0 , X1 , X2 , . . .
be a nonnegative sequence of random variables. Markov’s inequality tells us that

P (Xn ≥ a) ≤ E [Xn ] /a.

However, if we additionally assume that Xn is a submartingale, we get a much stronger inequality.

Proposition 24.4 (Maximal inequality). If X0 , X1 , . . . is a nonnegative submartingale, then

E [Xn ]
P max Xn ≥ a ≤ .
0≤j≤n a

Proof. We will define a bounded stopping time T by

T = min(inf{j ≤ n : Xj ≥ a}, n).

Therefore, by the optional stopping theorem,

E [XT ] E [Xn ]
P max Xn ≥ a ≤ P (XT ≥ a) ≤ ≤ .
0≤j≤n a a

One application of this inequality is in proving the martingale convergence theorem.

67
25 December 3rd, 2020
Today is the last lecture of the course, before reading period! We will talk about a selection of
topics, as requested by the students.

25.1 Completeness of Natural Exponential Families

We will prove a completeness property of certain natural exponential distributions.
Proposition 25.1. Given a natural exponential family Fη , if there is a function h such that
E [h(Yη )] = 0 for all η, where Yη ∼ Fη , then h = 0 almost everywhere on the support of Yη .
Proof. We can split h into positive and negative components, h = h+ − h− . Then we have
Z ∞ Z ∞
ηy +
e h (y)f0 (y) dy = eηy h− (y)f0 (y) dy.
−∞ −∞

When η = 0, this means that h+ (y)f0 (y) dy = h− (y)f0 (y) dy = c, so we can divide both sides
R R

of the above equation by c to turn this into a probability distribution. Then, the above integrals
are precisely the moment generating functions of two distributions (according to LOTUS), so
h+ (y)f0 (y) = h− (y)f0 (y)
almost everywhere, by the uniqueness of moment generating functions. Therefore, h+ (y) = h− (y)
almost everywhere on the support of y, where f0 (y) 6= 0, so therefore h(y) = 0.

25.2 Bounded Central Limit Theorem from Lindeberg

As an example of Lindeberg’s CLT, we will prove a nice bounded central limit theorem.
Proposition 25.2. Suppose that Xj are independent and have zero mean, such that |Xj | ≤ c
almost surely for all j, where c is a constant. Then, assuming that the variance of the partial sum
s2n = Var [X1 + · · · + Xn ] = σ12 + · · · + σn2 → ∞,
we have Sn /sn → N (0, 1) in distribution.
Proof. Let’s verify Lindeberg’s condition for the CLT. This is written as
n
" #
X Xj 2
Lindn, = E I|Xj |/sn > → 0,
sn
j=1

as n → ∞, for any fixed value of . However, note that since |Xj | ≤ c, the indicator random variable
must be zero for all sufficiently large values of n. Therefore, the Lindeberg condition converges to
zero for any choice of value for , as desired.

25.3 Poisson Embedding for the Coupon Collector’s Problem

Consider the following problem, which appeared on a past midterm.
Exercise 25.1 (b-coupon collector). There are n empty boxes. Balls are put into the boxes one at
a time, independently, with each ball equally likely to go into any of the boxes. Find the expected
number of balls needed to make it so that all of the boxes have at least b balls each. Your answer
can be left as a single integral. Hint: Imagine putting balls in boxes according to a Poisson process
of rate 1.

68
This is a generalization of the so-called coupon collector’s problem where
Pn b1= 1, and the expected
amount of time is simply equal to the n-th harmonic number Hn = j=1 j . However, while the
basic coupon collector’s problem is easy, this problem is significantly more difficult. The really
interesting thing is that although the problem is ostensibly discrete, adding the continuous-time
Poisson process greatly simplifies the solution, as this distribution has nice properties.

25.4 Final Thoughts

That concludes the last lecture of Stat 210. Here’s a quote from Joe, about the final exam:

My belief is that probability is inexhaustibly rich. There will never be a shortage of

interesting problems; it just might take a while to come up with them.

Probability is a vast subject. There’s plenty of courses like Stat 212 that go further. Joe mentions
that his book is quite different from the standard courses on the material, emphasizing probabilistic
thinking and not just measure theory. That’s it for the semester!

69
References
[BM20] J.K. Blitzstein and C. Morris. Probability for Statistical Science. Unpublished draft, 2020.

Probability Theory
100% (2)
Probability Theory
149 pages
Probability Theory An Introduction Using R Chapman and Hall, 2024
No ratings yet
Probability Theory An Introduction Using R Chapman and Hall, 2024
596 pages
Lecture Notes On Probability Theory Dmitry Panchenko
No ratings yet
Lecture Notes On Probability Theory Dmitry Panchenko
316 pages
STAT 330 Course Notes Fall 2024 Edition
No ratings yet
STAT 330 Course Notes Fall 2024 Edition
482 pages
Probability - Oliver Knill (Havard)
No ratings yet
Probability - Oliver Knill (Havard)
380 pages
STAT 230 Course Notes Fall 2019
No ratings yet
STAT 230 Course Notes Fall 2019
425 pages
Math 630 Course Notes Fall 2021
No ratings yet
Math 630 Course Notes Fall 2021
274 pages
Essential Mathematics For The Australian Curriculum Year 9
No ratings yet
Essential Mathematics For The Australian Curriculum Year 9
16 pages
Probability I12
No ratings yet
Probability I12
100 pages
Stat230 S2010 Course Notes
100% (1)
Stat230 S2010 Course Notes
281 pages
Statistics Basics From IITM Statistits 2 Course Week - 0
100% (1)
Statistics Basics From IITM Statistits 2 Course Week - 0
71 pages
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
No ratings yet
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
536 pages
Sequence and Series Best Approach
No ratings yet
Sequence and Series Best Approach
55 pages
Spreij Measure Theoretic Probability
No ratings yet
Spreij Measure Theoretic Probability
169 pages
DIVE IN! Real Estate Investment Advice From A Pro
From Everand
DIVE IN! Real Estate Investment Advice From A Pro
Fred Gelfand
No ratings yet
Merged Slides
No ratings yet
Merged Slides
643 pages
Principles of Statistical Analysis - V1
No ratings yet
Principles of Statistical Analysis - V1
426 pages
General Mathematics: Quarter 1 - Module 21: Intercepts, Zeroes and Asymptotes of Exponential Functions
No ratings yet
General Mathematics: Quarter 1 - Module 21: Intercepts, Zeroes and Asymptotes of Exponential Functions
27 pages
Introduction To Probability and Random Signals
100% (9)
Introduction To Probability and Random Signals
139 pages
Probability and Statistics For STEM
No ratings yet
Probability and Statistics For STEM
251 pages
277 Dirty Talk System
100% (13)
277 Dirty Talk System
8 pages
VIT
No ratings yet
VIT
4 pages
Intuition To Probability (Version 1.19)
No ratings yet
Intuition To Probability (Version 1.19)
396 pages
Boarding Pass: Mr. Vishay Raina
No ratings yet
Boarding Pass: Mr. Vishay Raina
3 pages
Solution Real Analysis Folland Ch4
100% (1)
Solution Real Analysis Folland Ch4
24 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
CMU Prob-Grad-Notes - Tomasz Tkocz
No ratings yet
CMU Prob-Grad-Notes - Tomasz Tkocz
226 pages
Probstatmarkov PDF
No ratings yet
Probstatmarkov PDF
242 pages
Book
No ratings yet
Book
113 pages
Lecture Notes - Probability Theory: Manuel Cabral Morais
No ratings yet
Lecture Notes - Probability Theory: Manuel Cabral Morais
297 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
Stat 230 No Tess 16 Print
No ratings yet
Stat 230 No Tess 16 Print
359 pages
Notesstat230 2014
No ratings yet
Notesstat230 2014
288 pages
Stat230 Spring
No ratings yet
Stat230 Spring
345 pages
STA2610 Study Guide
No ratings yet
STA2610 Study Guide
209 pages
Lectnotemat 5
No ratings yet
Lectnotemat 5
346 pages
Daily Bio-Energizer Warm Up - Elliott Hulse
No ratings yet
Daily Bio-Energizer Warm Up - Elliott Hulse
18 pages
Probability Olivier Knill
No ratings yet
Probability Olivier Knill
372 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
Variable End Point
No ratings yet
Variable End Point
227 pages
MI 2026 Probs and Statistics Theory and Answer
No ratings yet
MI 2026 Probs and Statistics Theory and Answer
119 pages
Lecture Notes Statistics 420-Probability Fall 2002
No ratings yet
Lecture Notes Statistics 420-Probability Fall 2002
127 pages
PROBLEM SET General Mathematics
No ratings yet
PROBLEM SET General Mathematics
18 pages
Pure Mass Nutrition Plan by Guru Mann
No ratings yet
Pure Mass Nutrition Plan by Guru Mann
2 pages
Department of Radiology: Ultrasound - Whole Abdomen
No ratings yet
Department of Radiology: Ultrasound - Whole Abdomen
2 pages
STAT 230 Notes 2013
No ratings yet
STAT 230 Notes 2013
278 pages
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
No ratings yet
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
159 pages
Sma 2271: Ordinary Differential Equations: Course Content
No ratings yet
Sma 2271: Ordinary Differential Equations: Course Content
71 pages
Inbound 8969254549211759123
No ratings yet
Inbound 8969254549211759123
164 pages
Theory of Probability: Lecture Notes
No ratings yet
Theory of Probability: Lecture Notes
162 pages
Doc-Cours MathsV
No ratings yet
Doc-Cours MathsV
69 pages
Hard Questions From 2024 SAT Math
No ratings yet
Hard Questions From 2024 SAT Math
6 pages
Skript 2022
No ratings yet
Skript 2022
112 pages
True/False: Markov Analysis
No ratings yet
True/False: Markov Analysis
17 pages
Stochastic Processes
No ratings yet
Stochastic Processes
133 pages
Group Theory The Application To Quantum Mechanics Paul Herman Ernst Meijer Edmond Bauer PDF Download
No ratings yet
Group Theory The Application To Quantum Mechanics Paul Herman Ernst Meijer Edmond Bauer PDF Download
84 pages
Theory of Probability Zitcovic PDF
No ratings yet
Theory of Probability Zitcovic PDF
162 pages
Probability Theory - Weber
No ratings yet
Probability Theory - Weber
117 pages
(Universitext) Pierre Brémaud - Probability Theory and Stochastic Processes (2020, Springer)
100% (5)
(Universitext) Pierre Brémaud - Probability Theory and Stochastic Processes (2020, Springer)
717 pages
SBI Clerk Mains 2023 24 Quant Memory Based Paper 25 Feb 2024 494704
No ratings yet
SBI Clerk Mains 2023 24 Quant Memory Based Paper 25 Feb 2024 494704
14 pages
Econ117 Notes l1 13 s2024
No ratings yet
Econ117 Notes l1 13 s2024
126 pages
MATH/STAT 235A - Probability Theory Lecture Notes, Fall 2011
No ratings yet
MATH/STAT 235A - Probability Theory Lecture Notes, Fall 2011
111 pages
Handbook Unit 1
No ratings yet
Handbook Unit 1
48 pages
Probability and Statistics Notes (Complete)
No ratings yet
Probability and Statistics Notes (Complete)
105 pages
Seed Numbers
No ratings yet
Seed Numbers
1 page
Null 19
No ratings yet
Null 19
98 pages
Lecture Notes On Probability
No ratings yet
Lecture Notes On Probability
95 pages
Ecture Otes On Robability: MER Amuz
No ratings yet
Ecture Otes On Robability: MER Amuz
88 pages
Kaushal Test
No ratings yet
Kaushal Test
6 pages
Probability Theory Cookbook
No ratings yet
Probability Theory Cookbook
63 pages
FundProb Notes22
No ratings yet
FundProb Notes22
52 pages
Calculus 1 Table of Contents
No ratings yet
Calculus 1 Table of Contents
5 pages
S1 VOL2 PROBABILITYDISTRIBUTIONS Removed
No ratings yet
S1 VOL2 PROBABILITYDISTRIBUTIONS Removed
59 pages
Probability P
No ratings yet
Probability P
66 pages
Probability I Course
No ratings yet
Probability I Course
73 pages
Probability
No ratings yet
Probability
67 pages
QP 2373
No ratings yet
QP 2373
8 pages
Lectures 3 7
No ratings yet
Lectures 3 7
38 pages
DR Raghu Sharma
No ratings yet
DR Raghu Sharma
23 pages
IRST Language Modeling Toolkit User Manual
No ratings yet
IRST Language Modeling Toolkit User Manual
28 pages
A Historical Introduction - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - .
No ratings yet
A Historical Introduction - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - .
10 pages
Finite Difference Schemes
No ratings yet
Finite Difference Schemes
9 pages
Tutorial 6
0% (1)
Tutorial 6
2 pages
On Cursing: by Phil Hine - A Presentation Made at UKAOS 1992
No ratings yet
On Cursing: by Phil Hine - A Presentation Made at UKAOS 1992
7 pages
Lecture Notes Fall Term 2013
No ratings yet
Lecture Notes Fall Term 2013
40 pages
Prob Weber
No ratings yet
Prob Weber
32 pages
5 Heat Exchangers
No ratings yet
5 Heat Exchangers
26 pages
Determination of Gas Pressure Distribution in A Pipeline Network Using The Broyden Method
No ratings yet
Determination of Gas Pressure Distribution in A Pipeline Network Using The Broyden Method
21 pages
Trigonometric Equation
No ratings yet
Trigonometric Equation
23 pages
Review Notes - Probability
No ratings yet
Review Notes - Probability
16 pages
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Petrov, 2024 (N Points Contraction)
No ratings yet
Petrov, 2024 (N Points Contraction)
10 pages
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Topic 2: Calculating Limits Using The Limit Laws: MATH 602 Pre Calculus
No ratings yet
Topic 2: Calculating Limits Using The Limit Laws: MATH 602 Pre Calculus
12 pages
A Multilayer Feed-Forward Neural Network
No ratings yet
A Multilayer Feed-Forward Neural Network
9 pages
Discrete Math Chapter 5
No ratings yet
Discrete Math Chapter 5
14 pages
7-Ideal and Real Gas
No ratings yet
7-Ideal and Real Gas
12 pages
4.4 Monotone Sequences and Cauchy Sequences
No ratings yet
4.4 Monotone Sequences and Cauchy Sequences
8 pages
Assignment5 Vishay Raina
No ratings yet
Assignment5 Vishay Raina
6 pages
Steps To Download Aadhaar Offline eKYC
No ratings yet
Steps To Download Aadhaar Offline eKYC
6 pages
Solution of Assignemnt1 MTH 401
No ratings yet
Solution of Assignemnt1 MTH 401
5 pages
How To Check Relative Prime Numbers
No ratings yet
How To Check Relative Prime Numbers
5 pages
Vad Via Noise Reducing
No ratings yet
Vad Via Noise Reducing
4 pages
CBSE Class 10 Mathematics Sample Paper-07 (For 2013)
No ratings yet
CBSE Class 10 Mathematics Sample Paper-07 (For 2013)
10 pages
Booking Confirmation
No ratings yet
Booking Confirmation
2 pages
BLR DEL: Raina / Vishay MR AI2415
No ratings yet
BLR DEL: Raina / Vishay MR AI2415
1 page
A Study On The Problem of Heart Rate Prediction From Facial Videos
No ratings yet
A Study On The Problem of Heart Rate Prediction From Facial Videos
2 pages
Color Level 3 Place Value Riddle Math Challenge Cards
No ratings yet
Color Level 3 Place Value Riddle Math Challenge Cards
2 pages
GoFirst - BoardingPass - PNR BDSHXX - 30 Jan 2022 Bengaluru - Mumbai For Mrs. Varsha Kumari
No ratings yet
GoFirst - BoardingPass - PNR BDSHXX - 30 Jan 2022 Bengaluru - Mumbai For Mrs. Varsha Kumari
1 page
Department of Clinical Pathology Test Name Result Unit Bio. Ref. Range Method
No ratings yet
Department of Clinical Pathology Test Name Result Unit Bio. Ref. Range Method
1 page