0% found this document useful (0 votes)
48 views46 pages

Advanced Probability and Statistics: MAFS5020 Hkust Kani Chen (Instructor)

1) This document provides an overview of probability theory and introduces examples to review key concepts such as probability space, expectation, conditional probability, Bayes' theorem, and continuous random variables. 2) The examples discuss problems such as de Méré's, Galileo's, the St. Petersburg paradox, the dice game of craps, the jailer's reasoning, and Buffon's needle. 3) Classical probability pioneers discussed include Cardano, Pascal, Fermat, Bernoulli, and de Moivre. The axioms of probability were formally established by Kolmogorov.

Uploaded by

QuynhVan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views46 pages

Advanced Probability and Statistics: MAFS5020 Hkust Kani Chen (Instructor)

1) This document provides an overview of probability theory and introduces examples to review key concepts such as probability space, expectation, conditional probability, Bayes' theorem, and continuous random variables. 2) The examples discuss problems such as de Méré's, Galileo's, the St. Petersburg paradox, the dice game of craps, the jailer's reasoning, and Buffon's needle. 3) Classical probability pioneers discussed include Cardano, Pascal, Fermat, Bernoulli, and de Moivre. The axioms of probability were formally established by Kolmogorov.

Uploaded by

QuynhVan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Advanced Probability and Statistics

MAFS5020
HKUST

Kani Chen (Instructor)


1

PART I

PROBABILITY THEORY
2

Chapter 0. Review of classical probability calculations through examples.

We first review some basic concepts about probability space through examples. Then we summarize
the structure of probability space and present axioms and theory.

Example 0.1. De Méré’s Problem. (for the purpose of reviewing the discrete probability space.)
The Chevalier de Méré is a French nobleman and a gambler of the 17th century. He was puzzled
by the equality of the probability of two events: At least one ace turns up with four rolls of a die,
and at least one double-ace turns up in 24 rolls of two dice. His reasoning:
In one roll of a die, there is 1/6 chance of getting an ace, so in 4 rolls, there is 4 ∗ (1/6) = 2/3
chance of getting at least one die.
In one roll of two dice there is 1/36 chance of getting a double-ace, so in 24 rolls, there is 24∗(1/36) =
2/3 chance of getting at least one double ace.
De Méré turned to Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665) for help. And the
two mathematicians/physicists gave the right answer:
P (getting at least one ace in 4 rolls of a die) = 1 − (1 − 1/6)4 = 0.518
4-rolls makes favorable bet, and 3-rolls makes not.
P (getting at least one double-ace in 24 rolls of two dice) = 1 − (1 − 1/36)24 = 0.491
25 rolls makes a favorable bet, 24 rolls make still an unfavorable bet. 

Historical remark: (Cardano’s mistake) Probability theory has an infamous birth place: gam-
bling room. The earliest publications on probability dates back to “Liber de Ludo Aleae” (the
book on games of chance) by Gerolamo Cardano (1501-1576), an Italian mathematician/physician/
astrologer/gambler of the time. Cardano made an important discovery of the product law of inde-
pendent events, and it is believed that ”Probability” was first coined and used by Cardano. Even
though he made several serious mistakes that seem to be elementary nowadays, he is considered as
a pioneer who first systematically computed probability of events. Pascal, Fermat, Bernoulli and
de Moivre are among many other prominent developers of the probability theory. The axioms of
probability was first formally and rigorously shown by Kolmogorov in the last century.
Exercise 0.1’. Galileo’s problem. Galileo (1564-1642), the famous physicist and astronomer, was
also involved in a calculation of probabilities of similar nature. Italian gamblers used to bet on
the total number of spots in a roll of three dice. The question is the chance of 9 total dots the
same as that of 10 total dots? There are altogether 6 combinations of total 9 dots (126, 135, 144,
234, 225, 333) and 6 combinations of total 10 dots (145, 136, 226, 235, 244, 334). This can give
the false impression that the two chances are equal. Galileo gave the correct answer: 25/216 for
one and 27/216 for the other. The key point here is to lay out all 63 = 216 outcomes/elements of
the probability space and realize that each of these 216 outcomes/elements are of the same chance
1/216. Then the chance of an event in the sum of the probability of the outcomes in the event.

Example 0.2. The St. Petersberg Paradox. (for the purpose of reviewing the concept of Expecta-
tion)
A gambler pays an entry fee M $ to play following game: A fair coin is tossed repeated until the first
head occurs and you win 2n−1 amount of money where n is the total number of tosses. Question:
what is the ”fair” amount of M ?
n is a random number with P (n = k) = 2−k for k = 1, 2, .... Therefore the ”Expected Winning” is

X 1
E(2n−1 ) = 2k−1 × = ∞.
2k
k=1
3

Notice that here I have used the expectation of a function of random variable. It appears that a
“fair”, but indeed naive, M should be ∞. However, by common sense, this game, despite its infinite
payoff, should not worth the same as infinity.
Daniel Bernoulli (1700-1782), a Dutch-born Swiss mathematician, provided one solution in 1738. In
his own words: The determination of the value of an item must not be based on the price, but rather
on the utility it yields. There is no doubt that a gain of one thousand ducats is more significant to
the pauper than to a rich man though both gain the same amount. Using a utility function, e.g., as
suggested by Bernoulli himself, the logarithmic function u(x) = log(x) (known as log utility), the
expected utility of the payoff (for simplicity assuming an initial wealth of zero) becomes finite:

X ∞
X
E(U ) = u(2k−1 ) ∗ P (n = k) = log(2k−1 )/2k = log(2) = u(2) < ∞.
k=1 k=1

(This particular utility function suggests that game is as useful as 2 dollars.) 


Before Bernoulli’s publication in 1738, another Swiss mathematician, Gabriel Cramer, found already
parts of this idea (also motivated by the St. Petersburg Paradox) in stating that The mathematicians
estimate money in proportion to its quantity, and men of good sense in proportion to the usage that
they may make of it.

Example 0.3. The dice game called “craps”. (This is to review conditional probability).
Two dice are rolled repeatedly and let the total dots of the n-th roll be Zn . If Z1 is 2 or 3 or 12,
it is an immediate loss of the game. If Z1 is 7 or 11, it is an immediate win. Else, continue the
rolls of the two dice until either Z1 occurs, meaning a loss, or 7 occurs, meaning a win. What is
the chance of a win of this game.
Solution. Write
12
X 12
X
P (W in) = P (Win and Z1 = k) = P (Win |Z1 = k)P (Z1 = k)
k=2 k=2
X
= P (Z1 = 7) + P (Z1 = 11) + P (Win |Z1 = k)P (Z1 = k)
k=4,5,6,8,9,10
X
= 6/36 + 2/36 + P (Ak |Z1 = k)P (Z1 = k)
k=4,5,6,8,9,10
X
= 2/9 + P (Ak )P (Z1 = k)
k=4,5,6,8,9,10

where Ak is the event that starting from the second roll, 7 dots occur before k dots. Now,
12
X 12
X
P (Ak ) = P (Ak ∩ {Z2 = j}) = P (Ak |{Z2 = j})P (Z2 = j)
j=2 j=2
12
X
= P (Z2 = 7) + P (starting from the 3rd roll, 7 occurs before k)P (Z2 = j)
j=2
j6=k,j6=7

= 6/36 + P (Ak )(1 − P (Z2 = k) − P (Z2 = 7)).

As a result,
P (Ak ) = P (Z1 = 7)/[P (Z1 = k) + P (Z1 = 7)].
And,
X
P (win) = 2/9 + P (Z1 = 7)/[P (Z1 = k) + P (Z1 = 7)]P (Z1 = k) = 0.492929.
k=4,5,6,8,9,10
4

Example 0.4. Jailer’s reasoning. (Bayes Probabilities)


Three men, A, B and C are in jail and one to be executed and the other two to be freed. C, being
anxious, asked the jailer to tell him who of A and B would be freed. The jailer, pondering for a
while, answered “for your own interest, I will not tell you, because, if I do, your chance of being
executed would rise from 1/3 to 1/2.” What is wrong with the jailer’s reasoning?
Solution. Let AF (BF) be the event that jailer says A (B) to be freed. Let AE or BE or CE be the
event that A or B or C to be executed. Then, P (CE) = 1/3. but, by the Bayes formula ,

P (AF |CE)P (CE)


P (CE|AF ) =
P (AF |AE)P (AE) + P (AF |BE)P (BE) + P (AF |CE)P (CE)
0.5 ∗ 1/3
=
0 ∗ 1/3 + 1 ∗ 1/3 + 1/2 ∗ 1/3
= 1/3
= P (CE).

Likewise P (CE|BF ) = P (CE). So the “rise of probability” is false. 

Example 0.5. Buffon’s needle (Continuous random variables).


Randomly drop a need of 1 cm onto a surface of many parallel straight lines that are 1 cm apart.
What is the chance that the needle touches one of the lines?
Solution. Let x be the distance from the center of the needle to the nearest line. Let θ be the
smaller angle of the needle with the nearest line. Then, the needle crosses a line if and only if
x/sin(θ) ≤ 0.5.
It follows from the randomness of the drop that that θ and x are independent following the uniform
distribution on [0, π/2] and [0, 1/2]. Therefore,
π/2 sin(θ)/2 π/2
4 2 2
Z Z Z
P (x/sin(θ) ≤ 0.5) = dxdθ = sin(θ)dθ = .
π 0 0 π 0 π

The chance is 2/π. 


5

Chapter 1. σ-algebra, measure, probability space and random variables.

This section lays the necessary rigorous foundation for probability as a mathematical theory. It
begins with sets, relations among sets, measurement of sets and functions defined on the sets.

Example 1.1. (A prototype of probability space.) Drop a needle blindly on the interval
[0, 1]. The needle hits interval [a, b], a sub-interval of [0, 1] with chance b − a. Suppose A is any
subset of [0, 1]. What’s the chance or length of A?
Here, we might interpret the largest set Ω = [0, 1] as the “universe”. Note that not all subsets
are “nice” in the sense that their volume/length can be properly assigned. So we first focus our
attention on certain class of “nice” subsets.
To begin with, the “Basic” subsets are all the sub-intervals of [0, 1], which may be denoted as [a, b],
with 0 ≤ a ≤ b ≤ 1. Denote B as the collection of all subsets of [0, 1], which are generated by all
basic sets after f inite set operations. B is called an algebra of Ω.
It can be proved that any set in B is a finite union of disjoint intervals (closed, open or half-closed).
Still, B is not rich enough. For example, it does not contain the set of all rational numbers. More
importantly, the limits of sets in B are often not in B. This is serious restrictions of mathematical
analysis.
Let A be the collection of all subsets of [0, 1], which are generated by all “basic” sets after countably
many set operations. A is called Borel σ-algebra of Ω. Sets in A are called Borel sets. Limits of
sets in A are still in A. (Ω, A) is a measurable space.
Borel measure: any set A in A can be assigned a volume, denoted as µ(A), such that
(i). µ([a, b]) = b − a.
(ii). µ(A) = lim µ(An ) for any sequence of Borel sets An ↑ A.
Lebesgue measure (1901): Completion of Borel σ-algebra by adding all subsets of Borel measure
0 sets, denoted as F . Sets with measure 0 are called null sets.
Why should Borel measure or Lebesgue measure exist in general?
Caratheodory’s extension theorem: extending a (σ-finite) measure on an algebra B to the σ-algebra
A = σ(B).
Ω = [0, 1] (the universe).
B: an algebra (finite set operations) generated by subintervals.
A: the Borel σ-algebra, is a σ-algebra, generated by subintervals.
F : completion of A, a σ-algebra, generated by A and null sets.

(Ω, B, µ) does not form a probability space,


(Ω, A, µ) forms a probability space.
(Ω, F , µ) forms a probability space.

Sets and set operations:


Consider Ω as the “universe”, (Beyond which is nothing.) Write Ω = {ω}, ω denotes an member of
the set, called element. Let A and B: be two subsets of Ω, called “events”.
The set operations are:
intersection: ∩, A ∩ B: both A and B (happens).
union: ∪, A ∪ B: either A or B (happens).
complement: Ac = Ω \ A: everything except for A, or A does not happen.
6

minus: A \ B = A ∩ B c : A but not B.


An elementary theorem about set operation is
DeMorgan’s identity:
 c  c
∪∞ A
j=1 j = ∩∞ c
j=1 Aj , ∩∞ A
j=1 j = ∪∞ c
j=1 Aj .

In particular, (A ∪ B)c = (Ac ∩ B c ), i.e., (A ∩ B)c = (Ac ∪ B c ).


Remark. Intersection can be generated by complement and union; and union can be generated by
complement and intersection.

Relation: A ⊂ B, if ω ∈ A ensures ω ∈ B.
A sequence of sets {An : n ≥ 1} is called increasing (decreasing) if An ⊂ An+1 (An ⊃ An+1 .)
A = B if and only if A ⊂ B and B ⊂ A.

Indicator functions. (A very useful tool to translate set operation into numerical operation)
The relation and operation of sets are equivalent to the indication set functions. For any subset
A ⊂ Ω, define its indicator function as
1 if ω ∈ A
n
1A (ω) =
0 Otherwise.
The indicator function is a function defined on Ω.
Set operations vs. function operations:

A⊂B ⇐⇒ 1A ≤ 1B .
A∩B ⇐⇒ 1A × 1B = 1A∩B = min(1A , 1B ).
Ac = Ω \ A ⇐⇒ 1 − 1A = 1Ac .
A∪B ⇐⇒ 1A∪B = 1A + 1B , if A ∩ B = ∅
⇐⇒ 1A∪B = max(1A , 1B ).

Set limits.
There are two limits of sets: upper limit and low limit.

lim sup An ≡ ∩∞ ∞
n=1 ∪k=n Ak = {An infinitely occurs.}
1lim sup An = lim sup 1An

ω ∈ lim sup An if and only if ω belongs to infinitely many An .


Lower limit.

lim inf An ≡ ∪∞ ∞
n=1 ∩k=n Ak
= {An always occurs except for finite number of times.}
1lim inf An = lim inf 1An

ω ∈ lim inf An if and only if ω belongs to all but finitely many An .


We say the set limit of A1 , A2 , ... exists if their lower limit is the same as the upper limit.

Algebra and σ-algebra


7

A is a non-empty collection (set) of subsets of Ω.


Definition. A is called an algebra if
(i). Ac ∈ A if A ∈ A;
(ii). A ∪ B ∈ A if A, B ∈ A.
A is called an σ-algebra if, (ii) is strengthened as,
(iii). ∪∞
n=1 An ∈ A if An ∈ A for n ≥ 1.

An algebra is closed for (finite) set operations. Ω ∈ A and ∅ ∈ A.


A σ-algebra is closed for countable operations.
(Ω, A) is called a measurable space, if A is a σ-algebra of Ω.

Measure, measure space and probability space.


A, containing ∅, is a non-empty collection (set) of subsets of Ω. µ is a nonnegative set function on
A.
µ is called a measure, if
(i). µ(∅) = 0.
P∞
(ii). µ(A) = n=1 µ(An ) if A, A1 , A2 , ... are all in A and A1 , A2 , ... are disjoint.
(Ω, A, µ) is called a measure space, if µ is a measure on A and A is a σ-algebra of Ω.
(Ω, A, P ) is called a probability space if (Ω, A, P ) is a measure space and P (Ω) = 1.
For probability space (Ω, A, P ), Ω is called sample space, every A in A is an event, and P (A) is the
probability of the event, the chance that it happens.

Random variable (r.v.).


Loosely speaking, given a probability space (Ω, F , P ), a random variable (r.v.) X is defined as
a real-valued function of Ω, satisfying certain measurability condition. Loosely speaking, viewing
X = X(ω) as a mapping from Ω to R, the real line, then X −1 (B) must be in F for all Borel sets B.
(Borel sets on real line are the σ-algebra generated by intervals, i.e., the sets generated by countable
operations on intervals).
A random variable X defined on a probability space (Ω, A, P ) is a function defined on Ω, such that
X −1 (B) ∈ A for every interval B on [−∞, ∞], where X −1 (B) = {ω : X(ω) ∈ B}. (We need to
identify its probability.)
X −1 (B) is called the inverse image of B.
X = X(·) can be viewed as a map or transformation from (Ω, A) to (R, B), where R = [−∞, ∞]
and B is the σ-algebra generated by the intervals in R.
X is a measurable map/transformation since X −1 (B) ∈ A for every B ∈ B (DIY.)
Because A is a σ-algebra, the upper and lower limits of Xn is a r.v. if Xn are r.v.s., and the
algebraic operations: +, −, ×, /, of r.v.s are still r.v.s.

Measurable map and random vectors.


f (·) is called a measurable map/transformation/function from a measurable space (Ω, A) to another
measurable space (S, S), if f −1 (B) ∈ A for every B ∈ S. i.e. {w : f (w) ∈ B} ∈ A.
X is called a random vector of p dimension if it is a measurable map from a probability space
(Ω, A, P ) to (Rp , B p ), where B p is the Borel σ-algebra in p dimensional real space, Rp = [−∞, ∞]p .
Proposition 1.1 If X = (X1 , ..., Xp ) is a random vector of p dimension on a probability space
(Ω, A, P ), and f (·) is measurable function from (Rp , B p ) to (R, B), then f (X) is a random variable.
8

Proof. For any Borel set B ∈ B,

{ω : f (X(ω)) ∈ B} = {ω : X(ω) ∈ f −1 (B)} ∈ A

since f −1 (B) ∈ B p . 
Proposition 1.2 If X1 , X2 , ... are r.v.s. So are

inf Xn , sup Xn lim inf Xn and lim sup Xn .


n n n n

Proof. Let the probability space be (Ω, A, P ). For any x,

{ω : inf Xn (ω) ≥ x} = ∩n {ω : Xn (ω) ≥ x} ∈ A;


n

{ω : sup Xn (ω) ≤ x} = ∩n {ω : Xn (ω) ≤ x} ∈ A;


n

{lim inf Xn > x} = ∪n { inf Xk > x} ∈ A;


n k≥n

{lim sup Xn < x} = ∪n {sup Xk < x} ∈ A.


n k≥n

Therefore, inf n Xn , supn Xn , lim inf n Xn and lim supn Xn are r.v.s. 
Proposition 1.3 Suppose X is a map from a measurable space (Ω, A) to another measurable
space (S, S). If X −1 (C) ∈ A for every C ∈ C and S = σ(C). Then, X is a measurable map, i.e.,
X −1 (S) ∈ A for every S ∈ S. In particular, when (S, S) = ([−∞, ∞], B), X −1 ([−∞, x]) ∈ A for
every x is enough to ensure X is a r.v..
Proof. Note that σ(C), the σ-algebra generated by C, is defined mathematically as the smallest
σ-algebra containing C.
Set B ∗ = {B ∈ S : X −1 (B) ∈ A}.
We first show B ∗ is a σ-algebra. Observe that
(i). for any B ∈ B ∗ , X −1 (B) ∈ A and, therefore, X −1 (B c ) = (X −1 (B))c ∈ A;
(ii). for any Bn ∈ B ∗ , X −1 (Bn ) ∈ A and X −1 (∪n Bn ) = ∪n X −1 (Bn ) ∈ A.
Consequently, B ∗ is a σ-algebra. Since C ⊂ B ∗ ⊂ S, it follows that B ∗ = S. 

set operations.
a σ-algebra of Ω and P a set function such that

DIY Exercises:
Exercise 1.1 ⋆⋆ Show 1lim inf An = lim inf 1An and DeMorgen’s identity.
Exercise
P 1.2 ⋆⋆ Show that, the so called “countable additivity” or “σ-additivity”, (P (∪n An ) =
n P (A n ) for countable disjoint An ∈ A), is equivalent to “finite additivity” plus “continuity” (if
An ↓ ∅, then P (An ) → 0.)
Exercise 1.3 ⋆ ⋆ ⋆ (Completion of a Probability space) Let (Ω, F , P ) be a probability space.
Define
F̄ = {A : P (A \ B) + P (B \ A) = 0, for someB ∈ F },
And for each A ∈ F̄ , P (A) is defined as P (B) for the B given above. Prove that (Ω, F̄ , P ) is also a
probability space. (Hint: need to show that F is a σ-algebra and that P is a probability measure.)
Exercise 1.4 ⋆ ⋆ ⋆ If X1 and X2 are two r.v.s, so is X1 + X2 . (Hint: cite Propositions 1.1 and
1.3)
9

Chapter 2. Distribution, expectation and inequalities.

Expectation, also called mean, of a random variable is often referred to as the location or center of
the random variable or its distribution. To avoid some non-essential trivialities, unless otherwise
stated, the random variables will usually be assumed to take finite values and those taking values
−∞ and ∞ are considered as r.v.s in extended sense.
(i). Distribution.
Recall that, given a probability space (Ω, F , P ), a random variable (r.v.) X is defined as a real-
valued function of Ω, satisfying certain measurability condition. The cumulative distribution func-
tion of X is then

F (t) = P (X ≤ t) = P ({w ∈ Ω : X(w) ≤ t}) = P (X −1 ((−∞, t])), t ∈ (−∞, ∞).

F (·) is then a right-continuous function defined on the real line (−∞, ∞).
Remark. The distribution function of a single r.v. may be considered as complete profile/description
of the r.v.. The distribution function F (·) defines a probability measure on (−∞, ∞). This is the
induced measure, induced by the random variable as a map/function from the probability mea-
sure P on (Ω, F , P ) to ((−∞, ∞), B, F ). In this sense, the original probability space is often left
unspecified or seemingly irrelevant when dealing with one single random variable.
We often call a random variablediscrete random variableif it takes countable number of values, and
call a random variablecontinuous random variableif the chance it takes any particular value is 0.
In statistics, continuous random variableis often, by default, given a density function. In general,
continuous random variablemay not have a density function (with respect to Legbegue measure).
An example is the Cantor measure.
For two random variables X and Y , their joint c.d.f. is

FX,Y (t, s) = P (X ≤ t and Y ≤ s) = P (X −1 ((−∞, t]) ∩ Y −1 ((−∞, s])), t, s ∈ (−∞, ∞).

Joint c.d.f can be extended for finite number of variables in a straightforward fashion. If the (joint)
c.d.f. is differentiable, the derivative is then called (joint) density.

(ii). Expectation.

Definitions. For a nonnegative random variableX with c.d.f F , its expectation is defined as
Z ∞
E(X) ≡ xdF (x).
0

In general, let X + = X1{X≥0} , X − = −X1{X≤0} ,

E(X) ≡ E(X + ) − E(X − ).

If E(X + ) = ∞ = E(X − ), E(X) does not exist.


A more original definition of the expectation is through that of Lebesgue integral: for nonnegative
X,
Z
E(X) ≡ X(w)dP (w) formally

X k  k n + 1
≡ lim P < X ≤ .
m→∞ 2m 2m 2m
k=0

If X takes ∞ with positive probability, E(X + ) = ∞. Note that X has finite mean is equivalent to
E|X| < ∞. And the mean of X does not exist is the same as E(X + ) = E(X − ) = ∞.
10

The expectation defined above is mathematically an integral or summation with respect to certain
probability measure induced by the random variable. In layman’s words, it is the weighted ”average”
of the values taken by the r.v., weighted by chances which sum up to 1.

Some basic properties of expectation:


R
(1). E(f (X)) = f (x)dF (x) where F is the c.d.f. of X.
(2). If P (X ≤ Y ) = 1, then E(X) ≤ E(Y ). If P (X = Y ) = 1 then E(X) = E(Y ).
(3). E(X) is finite if and only if E(|X|) is finite.
(4). (Linearity) E(aX + bY ) = aE(X) + bE(Y ).
(5). If a ≤ X ≤ b, then a ≤ E(X) ≤ b.

(iii). Some typical distributions of random variables.

(1.) Commonly used discrete distributions:

Bernoulli: X ∼ Bin(1, p). P (X = 1) = p = 1 − P (X = 0). E(X) = p and var(X) = p(1 − p).


Pn
Binomial: X ∼ Bin(n, p). X = i=1 xi and xi are iid with B(1, p) (the number of successes of n
Bernoulli trials.  
n k
P (X = k) = p (1 − p)n−k , k = 0, 1, ..., n.
k
E(X) = np. var(X) = np(1 − p).

Poisson: X ∼ P(λ). E(X) = var(X) = λ.


1 k −λ
P (X = k) = λ e , k = 0, 1, 2, ...
k!
Key fact: B(n, p) → P(λ) if n → ∞, np → λ. (Law of rare events.)

Geometric: X ∼ G(p): time to the first success in a series of Bernoulli trials.

P (X = k) = (1 − p)k−1 p, k = 1, 2, ...

E(X) = 1/p, var(X) = (1 − p)/p2 .

P X ∼ N B(p, r): time to the first r successes in a series of Bernoulli trials.


Negative binomial:
Therefore X = rj=1 ξj where ξj are iid ∼ G(p).
 
k−1 r
P (X = k) = p (1 − p)k−r , k = r, r + 1, ...
r−1

E(X) = r/p and var(X) = r(1 − p)/p2 .

Hyper-geometric: X ∼ HG(r, n, m): the number of black balls when r balls are taken without
replacement from an urn containing n black balls and m white balls.
    
n m n+m
P (X = k) = / , k = 0 ∨ (r − m), 1, ..., r ∧ n.
k r−k r

E(X) = rn/(m + n) and var(X) = rnm(n + m − r)/[(n + m)2 (n + m − 1)].

(2) Commonly used continuous distributions:

Uniform: X ∼ U nif [a, b]


f (x) = (b − a)1{x∈[a,b]}
11

E(X) = (a + b)/2 and var(X) = (b − a)2 /12.

Normal: X ∼ N (µ, σ 2 ), E(X) = µ and var(X) = σ 2 . Central limit theorem.


√ 2 2
f (x) = 1/ 2πσ 2 e−(x−µ) /(2σ ) , x ∈ (−∞, ∞).

Exponential: X ∼ E(λ). Density:

f (x) = e−x/λ /λ, x>0

E(X) = λ and var(X) = λ2 . No memory: (X − t) | {X ≥ t} ∼ E(λ).

Gamma: Γ(α, γ). Density:


1
f (x) = xα−1 e−x/γ , x > 0.
Γ(α)γ
E(λ) = Γ(1, λ), χ2n = Γ(n/2, 2). Sum of independent Γ(αi , γ) follows Γ( i αi , γ).
P

Beta: B(α, β). Density:

Γ(α + β) α−1
f (x) = x (1 − x)β−1 , x ∈ [0, 1]
Γ(α)Γ(β)

ξ/(ξ + η) ∼ B(α, β) where ξ ∼ Γ(α, γ) and η ∼ Γ(β, γ) are independent. X(k) ∼ B(k − 1, n − k + 1)
as the k-th smallest of X1 , ..., Xn iid ∼ U nif [0, 1]

Cauchy: density f (x) = 1/[π(1 + x2 )]. Symmetric about 0, but expectation and variance not exist.

χ2n (with d.f. n): sum of n i.i.d standard normal r.v.s. χ22 is E(2).
p
tn (with d.f n): ξ/ η/n where ξ ∼ N (0, 1), η ∼ χ2n and ξ and η are independent.

Fm,n (with d.f. (m, n)): (ξ/m)/(η/n) where ξ ∼ χ2m , η ∼ χ2n and ξ and η are independent.

(iv). Some basic inequalities:

Inequalities are extremely useful tools in theoretical development of probability theory. For sim-
plicity of notation, we use kXkp , which is also called Lp norm if p ≥ 1, to denote [E(|X|p )]1/p for
a r.v. X. In what follows, X and Y are two random variables.

(1) the Jensen inequality: Suppose ψ(·) is a convex function and X and ψ(X) have finite expectation.
Then ψ(E(X)) ≤ E(ψ(X)).
Proof. Convexity implies for every a, there exists a constant c such that ψ(x) − ψ(a) ≥ c(x − a).
Let a = E(X) and x = X, the right hand side is mean 0. So Jensen’s inequality follows. 

(2). the Markov inequality: For any a > 0, P (|X| ≥ a) ≤ 1/aE(|X|).

Proof. aP (|X| ≥ a) = E(a1{|X|≥a} ≤ E(|X|1{|X|≥a} ) ≤ E(|X|). 

(3). the Chebyshev (Tchebychev) inequality: for a > 0,

P (|X − E(X)| ≥ a) ≤ var(X)/a2

Proof. The inequality holds if var(X) = ∞. Assume var(X) < ∞, then E(X) is finite and
Y ≡ (X − E(X))2 is well defined. It follows from the Markov inequality that

P (|X − E(X)| ≥ a) = P (Y ≥ a2 ) ≤ E(Y )/a2 = var(X)/a2 .


12

(4). the Hölder inequality: for 1/p + 1/q = 1 with p > 0 and q > 0,

E|XY | ≤ kXkpkY kq

Proof. Observe that for any two nonnegative numbers a and b, ab ≤ ap /p + bq /q. (This is a result
of the concavity of the log-function. please DIY.) Let a = |X|/kXkp and b = |Y |/kY kq and take
expectation on both sides. The Hölder inequality follows. 

(5). the Schwarz inequality:


E(|XY |) ≤ [E(X 2 )E(Y 2 )]1/2 .

Proof. A special case of the Hölder inequality. 

(6). the Minkowski inequality: for p ≥ 1,

kX + Y kp ≤ kXkp + kY kp .

Proof. If p = 1, the inequality is trivial. Assume p > 1. Let q = p/(p − 1). Then 1/p + 1/q = 1.
By the Hölder inequality,

E[|X||X +Y |p−1 ] ≤ kXkp k|X +Y |p−1 kq = kXkp {E[|X +Y |(p−1)q ]}1/q = kXkp {E[|X +Y |p ]}(p−1)/p .

Likewise,
E[|Y ||X + Y |p−1 ] ≤ kY kp {E[|X + Y |p ]}(p−1)/p .
Summing up the above two inequalities leas to

E(|X + Y |p ) ≤ (kXkp + kY kp ){E[|X + Y |p ]}(p−1)/p ,

and the Minkowski inequality follows. 

Remark. Jensen’s inequality is a powerful tool. For example, straightforward applications include

[E(|X|)]p ≤ E(|X|p ), for p ≥ 1,

which implies
kXkp ≤ kY kq , for 0 < p < q.
Moreover,
E(log(|X|)) ≤ log(E(|X|)).
If E(X) exists,
E(eX ) ≥ eE(X) .
These inequalities are all very commonly used. For example, the validity of the maximum likelihood
likelihood estimation essentially rests on the fact,
 f (X)   f (X)  Z f (x)  Z 
θ θ θ
E log ≤ log E = log fθ0 (x)dx = log fθ (x)dx = log(1) = 0,
fθ0 (X) fθ0 (X) fθ0 (x)

which is a result of Jensen’s inequality. Here fθ (·) is a parametric family of density of X with θ0
being the true value of θ.
The Markov inequality, despite its simplicity, shall be frequently used in the order of a sequence
of random variables, especially when coupled with the technique of truncation. The Chebyshev
inequality is so mighty that, as an example, it directly proves the weak law of large numbers.
13

The Schwarz inequality shows that covariance is an inner product, and, furthermore, the space of
mean 0 r.v.s with finite variances forms a Hilbert space. The Minkowsky inequality is the triangle
inequality for Lp norm, without which Lp cannot be a norm.

DIY Exercises.
Exercise 2.1. ⋆ Suppose X is a r.v. taking values on all rational numbers on [0, 1], Specifically,
P (X = qi ) = pi > 0 where q1 , q2 , ... denotes all rational numbers on [0, 1]. Then, the c.d.f of X is
continuous at irrational numbers and discontinuous at rational numbers.
Exercise 2.2. ⋆ ⋆ ⋆ Show var(X + ) ≤ var(X) and var(min(X, c)) ≤ var(X) where c is any constant.
Exercise 2.3. ⋆ ⋆ ⋆ (Generalizing Jensen’s inequality). Suppose g(·) is a convex function and X is
a random variable with finite mean. Then, for any constant c,

Eg(X − E(X) + c) ≥ g(c).

Exercise 2.4. ⋆⋆⋆ Lyapunov (Liapounov) : Show that the function log E(|X|p ) is a convex function
of p on [0, ∞). Or, equivalently, for any 0 < s < m < l, show

E(|X|m ) ≤ [E(|X|s )]r [E(|X|l )]1−r

where r = (l − m)/(l − s). (Hint: use the H‡older inequality on

E(|X|λp1 +(1−λ)p2 ) ≤ [E(|X|p1 )]λ [E(|X|p2 )]1−λ

for positive p1 , p2 and 0 < λ < 1.)


14

Chapter 3. Convergence

Unlike convergence of a sequence of numbers, the convergence of a sequence of r.v.s at least has
four commonly used modes: almost sure convergence, in probability convergence, Lp convergence
and in distribution convergence. The first is sometimes called convergence almost everywhere or
almost certain and the last convergence in law.

(i). Definitions
In what follows, we give definitions. Suppose X1 , X2 , ... are a sequence of r.v.s.
Xn → X almost surely, (a.s.) if P ({ω : Xn (ω) → X(ω)}) = P (Xn → X) = 1. Namely, a.s.
convergence is a point-wise convergence “everywhere” except for a null set.
Xn → X in probability, if P (|Xn − X| > ǫ) → 0 for any ǫ > 0.
Xn → X in Lp , if E(|Xn − X|p ) → 0.
Xn → X in distribution. There are four equivalent definitions:
1). For every continuity point t of F , Fn (t) → F (t), where Fn and F are c.d.f of Xn and X.
2). For every closed set B, lim supn P (Xn ∈ B) ≤ P (X ∈ B).
3). For every open set B, lim inf n P (Xn ∈ B) ≥ P (X ∈ B).
4). For every continuous bounded function g(·), E(g(Xn )) → E(g(X)).

Remark. The Lp convergence preclude the limit X taking values of infinity with positive chances.
Sometimes in some textbooks, a sequence of numbers going to infinity is called convergence to
infinity rather than divergence to infinity. If this is the case, the limit X can be ∞ or −∞, for a.s.
convergence and, by slightly modifying the definition, for in probability convergence. For example,
Xn → ∞ in probability is naturally defined as, for any M > 0, P (Xn > M ) → 1. Convergence in
distribution only has to do with distributions.

(ii). Convergence theorems.


The following three theorems/lemma, tantamount to their analogues in real analysis, play important
role in the technical development of probability theory.
(1). Monotone convergence theorem. If Xn ≥ 0, and Xn ↑ X, then E(Xn ) ↑ E(X).
PN 
Proof. E(Xn ) ≤ E(X). For any a < E(X), there exists a N and m such that i=0 2im P 2m i
<
    
X(w) ≤ i+1
2m > a. But P 2im < Xn (w) ≤ i+1
2m → P 2im < X(w) ≤ i+12m (why?). Therefore,
lim E(Xn ) ≥ a. Hence, E(Xn ) → E(X). 

(2). Fatou’s lemma. If Xn ≥ 0, a.s., then

E(lim inf Xn ) ≤ lim inf E(Xn )

Proof. Let Xn∗ = inf(Xk : k ≥ n), then Xn∗ ↑ lim inf Xn , so the Monotone convergence theo-
rem, E(Xn∗ ) ↑ E(lim inf Xn ). On the other hand, Xn∗ ≤ Xn so, E(Xn∗ ) ≤ E(Xn ). As a result,
E(lim inf Xn ) ≤ lim inf E(Xn ). 

(3). Dominated convergence theorem. If |Xn | ≤ Y , E(Y ) < ∞, and Xn → X a.s., then E(Xn ) →
E(X).
Proof. Observe that Y − Xn ≥ 0 ≤ Y + Xn . By Fatou’s lemma, E(Y − lim Xn ) ≤ lim inf E(Y − Xn ),
leading to E(X) ≥ lim sup E(Xn ). Likewise E(Y + lim Xn ) ≤ lim inf E(Y + Xn ), leading to
E(X) ≤ lim inf E(Xn ). Consequently, E(Xn ) → E(X). 
15

The essence of the above convergence theorems is to use a bound, upper or lower, to ensure the
desired convergence in expectation. These bounds, lower bounds as 0 in the monotone convergence
theorem and the Fatou lemma, and both lower and upper bounds in the dominated convergence
theorem, can actually be relaxed; see DIY exercises. The most general extension is through the
concept of uniform integral r.v.s, which shall be introduced later if necessary.

(iii). Relations between convergence modes.


The relations are partly illustrated in the following diagram:


a.s. conv. Lp conv.

in.prob. conv. in.dist. conv.

†: exist a subsequence that converges a.s. ‡: if |Xn | ≤ Y where Y ∈ Lp .

(iv) Some examples.


We use following examples to clarify the above diagram.
a). in prob. conv. but not a.e. conv.
Let ξ ∼ U nif [0, 1]. Set X2j +k = 1 if ξ ∈ [k/2j , (k +1)/2j ] and 0 otherwise, for all 0 ≤ k ≤ 2j −1 and
j = 0, 1, 2, .... Then, Xn → 0 in probability as n → ∞, but Xn 9 0, a.e.. In fact, P (Xn → 0) = 0.
Let ξn be i.i.d ∼ U nif [0, 1]. Let Xn = 1 if ξn ≤ 1/n and 0 otherwise. Then Xn → 0 in probability,
but Xn 9 0, a.e. by Borel-Contelli lemma.

b). in distribution conv. but not in probability conv..


This is in fact quite trivial. Any sequence of (non-constant) i.i.d. random variables converge in
distribution, but not in probability. Observe that convergence in distribution only concerns the
distribution. The variables even do not have to be in the same probability space.
c). a.s. but not Lp conv.
Let ξ ∼ U nif [0, 1]. Let Xn = en if ξ ≤ 1/n and 0 otherwise. Then Xn → 0 a.s. but E(|Xn |p |) =
enp /n → ∞.

(v). Technical proofs.


1 . a.s. convergence =⇒ in probability convergence.
Proof. Let An = {|Xn − X| > ǫ}. a.s. convergence implies P (An , i.o.) = 0. But {An , i.o.} =
∩∞ ∞ ∞
n=1 ∪k=n Ak . So 0 = P (An , i.o.) = limn P (∪k=n Ak ) ≥ lim supn P (An ).

2 . Lp convergence =⇒ in prob convergence.


Proof. 0 ← E(|Xn − X|p ) ≥ E(|Xn − X|p 1{|Xn −X|>ǫ} ) ≥ ǫp P (|Xn − X| > ǫ).

3 . in prob convergence =⇒ in distribution convergence.


Proof. For any t, and any ǫ > 0, lim sup P (Xn ≤ t) ≤ lim sup P ({Xn ≤ t} ∩ {X ≤ Xn + ǫ}) ≤
P (X ≤ t + ǫ). Let ǫ ↓ 0, we have lim sup P (Xn ≤ t) ≤ P (X ≤ t). Likewise lim sup P (−Xn ≤ −t) ≤
P (−X ≤ −t). (Why?) Then lim inf P (Xn < t) ≥ P (X < t). Suppose now, t is a continuity point
of X. Then P (X < t) = P (X ≤ t). As a result, limn P (Xn ≤ t) = P (X ≤ t).
16

4 . in prob convergence =⇒ existence of a subsequence that converges a.s.


Proof. Let ǫk ↓ 0. Since P (|Xn − X| P∞> ǫk ) → 0 as n → ∞, there exists an nk such that
P (|Xnk − X| > ǫk ) < 2−k . Therefore k=1 P (|Xnk − X| > ǫk ) < ∞, which implies by the Borel-
Contelli lemma, which is introduced in the next section, that P (|Xnk − X| > ǫk , i.o.) = 0. This
means that, with probability 1, |Xnk − X| ≤ ǫk for all large k. This is tantamount to Xnk → X
a.s..

5 . Lp convergence =⇒ Lq convergence for p > q > 0.


Proof. Let Yn = |Xn − X|. For any ǫ > 0, E(Ynq ) ≤ ǫ + E(Ynq 1{Yn ≥ǫ} ) ≤ ǫ + E(Ynq 1{Yn ≥1} + P (ǫ ≤
Yn ≤ 1) ≤ ǫ + E(Ynp 1{Yn ≥1} + P (ǫ ≤ Yn ) → ǫ as n → ∞. Since ǫ > 0 is arbitrary, it follows that
Xn → X in Lq .

6 . Suppose |Xn | ≤ c > 0 a.s., then, in probability convergence ⇐⇒ Lp convergence for all (any)
p > 0.
Proof. ⇐= follows from 2 . And =⇒ follows from the dominated convergence theorem.

7 . The four equivalent definitions of in distribution convergence.


Proof. 2) ⇐⇒ 3). The complement of any closed set is open. Likewise, the complement of any
closed set is open.
1) =⇒ 3). Continuity points of F are dense (why?). Consider interval (−∞, t), there exists conti-
nuity points tk ↑ t. Then,

lim inf P (Xn ∈ (−∞, t)) ≥ lim inf P (Xn ∈ (−∞, tk ]) = P (X ∈ (−∞, tk ]) → P (X ∈ (−∞, t)).
n n

The result can be extended for general open sets. We omit the proof.
3) =⇒ 1). Suppose t is a continuity point. Then lim supn Fn (t) ≤ F (t) by 2) and the equivalency
of 2) and 3). lim inf n Fn (t) ≥ lim inf n P (Xn < t) ≥ P (X < t) = F (t) as t is a continuity point. So
1) follows.
4) =⇒ 1). Let t be a continuity point of F . For any small ǫ > 0, choose a non-increasing continuous
function f of x which is 1 for x < t, and is 0 for x > t + ǫ. Then, P (Xn ≤ t) ≤ E(f (Xn )) →
E(f (X)) ≤ P (X ≤ t + ǫ). Therefore the lim sup P (Xn ≤ t) ≤ P (X ≤ t). Likewise (how?), one can
show lim inf P (Xn ≤ t) ≥ P (X ≤ t). The desired convergence follows.
1) =⇒ 4). Continuity points of the cdf of X are dense (why?). Suppose |f (t)| < c. Choose
continuity points −∞ = t0 < t1 , ... < tK < tK+1 = ∞ such that F (t1 ) < ǫ > 1 − F (tK ), and
|f (t) − f (s)| < ǫ for any t, s ∈ [tj , tj+1 ] for j = 1, ..., K − 1. Then,
Z Z
|E(f (Xn )) − E(f (X))| = | f (t)dFn (t) − f (t)dF (t)|
K Z
X tj+1
≤ | f (t)[dFn (t) − dF (t)]|
j=0 tj

K−1
X Z tj+1
≤ 2cǫ + | f (t)[dFn (t) − dF (t)]|
j=1 tj

K−1
X Z tj+1
≤ 2cǫ + + | f (tj )[dFn (t) − dF (t)]|
j=1 tj

K−1
X Z tj+1
+ | [f (t) − f (tj )][dFn (t) − dF (t)]|
j=1 tj
17

K−1
X Z tj+1 K−1
X Z tj+1
≤ 2cǫ + c| [dFn (t) − dF (t)]| + ǫ [dFn (t) + dF (t)]
j=1 tj j=1 tj
Z tK
→ 2cǫ + 2ǫ dF (t) as n → ∞.
t1
≤ (2c + 1)ǫ,

which can be arbitrarily small.

DIY Exercises.
Exercise 3.1 ⋆⋆ Suppose Xn ≥ η, with E(η − ) < ∞. Show E(lim inf Xn ) ≤ lim inf E(Xn ).
Exercise 3.2 ⋆⋆ Show the dominated convergence theorem still holds if Xn → X in probability or
in distribution.
Pn
Exercise 3.3 ⋆ ⋆ ⋆ Let Sn = i=1 Xi . Raise a counter-example to show Sn /n 6→ 0 in probability
but Xn → 0 in probability.
Pn
Exercise 3.4 ⋆ ⋆ ⋆ Let Sn = i=1 Xi . Show that Sn /n → 0 a.s. if Xn → 0 a.s., and Sn /n → 0 in
Lp if Xn → 0 in Lp for p ≥ 1.
18

Chapter 4. Independence, conditional expectation, Borel-Cantelli lemma


and Kolmogorov 0-1 laws.

(i). Conditional probability and independence of events.


For any two events, say A and B, the conditional probability of A given B is defined as

P (A|B) = P (A ∩ B)/P (B), if P (B) 6= 0.

This is the chance of A to happen, given B has happened.


In common sense, the independence between events A and B should be, information about event
B happens/or not, does not change the chance of A to happen/or not, and vice versus. In other
words, whether B (A) happens or not does not contain any information about whether A (B)
happens. Therefore the definition of independence should be P (A|B) = P (A) or P (B|A) = P (B).
But to include that case of P (A) = 0 or P (B) = 0, the mathematical definition of independence is
P (A∩B) = P (A)P (B), which is equivalent to P (Ac ∩B) = P (Ac )P (B) or P (A∩B c ) = P (A)P (B c )
or P (Ac ∩ B c ) = P (Ac )P (B c ). The definition is extended in the following to independence between
n events.
Qn
Definition Events A1 , ..., An are called independent if P (∩ni=1 Bi ) = i=1 P (Bi ) where Bi is Ai
or Aci . Events A1 , ..., An are called pairwise independent if any pair of two events are independent.
The above definition implies, if A1 , ..., An are independent (pairwise independent), then Ai1 , ..., Aik
are independent (pairwise independent). (Please DIY).
The σ-algebra generated by a single set A, denoted as σ(A) is {∅, A, Ac , Ω}. Independence between
A1 , ..., An can be interpreted as independence between the σ-algebras: σ(Ai ), i = 1, ..., n.

(ii). Borel-Cantelli Lemma.


The Borel-Contelli Lemma is considered as sine qua non of probability theory and is instrumental
in proving the law of large numbers. Please note in the proof below the technique of using the
indicator functions to handle probability of sets,
Theorem 4.1. (Borel-Contelli Lemma) For events A1 , A2 , ...,

X
(1) P (An ) < ∞ =⇒ P (An , i.o.) = 0;
n=1

X
(2) If An are independent, P (An ) = ∞ =⇒ P (An , i.o.) = 1.
n=1

Here An , i.o. means An happens infinitely often, i.e., ∩∞ ∞


n=1 ∪k=n Ak .
P∞
Proof. (1): Let 1An be the indicator function of An . Then, An , i.o. is the same as n=1 1An = ∞.
Hence,

X ∞
X ∞
X
E( 1 An ) = E1An = P (An ) < ∞.
i=1 n=1 n=1
Pn
It implies i=1 1An < ∞ with probability 1. This is equivalent to P (An , i.o.) = 0.
P∞ Q∞
(2). n=1 P (An ) = ∞ implies k=n (1 − P (Ak )) = 0 since log(1 − x) ≤ −x for x ∈ [0, 1]. for all
n ≥ 1. By dominated convergence theorem

Y ∞
Y ∞
Y
E(lim inf 1Acn ) = E(lim 1Ack ) = lim E( 1Ack ) = lim (1 − P (Ak )) = 0.
n n n
k=n k=n k=n

Then, P (lim inf n Acn ) = 0 and hence P (lim sup An ) = 1. 


As an immediate consequence,
19

Corollary
P (Borel’s 0-1 law) If A1 , ..., An , ... are independent, then P (An , i.o.) = 1 or 0 according
as n P (An ) = ∞) or < ∞.
Even though the above 0-1 law appears to be simple, its impact and implication is profound. More
generally, suppose A ∈ ∩∞ n=1 σ(Aj , j ≥ n), the so-called tail σ-algebra. A is called a tail event. Then,
the independence of A1 , ..., An , ... implies P (A) = 0 or 1. The
Pn key fact here is that A is independent
of An for any n ≥ 1, such as, for example, {An , i.o.} or { i=1 1Ai / log(n) → ∞}. A more general
result involving independent random variables to be introduced below is the Kolmogorov’s 0-1 law
to be introduced later.
The following example can be viewed as a strengthening of the Borel-Cantelli lemma.
P
Example 4.1 Suppose A1 , ..., An , ... are independent events with n pn = ∞ where pn = P (An ).
Then, Pn
1 Ai
Xn ≡ Pi=1 n →1 a.s..
i=1 pi

Proof Since Pn
p (1 − pi ) 1
Pn i
2
E(Xn − 1) = i=1 ≤ Pn → 0,
( i=1 pi )2 i=1 pi
it follows that Xn → 1 in L2 and therefore also in probability by the Chebyshev inequality:
E(Xn − 1)2 1
P (|Xn − 1| > ǫ) ≤ 2
≤ 2 Pn → 0.
ǫ ǫ i=1 pi

Consider nk ↑ ∞ as k → ∞, such that


∞ Pnk+1
X 1 pi
Pnk <∞ and Pi=1
nk → 1.
i=1 pi p
i=1 i
k=1

Then,

X
P (|Xnk − 1| > ǫ) < ∞.
i=1
The Borel-Cantelli lemma implies Xnk → 1 a.s.. Observe that, for nk ≤ n ≤ nk+1 ,
Pnk Pn Pnk+1
i=1 1Ai i=1 1Ai i=1 1Ai
1 ← Pnk+1 ≤ Xn = P n ≤ P nk → 1, a.s..
i=1 p i p
i=1 i i=1 pi

The desired convergence holds. 


Remark. The trick of bracketing Xn by the two quantities in the above inequality is also used in
proving the uniform convergence of the empirical distribution to the population distribution:

|Fn (x) − F (x)| → 0, a.s.,


Pn
where Fn (x) = (1/n) i=1 1{ξi ≤x} and ξi are iid with cdf F . The idea is further elaborated in the
context of empirical approximation in terms of bracketing/packing numbers.
Example 4.2. Repeatedly toss a coin, which has probability p to be head and q = 1 − p to be tail
on each toss. Let Xn = H or T when n-th toss is a head or tail. Let

ln = max{m ≥ 0 : Xn = H, Xn+1 = H, ..., Xn+m−1 = H, Xn+m = T }

be the length of run of heads starting from n-th toss. Then,

lim sup ln / log n = 1/ log(1/p).


n

Proof. ln follows a geometric distribution, i.e.,

P (ln = k) = qpk , P (ln ≥ k) = P (Xn = 1, ..., Xn+k−1 = 1) = pk k = 0, 1, 2, ...


20

For any ǫ > 0,


∞ ∞ ∞ ∞
X  log n  X (1+ǫ) log(1/p)
log n X X
P ln > (1 + ǫ) ≤ p ≤ e−(1+ǫ) log n = n−(1+ǫ) < ∞
n=1
log(1/p) n=1 n=1 n=1

By the Borel-Cantelli lemma,


ln
lim sup ≤ 1.
n log n/ log(1/p)
We next try to find a subsequence
Pn with limit as large as 1. . Let dn be the integer part of
log n/ log(1/p) and let rn = i=1 di . Then rn ≈ n log n/ log(1/p) and log(rn ) ≈ log(n). Set

An = {Xrn = H, Xrn +1 = H, ..., Xrn +dn −1 = H}

Then An , n ≥ 1 are independent, and

P (An ) = pdn = edn log p ≈ 1/n


P
Therefore, n P (An ) = ∞. It then follows from the Borel Cantelli lemma that P (An , i.o, ) = 1.
Since An = {lrn ≥ dn }, we have
ln l rn lr
lim sup ≥ lim sup = lim sup n ≥ 1.
n log n/ log(1/p) n log(rn )/ log(1/p) n dn

Remark. An analogous problem occurs in the setting of Poisson processes. Consider a Poisson
process with intensity λ > 0. The sojourn times (time between two consecutive events) ξ0 , ξ1 , ... are
iid ∼ exponential distribution with mean 1/λ. Then, lim supx→∞ lx /x = 1/λ, where lx the time
period between x and the time of the event right after x.

(iii). Independence between σ-algebras and between random variables.


Definitions. Let A1 , ..., An be σ-algebras. They are called independent if A1 , ..., An are independent
for any Aj ∈ Aj , j = 1, ..., n. Random variables X1 , ..., Xn are called independent, if the σ-algebras
generated by Xj , 1 ≤ j ≤ n, are independent, i.e.,
n
Y n
Y
P (∩nj=1 Xj−1 (Bj )) = P (Xj−1 (Bj )) or P (X1 ∈ B1 , ..., Xn ∈ Bn ) = P (Xj ∈ Bj )
j=1 j=1

for any Borel sets B1 , ..., Bn in (−∞, ∞).


There are several equivalent definition of the independence of random variables:
Two r.v.s X and Y are called independent, if E(g(X)f (Y )) = E(g(X))E(f (Y )) for all bounded
(measurable) functions g and f . or, equivalently, if
n
Y
P (X ≤ t, and Y ≤ s) = P (Xj ≤ tj ) for all tj ∈ (−∞, ∞), j = 1, ..., n.
i=1

i.e., in terms of cumulative distribution functions.

FX,Y (x, y) = FX (x)FY (y) for all x, y.

If the joint density exists, This is the same as fX,Y (x, y) = fX (x)fX (y).
Roughly speaking, independence between two r.v.s X and Y is interpreted as X taking any value
“has nothing to do with” Y taking any value, and vice versus.

(iv). Conditional expectation.


21

(1). Conditional distribution and conditional expectation with respect to a set A.


Suppose A is a set with P (A) > 0, and X is a random variable. Then, the conditional expectation
is
E(X|A) ≡ E(X1A )/P (A).
The conditional distribution of X given A is

P (X ≤ t|A) = P ({X ≤ t} ∩ A)/P (A)


R
Then, E(X|A) = tdP (X ≤ t|A), if exist.
As a simple example, let X ∼ U nif [0, 1]. Let Ai = {i − 1/n < X ≤ i/n} for i = 1, ..., n.

E(X|Ai ) ≡ E(X1Ai )/P (Ai ) = (i − 1/2)/n.

Similarly E(X|Aci ) ≡ E(X1Aci )/P (Aci ).


Interpretation: E(X|A) is the weighted “average” (expected value) of X over the set A.
(2). Conditional expectation with respect to a r.v..
For two random variables X, Y , E(X|Y ) is a function of Y , i.e., measurable to σ(Y ), such that, for
any A ∈ σ(Y ),
E(X1A ) = E[E(X|Y )1A ].

Interpretation: E(X|Y ) is the weighted “average” (expected value) of X over the set {Y = y} for
all y. It is a function of Y and therefore is a r.v. measurable to σ(Y ).
If their joint density f (x, y) exists, then the conditional density of X given Y = y is fX|Y (x|y) ≡
f (x, y)/fY (y). And Z
E(X|Y = y) ≡ xfX|Y (x|y)dx.

(3). Conditional expectation with respect to a σ-algebra A.


Conditional expectation w.r.t. a σ-algebra is the most fundamental concept in probability theory,
especially in martingale theory in which the very definition of martingale depends on conditional
expectation.
Recall that a random variable, say X, is measurable to a σ-algebra A is that for any interval (a, b),
{ω : X(ω) ∈ (a, b)} ∈ A. In other words, σ(X) ⊆ A is interpreted as all information about X,
(which is σ(X)), is contained in A.
If A = σ(A1 , ..., An ) where Ai ∩ Aj = ∅, then X measurable to A implies X must be constant over
each Ai . If A is generated by a r.v. Y , then X measurable to A implies ξ must be a function of Y .
A heuristic understanding is that if Y is known, then there is no uncertainty of X, or if Y assumes
one value, X cannot assume more than one values.
Definition For a random variable X and a completed σ-algebra A, E(X|A) is defined as an A-
measurable random variable such that, for any A ∈ A,

E(X1A ) = E(E(X|A)1A ),

i.e. E(X|A) = E(E(X|A)|A) for every A ∈ A with P (A) > 0.


If A = σ(A1 , ..., An ) where Ai ∩ Aj = ∅, then
n
X
E(X|A) = E(X|Ai )1Ai ,
j=1

which is a r.v. that, on each Ai , takes the conditional average of X, i.e., E(X|Ai ), as its value.
Motivated from this simple case, we may obtain an important understanding of the conditional
22

expectation X w.r.t. a σ-algebra A: a new r.v. as the “average” of the r.v. X on each “un-
splitable” or “smallest” set of the σ-algebra A.
Conditional mean/expectation with respect to σ algebra shares many properties just like the ordi-
nary expectation.
Properties:
(1). E(aX + bY |A) = aE(X|A) + bE(Y |A)
(2). If X ∈ A, then E(X|A) = X.
(4). E(E(X|F)|A) = E(X|A) for two σ-algebras A ⊆ F .
Further properties, such as the dominated convergence theorem, Fatou’s lemma and monotone
convergence theorem also hold for conditional mean w.r.t. a σ-algebra. (See DIY exercises.)

(v). Kolmogorov’s 0-1 law.


One of the most important theorem in probability theory is the martingale convergence theorem.
In the following, we provide a simplified version, without a rigorous introduction of martingale and
without giving a proof.
Theorem 1.2 (simplified version of martingale convergence theorem) Suppose Fn ⊆
Fn+1 for n ≥ 1. Let F = σ(∪∞
n=1 Fn ). For any random variable X with E(|X|) < ∞,

E(X|Fn ) → E(X|F), a.s.

The martingale convergence theorem, even with the simplified version, has broad applications. For
example, One of the most basic 0-1 laws: the Kolomogorov 0-1 law, can be established upon it.
Corollary (Kolomogorov 0-1 law) Suppose X1 , ..., Xn , ... are a sequence of independent r.v.s.
Then all tails events are have probability 0 or 1.
Proof. Suppose A is a tail event. Then A is independent of X1 , ..., Xn for any fixed n. Therefore
E(1A |Fn ) = P (A) where Fn is the σ-algebra generated by X1 , ..., Xn . But, by Theorem 1.2,
E(1A |Fn ) → 1A a.s.. Hence 1A = P (A), and A can only be 0 or 1. 
A heuristic interpretation of Kolmogorov’s 0-1 law could be in the perspective of information. When
σ-algebras A1 , ..., An , ... are independent, the information carried by each Ai are independent or
unrelated or non-overlapping. Then, the information carried by {An , An+1 , ...} shall shrink to 0 as
n → ∞, as, if otherwise, An , An+1 , ... would have something in common.
As straightforward applications of Kolmogorov’s 0-1 law:
Corollary Suppose X1 , ..., Xn , ... are a sequence of independent random variables. Then,

lim inf Xn , lim sup Xn , lim sup Sn /an and lim inf Sn /an
n n n n
Pn
must be either a constant or ∞ or −∞, a.s., where Sn = i=1 Xi and an ↑ ∞.
Proof. Consider A = {ω : lim inf n Xn (ω) > a}. Try to show A is a tail event. (DIY). 
Remark. Without invoking martingale convergence theorem, Kolmogorov’s 0-1 law can be shown
through π − λ theorem, which we do not plan to cover.

DIY Exercises.
Exercise 4.1 ⋆⋆ Suppose Xn are iid random variables. Then Xn /n1/p → 0 a.s. if and only if
E(|Xn |p ) < ∞ for p > 0. Hint: Borel-Cantelli lemma.
Exercise 4.2 ⋆ ⋆ ⋆ Let Xn be iid r.v.s with E(Xn ) = ∞. Show that lim supn |Sn |/n = ∞ a.s. where
Sn = X 1 + · · · + X n .
P∞
Exercise 4.3 ⋆ ⋆ ⋆ Suppose Xn are iid nonnegative random variables such that k=1 kP (X1 >
ak ) < ∞ for ak ↑ ∞. Show that lim supn max1≤i≤n Xi /an ≤ 1 a.s.
23

Exercise 4.4 ⋆ ⋆ ⋆⋆ (Empirical Approximation) For every fixed t ∈ [0, 1], Sn (t) is a sequence
of random variables such that, with probability 1 for some p > 0,

|Sn (t) − Sn (s)| ≤ n|t − s|p ,

for all n ≥ 1 and all t, s ∈ [0, 1]. Suppose for every constant C > 0, there exists an c > 0 such that

P (|Sn (t)| > C(n log n)1/2 ) ≤ e−cn for all n ≥ 1 and t ∈ [0, 1].

Show that, for any p > 0,


max{|Sn (t)| : t ∈ [0, 1]}
→0 a.s..
(n log n)1/2
Hint: Borel-Cantelli lemma.
24

Chapter 5. Weak law of large numbers.

For a sequence of independent r.v.s X1 , X2 , ..., classical law of large numbers is typically about the
convergence of partial sums
Pn
Sn − E(Sn ) [Xi − E(Xi )]
= i=1 ,
n n
Pn
where Sn = i=1 Xi here and throughout this Chapter. A more general form is the convergence of

S n − an
bn
for some constants an and bn . Weak law is convergence in probability and strong law is convergence
a.s..

The following proposition may be called L2 weak law of large numbers which implies the weak law
of large numbers.
Proposition Suppose X1 , ..., Xn , ... are iid with mean µ and finite variance σ 2 . Then,

Sn /n → µ in probability and in L2 .

Proof. Write
E(Sn /n − µ)2 = (1/n)σ 2 → 0.
Therefore L2 convergence holds. And convergence in probability is implied by the Chebyshev
inequality. 

The above proposition implies that classical weak law of large numbers holds quite trivially in a
standard setup with the r.v.s being iid with finite variance. In fact, in such a standard setup strong
law of large numbers also holds, as to be shown in Chapter 6. However, the fact that convergence
in probability is implied in L2 convergence plays a central role is establishing weak law of large
numbers. For a example, a straightforward extension of the above proposition can be:
For independent r.v.s X1 , ...,, (Sn − E(Sn ))/bn → 0 in probability if (1/b2n ) ni=1 var(Xi ) → 0, for
P
some bn ↑ ∞.
The following theorem about general weak law of large numbers is a combination of the above
extension and the technique of truncation.

Theorem 5.1. Weak Law of Large Numbers Suppose X1 , X2 , ... are independent. Assume
Pn
(1). i=1 P (|Xi | > bn ) → 0,
−2
Pn 2
(2). bn i=1 E(Xi 1{|Xi |≤bn } ) → 0,
Pn
where 0 < bn ↑ ∞. Then (Sn − an )/bn → 0 in probability, where an = j=1 E(Xi 1{|Xi |≤bn } ).
Proof. Let Yj = Xj 1{|Xj |≤bn } . Consider
Pn Pn
j=1 Yj − an j=1 [Yj − E(Yj )]
= ,
bn bn

which is mean 0 and converges to 0 in L2 by (2). Therefore it also converges to 0 in probability.


Notice that
Pn n
j=1 Yj − an
S − a 
n n
X
P = = P (Sn = Yj )
bn bn j=1
25

n
Y
≥ P (Xj = Yj for all 1 ≤ j ≤ n) = P (Xj = Yj ) by independence
j=1
n
Y n
Y Pn
log[1−P (|Xj |>bn )]
= P (|Xj | ≤ bn ) = [1 − P (|Xj | > bn )] = e j=1

j=1 j=1
− n
P
≈ e j=1 P (|Xj |>bn )

→ 1 by (1).

Hence (Sn − an )/bn → 0 in probability. 

Theorem 5.2. Suppose X, X1 , X2 , ... are iid. Then, Sn /n − µn → 0 in probability for some µn ,
if and only if
xP (|X1 | > x) → 0 as x → ∞.
in which case µn = E(X1{|X|≤n} ) + o(1).
Proof. “⇐=” Let an = nµn and bn = n in Theorem 5.1. Condition (1) follows. To check
Condition (2), write, as n → ∞,
n
X 1 1
b−2
n E(X 2 1{|X|≤n} ) ≤ E(min(|X|, n)2 )
E(Xi2 1{|Xi |≤bn } ) =
i=1
n n
1 ∞ 1 n
Z Z
= 2xP (min(|X|, n) > x)dx = 2xP (|X| > x)dx
n 0 n 0
Z n
1
= 2xP (|X| > x)dx + o(1) for any fixed M > 0
n M
Z n
2
= xP (|X| > x)dx + o(1) ≤ 2 sup xP (|X| > x) + o(1),
n M x≥M

as n → ∞. Since M is arbitray, Condition (2) holds. And the WLLN follows from Theorem 1.3.
“=⇒” Let X ∗ , X1∗ , ... be iid following the samePdistribution of X and are independent of X, X1 , ....
Set ξi = Xi − Xi∗ (symmetrization) and S̃n = ni=1 ξi . Then, S̃n /n → 0 in probability. The Levy
inequality in Exercise 5.1 implies max{|S̃j | : 1 ≤ j ≤ n}/n → 0 in probability, which further ensures
max{|ξj | : 1 ≤ j ≤ n}/n → 0 in probability. For any ǫ > 0,

nP (|X| ≥ nǫ)P (|X ∗ | ≤ .5nǫ) = nP (|X| ≥ nǫ, |X ∗ | ≤ .5nǫ) ≤ nP (|X − X ∗ | ≥ .5nǫ)


≈ 1 − [1 − P (|X − X ∗ | ≥ .5nǫ)]n = P ( max |ξj | > .5nǫ) → 0.
1≤j≤n

As a result, for any ǫ > 0,

nP (|X| ≥ nǫ) ≈ nP (|X| ≥ nǫ)[1 − P (|X| ≥ .5nǫ)] → 0,

which is equivalent to xP (|X| > x) → 0 as x → ∞. 

Example 5.1. Suppose X1 , X2 , ... are i.i.d. with common density f symmetric about 0 and c.d.f
such that 1 − F (t) = 1/(t log t), for t > 3. Then, Sn /n → 0 in probability. But Sn /n 9 0, a.s..
The convergence in probability is a consequence of Theorem 5.2 with µn = 0 and checking the
condition xP (|X| > x) → 0 as x → ∞. The convergence a.s. is untrue because Xn /n 9 0 a.s. by
Borel-Cantelli lemma. 

Corollary. Suppose X1 , ..., Xn , ... are i.i.d. with E(|Xi |) < ∞. Then, Sn /n → E(X1 ) in probabil-
ity.
26

Proof. Since, as x → ∞,
Z x Z ∞
xP (|Xi | > x) = o(1) P (|Xi | > t)dt = o(1) P (|Xi | > t)dt = o(1)E(|Xi |),
0 0

the WLLN follows from Theorem 5.2. 


k
Example 5.2. The St. Petersberg Paradox. Let X, X1 , ..., Xn , ... be iid with P (X = 2 ) =
2−k , k = 1, 2, .... Then, E(X) = ∞ and
Sn 1
→ in probability.
n log n log 2

Proof. Notice that P (X ≥ 2k ) = 2−k+1 . Let kn ≈ log log n/ log 2, mn = log n/ log 2 + kn and
bn = 2mn = 2kn n ≈ n log n. mn is an integer. Then,

nP (X ≥ bn ) = n2−mn +1 ≈ 2n/n · 2−kn → 0.

And
mn
X mn
X
E(X 2 1{|X|≤bn } ) = 22k 2−k = 2k ≤ 2 × 2mn = 2bn .
k=1 k=1

Then,
nE(X 2 1{|X|≤bn } ) 2nbn 2n 2n 2n
2
≤ 2 = = mn = kn → 0.
bn bn bn 2 n2
Let an = nE(X1{|X|≤bn} ).
mn
X
an = n 2k 2−k = nmn = n log n/ log 2 + nkn ≈ bn log 2.
k=1

The desired convergence is implied by Theorem 5.2. 


Example 5.3. “Unfair fair game”. You pay one dollar to buy a lottery. The lottery has
infinite number of numbered balls. If number k occurs, you are paid by 2k dollars. The number k
ball occurs with probability
1
pk ≡ k .
2 k(k + 1)
Is this a fair game?
k
P X be gain/loss of the outcome. Then P (X = 2 − 1) = pk , k = 1, 2, ....
In a sense, it is fair. Let
and P (X = −1) = 1 − k pk . Then E(X) = 0.
If one buys the lottery on daily basis, one time every day. Let Xn be gain/loss of day n and Sn be
the cumulative gain/loss up to day n. Then,
Sn
→ − log 2 in probability,
n/ log n

meaning that in the long time, he/she is nearly certainly in red. 

Example 5.4. Compute the limit of


Z 1 1
x21 + · · · x21 +
Z
··· dx1 · · · dxn .
0 0 x1 + · · · + xn

Solution. The above integral is the same as


 X2 + · · · + X2 
1 n
E ,
X1 + · · · + Xn
27

where X1 , ..., Xn , ... are iid ∼ U nif [0, 1]. Since, by the WLLN
X Z 1 X
(1/n) Xi2 → E(X12 ) = x2 dx = 1/3 and (1/n) Xi → E(X1 ) = 1/2,
i=1 0 i=1

with the convergence being convergence in probability, we have

X12 + · · · + Xn2
→ 2/3 in probability.
X1 + · · · + Xn
The r.v. on the left hand side is bounded by 1. By the dominated convergence, its mean also
converges to 2/3. Then the limit of the integral is 2/3. 
Remark. The following WLLN for array of r.v.s. is a slight generalization of Theorem 5.1.
Suppose Xn,1 , ..., Xn,n are independent r.v.s. If
n
X n
X
P (|Xn,i | > bn ) → 0 and (1/b2n ) 2
E(Xn,i 1{|Xn,i |≤bn } ) → 0,
i=1 i=1

Then, Pn
i=1 Xn,i − an
→0 in probability
bn
Pn
where an = i=1 E(Xn,i 1{|Xn,i |≤bn } ).

DIY Exercises.
Exercise 5.1 (Levy’s Inequality) Suppose X1 , X2 , ... are independent and symmetric about 0.
Then,
P ( max |Sj | ≥ ǫ) ≤ 2P (|Sn | ≥ ǫ)
1≤j≤n

Exercise 5.2 Show Sn /(n log n) → − log 2 in probability in Example 5.4. Hint: Choose bn = 2mn
with mn = {k : 2−k k −3/2 ≤ 1/n} and proceed as in Example 5.2.
Exercise 5.3 For Example 1.4, prove that Sn /bn → 0 in probability, if bn /(n/ log n) ↑ ∞.
Exercise 5.4 (Marcinkiewicz-Zygmund weak law of large numbers) Suppose xp P (|X| >
x) → 0 as x → ∞ for some 0 < p < 2. Prove that

Sn − nE(X1{|X|≤n1/p} )
→0 in probability.
n1/p
28

Chapter 6. Strong law of large numbers.


Pn
For r.v.s X1 , X2 , ..., convergence of series means the convergenceP
of its partial sums Sn = i=1 Xi ,
as n → ∞. We shall denote the convergence of Sn a.s. just as ∞ n=1 Xn < ∞ a.s.. The following
Kolmogorov inequality is the key to establishing a.s. convergence of series for independent r.v.s.
(i). Kolmogorov inequality.
Theorem 6.1. Kolmogorov inequality Suppose X1 , X2 , ..., Xn are independent with E(Xi ) =
0 and var(Xi ) < ∞. Sj = X1 + ... + Xj . Then,

var(Sn )
P ( max |Sj | ≥ ǫ) = .
1≤j≤n ǫ2

Proof. Let T = min{j ≤ n : |Sj | ≥ ǫ}, with minimum of empty set being ∞, i.e., T = ∞ |Sj | ≤ ǫ
for all 1 ≤ j ≤ n. Then, {T ≤ j} or {T = j} only depends on X1 , ..., Xj . And, as a result,

{T ≥ j} = {T ≤ j − 1}c = {Si ≤ ǫ, 1 ≤ i ≤ j − 1}

only depends on X1 , ..., Xj−1 and therefore is independent of Xj , Xj+1 , .... Write

P ( max |Sj | ≥ ǫ) = P (T ≤ n) ≤ ǫ−2 E(|ST |2 1{T ≤n} ) ≤ ǫ−2 E(|ST ∧n |2 )


1≤j≤n
T
X ∧n n
X
= ǫ−2 E(| Xj |2 ) = ǫ−2 E(| Xj 1{T ≥j} |2 )
j=1 j=1
n Xn Xn o
= ǫ−2 E( Xj2 1{T ≥j} ) + 2 E(Xj Xi 1{T ≥j} 1{T ≥i}
j=1 1≤i<j≤n
n
nX n
X o
= ǫ−2 E(Xj2 )P (T ≥ j) + 2 E(Xj )E(Xi 1{T ≥j} 1{T ≥i} )
j=1 1≤i<j≤n
n
X
= ǫ−2 E(Xj2 )P (T ≥ j) + 0
j=1

≤ var(Sn )/ǫ2 .


Example 6.1. (Extension to continuous time process.) Suppose {St : t ∈ [0, ∞)} is a process
with increments that are independent, zero mean and finite variance. If the path of St is right
continuous, e.g.
  var(S )
τ
P max |St | > ǫ ≤ .
t∈[0,τ ] ǫ2
The examples of such processes are, e.g., compensated Poisson process and Brownian Motion.
Kolmogorov’s inequality will later on be seen as a special case of martingale inequality. In the proof
of Kolmogorov inequality, we have used a stopping time T , which is a r.v. associated with a process
Sn or, more generally, a filtration, such that T = k only depends on past and current values of the
process: S1 , ..., Sk . Stopping time is one of the most important concepts and tools in martingale
theory or stochastic processes.

(ii). Khintchine-Kolmogorov convergence theorem.


Theorem 6.2. (Khintchine-Kolmogorov
P ConvergenceP Theorem) Suppose X1 , X2 , ... are
independent with mean 0Psuch that n var(Xn ) < ∞. Then, n Xn < ∞ a.s., i.e., Sn converges

a.s. as well as in L2 to n=1 Xn .
29

P∞
Proof. Define Am,ǫ = {maxj>m |Sj − Sm | ≤ ǫ}. Then, { n=1 Xn < ∞} = ∩ǫ>0 ∪m Am,ǫ . By
Kolmogorov’s inequality
n ∞
var(Sn − Sm ) 1 X 1 X
P ( max |Sj − Sm | > ǫ) ≤ = var(X i ) ≤ var(Xi ).
m<j≤n ǫ2 ǫ2 i=m+1 ǫ2 i=m+1

By letting n → ∞ first and then m → ∞, we have

lim P (max |Sj − Sm | > ǫ) → 0.


m→∞ j>m

Then limm P (Am,ǫ ) → 1. So P (∪m≥1 Am,ǫ ) = 1 for every ǫ > 0. Hence,


X
P( Xn < ∞) = P (∩ǫ>0 ∪m Am,ǫ ) = 1.
n

And a.s. convergence of Sn holds. Denote the a.s. limit as S∞ .


To show convergence of Sn in L2 , write

E[(Sn − S∞ )2 ] = E[(Sn − lim Sk )2 ] = E[lim(Sn − Sk )2 ]


k k
≤ lim inf E[(Sn − Sk )2 ] by Fatou’s lemma
k
k
X ∞
X
= = lim inf var(Xj ) = var(Xj )
k
j=n j=n

which tends to 0, as n → ∞. Therefore convergence in L2 holds. 


P
Example 6.2. Suppose X1 , ... are iid with zero mean and finite variance. Then n an X n < ∞
a.s. if and only if n a2n < ∞.
P

“⇐=” is a direct consequence of Theorem 6.2.. “=⇒” follows from the central limit theorem to be
shown in Chapter 8.

(iii). Kolmogorov three series theorem


For independent random variables, Kolmogorov three series theorem is the ultimate result in pro-
viding sufficient and necessary conditions for the convergence of series a.s..
Theorem 6.3. (Kolmogorov P Three Series Theorem) SupposePX1 , X2 , ... are independent.
Let
P Yn = X n 1 {|Xn |≤1} Then,
P n Xn < ∞ a.s. if and only if (1). n P (|Xn | > 1) < ∞; (2).
n E(Yn ) < ∞; and (3). n var(Yn ) < ∞.
P
Proof. “⇐=”: TheP convergence of n (Yn − E(Yn )) is implied by (3) and Theorem 6.2. Together
with (2), it ensures n Yn < ∞ a.s.. On theP other hand, Condition (1) and Borel-Cantelli lemma
implies P (Xn 6= Yn , i.o.) = 0. Consequently, n Xn converges.
P
“=⇒” (An unconventional proof). It’s straightforward that Condition (1) holds. Then n Yn < ∞
a.s. since P (Xn 6= Yn , i.o.) = 1. If condition (3) does not hold, by the central limit theorem to be
shown in the next chapter,
n
1 X
pPn [Yi − E(Yi )] → N (0, 1),
i=1 var(Yi ) i=1

Hence P (| ni=1 Yi | > M ) → 0 as n → ∞ for any fixed M >P0, which contradicts


P
in distribution.
P
with n Yn < ∞ P a.s.. Hence condition (3) holds. Theorem 6.2 then ensures n (Yn − E(Yn )) < ∞
a.s.. As a result, n E(Yn ) < ∞ and condition (2) also holds. 
Remark. Suppose Xn is truncated at any constant ǫ > 0 rather than 1 in Theorem 6.3, the theorem
still holds.
30

p
Corollary.
P∞ 1/p
P∞ ) < ∞1/pfor some 0 < p < 2. Then,
Suppose X, X1 , X2 , ... are iid with E(|X|
n=1 [Xn − E(X)]/n < ∞ a.s. for 1 < p < 2; and n=1 Xn /n < ∞ a.s. for 0 < p < 1.
We leave the proof as Exercise 6.2.

Strong law of large numbers (SLLN) is a central result in classical probability theory. The conver-
gence of series estabalished in Section 1.6 paves a way towards proving SLLN using the Kronecker
lemma.

(iv). Kronecker lemma and Kolmogorov’s criterion of SLLN.


P Pn
Kronecker Lemma. Suppose an > 0 and an ↑ ∞. Then n xn /an < ∞ implies j=1 xj /an → 0.
Pn
Proof. Set bn = i=1 xi /ai and a0 = b0 = 0. Then, bn → b∞ < ∞ and xn = an (bn − bn−1 ). Write
n n n n
1 X 1 X 1 hX X i
xj = aj (bj − bj−1 ) = aj b j − aj bj−1
an j=1 an j=1 an j=1 j=1
n−1 n n n
1 hX X i 1 hX X i
= bn + aj b j − aj bj−1 = bn + aj−1 bj−1 − aj bj−1
an j=1 j=1
an j=1 j=1
n
1 X
= bn − bj−1 (aj − aj−1 )
an j=1
→ b∞ − b∞ = 0.


The following proposition is an immediate application of the Kronecker lemma and the Khintchine-
Kolmogorov convergence of series.
Proposition (Kolmogorov’s criterion of SLLN). Suppose X1 , X2 ..., are independent such that
E(Xn ) = 0 and n var(Xn )/n2 < ∞. Then, Sn /n → 0 a.e..
P

Proof. Consider the series ni=1 Xi /i < ∞, n ≥ 1. Then Theorem 6.2 implies n Xn /n < ∞ a.s..
P P
And the above Kronecker Lemma ensures Sn /n → 0 a.s.. 
Obviously, if X, X1 , X2 , ... are iid with finite variance, the above proposition implies the SLLN:
Sn /n → E(X) a.s.. In fact, a stronger result than the above SLLN is also straightforward:
Corollary. If X1 , X2 , ... are iid with mean µ and finite variance. Then,
S − nµ
pn →0 .a.s.
n(log n)δ

for any δ > 1.


We leave the proof as an exercise.
The corollary gives a rate of a.s. convergence of sample mean Sn /n to population mean µ at a rate
n−1/2 (log n)δ with δ > 1/2. This is, although not the sharpest rate, close to the sharpest rate of
a.s. convergence at n−1/2 (log log n)1/2 given in Kolmogorov’s law of iterated logarithm:

 lim sup √ S2n −nµ



=1 a.s..
2σ n log log n
Sn −nµ
 lim inf √ = −1 a.s..
2σ2 n log log n

for iid r.v.s with mean µ and finite variance σ 2 . We do not intend to cover the proofs of Kolmogorov’s
law of iterated logarithm.

(v) Kolmogorov’s strong law of large numbers.


31

The above SLLN requires finite moments of the series. The most standard classical SLLN, estab-
lished by Kolmogorov, for iid r.v.s. holds as long as the population mean exist. In statistical view,
the sample mean shall always converge to the population mean as long as the population mean
exists, without any further moment condition. In fact, the sample mean converges to a finite limit
if and only if the population mean is finite, in which case, the limit is the population mean.
Theorem 6.4. Kolmogorov’s strong law of large numbers. Suppose X, X1 , X2 , ... are iid and
E(X) exists. Then,
Sn /n → E(X), a.s..
Conversely, if Sn /n → µ which is finite, then µ = E(X).
Proof. Suppose first E(X1 ) = 0. We shall utilize the above proposition of Kolmogorov’s criterion
of SLLN. Consider
Yn = Xn 1{|Xn |≤n} − E(Xn 1{|Xn |≤n} ).
Write
∞ ∞ ∞
X var(Yn ) X 1 2

2
X 1 
≤ E(X 1 {|X|≤n} ) = E X 1 {|X|≤n}
n=1
n2 n=1
n2 n=1
n2

 X 2 
≤ E X2 ≤ 2E(|X| + 1) < ∞
n(n + 1)
n≥|X|∨1

It then follows from Kolmogorov’s criterion of SLLN that


n
1X
Yi → 0 a.s..
n i=1
Pn
Next, since E(Xn 1{|Xn |≤n} ) → E(X) = 0. i=1 E(Xi 1{|Xi |≤i} )/n → 0. Hence,
n
1X
Xi 1{|Xi |≤i} → 0 a.s..
n i=1

Observe that E(X) = 0 implies E|X| < ∞, and


X
E|X| < ∞ ⇐⇒ P (|X| > n) < ∞ ⇐⇒ P (|Xn | > n, i.o.) = 0 ⇐⇒ Xn /n → 0 a.s..
n
Pn
Therefore, i=1 Xi 1{|Xi |>i} /n → 0 a.s.. As a result, the SLLN holds.
Suppose E(X) < ∞. the SLLN holds by considering Xi − E(X), which is mean 0.
Pn
PnE(X) = ∞. Then, (1/n) i=1 Xi ∧ C → E(X1 ∧ C) a.s., which ↑ ∞ when C ↑ ∞. Since
Suppose
Sn ≥ i=1 Xi ∧ C, the SLLN holds. Likewise for the case E(X) = −∞.
Conversely, if Sn /n → µ a.s. where µ is finite, Xn /n → 0 a.s.. Hence, E|X| < ∞ and µ = E(X)
by the SLLN just proved. 

Remark Kolmogorov’s SLLN also holds for r.v.s that are pairwise independent following the same
distribution, which is slightly more general. We have chosen to follow the historic development of
the classical probability theory.

(vi). Strong law of large numbers when E(X) does not exist.
Kolmogorov’s SLLN in Theorem 6.4 already shows that the classical SLLN does not hold if E(X)
does not exist, i.e., E(X + ) = E(X − ) = ∞. The SLLN becomes quite complicated. We introduce
the theorem proved by W. Feller:
32

Proposition Suppose X, X1 , ... are iid with E|X| = ∞. Suppose an > 0 and an /n is nondecreas-
ing. Then,  P
lim sup |Sn |/an = 0 if Pn P (|X| ≥ an ) < ∞
lim sup |Sn |/an = ∞ if n P (|X| ≥ an ) = ∞.

The proof is somewhat technical but still along the same line as the that of Kolmogorov’s SLLN.
Interested students may refer to the textbook by Durrett (page 67). We omit the details.
Example 6.2. (The St. Petersburg Paradox) See Example 1.5 in which we have shown

Sn 1
→ in probability
n log n log 2

Analogous to the calculation therein,



X ∞
X ∞
X ∞
X
P (X ≥ n log n) = P (X ≥ 2log(n log n)/ log 2 ) ≥ 2− log(n log n)/ log 2 = 1/(n log n) = ∞
n=2 n=2 n=2 n=2

By the above proposition,


Sn
lim sup =∞ a.s..
n log n
On the other hand, one can also show with same calculation that, for δ > 1,

Sn
lim sup =0 a.s..
n(log n)δ

The following Marcinkiewicz-Zygmund SLLN is useful in connecting the rate of convergence with
the moments of the iid r.v.s.
Theorem 6.5. (Marcinkiewicz-Zygmund strong law of large numbers). Suppose
X, X1 , X2 , ... are iid and E(|X|p ) < ∞ for some 0 < p < 2. Then,
(
Sn −nE(X)
n1/p
→ 0, a.s. for 1 ≤ p < 2
Sn
n1/p
→ 0 a.s. for 0 < p < 1.

Proof. The case with p = 1 is Kolmogorov’s SLLN. The cases with 0 < p < 1 and 1 < p < 2 are
consequences of the corollary following Theorem 1.6 and the Kronecker lemma. 

Example 6.3 Suppose X, X1 , X2 , ... are iid and X is symmetric with P (X > t) = t−α for some
α > 0 and all large t.
(1). α > 2: Then, E(X 2 ) < ∞, Sn /n → 0 a.s. and, moreover, Kolmogorov’s law of iterated
logarithm gives the sharp rate of the a.s. convergence.
(2). 1 < α ≤ 2: for any 0 < p < α
Sn
→ 0, a.s.
n1/p
It implies that Sn /n converges to 0 a.s. at a rate faster than n−1+1/p , but not at the rate of
n−1+1/α . In particular, if α = 2, Sn /n converges to E(X) a.s. at a rate faster than n−β with any
0 < β < 1/2, but not at the rate of n−1/2 .
(3). 0 < α ≤ 1: E(X) does not exist. For any 0 < p < α,

Sn
→ 0, a.s.
n1/p
33

Moreover, the above proposition implies

|Sn | Sn
lim sup = ∞ a.s. and →0 a.s.
n1/α n (log n)δ/α
1/α

for any δ > 0.


Remark. In the above example, for 0 < α < 2, Sn /n1/α converges in distribution to a nonde-
generate distribution called stable law. In particular, if α = 1, Sn /n converges in distribution to a
Cauchy distribution. For α = 2, Sn /(n log n)1/2 converges to a normal distribution, and for α > 2,
Sn /n1/2 converges to a normal distribution,


DIY Exercises.
Exercise 6.1. ⋆ ⋆ ⋆ Suppose S0 ≡ 0, S1 , S2 , ... form a square integrable martingale, i.e., for
k = 0, 1, ..., n, E(Sk2 ) < ∞ and E(Sk+1 |Fk ) = Sk where Fk is the σ-algebra generated by S1 , ..., Sk .
Show that Kolmogorov’s inequality still holds.
Exercise 6.2. ⋆ ⋆ ⋆ Prove the Corollary following Theorem 6.2.
Exercise 6.3. ⋆ ⋆ ⋆⋆ For positive
P independent r.v.s XP1 , X2 , ..., show that the P
following three
statements are equivalent: (a). X
n n < ∞ a.s.; (b). n E(X n ∧ 1) < ∞; (c). n E(Xn /(1 +
Xn )) < ∞.
Exercise
P 6.4. ⋆ ⋆ ⋆⋆ Raise a counterexample to show that there exists X1 , X2 , ... iid with E(X) = 0
but n Xn /n 6< ∞ a.s..
Exercise 6.5 ⋆ ⋆ ⋆⋆ If X1 , ... are iid with mean µ and finite variance. Then,
Sn − nµ
p →0 a.s.
n(log n)δ

for any δ > 1.


Exercise 6.6 ⋆ ⋆ ⋆ Suppose X, X1 , ... are iid. Then, (Sn − Cn )/n → 0 a.s. if and only if E(|X|) < ∞.
Exercise 6.7 ⋆ ⋆ ⋆⋆ Suppose X, X1 , ... are iid with E(|X|p ) = ∞ for some 0 < p < ∞. Then,
lim sup |Sn |/n1/p = ∞ a.s..
⋆ ⋆ ⋆⋆ Suppose Xn , n ≥ 1 are independent with mean µn and variance σn2 such that
Exercise 6.8 P
n
µn → 0 and j=1 σj2 → ∞. show that
Pn
j=1 Xj /σj2
Pn →0 a.s.
j=1 σj−2
Pn Pj
Hint: Consider the series j=1 (Xj − µj )/(σj2 k=1 σk−2 ).
34

Chapter 7. Convergence in distribution and characteristic functions.


Convergence in distribution, which can be generalized slightly to weak convergence of measures,
has been introduced in Chapter 3. This section provides a more detailed description.

(i). Definition, basic properties and examples.


Recall that in Section 1.3, we have already defined convergence in distribution for a sequence of
random variables. Here we present the same definition in terms of weak convergence of their distri-
butions. We first note that a function F is a cdf if and only if it is right continuous, nondecreasing
with F (t) → 1 and 0 when t → ∞ and −∞, respectively.
Definition. A sequence of distribution function Fn is called converging to another distribution
function F∞ weakly, if
(1) Fn (t) → F∞ (t) for every continuity points of F∞ ; or
(2), lim inf n Fn (B) ≥ F∞ (B) for every open set B in (−∞, ∞); or
(3) lim supn Fn (C) ≤ F∞ (C) for every closed set C in (−∞, ∞); or
R R
(4) g(x)dFn (x) → g(x)dF∞ (x) for every continuous function g.
R R
Here Fn (A) is defined as A dFn (x) = 1x∈A dFn (x) for any Borel set A. The above four claims
are equivalent to each other, as proved in Chapter 3.
Remark. If F∞ is continuous, the inequalities in (2) and (3) are actually equalities. On the other
hand, if Xn all takes integer values, then Xn → X in distribution is equivalent to P (Xn = k) →
P (X = k) for all integer values k.
Remark. (Sheffe’s Theorem) Suppose Xn has density function fn (·) and fn (t) → f (t) for every
finite t and f is a density function. Then, Xn → X in distribution, where X has density f . This
can be shown quite straightforwardly as follows:
Z Z
2 = lim inf (fn + f − |fn (x) − f (x)|)dx ≤ lim inf (fn (x) + f (x) − |fn (x) − f (x)|)dx
n n
 Z  Z
= lim inf 2 − |fn (x) − f (x)|dx = 2 − lim sup |fn (x) − f (x)|dx.
n n

Certainly, for any Borel set B,


Z Z
P (Xn ∈ B) − P (X ∈ B) = (fn (x) − f (x))dx ≤ |fn (x) − f (x)|dx → 0.
B


In the above proof, we have used Fatou lemma with Lebesgue measure. In fact, the monotone
convergence theorem, Fatou lemma and dominated convergence theorem that we have established
with probability measure all hold with σ-finite measures, including Lebesgue measure.
Remark. (Slutsky’s Theorem) Suppose Xn → X∞ in distribution and Yn → c in probability.
Then, Xn Yn → cX∞ in distribution and Xn + Yn → Xn − c in distribution.
We leave the proof as an exercise.

In the following, we provide some classical examples about convergence in distribution, only to show
that there are a variety of important limiting distributions besides the normal distribution as the
limiting distribution in CLT.
Example 7.1. (Convergence of maxima and extreme value distributions) Let Mn =
max1≤i≤n Xi where Xi are iid r.v.s with c.d.f. F (·). Then,

P (Mn ≤ t) = P (X1 ≤ t)n = F (t)n .


35

As n → ∞, the limiting distribution of properly scaled Mn , should it converge, should only be


related with the right tail of the distribution of F (·), i.e., the F (x) when x is large. The following
are some examples.
(a). F (x) = 1 − x−α for some α > 0 and all large x. Then, for any t > 0,
−α
P (Mn /n1/α < t) = (1 − n−1 t−α )n → e−t

(b). F (x) = 1 − |x|β for x ∈ [−1, 0] and some β > 0. Then, for any t < 0,
β
P (n1/β Mn ≤ t) = (1 − n−1 |t|β )n → e−|t|

(c). F (x) = 1 − e−x for x > 0, i.e., Xi follows exponential distribution. Then for all t,
−t
P (Mn − log n ≤ t) → e−e
These limiting distributions are called extreme value distributions.
Example 7.2. (Birthday problem) Suppose X1 , X2 , ... are iid with uniform distribution on
the integers {1, 2, ..., N } with n < N and , Let
TN = min{k : there exists a j < k such that {Xj = Xk } }.
Then, for k ≤ N ,
P (TN > k) = P ( X1 , ..., Xk all take different values )
Yk  
= 1 − P ( Xj takes one of the values of X1 , .., Xj−1 )
j=2
k k−1
Y j−1 X
= (1 − ) = exp{ log(1 − j/N )}
j=2
N j=1

Then, for any fixed x > 0, as N → ∞,


X
P (TN /N 1/2 > x) = P (TN > N 1/2 x) ≈ exp{ log(1 − j/N )}
1≤j<N 1/2 x
X
≈ exp{− j/N } ≈ exp{−(1/N )N 1/2 x(N 1/2 x + 1)/2} ≈ exp{−x2 /2}
1≤j<N 1/2 x

In other words, TN /N 1/2 converges in distribution to a distribution F (t) = 1−exp(−t2 /2) for t ≥ 0.
Suppose now N = 365. By this approximation, we have P (T365 > 22) ≈ .5153 and P (T365 > 50) ≈
.0326, meaning that, with 22 (50) people there is about half (3%) probability that all of them have
different birthday.
Example 7.3. (Law of rare events) Suppose there are totally n flights worldwide each year,
and each flight has chance pn to have an accident, independent of rest flights. There is on average λ
accidents a year worldwide. The distribution of the number of accidents is B(n, pn ) with npn close
to λ. Then this distribution approximates Poisson distribution with mean λ, namely,
Bin(n, pn ) → P(λ) if n → ∞ and npn → λ > 0.

Proof. For any fixed k ≥ 0, and n ≥ k


(npn )k (1 − pn )n
 
n k n!
P (Bin(n, pn ) = k) = pn (1 − pn )n−k =
k k!(n − k)! nk (1 − pn )k
1 n(n − 1) · · · (n − k + 1) (npn )k en log(1−pn )
=
k! nk (1 − pn )k
k −λ
λ e
→ , as n → ∞.
k!
36


Example 7.4. (The secretary/marriage problem) Suppose there are n secretary to be
interviewed one by one and, right after each interview, you must make immediate decision of “hire
or fire” the interviewee. You observe only the relative ranks of the interviewed candidates. What
is the optimal strategy is maximize the chance of hiring the best of the n candidates? (Assume no
ties of performance.)
One type of strategy is to give up the first m candidates, whatever their performance in the interview.
Afterwards, the one that outperforms all previous candidates is hired. In other words, starting from
m + 1-th interview, the first candidate that outperforms the first m candidates is hired. Or else you
settle with the last candidate. The chance that the k-th best among all n candidates is hired is
n
X
Pk = P ( the k-th best is the j-th interviewee and is hired)
j=m+1
n
X 1
= P (the best among first j − 1 appears in the first m,
j=m+1
n
the j-th candidate is the k-th best, and the k − 1 best all appear after the j-th candidate.)
n
X m 1 n − j k−1
≈ × ×( )
j=m+1
j − 1 n n

Let n → ∞, and m ≈ nc where c is the percentage of the interviews to be given up. Then the
probability of hiring the k-th best
n Z 1
X 1 k−1 (1 − x)k−1
Pk ≈ c (1 − j/n) ≈c dx = cAk , say.
j=m
j c x

Since Ak+1 = Ak − (1 − c)k /k, for k ≥ 1, and A1 = − log c, it follows that

 k−1
X (1 − c)j 
Pk → c − log c − , as n → ∞.
j=1
j

In particular, P1 → −c log c. The function c log c is maximized at c = 1/e = 0.368. The best
strategy is to give up the first 36.8% of the interviews and then hire the best to date. The chance of
hiring the best overall is also 36.8%. The chance of hiring the last person is also c. This phenomenon
is also called 1/e law. 
You might please formulate this problem in terms of a sequence of random variables.

(ii). Some theoretical results about convergence in distribution.


(a). Fatou Lemma Suppose Xn ≥ 0 and Xn → X∞ in distribution. Then E(X∞ ) ≤
lim inf n E(Xn ).
Proof. Write
Z ∞ Z ∞ Z ∞
E(X∞ ) = P (X∞ ≥ t)dt ≤ lim inf P (Xn ≥ t)dt = lim inf P (Xn ≥ t)dt ≤ lim inf E(Xn ).
0 0 n n 0 n


The dominated convergence theorem also holds with convergence in distribution, which is left as
an exercise.
(b). Continuous mapping theorem: Xn → X∞ in distribution and g(·) is a continuous function.
Then, g(Xn ) → g(X∞ ) in distribution.
37

Proof. For any bounded continuous function f , f (g(·)) is still bounded continuous function. Hence
E(f (g(Xn ))) → E(f (g(X∞ ))), proving that g(Xn ) → g(X∞ ) in distribution. 
(c). Tightness and convergent subsequences.
In studying the convergence of a sequence of numbers, it is very useful that boundedness of the
sequence, guarantees a convergent subsequence. The same is true for uniformly bounded monotone
functions, such as, for example, distribution functions. This is the following Helly’s Selection
theorem, which is useful in studying weak convergence of distributions.
Helly’s Selection Theorem. A sequence of cumulative distribution functions Fn always con-
tains a subsequence, say Fnk , that converges to a function, say F∞ , which is nondecreasing and
right continuous, at every continuity point of F∞ . If F∞ (−∞) = 0 and F∞ (∞) = 1. Then, F∞ is
a distribution function and Fnk converges to F weakly.
Proof Let t1 , t2 , ... be all rational numbers. In the sequence Fn (t1 ), n ≥ 1, there is always a
(1)
convergent subsequence. Denote one of them as, say nk , k = 1, 2, .... Among this subsequence there
(2) (2) (1)
is again a further subsequence, denoted as nk , k = 1, 2, ..., with n1 > n1 , such that Fn(2) (t2 ) is
k
(k)
convergent. Repeat this process of selection infinitely. Let nk = n1 be the first element of the k-th
(l)
sub-sub-sequence. Then, for any fixed m, {nk : k ≥ m} is always a subsequence of {nk : k ≥ 1} for
all l ≤ m. Hence Fnk is convergent on every rational number. Denote the limit as F ∗ (tl ) on every
rational tl . Monotonicity of Fnk implies the monotonicity of F ∗ on rational numbers. Define, for all
t, F∞ (t) = inf{F ∗ (tl ) : tl > t, tl are rational}. Than, F∞ is right continuous and non-decreasing.
The right continuity of Fn ensures that, if s is a continuity point of F∞ , Fnk (s) → F∞ (s). 

Not all sequence of distributions Fn would converge weakly to a distribution function. The easiest
example is Fn ({n}) = Fn (n) − Fn (n−) = 1, i.e., P (Xn = n) = 1. Then, Fn (t) → 0 for all
t ∈ (−∞, ∞). If Fn all have little probability mass near ∞ or −∞, then the convergence to a
function which is not a distribution function can be avoided. A sequence of distribution functions Fn
is called tight if, for any ǫ > 0, there exists a M > 0 such that lim supn→∞ (1 − Fn (M ) + Fn (M ) < ǫ;
Or, in other words,
sup(1 − Fn (x) + Fn (−x)) → 0 as x → ∞.
n

Proposition. Every tight sequence of distribution functions contains a a subsequence that weakly
converges to a distribution function.
Proof Repeat the proof Helly’s Selection Theorem. The tightness ensures the limit is a distribution
function. 

(iii). Characteristic functions.

Characteristic function is one of the most useful tools in developing theory about convergence in
distribution. The technical details of characteristic functions involve some knowledge of complex
analysis. We shall view them as only a tool and try not to elaborate the technicalities.
1◦ . Definition and examples.
For a r.v. X with distribution F , its characteristic function is
Z
ψ(t) = E(eitX ) = E(cos(tX) + isin(tX)) = eitx dF (x), t ∈ (−∞, ∞)

where i = −1.
Some basic properties are:

ψ(0) = 1; |ψ(·)| ≤ 1; ψ(·) is continuous on (−∞, ∞)

If ψ is characteristic function of X, then eitb ψ(at) is characteristic function of aX + b.


38

Product of characteristic functions is still a characteristic function. And the characteristic function
of X1 + ... + Xn is the product of those of X1 , ..., Xn .
The following table lists some characteristic functions for some commonly used distributions:

Distribution Density/Probability function characteristic function (of t)


Degenerate P (X = a) = 1  eiat
Binomial Bin(n, p) P (X = k) = nk pk (1 − p)n−k , k = 0, 1, ..., n (peit + 1 − p)n
Poisson P(λ): P (X = k) = λk e−λ /k!,√k = 0, 1, ... exp(λ(eit − 1))
2 2 2 2
Normal N (µ, σ 2 ): f (x) = e−(x−µ) /(2σ ) / 2πσ 2 , x ∈ (−∞, ∞) eiµt−σ t /2
Uniform U nif [0, 1]: f (x) = 1, x ∈ [0, 1] (eit − 1)/(it)
Gamma : f (x) = λα xα−1 e−λx /Γ(α), x > 0 (1 − it/λ)−α
Cauchy: f (x) = 1/[π(1 + x2 )], x ∈ (−∞, ∞) e−|t|

2◦ . Levy’s inversion formula.


Proposition Suppose X is r.v. with characteristic function ψ(·). Then, for all a < b,
T
1 e−ita − e−itb 1
Z
lim ψ(t)dt = P (a < X < b) + (P (X = a) + P (X = b)).
n→∞ 2π −T it 2

Proof. The proof R∞uses Fubini’s theorem to interchange the the expectation with the integration
and the fact that 0 sin(x)/xdx = π/2. We omit the proof.
The above theorem clearly implies that two different distribution cannot have same characteristic
function, as formally presented in the following corollary.
Corollary. There is one-to-one correspondence between distribution functions and characteristic
functions.

3◦ . Levy’s continuity theorem.


Theorem 7.1 Levy’s continuity theorem. Let Fn , F∞ be cdf with characteristic function
ψn , ψ∞ . Then,
(a). If Fn → F∞ weakly, the ψn (t) → ψ(t) for every t.
(b). If ψn (t) → ψ(t) for every t, and ψ(·) is continuous at 0, then Fn → F weakly, where F is a
cdf with characteristic function ψ.
Proof. Part (a) directly follows from the definition of convergence in distribution since eitx is a
continuous function of x for every t. Proof of part (b) uses the Levy inversion formula. We omit
the details.
Remark. Levy’s continuity theorem enables us to show convergence of distribution through point-
wise convergence of characteristic functions. This shall be our approach to establish the central
limit theorem.

DIY Exercises:
Exercise 7.1. ⋆ ⋆ ⋆ Prove Slutsky’s Theorem.
Exercise 7.2. ⋆ ⋆ ⋆ (Dominated convergence theorem) Suppose Xn → X∞ in distribution
and |Xn | ≤ Y with E(Y ) < ∞. Show that E(Xn ) → E(X∞ ).
Exercise 7.3. ⋆⋆ Suppose Xn is independent of Yn , and X is independent of Y . Use characteristic
functions to show that, if Xn converges to X in distribution and Yn converges to Y in distribution
and , then Xn + Yn converges in distribution to X + Y .
39

Chapter 8. Central limit theorem.

The most ideal case of the CLT is that the random variables are iid with finite variance. Although
it is a special case of the more general Lindeberg-Feller CLT, it is most standard and its proof
contains the essential ingredients to establish more general CLT. Throughout the chapter, Φ(·) is
the cdf of standard normal distribution N (0, 1).
(i). Central limit theorem (CLT) for iid r.v.s.
The following lemma plays a key role in the proof of CLT.
Lemma 8.1 For any real x and n ≥ 1,
n
X (ix)j  |x|n+1 2|x|n 
|eix − | ≤ min , .
j=0
j! (n + 1)! n!

Consequently, for any r.v. X with characteristic function ψ and finite second moment,
t2 |t|2
ψ(t) − [1 + itE(X) − E(X 2 )] ≤ E(min(|t||X|3 , 6|X|2 )). (8.1)

2 6

Proof. The proof relies on the identity


n x x
(ix)j in+1 in
X Z Z
ix n is
e − = (x − s) e ds = (x − s)n−1 (eis − 1)ds,
j=0
j! n! 0 (n − 1)! 0

which can be shown by induction and by taking derivatives. The middle term is bounded by
|x|n+1 /(n + 1)!, and the last bounded by 2|x|n /n!. 

Theorem 8.2 Suppose X, X1 , ..., Xn , ... are iid with mean µ and finite variance σ 2 > 0. Then,

Sn − nµ
√ → N (0, 1) in distribution.
nσ 2

Proof. Without loss of generality, let µ = 0. Let ψ be the common characteristic function of Xi .
Observe that, by dominated convergence

E(min(|tn ||X|3 , 6|X|2 )) → 0 as |tn | → 0



The characteristic function of Sn / nσ 2 is, by applying the above lemma,
√ n √
nσ2
Y
nσ2 t
E(eitSn / ) = E(eitn Sn ) = E(eitXj / ) = ψn ( √ )
j=1 nσ 2
it t2 1 t2 1
= [1 + √ E(X) − 2
E(X 2 ) + o( )]n = [1 − + o( )]n
nσ 2 2nσ n 2n n
2
→ e−t /2
,

which is the characteristic function of N (0, 1). Then, Levy’s continuity theorem implies the above
CLT. 

In the case the common variance is not finite, the partial sum, after proper normalization, may or
may not converge to a normal distribution. The following theorem provides sufficient and necessary
condition. The key point here is whether there exists appropriate truncation, which is a trick that
we have used so many times before.
40

Theorem 8.3 Suppose X, X1 , X2 , ... are iid nondegenerate. Then, (Sn − an )/bn converges to a
normal distribution for some constants an and 0 < bn → ∞, if and only if

x2 P (|X| > x)
→ 0, as x → ∞. (8.2)
E(X 2 1{|X|≤x} )
2
The proof is omitted. We note that (8.2) holds if Xi has √ finite variance σ > 0, in which case
CLT of Theorem 8.2 holds with an = nE(X) and bn = nσ. Theorem 8.3 is of interest when
E(X 2 ) = ∞. In this case, one can choose to truncate the Xi s at

cn = sup{c : nE(|X|2 1{|X|≤c})/c2 ≥ 1}

With some calculation, condition (8.2) ensures

nP (|X| > cn ) → 0 and nE(|X|2 1{|X|≤cn} )/c2n → 1.

Separate Sn into two parts, one with Xi beyond ±cn and the other bounded by ±cn . The former
takes value 0 with chance going to 1. The latter, when standardized by
q
an = nE(X1{|X|≤cn} ) and bn = nE(X 2 1{|X|≤cn} ) ≈ cn .

converges to N (0, 1), which can be shown


p by repeating the proof of Theorem 8.2 or by citing
Lindeberg-Feller CLT. We note that bn ≈ nvar(X1{|X|≤cn} ) by (8.2).

Example 8.1 Recall Example 6.3, in which X, X1 , X2 , ... are iid symmetric such that P (|X| >
x) = x−α for some α > 0 all large x. Then, Theorem 8.3 implies (Sn − an )/bn → N (0, 1) if and
only if α ≥ 2. Indeed, when α > 2, the common variance is finite and CLT applies. When α = 2,

Sn /(n log n)1/2 → N (0, σ 2 )

for some σ 2 .
When α < 2, the condition in Theorem cannot hold. In fact, Sn when properly normalized shall
converge to non-normal distribution.

(ii). The Lindeberg-Feller CLT.


Theorem 8.4 Lindeberg-Feller Pn CLT. Suppose X1 , ..., Xn , ... are independent r.v.s with mean
0 and variance σn2 . Let s2n = j=1 σj2 denote the variance of partial sum Sn = X1 + · · · + Xn . If,
for every ǫ > 0,
n
1 X
E(Xj2 1{|Xj |>ǫsn } ) → 0, (2.3)
s2n j=1

then Sn /sn → N (0, 1). Conversely, if maxj≤n σj2 /s2n → 0 and Sn /sn → N (0, 1), then (8.3) holds.
Proof. “⇐=” The Lindeberg condition (8.3) implies
 σ2  1
j
max ≤ ǫ2 + max E(Xj2 1{|Xj |>ǫsn } ) → 0, (8.4)
1≤j≤n s2n s2n 1≤j≤n

by letting n → ∞ and then ǫ ↓ 0. Observe that for every real x > 0, |e−x − 1 + x| ≤ x2 /2. Moreover,
for complex zj and wj with |zj | ≤ 1 and |wj | ≤ 1,
n
Y n
Y n
X
| zj − wj | ≤ |zj − wj |, (8.5)
j=1 j=1 j=1
41

which can be proved by induction. With Lemma 2.1, it follows that, for any ǫ > 0,
2 2 2
|E(eitXj /sn ) − e−t σj /2sn |
 (tXj )2   t2 σj2  h  t2 X 2 |tX |3 i t4 σ 4
j j j
≤ |E 1 + itXj − − 1 − | + E min , + 4
2s2n 2s2n s2n 6s3n 8sn
 t2 X 2   |tX |3  t4 σ 4
j j j
≤ E 1 {|X |>ǫs } + E 1 {|X |≤ǫs } + 4
s2n j n
6s3n j n
8sn
t2 2 |t|3 ǫ 2
t4 σj2 σ2
≤ 2
E(X 1
j {|Xj |>ǫsn } ) + 2
E(X j ) + 2
max 2k
sn sn sn 1≤k≤n sn
Then, for any fixed t,
2
|E(eitSn /sn ) − e−t /2 |
n n
Y Y 2 2 2
= | E(eitXj /sn ) − e−t σj /2sn |
j=1 j=1
n
2
σj2 /2s2n
X
≤ |E(eitXj /sn ) − e−t | by (8.5)
j=1
n  2
X t |t|3 ǫ t4 σj2 σj2 
≤ 2
E(Xj2 1{|Xj |>ǫsn } ) + 2 E(Xj2 ) + 2 max 2
j=1
sn sn sn 1≤j≤n sn
n
 t2 X σj2 
≤ E(Xj2 1{|Xj |>ǫsn } ) + ǫ|t|3 + t4 max
s2n j=1
1≤j≤n s2
n
3
→ ǫ|t| , as n → ∞, by (8.3) and (8.4).
2
Since ǫ > 0 is arbitrary, it follows that E(eitSn /sn ) → e−t /2
for all t. Levy’s continuity theorem
implies Sn /sn → N (0, 1).
“⇐=” Let ψj be the moment generating function of Xj . The asymptotic normality is equivalent
Qn 2
to j=1 ψj (t/sn ) → e−t /2 . Notice that (8.1) implies

t2 σj2
|ψj (t/sn ) − 1| ≤ 2 (8.6)
sn
Write, as n → ∞,
n
X
[ψj (t/sn ) − 1] + t2 /2
j=1
n
X n
X
= [ψj (t/sn ) − 1 − log ψj (t/sn )] + [log ψj (t/sn )] + t2 /2
j=1 j=1
Xn
≤ |ψj (t/sn ) − 1 − log ψj (t/sn )| + +o(1)
j=1
Xn
≤ |ψj (t/sn ) − 1|2 + o(1)
j=1
n
X
≤ max |ψk (t/sn ) − 1| × |ψj (t/sn ) − 1| + o(1)
1≤k≤n
j=1
n
t2 σk2 X t2 σj2
≤ 4 max × + o(1) by (8.6)
1≤k≤n sn sn
j=1

= o(1), by the assumption maxj≤n σj2 /s2n → 0.


42

On the other hand, by definition of characteristic function, the above expression is, as n → ∞,
n
X
o(1) = [ψj (t/sn ) − 1] + t2 /2
j=1
n
X n
X n
X
= E(eitXj /sn − 1) + t2 /2 = E(cos(tXj /sn ) − 1) + t2 /2 + i E(sin(tXj /sn ))
j=1 j=1 j=1
Xn n
X
= E{(cos(tXj /sn ) − 1)1{|Xj |>ǫsn } } + E{(cos(tXj /sn ) − 1)1{|Xj |≤ǫsn } } + t2 /2
j=1 j=1
+imaginary part (immaterial).

Since cos(x) − 1 ≥ −x2 /2 for all real x,


n n
1 X 2 X t2 Xj2
E(Xj2 1{Xj |>ǫsn } ) = 1− E( 2 1{Xj |≤ǫsn } )
s2n j=1 t2 j=1 2sn
n
2  t2 X 
≤ + E{(cos(tX j /s n ) − 1)1 {|Xj |≤ǫsn } }
t2 2 j=1
n
2 X 
≤ | E{(cos(tX j /s n ) − 1)1 {|X |>ǫs } }| + o(1)
t2 j=1 j n

n
2 X
≤ 2P (|Xj | > ǫsn ) + o(1)
t2 j=1
n
4 X σj2
≤ + o(1) by Chebyshev inequality
t2 j=1 (ǫsn )2
4
≤ + o(1).
t2 ǫ 2
Since t can be chosen arbitrarily large, Lindeberg condition holds. 

Remark. Sufficiency is proved by Lindeberg in 1922 and necessity by Feller in 1935. Lindeberg-
Feller CLT is one of the most far-reaching results in probability theory. Nearly all generalizations
of various types of central limit theorems spin from Lindeberg-Feller CLT, such as, for example,
CLT for martingales, for renewal proceses, or for weakly dependent processes. The insights of the
Lindeberg condition (8.3) are that the “wild” values of the random variables, compared with sn ,
the standard deviation of Sn as the normalizing constant, are insignificant and can be truncated
off without affecting the general behavior of the partial sum Sn .
Example 8.2. Suppose Xn are independent and

P (Xn = n) = P (Xn = −n) = n−α /4 P (Xn = 0) = 1 − n−α /2,


and

with 0 < α < 3. Then, σn2 = E(Xn2 ) = n2−α /2 and s2n = nj=1 j 2−α /2, which increases to ∞
P

at the order of n3−α . Note that Lindeberg condition (2.3) is equivalent to n2 /n3−α → 0, i.e.,
0 < α < 1. On the other hand, max1≤j≤n σj2 /s2n → 0. Therefore, it follows from Theorem 8.4 that
Sn /sn → N (0, 1) if and only if 0 < α < 1. 

Example 8.3 Suppose Xn are independent and P (Xn = 1) = 1/n = 1 − P (Xn = 0). Then,
p
[Sn − log(n)]/ log(n) → N (0, 1) in distribution.
Pn Pn
Pn E(Xn ) = 1/n and var(Xn ) = (1 − 1/n)/n. So, E(Sn ) = i=1 = i=1 1/i, and
It’s clear that
var(Sn ) = i=1 (1 − 1/i)/i ≈ log(n). As Xn are all bounded by 1 and var(Sn ) ↑ ∞, the Lindeberg
43

condition is satisfied. Therefore, by the CLT,


Pn
S − i=1 1/i
Pn n → N (0, 1), in distribution.
[ i=1 (1 − 1/i)/i]1/2

Then, [Sn −log(n)]/ log(n) → N (0, 1) in distribution since | log(n)− ni=1 1/i| ≤ 1 and var(Sn )/ log(n) →
p P
1. 

Theorem 8.2 as well as the following Lyapunov CLT are both special cases of the Lindeberg-Feller
CLT. Nevertheles they are convenient for application.
Pn
Corollary (Lyapunov CLT) Suppose Xn are indendent with mean 0 and j=1 E(|Xj |δ )/sδn → 0
for some δ > 2, then Sn /sn → N (0, 1).
Proof. For any ǫ > 0, as n → ∞,
n n n
1 X 2
X Xj2 1 X Xjδ
E(X j 1 {|X |>ǫs } ) = E( 1 {|X |/s >ǫ} ) ≤ E( δ ) → 0.
s2n j=1 j n
j=1
s2n j n
ǫδ−2 j=1 sn

Lindeberg condition (8.3) holds and hence CLT holds. 


Pn Pn
In Example 8.2, for any δ > 2, j=1 E|Xj |δ = j=1 j δ j −α /2 which increasing at the order nδ−α+1 ,
while sδn increases at the order of n(3−α)δ/2 . Simple calculation shows, when 0 < α < 1, Lypunov
CLT holds.

(iii). CLT for arrays of random variables.


Very often Lindeberg-Feller CLT is presented in the form of arrays of random variables as given in
the textbook.
Theorem 8.5 (CLT for arrays of r.v.s) Let Xn,1 , ..., Xn,n be n independent random variables
with mean 0 such that, as n → ∞,
n
X n
X
2
var(Xn,j ) → 1 and E(Xn,j 1{|Xn,j |>ǫ} ) → 0, for any ǫ > 0.
j=1 j=1

Then, Sn ≡ Xn,1 + · · · + Xn,n → N (0, 1).


This theorem is slightly more general than Lindeberg-Feller CLT, although the proof is identical
to that of the first part of Theorem 8.4. Theorem 8.4. is a special case of Theorem 8.5 by letting
Xn,i = Xi /sn . Thus Xn,k are undertood as the usual r.v.s normalized by the standard deviation of
the partial sums. Thus Sn in this theorem is already standardized.

DIY Exercises
Exercise 8.1 ⋆ ⋆ ⋆ Suppse Xn are independent with
1 1
P (Xn = nα ) = P (Xn = −nα ) = and P (Xn = 0) = 1 −
2nβ nβ
with 2α > β − 1. Show that the Lindeberg condition holds if and only if 0 ≤ β < 1.
Exercise
P 8.2 ⋆ ⋆ ⋆ Suppose Xn are iid with mean
P 0 and variance 1. Let an > 0 be such that
s2n = nj=1 a2i → ∞ and an /sn → 0. Show that ni=1 ai Xi /sn → N (0, 1).
Exercise 8.3 ⋆ ⋆ ⋆ Suppose X1 , X2 ... are independent and Xn = Yn + Zn , where Yn takes values
1 and −1 with chance 1/2 each, and P (Zn√= ±n) = 1/(2n2 ) = (1 − P (Zn = 0))/2. Show that
Lindeberg condition does not hold, yet Sn / n → N (0, 1).
Exercise 8.4 ⋆ ⋆ ⋆ Suppse
√ X1 , X2 , ... are iid nonnegative r.v.s with mean 1 and finite variance

σ 2 > 0. Show that 2( Sn − n) → N (0, 1).
44

.
Review of Probability Theory
1. Probability Calculation.
Calculation of probabilities of events for discrete outcomes (e.g., coin tossing, roll of dice, etc.)
Calculation of the probability of certain events for given density functions of 1 or 2 dimension.
2. Probability space.
(1). Set operations: ∪, ∩, complement.
(2). σ-algebra. Definition and implications (the collection of sets which is closed on set operations).
(3). Kolmogorov’s trio of probability space.
(4). Independence of events and conditional probabilities of events.
(5). Borel-Cantelli lemma.
(6). Sets and set-index functions. 1A∩B = 1A 1B and 1Ac = 1 − 1A . 1{An ,i.o.} = lim supn 1An . etc.

{An , i.o.} = ∩∞ ∞ ∞
n=1 ∪k=1 Ak = lim ∪k=n Ak = lim sup An (lim sup 1An (ω)).
n→∞ n n

ω ∈ {An , i.o.} means ω ∈ An for infinitely many An . (Mathematically precisely, there exists a
subsequence nk → ∞, such that ω ∈ Ank for all nk .)

∪∞ ∞ ∞
n=1 ∩k=1 Ak = lim ∩k=n Ak = lim inf An (lim inf 1An (ω)).
n→∞ n n

ω ∈ lim inf n An means ω ∈ An for all large n. (Mathematically precisely, there exists an N , such
that ω ∈ An for all n ≥ N .)
(7). ⋆ completion of probability space.
3. Random variables.
(1). Definitions.
(2). c.d.f., density or probability functions.
(3). Expectation: definition, interpretation as weighted (by chance) average.
(4). Properties:
(i). Dominated convergence. (Proof and Application).
(ii). ⋆ Fatou’s lemma and monotone convergence.
(iii). Jensen’s inequalities
(iv) Chebyshev inequalities.
(5). Independence of r.v.s.
(6). ⋆ Conditional distribution and expectation given a σ-algebra (Definition and simple properties.)
(7). Commonly used distributions and r.v.s.
4. Convergence.
(1). Definition of a.e., in prob, Lp and in distribution convergence.
(2). ⋆ Equivalence of four definitions of in distribution convergence.
(3). The diagram about the relation among convergence modes.
(4). The technique of truncating off r.v.s
5. LLN.
(1). WLLN. Application of the theorem. (⋆ the proof)
(2). Kolmogorov’s inequality (⋆ the proof).
(3). Corollary (⋆ the proof)
45

(4). ⋆ Kolmogorov’s 3-series theorem


(5). Kolmogorov’s criterion for SLLN. (⋆ the proof)
(6) Kronecker lemma. (⋆ the proof)
(7). SLLN for iid r.v.s (⋆ the proof)
(8). Application.
6. CLT.
(1) Characteristic function. (Definition and simple properties.)
(i) ψ continuous. |ψ| ≤ 1 and ψ(0) = 1.
(ii) FX = FY =⇒ ψX = ψY . (⋆ proof of ⇐=.)
(iii) Fn → F =⇒ ψn → ψ. (⋆ proof of ⇐=.)
(iv) X and Y are independent =⇒ ψX+Y = ψX + ψY .
(2). CLT for iid r.v.s
Theorem, application and the proof for the case of bounded r.v.s.
(3). Lindeberg condition.
Application and Heuristic interpretation. (⋆ proof.)
(4). Application.
Remark. ⋆ means not required in the midterm exam.

You might also like