0% found this document useful (0 votes)

24 views39 pages

Prob Notes

This document introduces the mathematical foundations of probability theory. It defines probability spaces, discrete probability spaces, and uses the example of coin flips to construct a probability space. It also defines random variables and expectation. Key concepts covered include sample spaces, events, probability measures, and measurable functions in the context of probability theory.

Uploaded by

rasaraman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views39 pages

Prob Notes

Uploaded by

rasaraman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

NOTES ON PROBABILITY

Greg Lawler
Last Updated: March 21, 2016

Overview
This is an introduction to the mathematical foundations of probability theory. It is intended as a supplement
or follow-up to a graduate course in real analysis. The first two sections assume the knowledge of measure
spaces, measurable functions, Lebesgue integral, and notions of convergence of functions; the third assumes
Fubini’s Theorem; the fifth assumes knowledge of Fourier transform of nice (Schwartz) functions on R; and
Section 6 uses the Radon-Nikodym Theorem.
The mathematical foundations of probability theory are exactly the same as those of Lebesgue integration.
However, probability adds much intuition and leads to different developments of the area. These notes are
only intended to be a brief introduction — this might be considered what every graduate student should
know about the theory of probability.
Probability uses some different terminology than that of Lebesgue integration in R. These notes will
introduce the terminology and will also relate these ideas to those that would be encountered in an ele-
mentary (by which we will mean pre-measure theory) course in probability or statistics. Graduate students
encountering probabilty for the first time might want to also read an undergraduate book in probability.

1 Probability spaces
Definition A probability space is a measure space with total measure one. The standard notation is (Ω, F, P)
where:

• Ω is a set (sometimes called a sample space in elementary probability). Elements of Ω are denoted ω
and are sometimes called outcomes.
• F is a σ-algebra (or σ-field, we will use these terms synonymously) of subsets of Ω. Sets in F are called
events.

• P is a function from F to [0, 1] with P(Ω) = 1 and such that if E1 , E2 , . . . ∈ F are disjoint,
 
∞
[ X∞
P Ej  = P[Ej ].
j=1 j=1

We say “probability of E” for P(E).

A discrete probability space is a probability space such that Ω is finite or countably infinite. In this case
we usually choose F to be all the subsetsP of Ω (this can be written F = 2Ω ), and the probability measure P
is given by a function p : Ω → [0, 1] with ω∈Ω p(ω) = 1. Then,
X
P(E) = p(ω).
ω∈E

We will consider another important example here, the probability space associated to an infinite number
of flips of a coin. Let
Ω = {ω = (ω1 , ω2 , . . .) : ωj = 0 or 1}.

1
We think of 0 as “tails” and 1 as “heads”. For each positive integer n, let

Ωn = {(ω1 , ω2 , . . . , ωn ) : ωj = 0 or 1}.

Each Ωn is a finite set with 2n elements. We can consider Ωn as a probability space with σ-algebra 2Ωn and
probability Pn induced by
pn (ω) = 2−n , ω ∈ Ωn .
Let Fn be the σ-algebra on Ω consisting of all events that depend only on the first n flips. More formally,
we define Fn to be the collection of all subsets A of Ω such that there is an E ∈ 2Ωn with

A = {(ω1 , ω2 , . . .) : (ω1 , ω2 , . . . , ωn ) ∈ E}. (1)

n
Note that Fn is a finite σ-algebra (containing 22 subsets) and

F1 ⊂ F2 ⊂ F3 ⊂ · · · .

If A is of the form (1), we let

P(A) = Pn (E).
One can easily check that this definition is consistent. This gives a function P on
∞
[
F 0 := Fj .
j=1

Recall that an algebra of subsets of Ω is a collection of subsets containing Ω, closed under complementation,
and closed under finite unions. To show closure under finite unions, it suffices to show that if E1 , E2 ∈ F0 ,
then E1 ∪ E2 ∈ F0 .

Proposition 1.1. F 0 is an algebra but not a σ-algebra.

Proof. Ω ∈ F 0 since Ω ∈ F1 . Suppose E ∈ F 0 . Then E ∈ Fn for some n, and hence E c ∈ Fn and E c ∈ F.
Also, suppose E1 , E2 ∈ F. Then there exists j, k with E1 ∈ Fj , E2 ∈ Fk . Let n = max{j, k}. Then since
the σ-algebras are increasing, E1 , E2 ∈ Fn and hence E1 ∪ E2 ∈ Fn . Therefore, E1 ∪ E2 ∈ F 0 and F 0 is an
algebra.
To see that F 0 is not a σ-algebra consider the singleton set

E = {(1, 1, 1, . . .)}.

E is not in F 0 but E can be written as a countable intersection of events in F 0 ,

∞
\
E= {(ω1 , ω2 , . . .) : ω1 = ω2 = · · · = ωj = 1}.
j=1

Proposition 1.2. The function P is a (countably additive) measure on F 0 , i.e., it satisfies P[∅] = 0, and if
E1 , E2 , . . . ∈ F 0 are disjoint with ∪∞ 0
n=1 En ∈ F , then
"∞ # ∞
[ X
P En = P(En ).
n=1 n=1

2
Proof. P[∅] = 0 is immediate. Also it is easy to see that P is finitely additive, i.e., if E1 , E2 , . . . , En ∈ F 0 are
disjoint then  
n
[ Xn
P  Ej =
 P(Ej ).
j=1 j=1

(To see this, note that there must be an N such that E1 , . . . , En ∈ FN and then we can use the additivity
of PN .)
Showing countable subadditivity is harder. In fact, the following stronger fact holds: suppose E1 , E2 , . . . ∈
F 0 are disjoint and E = ∪∞ 0
n=1 En ∈ F . Then there is an N such that Ej = ∅ for j > N . Once we establish
this, countable additivity follows from finite additivity.
To establish this, we consider Ω as the topological space

{0, 1} × {0, 1} × · · ·

with the product topology where we have given each {0, 1} the discrete topology (all four subsets are open).
The product topology is the smallest topology such that all the sets in F 0 are open. Note also that all sets
in F 0 are closed since they are complements of sets in F 0 . It follows from Tychonoff’s Theorem that Ω is
a compact topological space under this topology. Suppose E1 , E2 , . . . are as above with E = ∪∞ 0
n=1 En ∈ F .
Then E1 , E2 , . . . is an open cover of the closed (and hence compact) set E. Therefore there is a finite
subcover, E1 , E2 , . . . , EN . Since the E1 , E2 , . . . are disjoint, this must imply that Ej = ∅ for j > N .
Let F be the smallest σ-algebra containing F 0 . Then the Carathéodory Extension Theorem tells us that
P can be extended uniquely to a complete measure space (P, F̄, P) where F ⊂ F̄.

The astute reader will note that the construction we just did is exactly the same as the construction of
Lebesgue measure on [0, 1]. Here we denote a real number x ∈ [0, 1] by its dyadic expansion
∞
X ωj
x = .ω1 ω2 ω3 · · · = .
j=1
2j

(There is a slight nuisance with the fact that .011111 · · · = .100000 · · · , but this can be handled.) The
σ-algebra F above corresponds to the Borel subsets of [0, 1] and the completion F corresponds to the
Lebesgue measurable sets. If the most complicated probability space we were interested were the space
above, then we could just use Lebesgue measure on [0, 1]. In fact, for almost all important applications
of probability, one could choose the measure space to be [0, 1] with Lebesgue measure (see Exercise 3).
However, this choice is not always the most convenient or natural.

Continuing on the last remark, one generally does not care what probability space one is working on.
What one observes are “random variables” which are discussed in the next section.

3
2 Random Variables and Expectation
Definition A random variable X is a measurable function from a probability space (Ω, F, P) to the reals1 ,
i.e., it is a function
X : Ω −→ (−∞, ∞)
such that for every Borel set B,
X −1 (B) = {X ∈ B} ∈ F.
Here we use the shorthand notation
{X ∈ B} = {ω ∈ Ω : X(ω) ∈ B}.
If X is a random variable, then for every Borel subset B of R, X −1 (B) ∈ F. We can define a function
on Borel sets by
µX (B) = P{X ∈ B} = P[X −1 (B)].
This function is in fact a measure, and (R, B, µX ) is a probability space. The measure µX is called the
distribution of the random variable. If µX gives measure one to a countable set of reals, then X is called
a discrete random variable. If µX gives zero measure to every singleton set, and hence to every countable
set, X is called a continuous random variable. Every random variable can be written as a sum of a discrete
random variable and a continuous random variable. All random variables defined on a discrete probability
space are discrete.
The distribution µX is often given in terms of the distribution function2 defined by
FX (x) = P{X ≤ x} = µX (−∞, x].
Note that F = FX satisfies the following:
• limx→−∞ F (x) = 0.
• limx→∞ F (x) = 1.
• F is a nondecreasing function.
• F is right continuous, i.e., for every x,
F (x+) := lim F (x + ) = F (x).
↓0

Conversely, any F satisfying the conditions above is the distribution function of a random variable. The
distribution can be obtained from the distribution function by setting
µX (−∞, x] = FX (x),
and extending uniquely to the Borel sets.
For some continuous random variables X, there is a function f = fX : R → [0, ∞) such that
Z b
P{a ≤ X ≤ b} = f (x) dx.
a

Such a function, if it exists, is called the density3 of the random variable. If the density exists, then
Z x
F (x) = f (t) dt.
−∞
1 or,
more generally, to any topological space
2 often
called cumulative distribution function (cdf) in elementary courses
3 More precisely, it is the density or Radon-Nikoydm derivative with respect to Lebesgue measure. In elementary courses,

the term probability density function (pdf) is often used.

4
If f is continuous at t, then the fundamental theorem of calculus implies that

f (x) = F 0 (x).

A density f satisfies Z ∞
f (x) dx = 1.
−∞

Conversely, any nonnegative function that integrates to one is the density of a random variable.

Example If E is an event, the indicator function of E is the random variable

1, ω ∈ E,
1E (ω) =
0, ω 6∈ E.

(The corresponding function in analysis is often called the characteristic function and denoted χE . Proba-
bilists never use the term characteristic function for the indicator function because the term characteristic
function has another meaning. The term indicator function has no ambiguity.)

Example Let (Ω, F, P) be the probability space for infinite tossings of a coin as in the previous section. Let

1, if nth flip heads,
Xn (ω1 , ω2 , . . .) = ωn = (2)
0, if nth flip tails.

Sn = X1 + · · · + Xn = # heads on first n flips. (3)

Then X1 , X2 , . . . , and S1 , S2 , . . . are discrete random variables. If Fn denotes the σ-algebra of events that
depend only on the first n flips, then Sn is also a random variable on the probability space (Ω, Fn , P).
However, Sn+1 is not a random variable on (Ω, Fn , P).

Example Let µ be any probability measure on (R, B). Consider the trivial random variable

X = x,

defined on the probability space (R, B, µ). Then X is a random variable and µX = µ. Hence every probability
measure on R is the distribution of a random variable.

Example A random variable X has a normal distribution with mean µ and variance σ 2 if it has density
1 2
/2σ 2
f (x) = √ e−(x−µ) , −∞ < x < ∞.
2πσ 2
If µ = 0 and σ 2 = 1, X is said to have a standard normal distribution. The distribution function of the
standard normal is often denoted Φ,
Z x
1 2
Φ(x) = √ e−t /2 dt.
−∞ 2π
Example If X is a random variable and
g : (R, B) → R
is a Borel measurable function, then Y = g(X) is also a random variable.

Example Recall that the Cantor function is a continuous function F : [0, 1] → [0, 1] with F (0) = 0, F (1) = 1
and such that F 0 (x) = 0 for all x ∈ [0, 1] \ A where A denotes the “middle thirds” Cantor set. Extend F to R
by setting F (x) = 0 for x ≤ 0 and F (x) = 1 for x ≥ 1. Then F is a distribution function. A random variable
with this distribution function is continuous, since F is continuous. However, such a random variable has no
density.

5
The term random variable is a little misleading but it standard. It is perhaps easier to think of a random
variable as a “random number”. For example, in the case of coin-flips we get an infinite sequence of
random numbers corresponding to the results of the flips.

Definition If X is a nonnegative random variable, the expectation of X, denoted E(X), is

Z
E(X) = X dP,

where the integral is the Lebesgue integral. If X is a random variable with E(|X|) < ∞, then we also define
the expectation by Z
E(X) = X dP.

If X takes on positive and negative values, and E(|X|) = ∞, the expectation is not defined.

Other terms used for expectation are expected value and mean (the term “expectation value” can be
found in the physics literature but it is not good English and sometimes refers to integrals with respect to
measures that are not probability measures). The letter µ is often used for mean. If X is a discrete random
variable taking on the values a1 , a2 , a3 , . . . we have the standard formula from elementary courses:
∞
X
E(X) = aj P{X = aj }, (4)
j=1

provided the sum is absolutely convergent.

Lemma 2.1. Suppose X is a random variable with distribution µX . Then,
Z
E(X) = x dµX .
R

(Either side exists if and only if the other side exists.)

Proof. First assume that X ≥ 0. If n, k are positive integers let

k−1 k
Ak,n = ω : ≤ X(ω) < .
n n
For every n < ∞, consider the discrete random variable Xn taking values in {k/n : k ∈ Z},
∞
X k
Xn = 1A .
n k,n
k=1

Then,
1
Xn − ≤ X ≤ Xn .
n
Hence
1
E[Xn ] − ≤ E[X] ≤ E[Xn ].
n
But,
X∞
1 k−1
E Xn − = P(Ak,n ),
n n
k=1

6
∞
X k
E[Xn ] = P(Ak,n ),
n
k=1

and
k−1 k−1 k k−1 k
Z
k
µX [ , )≤ x dµX ≤ µX [ , ).
n n n [ k−1 k
n ,n)
n n n
By summing we get Z
1
E Xn − ≤ x dµX ≤ E[Xn ].
n [0,∞)

By letting n go to infinity we get the result. The general case can be done by writing X = X + − X − . We
omit the details.
In particular, the expectation of a random variable depends only on its distribution and not on the
probability space on which it is defined. If X has a density f , then the measure µX is the same as f (x) dx
so we can write (as in elementary courses)
Z ∞
E[X] = x f (x) dx,
−∞

where again the expectation exists if and only if

Z ∞
|x| f (x) dx < ∞.
−∞

Since expectation is defined using the Lebesgue integral, all results about the Lebesgue integral (e.g.,
linearity, Monotone Convergence Theorem, Fatou’s Lemma, Dominated Convergence Theorem) also hold for
expectation.
Exercise 2.2. Use Hölder’s inequality to show that if X is a random variable and q ≥ 1, then E[|X|q ] ≥
(E[X])q . For which X does equality hold?

Example Consider the coin flipping random variables (2) and (3). Clearly
1 1 1
E[Xn ] = 1 · +0· = .
2 2 2
Hence
n
E[Sn ] = E[X1 + · · · + Xn ] = E[X1 ] + · · · + E[Xn ] = .
2
Also,
n
E[X1 + X1 + · · · + X1 ] = E[X1 ] + E[X1 ] + · · · + E[X1 ] = .
2
Note that in the above example, X1 + · · · + Xn and X1 + · · · + X1 do not have the same distribution,
yet the linearity rule for expectation show that they have the same expectation. A very powerful (although
trivial!) tool in probability is the linearity of the expectation even for random variables that are “correlated”.
If X1 , X2 , . . . are nonnegative we can even take countable sums,
"∞ # ∞
X X
E Xn = E[Xn ].
n=1 n=1

This can be derived from the rule for finite sums and the monotone convergence theorem by considering the
increasing sequence of random variables
X n
Yn = Xj .
j=1

7
Lemma 2.3. Suppose X is a random variable with distribution µX , and g is a Borel measurable function.
Then, Z
E[g(X)] = g(x) dµX .
R

(Either side exists if and only if the other side exists.)

Proof. Assume first that g is a nonnegative function. Then there exists an increasing sequence of nonnegative
simple functions with gn approaching g. Note that gn (X) is then a sequence of nonnegative simple random
variables approaching g(X). For the simple functions gn it is immediate from the definition that
Z
E[gn (X)] = gn (x) dµX .
R

and hence by the monotone convergence theorem,

Z
E[g(X)] = g(x) dµX .
R

The general case can be done by writing g = g + − g − and g(X) = g + (X) − g − (X).
If X has a density f , then we get the following formula found in elementary courses
Z ∞
E[g(X)] = g(x) f (x) dx.
−∞

Definition The variance of a random variable X is defined by

Var(X) = E[(X − E(X))2 ],

provided this expecation exists.

Note that

Var(X) = E[(X − E(X))2 ]

= E[X 2 − 2XE(X) + (E(X))2 ]
= E[X 2 ] − (E(X))2 .

The variance is often denoted σ 2 = σ 2 (X) and the square root of the variance, σ, is called the standard
deviation

There is a slightly different use of the term standard deviation in statistics; see any textbook on statistics
for a discussion.

Lemma 2.4. Let X be a random variable.

(Markov’s Inequality) If a > 0,
E[|X|]
P{|X| ≥ a} ≤ .
a
(Chebyshev’s Inequality) If a > 0,

Var(X)
P{|X − E(X)| ≥ a} ≤ .
a2

8
Proof. Let Xa = a1{|X|≥a} . Then Xa ≤ |X| and hence

aP{|X| ≥ a} = E[Xa ] ≤ E[X].

This gives Markov’s inequality. For Chebyshev’s inequality, we apply Markov’s inequality to the random
variable [X − E(X)]2 to get
E([X − E(X)]2 )
P{[X − E(X)]2 ≥ a2 } ≤ .
a2

Exercise 2.5 (Generalized Chebyshev Inequality). Let f : [0, ∞) → [0, ∞) be a nondecreasing Borel function
and let X be a nonnegative random variable. Then for all a > 0,

E[f (X)]
P{X ≥ a} ≤ .
f (a)

3 Independence
Throughout this section we will assume that there is a probability space (Ω, A, P). All σ-algebras will be
sub-σ-algebras of A, and all random variables will be defined on this space.

Definition

• Two events A, B are called independent if P(A ∩ B) = P(A)P(B).

• A collection of events {Aα } is called independent if for every distinct Aα1 , . . . , Aαn ,

P(Aα1 ∩ · · · ∩ Aαn ) = P(Aα1 ) · · · P(Aαn ).

• A collection of events {Aα } is called pairwise independent if for each distinct Aα1 , Aα2 ,

P(Aα1 ∩ Aα2 ) = P(Aα1 )P(Aα2 ).

Clearly independence implies pairwise independence but the converse is false. An easy example can be
seen by rolling two dice and letting A1 = { sum of rolls is 7 }, A2 = { first roll is 1 }, A3 = { second roll is
6 }. Then
1
P(A1 ) = P(A2 ) = P(A3 ) = ,
6
1
P(A1 ∩ A2 ) = P(A1 ∩ A3 ) = P(A2 ∩ A3 ) = ,
36
but P(A1 ∩ A2 ∩ A3 ) = 1/36 6= P(A1 )P(A2 )P(A3 ).

Definition A finite collection of σ-algebras F1 , . . . , Fn is called independent if for any A1 ∈ F1 , . . . , An ∈ Fn ,

P(A1 ∩ · · · ∩ An ) = P(A1 ) · · · P(An ).

An infinite collection of σ-algebras is called independent if every finite subcollection is independent.

Definition If X is a random variable and F is a σ-algebra, we say that X is F-measurable if X −1 (B) ∈ F

for every Borel set B, i.e., if X is a random variable on the probability space (Ω, F, P). We define F(X) to
be the smallest σ-algebra F for which X is F-measurable.

9
In fact,
F(X) = {X −1 (B) : B Borel }.
(Clearly F(X) contains the right hand side, and we can check that the right hand side is a σ-algebra.) If
{Xα } is a collection of random variables, then F({Xα }) is the smallest σ-algebra F such that each Xα is
F-measurable.

Probabilists think of a σ-algebra as “information”. Roughly speaking, we have the information in the
σ-algebra F if for every event E ∈ F we know whether or not the event E has occured. A random
variable X is F-measurable if we can determine the value of X by knowing whether or not E has occured
for every F-measurable event E. For example suppose we roll two dice and X is the sum of the two dice.
To determine the value of X it suffices to know which of the following events occured:

Ej,1 = {first roll is j} , j = 1, . . . 6,

Ej,2 = {second roll is j} , j = 1, . . . 6.

More generally, if X is any random variable, we can determine X by knowing which of the events
Ea := {X < a} have occured.

Exercise 3.1. Suppose F1 ⊂ F2 ⊂ · · · is an increasing sequence of σ-algebras and

∞
[
F0 = Fn .
n=1

Then F 0 is an algebra.
Exercise 3.2. Suppose X is a random variable and g is a Borel measurable function. Let Y = g(X). Then
F(Y ) ⊂ F(X). The inclusion can be a strict inclusion.
The following is a very useful way to check independence.
Lemma 3.3. Suppose F 0 and G 0 are two algebras of events that are independent, i.e., P(A∩B) = P(A) P(B)
for A ∈ F 0 , B ∈ G 0 . Then F := σ(F 0 ) and G := σ(G 0 ) are independent σ-algebras.
Proof. Let B ∈ G 0 and define the measure

µB (A) = P(A ∩ B), µ̃B (A) = P(A) P(B).

Note that µB , µ̃B are finite measures and they agree on F 0 . Hence (using Cartheodory extension) they must
agree on F. Hence P(A ∩ B) = P(A) P(B) for A ∈ F, B ∈ G 0 . Now take A ∈ F and consider the measures

νA (B) = P(A ∩ B), ν̃A (B) = P(A) P(B).

If X1 , . . . , Xn are random variables, we can consider them as a random vector (X1 , . . . , Xn ) and hence
as a random variable taking values in Rn . The joint distribution of the random variables is the measure
µ = µX1 ,...,Xn on Borel sets in Rn given by

µ(B) = P{(X1 , . . . , Xn ) ∈ B}.

Each random variable X1 , . . . , Xn also has a distribution on R, µXi ; these are called the marginal distribu-
tions. The joint distribution function F = FX1 ,...,Xn is defined by

F (t1 , . . . , tn ) = P{X1 ≤ t1 , . . . , Xn ≤ tn } = µ[ (−∞, t1 ] × · · · × (−∞, tn ] ].

10
Definition Random variables X1 , X2 , . . . , Xn are said to be independent if any of these (equivalent) condi-
tions hold:

• For all Borel sets B1 , . . . , Bn ,

P{X1 ∈ B1 , . . . , Xn ∈ Bn } = P{X1 ∈ B1 } P{X2 ∈ B2 } · · · P{Xn ∈ Bn }.

• The σ-algebras F(X1 ), . . . , F(Xn ) are independent.

• µ = µX1 × µX2 × · · · × µXn .

• For all real t1 , . . . , tn ,

F (t1 , . . . , tn ) = FX1 (t1 ) FX2 (t2 ) · · · FXn (tn ).

The random variables X1 , . . . , Xn have a joint density (joint probability density function) f (x1 , . . . , xn )
if for all Borel sets B ∈ Rn ,
Z
P{(X1 , . . . , Xn ) ∈ B} = f (x1 , . . . , xn ) dx1 dx2 · · · dxn .
B

Random variables with a joint density are independent if and only if

•
f (x1 , . . . , xn ) = fX1 (x1 ) · · · fXn (xn ),
where fXj (xj ) denotes the density of Xj .

An infinite collection of random variables is called independent if each finite subcollection is independent.
The following follows immediately from the definition and Exercise 3.2.
Proposition 3.4. If X1 , . . . , Xn are independent random variables and g1 , . . . , gn are Borel measurable
functions, then g1 (X1 ), . . . , gn (Xn ) are independent.
Proposition 3.5. If X, Y are independent random variables with E|X| < ∞, E|Y | < ∞, then E[|XY |] < ∞
and
E[XY ] = E[X] E[Y ]. (5)
Proof. First consider simple random variables
m
X n
X
S= ai 1Ai , T = bj 1Bj ,
i=1 j=1

where S is F(X) measurable and T is F(Y ) measurable. Then each Ai is independent of each Bj and (5) can
be derived immediately using P(Ai ∩ Bj ) = P(Ai ) P(Bj ). Suppose X, Y are nonnegative random variables.
Then there exists simple random variables Sn , measurable with respect to F(X), that increase to X and
simple random variables Tn , measurable with respect to F(Y ), increasing to Y . Note that Sn Tn increases
to XY . Hence by the monotone convergence theorem,

E[XY ] = lim E[Sn Tn ] = lim E[Sn ] E[Tn ] = E[X] E[Y ].

n→∞ n→∞

Finally for general X, Y , write X = X + − X − , Y + − Y − . Note that X + , X − are independent of Y + , Y −

(Propostion 3.4) and hence the result follows from linearity of the integral.

11
The multiplication rule for expectation can also be viewed as an application of Fubini’s Theorem. Let
µ, ν denote the distributions of X and Y . If they are independent, then the distribution of (X, Y ) is µ × ν.
Hence Z Z Z Z
E[XY ] = xy d(µ × ν) = xy dµ(x) dν(y) = E[X]y dν(y) = E[X] E[Y ].
R2

Random variables X, Y that satisfy E[XY ] = E[X]E[Y ] are called orthogonal. The last proposition shows
that independent integrable random variables are orthogonal.
Exercise 3.6. Give an example of orthogonal random variables that are not independent.

Proposition 3.7. Suppose X1 , . . . , Xn are random variables with finite variance. If X1 , . . . , Xn are pairwise
orthogonal, then
Var[X1 + · · · + Xn ] = Var[X1 ] + · · · + Var[Xn ].
Proof. Without loss of generality we may assume that all the random variables have zero mean (otherwise,
we subtract the mean without changing any of the variances). Then
   
Xn Xn
Var  Xj  = E ( Xj )2 
j=1 j=1
n
X X
= E[Xj2 ] + E[Xi Xj ]
j=1 i6=j
n
X X
= Var(Xj ) + E[Xi Xj ].
j=1 i6=j

But by orthogonality, if i 6= j, E[Xi Xj ] = E[Xi ] E[Xj ] = 0

This proposition is most often used when the random variables are independent. It is not true that
Var(X + Y ) = Var(X) + Var(Y ) for all random variables. Two extreme cases are Var(X + X) = Var(2X) =
4Var(X) and Var(X − X) = 0.

The last proposition is really just a generalization of the Pythagorean Theorem. A natural framework
for discussing random variables with zero mean and finite variance is the Hilbert space L2 with the inner
product hX, Y i = E[XY ]. In this case, X, Y are orthogonal iff hX, Y i = 0.

Exercise 3.8. Let (Ω, F, P) be the unit interval with Lebesgue measure on the Borel subsets. Follow this
outline to show that one can find independent random variables X1 , X2 , X3 , . . . defined on (Ω, F, P), each
normal mean zero, variance 1. Let use write x ∈ [0, 1] in its dyadic expansion

x = .ω1 ω2 ω3 · · · , ωj ∈ {0, 1}.

as in Section 1.
(i) Show that the random variable X(x) = x is a uniform random variable on [0, 1], i.e., has density
f (x) = 1[0,1] .
(ii) Use a diagonalization process to define U1 , U2 , . . ., independent random variables each with a uniform
distribution on [0, 1].
(iii) Show that Xj = Φ−1 (Uj ) has a normal distribution with mean zero, variance one. Here Φ is the
normal distribution function.

12
4 Sums of independent random variables
Definition

• A sequence of random variables Yn converges to Y in probability if for every > 0,

lim P{|Yn − Y | > } = 0.

n→∞

• A sequence of random variables Yn converges to Y almost surely (a.s.) or with probability one (w.p.1)
if there is an event E with P (E) = 1 and such that for ω ∈ E, Yn (ω) → Y (ω).

In other words, “in probability” is the same as “in measure” and “almost surely” is the analogue of “almost
everywhere”. The term almost surely is not very accurate (something that happens with probability one
happens surely!) so many people prefer to say “with probability one”.

Exercise 4.1. Suppose Y1 , Y2 , . . . is a sequence of random variables with E[Yn ] → µ and Var[Yn ] → 0. Show
that Yn → µ in probability. Give an example to show that it is not necessarily true that Yn → µ w.p.1.
Much of the focus of classical probability is on sums of independent random variables. Let (Ω, A, P) be
a probability space and assume that
X1 , X2 , . . .
are independent random variables on this probability space. We say that the random variables are identically
distributed if they all have the same distribution, and we write i.i.d. for “independent and identically
distributed.”
If X1 , X2 , . . . are independent (or, more generally, orthogonal) random variables each with mean µ and
variance σ 2 , then
X1 + · · · + Xn 1
E = [E(X1 ) + · · · + E(Xn )] = µ,
n n
σ2

X1 + · · · + Xn 1
Var = 2 [Var(X1 ) + · · · + Var(Xn )] = .
n n n
The equality for the expectations does not require the random variables to be orthogonal, but the expression
for the variance requires it. The mean of the weighted average is the√same as the mean of one of the random
variables, but the standard deviation of the weighted average is σ/ n which tends to zero.

Theorem 4.2 (Weak Law of Large Numbers). If X1 , X2 , . . . are independent random variables such that
E[Xn ] = µ and Var[Xn ] ≤ σ 2 for each n, then

X1 + · · · + Xn
−→ µ
n
in probability.

Proof. We have
σ2

X1 + · · · + Xn X1 + · · · + Xn
E = µ, Var ≤ → 0.
n n n
We now use Exercise 4.1.

13
The Strong Law of Large Numbers (SLLN) is a statement about almost sure convergence of the weighted
sums. We will prove a version of the strong law here. We start with an easy, but very useful lemma. Recall
that the limsup of events is defined by
∞ [
\ ∞
lim sup An = Am .
n=1 m=n

Probabilists often write {An i.o.} for lim sup An ; here “i.o.” stands for “infinitely often”.

Lemma 4.3 P (Borel-Cantelli Lemma). Suppose A1 , A2 , . . . is a sequence of events.

(i) If P(An ) < ∞, then
P{An i.o.} = P(lim sup An ) = 0.
P
(ii) If P(An ) = ∞, and the A1 , A2 , . . . are independent, then

P{An i.o.} = P(lim sup An ) = 1.

The conclusion in (ii) is not necessarily true for dependent events. For example, if A is an event of
probability 1/2, and A1 = A2 = · · · = A, then lim sup An = A which has probability 1/2.

Proof. P
(i) If P(An ) < ∞,
∞ ∞
!
[ X
P(lim sup An ) = lim P An ≤ lim P(Am ) = 0.
n→∞ n→∞
m=n m=n

P
(ii) Assume P(An ) = ∞ and A1 , A2 , . . . are independent. We will show that
"∞ ∞ #
[ \
c c
P[(lim sup An ) ] = P Am = 0.
n=1 m=n

To show this it suffices to show that for each n

∞
!
\
P Acm = 0,
m=n

and for this we need to show for each n,

M
!
\
lim P Acm = 0.
M →∞
m=n

But by independence,
M
! M
( M
)
\ Y X
P Acm = [1 − P(Am )] ≤ exp − P(Am ) → 0.
m=n m=n n=m

14
Theorem 4.4 (Strong Law of Large Numbers). Let X1 , X2 , . . . be independent random variables each with
mean µ. Suppose there exists an M < ∞ such that E[Xn4 ] ≤ M for each n (this holds, for example, when
the sequence is i.i.d. and E[X14 ] < ∞). Then w.p.1,

X1 + X2 + · · · + Xn
−→ µ.
n
Proof. Without loss of generality we may assume that µ = 0 for otherwise we can consider the random
variables X1 − µ, X2 − µ, . . .. Our first goal is to show that

E[(X1 + X2 + · · · + Xn )4 ] ≤ 3M n2 . (6)

To see this, we first note that if i 6∈ {j, k, l},

E[Xi Xj Xk Xl ] = E[Xi ] E[Xj Xk Xl ] = 0. (7)

When we expand (X1 + X2 + · · · + Xn )4 and then use the sum rule for expectation we get n4 terms. We get
n terms of the form E[Xi4 ] and 3n(n − 1) terms of the form E[Xi2 Xj2 ] with i 6= j. There are also terms of the
form E[Xi Xj3 ], E[Xi Xj Xk2 ] and E[Xi Xj Xk Xl ] for distinct i, j, k, l; all of these expectations are zero by (7) so
we will not bother to count the exact number. We know that E[Xi4 ] ≤ M . Also (see Exercise 2.2), if i 6= j,
1/2
E[Xi2 Xj2 ] ≤ E[Xi2 ] E[Xj2 ] ≤ E[Xi4 ] E[Xj4 ] ≤ M.

Since the sum consists of at most 3n2 terms each bounded by M , we get (6).
Let An be the event

X1 + · · · + Xn
≥ 1 7/8
An =
n n1/8 = {|X1 + X2 + · · · + Xn | ≥ n }.

By the generalized Chebyshev inequality (Exercise 2.5),

E[(X1 + · · · + Xn )4 ] 3M
P(An ) ≤ ≤ 3/2 .
(n7/8 )4 n
P
In particular, P(An ) < ∞. Hence by the Borel-Cantelli Lemma,

P(lim sup An ) = 0.

Suppose for some ω, (X1 (ω) + · · · + Xn (ω))/n does not converge to zero. Then there is an > 0 such that

|X1 (ω) + X2 (ω) + · · · + Xn (ω)| ≥ n,

for infinitely many values of n. This implies that ω ∈ lim sup An . Hence the set of ω at which we do not
have convergence has probability zero.

The strong law of large numbers holds in greater generality than under the conditions given here. In
fact, if X1 , X2 , . . . is any i.i.d. sequence of random variables with E(X1 ) = µ, then the strong law holds
X1 + · · · + Xn
→ µ, w.p.1.
n
We will not prove this here. However, as the next exercise shows, the strong law does not hold for all
independent sequences of random variables with the same mean.

15
Exercise 4.5. Let X1 , X2 , . . . be independent random variables. Suppose that

P{Xn = 2n } = 1 − P{Xn = 0} = 2−n .

Show that that E[Xn ] = 1 for each n and that w.p.1,

X1 + X2 + · · · + Xn
−→ 0.
n
Verify that the hypotheses of Theorem 4.4 do not hold in this case.
Exercise 4.6.
a. Let X be a random variable. Show that E[|X|] < ∞ if and only if
∞
X
P{|X| ≥ n} < ∞.
n=1

b. Let X1 , X2 , . . . be i.i.d. random variables. Show that

P{|Xn | ≥ n i.o.} = 0

if and only if E[|X1 |] < ∞.

We now establish a result known as the Kolmogorov Zero-One Law. Assume X1 , X2 , . . . are independent.
Let Fn be the σ-algebra generated by X1 , . . . , Xn , and let Gn be the σ-algebra generated by Xn+1 , Xn+2 , . . ..
Note that Fn and Gn are independent σ-algebras, and

F1 ⊂ F2 ⊂ · · · , G1 ⊃ G2 ⊃ · · · .

Let F be the σ-algebra generated by X1 , X2 , . . ., i.e., the smallest σ-algebra containing the algebra
∞
[
F 0 := Fn .
n=1

The tail σ-algebra T is defined by

∞
\
T = Gn .
n=1

(The intersection of σ-algebras is a σ-algebra, so T is a σ-algebra.) An example of an event in T is

X1 + · · · + Xm
lim =0 .
m→∞ m
It is easy to see that this event is in Gn for each n since

X1 + · · · + Xm Xn+1 + · · · + Xm
lim = 0 = lim =0 .
m→∞ m m→∞ m

We write 4 for symmetric difference, A4B = (A \ B) ∪ (B \ A). The next lemma says roughly that F0 is
“dense” in F.
Lemma 4.7. Suppose F 0 is an algebra of events and F = σ(F 0 ). If A ∈ F, then for every > 0 there is
an A0 ∈ F 0 with
P(A4A0 ) < .

16
Proof. Let B be the collection of all events A such that for every > 0 there is an A0 ∈ F 0 such that
P(A4A0 ) < . Trivially, F 0 ⊂ B. We will show that B is a σ-algebra and this will imply that F ⊂ B.
Obviously, Ω ∈ B.
Suppose A ∈ B and let > 0. Find A0 ∈ F 0 such that P(A4A0 ) < . Then Ac0 ∈ F 0 and

P(Ac 4Ac0 ) = P(A4A0 ) < .

Hence Ac ∈ B.
Suppose A1 , A2 , . . . ∈ B and let A = ∪Aj . Let > 0, and find n such that
 
n
[
P Aj  ≥ P[A] − .
j=1
2

For j = 1, . . . , n, let Aj,0 ∈ F 0 such that

P[Aj 4Aj,0 ] ≤ 2−j−1 .

Let A0 = A1,0 ∪ · · · ∪ An,0 and note that

   
n
[ n
[
A4A0 ⊂  Aj 4Aj,0  ∪ A \ Aj  ,
j=1 j=1

and hence P(A4A0 ) < and A ∈ B.

Theorem 4.8 (Kolmogorov Zero-One Law). If A ∈ T then P(A) = 0 or P(A) = 1.

Proof. Let A ∈ T and let > 0. Find a set A0 ∈ F 0 with P(A4A0 ) < . Then A0 ∈ Fn for some n. Since
T ⊂ Gn and Fn is independent of Gn , A and A0 are independent. Therefore P(A ∩ A0 ) = P(A)P(A0 ). Note
that P(A ∩ A0 ) ≥ P(A) − P(A4A0 ) ≥ P(A) − . Letting → 0, we get P(A) = P(A)P(A). This implies
P(A) = 0 or P(A) = 1.
As an example, suppose X1 , X2 , . . . are independent random variables. Then the probability of the event

X1 (ω) + · · · + Xn (ω)
ω : lim =0
n→∞ n

is zero or one.

5 Central Limit Theorem

In this section, we will use some knowledge of the Fourier transform that we review here. If g : R → R, the
Fourier transform of g is defined by Z ∞
ĝ(y) = e−ixy g(x) dx.
−∞

We will call g a Schwartz function it is C ∞ and all of its derivatives decay at ±∞ faster than every polynomial.
If g is Schwartz, then ĝ is Schwartz. (Roughly speaking, the Fourier transform sends smooth functions to
bounded functions and bounded functions to smooth functions. Very smooth and very bounded functions
are sent to very smooth and very bounded functions.) The inversion formula
Z ∞
1
g(x) = eixy ĝ(y) dy (8)
2π −∞

17
holds for Schwartz functions. A straightforward computation shows that
√ 2
−x2 /2 =
e\ 2πe−y /2 .

We also need
Z ∞ Z ∞ Z ∞
−ixy
f (x) ĝ(x) dx = f (x) g(y) e dy
−∞ −∞ −∞
Z ∞ Z ∞
= g(y) f (x) e−ixy dx dy
−∞ −∞
Z∞
= fˆ(y) g(y) dy.
−∞

This is valid provided that the interchange of integrals can be justified; in particular, it holds for Schwartz
f, g. The characteristic function is a version of the Fourier transform.

Definition The characteristic function of a a random variable X is the function φ = φX : R −→ C defined

by
φ(t) = E[eiXt ].

Properties of Characteristic Functions

• φ(0) = 1 and for all t

|φ(t)| = E[eiXt ] ≤ E |eiXt | = 1.

• φ is a continuous function of t. To see this note that the collection of random variables Ys = eisX are
dominated by the random variable 1 which has finite expectation. Hence by the dominated convergence
theorem, h i
lim φ(s) = lim E[eisX ] = E lim eisX = φ(t).
s→t s→t s→t

In fact, it is uniformly continuous (Exercise 5.1).

• If Y = aX + b where a, b are constants, then

φY (t) = E[ei(aX+b)t ] = eibt E[eiX(at) ] = eibt φX (at).

• The function MX (t) = etX is often called the moment generating function of X. Unlike the character-
istic function, the moment generating function does not always exist for all values of t. When it does
exist, however, we can use the formal relation φ(t) = M (it).
• If a random variable X has a density f then
Z ∞
φX (t) = eitx f (x) dx = fˆ(−t).
−∞

• If two random variables have the same characteristic function, then they have the same distribution.
This fact is not obvious — this will follow from Proposition 5.6 which gives an inversion formula for
the characteristic function similar to (8).
• If X1 , . . . , Xn are independent random variables, then

φX1 +···+Xn (t) = E[eit(X1 +···+Xn ) ]

= E[eitX1 ] E[eitX2 ] · · · E[e−tXn ]
= φX1 (t) φX2 (t) · · · φXn (t).

18
• If X is a normal random variable, mean zero, variance 1, then by completing the square in the expo-
nential one can compute Z ∞
1 2 2
φ(t) = eixt √ e−x /2 dx = e−t /2 .
−∞ 2π
If Y is normal mean µ, variance σ 2 , then Y has the same distribution as σX + µ, hence
2 2
φY (t) = φσX+µ (t) = eiµt φX (σt) = eiµt e−σ t /2
.

Exercise 5.1. Show that φ is a uniformly continuous function of t.

Proposition 5.2. Suppose X is a random variable with characteristic function φ. Suppose E[|X|] < ∞.
Then φ is continuously differentiable and
φ0 (0) = i E(X).
Proof. Note that for real θ
|eiθ − 1| ≤ |θ|.
This can be checked geometrically or by noting that
Z Z
θ θ
iθ is
|e − 1| = ie ds ≤ |ieis | ds = θ.

0 0

We write the difference quotient

∞
φ(t + δ) − φ(t) ei(t+δ)x − eitx
Z
= dµ(x).
δ −∞ δ

Note that i(t+δ)x

− eitx |δx|

e
≤ δ ≤ |x|.

δ
Saying E[|X|] < ∞ is equivalent to saying that |x| is integrable with respect to the measure µ. By the
dominated convergence theorem,
Z ∞ i(t+δ)x Z ∞
− eitx ei(t+δ)x − eitx

e
lim dµ(x) = lim dµ(x).
δ→0 −∞ δ −∞ δ→0 δ
Z ∞
= ixeitx dµ(x).
−∞

Hence, Z ∞
0
φ (t) = ixeitx dµ(x),
−∞

and, in particular,
φ0 (0) = i E[X].

The following proposition can be proved in the same way.

Proposition 5.3. Suppose X is a random variable with characteristic function φ. Suppose E[|X|k ] < ∞ for
some positive integer k. Then φ has k continuous derivatives and

φ(j) (t) = ij E[X j ], j = 0, 1 . . . , k. (9)

19
Formally, (9) can be derived by writing
 
∞ j j ∞ j
X (iX) t X i E(X j ) j
E[eiXt ] = E  = t ,
j=0
j! j=0
j!

differentiating both sides, and evaluating at t = 0. Proposition 5.3 tells us that the existence of E[|X|k ]
allows one to justify this rigorously up to the kth derivative.
Proposition 5.4. Let X1 , X2 , . . . be independent, identically distributed random variables with mean µ and
variance σ 2 and characteristic function φ. Let
X1 + · · · + Xn − nµ
Zn = √ ,
σ2 n
and let φn be the characteristic function of Zn . Then for each t,
2
lim φn (t) = e−t /2
.
n→∞

2
Recall that e−t /2
is the characteristic function of a normal mean zero, variance 1 random variable.
Proof. Without loss of generality we will assume µ = 0, σ 2 = 1 for otherwise we can consider Yj = (Xj −µ)/σ.
By Proposition 5.3, we have

1 t2
φ(t) = 1 + φ0 (0) t + φ00 (0) t2 + t t2 = 1 − + t t2 ,
2 2
where t → 0 as t → 0. Note that n
t
φn (t) = φ √ ,
n
and hence for fixed t,
n n
t2

t δn
lim φn (t) = lim φ √ = lim 1 − + ,
n→∞ n→∞ n n→∞ 2n n

where δn → 0 as n → ∞. Once n is sufficiently large that |δn − (t2 /2)| < n we can take logarithms and
expand in a Taylor series,
t2 t2

δn ρn
log 1 − + =− + , n→∞
2n n 2n n
where ρn → 0. (Since φ can take on complex values, we need to take a complex logarithm with log 1 = 0,
but there is no problem provided that |δn − (t2 /2)| < n.) Therefore

t2
lim log φn (t) = − .
n→∞ 2

Theorem 5.5 (Central Limit Theorem). Let X1 , X2 , . . . be independent, identically distributed random
variables with mean µ and finite variance. If −∞ < a < b < ∞, then
Z b
X1 + X2 + · · · + Xn − nµ 1 2
lim P a ≤ √ ≤b = √ e−x /2 dx.
n→∞ σ n a 2π

20
Proof. Let φn be the characteristic function of n−1/2 σ −1 (X1 + · · · + Xn − nµ). We have already shown that
2
for each t, φn (t) → e−t /2 . It suffices to show that if µn is any sequence of distributions such that their
2
characteristic functions φn converge pointwise to e−t /2 then for every a < b,

lim µn [a, b] = Φ(b) − Φ(a),

n→∞

where Z x
1 2
Φ(x) = √ e−y /2 dy.
−∞ 2π
∞
Fix a, b and let > 0. We can find a C function g = ga,b, such that

0 ≤ g(x) ≤ 1, −∞ < x < ∞,

g(x) = 1, a ≤ x ≤ b,
g(x) = 0, x∈
/ [a − , b + ].
We will show that Z ∞ Z ∞
1 2
lim g(x) dµn (x) = g(x) √ e−x /2 dx. (10)
n→∞ −∞ −∞ 2π
Since for any distribution µ, Z ∞
µ[a, b] ≤ g(x) dµ(x) ≤ µ[a − , b + ],
−∞

we then get the theorem by letting → 0.

Since g is a C ∞ function with compact support, it is a Schwartz function. Let
Z ∞
ĝ(y) = e−ixy g(x) dx,
−∞

be the Fourier transform. Then, using the inversion formula (8),

Z ∞ Z ∞ Z ∞
1
g(x) dµn (x) = eixy ĝ(y) dy dµn (x)
−∞ −∞ 2π −∞
Z ∞ Z ∞
1 ixy
= e dµn (x) ĝ(y) dy
2π −∞ −∞
Z ∞
1
= φn (y)ĝ(y) dy.
2π −∞

Since |φn ĝ| ≤ |ĝ| and ĝ is L1 , we can use the dominated convergence theorem to conclude
Z ∞ Z ∞
1 2
lim g(x) dµn (x) = e−y /2 ĝ(y) dy.
n→∞ −∞ 2π −∞
√ 2
−x2 /2 =
Recalling that e\ 2πe−y /2 , and that
Z Z
fˆg = f ĝ,

we get (10).

The next proposition establishes the uniqueness of characteristic functions by giving an inversion formula.
Note that in order to specify a distribution µ, it suffices to give µ[a, b] at all a, b with µ{a, b} = 0.

21
Proposition 5.6. Let X be a random variable with distribution µ, distribution function F , and characteristic
function φ. Then for every a < b such that F is continuous at a and b,
T
e−iya − e−iyb
Z
1
µ[a, b] = F (b) − F (a) = lim φ(y) dy. (11)
T →∞ 2π −T iy

Note that it is tempting to write the right hand side of (11) as

Z ∞ −iya
1 e − e−iyb
φ(y) dy, (12)
2π −∞ iy

but this integrand is not necessarily integrable on (−∞, ∞). We can write (12) if we interpret this as
the improper Riemann integral and not as the Lebesgue integral on (−∞, ∞).

Proof. We first fix T < ∞. Note that

−iya
− e−iyb

e
|φ(y)| ≤ (b − a), (13)

iy

and hence the integral on the right-hand side of (11) is well defined. The bound (13) allows us to use Fubini’s
Theorem:
Z T −iya Z T −iya Z ∞
− e−iyb − e−iyb

1 e 1 e iyx
φ(y) dy = e dµ(x) dy
2π −T iy 2π −T iy −∞
"
Z ∞ Z T iy(x−a) #
1 e − eiy(x−b)
= dy dµ(x)
2π −∞ −T iy

It is easy to check that for any real c,

T |c|T
eicx
Z Z
sin x
dx = 2 dx.
−T ix 0 x

We will need the fact that Z T

sin x π
lim dx = .
T →∞ 0 x 2
There are two relevant cases.
• If x − a and x − b have the same sign (i.e., x < a or x > b),
T
eiy(x−a) − eiy(x−b)
Z
lim dy = 0.
T →∞ −T iy

• If x − a > 0 and x − b < 0,

T T
eiy(x−a) − eiy(x−b)
Z Z
sin x
lim dy = 4 lim dx = 2π.
T →∞ −T iy T →∞ 0 x

22
We do not need to worry about what happens at x = a or x = b since µ gives zero measure to these points.
Since the function Z T iy(x−a)
e − eiy(x−b)
gT (a, b, x) = dy
−T iy
is uniformly bounded in T, a, b, x, we can use dominated convergence theorem to conclude that
Z ∞ "Z T iy(x−a) #
− eiy(x−b)
Z
e
lim dy dµ(x) = 2π 1(a,b) dµ(x) = 2πµ[a, b].
T →∞ −∞ −T iy

Therefore,
T
e−iya − e−iyb
Z
1
lim φ(y) dy = µ[a, b].
T →∞ 2π −T iy

Exercise 5.7. A random variable X has a Poisson distribution with parameter λ > 0 if X takes on only
nonnegative integer values and

λk
P{X = k} = e−λ , k = 0, 1, 2, . . . .
k!
In this problem, suppose that X, Y are independent Poisson random variables on a probability space with
parameters λX , λY , respectively.
(i) Find E[X], Var[X], E[Y ], Var[Y ].
(ii) Show that X + Y has a Poisson distribution with parameter λX + λY by computing directly

P{X + Y = k}.

(iii) Find the characteristic function for X, Y and use this to find an alternative deriviation of (ii).
(iv) Suppose that for each integer n > λ we have independent random variables Z1,n , Z2,n , . . . , Zn,n with

λ
P{Zj,n = 1} = 1 − P{Zj,n = 0} = .
n
Find
lim P{Z1,n + Z2,n + · · · + Zn,n = k}.
n→∞

Exercise 5.8. A random variable X is infinitely divisible if for each positive integer n, we can find i.i.d. ran-
dom variables X1 , . . . , Xn such that X1 + · · · + Xn have the same distribution as X.
(i) Show that Poisson random variables and normal random variables are infinitely divisible.
(ii) Suppose that X has a uniform distribution on [0, 1]. Show that X is not infinitely divisible.
(iii) Find all infinitely divisible distributions that take on only a finite number of values.

6 Conditional Expectation
6.1 Definition and properties
Suppose X is a random variable with E[|X|] < ∞. We can think of E[X] as the best guess for the random
variable given no information about the outcome of the random experiment which produces the number X.
Conditional expectation deals with the case where one has some, but not all, information about an event.
Consider the example of flipping coins, i.e., the probability space

Ω = {(ω1 , ω2 , . . .) : ωj = 0 or 1}.

23
Let Fn be the σ-algebra of all events that depend on the first n flips of the coins. Intutitively we think of the
σ-algebra Fn as the “information” contained in viewing the first n flips. Let Xn (ω) = ωn be the indicator
function of the event “the nth flip is heads”. Let

Sn = X1 + · · · + Xn

be the total number of heads in the first n flips. Clearly E[Sn ] = n/2. Suppose we are interested in S3 but
we get to see the flip of the first coin. Then our “best guess” for S3 depends on the value of the first flip.
Using the notation of elementary probability courses we can write

E[S3 | X1 = 1] = 2, E[S3 | X1 = 0] = 1.

More formally we can write E[S3 | F1 ] for the random variable that is measurable with respect to the
σ-algebra F1 that takes on the value 2 on the event {X1 = 1} and 1 on the event {X1 = 0}.
Let (Ω, F, P) be a probability space and G ⊂ F another σ-algebra. Let X be an integrable random
variable.
Let us first consider the case when G is a finite σ-algebra. In this case there exist disjoint, nonempty
events A1 , A2 , . . . , Ak that generate G in the sense that G consists precisely of those sets obtained from unions
of the events A1 , . . . , Ak . If P(Aj ) > 0, we define E[X | G] on Aj to be the average of X on Aj ,
R
A
X dP E[X1Aj ]
E[X | G] = j = , on Aj . (14)
P(Aj ) P(Aj )

If P(Aj ) = 0, this definition does not make sense, and we just let E[X | G] take on an arbitrary value, say
cj , on Aj . Note that E[X | G] is a G-measurable random variable and that its definition is unique except
perhaps on a set of total probability zero.
Proposition 6.1. If G is a finite σ-algebra and E[X | G] is defined as in (14), then for every event A ∈ G,
Z Z
E[X | G] dP = X dP.
A A

Proof. Let G be generated by A1 , . . . , Ak and assume that A = A1 ∪ · · · ∪ Al . Then

Z l Z
X
E[X | G] dP = E[X | G] dP
A j=1 Aj

l
R
X Aj
X dP
= P(Aj )
j=1
P(Aj )
l Z
X
= X dP
j=1 Aj
Z
= X dP.
A

This proposition gives us a clue as how to define E[X | G] for infinite G.

Proposition 6.2. Suppose X is an integrable random variable on (Ω, F, P), and G ⊂ F is a σ-algebra. Then
there exists a G-measurable random variable Y such that if A ∈ G,
Z Z
Y dP = X dP. (15)
A A

Moreover, if Ỹ is another G-measurable random variable satisfying (15), then Ỹ = Y a.s.

24
Proof. The uniqueness is immediate since if Y, Ỹ are G-measurable random variables satisfying (15), then
Z
(Y − Ỹ ) dP = 0.
A

for all A ∈ G and hence Y − Ỹ = 0 almost surely.

To show existence consider the probability space (Ω, G, P). Consider the (signed) measure on (Ω, G),
Z
ν(A) = X dP.
A

Note that ν << P . Hence by the Radon-Nikodym theorem there is a G-measurable random variable Y with
Z
ν(A) = Y dP.
A

Using the proposition, we can define the conditional expectation E[X | G] to be the random variable Y
in the proposition. It is characterized by the properties
• E[X | G] is G-measurable.
• For every A ∈ G, Z Z
E[X | G] dP = X dP. (16)
A A

The random variable E[X | G] is only defined up to a set of probability zero.

Exercise 6.3. This exercise gives an alternative proof of the existence of E[X | G] using Hilbert spaces (but
not using Radon-Nikodym theorem). Let H denote the real Hilbert space of all square-integrable random
variables on (Ω, F, P).
1. Show that the space of G-measurable, square-integrable random varaibles is a closed subspace of H.

2. For square-integrable X, define E(X | G) to be the Hilbert space projection onto the space of G-
measurable random variables. Show that E(X | G) satisfies (16).
3. Show that E(X | G) exists for integrable X. (Hint: if X ≥ 0, we can approximate X from below by
simple random variables Xn . Define

E(X | G) = lim E(Xn | G).

n→∞

For general X, write X = X + − X − .)

Properties of Conditional Expectation

• E[E[X | G]] = E[X].

Proof. This is a special case of (16) where A = Ω.
• If a, b ∈ R, E[a X + b Y | G] = a E[X | G] + b E[Y | G]. (This equality and others below are really
equalities up an event of probability zero.)

= (aX + bY ) dP.
A

The result follows by uniqueness of the conditional expectation.

• If X is G-measurable, then E[X | G] = X.

• If G is independent of X, then E[X | G] = E[X]. (G is independent of X if G is independent of the

σ-algebra generated by X. Equivalently, they are independent if for every A ∈ G, 1A is independent
of X.)
Proof. The constant random variable E[X] is certainly G-measurable. Also, if A ∈ G,
Z Z
E[X] dP = P(A) E[X] = E[1A ] E[X] = E[1A X] = X dP.
A A

• If H ⊂ G ⊂ F are σ-algebras, then

E[X | H] = E[E[X | G] | H]. (17)

Proof. Clearly, E[E[X | G] | H] is H-measurable. If A ∈ H, then A ∈ G, and hence

Z Z Z
E[E[X | G] | H] dP = E[X | G] dP = X dP.
A A A

• If Y is G-measurable, then
E[XY | G] = Y E[X | G]. (18)

Proof. Clearly Y E[X | G] is G-measurable. We need to show that for every A ∈ G,

Z Z
Y E[X | G] dP = Y X dP. (19)
A A

We will R case where X, Y ≥ 0. Note that this implies that E[X | G] ≥ 0 a.s.,
R first consider the
since A E[X | G] dP = A X dP ≥ 0 for every A ∈ G. Find simple G-measurable random variables
0 ≤ Y1 ≤ Y2 ≤ · · · such that Yn → Y . Then Yn X converges monotonically to Y X and Yn E[X | G]
converges monotonically to Y E[X | G]. If A ∈ G and
n
X
Z= cj 1Bj , Bj ∈ G,
j=1

26
is a G-measurable simple random variable,
Z n
X Z
Z E[X | G] dP = cj 1Bj E[X | G] dP
A j=1 A

Xn Z
= cj E[X | G] dP
j=1 A∩Bj

Xn Z
= cj X dP
j=1 A∩Bj
Z X n
= [ cj 1Bj ]X dP
A j=1
Z
= ZX dP.
A

Hence (19) holds for simple nonnegative Y and nonnegative X, and hence by the monotone convergence
theorem it holds for all nonnegative Y and nonnegative X. For general Y, X, we write Y = Y + −
Y − , X = X + − X − and use linearity of expectation and conditional expectation.

If X, Y are random variables and X is integrable we write E[X | Y ] for E[X | σ(Y )], Here σ(Y ) denotes
the σ-algebra generated by Y . Intuitively, E[X | Y ] is the best guess of X given the value of Y . Note that
E[X | Y ] can be written as φ(Y ) for some function φ, i.e., for each possible value of Y there is a value of
E[X | Y ]. Elementary texts often write this as E[X | Y = y] to indicate that for each value of y there is a
value of E[X | Y ]. Similarly, we can define E[X | Y1 , . . . Yn ].

Example Consider the coin-tossing random variables discussed earlier in the section. What is E[X1 | Sn ]?
Symmetry tells us that
E[X1 | Sn ] = E[X2 | Sn ] = · · · = E[Xn | Sn ].
Also, linearity gives

E[X1 | Sn ] + E[X2 | Sn ] + · · · + E[Xn | Sn ] =

E[X1 + · · · + Xn | Sn ] = E[Sn | Sn ] = Sn .
Hence,
Sn
E[Xn | Sn ] = .
n
Note that E[X1 | Sn ] is not the same thing at E[X1 | X1 , . . . , Xn ] = E[X1 | Fn ] which would be just X1 .
The first conditioning is given only the number of heads on the first n flips while the second conditioning is
given all the outcomes of the n flips.

Example Suppose X, Y are two random variables with joint density function f (x, y). Let us take our
probability space to be R2 , let F be the Borel sets, and let P be the measure
Z
P(A) = f (x, y) dx dy.
A

Then the random variables X(x, y) = x and Y (x, y) = y have the joint density f (x, y). Then E[Y | X] is a
random variable with density
f (X, y)
f (y | X) = R ∞ .
−∞
f (X, z) dz

27
This formula is not valid when the denominator is zero. However, the set of x such that
Z ∞
f (x, z) dz = 0,
−∞

is a null set under the measure P and hence the density f (y | X) is well defined almost surely. This density
is often called the conditional density of Y given X.

6.2 Martingales
A martingale is a model of a fair game. Suppose we have a probability space (Ω, F, P) and an increasing
sequence of σ-algebras
F0 ⊂ F1 ⊂ · · ·
Such a sequence is called a filtration, and we think of Fn as representing the amount of knowledge we have
at time n. The inclusion of the σ-algebras imply that we do not lose any knowledge that we have gained.

Definition A martingale (with respect to the filtration {Fn }) is a sequence of integrable random variables
Mn such that Mn is Fn -measurable for each n and for all n < m,

E[Mm | Fn ] = Mn . (20)

We note that to prove (20), it suffices to show that for all n,

E[Mn+1 | Fn ] = Mn .

This follows from successive applications of (17). If the filtration is not specified explicitly, then one assumes
that one uses the filtration generated by Mn , i.e., Fn is the σ-algebra generated by M1 , . . . , Mn .
Examples
We list some examples. We leave it as exercises to show that they are martingales.

• If X1 , X2 , . . . are independent, mean-zero random variables and Sn = X1 + · · · + Xn , then Sn is a

martingale with respect to the filtration generated by X1 , X2 , . . . .
• If X1 , X2 , . . . are as above, E[Xj2 ] = σj2 < ∞ for each j, then
n
X
Mn = Sn2 − σj2
j=1

is a martingale with respect to the filtration generated by X1 , X2 , . . . .

• Suppose X1 , X2 , . . . are as in the first example and Fn is the filtration generated by X1 , X2 , . . . . We
choose F0 to be the trivial σ-algebra. For each n, let Bn be a bounded random variable that is
measurable with respect to Fn−1 . We think of Bn as being the “bet” on the game Xn ; we can see the
results of X1 , . . . , Xn−1 before choosing a bet but one cannot see Xn . The total fortune by time n is
given by W0 = 0 and
Xn
Wn = Bj Xj .
j=1

Then Wn is a martingale.

28
• Suppose X1 , X2 , . . . are independent coin-flipping random variables, i.e., with distribution
1
P{Xj = 1} = P{Xj = −1} = .
2
A particular case of the last example is where B0 = 1 and Bn = 2n if X1 = X2 = · · · = X−1 = −1 and
Bn = 0 otherwise. In other words, one keeps doubling one’s bet until one wins. Note that

P{Wn = 1} = 1 − 2−n , P{Wn = 1 − 2n } = 2−n . (21)

This betting strategy is sometimes called the martingale betting strategy.

We want to prove a version of the optional stopping theorem which is a mathematical way of saying that
one cannot beat a fair game. The last example above shows that we need to be careful — if I play a fair
game and keep doubling my bet, I will eventually win and come out ahead (provided that I always have
enough money to put down the bet when I lose!)

Definition A stopping time with respect to a filtration {Fn } is a random variable taking values in {0, 1, 2 . . .}∪
{∞} such that for each n, the event
{T ≤ n} ∈ Fn .

In other words, we only need the information at time n to know whether we have stopped at time n.
Examples of stopping times are:
• Constant random variables are stopping times.
• If Mn is a martingale with respect to {Fn } and V ⊂ R is a Borel set, then

T = min{j : Mj ∈ V }

is a stopping time.
• If T1 , T2 are stopping times, then so are T1 ∧ T2 and T1 ∨ T2 .
• In particular, if T is a stopping time then so are Tn = T ∧ n. Note that T0 ≤ T1 ≤ . . . and T = lim Tn .
Proposition 6.4. If Mn is a martingale and T is a stopping time with respect to {Fn }, then Yn = MT ∧n
is a martingale with respect to {Fn }. In particular, E[Yn ] = E[Y0 ] = E[M0 ].
Proof. Note that
n−1
X
Yn = Mn 1{T ≥ n} + Mj 1{T = j}.
j=0

From this it is easy to see that Yn is Fn -measurable and E[|Yn |] < ∞. Also note that the event 1{T ≥ n + 1}
is the complement of the event 1{T ≤ n} and hence is Fn -measurable. Therefore, (18) implies

E(Mn+1 1{T ≥ n + 1} | Fn ) = 1{T ≥ n + 1} E(Mn+1 | Fn ) = 1{T ≥ n + 1} Mn .

Therefore, using linearity,

n
X
E(Yn+1 | Fn ) = 1{T ≥ n + 1} Mn + Mj 1{T = j}
j=0
n−1
X
= 1{T ≥ n} Mn + Mj 1{T = j} = Yn .
j=0

Examples

29
• Let X1 , X2 , . . . be independent, each with P{Xj = 1} = P{Xj = −1} = 1/2. Let Mn = 1+X1 +· · ·+Xn ,
N an integer greater than 1, and

T = min{j : Mn = 0 or Mn = N }.

T is the first time that a “random walker” starting at 1 reaches 0 or N . Then Mn∧T is a martingale.
In particular,
1 = E[M0 ] = E[Mn∧T ].
It is easy to check that w.p.1 T < ∞. Since Mn∧T is a sequence of bounded random variables
approaching MT , the dominated convergence theorem implies

E[MT ] = 1.

Note that
E[MT ] = N P{MT = N },
and hence we have computed a probability,
1
P{MT = N } = .
N

• Let X1 , X2 , . . . be as above and let

Sn = X1 + · · · + Xn .
We have noted that Mn = Sn2 − n is a martingale. Let N be a positive integer and

T = min{n : |Sn | = N }.

Then E(Mn∧T ) is a martingale and hence

2
0 = E[M0 ] = E[Mn∧T ] = E[Sn∧T − (n ∧ T )2 ].

With probability one,

lim Mn∧T = MT .
n→∞

We would like to say that

E[MT ] = lim E[Mn ∧ T ] = 0. (22)
n→∞

Indeed, we can justify this using the dominated convergence. It is not difficult to show (Exercise 6.5)
that E[T ] < ∞, and hence Mn∧T is dominated by the integrable random variable N 2 + T . This justifies
(22). Note that this implies that
E[T ] = N 2 .

• Consider the martingale betting strategy as above and let

T = min{n : Xn = 1} = min{x : Wn = 1}.

Then WT ∧n is a martingale, and hence E[WT ∧n ] = E[W0 ] = 0. This could be checked easily using (21).
Note that P{T < ∞} = 1 and
lim WT ∧n = WT = 1.
T →∞

Hence,
1 = E[WT ] 6= lim E[WT ∧n ] = 0.
T →∞

In this case there is no integrable random variable Y such that |WT ∧N | ≤ Y for all N .

30
Exercise 6.5. Consider the second example above.
• Show that there exists c < ∞, ρ < 1 (depending on N ) such that for all n,

P{T > n} ≤ c ρn .

Use this to conclude that E[T ] < ∞.

• Suppose we changed the problem by assuming that M0 = k where k < N is a positive integer. In this
case, what is E[T ]?.

7 Brownian motion
Brownian motion is a model for random continuous motion. We will consider only the one-dimensional
version. Imagine that Bt = Bt (ω) denotes the position of a random particle at time t. For fixed t, Bt (ω) is a
random variable. For fixed ω, we have a function t 7→ Bt (ω). Hence, we can consider the Brownian motion
either as a collection of random varibles Bt parametrized by time or as a random function from [0, ∞) into
R.

Definition A collection of random variables Bt , t ∈ [0, ∞) defined on a probability space (Ω, F, P) is called
a Brownian motion if it satisfies the following.
(i) B0 = 0 w.p.1.
(ii) For each 0 ≤ s1 ≤ t1 ≤ s2 ≤ t2 ≤ · · · ≤ sn ≤ tn , the random variables

Bt1 − Bs1 , . . . , Btn − Bsn

are independent.
(iii) For each 0 ≤ s ≤ t, the random variable Bt − Bs has the same distribution as Bt−s .
(iv) W.p.1., t 7→ Bt (ω) is a continuous function of t.

It is a nontrivial theorem, which we will not prove here, that states that if Bt is a processes satisfying
(i)–(iv) above, then the distribution of Bt − Bs must be normal. (This fact is closely related to but does not
immediately follow from the central limit theorem.) Note that
n h
X i
B1 = B j − B j−1 ,
n n
j=1

so a normal distribution is reasonable. However, one cannot conclude normality only from assumptions
(i)–(iii), see Exercise 5.8. Since we know Brownian motion must have normal increments (but we do not
want to prove it here!), we will modify our definition to include this fact.

Definition A collection of random variables Bt , t ∈ [0, ∞) defined on a probability space (Ω, F, P) is called
a Brownian motion with drift µ and variance parameter σ 2 , if it satisfies the following.
(i) B0 = 0 w.p.1.
(ii) For each 0 ≤ s1 ≤ t1 ≤ s2 ≤ t2 ≤ · · · ≤ sn ≤ tn , the random variables

Bt1 − Bs1 , . . . , Btn − Bsn

are independent.
(iii) For each 0 ≤ s ≤ t, the random variable Bt − Bs has a normal distribution with mean µ(t − s) and
variance σ 2 (t − s).
(iv) W.p.1., t 7→ Bt (ω) is a continuous function of t.
Bt is called a standard Brownian motion if µ = 0, σ 2 = 1.

31
Exercise 7.1. Suppose Bt is a standard Brownian motion. Show that B̂t := σBt + µt is a Brownian motion
with drift µ and variance parameter σ 2 .
In this section, we will show that Brownian motion exists. We will take any probability space (Ω, F, P)
sufficiently large that there exist a countable sequence of standard normal random variables N1 , N2 , . . .
defined on the space. Exercise 3.8 shows that the unit interval with Lebesgue measure is large enough. For
ease, we will show how to construct a standard Brownian motion Bt , 0 ≤ t ≤ 1. Let
∞
[
Dn = {j/2n : j = 1, . . . , 2n }, D= Dn ,
n=0

and suppose that the random variables {Nt : t ∈ D} have been relabeled so they are indexed by the countable
set D. The basic strategy is as follows:
• Step 1. Use the random variables Nt , t ∈ D to define random variables Bt , t ∈ D that satisfy (i)–(iii),
restricted to s, t, sj , tj ∈ D.
• Step 2. Show that w.p.1 the function t 7→ Bt , t ∈ D is uniformly continuous.
• Step 3. On the event that t 7→ Bt , t ∈ D is uniformly continuous, define Bt , t ∈ [0, 1] by continuity.
Show this satisfies the defintion.

7.1 Step 1.
We will recursively define Bt for t ∈ Dn . In our construction we will see that at each stage the random
variables
∆(j, n) := Bj2−n − B(j−1)2−n , j = 1, . . . , 2n ,
are independent, normal, variance 2−n . We start with n = 0, and set B0 = 0, B1 = N1 . To do the next step,
we use the following lemma.
Lemma 7.2. Suppose X, Y are independent normal random variables with mean zero, variance σ 2 , and let
Z = X + Y, W = X − Y . Then Z, W are independent random variables with mean zero and variance 2σ 2 .
Proof. For ease, assume σ 2 = 1. If V ∈ R2 is a Borel set, let Ṽ = {(x, y) : (x + y, x − y) ∈ V }. Then
P{(Z, W ) ∈ V } = P{(X, Y ) ∈ Ṽ }
Z Z
1 −(x2 +y2 )/2 1 −(z2 +w2 )/4
= e dx dy = e dz dw.
Ṽ 2π V 4π

The second equality uses the change of variables x = (z + w)/2, y = (z − w)/2.

We define
1 1 1 1
B1/2 = B1 + N1/2 = N1 + N1/2 ,
2 2 2 2
and hence
1 1
B1 − B1/2 = N1 − N1/2 .
2 2
Since N1/2 , N1 are independent, standard normal random variables, the lemma tells us that
∆(1, 1), ∆(2, 1)
are independent, mean-zero normals with variance 1/2. Recursively, if Bt is defined for t ∈ Dn , we define
1
∆(2j + 1, n + 1) = ∆(j, n) + 2−(n+2)/2 N(2j+1)2−(n+1) ,
2
and hence
1
∆(2j + 2, n + 1) = ∆(j, n) − 2−(n+2)/2 N(2j+1)2−(n+1) .
2
Then continual use of the lemma shows that this works.

32
7.2 Step 2.
Let
Mn = sup{|Bs − Bt | : s, t ∈ D, |s − t| ≤ 2−n }.
Showing continuity is equivalent to showing that w.p.1, Mn → 0 as n → ∞, or, equivalently, for each > 0,
P {Mn > i.o.} = 0.
It is easier to work with a different quantity:
Z(j, n) = sup Bt − B(j−1)2−n : t ∈ D, (j − 1)2−n ≤ t ≤ j2−n ,

Zn = max{Z(j, n) : j = 1, . . . , 2n }.
Note that Mn ≤ 3 Zn . Hence, for any > 0,
n
2
X
P{Mn > 3} ≤ P{Zn > } ≤ P{Z(j, n) > } = 2n P{Z(1, n) > }.
j=1
√
Since Bt has the same distribution as t B1 , a little thought will show that the distribution of Z(1, n) is the
same as the distribution of
2−n/2 Z(1, 0) = 2−n/2 sup{|Bt | : t ∈ D} = 2−n/2 lim max{|Bt | : t ∈ Dm }.
m→∞

For any δ > 0, let us consider

P {max{|Bt | : t ∈ Dm } > δ} ≤ 2 P {max{Bt : t ∈ Dm } > δ} .
We now write m
2
[
{max{Bt : t ∈ Dm } > δ} = Ak ,
k=1
where
Ak = Ak,m = Bk2−m > δ, Bj2−m ≤ δ, j < k .
Note that
1
P(Ak ∩ {B1 > δ}) ≥ P(Ak ∩ {B1 − Bk2−m ≥ 0}) = P(Ak ) P{B1 − Bk2−m ≥ 0} = P(Ak ).
2
Since the Ak are disjoint, we get
m
2 2m
X 1X 1
P{B1 > δ} = P(Ak ∩ {B1 > δ}) ≥ P(Ak ) = P {max{Bt : t ∈ Dm } > δ} ,
2 2
k=1 k=1
p
or, for δ ≥ π/2,
∞ ∞
r Z r Z
2 −x2 /2 2 2
P {max{Bt : t ∈ Dm } > δ} ≤ 2 P{B1 > δ} = e dx ≤ e−xδ/2 dx ≤ e−δ /2
.
π δ π δ

Since this holds for all m, Putting this all together, we have
P{Mn > 3} ≤ 2n P{Z(1, n) > 2−n/2 } ≤ 2n exp −2 2n−1 .

Therefore,
∞
X
P{Mn > 3} < ∞,
n=1
and by the Borel-Cantelli lemma,
P{Mn > 3 i.o.} = 0.

33
Exercise 7.3. Let α < 1/2 and

|Bt − Bs |
Y = sup : 0 ≤ s < t ≤ 1 .
(t − s)α
Show that w.p,.1., Y < ∞. Show that the result is not true for α = 1/2.

7.3 Step 3.
This step is straightforward and left as an exercise. One important fact that is used is that if X = lim Xn
and Xn is normal mean zero, variance σn2 and σn → σ, then X is normal mean zero, variance σ 2 .

7.4 Wiener measure

Brownian motion Bt , 0 ≤ t ≤ 1 can be considered as a random variable taking values in the metric space
C[0, 1], the set of continuous functions with the usual sup norm. The distribution of this random variable
is a probability measure on C[0, 1] and is called Wiener measure. If f ∈ C[0, 1] and > 0, then the Wiener
measure of the open ball of radius about f is
P{|Bt − f (t)| < , 0 ≤ t ≤ 1}.
Note that C[0, 1] with Wiener measure is a probability space. In fact, if we start with this probability
space, then it is trivial to define a Brownian motion — we just let Bt (f ) = f (t).

8 Elementary probability
In this section we discuss some results that would usually be covered in an undergraduate calculus-based
course in probability.

8.1 Discrete distributions

A discrete distribution on R can be considered as a function q : R → [0, 1] such that Sq := {x : q(x) > 0} is
finite or countably infinite, and X
q(x) = 1.
x∈Sq

We call Sq the support of q. If X is a random variable with distribution q, then

P{X = x} = q(x).

8.1.1 Bernoulli and Binomial

Let p ∈ (0, 1). Independent repetitions of an experiment with probability p of success are called Bernoulli
trials. The results of the trials can be represented by independent Bernoulli random variables
X1 , X2 , X3 , . . .
with P{Xj = 1} = 1 − P{Xj = 0} = p. If n is a positive integer, then the random variable
Y = X1 + · · · + Xn
is said to have a binomial distribution with parameters n and p. Y represents the number of sucesses in n
Bernoulli trials each with probability p of success. Note that

n k
P{Y = k} = p (1 − p)n−k , k = 0, 1, . . . , n.
k

34
n

The binomial coefficient k represents the number of ordered n-tuples of 0s and 1s that have exactly k 1s.
Clearly E[Xj ] = p and
2
Var[Xj ] = E[Xj2 ] − (E[Xj ]) = p − p2 = p (1 − p).
Hence, if Y has a binomial distribution with parameters n, p,
n
X n
X
E[Y ] = E[Xj ] = np, Var[Y ] = Var[Xj ] = np(1 − p).
j=1 j=1

8.1.2 Poisson distribution

A random variable X has a Poisson distribution with parameter λ > 0, if it has distribution

λk
P{X = k} = e−λ , k = 0, 1, 2, . . .
k!
The Taylor series for the exponential shows that
∞
X
P{X = k} = 1.
k=0

The Poisson distribution arises as a limit of the binomial distribuiton as n → ∞ when the expected number
of successes stays fixed at λ. In other words, it is the limit of binomial random variables Yn = Yn,λ with
parameters n and λ/n. This limit is easily derived; if k is a nonnegative integer,
k n−k
n λ λ
lim P{Yn = k} = lim 1−
n→∞ n→∞ k n n
k n −k
n(n − 1) · · · (n − k + 1) λ λ λ
= lim 1− 1−
n→∞ k! n n n
λk e−λ
= .
k!
It is easy to check that

E[X] = λ, E[X(X − 1)] = λ2 , Var[X] = E[X 2 ] − E[X]2 = λ.

The expectation and variances could be derived from the corresponding quanitities for Yn ,

λ
E[Yn ] = λ, Var[Yn ] = λ 1 − .
n

8.1.3 Geometric distribution

If p ∈ (0, 1), then X has a geometric distribution with parameter p if

P{X = k} = (1 − p)k−1 p, k = 1, 2, . . . (23)

X represents the number of Bernoulli trials needed until the first success assuming the probability of success
on each trial is p. (The event that the kth trial is the first success is the event that the first k − 1 trials are
failures and the last one is a success. From this, (23) follows immediately.) Note that
∞ ∞
X d X d 1−p 1
E[X] = k (1 − p)k−1 p = −p (1 − p)k = −p = .
dp dp p p
k=1 k=1

35
∞
X
E[X(X − 1)] = k (k − 1) (1 − p)k−1 p
k=1
d2 1 − p

2(1 − p)
= p(1 − p) = .
dp2 p p2

2 1 1−p
E[X 2 ] = − , Var[X] = E[X 2 ] − E[X]2 = .
p2 p p2
The geometric distributions can be seen to the only discrete distributions with support {1, 2, . . .} having the
loss of memory property: for all positive integers j, k,

P{X ≥ j + k | X ≥ k} = P{X = j}.

The term geometric distribution is sometimes also used for the distribution of Z = X − 1,

P{Z = k} = (1 − p)k p, k = 0, 1, 2, . . .

We can think of Z as the number of failures before the first success in Bernoulli trials. Note that
1−p 1−p
E[Z] = E[X] − 1 = , Var[Z] = Var[X] = .
p p2

8.2 Continuous distributions

The continuous distributions from elementary probability courses are given by specifying the density f or
the distribution function F . If X is a random variable with this distribution and a < b,
Z b
P{a < X < b} = P{a ≤ X ≤ b} = f (x) dx = F (b) − F (a).
a

Note that Z b
F (b) = f (x) dx,
−∞

and the fundamental theorem of calculus implies

F 0 (x) = f (x),

if f is continuous at x. Hence, it suffices to specify either the density or the distribution.

8.2.1 Normal distribution

A random variable Z has a standard normal distribution if it has density
1 2
f (x) = √ e−x /2 , −∞ < x < ∞.
2π
To check that this is a density, we can compute
Z ∞ 2 Z ∞ Z ∞ Z ∞ Z 2π
−x2 /2 −x2 /2 −y 2 /2 2
e dx = e e dx dy = re−r /2
dθ dr = 2π.
0 0 0 0 0

It is easy to check that

E[Z] = 0, Var[Z] = E[Z 2 ] = 1.

36
The distribution function of Z is often denoted by Φ,
Z x
1 2
Φ(x) = √ e−t /2 dt.
0 2π

The normal distribution with mean µ and variance σ 2 , often denoted N (µ, σ 2 ) is the distribution of
X = σ Z + µ. It is easily checked that the density is
1 2
/2σ 2
√ e−(x−µ) .
2πσ 2
Indeed, if F denotes the distribution function for X,

x−µ x−µ
F (x) = P{σZ + µ ≤ x} = P Z ≤ =Φ .
σ σ

Hence,
1 x−µ 1 1 −(x−µ)2 /2σ2
F (x) = Φ0
0
= √ e .
σ σ σ 2π
An important property of normal distributions is the following.
• If X1 , . . . , Xn are independent random variables where Xj has a N (µj , σj2 ) distribution, then X1 +
. . . + Xn has a N (µ1 + · · · + µn , σ12 + · · · σn2 ) distribution.

8.2.2 Exponential distribution

A random variable X has an exponential distribution with parameter λ > 0 if it has density

f (x) = λ e−λx , x > 0,

or, equivalently, if
P{X ≥ t} = e−λt , t ≥ 0.
Note that Z ∞ Z λ
−λx 1 2
E[X] = λxe dx = , E[X ] =2
x2 e−λx dx = ,
0 λ 0 λ2
1
Var[X] = E[X 2 ] − E[X]2 =
.
λ2
The exponential distribution has the loss of memory property, i.e., if G(t) = P{X ≥ t}, then

P{X ≥ s + t | X ≥ s} = P{X ≥ t}, s, t ≥ 0.

It is not difficult to show that this property characterizes the exponential distribution. If G(t) = P{X ≥ t},
then the above equality translates to
G(s + t) = G(s) G(t),
and the only continuous functions on [0, ∞) satisfying the condition above with G(0) = 1 and G(∞) = 0 are
of the form G(t) = e−λt for some λ > 0.

37
8.2.3 Gamma distribution
A random variable X has a Gamma distribution with parameters λ > 0 and α > 0 if it has density
λα α−1 −λx
f (x) = x e , x > 0.
Γ(α)

Here Z ∞
Γ(α) = xα−1 e−x dx
0
is the usual Gamma function.

Example

• If α = 1, then the Gamma distribution is the exponential distribution.

• If α = n is a positive integer, and X1 , . . . , Xn are independent random variables, each exponential with
parameter λ, then X1 + · · · + Xn has a Gamma distribution with parameters λ and α. We omit the
derivation of this.

• More generally, if X1 , . . . , Xn are independent random variables, each Gamma with parameters λ and
αj , then X1 + · · · + Xn has a Gamma distribution with parameters λ and α1 + · · · + αn .
• If Z has a standard normal distribution, then Z 2 has a Gamma distribution with parameters λ =
1/2, α = 1/2. To see this, let F denote the distribution of Z 2 . Then,
√
√
Z x
1 2
2
F (x) = P{Z ≤ x} = P{|Z| ≤ x} = 2 √ e−t /2 dt.
0 2π
Hence,
1
f (x) = F 0 (x) = √ e−x/2 .
2πx
If n is a positive integer, then the Gamma distribution with parameters λ = 1/2 and α = n/2 is called
the χ2 -distribution with n degrees of freedom. It is the distribution of

Z12 + · · · + Zn2

where Z1 , . . . , Zn are independent standard normal random variables. This distribution is very important in
statistics.

8.3 Poisson process

Let T1 , T2 , . . . be independent random variables, each with an exponential distribution with parameter λ.
Let τ0 = 0 and for k > 0, τk = T1 + · · · + Tk . Define the random variables Nt by

Nt = k, if τk ≤ t < τk+1 .

Then Nt is a Poisson process with parameter λ. We think of τk as the time until k events have occured
where events happen “randomly” with an average rate of λ events per time unit. The random variable Nt
denotes the number of events that have occurred by time t.
The process Nt has the following properties.
• For each 0 ≤ s < t, Nt − Ns has a Poisson distribution with mean λ(t − s).

38
• If 0 < t1 < t2 < . . . < tn , then the random variables

Nt1 , Nt2 − Nt1 , . . . , Ntn − Ntn−1

are independent.
• With probability one t 7→ Nt is nondecreasing, right continuous, and all jumps are of size one. For
every t > 0,
P{Nt+∆t = k + 1 | Nt = k} = λ ∆t + o(∆t), δt → 0 + .

Durrett 5e Solutions
No ratings yet
Durrett 5e Solutions
96 pages
VLSI Interview Questions
100% (10)
VLSI Interview Questions
112 pages
Lecture 1
No ratings yet
Lecture 1
7 pages
Prob 1 Lecture 1
No ratings yet
Prob 1 Lecture 1
8 pages
Orf526 f24 Lec1
No ratings yet
Orf526 f24 Lec1
4 pages
adv-prob-note (3)
No ratings yet
adv-prob-note (3)
102 pages
week1
No ratings yet
week1
6 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
34 pages
Ananta Kumar Majee
No ratings yet
Ananta Kumar Majee
7 pages
Lect 02
No ratings yet
Lect 02
12 pages
An Introduction To Probability Theory - Geiss
No ratings yet
An Introduction To Probability Theory - Geiss
71 pages
1 Probability space
No ratings yet
1 Probability space
3 pages
Chapter 1
No ratings yet
Chapter 1
12 pages
MScFE 620 DTSP - Compiled - Notes - M1 PDF
No ratings yet
MScFE 620 DTSP - Compiled - Notes - M1 PDF
25 pages
Chapter 1 Probability Theory
No ratings yet
Chapter 1 Probability Theory
18 pages
Math 2901 Booklet 15
No ratings yet
Math 2901 Booklet 15
291 pages
PCMI Notes
No ratings yet
PCMI Notes
70 pages
Probability spaces and σ-algebras: Scott Sheﬃeld
No ratings yet
Probability spaces and σ-algebras: Scott Sheﬃeld
12 pages
Stochastic Analysis in Finance I Stochastic Analysis in Finance I
No ratings yet
Stochastic Analysis in Finance I Stochastic Analysis in Finance I
18 pages
Problem Set Intro Probability
No ratings yet
Problem Set Intro Probability
10 pages
STAT733_notes
No ratings yet
STAT733_notes
216 pages
Lecture 3: Random Vibrations & Failure Analysis: Introduction To Probability - II
No ratings yet
Lecture 3: Random Vibrations & Failure Analysis: Introduction To Probability - II
10 pages
Probability and Measure Theory
No ratings yet
Probability and Measure Theory
198 pages
Week 1 Notes
No ratings yet
Week 1 Notes
7 pages
note
No ratings yet
note
46 pages
Probability Theory: 1 Heuristic Introduction
No ratings yet
Probability Theory: 1 Heuristic Introduction
17 pages
probability_script_070212.pdf
No ratings yet
probability_script_070212.pdf
76 pages
Ft Notes
No ratings yet
Ft Notes
46 pages
Probabilities and Random Variables: 1 The Probability Space
No ratings yet
Probabilities and Random Variables: 1 The Probability Space
8 pages
Brown
No ratings yet
Brown
36 pages
Probability Preamble
No ratings yet
Probability Preamble
5 pages
Solution Exercises List 1 - Probability and Measure Theory
No ratings yet
Solution Exercises List 1 - Probability and Measure Theory
8 pages
gsm-199-prev
No ratings yet
gsm-199-prev
25 pages
WQU MScFE Discrete-Time Stochastic Processes Module 1
100% (1)
WQU MScFE Discrete-Time Stochastic Processes Module 1
52 pages
Proba
No ratings yet
Proba
23 pages
Skript 2022
No ratings yet
Skript 2022
112 pages
6710 Notes
No ratings yet
6710 Notes
118 pages
UW MATH STAT394 Axioms Proba
No ratings yet
UW MATH STAT394 Axioms Proba
6 pages
Chapter 2 PDF
No ratings yet
Chapter 2 PDF
29 pages
6300 Solutionsmanual
No ratings yet
6300 Solutionsmanual
89 pages
Lecturenotes3 4 Probability
No ratings yet
Lecturenotes3 4 Probability
14 pages
Wang F. Foundation of Probability Theory 2024
100% (2)
Wang F. Foundation of Probability Theory 2024
208 pages
EE 533 Information Theory: Barı S Nakibo Glu
No ratings yet
EE 533 Information Theory: Barı S Nakibo Glu
12 pages
MScFE 620 DTSP - Video - Transcripts - M1
No ratings yet
MScFE 620 DTSP - Video - Transcripts - M1
14 pages
Zhc:ts
No ratings yet
Zhc:ts
114 pages
stochastic_notes
No ratings yet
stochastic_notes
133 pages
Lecturenotes7 8 Probability
No ratings yet
Lecturenotes7 8 Probability
8 pages
Chap1 Mazumdar
No ratings yet
Chap1 Mazumdar
38 pages
SigmaAlgebra
No ratings yet
SigmaAlgebra
11 pages
Math Prob Slinker Measure Theory Infinite Heads
No ratings yet
Math Prob Slinker Measure Theory Infinite Heads
8 pages
Basic Lebesgue Measure Theory: Royden 2010 Kolmogorov and Fomin 1970 Stein and Shakarchi 2005 Tao 2011
No ratings yet
Basic Lebesgue Measure Theory: Royden 2010 Kolmogorov and Fomin 1970 Stein and Shakarchi 2005 Tao 2011
28 pages
Martingale Theory and Applications: DR Nic Freeman June 4, 2015
No ratings yet
Martingale Theory and Applications: DR Nic Freeman June 4, 2015
40 pages
STOCHPROC
No ratings yet
STOCHPROC
64 pages
Introductory Course M2 TSE: 1 Basic Concepts
No ratings yet
Introductory Course M2 TSE: 1 Basic Concepts
38 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
An Introduction to Algebraic Topology
From Everand
An Introduction to Algebraic Topology
Andrew H. Wallace
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
The Gamma Function
From Everand
The Gamma Function
Emil Artin
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Infinite Series
From Everand
Infinite Series
Isidore Isaac Hirschman
4/5 (1)
CRT2 LDA Assignment
No ratings yet
CRT2 LDA Assignment
4 pages
SP5 Stoch Processes
No ratings yet
SP5 Stoch Processes
57 pages
Constructing A Concrete Trinomial Tree Model
No ratings yet
Constructing A Concrete Trinomial Tree Model
7 pages
Institute of Mathematical Statistics Is Collaborating With JSTOR To Digitize, Preserve, and Extend Access To
No ratings yet
Institute of Mathematical Statistics Is Collaborating With JSTOR To Digitize, Preserve, and Extend Access To
11 pages
Group Work Assignment DTSP 3 Sub 2 B
100% (1)
Group Work Assignment DTSP 3 Sub 2 B
6 pages
PGP Purdue Projects DS
No ratings yet
PGP Purdue Projects DS
5 pages
The Rhetoric: Aristotle
100% (2)
The Rhetoric: Aristotle
9 pages
Organizational Effectiveness
No ratings yet
Organizational Effectiveness
8 pages
Consumer Behavior of Pharmacy Customers
100% (1)
Consumer Behavior of Pharmacy Customers
15 pages
Intel Core Desktop Boxed Processors Comparison Chart
No ratings yet
Intel Core Desktop Boxed Processors Comparison Chart
16 pages
Pressure Vessels: Lab Background
No ratings yet
Pressure Vessels: Lab Background
3 pages
EWS Gender Neutral CSAB 2
No ratings yet
EWS Gender Neutral CSAB 2
19 pages
An Evaluation of The Effect of Promotion On Employees Productivity A Case Study of Guaranty Trust Bank
No ratings yet
An Evaluation of The Effect of Promotion On Employees Productivity A Case Study of Guaranty Trust Bank
37 pages
Dirtbaby Delay Madbeans Schemeatic
No ratings yet
Dirtbaby Delay Madbeans Schemeatic
8 pages
Activity No 8
No ratings yet
Activity No 8
7 pages
Newton's Second Law
No ratings yet
Newton's Second Law
4 pages
Fire Water Training 10 May 2011
100% (1)
Fire Water Training 10 May 2011
164 pages
Chapter-3 Recovery Techniques
No ratings yet
Chapter-3 Recovery Techniques
25 pages
Interfacing Tektronix RSA 306 With LabVIEW
No ratings yet
Interfacing Tektronix RSA 306 With LabVIEW
26 pages
MOD 3 (C) - Sustainable Development & Future of Urban Governance
No ratings yet
MOD 3 (C) - Sustainable Development & Future of Urban Governance
10 pages
Soal Sumatif Akhir Jenjang
No ratings yet
Soal Sumatif Akhir Jenjang
3 pages
Antenna Synthesis: Given A Radiation Pattern, Find The Required Current Distribution To Realize It
No ratings yet
Antenna Synthesis: Given A Radiation Pattern, Find The Required Current Distribution To Realize It
15 pages
1D Monoatomic Chain 3-5
No ratings yet
1D Monoatomic Chain 3-5
33 pages
Oral Communication in Context
No ratings yet
Oral Communication in Context
4 pages
(B) - Recommend Any Strategies Which Can Be Adopted To Achieve Energy Efficiency
No ratings yet
(B) - Recommend Any Strategies Which Can Be Adopted To Achieve Energy Efficiency
5 pages
answer key persuasive speech
No ratings yet
answer key persuasive speech
3 pages
Jntuworld: Jawaharlal Nehru Technological University Hyderabad
No ratings yet
Jntuworld: Jawaharlal Nehru Technological University Hyderabad
5 pages
Test Method For Resins
No ratings yet
Test Method For Resins
10 pages
Porcha English - 554
No ratings yet
Porcha English - 554
2 pages
AT 401 - OM 6 Lubrication PDF
No ratings yet
AT 401 - OM 6 Lubrication PDF
2 pages
Unit 6 Direct and Inverse Proportion
No ratings yet
Unit 6 Direct and Inverse Proportion
6 pages
Email
No ratings yet
Email
1 page
Slump of Hydraulic-Cement Concrete: Standard Test Method For
No ratings yet
Slump of Hydraulic-Cement Concrete: Standard Test Method For
4 pages
Pad 251
No ratings yet
Pad 251
246 pages
Budgetary Planning: Assignment Classification Table
No ratings yet
Budgetary Planning: Assignment Classification Table
64 pages
Geometry Dash Editor Guide
67% (3)
Geometry Dash Editor Guide
73 pages

Prob Notes

Uploaded by

Prob Notes

Uploaded by

NOTES ON PROBABILITY

We say “probability of E” for P(E).

A = {(ω1 , ω2 , . . .) : (ω1 , ω2 , . . . , ωn ) ∈ E}. (1)

If A is of the form (1), we let

Proposition 1.1. F 0 is an algebra but not a σ-algebra.

E is not in F 0 but E can be written as a countable intersection of events in F 0 ,

the term probability density function (pdf) is often used.

Example If E is an event, the indicator function of E is the random variable

Sn = X1 + · · · + Xn = # heads on first n flips. (3)

Definition If X is a nonnegative random variable, the expectation of X, denoted E(X), is

provided the sum is absolutely convergent.

(Either side exists if and only if the other side exists.)

where again the expectation exists if and only if

(Either side exists if and only if the other side exists.)

and hence by the monotone convergence theorem,

Definition The variance of a random variable X is defined by

Var(X) = E[(X − E(X))2 ],

provided this expecation exists.

Var(X) = E[(X − E(X))2 ]

Lemma 2.4. Let X be a random variable.

aP{|X| ≥ a} = E[Xa ] ≤ E[X].

• Two events A, B are called independent if P(A ∩ B) = P(A)P(B).

P(Aα1 ∩ · · · ∩ Aαn ) = P(Aα1 ) · · · P(Aαn ).

P(Aα1 ∩ Aα2 ) = P(Aα1 )P(Aα2 ).

Definition A finite collection of σ-algebras F1 , . . . , Fn is called independent if for any A1 ∈ F1 , . . . , An ∈ Fn ,

P(A1 ∩ · · · ∩ An ) = P(A1 ) · · · P(An ).

An infinite collection of σ-algebras is called independent if every finite subcollection is independent.

Definition If X is a random variable and F is a σ-algebra, we say that X is F-measurable if X −1 (B) ∈ F

Ej,1 = {first roll is j} , j = 1, . . . 6,

Ej,2 = {second roll is j} , j = 1, . . . 6.

Exercise 3.1. Suppose F1 ⊂ F2 ⊂ · · · is an increasing sequence of σ-algebras and

µB (A) = P(A ∩ B), µ̃B (A) = P(A) P(B).

νA (B) = P(A ∩ B), ν̃A (B) = P(A) P(B).

µ(B) = P{(X1 , . . . , Xn ) ∈ B}.

F (t1 , . . . , tn ) = P{X1 ≤ t1 , . . . , Xn ≤ tn } = µ[ (−∞, t1 ] × · · · × (−∞, tn ] ].

• For all Borel sets B1 , . . . , Bn ,

P{X1 ∈ B1 , . . . , Xn ∈ Bn } = P{X1 ∈ B1 } P{X2 ∈ B2 } · · · P{Xn ∈ Bn }.

• The σ-algebras F(X1 ), . . . , F(Xn ) are independent.

• For all real t1 , . . . , tn ,

Random variables with a joint density are independent if and only if

E[XY ] = lim E[Sn Tn ] = lim E[Sn ] E[Tn ] = E[X] E[Y ].

Finally for general X, Y , write X = X + − X − , Y + − Y − . Note that X + , X − are independent of Y + , Y −

But by orthogonality, if i 6= j, E[Xi Xj ] = E[Xi ] E[Xj ] = 0

x = .ω1 ω2 ω3 · · · , ωj ∈ {0, 1}.

• A sequence of random variables Yn converges to Y in probability if for every  > 0,

lim P{|Yn − Y | > } = 0.

Lemma 4.3 P (Borel-Cantelli Lemma). Suppose A1 , A2 , . . . is a sequence of events.

P{An i.o.} = P(lim sup An ) = 1.

To show this it suffices to show that for each n

and for this we need to show for each n,

To see this, we first note that if i 6∈ {j, k, l},

E[Xi Xj Xk Xl ] = E[Xi ] E[Xj Xk Xl ] = 0. (7)

By the generalized Chebyshev inequality (Exercise 2.5),

|X1 (ω) + X2 (ω) + · · · + Xn (ω)| ≥ n,

P{Xn = 2n } = 1 − P{Xn = 0} = 2−n .

Show that that E[Xn ] = 1 for each n and that w.p.1,

b. Let X1 , X2 , . . . be i.i.d. random variables. Show that

if and only if E[|X1 |] < ∞.

The tail σ-algebra T is defined by

(The intersection of σ-algebras is a σ-algebra, so T is a σ-algebra.) An example of an event in T is

P(Ac 4Ac0 ) = P(A4A0 ) < .

For j = 1, . . . , n, let Aj,0 ∈ F 0 such that

P[Aj 4Aj,0 ] ≤  2−j−1 .

Let A0 = A1,0 ∪ · · · ∪ An,0 and note that

and hence P(A4A0 ) <  and A ∈ B.

5 Central Limit Theorem

Definition The characteristic function of a a random variable X is the function φ = φX : R −→ C defined

Properties of Characteristic Functions

• φ(0) = 1 and for all t

In fact, it is uniformly continuous (Exercise 5.1).

φY (t) = E[ei(aX+b)t ] = eibt E[eiX(at) ] = eibt φX (at).

φX1 +···+Xn (t) = E[eit(X1 +···+Xn ) ]

Exercise 5.1. Show that φ is a uniformly continuous function of t.

We write the difference quotient

Note that i(t+δ)x

The following proposition can be proved in the same way.

• A sequence of random variables Yn converges to Y in probability if for every > 0,

lim P{|Yn − Y | > } = 0.

|X1 (ω) + X2 (ω) + · · · + Xn (ω)| ≥ n,

P(Ac 4Ac0 ) = P(A4A0 ) < .

P[Aj 4Aj,0 ] ≤ 2−j−1 .

and hence P(A4A0 ) < and A ∈ B.

we then get the theorem by letting → 0.