0% found this document useful (0 votes)
218 views

A First Course in Probability Notes

Uploaded by

tanxinzi39
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
218 views

A First Course in Probability Notes

Uploaded by

tanxinzi39
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

A First Course In Probability Notes

Lou Yi

Last Edited by: December 1, 2023


Contents
1 Combinatorial Analysis 1

2 Axioms of Probability 5
2.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Inclusion - Exclusion Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Limit of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Probability as a Measure of Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Conditional Probability 12
3.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Conditional Probability is a Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Discrete Random Variables 18


4.1 Definition Involving Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Moments And Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Bernoulli and Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Other Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6.1 Geometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6.2 Negative Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6.3 Hypergeometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Expected Value of Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Some Interesting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Continuous Random Variables 33


5.1 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Uniform Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Normal Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Exponential Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Other Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6.1 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6.2 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.7 Information on the Gamma and Beta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8 Distribution of a Function of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

i
6 Jointly Distributed Random Variables 55
6.1 Joint Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Joint Distribution of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4.1 Sum of Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.2 Sum of Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.3 Sum of Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.4 Sum of Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4.5 Sum of Normals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.6 Sum of Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Joint Distribution of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Expectation of Random Variables 69


7.1 Extra Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Sum of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Moments of Number of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4 Covariance and Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.5 Multinomial and Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.6 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.8 Conditional Expectation and Predication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.9 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.10 Moment Generating Function for Sum of Independent Random Variables . . . . . . . . . . . . . . . 89
7.11 Joint Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.12 Summary on Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8 Limit Theorems 93
8.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

ii
1 Combinatorial Analysis
The basic principle of counting:
Suppose that two experiments are to be performed. If the first experiment can result in any one of m possible
outcomes and if for each outcome of experiment 1, there are n possible outcomes of experiment 2, then together
there are mn possible outcomes of the two experiments.
Proof: Basically we can list the outcomes in an m × n matrix. The matrix has mn entries, hence there are mn
possible outcomes. □

The generalized basic principle of counting:


If r experiments that are to be performed are such that the first on may result in any of n1 possible outcomes; and
if, for each of these n1 possible outcomes, there are n2 possible outcomes of the second experiment; and if, for each
of the possible outcomes of the first two experiments, there are n3 possible outcomes of the third experiment and
so on. Then there is a total of n1 · n2 · · · nr possible outcomes of the r experiments.
Proof: Apply induction on the basic principle of counting. □

Permutation:
If we want to arrange n items, then there are a total of n! such different ordering.
Proof: Note that there are n different choices for the first position of the arrangement, then followed by n − 1
choices on the second position, n − 2 choices on the third and so on. Hence there are n × (n − 1) × · · · × 1 = n!
different orderings. □

Permutation with identical items:


Suppose there are n objects, of which n1 are alike, n2 are alike, · · · , nr are alike. Then there are a total of

n!
n1 !n2 ! · · · nr !

different permutations.
Proof: Firstly there are n! different permutations if all the objects are distinct. Then for each group of nk
identical items, there are nk ! different ways to arrange them, hence for each nk identical items, each ordering in
the original permutation is counted repeatedly by nk ! times. Thus the formula follows. □

Combination: !
n
We define , for r ≤ n by
r
!
n n!
=
r (n − r)!r!
!
n
and say that represents the number of possible combination of n objects taken r at a time.
r
Proof: Firstly, there are n!r! permutations of length r choosing from n items. Since order does not matter, it
implies that each combination is counted (n − r)! times, hence we have that the total possible combination is

n!
.
(n − r)!r!

1

Lemma 1.1 Suppose n ∈ N and r ∈ N with 1 ≤ r ≤ N . Then


! ! !
n n−1 n−1
= + .
r r−1 r

Proof: Note that in order to choose r items from a list of n object, we can either include the first item, and
choose r − 1 items from the remaining n − 1 object, or we can exclude the first item and choose r items from the
remaining n object. Hence we have the formula. □

Pascal’s Triangle:
The Pascal’s Triangle
! is a triangular array of numbers where the element on the ith row and j th column have the
i−1
value .
j−1

Theorem 1.2 (The Binomial Theorem) Suppose n is a positive integer, then

n
!
X n
(x + y)n = xk y n−k .
k=0
k
!
n
Hence, are often known as binomial coefficients.
k
!
n
Proof: The proof is done by induction or one can consider that there are ways to choose k different items
k
from n total items. □

Corollary 1.2.1 There are exactly 2n subsets of a set consisting of n elements.

!
n
Proof: Since there are subset of size k, then there are a total of
k

n
!
X n
= (1 + 1)n = 2n
k=0
k

subsets of the original sets. □

2
Permutations of length r choosing from n items:
If we want to arrange items in a list of length r choosing from n objects, with r ≤ n. Then there are

n!
r!

such permutation.
Proof: Note that similar to above reasoning, there are n × (n − 1) × · · · × (n − r + 1) different orderings, which
is equal to
n!
.
r!

Multinomial coefficients: !
n
Let n = n1 + n2 + · · · + nr . The multinomial coefficient is defined by
n1 , n2 , · · · , nr
!
n n!
= .
n1 , n2 , · · · , nr n1 ! · n2 ! · · · nr !
!
n
represents the number of possible division of n distinct objects into r distinct groups of size
n1 , n2 , · · · , nr
n1 , · · · , nr respectively. The number of possible permutations of n objects of which n1 are alike, n2 are alike, · · · ,
nr are alike.

Note: Let r = 2 and n1 = n2 , then ! ! !


n n n
= = .
n1 , n2 n1 n2

Theorem 1.3 (Multinomial Theorem) Suppose n ∈ N, r ∈ N∗ , then


!
X n
n
(x1 + x2 + · · · + xr ) = xn1 1 xn2 2 · · · xnr r
n1 , n2 , · · · , nr

where the sum is taken over all nonnegative integers n1 , · · · , nr so that

n1 + n2 + · · · + nr = n.

Proof: The theorem is clear from the combinatorial point of view. □

Proposition 1.4 The number of integer solutions to the equation

x1 + x2 + · · · + xr = n

3
where xi ≥ ki , i ∈ {1, 2, · · · , r} is given by
r
n + r − 1 − P ki 
i=1 .
r−1
n−1

In particular, the number of positive integer solution is given by r−1 and the number of nonnegative integer
solution is given by n+r−1

r−1 .

r
P
Proof: We can transform the problem into inserting r − 1 plates to n + r − 1 + ki gaps, this is done by
i=1
considering yi = xi + (1 − ki ), then
r
X
y1 + y2 + · · · + yk = n + r − ki .
i=1

So the number of ways to do this is given by


r
n + r − 1 − P ki 
i=1 .
r−1

Corollary 1.4.1 The number of integer solutions to the inequality

x1 + x2 + · · · + xr ≤ n

where xi ≥ ki , i ∈ {1, 2, · · · , r} is given by


Pr
n+r− ki
 
i=1 .
r

Proof: By the same reasoning, it is equivalent to calculate the number of positive integer solutions to
r
X
y1 + y2 + · · · + yr + yr+1 = n + r − ki ,
i=1

and the number of solutions for this is given by

Pr
n+r− ki
 
i=1 .
r

4
2 Axioms of Probability
2.1 Axioms of Probability
Sample Space:
Consider an experiment whose outcome is not predictable. The set of all possible outcomes of the experiment is
called the sample space. It is usually denoted by S.

Events:
Any subset E of the sample space S is an event. If the outcomes is contained in E, then we say E occurs.
Note that S itself is an event, which is also known as the sure event. ∅ is also an event, which is known as the null
event.

Operations on Events:
Operations on Events are precisely operations on sets. Let E and F be two events of a sample space S, then

ˆ The union E ∪ F consists of all outcomes in E or in F or in both.

ˆ The intersection E ∩ F or commonly denoted EF consists of all outcomes in both E and F .

ˆ If EF = ∅, then E and F are called mutually exclusive.

ˆ The complement E c consists all outcomes in (S) not in E.

Similarly, let E1 , E2 , · · · be a sequence of events of a sample space S.



ˆ Their union, denoted by
S
En consists of all outcomes which are in at least one of En .
n=1

ˆ Their intersection, denoted by
T
consists of all outcomes which are in every En .
n=1

Theorem 2.1 Let E, F and G be events of a sample space S. Then

ˆ (Commutative Laws) E ∪ F = F ∪ E, EF = F E.

ˆ (Associative Laws) (E ∪ F ) ∪ G = E ∪ (F ∪ G), (EF )G = E(F G).

ˆ (Distributive Laws) (E ∪ F )G = EG ∪ F G, (EF ) ∪ G = (E ∪ G)(F ∪ G)

ˆ (De Morgan’s Laws) (E ∪ F )C = E c F c , (EF )c = E c ∪ F c .

Proof: Simple element chasing between sets. □

Theorem 2.2 Let E1 , E2 , · · · and F be events of a sample space S. Then

ˆ General Distributive Laws:

5
S∞ S∞
– ( n=1 En ) n=1 (En ∩ F );
∩F =
T∞ T∞
– ( n=1 En ) ∪ F = n=1 (En ∪ F ).

ˆ General De Morgan’s Laws:

– ( ∞
S c
T∞ c
n=1 En ) = n=1 En ;

– ( ∞
T c
S∞ c
n=1 En ) = n=1 En .

Proof: Element chase. □

Probability:
Let E be any event of an experiment. Let n(E) be the number of times that E occurs in the first n repetitions of
the experiment. The probability of E is
n(E)
P (E) = lim
n→∞ n

if this limit exists.

Axioms of Probability:
Let S be the sample space of an experiment. Suppose that a number P (E) is defined for every event E of S, s.t.,

ˆ 0 ≤ P (E) ≤ 1.

ˆ P (S) = 1.

ˆ For any sequence of mutually exclusive events E1 , E2 , · · · ,



[ ∞
X
P( Ei ) = P (Ei ).
i=1 i=1

Then P (E) is called the probability of the event E.

Proposition 2.3 P (∅) = 0.

Proof: Let ∅ = E1 = E2 = E3 = · · · . Then E1 , E2 , · · · are mutually exclusive. Hence by the third axiom of
probability, we have P (∅) = 0. □

Proposition 2.4 Let E1 , · · · , En be mutually exclusive, then


n
[ n
X
P( Ei ) = P (Ei ).
i=1 i=1

6
Proof: Let ∅ = En+1 = En+2 = · · · . Then one can show that E1 , E2 , · · · are mutually exclusive. Hence
n
[ ∞
[ ∞
X n
X
P( Ei ) = P ( Ei ) = P (Ei ) = P (Ei ).
i=1 i=1 i=1 i=1

Proposition 2.5 Suppose E is an event of an experiment, then P (E c ) = 1 − P (E).

Proof: By the previous proposition, P (E) + P (E c ) = P (E ∪ E C ) = P (S) = 1. □

Proposition 2.6 Suppose that E ⊆ F are events of an experiment, then P (E) ≤ P (F ).

Proof: It is clear that E and F E c are mutually exclusive, and their union is F . Then P (F ) = P (E) + P (E c F ) ≥
P (E) + 0 = P (E). □

Proposition 2.7 Let E and F be any two events of an experiment, then

P (E ∪ F ) = P (E) + P (F ) − P (EF ),

which is known as the Inclusion-Exclusion Identity. And

P (E ∪ F ) ≤ P (E) + P (F ),

which is known as the Boole’s Inequality.

Proof: It is clear that EF c , EF and F E c are mutually exclusive. Then

P (E ∪ F ) = P (EF c ) + P (EF ) + P (F E c )
= [P (EF c ) + P (EF )] + [P (EF ) + P (F E c )] − P (EF )
= P (E) + P (F ) − P (EF ).

Then the Boole’s Inequality follows immediately. □

2.2 Inclusion - Exclusion Identity

7
Theorem 2.8 Let E1 , E2 , · · · , En be events, then
n
X X
P (E1 ∪ E2 ∪ · · · ∪ En ) = P (Ei ) − P (Ei1 Ei2 ) + · · ·
i=1 i1 <i2
X
+ (−1)r+1 P (Ei1 Ei2 · · · Eir )
i1 <i2 <···<ir

+ · · · + (−1)n+1 P (E1 E2 , · · · En )

The following is a succinct way of writing the inclusion-exclusion identity:


n
[ n
X X
P( Ei ) = (−1)r+1 P (Ei1 · · · Eir ).
i=1 r=1 i1 <···<ir

Proof: We present two proofs of the theorem.


We first prove it using induction. The case where n = 2 is trivial. Suppose that the formula is true for n, we show
it for n + 1. First apply the n = 2 case, then distributivity of intersections:

P (E1 ∪ E2 ∪ · · · ∪ En ∪ En+1 ) = P (E1 ∪ E2 ∪ · · · ∪ En ) + P (En+1 ) − P ((E1 ∪ · · · ∪ En ) ∩ En+1 )


= P (E1 ∪ E2 ∪ · · · ∪ En ) + P (En+1 ) − P ((E1 ∩ En+1 ) ∪ (E2 ∩ En+1 ) ∪ · · · ∪ (En ∩ En+1 )).

The first and last terms in brackets are n−unions, for which we assumed the formula to hold (applying inductive
hypothesis). Therefore
X X
P (E1 ∪ E2 ∪ · · · ∪ En ∪ En+1 ) = P (Ei ) − P (Ei1 ∩ Ei2 )
1≤i≤n 1≤i1 <i2 ≤n
X
+ P (Ei1 ∩ Ei2 ∩ Ei3 ) − · · · + (−1)n+1 P (E1 ∩ E2 ∩ · · · En )
1≤i1 <i2 <i3 ≤n
X X
+ P (En+1 ) − P (Ei ∩ En+1 ) + P (Ei1 ∩ Ei2 ∩ En+1 )
1≤i≤n 1≤i1 <i2 ≤n
X
− · · · − (−1)n P (Ei1 ∩ Ei2 ∩ · · · Ein−1 ∩ En+1 )
1≤i1 <i2 <···<in−1 ≤n

− (−1)n+1 P (E1 ∩ E2 ∩ · · · ∩ En ∩ En+1 )

Then by rearranging terms, we get the required results.

Next we argument using combinatorics:


If an outcome of the sample space is not a member of any of the sets Ei , then its probability does not contribute
anything to either side of the equality. Now, suppose that an outcomes is in exactly m of the events Ei , where
S S
m > 0. Then, since it is in Ei , its probability is counted once in P ( Ei ); also, as this outcomes is contained in
i i
m

k subsets of the type Ei1 Ei2 · · · Eik , its probability is counted
       
m m m m
− + − ··· ±
1 2 3 m

8
times on the right of the equality sign in the right hand side of the theorem. Thus, for m > 0, we must show that
       
m m m m
1= − + − ··· ± .
1 2 3 m
m

However, since 1 = 0 , the preceding equation is equivalent to

m  
X m
(−1)i = 0
i
i=0

and the latter equation follows from the binomial theorem, since
m  
X m
0 = (−1 + 1)m = (−1)i (1)m−i .
i
i=0

Since every element is counted the same number of times on both sides of the equality, then their respective
probability are equal. □

2.3 Equally Likely Outcomes


Equally Likely Experiments:
Suppose the sample space S of an experiment is finite such that all outcomes in S are equally likely to occur, i.e.,
for any even E, P (E) = |E|
|S| , where |E| is the number of outcomes in E.

Proposition 2.9 The probability of drawing a specific card from a normal deck of 52 cards with burning any
1
number (less than 52) of cards is 52 .

Proof: Suppose the card drawn is the ith position in the deck, then there are 51! ways such is possible and there
are 52! total possible permutations, thus the probability is 51! 1
52! = 52 no matter how many cards are burned. □

Corollary 2.9.1 Suppose there are n specific cards in a deck of 52 cards, then the probability of drawing any of
n
them with burning any number (less than 52) of cards is 52 .

Proof: Each of them is equally likely to happen. □

2.4 Limit of Probability


Definition: a sequence of events {En , n ≥ 1} is said to be an increasing sequence if

E1 ⊂ E2 ⊂ · · · ⊂ En ⊂ En+1 ⊂ · · ·

9
whereas it is said to be a decreasing sequence if

E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ En+1 ⊃ · · ·

Definition: if {En , n ≥ 1} is an increasing sequence of events, then we define a new event, denoted by lim En , by
n→∞


[
lim En = Ei .
n→∞
i=1

Similarly, if {En , n ≥ 1} is a decreasing sequence of events, we define lim En , by


n→∞


\
lim En = Ei .
n→∞
i=1

Proposition 2.10 If {En , n ≥ 1} is either an increasing or a decreasing sequence of events, then

lim P (En ) = P ( lim En ).


n→∞ n→∞

Proof: Suppose, firs that, {En , n ≥ 1} is an increasing sequence, the define the events Fn , n ≥ 1 by F1 = E1 , and

n−1
!c
[
c
Fn = En Ei = En En−1 , n > 1.
i=1

Then it is clear that the Fn′ s are mutually exclusive, and



[ ∞
[ n
[ n
[
Fi = Ei and Fi = Ei for all n ≥ 1.
i=1 i=1 i=1 i=1

Thus,
∞ ∞
! !
[ [
P Ei =P Fi
i=1 i=1

X
= P (Fi ) (by Axiom 3 of a Probability Function)
i=1
n
X
= lim P (Fi )
n→∞
i=1
n
!
[
= lim P Fi
n→∞
i=1
n
!
[
= lim P Ei
n→∞
i=1

= lim P (En ),
n→∞

10
which proves the result when {En , n ≥ 1} is increasing.
If {En , n ≥ 1} is a decreasing sequence, then {Enc , n ≥ 1} is an increasing sequence. Hence, from the preceding
equations,

!
[
P Eic = lim P (Enc ).
n→∞
i=1

∞ c
Eic
S T
However, because = Ei , it follows that
i=1 i=1


!c !
\
P Ei = lim P (Enc ).
n→∞
i=1

Thus

!
\
1−P Ei = lim [1 − P (En )] = 1 − lim P (En ).
n→∞ n→∞
i=1

Therefore, we conclude

!
\
P Ei = lim P (En ).
n→∞
i=1

2.5 Probability as a Measure of Belief


Measure of Belief:
If n(E) is the number of times that E occurs in n repetitions of experiment, then P (E) = lim n(E)/n.
n→∞
If we believe that a coin is fair, then P ({H}) = P ({T }) = 21 ; if we believe that a die is fair, then P ({1}) = P ({2}) =
· · · = P ({6}) = 16 .
It is logical to suppose that measure of the degree of one’s believe should satisfy the axioms of probability.

11
3 Conditional Probability
3.1 Conditional Probability
Definition: suppose P (F ) > 0, the conditional probability that E occurs given that F has occurred is given by

P (EF )
P (E|F ) = .
P (F )

Note that P (E|F ) = P (EF |F ) and P (EF ) = P (E|F )P (F ).

P (E)
Lemma 3.1 Suppose E ⊆ F , then P (E|F ) = P (F ) .

P (EF ) P (E)
Proof: Since E ⊆ F , P (E|F ) = P (F ) = P (F ) . □

Proposition 3.2 (Multiplication Rule) P (EF ) = P (E|F )P (F ).

Proof: Suppose P (F ) > 0, then P (E|F ) = P (EF )/P (F ). Hence P (EF ) = P (E|F )P (F ). Now suppose P (F ) =
0, then P (E|F ) = 0, as the probability of E happening given F happens is zero (F can never happen). So again
we have P (EF ) = P (E|F )P (F ). □

Corollary 3.2.1 (General Multiplication Rule) P (E1 · · · En ) = P (E1 )P (E2 |E1 ) · · · P (En |E1 · · · En−1 ).

Proof: Suppose P (E1 · · · En−1 ) = 0, then the statement is trivial as both sides equal to 0. Otherwise, we can use
induction to prove the general statement. □

3.2 Bayes’ Formula

Proposition 3.3 (Law of Total Probability) P (E) = P (EF )+P (EF C ) = P (E|F )P (F )+P (E|F C )[1−P (F )].

Proof: First, note that EF ∪ EF C = E and EF ∩ EF C = ∅. Hence we have P (E) = P (EF ∪ EF C ) =


P (EF ) + P (EF C ).
Next we apply the formula for conditional probability, and get the second inequality. □

12
Theorem 3.4 (Bayes’ Formula) Let E and F be events, then

P (E|F )P (F )
P (F |E) = .
P (E|F )P (F ) + P (E|F C )P (F C )

Proof: Since

P (F E) P (EF ) P (E|F )P (F )
P (F |E) = = C C
= .
P (E) P (E|F )P (F ) + P (E|F )P (F ) P (E|F )P (F ) + P (E|F C )P (F C )

Proposition 3.5 Let E and F be two events, then

P (F |E) ≥ P (F ) ⇔ P (E|F ) ≥ P (E|F C ).

This is to say, suppose E and F are positively associated, then E and F C is negatively associated.

P (E|F )P (F )
Proof: Note P (E|F ) = P (E|F )P (F )+P (E|F C )P (F C )
, then

P (E|F )P (F )
P (F |E) ≥ P (F ) ⇔ ≥ P (F )
P (E|F )P (F ) + P (E|F C )P (F C )
⇔ P (E|F ) ≥ P (E|F )P (F ) + P (E|F C )P (F C )
⇔ P (E|F )P (F C ) ≥ P (E|F C )P (F C )
⇔ P (E|F ) ≥ P (E|F C )

S∞
Proposition 3.6 (Generalized Law of Total Probability) Let F1 , F2 , · · · be mutually exclusive with n=1 Fn =
S. For any event E, we have

X
P (E) = P (E|Fn )P (Fn ).
n=1

S∞
Proof: Suppose n=1 Fn = S, then


[ ∞
[
E =E∩S =E∩ Fn = E ∩ Fn .
n=1 n=1

Hence
∞ ∞ ∞
!
[ X X
P (E) = P E ∩ Fn = P (EFn ) = P (E|Fn )P (Fn ).
n=1 n=1 n=1

13
S∞
Theorem 3.7 (Generalized Bayes’ Formula) Let F1 , F2 , · · · , be mutually exclusive, n=1 Fn = S. For any
event E, we have
P (E|Fj )(P (Fj )
P (Fj |E) = ∞ .
P
P (E|Fn )P (Fn )
n=1

Proof: Note that


P (Fj E) P (EFj ) P (E|Fj )(P (Fj )
P (Fj |E) = = = ∞ .
P (E) P (E) P
P (E|Fn )P (Fn )
n=1

Definition: the odds of an event A are defined by

P (A) P (A)
= .
P (Ac ) 1 − P (A)

Lemma 3.8 Suppose E and F are events, then odds of the event F given E is

P (F |E) P (F )P (F |E)
c
= .
P (F |E) P (F c )P (F c |E)

Proof:
P (F |E) P (F E) P (F )P (F |E)
c
= c
= .
P (F |E) P (F E) P (F c )P (F c |E)

3.3 Independent Events


Definition: events E and F are said to be independent if P (EF ) = P (E)P (F ), and dependent otherwise.
Note that if P (F ) = 0, then P (EF ) = 0 = P (E)P (F ); so E, F are automatically independent.

Lemma 3.9 If P (F ) > 0, then E and F are independent if and only if P (E|F ) = P (E). If P (E) > 0, then E
and F are independent if and only if P (F |E) = P (F ).

P (EF )
Proof: Suppose P (F ) > 0, P (E|F ) = P (F ) = P (E) if and only if P (EF ) = P (E)P (F ). Similarly we have the
second assertion. □

14
Proposition 3.10 (Property of Independent Events) Let E and F be events, then the following are equiva-
lent:

1. E and F are independent;

2. E and F C are independent;

3. E C and F are independent;

4. E C and F C are independent.

Proof: Suppose E and F are independent, i.e., P (EF ) = P (E)P (F ). Notice that EF and EF C are mutually
exclusive with union E. Then

P (E) = P (EF ) + P (EF C ) = P (E)P (F ) + P (EF C )

This implies that P (EF C ) = P (E) − P (E)P (F ) = P (E)[1 − P (F )] = P (E)P (F C ).


Reversing roles, we get that all four statements are equivalent. □

Definition: let E, F , G be events. They are independent if

1. P (EF G) = P (E)P (F )P (G), and

2. P (EF ) = P (E)P (F ), P (EG) = P (E)P (G), P (F G) = P (F )P (G).

Proposition 3.11 Suppose E, F, G are independent. Then E is independent of any event formed from F and G.

Proof: We just need to consider the case of F G, F ∪ G, F C G, F GC , F C GC , F ∪ G, F C ∪ G, and F ∪ GC . □

Definition: a sequence of events E1 , E2 , · · · are independent if

P (Ei1 · · · Eir ) = P (Ei1 ) · · · P (Eir ) for any r ≥ 2, i1 < · · · < ir .

Definition: suppose an experiment consists of a sequence of subexperiments. let Ei be the outcome of the ith
subexperiment. If E1 , E2 , · · · are independent and have same set of possible outcomes, then they are often called
trials.

Proposition 3.12 Let E and F be mutually exclusive events. Suppose independents trials are performed. Then
the probability that E occurs before F is
P (E)
.
P (E) + P (F )

15
Proof: Let S be the event that E occurs before F , and let K be the event that both events does not happen.
Then

P (S) = P (S|E)P (E) + P (S|F )P (F ) + P (S|K)P (K)


=⇒ P (S) = P (E) + 0 + P (S)(1 − P (E) − P (F ))
P (E)
=⇒ P (S) =
P (E) + P (F )

Proposition 3.13 Suppose that a man is gambling against an infinitely rich adversary and at each stage he either
win or lose 1 unit with respective probabilities p and 1 − p. If the man starts with i unit, then the probability that
he will eventually go broke is 
1 if p ≤ 21
 i
 q if p > 12
p

where q = 1 − p.

Proof: Let P (n) denote the probability that the man starts with n unit and go broke. Then P (0) = 1 and
P (n) = P (1)n for n ≥ 1. Also P (1) = P (0)P (losses 1 unit) + P (2)P (wins 1 unit). So P (1) = (1 − p) + pP (2) =
(1 − p) + p[P (1)]2 . Then solve for P (1) for desired result. □

3.4 Conditional Probability is a Probability

Proposition 3.14 Suppose F is an event with P (F ) > 0, then P (·|F ) is a probability

Proof: We verify that the conditional probability satisfies the 3 axioms of probability.

1. 0 ≤ P (E|F ) ≤ 1.
P (EF )
Proof: EF ⊆ F ; so 0 ≤ P (EF ) ≤ P (F ) and 0 ≤ P (E|F ) = P (F ) ≤ 1.

2. P (S|F ) = 1.
P (SF ) P (F )
Proof: P (S|F ) = P (F ) = P (F ) = 1.

3. We claim that if E1 , E2 , · · · are mutually exclusive events, then


∞ ∞
!
[ X
P Ei |F = P (Ei |F ).
i=1 i=1

16
This is because

!
P (( ∞ P( ∞
S S
i=1 Ei ) F ) i=1 Ei F )
[
P Ei |F = =
P (F ) P (F )
i=1
P∞ ∞ ∞
i=1 P (Ei F ) P (Ei F ) X
X
= = = P (Ei |F ).
P (F ) P (F )
i=1 i=1

Hence conditional probability is a probability. □

Proposition 3.15 (Properties of Conditional Probability As A Probability) Fix any event F with P (F ) >
0, then

ˆ Inclusion-Exclusion Identity:

P ((E1 ∪ E2 |F ) = P (E1 |F ) + P (E2 |F ) − P (E1 E2 |F ).

ˆ Law of Total Probability:

P (E|F ) = P (E|F G)P (G|F ) + P (E|F Gc )P (Gc |F ).

Proof: Define Q(E) = P (E|F ), since Q is a probability function, then

P ((E1 ∪ E2 )|F ) = Q(E1 ∪ E2 )


= Q(E1 ) + Q(E2 ) − Q(E1 E2 )
= P (E1 |F ) + P (E2 |F ) − P (E1 E2 |F )

And

P (E|F ) = Q(E)
= Q(E|G)Q(G) + Q(E|Gc )Q(Gc )
= P ((E|F )|G)P (G|F ) + P ((E|F )|Gc )P (Gc |F )
= P (E|F G)P (G|F ) + P (E|F Gc )P (Gc |F ).

17
4 Discrete Random Variables
4.1 Definition Involving Discrete Random Variables
Definition: on the sample space of an experiment, the quantities of interest, or real-valued functions on the sample
space are called random variables.

Definition: suppose a random variable X can only take on at most a countable number of possible values (finite
{a1 , · · · , an } or enumerable {a1 , a2 , · · · , }).Then X is said to be discrete. pX (a) = P (X = a) is the probability
mass function of X.
Pn
Suppose X only assumes values on {a1 , a2 , · · · , an }, ai all distinct, then pX (ai ) = 1 and pX (a) = 0 for
i=1
a ̸= a1 , a2 , · · · , an .

P
Suppose X only assumes values on {a1 , a2 , · · · , an , · · · }, ai all distinct, then pX (ai ) = 1 and pX (a) = 0 for
i=1
a ̸= a1 , a2 , · · · .
Note that for discrete random variable X, if pX (a) = 0, we may assume that X does not take value at a.

Definition: we define the cumulative distribution function of a random variable X to be


X
FX (a) = p(x).
x≤a

Suppose discrete random variable X takes values on a1 < a2 < · · · . Then the distribution function Fx is a
non-decreasing step function. If ai ≤ a < ai+1 , then
X
FX (a) = p(x) = p(a1 ) + · · · + p(ai ).
x≤a

There is a jump of size pX (ai ) = P (X = ai ) occurring at ai .

Proposition 4.1 (Properties of the Cumulative Distribution Function)

1. F is a nondecreasing function.

2. lim F (b) = 1.
b→∞

3. lim F (b) = 0.
b→−∞

4. F is right continuous.

4.2 Expected Value


Definition: let X be a discrete random variable with values x1 , · · · , xm . The expected value / mean of X is
E[X] = m
P
i=1 xi pX (xi ).
Definition: let X be a discrete random variable with values x1 , x2 , · · · . The expected value / mean of X is

18
E[X] = ∞
P
i=1 xi pX (xi ).
Definition: we define the indicator variable of an event E to be

1 if E occurs,
I= .
0 if E does not occur.

Lemma 4.2 E[I] = P (E).

Proof: Note I can only take values 0 and 1, and pI (1) = P (E) and pI (0) = P (E c ). Hence E[I] = 1 · P (E) + 0 ·
P (E c ) = P (E). □

Note: suppose X is a discrete random variable, then for any function g, Y = g(X) is again a discrete random
variable.
P∞
Proposition 4.3 Let X be a discrete random variable with values x1 , x2 , · · · . Then E[g(X)] = i=1 g(xi )pX (xi )
for any function g.

P
Proof: Let Y = g(x), then E[g(X)] = E[Y ] = y yP (Y = y). Fix y such that g(x) = y for some x. Let
Ey = {x | g(x) = y}.
P
Then P (Y = y) = x∈Ey P (X = x).
Therefore
X X
E[g(X)] = y P (X = x)
y x∈Ey
X X
= yP (X = x)
y x∈Ey
X X
= g(x)P (X = x)
y x∈Ey
X
= g(x)P (X = x).
x

Lemma 4.4 Suppose c ∈ R, then E[c] = c.

P P
Proof: E(c) = x cpX (x) =c x pX (x) = c. □

Lemma 4.5 Suppose X is a discrete random variable with value x1 , x2 , · · · . Then if a, b ∈ R, E[aX+b] = aE[X]+b.

19
Proof: If a, b ∈ R, then

X
E[aX + b] = (axi +)pX (xi )
i=1

X ∞
X
=a xi pX (xi ) + b pX (xi )
i=1 i=1

= aE[X] + b.

Lemma 4.6 Suppose X1 , X2 , · · · , Xn are random variables, c1 , c2 , · · · , cn ∈ R, then E(c1 X1 + · · · + cn Xn ) =


c1 E(X1 ) + · · · + cn E(Xn ).

Proof: Proof using induction. □

4.3 Moments And Variance


Definition: let X be a random variable. E[X n ] is called the nth moments of X. Suppose X is a discrete random
variable with values x1 , x2 , · · · . E[X n ] = ∞ n
P
i=1 xi pX (i).

Lemma 4.7 Let I be the indicator of an even E. Then E[I n ] = P (E).

Proof: E[I n ] = I n P (I = 1) + 0n P (I = 0) = P (E). □

Definition: let X be a random variable, we denote the mean value of E[X] to be µX . Then the variance of X is
Var(X) = E[(X − µX )2 ].

Proposition 4.8 Var(X) = E[X 2 ] − (E[X])2 .

Proof:

Var(X) = E[(X − µ)2 ]


= E[X 2 − 2µX + µ2 ]
= E[X 2 ] − 2µE[X] + µ2
= E[X 2 ] − 2µ2 + µ2
= E[X 2 ] − µ2
= E[X 2 ] − (E[X])2 .

20

Corollary 4.8.1 For any discrete random variable X, E[X 2 ] ≥ (E[X])2 and E[X 2 ]/|E[X]| ≥ |E[X]|.

Proof: It is clear that E[(X − µ)2 ] ≥ 0, Then E[X 2 ] − (E[X])2 = E[(X − µ)2 ] ≥ 0. Then it is clear that
E[X 2 ]/|E[X]| ≥ |E[X]|. □

Proposition 4.9 Suppose X is a random variable, a, b ∈ R, then Var(aX + b) = a2 Var(X).

Proof: Let µ = E[X],then

Var(aX + b) = E[(aX + b)2 ] − (E[aX + b])2


= E[a2 X 2 + 2abX + b2 ] − (aµ + b)2
= (a2 E[X 2 ] + 2abµ + b2 ) − (a2 µ2 + 2abµ + b2 )
= a2 (E[X 2 ] − µ2 )
= a2 Var(X).


p
Definition: we define the standard deviation of X to be the principle square root of Var(X), i.e., SD(X) = Var(X).
Usually we write Var(X) = σ 2 , where σ ≥ 0 is the standard deviation. Then SD(aX + b) = |a|SD(X).

4.4 Bernoulli and Binomial Random Variables


Definition: a random variable X is a Bernoulli random variable if pX (1) = p and pX (0) = 1 − p for some 0 < p < 1.
Definition: a random variable X is a binomial random variable if pX (i) = ni pi (1 − p)n−i , for some 0 < p < 1,


i = 0, 1, · · · , n.
Suppose pX (i) = ni pi (1 − p)n−i , 0 < p < 1, i = 0, 1, · · · , n. X is said to have a binomial distribution with


parameters (n, p). In particular, a Bernoulli random variable is binomial with parameter (1, p).

Proposition 4.10 Let X be a binomial random variable with parameters (n, p). Let E[X] = np and Var(X) =
np(1 − p).

Proof: Notices we we let Xi denote the random variable where Xi = 1 if the ith trial is a success and Xi = 0 if
the ith trial is a failure, then
Xn n
X
E[X] = E[Xi ] = p = np.
i=1 i=1

21
Now note that Var(X) = E[X 2 ] − (E[X])2 and E[X]2 = n2 p2 , hence we calculate E[X 2 ], but first we compute
E[X(X − 1)]
n  
X n i
E[X(X − 1)] = i(i − 1) p (1 − p)n−i
i
i=0
n
X (n − 2)!
= n(n − 1) p2 pk−2 (1 − p)n−k
(i − 2)!(n − i)!
i=0
n−2
X n − 2
2
= n(n − 1)p pk (1 − p)(n−2)−k
k
k=0
2
= n(n − 1)p · 1
= n(n − 1)p2

Then it follows that E[X 2 ] = n(n − 1)p2 + np, so Var[X] = np(1 − p). □

Lemma 4.11 Suppose X is a binomial random variable with parameters (n, p). Then

pX (i) n−i+1 p
= ·
pX (i − 1) i 1−p

and
pX (i + 1) n−i p
= · .
pX (i) i+1 1−p
In addition, we have
(n − i)p
pX (i + 1) = pX (i).
(i + 1)(1 − p)

Proof: Direct computation yields this result. □

Proposition 4.12 Suppose X is a binomial random variable with parameters (n, p) and (n + 1)p is not an integer,
then the value of pX (i) first increases monotonically, then reaches its largest value when i is the largest integer
≤ (n + 1)p, then decrease monotonically. If (n + 1)p is an integer, then pX (i) takes the maximum value at (n + 1)p
and (n + 1)p − 1.

Proof: By the previous lemma, we have that pX (i) ≥ pX (i − 1) if and only if (n − i + 1)p ≥ i(1 − p), which
happens if and only if i ≤ (n + 1)p. Hence we have the desired result. □

Proposition 4.13 Suppose X is a binomial random variable with parameters (n, p), then E[X k ] = npE[(Y +1)k−1 ]
where Y is a binomial random variable with parameters (n − 1, p).

22
Proof: Recall the identity    
n n−1
i =n ,
i i−1
then
n  
k n
X
k
E[X ] = i pi (1 − p)n−i
i
i=0
n  
k−1 n − 1
X
= np i pi−1 (1 − p)n−i
i−1
i=1
n−1  
X n−1 j
= np (j + 1)k−1 p (1 − p)n−1−j
j
j=0

= npE[(Y + 1)k−1 ]

4.5 Poisson Random Variable


Definition: a random variable X is Poisson with parameter λ > 0 if

λi
pX (i) = P (X = i) = e−λ , i = 0, 1, 2, · · · .
i!

A Poisson random variable with parameter λ > 0 is an approximation of binomial random variable with parameters
(n, p) such that λ = np with n very large and p very small.

Examples of Poisson Random Variable:

ˆ The number of misprints on a page of a book.

ˆ The number of people in a community who survive to age 100.

ˆ The number of wrong telephone numbers dialed in a day.

ˆ The number of packages of biscuits sold in a store in a day.

ˆ The customers centering a post office on a given day.

Proposition 4.14 Let X be a Poisson random variable with parameter λ > 0. Then E[X] = λ and Var(X) = λ.

23
i
Proof: Let X be a Poisson random variable with parameter λ > 0. Then pX (i) = P (X = i) = e−λ λi! . So


X λi
E(X) = ie−λ
i!
i=0
X λi
= e−λ
(i − 1)!
i=1
X λj
= e−λ λ
j!
j=0

= e−λ λeλ

Similarly, we compute that E[X(X − 1)] = λ2 , hence , E[X 2 ] = λ2 + λ, so Var(x) = E[X 2 ] − (E[X])2 = λ. □

Poisson Approximation:
One can approximate the probability mass function of a binomial random variable with parameter (n, p) using
Poisson distribution with parameter λ = np, when n is large enough and p is very small, so λ is moderate.

Poisson Paradigm:
Let pi be the probability that event i occurs, i = 1, · · · , n. If pi is small, and trials are independent or weakly
dependent, then the total number of these events that occur can be approximated by Poisson random variable with
parameter λ = ni=1 pi .
P

Definition: we say events E and F are weakly dependent if P (E) ≈ P (E|F ).

Poisson Process:
Suppose events occur at random points of time. Let λ > 0. Assume

1. The probability that exactly 1 event occurs in an interval of length h is approximately λh.

2. The probability that 2 or more events occur in an interval of length h is much smaller than λh.

3. The number of events occurring in non-overlapping intervals are independent.

Let N (t) denote the number of events occurring in an interval of length t. Then N (t) is a Poisson random variable
with parameter λt and
(λt)k
P (N (t) = k) = e−λt , k = 0, 1, 2, · · · .
k!

Lemma 4.15 Let X be a Poisson random variable with parameter λ. Then P (X = i) increases monotonically and
then decreases monotonically as i increases, reaching its maximum when i is the largest integer not exceeding λ.

Proof: Suppose P (X = i) increases monotonically, then P (X = i)/P (X = i − 1) ≥ 1, then

λ
≥1
i

24
which happens if and only if λ ≥ i. □

Proposition 4.16 Let X be a Poisson random variable with parameter λ. Then

E[X n ] = λE[(X + 1)n−1 ].

Proof:

X λi
E[X n ] = in e−λ
i!
i=1

X λi
= in−1 e−λ
(i − 1)!
i=1

X λj
=λ (j + 1)n−1 e−λ
j!
j=1

= λE[(X + 1)n−1 ].

Proposition 4.17 Suppose that the number of events occurring in a given time period is a Poisson random vari-
P
able with parameter λ. If each event is classified as a type i event with probability pi , i = 1, · · · , n, pi = 1,
P
independently with probability pi , i = 1, · · · , n, pi = 1, independently of other events. Then the number of type i
events that occur are independent Poisson random variable with respective parameters λpi .

Proof: Let X denote the Poisson random variable with parameter λ, and Xi denote the random variables of the
number of type i events that occur. Then suppose it is given X = k, then the number of events of Xi can be
presented by binomial random variable with parameter (k, pi ). Hence by the law of total probability, we have
∞  
X k n λk
P (Xi = n) = pi (1 − pi )i−n e−λ
n k!
k=n

e−λ pni X λk
= (1 − pi )k−n
n! (k − n)!
k=n
n ∞
−λ pi · λn X λj
=e (1 − pi )j
n! j!
j=0
(pi λ)n λ(1−pi )
= e−λ ·e
n!
(pi λ)n
= e−pi
n!

This holds for all nonnegative number n, hence Xi is a Poisson random variable with parameter pi λ. And easily,
one can verify that the Xi′ s are independent using multinomial distribution. □

25
4.6 Other Discrete Random Variables
4.6.1 Geometric Random Variable

Definition: independent trials are performed until a success occurs. suppose the the probability of success is
p, where 0 < p < 1. Then if we let X denote the number of trials needed until a success occurs, we have
P (X = n) = (1 − p)n−1 p, n = 1, 2, · · · . In this way, we define X to be the Geometric Random Variable of parame-
ter p.
Note that ∞ p
P
n=1 pX (n) = 1−(1−p) = 1. So P (X = ∞) = 0 and we may say that the X = ∞ does not occur.

1
Proposition 4.18 Let X be a geometric random variable with parameter p, 0 < p < 1. then E[X] = p and
Var(X) = 1−p
p2
.

Proof: Let X be a geometric random variable with parameter p, 0 < p < 1, then

X
E[X] = i · (1 − p)i−1 p
i=1

Note that

X 1−x
(1 − x)i = ,
x
i=1

then by differentiation both sides with respect to x and multiply by −x, we have

1 X
= i(1 − x)i−1 x.
x
i=1

Letting x = p, we have E[X] = p1 . Similarly, we have


2 X
= i(i − 1)(1 − x)i−2 .
x3
i=2

So

2(x − 1) X
= i(i − 1)(1 − x)i−1 x.
x2
i=2

Hence E[X(X − 1)] = 2 1−p


p2
, so Var(X) = 1−p
p2
. □

Lemma 4.19 Suppose n ∈ N and k ∈ N+ , and X is a geometric random variable with parameter p, then

P {X = n + k|k > n} = P {X = k}.

26
Proof: The P {X = n + k|k > n} is the probability where the first n trials are failures, and we get a success at
k th trial after this. This is the same as having a successful trial at the k th trial, which has probability P {x = k} □

4.6.2 Negative Binomial Random Variable

Definition: suppose independent trials are performed with success probability p. Let X be the number of trials
needed for r successes. Suppose X = n, where n ≥ r. In the first n − 1 trials, there are r − 1 successes and n − r
failures. The nth trials is a success, so P (X = n) = n−1
 r n−r . Hence we define X to be the Negative
r−1 p (1 − p)
Binomial Random Variable with parameter r, p, if 0 < p < 1, if
 
n−1 r
pX (n) = p (1 − p)n−r , n = r, r + 1, · · · .
r−1

Note that a geometric random variable is negative binomial with parameter (1, p).
Intuitively, one can see negative binomial random variable as the reverse of binomial random variable. Thus let X
be a negative binomial random variable with parameters r and p, and let Y be a binomial random variable with
parameters n and p. Then P (X > n) = P (Y < r).

r r(1−p)
Proposition 4.20 Let X be negative binomial with parameters (r, p). Then E[X] = p and Var(X) = p2
.

Proof: Let X be negative binomial with parameters (r, p). Then


∞  
X n−1 r
E[X k ] = nk p (1 − p)n−r
n=r
r − 1
∞      
r X k−1 n r+1 n−1 n
= n p (1 − p)n−r since n =r
p n=r r r−1 r
∞  
r X m − 1 r+1
= (m − 1)k−1 p (1 − p)m−(r+1)
p r
m=r+1
r
= E[(Y − 1)k−1 ]
p

where Y is a negative binomial random variable with parameters r + 1, p. Setting k = 1 in the preceding equation
yields
r
E[X] = .
p
Setting k = 2 in the equation for E[X k ] and using the formula for the expected value of a negative binomial random
variable gives

r
E[X 2 ] = E[Y − 1]
p
 
r r+1
= −1
p p

27
Therefore    2
r r+1 r r(1 − p)
Var(X) = −1 − = .
p p p p2

Lemma 4.21 Suppose X is a negative binomial random variable with parameters (r, p), and Y is a binomial
random variable with parameters (n, p). Then

P {X > n} = P {Y < r}.

More explicitly, we have


∞   r−1  
X i−1 r i−r
X n i
p (1 − p) = p (1 − p)n−i .
r−1 i
i=n+1 i=0

Proof: Note that the probability of X > n is the same as the probability of not getting r success in the first
n trials, which is equal to the probability of getting less than r success in the first n trials. Hence we have
P {X > n} = P {Y < r}. □

4.6.3 Hypergeometric Random Variable

Definition: n balls are randomly chosen from an urn containing m white and N −m black balls without replacement.
Let X be the number of white balls chosen, then
m N −m
 
i n−i
P (X = i) = N
 , i = 0, 1, · · · , n.
n

Then we define X to be the Hypergeometric Random Variable with parameters (n, N, m) if

m N −m
 
i n−i
pX (i) = N
 , i = 0, 1, · · · , n.
n

m

Note that 0 ≤ x ≤ n and 0 ≤ X ≤ m. If m < i ≤ n, then i = 0, so P (X = i) = 0.

nm
Proposition 4.22 Let X be hypergeometric with parameters (n, N, m). Then E[X] = N and
 
n−1
Var(X) = np(1 − p) 1 −
N −1
m
where p = N.

28
Proof: Let X be hypergeometric with parameters (n, N, m). Then
n
X
E[X k ] = ik P {X = i}
i=0
n m N −m
 
X i n−i
k
= i N

i=1 n

We use the identities        


m m−1 N N −1
i =m and n =N
i i−1 n n−1
we obtain
n m−1 N −m
 
knm X k−1 i−1 n−i
E[X ] = i N −1

N n−1
i=1
n−1 m−1
 N −m 
nm X j n−1−j
= (j + 1)k−1 N −1

N n−1
j=0
nm
= E[(Y + 1)k−1 ]
N

where Y is a hypergeometric random variable with parameters n − 1, N − 1 and m − 1. Hence upon setting k = 1,
we have
nm
E[X] = .
N
Upon setting k = 2 in the equation for E[X k ], we obtain

nm
E[X 2 ] = E[Y + 1]
N  
nm (n − 1)(m − 1)
= +1
N N −1

Hence  
nm (n − 1)(m − 1) nm
Var(X) = +1− .
N N −1 N
m
Letting p = N and using the identity

m−1 Np − 1 1−p
= =p−
N −1 N −1 N −1

we have
 
1−p
Var(X) = np (n − 1)p − (n − 1) + 1 − np
N −1
 
n−1
= np(1 − p) 1 −
N −1

Approximating Hypergeometric Using Binomial:


Note than when N and m are large enough, and n is small compared to N and m, then the probability of drawing

29
m
a white ball with or without replacement does not change that much, which in both case are almost equal to N .
Hence we can approximate the probability of drawing small number of white balls using the binomial distribution
m
with parameters (n, N ).

Proposition 4.23 Let X be a negative binomial random variable with parameters (n, N, m), then P (X = i + 1) =
(m−i)(n−i)
(i+1)(N −m−n+i+1) P (X = i), for i = 1, 2, · · · , n − 1. P (X = i) is maximized when i is the greatest integer smaller
than p = mn−m−n−1
N −m−n , if p is not an integer; P (X = i) is maximized when i = p or i = p + 1, if p is an integer.

Proof: According to the definition of negative binomial random variable, we have


m N −m N
  
P (X = i + 1) i+1 n−i−1 n
= N m N −m
  
P (X = i) m i n−i
(m − i)(n − i)
=
(i + 1)(N − m − n + i + 1)

(m−i)(n−i)
P (X = i + 1) ≥ P (X = i) if and only if (i+1)(N −m−n+i+1) ≥ 1,

(m − i)(n − i)
≥1
(i + 1)(N − m − n + i + 1)
(i + 1)(N − m − n + i + 1) ≤ (m − i)(n − i)
iN − im − in + i2 + 1 ≤ mn − m − n + i2
mn − m − n − 1
i≤
N −m−n
mn−m−n−1
Hence P (X = i) is maximized when i is the greatest integer smaller than p = N −m−n , if p is not an integer;
P (X = i) is maximized when i = p or i = p + 1, if p is an integer. □

4.7 Expected Value of Sums of Random Variables

Proposition 4.24 Suppose the sample space S is countable. For any discrete random variable X on S, E[X] =
P
X(s)p(s).
s∈S

30
Proof: This is clear intuitively, but we give a rigorous prove. Let Ei be the even that X = xi , i.e., s ∈ Ei ⇔
P
X(s) = xi . So P (X = xi ) = P (Ei ) = s∈Ei p(s). Hence


X
E[X] = xi P (X = xi )
i=1
X∞ X
= xi p(s)
i=1 s∈Ei
X∞ X
= xi p(s)
i=1 s∈Ei
X∞ X
= X(s)p(s)
i=1 s∈Ei
X
= X(s)p(s).
s∈S

Proposition 4.25 Let X1 , X2 , · · · , Xn be discrete random variables. Then E[X1 +· · ·+Xn ] = E[X1 ]+· · ·+E[Xn ].

Proof: We prove the case for two random variables, and we can apply induction for the general result.
Let X and Y be discrete random variables on S. Then
X X
E[X + Y ] = (X + Y )(s)p(s) = [X(s) + Y (s)]p(s)
s∈S s∈S
X X
= X(s)p(s) + Y (s)p(s)
s∈S s∈S

= E[X] + E[Y ].

Proposition 4.26 Let X1 , · · · , Xn be discrete random variables and X = X1 + · · · + Xn . Then


n
X X
2
E[X ] = E[Xi2 ] + E[Xi Xj ].
i=1 i̸=j

Proof: Notice
E[X 2 ] = E[(X1 + · · · + Xn )(X1 + · · · + Xn )].

Then using the previous proposition, we get the desired result. □

31
4.8 Some Interesting Results

Proposition 4.27 Suppose a positive integer is chosen at random, then the probability that it does not contain
repeated prime factor is π62 .

Proof: Suppose such an integer is chosen at random, then the probability such that it does not divide the ith
prime twice is 1 − p12 , where pi is the ith prime. Hence the probability of the number not dividing any prime twice
i
is
n  n
p2i − 1
 Y
Y 1 6
1− 2 = 2 = 2.
pi pi π
i=1 i=1

32
5 Continuous Random Variables
5.1 Continuous Random Variable
Definition: a random variable X is continuous if there exists a nonnegative function f such that
Z
P (X ∈ B) = f (x)dx
B

for any set B of real numbers.


In this case, Fx (the cumulative distribution function of X) is continuous everywhere. In particular, Fx has no
jump at any a ∈ R, so P (X = a) = 0 for every a ∈ R.

R
Definition: suppose P (X ∈ B) = B fX (x)dx for any set B or real numbers. Then fX is called the probability
density function of X.
Then
Ra
1. P (X = a) = a fX (x)dx = 0.
Rb
2. P (a ≤ x ≤ b) = P (a < X < b) = a fX (x)dx for any a < b.
Ra
3. FX (a) = P (X ≤ a) = −∞ fX (x)dx. Then it also follows that FX′ (a) = fX (a) if fX is continuous at a.
R∞
4. P (X ≥ a) = a fX (x)dx.

5. Assume that f is continuous at a, then P (a < X < a + ϵ) ≈ ϵf (a).

5.2 Expectation and Variance


Definition: if X is a continuous random variable with density function fX , the expected value E[X] is defined by
Z ∞
E[X] = xfX (x)dx.
−∞

a+b
Definition: the median m of a continuous random variable X is the value m = 2 , where

1 1
a = inf({x : F (x) ≥ }) and b = sup({x : F (x) ≤ }).
2 2

Definition: the mode m of a continuous random variable X is the value m such that fX (m) is maximum.

Lemma 5.1 Suppose X is a continuous random variable whose probability density function fX (x) is even, then
E[X] = 0.

33
Proof: Since fX (x) is even, then xfX (x) is odd, then
Z ∞
xfX (x) = 0
−∞

provided that the integral exists. □

Lemma 5.2 Let Y be a nonnegative continuous random variable with probability density function f . Then
Z ∞
P (Y > y)dy = E[Y ].
0

Proof:
Z ∞ Z ∞ Z ∞ 
P (Y > y)dy = f (x)dx dy
0 0 y
Z ∞ Z x 
= f (x)dy dx
0 0
Z ∞
= xf (x)dx
0
= E[Y ]

Lemma 5.3 Suppose Y is an arbitrary continuous random variable with probability density function f . Then
Z ∞ Z ∞
E[Y ] = P (Y > y)dy − P (Y < −y)dy.
0 0

Proof:
Z ∞
E[Y ] = xf (x)dx
−∞
Z ∞ Z −∞
= xf (x)dx − xf (x)dx
0 0
Z ∞ Z ∞
= P (Y > y)dy + yf (−y)dy
0 0
Z ∞ Z ∞
= P (Y > y)dy − P (Y < −y)dy
0 0

34
Proposition 5.4 Suppose a continuous random variable X has probability density function f . Then for any
function g, Z ∞
E[g(X)] = g(x)f (x)dx.
−∞

Proof: First we prove the statement for the special case that g(x) ≥ 0. Then
Z ∞
E[g(X)] = P (g(X) > y)dy
Z0 ∞ Z
= f (x)dxdy
0 g(x)>y
Z Z g(x)
= f (x)dydx
g(x)>0 0
Z
= f (x)g(x)dx
g(x)>0

Hence the statement holds for g ≥ 0.


Now, for the general case, consider g = g + − g−, where

g + (x) = max{0, g(x)}, g − = | min{0, g(x)}|

Then it is clear that g + ≥ 0 and g − ≥ 0. If g(x) ≥ 0, then g(x) = g(x) − 0 = g + (x) − g − (0), if g(x) ≤ 0, then
g(x) = 0 − |g(x)| = g + (x) − g − (x). Hence by the linearity of expectation and integration, we have
Z ∞ Z ∞
E[g(x)] = E[g + (x)] − E[g − (x)] = [g + (x) − g − (x)]f (x)dx = g(x)f (x)dx.
−∞ −∞

Lemma 5.5 Let X be a continuous random variable. For a, b ∈ R,

E[aX + b] = aE[X] + b

Proof: Z ∞ Z ∞ Z ∞
E[aX + b] = (ax + b)fX (x)dx = a xfX (x)dx + b fX (x)dx = aE[x] + b.
−∞ −∞ −∞

Definition: let X be a continuous random variable. Then the variance of X, is defined to be

Var(X) = E[(X − µX )2 ]

where µX = E[X]. Equivalently, we have

Var(X) = E[X 2 ] − (E[X])2 .

35
Notice the same proof used for Discrete Random Variable applies for continuous random variable as well. Similarly,
we have
Var(aX + b) = a2 Var(X).

5.3 Uniform Random Variable


Definition: a random variable X is uniformly distributed over (0, 1) if the probability density function is

1 0 < x < 1,
f (x) = .
0 otherwise

Lemma 5.6 One can easily verify the following if X is uniformly distributed over (0, 1):

ˆ If 0 < a < b < 1. Then


Z b
P (a ≤ X ≤ b) = f (x)dx = b − a.
a
R1
ˆ E[X] = 0 xf (x)dx = 21 .
R1
ˆ E[X 2 ] = 0 x2 f (x)dx = 13 .

ˆ Var(X) = E[X 2 ] − (E[X])2 = 1


3 − 1
4 = 1
12 .

Definition: a random variable X is uniformly distributed over (α, β) if the density function is

1

β−α α<x<β
f (x) = .
0 otherwise

Note that Y = aX +b, (a > 0) is uniform if X is uniform. In particular, if X is uniform over (0, 1). Then Y = aX +b
(a > 0) is uniform over (b, a + b). Hence Y = (β − α)X + α is uniform over (α, β). Then

1. E[Y ] = (β − α)E[X] + α = (β − α) 12 + α = 12 (α + β).


1
2. Var(Y ) = (β − α)2 Var(X) = 12 (β − α)2 .

3. Moreover, the cumulative density function of Y is





 0 y ≤ α,

FY (y) = (y − α)/(β − α), α < y < β, .


1 y ≥ β.

Bertrands’ Paradox:
The Bertrand’s paradox is a probability which cannot be solved. The problem states:
Consider a random chord of a circle. What is the probability that the length of the chord will be greater than the
side of the equilateral triangle inscribed in that circle? The reason why the problem cannot be solved because we

36
do not know what it means by a random chord in a circle. It follows by different ways in selecting the chord, the
resulting probability is different.

5.4 Normal Random Variables


Definition: Z is the standard normal random variable if the probability density function is

1 2
f (x) = √ e−x /2 .

Lemma 5.7 Z ∞
2 /2 √
e−x dx = 2π.
−∞

Proof: Recall that ∞ √


Z
2
e−x dx = π.
−∞
2 2
We can prove this formula by calculating the volume of the solid under z = e−x −y using multivariable calculus
and the shell method.
Then using integration by substitution, we have the desired result. This lemma also implies that Z is indeed a
random variable. □

Lemma 5.8 E[Z] = 0.

R∞ −x2 /2 dx 2 /2
Proof: E[Z] = √1
2π −∞ xe = 0 since xe−x is an odd function. □

Definition: the cumulative distribution function P (Z ≤ x) is given by


Z x
1 2 /2
Φ(x) = √ e−t dt.
2π −∞

Lemma 5.9 Φ(−x) = 1 − Φ(x).

Proof: Since the probability density function is even, then

Φ(−x) = P (Z ≤ −x) = P (Z ≥ x) = 1 − P (Z < x) = 1 − Φ(x).

37
Corollary 5.9.1 Suppose Z is the standard normal random variable and x > 0, then
ˆ P (Z > x) = P (Z < x);

ˆ P (|Z| > x) = 2P (Z > x);

ˆ P (|Z| < x) = 2P (Z < x) − 1.

2
Proposition 5.10 Suppose Z has probability density function f (x) = √1 e−x /2 . Then Var(Z) = E[Z 2 ] = 1.

Proof: By definition, we have Z ∞


1 2 /2
E[Z 2 ] = √ x2 e−x dx
2π −∞
2 /2
Let u = x and dv/dx = xe−x . Then
i∞ Z ∞
1 h −x2 /2 2
2
E[Z ] = √ −xe − −e−x /2 dx
2π −∞ −∞
1 √
= √ (0 + 2π)

=1

Since E[Z] = 0, we have Var(Z) = E[Z 2 ] = 1. □

Definition: X is a normal variable with parameters (µ, σ 2 ) if the probability density function of X is

1 2 2
f (x) = √ e−(x−µ) /(2σ ) .
2πσ

One can verify that X = σZ + µ, then E[X] = µ and Var(X) = σ 2 .


Conversely, if X is a normal variable with parameters (µ, σ 2 ), then Z = (X −µ)/σ is standard normally distributed.

Lemma 5.11 Suppose X is a normal variable with parameters (µ, σ 2 ), then Y = aX + b is a normal variable with
parameters (aµ + b, a2 σ 2 ).

Proof: Suppose Y = aX + b, since X is a normal variable with parameter (µ, σ 2 ), then X = σZ + µ. Hence
Y = aσZ + aµ + b, thus X is a normal variable with parameters (aµ + b, a2 σ 2 ). □

Proposition 5.12 Suppose X is a normal variable with parameters (µ, σ 2 ), then

P (X > µ + kσ) = 1 − Φ(k).

And if k > 0, then


P (µ − kσ < X < µ + kσ) = 2Φ(k) − 1.

38
Proof: Let Z = (X − µ)/σ, then Z is standard normally distributed. Then
 
X −µ (µ + kσ) − µ
P (X > µ + kσ) = P > = P (Z > k) = 1 − Φ(k).
σ σ

Next, notice that P (µ − kσ < X) = P (X > µ + kσ) = Φ(k), then we have the second claim. □

Theorem 5.13 (The De Moivre-Laplace Limit Theorem) Let Sn be a binomial random variable with pa-
rameters (n, p). Then !
Sn − np
P a≤ p ≤ b → Φ(b) − Φ(a)
np(1 − p)
as n → ∞, where Φ is the cumulative density function for the random variable with parameters µ = np and
σ 2 = np(1 − p). And P (Sn = i) is written as P (i − 12 < Sn < i + 12 ) in approximation (we apply continuation of
integers).

Proof: The theorem follows from the fact that

S − np
p n
np(1 − p)

is approximated by standard normal Z with parameters (np, np(1 − p)). □

5.5 Exponential Random Variables


Definition: a random variable X is exponential with parameter λ > 0 if the probability density function is

λe−λx , if x ≥ 0,
f (x) =
0, if x < 0.

Then the cumulative function of this random variable is



1 − e−λx , if x ≥ 0,
F (x) =
0, if x < 0.

Lemma 5.14 Suppose X is an exponential random variable with parameters λ, if c > 0, then cX is an exponential
random variable with parameter λc .

39
Proof: Let Y = cX, then clearly Y takes nonnegative values, so for y ≥ 0, we have

FY (y) = P (Y < y) = P (cX < y)


y
= P (X < )
c
y
= FX ( )
c

Hence
d y λ
fy (y) = FX ( ) = e−λy/c .
dy c c
λ
So, indeed y is exponential with parameter c. □

Proposition 5.15 Suppose X is an exponential random variable with parameter λ > 0 and Y = λX. Then

ˆ E[X] = E[Y ]/λ = 1/λ.

ˆ Var(X) = Var(Y )/λ2 = 1/λ2 .

Proof: P (Y > y) = P (X > y/λ) = e−λ(y/λ) = e−y . Hence fY (y) = e−y . Then
Z ∞ Z ∞
E[Y ] = P (Y > y)dy = e−y dy = 1.
0 0
Z ∞ Z ∞ √
2
E[Y ] = 2
P (Y > y)dy = e− y
dy = 2.
0 0
Var(Y ) = E[Y 2 ] − (E[Y ])2 = 2 − 12 = 1

Hence the proposition follows. □

1
Proposition 5.16 If X is an exponential random variable with mean λ, then

k!
E[X k ] = k = 1, 2, 3, · · · .
λk

Proof: Since E[X] = λ1 , then 


λe−λx , if x ≥ 0,
f (x) =
0, if x < 0.

40
Z ∞
E[X k ] = xk λe−λx dx
0
Z ∞
=λ xk e−λx dx
0
Z ∞

k −λx
= −x e + kxk−1 e−λx dx
0 0
k ∞ k−1 −λx
Z
= x λe dx
λ 0
k
= E[X k−1 ]
λ
k!
Then by induction we can show that E[X k ] = λk
. □

Definition: if P (X > s + t|X > t) = P (X > s) for all s, t ≥ 0, then X is said to be memoryless.
If a random variable is memoryless, then it also implies that P (X > s + t) = P (X > s)P (X > t), for all s, t ≥ 0.

Lemma 5.17 Exponential random variables are memoryless.

Proof: Suppose the component has survived for t hours. The probability that it can survive at least another s
hours is
P (X > s + t) e−λ(s+t)
P (X > s + t|X > t) = = = e−λs = P (X > s).
P (X > t) e−λt

Definition: X is a double exponential random variable with parameter λ > 0 if the probability density function

1
f (x) = λe−λ|x| , −∞ < x < ∞
2

Lemma 5.18 Suppose FX (x) is the cumulative distribution function of a double exponential random variable X
with parameter λ > 0, then 
 1 eλx x<0
FX (x) = 2 .
1 − 1 e−λx x > 0
2

Proof: One can integrate and verify directly. □

Proposition 5.19 Suppose X is a double exponential random variable with parameter λ > 0 and let Y be the
exponentially distributed with parameter λ, then

ˆ E[X] = 0.

ˆ E[X 2 ] = E[Y 2 ] = 2/λ2 .

41
ˆ Var(X) = 2/λ2 .

Proof: Let Y = |X|. Then for y ≥ 0

P (Y > y) = P (X > y) + P (X < −y) = 2P (X > y)


Z ∞
1 −λx
=2 λe dx
y 2
= e−λy .

Then Y is exponentially distributed with parameter λ, so E[Y ] = 0. Then it is clear that E[X 2 ] = E[Y 2 ] = 2/λ2 ,
as Y 2 = |X|2 = X 2 . By simple integration, we can also get that E[X] = 0 (p.d.f is even). Then Var(X) can be
easily shown to be 2/λ2 . □

Definition: let X be a positive continuous random variable. Let F (t) = 1−F (t). Then we define λ(t) = f (t)/F (t) is
the hazard (failure) rate function. The interpretation of the hazard rate function is that if an object has functioned
for time t, then λ(t) represents the probability that it will fail at time t.

Proposition 5.20 Let λ(s), s > 0 be the hazard rate function of a positive random variable X. Then
 Z t 
F (t) = 1 − exp − λ(s)ds .
0

Proof:
Z t Z t
f (s)
λ(s)ds = ds
0 0 1 − F (s)
= − ln(1 − F (s))|t0
= − ln(1 − F (t)) + ln(1 − F (0))
= − ln(1 − F (t))
 Z t 
⇐⇒ F (t) = 1 − exp − λ(s)ds .
0

Proposition 5.21 Suppose λ(t) is the hazard failure rate function of a random variable X. Then X is an expo-
nential random variable with parameter λ if and only if λ(t) = λ.

Proof: Suppose X is an exponential random variable with parameter λ, then

f (t) λe−λt
λ(t) = = = λ.
F (t) 1 − (1 − e−λt )

42
Next, suppose λ(t) = λ, then  Z t 
FX (t) = 1 − exp − λds = 1 − exp(−tλ).
0

Hence X is an exponential random variable with rate λ. □

5.6 Other Continuous Distributions


5.6.1 Gamma Distribution

Definition: a random variable X is gamma with parameters (n, λ), λ > 0, if the density function is

λe−λt (λt)n−1
f (t) = , t ≥ 0.
(n − 1)!

It is the time at which the nth event of a Poisson process of rate λ occurs.
Definition: a random variable X is gamma with parameters (α, λ), α, λ > 0, if the density function is

λe−λt (λt)α−1
f (t) = , t ≥ 0,
Γ(α)

where Z ∞ Z ∞
Γ(α) = λe−λt (λt)α−1 dt = e−y y α−1 dy
0 0

is the gamma function.

Recall that Γ(α + 1) = αΓ(α), for α > 0. And Γ(n) = (n − 1)! for positive integer n.

Suppose n is a positive integer, then a Gamma distribution X with parameter (n, λ), represents the distribution
on the time it takes for n event where the probability distributions of the events is Poisson with parameter λ.

Lemma 5.22 Let X be gamma with parameters (α, λ). Then E[X] = αλ .

Proof:
Z ∞
1
E[X] = t · λe−λt (λt)α−1 dt
Γ(α) 0
Z ∞
1
= e−λt (λt)α dt
Γ(α) 0
Z ∞
1
= λe−λt (λt)(α+1)−1 dt
λΓ(α) 0
1
= Γ(α + 1)
λΓ(α)
α
= .
λ

43

α(α+1) α
Lemma 5.23 Let X be gamma with parameters (α, λ). Then E[X 2 ] = λ2
and Var(X) = λ2
.

Proof:
Z ∞
1
E[X 2 ] = t2 · λe−λt (λt)α−1 dt
Γ(α) 0
Z ∞
1
= e−λt (λt)α+1 dt
λΓ(α) 0
Z ∞
1
= λe−λt (λt)(α+2)−1 dt
λ2 Γ(α) 0
1
= 2
Γ(α + 2)
λ Γ(α)
α(α + 1)
= .
λ2

Hence it follows that


α
Var(X) = E[X 2 ] − (E[X])2 = .
λ2

5.6.2 Beta Distribution

Definition: a random variable X is beta with parameters (a, b), a, b > 0, if the density function is

1
f (x) = xa−1 (1 − x)b−1 , 0 < x < 1
B(a, b)

where Z 1
Γ(a)Γ(b)
B(a, b) = xa−1 (1 − x)b−1 dx = .
0 Γ(a + b)
It represents the distribution of the probability p that a trial possesses if it is given that there are a successes and
b failures in the first a + b trials. For example a Beta distribution with parameter (40, 60) gives the distribution of
the probability of the success rate p of the trial, if we know that when one performed the trial 100 times, one got
40 successes and 60 failures.

Lemma 5.24 If a = b = 1, then the beta random variable X with parameter (a, b) is uniform on (0, 1).

1
Proof: If a = b = 1, then f (x) = B(1,1) = 1, 0 < x < 1. Hence it is uniform on (0, 1). □

44
Proposition 5.25 Suppose n, m ∈ N, then
Z 1
n!m!
xn (1 − x)m dx = .
0 (n + m + 1)!

R1
Proof: Let C(n, m) = 0 xn (1 − x)m dx, then using integration by parts, we have

m
C(n, m) = C(n + 1, m − 1).
n+1
1
Note that C(n, 0) = n+1 . Then using induction on m we can prove the identity. □

a
Lemma 5.26 B(a + 1, b) = a+b B(a, b).

Proof: Integration by part. □

Proposition 5.27 Suppose X is a beta random variable with parameter (a, b), then

ˆ E[X] = a
a+b .

(a+1)a
ˆ E[X 2 ] = (a+b+1)(a+b) .

ˆ Var(X) = ab
(a+b)2 (a+b+1)
.

Proof:
1
xa−1 (1 − x)b−1
Z
B(a + 1, b) a
E[X] = x· dx = = .
0 B(a, b) B(a, b) a+b
Z 1
2 xa−1 (1 − x)b−1 B(a + 2, b) (a + 1)a
E[X ] = x2 · = = .
0 B(a, b) B(a, b) (a + b + 1)(a + b)

Then the value of Variance of X follows from the previous two values. □

5.7 Information on the Gamma and Beta Function


Definition: The Γ function is defined by Z +∞
Γ(s) = e−x xs−1 dx
0

where 0 < s < ∞.


R +∞
Proposition 5.28 0 e−x xs−1 dx converges for s > 0, hence Γ(s) is well-defined.

45
Proof: We decompose the improper integral into two parts:
Z 1 Z ∞
−x s−1
I1 = e x dx, I2 = e−x xs−1 dx.
0 1

Firstly, consider I1 , when s ≥ 1, I1 is a definite integral, hence it converges; when 0 < s < 1,

1 1
e−x · xs−1 = x
· xs−1 < 1−s ,
e x

since 1 − s < 1, then by the integral comparison theorem, I1 converges.


Now we consider I2 , since
xs+1
lim x2 · (e−x xs−1 ) = lim = 0,
x→∞ x→∞ ex
R∞
then by limit comparison theorem, I2 converges as well, i.e., 0 e−x xs−1 dx converges for all s > 0. □

The comparison test and limit comparison test for improper integrals:
R∞
ˆ Suppose functions f (x), g(x) are continuous on [a, ∞),. If 0 ≤ f (x) ≤ g(x) for a ≤ x, then a g(x)dx
R∞ R∞ R∞
converges, if a f (x)dx converges; if a f (x)dx diverges, then a g(x)dx diverges.

ˆ Suppose f (x) is a continuous function on [a, ∞), and f (x) ≥ 0. Then if there exists constant p > 1, s.t.,
R∞ R∞
lim xp f (x) = c < ∞, then a f (x)dx converges; if lim xf (x) = d > 0, then a f (x)dx diverges.
x→∞ x→∞

Lemma 5.29 Γ(s + 1) = sΓ(s).

Proof: Using integration by part, one has


Z ∞
Γ(s + 1) = e−x xs dx
0
∞
Z ∞
= −e−x xs 0 + s e−x xs−1 dx

0
= [0 − 0] + sΓ(s)
= sΓ(s)

Lemma 5.30 Γ(1) = 1.

R∞
Proof: Γ(1) = 0 e−t dt = 1. □

Proposition 5.31 Γ(n + 1) = n!, for n ∈ N.

46
Proof: Using induction, we easily get the result. □

Lemma 5.32 As s → 0+ , we have Γ(s) → ∞.

Proof: Γ function is continuous for all positive value s (it is an integral), this will be used without proof. Then
as s → 0+ , since Γ(s) = Γ(s+1)
s , then
Γ(s + 1)
lim = ∞.
s→0 s

Lemma 5.33 Suppose f, g are convex function with the same domain, then f + g is convex.

Proof: Suppose f, g are convex functions on D, let x, y ∈ D, and λ ∈ [0, 1]. Then

(f + g)(λ(x) + (1 − λ)y) = f (λ(x) + (1 − λ)y) + g(λ(x) + (1 − λ)y)


≤ λf (x) + (1 − λ)f (y) + λg(x) + (1 − λ)g(y)
= λ(f + g)(x) + (1 − λ)(f + g)(y)

Hence f + g is convex. □

Proposition 5.34 ln Γ is convex on (0, ∞).

1 1
Proof: If 1 < p < ∞ and p + q = 1. Apply Hölder’s inequality, we obtain.
  Z
x y x
+ yq −1 −t
Γ + = tp e dt
p q
Z 1/p Z q 
x −t p y 1 −t
− p1
≤ t p e p dt t − eq dt
q q
= Γ(x)1/p Γ(y)1/q

Hence
x y   1 1
ln Γ( + ) ≤ ln Γ(x)1/p Γ(y)1/q = ln Γ(x) + ln γy.
p q p q
This implies ln Γ is convex. □

Theorem 5.35 If f is a positive function on (0, ∞) such that

1. f (x + 1) = xf (x),

47
2. f (1) = 1,

3. ln f is convex,

then f (x) = Γ(x).

Proof: Since Γ satisfies 1, 2, 3 it is enough to prove that f (x) is uniquely determined by 1, 2, 3 for all x > 0. By
1, it is enough to do this for x ∈ (0, 1), as the rest of the values depends on the value of f on (0, 1).
Put φ = ln f . Then
φ(x + 10 = φ(x) + ln x (0 < x < ∞),

and φ(1) = 0, and φ is convex. Suppose 0 < x < 1 and n is a positive integer, then φ(n + 1) = ln(n!). Consider
the difference quotients of φon the intervals [n, n + 1], [n + 1, n + 1 + x], [n + 1, n + 2]. Since φ is convex, then

φ(n + 1 + x) − φ(n + 1)
ln n ≤ ≤ ln(n + 1).
x

Repeated application of φ(x + 1) = φ(x) + ln x gives

φ(n + 1 + x) = φ(x) + ln[x(x + 1) · · · (x + n)].

Thus
φ(x) + ln[x(x + 1) · · · (x + n)] − φ(n + 1)
ln n ≤ ≤ ln(n + 1).
x
Then by some algebraic manipulation, we have

n!nx
   
1
0 ≤ φ(x) − ln ≤ x ln 1 + .
x(x + 1) · · · (x + n) n

The expression on the right tests to 0 as n → ∞, hence varphi(x) is uniquely determined, and the prove is
completed. □

Corollary 5.35.1 Suppose 0 < x < 1, then

n!nx
Γ(x) = lim .
n→∞ x(x + 1) · · · (x + n)

Proof: This is clear from the proof of the above theorem. □


Proposition 5.36 Γ( 12 ) = π.

R +∞
Proof: by definition, Γ(s) = 0 e−x xs−1 dx. We replace x = u2 , dx = 2udu, then
Z ∞
2
Γ(s) = 2 e−u u2s−1 du.
0

48
1+t
Let t = 2s − 1 ⇒ s = 2 , then Z ∞
2 1 1+t
e−u ut du = Γ( )
0 2 2
.
When s = 21 , t = 0, then Z ∞ Z ∞ √
1 2 2
Γ( ) = 2 e−u ut du = 2 e−u du = π.
2 0 0

Proposition 5.37 (Euler’s reflection formula)

π
Γ(s)Γ(1 − s) =
sin πs

for (0 < s < 1).

Proof: to read the proof of this or more related readings on the Γ function, visit https://en.wikipedia.org/
wiki/Gamma_function □

Theorem 5.38 If x > 0 and y > 0, then


Z 1
Γ(x)Γ(y)
tx−1 (1 − t)y−1 dt = .
0 Γ(x + y)

This integral is also know as the beta function B(x, y).

Proof: Note that B(1, y) = y1 , and using Hölder’s inequality, we have that ln B(x, y) is a convex function of x for
each fixed y. We show
x
B(x + 1, y) = B(x, y).
x+y
Since
1 x x 1 Z 1
(1 − t)x+y
Z  
t x+y−1 t x x−1 x
B(x + 1, y) = (1 − t) dt = − · − t (1 − t)y−1 = B(x, y).
0 1−t 1−t x+y 0 0 x+y x+y

Then for each y, consider the function


Γ(x + y)
f (x) = B(x, y).
Γ(y)
Then f (1) = 1, f (x + 1) = xf (x), and ln f (x) = ln B(x, y) + ln Γ(x + y) − ln Γ(y) is also convex. Hence f (x) = Γ(x).
This implies
Γ(x)Γ(y)
B(x, y) = .
Γ(x + y)

49
Corollary 5.38.1
2x−1  x 
 
x+1
Γ(x) = √ Γ Γ .
π 2 2

Proof: Let
2x−1  x 
 
x+1
f (x) = √ Γ Γ .
π 2 2
Note f (1) = 1, f (x + 1) = 2 · x2 f (x) = xf (x) and

√ x x+1
ln f (x) = ln 2( x − 1) − ln π + ln Γ( ) + ln Γ( ).
2 2

Hence ln f is convex, this implies f (x) = Γ(x) on (0, ∞). Thus we have completed the proof. □

Theorem 5.39 (Stirling’s Formula) The Stirling’s Formula provides a simple approximate expression for Γ(x+
1) when x is large. The formula is
Γ(x + 1)
lim √ = 1.
x→∞ (x/e)x 2πx

Proof: Apply change of variable by letting t = x(1 + u) in the definition


Z ∞
Γ(x + 1) = tx e−t dt.
0

Then we get Z ∞
x+1 −x
Γ(x + 1) = x e [(1 + u)e−u ]x du.
−1

Define h(u) so that h(0) = 1 and  2 


−u u
(1 + u)e = exp − h(u)
2
for −1 < u < ∞, u ̸= 0. One can check that h(u) is indeed well defined as

2
h(u) = [u − ln(1 + u)] u ̸= 0.
u2

Since h(0) = 1, one can verify the h is continuous. Next it is also clear that h(u) is decreasing monotonically from
∞ to 0 as u increase from −1 to ∞.
p
Substitute u = s 2/x, and we get
x −x
√ Z ∞
Γ(x + 1) = x e 2x ψx (s)ds
−∞

where 
exp[−s2 h(sp2/x)] p
(− x/2 < s < ∞),
ψx (s) = p
0 (s ≤ − x/2).

Next one can verify the following facts:


2
1. For every s, ψx (s) → e−s as x → ∞.

50
2. The convergence in 1 is uniform on [−A, A] for every A < ∞.
2
3. When s < 0, then 0 < ψx (s) < e−s .

4. When s > 0 and x > 1, then 0 < ψx (s) < ψ1 (s).


R∞
5. 0 ψ1 (s)ds < ∞.

Then by uniform convergence, the integral converges to the limit of integrals of the functions in the sequence. Since
Z ∞ √
2
e−x dx = π.
−∞

Then we have
Γ(x + 1)
lim √ = 1.
x→∞ (x/e)x 2πx

5.8 Distribution of a Function of a Random Variable


Recall that if X is a random variable then for any function g, g(X) is also a random variable.

Theorem 5.40 Let X be a continuous random variable. Let Y = g(X). Suppose g(x) is strictly monotonic and
differentiable. For y = g(x),
d −1
fY (y) = fX (g −1 (y)) g (y) .
dy

Proof: We consider two cases. First, suppose g(x) is strictly increasing and differentiable. Then

FY (y) = P (g(X) ≤ y) = P (X ≤ g −1 (y)) = FX (g −1 (y)).

So
d −1
fY (y) = FY′ (y) = fX (g −1 (y)) g (y).
dy
Now suppose g(x) is strictly decreasing and differentiable. Then

FY (y) = P (g(X) ≤ y) = P (X ≥ g −1 (y)) = 1 − FX (g −1 (y)).

So
d −1
fY (y) = FY′ (y) = −fX (g −1 (y)) g (y).
dy

Corollary 5.40.1 Suppose g is strictly increasing, and X, Y are continuous random variable such that Y = g(X),
then FY (y) = FX (g −1 (y); suppose g is strictly decreasing, and X, Y are continuous random variable such that
Y = g(X), then FY (y) = 1 − FX (g −1 (y)).

51
Proof: This is follows from the proof of the theorem. □

Definition: let X be a normal random variable with parameters (µ, σ 2 ). Then Y = eX is called lognormal with
parameters (µ, σ 2 ).

Proposition 5.41 Let X be a normal random variable with parameters (µ, σ 2 ), then the probability density function
of Y = eX is given by
1
fY (y) = √ exp{−(ln y − µ)2 /2σ 2 }.
2πσy

Proof: The density of X is


1 2 2
fX (x) = √ e−(x−µ) /2σ .
2πσ
Let g(x) = ex . Then g −1 (y) = ln y and d −1
dy g (y) = y1 .
Let y > 0. Then

d −1
fY (y) = fX (g −1 (y)) g (y)
dy
1
= fX (ln y)
y
1
=√ exp{−(ln y − µ)2 /2σ 2 }.
2πσy

Lemma 5.42 Suppose Y is lognormal with parameters (µ, σ 2 ), if c > 0, then cY is lognormal with parameters
(µ + ln c, σ 2 ).

Proof: Suppose Y = eX , where X is normal with parameters (µ, σ 2 ), then cY = eX+ln c , and clearly X + ln c is
normal with parameters (µ + ln c, σ 2 ). □

Proposition 5.43 Let Z be a standard normal random variable Z, and let g be differentiable function with deriva-
tive g ′ and
2
lim g(x)e−x /2 = 0.
x→±∞

Then

1. E[g ′ (Z)] = E[Zg(Z)];

2. E[Z n+1 ] = nE[Z n−1 ].

52
Proof: Let f (z) denote the probability density function of Z. Hence

1 2
f (z) = √ e−z /2 .

1.
Z ∞
E[g ′ (Z)] = g ′ (z)f (z)dz
Z−∞

1 2
= √ e−z /2 g ′ (z)dz
−∞ 2π
Z ∞
1 2
=√ e−z /2 g ′ (z)dz
2π −∞
Z ∞ 
1  −z 2 /2 ∞
−z 2 /2
=√ e g(z) − −ze g(z)dz
2π −∞ −∞
Z ∞
1 2
=√ ze−z /2 g(z)dz
2π −∞
= E[Zg(Z)]

2. Let g(x) = xn , then g ′ (x) = nxn−1 . Hence we have

E[Zg(Z)] = E[Z n+1 ] = E[g ′ (Z)] = E[nZ n−1 ] = nE[Z n−1 ].

Proposition 5.44 Let X be a nonnegative continuous random variable, then


Z ∞
n
E[X ] = ntn−1 P (x > t)dt.
0

Rx
Proof: Note that xn = 0 ntn−1 dx, then
Z ∞
n
E[X ] = xn f (x)dx
Z0 ∞ Z x
= ntn−1 f (x)dtdx
Z0 ∞ Z0 ∞
= ntn−1 f (x)dxdt
0 t
Z ∞ Z ∞
n−1
= nt f (x)dxdt
Z0 ∞ t

= ntn−1 P (x > t)dt


0

53
E[X n ]
Corollary 5.44.1 If X is a nonnegative continuous random variable, then P (X > a) ≤ an for any a > 0 and
positive integer n.

Proof: It suffices to show that an P (X > a) ≤ E[X n ]. Using the same argument as before, we have
Z a Z a
an P (X > a) = ntn−1 P (x > a)dt ≤ ntn−1 P (x > t)dt ≤ E[X n ].
0 0

As for t ≤ a, we have P (x > a) ≤ P (x > t). □

54
6 Jointly Distributed Random Variables
6.1 Joint Cumulative Distribution Function
Definition: let X and Y be two random variables, their joint cumulative probability distribution function is
FX,Y (a, b) = P (X ≤ a, Y ≤ b).

Proposition 6.1 Let F (a, b) = P (X ≤ a, Y ≤ b) be the joint cumulative distribution function of X and Y . Then
it can be used to generate all probability involving X and Y :

ˆ P (a1 < X ≤ a2 , Y ≤ b) = P (X ≤ a2 , Y ≤ b) − P (X ≤ a1 , Y ≤ b) = F (a2 , b) − F (a1 , b).

ˆ P (X ≤ a, b1 < y ≤ b2 ) = P (X ≤ a, Y ≤ b2 ) − P (X ≤ a, Y ≤ b1 ) = F (a, b2 ) − F (a, b1 ).

ˆ P (a1 < X ≤ a2 , b1 < Y ≤ b2 ) = P (X ≤ a2 , b1 < Y ≤ b2 ) − P (X ≤ a1 , b1 < Y ≤ b2 ) = F (a2 , b2 ) − F (a2 , b1 ) −


F (a1 , b2 ) + F (a1 , b1 ).

ˆ FX (a) = P (X < a; Y ∈ R) = F (a, ∞).

ˆ FY (b) = P (X ∈ R; Y < b) = F (∞, b).

ˆ P (X > a, Y > b) = 1 − FX (a) − FY (b) + F (a, b).

Thus F (a, b) determines P ((X, Y ) ∈ C) for any Borel set C ⊆ R2 .

Proof: This is clear from the definition of the joint cumulative probability distribution function. □

Definition: suppose X and Y are discrete random variables, then we use their joint probability mass function:

pX,Y (x, y) = P (X = x, Y = y).

Suppose Y takes values y1 , y2 , · · · . Then


X
P (X = x) = P (X = x, Y = yj )
j

is the mass function of X.


Suppose X takes values x1 , x2 , · · · . Then
X
P (Y = y) = (P = xi , Y = y)
i

is the mass function of Y .


Definition: the probability mass functions of X and Y obtained in this way are called marginal probability mass

55
functions:
X
pX (x) = pX,Y (x, yj ).
j
X
pY (y) = pX,Y (xi , y).
i

Definition: suppose X and Y are continuous random variables, then it is convenient to use their joint probability
density function fX,Y (x, y), such that
ZZ
fX,Y (x, y)dxdy = P ((X, Y ) ∈ C).
C

Definition: X and Y are jointly continuous if there is a function fX,Y (x, y) ≥ 0, such that
ZZ
fX,Y (x, y)dxdy = P ((X, Y ) ∈ C),
C

and fX,Y (x, y) is their joint probability density function.


Then Z a Z b
FX,Y (a, b) = P (X ≤ a, Y ≤ b) = fX,Y (x, y)dydx.
−∞ −∞

and
∂2
fX,Y (a, b) = FX,Y (a, b).
∂a∂b

Lemma 6.2 Suppose X and Y are jointly continuous with joint probability distribution function f (x, y). Then
Z Z Z Z
P (X ∈ A, Y ∈ B) = f (x, y)dydx = f (x, y)dxdy.
A B B A

For small ϵ1 , ϵ2 ,
P (a ≤ X ≤ a + ϵ1 , b ≤ Y ≤ b + ϵ2 ) ≈ f (a, b)ϵ1 ϵ2 .

Proof: This is by definition of joint probability distribution function and two variable integration. □

Definition: suppose X and Y are jointly continuous with joint probability distribution function f (x, y). Let A ⊆ R,
Z Z ∞
P (X ∈ A) = P (X ∈ A, Y ∈ R) = f (x, y)dydz.
A −∞

Then X is a continuous random variable with probability density function


Z ∞
fX (x) = f (x, y)dy,
−∞

56
which is known as the marginal probability density function. Similarly, let B ⊆ R, we have
Z Z ∞
P (Y ∈ B) = P (X ∈ R, Y ∈ B) = f (x, y)dxdy.
B −∞

Then Y is a continuous random variable with probability density function


Z ∞
fY (y) = f (x, y)dx.
−∞

6.2 Joint Distribution of Random Variables


Definition: let X1 , X2 , · · · , Xn be random variables, then their joint cumulative probability distribution function is

F (a1 , a2 , · · · , an ) = P (X1 ≤ a1 , X2 ≤ a2 , · · · , Xn ≤ an ).

Definition: if X1 , X2 , · · · , Xn are discrete random variables, their joint probability mass function is

p(a1 , a2 , · · · , an ) = P (X1 = a1 , X2 = a2 , · · · , Xn = an ).

Definition: X1 , X2 , · · · , Xn are said to be jointly continuous if there is a joint probability density function f such
that Z Z
P ((X1 , · · · , Xn ) ∈ C) = · · · f (x1 , · · · , xn )dx1 · · · dxn
C

for any C ⊆ Rn .

Definition: suppose that n independent and identical experiments are performed, each experiment results in exactly
one of r possible outcomes with respective probability p1 , · · · , pr , ni=1 pi = 1. Let Xi be the number of experiments
P

that result in the ith outcome. Then they have multinomial distribution with joint mass function
  r
n n1 nr
X
p(n1 , n2 , · · · , nr ) = p · · · pr , n = ni .
n1 , · · · , nr 1
i=1

6.3 Independent Random Variables


Definition: random variables X and Y are independent if the events X ∈ A and Y ∈ B are independent for any
A, B ⊆ R, i.e.,
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B).

In particular
P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b).

57
Theorem 6.3 Random variables X and Y are independent if and only if for any a, b ∈ R, we have

FX,Y (a, b) = FX (a)FY (b).

Theorem 6.4 Suppose X and Y are independent discrete variables if and only if pX,Y (a, b) = pX (a)pY (b). If this
is the case, then

E[XY ] = E[X]E[Y ]
Var(X + Y ) = Var(X) + Var(Y )

Proof: Since X and Y are independent, then for any a, b, let A = {a} and B = {b}, we have P (x ∈ A, Y ∈ B) =
P (X ∈ A)P (Y ∈ B) so pX,Y (a, b) = pX (a)pY (b). On the other hand, if pX,Y (a, b) = pX (a)pY (b), then for any A, B,
we have
XX
P (X ∈ A, Y ∈ B) = pX,Y (x, y)
x∈A y∈B
XX
= pX (x)pY (y)
x∈A y∈B
X X
= pX (x) pY (y)
x∈A x∈A

= P (X ∈ A)P (Y ∈ B).

Then we have the following:


XX
E[XY ] = abpX,Y (a, b)
a b
X X
= a b · pX (a)pY (b)
a b
X X
= apX (a) · bpY (b)
a b

= E[X]E[Y ]

On the other hand,

Var(X + Y ) = E[(X + Y )2 ] − (E[X + Y ])2


= E[X 2 ] + E[Y 2 ] + 2E[XY ] − (E[X])2 − (E[Y ])2 − 2E[X]E[Y ]
= E[X]2 − (E[X])2 + E[Y 2 ] − (E[Y ])2
= Var(X) + Var(Y )

58
Theorem 6.5 Jointly continuous random variables X, Y are independent if and only if

fX,Y (x, y) = fX (x)fY (y)

for all x, y.

Proof: ⇒: if X and Y are independent, then FX,Y (x, y) = FX (x)FY (y), so

∂2
fX,Y (x, y) = FX,Y (x, y) = FX′ (x)FY′ (y) = fX (x)fY (y).
∂x∂y

⇐: if fX,Y (x, y) = fX (x)fY (y), then


Z a Z b
FX,Y (a, b) = fX,Y (x, y)dydx
−∞ −∞
Z a Z b
= fX (x)fY (y)dydx
−∞ −∞
Z a Z b
= fX (x)dx · fY (y)dy
−∞ −∞

= FX (a)FY (b).

Definition: random variables X1 , · · · , Xn are independent if for any Ai ⊆ R, we have

P (X1 ∈ A1 , · · · , Xn ∈ An ) = P (X1 ∈ A1 ) · · · P (Xn ∈ An ).

Theorem 6.6 X1 , · · · , Xn are independent if and only if for any ai ∈ R,

F (a1 , · · · , an ) = FX1 (a1 ) · · · FXn (an ).

That is, if X1 , · · · , Xn are discrete random variables, then they are independent if and only if for any ai ∈ R,

p(a1 , · · · , an ) = pX1 (a1 ) · · · pXn (an ).

If X1 , · · · , Xn are jointly continuous, then they are independent if and only if for any ai ∈ R,

f (a1 , · · · , an ) = fX1 (a1 ) · · · fXn (an ).

Proposition 6.7 Suppose X1 , X2 , · · · , Xn are jointly continuous or discrete variables. Then X1 , X2 , · · · , Xn are

59
independent if and only if their jointly probability density or mass function f (x1 , · · · , xn ) can be written as
n
Y
f (x1 , · · · , xn ) = gi (xi )
i=1

for nonnegative functions gi (x), i = 1, · · · , n.

Proof: The case for jointly discrete random variables are trivial. We consider the case for jointly continuous
random variables.
Now suppose X1 , · · · , Xn are independent, then let fXi (t) be their respective probability density function. Then
we have
Yn
fXi (xi ) = f (x1 , · · · , xn ).
i=1
n
Q
Conversely, suppose f (x1 , · · · , xn ) = gi (xi ), then
i=1

Z ∞ Z ∞ n
Y
fXi (xi ) = ··· gi (xi )dx1 · · · dxi−1 dxi+1 · · · dxn = Ci gi (xi ).
−∞ −∞ i=1

Now since Z n
Z Y Z n
Z Y
··· gi (xi )dS = 1 = ··· fXi (xi )dS.
O i=1 O i=1

Then we conclude that the product of the Ci′ s is one, hence X1 , · · · , Xn are jointly independent. □

6.4 Sums of Independent Random Variables

Proposition 6.8 Let X and Y be independent integer-valued random variables. then


X
P (X + Y = n) = P (X = i)P (Y = j).
i+j=n

Proof:
X X
P (X + Y = n) = P (X = i, Y = n − i) = P (X = i)P (Y = n − i)
i i
X X
= P (X = n − j, Y = j) = P (X = n − j)P (Y = j)
j j
X
= P (X = i)P (Y = j)
i+j=n

60

pX (i)z i and
P
Corollary 6.8.1 Let X and Y be independent integer-valued random variables. Let g(z) =
i=−∞

(j)z j .
P
h(z) = pY Then
j=−∞

X
g(z)h(z) = pX+Y (n)z n .
n=−∞

Proof:


!  ∞

X X
g(z)h(z) = pX (i)z i · pY (j)z j 
i=−∞ j=−∞

X X
= pX (i)pY (i)z n
n=−∞ i+j=n
X∞
= pX+Y (n)z n .
n=−∞

Proposition 6.9 Let X and Y be independent continuous random variables, then


Z ∞ Z ∞
FX+Y (a) = FX (a − y)fY (y)dy = FX (a − y)dFY (y).
−∞ −∞

I.e., the cumulative distribution function of X + Y , is the convolution of FX and FY : FX+Y = FX ∗ FY .


In addition, Z ∞
fX+Y (a) = fX (a − y)fY (y)dy.
−∞

Proof:
ZZ
FX+Y (a) = P (X + Y ≤ a) = fX+Y (x, y)dxdy
x+y≤a
Z ∞ Z a−y
= fX (x)fY (y)dxdy
Z−∞

−∞

= FX (a − y)fY (y)dy
Z−∞

= FX (a − y)dFY (y)
−∞

Then differentiating with respect to a gives the probability density function of X + Y :


Z ∞
fX+Y (a) = fX (a − y)fY (y)dy.
−∞

61
6.4.1 Sum of Binomial

Theorem 6.10 Let X1 , X2 , · · · , Xk be independent binomial random variables with parameters (n1 , p), (n2 , p),
· · · , (nk , p) respectively, then X1 +X2 +· · ·+Xk is a binomial random variable with parameter (n1 +n2 +· · ·+nk , p).

Proof: First show that the sum of an independent binomial random variable with parameter (m, p) and a bernoulli
random variable with parameter (1, p) is a binomial random variable with parameter (m + 1, p). Then we can show
that the sum of an independent binomial random variable with parameter (m1 , p) and another binomial random
variable with parameter (m2 , p) is a binomial random variable with parameter (m1 + m2 , p). Then we can use
induction to get the desired result. □

6.4.2 Sum of Poisson

Theorem 6.11 If X1 , X2 , · · · , Xr are independent Poisson random variable with parameters λ1 , · · · , λr , then X1 +
· · · + Xr is Poisson with parameters λ1 + · · · + λr .

Proof: Prove using induction. □

6.4.3 Sum of Uniform

Proposition 6.12 Let X and Y be independent uniform random variable on (0, 1), then

a

 if 0 ≤ a ≤ 1,

fX+Y (a) = 2 − a, if 1 < a < 2, .


0,

otherwise

R∞
Proof: Recall fX+Y (a) = −∞ fX (a − y)fY (y)dy, then consider cases based on the value of a. □

Proposition 6.13 Let X1 , X2 , · · · , Xn , · · · be independent uniform random variable on (0, 1). Let Fn be the cu-
mulative distributive function of X1 + · · · + Xn . Then

xn
Fn (x) = P (X1 + · · · + Xn ≤ x) =
n!

for 0 ≤ x ≤ 1.

62
Proof: We proceed with induction, suppose n = 1, then it is clear that F1 (x) = P (X1 ≤ x) = x. Hence the
statement holds. Assume that this statement for some n ∈ N, we consider the case for n + 1.

Fn+1 = Fn ∗ FXn
Z 1 n
t
= · 1dt
0 n!
tn+1
=
n + 1!
xn
Hence we have Fn (x) = n! by induction. □

Corollary 6.13.1 Let N be the minimum integer n such that X1 + X2 + · · · + Xn > 1. Then E[N ] = e, i.e., the
expected value of the number of independent uniform (0, 1) random variable that must be summed for the sum to
exceed 1 is e.
In addition, the expected value of the number of independent uniform (0, 1) random variables that must be summed
for the sum to exceed x (0 ≤ x ≤ 1) is ex .

1
Proof: N > n ⇔ X1 + · · · + Xn ≤ 1, so P (N > n) = Fn (1) = n! . Thus

∞ ∞ ∞
X X X 1
E[N ] = P (N ≥ i) = P (N > n) = = e.
n!
i=1 n=0 n=0

Similarly we can yield the second conclusion. □

6.4.4 Sum of Gamma


λe−λx (λx)α−1
Recall that an (α, λ)−gamma random variable has probability density function f (x) = Γ(α) , x > 0.

Proposition 6.14 Let X and Y be independent gamma random variable with parameters (α, λ), (β, λ), then

B(α, β)
fX+Y (α) = λe−λα (λa)α+β−1 , a > 0.
Γ(α)Γ(β)

Proof: Let X and Y be independent gamma random variable with parameters (α, λ), (β, λ), then

λe−λ(a−y) [λ(a − y)]α−1 · λe−λy (λy)β−1


a
Z
fX+Y (a) = dy
0 Γ(α)Γ(β)
Z a
λα+β
= e−λa (a − y)α−1 y β−1 dy let y = ax
Γ(α)Γ(β) 0
Z 1
λα+β −λa α+β−1
= e a (1 − x)α−1 xβ−1 dx
Γ(α)Γ(β) 0
B(α, β) −λα α+β−1
= λe (λa) , a > 0.
Γ(α)Γ(β)

63
Let Z be a gamma random variable with parameters (α + β, λ), then it has probability density function

1
fZ (x) = λe−λx (λx)α+β−1 , x > 0.
Γ(α + β)

Since fZ (x) = fX+Y (x), then we conclude that

Γ(α)Γ(β)
B(α, β) = .
Γ(α + β)

Proposition 6.15 Let Z be a standard normal random variable and Y = Z 2 , then

1 − 12 y 1 1
2e ( 2 y) 2 −1
fY (y) = √ , y > 0.
π

1 1 1 √
I.e., Y = Z 2 is a gamma random variable with parameters
 
2, 2 and Γ 2 = π.

Proof:

ϵfY (y) ≈ P (y ≤ Y ≤ y + ϵ)
√ √
= 2P ( y ≤ Z ≤ y + ϵ)
 
√ √ ϵ
≈ 2P y≤Z ≤ y+ √
2 y
√ ϵ
≈ 2fZ ( y) √ .
2 y

Divide ϵ on both sides, we get


√ 1 − 12 y 1 1
fZ ( y) 1 −y/2 2 e ( 2 y) 2 −1
fY (y) = √ =√ e = √ , y > 0.
y 2πy π

1 1

Then we can see that fY (y) is a gamma random variable with parameters 2, 2 . Then by comparing denominator,

we get that Γ( 12 ) = π. □

Definition: let Z1 , Z2 , · · · , Zn be independent standard normal random variables. Then Xn2 = Z12 + · · · + Zn2 is a
gamma random variable with parameters n2 , 12 . This is called the chi-squared random variable with n degrees of


freedom.
Note that Xn2 is a gamma random variable as it is sum of n gamma random variables, then by induction, we can
show that Xn2 has parameters n2 , 21 .


6.4.5 Sum of Normals

Proposition 6.16 Let X and Y be independent normal random variables with parameters (µ1 , σ12 ) and (µ2 , σ22 )
respectively. Then X + Y is a normal random variable with parameters (µ1 + µ2 , σ12 + σ22 ).

64
Proof: We first prove the case that if S and Z are independent normal random variables with S having parame-
ters (0, σ 2 ) and Z having parameters (0, 1), i.e., Z is the standard normal distribution. Then S + Z is normal with
parameters (0, 1 + σ 2 ).

Z ∞
fS+Z (a) = fS (a − z)fZ (z)dy
Z−∞

1 −(a−z)2 /2σ2 −z 2 /2
= e e dz
−∞ 2πσ
a2
 
= C exp −
2(1 + σ 2 )

where C is a constant that does not depend on a. So we can see that S + Z is normal with mean and variance 0,
1 + σ 2 respectively.
Now for the general case, suppose X, Y are normal with parameters (µ1 , σ12 ) and (µ2 , σ22 ) respectively, then
 
X − µ1 Y − µ2
X + Y = σ2 + + µ1 + µ2 .
σ2 σ2

Hence X + Y is normal with mean µ1 + µ2 and variance σ12 + σ22 (Just apply the result we obtained). □

Corollary 6.16.1 Suppose Xi , i = 1, 2, · · · , n are independent random variables that are normallydistributive with
n n

2 2
P P
parameters (µi , σi ) respectively, then X1 + X2 + · · · + Xn is normally distributed with parameters µi , σi .
i=1 i=1

Proof: Apply induction. □

6.4.6 Sum of Exponential

Proof: Suppose X1 , X2 , · · · , Xn are independent identical exponential random variables with parameter λ, then
then Y = X1 + X2 + · · · + Xn is an gamma random variable with parameter (n, λ). □

Proof: An exponential random variable with parameter λ is a gamma random variable with parameter (1, λ). □

6.5 Conditional Distribution


Definition: let X and Y be discrete random variables, the conditional probability mass function of X given Y = b
is
pX,Y (x, b)
pX|Y (x|b) = ,
pY (b)

65
provided that pY (b) > 0.
Note that if X and Y are independent, then pX|Y (x|b) = pX (x).

Definition: let X and Y be jointly continuous random variables, the conditional probability density function of X
given Y = y is
fX,Y (x, y)
fX|Y (x|y) = ,
fY (y)
if fY (y) > 0.
Note that if X and Y are independent, then fX|Y (x|y) = fX (x).

Proposition 6.17 Let X and Y be discrete random variables such that pY (y) > 0. Then
X
FX|Y (x|y) = pX|Y (a|y).
a≤x

Proof: This is the case because


X X
FX|Y (x|y) = P (X ≤ x|Y = y) = P (X = a|Y = y) = pX|Y (a|y).
a≤x a≤x

Proposition 6.18 Let X and Y be jointly continuous random variables, then


Z
P (X ∈ A|Y = y) = fX|Y (x|y)dx.
A

In particular, the conditional cumulative distribution function of X given Y = y is


Z a
FX|Y (a|y) = P (X ≤ a|Y = y) = fX|Y (x|y)dx.
−∞

Proof: This is clear from the definition of conditional probability. □

Proposition 6.19 Let Y and Z be independent, where Y is chi-squared with degree 1 of freedom, and Z standard

normal. Let T = Z/ Y . Then the joint probability density of T and Y is

1 −y(t2 +1)/2
fT,Y (t, y) = fY (y)fT |Y (t|y) = e , y > 0.

The probability density function of T is


Z ∞
1 −y(t2 +1)/2 1
fT (t) = e dy = 2
.
0 2π π(t + 1)

66
Proof: Since Y and Z are independent, then the conditional distribution of T given that Y = y is the distribution
p
of 1/yZ, which is normal with mean 0 and variance 1/y. Hence, the conditional density of T given that Y = y is

1 2
fT |Y (t|y) = p e−t y/2 , −∞ < t < ∞.
2π/y

Then the rest of the results follow from here. □

Definition: let Y and Z be independent, where Y is a chi-squared with degree n of freedom, and Z standard normal.
Then T = √Z has a t-distribution with n degrees of freedom.
Y /n

Lemma 6.20 let X be a continuous and N a discrete random variable, then

P (N = k|X = a)f (a)ϵ


P (a ≤ x ≤ a + ϵ|N = k) ≈ .
P (N = k)

Proof:

P (a ≤ X ≤ a + ϵ, N = k)
P (a ≤ x ≤ a + ϵ|N = k) =
P (N = k)
P (N = k|a ≤ X ≤ a + ϵ)P (a ≤ X ≤ a + ϵ)
=
P (N = k)
P (N = k|X = a)f (a)ϵ
≈ .
P (N = k)

Definition: let X be a continuous and N a discrete random variable, then the conditional probability density
function of X given that N = k is

P (N = k|X = x)
fX|N (x|k) = fX (x).
P (N = k)

Definition: let X be continuous and N a discrete random variable, then the conditional probability mass function
of N given that X = x is
fX (X = x|N = n)
pN |X (n|x) = P (N = n).
fX (x)
Definition: let X be a discrete random variable. For any A ⊆ R, The conditional probability mass function of X
given X ∈ A is defined to be

P (X = a)/P (X ∈ A) if x ∈ A,
P (X = a|X ∈ A) =
0 if x ∈
/ A.

Definition: let X be a continuous random variable. For [a, a + ϵ] ⊆ A ⊂ R, the conditional probability density

67
function of X given X ∈ A is defined to be

fX (x)
fX|X∈A (x) = R for x ∈ A.
A fX (x)dx

6.6 Joint Distribution of Functions


Definition: let X1 and X2 be jointly continuous random variables. Suppose Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ),
such that the transformation from (x1 , x2 ) 7→ (y1 , y2 ) is one-to-one. Then
!
∂y1 ∂y1
∂x1 ∂x2
∂y2 ∂y2
∂x2 ∂x2

is nonzero and continuous at all (x1 , x2 ). Thus, we define the Jacobian of (x1 , x2 ) 7→ (y1 , y2 ) to be

∂y1 ∂y1
∂x1 ∂x2 ∂y1 ∂y2 ∂y1 ∂y2
J(x1 , x2 ) = ∂y2 ∂y2 = − ̸= 0
∂x2 ∂x2
∂x1 ∂x2 ∂x2 ∂x1

Theorem 6.21 The joint probability distribution function of Y1 = g1 (X1 , X2 ), Y2 = g2 (X1 , X2 ) in this case is
given by
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 ) |J(x1 , x2 )|−1

where g1 (x1 , x2 ) = y1 and g2 (x1 , x2 ) = y2 .

Proof: This is done by the change of coordinates. □

Similarly, using the Jacobian of n dimensional matrix, we can find the joint continuous random variable of
Y1 , Y2 , · · · , Yn from X1 , X2 , · · · , Xn , if Yi = gi (X1 , · · · , Xn ).

68
7 Expectation of Random Variables
7.1 Extra Properties of Expectation

Proposition 7.1 Let X be a nonnegative integer-valued random variable, then



X
E[X] = P (X ≥ i).
i=1

Proof: Note for any nonnegative values a1 , a2 , · · · ,



X ∞
X
(a1 + · · · + aj )P (X = j) = ai P (X ≥ i).
j=1 i=1

Then take a1 = a2 = · · · = 1, we get the desired result. □

Proposition 7.2 Let X be a random variable, if a ≤ X ≤ b, then a ≤ E[X] ≤ b.

Proof: Suppose Y ≥ 0, then it is clear that E[Y ] ≥ 0 (consider discrete and continuous cases), similarly if Y ≤ 0,
then E[Y ] ≤ 0. Now since X ≥ a, then X − a ≥ 0, so

0 ≤ E[X − a] = E[X] − a =⇒ E[X] ≥ a.

Similarly,we have E[X] ≤ b. □

Corollary 7.2.1 Let X be a random variable. If P (a ≤ X ≤ b) = 1, then a ≤ E[X] ≤ b.

Proof: Since P (a ≤ x ≤ b) = 1, then consider two cases, X discrete and X continuous.


If X is discrete, then it is clear that a ≤ X ≤ b, hence a ≤ E[X] ≤ b holds.
Rb
If X is continuous, then a fX (x)dx = 1. Then consider E[X − a], and E[b − X], we can easily get that they are
both greater than or equal to zero, hence a ≤ E[X] ≤ b.
A more rigorous proof requires the understanding of measure theory. □

Proposition 7.3 Suppose X is a random variable such that P (0 ≤ X ≤ c) = 1 and only takes values between 0
and c. Then
c2
Var(X) ≤ .
4

69
Proof: Note that since X only takes values between 0 and c, then 0 ≤ E[X] ≤ c and E[X 2 ] ≤ E[cX] = cE[X],
so
c2
var(X) = E[X 2 ] − (E[X])2 ≤ cE[X] − (E[X])2 ≤ .
4

Theorem 7.4 Let X and Y be discrete random variable with joint mass function p(x, y). For any function g(x, y),
XX
E[g(X, Y )] = g(x, y)p(x, y).
x y

Proof: By definition, we have


X
E[g(X, Y )] = g(x, y)p(x, y)
(x,y)
XX
= g(x, y)p(x, y).
x y

Theorem 7.5 Let X, Y be jointly continuous random variables with joint probability density function f (x, y). For
any function g(x, y), Z ∞Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dxdy.
−∞ −∞

Proof: First consider the case where g(X, Y ) ≥ 0, then we know that for any such random variable, we have
Z ∞
E[g(X, Y )] = P {g(X, Y ) > t}dt
0

Then
Z ∞
E[g(X, Y )] = P {g(X, Y ) > t}dt
0
Z ∞Z Z
= f (x, y)dydxdt
0 (x,y):g(x,y)>t
Z Z Z g(x,y)
= f (x, y)dtdydx (by change of order of integration)
x y 0
Z Z
= g(x, y)f (x, y)dydx
x y
Z ∞ Z ∞
= g(x, y)f (x, y)dydx
Z−∞ −∞
∞ Z ∞
= g(x, y)f (x, y)dxdy
−∞ −∞

70
The case for general g(X, Y ) is similar, we can deal with it by splitting g(X, Y ) into g + (X, Y ) and g − (X, Y ). □

Proposition 7.6 Let X and Y be jointly continuous random variable with joint probability distribution function
f (x, y). Then
Z ∞ Z ∞ Z ∞
xf (x, y)dydx = xfX (x)dx = E[X]
Z−∞
∞ Z−∞
∞ Z−∞

yf (x, y)dxdy = yfY (y)dy = E[Y ]
−∞ −∞
Z−∞
∞ Z ∞
E[X + Y ] = (x + y)f (x, y)dxdy = E[X] + E[Y ]
−∞ −∞

Proof: This is clear from the definition of expected value and joint continuous random variables. □

Let X be a random variable having finite expectation µ and variance σ 2 , and let g(·) be a twice differentiable
function. Then
g ′′ (µ) 2
E[g(X)] ≈ g(µ) + σ .
2

Proof: By Taylor’s Theorem, we get

g ′′ (µ)(X − µ)2
g(X) ≈ g(µ) + g ′ (µ)(X − µ) + .
2

Hence

g ′′ (µ)(X − µ)2
 

E[g(X)] ≈ E g(µ) + g (µ)(X − µ) +
2
g ′′ (µ) 2
= g(µ) + 0 + σ
2
g ′′ (µ) 2
= g(µ) + σ .
2

7.2 Sum of Random Variables

Theorem 7.7 Let X and Y be random variables. Then E[X + Y ] = E[X] + E[Y ] if E[X] and E[Y ] are finite. In
general, if X1 , X2 , · · · , Xn are random variables, then

E[X1 + X2 + · · · + Xn ] = E[X1 ] + E[X2 ] + · · · + [Xn ].

71
Proof: Using Induction. We know that E[X + Y ] = E[X] + E[Y ] for both the case of discrete and continuous
random variables. □

Corollary 7.7.1 Let X and Y be random variables, then E[X − Y ] = E[X] − E[Y ].

Proof: Since E[Y ] + E[X − Y ] = E[X]. □

Corollary 7.7.2 Suppose X and Y are random variables such that X ≥ Y , then E[X] ≥ E[Y ].

Proof: Since X − Y ≥ 0, then E[X − Y ] ≥ 0, Hence E[X] − E[Y ] = E[X − Y ] ≥ 0. Therefore, E[X] ≥ E[Y ]. □

n
1 P
Definition: Suppose X1 , X2 , · · · , Xn are random variables, X = X1 + · · · + Xn . Then X = n Xi is called the
i=1
sample mean.

Lemma 7.8 (Boole’s Inequality) P (A1 ) + · · · + P (An ) ≥ P (A1 ∪ · · · ∪ An ).

Proof: This follows from inclusion-exclusion principle. □

Proposition 7.9 Suppose A1 , A2 , · · · , An are events and X1 , X2 , · · · , Xn are their respective indicator variables.
Let Y = 1 − ni=1 (1 − Xi ), then
Q

n
X X
P (A1 ∪ · · · ∪ An ) = E[Y ] = P (Ai ) − P (Ai Aj ) + · · · − (−1)n P (A1 A2 · · · An ).
i=1 1≤i<j≤n

Proof: Note that P (A1 ∪ · · · ∪ An ) = E[Y ]. And

E[Y ] = 1 − E[(1 − X1 )(1 − X2 ) · · · (1 − Xn )]

Expand and we get the desired result. □

Theorem 7.10 Let X1 , X2 , · · · be a sequence of random variables. Suppose one of the following holds:

ˆ Xi is nonnegative for i = 1, 2, · · · .

ˆ
P
E[|Xi |] < ∞.
i=1

Then E [ ∞
P P∞
i=1 Xi ] = i=1 E[Xi ].

72
Proof: The first one is justified by the Sigma additivity of a measure. The second one is justified by absolute
convergence of sequences. □

7.3 Moments of Number of Events


Definition: let X be the number of events from A1 , A2 , · · · , An that occur. Let Xi be the indicator variable of
Ai . The number of pairs Ai Aj (i < j) that occurs is given by X2 . In general X
 
k gives the number of events
Ai1 Ai2 · · · Aik occurs.

Lemma 7.11  
X X
= Xi1 Xi2 · · · Xik .
k
i1 <i2 <···<ik
 
X X
E = P (Ai1 Ai2 · · · Aik ).
k
i1 <i2 <···<ik

Proof: This is quite clear from the definition of moments of number of events. □

Corollary 7.11.1  
X(X − 1) · · · (X − k + 1) X
E = P (Ai1 Ai2 · · · Aik ).
k!
i1 <i2 <···<ik

Proof: This is true since  


X X(X − 1) · · · (X − k + 1)
= .
k k!

7.4 Covariance and Correlations

Proposition 7.12 Suppose X and Y are independent, then, for any functions h and g,

E[g(X)h(Y )] = E[g(X)]E[h(Y )].

73
Proof: Suppose X and Y are discrete and independent, then it is clear that
X
E[g(X)h(Y )] = g(x)h(y)pX,Y (x, y)
(x,y)
X
= g(x)h(y)pX (x)pY (y)
(x,y)
!
X X
= g(x)pX (x) h(y)pY (y)
x y
X
= g(x)pX (x)E[h(Y )]
x

= E[g(X)]E[h(Y )].

Suppose X and Y are jointly continuous with joint density f (x, y), and X and Y are independent, then
Z ∞ Z ∞
E[g(X)h(Y )] = g(x)h(y)f (x, y)dxdy
Z−∞ −∞
∞ Z ∞
= g(x)h(y)fX (x)fY (y)dxdy
Z−∞

−∞
Z ∞
= g(x)fX (x)dx h(y)fY (y)dy
−∞ −∞

= E[g(X)]E[h(Y )].

Lemma 7.13 Suppose X and Y are independent identical random variables with variance σ 2 , then

E[(X − Y )2 ] = 2σ 2 .

Proof: Since X and Y are identical, then

E[(X − Y )2 ] = E[X 2 − 2XY + Y 2 ]


= 2E[X 2 ] − 2E[X]E[Y ]
= 2E[X 2 ] − 2E[X]2
= 2Var(X)
= 2σ 2 .

Definition: let X and Y be random variables. The covariance is defined by

Cov(X, Y ) = E[XY ] − E[X]E[Y ] = E[(X − E[X])(Y − E[Y ])].

74
If X and Y are independent, then Cov(X, Y ) = 0. If Cov(X, Y ) = 0, X and Y may not be independent.

Proposition 7.14 The following are true about Covariance:

1. Cov(X, X) = Var(X).

2. Cov(X, Y ) = Cov(Y, X).

3. Cov(aX, Y ) = aCov(X, Y ).

4. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).

Proof:

ˆ Cov(X, X) = E[X 2 ] − (E[X])2 = Var(X).

ˆ This is clear from definition.

ˆ Cov(aX, Y ) = E[aXY ] − E[aX]E[Y ] = a(E[XY ] − E[X]E[Y ]) = aCov(X, Y ).

Cov(X1 + X2 , Y ) = E[(X1 + X2 )Y ] − E[X1 + X2 ]E[Y ]


= E[X1 Y ] + E[X2 Y ] − E[X1 ]E[Y ] − E[X2 ]E[Y ]
= Cov(X1 , Y ) + Cov(X2 , Y ).

Corollary 7.14.1 Suppose X1 , · · · , Xn and Y1 , · · · , Ym are random variables, then


 
Xn m
X n X
X m
Cov  Xi , Yj  = Cov(Xi , Yj ).
i=1 j=1 i=1 j=1

Proof: Induction. □

Remark: Cov(·, ·) is an inner product on real random variables. The induced norm on the real random variable
p
space is σX = Var(X).

Theorem 7.15 Let X1 , X2 , · · · , Xn be random variables. Then


n n
!
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 i<j

75
In particular, if X1 , X2 , · · · , Xn are independent. Then
n n
!
X X
Var Xi = Var(Xi ).
i=1 i=1

Proof: Note that if X1 , · · · , Xn are independent, then all the covariance are just 0, hence we have the second
formula. For the first formula, we have the following:

Var(X1 + X2 + · · · + Xn ) = Cov(X1 + X2 + · · · + Xn , X1 + X2 + · · · + Xn )
n
X
= Cov(Xi , X1 + · · · + Xn )
i=1
n
X
= Cov(X1 + · · · + Xn , Xi )
i=1
n
X X
= Cov(Xi , Xi ) + 2 Cov(Xi , Xj )
i=1 i<j
n
X X
= Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i<j

Definition: let X1 , · · · , XN be independent, and identical random variable with mean µ, and variance σ 2 . Then we
n
define X = n1
P
Xi to be the sample mean.
i=1
Definition: we define Xi − X to be the deviation, i = 1, · · · , n.
n
(Xi −X)2
Definition: we define S 2 =
P
n−1 be the sample variance.
i=1

Proposition 7.16
n
1 P
1. E[X] = n E[Xi ] = µ.
i=1
n
1
Var(Xi ) = σ 2 /n.
P
2. Var(X) = n2
i=1

σ2
3. E[(Xi − X)2 ] = 1 .
1− n

n 1
4. E[S 2 ] = 2 = σ2.

n−1 σ 1− n

Proof: This can be derived easily from the definitions. □

Definition: let X and Y be random variable with positive variances. The correlation of X and Y are defined to be

Cov(X, Y )
ρ(X, Y ) = p .
Var(X)Var(Y )

76
X and Y are uncorrelated if ρ(X, Y ) = 0.

Proposition 7.17 Let c > 0. Then ρ(cX, Y ) = ρ(X, Y ), ρ(−cX, Y ) = −ρ(X, Y ).


2 and σ 2 respectively. Let X = X/σ
Let X and Y be random variables with variance σX Y 1 X and Y1 = Y /σY . Then
ρ(X, Y ) = ρ(X1 , Y1 ) = Cov(X1 , Y1 ).

Proof: Suppose c > 0. Then

Cov(cX, Y ) cCov(X, Y )
ρ(cX, Y ) = p = p = ρ(X, Y ).
Var(cX)Var(Y ) c Var(X)Var(Y )

Similarly, we can show the second statement.


Now since σX , σY is always greater than or equal to zero. Then by disregarding the case that it equal to zero, we
have ρ(X, Y ) = ρ(X1 , Y1 ). Furthermore, Var(X1 )Var(Y1 ) = 1. Hence the have the last equality. □

Proposition 7.18 Let X and Y be random variables with positive variances. Then −1 ≤ ρ(X, Y ) ≤ 1.

Proof: Replace X and Y by X/σX and Y /σY if necessary. We may assume that Var(X) = Var(Y ) = 1. Then

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )


= 2 + 2ρ(X, Y ) ≥ 0,
Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y )
= 2 − 2ρ(X, Y ) ≥ 0.

Hence we have |ρ(X, Y )| ≤ 1. □

Remarks: Var(X) = 0 if and only if X is a constant with probability 1.

ˆ ρ(X, Y ) = 1 ⇔ Y = aX + b with probability 1 for some a > 0.

ˆ ρ(X, Y ) = −1 ⇔ Y = aX + b with probability 1 for some a < 0.

Proposition 7.19 Suppose X and Y are identical random variables that are not necessarily independent, then

Cov(X + Y, X − Y ) = 0.

77
Proof: Since X and Y are identical, then E[X n ] = E[Y n ] for n ∈ N, and Var(X) = Var(Y ). So

Cov(X + Y, X − Y ) = Cov(X, X) + Cov(X, −Y ) + Cov(Y, X) + Cov(Y, −Y )


= Var(X) − Var(Y )
=0

Lemma 7.20 Suppose Y = a + bX, then



1 if b > 0,
ρ(X, Y ) = .
−1 if b < 0

Proof: Suppose b > 0, then E[Y ] = E[X] + a and Var(Y ) = b2 Var(X) Then

Cov(X, Y )
ρ(X, Y ) = p
Var(X) · Var(Y )
Cov(X, a + bX)
=
bVar(X)
Cov(X, bX)
=
bVar(X)
bVar(X)
=
bVar(X)
=1

The proof of the other case is similar. □

7.5 Multinomial and Multivariate Normal Distribution


Definition: consider m independent trials, each of which results in any of r possible outcomes with probabilities
p1 , · · · , pr , where ri=1 pi = 1. Let Ni , i = 1, · · · , r, denote the number of the m trials that results in outcome i,
P

then we say N1 , N2 , · · · , Nr have the multinomial distribution which is given by


r
m! X
P (N1 = n1 , · · · , Nr = nr ) = pn1 1 · · · pnr r , ni = m.
n1 ! · · · nr !
i=1

Proposition 7.21 Suppose the conditions in the definition holds, then the following are true:

1. for each i ∈ {1, · · · , r}, Ni is a binomial random variable with parameters (m, pi ).

2. for i ̸= j, Ni + Nj is binomial with parameters (m, pi + pj ).

78
3. If i ̸= j, then Cov(Ni , Nj ) = −mpi pj .

Proof: (1) and (2) are by direct computation, then since Ni + Nj is binomial with parameters (m, pi + pj ), we
know its variance. Since Var(Ni + Nj ) = Var(Ni ) + Var(Nj ) + 2Cov(Ni , Nj ), then we can compute their covariance.

Definition: let Z1 , · · · , Zn be independent standard normal random variables. If for some constants aij , 1 ≤ i ≤ m,
1 ≤ j ≤ n, and µi , 1 ≤ i ≤ m,

X1 = a11 Z1 + · · · + a1n Zn + µ1 ,
X2 = a21 Z1 + · · · + a2n Zn + µ2 ,
..
.
Xm = am1 Z1 + · · · + ann Zn + µm

then the random variables X1 , X2 , · · · , Xm are said to have a multivariate normal distribution.

Proposition 7.22 Suppose the conditions in the definition holds, then the following are true:

1. E[Xi ] = µi ;
n
a2ij ;
P
2. Var(Xi ) =
j=1

n
P
3. Cov(Xi , Xj ) = aik ajk .
k=1
!
n
a2ij . Hence we
P
Proof: First note that the sum of normal is normal, so Xi is normal with parameters µi ,
j=1
have (1) and (2). Next, for i ̸= j,

n n
!
X X
Cov(Xi , Xj ) = Cov aik Zk , ajk Zk
k=1 k=1
n
X
= Cov (aik Zk , ajk Zk )
k=1
Xn
= aik ajk .
k=1

79
7.6 Conditional Expectation
We call the definition on conditional distribution: let X and Y be discrete random variable with joint mass p(x, y).
Let P (Y = y) > 0. Condition on Y = y, X is a random variable with mass function

pX|Y (x|y) = P (X = x|Y = y) = p(x, y)/pY (y).

Definition: condition on Y = y, the expected value of X is


X X
E[X|Y = y] = xpX|Y (x|y) = xp(x, y)/pY (y).
x x

Let X and Y be jointly continuous with probability density function f (x, y). Let fY (y) > 0. Condition on Y = y,
X is a random variable with density function

fX|Y (x|y) = f (x, y)/fY (y).

Definition: condition on Y = y, the expected value of X is


Z ∞
xf (x, y)
E[X|Y = y] = dx.
−∞ fY (y)

Lemma 7.23 Conditional Expectation satisfies all properties of ordinary expectation:

ˆ For the discrete case,


X
E[g(X)|Y = y] = g(x)pX|Y (x|y).
x

ˆ For the continuous case, Z ∞


E[g(X)|Y = y] = g(x)fX|Y (x|y)dx.
−∞

ˆ Expectation is linear:

– E[aX|Y = y] = aE[X|Y = y].


– E[X1 + X2 |Y = y] = E[X1 |Y = y] + E[X2 |Y = y].

Proof: This follows from the fact that a conditional probability is a probability. □

Let X and Y be joint random variables. Then E[X|Y = y] is a real-valued function on Y = y. So E[X|Y ] is a
function in Y ; so it is also a random variable.

Theorem 7.24 Let Z = E[X|Y ] be a random variable, then E[Z] = E[E[X|Y ]] = E[X].

80
Proof: For the discrete case:
X
E[Z] = E[X|Y = y]P (Y = y)
y
X X xP (X = x, Y = y)
= P (Y = y)
y x
P (Y = y)
!
X X
= x P (X = x, Y = y)
x y
X
= xP (X = x) = E[X].
x

For the continuous case:


Z ∞
E[Z] = E[X|y = y]fY (y)dy
−∞
Z ∞Z ∞
xf (x, y)
= fY (y)dxdy
fY (y)
Z−∞ −∞
∞ Z ∞
= xf (x, y)dxdy
Z−∞

−∞
Z ∞
= x f (x, y)dydx
Z−∞

−∞

= xfX (x)dx
−∞

= E[X].

Proposition 7.25 (Law of Total Probability for Continuous Case) Suppose X is a discrete random vari-
able and Y a continuous random variable, then
Z ∞
P (X = i) = P (X = i|Y = y)fY (y)dy.
−∞

Proof: Suppose we have an random variable Z such that



1 if X = i,
Z=
0 otherwise.

81
Then

P (X = i) = P (Z = 1) = E[Z]
= E[E[Z|Y ]]
Z ∞
= P (Z|Y = y)fY (y)dy
−∞
Z ∞
= P (X = i|Y = y)fY (y)dy.
−∞

Proposition 7.26 Let N be Poisson, parameter generated by exponential random variable X. Then

ˆ E[N |X = λ] = λ.

ˆ E[N |X] = X.

ˆ E[X] = E[E[N |X]] = E[N ].

Proof: Direct verification from the definition works. □

Lemma 7.27 Suppose X and Y are independent, then

E[X|Y = y] = E[X] for all y.

Proof: We split into two cases, both are discrete and both are continuous.
Suppose X and Y are both discrete, then P (X = x|Y = y) = P (X = x) for any x, hence

E[X|Y = y] = E[X] for all y.

On the other hand, if both X and Y are continuous, then fX|Y (x|Y = y) = fx (x) for any x, hence

E[X|Y = y] = E[X] for all y.

Lemma 7.28 Suppose g is a function and X and Y are random variables, then E[g(X)Y |X] = g(X)E[Y |X].

82
Proof: Let X = x, where x is arbitrary, then

E[g(X)Y |X = x] = g(x)E[Y |X = x],

hence E[g(X)Y |X] = g(X)E[Y |X]. □

7.7 Conditional Variance


Let X and Y be random variables. We can fine the distribution of X given that Y = y.

Var(X|Y = y) = E[(X − E[X|Y = y])2 |Y = y].

Definition: the conditional variance is defined to be

Var(X|Y ) = E[(X − E[X|Y ])2 |Y ].

Lemma 7.29
ˆ Var(X|Y ) = E[X 2 |Y ] − (E[X|Y ])2 .

ˆ Var(E[X|Y ]) = E[(E[X|Y ])2 ] − (E[E[X|Y ]])2 .

ˆ E[Var(X|Y )] = E[X 2 ] − E[(E[X|Y ])2 ].

Proof: Since Var(Z) = E[Z 2 ]−(E[Z])2 , then Var(X|Y ) = E[X 2 |Y ]−(E[X|Y ])2 and Var(E[X|Y ]) = E[(E[X|Y ])2 ]−
(E[E[X|Y ]])2 .
Hence

E[Var(X|Y )] = E[E[X 2 |Y ]] − E[(E[X|Y ])2 ]


= E[X 2 ] − E[(E[X|Y ])2 ]

Corollary 7.29.1 Var(X) = E[Var(X|Y )] + Var(E[X|Y ]).

Proof: From the lemma, we know that

E[Var(X|Y )] + Var(E[X|Y ]) = E[X 2 ] − E[(E[X|Y ])2 ] + E[(E[X|Y ])2 ] − (E[E[X|Y ]])2


= E[X 2 ] − E[X]2
= Var(X).

83
7.8 Conditional Expectation and Predication
By prediction we mean that suppose the value of X is observed, we want to predict the value of a second vari-
able Y . We want to find a function g such that if X = x, then g(x) is the prediction for Y . The function g must
be chosen such that g(X) is closed to Y . I.e., we want to choose the function g such that E[(Y −g(X))2 ] is minimized.

Lemma 7.30 The best predictor of Y based on X is g(x) = E[Y |X = x], i.e. g(X) = E[Y |X] is the best predictor
of Y . I.e., for all g(X),
E[(Y − g(X))2 ] ≥ E[(Y − E[Y |X])2 ].

Proof: We want to maximize E[(Y − g(X)2 |X],

E[(Y − g(X)2 |X] = E[(Y − E[Y |X] + E[Y |X] − g(X))2 |X]
= E[(y − E[Y |X])2 |X] + E[(E[Y |X] − g(X))2 |X] + 2E[(Y − E[Y |X])(E[Y |X] − g(X))|X]

Since at a given X, E[Y |X] − g(X) is a function of X, then it can be treated as a constant, thus,

E[(Y − E[Y |X])(E[Y |X] − g(X))|X] = (E[Y |X] − g(X))E[Y − E[Y |X]|X]
= (E[Y |X] − g(X))(E[Y |X] − E[Y |X])
=0

Hence we have that


E[(Y − g(X))2 ] ≥ E[(Y − E[Y |X])2 ]

and equality occurs if g(X) = E[Y |X]. □

Lemma 7.31 The best constant predictor of Y when an X value is observed is given by E[Y ], and at this value of
c, E[(Y − c)2 ] = Var(X).

Proof: We want to find c ∈ R such that E[(Y − c)2 ] is minimum.

E[(Y − c)2 ] = c2 − 2cE[Y ] + E[Y 2 ]


= (c − E[Y ])2 + E[Y 2 ] − (E[Y ])2
= (c − E[Y ])2 + Var(X).

Hence it is clear that the minimum is Var(X) which happens when c = E[Y ]. □

Lemma 7.32 Suppose the joint distribution of X and Y are not completely known. We want to find constants a, b
such that E[(Y − a − bX)2 ] is the minimum. The to Y is µY + ρ σσX
Y
(X − µX ). Then minimum of E[(Y − a − bX)2 ]
is thus σY2 (1 − ρ2 ).

84
Proof: First consider the case where µX = µY = 0 and σX = σY = 1, ρ = ρ(X, Y ). Then

E[(Y − a − bX)2 ] = E[Y 2 + a2 + b2 X 2 − 2aY + 2abX − 2bXY ]


= 1 + a2 + b2 − 2bρ
= a2 + (b − ρ)2 + 1 − ρ2

Hence, E[(Y − a − bX)2 ] is minimal if a = 0 and b = p, so the best linear predictor to Y in this case is a + bX = ρX,
and the minimum of E[(Y − a − bX)2 ] is 1 − ρ2 .
Next, suppose X, Y are arbitrary, which has mean µX , µY and variance σX 2 and σ 2 respectively. Then let
Y

X − µX Y − µY
X1 = and Y1 = .
σX σY

Then
 
a + bµX − µY bσX
Y − a − bX = σY Y1 − − X1
σY σY

Then by our previous analysis, E[(Y − a − bX)2 ] is minimum if a+bµσXY −µY = 0 and bσX
σY = ρ, that is b = ρ σσX
Y
and
a = µY − bµX .
The best linear predictor to Y is thus
σY
µY + ρ (X − µX )
σX
and the minimum of E[(Y − a − bX)2 ] is σY2 (1 − ρ2 ). □

7.9 Moment Generating Function


Let X be a discrete random variable, recall that E[X n ] = xn p(x).
P
x

Definition: for any (analytic) function, the Maclaurin series of g(t) is



X g (n) (0)
g(t) = tn .
n!
n=0

85
Suppose we let g(t) be such that g (n) (0) = E[X n ]:

X E[X n ]
g(t) = tn
n!
n=0
P n
∞ x p(x)
x
X
= tn
n!
n=0

XX xn p(x)
= tn
x n=0
n!

X X (tx)n
= p(x)
x
n!
n=0
X
= p(x)etx
x

= E[etX ]

Then g(t) = E[etX ] a function of t with desired property.

Definition: let X be a random variable, its moment generating function is MX (t) = E[etX ].

(n)
Lemma 7.33 Assume that MX (t) has a power series expansion at 0. Then MX (0) = E[X n ] for any nonnegative
integer n.

Proof: Assume that X is a continuous random variable with density function f . Then

∞ (n) Z ∞
X M (0)X
tn = Mx (t) = E[etX ] = etx f (x)dx
n! −∞
n=0

∞ X
(tx)n
Z
= f (x)dx
−∞ n=0 n!

XZ ∞ (tx)n
= f (x)dx
n!
n=0 −∞
∞ n Z ∞
X t
= xn f (x)dx
n! −∞
n=0

X E[X n ] n
= t .
n!
n=0

Proposition 7.34

ˆ Suppose X is a Bernoulli variable with parameter p. Then MX (t) = pet + 1 − p.

ˆ Suppose X is a Poisson random variable with parameter λ. Then MX (t) = exp[λ(et − 1)].

86
pet
ˆ Suppose X is a geometric random variable with parameter p. Then MX (t) = 1−(1−p)et .

ˆ Suppose X is an exponential random variable with parameter p. Then MX (t) = λ


λ−t , t < λ.
2 /2
ˆ Suppose Z is the standard normal random variable. MZ (t) = et . If X = σZ + µ, which is normal with
parameters (µ, σ 2 ), σ > 0. Then MX (t) = exp(µt + σ 2 t2 /2).
 α
ˆ Suppose X is the Gamma random variable with parameter α > 0, λ > 0, then MX (t) = λ
λ−t .

Proof:

ˆ Suppose X is a Bernoulli variable with parameter p. Then

MX (t) = E[etX ] = et·1 P (X = 1) + et·0 P (X = 0) = pet + 1 − p.

ˆ Suppose X is a Poisson random variable with parameter λ. Then



X λn
MX (t) = E[etX ] = etn e−λ ·
n!
n=0

X (λet )n
= e−λ
n!
n=0
−λ λet
=e e
= exp[λ(et − 1)].

ˆ Suppose X is a geometric random variable with parameter p, then



X
MX (t) = E[etX ] = etn (1 − p)n−1 p
n=1
X∞
= [et (1 − p)]n−1 et p
n=1
pet
= .
1 − (1 − p)et

ˆ Suppose X is an exponential random variable with parameter p, then


Z ∞
MX (t) = E[e tX
]= etx λe−λx dx
0
Z ∞
=λ e−(λ−t)x dx
0
λ
= .
λ−t

87
ˆ Suppose Z is the standard normal random variable. Then
Z ∞
1 2
tZ
MZ (t) = E[e ] = etz √ e−z /2 dz

Z−∞∞
1 [−(z−t)2 /2]+t2 /2]
= e dz
−∞ 2π
Z ∞
2 1 2
= et /2 √ e−(z−t) /2 dz
−∞ 2π
2 /2
= et .

Now if X = σZ + µ, then

MX (t) = E[etX ] = E[et(σZ+µ) ]


= etµ E[etσZ ]
= etµ MZ (tσ)
2 /2
= etµ e(tσ)
= exp(µt + σ 2 t2 /2).

ˆ Suppose X is gamma with parameter (α, λ), α, λ > 0. Then

MX (t) = E[etX ]
Z ∞
λe−λx (λx)α−1
= etx dx
0 Γ(α)
α Z ∞
(λ − t)e−(λ−t)x [(λ − t)x]α−1

λ
= dx
λ−t Γ(α)
 α 0
λ
=
λ−t

Proposition 7.35 If MX (t) = MY (t) for all t near 0, then X and Y have the same distribution.

Proof: This is due to the uniqueness of Taylor Series. □

Proposition 7.36 Suppose X and Y are random variable such that Y = aX + b. Let the moment generating
function of X be MX (t), then the moment generating function MY (t) of Y is

MY (t) = etb MX (at).

88
Proof: Since Y = aX + b, then by the definition of the moment generating function, we have

E[etY ] = E[eaXt+bt ]
= E[eat·X ] · ebt
= ebt · MX (at)

7.10 Moment Generating Function for Sum of Independent Random Variables

Proposition 7.37 Suppose X and Y are independent, then MX+Y (t) = MX (t)MY (t). More generally, if X1 , X2 , · · · , Xn
are independent random variables, then

MX1 +X2 +···+Xn (t) = MX1 (t) · MX2 (t) · · · MXn (t).

Proof: It suffices to prove the case for two independent random variables, as the rest can be proven by induction.
Suppose X and Y are independent, then

E[et(X+Y ) ] = E[etX etY ] = E[etX ]E[etY ].

Hence we have the desired result. □

Proposition 7.38 Suppose X is a binomial random variable with parameters (n, p), then MX (t) = (pet + 1 − p)n .
Suppose X is a negative binomial with parameters (r, p), then
r
pet

MX (t) = .
1 − (1 − p)et

Proof: Binomial random variable with parameters (n, p) is just the sum of n independent Bernoulli random
variables with parameter p. Negative binomial variable with parameters (r, p) is just the sum of n independent
geometric random variable with parameter p. □

Proposition 7.39 Let X be chi-squared with n degrees of freedom, then MX (t) = (1 − 2t)−n/2 .

89
Proof: Recall that a chi-squared random variable with n degrees of freedom can be written as the sum of
Z12 + · · · + Zn2 , where Z1 , · · · , Zn are independent standard normal random variables. Then
n n n Z ∞
Y Y
tZ 2
Y 1 2 2
E[e tX
]= MZ 2 (t) = E[e ]= √ etx e−x /2 dx
i=1 i=1 i=1
2π −∞
n Z ∞
Y 1 2
= √ e−x (1−2t)/2 dx
i=1
2π −∞
n
Y
= (1 − 2t)−1/2
i=1

= (1 − 2t)−n/2 .

Proposition 7.40 Let X1 , · · · , Xn be independent and identical with X. Suppose N takes nonnegative integer and
let Y = X1 + · · · + Xn . Then E[etY |N = n] = [MX (t)]n , i.e., E[etY |N ] = [MX (t)]N . So MY (t) = E[E[etY |N ]] =
E[(MX (t))N ]. Then we have the following:

ˆ MY′ (t) = E[N (MX (t))N −1 MX


′ (t)].

ˆ E[Y ] = MY′ (0) = E[N (MX (0))N −1 MX


′ (0)] = E[N ]E[X].

ˆ Var(Y ) = E[N ]Var(X) + (E[X])2 Var(N ).

7.11 Joint Moment Generating Function


Definition: let X and Y be random variables. The joint moment generating function is MX,Y (s, t) = E[esX+tY ].
It is clear that MX (s) = E[esX ] = MX,Y (s, 0) and MY (t) = E[etY ] = MX,Y (0, t).
Note that the joint distribution of X and Y is uniquely determined by MX,Y (s, t).
If X and Y are independent if and only if MX,Y (s, t) = MX (s)MY (t) (This results can be extended to n dimensional
joint moment generating functions). If MX+Y (t) = MX (t)MY (t), if and only if Cov(X, Y ) = 0.
Let X1 , Y1 be independent random variable with the same distribution as X, Y . Then

MX1 ,Y1 (s, t) = MX1 (s)MY1 (t) = MX (s)MY (t) = MX,Y (s, t).

Hence X1 and Y1 are independent implies that X and Y are independent.

Proposition 7.41 Suppose the number of events that occur is Poisson with parameter λ. Each even is independent
of Type I with probability p, and type II with probability 1 − p. Let Xi be the number of type i. Then X1 and X2
are independent Poisson with parameters λp and λ(1 − p).

90
Proof: X be the total number of events. If X = n, X1 , X2 are binomial with parameters (n, p), (n, 1 − p). Then

E[esX1 +tX2 |X = n] = E[esX1 +t(n−X1 ) |X = n] = E[etn e(s−t)X1 |X = n]


= etn (pes−t + 1 − p)n = (pes + (1 − p)et )n ,
E[esX1 +tX2 |X] = (pes + (1 − p)et )X
t −1)
E[etX ] = eλ(e ⇒ E[aX ] = eλ(a−1)
s +(1−p)et −1)
MX1 ,X2 (s, t) = E[(pes + (1 − p)et )X ] = eλ(pe
s −1) t −1)
= eλp(e eλ(1−p)(e = MX1 (s)MX2 (t).

Proposition 7.42 Let Z1 and Z2 be independent standard normal random variables. Then X = 12 (Z1 + Z2 ) and
Y = 12 (Z1 − Z2 ) are normal random variables. In addition X and Y are independent, so are X and Y 2 .

Proof: We compute their joint moment generating functions:


1 1
MX,Y (s, t) = E[esX+tY ] = E[e 2 (s+t)Z1 ]E[e 2 (s−t)Z2 ]
1 2 1 2
= e 8 (s+t) e 8 (s−t)
1 2 +t2 )
= e 4 (s
1 2 1 2
= e4s e4t

Note that the joint moment generating functions are separable, and each factor is the moment generating function
of a normal random variable. Hence X and Y are normal and independent, so X and Y 2 is also independent. □

Proposition 7.43 Let X1 , X2 , · · · , Xn be independent normal random variable with parameters (µ, σ 2 ). Let X =
n
1 2 = (Xi − X)2 /(n − 1). Then X and S 2 are independent.
P
n (X1 + · · · + Xn ). Let S
i=1

91
7.12 Summary on Random Variables

92
8 Limit Theorems
8.1 Inequalities

Lemma 8.1 (Markov’s Inequality) Let X be a nonnegative random variable, then for any a > 0,

E[X]
P (X ≥ a) ≤ .
a

1 if X ≥ a
Proof: Let I be the indicator variable of X ≥ a : I =
0 if X < a.
If X ≥ a, then aI = a ≤ X; if X < a, then aI = 0 ≤ X. Hence X ≥ aI ⇒ E[X] ≥ aE[I] = aP (X ≥ a). □

Proposition 8.2 (Chernoff Bounds) Let X be a random variable with moment generating function M (t). Then
for any a > 0 and t > 0,
P (X ≥ a) ≤ e−ta M (t).

Proof: If X ≥ a, then etX ≥ eta , for any t > 0, a > 0. Hence by Markov’s Inequality, we have

E[etX ]
P (X ≥ a) ≤ P (etX ≥ eta ) ≤ = e−ta M (t).
eta

Corollary 8.2.1 (Chernoff Bounds for the Poisson Random Variable) Suppose X is a Poisson random
variable with parameter λ, then
e−λ (eλ)i
P (X ≥ i) ≤ .
ii

Proof: By Chernoff Bounds, we get that

t −1)
P (X ≥ i) ≤ eλ(e e−it t > 0

i
To minimize the right hand side, differentiate, and we get that the minimum is obtained when et = λ. Thus
substitute this value for t, and we get the desired inequality. □

Corollary 8.2.2 (Chernoff Bounds for Standard Normal Variable) Suppose Z is standard normal, and a >
0, then
1 −a2
P {Z > a} ≤ e 2 .
2

93
Proof:
Z ∞
1 2
P (Z > a) = √ e−u /2 du

Za ∞
1 − (x+a)2
= √ e 2 dx (x = u − a)
0 2π
Z ∞
1 a2 x2
= √ e− 2 e− 2 e−ax dx

Z0
1 − a2 ∞ − x2
≤√ e 2 e 2 · 1dx
2π 0
1 − a2 ∞ − x2
Z
=√ e 2 e 2 dx
2π 0
r
1 − a2 π
=√ e 2 ·
2π 2
1 − a2
= e 2.
2

2 be the
Lemma 8.3 (Chebyshev’s Inequality) Let X be a random variable, and µx be the mean of X and σX
variance of X, then
σ2
P (|X − µX | ≥ a) ≤ X ∀a > 0.
a2

2 =
Proof: Let X be a random variable, and µx be the mean of X and σX be the variance of X. Then σX
E[(X − µX )2 ]. Let Y = (X − µX )2 ≥ 0 and a > 0. By Markov’s Inequality, we have

E[Y ] σ2
P (Y ≥ a2 ) ≤ 2
= 2.
a a

But p(Y ≥ a2 ) = P ((X − µX )2 ≥ a2 ) = P (|X − µX | ≥ a), so

2
σX
P (|X − µX | ≥ a) ≤ ∀a > 0.
a2

Corollary 8.3.1 Suppose X is the standard normal random variable and a > 0. Then

1
Φ(a) ≥ 1 − .
2a2

1
Proof: P (|X| ≥ a) = 2P (X ≥ a) = 2(1 − Φ(a)). Since P (|X| ≥ a) ≤ a2
by Chebyshev’s Inequality. Then we
have the desired result. □

94
Theorem 8.4 If X is a random variable such that Var(X) = 0. Then P (X = µX ) = 1.

Proof: Let X be a random variable such that Var(X) = 0. For any ϵ > 0, by Chebyshev’s Inequality,

2
σX
P (|X − µX | > ϵ) ≤ = 0.
ϵ2

Let ϵ → 0+ . Then
0 = lim P (|X − µX | > ϵ) = P (|X − µX | > 0).
ϵ→0+

Hence P (|X − µX | =
̸ 0) = 0, which implies that P (X = µX ) = 1. □

Proposition 8.5 (One-sided Chebyshev’s Inequality) Let X be a random variable with mean µ and variance
σ 2 , then for any a > 0, we have
σ2
P (X ≥ µ + a) ≤ 2 .
σ + a2

Proof: Consider Y = σ1 (X − µ). Then Y is a random variable with mean 0 and variance 1.
 a
P (X ≥ µ + a) = P Y ≥ 2 .
σ
2
Hence to prove that P (X ≥ µ + a) ≤ σ2σ+a2 for all a > 0, it suffices to show that P (Y ≥ a) ≤ 1+a
1
2 for all a > 0.

Now suppose Y is a random variable with mean 0 and variance 1. Then Y ≥ a implies for all b, (Y + b)2 ≥ (a + b)2 ,
hence by Chebyshev’s inequality, we have

1 + b2
P (Y ≥ a) ≤ P ((Y + b)2 ≥ (a + b)2 ) ≤ .
(a + b)2

And note when b = a1 ,


1 + b2 1
2
= .
(a + b) 1 + a2
hence  !
1 2 1 2
  
1
P (Y ≥ a) ≤ P Y + ≥ a+ ≤ .
a a 1 + a2

Proposition 8.6 (Jensen’s Inequality) If f (x) is a convex function, then

E[f (X)] ≥ f (E[X])

provided that the expectations exist and are finite. Suppose f (x) is concave, then

E[f (X)] ≤ f (E[X])

95
provided that the expectations exist and are finite.

Proposition 8.7 (Cauchy-Schwarz Inequality) Suppose X and Y are random variables, then

(E[XY ])2 ≤ E[X 2 ]E[Y 2 ].

Proof: For any t ∈ R, note that E[(tX + Y )2 ] ≥ 0, hence

E[X 2 ]t2 + 2E[XY ]t + E[Y 2 ] ≥ 0

for all t, hence the discriminant of the quadratic must be less than or equal to zero, that is

4E[XY ]2 − 4E[X 2 ]E[Y 2 ] ≤ 0.

Hence we conclude that


(E[XY ])2 ≤ E[X 2 ]E[Y 2 ].

8.2 Limit Theorems

Proposition 8.8 (Weak Law of Large Numbers) Let X1 , X2 , · · · be independent and identical random vari-
able with E[Xi ] = µ. If ϵ > 0, then
 
X1 + X2 + · · · + Xn
P − µ ≥ ϵ → 0 as n → ∞.
n

Proof: Let X1 , X2 , · · · be independent and identical random variable with E[Xi ] = µ and Var(Xi ) = σ 2 . Let
X n = (X1 + · · · + Xn )/n. Then E[X n ] = µ and Var(X n ) = σ 2 /n.
By Chebyshev’s inequality, for any ϵ > 0, we have

σ2
P (|X n − µ| ≥ ϵ) ≤ .
nϵ2

Let ϵ > 0 be fixed. Then P (|X n − µ| ≥ ϵ) → 0 as n → ∞. □

Corollary 8.8.1 Let X1 , X2 , · · · be independent and identical random variable with E[Xi ] = µ. If ϵ > 0, then
 
X1 + X2 + · · · + Xn
P − µ ≤ ϵ → 1 as n → ∞.
n

96
Proof: Taking the complement of the probability, we get the desired result. □

Lemma 8.9 Let Zn be a random variable having cumulative distribution function FZn and moment generating
function MZn , n = 1, 2, · · · . Let Z be a random variable having cumulative distribution function Fz and moment
generating function MZ . If MZn (t) → MZ (t) for all t, as n → ∞, then FZn (T ) → FZ (T ) for all t at which FZ (t)
is continuous.

Theorem 8.10 (Central Limit Theorem) Let X1 , X2 , · · · be independent and identical random variable with
mean µ, and variance σ 2 . Then
X1 + · · · + Xn − nµ

σ n
tends to standard normal random variable as n → ∞.


Proof: Let Zn = (X1 + · · · + Xn − nµ)/(σ n). Note that Yi = (Xi − µ)/σ are identical with mean 0 and variance
n
P Yi
1, and let its moment generating function be M (t). Then Zn = √
n
, so it has moment generating function
h  in i=1 h  in
2 2
M √tn . Let Z be a standard normal random variable, then MZ (t) = et /2 . We show that M √tn → et /2
as n → ∞.   2
Let L(t) = ln M (t). Then it is equivalent to showing that nL √tn → t2 .

L(0) = ln M (0) = ln 1 = 0,
M ′ (0) E[X]
L′ (0) = = = 0,
M (0) 1
M ′′ (0)M (0) − [M ′ (0)]2 E[X 2 ] · 1 − (E[X])2
L′′ (0) = = = 1.
[M (0)]2 12

Set x = √1 . Then by L’Hopital’s Rule, we have


n

tL′ (tx) t2 L′′ (tx)


 
t L(tx)
lim nL √ = lim = lim = lim
n→∞ n x→0+ x2 x→0+ 2x x→0+ 2
2 ′′
t L (0) t 2
= = ,
2 2

by assuming that L′′ is continuous at 0. And this completes the proof of the theorem. □

Proposition 8.11 (Strong Law of Large Numbers) Let X1 , X2 , · · · be independent and identical random vari-
able with mean µ = E[Xi ]. Then  
X1 + · · · + Xn
P lim = µ = 1,
n→∞ n

97
or
X1 + · · · + Xn
→ µ as n → ∞.
n

Proof: We prove the case where E[Xi4 ] = K < ∞.


We first consider µ = E[X] = 0.
Let Sn = X1 + · · · + Xn . Then E[Sn4 ] is the sum of the followings:

E[Xi4 ], E[Xi3 Xj ], E[Xi2 Xj2 ], E[Xi2 Xj Xk ], E[Xi Xj Xk Xl ].

Since Xi′ s are independent, then E[Xi3 Xj ] = E[Xi2 Xj Xk ] = E[Xi Xj Xk Xl ] = 0. So

n
X X
E[Sn4 ] = E[Xi4 ] + 6E[Xi2 Xj2 ]
i=1 i<j

= nK + 3n(n − 1)(E[Xi2 ]2 )
≤ nK + 3n(n − 1)K
≤ n2 k + 3n2 K
= 4n2 K
E[(Sn /n)4 ] ≤ 4n2 K/n4 = 4K/n2
∞ ∞ ∞
" #
X X X 1
(Sn /n)4 = E[(Sn /n)4 ] ≤ 4K < ∞.
n2
n=1 n=1 n=1

We have the last equality because


 ∞ the sequence
 converges absolutely.
(Sn /n)4 < ∞.
P
So we have concluded that E
n=1
Let Y = ∞ 4 4
P
n=1 (Sn /n) , then P (Y = ∞) = 0, i.e., P (Y < ∞) = 1. Then in particular, (Sn /n) → 0 with probability
1, which implies
X1 + · · · + Xn
→ 0 as n → ∞.
n

Next, for the general case, suppose E[X] = µ.


Consider Yi = Xi −µ, then E[Yi ] = 0, and E[Yi4 ] = E[(Xi −µ)]4 ≤ E[8Xi4 +8µ4 ] by AM-GM. SinceE[8Xi4 +8µ4 ] < ∞
by the fact that E[Xi4 ] < ∞, then E[Yi4 ] < ∞. Thus we conclude that Y1 +···+Y
n
n
→ 0 as n → ∞. But since for each
i, Xi = Yi + µ. Then
X1 + · · · + Xn Y1 + · · · + Yn
= + µ → µ as n → ∞.
n n

Corollary 8.11.1 Suppose X1 , X2 , · · · are identical independent random variables, then

n
!1/n
Y
lim Xi = eE[ln(Xi )] .
n→∞
i=1

98
Proof: Since X1 , X2 , · · · are identical independent random variables, then ln(X1 ), ln(X2 ), · · · are also identical
independent random variables. Note
n
P 
n
!1/n ln Xi
Y  i=1 
Xi = exp  .
 n 
i=1

Since
n
P
ln(Xi )
i=1
→ E[ln(Xi )]
n
as n → ∞, then
n
P 
n
!1/n ln(Xi )
Y  i=1
 = eE[ln(Xi )] .

lim Xi = lim exp 
n→∞ n→∞  n 
i=1

Remark: suppose X1 , X2 , · · · are i.i.d random variables with mean µ and variance σ 2 , then
n
X
Xi ∼ Normal(nµ, nσ 2 ).
i=1

And
n
P
Xi
i=1 σ2
∼ Normal(µ, ).
n n

Proposition 8.12 Let Zn , n ≥ 1, be a sequence of random variables and c a constant such that for each ϵ > 0.
P {|Zn − c| > ϵ} → 0 as n → ∞. Then for any bounded continuous function g,

E[g(Zn )] → g(c) as n → ∞.

Proof: Suppose Zn is discrete, then let Z be the limit of Zn (we can easily show that it exists) and we would get
that Z is the random variable with p(Z = c) = 1, hence the desired statement must be true. So, we consider the
case where Zn is a sequence of continuous random variables.
Since g is bounded, then |g(x)| ≤ M for some M ∈ R. g is also continuous, so for every c ∈ R and ϵ > 0, there
exists a δ > 0 such that |x − c| ≤ δ → |g(x) − g(c)| ≤ ϵ.
By the definition of expected values, we have
Z Z Z
E[g(Zn )] = g(x)pZn (x)dx = g(x)pZn dx + g(x)pZn dx.
|x−c|≤δ |x−c|>δ

99
Now, for x such that |x − c| ≤ δ, we have that g(x) ≤ g(c) + ϵ and for x such that |x − c| > δ we have that g(x) ≤ M .
Therefore,
Z Z
E[g(Zn )] ≤ (g(c) + ϵ) pZn dx + M pZn dx
|x−c|≤δ |x−c|>δ

= (g(c) + ϵ)P (|Zn − c| ≤ δ) + M P (|Zn − c| > δ).

Similarly we have that


E[g(Zn )] ≥ (g(c) − ϵ)P (|Zn − c| ≤ δ) − M P (|Zn − c| > δ).

Then it is clear that lim sup E[g(Zn )] = E[g(c)] and lim inf E[g(Zn )] = E[g(c)]. Hence lim E[g(Zn )] = E[g(c)]. □
n→∞

100

You might also like