0% found this document useful (0 votes)

271 views103 pages

A First Course in Probability Notes

Uploaded by

tanxinzi39

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

271 views103 pages

A First Course in Probability Notes

Uploaded by

tanxinzi39

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

A First Course In Probability Notes

Lou Yi

Last Edited by: December 1, 2023

Contents
1 Combinatorial Analysis 1

2 Axioms of Probability 5
2.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Inclusion - Exclusion Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Limit of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Probability as a Measure of Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Conditional Probability 12
3.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Conditional Probability is a Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Discrete Random Variables 18

4.1 Definition Involving Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Moments And Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Bernoulli and Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Other Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6.1 Geometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6.2 Negative Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6.3 Hypergeometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Expected Value of Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Some Interesting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Continuous Random Variables 33

5.1 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Uniform Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Normal Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Exponential Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Other Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6.1 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6.2 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.7 Information on the Gamma and Beta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8 Distribution of a Function of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

i
6 Jointly Distributed Random Variables 55
6.1 Joint Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Joint Distribution of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4.1 Sum of Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.2 Sum of Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.3 Sum of Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.4 Sum of Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4.5 Sum of Normals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.6 Sum of Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Joint Distribution of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Expectation of Random Variables 69

7.1 Extra Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Sum of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Moments of Number of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4 Covariance and Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.5 Multinomial and Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.6 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.8 Conditional Expectation and Predication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.9 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.10 Moment Generating Function for Sum of Independent Random Variables . . . . . . . . . . . . . . . 89
7.11 Joint Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.12 Summary on Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8 Limit Theorems 93
8.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

ii
1 Combinatorial Analysis
The basic principle of counting:
Suppose that two experiments are to be performed. If the first experiment can result in any one of m possible
outcomes and if for each outcome of experiment 1, there are n possible outcomes of experiment 2, then together
there are mn possible outcomes of the two experiments.
Proof: Basically we can list the outcomes in an m × n matrix. The matrix has mn entries, hence there are mn
possible outcomes. □

The generalized basic principle of counting:

If r experiments that are to be performed are such that the first on may result in any of n1 possible outcomes; and
if, for each of these n1 possible outcomes, there are n2 possible outcomes of the second experiment; and if, for each
of the possible outcomes of the first two experiments, there are n3 possible outcomes of the third experiment and
so on. Then there is a total of n1 · n2 · · · nr possible outcomes of the r experiments.
Proof: Apply induction on the basic principle of counting. □

Permutation:
If we want to arrange n items, then there are a total of n! such different ordering.
Proof: Note that there are n different choices for the first position of the arrangement, then followed by n − 1
choices on the second position, n − 2 choices on the third and so on. Hence there are n × (n − 1) × · · · × 1 = n!
different orderings. □

Permutation with identical items:

Suppose there are n objects, of which n1 are alike, n2 are alike, · · · , nr are alike. Then there are a total of

n!
n1 !n2 ! · · · nr !

different permutations.
Proof: Firstly there are n! different permutations if all the objects are distinct. Then for each group of nk
identical items, there are nk ! different ways to arrange them, hence for each nk identical items, each ordering in
the original permutation is counted repeatedly by nk ! times. Thus the formula follows. □

Combination: !
n
We define , for r ≤ n by
r
!
n n!
=
r (n − r)!r!
!
n
and say that represents the number of possible combination of n objects taken r at a time.
r
Proof: Firstly, there are n!r! permutations of length r choosing from n items. Since order does not matter, it
implies that each combination is counted (n − r)! times, hence we have that the total possible combination is

n!
.
(n − r)!r!

1
□

Lemma 1.1 Suppose n ∈ N and r ∈ N with 1 ≤ r ≤ N . Then

! ! !
n n−1 n−1
= + .
r r−1 r

Proof: Note that in order to choose r items from a list of n object, we can either include the first item, and
choose r − 1 items from the remaining n − 1 object, or we can exclude the first item and choose r items from the
remaining n object. Hence we have the formula. □

Pascal’s Triangle:
The Pascal’s Triangle
! is a triangular array of numbers where the element on the ith row and j th column have the
i−1
value .
j−1

Theorem 1.2 (The Binomial Theorem) Suppose n is a positive integer, then

n
!
X n
(x + y)n = xk y n−k .
k=0
k
!
n
Hence, are often known as binomial coefficients.
k
!
n
Proof: The proof is done by induction or one can consider that there are ways to choose k different items
k
from n total items. □

Corollary 1.2.1 There are exactly 2n subsets of a set consisting of n elements.

!
n
Proof: Since there are subset of size k, then there are a total of
k

n
!
X n
= (1 + 1)n = 2n
k=0
k

subsets of the original sets. □

2
Permutations of length r choosing from n items:
If we want to arrange items in a list of length r choosing from n objects, with r ≤ n. Then there are

n!
r!

such permutation.
Proof: Note that similar to above reasoning, there are n × (n − 1) × · · · × (n − r + 1) different orderings, which
is equal to
n!
.
r!
□

Multinomial coefficients: !
n
Let n = n1 + n2 + · · · + nr . The multinomial coefficient is defined by
n1 , n2 , · · · , nr
!
n n!
= .
n1 , n2 , · · · , nr n1 ! · n2 ! · · · nr !
!
n
represents the number of possible division of n distinct objects into r distinct groups of size
n1 , n2 , · · · , nr
n1 , · · · , nr respectively. The number of possible permutations of n objects of which n1 are alike, n2 are alike, · · · ,
nr are alike.

Note: Let r = 2 and n1 = n2 , then ! ! !

n n n
= = .
n1 , n2 n1 n2

Theorem 1.3 (Multinomial Theorem) Suppose n ∈ N, r ∈ N∗ , then

!
X n
n
(x1 + x2 + · · · + xr ) = xn1 1 xn2 2 · · · xnr r
n1 , n2 , · · · , nr

where the sum is taken over all nonnegative integers n1 , · · · , nr so that

n1 + n2 + · · · + nr = n.

Proof: The theorem is clear from the combinatorial point of view. □

Proposition 1.4 The number of integer solutions to the equation

x1 + x2 + · · · + xr = n

3
where xi ≥ ki , i ∈ {1, 2, · · · , r} is given by
r
n + r − 1 − P ki
i=1 .
r−1
n−1

In particular, the number of positive integer solution is given by r−1 and the number of nonnegative integer
solution is given by n+r−1

r−1 .

r
P
Proof: We can transform the problem into inserting r − 1 plates to n + r − 1 + ki gaps, this is done by
i=1
considering yi = xi + (1 − ki ), then
r
X
y1 + y2 + · · · + yk = n + r − ki .
i=1

So the number of ways to do this is given by

r
n + r − 1 − P ki
i=1 .
r−1

Corollary 1.4.1 The number of integer solutions to the inequality

x1 + x2 + · · · + xr ≤ n

where xi ≥ ki , i ∈ {1, 2, · · · , r} is given by

Pr
n+r− ki

i=1 .
r

Proof: By the same reasoning, it is equivalent to calculate the number of positive integer solutions to
r
X
y1 + y2 + · · · + yr + yr+1 = n + r − ki ,
i=1

and the number of solutions for this is given by

Pr
n+r− ki

i=1 .
r

4
2 Axioms of Probability
2.1 Axioms of Probability
Sample Space:
Consider an experiment whose outcome is not predictable. The set of all possible outcomes of the experiment is
called the sample space. It is usually denoted by S.

Events:
Any subset E of the sample space S is an event. If the outcomes is contained in E, then we say E occurs.
Note that S itself is an event, which is also known as the sure event. ∅ is also an event, which is known as the null
event.

Operations on Events:
Operations on Events are precisely operations on sets. Let E and F be two events of a sample space S, then

The union E ∪ F consists of all outcomes in E or in F or in both.

The intersection E ∩ F or commonly denoted EF consists of all outcomes in both E and F .

If EF = ∅, then E and F are called mutually exclusive.

The complement E c consists all outcomes in (S) not in E.

Similarly, let E1 , E2 , · · · be a sequence of events of a sample space S.

∞
Their union, denoted by
S
En consists of all outcomes which are in at least one of En .
n=1
∞
Their intersection, denoted by
T
consists of all outcomes which are in every En .
n=1

Theorem 2.1 Let E, F and G be events of a sample space S. Then

(Commutative Laws) E ∪ F = F ∪ E, EF = F E.

(Associative Laws) (E ∪ F ) ∪ G = E ∪ (F ∪ G), (EF )G = E(F G).

(Distributive Laws) (E ∪ F )G = EG ∪ F G, (EF ) ∪ G = (E ∪ G)(F ∪ G)

(De Morgan’s Laws) (E ∪ F )C = E c F c , (EF )c = E c ∪ F c .

Proof: Simple element chasing between sets. □

Theorem 2.2 Let E1 , E2 , · · · and F be events of a sample space S. Then

General Distributive Laws:

5
S∞ S∞
– ( n=1 En ) n=1 (En ∩ F );
∩F =
T∞ T∞
– ( n=1 En ) ∪ F = n=1 (En ∪ F ).

General De Morgan’s Laws:

– ( ∞
S c
T∞ c
n=1 En ) = n=1 En ;

– ( ∞
T c
S∞ c
n=1 En ) = n=1 En .

Proof: Element chase. □

Probability:
Let E be any event of an experiment. Let n(E) be the number of times that E occurs in the first n repetitions of
the experiment. The probability of E is
n(E)
P (E) = lim
n→∞ n

if this limit exists.

Axioms of Probability:
Let S be the sample space of an experiment. Suppose that a number P (E) is defined for every event E of S, s.t.,

0 ≤ P (E) ≤ 1.

P (S) = 1.

For any sequence of mutually exclusive events E1 , E2 , · · · ,

∞
[ ∞
X
P( Ei ) = P (Ei ).
i=1 i=1

Then P (E) is called the probability of the event E.

Proposition 2.3 P (∅) = 0.

Proof: Let ∅ = E1 = E2 = E3 = · · · . Then E1 , E2 , · · · are mutually exclusive. Hence by the third axiom of
probability, we have P (∅) = 0. □

Proposition 2.4 Let E1 , · · · , En be mutually exclusive, then

n
[ n
X
P( Ei ) = P (Ei ).
i=1 i=1

6
Proof: Let ∅ = En+1 = En+2 = · · · . Then one can show that E1 , E2 , · · · are mutually exclusive. Hence
n
[ ∞
[ ∞
X n
X
P( Ei ) = P ( Ei ) = P (Ei ) = P (Ei ).
i=1 i=1 i=1 i=1

Proposition 2.5 Suppose E is an event of an experiment, then P (E c ) = 1 − P (E).

Proof: By the previous proposition, P (E) + P (E c ) = P (E ∪ E C ) = P (S) = 1. □

Proposition 2.6 Suppose that E ⊆ F are events of an experiment, then P (E) ≤ P (F ).

Proof: It is clear that E and F E c are mutually exclusive, and their union is F . Then P (F ) = P (E) + P (E c F ) ≥
P (E) + 0 = P (E). □

Proposition 2.7 Let E and F be any two events of an experiment, then

P (E ∪ F ) = P (E) + P (F ) − P (EF ),

which is known as the Inclusion-Exclusion Identity. And

P (E ∪ F ) ≤ P (E) + P (F ),

which is known as the Boole’s Inequality.

Proof: It is clear that EF c , EF and F E c are mutually exclusive. Then

P (E ∪ F ) = P (EF c ) + P (EF ) + P (F E c )
= [P (EF c ) + P (EF )] + [P (EF ) + P (F E c )] − P (EF )
= P (E) + P (F ) − P (EF ).

Then the Boole’s Inequality follows immediately. □

2.2 Inclusion - Exclusion Identity

7
Theorem 2.8 Let E1 , E2 , · · · , En be events, then
n
X X
P (E1 ∪ E2 ∪ · · · ∪ En ) = P (Ei ) − P (Ei1 Ei2 ) + · · ·
i=1 i1 <i2
X
+ (−1)r+1 P (Ei1 Ei2 · · · Eir )
i1 <i2 <···<ir

+ · · · + (−1)n+1 P (E1 E2 , · · · En )

The following is a succinct way of writing the inclusion-exclusion identity:

n
[ n
X X
P( Ei ) = (−1)r+1 P (Ei1 · · · Eir ).
i=1 r=1 i1 <···<ir

Proof: We present two proofs of the theorem.

We first prove it using induction. The case where n = 2 is trivial. Suppose that the formula is true for n, we show
it for n + 1. First apply the n = 2 case, then distributivity of intersections:

P (E1 ∪ E2 ∪ · · · ∪ En ∪ En+1 ) = P (E1 ∪ E2 ∪ · · · ∪ En ) + P (En+1 ) − P ((E1 ∪ · · · ∪ En ) ∩ En+1 )

= P (E1 ∪ E2 ∪ · · · ∪ En ) + P (En+1 ) − P ((E1 ∩ En+1 ) ∪ (E2 ∩ En+1 ) ∪ · · · ∪ (En ∩ En+1 )).

The first and last terms in brackets are n−unions, for which we assumed the formula to hold (applying inductive
hypothesis). Therefore
X X
P (E1 ∪ E2 ∪ · · · ∪ En ∪ En+1 ) = P (Ei ) − P (Ei1 ∩ Ei2 )
1≤i≤n 1≤i1 <i2 ≤n
X
+ P (Ei1 ∩ Ei2 ∩ Ei3 ) − · · · + (−1)n+1 P (E1 ∩ E2 ∩ · · · En )
1≤i1 <i2 <i3 ≤n
X X
+ P (En+1 ) − P (Ei ∩ En+1 ) + P (Ei1 ∩ Ei2 ∩ En+1 )
1≤i≤n 1≤i1 <i2 ≤n
X
− · · · − (−1)n P (Ei1 ∩ Ei2 ∩ · · · Ein−1 ∩ En+1 )
1≤i1 <i2 <···<in−1 ≤n

− (−1)n+1 P (E1 ∩ E2 ∩ · · · ∩ En ∩ En+1 )

Then by rearranging terms, we get the required results.

Next we argument using combinatorics:

If an outcome of the sample space is not a member of any of the sets Ei , then its probability does not contribute
anything to either side of the equality. Now, suppose that an outcomes is in exactly m of the events Ei , where
S S
m > 0. Then, since it is in Ei , its probability is counted once in P ( Ei ); also, as this outcomes is contained in
i i
m

k subsets of the type Ei1 Ei2 · · · Eik , its probability is counted

m m m m
− + − ··· ±
1 2 3 m

8
times on the right of the equality sign in the right hand side of the theorem. Thus, for m > 0, we must show that

m m m m
1= − + − ··· ± .
1 2 3 m
m

However, since 1 = 0 , the preceding equation is equivalent to

m
X m
(−1)i = 0
i
i=0

and the latter equation follows from the binomial theorem, since
m
X m
0 = (−1 + 1)m = (−1)i (1)m−i .
i
i=0

Since every element is counted the same number of times on both sides of the equality, then their respective
probability are equal. □

2.3 Equally Likely Outcomes

Equally Likely Experiments:
Suppose the sample space S of an experiment is finite such that all outcomes in S are equally likely to occur, i.e.,
for any even E, P (E) = |E|
|S| , where |E| is the number of outcomes in E.

Proposition 2.9 The probability of drawing a specific card from a normal deck of 52 cards with burning any
1
number (less than 52) of cards is 52 .

Proof: Suppose the card drawn is the ith position in the deck, then there are 51! ways such is possible and there
are 52! total possible permutations, thus the probability is 51! 1
52! = 52 no matter how many cards are burned. □

Corollary 2.9.1 Suppose there are n specific cards in a deck of 52 cards, then the probability of drawing any of
n
them with burning any number (less than 52) of cards is 52 .

Proof: Each of them is equally likely to happen. □

2.4 Limit of Probability

Definition: a sequence of events {En , n ≥ 1} is said to be an increasing sequence if

E1 ⊂ E2 ⊂ · · · ⊂ En ⊂ En+1 ⊂ · · ·

9
whereas it is said to be a decreasing sequence if

E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ En+1 ⊃ · · ·

Definition: if {En , n ≥ 1} is an increasing sequence of events, then we define a new event, denoted by lim En , by
n→∞

∞
[
lim En = Ei .
n→∞
i=1

Similarly, if {En , n ≥ 1} is a decreasing sequence of events, we define lim En , by

n→∞

∞
\
lim En = Ei .
n→∞
i=1

Proposition 2.10 If {En , n ≥ 1} is either an increasing or a decreasing sequence of events, then

lim P (En ) = P ( lim En ).

n→∞ n→∞

Proof: Suppose, firs that, {En , n ≥ 1} is an increasing sequence, the define the events Fn , n ≥ 1 by F1 = E1 , and

n−1
!c
[
c
Fn = En Ei = En En−1 , n > 1.
i=1

Then it is clear that the Fn′ s are mutually exclusive, and

∞
[ ∞
[ n
[ n
[
Fi = Ei and Fi = Ei for all n ≥ 1.
i=1 i=1 i=1 i=1

Thus,
∞ ∞
! !
[ [
P Ei =P Fi
i=1 i=1
∞
X
= P (Fi ) (by Axiom 3 of a Probability Function)
i=1
n
X
= lim P (Fi )
n→∞
i=1
n
!
[
= lim P Fi
n→∞
i=1
n
!
[
= lim P Ei
n→∞
i=1

= lim P (En ),
n→∞

10
which proves the result when {En , n ≥ 1} is increasing.
If {En , n ≥ 1} is a decreasing sequence, then {Enc , n ≥ 1} is an increasing sequence. Hence, from the preceding
equations,
∞
!
[
P Eic = lim P (Enc ).
n→∞
i=1
∞
∞ c
Eic
S T
However, because = Ei , it follows that
i=1 i=1

∞
!c !
\
P Ei = lim P (Enc ).
n→∞
i=1

Thus
∞
!
\
1−P Ei = lim [1 − P (En )] = 1 − lim P (En ).
n→∞ n→∞
i=1

Therefore, we conclude
∞
!
\
P Ei = lim P (En ).
n→∞
i=1

2.5 Probability as a Measure of Belief

Measure of Belief:
If n(E) is the number of times that E occurs in n repetitions of experiment, then P (E) = lim n(E)/n.
n→∞
If we believe that a coin is fair, then P ({H}) = P ({T }) = 21 ; if we believe that a die is fair, then P ({1}) = P ({2}) =
· · · = P ({6}) = 16 .
It is logical to suppose that measure of the degree of one’s believe should satisfy the axioms of probability.

11
3 Conditional Probability
3.1 Conditional Probability
Definition: suppose P (F ) > 0, the conditional probability that E occurs given that F has occurred is given by

P (EF )
P (E|F ) = .
P (F )

Note that P (E|F ) = P (EF |F ) and P (EF ) = P (E|F )P (F ).

P (E)
Lemma 3.1 Suppose E ⊆ F , then P (E|F ) = P (F ) .

P (EF ) P (E)
Proof: Since E ⊆ F , P (E|F ) = P (F ) = P (F ) . □

Proposition 3.2 (Multiplication Rule) P (EF ) = P (E|F )P (F ).

Proof: Suppose P (F ) > 0, then P (E|F ) = P (EF )/P (F ). Hence P (EF ) = P (E|F )P (F ). Now suppose P (F ) =
0, then P (E|F ) = 0, as the probability of E happening given F happens is zero (F can never happen). So again
we have P (EF ) = P (E|F )P (F ). □

Corollary 3.2.1 (General Multiplication Rule) P (E1 · · · En ) = P (E1 )P (E2 |E1 ) · · · P (En |E1 · · · En−1 ).

Proof: Suppose P (E1 · · · En−1 ) = 0, then the statement is trivial as both sides equal to 0. Otherwise, we can use
induction to prove the general statement. □

3.2 Bayes’ Formula

Proposition 3.3 (Law of Total Probability) P (E) = P (EF )+P (EF C ) = P (E|F )P (F )+P (E|F C )[1−P (F )].

Proof: First, note that EF ∪ EF C = E and EF ∩ EF C = ∅. Hence we have P (E) = P (EF ∪ EF C ) =

P (EF ) + P (EF C ).
Next we apply the formula for conditional probability, and get the second inequality. □

12
Theorem 3.4 (Bayes’ Formula) Let E and F be events, then

P (E|F )P (F )
P (F |E) = .
P (E|F )P (F ) + P (E|F C )P (F C )

Proof: Since

Proposition 3.5 Let E and F be two events, then

P (F |E) ≥ P (F ) ⇔ P (E|F ) ≥ P (E|F C ).

This is to say, suppose E and F are positively associated, then E and F C is negatively associated.

P (E|F )P (F )
Proof: Note P (E|F ) = P (E|F )P (F )+P (E|F C )P (F C )
, then

S∞
Proposition 3.6 (Generalized Law of Total Probability) Let F1 , F2 , · · · be mutually exclusive with n=1 Fn =
S. For any event E, we have
∞
X
P (E) = P (E|Fn )P (Fn ).
n=1

S∞
Proof: Suppose n=1 Fn = S, then

∞
[ ∞
[
E =E∩S =E∩ Fn = E ∩ Fn .
n=1 n=1

Hence
∞ ∞ ∞
!
[ X X
P (E) = P E ∩ Fn = P (EFn ) = P (E|Fn )P (Fn ).
n=1 n=1 n=1

13
S∞
Theorem 3.7 (Generalized Bayes’ Formula) Let F1 , F2 , · · · , be mutually exclusive, n=1 Fn = S. For any
event E, we have
P (E|Fj )(P (Fj )
P (Fj |E) = ∞ .
P
P (E|Fn )P (Fn )
n=1

Proof: Note that

P (Fj E) P (EFj ) P (E|Fj )(P (Fj )
P (Fj |E) = = = ∞ .
P (E) P (E) P
P (E|Fn )P (Fn )
n=1

Definition: the odds of an event A are defined by

P (A) P (A)
= .
P (Ac ) 1 − P (A)

Lemma 3.8 Suppose E and F are events, then odds of the event F given E is

P (F |E) P (F )P (F |E)
c
= .
P (F |E) P (F c )P (F c |E)

Proof:
P (F |E) P (F E) P (F )P (F |E)
c
= c
= .
P (F |E) P (F E) P (F c )P (F c |E)
□

3.3 Independent Events

Definition: events E and F are said to be independent if P (EF ) = P (E)P (F ), and dependent otherwise.
Note that if P (F ) = 0, then P (EF ) = 0 = P (E)P (F ); so E, F are automatically independent.

Lemma 3.9 If P (F ) > 0, then E and F are independent if and only if P (E|F ) = P (E). If P (E) > 0, then E
and F are independent if and only if P (F |E) = P (F ).

P (EF )
Proof: Suppose P (F ) > 0, P (E|F ) = P (F ) = P (E) if and only if P (EF ) = P (E)P (F ). Similarly we have the
second assertion. □

14
Proposition 3.10 (Property of Independent Events) Let E and F be events, then the following are equiva-
lent:

1. E and F are independent;

2. E and F C are independent;

3. E C and F are independent;

4. E C and F C are independent.

Proof: Suppose E and F are independent, i.e., P (EF ) = P (E)P (F ). Notice that EF and EF C are mutually
exclusive with union E. Then

P (E) = P (EF ) + P (EF C ) = P (E)P (F ) + P (EF C )

This implies that P (EF C ) = P (E) − P (E)P (F ) = P (E)[1 − P (F )] = P (E)P (F C ).

Reversing roles, we get that all four statements are equivalent. □

Definition: let E, F , G be events. They are independent if

1. P (EF G) = P (E)P (F )P (G), and

2. P (EF ) = P (E)P (F ), P (EG) = P (E)P (G), P (F G) = P (F )P (G).

Proposition 3.11 Suppose E, F, G are independent. Then E is independent of any event formed from F and G.

Proof: We just need to consider the case of F G, F ∪ G, F C G, F GC , F C GC , F ∪ G, F C ∪ G, and F ∪ GC . □

Definition: a sequence of events E1 , E2 , · · · are independent if

P (Ei1 · · · Eir ) = P (Ei1 ) · · · P (Eir ) for any r ≥ 2, i1 < · · · < ir .

Definition: suppose an experiment consists of a sequence of subexperiments. let Ei be the outcome of the ith
subexperiment. If E1 , E2 , · · · are independent and have same set of possible outcomes, then they are often called
trials.

Proposition 3.12 Let E and F be mutually exclusive events. Suppose independents trials are performed. Then
the probability that E occurs before F is
P (E)
.
P (E) + P (F )

15
Proof: Let S be the event that E occurs before F , and let K be the event that both events does not happen.
Then

P (S) = P (S|E)P (E) + P (S|F )P (F ) + P (S|K)P (K)

=⇒ P (S) = P (E) + 0 + P (S)(1 − P (E) − P (F ))
P (E)
=⇒ P (S) =
P (E) + P (F )

Proposition 3.13 Suppose that a man is gambling against an infinitely rich adversary and at each stage he either
win or lose 1 unit with respective probabilities p and 1 − p. If the man starts with i unit, then the probability that
he will eventually go broke is 
1 if p ≤ 21
i
 q if p > 12
p

where q = 1 − p.

Proof: Let P (n) denote the probability that the man starts with n unit and go broke. Then P (0) = 1 and
P (n) = P (1)n for n ≥ 1. Also P (1) = P (0)P (losses 1 unit) + P (2)P (wins 1 unit). So P (1) = (1 − p) + pP (2) =
(1 − p) + p[P (1)]2 . Then solve for P (1) for desired result. □

3.4 Conditional Probability is a Probability

Proposition 3.14 Suppose F is an event with P (F ) > 0, then P (·|F ) is a probability

Proof: We verify that the conditional probability satisfies the 3 axioms of probability.

1. 0 ≤ P (E|F ) ≤ 1.
P (EF )
Proof: EF ⊆ F ; so 0 ≤ P (EF ) ≤ P (F ) and 0 ≤ P (E|F ) = P (F ) ≤ 1.

2. P (S|F ) = 1.
P (SF ) P (F )
Proof: P (S|F ) = P (F ) = P (F ) = 1.

3. We claim that if E1 , E2 , · · · are mutually exclusive events, then

∞ ∞
!
[ X
P Ei |F = P (Ei |F ).
i=1 i=1

16
This is because
∞
!
P (( ∞ P( ∞
S S
i=1 Ei ) F ) i=1 Ei F )
[
P Ei |F = =
P (F ) P (F )
i=1
P∞ ∞ ∞
i=1 P (Ei F ) P (Ei F ) X
X
= = = P (Ei |F ).
P (F ) P (F )
i=1 i=1

Hence conditional probability is a probability. □

Proposition 3.15 (Properties of Conditional Probability As A Probability) Fix any event F with P (F ) >
0, then

Inclusion-Exclusion Identity:

P ((E1 ∪ E2 |F ) = P (E1 |F ) + P (E2 |F ) − P (E1 E2 |F ).

Law of Total Probability:

P (E|F ) = P (E|F G)P (G|F ) + P (E|F Gc )P (Gc |F ).

Proof: Define Q(E) = P (E|F ), since Q is a probability function, then

P ((E1 ∪ E2 )|F ) = Q(E1 ∪ E2 )

= Q(E1 ) + Q(E2 ) − Q(E1 E2 )
= P (E1 |F ) + P (E2 |F ) − P (E1 E2 |F )

And

P (E|F ) = Q(E)
= Q(E|G)Q(G) + Q(E|Gc )Q(Gc )
= P ((E|F )|G)P (G|F ) + P ((E|F )|Gc )P (Gc |F )
= P (E|F G)P (G|F ) + P (E|F Gc )P (Gc |F ).

17
4 Discrete Random Variables
4.1 Definition Involving Discrete Random Variables
Definition: on the sample space of an experiment, the quantities of interest, or real-valued functions on the sample
space are called random variables.

Definition: suppose a random variable X can only take on at most a countable number of possible values (finite
{a1 , · · · , an } or enumerable {a1 , a2 , · · · , }).Then X is said to be discrete. pX (a) = P (X = a) is the probability
mass function of X.
Pn
Suppose X only assumes values on {a1 , a2 , · · · , an }, ai all distinct, then pX (ai ) = 1 and pX (a) = 0 for
i=1
a ̸= a1 , a2 , · · · , an .
∞
P
Suppose X only assumes values on {a1 , a2 , · · · , an , · · · }, ai all distinct, then pX (ai ) = 1 and pX (a) = 0 for
i=1
a ̸= a1 , a2 , · · · .
Note that for discrete random variable X, if pX (a) = 0, we may assume that X does not take value at a.

Definition: we define the cumulative distribution function of a random variable X to be

X
FX (a) = p(x).
x≤a

Suppose discrete random variable X takes values on a1 < a2 < · · · . Then the distribution function Fx is a
non-decreasing step function. If ai ≤ a < ai+1 , then
X
FX (a) = p(x) = p(a1 ) + · · · + p(ai ).
x≤a

There is a jump of size pX (ai ) = P (X = ai ) occurring at ai .

Proposition 4.1 (Properties of the Cumulative Distribution Function)

1. F is a nondecreasing function.

2. lim F (b) = 1.
b→∞

3. lim F (b) = 0.
b→−∞

4. F is right continuous.

4.2 Expected Value

Definition: let X be a discrete random variable with values x1 , · · · , xm . The expected value / mean of X is
E[X] = m
P
i=1 xi pX (xi ).
Definition: let X be a discrete random variable with values x1 , x2 , · · · . The expected value / mean of X is

18
E[X] = ∞
P
i=1 xi pX (xi ).
Definition: we define the indicator variable of an event E to be

1 if E occurs,
I= .
0 if E does not occur.

Lemma 4.2 E[I] = P (E).

Proof: Note I can only take values 0 and 1, and pI (1) = P (E) and pI (0) = P (E c ). Hence E[I] = 1 · P (E) + 0 ·
P (E c ) = P (E). □

Note: suppose X is a discrete random variable, then for any function g, Y = g(X) is again a discrete random
variable.
P∞
Proposition 4.3 Let X be a discrete random variable with values x1 , x2 , · · · . Then E[g(X)] = i=1 g(xi )pX (xi )
for any function g.

P
Proof: Let Y = g(x), then E[g(X)] = E[Y ] = y yP (Y = y). Fix y such that g(x) = y for some x. Let
Ey = {x | g(x) = y}.
P
Then P (Y = y) = x∈Ey P (X = x).
Therefore
X X
E[g(X)] = y P (X = x)
y x∈Ey
X X
= yP (X = x)
y x∈Ey
X X
= g(x)P (X = x)
y x∈Ey
X
= g(x)P (X = x).
x

Lemma 4.4 Suppose c ∈ R, then E[c] = c.

P P
Proof: E(c) = x cpX (x) =c x pX (x) = c. □

Lemma 4.5 Suppose X is a discrete random variable with value x1 , x2 , · · · . Then if a, b ∈ R, E[aX+b] = aE[X]+b.

19
Proof: If a, b ∈ R, then
∞
X
E[aX + b] = (axi +)pX (xi )
i=1
∞
X ∞
X
=a xi pX (xi ) + b pX (xi )
i=1 i=1

= aE[X] + b.

Lemma 4.6 Suppose X1 , X2 , · · · , Xn are random variables, c1 , c2 , · · · , cn ∈ R, then E(c1 X1 + · · · + cn Xn ) =

c1 E(X1 ) + · · · + cn E(Xn ).

Proof: Proof using induction. □

4.3 Moments And Variance

Definition: let X be a random variable. E[X n ] is called the nth moments of X. Suppose X is a discrete random
variable with values x1 , x2 , · · · . E[X n ] = ∞ n
P
i=1 xi pX (i).

Lemma 4.7 Let I be the indicator of an even E. Then E[I n ] = P (E).

Proof: E[I n ] = I n P (I = 1) + 0n P (I = 0) = P (E). □

Definition: let X be a random variable, we denote the mean value of E[X] to be µX . Then the variance of X is
Var(X) = E[(X − µX )2 ].

Proposition 4.8 Var(X) = E[X 2 ] − (E[X])2 .

Proof:

Var(X) = E[(X − µ)2 ]

= E[X 2 − 2µX + µ2 ]
= E[X 2 ] − 2µE[X] + µ2
= E[X 2 ] − 2µ2 + µ2
= E[X 2 ] − µ2
= E[X 2 ] − (E[X])2 .

20
□

Corollary 4.8.1 For any discrete random variable X, E[X 2 ] ≥ (E[X])2 and E[X 2 ]/|E[X]| ≥ |E[X]|.

Proof: It is clear that E[(X − µ)2 ] ≥ 0, Then E[X 2 ] − (E[X])2 = E[(X − µ)2 ] ≥ 0. Then it is clear that
E[X 2 ]/|E[X]| ≥ |E[X]|. □

Proposition 4.9 Suppose X is a random variable, a, b ∈ R, then Var(aX + b) = a2 Var(X).

Proof: Let µ = E[X],then

Var(aX + b) = E[(aX + b)2 ] − (E[aX + b])2

= E[a2 X 2 + 2abX + b2 ] − (aµ + b)2
= (a2 E[X 2 ] + 2abµ + b2 ) − (a2 µ2 + 2abµ + b2 )
= a2 (E[X 2 ] − µ2 )
= a2 Var(X).

□
p
Definition: we define the standard deviation of X to be the principle square root of Var(X), i.e., SD(X) = Var(X).
Usually we write Var(X) = σ 2 , where σ ≥ 0 is the standard deviation. Then SD(aX + b) = |a|SD(X).

4.4 Bernoulli and Binomial Random Variables

Definition: a random variable X is a Bernoulli random variable if pX (1) = p and pX (0) = 1 − p for some 0 < p < 1.
Definition: a random variable X is a binomial random variable if pX (i) = ni pi (1 − p)n−i , for some 0 < p < 1,

i = 0, 1, · · · , n.
Suppose pX (i) = ni pi (1 − p)n−i , 0 < p < 1, i = 0, 1, · · · , n. X is said to have a binomial distribution with

parameters (n, p). In particular, a Bernoulli random variable is binomial with parameter (1, p).

Proposition 4.10 Let X be a binomial random variable with parameters (n, p). Let E[X] = np and Var(X) =
np(1 − p).

Proof: Notices we we let Xi denote the random variable where Xi = 1 if the ith trial is a success and Xi = 0 if
the ith trial is a failure, then
Xn n
X
E[X] = E[Xi ] = p = np.
i=1 i=1

21
Now note that Var(X) = E[X 2 ] − (E[X])2 and E[X]2 = n2 p2 , hence we calculate E[X 2 ], but first we compute
E[X(X − 1)]
n
X n i
E[X(X − 1)] = i(i − 1) p (1 − p)n−i
i
i=0
n
X (n − 2)!
= n(n − 1) p2 pk−2 (1 − p)n−k
(i − 2)!(n − i)!
i=0
n−2
X n − 2
2
= n(n − 1)p pk (1 − p)(n−2)−k
k
k=0
2
= n(n − 1)p · 1
= n(n − 1)p2

Then it follows that E[X 2 ] = n(n − 1)p2 + np, so Var[X] = np(1 − p). □

Lemma 4.11 Suppose X is a binomial random variable with parameters (n, p). Then

pX (i) n−i+1 p
= ·
pX (i − 1) i 1−p

and
pX (i + 1) n−i p
= · .
pX (i) i+1 1−p
In addition, we have
(n − i)p
pX (i + 1) = pX (i).
(i + 1)(1 − p)

Proof: Direct computation yields this result. □

Proposition 4.12 Suppose X is a binomial random variable with parameters (n, p) and (n + 1)p is not an integer,
then the value of pX (i) first increases monotonically, then reaches its largest value when i is the largest integer
≤ (n + 1)p, then decrease monotonically. If (n + 1)p is an integer, then pX (i) takes the maximum value at (n + 1)p
and (n + 1)p − 1.

Proof: By the previous lemma, we have that pX (i) ≥ pX (i − 1) if and only if (n − i + 1)p ≥ i(1 − p), which
happens if and only if i ≤ (n + 1)p. Hence we have the desired result. □

Proposition 4.13 Suppose X is a binomial random variable with parameters (n, p), then E[X k ] = npE[(Y +1)k−1 ]
where Y is a binomial random variable with parameters (n − 1, p).

22
Proof: Recall the identity
n n−1
i =n ,
i i−1
then
n
k n
X
k
E[X ] = i pi (1 − p)n−i
i
i=0
n
k−1 n − 1
X
= np i pi−1 (1 − p)n−i
i−1
i=1
n−1
X n−1 j
= np (j + 1)k−1 p (1 − p)n−1−j
j
j=0

= npE[(Y + 1)k−1 ]

4.5 Poisson Random Variable

Definition: a random variable X is Poisson with parameter λ > 0 if

λi
pX (i) = P (X = i) = e−λ , i = 0, 1, 2, · · · .
i!

A Poisson random variable with parameter λ > 0 is an approximation of binomial random variable with parameters
(n, p) such that λ = np with n very large and p very small.

Examples of Poisson Random Variable:

The number of misprints on a page of a book.

The number of people in a community who survive to age 100.

The number of wrong telephone numbers dialed in a day.

The number of packages of biscuits sold in a store in a day.

The customers centering a post office on a given day.

Proposition 4.14 Let X be a Poisson random variable with parameter λ > 0. Then E[X] = λ and Var(X) = λ.

23
i
Proof: Let X be a Poisson random variable with parameter λ > 0. Then pX (i) = P (X = i) = e−λ λi! . So

∞
X λi
E(X) = ie−λ
i!
i=0
X λi
= e−λ
(i − 1)!
i=1
X λj
= e−λ λ
j!
j=0

= e−λ λeλ
=λ

Similarly, we compute that E[X(X − 1)] = λ2 , hence , E[X 2 ] = λ2 + λ, so Var(x) = E[X 2 ] − (E[X])2 = λ. □

Poisson Approximation:
One can approximate the probability mass function of a binomial random variable with parameter (n, p) using
Poisson distribution with parameter λ = np, when n is large enough and p is very small, so λ is moderate.

Poisson Paradigm:
Let pi be the probability that event i occurs, i = 1, · · · , n. If pi is small, and trials are independent or weakly
dependent, then the total number of these events that occur can be approximated by Poisson random variable with
parameter λ = ni=1 pi .
P

Definition: we say events E and F are weakly dependent if P (E) ≈ P (E|F ).

Poisson Process:
Suppose events occur at random points of time. Let λ > 0. Assume

1. The probability that exactly 1 event occurs in an interval of length h is approximately λh.

2. The probability that 2 or more events occur in an interval of length h is much smaller than λh.

3. The number of events occurring in non-overlapping intervals are independent.

Let N (t) denote the number of events occurring in an interval of length t. Then N (t) is a Poisson random variable
with parameter λt and
(λt)k
P (N (t) = k) = e−λt , k = 0, 1, 2, · · · .
k!

Lemma 4.15 Let X be a Poisson random variable with parameter λ. Then P (X = i) increases monotonically and
then decreases monotonically as i increases, reaching its maximum when i is the largest integer not exceeding λ.

Proof: Suppose P (X = i) increases monotonically, then P (X = i)/P (X = i − 1) ≥ 1, then

λ
≥1
i

24
which happens if and only if λ ≥ i. □

Proposition 4.16 Let X be a Poisson random variable with parameter λ. Then

E[X n ] = λE[(X + 1)n−1 ].

Proof:
∞
X λi
E[X n ] = in e−λ
i!
i=1
∞
X λi
= in−1 e−λ
(i − 1)!
i=1
∞
X λj
=λ (j + 1)n−1 e−λ
j!
j=1

= λE[(X + 1)n−1 ].

Proposition 4.17 Suppose that the number of events occurring in a given time period is a Poisson random vari-
P
able with parameter λ. If each event is classified as a type i event with probability pi , i = 1, · · · , n, pi = 1,
P
independently with probability pi , i = 1, · · · , n, pi = 1, independently of other events. Then the number of type i
events that occur are independent Poisson random variable with respective parameters λpi .

Proof: Let X denote the Poisson random variable with parameter λ, and Xi denote the random variables of the
number of type i events that occur. Then suppose it is given X = k, then the number of events of Xi can be
presented by binomial random variable with parameter (k, pi ). Hence by the law of total probability, we have
∞
X k n λk
P (Xi = n) = pi (1 − pi )i−n e−λ
n k!
k=n
∞
e−λ pni X λk
= (1 − pi )k−n
n! (k − n)!
k=n
n ∞
−λ pi · λn X λj
=e (1 − pi )j
n! j!
j=0
(pi λ)n λ(1−pi )
= e−λ ·e
n!
(pi λ)n
= e−pi
n!

This holds for all nonnegative number n, hence Xi is a Poisson random variable with parameter pi λ. And easily,
one can verify that the Xi′ s are independent using multinomial distribution. □

25
4.6 Other Discrete Random Variables
4.6.1 Geometric Random Variable

Definition: independent trials are performed until a success occurs. suppose the the probability of success is
p, where 0 < p < 1. Then if we let X denote the number of trials needed until a success occurs, we have
P (X = n) = (1 − p)n−1 p, n = 1, 2, · · · . In this way, we define X to be the Geometric Random Variable of parame-
ter p.
Note that ∞ p
P
n=1 pX (n) = 1−(1−p) = 1. So P (X = ∞) = 0 and we may say that the X = ∞ does not occur.

1
Proposition 4.18 Let X be a geometric random variable with parameter p, 0 < p < 1. then E[X] = p and
Var(X) = 1−p
p2
.

Proof: Let X be a geometric random variable with parameter p, 0 < p < 1, then
∞
X
E[X] = i · (1 − p)i−1 p
i=1

Note that
∞
X 1−x
(1 − x)i = ,
x
i=1

then by differentiation both sides with respect to x and multiply by −x, we have
∞
1 X
= i(1 − x)i−1 x.
x
i=1

Letting x = p, we have E[X] = p1 . Similarly, we have

∞
2 X
= i(i − 1)(1 − x)i−2 .
x3
i=2

So
∞
2(x − 1) X
= i(i − 1)(1 − x)i−1 x.
x2
i=2

Hence E[X(X − 1)] = 2 1−p

p2
, so Var(X) = 1−p
p2
. □

Lemma 4.19 Suppose n ∈ N and k ∈ N+ , and X is a geometric random variable with parameter p, then

P {X = n + k|k > n} = P {X = k}.

26
Proof: The P {X = n + k|k > n} is the probability where the first n trials are failures, and we get a success at
k th trial after this. This is the same as having a successful trial at the k th trial, which has probability P {x = k} □

4.6.2 Negative Binomial Random Variable

Definition: suppose independent trials are performed with success probability p. Let X be the number of trials
needed for r successes. Suppose X = n, where n ≥ r. In the first n − 1 trials, there are r − 1 successes and n − r
failures. The nth trials is a success, so P (X = n) = n−1
r n−r . Hence we define X to be the Negative
r−1 p (1 − p)
Binomial Random Variable with parameter r, p, if 0 < p < 1, if

n−1 r
pX (n) = p (1 − p)n−r , n = r, r + 1, · · · .
r−1

Note that a geometric random variable is negative binomial with parameter (1, p).
Intuitively, one can see negative binomial random variable as the reverse of binomial random variable. Thus let X
be a negative binomial random variable with parameters r and p, and let Y be a binomial random variable with
parameters n and p. Then P (X > n) = P (Y < r).

r r(1−p)
Proposition 4.20 Let X be negative binomial with parameters (r, p). Then E[X] = p and Var(X) = p2
.

Proof: Let X be negative binomial with parameters (r, p). Then

∞
X n−1 r
E[X k ] = nk p (1 − p)n−r
n=r
r − 1
∞
r X k−1 n r+1 n−1 n
= n p (1 − p)n−r since n =r
p n=r r r−1 r
∞
r X m − 1 r+1
= (m − 1)k−1 p (1 − p)m−(r+1)
p r
m=r+1
r
= E[(Y − 1)k−1 ]
p

where Y is a negative binomial random variable with parameters r + 1, p. Setting k = 1 in the preceding equation
yields
r
E[X] = .
p
Setting k = 2 in the equation for E[X k ] and using the formula for the expected value of a negative binomial random
variable gives

r
E[X 2 ] = E[Y − 1]
p

r r+1
= −1
p p

27
Therefore 2
r r+1 r r(1 − p)
Var(X) = −1 − = .
p p p p2
□

Lemma 4.21 Suppose X is a negative binomial random variable with parameters (r, p), and Y is a binomial
random variable with parameters (n, p). Then

P {X > n} = P {Y < r}.

More explicitly, we have

∞ r−1
X i−1 r i−r
X n i
p (1 − p) = p (1 − p)n−i .
r−1 i
i=n+1 i=0

Proof: Note that the probability of X > n is the same as the probability of not getting r success in the first
n trials, which is equal to the probability of getting less than r success in the first n trials. Hence we have
P {X > n} = P {Y < r}. □

4.6.3 Hypergeometric Random Variable

Definition: n balls are randomly chosen from an urn containing m white and N −m black balls without replacement.
Let X be the number of white balls chosen, then
m N −m

i n−i
P (X = i) = N
, i = 0, 1, · · · , n.
n

Then we define X to be the Hypergeometric Random Variable with parameters (n, N, m) if

m N −m

i n−i
pX (i) = N
, i = 0, 1, · · · , n.
n

m

Note that 0 ≤ x ≤ n and 0 ≤ X ≤ m. If m < i ≤ n, then i = 0, so P (X = i) = 0.

nm
Proposition 4.22 Let X be hypergeometric with parameters (n, N, m). Then E[X] = N and

n−1
Var(X) = np(1 − p) 1 −
N −1
m
where p = N.

28
Proof: Let X be hypergeometric with parameters (n, N, m). Then
n
X
E[X k ] = ik P {X = i}
i=0
n m N −m

X i n−i
k
= i N

i=1 n

We use the identities

m m−1 N N −1
i =m and n =N
i i−1 n n−1
we obtain
n m−1 N −m

knm X k−1 i−1 n−i
E[X ] = i N −1

N n−1
i=1
n−1 m−1
N −m
nm X j n−1−j
= (j + 1)k−1 N −1

N n−1
j=0
nm
= E[(Y + 1)k−1 ]
N

where Y is a hypergeometric random variable with parameters n − 1, N − 1 and m − 1. Hence upon setting k = 1,
we have
nm
E[X] = .
N
Upon setting k = 2 in the equation for E[X k ], we obtain

nm
E[X 2 ] = E[Y + 1]
N
nm (n − 1)(m − 1)
= +1
N N −1

Hence
nm (n − 1)(m − 1) nm
Var(X) = +1− .
N N −1 N
m
Letting p = N and using the identity

m−1 Np − 1 1−p
= =p−
N −1 N −1 N −1

we have

1−p
Var(X) = np (n − 1)p − (n − 1) + 1 − np
N −1

n−1
= np(1 − p) 1 −
N −1

Approximating Hypergeometric Using Binomial:

Note than when N and m are large enough, and n is small compared to N and m, then the probability of drawing

29
m
a white ball with or without replacement does not change that much, which in both case are almost equal to N .
Hence we can approximate the probability of drawing small number of white balls using the binomial distribution
m
with parameters (n, N ).

Proposition 4.23 Let X be a negative binomial random variable with parameters (n, N, m), then P (X = i + 1) =
(m−i)(n−i)
(i+1)(N −m−n+i+1) P (X = i), for i = 1, 2, · · · , n − 1. P (X = i) is maximized when i is the greatest integer smaller
than p = mn−m−n−1
N −m−n , if p is not an integer; P (X = i) is maximized when i = p or i = p + 1, if p is an integer.

Proof: According to the definition of negative binomial random variable, we have

m N −m N

P (X = i + 1) i+1 n−i−1 n
= N m N −m

P (X = i) m i n−i
(m − i)(n − i)
=
(i + 1)(N − m − n + i + 1)

(m−i)(n−i)
P (X = i + 1) ≥ P (X = i) if and only if (i+1)(N −m−n+i+1) ≥ 1,

(m − i)(n − i)
≥1
(i + 1)(N − m − n + i + 1)
(i + 1)(N − m − n + i + 1) ≤ (m − i)(n − i)
iN − im − in + i2 + 1 ≤ mn − m − n + i2
mn − m − n − 1
i≤
N −m−n
mn−m−n−1
Hence P (X = i) is maximized when i is the greatest integer smaller than p = N −m−n , if p is not an integer;
P (X = i) is maximized when i = p or i = p + 1, if p is an integer. □

4.7 Expected Value of Sums of Random Variables

Proposition 4.24 Suppose the sample space S is countable. For any discrete random variable X on S, E[X] =
P
X(s)p(s).
s∈S

30
Proof: This is clear intuitively, but we give a rigorous prove. Let Ei be the even that X = xi , i.e., s ∈ Ei ⇔
P
X(s) = xi . So P (X = xi ) = P (Ei ) = s∈Ei p(s). Hence

∞
X
E[X] = xi P (X = xi )
i=1
X∞ X
= xi p(s)
i=1 s∈Ei
X∞ X
= xi p(s)
i=1 s∈Ei
X∞ X
= X(s)p(s)
i=1 s∈Ei
X
= X(s)p(s).
s∈S

Proposition 4.25 Let X1 , X2 , · · · , Xn be discrete random variables. Then E[X1 +· · ·+Xn ] = E[X1 ]+· · ·+E[Xn ].

Proof: We prove the case for two random variables, and we can apply induction for the general result.
Let X and Y be discrete random variables on S. Then
X X
E[X + Y ] = (X + Y )(s)p(s) = [X(s) + Y (s)]p(s)
s∈S s∈S
X X
= X(s)p(s) + Y (s)p(s)
s∈S s∈S

= E[X] + E[Y ].

Proposition 4.26 Let X1 , · · · , Xn be discrete random variables and X = X1 + · · · + Xn . Then

n
X X
2
E[X ] = E[Xi2 ] + E[Xi Xj ].
i=1 i̸=j

Proof: Notice
E[X 2 ] = E[(X1 + · · · + Xn )(X1 + · · · + Xn )].

Then using the previous proposition, we get the desired result. □

31
4.8 Some Interesting Results

Proposition 4.27 Suppose a positive integer is chosen at random, then the probability that it does not contain
repeated prime factor is π62 .

Proof: Suppose such an integer is chosen at random, then the probability such that it does not divide the ith
prime twice is 1 − p12 , where pi is the ith prime. Hence the probability of the number not dividing any prime twice
i
is
n n
p2i − 1
Y
Y 1 6
1− 2 = 2 = 2.
pi pi π
i=1 i=1

32
5 Continuous Random Variables
5.1 Continuous Random Variable
Definition: a random variable X is continuous if there exists a nonnegative function f such that
Z
P (X ∈ B) = f (x)dx
B

for any set B of real numbers.

In this case, Fx (the cumulative distribution function of X) is continuous everywhere. In particular, Fx has no
jump at any a ∈ R, so P (X = a) = 0 for every a ∈ R.

R
Definition: suppose P (X ∈ B) = B fX (x)dx for any set B or real numbers. Then fX is called the probability
density function of X.
Then
Ra
1. P (X = a) = a fX (x)dx = 0.
Rb
2. P (a ≤ x ≤ b) = P (a < X < b) = a fX (x)dx for any a < b.
Ra
3. FX (a) = P (X ≤ a) = −∞ fX (x)dx. Then it also follows that FX′ (a) = fX (a) if fX is continuous at a.
R∞
4. P (X ≥ a) = a fX (x)dx.

5. Assume that f is continuous at a, then P (a < X < a + ϵ) ≈ ϵf (a).

5.2 Expectation and Variance

Definition: if X is a continuous random variable with density function fX , the expected value E[X] is defined by
Z ∞
E[X] = xfX (x)dx.
−∞

a+b
Definition: the median m of a continuous random variable X is the value m = 2 , where

1 1
a = inf({x : F (x) ≥ }) and b = sup({x : F (x) ≤ }).
2 2

Definition: the mode m of a continuous random variable X is the value m such that fX (m) is maximum.

Lemma 5.1 Suppose X is a continuous random variable whose probability density function fX (x) is even, then
E[X] = 0.

33
Proof: Since fX (x) is even, then xfX (x) is odd, then
Z ∞
xfX (x) = 0
−∞

provided that the integral exists. □

Lemma 5.2 Let Y be a nonnegative continuous random variable with probability density function f . Then
Z ∞
P (Y > y)dy = E[Y ].
0

Proof:
Z ∞ Z ∞ Z ∞
P (Y > y)dy = f (x)dx dy
0 0 y
Z ∞ Z x
= f (x)dy dx
0 0
Z ∞
= xf (x)dx
0
= E[Y ]

Lemma 5.3 Suppose Y is an arbitrary continuous random variable with probability density function f . Then
Z ∞ Z ∞
E[Y ] = P (Y > y)dy − P (Y < −y)dy.
0 0

Proof:
Z ∞
E[Y ] = xf (x)dx
−∞
Z ∞ Z −∞
= xf (x)dx − xf (x)dx
0 0
Z ∞ Z ∞
= P (Y > y)dy + yf (−y)dy
0 0
Z ∞ Z ∞
= P (Y > y)dy − P (Y < −y)dy
0 0

34
Proposition 5.4 Suppose a continuous random variable X has probability density function f . Then for any
function g, Z ∞
E[g(X)] = g(x)f (x)dx.
−∞

Proof: First we prove the statement for the special case that g(x) ≥ 0. Then
Z ∞
E[g(X)] = P (g(X) > y)dy
Z0 ∞ Z
= f (x)dxdy
0 g(x)>y
Z Z g(x)
= f (x)dydx
g(x)>0 0
Z
= f (x)g(x)dx
g(x)>0

Hence the statement holds for g ≥ 0.

Now, for the general case, consider g = g + − g−, where

g + (x) = max{0, g(x)}, g − = | min{0, g(x)}|

Then it is clear that g + ≥ 0 and g − ≥ 0. If g(x) ≥ 0, then g(x) = g(x) − 0 = g + (x) − g − (0), if g(x) ≤ 0, then
g(x) = 0 − |g(x)| = g + (x) − g − (x). Hence by the linearity of expectation and integration, we have
Z ∞ Z ∞
E[g(x)] = E[g + (x)] − E[g − (x)] = [g + (x) − g − (x)]f (x)dx = g(x)f (x)dx.
−∞ −∞

Lemma 5.5 Let X be a continuous random variable. For a, b ∈ R,

E[aX + b] = aE[X] + b

Proof: Z ∞ Z ∞ Z ∞
E[aX + b] = (ax + b)fX (x)dx = a xfX (x)dx + b fX (x)dx = aE[x] + b.
−∞ −∞ −∞

Definition: let X be a continuous random variable. Then the variance of X, is defined to be

Var(X) = E[(X − µX )2 ]

where µX = E[X]. Equivalently, we have

Var(X) = E[X 2 ] − (E[X])2 .

35
Notice the same proof used for Discrete Random Variable applies for continuous random variable as well. Similarly,
we have
Var(aX + b) = a2 Var(X).

5.3 Uniform Random Variable

Definition: a random variable X is uniformly distributed over (0, 1) if the probability density function is

1 0 < x < 1,
f (x) = .
0 otherwise

Lemma 5.6 One can easily verify the following if X is uniformly distributed over (0, 1):

If 0 < a < b < 1. Then

Z b
P (a ≤ X ≤ b) = f (x)dx = b − a.
a
R1
E[X] = 0 xf (x)dx = 21 .
R1
E[X 2 ] = 0 x2 f (x)dx = 13 .

Var(X) = E[X 2 ] − (E[X])2 = 1

3 − 1
4 = 1
12 .

Definition: a random variable X is uniformly distributed over (α, β) if the density function is

1

β−α α<x<β
f (x) = .
0 otherwise

Note that Y = aX +b, (a > 0) is uniform if X is uniform. In particular, if X is uniform over (0, 1). Then Y = aX +b
(a > 0) is uniform over (b, a + b). Hence Y = (β − α)X + α is uniform over (α, β). Then

1. E[Y ] = (β − α)E[X] + α = (β − α) 12 + α = 12 (α + β).

1
2. Var(Y ) = (β − α)2 Var(X) = 12 (β − α)2 .

3. Moreover, the cumulative density function of Y is




 0 y ≤ α,

FY (y) = (y − α)/(β − α), α < y < β, .


1 y ≥ β.


Bertrands’ Paradox:
The Bertrand’s paradox is a probability which cannot be solved. The problem states:
Consider a random chord of a circle. What is the probability that the length of the chord will be greater than the
side of the equilateral triangle inscribed in that circle? The reason why the problem cannot be solved because we

36
do not know what it means by a random chord in a circle. It follows by different ways in selecting the chord, the
resulting probability is different.

5.4 Normal Random Variables

Definition: Z is the standard normal random variable if the probability density function is

1 2
f (x) = √ e−x /2 .
2π

Lemma 5.7 Z ∞
2 /2 √
e−x dx = 2π.
−∞

Proof: Recall that ∞ √

Z
2
e−x dx = π.
−∞
2 2
We can prove this formula by calculating the volume of the solid under z = e−x −y using multivariable calculus
and the shell method.
Then using integration by substitution, we have the desired result. This lemma also implies that Z is indeed a
random variable. □

Lemma 5.8 E[Z] = 0.

R∞ −x2 /2 dx 2 /2
Proof: E[Z] = √1
2π −∞ xe = 0 since xe−x is an odd function. □

Definition: the cumulative distribution function P (Z ≤ x) is given by

Z x
1 2 /2
Φ(x) = √ e−t dt.
2π −∞

Lemma 5.9 Φ(−x) = 1 − Φ(x).

Proof: Since the probability density function is even, then

Φ(−x) = P (Z ≤ −x) = P (Z ≥ x) = 1 − P (Z < x) = 1 − Φ(x).

37
Corollary 5.9.1 Suppose Z is the standard normal random variable and x > 0, then
P (Z > x) = P (Z < x);

P (|Z| > x) = 2P (Z > x);

P (|Z| < x) = 2P (Z < x) − 1.

2
Proposition 5.10 Suppose Z has probability density function f (x) = √1 e−x /2 . Then Var(Z) = E[Z 2 ] = 1.
2π

Proof: By definition, we have Z ∞

1 2 /2
E[Z 2 ] = √ x2 e−x dx
2π −∞
2 /2
Let u = x and dv/dx = xe−x . Then
i∞ Z ∞
1 h −x2 /2 2
2
E[Z ] = √ −xe − −e−x /2 dx
2π −∞ −∞
1 √
= √ (0 + 2π)
2π
=1

Since E[Z] = 0, we have Var(Z) = E[Z 2 ] = 1. □

Definition: X is a normal variable with parameters (µ, σ 2 ) if the probability density function of X is

1 2 2
f (x) = √ e−(x−µ) /(2σ ) .
2πσ

One can verify that X = σZ + µ, then E[X] = µ and Var(X) = σ 2 .

Conversely, if X is a normal variable with parameters (µ, σ 2 ), then Z = (X −µ)/σ is standard normally distributed.

Lemma 5.11 Suppose X is a normal variable with parameters (µ, σ 2 ), then Y = aX + b is a normal variable with
parameters (aµ + b, a2 σ 2 ).

Proof: Suppose Y = aX + b, since X is a normal variable with parameter (µ, σ 2 ), then X = σZ + µ. Hence
Y = aσZ + aµ + b, thus X is a normal variable with parameters (aµ + b, a2 σ 2 ). □

Proposition 5.12 Suppose X is a normal variable with parameters (µ, σ 2 ), then

P (X > µ + kσ) = 1 − Φ(k).

And if k > 0, then

P (µ − kσ < X < µ + kσ) = 2Φ(k) − 1.

38
Proof: Let Z = (X − µ)/σ, then Z is standard normally distributed. Then

X −µ (µ + kσ) − µ
P (X > µ + kσ) = P > = P (Z > k) = 1 − Φ(k).
σ σ

Next, notice that P (µ − kσ < X) = P (X > µ + kσ) = Φ(k), then we have the second claim. □

Theorem 5.13 (The De Moivre-Laplace Limit Theorem) Let Sn be a binomial random variable with pa-
rameters (n, p). Then !
Sn − np
P a≤ p ≤ b → Φ(b) − Φ(a)
np(1 − p)
as n → ∞, where Φ is the cumulative density function for the random variable with parameters µ = np and
σ 2 = np(1 − p). And P (Sn = i) is written as P (i − 12 < Sn < i + 12 ) in approximation (we apply continuation of
integers).

Proof: The theorem follows from the fact that

S − np
p n
np(1 − p)

is approximated by standard normal Z with parameters (np, np(1 − p)). □

5.5 Exponential Random Variables

Definition: a random variable X is exponential with parameter λ > 0 if the probability density function is

λe−λx , if x ≥ 0,
f (x) =
0, if x < 0.

Then the cumulative function of this random variable is


1 − e−λx , if x ≥ 0,
F (x) =
0, if x < 0.

Lemma 5.14 Suppose X is an exponential random variable with parameters λ, if c > 0, then cX is an exponential
random variable with parameter λc .

39
Proof: Let Y = cX, then clearly Y takes nonnegative values, so for y ≥ 0, we have

FY (y) = P (Y < y) = P (cX < y)

y
= P (X < )
c
y
= FX ( )
c

Hence
d y λ
fy (y) = FX ( ) = e−λy/c .
dy c c
λ
So, indeed y is exponential with parameter c. □

Proposition 5.15 Suppose X is an exponential random variable with parameter λ > 0 and Y = λX. Then

E[X] = E[Y ]/λ = 1/λ.

Var(X) = Var(Y )/λ2 = 1/λ2 .

Proof: P (Y > y) = P (X > y/λ) = e−λ(y/λ) = e−y . Hence fY (y) = e−y . Then
Z ∞ Z ∞
E[Y ] = P (Y > y)dy = e−y dy = 1.
0 0
Z ∞ Z ∞ √
2
E[Y ] = 2
P (Y > y)dy = e− y
dy = 2.
0 0
Var(Y ) = E[Y 2 ] − (E[Y ])2 = 2 − 12 = 1

Hence the proposition follows. □

1
Proposition 5.16 If X is an exponential random variable with mean λ, then

k!
E[X k ] = k = 1, 2, 3, · · · .
λk

Proof: Since E[X] = λ1 , then 

λe−λx , if x ≥ 0,
f (x) =
0, if x < 0.

40
Z ∞
E[X k ] = xk λe−λx dx
0
Z ∞
=λ xk e−λx dx
0
Z ∞
∞
k −λx
= −x e + kxk−1 e−λx dx
0 0
k ∞ k−1 −λx
Z
= x λe dx
λ 0
k
= E[X k−1 ]
λ
k!
Then by induction we can show that E[X k ] = λk
. □

Definition: if P (X > s + t|X > t) = P (X > s) for all s, t ≥ 0, then X is said to be memoryless.
If a random variable is memoryless, then it also implies that P (X > s + t) = P (X > s)P (X > t), for all s, t ≥ 0.

Lemma 5.17 Exponential random variables are memoryless.

Proof: Suppose the component has survived for t hours. The probability that it can survive at least another s
hours is
P (X > s + t) e−λ(s+t)
P (X > s + t|X > t) = = = e−λs = P (X > s).
P (X > t) e−λt
□

Definition: X is a double exponential random variable with parameter λ > 0 if the probability density function

1
f (x) = λe−λ|x| , −∞ < x < ∞
2

Lemma 5.18 Suppose FX (x) is the cumulative distribution function of a double exponential random variable X
with parameter λ > 0, then 
 1 eλx x<0
FX (x) = 2 .
1 − 1 e−λx x > 0
2

Proof: One can integrate and verify directly. □

Proposition 5.19 Suppose X is a double exponential random variable with parameter λ > 0 and let Y be the
exponentially distributed with parameter λ, then

E[X] = 0.

E[X 2 ] = E[Y 2 ] = 2/λ2 .

41
Var(X) = 2/λ2 .

Proof: Let Y = |X|. Then for y ≥ 0

P (Y > y) = P (X > y) + P (X < −y) = 2P (X > y)

Z ∞
1 −λx
=2 λe dx
y 2
= e−λy .

Then Y is exponentially distributed with parameter λ, so E[Y ] = 0. Then it is clear that E[X 2 ] = E[Y 2 ] = 2/λ2 ,
as Y 2 = |X|2 = X 2 . By simple integration, we can also get that E[X] = 0 (p.d.f is even). Then Var(X) can be
easily shown to be 2/λ2 . □

Definition: let X be a positive continuous random variable. Let F (t) = 1−F (t). Then we define λ(t) = f (t)/F (t) is
the hazard (failure) rate function. The interpretation of the hazard rate function is that if an object has functioned
for time t, then λ(t) represents the probability that it will fail at time t.

Proposition 5.20 Let λ(s), s > 0 be the hazard rate function of a positive random variable X. Then
Z t
F (t) = 1 − exp − λ(s)ds .
0

Proof:
Z t Z t
f (s)
λ(s)ds = ds
0 0 1 − F (s)
= − ln(1 − F (s))|t0
= − ln(1 − F (t)) + ln(1 − F (0))
= − ln(1 − F (t))
Z t
⇐⇒ F (t) = 1 − exp − λ(s)ds .
0

Proposition 5.21 Suppose λ(t) is the hazard failure rate function of a random variable X. Then X is an expo-
nential random variable with parameter λ if and only if λ(t) = λ.

Proof: Suppose X is an exponential random variable with parameter λ, then

f (t) λe−λt
λ(t) = = = λ.
F (t) 1 − (1 − e−λt )

42
Next, suppose λ(t) = λ, then Z t
FX (t) = 1 − exp − λds = 1 − exp(−tλ).
0

Hence X is an exponential random variable with rate λ. □

5.6 Other Continuous Distributions

5.6.1 Gamma Distribution

Definition: a random variable X is gamma with parameters (n, λ), λ > 0, if the density function is

λe−λt (λt)n−1
f (t) = , t ≥ 0.
(n − 1)!

It is the time at which the nth event of a Poisson process of rate λ occurs.
Definition: a random variable X is gamma with parameters (α, λ), α, λ > 0, if the density function is

λe−λt (λt)α−1
f (t) = , t ≥ 0,
Γ(α)

where Z ∞ Z ∞
Γ(α) = λe−λt (λt)α−1 dt = e−y y α−1 dy
0 0

is the gamma function.

Recall that Γ(α + 1) = αΓ(α), for α > 0. And Γ(n) = (n − 1)! for positive integer n.

Suppose n is a positive integer, then a Gamma distribution X with parameter (n, λ), represents the distribution
on the time it takes for n event where the probability distributions of the events is Poisson with parameter λ.

Lemma 5.22 Let X be gamma with parameters (α, λ). Then E[X] = αλ .

Proof:
Z ∞
1
E[X] = t · λe−λt (λt)α−1 dt
Γ(α) 0
Z ∞
1
= e−λt (λt)α dt
Γ(α) 0
Z ∞
1
= λe−λt (λt)(α+1)−1 dt
λΓ(α) 0
1
= Γ(α + 1)
λΓ(α)
α
= .
λ

43
□

α(α+1) α
Lemma 5.23 Let X be gamma with parameters (α, λ). Then E[X 2 ] = λ2
and Var(X) = λ2
.

Proof:
Z ∞
1
E[X 2 ] = t2 · λe−λt (λt)α−1 dt
Γ(α) 0
Z ∞
1
= e−λt (λt)α+1 dt
λΓ(α) 0
Z ∞
1
= λe−λt (λt)(α+2)−1 dt
λ2 Γ(α) 0
1
= 2
Γ(α + 2)
λ Γ(α)
α(α + 1)
= .
λ2

Hence it follows that

α
Var(X) = E[X 2 ] − (E[X])2 = .
λ2
□

5.6.2 Beta Distribution

Definition: a random variable X is beta with parameters (a, b), a, b > 0, if the density function is

1
f (x) = xa−1 (1 − x)b−1 , 0 < x < 1
B(a, b)

where Z 1
Γ(a)Γ(b)
B(a, b) = xa−1 (1 − x)b−1 dx = .
0 Γ(a + b)
It represents the distribution of the probability p that a trial possesses if it is given that there are a successes and
b failures in the first a + b trials. For example a Beta distribution with parameter (40, 60) gives the distribution of
the probability of the success rate p of the trial, if we know that when one performed the trial 100 times, one got
40 successes and 60 failures.

Lemma 5.24 If a = b = 1, then the beta random variable X with parameter (a, b) is uniform on (0, 1).

1
Proof: If a = b = 1, then f (x) = B(1,1) = 1, 0 < x < 1. Hence it is uniform on (0, 1). □

44
Proposition 5.25 Suppose n, m ∈ N, then
Z 1
n!m!
xn (1 − x)m dx = .
0 (n + m + 1)!

R1
Proof: Let C(n, m) = 0 xn (1 − x)m dx, then using integration by parts, we have

m
C(n, m) = C(n + 1, m − 1).
n+1
1
Note that C(n, 0) = n+1 . Then using induction on m we can prove the identity. □

a
Lemma 5.26 B(a + 1, b) = a+b B(a, b).

Proof: Integration by part. □

Proposition 5.27 Suppose X is a beta random variable with parameter (a, b), then

E[X] = a
a+b .

(a+1)a
E[X 2 ] = (a+b+1)(a+b) .

Var(X) = ab
(a+b)2 (a+b+1)
.

Proof:
1
xa−1 (1 − x)b−1
Z
B(a + 1, b) a
E[X] = x· dx = = .
0 B(a, b) B(a, b) a+b
Z 1
2 xa−1 (1 − x)b−1 B(a + 2, b) (a + 1)a
E[X ] = x2 · = = .
0 B(a, b) B(a, b) (a + b + 1)(a + b)

Then the value of Variance of X follows from the previous two values. □

5.7 Information on the Gamma and Beta Function

Definition: The Γ function is defined by Z +∞
Γ(s) = e−x xs−1 dx
0

where 0 < s < ∞.

R +∞
Proposition 5.28 0 e−x xs−1 dx converges for s > 0, hence Γ(s) is well-defined.

45
Proof: We decompose the improper integral into two parts:
Z 1 Z ∞
−x s−1
I1 = e x dx, I2 = e−x xs−1 dx.
0 1

Firstly, consider I1 , when s ≥ 1, I1 is a definite integral, hence it converges; when 0 < s < 1,

1 1
e−x · xs−1 = x
· xs−1 < 1−s ,
e x

since 1 − s < 1, then by the integral comparison theorem, I1 converges.

Now we consider I2 , since
xs+1
lim x2 · (e−x xs−1 ) = lim = 0,
x→∞ x→∞ ex
R∞
then by limit comparison theorem, I2 converges as well, i.e., 0 e−x xs−1 dx converges for all s > 0. □

The comparison test and limit comparison test for improper integrals:
R∞
Suppose functions f (x), g(x) are continuous on [a, ∞),. If 0 ≤ f (x) ≤ g(x) for a ≤ x, then a g(x)dx
R∞ R∞ R∞
converges, if a f (x)dx converges; if a f (x)dx diverges, then a g(x)dx diverges.

Suppose f (x) is a continuous function on [a, ∞), and f (x) ≥ 0. Then if there exists constant p > 1, s.t.,
R∞ R∞
lim xp f (x) = c < ∞, then a f (x)dx converges; if lim xf (x) = d > 0, then a f (x)dx diverges.
x→∞ x→∞

Lemma 5.29 Γ(s + 1) = sΓ(s).

Proof: Using integration by part, one has

Z ∞
Γ(s + 1) = e−x xs dx
0
∞
Z ∞
= −e−x xs 0 + s e−x xs−1 dx

0
= [0 − 0] + sΓ(s)
= sΓ(s)

Lemma 5.30 Γ(1) = 1.

R∞
Proof: Γ(1) = 0 e−t dt = 1. □

Proposition 5.31 Γ(n + 1) = n!, for n ∈ N.

46
Proof: Using induction, we easily get the result. □

Lemma 5.32 As s → 0+ , we have Γ(s) → ∞.

Proof: Γ function is continuous for all positive value s (it is an integral), this will be used without proof. Then
as s → 0+ , since Γ(s) = Γ(s+1)
s , then
Γ(s + 1)
lim = ∞.
s→0 s
□

Lemma 5.33 Suppose f, g are convex function with the same domain, then f + g is convex.

Proof: Suppose f, g are convex functions on D, let x, y ∈ D, and λ ∈ [0, 1]. Then

(f + g)(λ(x) + (1 − λ)y) = f (λ(x) + (1 − λ)y) + g(λ(x) + (1 − λ)y)

≤ λf (x) + (1 − λ)f (y) + λg(x) + (1 − λ)g(y)
= λ(f + g)(x) + (1 − λ)(f + g)(y)

Hence f + g is convex. □

Proposition 5.34 ln Γ is convex on (0, ∞).

1 1
Proof: If 1 < p < ∞ and p + q = 1. Apply Hölder’s inequality, we obtain.
Z
x y x
+ yq −1 −t
Γ + = tp e dt
p q
Z 1/p Z q
x −t p y 1 −t
− p1
≤ t p e p dt t − eq dt
q q
= Γ(x)1/p Γ(y)1/q

Hence
x y 1 1
ln Γ( + ) ≤ ln Γ(x)1/p Γ(y)1/q = ln Γ(x) + ln γy.
p q p q
This implies ln Γ is convex. □

Theorem 5.35 If f is a positive function on (0, ∞) such that

1. f (x + 1) = xf (x),

47
2. f (1) = 1,

3. ln f is convex,

then f (x) = Γ(x).

Proof: Since Γ satisfies 1, 2, 3 it is enough to prove that f (x) is uniquely determined by 1, 2, 3 for all x > 0. By
1, it is enough to do this for x ∈ (0, 1), as the rest of the values depends on the value of f on (0, 1).
Put φ = ln f . Then
φ(x + 10 = φ(x) + ln x (0 < x < ∞),

and φ(1) = 0, and φ is convex. Suppose 0 < x < 1 and n is a positive integer, then φ(n + 1) = ln(n!). Consider
the difference quotients of φon the intervals [n, n + 1], [n + 1, n + 1 + x], [n + 1, n + 2]. Since φ is convex, then

φ(n + 1 + x) − φ(n + 1)
ln n ≤ ≤ ln(n + 1).
x

Repeated application of φ(x + 1) = φ(x) + ln x gives

φ(n + 1 + x) = φ(x) + ln[x(x + 1) · · · (x + n)].

Thus
φ(x) + ln[x(x + 1) · · · (x + n)] − φ(n + 1)
ln n ≤ ≤ ln(n + 1).
x
Then by some algebraic manipulation, we have

n!nx

1
0 ≤ φ(x) − ln ≤ x ln 1 + .
x(x + 1) · · · (x + n) n

The expression on the right tests to 0 as n → ∞, hence varphi(x) is uniquely determined, and the prove is
completed. □

Corollary 5.35.1 Suppose 0 < x < 1, then

n!nx
Γ(x) = lim .
n→∞ x(x + 1) · · · (x + n)

Proof: This is clear from the proof of the above theorem. □

√
Proposition 5.36 Γ( 12 ) = π.

R +∞
Proof: by definition, Γ(s) = 0 e−x xs−1 dx. We replace x = u2 , dx = 2udu, then
Z ∞
2
Γ(s) = 2 e−u u2s−1 du.
0

48
1+t
Let t = 2s − 1 ⇒ s = 2 , then Z ∞
2 1 1+t
e−u ut du = Γ( )
0 2 2
.
When s = 21 , t = 0, then Z ∞ Z ∞ √
1 2 2
Γ( ) = 2 e−u ut du = 2 e−u du = π.
2 0 0

Proposition 5.37 (Euler’s reflection formula)

π
Γ(s)Γ(1 − s) =
sin πs

for (0 < s < 1).

Proof: to read the proof of this or more related readings on the Γ function, visit https://en.wikipedia.org/
wiki/Gamma_function □

Theorem 5.38 If x > 0 and y > 0, then

Z 1
Γ(x)Γ(y)
tx−1 (1 − t)y−1 dt = .
0 Γ(x + y)

This integral is also know as the beta function B(x, y).

Proof: Note that B(1, y) = y1 , and using Hölder’s inequality, we have that ln B(x, y) is a convex function of x for
each fixed y. We show
x
B(x + 1, y) = B(x, y).
x+y
Since
1 x x 1 Z 1
(1 − t)x+y
Z
t x+y−1 t x x−1 x
B(x + 1, y) = (1 − t) dt = − · − t (1 − t)y−1 = B(x, y).
0 1−t 1−t x+y 0 0 x+y x+y

Then for each y, consider the function

Γ(x + y)
f (x) = B(x, y).
Γ(y)
Then f (1) = 1, f (x + 1) = xf (x), and ln f (x) = ln B(x, y) + ln Γ(x + y) − ln Γ(y) is also convex. Hence f (x) = Γ(x).
This implies
Γ(x)Γ(y)
B(x, y) = .
Γ(x + y)
□

49
Corollary 5.38.1
2x−1 x

x+1
Γ(x) = √ Γ Γ .
π 2 2

Proof: Let
2x−1 x

x+1
f (x) = √ Γ Γ .
π 2 2
Note f (1) = 1, f (x + 1) = 2 · x2 f (x) = xf (x) and

√ x x+1
ln f (x) = ln 2( x − 1) − ln π + ln Γ( ) + ln Γ( ).
2 2

Hence ln f is convex, this implies f (x) = Γ(x) on (0, ∞). Thus we have completed the proof. □

Theorem 5.39 (Stirling’s Formula) The Stirling’s Formula provides a simple approximate expression for Γ(x+
1) when x is large. The formula is
Γ(x + 1)
lim √ = 1.
x→∞ (x/e)x 2πx

Proof: Apply change of variable by letting t = x(1 + u) in the definition

Z ∞
Γ(x + 1) = tx e−t dt.
0

Then we get Z ∞
x+1 −x
Γ(x + 1) = x e [(1 + u)e−u ]x du.
−1

Define h(u) so that h(0) = 1 and 2

−u u
(1 + u)e = exp − h(u)
2
for −1 < u < ∞, u ̸= 0. One can check that h(u) is indeed well defined as

2
h(u) = [u − ln(1 + u)] u ̸= 0.
u2

Since h(0) = 1, one can verify the h is continuous. Next it is also clear that h(u) is decreasing monotonically from
∞ to 0 as u increase from −1 to ∞.
p
Substitute u = s 2/x, and we get
x −x
√ Z ∞
Γ(x + 1) = x e 2x ψx (s)ds
−∞

where 
exp[−s2 h(sp2/x)] p
(− x/2 < s < ∞),
ψx (s) = p
0 (s ≤ − x/2).

Next one can verify the following facts:

2
1. For every s, ψx (s) → e−s as x → ∞.

50
2. The convergence in 1 is uniform on [−A, A] for every A < ∞.
2
3. When s < 0, then 0 < ψx (s) < e−s .

4. When s > 0 and x > 1, then 0 < ψx (s) < ψ1 (s).

R∞
5. 0 ψ1 (s)ds < ∞.

Then by uniform convergence, the integral converges to the limit of integrals of the functions in the sequence. Since
Z ∞ √
2
e−x dx = π.
−∞

Then we have
Γ(x + 1)
lim √ = 1.
x→∞ (x/e)x 2πx

5.8 Distribution of a Function of a Random Variable

Recall that if X is a random variable then for any function g, g(X) is also a random variable.

Theorem 5.40 Let X be a continuous random variable. Let Y = g(X). Suppose g(x) is strictly monotonic and
differentiable. For y = g(x),
d −1
fY (y) = fX (g −1 (y)) g (y) .
dy

Proof: We consider two cases. First, suppose g(x) is strictly increasing and differentiable. Then

FY (y) = P (g(X) ≤ y) = P (X ≤ g −1 (y)) = FX (g −1 (y)).

So
d −1
fY (y) = FY′ (y) = fX (g −1 (y)) g (y).
dy
Now suppose g(x) is strictly decreasing and differentiable. Then

FY (y) = P (g(X) ≤ y) = P (X ≥ g −1 (y)) = 1 − FX (g −1 (y)).

So
d −1
fY (y) = FY′ (y) = −fX (g −1 (y)) g (y).
dy
□

Corollary 5.40.1 Suppose g is strictly increasing, and X, Y are continuous random variable such that Y = g(X),
then FY (y) = FX (g −1 (y); suppose g is strictly decreasing, and X, Y are continuous random variable such that
Y = g(X), then FY (y) = 1 − FX (g −1 (y)).

51
Proof: This is follows from the proof of the theorem. □

Definition: let X be a normal random variable with parameters (µ, σ 2 ). Then Y = eX is called lognormal with
parameters (µ, σ 2 ).

Proposition 5.41 Let X be a normal random variable with parameters (µ, σ 2 ), then the probability density function
of Y = eX is given by
1
fY (y) = √ exp{−(ln y − µ)2 /2σ 2 }.
2πσy

Proof: The density of X is

1 2 2
fX (x) = √ e−(x−µ) /2σ .
2πσ
Let g(x) = ex . Then g −1 (y) = ln y and d −1
dy g (y) = y1 .
Let y > 0. Then

d −1
fY (y) = fX (g −1 (y)) g (y)
dy
1
= fX (ln y)
y
1
=√ exp{−(ln y − µ)2 /2σ 2 }.
2πσy

Lemma 5.42 Suppose Y is lognormal with parameters (µ, σ 2 ), if c > 0, then cY is lognormal with parameters
(µ + ln c, σ 2 ).

Proof: Suppose Y = eX , where X is normal with parameters (µ, σ 2 ), then cY = eX+ln c , and clearly X + ln c is
normal with parameters (µ + ln c, σ 2 ). □

Proposition 5.43 Let Z be a standard normal random variable Z, and let g be differentiable function with deriva-
tive g ′ and
2
lim g(x)e−x /2 = 0.
x→±∞

Then

1. E[g ′ (Z)] = E[Zg(Z)];

2. E[Z n+1 ] = nE[Z n−1 ].

52
Proof: Let f (z) denote the probability density function of Z. Hence

1 2
f (z) = √ e−z /2 .
2π

1.
Z ∞
E[g ′ (Z)] = g ′ (z)f (z)dz
Z−∞
∞
1 2
= √ e−z /2 g ′ (z)dz
−∞ 2π
Z ∞
1 2
=√ e−z /2 g ′ (z)dz
2π −∞
Z ∞
1 −z 2 /2 ∞
−z 2 /2
=√ e g(z) − −ze g(z)dz
2π −∞ −∞
Z ∞
1 2
=√ ze−z /2 g(z)dz
2π −∞
= E[Zg(Z)]

2. Let g(x) = xn , then g ′ (x) = nxn−1 . Hence we have

E[Zg(Z)] = E[Z n+1 ] = E[g ′ (Z)] = E[nZ n−1 ] = nE[Z n−1 ].

Proposition 5.44 Let X be a nonnegative continuous random variable, then

Z ∞
n
E[X ] = ntn−1 P (x > t)dt.
0

Rx
Proof: Note that xn = 0 ntn−1 dx, then
Z ∞
n
E[X ] = xn f (x)dx
Z0 ∞ Z x
= ntn−1 f (x)dtdx
Z0 ∞ Z0 ∞
= ntn−1 f (x)dxdt
0 t
Z ∞ Z ∞
n−1
= nt f (x)dxdt
Z0 ∞ t

= ntn−1 P (x > t)dt

53
E[X n ]
Corollary 5.44.1 If X is a nonnegative continuous random variable, then P (X > a) ≤ an for any a > 0 and
positive integer n.

Proof: It suffices to show that an P (X > a) ≤ E[X n ]. Using the same argument as before, we have
Z a Z a
an P (X > a) = ntn−1 P (x > a)dt ≤ ntn−1 P (x > t)dt ≤ E[X n ].
0 0

As for t ≤ a, we have P (x > a) ≤ P (x > t). □

54
6 Jointly Distributed Random Variables
6.1 Joint Cumulative Distribution Function
Definition: let X and Y be two random variables, their joint cumulative probability distribution function is
FX,Y (a, b) = P (X ≤ a, Y ≤ b).

Proposition 6.1 Let F (a, b) = P (X ≤ a, Y ≤ b) be the joint cumulative distribution function of X and Y . Then
it can be used to generate all probability involving X and Y :

P (a1 < X ≤ a2 , Y ≤ b) = P (X ≤ a2 , Y ≤ b) − P (X ≤ a1 , Y ≤ b) = F (a2 , b) − F (a1 , b).

P (X ≤ a, b1 < y ≤ b2 ) = P (X ≤ a, Y ≤ b2 ) − P (X ≤ a, Y ≤ b1 ) = F (a, b2 ) − F (a, b1 ).

P (a1 < X ≤ a2 , b1 < Y ≤ b2 ) = P (X ≤ a2 , b1 < Y ≤ b2 ) − P (X ≤ a1 , b1 < Y ≤ b2 ) = F (a2 , b2 ) − F (a2 , b1 ) −

F (a1 , b2 ) + F (a1 , b1 ).

FX (a) = P (X < a; Y ∈ R) = F (a, ∞).

FY (b) = P (X ∈ R; Y < b) = F (∞, b).

P (X > a, Y > b) = 1 − FX (a) − FY (b) + F (a, b).

Thus F (a, b) determines P ((X, Y ) ∈ C) for any Borel set C ⊆ R2 .

Proof: This is clear from the definition of the joint cumulative probability distribution function. □

Definition: suppose X and Y are discrete random variables, then we use their joint probability mass function:

pX,Y (x, y) = P (X = x, Y = y).

Suppose Y takes values y1 , y2 , · · · . Then

X
P (X = x) = P (X = x, Y = yj )
j

is the mass function of X.

Suppose X takes values x1 , x2 , · · · . Then
X
P (Y = y) = (P = xi , Y = y)
i

is the mass function of Y .

Definition: the probability mass functions of X and Y obtained in this way are called marginal probability mass

55
functions:
X
pX (x) = pX,Y (x, yj ).
j
X
pY (y) = pX,Y (xi , y).
i

Definition: suppose X and Y are continuous random variables, then it is convenient to use their joint probability
density function fX,Y (x, y), such that
ZZ
fX,Y (x, y)dxdy = P ((X, Y ) ∈ C).
C

Definition: X and Y are jointly continuous if there is a function fX,Y (x, y) ≥ 0, such that
ZZ
fX,Y (x, y)dxdy = P ((X, Y ) ∈ C),
C

and fX,Y (x, y) is their joint probability density function.

Then Z a Z b
FX,Y (a, b) = P (X ≤ a, Y ≤ b) = fX,Y (x, y)dydx.
−∞ −∞

and
∂2
fX,Y (a, b) = FX,Y (a, b).
∂a∂b

Lemma 6.2 Suppose X and Y are jointly continuous with joint probability distribution function f (x, y). Then
Z Z Z Z
P (X ∈ A, Y ∈ B) = f (x, y)dydx = f (x, y)dxdy.
A B B A

For small ϵ1 , ϵ2 ,
P (a ≤ X ≤ a + ϵ1 , b ≤ Y ≤ b + ϵ2 ) ≈ f (a, b)ϵ1 ϵ2 .

Proof: This is by definition of joint probability distribution function and two variable integration. □

Definition: suppose X and Y are jointly continuous with joint probability distribution function f (x, y). Let A ⊆ R,
Z Z ∞
P (X ∈ A) = P (X ∈ A, Y ∈ R) = f (x, y)dydz.
A −∞

Then X is a continuous random variable with probability density function

Z ∞
fX (x) = f (x, y)dy,
−∞

56
which is known as the marginal probability density function. Similarly, let B ⊆ R, we have
Z Z ∞
P (Y ∈ B) = P (X ∈ R, Y ∈ B) = f (x, y)dxdy.
B −∞

Then Y is a continuous random variable with probability density function

Z ∞
fY (y) = f (x, y)dx.
−∞

6.2 Joint Distribution of Random Variables

Definition: let X1 , X2 , · · · , Xn be random variables, then their joint cumulative probability distribution function is

F (a1 , a2 , · · · , an ) = P (X1 ≤ a1 , X2 ≤ a2 , · · · , Xn ≤ an ).

Definition: if X1 , X2 , · · · , Xn are discrete random variables, their joint probability mass function is

p(a1 , a2 , · · · , an ) = P (X1 = a1 , X2 = a2 , · · · , Xn = an ).

Definition: X1 , X2 , · · · , Xn are said to be jointly continuous if there is a joint probability density function f such
that Z Z
P ((X1 , · · · , Xn ) ∈ C) = · · · f (x1 , · · · , xn )dx1 · · · dxn
C

for any C ⊆ Rn .

Definition: suppose that n independent and identical experiments are performed, each experiment results in exactly
one of r possible outcomes with respective probability p1 , · · · , pr , ni=1 pi = 1. Let Xi be the number of experiments
P

that result in the ith outcome. Then they have multinomial distribution with joint mass function
r
n n1 nr
X
p(n1 , n2 , · · · , nr ) = p · · · pr , n = ni .
n1 , · · · , nr 1
i=1

6.3 Independent Random Variables

Definition: random variables X and Y are independent if the events X ∈ A and Y ∈ B are independent for any
A, B ⊆ R, i.e.,
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B).

In particular
P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b).

57
Theorem 6.3 Random variables X and Y are independent if and only if for any a, b ∈ R, we have

FX,Y (a, b) = FX (a)FY (b).

Theorem 6.4 Suppose X and Y are independent discrete variables if and only if pX,Y (a, b) = pX (a)pY (b). If this
is the case, then

E[XY ] = E[X]E[Y ]
Var(X + Y ) = Var(X) + Var(Y )

Proof: Since X and Y are independent, then for any a, b, let A = {a} and B = {b}, we have P (x ∈ A, Y ∈ B) =
P (X ∈ A)P (Y ∈ B) so pX,Y (a, b) = pX (a)pY (b). On the other hand, if pX,Y (a, b) = pX (a)pY (b), then for any A, B,
we have
XX
P (X ∈ A, Y ∈ B) = pX,Y (x, y)
x∈A y∈B
XX
= pX (x)pY (y)
x∈A y∈B
X X
= pX (x) pY (y)
x∈A x∈A

= P (X ∈ A)P (Y ∈ B).

Then we have the following:

XX
E[XY ] = abpX,Y (a, b)
a b
X X
= a b · pX (a)pY (b)
a b
X X
= apX (a) · bpY (b)
a b

= E[X]E[Y ]

On the other hand,

Var(X + Y ) = E[(X + Y )2 ] − (E[X + Y ])2

= E[X 2 ] + E[Y 2 ] + 2E[XY ] − (E[X])2 − (E[Y ])2 − 2E[X]E[Y ]
= E[X]2 − (E[X])2 + E[Y 2 ] − (E[Y ])2
= Var(X) + Var(Y )

58
Theorem 6.5 Jointly continuous random variables X, Y are independent if and only if

fX,Y (x, y) = fX (x)fY (y)

for all x, y.

Proof: ⇒: if X and Y are independent, then FX,Y (x, y) = FX (x)FY (y), so

∂2
fX,Y (x, y) = FX,Y (x, y) = FX′ (x)FY′ (y) = fX (x)fY (y).
∂x∂y

⇐: if fX,Y (x, y) = fX (x)fY (y), then

Z a Z b
FX,Y (a, b) = fX,Y (x, y)dydx
−∞ −∞
Z a Z b
= fX (x)fY (y)dydx
−∞ −∞
Z a Z b
= fX (x)dx · fY (y)dy
−∞ −∞

= FX (a)FY (b).

Definition: random variables X1 , · · · , Xn are independent if for any Ai ⊆ R, we have

P (X1 ∈ A1 , · · · , Xn ∈ An ) = P (X1 ∈ A1 ) · · · P (Xn ∈ An ).

Theorem 6.6 X1 , · · · , Xn are independent if and only if for any ai ∈ R,

F (a1 , · · · , an ) = FX1 (a1 ) · · · FXn (an ).

That is, if X1 , · · · , Xn are discrete random variables, then they are independent if and only if for any ai ∈ R,

p(a1 , · · · , an ) = pX1 (a1 ) · · · pXn (an ).

If X1 , · · · , Xn are jointly continuous, then they are independent if and only if for any ai ∈ R,

f (a1 , · · · , an ) = fX1 (a1 ) · · · fXn (an ).

Proposition 6.7 Suppose X1 , X2 , · · · , Xn are jointly continuous or discrete variables. Then X1 , X2 , · · · , Xn are

59
independent if and only if their jointly probability density or mass function f (x1 , · · · , xn ) can be written as
n
Y
f (x1 , · · · , xn ) = gi (xi )
i=1

for nonnegative functions gi (x), i = 1, · · · , n.

Proof: The case for jointly discrete random variables are trivial. We consider the case for jointly continuous
random variables.
Now suppose X1 , · · · , Xn are independent, then let fXi (t) be their respective probability density function. Then
we have
Yn
fXi (xi ) = f (x1 , · · · , xn ).
i=1
n
Q
Conversely, suppose f (x1 , · · · , xn ) = gi (xi ), then
i=1

Z ∞ Z ∞ n
Y
fXi (xi ) = ··· gi (xi )dx1 · · · dxi−1 dxi+1 · · · dxn = Ci gi (xi ).
−∞ −∞ i=1

Now since Z n
Z Y Z n
Z Y
··· gi (xi )dS = 1 = ··· fXi (xi )dS.
O i=1 O i=1

Then we conclude that the product of the Ci′ s is one, hence X1 , · · · , Xn are jointly independent. □

6.4 Sums of Independent Random Variables

Proposition 6.8 Let X and Y be independent integer-valued random variables. then

X
P (X + Y = n) = P (X = i)P (Y = j).
i+j=n

Proof:
X X
P (X + Y = n) = P (X = i, Y = n − i) = P (X = i)P (Y = n − i)
i i
X X
= P (X = n − j, Y = j) = P (X = n − j)P (Y = j)
j j
X
= P (X = i)P (Y = j)
i+j=n

60
∞
pX (i)z i and
P
Corollary 6.8.1 Let X and Y be independent integer-valued random variables. Let g(z) =
i=−∞
∞
(j)z j .
P
h(z) = pY Then
j=−∞
∞
X
g(z)h(z) = pX+Y (n)z n .
n=−∞

Proof:

∞
!  ∞

X X
g(z)h(z) = pX (i)z i · pY (j)z j 
i=−∞ j=−∞
∞
X X
= pX (i)pY (i)z n
n=−∞ i+j=n
X∞
= pX+Y (n)z n .
n=−∞

Proposition 6.9 Let X and Y be independent continuous random variables, then

Z ∞ Z ∞
FX+Y (a) = FX (a − y)fY (y)dy = FX (a − y)dFY (y).
−∞ −∞

I.e., the cumulative distribution function of X + Y , is the convolution of FX and FY : FX+Y = FX ∗ FY .

In addition, Z ∞
fX+Y (a) = fX (a − y)fY (y)dy.
−∞

Proof:
ZZ
FX+Y (a) = P (X + Y ≤ a) = fX+Y (x, y)dxdy
x+y≤a
Z ∞ Z a−y
= fX (x)fY (y)dxdy
Z−∞
∞
−∞

= FX (a − y)fY (y)dy
Z−∞
∞
= FX (a − y)dFY (y)
−∞

Then differentiating with respect to a gives the probability density function of X + Y :

Z ∞
fX+Y (a) = fX (a − y)fY (y)dy.
−∞

61
6.4.1 Sum of Binomial

Theorem 6.10 Let X1 , X2 , · · · , Xk be independent binomial random variables with parameters (n1 , p), (n2 , p),
· · · , (nk , p) respectively, then X1 +X2 +· · ·+Xk is a binomial random variable with parameter (n1 +n2 +· · ·+nk , p).

Proof: First show that the sum of an independent binomial random variable with parameter (m, p) and a bernoulli
random variable with parameter (1, p) is a binomial random variable with parameter (m + 1, p). Then we can show
that the sum of an independent binomial random variable with parameter (m1 , p) and another binomial random
variable with parameter (m2 , p) is a binomial random variable with parameter (m1 + m2 , p). Then we can use
induction to get the desired result. □

6.4.2 Sum of Poisson

Theorem 6.11 If X1 , X2 , · · · , Xr are independent Poisson random variable with parameters λ1 , · · · , λr , then X1 +
· · · + Xr is Poisson with parameters λ1 + · · · + λr .

Proof: Prove using induction. □

6.4.3 Sum of Uniform

Proposition 6.12 Let X and Y be independent uniform random variable on (0, 1), then

a

 if 0 ≤ a ≤ 1,

fX+Y (a) = 2 − a, if 1 < a < 2, .


0,

otherwise

R∞
Proof: Recall fX+Y (a) = −∞ fX (a − y)fY (y)dy, then consider cases based on the value of a. □

Proposition 6.13 Let X1 , X2 , · · · , Xn , · · · be independent uniform random variable on (0, 1). Let Fn be the cu-
mulative distributive function of X1 + · · · + Xn . Then

xn
Fn (x) = P (X1 + · · · + Xn ≤ x) =
n!

for 0 ≤ x ≤ 1.

62
Proof: We proceed with induction, suppose n = 1, then it is clear that F1 (x) = P (X1 ≤ x) = x. Hence the
statement holds. Assume that this statement for some n ∈ N, we consider the case for n + 1.

Fn+1 = Fn ∗ FXn
Z 1 n
t
= · 1dt
0 n!
tn+1
=
n + 1!
xn
Hence we have Fn (x) = n! by induction. □

Corollary 6.13.1 Let N be the minimum integer n such that X1 + X2 + · · · + Xn > 1. Then E[N ] = e, i.e., the
expected value of the number of independent uniform (0, 1) random variable that must be summed for the sum to
exceed 1 is e.
In addition, the expected value of the number of independent uniform (0, 1) random variables that must be summed
for the sum to exceed x (0 ≤ x ≤ 1) is ex .

1
Proof: N > n ⇔ X1 + · · · + Xn ≤ 1, so P (N > n) = Fn (1) = n! . Thus

∞ ∞ ∞
X X X 1
E[N ] = P (N ≥ i) = P (N > n) = = e.
n!
i=1 n=0 n=0

Similarly we can yield the second conclusion. □

6.4.4 Sum of Gamma

λe−λx (λx)α−1
Recall that an (α, λ)−gamma random variable has probability density function f (x) = Γ(α) , x > 0.

Proposition 6.14 Let X and Y be independent gamma random variable with parameters (α, λ), (β, λ), then

B(α, β)
fX+Y (α) = λe−λα (λa)α+β−1 , a > 0.
Γ(α)Γ(β)

Proof: Let X and Y be independent gamma random variable with parameters (α, λ), (β, λ), then

λe−λ(a−y) [λ(a − y)]α−1 · λe−λy (λy)β−1

a
Z
fX+Y (a) = dy
0 Γ(α)Γ(β)
Z a
λα+β
= e−λa (a − y)α−1 y β−1 dy let y = ax
Γ(α)Γ(β) 0
Z 1
λα+β −λa α+β−1
= e a (1 − x)α−1 xβ−1 dx
Γ(α)Γ(β) 0
B(α, β) −λα α+β−1
= λe (λa) , a > 0.
Γ(α)Γ(β)

63
Let Z be a gamma random variable with parameters (α + β, λ), then it has probability density function

1
fZ (x) = λe−λx (λx)α+β−1 , x > 0.
Γ(α + β)

Since fZ (x) = fX+Y (x), then we conclude that

Γ(α)Γ(β)
B(α, β) = .
Γ(α + β)

Proposition 6.15 Let Z be a standard normal random variable and Y = Z 2 , then

1 − 12 y 1 1
2e ( 2 y) 2 −1
fY (y) = √ , y > 0.
π

1 1 1 √
I.e., Y = Z 2 is a gamma random variable with parameters

2, 2 and Γ 2 = π.

Proof:

ϵfY (y) ≈ P (y ≤ Y ≤ y + ϵ)
√ √
= 2P ( y ≤ Z ≤ y + ϵ)

√ √ ϵ
≈ 2P y≤Z ≤ y+ √
2 y
√ ϵ
≈ 2fZ ( y) √ .
2 y

Divide ϵ on both sides, we get

√ 1 − 12 y 1 1
fZ ( y) 1 −y/2 2 e ( 2 y) 2 −1
fY (y) = √ =√ e = √ , y > 0.
y 2πy π

1 1

Then we can see that fY (y) is a gamma random variable with parameters 2, 2 . Then by comparing denominator,
√
we get that Γ( 12 ) = π. □

Definition: let Z1 , Z2 , · · · , Zn be independent standard normal random variables. Then Xn2 = Z12 + · · · + Zn2 is a
gamma random variable with parameters n2 , 12 . This is called the chi-squared random variable with n degrees of

freedom.
Note that Xn2 is a gamma random variable as it is sum of n gamma random variables, then by induction, we can
show that Xn2 has parameters n2 , 21 .

6.4.5 Sum of Normals

Proposition 6.16 Let X and Y be independent normal random variables with parameters (µ1 , σ12 ) and (µ2 , σ22 )
respectively. Then X + Y is a normal random variable with parameters (µ1 + µ2 , σ12 + σ22 ).

64
Proof: We first prove the case that if S and Z are independent normal random variables with S having parame-
ters (0, σ 2 ) and Z having parameters (0, 1), i.e., Z is the standard normal distribution. Then S + Z is normal with
parameters (0, 1 + σ 2 ).

Z ∞
fS+Z (a) = fS (a − z)fZ (z)dy
Z−∞
∞
1 −(a−z)2 /2σ2 −z 2 /2
= e e dz
−∞ 2πσ
a2

= C exp −
2(1 + σ 2 )

where C is a constant that does not depend on a. So we can see that S + Z is normal with mean and variance 0,
1 + σ 2 respectively.
Now for the general case, suppose X, Y are normal with parameters (µ1 , σ12 ) and (µ2 , σ22 ) respectively, then

X − µ1 Y − µ2
X + Y = σ2 + + µ1 + µ2 .
σ2 σ2

Hence X + Y is normal with mean µ1 + µ2 and variance σ12 + σ22 (Just apply the result we obtained). □

Corollary 6.16.1 Suppose Xi , i = 1, 2, · · · , n are independent random variables that are normallydistributive with
n n

2 2
P P
parameters (µi , σi ) respectively, then X1 + X2 + · · · + Xn is normally distributed with parameters µi , σi .
i=1 i=1

Proof: Apply induction. □

6.4.6 Sum of Exponential

Proof: Suppose X1 , X2 , · · · , Xn are independent identical exponential random variables with parameter λ, then
then Y = X1 + X2 + · · · + Xn is an gamma random variable with parameter (n, λ). □

Proof: An exponential random variable with parameter λ is a gamma random variable with parameter (1, λ). □

6.5 Conditional Distribution

Definition: let X and Y be discrete random variables, the conditional probability mass function of X given Y = b
is
pX,Y (x, b)
pX|Y (x|b) = ,
pY (b)

65
provided that pY (b) > 0.
Note that if X and Y are independent, then pX|Y (x|b) = pX (x).

Definition: let X and Y be jointly continuous random variables, the conditional probability density function of X
given Y = y is
fX,Y (x, y)
fX|Y (x|y) = ,
fY (y)
if fY (y) > 0.
Note that if X and Y are independent, then fX|Y (x|y) = fX (x).

Proposition 6.17 Let X and Y be discrete random variables such that pY (y) > 0. Then
X
FX|Y (x|y) = pX|Y (a|y).
a≤x

Proof: This is the case because

Proposition 6.18 Let X and Y be jointly continuous random variables, then

Z
P (X ∈ A|Y = y) = fX|Y (x|y)dx.
A

In particular, the conditional cumulative distribution function of X given Y = y is

Z a
FX|Y (a|y) = P (X ≤ a|Y = y) = fX|Y (x|y)dx.
−∞

Proof: This is clear from the definition of conditional probability. □

Proposition 6.19 Let Y and Z be independent, where Y is chi-squared with degree 1 of freedom, and Z standard
√
normal. Let T = Z/ Y . Then the joint probability density of T and Y is

1 −y(t2 +1)/2
fT,Y (t, y) = fY (y)fT |Y (t|y) = e , y > 0.
2π

The probability density function of T is

Z ∞
1 −y(t2 +1)/2 1
fT (t) = e dy = 2
.
0 2π π(t + 1)

66
Proof: Since Y and Z are independent, then the conditional distribution of T given that Y = y is the distribution
p
of 1/yZ, which is normal with mean 0 and variance 1/y. Hence, the conditional density of T given that Y = y is

1 2
fT |Y (t|y) = p e−t y/2 , −∞ < t < ∞.
2π/y

Then the rest of the results follow from here. □

Definition: let Y and Z be independent, where Y is a chi-squared with degree n of freedom, and Z standard normal.
Then T = √Z has a t-distribution with n degrees of freedom.
Y /n

Lemma 6.20 let X be a continuous and N a discrete random variable, then

P (N = k|X = a)f (a)ϵ

P (a ≤ x ≤ a + ϵ|N = k) ≈ .
P (N = k)

Proof:

P (a ≤ X ≤ a + ϵ, N = k)
P (a ≤ x ≤ a + ϵ|N = k) =
P (N = k)
P (N = k|a ≤ X ≤ a + ϵ)P (a ≤ X ≤ a + ϵ)
=
P (N = k)
P (N = k|X = a)f (a)ϵ
≈ .
P (N = k)

Definition: let X be a continuous and N a discrete random variable, then the conditional probability density
function of X given that N = k is

P (N = k|X = x)
fX|N (x|k) = fX (x).
P (N = k)

Definition: let X be continuous and N a discrete random variable, then the conditional probability mass function
of N given that X = x is
fX (X = x|N = n)
pN |X (n|x) = P (N = n).
fX (x)
Definition: let X be a discrete random variable. For any A ⊆ R, The conditional probability mass function of X
given X ∈ A is defined to be

P (X = a)/P (X ∈ A) if x ∈ A,
P (X = a|X ∈ A) =
0 if x ∈
/ A.

Definition: let X be a continuous random variable. For [a, a + ϵ] ⊆ A ⊂ R, the conditional probability density

67
function of X given X ∈ A is defined to be

fX (x)
fX|X∈A (x) = R for x ∈ A.
A fX (x)dx

6.6 Joint Distribution of Functions

Definition: let X1 and X2 be jointly continuous random variables. Suppose Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ),
such that the transformation from (x1 , x2 ) 7→ (y1 , y2 ) is one-to-one. Then
!
∂y1 ∂y1
∂x1 ∂x2
∂y2 ∂y2
∂x2 ∂x2

is nonzero and continuous at all (x1 , x2 ). Thus, we define the Jacobian of (x1 , x2 ) 7→ (y1 , y2 ) to be

∂y1 ∂y1
∂x1 ∂x2 ∂y1 ∂y2 ∂y1 ∂y2
J(x1 , x2 ) = ∂y2 ∂y2 = − ̸= 0
∂x2 ∂x2
∂x1 ∂x2 ∂x2 ∂x1

Theorem 6.21 The joint probability distribution function of Y1 = g1 (X1 , X2 ), Y2 = g2 (X1 , X2 ) in this case is
given by
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 ) |J(x1 , x2 )|−1

where g1 (x1 , x2 ) = y1 and g2 (x1 , x2 ) = y2 .

Proof: This is done by the change of coordinates. □

Similarly, using the Jacobian of n dimensional matrix, we can find the joint continuous random variable of
Y1 , Y2 , · · · , Yn from X1 , X2 , · · · , Xn , if Yi = gi (X1 , · · · , Xn ).

68
7 Expectation of Random Variables
7.1 Extra Properties of Expectation

Proposition 7.1 Let X be a nonnegative integer-valued random variable, then

∞
X
E[X] = P (X ≥ i).
i=1

Proof: Note for any nonnegative values a1 , a2 , · · · ,

∞
X ∞
X
(a1 + · · · + aj )P (X = j) = ai P (X ≥ i).
j=1 i=1

Then take a1 = a2 = · · · = 1, we get the desired result. □

Proposition 7.2 Let X be a random variable, if a ≤ X ≤ b, then a ≤ E[X] ≤ b.

Proof: Suppose Y ≥ 0, then it is clear that E[Y ] ≥ 0 (consider discrete and continuous cases), similarly if Y ≤ 0,
then E[Y ] ≤ 0. Now since X ≥ a, then X − a ≥ 0, so

0 ≤ E[X − a] = E[X] − a =⇒ E[X] ≥ a.

Similarly,we have E[X] ≤ b. □

Corollary 7.2.1 Let X be a random variable. If P (a ≤ X ≤ b) = 1, then a ≤ E[X] ≤ b.

Proof: Since P (a ≤ x ≤ b) = 1, then consider two cases, X discrete and X continuous.

If X is discrete, then it is clear that a ≤ X ≤ b, hence a ≤ E[X] ≤ b holds.
Rb
If X is continuous, then a fX (x)dx = 1. Then consider E[X − a], and E[b − X], we can easily get that they are
both greater than or equal to zero, hence a ≤ E[X] ≤ b.
A more rigorous proof requires the understanding of measure theory. □

Proposition 7.3 Suppose X is a random variable such that P (0 ≤ X ≤ c) = 1 and only takes values between 0
and c. Then
c2
Var(X) ≤ .
4

69
Proof: Note that since X only takes values between 0 and c, then 0 ≤ E[X] ≤ c and E[X 2 ] ≤ E[cX] = cE[X],
so
c2
var(X) = E[X 2 ] − (E[X])2 ≤ cE[X] − (E[X])2 ≤ .
4
□

Theorem 7.4 Let X and Y be discrete random variable with joint mass function p(x, y). For any function g(x, y),
XX
E[g(X, Y )] = g(x, y)p(x, y).
x y

Proof: By definition, we have

X
E[g(X, Y )] = g(x, y)p(x, y)
(x,y)
XX
= g(x, y)p(x, y).
x y

Theorem 7.5 Let X, Y be jointly continuous random variables with joint probability density function f (x, y). For
any function g(x, y), Z ∞Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dxdy.
−∞ −∞

Proof: First consider the case where g(X, Y ) ≥ 0, then we know that for any such random variable, we have
Z ∞
E[g(X, Y )] = P {g(X, Y ) > t}dt
0

Then
Z ∞
E[g(X, Y )] = P {g(X, Y ) > t}dt
0
Z ∞Z Z
= f (x, y)dydxdt
0 (x,y):g(x,y)>t
Z Z Z g(x,y)
= f (x, y)dtdydx (by change of order of integration)
x y 0
Z Z
= g(x, y)f (x, y)dydx
x y
Z ∞ Z ∞
= g(x, y)f (x, y)dydx
Z−∞ −∞
∞ Z ∞
= g(x, y)f (x, y)dxdy
−∞ −∞

70
The case for general g(X, Y ) is similar, we can deal with it by splitting g(X, Y ) into g + (X, Y ) and g − (X, Y ). □

Proposition 7.6 Let X and Y be jointly continuous random variable with joint probability distribution function
f (x, y). Then
Z ∞ Z ∞ Z ∞
xf (x, y)dydx = xfX (x)dx = E[X]
Z−∞
∞ Z−∞
∞ Z−∞
∞
yf (x, y)dxdy = yfY (y)dy = E[Y ]
−∞ −∞
Z−∞
∞ Z ∞
E[X + Y ] = (x + y)f (x, y)dxdy = E[X] + E[Y ]
−∞ −∞

Proof: This is clear from the definition of expected value and joint continuous random variables. □

Let X be a random variable having finite expectation µ and variance σ 2 , and let g(·) be a twice differentiable
function. Then
g ′′ (µ) 2
E[g(X)] ≈ g(µ) + σ .
2

Proof: By Taylor’s Theorem, we get

g ′′ (µ)(X − µ)2
g(X) ≈ g(µ) + g ′ (µ)(X − µ) + .
2

Hence

g ′′ (µ)(X − µ)2

′
E[g(X)] ≈ E g(µ) + g (µ)(X − µ) +
2
g ′′ (µ) 2
= g(µ) + 0 + σ
2
g ′′ (µ) 2
= g(µ) + σ .
2

7.2 Sum of Random Variables

Theorem 7.7 Let X and Y be random variables. Then E[X + Y ] = E[X] + E[Y ] if E[X] and E[Y ] are finite. In
general, if X1 , X2 , · · · , Xn are random variables, then

E[X1 + X2 + · · · + Xn ] = E[X1 ] + E[X2 ] + · · · + [Xn ].

71
Proof: Using Induction. We know that E[X + Y ] = E[X] + E[Y ] for both the case of discrete and continuous
random variables. □

Corollary 7.7.1 Let X and Y be random variables, then E[X − Y ] = E[X] − E[Y ].

Proof: Since E[Y ] + E[X − Y ] = E[X]. □

Corollary 7.7.2 Suppose X and Y are random variables such that X ≥ Y , then E[X] ≥ E[Y ].

Proof: Since X − Y ≥ 0, then E[X − Y ] ≥ 0, Hence E[X] − E[Y ] = E[X − Y ] ≥ 0. Therefore, E[X] ≥ E[Y ]. □

n
1 P
Definition: Suppose X1 , X2 , · · · , Xn are random variables, X = X1 + · · · + Xn . Then X = n Xi is called the
i=1
sample mean.

Lemma 7.8 (Boole’s Inequality) P (A1 ) + · · · + P (An ) ≥ P (A1 ∪ · · · ∪ An ).

Proof: This follows from inclusion-exclusion principle. □

Proposition 7.9 Suppose A1 , A2 , · · · , An are events and X1 , X2 , · · · , Xn are their respective indicator variables.
Let Y = 1 − ni=1 (1 − Xi ), then
Q

n
X X
P (A1 ∪ · · · ∪ An ) = E[Y ] = P (Ai ) − P (Ai Aj ) + · · · − (−1)n P (A1 A2 · · · An ).
i=1 1≤i<j≤n

Proof: Note that P (A1 ∪ · · · ∪ An ) = E[Y ]. And

E[Y ] = 1 − E[(1 − X1 )(1 − X2 ) · · · (1 − Xn )]

Expand and we get the desired result. □

Theorem 7.10 Let X1 , X2 , · · · be a sequence of random variables. Suppose one of the following holds:

Xi is nonnegative for i = 1, 2, · · · .
∞

P
E[|Xi |] < ∞.
i=1

Then E [ ∞
P P∞
i=1 Xi ] = i=1 E[Xi ].

72
Proof: The first one is justified by the Sigma additivity of a measure. The second one is justified by absolute
convergence of sequences. □

7.3 Moments of Number of Events

Definition: let X be the number of events from A1 , A2 , · · · , An that occur. Let Xi be the indicator variable of
Ai . The number of pairs Ai Aj (i < j) that occurs is given by X2 . In general X

k gives the number of events
Ai1 Ai2 · · · Aik occurs.

Lemma 7.11
X X
= Xi1 Xi2 · · · Xik .
k
i1 <i2 <···<ik

X X
E = P (Ai1 Ai2 · · · Aik ).
k
i1 <i2 <···<ik

Proof: This is quite clear from the definition of moments of number of events. □

Corollary 7.11.1
X(X − 1) · · · (X − k + 1) X
E = P (Ai1 Ai2 · · · Aik ).
k!
i1 <i2 <···<ik

Proof: This is true since

X X(X − 1) · · · (X − k + 1)
= .
k k!
□

7.4 Covariance and Correlations

Proposition 7.12 Suppose X and Y are independent, then, for any functions h and g,

E[g(X)h(Y )] = E[g(X)]E[h(Y )].

73
Proof: Suppose X and Y are discrete and independent, then it is clear that
X
E[g(X)h(Y )] = g(x)h(y)pX,Y (x, y)
(x,y)
X
= g(x)h(y)pX (x)pY (y)
(x,y)
!
X X
= g(x)pX (x) h(y)pY (y)
x y
X
= g(x)pX (x)E[h(Y )]
x

= E[g(X)]E[h(Y )].

Suppose X and Y are jointly continuous with joint density f (x, y), and X and Y are independent, then
Z ∞ Z ∞
E[g(X)h(Y )] = g(x)h(y)f (x, y)dxdy
Z−∞ −∞
∞ Z ∞
= g(x)h(y)fX (x)fY (y)dxdy
Z−∞
∞
−∞
Z ∞
= g(x)fX (x)dx h(y)fY (y)dy
−∞ −∞

= E[g(X)]E[h(Y )].

Lemma 7.13 Suppose X and Y are independent identical random variables with variance σ 2 , then

E[(X − Y )2 ] = 2σ 2 .

Proof: Since X and Y are identical, then

E[(X − Y )2 ] = E[X 2 − 2XY + Y 2 ]

= 2E[X 2 ] − 2E[X]E[Y ]
= 2E[X 2 ] − 2E[X]2
= 2Var(X)
= 2σ 2 .

Definition: let X and Y be random variables. The covariance is defined by

Cov(X, Y ) = E[XY ] − E[X]E[Y ] = E[(X − E[X])(Y − E[Y ])].

74
If X and Y are independent, then Cov(X, Y ) = 0. If Cov(X, Y ) = 0, X and Y may not be independent.

Proposition 7.14 The following are true about Covariance:

1. Cov(X, X) = Var(X).

2. Cov(X, Y ) = Cov(Y, X).

3. Cov(aX, Y ) = aCov(X, Y ).

4. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).

Proof:

Cov(X, X) = E[X 2 ] − (E[X])2 = Var(X).

This is clear from definition.

Cov(aX, Y ) = E[aXY ] − E[aX]E[Y ] = a(E[XY ] − E[X]E[Y ]) = aCov(X, Y ).

Cov(X1 + X2 , Y ) = E[(X1 + X2 )Y ] − E[X1 + X2 ]E[Y ]

= E[X1 Y ] + E[X2 Y ] − E[X1 ]E[Y ] − E[X2 ]E[Y ]
= Cov(X1 , Y ) + Cov(X2 , Y ).

Corollary 7.14.1 Suppose X1 , · · · , Xn and Y1 , · · · , Ym are random variables, then

 
Xn m
X n X
X m
Cov  Xi , Yj  = Cov(Xi , Yj ).
i=1 j=1 i=1 j=1

Proof: Induction. □

Remark: Cov(·, ·) is an inner product on real random variables. The induced norm on the real random variable
p
space is σX = Var(X).

Theorem 7.15 Let X1 , X2 , · · · , Xn be random variables. Then

n n
!
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 i<j

75
In particular, if X1 , X2 , · · · , Xn are independent. Then
n n
!
X X
Var Xi = Var(Xi ).
i=1 i=1

Proof: Note that if X1 , · · · , Xn are independent, then all the covariance are just 0, hence we have the second
formula. For the first formula, we have the following:

Var(X1 + X2 + · · · + Xn ) = Cov(X1 + X2 + · · · + Xn , X1 + X2 + · · · + Xn )
n
X
= Cov(Xi , X1 + · · · + Xn )
i=1
n
X
= Cov(X1 + · · · + Xn , Xi )
i=1
n
X X
= Cov(Xi , Xi ) + 2 Cov(Xi , Xj )
i=1 i<j
n
X X
= Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i<j

Definition: let X1 , · · · , XN be independent, and identical random variable with mean µ, and variance σ 2 . Then we
n
define X = n1
P
Xi to be the sample mean.
i=1
Definition: we define Xi − X to be the deviation, i = 1, · · · , n.
n
(Xi −X)2
Definition: we define S 2 =
P
n−1 be the sample variance.
i=1

Proposition 7.16
n
1 P
1. E[X] = n E[Xi ] = µ.
i=1
n
1
Var(Xi ) = σ 2 /n.
P
2. Var(X) = n2
i=1

σ2
3. E[(Xi − X)2 ] = 1 .
1− n

n 1
4. E[S 2 ] = 2 = σ2.

n−1 σ 1− n

Proof: This can be derived easily from the definitions. □

Definition: let X and Y be random variable with positive variances. The correlation of X and Y are defined to be

Cov(X, Y )
ρ(X, Y ) = p .
Var(X)Var(Y )

76
X and Y are uncorrelated if ρ(X, Y ) = 0.

Proposition 7.17 Let c > 0. Then ρ(cX, Y ) = ρ(X, Y ), ρ(−cX, Y ) = −ρ(X, Y ).

2 and σ 2 respectively. Let X = X/σ
Let X and Y be random variables with variance σX Y 1 X and Y1 = Y /σY . Then
ρ(X, Y ) = ρ(X1 , Y1 ) = Cov(X1 , Y1 ).

Proof: Suppose c > 0. Then

Cov(cX, Y ) cCov(X, Y )
ρ(cX, Y ) = p = p = ρ(X, Y ).
Var(cX)Var(Y ) c Var(X)Var(Y )

Similarly, we can show the second statement.

Now since σX , σY is always greater than or equal to zero. Then by disregarding the case that it equal to zero, we
have ρ(X, Y ) = ρ(X1 , Y1 ). Furthermore, Var(X1 )Var(Y1 ) = 1. Hence the have the last equality. □

Proposition 7.18 Let X and Y be random variables with positive variances. Then −1 ≤ ρ(X, Y ) ≤ 1.

Proof: Replace X and Y by X/σX and Y /σY if necessary. We may assume that Var(X) = Var(Y ) = 1. Then

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )

= 2 + 2ρ(X, Y ) ≥ 0,
Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y )
= 2 − 2ρ(X, Y ) ≥ 0.

Hence we have |ρ(X, Y )| ≤ 1. □

Remarks: Var(X) = 0 if and only if X is a constant with probability 1.

ρ(X, Y ) = 1 ⇔ Y = aX + b with probability 1 for some a > 0.

ρ(X, Y ) = −1 ⇔ Y = aX + b with probability 1 for some a < 0.

Proposition 7.19 Suppose X and Y are identical random variables that are not necessarily independent, then

Cov(X + Y, X − Y ) = 0.

77
Proof: Since X and Y are identical, then E[X n ] = E[Y n ] for n ∈ N, and Var(X) = Var(Y ). So

Cov(X + Y, X − Y ) = Cov(X, X) + Cov(X, −Y ) + Cov(Y, X) + Cov(Y, −Y )

= Var(X) − Var(Y )
=0

Lemma 7.20 Suppose Y = a + bX, then


1 if b > 0,
ρ(X, Y ) = .
−1 if b < 0

Proof: Suppose b > 0, then E[Y ] = E[X] + a and Var(Y ) = b2 Var(X) Then

Cov(X, Y )
ρ(X, Y ) = p
Var(X) · Var(Y )
Cov(X, a + bX)
=
bVar(X)
Cov(X, bX)
=
bVar(X)
bVar(X)
=
bVar(X)
=1

The proof of the other case is similar. □

7.5 Multinomial and Multivariate Normal Distribution

Definition: consider m independent trials, each of which results in any of r possible outcomes with probabilities
p1 , · · · , pr , where ri=1 pi = 1. Let Ni , i = 1, · · · , r, denote the number of the m trials that results in outcome i,
P

then we say N1 , N2 , · · · , Nr have the multinomial distribution which is given by

r
m! X
P (N1 = n1 , · · · , Nr = nr ) = pn1 1 · · · pnr r , ni = m.
n1 ! · · · nr !
i=1

Proposition 7.21 Suppose the conditions in the definition holds, then the following are true:

1. for each i ∈ {1, · · · , r}, Ni is a binomial random variable with parameters (m, pi ).

2. for i ̸= j, Ni + Nj is binomial with parameters (m, pi + pj ).

78
3. If i ̸= j, then Cov(Ni , Nj ) = −mpi pj .

Proof: (1) and (2) are by direct computation, then since Ni + Nj is binomial with parameters (m, pi + pj ), we
know its variance. Since Var(Ni + Nj ) = Var(Ni ) + Var(Nj ) + 2Cov(Ni , Nj ), then we can compute their covariance.
□

Definition: let Z1 , · · · , Zn be independent standard normal random variables. If for some constants aij , 1 ≤ i ≤ m,
1 ≤ j ≤ n, and µi , 1 ≤ i ≤ m,

X1 = a11 Z1 + · · · + a1n Zn + µ1 ,
X2 = a21 Z1 + · · · + a2n Zn + µ2 ,
..
.
Xm = am1 Z1 + · · · + ann Zn + µm

then the random variables X1 , X2 , · · · , Xm are said to have a multivariate normal distribution.

Proposition 7.22 Suppose the conditions in the definition holds, then the following are true:

1. E[Xi ] = µi ;
n
a2ij ;
P
2. Var(Xi ) =
j=1

n
P
3. Cov(Xi , Xj ) = aik ajk .
k=1
!
n
a2ij . Hence we
P
Proof: First note that the sum of normal is normal, so Xi is normal with parameters µi ,
j=1
have (1) and (2). Next, for i ̸= j,

n n
!
X X
Cov(Xi , Xj ) = Cov aik Zk , ajk Zk
k=1 k=1
n
X
= Cov (aik Zk , ajk Zk )
k=1
Xn
= aik ajk .
k=1

79
7.6 Conditional Expectation
We call the definition on conditional distribution: let X and Y be discrete random variable with joint mass p(x, y).
Let P (Y = y) > 0. Condition on Y = y, X is a random variable with mass function

pX|Y (x|y) = P (X = x|Y = y) = p(x, y)/pY (y).

Definition: condition on Y = y, the expected value of X is

X X
E[X|Y = y] = xpX|Y (x|y) = xp(x, y)/pY (y).
x x

Let X and Y be jointly continuous with probability density function f (x, y). Let fY (y) > 0. Condition on Y = y,
X is a random variable with density function

fX|Y (x|y) = f (x, y)/fY (y).

Definition: condition on Y = y, the expected value of X is

Z ∞
xf (x, y)
E[X|Y = y] = dx.
−∞ fY (y)

Lemma 7.23 Conditional Expectation satisfies all properties of ordinary expectation:

For the discrete case,

X
E[g(X)|Y = y] = g(x)pX|Y (x|y).
x

For the continuous case, Z ∞

E[g(X)|Y = y] = g(x)fX|Y (x|y)dx.
−∞

Expectation is linear:

– E[aX|Y = y] = aE[X|Y = y].

– E[X1 + X2 |Y = y] = E[X1 |Y = y] + E[X2 |Y = y].

Proof: This follows from the fact that a conditional probability is a probability. □

Let X and Y be joint random variables. Then E[X|Y = y] is a real-valued function on Y = y. So E[X|Y ] is a
function in Y ; so it is also a random variable.

Theorem 7.24 Let Z = E[X|Y ] be a random variable, then E[Z] = E[E[X|Y ]] = E[X].

80
Proof: For the discrete case:
X
E[Z] = E[X|Y = y]P (Y = y)
y
X X xP (X = x, Y = y)
= P (Y = y)
y x
P (Y = y)
!
X X
= x P (X = x, Y = y)
x y
X
= xP (X = x) = E[X].
x

For the continuous case:

Z ∞
E[Z] = E[X|y = y]fY (y)dy
−∞
Z ∞Z ∞
xf (x, y)
= fY (y)dxdy
fY (y)
Z−∞ −∞
∞ Z ∞
= xf (x, y)dxdy
Z−∞
∞
−∞
Z ∞
= x f (x, y)dydx
Z−∞
∞
−∞

= xfX (x)dx
−∞

= E[X].

Proposition 7.25 (Law of Total Probability for Continuous Case) Suppose X is a discrete random vari-
able and Y a continuous random variable, then
Z ∞
P (X = i) = P (X = i|Y = y)fY (y)dy.
−∞

Proof: Suppose we have an random variable Z such that


1 if X = i,
Z=
0 otherwise.

81
Then

P (X = i) = P (Z = 1) = E[Z]
= E[E[Z|Y ]]
Z ∞
= P (Z|Y = y)fY (y)dy
−∞
Z ∞
= P (X = i|Y = y)fY (y)dy.
−∞

Proposition 7.26 Let N be Poisson, parameter generated by exponential random variable X. Then

E[N |X = λ] = λ.

E[N |X] = X.

E[X] = E[E[N |X]] = E[N ].

Proof: Direct verification from the definition works. □

Lemma 7.27 Suppose X and Y are independent, then

E[X|Y = y] = E[X] for all y.

Proof: We split into two cases, both are discrete and both are continuous.
Suppose X and Y are both discrete, then P (X = x|Y = y) = P (X = x) for any x, hence

E[X|Y = y] = E[X] for all y.

On the other hand, if both X and Y are continuous, then fX|Y (x|Y = y) = fx (x) for any x, hence

E[X|Y = y] = E[X] for all y.

Lemma 7.28 Suppose g is a function and X and Y are random variables, then E[g(X)Y |X] = g(X)E[Y |X].

82
Proof: Let X = x, where x is arbitrary, then

E[g(X)Y |X = x] = g(x)E[Y |X = x],

hence E[g(X)Y |X] = g(X)E[Y |X]. □

7.7 Conditional Variance

Let X and Y be random variables. We can fine the distribution of X given that Y = y.

Var(X|Y = y) = E[(X − E[X|Y = y])2 |Y = y].

Definition: the conditional variance is defined to be

Var(X|Y ) = E[(X − E[X|Y ])2 |Y ].

Lemma 7.29
Var(X|Y ) = E[X 2 |Y ] − (E[X|Y ])2 .

Var(E[X|Y ]) = E[(E[X|Y ])2 ] − (E[E[X|Y ]])2 .

E[Var(X|Y )] = E[X 2 ] − E[(E[X|Y ])2 ].

E[Var(X|Y )] = E[E[X 2 |Y ]] − E[(E[X|Y ])2 ]

= E[X 2 ] − E[(E[X|Y ])2 ]

Corollary 7.29.1 Var(X) = E[Var(X|Y )] + Var(E[X|Y ]).

Proof: From the lemma, we know that

E[Var(X|Y )] + Var(E[X|Y ]) = E[X 2 ] − E[(E[X|Y ])2 ] + E[(E[X|Y ])2 ] − (E[E[X|Y ]])2

= E[X 2 ] − E[X]2
= Var(X).

83
7.8 Conditional Expectation and Predication
By prediction we mean that suppose the value of X is observed, we want to predict the value of a second vari-
able Y . We want to find a function g such that if X = x, then g(x) is the prediction for Y . The function g must
be chosen such that g(X) is closed to Y . I.e., we want to choose the function g such that E[(Y −g(X))2 ] is minimized.

Lemma 7.30 The best predictor of Y based on X is g(x) = E[Y |X = x], i.e. g(X) = E[Y |X] is the best predictor
of Y . I.e., for all g(X),
E[(Y − g(X))2 ] ≥ E[(Y − E[Y |X])2 ].

Proof: We want to maximize E[(Y − g(X)2 |X],

E[(Y − g(X)2 |X] = E[(Y − E[Y |X] + E[Y |X] − g(X))2 |X]
= E[(y − E[Y |X])2 |X] + E[(E[Y |X] − g(X))2 |X] + 2E[(Y − E[Y |X])(E[Y |X] − g(X))|X]

Since at a given X, E[Y |X] − g(X) is a function of X, then it can be treated as a constant, thus,

E[(Y − E[Y |X])(E[Y |X] − g(X))|X] = (E[Y |X] − g(X))E[Y − E[Y |X]|X]
= (E[Y |X] − g(X))(E[Y |X] − E[Y |X])
=0

Hence we have that

E[(Y − g(X))2 ] ≥ E[(Y − E[Y |X])2 ]

and equality occurs if g(X) = E[Y |X]. □

Lemma 7.31 The best constant predictor of Y when an X value is observed is given by E[Y ], and at this value of
c, E[(Y − c)2 ] = Var(X).

Proof: We want to find c ∈ R such that E[(Y − c)2 ] is minimum.

E[(Y − c)2 ] = c2 − 2cE[Y ] + E[Y 2 ]

= (c − E[Y ])2 + E[Y 2 ] − (E[Y ])2
= (c − E[Y ])2 + Var(X).

Hence it is clear that the minimum is Var(X) which happens when c = E[Y ]. □

Lemma 7.32 Suppose the joint distribution of X and Y are not completely known. We want to find constants a, b
such that E[(Y − a − bX)2 ] is the minimum. The to Y is µY + ρ σσX
Y
(X − µX ). Then minimum of E[(Y − a − bX)2 ]
is thus σY2 (1 − ρ2 ).

84
Proof: First consider the case where µX = µY = 0 and σX = σY = 1, ρ = ρ(X, Y ). Then

E[(Y − a − bX)2 ] = E[Y 2 + a2 + b2 X 2 − 2aY + 2abX − 2bXY ]

= 1 + a2 + b2 − 2bρ
= a2 + (b − ρ)2 + 1 − ρ2

Hence, E[(Y − a − bX)2 ] is minimal if a = 0 and b = p, so the best linear predictor to Y in this case is a + bX = ρX,
and the minimum of E[(Y − a − bX)2 ] is 1 − ρ2 .
Next, suppose X, Y are arbitrary, which has mean µX , µY and variance σX 2 and σ 2 respectively. Then let
Y

X − µX Y − µY
X1 = and Y1 = .
σX σY

Then

a + bµX − µY bσX
Y − a − bX = σY Y1 − − X1
σY σY

Then by our previous analysis, E[(Y − a − bX)2 ] is minimum if a+bµσXY −µY = 0 and bσX
σY = ρ, that is b = ρ σσX
Y
and
a = µY − bµX .
The best linear predictor to Y is thus
σY
µY + ρ (X − µX )
σX
and the minimum of E[(Y − a − bX)2 ] is σY2 (1 − ρ2 ). □

7.9 Moment Generating Function

Let X be a discrete random variable, recall that E[X n ] = xn p(x).
P
x

Definition: for any (analytic) function, the Maclaurin series of g(t) is

∞
X g (n) (0)
g(t) = tn .
n!
n=0

85
Suppose we let g(t) be such that g (n) (0) = E[X n ]:
∞
X E[X n ]
g(t) = tn
n!
n=0
P n
∞ x p(x)
x
X
= tn
n!
n=0
∞
XX xn p(x)
= tn
x n=0
n!
∞
X X (tx)n
= p(x)
x
n!
n=0
X
= p(x)etx
x

= E[etX ]

Then g(t) = E[etX ] a function of t with desired property.

Definition: let X be a random variable, its moment generating function is MX (t) = E[etX ].

(n)
Lemma 7.33 Assume that MX (t) has a power series expansion at 0. Then MX (0) = E[X n ] for any nonnegative
integer n.

Proof: Assume that X is a continuous random variable with density function f . Then

∞ (n) Z ∞
X M (0)X
tn = Mx (t) = E[etX ] = etx f (x)dx
n! −∞
n=0
∞
∞ X
(tx)n
Z
= f (x)dx
−∞ n=0 n!
∞
XZ ∞ (tx)n
= f (x)dx
n!
n=0 −∞
∞ n Z ∞
X t
= xn f (x)dx
n! −∞
n=0
∞
X E[X n ] n
= t .
n!
n=0

Proposition 7.34

Suppose X is a Bernoulli variable with parameter p. Then MX (t) = pet + 1 − p.

Suppose X is a Poisson random variable with parameter λ. Then MX (t) = exp[λ(et − 1)].

86
pet
Suppose X is a geometric random variable with parameter p. Then MX (t) = 1−(1−p)et .

Suppose X is an exponential random variable with parameter p. Then MX (t) = λ

λ−t , t < λ.
2 /2
Suppose Z is the standard normal random variable. MZ (t) = et . If X = σZ + µ, which is normal with
parameters (µ, σ 2 ), σ > 0. Then MX (t) = exp(µt + σ 2 t2 /2).
α
Suppose X is the Gamma random variable with parameter α > 0, λ > 0, then MX (t) = λ
λ−t .

Proof:

Suppose X is a Bernoulli variable with parameter p. Then

MX (t) = E[etX ] = et·1 P (X = 1) + et·0 P (X = 0) = pet + 1 − p.

Suppose X is a Poisson random variable with parameter λ. Then

∞
X λn
MX (t) = E[etX ] = etn e−λ ·
n!
n=0
∞
X (λet )n
= e−λ
n!
n=0
−λ λet
=e e
= exp[λ(et − 1)].

Suppose X is a geometric random variable with parameter p, then

∞
X
MX (t) = E[etX ] = etn (1 − p)n−1 p
n=1
X∞
= [et (1 − p)]n−1 et p
n=1
pet
= .
1 − (1 − p)et

Suppose X is an exponential random variable with parameter p, then

Z ∞
MX (t) = E[e tX
]= etx λe−λx dx
0
Z ∞
=λ e−(λ−t)x dx
0
λ
= .
λ−t

87
Suppose Z is the standard normal random variable. Then
Z ∞
1 2
tZ
MZ (t) = E[e ] = etz √ e−z /2 dz
2π
Z−∞∞
1 [−(z−t)2 /2]+t2 /2]
= e dz
−∞ 2π
Z ∞
2 1 2
= et /2 √ e−(z−t) /2 dz
−∞ 2π
2 /2
= et .

Now if X = σZ + µ, then

MX (t) = E[etX ] = E[et(σZ+µ) ]

= etµ E[etσZ ]
= etµ MZ (tσ)
2 /2
= etµ e(tσ)
= exp(µt + σ 2 t2 /2).

Suppose X is gamma with parameter (α, λ), α, λ > 0. Then

MX (t) = E[etX ]
Z ∞
λe−λx (λx)α−1
= etx dx
0 Γ(α)
α Z ∞
(λ − t)e−(λ−t)x [(λ − t)x]α−1

λ
= dx
λ−t Γ(α)
α 0
λ
=
λ−t

Proposition 7.35 If MX (t) = MY (t) for all t near 0, then X and Y have the same distribution.

Proof: This is due to the uniqueness of Taylor Series. □

Proposition 7.36 Suppose X and Y are random variable such that Y = aX + b. Let the moment generating
function of X be MX (t), then the moment generating function MY (t) of Y is

MY (t) = etb MX (at).

88
Proof: Since Y = aX + b, then by the definition of the moment generating function, we have

E[etY ] = E[eaXt+bt ]
= E[eat·X ] · ebt
= ebt · MX (at)

7.10 Moment Generating Function for Sum of Independent Random Variables

Proposition 7.37 Suppose X and Y are independent, then MX+Y (t) = MX (t)MY (t). More generally, if X1 , X2 , · · · , Xn
are independent random variables, then

MX1 +X2 +···+Xn (t) = MX1 (t) · MX2 (t) · · · MXn (t).

Proof: It suffices to prove the case for two independent random variables, as the rest can be proven by induction.
Suppose X and Y are independent, then

E[et(X+Y ) ] = E[etX etY ] = E[etX ]E[etY ].

Hence we have the desired result. □

Proposition 7.38 Suppose X is a binomial random variable with parameters (n, p), then MX (t) = (pet + 1 − p)n .
Suppose X is a negative binomial with parameters (r, p), then
r
pet

MX (t) = .
1 − (1 − p)et

Proof: Binomial random variable with parameters (n, p) is just the sum of n independent Bernoulli random
variables with parameter p. Negative binomial variable with parameters (r, p) is just the sum of n independent
geometric random variable with parameter p. □

Proposition 7.39 Let X be chi-squared with n degrees of freedom, then MX (t) = (1 − 2t)−n/2 .

89
Proof: Recall that a chi-squared random variable with n degrees of freedom can be written as the sum of
Z12 + · · · + Zn2 , where Z1 , · · · , Zn are independent standard normal random variables. Then
n n n Z ∞
Y Y
tZ 2
Y 1 2 2
E[e tX
]= MZ 2 (t) = E[e ]= √ etx e−x /2 dx
i=1 i=1 i=1
2π −∞
n Z ∞
Y 1 2
= √ e−x (1−2t)/2 dx
i=1
2π −∞
n
Y
= (1 − 2t)−1/2
i=1

= (1 − 2t)−n/2 .

Proposition 7.40 Let X1 , · · · , Xn be independent and identical with X. Suppose N takes nonnegative integer and
let Y = X1 + · · · + Xn . Then E[etY |N = n] = [MX (t)]n , i.e., E[etY |N ] = [MX (t)]N . So MY (t) = E[E[etY |N ]] =
E[(MX (t))N ]. Then we have the following:

MY′ (t) = E[N (MX (t))N −1 MX

′ (t)].

E[Y ] = MY′ (0) = E[N (MX (0))N −1 MX

′ (0)] = E[N ]E[X].

Var(Y ) = E[N ]Var(X) + (E[X])2 Var(N ).

7.11 Joint Moment Generating Function

Definition: let X and Y be random variables. The joint moment generating function is MX,Y (s, t) = E[esX+tY ].
It is clear that MX (s) = E[esX ] = MX,Y (s, 0) and MY (t) = E[etY ] = MX,Y (0, t).
Note that the joint distribution of X and Y is uniquely determined by MX,Y (s, t).
If X and Y are independent if and only if MX,Y (s, t) = MX (s)MY (t) (This results can be extended to n dimensional
joint moment generating functions). If MX+Y (t) = MX (t)MY (t), if and only if Cov(X, Y ) = 0.
Let X1 , Y1 be independent random variable with the same distribution as X, Y . Then

MX1 ,Y1 (s, t) = MX1 (s)MY1 (t) = MX (s)MY (t) = MX,Y (s, t).

Hence X1 and Y1 are independent implies that X and Y are independent.

Proposition 7.41 Suppose the number of events that occur is Poisson with parameter λ. Each even is independent
of Type I with probability p, and type II with probability 1 − p. Let Xi be the number of type i. Then X1 and X2
are independent Poisson with parameters λp and λ(1 − p).

90
Proof: X be the total number of events. If X = n, X1 , X2 are binomial with parameters (n, p), (n, 1 − p). Then

E[esX1 +tX2 |X = n] = E[esX1 +t(n−X1 ) |X = n] = E[etn e(s−t)X1 |X = n]

= etn (pes−t + 1 − p)n = (pes + (1 − p)et )n ,
E[esX1 +tX2 |X] = (pes + (1 − p)et )X
t −1)
E[etX ] = eλ(e ⇒ E[aX ] = eλ(a−1)
s +(1−p)et −1)
MX1 ,X2 (s, t) = E[(pes + (1 − p)et )X ] = eλ(pe
s −1) t −1)
= eλp(e eλ(1−p)(e = MX1 (s)MX2 (t).

Proposition 7.42 Let Z1 and Z2 be independent standard normal random variables. Then X = 12 (Z1 + Z2 ) and
Y = 12 (Z1 − Z2 ) are normal random variables. In addition X and Y are independent, so are X and Y 2 .

Proof: We compute their joint moment generating functions:

1 1
MX,Y (s, t) = E[esX+tY ] = E[e 2 (s+t)Z1 ]E[e 2 (s−t)Z2 ]
1 2 1 2
= e 8 (s+t) e 8 (s−t)
1 2 +t2 )
= e 4 (s
1 2 1 2
= e4s e4t

Note that the joint moment generating functions are separable, and each factor is the moment generating function
of a normal random variable. Hence X and Y are normal and independent, so X and Y 2 is also independent. □

Proposition 7.43 Let X1 , X2 , · · · , Xn be independent normal random variable with parameters (µ, σ 2 ). Let X =
n
1 2 = (Xi − X)2 /(n − 1). Then X and S 2 are independent.
P
n (X1 + · · · + Xn ). Let S
i=1

91
7.12 Summary on Random Variables

92
8 Limit Theorems
8.1 Inequalities

Lemma 8.1 (Markov’s Inequality) Let X be a nonnegative random variable, then for any a > 0,

E[X]
P (X ≥ a) ≤ .
a

1 if X ≥ a
Proof: Let I be the indicator variable of X ≥ a : I =
0 if X < a.
If X ≥ a, then aI = a ≤ X; if X < a, then aI = 0 ≤ X. Hence X ≥ aI ⇒ E[X] ≥ aE[I] = aP (X ≥ a). □

Proposition 8.2 (Chernoff Bounds) Let X be a random variable with moment generating function M (t). Then
for any a > 0 and t > 0,
P (X ≥ a) ≤ e−ta M (t).

Proof: If X ≥ a, then etX ≥ eta , for any t > 0, a > 0. Hence by Markov’s Inequality, we have

E[etX ]
P (X ≥ a) ≤ P (etX ≥ eta ) ≤ = e−ta M (t).
eta

Corollary 8.2.1 (Chernoff Bounds for the Poisson Random Variable) Suppose X is a Poisson random
variable with parameter λ, then
e−λ (eλ)i
P (X ≥ i) ≤ .
ii

Proof: By Chernoff Bounds, we get that

t −1)
P (X ≥ i) ≤ eλ(e e−it t > 0

i
To minimize the right hand side, differentiate, and we get that the minimum is obtained when et = λ. Thus
substitute this value for t, and we get the desired inequality. □

Corollary 8.2.2 (Chernoff Bounds for Standard Normal Variable) Suppose Z is standard normal, and a >
0, then
1 −a2
P {Z > a} ≤ e 2 .
2

93
Proof:
Z ∞
1 2
P (Z > a) = √ e−u /2 du
2π
Za ∞
1 − (x+a)2
= √ e 2 dx (x = u − a)
0 2π
Z ∞
1 a2 x2
= √ e− 2 e− 2 e−ax dx
2π
Z0
1 − a2 ∞ − x2
≤√ e 2 e 2 · 1dx
2π 0
1 − a2 ∞ − x2
Z
=√ e 2 e 2 dx
2π 0
r
1 − a2 π
=√ e 2 ·
2π 2
1 − a2
= e 2.
2

2 be the
Lemma 8.3 (Chebyshev’s Inequality) Let X be a random variable, and µx be the mean of X and σX
variance of X, then
σ2
P (|X − µX | ≥ a) ≤ X ∀a > 0.
a2

2 =
Proof: Let X be a random variable, and µx be the mean of X and σX be the variance of X. Then σX
E[(X − µX )2 ]. Let Y = (X − µX )2 ≥ 0 and a > 0. By Markov’s Inequality, we have

E[Y ] σ2
P (Y ≥ a2 ) ≤ 2
= 2.
a a

But p(Y ≥ a2 ) = P ((X − µX )2 ≥ a2 ) = P (|X − µX | ≥ a), so

2
σX
P (|X − µX | ≥ a) ≤ ∀a > 0.
a2

Corollary 8.3.1 Suppose X is the standard normal random variable and a > 0. Then

1
Φ(a) ≥ 1 − .
2a2

1
Proof: P (|X| ≥ a) = 2P (X ≥ a) = 2(1 − Φ(a)). Since P (|X| ≥ a) ≤ a2
by Chebyshev’s Inequality. Then we
have the desired result. □

94
Theorem 8.4 If X is a random variable such that Var(X) = 0. Then P (X = µX ) = 1.

Proof: Let X be a random variable such that Var(X) = 0. For any ϵ > 0, by Chebyshev’s Inequality,

2
σX
P (|X − µX | > ϵ) ≤ = 0.
ϵ2

Let ϵ → 0+ . Then
0 = lim P (|X − µX | > ϵ) = P (|X − µX | > 0).
ϵ→0+

Hence P (|X − µX | =
̸ 0) = 0, which implies that P (X = µX ) = 1. □

Proposition 8.5 (One-sided Chebyshev’s Inequality) Let X be a random variable with mean µ and variance
σ 2 , then for any a > 0, we have
σ2
P (X ≥ µ + a) ≤ 2 .
σ + a2

Proof: Consider Y = σ1 (X − µ). Then Y is a random variable with mean 0 and variance 1.
a
P (X ≥ µ + a) = P Y ≥ 2 .
σ
2
Hence to prove that P (X ≥ µ + a) ≤ σ2σ+a2 for all a > 0, it suffices to show that P (Y ≥ a) ≤ 1+a
1
2 for all a > 0.

Now suppose Y is a random variable with mean 0 and variance 1. Then Y ≥ a implies for all b, (Y + b)2 ≥ (a + b)2 ,
hence by Chebyshev’s inequality, we have

1 + b2
P (Y ≥ a) ≤ P ((Y + b)2 ≥ (a + b)2 ) ≤ .
(a + b)2

And note when b = a1 ,

1 + b2 1
2
= .
(a + b) 1 + a2
hence !
1 2 1 2

1
P (Y ≥ a) ≤ P Y + ≥ a+ ≤ .
a a 1 + a2

Proposition 8.6 (Jensen’s Inequality) If f (x) is a convex function, then

E[f (X)] ≥ f (E[X])

provided that the expectations exist and are finite. Suppose f (x) is concave, then

E[f (X)] ≤ f (E[X])

95
provided that the expectations exist and are finite.

Proposition 8.7 (Cauchy-Schwarz Inequality) Suppose X and Y are random variables, then

(E[XY ])2 ≤ E[X 2 ]E[Y 2 ].

Proof: For any t ∈ R, note that E[(tX + Y )2 ] ≥ 0, hence

E[X 2 ]t2 + 2E[XY ]t + E[Y 2 ] ≥ 0

for all t, hence the discriminant of the quadratic must be less than or equal to zero, that is

4E[XY ]2 − 4E[X 2 ]E[Y 2 ] ≤ 0.

Hence we conclude that

(E[XY ])2 ≤ E[X 2 ]E[Y 2 ].

8.2 Limit Theorems

Proposition 8.8 (Weak Law of Large Numbers) Let X1 , X2 , · · · be independent and identical random vari-
able with E[Xi ] = µ. If ϵ > 0, then

X1 + X2 + · · · + Xn
P − µ ≥ ϵ → 0 as n → ∞.
n

Proof: Let X1 , X2 , · · · be independent and identical random variable with E[Xi ] = µ and Var(Xi ) = σ 2 . Let
X n = (X1 + · · · + Xn )/n. Then E[X n ] = µ and Var(X n ) = σ 2 /n.
By Chebyshev’s inequality, for any ϵ > 0, we have

σ2
P (|X n − µ| ≥ ϵ) ≤ .
nϵ2

Let ϵ > 0 be fixed. Then P (|X n − µ| ≥ ϵ) → 0 as n → ∞. □

Corollary 8.8.1 Let X1 , X2 , · · · be independent and identical random variable with E[Xi ] = µ. If ϵ > 0, then

X1 + X2 + · · · + Xn
P − µ ≤ ϵ → 1 as n → ∞.
n

96
Proof: Taking the complement of the probability, we get the desired result. □

Lemma 8.9 Let Zn be a random variable having cumulative distribution function FZn and moment generating
function MZn , n = 1, 2, · · · . Let Z be a random variable having cumulative distribution function Fz and moment
generating function MZ . If MZn (t) → MZ (t) for all t, as n → ∞, then FZn (T ) → FZ (T ) for all t at which FZ (t)
is continuous.

Theorem 8.10 (Central Limit Theorem) Let X1 , X2 , · · · be independent and identical random variable with
mean µ, and variance σ 2 . Then
X1 + · · · + Xn − nµ
√
σ n
tends to standard normal random variable as n → ∞.

√
Proof: Let Zn = (X1 + · · · + Xn − nµ)/(σ n). Note that Yi = (Xi − µ)/σ are identical with mean 0 and variance
n
P Yi
1, and let its moment generating function be M (t). Then Zn = √
n
, so it has moment generating function
h in i=1 h in
2 2
M √tn . Let Z be a standard normal random variable, then MZ (t) = et /2 . We show that M √tn → et /2
as n → ∞. 2
Let L(t) = ln M (t). Then it is equivalent to showing that nL √tn → t2 .

L(0) = ln M (0) = ln 1 = 0,
M ′ (0) E[X]
L′ (0) = = = 0,
M (0) 1
M ′′ (0)M (0) − [M ′ (0)]2 E[X 2 ] · 1 − (E[X])2
L′′ (0) = = = 1.
[M (0)]2 12

Set x = √1 . Then by L’Hopital’s Rule, we have

tL′ (tx) t2 L′′ (tx)

t L(tx)
lim nL √ = lim = lim = lim
n→∞ n x→0+ x2 x→0+ 2x x→0+ 2
2 ′′
t L (0) t 2
= = ,
2 2

by assuming that L′′ is continuous at 0. And this completes the proof of the theorem. □

Proposition 8.11 (Strong Law of Large Numbers) Let X1 , X2 , · · · be independent and identical random vari-
able with mean µ = E[Xi ]. Then
X1 + · · · + Xn
P lim = µ = 1,
n→∞ n

97
or
X1 + · · · + Xn
→ µ as n → ∞.
n

Proof: We prove the case where E[Xi4 ] = K < ∞.

We first consider µ = E[X] = 0.
Let Sn = X1 + · · · + Xn . Then E[Sn4 ] is the sum of the followings:

E[Xi4 ], E[Xi3 Xj ], E[Xi2 Xj2 ], E[Xi2 Xj Xk ], E[Xi Xj Xk Xl ].

Since Xi′ s are independent, then E[Xi3 Xj ] = E[Xi2 Xj Xk ] = E[Xi Xj Xk Xl ] = 0. So

n
X X
E[Sn4 ] = E[Xi4 ] + 6E[Xi2 Xj2 ]
i=1 i<j

= nK + 3n(n − 1)(E[Xi2 ]2 )
≤ nK + 3n(n − 1)K
≤ n2 k + 3n2 K
= 4n2 K
E[(Sn /n)4 ] ≤ 4n2 K/n4 = 4K/n2
∞ ∞ ∞
" #
X X X 1
(Sn /n)4 = E[(Sn /n)4 ] ≤ 4K < ∞.
n2
n=1 n=1 n=1

We have the last equality because

∞ the sequence
converges absolutely.
(Sn /n)4 < ∞.
P
So we have concluded that E
n=1
Let Y = ∞ 4 4
P
n=1 (Sn /n) , then P (Y = ∞) = 0, i.e., P (Y < ∞) = 1. Then in particular, (Sn /n) → 0 with probability
1, which implies
X1 + · · · + Xn
→ 0 as n → ∞.
n

Next, for the general case, suppose E[X] = µ.

Consider Yi = Xi −µ, then E[Yi ] = 0, and E[Yi4 ] = E[(Xi −µ)]4 ≤ E[8Xi4 +8µ4 ] by AM-GM. SinceE[8Xi4 +8µ4 ] < ∞
by the fact that E[Xi4 ] < ∞, then E[Yi4 ] < ∞. Thus we conclude that Y1 +···+Y
n
n
→ 0 as n → ∞. But since for each
i, Xi = Yi + µ. Then
X1 + · · · + Xn Y1 + · · · + Yn
= + µ → µ as n → ∞.
n n
□

Corollary 8.11.1 Suppose X1 , X2 , · · · are identical independent random variables, then

n
!1/n
Y
lim Xi = eE[ln(Xi )] .
n→∞
i=1

98
Proof: Since X1 , X2 , · · · are identical independent random variables, then ln(X1 ), ln(X2 ), · · · are also identical
independent random variables. Note
n
P 
n
!1/n ln Xi
Y  i=1 
Xi = exp  .
 n 
i=1

Since
n
P
ln(Xi )
i=1
→ E[ln(Xi )]
n
as n → ∞, then
n
P 
n
!1/n ln(Xi )
Y  i=1
 = eE[ln(Xi )] .

lim Xi = lim exp 
n→∞ n→∞  n 
i=1

Remark: suppose X1 , X2 , · · · are i.i.d random variables with mean µ and variance σ 2 , then
n
X
Xi ∼ Normal(nµ, nσ 2 ).
i=1

And
n
P
Xi
i=1 σ2
∼ Normal(µ, ).
n n

Proposition 8.12 Let Zn , n ≥ 1, be a sequence of random variables and c a constant such that for each ϵ > 0.
P {|Zn − c| > ϵ} → 0 as n → ∞. Then for any bounded continuous function g,

E[g(Zn )] → g(c) as n → ∞.

Proof: Suppose Zn is discrete, then let Z be the limit of Zn (we can easily show that it exists) and we would get
that Z is the random variable with p(Z = c) = 1, hence the desired statement must be true. So, we consider the
case where Zn is a sequence of continuous random variables.
Since g is bounded, then |g(x)| ≤ M for some M ∈ R. g is also continuous, so for every c ∈ R and ϵ > 0, there
exists a δ > 0 such that |x − c| ≤ δ → |g(x) − g(c)| ≤ ϵ.
By the definition of expected values, we have
Z Z Z
E[g(Zn )] = g(x)pZn (x)dx = g(x)pZn dx + g(x)pZn dx.
|x−c|≤δ |x−c|>δ

99
Now, for x such that |x − c| ≤ δ, we have that g(x) ≤ g(c) + ϵ and for x such that |x − c| > δ we have that g(x) ≤ M .
Therefore,
Z Z
E[g(Zn )] ≤ (g(c) + ϵ) pZn dx + M pZn dx
|x−c|≤δ |x−c|>δ

= (g(c) + ϵ)P (|Zn − c| ≤ δ) + M P (|Zn − c| > δ).

Similarly we have that

E[g(Zn )] ≥ (g(c) − ϵ)P (|Zn − c| ≤ δ) − M P (|Zn − c| > δ).

Then it is clear that lim sup E[g(Zn )] = E[g(c)] and lim inf E[g(Zn )] = E[g(c)]. Hence lim E[g(Zn )] = E[g(c)]. □
n→∞

100

st104b Statistics 2 Revision Guide
100% (1)
st104b Statistics 2 Revision Guide
394 pages
Probability_piyushwairale
No ratings yet
Probability_piyushwairale
42 pages
Vdoc.pub Introduction to Probability
No ratings yet
Vdoc.pub Introduction to Probability
678 pages
Stat Book
No ratings yet
Stat Book
413 pages
Stat Proof Book
No ratings yet
Stat Proof Book
381 pages
Intuition To Probability (Version 1.19)
No ratings yet
Intuition To Probability (Version 1.19)
396 pages
MA-202 L
No ratings yet
MA-202 L
221 pages
(eBook PDF) Introduction to Probability, Second Edition 2024 scribd download
100% (1)
(eBook PDF) Introduction to Probability, Second Edition 2024 scribd download
51 pages
Ee230 Lectures
No ratings yet
Ee230 Lectures
103 pages
STAT Exercises
No ratings yet
STAT Exercises
258 pages
Probabilityjan13 PDF
No ratings yet
Probabilityjan13 PDF
281 pages
Inbound 8969254549211759123
No ratings yet
Inbound 8969254549211759123
164 pages
An Introduction To The Science of Statis PDF
No ratings yet
An Introduction To The Science of Statis PDF
430 pages
+an Introduction To The Science of Statistics PDF
No ratings yet
+an Introduction To The Science of Statistics PDF
383 pages
ECE 313 Course Notes: Probability With Engineering Applications
No ratings yet
ECE 313 Course Notes: Probability With Engineering Applications
188 pages
Ma 202
No ratings yet
Ma 202
219 pages
Probability
No ratings yet
Probability
180 pages
Statbook PDF
No ratings yet
Statbook PDF
433 pages
JB Ies 109 Exercises Answers
No ratings yet
JB Ies 109 Exercises Answers
246 pages
Stat230 Spring
No ratings yet
Stat230 Spring
345 pages
Stat230 S2010 Course Notes
100% (1)
Stat230 S2010 Course Notes
281 pages
Lectnotemat 5
No ratings yet
Lectnotemat 5
346 pages
Probability and Statistics notes (complete)
No ratings yet
Probability and Statistics notes (complete)
105 pages
Probability
No ratings yet
Probability
67 pages
Fundamentals of Statistics 2
No ratings yet
Fundamentals of Statistics 2
168 pages
Principles of Statistical Analysis - V1
No ratings yet
Principles of Statistical Analysis - V1
426 pages
Cours-corrigé de Proba&Statis
No ratings yet
Cours-corrigé de Proba&Statis
79 pages
Math 630 Course Notes Fall 2021
No ratings yet
Math 630 Course Notes Fall 2021
274 pages
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
100% (1)
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
235 pages
Stat Book
No ratings yet
Stat Book
383 pages
National Geographic-The Tailor and His Coat
No ratings yet
National Geographic-The Tailor and His Coat
38 pages
STAT 230 Course Notes Fall 2019
No ratings yet
STAT 230 Course Notes Fall 2019
425 pages
Stat 230 No Tess 16 Print
No ratings yet
Stat 230 No Tess 16 Print
359 pages
Notesstat230 2014
No ratings yet
Notesstat230 2014
288 pages
Probability P
No ratings yet
Probability P
66 pages
doc-cours_MathsV
No ratings yet
doc-cours_MathsV
69 pages
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
No ratings yet
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
159 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Summary I 2018-2019
No ratings yet
Summary I 2018-2019
72 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
STAT 230 Notes 2013
No ratings yet
STAT 230 Notes 2013
278 pages
Review Notes - Probability
No ratings yet
Review Notes - Probability
16 pages
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
No ratings yet
Probability and Statistics With Examples Using R Siva Athreya, Deepayan Sarkar, and Steve Tanner
258 pages
(eBook PDF) Introduction to Probability, Second Editioninstant download
No ratings yet
(eBook PDF) Introduction to Probability, Second Editioninstant download
55 pages
Lewis, Arrowsmith
No ratings yet
Lewis, Arrowsmith
403 pages
Yourturn Chinese New Year Worksheet
100% (1)
Yourturn Chinese New Year Worksheet
2 pages
Penguin - Test Your Vocabulary 1 Elementary PDF
No ratings yet
Penguin - Test Your Vocabulary 1 Elementary PDF
55 pages
Proposal Letter To Schools
No ratings yet
Proposal Letter To Schools
7 pages
toc
No ratings yet
toc
4 pages
Lecture Notes Statistics 420-Probability Fall 2002
No ratings yet
Lecture Notes Statistics 420-Probability Fall 2002
127 pages
Comic Strips About 9 Life Lessons From Aristotle and Plato
No ratings yet
Comic Strips About 9 Life Lessons From Aristotle and Plato
5 pages
Ket Vocabulary List
No ratings yet
Ket Vocabulary List
67 pages
I ' S M P S I: Nstructor S Olutions Anual Robability AND Tatistical Nference
No ratings yet
I ' S M P S I: Nstructor S Olutions Anual Robability AND Tatistical Nference
16 pages
Book Solutions
No ratings yet
Book Solutions
17 pages
131 Mary Poppins PDF
No ratings yet
131 Mary Poppins PDF
46 pages
Phillip McIntyre, Janet Fulton, Elizabeth Paton (Eds.) - The Creative System in Action - Understanding Cultural Production and Practice-Palgrave Macmillan UK (2016) PDF
100% (1)
Phillip McIntyre, Janet Fulton, Elizabeth Paton (Eds.) - The Creative System in Action - Understanding Cultural Production and Practice-Palgrave Macmillan UK (2016) PDF
227 pages
Nelson English Year 4primary 5 Pupil Book 4
No ratings yet
Nelson English Year 4primary 5 Pupil Book 4
148 pages
Kadioglu Paradox of Turkish Nationalism
100% (2)
Kadioglu Paradox of Turkish Nationalism
18 pages
El Dibujo de Figura en Todo Su Valorr
No ratings yet
El Dibujo de Figura en Todo Su Valorr
180 pages
List of Heroes - Heroes 3 Wiki
No ratings yet
List of Heroes - Heroes 3 Wiki
3 pages
Milani Cosmetics: Marketing Plan
No ratings yet
Milani Cosmetics: Marketing Plan
8 pages
ICTV Master Species List 2020.v1
No ratings yet
ICTV Master Species List 2020.v1
44 pages
Year 6 English Language Placement Test
No ratings yet
Year 6 English Language Placement Test
3 pages
HGE-1113_LAS3_ONG_STE12-6P (1)
No ratings yet
HGE-1113_LAS3_ONG_STE12-6P (1)
3 pages
Grade 5 English Notes Unit 1 The Hare and THR Tortoise
No ratings yet
Grade 5 English Notes Unit 1 The Hare and THR Tortoise
10 pages
Internship Report On Bank of Azad Kashmir
No ratings yet
Internship Report On Bank of Azad Kashmir
38 pages
Toys Markeet
No ratings yet
Toys Markeet
3 pages
Nasdaq Data Link Data Fabric
No ratings yet
Nasdaq Data Link Data Fabric
12 pages
Reading: Why Do We Travel and What Makes Us Venture Into The Hazards of The Unknown?
0% (1)
Reading: Why Do We Travel and What Makes Us Venture Into The Hazards of The Unknown?
2 pages
faisalabad institute of cardiology
No ratings yet
faisalabad institute of cardiology
2 pages
Vibration of Bridge
No ratings yet
Vibration of Bridge
34 pages
Ielts Writing
No ratings yet
Ielts Writing
5 pages
Excel Skills For Business: Essentials: Week 2: Performing Calculations
No ratings yet
Excel Skills For Business: Essentials: Week 2: Performing Calculations
5 pages
New Base Building Construction Elevator and Escalator Maintenance Agreement
No ratings yet
New Base Building Construction Elevator and Escalator Maintenance Agreement
38 pages
Grade-12 (English For Academic and Professional Purposes) - 2
100% (1)
Grade-12 (English For Academic and Professional Purposes) - 2
8 pages
Pineapple: Recommendations For Maintaining Postharvest Quality
No ratings yet
Pineapple: Recommendations For Maintaining Postharvest Quality
4 pages
Mahindra Finance
No ratings yet
Mahindra Finance
6 pages
You Have - Type You Answers In: 15 Minutes
100% (1)
You Have - Type You Answers In: 15 Minutes
2 pages
Put The Verbs in Brackets in The Correct Form
0% (1)
Put The Verbs in Brackets in The Correct Form
2 pages
Class Ix Science Sample Paper KV Chandrapur
No ratings yet
Class Ix Science Sample Paper KV Chandrapur
9 pages
A Lost Key
No ratings yet
A Lost Key
2 pages
Mmw-Operations On Sets
No ratings yet
Mmw-Operations On Sets
25 pages
Dorayaki
No ratings yet
Dorayaki
7 pages
Pipiolo and The Roof Dogs Organizer - MCD
No ratings yet
Pipiolo and The Roof Dogs Organizer - MCD
8 pages
Spritual Growth Bootcamp 2.0 Brochure
100% (1)
Spritual Growth Bootcamp 2.0 Brochure
4 pages
T SR ARMET SIMRIT I II Tape Silicone Rubber Freudenberg NOK Equivalent Electrical Insulation
No ratings yet
T SR ARMET SIMRIT I II Tape Silicone Rubber Freudenberg NOK Equivalent Electrical Insulation
1 page
Tehillim (Reich) - Wikipedia
No ratings yet
Tehillim (Reich) - Wikipedia
3 pages
December 12-16, 2016 (WEEK 6) : GRADES 1 To 12 Daily Lesson Log Tabunoc Central School V Maria Angeles Y. Niadas 3 Quarter
0% (1)
December 12-16, 2016 (WEEK 6) : GRADES 1 To 12 Daily Lesson Log Tabunoc Central School V Maria Angeles Y. Niadas 3 Quarter
4 pages
Feliciano Vs COA
No ratings yet
Feliciano Vs COA
2 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet

A First Course in Probability Notes

Uploaded by

A First Course in Probability Notes

Uploaded by

A First Course In Probability Notes

Last Edited by: December 1, 2023

4 Discrete Random Variables 18

5 Continuous Random Variables 33

7 Expectation of Random Variables 69

The generalized basic principle of counting:

Permutation with identical items:

Lemma 1.1 Suppose n ∈ N and r ∈ N with 1 ≤ r ≤ N . Then

Theorem 1.2 (The Binomial Theorem) Suppose n is a positive integer, then

Corollary 1.2.1 There are exactly 2n subsets of a set consisting of n elements.

subsets of the original sets. □

Note: Let r = 2 and n1 = n2 , then ! ! !

Theorem 1.3 (Multinomial Theorem) Suppose n ∈ N, r ∈ N∗ , then

where the sum is taken over all nonnegative integers n1 , · · · , nr so that

Proof: The theorem is clear from the combinatorial point of view. □

Proposition 1.4 The number of integer solutions to the equation

So the number of ways to do this is given by

Corollary 1.4.1 The number of integer solutions to the inequality

where xi ≥ ki , i ∈ {1, 2, · · · , r} is given by

and the number of solutions for this is given by

 The union E ∪ F consists of all outcomes in E or in F or in both.

 The intersection E ∩ F or commonly denoted EF consists of all outcomes in both E and F .

 If EF = ∅, then E and F are called mutually exclusive.

 The complement E c consists all outcomes in (S) not in E.

Similarly, let E1 , E2 , · · · be a sequence of events of a sample space S.

Theorem 2.1 Let E, F and G be events of a sample space S. Then

 (Associative Laws) (E ∪ F ) ∪ G = E ∪ (F ∪ G), (EF )G = E(F G).

 (Distributive Laws) (E ∪ F )G = EG ∪ F G, (EF ) ∪ G = (E ∪ G)(F ∪ G)

 (De Morgan’s Laws) (E ∪ F )C = E c F c , (EF )c = E c ∪ F c .

Proof: Simple element chasing between sets. □

Theorem 2.2 Let E1 , E2 , · · · and F be events of a sample space S. Then

 General Distributive Laws:

 General De Morgan’s Laws:

Proof: Element chase. □

if this limit exists.

 For any sequence of mutually exclusive events E1 , E2 , · · · ,

Then P (E) is called the probability of the event E.

Proposition 2.3 P (∅) = 0.

Proposition 2.4 Let E1 , · · · , En be mutually exclusive, then

Proposition 2.5 Suppose E is an event of an experiment, then P (E c ) = 1 − P (E).

Proof: By the previous proposition, P (E) + P (E c ) = P (E ∪ E C ) = P (S) = 1. □

Proposition 2.6 Suppose that E ⊆ F are events of an experiment, then P (E) ≤ P (F ).

Proposition 2.7 Let E and F be any two events of an experiment, then

which is known as the Inclusion-Exclusion Identity. And

which is known as the Boole’s Inequality.

Proof: It is clear that EF c , EF and F E c are mutually exclusive. Then

Then the Boole’s Inequality follows immediately. □

2.2 Inclusion - Exclusion Identity

The following is a succinct way of writing the inclusion-exclusion identity:

Proof: We present two proofs of the theorem.

P (E1 ∪ E2 ∪ · · · ∪ En ∪ En+1 ) = P (E1 ∪ E2 ∪ · · · ∪ En ) + P (En+1 ) − P ((E1 ∪ · · · ∪ En ) ∩ En+1 )

− (−1)n+1 P (E1 ∩ E2 ∩ · · · ∩ En ∩ En+1 )

Then by rearranging terms, we get the required results.

Next we argument using combinatorics:

2.3 Equally Likely Outcomes

Proof: Each of them is equally likely to happen. □

2.4 Limit of Probability

Similarly, if {En , n ≥ 1} is a decreasing sequence of events, we define lim En , by

Proposition 2.10 If {En , n ≥ 1} is either an increasing or a decreasing sequence of events, then

lim P (En ) = P ( lim En ).

Then it is clear that the Fn′ s are mutually exclusive, and

2.5 Probability as a Measure of Belief

Note that P (E|F ) = P (EF |F ) and P (EF ) = P (E|F )P (F ).

Proposition 3.2 (Multiplication Rule) P (EF ) = P (E|F )P (F ).

3.2 Bayes’ Formula

Proof: First, note that EF ∪ EF C = E and EF ∩ EF C = ∅. Hence we have P (E) = P (EF ∪ EF C ) =

Proposition 3.5 Let E and F be two events, then

P (F |E) ≥ P (F ) ⇔ P (E|F ) ≥ P (E|F C ).

Proof: Note that

Definition: the odds of an event A are defined by

3.3 Independent Events

1. E and F are independent;

2. E and F C are independent;

3. E C and F are independent;

4. E C and F C are independent.

P (E) = P (EF ) + P (EF C ) = P (E)P (F ) + P (EF C )

The union E ∪ F consists of all outcomes in E or in F or in both.

The intersection E ∩ F or commonly denoted EF consists of all outcomes in both E and F .

If EF = ∅, then E and F are called mutually exclusive.

The complement E c consists all outcomes in (S) not in E.

(Associative Laws) (E ∪ F ) ∪ G = E ∪ (F ∪ G), (EF )G = E(F G).

(Distributive Laws) (E ∪ F )G = EG ∪ F G, (EF ) ∪ G = (E ∪ G)(F ∪ G)

(De Morgan’s Laws) (E ∪ F )C = E c F c , (EF )c = E c ∪ F c .

General Distributive Laws:

General De Morgan’s Laws:

For any sequence of mutually exclusive events E1 , E2 , · · · ,

Law of Total Probability:

The number of misprints on a page of a book.

The number of people in a community who survive to age 100.

The number of wrong telephone numbers dialed in a day.

The number of packages of biscuits sold in a store in a day.

The customers centering a post office on a given day.

We use the identities