A First Course in Probability Notes
A First Course in Probability Notes
Lou Yi
2 Axioms of Probability 5
2.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Inclusion - Exclusion Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Limit of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Probability as a Measure of Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Conditional Probability 12
3.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Conditional Probability is a Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
i
6 Jointly Distributed Random Variables 55
6.1 Joint Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Joint Distribution of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4.1 Sum of Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.2 Sum of Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.3 Sum of Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4.4 Sum of Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4.5 Sum of Normals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.6 Sum of Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Joint Distribution of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8 Limit Theorems 93
8.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
ii
1 Combinatorial Analysis
The basic principle of counting:
Suppose that two experiments are to be performed. If the first experiment can result in any one of m possible
outcomes and if for each outcome of experiment 1, there are n possible outcomes of experiment 2, then together
there are mn possible outcomes of the two experiments.
Proof: Basically we can list the outcomes in an m × n matrix. The matrix has mn entries, hence there are mn
possible outcomes. □
Permutation:
If we want to arrange n items, then there are a total of n! such different ordering.
Proof: Note that there are n different choices for the first position of the arrangement, then followed by n − 1
choices on the second position, n − 2 choices on the third and so on. Hence there are n × (n − 1) × · · · × 1 = n!
different orderings. □
n!
n1 !n2 ! · · · nr !
different permutations.
Proof: Firstly there are n! different permutations if all the objects are distinct. Then for each group of nk
identical items, there are nk ! different ways to arrange them, hence for each nk identical items, each ordering in
the original permutation is counted repeatedly by nk ! times. Thus the formula follows. □
Combination: !
n
We define , for r ≤ n by
r
!
n n!
=
r (n − r)!r!
!
n
and say that represents the number of possible combination of n objects taken r at a time.
r
Proof: Firstly, there are n!r! permutations of length r choosing from n items. Since order does not matter, it
implies that each combination is counted (n − r)! times, hence we have that the total possible combination is
n!
.
(n − r)!r!
1
□
Proof: Note that in order to choose r items from a list of n object, we can either include the first item, and
choose r − 1 items from the remaining n − 1 object, or we can exclude the first item and choose r items from the
remaining n object. Hence we have the formula. □
Pascal’s Triangle:
The Pascal’s Triangle
! is a triangular array of numbers where the element on the ith row and j th column have the
i−1
value .
j−1
n
!
X n
(x + y)n = xk y n−k .
k=0
k
!
n
Hence, are often known as binomial coefficients.
k
!
n
Proof: The proof is done by induction or one can consider that there are ways to choose k different items
k
from n total items. □
!
n
Proof: Since there are subset of size k, then there are a total of
k
n
!
X n
= (1 + 1)n = 2n
k=0
k
2
Permutations of length r choosing from n items:
If we want to arrange items in a list of length r choosing from n objects, with r ≤ n. Then there are
n!
r!
such permutation.
Proof: Note that similar to above reasoning, there are n × (n − 1) × · · · × (n − r + 1) different orderings, which
is equal to
n!
.
r!
□
Multinomial coefficients: !
n
Let n = n1 + n2 + · · · + nr . The multinomial coefficient is defined by
n1 , n2 , · · · , nr
!
n n!
= .
n1 , n2 , · · · , nr n1 ! · n2 ! · · · nr !
!
n
represents the number of possible division of n distinct objects into r distinct groups of size
n1 , n2 , · · · , nr
n1 , · · · , nr respectively. The number of possible permutations of n objects of which n1 are alike, n2 are alike, · · · ,
nr are alike.
n1 + n2 + · · · + nr = n.
x1 + x2 + · · · + xr = n
3
where xi ≥ ki , i ∈ {1, 2, · · · , r} is given by
r
n + r − 1 − P ki
i=1 .
r−1
n−1
In particular, the number of positive integer solution is given by r−1 and the number of nonnegative integer
solution is given by n+r−1
r−1 .
r
P
Proof: We can transform the problem into inserting r − 1 plates to n + r − 1 + ki gaps, this is done by
i=1
considering yi = xi + (1 − ki ), then
r
X
y1 + y2 + · · · + yk = n + r − ki .
i=1
x1 + x2 + · · · + xr ≤ n
Proof: By the same reasoning, it is equivalent to calculate the number of positive integer solutions to
r
X
y1 + y2 + · · · + yr + yr+1 = n + r − ki ,
i=1
Pr
n+r− ki
i=1 .
r
4
2 Axioms of Probability
2.1 Axioms of Probability
Sample Space:
Consider an experiment whose outcome is not predictable. The set of all possible outcomes of the experiment is
called the sample space. It is usually denoted by S.
Events:
Any subset E of the sample space S is an event. If the outcomes is contained in E, then we say E occurs.
Note that S itself is an event, which is also known as the sure event. ∅ is also an event, which is known as the null
event.
Operations on Events:
Operations on Events are precisely operations on sets. Let E and F be two events of a sample space S, then
(Commutative Laws) E ∪ F = F ∪ E, EF = F E.
5
S∞ S∞
– ( n=1 En ) n=1 (En ∩ F );
∩F =
T∞ T∞
– ( n=1 En ) ∪ F = n=1 (En ∪ F ).
– ( ∞
S c
T∞ c
n=1 En ) = n=1 En ;
– ( ∞
T c
S∞ c
n=1 En ) = n=1 En .
Probability:
Let E be any event of an experiment. Let n(E) be the number of times that E occurs in the first n repetitions of
the experiment. The probability of E is
n(E)
P (E) = lim
n→∞ n
Axioms of Probability:
Let S be the sample space of an experiment. Suppose that a number P (E) is defined for every event E of S, s.t.,
0 ≤ P (E) ≤ 1.
P (S) = 1.
Proof: Let ∅ = E1 = E2 = E3 = · · · . Then E1 , E2 , · · · are mutually exclusive. Hence by the third axiom of
probability, we have P (∅) = 0. □
6
Proof: Let ∅ = En+1 = En+2 = · · · . Then one can show that E1 , E2 , · · · are mutually exclusive. Hence
n
[ ∞
[ ∞
X n
X
P( Ei ) = P ( Ei ) = P (Ei ) = P (Ei ).
i=1 i=1 i=1 i=1
Proof: It is clear that E and F E c are mutually exclusive, and their union is F . Then P (F ) = P (E) + P (E c F ) ≥
P (E) + 0 = P (E). □
P (E ∪ F ) = P (E) + P (F ) − P (EF ),
P (E ∪ F ) ≤ P (E) + P (F ),
P (E ∪ F ) = P (EF c ) + P (EF ) + P (F E c )
= [P (EF c ) + P (EF )] + [P (EF ) + P (F E c )] − P (EF )
= P (E) + P (F ) − P (EF ).
7
Theorem 2.8 Let E1 , E2 , · · · , En be events, then
n
X X
P (E1 ∪ E2 ∪ · · · ∪ En ) = P (Ei ) − P (Ei1 Ei2 ) + · · ·
i=1 i1 <i2
X
+ (−1)r+1 P (Ei1 Ei2 · · · Eir )
i1 <i2 <···<ir
+ · · · + (−1)n+1 P (E1 E2 , · · · En )
The first and last terms in brackets are n−unions, for which we assumed the formula to hold (applying inductive
hypothesis). Therefore
X X
P (E1 ∪ E2 ∪ · · · ∪ En ∪ En+1 ) = P (Ei ) − P (Ei1 ∩ Ei2 )
1≤i≤n 1≤i1 <i2 ≤n
X
+ P (Ei1 ∩ Ei2 ∩ Ei3 ) − · · · + (−1)n+1 P (E1 ∩ E2 ∩ · · · En )
1≤i1 <i2 <i3 ≤n
X X
+ P (En+1 ) − P (Ei ∩ En+1 ) + P (Ei1 ∩ Ei2 ∩ En+1 )
1≤i≤n 1≤i1 <i2 ≤n
X
− · · · − (−1)n P (Ei1 ∩ Ei2 ∩ · · · Ein−1 ∩ En+1 )
1≤i1 <i2 <···<in−1 ≤n
8
times on the right of the equality sign in the right hand side of the theorem. Thus, for m > 0, we must show that
m m m m
1= − + − ··· ± .
1 2 3 m
m
However, since 1 = 0 , the preceding equation is equivalent to
m
X m
(−1)i = 0
i
i=0
and the latter equation follows from the binomial theorem, since
m
X m
0 = (−1 + 1)m = (−1)i (1)m−i .
i
i=0
Since every element is counted the same number of times on both sides of the equality, then their respective
probability are equal. □
Proposition 2.9 The probability of drawing a specific card from a normal deck of 52 cards with burning any
1
number (less than 52) of cards is 52 .
Proof: Suppose the card drawn is the ith position in the deck, then there are 51! ways such is possible and there
are 52! total possible permutations, thus the probability is 51! 1
52! = 52 no matter how many cards are burned. □
Corollary 2.9.1 Suppose there are n specific cards in a deck of 52 cards, then the probability of drawing any of
n
them with burning any number (less than 52) of cards is 52 .
E1 ⊂ E2 ⊂ · · · ⊂ En ⊂ En+1 ⊂ · · ·
9
whereas it is said to be a decreasing sequence if
E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ En+1 ⊃ · · ·
Definition: if {En , n ≥ 1} is an increasing sequence of events, then we define a new event, denoted by lim En , by
n→∞
∞
[
lim En = Ei .
n→∞
i=1
∞
\
lim En = Ei .
n→∞
i=1
Proof: Suppose, firs that, {En , n ≥ 1} is an increasing sequence, the define the events Fn , n ≥ 1 by F1 = E1 , and
n−1
!c
[
c
Fn = En Ei = En En−1 , n > 1.
i=1
Thus,
∞ ∞
! !
[ [
P Ei =P Fi
i=1 i=1
∞
X
= P (Fi ) (by Axiom 3 of a Probability Function)
i=1
n
X
= lim P (Fi )
n→∞
i=1
n
!
[
= lim P Fi
n→∞
i=1
n
!
[
= lim P Ei
n→∞
i=1
= lim P (En ),
n→∞
10
which proves the result when {En , n ≥ 1} is increasing.
If {En , n ≥ 1} is a decreasing sequence, then {Enc , n ≥ 1} is an increasing sequence. Hence, from the preceding
equations,
∞
!
[
P Eic = lim P (Enc ).
n→∞
i=1
∞
∞ c
Eic
S T
However, because = Ei , it follows that
i=1 i=1
∞
!c !
\
P Ei = lim P (Enc ).
n→∞
i=1
Thus
∞
!
\
1−P Ei = lim [1 − P (En )] = 1 − lim P (En ).
n→∞ n→∞
i=1
Therefore, we conclude
∞
!
\
P Ei = lim P (En ).
n→∞
i=1
11
3 Conditional Probability
3.1 Conditional Probability
Definition: suppose P (F ) > 0, the conditional probability that E occurs given that F has occurred is given by
P (EF )
P (E|F ) = .
P (F )
P (E)
Lemma 3.1 Suppose E ⊆ F , then P (E|F ) = P (F ) .
P (EF ) P (E)
Proof: Since E ⊆ F , P (E|F ) = P (F ) = P (F ) . □
Proof: Suppose P (F ) > 0, then P (E|F ) = P (EF )/P (F ). Hence P (EF ) = P (E|F )P (F ). Now suppose P (F ) =
0, then P (E|F ) = 0, as the probability of E happening given F happens is zero (F can never happen). So again
we have P (EF ) = P (E|F )P (F ). □
Corollary 3.2.1 (General Multiplication Rule) P (E1 · · · En ) = P (E1 )P (E2 |E1 ) · · · P (En |E1 · · · En−1 ).
Proof: Suppose P (E1 · · · En−1 ) = 0, then the statement is trivial as both sides equal to 0. Otherwise, we can use
induction to prove the general statement. □
Proposition 3.3 (Law of Total Probability) P (E) = P (EF )+P (EF C ) = P (E|F )P (F )+P (E|F C )[1−P (F )].
12
Theorem 3.4 (Bayes’ Formula) Let E and F be events, then
P (E|F )P (F )
P (F |E) = .
P (E|F )P (F ) + P (E|F C )P (F C )
Proof: Since
P (F E) P (EF ) P (E|F )P (F )
P (F |E) = = C C
= .
P (E) P (E|F )P (F ) + P (E|F )P (F ) P (E|F )P (F ) + P (E|F C )P (F C )
This is to say, suppose E and F are positively associated, then E and F C is negatively associated.
P (E|F )P (F )
Proof: Note P (E|F ) = P (E|F )P (F )+P (E|F C )P (F C )
, then
P (E|F )P (F )
P (F |E) ≥ P (F ) ⇔ ≥ P (F )
P (E|F )P (F ) + P (E|F C )P (F C )
⇔ P (E|F ) ≥ P (E|F )P (F ) + P (E|F C )P (F C )
⇔ P (E|F )P (F C ) ≥ P (E|F C )P (F C )
⇔ P (E|F ) ≥ P (E|F C )
S∞
Proposition 3.6 (Generalized Law of Total Probability) Let F1 , F2 , · · · be mutually exclusive with n=1 Fn =
S. For any event E, we have
∞
X
P (E) = P (E|Fn )P (Fn ).
n=1
S∞
Proof: Suppose n=1 Fn = S, then
∞
[ ∞
[
E =E∩S =E∩ Fn = E ∩ Fn .
n=1 n=1
Hence
∞ ∞ ∞
!
[ X X
P (E) = P E ∩ Fn = P (EFn ) = P (E|Fn )P (Fn ).
n=1 n=1 n=1
13
S∞
Theorem 3.7 (Generalized Bayes’ Formula) Let F1 , F2 , · · · , be mutually exclusive, n=1 Fn = S. For any
event E, we have
P (E|Fj )(P (Fj )
P (Fj |E) = ∞ .
P
P (E|Fn )P (Fn )
n=1
P (A) P (A)
= .
P (Ac ) 1 − P (A)
Lemma 3.8 Suppose E and F are events, then odds of the event F given E is
P (F |E) P (F )P (F |E)
c
= .
P (F |E) P (F c )P (F c |E)
Proof:
P (F |E) P (F E) P (F )P (F |E)
c
= c
= .
P (F |E) P (F E) P (F c )P (F c |E)
□
Lemma 3.9 If P (F ) > 0, then E and F are independent if and only if P (E|F ) = P (E). If P (E) > 0, then E
and F are independent if and only if P (F |E) = P (F ).
P (EF )
Proof: Suppose P (F ) > 0, P (E|F ) = P (F ) = P (E) if and only if P (EF ) = P (E)P (F ). Similarly we have the
second assertion. □
14
Proposition 3.10 (Property of Independent Events) Let E and F be events, then the following are equiva-
lent:
Proof: Suppose E and F are independent, i.e., P (EF ) = P (E)P (F ). Notice that EF and EF C are mutually
exclusive with union E. Then
Proposition 3.11 Suppose E, F, G are independent. Then E is independent of any event formed from F and G.
Definition: suppose an experiment consists of a sequence of subexperiments. let Ei be the outcome of the ith
subexperiment. If E1 , E2 , · · · are independent and have same set of possible outcomes, then they are often called
trials.
Proposition 3.12 Let E and F be mutually exclusive events. Suppose independents trials are performed. Then
the probability that E occurs before F is
P (E)
.
P (E) + P (F )
15
Proof: Let S be the event that E occurs before F , and let K be the event that both events does not happen.
Then
Proposition 3.13 Suppose that a man is gambling against an infinitely rich adversary and at each stage he either
win or lose 1 unit with respective probabilities p and 1 − p. If the man starts with i unit, then the probability that
he will eventually go broke is
1 if p ≤ 21
i
q if p > 12
p
where q = 1 − p.
Proof: Let P (n) denote the probability that the man starts with n unit and go broke. Then P (0) = 1 and
P (n) = P (1)n for n ≥ 1. Also P (1) = P (0)P (losses 1 unit) + P (2)P (wins 1 unit). So P (1) = (1 − p) + pP (2) =
(1 − p) + p[P (1)]2 . Then solve for P (1) for desired result. □
Proof: We verify that the conditional probability satisfies the 3 axioms of probability.
1. 0 ≤ P (E|F ) ≤ 1.
P (EF )
Proof: EF ⊆ F ; so 0 ≤ P (EF ) ≤ P (F ) and 0 ≤ P (E|F ) = P (F ) ≤ 1.
2. P (S|F ) = 1.
P (SF ) P (F )
Proof: P (S|F ) = P (F ) = P (F ) = 1.
16
This is because
∞
!
P (( ∞ P( ∞
S S
i=1 Ei ) F ) i=1 Ei F )
[
P Ei |F = =
P (F ) P (F )
i=1
P∞ ∞ ∞
i=1 P (Ei F ) P (Ei F ) X
X
= = = P (Ei |F ).
P (F ) P (F )
i=1 i=1
Proposition 3.15 (Properties of Conditional Probability As A Probability) Fix any event F with P (F ) >
0, then
Inclusion-Exclusion Identity:
And
P (E|F ) = Q(E)
= Q(E|G)Q(G) + Q(E|Gc )Q(Gc )
= P ((E|F )|G)P (G|F ) + P ((E|F )|Gc )P (Gc |F )
= P (E|F G)P (G|F ) + P (E|F Gc )P (Gc |F ).
17
4 Discrete Random Variables
4.1 Definition Involving Discrete Random Variables
Definition: on the sample space of an experiment, the quantities of interest, or real-valued functions on the sample
space are called random variables.
Definition: suppose a random variable X can only take on at most a countable number of possible values (finite
{a1 , · · · , an } or enumerable {a1 , a2 , · · · , }).Then X is said to be discrete. pX (a) = P (X = a) is the probability
mass function of X.
Pn
Suppose X only assumes values on {a1 , a2 , · · · , an }, ai all distinct, then pX (ai ) = 1 and pX (a) = 0 for
i=1
a ̸= a1 , a2 , · · · , an .
∞
P
Suppose X only assumes values on {a1 , a2 , · · · , an , · · · }, ai all distinct, then pX (ai ) = 1 and pX (a) = 0 for
i=1
a ̸= a1 , a2 , · · · .
Note that for discrete random variable X, if pX (a) = 0, we may assume that X does not take value at a.
Suppose discrete random variable X takes values on a1 < a2 < · · · . Then the distribution function Fx is a
non-decreasing step function. If ai ≤ a < ai+1 , then
X
FX (a) = p(x) = p(a1 ) + · · · + p(ai ).
x≤a
1. F is a nondecreasing function.
2. lim F (b) = 1.
b→∞
3. lim F (b) = 0.
b→−∞
4. F is right continuous.
18
E[X] = ∞
P
i=1 xi pX (xi ).
Definition: we define the indicator variable of an event E to be
1 if E occurs,
I= .
0 if E does not occur.
Proof: Note I can only take values 0 and 1, and pI (1) = P (E) and pI (0) = P (E c ). Hence E[I] = 1 · P (E) + 0 ·
P (E c ) = P (E). □
Note: suppose X is a discrete random variable, then for any function g, Y = g(X) is again a discrete random
variable.
P∞
Proposition 4.3 Let X be a discrete random variable with values x1 , x2 , · · · . Then E[g(X)] = i=1 g(xi )pX (xi )
for any function g.
P
Proof: Let Y = g(x), then E[g(X)] = E[Y ] = y yP (Y = y). Fix y such that g(x) = y for some x. Let
Ey = {x | g(x) = y}.
P
Then P (Y = y) = x∈Ey P (X = x).
Therefore
X X
E[g(X)] = y P (X = x)
y x∈Ey
X X
= yP (X = x)
y x∈Ey
X X
= g(x)P (X = x)
y x∈Ey
X
= g(x)P (X = x).
x
P P
Proof: E(c) = x cpX (x) =c x pX (x) = c. □
Lemma 4.5 Suppose X is a discrete random variable with value x1 , x2 , · · · . Then if a, b ∈ R, E[aX+b] = aE[X]+b.
19
Proof: If a, b ∈ R, then
∞
X
E[aX + b] = (axi +)pX (xi )
i=1
∞
X ∞
X
=a xi pX (xi ) + b pX (xi )
i=1 i=1
= aE[X] + b.
Definition: let X be a random variable, we denote the mean value of E[X] to be µX . Then the variance of X is
Var(X) = E[(X − µX )2 ].
Proof:
20
□
Corollary 4.8.1 For any discrete random variable X, E[X 2 ] ≥ (E[X])2 and E[X 2 ]/|E[X]| ≥ |E[X]|.
Proof: It is clear that E[(X − µ)2 ] ≥ 0, Then E[X 2 ] − (E[X])2 = E[(X − µ)2 ] ≥ 0. Then it is clear that
E[X 2 ]/|E[X]| ≥ |E[X]|. □
□
p
Definition: we define the standard deviation of X to be the principle square root of Var(X), i.e., SD(X) = Var(X).
Usually we write Var(X) = σ 2 , where σ ≥ 0 is the standard deviation. Then SD(aX + b) = |a|SD(X).
i = 0, 1, · · · , n.
Suppose pX (i) = ni pi (1 − p)n−i , 0 < p < 1, i = 0, 1, · · · , n. X is said to have a binomial distribution with
parameters (n, p). In particular, a Bernoulli random variable is binomial with parameter (1, p).
Proposition 4.10 Let X be a binomial random variable with parameters (n, p). Let E[X] = np and Var(X) =
np(1 − p).
Proof: Notices we we let Xi denote the random variable where Xi = 1 if the ith trial is a success and Xi = 0 if
the ith trial is a failure, then
Xn n
X
E[X] = E[Xi ] = p = np.
i=1 i=1
21
Now note that Var(X) = E[X 2 ] − (E[X])2 and E[X]2 = n2 p2 , hence we calculate E[X 2 ], but first we compute
E[X(X − 1)]
n
X n i
E[X(X − 1)] = i(i − 1) p (1 − p)n−i
i
i=0
n
X (n − 2)!
= n(n − 1) p2 pk−2 (1 − p)n−k
(i − 2)!(n − i)!
i=0
n−2
X n − 2
2
= n(n − 1)p pk (1 − p)(n−2)−k
k
k=0
2
= n(n − 1)p · 1
= n(n − 1)p2
Then it follows that E[X 2 ] = n(n − 1)p2 + np, so Var[X] = np(1 − p). □
Lemma 4.11 Suppose X is a binomial random variable with parameters (n, p). Then
pX (i) n−i+1 p
= ·
pX (i − 1) i 1−p
and
pX (i + 1) n−i p
= · .
pX (i) i+1 1−p
In addition, we have
(n − i)p
pX (i + 1) = pX (i).
(i + 1)(1 − p)
Proposition 4.12 Suppose X is a binomial random variable with parameters (n, p) and (n + 1)p is not an integer,
then the value of pX (i) first increases monotonically, then reaches its largest value when i is the largest integer
≤ (n + 1)p, then decrease monotonically. If (n + 1)p is an integer, then pX (i) takes the maximum value at (n + 1)p
and (n + 1)p − 1.
Proof: By the previous lemma, we have that pX (i) ≥ pX (i − 1) if and only if (n − i + 1)p ≥ i(1 − p), which
happens if and only if i ≤ (n + 1)p. Hence we have the desired result. □
Proposition 4.13 Suppose X is a binomial random variable with parameters (n, p), then E[X k ] = npE[(Y +1)k−1 ]
where Y is a binomial random variable with parameters (n − 1, p).
22
Proof: Recall the identity
n n−1
i =n ,
i i−1
then
n
k n
X
k
E[X ] = i pi (1 − p)n−i
i
i=0
n
k−1 n − 1
X
= np i pi−1 (1 − p)n−i
i−1
i=1
n−1
X n−1 j
= np (j + 1)k−1 p (1 − p)n−1−j
j
j=0
= npE[(Y + 1)k−1 ]
λi
pX (i) = P (X = i) = e−λ , i = 0, 1, 2, · · · .
i!
A Poisson random variable with parameter λ > 0 is an approximation of binomial random variable with parameters
(n, p) such that λ = np with n very large and p very small.
Proposition 4.14 Let X be a Poisson random variable with parameter λ > 0. Then E[X] = λ and Var(X) = λ.
23
i
Proof: Let X be a Poisson random variable with parameter λ > 0. Then pX (i) = P (X = i) = e−λ λi! . So
∞
X λi
E(X) = ie−λ
i!
i=0
X λi
= e−λ
(i − 1)!
i=1
X λj
= e−λ λ
j!
j=0
= e−λ λeλ
=λ
Similarly, we compute that E[X(X − 1)] = λ2 , hence , E[X 2 ] = λ2 + λ, so Var(x) = E[X 2 ] − (E[X])2 = λ. □
Poisson Approximation:
One can approximate the probability mass function of a binomial random variable with parameter (n, p) using
Poisson distribution with parameter λ = np, when n is large enough and p is very small, so λ is moderate.
Poisson Paradigm:
Let pi be the probability that event i occurs, i = 1, · · · , n. If pi is small, and trials are independent or weakly
dependent, then the total number of these events that occur can be approximated by Poisson random variable with
parameter λ = ni=1 pi .
P
Poisson Process:
Suppose events occur at random points of time. Let λ > 0. Assume
1. The probability that exactly 1 event occurs in an interval of length h is approximately λh.
2. The probability that 2 or more events occur in an interval of length h is much smaller than λh.
Let N (t) denote the number of events occurring in an interval of length t. Then N (t) is a Poisson random variable
with parameter λt and
(λt)k
P (N (t) = k) = e−λt , k = 0, 1, 2, · · · .
k!
Lemma 4.15 Let X be a Poisson random variable with parameter λ. Then P (X = i) increases monotonically and
then decreases monotonically as i increases, reaching its maximum when i is the largest integer not exceeding λ.
λ
≥1
i
24
which happens if and only if λ ≥ i. □
Proof:
∞
X λi
E[X n ] = in e−λ
i!
i=1
∞
X λi
= in−1 e−λ
(i − 1)!
i=1
∞
X λj
=λ (j + 1)n−1 e−λ
j!
j=1
= λE[(X + 1)n−1 ].
Proposition 4.17 Suppose that the number of events occurring in a given time period is a Poisson random vari-
P
able with parameter λ. If each event is classified as a type i event with probability pi , i = 1, · · · , n, pi = 1,
P
independently with probability pi , i = 1, · · · , n, pi = 1, independently of other events. Then the number of type i
events that occur are independent Poisson random variable with respective parameters λpi .
Proof: Let X denote the Poisson random variable with parameter λ, and Xi denote the random variables of the
number of type i events that occur. Then suppose it is given X = k, then the number of events of Xi can be
presented by binomial random variable with parameter (k, pi ). Hence by the law of total probability, we have
∞
X k n λk
P (Xi = n) = pi (1 − pi )i−n e−λ
n k!
k=n
∞
e−λ pni X λk
= (1 − pi )k−n
n! (k − n)!
k=n
n ∞
−λ pi · λn X λj
=e (1 − pi )j
n! j!
j=0
(pi λ)n λ(1−pi )
= e−λ ·e
n!
(pi λ)n
= e−pi
n!
This holds for all nonnegative number n, hence Xi is a Poisson random variable with parameter pi λ. And easily,
one can verify that the Xi′ s are independent using multinomial distribution. □
25
4.6 Other Discrete Random Variables
4.6.1 Geometric Random Variable
Definition: independent trials are performed until a success occurs. suppose the the probability of success is
p, where 0 < p < 1. Then if we let X denote the number of trials needed until a success occurs, we have
P (X = n) = (1 − p)n−1 p, n = 1, 2, · · · . In this way, we define X to be the Geometric Random Variable of parame-
ter p.
Note that ∞ p
P
n=1 pX (n) = 1−(1−p) = 1. So P (X = ∞) = 0 and we may say that the X = ∞ does not occur.
1
Proposition 4.18 Let X be a geometric random variable with parameter p, 0 < p < 1. then E[X] = p and
Var(X) = 1−p
p2
.
Proof: Let X be a geometric random variable with parameter p, 0 < p < 1, then
∞
X
E[X] = i · (1 − p)i−1 p
i=1
Note that
∞
X 1−x
(1 − x)i = ,
x
i=1
then by differentiation both sides with respect to x and multiply by −x, we have
∞
1 X
= i(1 − x)i−1 x.
x
i=1
∞
2 X
= i(i − 1)(1 − x)i−2 .
x3
i=2
So
∞
2(x − 1) X
= i(i − 1)(1 − x)i−1 x.
x2
i=2
Lemma 4.19 Suppose n ∈ N and k ∈ N+ , and X is a geometric random variable with parameter p, then
26
Proof: The P {X = n + k|k > n} is the probability where the first n trials are failures, and we get a success at
k th trial after this. This is the same as having a successful trial at the k th trial, which has probability P {x = k} □
Definition: suppose independent trials are performed with success probability p. Let X be the number of trials
needed for r successes. Suppose X = n, where n ≥ r. In the first n − 1 trials, there are r − 1 successes and n − r
failures. The nth trials is a success, so P (X = n) = n−1
r n−r . Hence we define X to be the Negative
r−1 p (1 − p)
Binomial Random Variable with parameter r, p, if 0 < p < 1, if
n−1 r
pX (n) = p (1 − p)n−r , n = r, r + 1, · · · .
r−1
Note that a geometric random variable is negative binomial with parameter (1, p).
Intuitively, one can see negative binomial random variable as the reverse of binomial random variable. Thus let X
be a negative binomial random variable with parameters r and p, and let Y be a binomial random variable with
parameters n and p. Then P (X > n) = P (Y < r).
r r(1−p)
Proposition 4.20 Let X be negative binomial with parameters (r, p). Then E[X] = p and Var(X) = p2
.
where Y is a negative binomial random variable with parameters r + 1, p. Setting k = 1 in the preceding equation
yields
r
E[X] = .
p
Setting k = 2 in the equation for E[X k ] and using the formula for the expected value of a negative binomial random
variable gives
r
E[X 2 ] = E[Y − 1]
p
r r+1
= −1
p p
27
Therefore 2
r r+1 r r(1 − p)
Var(X) = −1 − = .
p p p p2
□
Lemma 4.21 Suppose X is a negative binomial random variable with parameters (r, p), and Y is a binomial
random variable with parameters (n, p). Then
Proof: Note that the probability of X > n is the same as the probability of not getting r success in the first
n trials, which is equal to the probability of getting less than r success in the first n trials. Hence we have
P {X > n} = P {Y < r}. □
Definition: n balls are randomly chosen from an urn containing m white and N −m black balls without replacement.
Let X be the number of white balls chosen, then
m N −m
i n−i
P (X = i) = N
, i = 0, 1, · · · , n.
n
m N −m
i n−i
pX (i) = N
, i = 0, 1, · · · , n.
n
m
Note that 0 ≤ x ≤ n and 0 ≤ X ≤ m. If m < i ≤ n, then i = 0, so P (X = i) = 0.
nm
Proposition 4.22 Let X be hypergeometric with parameters (n, N, m). Then E[X] = N and
n−1
Var(X) = np(1 − p) 1 −
N −1
m
where p = N.
28
Proof: Let X be hypergeometric with parameters (n, N, m). Then
n
X
E[X k ] = ik P {X = i}
i=0
n m N −m
X i n−i
k
= i N
i=1 n
where Y is a hypergeometric random variable with parameters n − 1, N − 1 and m − 1. Hence upon setting k = 1,
we have
nm
E[X] = .
N
Upon setting k = 2 in the equation for E[X k ], we obtain
nm
E[X 2 ] = E[Y + 1]
N
nm (n − 1)(m − 1)
= +1
N N −1
Hence
nm (n − 1)(m − 1) nm
Var(X) = +1− .
N N −1 N
m
Letting p = N and using the identity
m−1 Np − 1 1−p
= =p−
N −1 N −1 N −1
we have
1−p
Var(X) = np (n − 1)p − (n − 1) + 1 − np
N −1
n−1
= np(1 − p) 1 −
N −1
29
m
a white ball with or without replacement does not change that much, which in both case are almost equal to N .
Hence we can approximate the probability of drawing small number of white balls using the binomial distribution
m
with parameters (n, N ).
Proposition 4.23 Let X be a negative binomial random variable with parameters (n, N, m), then P (X = i + 1) =
(m−i)(n−i)
(i+1)(N −m−n+i+1) P (X = i), for i = 1, 2, · · · , n − 1. P (X = i) is maximized when i is the greatest integer smaller
than p = mn−m−n−1
N −m−n , if p is not an integer; P (X = i) is maximized when i = p or i = p + 1, if p is an integer.
(m−i)(n−i)
P (X = i + 1) ≥ P (X = i) if and only if (i+1)(N −m−n+i+1) ≥ 1,
(m − i)(n − i)
≥1
(i + 1)(N − m − n + i + 1)
(i + 1)(N − m − n + i + 1) ≤ (m − i)(n − i)
iN − im − in + i2 + 1 ≤ mn − m − n + i2
mn − m − n − 1
i≤
N −m−n
mn−m−n−1
Hence P (X = i) is maximized when i is the greatest integer smaller than p = N −m−n , if p is not an integer;
P (X = i) is maximized when i = p or i = p + 1, if p is an integer. □
Proposition 4.24 Suppose the sample space S is countable. For any discrete random variable X on S, E[X] =
P
X(s)p(s).
s∈S
30
Proof: This is clear intuitively, but we give a rigorous prove. Let Ei be the even that X = xi , i.e., s ∈ Ei ⇔
P
X(s) = xi . So P (X = xi ) = P (Ei ) = s∈Ei p(s). Hence
∞
X
E[X] = xi P (X = xi )
i=1
X∞ X
= xi p(s)
i=1 s∈Ei
X∞ X
= xi p(s)
i=1 s∈Ei
X∞ X
= X(s)p(s)
i=1 s∈Ei
X
= X(s)p(s).
s∈S
Proposition 4.25 Let X1 , X2 , · · · , Xn be discrete random variables. Then E[X1 +· · ·+Xn ] = E[X1 ]+· · ·+E[Xn ].
Proof: We prove the case for two random variables, and we can apply induction for the general result.
Let X and Y be discrete random variables on S. Then
X X
E[X + Y ] = (X + Y )(s)p(s) = [X(s) + Y (s)]p(s)
s∈S s∈S
X X
= X(s)p(s) + Y (s)p(s)
s∈S s∈S
= E[X] + E[Y ].
Proof: Notice
E[X 2 ] = E[(X1 + · · · + Xn )(X1 + · · · + Xn )].
31
4.8 Some Interesting Results
Proposition 4.27 Suppose a positive integer is chosen at random, then the probability that it does not contain
repeated prime factor is π62 .
Proof: Suppose such an integer is chosen at random, then the probability such that it does not divide the ith
prime twice is 1 − p12 , where pi is the ith prime. Hence the probability of the number not dividing any prime twice
i
is
n n
p2i − 1
Y
Y 1 6
1− 2 = 2 = 2.
pi pi π
i=1 i=1
32
5 Continuous Random Variables
5.1 Continuous Random Variable
Definition: a random variable X is continuous if there exists a nonnegative function f such that
Z
P (X ∈ B) = f (x)dx
B
R
Definition: suppose P (X ∈ B) = B fX (x)dx for any set B or real numbers. Then fX is called the probability
density function of X.
Then
Ra
1. P (X = a) = a fX (x)dx = 0.
Rb
2. P (a ≤ x ≤ b) = P (a < X < b) = a fX (x)dx for any a < b.
Ra
3. FX (a) = P (X ≤ a) = −∞ fX (x)dx. Then it also follows that FX′ (a) = fX (a) if fX is continuous at a.
R∞
4. P (X ≥ a) = a fX (x)dx.
a+b
Definition: the median m of a continuous random variable X is the value m = 2 , where
1 1
a = inf({x : F (x) ≥ }) and b = sup({x : F (x) ≤ }).
2 2
Definition: the mode m of a continuous random variable X is the value m such that fX (m) is maximum.
Lemma 5.1 Suppose X is a continuous random variable whose probability density function fX (x) is even, then
E[X] = 0.
33
Proof: Since fX (x) is even, then xfX (x) is odd, then
Z ∞
xfX (x) = 0
−∞
Lemma 5.2 Let Y be a nonnegative continuous random variable with probability density function f . Then
Z ∞
P (Y > y)dy = E[Y ].
0
Proof:
Z ∞ Z ∞ Z ∞
P (Y > y)dy = f (x)dx dy
0 0 y
Z ∞ Z x
= f (x)dy dx
0 0
Z ∞
= xf (x)dx
0
= E[Y ]
Lemma 5.3 Suppose Y is an arbitrary continuous random variable with probability density function f . Then
Z ∞ Z ∞
E[Y ] = P (Y > y)dy − P (Y < −y)dy.
0 0
Proof:
Z ∞
E[Y ] = xf (x)dx
−∞
Z ∞ Z −∞
= xf (x)dx − xf (x)dx
0 0
Z ∞ Z ∞
= P (Y > y)dy + yf (−y)dy
0 0
Z ∞ Z ∞
= P (Y > y)dy − P (Y < −y)dy
0 0
34
Proposition 5.4 Suppose a continuous random variable X has probability density function f . Then for any
function g, Z ∞
E[g(X)] = g(x)f (x)dx.
−∞
Proof: First we prove the statement for the special case that g(x) ≥ 0. Then
Z ∞
E[g(X)] = P (g(X) > y)dy
Z0 ∞ Z
= f (x)dxdy
0 g(x)>y
Z Z g(x)
= f (x)dydx
g(x)>0 0
Z
= f (x)g(x)dx
g(x)>0
Then it is clear that g + ≥ 0 and g − ≥ 0. If g(x) ≥ 0, then g(x) = g(x) − 0 = g + (x) − g − (0), if g(x) ≤ 0, then
g(x) = 0 − |g(x)| = g + (x) − g − (x). Hence by the linearity of expectation and integration, we have
Z ∞ Z ∞
E[g(x)] = E[g + (x)] − E[g − (x)] = [g + (x) − g − (x)]f (x)dx = g(x)f (x)dx.
−∞ −∞
E[aX + b] = aE[X] + b
Proof: Z ∞ Z ∞ Z ∞
E[aX + b] = (ax + b)fX (x)dx = a xfX (x)dx + b fX (x)dx = aE[x] + b.
−∞ −∞ −∞
Var(X) = E[(X − µX )2 ]
35
Notice the same proof used for Discrete Random Variable applies for continuous random variable as well. Similarly,
we have
Var(aX + b) = a2 Var(X).
Lemma 5.6 One can easily verify the following if X is uniformly distributed over (0, 1):
Definition: a random variable X is uniformly distributed over (α, β) if the density function is
1
β−α α<x<β
f (x) = .
0 otherwise
Note that Y = aX +b, (a > 0) is uniform if X is uniform. In particular, if X is uniform over (0, 1). Then Y = aX +b
(a > 0) is uniform over (b, a + b). Hence Y = (β − α)X + α is uniform over (α, β). Then
Bertrands’ Paradox:
The Bertrand’s paradox is a probability which cannot be solved. The problem states:
Consider a random chord of a circle. What is the probability that the length of the chord will be greater than the
side of the equilateral triangle inscribed in that circle? The reason why the problem cannot be solved because we
36
do not know what it means by a random chord in a circle. It follows by different ways in selecting the chord, the
resulting probability is different.
1 2
f (x) = √ e−x /2 .
2π
Lemma 5.7 Z ∞
2 /2 √
e−x dx = 2π.
−∞
R∞ −x2 /2 dx 2 /2
Proof: E[Z] = √1
2π −∞ xe = 0 since xe−x is an odd function. □
37
Corollary 5.9.1 Suppose Z is the standard normal random variable and x > 0, then
P (Z > x) = P (Z < x);
2
Proposition 5.10 Suppose Z has probability density function f (x) = √1 e−x /2 . Then Var(Z) = E[Z 2 ] = 1.
2π
Definition: X is a normal variable with parameters (µ, σ 2 ) if the probability density function of X is
1 2 2
f (x) = √ e−(x−µ) /(2σ ) .
2πσ
Lemma 5.11 Suppose X is a normal variable with parameters (µ, σ 2 ), then Y = aX + b is a normal variable with
parameters (aµ + b, a2 σ 2 ).
Proof: Suppose Y = aX + b, since X is a normal variable with parameter (µ, σ 2 ), then X = σZ + µ. Hence
Y = aσZ + aµ + b, thus X is a normal variable with parameters (aµ + b, a2 σ 2 ). □
38
Proof: Let Z = (X − µ)/σ, then Z is standard normally distributed. Then
X −µ (µ + kσ) − µ
P (X > µ + kσ) = P > = P (Z > k) = 1 − Φ(k).
σ σ
Next, notice that P (µ − kσ < X) = P (X > µ + kσ) = Φ(k), then we have the second claim. □
Theorem 5.13 (The De Moivre-Laplace Limit Theorem) Let Sn be a binomial random variable with pa-
rameters (n, p). Then !
Sn − np
P a≤ p ≤ b → Φ(b) − Φ(a)
np(1 − p)
as n → ∞, where Φ is the cumulative density function for the random variable with parameters µ = np and
σ 2 = np(1 − p). And P (Sn = i) is written as P (i − 12 < Sn < i + 12 ) in approximation (we apply continuation of
integers).
S − np
p n
np(1 − p)
Lemma 5.14 Suppose X is an exponential random variable with parameters λ, if c > 0, then cX is an exponential
random variable with parameter λc .
39
Proof: Let Y = cX, then clearly Y takes nonnegative values, so for y ≥ 0, we have
Hence
d y λ
fy (y) = FX ( ) = e−λy/c .
dy c c
λ
So, indeed y is exponential with parameter c. □
Proposition 5.15 Suppose X is an exponential random variable with parameter λ > 0 and Y = λX. Then
Proof: P (Y > y) = P (X > y/λ) = e−λ(y/λ) = e−y . Hence fY (y) = e−y . Then
Z ∞ Z ∞
E[Y ] = P (Y > y)dy = e−y dy = 1.
0 0
Z ∞ Z ∞ √
2
E[Y ] = 2
P (Y > y)dy = e− y
dy = 2.
0 0
Var(Y ) = E[Y 2 ] − (E[Y ])2 = 2 − 12 = 1
1
Proposition 5.16 If X is an exponential random variable with mean λ, then
k!
E[X k ] = k = 1, 2, 3, · · · .
λk
40
Z ∞
E[X k ] = xk λe−λx dx
0
Z ∞
=λ xk e−λx dx
0
Z ∞
∞
k −λx
= −x e + kxk−1 e−λx dx
0 0
k ∞ k−1 −λx
Z
= x λe dx
λ 0
k
= E[X k−1 ]
λ
k!
Then by induction we can show that E[X k ] = λk
. □
Definition: if P (X > s + t|X > t) = P (X > s) for all s, t ≥ 0, then X is said to be memoryless.
If a random variable is memoryless, then it also implies that P (X > s + t) = P (X > s)P (X > t), for all s, t ≥ 0.
Proof: Suppose the component has survived for t hours. The probability that it can survive at least another s
hours is
P (X > s + t) e−λ(s+t)
P (X > s + t|X > t) = = = e−λs = P (X > s).
P (X > t) e−λt
□
Definition: X is a double exponential random variable with parameter λ > 0 if the probability density function
1
f (x) = λe−λ|x| , −∞ < x < ∞
2
Lemma 5.18 Suppose FX (x) is the cumulative distribution function of a double exponential random variable X
with parameter λ > 0, then
1 eλx x<0
FX (x) = 2 .
1 − 1 e−λx x > 0
2
Proposition 5.19 Suppose X is a double exponential random variable with parameter λ > 0 and let Y be the
exponentially distributed with parameter λ, then
E[X] = 0.
41
Var(X) = 2/λ2 .
Then Y is exponentially distributed with parameter λ, so E[Y ] = 0. Then it is clear that E[X 2 ] = E[Y 2 ] = 2/λ2 ,
as Y 2 = |X|2 = X 2 . By simple integration, we can also get that E[X] = 0 (p.d.f is even). Then Var(X) can be
easily shown to be 2/λ2 . □
Definition: let X be a positive continuous random variable. Let F (t) = 1−F (t). Then we define λ(t) = f (t)/F (t) is
the hazard (failure) rate function. The interpretation of the hazard rate function is that if an object has functioned
for time t, then λ(t) represents the probability that it will fail at time t.
Proposition 5.20 Let λ(s), s > 0 be the hazard rate function of a positive random variable X. Then
Z t
F (t) = 1 − exp − λ(s)ds .
0
Proof:
Z t Z t
f (s)
λ(s)ds = ds
0 0 1 − F (s)
= − ln(1 − F (s))|t0
= − ln(1 − F (t)) + ln(1 − F (0))
= − ln(1 − F (t))
Z t
⇐⇒ F (t) = 1 − exp − λ(s)ds .
0
Proposition 5.21 Suppose λ(t) is the hazard failure rate function of a random variable X. Then X is an expo-
nential random variable with parameter λ if and only if λ(t) = λ.
f (t) λe−λt
λ(t) = = = λ.
F (t) 1 − (1 − e−λt )
42
Next, suppose λ(t) = λ, then Z t
FX (t) = 1 − exp − λds = 1 − exp(−tλ).
0
Definition: a random variable X is gamma with parameters (n, λ), λ > 0, if the density function is
λe−λt (λt)n−1
f (t) = , t ≥ 0.
(n − 1)!
It is the time at which the nth event of a Poisson process of rate λ occurs.
Definition: a random variable X is gamma with parameters (α, λ), α, λ > 0, if the density function is
λe−λt (λt)α−1
f (t) = , t ≥ 0,
Γ(α)
where Z ∞ Z ∞
Γ(α) = λe−λt (λt)α−1 dt = e−y y α−1 dy
0 0
Recall that Γ(α + 1) = αΓ(α), for α > 0. And Γ(n) = (n − 1)! for positive integer n.
Suppose n is a positive integer, then a Gamma distribution X with parameter (n, λ), represents the distribution
on the time it takes for n event where the probability distributions of the events is Poisson with parameter λ.
Lemma 5.22 Let X be gamma with parameters (α, λ). Then E[X] = αλ .
Proof:
Z ∞
1
E[X] = t · λe−λt (λt)α−1 dt
Γ(α) 0
Z ∞
1
= e−λt (λt)α dt
Γ(α) 0
Z ∞
1
= λe−λt (λt)(α+1)−1 dt
λΓ(α) 0
1
= Γ(α + 1)
λΓ(α)
α
= .
λ
43
□
α(α+1) α
Lemma 5.23 Let X be gamma with parameters (α, λ). Then E[X 2 ] = λ2
and Var(X) = λ2
.
Proof:
Z ∞
1
E[X 2 ] = t2 · λe−λt (λt)α−1 dt
Γ(α) 0
Z ∞
1
= e−λt (λt)α+1 dt
λΓ(α) 0
Z ∞
1
= λe−λt (λt)(α+2)−1 dt
λ2 Γ(α) 0
1
= 2
Γ(α + 2)
λ Γ(α)
α(α + 1)
= .
λ2
Definition: a random variable X is beta with parameters (a, b), a, b > 0, if the density function is
1
f (x) = xa−1 (1 − x)b−1 , 0 < x < 1
B(a, b)
where Z 1
Γ(a)Γ(b)
B(a, b) = xa−1 (1 − x)b−1 dx = .
0 Γ(a + b)
It represents the distribution of the probability p that a trial possesses if it is given that there are a successes and
b failures in the first a + b trials. For example a Beta distribution with parameter (40, 60) gives the distribution of
the probability of the success rate p of the trial, if we know that when one performed the trial 100 times, one got
40 successes and 60 failures.
Lemma 5.24 If a = b = 1, then the beta random variable X with parameter (a, b) is uniform on (0, 1).
1
Proof: If a = b = 1, then f (x) = B(1,1) = 1, 0 < x < 1. Hence it is uniform on (0, 1). □
44
Proposition 5.25 Suppose n, m ∈ N, then
Z 1
n!m!
xn (1 − x)m dx = .
0 (n + m + 1)!
R1
Proof: Let C(n, m) = 0 xn (1 − x)m dx, then using integration by parts, we have
m
C(n, m) = C(n + 1, m − 1).
n+1
1
Note that C(n, 0) = n+1 . Then using induction on m we can prove the identity. □
a
Lemma 5.26 B(a + 1, b) = a+b B(a, b).
Proposition 5.27 Suppose X is a beta random variable with parameter (a, b), then
E[X] = a
a+b .
(a+1)a
E[X 2 ] = (a+b+1)(a+b) .
Var(X) = ab
(a+b)2 (a+b+1)
.
Proof:
1
xa−1 (1 − x)b−1
Z
B(a + 1, b) a
E[X] = x· dx = = .
0 B(a, b) B(a, b) a+b
Z 1
2 xa−1 (1 − x)b−1 B(a + 2, b) (a + 1)a
E[X ] = x2 · = = .
0 B(a, b) B(a, b) (a + b + 1)(a + b)
Then the value of Variance of X follows from the previous two values. □
45
Proof: We decompose the improper integral into two parts:
Z 1 Z ∞
−x s−1
I1 = e x dx, I2 = e−x xs−1 dx.
0 1
Firstly, consider I1 , when s ≥ 1, I1 is a definite integral, hence it converges; when 0 < s < 1,
1 1
e−x · xs−1 = x
· xs−1 < 1−s ,
e x
The comparison test and limit comparison test for improper integrals:
R∞
Suppose functions f (x), g(x) are continuous on [a, ∞),. If 0 ≤ f (x) ≤ g(x) for a ≤ x, then a g(x)dx
R∞ R∞ R∞
converges, if a f (x)dx converges; if a f (x)dx diverges, then a g(x)dx diverges.
Suppose f (x) is a continuous function on [a, ∞), and f (x) ≥ 0. Then if there exists constant p > 1, s.t.,
R∞ R∞
lim xp f (x) = c < ∞, then a f (x)dx converges; if lim xf (x) = d > 0, then a f (x)dx diverges.
x→∞ x→∞
R∞
Proof: Γ(1) = 0 e−t dt = 1. □
46
Proof: Using induction, we easily get the result. □
Proof: Γ function is continuous for all positive value s (it is an integral), this will be used without proof. Then
as s → 0+ , since Γ(s) = Γ(s+1)
s , then
Γ(s + 1)
lim = ∞.
s→0 s
□
Lemma 5.33 Suppose f, g are convex function with the same domain, then f + g is convex.
Proof: Suppose f, g are convex functions on D, let x, y ∈ D, and λ ∈ [0, 1]. Then
Hence f + g is convex. □
1 1
Proof: If 1 < p < ∞ and p + q = 1. Apply Hölder’s inequality, we obtain.
Z
x y x
+ yq −1 −t
Γ + = tp e dt
p q
Z 1/p Z q
x −t p y 1 −t
− p1
≤ t p e p dt t − eq dt
q q
= Γ(x)1/p Γ(y)1/q
Hence
x y 1 1
ln Γ( + ) ≤ ln Γ(x)1/p Γ(y)1/q = ln Γ(x) + ln γy.
p q p q
This implies ln Γ is convex. □
1. f (x + 1) = xf (x),
47
2. f (1) = 1,
3. ln f is convex,
Proof: Since Γ satisfies 1, 2, 3 it is enough to prove that f (x) is uniquely determined by 1, 2, 3 for all x > 0. By
1, it is enough to do this for x ∈ (0, 1), as the rest of the values depends on the value of f on (0, 1).
Put φ = ln f . Then
φ(x + 10 = φ(x) + ln x (0 < x < ∞),
and φ(1) = 0, and φ is convex. Suppose 0 < x < 1 and n is a positive integer, then φ(n + 1) = ln(n!). Consider
the difference quotients of φon the intervals [n, n + 1], [n + 1, n + 1 + x], [n + 1, n + 2]. Since φ is convex, then
φ(n + 1 + x) − φ(n + 1)
ln n ≤ ≤ ln(n + 1).
x
Thus
φ(x) + ln[x(x + 1) · · · (x + n)] − φ(n + 1)
ln n ≤ ≤ ln(n + 1).
x
Then by some algebraic manipulation, we have
n!nx
1
0 ≤ φ(x) − ln ≤ x ln 1 + .
x(x + 1) · · · (x + n) n
The expression on the right tests to 0 as n → ∞, hence varphi(x) is uniquely determined, and the prove is
completed. □
n!nx
Γ(x) = lim .
n→∞ x(x + 1) · · · (x + n)
√
Proposition 5.36 Γ( 12 ) = π.
R +∞
Proof: by definition, Γ(s) = 0 e−x xs−1 dx. We replace x = u2 , dx = 2udu, then
Z ∞
2
Γ(s) = 2 e−u u2s−1 du.
0
48
1+t
Let t = 2s − 1 ⇒ s = 2 , then Z ∞
2 1 1+t
e−u ut du = Γ( )
0 2 2
.
When s = 21 , t = 0, then Z ∞ Z ∞ √
1 2 2
Γ( ) = 2 e−u ut du = 2 e−u du = π.
2 0 0
π
Γ(s)Γ(1 − s) =
sin πs
Proof: to read the proof of this or more related readings on the Γ function, visit https://en.wikipedia.org/
wiki/Gamma_function □
Proof: Note that B(1, y) = y1 , and using Hölder’s inequality, we have that ln B(x, y) is a convex function of x for
each fixed y. We show
x
B(x + 1, y) = B(x, y).
x+y
Since
1 x x 1 Z 1
(1 − t)x+y
Z
t x+y−1 t x x−1 x
B(x + 1, y) = (1 − t) dt = − · − t (1 − t)y−1 = B(x, y).
0 1−t 1−t x+y 0 0 x+y x+y
49
Corollary 5.38.1
2x−1 x
x+1
Γ(x) = √ Γ Γ .
π 2 2
Proof: Let
2x−1 x
x+1
f (x) = √ Γ Γ .
π 2 2
Note f (1) = 1, f (x + 1) = 2 · x2 f (x) = xf (x) and
√ x x+1
ln f (x) = ln 2( x − 1) − ln π + ln Γ( ) + ln Γ( ).
2 2
Hence ln f is convex, this implies f (x) = Γ(x) on (0, ∞). Thus we have completed the proof. □
Theorem 5.39 (Stirling’s Formula) The Stirling’s Formula provides a simple approximate expression for Γ(x+
1) when x is large. The formula is
Γ(x + 1)
lim √ = 1.
x→∞ (x/e)x 2πx
Then we get Z ∞
x+1 −x
Γ(x + 1) = x e [(1 + u)e−u ]x du.
−1
2
h(u) = [u − ln(1 + u)] u ̸= 0.
u2
Since h(0) = 1, one can verify the h is continuous. Next it is also clear that h(u) is decreasing monotonically from
∞ to 0 as u increase from −1 to ∞.
p
Substitute u = s 2/x, and we get
x −x
√ Z ∞
Γ(x + 1) = x e 2x ψx (s)ds
−∞
where
exp[−s2 h(sp2/x)] p
(− x/2 < s < ∞),
ψx (s) = p
0 (s ≤ − x/2).
50
2. The convergence in 1 is uniform on [−A, A] for every A < ∞.
2
3. When s < 0, then 0 < ψx (s) < e−s .
Then by uniform convergence, the integral converges to the limit of integrals of the functions in the sequence. Since
Z ∞ √
2
e−x dx = π.
−∞
Then we have
Γ(x + 1)
lim √ = 1.
x→∞ (x/e)x 2πx
Theorem 5.40 Let X be a continuous random variable. Let Y = g(X). Suppose g(x) is strictly monotonic and
differentiable. For y = g(x),
d −1
fY (y) = fX (g −1 (y)) g (y) .
dy
Proof: We consider two cases. First, suppose g(x) is strictly increasing and differentiable. Then
So
d −1
fY (y) = FY′ (y) = fX (g −1 (y)) g (y).
dy
Now suppose g(x) is strictly decreasing and differentiable. Then
So
d −1
fY (y) = FY′ (y) = −fX (g −1 (y)) g (y).
dy
□
Corollary 5.40.1 Suppose g is strictly increasing, and X, Y are continuous random variable such that Y = g(X),
then FY (y) = FX (g −1 (y); suppose g is strictly decreasing, and X, Y are continuous random variable such that
Y = g(X), then FY (y) = 1 − FX (g −1 (y)).
51
Proof: This is follows from the proof of the theorem. □
Definition: let X be a normal random variable with parameters (µ, σ 2 ). Then Y = eX is called lognormal with
parameters (µ, σ 2 ).
Proposition 5.41 Let X be a normal random variable with parameters (µ, σ 2 ), then the probability density function
of Y = eX is given by
1
fY (y) = √ exp{−(ln y − µ)2 /2σ 2 }.
2πσy
d −1
fY (y) = fX (g −1 (y)) g (y)
dy
1
= fX (ln y)
y
1
=√ exp{−(ln y − µ)2 /2σ 2 }.
2πσy
Lemma 5.42 Suppose Y is lognormal with parameters (µ, σ 2 ), if c > 0, then cY is lognormal with parameters
(µ + ln c, σ 2 ).
Proof: Suppose Y = eX , where X is normal with parameters (µ, σ 2 ), then cY = eX+ln c , and clearly X + ln c is
normal with parameters (µ + ln c, σ 2 ). □
Proposition 5.43 Let Z be a standard normal random variable Z, and let g be differentiable function with deriva-
tive g ′ and
2
lim g(x)e−x /2 = 0.
x→±∞
Then
52
Proof: Let f (z) denote the probability density function of Z. Hence
1 2
f (z) = √ e−z /2 .
2π
1.
Z ∞
E[g ′ (Z)] = g ′ (z)f (z)dz
Z−∞
∞
1 2
= √ e−z /2 g ′ (z)dz
−∞ 2π
Z ∞
1 2
=√ e−z /2 g ′ (z)dz
2π −∞
Z ∞
1 −z 2 /2 ∞
−z 2 /2
=√ e g(z) − −ze g(z)dz
2π −∞ −∞
Z ∞
1 2
=√ ze−z /2 g(z)dz
2π −∞
= E[Zg(Z)]
Rx
Proof: Note that xn = 0 ntn−1 dx, then
Z ∞
n
E[X ] = xn f (x)dx
Z0 ∞ Z x
= ntn−1 f (x)dtdx
Z0 ∞ Z0 ∞
= ntn−1 f (x)dxdt
0 t
Z ∞ Z ∞
n−1
= nt f (x)dxdt
Z0 ∞ t
53
E[X n ]
Corollary 5.44.1 If X is a nonnegative continuous random variable, then P (X > a) ≤ an for any a > 0 and
positive integer n.
Proof: It suffices to show that an P (X > a) ≤ E[X n ]. Using the same argument as before, we have
Z a Z a
an P (X > a) = ntn−1 P (x > a)dt ≤ ntn−1 P (x > t)dt ≤ E[X n ].
0 0
54
6 Jointly Distributed Random Variables
6.1 Joint Cumulative Distribution Function
Definition: let X and Y be two random variables, their joint cumulative probability distribution function is
FX,Y (a, b) = P (X ≤ a, Y ≤ b).
Proposition 6.1 Let F (a, b) = P (X ≤ a, Y ≤ b) be the joint cumulative distribution function of X and Y . Then
it can be used to generate all probability involving X and Y :
Proof: This is clear from the definition of the joint cumulative probability distribution function. □
Definition: suppose X and Y are discrete random variables, then we use their joint probability mass function:
55
functions:
X
pX (x) = pX,Y (x, yj ).
j
X
pY (y) = pX,Y (xi , y).
i
Definition: suppose X and Y are continuous random variables, then it is convenient to use their joint probability
density function fX,Y (x, y), such that
ZZ
fX,Y (x, y)dxdy = P ((X, Y ) ∈ C).
C
Definition: X and Y are jointly continuous if there is a function fX,Y (x, y) ≥ 0, such that
ZZ
fX,Y (x, y)dxdy = P ((X, Y ) ∈ C),
C
and
∂2
fX,Y (a, b) = FX,Y (a, b).
∂a∂b
Lemma 6.2 Suppose X and Y are jointly continuous with joint probability distribution function f (x, y). Then
Z Z Z Z
P (X ∈ A, Y ∈ B) = f (x, y)dydx = f (x, y)dxdy.
A B B A
For small ϵ1 , ϵ2 ,
P (a ≤ X ≤ a + ϵ1 , b ≤ Y ≤ b + ϵ2 ) ≈ f (a, b)ϵ1 ϵ2 .
Proof: This is by definition of joint probability distribution function and two variable integration. □
Definition: suppose X and Y are jointly continuous with joint probability distribution function f (x, y). Let A ⊆ R,
Z Z ∞
P (X ∈ A) = P (X ∈ A, Y ∈ R) = f (x, y)dydz.
A −∞
56
which is known as the marginal probability density function. Similarly, let B ⊆ R, we have
Z Z ∞
P (Y ∈ B) = P (X ∈ R, Y ∈ B) = f (x, y)dxdy.
B −∞
F (a1 , a2 , · · · , an ) = P (X1 ≤ a1 , X2 ≤ a2 , · · · , Xn ≤ an ).
Definition: if X1 , X2 , · · · , Xn are discrete random variables, their joint probability mass function is
p(a1 , a2 , · · · , an ) = P (X1 = a1 , X2 = a2 , · · · , Xn = an ).
Definition: X1 , X2 , · · · , Xn are said to be jointly continuous if there is a joint probability density function f such
that Z Z
P ((X1 , · · · , Xn ) ∈ C) = · · · f (x1 , · · · , xn )dx1 · · · dxn
C
for any C ⊆ Rn .
Definition: suppose that n independent and identical experiments are performed, each experiment results in exactly
one of r possible outcomes with respective probability p1 , · · · , pr , ni=1 pi = 1. Let Xi be the number of experiments
P
that result in the ith outcome. Then they have multinomial distribution with joint mass function
r
n n1 nr
X
p(n1 , n2 , · · · , nr ) = p · · · pr , n = ni .
n1 , · · · , nr 1
i=1
In particular
P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b).
57
Theorem 6.3 Random variables X and Y are independent if and only if for any a, b ∈ R, we have
Theorem 6.4 Suppose X and Y are independent discrete variables if and only if pX,Y (a, b) = pX (a)pY (b). If this
is the case, then
E[XY ] = E[X]E[Y ]
Var(X + Y ) = Var(X) + Var(Y )
Proof: Since X and Y are independent, then for any a, b, let A = {a} and B = {b}, we have P (x ∈ A, Y ∈ B) =
P (X ∈ A)P (Y ∈ B) so pX,Y (a, b) = pX (a)pY (b). On the other hand, if pX,Y (a, b) = pX (a)pY (b), then for any A, B,
we have
XX
P (X ∈ A, Y ∈ B) = pX,Y (x, y)
x∈A y∈B
XX
= pX (x)pY (y)
x∈A y∈B
X X
= pX (x) pY (y)
x∈A x∈A
= P (X ∈ A)P (Y ∈ B).
= E[X]E[Y ]
58
Theorem 6.5 Jointly continuous random variables X, Y are independent if and only if
for all x, y.
∂2
fX,Y (x, y) = FX,Y (x, y) = FX′ (x)FY′ (y) = fX (x)fY (y).
∂x∂y
= FX (a)FY (b).
That is, if X1 , · · · , Xn are discrete random variables, then they are independent if and only if for any ai ∈ R,
If X1 , · · · , Xn are jointly continuous, then they are independent if and only if for any ai ∈ R,
Proposition 6.7 Suppose X1 , X2 , · · · , Xn are jointly continuous or discrete variables. Then X1 , X2 , · · · , Xn are
59
independent if and only if their jointly probability density or mass function f (x1 , · · · , xn ) can be written as
n
Y
f (x1 , · · · , xn ) = gi (xi )
i=1
Proof: The case for jointly discrete random variables are trivial. We consider the case for jointly continuous
random variables.
Now suppose X1 , · · · , Xn are independent, then let fXi (t) be their respective probability density function. Then
we have
Yn
fXi (xi ) = f (x1 , · · · , xn ).
i=1
n
Q
Conversely, suppose f (x1 , · · · , xn ) = gi (xi ), then
i=1
Z ∞ Z ∞ n
Y
fXi (xi ) = ··· gi (xi )dx1 · · · dxi−1 dxi+1 · · · dxn = Ci gi (xi ).
−∞ −∞ i=1
Now since Z n
Z Y Z n
Z Y
··· gi (xi )dS = 1 = ··· fXi (xi )dS.
O i=1 O i=1
Then we conclude that the product of the Ci′ s is one, hence X1 , · · · , Xn are jointly independent. □
Proof:
X X
P (X + Y = n) = P (X = i, Y = n − i) = P (X = i)P (Y = n − i)
i i
X X
= P (X = n − j, Y = j) = P (X = n − j)P (Y = j)
j j
X
= P (X = i)P (Y = j)
i+j=n
60
∞
pX (i)z i and
P
Corollary 6.8.1 Let X and Y be independent integer-valued random variables. Let g(z) =
i=−∞
∞
(j)z j .
P
h(z) = pY Then
j=−∞
∞
X
g(z)h(z) = pX+Y (n)z n .
n=−∞
Proof:
∞
! ∞
X X
g(z)h(z) = pX (i)z i · pY (j)z j
i=−∞ j=−∞
∞
X X
= pX (i)pY (i)z n
n=−∞ i+j=n
X∞
= pX+Y (n)z n .
n=−∞
Proof:
ZZ
FX+Y (a) = P (X + Y ≤ a) = fX+Y (x, y)dxdy
x+y≤a
Z ∞ Z a−y
= fX (x)fY (y)dxdy
Z−∞
∞
−∞
= FX (a − y)fY (y)dy
Z−∞
∞
= FX (a − y)dFY (y)
−∞
61
6.4.1 Sum of Binomial
Theorem 6.10 Let X1 , X2 , · · · , Xk be independent binomial random variables with parameters (n1 , p), (n2 , p),
· · · , (nk , p) respectively, then X1 +X2 +· · ·+Xk is a binomial random variable with parameter (n1 +n2 +· · ·+nk , p).
Proof: First show that the sum of an independent binomial random variable with parameter (m, p) and a bernoulli
random variable with parameter (1, p) is a binomial random variable with parameter (m + 1, p). Then we can show
that the sum of an independent binomial random variable with parameter (m1 , p) and another binomial random
variable with parameter (m2 , p) is a binomial random variable with parameter (m1 + m2 , p). Then we can use
induction to get the desired result. □
Theorem 6.11 If X1 , X2 , · · · , Xr are independent Poisson random variable with parameters λ1 , · · · , λr , then X1 +
· · · + Xr is Poisson with parameters λ1 + · · · + λr .
Proposition 6.12 Let X and Y be independent uniform random variable on (0, 1), then
a
if 0 ≤ a ≤ 1,
fX+Y (a) = 2 − a, if 1 < a < 2, .
0,
otherwise
R∞
Proof: Recall fX+Y (a) = −∞ fX (a − y)fY (y)dy, then consider cases based on the value of a. □
Proposition 6.13 Let X1 , X2 , · · · , Xn , · · · be independent uniform random variable on (0, 1). Let Fn be the cu-
mulative distributive function of X1 + · · · + Xn . Then
xn
Fn (x) = P (X1 + · · · + Xn ≤ x) =
n!
for 0 ≤ x ≤ 1.
62
Proof: We proceed with induction, suppose n = 1, then it is clear that F1 (x) = P (X1 ≤ x) = x. Hence the
statement holds. Assume that this statement for some n ∈ N, we consider the case for n + 1.
Fn+1 = Fn ∗ FXn
Z 1 n
t
= · 1dt
0 n!
tn+1
=
n + 1!
xn
Hence we have Fn (x) = n! by induction. □
Corollary 6.13.1 Let N be the minimum integer n such that X1 + X2 + · · · + Xn > 1. Then E[N ] = e, i.e., the
expected value of the number of independent uniform (0, 1) random variable that must be summed for the sum to
exceed 1 is e.
In addition, the expected value of the number of independent uniform (0, 1) random variables that must be summed
for the sum to exceed x (0 ≤ x ≤ 1) is ex .
1
Proof: N > n ⇔ X1 + · · · + Xn ≤ 1, so P (N > n) = Fn (1) = n! . Thus
∞ ∞ ∞
X X X 1
E[N ] = P (N ≥ i) = P (N > n) = = e.
n!
i=1 n=0 n=0
Proposition 6.14 Let X and Y be independent gamma random variable with parameters (α, λ), (β, λ), then
B(α, β)
fX+Y (α) = λe−λα (λa)α+β−1 , a > 0.
Γ(α)Γ(β)
Proof: Let X and Y be independent gamma random variable with parameters (α, λ), (β, λ), then
63
Let Z be a gamma random variable with parameters (α + β, λ), then it has probability density function
1
fZ (x) = λe−λx (λx)α+β−1 , x > 0.
Γ(α + β)
Γ(α)Γ(β)
B(α, β) = .
Γ(α + β)
1 − 12 y 1 1
2e ( 2 y) 2 −1
fY (y) = √ , y > 0.
π
1 1 1 √
I.e., Y = Z 2 is a gamma random variable with parameters
2, 2 and Γ 2 = π.
Proof:
ϵfY (y) ≈ P (y ≤ Y ≤ y + ϵ)
√ √
= 2P ( y ≤ Z ≤ y + ϵ)
√ √ ϵ
≈ 2P y≤Z ≤ y+ √
2 y
√ ϵ
≈ 2fZ ( y) √ .
2 y
1 1
Then we can see that fY (y) is a gamma random variable with parameters 2, 2 . Then by comparing denominator,
√
we get that Γ( 12 ) = π. □
Definition: let Z1 , Z2 , · · · , Zn be independent standard normal random variables. Then Xn2 = Z12 + · · · + Zn2 is a
gamma random variable with parameters n2 , 12 . This is called the chi-squared random variable with n degrees of
freedom.
Note that Xn2 is a gamma random variable as it is sum of n gamma random variables, then by induction, we can
show that Xn2 has parameters n2 , 21 .
Proposition 6.16 Let X and Y be independent normal random variables with parameters (µ1 , σ12 ) and (µ2 , σ22 )
respectively. Then X + Y is a normal random variable with parameters (µ1 + µ2 , σ12 + σ22 ).
64
Proof: We first prove the case that if S and Z are independent normal random variables with S having parame-
ters (0, σ 2 ) and Z having parameters (0, 1), i.e., Z is the standard normal distribution. Then S + Z is normal with
parameters (0, 1 + σ 2 ).
Z ∞
fS+Z (a) = fS (a − z)fZ (z)dy
Z−∞
∞
1 −(a−z)2 /2σ2 −z 2 /2
= e e dz
−∞ 2πσ
a2
= C exp −
2(1 + σ 2 )
where C is a constant that does not depend on a. So we can see that S + Z is normal with mean and variance 0,
1 + σ 2 respectively.
Now for the general case, suppose X, Y are normal with parameters (µ1 , σ12 ) and (µ2 , σ22 ) respectively, then
X − µ1 Y − µ2
X + Y = σ2 + + µ1 + µ2 .
σ2 σ2
Hence X + Y is normal with mean µ1 + µ2 and variance σ12 + σ22 (Just apply the result we obtained). □
Corollary 6.16.1 Suppose Xi , i = 1, 2, · · · , n are independent random variables that are normallydistributive with
n n
2 2
P P
parameters (µi , σi ) respectively, then X1 + X2 + · · · + Xn is normally distributed with parameters µi , σi .
i=1 i=1
Proof: Suppose X1 , X2 , · · · , Xn are independent identical exponential random variables with parameter λ, then
then Y = X1 + X2 + · · · + Xn is an gamma random variable with parameter (n, λ). □
Proof: An exponential random variable with parameter λ is a gamma random variable with parameter (1, λ). □
65
provided that pY (b) > 0.
Note that if X and Y are independent, then pX|Y (x|b) = pX (x).
Definition: let X and Y be jointly continuous random variables, the conditional probability density function of X
given Y = y is
fX,Y (x, y)
fX|Y (x|y) = ,
fY (y)
if fY (y) > 0.
Note that if X and Y are independent, then fX|Y (x|y) = fX (x).
Proposition 6.17 Let X and Y be discrete random variables such that pY (y) > 0. Then
X
FX|Y (x|y) = pX|Y (a|y).
a≤x
Proposition 6.19 Let Y and Z be independent, where Y is chi-squared with degree 1 of freedom, and Z standard
√
normal. Let T = Z/ Y . Then the joint probability density of T and Y is
1 −y(t2 +1)/2
fT,Y (t, y) = fY (y)fT |Y (t|y) = e , y > 0.
2π
66
Proof: Since Y and Z are independent, then the conditional distribution of T given that Y = y is the distribution
p
of 1/yZ, which is normal with mean 0 and variance 1/y. Hence, the conditional density of T given that Y = y is
1 2
fT |Y (t|y) = p e−t y/2 , −∞ < t < ∞.
2π/y
Definition: let Y and Z be independent, where Y is a chi-squared with degree n of freedom, and Z standard normal.
Then T = √Z has a t-distribution with n degrees of freedom.
Y /n
Proof:
P (a ≤ X ≤ a + ϵ, N = k)
P (a ≤ x ≤ a + ϵ|N = k) =
P (N = k)
P (N = k|a ≤ X ≤ a + ϵ)P (a ≤ X ≤ a + ϵ)
=
P (N = k)
P (N = k|X = a)f (a)ϵ
≈ .
P (N = k)
Definition: let X be a continuous and N a discrete random variable, then the conditional probability density
function of X given that N = k is
P (N = k|X = x)
fX|N (x|k) = fX (x).
P (N = k)
Definition: let X be continuous and N a discrete random variable, then the conditional probability mass function
of N given that X = x is
fX (X = x|N = n)
pN |X (n|x) = P (N = n).
fX (x)
Definition: let X be a discrete random variable. For any A ⊆ R, The conditional probability mass function of X
given X ∈ A is defined to be
P (X = a)/P (X ∈ A) if x ∈ A,
P (X = a|X ∈ A) =
0 if x ∈
/ A.
Definition: let X be a continuous random variable. For [a, a + ϵ] ⊆ A ⊂ R, the conditional probability density
67
function of X given X ∈ A is defined to be
fX (x)
fX|X∈A (x) = R for x ∈ A.
A fX (x)dx
is nonzero and continuous at all (x1 , x2 ). Thus, we define the Jacobian of (x1 , x2 ) 7→ (y1 , y2 ) to be
∂y1 ∂y1
∂x1 ∂x2 ∂y1 ∂y2 ∂y1 ∂y2
J(x1 , x2 ) = ∂y2 ∂y2 = − ̸= 0
∂x2 ∂x2
∂x1 ∂x2 ∂x2 ∂x1
Theorem 6.21 The joint probability distribution function of Y1 = g1 (X1 , X2 ), Y2 = g2 (X1 , X2 ) in this case is
given by
fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (x1 , x2 ) |J(x1 , x2 )|−1
Similarly, using the Jacobian of n dimensional matrix, we can find the joint continuous random variable of
Y1 , Y2 , · · · , Yn from X1 , X2 , · · · , Xn , if Yi = gi (X1 , · · · , Xn ).
68
7 Expectation of Random Variables
7.1 Extra Properties of Expectation
Proof: Suppose Y ≥ 0, then it is clear that E[Y ] ≥ 0 (consider discrete and continuous cases), similarly if Y ≤ 0,
then E[Y ] ≤ 0. Now since X ≥ a, then X − a ≥ 0, so
Proposition 7.3 Suppose X is a random variable such that P (0 ≤ X ≤ c) = 1 and only takes values between 0
and c. Then
c2
Var(X) ≤ .
4
69
Proof: Note that since X only takes values between 0 and c, then 0 ≤ E[X] ≤ c and E[X 2 ] ≤ E[cX] = cE[X],
so
c2
var(X) = E[X 2 ] − (E[X])2 ≤ cE[X] − (E[X])2 ≤ .
4
□
Theorem 7.4 Let X and Y be discrete random variable with joint mass function p(x, y). For any function g(x, y),
XX
E[g(X, Y )] = g(x, y)p(x, y).
x y
Theorem 7.5 Let X, Y be jointly continuous random variables with joint probability density function f (x, y). For
any function g(x, y), Z ∞Z ∞
E[g(X, Y )] = g(x, y)f (x, y)dxdy.
−∞ −∞
Proof: First consider the case where g(X, Y ) ≥ 0, then we know that for any such random variable, we have
Z ∞
E[g(X, Y )] = P {g(X, Y ) > t}dt
0
Then
Z ∞
E[g(X, Y )] = P {g(X, Y ) > t}dt
0
Z ∞Z Z
= f (x, y)dydxdt
0 (x,y):g(x,y)>t
Z Z Z g(x,y)
= f (x, y)dtdydx (by change of order of integration)
x y 0
Z Z
= g(x, y)f (x, y)dydx
x y
Z ∞ Z ∞
= g(x, y)f (x, y)dydx
Z−∞ −∞
∞ Z ∞
= g(x, y)f (x, y)dxdy
−∞ −∞
70
The case for general g(X, Y ) is similar, we can deal with it by splitting g(X, Y ) into g + (X, Y ) and g − (X, Y ). □
Proposition 7.6 Let X and Y be jointly continuous random variable with joint probability distribution function
f (x, y). Then
Z ∞ Z ∞ Z ∞
xf (x, y)dydx = xfX (x)dx = E[X]
Z−∞
∞ Z−∞
∞ Z−∞
∞
yf (x, y)dxdy = yfY (y)dy = E[Y ]
−∞ −∞
Z−∞
∞ Z ∞
E[X + Y ] = (x + y)f (x, y)dxdy = E[X] + E[Y ]
−∞ −∞
Proof: This is clear from the definition of expected value and joint continuous random variables. □
Let X be a random variable having finite expectation µ and variance σ 2 , and let g(·) be a twice differentiable
function. Then
g ′′ (µ) 2
E[g(X)] ≈ g(µ) + σ .
2
g ′′ (µ)(X − µ)2
g(X) ≈ g(µ) + g ′ (µ)(X − µ) + .
2
Hence
g ′′ (µ)(X − µ)2
′
E[g(X)] ≈ E g(µ) + g (µ)(X − µ) +
2
g ′′ (µ) 2
= g(µ) + 0 + σ
2
g ′′ (µ) 2
= g(µ) + σ .
2
Theorem 7.7 Let X and Y be random variables. Then E[X + Y ] = E[X] + E[Y ] if E[X] and E[Y ] are finite. In
general, if X1 , X2 , · · · , Xn are random variables, then
71
Proof: Using Induction. We know that E[X + Y ] = E[X] + E[Y ] for both the case of discrete and continuous
random variables. □
Corollary 7.7.1 Let X and Y be random variables, then E[X − Y ] = E[X] − E[Y ].
Corollary 7.7.2 Suppose X and Y are random variables such that X ≥ Y , then E[X] ≥ E[Y ].
Proof: Since X − Y ≥ 0, then E[X − Y ] ≥ 0, Hence E[X] − E[Y ] = E[X − Y ] ≥ 0. Therefore, E[X] ≥ E[Y ]. □
n
1 P
Definition: Suppose X1 , X2 , · · · , Xn are random variables, X = X1 + · · · + Xn . Then X = n Xi is called the
i=1
sample mean.
Proposition 7.9 Suppose A1 , A2 , · · · , An are events and X1 , X2 , · · · , Xn are their respective indicator variables.
Let Y = 1 − ni=1 (1 − Xi ), then
Q
n
X X
P (A1 ∪ · · · ∪ An ) = E[Y ] = P (Ai ) − P (Ai Aj ) + · · · − (−1)n P (A1 A2 · · · An ).
i=1 1≤i<j≤n
Theorem 7.10 Let X1 , X2 , · · · be a sequence of random variables. Suppose one of the following holds:
Xi is nonnegative for i = 1, 2, · · · .
∞
P
E[|Xi |] < ∞.
i=1
Then E [ ∞
P P∞
i=1 Xi ] = i=1 E[Xi ].
72
Proof: The first one is justified by the Sigma additivity of a measure. The second one is justified by absolute
convergence of sequences. □
Lemma 7.11
X X
= Xi1 Xi2 · · · Xik .
k
i1 <i2 <···<ik
X X
E = P (Ai1 Ai2 · · · Aik ).
k
i1 <i2 <···<ik
Proof: This is quite clear from the definition of moments of number of events. □
Corollary 7.11.1
X(X − 1) · · · (X − k + 1) X
E = P (Ai1 Ai2 · · · Aik ).
k!
i1 <i2 <···<ik
Proposition 7.12 Suppose X and Y are independent, then, for any functions h and g,
73
Proof: Suppose X and Y are discrete and independent, then it is clear that
X
E[g(X)h(Y )] = g(x)h(y)pX,Y (x, y)
(x,y)
X
= g(x)h(y)pX (x)pY (y)
(x,y)
!
X X
= g(x)pX (x) h(y)pY (y)
x y
X
= g(x)pX (x)E[h(Y )]
x
= E[g(X)]E[h(Y )].
Suppose X and Y are jointly continuous with joint density f (x, y), and X and Y are independent, then
Z ∞ Z ∞
E[g(X)h(Y )] = g(x)h(y)f (x, y)dxdy
Z−∞ −∞
∞ Z ∞
= g(x)h(y)fX (x)fY (y)dxdy
Z−∞
∞
−∞
Z ∞
= g(x)fX (x)dx h(y)fY (y)dy
−∞ −∞
= E[g(X)]E[h(Y )].
Lemma 7.13 Suppose X and Y are independent identical random variables with variance σ 2 , then
E[(X − Y )2 ] = 2σ 2 .
74
If X and Y are independent, then Cov(X, Y ) = 0. If Cov(X, Y ) = 0, X and Y may not be independent.
1. Cov(X, X) = Var(X).
3. Cov(aX, Y ) = aCov(X, Y ).
Proof:
Proof: Induction. □
Remark: Cov(·, ·) is an inner product on real random variables. The induced norm on the real random variable
p
space is σX = Var(X).
75
In particular, if X1 , X2 , · · · , Xn are independent. Then
n n
!
X X
Var Xi = Var(Xi ).
i=1 i=1
Proof: Note that if X1 , · · · , Xn are independent, then all the covariance are just 0, hence we have the second
formula. For the first formula, we have the following:
Var(X1 + X2 + · · · + Xn ) = Cov(X1 + X2 + · · · + Xn , X1 + X2 + · · · + Xn )
n
X
= Cov(Xi , X1 + · · · + Xn )
i=1
n
X
= Cov(X1 + · · · + Xn , Xi )
i=1
n
X X
= Cov(Xi , Xi ) + 2 Cov(Xi , Xj )
i=1 i<j
n
X X
= Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i<j
Definition: let X1 , · · · , XN be independent, and identical random variable with mean µ, and variance σ 2 . Then we
n
define X = n1
P
Xi to be the sample mean.
i=1
Definition: we define Xi − X to be the deviation, i = 1, · · · , n.
n
(Xi −X)2
Definition: we define S 2 =
P
n−1 be the sample variance.
i=1
Proposition 7.16
n
1 P
1. E[X] = n E[Xi ] = µ.
i=1
n
1
Var(Xi ) = σ 2 /n.
P
2. Var(X) = n2
i=1
σ2
3. E[(Xi − X)2 ] = 1 .
1− n
n 1
4. E[S 2 ] = 2 = σ2.
n−1 σ 1− n
Definition: let X and Y be random variable with positive variances. The correlation of X and Y are defined to be
Cov(X, Y )
ρ(X, Y ) = p .
Var(X)Var(Y )
76
X and Y are uncorrelated if ρ(X, Y ) = 0.
Cov(cX, Y ) cCov(X, Y )
ρ(cX, Y ) = p = p = ρ(X, Y ).
Var(cX)Var(Y ) c Var(X)Var(Y )
Proposition 7.18 Let X and Y be random variables with positive variances. Then −1 ≤ ρ(X, Y ) ≤ 1.
Proof: Replace X and Y by X/σX and Y /σY if necessary. We may assume that Var(X) = Var(Y ) = 1. Then
Proposition 7.19 Suppose X and Y are identical random variables that are not necessarily independent, then
Cov(X + Y, X − Y ) = 0.
77
Proof: Since X and Y are identical, then E[X n ] = E[Y n ] for n ∈ N, and Var(X) = Var(Y ). So
Proof: Suppose b > 0, then E[Y ] = E[X] + a and Var(Y ) = b2 Var(X) Then
Cov(X, Y )
ρ(X, Y ) = p
Var(X) · Var(Y )
Cov(X, a + bX)
=
bVar(X)
Cov(X, bX)
=
bVar(X)
bVar(X)
=
bVar(X)
=1
Proposition 7.21 Suppose the conditions in the definition holds, then the following are true:
1. for each i ∈ {1, · · · , r}, Ni is a binomial random variable with parameters (m, pi ).
78
3. If i ̸= j, then Cov(Ni , Nj ) = −mpi pj .
Proof: (1) and (2) are by direct computation, then since Ni + Nj is binomial with parameters (m, pi + pj ), we
know its variance. Since Var(Ni + Nj ) = Var(Ni ) + Var(Nj ) + 2Cov(Ni , Nj ), then we can compute their covariance.
□
Definition: let Z1 , · · · , Zn be independent standard normal random variables. If for some constants aij , 1 ≤ i ≤ m,
1 ≤ j ≤ n, and µi , 1 ≤ i ≤ m,
X1 = a11 Z1 + · · · + a1n Zn + µ1 ,
X2 = a21 Z1 + · · · + a2n Zn + µ2 ,
..
.
Xm = am1 Z1 + · · · + ann Zn + µm
then the random variables X1 , X2 , · · · , Xm are said to have a multivariate normal distribution.
Proposition 7.22 Suppose the conditions in the definition holds, then the following are true:
1. E[Xi ] = µi ;
n
a2ij ;
P
2. Var(Xi ) =
j=1
n
P
3. Cov(Xi , Xj ) = aik ajk .
k=1
!
n
a2ij . Hence we
P
Proof: First note that the sum of normal is normal, so Xi is normal with parameters µi ,
j=1
have (1) and (2). Next, for i ̸= j,
n n
!
X X
Cov(Xi , Xj ) = Cov aik Zk , ajk Zk
k=1 k=1
n
X
= Cov (aik Zk , ajk Zk )
k=1
Xn
= aik ajk .
k=1
79
7.6 Conditional Expectation
We call the definition on conditional distribution: let X and Y be discrete random variable with joint mass p(x, y).
Let P (Y = y) > 0. Condition on Y = y, X is a random variable with mass function
Let X and Y be jointly continuous with probability density function f (x, y). Let fY (y) > 0. Condition on Y = y,
X is a random variable with density function
Expectation is linear:
Proof: This follows from the fact that a conditional probability is a probability. □
Let X and Y be joint random variables. Then E[X|Y = y] is a real-valued function on Y = y. So E[X|Y ] is a
function in Y ; so it is also a random variable.
Theorem 7.24 Let Z = E[X|Y ] be a random variable, then E[Z] = E[E[X|Y ]] = E[X].
80
Proof: For the discrete case:
X
E[Z] = E[X|Y = y]P (Y = y)
y
X X xP (X = x, Y = y)
= P (Y = y)
y x
P (Y = y)
!
X X
= x P (X = x, Y = y)
x y
X
= xP (X = x) = E[X].
x
= xfX (x)dx
−∞
= E[X].
Proposition 7.25 (Law of Total Probability for Continuous Case) Suppose X is a discrete random vari-
able and Y a continuous random variable, then
Z ∞
P (X = i) = P (X = i|Y = y)fY (y)dy.
−∞
81
Then
P (X = i) = P (Z = 1) = E[Z]
= E[E[Z|Y ]]
Z ∞
= P (Z|Y = y)fY (y)dy
−∞
Z ∞
= P (X = i|Y = y)fY (y)dy.
−∞
Proposition 7.26 Let N be Poisson, parameter generated by exponential random variable X. Then
E[N |X = λ] = λ.
E[N |X] = X.
Proof: We split into two cases, both are discrete and both are continuous.
Suppose X and Y are both discrete, then P (X = x|Y = y) = P (X = x) for any x, hence
On the other hand, if both X and Y are continuous, then fX|Y (x|Y = y) = fx (x) for any x, hence
Lemma 7.28 Suppose g is a function and X and Y are random variables, then E[g(X)Y |X] = g(X)E[Y |X].
82
Proof: Let X = x, where x is arbitrary, then
Lemma 7.29
Var(X|Y ) = E[X 2 |Y ] − (E[X|Y ])2 .
Proof: Since Var(Z) = E[Z 2 ]−(E[Z])2 , then Var(X|Y ) = E[X 2 |Y ]−(E[X|Y ])2 and Var(E[X|Y ]) = E[(E[X|Y ])2 ]−
(E[E[X|Y ]])2 .
Hence
83
7.8 Conditional Expectation and Predication
By prediction we mean that suppose the value of X is observed, we want to predict the value of a second vari-
able Y . We want to find a function g such that if X = x, then g(x) is the prediction for Y . The function g must
be chosen such that g(X) is closed to Y . I.e., we want to choose the function g such that E[(Y −g(X))2 ] is minimized.
Lemma 7.30 The best predictor of Y based on X is g(x) = E[Y |X = x], i.e. g(X) = E[Y |X] is the best predictor
of Y . I.e., for all g(X),
E[(Y − g(X))2 ] ≥ E[(Y − E[Y |X])2 ].
E[(Y − g(X)2 |X] = E[(Y − E[Y |X] + E[Y |X] − g(X))2 |X]
= E[(y − E[Y |X])2 |X] + E[(E[Y |X] − g(X))2 |X] + 2E[(Y − E[Y |X])(E[Y |X] − g(X))|X]
Since at a given X, E[Y |X] − g(X) is a function of X, then it can be treated as a constant, thus,
E[(Y − E[Y |X])(E[Y |X] − g(X))|X] = (E[Y |X] − g(X))E[Y − E[Y |X]|X]
= (E[Y |X] − g(X))(E[Y |X] − E[Y |X])
=0
Lemma 7.31 The best constant predictor of Y when an X value is observed is given by E[Y ], and at this value of
c, E[(Y − c)2 ] = Var(X).
Hence it is clear that the minimum is Var(X) which happens when c = E[Y ]. □
Lemma 7.32 Suppose the joint distribution of X and Y are not completely known. We want to find constants a, b
such that E[(Y − a − bX)2 ] is the minimum. The to Y is µY + ρ σσX
Y
(X − µX ). Then minimum of E[(Y − a − bX)2 ]
is thus σY2 (1 − ρ2 ).
84
Proof: First consider the case where µX = µY = 0 and σX = σY = 1, ρ = ρ(X, Y ). Then
Hence, E[(Y − a − bX)2 ] is minimal if a = 0 and b = p, so the best linear predictor to Y in this case is a + bX = ρX,
and the minimum of E[(Y − a − bX)2 ] is 1 − ρ2 .
Next, suppose X, Y are arbitrary, which has mean µX , µY and variance σX 2 and σ 2 respectively. Then let
Y
X − µX Y − µY
X1 = and Y1 = .
σX σY
Then
a + bµX − µY bσX
Y − a − bX = σY Y1 − − X1
σY σY
Then by our previous analysis, E[(Y − a − bX)2 ] is minimum if a+bµσXY −µY = 0 and bσX
σY = ρ, that is b = ρ σσX
Y
and
a = µY − bµX .
The best linear predictor to Y is thus
σY
µY + ρ (X − µX )
σX
and the minimum of E[(Y − a − bX)2 ] is σY2 (1 − ρ2 ). □
85
Suppose we let g(t) be such that g (n) (0) = E[X n ]:
∞
X E[X n ]
g(t) = tn
n!
n=0
P n
∞ x p(x)
x
X
= tn
n!
n=0
∞
XX xn p(x)
= tn
x n=0
n!
∞
X X (tx)n
= p(x)
x
n!
n=0
X
= p(x)etx
x
= E[etX ]
Definition: let X be a random variable, its moment generating function is MX (t) = E[etX ].
(n)
Lemma 7.33 Assume that MX (t) has a power series expansion at 0. Then MX (0) = E[X n ] for any nonnegative
integer n.
Proof: Assume that X is a continuous random variable with density function f . Then
∞ (n) Z ∞
X M (0)X
tn = Mx (t) = E[etX ] = etx f (x)dx
n! −∞
n=0
∞
∞ X
(tx)n
Z
= f (x)dx
−∞ n=0 n!
∞
XZ ∞ (tx)n
= f (x)dx
n!
n=0 −∞
∞ n Z ∞
X t
= xn f (x)dx
n! −∞
n=0
∞
X E[X n ] n
= t .
n!
n=0
Proposition 7.34
Suppose X is a Poisson random variable with parameter λ. Then MX (t) = exp[λ(et − 1)].
86
pet
Suppose X is a geometric random variable with parameter p. Then MX (t) = 1−(1−p)et .
Proof:
87
Suppose Z is the standard normal random variable. Then
Z ∞
1 2
tZ
MZ (t) = E[e ] = etz √ e−z /2 dz
2π
Z−∞∞
1 [−(z−t)2 /2]+t2 /2]
= e dz
−∞ 2π
Z ∞
2 1 2
= et /2 √ e−(z−t) /2 dz
−∞ 2π
2 /2
= et .
Now if X = σZ + µ, then
MX (t) = E[etX ]
Z ∞
λe−λx (λx)α−1
= etx dx
0 Γ(α)
α Z ∞
(λ − t)e−(λ−t)x [(λ − t)x]α−1
λ
= dx
λ−t Γ(α)
α 0
λ
=
λ−t
Proposition 7.35 If MX (t) = MY (t) for all t near 0, then X and Y have the same distribution.
Proposition 7.36 Suppose X and Y are random variable such that Y = aX + b. Let the moment generating
function of X be MX (t), then the moment generating function MY (t) of Y is
88
Proof: Since Y = aX + b, then by the definition of the moment generating function, we have
E[etY ] = E[eaXt+bt ]
= E[eat·X ] · ebt
= ebt · MX (at)
Proposition 7.37 Suppose X and Y are independent, then MX+Y (t) = MX (t)MY (t). More generally, if X1 , X2 , · · · , Xn
are independent random variables, then
MX1 +X2 +···+Xn (t) = MX1 (t) · MX2 (t) · · · MXn (t).
Proof: It suffices to prove the case for two independent random variables, as the rest can be proven by induction.
Suppose X and Y are independent, then
Proposition 7.38 Suppose X is a binomial random variable with parameters (n, p), then MX (t) = (pet + 1 − p)n .
Suppose X is a negative binomial with parameters (r, p), then
r
pet
MX (t) = .
1 − (1 − p)et
Proof: Binomial random variable with parameters (n, p) is just the sum of n independent Bernoulli random
variables with parameter p. Negative binomial variable with parameters (r, p) is just the sum of n independent
geometric random variable with parameter p. □
Proposition 7.39 Let X be chi-squared with n degrees of freedom, then MX (t) = (1 − 2t)−n/2 .
89
Proof: Recall that a chi-squared random variable with n degrees of freedom can be written as the sum of
Z12 + · · · + Zn2 , where Z1 , · · · , Zn are independent standard normal random variables. Then
n n n Z ∞
Y Y
tZ 2
Y 1 2 2
E[e tX
]= MZ 2 (t) = E[e ]= √ etx e−x /2 dx
i=1 i=1 i=1
2π −∞
n Z ∞
Y 1 2
= √ e−x (1−2t)/2 dx
i=1
2π −∞
n
Y
= (1 − 2t)−1/2
i=1
= (1 − 2t)−n/2 .
Proposition 7.40 Let X1 , · · · , Xn be independent and identical with X. Suppose N takes nonnegative integer and
let Y = X1 + · · · + Xn . Then E[etY |N = n] = [MX (t)]n , i.e., E[etY |N ] = [MX (t)]N . So MY (t) = E[E[etY |N ]] =
E[(MX (t))N ]. Then we have the following:
MX1 ,Y1 (s, t) = MX1 (s)MY1 (t) = MX (s)MY (t) = MX,Y (s, t).
Proposition 7.41 Suppose the number of events that occur is Poisson with parameter λ. Each even is independent
of Type I with probability p, and type II with probability 1 − p. Let Xi be the number of type i. Then X1 and X2
are independent Poisson with parameters λp and λ(1 − p).
90
Proof: X be the total number of events. If X = n, X1 , X2 are binomial with parameters (n, p), (n, 1 − p). Then
Proposition 7.42 Let Z1 and Z2 be independent standard normal random variables. Then X = 12 (Z1 + Z2 ) and
Y = 12 (Z1 − Z2 ) are normal random variables. In addition X and Y are independent, so are X and Y 2 .
Note that the joint moment generating functions are separable, and each factor is the moment generating function
of a normal random variable. Hence X and Y are normal and independent, so X and Y 2 is also independent. □
Proposition 7.43 Let X1 , X2 , · · · , Xn be independent normal random variable with parameters (µ, σ 2 ). Let X =
n
1 2 = (Xi − X)2 /(n − 1). Then X and S 2 are independent.
P
n (X1 + · · · + Xn ). Let S
i=1
91
7.12 Summary on Random Variables
92
8 Limit Theorems
8.1 Inequalities
Lemma 8.1 (Markov’s Inequality) Let X be a nonnegative random variable, then for any a > 0,
E[X]
P (X ≥ a) ≤ .
a
1 if X ≥ a
Proof: Let I be the indicator variable of X ≥ a : I =
0 if X < a.
If X ≥ a, then aI = a ≤ X; if X < a, then aI = 0 ≤ X. Hence X ≥ aI ⇒ E[X] ≥ aE[I] = aP (X ≥ a). □
Proposition 8.2 (Chernoff Bounds) Let X be a random variable with moment generating function M (t). Then
for any a > 0 and t > 0,
P (X ≥ a) ≤ e−ta M (t).
Proof: If X ≥ a, then etX ≥ eta , for any t > 0, a > 0. Hence by Markov’s Inequality, we have
E[etX ]
P (X ≥ a) ≤ P (etX ≥ eta ) ≤ = e−ta M (t).
eta
Corollary 8.2.1 (Chernoff Bounds for the Poisson Random Variable) Suppose X is a Poisson random
variable with parameter λ, then
e−λ (eλ)i
P (X ≥ i) ≤ .
ii
t −1)
P (X ≥ i) ≤ eλ(e e−it t > 0
i
To minimize the right hand side, differentiate, and we get that the minimum is obtained when et = λ. Thus
substitute this value for t, and we get the desired inequality. □
Corollary 8.2.2 (Chernoff Bounds for Standard Normal Variable) Suppose Z is standard normal, and a >
0, then
1 −a2
P {Z > a} ≤ e 2 .
2
93
Proof:
Z ∞
1 2
P (Z > a) = √ e−u /2 du
2π
Za ∞
1 − (x+a)2
= √ e 2 dx (x = u − a)
0 2π
Z ∞
1 a2 x2
= √ e− 2 e− 2 e−ax dx
2π
Z0
1 − a2 ∞ − x2
≤√ e 2 e 2 · 1dx
2π 0
1 − a2 ∞ − x2
Z
=√ e 2 e 2 dx
2π 0
r
1 − a2 π
=√ e 2 ·
2π 2
1 − a2
= e 2.
2
2 be the
Lemma 8.3 (Chebyshev’s Inequality) Let X be a random variable, and µx be the mean of X and σX
variance of X, then
σ2
P (|X − µX | ≥ a) ≤ X ∀a > 0.
a2
2 =
Proof: Let X be a random variable, and µx be the mean of X and σX be the variance of X. Then σX
E[(X − µX )2 ]. Let Y = (X − µX )2 ≥ 0 and a > 0. By Markov’s Inequality, we have
E[Y ] σ2
P (Y ≥ a2 ) ≤ 2
= 2.
a a
2
σX
P (|X − µX | ≥ a) ≤ ∀a > 0.
a2
Corollary 8.3.1 Suppose X is the standard normal random variable and a > 0. Then
1
Φ(a) ≥ 1 − .
2a2
1
Proof: P (|X| ≥ a) = 2P (X ≥ a) = 2(1 − Φ(a)). Since P (|X| ≥ a) ≤ a2
by Chebyshev’s Inequality. Then we
have the desired result. □
94
Theorem 8.4 If X is a random variable such that Var(X) = 0. Then P (X = µX ) = 1.
Proof: Let X be a random variable such that Var(X) = 0. For any ϵ > 0, by Chebyshev’s Inequality,
2
σX
P (|X − µX | > ϵ) ≤ = 0.
ϵ2
Let ϵ → 0+ . Then
0 = lim P (|X − µX | > ϵ) = P (|X − µX | > 0).
ϵ→0+
Hence P (|X − µX | =
̸ 0) = 0, which implies that P (X = µX ) = 1. □
Proposition 8.5 (One-sided Chebyshev’s Inequality) Let X be a random variable with mean µ and variance
σ 2 , then for any a > 0, we have
σ2
P (X ≥ µ + a) ≤ 2 .
σ + a2
Proof: Consider Y = σ1 (X − µ). Then Y is a random variable with mean 0 and variance 1.
a
P (X ≥ µ + a) = P Y ≥ 2 .
σ
2
Hence to prove that P (X ≥ µ + a) ≤ σ2σ+a2 for all a > 0, it suffices to show that P (Y ≥ a) ≤ 1+a
1
2 for all a > 0.
Now suppose Y is a random variable with mean 0 and variance 1. Then Y ≥ a implies for all b, (Y + b)2 ≥ (a + b)2 ,
hence by Chebyshev’s inequality, we have
1 + b2
P (Y ≥ a) ≤ P ((Y + b)2 ≥ (a + b)2 ) ≤ .
(a + b)2
provided that the expectations exist and are finite. Suppose f (x) is concave, then
95
provided that the expectations exist and are finite.
Proposition 8.7 (Cauchy-Schwarz Inequality) Suppose X and Y are random variables, then
for all t, hence the discriminant of the quadratic must be less than or equal to zero, that is
Proposition 8.8 (Weak Law of Large Numbers) Let X1 , X2 , · · · be independent and identical random vari-
able with E[Xi ] = µ. If ϵ > 0, then
X1 + X2 + · · · + Xn
P − µ ≥ ϵ → 0 as n → ∞.
n
Proof: Let X1 , X2 , · · · be independent and identical random variable with E[Xi ] = µ and Var(Xi ) = σ 2 . Let
X n = (X1 + · · · + Xn )/n. Then E[X n ] = µ and Var(X n ) = σ 2 /n.
By Chebyshev’s inequality, for any ϵ > 0, we have
σ2
P (|X n − µ| ≥ ϵ) ≤ .
nϵ2
Corollary 8.8.1 Let X1 , X2 , · · · be independent and identical random variable with E[Xi ] = µ. If ϵ > 0, then
X1 + X2 + · · · + Xn
P − µ ≤ ϵ → 1 as n → ∞.
n
96
Proof: Taking the complement of the probability, we get the desired result. □
Lemma 8.9 Let Zn be a random variable having cumulative distribution function FZn and moment generating
function MZn , n = 1, 2, · · · . Let Z be a random variable having cumulative distribution function Fz and moment
generating function MZ . If MZn (t) → MZ (t) for all t, as n → ∞, then FZn (T ) → FZ (T ) for all t at which FZ (t)
is continuous.
Theorem 8.10 (Central Limit Theorem) Let X1 , X2 , · · · be independent and identical random variable with
mean µ, and variance σ 2 . Then
X1 + · · · + Xn − nµ
√
σ n
tends to standard normal random variable as n → ∞.
√
Proof: Let Zn = (X1 + · · · + Xn − nµ)/(σ n). Note that Yi = (Xi − µ)/σ are identical with mean 0 and variance
n
P Yi
1, and let its moment generating function be M (t). Then Zn = √
n
, so it has moment generating function
h in i=1 h in
2 2
M √tn . Let Z be a standard normal random variable, then MZ (t) = et /2 . We show that M √tn → et /2
as n → ∞. 2
Let L(t) = ln M (t). Then it is equivalent to showing that nL √tn → t2 .
L(0) = ln M (0) = ln 1 = 0,
M ′ (0) E[X]
L′ (0) = = = 0,
M (0) 1
M ′′ (0)M (0) − [M ′ (0)]2 E[X 2 ] · 1 − (E[X])2
L′′ (0) = = = 1.
[M (0)]2 12
by assuming that L′′ is continuous at 0. And this completes the proof of the theorem. □
Proposition 8.11 (Strong Law of Large Numbers) Let X1 , X2 , · · · be independent and identical random vari-
able with mean µ = E[Xi ]. Then
X1 + · · · + Xn
P lim = µ = 1,
n→∞ n
97
or
X1 + · · · + Xn
→ µ as n → ∞.
n
n
X X
E[Sn4 ] = E[Xi4 ] + 6E[Xi2 Xj2 ]
i=1 i<j
= nK + 3n(n − 1)(E[Xi2 ]2 )
≤ nK + 3n(n − 1)K
≤ n2 k + 3n2 K
= 4n2 K
E[(Sn /n)4 ] ≤ 4n2 K/n4 = 4K/n2
∞ ∞ ∞
" #
X X X 1
(Sn /n)4 = E[(Sn /n)4 ] ≤ 4K < ∞.
n2
n=1 n=1 n=1
n
!1/n
Y
lim Xi = eE[ln(Xi )] .
n→∞
i=1
98
Proof: Since X1 , X2 , · · · are identical independent random variables, then ln(X1 ), ln(X2 ), · · · are also identical
independent random variables. Note
n
P
n
!1/n ln Xi
Y i=1
Xi = exp .
n
i=1
Since
n
P
ln(Xi )
i=1
→ E[ln(Xi )]
n
as n → ∞, then
n
P
n
!1/n ln(Xi )
Y i=1
= eE[ln(Xi )] .
lim Xi = lim exp
n→∞ n→∞ n
i=1
Remark: suppose X1 , X2 , · · · are i.i.d random variables with mean µ and variance σ 2 , then
n
X
Xi ∼ Normal(nµ, nσ 2 ).
i=1
And
n
P
Xi
i=1 σ2
∼ Normal(µ, ).
n n
Proposition 8.12 Let Zn , n ≥ 1, be a sequence of random variables and c a constant such that for each ϵ > 0.
P {|Zn − c| > ϵ} → 0 as n → ∞. Then for any bounded continuous function g,
E[g(Zn )] → g(c) as n → ∞.
Proof: Suppose Zn is discrete, then let Z be the limit of Zn (we can easily show that it exists) and we would get
that Z is the random variable with p(Z = c) = 1, hence the desired statement must be true. So, we consider the
case where Zn is a sequence of continuous random variables.
Since g is bounded, then |g(x)| ≤ M for some M ∈ R. g is also continuous, so for every c ∈ R and ϵ > 0, there
exists a δ > 0 such that |x − c| ≤ δ → |g(x) − g(c)| ≤ ϵ.
By the definition of expected values, we have
Z Z Z
E[g(Zn )] = g(x)pZn (x)dx = g(x)pZn dx + g(x)pZn dx.
|x−c|≤δ |x−c|>δ
99
Now, for x such that |x − c| ≤ δ, we have that g(x) ≤ g(c) + ϵ and for x such that |x − c| > δ we have that g(x) ≤ M .
Therefore,
Z Z
E[g(Zn )] ≤ (g(c) + ϵ) pZn dx + M pZn dx
|x−c|≤δ |x−c|>δ
Then it is clear that lim sup E[g(Zn )] = E[g(c)] and lim inf E[g(Zn )] = E[g(c)]. Hence lim E[g(Zn )] = E[g(c)]. □
n→∞
100