Probability

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’

LONG CHEN

A BSTRACT. This is an outline of the book ‘A First Course in Probability’, which can serve
as a minimal introduction to the probability theory.

C ONTENTS
1. Combinatorial Analysis 2
2. Axioms of Probability 2
3. Conditional Probability and Bayes’ Formulae 3
Bayes’s Formulae 3
4. Random Variables 4
Important Discrete Random Variables 5
5. Continuous Random Variables 7
Important Continuous Random Variables 8
6. Jointly Distributed Random Variables 10
6.1. Jointly Distribution 10
6.2. Summation of Independent Random Variables 11
7. Properties of Expectation and Variance 11
7.1. Expectation of Sums of Random Variables 11
7.2. Moments of the Number of Events that Occur 12
7.3. Covariance, Variance of Sums, and Correlations 13
7.4. Moment Generating Functions 14
7.5. The Sample Mean and Variance 14
8. Limit Theorems 15
8.1. Tail bound 15
8.2. The Central Limit Theorem 16
8.3. Law of Large Numbers 17
Appendix 19

Date: November 12, 2015.


1
2 LONG CHEN

1. C OMBINATORIAL A NALYSIS
The following sampling table presents the number of possible samples of size k out of
a population of size n, under various assumptions about how the sample is collected.
TABLE 1. Sampling Table

Order  No Order 
n+k−1
With Rep nk
k
   
n n
Without Rep k!
k k

Here ‘Rep’ stands for ‘Replacement’ or ‘Repetition’ meaning that in the sampling the
output can have duplication items.

2. A XIOMS OF P ROBABILITY
Let S denote the set of all possible outcomes of an experiment. S is called the sample
space of the experiment. An event is a subset of S.
An intuitive way of defining the probability is as follows. For each event E of the
sample space S, we define n(E) to be the number of outcomes favorable to E in the first
n experiments. Then the naive definition of the probability of the event E is
n(E)
Pr(E) = lim .
n→∞ n
That is Pr(E) can be interpreted as a long-run relative frequency. It possesses a serious
drawback: How do we know the limit exists? We have to believe (assume) the limit exists!
Instead, we can accept the following axioms of probability as the definition of a proba-
bility:
(A1)
0 ≤ Pr(E) ≤ 1
(A2)
Pr(S) = 1
(A3) For mutually exclusive events

X
Pr (∪∞i=1 E i ) = Pr(Ei )
i=1
The inclusive-exclusive formulate for two events is
Pr(E ∪ F ) = Pr(E) + Pr(F ) − Pr(E ∩ F )
which can be generalized to n-events and will be proved later
n
X X
Pr (∪ni=1 Ei ) = Pr(Ei ) − Pr(Ei ∩ Ej )
i=1 i<j
X
+ Pr(Ei ∩ Ej ∩ Ek ) + · · · + (−1)n+1 Pr(∩ni=1 Ei ).
i<j<k

More rigorously, the probability can be defined as a normalized measure. A σ-field F


(on S) is a collection of subsets of S satisfying the following conditions:
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 3

(1) It is not empty: S ∈ F;


(2) If E ∈ F, then E c ∈ F;
(3) If E1 , E2 , . . . ∈ F, then ∪∞
i=1 Ei ∈ F.
From these properties, it follows that the σ-algebra is also closed under countable intersec-
tions (by applying De Morgan’s laws).
Then the probability is a normalized measure defined on the σ-algebra satisfying the
three axioms. Usually we denote as a triple (S, F, Pr).

3. C ONDITIONAL P ROBABILITY AND BAYES ’ F ORMULAE


For events E and F , the conditional probability of E given that F has occurred is
denoted by Pr(E|F ) and is defined by
Pr(E ∩ F )
Pr(E|F ) = .
Pr(F )
It can be rewritten as a multiplication rule of probability
Pr(E ∩ F ) = Pr(F ) Pr(E|F ).
If Pr(E ∩ F ) = Pr(E) Pr(F ), then we say that the events E and F are independent.
This condition is equivalent to Pr(E|F ) = Pr(E) and to Pr(F |E) = Pr(F ). Thus, the
events E and F are independent if knowledge of the occurrence of one of them does not
affect the probability of the other.
The conditional independence of E and F given A is defined as
Pr(E ∩ F |A) = Pr(E|A) Pr(F |A).
Conditional independence does not imply independence, and independence does not imply
conditional independence.
A valuable identity is
Pr(E) = Pr(E|F ) Pr(F ) + Pr(E|F c ) Pr(F c )
which can be used to compute Pr(E) by “conditioning” on whether F occurs. The formu-
lae can be generalized to a set of mutually exclusive and exhaustive events (a partition of
the sample space S) F1 , . . . , Fn which is known as the law of total probability
Xn
(1) Pr(E) = Pr(E|Fi ) Pr(Fi ).
i=1

Bayes’s Formulae. Let H denote the Hypothesis and E as the Evidence. The Bayes’
formulae
Pr(H ∩ E) Pr(E|H) Pr(H)
Pr(H|E) = = .
Pr(E) Pr(E|H) Pr(H) + Pr(E|H c ) Pr(H c )
The ordering in the conditional probability is switched since in practice the information is
known for Pr(E|H) not Pr(H|E).
For a set of mutually exclusive and exhaustive events (a partition of the sample space
S) F1 , . . . , Fn , the formula is
Pr(E ∩ Fj ) Pr(E|Fj ) Pr(Fj )
Pr(Fj |E) = = Pn .
Pr(E) i=1 Pr(E|Fi ) Pr(Fi )
If the events Fi , i = 1, . . . , n, are competing hypotheses, then Bayes’s formula shows how
to compute the conditional probabilities of these hypotheses when additional evidence E
becomes available.
4 LONG CHEN

Example 3.1. The color of a person’s eyes is determined by a single pair of genes. If they
are both blue-eyed genes, then the person will have blue eyes; if they are both brown-eyed
genes, then the person will have brown eyes; and if one of them is a blue-eyed gene and
the other a brown-eyed gene, then the person will have brown eyes. (Because of the latter
fact, we say that the brown-eyed gene is dominant over the blue-eyed one.) A newborn
child independently receives one eye gene from each of its parents, and the gene it receives
from a parent is equally likely to be either of the two eye genes of that parent. Suppose
that Smith and both of his parents have brown eyes, but Smith’s sister has blue eyes.
(1) Suppose that Smith’s wife has blue eyes. What is the probability that their first
child will have brown eyes?
(2) If their first child has brown eyes, what is the probability that their next child will
also have brown eyes?
The answers are 2/3 and 3/4. Let the three events be F1 = (br, br), F2 = (br, bl), F3 =
(bl, br). The second question is different with the first one due to the Bayes’s formulae. Af-
ter the evidence E =‘their first child has brown eyes’, the hypothesis Pr(Fj |E) is changed.
The conditional probability of A = ‘their next child will also have brown eyes’ given E
can be computed by the law of total probability
3
X
Pr(A|E) = Pr(A|(Fi |E)) Pr(Fi |E).
i=1
P3
Note that Pr(A) = i=1 Pr(A|Fi ) Pr(Fi ) = 2/3. The new evidence E increases the
probability of having brown eyes since E is in favor of that fact.
The odds of an event A are defined by
Pr(A) Pr(A)
= .
Pr(Ac ) 1 − Pr(A)
If the odds are equal to α, then it is common to say that the odds are ‘α to 1’ in favor of
the hypothesis. For example, if Pr(A) = 2/3, then the odds are 2.
Consider now a hypothesis H and a new evidence E is introduced. The new odds after
the evidence E has been introduced are
Pr(H|E) Pr(H) Pr(E|H)
(2) c
= .
Pr(H |E) Pr(H c ) Pr(E|H c )
The posterior odds of H are the likelihood ratio times the prior odds.

4. R ANDOM VARIABLES
A real-valued function defined on the sample space is called a random variable (RV),
i.e., X : S → R. The event {X ≤ x} is a subset of S.
Two functions in the classical senses are associated to a random variable. The cumula-
tive distribution function (CDF) F : R → [0, 1] is defined as
F (x) = Pr{X ≤ x},
which is an increasing and right continuous function, and the density function (for a con-
tinuous RV) is
f (x) = F 0 (x).
All probabilities concerning X can be stated in terms of F .
that can make calculating
Bayes’ Rule, and with extra conditioning (just add in C!)
g them to intersections, and
more than two sets. P (B|A)P (A)
c c P (A|B) =
\B P (B)
c c
[B P (B|A, C)P (A|C)
P (A|B, C) =
P (B|C)
tional We can also write
B) – Probability of A and B.
P (A, B, C) P (B, C|A)P (A)
lity P (A) – Probability of A. P (A|B, C) = =
P (B, C) P (B, C)
P (A, B)/P (B) – Probability of
Odds Form of Bayes’ Rule

lity P (A|B) is a probability P (A|B) A SHORT SUMMARY


P (B|A) P (A) ON ‘A FIRST COURSE IN PROBABILITY’ 5
=
that holds for probability also P (Ac |B) P (B|Ac ) P (Ac )
Random
The posterior odds ofvariables can be classified
A are the likelihood into:
ratio times the discrete
RV and continuous RV. A random
prior odds.
on or Union variable whose set of possible values is either finite or countably infinite is called discrete.
Random Variables and their Distributions
If X is a discrete random variable, then the function
B|A)
B|A)P (C|A, B)
PMF, CDF, and Independence p(xi ) = Pr{X = xi }
Probability Mass Function (PMF) Gives the probability that a Distribution Function (CDF) Gives the probability
Cumulative
0 or equal to x.
Indicator Random V
is called
discrete random the probability
variable mass
takes on the function
value x. (PMF) of X and
that a random p(x)is =
variable lessFthan
(x) is understood in the Indicator Random Variab
P (A \ B)
+ P (C)
distribution sense; see=Fig
pX (x) P (X4.= x) FX (x) = P (X  x) value 1 or 0. It is always an
occurs, the indicator is 1; ot
P (A \ C) P (B \ C) problems about counting ho
C). (
1
1.0

1.0

IA =
● ●
0
0.8

0.8
2
● ● Note that IA = I A , I A IB =

0.6
0.6

Distribution IA ⇠ Bern(p
pmf

cdf
Fundamental Bridge The

0.4
0.4


● ● the probability of event A: E

0.2
● ●
0.2

● ● Variance and Stand

0.0
● ●

0.0

Var(X) = E (X
0 1 2 3 4 0 1 2 3 4

Dr. Nick x x SD

The PMF satisfies (a) A PMF (b) A CDF function with


The CDF is an increasing, right-continuous
c
(A | B, C ) < P (A | B , C )
c c X Continuous RVs,
pX (x) 0 and pX (x) = 1 FX (x) ! 0 as x ! 1 and FX (x) ! 1 as x ! 1
c
> P (A | B ). x
Independence
F IGURE 1. A PMF and CDF Intuitively,
for a discrete random two random variables are independent if
variable. Continuous Random
knowing the value of one gives no information about the other.
Discrete r.v.s X and Y are independent if for all values of x and y What’s the probability t
The expectation, commonly called the mean, E[X]P (X
is = x, Y = y) = P (X = x)P (Y = y) di↵erence in CDF values (or
X P (a  X  b) = P (X
E[X] Expected
= xi p(xiValue
). and Indicators 2
i For X ⇠ N (µ, ), this beco
Expected Value and Linearity
The variance of a random variable X, denoted by Var(X), is defined by P (a  X  b)
Expected Value (a.k.a. mean, expectation, or average) is a weighted
E[X])of2 ]the
Var(X) = E[(X −average E[X 2 ]outcomes
= possible − E2 [X]. of our random variable.
Mathematically, if x1 , x2 , x3 , . . . are all of the distinct possible values What is the Probability
that X can take, the expected value of X is is the derivative of the CDF
p
It is a measure of the spread of the possible values of X. The quantity P Var(X) is called
E(X) = xi P (X = xi )
the standard deviation of X. i

An important property of the expected value is the linearity


X of theY expectation
X+Y A PDF is nonnegative and i
" n 3 4 7
theorem of calculus, to get f
n
#
X X 2 2 4

E Xi = E[Xi 10].6 8
23
14
33
F(

i=1 i=1 1 –3 –2
1 0 1

0.30
It can be proved using another equivalent definition of expectation 5 9 14
4 1 5

0.20
X
E[X] = X(s)p(s). ... ... ...

PDF
1 1 1
n n

n∑ + n∑ = ∑ (xi + yi)
n

0.10
s∈S xi yi n
i=1 i=1 i=1

Important Discrete Random Variables. We list several discrete


E(X) +
random
E(Y)
variables corre-
= E(X + Y)
0.00
−4 −2 0
x
sponding to the sampling schemes. Linearity For any r.v.s X and Y , and constants a, b, c, To find the probability that
integrate the PDF over that
E(aX + bY + c) = aE(X) + bE(Y ) + c
TABLE 2. Distribution for Sampling Schemes
Same distribution implies same mean If X and Y have the same F (b)
distribution, then E(X) = E(Y ) and, more generally,
Replace E(g(X)) = No Replace
E(g(Y )) How do I find the expect
Fixed n trials Binomial (n, Conditional
p) (Bern if Expected
n = 1) Value isHypergeometric
defined like expectation, only
discrete case, where you sum
x times the PDF.
Draw until r success conditioned
Negative Binomial (Geomonifany
r= event
1) A. Negative Hypergeometric
P E(X
E(X|A) = xP (X = x|A)
x

Consider in total n trials with r successful trials. Each trial results in a success with
probability p. Denoted by the triple (n, r, p). If n is fixed, r is the random variable called
6 LONG CHEN

binomial random variable. When n = 1, it is Bernoulli random variable. If r is fixed, n


is the random variable called negative binomial random variable. A special case is r = 1
called geometric random variable.
The case without replacement is better understood as drawing balls from a bin. Consider
in total N balls of which m are white and N − m are black. If drawing white is considered
as successful, then the probability of success is p = m/N when only one ball is chosen. If
the selected ball is put back for each sampling, then it is the case considered before. Now
without replacement, a sample of size n is chosen and ask for r are white. Denoted by
(n, r, N, m). If n is fixed, r is the random variable called hypergeometric random variable.
If r is fixed, n is the random variable called negative hypergeometric random variable.
We explain these distributions in detail below.
• Bin(n, p). Binomial random variable with parameters (n, p)
 
n i
p(i) = p (1 − p)n−i i = 0, . . . , n.
i
Such a random variable can be interpreted as being the number of success that oc-
cur when n independent trials, each of which results in a success with probability
p, are performed. Its mean and variance are given by
E[X] = np, Var(X) = np(1 − p).
The special case n = 1 is called Bernoulli random variable and denoted by
Bern(p).
• NBin(r, p). Negative binomial random variable with parameters (r, p)
 
n−1 r
p{X = n} = p (1 − p)n−r n = r, r + 1, . . .
r−1
Suppose that independent trials, each having probability p, 0 < p < 1, of being
a success are preformed until a total of r successes is accumulated. The random
variable X equals the number of trials required. In order for the rth success to
occur at the nth trial, there must be r − 1 successes in the first n − 1 trials and the
nth trial must be a success.
• Geom(p). Geometric random variable with parameter p
p(i) = p(1 − p)i−1 i = 1, 2, . . .
Such a random variable represents the trial number of the first success when each
trial is independently a success with probability p. Its mean and variance are
1 1−p
E[X] = , Var(X) = .
p p2
• HGeom(n, N, m). Hypergeometric random variable with parameters (n, N, m)
  
m N −m
i n−i
Pr{X = i} =   i = 0, 1, . . . , n.
N
n
An urn contains N balls, of which m are white and N − m are black. A sample of
size n is chosen randomly from this urn. X is the number of white balls selected.
Gives the probability Indicator Random Variables LOTUS
x.
Indicator Random Variable is a random variable that takes on the Expected value of a f
value 1 or 0. It is always an indicator of some event: if the event is defined this way:
occurs, the indicator is 1; otherwise it is 0. They are useful for many
problems about counting how many events of some kind occur. Write
E(X) =
( ON ‘A FIRST COURSE IN PROBABILITY’
A SHORT SUMMARY 7
1 if A occurs,

IA = Z
● • NHGeom(N, m, r). Negative
0 ifhypergeometric
A does not random
occur. variables with integer pa-
rameters (N, m, r) E(X) =
2
Note that IA = IA , IA Im B = IA\B
 
N −m , and IA[B = IA + IB
 I A IB .
The Law of the Uncon
Distribution IA ⇠ Bern(p) r−1 n − r pm
where = −P(r − 1)
(A). you can find the expecte
Pr{X = n} =   n = 0, 1, . . . , m.
Fundamental Bridge TheNexpectation N − of − 1)indicator for event A is g(X), in a similar way, b
(n the
the probability of event A:n E(I − 1 ) = P (A). g(x) but still working w
A
An urn contains N balls, of which m are special and N − m are ordinary. These
Variance and Standard
balls are removed Deviation
one at a time randomly. The random variable X is equal to the E(g(X)) =
number of balls that need be withdrawn 2 until a total
2 of r special2 balls have been
removed.Var(X) = E (X E(X)) = E(X ) (E(X)) Z
4
q
To obtain the probability mass function of a negative hypergeometric random E(g(X)) =
SD(X) = Var(X)
variable X, not that X will equal n if both
nction with (1) the first n − 1 withdrawals consist of r − 1 special and n − r ordinary balls
Continuous RVs, LOTUS, UoU
(2) the kth ball withdrawn is special
What’s a function of
variable is also a random
! 1 as x ! 1
When p  1 and n  1, the Bin(n, p) ≈ Pois(λ) defined below. bikes you see in an hour
bles are independent if Pois(λ). Poisson random variable with parameter λ
Continuous Random Variables (CRVs) you see in that hour and
about the other. pairs of bikes such that
What’s the probability p(i) that e−λ λi
all values of x and y = a CRVi is in an interval? Take the
≥ 0.
di↵erence in CDF values (or use the i! PDF as described later). What’s the point? Yo
(Y = y) to find its expected valu
If a large number of (approximately) independent trials are performed, each having a small
probabilityP (a  X successful,
of being  b) = Pthen
(X  theb) P (X
number  a) = Ftrials
of successful X (b) FX (a)
that result will have a
ors distribution which is approximately that a Poisson random variable. The mean and variance
X ⇠N (µ, 2variable Universality of U
ofFor
a Poisson random ), thisare
becomes
both equal to its parameter λ.
✓ ◆ ✓ ◆ When you plug any CRV
b µ a µ
P (a  X  b) =
5. C ONTINUOUS R ANDOM VARIABLES random variable. When
r average) is a weighted
CDF, you get an r.v. wi
m variable. A random variable X is continuous if there is a nonnegative function f , called the random variable X has
distinct possible values What isdensity
probability the Probability Density
function of X, such that, forFunction
any set B ⊂(PDF)?
R, The PDF f
is the derivative of the CDF F . Z F
) Pr{XF∈0 (x)
B} = f
= fB(x) (x) dx.
By UoU, if we plug X in
If
A X is
PDF continuous,
is then
nonnegativeits distribution
and function
integrates F
to will
1. be
By differentiable
the and
fundamental distributed random vari
X+Y
7
theorem of calculus, to get from
d PDF back to CDF we can integrate:
F (x) = f (x). F (X
4 dx Z x
14 F (x) = f (t)dt
33 1 Similarly, if U ⇠ Unif(0,
–2 that for any continuous
1 Uniform random variabl
0.30

1.0

14
5
0.8

Moments and
0.20

...
0.6
CDF
PDF

0.4

n
∑ (xi + yi)
0.10

0.2

i=1
Moments
0.00

0.0

E(X + Y) −4 −2 0 2 4 −4 −2 0 2 4
x x Moments describe the sh
nts a, b, c, To find the probability that a CRV takes on a value in an interval, standard deviation , an
integrate the PDF
F IGURE over
2. PDF andthat
CDF interval.
functions of a continuous random variable.
E(Y ) + c of X. The kth moment
Z b moment of X is mk = E
X and Y have the same F (b) F (a) = f (x)dx kurtosis are important s
nerally, a
Mean E(X) = µ1
How do I find the expected value of a CRV? Analogous to the
discrete case, where you sum x times the PMF, for CRVs you integrate Variance Var(X) = µ2
expectation, only
x times the PDF. Z 1
Skewness Skew(X) = m
E(X) = xf (x)dx
8 LONG CHEN

The expected value of a continuous random variable X is defined by


Z ∞
E[X] = xf (x) dx.
−∞

For a nonnegative random variable X, the following formulae can be proved by interchang-
ing the order of integration
Z ∞ Z ∞Z ∞
(3) E[X] = Pr{X > x} dx = f (y) dy dx.
0 0 x
The Law of the Unconscious Statistician (LOTUS) is that, for any function g,
Z ∞
E[g(X)] = g(x)f (x) dx,
−∞

which can be proved using (3) and again interchanging the order of integration.

Important Continuous Random Variables.


• Unif(a, b). Uniform random variable over the interval (a, b)

 1 a ≤ x ≤ b,
f (x) = b − a
0 otherwise.
Its expected value and variance are
a+b (b − a)2
E[X] = , Var(X) = .
2 12
Universality of Uniform (UoU). For any continuous random variable, we can
transform it into a uniform random variable and vice versa. Let X be a continuous
random variable with CDF FX (x). Then
(4) FX (X) ∼ Unif(0, 1).
Indeed for any increasing function g and for any a in the range of g,
Pr{g(X) < a} = Pr{X < g −1 (a)} = FX (g −1 (a)),
which implies (4) by choosing g = FX .
Similarly if U ∼ Unif(0, 1), then
−1
FX (U ) ∼ X.

• N (µ, σ). Normal random variable with parameters (µ, σ 2 )


1 2 2
f (x) = √ e−(x−µ) /2σ , −∞ < x < ∞.
2πσ
It can be shown that
µ = E[X], σ 2 = Var(X).
Then transformed random variable
X −µ
Z=
σ
is normal with mean 0 and variance 1. Such a random variable is said to be a
standard normal random variable. Probabilities about X can be expressed in terms
of probabilities about the standard normal variable Z.
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 9

R
The identity R f = 1 for the normal distribution can be proved as follows:
Z Z Z
1 2 1 2 1 2 2
I2 = e− 2 x dx e− 2 y dy = e− 2 (x +y ) dx dy.
R R R2
The last integral can be easily evaluated in the polar coordinate.
• Expo(λ). Exponential random variable with parameter λ
(
λe−λx x ≥ 0,
f (x) =
0 otherwise.
Its expected value and variance are, respectively,
1 1
E[X] = , Var(X) = 2 .
λ λ
A key property possessed only by exponential random variables is that they are
memoryless, in the sense that, for positive s and t,
Pr{X > s + t|X > t} = Pr{X > s}.
If X represents the life of an item, then the memoryless property states that, for
any t, the remaining life of a t-year-old item has the same probability distribution
as the life of a new item. Thus, one need not remember the age of an item to
know its distribution of remaining life. In summary, a product with an Expro(λ)
lifetime is always ‘as good as new’.
• Γ(α, λ). Gamma distribution with parameters (α, λ)
 α−1
λe−λx (λx) x ≥ 0,
f (x) = Γ(α)
0 otherwise,

where the gamma function is defined by


Z ∞
Γ(α) = e−x xα−1 dx.
0
The gamma distribution often arises, in practice as the distribution of the amount
of time one has to wait until a total of n events has occurred. When α = 1, it
reduces to the exponential distribution.
• β(a, b). Beta distribution with parameters (a, b)
 1 xa−1 (1 − x)b−1 0 < x < 1,

f (x) = B(a, b)
0 otherwise,

where the constant B(a, b) is given by


Z 1
B(a, b) = xa−1 (1 − x)b−1 dx.
0

• χ2 (n) Chi-square distribution.


Pn Let Zi ∼ N (0, 1) for i = 1, . . . , n. Then the
square sum of Zi , i.e., X = i=1 Zi2 is called the χ2 (n) distribution and
 
n 1
X∼Γ , .
2 2
10 LONG CHEN

• Weibull distribution with parameters (ν, α, β) The Weibull distribution function



0
 (  x≤ν
β )
F (x) = x−ν
1 − exp −

α
x > ν.

and the density function can be obtained by differentiation.


It is widely used in the field of life phenomena as the distribution of the lifetime
of some object, especially when the “weakest link” model is appropriate for the
object.
• Cauchy distribution with parameter θ
1 1
f (x) = , −∞ < x < ∞.
π 1 + (x − θ)2

6. J OINTLY D ISTRIBUTED R ANDOM VARIABLES


A random variable is like a single variable function S → R. Given two random vari-
ables, we can the obtain a multivariable function S 2 → R2 .

6.1. Jointly Distribution. The joint cumulative probability distribution function of the
pair of random variables X and Y is defined by

F (x, y) = Pr{X ≤ x, Y ≤ y} − ∞ < x, y < ∞.

All probabilities regarding the pair can be obtained from F . To find the individual proba-
bility distribution functions of X and Y , use

FX (x) = lim F (x, y), FY (y) = lim F (x, y).


y→∞ x→∞

If X and Y are both discrete random variables, then their joint probability mass function
is defined by
p(i, j) = Pr{X = i, Y = j}.
The individual mass functions are
X X
Pr{X = i} = p(i, j), Pr{Y = j} = p(i, j).
j i

If we list p(i, j) as a table, then the above mass functions are called marginal PMF.
The random variables X and Y are said to be jointly continuous if there is a function
f (x, y), called the joint probability density function, such that for any two-dimensional set
C,
ZZ
Pr{(X, Y ) ∈ C} = f (x, y) dx dy.
C

If X and Y are jointly continuous, then they are individually continuous with density
functions
Z ∞ Z ∞
fX (x) = f (x, y) dy, fY (y) = f (x, y) dx.
−∞ −∞
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 11

6.2. Summation of Independent Random Variables. The random variables X and Y


are independent if, for all sets A and B,
Pr{X ∈ A, Y ∈ B} = Pr{X ∈ A} Pr{Y ∈ B}.
In terms of the joint distribution or the joint density function, for independent random
variables, they can be factorized as
F (x, y) = FX (x)FY (y)
p(x, y) = pX (x)pY (y)
f (x, y) = fX (x)fY (y).
If X and Y are independent continuous random variables, then the distribution function
of their sum can be obtained from the identity
Z ∞
FX+Y (a) = FX (a − y)fY (y)dy.
−∞
The density function is the convolution
fX+Y = fX ∗ fY .
Examples of sums of independent random variables. We assume X1 , X2 , . . . , Xn are
independent random variables of the same type.
• Xi = N (µi , σi2 ):
n n n
!
X X X
Xi = N µi , σi2 .
i=1 i=1 i=1
• Xi is Poisson RV with parameter λi :
Xn n
X
Xi is Poisson with parameter λi .
i=1 i=1

• Xi is binomial RV with parameter (ni , p):


n n
!
X X
Xi is binomial with parameter ni , p .
i=1 i=1

• Xi is continuous Gamma distribution with parameter (ti , λ):


n n
!
X X
Xi is Gamma distribution with parameter ti , λ .
i=1 i=1

7. P ROPERTIES OF E XPECTATION AND VARIANCE


7.1. Expectation of Sums of Random Variables. If X and Y have a joint probability
mass function p(x, y) or a joint density function f (x, y), then.
XX
E[g(X, Y )] = g(x, y)p(x, y)
y x
Z ∞ Z ∞
E[g(X, Y )] = g(x, y)f (x, y) dx dy.
−∞ −∞
It can be proved by simply switching the order of integration. A consequence of the pre-
ceding equations is that
E[X + Y ] = E[X] + E[Y ],
12 LONG CHEN

which generalizes to
(5) E[X1 + · · · + Xn ] = E[X1 ] + · · · + E[Xn ].
By definition E[cX] = cE[X]. Therefore E[·] is linear.
Question: How about the expectation of the product of random variables? Do we have
(6) E[XY ] = E[X]E[Y ]?
The answer is NO in general and YES if X and Y are independent. In general, if X and Y
are independent, then, for any functions h and g,
(7) E[g(X)h(Y )] = E[g(X)]E[h(Y )].
This fact can be easily proved by the separation of variables in the corresponding integral
as the joint density function of independent variables are separable.
Using (5), we can decompose a random variable into summation of simple random
variables. For example, the binomial random variable can be decomposed into sum of
Bernoulli random variables.
Two important and interesting examples:
• a random walk in the plane;
• complexity of the quick-sort algorithm.
7.2. Moments of the Number of Events that Occur. Express a random variable as a
combination of indicator random variables. For an event A, the indicator random variable
IA = 1 if A occurs and 0 otherwise. Then IA ∼ Bern(p) and E[IA ] = Pr(A). For two
events A and B, we have the properties
• IA IB = IA∩B ;
• IA∪B = IA + IB − IA IB .
For given events A1 , . . . , An , let X be the number of these events that occur. If we
introduce an indicator variable Ii for even Ai , then
Xn
X= Ii ,
i=1
and consequently
n
X n
X
E[X] = E[Ii ] = Pr(Ai ).
i=1 i=1
Now suppose we are interested in the number of pairs of events that occur. Then
  X
X
= Ii Ij .
2 i<j

Taking expectations yields


  X
X X
E = E[Ii Ij ] = Pr(Ai Aj )
2 i<j i<j

giving that X
E[X 2 ] − E[X] = 2 Pr(Ai Aj ).
i<j
Moreover, by considering the number of distinct subsets of k events that all occur, we have
 
X X X
E = E[Ii1 Ii1 · · · Iik ] = Pr(Ai1 Ai1 · · · Aik ).
k i <i <...<i i <i <...<i
1 2 k 1 2 k
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 13

7.3. Covariance, Variance of Sums, and Correlations. The covariance between X and
Y , denoted by Cov(X, Y ), is defined by

Cov(X, Y ) = E[(X − µX )(Y − µY )] = E[XY ] − E[X] E[Y ].

Cov(·, ·) is a symmetric bilinear functional and Cov(X, X) = Var(X) > 0. That is


Cov(·, ·) defines an inner product on a quotient subspace of the space of random variables.
Let us identify this subspace. First of all, we restrict to the subspace of random variables
with finite second moment. Second, we identify two random variables if they differ by a
constant. The obtained quotient space is isomorphic to the subspace of random variables
with finite second moment and mean zero; on that subspace, the covariance is exactly the
L2 inner product of real-valued functions.
The correlation of two random variables X and Y , denoted by ρ(X, Y ), is defined by

Cov(X, Y )
ρ(X, Y ) = p .
Var(X) Var(Y )

The correlation coefficient is a measure of the degree of linearity between X and Y . A


value of ρ(X, Y ) near +1 or −1 indicates a high degree of linearity between X and Y ,
whereas a value near 0 indicates that such linearity is absent. If ρ(X, Y ) = 0, then X and Y
are said to be uncorrelated. From the inner product point of view, ρ(X, Y ) = cos θ(X, Y )
contains the angle information between X and Y and uncorrelation is equivalent to the
orthogonality.
By the definition, we have a precise characterization of the question (6):

E[XY ] = E[X]E[Y ] ⇐⇒ X, Y are uncorrelated.

If X and Y are independent, then Cov(X, Y ) = 0. However, the converse is not


true. That is two random variables could be dependent but uncorrelated. An example is:
X ∼ N (0, 1) and Y = X 2 , then X and Y are dependent but uncorrelated. When both
X and Y are normal, then the converse is true. Namely two normal random variables are
uncorrelated iff they are independent.
Writing
 
n n n
!
X X X
Var Xi = Cov  Xi , Xj  ,
i=1 i=1 j=1

we get a formula on the variance of summation of random variables


n n
!
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 i<j

In particular, if Xi are independent, we can exchange the sum and Var, i.e.,
n n
!
X X
Var Xi = Var(Xi ), when Xi are independent.
i=1 i=1

Note that Var(·) is not linear but quadratic to the scaling. By definition

Var(cX) = Cov(cX, cX) = c2 Cov(X, X) = c2 Var(X).


14 LONG CHEN

7.4. Moment Generating Functions. The moment generating function M (t) of the ran-
dom variable X is defined for t in some open interval containing 0 as
M (t) = E[etX ].
We call M (t) the moment generating function because all of the moments of X can be
obtained by successively differentiating M (t) and then evaluating the result at t = 0.
Specifically, we have
M (n) (0) = E[X n ] n ≥ 1.
When X and Y are independent, using (7), we know
(8) MX+Y (t) = MX (t)MY (t).
In particular,
MaX+b = E etaX etb = etb E etaX = ebt MX (at).
   

For the standardized version Z = (X − µ)/σ, we have the relation


MX (t) = eµt MZ (σt).
An important result is that the moment generating function uniquely determines the dis-
tribution. This result leads to a simple proof that the sum of independent normal (Poisson,
gamma) random variables remains a normal (Poisson, gamma) random variable by using
(8) to compute the corresponding moment generating function. R
Moments describes the shape of a distribution. Recall that R f (x) dx = 1 which
implies f (x) ∼ o(1/x) as |x| → ∞. If the even moments (2k) is finite, it implies f (x)
decay faster than 1/x2k+1 .
7.5. The Sample Mean and Variance. Let X1 , . . . , Xn be i.i.d (independent and identi-
cally distributed) random variables having distribution function F and expected value µ.
Such a sequence of random variables is said to constitute a sample from the distribution F .
The quantity
n
X Xi
X=
i=1
n
 
is called the sample mean. By the linearity of expectation, E X = µ. When the distribu-
tion mean µ is unknown, the sample mean is often used in statistics to estimate it.
The quantities Xi − X i , i = 1, . . . , n, are called deviations: the differences between
the individual data and the sample mean. The random variable
n
X (Xi − X)2
S2 =
i=1
n−1
is called the sample variance. Then
 σ2
, E S 2 = σ2 .
 
Var X =
n
The sample mean and the sample variance are independent. The sample mean and a
deviation from the sample mean are uncorrelated, i.e.,
Cov(Xi − X, X) = 0.
Although X and the deviation Xi − X are uncorrelated, they are not, in general indepen-
dent. A special exception is Xi are normal random variables.
The sample mean X is a normal Prandom variable with mean µ and variance σ 2 /n; the
n
random variable (n − 1)S /σ = i=1 (Xi − X)2 is a Chi-squared random variable with
2 2
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 15

n − 1 degrees of freedom which explains the denominator is n − 1 not n in the definition


of S.

8. L IMIT T HEOREMS
The most important theoretical results in the probability theory are limit theorems. Of
these, the most important are:
• Laws of large numbers. The average of a sequence of random variables converges
(in certain topology) to the expected average.
• Central limit theorems. The sum of a large number random variables has a proba-
bility distribution that is approximately normal.

8.1. Tail bound. Taking expectation of the inequality


χ({X ≥ a}) ≤ X/a,
we obtain the Markov’s inequality.
Markov’s inequality. Let X be a non-negative random variable, i.e., X ≥ 0. Then, for
any value a > 0,
E[X]
Pr{X ≥ a} ≤ .
a
Chebyshev’s inequality. If X is a random variable with finite mean µ and variance σ 2 ,
then, for any value a > 0,
σ2
Pr{|X − µ| ≥ a} ≤ 2 .
a
Just apply Markov’s inequality to the non-negative RV (X − µ)2 .
The importance of Markovs and Chebyshevs inequalities is that they enable us to derive
bounds on probabilities when only the mean, or both the mean and the variance, of the
probability distribution are known.
If we know more moment of X, we can obtain more effective bounds. For example, if
r is a nonnegative even integer, then applying Markov’s inequality to X r
E[X r ]
(9) Pr{|X| ≥ a} = Pr{X r ≥ ar } ≤ ,
ar
a bound that falls off as 1/ar . The larger the r, the greater the rate is, given a bound on
E[X r ] is available. If we write the probability F̄ (a) := Pr{X > a} = 1 − Pr{X ≤ a} =
1 − F (a), then the bound (9) tells how fast the function F̄ decays. The moments E[X r ] is
finite implies the PDF f decays faster than 1/xr+1 and F̄ decays like 1/xr .
On the other hand, as Markov and Chebyshevs inequalities are valid for all distributions,
we cannot expect the bound on the probability to be very close to the actual probability in
most cases.
Indeed the Markov’s inequality is useless when a ≤ µ and Chebyshev’s inequality is
useless when a ≤ σ since the upper bound will be greater than or equal to one. We can
improve the bound to be strictly less than one.
One-sided Chebyshev inequality. If E[X] = µ and Var(X) = σ 2 , then, for any a > 0,
σ2
Pr{X ≥ µ + a} ≤ ,
σ2
+ a2
σ2
Pr{X ≤ µ − a} ≤ 2 .
σ + a2
16 LONG CHEN

Proof. Let b > 0 and note that X ≥ a is equivalent to X + b ≥ a + b. Hence


Pr{X ≥ a} = Pr{X + b ≥ a + b} ≤ Pr{(X + b)2 ≥ (a + b)2 }.
Upon applying Markov’s inequality, the preceding yields that
E[(X + b)2 ] σ 2 + b2
Pr{X ≥ a} ≤ = .
(a + b)2 (a + b)2
Letting b = σ 2 /a, which minimizes the upper bound, gives the desired result. 
When the moment generating function is available (all moments are finite), we have the
Chernoff bound which usually implies exponential decay of the tail.
Chernoff bounds.
Pr{X ≥ a} ≤ e−ta M (t) for all t > 0,
Pr{X ≤ a} ≤ e−ta M (t) for all t < 0.
We can obtain the best bound by using the t that minimizes the upper bound e−ta M (t).
Example 8.1 (Chernoff bounds for the standard normal distribution). Let X ∼ N (0, 1) be
2
the standard normal distribution. Then M (t) = et /2 . So the Chernoff bound is given by
2
Pr {X ≥ a} ≤ e−ta et /2
for all t > 0.
The minimum is achieved at t = a which gives the exponential decay tail bound
2
(10) Pr {X ≥ a} ≤ e−a /2
for all a > 0.
Similarly, we get the tail bound in the left
2
Pr {X ≤ a} ≤ e−a /2
for all a < 0.
For X ∼ N (0, σ), the tail bound becomes
1 a2
(11) Pr {X ≥ a} ≤ e− 2 σ2 for all a > 0.
The smaller σ is, the faster the decay is.
8.2. The Central Limit Theorem. The central limit theorem is one of the most remark-
able results in the probability theory. Loosely put, it states that the sum of a large number
of independent random variables has a distribution that is approximately normal.
Theorem 8.2 (The central limit theorem). Let X1 , X2 , . . . be a sequence of independent
and identically distributed random variables, each having finite mean µ and variance σ 2 .
Then the distribution of Pn
i=1 Xi − nµ

σ n
tends to the standard normal as n → ∞.
Pn
Equivalently X → N (µ, σ 2 /n). Here recall that the sample mean X = i=1 Xi /n.
The variance of X scales like O(1/n) and will approach to zero which implies the law of
large numbers by Chernoff bounds.
The central limit result can be extended to independent but may not identical random
variables.
Theorem 8.3 (Central limit theorem for independent random variables). Let X1 , X2 , . . .
be a sequence of independent random variables having finite mean µi and variance σi2 . If
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 17

P∞Xi are
(1) the
2
uniformly bounded;
(2) σ
i=1 i = ∞,
then the distribution of
 v 
X n u n 
uX
Pr (Xi − µi )/t σi2 ≤ a → Φ(a) as n → ∞.
 
i=1 i=1

The application of the central limit theorem to show that measurement errors are approx-
imately normally distributed is regarded as an important contribution to science. Indeed,
in the 17th and 18th centuries the central limit theorem was often called the law of fre-
quency of errors. The central limit theorem was originally stated and proven by the French
mathematician Pierre-Simon, Marquis de Laplace.
8.3. Law of Large Numbers.
Theorem 8.4 (The weak law of large numbers). Let X1 , X2 , . . . be a sequence of i.i.d.
random variables, each having finite mean E[Xi ] = µ. Then, for any ε > 0,
n
( )
1X
Pr Xi − µ ≥ ε → 0 as n → ∞.
n i=1
Pn  
Proof. Let X = i=1 Xi /n. Then by the linearity of expectation, E X = µ. Since Xi
are i.i.d., the variance is additive to independent variables and scale quadratically to the
constant scaling, we have Var(X) = σ 2 /n. Apply Chebyshev’s inequality to get
n
( )
1X σ2
Pr Xi − µ ≥ ε ≤ 2 → 0 as n → ∞.
n i=1 nε

Theorem 8.5 (The strong law of large numbers). Let X1 , X2 , . . . be a sequence of inde-
pendent and identically distributed
Pn random variables, each having finite mean E[Xi ] = µ.
Then, with probability 1, n1 i=1 Xi → µ as n → ∞, i.e.
n
( )
1X
Pr lim Xi = µ = 1.
n→∞ n
i=1

What is the difference between the weak and the strong laws of large numbers? In the
weak version, n depends on ε while the strong version is not.
As an application of the strong law of large numbers, suppose that a sequence of inde-
pendent trials of some experiment is performed. Let E be a fixed event of the experiment,
and denote by Pr(E) the probability that E occurs on any particular trial. Let Xi be the in-
dicator random variable of E on the ith trial. We have, by the strong law of large numbers,
that with probability 1,
n
1X
Xi → E[X] = Pr(E).
n i=1
Therefore, if we accept the interpretation that “with probability 1” means “with certainty,”
we obtain the theoretical justification for the long-run relative frequency interpretation of
probabilities.
The weak law of large numbers was originally proven by James Bernoulli for the special
case where the Xi are Bernoulli random variables. The general form of the weak law of
large numbers was proved by the Russian mathematician Khintchine.
18 LONG CHEN

The strong law of large numbers was originally proven, in the special case of Bernoulli
random variables, by the French mathematician Borel. The general form of the strong law
was proven by the Russian mathematician A. N. Kolmogorov.
include a sketch of proofs
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 19

A PPENDIX
Formulae

X xn  x n
ex = = lim 1 + ,
n=0
n! n→∞ n
Z ∞
Γ(t) = xt−1 e−x dx,
0
Z 1
Γ(a)Γ(b)
β(a, b) = xa−1 (1 − x)b−1 dx = ,
0 Γ(a + b)
1 1 1
1 + + + · · · + ≈ ln n + 0.577
2 3 n
√  n n
n! ≈ 2πn .
e

TABLE 3. Table of Distributions

Distribution PMF/PDF
  Mean Variance MGF
n k
Binomial Pr{X = k} = p (1 − p)n−k np np(1 − p) (1 − p + pet )n
k

k! t
−1)
Poisson Pr{X = k} = λ λ eλ(e
e−λ λk

etb − eta
Uniform f (x) = 1/(b − a) (a + b)/2 (b − a)2 /12
t(b − a)

1 (x−µ)2 2 2
Normal f (x) = √ e− 2σ2 µ σ2 etµ+σ t /2
σ 2π
 α
λ
Gamma f (x) = λe−λx (λx)α−1 /Γ(α), x ≥ 0 α/λ α/λ2 ,t < λ
λ−t

Chi-square xn/2−1 e−x/2 /(2n/2 Γ(n/2)), x > 0 n 2n (1 − 2t)−n/2 , t < 1/2

You might also like