Probability
Probability
Probability
LONG CHEN
A BSTRACT. This is an outline of the book ‘A First Course in Probability’, which can serve
as a minimal introduction to the probability theory.
C ONTENTS
1. Combinatorial Analysis 2
2. Axioms of Probability 2
3. Conditional Probability and Bayes’ Formulae 3
Bayes’s Formulae 3
4. Random Variables 4
Important Discrete Random Variables 5
5. Continuous Random Variables 7
Important Continuous Random Variables 8
6. Jointly Distributed Random Variables 10
6.1. Jointly Distribution 10
6.2. Summation of Independent Random Variables 11
7. Properties of Expectation and Variance 11
7.1. Expectation of Sums of Random Variables 11
7.2. Moments of the Number of Events that Occur 12
7.3. Covariance, Variance of Sums, and Correlations 13
7.4. Moment Generating Functions 14
7.5. The Sample Mean and Variance 14
8. Limit Theorems 15
8.1. Tail bound 15
8.2. The Central Limit Theorem 16
8.3. Law of Large Numbers 17
Appendix 19
1. C OMBINATORIAL A NALYSIS
The following sampling table presents the number of possible samples of size k out of
a population of size n, under various assumptions about how the sample is collected.
TABLE 1. Sampling Table
Order No Order
n+k−1
With Rep nk
k
n n
Without Rep k!
k k
Here ‘Rep’ stands for ‘Replacement’ or ‘Repetition’ meaning that in the sampling the
output can have duplication items.
2. A XIOMS OF P ROBABILITY
Let S denote the set of all possible outcomes of an experiment. S is called the sample
space of the experiment. An event is a subset of S.
An intuitive way of defining the probability is as follows. For each event E of the
sample space S, we define n(E) to be the number of outcomes favorable to E in the first
n experiments. Then the naive definition of the probability of the event E is
n(E)
Pr(E) = lim .
n→∞ n
That is Pr(E) can be interpreted as a long-run relative frequency. It possesses a serious
drawback: How do we know the limit exists? We have to believe (assume) the limit exists!
Instead, we can accept the following axioms of probability as the definition of a proba-
bility:
(A1)
0 ≤ Pr(E) ≤ 1
(A2)
Pr(S) = 1
(A3) For mutually exclusive events
∞
X
Pr (∪∞i=1 E i ) = Pr(Ei )
i=1
The inclusive-exclusive formulate for two events is
Pr(E ∪ F ) = Pr(E) + Pr(F ) − Pr(E ∩ F )
which can be generalized to n-events and will be proved later
n
X X
Pr (∪ni=1 Ei ) = Pr(Ei ) − Pr(Ei ∩ Ej )
i=1 i<j
X
+ Pr(Ei ∩ Ej ∩ Ek ) + · · · + (−1)n+1 Pr(∩ni=1 Ei ).
i<j<k
Bayes’s Formulae. Let H denote the Hypothesis and E as the Evidence. The Bayes’
formulae
Pr(H ∩ E) Pr(E|H) Pr(H)
Pr(H|E) = = .
Pr(E) Pr(E|H) Pr(H) + Pr(E|H c ) Pr(H c )
The ordering in the conditional probability is switched since in practice the information is
known for Pr(E|H) not Pr(H|E).
For a set of mutually exclusive and exhaustive events (a partition of the sample space
S) F1 , . . . , Fn , the formula is
Pr(E ∩ Fj ) Pr(E|Fj ) Pr(Fj )
Pr(Fj |E) = = Pn .
Pr(E) i=1 Pr(E|Fi ) Pr(Fi )
If the events Fi , i = 1, . . . , n, are competing hypotheses, then Bayes’s formula shows how
to compute the conditional probabilities of these hypotheses when additional evidence E
becomes available.
4 LONG CHEN
Example 3.1. The color of a person’s eyes is determined by a single pair of genes. If they
are both blue-eyed genes, then the person will have blue eyes; if they are both brown-eyed
genes, then the person will have brown eyes; and if one of them is a blue-eyed gene and
the other a brown-eyed gene, then the person will have brown eyes. (Because of the latter
fact, we say that the brown-eyed gene is dominant over the blue-eyed one.) A newborn
child independently receives one eye gene from each of its parents, and the gene it receives
from a parent is equally likely to be either of the two eye genes of that parent. Suppose
that Smith and both of his parents have brown eyes, but Smith’s sister has blue eyes.
(1) Suppose that Smith’s wife has blue eyes. What is the probability that their first
child will have brown eyes?
(2) If their first child has brown eyes, what is the probability that their next child will
also have brown eyes?
The answers are 2/3 and 3/4. Let the three events be F1 = (br, br), F2 = (br, bl), F3 =
(bl, br). The second question is different with the first one due to the Bayes’s formulae. Af-
ter the evidence E =‘their first child has brown eyes’, the hypothesis Pr(Fj |E) is changed.
The conditional probability of A = ‘their next child will also have brown eyes’ given E
can be computed by the law of total probability
3
X
Pr(A|E) = Pr(A|(Fi |E)) Pr(Fi |E).
i=1
P3
Note that Pr(A) = i=1 Pr(A|Fi ) Pr(Fi ) = 2/3. The new evidence E increases the
probability of having brown eyes since E is in favor of that fact.
The odds of an event A are defined by
Pr(A) Pr(A)
= .
Pr(Ac ) 1 − Pr(A)
If the odds are equal to α, then it is common to say that the odds are ‘α to 1’ in favor of
the hypothesis. For example, if Pr(A) = 2/3, then the odds are 2.
Consider now a hypothesis H and a new evidence E is introduced. The new odds after
the evidence E has been introduced are
Pr(H|E) Pr(H) Pr(E|H)
(2) c
= .
Pr(H |E) Pr(H c ) Pr(E|H c )
The posterior odds of H are the likelihood ratio times the prior odds.
4. R ANDOM VARIABLES
A real-valued function defined on the sample space is called a random variable (RV),
i.e., X : S → R. The event {X ≤ x} is a subset of S.
Two functions in the classical senses are associated to a random variable. The cumula-
tive distribution function (CDF) F : R → [0, 1] is defined as
F (x) = Pr{X ≤ x},
which is an increasing and right continuous function, and the density function (for a con-
tinuous RV) is
f (x) = F 0 (x).
All probabilities concerning X can be stated in terms of F .
that can make calculating
Bayes’ Rule, and with extra conditioning (just add in C!)
g them to intersections, and
more than two sets. P (B|A)P (A)
c c P (A|B) =
\B P (B)
c c
[B P (B|A, C)P (A|C)
P (A|B, C) =
P (B|C)
tional We can also write
B) – Probability of A and B.
P (A, B, C) P (B, C|A)P (A)
lity P (A) – Probability of A. P (A|B, C) = =
P (B, C) P (B, C)
P (A, B)/P (B) – Probability of
Odds Form of Bayes’ Rule
1.0
●
IA =
● ●
0
0.8
0.8
2
● ● Note that IA = I A , I A IB =
0.6
0.6
Distribution IA ⇠ Bern(p
pmf
cdf
Fundamental Bridge The
0.4
0.4
●
● ● the probability of event A: E
0.2
● ●
0.2
0.0
● ●
●
0.0
Var(X) = E (X
0 1 2 3 4 0 1 2 3 4
Dr. Nick x x SD
E Xi = E[Xi 10].6 8
23
14
33
F(
i=1 i=1 1 –3 –2
1 0 1
0.30
It can be proved using another equivalent definition of expectation 5 9 14
4 1 5
0.20
X
E[X] = X(s)p(s). ... ... ...
PDF
1 1 1
n n
n∑ + n∑ = ∑ (xi + yi)
n
0.10
s∈S xi yi n
i=1 i=1 i=1
Consider in total n trials with r successful trials. Each trial results in a success with
probability p. Denoted by the triple (n, r, p). If n is fixed, r is the random variable called
6 LONG CHEN
1.0
14
5
0.8
Moments and
0.20
...
0.6
CDF
PDF
0.4
n
∑ (xi + yi)
0.10
0.2
i=1
Moments
0.00
0.0
E(X + Y) −4 −2 0 2 4 −4 −2 0 2 4
x x Moments describe the sh
nts a, b, c, To find the probability that a CRV takes on a value in an interval, standard deviation , an
integrate the PDF
F IGURE over
2. PDF andthat
CDF interval.
functions of a continuous random variable.
E(Y ) + c of X. The kth moment
Z b moment of X is mk = E
X and Y have the same F (b) F (a) = f (x)dx kurtosis are important s
nerally, a
Mean E(X) = µ1
How do I find the expected value of a CRV? Analogous to the
discrete case, where you sum x times the PMF, for CRVs you integrate Variance Var(X) = µ2
expectation, only
x times the PDF. Z 1
Skewness Skew(X) = m
E(X) = xf (x)dx
8 LONG CHEN
For a nonnegative random variable X, the following formulae can be proved by interchang-
ing the order of integration
Z ∞ Z ∞Z ∞
(3) E[X] = Pr{X > x} dx = f (y) dy dx.
0 0 x
The Law of the Unconscious Statistician (LOTUS) is that, for any function g,
Z ∞
E[g(X)] = g(x)f (x) dx,
−∞
which can be proved using (3) and again interchanging the order of integration.
R
The identity R f = 1 for the normal distribution can be proved as follows:
Z Z Z
1 2 1 2 1 2 2
I2 = e− 2 x dx e− 2 y dy = e− 2 (x +y ) dx dy.
R R R2
The last integral can be easily evaluated in the polar coordinate.
• Expo(λ). Exponential random variable with parameter λ
(
λe−λx x ≥ 0,
f (x) =
0 otherwise.
Its expected value and variance are, respectively,
1 1
E[X] = , Var(X) = 2 .
λ λ
A key property possessed only by exponential random variables is that they are
memoryless, in the sense that, for positive s and t,
Pr{X > s + t|X > t} = Pr{X > s}.
If X represents the life of an item, then the memoryless property states that, for
any t, the remaining life of a t-year-old item has the same probability distribution
as the life of a new item. Thus, one need not remember the age of an item to
know its distribution of remaining life. In summary, a product with an Expro(λ)
lifetime is always ‘as good as new’.
• Γ(α, λ). Gamma distribution with parameters (α, λ)
α−1
λe−λx (λx) x ≥ 0,
f (x) = Γ(α)
0 otherwise,
f (x) = B(a, b)
0 otherwise,
6.1. Jointly Distribution. The joint cumulative probability distribution function of the
pair of random variables X and Y is defined by
All probabilities regarding the pair can be obtained from F . To find the individual proba-
bility distribution functions of X and Y , use
If X and Y are both discrete random variables, then their joint probability mass function
is defined by
p(i, j) = Pr{X = i, Y = j}.
The individual mass functions are
X X
Pr{X = i} = p(i, j), Pr{Y = j} = p(i, j).
j i
If we list p(i, j) as a table, then the above mass functions are called marginal PMF.
The random variables X and Y are said to be jointly continuous if there is a function
f (x, y), called the joint probability density function, such that for any two-dimensional set
C,
ZZ
Pr{(X, Y ) ∈ C} = f (x, y) dx dy.
C
If X and Y are jointly continuous, then they are individually continuous with density
functions
Z ∞ Z ∞
fX (x) = f (x, y) dy, fY (y) = f (x, y) dx.
−∞ −∞
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 11
which generalizes to
(5) E[X1 + · · · + Xn ] = E[X1 ] + · · · + E[Xn ].
By definition E[cX] = cE[X]. Therefore E[·] is linear.
Question: How about the expectation of the product of random variables? Do we have
(6) E[XY ] = E[X]E[Y ]?
The answer is NO in general and YES if X and Y are independent. In general, if X and Y
are independent, then, for any functions h and g,
(7) E[g(X)h(Y )] = E[g(X)]E[h(Y )].
This fact can be easily proved by the separation of variables in the corresponding integral
as the joint density function of independent variables are separable.
Using (5), we can decompose a random variable into summation of simple random
variables. For example, the binomial random variable can be decomposed into sum of
Bernoulli random variables.
Two important and interesting examples:
• a random walk in the plane;
• complexity of the quick-sort algorithm.
7.2. Moments of the Number of Events that Occur. Express a random variable as a
combination of indicator random variables. For an event A, the indicator random variable
IA = 1 if A occurs and 0 otherwise. Then IA ∼ Bern(p) and E[IA ] = Pr(A). For two
events A and B, we have the properties
• IA IB = IA∩B ;
• IA∪B = IA + IB − IA IB .
For given events A1 , . . . , An , let X be the number of these events that occur. If we
introduce an indicator variable Ii for even Ai , then
Xn
X= Ii ,
i=1
and consequently
n
X n
X
E[X] = E[Ii ] = Pr(Ai ).
i=1 i=1
Now suppose we are interested in the number of pairs of events that occur. Then
X
X
= Ii Ij .
2 i<j
giving that X
E[X 2 ] − E[X] = 2 Pr(Ai Aj ).
i<j
Moreover, by considering the number of distinct subsets of k events that all occur, we have
X X X
E = E[Ii1 Ii1 · · · Iik ] = Pr(Ai1 Ai1 · · · Aik ).
k i <i <...<i i <i <...<i
1 2 k 1 2 k
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 13
7.3. Covariance, Variance of Sums, and Correlations. The covariance between X and
Y , denoted by Cov(X, Y ), is defined by
Cov(X, Y )
ρ(X, Y ) = p .
Var(X) Var(Y )
In particular, if Xi are independent, we can exchange the sum and Var, i.e.,
n n
!
X X
Var Xi = Var(Xi ), when Xi are independent.
i=1 i=1
Note that Var(·) is not linear but quadratic to the scaling. By definition
7.4. Moment Generating Functions. The moment generating function M (t) of the ran-
dom variable X is defined for t in some open interval containing 0 as
M (t) = E[etX ].
We call M (t) the moment generating function because all of the moments of X can be
obtained by successively differentiating M (t) and then evaluating the result at t = 0.
Specifically, we have
M (n) (0) = E[X n ] n ≥ 1.
When X and Y are independent, using (7), we know
(8) MX+Y (t) = MX (t)MY (t).
In particular,
MaX+b = E etaX etb = etb E etaX = ebt MX (at).
8. L IMIT T HEOREMS
The most important theoretical results in the probability theory are limit theorems. Of
these, the most important are:
• Laws of large numbers. The average of a sequence of random variables converges
(in certain topology) to the expected average.
• Central limit theorems. The sum of a large number random variables has a proba-
bility distribution that is approximately normal.
P∞Xi are
(1) the
2
uniformly bounded;
(2) σ
i=1 i = ∞,
then the distribution of
v
X n u n
uX
Pr (Xi − µi )/t σi2 ≤ a → Φ(a) as n → ∞.
i=1 i=1
The application of the central limit theorem to show that measurement errors are approx-
imately normally distributed is regarded as an important contribution to science. Indeed,
in the 17th and 18th centuries the central limit theorem was often called the law of fre-
quency of errors. The central limit theorem was originally stated and proven by the French
mathematician Pierre-Simon, Marquis de Laplace.
8.3. Law of Large Numbers.
Theorem 8.4 (The weak law of large numbers). Let X1 , X2 , . . . be a sequence of i.i.d.
random variables, each having finite mean E[Xi ] = µ. Then, for any ε > 0,
n
( )
1X
Pr Xi − µ ≥ ε → 0 as n → ∞.
n i=1
Pn
Proof. Let X = i=1 Xi /n. Then by the linearity of expectation, E X = µ. Since Xi
are i.i.d., the variance is additive to independent variables and scale quadratically to the
constant scaling, we have Var(X) = σ 2 /n. Apply Chebyshev’s inequality to get
n
( )
1X σ2
Pr Xi − µ ≥ ε ≤ 2 → 0 as n → ∞.
n i=1 nε
Theorem 8.5 (The strong law of large numbers). Let X1 , X2 , . . . be a sequence of inde-
pendent and identically distributed
Pn random variables, each having finite mean E[Xi ] = µ.
Then, with probability 1, n1 i=1 Xi → µ as n → ∞, i.e.
n
( )
1X
Pr lim Xi = µ = 1.
n→∞ n
i=1
What is the difference between the weak and the strong laws of large numbers? In the
weak version, n depends on ε while the strong version is not.
As an application of the strong law of large numbers, suppose that a sequence of inde-
pendent trials of some experiment is performed. Let E be a fixed event of the experiment,
and denote by Pr(E) the probability that E occurs on any particular trial. Let Xi be the in-
dicator random variable of E on the ith trial. We have, by the strong law of large numbers,
that with probability 1,
n
1X
Xi → E[X] = Pr(E).
n i=1
Therefore, if we accept the interpretation that “with probability 1” means “with certainty,”
we obtain the theoretical justification for the long-run relative frequency interpretation of
probabilities.
The weak law of large numbers was originally proven by James Bernoulli for the special
case where the Xi are Bernoulli random variables. The general form of the weak law of
large numbers was proved by the Russian mathematician Khintchine.
18 LONG CHEN
The strong law of large numbers was originally proven, in the special case of Bernoulli
random variables, by the French mathematician Borel. The general form of the strong law
was proven by the Russian mathematician A. N. Kolmogorov.
include a sketch of proofs
A SHORT SUMMARY ON ‘A FIRST COURSE IN PROBABILITY’ 19
A PPENDIX
Formulae
∞
X xn x n
ex = = lim 1 + ,
n=0
n! n→∞ n
Z ∞
Γ(t) = xt−1 e−x dx,
0
Z 1
Γ(a)Γ(b)
β(a, b) = xa−1 (1 − x)b−1 dx = ,
0 Γ(a + b)
1 1 1
1 + + + · · · + ≈ ln n + 0.577
2 3 n
√ n n
n! ≈ 2πn .
e
Distribution PMF/PDF
Mean Variance MGF
n k
Binomial Pr{X = k} = p (1 − p)n−k np np(1 − p) (1 − p + pet )n
k
k! t
−1)
Poisson Pr{X = k} = λ λ eλ(e
e−λ λk
etb − eta
Uniform f (x) = 1/(b − a) (a + b)/2 (b − a)2 /12
t(b − a)
1 (x−µ)2 2 2
Normal f (x) = √ e− 2σ2 µ σ2 etµ+σ t /2
σ 2π
α
λ
Gamma f (x) = λe−λx (λx)α−1 /Γ(α), x ≥ 0 α/λ α/λ2 ,t < λ
λ−t