Probability Review
Probability Review
You will undoubtedly find these notes lack many important details. I
strongly urge you to seek out more detailed treatments of this
material as needed -- e.g., by reading them along side a textbook or similarly
thorough resource -- especially if using these notes for more than a light refresher.
-- Paul J. Hurtado
1
Contents
Basic Definitions 3
Set Operations 3
Independence 6
Independent vs Mutual Exclusive (aka Disjoint) . . . . . . . . . . . . . . . . . . . . . . . 6
Random Variables 8
Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Multivariate Distributions 15
Motivating Examples: Multivariate vs Univariate . . . . . . . . . . . . . . . . . . . . . . 15
Density vs Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Random Vectors and Joint Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Independence Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Conditional Distributions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Expected Values, Variance, and Covariance Revisited . . . . . . . . . . . . . . . . . . . . 20
Combining Random Variables: Sums, Products, Quotients . . . . . . . . . . . . . . . . . 21
Special Distributions 22
Geometric and Negative Binomial Distributions . . . . . . . . . . . . . . . . . . . . . . . 24
Exponential and Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Normal (Gaussian) Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2
Basic Definitions
Experiment: Any procedure that can be repeated under the same conditions (theoretically)
infinite number of times, and such that its outcomes are well defined. By well defined we mean we
can describe all possible outcomes.
Sample Space: The set of all possible outcomes of an experiment. Usually denoted by S.
Event: A subset of the sample space S. Events are usually denoted by capital letters.
1. S is the sample space, the set of all outcomes. (Some texts use Ω instead of S). Ex: For
a coin toss experiment, S = {H, T }.
2. F is the σ-algebra associated with S. It is the collection of subsets of S (we call these
subsets events), and includes S and the empty set ∅. This set of events is closed under
countable unions, countable intersections and complementation. Furthermore, F satisfies:
3. P is our probability function P : F → [0, 1]. It associates each event (i.e., each subset of S
included in F) with a number between 0 and 1. Furthermore, we require that
Set Operations
Operations on events (sets): Union, Intersection, Complement.
3. The complement of A is the event (set) AC which contains all elements in S not in A.
3
NOTE: We can extend the definition of a union (or intersection) of two events, to any finite
number of events A1 , A2 , . . . , Ak defined over the sample space S.
Definition: Events A and B are mutually exclusive if their intersection is empty (A ∩ B = ∅).
1. A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) 3. A ∪ (B ∪ C) = (A ∪ B) ∪ C
2. A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) 4. A ∩ (B ∩ C) = (A ∩ B) ∩ C
3. The probability of a union of two mutually exclusive events A and B is the sum of their
probabilities: P (A ∪ B) = P (A) + P (B) for any mutually exclusive events A and B.
4. The probability of a union of infinitely many pairwise disjoint events, is the sum of their
probabilities. That is, if A1 , A2 , . . . are events over S such that Ai ∩ Aj = Ø for i 6= j, then
P ( i=1 Ai ) = ∞
S∞ P
i=1 P (Ai ).
NOTE: Axioms 1 - 3 are enough for finite sample spaces. Axiom 4 is necessary when the sample
space is infinite (e.g. the real numbers, R).
4
Properties of Probability Functions
Suppose P is probability function on the subsets of the sample space S, and A and B are events
defined over S. Then, the following are true.
1. P (AC ) = 1 − P (A).
2. P (Ø) = 0.
3. If A ⊂ B, then P (A) ≤ P (B).
4. For any event A, P (A) ≤ 1.
5. If events A1 , A2 , . . . , Ak are such that Ai ∩ Aj = Ø for i 6= j, then P ( ki=1 Ai ) = ki=1 P (Ai ).
S P
6. Addition Rule: For any two events A and B: P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
Theorem: Multiplication Rule for more than 2 events: Let A1 , A2 , . . . , An be events over
S. then P (A1 ∩ A2 ∩, . . . , An ) = P (A1 )P (A2 |A1 ) · · · P (An−1 |A1 ∩, . . . , An−2 )P (An |A1 ∩, . . . , An−1 ).
Definition: Sets B1 , B2 , . . . , Bn form a partition of the sample space S if: (1) They ”cover” S,
i.e., B1 ∪ B2 ∪ . . . ∪ Bn = S; and (2) They are pairwise disjoint.
Theorem: Total Probability formula: Let the sets B1 , B2 , . . . , Bn form a partition of the
sample space S. Let A be an event over S. Then
n
X
P (A) = P (A|Bi )P (Bi ).
i=1
Bayes Rule
Theorem: Bayes Formula: (1) For any events A and B defined on sample space S and such
that P (B) 6= 0 we have:
P (B|A)P (A)
P (A|B) = ,
P (B)
(2) More generally, if the sets B1 , B2 , . . . , Bn form a partition of the sample space S, we have
P (A|Bj )P (Bj )
P (Bj |A) = Pn ,
i=1 P (A|Bi )P (Bi )
for every j = 1, . . . , n.
5
Independence
Definition: Two events A and B are called independent if P (A ∩ B) = P (A)P (B).
NOTE: (1) If A and B are independent, then P (A|B) = P (A) and P (B|A) = P (B).
1. Independence deals with the relationship between the probabilities of events A and B,
and the probability of their co-occurrence, P (A ∩ B). Independence says something about
events that can co-occur, whereas disjoint events, by definition, never co-occur.
2. The notion of sets being disjoint relates to the elements in those events, and whether or
any are shared (i.e, whether or not they have an empty intersection). Mutual exclusivity
describes which outcomes cannot co-occur. Intuition should tell us that disjoint sets are
NOT independent! Why? Suppose two events are disjoint. Then knowledge of one event
occurring tells you quite a bit of information about whether or not the other has occurred
(by definition, it has not!). For example, if you are 22 years old, I know that you are not 21
years old. In fact, disjoint sets cannot be independent except in the trivial case where one
or both events has probability zero: since P (A ∩ B) = 0 for disjoint events, they can only
satisfy the definition of independence (P (A ∩ B) = P (B) P (A)) if P (A) = 0 or P (B) = 0 (or
both are true).
Example: Consider the experiment defined by one card out of a standard 52 card deck.
Let event A be that the card is red (i.e., A is the set of all 26 red cards) and B be the event that
the card is a king (i.e., B is all four kings).
Are A and B independent? Check that they satisfy the definition, P (A ∩ B) = P (A)P (B):
6
Combinatorics: Counting, Ordering, Arranging
Multiplication Rule: If operation A can be performed in n different ways and operation B can
be performed in m different ways, then the sequence of these two operations (say, AB) can be
performed in n · m ways.
Number of permutations of elements that are not all different: The number of permuta-
tions of length n, that canPbe formed from a set of n1 objects of type 1, n2 objects of type 2, . . ., nk
objects of type k, where ki=1 ni = n, is
n!
.
n1 !n2 ! · · · nk !
Combinations: A set of k unordered objects is called a combination of size k.
NOTE: The number of combinations of size k of n distinct objects is the number of different
subsets of size k formed from a set of n elements.
7
Random Variables
Definition A probability space (S, E, P ) is composed of a sample space S, the algebra E (the
set of subsets of S), and a probability function P : E → [0, 1] that satisfies Kolmogorov’s axioms.
In practice, we think of random variables (r.v.) in two ways.
1. We commonly think of a random variable as a ‘‘place holder” for the observed outcome of an
experiment. Ex: Let X be the number of heads in 10 coin tosses.
2. Formally, if X is a random variable, it is a real-valued measurable function that maps one
probability space into another (real-valued) probability space. That is, X : (S, E, P ) →
(Ω, F, PX ) where Ω ⊆ R and we define
PX (A) = P (s ∈ Ω : X(s) ∈ A) for all events A ∈ F
Definition A real-valued function X that maps one probability space (S, E) to another probability
space (Ω, F) is called a random variable (r.v.) if
X −1 (E) ∈ E for all E ∈ F
That is, each event in the ‘‘new” algebra corresponds to (measureable) events in the original space.
This ensures that X induces a consistent probability measure on the new space.
Definition Suppose r.v. X maps (S, E, P ) → (Ω, F, PX ). The probability function (measure) PX
is called the probability distribution of X and is given by
PX (A) = P ({s ∈ S : X(s) ∈ A}) for all A ∈ F.
NOTE: By X being real-valued, we mean that Ω ⊆ R or Ω ⊆ Rn . In the latter case, we call X a
random vector.
Example 1: Stating that ‘‘X is a Bernoulli r.v. with probability p of success” implies that
S = {0, 1} and P (X = k) = pk (1 − p)1−k . That is, P (X = 1) = p and P (X = 0) = 1 − p.
Example 2: Stating that ‘‘Y is a binomial r.v. with parameters n and p” implies that Y = ni=0 Xi
P
is the number of successes in a Bernoulli process of length n, and therefore that S = {0, 1, ..., n}
n k
and P (X = k) = k p (1 − p)n−k for k ∈ S (zero otherwise).
8
Theorem: The distribution of X is uniquely determined by the cumulative distribution
function (cdf) of X, denoted ny F or FX :
Properties of cdf
3. limx→∞ F (x) = 1;
4. limx→−∞ F (x) = 0.
1. For a sequence of increasing setsS∞A1 ⊂ A2 ⊂ . . . the probability of their union is the limit of
their probabilities, that is: P ( i=1 Ai ) = limi→∞ P (Ai ).
Types of distributions: There are three main types of distributions / random variables:
1. Discrete r.v.: CDF is a step function, S has at most countable number of outcomes.
Examples: Binomial, Poisson
9
Discrete Random Variables
Definition. Suppose a sample space S has finite or countable number of simple outcomes. Let p
be a real valued function on S such that
Definition. A random variable with finite or countably many values is called a discrete random
variable.
Definition. Any discrete random variable X is described by its probability density function
(or probability mass function), denoted pX (k), which provides probabilities of all values of X as
follows:
Examples:
1. Binomial r.v. X with n trials and probability of success equal to p, i.e., X ∼ binom(n, p).
n
pX (k) = P (k successes in n trials) = pk (1 − p)n−k , for k = 0, 1, 2, . . . , n.
k
Definition. Let X be a discrete random variable. For any real number t, the cumulative
distribution function F of X at t is given by
10
Continuous Random Variables
Suppose a sample space Ω is uncountable, e.g., Ω = [0, 1] or Ω = R. We can define a random
variable X : (Ω, E) → (S, B) where the new sample space S is a subset of R and the algebra B
is the Borel sets (all unions, intersections and complements of the open and closed intervals in
S). The probability structure on such a space can be described using a special function, f called
probability density function (pdf).
Definition: Any function Y that maps S (a subset of real numbers) into the real numbers is called
a continuous random variable. The pdf of Y is a function f such that
Z b
P (a ≤ Y ≤ b) = f (t)dt.
a
R
For any event A defined on S: P (A) = A f (t)dt.
Theorem: For any continuous random variable P (X = a) = 0 for any real number a.
Definition. The cdf of a continuous random variable Y (with pdf f ) is FY (t), given by
Z y
FY (y) = P (Y ≤ y) = P ({s ∈ S : Y (s) ≤ y}) = f (t)dt for any real y.
−∞
Theorem. If FY (t) is a cdf and fY (t) is a pdf of a continuous random variable Y , then
d
FY (t) = fY (t).
dt
Linear transformation: Let X be a continuous random variable with pdf f . Let Y = aX + b,
1
where a and b are real constants. Then the pdf of Y is: gY (y) = |a| fX ( y−b
a
).
11
Expectation and Expected Values
We often quantify the central tendency of a random variable using its expected value (mean).
1. If X is a discrete random variable with pdf pX (k), then the expected value of X is given by
X X
E(X) = µ = µX = k · pX (k) = k · P (X = k)
all k all k
3. If X is a mixed random variable with cdf F, then the expected value of X is given by
Z ∞ X
E(X) = µ = µX = xF 0 (x)dx + k · P (X = k),
−∞
all k
where F 0 is the derivative of F where the derivative exists and k’s in the summation are the
‘‘discrete” values of X.
NOTE: For the expectation of a random variable to exist, we assume that all integrals and sums
in the definition of the expectation above converge absolutely.
Definition: The median of a random variable is the value at the midpoint distribution of X
-- another way to characterize the central tendency of a random variable. Specifically, if X is a
discrete random variable, then its median m is the point for which P (X < m) = P (X > m). If
there are two values m and m0 such that P (X ≤ m) = 0.5 and P (X ≥ m0 ) = 0.5, the median is
the average of m and m0 , (m + m0 )/2.
If X is a continuous random variable with pdf f , the median is the solution of the equation:
Z m
1
f (x) dx = .
−∞ 2
12
If X is a continuous random variable with pdf fX (x), and if g is a continuous function, then the
expected value of g(X) is given by
Z ∞
E (g(X)) = g(x) f (x) dx,
−∞
R∞
provided that −∞
|g(x)| f (x) dx is finite.
NOTE: Expected value is a linear operator, that is E(aX + b) = aE(X) + b, for any rv X.
E(X)
P (X ≥ a) ≤
a
Variance
To get an idea about variability of a random variable, we look at the measures of spread. These
include the variance, standard deviation, and coefficient of variation.
Definition. Variance of a random variable, denoted Var(X) or σ 2 , is the average of its squared
deviations from the mean µ. Let X be a random variable.
1. If X is a discrete random variable with pdf pX (k) and mean µX , then the variance of X is
given by
X X
V ar(X) = σ 2 = E[(X − µX )2 ] = (k − µX )2 pX (k) = (k − µX )2 P (X = k)
all k all k
13
3. If X is a mixed random variable with cdf F and mean µX , then the variance of X is given by
Z ∞ X
2 2
V ar(X) = σ = E[(X − µX ) ] = (x − µX )2 F 0 (x)dx + (k − µX )2 P (X = k),
−∞
all k
where F 0 is the derivative of F where the derivative exists and k’s in the summation are the
”discrete” values of X.
NOTE: If E(X 2 ) is not finite, then the variance does not exist.
p
Definition. The standard deviation (sd(X) or σ) is sd(X) = V ar(X).
NOTE: The units of variance are square units of the random variable. The units of standard
deviation are the same as the units of the random variable.
Theorem: Let X be a random variable with variance σ 2 . Then, we can compute σ 2 as follows:
Definition: The coefficient of variation (CV) is the standard deviation divided by the mean:
p
CV (X) = E(X)/ (V ar(X)
The sd gives an absolute measure of spread, while the CV quantifies spread relative to the mean.
1. The rth moment of X (about the origin) is E(X r ), provided that the moment exists.
2. The rth moment of X about the mean is E[(X − µX )r ], provided that the moment exists.
14
Multivariate Distributions
In statistics, we typically worth with data sets with sample sizes greater than one! This naturally
leads us to consider all of these data not as replicates from a single univariate distribution, but
as a single vector-valued observation from a multivariate distribution. Before we discuss how the
above material generalizes to N > 1 dimensions, here is some motivation from statistics.
Since a normal r.v. with mean µ and standard deviation σ can be written as µ plus a normal r.v.
with mean 0 (i.e., µ + N (0, σ)) it follows that
Yi = β0 + β1 Xi1 + · · · + βk Xik + i
where each i are independent normals with mean 0 and standard deviation σ. Writing these n
equations in matrix form yields
Y1 1 + X11 + · · · + X1k β0 1
Y2 1 + X21 + · · · + X2k β1 2
.. = .. + ..
..
. . . .
Yn 1 + Xn1 + · · · + Xnk βn n
Y = Xβ +
Note that E(Y) = Xβ. Assuming the observed outcomes (data) y = (y1 , · · · , yn )T and inputs X are
known, and the goal is to estimate the (unknown) parameters β (call this estimate β̂). Statistical
theory says the best way to compute that estimate is to take the sum of squared differences (SSD)
between the observed data and the expected model output for a given set of parameters β (i.e.,
SSD = rT r where r = y − E(Y); a measure of ‘‘distance” between model and data) then use the
β that minimizes that distance as our estimate β̂. It can be shown with a little linear algebra that
β̂ = (XT X)−1 XT y.
15
Therefore we’ve used linear algebra and a little multivariate calculus to turn an optimization
problem into a relatively simple matrix computation!
Concluding Remark: In practice, statistics is a multivariate endeavor and therefore you should
be familiar with these basic probability concepts in a multivariate setting. Also, some basic tools
from linear algebra are essential to thinking critically about both theoretical and applied statistics.
Density vs Likelihood
Definition: A random sample of size N is a set of N independent and identically distributed
(iid ) observations X1 = x1 , X2 = x2 , . . ., Xn = xn .
Here’s a crude, graphical way of estimating the mean µ and variance σ 2 of a normal distribution
from a random sample of data: Plot a histogram, choose an initial µ and σ and overlay the
corresponding density curve. Iteratively adjust µ and σ until it looks like a good fit. In R...
0.4
0.4
0.3
0.3
0.3
Density
Density
Density
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.0
6 8 10 12 14 6 8 10 12 14 6 8 10 12 14
X X X
Formally, we’d like to compute some ‘‘goodness of fit” measure instead of just trusting our intuition
with what ‘‘looks like a good fit”. This might be the SSD (sometimes called the sum of squared
errors [SSE ]) from the OLS example above, but another options comes from some theoretical
results in mathematical statistics: the likelihood of parameters µ and σ given the data X. Here,
our estimates are the values of µ and σ that maximize the likelihood.
What is this likelihood ? This is defined by the distribution for random vector X, but where we flip
around our notion of what’s fixed and what varies. That is, we treat the x values (our data) as
fixed, and our candidate parameter estimates µ and σ are treated as variable quantities. Let us
consider at a specific example to see how we define and use a likelihood function in practice.
16
Example: Assume all QX i are iid with normal density f (xi ; µ, σ). This implies the joint density
n
fX (x1 , ..., xn ; µ, σ) = i=1 f (xi , µ, σ). Here we can write it as a simple product, thanks to the
independence of the individual random variables. This density function defines the likelihood
function for parmeters µ and σ
n
Y
L(µ, σ; x) = f (xi , µ, σ)
i=1
Note that we’ve gone from a function of n variables (number of data points) down to a function of 2
variables (number of parameters), and our domain is no longer the sample space but is instead the
range of possible parameters (µ ∈ R, σ ∈ R+ ). Plotting likelihood values over a range of possible
parameter values (here holding one parameter constant while varying the other) in R yields...
par(mfrow=c(1,2))
Lik=Vectorize(function(mu,sd,xs) prod(dnorm(xs,mean=mu,sd=sd)),"mu");
# fix sd=2, vary mu
curve(Lik(x,2,X),from=8,to=12, main="MEAN", xlab=expression(mu))
# fix mu, vary sd
Lik=Vectorize(function(mu,sd,xs) prod(dnorm(xs,mean=mu,sd=sd)),"sd");
curve(Lik(10,x,X),from=0,to=4, main="SD", xlab=expression(sigma))
# Optimization algorithms can then be used to refine estimates.
MEAN SD
8e−90
Lik(10, x, X)
Lik(x, 2, X)
4e−90
4e−90
0e+00
0e+00
8 9 10 11 12 0 1 2 3 4
µ σ
The maximum likelihood estimates of µ and σ are the pair of values that yield the maximum
likelihood value. In this case, using the optim() function yields µ = 9.98 and σ = 1.88.
Concluding Remark: In this example, we are inferring the parameters for a single distribution
from our random sample of data. We do so by treating those data as a random vector -- a single
observation from a multivariate distribution. We typically do statistics by treating all of our
data as a single outcome from a joint distribution. Therefore, to have a deeper understanding of
Statistics, we need to understand Probability from a multivariate perspective.
17
Random Vectors and Joint Densities
Joint densities describe probability distributions of random vectors. A random vector X is an
n-dimensional vector where each component is itself a random variable, i.e., X = (X1 , X2 , . . . , Xn ),
where all Xi s are rvs.
Discrete random vectors are described by the joint probability density function of Xi (or joint
pdf), i = {1, · · · , n} denoted by
Another name for the joint pdf of a discrete random vector is joint probability mass function (pmf).
Computing probabilities for discrete random vectors. For any subset A of R2 , we have
X X
P ((X, Y ) ∈ A) = P (X = x, Y = y) = pX,Y (x, y).
(x,y)∈A (x,y)∈A
Continuous random vectors are described by the joint probability density function of X and Y
(or joint pdf) denoted by fX,Y (x, y). The pdf has the following properties:
Joint cdf of a vector (X, Y ). The joint cumulative distribution function of X and Y (or joint
cdf) is defined by
FX,Y (u, v) = P (X ≤ u, Y ≤ v).
Theorem. Let FX,Y (u, v) be a joint cdf of the vector (X, Y ). Then the joint pdf of (X, Y ), fX,Y ,
∂2
is given by second partial deriveative of the cdf. That is fX,Y (x, y) = ∂x∂y FX,Y (x, y), provided that
FX,Y (x, y) has continuous second partial derivatives.
18
Independence Revisited
Definition. Two random variables are called independent if and only if (iff ) for any events A
and B in S, it follows that P (X ∈ A and Y ∈ B) = P (X ∈ A)P (Y ∈ B).
NOTE: Random variables X and Y are independent iff FX,Y (x, y) = FX (x)FY (y), where F (x, y) is
the joint cdf of (X, Y ), and FX (x) and FY (y) are the marginal cdf’s of the X and Y , respectively.
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) · · · fXn (xn )
where fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) is the joint pdf of the vector (X1 , X2 , . . . , Xn ), and fX1 (x1 ),
fX2 (x2 ), · · · , and fXn (xn ) are the marginal pdf’s of the variables X1 , X2 , . . . , Xn .
pX,Y (x, y)
pX|Y =y (x) = .
pY (y)
Similarily, if P (X = x) > 0, then the conditional distribution of Y given X=x is given by the
p (x,y)
conditional pmf pY |X=x (y) = X,Y
pX (x)
.
Definition. If (X, Y ) is a continuous random vector with pdf fX,Y (x, y), and if fY (y) > 0, then
the conditional distribution of X given Y=y is given by the conditional pdf
fX,Y (x, y)
fX|Y =y (x) = .
fY (y)
Similarly, if fX (x) > 0, then the conditional distribution of Y given X=x is given by the conditional
f (x,y)
pdf fY |X=x (y) = X,YfX (x)
.
19
Expected Values, Variance, and Covariance Revisited
Definition: Let (X, Y ) be a random vector with pmf p (discrete) or pdf f (continuous). Let g be
a real valued function of (X, Y ). Then, the expected value of random variable g(X, Y ) is
XX
E(g(X, Y )) = g(x, y)p(x, y), in the discrete case, or
allx ally
Z ∞ Z ∞
E(g(X, Y )) = g(x, y)f (x, y)dxdy, in the continuous case,
−∞ −∞
provided that the sums and the integrals converge absolutely.
Mean of a sum of random variables. Let X and Y be any random variables, and a and b real
numbers. Then
E(aX + bY ) = aE(X) + bE(Y ),
provided both expectations are finite.
NOTE: Let X1 , X2 , . . . , Xn be any random variables with finite means, and let a1 , a2 , . . . , an be a
set of real numbers. Then
E(a1 X1 + a2 X2 + · · · + an Xn ) = a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn ).
Mean of a product of independent random variables. If X and Y are independent random
variables with finite expectations, then E(XY ) = E(X) E(Y ).
20
Combining Random Variables: Sums, Products, Quotients
Let X and Y be independent random variables with pdf or pmf’s fX and fY or pX and pY ,
respectively. Then...
If X and Y are discrete random variables, then the pmf of their sum W = X+ Y is
X
pW (w) = pX (x)pY (w − x).
allx
If X and Y are continuous random variables, then the pdf of their sum W = X+ Y is the convolution
of the individual densities: Z ∞
fW (w) = fX (x)fY (w − x)dx.
−∞
If X and Y are independent continuous random variables, then the pdf of their quotient W = Y /X
is given by: Z ∞
fW (w) = | x | fX (x)fY (wx)dx.
−∞
The above formula is valid, if X is equal to zero in at most a set of isolated points (no intervals).
If X and Y are independent continuous random variables, then the pdf of their product W = XY
is given by: Z ∞
1
fW (w) = fX (w/x)fY (x)dx.
−∞ | x |
21
Special Distributions
Some useful distributions are ‘‘special” enough to be named. They include: Poisson, exponential,
Normal (Gaussian), Gamma, geometric, negative binomial, Binomial and hypergeometric distribu-
tions. We already saw and used exponential, Binomial and hypergeometric distributions. We will
now explore the definitions and properties of the other ”special” distributions.
Bernoulli. Sample space {0, 1}, mean p, variance p(1 − p), and mass function
px = px (1 − p)1−x
Binomial. The number of successes in n Bernoulli(p) trials. Discrete random variable with sample
space {0, . . . , n}, and mass function (with parameters n, p) given by
n x
px = p (1 − p)n−x
k
The mean is np and the variance is np(1 − p).
Multinomial (Generalized Binomial). Discrete random variable for the number of each of
k types of outcomesP in n trials. Sample space {0, ..., n}k , and mass function (with parameters
n,p1 ,...,pk where the pi = 1) given by
n!
px1 ,...,xk = px1 1 · · · pxkk
x1 ! · · · xk !
The marginals are binomial, thus the means are E(Xi ) = npi and the variances are V ar(Xi ) =
npi (1 − pi ).
Hypergeometric. Discrete r.v. with sample space {0, ..., w}, and mass function (with parameters
N , w, n) given by
w N −w
x n−x
px = N
n
The mean is n w/N and the variance is n w/N (1 − w/N )(N − n)/(N − 1).
P
Generalized Hypergeometric. Discrete random variable with parameters n, n1 , ..., nk , ni =
N , with mass function
n1 nk
x
· · · x
px1 ,...,xk = 1 N k
n
22
The marginals are Hypergeometric.
Uniform (Continuous). Discrete Continuous random variable with sample space {}, and density
function
fx = (b − a)−1
The mean is (b + a)/2 and the variance is (b − a)2 /12.
e−λ (λk )
P (X = k) = for k = 0, 1, 2, . . . .
k!
The mean and variance are the same, namely E(X) = V ar(X) = λ.
Poisson Model. Suppose events can occur in space or time in such a way that:
1. The probability that two events occur in the same small area or time interval is zero.
3. The probability than an event occurs in a given area or time interval T depends only on the
size of the area or length of the time interval, and not on their location.
Poisson Process. Suppose that events satisfying the Poisson model occur at the rate λ per unit
time. Let X(t) denote the number of events occuring in time interval of length t. Then
e−λt (λt)k
P (X = k) = .
k!
X(t) is called Poisson process with rate λ.
Exponential Continuing from above, the waiting time Y between consecutive events has an
exponential distribution with parameter λ (that is with mean 1/λ), that is P (Y > t) = e−λt , t > 0,
or equivalently,
f (t) = λe−λt , for t > 0.
The mean is 1/λ and the variance is 1/λ2 .
23
Geometric and Negative Binomial Distributions
Geometric experiment: Toss a fair coin until the first H appears. Let X=number of tosses
required for the first H. Then X has geometric distribution with probability of success 0.5.
P (X = k) = (1 − p)k−1 p, for k = 1, 2, 3, . . . .
It is denoted X ∼ Geo(p). The mean and variance of a geometric distribution are EX = 1/p and
pet
V ar(X) = 1−p
p2
, respectively. The mgf of X is MX (t) = 1−(1−p)et.
Memoryless property of geometric distribution. Let X ∼ Geo(p), then for any n and k, we
have
P (X = n + k | X > n) = P (X = k).
Negative Binomial experiment. Think of geometric experiment performed until we get r
successes. Let X = number of trials until we have r successes.
24
Exponential and Gamma Distributions
Definition. The Gamma function. For any positive real number r > 0, the gamma function
of r is denoted Γ(r) and equal to
Z ∞
Γ(r) = y r−1 e−y dy.
0
Theorem. Properties of Gamma function. The Gamma(r) function satifies the following
properties:
1. Γ(1) = 1.
Definition of the Gamma random variable. For any real positive numbers r > 0 and λ > 0,
a random variable with pdf
λr r−1 −λx
fX (x) = x e , x > 0,
Γ(r)
is said to have a Gamma distribution with parameters r and λ, denoted X ∼ Γ(r, λ).
Theorem: moments and mgf of a gamma distribution. If X ∼ Γ(r, λ) then
1. EX= r/λ.
2. Var(X)= r/λ2 .
Theorem. Let X1 , X2 , . . . , Xn be iid exponential random variables with parameter λ, that is with
mean
Pn 1/λ. The the sum of Xi ’s has a gamma distribution with parameters n and λ. More precisely,
i=1 Xi ∼ Γ(r, λ).
Theorem. A sum of independent gamma random variables X ∼ Γ(r, λ) and Y ∼ Γ(s, λ) with the
same λ has a gamma distribution with r0 = r + s and the same λ. That is X + Y ∼ Γ(r + s, λ).
Note: In a sequence of Poisson events occurring with rate λ per unit time/area, the waiting time
for the r’th event has a Γ(r, λ) distribution.
25
Normal (Gaussian) Distribution
Normal (Gaussian) distribution. Continuous random variable X has a normal distribution
with mean µ and variance σ 2 if its pdf is of the form:
1 (x−µ)2
f (x) = √ e− 2σ2 ,
2πσ
where µ and σ 2 are real valued constants. If X has pdf as above, we denote it: X ∼ N (µ, σ 2 ). The
2 2
mgf of X is MX (t) = eµt+σ t /2 , for any real t.
The normal pdf is bell shaped and centered around the mean µ. There is a special Normal
distribution with mean 0 and variance 1, called standard normal distribution, and denoted by
Z ∼ N (0, 1). The standard normal pdf is
1 x2
f (z) = √ e− 2 .
2π
The values of the standard normal cdf are tabulated. To find probabilities related to general normal
random variables, use the following fact:
X−µ
Theorem. If X ∼ N (µ, σ 2 ), then Z = σ
∼ N (0, 1).
1. Let X1 ∼ N (µ1 , σ12 ), and X2 ∼ N (µ2 , σ22 ), with X1 and X2 independent. Then X1 ± X2 ∼
N (µ1 ± µ2 , σ12 + σ22 ), and more generally:
2. Let Xi ∼ N (µi , σi2 ), for i = 1, . . . , n, and Xi ’s ind. Then Y = ni=1 Xi ∼ N ( ni=1 µi , ni=1 σi2 ),
P P P
and
Normal Approximation to Binomial. Let X ∼ Bin(n, p) and Y ∼ N (np, np(1 − p)). Then
for large n
P (a ≤ X ≤ b) ≈ P (a ≤ Y ≤ b).
Continuity correction for the normal approximation to binomial. To ”correct” for the
fact that binomial is discrete and normal is a continuous distribution, we do the following correction
for continuity: P (X = x) ≈ P (x − 0.5 < Y < x + 0.5).
26
Convergence Concepts & Laws of Large Numbers
Before discussing the Central Limit Theorem (CLT), Weak Law of Large Numbers (WLLN) and
Strong Law of Large Numbers (SLLN) it helps to know some different convergence concepts that
exist in probability (and measure theory).
We begin with two results that help us bound probabilities when only the mean is known:
Markov Inequality: For any non-negative valued r.v. Y with E(Y ) = µ, then for a > 0
E(Y )
P (Y ≥ a) ≤ .
a
Proof (finite-variance, continuous case):
1 ∞ 1 ∞ 1 ∞
Z Z Z
E(Y )
= y f (y) dy ≥ y f (y) dy ≥ a f (y) dy = P (Y ≥ a)
a a 0 a a a a
Chebychev Inequality: For r.v. X with E(X) = µ and V ar(X) = σ 2 < ∞, then for any k > 0
the probability that X deviates more than k from the mean is bounded by
σ2
P (|X − µ| ≥ k) ≤
k2
Sketch of Proof: Apply the Markov Inequality using Y = (X − µ)2 and a = k 2 .
In measure theory, almost everywhere means a statement holds true for all but a set of measure zero.
Thinking of random variables as functions on our sample space, this is just pointwise convergence
of the random variables except perhaps on some set of measure zero.
27
Example 1 (Convergence in probability, but not almost surely.)
Let U be uniform on [0,1], and define the sequence of random variables Yn to all depend directly on
U according to Yn = U + 1An (U ) where intervals An are defined as the nth interval in the sequence
[0,1/2], [1/2,1], [0,1/3], [1/3,2/3], [2/3,1], [0,1/4],... That is, for observation U = u, Yn = u + 1 if
u ∈ An , otherwise Yn = u. Note these r.v.s Yn are not independent, since each depends directly
on U ! Observing that, as n → ∞, the width of interval An → 0, it follows that Yn converges in
probability to U since
lim P (|Yn − U | ≥ ) = lim P (U ∈ An ) = 0.
n→∞ n→∞
But for a given outcome U = u, Yn (u) never converges since for any N > 0 there is always some
k > N where Yk (u) = 1 + u.
Y7 − U Y8 − U Y9 − U Y10 − U Y11 − U
1
●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
●● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●
0
●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
● ●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●●
●●
0 1 0 1 0 1 0 1 0 1
U U U U U
28
Laws of Large Numbers
Pn
Weak Law of Large Numbers (WLLN): Let Xi be iid with mean µ. Then Xn = i=1 Xi
P
converges in probability to µ, i.e., Xn → µ. That is, for all positive near zero,
lim P (|Xn − µ| ≥ ) = 0 ⇔ lim P (|Xn − µ| ≤ ) = 1.
n→∞ n→∞
Proof (when V ar(X) = σ 2 < ∞): Apply the Chebychev Inequality. This was first proven in the
1700s by Bernoulli, and incrementally generalized by Markov then Chebychev.
Pn
Strong Law of Large Numbers (SLLN): Let Xi be iid with mean µ, and let Xn = i=1 Xi .
Then Xn converges almost surely to µ. That is, for all positive near zero,
P lim |Xn − µ| ≥ = 0 ⇔ P lim |Xn − µ| ≤ = 1.
n→∞ n→∞
NOTE: Borel gave the first proof of the SLLN, 200 years later, in 1909. It was incrementally
improved by Cantelli, Khintchine (who named it the SLLN) and Kolmogorov (in the 1930s).
Weak vs Strong: Accordingly, almost sure convergence is called a stronger form of convergence
than convergence in probability, and convergence in distribution is even more weak.
NOTE: The WLLN and SLLN basically both state that the average of n iid random variables
(with mean µ < ∞) converges to µ as n → ∞. The Weak LLN states this in the weaker form
P a.s.
(Xn → µ), while the Strong LLN states this in the (stronger) form (Xn → µ).
NOTE: Other CLTs relax the iid assumptions, but require additional conditions that must be
met. The Lyapunov CLT, for example, relaxes the assumption of identical distributions:
29