Math - ML Trang 6
Math - ML Trang 6
172
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
6.1 Construction of a Probability Space 173
ple
reduction
y
rit
Independence Bernoulli
ila
m
Sufficient statistics
Si
Conjugate
Chapter 11
Finite
Density estimation
(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three con-
cepts of sample space, event space, and probability measure. The prob-
ability space models a real-world process (referred to as an experiment)
with random outcomes.
The probability of a single event must lie in the interval [0, 1], and the
total probability over all outcomes in the sample space Ω must be 1, i.e.,
P (Ω) = 1. Given a probability space (Ω, A, P ), we want to use it to model
some real-world phenomenon. In machine learning, we often avoid explic-
itly referring to the probability space, but instead refer to probabilities on
quantities of interest, which we denote by T . In this book, we refer to T
as the target space and refer to elements of T as states. We introduce a target space
function X : Ω → T that takes an element of Ω (an outcome) and returns
a particular quantity of interest x, a value in T . This association/mapping
from Ω to T is called a random variable. For example, in the case of tossing random variable
two coins and counting the number of heads, a random variable X maps
to the three possible outcomes: X(hh) = 2, X(ht) = 1, X(th) = 1, and
X(tt) = 0. In this particular case, T = {0, 1, 2}, and it is the probabilities
on elements of T that we are interested in. For a finite sample space Ω and The name “random
finite T , the function corresponding to a random variable is essentially a variable” is a great
source of
lookup table. For any subset S ⊆ T , we associate PX (S) ∈ [0, 1] (the
misunderstanding
probability) to a particular event occurring corresponding to the random as it is neither
variable X . Example 6.1 provides a concrete illustration of the terminol- random nor is it a
ogy. variable. It is a
function.
Remark. The aforementioned sample space Ω unfortunately is referred
to by different names in different books. Another common name for Ω
is “state space” (Jacod and Protter, 2004), but state space is sometimes
reserved for referring to states in a dynamical system (Hasselblatt and
Example 6.1
This toy example is We assume that the reader is already familiar with computing probabilities
essentially a biased
of intersections and unions of sets of events. A gentler introduction to
coin flip example.
probability with many examples can be found in chapter 2 of Walpole
et al. (2011).
Consider a statistical experiment where we model a funfair game con-
sisting of drawing two coins from a bag (with replacement). There are
coins from USA (denoted as $) and UK (denoted as £) in the bag, and
since we draw two coins from the bag, there are four outcomes in total.
The state space or sample space Ω of this experiment is then ($, $), ($,
£), (£, $), (£, £). Let us assume that the composition of the bag of coins is
such that a draw returns at random a $ with probability 0.3.
The event we are interested in is the total number of times the repeated
draw returns $. Let us define a random variable X that maps the sample
space Ω to T , which denotes the number of times we draw $ out of the
bag. We can see from the preceding sample space we can get zero $, one $,
or two $s, and therefore T = {0, 1, 2}. The random variable X (a function
or lookup table) can be represented as a table like the following:
X(($, $)) = 2 (6.1)
X(($, £)) = 1 (6.2)
X((£, $)) = 1 (6.3)
X((£, £)) = 0 . (6.4)
Since we return the first coin we draw before drawing the second, this
implies that the two draws are independent of each other, which we will
discuss in Section 6.4.5. Note that there are two experimental outcomes,
which map to the same event, where only one of the draws returns $.
Therefore, the probability mass function (Section 6.2.1) of X is given by
P (X = 2) = P (($, $))
= P ($) · P ($)
= 0.3 · 0.3 = 0.09 (6.5)
P (X = 1) = P (($, £) ∪ (£, $))
= P (($, £)) + P ((£, $))
= 0.3 · (1 − 0.3) + (1 − 0.3) · 0.3 = 0.42 (6.6)
P (X = 0) = P ((£, £))
= P (£) · P (£)
= (1 − 0.3) · (1 − 0.3) = 0.49 . (6.7)
6.1.3 Statistics
Probability theory and statistics are often presented together, but they con-
cern different aspects of uncertainty. One way of contrasting them is by the
kinds of problems that are considered. Using probability, we can consider
a model of some process, where the underlying uncertainty is captured
by random variables, and we use the rules of probability to derive what
happens. In statistics, we observe that something has happened and try
to figure out the underlying process that explains the observations. In this
sense, machine learning is close to statistics in its goals to construct a
model that adequately represents the process that generated the data. We
can use the rules of probability to obtain a “best-fitting” model for some
data.
Another aspect of machine learning systems is that we are interested
in generalization error (see Chapter 8). This means that we are actually
interested in the performance of our system on instances that we will
observe in future, which are not identical to the instances that we have
ci Figure 6.2
z }|{ Visualization of a
y1 discrete bivariate
o probability mass
Y y2 nij rj function, with
random variables X
y3 and Y . This
x1 x2 x3 x4 x5 diagram is adapted
from Bishop (2006).
X
Example 6.2
Consider two random variables X and Y , where X has five possible states
and Y has three possible states, as shown in Figure 6.2. We denote by nij
the number of events with state X = xi and Y = yj , and denote by
N the total number of events. The value ciP is the sum of the individual
3
frequencies for the ith column, that is, ci = j=1 nij . Similarly, the value
P5
rj is the row sum, that is, rj = i=1 nij . Using these definitions, we can
compactly express the distribution of X and Y .
The probability distribution of each random variable, the marginal
probability, can be seen as the sum over a row or column
P3
ci j=1 nij
P (X = xi ) = = (6.10)
N N
and
P5
rj nij
P (Y = yj ) = = i=1 , (6.11)
N N
where ci and rj are the ith column and j th row of the probability table,
respectively. By convention, for discrete random variables with a finite
number of events, we assume that probabilties sum up to one, that is,
5
X 3
X
P (X = xi ) = 1 and P (Y = yj ) = 1 . (6.12)
i=1 j=1
Remark. We reiterate that there are in fact two distinct concepts when
talking about distributions. First is the idea of a pdf (denoted by f (x)),
which is a nonnegative function that sums to one. Second is the law of a
random variable X , that is, the association of a random variable X with
the pdf f (x). ♦
p(x)
uniform 1.0 1.0
distributions. See
Example 6.3 for 0.5 0.5
details of the
0.0 0.0
distributions. −1 0 1 2 −1 0 1 2
z x
(a) Discrete distribution (b) Continuous distribution
For most of this book, we will not use the notation f (x) and FX (x) as
we mostly do not need to distinguish between the pdf and cdf. However,
we will need to be careful about pdfs and cdfs in Section 6.7.
Example 6.3
We consider two examples of the uniform distribution, where each state is
equally likely to occur. This example illustrates some differences between
discrete and continuous probability distributions.
Let Z be a discrete uniform random variable with three states {z =
The actual values of −1.1, z = 0.3, z = 1.5}. The probability mass function can be represented
these states are not as a table of probability values:
meaningful here,
and we deliberately z −1.1 0.3 1.5
chose numbers to
1 1 1
drive home the P (Z = z) 3 3 3
point that we do not
want to use (and
should ignore) the Alternatively, we can think of this as a graph (Figure 6.3(a)), where we
ordering of the use the fact that the states can be located on the x-axis, and the y -axis
states.
represents the probability of a particular state. The y -axis in Figure 6.3(a)
is deliberately extended so that is it the same as in Figure 6.3(b).
Let X be a continuous random variable taking values in the range 0.9 6
X 6 1.6, as represented by Figure 6.3(b). Observe that the height of the
naturally from fulfilling the desiderata (Jaynes, 2003, chapter 2). Prob-
abilistic modeling (Section 8.4) provides a principled foundation for de-
signing machine learning methods. Once we have defined probability dis-
tributions (Section 6.2) corresponding to the uncertainties of the data and
our problem, it turns out that there are only two fundamental rules, the
sum rule and the product rule.
Recall from (6.9) that p(x, y) is the joint distribution of the two ran-
dom variables x, y . The distributions p(x) and p(y) are the correspond-
ing marginal distributions, and p(y | x) is the conditional distribution of y
given x. Given the definitions of the marginal and conditional probability
for discrete and continuous random variables in Section 6.2, we can now
These two rules present the two fundamental rules in probability theory.
arise The first rule, the sum rule, states that
naturally (Jaynes, X
2003) from the p(x, y) if y is discrete
requirements we
y∈Y
discussed in p(x) = Z , (6.20)
Section 6.1.1. p(x, y)dy if y is continuous
sum rule Y
where Y are the states of the target space of random variable Y . This
means that we sum out (or integrate out) the set of states y of the random
marginalization variable Y . The sum rule is also known as the marginalization property.
property The sum rule relates the joint distribution to a marginal distribution. In
general, when the joint distribution contains more than two random vari-
ables, the sum rule can be applied to any subset of the random variables,
resulting in a marginal distribution of potentially more than one random
variable. More concretely, if x = [x1 , . . . , xD ]> , we obtain the marginal
Z
p(xi ) = p(x1 , . . . , xD )dx\i (6.21)
The product rule can be interpreted as the fact that every joint distribu-
tion of two random variables can be factorized (written as a product)
of two other distributions. The two factors are the marginal distribu-
tion of the first random variable p(x), and the conditional distribution
of the second random variable given the first p(y | x). Since the ordering
of random variables is arbitrary in p(x, y), the product rule also implies
p(x, y) = p(x | y)p(y). To be precise, (6.22) is expressed in terms of the
probability mass functions for discrete random variables. For continuous
random variables, the product rule is expressed in terms of the probability
density functions (Section 6.2.3).
In machine learning and Bayesian statistics, we are often interested in
making inferences of unobserved (latent) random variables given that we
have observed other random variables. Let us assume we have some prior
knowledge p(x) about an unobserved random variable x and some rela-
tionship p(y | x) between x and a second random variable y , which we
can observe. If we observe y , we can use Bayes’ theorem to draw some
conclusions about x given the observed values of y . Bayes’ theorem (also Bayes’ theorem
Bayes’ rule or Bayes’ law) Bayes’ rule
The quantity
Z
p(y) := p(y | x)p(x)dx = EX [p(y | x)] (6.27)
marginal likelihood is the marginal likelihood/evidence. The right-hand side of (6.27) uses the
evidence expectation operator which we define in Section 6.4.1. By definition, the
marginal likelihood integrates the numerator of (6.23) with respect to the
latent variable x. Therefore, the marginal likelihood is independent of
x, and it ensures that the posterior p(x | y) is normalized. The marginal
likelihood can also be interpreted as the expected likelihood where we
take the expectation with respect to the prior p(x). Beyond normalization
of the posterior, the marginal likelihood also plays an important role in
Bayesian model selection, as we will discuss in Section 8.6. Due to the
Bayes’ theorem is integration in (8.44), the evidence is often hard to compute.
also called the Bayes’ theorem (6.23) allows us to invert the relationship between x
“probabilistic
and y given by the likelihood. Therefore, Bayes’ theorem is sometimes
inverse.”
probabilistic inverse called the probabilistic inverse. We will discuss Bayes’ theorem further in
Section 8.4.
Remark. In Bayesian statistics, the posterior distribution is the quantity
of interest as it encapsulates all available information from the prior and
the data. Instead of carrying the posterior around, it is possible to focus
on some statistic of the posterior, such as the maximum of the posterior,
which we will discuss in Section 8.3. However, focusing on some statistic
of the posterior leads to loss of information. If we think in a bigger con-
text, then the posterior can be used within a decision-making system, and
having the full posterior can be extremely useful and lead to decisions that
are robust to disturbances. For example, in the context of model-based re-
inforcement learning, Deisenroth et al. (2015) show that using the full
posterior distribution of plausible transition functions leads to very fast
(data/sample efficient) learning, whereas focusing on the maximum of
the posterior leads to consistent failures. Therefore, having the full pos-
terior can be very useful for a downstream task. In Chapter 9, we will
continue this discussion in the context of linear regression. ♦
Definition 6.3 (Expected Value). The expected value of a function g : R → expected value
R of a univariate continuous random variable X ∼ p(x) is given by
Z
EX [g(x)] = g(x)p(x)dx . (6.28)
X
where X is the set of possible outcomes (the target space) of the random
variable X .
where the subscript EXd indicates that we are taking the expected value
with respect to the dth element of the vector x. ♦
Definition 6.3 defines the meaning of the notation EX as the operator
indicating that we should take the integral with respect to the probabil-
ity density (for continuous distributions) or the sum over all states (for
discrete distributions). The definition of the mean (Definition 6.4), is a
special case of the expected value, obtained by choosing g to be the iden-
tity function.
Definition 6.4 (Mean). The mean of a random variable X with states mean
Example 6.4
Consider the two-dimensional distribution illustrated in Figure 6.4:
10 1 0 0 8.4 2.0
p(x) = 0.4 N x , + 0.6 N x , .
2 0 1 0 2.0 1.7
(6.33)
2
We will define the Gaussian distribution N µ, σ in Section 6.5. Also
shown is its corresponding marginal distribution in each dimension. Ob-
serve that the distribution is bimodal (has two modes), but one of the
Figure 6.4
Mean Illustration of the
Modes mean, mode, and
Median median for a
two-dimensional
dataset, as well as
its marginal
densities.
Remark. The expected value (Definition 6.3) is a linear operator. For ex-
ample, given a real-valued function f (x) = ag(x) + bh(x) where a, b ∈ R
and x ∈ RD , we obtain
Z
EX [f (x)] = f (x)p(x)dx (6.34a)
Z
= [ag(x) + bh(x)]p(x)dx (6.34b)
Z Z
= a g(x)p(x)dx + b h(x)p(x)dx (6.34c)
♦
For two random variables, we may wish to characterize their correspon-
y
each axis (colored
0 0 lines) but with
different
−2 −2 covariances.
−5 0 5 −5 0 5
x x
(a) x and y are negatively correlated. (b) x and y are positively correlated.
Z
p(xi ) = p(x1 , . . . , xD )dx\i , (6.39)
where “\i” denotes “all variables but i”. The off-diagonal entries are the
cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j . cross-covariance
where xn ∈ RD .
empirical covariance Similar to the empirical mean, the empirical covariance matrix is a D×D
matrix
N
1 X
Σ := (xn − x̄)(xn − x̄)> . (6.42)
N n=1
Throughout the
book, we use the To compute the statistics for a particular dataset, we would use the
empirical realizations (observations) x1 , . . . , xN and use (6.41) and (6.42). Em-
covariance, which is pirical covariance matrices are symmetric, positive semidefinite (see Sec-
a biased estimate.
The unbiased
tion 3.2.3).
(sometimes called
corrected)
covariance has the 6.4.3 Three Expressions for the Variance
factor N − 1 in the
denominator We now focus on a single random variable X and use the preceding em-
instead of N . pirical formulas to derive three possible expressions for the variance. The
The derivations are following derivation is the same for the population variance, except that
exercises at the end we need to take care of integrals. The standard definition of variance, cor-
of this chapter.
responding to the definition of covariance (Definition 6.5), is the expec-
tation of the squared deviation of a random variable X from its expected
value µ, i.e.,
VX [x] := EX [(x − µ)2 ] . (6.43)
The expectation in (6.43) and the mean µ = EX (x) are computed us-
ing (6.32), depending on whether X is a discrete or continuous random
variable. The variance as expressed in (6.43) is the mean of a new random
variable Z := (X − µ)2 .
When estimating the variance in (6.43) empirically, we need to resort
to a two-pass algorithm: one pass through the data to calculate the mean
µ using (6.41), and then a second pass using this estimate µ̂ calculate the
variance. It turns out that we can avoid two passes by rearranging the
terms. The formula in (6.43) can be converted to the so-called raw-score raw-score formula
formula for variance: for variance
2
VX [x] = EX [x2 ] − (EX [x]) . (6.44)
The expression in (6.44) can be remembered as “the mean of the square
minus the square of the mean”. It can be calculated empirically in one pass
through data since we can accumulate xi (to calculate the mean) and x2i
simultaneously, where xi is the ith observation. Unfortunately, if imple- If the two terms
mented in this way, it can be numerically unstable. The raw-score version in (6.44) are huge
and approximately
of the variance can be useful in machine learning, e.g., when deriving the
equal, we may
bias–variance decomposition (Bishop, 2006). suffer from an
A third way to understand the variance is that it is a sum of pairwise dif- unnecessary loss of
ferences between all pairs of observations. Consider a sample x1 , . . . , xN numerical precision
in floating-point
of realizations of random variable X , and we compute the squared differ-
arithmetic.
ence between pairs of xi and xj . By expanding the square, we can show
that the sum of N 2 pairwise differences is the empirical variance of the
observations:
!2
N N N
1 X 1 X 1 X
2
(xi − xj )2 = 2 x2i − xi . (6.45)
N i,j=1 N i=1 N i=1
We see that (6.45) is twice the raw-score expression (6.44). This means
that we can express the sum of pairwise distances (of which there are N 2
of them) as a sum of deviations from the mean (of which there are N ). Ge-
ometrically, this means that there is an equivalence between the pairwise
distances and the distances from the center of the set of points. From a
computational perspective, this means that by computing the mean (N
terms in the summation), and then computing the variance (again N
terms in the summation), we can obtain an expression (left-hand side
of (6.45)) that has N 2 terms.
Example 6.5
Consider a random variable X with zero mean (EX [x] = 0) and also
EX [x3 ] = 0. Let y = x2 (hence, Y is dependent on X ) and consider the
covariance (6.36) between X and Y . But this gives
Cov[x, y] = E[xy] − E[x]E[y] = E[x3 ] = 0 . (6.54)
Figure 6.6
Geometry of
random variables. If
random variables X
and Y are
uncorrelated, they
are orthogonal
vectors in a
corresponding
vector space, and [y ]
+ var
the Pythagorean [ x]
p var
theorem applies.
=
p
] a var[x]
+y c
p var[x
b
p
var[y]
Figure 6.7
Gaussian
distribution of two
random variables x1
0.20 and x2 .
p(x1, x2)
0.15
0.10
0.05
0.00
7.5
5.0
2.5
−1 0.0 x 2
0 −2.5
x1 1 −5.0
x2
2
dimensional case; 0.10
(b) two-dimensional 0
0.05
case. −2
0.00
−4
−5.0 −2.5 0.0 2.5 5.0 7.5 −1 0 1
x x1
Example 6.6
Figure 6.9
(a) Bivariate 8
Gaussian;
6
(b) marginal of a
joint Gaussian 4
distribution is
x2
2
Gaussian; (c) the
conditional 0 x2 = −1
distribution of a −2
Gaussian is also
Gaussian. −4
−1 0 1
x1
0.4
0.2
0.2
0.0 0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x1 x1
Example 6.7
Since expectations are linear operations, we can obtain the weighted sum
of independent Gaussian random variables
p(ax + by) = N aµx + bµy , a2 Σx + b2 Σy .
(6.79)
where the scalar 0 < α < 1 is the mixture weight, and p1 (x) and p2 (x) are
univariate Gaussian densities (Equation (6.62)) with different parameters,
i.e., (µ1 , σ12 ) 6= (µ2 , σ22 ).
Then the mean of the mixture density p(x) is given by the weighted sum
of the means of each random variable:
(6.82)
Proof The mean of the mixture density p(x) is given by the weighted
sum of the means of each random variable. We apply the definition of the
mean (Definition 6.4), and plug in our mixture (6.80), which yields
Z ∞
E[x] = xp(x)dx (6.83a)
−∞
Z ∞
= (αxp1 (x) + (1 − α)xp2 (x)) dx (6.83b)
−∞
Z ∞ Z ∞
=α xp1 (x)dx + (1 − α) xp2 (x)dx (6.83c)
−∞ −∞
= αµ1 + (1 − α)µ2 . (6.83d)
To compute the variance, we can use the raw-score version of the vari-
ance from (6.44), which requires an expression of the expectation of the
squared random variable. Here we use the definition of an expectation of
a function (the square) of a random variable (Definition 6.3),
Z ∞
2
E[x ] = x2 p(x)dx (6.84a)
−∞
Z ∞
αx2 p1 (x) + (1 − α)x2 p2 (x) dx
= (6.84b)
−∞
Remark. The preceding derivation holds for any density, but since the
Gaussian is fully determined by the mean and variance, the mixture den-
sity can be determined in closed form. ♦
For a mixture density, the individual components can be considered
to be conditional distributions (conditioned on the component identity).
Equation (6.85c) is an example of the conditional variance formula, also
known as the law of total variance, which generally states that for two ran- law of total variance
dom variables X and Y it holds that VX [x] = EY [VX [x|y]]+ VY [EX [x|y]],
i.e., the (total) variance of X is the expected conditional variance plus the
variance of a conditional mean.
We consider in Example 6.17 a bivariate standard Gaussian random
variable X and performed a linear transformation Ax on it. The outcome
is a Gaussian random variable with mean zero and covariance AA> . Ob-
serve that adding a constant vector will change the mean of the distribu-
tion, without affecting its variance, that is, the random variable x + µ is
Gaussian with mean µ and identity covariance. Hence, any linear/affine
transformation of a Gaussian random variable is Gaussian distributed. Any linear/affine
Consider a Gaussian distributed random variable X ∼ N µ, Σ . For transformation of a
Gaussian random
a given matrix A of appropriate shape, let Y be a random variable such
variable is also
that y = Ax is a transformed version of x. We can compute the mean of Gaussian
y by exploiting that the expectation is a linear operator (6.50) as follows: distributed.
It turns out that the class of distributions called the exponential family exponential family
provides the right balance of generality while retaining favorable compu-
tation and inference properties. Before we introduce the exponential fam-
ily, let us see three more members of “named” probability distributions,
the Bernoulli (Example 6.8), Binomial (Example 6.9), and Beta (Exam-
ple 6.10) distributions.
Example 6.8
The Bernoulli distribution is a distribution for a single binary random Bernoulli
variable X with state x ∈ {0, 1}. It is governed by a single continuous pa- distribution
rameter µ ∈ [0, 1] that represents the probability of X = 1. The Bernoulli
distribution Ber(µ) is defined as
p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1} , (6.92)
E[x] = µ , (6.93)
V[x] = µ(1 − µ) , (6.94)
where E[x] and V[x] are the mean and variance of the binary random
variable X .
Figure 6.10
Examples of the µ = 0.1
Binomial 0.3 µ = 0.4
distribution for
µ ∈ {0.1, 0.4, 0.75} µ = 0.75
and N = 15.
0.2
p(m)
0.1
0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0
Number m of observations x = 1 in N = 15 experiments
10 Figure 6.11
α = 0.5 = β Examples of the
8 α=1=β Beta distribution for
α = 2, β = 0.3 different values of α
and β.
p(µ|α, β)
6 α = 4, β = 10
α = 5, β = 1
4
0
0.0 0.2 0.4 0.6 0.8 1.0
µ
Remark. There is a whole zoo of distributions with names, and they are
related in different ways to each other (Leemis and McQueston, 2008).
It is worth keeping in mind that each named distribution is created for a
particular reason, but may have other applications. Knowing the reason
behind the creation of a particular distribution often allows insight into
how to best use it. We introduced the preceding three distributions to be
6.6.1 Conjugacy
According to Bayes’ theorem (6.23), the posterior is proportional to the
product of the prior and the likelihood. The specification of the prior can
be tricky for two reasons: First, the prior should encapsulate our knowl-
edge about the problem before we see any data. This is often difficult to
describe. Second, it is often not possible to compute the posterior distribu-
tion analytically. However, there are some priors that are computationally
conjugate prior convenient: conjugate priors.
conjugate Definition 6.13 (Conjugate Prior). A prior is conjugate for the likelihood
function if the posterior is of the same form/type as the prior.
Conjugacy is particularly convenient because we can algebraically cal-
culate our posterior distribution by updating the parameters of the prior
distribution.
Remark. When considering the geometry of probability distributions, con-
jugate priors retain the same distance structure as the likelihood (Agarwal
and Daumé III, 2010). ♦
To introduce a concrete example of conjugate priors, we describe in Ex-
ample 6.11 the Binomial distribution (defined on discrete random vari-
ables) and the Beta distribution (defined on continuous random vari-
ables).
∝ Beta(h + α, N − h + β) , (6.104d)
i.e., the posterior distribution is a Beta distribution as the prior, i.e., the
Beta prior is conjugate for the parameter µ in the Binomial likelihood
function.
Table 6.2 lists examples for conjugate priors for the parameters of some
standard likelihoods used in probabilistic modeling. Distributions such as The Gamma prior is
Multinomial, inverse Gamma, inverse Wishart, and Dirichlet can be found conjugate for the
precision (inverse
in any statistical text, and are described in Bishop (2006), for example.
variance) in the
The Beta distribution is the conjugate prior for the parameter µ in both univariate Gaussian
the Binomial and the Bernoulli likelihood. For a Gaussian likelihood func- likelihood, and the
tion, we can place a conjugate Gaussian prior on the mean. The reason Wishart prior is
conjugate for the
why the Gaussian likelihood appears twice in the table is that we need
precision matrix
to distinguish the univariate from the multivariate case. In the univariate (inverse covariance
(scalar) case, the inverse Gamma is the conjugate prior for the variance. matrix) in the
In the multivariate case, we use a conjugate inverse Wishart distribution multivariate
Gaussian likelihood.
as a prior on the covariance matrix. The Dirichlet distribution is the conju-
gate prior for the multinomial likelihood function. For further details, we
refer to Bishop (2006).
µ
θ = log 1−µ (6.115)
φ(x) = x (6.116)
A(θ) = − log(1 − µ) = log(1 + exp(θ)). (6.117)
The relationship between θ and µ is invertible so that
1
µ= . (6.118)
1 + exp(−θ)
The relation (6.118) is used to obtain the right equality of (6.117).
Example 6.15
Recall the exponential family form of the Bernoulli distribution (6.113d)
µ
p(x | µ) = exp x log + log(1 − µ) . (6.121)
1−µ
ables. However, we may not be able to obtain the functional form of the
distribution under transformations. Furthermore, we may be interested
in nonlinear transformations of random variables for which closed-form
expressions are not readily available.
Remark (Notation). In this section, we will be explicit about random vari-
ables and the values they take. Hence, recall that we use capital letters
X, Y to denote random variables and small letters x, y to denote the val-
ues in the target space T that the random variables take. We will explicitly
write pmfs of discrete random variables X as P (X = x). For continuous
random variables X (Section 6.2.2), the pdf is written as f (x) and the cdf
is written as FX (x). ♦
We will look at two approaches for obtaining distributions of transfor-
mations of random variables: a direct approach using the definition of a
cumulative distribution function and a change-of-variable approach that
uses the chain rule of calculus (Section 5.2.2). The change-of-variable ap- Moment generating
proach is widely used because it provides a “recipe” for attempting to functions can also
be used to study
compute the resulting distribution due to a transformation. We will ex-
transformations of
plain the techniques for univariate random variables, and will only briefly random
provide the results for the general case of multivariate random variables. variables (Casella
Transformations of discrete random variables can be understood di- and Berger, 2002,
chapter 2).
rectly. Suppose that there is a discrete random variable X with pmf P (X =
x) (Section 6.2.1), and an invertible function U (x). Consider the trans-
formed random variable Y := U (X), with pmf P (Y = y). Then
We also need to keep in mind that the domain of the random variable may
have changed due to the transformation by U .
Example 6.16
Let X be a continuous random variable with probability density function
on 0 6 x 6 1
f (x) = 3x2 . (6.128)
We are interested in finding the pdf of Y = X 2 .
The function f is an increasing function of x, and therefore the resulting
value of y lies in the interval [0, 1]. We obtain
FY (y) = P (Y 6 y) definition of cdf (6.129a)
= P (X 2 6 y) transformation of interest (6.129b)
1
= P (X 6 y ) 2 inverse (6.129c)
1
= FX (y ) 2 definition of cdf (6.129d)
Z y 21
= 3t2 dt cdf as a definite integral (6.129e)
0
t=y 12
= t3 t=0 result of integration (6.129f)
3
=y , 2 0 6 y 6 1. (6.129g)
Therefore, the cdf of Y is
3
FY (y) = y 2 (6.130)
for 0 6 y 6 1. To obtain the pdf, we differentiate the cdf
d 3 1
f (y) = FY (y) = y 2 (6.131)
dy 2
for 0 6 y 6 1.
Y := FX (X) (6.132)
Theorem 6.15 is known as the probability integral transform, and it is probability integral
used to derive algorithms for sampling from distributions by transforming transform
the result of sampling from a uniform random variable (Bishop, 2006).
The algorithm works by first generating a sample from a uniform distribu-
tion, then transforming it by the inverse cdf (assuming this is available)
to obtain a sample from the desired distribution. The probability integral
transform is also used for hypothesis testing whether a sample comes from
a particular distribution (Lehmann and Romano, 2005). The idea that the
output of a cdf gives a uniform distribution also forms the basis of copu-
las (Nelsen, 2006).
Let us break down the reasoning step by step, with the goal of understand-
ing the more general change-of-variables approach in Theorem 6.16. Change of variables
in probability relies
Remark. The name “change of variables” comes from the idea of chang- on the
ing the variable of integration when faced with a difficult integral. For change-of-variables
univariate functions, we use the substitution rule of integration, method in
Z Z calculus (Tandra,
0 2014).
f (g(x))g (x)dx = f (u)du , where u = g(x) . (6.133)
The derivation of this rule is based on the chain rule of calculus (5.32) and
by applying twice the fundamental theorem of calculus. The fundamental
theorem of calculus formalizes the fact that integration and differentiation
are somehow “inverses” of each other. An intuitive understanding of the
rule can be obtained by thinking (loosely) about small changes (differen-
tials) to the equation u = g(x), that is by considering ∆u = g 0 (x)∆x as a
differential of u = g(x). By substituting u = g(x), the argument inside the
integral on the right-hand side of (6.133) becomes f (g(x)). By pretending
that the term du can be approximated by du ≈ ∆u = g 0 (x)∆x, and that
dx ≈ ∆x, we obtain (6.133). ♦
Consider a univariate random variable X , and an invertible function
U , which gives us another random variable Y = U (X). We assume that
random variable X has states x ∈ [a, b]. By the definition of the cdf, we
have
FY (y) = P (Y 6 y) . (6.134)
Theorem 6.16. [Theorem 17.2 in Billingsley (1995)] Let f (x) be the value
of the probability density of the multivariate continuous random variable X .
If the vector-valued function y = U (x) is differentiable and invertible for
all values within the domain of x, then for corresponding values of y , the
probability density of Y = U (X) is given by
∂ −1
f (y) = fx (U −1 (y)) · det U (y) . (6.144)
∂y
The theorem looks intimidating at first glance, but the key point is that
a change of variable of a multivariate random variable follows the pro-
cedure of the univariate change of variable. First we need to work out
the inverse transform, and substitute that into the density of x. Then we
calculate the determinant of the Jacobian and multiply the result. The
following example illustrates the case of a bivariate random variable.
Example 6.17
x1
Consider a bivariate random variable X with states x = and proba-
x2
bility density function
> !
1 1 x1
x1 x1
f = exp − . (6.145)
x2 2π 2 x2 x2
Exercises
6.1 Consider the following bivariate distribution p(x, y) of two discrete random
variables X and Y .
x1 x2 x3 x4 x5
X
Compute:
a. The marginal distributions p(x) and p(y).
b. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
6.2 Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4),
10 1 0 0 8.4 2.0
0.4 N , + 0.6 N , .
2 0 1 0 2.0 1.7
Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
terior distribution p(µ | x1 , . . . , xN ).
6.4 There are two bags. The first bag contains four mangos and two apples; the
second bag contains four mangos and four apples.
We also have a biased coin, which shows “heads” with probability 0.6 and
“tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at
random from bag 1; otherwise we pick a fruit at random from bag 2.
Your friend flips the coin (you cannot see the result), picks a fruit at random
from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2?
Hint: Use Bayes’ theorem.
6.5 Consider the time-series model
xt+1 = Axt + w , w ∼ N 0, Q
y t = Cxt + v , v ∼ N 0, R ,
where w, v are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
N µ0 , Σ0 .
C = (A−1 + B −1 )−1
c = C(A−1 a + B −1 b)
D 1
c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)> (A + B)−1 (a − b) .
Furthermore, we have
y = Ax + b + w ,