0% found this document useful (0 votes)
10 views53 pages

Math - ML Trang 6

The document discusses the concepts of probability and distributions, emphasizing the importance of quantifying uncertainty through random variables and probability distributions. It outlines the construction of a probability space, including the sample space, event space, and probability measure, and introduces philosophical issues related to plausible reasoning in automated systems. Additionally, it distinguishes between Bayesian and frequentist interpretations of probability and highlights the significance of understanding random variables in machine learning contexts.

Uploaded by

Quân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views53 pages

Math - ML Trang 6

The document discusses the concepts of probability and distributions, emphasizing the importance of quantifying uncertainty through random variables and probability distributions. It outlines the construction of a probability space, including the sample space, event space, and probability measure, and introduces philosophical issues related to plausible reasoning in automated systems. Additionally, it distinguishes between Bayesian and frequentist interpretations of probability and highlights the significance of understanding random variables in machine learning contexts.

Uploaded by

Quân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

6

Probability and Distributions

Probability, loosely speaking, concerns the study of uncertainty. Probabil-


ity can be thought of as the fraction of times an event occurs, or as a degree
of belief about an event. We then would like to use this probability to mea-
sure the chance of something occurring in an experiment. As mentioned
in Chapter 1, we often quantify uncertainty in the data, uncertainty in the
machine learning model, and uncertainty in the predictions produced by
random variable the model. Quantifying uncertainty requires the idea of a random variable,
which is a function that maps outcomes of random experiments to a set of
properties that we are interested in. Associated with the random variable
is a function that measures the probability that a particular outcome (or
probability set of outcomes) will occur; this is called the probability distribution.
distribution Probability distributions are used as a building block for other con-
cepts, such as probabilistic modeling (Section 8.4), graphical models (Sec-
tion 8.5), and model selection (Section 8.6). In the next section, we present
the three concepts that define a probability space (the sample space, the
events, and the probability of an event) and how they are related to a
fourth concept called the random variable. The presentation is deliber-
ately slightly hand wavy since a rigorous presentation may occlude the
intuition behind the concepts. An outline of the concepts presented in this
chapter are shown in Figure 6.1.

6.1 Construction of a Probability Space


The theory of probability aims at defining a mathematical structure to
describe random outcomes of experiments. For example, when tossing a
single coin, we cannot determine the outcome, but by doing a large num-
ber of coin tosses, we can observe a regularity in the average outcome.
Using this mathematical structure of probability, the goal is to perform
automated reasoning, and in this sense, probability generalizes logical
reasoning (Jaynes, 2003).

6.1.1 Philosophical Issues


When constructing automated reasoning systems, classical Boolean logic
does not allow us to express certain forms of plausible reasoning. Consider

172
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
6.1 Construction of a Probability Space 173

Figure 6.1 A mind


Mean Variance Bayes’ Theorem
map of the concepts
related to random
variables and
probability
distributions, as
Chapter 9 described in this
Summary statistics Product rule Sum rule Regression chapter.
Pro
per
ty

Random variable Example


Transformations & distribution Gaussian
Ex Chapter 10
am Dimensionality
Property

ple
reduction
y
rit

Independence Bernoulli
ila
m

Sufficient statistics
Si

Conjugate

Chapter 11
Finite

Density estimation

Inner product Beta


Exponential family

the following scenario: We observe that A is false. We find B becomes


less plausible, although no conclusion can be drawn from classical logic.
We observe that B is true. It seems A becomes more plausible. We use
this form of reasoning daily. We are waiting for a friend, and consider
three possibilities: H1, she is on time; H2, she has been delayed by traffic;
and H3, she has been abducted by aliens. When we observe our friend
is late, we must logically rule out H1. We also tend to consider H2 to be
more likely, though we are not logically required to do so. Finally, we may
consider H3 to be possible, but we continue to consider it quite unlikely.
How do we conclude H2 is the most plausible answer? Seen in this way, “For plausible
probability theory can be considered a generalization of Boolean logic. In reasoning it is
necessary to extend
the context of machine learning, it is often applied in this way to formalize
the discrete true and
the design of automated reasoning systems. Further arguments about how false values of truth
probability theory is the foundation of reasoning systems can be found to continuous
in Pearl (1988). plausibilities”
(Jaynes, 2003).
The philosophical basis of probability and how it should be somehow
related to what we think should be true (in the logical sense) was studied
by Cox (Jaynes, 2003). Another way to think about it is that if we are
precise about our common sense we end up constructing probabilities.
E. T. Jaynes (1922–1998) identified three mathematical criteria, which
must apply to all plausibilities:
1. The degrees of plausibility are represented by real numbers.
2. These numbers must be based on the rules of common sense.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


174 Probability and Distributions

3. The resulting reasoning must be consistent, with the three following


meanings of the word “consistent”:
(a) Consistency or non-contradiction: When the same result can be
reached through different means, the same plausibility value must
be found in all cases.
(b) Honesty: All available data must be taken into account.
(c) Reproducibility: If our state of knowledge about two problems are
the same, then we must assign the same degree of plausibility to
both of them.
The Cox–Jaynes theorem proves these plausibilities to be sufficient to
define the universal mathematical rules that apply to plausibility p, up to
transformation by an arbitrary monotonic function. Crucially, these rules
are the rules of probability.
Remark. In machine learning and statistics, there are two major interpre-
tations of probability: the Bayesian and frequentist interpretations (Bishop,
2006; Efron and Hastie, 2016). The Bayesian interpretation uses probabil-
ity to specify the degree of uncertainty that the user has about an event. It
is sometimes referred to as “subjective probability” or “degree of belief”.
The frequentist interpretation considers the relative frequencies of events
of interest to the total number of events that occurred. The probability of
an event is defined as the relative frequency of the event in the limit when
one has infinite data. ♦
Some machine learning texts on probabilistic models use lazy notation
and jargon, which is confusing. This text is no exception. Multiple distinct
concepts are all referred to as “probability distribution”, and the reader
has to often disentangle the meaning from the context. One trick to help
make sense of probability distributions is to check whether we are trying
to model something categorical (a discrete random variable) or some-
thing continuous (a continuous random variable). The kinds of questions
we tackle in machine learning are closely related to whether we are con-
sidering categorical or continuous models.

6.1.2 Probability and Random Variables


There are three distinct ideas that are often confused when discussing
probabilities. First is the idea of a probability space, which allows us to
quantify the idea of a probability. However, we mostly do not work directly
with this basic probability space. Instead, we work with random variables
(the second idea), which transfers the probability to a more convenient
(often numerical) space. The third idea is the idea of a distribution or law
associated with a random variable. We will introduce the first two ideas
in this section and expand on the third idea in Section 6.2.
Modern probability is based on a set of axioms proposed by Kolmogorov

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.1 Construction of a Probability Space 175

(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three con-
cepts of sample space, event space, and probability measure. The prob-
ability space models a real-world process (referred to as an experiment)
with random outcomes.

The sample space Ω


The sample space is the set of all possible outcomes of the experiment, sample space
usually denoted by Ω. For example, two successive coin tosses have
a sample space of {hh, tt, ht, th}, where “h” denotes “heads” and “t”
denotes “tails”.
The event space A
The event space is the space of potential results of the experiment. A event space
subset A of the sample space Ω is in the event space A if at the end
of the experiment we can observe whether a particular outcome ω ∈ Ω
is in A. The event space A is obtained by considering the collection of
subsets of Ω, and for discrete probability distributions (Section 6.2.1)
A is often the power set of Ω.
The probability P
With each event A ∈ A, we associate a number P (A) that measures the
probability or degree of belief that the event will occur. P (A) is called
the probability of A. probability

The probability of a single event must lie in the interval [0, 1], and the
total probability over all outcomes in the sample space Ω must be 1, i.e.,
P (Ω) = 1. Given a probability space (Ω, A, P ), we want to use it to model
some real-world phenomenon. In machine learning, we often avoid explic-
itly referring to the probability space, but instead refer to probabilities on
quantities of interest, which we denote by T . In this book, we refer to T
as the target space and refer to elements of T as states. We introduce a target space
function X : Ω → T that takes an element of Ω (an outcome) and returns
a particular quantity of interest x, a value in T . This association/mapping
from Ω to T is called a random variable. For example, in the case of tossing random variable
two coins and counting the number of heads, a random variable X maps
to the three possible outcomes: X(hh) = 2, X(ht) = 1, X(th) = 1, and
X(tt) = 0. In this particular case, T = {0, 1, 2}, and it is the probabilities
on elements of T that we are interested in. For a finite sample space Ω and The name “random
finite T , the function corresponding to a random variable is essentially a variable” is a great
source of
lookup table. For any subset S ⊆ T , we associate PX (S) ∈ [0, 1] (the
misunderstanding
probability) to a particular event occurring corresponding to the random as it is neither
variable X . Example 6.1 provides a concrete illustration of the terminol- random nor is it a
ogy. variable. It is a
function.
Remark. The aforementioned sample space Ω unfortunately is referred
to by different names in different books. Another common name for Ω
is “state space” (Jacod and Protter, 2004), but state space is sometimes
reserved for referring to states in a dynamical system (Hasselblatt and

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


176 Probability and Distributions

Katok, 2003). Other names sometimes used to describe Ω are: “sample


description space”, “possibility space,” and “event space”. ♦

Example 6.1
This toy example is We assume that the reader is already familiar with computing probabilities
essentially a biased
of intersections and unions of sets of events. A gentler introduction to
coin flip example.
probability with many examples can be found in chapter 2 of Walpole
et al. (2011).
Consider a statistical experiment where we model a funfair game con-
sisting of drawing two coins from a bag (with replacement). There are
coins from USA (denoted as $) and UK (denoted as £) in the bag, and
since we draw two coins from the bag, there are four outcomes in total.
The state space or sample space Ω of this experiment is then ($, $), ($,
£), (£, $), (£, £). Let us assume that the composition of the bag of coins is
such that a draw returns at random a $ with probability 0.3.
The event we are interested in is the total number of times the repeated
draw returns $. Let us define a random variable X that maps the sample
space Ω to T , which denotes the number of times we draw $ out of the
bag. We can see from the preceding sample space we can get zero $, one $,
or two $s, and therefore T = {0, 1, 2}. The random variable X (a function
or lookup table) can be represented as a table like the following:
X(($, $)) = 2 (6.1)
X(($, £)) = 1 (6.2)
X((£, $)) = 1 (6.3)
X((£, £)) = 0 . (6.4)
Since we return the first coin we draw before drawing the second, this
implies that the two draws are independent of each other, which we will
discuss in Section 6.4.5. Note that there are two experimental outcomes,
which map to the same event, where only one of the draws returns $.
Therefore, the probability mass function (Section 6.2.1) of X is given by
P (X = 2) = P (($, $))
= P ($) · P ($)
= 0.3 · 0.3 = 0.09 (6.5)
P (X = 1) = P (($, £) ∪ (£, $))
= P (($, £)) + P ((£, $))
= 0.3 · (1 − 0.3) + (1 − 0.3) · 0.3 = 0.42 (6.6)
P (X = 0) = P ((£, £))
= P (£) · P (£)
= (1 − 0.3) · (1 − 0.3) = 0.49 . (6.7)

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.1 Construction of a Probability Space 177

In the calculation, we equated two different concepts, the probability


of the output of X and the probability of the samples in Ω. For example,
in (6.7) we say P (X = 0) = P ((£, £)). Consider the random variable
X : Ω → T and a subset S ⊆ T (for example, a single element of T ,
such as the outcome that one head is obtained when tossing two coins).
Let X −1 (S) be the pre-image of S by X , i.e., the set of elements of Ω that
map to S under X ; {ω ∈ Ω : X(ω) ∈ S}. One way to understand the
transformation of probability from events in Ω via the random variable
X is to associate it with the probability of the pre-image of S (Jacod and
Protter, 2004). For S ⊆ T , we have the notation

PX (S) = P (X ∈ S) = P (X −1 (S)) = P ({ω ∈ Ω : X(ω) ∈ S}) . (6.8)


The left-hand side of (6.8) is the probability of the set of possible outcomes
(e.g., number of $ = 1) that we are interested in. Via the random variable
X , which maps states to outcomes, we see in the right-hand side of (6.8)
that this is the probability of the set of states (in Ω) that have the property
(e.g., $£, £$). We say that a random variable X is distributed according
to a particular probability distribution PX , which defines the probability
mapping between the event and the probability of the outcome of the
random variable. In other words, the function PX or equivalently P ◦ X −1
is the law or distribution of random variable X . law
distribution
Remark. The target space, that is, the range T of the random variable X ,
is used to indicate the kind of probability space, i.e., a T random variable.
When T is finite or countably infinite, this is called a discrete random
variable (Section 6.2.1). For continuous random variables (Section 6.2.2),
we only consider T = R or T = RD . ♦

6.1.3 Statistics
Probability theory and statistics are often presented together, but they con-
cern different aspects of uncertainty. One way of contrasting them is by the
kinds of problems that are considered. Using probability, we can consider
a model of some process, where the underlying uncertainty is captured
by random variables, and we use the rules of probability to derive what
happens. In statistics, we observe that something has happened and try
to figure out the underlying process that explains the observations. In this
sense, machine learning is close to statistics in its goals to construct a
model that adequately represents the process that generated the data. We
can use the rules of probability to obtain a “best-fitting” model for some
data.
Another aspect of machine learning systems is that we are interested
in generalization error (see Chapter 8). This means that we are actually
interested in the performance of our system on instances that we will
observe in future, which are not identical to the instances that we have

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


178 Probability and Distributions

seen so far. This analysis of future performance relies on probability and


statistics, most of which is beyond what will be presented in this chapter.
The interested reader is encouraged to look at the books by Boucheron
et al. (2013) and Shalev-Shwartz and Ben-David (2014). We will see more
about statistics in Chapter 8.

6.2 Discrete and Continuous Probabilities


Let us focus our attention on ways to describe the probability of an event
as introduced in Section 6.1. Depending on whether the target space is dis-
crete or continuous, the natural way to refer to distributions is different.
When the target space T is discrete, we can specify the probability that a
random variable X takes a particular value x ∈ T , denoted as P (X = x).
The expression P (X = x) for a discrete random variable X is known as
probability mass the probability mass function. When the target space T is continuous, e.g.,
function the real line R, it is more natural to specify the probability that a random
variable X is in an interval, denoted by P (a 6 X 6 b) for a < b. By con-
vention, we specify the probability that a random variable X is less than
a particular value x, denoted by P (X 6 x). The expression P (X 6 x) for
cumulative a continuous random variable X is known as the cumulative distribution
distribution function function. We will discuss continuous random variables in Section 6.2.2.
We will revisit the nomenclature and contrast discrete and continuous
random variables in Section 6.2.3.
univariate Remark. We will use the phrase univariate distribution to refer to distribu-
tions of a single random variable (whose states are denoted by non-bold
x). We will refer to distributions of more than one random variable as
multivariate multivariate distributions, and will usually consider a vector of random
variables (whose states are denoted by bold x). ♦

6.2.1 Discrete Probabilities


When the target space is discrete, we can imagine the probability distri-
bution of multiple random variables as filling out a (multidimensional)
array of numbers. Figure 6.2 shows an example. The target space of the
joint probability is the Cartesian product of the target spaces of each of
joint probability the random variables. We define the joint probability as the entry of both
values jointly
nij
P (X = xi , Y = yj ) = , (6.9)
N
where nij is the number of events with state xi and yj and N the total
number of events. The joint probability is the probability of the intersec-
tion of both events, that is, P (X = xi , Y = yj ) = P (X = xi ∩ Y = yj ).
probability mass Figure 6.2 illustrates the probability mass function (pmf) of a discrete prob-
function ability distribution. For two random variables X and Y , the probability

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.2 Discrete and Continuous Probabilities 179

ci Figure 6.2
z }|{ Visualization of a
y1 discrete bivariate
o probability mass
Y y2 nij rj function, with
random variables X
y3 and Y . This
x1 x2 x3 x4 x5 diagram is adapted
from Bishop (2006).
X

that X = x and Y = y is (lazily) written as p(x, y) and is called the joint


probability. One can think of a probability as a function that takes state
x and y and returns a real number, which is the reason we write p(x, y).
The marginal probability that X takes the value x irrespective of the value marginal probability
of random variable Y is (lazily) written as p(x). We write X ∼ p(x) to
denote that the random variable X is distributed according to p(x). If we
consider only the instances where X = x, then the fraction of instances
(the conditional probability) for which Y = y is written (lazily) as p(y | x). conditional
probability

Example 6.2
Consider two random variables X and Y , where X has five possible states
and Y has three possible states, as shown in Figure 6.2. We denote by nij
the number of events with state X = xi and Y = yj , and denote by
N the total number of events. The value ciP is the sum of the individual
3
frequencies for the ith column, that is, ci = j=1 nij . Similarly, the value
P5
rj is the row sum, that is, rj = i=1 nij . Using these definitions, we can
compactly express the distribution of X and Y .
The probability distribution of each random variable, the marginal
probability, can be seen as the sum over a row or column
P3
ci j=1 nij
P (X = xi ) = = (6.10)
N N
and
P5
rj nij
P (Y = yj ) = = i=1 , (6.11)
N N
where ci and rj are the ith column and j th row of the probability table,
respectively. By convention, for discrete random variables with a finite
number of events, we assume that probabilties sum up to one, that is,
5
X 3
X
P (X = xi ) = 1 and P (Y = yj ) = 1 . (6.12)
i=1 j=1

The conditional probability is the fraction of a row or column in a par-

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


180 Probability and Distributions

ticular cell. For example, the conditional probability of Y given X is


nij
P (Y = yj | X = xi ) = , (6.13)
ci
and the conditional probability of X given Y is
nij
P (X = xi | Y = yj ) = . (6.14)
rj

In machine learning, we use discrete probability distributions to model


categorical variable categorical variables, i.e., variables that take a finite set of unordered val-
ues. They could be categorical features, such as the degree taken at uni-
versity when used for predicting the salary of a person, or categorical la-
bels, such as letters of the alphabet when doing handwriting recognition.
Discrete distributions are also often used to construct probabilistic models
that combine a finite number of continuous distributions (Chapter 11).

6.2.2 Continuous Probabilities


We consider real-valued random variables in this section, i.e., we consider
target spaces that are intervals of the real line R. In this book, we pretend
that we can perform operations on real random variables as if we have dis-
crete probability spaces with finite states. However, this simplification is
not precise for two situations: when we repeat something infinitely often,
and when we want to draw a point from an interval. The first situation
arises when we discuss generalization errors in machine learning (Chap-
ter 8). The second situation arises when we want to discuss continuous
distributions, such as the Gaussian (Section 6.5). For our purposes, the
lack of precision allows for a briefer introduction to probability.
Remark. In continuous spaces, there are two additional technicalities,
which are counterintuitive. First, the set of all subsets (used to define
the event space A in Section 6.1) is not well behaved enough. A needs
to be restricted to behave well under set complements, set intersections,
and set unions. Second, the size of a set (which in discrete spaces can be
obtained by counting the elements) turns out to be tricky. The size of a
measure set is called its measure. For example, the cardinality of discrete sets, the
length of an interval in R, and the volume of a region in Rd are all mea-
sures. Sets that behave well under set operations and additionally have
Borel σ-algebra a topology are called a Borel σ -algebra. Betancourt details a careful con-
struction of probability spaces from set theory without being bogged down
in technicalities; see https://tinyurl.com/yb3t6mfd. For a more pre-
cise construction, we refer to Billingsley (1995) and Jacod and Protter
(2004).
In this book, we consider real-valued random variables with their cor-

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.2 Discrete and Continuous Probabilities 181

responding Borel σ -algebra. We consider random variables with values in


RD to be a vector of real-valued random variables. ♦
Definition 6.1 (Probability Density Function). A function f : RD → R is
called a probability density function (pdf ) if probability density
function
1. ∀x ∈ RD : f (x) > 0 pdf
2. Its integral exists and
Z
f (x)dx = 1 . (6.15)
RD

For probability mass functions (pmf) of discrete random variables, the


integral in (6.15) is replaced with a sum (6.12).
Observe that the probability density function is any function f that is
non-negative and integrates to one. We associate a random variable X
with this function f by
Z b
P (a 6 X 6 b) = f (x)dx , (6.16)
a

where a, b ∈ R and x ∈ R are outcomes of the continuous random vari-


able X . States x ∈ RD are defined analogously by considering a vector
of x ∈ R. This association (6.16) is called the law or distribution of the law
random variable X . P (X = x) is a set of
Remark. In contrast to discrete random variables, the probability of a con- measure zero.
tinuous random variable X taking a particular value P (X = x) is zero.
This is like trying to specify an interval in (6.16) where a = b. ♦
Definition 6.2 (Cumulative Distribution Function). A cumulative distribu- cumulative
tion function (cdf) of a multivariate real-valued random variable X with distribution function
states x ∈ RD is given by
FX (x) = P (X1 6 x1 , . . . , XD 6 xD ) , (6.17)
where X = [X1 , . . . , XD ]> , x = [x1 , . . . , xD ]> , and the right-hand side
represents the probability that random variable Xi takes the value smaller
than or equal to xi .
There are cdfs,
The cdf can be expressed also as the integral of the probability density which do not have
function f (x) so that corresponding pdfs.
Z x1 Z xD
FX (x) = ··· f (z1 , . . . , zD )dz1 · · · dzD . (6.18)
−∞ −∞

Remark. We reiterate that there are in fact two distinct concepts when
talking about distributions. First is the idea of a pdf (denoted by f (x)),
which is a nonnegative function that sums to one. Second is the law of a
random variable X , that is, the association of a random variable X with
the pdf f (x). ♦

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


182 Probability and Distributions
Figure 6.3 2.0 2.0
Examples of
(a) discrete and 1.5 1.5
(b) continuous P (Z = z)

p(x)
uniform 1.0 1.0
distributions. See
Example 6.3 for 0.5 0.5
details of the
0.0 0.0
distributions. −1 0 1 2 −1 0 1 2
z x
(a) Discrete distribution (b) Continuous distribution

For most of this book, we will not use the notation f (x) and FX (x) as
we mostly do not need to distinguish between the pdf and cdf. However,
we will need to be careful about pdfs and cdfs in Section 6.7.

6.2.3 Contrasting Discrete and Continuous Distributions


Recall from Section 6.1.2 that probabilities are positive and the total prob-
ability sums up to one. For discrete random variables (see (6.12)), this
implies that the probability of each state must lie in the interval [0, 1].
However, for continuous random variables the normalization (see (6.15))
does not imply that the value of the density is less than or equal to 1 for
uniform distribution all values. We illustrate this in Figure 6.3 using the uniform distribution
for both discrete and continuous random variables.

Example 6.3
We consider two examples of the uniform distribution, where each state is
equally likely to occur. This example illustrates some differences between
discrete and continuous probability distributions.
Let Z be a discrete uniform random variable with three states {z =
The actual values of −1.1, z = 0.3, z = 1.5}. The probability mass function can be represented
these states are not as a table of probability values:
meaningful here,
and we deliberately z −1.1 0.3 1.5
chose numbers to
1 1 1
drive home the P (Z = z) 3 3 3
point that we do not
want to use (and
should ignore) the Alternatively, we can think of this as a graph (Figure 6.3(a)), where we
ordering of the use the fact that the states can be located on the x-axis, and the y -axis
states.
represents the probability of a particular state. The y -axis in Figure 6.3(a)
is deliberately extended so that is it the same as in Figure 6.3(b).
Let X be a continuous random variable taking values in the range 0.9 6
X 6 1.6, as represented by Figure 6.3(b). Observe that the height of the

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.3 Sum Rule, Product Rule, and Bayes’ Theorem 183

Type “Point probability” “Interval probability” Table 6.1


Nomenclature for
Discrete P (X = x) Not applicable probability
Probability mass function distributions.
Continuous p(x) P (X 6 x)
Probability density function Cumulative distribution function

density can be greater than 1. However, it needs to hold that


Z 1.6
p(x)dx = 1 . (6.19)
0.9

Remark. There is an additional subtlety with regards to discrete prob-


ability distributions. The states z1 , . . . , zd do not in principle have any
structure, i.e., there is usually no way to compare them, for example
z1 = red, z2 = green, z3 = blue. However, in many machine learning
applications discrete states take numerical values, e.g., z1 = −1.1, z2 =
0.3, z3 = 1.5, where we could say z1 < z2 < z3 . Discrete states that as-
sume numerical values are particularly useful because we often consider
expected values (Section 6.4.1) of random variables. ♦
Unfortunately, machine learning literature uses notation and nomen-
clature that hides the distinction between the sample space Ω, the target
space T , and the random variable X . For a value x of the set of possible
outcomes of the random variable X , i.e., x ∈ T , p(x) denotes the prob- We think of the
ability that random variable X has the outcome x. For discrete random outcome x as the
argument that
variables, this is written as P (X = x), which is known as the probabil-
results in the
ity mass function. The pmf is often referred to as the “distribution”. For probability p(x).
continuous variables, p(x) is called the probability density function (often
referred to as a density). To muddy things even further, the cumulative
distribution function P (X 6 x) is often also referred to as the “distribu-
tion”. In this chapter, we will use the notation X to refer to both univariate
and multivariate random variables, and denote the states by x and x re-
spectively. We summarize the nomenclature in Table 6.1.
Remark. We will be using the expression “probability distribution” not
only for discrete probability mass functions but also for continuous proba-
bility density functions, although this is technically incorrect. In line with
most machine learning literature, we also rely on context to distinguish
the different uses of the phrase probability distribution. ♦

6.3 Sum Rule, Product Rule, and Bayes’ Theorem


We think of probability theory as an extension to logical reasoning. As we
discussed in Section 6.1.1, the rules of probability presented here follow

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


184 Probability and Distributions

naturally from fulfilling the desiderata (Jaynes, 2003, chapter 2). Prob-
abilistic modeling (Section 8.4) provides a principled foundation for de-
signing machine learning methods. Once we have defined probability dis-
tributions (Section 6.2) corresponding to the uncertainties of the data and
our problem, it turns out that there are only two fundamental rules, the
sum rule and the product rule.
Recall from (6.9) that p(x, y) is the joint distribution of the two ran-
dom variables x, y . The distributions p(x) and p(y) are the correspond-
ing marginal distributions, and p(y | x) is the conditional distribution of y
given x. Given the definitions of the marginal and conditional probability
for discrete and continuous random variables in Section 6.2, we can now
These two rules present the two fundamental rules in probability theory.
arise The first rule, the sum rule, states that
naturally (Jaynes,  X
2003) from the  p(x, y) if y is discrete
requirements we


y∈Y
discussed in p(x) = Z , (6.20)
Section 6.1.1. p(x, y)dy if y is continuous



sum rule Y

where Y are the states of the target space of random variable Y . This
means that we sum out (or integrate out) the set of states y of the random
marginalization variable Y . The sum rule is also known as the marginalization property.
property The sum rule relates the joint distribution to a marginal distribution. In
general, when the joint distribution contains more than two random vari-
ables, the sum rule can be applied to any subset of the random variables,
resulting in a marginal distribution of potentially more than one random
variable. More concretely, if x = [x1 , . . . , xD ]> , we obtain the marginal
Z
p(xi ) = p(x1 , . . . , xD )dx\i (6.21)

by repeated application of the sum rule where we integrate/sum out all


random variables except xi , which is indicated by \i, which reads “all
except i.”
Remark. Many of the computational challenges of probabilistic modeling
are due to the application of the sum rule. When there are many variables
or discrete variables with many states, the sum rule boils down to per-
forming a high-dimensional sum or integral. Performing high-dimensional
sums or integrals is generally computationally hard, in the sense that there
is no known polynomial-time algorithm to calculate them exactly. ♦
product rule The second rule, known as the product rule, relates the joint distribution
to the conditional distribution via

p(x, y) = p(y | x)p(x) . (6.22)

The product rule can be interpreted as the fact that every joint distribu-
tion of two random variables can be factorized (written as a product)

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.3 Sum Rule, Product Rule, and Bayes’ Theorem 185

of two other distributions. The two factors are the marginal distribu-
tion of the first random variable p(x), and the conditional distribution
of the second random variable given the first p(y | x). Since the ordering
of random variables is arbitrary in p(x, y), the product rule also implies
p(x, y) = p(x | y)p(y). To be precise, (6.22) is expressed in terms of the
probability mass functions for discrete random variables. For continuous
random variables, the product rule is expressed in terms of the probability
density functions (Section 6.2.3).
In machine learning and Bayesian statistics, we are often interested in
making inferences of unobserved (latent) random variables given that we
have observed other random variables. Let us assume we have some prior
knowledge p(x) about an unobserved random variable x and some rela-
tionship p(y | x) between x and a second random variable y , which we
can observe. If we observe y , we can use Bayes’ theorem to draw some
conclusions about x given the observed values of y . Bayes’ theorem (also Bayes’ theorem
Bayes’ rule or Bayes’ law) Bayes’ rule

likelihood prior Bayes’ law


z }| { z }| {
p(y | x) p(x)
p(x | y) = (6.23)
| {z } p(y)
posterior |{z}
evidence

is a direct consequence of the product rule in (6.22) since


p(x, y) = p(x | y)p(y) (6.24)
and
p(x, y) = p(y | x)p(x) (6.25)
so that
p(y | x)p(x)
p(x | y)p(y) = p(y | x)p(x) ⇐⇒ p(x | y) = . (6.26)
p(y)
In (6.23), p(x) is the prior, which encapsulates our subjective prior prior
knowledge of the unobserved (latent) variable x before observing any
data. We can choose any prior that makes sense to us, but it is critical to
ensure that the prior has a nonzero pdf (or pmf) on all plausible x, even
if they are very rare.
The likelihood p(y | x) describes how x and y are related, and in the likelihood
case of discrete probability distributions, it is the probability of the data y The likelihood is
sometimes also
if we were to know the latent variable x. Note that the likelihood is not a
called the
distribution in x, but only in y . We call p(y | x) either the “likelihood of “measurement
x (given y )” or the “probability of y given x” but never the likelihood of model”.
y (MacKay, 2003).
The posterior p(x | y) is the quantity of interest in Bayesian statistics posterior
because it expresses exactly what we are interested in, i.e., what we know
about x after having observed y .

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


186 Probability and Distributions

The quantity
Z
p(y) := p(y | x)p(x)dx = EX [p(y | x)] (6.27)

marginal likelihood is the marginal likelihood/evidence. The right-hand side of (6.27) uses the
evidence expectation operator which we define in Section 6.4.1. By definition, the
marginal likelihood integrates the numerator of (6.23) with respect to the
latent variable x. Therefore, the marginal likelihood is independent of
x, and it ensures that the posterior p(x | y) is normalized. The marginal
likelihood can also be interpreted as the expected likelihood where we
take the expectation with respect to the prior p(x). Beyond normalization
of the posterior, the marginal likelihood also plays an important role in
Bayesian model selection, as we will discuss in Section 8.6. Due to the
Bayes’ theorem is integration in (8.44), the evidence is often hard to compute.
also called the Bayes’ theorem (6.23) allows us to invert the relationship between x
“probabilistic
and y given by the likelihood. Therefore, Bayes’ theorem is sometimes
inverse.”
probabilistic inverse called the probabilistic inverse. We will discuss Bayes’ theorem further in
Section 8.4.
Remark. In Bayesian statistics, the posterior distribution is the quantity
of interest as it encapsulates all available information from the prior and
the data. Instead of carrying the posterior around, it is possible to focus
on some statistic of the posterior, such as the maximum of the posterior,
which we will discuss in Section 8.3. However, focusing on some statistic
of the posterior leads to loss of information. If we think in a bigger con-
text, then the posterior can be used within a decision-making system, and
having the full posterior can be extremely useful and lead to decisions that
are robust to disturbances. For example, in the context of model-based re-
inforcement learning, Deisenroth et al. (2015) show that using the full
posterior distribution of plausible transition functions leads to very fast
(data/sample efficient) learning, whereas focusing on the maximum of
the posterior leads to consistent failures. Therefore, having the full pos-
terior can be very useful for a downstream task. In Chapter 9, we will
continue this discussion in the context of linear regression. ♦

6.4 Summary Statistics and Independence


We are often interested in summarizing sets of random variables and com-
paring pairs of random variables. A statistic of a random variable is a de-
terministic function of that random variable. The summary statistics of a
distribution provide one useful view of how a random variable behaves,
and as the name suggests, provide numbers that summarize and charac-
terize the distribution. We describe the mean and the variance, two well-
known summary statistics. Then we discuss two ways to compare a pair
of random variables: first, how to say that two random variables are inde-
pendent; and second, how to compute an inner product between them.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.4 Summary Statistics and Independence 187

6.4.1 Means and Covariances


Mean and (co)variance are often useful to describe properties of probabil-
ity distributions (expected values and spread). We will see in Section 6.6
that there is a useful family of distributions (called the exponential fam-
ily), where the statistics of the random variable capture all possible infor-
mation.
The concept of the expected value is central to machine learning, and
the foundational concepts of probability itself can be derived from the
expected value (Whittle, 2000).

Definition 6.3 (Expected Value). The expected value of a function g : R → expected value
R of a univariate continuous random variable X ∼ p(x) is given by
Z
EX [g(x)] = g(x)p(x)dx . (6.28)
X

Correspondingly, the expected value of a function g of a discrete random


variable X ∼ p(x) is given by
X
EX [g(x)] = g(x)p(x) , (6.29)
x∈X

where X is the set of possible outcomes (the target space) of the random
variable X .

In this section, we consider discrete random variables to have numerical


outcomes. This can be seen by observing that the function g takes real
numbers as inputs. The expected value
of a function of a
Remark. We consider multivariate random variables X as a finite vector random variable is
of univariate random variables [X1 , . . . , XD ]> . For multivariate random sometimes referred
variables, we define the expected value element wise to as the law of the
unconscious
statistician (Casella
 
EX1 [g(x1 )]
.. and Berger, 2002,
D
EX [g(x)] =  ∈R , (6.30)
 
. Section 2.2).
EXD [g(xD )]

where the subscript EXd indicates that we are taking the expected value
with respect to the dth element of the vector x. ♦
Definition 6.3 defines the meaning of the notation EX as the operator
indicating that we should take the integral with respect to the probabil-
ity density (for continuous distributions) or the sum over all states (for
discrete distributions). The definition of the mean (Definition 6.4), is a
special case of the expected value, obtained by choosing g to be the iden-
tity function.

Definition 6.4 (Mean). The mean of a random variable X with states mean

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


188 Probability and Distributions

x ∈ RD is an average and is defined as


 
EX1 [x1 ]
.. D
EX [x] =  ∈R , (6.31)
 
.
EXD [xD ]
where
 Z
xd p(xd )dxd if X is a continuous random variable



X
EXd [xd ] := X



xi p(xd = xi ) if X is a discrete random variable
xi ∈X
(6.32)
for d = 1, . . . , D, where the subscript d indicates the corresponding di-
mension of x. The integral and sum are over the states X of the target
space of the random variable X .

In one dimension, there are two other intuitive notions of “average”,


median which are the median and the mode. The median is the “middle” value if
we sort the values, i.e., 50% of the values are greater than the median and
50% are smaller than the median. This idea can be generalized to contin-
uous values by considering the value where the cdf (Definition 6.2) is 0.5.
For distributions, which are asymmetric or have long tails, the median
provides an estimate of a typical value that is closer to human intuition
than the mean value. Furthermore, the median is more robust to outliers
than the mean. The generalization of the median to higher dimensions is
non-trivial as there is no obvious way to “sort” in more than one dimen-
mode sion (Hallin et al., 2010; Kong and Mizera, 2012). The mode is the most
frequently occurring value. For a discrete random variable, the mode is
defined as the value of x having the highest frequency of occurrence. For
a continuous random variable, the mode is defined as a peak in the density
p(x). A particular density p(x) may have more than one mode, and fur-
thermore there may be a very large number of modes in high-dimensional
distributions. Therefore, finding all the modes of a distribution can be
computationally challenging.

Example 6.4
Consider the two-dimensional distribution illustrated in Figure 6.4:
         
10 1 0 0 8.4 2.0
p(x) = 0.4 N x , + 0.6 N x , .
2 0 1 0 2.0 1.7
 (6.33)
2
We will define the Gaussian distribution N µ, σ in Section 6.5. Also
shown is its corresponding marginal distribution in each dimension. Ob-
serve that the distribution is bimodal (has two modes), but one of the

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.4 Summary Statistics and Independence 189

marginal distributions is unimodal (has one mode). The horizontal bi-


modal univariate distribution illustrates that the mean and median can
be different from each other. While it is tempting to define the two-
dimensional median to be the concatenation of the medians in each di-
mension, the fact that we cannot define an ordering of two-dimensional
points makes it difficult. When we say “cannot define an ordering”, we
mean
  that
  there is more than one way to define the relation < so that
3 2
< .
0 3

Figure 6.4
Mean Illustration of the
Modes mean, mode, and
Median median for a
two-dimensional
dataset, as well as
its marginal
densities.

Remark. The expected value (Definition 6.3) is a linear operator. For ex-
ample, given a real-valued function f (x) = ag(x) + bh(x) where a, b ∈ R
and x ∈ RD , we obtain
Z
EX [f (x)] = f (x)p(x)dx (6.34a)
Z
= [ag(x) + bh(x)]p(x)dx (6.34b)
Z Z
= a g(x)p(x)dx + b h(x)p(x)dx (6.34c)

= aEX [g(x)] + bEX [h(x)] . (6.34d)


For two random variables, we may wish to characterize their correspon-

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


190 Probability and Distributions

dence to each other. The covariance intuitively represents the notion of


how dependent random variables are to one another.
covariance Definition 6.5 (Covariance (Univariate)). The covariance between two
univariate random variables X, Y ∈ R is given by the expected product
of their deviations from their respective means, i.e.,
 
CovX,Y [x, y] := EX,Y (x − EX [x])(y − EY [y]) . (6.35)
Terminology: The
covariance of Remark. When the random variable associated with the expectation or
multivariate random covariance is clear by its arguments, the subscript is often suppressed (for
variables Cov[x, y] example, EX [x] is often written as E[x]). ♦
is sometimes
referred to as By using the linearity of expectations, the expression in Definition 6.5
cross-covariance, can be rewritten as the expected value of the product minus the product
with covariance
of the expected values, i.e.,
referring to
Cov[x, x]. Cov[x, y] = E[xy] − E[x]E[y] . (6.36)
variance The covariance of a variable with itself Cov[x, x] is called the variance and
standard deviation is denoted by VX [x]. The square root of the variance is called the standard
deviation and is often denoted by σ(x). The notion of covariance can be
generalized to multivariate random variables.
Definition 6.6 (Covariance (Multivariate)). If we consider two multivari-
ate random variables X and Y with states x ∈ RD and y ∈ RE respec-
covariance tively, the covariance between X and Y is defined as
Cov[x, y] = E[xy > ] − E[x]E[y]> = Cov[y, x]> ∈ RD×E . (6.37)
Definition 6.6 can be applied with the same multivariate random vari-
able in both arguments, which results in a useful concept that intuitively
captures the “spread” of a random variable. For a multivariate random
variable, the variance describes the relation between individual dimen-
sions of the random variable.
variance Definition 6.7 (Variance). The variance of a random variable X with
states x ∈ RD and a mean vector µ ∈ RD is defined as
VX [x] = CovX [x, x] (6.38a)
> > >
= EX [(x − µ)(x − µ) ] = EX [xx ] − EX [x]EX [x] (6.38b)
 
Cov[x1 , x1 ] Cov[x1 , x2 ] . . . Cov[x1 , xD ]
 Cov[x2 , x1 ] Cov[x2 , x2 ] . . . Cov[x2 , xD ] 
= .. .. .. . (6.38c)
 
. .
 . . . . 
Cov[xD , x1 ] ... . . . Cov[xD , xD ]
covariance matrix The D × D matrix in (6.38c) is called the covariance matrix of the mul-
tivariate random variable X . The covariance matrix is symmetric and pos-
itive semidefinite and tells us something about the spread of the data. On
marginal its diagonal, the covariance matrix contains the variances of the marginals

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.4 Summary Statistics and Independence 191
Figure 6.5
6 6 Two-dimensional
datasets with
4 4
identical means and
2 2 variances along
y

y
each axis (colored
0 0 lines) but with
different
−2 −2 covariances.
−5 0 5 −5 0 5
x x

(a) x and y are negatively correlated. (b) x and y are positively correlated.

Z
p(xi ) = p(x1 , . . . , xD )dx\i , (6.39)

where “\i” denotes “all variables but i”. The off-diagonal entries are the
cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j . cross-covariance

Remark. In this book, we generally assume that covariance matrices are


positive definite to enable better intuition. We therefore do not discuss
corner cases that result in positive semidefinite (low-rank) covariance ma-
trices. ♦
When we want to compare the covariances between different pairs of
random variables, it turns out that the variance of each random variable
affects the value of the covariance. The normalized version of covariance
is called the correlation.
Definition 6.8 (Correlation). The correlation between two random vari- correlation
ables X, Y is given by
Cov[x, y]
corr[x, y] = p ∈ [−1, 1] . (6.40)
V[x]V[y]
The correlation matrix is the covariance matrix of standardized random
variables, x/σ(x). In other words, each random variable is divided by its
standard deviation (the square root of the variance) in the correlation
matrix.
The covariance (and correlation) indicate how two random variables
are related; see Figure 6.5. Positive correlation corr[x, y] means that when
x grows, then y is also expected to grow. Negative correlation means that
as x increases, then y decreases.

6.4.2 Empirical Means and Covariances


The definitions in Section 6.4.1 are often also called the population mean population mean
and covariance, as it refers to the true statistics for the population. In ma- and covariance
chine learning, we need to learn from empirical observations of data. Con-
sider a random variable X . There are two conceptual steps to go from

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


192 Probability and Distributions

population statistics to the realization of empirical statistics. First, we use


the fact that we have a finite dataset (of size N ) to construct an empirical
statistic that is a function of a finite number of identical random variables,
X1 , . . . , XN . Second, we observe the data, that is, we look at the realiza-
tion x1 , . . . , xN of each of the random variables and apply the empirical
statistic.
Specifically, for the mean (Definition 6.4), given a particular dataset we
empirical mean can obtain an estimate of the mean, which is called the empirical mean or
sample mean sample mean. The same holds for the empirical covariance.
empirical mean Definition 6.9 (Empirical Mean and Covariance). The empirical mean vec-
tor is the arithmetic average of the observations for each variable, and it
is defined as
N
1 X
x̄ := xn , (6.41)
N n=1

where xn ∈ RD .
empirical covariance Similar to the empirical mean, the empirical covariance matrix is a D×D
matrix
N
1 X
Σ := (xn − x̄)(xn − x̄)> . (6.42)
N n=1
Throughout the
book, we use the To compute the statistics for a particular dataset, we would use the
empirical realizations (observations) x1 , . . . , xN and use (6.41) and (6.42). Em-
covariance, which is pirical covariance matrices are symmetric, positive semidefinite (see Sec-
a biased estimate.
The unbiased
tion 3.2.3).
(sometimes called
corrected)
covariance has the 6.4.3 Three Expressions for the Variance
factor N − 1 in the
denominator We now focus on a single random variable X and use the preceding em-
instead of N . pirical formulas to derive three possible expressions for the variance. The
The derivations are following derivation is the same for the population variance, except that
exercises at the end we need to take care of integrals. The standard definition of variance, cor-
of this chapter.
responding to the definition of covariance (Definition 6.5), is the expec-
tation of the squared deviation of a random variable X from its expected
value µ, i.e.,
VX [x] := EX [(x − µ)2 ] . (6.43)
The expectation in (6.43) and the mean µ = EX (x) are computed us-
ing (6.32), depending on whether X is a discrete or continuous random
variable. The variance as expressed in (6.43) is the mean of a new random
variable Z := (X − µ)2 .
When estimating the variance in (6.43) empirically, we need to resort
to a two-pass algorithm: one pass through the data to calculate the mean
µ using (6.41), and then a second pass using this estimate µ̂ calculate the

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.4 Summary Statistics and Independence 193

variance. It turns out that we can avoid two passes by rearranging the
terms. The formula in (6.43) can be converted to the so-called raw-score raw-score formula
formula for variance: for variance

2
VX [x] = EX [x2 ] − (EX [x]) . (6.44)
The expression in (6.44) can be remembered as “the mean of the square
minus the square of the mean”. It can be calculated empirically in one pass
through data since we can accumulate xi (to calculate the mean) and x2i
simultaneously, where xi is the ith observation. Unfortunately, if imple- If the two terms
mented in this way, it can be numerically unstable. The raw-score version in (6.44) are huge
and approximately
of the variance can be useful in machine learning, e.g., when deriving the
equal, we may
bias–variance decomposition (Bishop, 2006). suffer from an
A third way to understand the variance is that it is a sum of pairwise dif- unnecessary loss of
ferences between all pairs of observations. Consider a sample x1 , . . . , xN numerical precision
in floating-point
of realizations of random variable X , and we compute the squared differ-
arithmetic.
ence between pairs of xi and xj . By expanding the square, we can show
that the sum of N 2 pairwise differences is the empirical variance of the
observations:
 !2 
N N N
1 X 1 X 1 X
2
(xi − xj )2 = 2  x2i − xi  . (6.45)
N i,j=1 N i=1 N i=1

We see that (6.45) is twice the raw-score expression (6.44). This means
that we can express the sum of pairwise distances (of which there are N 2
of them) as a sum of deviations from the mean (of which there are N ). Ge-
ometrically, this means that there is an equivalence between the pairwise
distances and the distances from the center of the set of points. From a
computational perspective, this means that by computing the mean (N
terms in the summation), and then computing the variance (again N
terms in the summation), we can obtain an expression (left-hand side
of (6.45)) that has N 2 terms.

6.4.4 Sums and Transformations of Random Variables


We may want to model a phenomenon that cannot be well explained by
textbook distributions (we introduce some in Sections 6.5 and 6.6), and
hence may perform simple manipulations of random variables (such as
adding two random variables).
Consider two random variables X, Y with states x, y ∈ RD . Then:
E[x + y] = E[x] + E[y] (6.46)
E[x − y] = E[x] − E[y] (6.47)
V[x + y] = V[x] + V[y] + Cov[x, y] + Cov[y, x] (6.48)
V[x − y] = V[x] + V[y] − Cov[x, y] − Cov[y, x] . (6.49)

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


194 Probability and Distributions

Mean and (co)variance exhibit some useful properties when it comes


to affine transformation of random variables. Consider a random variable
X with mean µ and covariance matrix Σ and a (deterministic) affine
transformation y = Ax + b of x. Then y is itself a random variable
whose mean vector and covariance matrix are given by
EY [y] = EX [Ax + b] = AEX [x] + b = Aµ + b , (6.50)
> >
VY [y] = VX [Ax + b] = VX [Ax] = AVX [x]A = AΣA , (6.51)
This can be shown respectively. Furthermore,
directly by using the
definition of the Cov[x, y] = E[x(Ax + b)> ] − E[x]E[Ax + b]> (6.52a)
mean and
covariance.
= E[x]b> + E[xx> ]A> − µb> − µµ> A> (6.52b)
= µb> − µb> + E[xx> ] − µµ> A>

(6.52c)
(6.38b)
= ΣA> , (6.52d)
where Σ = E[xx> ] − µµ> is the covariance of X .

6.4.5 Statistical Independence


statistical Definition 6.10 (Independence). Two random variables X, Y are statis-
independence tically independent if and only if
p(x, y) = p(x)p(y) . (6.53)
Intuitively, two random variables X and Y are independent if the value
of y (once known) does not add any additional information about x (and
vice versa). If X, Y are (statistically) independent, then
p(y | x) = p(y)
p(x | y) = p(x)
VX,Y [x + y] = VX [x] + VY [y]
CovX,Y [x, y] = 0
The last point may not hold in converse, i.e., two random variables can
have covariance zero but are not statistically independent. To understand
why, recall that covariance measures only linear dependence. Therefore,
random variables that are nonlinearly dependent could have covariance
zero.

Example 6.5
Consider a random variable X with zero mean (EX [x] = 0) and also
EX [x3 ] = 0. Let y = x2 (hence, Y is dependent on X ) and consider the
covariance (6.36) between X and Y . But this gives
Cov[x, y] = E[xy] − E[x]E[y] = E[x3 ] = 0 . (6.54)

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.4 Summary Statistics and Independence 195

In machine learning, we often consider problems that can be mod-


eled as independent and identically distributed (i.i.d.) random variables, independent and
X1 , . . . , XN . For more than two random variables, the word “indepen- identically
distributed
dent” (Definition 6.10) usually refers to mutually independent random
i.i.d.
variables, where all subsets are independent (see Pollard (2002, chap-
ter 4) and Jacod and Protter (2004, chapter 3)). The phrase “identically
distributed” means that all the random variables are from the same distri-
bution.
Another concept that is important in machine learning is conditional
independence.
Definition 6.11 (Conditional Independence). Two random variables X
and Y are conditionally independent given Z if and only if conditionally
independent
p(x, y | z) = p(x | z)p(y | z) for all z ∈ Z , (6.55)
where Z is the set of states of random variable Z . We write X ⊥
⊥ Y | Z to
denote that X is conditionally independent of Y given Z .
Definition 6.11 requires that the relation in (6.55) must hold true for
every value of z . The interpretation of (6.55) can be understood as “given
knowledge about z , the distribution of x and y factorizes”. Independence
can be cast as a special case of conditional independence if we write X ⊥ ⊥
Y | ∅. By using the product rule of probability (6.22), we can expand the
left-hand side of (6.55) to obtain
p(x, y | z) = p(x | y, z)p(y | z) . (6.56)
By comparing the right-hand side of (6.55) with (6.56), we see that p(y | z)
appears in both of them so that
p(x | y, z) = p(x | z) . (6.57)
Equation (6.57) provides an alternative definition of conditional indepen-
dence, i.e., X ⊥⊥ Y | Z . This alternative presentation provides the inter-
pretation “given that we know z , knowledge about y does not change our
knowledge of x”.

6.4.6 Inner Products of Random Variables


Recall the definition of inner products from Section 3.2. We can define an Inner products
inner product between random variables, which we briefly describe in this between
multivariate random
section. If we have two uncorrelated random variables X, Y , then
variables can be
V[x + y] = V[x] + V[y] . (6.58) treated in a similar
fashion
Since variances are measured in squared units, this looks very much like
the Pythagorean theorem for right triangles c2 = a2 + b2 .
In the following, we see whether we can find a geometric interpreta-
tion of the variance relation of uncorrelated random variables in (6.58).

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


196 Probability and Distributions

Figure 6.6
Geometry of
random variables. If
random variables X
and Y are
uncorrelated, they
are orthogonal
vectors in a
corresponding
vector space, and [y ]
+ var
the Pythagorean [ x]
p var
theorem applies.
=
p
] a var[x]
+y c
p var[x
b
p
var[y]

Random variables can be considered vectors in a vector space, and we


can define inner products to obtain geometric properties of random vari-
ables (Eaton, 2007). If we define
hX, Y i := Cov[x, y] (6.59)
for zero mean random variables X and Y , we obtain an inner product. We
Cov[x, x] = 0 ⇐⇒ see that the covariance is symmetric, positive definite, and linear in either
x=0 argument. The length of a random variable is
Cov[αx + z, y] = q q
α Cov[x, y] +
kXk = Cov[x, x] = V[x] = σ[x] , (6.60)
Cov[z, y] for α ∈ R.
i.e., its standard deviation. The “longer” the random variable, the more
uncertain it is; and a random variable with length 0 is deterministic.
If we look at the angle θ between two random variables X, Y , we get
hX, Y i Cov[x, y]
cos θ = =p , (6.61)
kXk kY k V[x]V[y]
which is the correlation (Definition 6.8) between the two random vari-
ables. This means that we can think of correlation as the cosine of the
angle between two random variables when we consider them geometri-
cally. We know from Definition 3.7 that X ⊥ Y ⇐⇒ hX, Y i = 0. In our
case, this means that X and Y are orthogonal if and only if Cov[x, y] = 0,
i.e., they are uncorrelated. Figure 6.6 illustrates this relationship.
Remark. While it is tempting to use the Euclidean distance (constructed

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.5 Gaussian Distribution 197

Figure 6.7
Gaussian
distribution of two
random variables x1
0.20 and x2 .

p(x1, x2)
0.15
0.10
0.05
0.00

7.5
5.0
2.5
−1 0.0 x 2
0 −2.5
x1 1 −5.0

from the preceding definition of inner products) to compare probability


distributions, it is unfortunately not the best way to obtain distances be-
tween distributions. Recall that the probability mass (or density) is posi-
tive and needs to add up to 1. These constraints mean that distributions
live on something called a statistical manifold. The study of this space of
probability distributions is called information geometry. Computing dis-
tances between distributions are often done using Kullback-Leibler diver-
gence, which is a generalization of distances that account for properties of
the statistical manifold. Just like the Euclidean distance is a special case of
a metric (Section 3.3), the Kullback-Leibler divergence is a special case of
two more general classes of divergences called Bregman divergences and
f -divergences. The study of divergences is beyond the scope of this book,
and we refer for more details to the recent book by Amari (2016), one of
the founders of the field of information geometry. ♦

6.5 Gaussian Distribution


The Gaussian distribution is the most well-studied probability distribution
for continuous-valued random variables. It is also referred to as the normal normal distribution
distribution. Its importance originates from the fact that it has many com- The Gaussian
putationally convenient properties, which we will be discussing in the fol- distribution arises
naturally when we
lowing. In particular, we will use it to define the likelihood and prior for
consider sums of
linear regression (Chapter 9), and consider a mixture of Gaussians for independent and
density estimation (Chapter 11). identically
There are many other areas of machine learning that also benefit from distributed random
variables. This is
using a Gaussian distribution, for example Gaussian processes, variational
known as the
inference, and reinforcement learning. It is also widely used in other ap- central limit
plication areas such as signal processing (e.g., Kalman filter), control (e.g., theorem (Grinstead
linear quadratic regulator), and statistics (e.g., hypothesis testing). and Snell, 1997).

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


198 Probability and Distributions
Figure 6.8
8
Gaussian Mean
p(x)
0.20 Sample
distributions Mean 6
overlaid with 100 Sample
0.15 4

samples. (a) One-

x2
2
dimensional case; 0.10
(b) two-dimensional 0
0.05
case. −2
0.00
−4
−5.0 −2.5 0.0 2.5 5.0 7.5 −1 0 1
x x1

(a) Univariate (one-dimensional) Gaussian; (b) Multivariate (two-dimensional) Gaus-


The red cross shows the mean and the red sian, viewed from top. The red cross shows
line shows the extent of the variance. the mean and the colored lines show the con-
tour lines of the density.

For a univariate random variable, the Gaussian distribution has a den-


sity that is given by
1 (x − µ)2
 
2
p(x | µ, σ ) = √ exp − . (6.62)
2πσ 2 2σ 2
multivariate The multivariate Gaussian distribution is fully characterized by a mean
Gaussian vector µ and a covariance matrix Σ and defined as
distribution
mean vector D 1
p(x | µ, Σ) = (2π)− 2 |Σ|− 2 exp − 12 (x − µ)> Σ−1 (x − µ) , (6.63)

covariance matrix
 
Also known as a where x ∈ RD . We write p(x) = N x | µ, Σ or X ∼ N µ, Σ . Fig-
multivariate normal ure 6.7 shows a bivariate Gaussian (mesh), with the corresponding con-
distribution.
tour plot. Figure 6.8 shows a univariate Gaussian and a bivariate Gaussian
with corresponding samples. The special case of the Gaussian with zero
mean and identity covariance, that is, µ = 0 and Σ = I , is referred to as
standard normal the standard normal distribution.
distribution Gaussians are widely used in statistical estimation and machine learn-
ing as they have closed-form expressions for marginal and conditional dis-
tributions. In Chapter 9, we use these closed-form expressions extensively
for linear regression. A major advantage of modeling with Gaussian ran-
dom variables is that variable transformations (Section 6.7) are often not
needed. Since the Gaussian distribution is fully specified by its mean and
covariance, we often can obtain the transformed distribution by applying
the transformation to the mean and covariance of the random variable.

6.5.1 Marginals and Conditionals of Gaussians are Gaussians


In the following, we present marginalization and conditioning in the gen-
eral case of multivariate random variables. If this is confusing at first read-
ing, the reader is advised to consider two univariate random variables in-
stead. Let X and Y be two multivariate random variables, that may have

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.5 Gaussian Distribution 199

different dimensions. To consider the effect of applying the sum rule of


probability and the effect of conditioning, we explicitly write the Gaus-
sian distribution in terms of the concatenated states [x> , y > ],
   
µx Σxx Σxy
p(x, y) = N , . (6.64)
µy Σyx Σyy
where Σxx = Cov[x, x] and Σyy = Cov[y, y] are the marginal covari-
ance matrices of x and y , respectively, and Σxy = Cov[x, y] is the cross-
covariance matrix between x and y .
The conditional distribution p(x | y) is also Gaussian (illustrated in Fig-
ure 6.9(c)) and given by (derived in Section 2.3 of Bishop, 2006)

p(x | y) = N µx | y , Σx | y (6.65)
µx | y = µx + Σxy Σ−1
yy (y − µy ) (6.66)
Σx | y = Σxx − Σxy Σ−1
yy Σyx . (6.67)
Note that in the computation of the mean in (6.66), the y -value is an
observation and no longer random.
Remark. The conditional Gaussian distribution shows up in many places,
where we are interested in posterior distributions:
The Kalman filter (Kalman, 1960), one of the most central algorithms
for state estimation in signal processing, does nothing but computing
Gaussian conditionals of joint distributions (Deisenroth and Ohlsson,
2011; Särkkä, 2013).
Gaussian processes (Rasmussen and Williams, 2006), which are a prac-
tical implementation of a distribution over functions. In a Gaussian pro-
cess, we make assumptions of joint Gaussianity of random variables. By
(Gaussian) conditioning on observed data, we can determine a poste-
rior distribution over functions.
Latent linear Gaussian models (Roweis and Ghahramani, 1999; Mur-
phy, 2012), which include probabilistic principal component analysis
(PPCA) (Tipping and Bishop, 1999). We will look at PPCA in more de-
tail in Section 10.7.

The marginal distribution p(x) of a joint Gaussian distribution p(x, y)
(see (6.64)) is itself Gaussian and computed by applying the sum rule
(6.20) and given by
Z

p(x) = p(x, y)dy = N x | µx , Σxx . (6.68)

The corresponding result holds for p(y), which is obtained by marginaliz-


ing with respect to x. Intuitively, looking at the joint distribution in (6.64),
we ignore (i.e., integrate out) everything we are not interested in. This is
illustrated in Figure 6.9(b).

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


200 Probability and Distributions

Example 6.6
Figure 6.9
(a) Bivariate 8
Gaussian;
6
(b) marginal of a
joint Gaussian 4
distribution is

x2
2
Gaussian; (c) the
conditional 0 x2 = −1
distribution of a −2
Gaussian is also
Gaussian. −4
−1 0 1
x1

(a) Bivariate Gaussian.

p(x1) 1.2 p(x1|x2 = −1)

0.6 Mean Mean


1.0
2σ 2σ
0.8
0.4
0.6

0.4
0.2
0.2

0.0 0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x1 x1

(b) Marginal distribution. (c) Conditional distribution.

Consider the bivariate Gaussian distribution (illustrated in Figure 6.9):


   
0 0.3 −1
p(x1 , x2 ) = N , . (6.69)
2 −1 5
We can compute the parameters of the univariate Gaussian, conditioned
on x2 = −1, by applying (6.66) and (6.67) to obtain the mean and vari-
ance respectively. Numerically, this is
µx1 | x2 =−1 = 0 + (−1) · 0.2 · (−1 − 2) = 0.6 (6.70)
and
σx21 | x2 =−1 = 0.3 − (−1) · 0.2 · (−1) = 0.1 . (6.71)
Therefore, the conditional Gaussian is given by

p(x1 | x2 = −1) = N 0.6, 0.1 . (6.72)
The marginal distribution p(x1 ), in contrast, can be obtained by apply-
ing (6.68), which is essentially using the mean and variance of the random
variable x1 , giving us

p(x1 ) = N 0, 0.3 . (6.73)

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.5 Gaussian Distribution 201

6.5.2 Product of Gaussian Densities


For linear regression (Chapter 9), we need to compute a Gaussian likeli-
hood. Furthermore, we may wish to assume a Gaussian prior (Section 9.3).
We apply Bayes’ Theorem to compute the posterior, which results in a mul-
tiplication of the likelihood and the prior, that is, the multiplication
 of two
Gaussian densities. The product of two Gaussians N x | a, A N x | b, B The derivation is an
exercise at the end

is a Gaussian distribution scaled by a c ∈ R, given by c N x | c, C with
of this chapter.
C = (A−1 + B −1 )−1 (6.74)
−1 −1
c = C(A a + B b) (6.75)
D 1
(2π)− 2 |A B|− 2 exp − 12 (a − b)> (A + B)−1 (a − b) . (6.76)

c= +
The scaling constant c itself can be written in the form of a Gaussian
density either in a or in b with an “inflated” covariance matrix A + B ,
i.e., c = N a | b, A + B = N b | a, A + B .

Remark. For notation convenience, we will sometimes use N x | m, S
to describe the functional form of a Gaussian density even if x is not a
random variable. We have just done this in the preceding demonstration
when we wrote
 
c = N a | b, A + B = N b | a, A + B . (6.77)
Here, neither a nor b are random variables. However, writing c in this way
is more compact than (6.76). ♦

6.5.3 Sums and Linear Transformations


If X, Y are independent Gaussian random variables (i.e., the joint distri-

bution is given as p(x, y) = p(x)p(y)) with p(x) = N x | µx , Σx and

p(y) = N y | µy , Σy , then x + y is also Gaussian distributed and given
by

p(x + y) = N µx + µy , Σx + Σy . (6.78)
Knowing that p(x + y) is Gaussian, the mean and covariance matrix can
be determined immediately using the results from (6.46) through (6.49).
This property will be important when we consider i.i.d. Gaussian noise
acting on random variables, as is the case for linear regression (Chap-
ter 9).

Example 6.7
Since expectations are linear operations, we can obtain the weighted sum
of independent Gaussian random variables
p(ax + by) = N aµx + bµy , a2 Σx + b2 Σy .

(6.79)

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


202 Probability and Distributions

Remark. A case that will be useful in Chapter 11 is the weighted sum of


Gaussian densities. This is different from the weighted sum of Gaussian
random variables. ♦
In Theorem 6.12, the random variable x is from a density that is a
mixture of two densities p1 (x) and p2 (x), weighted by α. The theorem can
be generalized to the multivariate random variable case, since linearity of
expectations holds also for multivariate random variables. However, the
idea of a squared random variable needs to be replaced by xx> .

Theorem 6.12. Consider a mixture of two univariate Gaussian densities

p(x) = αp1 (x) + (1 − α)p2 (x) , (6.80)

where the scalar 0 < α < 1 is the mixture weight, and p1 (x) and p2 (x) are
univariate Gaussian densities (Equation (6.62)) with different parameters,
i.e., (µ1 , σ12 ) 6= (µ2 , σ22 ).
Then the mean of the mixture density p(x) is given by the weighted sum
of the means of each random variable:

E[x] = αµ1 + (1 − α)µ2 . (6.81)

The variance of the mixture density p(x) is given by


  2

V[x] = ασ12 + (1 − α)σ22 + αµ21 + (1 − α)µ22 − [αµ1 + (1 − α)µ2 ] .
 

(6.82)

Proof The mean of the mixture density p(x) is given by the weighted
sum of the means of each random variable. We apply the definition of the
mean (Definition 6.4), and plug in our mixture (6.80), which yields
Z ∞
E[x] = xp(x)dx (6.83a)
−∞
Z ∞
= (αxp1 (x) + (1 − α)xp2 (x)) dx (6.83b)
−∞
Z ∞ Z ∞
=α xp1 (x)dx + (1 − α) xp2 (x)dx (6.83c)
−∞ −∞
= αµ1 + (1 − α)µ2 . (6.83d)

To compute the variance, we can use the raw-score version of the vari-
ance from (6.44), which requires an expression of the expectation of the
squared random variable. Here we use the definition of an expectation of
a function (the square) of a random variable (Definition 6.3),
Z ∞
2
E[x ] = x2 p(x)dx (6.84a)
−∞
Z ∞
αx2 p1 (x) + (1 − α)x2 p2 (x) dx

= (6.84b)
−∞

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.5 Gaussian Distribution 203
Z ∞ Z ∞
=α x2 p1 (x)dx + (1 − α) x2 p2 (x)dx (6.84c)
−∞ −∞
= α(µ21 + σ12 ) + (1 − α)(µ22 + σ22 ) , (6.84d)
where in the last equality, we again used the raw-score version of the
variance (6.44) giving σ 2 = E[x2 ] − µ2 . This is rearranged such that the
expectation of a squared random variable is the sum of the squared mean
and the variance.
Therefore, the variance is given by subtracting (6.83d) from (6.84d),
V[x] = E[x2 ] − (E[x])2 (6.85a)
= α(µ21 + σ12 )
+ (1 − α)(µ22 + σ22 ) − (αµ1 + (1 − α)µ2 ) 2
(6.85b)
 2
+ (1 − α)σ22

= ασ1
 
2
αµ21 + (1 − α)µ22 − [αµ1 + (1 − α)µ2 ] .

+ (6.85c)

Remark. The preceding derivation holds for any density, but since the
Gaussian is fully determined by the mean and variance, the mixture den-
sity can be determined in closed form. ♦
For a mixture density, the individual components can be considered
to be conditional distributions (conditioned on the component identity).
Equation (6.85c) is an example of the conditional variance formula, also
known as the law of total variance, which generally states that for two ran- law of total variance
dom variables X and Y it holds that VX [x] = EY [VX [x|y]]+ VY [EX [x|y]],
i.e., the (total) variance of X is the expected conditional variance plus the
variance of a conditional mean.
We consider in Example 6.17 a bivariate standard Gaussian random
variable X and performed a linear transformation Ax on it. The outcome
is a Gaussian random variable with mean zero and covariance AA> . Ob-
serve that adding a constant vector will change the mean of the distribu-
tion, without affecting its variance, that is, the random variable x + µ is
Gaussian with mean µ and identity covariance. Hence, any linear/affine
transformation of a Gaussian random variable is Gaussian distributed.  Any linear/affine
Consider a Gaussian distributed random variable X ∼ N µ, Σ . For transformation of a
Gaussian random
a given matrix A of appropriate shape, let Y be a random variable such
variable is also
that y = Ax is a transformed version of x. We can compute the mean of Gaussian
y by exploiting that the expectation is a linear operator (6.50) as follows: distributed.

E[y] = E[Ax] = AE[x] = Aµ . (6.86)


Similarly the variance of y can be found by using (6.51):
V[y] = V[Ax] = AV[x]A> = AΣA> . (6.87)
This means that the random variable y is distributed according to
p(y) = N y | Aµ, AΣA> .

(6.88)

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


204 Probability and Distributions

Let us now consider the reverse transformation: when we know that a


random variable has a mean that is a linear transformation of another
random variable. For a given full rank matrix A ∈ RM ×N , where M > N ,
let y ∈ RM be a Gaussian random variable with mean Ax, i.e.,

p(y) = N y | Ax, Σ . (6.89)

What is the corresponding probability distribution p(x)? If A is invert-


ible, then we can write x = A−1 y and apply the transformation in the
previous paragraph. However, in general A is not invertible, and we use
an approach similar to that of the pseudo-inverse (3.57). That is, we pre-
multiply both sides with A> and then invert A> A, which is symmetric
and positive definite, giving us the relation

y = Ax ⇐⇒ (A> A)−1 A> y = x . (6.90)

Hence, x is a linear transformation of y , and we obtain

p(x) = N x | (A> A)−1 A> y, (A> A)−1 A> ΣA(A> A)−1 .



(6.91)

6.5.4 Sampling from Multivariate Gaussian Distributions


We will not explain the subtleties of random sampling on a computer, and
the interested reader is referred to Gentle (2004). In the case of a mul-
tivariate Gaussian, this process consists of three stages: first, we need a
source of pseudo-random numbers that provide a uniform sample in the
interval [0,1]; second, we use a non-linear transformation such as the
Box-Müller transform (Devroye, 1986) to obtain a sample from a univari-
ate Gaussian; and third, we collate a vector of thesesamples to obtain a
sample from a multivariate standard normal N 0, I .
For a general multivariate Gaussian, that is, where the mean is non
zero and the covariance is not the identity matrix, we use the proper-
ties of linear transformations of a Gaussian random variable. Assume we
are interested in generating samples xi , i = 1, . . . , n, from a multivariate
To compute the Gaussian distribution with mean µ and covariance matrix Σ. We would
Cholesky like to construct the sample from a sampler
factorization of a  that provides samples from
the multivariate standard normal N 0, I .
matrix, it is required 
that the matrix is To obtain samples from a multivariate normal N µ, Σ , we can use
symmetric and the properties of a linear transformation of a Gaussian random variable:
positive definite If x ∼ N 0, I , then y = Ax + µ, where AA> = Σ is Gaussian dis-
(Section 3.2.3).
Covariance matrices
tributed with mean µ and covariance matrix Σ. One convenient choice of
possess this A is to use the Cholesky decomposition (Section 4.3) of the covariance
property. matrix Σ = AA> . The Cholesky decomposition has the benefit that A is
triangular, leading to efficient computation.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.6 Conjugacy and the Exponential Family 205

6.6 Conjugacy and the Exponential Family


Many of the probability distributions “with names” that we find in statis-
tics textbooks were discovered to model particular types of phenomena.
For example, we have seen the Gaussian distribution in Section 6.5. The
distributions are also related to each other in complex ways (Leemis and
McQueston, 2008). For a beginner in the field, it can be overwhelming to
figure out which distribution to use. In addition, many of these distribu-
tions were discovered at a time that statistics and computation were done “Computers” used to
by pencil and paper. It is natural to ask what are meaningful concepts be a job description.
in the computing age (Efron and Hastie, 2016). In the previous section,
we saw that many of the operations required for inference can be conve-
niently calculated when the distribution is Gaussian. It is worth recalling
at this point the desiderata for manipulating probability distributions in
the machine learning context:

1. There is some “closure property” when applying the rules of probability,


e.g., Bayes’ theorem. By closure, we mean that applying a particular
operation returns an object of the same type.
2. As we collect more data, we do not need more parameters to describe
the distribution.
3. Since we are interested in learning from data, we want parameter es-
timation to behave nicely.

It turns out that the class of distributions called the exponential family exponential family
provides the right balance of generality while retaining favorable compu-
tation and inference properties. Before we introduce the exponential fam-
ily, let us see three more members of “named” probability distributions,
the Bernoulli (Example 6.8), Binomial (Example 6.9), and Beta (Exam-
ple 6.10) distributions.

Example 6.8
The Bernoulli distribution is a distribution for a single binary random Bernoulli
variable X with state x ∈ {0, 1}. It is governed by a single continuous pa- distribution
rameter µ ∈ [0, 1] that represents the probability of X = 1. The Bernoulli
distribution Ber(µ) is defined as
p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1} , (6.92)
E[x] = µ , (6.93)
V[x] = µ(1 − µ) , (6.94)
where E[x] and V[x] are the mean and variance of the binary random
variable X .

An example where the Bernoulli distribution can be used is when we


are interested in modeling the probability of “heads” when flipping a coin.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


206 Probability and Distributions

Figure 6.10
Examples of the µ = 0.1
Binomial 0.3 µ = 0.4
distribution for
µ ∈ {0.1, 0.4, 0.75} µ = 0.75
and N = 15.
0.2
p(m)

0.1

0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0
Number m of observations x = 1 in N = 15 experiments

Remark. The rewriting above of the Bernoulli distribution, where we use


Boolean variables as numerical 0 or 1 and express them in the exponents,
is a trick that is often used in machine learning textbooks. Another oc-
curence of this is when expressing the Multinomial distribution. ♦

Example 6.9 (Binomial Distribution)


Binomial The Binomial distribution is a generalization of the Bernoulli distribution
distribution to a distribution over integers (illustrated in Figure 6.10). In particular,
the Binomial can be used to describe the probability of observing m oc-
currences of X = 1 in a set of N samples from a Bernoulli distribution
where p(X = 1) = µ ∈ [0, 1]. The Binomial distribution Bin(N, µ) is
defined as
!
N m
p(m | N, µ) = µ (1 − µ)N −m , (6.95)
m
E[m] = N µ , (6.96)
V[m] = N µ(1 − µ) , (6.97)
where E[m] and V[m] are the mean and variance of m, respectively.

An example where the Binomial could be used is if we want to describe


the probability of observing m “heads” in N coin-flip experiments if the
probability for observing head in a single experiment is µ.

Example 6.10 (Beta Distribution)


We may wish to model a continuous random variable on a finite interval.
Beta distribution The Beta distribution is a distribution over a continuous random variable
µ ∈ [0, 1], which is often used to represent the probability for some binary
event (e.g., the parameter governing the Bernoulli distribution). The Beta

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.6 Conjugacy and the Exponential Family 207

distribution Beta(α, β) (illustrated in Figure 6.11) itself is governed by


two parameters α > 0, β > 0 and is defined as
Γ(α + β) α−1
p(µ | α, β) = µ (1 − µ)β−1 (6.98)
Γ(α)Γ(β)
α αβ
E[µ] = , V[µ] = (6.99)
α+β (α + β) (α + β + 1)
2

where Γ(·) is the Gamma function defined as


Z ∞
Γ(t) := xt−1 exp(−x)dx, t > 0. (6.100)
0
Γ(t + 1) = tΓ(t) . (6.101)
Note that the fraction of Gamma functions in (6.98) normalizes the Beta
distribution.

10 Figure 6.11
α = 0.5 = β Examples of the
8 α=1=β Beta distribution for
α = 2, β = 0.3 different values of α
and β.
p(µ|α, β)

6 α = 4, β = 10
α = 5, β = 1
4

0
0.0 0.2 0.4 0.6 0.8 1.0
µ

Intuitively, α moves probability mass toward 1, whereas β moves prob-


ability mass toward 0. There are some special cases (Murphy, 2012):

For α = 1 = β , we obtain the uniform distribution U[0, 1].


For α, β < 1, we get a bimodal distribution with spikes at 0 and 1.
For α, β > 1, the distribution is unimodal.
For α, β > 1 and α = β , the distribution is unimodal, symmetric, and
centered in the interval [0, 1], i.e., the mode/mean is at 12 .

Remark. There is a whole zoo of distributions with names, and they are
related in different ways to each other (Leemis and McQueston, 2008).
It is worth keeping in mind that each named distribution is created for a
particular reason, but may have other applications. Knowing the reason
behind the creation of a particular distribution often allows insight into
how to best use it. We introduced the preceding three distributions to be

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


208 Probability and Distributions

able to illustrate the concepts of conjugacy (Section 6.6.1) and exponen-


tial families (Section 6.6.3). ♦

6.6.1 Conjugacy
According to Bayes’ theorem (6.23), the posterior is proportional to the
product of the prior and the likelihood. The specification of the prior can
be tricky for two reasons: First, the prior should encapsulate our knowl-
edge about the problem before we see any data. This is often difficult to
describe. Second, it is often not possible to compute the posterior distribu-
tion analytically. However, there are some priors that are computationally
conjugate prior convenient: conjugate priors.
conjugate Definition 6.13 (Conjugate Prior). A prior is conjugate for the likelihood
function if the posterior is of the same form/type as the prior.
Conjugacy is particularly convenient because we can algebraically cal-
culate our posterior distribution by updating the parameters of the prior
distribution.
Remark. When considering the geometry of probability distributions, con-
jugate priors retain the same distance structure as the likelihood (Agarwal
and Daumé III, 2010). ♦
To introduce a concrete example of conjugate priors, we describe in Ex-
ample 6.11 the Binomial distribution (defined on discrete random vari-
ables) and the Beta distribution (defined on continuous random vari-
ables).

Example 6.11 (Beta-Binomial Conjugacy)


Consider a Binomial random variable x ∼ Bin(N, µ) where
!
N x
p(x | N, µ) = µ (1 − µ)N −x , x = 0, 1, . . . , N , (6.102)
x
is the probability of finding x times the outcome “heads” in N coin flips,
where µ is the probability of a “head”. We place a Beta prior on the pa-
rameter µ, that is, µ ∼ Beta(α, β), where
Γ(α + β) α−1
p(µ | α, β) = µ (1 − µ)β−1 . (6.103)
Γ(α)Γ(β)
If we now observe some outcome x = h, that is, we see h heads in N coin
flips, we compute the posterior distribution on µ as
p(µ | x = h, N, α, β) ∝ p(x | N, µ)p(µ | α, β) (6.104a)
h (N −h) α−1 β−1
∝ µ (1 − µ) µ (1 − µ) (6.104b)
= µh+α−1 (1 − µ)(N −h)+β−1 (6.104c)

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.6 Conjugacy and the Exponential Family 209
Likelihood Conjugate prior Posterior Table 6.2 Examples
Bernoulli Beta Beta of conjugate priors
Binomial Beta Beta for common
Gaussian Gaussian/inverse Gamma Gaussian/inverse Gamma likelihood functions.
Gaussian Gaussian/inverse Wishart Gaussian/inverse Wishart
Multinomial Dirichlet Dirichlet

∝ Beta(h + α, N − h + β) , (6.104d)
i.e., the posterior distribution is a Beta distribution as the prior, i.e., the
Beta prior is conjugate for the parameter µ in the Binomial likelihood
function.

In the following example, we will derive a result that is similar to the


Beta-Binomial conjugacy result. Here we will show that the Beta distribu-
tion is a conjugate prior for the Bernoulli distribution.

Example 6.12 (Beta-Bernoulli Conjugacy)


Let x ∈ {0, 1} be distributed according to the Bernoulli distribution with
parameter θ ∈ [0, 1], that is, p(x = 1 | θ) = θ. This can also be expressed
as p(x | θ) = θx (1 − θ)1−x . Let θ be distributed according to a Beta distri-
bution with parameters α, β , that is, p(θ | α, β) ∝ θα−1 (1 − θ)β−1 .
Multiplying the Beta and the Bernoulli distributions, we get
p(θ | x, α, β) = p(x | θ)p(θ | α, β) (6.105a)
x 1−x α−1 β−1
∝ θ (1 − θ) θ (1 − θ) (6.105b)
α+x−1 β+(1−x)−1
=θ (1 − θ) (6.105c)
∝ p(θ | α + x, β + (1 − x)) . (6.105d)
The last line is the Beta distribution with parameters (α + x, β + (1 − x)).

Table 6.2 lists examples for conjugate priors for the parameters of some
standard likelihoods used in probabilistic modeling. Distributions such as The Gamma prior is
Multinomial, inverse Gamma, inverse Wishart, and Dirichlet can be found conjugate for the
precision (inverse
in any statistical text, and are described in Bishop (2006), for example.
variance) in the
The Beta distribution is the conjugate prior for the parameter µ in both univariate Gaussian
the Binomial and the Bernoulli likelihood. For a Gaussian likelihood func- likelihood, and the
tion, we can place a conjugate Gaussian prior on the mean. The reason Wishart prior is
conjugate for the
why the Gaussian likelihood appears twice in the table is that we need
precision matrix
to distinguish the univariate from the multivariate case. In the univariate (inverse covariance
(scalar) case, the inverse Gamma is the conjugate prior for the variance. matrix) in the
In the multivariate case, we use a conjugate inverse Wishart distribution multivariate
Gaussian likelihood.
as a prior on the covariance matrix. The Dirichlet distribution is the conju-

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


210 Probability and Distributions

gate prior for the multinomial likelihood function. For further details, we
refer to Bishop (2006).

6.6.2 Sufficient Statistics


Recall that a statistic of a random variable is a deterministic function of
that random variable. For example, if x = [x1 , . . . , xN ]> is a vector of
univariate Gaussian random variables, that is, xn ∼ N µ, σ 2 , then the
sample mean µ̂ = N1 (x1 + · · · + xN ) is a statistic. Sir Ronald Fisher dis-
sufficient statistics covered the notion of sufficient statistics: the idea that there are statistics
that will contain all available information that can be inferred from data
corresponding to the distribution under consideration. In other words, suf-
ficient statistics carry all the information needed to make inference about
the population, that is, they are the statistics that are sufficient to repre-
sent the distribution.
For a set of distributions parametrized by θ, let X be a random variable
with distribution p(x | θ0 ) given an unknown θ0 . A vector φ(x) of statistics
is called sufficient statistics for θ0 if they contain all possible informa-
tion about θ0 . To be more formal about “contain all possible information”,
this means that the probability of x given θ can be factored into a part
that does not depend on θ, and a part that depends on θ only via φ(x).
The Fisher-Neyman factorization theorem formalizes this notion, which
we state in Theorem 6.14 without proof.
Theorem 6.14 (Fisher-Neyman). [Theorem 6.5 in Lehmann and Casella
Fisher-Neyman (1998)] Let X have probability density function p(x | θ). Then the statistics
theorem φ(x) are sufficient for θ if and only if p(x | θ) can be written in the form
p(x | θ) = h(x)gθ (φ(x)) , (6.106)
where h(x) is a distribution independent of θ and gθ captures all the depen-
dence on θ via sufficient statistics φ(x).
If p(x | θ) does not depend on θ, then φ(x) is trivially a sufficient statistic
for any function φ. The more interesting case is that p(x | θ) is dependent
only on φ(x) and not x itself. In this case, φ(x) is a sufficient statistic for
θ.
In machine learning, we consider a finite number of samples from a
distribution. One could imagine that for simple distributions (such as the
Bernoulli in Example 6.8) we only need a small number of samples to
estimate the parameters of the distributions. We could also consider the
opposite problem: If we have a set of data (a sample from an unknown
distribution), which distribution gives the best fit? A natural question to
ask is, as we observe more data, do we need more parameters θ to de-
scribe the distribution? It turns out that the answer is yes in general, and
this is studied in non-parametric statistics (Wasserman, 2007). A converse
question is to consider which class of distributions have finite-dimensional

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.6 Conjugacy and the Exponential Family 211

sufficient statistics, that is the number of parameters needed to describe


them does not increase arbitrarily. The answer is exponential family dis-
tributions, described in the following section.

6.6.3 Exponential Family


There are three possible levels of abstraction we can have when con-
sidering distributions (of discrete or continuous random variables). At
level one (the most concrete end of the spectrum), we have a particu-
lar named distribution
 with fixed parameters, for example a univariate
Gaussian N 0, 1 with zero mean and unit variance. In machine learning,
we often use the second level of abstraction, that is, we fix the paramet-
ric form (the univariate Gaussian) and infer the parameters
 from data. For
example, we assume a univariate Gaussian N µ, σ 2 with unknown mean
µ and unknown variance σ 2 , and use a maximum likelihood fit to deter-
mine the best parameters (µ, σ 2 ). We will see an example of this when
considering linear regression in Chapter 9. A third level of abstraction is
to consider families of distributions, and in this book, we consider the ex-
ponential family. The univariate Gaussian is an example of a member of
the exponential family. Many of the widely used statistical models, includ-
ing all the “named” models in Table 6.2, are members of the exponential
family. They can all be unified into one concept (Brown, 1986).
Remark. A brief historical anecdote: Like many concepts in mathemat-
ics and science, exponential families were independently discovered at
the same time by different researchers. In the years 1935–1936, Edwin
Pitman in Tasmania, Georges Darmois in Paris, and Bernard Koopman in
New York independently showed that the exponential families are the only
families that enjoy finite-dimensional sufficient statistics under repeated
independent sampling (Lehmann and Casella, 1998). ♦
An exponential family is a family of probability distributions, parame- exponential family
terized by θ ∈ RD , of the form
p(x | θ) = h(x) exp (hθ, φ(x)i − A(θ)) , (6.107)
where φ(x) is the vector of sufficient statistics. In general, any inner prod-
uct (Section 3.2) can be used in (6.107), and for concreteness we will use
the standard dot product here (hθ, φ(x)i = θ > φ(x)). Note that the form
of the exponential family is essentially a particular expression of gθ (φ(x))
in the Fisher-Neyman theorem (Theorem 6.14).
The factor h(x) can be absorbed into the dot product term by adding
another entry (log h(x)) to the vector of sufficient statistics φ(x), and
constraining the corresponding parameter θ0 = 1. The term A(θ) is the
normalization constant that ensures that the distribution sums up or inte-
grates to one and is called the log-partition function. A good intuitive no- log-partition
tion of exponential families can be obtained by ignoring these two terms function

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


212 Probability and Distributions

and considering exponential families as distributions of the form


p(x | θ) ∝ exp θ > φ(x) .

(6.108)
natural parameters For this form of parametrization, the parameters θ are called the natural
parameters. At first glance, it seems that exponential families are a mun-
dane transformation by adding the exponential function to the result of a
dot product. However, there are many implications that allow for conve-
nient modeling and efficient computation based on the fact that we can
capture information about data in φ(x).

Example 6.13 (Gaussian as Exponential Family)  


2 x 
Consider the univariate Gaussian distribution N µ, σ . Let φ(x) = 2 .
x
Then by using the definition of the exponential family,
p(x | θ) ∝ exp(θ1 x + θ2 x2 ) . (6.109)
Setting
>
1

µ
θ= ,− (6.110)
σ 2 2σ 2
and substituting into (6.109), we obtain
x2 1
   
µx 2
p(x | θ) ∝ exp − 2 ∝ exp − 2 (x − µ) . (6.111)
σ2 2σ 2σ
Therefore, the univariate Gaussian distribution is a member of the expo-
x
nential family with sufficient statistic φ(x) = 2 , and natural parame-
x
ters given by θ in (6.110).

Example 6.14 (Bernoulli as Exponential Family)


Recall the Bernoulli distribution from Example 6.8
p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1}. (6.112)
This can be written in exponential family form
p(x | µ) = exp log µx (1 − µ)1−x
 
(6.113a)
= exp [x log µ + (1 − x) log(1 − µ)] (6.113b)
= exp [x log µ − x log(1 − µ) + log(1 − µ)] (6.113c)
h i
µ
= exp x log 1−µ + log(1 − µ) . (6.113d)
The last line (6.113d) can be identified as being in exponential family
form (6.107) by observing that
h(x) = 1 (6.114)

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.6 Conjugacy and the Exponential Family 213

µ
θ = log 1−µ (6.115)
φ(x) = x (6.116)
A(θ) = − log(1 − µ) = log(1 + exp(θ)). (6.117)
The relationship between θ and µ is invertible so that
1
µ= . (6.118)
1 + exp(−θ)
The relation (6.118) is used to obtain the right equality of (6.117).

Remark. The relationship between the original Bernoulli parameter µ and


the natural parameter θ is known as the sigmoid or logistic function. Ob- sigmoid
serve that µ ∈ (0, 1) but θ ∈ R, and therefore the sigmoid function
squeezes a real value into the range (0, 1). This property is useful in ma-
chine learning, for example it is used in logistic regression (Bishop, 2006,
section 4.3.2), as well as as a nonlinear activation functions in neural net-
works (Goodfellow et al., 2016, chapter 6). ♦
It is often not obvious how to find the parametric form of the conjugate
distribution of a particular distribution (for example, those in Table 6.2).
Exponential families provide a convenient way to find conjugate pairs of
distributions. Consider the random variable X is a member of the expo-
nential family (6.107):

p(x | θ) = h(x) exp (hθ, φ(x)i − A(θ)) . (6.119)

Every member of the exponential family has a conjugate prior (Brown,


1986)
    
γ1 θ
p(θ | γ) = hc (θ) exp , − Ac (γ) , (6.120)
γ2 −A(θ)
 
γ1
where γ = has dimension dim(θ) + 1. The sufficient statistics of
γ2
 
θ
the conjugate prior are . By using the knowledge of the general
−A(θ)
form of conjugate priors for exponential families, we can derive functional
forms of conjugate priors corresponding to particular distributions.

Example 6.15
Recall the exponential family form of the Bernoulli distribution (6.113d)
 
µ
p(x | µ) = exp x log + log(1 − µ) . (6.121)
1−µ

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


214 Probability and Distributions

The canonical conjugate prior has the form


 
µ µ
p(µ | α, β) = exp α log + (β + α) log(1 − µ) − Ac (γ) ,
1−µ 1−µ
(6.122)
where we defined γ := [α, β + α]> and hc (µ) := µ/(1 − µ). Equa-
tion (6.122) then simplifies to
p(µ | α, β) = exp [(α − 1) log µ + (β − 1) log(1 − µ) − Ac (α, β)] .
(6.123)
Putting this in non-exponential family form yields
p(µ | α, β) ∝ µα−1 (1 − µ)β−1 , (6.124)
which we identify as the Beta distribution (6.98). In example 6.12, we
assumed that the Beta distribution is the conjugate prior of the Bernoulli
distribution and showed that it was indeed the conjugate prior. In this
example, we derived the form of the Beta distribution by looking at the
canonical conjugate prior of the Bernoulli distribution in exponential fam-
ily form.

As mentioned in the previous section, the main motivation for expo-


nential families is that they have finite-dimensional sufficient statistics.
Additionally, conjugate distributions are easy to write down, and the con-
jugate distributions also come from an exponential family. From an infer-
ence perspective, maximum likelihood estimation behaves nicely because
empirical estimates of sufficient statistics are optimal estimates of the pop-
ulation values of sufficient statistics (recall the mean and covariance of a
Gaussian). From an optimization perspective, the log-likelihood function
is concave, allowing for efficient optimization approaches to be applied
(Chapter 7).

6.7 Change of Variables/Inverse Transform


It may seem that there are very many known distributions, but in reality
the set of distributions for which we have names is quite limited. There-
fore, it is often useful to understand how transformed random variables
are distributed. For example, assuming that X is a random variable
 dis-
tributed according to the univariate normal distribution N 0, 1 , what is
the distribution of X 2 ? Another example, which is quite common in ma-
chine learning, is, given that X1 and X2 are univariate standard normal,
what is the distribution of 12 (X1 + X2 )?
One option to work out the distribution of 21 (X1 + X2 ) is to calculate
the mean and variance of X1 and X2 and then combine them. As we saw
in Section 6.4.4, we can calculate the mean and variance of resulting ran-
dom variables when we consider affine transformations of random vari-

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.7 Change of Variables/Inverse Transform 215

ables. However, we may not be able to obtain the functional form of the
distribution under transformations. Furthermore, we may be interested
in nonlinear transformations of random variables for which closed-form
expressions are not readily available.
Remark (Notation). In this section, we will be explicit about random vari-
ables and the values they take. Hence, recall that we use capital letters
X, Y to denote random variables and small letters x, y to denote the val-
ues in the target space T that the random variables take. We will explicitly
write pmfs of discrete random variables X as P (X = x). For continuous
random variables X (Section 6.2.2), the pdf is written as f (x) and the cdf
is written as FX (x). ♦
We will look at two approaches for obtaining distributions of transfor-
mations of random variables: a direct approach using the definition of a
cumulative distribution function and a change-of-variable approach that
uses the chain rule of calculus (Section 5.2.2). The change-of-variable ap- Moment generating
proach is widely used because it provides a “recipe” for attempting to functions can also
be used to study
compute the resulting distribution due to a transformation. We will ex-
transformations of
plain the techniques for univariate random variables, and will only briefly random
provide the results for the general case of multivariate random variables. variables (Casella
Transformations of discrete random variables can be understood di- and Berger, 2002,
chapter 2).
rectly. Suppose that there is a discrete random variable X with pmf P (X =
x) (Section 6.2.1), and an invertible function U (x). Consider the trans-
formed random variable Y := U (X), with pmf P (Y = y). Then

P (Y = y) = P (U (X) = y) transformation of interest (6.125a)


= P (X = U −1 (y)) inverse (6.125b)

where we can observe that x = U −1 (y). Therefore, for discrete random


variables, transformations directly change the individual events (with the
probabilities appropriately transformed).

6.7.1 Distribution Function Technique


The distribution function technique goes back to first principles, and uses
the definition of a cdf FX (x) = P (X 6 x) and the fact that its differential
is the pdf f (x) (Wasserman, 2004, chapter 2). For a random variable X
and a function U , we find the pdf of the random variable Y := U (X) by

1. Finding the cdf:


FY (y) = P (Y 6 y) (6.126)

2. Differentiating the cdf FY (y) to get the pdf f (y).


d
f (y) = FY (y) . (6.127)
dy

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


216 Probability and Distributions

We also need to keep in mind that the domain of the random variable may
have changed due to the transformation by U .

Example 6.16
Let X be a continuous random variable with probability density function
on 0 6 x 6 1
f (x) = 3x2 . (6.128)
We are interested in finding the pdf of Y = X 2 .
The function f is an increasing function of x, and therefore the resulting
value of y lies in the interval [0, 1]. We obtain
FY (y) = P (Y 6 y) definition of cdf (6.129a)
= P (X 2 6 y) transformation of interest (6.129b)
1
= P (X 6 y ) 2 inverse (6.129c)
1
= FX (y ) 2 definition of cdf (6.129d)
Z y 21
= 3t2 dt cdf as a definite integral (6.129e)
0
 t=y 12
= t3 t=0 result of integration (6.129f)
3
=y , 2 0 6 y 6 1. (6.129g)
Therefore, the cdf of Y is
3
FY (y) = y 2 (6.130)
for 0 6 y 6 1. To obtain the pdf, we differentiate the cdf
d 3 1
f (y) = FY (y) = y 2 (6.131)
dy 2
for 0 6 y 6 1.

In Example 6.16, we considered a strictly monotonically increasing func-


Functions that have tion f (x) = 3x2 . This means that we could compute an inverse function.
inverses are called In general, we require that the function of interest y = U (x) has an in-
bijective functions
verse x = U −1 (y). A useful result can be obtained by considering the cu-
(Section 2.7).
mulative distribution function FX (x) of a random variable X , and using
it as the transformation U (x). This leads to the following theorem.

Theorem 6.15. [Theorem 2.1.10 in Casella and Berger (2002)] Let X be a


continuous random variable with a strictly monotonic cumulative distribu-
tion function FX (x). Then the random variable Y defined as

Y := FX (X) (6.132)

has a uniform distribution.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.7 Change of Variables/Inverse Transform 217

Theorem 6.15 is known as the probability integral transform, and it is probability integral
used to derive algorithms for sampling from distributions by transforming transform
the result of sampling from a uniform random variable (Bishop, 2006).
The algorithm works by first generating a sample from a uniform distribu-
tion, then transforming it by the inverse cdf (assuming this is available)
to obtain a sample from the desired distribution. The probability integral
transform is also used for hypothesis testing whether a sample comes from
a particular distribution (Lehmann and Romano, 2005). The idea that the
output of a cdf gives a uniform distribution also forms the basis of copu-
las (Nelsen, 2006).

6.7.2 Change of Variables


The distribution function technique in Section 6.7.1 is derived from first
principles, based on the definitions of cdfs and using properties of in-
verses, differentiation, and integration. This argument from first principles
relies on two facts:

1. We can transform the cdf of Y into an expression that is a cdf of X .


2. We can differentiate the cdf to obtain the pdf.

Let us break down the reasoning step by step, with the goal of understand-
ing the more general change-of-variables approach in Theorem 6.16. Change of variables
in probability relies
Remark. The name “change of variables” comes from the idea of chang- on the
ing the variable of integration when faced with a difficult integral. For change-of-variables
univariate functions, we use the substitution rule of integration, method in
Z Z calculus (Tandra,
0 2014).
f (g(x))g (x)dx = f (u)du , where u = g(x) . (6.133)

The derivation of this rule is based on the chain rule of calculus (5.32) and
by applying twice the fundamental theorem of calculus. The fundamental
theorem of calculus formalizes the fact that integration and differentiation
are somehow “inverses” of each other. An intuitive understanding of the
rule can be obtained by thinking (loosely) about small changes (differen-
tials) to the equation u = g(x), that is by considering ∆u = g 0 (x)∆x as a
differential of u = g(x). By substituting u = g(x), the argument inside the
integral on the right-hand side of (6.133) becomes f (g(x)). By pretending
that the term du can be approximated by du ≈ ∆u = g 0 (x)∆x, and that
dx ≈ ∆x, we obtain (6.133). ♦
Consider a univariate random variable X , and an invertible function
U , which gives us another random variable Y = U (X). We assume that
random variable X has states x ∈ [a, b]. By the definition of the cdf, we
have
FY (y) = P (Y 6 y) . (6.134)

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


218 Probability and Distributions

We are interested in a function U of the random variable


P (Y 6 y) = P (U (X) 6 y) , (6.135)
where we assume that the function U is invertible. An invertible function
on an interval is either strictly increasing or strictly decreasing. In the case
that U is strictly increasing, then its inverse U −1 is also strictly increasing.
By applying the inverse U −1 to the arguments of P (U (X) 6 y), we obtain
P (U (X) 6 y) = P (U −1 (U (X)) 6 U −1 (y)) = P (X 6 U −1 (y)) .
(6.136)
The right-most term in (6.136) is an expression of the cdf of X . Recall the
definition of the cdf in terms of the pdf
Z U −1 (y)
P (X 6 U −1 (y)) = f (x)dx . (6.137)
a

Now we have an expression of the cdf of Y in terms of x:


Z U −1 (y)
FY (y) = f (x)dx . (6.138)
a

To obtain the pdf, we differentiate (6.138) with respect to y :


Z −1
d d U (y)
f (y) = Fy (y) = f (x)dx . (6.139)
dy dy a
Note that the integral on the right-hand side is with respect to x, but we
need an integral with respect to y because we are differentiating with
respect to y . In particular, we use (6.133) to get the substitution
Z Z
0
f (U −1 (y))U −1 (y)dy = f (x)dx where x = U −1 (y) . (6.140)

Using (6.140) on the right-hand side of (6.139) gives us


Z −1
d U (y) 0
f (y) = fx (U −1 (y))U −1 (y)dy . (6.141)
dy a
We then recall that differentiation is a linear operator and we use the
subscript x to remind ourselves that fx (U −1 (y)) is a function of x and not
y . Invoking the fundamental theorem of calculus again gives us
d −1
 
f (y) = fx (U −1 (y)) · U (y) . (6.142)
dy
Recall that we assumed that U is a strictly increasing function. For decreas-
ing functions, it turns out that we have a negative sign when we follow
the same derivation. We introduce the absolute value of the differential to
have the same expression for both increasing and decreasing U :
d −1
f (y) = fx (U −1 (y)) · U (y) . (6.143)
dy

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.7 Change of Variables/Inverse Transform 219
d
This is called the change-of-variable technique. The term dy U −1 (y) in change-of-variable
technique
(6.143) measures how much a unit volume changes when applying U
(see also the definition of the Jacobian in Section 5.3).
Remark. In comparison to the discrete case in (6.125b), we have an addi-
d
tional factor dy U −1 (y) . The continuous case requires more care because
P (Y = y) = 0 for all y . The probability density function f (y) does not
have a description as a probability of an event involving y . ♦
So far in this section, we have been studying univariate change of vari-
ables. The case for multivariate random variables is analogous, but com-
plicated by fact that the absolute value cannot be used for multivariate
functions. Instead, we use the determinant of the Jacobian matrix. Recall
from (5.58) that the Jacobian is a matrix of partial derivatives, and that
the existence of a nonzero determinant shows that we can invert the Ja-
cobian. Recall the discussion in Section 4.1 that the determinant arises
because our differentials (cubes of volume) are transformed into paral-
lelepipeds by the Jacobian. Let us summarize preceding the discussion in
the following theorem, which gives us a recipe for multivariate change of
variables.

Theorem 6.16. [Theorem 17.2 in Billingsley (1995)] Let f (x) be the value
of the probability density of the multivariate continuous random variable X .
If the vector-valued function y = U (x) is differentiable and invertible for
all values within the domain of x, then for corresponding values of y , the
probability density of Y = U (X) is given by
 
∂ −1
f (y) = fx (U −1 (y)) · det U (y) . (6.144)
∂y
The theorem looks intimidating at first glance, but the key point is that
a change of variable of a multivariate random variable follows the pro-
cedure of the univariate change of variable. First we need to work out
the inverse transform, and substitute that into the density of x. Then we
calculate the determinant of the Jacobian and multiply the result. The
following example illustrates the case of a bivariate random variable.

Example 6.17  
x1
Consider a bivariate random variable X with states x = and proba-
x2
bility density function
 >  !
1 1 x1
 
x1 x1
f = exp − . (6.145)
x2 2π 2 x2 x2

We use the change-of-variable technique from Theorem 6.16 to derive the

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


220 Probability and Distributions

effect of a linear transformation (Section 2.7) of the random variable.


Consider a matrix A ∈ R2×2 defined as
 
a b
A= . (6.146)
c d
We are interested in finding the probability density function of the trans-
formed bivariate random variable Y with states y = Ax.
Recall that for change of variables we require the inverse transformation
of x as a function of y . Since we consider linear transformations, the
inverse transformation is given by the matrix inverse (see Section 2.2.2).
For 2 × 2 matrices, we can explicitly write out the formula, given by
1
      
x1 −1 y1 d −b y1
=A = . (6.147)
x2 y2 ad − bc −c a y2
Observe that ad − bc is the determinant (Section 4.1) of A. The corre-
sponding probability density function is given by
1  
f (x) = f (A−1 y) = exp − 21 y > A−> A−1 y . (6.148)

The partial derivative of a matrix times a vector with respect to the vector
is the matrix itself (Section 5.5), and therefore
∂ −1
A y = A−1 . (6.149)
∂y
Recall from Section 4.1 that the determinant of the inverse is the inverse
of the determinant so that the determinant of the Jacobian matrix is
1
 
∂ −1
det A y = . (6.150)
∂y ad − bc
We are now able to apply the change-of-variable formula from Theo-
rem 6.16 by multiplying (6.148) with (6.150), which yields
 
∂ −1
f (y) = f (x) det A y (6.151a)
∂y
1  
= exp − 12 y > A−> A−1 y |ad − bc|−1 . (6.151b)

While Example 6.17 is based on a bivariate random variable, which al-


lows us to easily compute the matrix inverse, the preceding relation holds
for higher dimensions.
Remark. We saw in Section 6.5 that the density f (x) in (6.148) is actually
the standard Gaussian distribution, and the transformed density f (y) is a
bivariate Gaussian with covariance Σ = AA> . ♦
We will use the ideas in this chapter to describe probabilistic modeling

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


6.8 Further Reading 221

in Section 8.4, as well as introduce a graphical language in Section 8.5. We


will see direct machine learning applications of these ideas in Chapters 9
and 11.

6.8 Further Reading


This chapter is rather terse at times. Grinstead and Snell (1997) and
Walpole et al. (2011) provide more relaxed presentations that are suit-
able for self-study. Readers interested in more philosophical aspects of
probability should consider Hacking (2001), whereas an approach that
is more related to software engineering is presented by Downey (2014).
An overview of exponential families can be found in Barndorff-Nielsen
(2014). We will see more about how to use probability distributions to
model machine learning tasks in Chapter 8. Ironically, the recent surge
in interest in neural networks has resulted in a broader appreciation of
probabilistic models. For example, the idea of normalizing flows (Jimenez
Rezende and Mohamed, 2015) relies on change of variables for transform-
ing random variables. An overview of methods for variational inference as
applied to neural networks is described in chapters 16 to 20 of the book
by Goodfellow et al. (2016).
We side stepped a large part of the difficulty in continuous random vari-
ables by avoiding measure theoretic questions (Billingsley, 1995; Pollard,
2002), and by assuming without construction that we have real numbers,
and ways of defining sets on real numbers as well as their appropriate fre-
quency of occurrence. These details do matter, for example, in the specifi-
cation of conditional probability p(y | x) for continuous random variables
x, y (Proschan and Presnell, 1998). The lazy notation hides the fact that
we want to specify that X = x (which is a set of measure zero). Fur-
thermore, we are interested in the probability density function of y . A
more precise notation would have to say Ey [f (y) | σ(x)], where we take
the expectation over y of a test function f conditioned on the σ -algebra of
x. A more technical audience interested in the details of probability the-
ory have many options (Jaynes, 2003; MacKay, 2003; Jacod and Protter,
2004; Grimmett and Welsh, 2014), including some very technical discus-
sions (Shiryayev, 1984; Lehmann and Casella, 1998; Dudley, 2002; Bickel
and Doksum, 2006; Çinlar, 2011). An alternative way to approach proba-
bility is to start with the concept of expectation, and “work backward” to
derive the necessary properties of a probability space (Whittle, 2000). As
machine learning allows us to model more intricate distributions on ever
more complex types of data, a developer of probabilistic machine learn-
ing models would have to understand these more technical aspects. Ma-
chine learning texts with a probabilistic modeling focus include the books
by MacKay (2003); Bishop (2006); Rasmussen and Williams (2006); Bar-
ber (2012); Murphy (2012).

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


222 Probability and Distributions

Exercises
6.1 Consider the following bivariate distribution p(x, y) of two discrete random
variables X and Y .

y1 0.01 0.02 0.03 0.1 0.1

Y y2 0.05 0.1 0.05 0.07 0.2

y3 0.1 0.05 0.03 0.05 0.04

x1 x2 x3 x4 x5
X

Compute:
a. The marginal distributions p(x) and p(y).
b. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
6.2 Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4),
       
10 1 0 0 8.4 2.0
0.4 N , + 0.6 N , .
2 0 1 0 2.0 1.7

a. Compute the marginal distributions for each dimension.


b. Compute the mean, mode and median for each marginal distribution.
c. Compute the mean and mode for the two-dimensional distribution.
6.3 You have written a computer program that sometimes compiles and some-
times not (code does not change). You decide to model the apparent stochas-
ticity (success vs. no success) x of the compiler using a Bernoulli distribution
with parameter µ:

p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1} .

Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
terior distribution p(µ | x1 , . . . , xN ).
6.4 There are two bags. The first bag contains four mangos and two apples; the
second bag contains four mangos and four apples.
We also have a biased coin, which shows “heads” with probability 0.6 and
“tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at
random from bag 1; otherwise we pick a fruit at random from bag 2.
Your friend flips the coin (you cannot see the result), picks a fruit at random
from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2?
Hint: Use Bayes’ theorem.
6.5 Consider the time-series model

xt+1 = Axt + w , w ∼ N 0, Q

y t = Cxt + v , v ∼ N 0, R ,

where w, v are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
N µ0 , Σ0 .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


Exercises 223

a. What is the form of p(x0 , x1 , . . . , xT )? Justify your answer (you do not


have to explicitly compute the joint distribution).

b. Assume that p(xt | y 1 , . . . , y t ) = N µt , Σt .
1. Compute p(xt+1 | y 1 , . . . , y t ).
2. Compute p(xt+1 , y t+1 | y 1 , . . . , y t ).
3. At time t+1, we observe the value y t+1 = ŷ . Compute the conditional
distribution p(xt+1 | y 1 , . . . , y t+1 ).
6.6 Prove the relationship in (6.44), which relates the standard definition of the
variance to the raw-score expression for the variance.
6.7 Prove the relationship in (6.45), which relates the pairwise difference be-
tween examples in a dataset with the raw-score expression for the variance.
6.8 Express the Bernoulli distribution in the natural parameter form of the ex-
ponential family, see (6.107).
6.9 Express the Binomial distribution as an exponential family distribution. Also
express the Beta distribution is an exponential family distribution. Show that
the product of the Beta and the Binomial distribution is also a member of
the exponential family.
6.10 Derive the relationship in Section 6.5.2 in two ways:
a. By completing the square
b. By expressing the Gaussian in its exponential family form
 
The product of two Gaussians N x | a, A N x | b, B is an unnormalized
Gaussian distribution c N x | c, C with

C = (A−1 + B −1 )−1
c = C(A−1 a + B −1 b)
D 1
c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)> (A + B)−1 (a − b) .


Note that the normalizing constant c itself can be considered a (normalized)


Gaussian distribution either in a or in b with an “inflated”
 covariance matrix
A + B , i.e., c = N a | b, A + B = N b | a, A + B .
6.11 Iterated Expectations.
Consider two random variables x, y with joint distribution p(x, y). Show that
 
EX [x] = EY EX [x | y] .
Here, EX [x | y] denotes the expected value of x under the conditional distri-
bution p(x | y).
6.12 Manipulation of Gaussian Random Variables.
Consider a Gaussian random variable x ∼ N x | µx , Σx , where x ∈ RD .


Furthermore, we have
y = Ax + b + w ,

where y ∈ RE , A ∈ RE×D , b ∈ RE , and w ∼ N w | 0, Q is indepen-




dent Gaussian noise. “Independent” implies that x and w are independent


random variables and that Q is diagonal.
a. Write down the likelihood
R p(y | x).
b. The distribution p(y) = p(y | x)p(x)dx is Gaussian. Compute the mean
µy and the covariance Σy . Derive your result in detail.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


224 Probability and Distributions

c. The random variable y is being transformed according to the measure-


ment mapping
z = Cy + v ,

where z ∈ RF , C ∈ RF ×E , and v ∼ N v | 0, R is independent Gaus-




sian (measurement) noise.


Write down p(z | y).
Compute p(z), i.e., the mean µz and the covariance Σz . Derive your
result in detail.
d. Now, a value ŷ is measured. Compute the posterior distribution p(x | ŷ).
Hint for solution: This posterior is also Gaussian, i.e., we need to de-
termine only its mean and covariance matrix. Start by explicitly com-
puting the joint Gaussian p(x, y). This also requires us to compute the
cross-covariances Covx,y [x, y] and Covy,x [y, x]. Then apply the rules
for Gaussian conditioning.
6.13 Probability Integral Transformation
Given a continuous random variable X , with cdf FX (x), show that the ran-
dom variable Y := FX (X) is uniformly distributed (Theorem 6.15).

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

You might also like