0% found this document useful (0 votes)
25 views418 pages

Probabilistic Artificial Intelligence

This document is a manuscript on Probabilistic Artificial Intelligence, based on a course at ETH Zürich, covering key concepts in probabilistic machine learning and decision-making under uncertainty. It discusses various probabilistic models, inference methods, and applications in reinforcement learning, emphasizing the importance of reasoning about uncertainty. The manuscript includes exercises, references for further reading, and is intended for graduate-level readers with a background in probability and machine learning.

Uploaded by

aditya745
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views418 pages

Probabilistic Artificial Intelligence

This document is a manuscript on Probabilistic Artificial Intelligence, based on a course at ETH Zürich, covering key concepts in probabilistic machine learning and decision-making under uncertainty. It discusses various probabilistic models, inference methods, and applications in reinforcement learning, emphasizing the importance of reasoning about uncertainty. The manuscript includes exercises, references for further reading, and is intended for graduate-level readers with a background in probability and machine learning.

Uploaded by

aditya745
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 418

Probabilistic

arXiv:2502.05244v1 [cs.AI] 7 Feb 2025

Artificial Intelligence
Andreas Krause, Jonas Hübotter

Institute for Machine Learning


Department of Computer Science
Compiled on February 11, 2025.

This manuscript is based on the course Probabilistic Artificial Intelligence (263-5210-00L) at ETH Zürich.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

© 2025 ETH Zürich. All rights reserved.


Preface

Artificial intelligence commonly refers to the science and engineering


of artificial systems that can carry out tasks generally associated with
requiring aspects of human intelligence, such as playing games, trans-
lating languages, and driving cars. In recent years, there have been
exciting advances in learning-based, data-driven approaches towards
AI, and machine learning and deep learning have enabled computer
systems to perceive the world in unprecedented ways. Reinforcement
learning has enabled breakthroughs in complex games such as Go and
challenging robotics tasks such as quadrupedal locomotion.

A key aspect of intelligence is to not only make predictions, but reason


about the uncertainty in these predictions, and to consider this uncer-
tainty when making decisions. This is what “Probabilistic Artificial
Intelligence” is about. The first part covers probabilistic approaches to
machine learning. We discuss the differentiation between “epistemic”
uncertainty due to lack of data and “aleatoric” uncertainty, which is
irreducible and stems, e.g., from noisy observations and outcomes.
We discuss concrete approaches towards probabilistic inference, such
as Bayesian linear regression, Gaussian process models and Bayesian
neural networks. Often, inference and making predictions with such
models is intractable, and we discuss modern approaches to efficient
approximate inference.

The second part of the manuscript is about taking uncertainty into


account in sequential decision tasks. We consider active learning and
Bayesian optimization — approaches that collect data by proposing ex-
periments that are informative for reducing the epistemic uncertainty.
We then consider reinforcement learning, a rich formalism for mod-
eling agents that learn to act in uncertain environments. After cov-
ering the basic formalism of Markov Decision Processes, we consider
modern deep RL approaches that use neural network function approx-
imation. We close by discussing modern approaches in model-based
RL, which harness epistemic and aleatoric uncertainty to guide explo-
ration, while also reasoning about safety.
iv

Guide to the Reader

The material covered in this manuscript may support a one semester


graduate introduction to probabilistic machine learning and sequential
decision-making. We welcome readers from all backgrounds. How-
ever, we assume familiarity with basic concepts in probability, calculus,
linear algebra, and machine learning (e.g., neural networks) as covered
in a typical introductory course to machine learning. In Chapter 1, we
give a gentle introduction to probabilistic inference, which serves as
the foundation for the rest of the manuscript. As part of this first chap-
ter, we also review key concepts from probability theory. We provide
a chapter reviewing key concepts of further mathematical background
in the back of the manuscript.

Throughout the manuscript, we focus on key concepts and ideas rather


than their historical development. We encourage you to consult the
provided references for further reading and historical context to delve
deeper into the covered topics.

Finally, we have included a set of exercises at the end of each chapter.


When we highlight an exercise throughout the text, we use this ques-
tion mark: ? — so don’t be surprised when you stumble upon it. You Problem 1.1
will find solutions to all exercises in the back of the manuscript.

We hope you will find this resource useful.

Contributing

We encourage you to raise issues and suggest fixes for anything you
think can be improved. We are thankful for any such feedback!
Contact: [email protected]

Acknowledgements

We are grateful to Sebastian Curi for creating the original Jupyter note-
books that accompany the course at ETH Zürich and which were in-
strumental in the creation of many figures. We thank Hado van Hasselt
for kindly contributing Figure 12.1, and thank Tuomas Haarnoja (Haarnoja
et al., 2018a) and Roberto Calandra (Chua et al., 2018) for kindly agree-
ing to have their figures included in this manuscript. Furthermore,
many of the exercises in these notes are adapted from iterations of the
course at ETH Zürich. Special thanks to all instructors that contributed
to the course material over the years. We also thank all students of
the course in the Fall of 2022, 2023, and 2024 who provided valuable
feedback on various iterations of this manuscript and corrected many
v

mistakes. Finally, we thank Zhiyuan Hu, Shyam Sundhar Ramesh,


Leander Diaz-Bone, Nicolas Menet, and Ido Hakimi for proofreading
parts of various drafts of this text.
Contents

1 Fundamentals of Inference 1
1.1 Probability 2
1.2 Probabilistic Inference 15
1.3 Supervised Learning and Point Estimates 22
1.4 Outlook: Decision Theory 29

I Probabilistic Machine Learning 35

2 Linear Regression 39
2.1 Weight-space View 40
2.2 Aleatoric and Epistemic Uncertainty 44
2.3 Non-linear Regression 45
2.4 Function-space View 45

3 Filtering 51
3.1 Conditioning and Prediction 53
3.2 Kalman Filters 54

4 Gaussian Processes 59
4.1 Learning and Inference 60
4.2 Sampling 61
4.3 Kernel Functions 62
4.4 Model Selection 67
4.5 Approximations 70
viii

5 Variational Inference 83
5.1 Laplace Approximation 83
5.2 Predictions with a Variational Posterior 87
5.3 Blueprint of Variational Inference 88
5.4 Information Theoretic Aspects of Uncertainty 89
5.5 Evidence Lower Bound 100

6 Markov Chain Monte Carlo Methods 113


6.1 Markov Chains 114
6.2 Elementary Sampling Methods 121
6.3 Sampling using Gradients 124

7 Deep Learning 139


7.1 Artificial Neural Networks 139
7.2 Bayesian Neural Networks 142
7.3 Approximate Probabilistic Inference 144
7.4 Calibration 152

II Sequential Decision-Making 157

8 Active Learning 161


8.1 Conditional Entropy 161
8.2 Mutual Information 163
8.3 Submodularity of Mutual Information 166
8.4 Maximizing Mutual Information 168
8.5 Learning Locally: Transductive Active Learning 172

9 Bayesian Optimization 177


9.1 Exploration-Exploitation Dilemma 177
9.2 Online Learning and Bandits 178
9.3 Acquisition Functions 180

10 Markov Decision Processes 197


10.1 Bellman Expectation Equation 199
10.2 Policy Evaluation 201
ix

10.3 Policy Optimization 203


10.4 Partial Observability 209

11 Tabular Reinforcement Learning 217


11.1 The Reinforcement Learning Problem 217
11.2 Model-based Approaches 219
11.3 Balancing Exploration and Exploitation 220
11.4 Model-free Approaches 224

12 Model-free Reinforcement Learning 233


12.1 Tabular Reinforcement Learning as Optimization 233
12.2 Value Function Approximation 235
12.3 Policy Approximation 238
12.4 On-policy Actor-Critics 244
12.5 Off-policy Actor-Critics 251
12.6 Maximum Entropy Reinforcement Learning 256
12.7 Learning from Preferences 260

13 Model-based Reinforcement Learning 273


13.1 Planning 274
13.2 Learning 281
13.3 Exploration 287

A Mathematical Background 299


A.1 Probability 299
A.2 Quadratic Forms and Gaussians 301
A.3 Parameter Estimation 302
A.4 Optimization 313
A.5 Useful Matrix Identities and Inequalities 319

B Solutions 321

Bibliography 385

Summary of Notation 393


x

Acronyms 399

Index 401
1
Fundamentals of Inference

Boolean logic is the algebra of statements which are either true or false.
Consider, for example, the statements

“If it is raining, the ground is wet.” and “It is raining.”

A quite remarkable property of Boolean logic is that we can combine


these premises to draw logical inferences which are new (true) state-
ments. In the above example, we can conclude that the ground must
be wet. This is an example of logical reasoning which is commonly
referred to as logical inference, and the study of artificial systems that
are able to perform logical inference is known as symbolic artificial in-
telligence.

But is it really raining? Perhaps it is hard to tell by looking out of the


window. Or we have seen it rain earlier, but some time has passed
since we have last looked out of the window. And is it really true that
if it rains, the ground is wet? Perhaps the rain is just light enough that
it is absorbed quickly, and therefore the ground still appears dry.

This goes to show that in our experience, the real world is rarely black
and white. We are frequently (if not usually) uncertain about the truth
of statements, and yet we are able to reason about the world and make
predictions. We will see that the principles of Boolean logic can be ex-
tended to reason in the face of uncertainty. The mathematical frame-
work that allows us to do this is probability theory, which — as we
will find in this first chapter — can be seen as a natural extension of
Boolean logic from the domain of certainty to the domain of uncer-
tainty. In fact, in the 20th century, Richard Cox and Edwin Thompson
Jaynes have done early work to formalize probability theory as the
“logic under uncertainty” (Cox, 1961; Jaynes, 2002).

In this first chapter, we will briefly recall the fundamentals of prob-


ability theory, and we will see how probabilistic inference can be used
2 probabilistic artificial intelligence

to reason about the world. In the remaining chapters, we will then


discuss how probabilistic inference can be performed efficiently given
limited computational resources and limited time, which is the key
challenge in probabilistic artificial intelligence.

1.1 Probability

Probability is commonly interpreted in two different ways. In the fre-


quentist interpretation, one interprets the probability of an event (say
a coin coming up “heads” when flipping it) as the limit of relative
frequencies in repeated independent experiments. That is,

# events happening in N trials


Probability = lim .
N →∞ N
This interpretation is natural, but has a few issues. It is not very dif-
ficult to conceive of settings where repeated experiments do not make
sense. Consider the outcome:

“Person X will live for at least 80 years.”

There is no way in which we could conduct multiple independent ex-


periments in this case. Still, this statement is going to turn out either
true or false, as humans we are just not able to determine its truth
value beforehand. Nevertheless, humans commonly have beliefs about
statements of this kind. We also commonly reason about statements
such as

“The Beatles were more groundbreaking than The Monkees.”

This statement does not even have an objective truth value, and yet we
as humans tend to have opinions about it.

While it is natural to consider the relative frequency of the outcome


in repeated experiments as our belief, if we are not able to conduct
repeated experiments, our notion of probability is simply a subjective
measure of uncertainty about outcomes. In the early 20th century,
Bruno De Finetti has done foundational work to formalize this notion
which is commonly called Bayesian reasoning or the Bayesian interpre-
tation of probability (De Finetti, 1970).

We will see that modern approaches to probabilistic inference often


lend themselves to a Bayesian interpretation, even if such an interpre-
tation is not strictly necessary. For our purposes, probabilities will be
a means to an end: the end usually being solving some task. This task
may be to make a prediction or to take an action with an uncertain
outcome, and we can evaluate methods according to how well they
perform on this task. No matter the interpretation, the mathematical
fundamentals of inference 3

framework of probability theory which we will formally introduce in


the following is the same.

1.1.1 Probability Spaces


A probability space is a mathematical model for a random experiment.
The set of all possible outcomes of the experiment Ω is called sample
space. An event A ⊆ Ω of interest may be any combination of possible
outcomes. The set of all events A ⊆ P (Ω) that we are interested in
is often called the event space of the experiment.1 This set of events is 1
We use P (Ω) to denote the power set
required to be a σ-algebra over the sample space. (set of all subsets) of Ω.

Definition 1.1 (σ-algebra). Given the set Ω, the set A ⊆ P (Ω) is a


σ-algebra over Ω if the following properties are satisfied:
1. Ω ∈ A;
2. if A ∈ A, then A ∈ A (closedness under complements); and
3. if we have Ai ∈ A for all i, then i∞=1 Ai ∈ A (closedness under
S

countable unions).

Note that the three properties of σ-algebras correspond to character-


istics we universally expect when working with random experiments.
Namely, that we are able to reason about the event Ω that any of the
possible outcomes occur, that we are able to reason about an event
not occurring, and that we are able to reason about events that are
composed of multiple (smaller) events.

Example 1.2: Event space of throwing a die


The event space A can also be thought of as “how much infor-
mation is available about the experiment”. For example, if the
experiment is a throw of a die and Ω is the set of possible values
on the die: Ω = {1, . . . , 6}, then the following A implies that the
observer cannot distinguish between 1 and 3:
.
A = {∅, Ω, {1, 3, 5}, {2, 4, 6}}.

Intuitively, the observer only understands the parity of the face of


the die.

Definition 1.3 (Probability measure). Given the set Ω and the σ-algebra
A over Ω, the function

P:A→R

is a probability measure on A if the Kolmogorov axioms are satisfied:


1. 0 ≤ P( A) ≤ 1 for any A ∈ A;
2. P(Ω) = 1; and
4 probabilistic artificial intelligence

3. P( i∞=1 Ai ) = ∑i∞=1 P( Ai ) for any countable set of mutually disjoint


S

events { Ai ∈ A}i .2 2
We say that a set of sets { Ai }i is disjoint
if for all i ̸= j we have Ai ∩ A j = ∅.
Remarkably, all further statements about probability follow from these
three natural axioms. For an event A ∈ A, we call P( A) the probability
of A. We are now ready to define a probability space.

Definition 1.4 (Probability space). A probability space is a triple (Ω, A, P)


where
• Ω is a sample space,
• A is a σ-algebra over Ω, and
• P is a probability measure on A.

Example 1.5: Borel σ-algebra over R


In our context, we often have that Ω is the set of real numbers R
or a compact subset of it. In this case, a natural event space is the
σ-algebra generated by the set of events
.
A x = { x ′ ∈ Ω : x ′ ≤ x }.

The smallest σ-algebra A containing all sets A x is called the Borel


σ-algebra. A contains all “reasonable” subsets of Ω (except for
some pathological examples). For example, A includes all single-
ton sets { x }, as well as all countable unions of intervals.

In the case of discrete Ω, in fact A = P (Ω), i.e., the Borel σ-


algebra contains all subsets of Ω.

1.1.2 Random Variables

The set Ω is often rather complex. For example, take Ω to be the set of
all possible graphs on n vertices. Then the outcome of our experiment
is a graph. Usually, we are not interested in a specific graph but rather
a property such as the number of edges, which is shared by many
graphs. A function that maps a graph to its number of edges is a
random variable.

Definition 1.6 (Random variable). A random variable X is a function

X:Ω→T

where T is called target space of the random variable,3 and where X 3


For a random variable that maps a
respects the information available in the σ-algebra A. That is,4 graph to its number of edges, T = N0 .
For our purposes, you can generally as-
sume T ⊆ R.
∀S ⊆ T : { ω ∈ Ω : X ( ω ) ∈ S } ∈ A. (1.1) 4
In our example of throwing a die, X
should assign the same value to the out-
comes 1, 3, 5.
fundamentals of inference 5

Concrete values x of a random variable X are often referred to as states


or realizations of X. The probability that X takes on a value in S ⊆ T is

P( X ∈ S) = P({ω ∈ Ω : X (ω ) ∈ S}). (1.2)

1.1.3 Distributions
Consider a random variable X on a probability space (Ω, A, P), where
Ω is a compact subset of R, and A the Borel σ-algebra.

In this case, we can refer to the probability that X assumes a particular


state or set of states by writing
.
p X ( x ) = P( X = x ) (in the discrete setting), (1.3)
.
PX ( x ) = P( X ≤ x ). (1.4)

Note that “X = x” and “X ≤ x” are merely events (that is, they char-
acterize subsets of the sample space Ω satisfying this condition) which
are in the Borel σ-algebra, and hence their probability is well-defined.

Hereby, p X and PX are referred to as the probability mass function


(PMF) and cumulative distribution function (CDF) of X, respectively.
Note that we can also implicitly define probability spaces through ran-
dom variables and their associated PMF/CDF, which is often very con-
venient.

We list some common examples of discrete distributions in Appendix A.1.1.


Further, note that for continuous variables, P( X = x ) = 0. Here, in-
stead we typically use the probability density function (PDF), to which
we (with slight abuse of notation) also refer with p X . We discuss den-
sities in greater detail in Section 1.1.4.

We call the subset S ⊆ T of the domain of a PMF or PDF p X such that


all elements x ∈ S have positive probability, p X ( x ) > 0, the support of
the distribution p X . This quantity is denoted by X (Ω).

1.1.4 Continuous Distributions


As mentioned, a continuous random variable can be characterized by
its probability density function (PDF). But what is a density? We can
derive some intuition from physics.

Let M be a (non-homogeneous) physical object, e.g., a rock. We com-


monly use m( M) and vol( M) to refer to its mass and volume, respec-
tively. Now, consider for a point x ∈ M and a ball Br ( x) around x with
radius r the following quantities:

lim vol( Br ( x)) = 0 lim m( Br ( x)) = 0.


r →0 r →0
6 probabilistic artificial intelligence

They appear utterly uninteresting at first, yet, if we divide them, we


get what is called the density of M at x.

m( Br ( x)) .
lim = ρ ( x ).
r →0 vol( Br ( x))

We know that the relationship between density and mass is described


by the following formula:
Z
m( M) = ρ( x) dx.
M

In other words, the density is to be integrated. For a small region I


around x, we can approximate m( I ) ≈ ρ( x) · vol( I ).

Crucially, observe that even though the mass of any particular point
x is zero, i.e., m({ x}) = 0, assigning a density ρ( x) to x is useful for
integration and approximation. The same idea applies to continuous
random variables, only that volume corresponds to intervals on the
real line and mass to probability. Recall that probability density func-
tions are normalized such that their probability mass across the entire N ( x; 0, 1)
0.4
real line integrates to one.
0.3

Example 1.7: Normal distribution / Gaussian 0.2

A famous example of a continuous distribution is the normal dis- 0.1

tribution, also called Gaussian. We say, a random variable X is 0.0


normally distributed, X ∼ N (µ, σ2 ), if its PDF is −2 0 2
x
( x − µ )2
 
. 1
N ( x; µ, σ2 ) = √ exp − . (1.5) Figure 1.1: PDF of the standard normal
2πσ2 2σ2 distribution. Observe that the PDF is
symmetric around the mode.
We have E[ X ] = µ and Var[ X ] = σ2 . If µ = 0 and σ2 = 1, this dis-
tribution is called the standard normal distribution. The Gaussian
CDF cannot be expressed in closed-form.

Note that the mean of a Gaussian distribution coincides with the


maximizer of its PDF, also called mode of a distribution.

We will focus in the remainder of this chapter on continuous distribu-


tions, but the concepts we discuss extend mostly to discrete distribu-
tions simply by “replacing integrals by sums”.

1.1.5 Joint Probability


A joint probability (as opposed to a marginal probability) is the prob-
ability of two or more events occurring simultaneously:
.
P( A, B) = P( A ∩ B). (1.6)
fundamentals of inference 7

In terms of random variables, this concept extends to joint distribu-


tions. Instead of characterizing a single random variable, a joint dis-
tribution is a function pX : Rn → R, characterizing a random vector
.
X = [ X1 · · · Xn ]⊤ . For example, if the Xi are discrete, the joint distri-
bution characterizes joint probabilities of the form

P(X = [ x1 , . . . , xn ]) = P( X1 = x1 , . . . , Xn = xn ),

and hence describes the relationship among all variables Xi . For this
reason, a joint distribution is also called a generative model. We use Xi:j
to denote the random vector [ Xi · · · X j ]⊤ .

We can “sum out” (respectively “integrate out”) variables from a joint


distribution in a process called “marginalization”:

Fact 1.8 (Sum rule). We have that


Z
p( x1:i−1 , xi+1:n ) = p( x1:i−1 , xi , xi+1:n ) dxi . (1.7)
Xi ( Ω )

1.1.6 Conditional Probability


Conditional probability updates the probability of an event A given
some new information, for example, after observing the event B.

Definition 1.9 (Conditional probability). Given two events A and B


such that P( B) > 0, the probability of A conditioned on B is given as
A
P( A, B)
. B
P( A | B ) = . (1.8) Ω
P( B )
Figure 1.2: Conditioning an event A on
Simply rearranging the terms yields, another event B can be understood as re-
placing the universe of all possible out-
P( A, B) = P( A | B) · P( B) = P( B | A) · P( A). (1.9) comes Ω by the observed outcomes B.
Then, the conditional probability is sim-
Thus, the probability that both A and B occur can be calculated by ply expressing the likelihood of A given
that B occurred.
multiplying the probability of event A and the probability of B condi-
tional on A occurring.

We say Z ∼ X | Y = y (or simply Z ∼ X | y) if Z follows the conditional


distribution
. pX,Y ( x, y)
pX|Y ( x | y ) = . (1.10)
pY (y)
If X and Y are discrete, we have that pX|Y ( x | y) = P(X = x | Y = y)
as one would naturally expect.

Extending Equation (1.9) to arbitrary random vectors yields the prod-


uct rule (also called the chain rule of probability):
8 probabilistic artificial intelligence

Fact 1.10 (Product rule). Given random variables X1:n ,


n
p( x1:n ) = p( x1 ) · ∏ p( xi | x1:i−1 ). (1.11)
i =2

Combining sum rule and product rule, we can compute marginal


probabilities too:
Z Z
p( x) = p( x, y) dy = p( x | y) · p(y) dy (1.12) first using the sum rule (1.7) then the
Y(Ω) Y(Ω) product rule (1.11)

This is called the law of total probability (LOTP), which is colloquially


often referred to as conditioning on Y. If it is difficult to compute p( x)
directly, conditioning can be a useful technique when Y is chosen such
that the densities p( x | y) and p(y) are straightforward to understand.

1.1.7 Independence
Two random vectors X and Y are independent (denoted X ⊥ Y) if and
only if knowledge about the state of one random vector does not affect
the distribution of the other random vector, namely if their conditional
CDF (or in case they have a joint density, their conditional PDF) sim-
plifies to

PX|Y ( x | y) = PX ( x), p X | Y ( x | y ) = p X ( x ). (1.13)

For the conditional probabilities to be well-defined, we need to assume


that pY (y) > 0.

The more general characterization of independence is that X and Y are


independent if and only if their joint CDF (or in case they have a joint
density, their joint PDF) can be decomposed as follows:

PX,Y ( x, y) = PX ( x) · PY (y), pX,Y ( x, y) = pX ( x) · pY (y). (1.14)

The equivalence of the two characterizations (when pY (y) > 0) is eas-


ily proven using the product rule: pX,Y ( x, y) = pY (y) · pX|Y ( x | y).

A “weaker” notion of independence is conditional independence.5 5


We discuss in Remark 1.11 how
Two random vectors X and Y are conditionally independent given a ran- “weaker” is to be interpreted in this con-
text.
dom vector Z (denoted X ⊥ Y | Z) iff, given Z, knowledge about the
value of one random vector Y does not affect the distribution of the
other random vector X, namely if

PX|Y,Z ( x | y, z) = PX|Z ( x | z), (1.15a)


pX|Y,Z ( x | y, z) = pX|Z ( x | z). (1.15b)
fundamentals of inference 9

Similarly to independence, we have that X and Y are conditionally


independent given Z if and only if their joint CDF or joint PDF can be
decomposed as follows:

PX,Y|Z ( x, y | z) = PX|Z ( x | z) · PY|Z (y | z), (1.16a)


pX,Y|Z ( x, y | z) = pX|Z ( x | z) · pY|Z (y | z). (1.16b)

Remark 1.11: Common causes


How can conditional independence be understood as a “weaker”
notion of independence? Clearly, conditional independence does
not imply independence: a trivial example is X ⊥ X | X =⇒
̸ X⊥
X.6 Neither does independence imply conditional independence: 6
X ⊥ X | X is true trivially.
for example, X ⊥ Y =⇏ X ⊥ Y | X + Y.7 7
Knowing X and X + Y already implies
the value of Y, and hence, X ̸⊥ Y | X +
When we say that conditional independence is a weaker notion we Y.
mean to emphasize that X and Y can be “made” (conditionally)
independent by conditioning on the “right” Z even if X and Y are
dependent. This is known as Reichenbach’s common cause principle
which says that for any two random variables X ̸⊥ Y there exists a
random variable Z (which may be X or Y) that causally influences
both X and Y, and which is such that X ⊥ Y | Z.

1.1.8 Directed Graphical Models


Directed graphical models (also called Bayesian networks) are often
used to visually denote the (conditional) independence relationships
of a large number of random variables. They are a schematic repre-
sentation of the factorization of the generative model into a product
of conditional distributions as a directed acyclic graph. Given the se-
quence of random variables { Xi }in=1 , their generative model can be
expressed as
n
p( x1:n ) = ∏ p(xi | parents(xi )) (1.17) 8
More generally, vertices u and v are
i =1 conditionally independent given a set of
vertices Z if Z d-separates u and v, which
where parents( xi ) is the set of parents of the vertex Xi in the directed we will not cover in depth here.
graphical model. In other words, the parenthood relationship encodes
a conditional independence of a random variable X with a random
c
variable Y given their parents:8 Y

X ⊥ Y | parents( X ), parents(Y ). (1.18)


X1 ··· Xn
Equation (1.17) simply uses the product rule and the conditional in-
dependence relationships to factorize the generative model. This can
a1 an
greatly reduce the model’s complexity, i.e., the length of the product.
Figure 1.3: Example of a directed
graphical model. The random vari-
ables X1 , . . . , Xn are mutually indepen-
dent given the random variable Y. The
squared rectangular nodes are used to
represent dependencies on parameters
c, a1 , . . . , an .
10 probabilistic artificial intelligence

An example of a directed graphical model is given in Figure 1.3. Cir-


cular vertices represent random quantities (i.e., random variables). In
contrast, square vertices are commonly used to represent determinis-
tic quantities (i.e., parameters that the distributions depend on). In Y
the given example, we have that Xi is conditionally independent of all c
other X j given Y. Plate notation is a condensed notation used to repre-
Xi
sent repeated variables of a graphical model. An example is given in
Figure 1.4.
ai
i∈1:n
1.1.9 Expectation
Figure 1.4: The same directed graphical
model as in Figure 1.3 using plate nota-
The expected value or mean E[X] of a random vector X is the (asymp- tion.
totic) arithmetic mean of an arbitrarily increasing number of indepen-
dent realizations of X. That is,9 9
In infinite probability spaces, absolute
Z convergence of E[X] is necessary for the
. existence of E[X].
E[ X ] = x · p( x) dx (1.19)
X(Ω)

A very special and often used property of expectations is their lin-


earity, namely that for any random vectors X and Y in Rn and any
A ∈ Rm×n , b ∈ Rm it holds that

E[ AX + b] = AE[X] + b and E[X + Y] = E[X] + E[Y]. (1.20)

Note that X and Y do not necessarily have to be independent! Further,


if X and Y are independent then
h i
E XY⊤ = E[X] · E[Y]⊤ . (1.21)

The following intuitive lemma can be used to compute expectations of


transformed random variables.

Fact 1.12 (Law of the unconscious statistician, LOTUS).


Z
E[ g (X)] = g ( x) · p( x) dx (1.22)
X(Ω)

where g : X(Ω) → Rn is a “nice” function10 and X is a continuous ran- 10


g being a continuous function, which
dom vector. The analogous statement with a sum replacing the integral is either bounded or absolutely inte-
grable (i.e., | g( x)| p( x) dx < ∞), is suf-
R
holds for discrete random variables. ficient. This is satisfied in most cases.

This is a nontrivial fact that can be proven using the change of variables
formula which we discuss in Section 1.1.11.

Similarly to conditional probability, we can also define conditional ex-


pectations. The expectation of a continuous random vector X given
that Y = y is defined as
Z
.
E[ X | Y = y ] = x · pX|Y ( x | y) dx. (1.23)
X(Ω)
fundamentals of inference 11

Observe that E[X | Y = ·] defines a deterministic mapping from y to


E[X | Y = y]. Therefore, E[X | Y] is itself a random vector:
E[X | Y](ω ) = E[X | Y = Y(ω )] (1.24)
where ω ∈ Ω. This random vector E[X | Y] is called the conditional
expectation of X given Y.

Analogously to the law of total probability (1.12), one can condition an


expectation on another random vector. This is known as the tower rule
or the law of total expectation (LOTE):

Theorem 1.13 (Tower rule). Given random vectors X and Y, we have

EY [EX [X | Y]] = E[X]. (1.25)

Proof sketch. We only prove the case where X and Y have a joint den-
sity. We have
Z Z 
E[E[X | Y]] = x · p( x | y) dx p(y) dy
Z Z
= x · p( x, y) dx dy by definition of conditional densities
Z Z (1.10)
= x p( x, y) dy dx by Fubini’s theorem
Z
= x · p( x) dx using the sum rule (1.7)

= E[ X ].

1.1.10 Covariance and Variance


Given two random vectors X in Rn and Y in Rm , their covariance is
defined as
h i
.
Cov[X, Y] = E (X − E[X])(Y − E[Y])⊤ (1.26)
h i
= E XY⊤ − E[X] · E[Y]⊤ (1.27)

= Cov[Y, X]⊤ ∈ Rn×m . (1.28)


Covariance measures the linear dependence between two random vec-
tors since a direct consequence of its definition (1.26) is that given lin-
′ ′ ′ ′
ear maps A ∈ Rn ×n , B ∈ Rm ×m , vectors c ∈ Rn , d ∈ Rm and random
vectors X in Rn and Y in Rm , we have that
Cov[ AX + c, BY + d] = ACov[X, Y] B⊤ . (1.29)
Two random vectors X and Y are said to be uncorrelated if and only
if Cov[X, Y] = 0. Note that if X and Y are independent, then Equa-
tion (1.21) implies that X and Y are uncorrelated. The reverse does not
hold in general.
12 probabilistic artificial intelligence

Remark 1.14: Correlation


The correlation of the random vectors X and Y is a normalized
covariance,
 
. Cov Xi , Yj
Cor[X, Y](i, j) = q   ∈ [−1, 1]. (1.30)
Var[ Xi ]Var Yj

Two random vectors X and Y are therefore uncorrelated if and


only if Cor[X, Y] = 0.

There is also a nice geometric interpretation of covariance and


correlation. For zero mean random variables X and Y, Cov[ X, Y ]
is an inner product.11 11
That is,
• Cov[ X, Y ] is symmetric,
The cosine of the angle θ between X and Y (that are not determin- • Cov[ X, Y ] is linear (here we use
istic) coincides with their correlation, EX = EY = 0), and
• Cov[ X, X ] ≥ 0.
Cov[ X, Y ]
cos θ = = Cor[ X, Y ]. (1.31) using the Euclidean inner product
∥ X ∥ ∥Y ∥ formula, Cov[ X, Y ] = ∥ X ∥ ∥Y ∥ cos θ

cos θ is also called a cosine similarity. Thus,

θ = arccos Cor[ X, Y ]. (1.32)

For example, if X and Y are uncorrelated, then they are orthogonal


in the inner product space. If Cor[ X, Y ] = −1 then θ ≡ π (that is,
X and Y “point in opposite directions”), whereas if Cor[ X, Y ] = 1
then θ ≡ 0 (that is, X and Y “point in the same direction”).

The covariance of a random vector X in Rn with itself is called its


variance:

.
Var[X] = Cov[X, X] (1.33)
h i
= E (X − E[X])(X − E[X])⊤ (1.34)
h i
= E XX⊤ − E[X] · E[X]⊤ (1.35)
 
Cov[ X1 , X1 ] · · · Cov[ X1 , Xn ]
 .. .. .. 
= . . .
.
 (1.36)
Cov[ Xn , X1 ] · · · Cov[ Xn , Xn ]

The scalar variance Var[ X ] of a random variable X is a measure of un-


certainty about the value of X since it measures the average squared
deviation from E[ X ]. We will see that the eigenvalue spectrum of a
covariance matrix can serve as a measure of uncertainty in the multi-
variate setting.12 12
The multivariate setting (as opposed to
the univariate setting) studies the joint
distribution of multiple random vari-
ables.
fundamentals of inference 13

Remark 1.15: Standard deviation


The length of a random variable X in the inner product space
described in Remark 1.14 is called its standard deviation,
q q
.
∥ X ∥ = Cov[ X, X ] = Var[ X ] = σ[ X ]. (1.37)

That is, the longer a random variable is in the inner product space,
the more “uncertain” we are about its value. If a random variable
has length 0, then it is deterministic.

The variance of a random vector X is also called the covariance matrix


of X and denoted by Σ X (or Σ if the correspondence to X is clear from
context). A covariance matrix is symmetric by definition due to the
symmetry of covariance, and is always positive semi-definite ? . Problem 1.4

Two useful properties of variance are the following:


• It follows from Equation (1.29) that for any linear map A ∈ Rm×n
and vector b ∈ Rm ,

Var[ AX + b] = AVar[X] A⊤ . (1.38)

In particular, Var[−X] = Var[X].


• It follows from the definition of variance (1.34) that for any two
random vectors X and Y,

Var[X + Y] = Var[X] + Var[Y] + 2Cov[X, Y]. (1.39)

In particular, if X and Y are independent then the covariance term


vanishes and Var[X + Y] = Var[X] + Var[Y].
Analogously to conditional probability and conditional expectation,
we can also define conditional variance. The conditional variance of a
random vector X given another random vector Y is the random vector
h i
.
Var[X | Y] = E (X − E[X | Y])(X − E[X | Y])⊤ Y . (1.40)

Intuitively, the conditional variance is the remaining variance when


we use E[X | Y] to predict X rather than if we used E[X]. One can
also condition a variance on another random vector, analogously to
the laws of total probability (1.12) and expectation (1.25).

Theorem 1.16 (Law of total variance, LOTV).

Var[X] = EY [VarX [X | Y]] + VarY [EX [X | Y]]. (1.41)

Here, the first term measures the average deviation from the mean of X
across realizations of Y and the second term measures the uncertainty
14 probabilistic artificial intelligence

in the mean of X across realizations of Y. In Section 2.2, we will see


that both terms have a meaningful characterization in the context of
probabilistic inference.
Proof sketch of LOTV. To simplify the notation, we present only a proof
for the univariate setting.
h i
Var[ X ] = E X 2 − E[ X ]2
h h ii
= E E X 2 | Y − E[E[ X | Y ]]2 by the tower rule (1.25)
h i
= E Var[ X | Y ] + E[ X | Y ]2 − E[E[ X | Y ]]2 by the definition of variance (1.35)
 h i 
= E[Var[ X | Y ]] + E E[ X | Y ]2 − E[E[ X | Y ]]2
= E[Var[ X | Y ]] + Var[E[ X | Y ]]. by the definition of variance (1.35)

1.1.11 Change of Variables


It is often useful to understand the distribution of a transformed ran-
dom variable Y = g( X ) that is defined in terms of a random variable
X, whose distribution is known. Let us first consider the univariate
setting. We would like to express the distribution of Y in terms of the
distribution of X, that is, we would like to find
 
PY (y) = P(Y ≤ y) = P( g( X ) ≤ y) = P X ≤ g−1 (y) . (1.42)

When the random variables are continuous, this probability can be ex-
pressed as an integration over the domain of X. We can then use the
substitution rule of integration to “change the variables” to an inte-
gration over the domain of Y. Taking the derivative yields the density
pY .13 There is an analogous change of variables formula for the multi- 13
The full proof of the change of vari-
ables formula in the univariate setting
variate setting.
can be found in section 6.7.2 of “Math-
ematics for machine learning” (Deisen-
roth et al., 2020).
Fact 1.17 (Change of variables formula). Let X be a random vector
in Rn with density pX and let g : Rn → Rn be a differentiable and
invertible function. Then Y = g (X) is another random variable, whose
density can be computed based on pX and g as follows:
 
pY (y) = pX ( g −1 (y)) · det Dg −1 (y) (1.43)

where Dg −1 (y) is the Jacobian of g −1 evaluated at y.

Here, the term det Dg −1 (y) measures how much a unit volume


changes when applying g. Intuitively, the change of variables swaps


the coordinate system over which we integrate. The factor det Dg −1 (y)


corrects for the change in volume that is caused by this change in co-
ordinates.
fundamentals of inference 15

Intuitively, you can think of the vector field g as a perturbation to X,


“pushing” the probability mass around. The perturbation of a density
pX by g is commonly denoted by the pushforward
.
g♯ pX = pY where Y = g (X). (1.44)

This concludes our quick tour of probability theory, and we are well-
prepared to return to the topic of probabilistic inference.

1.2 Probabilistic Inference

Recall the logical implication “If it is raining, the ground is wet.” from
the beginning of this chapter. Suppose that we look outside a window
and see that it is not raining: will the ground be dry? Logical reason-
ing does not permit drawing an inference of this kind, as there might
be reasons other than rain for which the ground could be wet (e.g.,
sprinklers). However, intuitively, by observing that it is not raining,
we have just excluded the possibility that the ground is wet because of
rain, and therefore we would deem it “more likely” that the ground is
dry than before. In other words, if we were to walk outside now and
the ground was wet, we would be more surprised than we would have
been if we had not looked outside the window before.

As humans, we are constantly making such “plausible” inferences of


our beliefs: be it about the weather, the outcomes of our daily deci-
sions, or the behavior of others. Probabilistic inference is the process of
updating such a prior belief P W to a posterior belief P W | R upon
 

observing R where — to reduce clutter — we write W for “The ground


is wet” and R for “It is raining”.

The central principle of probabilistic inference is Bayes’ rule:

Theorem 1.18 (Bayes’ rule). Given random vectors X in Rn and Y in


Rm , we have for any x ∈ Rn , y ∈ Rm that

p(y | x) · p( x)
p( x | y) = . (1.45)
p(y)

Proof. Bayes’ rule is a direct consequence of the definition of condi-


tional densities (1.10) and the product rule (1.11).

Let us consider the meaning of each term separately:


• the prior p( x) is the initial belief about x,
• the (conditional) likelihood p(y | x) describes how likely the observa-
tions y are under a given value x,
16 probabilistic artificial intelligence

• the posterior p( x | y) is the updated belief about x after observing y,


• the joint likelihood p( x, y) = p(y | x) p( x) combines prior and likeli-
hood,
• the marginal likelihood p(y) describes how likely the observations y
are across all values of x.

The marginal likelihood can be computed using the sum rule (1.7) or
the law of total probability (1.12),

Z
p(y) = p(y | x) · p( x) dx. (1.46)
X(Ω)

Note, however, that the marginal likelihood is simply normalizing the


conditional distribution to integrate to one, and therefore a constant
with respect to x. For this reason, p(y) is commonly called the normal-
izing constant.

Example 1.19: Plausible inferences


Let us confirm our intuition from the above example. The logical
implication “If it is raining, the ground is wet.” (denoted R → W)
can be succinctly expressed as P(W | R) = 1. Since P(W ) ≤ 1, we
know that
P (W | R ) · P ( R ) P( R )
P( R | W ) = = ≥ P( R ).
P (W ) P (W )

That is, observing that the ground is wet makes it more likely to
be raining. From P( R | W ) ≥ P( R) we know P R | W ≤ P R ,14
 
since P X = 1 − P( X )
14


which leads us to follow that

P R | W · P (W )

P W|R = ≤ P (W ) ,

P R


that is, having observed it not to be raining made the ground less
likely to be wet.

Example 1.19 is called a plausible inference because the observation of R


does not completely determine the truth value of W, and hence, does
not permit logical inference. In the case, however, that logical inference
is permitted, it coincides with probabilistic inference.

Example 1.20: Logical inferences


For example, if we were to observe that the ground is not wet,
then logical inference implies that it must not be raining: W → R.
This is called the contrapositive of R → W.
fundamentals of inference 17

Indeed, by probabilistic inference, we obtain analogously

P W | R · P( R ) (1 − P(W | R)) · P( R)

P R|W =

= = 0. as P(W | R) = 1
P W P W
 

Observe that a logical inference does not depend on the prior P( R):
Even if the prior was P( R) = 1 in Example 1.20, after observing that
the ground is not wet, we are forced to conclude that it is not raining to
maintain logical consistency. The examples highlight that while logical
inference does not require the notion of a prior, plausible (probabilistic!)
inference does.

1.2.1 Where do priors come from?

Bayes’ rule necessitates the specification of a prior p( x). Different pri-


ors can lead to the deduction of dramatically different posteriors, as
one can easily see by considering the extreme cases of a prior that is a
point density at x = x0 and a prior that is “uniform” over Rn .15 In the 15
The latter is not a valid probability dis-
former case, the posterior will be a point density at x0 regardless of tribution, but we can still derive mean-
ing from the posterior as we discuss in
the likelihood. In other words, no evidence can alter the “prior belief” Remark 1.22.
the learner ascribed to x. In the latter case, the learner has “no prior
belief”, and therefore the posterior will be proportional to the likeli-
hood. Both steps of probabilistic inference are perfectly valid, though
one might debate which prior is more reasonable.

Someone who follows the Bayesian interpretation of probability might


argue that everything is conditional, meaning that the prior is simply
a posterior of all former observations. While this might seem natural
(“my world view from today is the combination of my world view
from yesterday and the observations I made today”), this lacks an
explanation for “the first day”. Someone else who is more inclined
towards the frequentist interpretation might also object to the exis-
tence of a prior belief altogether, arguing that a prior is subjective and
therefore not a valid or desirable input to a learning algorithm. Put
differently, a frequentist “has the belief not to have any belief”. This is
perfectly compatible with probabilistic inference, as long as the prior
is chosen to be noninformative:

p( x) ∝ const. (1.47)

Choosing a noninformative prior in the absence of any evidence is


known as the principle of indifference or the principle of insufficient reason,
which dates back to the famous mathematician Pierre-Simon Laplace.
18 probabilistic artificial intelligence

Example 1.21: Why be indifferent?


Consider a criminal trial with three suspects, A, B, and C. The
collected evidence shows that suspect C can not have committed
the crime, however it does not yield any information about sus-
pects A and B. Clearly, any distribution respecting the data must
assign zero probability of having committed the crime to suspect
C. However, any distribution interpolating between (1, 0, 0) and
(0, 1, 0) respects the data. The principle of indifference suggests
that the desired distribution is ( 12 , 12 , 0), and indeed, any alterna-
tive distribution seems unreasonable.

Remark 1.22: Noninformative and improper priors


It is not necessarily required that the prior p( x) is a valid distri-
bution (i.e., integrates to 1). Consider for example, the noninfor-
mative prior p( x) ∝ 1{ x ∈ I } where I ⊆ Rn is an infinitely large
interval. Such a prior which is not a valid distribution is called an
improper prior. We can still derive meaning from the posterior of a
given likelihood and (improper) prior as long as the posterior is a
valid distribution.

Laplace’s principle of indifference can be generalized to cases where


some evidence is available. The maximum entropy principle, originally
proposed by Jaynes (1968), states that one should choose as prior from
all possible distributions that are consistent with prior knowledge, the
one that makes the least “additional assumptions”, i.e., is the least
“informative”. In philosophy, this principle is known as Occam’s razor
or the principle of parsimony. The “informativeness” of a distribution p
is quantified by its entropy which is defined as
.
H[ p] = Ex∼ p [− log p( x)]. (1.48)

The more concentrated p is, the less is its entropy; the more diffuse p
is, the greater is its entropy.16 16
We give a thorough introduction to en-
tropy in Section 5.4.
In the absence of any prior knowledge, the uniform distribution has
the highest entropy,17 and hence, the maximum entropy principle sug- 17
This only holds true when the set
gests a noninformative prior (as does Laplace’s principle of indiffer- of possible outcomes of x finite (or
a bounded continuous interval), as in
ence). In contrast, if the evidence perfectly determines the value of x, this case, the noninformative prior is a
then the only consistent explanation is the point density at x. The proper distribution — the uniform dis-
tribution. In the “infinite case”, there is
maximum entropy principle characterizes a reasonable choice of prior no uniform distribution and the nonin-
for these two extreme cases and all cases in between. Bayes’ rule can formative prior can be attained from the
in fact be derived as a consequence of the maximum entropy principle maximum entropy principle as the lim-
iting solution as the number of possible
in the sense that the posterior is the least “informative” distribution outcomes of x is increased.
among all distributions that are consistent with the prior and the ob-
fundamentals of inference 19

servations ? . Problem 5.7

1.2.2 Conjugate Priors


If the prior p( x) and posterior p( x | y) are of the same family of distri-
butions, the prior is called a conjugate prior to the likelihood p(y | x).
This is a very desirable property, as it allows us to recursively ap-
ply the same learning algorithm implementing probabilistic inference.
We will see in Chapter 2 that under some conditions the Gaussian is
self-conjugate. That is, if we have a Gaussian prior and a Gaussian like-
lihood then our posterior will also be Gaussian. This will provide us
with the first efficient implementation of probabilistic inference.

Example 1.23: Conjugacy of beta and binomial distribution


As an example for conjugacy, we will show that the beta distribu-
tion is a conjugate prior to a binomial likelihood. Recall the PMF
of the binomial distribution
 
n k
Bin(k; n, θ ) = θ (1 − θ ) n − k (1.49)
k

and the PDF of the beta distribution,

Beta(θ; α, β) ∝ θ α−1 (1 − θ ) β−1 , (1.50)

We assume the prior θ ∼ Beta(α, β) and likelihood k | θ ∼ Bin(n, θ ).


Let n H = k be the number of heads and n T = n − k the number of
tails in the binomial trial k. Then,

p(θ | k) ∝ p(k | θ ) p(θ ) using Bayes’ rule (1.45)


nH n T α −1 β −1
∝θ (1 − θ ) θ (1 − θ )
= θ α + n H −1 (1 − θ ) β + n T −1 .

Thus, θ | k ∼ Beta(α + n H , β + n T ).

This same conjugacy can be shown for the multivariate general-


ization of the beta distribution, the Dirichlet distribution, and the
multivariate generalization of the binomial distribution, the multi-
nomial distribution.

1.2.3 Tractable Inference with the Normal Distribution


Using arbitrary distributions for learning and inference is computa-
tionally very expensive when the number of dimensions is large —
even in the discrete setting. For example, computing marginal distri-
butions using the sum rule yields an exponentially long sum in the
20 probabilistic artificial intelligence

size of the random vector. Similarly, the normalizing constant of the X1 ··· X n −1 Xn P( X1:n )

conditional distribution is a sum of exponential length. Even to rep- 0 ··· 0 0 0.01


0 ··· 0 1 0.001
resent any discrete joint probability distribution requires space that is 0 ··· 1 0 0.213
exponential in the number of dimensions (cf. Figure 1.5). .. .. .. ..
. . . .
One strategy to get around this computational blowup is to restrict 1 ··· 1 1 0.0003

the class of distributions. Gaussians are a popular choice for this pur- Figure 1.5: A table representing a joint
pose since they have extremely useful properties: they have a compact distribution of n binary random vari-
ables. The table has 2n rows. The num-
representation and — as we will see in Chapter 2 — they allow for
ber of parameters is 2n − 1 since the fi-
closed-form probabilistic inference. nal probability is determined by all other
probabilities as they must sum to one.
In Equation (1.5), we have already seen the PDF of the univariate Gaus-
sian distribution. A random vector X in Rn is normally distributed,
X ∼ N (µ, Σ ), if its PDF is

 
. 1 1
N ( x; µ, Σ ) = p exp − ( x − µ)⊤ Σ −1 ( x − µ) (1.51)
det(2πΣ ) 2

where µ ∈ Rn is the mean vector and Σ ∈ Rn×n the covariance matrix


.
? . We call Λ = Σ −1 the precision matrix. X is also called a Gaussian Problem 1.11
random vector (GRV). N (0, I ) is the multivariate standard normal distri-
bution. We call a Gaussian isotropic if its covariance matrix is of the
form Σ = σ2 I for some σ2 ∈ R. In this case, the sublevel sets of the
PDF are perfect spheres as can be seen in Figure 1.6.

Figure 1.6: Shown are the PDFs of two-


dimensional Gaussians with mean 0 and
covariance matrices
   
. 1 0 . 1 0.9
0.2 0.4 Σ1 = , Σ2 =
0 1 0.9 1
0.1 0.2
respectively.
0.0 0.0

2 2

0 0
−2 y −2 y
0 −2 0 −2
x 2 x 2

Note that a Gaussian can be represented using only O n2 parameters.




In the case of a diagonal covariance matrix, which corresponds to n


independent univariate Gaussians ? , we just need O(n) parameters. Problem 1.8

In Equation (1.51), we assume that the covariance matrix Σ is invert-


ible, i.e., does not have the eigenvalue 0. This is not a restriction since
it can be shown that a covariance matrix has a zero eigenvalue if and
only if there exists a deterministic linear relationship between some
variables in the joint distribution ? . As we have already seen that a Problem 1.6
fundamentals of inference 21

covariance matrix does not have negative eigenvalues ? , this ensures Problem 1.4
that Σ and Λ are positive definite.18 18
The inverse of a positive definite ma-
trix is also positive definite.
An important property of the normal distribution is that it is closed
under marginalization and conditioning.

Theorem 1.24 (Marginal and conditional distribution). ? Con- Problem 1.9


sider the Gaussian random vector X and fix index sets A ⊆ [n] and
B ⊆ [n]. Then, we have that for any such marginal distribution,

X A ∼ N (µ A , Σ AA ), (1.52) By µ A we denote [µi1 , . . . , µik ] where


A = {i1 , . . . ik }. Σ AA is defined
and that for any such conditional distribution, analogously.

X A | X B = x B ∼ N (µ A| B , Σ A| B ) where (1.53a)
.
µ A| B = µ A + Σ AB Σ − 1
BB ( x B − µ B ), (1.53b) Here, µ A characterizes the prior belief
. and Σ AB Σ − 1
BB ( x B − µ B ) represents “how
Σ A| B = Σ AA − Σ AB Σ − 1
BB Σ BA . (1.53c) different” x B is from what was
expected.

Theorem 1.24 provides a closed-form characterization of probabilistic


inference for the case that random variables are jointly Gaussian. We
will discuss in Chapter 2, how this can be turned into an efficient
inference algorithm.

Observe that upon inference, the variance can only shrink! Moreover,
how much the variance is reduced depends purely on where the obser-
vations are made (i.e., the choice of B) but not on what the observations
are. In contrast, the posterior mean µ A| B depends affinely on µ B . These
are special properties of the Gaussian and do not generally hold true
for other distributions.

It can be shown that Gaussians are additive and closed under affine
transformations ? . The closedness under affine transformations (1.78) Problem 1.10
implies that a Gaussian X ∼ N (µ, Σ ) is equivalently characterized as

X = Σ /2 Y + µ.
1
(1.54)

where Y ∼ N (0, I ) and Σ 1/2 is the square root of Σ.19 Importantly, this 19
More details on the square root of a
implies together with Theorem 1.24 and additivity (1.79) that: symmetric and positive definite matrix
can be found in Appendix A.2.
Any affine transformation of a Gaussian random vector
is a Gaussian random vector.

A consequence of this is that given any jointly Gaussian random vec-


tors X A and X B , X A can be expressed as an affine function of X B with
added independent Gaussian noise. Formally, we define
.
X A = AX B + b + ε where (1.55a)
22 probabilistic artificial intelligence

.
A = Σ AB Σ − 1
BB , (1.55b)
.
b= µ A − Σ AB Σ − 1
BB µ B , (1.55c)
ε ∼ N (0, Σ A| B ). (1.55d)

It directly follows from the closedness of Gaussians under affine trans-


formations (1.78) that the characterization of X A via Equation (1.55) is
equivalent to X A ∼ N (µ A , Σ AA ), and hence, any Gaussian X A can be
modeled as a so-called conditional linear Gaussian, i.e., an affine function
of another Gaussian X B with additional independent Gaussian noise.
We will use this fact frequently to represent Gaussians in a compact
form.

1.3 Supervised Learning and Point Estimates

Throughout the first part of this manuscript, we will focus mostly on


the supervised learning problem where we want to learn a function

f⋆ : X → Y

from labeled training data. That is, we are given a collection of labeled
.
examples, Dn = {( xi , yi )}in=1 , where the xi ∈ X are inputs and the f⋆
f
yi ∈ Y are outputs (called labels), and we want to find a function fˆ that fˆ
best-approximates f ⋆ . It is common to choose fˆ from a parameter-
ized function class F (Θ), where each function f θ is described by some F
parameters θ ∈ Θ.
Figure 1.7: Illustration of estimation er-
ror and approximation error. f ⋆ denotes
the true function and fˆ is the best ap-
Remark 1.25: What this manuscript is about and not about
proximation from the function class F .
As illustrated in Figure 1.7, the restriction to a function class leads We do not specify here, how one could
quantify “error”. For more details, see
to two sources of error: the estimation error of having “incorrectly” Appendix A.3.5.
determined fˆ within the function class, and the approximation er-
ror of the function class itself. Choosing a “good” function class
/ architecture with small approximation error is therefore critical
for any practical application of machine learning. We will discuss
various function classes, from linear models to deep neural net-
works, however, determining the “right” function class will not
be the focus of this manuscript. To keep the exposition simple,
we will assume in the following that f ⋆ ∈ F (Θ) with parameters
θ⋆ ∈ Θ.

Instead, we will focus on the problem of estimation/inference


within a given function class. We will see that inference in smaller
function classes is often more computationally efficient since the
search space is smaller or — in the case of Gaussians — has a
known tractable structure. On the other hand, larger function
fundamentals of inference 23

classes are more expressive and therefore can typically better ap-
proximate the ground truth f ⋆ .

.
We differentiate between the task of regression where Y = Rk ,20 and 20
The labels are usually scalar, so k = 1.
.
the task of classification where Y = C and C is an m-element set of
classes. In other words, regression is the task of predicting a continu-
ous label, whereas classification is the task of predicting a discrete class
label. These two tasks are intimately related: in fact, we can think of
classification tasks as a regression problem where we learn a probabil-
.
ity distribution over class labels. In this regression problem, Y = ∆C
where ∆C denotes the set of all probability distributions over the set
of classes C which is an (m − 1)-dimensional convex polytope in the
m-dimensional space of probabilities [0, 1]m (cf. Appendix A.1.2).

For now, let us stick to the regression setting. We will assume that
iid
the observations are noisy, that is, yi ∼ p(· | xi , θ⋆ ) for some known
conditional distribution p(· | xi , θ) but unknown parameter θ⋆ .21 Our 21
The case where the labels are deter-
assumption can equivalently be formulated as ministic is the special case of p(· | xi , θ⋆ )
being a point density at f ⋆ ( xi ).

yi = f θ ( xi ) + ε i ( xi ) (1.56)
| {z } | {z }
signal noise

where f θ ( xi ) is the mean of p(· | xi , θ) and ε i ( xi ) = yi − f θ ( xi ) is some


independent zero-mean noise, for example (but not necessarily) Gaus-
sian.22 When the noise distribution may depend on xi , the noise is said 22
It is crucial that the assumed noise dis-
to be heteroscedastic and otherwise the noise is called homoscedastic. tribution accurately reflects the noise of
the data. For example, using a (light-
tailed) Gaussian noise model in the pres-
ence of heavy-tailed noise will fail! We
1.3.1 Maximum Likelihood Estimation discuss the distinction between light and
heavy tails in Appendix A.3.2.
A common approach to finding fˆ is to select the model f ∈ F (Θ) un-
der which the training data is most likely. This is called the maximum
likelihood estimate (or MLE):
.
θ̂MLE = arg max p(y1:n | x1:n , θ) (1.57)
θ∈Θ
n
= arg max ∏ p(yi | xi , θ). using the independence of the training
θ∈Θ i =1 data (1.56)

Such products of probabilities are often numerically unstable, which


is why one typically takes the logarithm:
n
= arg max ∑ log p(yi | xi , θ) . (1.58)
θ∈Θ i =1
| {z }
log-likelihood

We will denote the negative log-likelihood by ℓnll (θ; Dn ).


24 probabilistic artificial intelligence

The MLE is often used in practice due to its desirable asymptotic prop-
erties as the sample size n increases. We give a brief summary here
and provide additional background and definitions in Appendix A.3.
To give any guarantees on the convergence of the MLE, we neces-
sarily need to assume that θ⋆ is identifiable.23 If additionally, ℓnll is 23
That is, θ⋆ ̸= θ =⇒ f ⋆ ̸= f θ for any
“well-behaved” then standard results say that the MLE is consistent θ ∈ Θ. In words, there is no other pa-
rameter θ that yields the same function
and asymptotically normal (Van der Vaart, 2000): f θ as θ⋆ .

P D
θ̂MLE → θ⋆ and θ̂MLE → N (θ⋆ , Sn ) as n → ∞. (1.59)

Here, we denote by Sn the asymptotic variance of the MLE which


can be understood as measuring the “quality” of the estimate.24 This 24
A “smaller” variance means that we
implies in some sense that the MLE is asymptotically unbiased. More- can be more confident that the MLE is
close to the true parameter.
over, the MLE can be shown to be asymptotically efficient which is to
say that there exists no other consistent estimator with a “smaller”
asymptotic variance.25 25
see Appendix A.3.4.

The situation is quite different in the finite sample regime. Here, the
MLE need not be unbiased, and it is susceptible to overfitting to the
(finite) training data as we discuss in more detail in Appendix A.3.5.

1.3.2 Using Priors: Maximum a Posteriori Estimation


We can incorporate prior assumptions about the parameters θ⋆ into the
estimation procedure. One approach of this kind is to find the mode
of the posterior distribution, called the maximum a posteriori estimate (or
MAP estimate):
.
θ̂MAP = arg max p(θ | x1:n , y1:n ) (1.60)
θ∈Θ
= arg max p(y1:n | x1:n , θ) · p(θ) (1.61) by Bayes’ rule (1.45)
θ∈Θ
n
= arg max log p(θ) + ∑ log p(yi | xi , θ) (1.62) taking the logarithm
θ∈Θ i =1
= arg min − log p(θ) + ℓnll (θ; Dn ) . (1.63)
θ∈Θ | {z } | {z }
regularization quality of fit

Here, the log-prior log p(θ) acts as a regularizer. Common regularizers


are given, for example, by
• p(θ) = N (θ; 0, (2λ)−1 I ) which yields − log p(θ) = λ ∥θ∥22 + const,
• p(θ) = Laplace(θ; 0, λ−1 ) which yields − log p(θ) = λ ∥θ∥1 + const,
• a uniform prior (cf. Section 1.2.1) for which the MAP is equivalent
to the MLE. In other words, the MLE is merely the mode of the
posterior distribution under a uniform prior.
The Gaussian and Laplace regularizers act as simplicity biases, pre-
ferring simpler models over more complex ones, which empirically
fundamentals of inference 25

tends to reduce the risk of overfitting. However, one may also encode
more nuanced information about the (assumed) structure of θ⋆ into
the prior.

An alternative way of encoding a prior is by restricting the function


class to some Θ
e ⊂ Θ, for example to rotation- and translation-invariant
models as often done when the inputs are images. This effectively
sets p(θ) = 0 for all θ ∈ Θ \ Θ e but is better suited for numerical
optimization than to impose this constraint directly on the prior.

Encoding prior assumptions into the function class or into the parame-
ter estimation can accelerate learning and improve generalization per-
formance dramatically, yet importantly, incorporating a prior can also
inhibit learning in case the prior is “wrong”. For example, when the
learning task is to differentiate images of cats from images of dogs,
consider the (stupid) prior that only permits models that exclusively
use the upper-left pixel for prediction. No such model will be able to
solve the task, and therefore starting from this prior makes the learn-
ing problem effectively unsolvable which illustrates that priors have to
be chosen with care.

1.3.3 When does the prior matter?


We have seen that the MLE has desirable asymptotic properties, and
that MAP estimation can be seen as a regularized MLE where the type
of regularization is encoded by the prior. Is it possible to derive similar
asymptotic results for the MAP estimate?

To answer this question, we will look at the asymptotic effect of the


prior on the posterior more generally. Doob’s consistency theorem states
that assuming parameters are identifiable,26 there exists Θe ⊆ Θ with 26
This is akin to the assumption required
e ) = 1 such that the posterior is consistent for any θ⋆ ∈ Θ
p(Θ e (Doob, for consistency of the MLE, cf. Equa-
tion (1.59).
1949; Miller, 2016, 2018):
P
θ | Dn → θ⋆ as n → ∞. (1.64)

In words, Doob’s consistency theorem tells us that for any prior dis-
tribution, the posterior is guaranteed to converge to a point density
in the (small) neighborhood θ⋆ ∈ B of the true parameter as long as
p( B) > 0.27 We call such a prior a well-specified prior. 27
B can for example be a ball of radius ϵ
around θ⋆ (with respect to some geome-
try of Θ).
Remark 1.26: Cromwell’s rule
In the case where |Θ| is finite, Doob’s consistency theorem strongly
suggests that the prior should not assign 0 probability (or proba-
bility 1 for that matter) to any individual parameter θ ∈ Θ, unless
we know with certainty that θ⋆ ̸= θ. This is called Cromwell’s rule,
26 probabilistic artificial intelligence

and a prior obeying by this rule is always well-specified.

Under the same assumption that the prior is well-specified (and reg-
ularity conditions28 ), the Bernstein-von Mises theorem, which was first 28
These regularity conditions are akin
discovered by Pierre-Simon Laplace in the early 19th century, estab- to the assumptions required for asymp-
totic normality of the MLE, cf. Equa-
lishes the asymptotic normality of the posterior distribution (Van der tion (1.59).
Vaart, 2000; Miller, 2016):
D
θ | Dn → N (θ⋆ , Sn ) as n → ∞ (1.65)

and where Sn is the same as the asymptotic variance of the MLE.29 29


This has also been called the “Bayesian
central limit theorem”, which is a bit of a
These results link probabilistic inference to maximum likelihood es- misnomer since the theorem also applies
to likelihoods (when the prior is nonin-
timation in the asymptotic limit of infinite data. Intuitively, in the formative) which are often used in fre-
limit of infinite data, the prior is “overwhelmed” by the observations quentist statistics.
and the posterior becomes equivalent to the limiting distribution of
the MLE.30 One can interpret the regime of infinite data as the regime 30
More examples and discussion can be
where computational resources and time are unlimited and plausible found in section 17.8 of Le Cam (1986),
chapter 8 of Le Cam and Yang (2000),
inferences evolve into logical inferences. This transition signifies a shift chapter 10 of Van der Vaart (2000), and
from the realm of uncertainty to that of certainty. The importance of in Tanner (1991).

the prior surfaces precisely in the non-asymptotic regime where plau-


sible inferences are necessary due to limited computational resources and
limited time.

1.3.4 Estimation vs Inference


You can interpret a single parameter vector θ ∈ Θ as “one possible
explanation” of the data. Maximum likelihood and maximum a pos-
teriori estimation are examples of estimation algorithms which return
a one such parameter vector — called a point estimate. That is, given
the training set Dn , they return a single parameter vector θ̂n . We give
a more detailed account of estimation in Appendix A.3.

Example 1.27: Point estimates and invalid logical inferences


To see why point estimates can be problematic, recall Example 1.19.
We have seen that the logical implication

“If it is raining, the ground is wet.”

can be expressed as P(W | R) = 1. Observing that “The ground


is wet.” does not permit logical inference, yet, the maximum like-
lihood estimate of R is R̂MLE = 1. This is logically inconsistent
since there might be other explanations for the ground to be wet,
such as a sprinkler! With only a finite sample (say independently
observing n times that the ground is wet), we cannot rule out with
fundamentals of inference 27

certainty that the ground is wet for other reasons than rain.

In practice, we never observe an infinite amount of data. Example 1.27


demonstrates that on a finite sample, point estimates may perform
invalid logical inferences, and can therefore lure us into a false sense
of certainty.

Remark 1.28: MLE and MAP are approximations of inference


The MLE and MAP estimate can be seen as a naive approximation
of probabilistic inference, represented by a point density which
“collapses” all probability mass at the mode of the posterior dis- p(θ | D)
tribution. This can be a relatively decent — even if overly simple 0.4

— approximation when the distribution is unimodal, symmetric, 0.3


and light-tailed as in Figure 1.8, but is usually a very poor approx- 0.2
imation for practical posteriors that are complex and multimodal.
0.1

0.0
In this manuscript, we will focus mainly on algorithms for probabilistic
−2 0 2
inference which compute or approximate the distribution p(θ | x1:n , y1:n ) θ
over parameters. Returning a distribution over parameters is natural
Figure 1.8: A the MLE/MAP are point
since this acknowledges that given a finite sample with noisy observa- estimates at the mode θ̂ of the posterior
tions, more than one parameter vector can explain the data. distribution p(θ | D).

1.3.5 Probabilistic Inference and Prediction


The prior distribution p(θ) can be interpreted as the degree of our
belief that the model parameterized by θ “describes the (previously
seen) data best”. The likelihood captures how likely the training data
is under a particular model:
n
p(y1:n | x1:n , θ) = ∏ p(yi | xi , θ). (1.66)
i =1

The posterior then represents our belief about the best model after
seeing the training data. Using Bayes’ rule (1.45), we can write it as31 31
We generally assume that

n p(θ | x1:n ) = p(θ).


1
p(θ | x1:n , y1:n ) = p(θ) ∏ p(yi | xi , θ) where (1.67a) For our purposes, you can think of the
Z i =1 inputs x1:n as fixed deterministic param-
Z n eters, but one can also consider inputs
.
Z= p(θ) ∏ p(yi | xi , θ) dθ (1.67b) drawn from a distribution over X .
Θ i =1

is the normalizing constant. We refer to this process of learning a model


from data as learning. We can then use our learned model for prediction
at a new input x⋆ by conditioning on θ,
Z
p(y⋆ | x⋆ , x1:n , y1:n ) = p(y⋆ , θ | x⋆ , x1:n , y1:n ) dθ by the sum rule (1.7)
Θ
28 probabilistic artificial intelligence

Z
= p(y⋆ | x⋆ , θ) · p(θ | x1:n , y1:n ) dθ. (1.68) by the product rule (1.11) and
Θ y⋆ ⊥ x1:n , y1:n | θ
Here, the distribution over models p(θ | x1:n , y1:n ) is called the posterior
and the distribution over predictions p(y⋆ | x⋆ , x1:n , y1:n ) is called the
predictive posterior. The predictive posterior quantifies our posterior
uncertainty about the “prediction” y⋆ , however, since this is typically
p(y⋆ | x⋆ , D)
a complex distribution, it is difficult to communicate this uncertainty
0.20
to a human. One statistic that can be used for this purpose is the
smallest set Cδ ( x⋆ ) ⊆ R for a fixed δ ∈ (0, 1) such that 0.15

0.10
C( x⋆ )
P(y⋆ ∈ Cδ ( x⋆ ) | x⋆ , x1:n , y1:n ) ≥ 1 − δ. (1.69)
0.05

That is, we believe with “confidence” at least 1 − δ that the true value 0.00

of y⋆ lies in Cδ ( x⋆ ). Such a set Cδ ( x⋆ ) is called a credible set. 0 5


y⋆
We have seen here that the tasks of learning and prediction are inti-
Figure 1.9: Example of a 95% credible
mately related. Indeed, “prediction” can be seen in many ways as a set at x⋆ where the predictive posterior is
natural by-product of “reasoning” (i.e., probabilistic inference), where Gaussian with mean µ( x⋆ ) and standard
we evaluate the likelihood of outcomes given our learned explana- deviation σ( x⋆ ). In this case, the gray
area integrates to ≈ 0.95 for
tions for the world. This intuition can be read off directly from Equa-
C0.05 ( x⋆ ) = [µ( x⋆ ) ± 1.96σ( x⋆ )].
tion (1.68) where p(y⋆ | x⋆ , θ) corresponds to the likelihood of an out-
come given the explanation θ and p(θ | x1:n , y1:n ) corresponds to our
inferred belief about the world. We will see many more examples
of this link between probabilistic inference and prediction throughout
this manuscript.

The high-dimensional integrals of Equations (1.67b) and (1.68) are typ-


ically intractable, and represent the main computational challenge in
probabilistic inference. Throughout the first part of this manuscript,
we will describe settings where exact inference is tractable, as well
as modern approximate inference algorithms that can be used when
exact inference is intractable.

1.3.6 Recursive Probabilistic Inference and Memory


We have already alluded to the fact that probabilistic inference has a
recursive structure, which lends itself to continual learning and which
often leads to efficient algorithms. Let us denote by
.
p(t) (θ) = p(θ | x1:t , y1:t ) (1.70)

the posterior after the first t observations with p(0) (θ) = p(θ). Now,
suppose that we have already computed p(t) (θ) and observe yt+1 . We
can recursively update the posterior as follows,

p(t+1) (θ) = p(θ | y1:t+1 )


∝ p(θ | y1:t ) · p(yt+1 | θ, y1:t ) using Bayes’ rule (1.45)
fundamentals of inference 29

= p(t) (θ) · p(yt+1 | θ). (1.71) using yt+1 ⊥ y1:t | θ, see Figure 2.3

Intuitively, the posterior distribution at time t “absorbs” or “summa-


rizes” all seen data.

By unrolling the recursion of Equation (1.71), we see that regardless


of the philosophical interpretation of probability, probabilistic infer-
ence is a fundamental mechanism of learning. Even the MLE which
performs naive approximate inference without a prior (i.e., a uniform
prior), is based on p(n) (θ) ∝ p(y1:n | x1:n , θ) which is the result of n
individual plausible inferences, where the (t + 1)-st inference uses the
posterior of the t-th inference as its prior.

So far we have been considering the supervised learning setting, where


all data is available a-priori. However, by sequentially obtaining the
new posterior and replacing our prior, we can also perform proba-
bilistic inference as data arrives online (i.e., in “real-time”). This is
analogous to recursive logical inference where derived consequences
are repeatedly added to the set of propositions to derive new conse-
quences. This also highlights the intimate connection between “rea-
soning” and “memory”. Indeed, the posterior distribution p(t) (θ) can
be seen as a form of memory that evolves with time t.

1.4 Outlook: Decision Theory

How can we use our predictions to make concrete decisions under


uncertainty? We will study this question extensively in Part II of this
manuscript, but briefly introduce some fundamental concepts here.
Making decisions using a probabilistic model p(y | x) of output y ∈ Y
given input x ∈ X , such as the ones we have discussed in the previous
section, is commonly formalized by
• a set of possible actions A, and
• a reward function r (y, a) ∈ R that computes the reward or utility of
taking action a ∈ A, assuming the true output is y ∈ Y .
Standard decision theory recommends picking the action with the
largest expected utility:
.
a⋆ ( x) = arg max Ey| x [r (y, a)]. (1.72)
a∈A

Here, a⋆ is called the optimal decision rule because, under the given
probabilistic model, no other rule can yield a higher expected utility.

Let us consider some examples of reward functions and their corre-


sponding optimal decisions:
30 probabilistic artificial intelligence

Example 1.29: Reward functions


Under the decision rule from Equation (1.72), different reward
functions r can lead to different decisions. Let us examine two
reward functions for the case where Y = A = R ? . Problem 1.13
• Alternatively to considering r as a reward function, we can in-
terpret −r as the loss of taking action a when the true output
is y. If our goal is for our actions a to “mimic” the output y, a
natural choice is the squared loss, −r (y, a) = (y − a)2 . It turns
out that under the squared loss, the optimal decision is simply
the mean: a⋆ ( x) = E[y | x].
• To contrast this, we consider the asymmetric loss,

−r (y, a) = c1 max{y − a, 0} + c2 max{ a − y, 0} ,


| {z } | {z }
underestimation error overestimation error

which penalizes underestimation and overestimation differently.


When y | x ∼ N (µ x , σx2 ) then the optimal decision is
 
⋆ −1 c1
a ( x) = µ x + σx · Φ
c + c2
| {z 1 }
pessimism / optimism

where Φ is the CDF of the standard normal distribution.32 Note 32


Recall that the CDF Φ of the standard
that if c1 = c2 , then the second term vanishes and the optimal normal distribution is a sigmoid with its
inverse satisfying
decision is the same as under the squared loss. If c1 > c2 , the 
second term is positive (i.e., optimistic) to avoid underestima- < 0 if u < 0.5,

Φ−1 (u) = 0 if u = 0.5,
tion, and if c1 < c2 , the second term is negative (i.e., pessimistic) 
> 0 if u > 0.5.

to avoid overestimation. We will find these notions of optimism
and pessimism to be useful in many decision-making scenarios.

While Equation (1.72) describes how to make optimal decisions given


a (posterior) probabilistic model, it does not tell us how to learn or
improve this model in the first place. That is, these decisions are only
optimal under the assumption that we cannot use their outcomes and
our resulting observations to update our model and inform future de-
cisions. When we start to consider the effect of our decisions on future
data and future posteriors, answering “how do I make optimal deci-
sions?” becomes more complex, and we will study this in Part II on
sequential decision-making.

Discussion

In this chapter, we have learned about the fundamental concepts of


probabilistic inference. We have seen that probabilistic inference is the
fundamentals of inference 31

natural extension of logical reasoning to domains with uncertainty. We


have also derived the central principle of probabilistic inference, Bayes’
rule, which is simple to state but often computationally challenging. In
the next part of this manuscript, we will explore settings where exact
inference is tractable, as well as modern approaches to approximate
probabilistic inference.

Overview of Mathematical Background

We have included brief summaries of the fundamentals of parame-


ter estimation (mean estimation in particular) and optimization in
Appendices A.3 and A.4, respectively, which we will refer back to
throughout the manuscript. Appendix A.2 discusses the correspon-
dence of Gaussians and quadratic forms. Appendix A.5 comprises a
list of useful matrix identities and inequalities.

Problems

1.1. Properties of probability.

Let (Ω, A, P) be a probability space. Derive the following properties


of probability from the Kolmogorov axioms:
1. For any A, B ∈ A, if A ⊆ B then P( A) ≤ P( B).
2. For any A ∈ A, P A = 1 − P( A).


3. For any countable set of events { Ai ∈ A}i ,

∞ ∞
!
P ∑ P( A i ).
[
Ai ≤ (1.73)
i =1 i =1

which is called a union bound.

1.2. Random walks on graphs.

Let G be a simple connected finite graph. We start at a vertex u of G.


At every step, we move to one of the neighbors of the current vertex
uniformly at random, e.g., if the vertex has 3 neighbors, we move to
one of them, each with probability 1/3. What is the probability that
the walk visits a given vertex v eventually?

1.3. Law of total expectation.

Show that if { Ai }ik=1 are a partition of Ω and X is a random vector,

k
E[ X ] = ∑ E[ X | A i ] · P( A i ). (1.74)
i =1
32 probabilistic artificial intelligence

1.4. Covariance matrices are positive semi-definite.

Prove that a covariance matrix Σ is always positive semi-definite. That


is, all of its eigenvalues are greater or equal to zero, or equivalently,
x⊤ Σx ≥ 0 for any x ∈ Rn .

1.5. Probabilistic inference.

As a result of a medical screening, one of the tests revealed a serious


disease in a person. The test has a high accuracy of 99% (the prob-
ability of a positive response in the presence of a disease is 99% and
the probability of a negative response in the absence of a disease is
also 99%). However, the disease is quite rare and occurs only in one
person per 10 000. Calculate the probability of the examined person
having the identified disease.

1.6. Zero eigenvalues of covariance matrices.

We say that a random vector X in Rn is not linearly independent if for


some α ∈ Rn \ {0}, α⊤ X = 0.
1. Show that if X is not linearly independent, then Var[X] has a zero
eigenvalue.
2. Show that if Var[X] has a zero eigenvalue, then X is not linearly
independent.
Hint: Consider the variance of λ⊤ X where λ is the eigenvector corre-
sponding to the zero eigenvalue.
Thus, we have shown that Var[X] has a zero eigenvalue if and only if X
is not linearly independent.

1.7. Product of Gaussian PDFs.

Let µ1 , µ2 ∈ Rn be mean vectors and Σ 1 , Σ 2 ∈ Rn×n be covariance


matrices. Prove that

N ( x; µ, Σ ) ∝ N ( x; µ1 , Σ 1 ) · N ( x; µ2 , Σ 2 ) (1.75)

for some mean vector µ ∈ Rn and covariance matrix Σ ∈ Rn×n . That


is, show that the product of two Gaussian PDFs is proportional to the
PDF of a Gaussian.

1.8. Independence of Gaussians.

Show that two jointly Gaussian random vectors, X and Y, are indepen-
dent if and only if X and Y are uncorrelated.

1.9. Marginal / conditional distribution of a Gaussian.

Prove Theorem 1.24. That is, show that


1. every marginal of a Gaussian is Gaussian; and
fundamentals of inference 33

2. conditioning on a subset of variables of a joint Gaussian is Gaus-


sian
by finding their corresponding PDFs.

Hint: You may use that for matrices Σ and Λ such that Σ −1 = Λ,
• if Σ and Λ are symmetric,
" #⊤ " #" #
xA Λ AA Λ AB xA
xB Λ BA Λ BB xB
= x⊤ ⊤ ⊤ ⊤
A Λ AA x A + x A Λ AB x B + x B Λ BA x A + x B Λ BB x B
= x⊤ −1
A (Λ AA − Λ AB Λ BB Λ BA ) x A +
( x B + Λ− 1 ⊤ −1
BB Λ BA x A ) Λ BB ( x B + Λ BB Λ BA x A ),

• Λ− 1 −1
BB = Σ BB − Σ BA Σ AA Σ AB ,
−1 −1
• Λ BB Λ BA = −Σ BA Σ AA .
The final two equations follow from the general characterization of the inverse
of a block matrix (Petersen et al., 2008, section 9.1.3).

1.10. Closedness properties of Gaussians.

Recall the notion of a moment-generating function (MGF) of a random


vector X in Rn which is defined as
 
.
φX (t ) = E exp t ⊤ X , for all t ∈ Rn . (1.76)

An MGF uniquely characterizes a distribution. The MGF of the multi-


variate Gaussian X ∼ N (µ, Σ ) is
 
1 ⊤

φX (t ) = exp t µ + t Σt . (1.77)
2

This generalizes the MGF of the univariate Gaussian from Equation (A.40).

Prove the following facts.


1. Closedness under affine transformations: Given an n-dimensional Gaus-
sian X ∼ N (µ, Σ ), and A ∈ Rm×n and b ∈ Rm ,

AX + b ∼ N ( Aµ + b, AΣ A⊤ ). (1.78)

2. Additivity: Given two independent Gaussian random vectors X ∼ N (µ, Σ )


and X′ ∼ N (µ′ , Σ ′ ) in Rn ,

X + X ′ ∼ N ( µ + µ ′ , Σ + Σ ′ ). (1.79)
These properties are unique to Gaussians and a reason for why they
are widely used for learning and inference.
34 probabilistic artificial intelligence

1.11. Expectation and variance of Gaussians.

Derive that E[X] = µ and Var[X] = Σ when X ∼ N (µ, Σ ).


Hint: First derive the expectation and variance of a univariate standard nor-
mal random variable.

1.12. Non-affine transformations of Gaussians.

Answer the following questions with yes or no.


1. Does there exist any non-affine transformation of a Gaussian ran-
dom vector which is Gaussian? If yes, give an example.
2. Let X, Y, Z be independent standard normal random variables. Is
X +YZ

2
Gaussian?
1+ Z

1.13. Decision theory.

Derive the optimal decisions under the squared loss and the asymmet-
ric loss from Example 1.29.
part I

Probabilistic Machine Learning


Preface to Part I

As humans, we constantly learn about the world around us. We learn


to interact with our physical surroundings. We deepen our under-
standing of the world by establishing relationships between actors, ob-
jects, and events. And we learn about ourselves by observing how
we interact with the world and with ourselves. We then continuously
use this knowledge to make inferences and predictions, be it about the
weather, the movement of a ball, or the behavior of a friend.

With limited computational resources, limited genetic information, and


limited life experience, we are not able to learn everything about the
world to complete certainty. We saw in Chapter 1 that probability the-
ory is the mathematical framework for reasoning with uncertainty in
the same way that logic is the mathematical framework for reasoning
with certainty. We will discuss two kinds of uncertainty: “aleatoric”
uncertainty which cannot be reduced under computational constraints,
and “epistemic” uncertainty which can be reduced by observing more
data.
world
An important aspect of learning is that we do not just learn once, but
continually. Bayes’ rule allows us to update our beliefs and reduce
perception
our uncertainty as we observe new data — a process that is called
D
probabilistic inference. By taking the former posterior as the new prior,
probabilistic inference can be performed continuously and repeated
indefinitely as we observe more and more data. model p(θ | D)
prior p(θ)
Our sensory information is often noisy and imperfect, which is another
source of uncertainty. The same is true for machines, even if they can
Figure 1.10: A schematic illustration of
sometimes sense aspects of the world more accurately than humans.
probabilistic inference in the context of
We discuss how one can infer latent structure of the world from sensed the (supervised) learning of a model θ
data, such as the state of a dynamical system like a car, in a process from perceived data D . The prior model
p(θ) can equip the model with anything
that is called filtering. from substantial, to little, to no prior
knowledge.
In this first part of the manuscript, we examine how we can build ma-
chines that are capable of (continual) learning and inference. First, we
introduce probabilistic inference in the context of linear models which
38 probabilistic artificial intelligence

make predictions based on fixed (often hand-designed) features. We


then discuss how probabilistic inference can be scaled to kernel meth-
ods and Gaussian processes which use a large (potentially infinite)
number of features, and to deep neural networks which learn features
dynamically from data. In these models, exact inference is typically in-
tractable, and we discuss modern methods for approximate inference
such as variational inference and Markov chain Monte Carlo. We high-
light a tradeoff between curiosity (i.e., extrapolating beyond the given
data) and conformity (i.e., fitting the given data), which surfaces as a
fundamental principle of probabilistic inference in the regime where
the data and our computational resources are limited.
2
Linear Regression

As a first example of probabilistic inference, we will study linear mod-


els for regression1 which assume that the output y ∈ R is a linear 1
As we have discussed in Section 1.3,
function of the input x ∈ Rd : regression models can also be used for
classification. The canonical example of
a linear model for classification is logis-
y ≈ w ⊤ x + w0 tic regression, which we will discuss in
Section 5.1.1.
where w ∈ Rd are the weights and w0 ∈ R is the intercept. Observe
. .
that if we define the extended inputs x′ = ( x, 1) and w′ = (w, w0 ),
then w′⊤ x′ = w⊤ x + w0 , implying that without loss of generality it y

suffices to study linear functions without the intercept term w0 . We 2

will therefore consider the following function class of linear models


0
. ⊤
f ( x; w) = w x.
−2

We will consider the supervised learning task of learning weights w


−2 0 2
from labeled training data {( xi , yi )}in=1 . We define the design matrix, x
 
x1⊤ Figure 2.1: Example of linear regression
with the least squares estimator (shown
.  .  n×d
X= ..  ∈ R , (2.1) in blue).

xn⊤

.
as the collection of inputs and the vector y = [y1 · · · yn ]⊤ ∈ Rn as the
collection of labels. For each noisy observation ( xi , yi ), we define the
.
value of the approximation of our model, f i = w⊤ xi . Our model at
.
the inputs X is described by the vector f = [ f 1 · · · f n ]⊤ which can be
expressed succinctly as f = Xw.

The most common way of estimating w from data is the least squares
estimator,
n
.
ŵls = arg min ∑ (yi − w⊤ xi )2 = arg min ∥y − Xw∥22 , (2.2)
w ∈Rd i =1 w ∈Rd
40 probabilistic artificial intelligence

minimizing the squared difference between the labels and predictions


of the model. A slightly different estimator is used for ridge regression,
.
ŵridge = arg min ∥y − Xw∥22 + λ ∥w∥22 (2.3)
w ∈Rd

where λ > 0. The squared L2 regularization term λ ∥w∥22 penalizes 2


Ridge regression is more robust to mul-
large w and thus reduces the “complexity” of the resulting model.2 ticollinearity than standard linear re-
gression. Multicollinearity occurs when
It can be shown that the unique solutions to least squares and ridge multiple independent inputs are highly
regression are given by correlated. In this case, their individual
effects on the predicted variable cannot
ŵls = ( X ⊤ X )−1 X ⊤ y and (2.4) be estimated well. Classical linear re-
gression is highly volatile to small input
ŵridge = ( X ⊤ X + λI )−1 X ⊤ y, (2.5) changes. The regularization of ridge re-
gression reduces this volatility by intro-
respectively if the Hessian of the loss is positive definite (i.e., the loss ducing a bias on the weights towards 0.

is strictly convex) ? which is the case as long as the columns of X Problem 2.1 (1)
are not linearly dependent. Least squares regression can be seen as
finding the orthogonal projection of y onto the column space of X, as
is illustrated in Figure 2.2 ? . Problem 2.1 (2)
y

2.0.1 Maximum Likelihood Estimation


Since our function class comprises linear functions of the form w⊤ x, X ŵls
the observation model from Equation (1.56) simplifies to

yi = w ⋆ ⊤ xi + ε i (2.6)
span{ X }
for some weight vector w, where for the purpose of this chapter we
Figure 2.2: Least squares regression
will additionally assume that ε i ∼ N (0, σn2 ) is homoscedastic Gaussian
finds the orthogonal projection of y onto
noise.3 This observation model is equivalently characterized by the span{ X } (here illustrated as the plane).
Gaussian likelihood, 3
ε i is called additive white Gaussian noise.

yi | xi , w ∼ N (w⊤ xi , σn2 ). (2.7) using Equation (1.55)

Based on this likelihood we can compute the MLE (1.57) of the weights:
n n
ŵMLE = arg max ∑ log p(yi | xi , w) = arg min ∑ (yi − w⊤ xi )2 . plugging in the Gaussian likelihood and
w ∈Rd i =1 w ∈Rd i =1 simplifying

Note that therefore ŵMLE = ŵls .

In practice, the noise variance σn2 is typically unknown and also has to
be determined, for example, through maximum likelihood estimation.
It is a straightforward exercise to check that the MLE of σn2 given fixed
weights w is σ̂n2 = n1 ∑in=1 (yi − w⊤ xi )2 ? . Problem 2.2

2.1 Weight-space View

The most immediate and natural probabilistic interpretation of lin-


ear regression is to quantify uncertainty about the weights w. Recall
linear regression 41

that probabilistic inference requires specification of a generative model


comprised of prior and likelihood. Throughout this chapter, we will
use the Gaussian prior,
σp2
w∼ N (0, σp2 I ), (2.8) w
σn2
and the Gaussian likelihood from Equation (2.7). We will discuss pos-
sible (probabilistic) strategies for choosing hyperparameters such as yi
the prior variance σp2 and the noise variance σn2 in Section 4.4.
xi
i∈1:n
Remark 2.1: Why a Gaussian prior?
Figure 2.3: Directed graphical model of
The choice of using a Gaussian prior may seem somewhat arbi- Bayesian linear regression in plate nota-
trary at first sight, except perhaps for the nice analytical proper- tion.
ties of Gaussians that we have seen in Section 1.2.3 and which will
prove useful. The maximum entropy principle (cf. Section 1.2.1)
provides a more fundamental justification for Gaussian priors since
turns out that N has the maximum entropy among all distribu-
tions on Rd with known mean and variance ? . Problem 5.6

Next, let us derive the posterior distribution over the weights.

log p(w | x1:n , y1:n )

= log p(w) + log p(y1:n | x1:n , w) + const by Bayes’ rule (1.45)


n
= log p(w) + ∑ log p(yi | xi , w) + const using independence of the samples
i =1
" #
n
1 −2
= − σp ∥w∥2 + σn ∑ (yi − w xi ) + const
2 −2 ⊤ 2
using the Gaussian prior and likelihood
2 i =1
1 h −2 i
= − σp ∥w∥22 + σn−2 ∥y − Xw∥22 + const using ∑in=1 (yi − w⊤ xi )2 = ∥y − Xw∥22
2
1 h −2 ⊤  i
= − σp w w + σn−2 w⊤ X ⊤ Xw − 2y⊤ Xw + y⊤ y + const
2
1 h ⊤ −2 ⊤ i
= − w (σn X X + σp−2 I )w − 2σn−2 y⊤ Xw + const. (2.9)
2
Observe that the log-posterior is a quadratic form in w, so the posterior
distribution must be Gaussian:

w | x1:n , y1:n ∼ N (µ, Σ ) (2.10a) see Equation (A.12)

where we can read off the mean and variance to be


.
µ = σn−2 ΣX ⊤ y, (2.10b)
  −1
.
Σ = σn−2 X ⊤ X + σp−2 I . (2.10c)

This also shows that Gaussians with known variance and linear like-
lihood are self-conjugate, a property that we had hinted at in Sec-
tion 1.2.2. It can be shown more generally that Gaussians with known
42 probabilistic artificial intelligence

variance are self-conjugate to any Gaussian likelihood (Murphy, 2007).


For other generative models, the posterior can typically not be ex-
pressed in closed-form — this is a very special property of Gaussians!

2.1.1 Maximum a Posteriori Estimation


Computing the MAP estimate for the weights,

ŵMAP = arg max log p(y1:n | x1:n , w) + log p(w)


w
σn2
= arg min ∥y − Xw∥22 + ∥w∥22 , (2.11) using that the likelihood and prior are
w σp2 Gaussian
θ2
we observe that this is identical to ridge regression with weight decay
. 1.0
λ = σn2 /σp2 : ŵMAP = ŵridge . Equation (2.11) is simply the MLE loss
with an additional L2 -regularization (originating from the prior) that 0.5

encourages keeping weights small. Recall that the MAP estimate cor- 0.0
responds to the mode of the posterior distribution, which in the case
of a Gaussian is simply its mean µ. As to be expected, µ coincides with −0.5

the analytical solution to ridge regression from Equation (2.5). −1.0


−1 0 1
Example 2.2: Lasso as the MAP estimate with a Laplace prior θ1

One problem with ridge regression is that the contribution of Figure 2.4: Level sets of L2 - (blue)
and L1 -regularization (red), correspond-
nearly-zero weights to the L2 -regularization term is negligible.
ing to Gaussian and Laplace priors, re-
Thus, L2 -regularization is typically not sufficient to perform vari- spectively. It can be seen that L1 -
able selection (that is, set some weights to zero entirely), which is regularization is more effective in en-
couraging sparse solutions (that is, so-
often desirable for interpretability of the model. lutions where many components are set
to exactly 0).
A commonly used alternative to ridge regression is the least ab-
solute shrinkage and selection operator (or lasso), which regularizes
with the L1 -norm:
.
ŵlasso = arg min ∥y − Xw∥22 + λ ∥w∥1 . (2.12)
w ∈Rd

It turns out that lasso can also be viewed as probabilistic infer-


ence, using a Laplace prior w ∼ Laplace(0, h) with length scale h
instead of a Gaussian prior.

Computing the MAP estimate for the weights yields,

ŵMAP = arg max log p(y1:n | x1:n , w) + log p(w)


w
n
σn2
= arg min ∑ (yi − w⊤ xi )2 + ∥ w ∥1 (2.13) using that the likelihood is Gaussian
w i =1
h and the prior is Laplacian
.
which coincides with the lasso with weight decay λ = σn2 /h.
linear regression 43

To make predictions at a test point x⋆ , we define the (model-)predicted


. ⊤
point f ⋆ = ŵMAP x⋆ and obtain the label prediction

y⋆ | x⋆ , x1:n , y1:n ∼ N ( f ⋆ , σn2 ). (2.14)

Here we observe that using point estimates such as the MAP estimate
does not quantify uncertainty in the weights. The MAP estimate sim-
ply collapses all mass of the posterior around its mode. This can be
harmful when we are highly unsure about the best model, e.g., because
we have observed insufficient data.

2.1.2 Probabilistic Inference


Rather than selecting a single weight vector ŵ to make predictions, we
can use the full posterior distribution. This is known as Bayesian linear
regression (BLR) and illustrated with an example in Figure 2.5.

y y Figure 2.5: Comparison of linear regres-


3 6 sion (MLE), ridge regression (MAP es-
timate), and Bayesian linear regression
4 when the data is generated according to
2
2 y | w, x ∼ N (w⊤ x, σn2 ).
1
0
The true mean is shown in black, the
0 −2 MLE in blue, and the MAP estimate in
red. The dark gray area denotes the epis-
−1 −4 temic uncertainty of Bayesian linear re-
−0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5 1.0 1.5 gression and the light gray area the addi-
x x tional homoscedastic noise. On the left,
σn = 0.15. On the right, σn = 0.7.

.
To make predictions at a test point x⋆ , we let f ⋆ = w⊤ x⋆ which has
the distribution

f ⋆ | x⋆ , x1:n , y1:n ∼ N (µ⊤ x⋆ , x⋆ ⊤ Σx⋆ ). (2.15) using the closedness of Gaussians


under linear transformations (1.78)
Note that this does not take into account the noise in the labels σn2 . For
the label prediction y⋆ , we obtain

y⋆ | x⋆ , x1:n , y1:n ∼ N (µ⊤ x⋆ , x⋆ ⊤ Σx⋆ + σn2 ). (2.16) using additivity of Gaussians (1.79)

2.1.3 Recursive Probabilistic Inference


We have already discussed the recursive properties of probabilistic in-
ference in Section 1.3.6. For Bayesian linear regression with a Gaussian
prior and likelihood, this principle can be used to derive an efficient
online algorithm since also the posterior is a Gaussian,

p(t) (w) = N (w; µ(t) , Σ (t) ), (2.17)


2

which can be stored efficiently using only O d parameters. This
leads to an efficient online algorithm for Bayesian linear regression
44 probabilistic artificial intelligence

with time-independent(!) memory complexity O(d) and round com-


plexity O d2 ? . The interpretation of Bayesian linear regression as an

Problem 2.5
online algorithm also highlights similarities to other sequential models
such as Kalman filters, which we discuss in Chapter 3. In Example 3.5,
we will learn that online Bayesian linear regression is, in fact, an ex-
ample of a Kalman filter.

2.2 Aleatoric and Epistemic Uncertainty

The predictive posterior distribution from Equation (2.16) highlights a


decomposition of uncertainty wherein x⋆ ⊤ Σx⋆ corresponds to the un-
certainty about our model due to the lack of data (commonly referred
to as the epistemic uncertainty) and σn2 corresponds to the uncertainty
about the labels that cannot be explained by the inputs and any model
from the model class (commonly referred to as the aleatoric uncertainty,
“irreducible noise”, or simply “(label) noise”) ? . Problem 2.6

A natural probabilistic approach is to represent epistemic uncertainty


with a probability distribution over models. Intuitively, the variance
of this distribution measures our uncertainty about the model and its
mode corresponds to our current best (point) estimate. The distri-
bution over weights of a linear model is one example, and we will
continue to explore this approach for other models in the following
chapters.

It is a practical modeling choice how much inaccuracy to attribute to


epistemic or aleatoric uncertainty. Generally, when a poor model is
used to explain a process, more inaccuracy has to be attributed to irre-
ducible noise. For example, when a linear model is used to “explain”
a nonlinear process, most uncertainty is aleatoric as the model cannot
explain the data well. As we use more expressive models, a larger
portion of the uncertainty can be explained by the data.

Epistemic and aleatoric uncertainty can be formally defined in terms


of the law of total variance (1.41),

Var[y⋆ | x⋆ ] = Eθ Vary⋆ [y⋆ | x⋆ , θ] + Varθ Ey⋆ [y⋆ | x⋆ , θ] .


   
(2.18)
| {z } | {z }
aleatoric uncertainty epistemic uncertainty

Here, the mean variability of predictions y⋆ averaged across all mod-


els θ is the estimated aleatoric uncertainty. In contrast, the variability of
the mean prediction y⋆ under each model θ is the estimated epistemic
uncertainty. This decomposition of uncertainty will appear frequently
throughout this manuscript.
linear regression 45

2.3 Non-linear Regression


y
We can use linear regression not only to learn linear functions. The
trick is to apply a nonlinear transformation ϕ : Rd → Re to the fea- 10

tures xi , where d is the dimension of the input space and e is the 5

dimension of the designed feature space. We denote the design ma- 0


trix comprised of transformed features by Φ ∈ Rn×e . Note that if the −5
feature transformation ϕ is the identity function then Φ = X.
−2 0 2
x
Example 2.3: Polynomial regression
. . Figure 2.6: Applying linear regression
Let ϕ( x ) = [ x2 , x, 1] and w = [ a, b, c]. Then the function that our with a feature space of polynomials of
model learns is given as degree 10. The least squares estimate is
shown in blue, ridge regression in red,
and lasso in green.
f = ax2 + bx + c.

Thus, our model can exactly represent all polynomials up to de-


gree 2.

However, to learn polynomials of degree m in d input dimensions,


we need to apply the nonlinear transformation

ϕ( x) = [1, x1 , . . . , xd , x12 , . . . , xd2 , x1 · x2 , . . . , xd−1 · xd ,


...,
x d − m +1 · · · · · x d ] .

Note that the feature dimension e is ∑im=0 (d+ii−1) = Θ(dm ).4 Thus, 4
Observe that the vector contains (d+ii−1)
the dimension of the feature space grows exponentially in the de- monomials of degree i as this is the
number of ways to choose i times from
gree of polynomials and input dimensions. Even for relatively d items with replacement and without
small m and d, this becomes completely unmanageable. consideration of order. To see this, con-
sider the following encoding: We take
a sequence of d + i − 1 spots. Select-
The example of polynomials highlights that it may be inefficient to ing any subset of i spots, we interpret
the remaining d − 1 spots as “barriers”
keep track of the weights w ∈ Re when e is large, and that it may separating each of the d items. The se-
be useful to instead consider a reparameterization which is of dimen- lected spots correspond to the number
of times each item has been selected. For
sion n rather than of the feature dimension.
example, if 2 items are to be selected out
of a total of 4 items with replacement,
one possible configuration is “◦ || ◦ |”
2.4 Function-space View where ◦ denotes a selected spot and |
denotes a barrier. This configuration en-
codes that the first and third item have
Let us now look at Bayesian linear regression through a different lens. each been chosen once. The number of
Previously, we have been interpreting it as a distribution over the possible configurations — each encoding
weights w of a linear function f = Φw. The key idea is that for a a unique outcome — is therefore (d+ii−1).

finite set of inputs (ensuring that the design matrix is well-defined),


we can equivalently consider a distribution directly over the estimated
function values f . We call this the function-space view of Bayesian linear
regression.
46 probabilistic artificial intelligence

Instead of considering a prior over the weights w ∼ N (0, σp2 I ) as we


have done previously, we now impose a prior directly on the values of
our model at the observations. Using that Gaussians are closed under
linear maps (1.78), we obtain the equivalent prior y

f | X ∼ N (ΦE[w], ΦVar[w]Φ⊤ ) = N (0, σp2 ΦΦ⊤ ) (2.19) 2


fn
| {z }
K 0

where K ∈ Rn × n
is the so-called kernel matrix. Observe that the entries −2
f1
of the kernel matrix can be expressed as K (i, j) = σp2 · ϕ( xi )⊤ ϕ( x j ). −4

You may say that nothing has changed, and you would be right — x1 xn

that is precisely the point. Note, however, that the shape of the kernel x

matrix is n × n rather than the e × e covariance matrix over weights, Figure 2.7: An illustration of the
which becomes unmanageable when e is large. The kernel matrix K function-space view. The model is de-
scribed by the points ( xi , f i ).
has entries only for the finite set of observed inputs. However, in
principle, we could have observed any input, and this motivates the
definition of the kernel function
.
k( x, x′ ) = σp2 · ϕ( x)⊤ ϕ( x′ ) (2.20)

for arbitrary inputs x and x′ . A kernel matrix is simply a finite “view”


of the kernel function,
 
k ( x1 , x1 ) · · · k ( x1 , x n )
 .. .. .. 
K=  . . .

 (2.21)
k ( x n , x1 ) · · · k ( x n , x n )

Observe that by definition of the kernel matrix in Equation (2.19), the


kernel matrix is a covariance matrix and the kernel function measures
the covariance of the function values f ( x) and f ( x′ ) given inputs x
and x′ :

k( x, x′ ) = Cov f ( x), f ( x′ ) .
 
(2.22)

Moreover, note that we have reformulated5 the learning algorithm 5


we often say “kernelized”
such that the feature space is now implicit in the choice of kernel, and
the kernel is defined by inner products of (nonlinearly transformed)
inputs. In other words, the choice of kernel implicitly determines the
class of functions that f is sampled from (without expressing the func-
tions explicitly in closed-form), which encodes our prior beliefs. This
is known as the kernel trick.

2.4.1 Learning and Predictions


We have already kernelized the Bayesian linear regression prior. The
posterior distribution f | X, y is again Gaussian due to the closedness
linear regression 47

properties of Gaussians, analogously to our derivation of the prior


kernel matrix in Equation (2.19).

It remains to show that we can also rely on the kernel trick for predic-
tions. Given the test point x⋆ , we define
" # " # " #
. Φ . y . f
Φ̃ = ⋆ ⊤ , ỹ = ⋆ , f˜ = ⋆ .
ϕ( x ) y f

We immediately obtain f˜ = Φ̃w. Analogously to our analysis of pre-


dictions from the weight-space view, we add the label noise to obtain
.
the estimate ỹ = f˜ + ε̃ where ε̃ = [ε 1 · · · ε n ε⋆ ]⊤ ∼ N (0, σn2 I ) is the
independent label noise. Applying the same reasoning as we did for
the prior, we obtain

f˜ | X, x⋆ ∼ N (0, K̃ ) (2.23)
.
where K̃ = σp2 Φ̃Φ̃⊤ . Adding the label noise yields

ỹ | X, x⋆ ∼ N (0, K̃ + σn2 I ). (2.24)

Finally, we can conclude from the closedness of Gaussian random vec-


tors under conditional distributions (1.53) that the predictive posterior
y⋆ | x⋆ , X, y follows again a normal distribution. We will do a full
derivation of the posterior and predictive posterior in Section 4.1.

2.4.2 Efficient Polynomial Regression


But how does the kernel trick address our concerns about efficiency
raised in Section 2.3? After all, computing the kernel for a feature
space of dimension e still requires computing sums of length e which
is prohibitive when e is large. The kernel trick opens up a couple of
new doors for us:
1. For certain feature transformations ϕ, we may be able to find an
easier to compute expression equivalent to ϕ( x)⊤ ϕ( x′ ).
2. If this is not possible, we could approximate the inner product by
an easier to compute expression.
3. Or, alternatively, we may decide not to care very much about the
exact feature transformation and simply experiment with kernels
that induce some feature space (which may even be infinitely di-
mensional).
We will explore the third approach when we revisit kernels in Sec-
tion 4.3. A polynomial feature transformation can be computed effi-
ciently in closed-form.

Fact 2.4. For the polynomial feature transformation ϕ up to degree m from


Example 2.3, it can be shown that up to constant factors,

ϕ ( x ) ⊤ ϕ ( x ′ ) = (1 + x ⊤ x ′ ) m . (2.25)
48 probabilistic artificial intelligence

For example, for input dimension 2, the kernel (1 + x⊤ x′ )2 corre-


√ √ √
sponds to the feature vector ϕ( x) = [1 2x1 2x2 2x1 x2 x12 x22 ]⊤ .

Discussion

We have explored a probabilistic perspective on linear models, and


seen that classical approaches such as least squares and ridge regres-
sion can be interpreted as approximate probabilistic inference. We
then saw that we can even perform exact probabilistic inference effi-
ciently if we adopt a Gaussian prior and Gaussian noise assumption.
These are already powerful tools, which are often applied also to non-
linear models if we treat the latent feature space — which was either
human-designed or learned via deep learning — as fixed. In the next
chapter, we will digress briefly from the storyline on “learning” to see
how we can adopt a similar probabilistic perspective to track latent
states over time. Then, in Chapter 4, we will see how we can use the
function-space view and kernel trick to learn flexible nonlinear models
with exact probabilistic inference, without ever explicitly representing
the feature space.

Problems

2.1. Closed-form linear regression.


1. Derive the unique solutions to least squares and ridge regression
from Equations (2.4) and (2.5).
2. For an n × m matrix A and vector x ∈ Rm , we call Π A x the orthogo-
nal projection of x onto span{ A} = { Ax′ | x′ ∈ Rm }. In particular,
an orthogonal projection satisfies x − Π A x ⊥ Ax′ for all x′ ∈ Rm .
Show that ŵls from Equation (2.4) is such that X ŵls is the unique
closest point to y on span{ X }, i.e., it satisfies X ŵls = ΠX y.

2.2. MLE of noise variance.

Show that the MLE of σn2 given fixed weights w is


n
1
σ̂n2 =
n ∑ ( y i − w ⊤ x i )2 . (2.26)
i =1

2.3. Variance of least squares around training data.

Show that the variance of a prediction at the point [1 x ⋆ ]⊤ is small-


est when x ⋆ is the mean of the training data. More formally, show
that if inputs are of the form xi = [1 xi ]⊤ where xi ∈ R and ŵls is
the least squares estimate, then Var y⋆ | [1 x ⋆ ]⊤ , ŵls is minimized for
 

x ⋆ = n1 ∑in=1 xi .
linear regression 49

2.4. Bayesian linear regression.

Suppose you are given the following observations


   
1 1 2.4
1 2 4.3
X= , y= 
   
2 1 3.1
2 2 4.9

and assume the data follows a linear model with homoscedastic noise
N (0, σn2 ) where σn2 = 0.1.
1. Find the maximum likelihood estimate ŵMLE given the data.
2. Now assume that we have a prior p(w) = N (w; 0, σp2 I ) with σp2 = 0.05.
Find the MAP estimate ŵMAP given the data and the prior.
3. Use the posterior p(w | X, y) to get a posterior prediction for the
label y⋆ at x⋆ = [3 3]⊤ . Report the mean and the variance of this
prediction.
4. How would you have to change the prior p(w) such that

ŵMAP → ŵMLE ?

2.5. Online Bayesian linear regression.


1. Can you design an algorithm that updates the posterior (as op-
posed to recalculating it from scratch using Equation (2.10)) in a
smarter way? The requirement is that the memory should not grow
as O(t).
2. If d is large, computing the inverse every round is very expen-
sive. Can you use the recursive structure you found in the previ-
ous question to bring down the computational complexity of every
round to O d2 ?


The resulting efficient online algorithm is known as online Bayesian


linear regression.

2.6. Aleatoric and epistemic uncertainty of BLR.

Prove for Bayesian linear regression that x⋆ ⊤ Σx⋆ is the epistemic un-
certainty and σn2 the aleatoric uncertainty in y⋆ under the decomposi-
tion of Equation (2.18).

2.7. Hyperpriors.

We consider a dataset {( xi , yi )}in=1 of size n, where xi ∈ Rd denotes


the feature vector and yi ∈ R denotes the label of the i-th data point.
Let ε i be i.i.d. samples from the Gaussian distribution N (0, λ−1 ) for a
given λ > 0. We collect the labels in a vector y ∈ Rn , the features in
a matrix X ∈ Rn×d , and the noise in a vector ε ∈ Rn . The labels are
generated according to y = Xw + ε.
50 probabilistic artificial intelligence

To perform Bayesian Linear Regression, we consider the prior distribu-


tion over the parameter vector w to be N (µ, λ−1 Id ), where Id denotes
the d-dimensional identity matrix and µ ∈ Rd is a hyperparameter.
1. Given this Bayesian data model, what is the conditional covariance
.
matrix Σ y = Var[y | X, µ, λ]?
2. Calculate the maximum likelihood estimate of the hyperparame-
ter µ.
3. Since we are unsure about the hyperparameter µ, we decide to
model our uncertainty about µ by placing the “hyperprior” µ ∼ N (0, Id ).
Is the posterior distribution p(µ | X, y, λ) a Gaussian distribution?
If yes, what are its mean vector and covariance matrix?
4. What is the posterior distribution p(λ | X, y, µ)?
Hint: For any a ∈ R, A ∈ Rn×n it holds that det( aA) = an det( A).
3
Filtering

Before we continue in Chapter 4 with the function-space view of re-


gression, we want to look at a seemingly different but very related
problem. We will study Bayesian learning and inference in the state
space model, where we want to keep track of the state of an agent over
time based on noisy observations. In this model, we have a sequence
of (hidden) states (Xt )t∈N0 where Xt is in Rd and a sequence of obser-
vations (Yt )t∈N0 where Yt is in Rm .

The process of keeping track of the state using noisy observations is


also known as Bayesian filtering or recursive Bayesian estimation. Fig-
ure 3.1 illustrates this process, where an agent perceives the current
state of the world and then updates its beliefs about the state based on
this observation.

Figure 3.1: Schematic view of Bayesian


filtering: An agent perceives the current
state of the world and updates its belief
worldt worldt+1 accordingly.

perception perception
Dt D t +1

model p(θ | D1:t ) model p(θ | D1:t+1 )

We will discuss Bayesian filtering more broadly in the next section. A


Kalman filter is an important special case of a Bayes’ filter, which uses
a Gaussian prior over the states and conditional linear Gaussians to
describe the evolution of states and observations. Analogously to the
52 probabilistic artificial intelligence

previous chapter, we will see that inference in this model is tractable


due to the closedness properties of Gaussians.

Definition 3.1 (Kalman filter). A Kalman filter is specified by a Gaus-


sian prior over the states,

X0 ∼ N (µ, Σ ), (3.1)

and a conditional linear Gaussian motion model and sensor model,


.
Xt+1 = FXt + εt F ∈ Rd×d , εt ∼ N (0, Σ x ), (3.2)
. m×d
Yt = HXt + ηt H∈R , ηt ∼ N (0, Σ y ), (3.3)

respectively. The motion model is sometimes also called transition


model or dynamics model. Crucially, Kalman filters assume that F and X1 X2 X3 ···
H are known. In general, F and H may depend on t. Also, ε and η
may have a non-zero mean, commonly called a “drift”. Y1 Y2 Y3

Figure 3.2: Directed graphical model of a


Because Kalman filters use conditional linear Gaussians, which we
Kalman filter with hidden states Xt and
have already seen in Equation (1.55), their joint distribution (over all observables Yt .
variables) is also Gaussian. This means that predicting the future states
of a Kalman filter is simply inference with multivariate Gaussians. In
Bayesian filtering, however, we do not only want to make predictions
occasionally. In Bayesian filtering, we want to keep track of states, that
is, predict the current state of an agent online.1 To do this efficiently, 1
Here, online is common terminology to
we need to update our belief about the state of the agent recursively, say that we want to perform inference
at time t without exposure to times t +
similarly to our recursive Bayesian updates in Bayesian linear regres- 1, t + 2, . . . , so in “real-time”.
sion (see Section 2.1.3).

From the directed graphical model of a Kalman filter shown in Fig-


ure 3.2, we can immediately gather the following conditional indepen-
dence relations,2 2
Alternatively, they follow from the def-
inition of the motion and sensor models
as linear updates.
Xt+1 ⊥ X1:t−1 , Y1:t−1 | Xt , (3.4)
Yt ⊥ X1:t−1 | Xt (3.5)
Yt ⊥ Y1:t−1 | Xt−1 . (3.6)

The first conditional independence property is also known as the Markov


property, which we will return to later in our discussion of Markov
chains and Markov decision processes. This characterization of the
Kalman filter, yields the following factorization of the joint distribu-
tion:
t
p( x1:t , y1:t ) = ∏ p(xi | x1:i−1 ) p(yi | x1:t , y1:i−1 ) using the product rule (1.11)
i =1
t
= p( x1 ) p(y1 | x1 ) ∏ p( xi | xi−1 ) p(yi | xi ). (3.7) using the conditional independence
i =2 properties from (3.4), (3.5), and (3.6)
filtering 53

3.1 Conditioning and Prediction

We can describe Bayesian filtering by the following recursive scheme


with the two phases, conditioning (also called “update”) and prediction:

Algorithm 3.2: Bayesian filtering


1 start with a prior over initial states p( x0 )
2 for t = 1 to ∞ do
3 assume we have p( xt | y1:t−1 )
4 conditioning: compute p( xt | y1:t ) using the new observation yt
5 prediction: compute p( xt+1 | y1:t )

Let us consider the conditioning step first:


1
p( xt | y1:t ) =p( xt | y1:t−1 ) p(yt | xt , y1:t−1 ) using Bayes’ rule (1.45)
Z
1
= p( xt | y1:t−1 ) p(yt | xt ). (3.8) using the conditional independence
Z structure (3.6)
For the prediction step, we obtain,
Z
p( xt+1 | y1:t ) = p( xt+1 , xt | y1:t ) dxt using the sum rule (1.7)
Z
= p( xt+1 | xt , y1:t ) p( xt | y1:t ) dxt using the product rule (1.11)
Z
= p( xt+1 | xt ) p( xt | y1:t ) dxt . (3.9) using the conditional independence
structure (3.4)
In general, these distributions can be very complicated, but for Gaus-
sians (i.e., Kalman filters) they can be expressed in closed-form.

Remark 3.3: Bayesian smoothing


Bayesian smoothing is a closely related task to Bayesian filtering.
While Bayesian filtering methods estimate the current state based
only on observations obtained before and at the current time step,
Bayesian smoothing computes the distribution of Xk | y1:t where
t > k. That is Bayesian smoothing estimates Xk based on data until
and beyond time k. Note that if k = t, then Bayesian smoothing
coincides with Bayesian filtering.

Analogously to Equation (3.8),

p( xk | y1:t ) ∝ p( xk | y1:k ) p(yk+1:t | xk ). (3.10)

If we assume a Gaussian prior and conditional Gaussian transi-


tion and dynamics models (this is called Kalman smoothing), then
54 probabilistic artificial intelligence

by the closedness properties of Gaussians, Xk | y1:t is a Gaus-


sian. Indeed, all terms of Equation (3.10) are Gaussian PDFs and
as seen in Equation (1.75), the product of two Gaussian PDFs is
again proportional to a Gaussian PDF.

The first term, Xk | y1:k , is the marginal posterior of the hidden


states of the Kalman filter which can be obtained with Bayesian
filtering.

By conditioning on Xk+1 , we have for the second term,


Z
p(yk+1:t | xk ) = p(yk+1:t | xk , xk+1 ) p( xk+1 | xk ) dxk+1 using the sum rule (1.7) and product
Z rule (1.11)
= p(yk+1:t | xk+1 ) p( xk+1 | xk ) dxk+1 using the conditional independence
Z structure (3.5)
= p(yk+1 | xk+1 ) p(yk+2:t | xk+1 ) p( xk+1 | xk ) dxk+1 using the conditional independence
structure (3.6)
(3.11)

Let us have a look at the terms in the product:


• p(yk+1 | xk+1 ) is obtained from the sensor model,
• p( xk+1 | xk ) is obtained from the transition model, and
• p(yk+2:t | xk+1 ) can be computed recursively backwards in time.
This recursion results in linear equations resembling a Kalman
filter running backwards in time.

Thus, in the setting of Kalman smoothing, both factors of Equa-


tion (3.10) can be computed efficiently: one using a (forward)
Kalman filter; the other using a “backward” Kalman filter. More
concretely, in time O(t), we can compute the two factors for all
k ∈ [t]. This approach is known as two-filter smoothing or the
forward-backward algorithm.

3.2 Kalman Filters

Let us return to the setting of Kalman filters where priors and likeli-
hoods are Gaussian. Here, we will see that the update and prediction
steps can be computed in closed form.

3.2.1 Conditioning

The conditioning operation in Kalman filters is also called the Kalman


update. Before introducing the general Kalman update, let us consider
a simpler example:
filtering 55

Example 3.4: Random walk in 1d


We use the simple motion and sensor models,3 3
This corresponds to F = H = I and a
drift of 0.
Xt+1 | xt ∼ N ( xt , σx2 ), (3.12a)
Yt | xt ∼ N ( xt , σy2 ). (3.12b)

Let Xt | y1:t ∼ N (µt , σt2 ) be our belief at time t. It can be shown


that Bayesian filtering yields the belief Xt+1 | y1:t+1 ∼ N (µt+1 , σt2+1 )
at time t + 1 where ? Problem 3.1

. σy2 µt + (σt2 + σx2 )yt+1 . (σt2 + σx2 )σy2


µ t +1 = , σt2+1 = . (3.13)
σt2 + σx2 + σy2 σt2 + σx2 + σy2

Although looking intimidating at first, this update has a very nat-


ural interpretation. Let us define the following quantity,

. σt2 + σx2 σy2


λ= = 1 − ∈ [0, 1]. (3.14)
σt2 + σx2 + σy2 σt2 + σx2 + σy2

Using λ, we can write the updated mean as a convex combination


of the previous mean and the observation,

µt+1 = (1 − λ)µt + λyt+1 (3.15)


= µ t + λ ( y t +1 − µ t ). (3.16)

x
Intuitively, λ is a form of “gain” that influences how much of the
new information should be incorporated into the updated mean.
1 2 3 4 5 6
For this reason, λ is also called Kalman gain. t

The updated variance can similarly be rewritten, Figure 3.3: Hidden states during a ran-
dom walk in one dimension.
σt2+1 = λσy2 = (1 − λ)(σt2 + σx2 ). (3.17)

In particular, observe that if µt = yt+1 (i.e., we observe our predic-


tion), we have µt+1 = µt as there is no new information. Similarly,
for σy2 → ∞ (i.e., we do not trust our observations), we have

λ → 0, µ t +1 = µ t , σt2+1 = σt2 + σx2 .

In contrast, for σy2 → 0, we have

λ → 1, µ t +1 = y t +1 , σt2+1 = 0.

The general formulas for the Kalman update follow the same logic as
in the above example of a one-dimensional random walk. Given the
56 probabilistic artificial intelligence

prior belief Xt | y1:t ∼ N (µt , Σ t ), we have

Xt+1 | y1:t+1 ∼ N (µt+1 , Σ t+1 ) where (3.18a)


.
µt+1 = Fµt + Kt+1 (yt+1 − H Fµt ), (3.18b)
. ⊤
Σ t+1 = ( I − Kt+1 H )( FΣ t F + Σ x ). (3.18c)

Hereby, Kt+1 is the Kalman gain,


.
Kt+1 = ( FΣ t F ⊤ + Σ x ) H ⊤ ( H ( FΣ t F ⊤ + Σ x ) H ⊤ + Σ y )−1 ∈ Rd×m .
(3.18d)

Note that Σ t and Kt can be computed offline as they are independent


of the observation yt+1 . Fµt represents the expected state at time t + 1,
and hence, H Fµt corresponds to the expected observation. Therefore,
the term yt+1 − H Fµt measures the error in the predicted observation
and the Kalman gain Kt+1 appears as a measure of relevance of the
new observation compared to the prediction.

Example 3.5: Bayesian linear regression as a Kalman filter


Even though they arise from a rather different setting, it turns
out that Kalman filters are a generalization of Bayesian linear re-
gression! To see this, recall the online Bayesian linear regression
algorithm from Section 2.1.3. Observe that by keeping attempting
to estimate the (hidden) weights w⋆ from sequential noisy obser-
vations yt , this algorithm performs Bayesian filtering! Moreover,
we have used a Gaussian prior and likelihood. This is precisely
the setting of a Kalman filter!

Concretely, we are estimating the constant (i.e., F = I, ε = 0)


hidden state xt = w(t) with prior w(0) ∼ N (0, σp2 I ).

Our sensor model is time-dependent, since in each iteration we


observe a different input xt . Furthermore, we only observe a
scalar-valued label yt .4 Formally, our sensor model is character- 4
That is, m = 1 in our general Kalman
ized by ht = x⊤ 2
t and noise ηt = ε t with ε t ∼ N (0, σn ).
filter formulation from above.

You will show in ? that the Kalman update (3.18) is the online Problem 3.2
equivalent to computing the posterior of the weights in Bayesian
linear regression.

3.2.2 Predicting
Using now that the marginal posterior of Xt is a Gaussian due to the
closedness properties of Gaussians, we have

Xt+1 | y1:t ∼ N (µ̂t+1 , Σ̂ t+1 ), (3.19)


filtering 57

and it suffices to compute the prediction mean µ̂t+1 and covariance


matrix Σ̂ t+1 .

For the mean,

µ̂t+1 = E[ xt+1 | y1:t ]


= E[ Fxt + εt | y1:t ] using the motion model (3.2)

= FE[ xt | y1:t ] using linearity of expectation (1.20) and


E[ ε t ] = 0
= Fµt . (3.20) using the mean of the Kalman update

For the covariance matrix,


h i
Σ̂ t+1 = E ( xt+1 − µ̂t+1 )( xt+1 − µ̂t+1 )⊤ y1:t using the definition of the covariance
h i h i matrix (1.36)
= FE ( xt − µt )( xt − µt )⊤ y1:t F ⊤ + E εt ε⊤t using (3.20), the motion model (3.2) and
that εt is independent of the
= FΣ t F ⊤ + Σ x . (3.21) observations

Optional Readings
Kalman filters and related models are often called temporal models.
For a broader look at such models, read chapter 15 of “Artificial
intelligence: a modern approach” (Russell and Norvig, 2002).

Discussion

In this chapter, we have introduced Kalman filters as a special case of


probabilistic filtering where probabilistic inference can be performed
in closed form. Similarly to Bayesian linear regression, probabilistic in-
ference is tractable due to assuming Gaussian priors and likelihoods.
Indeed, learning linear models and Kalman filters are very closely re-
lated as seen in Example 3.5, and we will further explore this relation-
ship in Problem 4.3. We will refer back to filtering in the second part
of this manuscript when we discuss sequential decision-making with
partial observability of the state space. Next, we return to the storyline
on “learning” using exact probabilistic inference.

Problems

3.1. Kalman update.

Derive the predictive distribution Xt+1 | y1:t+1 (3.13) of the Kalman fil-
ter described in the above example using your knowledge about mul-
tivariate Gaussians from Section 1.2.3.
Hint: First compute the predictive distribution Xt+1 | y1:t .
58 probabilistic artificial intelligence

3.2. Bayesian linear regression as a Kalman filter.

Recall the specific Kalman filter from Example 3.5. With this model
the Kalman update (3.18) simplifies to

Σ t −1 x t
kt = , (3.22a)
x⊤
t t−1 xt + σn
Σ 2

µ t = µ t −1 + k t ( y t − x ⊤
t µ t −1 ), (3.22b)
Σt = Σ t −1 − k t x ⊤
t Σ t −1 , (3.22c)

with µ0 = 0 and Σ 0 = σp2 I. Note that the Kalman gain kt is a vector


in Rd . We assume σn2 = σp2 = 1 for simplicity.

Prove by induction that the (µt , Σ t ) produced by the Kalman update


are equivalent to (µ, Σ ) from the posterior of Bayesian linear regres-
sion (2.10) given x1:t , y1:t . You may use that Σ − 1
t kt = xt .

Hint: In the inductive step, first prove the equivalence of Σ t and then expand
Σ− 1
t µt to prove the equivalence of µt .

3.3. Parameter estimation using Kalman filters.

Suppose that we want to estimate the value of an unknown constant π


using uncorrelated measurements

y t = π + ηt , ηt ∼ N (0, σy2 ).

1. How can this problem be formulated as a Kalman filter? Compute


closed form expressions for the Kalman gain and the variance of
the estimation error σt2 in terms of t, σy2 , and σ02 .
2. What is the Kalman filter when t → ∞?
3. Suppose that one has no prior assumptions on π, meaning that
µ0 = 0 and σ02 → ∞. Which well-known estimator does the Kalman
filter reduce to in this case?
4
Gaussian Processes

Let us remember our first attempt from Chapter 2 at scaling up Bayesian


linear regression to nonlinear functions. We saw that we can model
nonlinear functions by transforming the input space to a suitable higher-
dimensional feature space, but found that this approach scales poorly
if we require a large number of features. We then found something
remarkable: by simply changing our perspective from a weight-space
view to a function-space view, we could implement Bayesian linear
regression without ever needing to compute the features explicitly.
Under the function-space view, the key object describing the class of
functions we can model is not the features ϕ( x), but instead the kernel
function which only implicitly defines a feature space. Our key ob-
servation in this chapter is that we can therefore stop reasoning about
feature spaces, and instead directly work with kernel functions that
describe “reasonable” classes of functions.

We are still concerned with the problem of estimating the value of


a function f : X → R at arbitrary points x⋆ ∈ X given training
data { xi , yi }in=1 , where the labels are assumed to be corrupted by ho-
moscedastic Gaussian noise with variance σn2 ,

yi = f ( xi ) + ε i , ε i ∼ N (0, σn2 ).

As in Chapter 2 on Bayesian linear regression, we denote by X the


design matrix (collection of training inputs) and by y the vector of
training labels. We will represent the unknown function value at a
.
point x ∈ X by the random variable f x = f ( x). The collection of these
random variables is then called a Gaussian process if any finite subset
of them is jointly Gaussian:

Definition 4.1 (Gaussian process, GP). A Gaussian process is an infinite


set of random variables such that any finite number of them are jointly
Gaussian and such that they are consistent under marginalization.1 1
That is, if you take a joint distribution
for n variables and marginalize out one
of them, you should recover the joint dis-
tribution for the remaining n − 1 vari-
ables.
60 probabilistic artificial intelligence

The fact that with a Gaussian process, any finite subset of the random
variables is jointly Gaussian is the key property allowing us to perform

f (x)
exact probabilistic inference. Intuitively, a Gaussian process can be
interpreted as a normal distribution over functions — and is therefore

y
often called an “infinite-dimensional Gaussian”.

A Gaussian process is characterized by a mean function µ : X → R and


x p( f ( x ))
a covariance function (or kernel function) k : X × X → R such that for
.
any set of points A = { x1 , . . . , xm } ⊆ X , we have Figure 4.1: A Gaussian process can be
interpreted as an infinite-dimensional
.
f A = [ f x1 · · · f xm ]⊤ ∼ N (µ A , K AA ) (4.1) Gaussian over functions. At any loca-
tion x in the domain, this yields a dis-
where tribution over values f ( x ) shown in red.
    The blue line corresponds to the MAP
µ ( x1 ) k ( x1 , x1 ) · · · k ( x1 , x m ) estimate (i.e., mean function of the Gaus-
.  ..  .  .. .. sian process), the dark gray region corre-
µA =  K AA = .. 
 . , . . (4.2)
. . sponds to the epistemic uncertainty and
 

the light gray region denotes the addi-
µ( xm ) k ( x m , x1 ) · · · k ( xm , xm ) tional aleatoric uncertainty.
We write f ∼ GP (µ, k). In particular, given a mean function, covari-
ance function, and using the homoscedastic noise assumption,

y⋆ | x⋆ ∼ N (µ( x⋆ ), k( x⋆ , x⋆ ) + σn2 ). (4.3)

Commonly, for notational simplicity, the mean function is taken to


be zero. Note that for a fixed mean this is not a restriction, as we
can simply apply the zero-mean Gaussian process to the difference
between the mean and the observations.2 2
For alternative ways of representing a
mean function, refer to section 2.7 of
“Gaussian processes for machine learn-
4.1 Learning and Inference ing” (Williams and Rasmussen, 2006).

First, let us look at learning and inference in the context of Gaussian


processes. With slight abuse of our previous notation, let us denote the
.
set of observed points by A = { x1 , . . . , xn }. Given a prior f ∼ GP (µ, k)
and the noisy observations yi = f ( xi ) + ε i with ε i ∼ N (0, σn2 ), we can
then write the joint distribution of the observations y1:n and the noise-
free prediction f ⋆ at a test point x⋆ as
" #
y
| x⋆ , X ∼ N (µ̃, K̃ ), where (4.4)
f⋆
 
" # " # k( x, x1 )
K AA + σn I2 k x⋆ ,A
. µA . .  .. 
µ̃ = ⋆ , K̃ = ⊤ ⋆ ⋆ , k x,A =  .
.
µ( x ) k x⋆ ,A k( x , x )  
k( x, xn )
(4.5)
Deriving the conditional distribution using (1.53), we obtain that the
Gaussian process posterior is given by

f | x1:n , y1:n ∼ GP (µ′ , k′ ), where (4.6)


gaussian processes 61

.
µ′ ( x) = µ( x) + k⊤ 2 −1
x,A ( K AA + σn I ) ( y A − µ A ), (4.7)
′ ′ . ′
k ( x, x ) = k( x, x ) − k⊤
x,A ( K AA + σn2 I )−1 k x′ ,A . (4.8)
Observe that analogously to Bayesian linear regression, the posterior
covariance can only decrease when conditioning on additional data,
and is independent of the observations yi .

We already studied inference in the function-space view of Bayesian


linear regression, but did not make the predictive posterior explicit.
Using Equation (4.6), the predictive posterior at x⋆ is simply
f ⋆ | x⋆ , x1:n , y1:n ∼ N (µ′ ( x⋆ ), k′ ( x⋆ , x⋆ )). (4.9)

4.2 Sampling

Often, we are not interested in the full predictive posterior distribution,


but merely want to obtain samples of our Gaussian process model. We
will briefly examine two approaches.
1. For the first approach, consider a discretized subset of points
.
f = [ f1 , . . . , fn ]
that we want to sample.3 Note that f ∼ N (µ, K ). We have already 3
For example, if we want to render the
seen in Equation (1.54) that function, the length of this vector could
be guided by the screen resolution.
f = K /2 ε + µ
1
(4.10)
where K 1/2 is the square root of K and ε ∼ N (0, I ) is standard
Gaussian noise.4 However, computing the square root of K takes 4
We discuss square roots of matrices in
O n3 time. Appendix A.2.


2. For the second approach, recall the product rule (1.11),


n
p( f 1 , . . . , f n ) = ∏ p( f i | f 1:i−1 ).
i =1
That is the joint distribution factorizes neatly into a product where
each factor only depends on the “outcomes” of preceding factors.
We can therefore obtain samples one-by-one, each time condition-
ing on one more observation:
f 1 ∼ p( f 1 )
f 2 ∼ p( f 2 | f 1 )
(4.11)
f 3 ∼ p( f 3 | f 1 , f 2 )
..
.
This general approach is known as forward sampling. Due to the ma-
trix inverse in the formula of the GP posterior (4.6), this approach
also takes O n3 time.


We will discuss more efficient approximate sampling methods in Sec-


tion 4.5.
62 probabilistic artificial intelligence

4.3 Kernel Functions

We have seen that kernel functions are the key object describing the
class of functions a Gaussian process can model. Depending on the
kernel function, the “shape” of functions that are realized from a Gaus-
sian process varies greatly. Let us recap briefly from Section 2.4 what
a kernel function is:

Definition 4.2 (Kernel function). A kernel function k : X × X → R


satisfies
• k( x, x′ ) = k( x′ , x) for any x, x′ ∈ X (symmetry), and
• K AA is positive semi-definite for any A ⊆ X .
The two defining conditions ensure that for any A ⊆ X , K AA is a valid
covariance matrix. We say that a kernel function is positive definite if
K AA is positive definite for any A ⊆ X .

Intuitively, the kernel function evaluated at locations x and x′ describes


how f ( x) and f ( x′ ) are related, which we can express formally as

k( x, x′ ) = Cov f ( x), f ( x′ ) .
 
(4.12)

If x and x′ are “close”, then f ( x) and f ( x′ ) are usually taken to be


positively correlated, encoding a “smooth” function.
f (x)
In the following, we will discuss some of the most common kernel
functions, how they can be combined to create “new” kernels, and
how we can characterize the class of functions they can model.

4.3.1 Common Kernels


First, we look into some of the most commonly used kernels. Often an
additional factor σ2 (output scale) is added, which we assume here to x
be 1 for simplicity.
Figure 4.2: Functions sampled according
1. The linear kernel is defined as to a Gaussian process with a linear ker-
nel and ϕ = id.
.
k( x, x′ ; ϕ) = ϕ( x)⊤ ϕ( x′ ) (4.13)
f (x)
where ϕ is a nonlinear transformation as introduced in Section 2.3
or the identity.

Remark 4.3: GPs with linear kernel and BLR


A Gaussian process with a linear kernel is equivalent to Bayesian x
linear regression. This follows directly from the function-space Figure 4.3: Functions sampled accord-
view of Bayesian linear regression (see Section 2.4) and com- ing to a Gaussian process with a linear
kernel and ϕ( x ) = [1, x, x2 ] (left) and
paring the derived kernel function (2.20) with the definition of
ϕ( x ) = sin( x ) (right).
the linear kernel (4.13).
gaussian processes 63

2. The Gaussian kernel (also known as squared exponential kernel or ra-


dial basis function (RBF) kernel) is defined as

2
!
. ∥ x − x ′ ∥2
k( x, x′ ; h) = exp − (4.14)
2h2

where h is its length scale. The larger the length scale h, the smoother 5
As the length scale is increased, the ex-
the resulting functions.5 Furthermore, it turns out that the feature ponent of the exponential increases, re-
sulting in a higher dependency between
space (think back to Section 2.4!) corresponding to the Gaussian locations.
kernel is “infinitely dimensional”, as you will show in ? . So the Problem 4.1
Gaussian kernel already encodes a function class that we were not
able to model under the weight-space view of Bayesian linear re-
gression.

f (x) Figure 4.4: Functions sampled according


to a Gaussian process with a Gaussian
kernel and length scales h = 5 (left) and
h = 1 (right).

k( x − x′ )
1.00

0.75

x 0.50

0.25
3. The Laplace kernel (also known as exponential kernel) is defined as
0.00

∥ x − x′ ∥ −2 0 2
 
.
k ( x, x′ ; h) = exp − 2
. (4.15) x − x′
h
Figure 4.5: Gaussian kernel with length
scales h = 1, h = 0.5, and h = 0.2.
As can be seen in Figure 4.7, samples from a GP with Laplace
kernel are non-smooth as opposed to the samples from a GP with
Gaussian kernel.

f (x) Figure 4.7: Functions sampled accord-


ing to a Gaussian process with a Laplace
kernel and length scales h = 10 000 (left)
and h = 10 (right).

k( x − x′ )
1.00

0.75

x 0.50

0.25

4. The Matérn kernel trades the smoothness of the Gaussian and the 0.00
Laplace kernels. As such, it is frequently used in practice to model −2 0 2
x − x′
Figure 4.6: Laplace kernel with length
scales h = 1, h = 0.5, and h = 0.2.
64 probabilistic artificial intelligence

“real world” functions that are relatively smooth. It is defined as


√ ′∥
!ν √ ′∥
!
21− ν 2ν ∥ x − x 2ν ∥ x − x
.
k( x, x′ ; ν, h) = 2
Kν 2
Γ(ν) h h
(4.16)

where Γ is the Gamma function, Kν the modified Bessel function


of the second kind, and h a length scale parameter. For ν = 1/2, the
Matérn kernel is equivalent to the Laplace kernel. For ν → ∞, the
Matérn kernel is equivalent to the Gaussian kernel. The resulting
functions are ⌈ν⌉ − 1 times mean square differentiable.6 In partic- 6
Refer to Remark A.12 for the defini-
ular, GPs with a Gaussian kernel are infinitely many times mean tions of mean square continuity and dif-
ferentiability.
square differentiable whereas GPs with a Laplace kernel are mean
square continuous but not mean square differentiable.

4.3.2 Composing Kernels


Given two kernels k1 : X × X → R and k2 : X × X → R, they can
be composed to obtain a new kernel k : X × X → R in the following
ways:
.
• k( x, x′ ) = k1 ( x, x′ ) + k2 ( x, x′ ),
.
• k( x, x′ ) = k1 ( x, x′ ) · k2 ( x, x′ ),
.
• k( x, x′ ) = c · k1 ( x, x′ ) for any c > 0,
.
• k( x, x′ ) = f (k1 ( x, x′ )) for any polynomial f with positive coeffi-
cients or f = exp.
.
For example, the additive structure of a function f ( x) = f 1 ( x) + f 2 ( x)
can be easily encoded in GP models. Suppose that f 1 ∼ GP (µ1 , k1 )
and f 2 ∼ GP (µ2 , k2 ), then the distribution of the sum of those two
.
functions f = f 1 + f 2 ∼ GP (µ1 + µ2 , k1 + k2 ) is another GP.7 7
We use f = f 1 + f 2 to denote the func-
tion f (·) = f 1 (·) + f 2 (·).
Whereas the addition of two kernels k1 and k2 can be thought of as
an OR operation (i.e., the kernel has high value if either k1 or k2 have
high value), the multiplication of k1 and k2 can be thought of as an
AND operation (i.e., the kernel has high value if both k1 and k2 have
high value). For example, the product of two linear kernels results in
functions which are quadratic.

As mentioned previously, the constant c of a scaled kernel function


.
k′ ( x, x′ ) = c · k( x, x′ ) is generally called the output scale of a kernel,
and it scales the variance Var[ f ( x)] = c · k( x, x) of the predictions f ( x)
from GP (µ, k′ ).

Optional Readings
For a broader introduction to how kernels can be used and com-
bined to model certain classes of functions, read
gaussian processes 65

• chapter 2 of “Automatic model construction with Gaussian pro-


cesses” (Duvenaud, 2014) also known as the “kernel cookbook”,
• chapter 4 of “Gaussian processes for machine learning” (Williams
and Rasmussen, 2006).

4.3.3 Stationarity and Isotropy


Kernel functions are commonly classified according to two properties:

Definition 4.4 (Stationarity and isotropy). A kernel k : Rd × Rd → R


is called
• stationary (or shift-invariant) if there exists a function k̃ such that
k̃( x − x′ ) = k( x, x′ ), and
• isotropic if there exists a function k̃ such that k̃(∥ x − x′ ∥) = k( x, x′ )
with ∥·∥ any norm.

Note that stationarity is a necessary condition for isotropy. In other


words, isotropy implies stationarity.

Example 4.5: Stationarity and isotropy of kernels

stationary isotropic
linear kernel no no
Gaussian kernel yes yes
. 2
k( x, x′ ) = exp(− ∥ x − x′ ∥ M )
yes no ∥·∥ M denotes the Mahalanobis norm
where M is positive semi-definite induced by matrix M

For x′ = x, stationarity implies that the kernel must only depend


on 0. In other words, a stationary kernel must depend on relative
locations only. This is clearly not the case for the linear kernel,
which depends on the absolute locations of x and x′ . Therefore,
the linear kernel cannot be isotropic either.

For the Gaussian kernel, isotropy follows immediately from its


definition.

The last kernel is clearly stationary by definition, but not isotropic


for general matrices M. Note that for M = I it is indeed isotropic.

Stationarity encodes the idea that relative location matters more than
absolute location: the process “looks the same” no matter where we
shift it in the input space. This is often appropriate when we believe
the same statistical behavior holds across the entire domain (e.g., no
region is special). Isotropy goes one step further by requiring that
66 probabilistic artificial intelligence

the kernel depends only on the distance between points, so that all
directions in the space are treated equally. In other words, there is no
preferred orientation or axis. This is especially useful in settings where
we expect uniform behavior in every direction (as with the Gaussian
kernel). Such kernels are simpler to specify and interpret since we
only need a single “scale” (like a length scale) rather than multiple
parameters or directions.

4.3.4 Reproducing Kernel Hilbert Spaces


We can characterize the precise class of functions that can be modeled
by a Gaussian process with a given kernel function. This correspond-
ing function space is called a reproducing kernel Hilbert space (RKHS),
and we will discuss it briefly in this section.

Recall that Gaussian processes keep track of a posterior distribution


f | x1:n , y1:n over functions. We will in fact show later that the corre-
sponding MAP estimate fˆ corresponds to the solution to a regularized
optimization problem in the RKHS space of functions. This duality
is similar to the duality between the MAP estimate of Bayesian linear
regression and ridge regression we observed in Chapter 2. So what is
the reproducing kernel Hilbert space of a kernel function k?

Definition 4.6 (Reproducing kernel Hilbert space, RKHS). Given a ker-


nel k : X × X → R, its corresponding reproducing kernel Hilbert space is
the space of functions f defined as
( )
n
.
Hk (X ) = f (·) = ∑ αi k(xi , ·) : n ∈ N, xi ∈ X , αi ∈ R . (4.17)
i =1

The inner product of the RKHS is defined as

n n′
.
⟨ f , g⟩k = ∑ ∑ αi α′j k(xi , x′j ), (4.18)
i =1 j =1


where g(·) = ∑nj=1 α′j k( x′j , ·), and induces the norm ∥ f ∥k = ⟨ f , f ⟩k .
p

You can think of the norm as measuring the “smoothness” or “com-


plexity” of f . ? Problem 4.4 (2)

It is straightforward to check that for all x ∈ X , k( x, ·) ∈ Hk (X ).


Moreover, the RKHS inner product ⟨·, ·⟩k satisfies for all x ∈ X and
f ∈ Hk (X ) that f ( x) = ⟨ f (·), k( x, ·)⟩k which is also known as the
reproducing property ? . That is, evaluations of RKHS functions f are Problem 4.4 (1)
inner products in Hk (X ) parameterized by the “feature map” k( x, ·).

The representer theorem (Schölkopf et al., 2001) characterizes the solu-


tion to regularized optimization problems in RKHSs:
gaussian processes 67

Theorem 4.7 (Representer theorem). ? Let k be a kernel and let λ > 0. Problem 4.5
For f ∈ Hk (X ) and training data {( xi , f ( xi ))}in=1 , let L( f ( x1 ), . . . , f ( xn )) ∈
R ∪ {∞} denote any loss function which depends on f only through its eval-
uation at the training points. Then, any minimizer

fˆ ∈ arg min L( f ( x1 ), . . . , f ( xn )) + λ ∥ f ∥2k (4.19)


f ∈Hk (X )

admits a representation of the form

n
fˆ( x) = α̂⊤ k x,{ xi }n
i =1
= ∑ α̂i k(x, xi ) for some α̂ ∈ Rn . (4.20)
i =1

This statement is remarkable: the solutions to general regularized op-


timization problems over the generally infinite-dimensional space of
functions Hk (X ) can be represented as a linear combination of the
kernel functions evaluated at the training points. The representer the-
orem can be used to show that the MAP estimate of a Gaussian process
corresponds to the solution of a regularized linear regression problem
in the RKHS of the kernel function, namely, ? Problem 4.6

. 1
fˆ = arg min − log p(y1:n | x1:n , f ) + ∥ f ∥2k . (4.21)
f ∈H (X ) 2
k

Here, the first term corresponds to the likelihood, measuring the “qual-
ity of fit”. The regularization term limits the “complexity” of fˆ. Reg-
ularization is necessary to prevent overfitting since in an expressive
RKHSs, there may be many functions that interpolate the training data
perfectly. This shows the close link between Gaussian process regres-
sion and Bayesian linear regression, with the kernel function k gener-
alizing the inner product of feature maps to feature spaces of possi-
bly “infinite dimensionality”. Because solutions can be represented as
linear combinations of kernel evaluations at the training points, Gaus-
sian processes remain computationally tractable even though they can
model functions over “infinite-dimensional” feature spaces.

4.4 Model Selection

We have not yet discussed how to pick the hyperparameters θ (e.g.,


parameters of kernels). A common technique in supervised learning
is to select hyperparameters θ, such that the resulting function esti-
mate fˆθ leads to the most accurate predictions on hold-out validation
data. After reviewing this approach, we contrast it with a probabilistic
approach to model selection, which avoids using point estimates of fˆθ
and rather utilizes the full posterior.
68 probabilistic artificial intelligence

4.4.1 Optimizing Validation Set Performance


A common approach to model selection is to split our data D into
.
separate training set D train = {( xtrain
i , ytrain
i )}in=1 and validation sets
val . val val m
D = {( xi , yi )}i=1 . We then optimize the model for a parameter
candidate θj using the training set. This is usually done by picking a
point estimate (like the MAP estimate),
.
fˆj = arg max p( f | xtrain train
1:n , y1:n ). (4.22)
f

Then, we score θj according to the performance of fˆj on the validation


set,
. val ˆ
θ̂ = arg max p(yval
1:m | x1:m , f j ). (4.23)
θj

This ensures that fˆj does not depend on D val .

Remark 4.8: Approximating population risk


Why is it useful to separate the data into a training and a vali-
dation set? Recall from Appendix A.3.5 that minimizing the em-
pirical risk without separating training and validation data may
lead to overfitting as both the loss and fˆj depend on the same
data D . In contrast, using independent training and validation
sets, fˆj does not depend on D val , and we have that
m
1 h i
m ∑ ℓ(yval val ˆ
i | xi , f j ) ≈ E( x,y)∼P ℓ(y | x, fˆj ) , (4.24)
i =1

iid
using Monte Carlo sampling.8 In words, for reasonably large m, 8
We generally assume D ∼ P , in par-
minimizing the empirical risk as we do in Equation (4.23) approx- ticular, we assume that the individual
samples of the data are i.i.d.. Recall
imates minimizing the population risk. that in this setting, Hoeffding’s inequal-
ity (A.41) can be used to gauge how
large m should be.
While this approach often is quite effective at preventing overfitting
as compared to using the same data for training and picking θ̂, it still
collapses the uncertainty in f into a point estimate. Can we do better?

4.4.2 Maximizing the Marginal Likelihood


We have already seen for Bayesian linear regression, that picking a
point estimate loses a lot of information. Instead of optimizing the
effects of θ for a specific point estimate fˆ of the model f , maximizing
the marginal likelihood optimizes the effects of θ across all realizations
of f . In this approach, we obtain our hyperparameter estimate via
.
θ̂MLE = arg max p(y1:n | x1:n , θ) (4.25) using the definition of marginal
θ likelihood in Bayes’ rule (1.45)
gaussian processes 69

Z
= arg max p(y1:n , f | x1:n , θ) d f by conditioning on f using the sum rule
θ (1.7)
Z
= arg max p(y1:n | x1:n , f , θ) p( f | θ) d f . (4.26) using the product rule (1.11)
θ
Remarkably, this approach typically avoids overfitting even though we
do not use a separate training and validation set. The following ta-
ble provides an intuitive argument for why maximizing the marginal
likelihood is a good strategy.

Table 4.1: The table gives an intuitive ex-


likelihood prior
planation of effects of parameter choices
θ on the marginal likelihood. Note that
“underfit” model
small for “almost all” f large words in quotation marks refer to in-
(too simple θ) tuitive quantities, as we have infinitely
many realizations of f .
“overfit” model large for “few” f
small
(too complex θ) small for “most” f

marginal likelihood
simple
“just right” moderate for “many” f moderate
intermediate
For an “underfit” model, the likelihood is mostly small as the data complex
cannot be well described, while the prior is large as there are “fewer”
functions to choose from. For an “overfit” model, the likelihood is
large for “some” functions (which would be picked if we were only all possible data sets
minimizing the training error and not doing cross validation) but small Figure 4.8: A schematic illustration of
for “most” functions. The prior is small, as the probability mass has the marginal likelihood of a simple, in-
termediate, and complex model across
to be distributed among “more” functions. Thus, in both cases, one all possible data sets.
term in the product will be small. Hence, maximizing the marginal
likelihood naturally encourages trading between a large likelihood and
a large prior.

In the context of Gaussian process regression, recall from Equation (4.3)


that
y1:n | x1:n , θ ∼ N (0, K f ,θ + σn2 I ) (4.27)
where K f ,θ denotes the kernel matrix at the inputs x1:n depending on
.
the kernel function parameterized by θ. We write Ky,θ = K f ,θ + σn2 I.
Continuing from Equation (4.25), we obtain
θ̂MLE = arg max N (y; 0, Ky,θ )
θ
1 1  n
= arg min y⊤ Ky,θ
−1
y+ log det Ky,θ + log 2π (4.28) taking the negative logarithm
θ 2 2 2
1 1
= arg min y⊤ Ky,θ
−1

y+ log det Ky,θ (4.29) the last term is independent of θ
θ 2 2
The first term of the optimization objective describes the “goodness of
fit” (i.e., the “alignment” of y with Ky,θ). The second term character-
izes the “volume” of the model class. Thus, this optimization naturally
trades the aforementioned objectives.
70 probabilistic artificial intelligence

Marginal likelihood maximization is an empirical Bayes method. Often


it is simply referred to as empirical Bayes. It also has the nice property
that the gradient of its objective (the MLL loss) can be expressed in
closed-from ? , Problem 4.7
!
∂ 1 −1 ∂Ky,θ
log p(y1:n | x1:n , θ) = tr (αα⊤ − Ky,θ ) (4.30) 1.0

MLL loss
∂θ j 2 ∂θ j
0.5
. −1
where α = and tr( M ) is the trace of a matrix M. This optimiza-
Ky,θ y
0.0
tion problem is, in general, non-convex. Figure 4.10 gives an example
of two local optima according to empirical Bayes. 0 100 200
# of iterations
Taking a step back, observe that taking a probabilistic perspective on
model selection naturally led us to consider all realizations of our Figure 4.9: An example of model selec-
tion by maximizing the log likelihood
model f instead of using point estimates. However, we are still us- (without hyperpriors) using a linear,
ing point estimates for our model parameters θ. Continuing on our quadratic, Laplace, Matérn (ν = 3/2),
and Gaussian kernel, respectively. They
probabilistic adventure, we could place a prior p(θ) on them too.9 We
are used to learn the function
could use it to obtain the MAP estimate (still a point estimate!) which sin( x )
adds an additional regularization term x 7→ + ε, ε ∼ N (0, 0.01)
x
. using SGD with learning rate 0.1.
θ̂MAP = arg max p(θ | x1:n , y1:n ) (4.31) 9
Such a prior is called hyperprior.
θ
= arg min − log p(θ) − log p(y1:n | x1:n , θ). (4.32) using Bayes’ rule (1.45) and then taking
θ the negative logarithm

An alternative approach is to consider the full posterior distribution


over parameters θ. The resulting predictive distribution is, however,
intractable,
Z Z
⋆ ⋆
p(y | x , x1:n , y1:n ) = p(y⋆ | x⋆ , f ) · p( f | x1:n , y1:n , θ) · p(θ) d f dθ.
(4.33)

Recall that as the mode of Gaussians coincides with their mean, the
MAP estimate corresponds to the mean of the predictive posterior.

As a final note, observe that in principle, there is nothing stopping us


from descending deeper in the probabilistic hierarchy. The prior on the
model parameters θ is likely to have parameters too. Ultimately, we
need to break out of this hierarchy of dependencies and choose a prior.

4.5 Approximations

To learn a Gaussian process, we need to invert n × n matrices, hence


the computational cost is O n3 . Compare this to Bayesian linear re-


gression which allows us to learn a regression model in O nd2 time




(even online) where d is the feature dimension. It is therefore natural


to look for ways of approximating a Gaussian process.
gaussian processes 71

101
Figure 4.10: The top plot shows contour
lines of an empirical Bayes with two lo-
cal optima. The bottom two plots show
the Gaussian processes corresponding
to the two optimal models. The left
model with smaller lengthscale is chosen
noise standard deviation σn

within a more flexible class of models,


while the right model explains more ob-
servations through noise. Adapted from
100 figure 5.5 of “Gaussian processes for
machine learning” (Williams and Ras-
mussen, 2006).

10−1

100 101
lengthscale h

2 2

1 1
f (x)

0 0

−1 −1

−2 −2

−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
72 probabilistic artificial intelligence

4.5.1 Local Methods

Recall that during forward sampling, we had to condition on a larger


and larger number of previous samples. When sampling at a loca-
tion x, a very simple approximation is to only condition on those sam-
ples x′ that are “close” (where |k( x, x′ )| ≥ τ for some τ > 0). Essen-
tially, this method “cuts off the tails” of the kernel function k. However,
τ has to be chosen carefully as if τ is chosen too large, samples become
essentially independent.

This is one example of a sparse approximation of a Gaussian process. We


will discuss more advanced sparse approximations known as “induc-
ing point methods” in Section 4.5.3.

4.5.2 Kernel Function Approximation

Another method is to approximate the kernel function directly. The


idea is to construct a “low-dimensional” feature map ϕ : Rd → Rm
that approximates the kernel,

k( x, x′ ) ≈ ϕ( x)⊤ ϕ( x′ ). (4.34)

Then, we can apply Bayesian linear regression, resulting in a time com-


plexity of O nm2 + m3 .


One example of this approach are random Fourier features, which we


will discuss in the following.

Remark 4.9: Fourier transform


Im
First, let us remind ourselves of Fourier transformations. The
Fourier transform is a method of decomposing frequencies into i eiφ
their individual components.

Recall Euler’s formula which states that for any x ∈ R,


sin φ
φ
ix
e = cos x + i sin x (4.35) 0
cos φ

where i is the imaginary unit of complex numbers. The formula is


illustrated in Figure 4.11. Note that e−i2πx corresponds to rotating 0 1
clockwise around the unit circle in R2 — completing a rotation Re
whenever x ∈ R reaches the next natural number. Figure 4.11: Illustration of Euler’s for-
mula. It can be seen that eiφ corresponds
We can scale x by a frequency ξ: e−i2πξx .
If x ∈ Rd ,
we can also to a (counter-clockwise) rotation on the
scale each component j of x by a different frequency ξ ( j). Multi- unit circle as φ varies from 0 to 2π.
plying a function f : Rd → R with the rotation around the unit
circle with given frequencies ξ, yields a quantity that describes the
gaussian processes 73

f (x)
amplitude of the frequencies ξ,
1
Z
. ⊤x
fˆ(ξ ) = f ( x)e−i2πξ dx. (4.36)
Rd

fˆ is called the Fourier transform of f . f is called the inverse Fourier


transform of fˆ, and can be computed using
0
Z

f ( x) = fˆ(ξ )ei2πξ x dξ. (4.37) −1 1
Rd x
. fˆ(ω )
It is common to write ω = 2πξ. See Figure 4.12 for an example.
2
Refer to “But what is the Fourier Transform? A visual introduc-
tion” (Sanderson, 2018) for a visual introduction.

Because a stationary kernel k : Rd × Rd → R can be interpreted as a 0


function in one variable, it has an associated Fourier transform which
we denote by p(ω). That is, −π π
Z ω
⊤ ( x− x′ )
k ( x − x′ ) = p(ω)eiω dω. (4.38)
Rd Figure 4.12: The Fourier transform of a
rectangular pulse,

Fact 4.10 (Bochner’s theorem). A continuous stationary kernel on Rd is


(
. 1 x ∈ [−1, 1]
f (x) =
positive definite if and only if its Fourier transform p(ω) is non-negative. 0 otherwise,

is given by
Bochner’s theorem implies that when a continuous and stationary
Z 1
1  iω 
kernel is positive definite and scaled appropriately, its Fourier trans- fˆ(ω ) = e−iωx dx = e − e−iω
−1 iω
form p(ω) is a proper probability distribution. In this case, p(ω) is 2 sin(ω )
= .
called the spectral density of the kernel k. ω

Remark 4.11: Eigenvalue spectrum of stationary kernels


When a kernel k is stationary (i.e., a univariate function of x − x′ ),
its eigenfunctions (with respect to the usual Lebesgue measure)
turn out to be the complex exponentials exp(iω⊤ ( x − x′ )). In
simpler terms, you can think of these exponentials as “building
blocks” at different frequencies ω. The spectral density p(ω) as-
sociated with the kernel tells you how strongly each frequency
contributes, i.e., how large the corresponding eigenvalue is.

A key insight of this analysis is that the rate at which these magni-
tudes p(ω) decay with increasing frequency ω reveals the smooth-
ness of the processes governed by the kernel. If a kernel allocates
more “power” to high frequencies (meaning the spectral density
decays slowly), the resulting processes will appear “rougher”.
Conversely, if high-frequency components are suppressed, the pro-
cess will appear “smoother”.
74 probabilistic artificial intelligence

For an in-depth introduction to the eigenfunction analysis of ker-


nels, refer to section 4.3 of “Gaussian processes for machine learn-
ing” (Williams and Rasmussen, 2006).

Example 4.12: Spectral density of the Gaussian kernel


The Gaussian kernel with length scale h has the spectral density
Z
⊤ ( x− x′ )
p(ω) = k ( x − x′ ; h)e−iω d( x − x′ ) using the definition of the Fourier
Rd transform (4.36)
!
∥ x∥22
Z
= exp − − iω⊤ x dx using the definition of the Gaussian
Rd 2h2 kernel (4.14)
!
2
2 d/2 2 ∥ ω ∥2
= (2h π ) exp −h . (4.39)
2

The key idea is now to interpret the kernel as an expectation,


Z
⊤ ′
k ( x − x′ ) = p(ω)eiω ( x− x ) dω from Equation (4.38)
Rd
h ⊤ ′
i
= Eω∼ p eiω (x− x ) by the definition of expectation (1.19)
h i
= Eω∼ p cos(ω⊤ x − ω⊤ x′ ) + i sin(ω⊤ x − ω⊤ x′ ) . using Euler’s formula (4.35)

Observe that as both k and p are real, convergence of the integral im-
plies Eω∼ p sin(ω⊤ x − ω⊤ x′ ) = 0. Hence,
 

h i
= Eω∼ p cos(ω⊤ x − ω⊤ x′ )
h i
= Eω∼ p Eb∼Unif([0,2π ]) cos((ω⊤ x + b) − (ω⊤ x′ + b)) expanding with b − b
h
= Eω∼ p Eb∼Unif([0,2π ]) cos(ω⊤ x + b) cos(ω⊤ x′ + b) using the angle subtraction identity,
cos(α − β) = cos α cos β + sin α sin β
i
+ sin(ω⊤ x + b) sin(ω⊤ x′ + b)
h i
= Eω∼ p Eb∼Unif([0,2π ]) 2 cos(ω⊤ x + b) cos(ω⊤ x′ + b) using

= Eω∼ p,b∼Unif([0,2π ]) zω,b ( x) · zω,b ( x′ )


 
(4.40) Eb [cos(α + b) cos( β + b)]
= Eb [sin(α + b) sin( β + b)]
. √
where zω,b ( x) = 2 cos(ω⊤ x + b),
for b ∼ Unif([0, 2π ])
m
1

m ∑ zω(i) ,b(i) (x) · zω(i) ,b(i) (x′ ) (4.41) using Monte Carlo sampling to estimate
i =1 the expectation, see Example A.6
iid iid
for independent samples ω(i) ∼ p and b(i) ∼ Unif([0, 2π ]),

= z( x)⊤ z( x′ ) (4.42)

where the (randomized) feature map of random Fourier features is


. 1
z( x) = √ [zω(1) ,b(1) ( x), . . . , zω(m) ,b(m) ( x)]⊤ . (4.43)
m
gaussian processes 75

Intuitively, each component of the feature map z( x) projects x onto a f (x)


random direction ω drawn from the (inverse) Fourier transform p(ω) 4

of k( x − x′ ), and wraps this line onto the unit circle in R2 . After trans-
forming two points x and x′ in this way, their inner product is an unbi- 2

ased estimator of k( x − x′ ). The mapping zω,b ( x) = 2 cos(ω⊤ x + b)
0
additionally rotates the circle by a random amount b and projects the
points onto the interval [0, 1].
2 −5.0 −2.5 0.0 2.5 5.0
Rahimi et al. (2007) show that Bayesian linear regression with the fea-
1
ture map z approximates Gaussian processes with a stationary kernel:
0
Theorem 4.13 (Uniform convergence of Fourier features). Suppose M
is a compact subset of Rd with diameter diam(M). Then for a stationary −1
kernel k, the random Fourier features z, and any ϵ > 0 it holds that −5.0 −2.5 0.0 2.5 5.0
x
!
P sup z( x)⊤ z( x′ ) − k ( x − x′ ) ≥ ϵ (4.44) Figure 4.13: Example of random Fourier
x,x′ ∈M features with where the number of fea-
tures m is 5 (top) and 10 (bottom), re-
σp diam(M) 2 mϵ2
   
8 spectively. The noise-free true function
≤2 exp − is shown in black and the mean of the
ϵ 8( d + 2)
Gaussian process is shown in blue.
.
where σp2 = Eω∼ p ω⊤ ω is the second moment of p, m is the dimension
 

of z( x), and d is the dimension of x. ? Problem 4.8

Note that the error probability decays exponentially fast in the dimen- f (x)
sion of the Fourier feature space. 2

4.5.3 Data Sampling 1

Another natural approach is to only consider a (random) subset of the 0


training data during learning. The naive approach is to subsample
uniformly at random. Not very surprisingly, we can do much better. −5 0 5
x
One subsampling method is the inducing points method (Quinonero-
Candela and Rasmussen, 2005). The idea is to summarize the data Figure 4.14: Inducing points u are shown
as vertical dotted red lines. The noise-
around so-called inducing points.10 For now, let us consider an arbi- free true function is shown in black and
trary set of inducing points, the mean of the Gaussian process is
shown in blue. Observe that the true
. function is approximated “well” around
U = { x1 , . . . , x k }.
the inducing points.
The inducing points can be treated as
10

Then, the original Gaussian process can be recovered using marginal- hyperparameters.
ization,
Z Z
p( f ⋆ , f ) = p( f ⋆ , f , u) du = p( f ⋆ , f | u) p(u) du, (4.45) using the sum rule (1.7) and product
Rk Rk rule (1.11)
. .
where f = [ f ( x1 ) · · · f ( xn )]⊤ and f ⋆ = f ( x⋆ ) at some evaluation
.
point x⋆ ∈ X . We use u = [ f ( x1 ) · · · f ( xk )]⊤ ∈ Rk to denote the pre-
dictions of the model at the inducing points U. Due to the marginaliza-
tion property of Gaussian processes (4.1), we have that u ∼ N (0, KUU ).
76 probabilistic artificial intelligence

The key idea is to approximate the joint prior, assuming that f ⋆ and f
are conditionally independent given u,
Z
p( f ⋆ , f ) ≈ p( f ⋆ | u) p( f | u) p(u) du. (4.46)
Rk

Here, p( f | u) and p( f ⋆ | u) are commonly called the training condi-


tional and the testing conditional, respectively. Still denoting the obser-
.
vations by A = { x1 , . . . , xn } and defining ⋆ = { x⋆ }, we know, using
the closed-form expression for conditional Gaussians (1.53),

−1
p( f | u) ∼ N ( f ; K AU KUU u, K AA − Q AA ), (4.47a)
⋆ ⋆ −1
p( f | u) ∼ N ( f ; K⋆U KUU u, K⋆⋆ − Q⋆⋆ ) (4.47b)

. −1
where Q ab = KaU KUU KUb . Intuitively, K AA represents the prior co-
variance and Q AA represents the covariance “explained” by the induc-
ing points.11 11
For more details, refer to section 2
of “A unifying view of sparse ap-
Computing the full covariance matrix is expensive. In the following, proximate Gaussian process regression”
(Quinonero-Candela and Rasmussen,
we mention two approximations to the covariance of the training con- 2005).
ditional (and testing conditional).
f (x)
Example 4.14: Subset of regressors 3

The subset of regressors (SoR) approximation is defined as 2

. 1
−1
qSoR ( f | u) = N ( f ; K AU KUU u, 0), (4.48a)
0
. −1
qSoR ( f ⋆ | u) = N ( f ⋆ ; K⋆U KUU u, 0). (4.48b) −1

−5.0 −2.5 0.0 2.5 5.0


Comparing to Equation (4.47), SoR simply forgets about all vari- 2
ance and covariance.
0

Example 4.15: Fully independent training conditional


−2
The fully independent training conditional (FITC) approximation is −5.0 −2.5 0.0 2.5 5.0
defined as x

. −1 Figure 4.15: Comparison of SoR (top)


qFITC ( f | u) = N ( f ; K AU KUU u, diag{K AA − Q AA }), (4.49a) and FITC (bottom). The inducing points
. −1
qFITC ( f ⋆ | u) = N ( f ⋆ ; K⋆U KUU u, diag{K⋆⋆ − Q⋆⋆ }). (4.49b) u are shown as vertical dotted red lines.
The noise-free true function is shown in
black and the mean of the Gaussian pro-
In contrast to SoR, FITC keeps track of the variances but forgets cess is shown in blue.
about the covariance.

The computational cost for inducing point methods SoR and FITC is
dominated by the cost of inverting KUU . Thus, the time complexity is
cubic in the number of inducing points, but only linear in the number
of data points.
gaussian processes 77

Discussion

This chapter introduced Gaussian processes which leverage the function-


space view on linear regression to perform exact probabilistic inference
with flexible, nonlinear models. A Gaussian process can be seen as a
non-parametric model since it can represent an infinite-dimensional
parameter space. Instead, as we saw with the representer theorem,
such non-parametric (i.e., “function-space”) models are directly rep-
resented as functions of the data points. While this can make these
models more flexible than a simple linear parametric model in input
space, it also makes them computationally expensive as the number of
data points grows. To this end, we discussed several ways of approxi-
mating Gaussian processes.

Nevertheless, for today’s internet-scale datasets, modern machine learn-


ing typically relies on large parametric models that learn features from
data. These models can effectively amortize the cost of inference dur-
ing training by encoding information into a fixed set of parameters. In
the following chapters, we will start to explore approaches to approx-
imate probabilistic inference that can be applied to such models.

Problems

4.1. Feature space of Gaussian kernel.


1. Show that the univariate Gaussian kernel with length scale h = 1
implicitly defines a feature space with basis vectors
 
ϕ0 ( x )

ϕ ( x )
 1 − x2 j
 1  with ϕj ( x ) = p j! e
ϕ( x ) =  2 x .
..
.

Hint: Use the Taylor series expansion of the exponential function, e x = ∑∞


j =0
xj
j! .
2. Note that the vector ϕ( x ) is ∞-dimensional. Thus, taking the
function-space view allows us to perform regression in an infinite-
dimensional feature space. What is the effective dimension when
regressing n univariate data points with a Gaussian kernel?

4.2. Kernels on the circle.

Consider a dataset {( xi , yi )}in=1 with labels yi ∈ R and inputs xi which


lie on the unit circle S ⊂ R2 . In particular, any element of S can
be identified with points in R2 of form (cos(θ ), sin(θ )) or with the
respective angles θ ∈ [0, 2π ).

You now want to use GP regression to learn an unknown mapping


from S to R using this dataset. Thus, you need a valid kernel k :
78 probabilistic artificial intelligence

S × S → R. First, we look at kernels k which can be understood as


analogous to the Gaussian kernel.
1. You think of the “extrinsic” kernel ke : S × S → R defined by
!
′ . ∥ x(θ ) − x(θ ′ )∥22
ke (θ, θ ) = exp − ,
2κ 2
.
where x(θ ) = (cos(θ ), sin(θ )). Is ke positive semi-definite for all
values of κ > 0?
2. Then, you think of an “intrinsic” kernel ki : S × S → R defined by
d(θ, θ ′ )2
 
.
ki (θ, θ ′ ) = exp −
2κ 2
.
where d(θ, θ ′ ) = min(|θ − θ ′ |, |θ − θ ′ − 2π |, |θ − θ ′ + 2π |) is the
standard arc length distance on the circle S.
You would now like to test whether this kernel is positive semi-
definite. We pick κ = 2 and compute the kernel matrix K for the
points corresponding to the angles {0, π/2, π, 3π/2}. This kernel
matrix K has eigenvectors (1, 1, 1, 1) and (−1, 1, −1, 1).
Now compute the eigenvalue corresponding to the eigenvector
(−1, 1, −1, 1).
3. Is ki positive semi-definite for κ = 2?
4. A mathematician friend of yours suggests to you yet another kernel
for points on the circle S, called the heat kernel. The kernel itself has
a complicated expression but can be accurately approximated by
!
L −1
′ . 1
2
k h (θ, θ ) = 1+ ∑ e − κ2 l 2 ′
2 cos(l (θ − θ )) ,
Cκ l =1

where L ∈ N controls the quality of approximation and Cκ > 0 is


a normalizing constant that depends only on κ.
Is k h is positive semi-definite for all values of κ > 0 and L ∈ N?
Hint: Recall that cos( a − b) = cos( a) cos(b) + sin( a) sin(b).

4.3. A Kalman filter as a Gaussian process.

Next we will show that the Kalman filter from Example 3.4 can be seen
as a Gaussian process. To this end, we define

f : N0 → R, t 7 → Xt . (4.50)
.
Assuming that X0 ∼ N (0, σ02 ) and Xt+1 = Xt + ε t with independent
noise ε t ∼ N (0, σx2 ), show that

f ∼ GP (0, kKF ) where (4.51)


′ .
kKF (t, t ) = σ02 + σx2 min{t, t′ }. (4.52)

.
This particular kernel k (t, t′ ) = min{t, t′ } but over the continuous-time
domain defines the Wiener process (also known as Brownian motion).
gaussian processes 79

4.4. Reproducing property and RKHS norm.


1. Derive the reproducing property.
Hint: Use k( x, x′ ) = ⟨k( x, ·), k( x′ , ·)⟩k .
2. Show that the RKHS norm ∥·∥k is a measure of smoothness by
proving that for any f ∈ Hk (X ) and x, y ∈ X it holds that

| f ( x) − f (y)| ≤ ∥ f ∥k ∥k( x, ·) − k(y, ·)∥k .

4.5. Representer theorem.

With this, we can now derive the representer theorem (4.20).

Hint: Recall
1. the reproducing property f ( x) = ⟨ f , k( x, ·)⟩k with k( x, ·) ∈ Hk (X )
which holds for all f ∈ Hk (X ) and x ∈ Hk (X ), and
2. that the norm after projection is smaller or equal the norm before projec-
tion.
Then decompose f into parallel and orthogonal components with respect to
span{k ( x1 , ·), . . . , k( xn , ·)}.

4.6. MAP estimate of Gaussian processes.

Let us denote by A = { x1 , . . . , xn } the set of training points. We will


now show that the MAP estimate of GP regression corresponds to
the solution of the regularized linear regression problem in the RKHS
stated in Equation (4.21):

. 1
fˆ = arg min − log p(y1:n | x1:n , f ) + ∥ f ∥2k .
f ∈H (X ) 2
k

In the following, we abbreviate K = K AA . We will also assume that


the GP has a zero mean function.
1. Show that Equation (4.21) is equivalent to
.
α̂ = arg min ∥y − Kα∥22 + λ ∥α∥2K (4.53)
α ∈Rn

for some λ > 0 which is also known as kernel ridge regression. De-
termine λ.
2. Show that Equation (4.53) with the λ determined in (1) is equiva-
lent to the MAP estimate of GP regression.
Hint: Recall from Equation (4.6) that the MAP estimate at a point x⋆ is
E[ f ⋆ | x⋆ , X, y] = k⊤ 2 −1
x⋆ ,A ( K + σn I ) y.

4.7. Gradient of the marginal likelihood.

In this exercise, we derive Equation (4.30).

Recall that we were considering a dataset ( X, y) of noise-perturbed


evaluations yi = f ( xi ) + ε i where ε i ∼ N (0, σn2 ) and f is an unknown
80 probabilistic artificial intelligence

function. We make the hypothesis f ∼ GP (0, k θ ) with a zero mean


function and the covariance function k θ. We are interested in finding
the hyperparameters θ that maximize the marginal likelihood p(y | X, θ).
1. Derive Equation (4.30).
Hint: You can use the following identities:
(a) for any invertible matrix M,

∂ ∂M −1
M −1 = − M −1 M and (4.54)
∂θ j ∂θ j

(b) for any symmetric positive definite matrix M,


!
∂ −1 ∂M
log det( M ) = tr M . (4.55)
∂θ j ∂θ j
2. Assume now that the covariance function for the noisy targets (i.e.,
including the noise contribution) can be expressed as

k y,θ ( x, x′ ) = θ0 k̃( x, x′ )

where k̃ is a valid kernel independent of θ0 .12 12


That is, Ky,θ (i, j) = k y,θ ( xi , x j ).
Show that ∂θ∂ log p(y | X, θ) = 0 admits a closed-form solution for
0
θ0 which we denote by θ0⋆ .
3. How should the optimal parameter θ0⋆ be scaled if we scale the
labels y by a scalar s?

4.8. Uniform convergence of Fourier features.

In this exercise, we will prove Theorem 4.13.


. .
Let s( x, x′ ) = z( x)⊤ z( x′ ) and f ( x, x′ ) = s( x, x′ ) − k( x, x′ ). Observe that
both functions are shift invariant, and we will therefore denote them
as univariate functions with argument ∆ ≡ x − x′ ∈ M∆ . Notice that
our goal is to bound the probability of the event sup∆∈M∆ | f (∆)| ≥ ϵ.
 2

1. Show that for all ∆ ∈ M∆ , P(| f (∆)| ≥ ϵ) ≤ 2 exp − mϵ4 .
What we have derived in (1) is known as a pointwise convergence guar-
antee. However, we are interested in bounding the uniform convergence
over the compact set M∆ .

Our approach will be to “cover” the compact set M∆ using T balls of


radius r whose centers we denote by {∆ i }iT=1 . It can be shown that
this is possible for some T ≤ (4 diam(M)/r )d . It can furthermore be
shown that
ϵ ϵ
∀i. | f (∆ i )| < and ∥∇ f (∆⋆ )∥2 < =⇒ sup | f (∆)| < ϵ
2 2r ∆∈M∆

where ∆⋆ = arg max∆∈M∆ ∥∇ f (∆)∥2 .


gaussian processes 81

  2rσp 2
2. Prove P ∥∇ f (∆⋆ )∥2 ≥ 2r ϵ
≤ ϵ .
Hint: Recall that the random Fourier feature approximation is unbiased,
i.e., E[s(
∆)] = k(∆).   
2
3. Prove P iT=1 | f (∆ i )| ≥ 2ϵ ≤ 2T exp − mϵ
S
16 .
4. Combine the results from (2) and (3) to prove Theorem 4.13.
Hint: You may use that
d 2 1
(a) αr −d + βr2 = 2β d+2 α d+2 for r = (α/β) d+2 and
σ diam(M)
(b) p ϵ ≥ 1.
d
5. Show that for the Gaussian kernel (4.14), σp2 = h2
.
Hint: First show σp2 = −tr( H∆ k(0)).

4.9. Subset of regressors.


1. Using an SoR approximation, prove the following:
" # " #!
⋆ f Q AA Q A⋆
qSoR ( f , f ) = N ; 0, (4.56)
f⋆ Q⋆ A Q⋆⋆
qSoR ( f ⋆ | y) = N ( f ⋆ ; Q⋆ A Q̃− 1 −1
AA y, Q⋆⋆ − Q⋆ A Q̃ AA Q A⋆ ) (4.57)
.
where Q̃ ab = Q ab + σn2 .
2. Derive that the resulting model is a degenerate Gaussian process
with covariance function
.
kSoR ( x, x′ ) = k⊤ −1
x,U KUU k x′ ,U . (4.58)
5
Variational Inference

We have seen how to perform (efficient) probabilistic inference with


Gaussians, exploiting their closed-form formulas for marginal and con-
ditional distributions. But what if we work with other distributions?

In this and the following chapter, we will discuss two methods of ap-
proximate inference. We begin by discussing variational (probabilistic)
inference, which aims to find a good approximation of the posterior
distribution from which it is easy to sample. In Chapter 6, we discuss
Markov chain Monte Carlo methods, which approximate the sampling
from the posterior distribution directly.

The fundamental idea behind variational inference is to approximate


the true posterior distribution using a “simpler” posterior that is as
close as possible to the true posterior:

1 .
p(θ | x1:n , y1:n ) = p(θ, y1:n | x1:n ) ≈ q(θ | λ) = qλ (θ) (5.1)
Z
where λ represents the parameters of the variational posterior qλ , also
called variational parameters. In doing so, variational inference reduces
probabilistic inference — where the fundamental difficulty lies in solv-
ing high-dimensional integrals — to an optimization problem. Opti-
mizing (stochastic) objectives is a well-understood problem with effi-
cient algorithms that perform well in practice.1 1
We provide an overview of first-order
methods such as stochastic gradient de-
scent in Appendix A.4.
5.1 Laplace Approximation

Before introducing a general framework of variational inference, we


discuss a simpler method of approximate inference known as Laplace’s
method. This method was proposed as a method of approximating in-
tegrals as early as 1774 by Pierre-Simon Laplace. The idea is to use
a Gaussian approximation (that is, a second-order Taylor approxima-
84 probabilistic artificial intelligence

tion) of the posterior distribution around its mode. Let


.
ψ(θ) = log p(θ | x1:n , y1:n ) (5.2)

denote the log-posterior. Then, using a second-order Taylor approx-


imation (A.53) around the mode θ̂ of ψ (i.e., the MAP estimate), we
obtain the approximation ψ̂ which is accurate for θ ≈ θ̂:

. 1
ψ(θ) ≈ ψ̂(θ) = ψ(θ̂) + (θ − θ̂)⊤ ∇ψ(θ̂) + (θ − θ̂)⊤ Hψ (θ̂)(θ − θ̂)
2
1
= ψ(θ̂) + (θ − θ̂)⊤ Hψ (θ̂)(θ − θ̂). (5.3) using ∇ψ(θ̂) = 0
2

Compare this expression to the log-PDF of a Gaussian:

1
log N (θ; θ̂, Λ−1 ) = − (θ − θ̂)⊤ Λ(θ − θ̂) + const. (5.4)
2
Since ψ(θ̂) is constant with respect to θ,

ψ̂(θ) = log N (θ; θ̂, − Hψ (θ̂)−1 ) + const. (5.5)

The Laplace approximation q of p is


.
q(θ) = N (θ; θ̂, Λ−1 ) ∝ exp(ψ̂(θ)) where (5.6a)
.
Λ = − Hψ (θ̂) = − Hθ log p(θ | x1:n , y1:n ) θ=θ̂
. (5.6b)

Recall that for this approximation to be well-defined, the covariance


matrix Λ−1 (or equivalently the precision matrix Λ) needs to be sym-
metric and positive semi-definite. Let us verify that this is indeed the
case for sufficiently smooth ψ.2 In this case, the Hessian Λ is sym- 2
ψ being twice continuously differen-
metric since the order of differentiation does not matter. Moreover, by tiable around θ̂ is sufficient.

the second-order optimality condition, Hψ (θ̂) is negative semi-definite


since θ̂ is a maximum of ψ, which implies that Λ is positive semi-
definite.

Example 5.1: Laplace approximation of a Gaussian


Consider approximating the Gaussian density p(θ) = N (θ; µ, Σ )
using a Laplace approximation.

We know that the mode of p is µ, which we can verify by comput-


ing the gradient,

1 !
∇θ log p(θ) = − (2Σ −1 θ − 2Σ −1 µ) = 0 ⇐⇒ θ = µ. (5.7)
2
For the Hessian of log p(θ), we get

Hθ log p(θ) = ( Dθ (Σ −1 µ − Σ −1 θ))⊤ = −(Σ −1 )⊤ = −Σ −1 . (5.8) using ( A−1 )⊤ = ( A⊤ )−1 and symmetry
of Σ
variational inference 85

0.8
We see that the Laplace approximation of a Gaussian p(θ) is ex-
q
act, which should not come as a surprise since the second-order 0.6

Taylor approximation of log p(θ) is exact for Gaussians. 0.4 p̂

0.2 p
The Laplace approximation matches the shape of the true posterior
around its mode but may not represent it accurately elsewhere — often 0.0

leading to extremely overconfident predictions. An example is given −2 0 2 4 6


x
in Figure 5.1. Nevertheless, the Laplace approximation has some de-
sirable properties such as being relatively easy to apply in a post-hoc Figure 5.1: The Laplace approximation
q greedily selects the mode of the true
manner, that is, after having already computed the MAP estimate. It posterior distribution p and matches the
preserves the MAP point estimate as its mean and just “adds” a lit- curvature around the mode p̂. As shown
tle uncertainty around it. However, the fact that it can be arbitrarily here, the Laplace approximation can be
extremely overconfident when p is not
different from the true posterior makes it unsuitable for approximate approximately Gaussian.
probabilistic inference.

σ(z)
5.1.1 Example: Bayesian Logistic Regression 1.00

As an example, we will look at Laplace approximation in the context 0.75


of Bayesian logistic regression. Logistic regression learns a classifier 0.50
that decides for a given input whether it belongs to one of two classes.
0.25
A sigmoid function, typically the logistic function,
0.00
. 1 ⊤ −5 0 5
σ(z) = ∈ (0, 1), z = w x, (5.9) z
1 + exp(−z)

is used to obtain the class probabilities. Bayesian logistic regression cor- Figure 5.2: The logistic function
squashes the linear function w⊤ x onto
responds to Bayesian linear regression with a Bernoulli likelihood, the interval (0, 1).

y | x, w ∼ Bern(σ (w⊤ x)), (5.10) x2


10
where y ∈ {−1, 1} is the binary class label.3 Observe that given a data w
5
point ( x, y), the probability of a correct classification is
 0
σ (w⊤ x) if y = 1
p(y | x, w) = = σ(yw⊤ x), (5.11) −5
1 − σ (w⊤ x) if y = −1
−2 0 2
x1
as the logistic function σ is symmetric around 0. Also, recall that
Bayesian linear regression used the prior Figure 5.3: Logistic regression classifies
! data into two classes with a linear deci-
1 sion boundary.
2 2
p(w) = N (w; 0, σp I ) ∝ exp − 2 ∥w∥2 .
2σp 3
The same approach extends to Gaus-
sian processes where it is known as
Gaussian process classification, see Prob-
Let us first find the posterior mode, that is, the MAP estimate of the lem 5.2 and Hensman et al. (2015).
weights:

ŵ = arg max p(w | x1:n , y1:n )


w
86 probabilistic artificial intelligence

= arg max p(w) p(y1:n | x1:n , w) using Bayes’ rule (1.45)


w
= arg max log p(w) + log p(y1:n | x1:n , w) taking the logarithm
w
n
1
= arg max − 2
∥w∥22 + ∑ log σ(yi w⊤ xi ) using independence of the observations
w 2σp i =1 and Equation (5.11)
n
1
= arg min
w 2σp2
∥ w ∥ 2
2 + ∑ log(1 + exp(−yi w⊤ xi )). (5.12) using the definition of σ (5.9)
i =1

Note that for λ = 1/2σp2 , the above optimization is equivalent to stan-


dard (regularized) logistic regression where
.
ℓlog (w⊤ x; y) = log(1 + exp(−yw⊤ x)) (5.13)

is called logistic loss. The gradient of the logistic loss is given by ? Problem 5.1 (1)

∇w ℓlog (w⊤ x; y) = −yx · σ(−yw⊤ x). (5.14)

Recall that due to the symmetry of σ around 0, σ (−yw⊤ x) is the prob-


ability that x was not classified as y. Intuitively, if the model is “sur-
prised” by the label, the gradient is large.

We can therefore use SGD with the (regularized) gradient step and
with batch size 1,

w ← w(1 − 2ληt ) + ηt yxσ (−yw⊤ x), (5.15)

for the data point ( x, y) picked uniformly at random from the training
data. Here, 2ληt is due to the gradient of the regularization term, in
effect, performing weight decay.

Example 5.2: Laplace approx. of Bayesian logistic regression


We have already found the mode of the posterior distribution, ŵ.

Let us denote by
.
πi = P(yi = 1 | xi , ŵ) = σ(ŵ⊤ xi ) (5.16)

the probability of xi belonging to the positive class under the


model given by the MAP estimate of the weights. For the pre-
cision matrix, we then have

Λ = − Hw log p(w | x1:n , y1:n )|w=ŵ


= − Hw log p(y1:n | x1:n , w)|w=ŵ − Hw log p(w)|w=ŵ
n
= ∑ Hw ℓlog (w⊤ xi ; yi )
w=ŵ
+ σp−2 I using the definition of the logistic loss
i =1 (5.13)
n
= ∑ xi xi⊤ πi (1 − πi ) + σp−2 I using the Hessian of the logistic loss
i =1 (5.73) which you derive in Problem 5.1
(2)
variational inference 87

= X ⊤ diagi∈[n] {πi (1 − πi )} X + σp−2 I. (5.17)

Observe that πi (1 − πi ) ≈ 0 if πi ≈ 1 or πi ≈ 0. That is, if a train-


ing example is “well-explained” by ŵ, then its contribution to the
precision matrix is small. In contrast, we have πi (1 − πi ) = 0.25
for πi = 0.5. Importantly, Λ does not depend on the normalization
constant of the posterior distribution which is hard to compute.

In summary, we have that N (ŵ, Λ−1 ) is the Laplace approxima-


tion of p(w | x1:n , y1:n ).

5.2 Predictions with a Variational Posterior

How can we make predictions using our variational approximation?


We simply approximate the (intractable) true posterior with our varia-
tional posterior:
Z
p(y⋆ | x⋆ , x1:n , y1:n ) = p(y⋆ | x⋆ , θ) p(θ | x1:n , y1:n ) dθ using the sum rule (1.7)
Z
≈ p(y⋆ | x⋆ , θ)qλ (θ) dθ. (5.18)

A straightforward approach is to observe that Equation (5.18) can be


viewed as an expectation over the variational posterior qλ and approx-
imated via Monte Carlo sampling:

= Eθ∼qλ [ p(y⋆ | x⋆ , θ)] (5.19)


m
1

m ∑ p(y⋆ | x⋆ , θj ) (5.20)
j =1

iid
where θj ∼ qλ .

Example 5.3: Predictions in Bayesian logistic regression


In the case of Bayesian logistic regression with a Gaussian approx-
imation of the posterior, we can obtain more accurate predictions.

Observe that the final prediction y⋆ is conditionally independent


of the model parameters w given the “latent value” f ⋆ = w⊤ x⋆ :
Z
p(y⋆ | x⋆ , x1:n , y1:n ) ≈ p(y⋆ | x⋆ , w)qλ (w) dw
Z Z
= p(y⋆ | f ⋆ ) p( f ⋆ | x⋆ , w)qλ (w) dw d f ⋆ once more, using the sum rule (1.7)
Z Z
= p(y⋆ | f ⋆ ) p( f ⋆ | x⋆ , w)qλ (w) dw d f ⋆ . rearranging terms

(5.21)
88 probabilistic artificial intelligence

The outer integral can be readily approximated since it is only


one-dimensional! The challenging part is the inner integral, which
is a high-dimensional integral over the model weights w. Since the
posterior over weights qλ (w) = N (w; ŵ, Λ−1 ) is a Gaussian, we
have due to the closedness properties of Gaussians (1.78) that
Z
p( f ⋆ | x⋆ , w)qλ (w) dw = N (ŵ⊤ x⋆ , x⋆ ⊤ Λ−1 x⋆ ). (5.22)

Crucially, this is a one-dimensional Gaussian in function-space as


opposed to the d-dimensional Gaussian qλ in weight-space!

As we have seen in Equation (5.11), for Bayesian logistic regres-


sion, the prediction y⋆ depends deterministically on the predicted
latent value f ⋆ : p(y⋆ | f ⋆ ) = σ (y⋆ f ⋆ ). Combining Equations (5.21)
and (5.22), we obtain
Z
p(y⋆ | x⋆ , x1:n , y1:n ) ≈ σ (y⋆ f ⋆ ) · N ( f ⋆ ; ŵ⊤ x⋆ , x⋆ ⊤ Λ−1 x⋆ ) d f ⋆ .
(5.23)

We have replaced the high-dimensional integral over the model


parameters θ by the one-dimensional integral over the prediction
of our variational posterior f ⋆ . While this integral is generally
still intractable, it can be approximated efficiently using numerical
quadrature methods such as the Gauss-Legendre quadrature or
alternatively with Monte Carlo sampling.

5.3 Blueprint of Variational Inference

General probabilistic inference poses the challenge of approximating


the posterior distribution with limited memory and computation, re-
source constraints also present in humans and other intelligent sys-
tems. These resource constraints require information to be compressed,
and as we will see, such a compression poses a fundamental tradeoff
between model accuracy (on the observed data) and model complexity
(to avoid overfitting).

Laplace approximation approximates the true (intractable) posterior


with a simpler one, by greedily matching mode and curvature around
it. Can we find “less greedy” approaches? We can view variational
probabilistic inference more generally as a family of approaches aim-
ing to approximate the true posterior distribution by one that is closest
(according to some criterion) among a “simpler” class of distributions.
To this end, we need to fix a class of distributions and define suit-
able criteria, which we can then optimize numerically. The key ben-
variational inference 89

efit is that we can reduce the (generally intractable) problem of high-


dimensional integration to the (often much more tractable) problem of
optimization.

Definition 5.4 (Variational family). Let P be the class of all probability p


q⋆
distributions. A variational family Q ⊆ P is a class of distributions such
that each distribution q ∈ Q is characterized by unique variational Q
parameters λ ∈ Λ. P

Figure 5.4: An illustration of variational


Example 5.5: Family of independent Gaussians inference in the space of distributions P .
The variational distribution q⋆ ∈ Q is the
A straightforward example for a variational family is the family optimal approximation of the true poste-
of independent Gaussians, rior p.

n o
.
Q = q(θ) = N (θ; µ, diagi∈[d] {σi2 }) , (5.24)

. 2 ]. Such a multivariate dis-


which is parameterized by λ = [µ1:d , σ1:d
tribution where all variables are independent is called a mean-field
distribution. Importantly, this family of distributions is character-
ized by only 2d parameters!

Note that Figure 5.4 is a generalization of the canonical distinction be-


tween estimation error and approximation error from Figure 1.7, only
that here, we operate in the space of distributions over functions as op-
posed the space of functions. A common notion of distance between
two distributions q and p is the Kullback-Leibler divergence KL(q∥ p)
which we will define in the next section. Using this notion of distance,
we need to solve the following optimization problem:

.
q⋆ = arg min KL(q∥ p) = arg min KL(qλ ∥ p). (5.25)
q∈Q λ∈Λ

In Section 5.4, we introduce information theory and the Kullback-


Leibler divergence. Then, in Section 5.5, we discuss how the opti-
mization problem of Equation (5.25) can be solved efficiently.

5.4 Information Theoretic Aspects of Uncertainty

One of our main objectives throughout this manuscript is to capture


the “uncertainty” about events A in an appropriate probability space.
One very natural measure of uncertainty is their probability, P( A). In
this section, we will introduce an alternative measure of uncertainty,
namely the so-called “surprise” about the event A.
90 probabilistic artificial intelligence

5.4.1 Surprise
The surprise about an event with probability u is defined as
S[ u ]
.
S[u] = − log u. (5.26)
4
Observe that the surprise is a function from R≥0 to R, where we
let S[0] ≡ ∞. Moreover, for a discrete random variable X, we have 2

that p( x ) ≤ 1, and hence, S[ p( x )] ≥ 0. But why is it reasonable to


0
measure surprise by − log u?
0.00 0.25 0.50 0.75 1.00
u
Remarkably, it can be shown that the following natural axiomatic char-
acterization leads to exactly this definition of surprise. Figure 5.5: Surprise S[u] associated with
an event of probability u.
Theorem 5.6 (Axiomatic characterization of surprise). The axioms
1. S[u] > S[v] =⇒ u < v (anti-monotonicity) we are more surprised by unlikely events
2. S is continuous, no jumps in surprise for infinitesimal
changes of probability
3. S[uv] = S[u] + S[v] for independent events, the surprise of independent events is
additive
characterize S up to a positive constant factor.

Proof. Observe that the third condition looks similar to the product
rule of logarithms: log(uv) = log v + log v. We can formalize this
intuition by remembering Cauchy’s functional equation, f ( x + y) =
f ( x ) + f (y), which has the unique family of solutions { f : x 7→ cx :
c ∈ R} if f is required to be continuous. Such a solution is called an
.
“additive function”. Consider the function g( x ) = f (e x ). Then, g is
additive if and only if

f ( e x e y ) = f ( e x + y ) = g ( x + y ) = g ( x ) + g ( y ) = f ( e x ) + f ( e y ).

This is precisely the third axiom of surprise for f = S and e x = u!


Hence, the second and third axioms of surprise imply that g must be
additive and that g( x ) = S[e x ] = cx for any c ∈ R. If we replace e x
by u, we obtain S[u] = c log u. The first axiom of surprise implies that
c < 0, and thus, S[u] = −c′ log u for any c′ > 0.

Importantly, surprise offers a different perspective on uncertainty as


opposed to probability: the uncertainty about an event can either be
“sur
interpreted in terms of its probability or in terms of its surprise, and “probability” prise

the two “spaces of uncertainty” are related by a log-transform. This S[P( A)]
− log
relationship is illustrated in Figure 5.6. Information theory is the study P( A )
of uncertainty in terms of surprise. information theory
probability theory

Throughout this manuscript we will see many examples where mod- Figure 5.6: Illustration of the probability
space and the corresponding “surprise
eling uncertainty in terms of surprise (i.e., the information-theoretic space”.
interpretation of uncertainty) is useful. One example where we have
variational inference 91

already encountered the “surprise space” was in the context of like-


lihood maximization (cf. Section 1.3.1) where we used that the log-
transform linearizes products of probabilities. We will see later in
Chapter 6 that in many cases the surprise S[ p( x )] can also be inter-
H [Bern( p)]
preted as a “cost” or “energy” associated with the state x.
1.00

0.75
5.4.2 Entropy
0.50
The entropy of a distribution p is the average surprise about samples
0.25
from p. In this way, entropy is a notion of uncertainty associated with
the distribution p: if the entropy of p is large, we are more uncertain 0.00
0.0 0.5 1.0
about x ∼ p than if the entropy of p were low. Formally, p
.
H[ p] = Ex∼ p [S[ p( x )]] = Ex∼ p [− log p( x )]. (5.27) Figure 5.7: Entropy of a Bernoulli exper-
iment with success probability p.
When X ∼ p is a random vector distributed according to p, we write
.
H[X] = H[ p]. Observe that by definition, if p is discrete then H[ p] ≥ 0
as p( x ) ≤ 1 (∀ x ).4 For discrete distributions it is common to use the 4
The entropy of a continuous distribu-
logarithm with base 2 rather than the natural logarithm:5 tion can be negative. For example,
1 1
Z
H[Unif([ a, b])] = − log dx
H[ p] = − ∑ p( x ) log2 p( x ) (if p is discrete), (5.28a) b−a b−a
x = log(b − a)
Z
H[ p ] = − p( x) log p( x) dx (if p is continuous). (5.28b) which is negative if b − a < 1.
log x
5
Recall that log2 x = log 2 , that is, loga-
rithms to a different base only differ by
Let us briefly recall Jensen’s inequality, which is a useful tool when a constant factor.
working with expectations of convex functions such as entropy:6 6
The surprise S[u] is convex in u.

Fact 5.7 (Jensen’s Inequality). ? Given a random variable X and a convex Problem 5.3 (1)
function g : R → R, we have

g(E[ X ]) ≤ E[ g( X )]. (5.29)


g
E[ g( X )]

Example 5.8: Examples of entropy


• Fair Coin H[Bern(0.5)] = −2(0.5 log2 0.5) = 1. g(E[ X ])

• Unfair Coin
E[ X ]

H[Bern(0.1)] = −0.1 log2 0.1 − 0.9 log2 0.9 ≈ 0.469. Figure 5.8: An illustration of Jensen’s in-
equality. Due to the convexity of g, we
• Uniform Distribution have that g evaluated at E[ X ] will always
be below the average of evaluations of g.
n
1 1
H[Unif({1, . . . , n})] = − ∑ log2 = log2 n.
i =1
n n

The uniform distribution has the maximum entropy among all


discrete distributions supported on {1, . . . , n} ? . Note that a Problem 5.3 (2)
fair coin corresponds to a uniform distribution with n = 2. Also
92 probabilistic artificial intelligence

observe that log2 n corresponds to the number of bits required


to encode the outcome of the experiment.
In general, the entropy H[ p] of a discrete distribution p can be
interpreted as the average number of bits required to encode a
sample x ∼ p, or in other words, the average “information” car-
ried by a sample x.

Example 5.9: Entropy of a Gaussian


Let us derive the entropy of a univariate Gaussian. Recall the PDF,

( x − µ )2
 
2 1
N ( x; µ, σ ) = exp −
Z 2σ2

where Z = 2πσ2 . Using the definition of entropy (5.28b), we
obtain,

( x − µ )2
 
1
h i Z
2
H N (µ, σ ) = − exp − 2
Z  2σ
( x − µ )2

1
· log exp − dx
Z 2σ2
( x − µ )2
 
1
Z
= log Z exp − dx
Z 2σ2
| {z }
 1
( x − µ ) ( x − µ )2
2 
1
Z
+ exp − dx
Z 2σ2 2σ2
1 h i
= log Z + 2 E ( x − µ)2 using LOTUS (1.22)

√ 1
= log(σ 2π ) + using E ( x − µ)2 = Var[ x ] = σ2 (1.34)
 

√ 2

= log(σ 2πe). (5.30) using log e = 1/2

In general, the entropy of a Gaussian is

1 1  
H[N (µ, Σ )] = log det(2πeΣ ) = log (2πe)d det(Σ ) . (5.31)
2 2
Note that the entropy is a function of the determinant of the co-
variance matrix Σ. In general, there are various ways of “scalariz-
ing” the notion of uncertainty for a multivariate distribution. The
determinant of Σ measures the volume of the credible sets around
the mean µ, and is also called the generalized variance. Next to
entropy and generalized variance (which are closely related for
Gaussians), a common scalarization is the trace of Σ, which is also
called the total variance.
variational inference 93

5.4.3 Cross-Entropy
How can we use entropy to measure our average surprise when as-
suming the data follows some distribution q but in reality the data
follows a different distribution p?

Definition 5.10 (Cross-entropy). The cross-entropy of a distribution q


relative to the distribution p is
.
H[ p∥q] = Ex∼ p [S[q( x )]] = Ex∼ p [− log q( x )]. (5.32)

Cross-entropy can also be expressed in terms of the KL-divergence (cf.


Section 5.4.4) KL( p∥q) which measures how “different” the distribu-
tion q is from a reference distribution p,

H[ p∥q] = H[ p] + KL( p∥q) ≥ H[ p]. (5.33) KL( p∥q) ≥ 0 is shown in Problem 5.5

Quite intuitively, the average surprise in samples from p with respect


to the distribution q is given by the inherent uncertainty in p and the
additional surprise that is due to us assuming the wrong data distri-
bution q. The “closer” q is to the true data distribution p, the smaller
is the additional average surprise.

5.4.4 Kullback-Leibler Divergence


As mentioned, the Kullback-Leibler divergence is a (non-metric) mea-
sure of distance between distributions. It is defined as follows.

Definition 5.11 (Kullback-Leibler divergence, KL-divergence). Given


two distributions p and q, the Kullback-Leibler divergence (or relative en-
tropy) of q with respect to p,
.
KL( p∥q) = H[ p∥q] − H[ p] (5.34)
= Eθ∼ p [S[q(θ)] − S[ p(θ)]] (5.35)
p(θ)
 
= Eθ∼ p log , (5.36)
q(θ)

measures how different q is from a reference distribution p.

In words, KL( p∥q) measures the additional expected surprise when ob-
serving samples from p that is due to assuming the (wrong) distribu- 7
The KL-divergence only captures the
tion q and which not inherent in the distribution p already.7 additional expected surprise since the
surprise inherent in p (as measured by
H[ p]) is subtracted.
The KL-divergence has the following properties:
• KL( p∥q) ≥ 0 for any distributions p and q ? , Problem 5.5 (1)
• KL( p∥q) = 0 if and only if p = q almost surely ? , and Problem 5.5 (2)
• there exist distributions p and q such that KL( p∥q) ̸= KL(q∥ p).
94 probabilistic artificial intelligence

The KL-divergence can simply be understood as a shifted version of


cross-entropy, which is zero if we consider the divergence between two
identical distributions.

We will briefly look at another interpretation for how KL-divergence


measures “distance” between distributions. Suppose we are presented
with a sequence θ1 , . . . , θn of independent samples from either a dis-
tribution p or a distribution q, both of which are known. Which of
p or q was used to generate the data is, however, unknown to us,
and we would like to find out. A natural approach is to choose the
distribution whose data likelihood is larger. That is, we choose p if
p(θ1:n ) > q(θ1:n ) and vice versa. Assuming that the samples are inde-
pendent and rewriting the inequality slightly, we choose p if
n n
p(θi ) p(θ )
∏ q(θi )
> 1, or equivalently if ∑ log q(θii) > 0. (5.37) taking the logarithm
i =1 i =1

Assume without loss of generality that θi ∼ p. Then, using the law of


large numbers (A.36),

1 n p(θi ) a.s. p(θ)


 

n i∑
log → Eθ∼ p log = KL( p∥q) (5.38)
=1
q(θi ) q(θ)

as n → ∞. Plugging this into our decision criterion from Equation (5.37),


we find that
" #
n
p(θi )
E ∑ log = nKL( p∥q). (5.39)
i =1
q(θi )

In this way, KL( p∥q) measures the observed “distance” between p


and q. Recall that assuming p ̸= q we have that KL( p∥q) > 0 with
probability 1, and therefore we correctly choose p with probability 1
as n → ∞. Moreover, Hoeffding’s inequality (A.41) can be used to de-
termine “how quickly” samples converge to this limit, that is, how
quickly we can distinguish between p and q.

Example 5.12: KL-divergence of Bernoulli random variables


Suppose we are given two Bernoulli distributions Bern( p) and
Bern(q). Then, their KL-divergence is

Bern( x; p)
KL(Bern( p)∥Bern(q)) = ∑ Bern( x; p) log
Bern( x; q)
x ∈{0,1}
p (1 − p )
= p log + (1 − p) log . (5.40)
q (1 − q )

Observe that KL(Bern( p)∥Bern(q)) = 0 if and only if p = q.


variational inference 95

Example 5.13: KL-divergence of Gaussians


.
Suppose we are given two Gaussian distributions p = N (µ p , Σ p )
.
and q = N (µq , Σ q ) with dimension d. The KL-divergence of p
and q is given by ? Problem 5.8

1
KL( p∥q) = tr(Σ − 1 ⊤ −1
q Σ p ) + (µ p − µq ) Σ q (µ p − µq ) (5.41)
2
!
det Σ q
−d + log  .
det Σ p 0.8

0.6 q2⋆
For independent Gaussians with unit variance, Σ p = Σ q = I, the
0.4
expression simplifies to the squared Euclidean distance, q1⋆
0.2 p
1 2
KL( p∥q) = µq − µ p 2
. (5.42)
2 0.0
−2 0 2 4 6
x
If we approximate independent Gaussians with variances σi2 ,
. x2
p = N (µ, diag{σ12 , . . . , σd2 }),
. 2
by a standard normal distribution, q = N (0, I ), the expression
simplifies to 1

0
1 d
KL( p∥q) = ∑ (σi2 + µ2i − 1 − log σi2 ). (5.43) −1
2 i =1
−2

Here, the term µ2i penalizes a large mean of p, the term σi2 penal-
−2 0 2
izes a large variance of p, and the term − log σi2 penalizes a small 2
variance of p. As expected, KL( p∥q) is proportional to the amount 1
of information we lose by approximating p with the simpler dis- 0
tribution q.
−1

−2

5.4.5 Forward and Reverse KL-divergence


−2 0 2
KL( p∥q) is also called the forward (or inclusive) KL-divergence. In x1
contrast, KL(q∥ p) is called the reverse (or exclusive) KL-divergence. Figure 5.9: Comparison of the forward
Figure 5.9 shows the approximations of a general Gaussian obtained KL-divergence q1⋆ and the reverse KL-
when Q is the family of diagonal (independent) Gaussians. Thereby, divergence q2⋆ when used to approxi-
mate the true posterior p. The first plot
. . shows the PDFs in a one-dimensional
q1⋆ = arg min KL( p∥q) and q2⋆ = arg min KL(q∥ p). feature space where p is a mixture of
q∈Q q∈Q two univariate Gaussians. The second
plot shows contour lines of the PDFs in
q1⋆ is the result when using the forward KL-divergence and q2⋆ is the re- a two-dimensional feature space where
sult when using reverse KL-divergence. It can be seen that the reverse the non-diagonal Gaussian p is approx-
imated by diagonal Gaussians q1⋆ and
KL-divergence tends to greedily select the mode and underestimat- q2⋆ . It can be seen that q1⋆ selects the
ing the variance which, in this case, leads to an overconfident predic- variance and q2⋆ selects the mode of p.
The approximation q1⋆ is more conserva-
tive than the (overconfident) approxima-
tion q2⋆ .
96 probabilistic artificial intelligence

tion. The forward KL-divergence, in contrast, is more conservative and


yields what one could consider the “desired” approximation.

Recall that in the blueprint of variational inference (5.25) we used


the reverse KL-divergence. This is for computational reasons. Ob-
serve that to approximate the KL-divergence KL( p∥q) using Monte
Carlo sampling, we would need to obtain samples from p yet p is
the intractable posterior distribution which we were trying to approx-
imate in the first place. Crucially, observe that if the true posterior
p(· | x1:n , y1:n ) is in the variational family Q, then
a.s. a.s.
arg min KL(q∥ p(· | x1:n , y1:n )) = p(· | x1:n , y1:n ), (5.44) as minq∈Q KL(q∥ p(· | x1:n , y1:n )) = 0
q∈Q

so minimizing reverse-KL still recovers the true posterior almost surely.

Remark 5.14: Greediness of reverse-KL


As in the previous example, consider the independent Gaussians
.
p = N (µ, diagi∈[d] {σi2 }),

which we seek to approximate by a standard normal distribu-


.
tion q = N (0, I ). Using (5.41), we obtain for the reverse KL-
divergence,

1  
KL(q∥ p) = tr diag{σi−2 } + µ⊤ diag{σi−2 }µ − d
2  
+ log det diag{σi2 }
!
1 d µ2i
2 i∑
−2
= σi + 2
− 1 + log σi2 . (5.45)
=1 σi

Here, σi−2 penalizes small variance, µ2i /σi2 penalizes a large mean,
and log σi2 penalizes large variance. Compare this to the expres-
sion for the forward KL-divergence KL( p∥q) that we have seen in
Equation (5.43). In particular, observe that reverse-KL penalizes
large variance less strongly than forward-KL.

Note, however, that reverse-KL is not greedy in the same sense


as Laplace approximation, as it does still take the variance into
account and does not purely match the mode of p.

5.4.6 Interlude: Minimizing Forward KL-Divergence


Before completing the blueprint of variational inference in Section 5.5
by showing how reverse-KL can be efficiently minimized, we will di-
gress briefly and relate minimizing forward-KL to two other well-known
variational inference 97

inference algorithms. This discussion will deepen our understanding


of the KL-divergence and its role in probabilistic inference, but feel free
to skip ahead to Section 5.5 if you are eager to complete the blueprint.

Minimizing forward-KL as maximum likelihood estimation: First, we ob-


serve that minimizing the forward KL-divergence is equivalent to max-
imum likelihood estimation on an infinitely large sample size. The
classical application of this result is in the setting where p( x) is a gen-
erative model, and we aim to estimate its density with the parameter-
ized model qλ .

Lemma 5.15 (Forward KL-divergence as MLE). Given some generative


model p( x) and a likelihood qλ ( x) = q( x | λ) (that we use to approximate
the true data distribution), we have
n
a.s. 1
arg min KL( p∥qλ ) = arg max lim
n→∞ n ∑ log q(xi | λ), (5.46)
λ∈Λ λ∈Λ i =1
iid
where xi ∼ p are independent samples from the true data distribution.

Proof.

KL( p∥qλ ) = H[ p∥qλ ] − H[ p] using the definition of KL-divergence


(5.34)
= E(x,y)∼ p [− log q( x | λ)] + const dropping H[ p] and using the definition
n of cross-entropy (5.32)
a.s. 1
= − lim
n→∞ n ∑ log q(xi | λ) + const using Monte Carlo sampling, i.e., the
i =1 law of large numbers (A.36)
iid
where xi ∼ p are independent samples.

This tells us that any maximum likelihood estimate qλ minimizes the


forward KL-divergence to the empirical data distribution. Note that
here, we aim to learn model parameters λ for estimating the probabil-
ity of x, whereas in the setting of variational probabilistic inference, we
want to learn parameters λ of a distribution over θ and θ parameterizes
a distribution over x. This interpretation is therefore not immediately
useful for probabilistic inference (i.e., in the setting where p is a pos-
terior distribution over model parameters θ) as a maximum likelihood
estimate requires i.i.d. samples from p which we cannot easily obtain
in this case.8 8
It is possible to obtain “approximate”
samples using Markov chain Monte
Carlo (MCMC) methods which we dis-
Example 5.16: Minimizing cross-entropy cuss in Chapter 6.
Minimizing the KL-divergence between p and qλ is equivalent to
minimizing cross-entropy since KL( p∥qλ ) = H[ p∥qλ ] − H[ p] and
H[ p] is constant with respect to p.
98 probabilistic artificial intelligence

Lets consider an example in a binary classification problem with


the label y ∈ {0, 1} and predicted class probability ŷ ∈ [0, 1] for
some fixed input. It is natural to use cross-entropy as a measure
of dissimilarity between y and ŷ,
.
ℓbce (ŷ; y) = H[Bern(y)∥Bern(ŷ)]
=− ∑ Bern( x; y) log Bern( x; ŷ)
(5.47)
x ∈{0,1}

= −y log ŷ − (1 − y) log(1 − ŷ).

This loss function is also known as the binary cross-entropy loss and
we will discuss it in more detail in Section 7.1.3 in the context of
neural networks.

Minimizing forward-KL as moment matching: Now to a second inter-


pretation of minimizing forward-KL. Moment matching (also known as
the method of moments) is a technique for approximating an unknown
distribution p with a parameterized distribution qλ where λ is cho-
sen such that qλ matches the (estimated) moments of p. For example,
given the estimates a and B of the first and second moment of p,9 and 9
These estimates are computed using
if qλ is a Gaussian with parameters λ = {µ, Σ }, then moment matching the samples from p. For example, using
a sample mean and a sample variance
chooses λ as the solution to to compute the estimates of the first and
second moment.
!
E p [θ] ≈ a = µ = Eqλ [θ]
h i h i
!
E p θθ⊤ ≈ B = Σ + µµ⊤ = Eqλ θθ⊤ . using the definition of variance (1.35)

In general the number of moments to be matched (i.e., the number of


equations) is adjusted such that it is equal to the number of parameters
to be estimated. We will see now that the “matching” of moments
is also ensured when qλ is obtained by minimizing the forward KL-
divergence within the family of Gaussians.

The Gaussian PDF can be expressed as

1
N (θ; µ, Σ ) = exp(λ⊤ s(θ)) where (5.48)
Z (λ)
" #
. Σ −1 µ
λ= (5.49)
vec[Σ −1 ]
" #
. θ
s(θ) = (5.50)
vec[− 21 θθ⊤ ]

. R
and Z (λ) = exp(λ⊤ s(θ)) dθ, and we will confirm this in just a mo-
ment.10 The family of distributions with densities of the form (5.48) — 10
Given a matrix A ∈ Rn×m , we use
with an additional scaling constant h(θ) which is often 1 — is called vec[ A] ∈ Rn·m
to denote the row-by-row concatenation
of A yielding a vector of length n · m.
variational inference 99

the exponential family of distributions. Here, s(θ) are the sufficient statis-
tics, λ are called the natural parameters, and Z (λ) is the normalizing
constant. In this context, Z (λ) is often called the partition function.

To see that the Gaussian is indeed part of the exponential family as


promised in Equations (5.49) and (5.50), consider
 
1
N (θ; µ, Σ ) ∝ exp − (θ − µ)⊤ Σ −1 (θ − µ)
2
   
1
∝ exp tr − θ⊤ Σ −1 θ + θ⊤ Σ −1 µ expanding the inner product and using
2 that tr( x ) = x for all x ∈ R
   
1
= exp tr − θθ⊤ Σ −1 + θ⊤ Σ −1 µ using that the trace is invariant under
2 cyclic permutations
 
= exp vec[− 12 θθ⊤ ]⊤ vec[Σ −1 ] + θ⊤ Σ −1 µ . using tr( AB) = vec[ A]⊤ vec[ B] for any
A, B ∈ Rn×n
This allows us to express the forward KL-divergence as

p(θ)
Z
KL( p∥qλ ) = p(θ) log dθ
qλ (θ)
Z
=− p(θ) · λ⊤ s(θ) dθ + log Z (λ) + const.
R
using that p(θ) log p(θ) dθ is constant

Differentiating with respect to the natural parameters λ gives

1
Z Z
∇λ KL( p∥qλ ) = − p(θ)s(θ) dθ + s(θ) exp(λ⊤ s(θ)) dθ
Z (λ)
= −Eθ∼ p [s(θ)] + Eθ∼qλ [s(θ)].

Hence, for any minimizer of KL( p∥qλ ), we have that the sufficient
statistics under p and qλ match:

E p [s(θ)] = Eqλ [s(θ)]. (5.51)

Therefore, in the Gaussian case,


h i h i
E p [θ] = Eqλ [θ] and E p − 21 θθ⊤ = Eqλ − 12 θθ⊤ ,

implying that
h i
E p [θ] = µ and Var p [θ] = E p θθ⊤ − E p [θ] · E p [θ]⊤ = Σ (5.52) using Equation (1.35)

where µ and Σ are the mean and variance of the approximation qλ , re-
spectively. That is, a Gaussian qλ minimizing KL( p∥qλ ) has the same
first and second moment as p. Combining this insight with our obser-
vation from Equation (5.46) that minimizing forward-KL is equivalent
to maximum likelihood estimation, we see that if we use MLE to fit a
Gaussian to given data, this Gaussian will eventually match the first
and second moments of the data distribution.
100 probabilistic artificial intelligence

5.5 Evidence Lower Bound

Let us return to the blueprint of variational inference from Section 5.3.


To complete this blueprint, it remains to show that the reverse KL-
divergence can be minimized efficiently. We have

q(θ)
 
KL(q∥ p(· | x1:n , y1:n )) = Eθ∼q log using the definition of the
p(θ | x1:n , y1:n ) KL-divergence (5.34)
p(y1:n | x1:n )q(θ)
 
= Eθ∼q log using the definition of conditional
p(y1:n , θ | x1:n ) probability (1.8)
= log p(y1:n | x1:n ) using linearity of expectation (1.20)
−Eθ∼q [log p(y1:n , θ | x1:n )] − H[q]
| {z }
− L(q,p;Dn )

where L(q, p; Dn ) is called the evidence lower bound (ELBO) given the
data Dn = {( xi , yi )}in=1 . This gives the relationship

L(q, p; Dn ) = log p(y1:n | x1:n ) − KL(q∥ p(· | x1:n , y1:n )). (5.53)
| {z }
const

Thus, maximizing the ELBO coincides with minimizing reverse-KL.

Maximizing the ELBO,


.
L(q, p; Dn ) = Eθ∼q [log p(y1:n , θ | x1:n )] + H[q], (5.54)

selects q that has large joint likelihood p(y1:n , θ | x1:n ) and large en-
tropy H[q]. The ELBO can also be expressed in various other forms:

L(q, p; Dn ) = Eθ∼q [log p(y1:n , θ | x1:n ) − log q(θ)] (5.55a) using the definition of entropy (5.27)

= Eθ∼q [log p(y1:n | x1:n , θ) + log p(θ) − log q(θ)] (5.55b) using the product rule (1.11)

= Eθ∼q [log p(y1:n | x1:n , θ)] − KL(q∥ p(·)) . (5.55c) using the definition of KL-divergence
| {z } | {z } (5.34)
log-likelihood proximity to prior

where we denote by p(·) the prior distribution. Equation (5.55c) high-


lights the connection to probabilistic inference, namely that maximiz-
ing the ELBO selects a variational distribution q that is close to the
prior distribution p(·) while also maximizing the average likelihood of
the data p(y1:n | x1:n , θ) for θ ∼ q. This is in contrast to maximum a
posteriori estimation, which picks a single model θ that maximizes the
likelihood and proximity to the prior. As an example, let us look at
the case where the prior is noninformative, i.e., p(·) ∝ 1. In this case,
the ELBO simplifies to Eθ∼q [log p(y1:n , | x1:n , θ)] + H[q] + const. That
is, maximizing the ELBO maximizes the average likelihood of the data
under the variational distribution q while regularizing q to have high
variational inference 101

entropy. Why is it reasonable to maximize the entropy of q? Consider


two distributions q1 and q2 under which the data is “equally” likely
and which are “equally” close to the prior. Maximizing the entropy
selects the distribution that exhibits the most uncertainty which is in
accordance with the maximum entropy principle.11 11
In Section 1.2.1, we discussed the max-
imum entropy principle and the related
Recalling that KL-divergence is non-negative, it follows from Equa- principle of indifference at length. In
simple terms, the maximum entropy
tion (5.53) that the evidence lower bound is a (uniform12 ) lower bound principle states that without informa-
to the evidence log p(y1:n | x1:n ): tion, we should choose the distribution
that is maximally uncertain.
log p(y1:n | x1:n ) ≥ L(q, p; Dn ). (5.56) 12
That is, the bound holds for any varia-
tional distribution q (with full support).
This indicates that maximizing the evidence lower bound is an ade-
quate method of model selection which can be used instead of max-
imizing the evidence (marginal likelihood) directly (as was discussed
in Section 4.4.2). Note that this inequality lower bounds the logarithm
of an integral by an expectation of a logarithm over some variational
distribution q. Hence, the ELBO is a family of lower bounds — one for
each variational distribution. Such inequalities are called variational
inequalities.

Example 5.17: Gaussian VI vs Laplace approximation


Consider maximizing the ELBO within the variational family of
.
Gaussians Q = {q(θ) = N (θ; µ, Σ )}. How does this relate to
Laplace approximation which also fits a Gaussian approximation
to the posterior

p(θ | D) ∝ p(y1:n , θ | x1:n )?

It turns out that both approximations are closely related. Indeed,


it can be shown that while the Laplace approximation is fitted
locally at the MAP estimate θ̂, satisfying

0 = ∇θ log p(y1:n , θ | x1:n )


−1
(5.57)
Σ = − Hθ log p(y1:n , θ | x1:n )|θ=θ̂ ,

Gaussian variational inference satisfies the conditions of the Laplace


approximation on average with respect to the approximation q ? : Problem 5.10

0 = Eθ∼q [∇θ log p(y1:n , θ | x1:n )]


−1
(5.58)
Σ = −Eθ∼q [ Hθ log p(y1:n , θ | x1:n )].

For this reason, the Gaussian variational approximation does not


suffer from the same overconfidence as the Laplace approxima-
tion.13 13
see Figure 5.1
102 probabilistic artificial intelligence

Example 5.18: ELBO for Bayesian logistic regression


Recall that Bayesian logistic regression uses the prior distribution
w ∼ N (0, I ).14 14
We omit the scaling factor σp2 here for
simplicity.
Suppose we use the variational family Q of all diagonal Gaussians
from Example 5.5. We have already seen in Equation (5.43) that
for a prior p ∼ N (0, I ) and a variational distribution

qλ ∼ N (µ, diagi∈[d] {σi2 }),

we have

1 d 2
2 i∑
KL(q∥ p(·)) = (σi + µ2i − 1 − log σi2 ).
=1

It remains to find the expected likelihood under models from our


approximate posterior:
" #
n
Ew∼qλ [log p(y1:n | x1:n , w)] = Ew∼qλ ∑ log p(yi | xi , w) using independence of the data
i =1
" #
n
= Ew∼qλ − ∑ ℓlog (w; xi , yi ) . substituting the logistic loss (5.13)
i =1
(5.59)

5.5.1 Gradient of Evidence Lower Bound


We have yet to discuss how the optimization problem of maximizing
the ELBO can be solved efficiently. A suitable tool is stochastic gradi-
ent descent (SGD), however, SGD requires unbiased gradient estimates
.
of the loss ℓ(λ; Dn ) = − L(qλ , p; Dn ). That is, we need to obtain gradi-
ent estimates of

∇λ L(qλ , p; Dn ) = ∇λ Eθ∼qλ [log p(y1:n | x1:n , θ)] − ∇λ KL(qλ ∥ p(·)). using the definition of the ELBO (5.55c)
(5.60)

Typically, the KL-divergence (and its gradient) can be computed ex-


actly for commonly used variational families. For example, we have
already seen a closed-form expression of the KL-divergence for Gaus-
sians in Equation (5.41).

Obtaining the gradient of the expected log-likelihood is more diffi-


cult. This is because the expectation integrates over the measure qλ ,
which depends on the variational parameters λ. Thus, we cannot move
the gradient operator inside the expectation as commonly done (cf.
Appendix A.1.5). There are two main techniques which are used to
variational inference 103

rewrite the gradient in such a way that Monte Carlo sampling becomes
possible.

One approach is to use score gradients via the “score function trick”:

∇λ Eθ∼qλ [log p(y1:n | x1:n , θ)]


= Eθ∼qλ [log p(y1:n | x1:n , θ) ∇λ log qλ (θ)], (5.61)
| {z }
score function

which we introduce in Section 12.3.2 in the context of reinforcement


learning. More common in the context of variational inference is the
so-called “reparameterization trick”.

Theorem 5.19 (Reparameterization trick). Given a random variable ε ∼ ϕ


(which is independent of λ) and given a differentiable and invertible function
.
g : Rd → Rd . We let θ = g (ε; λ). Then,

qλ (θ) = ϕ(ε) · |det( Dε g (ε; λ))|−1 , (5.62)


Eθ∼qλ [ f (θ)] = Eε∼ϕ [ f ( g (ε; λ))] (5.63)

for a “nice” function f : Rd → Re .

Proof. By the change of variables formula (1.43) and using ε = g −1 (θ; λ),
 
qλ (θ) = ϕ(ε) · det Dθ g −1 (θ; λ)
 
= ϕ(ε) · det ( Dε g (ε; λ))−1 by the inverse function theorem,
Dg −1 (y) = Dg ( x)−1
−1
= ϕ(ε) · |det( Dε g (ε; λ))| . using det A−1 = det( A)−1


Equation (5.63) is a direct consequence of the law of the unconscious


statistician (1.22).

In other words, the reparameterization trick allows a change of “den-


sities” by finding a function g (·; λ) and a reference density ϕ such that
qλ = g (·; λ)♯ ϕ is the pushforward of ϕ under perturbation g. Apply-
ing the reparameterization trick, we can swap the order of gradient
and expectation,

∇λ Eθ∼qλ [ f (θ)] = Eε∼ϕ [∇λ f ( g (ε; λ))]. (5.64) using Equation (A.5)

We call a distribution qλ reparameterizable if it admits reparameteriza-


tion, i.e., if we can find g and a suitable reference density ϕ which is
independent of λ.
104 probabilistic artificial intelligence

Example 5.20: Reparameterization trick for Gaussians


Suppose we use a Gaussian variational approximation,
.
qλ (θ) = N (θ; µ, Σ ),

where we assume Σ to have full rank (i.e., be invertible). We have


seen in Equation (1.78) that a Gaussian random vector ε ∼ N (0, I )
following a standard normal distribution can be transformed to
follow the Gaussian distribution qλ by using the linear transfor-
mation,
.
θ = g (ε; λ) = Σ /2 ε + µ.
1
(5.65)

In particular, we have

ε = g −1 (θ; λ) = Σ − /2 (θ − µ)
1
and (5.66) by solving Equation (5.65) for ε
 
ϕ(ε) = qλ (θ) · det Σ /2 .
1
(5.67) using the reparameterization trick (i.e.,
the change of variables formula) (5.62)

.
In the following, we write C = Σ 1/2 . Let us now derive the gradient
estimate for the evidence lower bound assuming the Gaussian vari-
ational approximation from Example 5.20. This approach extends to
any reparameterizable distribution.

∇λ Eθ∼qλ [log p(y1:n | x1:n , θ)]


h i
= ∇C,µ Eε∼N (0,I ) log p(y1:n | x1:n , θ)|θ=Cε+µ (5.68) using the reparameterization trick (5.63)
" #
1 n
n i∑
= n · ∇C,µ Eε∼N (0,I ) log p(yi | xi , θ)|θ=Cε+µ using independence of the data and
=1 extending with n/n
h i
= n · ∇C,µ Eε∼N (0,I ) Ei∼Unif([n]) log p(yi | xi , θ)|θ=Cε+µ interpreting the sum as an expectation
h i
= n · Eε∼N (0,I ) Ei∼Unif([n]) ∇C,µ log p(yi | xi , θ)|θ=Cε+µ (5.69) using Equation (A.5)
m
1
≈ n·
m ∑ ∇C,µ log p(yi j | xi j , θ)
θ=Cε j +µ
(5.70) using Monte Carlo sampling
j =1

iid iid
where ε j ∼ N (0, I ) and i j ∼ Unif([n]). This yields an unbiased gra-
dient estimate, which we can use with stochastic gradient descent to
maximize the evidence lower bound. We have successfully recast the
difficult problems of learning and inference as an optimization prob-
lem!

The procedure of approximating the true posterior using a variational


posterior by maximizing the evidence lower bound using stochastic
optimization is also called black box stochastic variational inference (Ran-
ganath et al., 2014; Titsias and Lázaro-Gredilla, 2014; Duvenaud and
variational inference 105

Adams, 2015). The only requirement is that we can obtain unbiased


gradient estimates from the evidence lower bound (and the likelihood).
We have just discussed one of many approaches to obtain such gradi-
ent estimates (Mohamed et al., 2020). If we use the variational family
of diagonal Gaussians, we only require twice as many parameters as
other inference techniques like MAP estimation. The performance can
be improved by using natural gradients and variance reduction tech-
niques for the gradient estimates such as control variates.

5.5.2 Minimizing Surprise via Exploration and Exploitation


Now that we have established a way to optimize the ELBO, let us dwell
a bit more on its interpretation. Observe that the evidence can also be
interpreted as the negative surprise about the observations under the
prior distribution p(θ), and by negating Equation (5.56), we obtain the
variational upper bound

S[ p(y1:n | x1:n )] ≤ Eθ∼q [S[ p(y1:n , θ | x1:n )]] −H[q]


| {z }
called energy (5.71)
= − L(q, p; Dn ).

Here, − L(q, p; Dn ) is commonly called the (variational) free energy with


respect to q. Free energy can also be characterized as

− L(q, p; Dn ) = Eθ∼q [S[ p(y1:n | x1:n , θ)]] + KL(q∥ p(·)) , (5.72) analogously to Equation (5.55c)
| {z } | {z }
average surprise proximity to prior

and therefore its minimization is minimizing the average surprise about


the data under the variational distribution q while maximizing prox-
imity to the prior p(·). Systems that minimize the surprise in their
observations are widely studied in many areas of science.15 15
Refer to the free energy principle which
was originally introduced by the neuro-
To minimize surprise, the free energy makes apparent a natural trade- scientist Karl Friston (Friston, 2010).
off between two extremes: On the one hand, models q(θ) that “overfit”
to observations (e.g., by using a point estimate of θ), and hence, result
in a large surprise when new observations deviate from this specific
belief. On the other hand, models q(θ) that “underfit” to observa-
tions (e.g., by expecting outcomes with equal probability), and hence,
any observation results in a non-negligible surprise. Either of these
extremes is undesirable.

As we have alluded to previously when introducing the ELBO, the two


terms constituting free energy map neatly onto this tradeoff. Therein,
the entropy (which is maximized) encourages q to have uncertainty, in
other words, to “explore” beyond the finite data. In contrast, the energy
(which is minimized) encourages q to fit the observed data closely, in
106 probabilistic artificial intelligence

other words, to “exploit” the finite data. This tradeoff is ubiquitous in


approximations to probabilistic inference that deal with limited com-
putational resources and limited time, and we will encounter it many
times more. We will point out these connections as we go along.

Since this tradeoff is so fundamental and appears in many branches of


science under different names, it is difficult to give it an appropriate
unifying name. The essence of this tradeoff can be captured as a prin-
ciple of curiosity and conformity, which suggests that reasoning under
uncertainty requires curiosity to entertain and pursue alternative ex-
planations of the data and conformity to make consistent predictions.

Discussion

We have explored variational inference, where we approximate the


intractable posterior distribution of probabilistic inference with a sim-
pler distribution. We operationalized this idea by turning the infer-
ence problem, which requires computing high-dimensional integrals,
into a tractable optimization problem. Gaussians are frequently used
as variational distributions due to their versatility and compact repre-
sentation.

Nevertheless, recall from Figure 5.4 that while the estimation error in
variational inference can be small, choosing a variational family that is
too simple can lead to a large approximation error. We have seen that
for posteriors that are multimodal or have heavy tails, Gaussians may
not provide a good approximation. In the next chapter, we will explore
alternative techniques for approximate inference that can handle more
complex posteriors.

Problems

5.1. Logistic loss.


1. Derive the gradient of ℓlog as given in Equation (5.14).
2. Show that

Hw ℓlog (w⊤ x; y) = xx⊤ · σ (w⊤ x) · (1 − σ (w⊤ x)). (5.73)

Hint: Begin by deriving the first derivative of the logistic function, and
use the chain rule of multivariate calculus,

Dx ( f ◦ g ) = ( D f ◦ g ) · Dx g (5.74)
| {z } | g ( x) {z }
Rn →Rm × n Rn →Rm × k ·Rk × n

where g : Rn → Rk and f : Rk → Rm .
3. Is the logistic loss ℓlog convex in w?
variational inference 107

5.2. Gaussian process classification.

In this exercise, we will study the use of Gaussian processes for clas-
sification tasks, commonly called Gaussian process classification (GPC).
Linear logistic regression is extended to GPC by replacing the Gaus-
sian prior over weights with a GP prior on f ,

f ∼ GP (0, k), y | x, f ∼ Bern(σ ( f ( x))) (5.75)

where σ : R → (0, 1) is some logistic-type function. Note that Bayesian


logistic regression is the special case where k is the linear kernel and σ
is the logistic function. This is analogous to the relationship of Bayesian
linear regression and Gaussian process regression.

In the GP regression setting of Chapter 4, yi was assumed to be a


noisy observation of f ( xi ). In the classification setting, we now have
that yi ∈ {−1, +1} is a binary class label and f ( xi ) ∈ R is a latent
value. We study the setting where σ (z) = Φ(z; 0, σn2 ) is the CDF of a
univariate Gaussian with mean 0 and variance σn2 , also called a probit
likelihood.

To make probabilistic predictions for a query point x⋆ , we first com-


pute the distribution of the latent variable f ⋆ ,
Z
p( f ⋆ | x1:n , y1:n , x⋆ ) = p( f ⋆ | x1:n , x⋆ , f ) p( f | x1:n , y1:n ) d f (5.76) using sum rule (1.7) and product rule
(1.11), and f ⋆ ⊥ y1:n | f
where p( f | x1:n , y1:n ) is the posterior over the latent variables.
1. Assuming that we can efficiently compute p( f ⋆ | x1:n , y1:n , x⋆ ) (ap-
proximately), describe how we can find the predictive posterior
p(y⋆ = +1 | x1:n , y1:n , x⋆ ).
2. The posterior over the latent variables is not a Gaussian as we used
a non-Gaussian likelihood, and hence, the integral of the latent
predictive posterior (5.76) is analytically intractable. A common
technique is to approximate the latent posterior p( f | x1:n , y1:n )
.
with a Gaussian using a Laplace approximation q = N ( fˆ, Λ−1 ). It
is generally not possible to obtain an analytical representation of
the mode of the Laplace approximation fˆ. Instead, fˆ is commonly
found using a second-order optimization scheme such as Newton’s
method.
(a) Find the precision matrix Λ of the Laplace approximation.
Hint: Observe that for a label yi ∈ {−1, +1}, the probability of a
correct classification given the latent value f i is p(yi | f i ) = σ (yi f i ),
where we use the symmetry of the probit likelihood around 0.
(b) Assume that k( x, x′ ) = x⊤ x′ is the linear kernel (σp = 1) and
that σ is the logistic function (5.9). Show for this setting that
the matrix Λ derived in (a) is equivalent to the precision matrix
108 probabilistic artificial intelligence

Λ′ of the Laplace approximation of Bayesian logistic regression


(5.17).16 You may assume that fˆi = ŵ⊤ xi . 16
This should not be surprising since —
Hint: First derive under which condition Λ and Λ′ are “equivalent”. as already mentioned — Gaussian pro-
cess classification is a generalization of
(c) Observe that the (approximate) latent predictive posterior Bayesian logistic regression.
Z
.
q( f ⋆ | x1:n , y1:n , x⋆ ) = p( f ⋆ | x1:n , x⋆ , f )q( f | x1:n , y1:n ) d f

which uses the Laplace approximation of the latent posterior is


Gaussian.17 Determine its mean and variance. 17
Using the Laplace-approximated la-
Hint: Condition on the latent variables f using the laws of total ex- tent posterior, [ f ⋆ f ] are jointly Gaus-
sian. Thus, it directly follows from The-
pectation and variance. orem 1.24 that the marginal distribution
(d) Compare the prediction p( f ⋆ | x1:n , y1:n , x⋆ ) you obtained in over f ⋆ is also a Gaussian.
(1) (but now using the Laplace-approximated latent predictive
posterior) to the prediction σ (E f ⋆ ∼q [ f ⋆ ]). Are they identical? If
not, describe how they are different.
3. The use of the probit likelihood may seem arbitrary. Consider the
following model which may be more natural,

f ∼ GP (0, k), y = 1{ f ( x ) + ε ≥ 0}, ε ∼ N (0, σn2 ). (5.77)


| {z }
GP regression

Show that the model from Equation (5.75) using a noise-free latent
process with probit likelihood Φ(z; 0, σn2 ) is equivalent (in expecta-
tion over ε) to the model from Equation (5.77).

5.3. Jensen’s inequality.


1. Prove the finite form of Jensen’s inequality.
Theorem 5.21 (Jensen’s inequality, finite form). Let f : Rn → R be a
convex function. Suppose that x1 , . . . , xk ∈ Rn and θ1 , . . . , θk ≥ 0 with
θ1 + · · · + θk = 1. Then,

f ( θ1 x1 + · · · + θ k x k ) ≤ θ1 f ( x1 ) + · · · + θ k f ( x k ). (5.78)

Observe that if X is a random variable with finite support, the


above two versions of Jensen’s inequality are equivalent.
2. Show that for any discrete distribution p supported on a finite do-
main of size n, H[ p] ≤ log2 n. This implies that the uniform distri-
bution has maximum entropy.

5.4. Binary cross-entropy loss.

Show that the logistic loss (5.13) is equivalent to the binary cross-
entropy loss with ŷ = σ ( fˆ). That is,

ℓlog ( fˆ; y) = ℓbce (ŷ; y). (5.79)


variational inference 109

5.5. Gibbs’ inequality.


1. Prove KL( p∥q) ≥ 0 which is also known as Gibbs’ inequality.
2. Let p and q be discrete distributions with finite identical support
A. Show that KL( p∥q) = 0 if and only if p ≡ q.
Hint: Use that if a function f : Rn → R is strictly convex and x1 , . . . , xk ∈
Rn , θ1 , . . . , θk ≥ 0, θ1 + · · · + θk = 1, we have that

f ( θ1 x1 + · · · + θ k x k ) = θ1 f ( x1 ) + · · · + θ k f ( x k ) (5.80)

iff x1 = · · · = xk . This is a slight adaptation of Jensen’s inequality in


finite-form, which you proved in Problem 5.3.

5.6. Maximum entropy principle.

In this exercise we will prove that the normal distribution is the distri-
bution with maximal entropy among all (univariate) distributions sup-
.
ported on R with fixed mean µ and variance σ2 . Let g( x ) = N ( x; µ, σ2 ),
and f ( x ) be any distribution on R with mean µ and variance σ2 .
1. Prove that KL( f ∥ g) = H[ g] − H[ f ].
Hint: Equivalently, show that H[ f ∥ g] = H[ g]. That is, the expected
surprise evaluated based on the Gaussian g is invariant to the true distri-
bution f .
2. Conclude that H[ g] ≥ H[ f ].

5.7. Probabilistic inference as a consequence of the maximum en-


tropy principle.

Consider the family of generative models of the random vectors X in X


and Y in Y :
 Z 
∆X ×Y = q : X × Y → R≥0 q( x, y) dx dy = 1 .
X ×Y

Suppose that we observe Y to be y′ , and are looking for a (new) gen-


erative model that is consistent with this information, that is,
Z
q(y) = q( x, y) dx = δy′ (y) using the sum rule (1.7)
X

where δy′ denotes the point density at y′ . The product rule (1.11)
implies that q( x, y) = δy′ (y) · q( x | y), but any choice of q( x | y) is
possible.

We will derive that given any fixed generative model pX,Y , the “pos-
terior” distribution qX (·) = pX|Y (· | y′ ) minimizes the relative entropy
KL(qX,Y ∥ pX,Y ) subject to the constraint Y = y′ . In other words, among
all distributions qX,Y that are consistent with the observation Y = y′ ,
the posterior distribution qX (·) = pX|Y (· | y′ ) is the one with “maxi-
mum entropy”.
110 probabilistic artificial intelligence

1. Show that the optimization problem

arg min KL(qX,Y ∥ pX,Y )


q∈∆X ×Y

subject to q(y) = δy′ (y) ∀y ∈ Y

is solved by q( x, y) = δy′ (y) · p( x | y).


Hint: Solve the dual problem.
2. Conclude that q( x) = p( x | y′ ).

5.8. KL-divergence of Gaussians.

Derive Equation (5.41).

Hint: If X ∼ N (µ, Σ ) in d dimensions, then we have that for any m ∈ Rd


and A ∈ Rd×d ,
h i
E (X − m)⊤ A(X − m) = (µ − m)⊤ A(µ − m) + tr( AΣ ) (5.81)

5.9. Forward vs reverse KL.


1. Consider a factored approximation q( x, y) = q( x )q(y) to a joint
distribution p( x, y). Show that to minimize the forward KL( p∥q)
we should set q( x ) = p( x ) and q(y) = p(y), i.e., the optimal ap-
proximation is a product of marginals.
2. Consider the following joint distribution p, where the rows repre-
sent y and the columns x:

1 2 3 4
1 1/8 1/8 0 0
2 1/8 1/8 0 0
3 0 0 1/4 0
4 0 0 0 1/4

Show that the reverse KL(q∥ p) for this p has three distinct minima.
Identify those minima and evaluate KL(q∥ p) at each of them.
3. What is the value of KL(q∥ p) if we use the approximation q( x, y) = p( x ) p(y)?

5.10. Gaussian VI vs Laplace approximation.

In this exercise, we compare the Laplace approximation from Sec-


tion 5.1 to variational inference with the variational family of Gaus-
sians,
.
Q = {q(θ) = N (θ; µ, Σ )}.

1. Let p be any distribution on R, and let q⋆ = arg minq∈Q KL( p∥q).


Show that q⋆ differs from the Laplace approximation of p.
variational inference 111

Minimizing forward-KL is typically intractable, and we have seen that


it is therefore common to minimize the reverse-KL instead:

q̃ = arg min KL(q∥ p(· | Dn )).


q∈Q

2. Show that q̃ = N (µ, Σ ) satisfies Equation (5.58):

0 = Eθ∼q̃ [∇θ log p(y1:n , θ | x1:n )]


Σ −1 = −Eθ∼q̃ [ Hθ log p(y1:n , θ | x1:n )].

Hint 1: For any positive definite and symmetric matrix A, it holds that
∇A log det( A) = A−1 .
Hint 2: For any function f and Gaussian p = N (µ, Σ ),

∇µ Ex∼ p [ f ( x)] = Ex∼ p [∇x f ( x)]


1 (5.82)
∇Σ Ex∼ p [ f ( x)] = Ex∼ p [ Hx f ( x)].
2
Recall the conditions satisfied by the Laplace approximation of the
posterior p(θ | D) ∝ p(y1:n , θ | x1:n ) as detailed in Equation (5.57). The
Laplace approximation is fitted locally at the MAP estimate θ̂. Compar-
ing Equation (5.57) to Equation (5.58), we see that Gaussian variational
inference satisfies the conditions of the Laplace approximation on av-
erage. For more details, refer to Opper and Archambeau (2009).

5.11. Gradient of reverse-KL.


.
Suppose p = N (0, σp2 I ) and a tractable distribution described by
.
qλ = N (µ, diag{σ12 , . . . , σd2 })
. .
where µ = [µ1 · · · µd ] and λ = [µ1 · · · µd σ1 · · · σd ]. Show that the
gradient of KL(qλ ∥ p(·)) with respect to λ is given by

∇µ KL(qλ ∥ p(·)) = σp−2 µ, and (5.83a)


hσ i
1 σd 1
∇[σ1 ··· σd ] KL(qλ ∥ p(·)) = σp12 − σ1 ... σp2
− σd . (5.83b)

5.12. Reparameterizable distributions.


1. Let X ∼ Unif([ a, b]) for any a ≤ b. That is,

 1 if x ∈ [ a, b]
p X ( x ) = b− a (5.84)
0 otherwise.

Show that X can be reparameterized in terms of Unif([0, 1]). Hint:


You may use that for any Y ∼ Unif([ a, b]) and c ∈ R,
• Y + c ∼ Unif([ a + c, b + c]) and
• cY ∼ Unif([c · a, c · b]).
112 probabilistic artificial intelligence

.
2. Let Z ∼ N (µ, σ2 ) and X = e Z . That is, X is logarithmically nor-
mal distributed with parameters µ and σ2 . Show that X can be
reparameterized in terms of N (0, 1).
3. Show that Cauchy(0, 1) can be reparameterized in terms of Unif([0, 1]).
Finally, let us apply the reparameterization trick to compute the gra-
dient of an expectation.
.
4. Let ReLU(z) = max{0, z} and w > 0. Show that

d
E ReLU(wx ) = wΦ(µ)
dµ x∼N (µ,1)

where Φ denotes the CDF of the standard normal distribution.


6
Markov Chain Monte Carlo Methods

Variational inference approximates the entire posterior distribution.


However, note that the key challenge in probabilistic inference is not
learning the posterior distribution, but using the posterior distribution
for predictions,
Z
p(y⋆ | x⋆ , x1:n , y1:n ) = p(y⋆ | x⋆ , θ) p(θ | x1:n , y1:n ) dθ. (6.1)

This integral can be interpreted as an expectation over the posterior


distribution,

= Eθ∼ p(·|x1:n ,y1:n ) [ p(y⋆ | x⋆ , θ)]. (6.2)


.
Observe that the likelihood f (θ) = p(y⋆ | x⋆ , θ) is easy to evaluate. The
difficulty lies in sampling from the posterior distribution. Assuming
we can obtain independent samples from the posterior distribution,
we can use Monte Carlo sampling to obtain an unbiased estimate of
the expectation,
m
1

m ∑ f (θ(i) ) (6.3)
i =1

iid
for independent θ(i) ∼ p(· | x1:n , y1:n ). The law of large numbers (A.36)
and Hoeffding’s inequality (A.41) imply that this estimator is consis-
tent and sharply concentrated.1 1
For more details, see Appendix A.3.3.

Obtaining samples of the posterior distribution is therefore sufficient


to perform approximate inference. Recall that the difficulty of comput-
ing the posterior p exactly, was in finding the normalizing constant Z,

1
p( x ) = q ( x ). (6.4)
Z
The joint likelihood q is typically easy to obtain. Note that q( x ) is pro-
portional to the probability density associated with x, but q does not
114 probabilistic artificial intelligence

integrate to 1. Such functions are also called a finite measure. Without


normalizing q, we cannot directly sample from it.

Remark 6.1: The difficulty of sampling — even with a PDF


Even a decent approximation of Z does not yield a general effi-
cient sampling method. For example, one very common approach
to sampling is inverse transform sampling (cf. Appendix A.1.3)
which requires an (approximate) quantile function. Computing
the quantile function given an arbitrary PDF requires solving in-
tegrals over the domain of the PDF which is what we were trying
to avoid in the first place.

The key idea of Markov chain Monte Carlo methods is to construct


a Markov chain, which is efficient to simulate and has the stationary
distribution p.

6.1 Markov Chains

To start, let us revisit the fundamental theory behind Markov chains.

Definition 6.2 (Markov chain). A (finite and discrete-time) Markov chain


over the state space
.
S = {0, . . . , n − 1} (6.5)

is a stochastic process2 ( Xt )t∈N0 valued in S such that the Markov prop- 2


A stochastic process is a sequence of ran-
erty is satisfied: dom variables.

X1 X2 X3 ···
Xt+1 ⊥ X0:t−1 | Xt . (6.6)
Figure 6.1: Directed graphical model of
a Markov chain. The random variable
Intuitively, the Markov property states that future behavior is indepen- Xt+1 is conditionally independent of the
random variables X0:t−1 given Xt .
dent of past states given the present state.

Remark 6.3: Generalizations of Markov chains


One can also define continuous-state Markov chains (for example,
where states are vectors in Rd ) and the results which we state for
(finite) Markov chains will generally carry over. For a survey, refer
to “General state space Markov chains and MCMC algorithms”
(Roberts and Rosenthal, 2004).

Moreover, one can also consider continuous-time Markov chains.


One example of such a continuous-space and continuous-time
Markov chain is the Wiener process (cf. Remark 6.23).

We restrict our attention to time-homogeneous Markov chains,3 which 3


That is, the transition probabilities do
not change over time.
markov chain monte carlo methods 115

can be characterized by a transition function,


.
p ( x ′ | x ) = P X t +1 = x ′ | X t = x .

(6.7)

As the state space is finite, we can describe the transition function by


the transition matrix,
 
p ( x1 | x1 ) ··· p ( x n | x1 )
.  .. .. ..  ∈ Rn × n .

P= 
. . .  (6.8)
p ( x1 | x n ) ··· p( xn | xn )

Note that each row of P must always sum to 1. Such matrices are also
called stochastic.

The transition graph of a Markov chain is a directed graph consisting of


vertices S and weighted edges represented by the adjacency matrix P.

The current state of the Markov chain at time t is denoted by the


probability distribution qt over states S, that is, Xt ∼ qt . In the finite
setting, qt is a PMF, which is often written explicitly as the row vec-
tor qt ∈ R1×|S| . The initial state (or prior) of the Markov chain is given
as X0 ∼ q0 . One iteration of the Markov chain can then be expressed
as follows: ? Problem 6.1

qt+1 = qt P. (6.9)

It is implied directly that we can write the state of the Markov chain
at time t + k as

qt+k = qt P k . (6.10)

The entry Pk ( x, x ′ ) corresponds to the probability of transitioning from


state x ∈ S to state x ′ ∈ S in exactly k steps ? . We denote this entry Problem 6.2
by p(k) ( x ′ | x ).

In the analysis of Markov chains, there are two main concepts of inter-
est: stationarity and convergence. We begin by introducing stationar-
ity.

6.1.1 Stationarity
Definition 6.4 (Stationary distribution). A distribution π is stationary
with respect to the transition function p iff

π (x) = ∑ p( x | x ′ )π ( x ′ ) (6.11)
x ′ ∈S

holds for all x ∈ S. It follows from Equation (6.9) that equivalently, π


is stationary w.r.t. a transition matrix P iff

π = πP. (6.12)
116 probabilistic artificial intelligence

After entering a stationary distribution π, a Markov chain will always


remain in the stationary distribution. In particular, suppose that Xt is
distributed according to π, then for all k ≥ 0, Xt+k ∼ π.

Remark 6.5: When does a stationary distribution exist?


In general, there are Markov chains with infinitely many station-
ary distributions or no stationary distribution at all. You can find
some examples in Figure 6.2.

It can be shown that there exists a unique stationary distribution π


if the Markov chain is irreducible, that is, if every state is reach-
able from every other state with a positive probability when the
Markov chain is run for enough steps. Formally,

∀ x, x ′ ∈ S. ∃k ∈ N. p(k) ( x ′ | x ) > 0. (6.13)

Equivalently, a Markov chain is irreducible iff its transition graph


is strongly connected.

6.1.2 Convergence

Let us now consider Markov chains with a unique stationary distribu-


tion.4 A natural next question is whether this Markov chain converges 4
Observe that the stationary distribu-
to its stationary distribution. We say that a Markov chain converges to tion of an irreducible Markov chain must
have full support, that is, assign positive
its stationary distribution iff we have probability to every state.

lim qt = π, (6.14)
t→∞

irrespectively of the initial distribution q0 .

Remark 6.6: When does a Markov chain converge?


Even if a Markov chain has a unique stationary distribution, it
does not have to converge to it. Consider example (3) in Fig-
ure 6.2. Clearly, π = ( 12 , 12 ) is the unique stationary distribution.
However, observe that if we start with a suitable initial distribu-
tion such as q0 = (1, 0), at no point in time will the probability
of all states be positive, and in particular, the chain will not con-
verge to π. Instead, the chain behaves periodically, i.e., its state
distributions are q2t = (0, 1) and q2t+1 = (1, 0) for all t ∈ N. It
turns out that if we exclude such “periodic” Markov chains, then
the remaining (irreducible) Markov chains will always converge
to their stationary distribution.
markov chain monte carlo methods 117

Formally, a Markov chain is aperiodic if for all states x ∈ S,

∃k0 ∈ N. ∀k ≥ k0 . p(k) ( x | x ) > 0. (6.15)

In words, a Markov chain is aperiodic iff for every state x, the


transition graph has a closed path from x to x with length k for all
k ∈ N greater than some k0 ∈ N.

This additional property leads to the concept of ergodicity.

Definition 6.7 (Ergodicity). A Markov chain is ergodic iff there exists a


t ∈ N0 such that for any x, x ′ ∈ S we have
(1) 1 1 2 1
p(t) ( x ′ | x ) > 0, (6.16)
1
whereby p(t) ( x ′ | x ) is the probability to reach x ′ from x in exactly t 1
(2) 1 2
steps. Equivalent conditions are
1. that there exists some t ∈ N0 such that all entries of Pt are strictly 1
(3) 1 2
positive; and 1
2. that it is irreducible and aperiodic. 1/2

(4) 1/2 1 2
1
Example 6.8: Making a Markov chain ergodic Figure 6.2: Transition graphs of Markov
chains: (1) is not ergodic as its transi-
A commonly used strategy to ensure that a Markov chain is er-
tion diagram is not strongly connected;
godic is to add “self-loops” to every vertex in the transition graph. (2) is not ergodic for the same reason; (3)
That is, to ensure that at any point in time, the Markov chain re- is irreducible but periodic and therefore
not ergodic; (4) is ergodic with station-
mains with positive probability in its current state. ary distribution π (1) = 2/3, π (2) = 1/3.

Take a (not necessarily ergodic) but irreducible Markov chain with


transition matrix P. We define the new Markov chain
. 1 1
P′ = P + I. (6.17)
2 2
It is a simple exercise to confirm that P′ is stochastic, and hence
a valid transition matrix. Also, it follows directly that P′ is irre-
ducible (as P is irreducible) and aperiodic as every vertex has a
closed path of length 1 to itself, and therefore the chain is ergodic.

Take now π to be a stationary distribution of P. We have that π


is also a stationary distribution of P′ as

1 1 1 1
πP′ = πP + π I = π + π = π. (6.18) using (6.12)
2 2 2 2

Fact 6.9 (Fundamental theorem of ergodic Markov chains, theorem 4.9


of Levin and Peres (2017)). An ergodic Markov chain has a unique station-
ary distribution π (with full support) and
lim qt = π (6.19)
t→∞
118 probabilistic artificial intelligence

irrespectively of the initial distribution q0 .

This naturally suggests constructing an ergodic Markov chain such


that its stationary distribution coincides with the posterior distribu-
tion. If we then sample “sufficiently long”, Xt is drawn from a distri-
bution that is “very close” to the posterior distribution.

Remark 6.10: How quickly does a Markov chain converge?


The convergence speed of Markov chains is a rich field of research.
“Sufficiently long” and “very close” are commonly made precise
by the notions of rapidly mixing Markov chains and total variation
distance.

Definition 6.11 (Total variation distance). The total variation dis-


tance between two probability distributions µ and ν on A is de-
fined by
.
∥µ − ν∥TV = 2 sup |µ( A) − ν( A)| . (6.20)
A⊆A

It defines the distance between µ and ν to be the maximum dif-


ference between the probabilities that µ and ν assign to the same
event.

As opposed to the KL-divergence (5.34), the total variation dis-


tance is a metric. In particular, it is symmetric and satisfies the
triangle inequality. It can be shown that
q
∥µ − ν∥TV ≤ 2KL(µ∥ν) (6.21)

which is known as Pinsker’s inequality. Moreover, if µ and ν are


discrete distributions over the set S, it can be shown that

∥µ − ν∥TV = ∑ |µ(i) − ν(i)| . (6.22)


i ∈S

Definition 6.12 (Mixing time). For a Markov chain with stationary


distribution π, its mixing time with respect to the total variation
distance for any ϵ > 0 is
.
τTV (ϵ) = min{t | ∀q0 : ∥qt − π ∥TV ≤ ϵ}. (6.23)

Thus, the mixing time measures the time required by a Markov


chain for the distance to stationarity to be small. A Markov chain
is typically said to be rapidly mixing if for any ϵ > 0,

τTV (ϵ) ∈ O(poly(n, log(1/ϵ))). (6.24)


markov chain monte carlo methods 119

That is, a rapidly mixing Markov chain on n states needs to be


simulated for at most poly(n) steps to obtain a “good” sample
from its stationary distribution π.

You can find a thorough introduction to mixing times in chapter


4 of “Markov chains and mixing times” (Levin and Peres, 2017).
Later chapters introduce methods for showing that a Markov chain
is rapidly mixing.

6.1.3 Detailed Balance Equation


How can we confirm that the stationary distribution of a Markov chain
coincides with the posterior distribution? The detailed balance equa-
tion yields a very simple method.

Definition 6.13 (Detailed balance equation / reversibility). A Markov


chain satisfies the detailed balance equation with respect to a distribu-
tion π iff

π ( x ) p( x′ | x ) = π ( x′ ) p( x | x′ ) (6.25)

holds for any x, x ′ ∈ S. A Markov chain that satisfies the detailed


balance equation with respect to π is called reversible with respect to π.

Lemma 6.14. Given a finite Markov chain, if the Markov chain is reversible
with respect to π then π is a stationary distribution.5 5
Note that reversibility of π is only a suf-
. ficient condition for stationarity of π, it
Proof. Let π = qt . We have, is not necessary! In particular, there are
irreversible ergodic Markov chains.
q t +1 ( x ) = ∑ p( x | x ′ )qt ( x ′ ) using the Markov property (6.6)
x ′ ∈S
= ∑

p( x | x ′ )π ( x ′ )
x ∈S
= ∑

p( x ′ | x )π ( x ) using the detailed balance equation
x ∈S (6.25)
= π (x) ∑ p( x′ | x )
x ′ ∈S
= π ( x ). using that ∑ x′ ∈S p( x ′ | x ) = 1

That is, if we can show that the detailed balance equation (6.25) holds
for some distribution q, then we know that q is the stationary distribu-
tion of the Markov chain.

Next, reconsider our posterior distribution p( x ) = Z1 q( x ) from Equa-


tion (6.4). If we substitute the posterior for π in the detailed balance
equation, we obtain
1 1
q ( x ) p ( x ′ | x ) = q ( x ′ ) p ( x | x ′ ), (6.26)
Z Z
120 probabilistic artificial intelligence

or equivalently,

q ( x ) p ( x ′ | x ) = q ( x ′ ) p ( x | x ′ ). (6.27)

In words, we do not need to know the true posterior p to check that


the stationary distribution of our Markov chain coincides with p, it
suffices to know the finite measure q!

6.1.4 Ergodic Theorem


If we now suppose that we can construct a Markov chain whose sta-
tionary distribution coincides with the posterior distribution — we
will see later that this is possible — it is not apparent that this allows
us to estimate expectations over the posterior distribution. Note that
although constructing such a Markov chain allows us to obtain sam-
ples from the posterior distribution, they are not independent. In fact,
due to the structure of a Markov chain, by design, they are strongly
dependent. Thus, the law of large numbers and Hoeffding’s inequality
do not apply. By itself, it is not even clear that an estimator relying on
samples from a single Markov chain will be unbiased.

Theoretically, we could simulate many Markov chains separately and


obtain one sample from each of them. This, however, is extremely
inefficient. It turns out that there is a way to generalize the (strong)
law of large numbers to Markov chains.

Theorem 6.15 (Ergodic theorem, appendix C of Levin and Peres (2017)).


Given an ergodic Markov chain ( Xt )t∈N0 over a finite state space S with sta-
tionary distribution π and a function f : S → R,

n 1
1 a.s.
n ∑ f ( xi ) → ∑ π ( x ) f ( x ) = Ex∼π [ f ( x )] (6.28)
i =1 x ∈S
p
as n → ∞ where xi ∼ Xi | xi−1 .

This result is the fundamental reason for why Markov chain Monte
Carlo methods are possible. There are analogous results for continu-
ous domains. 0
t0
Note, however, that the ergodic theorem only tells us that simulating t
a single Markov chain yields an unbiased estimator. It does not tell Figure 6.3: Illustration of the “burn-in”
us anything about the rate of convergence and variance of such an time t0 of a Markov chain approximat-
ing the posterior p(y⋆ = 1 | X, y) of
estimator. The convergence rate depends on the mixing time of the Bayesian logistic regression. The true
Markov chain, which is difficult to establish in general. posterior p is shown in gray. The dis-
tribution of the Markov chain at time t is
In practice, one observes that Markov chain Monte Carlo methods have shown in red.
a so-called “burn-in” time during which the distribution of the Markov
markov chain monte carlo methods 121

chain does not yet approximate the posterior distribution well. Typi-
cally, the first t0 samples are therefore discarded,

T
1
E[ f ( X )] ≈
T − t0 ∑ f ( Xt ). (6.29)
t = t0 +1

It is not clear in general how T and t0 should be chosen such that the
estimator is unbiased, rather they have to be tuned.

Another widely used heuristic is to first find the mode of the posterior
distribution and then start the Markov chain at that point. This tends
to increase the rate of convergence drastically, as the Markov chain
does not have to “walk to the location in the state space where most
probability mass will be located”.

6.2 Elementary Sampling Methods

We will now examine methods for constructing and sampling from a


Markov chain with the goal of approximating samples from the pos-
terior distribution p. Note that in this setting the state space of the
Markov chain is Rn and a single state at time t is described by the
.
random vector X = [ X1 , . . . , Xn ].

6.2.1 Metropolis-Hastings Algorithm


Suppose we are given a proposal distribution r ( x′ | x) which, given
we are in state x, proposes a new state x′ . Metropolis and Hastings
showed that using the acceptance distribution Bern(α( x′ | x)) where

p ( x ′ )r ( x | x ′ )
 
.
α( x′ | x) = min 1, (6.30)
p ( x )r ( x ′ | x )
q ( x ′ )r ( x | x ′ )
 
= min 1, (6.31) similarly to the detailed balance
q ( x )r ( x ′ | x ) equation, the normalizing constant Z
cancels
to decide whether to follow the proposal yields a Markov chain with
stationary distribution p( x) = Z1 q( x).

Algorithm 6.16: Metropolis-Hastings algorithm


1 initialize x ∈ Rn
2 for t = 1 to T do
3 sample x′ ∼ r ( x′ | x)
4 sample u ∼ Unif([0, 1])
5 if u ≤ α( x′ | x) then update x ← x′
6 else update x ← x
122 probabilistic artificial intelligence

Intuitively, the acceptance distribution corrects for the bias in the pro-
posal distribution. That is, if the proposal distribution r is likely to
propose states with low probability under p, the acceptance distri-
bution will reject these proposals frequently. The following theorem
formalizes this intuition.

Theorem 6.17 (Metropolis-Hastings theorem). Given an arbitrary pro-


posal distribution r whose support includes the support of q, the stationary
distribution of the Markov chain simulated by the Metropolis-Hastings algo-
rithm is p( x) = Z1 q( x).

Proof. First, let us define the transition probabilities of the Markov


chain. The probability of transitioning from a state x to a state x′ is
given by r ( x′ | x)α( x′ | x) if x ̸= x′ and the probability of proposing to
remain in state x, r ( x | x), plus the probability of denying the proposal,
otherwise.

r ( x ′ | x ) α ( x ′ | x ) if x ̸= x′
p( x′ | x) =
r ( x | x) + ∑ ′′ r ( x′′ | x)(1 − α( x′′ | x)) otherwise.
x ̸= x
(6.32)

We will show that the stationary distribution is p by showing that p


satisfies the detailed balance equation (6.25). Let us fix arbitrary states
x and x′ . First, observe that if x = x′ , then the detailed balance equa-
tion is trivially satisfied. Without loss of generality we assume

q ( x ′ )r ( x | x ′ )
α( x | x′ ) = 1, α( x′ | x) = .
q ( x )r ( x ′ | x )

For x ̸= x′ , we then have,


1
p( x) · p( x′ | x) = q( x) p( x′ | x) using the definition of the distribution p
Z
1
= q ( x )r ( x ′ | x ) α ( x ′ | x ) using the transition probabilities of the
Z Markov chain
1 q ( x ′ )r ( x | x ′ )
= q ( x )r ( x ′ | x ) using the definition of the acceptance
Z q ( x )r ( x ′ | x ) distribution α
1
= q ( x ′ )r ( x | x ′ )
Z
1
= q ( x ′ )r ( x | x ′ ) α ( x | x ′ ) using the definition of the acceptance
Z distribution α
1
= q( x′ ) p( x | x′ ) using the transition probabilities of the
Z Markov chain
= p ( x ′ ) · p ( x | x ′ ). using the definition of the distribution p

Note that by the fundamental theorem of ergodic Markov chains (6.19),


for convergence to the stationary distribution, it is sufficient for the
markov chain monte carlo methods 123

Markov chain to be ergodic. Ergodicity follows immediately when


the transition probabilities p(· | x) have full support. For example,
if the proposal distribution r (· | x) has full support, the full support
of p(· | x) follows immediately from Equation (6.32). The rate of con-
vergence of Metropolis-Hastings depends strongly on the choice of the
proposal distribution, and we will explore different choices of proposal
distribution in the following.

6.2.2 Gibbs Sampling


A popular example of a Metropolis-Hastings algorithm is Gibbs sam-
pling as presented in Algorithm 6.18.

Algorithm 6.18: Gibbs sampling


1 initialize x = [ x1 , . . . , xn ] ∈ Rn
2 for t = 1 to T do
3 pick a variable i uniformly at random from {1, . . . , n}
.
4 set x−i = [ x1 , . . . , xi−1 , xi+1 , . . . , xn ]
5 update xi by sampling according to the posterior distribution
p ( xi | x −i )

Intuitively, by re-sampling single coordinates according to the pos-


terior distribution given the other coordinates, Gibbs sampling finds
states that are successively “more” likely. Selecting the index i uni-
formly at random ensures that the underlying Markov chain is ergodic
provided the conditional distributions p(· | x−i ) have full support.

Theorem 6.19 (Gibbs sampling as Metropolis-Hastings). Gibbs sam-


pling is a Metropolis-Hastings algorithm. For any fixed i ∈ [n], it has
proposal distribution

 p( x ′ | x′ ) x′ differs from x only in entry i
. −i
ri ( x ′ | x ) = i
(6.33)
0 otherwise
.
and acceptance distribution αi ( x′ | x) = 1.

Proof. We show that αi ( x′ | x) = 1 follows from the definition of an


acceptance distribution in Metropolis-Hastings (6.31) and the choice of
proposal distribution (6.33).

By (6.31),

p ( x ′ )ri ( x | x ′ )
 

αi ( x | x) = min 1,
p ( x )ri ( x ′ | x )
124 probabilistic artificial intelligence

Note that p( x) = p( xi , x−i ) = p( xi | x−i ) p( x−i ) using the product rule


(1.11). Therefore,
p( xi′ | x′ −i ) p( x′ −i )ri ( x | x′ )
 
= min 1,
p ( xi | x −i ) p ( x −i )ri ( x ′ | x )
p( xi′ | x′ −i ) p( x′ −i ) p( xi | x−i )
 
= min 1, using the proposal distribution (6.33)
p( xi | x−i ) p( x−i ) p( xi′ | x′ −i )
p ( x ′ −i )
 
= min 1,
p ( x −i )
= 1. using that p( x′ −i ) = p( x−i )

If the index i is chosen uniformly at random as in Algorithm 6.18, then


the proposal distribution is r ( x′ | x) = n1 ∑in=1 ri ( x′ | x), which analo-
gously to Theorem 6.19 has the associated acceptance distribution
p ( x ′ )r ( x | x ′ ) ∑in=1 p( x′ )ri ( x | x′ )
   

α( x | x) = min 1, = min 1, n = 1.
p ( x )r ( x ′ | x ) ∑ i =1 p ( x ) r i ( x ′ | x )

Corollary 6.20 (Convergence of Gibbs sampling). As Gibbs sampling


is a specific example of an MH-algorithm, the stationary distribution of the
simulated Markov chain is p( x).

Note that for the proposals of Gibbs sampling, we have


p ( xi , x −i ) q ( xi , x −i )
p ( xi | x −i ) = = . (6.34) using the definition of condition
∑ xi p ( x , x
i −i ) ∑ xi q ( x i , x −i ) probability (1.8) and the sum rule (1.7)
Under many models, this probability can be efficiently evaluated due
to the conditioning on the remaining coordinates x−i . If Xi has finite
support, the normalizer can be computed exactly.

6.3 Sampling using Gradients

In this section, we discuss more advanced sampling methods. The


main idea that we will study is the interpretation of sampling as an
optimization problem. We will build towards an optimization view
of sampling step-by-step, and first introduce what is commonly called
the “energy” of a distribution.

6.3.1 Gibbs Distributions


Gibbs distributions are a special class of distributions that are widely
used in machine learning, and which are characterized by an energy.

Definition 6.21 (Gibbs distribution). Formally, a Gibbs distribution (also


called a Boltzmann distribution) is a continuous distribution p whose
PDF is of the form
1
p( x) = exp(− f ( x)). (6.35)
Z
markov chain monte carlo methods 125

f : Rn → R is also called an energy function. When the energy func-


tion f is convex, its Gibbs distribution is called log-concave.

A useful property is that Gibbs distributions always have full support.6 6


This can easily be seen as exp(·) > 0.
It is often easier to reason about “energies” rather than probabilities as
they are neither restricted to be non-negative nor do they have to inte-
grate to 1. Note that the Gibbs distribution belongs to the exponential
family (5.48) with sufficient statistic − f ( x).

Observe that the posterior distribution can always be interpreted as a


Gibbs distribution as long as prior and likelihood have full support,

1
p(θ | x1:n , y1:n ) =p(θ) p(y1:n | x1:n , θ) using Bayes’ rule (1.45)
Z
1
= exp(−[− log p(θ) − log p(y1:n | x1:n , θ)]). (6.36)
Z
Thus, defining the energy function
.
f (θ) = − log p(θ) − log p(y1:n | x1:n , θ) (6.37)
n
= − log p(θ) − ∑ log p(yi | xi , θ), (6.38)
i =1

yields

1
p(θ | x1:n , y1:n ) = exp(− f (θ)). (6.39)
Z
Note that f coincides with the loss function used for MAP estimation
(1.62). For a noninformative prior, the regularization term vanishes
and the energy reduces to the negative log-likelihood ℓnll (θ; D) (i.e.,
the loss function of maximum likelihood estimation (1.57)).

Using that the posterior is a Gibbs distribution, we can rewrite the


acceptance distribution of Metropolis-Hastings,

r ( x | x′ )
 
′ ′
α( x | x) = min 1, exp( f ( x) − f ( x )) . (6.40) this is obtained by substituting the PDF
r ( x′ | x) of a Gibbs distribution for the posterior

Example 6.22: Metropolis-Hastings with Gaussian proposals


Let us consider Metropolis-Hastings with the Gaussian proposal
distribution,
.
r ( x′ | x) = N ( x′ ; x, τ I ). (6.41)

Due to the symmetry of Gaussians, we have

r ( x | x′ ) N ( x; x′ , τI )
= = 1.
r ( x′ | x) N ( x′ ; x, τI )
126 probabilistic artificial intelligence

Hence, the acceptance distribution is defined by

α( x′ | x) = min 1, exp( f ( x) − f ( x′ )) .

(6.42)

Intuitively, when a state with lower energy is proposed, that is


f ( x′ ) ≤ f ( x), then the proposal will always be accepted. In con- f (θ )

trast, if the energy of the proposed state is higher, the accep-


tance probability decreases exponentially in the difference of en-
ergies f ( x) − f ( x′ ). Thus, Metropolis-Hastings minimizes the en-
ergy function, which corresponds to minimizing the negative log-
likelihood and negative log-prior. The variance in the proposals τ
helps in getting around local optima, but the search direction is
uniformly random (i.e., “uninformed”).
θ0
θ

Figure 6.4: Metropolis-Hastings and


6.3.2 From Energy to Surprise (and back) Langevin dynamics minimize the en-
ergy function f (θ ) shown in blue. Sup-
Energy-based models are a well-known class of models in machine pose we start at the black dot θ0 , then
learning where an energy function f is learned from data. These en- the black and red arrows denote pos-
sible subsequent samples. Metropolis-
ergy functions do not need to originate from a probability distribution,
Hastings uses an “uninformed” search
yet they induce a probability distribution via their Gibbs distribution direction, whereas Langevin dynamics
p( x) ∝ exp(− f ( x)). As we will see in Problem 6.7, this Gibbs distribu- uses the gradient of f (θ ) to make “more
promising” proposals. The random pro-
tion is the associated maximum entropy distribution. Observe that the posals help get past local optima.
surprise about x under this distribution is given by

S[ p( x)] = f ( x) + log Z. (6.43)

That is, up to a constant shift, the energy of x coincides with the sur-
prise about x. Energies are therefore sufficient for comparing the “like-
lihood” of points, and they do not require normalization.7 7
Intuitively, an energy can be used to
compare the “likelihood” of two points x
What kind of energies could we use? In Section 6.3.1, we discussed and x′ whereas the probability x makes
a statement about the “likelihood” of x
the use of the negative log-posterior or negative log-likelihood as en- relative to all other points.
ergies. In general, any loss function ℓ( x) can be thought of as an
energy function with an associated maximum entropy distribution
p( x) ∝ exp(−ℓ( x)).

6.3.3 Langevin Dynamics


Until now, we have looked at Metropolis-Hastings algorithms with
proposal distributions that do not explicitly take into account the cur-
vature of the energy function around the current state. Langevin dy-
namics adapts the Gaussian proposals of the Metropolis-Hastings al-
gorithm we have seen in Example 6.22 to search the state space in an
“informed” direction. The simple idea is to bias the sampling towards
states with lower energy, thereby making it more likely that a proposal
is accepted.
markov chain monte carlo methods 127

A natural idea is to shift the proposal distribution perpendicularly to


the gradient of the energy function. This yields the following proposal
distribution,

r ( x′ | x) = N ( x′ ; x − ηt ∇ f ( x), 2ηt I ). (6.44)

The resulting variant of Metropolis-Hastings is known as the Metropo-


lis adjusted Langevin algorithm (MALA) or Langevin Monte Carlo (LMC).
It can be shown that, as ηt → 0, we have for the acceptance probabil-
ity α( x′ | x) → 1 using that the acceptance probability is 1 if x′ = x.
Hence, the Metropolis-Hastings acceptance step can be omitted once
the rejection probability becomes negligible. The algorithm which al-
ways accepts the proposal of Equation (6.44) is known as the unadjusted
Langevin algorithm (ULA).

Observe that if the stationary distribution is the posterior distribution


!
n
1
p(θ | x1:n , y1:n ) = exp log p(θ) + ∑ log p(yi | xi , θ)
Z i =1
| {z }
− f (θ)

with energy f as we discussed in Section 6.3.1, then the proposal θ′ of


MALA/LMC can be equivalently formulated as

θ′ ← θ − ηt ∇ f (θ) + ε
!
n (6.45)
= θ + ηt ∇ log p(θ) + ∑ ∇ log p(yi | xi , θ) + ε
i =1

where ε ∼ N (0, 2ηt I ).

Remark 6.23: ULA is a discretized diffusion


The unadjusted Langevin algorithm can be seen as a discretiza-
tion of Langevin dynamics, which is a continuous-time stochastic
process with a drift and with random stationary and independent
Gaussian increments. The randomness is modeled by a Wiener
process.

Definition 6.24 (Wiener process). The Wiener process (also known


as Brownian motion) is a sequence of random vectors {Wt }t≥0 such
that
1. W0 = 0,
2. with probability 1, Wt is continuous in t,
3. the process has independent increments,8 and 8
That is, the “future” increments
4. Wt+u − Wt ∼ N (0, uI ). Wt+u − Wt for u ≥ 0 are independent
of past values Ws for s < t.
Consider the continuous-time stochastic process θ defined by the
128 probabilistic artificial intelligence

stochastic differential equation (SDE)



dθt = −∇ f (θt ) dt + 2 dWt , (6.46)
| {z } | {z }
drift noise

Such a stochastic process is also called a diffusion (process) and


Equation (6.46) specifically is called Langevin dynamics. Here, the
first term is called the “drift” of the process, and the second term
is called its “noise”. Note that if the noise term is zero then Equa-
tion (6.46) is simply an ordinary differential equation (ODE).

A diffusion can be discretized using the Euler-Maruyama method


(also called “forward Euler”) to obtain a discrete approximation
θk ≈ θ(τk ) where τk denotes the k-th time step. Choosing the time
.
steps such that ∆tk = τk+1 − τk = ηk yields the approximation

θk+1 = θk − ∇ f (θk ) ∆tk + 2 ∆Wk (6.47)
.
where ∆Wk = Wτk+1 − Wτk ∼ N (0, ∆tk I ). Observe that this co-
incides with the update rule of Langevin dynamics from Equa-
tion (6.45).

The appearance of drift and noise is closely related to the principle


of curiosity and conformity, we encountered in the previous chapter on
variational inference. The noise term (also called the diffusion) leads
to exploration of the state space (i.e., curiosity about alternative expla-
nations of the data), whereas the drift term (also called the distillation)
leads to minimizing the energy or loss (i.e., conformity to the data).
Interestingly, this same principle appears in both variational inference
and MCMC, two very different approaches to approximate probabilis-
tic inference. In the remainder of this manuscript we will find this to
be a reoccurring theme.

For log-concave distributions, the mixing time of the Markov chain


underlying Langevin dynamics can be shown to be polynomial in the
dimension n (Vempala and Wibisono, 2019). You will prove this result
for strongly log-concave distributions in ? and see that the analysis is Problem 6.9
analogous to the canonical convergence analysis of classical optimiza-
tion schemes.

6.3.4 Stochastic Gradient Langevin Dynamics


Note that computing the gradient of the energy function, which corre-
sponds to computing exact gradients of the log-prior and log-likelihood,
in every step can be expensive. The proposal step of MALA/LMC can
be made more efficient by approximating the gradient with an unbi-
markov chain monte carlo methods 129

ased gradient estimate, leading to stochastic gradient Langevin dynamics


(SGLD) shown in Algorithm 6.25 (Welling and Teh, 2011). Observe that
SGLD (6.48) differs from MALA/LMC (6.45) only by using a sample-
based approximation of the gradient.

Algorithm 6.25: Stochastic gradient Langevin dynamics, SGLD


1 initialize θ
2 for t = 1 to T do
3 sample i1 , . . . , im ∼ Unif({1, . . . , n}) independently
4 sample ε ∼ N (0, 2ηt I )
 
5 θ ← θ + ηt ∇ log p(θ) + m n
∑m j =1 ∇ log p ( y ij | x ij , θ) + ε // (6.48)

Intuitively, in the initial phase of the algorithm, the stochastic gra-


dient term dominates, and therefore, SGLD corresponds to a variant
of stochastic gradient ascent. In the later phase, the update rule is
dominated by the injected noise ε, and will effectively be Langevin
dynamics. SGLD transitions smoothly between the two phases.

Under additional assumptions, SGLD is guaranteed to converge to the


posterior distribution for decreasing learning rates ηt = Θ(t−1/3 ) (Ra-
ginsky et al., 2017; Xu et al., 2018). SGLD does not use the acceptance
step from Metropolis-Hastings as asymptotically, SGLD corresponds
to Langevin dynamics and the Metropolis-Hastings rejection probabil-
ity goes to zero for a decreasing learning rate.

6.3.5 Hamiltonian Monte Carlo


As MALA and SGLD can be seen as a sampling-based analogue of
GD and SGD, a similar analogue for (stochastic) gradient descent with
momentum is the (stochastic gradient) Hamiltonian Monte Carlo (HMC)
algorithm, which we discuss in the following (Duane et al., 1987; Chen
et al., 2014).

We have seen that if we want to sample from a distribution

p( x) ∝ exp(− f ( x))

with energy function f , we can construct a Markov chain whose dis-


tribution converges to p. We have also seen that for this approach to
work, the chain must move through all areas of significant probability
with reasonable speed.

If one is faced with a distribution p which is multimodal (i.e., that


has several “peaks”), one has to ensure that the chain will explore all
modes, and can therefore “jump between different areas of the space”.
130 probabilistic artificial intelligence

Figure 6.5: A commutative diagram of


HMC SG-HMC sampling and optimization algorithms.
Langevin dynamics (LD) is the non-
sampling stochastic variant of SGLD.

GD / Mom. SGD / Mom.


momentum

LD SGLD

stochastic optimization
GD SGD

So in general, local updates are doomed to fail. Methods such as


Metropolis-Hastings with Gaussian proposals, or even Langevin Monte
Carlo might face this issue, as they do not jump to distant areas of the
state space with significant acceptance probability. It will therefore
take a long time to move from one peak to another.

The HMC algorithm is an instance of Metropolis-Hastings which uses


momentum to propose distant points that conserve energy, with high
acceptance probability. The general idea of HMC is to lift samples x
to a higher-order space by considering an auxiliary variable y with
the same dimension as x. We also lift the distribution p to a distribu-
tion on the ( x, y)-space by defining a distribution p(y | x) and setting
.
p( x, y) = p(y | x) p( x). It is common to pick p(y | x) to be a Gaussian
with zero mean and variance mI. Hence,
 
1 2
p( x, y) ∝ exp − ∥ y ∥2 − f ( x ) . (6.49)
2m

Physicists might recognize the above as the canonical distribution of a


Newtonian system if one takes x as the position and y as the momen-
. 1
tum. H ( x, y) = 2m ∥y∥22 + f ( x) is called the Hamiltonian. HMC then
takes a step in this higher-order space according to the Hamiltonian
dynamics,9 9
That is, HMC follows the trajectory of
these dynamics for some time.
dx dy
= ∇y H, = −∇x H, (6.50)
dt dt

reaching some new point ( x′ , y′ ) and projecting back to the state space
by selecting x′ as the new sample. This is illustrated in Figure 6.6.
In the next iteration, we resample the momentum y′ ∼ p(· | x′ ) and
repeat the procedure.

In an implementation of this algorithm, one has to solve Equation (6.50)


numerically rather than exactly. Typically, this is done using the Leapfrog
markov chain monte carlo methods 131

y Figure 6.6: Illustration of Hamiltonian


Monte Carlo. Shown is the contour plot
of a distribution p, which is a mixture of
move acc. to Hamiltonian dynamics
2 two Gaussians, in the ( x, y)-space.
First, the initial point in the state
space is lifted to the ( x, y)-space. Then,
1
we move according to Hamiltonian dy-
namics and finally project back onto the
project lift state space.
start
0
new proposal

−1

−2

−3 −2 −1 0 1 2 3
x

method, which for a step size τ computes

τ
y(t + τ/2) = y(t) − ∇x f ( x(t)) (6.51a)
2
τ
x(t + τ ) = x(t) + y(t + τ/2) (6.51b)
m
τ
y(t + τ ) = y(t + τ/2) − ∇x f ( x(t + τ )). (6.51c)
2

Then, one repeats this procedure L times to arrive at a point ( x′ , y′ ).


To correct for the resulting discretization error, the proposal is either
accepted or rejected in a final Metropolis-Hastings acceptance step. If
the proposal distribution is symmetric (which we will confirm in a
moment), the acceptance probability is
.
α(( x′ , y′ ) | ( x, y)) = min{1, exp( H ( x′ , y′ ) − H ( x, y))}. (6.52)

It follows that p( x, y) is the stationary distribution of the Markov chain


underlying HMC. Due to the independence of x and y, this also im-
plies that the projection to x yields a Markov chain with stationary
distribution p( x).

So why is the proposal distribution symmetric? This follows from


the time-reversibility of Hamiltonian dynamics. It is straightforward
to check that the dynamics from Equation (6.50) are identical if we
replace t with −t and y with −y. Intuitively, unlike the position x,
the momentum y is reversed when time is reversed as it depends on
the velocity which is the time-derivative of the position.10 In simpler 10
The momentum of the i-th coordinate
terms, time-reversibility states that if we observe the evolution of a sys- is yi = mi vi where mi is the mass and
dx
vi = dti is the velocity.
tem (e.g., two billiard balls colliding), we cannot distinguish whether
132 probabilistic artificial intelligence

we are observing the system evolve forward or backward in time. The


Leapfrog method maintains the time-reversibility of the dynamics.

Symmetry of the proposal distribution is ensured by proposing the


point ( x′ , −y′ ).11 Intuitively, this is simply to ensure that the system is 11
More formally, the proposal distribu-
run backward in time as often as it is run forward in time. Recall that tion is the Dirac delta at ( x′ , −y′ ).

the momentum is resampled before each iteration (i.e., the proposed


momentum is “discarded”) and observe that p( x′ , −y′ ) = p( x′ , y′ ),12 12
We use here that p(y | x) was chosen
so we can safely disregard the direction of time when computing the to be symmetric around zero.

acceptance probability in Equation (6.52).

Discussion

To summarize, Markov chain Monte Carlo methods use a Markov


chain to approximately sample from an intractable distribution. Note
that unlike for variational inference, the convergence of many methods
can be guaranteed. Moreover, for log-concave distributions (e.g., with
Bayesian logistic regression), the underlying Markov chain converges
quickly to the stationary distribution. Methods such as Langevin dy-
namics and Hamiltonian Monte Carlo aim to accelerate mixing by
proposing points with a higher acceptance probability than Metropolis-
Hastings with “undirected” Gaussian proposals. Nevertheless, in gen-
eral, the convergence (mixing time) may be slow, meaning that, in
practice, accuracy and efficiency have to be traded.

Optional Readings
• Ma, Chen, Jin, Flammarion, and Jordan (2019).
Sampling can be faster than optimization.
• Teh, Thiery, and Vollmer (2016).
Consistency and fluctuations for stochastic gradient Langevin dy-
namics.
• Chen, Fox, and Guestrin (2014).
Stochastic gradient hamiltonian monte carlo.

Problems

6.1. Markov chain update.

Prove Equation (6.9), i.e., that one iteration of the Markov chain can be
expressed as qt+1 = qt P.

6.2. k-step transitions.

Prove that the entry Pk ( x, x ′ ) corresponds to the probability of transi-


tioning from state x ∈ S to state x ′ ∈ S in exactly k steps.
markov chain monte carlo methods 133

6.3. Finding stationary distributions.

A news station classifies each day as “good”, “fair”, or “poor” based


on its daily ratings which fluctuate with what is occurring in the news.
Moreover, the following table shows the probabilistic relationship be-
tween the type of the current day and the probability of the type of the
next day conditioned on the type of the current day.

next day
good fair poor
good 0.60 0.30 0.10
current day fair 0.50 0.25 0.25
poor 0.20 0.40 0.40

In the long run, what percentage of news days will be classified as


“good”?

6.4. Example of Metropolis-Hastings.

Consider the state space {0, 1}n of binary strings having length n. Let
the proposal distribution be r ( x ′ | x ) = 1/n if x ′ differs from x in
exactly one bit and r ( x ′ | x ) = 0 otherwise. Suppose we desire a sta-
tionary distribution p for which p( x ) is proportional to the number of
ones that occur in the bit string x. For example, in the long run, a ran-
dom walk should visit a string having five 1s five times as often as it
visits a string having only a single 1. Provide a general formula for the
acceptance probability α( x ′ | x ) that would be used if we were to ob-
tain the desired stationary distribution used the Metropolis-Hastings
algorithm.

6.5. Practical examples of Gibbs sampling.

In this exercise, we look at some examples where Gibbs sampling is


useful.
1. Consider the distribution
 
. n x + α −1
p( x, y) = y (1 − y ) n − x + β −1 , x ∈ [n], y ∈ [0, 1]. Gamma distribution The PDF of the
x
gamma distribution Gamma(α, β) is de-
Convince yourself that it is hard to sample directly from p and fined as

prove that it is an easy task if one uses Gibbs sampling. That is, Gamma( x; α, β) ∝ x α−1 e− βx , x ∈ R>0 .
show that the conditional distributions p( x | y) and p(y | x ) are A random variable X ∼ Gamma(α, β)
easy to sample from. measures the waiting time until α > 0
events occur in a Poisson process with
Hint: Take a look at the Beta distribution (1.50). rate β > 0. In particular, when α = 1
2. Consider the following generative model p(µ, λ, x1:n ) given by the then the gamma distribution coincides
iid with the exponential distribution with
likelihood x1:n | µ, λ ∼ N (µ, λ−1 ) and the independent priors rate β.

µ ∼ N (µ0 , λ0−1 ) and λ ∼ Gamma(α, β).


134 probabilistic artificial intelligence

We would like to sample from the posterior p(µ, λ | x1:n ). Show


that

µ | λ, x1:n ∼ N (mλ , lλ−1 ) and λ | µ, x1:n ∼ Gamma( aµ , bµ ),

and derive mλ , lλ , aµ , bµ . Such a prior is called a semi-conjugate prior


to the likelihood, as the prior on µ is conjugate for any fixed value
of λ and vice-versa.
iid
3. Let us assume that x1:n | α, c ∼ Pareto(α, c) and assume the im-
proper prior p(α, c) ∝ 1{α, c > 0} which corresponds to a noninfor-
mative prior. Derive the posterior p(α, c | x1:n ). Then, also derive
the conditional distributions p(α | c, x1:n ) and p(c | α, x1:n ), and
observe that they correspond to known distributions / are easy to
sample from.

6.6. Energy function of Bayesian logistic regression.

Recall from Equation (5.12) that the energy function of Bayesian logis-
tic regression is
n
f (w) = λ ∥w∥22 + ∑ log(1 + exp(−yi w⊤ xi )), (6.53)
i =1

which coincided with the standard optimization objective of (regular-


ized) logistic regression.

Show that the posterior distribution of Bayesian logistic regression is


log-concave.

6.7. Maximum entropy property of Gibbs distribution.


1. Let X be a random variable supported on the finite set T ⊂ R.13 13
The same result can be shown to hold
Show that the Gibbs distribution with energy function T1 f ( x ) for for arbitrary compact subsets.

some temperature scalar T ∈ R is the distribution with maximum


entropy of all distributions supported on T that satisfy the con-
straint E[ f ( X )] < ∞.
Hint: Solve the dual problem (analogously to Problem 5.7).
2. What happens for T → {0, ∞}?

6.8. Energy reduction of Gibbs sampling.

Let p( x) be a probability density over Rd , which we want to sample


from. Assume that p is a Gibbs distribution with energy function
f : Rd → R.

In this exercise, we will study a single round of Gibbs sampling with


initial state x and final state x′ where

 x ′ if j = i
x ′j = i
 x j otherwise
markov chain monte carlo methods 135

for some fixed index i and xi′ ∼ p(· | x−i ).

Show that

Ex′ ∼ p(·| x−i ) f ( x′ ) ≤ f ( x) − S[ p( xi | x−i )] + H[ p(· | x−i )].


 
(6.54)
i

That is, the energy is expected to decrease if the surprise of xi given


x−i is larger than the expected surprise of the new xi′ given x−i , i.e.,
S[ p( xi | x−i )] ≥ H[ p(· | x−i )].

Hint: Recall the framing of Gibbs sampling as a variant of Metropolis-Hastings


and relate this to the acceptance distribution of Metropolis-Hastings when p
is a Gibbs distribution.

6.9. Mixing time of Langevin dynamics.

In this exercise, we will show that for certain Gibbs distributions,


p(θ) ∝ exp(− f (θ)), Langevin dynamics is rapidly mixing. To do this,
we will observe that Langevin dynamics can be seen as a continuous-
time optimization algorithm in the space of distributions.

First, we consider a simpler and more widely-known optimization al-


gorithm, namely the gradient flow

dxt = −∇ f ( xt ) dt. (6.55)

Note that gradient descent is simply the discrete-time approximation


of gradient flow just as ULA is the discrete-time approximation of
Langevin dynamics. In the analysis of ODEs such as the gradient flow,
so-called Lyapunov functions are commonly used to prove convergence
of xt to a fixed point (also called an equilibrium).

Let us assume that f is α-strongly convex for some α > 0, that is,
α
f (y) ≥ f ( x) + ∇ f ( x)⊤ (y − x) + ∥y − x∥22 ∀ x, y ∈ Rn . (6.56)
2
In words, f is lower bounded by a quadratic function with curvature α.
Moreover, assume w.l.o.g. that f minimized at f (0) = 0.14 14
This can always be achieved by shift-
ing the coordinate system and subtract-
1. Show that f satisfies the Polyak-Łojasiewicz (PL) inequality, i.e., ing a constant from f .
1
f ( x) ≤ ∥∇ f ( x)∥22 ∀ x ∈ Rn . (6.57)

d
2. Prove dt f ( xt ) ≤ −2α f ( xt ).
Thus, 0 is the fixed point of Equation (6.55) and the Lyapunov func-
tion f is monotonically decreasing along the trajectory of xt . We
recall Grönwall’s inequality which states that for any real-valued con-
tinuous functions g(t) and β(t) on the interval [0, T ] ⊂ R such that
d
dt g ( t ) ≤ β ( t ) g ( t ) for all t ∈ [0, T ] we have
Z t 
g(t) ≤ g(0) exp β(s) ds ∀t ∈ [0, T ]. (6.58)
0
136 probabilistic artificial intelligence

3. Conclude that f ( xt ) ≤ e−2αt f ( x0 ).

Now that we have proven the convergence of gradient flow using f as


Lyapunov function, we will follow the same template to prove the con-
vergence of Langevin dynamics to the distribution p(θ) ∝ exp(− f (θ)).
We will use that the evolution of {θt }t≥0 following the Langevin dy-
namics (6.46) is equivalently characterized by their densities {qt }t≥0
following the Fokker-Planck equation
∂qt
= ∇ · (qt ∇ f ) + ∆qt . (6.59)
∂t
Here, ∇· and ∆ are the divergence and Laplacian operators, respec-
tively.15 Intuitively, the first term of the Fokker-Planck equation cor- 15
For ease of notation, we omit the ex-
responds to the drift and its second term corresponds to the diffu- plicit dependence of qt , p, and f on θ.

sion (i.e., the Gaussian noise).

Remark 6.26: Intuition on vector calculus


Recall that the divergence ∇ · F of a vector field F measures the
change of volume under the flow of F. That is, if in the small
neighborhood of a point x, F points towards x, then the divergence
at x is negative as the volume shrinks. If F points away from x,
then the divergence at x is positive as the volume increases.
The Laplacian ∆φ = ∇ · (∇ φ) of a scalar field φ can be under-
stood intuitively as measuring “heat dissipation”. That is, if φ( x)
is smaller than the average value of φ in a small neighborhood of
x, then the Laplacian at x is positive.

Regarding the Fokker-Planck equation (6.59), the second term ∆qt


can therefore be understood as locally dissipating the probability
mass of qt (which is due to the diffusion term in the SDE). On the
other hand, the term ∇ · (qt ∇ f ) can be understood as a Laplacian
of f “weighted” by qt . Intuitively, the vector field ∇ f moves flow
in the direction of high energy, and hence, its divergence is larger
in regions of lower energy and smaller in regions of higher energy.
This term therefore corresponds to a drift from regions of high
energy to regions of low energy.

4. Show that ∆qt = ∇ · (qt ∇ log qt ), implying that the Fokker-Planck


equation simplifies to
 
∂qt qt
= ∇ · qt ∇ log . (6.60)
∂t p
.
Hint: The Laplacian of a scalar field φ is ∆φ = ∇ · (∇ φ).
Observe that the Fokker-Planck equation already implies that p is in-
∂q
deed a stationary distribution, as if qt = p then ∂tt = 0. Moreover, note
markov chain monte carlo methods 137

q
the similarity of the integrand of KL(qt ∥ p), qt log pt , to Equation (6.60).
We will therefore use the KL-divergence as Lyapunov function.
d
5. Prove dt KL( qt ∥ p ) = −J(qt ∥ p). Here,
" #
2
. qt (θ)
J(qt ∥ p) = Eθ∼qt ∇ log (6.61)
p(θ) 2

denotes the relative Fisher information of p with respect to qt . Hint:


For any distribution q on Rn ,
Z Z
(∇ · qF) φ dx = − q ∇ φ · F dx (6.62)
Rn Rn

follows for any vector field F and scalar field φ from the divergence theo-
rem and the product rule of the divergence operator.
Thus, the relative Fisher information can be seen as the negated time-
derivative of the KL-divergence, and as J(qt ∥ p) ≥ 0 it follows that the
KL-divergence is decreasing along the trajectory.

The log-Sobolev inequality (LSI) is satisfied by a distribution p with a


constant α > 0 if for all q:

1
KL(q∥ p) ≤ J( q ∥ p ). (6.63)

It is a classical result that if f is α-strongly convex then p satisfies the
LSI with constant α (Bakry and Émery, 2006).
6. Show that if f is α-strongly convex for some α > 0 (we say that p
is “strongly log-concave”), then KL(qt ∥ p) ≤ e−2αt KL(q0 ∥ p).
7. Conclude that under the same assumption on f , Langevin dynam-
ics is rapidly mixing, i.e., τTV (ϵ) ∈ O(poly(n, log(1/ϵ))).
To summarize, we have seen that Langevin dynamics is an optimiza-
tion scheme in the space of distributions, and that its convergence can
be analyzed analogously to classical optimization schemes. Notably,
in this exercise we have studied continuous-time Langevin dynamics.
Convergence guarantees for discrete-time approximations can be de-
rived using the same techniques. If this interests you, refer to “Rapid
convergence of the unadjusted Langevin algorithm: Isoperimetry suf-
fices” (Vempala and Wibisono, 2019).

6.10. Hamiltonian Monte Carlo.


1. Prove that if the dynamics are solved exactly (as opposed to numer-
ically using the Leapfrog method), then the acceptance probability
of the MH-step is always 1.
2. Prove that the Langevin Monte Carlo algorithm from (6.44) can be
seen as a special case of HMC if only one Leapfrog step is used
(L = 1) and m = 1.
7
Deep Learning

We began our journey through probabilistic machine learning with


linear models. In most practical applications, however, it is seen that
models perform better when labels may nonlinearly depend on the
inputs, and for this reason linear models are often used in conjunction
with “hand-designed” nonlinear features. Designing these features is
costly, time-consuming, and requires expert knowledge. Moreover, the
design of such features is inherently limited by human comprehension
and ingenuity.

7.1 Artificial Neural Networks

One widely used family of nonlinear functions are artificial “deep” neu-
ral networks,1 1
In the following, we will refrain from
using the characterizations “artificial”
. and “deep” for better readability.
f : Rd → Rk , f ( x; θ) = φ(WLφ(WL−1 (· · · φ(W1 x)))) (7.1)
.
where θ = [W1 , . . . , WL ] is a vector of weights (written as matrices
Wl ∈ Rnl ×nl −1 )2 and φ : R → R is a component-wise nonlinear func- 2
where n0 = d and n L = k
tion. Thus, a deep neural network can be seen as nested (“deep”)
linear functions composed with nonlinearities. This simple kind of
neural network is also called a multilayer perceptron.

In this chapter, we will be focusing mostly on performing probabilis-


tic inference with a given neural network architecture. To this end,
understanding the basic architecture of a multilayer perceptron will
be sufficient for us. For a more thorough introduction to the field of
deep learning and various architectures, you may refer to the books of
Goodfellow et al. (2016) and Prince (2023).

A neural network can be visualized by a computation graph. An exam-


ple for such a computation graph is given in Figure 7.1. The columns
of the computation graph are commonly called layers, whereby the
140 probabilistic artificial intelligence

(1) (2)
w1,1 (1) w1,1 (2) Figure 7.1: Computation graph of a neu-
x1 ν1 ν1 w (3) ral network with two hidden layers.
1,1

f1
(1) (2)
x2 ν2 ν2
..
.. .. .
..
. . .
fk
(1) (2)
xd νn1 νn2 (3)
(1)
wn
(2)
wn2 ,n1 w k ,n 2
1 ,d
input layer hidden layer 1 hidden layer 2 output layer

left-most column is the input layer, the right-most column is the output
layer, and the remaining columns are the hidden layers. The inputs are
.
(as we have previously) referred to as x = [ x1 , . . . , xd ]. The outputs
(i.e., vertices of the output layer) are often referred to as logits and
.
named f = [ f 1 , . . . , f k ]. The activations of an individual (hidden) layer
l of the neural network are described by
.
ν(l ) = φ(Wl ν(l −1) ) (7.2)

(l )
where ν(0) = x. The activation of the i-th node is νi = ν ( l ) ( i ).

7.1.1 Activation Functions


The nonlinearity φ is called an activation function. The following two
Tanh( x )
activation functions are particularly common:
1
1. The hyperbolic tangent (Tanh) is defined as
0
. exp(z) − exp(−z)
Tanh(z) = ∈ (−1, 1). (7.3)
exp(z) + exp(−z)
−1
Tanh is a scaled and shifted variant of the sigmoid function (5.9) −2 0 2
x
which we have previously seen in the context of logistic regression
as Tanh(z) = 2σ (2z) − 1. ReLU( x )
2. The rectified linear unit (ReLU) is defined as 3

2
.
ReLU(z) = max{z, 0} ∈ [0, ∞). (7.4)
1

In particular, the ReLU activation function leads to “sparser” gra- 0


dients as it selects the halfspace of inputs with positive sign. More- −2 0 2
x
over, the gradients of ReLU do not “vanish” as z → ±∞ which can
Figure 7.2: The Tanh and ReLU activa-
lead to faster training. tion functions, respectively.
It is important that the activation function is nonlinear because other-
wise, any composition of layers would still represent a linear function.
Non-linear activation functions allow the network to represent arbi-
trary functions. This is known as the universal approximation theorem,
deep learning 141

and it states that any artificial neural network with just a single hidden f2
layer (with arbitrary width) and non-polynomial activation function φ
2
can approximate any continuous function to an arbitrary accuracy.
0

7.1.2 Classification
−2
Although we mainly focus on regression, neural networks can equally
well be used for classification. If we want to classify inputs into c sepa- −2 0 2
f1
rate classes, we can simply construct a neural network with c outputs,
f = [ f 1 , . . . , f c ], and normalize them into a probability distribution. Figure 7.3: Softmax σ1 ( f 1 , f 2 ) for a bi-
nary classification problem. Blue de-
Often, the softmax function is used for normalization,
notes a small probability and yellow de-
notes a large probability of belonging to
. exp( f i ) class 1, respectively.
σi ( f ) = (7.5)
∑cj=1 exp( f j )

where σi ( f ) corresponds to the probability mass of class i. The soft-


max is a generalization of the logistic function (5.9) to more than two
classes ? . Note that the softmax corresponds to a Gibbs distribution Problem 7.1
with energies − f .

7.1.3 Maximum Likelihood Estimation


We will study neural networks under the lens of supervised learn-
ing (cf. Section 1.3) where we are provided some independently-
sampled (noisy) data D = {( xi , yi )}in=1 generated according to an un-
known process y ∼ p(· | x, θ⋆ ), which we wish to approximate.

Upon initialization, the network does generally not approximate this


process well, so a key element of deep learning is “learning” a param-
eterization θ that is a good approximation. To this end, one typically
considers a loss function ℓ(θ; y) which quantifies the “error” of the net-
work outputs f ( x; θ). In the classical setting of regression, i.e., y ∈ R
and k = 1, ℓ is often taken to be the (empirical) mean squared error,
n
. 1
ℓmse (θ; D) =
n ∑ ( f (xi ; θ) − yi )2 . (7.6)
i =1

As we have already seen in Section 2.0.1 in the context of linear re-


gression, minimizing mean squared error corresponds to maximum
likelihood estimation under a Gaussian likelihood.

In the setting of classification where y ∈ {0, 1}c is a one-hot encod-


ing of class membership,3 it is instead common to interpret the out- 3
That is, exactly one component of y is 1
puts of a neural network as probabilities akin to our discussion in and all others are 0, indicating to which
class the given example belongs.
Section 7.1.2. We denote by qθ (· | x) the resulting probability distri-
bution over classes with PMF [σ1 ( f ( x; θ)), . . . , σc ( f ( x; θ))], and aim to
find θ such that qθ (y | x) ≈ p(y | x). In this context, it is common to
142 probabilistic artificial intelligence

minimize the cross-entropy,


H[ p∥qθ ] = E( x,y)∼ p [− log qθ (y | x)] using the definition of cross-entropy
n (5.32)
1
≈−
n ∑ log qθ(yi | xi ) (7.7) using Monte Carlo sampling
i =1
| {z }
.
=ℓce (θ;D)

which can be understood as minimizing the surprise about the training


data under the model. ℓce is called the cross-entropy loss. Disregarding
the constant 1/n, we can rewrite the cross-entropy loss as
n
∝ − ∑ log qθ (yi | xi ) = ℓnll (θ; D) (7.8)
i =1
Recall that ℓnll (θ; D) is the negative log-likelihood of the training data,
and thus, empirically minimizing cross-entropy can equivalently be
interpreted as maximum likelihood estimation.4 Furthermore, recall 4
We have previously seen this equiv-
from Problem 5.1 that for a two-class classification problem the cross- alence of MLE and empirically mini-
mizing KL-divergence in Section 5.4.6
entropy loss is equivalent to the logistic loss. (minimizing the cross-entropy H[ p∥qθ ]
is equivalent to minimizing forward-KL
KL( p∥qθ )). Note that this interpreta-
7.1.4 Backpropagation tion is not exclusive to the canonical
cross-entropy loss from Equation (7.7),
A crucial property of neural networks is that they are differentiable. but holds for any MLE. For example,
That is, we can compute gradients ∇θ ℓ of f with respect to the pa- minimizing mean squared error corre-
sponds to empirically minimizing the
rameterization of the model θ = W1:L and some loss function ℓ( f ; y). KL-divergence with a Gaussian likeli-
Being able to obtain these gradients efficiently allows for “learning” a hood.
particular function from data using first-order optimization methods.

The algorithm for computing gradients of a neural network is called


backpropagation and is essentially a repeated application of the chain
rule. Note that using the chain rule for every path through the network
is computationally infeasible, as this quickly leads to a combinatorial
explosion as the number of hidden layers is increased. The key insight
of backpropagation is that we can use the feed-forward structure of our
neural network to memoize computations of the gradient, yielding
a linear time algorithm. Obtaining gradients by backpropagation is
often called automatic differentiation (auto-diff). For more details, refer
to Goodfellow et al. (2016).

Computing the exact gradient for each data point is still fairly expen-
sive when the size of the neural network is large. Typically, stochastic
gradient descent is used to obtain unbiased gradient estimates using
batches of only m of the n data points, where m ≪ n.

(1) (2)
7.2 Bayesian Neural Networks x1 ν
1 ν
1
(1) (2) f1
x2 ν2 ν2
..
How can we perform probabilistic inference in neural networks? We .. .. .. .
. . .
fk
adopt the same strategy which we already used for Bayesian linear xd
(1)
νn
(2)
νn
1 2

Figure 7.4: Bayesian neural networks


model a distribution over the weights of
a neural network.
deep learning 143

regression, we impose a Gaussian prior on the weights,

θ ∼ N (0, σp2 I ). (7.9)

Similarly, we can use a Gaussian likelihood to describe how well the


data is described by the model f ,

y | x, θ ∼ N ( f ( x; θ), σn2 ). (7.10)

Thus, instead of considering weights as point estimates which are


learned exactly, Bayesian neural networks learn a distribution over the
weights of the network. In principle, other priors and likelihoods can
be used, yet Gaussians are typically chosen due to their closedness
properties, which we have seen in Section 1.2.3 and many times since.

7.2.1 Maximum a Posteriori Estimation


Before studying probabilistic inference, let us first consider MAP esti-
mation in the context of neural networks.

The MAP estimate of the weights is obtained by


n
θ̂MAP = arg max log p(θ) + ∑ log p(yi | xi , θ). (7.11)
θ i =1

In Section 7.1.3, we have seen that the negative log-likelihood under


a Gaussian likelihood (7.10) is the squared error between label and
prediction,

(yi − f ( xi ; θ))2
log p(yi | xi , θ) = − + const. (7.12)
2σn2

Obtaining the MAP estimate instead, simply corresponds to adding an


L2 -regularization term to the squared error loss,
n
1 1
θ̂MAP = arg min
2σp2
∥θ∥22 + 2
2σn ∑ (yi − f (xi ; θ))2 . (7.13) using that for an isotropic Gaussian
θ i =1 prior, log p(θ) = − 2σ1 2 ∥θ∥22 + const
p

As we have already seen in Remark A.31 and the context of Bayesian


linear regression (and ridge regression), using a Gaussian prior is
equivalent to applying weight decay.5 Using gradient ascent, we ob- 5
Recall that weight decay regularizes
tain the following update rule, weights by shrinking them towards zero.

n
θ ← θ(1 − ληt ) + ηt ∑ ∇ log p(yi | xi , θ) (7.14)
i =1

.
where λ = 1/σp2 . The gradients of the likelihood can be obtained using
automatic differentiation.
144 probabilistic artificial intelligence

7.2.2 Heteroscedastic Noise


Equation (7.10) uses the scalar parameter σn2 to model the aleatoric un-
certainty (label noise), similarly to how we modeled the label noise y
in Bayesian linear regression and Gaussian processes. Such a noise
model is called homoscedastic noise as the noise is assumed to be uni-
form across the domain. In many settings, however, the noise is in-
herently heteroscedastic noise. That is, the noise varies depending on
the input and which “region” of the domain the input is from. This
behavior is visualized in Figure 7.5.

There is a natural way of modeling heteroscedastic noise with Bayesian x


neural networks. We use a neural network with two outputs f 1 and f 2
Figure 7.5: Illustration of data with vari-
and define able (heteroscedastic) noise. The noise
increases as the inputs increase in mag-
y | x, θ ∼ N (µ( x; θ), σ2 ( x; θ)) where (7.15a) nitude. The noise-free function is shown
. in black.
µ( x; θ) = f 1 ( x; θ), (7.15b)
.
σ2 ( x; θ) = exp( f 2 ( x; θ)). (7.15c) we exponentiate f 2 to ensure
non-negativity of the variance

Using this model, the likelihood term from Equation (7.11) is

log p(yi | xi , θ) = log N (yi ; µ( xi ; θ), σ2 ( xi ; θ))


1 (yi − µ( xi ; θ))2
= log p − note that the normalizing constant
2πσ2 ( xi ; θ) 2σ2 ( xi ; θ) depends on the noise model!
(yi − µ( xi ; θ))2
 
1 1 2
= log √ − log σ ( xi ; θ) + .
2π 2 σ2 ( xi ; θ)
| {z }
const
(7.16)
Hence, the model can either explain the label yi by an accurate model
µ( xi ; θ) or by a large variance σ2 ( xi ; θ), yet, it is penalized for choos-
ing large variances. Intuitively, this allows to attenuate losses for some
data points by attributing them to large variance (when no model re-
flecting all data points simultaneously can be found). This allows the
model to “learn” its aleatoric uncertainty. However, recall that the
MAP estimate still corresponds to a point estimate of the weights, so
we forgo modeling the epistemic uncertainty.

7.3 Approximate Probabilistic Inference

Naturally, we want to understand the epistemic uncertainty of our


model too. However, learning and inference in Bayesian neural net-
works are generally intractable (even when using a Gaussian prior
and likelihood) when the noise is not assumed to be homoscedastic
and known.6 Thus, we are led to the techniques of approximate infer- 6
In this case, the conjugate prior to
ence, which we discussed in the previous two chapters. a Gaussian likelihood is not a Gaus-
sian. See, e.g., https://en.wikipedia.
org/wiki/Conjugate_prior.
deep learning 145

7.3.1 Variational Inference


As we have discussed in Chapter 5, we can apply black box stochas-
tic variational inference which — in the context of neural networks
— is also known as Bayes by Backprop. As variational family, we use
the family of independent Gaussians which we have already encoun-
tered in Example 5.5.7 Recall the fundamental objective of variational 7
Independent Gaussians are useful be-
inference (5.25), cause they can be encoded using only
2d parameters, which is crucial when the
size of the neural network is large.
arg min KL(q∥ p(· | x1:n , y1:n ))
q∈Q

= arg max L(q, p; D) using Equation (5.53)


q∈Q

= arg max Eθ∼q [log p(y1:n | x1:n , θ)] − KL(q∥ p(·)). using Equation (5.55c)
q∈Q

The KL-divergence KL(q∥ p(·)) can be expressed in closed-form for


Gaussians.8 Recall that we can obtain unbiased gradient estimates of 8
see Equation (5.41)
the expectation using the reparameterization trick (5.68),
h i
Eθ∼q [log p(y1:n | x1:n , θ)] = Eε∼N (0,I ) log p(y1:n | x1:n , θ)|θ=Σ 1/2 ε+µ .

As Σ is the diagonal matrix diag{σ12 , . . . , σd2 }, Σ 1/2 = diag{σ1 , . . . , σd }.


The gradients of the likelihood can be obtained using backpropaga-
tion. We can now use the variational posterior qλ to perform approxi-
mate inference,
Z
p(y⋆ | x⋆ , x1:n , y1:n ) = p(y⋆ | x⋆ , θ) p(θ | x1:n , y1:n ) dθ using the sum rule (1.7) and product
rule (1.11)
= Eθ∼ p(·|x1:n ,y1:n ) [ p(y⋆ | x⋆ , θ)] interpreting the integral as an
expectation over the posterior
≈ Eθ ∼qλ [ p(y⋆ | x⋆ , θ)] approximating the posterior with the
m variational posterior qλ
1

m ∑ p(y⋆ | x⋆ , θ(i) ) (7.17) using Monte Carlo sampling
i =1
iid
for independent samples θ(i) ∼ qλ ,
m
1
=
m ∑ N (y⋆ ; µ(x⋆ ; θ(i) ), σ2 (x⋆ ; θ(i) )). (7.18) using the neural network
i =1

Intuitively, variational inference in Bayesian neural networks can be


interpreted as averaging the predictions of multiple neural networks
drawn according to the variational posterior qλ .

Using the Monte Carlo samples θ(i) , we can also estimate the mean of
our predictions,
m
1 .
E[y⋆ | x⋆ , x1:n , y1:n ] ≈
m ∑ µ(x⋆ ; θ(i) ) = µ(x⋆ ), (7.19)
i =1
146 probabilistic artificial intelligence

and the variance of our predictions,

Var[y⋆ | x⋆ , x1:n , y1:n ] = Eθ Vary⋆ [y⋆ | x⋆ , θ] + Varθ Ey⋆ [y⋆ | x⋆ , θ]


   
using the law of total variance (1.41)
h i
= Eθ σ2 ( x⋆ ; θ) + Varθ [µ( x⋆ ; θ)]. using the likelihood (7.15a)

Recall from Equation (2.18) that the first term corresponds to the ale-
atoric uncertainty of the data and the second term corresponds to the
epistemic uncertainty of the model. We can approximate them using
the Monte Carlo samples θ(i) ,
m
1
Var[y⋆ | x⋆ , x1:n , y1:n ] ≈
m ∑ σ2 (x⋆ ; θ(i) ) (7.20)
i =1
m
1
+ ∑
m − 1 i =1
(µ( x⋆ ; θ(i) ) − µ( x⋆ ))2

using a sample mean (A.14) and sample variance (A.16).

7.3.2 Markov Chain Monte Carlo


As we have discussed in Chapter 6, an alternative method to approxi-
mating the full posterior distribution is to sample from it directly. By
the ergodic theorem (6.28), we can use any of the discussed sampling
strategies to obtain samples θ(t) such that

T
1
p(y⋆ | x⋆ , x1:n , y1:n ) ≈
T ∑ p(y⋆ | x⋆ , θ(t) ). see (6.29)
t =1

Here, we omit the offset t0 which is commonly used to avoid the


“burn-in” period for simplicity. Algorithms such as SGLD or SG-HMC
are often used as they rely only on stochastic gradients of the loss func-
tion which can be computed efficiently using automatic differentiation.

Typically, for large networks, we cannot afford to store all T samples of


models. Thus, we need to summarize the iterates.9 One approach is to 9
That is, combine the individual sam-
keep track of m snapshots of weights [θ(1) , . . . , θ(m) ] according to some ples of weights θ(i) .

schedule and use those for inference (e.g., by averaging the predictions
of the corresponding neural networks). This approach of sampling a
subset of some data is generally called subsampling.

Another approach is to summarize (that is, approximate) the weights


using sufficient statistics (e.g., a Gaussian).10 In other words, we learn 10
A statistic is sufficient for a family of
the Gaussian approximation, probability distributions if the samples
from which it is calculated yield no more
information than the statistic with re-
θ ∼ N (µ, Σ ), where (7.21a) spect to the learned parameters. We pro-
T vide a formal definition in Section 8.2.
. 1
µ=
T ∑ θ(i) , (7.21b) using a sample mean (A.14)
i =1
deep learning 147

T
. 1
Σ= ∑
T − 1 i =1
(θ(i) − µ)(θ(i) − µ)⊤ , (7.21c) using a sample variance (A.16)

using sample means and sample (co)variances. This can be imple-


mented efficiently using running averages of the first and second mo-
ments,
1 1
µ← ( Tµ + θ) and A← ( TA + θθ⊤ ) (7.22)
T+1 T+1
upon observing the fresh sample θ. Σ can easily be calculated from
these moments,
T
Σ= ( A − µµ⊤ ). (7.23) using the characterization of sample
T−1 variance in terms of estimators of the
To predict, we can sample weights θ from the learned Gaussian. first and second moment (A.17)

Remark 7.1: Stochastic weight averaging


It turns out that this approach works well even without inject-
ing additional Gaussian noise during training, e.g., when using
SGD rather than SGLD. Simply taking the mean of the iterates of
SGD is called stochastic weight averaging (SWA). Describing the iter-
ates of SGD by Gaussian sufficient statistics (analogously to Equa-
tion (7.21)), is known as the stochastic weight averaging-Gaussian
(SWAG) method (Izmailov et al., 2018).

7.3.3 Dropout and Dropconnect


We will now discuss two approximate inference techniques that are x1 ν
(1)
ν
(2)
1 1
tailored to neural networks. The first is “dropout”/“dropconnect” reg- x2
(1)
ν2
(2)
ν2
f1
..
ularization. Traditionally, dropout regularization (Hinton et al., 2012; Sri- .. .. .. .
. . .
vastava et al., 2014) randomly omits vertices of the computation graph (1) (2)
fk
xd νn νn
1 2
during training, see Figure 7.7. In contrast, dropconnect regularization
(Wan et al., 2013) randomly omits edges of the computation graph. Figure 7.6: Illustration of dropout reg-
ularization. Some vertices of the com-
The key idea that we will present here is to interpret this type of reg- putation graph are randomly omitted.
ularization as performing variational inference. In contrast, dropconnect regularization
randomly omits edges of the computa-
In our exposition, we will focus on dropconnect, but the same ap- tion graph.
proach also applies to dropout (Gal and Ghahramani, 2016). Suppose qj
that we omit an edge of the computation graph (i.e., set its weight to
zero) with probability p. Then our variational posterior is given by 1− p

d
.
q(θ | λ) = ∏ q j (θ j | λ j ) (7.24)
j =1 p

where d is the number of weights of the neural network and


0 λj
.
q j (θ j | λ j ) = pδ0 (θ j ) + (1 − p)δλ j (θ j ). (7.25) θj

Figure 7.7: Interpretation of dropconnect


regularization as variational inference.
The only coordinates where the varia-
tional posterior q j has positive probabil-
ity are 0 and λ j .
148 probabilistic artificial intelligence

Here, δα is the Dirac delta function with point mass at α.11 The vari-
ational parameters λ correspond to the “original” weights of the net-
work. In words, the variational posterior expresses that the j-th weight
has value 0 with probability p and value λ j with probability 1 − p. 11
see Appendix A.1.4
For fixed weights λ, sampling from the variational posterior qλ corre-
sponds to sampling a vector z with entries z(i ) ∼ Bern( p), yielding
z ⊙ λ which is one of 2d possible subnetworks.12 12
A ⊙ B denotes the Hadamard
(element-wise) product.

The weights λ can be learned by maximizing the ELBO, analogously to


black-box variational inference. The KL-divergence term of the ELBO
is not tractable for the variational family described by Equation (7.25),
instead a common approach is to use a mixture of Gaussians:

.
q j (θ j | λ j ) = pN (θ j ; 0, 1) + (1 − p)N (θ j ; λ j , 1). (7.26)

p
In this case, it can be shown that KL(qλ ∥ p(·)) ≈ 2 ∥λ∥22 for sufficiently
large d (Gal and Ghahramani, 2015, proposition 1). Thus,

arg max L(qλ , p; D)


λ∈Λ
= arg max Eθ∼qλ [log p(y1:n | x1:n , θ)] − KL(q∥ p(·)) using Equation (5.55c)
λ∈Λ
m
1 p
≈ arg min −
m ∑ log p(y1:n | x1:n , θ) θ=z(i) ⊙λ+ε(i) + 2 ∥λ∥22 (7.27) using Monte Carlo sampling
λ∈Λ i =1

where we reparameterize θ ∼ qλ by θ = z ⊙ λ + ε with z(i ) ∼ Bern( p)


and ε ∼ N (0, I ). Equation (7.27) is the standard L2 -regularized loss
function of a neural network with weights λ and dropconnect, and it
is straightforward to obtain unbiased gradient estimates by automatic
differentiation.

Crucially, for the interpretation of dropconnect regularization as vari-


ational inference to be valid, we also need to perform dropconnect
regularization during inference,

p(y⋆ | x⋆ , x1:n , y1:n ) ≈ Eθ∼qλ [ p(y⋆ | x⋆ , θ)]


m
1

m ∑ p(y⋆ | x⋆ , θ(i) ) (7.28) using Monte Carlo sampling
i =1

iid
where θ(i) ∼ qλ are independent samples. This coincides with our ear-
lier discussion of variational inference for Bayesian neural networks
in Equation (7.17). In words, we average the predictions of m neural
networks for each of which we randomly “drop out” weights.
deep learning 149

Remark 7.2: Masksembles


A practical problem of dropout is that for any reasonable choice
of dropout probability, the dropout masks z(i) will overlap signif-
icantly, which tends to make the predictions p(y⋆ | x⋆ , θ(i) ) highly
correlated. This can lead to underestimation of epistemic uncer-
tainty. Masksembles (Durasov et al., 2021) mitigate this issue by
choosing a fixed set of pre-defined dropout masks, which have
controlled overlap, and alternating between them during training.
In the extreme case of “infinitely many” masks, masksembles are
equivalent to dropout since each mask is only seen once.

7.3.4 Probabilistic Ensembles


We have seen that variational inference in the context of Bayesian neu-
ral networks can be interpreted as averaging the predictions of m neu-
ral networks drawn according to the variational posterior.

A natural adaptation of this idea is to immediately learn the weights


of m neural networks. The idea is to randomly choose m training sets
by sampling uniformly from the data with replacement. Then, using
our analysis from Section 7.2.1, we obtain m MAP estimates of the
weights θ(i) , yielding the approximation

p(y⋆ | x⋆ , x1:n , y1:n ) = Eθ∼ p(·| x1:n ,y1:n ) [ p(y⋆ | x⋆ , θ)]


m
1

m ∑ p(y⋆ | x⋆ , θ(i) ). (7.29) using bootstrapping
i =1

Here, the randomly chosen m training sets induce “diversity” of the


models. In practice, in the context of deep neural networks where
the global minimizer of the loss can rarely be identified, it is com-
mon to use the full training data to train each of the m neural net-
works. Random initialization and random shuffling of the training
data is typically enough to ensure some degree of diversity between
the individual models (Lakshminarayanan et al., 2017). We can con-
nect ensembles to the other approximate inference techniques we have
discussed: First, ensembles can be seen as a specific kind of masksem-
ble, where the masks are non-overlapping,13 which mitigates the issue 13
That is, the m models do not share
of correlated predictions from Remark 7.2. Second, ensembling can be any of their parameters. Ensembles and
dropout lie on opposite ends of this
combined with other approximate inference techniques such as varia- spectrum.
tional inference, Laplace approximation, or SWAG to get a mixture of
Gaussians as the posterior approximation.

Note that Equation (7.29) is not equivalent to Monte Carlo sampling,


although it looks very similar. The key difference is that this approach
does not sample from the true posterior distribution p, but instead
150 probabilistic artificial intelligence

from the empirical posterior distribution p̂ given the (re-sampled) MAP


estimates. Intuitively, this can be understood as the difference between
sampling from a distribution p directly (Monte Carlo sampling) versus
sampling from an approximate (empirical) distribution p̂ (correspond-
ing to the training data), which itself is constructed from samples of the
true distribution p. This approach is known as bootstrapping or bagging
(short for bootstrap aggregating) and plays a central role in model-free
reinforcement learning. We will return to this concept in Section 11.4.1.

7.3.5 Diverse Probabilistic Ensembles

Probabilistic ensembles can be loosely interpreted as randomly initial-


izing m “particles” {θ(i) }im=1 and then pushing each particle towards
regions of high posterior probability. A potential issue with this ap-
proach is that if the initialization of particles is not sufficiently diverse,
the particles may “collapse” since every particle eventually converges
to a local optimum of the loss function. A natural approach to mitigate
this issue is to alter the objective of each particle from simply aiming
to minimize the loss − log p(θ | x1:n , y1:n ), which we will abbreviate by
− log p(θ), to also “push” the particles away from each other. We will
see next that this can be interpreted as a form of variational inference
under a very flexible variational family.

In our discussion of variational inference, we have seen that minimiz-


ing reverse-KL is equivalent to maximizing the evidence lower bound,
and used this to derive an optimization algorithm to compute approx-
imate posteriors. We will discuss the alternative approach of directly
computing the gradient of the KL-divergence. Consider the variational
family of all densities that can be expressed as smooth transformations
of points sampled from a reference density ϕ. That is, we consider
.
Qϕ = {T♯ ϕ | T : Θ → Θ is smooth} where T♯ ϕ is the density of the
random vector θ′ = T(θ) with θ ∼ ϕ.14 The density ϕ can be thought 14
Refer back to Section 1.1.11 for a recap
of as the initial distribution of the particles, and the smooth map T on pushforwards.

as the dynamics that push the particles towards the target density p.
It can be shown that for “almost any” reference density ϕ, this varia-
tional family Qϕ is expressive enough to closely approximate “almost
arbitrary” distributions.15 A natural approach is therefore to learn the 15
For a more detailed discussion, refer
appropriate smooth map T between the reference density ϕ and the to “Stein variational gradient descent: A
general purpose Bayesian inference algo-
target density p. rithm” (Liu and Wang, 2016).

Example 7.3: Gaussian variational inference


We have seen in Section 5.5 that if ϕ is standard normal and
T( x; {µ, Σ 1/2 }) = µ + Σ 1/2 x an affine transformation then the ELBO
deep learning 151

can be maximized efficiently using stochastic gradient descent.


However, in this case, Qϕ can only approximate Gaussian-like dis-
tributions since the expressivity of the map T is limited under the
fixed reference density ϕ.

An alternative approach to Gaussian variational inference is the fol-


lowing algorithm known as Stein variational gradient descent (SVGD),
where we recursively apply carefully chosen smooth maps to the cur-
rent variational approximation:

T0⋆ T⋆ T⋆ .
1
q0 −→ q1 −→ 2
q2 −→ ··· where qt+1 = T⋆t ♯ qt . (7.30)

.
We consider maps T = id + f where id(θ) = θ denotes the identity
map and f ( x) represents a (small) perturbation. Recall that at time t

we seek to minimize KL T♯ qt ∥ p , so we choose the smooth map as

.
T⋆t = id − ηt ∇f KL T♯ qt ∥ p

f =0
(7.31)

where ηt is a step size. In this way, the SVGD update (7.31) can be
interpreted as a step of “functional” gradient descent.

To be able to compute the gradient of the KL-divergence, we need


to make some structural assumptions on the perturbation f . SVGD
assumes that f = [ f 1 · · · f d ]⊤ with f i ∈ Hk (Θ) is from some re-
producing kernel Hilbert space Hk (Θ) of a positive definite kernel k;
we say f ∈ Hkd (Θ). Within the RKHS, we can compute the gradient
of the KL-divergence exactly. Liu and Wang (2016) show that in this
case, the functional gradient of the KL-divergence can be expressed in
closed-form as −∇f KL T♯ q∥ p f =0 = φ⋆q,p where


.
φ⋆q,p (·) = Eθ∼q [k(·, θ)∇θ log p(θ) + ∇θ k(·, θ)]. (7.32)

SVGD then approximates q using the particles {θ(i) }im=1 as follows:

Algorithm 7.4: Stein variational gradient descent, SVGD


1 initialize particles {θ(i) }im=1
2 repeat
3 for each particle i ∈ [m] do
4 θ(i) ← θ(i) + ηtφ̂⋆q,p (θ(i) ) where
h i
.
φ̂⋆q,p (θ) = m1 ∑m
j =1 k ( θ, θ( j) )∇ log p ( θ) + ∇ k ( θ, θ( j) )
θ θ( j )

5 until converged
152 probabilistic artificial intelligence

Often, a Gaussian kernel (4.14) with length scale h is used to model


the perturbations, in which case the repulsion term is
1
∇θ( j) k(θ, θ( j) ) = (θ − θ( j) )k(θ, θ( j) ) (7.33)
h2
and the negative functional gradient simplifies to
m
1 h i
φ̂⋆q,p (θ) =
m ∑ k(θ, θ( j) ) ∇θ log p(θ) + h−2 (θ − θ( j) ) . (7.34)
j =1
| {z } | {z }
drift repulsion

Note that SVGD has similarities to Langevin dynamics, which as seen


in ? can also be interpreted as following a gradient of the KL-divergence. Problem 6.9
Whereas Langevin dynamics perturbs particles according to a drift to-
wards regions of high posterior probability and some random diffu-
sion (cf. Equation (6.46)), the first term of φ̂⋆q,p (θ) perturbs particles to
drift towards regions of high posterior probability while the second
term leads to a mutual “repulsion” of particles. Notably, the perturba-
tions of Langevin dynamics are noisy, while SVGD perturbs particles
deterministically and the randomness is exclusively in the initializa-
tion of particles. The repulsion term prevents particles from collaps-
ing to a single mode of the posterior distribution, which is a possible
failure mode of other particle-based posterior approximations such as
ensembles.

Note that the above decomposition of φ̂⋆q,p (θ) is once more an example
of the principle of curiosity and conformity which we have seen to be a
recurring theme in approaches to approximate inference. The repul-
sion term leads to exploration of the particles (i.e., “curiosity” about
alternative explanations), while the drift term leads to minimization of
the loss (i.e., “conformity” to the data).

Lu et al. (2019) show that under some assumptions, SVGD converges


asymptotically to the target density p as ηt → 0. SVGD’s name orig-
inates from Stein’s method which is a general-purpose approach for
characterizing convergence in distribution.16 16
For an introduction to Stein’s method,
read “Measuring sample quality with
Stein’s method” (Gorham and Mackey,
7.4 Calibration 2015).

A key challenge of Bayesian deep learning (and also other probabilis-


tic methods) is the calibration of models. We say that a model is well-
calibrated if its confidence coincides with its accuracy across many pre-
dictions. Consider a classification model that predicts that the label
of a given input belongs to some class with probability 80%. If the
model is well-calibrated, then the prediction is correct about 80% of
the time. In other words, during calibration, we adjust the probability
estimation of the model.
deep learning 153

We will first mention two methods of estimating the calibration of a


model, namely the marginal likelihood and reliability diagrams. Then,
in Section 7.4.3, we survey commonly used heuristics for empirically
improving the calibration.

7.4.1 Evidence
A popular method (which we already encountered multiple times) is
to use the evidence of a validation set xval
1:m of size m given the training
set xtrain
1:n of size n for estimating the model calibration. Here, the ev-
idence can be understood as describing how well the validation set is
described by the model trained on the training set. We obtain,

log p(yval val train train


1:m | x1:m , x1:n , y1:n )
Z
= log p(yval val train train
1:m | x1:m , θ) p ( θ | x1:n , y1:n ) dθ using the sum rule (1.7) and product
Z rule (1.11)
≈ log p(yval val
1:m | x1:m , θ) qλ ( θ) dθ approximating with the variational
posterior
Z m
= log ∏ p(yval val
i | xi , θ) qλ ( θ) dθ (7.35) using the independence of the data
i =1

The resulting integrals are typically very small which leads to numer-
ical instabilities. Therefore, it is common to maximize a lower bound
to the evidence instead,
" #
m
= log Eθ∼qλ ∏ p(yval val
i | xi , θ) interpreting the integral as an
i =1 expectation over the variational
"
m
# posterior
≥ Eθ∼qλ ∑ log p(yval
i | xval
i , θ) (7.36) using Jensen’s inequality (5.29)
i =1
k m
1

k ∑ ∑ log p(yval val ( j)
i | xi , θ ) (7.37) using Monte Carlo sampling
j =1 i =1

iid
for independent samples θ( j) ∼ qλ .

7.4.2 Reliability Diagrams


Reliability diagrams take a frequentist perspective to estimate the cal-
ibration of a model. For simplicity, we assume a calibration problem
with two classes, 1 and −1 (similarly to logistic regression).17 17
Reliability diagrams generalize be-
yond this restricted example.
We group the predictions of a validation set into M interval bins of
size 1/M according to the class probability predicted by the model,
P(Yi = 1 | xi ). We then compare within each bin, how often the model
thought the inputs belonged to the class (confidence) with how often
the inputs actually belonged to the class (frequency). Formally, we
154 probabilistic artificial intelligence

1.0

define Bm as the set of samples falling into bin m and let 0.8

. 1 0.6
∑ 1{Yi = 1}

freq
freq( Bm ) = (7.38)
| Bm | i ∈ Bm 0.4

be the proportion of samples in bin m that belong to class 1 and let 0.2

0.0
. 1
conf( Bm ) =
| Bm | ∑ P(Yi = 1 | xi ) (7.39) 0.00 0.25 0.50 0.75 1.00
i ∈ Bm conf

be the average confidence of samples belonging to class 1 within the 1.0

bin m. 0.8

0.6
Thus, a model is well calibrated if freq( Bm ) ≈ conf( Bm ) for each bin

freq
m ∈ [ M ]. There are two common metrics of calibration that quantify 0.4

how “close” a model is to being well calibrated. 0.2

1. The expected calibration error (ECE) is the average deviation of a 0.0


0.00 0.25 0.50 0.75 1.00
model from perfect calibration,
conf
M
. | Bm |
ℓECE = ∑ n
|freq( Bm ) − conf( Bm )| (7.40) Figure 7.8: Examples of reliability dia-
grams with ten bins. A perfectly cali-
m =1
brated model approximates the diagonal
where n is the size of the validation set. dashed red line. The first reliability dia-
2. The maximum calibration error (MCE) is the maximum deviation of gram shows a well-calibrated model. In
contrast, the second reliability diagram
a model from perfect calibration among all bins, shows an overconfident model.
.
ℓMCE = max |freq( Bm ) − conf( Bm )| . (7.41)
m∈[ M ]

7.4.3 Heuristics for Improving Calibration


We now survey a few heuristics which can be used empirically to im-
prove model calibration.
.
1. Histogram binning assigns a calibrated score qm = freq( Bm ) to each
bin during validation. Then, during inference, we return the cal-
ibrated score qm of the bin corresponding to the prediction of the
model.
2. Isotonic regression extends histogram binning by using variable bin
.
boundaries. We find a piecewise constant function f = [ f 1 , . . . , f M ]
that minimizes the bin-wise squared loss,
M n
min
M, f ,a
∑ ∑ 1{am ≤ P(Yi = 1 | xi ) < am+1 }( f m − yi )2
m =1 i =1
(7.42a)
subject to 0 = a1 ≤ · · · ≤ a M+1 = 1, (7.42b)
f1 ≤ · · · ≤ f M (7.42c)
.
where f are the calibrated scores and a = [ a1 , . . . , a M+1 ] are the
bin boundaries. We then return the calibrated score f m of the bin
corresponding to the prediction of the model.
deep learning 155

3. Platt scaling adjusts the logits zi of the output layer to


.
qi = σ ( azi + b) (7.43)

and then learns parameters a, b ∈ R to maximize the likelihood.


4. Temperature scaling is a special and widely used instance of Platt
. . 1
scaling where a = 1/T and b = 0 for some temperature scalar T > 0,
z 
. i
qi = σ . (7.44)
T

qi
Intuitively, for a larger temperature T, the probability is distributed
more evenly among the classes (without changing the ranking),
yielding a more uncertain prediction. In contrast, for a lower tem- 0
A B C
perature T, the probability is concentrated more towards the top
1
choices, yielding a less uncertain prediction. As seen in Prob-
lem 6.7, temperature scaling can be motivated as tuning the mean
of the softmax distribution.

qi
Optional Readings
• Guo, Pleiss, Sun, and Weinberger (2017). 0
A B C
On calibration of modern neural networks.
• Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015). Figure 7.9: Illustration of temperature
scaling for a classifier with three classes.
Weight uncertainty in neural network. On the top, we have a prediction with
• Kendall and Gal (2017). a high temperature, yielding a very un-
What uncertainties do we need in Bayesian deep learning for com- certain prediction (in favor of class A).
Below, we have a prediction with a low
puter vision?. temperature, yielding a prediction that
is strongly in favor of class A. Note that
the ranking (A ≻ C ≻ B) is preserved.

Discussion

This chapter concludes our discussion of (approximate) probabilistic


inference. Across the last three chapters, we have seen numerous
methods for approximating the posterior distributions of deep neural
networks:
• Methods such as dropout and stochastic weight averaging are fre-
quently used in practice. Other particle-based approaches such as
ensembles and SVGD are used less frequently since they are compu-
tationally more expensive to train, but are some of the most effective
methods in estimating uncertainty.
• Recently, Laplace approximations regained interest since they can
be applied “post-hoc” after training simply by computing or ap-
proximating the Hessian of the loss function (Daxberger et al., 2021;
Antorán et al., 2022). Still, Laplace approximations come with the
limitations inherent to unimodal Gaussian approximations.
156 probabilistic artificial intelligence

• Other work, particularly in fine-tuning, has explored approximating


the posterior distribution of deep neural networks by treating them
as linear functions in a fixed learned feature space, in which case
one can use the tools for exact probabilistic inference from Chap-
ters 2 and 4 (e.g., Hübotter et al., 2025).
Despite large progress in approximate inference over the past decade,
efficient and reliable uncertainty estimation of large models remains
an important open challenge.

Problems

7.1. Softmax is a generalization of the logistic function.

Show that for a two-class classification problem (i.e., c = 2), the soft-
max function is equivalent to the logistic function (5.9) for the univari-
.
ate model f = f 1 − f 0 . That is, σ1 ( f ) = σ ( f ) and σ0 ( f ) = 1 − σ ( f ).

Thus, the softmax function is a generalization of the logistic function


to more than two classes.
part II

Sequential Decision-Making
Preface to Part II

In the first part of the manuscript, we have learned about how we can
build machines that are capable of updating their beliefs and reducing
their epistemic uncertainty through probabilistic inference. We have
also discussed ways of keeping track of the world through noisy sen-
sory information by filtering. An important aspect of intelligence is to
use this acquired knowledge for making decisions and taking actions
that have a positive impact on the world.

Already today, we are surrounded by machines that make decisions


and take actions; that is, exhibit some degree of agency. Be it a search
engine producing a list of search results, a chatbot answering a ques-
tion, or a driving-assistance system steering a car: these systems are
all perceiving the world, making decisions, and then taking actions
that in turn have an effect on the world. Figure 7.10 illustrates this
perception-action loop.

Figure 7.10: An illustration of the


perception-action loop. This is a
straightforward extension of our view of
probabilistic inference from Figure 1.10
worldt with the addition of an “action” compo-
nent which is capable of “adaptively” in-
teracting with the outside world and the
worldt+1
internal world model.
perception

Dt action

model p(θ | D1:t )

In the second part of this manuscript, we will discuss the underpin-


ning principles of building machines that are capable of making se-
160 probabilistic artificial intelligence

quential decisions. We will see that decision-making itself can be cast


as probabilistic inference, obeying the same mechanisms that we used
in the first part to build learning systems.

We discuss various ways of addressing the question:

How to act, given that computational resources and time are limited?

One approach is to act with the aim to reduce epistemic uncertainty,


which is the topic of active learning. Another approach is to act with
the aim to maximize some reward signal, which is the topic of bandits,
Bayesian optimization, and reinforcement learning.

This surfaces the exploration-exploitation dilemma where the agent


has to prioritize either maximizing its immediate rewards or reducing
its uncertainty about the world which might pay off in the future. We
discuss that this dilemma is, in fact, in direct correspondence to the
principle of curiosity and conformity which we discussed extensively
throughout Part I.

Since time is limited, it is critical to be sample-efficient when learning


the most important aspects of the world. At the same time, interactions
with the world are often complex, and some interactions might even
be harmful. We discuss how an agent can use its epistemic uncertainty
to guide the exploration of its environment while mitigating risks and
reasoning about safety.
8
Active Learning

By now, we have seen that probabilistic machine learning is very use-


ful for estimating the uncertainty in our models (epistemic uncer-
tainty) and in the data (aleatoric uncertainty). We have been focus-
ing on the setting of supervised learning where we are given a set
D = {( xi , yi )}in=1 of labeled data, yet we often encounter settings
where we have only little data and acquiring new data is costly.

In this chapter — and in the following chapter on Bayesian optimiza-


tion — we will discuss how one can use uncertainty to effectively
collect more data. In other words, we want to figure out where in
the domain we should sample to obtain the most useful information.
Throughout most of this chapter, we focus on the most common way
of quantifying “useful information”, namely the expected reduction in
entropy which is also called the mutual information.

8.1 Conditional Entropy

We begin by introducing the notion of conditional entropy. Recall that


the entropy H[X] of a random vector X can be interpreted as our av-
erage surprise when observing realizations x ∼ X. Thus, entropy can
be considered as a quantification of the uncertainty about a random
vector (or equivalently, its distribution).1 1
We discussed entropy extensively in
Section 5.4.
A natural extension is to consider the entropy of X given the occur-
rence of an event corresponding to another random variable (e.g.,
Y = y for a random vector Y),
.
H[X | Y = y] = Ex∼ p( x|y) [− log p( x | y)]. (8.1)

Instead of averaging over the surprise of samples from the distribu-


tion p( x) (like the entropy H[X]), this quantity simply averages over
the surprise of samples from the conditional distribution p( x | y).
162 probabilistic artificial intelligence

Definition 8.1 (Conditional entropy). The conditional entropy of a ran-


dom vector X given the random vector Y is defined as

.
H[X | Y] = Ey∼ p(y) [H[X | Y = y]] (8.2)
= E(x,y)∼ p(x,y) [− log p( x | y)]. (8.3)

Intuitively, the conditional entropy of X given Y describes our aver-


age surprise about realizations of X given a particular realization of Y,
averaged over all such possible realizations of Y. In other words, con-
ditional entropy corresponds to the expected remaining uncertainty
in X after we observe Y. Note that, in general, H[X | Y] ̸= H[Y | X].

It is crucial to stress the difference between H[X | Y = y] and the con-


ditional entropy H[X | Y]. The former simply corresponds to a proba-
bilistic update of our uncertainty in X after we have observed the real-
ization y ∼ Y. In contrast, conditional entropy predicts how much un-
certainty will remain about X (in expectation) after we will observe Y.

Definition 8.2 (Joint entropy). One can also define the joint entropy of
random vectors X and Y,

.
H[X, Y] = E( x,y)∼ p( x,y) [− log p( x, y)], (8.4)

as the combined uncertainty about X and Y. Observe that joint entropy


is symmetric.

This gives the chain rule for entropy,

H[X, Y] = H[Y] + H[X | Y] (8.5) using the product rule (1.11) and the
definition of conditional entropy (8.2)
= H[ X ] + H[ Y | X ]. (8.6) using symmetry of joint entropy

That is, the joint entropy of X and Y is given by the uncertainty about X
and the additional uncertainty about Y given X. Moreover, this also
yields Bayes’ rule for entropy,

H[ X | Y ] = H[ Y | X ] + H[ X ] − H[ Y ]. (8.7) using the chain rule for entropy (8.5)


twice

A very intuitive property of entropy is its monotonicity: when condi-


tioning on additional observations the entropy can never increase,

H[ X | Y ] ≤ H[ X ]. (8.8)

Colloquially, this property is also called the “information never hurts”


principle. We will derive a proof in the following section.
active learning 163

8.2 Mutual Information

Recall that our fundamental objective is to reduce entropy, as this cor-


responds to reduced uncertainty in the variables, which we want to
predict. Thus, we are interested in how much information we “gain”
about the random vector X by choosing to observe a random vector Y.
By our interpretation of conditional entropy from the previous section,
this is described by the following quantity.

Definition 8.3 (Mutual information, MI). The mutual information of X


and Y (also known as the information gain) is defined as
.
I(X; Y) = H[X] − H[X | Y] (8.9)
= H[X] + H[Y] − H[X, Y]. (8.10)

In words, we subtract the uncertainty left about X after observing Y


from our initial uncertainty about X. This measures the reduction in
our uncertainty in X (as measured by entropy) upon observing Y. Un-
like conditional entropy, it follows from the definition that mutual in-
formation is symmetric. That is,

I(X; Y) = I(Y; X). (8.11)

Thus, the mutual information between X and Y can be understood as


the approximation error (or information loss) when assuming that X
Figure 8.1: Information gain. The first
and Y are independent. graph shows the prior. The second
graph shows a selection of samples with
In particular, using Gibbs’ inequality (cf. Problem 5.5), this relation- large information gain (large uncertainty
ship shows that I(X; Y) ≥ 0 with equality when X and Y are indepen- reduction). The third graph shows a se-
lection of samples with small informa-
dent, and also proves the information never hurts principle (8.8) as tion gain (small uncertainty reduction).

0 ≤ I(X; Y) = H[X] − H[X | Y]. (8.12) H[ X ] H[X, Y] H[ Y ]

Example 8.4: Mutual information of Gaussians H[X | Y] I(X; Y) H[Y | X]

Given the Gaussian random vector X ∼ N (µ, Σ ) and the noisy


observation Y = X + ε where ε ∼ N (0, σn2 I ), we want to find the
Figure 8.2: Relationship between mutual
information gain of X when observing Y. Using our definitions information and entropy, expressed as a
from this chapter, we obtain Venn diagram.

I(X; Y) = I(Y; X) using symmetry (8.11)

= H[ Y ] − H[ Y | X ] by mutual information (8.9)

= H[ Y ] − H[ ε ] given X, the only randomness in Y


originates from ε
1    1   
= log (2πe)d det Σ + σn2 I − log (2πe)d det σn2 I using the entropy of Gaussians (5.31)
2 2
164 probabilistic artificial intelligence

det Σ + σn2 I

1
= log
2 det(σn2 I )
1  
= log det I + σn−2 Σ . (8.13)
2
Intuitively, the larger the noise σn2 in relation to the covariance
of X, the smaller the information gain.

8.2.1 Synergy and Redundancy

It is sometimes useful to write down the mutual information of X


and Y conditioned (in expectation) on a third random vector Z. This
leads us to the following definition.

Definition 8.5 (Conditional mutual information). The conditional mu-


tual information of X and Y given Z is defined as

.
I(X; Y | Z) = H[X | Z] − H[X | Y, Z]. (8.14)
= H[X, Z] + H[Y, Z] − H[Z] − H[X, Y, Z] (8.15) using the relationship of joint and
conditional entropy (8.5)
= I(X; Y, Z) − I(X; Z). (8.16)

Thus, the conditional mutual information corresponds to the reduction


of uncertainty in X when observing Y, given we already observed Z.
It also follows that conditional mutual information is symmetric:

I(X; Y | Z) = I(Y; X | Z) . (8.17)

We have seen in this chapter that entropy is monotonically decreasing


as we condition on new information, and called this the “information
never hurts” principle (8.8). However, the same does not hold for mu-
tual information! That is, information about a random vector Z may
reduce the mutual information between random vectors X and Y ? . Problem 8.2

Remark 8.6: Sufficient statistics and data processing inequality


A related concept is the data processing inequality (8.42) which
you prove in ? and which allows us to formalize a concept which Problem 8.2 (3)
we have seen multiple times already, namely the notion of a suf-
ficient statistic. Consider the Markov chain λ → X → s(X), for
example, λ may be parameters of the distribution of X. By the
data processing inequality (8.42), I(λ; s(X)) ≤ I(λ; X). If the data
processing inequality is satisfied with equality then s(X) is called
a sufficient statistic of X for the inference of λ.
active learning 165

To understand the behavior of mutual information under conditioning,


it is helpful to consider the interaction information
.
I(X; Y; Z) = I(X; Y) − I(X; Y | Z). (8.18)

If the interaction is positive then some information about X that is


provided by Y is also provided by Z (i.e., conditioning on Z decreases
MI between X and Y), and we say that there is redundancy between Y
and Z (with respect to X). Conversely, if the interaction is negative then
learning about Z increases what can be learned from Y about X, and
we say that there is synergy between Y and Z. We will see later in this
chapter that the absence of synergies can lead to efficient algorithms
for maximizing mutual information.

8.2.2 Mutual Information as Utility Function


Following our introduction of mutual information, it is natural to an-
swer the question “where should I collect data?” by saying “wherever
mutual information is maximized”. More concretely, assume we are
given a set X of possible observations of f , where y x denotes a single
such observation at x ∈ X ,
.
yx = f x + ε x , (8.19)
.
f x = f ( x), and ε x is some zero-mean Gaussian noise. For a set of
observations S = { x1 , . . . , xn }, we can write yS = f S + ε where
   
y x1 f x1
.  ..  .  .  2
yS =  .  , f S =  .. 
  
 , and ε ∼ N (0, σn I ).
y xn f xn

Note that both yS and f S are random vectors. Our goal is then to find
a subset S ⊆ X of size n maximizing the information gain between
our model f and yS .

This yields the maximization objective,


.
I (S) = I( f S ; yS ) = H[ f S ] − H[ f S | yS ]. (8.20)

Here, H[ f S ] corresponds to the uncertainty about f S before obtain-


ing the observations yS and H[ f S | yS ] corresponds to the uncertainty
about f S , in expectation, after obtaining the observations yS .

Remark 8.7: Making optimal decisions with intrinsic rewards


Note that this objective function maps neatly onto our initial con-
sideration of making optimal decisions under uncertainty from
166 probabilistic artificial intelligence

Section 1.4. In fact, you can think of maximizing I (S) simply as


computing the optimal decision rule for the utility

r (yS , S) = H[ f S ] − H[ f S | YS = yS ], (8.21) with YS = yS denoting an event

with I (S) = EyS [r (yS , S)] measuring the expected utility of ob-
servations yS . Such a utility or reward function is often called an
intrinsic reward since it does not measure an “objective” external
quantity, but instead a “subjective” quantity that is internal to the
model of f .

Observe that picking a subset of points S ⊆ X from the domain X is


a combinatorial problem. That is to say, we are optimizing a function
over discrete sets. In general, such combinatorial optimization prob-
lems tend to be very difficult. It can be shown that maximizing mutual
information is N P -hard.

8.3 Submodularity of Mutual Information

We will look at optimizing mutual information in the following sec-


tion. First, we want to introduce the notion of submodularity which is
important in the analysis of discrete functions.

Definition 8.8 (Marginal gain). Given a (discrete) function F : P (X ) → R,


the marginal gain of x ∈ X given A ⊆ X is defined as

.
∆ F ( x | A) = F ( A ∪ { x}) − F ( A). (8.22)

Intuitively, the marginal gain describes how much “adding” the addi-
tional x to A increases the value of F.

When maximizing mutual information, the marginal gain is ? Problem 8.4

∆ I ( x | A ) = I( f x ; y x | y A ) (8.23)
= H[ y x | y A ] − H[ ε x ]. (8.24)

That is, when maximizing mutual information, the marginal gain cor-
x
responds to the difference between the uncertainty after observing y A
A
and the entropy of the noise H[ε x ]. Altogether, the marginal gain rep- B
resents the reduction in uncertainty by observing { x}. D

Definition 8.9 (Submodularity). A (discrete) function F : P (X ) → R Figure 8.3: Monotone submodularity.


The effect of “adding” x to the smaller
is submodular iff for any x ∈ X and any A ⊆ B ⊆ X it satisfies set A is larger than the effect of adding
x to the larger set B.
F ( A ∪ { x}) − F ( A) ≥ F ( B ∪ { x}) − F ( B). (8.25)
active learning 167

Equivalently, using our definition of marginal gain, we have that F is


submodular iff for any x ∈ X and any A ⊆ B ⊆ X ,

∆ F ( x | A ) ≥ ∆ F ( x | B ). (8.26)

That is, “adding” x to the smaller set A yields more marginal gain
than adding x to the larger set B. In other words, the function F has
“diminishing returns”. In this way, submodularity can be interpreted
as a notion of “concavity” for discrete functions.

Definition 8.10 (Monotone submodularity). A function F : P (X ) → R


is called monotone iff for any A ⊆ B ⊆ X it satisfies

F ( A ) ≤ F ( B ). (8.27)

If F is also submodular, then F is called monotone submodular.

Theorem 8.11. The objective I is monotone submodular.


Proof. We fix arbitrary subsets A ⊆ B ⊆ X and any x ∈ X . We have,

I is submodular ⇐⇒ ∆ I ( x | A) ≥ ∆ I ( x | B) by submodularity in terms of marginal


gain (8.26)
⇐⇒ H[ y x | y A ] − H[ ε x ] ≥ H[ y x | yB ] − H[ ε x ] using Equation (8.24)

⇐⇒ H[ y x | y A ] ≥ H[ y x | yB ]. H[ε x ] cancels

Due to the “information never hurts” principle (8.8) of entropy and as


A ⊆ B, this is always true. Moreover,

I is monotone ⇐⇒ I ( A) ≤ I ( B) by the definition of monotinicity (8.27)

⇐⇒ I( f A ; y A ) ≤ I( f B ; yB ) using the definition of I (8.20)

⇐⇒ I( f B ; y A ) ≤ I( f B ; yB ) using I( f B ; y A ) = I( f A ; y A ) as
yA ⊥ f B | f A
⇐⇒ H[ f B ] − H[ f B | y A ] ≤ H[ f B ] − H[ f B | yB ] using the definition of MI (8.9)

⇐⇒ H[ f B | y A ] ≥ H[ f B | yB ], H[ f B ] cancels

which is also satisfied due to the “information never hurts” princi-


ple (8.8).

The submodularity of I can be interpreted from the perspective of


information theory. It turns out that submodularity is equivalent to
the absence of synergy between observations ? . Intuitively, without Problem 8.5
synergies, acting greedily is enough to find a near-optimal solution.
If there are synergies, then the combinatorial search problem is much
harder, because single-step optimal actions do not necessarily lead us
to the optimal solution. Consider the extreme case of having to solve
a “needle in a haystack” problem, where only a single subset of X
with size k achieves objective value 1, with all other subsets achieving
objective value 0. In this case, we can do nothing but exhaustively
search through all |X |k combinations to find the optimal solution.
168 probabilistic artificial intelligence

8.4 Maximizing Mutual Information

As we cannot efficiently pick a set S ⊆ X to maximize mutual informa-


tion but know that I is submodular, a natural approach is to maximize
mutual information greedily. That is, we pick the locations x1 through
xn individually by greedily finding the location with the maximal mu-
tual information. The following general result for monotone submod-
ular function maximization shows that, indeed, this greedy approach
provides a good approximation.

Theorem 8.12 (Greedy submodular function maximization). If the set


function F : P (X ) → R≥0 is monotone submodular, then greedily maximiz-
ing F is a (1 − 1/e)-approximation:2 2
1 − 1/e ≈ 0.632
 
1
F ( Sn ) ≥ 1 − max F (S). (8.28)
e S⊆X
|S|=n

Proof. Fix any n ≥ 1. Let S⋆ ∈ arg max{ F (S) | S ∈ X , |S| ≤ n}. We


can assume |S⋆ | = n due to the monotonicity (8.27) of F. We write,
.
{ x1⋆ , . . . , x⋆n } = S⋆ . We have,

F ( S ⋆ ) ≤ F ( S ⋆ ∪ St ) using monotonicity (8.27)


n
= F (St ) + ∑ ∆ F ( xi⋆ | St ∪ { x1⋆ , . . . , xi⋆−1 }) using the definition of marginal gain
i =1 (8.22)
≤ F ( St ) + ∑ ∆ F ( x ⋆ | St ) using submodularity (8.26)
x⋆ ∈S⋆
≤ F (St ) + n( F (St+1 ) − F (St )). using that St+1 = St ∪ { x} is chosen
such that ∆ F ( x | St ) is maximized (8.29)
.
Let δt = F (S⋆ ) − F (St ). Then,

δt = F (S⋆ ) − F (St ) ≤ n( F (St+1 ) − F (St )) = n(δt − δt+1 ).

This implies δt+1 ≤ (1 − 1/n)δt and δn ≤ (1 − 1/n)n δ0 ≤ δ0/e, using the


well-known inequality 1 − x ≤ e− x for all x ∈ R.

Finally, observe that δ0 = F (S⋆ ) − F (∅) ≤ F (S⋆ ) due to the non-


negativity of F. We obtain,

δ0 F (S⋆ )
δn = F (S⋆ ) − F (Sn ) ≤ ≤ .
e e
Rearranging the terms yields the theorem.

Optional Readings
The original proof of greedy maximization for submodular func-
tions was given by “An analysis of approximations for maximiz-
active learning 169

ing submodular set functions” (Nemhauser et al., 1978).

For more background on maximizing submodular functions, see


“Submodular function maximization” (Krause and Golovin, 2014).

Now that we have established that greedy maximization of mutual


information is a decent approximation to maximizing the joint infor-
mation of data, we will look at how this optimization problem can be
solved in practice.

8.4.1 Uncertainty Sampling


When maximizing mutual information, at time t when we have already
picked St = { x1 , . . . , xt }, we need to solve the following optimization
problem,
.
xt+1 = arg max ∆ I ( x | St ) (8.29)
x∈X
= arg max I( f x ; y x | ySt ). (8.30) using Equation (8.23)
x∈X

Note that f x and y x are univariate random variables. Thus, using


our formula for the mutual information of conditional linear Gaus-
sians (8.13), we can simplify to,

σ2 ( x )
 
1
= arg max log 1 + t 2 (8.31)
x∈X 2 σn

where σt2 ( x) is the (remaining) variance at x after observing St . As-


suming the label noise is independent of x (i.e., homoscedastic),

= arg max σt2 ( x). (8.32) y


x∈X

Therefore, if f is modeled by a Gaussian and we assume homoscedas-


tic noise, greedily maximizing mutual information corresponds to sim-
ply picking the point x with the largest variance. This algorithm is also
called uncertainty sampling.

x⋆
8.4.2 Heteroscedastic Noise x

Uncertainty sampling is clearly problematic if the noise is heteroscedas- Figure 8.4: Uncertainty sampling with
heteroscedastic noise. The epistemic un-
tic. If there are a particular set of inputs with a large aleatoric un- certainty of the model is shown in a dark
certainty dominating the epistemic uncertainty, uncertainty sampling gray. The aleatoric uncertainty of the
will continuously choose those points even though the epistemic un- data is shown in a light gray. Uncer-
tainty sampling would repeatedly pick
certainty will not be reduced substantially (cf. Figure 8.4). points around x ⋆ as they maximize the
epistemic uncertainty, even though the
Looking at Equation (8.31) suggests a natural fix. Instead of only con- aleatoric uncertainty at x ⋆ is much larger
sidering the epistemic uncertainty σt2 ( x), we can also consider the ale- than at the boundary.
170 probabilistic artificial intelligence

atoric uncertainty σn2 ( x) by modeling heteroscedastic noise, yielding


σt2 ( x) σ2 ( x )
 
. 1
xt+1 = arg max log 1 + 2 = arg max t2 . (8.33)
x∈X 2 σn ( x) x∈X σn ( x)

Thus, we choose locations that trade large epistemic uncertainty with


large aleatoric uncertainty. Ideally, we find a location where the epis-
temic uncertainty is large, and the aleatoric uncertainty is low, which
promises a significant reduction of uncertainty around this location.

8.4.3 Classification
While we focused on regression, one can apply active learning also
for other settings, such as (probabilistic) classification. In this setting,
for any input x, a model produces a categorical distribution over la-
bels y x .3 Here, uncertainty sampling corresponds to selecting samples 3
see Section 1.3
that maximize the entropy of the predicted label y x ,
.
xt+1 = arg max H[y x | x1:t , y1:t ]. (8.34)
x∈X

The entropy of a categorical distribution is simply a finite sum of


weighted surprise terms.

Figure 8.5: Uncertainty sampling in


classification. The area with high
uncertainty (as measured by entropy)
is highlighted in yellow. The shown
figures display each a sequence of
model updates, each after one new ob-
servation. In the left figure, the classes
are well-separated and uncertainty is
dominated by epistemic uncertainty,
whereas in the right figure the uncer-
tainty is dominated by noise. In the
latter case, if we mostly choose points
x in the area of highest uncertainty (i.e.,
close to the decision boundary) to make
observations, the label noise results in
This approach generally leads to sampling points that are close to the frequently changing models.
decision boundary. Often, the uncertainty is mainly dominated by la-
bel noise rather than epistemic uncertainty, and hence, we do not learn
much from our observations. This is a similar problem to the one we
encountered with uncertainty sampling in the setting of heteroscedas-
tic noise.

This naturally suggests distinguishing between the aleatoric and epis-


temic uncertainty of the model f (parameterized by θ). To this end,
mutual information can be used similarly as we have done with un-
certainty sampling for regression,
.
xt+1 = arg max I(θ; y x | x1:t , y1:t ) (8.35)
x∈X
active learning 171

= arg max I(y x ; θ | x1:t , y1:t ) using symmetry (8.11)


x∈X
= arg max H[y x | x1:t , y1:t ] − H[y x | θ, x1:t , y1:t ] using the definition of mutual
x∈X information (8.9)
= arg max H[y x | x1:t , y1:t ] − Eθ| x1:t ,y1:t [H[y x | θ, x1:t , y1:t ]] (8.36) using the definition of conditional
x∈X entropy (8.2)
= arg max H[y x | x1:t , y1:t ] −Eθ| x1:t ,y1:t H[y x | θ] . (8.37) using the definition of entropy (5.27)
x∈X | {z } | {z } and assuming y x ⊥ x1:t , y1:t | θ
entropy of entropy of
predictive posterior likelihood

The first term measures the entropy of the averaged prediction while
the second term measures the average entropy of predictions. Thus,
the first term looks for points where the average prediction is not con-
fident. In contrast, the second term penalizes points where many of
the sampled models are not confident about their prediction, and thus
looks for points where the models are confident in their predictions.
This identifies those points x where the models disagree about the la-
bel y x (that is, each model is “confident” but the models predict differ-
ent labels). For this reason, this approach is known as Bayesian active
learning by disagreement (BALD).

Note that the second term of the difference acts as a regularizer when
compared to Equation (8.34). The second term mirrors our description
of aleatoric uncertainty from Section 2.2. Recall that we interpreted ale-
atoric uncertainty as the average uncertainty for all models. Crucially,
here we use entropy to “measure” uncertainty, whereas previously we
have been using variance. Therefore, intuitively, Equation (8.36) sub-
tracts the aleatoric uncertainty from the total uncertainty about the
label.

Observe that both terms require approximate forms of the posterior


distribution. In Chapters 5 and 6, we have seen various approaches
from variational inference and MCMC methods which can be used
here. The first term can be approximated by the predictive distribu-
tion of an approximated posterior which is obtained, for example, us-
ing variational inference. The nested entropy of the second term is
typically easy to compute, as it corresponds to the entropy of the (dis-
crete) likelihood distribution p(y | x, θ) of the model θ. The outer
expectation of the second term may be approximated using (approxi-
mate) samples from the posterior distribution obtained via variational
inference, MCMC, or some other method.

Optional Readings
• Gal, Islam, and Ghahramani (2017).
Deep Bayesian active learning with image data.
172 probabilistic artificial intelligence

8.5 Learning Locally: Transductive Active Learning

So far we have explored how to select observations that provide us


with the best predictor f ( x) across the entire domain x ∈ X . How-
ever, we typically utilize predictors to make predictions at a particular
location x⋆ . It is therefore a natural question to ask how to select ob-
servations that lead to the best individual prediction f ( x⋆ ) at the pre-
diction target x⋆ . The distinction between these two settings is closely
related to the distinction between two general approaches to learning:
inductive learning and transductive learning. Inductive learning aims to
extract general rules from the data, that is, to extract and compress
the most bits of information. In contrast, transductive learning aims
to make the best prediction at a particular location, that is, to extract
the most bits of information relevant to the prediction at x⋆ . The concept
of transduction was developed by Vladimir Vapnik in the 1980s who
described its essence as follows:
“When solving a problem of interest, do not solve a more general prob-
lem as an intermediate step. Try to get the answer that you really need
but not a more general one.” — Vladimir Vapnik

Remark 8.13: What are the prediction targets x⋆ ?


Typically, in transductive learning, we cannot directly observe the
value f ( x⋆ ) at the prediction target x⋆ , for example, when learn-
ing from a fixed dataset X ′ ⊂ X . If we could observe f ( x⋆ ) di-
rectly (or perturbed by noise), solving the learning task would
only require memorization. Instead, most interesting learning
tasks require generalizing f ( x⋆ ) from the behavior of f at other
locations. Therefore, transductive learning becomes interesting
precisely when we cannot directly observe f ( x⋆ ).4 4
We will discuss an example of this kind
in Problem 8.6.
Note that this is similar to the inductive setting where we, in prin-
ciple, could observe f ( x) at any location x, but practically, we can
only observe f ( x) at a finite number of locations. Since such an in-
ductive model is then used to make predictions at any location x⋆ ,
it also needs to generalize from the observations.

Following the transductive paradigm, when already knowing that our


goal is to predict f ( x⋆ ) our objective is to select observations that
provide the most information about f ( x⋆ ). We can express this ob-
jective elegantly using the probabilistic framework from this chapter
(MacKay, 1992; Hübotter et al., 2024):
.
xt+1 = arg max I( f x⋆ ; y x | x1:t , y1:t ) = arg min H[ f x⋆ | x1:t , y1:t , y x ].
x∈X ′ x∈X ′
(8.38)
active learning 173

Hübotter et al. (2024) show that this objective leads to a remarkably


different selection of observations compared to the inductive active
learning objective we discussed earlier. Indeed, while the inductive
objective focuses on selecting a diverse set of examples, the transductive
objective also takes into account the relevance of the examples to the
prediction target x⋆ . We can see this tradeoff between diversity and
relevance by rewriting the transductive objective as
I( f x⋆ ; y x | x1:t , y1:t ) = I( f x⋆ ; y x ) − I( f x⋆ ; y x ; y1:t | x1:t ) (8.39) using the definition of interaction
| {z } | {z } information (8.18)
relevance redundancy

where the first term measures the information gain of y x about f x⋆


while the second term is the interaction information which measures
the redundancy of the information in y x and y1:t about f x⋆ . In this
way, transductive active learning describes a middle ground between
traditional search and retrieval methods that focus on relevance on the
one hand and inductive active learning which focuses on diversity on
the other hand.

Optional Readings
• Hübotter, Sukhija, Treven, As, and Krause (2024).
Transductive active learning: Theory and applications.
In modern machine learning, one often differentiates between a
“pre-training” and a “fine-tuning” stage. During pre-training, a
model is trained on a large dataset to extract general knowledge
without a specific task in mind. Then, during fine-tuning, the
model is adapted to a specific task by training on a smaller dataset.
Whereas (inductive) active learning is closely linked to the pre-
training stage, transductive active learning has been shown to be
useful for task-specific fine-tuning:
• Hübotter, Bongni, Hakimi, and Krause (2025).
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs.
• Bagatella, Hübotter, Martius, and Krause (2024).
Active Fine-Tuning of Generalist Policies.

Discussion

We have discussed how to select the most informative data. Thereby,


we focused mostly on inductive active learning which is applicable to
“pre-training” when we aim to extract general knowledge from data,
but also explored transductive active learning which is useful for “fine-
tuning” when we aim to adapt a model to a specific task.

We focused on quantifying the “informativeness” of data by its infor-


mation gain, which is a common approach, though many other viable
174 probabilistic artificial intelligence

criteria exist:

Remark 8.14: Beyond mutual information


The problem of identifying which experiments to conduct in order
to maximize learning is studied extensively in the field of experi-
mental design where a set of observations S is commonly called a
design. In the presence of a prior and likelihood, and a different
possible posterior distribution for each design S, the field is also
called Bayesian experimental design (Chaloner and Verdinelli, 1995;
Rainforth et al., 2024; Mutnỳ, 2024).

As we highlighted in the beginning of this chapter, how we mea-


sure the utility (i.e., the informativeness) of a design S is crucial.
Such a measure is called a design criterion and a design which
is optimal with respect to a design criterion is called an optimal
design. The literature studies various design criteria beyond max-
imizing mutual information (i.e., minimizing posterior entropy).
One popular alternative is to select the observations S that min-
imize the trace of the posterior covariance matrix, which corre-
sponds to minimizing the posterior average variance.

Next, we will move to the topic of optimization and ask which data we
should select to find the optimum of an unknown function as quickly
as possible. In the following chapter, we will focus on “Bayesian opti-
mization” (also called “bandit optimization”) where our aim is to find
and sample the optimal point. A related task that is slightly more re-
lated to active learning is the “best-arm identification” problem where
we aim only to identify the optimal point without sampling it. This
problem is closely related transductive active learning (with the lo-
cal task being defined by the location of the maximum) and so-called
entropy search methods that minimize the entropy of the posterior dis-
tribution over the location or value of the maximum (akin to Equa-
tion (8.38)) are often used to solve this problem (Hennig and Schuler,
2012; Wang and Jegelka, 2017; Hvarfner et al., 2022).

Problems

8.1. Mutual information and KL divergence.

Show that by expanding the definition of mutual information,

p( x, y)
 
I(X; Y) = E( x,y)∼ p log
p( x) p(y)
= KL( p( x, y)∥ p( x) p(y)) (8.40)
= Ey∼ p [KL( p( x | y)∥ p( x))], (8.41)
active learning 175

where p( x, y) denotes the joint distribution and p( x), p(y) denote the
marginal distributions of X and Y.

8.2. Non-monotonicity of cond. mutual information.


1. Show that if X ⊥ Z then I(X; Y | Z) ≥ I(X; Y).
2. Show that if X ⊥ Z | Y then I(X; Y | Z) ≤ I(X; Y).
3. Note that the condition X ⊥ Z | Y is the Markov property, namely,
{X, Y, Z} form a Markov chain with graphical model X → Y → Z.
This situation often occurs when data is processed sequentially.
Prove

I(X; Z) ≤ I(X; Y). (8.42)

which is also known as the data processing inequality, and which


says that processing cannot increase the information contained in
a signal.

8.3. Interaction information.


1. Show that interaction information is symmetric.
2. Let X1 , X2 ∼ Bern( p) for some p ∈ (0, 1) and independent. We
.
denote by Y = X1 ⊕ X2 their XOR. Compute I(Y; X1 ; X2 ).

8.4. Marginal gain of maximizing mutual information.

Show that in the context of maximizing mutual information, the marginal


gain is

∆ I ( x | A ) = I( f x ; y x | y A ) = H[ y x | y A ] − H[ ε x ].

8.5. Submodularity means no synergy.

Show that submodularity is equivalent to the absence of synergy be-


tween observations. That is, show that for all A ⊆ B,

I( f x ; y x ; yB\ A | y A ) ≥ 0. (8.43)

8.6. Transductive active learning.

Consider the prior distribution Xi ∼ N (0, 1) with all Xi independent


and consider the “output” variable

100
.
Z= ∑ i · Xi .
i =1

Our observation when Xit is selected in round t is generated by


. iid
Yt = Xit + ε t with ε t ∼ N (0, 1).
176 probabilistic artificial intelligence

For a set of inputs S = {i1 , . . . , it } ⊆ {1, . . . , 100}, we define the infor-


mation gain of S as
.
F (S) = I( Z; Y1:t ) = H[ Z ] − H[ Z | Y1:t ]

where H[ Z ] is the entropy of Z according to the prior and H[ Z | Y1:t ]


is the conditional entropy after the observations Y1 , . . . , Yt .

Note: The random vector ( X1 , . . . , X100 , Z ) is jointly Gaussian due to the


closedness of Gaussians under linear transformations.
1. We define the marginal information gain ∆( j | S) for a new observa-
tion Yj as
.
∆( j | S) = F (S ∪ { j}) − F (S).

Does ∆( j | S) ≥ 0 hold for all j and S?


2. Is maximizing the marginal information gain ∆(i | S) equivalent to
picking the point i ∈ {1, . . . , 100} with maximum variance under
the current posterior over Xi , that is, “equivalent to uncertainty
sampling”?
3. Let us consider the alternative prior where Xi ∼ Bern(0.5) are fair
coin flips which we observe directly (i.e., Yt = Xit ). For which of
the following definitions of Z is the acquisition function S 7→ F (S)
submodular?
.
(a) Z = X1 ∧ · · · ∧ X100 with ∧ denoting the logical AND.
.
(b) Z = X1 ∨ · · · ∨ X100 with ∨ denoting the logical OR.
.
(c) Z = X1 ⊕ · · · ⊕ X100 with ⊕ denoting the logical XOR, i.e., the
exclusive OR which returns 1 iff exactly one of the Xi is 1 and
0 otherwise.
9
Bayesian Optimization

Often, obtaining data is costly. In the previous chapter, this led us


to investigate how we can optimally improve our understanding (i.e.,
reduce uncertainty) of the process we are trying to model. However,
purely improving our understanding is often not good enough. In
many cases, we want to use our improving understanding simultane-
ously to reach certain goals. This is a very common problem in artificial xt y t = f ⋆ ( x t ) + ϵt
intelligence and will concern us for the rest of this manuscript. One f⋆
common instance of this problem is the setting of optimization. Figure 9.1: Illustration of Bayesian op-
timization. We pass an input xt into the
Given some function f ⋆ : X → R, suppose we want to find the unknown function f ⋆ to obtain noisy ob-
servations yt .
arg max f ⋆ ( x). (9.1)
x∈X

Now, contrary to classical optimization, we are interested in the setting


where the function f ⋆ is unknown to us (like a “black-box”). We are
only able to obtain noisy observations of f ⋆ ,

yt = f ⋆ ( xt ) + ε t . (9.2)

Moreover, these noise-perturbed evaluations are costly to obtain. We


will assume that similar alternatives yield similar results,1 which is 1
That is, f ⋆ is “smooth”. We will be
commonly encoded by placing a Gaussian process prior on f ⋆ . This more precise in the subsequent parts of
this chapter. If this were not the case,
assumed correlation is fundamentally what will allow us to learn a optimizing the function without evaluat-
model of f ⋆ from relatively few samples.2 ing it everywhere would not be possible.
Fortunately, many interesting functions
obey by this relatively weak assumption.
9.1 Exploration-Exploitation Dilemma 2
There are countless examples of this
problem in the “real world”. Instances
are
In Bayesian optimization, we want to learn a model of f ⋆ and use • drug trials
this model to optimize f ⋆ simultaneously. These goals are somewhat • chemical engineering — the develop-
ment of physical products
contrary. Learning a model of f ⋆ requires us to explore the input space
• recommender systems
while using the model to optimize f ⋆ requires us to focus on the most • automatic machine learning — auto-
promising well-explored areas. This trade-off is commonly known as matic tuning of model & hyperpa-
rameters
the exploration-exploitation dilemma, whereby • and many more...
178 probabilistic artificial intelligence

• exploration refers to choosing points that are “informative” with re-


spect to the unknown function. For example, points that are far
away from previously observed points (i.e., have high posterior vari-
ance);3 and 3
We explored this topic (with strategies
• exploitation refers to choosing promising points where we expect the like uncertainty sampling) in the previ-
ous chapter.
function to have high values. For example, points that have a high
posterior mean and a low posterior variance.
In other words, the exploration-exploitation dilemma refers to the chal-
lenge of learning enough to understand f ⋆ , but not learning too much
to lose track of the objective — optimizing f ⋆ .

The exploration-exploitation dilemma is yet another example of the


principle of curiosity and conformity which we introduced in Section 5.5.2
and encountered many times since in our study of approximate prob-
abilistic inference. We will see in subsequent chapters that sequential
decision-making is intimately related to probabilistic inference, and
there we will also make this correspondence more precise.

9.2 Online Learning and Bandits

Bayesian optimization is closely related to a form of online learning.


In online learning we are given a set of possible inputs X and an un-
known function f ⋆ : X → R. We are now asked to choose a sequence
of inputs x1 , . . . x T online,4 and our goal is to maximize our cumu- 4
Online is best translated as “sequen-
lative reward ∑tT=1 f ⋆ ( xt ). Depending on what we observe about f ⋆ , tial”. That is, we need to pick xt+1 based
only on our prior observations y1 , . . . , yt .
there are different variants of online learning. Bayesian optimization
is closest to the so-called (stochastic) “bandit” setting.

9.2.1 Multi-Armed Bandits


The “multi-armed bandits” (MAB) problem is a classical, canonical for-
malization of the exploration-exploitation dilemma. In the MAB prob-
lem, we are provided with k possible actions (arms) and want to maxi-
mize our reward online within the time horizon T. We do not know the
reward distributions of the actions in advance, however, so we need to
trade learning the reward distribution with following the most promis-
ing action. Bayesian optimization can be interpreted as a variant of the
MAB problem where there can be a potentially infinite number of ac-
tions (arms), but their rewards are correlated (because of the smooth-
ness of the Gaussian process prior). Figure 9.2: Illustration of a multi-armed
bandit with four arms, each with a dif-
There exists a large body of work on this and similar problems in on- ferent reward distribution. The agent
line decision-making. Much of this work develops theory on how to tries to identify the arm with the most
beneficial reward distribution shown in
explore and exploit in the face of uncertainty. The shared prevalence green.
of the exploration-exploitation dilemma signals a deep connection be-
bayesian optimization 179

tween online learning and Bayesian optimization (and — as we will


later come to see — reinforcement learning). Many of the approaches
which we will encounter in the context of these topics are strongly
related to methods in online learning.

One of the key principles of the theory on multi-armed bandits and


reinforcement learning is the principle of “optimism in the face of un-
certainty”, which suggests that it is a good guideline to explore where
we can hope for the best outcomes. We will frequently come back
to this general principle in our discussion of algorithms for Bayesian
optimization and reinforcement learning.

9.2.2 Regret
The key performance metric in online learning is the regret.

Definition 9.1 (Regret). The (cumulative) regret for a time horizon T


associated with choices { xt }tT=1 is defined as
T  
.
RT = ∑ x max f ⋆
( x ) − f ⋆
( x t ) (9.3)
t =1 | {z }
instantaneous regret
T
= T max f ⋆ ( x) −
x
∑ f ⋆ ( x t ). (9.4)
t =1

The regret can be interpreted as the additive loss with respect to the
static optimum maxx f ⋆ ( x).

The goal is to find algorithms that achieve sublinear regret, 5


Metrical task systems are a classical ex-
ample in online algorithms. Suppose we
RT are moving in a (finite) decision space X .
lim = 0. (9.5)
T →∞ T In each round, we are given a “task”
f t : X → R which is more or less costly
Importantly, if we use an algorithm which explores forever, e.g., by go- depending on our state xt ∈ X . In many
ing to a random point x̃ with a constant probability ϵ in each round, contexts, it is natural to assume that it
is also costly to move around in the de-
then the regret will grow linearly with time. This is because the instan- cision space. This cost is modeled by a
taneous regret is at least ϵ(maxx f ⋆ ( x) − f ⋆ ( x̃)) and non-decreasing. metric d(·, ·) on X . In metrical task sys-
tems, we want to minimize our total cost,
Conversely, if we use an algorithm which never explores, then we
T
might never find the static optimum, and hence, also incur constant ∑ f t ( x t ) + d ( x t , x t −1 ).
instantaneous regret in each round, implying that regret grows lin- t =1

early with time. Thus, achieving sublinear regret requires balancing That is, we want to trade completing our
tasks optimally with moving around in
exploration and exploitation. the state space. Crucially, we do not
know the sequence of tasks f t in ad-
Typically, online learning (and Bayesian optimization) consider sta- vance. Due to the cost associated with
tionary environments, hence the comparison to the static optimum. moving in the decision space, previous
choices affect the future!
Dynamic environments are studied in online algorithms (see metrical
task systems5 , convex function chasing6 , and generalizations of multi- 6
Convex function chasing (or convex body
armed bandits to changing reward distributions) and reinforcement chasing) generalize metrical task systems
to continuous domains X . To make
any guarantees about the performance
in these settings, one typically has to
assume that the tasks f t are convex.
Note that this mirrors our assumption in
Bayesian optimization that similar alter-
natives yield similar results.
180 probabilistic artificial intelligence

learning. When operating in dynamic environments, other metrics


such as the competitive ratio,7 which compares against the best dy- 7
To assess the performance in dynamic
namic choice, are useful. As we will later come to see in Section 13.1 in environments, we typically compare to
a dynamic optimum. As these problems
the context of reinforcement learning, operating in dynamic environ- are difficult (we are usually not able to
ments is deeply connected to a rich field of research called control. guarantee convergence to the dynamic
optimum), one considers a multiplica-
tive performance metric similar to the
approximation ratio, the competitive ratio,
9.3 Acquisition Functions
cost(ALG) ≤ α · cost(OPT),
It is common to use a so-called acquisition function to greedily pick the where OPT corresponds to the dynamic
next point to sample based on the current model. optimal choice (in hindsight).

Throughout our description of acquisition functions, we will focus on a


setting where we model f ⋆ using a Gaussian process which we denote
by f . The methods generalize to other means of learning f ⋆ such as
Bayesian deep learning. The various acquisition functions F are used
in the same way as is illustrated in Algorithm 9.2.

Algorithm 9.2: Bayesian optimization (with GPs)


1 initialize f ∼ GP (µ0 , k0 )
2 for t = 1 to T do
3 choose xt = arg maxx∈X F ( x; µt−1 , k t−1 )
4 observe yt = f ( xt ) + ϵt
5 perform a probabilistic update to obtain µt and k t

Remark 9.3: Model selection


Selecting a model of f ⋆ in sequential decision-making is much
harder than in the i.i.d. data setting of supervised learning. There
are mainly the following two dangers:
• the data sets collected in active learning and Bayesian optimiza-
tion are small; and
• the data points are selected dependently on prior observations.
This leads to a specific danger of overfitting. In particular, due to
feedback loops between the model and the acquisition function,
one may end up sampling the same point repeatedly.

One approach to reduce the chance of overfitting is the use of


hyperpriors which we mentioned previously in Section 4.4.2. An-
other approach that often works fairly well is to occasionally (ac-
cording to some schedule) select points uniformly at random in-
stead of using the acquisition function. This tends to prevent get-
ting stuck in suboptimal parts of the state space.
bayesian optimization 181

One possible acquisition function is uncertainty sampling (8.31), which


we discussed in the previous chapter. However, this acquisition func-
tion does not at all take into account the objective of maximizing f ⋆
and focuses solely on exploration.

Suppose that our model f of f ⋆ is well-calibrated, in the sense that y


the true function lies within its confidence bounds. Consider the best
lower bound, that is, the maximum of the lower confidence bound. 4

Now, if the true function is really contained in the confidence bounds,


it must hold that the optimum is somewhere above this best lower 2

bound. In particular, we can exclude all regions of the domain where


the upper confidence bound (the optimistic estimate of the function 0

value) is lower than the best lower bound. This is visualized in Fig-
ure 9.3. 0 2 4
x
Therefore, we only really care how the function looks like in the re-
Figure 9.3: Optimism in Bayesian op-
gions where the upper confidence bound is larger than the best lower timization. The unknown function is
bound. The key idea behind the methods that we will explore is to shown in black, our model in blue with
gray confidence bounds. The dotted
focus exploration on these plausible maximizers.
black line denotes the maximum lower
bound. We can therefore focus our ex-
Note that it is crucial that our uncertainty about f reflects the “fit” of ploration to the yellow regions where
our model to the unknown function. If the model is not well calibrated the upper confidence bound is higher
or does not describe the underlying function at all, these methods will than the maximum lower bound.

perform poorly. This is where we can use the Bayesian philosophy by


imposing a prior belief that may be conservative.

9.3.1 Upper Confidence Bound


The principle of optimism in the face of uncertainty suggests picking the 2.5
4
acquisition function

point where we can hope for the optimal outcome. In this setting, this

GP posterior
2.0
corresponds to simply maximizing the upper confidence bound (UCB), 2
1.5
.
xt+1 = arg max µt ( x) + β t+1 σt ( x), (9.6) 1.0 0
x∈X
0.5
. p
where σt ( x) = k t ( x, x) is the standard deviation at x and β t regulates 4
how confident we are about our model f (i.e., the choice of confidence 3.5
acquisition function

GP posterior

interval). 3.0 2

This acquisition function naturally trades exploitation by preferring 2.5


a large posterior mean with exploration by preferring a large poste- 0
2.0
rior variance. Note that if β t = 0 then UCB is purely exploitative,
whereas, if β t → ∞, UCB recovers uncertainty sampling (i.e., is purely
Figure 9.4: Plot of the UCB acquisition
explorative).8 UCB is an example of an optimism-based method, as it function for β = 0.25 and β = 1, respec-
greedily picks the point where we can hope for the best outcome. tively.
8
Due to the monotonicity of (·)2 , it does
As can be seen in Figure 9.4, the UCB acquisition function is gener- not matter whether we optimize the vari-
ance or standard deviation at x.
ally non-convex. For selecting the next point, we can use approximate
182 probabilistic artificial intelligence

global optimization techniques like Lipschitz optimization (in low di-


mensions) and gradient ascent with random initialization (in high di-
mensions). Another widely used technique is to sample some random
points from the domain, score them according to this criterion, and
simply take the best one.

The choice of β t is crucial for the performance of UCB. Intuitively,


for UCB to work even if the unknown function f ⋆ is not contained in
the confidence bounds, we use β t to re-scale the confidence bounds
y
to enclose f ⋆ as shown in Figure 9.5. A theoretical analysis requires
that β t is chosen “correctly”. Formally, we say that the sequence β t is 3

chosen correctly if it leads to well-calibrated confidence intervals, that is, 2


if with probability at least 1 − δ,
1
.
∀t ≥ 1, ∀ x ∈ X : f ⋆ ( x) ∈ Ct ( x) = [µt−1 ( x) ± β t (δ) · σt−1 ( x)]. (9.7) 0

−1
Bounds on β t (δ) can be derived both in a “Bayesian” and in a “fre-
quentist” setting. In the Bayesian setting, it is assumed that f ⋆ is 0 2 4
x
drawn from the prior GP, i.e., f ⋆ ∼ GP (µ0 , k0 ). However, in many
cases this may be an unrealistic assumption. In the frequentist setting, Figure 9.5: Re-scaling the confidence
bounds. The dotted gray lines represent
it is assumed instead that f ⋆ is a fixed element of a reproducing kernel updated confidence bounds.
Hilbert space Hk (X ) which depending on the kernel k can encompass
a large class of functions. We will discuss the Bayesian setting first and
later return to the frequentist setting.

Theorem 9.4 (Bayesian confidence intervals, lemma 5.5 of Srinivas


et al. (2010)). ? Let δ ∈ (0, 1). Assuming f ⋆ ∼ GP (µ0 , k0 ) and Gaussian Problem 9.2
observation noise ϵt ∼ N (0, σn2 ), the sequence
q 
β t (δ) = O log(|X |t/δ) (9.8)

satisfies P(∀t ≥ 1, x ∈ X : f ⋆ ( x) ∈ Ct ( x)) ≥ 1 − δ.

Under the assumption of well-calibrated confidence intervals, we can


bound the regret of UCB.

Theorem 9.5 (Regret of GP-UCB, theorem 2 of Srinivas et al. (2010)).


? If β t (δ) is chosen “correctly” for a fixed δ ∈ (0, 1), with probability at Problem 9.3
least 1 − δ, greedily choosing the upper confidence bound yields cumulative
regret
 p 
R T = O β T ( δ ) γT T (9.9)

where
. 1  
γT = max I( f S ; yS ) = max log det I + σn−2 KSS (9.10)
S⊆X S⊆X 2
|S|= T |S|= T

is the maximum information gain after T rounds.


bayesian optimization 183

Observe that if the information gain is sublinear in T then we achieve


sublinear regret and, in particular, converge to the true optimum. The
information gain γT measures how much can be learned about f ⋆
within T rounds. If the function is assumed to be smooth (perhaps
even linear), then the information gain is smaller than if the function
was assumed to be “rough”. Intuitively, the smoother the functions
encoded by the prior, the smaller is the class of functions to choose
from and the more can be learned from a single observation about
“neighboring” points.

Theorem 9.6 (Information gain of common kernels, theorem 5 of Srini-


vas et al. (2010) and remark 2 of Vakili et al. (2021)). Due to submodular-
ity, we have the following bounds on the information gain of common kernels:
• linear kernel

upper bound on γT
γT = O(d log T ), (9.11) 40

• Gaussian kernel 20
 
γT = O (log T )d+1 , (9.12)
0
1 20 40
• Matérn kernel for ν > 2 T
 d 2ν

γT = O T 2ν+d (log T ) 2ν+d . (9.13) Figure 9.6: Information gain of inde-
pendent, linear, Gaussian, and Matérn
The information gain of common kernels is illustrated in Figure 9.6. (ν ≈ 0.5) kernels with d = 2 (up to con-
stant factors). The kernels with sublin-
Notably, when all points in the domain are independent, the informa-
ear information gain have strong dimin-
tion gain is linear in T. This is because when the function f ⋆ may be ishing returns (due to their strong de-
arbitrarily “rough”, we cannot generalize from a single observation to pendence between “close” points). In
contrast, the independent kernel has no
“neighboring” points, and as there are infinitely many points in the dependence between points in the do-
domain X there are no diminishing returns. As one would expect, in main, and therefore no diminishing re-
turns. Intuitively, the “smoother” the
this case, Theorem 9.5 does not yield sublinear regret. However, we
class of functions modeled by the kernel,
can see from Theorem 9.6 that the information gain is sublinear for lin- the stronger are the diminishing returns.
ear, Gaussian, and most Matérn kernels. Moreover, observe that unless
the function is linear, the information gain grows exponentially with
the dimension d. This is because the number of “neighboring” points
(with respect to Euclidean geometry) decreases exponentially with the
dimension which is also known as the curse of dimensionality.

As mentioned, the size of the confidence intervals can also be analyzed


under a frequentist assumption on f ⋆ .

Theorem 9.7 (Frequentist confidence intervals, theorem 2 of Chowd-


hury and Gopalan (2017)). Let δ ∈ (0, 1). Assuming f ⋆ ∈ Hk (X ), we
have that with probability at least 1 − δ, the sequence
q
β t (δ) = ∥ f ⋆ ∥k + σn 2(γt + log(1/δ)) (9.14)

satisfies P(∀t ≥ 1, x ∈ X : f ⋆ ( x) ∈ Ct ( x)) ≥ 1 − δ.


184 probabilistic artificial intelligence

That is, β t depends on the information gain of the kernel as well as on


the “complexity” of f ⋆ which is measured in terms of its norm in the
underlying reproducing kernel Hilbert space Hk (X ).

Remark 9.8: Bayesian vs frequentist assumption


Theorems 9.4 and 9.7 provide different bounds on β t (δ) based
on fundamentally different assumptions on the ground truth f ⋆ :
The Bayesian assumption is that f ⋆ is drawn from the prior GP,
whereas the frequentist assumption is that f ⋆ is an element of a
reproducing kernel Hilbert space Hk (X ). The frequentist assump-
tion holds uniformly for all functions f ⋆ with ∥ f ⋆ ∥k < ∞, whereas
the Bayesian assumption holds only under the Bayesian “belief”
that f ⋆ is drawn from the prior GP.

Interestingly, neither assumption encompasses the other. This is


because if f ∼ GP (0, k ) then it can be shown that almost surely
∥ f ∥k = ∞, which implies that f ̸∈ Hk (X ) (Srinivas et al., 2010).

We remark that Theorem 9.7 holds also under the looser assumption
that observations are perturbed by σn -sub-Gaussian noise (cf. Equa-
tion (A.39)) instead of Gaussian noise. The bound on γT from Equa-
tion (9.13) for the Matérn kernel does not yield sublinear regret when
combined with the standard regret bound from Theorem 9.5, however,
Whitehouse et al. (2024) show that the regret of GP-UCB is sublinear
also in this case provided σn2 is chosen carefully.

This concludes our discussion of the UCB algorithm. We have seen


that its regret can be analyzed under both Bayesian and frequentist
assumptions on f ⋆ .

9.3.2 Improvement
Another well-known family of methods is based on keeping track of
a running optimum fˆt , and scoring points according to their improve-
ment upon the running optimum. The improvement of x after round t
is measured by
.
It ( x) = ( f ( x) − fˆt )+ (9.15)

where we use (·)+ to denote max{0, ·}.

The probability of improvement (PI) picks the point that maximizes the
probability to improve upon the running optimum,
.
xt+1 = arg max P( It ( x) > 0 | x1:t , y1:t ) (9.16)
x∈X
= arg max P( f ( x) > fˆt | x1:t , y1:t ) (9.17)
x∈X
bayesian optimization 185

!
µt ( x) − fˆt
= arg max Φ (9.18) using linear transformations of
x∈X σt ( x) Gaussians (1.78)

where Φ denotes the CDF of the standard normal distribution and we


use that f ( x) | x1:t , y1:t ∼ N (µt ( x), σt2 ( x)). Probability of improvement
tends to be biased in favor of exploitation, as it prefers points with
large posterior mean and small posterior variance which is typically 0.5
true “close” to the previously observed maximum fˆt . 4
0.4

acquisition function

GP posterior
Probability of improvement looks at how likely a point is to improve 0.3 2
upon the running optimum. An alternative is to look at how much a
0.2
point is expected to improve upon the running optimum. This acqui-
0
sition function is called the expected improvement (EI), 0.1

. 0.5
xt+1 = arg max E[ It ( x) | x1:t , y1:t ] . (9.19) 4
x∈X 0.4

acquisition function

GP posterior
0.3
Intuitively, EI seeks a large expected improvement (exploitation) while 2
also preferring states with a large variance (exploration). Expected im- 0.2

provement yields the same regret bound as UCB (Nguyen et al., 2017). 0.1 0
0.0
The expected improvement acquisition function is often flat which
makes it difficult to optimize in practice due to vanishing gradients. Figure 9.7: Plot of the PI and EI acquisi-
One approach addressing this is to instead optimize the logarithm of tion functions, respectively.
EI (Ament et al., 2024).

1.0

0.5
∆t

0.0

−0.5

−1.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
σt σt σt

Figure 9.8: Contour lines of acquisition


functions for varying ∆t = µt ( x) − fˆt
9.3.3 Thompson Sampling and σt . A brighter color corresponds to
a larger acquisition function value. The
We can also interpret the principle of optimism in the face of uncertainty first graph shows contour lines of UCB
with β t = 0.75, the second of PI, and the
in a slightly different way than we did with UCB (and EI). Suppose third of EI.
we select the next point according to the probability that it is optimal
(assuming that the posterior distribution is an accurate representation
of the uncertainty),
 
. ′
π ( x | x1:t , y1:t ) = P f | x1:t ,y1:t f ( x) = max f ( x ) (9.20)
x′
186 probabilistic artificial intelligence

xt+1 ∼ π (· | x1:t , y1:t ). (9.21)

This approach of sampling according to the probability of maximality π


is called probability matching. Probability matching is exploratory as it
prefers points with larger variance (as they automatically have a larger
chance of being optimal), but at the same time exploitative as it ef-
fectively discards points with low posterior mean and low posterior
variance. Unfortunately, it is generally difficult to compute π analyti-
cally given a posterior.

Instead, it is common to use a sampling-based approximation of π.


Observe that the density π can be expressed as an expectation,

π ( x | x1:t , y1:t ) = E f | x1:t ,y1:t 1{ f ( x) = max f ( x′ )} ,


 
(9.22)
x′

which we can approximate using Monte Carlo sampling (typically us-


ing a single sample),

≈ 1{ f˜t+1 ( x) = max f˜t+1 ( x′ )} (9.23)


x′

where f˜t+1 ∼ p(· | x1:t , y1:t ) is a sample from our posterior distri-
bution. Observe that this approximation of π coincides with a point
density at the maximizer of f˜t+1 .

The resulting algorithm is known as Thompson sampling. At time t + 1,


we sample a function f˜t+1 ∼ p(· | x1:t , y1:t ) from our posterior distri-
bution. Then, we simply maximize f˜t+1 ,
.
xt+1 = arg max f˜t+1 ( x). (9.24)
x∈X

In many cases, the randomness in the realizations of f˜t+1 is already


sufficient to effectively trade exploration and exploitation. Similar re-
gret bounds to those of UCB can also be established for Thompson
sampling (Russo and Van Roy, 2016; Kandasamy et al., 2018).

9.3.4 Information-Directed Sampling


After having looked at multiple methods that aim to balance exploita-
tion of the current posterior distribution over f for immediate returns
with exploration to reduce uncertainty about f for future returns, we
will next discuss a method that makes this tradeoff explicit.
.
Denoting the instantaneous regret of choosing x as ∆( x) = maxx′ f ⋆ ( x′ ) −
f ⋆ ( x) and by It ( x) some function capturing the “information gain” as-
sociated with observing x in iteration t + 1, we can define the informa-
tion ratio,

. ∆ ( x )2
Ψt ( x) = , (9.25)
It ( x)
bayesian optimization 187

which was originally introduced by Russo and Van Roy (2016). Here
exploitation reduces regret while exploration increases information
gain, and hence, points x that minimize the information ratio are those
that most effectively balance exploration and exploitation. We can
make the key observation that the regret ∆(·) decreases when It (·) de-
creases, as a small It (·) implies that the algorithm has already learned
a lot about the function f ⋆ . The strength of this relationship is quanti-
fied by the information ratio:

Theorem 9.9 (Proposition 1 of Russo and Van Roy (2014) and theo-
rem 8 of Kirschner and Krause (2018)). For any iteration T ≥ 1, let
∑tT=1 It−1 ( xt ) ≤ γT and suppose that Ψt−1 ( xt ) ≤ Ψ T for all t ∈ [ T ]. Then,
the cumulative regret is bounded by
q
R T ≤ γT Ψ T T. (9.26)

Proof. By Equation (9.25), rt = ∆( xt ) = Ψt−1 ( xt ) · It−1 ( xt ). Hence,


p

T
RT = ∑ rt
t =1
T q
= ∑ Ψt−1 ( xt ) · It−1 ( xt )
t =1
v
u T T
≤ t ∑ Ψ t −1 ( x t ) · ∑ It−1 (xt )
u
using the Cauchy-Schwarz inequality
t =1 t =1
q
≤ γT Ψ T T. using the assumptions on It (·) and
Ψt (·)

Example 9.10: How to measure “information gain”?


One possibility of measuring the “information gain” is
.
It ( x) = I( f x ; y x | x1:t , y1:t ) (9.27)

which — as you may recall — is precisely the marginal gain of


the utility I (S) = I( f S ; yS ) we were studying in Chapter 8. In this
case,
T T
∑ It−1 (xt ) = ∑ I( f xt ; yxt | x1:t−1 , y1:t−1 )
t =1 t =1
T
= ∑ I( f x1:T ; yxt | x1:t−1 , y1:t−1 ) using I(X, Z; Y) ≥ I(X; Y) which follows
t =1 from Equations (8.12) and (8.16) and is
called monotonicity of MI
= I( f x1:T ; yx1:T ) by repeated application of
Equation (8.16), also called the chain rule
≤ γT . of
byMI
definition of γ (9.10)
T
188 probabilistic artificial intelligence

The regret bound from Theorem 9.9 suggests an algorithm which in


each iteration chooses the point which minimizes the information ratio
(9.25). However, this is not possible since ∆(·) is unknown due to
its dependence on f ⋆ . Kirschner and Krause (2018) propose to use a
surrogate to the regret which is based on the current model of f ⋆ ,
.
ˆ t ( x) =
∆ max ut ( x′ ) − lt ( x). (9.28)
x′ ∈X
. .
Here, ut ( x) = µt ( x) + β t+1 σt ( x) and lt ( x) = µt ( x) − β t+1 σt ( x) are
the upper and lower confidence bounds of the confidence interval
Ct ( x) of f ⋆ ( x), respectively. Similarly to our discussion of UCB, we
make the assumption that the sequence β t is chosen “correctly” (cf.
Equation (9.7)) so that the confidence interval is well-calibrated and
∆( x) ≤ ∆ ˆ t ( x) with high probability. The resulting algorithm
( )
. . ∆ˆ t ( x )2
xt+1 = arg min Ψt ( x) =
b (9.29)
x∈X It ( x)
is known as information-directed sampling (IDS).

Theorem 9.11 (Regret of IDS, lemma 8 of Kirschner and Krause (2018)).


? Let β t (δ) be chosen “correctly” for a fixed δ ∈ (0, 1). Then, if the measure Problem 9.6
of information gain is It ( x) = I( f x ; y x | x1:t , y1:t ), with probability at least
1 − δ, IDS has cumulative regret
  3 4
p
R T = O β T ( δ ) γT T . (9.30)

GP posterior
2
2

Ψ
b
Regret bounds such as Theorem 9.11 can be derived also for different
1
measures of information gain. For example, the argument of Prob- 0
lem 9.6 also goes through for the “greedy” measure 0
.
It ( x) = I( f xUCB ; y x | x1:t , y1:t ) (9.31) 12.5 4
t

which focuses exclusively on reducing the uncertainty at xUCB


rather 10.0

GP posterior
t
than globally. We compare the two measures of information gain in 7.5 2
Ψ
b

Figure 9.9. Observe that the acquisition function depends critically 5.0
on the choice of It (·) and is less sensitive to the scaling of confidence 2.5
0

intervals.

IDS trades exploitation and exploration by balancing the (exploitative) 80 4

regret surrogate with a measure of information gain (such as those


GP posterior

60
studied in Chapter 8) that is purely explorative. In this way, IDS can 2
Ψ
b

account for kinds of information which are not addressed by alterna- 40

tive algorithms such as UCB or EI (Russo and Van Roy, 2014): Depend- 20 0
ing on the measure of information gain, IDS can select points to obtain
0
indirect information about other points or cumulating information that
does not immediately lead to a higher reward but only when com- Figure 9.9: Plot of the surrogate infor-
mation ratio Ψ:
b IDS selects its minimizer.
bined with subsequent observations. Moreover, IDS avoids selecting The first two plots use the “global” infor-
points which yield irrelevant information. mation gain measure from Example 9.10
with β = 0.25 and β = 0.5, respec-
tively. The third plot uses the “greedy”
information gain measure from Equa-
tion (9.31) and β = 1.
bayesian optimization 189

9.3.5 Probabilistic Inference


As we mentioned before, computing the probability of maximality π
is generally intractable. You can think of computing π as attempting
to fully solve the probabilistic inference problem associated with de-
termining the optimum of f | x1:t , y1:t . In many cases, it is useful to
determine the probability that a point is optimal under the current
posterior distribution, and we will consider one particular example in
the following.

Example 9.12: Maximizing recall


In domains such as molecular design we often use machine learn-
ing to screen candidates for further manual testing. The goal here
is to suggest a small set E from a large domain of molecules X , so
that the probability of E containing the optimal molecule, i.e., the
recall, is maximized. Note that this task is quite different from on-
line Bayesian optimization, for example, in BO we get sequential
feedback that we can use to decide which inputs to query next.
Nevertheless, we will see in this section that both tasks turn out
to be closely related.

Let us assume that the maximizer is unique almost surely, which


with GPs is automatically the case if there are no same-mean,
perfectly-correlated entries. The recall task is then intimately re-
lated to the task of determining the probability of maximality,
since
 
arg max P f | x1:t ,y1:t max f ( x) = max f ( x′ )
E⊆X :| E|=k x∈ E x′
(9.32)
= arg max ∑ π (x | x1:t , y1:t ). by noting that for x ̸= y, the events
E⊆X :| E|=k x∈ E { f ( x) = maxx′ f ( x′ )} and
{ f (y) = maxx′ f ( x′ )} are disjoint
Thus, to obtain the recall-optimal candidates for further testing,
we need to find the probability of maximality. However, as men-
tioned, computing the probability of maximality π is generally
1.0
intractable.
0.8
Figure 9.10 depicts an example of a recall task, where the black
Expected Recall

line shows the optimal recall achieved by knowing the proba- 0.6

bility of maximality π exactly. LITE (Menet et al., 2025) is an


0.4
almost-linear time approximation of π, whereas TS selects points
F-LITE
via Thompson sampling and MEANS selects the points with the 0.2
TS
MEANS
highest posterior mean. Selecting E according to probability of 0.0
0.0 0.2 0.4 0.6 0.8 1.0
maximality achieves much higher recall than the other intuitive |E|/|X |
heuristics. Figure 9.10: Menet et al. (2025) show in
an experiment that selecting E ⊆ X ac-
cording to the estimates of probability
of maximality from LITE (here called F-
LITE) is near-optimal. Intuitive heuris-
tics such as Thompson sampling (TS) or
selecting points with the highest poste-
rior mean (MEANS) perform worse.
190 probabilistic artificial intelligence

So how can we estimate the probability of maximality? LITE approxi-


mates π by
 

π ( x | x1:t , y1:t ) = P f | x1:t ,y1:t f ( x) ≥ max f ( x )
x′

≈ P f | x1:t ,y1:t ( f ( x) ≥ κ ∗ ) (9.33)


µt ( x) − κ ∗
 
=Φ (9.34)
σt ( x)
0.04

probability of maximality
4
with Φ denoting the CDF of the standard normal distribution and κ∗

GP posterior
chosen such that the approximation of π integrates to 1, so that it is a 0.02 2
valid distribution.
0
Remarkably, LITE is intimately related to many of the BO methods we 0.00

discussed in this chapter. First, and most obviously, Equation (9.33)


Figure 9.11: Plot of the probability of
looks similar to the PI acquisition function (9.17). But note that κ ∗ is
maximality as estimated by LITE.
not equal to the best observed value fˆt as in PI, but instead typically
larger. If κ ∗ > fˆt , then Equation (9.33) is more exploratory than PI:
it puts additional emphasis on points with large posterior variance
and comparatively less emphasis on points with large posterior mean,
which we illustrate in Figure 9.11.

An insightful interpretation of LITE is that it balances two different


kinds of exploration: optimism in the face of uncertainty and entropy reg-
ularization. To see this, let us define the following variational objective:
 q 
.
W (π ) = ∑ π ( x) · µt ( x) + 2 S′ (π ( x)) · σt ( x) (9.35)
x∈X
| {z } | {z }
exploitation exploration

.
with the “quasi-surprise” S′ (u) = 21 (ϕ(Φ−1 (u))/u)2 . The quasi-surprise
S′ (·) behaves similarly to the surprise − ln(·). In fact, their asymptotics
coincide:

S′ (1) = 0 = − ln(1) and S′ (u) → − ln(u) as u → 0+ .

The objective W is maximized for those probability distributions π


that are concentrated around points with large mean µt ( x) and points
with large exploration bonus. We have seen that the uncertainty σt ( x)
about f ( x) is the standard exploration bonus of UCB-style algorithms.
In Equation (9.35), σt ( x) is weighted by the quasi-surprise, which acts
as “entropy regularization”: it increases the entropy of π by uniformly
pushing π ( x) away from zero. You can think of entropy regularization
as encouraging a different kind of exploration than optimism by not
making deterministic decisions like in UCB.

Menet et al. (2025) show that LITE (9.34) is the solution to the varia-
bayesian optimization 191

tional problem (9.35) among valid probability distributions ? : Problem 9.7

µt (·) − κ ∗
 
arg max W (π ) = Φ (9.36)
π ∈∆X
σt (·)

with κ ∗ such that the right-hand side sums to 1.9 This indicates that 9
∆X is the probability simplex on X .
LITE and Thompson sampling, which samples from probability of
maximality, achieve exploration through two means:
1. Optimism: by preferring points with large uncertainty σt ( x) about
the reward value f ( x).
2. Decision uncertainty: by assigning some probability mass to all x,
that is, by remaining uncertain about which x is the maximizer.
In our discussion of balancing exploration and exploitation in rein-
forcement learning, we will return to this dichotomy of exploration
strategies.

Remark 9.13: But why is “exploration” useful for recall?


Let us pause for a moment and reflect on why this interpreta-
tion of LITE is remarkable. This interpretation shows that LITE
is more “exploratory” than simply taking the highest posterior
means. However, the recall task (9.32) differs from the standard
BO task in that we do not collect any further observations, which
we may use downstream to make better decisions. In the recall
task, we are only interested in having the best shot at including
the maximizer in the set E, without obtaining any further infor-
mation or making any subsequent decisions. At first sight, this
seems to suggest that we should be as “exploitative” as possible.
Then how can it be that “exploration” is useful?

We will explore this question with the following example: You are
observing a betting game. When placing a bet, players can either
place a safe bet (“playing safe”) or a risky bet (“playing risky”).
You now have to place a bet on their bets, and estimate which of
the players will win the most money: those that play safe or those
that play risky? Note that you do not care whether your guess
ends up in 2nd place or last place among all players — you only
care about whether your guess wins. That is, you are interested
in recalling the best player, not in “ranking” the players.

Consider three players: one that plays safe and two that play risky.
Suppose that the safe bet has payoff S = 1 while each risky bet
has payoff R ∼ N (0, 100). In expectation, the safe player will win
the most money. However, one can see with just a little bit of al-
gebra that the probability of either of the risky players winning
the most money is ≈ 35%, whereas the safe player only wins with
192 probabilistic artificial intelligence

probability ≈ 29% ? . That is, it is in fact optimal to bet on either Problem 9.8
of the risky players since the best player might have vastly out-
performed their expected winnings, and performed closer to their
upper confidence bound. In summary, maximizing recall requires
us to be “exploratory” since it is likely that the optimum among
inputs is one that has performed better than expected, not simply
the one with the highest expected performance.

Discussion

In this chapter, we have explored the exploration-exploitation dilemma


in the context of optimizing black-box functions. To this end, we
have explored various methods to balance exploration and exploita-
tion. While computing the precise probability of maximality is gener-
ally intractable, we found that it can be understood approximately as
balancing two sources of exploration: optimism in the face of uncer-
tainty and entropy regularization.

In the following chapters, we will begin to discuss stateful settings


where the black-box optimization task is known as “reinforcement
learning”. Naturally, we will see that the exploration-exploitation
dilemma is also a central challenge in reinforcement learning, and we
will revisit many of the concepts we have discussed in this chapter.

Optional Readings
• Srinivas, Krause, Kakade, and Seeger (2010).
Gaussian process optimization in the bandit setting: No regret and
experimental design.
• Golovin, Solnik, Moitra, Kochanski, Karro, and Sculley (2017).
Google vizier: A service for black-box optimization.
• Romero, Krause, and Arnold (2013).
Navigating the protein fitness landscape with Gaussian processes.
• Chowdhury and Gopalan (2017).
On kernelized multi-armed bandits.

Problems

9.1. Convergence to the static optimum.

Show that any algorithm where limt→∞ f ⋆ ( xt ) exists achieves sublin-


ear regret if and only if it converges to the static optimum, that is,

lim f ⋆ ( xt ) = max f ⋆ ( x). (9.37)


t→∞ x
bayesian optimization 193

Hint: Use that if a sequence an converges to a as n → ∞, then we have for


the sequence
n
. 1
bn =
n ∑ ai (9.38)
i =1

that limn→∞ bn = a. This is also known as the Cesàro mean.

9.2. Bayesian confidence intervals.

In this exercise, we derive Theorem 9.4.


1. For fixed t ≥ 1 and x ∈ X , prove
2
P( f ⋆ ( x) ̸∈ Ct ( x) | x1:t−1 , y1:t−1 ) ≤ e− β t /2 . (9.39)

Hint: Bound P( Z > c) for Z ∼ N (0, 1) and c > 0.


2. Prove Theorem 9.4.

9.3. Regret of GP-UCB.

To develop some intuition, we will derive Theorem 9.5.


1. Show that if Equation (9.7) holds, then for a fixed t ≥ 1 the instan-
taneous regret rt is bounded by 2β t σt−1 ( xt ).
. . .
2. Let ST = { xt }tT=1 , and define f T = f ST and yT = ysT . Prove
!
1 T σt2−1 ( xt )
I( f T ; yT ) = ∑ log 1 + . (9.40)
2 t =1 σn2

3. Combine (1) and (2) to show Theorem 9.5. We assume w.l.o.g. that
the sequence { β t }t is monotonically increasing.
Hint: If s ∈ [0, M ] for some M > 0 then s ≤ C · log(1 + s) with
.
C = M/ log(1 + M ).

9.4. Sublinear regret of GP-UCB for a linear kernel.

Assume that f ⋆ ∼ GP (0, k) where k is the linear kernel

k( x, x′ ) = x⊤ x′ .

In addition, we assume that for any x ∈ X , ∥ x∥2 ≤ 1. Moreover, recall


that the points in a finite set S ⊆ X can be written in a matrix form
(the “design matrix”) which we denote by XS ∈ Rd×|S| .
1. Prove that γT = O(d log T ).
2. Deduce from (1) and Theorem 9.5 that limT →∞ R T /T = 0.

9.5. Closed-form expected improvement.

Let us denote the acquisition function of EI from Equation (9.19) by


EIt ( x). In this exercise, we derive a closed-form expression.
194 probabilistic artificial intelligence

1. Show that
Z +∞
EIt ( x) = (µt ( x) + σt ( x)ε − fˆt ) · ϕ(ε) dε (9.41)
( fˆt −µt ( x))/σt ( x)

where ϕ is the PDF of the univariate standard normal distribution


(1.5).
Hint: Reparameterize the posterior distribution

f ( x) | x1:t , y1:t ∼ N (µt ( x), σt2 ( x))

using a standard normal distribution.


2. Using the above expression, show that
! !
µt ( x) − fˆt µt ( x) − fˆt
EIt ( x) = (µt ( x) − fˆt )Φ + σt ( x)ϕ
σt ( x) σt ( x)
(9.42)
. Ru
where Φ(u) = −∞ ϕ(ε) dε denotes the CDF of the standard normal
distribution.
Note that the first term of Equation (9.42) encourages exploitation
while the second term encourages exploration. EI can be seen as a
special case of UCB where the confidence bounds are scaled depend-
ing on x:

β t = ϕ(zt ( x))/Φ(zt ( x))


.
for zt ( x) = (µt ( x) − fˆt )/σt ( x).

9.6. Regret of IDS.

We derive Theorem 9.11.


1. Prove that for all t ≥ 1, ∆ˆ t ( xUCB ) ≤ 2β t+1 σt ( xUCB ) where we de-
t t
UCB .
note by xt = arg maxx∈X ut ( x) the point maximizing the upper
confidence bound after iteration t.
2. Using (1) and the assumption on It , bound Ψ b t ( xUCB ).
t
Hint: You may find the hint of Problem 9.3 (3) useful.
3. Complete the proof of Theorem 9.11.

9.7. Variational form of LITE.

Let |X | < ∞ be such that we can write W as an objective function over


the probability simplex ∆|X | ⊂ R|X | . Derive Equation (9.36).

Hint: First show that W (·) is concave (i.e., minimizing −W (·) is a convex
optimization problem) and then use Lagrange multipliers to find the optimum.
bayesian optimization 195

9.8. Finding the winning player.

Consider the betting game from Remark 9.13 with two risky players,
R1 ∼ N (0, 100) and R2 ∼ N (0, 100), and one safe player S = 1. Prove
that the individual probability of any of the risky players winning the
most money is larger than the winning probability of the safe player.
We assume that players payoffs are mutually independent.
.
Hint: Compute the CDF of R = max{ R1 , R2 }.
10
Markov Decision Processes

We will now turn to the topic of probabilistic planning. Planning deals


with the problem of deciding which action an agent should play in
a (stochastic) environment.1 A key formalism for probabilistic plan- 1
An environment is stochastic as op-
ning in known environments are so-called Markov decision processes. posed to deterministic, when the out-
come of actions is random.
Starting from the next chapter, we will look at reinforcement learning,
which extends probabilistic planning to unknown environments.

Consider the setting where we have a sequence of states ( Xt )t∈N0 sim-


ilarly to Markov chains. But now, the next state Xt+1 of an agent does A1 A2

not only depend on the previous state Xt but also depends on the last
X1 X2 X3 ···
action At of this agent.
Figure 10.1: Directed graphical model of
Definition 10.1 ((Finite) Markov decision process, MDP). A (finite) a Markov decision process with hidden
Markov decision process is specified by states Xt and actions At .
.
• a (finite) set of states X = {1, . . . , n},
.
• a (finite) set of actions A = {1, . . . , m},
• transition probabilities
.
p( x ′ | x, a) = P Xt+1 = x ′ | Xt = x, At = a

(10.1)

which is also called the dynamics model, and


• a reward function r : X × A → R which maps the current state x and
an action a to some reward.

The reward function may also depend on the next state x ′ , however,
we stick to the above model for simplicity. Also, the reward function
can be random with mean r. Observe that r induces the sequence of
rewards ( Rt )t∈N0 , where
.
R t = r ( Xt , A t ), (10.2)

which is sometimes used in the literature instead of r.

Crucially, we assume the dynamics model p and the reward function r


to be known. That is, we operate in a known environment. For now,
198 probabilistic artificial intelligence

we also assume that the environment is fully observable. In other words,


we assume that our agent knows its current state. In Section 10.4, we
discuss how this method can be extended to the partially observable
setting.

Our fundamental objective is to learn how the agent should behave to


optimize its reward. In other words, given its current state, the agent
should decide (optimally) on the action to play. Such a decision map
— whether optimal or not — is called a policy.

Definition 10.2 (Policy). A policy is a function that maps each state x ∈


X to a probability distribution over the actions. That is, for any t > 0,
.
π ( a | x ) = P( A t = a | Xt = x ). (10.3)

In other words, a policy assigns to each action a ∈ A, a probability of


being played given the current state x ∈ X.

We assume that policies are stationary, that is, do not change over time.

Remark 10.3: Stochastic policies


We will see later in this chapter that in fully observable environ-
ments optimal policies are always deterministic. Thus, there is
no need to consider stochastic policies in the context of Markov
decision processes. For this chapter, you can think of a policy π
simply as a deterministic mapping π : X → A from current state
to played action. In the context of reinforcement learning, we will
later see in Section 12.5.1 that randomized policies are important
in trading exploration and exploitation.

Observe that a policy induces a Markov chain ( Xtπ )t∈N0 with transition
probabilities,
.
pπ ( x ′ | x ) = P Xtπ+1 = x ′ | Xtπ = x = ∑ π (a | x) p(x′ | x, a).

(10.4)
a∈ A

This is crucial: if our agent follows a fixed policy (i.e., decision-making


protocol) then the evolution of the process is described fully by a
Markov chain.

As mentioned, we want to maximize the reward. There are many mod-


els of calculating a score from the infinite sequence of rewards ( Rt )t∈N0 .
For the purpose of our discussion of Markov decision processes and
reinforcement learning, we will focus on a very common reward called
discounted payoff.

Definition 10.4 (Discounted payoff). The discounted payoff (also called


markov decision processes 199

discounted total reward) from time t is defined as the random variable,



.
Gt = ∑ γm Rt+m (10.5)
m =0

where γ ∈ [0, 1) is the discount factor.

Remark 10.5: Other reward models


Other well-known methods for combining rewards into a score
are
(instantaneous) (finite-horizon) (mean payoff )
T −1 T
. . . 1
Gt = Rt , Gt = ∑ Rt+m , and Gt = lim inf
T →∞ T ∑ Rt+m .
m =0 m =0

The methods that we will discuss can also be analyzed using these
or other alternative reward models.

We now want to understand the effect of the starting state and initial
action on our optimization objective Gt . To analyze this, it is common
to use the following two functions:

Definition 10.6 (State value function). The state value function,2 2


Recall that following a fixed policy π
induces a Markov chain ( Xtπ )t∈N0 . We
.
t ( x ) = Eπ [ Gt | Xt = x ],
vπ (10.7) define
.
Eπ [·] = E(X π )t∈N [·] (10.6)
measures the average discounted payoff from time t starting from t 0

state x ∈ X. as an expectation over all possible se-


quences of states ( xt )t∈N0 within this
Definition 10.7 (State-action value function). The state-action value func- Markov chain.
tion (also called Q-function),
.
t ( x, a ) = Eπ [ Gt | Xt = x, At = a ]
qπ (10.8)
= r ( x, a) + γ ∑ ′
p( x | x, a) · vπ ′
t +1 ( x ), (10.9) by expanding the defition of the
x′ ∈X discounted payoff (10.5); corresponds to
one step in the induced Markov chain
measures the average discounted payoff from time t starting from
state x ∈ X and with playing action a ∈ A. In other words, it com-
bines the immediate return with the value of the next states.

Note that both vπ t ( x ) and qt ( x, a ) are deterministic scalar-valued func-


π

tions. Because we assumed stationary dynamics, rewards, and poli-


cies, the discounted payoff starting from a given state x will be in-
.
dependent of the start time t. Thus, we write vπ ( x ) = v0π ( x ) and
.
qπ ( x, a) = q0π ( x, a) without loss of generality.

10.1 Bellman Expectation Equation

Let us now see how we can compute the value function,

vπ ( x ) = Eπ [ G0 | X0 = x ] using the definition of the value


function (10.7)
200 probabilistic artificial intelligence


" #
= Eπ ∑ m
γ Rm X0 = x using the definition of the discounted
m =0 payoff (10.5)

" #
h i
0
= Eπ γ R0 X0 = x + γEπ ∑ m
γ R m +1 X0 = x using linearity of expectation (1.20)
m =0

" " # #
= r ( x, π ( x )) + γEx′ Eπ ∑ m
γ R m +1 X1 = x ′
X0 = x by simplifying the first expectation and
m =0 conditioning the second expectation on
"

# X1
= r ( x, π ( x )) + γ ∑ ′
p( x | x, π ( x ))Eπ ∑ m
γ R m +1 X1 = x ′
expanding the expectation on X1 and
x′ ∈X m =0 using conditional independence of the
"

# discounted payoff of X0 given X1
= r ( x, π ( x )) + γ ∑

p( x ′ | x, π ( x ))Eπ ∑ γm Rm X0 = x ′ shifting the start time of the discounted
x ∈X m =0 payoff using stationarity

∑ p( x ′ | x, π ( x ))Eπ G0 | X0 = x ′
 
= r ( x, π ( x )) + γ using the definition of the discounted

x ∈X payoff (10.5)

= r ( x, π ( x )) + γ ∑

p( x ′ | x, π ( x )) · vπ ( x ′ ). (10.10) using the definition of the value
x ∈X function (10.7)

= r ( x, π ( x )) + γEx′ | x,π (x) vπ ( x ′ ) .


 
(10.11) interpreting the sum as an expectation

This equation is known as the Bellman expectation equation, and it shows


a recursive dependence of the value function on itself. The intuition
is clear: the value of the current state corresponds to the reward from
the next action plus the discounted sum of all future rewards obtained
from the subsequent states.

For stochastic policies, the above calculation can be extended to yield,


!
v (x) =
π
∑ π (a | x) r ( x, a) + γ ∑ ′
p( x | x, a)v ( x )π ′
(10.12)
a∈ A x′ ∈X
h i
= Ea∼π (x) r ( x, a) + γEx′ | x,a vπ ( x ′ )

(10.13)

= Ea∼π (x) [qπ ( x, a)]. (10.14)

For stochastic policies, by also conditioning on the first action, one can
obtain an analogous equation for the state-action value function,

qπ ( x, a) = r ( x, a) + γ ∑

p( x ′ | x, a) ∑

π ( a′ | x ′ )qπ ( x ′ , a′ ) (10.15)
x ∈X a ∈A
= r ( x, a) + γEx′ | x,a Ea′ ∼π (x′ ) q ( x , a′ ) .
 π ′

(10.16)

Note that it does not make sense to consider a similar recursive for-
mula for the state-action value function in the setting of deterministic
policies as the action played when in state x ∈ X is uniquely deter-
mined as π ( x ). In particular,

vπ ( x ) = qπ ( x, π ( x )). (10.17)
markov decision processes 201

1 (0) Figure 10.2: Example of an MDP, which


we study in Problem 10.1. Suppose you
1/2 (-1) 1 (-1) are building a company. The shown
S MDP models “how to become rich and
famous”. Here, the action S is short for
poor, 1/2 (-1) poor, saving and the action A is short for ad-
unknown A famous A vertising.
Suppose you begin by being “poor
and unknown”. Then, the greedy ac-
1/2
S tion (i.e., the action maximizing instan-
1/2 (0)
(10) taneous reward) is to save. However,
within this simplified environment, sav-
1/2 (-1)
1/2 (10) 1/2 (0) 1 (-1) ing when you are poor and unknown
1/2
means that you will remain poor and un-
(-1) known forever. As the potential rewards
S A in other states are substantially larger,
this simple example illustrates that fol-
rich, 1/2 (10) rich, lowing the greedy choice is generally not
unknown
S A
famous optimal.
The example is adapted from
Andrew Moore’s lecture notes on
1/2 (10) MDPs (Moore, 2002).

10.2 Policy Evaluation

Bellman’s expectation equation tells us how we can find the value


function vπ of a fixed policy π using a system of linear equations!
Using,

   
v π (1) r (1, π (1))
.  .  π . 
 .. 
vπ =  ..  , r = 
 , and
.


vπ (n) r (n, π (n))
  (10.18)
p(1 | 1, π (1)) · · · p(n | 1, π (1))
.  .. .. .. 
Pπ =  . . .


p(1 | n, π (n)) · · · p(n | n, π (n))

and a little bit of linear algebra, the Bellman expectation equation (10.10)
is equivalent to

vπ = r π + γPπ vπ (10.19)
⇐⇒ ( I − γP )v = r
π π π

⇐⇒ vπ = ( I − γPπ )−1 r π . (10.20)

Solving this linear system of equations (i.e., performing matrix inver-


sion) takes cubic time in the size of the state space.
202 probabilistic artificial intelligence

10.2.1 Fixed-point Iteration


To obtain an (approximate) solution of vπ , we can use that it is the
unique fixed-point of the affine mapping Bπ : Rn → Rn ,
.
Bπ v = r π + γPπ v. (10.21)

Using this fact (which we will prove in just a moment), we can use
fixed-point iteration of Bπ .

Algorithm 10.8: Fixed-point iteration


1 initialize vπ (e.g., as 0)
2 for t = 1 to T do
3 vπ ← Bπ vπ = r π + γPπ vπ

Fixed-point iteration has computational advantages, for example, for


sparse transitions.

Theorem 10.9. vπ is the unique fixed-point of Bπ .

Proof. It is immediate from Bellman’s expectation equation (10.19) and


the definition of Bπ (10.21) that vπ is a fixed-point of Bπ . To prove
uniqueness, we will show that Bπ is a contraction.

Remark 10.10: Contractions


A contraction is a concept from topology. In a Banach space (X , ∥·∥)
(a metric space with a norm), f : X → X is a contraction iff there
exists a k < 1 such that

∥ f ( x) − f (y)∥ ≤ k · ∥ x − y∥ (10.22)

for any x, y ∈ X . By the Banach fixed-point theorem, a contraction


admits a unique fixed-point. Intuitively, by iterating the function
f , the distance to any fixed-point shrinks by a factor k in each
iteration, hence, converges to 0. As we cannot converge to multi-
ple fixed-points simultaneously, the fixed-point of a contraction f
must be unique.
3
The L∞ norm (also called supremum
Let v ∈ Rn and v′ ∈ Rn be arbitrary initial guesses. We use the L∞ norm) is defined as
.
space,3 ∥ x∥∞ = max | x(i )| . (10.23)
i

Bπ v − Bπ v′ ∞
= r π + γPπ v − r π − γPπ v′ ∞
using the definition of Bπ (10.21)

= γ Pπ (v − v′ ) ∞
≤ γ max
x∈ X
∑ p( x ′ | x, π ( x )) · v( x ′ ) − v′ ( x ′ ) . using the definition of the L∞ norm
x′ ∈X (10.23), expanding the multiplication,
and using |∑i ai | ≤ ∑i | ai |
markov decision processes 203

≤ γ v − v′ ∞
. (10.24) using ∑ x′ ∈X p( x ′ | x, π ( x )) = 1 and
|v( x ′ ) − v′ ( x ′ )| ≤ ∥v − v′ ∥∞
Thus, by Equation (10.22), Bπ is a contraction and by Banach’s fixed-
point theorem vπ is its unique fixed-point.

Let vtπ be the value function estimate after t iterations. Then, we have
for the convergence of fixed-point iteration,

∥vtπ − vπ ∥∞ = Bπ vtπ−1 − Bπ vπ ∞
using the update rule of fixed-point
iteration and Bπ vπ = vπ
≤ γ vtπ−1 − vπ ∞
using (10.24)

= γt ∥v0π − vπ ∥∞ . (10.25) by induction

This shows that fixed-point iteration converges to vπ exponentially


fast.

10.3 Policy Optimization

Recall that our goal was to find an optimal policy,


.
π ⋆ = arg max Eπ [ G0 ]. (10.26)
π

We can alternatively characterize an optimal policy as follows: We


define a partial ordering over policies by
· ′
π ≥ π ′ ⇐⇒ vπ ( x ) ≥ vπ ( x ) (∀ x ∈ X ). (10.27)

π ⋆ is then simply a policy which is maximal according to this partial


ordering.

It follows that all optimal policies have identical value functions. Sub-
. ⋆ . ⋆
sequently, we use v⋆ = vπ and q⋆ = qπ to denote the state value
function and state-action value function arising from an optimal pol-
icy, respectively. As an optimal policy maximizes the value of each
state, we have that

v⋆ ( x ) = max vπ ( x ), q⋆ ( x, a) = max qπ ( x, a). (10.28)


π π

Simply optimizing over each policy is not a good idea as there are mn
deterministic policies in total. It turns out that we can do much better.

10.3.1 Greedy Policies


Consider a policy that acts greedily according to the immediate return.
It is fairly obvious that this policy will not perform well because the
agent might never get to high-reward states. But what if someone
could tell us not just the immediate return, but the long-term value of
204 probabilistic artificial intelligence

the states our agent can reach in a single step? If we knew the value of
each state our agent can reach, then we can simply pick the action that
maximizes the expected value. We will make this approach precise in
the next section.

This thought experiment suggests the definition of a greedy policy


with respect to a value function.

Definition 10.11 (Greedy policy). The greedy policy with respect to a


state-action value function q is defined as

.
πq ( x ) = arg max q( x, a). (10.29)
a∈ A

Analogously, we define the greedy policy with respect to a state value


function v,

.
πv ( x ) = arg max r ( x, a) + γ ∑ p( x ′ | x, a) · v( x ′ ). (10.30)
a∈ A ′
x ∈X

We can use v and q interchangeably ? . Problem 10.2

10.3.2 Bellman Optimality Equation

Observe that following the greedy policy πv , will lead us to a new


value function vπv . With respect to this value function, we can again
obtain a greedy policy, of which we can then obtain a new value
function. In this way, the correspondence between greedy policies
and value functions induces a cyclic dependency, which is visualized
vπ induces πvπ
in Figure 10.3.

It turns out that the optimal policy π ⋆ is a fixed-point of this depen- vπ πv


dency. This is made precise by the following theorem.
πv induces vπv
Theorem 10.12 (Bellman’s theorem). A policy π ⋆ is optimal iff it is greedy
Figure 10.3: Cyclic dependency between
with respect to its own value function. In other words, π ⋆ is optimal iff π ⋆ ( x ) value function and greedy policy.
is a distribution over the set arg maxa∈ A q⋆ ( x, a).

In particular, if for every state there is a unique action that maxi-


mizes the state-action value function, the policy π ⋆ is deterministic
and unique,

π ⋆ ( x ) = arg max q⋆ ( x, a). (10.31)


a∈ A

Proof. It is a direct consequence of Equation (10.28) that a policy is


optimal iff it is greedy with respect to q⋆ .
markov decision processes 205

This theorem confirms our intuition from the previous section that
greedily following an optimal value function is itself optimal. In par-
ticular, Bellman’s theorem shows that there always exists an optimal
policy which is deterministic and stationary.

We have seen, that π ⋆ is a fixed-point of greedily picking the best ac-


tion according to its state-action value function. The converse is also
true:

Corollary 10.13. The optimal value functions v⋆ and q⋆ are a fixed-point of


the so-called Bellman update,

v⋆ ( x ) = max q⋆ ( x, a), (10.32)


a∈ A
= max r ( x, a) + γEx′ | x,a v⋆ ( x ′ )
 
(10.33) using the definition of the q-function (10.9)
a∈ A
 
q ( x, a) = r ( x, a) + γEx′ | x,a max q⋆ ( x ′ , a′ ) .

(10.34)
a′ ∈ A

Proof. It follows from Equation (10.14) that

v⋆ ( x ) = Ea∼π ⋆ ( x) [q⋆ ( x, a)]. (10.35)

Thus, as π ⋆ is greedy with respect to q⋆ , v⋆ ( x ) = maxa∈ A q⋆ ( x, a).

Equation (10.34) follows analogously from Equation (10.15).

These equations are also called the Bellman optimality equations. Intu-
Bellman’s optimality principle Bell-
itively, the Bellman optimality equations express that the value of a man’s optimality equations for MDPs
state under an optimal policy must equal the expected return for the are one of the main settings of Bell-
man’s optimality principle. However,
best action from that state. Bellman’s theorem is also known as Bell-
Bellman’s optimality principle has many
man’s optimality principle, which is a more general concept. other important applications, for exam-
ple in dynamic programming. Broadly
The two perspectives of Bellman’s theorem naturally suggest two sep- speaking, Bellman’s optimality principle
arate ways of finding the optimal policy. Policy iteration uses the per- says that optimal solutions to decision
problems can be decomposed into opti-
spective from Equation (10.31) of π ⋆ as a fixed-point of the dependency mal solutions to sub-problems.
between greedy policy and value function. In contrast, value iteration
uses the perspective from Equation (10.32) of v⋆ as the fixed-point of
the Bellman update. Another approach which we will not discuss here
is to use a linear program where the Bellman update is interpreted as
a set of linear inequalities.

10.3.3 Policy Iteration


Starting from an arbitrary initial policy, policy iteration as shown in
Algorithm 10.14 uses the Bellman expectation equation to compute
the value function of that policy (as we have discussed in Section 10.2)
and then chooses the greedy policy with respect to that value function
as its next iterate.
206 probabilistic artificial intelligence

Algorithm 10.14: Policy iteration


1 initialize π (arbitrarily)
2 repeat
3 compute vπ
4 compute πvπ
5 π ← πvπ
6 until converged

Let πt be the policy after t iterations. We will now show that policy
iteration converges to the optimal policy. The proof is split into two
parts. First, we show that policy iteration improves policies mono-
tonically. Then, we will use this fact to show that policy iteration
converges.

Lemma 10.15 (Monotonic improvement of policy iteration). We have,


• vπt+1 ( x ) ≥ vπt ( x ) for all x ∈ X; and
• vπt+1 ( x ) > vπt ( x ) for at least one x ∈ X, unless vπt ≡ v⋆ .

Proof. We consider the Bellman update from (10.32) as the mapping


B ⋆ : Rn → Rn ,
.
( B⋆ v)( x ) = max q( x, a), (10.36)
a∈ A

where q is the state-action value function corresponding to the state


value function v ∈ Rn . Recall that after obtaining vπt , policy iteration
.
first computes the greedy policy w.r.t. vπt , πt+1 = πvπt , and then
computes its value function vπt+1 .

To establish the (weak) monotonic improvement of policy iteration, we


consider a fixed-point iteration (cf. Algorithm 10.8) of vπt+1 initialized
by vπt . We denote the iterates by ṽτ , in particular, we have that ṽ0 =
vπt and limτ →∞ ṽτ = vπt+1 .4 First, observe that for the first iteration 4
using the convergence of fixed-point it-
of fixed-point iteration, eration (10.25)

ṽ1 ( x ) = ( B⋆ vπt )( x ) using that πt+1 is greedy wrt. vπt

= max q ( x, a)πt
using the definition of the Bellman
a∈ A update (10.36)
≥ q ( x, πt ( x ))
πt

= vπt ( x )
= ṽ0 ( x ). using (10.17)

Let us now consider a single iteration of fixed-point iteration. We have,

ṽτ +1 ( x ) = r ( x, πt+1 ( x )) + γ ∑

p( x ′ | x, πt+1 ( x )) · ṽτ ( x ′ ). using the definition of ṽτ +1 (10.21)
x ∈X
markov decision processes 207

Using an induction on τ, we conclude,

≥ r ( x, πt+1 ( x )) + γ ∑ p( x ′ | x, πt+1 ( x )) · ṽτ −1 ( x ′ ) using the induction hypothesis,



x ∈X ṽτ ( x ′ ) ≥ ṽτ −1 ( x ′ )
= ṽτ ( x ).

This establishes the first claim,

vπt+1 = lim ṽτ ≥ ṽ0 = vπt . (10.37)


τ →∞

For the second claim, recall from Bellman’s theorem (10.32) that v⋆ is a
(unique) fixed-point of the Bellman update B⋆ .5 In particular, we have 5
We will show in Equation (10.38) that
vπt+1 ≡ vπt if and only if vπt+1 ≡ vπt ≡ v⋆ . In other words, if vπt ̸≡ v⋆ B⋆ is a contraction, implying that v⋆ is
the unique fixed-point of B⋆ .
then Equation (10.37) is strict for at least one x ∈ X and vπt+1 ̸≡ vπt .
This proves the strict monotonic improvement of policy iteration.

Theorem 10.16 (Convergence of policy iteration). For finite Markov de-


cision processes, policy iteration converges to an optimal policy.

Proof. Finite Markov decision processes only have a finite number of


deterministic policies (albeit exponentially many). Observe that policy
iteration only considers deterministic policies, and recall that there is
an optimal policy that is deterministic. As the value of policies strictly
increase in each iteration until an optimal policy is found, policy iter-
ation must converge in finite time.

It can be shown that policy iteration converges to an exact solution in


a polynomial number of iterations (Ye, 2011). Each iteration of policy
iteration requires computing the value function, which we have seen
to be of cubic complexity in the number of states.

10.3.4 Value Iteration


As we have mentioned, another natural approach of finding the opti-
mal policy is to interpret v⋆ as the fixed point of the Bellman update.
Recall our definition of the Bellman update from Equation (10.36),

( B⋆ v)( x ) = max q( x, a),


a∈ A

where q was the state-action value function associated with the state
value function v. The value iteration algorithm is shown in Algo-
rithm 10.17.

We will now prove the convergence of value iteration using the fixed-
point interpretation.

Theorem 10.18 (Convergence of value iteration). Value iteration con-


verges asymptotically to an optimal policy.
208 probabilistic artificial intelligence

Algorithm 10.17: Value iteration


1 initialize v( x ) ← maxa∈ A r ( x, a) for each x ∈ X
2 for t = 1 to ∞ do
3 v( x ) ← ( B⋆ v)( x ) = maxa∈ A q( x, a) for each x ∈ X
4 choose πv

Proof. Clearly, value iteration converges if v⋆ is the unique fixed-point


of B⋆ . We already know from Bellman’s theorem (10.32) that v⋆ is
a fixed-point of B⋆ . It remains to show that it is indeed the unique
fixed-point.

Analogously to our proof of the convergence of fixed-point iteration to


the value function vπ , we show that B⋆ is a contraction. Fix arbitrary
v, v′ ∈ Rn , then

B⋆ v − B⋆ v′ ∞
= max ( B⋆ v)( x ) − ( B⋆ v′ )( x ) using the definition of the L∞ norm
x∈X (10.23)

= max max q( x, a) − max q′ ( x, a) using the definition of the Bellman


x∈X a∈ A a∈ A update (10.36)
≤ max max q( x, a) − q′ ( x, a) using |maxx f ( x ) − maxx g( x )| ≤
x ∈ X a∈ A
maxx | f ( x ) − g( x )|
≤ γ max max
x ∈ X a∈ A


p( x ′ | x, a) v( x ′ ) − v′ ( x ′ ) using the definition of the Q-function
x ∈X (10.9) and |∑i ai | ≤ ∑i | ai |

≤ γ v−v ∞
(10.38) using ∑ x′ ∈X p( x ′ | x, a) = 1 and
|v( x ′ ) − v′ ( x ′ )| ≤ ∥v − v′ ∥∞
where q and q′ are the state-action value functions associated with v
and v′ , respectively. By Equation (10.22), B⋆ is a contraction and by
Banach’s fixed-point theorem v⋆ is its unique fixed-point.

Remark 10.19: Value iteration as a dynamic program


Let us denote by vt the value function estimate after the t-th iter-
ation. Observe that vt ( x ) corresponds to the maximum expected
reward when starting in state x and the “world ends” after t time
steps. In particular, v0 corresponds to the maximum immediate
reward. This suggests a different perspective on value iteration
(akin to dynamic programming) where in each iteration we ex-
tend the time horizon of our approximation by one time step.

For any ϵ > 0, value iteration converges to an ϵ-optimal solution in


polynomial time. However, unlike policy iteration, value iteration does
not generally reach the exact optimum in a finite number of iterations.
Recalling the update rule of value iteration, its main benefit is that
each iteration only requires a sum over all possible actions a in state
markov decision processes 209

x and a sum over all reachable states x ′ from x. In sparse Markov


decision processes,6 an iteration of value iteration can be performed in 6
Sparsity refers to the interconnectivity
(virtually) constant time. of the state space. When only few states
are reachable from any state, we call an
MDP sparse.
10.4 Partial Observability

So far we have focused on the fully observable setting. That is, at


any time, our agent knows its current state. We have seen that we
can efficiently find the optimal policy (as long as the Markov decision
process is finite).

We have already encountered the partially observable setting in Chap-


ter 3, where we discussed filtering. In this section, we consider how
Markov decision processes can be extended to a partially observable
A1 A2
setting where the agent can only access noisy observations Yt of its
state Xt . X1 X2 X3 ···

Definition 10.20 (Partially observable Markov decision process, POMDP).


Similarly to a Markov decision process, a partially observable Markov de- Y1 Y2 Y3

cision process is specified by Figure 10.4: Directed graphical model of


a partially observable Markov decision
• a set of states X, process with hidden states Xt , observ-
• a set of actions A, ables Yt , and actions At .
• transition probabilities p( x ′ | x, a), and
• a reward function r : X × A → R.
Additionally, it is specified by
• a set of observations Y, and
• observation probabilities
.
o (y | x ) = P(Yt = y | Xt = x ). (10.39)

Whereas MDPs are controlled Markov chains, POMDPs are controlled


hidden Markov models.

Remark 10.21: Hidden Markov models


A hidden Markov model is a Markovian process with unobserv- X1 X2 X3 ···
able states Xt and observations Yt that depend on Xt in a known
way. Y1 Y2 Y3

Definition 10.22 (Hidden Markov model, HMM). A hidden Markov Figure 10.5: Directed graphical model
model is specified by of a hidden Markov model with hidden
states Xt and observables Yt .
• a set of states X,
.
• transition probabilities p( x ′ | x ) = P( Xt+1 = x ′ | Xt = x ) (also
called motion model), and
.
• a sensor model o (y | x ) = P(Yt = y | Xt = x ).

Following from its directed graphical model shown in Figure 10.5,


210 probabilistic artificial intelligence

its joint probability distribution factorizes into


t
P( x1:t , y1:t ) = P( x1 ) · o (y1 | x1 ) · ∏ p( xi | xi−1 ) · o (yi | xi ).
i =2
(10.40)

Observe that a Kalman filter can be viewed as a hidden Markov


model with conditional linear Gaussian motion and sensor models
and a Gaussian prior on the initial state. In particular, the tasks of
filtering, smoothing, and predicting which we discussed extensively
in Chapter 3 are also of interest for hidden Markov models.

A widely used application of hidden Markov models is to find the


most likely sequence (also called most likely explanation) of hidden
states x1:t given a series of observations y1:t ,7 that is, to find 7
This is useful in many applications
such as speech recognition, decoding
arg max P( x1:t | y1:t ). (10.41) data that was transmitted over a noisy
x1:t channel, beat detection, and many more.

This task can be solved in linear time by a simple backtracking


algorithm known as the Viterbi algorithm.

POMDPs are a very powerful model, but very hard to solve in gen-
eral. POMDPs can be reduced to a Markov decision process with an
enlarged state space. The key insight is to consider an MDP whose
states are the beliefs,
.
bt ( x ) = P( Xt = x | y1:t , a1:t−1 ), (10.42)
about the current state in the POMDP. In other words, the states of the
MDP are probability distributions over the states of the POMDP. We
will make this more precise in the following.

Let us assume that our prior belief about the state of our agent is given
.
by b0 ( x ) = P( X0 = x ). Keeping track of how beliefs change over time
is known as filtering, which we already encountered in Section 3.1.
Given a prior belief bt , an action taken at , and a new observation yt+1 ,
the belief state can be updated as follows,
bt+1 ( x ) = P( Xt+1 = x | y1:t+1 , a1:t ) by the definition of beliefs (10.42)
1
= P(yt+1 | Xt+1 = x )P( Xt+1 = x | y1:t , a1:t ) using Bayes’ rule (1.45)
Z
1
= o (yt+1 | x )P( Xt+1 = x | y1:t , a1:t ) using the definition of observation
Z probabilities (10.39)
1
= o (yt+1 | x ) ∑ p( x | x ′ , at )P Xt = x ′ | y1:t , a1:t−1

by conditioning on the previous state x ′ ,
Z x′ ∈X noting at does not influence Xt
1
= o ( y t + 1 | x ) ∑ p ( x | x ′ , a t ) bt ( x ′ ) (10.43) using the definition of beliefs (10.42)
Z x′ ∈X
markov decision processes 211

where
.
Z= ∑ o ( y t +1 | x ) ∑ p ( x | x ′ , a t ) bt ( x ′ ) . (10.44)
x∈X x′ ∈X

Thus, the updated belief state is a deterministic mapping from the


previous belief state depending only on the (random) observation yt+1
and the taken action at . Note that this obeys a Markovian structure of
transition probabilities with respect to the beliefs bt .

The sequence of belief-states defines the sequence of random vari-


ables ( Bt )t∈N0 ,
.
Bt = Xt | y1:t , a1:t−1 , (10.45)

where the (state-)space of all beliefs is the (infinite) space of all proba-
bility distributions over X,8 8
This definition naturally extends to
n continuous state spaces X .
|X|
o
. .
B = ∆ X = b ∈ R|X | : b ≥ 0, ∑i=1 b(i ) = 1 . (10.46)

A Markov decision process, where every belief corresponds to a state


is called a belief-state MDP.

Definition 10.23 (Belief-state Markov decision process). Given a POMDP,


the corresponding belief-state Markov decision process is a Markov deci-
sion process specified by
.
• the belief space B = ∆ X depending on the hidden states X,
• the set of actions A,
• transition probabilities
.
τ (b′ | b, a) = P Bt+1 = b′ | Bt = b, At = a ,

(10.47)

• and rewards
.
ρ(b, a) = Ex∼b [r ( x, a)] = ∑ b( x )r ( x, a). (10.48)
x∈X

It remains to derive the transition probabilities τ in terms of the origi-


nal POMDP. We have,

τ ( bt + 1 | bt , a t ) = P ( bt + 1 | bt , a t )
= ∑ P ( bt + 1 | bt , a t , y t + 1 ) P ( y t + 1 | bt , a t ) . (10.49) by conditioning on yt+1 ∈ Y
y t + 1 ∈Y

Using the Markovian structure of the belief updates, we naturally set,





 1 if bt+1 matches the belief update

of Equation (10.42) given bt , at , and

.
P ( bt + 1 | bt , a t , y t + 1 ) =


 y t +1 ,
0 otherwise.

(10.50)
212 probabilistic artificial intelligence

The final missing piece is the likelihood of an observation yt+1 given


the prior belief bt and action at , which using our interpretation of be-
liefs corresponds to
h i
P(yt+1 | bt , at ) = Ex∼bt Ex′ | x,at P yt+1 | Xt+1 = x ′

h i
= Ex∼bt Ex′ | x,at o (yt+1 | x ′ )

using the definition of observation
probabilities (10.39)
= ∑ bt ( x ) ∑

p( x ′ | x, at ) · o (yt+1 | x ′ ). (10.51)
x∈X x ∈X

In principle, we can now apply arbitrary algorithms for planning in


MDPs to POMDPs. Of course, the problem is that there are infinitely
many beliefs, even for a finite state space X.9 The belief-state MDP has 9
You can think of an | X |-dimensional
therefore an infinitely large belief space B . Even when only planning space. Here, all points whose coordi-
nates sum to 1 correspond to probability
over finite horizons, exponentially many beliefs can be reached. So the distributions (i.e., beliefs) over the hid-
belief space blows-up very quickly. den states X. The convex hull of these
points is also known as the (| X | − 1)-
dimensional probability simplex (cf. Ap-
We will study MDPs with large state spaces (where transition dynam-
pendix A.1.2). Now, by definition of the
ics and rewards are unknown) in Chapters 12 and 13. Similar methods (| X | − 1)-dimensional probability sim-
can also be used to approximately solve POMDPs. plex as a polytope in | X | − 1 dimen-
sions, we can conclude that its bound-
ary consists of infinitely many points in
A key idea in approximate solutions to POMDPs is that most belief | X | dimensions. Noting that these points
states are never reached. A common approach is to discretize the corresponded to the probability distribu-
belief space by sampling or by applying a dimensionality reduction. tions on | X |, we conclude that there are
infinitely many such distributions.
Examples are point-based value iteration (PBVI) and point-based policy it-
eration (PBPI) (Shani et al., 2013).

Discussion

Even though we focus on the fully observed setting throughout this


manuscript, the partially observed setting can be reduced to the fully
observed setting with very large state spaces. In the next chapter, we
will consider learning and planning in unknown Markov decision pro-
cesses (i.e., reinforcement learning) for small state spaces. The setting
of small state and action spaces is also known as the tabular setting.
Then, in the final two chapters, we will consider approximate meth-
ods for large state and action spaces. In particular, in Section 13.1, we
will revisit the problem of probabilistic planning in known Markov
decision processes, but with continuous state and action spaces.
markov decision processes 213

Problems

10.1. Value functions.

Recall the example of “becoming rich and famous” from Figure 10.2.
Consider the policy, π ≡ S (i.e., to always save) and let γ = 1/2. Show
that the (rounded) state-action value function qπ is as follows:

save advertise
poor, unknown 0 0.1
poor, famous 4.4 1.2
rich, famous 17.8 1.2
rich, unknown 13.3 0.1

Shown in bold is the state value function vπ .

10.2. Greedy policies.

Show that if q and v arise from the same policy, that is, q is defined in
terms of v as per Equation (10.9), then

πv ≡ πq . (10.52)

This implies that we can use v and q interchangeably.

10.3. Optimal policies.

Again, recall the example of “becoming rich and famous” from Fig-
ure 10.2.
1. Show that the policy π ≡ S, which we considered in Problem 10.1,
is not optimal.
2. Instead, consider the policy

 A if poor and unknown
π′ ≡
S otherwise

and let γ = 1/2. Show that the (rounded) state-action value func-

tion qπ is as follows:

save advertise
poor, unknown 0.8 1.6
poor, famous 4.5 1.2
rich, famous 17.8 1.2
rich, unknown 13.4 0.2

Shown in bold is the state value function vπ .
3. Is the policy π ′ optimal?
214 probabilistic artificial intelligence

10.4. Linear convergence of policy iteration.

Denote by πt the policy obtained by policy iteration after t iterations.


Use that the Bellman operator B⋆ is a contraction with the unique
fixed-point v⋆ to show that

∥ v π t − v ⋆ ∥ ∞ ≤ γ t ∥ v π0 − v ⋆ ∥ ∞ (10.53)

where vπ and v⋆ are vector representations of the functions vπ and v⋆ ,


respectively.

Hint: Recall from Lemma 10.15 that vπt+1 ≥ B⋆ vπt ≥ vπt .

10.5. Reward modification.

A key technique for solving sequential decision problems is the modi-


fication of reward functions that leaves the optimal policy unchanged
while improving sample efficiency or convergence rates. This exercise
looks at simple ways of modifying rewards and understanding how
these modifications affect the optimal policy.
. .
Consider two Markov decision processes M = ( X, A, p, r ) and M′ = ( X, A, p, r ′ )
where the reward function r is modified to obtain r ′ , and the rewards
are bounded and discounted by the discount factor γ ∈ [0, 1). Let π ⋆M
be the optimal policy for M.
1. Suppose r ′ ( x ) = αr ( x ), where α > 0. Show that the optimal pol-
icy π ⋆ of M is also an optimal policy of M′ .
2. Given a modification of the form r ′ ( x ) = r ( x ) + c, where c > 0 is
a constant scalar, show that the optimal policy π ⋆M can be different
from π ⋆M′ .
3. Another way of modifying the reward function is through reward
shaping where one supplies additional rewards to the agent to guide
the learning process. When one has no knowledge of the un-
derlying transition dynamics p, a commonly used transformation
is r ′ ( x, x ′ ) = r ( x, x ′ ) + f ( x, x ′ ) where f is a potential-based shaping
function defined as
.
f ( x, x ′ ) = γϕ( x ′ ) − ϕ( x ), ϕ : X → R. (10.54)

Show that the optimal policy remains unchanged under this defi-
nition of f .

10.6. A partially observable fishing problem.

We model an angler’s decisions while fishing, where the states are


partially observable. There are two states: (1) Fish (F): A fish is hooked
on the line. (2) No fish (F): No fish is hooked on the line.

The angler can choose between two actions:


markov decision processes 215

• Pull up the rod (P): If there is a fish on the line (F), there is a 90%
chance of catching it (reward +10, transitioning to F) and a 10%
chance of it escaping (reward −1, transitioning to F). If there is no
fish (F), pulling up the rod results in no catch, staying in F with a
reward of −5.
• Waiting (W): All waiting actions result in a reward of −1. In state
F, there is a 60% chance of the fish staying (remaining in F) and a
40% chance of it escaping (transitioning to F). In state F, there is a
50% chance of a fish biting (transitioning to F) and a 50% chance of
no change (remaining in F).
Suggestion: Draw the MDP transition diagram. Draw each transition with
action, associated probability, and associated reward.

Since the angler cannot directly observe whether there is a fish on the
line, they receive a noisy observation about the state. This observation
can be:
• o1 : The signal suggests that a fish might be on the line.
• o2 : The signal suggests that there is no fish on the line.
The observation model, which defines the probability of receiving each
observation given the true state is as follows:

P(o1 | ·) P(o2 | ·)
F 0.8 0.2
F 0.3 0.7

The angler’s goal is to choose actions that maximize their overall re-
ward, balancing the chances of catching a fish against the cost of wait-
ing and unsuccessful pulls.
1. Given an initial belief b0 ( F ) = b0 ( F ) = 0.5, the angler chooses to
wait and observes o1 . Compute the updated belief b1 using the
observation model and belief update equation (10.42).
2. Given belief b1 ( F ) ≈ 0.765 and b1 ( F ) ≈ 0.235, compute the up-
dated belief b2 for both actions P (pull) and W (wait), both in the
case where you observe o1 (fish likely) and o2 (fish unlikely).
11
Tabular Reinforcement Learning

11.1 The Reinforcement Learning Problem

Reinforcement learning is concerned with probabilistic planning in


unknown environments. This extends our study of known environ-
ments in the previous chapter. Those environments are still modeled
by Markov decision processes, but in reinforcement learning, we do
not know the dynamics p and rewards r in advance. Hence, reinforce- x t +1 rt at
ment learning is at the intersection of the theories of probabilistic plan-
ning (i.e., Markov decision processes) and learning (e.g., multi-armed
bandits), which we covered extensively in the previous chapters.
Figure 11.1: In reinforcement learning,
an agent interacts with its environment
We will continue to focus on the fully observed setting, where the in a sequence of rounds. After playing
agent knows its current state. As we have seen in the previous section, an action at , it observes rewards rt and
its new state xt+1 . The agent then uses
the partially observed setting corresponds to a fully observed setting
this information to learn how to act to
with an enlarged state space. In this chapter, we will begin by consid- maximize reward.
ering reinforcement learning with small state and action spaces. This
setting is often called the tabular setting, as the value functions can be
computed exhaustively for all states and stored in a table.

Clearly, the agent needs to trade exploring and learning about the en-
vironment with exploiting its knowledge to maximize rewards. Thus,
the exploration-exploitation dilemma, which was at the core of Bayesian
optimization (see Section 9.1), also plays a crucial role in reinforcement
learning. In fact, Bayesian optimization can be viewed as reinforce-
ment learning with a fixed state: In each round, the agent plays an
action, aiming to find the action that maximizes the reward. How-
ever, playing the same action multiple times yields the same reward,
implying that we remain in a single state. In the context of Bayesian
optimization, we used “regret” as performance metric: in the jargon
of planning, minimizing regret corresponds to maximizing the cumu-
lative reward.
218 probabilistic artificial intelligence

Another key challenge of reinforcement learning is that the observed


data is dependent on the played actions. This is in contrast to the
setting of supervised learning that we have been considering in earlier
chapters, where the data is sampled independently.

11.1.1 Trajectories
The data that the agent collects is modeled using so-called trajectories.

Definition 11.1 (Trajectory). A trajectory τ is a (possibly infinite) se-


quence,
.
τ = (τ0 , τ1 , τ2 , . . . ), (11.1)

of transitions,
.
τi = ( xi , ai , ri , xi+1 ), (11.2)

where xi ∈ X is the starting state, ai ∈ A is the played action, ri ∈ R is


the attained reward, and xi+1 ∈ X is the ending state.

In the context of learning a dynamics and rewards model, xi and ai can


be understood as inputs, and ri and xi+1 can be understood as labels
of a regression problem.

Crucially, the newly observed states xt+1 and the rewards rt (across
multiple transitions) are conditionally independent given the previ-
ous states xt and actions at . This follows directly from the Markovian
structure of the underlying Markov decision process.1 Formally, we 1
Recall the Markov property (6.6), which
have, assumes that in the underlying Markov
decision process (i.e., in our environ-
ment) the future state of an agent is inde-
X t +1 ⊥ X t ′ +1 | X t , X t ′ , A t , A t ′ , (11.3a) pendent of past states given the agent’s
current state. This is commonly called a
R t ⊥ R t ′ | Xt , Xt ′ , A t , A t ′ , (11.3b) Markovian structure. From this Marko-
vian structure, we gather that repeated
for any t, t′ ∈ N0 . In particular, if xt = xt′ and at = at′ , then xt+1 encounters of state-action pairs result
and xt′ +1 are independent samples according to the transition model in independent trials of the transition
model and rewards.
p( Xt+1 | xt , at ). Analogously, if xt = xt′ and at = at′ , then rt and rt′
are independent samples of the reward model r ( xt , at ). As we will
see later in this chapter and especially in Chapter 13, this indepen-
dence property is crucial for being able to learn about the underlying
Markov decision process. Notably, this implies that we can apply the
law of large numbers (A.36) and Hoeffding’s inequality (A.41) to our
estimators of both quantities.

The collection of data is commonly classified into two settings. In the


episodic setting, the agent performs a sequence of “training” rounds (called
episodes). In the beginning of each episode, the agent is reset to some
initial state. In contrast, in the continuous setting (or non-episodic /
tabular reinforcement learning 219

online setting), the agent learns online. Especially, every action, every
reward, and every state transition counts.

The episodic setting is more applicable to an agent playing a computer


game. That is, the agent is performing in a simulated environment
that is easy to reset. The continuous setting is akin to an agent that
is deployed to the “real world”. In principle, real-world agents can
be trained in simulated environments before being deployed. How-
ever, this bears the risk of learning to exploit or rely on features of the
simulated environment that are not present in the real environment.
Sometimes, using a simulated environment for training is downright
impossible, as the real environment is too complex.

11.1.2 On-policy and Off-policy Methods


Another important distinction in how data is collected, is the distinc-
tion between on-policy and off-policy methods. As the names suggest,
on-policy methods are used when the agent has control over its own ac-
tions, in other words, the agent can freely choose to follow any policy.
Being able to follow a policy is helpful, for example because it allows
the agent to experiment with trading exploration and exploitation.

In contrast, off-policy methods can be used even when the agent can-
not freely choose its actions. Off-policy methods are therefore able
to make use of purely observational data. This might be data that
was collected by another agent, a fixed policy, or during a previous
episode. Off-policy methods are therefore more sample-efficient than
on-policy methods. This is crucial, especially in settings where con-
ducting experiments (i.e., collecting new data) is expensive.

11.2 Model-based Approaches

Approaches to reinforcement learning are largely categorized into two


classes. Model-based approaches aim to learn the underlying Markov
decision process. More concretely, they learn models of the dynamics p
and rewards r. They then use these models to perform planning (i.e.,
policy optimization) in the underlying Markov decision process. In
contrast, model-free approaches learn the value function directly. We
begin by discussing model-based approaches to the tabular setting. In
Section 11.4, we cover model-free approaches.

11.2.1 Learning the Underlying Markov Decision Process


Recall that the underlying Markov decision process was specified by
its dynamics p( x ′ | x, a) that correspond to the probability of entering
220 probabilistic artificial intelligence

state x ′ ∈ X when playing action a ∈ A from state x ∈ X, and its


rewards r ( x, a) for playing action a ∈ A in state x ∈ X. A natural first
idea is to use maximum likelihood estimation to approximate these
quantities.

We can think of each transition x ′ | x, a as sampling from a categorical


random variable of which we want to estimate the success probabilities
for landing in each of the states. Therefore, as we have seen in Exam-
ple A.7, the MLE of the dynamics model coincides with the sample
mean,
N ( x ′ | x, a)
p̂( x ′ | x, a) = (11.4)
N (a | x)
where N ( x ′ | x, a) counts the number of transitions from state x to
state x ′ when playing action a and N ( a | x ) counts the number of tran-
sitions that start in state x and play action a (regardless of the next
state). Similarly, for the rewards model, we obtain the following max-
imum likelihood estimate (i.e., sample mean),

1
r̂ ( x, a) =
N (a | x) ∑ rt . (11.5)
t =0
xt = x
at = a

It is immediate that both estimates are unbiased as both correspond to


a sample mean.

Still, for the models of our environment to become accurate, our agent
needs to visit each state-action pair ( x, a) numerous times. Note that
our estimators for dynamics and rewards are only well-defined when
we visit the corresponding state-action pair at least once. However,
in a stochastic environment, a single visit will likely not result in an
accurate model. We can use Hoeffding’s inequality (A.41) to gauge
how accurate the estimates are after only a limited number of visits.

11.3 Balancing Exploration and Exploitation

The next natural question is how to use our current model of the en-
vironment to pick actions such that exploration and exploitation are
traded effectively. This is what we will consider next.

Given the estimated MDP given by p̂ and r̂, we can compute the opti-
mal policy using either policy iteration or value iteration. For example,
using value iteration, we can compute the optimal state-action value
function Q⋆ within the estimated MDP, and then employ the greedy
policy

π ( x ) = arg max Q⋆ ( x, a). (11.6)


a∈ A
tabular reinforcement learning 221

Recall from Equation (10.31) that this corresponds to always picking


the best action under the current model (that is, π is the optimal pol-
icy). But since the model is inaccurate, while potentially quickly gen-
erating some reward, we will likely get stuck in a suboptimal state.

11.3.1 ε-greedy

Consider the other extreme: If we always pick a random action, we


will eventually(!) estimate the dynamics and rewards correctly, yet we
will do extremely poorly in terms of maximizing rewards along the
way. To trade exploration and exploitation, a natural idea is to balance
these two extremes.

Arguably, the simplest idea is the following: At each time step, throw
a biased coin. If this coin lands heads, we pick an action uniformly at
random among all actions. If the coin lands tails, we pick the best ac-
tion under our current model. This algorithm is called ε-greedy, where
the probability of a coin landing heads at time t is ε t .

Algorithm 11.2: ε-greedy


1 for t= 0 to ∞ do
2 sample u ∈ Unif([0, 1])
3 if u ≤ ε t then pick action uniformly at random among all actions
4 else pick best action under the current model

The ε-greedy algorithm provides a general framework for addressing


the exploration-exploitation dilemma. When the underlying MDP is
learned using Monte Carlo estimation as we discussed in Section 11.2.1,
the resulting algorithm is known as Monte Carlo control. However, the
same framework can also be used in the model-free setting where we
pick the best action without estimating the full underlying MDP. We
discuss this approach in greater detail in Section 11.4.

Amazingly, this simple algorithm already works quite well. Neverthe-


less, it can clearly be improved. The key problem of ε-greedy is that it
explores the state space in an uninformed manner. In other words, it
explores ignoring all past experience. It thus does not eliminate clearly
suboptimal actions. This is a problem, especially as we typically have
many state-action pairs and recalling that we have to explore each such
pair many times to learn an accurate model.
222 probabilistic artificial intelligence

Remark 11.3: Asymptotic convergence


It can be shown that Monte Carlo control converges to an optimal
policy (albeit slowly) almost surely when the learned policy is
“greedy in the limit with infinite exploration”.

Definition 11.4 (Greedy in the limit with infinite exploration, GLIE).


A sequence of policies πt is said to be greedy in the limit with infinite
exploration if
1. all state-action pairs are explored infinitely many times,2 2
That all state-action pairs are chosen
is a fundamental requirement. There
lim Nt ( x, a) = ∞ and (11.7) is no reason why any algorithm would
t→∞ converge to the true value function for
all states when it only sees some state-
2. the policy converges to a greedy policy, action pairs finitely many times, or even
not at all.
lim πt ( a | x ) = 1{ a = arg max Q⋆t ( x, a′ )} (11.8)
t→∞ a′ ∈ A
where we denote by Nt ( x, a) the number of transitions from state
x playing action a until time t, and Q⋆t is the optimal state-action
value function in the estimated MDP at time t.

Note that ε-greedy is GLIE with probability 1 if the sequence


(ε t )t∈N0 satisfies the Robbins-Monro (RM) conditions (A.60),
∞ ∞
εt ≥ 0 ∀t, ∑ εt = ∞ and ∑ ε2t < ∞.
t =0 t =0

The RM-conditions are satisfied, for example, if ε t = 1/t.

Theorem 11.5 (Convergence of Monte Carlo control). GLIE Monte


Carlo control converges to an optimal policy with probability 1.

Intuitively, the probability of exploration converges to zero, and


hence, the policy will “eventually coincide” with the greedy pol-
icy. Moreover, the greedy policy will “eventually coincide” with
the optimal policy due to an argument akin to the convergence of
policy iteration,3 and using that each state-action pair is visited 3
see Lemma 10.15 and Theorem 10.16
infinitely often.

11.3.2 Softmax Exploration


An alternative to using ε-greedy for trading between greedy exploita-
tion and uniform exploration is the so-called softmax exploration or
Boltzmann exploration. Given the agent is in state x, we pick action a
with probability,
 
1 ⋆
πλ ( a | x ) ∝ exp Q ( x, a) , (11.9)
λ
tabular reinforcement learning 223

which is the Gibbs distribution with temperature parameter λ > 0.


Observe that for λ → 0, softmax exploration corresponds to greedily
maximizing the Q-function (i.e., greedy exploitation), whereas for λ → ∞,
softmax exploration explores uniformly at random. This can out-
perform ε-greedy as the exploration is directed towards actions with
larger estimated value.

11.3.3 Optimism
Recall from our discussion of multi-armed bandits in Section 9.2.1 that
a key principle in effectively trading exploration and exploitation is

..
optimism in the face of uncertainty. Let us apply this principle to the

.
Rmax
reinforcement learning setting. The key idea is to assume that the dy- x x⋆ Rmax

namics and rewards model “work in our favor” until we have learned

.
..
“good estimates” of the true dynamics and rewards. Figure 11.2: Illustration of the fairy-tale
state of Rmax . If in doubt, the agent be-
More formally, if r ( x, a) is unknown, we set r̂ ( x, a) = Rmax , where Rmax lieves actions from the state x to lead to
the fairy-tale state x ⋆ with maximal re-
is the maximum reward our agent can attain during a single transition. wards. This encourages the exploration
Similarly, if p( x ′ | x, a) is unknown, we set p̂( x ⋆ | x, a) = 1, where x ⋆ is of unknown states.
a “fairy-tale state”. The fairy-tale state corresponds to everything our
agent could wish for, that is,

p̂( x ⋆ | x ⋆ , a) = 1 ∀ a ∈ A, (11.10)

r̂ ( x , a) = Rmax ∀ a ∈ A. (11.11)

In practice, the decision of when to assume that the learned dynamics


and reward models are “good enough” has to be tuned.

In using these optimistic estimates of p and r, we obtain an optimistic


underlying Markov decision process that exhibits a bias towards explo-
ration. In particular, the rewards attained in this MDP, are an upper
bound of the true reward. The resulting algorithm is known as the
Rmax algorithm.

How many transitions are “enough”? We can use Hoeffding’s inequal-


ity (A.41) to get a rough idea! The key here, is our observation from
Equation (11.3) that the transitions and rewards are conditionally in-
dependent given the state-action pairs since, as we have discussed in
Section 6.1.4 on the ergodic theorem, Hoeffding’s inequality does not
hold for dependent samples. In this case, Hoeffding’s inequality tells
us that for the absolute approximation error to be below ϵ with prob-
ability at least 1 − δ, we need
R2max 2
N (a | x) ≥ 2
log . (11.12) see (A.42)
2ϵ δ

Lemma 11.7 (Exploration and exploitation of Rmax ). Every T time steps,


with high probability, Rmax either
224 probabilistic artificial intelligence

Algorithm 11.6: Rmax algorithm


1 add the fairy-tale state x ⋆ to the Markov decision process
2 set r̂ ( x, a) = Rmax for all x ∈ X and a ∈ A
3 set p̂( x ⋆ | x, a) = 1 for all x ∈ X and a ∈ A
4 compute the optimal policy π̂ for r̂ and p̂
5 for t = 0 to ∞ do
6 execute policy π̂ (for some number of steps)
7 for each visited state-action pair ( x, a), update r̂ ( x, a)
8 estimate transition probabilities p̂( x ′ | x, a)
9 after observing “enough” transitions and rewards, recompute the
optimal policy π̂ according the current model p̂ and r̂.

• obtains near-optimal reward; or


• visits at least one unknown state-action pair.4 4
Note that in the tabular setting, there
are “only” polynomially many state-
Here, T depends on the mixing time of the Markov chain induced by the action pairs.
optimal policy.

Theorem 11.8 (Convergence of Rmax , Brafman and Tennenholtz (2002)).


With probability at least 1 − δ, Rmax reaches an ϵ-optimal policy in a number
of steps that is polynomial in | X |, | A|, T, 1/ϵ, 1/δ, and Rmax .

11.3.4 Challenges of Model-based Approaches


We have seen that the Rmax algorithm performs remarkably well in the
tabular setting. However, there are important computational limita-
tions to the model-based approaches that we discussed so far.

First, observe that the (tabular) model-based approach requires us to


store p̂( x ′ | x, a) and r̂ ( x, a) in a table. This table already has O n2 m


entries. Even though polynomial in the size of the state and action
spaces, this quickly becomes unmanageable.

Second, the model-based approach requires us to “solve” the learned


Markov decision processes to obtain the optimal policy (using policy
or value iteration). As we continue to learn over time, we need to find
the optimal policy many times. Rmax recomputes the policy after each
state-action pair is observed sufficiently often, so O(nm) times.

11.4 Model-free Approaches

In the previous section, we have seen that learning and remembering


the model as well as planning within the estimated model can po-
tentially be quite expensive in the model-based approach. We there-
tabular reinforcement learning 225

fore turn to model-free methods that estimate the value function di-
rectly. Thus, they require neither remembering the full model nor
planning (i.e., policy optimization) in the underlying Markov decision
process. We will, however, return to model-based methods in Chap-
ter 13 to see that promise lies in combining methods from model-based
reinforcement learning with methods from model-free reinforcement
learning.

A significant benefit to model-based reinforcement learning is that it


is inherently off-policy. That is, any trajectory regardless of the policy
used to obtain it can be used to improve the model of the underlying
Markov decision process. In the model-free setting, this not necessar-
ily true. By default, estimating the value function according to the
data from a trajectory, will yield an estimate of the value function cor-
responding to the policy that was used to sample the data.

We will start by discussing on-policy methods and later see how the
value function can be estimated off-policy.

11.4.1 On-policy Value Estimation


Let us suppose, our agent follows a fixed policy π. Then, the corre-
sponding value function vπ is given as

vπ ( x ) = r ( x, π ( x )) + γ ∑

p( x ′ | x, π ( x )) · vπ ( x ′ ) using the definition of the value
x ∈X function (10.7)

= ER0 ,X1 [ R0 + γv ( X1 ) | X0 = x, A0 = π ( x )]
π
(11.13) interpreting the above expression as an
expectation over the random quantities
Our first instinct might be to use a Monte Carlo estimate of this expec- R0 and X1
tation. Due to the conditional independence of the transitions (11.3),
Monte Carlo approximation does yield an unbiased estimate,

≈ r + γvπ ( x ′ ), (11.14)

where the agent observed the transition ( x, a, r, x ′ ). Note that to es-


timate this expectation we use a single(!) sample,5 unlike our previ- 5
The idea is that we will use this ap-
ous applications of Monte Carlo sampling where we usually averaged proximation repeatedly as our agent col-
lects new data to achieve the same effect
over m samples. However, there is one significant problem in this ap- as initially averaging over multiple sam-
proximation. Our approximation of vπ does in turn depend on the ples.
(unknown) true value of vπ !

The key idea is to use a bootstrapping estimate of the value function


instead. That is, in place of the true value function vπ , we will use a
“running estimate” V π . In other words, whenever observing a new
transition, we use our previous best estimate of vπ to obtain a new
estimate V π . We already encountered bootstrapping briefly in Sec-
tion 7.3.4 in the context of probabilistic ensembles in Bayesian deep
learning. More generally, bootstrapping refers to approximating a true
226 probabilistic artificial intelligence

quantity (e.g., vπ ) by using an empirical quantity (e.g., V π ), which it-


self is constructed using samples from the true quantity that is to be
approximated.

Due to its use in estimating the value function, bootstrapping is a core


concept to model-free reinforcement learning. Crucially, using a boot-
strapping estimate generally results in biased estimates of the value
function. Moreover, due to relying on a single sample, the estimates
from Equation (11.14) tend to have very large variance.

The variance of the estimate is typically reduced by mixing new esti-


mates of the value function with previous estimates using a learning
rate αt . This yields the temporal-difference learning algorithm.

Algorithm 11.9: Temporal-difference (TD) learning


1 initialize V π arbitrarily (e.g., as 0)
2 for t = 0 to ∞ do
3 follow policy π to obtain the transition ( x, a, r, x ′ )
4 V π ( x ) ← (1 − αt )V π ( x ) + αt (r + γV π ( x ′ )) // (11.15)

The update rule is sometimes written equivalently as

V π ( x ) ← V π ( x ) + αt (r + γV π ( x ′ ) − V π ( x )). (11.16)

Thus, the update to V π ( x ) is proportional to the learning rate and the


difference between the previous estimate and the renewed estimate
using the new observation.

Theorem 11.10 (Convergence of TD-learning, Jaakkola et al. (1993)).


If (αt )t∈N0 satisfies the RM-conditions (A.60) and all state-action pairs are
chosen infinitely often, then V π converges to vπ with probability 1.

Importantly, note that due to the Monte Carlo approximation of Equa-


tion (11.13) with respect to transitions attained by following policy π,
TD-learning is fundamentally on-policy. That is, for the estimates V π
to converge to the true value function vπ , the transitions that are used
for the estimation must follow policy π.

11.4.2 SARSA: On-policy Control


TD-learning merely estimates the value function of a fixed policy π.
To find the optimal policy π ⋆ , we can use an analogue of policy it-
eration (see Algorithm 10.14). Here, it is more convenient to use an
estimate of the state-action value function qπ which can be obtained
tabular reinforcement learning 227

analogously to the bootstrapping estimate of vπ (11.14),

qπ ( x, a) = r ( x, a) + γ ∑

p( x ′ | x, a) ∑

π ( a′ | x ′ )qπ ( x ′ , a′ ) using Bellman’s expectation equation
x ∈X a ∈A (10.15)

= ER0 ,X1 ,A1 [ R0 + γqπ ( X1 , A1 ) | X0 = x, A0 = a] (11.17) interpreting the above expression as an


expectation over R0 , X1 and A1
≈ r + γqπ ( x ′ , a′ ), (11.18) Monte Carlo approximation with a
single sample

where the agent observed transitions ( x, a, r, x ′ ) and ( x ′ , a′ , r ′ , x ′′ ).

The update rule from TD-learning is therefore adapted to6 6


Note that for deterministic policies π,
Qπ ( x ′ , a′ ) = Qπ ( x ′ , π ( x ′ )) = V π ( x ′ ) if
the transitions are obtained by following
Qπ ( x, a) ← (1 − αt ) Qπ ( x, a) + αt (r + γQπ ( x ′ , a′ )). (11.19) policy π.

This algorithm is known as SARSA (short for state-action-reward-state-


action). Similar convergence guarantees to those of TD-learning can
also be derived for SARSA.

Theorem 11.11 (Convergence of SARSA, Singh et al. (2000)). If (αt )t∈N0


satisfies the RM-conditions (A.60) and all state-action pairs are chosen in-
finitely often, then Qπ converges to qπ with probability 1.

The policy iteration scheme to identify the optimal policy can be out-
lined as follows: In each iteration t, we estimate the value function qπt
of policy πt with the estimate Qπt obtained from SARSA. We then
choose the greedy policy with respect to Qπt as the next policy πt+1 .
However, due to the on-policy nature of SARSA, we cannot reuse any
data between the iterations. Moreover, it turns out that in practice,
when using only finitely many samples, this form of greedily opti-
mizing Markov decision processes does not explore enough. At least
partially, this can be compensated for by injecting noise when choosing
the next action, e.g., by following an ε-greedy policy or using softmax
exploration.

11.4.3 Off-policy Value Estimation

Consider the following slight adaptation of the derivation of SARSA (11.18),

qπ ( x, a) = r ( x, a) + γ ∑

p( x ′ | x, a) ∑

π ( a′ | x ′ )qπ ( x ′ , a′ ) using Bellman’s expectation equation
x ∈X a ∈A (10.15)
" #
= ER0 ,X1 R0 + γ ∑ ′ ′
π ( a | X1 )q ( X1 , a ) X0 = x, A0 = a
π
interpreting the above expression as an
a′ ∈ A expectation over R0 and X1
(11.20)
≈ r+γ ∑ ′ ′
π ( a | x ) q ( x , a ),
π ′ ′
(11.21) Monte Carlo approximation with a
a′ ∈ A single sample
228 probabilistic artificial intelligence

where the agent observed the transition ( x, a, r, x ′ ). This yields the


update rule,
!
Qπ ( x, a) ← (1 − αt ) Qπ ( x, a) + αt r + γ ∑

π ( a′ | x ′ ) Qπ ( x ′ , a′ ) .
a ∈A
(11.22)

This adapted update rule explicitly chooses the subsequent action a′ ac-
cording to policy π whereas SARSA absorbs this choice into the Monte
Carlo approximation. The algorithm has analogous convergence guar-
antees to those of SARSA.

Crucially, this algorithm is off-policy. That is, we can use transitions


that were obtained according to any policy to estimate the value of a
fixed policy π, which we may have never used! Perhaps this seems
contradictory at first, but it is not. As noted, the key difference to
the on-policy TD-learning and SARSA is that our estimate of the Q-
function explicitly keeps track of the next-performed action. It does so
for any action in any state. Moreover, note that the transitions that are
due to the dynamics model and rewards are unaffected by the used
policy. They merely depend on the originating state-action pair. We
can therefore use the instances where other policies played action π ( x )
in state x to estimate the performance of π.

11.4.4 Q-learning: Off-policy Control


It turns out that there is a way to estimate the value function of the
optimal policy directly. Recall from Bellman’s theorem (10.31) that the
optimal policy π ⋆ can be characterized in terms of the optimal state-
action value function q⋆ ,

π ⋆ ( x ) = arg max q⋆ ( x, a).


a∈ A

π ⋆ corresponds to greedily maximizing the value function.

Analogously to our derivation of SARSA (11.18), only using Bellman’s


theorem (10.32) in place of Bellman’s expectation equation (10.15), we
obtain,

q⋆ ( x, a) = r ( x, a) + γ ∑

p( x ′ | x, a) max q⋆ ( x ′ , a′ )
a′ ∈ A
using that the Q-function is a
x ∈X fixed-point of the Bellman update, see
  Bellman’s theorem (10.32)
⋆ ′
= ER0 ,X1 R0 + γ max q ( X1 , a ) X0 = x, A0 = a (11.23) interpreting the above expression as an
a′ ∈ A expectation over R0 and X1
≈ r + γ max q⋆ ( x ′ , a′ ), (11.24) Monte Carlo approximation with a
a′ ∈ A single sample

where the agent observed the transition ( x, a, r, x ′ ). Using a bootstrap-


ping estimate Q⋆ for q⋆ , we obtain a structurally similar algorithm to
tabular reinforcement learning 229

TD-learning and SARSA — only for estimating the optimal Q-function


directly! This algorithm is known as Q-learning. Whereas we have seen
that the optimal policy can be found using SARSA in a policy-iteration-
like scheme, Q-learning is conceptually similar to value iteration.

Algorithm 11.12: Q-learning


1 initialize Q⋆ ( x, a) arbitrarily (e.g., as 0)
2 for t = 0 to ∞ do
3 observe the transition ( x, a, r, x ′ )
4 Q⋆ ( x, a) ← (1 − αt ) Q⋆ ( x, a) + αt (r + γ maxa′ ∈ A Q⋆ ( x ′ , a′ ))
// (11.25)

Similarly to TD-learning, the update rule can also be expressed as


 
⋆ ⋆ ⋆ ′ ′ ⋆
Q ( x, a) ← Q ( x, a) + αt r + γ max Q ( x , a ) − Q ( x, a) . (11.26)
a′ ∈ A

Crucially, the Monte Carlo approximation of Equation (11.23) does not


depend on the policy. Thus, Q-learning is an off-policy method.

Theorem 11.13 (Convergence of Q-learning, Jaakkola et al. (1993)). If


(αt )t∈N0 satisfies the RM-conditions (A.60) and all state-action pairs are
chosen infinitely often (that is, the sequence of policies used to obtain the
transitions is GLIE), then Q⋆ converges to q⋆ with probability 1.

It can be shown that with probability at least 1 − δ, Q-learning con-


verges to an ϵ-optimal policy in a number of steps that is polynomial
in log | X |, log | A|, 1/ϵ and log 1/δ (Even-Dar et al., 2003).

11.4.5 Optimistic Q-learning

The next natural question is how to effectively trade exploration and


exploitation to both visit all state-action pairs many times, but also
attain a high reward.

However, as we have seen in Section 11.3, random “uninformed” ex-


ploration like ε-greedy and softmax exploration explores the state space
very slowly. We therefore return to the principle of optimism in the face
of uncertainty, which already led us to the Rmax algorithm in the model-
based setting. We will now additionally assume that the rewards are
non-negative, that is, 0 ≤ r ( x, a) ≤ Rmax (∀ x ∈ X, a ∈ A). It turns out
that a similar algorithm to Rmax also exists for (model-free) Q-learning:
it is called optimistic Q-learning and shown in Algorithm 11.14.
230 probabilistic artificial intelligence

Algorithm 11.14: Optimistic Q-learning


T
1 initialize Q⋆ ( x, a) = Vmax ∏t=init1 (1 − αt )−1
2 for t = 0 to ∞ do
3 pick action at = arg maxa∈ A Q⋆ ( x, a) and observe the transition
( x, at , r, x ′ )
4 Q⋆ ( x, at ) ← (1 − αt ) Q⋆ ( x, at ) + αt (r + γ maxa′ ∈ A Q⋆ ( x ′ , a′ ))
// (11.27)

Here,
. Rmax
Vmax = ≥ max q⋆ ( x, a),
1−γ x ∈ X,a∈ A

is an upper bound on the discounted return and Tinit is some initial-


ization time. Intuitively, the initialization of Q⋆ corresponds to the
best-case long-term reward, assuming that all individual rewards are
upper bounded by Rmax . This is shown by the following lemma.

Lemma 11.15. Denote by Q⋆t , the approximation of q⋆ attained in the t-th


iteration of optimistic Q-learning. Then, for any state-action pair ( x, a) and
iteration t such that N ( a | x ) ≤ Tinit ,7 7
N ( a | x ) is the number of times action
a is performed in state x.
Q⋆t ( x, a) ≥ Vmax ≥ q⋆ ( x, a). (11.28)

. .
Proof. We write β τ = ∏iτ=1 (1 − αi ) and ηi,τ = αi ∏τj=i+1 (1 − α j ). Using
the update rule of optimistic Q-learning (11.27), we have
N ( a| x )
Q⋆t ( x, a) = β N (a| x) Q0⋆ ( x, a) + ∑ ηi,N (a| x) (r + γ max Q⋆ti ( xi , ai ))
ai ∈ A
i =1
(11.29)

where xi is the next state arrived at time ti when action a is performed


the i-th time in state x.

Using the assumption that the rewards are non-negative, from Equa-
tion (11.29) and Q0⋆ ( x, a) = Vmax/β Tinit , we immediately have
β N ( a| x )
Q⋆t ( x, a) ≥ Vmax
β Tinit
≥ Vmax . using N ( a | x ) ≤ Tinit

Now, if Tinit is chosen large enough, it can be shown that optimistic


Q-learning converges quickly to an optimal policy.

Theorem 11.16 (Convergence of optimistic Q-learning, Even-Dar and


Mansour (2001)). With probability at least 1 − δ, optimistic Q-learning ob-
tains an ϵ-optimal policy after a number of steps that is polynomial in | X |,
tabular reinforcement learning 231

| A|, 1/ϵ, log 1/δ, and Rmax where the initialization time Tinit is upper bounded
by a polynomial in the same coefficients.

Note that for Q-learning, we still need to store Q⋆ ( x, a) for any state-
action pair in memory. Thus, Q-learning requires O(nm) memory.
During each transition, we need to compute

max Q⋆ ( x ′ , a′ )
a∈ A

once. If we run Q-learning for T iterations, this yields a time complex-


ity of O( Tm). Crucially, for sparse Markov decision processes where,
in most states, only few actions are permitted, each iteration of Q-
learning can be performed in (virtually) constant time. This is a big
improvement of the quadratic (in the number of states) performance
of the model-based Rmax algorithm.

Discussion

We have seen that both the model-based Rmax algorithm and the model-
free Q-learning take time polynomial in the number of states | X | and
the number of actions | A| to converge. While this is acceptable in small
grid worlds, this is completely unacceptable for large state and action
spaces.

Often, domains are continuous, for example when modeling beliefs


about states in a partially observable environment. Also, in many
structured domains (e.g., chess or multiagent planning), the size of the
state and action space is exponential in the size of the input. In the final
two chapters, we will therefore explore how model-free and model-
based methods can be used (approximately) in such large domains.

Problems

11.1. Q-learning.

Assume the following grid world, where from state A the agent can
go to the right and down, and from state B to the left and down. From
states G1 and G2 the only action is to exit. The agent receives a reward
(+10 or +1) only when exiting.

Rewards States
0 0 A B
+10 +1 G1 G2

We assume the discount factor γ = 1 and that all actions are determin-
istic.
232 probabilistic artificial intelligence

1. We observe the following two episodes:


Episode 2
Episode 1
x a x′ r
x a x′ r
B ← A 0
A ↓ G1 0
A ↓ G1 0
G1 exit 10
G1 exit 10

Assume α = 0.3, and that Q-values of all non-terminal states are


initialized to 0.5. What are the Q-values

Q⋆ ( A, ↓), Q⋆ ( G1 , exit), Q⋆ ( G2 , exit)

learned by executing Q-learning with the above episodes?


2. Will Q-learning converge to q⋆ for all state-action pairs ( x, a) if we
repeat episode 1 and episode 2 infinitely often? If not, design a
sequence of episodes that leads to convergence.
3. How does the choice of initial Q-values influence the convergence
of the Q-learning algorithm when episodes are obtained off-policy?
4. Determine v⋆ for all states.
12
Model-free Reinforcement Learning

In the previous chapter, we have seen methods for tabular settings.


Our goal now is to extend the model-free methods like TD-learning
and Q-learning to large state-action spaces X and A. We have seen
that a crucial bottleneck of these methods is the parameterization of
the value function. If we want to store the value function in a ta-
ble, we need at least O(|X |) space. If we learn the Q function, we
even need O(|X | · |A|) space. Also, for large state-action spaces, the
time required to compute the value function for every state-action pair
exactly will grow polynomially in the size of the state-action space.
Hence, a natural idea is to learn approximations of these functions with
a low-dimensional parameterization. Such approximations were the
focus of the first few chapters and are, in fact, the key idea behind
methods for reinforcement learning in large state-action spaces.

12.1 Tabular Reinforcement Learning as Optimization

To begin with, let us reinterpret the model-free methods from the pre-
vious section, TD-learning and Q-learning, as solving an optimization
problem, where each iteration corresponds to a single gradient update.
We will focus on TD-learning here, but the same interpretation applies
to Q-learning. Recall the update rule of TD-learning (11.15),

V π ( x ) ← (1 − αt )V π ( x ) + αt (r + γV π ( x ′ )).

Note that this looks just like the update rule of an optimization algo-
rithm! We can parameterize our estimates V π with parameters θ that
are updated according to the gradient of some loss function, assuming
fixed bootstrapping estimates. In particular, in the tabular setting (i.e.,
over a finite domain), we can parameterize the value function exactly
by learning a separate parameter for each state,
. .
θ = [θ(1), . . . , θ(n)], V π ( x; θ) = θ( x ). (12.1)
234 probabilistic artificial intelligence

To re-derive the above update rule as a gradient update, let us consider


the following loss function,
.1 π
ℓ(θ; x, r ) = (v ( x ) − θ( x ))2 (12.2)
2
1 2
r + γEx′ | x,π ( x) vπ ( x ′ ) − θ( x )
 
= (12.3) using Bellman’s expectation equation
2 (10.11)
Note that this loss corresponds to a standard squared loss of the differ-
ence between the parameter θ( x ) and the label vπ ( x ) we want to learn.

We can now find the gradient of this loss. Elementary calculations


yield,
 
∇θ(x) ℓ(θ; x, r ) = θ( x ) − r + γEx′ | x,π (x) vπ ( x ′ ) .

(12.4)

Now, we cannot compute this derivative because we cannot compute


the expectation. Firstly, the expectation is over the true value func-
tion which is unknown to us. Secondly, the expectation is over the
transition model which we are trying to avoid in model-free methods.

To resolve the first issue, analogously to TD-learning, instead of learn-


ing the true value function vπ which is unknown, we learn the boot-
strapping estimate V π . Recall that the core principle behind boot-
strapping as discussed in Section 11.4.1 is that this bootstrapping esti-
mate V π is treated as if it were independent of the current estimate of
the value function θ. To emphasize this, we write V π ( x; θold ) ≈ vπ ( x )
where θold = θ but θold is treated as a constant with respect to θ.1 1
That is, the bootstrapping estimate
V π ( x; θold ) is assumed to be constant
To resolve the second issue, analogously to the introduction of TD- with respect to θ( x ) in the same way that
vπ ( x ) is constant with respect to θ( x ).
learning in the previous chapter, we will use a Monte Carlo estimate
If we were not using the bootstrapping
using a single sample. Recall that this is only possible because the estimate, the following derivation of the
transitions are conditionally independent given the state-action pair. gradient of the loss would not be this
simple.

Remark 12.1: Sample (in)efficiency of model-free methods


Taking these two shortcuts are two of the main reasons why model-
free methods such as TD-learning and Q-learning are usually sam-
ple inefficient. This is because using a bootstrapping estimate leads
to “(initially) incorrect” and “unstable” targets of the optimiza-
tion problem,2 and Monte Carlo estimation with a single sample 2
We explore this in some more capacity
leads to a large variance. Recall that the theoretical guarantees for in Section 12.2.1 where we cover heuris-
tic approaches to alleviate this problem
model-free methods in the tabular setting therefore required that to some degree.
all state-action pairs are visited infinitely often.

Using the aforementioned shortcuts, let us define the loss ℓ after ob-
serving the single transition ( x, a, r, x ′ ),
2
. 1

ℓ(θ; x, r, x ′ ) = r + γθold ( x ′ ) − θ( x ) . (12.5)
2
model-free reinforcement learning 235

We define the gradient of this loss with respect to θ( x ) as


.
δTD = ∇θ( x) ℓ(θ; x, r, x ′ )
 
= θ( x ) − r + γθold ( x ′ ) . (12.6)

This error term is also called temporal-difference (TD) error. The tempo-
ral difference error compares the previous estimate of the value func-
tion to the bootstrapping estimate of the value function. We know
from the law of large numbers (A.36) that Monte Carlo averages are
unbiased.3 We therefore have, 3
Crucially, the samples are unbiased
with respect to the approximate label in
Ex′ | x,π ( x) [δTD ] ≈ ∇θ( x) ℓ(θ; x, r ). (12.7) terms of the bootstrapping estimate only.
Due to bootstrapping the value function,
the estimates are not unbiased with re-
Naturally, we can use these unbiased gradient estimates with respect spect to the true value function. More-
over, the variance of each individual es-
to the loss ℓ to perform stochastic gradient descent. This yields the timation of the gradient is large, as we
update rule, only consider a single transition.

V π ( x; θ) = θ( x ) ← θ( x ) − αt δTD (12.8) using stochastic gradient descent with


  learning rate αt , see Algorithm A.29
= (1 − αt )θ( x ) + αt r + γθold ( x ′ ) using the definition of the temporal
  difference error (12.6)
= (1 − αt )V π ( x; θ) + αt r + γV π ( x ′ ; θold ) . (12.9) substituting V π ( x; θ) for θ( x )

Observe that this gradient update coincides with the update rule of
TD-learning (11.15). Therefore, TD-learning is essentially performing
stochastic gradient descent using the TD-error as an unbiased gradient
estimate.4 Crucially, TD-learning performs stochastic gradient descent 4
An alternative interpretation is that
with respect to the bootstrapping estimate of the value function V π TD-learning performs gradient descent
with respect to the loss ℓ.
and not the true value function vπ ! Stochastic gradient descent with
a bootstrapping estimate is also called stochastic semi-gradient descent.
Importantly, the optimization target r + γV π ( x ′ ; θold ) from the loss ℓ is
now moving between iterations which introduces some practical chal-
lenges we will discuss in Section 12.2.1. We have seen in the previous
chapter that using a bootstrapping estimate still guarantees (asymp-
totic) convergence to the true value function.

12.2 Value Function Approximation

To scale to large state spaces, it is natural to approximate the value


function using a parameterized model, V ( x; θ) or Q( x, a; θ). You may
think of this as a regression problem where we map state(-action) pairs
to a real number. Recall from the previous section that this is a strict
generalization of the tabular setting, as we could use a separate pa-
rameter to learn the value function for each individual state-action
pair. Our goal for large state-action spaces is to exploit the smooth-
ness properties5 of the value function to condense the representation. 5
That is, the value function takes similar
values in “similar” states.
236 probabilistic artificial intelligence

A straightforward approach is to use a linear function approximation


with the hand-designed feature map ϕ,
.
Q( x, a; θ) = θ⊤ ϕ( x, a). (12.10)

A common alternative is to use a deep neural network to learn these


features instead. Doing so is also known as deep reinforcement learning.6 6
Note that often non-Bayesian deep
learning (i.e., point estimates of the
We will now apply the derivation from the previous section directly to weights of a neural network) is applied.
In the final chapter, Chapter 13, we will
Q-learning. For Q-learning, after observing the transition ( x, a, r, x′ ), explore the benefits of using Bayesian
the loss function is given as deep learning.
 2
′ .1 ⋆ ′ ′ old ⋆
ℓ(θ; x, a, r, x ) = r + γ max Q ( x , a ; θ ) − Q ( x, a; θ) . (12.11)
2 a′ ∈A

Here, we simply use Bellman’s optimality equation (10.32) to estimate


q⋆ ( x, a), instead of the estimation of vπ ( x) using Bellman’s expectation
equation for TD-learning. The difference between the current approx-
imation and the optimization target,
.
δB = r + γ max Q⋆ ( x′ , a′ ; θold ) − Q⋆ ( x, a; θ), (12.12)
a′ ∈A

is called the Bellman error. Analogously to TD-learning,7 we obtain the 7


compare to Equation (12.8)
gradient update,

θ ← θ − αt ∇θ ℓ(θ; x, a, r, x′ ) (12.13)
 2
1
= θ − αt ∇θ r + γ max Q⋆ ( x′ , a′ ; θold ) − Q⋆ ( x, a; θ) using the definition of ℓ (12.11)
2 a′ ∈A

= θ + αt δB ∇θ Q⋆ ( x, a; θ). (12.14) using the chain rule

When using a neural network to learn Q⋆ , we can use automatic dif-


ferentiation to obtain unbiased gradient estimates. In the case of linear
function approximation, we can compute the gradient exactly,

= θ + αt δB ∇θ θ⊤ ϕ( x, a) using the linear approximation of Q⋆


(12.10)
= θ + αt δB ϕ( x, a). (12.15)

In the tabular setting, this algorithm is identical to Q-learning and, in


particular, converges to the true Q-function q⋆ .8 There are few such 8
see Theorem 11.13
results in the approximate setting. Usage in practice indicates that
using an approximation of the value function “should be fine” when a
“rich-enough” class of functions is used.

12.2.1 Heuristics
The vanilla stochastic semi-gradient descent is very slow. In this sub-
section, we will discuss some “tricks of the trade” to improve its per-
formance.
model-free reinforcement learning 237

Stabilizing optimization targets: There are mainly two problems. The


first problem is that, as mentioned previously, the bootstrapping es-
timate changes after each iteration. As we are trying to learn an ap-
proximate value function that depends on the bootstrapping estimate,
this means that the optimization target is “moving” between iterations.
In practice, moving targets lead to stability issues. The first family of
techniques we discuss here aim to “stabilize” the optimization targets.

One such technique is called neural fitted Q-iteration or deep Q-networks


(DQN) (Mnih et al., 2015). DQN updates the neural network used
for the approximate bootstrapping estimate infrequently to maintain a
constant optimization target across multiple episodes. How this is im-
plemented exactly varies. One approach is to clone the neural network
and maintain one changing neural network (“online network”) for the
most recent estimate of the Q-function which is parameterized by θ,
and one fixed neural network (“target network”) used as the target
which is parameterized by θold and which is updated infrequently.

This can be implemented by maintaining a data set D of observed


transitions (the so-called replay buffer) and then “every once in a while”
(e.g., once |D| is large enough) solving a regression problem, where the
labels are determined by the target network. This yields a loss term
where the target is fixed across all transitions in the replay buffer D ,
 2
1
.
ℓDQN (θ; D) = ∑
2 ( x,a,r,x
⋆ ′ ′ old ⋆
r + γ max Q ( x , a ; θ ) − Q ( x, a; θ) .
a′ ∈A
compare to the Q-learning loss (12.11)
′ )∈D

(12.16)

The loss can also be interpreted (in an online sense) as performing


regular Q-learning with the modification that the target network θold
is not updated to θ after every observed transition, but instead only
after observing |D|-many transitions. This technique is known as ex-
perience replay. Another approach is Polyak averaging where the target
network is gradually “nudged” by the neural network used to estimate
the Q-function.

Maximization bias: Now, observe that the estimates Q⋆ are noisy esti-
mates of q⋆ and consider the term,

max q⋆ ( x′ , a′ ) ≈ max Q⋆ ( x′ , a′ ; θold ),


a′ ∈A a′ ∈A

from the loss function (12.16). This term maximizes a noisy estimate
of q⋆ , which leads to a biased estimate of max q⋆ as can be seen in
Figure 12.1. The fact that the update rules of Q-learning and DQN
are affected by inaccuracies (i.e., noise in the estimates) of the learned
Q-function is known as the “maximization bias”.
238 probabilistic artificial intelligence

True value and an estimate All estimates and max Bias as function of state
2 2
max Q⋆ ( x, a) max Q⋆ ( x, a) − max q⋆ ( x, a)
q⋆ ( x, a) a
1 a a
0 0 0
Q⋆ ( x, a) −1
Double-Q estimate
−2 −2

max Q⋆ ( x, a) − max q⋆ ( x, a)
2 q⋆ ( x, a) 2 max Q⋆ ( x, a) 1
a a
a

Q ( x, a) 0
0 0 −1
Double-Q estimate

4 4 4
max Q⋆ ( x, a)−
Q⋆ ( x, a) max Q⋆ ( x, a) a
a max q⋆ ( x, a)
a
2 2 2

0 0 0
q⋆ ( x, a) Double-Q estimate
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
state state state

Figure 12.1: Illustration of overestima-


tion during learning. In each state (x-
Double DQN (DDQN) is an algorithm that addresses this maximization axis), there are 10 actions. The left col-
bias (Van Hasselt et al., 2016). Instead of picking the optimal action umn shows the true values v⋆ ( x ) (pur-
ple line). All true action values are de-
with respect to the old network, it picks the optimal action with respect .
fined by q⋆ ( x, a) = v⋆ ( x ). The green
to the new network, line shows estimated values Q⋆ ( x, a) for
one action as a function of state, fit-
2
. 1


ted to the true value at several sam-
ℓDDQN (θ; D) = r + γQ⋆ ( x′ , a⋆ ( x′ ; θ); θold ) − Q⋆ ( x, a; θ) pled states (green dots). The middle col-
2 ( x,a,r,x′ )∈D
umn plots show all the estimated val-
(12.17) ues (green), and the maximum of these
. values (dashed black). The maximum is
⋆ ′ ⋆ ′ ′
where a ( x ; θ) = arg max Q ( x , a ; θ). (12.18) higher than the true value (purple, left
a′ ∈A plot) almost everywhere. The right col-
umn plots show the difference in red.
Intuitively, this change ensures that the evaluation of the target net- The blue line in the right plots is the esti-
work is consistent with the updated Q-function, which makes the al- mate used by Double Q-learning with a
second set of samples for each state. The
gorithm more robust to noise.
blue line is much closer to zero, indicat-
ing less bias. The three rows correspond
Similarly to DQN, this can also be interpreted as the online update, to different true functions (left, purple)
  or capacities of the fitted function (left,
θ ← θ + αt r + γQ⋆ ( x′ , a⋆ ( x′ ; θ); θold ) − Q⋆ ( x, a; θ) ∇θ Q⋆ ( x, a; θ) green). Reproduced with permission
from “Deep reinforcement learning with
(12.19) double q-learning” (Van Hasselt et al.,
2016).
after observing a single transition ( x, a, r, x′ ) where while differenti-
ating, a⋆ ( x′ ; θ) is treated as constant with respect to θ. θold is then
updated to θ after observing |D|-many transitions.

12.3 Policy Approximation

Q-learning defines a policy implicitly by


.
π ⋆ ( x) = arg max Q⋆ ( x, a). (12.20)
a∈A
model-free reinforcement learning 239

Q-learning also maximizes over the set of all actions in its update step
while learning the Q-function. This is intractable for large and, in
particular, continuous action spaces. A natural idea to escape this lim-
itation is to immediately learn an approximate parameterized policy,
.
π ⋆ ( x) ≈ π ( x; φ) = πφ( x). (12.21)
Methods that find an approximate policy are also called policy search
methods or policy gradient methods.

Whereas with Q-learning, exploration can be encouraged by using an


ε-greedy policy, softmax exploration, or an optimistic initialization, we
will see later that policy gradient methods fundamentally rely on ran-
domized policies for exploration.

Remark 12.2: Notation


We refer to deterministic policies π in bold, as they can be in-
terpreted as vector-valued functions from X to A. We still refer
to randomized policies by π, as for each state x ∈ X they are
represented as a PDF over actions A.

In particular, we denote by πφ( a | x) the probability (density) of


playing action a when in state x according to πφ.

12.3.1 Estimating Policy Values


We will begin by attributing a “value” to a policy. Recall the defini-
tion of the discounted payoff Gt from time t, which we are aiming to
maximize,

Gt = ∑ γm Rt+m . see Equation (10.5)
m =0
We define Gt:T to be the bounded discounted payoff until time T,
T −1− t
.
Gt:T = ∑ γm Rt+m . (12.22)
m =0
Based on these two random variables, we can define the policy value
function:

Definition 12.3 (Policy value function). The policy value function,



" #
.
J (π ) = Eπ [ G0 ] = Eπ ∑ γt Rt , (12.23)
t =0

measures the expected discounted payoff of policy π.9 We also define 9


We neglect here that implicitly one also
the bounded variant, averages over the initial state if this state
" # is not fixed.
T −1
.
JT (π ) = Eπ [ G0:T ] = Eπ ∑ γt Rt . (12.24)
t =0
240 probabilistic artificial intelligence

.
For simplicity, we will abbreviate J (φ) = J (πφ).

Remark 12.4: Notation of policy value function


We adopt the more common notation J (π ) for the policy value
function, as opposed to j(π ), which would be consistent with our
notation of the (true) value functions vπ , qπ . So don’t be con-
fused by this: just like the value functions vπ , qπ , the policy value
function J (π ) is a deterministic object, measuring the mean dis-
counted payoff. We will use b J (π ) to refer to our estimates of the
policy value function.

Naturally, we want to maximize J (φ). That is, we want to solve


.
φ⋆ = arg max J (φ) (12.25)
φ

which is a non-convex optimization problem. Let us see how J (φ)


can be evaluated to understand the optimization problem better. We
will again use a Monte Carlo estimate. Recall that a fixed φ induces a
unique Markov chain, which can be simulated. In the episodic setting,
each episode (also called rollout) of length T yields an independently
sampled trajectory,
. (i ) (i ) (i ) (i ) (i ) (i ) (i ) (i )
τ (i) = (( x0 , a0 , r0 , x1 ), ( x1 , a1 , r1 , x2 ), . . . ) (12.26)

Simulating m rollouts yields the samples τ (1) , . . . , τ (m) ∼ Πφ, where


Πφ is the distribution of all trajectories (i.e., rollouts) of the Markov
chain induced by policy πφ. We denote the (bounded) discounted
payoff of the i-th rollout by
T −1
(i ) . (i )
g0:T = ∑ γt rt (12.27)
t =0

(i )
where rt is the reward at time t of the i-th rollout. Using a Monte
Carlo approximation, we can then estimate JT (φ). Moreover, due to
the exponential discounting of future rewards, it is reasonable to ap-
proximate the policy value function using bounded trajectories,
m
. 1 (i )
J (φ) ≈ JT (φ) ≈ b
JT (φ) =
m ∑ g0:T . (12.28)
i =1

12.3.2 Policy Gradient


Policy gradient methods solve the above optimization problem (12.25)
by stochastic gradient ascent on the policy parameter φ:

φ ← φ + η ∇φ J (φ). (12.29)
model-free reinforcement learning 241

How can we compute the policy gradient? Let us first formally define
the distribution over trajectories Πφ that we introduced in the previous
section. We can specify the probability of a specific trajectory τ under
a policy πφ by
T −1
Πφ(τ ) = p( x0 ) ∏ πφ(at | xt ) p(xt+1 | xt , at ). (12.30)
t =0

For optimizing J (φ) we need to obtain unbiased gradient estimates:

∇φ J (φ) ≈ ∇φ JT (φ) = ∇φ Eτ ∼Πφ [ G0:T ]. (12.31)

Note that the expectation integrates over the measure Πφ, which de-
pends on the parameter φ. Thus, we cannot move the gradient oper-
ator inside the expectation as we have often done previously (cf. Ap-
pendix A.1.5). This should remind you of the reparameterization trick
(see Equation (5.62)) that we used to solve a similar gradient in the
context of variational inference. In this context, however, we cannot
apply the reparameterization trick.10 Fortunately, there is another way 10
This is because the distribution Πφ
of estimating this gradient. is generally not reparameterizable. We
will, however, see that reparameteri-
zation gradients are also useful in re-
Theorem 12.5 (Score gradient estimator). Under some regularity assump-
inforcement learning. See, e.g., Sec-
tions, we have tions 12.5.1 and 13.1.2.

∇φ Eτ ∼Πφ [ G0 ] = Eτ ∼Πφ G0 ∇φ log Πφ(τ ) .


 
(12.32)

This estimator of the gradient is called the score gradient estimator.

Proof. To begin with, let us look at the so-called score function of the
distribution Πφ, ∇φ log Πφ(τ ). Using the chain rule, the score function
can be expressed as

∇φ Πφ(τ )
∇φ log Πφ(τ ) = (12.33)
Πφ(τ )

and by rearranging the terms, we obtain

∇φ Πφ(τ ) = Πφ(τ )∇φ log Πφ(τ ). (12.34)

This is called the score function trick.11 11


We have already applied this “trick” in
Problem 6.9.
Now, assuming that state and action spaces are continuous, we obtain
Z
∇φ Eτ ∼Πφ [ G0 ] = ∇φ Πφ(τ ) · G0 dτ using the definition of expectation (1.19)
Z
= ∇φ Πφ(τ ) · G0 dτ using the regularity assumptions to
Z swap gradient and integral
= G0 · Πφ(τ )∇φ log Πφ(τ ) dτ using the score function trick (12.34)

= Eτ ∼Πφ G0 ∇φ log Πφ(τ ) .


 
interpreting the integral as an
expectation over Πφ
242 probabilistic artificial intelligence

Intuitively, maximizing J (φ) increases the probability of policies with


high returns and decreases the probability of policies with low returns.

To use the score gradient estimator for estimating the gradient, we


need to compute ∇φ log Πφ(τ ).

∇φ log Πφ(τ )
!
T −1 T −1
= ∇φ log p( x0 ) + ∑ log πφ(at | xt ) + ∑ log p(xt+1 | xt , at ) using the definition of the distribution
t =0 t =0 over trajectories Πφ
T −1 T −1
= ∇φ log p( x0 ) + ∑ ∇φ log πφ(at | xt ) + ∑ ∇φ log p(xt+1 | xt , at )
t =0 t =0
T −1
= ∑ ∇φ log πφ(at | xt ). (12.35) using that the first and third term are
t =0 independent of φ

When using a neural network for the parameterization of the policy π,


we can use automatic differentiation to compute the gradients.

The expectation of the score gradient estimator (12.32) can be approx-


imated using Monte Carlo sampling,
m T −1
. 1 (i ) (i ) (i )
∇φ JT (φ) ≈ ∇φ b
JT (φ) =
m ∑ g0:T ∑ ∇φ log πφ(at | x t ). (12.36)
i =1 t =0

However, typically the variance of these estimates is very large. Using


so-called baselines, we can reduce the variance dramatically ? . Problem 12.3

Lemma 12.6 (Score gradients with baselines). We have,

Eτ ∼Πφ G0 ∇φ log Πφ(τ ) = Eτ ∼Πφ ( G0 − b)∇φ log Πφ(τ ) .


   
(12.37)

Here, b ∈ R is called a baseline.

Proof. For the term to the right, we have due to linearity of expecta-
tion (1.20),

Eτ ∼Πφ ( G0 − b)∇φ log Πφ(τ ) = Eτ ∼Πφ G0 ∇φ log Πφ(τ )


   

− Eτ ∼Πφ b · ∇φ log Πφ(τ ) .


 

Thus, it remains to show that the second term is zero,


Z
Eτ ∼Πφ b · ∇φ log Πφ(τ ) = b · Πφ(τ )∇φ log Πφ(τ ) dτ
 
using the definition of expectation (1.19)
Z
∇φ Πφ(τ )
= b· Πφ(τ ) dτ substituting the score function (12.33),
Πφ(τ ) “undoing the score function trick”
Z
= b· ∇φ Πφ(τ ) dτ Πφ (τ ) cancels
Z
= b · ∇φ Πφ(τ ) dτ

= b · ∇φ 1 = 0. integrating a PDF over its domain is 1


and the derivative of a constant is 0
model-free reinforcement learning 243

One can even show, that we can subtract arbitrary baselines depending
on previous states ? . Problem 12.4

Example 12.7: Downstream returns


A commonly used state-dependent baseline is

t −1
.
b(τ0:t−1 ) = ∑ γm rm . (12.38)
m =0

This baseline subtracts the returns of all actions before time t. In-
tuitively, using this baseline, the score gradient only considers
downstream returns. Recall from Equation (12.22) that we de-
fined Gt:T as the bounded discounted payoff from time t. It is also
commonly called the (bounded) downstream return (or reward to go)
beginning at time t.

For a fixed trajectory τ that is bounded at time T, we have

G0 − b(τ0:t−1 ) = γt Gt:T , (12.39)

yielding the gradient estimator,

∇φ J (φ) ≈ ∇φ JT (φ) = Eτ ∼Πφ G0 ∇φ log Πφ(τ )


 
using the score gradient estimator
" # (12.32)
T −1
= Eτ ∼Πφ ∑ γt Gt:T ∇φ log πφ(at | xt ) . using a state-dependent baseline
t =0 (12.102)
(12.40)

Performing stochastic gradient descent with the score gradient esti-


mator and downstream returns is known as the REINFORCE algo-
rithm (Williams, 1992) which is shown in Algorithm 12.8.

Algorithm 12.8: REINFORCE algorithm


1 initialize policy weights φ
2 repeat
3 generate an episode (i.e., rollout) to obtain trajectory τ
4 for t = 0 to T − 1 do
5 set gt:T to the downstream return from time t
6 φ ← φ + ηγt gt:T ∇φ log πφ( at | xt ) // (12.41)

7 until converged

The variance of REINFORCE can be reduced further. A common tech-


244 probabilistic artificial intelligence

nique is to subtract a term bt from the downstream returns,


" #
T
∇φ J (φ) = Eτ ∼Πφ ∑ γ (Gt:T − bt )∇φ log πφ(at | xt )
t
. (12.42)
t =0

For example, we can subtract the t-independent mean reward to go,

T −1
. 1
bt = b =
T ∑ Gt′ :T . (12.43)
t ′ =0

The main advantage of policy gradient methods such as REINFORCE


is that they can be used in continuous action spaces. However, REIN-
FORCE is not guaranteed to find an optimal policy. Even when operat-
ing in very small domains, REINFORCE can get stuck in local optima.

Typically, policy gradient methods are slow due to the large variance in
the score gradient estimates. Because of this, they need to take small
steps and require many rollouts of a Markov chain. Moreover, we
cannot reuse data from previous rollouts, as policy gradient methods
are fundamentally on-policy.12 12
This is because the score gradient esti-
mator is used to obtain gradients of the
Next, we will combine value approximation techniques like Q-learning policy value function with respect to the
current policy.
and policy gradient methods, leading to an often more practical family
of methods called actor-critic methods.

12.4 On-policy Actor-Critics

Actor-Critic methods reduce the variance of policy gradient estimates


by using ideas from value function approximation. They use function
approximation both to approximate value functions and to approxi-
mate policies. The goal for these algorithms is to scale to reinforce-
ment learning problems, where we both have large state spaces and
large action spaces.

12.4.1 Advantage Function


A key concept of actor-critic methods is the advantage function.

Definition 12.9 (Advantage function). Given a policy π, the advantage


function,
.
aπ ( x, a) = qπ ( x, a) − vπ ( x) (12.44)

= q ( x, a) − Ea′ ∼π (x) q ( x, a ) ,
π
 π

(12.45) using Equation (10.14)

measures the advantage of picking action a ∈ A when in state x ∈ X


over simply following policy π.
model-free reinforcement learning 245

It follows immediately from Equation (12.45) that for any policy π and
state x ∈ X , there exists an action a ∈ A such that aπ ( x, a) is non-
negative,

max aπ ( x, a) ≥ 0. (12.46)
a∈A

Moreover, it follows directly from Bellman’s theorem (10.31) that

π is optimal ⇐⇒ ∀ x ∈ X , a ∈ A : aπ ( x, a) ≤ 0. (12.47)

In other words, quite intuitively, π is optimal if and only if there is no


action that has an advantage in any state over the action that is played
by π.

Finally, we can re-define the greedy policy πq with respect to the state-
action value function q as
.
πq ( x) = arg max a( x, a) (12.48)
a∈A

since

arg max a( x, a) = arg max q( x, a) − v( x) = arg max q( x, a),


a∈A a∈A a∈A

as v( x) is independent of a. This coincides with our initial definition of


greedy policies in Equation (10.29). Intuitively, the advantage function
is a shifted version of the state-action value function q that is relative
to 0. Using this quantity rather than q often has numerical advantages.

12.4.2 Policy Gradient Theorem


Recall the score gradient estimator (12.40) that we had introduced in
the previous section,
" #
T −1
∇φ JT (φ) = Eτ ∼Πφ ∑ γ Gt:T ∇φ log πφ(at | xt )
t
.
t =0

Previously, we have approximated the policy value function J (φ) by


the bounded policy value function JT (φ). We said that this approxi-
mation was “reasonable” due to the diminishing returns. Essentially,
we have “cut off the tails” of the policy value function. Let us now
reinterpret score gradients while taking into account the tails of J (φ).

∇φ J (φ) = lim ∇φ JT (φ) (12.49)


T →∞

∑ Eτ∼Πφ γt Gt ∇φ log πφ( at | xt ) .
 
= substituting the score gradient estimator
t =0 with downstream returns (12.40) and
using linearity of expectation (1.20)
246 probabilistic artificial intelligence

Observe that because the expectations only consider downstream re-


turns, we can disregard all data from the trajectory prior to time t. Let
us define
.
τt:∞ = (( xt , at , rt , xt+1 ), ( xt+1 , at+1 , rt+1 , xt+2 ), . . . ), (12.50)

as the trajectory from time step t. Then,



∑ Eτt:∞ ∼Πφ γt Gt ∇φ log πφ( at | xt ) .
 
=
t =0

We now condition on xt and at ,



∑ Ext ,at γt Ert ,τt+1:∞ [ Gt | xt , at ] ∇φ log πφ( at | xt ) .
 
= using that πφ ( at | xt ) is a constant given
t =0 xt and at

Observe that averaging over the trajectories Eτ ∼Πφ [·] that are sampled
according to policy πφ is equivalent to our shorthand notation Eπφ [·]
from Equation (10.6),
∞ h i
= ∑ Ext ,at γt Eπφ [ Gt | xt , at ] ∇φ log πφ( at | xt )
t =0

∑ Ext ,at γt qπφ ( xt , at )∇φ log πφ( at | xt ) .
 
= (12.51) using the definition of the Q-function
t =0 (10.8)

It turns out that Ext ,at [qπφ ( xt , at )] exhibits much less variance than our
previous estimator Ext ,at Eπφ [ Gt | xt , at ]. Equation (12.51) is known as
the policy gradient theorem.

Often, the policy gradient theorem is stated in a slightly rephrased


form in terms of the discounted state occupancy measure,13 13
Depending on the reward setting,
there exist various variations of the pol-
∞ icy gradient theorem. We derived the
∞ .
ρφ ( x) = (1 − γ) ∑ γt pXt( x). (12.52) variant for infinite-horizon discounted
t =0 payoffs. “Reinforcement learning: An
∞ is a probability density as introduction” (Sutton and Barto, 2018)
The factor (1 − γ) ensures that ρφ derive the variant for undiscounted av-
Z ∞ Z ∞
erage rewards.

ρφ ( x) dx = (1 − γ) ∑ γt pXt( x) dx = (1 − γ) ∑ γt = 1.
t =0 t =0

Intuitively, ∞ ( x)
ρφmeasures how often we visit state x when following
policy πφ. It can be thought of as a “discounted frequency”.

Theorem 12.10 (Policy gradient theorem in terms of ρφ ∞ ). Policy gradi-

ents can be represented in terms of the Q-function,

∇φ J (φ) ∝ Ex∼ρφ∞ Ea∼πφ(·|x) qπφ ( x, a)∇φ log πφ( a | x) .


 
(12.53)

Proof. The right hand side of Equation (12.51) can be expressed as


∞ Z
∑ pXt( x) · Eat ∼πφ(·| x) γt qπφ ( x, at )∇φ log πφ( at | x) dx
 
t =0
model-free reinforcement learning 247

1
Z

( x) · Ea∼πφ(·|x) qπφ ( x, a)∇φ log πφ( a | x) dx
 
= ρφ
1−γ

where we swapped the order of sum and integral and reorganized


terms.

Matching our intuition, according to the policy gradient theorem, max-


imizing J (φ) corresponds to increasing the probability of actions with
a large value and decreasing the probability of actions with a small
value, taking into account how often the resulting policy visits certain
states.

Observe that we cannot use the policy gradient to calculate the gradi-
θ φ
ent exactly, as we do not know qπφ . Instead, we will use bootstrapping
estimates Qπφ of qπφ .
θ′ φ′

value function policy


12.4.3 A First Actor-Critic approximation approximation
Figure 12.2: Illustration of one iteration
Actor-Critic methods consist of two components: of actor-critic methods. The dependen-
.
• a parameterized policy, π ( a | x; φ) = πφ, which is called (12.54) cies between the actors and critics are
shown as arrows. Methods differ in the
actor; and
exact order in which actor and critic are
• a value function approximation, qπφ ( x, a) ≈ Qπφ ( x, a; θ), which is updated.
called critic. In the following, we will abbreviate Qπφ by Q. (12.55)
In deep reinforcement learning, neural networks are used to param-
eterize both actor and critic. Therefore, in principle, the actor-critic
framework allows scaling to both large state spaces and large action
spaces. We begin by discussing on-policy actor-critics.

One approach in the online setting (i.e., non-episodic setting), is to


simply use SARSA for learning the critic. To learn the actor, we use
stochastic gradient descent with gradients obtained using single sam-
ples from

.
∑ E(xt ,at )∼πφ γt Q( xt , at ; θ)∇φ log πφ( at | xt )
 
∇φ J (φ) ≈ ∇φ b
J (φ) = see Equation (12.51)
t =0
(12.56)

where Q is a bootstrapping estimate of qπφ . This algorithm is known


as online actor-critic or Q actor-critic and shown in Algorithm 12.11.

Comparing to the derivation for TD-learning from Equation (12.8), we


observe that Equation (12.58) corresponds to the SARSA update rule.14 14
The gradient with respect to θQ ap-
Due to the use of SARSA for learning the critic, this algorithm is fun- pears analogously to our derivation of
approximate Q-learning (12.15).
damentally on-policy.

Crucially, by neglecting the dependence of the bootstrapping estimate


Q on the policy parameters φ, we introduce bias in the gradient esti-
mates. In other words, using the bootstrapping estimate Q means that
248 probabilistic artificial intelligence

Algorithm 12.11: Online actor-critic


1 initialize parameters φ and θ
2 repeat
3 use πφ to obtain transition ( x, a, r, x′ )
4 δ = r + γQ( x′ , πφ( x′ ); θ) − Q( x, a; θ)
// actor update
5 φ ← φ + ηγt Q( x, a; θ)∇φ log πφ( a | x) // (12.57)
// critic update
6 θ ← θ + ηδ∇θ Q( x, a; θ) // (12.58)
7 until converged

the resulting gradient direction might not be a valid ascent direction.


In particular, the actor is not guaranteed to improve. Still, it turns out
that under strong so-called “compatibility conditions” that are rarely
satisfied in practice, a valid ascent direction can be guaranteed.

12.4.4 Improved Actor-Critics


Reducing variance: To further reduce the variance of the gradient es-
timates, it turns out that a similar approach to the baselines we dis-
cussed in the previous section on policy gradient methods is useful.
A common approach is to subtract the state value function from esti-
mates of the Q-function,

φ ← φ + ηt γt ( Q( x, a; θ) − V ( x; θ))∇φ log πφ( a | x) (12.59)


= φ + ηt γt A( x, a; θ)∇φ log πφ( a | x) (12.60) using the definition of the advantage
function (12.44)
where A( x, a; θ) is a bootstrapped estimate of the advantage func-
tion aπφ . This algorithm is known as advantage actor-critic (A2C) (Mnih
et al., 2016). Recall that the Q-function is an absolute quantity, whereas
the advantage function is a relative quantity, where the sign is informa-
tive for the gradient direction. Intuitively, an absolute value is harder
to estimate than the sign. Actor-Critic methods are therefore often
implemented with respect to the advantage function rather than the
Q-function.

Taking a step back, observe that policy gradient methods such as


REINFORCE generally have high variance in their gradient estimates.
However, due to using Monte Carlo estimates of Gt , the gradient es-
timates are unbiased. In contrast, using a bootstrapped Q-function to
obtain gradient estimates yields estimates with a smaller variance, but
those estimates are biased. We are faced with a bias-variance tradeoff . A
natural approach is therefore to blend both gradient estimates to allow
model-free reinforcement learning 249

for effectively trading bias and variance. This leads to algorithms such
as generalized advantage estimation (GAE) (Schulman et al., 2016).

Exploration: Similarly to REINFORCE, actor-critic methods typically


rely on randomization in the policy to encourage exploration, the idea
being that if the policy is stochastic, then the agent will visit a diverse
set of states. The inherent stochasticity of the policy is, however, often
insufficient. A common problem is that the policy quickly “collapses”
to a deterministic policy since the objective function is greedily ex-
ploitative. A common workaround is to use an ε-greedy policy (cf.
Section 11.3.1) or to explicitly encourage the policy to exhibit uncer-
tainty by adding an entropy term to the objective function (more on
this in Section 12.6). However, note that for on-policy methods, chang-
ing the policy also changes the value function learned by the critic.

Improving sample efficiency: Actor-Critic methods typically suffer from


low sample efficiency. When additionally using an on-policy method,
actor-critics often need an extremely large number of interactions be-
fore learning a near-optimal policy, because they cannot reuse past
data. Allowing to reuse past data is a major advantage of off-policy
methods like Q-learning.

One well-known variant that slightly improves the sample efficiency


is trust-region policy optimization (TRPO) (Schulman et al., 2015). TRPO
uses multiple iterations, where in each iteration a fixed critic is used
to optimize the policy.15 During iteration k, we select 15
Intuitively, each iteration performs a
collection of gradient ascent steps.
J (φ) subject to Ex∼ρφ∞ KL πφk (· | x)∥πφ(· | x) ≤ δ

φk+1 ← arg max b
k
φ
(12.61)
for some fixed δ > 0 and where
.
J (φ) = Ex∼ρφ∞ ,a∼πφ
 
b (·| x) wk (φ; x, a) Aπφk ( x, a) . (12.62)
k k

J is an expectation with respect to the previous policy πφk and


Notably, b
the previous critic Aπφk . TRPO uses importance sampling where the
importance weights (called “likelihood ratios”),

. πφ( a | x)
wk (φ; x, a) = ,
πφk ( a | x)

are used to correct for taking the expectation over the previous pol-
icy. When wk ≈ 1 the policies πφ and πφk are similar, whereas when
wk ≪ 1 or wk ≫ 1, the policies differ significantly. To be able to
assume that the fixed critic is a good approximation within a certain
“trust region” (i.e., one iteration), we impose the constraint

Ex∼ρφ∞ KL πφk (· | x)∥πφ(· | x) ≤ δ



k
250 probabilistic artificial intelligence

to optimize only in the “neighborhood” of the current policy. This


constraint is also necessary for the importance weights not to blow up.

Remark 12.12: Estimating the KL-divergence


Instead of naive computation with Ea∼πφ (·| x) [− log wk (φ; x, a)],
k
the KL-divergence is commonly estimated by Monte Carlo sam-
ples of

KL πφk (· | x)∥πφ(· | x)
= Ea∼πφ (·| x) [ wk (φ; x, a ) − 1 − log wk (φ; x, a )],
k

which adds the “baseline” wk (φ; x, a) − 1 with mean 0. Observe


that this estimator is unbiased, always non-negative since log( x ) ≤
x − 1 for all x, while having a lower variance than the naive esti-
mator.

Taking the expectation with respect to the previous policy πφk means
that we can reuse data from rollouts within the same iteration. That is,
TRPO allows reusing past data as long as it can still be “trusted”. This
makes TRPO “somewhat” off-policy. Fundamentally, though, TRPO is
still an on-policy method.

Proximal policy optimization (PPO) is a family of heuristic variants of


TRPO which replace the constrained optimization problem of Equa-
tion (12.61) by the unconstrained optimization of a regularized objec-
tive (Schulman et al., 2017; Wang et al., 2020). PPO algorithms often
work well in practice. One canonical PPO method uses the modified
objective

J (φ) − λEx∼ρφ∞ KL πφk (· | x)∥πφ(· | x)
φk+1 ← arg max b (12.63)
k
φ

with some λ > 0, which regularizes towards the trust region. An-
other common variant of PPO is based on controlling the importance
weights directly rather than regularizing by the KL-divergence. PPO
is used, for example, to train large-scale language models such as GPT
(Stiennon et al., 2020; OpenAI, 2023) which we will discuss in more
detail in Section 12.7. There we will also see that the objective from
Equation (12.63) can be cast as performing probabilistic inference.

Improving computational efficiency: A practical problem with the above


methods is that the estimation of the advantage function A( x, a; θ)
requires training a separate critic, next to the policy (i.e., the actor)
parameterized by φ. This can be computationally expensive. In par-
ticular, when both models are large neural networks (think multiple
billions of parameters each), training both models is computationally
model-free reinforcement learning 251

prohibitive. Now, recall that we introduced critics in the first place


to reduce the variance of the policy gradient estimates.16 Group rela- 16
That is, we moved from the REIN-
tive policy optimization (GRPO) replaces the critic in PPO with simple FORCE policy update (12.41) to the
actor-critic policy update (12.57).
Monte Carlo estimates of the advantage function (Shao et al., 2024):
" #
m T
. 1
∑ ∑ wk (φ; a (i ) πφ
J (φ) = E{τ (i) }m ) Ât,i k , (12.64)
i =1 ∼ Πφk (·| x )
b
m i =1 t =1
(i )
πφ . gt:T − mean({τ (i) })
where Ât,i k =
std({τ (i) })

estimates the advantage of action a(i) at time t by comparing to the


mean reward and normalizing by the standard deviation of rewards
from all trajectories τ (i) .17 GRPO combines Monte Carlo sampling and 17
In Equation (12.36), we have already
baselines for variance reduction with the trust-region optimization of seen that using a Monte Carlo estimate
of returns is a simple approach to re-
PPO, leading to a method that is more sample efficient than naive duce variance without needing to learn
REINFORCE while being computationally more efficient than PPO. a critic.

12.5 Off-policy Actor-Critics

In many applications, sample efficiency is crucial. Either because re-


quiring too many interactions is computationally prohibitive or be-
cause obtaining sufficiently many samples for learning a near-optimal
policy is simply impossible. We therefore now want to look at a sep-
arate family of actor-critic methods, which are off-policy, and hence,
allow for the reuse of past data. These algorithms use the reparameter-
ization gradient estimates, which we encountered before in the context
of variational inference,18 instead of score gradient estimators. 18
see Section 5.5.1

The on-policy methods that we discussed in the previous section can


be understood as performing a variant of policy iteration, where we
use an estimate of the state-action value function of the current policy
and then try to improve that policy by acting greedily with respect
to this estimate. They mostly vary in how improving the policy is
traded with improving the estimate of its value. Fundamentally, these
methods rely on policy evaluation.19 19
Policy evaluation is at the core of pol-
icy iteration. See Algorithm 10.14 for
the definition of policy iteration and Sec-
The techniques that we will introduce in this section are much more
tion 10.2 for a summary of policy evalu-
closely related to value iteration, essentially making use of Bellman’s ation in the context of Markov decision
optimality principle to learn the optimal value function directly which processes.

characterizes the optimal policy.

To begin with, let us assume that the policy π is deterministic. We


will later lift this restriction in Section 12.5.1. Recall that our initial
motivation to consider policy gradient methods and then actor-critic
252 probabilistic artificial intelligence

methods was the intractability of the DQN loss


 2
1
ℓDQN (θ; D) = ∑
2 ( x,a,r,x
⋆ ′ ′ old ⋆
r + γ max Q ( x , a ; θ ) − Q ( x, a; θ)
a′ ∈A
see Equation (12.16)
′ )∈D

when the action space A is large. What if we simply replace the exact
maximum over actions by a parameterized policy?

1  2
ℓDQN (θ; D) ≈ ∑
2 ( x,a,r,x′ )∈D
r + γQ⋆ ( x′ , πφ( x′ ); θold ) − Q⋆ ( x, a; θ) .

(12.65)

We want to train our parameterized policy to learn the maximization


over actions, that is, to approximate the greedy policy20 20
Here, we already apply the improve-
ment of DDQN to use the most-recent
estimate of the Q-function for action se-
πφ( x) ≈ π ⋆ ( x) = arg max Q⋆ ( x, a; θ). (12.66)
lection (see Equation (12.17)).
a∈A

The key idea is that if we use a “rich-enough” parameterization of


policies, selecting the greedy policy with respect to Q⋆ is equivalent to

φ⋆ = arg max Ex∼µ Q⋆ ( x, πφ( x); θ)


 
(12.67)
φ

where µ( x) > 0 is an exploration distribution over states with full sup-


port.21 We refer to this expectation by 21
We require full support to ensure that
all states are explored.
.
Jµ (φ; θ) = Ex∼µ Q⋆ ( x, πφ( x); θ) .
 
b (12.68)

Commonly, the exploration distribution µ is taken to be the distribu-


tion that samples states uniformly from a replay buffer. Note that we
Jµ with respect to φ:
can easily obtain unbiased gradient estimates of b

Jµ (φ; θ) = Ex∼µ ∇φ Q⋆ ( x, πφ( x); θ) .


 
∇φ b (12.69) see Appendix A.1.5

Analogously to on-policy actor-critics (see Equation (12.56)), we use a


bootstrapping estimate of Q⋆ . That is, we neglect the dependence of
the critic Q⋆ on the actor πφ, and in particular, the policy parameters
φ. We have seen that bootstrapping works with Q-learning, so there is
reason to hope that it will work in this context too. This then allows
us to use the chain rule to compute the gradient,

∇φ Q⋆ ( x, πφ( x); θ) = Dφπφ( x) · ∇a Q⋆ ( x, a; θ) a=πφ ( x)


. (12.70)

This corresponds to evaluating the bootstrapping estimate of the Q-


function at πφ( x) and obtaining a gradient estimate of the policy es-
timate (e.g., through automatic differentiation). Note that as πφ is
vector-valued, Dφπφ( x) is the Jacobian of πφ evaluated at x.
model-free reinforcement learning 253

Exploration: Now that we have estimates of the gradient of our op-


timization target b Jµ , it is natural to ask how we should select actions
(based on πφ) to trade exploration and exploitation. As we have seen,
policy gradient techniques rely on the randomness in the policy to ex-
plore, but here we consider deterministic policies. As our method is
off-policy, a simple idea in continuous action spaces is to add Gaus-
sian noise to the action selected by πφ — also known as Gaussian noise
“dithering”.22 This corresponds to an algorithm called deep determin- Intuitively, this adds “additional ran-
22

istic policy gradients (Lillicrap et al., 2016) shown in Algorithm 12.13. domness” to the policy πφ.

This algorithm is essentially equivalent to Q-learning with function


approximation (e.g., DQN),23 with the only exception that we replace 23
see Equation (12.16)
the maximization over actions with the learned policy πφ.

Algorithm 12.13: Deep deterministic policy gradients, DDPG


1 initialize φ, θ, a (possibly non-empty) replay buffer D = ∅
2 set φold = φ and θold = θ
3 for t = 0 to ∞ do
4 observe state x, pick action a = πφ( x) + ε for ε ∼ N (0, λI,)
5 execute a, observe r and x′
6 add ( x, a, r, x′ ) to the replay buffer D
7 if collected “enough” data then
// policy improvement step
8 for some iterations do
9 sample a mini-batch B of D
10 for each transition in B, compute the label
y = r + γQ⋆ ( x′ , π ( x′ ; φold ); θold )
// critic update
11 θ ← θ − η ∇θ 1
B ∑( x,a,r,x′ ,y)∈ B (y − Q⋆ ( x, a; θ))2
// actor update
12 φ ← φ + η ∇φ 1
∑( x,a,r,x′ ,y)∈ B
B Q⋆ ( x, π ( x; φ); θ)
13 θold ← (1 − ρ)θold + ρθ
14 φold ← (1 − ρ)φold + ρφ

Twin delayed DDPG (TD3) is an extension of DDPG that uses two sepa-
rate critic networks for predicting the maximum action and evaluating
the policy (Fujimoto et al., 2018). This addresses the maximization
bias akin to Double-DQN. TD3 also applies delayed updates to the
actor network, which increases stability.
254 probabilistic artificial intelligence

12.5.1 Randomized Policies


We have seen that randomized policies naturally encourage explo-
ration. With deterministic actor-critic methods like DDPG, we had
to inject Gaussian noise to enforce sufficient exploration. A natural
question is therefore whether we can also handle randomized policies
in this framework of off-policy actor-critics.

The key idea is to replace the squared loss of the critic,

1 2
ℓDQN (θ; D) ≈ r + γQ⋆ ( x′ , π ( x′ ; φ); θold ) − Q⋆ ( x, a; θ) , refer to the squared loss of Q-learning
2 (12.11)

which only considers the fixed action π ( x′ ; φold ) with an expected


squared loss,
  2 
1
ℓDQN (θ; D) ≈ Ea′ ∼π (x′ ;φ) r + γQ⋆ ( x′ , a′ ; θold ) − Q⋆ ( x, a; θ) ,
2
(12.71)

which considers a distribution over actions.

It turns out that we can still compute gradients of this expectation.


  2 
1 ⋆ ′ ′ old ⋆
∇θ Ea′ ∼π (x′ ;φ) r + γQ ( x , a ; θ ) − Q ( x, a; θ)
2
 2 
1 ⋆ ′ ′ old ⋆
= Ea′ ∼π (x′ ;φ) ∇θ r + γQ ( x , a ; θ ) − Q ( x, a; θ) . see Appendix A.1.5
2

Similarly to our definition of the Bellman error (12.12), we define by


.
δB ( a′ ) = r + γQ⋆ ( x′ , a′ ; θold ) − Q⋆ ( x, a; θ), (12.72)

the Bellman error for a fixed action a′ . Using the chain rule, we obtain

= Ea′ ∼π (x′ ;φ) δB ( a′ )∇θ Q⋆ ( x, a; θ) .


 
(12.73)

Note that this is identical to the gradient in DQN (12.14), except that
now we have an expectation over actions. As we have done many times
already, we can use automatic differentiation to obtain gradient esti-
mates of ∇θ Q⋆ ( x, a; θ). This provides us with a method of obtaining
unbiased gradient estimates for the critic.

We also need to reconsider the actor update. When using a random-


ized policy, the objective function changes to
.
Jµ (φ; θ) = Ex∼µ Ea∼π ( x;φ) [ Q⋆ ( x, a; θ)].
b

of which we can obtain gradients via

Jµ (φ; θ) = Ex∼µ ∇φ Ea∼π ( x;φ) [ Q⋆ ( x, a; θ)].


∇φ b (12.74) see Appendix A.1.5
model-free reinforcement learning 255

Note that the inner expectation is with respect to a measure that de-
pends on the parameters φ, which we are trying to optimize. We there-
fore cannot move the gradient operator inside the expectation. This is
a problem that we have already encountered several times. In the
previous section on policy gradients, we used the score gradient esti-
mator.24 Earlier, in Chapter 5 on variational inference we have already 24
see Equation (12.32)
seen reparameterization gradients.25 Here, if our policy is reparame- 25
see (5.63)
terizable, we can use the reparameterization trick from Theorem 5.19!

Example 12.14: Reparameterization gradients for Gaussians


Suppose we use a Gaussian parameterization of policies,
.
π ( x; φ) = N (µ( x; φ), Σ ( x; φ)).

Then, using conditional linear Gaussians, our action a is given by


.
a = g (ε; x, φ) = Σ /2 ( x; φ)ε + µ( x; φ),
1
ε ∼ N (0, I ) (12.75)

where Σ 1/2 ( x; φ) is the square root of Σ ( x; φ). This coincides with


our earlier application of the reparameterization trick to Gaus-
sians in Example 5.20.

As we have seen, not only Gaussians are reparameterizable. In general,


we called a distribution (in this context, a policy) reparameterizable iff
a ∼ π ( x; φ) is such that a = g (ε; x, φ), where ε ∼ ϕ is an independent
random variable.

Then, we have,

∇φ Ea∼π (x;φ) [ Q⋆ ( x, a; θ)]


= Eε∼ϕ ∇φ Q⋆ ( x, g (ε; x, φ); θ)
 
(12.76) using the reparameterization trick (5.63)
h i
= Eε∼ϕ ∇a Q⋆ ( x, a; θ) a= g (ε;x,φ) · Dφ g (ε; x, φ) . (12.77) using the chain rule analogously to
Equation (12.70)

In this way, we can obtain unbiased gradient estimates for reparame-


terizable policies. This general technique does not only apply to con-
tinuous action spaces. For discrete action spaces, there is the analogous
so-called Gumbel-max trick, which we will not discuss in greater detail
here.

The algorithm that uses Equation (12.73) to obtain gradients for the
critic and reparameterization gradients for the actor is called stochastic
value gradients (SVG) (Heess et al., 2015).
256 probabilistic artificial intelligence

12.6 Maximum Entropy Reinforcement Learning

In practice, algorithms like SVG often do not explore enough. A key


issue with relying on randomized policies for exploration is that they
might collapse to deterministic policies. That is, the algorithm might
quickly reach a local optimum, where all mass is placed on a single
action.

A simple trick that encourages a little bit of extra exploration is to


regularize the randomized policies “away” from putting all mass on
a single action. In other words, we want to encourage the policies
to exhibit some uncertainty. A natural measure of uncertainty is en-
tropy, which we have already seen several times.26 This approach is 26
see Section 5.4
known as entropy regularization or maximum entropy reinforcement learn-
ing (MERL). Canonically, entropy regularization is applied to finite-
horizon rewards (cf. Remark 10.5), yielding the optimization problem
of maximizing
.
Jλ (φ) = J (φ) + λH Πφ
 
(12.78)
T
∑ E(xt ,at )∼Πφ
  
= r ( xt , at ) + λH πφ(· | xt ) , (12.79)
t =1

where we have a preference for entropy in the actor distribution to


encourage exploration which is regulated by the temperature param-
eter λ. As λ → 0, we recover the “standard” reinforcement learning
objective (here for finite-horizon rewards):
T
J (φ) = ∑ E(xt ,at )∼Πφ [r(xt , at )]. (12.80)
t =1

Here, for notational convenience, we begin the sum with t = 1 rather


than t = 0.

12.6.1 Entropy Regularization as Probabilistic Inference


The entropy-regularized objective from Equation (12.79) leads us to a
remarkable interpretation of reinforcement learning and, more gener-
ally, decision-making under uncertainty as solving an inference prob-
A1 A2
lem akin to variational inference. The framing of “control as inference”
will lead us to contemporary algorithms for reinforcement learning as X1 X2 X3 ···
well as paint a path for decision-making under uncertainty beyond
stationary MDPs. O1 O2 ···

Let us denote by Π⋆ the distribution over trajectories τ under the op- Figure 12.3: Directed graphical model
timal policy π ⋆ . By framing the problem of optimal control as an of the underlying hidden Markov model
with hidden states Xt , optimality vari-
inference problem in a hidden Markov model with hidden “optimal- ables Ot , and actions At .
ity variables” Ot ∈ {0, 1} indicating whether the played action at was
model-free reinforcement learning 257

optimal we can derive Π⋆ analytically. That is to say, when Ot = 1 and


Ot+1:T ≡ 1 the policy from time t onwards was optimal. To simplify
the notation, we will denote the event Ot = 1 by Ot .

We consider the HMM defined by the Gibbs distribution


 
1
p(Ot | xt , at ) ∝ exp r ( xt , at ) , with λ > 0 (12.81)
λ
which is a natural choice as we have seen in Problem 6.7 that the Gibbs
distribution maximizes entropy subject to E[Ot · r ( xt , at ) | xt , at ] < ∞.

The distribution over trajectories conditioned on optimality of actions (i.e.,


conditioned on O1:T ) is given by
T
.
Π⋆ (τ ) = p(τ | O1:T ) = p( x1 ) ∏ p( at | xt , Ot ) p( xt+1 | xt , at ). (12.82) Using Equation (12.30). We assume here
t =1 that the dynamics and initial state
distribution are “fixed”, that is, we
It remains to determine p( at | xt , Ot ) which corresponds to the optimal assume p( x1 | O1:T ) = p( x1 ) and
policy π ⋆ ( at | xt ). It is generally useful to think of the situation where p( xt+1 | xt , at , O1:T ) = p( xt+1 | xt , at ).
the prior policy p( at | xt ) is uniform on A,27 in which case by Bayes’ 27
This is not a restriction as any infor-
rule (1.45), p( at | xt , Ot ) ∝ p(Ot | xt , at ), so mative prior can be pushed into Equa-
tion (12.81).
" # !
T
1 T
Π⋆ (τ ) ∝ p( x1 ) ∏ p( xt+1 | xt , at ) exp
λ t∑
r ( xt , at ) . (12.83)
t =1 =1

Recall that our fundamental goal is to approximate Π⋆ with a distri- 28


Observe that we cannot easily mini-
bution over trajectories Πφ under the parameterized policy πφ. It is mize forward-KL as we cannot sample
from Π⋆ . In the context of RL, it can
therefore a natural idea to minimize KL Πφ∥Π⋆ :28

be argued that the mode-seeking behav-
ior of reverse-KL is preferable over the
arg min KL Πφ∥Π⋆

moment-matching behavior of forward-
φ KL (Levine, 2018).
= arg min H Πφ∥Π⋆ − H Πφ
   
using the definition of KL-divergence
φ (5.34)
= arg max Eτ ∼Πφ log Π⋆ (τ ) − log Πφ(τ )
 
using the definition of cross-entropy
φ (5.32) and entropy (5.27)
" #
T
= arg max Eτ ∼Πφ ∑ r(xt , at ) − λ log πφ(at | xt ) using Equations (12.30) and (12.83) and
φ t =1 simplifying
T
= arg max ∑ E(xt ,at )∼Πφ r ( xt , at ) + λH πφ(· | xt ) .
  
(12.84) using the definition of entropy (5.27)
φ t =1 and linearity of expectation (1.20)

That is, entropy regularization is equivalent to minimizing the KL-


divergence from Π⋆ to Πφ. This highlights a very natural tradeoff
between exploration and exploitation, wherein H Πφ∥Π⋆ encourages
 

exploitation and H Πφ encourages exploration.


 

It can be shown that a “softmax” version of the Bellman optimality


equation (10.34) can be obtained for Equation (12.84) ? : Problem 12.7
 
1
Z
⋆ ′ ′

 ′
q ( x, a) = r ( x, a) + Ex′ ∼ x,a log exp q ( x , a ) da (12.85)
λ A
258 probabilistic artificial intelligence

with the convention that q⋆ ( x T , a) = 0 for all a.29 Here, q⋆ is called a 29


Note that Equation (10.34) was de-
soft value function. As we will see in Problem 12.7, the optimal policy rived in the infinite horizon setting with
discounted rewards, whereas here we
has the form π ⋆ ( a | x) ∝ exp(q⋆ ( x, a)), that is, it simply corresponds study the finite horizon setting. Equa-
to performing softmax exploration (11.9) with the soft value function. tion (12.85) is the natural extension of
the standard Bellman optimality equa-
The second term of Equation (12.85) quantifies downstream rewards. tion in the finite horizon setting where
In comparison to the “standard” Bellman optimality equation (10.34), the downstream rewards are measured
the soft value function is less greedy which tends to encourage robust- by a softmax rather than the greedy pol-
icy.
ness.

Analogously to Q-learning, the soft value function q⋆ can be approx-


imated via a bootstrapped “critic” Q⋆ which is called soft Q-learning
(Levine, 2018). Note that computing the optimal policy requires com-
puting an integral over the action space, which is typically intractable
for continuous action spaces. As discussed in Sections 12.3 to 12.5
and analogously to actor-critic methods such as DDPG and SVG, we
can learn a parameterized policy (i.e., an “actor”) πφ to approximate
the optimal policy π ⋆ . The resulting algorithm, soft actor critic (SAC)
(Haarnoja et al., 2018a,b), is widely used. Due to its off-policy nature,
it is also relatively sample efficient.

Figure 12.4: Comparison of training


curves of a selection of on-policy and
off-policy policy gradient methods. Re-
We can also express this optimization in terms of an evidence lower produced with permission from “Soft
actor-critic algorithms and applications”
(Haarnoja et al., 2018b).
model-free reinforcement learning 259

bound. The evidence lower bound for the observations O1:T is

L(Πφ, Π⋆ ; O1:T ) = Eτ ∼Πφ log p(O1:T | τ ) + log Π(τ ) − log Πφ(τ )


 
using the definition of the ELBO (5.55b)
(12.86)

where Π denotes the distribution over trajectories (12.30) under the


prior policy p( at | xt ). This is commonly written as the variational
free energy

− L(Πφ, Π⋆ ; O1:T ) = Eτ ∼Πφ [S[ p(O1:T | τ )]] + KL Πφ∥Π



(12.87)
= S[ p(O1:T )] + KL Πφ∥Π⋆ .

(12.88) using that Π⋆ (τ ) = p(τ | O1:T )
| {z } | {z }
“extrinsic” value “epistemic” value

which we already encountered in Section 5.5.2 in the context of vari-


ational inference. Here, the “extrinsic” value is independent of the
variational distribution Πφ and can be thought of as a fixed “problem
cost”, whereas the “epistemic” value can be interpreted as the approx-
imation error or “solution cost”. To summarize, we have seen that

arg max Jλ (φ) = arg min KL Πφ∥Π⋆ = arg max L(Πφ, Π⋆ ; O1:T ).

φ φ φ

Recall that the free energy − L(Πφ, Π⋆ ; O1:T ) is a variational upper


bound to the surprise about observations S[Π⋆ (O1:T )] when follow-
ing an optimal policy.30 For example, undesirable states incur a low 30
see Section 5.5.2
reward while desirable states yield a high reward, and thus, if we ex-
pect optimality of actions (i.e., O1:T ) paths leading to such states have
high and low surprise, respectively.31 31
This is because, a path leading to a low
reward state will include sub-optimal ac-
An agent which acts to minimize free energy with O1:T can be thought tions.
of as hallucinating to perform optimally, and acting to minimize the
surprise about having played suboptimal actions. Think, for example,
about the robotics task of moving an arm to a new position. Intuitively,
minimizing free energy solves this task by “hallucinating that the arm
is at the goal position”, and then minimizing the surprise with respect
to this perturbed world model. In this way, MERL can be understood
as identifying paths of least surprise akin to the free energy principle.

Remark 12.15: Towards active inference


Maximum entropy reinforcement learning makes one fundamen-
tal assumption, namely, that the “biased” distribution about ob-
servations specifying the underlying HMM is
 
1
p(Ot | xt , at ) ∝ exp r ( xt , at ) .
λ

This assumption is key to the framing of optimal control (in an


260 probabilistic artificial intelligence

unknown MDP) with a certain reward function as an inference


problem. One could conceive other HMMs. For example, Fellows
et al. (2019) propose an HMM defined in terms of the current
value function of the agent:
 
1
p(Ot | xt , at ) ∝ exp Q( xt , at ; θ) .
λ

Crucially, the choice of p(Ot | xt , at ) is the only place where the


reward enters the inference problem, and one can conceive of set-
tings where a “stationary” (i.e., time-independent) reward is not
assumed to exist. The general approach to decision-making as
probabilistic inference presented in this section (but for possibly
reward-independent and non-stationary HMMs) is known as ac-
tive inference (Friston et al., 2015; Millidge et al., 2020, 2021; Parr
et al., 2022).

12.7 Learning from Preferences

So far, we have been assuming that the agent is presented with a re-
ward signal after every played action. This is a natural assumption
in domains such as games and robotics — even though it often re-
x t +1 at
quires substantial “reward engineering” to break down complex tasks
with sparse rewards to more manageable tasks (cf. Section 13.3). In
many other domains such as an agent learning to drive a car or a
chatbot, it is unclear how one can even quantify the reward associated
with an action or a sequence of actions. For example, in the con- feedbackt
text of autonomous driving it is typically desired that agents behave Figure 12.5: We generalize the perspec-
tive of reinforcement learning from Fig-
“human-like” even though a different driving behavior may also reach ure 11.1 by allowing the feedback to
the destination safely. come from either the environment or
an evaluation by other agents (e.g., hu-
The task of “aligning” the behavior of an agent to human expectations mans), and by allowing the feedback to
come in other forms than a numerical re-
is difficult in complex domains such as the physical world and lan- ward.
guage, yet crucial for their practical use. To this end, one can conceive
of alternative ways for presenting “feedback” to the agent:
• The classical feedback in reinforcement learning is a numerical score.
Consider, for example, a recommender system for movies. The feed-
back is obtained after a movie was recommended to a user by a
user-rating on a given scale (often 1 to 10). This rating is infor-
mative as it corresponds to an absolute value assessment, allowing to
place the recommendation in a complete ranking of all previous
recommendations. However, numerical feedback of this type can
be error-prone as it is scale-dependent (different users may ascribe
different value to a recommendation rated a 7).
model-free reinforcement learning 261

• An alternative feedback mechanism is comparison-based. The user


is presented with k alternative actions and selects their preferred ac-
tion (or alternatively returns a ranking of actions). This feedback is
typically easy to obtain as humans are fast in making “this-or-that”
decisions. However, in contrast to numerical rewards, the feedback
provides information only on the user’s relative preferences. That is,
such preference feedback encodes fewer bits of information than
score feedback, and it therefore often takes longer to learn complex
behavior from this feedback.

Remark 12.16: Context-dependent feedback


We neglect here that feedback is often context-dependent (Lindner
and El-Assady, 2022; Casper et al., 2023). For example, if someone
is asked whether they prefer “ice cream” over “pizza” the answer
may depend on whether they are hungry and the weather.

12.7.1 Language Models as Agents


In the following, we discuss approaches to learning from preference
feedback in the context of autoregressive32 large language models and 32
An autoregressive model predicts/-
chatbots. A chatbot is an agent (often based on a transformer architec- generates the next “token” as a function
of previous tokens.
ture, Vaswani et al. (2017)) parameterized by φ that given a prompt x re-
turns a (stochastic) response y. The autoregressive generation of the re-
sponse can be understood in terms of a policy πφ(yt+1 | x, y1:t ) which
generates the next token given the prompt and all previous tokens.33 33
In large language models, a “token”
We denote the policy over complete responses (i.e., the chatbot) by is usually taken to be a letter, word, or
something in-between. A special token
T −1 is used to terminate the response.
Πφ(y | x) = ∏ πφ(yt+1 | x, y1:t ). (12.89)
t =0

In RL jargon, the agents action corresponds to the choice of next to-


ken yt+1 and the deterministic dynamics add this token to the current
(incomplete) response y1:t . A full trajectory y consists sequentially of
all tokens y1:T comprising a response, and the prompt x can be inter-
preted as a context to this trajectory. Observe that Equation (12.89) is 34
The pre-trained language model is
derived from the general representation of the distribution over trajec- usually obtained by self-supervised
tories (12.30), noting that prior and dynamics are deterministic. training on a large corpus of text.
Self-supervised learning generates labeled
The standard pipeline for applying pre-trained34 large language mod- training data from an unlabeled data
source by selectively “masking-out”
els such as GPT (OpenAI, 2023) to downstream tasks consists of two parts of the data. When training an
main steps which are illustrated in Figure 12.6 (Stiennon et al., 2020): autoregressive language model, labeled
(1) supervised fine-tuning and (2) post-training using preference feed- training data can be obtained by repeat-
edly “masking-out” the next word in
back. a sentence, and training the language
model to predict this word. Such large
The first step is to fine-tune the language model with supervised learn- models that can be fine-tuned and “post-
ing on high-quality data for the downstream task of interest. For ex- trained” to various downstream tasks
are also called foundation models.
262 probabilistic artificial intelligence

Figure 12.6: Illustration of the learning


step 1 step 2 pipeline of a large language model.
supervised fine-tuning post-training
Πφ

D=

A B

Πinit
A or B?

ample, when the goal is to build a chatbot, this data may consist of
desirable responses to some exemplary prompts. We will denote the
parameters of the fine-tuned language model by φinit , and its associ-
ated policy by Πinit .

The second step is then to “post-train” the language model Πinit from
the first step using human feedback. Here, it is important that Πinit is
already capable of producing sensible output (i.e., with correct spelling
and grammar). Learning this from scratch using only preference feed-
back would take far too long. Instead, the post-training step is used to
align the agent to the task and user preferences.

In each iteration of post-training, the model is prompted with prompts


x to produce pairs of answers (y A , yB ) ∼ Πφ(· | x). These answers are
presented to human labelers who express their preference for one of
the answers, denoted y A ≻ yB | x. A popular choice for modeling
preferences is the Bradley-Terry model which stipulates that the human
preference distribution is given by

exp(r (y A | x))
p(y A ≻ yB | x, r ) = (12.90)
exp(r (y A | x)) + exp(r (yB | x))

for some unknown latent reward model r (y | x) (Bradley and Terry,


1952). This can be written in terms of the logistic function σ (5.9):

= σ(r (y A | x) − r (yB | x)). (12.91) as seen in Problem 7.1 this is the Gibb’s
distribution with energy −r (y | x) in a
binary classification problem
model-free reinforcement learning 263

Remark 12.17: Outcome rewards


Note that the Bradley-Terry model attributes reward only to “com-
plete” responses. We call such a reward an outcome reward.35 35
Framing this in terms of an individual
While the following discussion is on outcome rewards (which “per-step” rewards which we have seen
so far, this corresponds to a (sparse) re-
is most common in the context of language models), everything ward which is zero until the final action.
translates to individual per-step rewards.

(i ) (i )
The aggregated human feedback D = {y A ≻ yB | x(i) }in=1 across n
different prompts is then used to update the language model πφ. In
the next two sections, we discuss two standard approaches to post-
training: reinforcement learning from human feedback (RLHF) and
direct preference optimization (DPO).

12.7.2 Reinforcement Learning from Human Feedback

RLHF separates the post-training step into two stages (Stiennon et al.,
2020). First, the human feedback is used to learn an approximate re-
ward model rθ. This reward model is then used in the second stage to
determine a refined policy Πφ.

stage 1 stage 2 Figure 12.7: Illustration of the post-


learning an optimal policy training process of RLHF.
learning a reward model
Πinit Πφ

sample prompt x
and a response
A B y ∼ Πφ(· | x)

compute the reward rθ (y | x)


A or B?
update policy
update reward model rθ

Learning a reward model: During the first stage, the initial policy ob-
tained by supervised fine-tuning Πinit is used to generate propos-
als (y A , yB ) for some exemplary prompts x, which are then ranked
according to the preference of human labelers. This preference data
can be used to learn a reward model rθ by maximum likelihood esti-
264 probabilistic artificial intelligence

mation (or equivalently minimizing cross-entropy):

arg max p(D | rθ )


θ
(12.92)
= arg max E(yA ≻yB | x)∼D [log σ(rθ (y A | x) − rθ (yB | x))] using Equation (12.91) and SGD
θ

This is analogous to the standard maximum likelihood estimation of


the reward model in model-based RL with score feedback which we
discussed in Section 11.2.1. The reward model rθ is often initialized
from the initial policy πinit by placing a linear layer producing a scalar
output on top of the final transformer layer.

Learning an optimal policy: One can now employ the methods from
this and previous chapters to determine the optimal policy for the
approximate reward rθ. Due to the use of an approximate reward, how-
ever, simply maximizing rθ surfaces the so-called “reward gaming”
problem which is illustrated in Figure 12.8. As the responses gen-
erated by the learned policy πφ deviate from the distribution of D rθ
(i.e., the distribution induced by the initial policy πinit ), the approx-
imate reward model becomes inaccurate. The approximate reward
may severely overestimate the true reward in regions far away from r

the training data. A common approach to address this problem is to


regularize the policy search towards policies whose distribution does
not stray away “too far” from the training distribution.
y
Analogously to the trust-region methods we discussed in Section 12.4.4,
Figure 12.8: Illustration of “reward gam-
the deviation from the initial policy is typically controlled by maximiz- ing”. Shown in black is the true reward
ing the regularized objective r (y | x ) for a fixed prompt x. Shown
in blue is the approximation based on
. the feedback D to the responses shown
Jλ (φ; φinit | x) = Ey∼Πφ(·| x) [r (y | x)] − λKL(Πφ(· | x)∥Πinit (· | x)) in red. The yellow region symbolizes
responses y where the approximate re-
in expectation over prompts x sampled uniformly at random from the ward can still be “trusted”.
dataset D . Note that this coincides with the PPO objective from Equa-
tion (12.63) with an outcome reward. We can expand the regularization
term to obtain

= Ey∼Πφ(·|x) [r (y | x)] + λH[Πφ(· | x)] (12.93) using the definition of KL-divergence


| {z } (5.34)
entropy-regularized RL

− λH[Πφ(· | x)∥Πinit (· | x)]

which indicates an intimate relationship to entropy-regularized RL.36 36


compare to Equation (12.79)

The optimal policy maximizing Jλ (φ; φinit | x) is ? Problem 12.8

 
1
Π⋆ (y | x) ∝ Πinit (y | x) exp r (y | x) (12.94)
λ
model-free reinforcement learning 265

and can be interpreted as a probabilistic update to the prior Πinit


where exp( λ1 r (y | x)) corresponds to the “likelihood of optimality”.
As λ → ∞, Π⋆ → Πinit , and as λ → 0, the optimal policy reduces to
deterministically picking the response with the highest reward.

In practice, sampling from Π⋆ explicitly is intractable, but note that


any of the approximate inference methods discussed in Part I are ap-
plicable here. In the context of chatbots, it is important that the sam-
pling from the resulting policy is efficient (i.e., the chatbot should re-
spond quickly to prompts). Most commonly, the optimization problem
of maximizing jφ; φinit | x[λ] (with the estimated reward rθ) is solved
approximately within a parameterized family of policies. This is typi-
cally done using policy gradient methods such as PPO (Stiennon et al.,
2020) or GRPO (Guo et al., 2025).

Remark 12.18: Other (non-preference) reward models


Note that the post-training pipeline we described here is agnostic
to the choice of reward model. That is, instead of post-training
our language model to comply with human preferences, we could
have also post-trained it to maximize any other reward signal.
More recently, works have explored the use of reward models for
challenging “reasoning problems” such as mathematical calcula-
tions, where the reward is usually based on the correctness of the
answer.37 One prominent example for this are “reasoning” mod- 37
On training data where the answer to
els such as the DeepSeek-R1 model (Guo et al., 2025), which was problem x is known to be y⋆ ( x), the re-
ward is simply r (y | x) = 1{y⋆ ( x) ∈ y}.
trained with GRPO. That is, the reward is 1 if the response y
contains the correct answer and 0 other-
wise.
12.7.3 Direct Preference Optimization
Observe that the reward model can be expressed in terms of its asso-
ciated optimal policy:
Π⋆ (y | x)
r (y | x) = λ log + const. (12.95) by reorganizing the terms of
Πinit (y | x) Equation (12.94)
In particular, it follows that given a fixed prior Πinit , any policy Πφ
has a family of associated reward models with respect to which it is
optimal! We denote by
. Πφ(y | x)
r[φ] (y | x) = λ log (12.96)
Πinit (y | x)
the “simplest” of these reward models. Remembering the characteriza-
tion of the optimal policy from Equation (12.94) it follows immediately
that Πφ is optimal with respect to r[φ] .

Instead of first learning an approximate reward model and then find-


ing the associated optimal policy, DPO exploits the relationship of
266 probabilistic artificial intelligence

Equation (12.95) to learn the optimal policy directly (Rafailov et al.,


2023). Substituting rθ in the maximum likelihood estimation from
Equation (12.92), yields the objective
h  i
E(yA ≻yB | x)∼D log σ r[φ] (y A | x) − r[φ] (yB | x)
Πφ(y A | x) Πφ(yB | x)
  
= E(yA ≻yB | x)∼D log σ λ log − λ log . using Equation (12.95)
Πinit (y A | x) Πinit (yB | x)
(12.97)

Gradients can be computed via automatic differentiation:


"
 
λE(yA ≻yB | x)∼D σ r[φ] (yB | x) − r[φ] (y A | x)
| {z }
weight according to error of reward estimate
# (12.98)
h i
∇φ log Πφ(y A | x) − ∇φ log Πφ(yB | x) .
| {z } | {z }
increase likelihood of y A decrease likelihood of yB

Intuitively, DPO successively increases the likelihood of preferred re-


sponses y A and decreases the likelihood of dispreferred responses yB .
Examples (y A ≻ yB | x) are weighted by the strength of regulariza-
tion λ and by the degree to which the implicit reward model incor-
rectly orders the responses.

Discussion

In this chapter, we studied central ideas in actor-critic methods. We


have seen two main approaches to use policy-gradient methods. We
began, in Section 12.3, by introducing the REINFORCE algorithm which
uses policy gradients and Monte Carlo estimation, but suffered from
large variance in the gradient estimates of the policy value function. In
Section 12.4, we have then seen a number of actor-critic methods such
as A2C and GAE behaving similarly to policy iteration that exhibit less
variance, but are very sample inefficient due to their on-policy nature.
TRPO improves the sample efficiency slightly, but not fundamentally.

In Section 12.5, we discussed a second family of policy gradient tech-


niques that generalize Q-learning and are akin to value iteration. For
reparameterizable policies, this led us to algorithms such as DDPG,
TD3, SVG. Importantly, these algorithms are significantly more sample
efficient than on-policy policy gradient methods, which often results
in much faster learning of a near-optimal policy. In Section 12.6, we
discussed entropy regularization which frames reinforcement learning
as probabilistic inference, and we derived the SAC algorithm which is
widely used and works quite well in practice.
model-free reinforcement learning 267

Finally, in Section 12.7, we studied two canonical approaches to learn-


ing from preference feedback: RLHF which separately learns reward
model and policy and DPO which learns a policy directly. We have
seen that RLHF is akin to model-based RL as it explicitly learns a
reward model through maximum likelihood estimation. In contrast,
DPO is more closely related to model-free RL and policy gradient
methods as it learns the optimal policy directly.

Optional Readings
• A3C: Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and
Kavukcuoglu (2016).
Asynchronous methods for deep reinforcement learning.
• GAE: Schulman, Moritz, Levine, Jordan, and Abbeel (2016).
High-dimensional continuous control using generalized advantage es-
timation.
• TRPO: Schulman, Levine, Abbeel, Jordan, and Moritz (2015).
Trust region policy optimization.
• PPO: Schulman, Wolski, Dhariwal, Radford, and Klimov (2017).
Proximal policy optimization algorithms.
• DDPG: Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, and
Wierstra (2016).
Continuous control with deep reinforcement learning.
• TD3: Fujimoto, Hoof, and Meger (2018).
Addressing function approximation error in actor-critic methods.
• SVG: Heess, Wayne, Silver, Lillicrap, Erez, and Tassa (2015).
Learning continuous control policies by stochastic value gradients.
• SAC: Haarnoja, Zhou, Abbeel, and Levine (2018a).
Soft actor-critic: Off-policy maximum entropy deep reinforcement
learning with a stochastic actor.
• DPO: Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn
(2023).
Direct preference optimization: Your language model is secretly a re-
ward model.
268 probabilistic artificial intelligence

Problems

12.1. Q-learning and function approximation.

Consider the MDP of Figure 12.9 and set γ = 1.


1. Using Bellman’s theorem, prove that v⋆ ( x ) = − | x − 4| is the opti-
mal value function.
2. Suppose we observe the following episode:

x a x′ r
3 −1 2 −1
2 1 3 −1
3 1 4 −1
4 1 4 0

We initialize all Q-values to 0. Compute the updated Q-values


using Q-learning with learning rate α = 1/2.
3. We will now approximate the Q-function with a linear function.
We let
.
Q( x, a; w) = xw0 + aw1 + w2

where w = [w0 w1 w2 ]⊤ ∈ R3 .
Suppose we have wold = [1 − 1 − 2]⊤ and w = [−1 1 1]⊤ ,
and we observe the transition τ = (2, −1, −1, 1). Use the learn-
ing rate α = 1/2 to compute ∇w ℓ(w; τ ) and the updated weights
w′ = w − α∇w ℓ(w; τ ).

(1,-1) (1,-1) (1,-1) (1,0) (1,-1) (1,-1) Figure 12.9: MDP studied in Prob-
lem 12.1. Each arrow marks a (deter-
ministic) transition and is labeled with
(-1,-1) 1 2 3 4 5 6 7 (1,-1)
(action, reward).

(-1,-1) (-1,-1) (-1,-1) (-1,-1) (-1,-1)


(-1,0)

12.2. Eligibility vector.

The vector ∇φ log πφ( at | xt ) is commonly called eligibility vector. In


the following, we assume that the action space A is finite and denote
it by A.

If we parameterize πφ as a softmax distribution

. exp(h( x, a, φ))
πφ( a | x) = (12.99)
∑b∈ A exp(h( x, b, φ))
.
with linear preferences h( x, a, φ) = φ⊤ ϕ( x, a) where ϕ( x, a) is some
feature vector, what is the form of the eligibility vector?
model-free reinforcement learning 269

12.3. Variance of score gradients with baselines.

In this exercise, we will see a sufficient condition for baselines to re-


duce the variance of score gradient estimators.
1. Suppose for a random vector X, we want to estimate E[ f (X)] for
some function f . Assume that you are given a function g and also
its expectation E[ g(X)]. Instead of estimating E[ f (X)] directly, we
will instead estimate E[ f (X) − g(X)] as we know from linearity of
expectation (1.20) that

E[ f (X)] = E[ f (X) − g(X)] + E[ g(X)].

Prove that if 12 Var[ g(X)] ≤ Cov[ f (X), g(X)], then

Var[ f (X) − g(X)] ≤ Var[ f (X)]. (12.100)

2. Consider estimating ∇φ J (φ). Prove that if b2 ≤ 2b · r ( x, a) for every


state x ∈ X and action a ∈ A, then

Varτ ∼Πφ ( G0 − b)∇φ log Πφ(τ ) ≤ Varτ ∼Πφ G0 ∇φ log Πφ(τ ) .


   

(12.101)
12.4. Score gradients with state-dependent baselines.

For a sequence of state-dependent baselines {b(τ0:t−1 )}tT=1 where


.
τ0:t−1 = (τ0 , τ1 , . . . , τt−1 ),

show that
Eτ ∼Πφ G0 ∇φ log Πφ(τ )
 
" #
T −1 (12.102)
= Eτ ∼Πφ ∑ (G0 − b(τ0:t−1 ))∇φ log πφ(at | xt )
t =0

where we write b(τ0:−1 ) = 0.

12.5. Policy gradients with downstream returns.

Suppose we are training an agent to solve a computer game. There are


only two possible actions, specifically:
1. do nothing; and
2. move.
Each episode lasts for four (T = 3) time steps. The policy πθ is com-
pletely determined by the parameter θ ∈ [0, 1]. Here, for simplicity,
we have assumed that the policy is independent of the current state.
The probability of moving (action 2) is equal to θ and the probability
of doing nothing (action 1) is 1 − θ.

Initially, θ = 0.5. One episode is played with this initial policy and the
results are

actions = (1, 0, 1, 0), rewards = (1, 0, 1, 1).


270 probabilistic artificial intelligence

Compute the policy gradient estimate with downstream returns, dis-


count factor γ = 1/2, and the provided single sample τ ∼ Πθ .

12.6. Policy gradient with an exponential family.


1. Suppose, we can choose between two actions a ∈ {0, 1} in each
state. A natural stochastic policy is induced by a Bernoulli distri-
bution,

a ∼ Bern(σ ( fφ( x))), (12.103)

where σ is the logistic function from Equation (5.9). First, write


down the expression for πφ( a | x). Then, derive the expression for
∇φ J (φ) in terms of qπφ , σ( fφ( x)), and ∇φ fφ( x) using the policy
gradient theorem.
2. The Bernoulli distribution is part of a family of distributions that
allows for a much simpler derivation of the gradient than was nec-
essary in (1). A univariate exponential family is a family of distribu-
tions whose PDF (or PMF) can be expressed in canonical form as

πφ( a | x) = h( a) exp( a fφ( x) − A( fφ( x))) (12.104)

where h, f , and A are known functions.38 Derive the expression of 38


Observe that this form is equivalent to
the policy gradient ∇φ J (φ) for such a distribution. the form introduced in Equation (5.48)
where fφ ( x) is the natural parameter,
3. Can you relate the results of the previous two exercises (1) and (2)? and we let A( f ) = log Z ( f ).
What are h and A in case of the Bernoulli distribution?
4. The Gaussian distribution with unit variance N ( fφ( x), 1) is of the
same canonical form with
fφ( x)2
A( fφ( x)) = . (12.105)
2
Determine the policy gradient ∇φ J (φ).
5. For a Gaussian policy, can we instead apply the reparameterization
trick (5.65) that we have seen in the context of variational inference?
If yes, how? If not, why?

12.7. Soft value function.

In this exercise, we derive the optimal policy solving Equation (12.84)


and the soft value function from Equation (12.85).
. . R
1. We let β( at | xt ) = exp( λ1 r ( xt , at )), Z ( x) = A β( a | x) da, and
denote by π̂ (· | x) the policy β(· | x)/Z ( x). Show that

KL Πφ∥Π⋆


T (12.106)
∑ Ext ∼Πφ
  
= KL πφ(· | xt )∥π̂ (· | xt ) − log Z ( xt ) .
t =1

2. Show that if the space of policies parameterized by φ is sufficiently


expressive, π ⋆ ( a | x) ∝ exp(q⋆ ( x, a)) solves Equation (12.84).
model-free reinforcement learning 271

12.8. PPO as probabilistic inference.


1. Consider the same generative model as was introduced in Sec-
tion 12.6.1 when we interpreted entropy-regularized RL as prob-
abilistic inference. Only now, assume that T = 1 39 and suppose 39
since, in this section, we have been
that the prior over actions is p(y | x) = Πinit (y | x) rather than uni- considering outcome rewards

form. Define the distribution over optimal trajectories Π⋆ (y | x) as


before in Equation (12.82). Show that for any context x,

arg min KL Πφ(· | x)∥Π⋆ (· | x) = arg max Jλ (φ; φinit | x).



φ φ
(12.107)

2. Conclude that the policy maximizing Jλ (φ; φinit | x) is


 
1
Π⋆ (y | x) ∝ Πinit (y | x) exp r (y | x) . (12.108)
λ
13
Model-based Reinforcement Learning

In this final chapter, we will revisit the model-based approach to rein-


forcement learning. We will see some advantages it offers over model-
free methods. In particular, we will use the machinery of probabilistic
inference, which we developed in the first chapters, to quantify uncer-
tainty about our model and use this uncertainty for planning, explo-
ration, and reliable decision-making.

To recap, in Chapter 11, we began by discussing model-based rein-


forcement learning which attempts to learn the underlying Markov
decision process and then use it for planning. We have seen that in
the tabular setting, computing and storing the entire model is compu-
tationally expensive. This led us to consider the family of model-free
approaches, which learn the value function directly, and as such can
be considered more economical in the amount of data that they store.

In Chapter 12, we saw that using function approximation, we were able


to scale model-free methods to very large state and action spaces. We
will now explore similar ideas in the model-based framework. Namely,
we will use function approximation to condense the representation of
our model of the environment. More concretely, we will learn an ap-
proximate dynamics model f ≈ p and approximate rewards r, which
is also called a world model.

There are a few ways in which the model-based approach is advanta-


geous. First, if we have an accurate model of the environment, we can
use it for planning — ideally also for interacting safely with the envi-
ronment. However, in practice, we will rarely have such a model. In
fact, the accuracy of our model will depend on the past experience of
our agent and the region of the state-action space. Understanding the
uncertainty in our model of the environment is crucial for planning.
In particular, quantifying uncertainty is necessary to drive safe(!) ex-
ploration and avoid undesired states.
274 probabilistic artificial intelligence

Moreover, modeling uncertainty in our model of the environment can


be extremely useful in deciding where to explore. Learning a model
can therefore help to dramatically reduce the sample complexity over
model-free techniques. Often times this is crucial when developing
agents for real-world use as in such settings, exploration is costly and
potentially dangerous.

Algorithm 13.1 describes the general approach to model-based rein-


forcement learning.

Algorithm 13.1: Model-based reinforcement learning (outline)


1 start with an initial policy π and no (or some) initial data D
2 for several episodes do
3 roll out policy π to collect data
4 learn a model of the dynamics f and rewards r from data
5 plan a new policy π based on the estimated models

We face three main challenges in model-based reinforcement learning.


First, given a fixed model, we need to perform planning to decide on
which actions to play. Second, we need to learn models f and r accu-
rately and efficiently. Third, we need to effectively trade exploration
and exploitation. We will discuss these three topics in the following.

13.1 Planning

There exists a large literature on planning in various settings. These


settings can be mainly characterized as
• discrete or continuous action spaces;
• fully- or partially observable state spaces;
• constrained or unconstrained; and
• linear or nonlinear dynamics.
In Chapter 10, we have already seen algorithms such as policy iteration
and value iteration, which can be used to solve planning exactly in
tabular settings. In the following, we will now focus on the setting of
continuous state and action spaces, fully observable state spaces, no
constraints, and nonlinear dynamics.

13.1.1 Deterministic Dynamics


To begin with, let us assume that our dynamics model is deterministic
and known. That is, given a state-action pair, we know the subsequent
state,

x t +1 = f ( x t , a t ). (13.1)
model-based reinforcement learning 275

We continue to focus on the setting of infinite-horizon discounted re-


turns (10.5), which we have been considering throughout our discus-
sion of reinforcement learning. This yields the objective,


max ∑ γt r ( xt , at ) such that xt+1 = f ( xt , at ). (13.2)
a0:∞
t =0

Now, because we are optimizing over an infinite time horizon, we can-


not solve this optimization problem directly. This problem is studied
ubiquitously in the area of optimal control. We will discuss one central
idea from optimal control that is widely used in model-based rein-
forcement learning, and will later return to using this idea for learning
parametric policies in Section 13.1.3.

Planning over finite horizons: The key idea of a classical algorithm from
optimal control called receding horizon control (RHC) or model predictive
control (MPC) is to iteratively plan over finite horizons. That is, in each
round, we plan over a finite time horizon H and carry out the first
action. x⋆
x0

Algorithm 13.2: Model predictive control, MPC


Figure 13.1: Illustration of model predic-
1 for t= 0 to ∞ do tive control in a deterministic transition
model. The agent starts in position x0
2 observe xt
and wants to reach x⋆ despite the black
3 plan over a finite horizon H, obstacle. We use the reward function
t + H −1 r ( x) = − ∥ x − x⋆ ∥. The gray concen-
max
at:t+ H −1
∑ γτ −t r ( xτ , aτ ) such that xτ +1 = f ( xτ , aτ ) (13.3) tric circles represent the length of a sin-
gle step. We plan with a time horizon of
τ =t
H = 2. Initially, the agent does not “see”
4 carry out action at the black obstacle, and therefore moves
straight towards the goal. As soon as the
agent sees the obstacle, the optimal tra-
jectory is “replanned”. The dotted red
Observe that the state xτ can be interpreted as a deterministic function line corresponds to the optimal trajec-
xτ ( at:τ −1 ), which depends on all actions from time t to time τ − 1 and tory, the agent’s steps are shown in blue.

the state xt . To solve the optimization problem of a single iteration, we


therefore need to maximize,

t + H −1
.
JH ( at:t+ H −1 ) = ∑ γτ −t r ( xτ ( at:τ −1 ), aτ ). (13.4)
τ =t

This optimization problem is in general non-convex. If the actions are


continuous and the dynamics and reward models are differentiable,
JH . This can be done
we can nevertheless obtain analytic gradients of b
using the chain rule and “backpropagating” through time, analogously
to backpropagation in neural networks.1 Especially for large horizons 1
see Section 7.1.4
H, this optimization problem becomes difficult to solve exactly due to
local optima and vanishing/exploding gradients.
276 probabilistic artificial intelligence

Tree search: Often, heuristic global optimization methods (also called


“search methods”) are used to optimize b JH . An example are random
shooting methods, which find the optimal choice among a set of random
proposals. Of course, obtaining a “good” set of randomly proposed
action sequences is crucial. A naive way of generating the proposals
is to pick them uniformly at random. This strategy, however, usually
does not perform very well as it corresponds to suggesting random
walks of the state space. Alternatives are to sample from a Gaussian
distribution or using the cross-entropy method which gradually adapts
the sampling distribution by reweighing samples according to the re-
wards they produce.

Algorithm 13.3: Random shooting methods


(i )
1 generate m sets of random samples, at:t+ H −1
(i ⋆ )
2 pick the sequence of actions at:t+ H −1 where
(i )
i⋆ = arg max JH ( at:t+ H −1 ) (13.5)
i ∈[m]

The evolution of the state can be visualized as a tree where — if dy-


namics are deterministic — the branching is fully determined by the
played actions. For this reason, classical tree search algorithms can be
employed, such as alpha-beta pruning or Monte Carlo tree search (MCTS),
which was used for example by “AlphaZero” (Silver et al., 2016, 2017)
and can be viewed as an advanced variant of a shooting method.

Finite-horizon planning with sparse rewards: A common problem of finite-


horizon methods is that in the setting of sparse rewards, there is often
no signal that can be followed. You can think of an agent that operates
in a kitchen and tries to find a box of candy. Yet, to get to this box, it
needs to perform a large number of actions. In particular, if this num-
ber of actions is larger than the horizon H, then the local optimization
problem of MPC will not take into account the reward for choosing
this long sequence of actions. Thus, the box of candy will likely never
be found by our agent.

A solution to this problem is to amend a long-term value estimate to Figure 13.2: Illustration of finite-horizon
the finite-horizon sum. The idea is to not only consider the rewards planning with sparse rewards. When the
finite time horizon does not suffice to
attained while following the actions at:t H −1 , but to also consider the “reach” a reward, the agent has no sig-
value of the final state xt+ H , which estimates the discounted sum of nal to follow.
model-based reinforcement learning 277

future rewards.

t + H −1
.
JH ( at:t+ H −1 ) =
b ∑ γτ −t r ( xτ ( at:τ −1 ), aτ ) + γ H V ( xt+ H ) . (13.6)
τ =t | {z }
| {z } long-term
short-term

Intuitively, γ H V ( xt+ H ) is estimating the tail of the infinite sum.

Remark 13.4: Planning generalizes model-free methods!


Observe that for H = 1, when we use the value function esti-
mate associated with this MPC controller, maximizing JH coin-
cides with using the greedy policy πV . That is,

at = arg max b
J1 ( a) = πV ( xt ). (13.7)
a∈A

Thus, by looking ahead for a single time step, we recover the ap-
proaches from the model-free setting in this model-based setting!
Essentially, if we do not plan long-term and only consider the
value estimate, the model-based setting reduces to the model-free
setting. However, in the model-based setting, we are now able to
use our model of the transition dynamics to anticipate the down-
stream effects of picking a particular action at . This is one of the
fundamental reasons for why model-based approaches are typi-
cally severely more sample efficient than model-free methods.

To obtain the value estimates, we can use the approaches we discussed


in detail in Section 11.4, such as TD-learning for on-policy value esti-
mates and Q-learning for off-policy value estimates. For large state-
action spaces, we can use their approximate variants, which we dis-
cussed in Section 12.1 and Section 12.2. To improve value estimates,
we can obtain “artificial” data by rolling out policies within our model.
This is a key advantage over model-free methods as, once a sufficiently
accurate model has been learned, data for value estimation can be gen-
erated efficiently in simulation.

13.1.2 Stochastic Dynamics


How can we extend this approach to planning to a stochastic transi-
tion model? A natural extension of model predictive control is to do
what is called stochastic average approximation (SAA) or trajectory sam-
pling (Chua et al., 2018). Like in MPC, we still optimize over a deter-
ministic sequence of actions, but now we will average over all resulting
trajectories.
278 probabilistic artificial intelligence

Algorithm 13.5: Trajectory sampling


1 for t = 0 to ∞ do
2 observe xt
3 optimize expected performance over a finite horizon H,
" #
t + H −1
max Ext+1:t+ H
at:t+ H −1
∑ γ τ −t H
rτ + γ V ( xt+ H ) at:t+ H −1 , f (13.8)
τ =t

4 carry out action at

Intuitively, trajectory sampling optimizes over a much simpler object


— namely a deterministic sequence of actions of length H — than
finding a policy, which corresponds to finding an optimal decision tree
mapping states to actions. Of course, using trajectory sampling (from Figure 13.3: Illustration of trajectory
an arbitrary starting state) implies such a policy. However, trajectory sampling. High-reward states are shown
in brighter colors. The agent iteratively
sampling never computes this policy explicitly, and rather, in each step, plans a finite number of time steps into
only plans over a finite horizon. the future and picks the best initial ac-
tion.
Computing the expectation exactly is typically not possible as this in-
volves solving a high-dimensional integral of nonlinear functions. In-
stead, a common approach is to use Monte Carlo estimates of this ex-
pectation. This approach is known as Monte Carlo trajectory sampling.
The key issue with using sampling based estimation is that the sam-
pled trajectory (i.e., sampled sequence of states) we obtain, depends
on the actions we pick. In other words, the measure we average over
depends on the decision variables — the actions. This is a problem
that we have seen several times already! It naturally suggests using
the reparameterization trick.2 2
see Theorem 5.19 and Equation (12.76)

Previously, we have used the reparameterization trick to reparameter-


ize variational distributions (see Section 5.5.1) and to reparameterize
policies (see Section 12.5.1). It turns out that we can use the exact
same approach for reparameterizing the transition model. We say that
a (stochastic) transition model f is reparameterizable iff xt+1 ∼ f ( xt , at )
is such that xt+1 = g (ε; xt , at ), where ε ∼ ϕ is a random variable that
is independent of xt and at . We have already seen in Example 12.14
(in the context of stochastic policies) how a Gaussian transition model
can be reparameterized.

In this case, xτ is determined recursively by at:τ −1 and εt:τ −1 ,

xτ = xτ (εt:τ −1 ; at:τ −1 )
. (13.9)
= g ( ε τ −1 ; g ( . . . ; ( ε t +1 ; g ( ε t ; x t , a t ), a t +1 ), . . . ), a τ −1 ).
model-based reinforcement learning 279

This allows us to obtain unbiased estimates of JH using Monte Carlo


approximation,

m t + H −1
1 (i )
JH ( at:t+ H −1 ) ≈
b
m ∑ ∑ γτ −t r ( xτ (εt:τ −1 ; at:τ −1 ), aτ )
i =1 τ =t
! (13.10)
H (i )
+γ V ( xt+ H (εt:t+ H −1 ; at:t+ H −1 ))

(i ) iid
where εt:t+ H −1 ∼ ϕ are independent samples. To optimize this ap-
proximation we can again compute analytic gradients or use shooting
methods as we have discussed in Section 13.1.1 for deterministic dy-
namics.

13.1.3 Parametric Policies


When using algorithms such as model predictive control for planning,
planning needs to be done online before each time we take an action.
This is called closed-loop control and can be expensive. Especially when
the time horizon is large, or we encounter similar states many times
(leading to “repeated optimization problems”), it can be beneficial to
“store” the planning decision in a (deterministic) policy,
.
at = π ( xt ; φ) = πφ( xt ). (13.11)

This policy can then be trained offline and evaluated cheaply online,
which is known as open-loop control.

This is akin to a problem that we have discussed in detail in the previ-


ous chapter when extending Q-learning to large action spaces. There,
this led us to discuss policy gradient and actor-critic methods. Recall
that in Q-learning, we seek to follow the greedy policy,

π ⋆ ( x) = arg max Q⋆ ( x, a; θ), see Equation (12.20)


a∈A

and therefore had to solve an optimization problem over all actions.


We accelerated this by learning an approximate policy that “mim-
icked” this optimization,

φ⋆ = arg max Ex∼µ Q⋆ ( x, πφ( x); θ)


 
see Equation (12.67)
φ | {z }
Jµ (φ;θ)
b

where µ( x) > 0 was some exploration distribution that has full sup-
port and thus leads to the exploration of all states. The key idea was
that if we use a differentiable approximation Q and a differentiable
parameterization of policies, which is “rich enough”, then both op-
timizations are equivalent, and we can use the chain rule to obtain
280 probabilistic artificial intelligence

gradient estimates of the second expression. We then used this to de-


rive the deep deterministic policy gradients (DDPG) and stochastic value
gradients (SVG) algorithms. It turns out that there is a very natural
analogue to DDPG/SVG for model-based reinforcement learning.

Instead of maximizing the Q-function directly, we use finite-horizon


planning to estimate the immediate value of the policy within the next
H time steps and simply use the Q-function to approximate the ter-
minal value (i.e., the tails of the infinite sum). Then, our objective
becomes,
" #
H −1
.
Jµ,H (φ; θ) = Ex ∼µ,x |π , f ∑ γ rτ + γ Q ( x H , πφ( x H ); θ)
b
0 1:H φ
τ H ⋆
τ =0
(13.12)
This approach naturally extends to randomized policies using repa-
rameterization gradients, which we have discussed in Section 12.5.1.
Analogously to Remark 13.4, for H = 0, this coincides with the DDPG
objective! For larger time horizons, the look-ahead takes into account
the transition model for planning next time steps. This tends to help
dramatically in improving policies much more rapidly between episodes.
Instead of just gradually improving policies a little by slightly adapt-
ing the policy to the learned value function estimates (as in model-free
RL), we use the model to anticipate the consequences of actions multi-
ple time steps ahead. This is at the heart of model-based reinforcement
learning.

Essentially, we are using methods such as Q-learning and DDPG/SVG


as subroutines within the framework of model predictive control to do
much bigger steps in policy improvement than to slightly improve the
next picked action. To encourage exploration, it is common to extend
the objective in Equation (13.12) by an additional entropy regulariza-
tion term as seen in Section 12.6.

Remark 13.6: Planning as inference


We have been using Monte Carlo rollouts to estimate the expecta-
tion of Equation (13.12). That is, we have been using a Monte
Carlo approximation (e.g., a sample mean) using samples ob-
tained by “rolling out” the induced Markov chain of a fixed policy.

It would certainly be preferable to compute the expectation ex-


actly, however, this is generally not possible as this involves solv-
ing a high dimensional integral. Recall that we faced the same
problem when studying inference in the first half of the manuscript.
In both problems, we need to approximate high-dimensional inte-
grals (i.e., expectations). This suggests a deep connection between
model-based reinforcement learning 281

the problems of planning and inference. It is therefore not sur-


prising that many techniques for approximate inference that we
have seen earlier can also be applied to planning.
• Monte Carlo approximation which we have been focusing on dur-
ing our discussion of planning is a very simple inference algo-
rithm — approximating an expectation by sampling from the
distribution that is averaged over. This allowed us to obtain
unbiased gradient estimates (which may have high variance).
• An alternative approach is moment matching (cf. Section 5.4.6).
Instead of approximating the expectation, here, we approxi-
mate the distribution over trajectories using a tractable distri-
bution (e.g., a Gaussian) and “matching” their moments. This
then allows us to analytically compute gradients of the expecta-
tion in Equation (13.12). A prominent example of this approach
is probabilistic inference for learning control (PILCO).
• In Section 12.6, we have used variational inference for planning,
and seen that it coincides with entropy regularization as imple-
mented by the soft actor critic (SAC) algorithm.

13.2 Learning

Thus far, we have considered known environments. That is, we as-


sumed that the transition model f and the rewards r are known. In
reinforcement learning, f and r are (of course!) not known. Instead, we
have to estimate them from data. This will also be crucial in our later
discussion of exploration in Section 13.3 where we explore methods of
driving data collection to learn what we need to learn about the world.

First, let us revisit one of our key observations when we first intro-
duced the reinforcement learning problem. Namely, that the observed
transitions x′ and rewards r are conditionally independent given the
state-action pairs ( x, a).3 This is due to the Markovian structure of the 3
see Equation (11.3)
underlying Markov decision process.

This is the key observation that allows us to treat the estimation of


the dynamics and rewards as a simple regression problem (or a den-
sity estimation problem when the quantities are stochastic rather than
deterministic). More concretely, we can estimate the dynamics and
rewards off-policy using the standard supervised learning techniques
we discussed in earlier chapters, from a replay buffer

D = {( xt , at , rt , xt+1 )}t . (13.13)


| {z } | {z }
“input” “label”

Here, xt and at are the “inputs”, and rt and xt+1 are the “labels” of the
282 probabilistic artificial intelligence

regression problem. Due to the conditional independence of the labels


given the inputs, we have independent label noise (i.e., “independent
training data”) which is the basic assumption that we have been mak-
ing throughout our discussion of techniques for probabilistic machine
learning in Part I.

The key difference to supervised learning is that the set of inputs de-
pends on how we act. That is, the current inputs arise from previous
policies, and the inputs which we will observe in the future will de-
pend on the model (and policy) obtained from the current data: we
have feedback loops! We will come back to this aspect of reinforce-
ment learning in the next section on exploration. For now, recall we
only assume that we have used an arbitrary policy to collect some data,
which we then stored in a replay buffer, and which we now want to
use to learn the “best-possible” model of our environment.

13.2.1 Probabilistic Inference


In the following, we will discuss how we can use the techniques from
probabilistic inference, which we have seen in Part I, to learn the dy-
namics and reward models. Thereby, we will focus on learning the
transition model f as learning the reward model r is completely analo-
gous. For learning deterministic dynamics or rewards, we can use for
example Gaussian processes (cf. Chapter 4) or deep neural networks
(cf. Chapter 7). We will now focus on the setting where the dynamics
are stochastic, that is,
x t +1 ∼ f ( x t , a t ; ψ ). (13.14)

Example 13.7: Conditional Gaussian dynamics


We could, for example, use a conditional Gaussian for the transi-
tion model,

xt+1 ∼ N (µ( xt , at ; ψ), Σ ( xt , at ; ψ)). (13.15)

As we have seen in Equation (A.7), we can rewrite the covari-


ance matrix Σ as a product of a lower-triangular matrix L and its
transpose using the Cholesky decomposition Σ = LL⊤ of Σ. This
allows us to represent the model by only n(n + 1)/2 parameters.
Moreover, we have learned that Gaussians are reparameterizable,
which we have seen to be useful for planning.

Note that this model reduces to a deterministic model if the co-


variance is zero. So the stochastic transition models encompass
all deterministic models. Moreover, in many applications (such
as robotics), it is often useful to use stochastic models to attribute
model-based reinforcement learning 283

slight inaccuracies and measurement noise to a small uncertainty


in the model.

A first approach might be to obtain a point estimate for f , either


through maximum likelihood estimation (which we have seen to over-
fit easily) or through maximum a posteriori estimation. If, for example,
f is represented as a deep neural network, we have already seen how
to find the MAP estimate of its weights in Section 7.2.1.

Remark 13.8: The key pitfall of point estimates


However, using point estimates leads to a key pitfall of model-
based reinforcement learning. Using a point estimate of the model
for planning — even if this point estimate is very accurate — of-
ten performs very poorly. The reason is that planning is very good
at overfitting (i.e., exploiting) small errors in the transition model.
Moreover, the errors in the model estimate compound over time
when using a longer time horizon H. The key to remedy this pit-
fall lies in being robust to misestimated models. This naturally
suggests quantifying the uncertainty in our model estimate and
taking it into account during planning.4 In the following section, 4
Quantifying the uncertainty of an esti-
we will rediscover that estimates of epistemic uncertainty are also mate is a problem that we have spent the
first few chapters exploring. Notably, re-
extremely useful for driving (safe) exploration — something that fer to
we have already encountered in our discussion of Bayesian opti- • Section 2.2 for a description of epis-
temic and aleatoric uncertainty;
mization. • Chapter 4 for our use of uncertainty
estimates in the context of Gaussian
In the following, we will differentiate between the epistemic uncer- processes;
• Chapter 7 for our use of uncertainty
tainty and the aleatoric uncertainty. Recall from Section 2.2 that epis- estimates in the context of Bayesian
temic uncertainty corresponds to our uncertainty about the model, deep learning; and
• Chapters 8 and 9 for our use of epis-
p( f | D), while aleatoric uncertainty corresponds to the uncertainty
temic uncertainty estimates to drive
of the transitions in the underlying Markov decision process (which exploration.
can be thought of as “irreducible” noise), p( xt+1 | f , xt , at ).

Intuitively, probabilistic inference of dynamics models corresponds to


learning a distribution over possible models f and r given prior beliefs,
where f and r characterize the underlying Markov decision process.
This goes to show another benefit of the model-based over the model-
free approach to reinforcement learning. Namely, that it is much easier
to encode prior knowledge about the transition and rewards model.

Example 13.9: Inference with conditional Gaussian dynamics


Let us revisit inference with our conditional Gaussian dynamics
284 probabilistic artificial intelligence

model from Equation (13.15),

xt+1 ∼ N (µ( xt , at ; ψ), Σ ( xt , at ; ψ)).

Recall that in the setting of Bayesian deep learning, most approx-


imate inference techniques represented the approximate posterior
using some form of a mixture of Gaussians,5 5
see
• Equation (7.18) for variational infer-
m ence;
1
p ( x t +1 | D , x t , a t ) ≈
m ∑ N (µ(xt , at ; ψ(i) ), Σ (xt , at ; ψ(i) )). • Equation (7.21a) for Markov chain
i =1 Monte Carlo;
(13.16) • Equation (7.28) for dropout regular-
ization; and
• Equation (7.29) for probabilistic en-
Hereby, the epistemic uncertainty is represented by the variance
sembles.
between mixture components, and the aleatoric uncertainty by the
average variance within the components.6 6
see Section 7.3.1

In supervised learning, we often conflated the notions of epistemic


and aleatoric uncertainty. In the context of planning, there is an im-
portant consequence of the decomposition into epistemic and aleatoric
uncertainty. Recall that the epistemic uncertainty corresponds to a
distribution over Markov decision processes f , whereas the aleatoric
uncertainty corresponds to the randomness in the transitions within
one such MDP f . Crucially, this randomness in the transitions must
be consistent within a single MDP! That is, once we selected a sin-
gle MDP for planning, we should disregard the epistemic uncertainty
and solely focus on the randomness of the transitions. Then, to take
into account epistemic uncertainty, we should average our plan across
the different realizations of f . This yields the following Monte Carlo
estimate of our reward b JH ,
m
1
JH ( at:t+ H −1 ) ≈
b
m ∑ bJH (at:t+ H−1 ; f (i) ) where (13.17)
i =1
t + H −1
. (i )
JH ( at:t+ H −1 ; f ) =
b ∑ γτ −t r ( xτ (εt:τ −1 ; at:τ −1 , f ), aτ ) + γ H V ( xt+ H ).
τ =t
(13.18)
iid
Here, f (i) ∼ p( f | D) are independent samples of the transition model,
(i ) iid
and εt:t+ H −1 ∼ ϕ parameterizes the dynamics analogously to Equa-
tion (13.9):

xτ (εt:τ −1 ; at:τ −1 , f )
. (13.19)
= f ( ε τ −1 ; f ( . . . ; ( ε t +1 ; f ( ε t ; x t , a t ), a t +1 ), . . . ), a τ −1 ).

Observe that the epistemic and aleatoric uncertainty are treated differ-
ently. Within a particular MDP f , we ensure that randomness (i.e., ale-
model-based reinforcement learning 285

atoric uncertainty) is simulated consistently using our previous frame-


work from our discussion of planning.7 The Monte Carlo samples of f 7
see Equation (13.10)
take into account the epistemic uncertainty about the transition model.
In our previous discussion of planning, we assumed the Markov deci-
sion process f to be fixed. Essentially, in Equation (13.17) we are now
using Monte Carlo trajectory sampling as a subroutine and average
over an “ensemble” of Markov decision processes.

Figure 13.4: Illustration of planning with


epistemic uncertainty and Monte Carlo
sampling. The agent considers m alter-
native “worlds”. Within each world, it
plans a sequence of actions over a fi-
nite time horizon. Then, the agent av-
erages all optimal initial actions from
all worlds. Crucially, each world by it-
self is consistent. That is, its transition
model (i.e., the aleatoric uncertainty of
the model) is constant.

The same approach that we have seen in Section 13.1.3 can be used
to “compile” these plans into a parametric policy that can be trained
JH (π ) instead of b
offline, in which case, we write b JH ( at:t+ H −1 ).

This leads us to a first greedily exploitative algorithm for model-based


reinforcement learning, which is shown in Algorithm 13.10. This al-
gorithm is purely exploitative as it greedily maximizes the expected
reward with respect to the transition model, taking into account epis-
temic uncertainty.

Algorithm 13.10: Greedy exploitation for model-based RL


1 start with (possibly empty) data D = ∅ and a prior p( f ) = p( f | D)
2 for several episodes do
3 plan a new policy π to (approximately) maximize,
h i
max E f ∼ p(·|D) bJH (π; f ) (13.20)
π

4 roll out policy π to collect more data


5 update posterior p( f | D)

In the context of Gaussian process models, this algorithm is called


286 probabilistic artificial intelligence

probabilistic inference for learning control (PILCO) (Deisenroth and Ras-


mussen, 2011), which was the first demonstration of how sample effi-
cient model-based reinforcement learning can be. As was mentioned
in Remark 13.6, the originally proposed PILCO uses moment matching
instead of Monte Carlo averaging.

Figure 13.5: Sample efficiency of model-


free and model-based reinforcement
In the context of neural networks, this algorithm is called probabilistic learning. DDPG and SAC are shown as
ensembles with trajectory sampling (PETS) (Chua et al., 2018), which is constant (black) lines, because they take
an order of magnitude more time steps
one of the state-of-the-art algorithms. PETS uses an ensemble of con- before learning a good model. They also
ditional Gaussian distributions over weights, trajectory sampling for compare the PETS algorithm (blue) to
deterministic ensembles (orange), where
evaluating the performance and model predictive control for planning.
the transition models are assumed to
be deterministic (or to have covariance
Notably, PETS does not explicitly explore. Exploration only happens 0). Reproduced with permission from
“Deep reinforcement learning in a hand-
due to the uncertainty in the model, which already drives exploration
ful of trials using probabilistic dynamics
to some extent. In many settings, however, this incentive is not suffi- models” (Chua et al., 2018).
cient for exploration — especially when rewards are sparse.

Optional Readings
• PILCO: Deisenroth and Rasmussen (2011).
PILCO: A model-based and data-efficient approach to policy search.
• PETS: Chua, Calandra, McAllister, and Levine (2018).
Deep reinforcement learning in a handful of trials using probabilistic
dynamics models.
model-based reinforcement learning 287

13.2.2 Partial Observability


Depending on the task, it may be difficult to learn the observed dy-
namics directly. For example, when learning an artificial player for a
computer game based only on the games’ visual interface it may be
difficult to predict the next frame in pixel space. Instead, a common
approach is to treat the visual interface as an observation in a POMDP
with a hidden latent space (cf. Section 10.4), and to learn the dynamics
within the latent space and the observation model separately.

An example of this approach is the deep planning network (PlaNet) algo-


rithm which learns such a POMDP via variational inference and uses
the cross-entropy method for closed-loop planning of action sequences
(Hafner et al., 2019). The Dreamer algorithm replaces the closed-loop
planning of PlaNet by open-loop planning with entropy regularization
(Hafner et al., 2020, 2021). Notably, neither PlaNet nor Dreamer explic-
itly take into account epistemic model uncertainty during planning but
rather use point estimates.

13.3 Exploration

Exploration is critical when interacting with unknown environments,


such as in reinforcement learning. We first encountered the exploration-
exploitation dilemma in our discussion of Bayesian optimization, where
we aimed at maximizing an unknown function in as little time as possi-
ble by making noisy observations.8 Within the framework of Bayesian 8
see Chapter 9
optimization, we used so-called acquisition functions for selecting the
next point at which to observe the unknown function. Observe that
these acquisition functions are analogous to policies in the setting of
reinforcement learning. In particular, the policy that uses greedy ex-
ploitation like we have seen in the previous section is analogous to
simply picking the point that maximizes the mean of the posterior dis-
tribution. In the context of Bayesian optimization, we have already
seen that this is insufficient for exploration and can easily get stuck
in locally optimal solutions. As reinforcement learning is a strict gen-
eralization of Bayesian optimization, there is no reason why such a
strategy should be sufficient now.

Recall from our discussion of Bayesian optimization that we “solved”


this problem by using the epistemic uncertainty in our model of the
unknown function to pick the next point to explore. This is what we
will now explore in the context of reinforcement learning.

One simple strategy that we already investigated is the addition of


some exploration noise. In other words, one follows a greedy exploita-
tive strategy, but every once in a while, one chooses a random action
288 probabilistic artificial intelligence

(like in ε-greedy); or one adds additional noise to the selected actions


(known as Gaussian noise “dithering”). We have already seen that in
difficult exploration tasks, these strategies are not systematic enough.

Two other approaches that we have considered before are Thompson


sampling (cf. Section 9.3.3) and, more generally, the principle of opti-
mism in the face of uncertainty (cf. Section 9.2.1).

13.3.1 Thompson Sampling

Recall from Section 9.3.3 the main idea behind Thompson sampling:
namely, that the randomness in the realizations of f from the posterior
distribution is already enough to drive exploration. That is, instead of
picking the action that performs best on average across all realizations
of f , Thompson sampling picks the action that performs best for a
single realization of f . The epistemic uncertainty in the realizations
of f leads to variance in the picked actions and provides an additional
incentive for exploration. This yields Algorithm 13.11 which is an
immediate adaptation of greedy exploitation and straightforward to
implement.

Algorithm 13.11: Thompson sampling


1 start with (possibly empty) data D = ∅ and a prior p( f ) = p( f | D)
2 for several episodes do
3 sample a model f ∼ p(· | D)
4 plan a new policy π to (approximately) maximize,
JH (π; f )
max b (13.21)
π

5 roll out policy π to collect more data


6 update posterior p( f | D)

13.3.2 Optimistic Exploration

We have already seen in the context of Bayesian optimization and tab-


ular reinforcement learning that optimism is a central pillar for explo-
ration. But how can we explore optimistically in model-based rein-
forcement learning?

Let us consider a set M(D) of plausible models given some data D . Op-
timistic exploration would then optimize for the most advantageous
model among all models that are plausible given the seen data.
model-based reinforcement learning 289

Example 13.12: Plausible conditional Gaussians


In the context of conditional Gaussians, we can consider the set of
all models such that the prediction of a single dimension i lies in
some confidence interval,
.
f i ( x, a) ∈ Ci = [µi ( x, a | D) − β i σi ( x, a | D), (13.22)
µi ( x, a | D) + β i σi ( x, a | D)],

where β i tunes the width of the confidence interval, analogously


to Section 9.3.1. The set of plausible models is then given as
.
M(D) = { f | ∀ x ∈ X , a ∈ A, i ∈ [d] : f i ( x, a) ∈ Ci }. (13.23)

When compared to greedy exploitation, instead of taking the opti-


mal step on average with respect to all realizations of the transition
model f , optimistic exploration as shown in Algorithm 13.13 takes the
optimal step with respect to the most optimistic model among all tran-
sition models that are consistent with the data.

Algorithm 13.13: Optimistic exploration


1 start with (possibly empty) data D = ∅ and a prior p( f ) = p( f | D)
2 for several episodes do
3 plan a new policy π to (approximately) maximize,
max max b JH (π; f ) (13.24)
π f ∈M(D)

4 roll out policy π to collect more data


5 update posterior p( f | D)

In general, the joint maximization over π and f is very difficult to


solve computationally. Yet, remarkably, it turns out that this complex
optimization problem can be reduced to standard model-based rein-
forcement learning with a fixed model. The key idea is to consider
an agent that can control its “luck”. In other words, we assume that
the agent believes it can control the outcome of its actions — or rather
choose which of the plausible dynamics it follows. The “luck” of the
agent can be considered as additional decision variables. Consider the
optimization problem,
.
πt+1 = arg max max JH (π; fbt )
b (13.25)
π η(·)∈[−1,1]d

with “optimistic” dynamics


.
fbt,i ( x, a) = µt,i ( x, a) + β t,i ηi ( x, a)σt,i ( x, a). (13.26)
290 probabilistic artificial intelligence

Here the decision variables ηi control the variance of an action. That


is, within the confidence bounds of the transition model, the agent
can freely choose the state that is reached by playing an action a from
state x. Essentially, this corresponds to maximizing expected reward
in an augmented (optimistic) MDP with known dynamics fb and with a
larger action space that also includes the decision variables η. This is a
known MDP for which we can use our toolbox for planning which we
developed in Section 13.1.

The algorithm that maximizes expected reward in this augmented


MDP is called hallucinated upper confidence reinforcement learning (H-
UCRL) (Curi et al., 2020; Treven et al., 2024). H-UCRL can be seen as
the natural extension of the UCB acquisition function from Bayesian
optimization to reinforcement learning. An illustration of the algo-
rithm is given in Figure 13.6.

Figure 13.6: Illustration of H-UCRL in a


one-dimensional state space. The agent
“hallucinates” that it takes the black tra-
(epistemic)
jectory when, in reality, the outcomes of
confidence bounds its actions are as shown in blue. The
agent can hallucinate to land anywhere
within the gray confidence regions (i.e.,
the epistemic uncertainty in the model)
sparse using the luck decision variables η. This
allows agents to discover long sequences
reward
of actions leading to sparse rewards.
x

luck η
effect of
x0 policy action

Intuitively, the agent has the tendency to believe that it can achieve
much more than it actually can. As more data is collected, the con-
fidence bounds shrink and the optimistic policy rewards converge to
the actual rewards. Yet, crucially, we only collect data in regions of
the state-action space that are more promising than the regions that
we have already explored. That is, we only collect data in the most
promising regions.

Optimistic exploration yields the strongest effects for hard exploration


tasks, for example, settings with large penalties associated with per-
forming certain actions and settings with sparse rewards.9 In those 9
Action penalties are often used to dis-
courage the agent from exhibiting un-
wanted behavior. However, increasing
the action penalty increases the difficulty
of exploration. Therefore, optimistic ex-
ploration is especially useful in settings
where we want to practically disallow
many actions by attributing large penal-
ties to them.
model-based reinforcement learning 291

settings, most other strategies (i.e., those that are not sufficiently ex-
plorative), learn not to act at all. However, even in settings of “ordinary
rewards”, optimistic exploration often learns good policies faster.

13.3.3 Constrained Exploration


Besides making exploration more efficient, another use of uncertainty
is to make exploration more safe. Today, reinforcement learning is still
far away from being deployed directly to the real world. In practice,
reinforcement learning is almost always used together with a simula-
tor, in which the agent can “safely” train and explore. Yet, in many
domains, it is not possible to simulate the training process, either be-
cause we are lacking a perfect model of the environment, or because
simulating such a model is too computationally inefficient. This is X
Xunsafe
where we can make use of uncertainty estimates of our model to avoid
unsafe states. Figure 13.7: Illustration of planning with
confidence bounds. The unsafe set of
Let us denote by Xunsafe the subset of unsafe states of our state space X . states is shown as the red region. The
agent starts at the position denoted by
A natural idea is to perform planning using confidence bounds of our
the black dot and plans a sequence of
epistemic uncertainty. This allows us to pessimistically forecast the actions. The confidence bounds on the
plausible consequences of our actions, given what we have already subsets of states that are reached by this
action sequence are shown in gray.
learned about the transition model. As we collect more data, the con-
fidence bounds will shrink, requiring us to be less conservative over
time. This idea is also at the heart of fields like robust control.

A general formalism for planning under constraints is the notion of


constrained Markov decision processes (Altman, 1999). Given dynamics f ,
planning in constrained MDPs amounts to the following optimization
problem:

" #
.
max
π
Jµ (π; f ) = Ex0 ∼µ,x1:∞ |π, f ∑ γt rt (13.27a)
t =0

subject to Jµc (π; f ) ≤ δ (13.27b)

where µ is a distribution over the initial state and



" #
.
Jµc (π; f ) = Ex0 ∼µ,x1:∞ |π, f ∑ γ c( xt )
t
(13.28)
t =0

are expected discounted costs with respect to a cost function c : X → R≥0 .10 10 It is straightforward to extend this
. framework to allow for multiple con-
Observe that for the cost function c( x) = 1{ x ∈ Xunsafe }, the value
straints.
Jµc (π; f ) can be interpreted as an upper bound on the (discounted)
probability of visiting unsafe states under dynamics f ,11 and hence, 11
This follows from a simple union
c
the constraint Jµ (π; f ) ≤ δ bounds the probability of visiting an un- bound (1.73).

safe state when following policy π.

The augmented Lagrangian method can be used to solve the optimization


problem of Equation (13.27).12 Thereby, one solves 12
For an introduction, read chapter 17 of
“Numerical optimization” (Wright et al.,
1999).
292 probabilistic artificial intelligence

max min Jµ (π; f ) − λ( Jµc (π; f ) − δ) (13.29)


π λ ≥0

 J (π; f ) if π is feasible
µ
= max (13.30)
π −∞ otherwise.

Observe that if π is feasible, then Jµc (π; f ) ≤ δ and so the minimum


over λ is satisfied if λ = 0. Conversely, if π is infeasible, λ can be
made arbitrarily large to solve the optimization problem. In practice,
an additional penalty term is added to smooth the objective when tran-
sitioning between feasible and infeasible policies.

Note that solving constrained optimization problems such as Equa-


tion (13.27) yields an optimal safe policy. However, it is not ensured
that constraints are not violated during the search for the optimal pol-
icy. Generally, exterior penalty methods such as the augmented La-
grangian method allow for generating infeasible points during the
search, and are therefore not suitable when constraints have to be
strictly enforced at all times. Thus, this method is more applicable
in the episodic setting (e.g., when an agent is first “trained” in a simu-
lated environment and then “deployed” to the actual task) rather than
in the continuous setting where the agent has to operate in the “real
world” from the beginning and cannot easily be reset.

Remark 13.14: Barrier functions


The augmented Lagrangian method is merely one example of op-
timizing a joint objective of maximizing rewards and minimizing
costs to find a policy with bounded costs. Another example of
this approach are so-called Barrier functions which augment the
reward objective with a smoothed approximation of the boundary
of a set Xunsafe . That is, one solves

max Jµ (π; f ) − λ · Bµc (π; f ) (13.31)


π

for some λ > 0 where the barrier function Bµc (π; f ) goes to infinity
as a state on the boundary of Xunsafe is approached. Examples for
barrier functions are − log c( x) and c(1x) .

Barrier functions are an interior penalty method which ensure that


points are feasible during the search for the optimal solution.

So far, we have assumed knowledge of the true dynamics to solve the


optimization problem of Equation (13.27). If we do not know the true
dynamics but instead have access to a set of plausible models M(D)
given data D (cf. Section 13.3.2) which encompasses the true dynam-
ics, then a natural strategy is to be optimistic with respect to future
model-based reinforcement learning 293

rewards and pessimistic with respect to future constraint violations.


More specifically, we solve the optimization problem,

max max Jµ (π; f ) (13.32a) optimistic


π f ∈M(D)

subject to max Jµc (π; f ) ≤ δ. (13.32b) pessimistic


f ∈M(D)

Intuitively, jointly maximizing Jµ (π; f ) with respect to π and (plausi-


ble) f can lead the agent to try behaviors with potentially high reward
due to optimism (i.e., the agent “dreams” about potential outcomes).
On the other hand, being pessimistic with respect to constraint viola-
tions enforces the safety constraints (i.e., the agent has “nightmares”
about potential outcomes).

If the policy values Jµ (π; f ) and Jµc (π; f ) are modeled as a distribu-
tion (e.g., using Bayesian deep learning), then the inner maximiza-
tion over plausible dynamics can be approximated using samples from
the posterior distributions. Thus, the augmented Lagrangian method
can also be used to solve the general optimization problem of Equa-
tion (13.32). The resulting algorithm is known as Lagrangian model-
based agent (LAMBDA) (As et al., 2022).

13.3.4 Safe Exploration


As noted in the previous section, in many settings we do not only
want to eventually find a safe policy, but we also want to ensure that
we act safely while searching for an optimal policy. To this end, re-
call the general approach outlined in the beginning of the previous
section wherein we plan using (pessimistic) confidence bounds of the
plausible consequences of our actions.

One key challenge of this approach is that we need to forecast plausi-


ble trajectories. The confidence bounds of such trajectories cannot be
nicely represented anymore. One approach is to over-approximate the
confidence bounds over reachable states from trajectories.
unsafe trajectory
Theorem 13.15 (informal, Koller et al. (2018)). For conditional Gaussian Figure 13.8: Illustration of long-term
dynamics, the reachable states of trajectories can be over-approximated with consequence when planning a finite
number of steps. Green dots are to de-
high probability. note safe states and the red dot is to de-
note an unsafe state. After performing
Another key challenge is that actions might have consequences that the first action, the agent is still able to
exceed the time horizon used for planning. In other words, by per- return to the previous state. Yet, after
reaching the third state, the agent is al-
forming an action now, our agent might put itself into a state that is ready guaranteed to end in an unsafe
not yet unsafe, but out of which it cannot escape and which will even- state. When using only a finite horizon
of H = 2 for planning, the agent might
tually lead to an unsafe state. You may think of a car driving towards
make this transition regardless.
a wall. When a crash with the wall is designated as an unsafe state,
there are quite a few states in advance at which it is already impossible
294 probabilistic artificial intelligence

to avoid the crash. Thus, looking ahead a finite number of steps is not
sufficient to prevent entering unsafe states.

It turns out that it is still possible to use the epistemic uncertainty


about the model and short-term plausible behaviors to make guaran-
tees about certain long-term consequences. One idea is to also consider
a set of safe states Xsafe alongside the set of unsafe states Xunsafe , for
which our agent knows how to stay inside (i.e., remain safe). In other
words, for states x ∈ Xsafe , we know that our agent can always behave
in such a way that it will not reach an unsafe state.13 An illustration 13
In the example of driving a car, the set
of this approach is shown in Figure 13.9. of safe states corresponds to those states
where we know that we can still safely
brake before hitting the wall.
Figure 13.9: Illustration of long-term
consequence when planning a finite
number of steps. The unsafe set of states
is shown in red and the safe set of states
Xsaf is shown in blue. The confidence in-
e tervals corresponding to actions of the
performance plan and safety plan are
shown in orange and green, respectively.

Plan A
Plan B
X
Xunsafe

The problem is that this set of safe states might be very conservative.
That is to say, it is likely that rewards are mostly attained outside of
the set of safe states. The key idea is to plan two sequences of actions,
instead of only one. “Plan A” (the performance plan) is planned with
the objective to solve the task, that is, attain maximum reward. “Plan
B” (the safety plan) is planned with the objective to return to the set of
safe states Xsafe . In addition, we enforce that both plans must agree on
the first action to be played.

Under the assumption that the confidence bounds are conservative


estimates, this guarantees that after playing this action, our agent will
still be in a state of which it can return to the set of safe states. In this
way, we can gradually increase the set of states that are safe to explore.
It can be shown that under suitable conditions, returning to the safe
set can be guaranteed (Koller et al., 2018).

13.3.5 Safety Filters


An alternative approach is to (slightly) adjust a potentially unsafe pol-
icy π to obtain a policy π̂ which avoids entering unsafe states with
high probability.
model-based reinforcement learning 295

Following our interpretation of constrained policy optimization in terms


of optimism and pessimism from Section 13.3.3, to pessimistically eval-
uate the safety of a policy with respect to a cost function c given a set
of plausible models M(D), we can use
.
C π ( x) = max Jδcx (π; f ) (13.33) δx is the point density at x
f ∈M(D)

where the initial-state distribution δx is to denote that the initial state is


x. Observe that Equation (13.33) permits a reparameterization in terms
of additional decision variables η which is analogous to our discussion
in Section 13.3.2. Concretely, we have

C π ( x) = max Jδcx (π; fb) (13.34)


η

where fb are the adjusted dynamics (13.26) which are based on the
“luck” variables η. In the context of estimating costs which we aim to
minimize (as opposed to rewards which we aim to maximize), η can
be interpreted as the “bad luck” of the agent.

When our only objective is to act safely, that is, we only aim to mini-
mize cost and are indifferent to rewards, then this reparameterization
allows us to find a “maximally safe” policy,
.
πsafe = arg min Ex∼µ [C π ( x)] = arg min max Jµc (π; fb). (13.35)
π π η

Under some conditions πsafe can be shown to satisfy πsafe ( x) = π x ( x)


.
for any x where π x = arg minπ C π ( x) (Curi et al., 2022, proposition
4.2). On its own, the policy πsafe is rather useless for exploring the
state space. In particular, when in a state that we already deem safe,
following policy πsafe , the agent aims simply to “stay” within the set of
safe states which means that it has no incentive to explore / maximize
reward.

Instead, one can interpret πsafe as a “backup” policy in case we realize


upon exploring that a certain trajectory is too dangerous, akin to our
notion of the “safety plan” in Section 13.3.4. That is, given any (poten-
tially explorative and dangerous) policy π, we can find the adjusted
policy,
.
π̂ ( x) = arg min d ( π ( x ), a ) (13.36a)
a∈A
subject to max C πsafe ( fb( x, a)) ≤ δ (13.36b)
η

for some metric d(·, ·) on A. The constraint ensures that the pessimistic
next state fb( x, a) is recoverable by following policy πsafe . In this way,

.
π̂ ( x) if C πsafe ( x) ≤ δ
π̃ ( x) = (13.37)
πsafe ( x) if C πsafe ( x) > δ
296 probabilistic artificial intelligence

is the policy “closest” to π which is δ-safe with respect to the pes-


simistic cost estimates C πsafe (Curi et al., 2022).14 This is also called 14
The policy π̃ is required as, a priori,
a safety filter. Similar approaches using backup policies also allow it is not guaranteed that the state xt will
satisfy C πsafe ( xt ) ≤ δ for all t unless π̂ is
safe exploration in the model-free setting (Sukhija et al., 2023; Wid- replanned after every step.
mer et al., 2023).

Optional Readings
• Curi, Berkenkamp, and Krause (2020).
Efficient model-based reinforcement learning through optimistic pol-
icy search and planning.
• Berkenkamp, Turchetta, Schoellig, and Krause (2017).
Safe model-based reinforcement learning with stability guarantees.
• Koller, Berkenkamp, Turchetta, and Krause (2018).
Learning-based model predictive control for safe exploration.
• As, Usmanova, Curi, and Krause (2022).
Constrained Policy Optimization via Bayesian World Models.
• Curi, Lederer, Hirche, and Krause (2022).
Safe Reinforcement Learning via Confidence-Based Filters.
• Turchetta, Berkenkamp, and Krause (2019).
Safe exploration for interactive machine learning.

Discussion

In this final chapter, we learned that leveraging a learned “world model”


for planning can be dramatically more sample-efficient than model-
free reinforcement learning. Additionally, such world models allow
for effective multitask learning since they can be reused across tasks
and only the reward function needs to be swapped out.15 Finally, 15
A similar idea, though often classi-
we explored how to use probabilistic inference of our world model to fied as model-free, is to learn a goal-
conditioned policy which is studied
drive exploration more effectively and safely. extensively in goal-conditioned reinforce-
ment learning (Andrychowicz et al., 2017;
Plappert et al., 2018; Park et al., 2024).
Problems

13.1. Bounding the regret of H-UCRL.

Analogously to our analysis of the UCB acquisition function for Bayesian


optimization, we can use optimism to bound the regret of H-UCRL. We
assume for simplicity that d = 1 and drop the index i in the follow-
ing. Let us denote by xt,k the k-th state visited during episode t when
following policy πt . We denote by xbt,k the corresponding state but un-
der the optimistic dynamics fb rather than the true dynamics f . The
instantaneous regret during episode t is given by

rt = JH (π ⋆ ; f ) − JH (πt ; f ) (13.38)
model-based reinforcement learning 297

where we take the objective to be JH (π; f ) = ∑kH=−01 r ( xt,k , π ( xt,k )).

You may assume that f ( x, π ( x )), r ( x, π ( x )), and σ( x, π ( x )) are Lips-


chitz continuous in x.
1. Show by induction that for any k ≥ 0, with high probability,

k −1
xbt,k − xt,k ≤ 2β t ∑ αkt −1−l σt−1 (xt,l , πt (xt,l )) (13.39)
l =0

where αt depends on the Lipschitz constants and β t .


2. Assuming w.l.o.g. that αt ≥ 1, show that with high probability,
!
H −1
rt ≤ O β t HαtH −1 ∑ σt−1 ( xt,k , πt ( xt,k )) (13.40)
k =0

.
3. Let Γ T = maxπ1 ,...πT ∑tT=1 ∑kH=−01 σt2−1 ( xt,k , πt ( xt,k )). Analogously to
Problem 9.3, it can be derived that Γ T ≤ O( HγT ) if the dynamics
are modeled by a Gaussian process.16 Bound the cumulative regret 16
For details, see appendix A of Treven
 3 √  et al. (2024).
R T = ∑tT=1 rt ≤ O β T H 2 α TH −1 TΓ T .
Thus, if the dynamics are modeled by a Gaussian process with kernel
such that γT is sublinear, the regret of H-UCRL is sublinear.
A
Mathematical Background

A.1 Probability

A.1.1 Common discrete distributions

• Bernoulli Bern( p) describes (biased) coin flips. Its domain is Ω = {0, 1}


where 1 corresponds to heads and 0 corresponds to tails. p ∈ [0, 1]
is the probability of the coin landing heads, that is, P( X = 1) = p
and P( X = 0) = 1 − p.
• Binomial Bin(n, p) counts the number of heads in n independent
Bernoulli trials, each with probability of heads p.
• Categorical Cat(m, p1 , . . . , pm ) is a generalization of the Bernoulli
distribution and represents throwing a (biased) m-sided die. Its
domain is Ω = [m] and we have P( X = i ) = pi . We require pi ≥ 0
and ∑i∈[m] pi = 1.
• Multinomial Mult(n, m, p1 , . . . , pm ) counts the number of outcomes
of each side in n independent Categorical trials.
• Uniform Unif(S) assigns identical probability mass to all values in
the set S. That is, P( X = x ) = |S1 | (∀ x ∈ S).

A.1.2 Probability simplex

We denote by ∆m the set of all categorical distributions on m classes.


Observe that ∆m is an (m − 1)-dimensional convex polytope.

To see this, let us consider the m-dimensional space of probabilities [0, 1]m .
It follows from our characterization of the categorical distribution in
Appendix A.1.1 that there is a one-to-one correspondence between
probability distributions over m classes and points in the space [0, 1]m
where all coordinates sum to one. This (m − 1)-dimensional subspace
of [0, 1]m is also known as the probability simplex.
300 probabilistic artificial intelligence

A.1.3 Universality of the Uniform and Sampling


Uniform distribution The (continu-
ous) uniform distribution Unif([ a, b]) is
Sampling from a continuous random variable is a common and non-
the only distribution that assigns con-
trivial problem. A family of techniques is motivated by the following stant density to all points in the support
general property of the uniform distribution, colloquially known as [ a, b]. That is, it has PDF
the “universality of the uniform”.
(
1
if u ∈ [ a, b]
p (u ) = b− a
0 otherwise
First, given a random variable X ∼ P, we call
and CDF
(
. u− a
−1 if u ∈ [ a, b]
P (u) = min{ x | P( x ) ≥ u} for all 0 < u < 1. (A.1) P(u) = b− a
0 otherwise.

the quantile function of X. That is, P−1 (u) corresponds to the value x
such that the probability of X being at most x is u. If the CDF P is
invertible, then P−1 coincides with the inverse of P.

Theorem A.1 (Universality of the uniform). If U ∼ Unif([0, 1]) and P


is invertible, then P−1 (U ) ∼ P.

Proof. Let U ∼ Unif([0, 1]). Then,


 
P P−1 (U ) ≤ x = P(U ≤ P( x )) = P( x ). using P(U ≤ u) = u if U ∼ Unif([0, 1])

This implies that if we are able to sample from Unif([0, 1]),1 then we 1
This is done in practice using so-called
are able to sample from any distribution with invertible CDF. This pseudo-random number generators.

method is known as inverse transform sampling.

In the case of Gaussians, we learned that the CDF cannot be expressed


in closed-form (and hence, is not invertible), however, for practical pur-
poses, the quantile function of Gaussians can be approximated well.

A.1.4 Point densities and Dirac’s delta function

The Dirac delta function δα is a function satisfying δα ( x ) = 0 for any x ̸= α


R
and δα ( x ) dx = 1. δα is also called a point density at α.

As an example, let us consider the random variable Y = g( X ), which


is defined in terms of another random variable X and a continuously
differentiable function g : R → R. Using the sum rule and product
rule, we can express the PDF of Y as
Z Z
p(y) = p(y | x ) · p( x ) dx = δg( x) (y) · p( x ) dx. (A.2)

In Section 1.1.11, we discuss how one can obtain an explicit represen-


tation of pY using the “change of variables” formula.
mathematical background 301

A.1.5 Gradients of expectations


Gradient The gradient of a function f :
Rn → R at a point x ∈ Rn is
We often encounter gradients of expectations, ∇θ EX [ f (X, θ)].
∂ f ( x) ⊤
h i
.
∇ f ( x) = ∂∂xf ((1x)) · · · ∂x . (A.3)
Fact A.2 (Differentiation under the integral sign). Let f ( x, t) be differen- (n)

tiable in x, integrable in t, and be such that


∂ f ( x, t)
≤ g( x)
∂t
for some integrable function g. Then, if the distribution of X is not parame-
terized by t,2 That is, E[ f (X, t)] = p( x) · f ( x, t) dx
2
R

and p( x) does not depend on t.


∂E[ f (X, t)] ∂ f (X, t)
 
=E . (A.4)
∂t ∂t

Therefore, if the distribution of X is not parameterized by θ,

∇θ E[ f (X, θ)] = E[∇θ f (X, θ)]. (A.5)

A.2 Quadratic Forms and Gaussians

Definition A.3 (Quadratic form). Given a symmetric matrix A ∈ Rn×n ,


the quadratic form induced by A is defined as
n n
f A : Rn → R, x 7→ x⊤ Ax = ∑ ∑ A(i, j) · x(i) · x( j). (A.6)
i =1 j =1

We call A a positive definite matrix if all eigenvalues of A are positive.


Equivalently, we have f A ( x) > 0 for all x ∈ Rn \ {0} and f A (0) = 0.
Similarly, A is called positive semi-definite if all eigenvalues of A are
non-negative, or equivalently, if f A ( x) ≥ 0 for all x ∈ Rn . In particular,
p
if A is positive definite then f A ( x) is a norm (called the Mahalanobis
norm induced by A), and is often denoted by ∥ x∥ A .

If A is positive definite, then the sublevel sets of its induced quadratic


form f A are convex ellipsoids. Not coincidentally, the same is true for
the sublevel sets of the PDF of a normal distribution N (µ, Σ ), which
we have seen an example of in Figure 1.6. Hereby, the axes of the
ellipsoid and their corresponding squared lengths are the eigenvectors
and eigenvalues of Σ, respectively.

Remark A.4: Correspondence of quadratic forms and Gaussians


Observe that the PDF of a zero-mean multivariate Gaussian N (0, Σ )
is an exponentiated and appropriately scaled quadratic form in-
duced by the positive definite precision matrix Σ −1 . The constant
factor is chosen such that the resulting function is a valid proba-
302 probabilistic artificial intelligence

bility density function, that is, sums to one.

One important property of positive definiteness of Σ is that Σ can


be decomposed into the product of a lower-triangular matrix with its
transpose. This is known as the Cholesky decomposition.

Fact A.5 (Cholesky decomposition, symmetric matrix-form). For any


symmetric and positive (semi-)definite matrix A ∈ Rn×n , there is a decom-
position of the form

A = LL⊤ (A.7)

where L ∈ Rn×n is lower triangular and positive (semi-)definite.

We will not prove this fact, but it is not hard to see that a decomposi-
tion exists (it takes more work to show that L is lower-triangular).

Let A be a symmetric and positive (semi-)definite matrix. By the spec-


tral theorem of symmetric matrices, A = VΛV ⊤ , where Λ is a diagonal
matrix of eigenvalues and V is an orthonormal matrix of correspond-
ing eigenvectors. Consider the matrix
.
A /2 = VΛ /2 V ⊤
1 1
(A.8)
.
where Λ1/2
p
= diag{ Λ(i, i )}, also called the square root of A. Then,
A /2 A /2 = VΛ /2 V ⊤ VΛ /2 V ⊤
1 1 1 1

= VΛ /2 Λ /2 V ⊤
1 1

= VΛV ⊤ = A. (A.9)

It is immediately clear from the definition that A1/2 is also symmetric


and positive (semi-)definite.

Quadratic forms of positive semi-definite matrices are a generalization


of the Euclidean norm, as

∥ x∥2A = x⊤ Ax = x⊤ A /2 A /2 x = ( A /2 x)⊤ A /2 x = ∥ A /2 x∥22 ,


1 1 1 1 1
(A.10)

and in particular,
1
log N ( x; µ, Σ ) = − ∥ x − µ∥2Σ −1 + const (A.11)
2
1h i
= − x⊤ Σ −1 x − 2µ⊤ Σ −1 x + const, (A.12)
2
1
log N ( x; 0, I ) = − ∥ x∥22 + const. (A.13)
2

A.3 Parameter Estimation

In this section, we provide a more detailed account of parameter esti-


mation as outlined in Section 1.3.
mathematical background 303

A.3.1 Estimators
Suppose we are given a collection of independent samples x1 , . . . , xn
from some random vector X. Often, the exact distribution of X is un-
known to us, but we still want to “estimate” some property of this
distribution, for example its mean. We denote the property that we
aim to estimate from our sample by θ⋆ . For example, if our goal is
.
estimating the mean of X, then θ⋆ = E[X].

An estimator for a parameter θ⋆ is a random vector θ̂n that is a function


of n sample variables X1 , . . . , Xn whose distribution is identical to the
distribution of X. Any concrete sample { xi ∼ Xi }in=1 yields a concrete
estimate of θ⋆ .

Example A.6: Estimating expectations


The most common estimator for E[X] is the sample mean
n
. 1
Xn =
n ∑ Xi (A.14)
i =1
iid
where Xi ∼ X. Using a sample mean to estimate an expectation is
often also called Monte Carlo sampling, and the resulting estimate
is referred to as a Monte Carlo average.

Example A.7: MLE of Bernoulli random variables


Let us consider an i.i.d. sample x1:n of Bern( p) distributed random
variables. We want to estimate the parameter p using a MLE. We
have,

p̂MLE = arg max P( x1:n | p)


p
n
= arg max ∑ log P( xi | p)
p i =1
n
= arg max ∑ log p xi (1 − p)1− xi using the Bernoulli PMF, see
p i =1 Appendix A.1.1
n
= arg max ∑ xi log p + (1 − xi ) log(1 − p).
p i =1

Computing the first derivative with respect to p, we see that the


objective is maximized by
n
1
=
n ∑ xi . (A.15)
i =1

Thus, the maximum likelihood estimate for the parameter p of


X ∼ Bern( p) coincides with the sample mean X n .
304 probabilistic artificial intelligence

What does it mean for θ̂n to be a “good estimate” of θ⋆ ? There are


two canonical measures of goodness of an estimator: its bias and its
variance.

Clearly, we want E θ̂n = θ⋆ . Estimators that satisfy this property are


 

called unbiased. The bias, Eθ̂n θ̂n − θ⋆ , of an unbiased estimator is 0.


 

It follows directly from linearity of expectation (1.20) that the sample


mean is unbiased. In Appendix A.3.3, we will see that the variance
of the sample mean is small for reasonably large n for “light-tailed”
distributions.

Example A.8: Estimating variances


Analogously to the sample mean, the most common estimator for
the covariance matrix Var[X] is the sample variance (also called sam-
ple covariance matrix)
n
. 1
n − 1 i∑
Sn2 = (Xi − Xn )(Xi − Xn )⊤ (A.16)
=1
!
n 1 n ⊤
n − 1 n i∑

= Xi Xi − X n X n (A.17)
=1
iid
where Xi ∼ X. It can be shown that the sample variance is un-
biased ? . Intuitively, the normalizing factor is 1/n−1 because Sn2 Problem A.1
depends on the sample mean Xn , which is obtained using the
same samples. For this reason, Sn2 has n − 1 degrees of freedom.3 3
That is, any sample Xi can be recovered
using all other samples and the sample
mean.
The second desirable property of an estimator θ̂n is that its variance is
small.4 A common measure for the variance of an estimator of θ is the 4
The variance of estimators vector-
mean squared error, valued estimators θ̂n is typically studied
component wise.
h i
.
MSE(θ̂n ) = Eθ̂n (θ̂n − θ ⋆ )2 . (A.18)

The mean squared error can be decomposed into the estimator’s bias
and variance:
h i
MSE(θ̂n ) = Eθ̂n (θ̂n − θ ⋆ )2 = Var θ̂n + (E θ̂n − θ ⋆ )2 ,
   
(A.19) using (1.35) and
Varθ̂n θ̂n − θ ⋆ = Var θ̂n
   

the mean squared error of an estimator can be written as the sum of


its variance and its squared bias.

A desirable property is for θ̂n to converge to θ ⋆ in probability (A.28):

lim P θ̂n − θ ⋆ > ϵ = 0.



∀ϵ > 0 :
n→∞

Such estimators are called consistent, and a sufficient condition for con-
sistency is that the mean squared error converges to zero as n → ∞.5 5
It follows from Chebyshev’s inequality
MSE(θ̂n )
(A.72) that P θ̂n − θ ⋆ > ϵ ≤

ϵ2
.
mathematical background 305

Consistency is an asymptotic property. In practice, one would want to


know “how quickly” the estimator converges as n grows. To this end,
an estimator θ̂n is said to be sharply concentrated around θ ⋆ if

∀ϵ > 0 : P θ̂n − θ ⋆ > ϵ ≤ exp(−Ω(ϵ)),



(A.20)

where Ω(ϵ) denotes the class of functions that grow at least linearly
in ϵ.6 Thus, if an estimator is sharply concentrated, its absolute error 6
That is, h ∈ Ω(ϵ) if and only if
h(ϵ)
is bounded by an exponentially quickly decaying error probability. limϵ→∞ ϵ > 0. With slight abuse of
notation, we force h to be positive (so as
to ensure that the argument to the expo-
A.3.2 Heavy Tails nential function is negative) whereas in
the traditional definition of Landau sym-
It is often said that a sharply concentrated estimator θ̂n has small tails, bols, h is only required to grow linearly
in absolute value.
where “tails” refer to the “ends” of a PDF. Let us examine the differ-
ence between a light-tailed and a heavy-tailed distribution more closely. p( x )

Definition A.9. A distribution PX is said to have a heavy (right) tail if 10−1


its tail distribution function
10−3
.
P X ( x ) = 1 − PX ( x ) = P( X > x ) (A.21)
10−5

decays slower than that of the exponential distribution, that is, 10−7
5 10
P (x) x
lim sup X−λx = ∞ (A.22)
x →∞ e
Figure A.1: Shown are the right tails of
P (x) the PDFs of a Gaussian with mean 1 and
for all λ > 0. When lim supx→∞ PX ( x) > 1, the (right) tail of X is said variance 1, a exponential distribution
Y
to be heavier than the (right) tail of Y. with mean 1 and parameter λ = 1, and
a log-normal distribution with mean 1
It is immediate from the definitions that the distribution of an unbi- and variance 1 on a log-scale.
ased estimator is light-tailed if and only if the estimator is sharply
concentrated, so both notions are equivalent.

Example A.10: Light- and heavy-tailed distributions


• A Gaussian X ∼ N (0, 1) is light-tailed since its tail distribution
function is bounded by
Z ∞ Z ∞ 2
1 2 t 1 2 e−x /2
PX (x) = √ e−t /2 dt ≤ √ e−t /2 dt = √ . using t
x ≥1
x 2π x x 2π 2πx
(A.23)
p( x )
• The Laplace distribution with PDF
0.4
| x − µ|
 
Laplace( x; µ, h) ∝ exp − (A.24)
h 0.2

is light-tailed, but its tails decay slower than those of the Gaus-
0.0
sian. It is also “more sharply peaked” at its mean than the
0 2 4
Gaussian. x

Figure A.2: Shown are the right tails of


the PDFs of a Laplace distribution with
mean 0 and length scale 1 and a Gaus-
sian with mean 0 and variance 1.
306 probabilistic artificial intelligence

Heavy-tailed distributions frequently occur in many domains, the


following just serving as a few examples.
• A well-known example of a heavy-tailed distribution is the log-
normal distribution. A random variable X is logarithmically nor-
mal distributed with parameters µ and σ2 if log X ∼ N (µ, σ2 ).
The log-normal arises when modeling natural growth phenom-
ena or stock prices which are multiplicative, and hence, become
additive on a logarithmic scale.
• The Pareto distribution was originally used to model the distri-
bution of wealth in a society, but it is also used to model many
other phenomena such as the size of cities, the frequency of
words, and the returns on stocks. Formally, the Pareto distri-
bution is defined by the following PDF,

αcα
Pareto( x; α, c) = 1{ x ≥ c }, x∈R (A.25)
x α +1
where the tail index α > 0 models the “weight” of the right
tail of the distribution (larger α means lighter tail) and c > 0
corresponds to a cutoff threshold. The distribution is supported
on [c, ∞), and as α → ∞ it approaches a point density at c. The
Pareto’s right tail is P( x ) = ( xc )α for all x ≥ c.
• The Cauchy distribution arises in the description of numerous
physical phenomena. The CDF of a Cauchy distribution is

x−m
 
. 1 1
PX ( x ) = arctan + (A.26)
π τ 2

where X ∼ Cauchy(m, τ ) with location m and scale τ > 0.

Light- and heavy-tailed distributions exhibit dramatically different be-


havior. For example, consider the sample mean X n of n i.i.d. random
variables Xi with mean 1, and we are told that the sample mean is 5.
What does this tell us about the (conditional) distribution of X1 ? There
are many possible explanations for this observation, but they can be
largely grouped into two categories:
1. either many Xi are slightly larger than 1,
2. or very few Xi are much larger than 1.
Nair et al. (2022) term interpretation (1) the conspiracy principle and
interpretation (2) the catastrophe principle. It turns out that which of the
two principles applies depends on the tails of the distribution of X (1) .
If the tails are light, then the conspiracy principle applies, and if the
tails are heavy, then the catastrophe principle applies as is illustrated
in Figure A.3.

When working with light-tailed distributions, the conspiracy princi-


mathematical background 307

p ( x1 | x 2 = 5) Figure A.3: Shown are the conditional


distributions of X1 given the event that
the sample mean across two samples is
0.5 surprisingly large: X 2 = 5. We plot
conspiracy principle the examples where X is a light-tailed
0.4 Gaussian with mean 1, X is a heavy-
tailed log-normal with mean 1, and X
0.3 is exponential with mean 1. When X is
catastrophe principle light-tailed, the large mean is explained
0.2 by “conspiratory” samples X1 and X2 . In
contrast, when X is heavy-tailed, the large
mean is explained by a single “catas-
0.1
trophic” event. See also Nair et al.
(2022).
0.0

0 2 4 6 8 10
x1

ple tells us that outliers are rare, and hence, we can usually ignore
them. In contrast, when working with heavy-tailed distributions, the
catastrophe principle tells us that outliers are not just common but a
defining feature, and hence, we need to be careful to not ignore them.

Readings
For more details and examples of heavy-tailed distributions, refer
to “The Fundamentals of Heavy Tails” (Nair et al., 2022).

A.3.3 Mean Estimation and Concentration

We have seen that we desire two properties in estimators: namely,


(1) that they are unbiased; and (2) that their variance is small.7 For 7
That is, they are consistent and, ideally,
estimating expectations with the sample mean X n which is also known their variance converges quickly.

as a Monte Carlo approximation, we will see that both properties follow


from standard results in statistics.

We have already concluded that the sample mean (A.14) is an unbi-


ased estimator for EX. We will now see that the sample mean is also
consistent, that its limiting distribution can usually be determined ex-
plicitly, and in some cases it is even sharply concentrated. First, let
us recall the common notions of convergence of sequences of random
variables.

Definition A.11 (Convergence of random variables). Let { Xn }n∈N be


a sequence of random variables and X another random variable. We
say that,
1. Xn converges to X almost surely (also called convergence with proba-
308 probabilistic artificial intelligence

bility 1) if
n o
P ω ∈ Ω : lim Xn (ω ) = X (ω ) = 1, (A.27)
n→∞
a.s.
and we write Xn → X as n → ∞.
2. Xn converges to X in probability if for any ϵ > 0,

lim P(| Xn − X | > ϵ) = 0, (A.28)


n→∞
P
and we write Xn → X as n → ∞.
3. Xn converges to X in distribution if for all points x ∈ X (Ω) at which
PX is continuous,

lim PXn ( x ) = PX ( x ), (A.29)


n→∞
D
and we write Xn → X as n → ∞.
It can be shown that as n → ∞,
a.s. P D
Xn → X =⇒ Xn → X =⇒ Xn → X. (A.30)

Remark A.12: Convergence in mean square


Mean square continuity and mean square differentiability are a
generalization of continuity and differentiation to random pro-
cesses, where the limit is generalized to a limit in the “mean
square sense”.

A sequence of random variables { Xn }n∈N is said to converge to


the random variable X in mean square if
h i
lim E ( Xn − X )2 = 0. (A.31)
n→∞

Using Markov’s inequality (A.71) it can be seen that convergence


in mean square implies convergence in probability. Moreover, con-
vergence in mean square implies

lim EXn = EX and (A.32)


n→∞
lim EXn2 = EX 2 . (A.33)
n→∞

Whereas a deterministic function f ( x) is said to be continuous at


a point x⋆ if limx→ x⋆ f ( x) = f ( x⋆ ), a random process f ( x) is mean
square continuous at x⋆ if
h i
lim⋆ E ( f ( x) − f ( x⋆ ))2 = 0. (A.34)
x→ x

It can be shown that a random process is mean square continuous


at x⋆ iff its kernel function k ( x, x′ ) is continuous at x = x′ = x⋆ .
mathematical background 309

Similarly, a random process f ( x) is mean square differentiable at a


point x in direction i if ( f ( x + δei ) − f ( x))/δ converges in mean
square as δ → 0 where ei is the unit vector in direction i. This
notion can be extended to higher-order derivatives.

For the precise notions of mean square continuity and mean square
differentiability in the context of Gaussian processes, refer to sec-
tion 4.1.1 of “Gaussian processes for machine learning” (Williams
and Rasmussen, 2006).

Given a random variable X : Ω → R with finite variance, it can be


shown that
P
X n → EX (A.35)

which is known as the weak law of large numbers (WLLN) and which
establishes consistency of the sample mean ? . Using more advanced Problem A.2
tools, it is possible to show almost sure convergence of the sample
mean even when the variance is infinite:

Fact A.13 (Strong law of large numbers, SLLN). Given the random
variable X : Ω → R with finite mean. Then, as n → ∞,
a.s.
X n → EX. (A.36)

To get an idea of how quickly the sample mean converges, we can look
at its variance:
" #
1 n Var[ X ]
n i∑
 
Var X n = Var Xi = . (A.37)
=1
n

Remarkably, one cannot only compute its variance, but also its limiting
distribution. This is known as the central limit theorem (CLT) which
states that the prediction error of the sample mean tends to a normal
distribution as the sample size goes to infinity (even if the samples
themselves are not normally distributed).

Fact A.14 (Central limit theorem by Lindeberg-Lévy). Given the


random variable X : Ω → R with finite mean and finite variance. Then,
 
D Var[ X ]
X n → N EX, (A.38)
n
as n → ∞.

The central limit theorem makes the critical assumption that the vari-
ance of X is finite. This is not the case for many heavy-tailed distri-
310 probabilistic artificial intelligence

butions such as the Pareto or Cauchy distributions. One can general-


ize the central limit theorem to distributions with infinite variance,
with the important distinction that their limiting distribution is no
longer a Gaussian but also a heavy-tailed distribution with infinite
variance (Nair et al., 2022).

For a subclass of light-tailed distributions it is also possible to show


that the sample mean X n is sharply concentrated, which is a much
stronger and much more practical property than consistency. We will
consider the class of sub-Gaussian random variables which encompass
random variables whose tail probabilities decay at least as fast as those
of a Gaussian.

Definition A.15 (Sub-Gaussian random variable). A random variable


X : Ω → R is called σ-sub-Gaussian for σ > 0 if for all λ ∈ R,
 2 2
h
λ( X −EX )
i σ λ
E e ≤ exp . (A.39)
2

Example A.16: Examples of sub-Gaussian random variables


• If a random variable X is σ-sub-Gaussian, then so is − X.
• If X ∼ N (µ, σ2 ) then a simple calculation shows that

σ 2 λ2
h i  
.
φ X (λ) = E eλX = exp µλ + for any λ ∈ R
2
(A.40)

which is called the “moment-generating function” of the nor-


mal distribution, and implying that X is σ-sub-Gaussian.
• If the random variable X : Ω → R satisfies a ≤ X ≤ b with
probability 1 then X is b−2 a -sub-Gaussian. This is also known
as Hoeffding’s lemma.

Theorem A.17 (Hoeffding’s inequality). Let X : Ω → R be a σ-sub-


Gaussian random variable. Then, for any ϵ > 0,

nϵ2

P X n − EX ≥ ϵ ≤ 2 exp − 2 .

(A.41)

In words, the absolute error of the sample mean is bounded by an


exponentially quickly decaying error probability δ. Solving for n, we
obtain that for
2σ2 2
n≥ log (A.42)
ϵ2 δ
the probability that the absolute error is greater than ϵ is at most δ.
mathematical background 311

.
Proof of Theorem A.17. Let Sn = nX n = X1 + · · · + Xn . We have for any
λ, ϵ > 0 that

P X n − EX ≥ ϵ = P(Sn − ESn ≥ nϵ)



 
= P eλ(Sn −ESn ) ≥ enϵλ using that z 7→ eλz is increasing

≤ e−nϵλ E[eλ(Sn −ESn ) ] using Markov’s inequality (A.71)


n
= e−nϵλ ∏ E[eλ(Xi −EX ) ] using independence of the Xi
i =1
n
≤ e−nϵλ ∏ eσ
2 λ2 /2
using the characterizing property of a
i =1 σ-sub-Gaussian random variable (A.39)
nσ2 λ2
 
= exp −nϵλ + .
2

Minimizing the expression with respect to λ, we set λ = ϵ/σ2 , and


obtain
nσ2 λ2 nϵ2
    
P X n − EX ≥ ϵ ≤ min exp −nϵλ +

= exp − 2 .
λ >0 2 2σ
The theorem then follows from

P X n − EX ≥ ϵ = P X n − EX ≥ ϵ + P X n − EX ≤ −ϵ
  

and noting that the second term can be bounded analogously to the
first term by considering the random variables − X1 , . . . , − Xn .

The law of large numbers and Hoeffding’s inequality tell us that when X
is light-tailed, we can estimate EX very precisely with “few” samples
using a sample mean. Crucially, the sample mean requires independent
samples xi from X which are often hard to obtain.

Remark A.18: When the sample mean fails


While we have seen that the sample mean works well for light-
tailed distributions, it fails for heavy-tailed distributions. For ex-
ample, if we only assume that the variance of X is finite, then
the best known error rate is obtained by applying Chebyshev’s
inequality (A.72),
Var[ X ]
P X n − EX ≥ ϵ ≤

(A.43)
nϵ2
which decays only linearly in n. This is a result of the catastrophe
principle, namely, that outliers may be likely and therefore need
to be accounted for.

Robust methods of estimation in the presence of outliers or cor-


ruptions have been studied extensively in robust statistics. On the
312 probabilistic artificial intelligence

subject of mean estimation, approaches such as trimming outliers


before taking the sample mean (called the truncated sample mean)
or using the median-of-means which selects the median among k
sample means — each computed on a subset of the data — have
been shown to yield sharply concentrated estimates, even when
the variance of X is infinite (Bubeck et al., 2013).

A.3.4 Asymptotic Efficiency of Maximum Likelihood Estimation


In Section 1.3.1, we briefly discussed the asymptotic behavior of the
MLE, and we mentioned that the MLE can be shown to be “asymptot-
ically efficient”.

Let us first address the question of how the asymptotic covariance


matrix Sn looks like? It can be shown that Sn = In (θ)−1 where
.
In (θ) = EDn [ Hθ ℓnll (θ; Dn )] (A.44)

is the so-called Fisher information which captures the curvature of the


negative log-likelihood around θ. The Fisher information In (θ) can
be used to measure the “difficulty” of estimating θ as shown by the
Cramér-Rao lower bound:

Fact A.19 (Cramér-Rao lower bound). Let θ̂n be an unbiased estimator


of θ. Then,8 8
We use A ⪰ B as shorthand for A − B
being positive semi-definite. This partial
ordering of positive semi-definite matri-
Var θ̂n ⪰ In (θ)−1 .
 
(A.45) ces is called the Loewner order.

An estimator is called efficient if it achieves equality in the Cramér-Rao


lower bound, and the MLE is therefore asymptotically efficient.

A.3.5 Population Risk and Empirical Risk


The notion of “error” mentioned in Section 1.3 is typically captured by
a loss function ℓ(ŷ; y) ∈ R which is small when the prediction ŷ = fˆ( x)
is “close” to the true label y = f ⋆ ( x) and large otherwise. Let us fix a
distribution PX over inputs, which together with the likelihood from
Equation (1.56) induces an unknown joint distribution over input-label
pairs ( x, y) by P . The canonical objective is to best-approximate the
mappings ( x, y) ∼ P , that is, to minimize
h i
E( x,y)∼P ℓ( fˆ( x); y) . (A.46)

This quantity is also called the population risk. However, the underlying
distribution P is unknown to us. All that we can work with is the
mathematical background 313

iid
training data for which we assume Dn ∼ P . It is therefore natural to
consider minimizing
n
1
n ∑ ℓ( fˆ(xi ); yi ), Dn = {( xi , yi )}in=1 , (A.47)
i =1

which is known as the empirical risk.

However, selecting fˆ by minimizing the empirical risk can be problem-


atic. The reason is that in this case the model fˆ and the empirical risk
depend on the same data Dn , implying that the empirical risk will not
be an unbiased estimator of the population risk. This can result in a
model which fits the training data too closely, and which is failing to
generalize to unseen data — a problem called overfitting. We will dis-
cuss some (probabilistic) solutions to this problem in Section 4.4 when
covering model selection.

A.4 Optimization

Finding parameter estimates is one of the many examples where we


seek to minimize some function ℓ.9 The field of optimization has a rich 9
W.l.o.g. we assume that we want to
history, which we will not explore in much detail here. What will be minimize ℓ. If we wanted to maximize
the objective, we can simply minimize its
important for us is that given that the function to be optimized (called negation.
the objective or loss) fulfills certain smoothness properties, optimization
is a well-understood problem and can often be solved exactly (e.g.,
when the objective is convex) or “approximately” when the objective
is non-convex. In fact, we will see that it is often advantageous to
frame problems as optimization problems when suitable because the
machinery to solve these problems is so extensive.

Readings
For a more thorough reminder of optimization methods, read
chapter 7 of “Mathematics for machine learning” (Deisenroth et al.,
2020).

A.4.1 Stationary Points


In this section, we derive some basic facts about unconstrained opti-
mization problems. Given some function f : Rn → R, we want to find
min f ( x). (A.48)
x ∈Rn

We say that a point x⋆ ∈ Rn is a (global) optimum of f if f ( x⋆ ) ≤ f ( x)


for any x ∈ Rn .

Consider the more general problem of minimizing f over some subset


S ⊆ Rn , that is, to minimize the function f S : S → R, x 7→ f ( x). If there
314 probabilistic artificial intelligence

exists some open subset S ⊆ Rn including x⋆ such that x⋆ is optimal


with respect to the function f S , then x⋆ is called a local optimum of f .

Remark A.20: Differentiability


We will generally assume that f is continuously (Fréchet) differ-
entiable on Rn . That is, at any point x ∈ Rn , there exists ∇ f ( x)
such that for any δ ∈ Rn ,

f ( x + δ) = f ( x) + ∇ f ( x)⊤ δ + o (∥δ∥2 ), (A.49)

o (∥δ∥2 )
where limδ→0 = 0, and ∇ f ( x) is continuous on Rn .
∥ δ ∥2 x3
Equation (A.49) is also called a first-order expansion of f at x. 20

Definition A.21 (Stationary point). Given a function f : Rn → R, a 0

point x ∈ Rn where ∇ f ( x) = 0 is called a stationary point of f .


−20
Being a stationary point is not sufficient for optimality. Take for exam-
. . −2
ple the point x = 0 of f ( x ) = x3 . Such a point that is stationary but 0 2
x
not (locally) optimal is called a saddle point.
Figure A.4: Example of a saddle point at
Theorem A.22 (First-order optimality condition). If x ∈ Rn is a local x = 0.
extremum of a differentiable function f : Rn → R, then ∇ f ( x) = 0.

Proof. Assume x is a local minimum of f . Then, for all d ∈ Rn and for


all small enough λ ∈ R, we have f ( x) ≤ f ( x + λd), so

0 ≤ f ( x + λd) − f ( x)
= λ ∇ f ( x ) ⊤ d + o ( λ ∥ d ∥2 ). using a first-order expansion of f
around x
Dividing by λ and taking the limit λ → 0, we obtain

o ( λ ∥ d ∥2 )
0 ≤ ∇ f ( x)⊤ d + lim = ∇ f ( x)⊤ d.
λ →0 λ
. y
Take d = −∇ f ( x). Then, 0 ≤ − ∥∇ f ( x)∥22 , so ∇ f ( x) = 0.
8
f
A.4.2 Convexity 6

4
Convex functions are a subclass of functions where finding global min-
2
ima is substantially easier than for general functions.
0
Definition A.23 (Convex function). A function f : Rn → R is convex if −2 0 2
x
∀ x, y ∈ Rn : ∀θ ∈ [0, 1] : f (θx + (1 − θ )y) ≤ θ f ( x) + (1 − θ ) f (y).
Figure A.5: Example of a convex func-
(A.50) tion. Any line between two points on f
lies “above” f .
That is, any line between two points on f lies “above” f . If the in-
equality of Equation (A.50) is strict, we say that f is strictly convex.
mathematical background 315

If the function f is convex, we say that the function − f is concave. The


above is also known as the 0th-order characterization of convexity. y

8
Theorem A.24 (First-order characterization of convexity). Suppose that f
f : Rn → R is differentiable, then f is convex if and only if 6

4
∀ x, y ∈ Rn : f (y) ≥ f ( x) + ∇ f ( x)⊤ (y − x). (A.51) 2

0
Observe that the right-hand side of the inequality is an affine function −2 0 2
with slope ∇ f ( x) based at f ( x). x

Figure A.6: The first-order characteriza-


Proof. In the following, we will make use of directional derivatives. tion characterizes convexity in terms of
In particular, we will use that given a function f : Rn → R that is affine lower bounds. Shown is an affine
lower bound based at x = −2.
differentiable at x ∈ Rn ,

f ( x + λd) − f ( x)
lim = ∇ f ( x)⊤ d (A.52)
λ →0 λ

for any direction d ∈ Rn ? . Problem A.3

• “⇒”: Fix any x, y ∈ Rn . As f is convex,

f ((1 − θ ) x + θy) ≤ (1 − θ ) f ( x) + θ f (y),

for all θ ∈ [0, 1]. We can rearrange to

f ((1 − θ ) x + θy) − f ( x) ≤ θ ( f (y) − f ( x)).


| {z }
x+θ (y− x)

Dividing by θ yields,

f ( x + θ (y − x)) − f ( x)
≤ f ( y ) − f ( x ).
θ
Taking the limit θ → 0 on both sides gives the directional derivative
at x in direction y − x,

∇ f ( x)⊤ (y − x) = D f ( x)[y − x] ≤ f (y) − f ( x).


.
• “⇐”: Fix any x, y ∈ Rn and let z = θy + (1 − θ ) x. We have,

f ( y ) ≥ f ( z ) + ∇ f ( z ) ⊤ ( y − z ), and

f ( x ) ≥ f ( z ) + ∇ f ( z ) ( x − z ).

We also have y − z = (1 − θ )(y − x) and x − z = θ ( x − y). Hence,

θ f (y) + (1 − θ ) f ( x) ≥ f (z) + ∇ f (z)⊤ (θ (y − z) + (1 − θ )( x − z))


| {z }
0
= f (θy + (1 − θ ) x).
316 probabilistic artificial intelligence

Theorem A.25. Let f : Rn → R be a convex and differentiable func-


tion. Then, if x⋆ ∈ Rn is a stationary point of f , x⋆ is a global minimum
of f .

Proof. By the first-order characterization of convexity, we have for any


y ∈ Rn ,

f ( y ) ≥ f ( x ⋆ ) + ∇ f ( x ⋆ ) ⊤ ( y − x ⋆ ) = f ( x ⋆ ).
| {z }
0

Generally, the main difficulty in solving convex optimization problems


lies therefore in finding stationary points (or points that are sufficiently
close to stationary points).

Remark A.26: Second-order characterization of convexity


We say that f : Rn → R is twice continuously (Fréchet) differen-
tiable if for any point x ∈ Rn , there exists ∇ f ( x) and H f ( x) such Jacobian Given a vector-valued func-
tion,
that for any δ ∈ Rn ,  
g1 ( x )
1 g : Rn → Rm ,
 .. 
f ( x + δ) = f ( x) + ∇ f ( x)⊤ δ + δ⊤ H f ( x)δ + o (∥δ∥22 ), (A.53) x 7→  . ,


2 gm ( x )
o (∥δ∥22 ) where gi : Rn → R, the Jacobian of g at
where limδ→0 = 0, and ∇ f ( x) and H f ( x) are continuous
∥δ∥22 x ∈ Rn is
on Rn . Equation (A.53) is also called a second-order expansion of f  ∂g ( x)
1 ( x)

1
∂x(1)
· · · ∂g
∂x(n)
at x. 
.  . .. .. 

Dg ( x) =  .. . .  .
 
Twice continuously differentiable functions admit a natural char- ∂gm ( x)
··· ∂gm ( x)
∂x(1) ∂x(n)
acterization of convexity in terms of the Hessian. (A.54)
Observe that for a function f : Rn → R,
Fact A.27 (Second-order characterization of convexity). Consider

a twice continuously differentiable function f : Rn → R. Then, f is D f ( x) = ∇ f ( x) . (A.55)

convex if and only if H f ( x) is positive semi-definite for all x ∈ Rn .


Hessian The Hessian of a function f :
It follows that f is concave if and only if H f ( x) is negative semi-
Rn → R at a point x ∈ Rn is
definite for all x ∈ Rn . .
H f ( x) = H f ( x)
 ∂ 2 f ( x) ∂ 2 f ( x)

···
 ∂x(1) ∂x(1) ∂x(1) ∂x(n)
.. ..

.  ..
A.4.3 Stochastic Gradient Descent =

 . . . 

∂ 2 f ( x) ∂ 2 f ( x)
∂x(n) ∂x(1)
··· ∂x(n) ∂x(n)
In this manuscript, we primarily employ so-called first-order methods,
(A.56)
which rely on (estimates of) the gradient of the objective function to ⊤
= ( D ∇ f ( x)) (A.57)
determine a direction of local improvement. The main idea behind
∈ Rn × n .
these methods is to repeatedly take a step in the opposite direction of
Thus, a Hessian captures the curvature
the gradient scaled by a learning rate ηt , which may depend on the of f . If the Hessian of f is positive def-
current iteration t. inite at a point x, then f is “curved up
around x”.
Hessians are symmetric when the sec-
ond partial derivatives are continuous,
due to Schwartz’s theorem.
mathematical background 317

We will often want to minimize a stochastic optimization objective


.
L(θ) = Ex∼ p [ℓ(θ; x)] (A.58)

where ℓ and its gradient ∇ℓ are known.

Based on our discussion in previous subsections, it is a natural first


step to look for stationary points of L, that is, the roots of ∇ L.

Fact A.28 (Robbins-Monro (RM) algorithm). Let g : Rn → Rm be


an unknown function of which we want to find a root and suppose that
we have access to unbiased noisy observations G(θ) of g (θ). The scheme
.
θt+1 = θt − ηt g (t) (θt ), (A.59)

where g (t) (θt ) ∼ G(θt ) are independent and unbiased estimates of


g (θt ), is known as the Robbins-Monro algorithm.

It can be shown that if the sequence of learning rates {ηt }∞


t=0 is chosen
such that the Robbins-Monro conditions,10 10
ηt = 1/t is an example of a learning
rate satisfying the RM-conditions.
∞ ∞
ηt ≥ 0 ∀t, ∑ ηt = ∞ and ∑ ηt2 < ∞, (A.60)
t =0 t =0

and additional regularity assumptions are satisfied,11 then we have that 11


For more details, refer to “Online
learning and stochastic approximations”
a.s. (Bottou, 1998).
g (θt ) → 0 as t → ∞. (A.61)

That is, the RM-algorithm converges to a root almost surely.12 12


See Equation (A.27) for a definition of
almost sure convergence of a sequence
of random variables.
Using Robbins-Monro to find a root of ∇ L is known as stochastic gradi-
ent descent. In particular, when ℓ is convex and the RM-conditions are
satisfied, Robbins-Monro converges to a stationary point (and hence,
a global minimum) of L almost surely. Moreover, it can be shown that
for general ℓ and satisfied RM-conditions, stochastic gradient descent
converges to a local minimum almost surely. Intuitively, the random-
ness in the gradient estimates allows the algorithm to “jump past”
saddle points.

A commonly used strategy to obtain unbiased gradient estimates is


to take the sample mean of the gradient with respect to some set of
samples B (also called a batch) as is shown in Algorithm A.29.

Example A.30: Minimizing training loss


A common application of SGD in the context of machine learning
.
is the following: p = Unif({ x1 , . . . , xn }) is a uniform distribution
318 probabilistic artificial intelligence

Algorithm A.29: Stochastic gradient descent, SGD


1 initialize θ
2 repeat
.
3 draw mini-batch B = { x(1) , . . . , x(m) }, x(i) ∼ p
4 θ ← θ − ηt m1 ∑im=1 ∇θ ℓ(θ; x(i) )
5 until converged

over the training inputs, yielding the objective


n
1
L(θ) =
n ∑ ℓ(θ; xi ), (A.62)
i =1

where ℓ(θ; xi ) is the loss of a model parameterized by θ for train-


ing input xi and m is a fixed batch size. Here, computing the gra-
dients exactly (i.e., choosing m = n) is expensive when the size of
the training set is large.

A commonly used alternative to sampling each batch indepen-


dently, is to split the training set into equally sized batches and
perform a gradient descent step with respect to each of them. Typ-
ically, the optimum is not found after a single pass through the
training data, so the same procedure is repeated multiple times.
One such iteration is called an epoch.

Remark A.31: Regularization via weight decay


A common technique to improve the “stability” of minima found
through gradient-based methods is to “regularize” the loss by
adding an explicit bias favoring “simpler” solutions. That is, given
a loss ℓ measuring the quality of fit, we consider the regularized
loss
.
ℓ′ (θ; x) = ℓ(θ; x) + r (θ) (A.63)
| {z } |{z}
quality of fit regularization

where r (θ) is large for “complex” and small for “simple” choices
of θ, respectively. A common choice is r (θ) = λ2 ∥θ∥22 for some
λ > 0, which is known as L2 -regularization.

Recall from Section 1.3.2 that in the context of likelihood maxi-


mization, this choice of r corresponds to imposing the Gaussian
prior N (0, λI ), and finding the MAP estimate. Imposing, for ex-
ample, a Laplace prior leads to L1 -regularization.
mathematical background 319

Using L2 -regularization, we obtain for the gradient,

∇θ ℓ′ (θ; x) = ∇θ ℓ(θ; x) + λθ. (A.64)

Thus, a gradient descent step changes to

θ ← (1 − ληt )θ − ηt ∇θ ℓ(θ; x). (A.65)

That is, in each step, θ decays towards zero at the rate ληt . This
regularization method is also called weight decay.

Remark A.32: Adaptive learning rates and momentum


Commonly, in optimization, constant learning rates are used to
accelerate mixing. The performance can be further improved by
taking an adaptive gradient step with respect to the geometry of
the cost function, which is done by commonly used algorithms
such as Adagrad and Adam. The methods employed by these algo-
rithms are known as adaptive learning rates and momentum.

For more details on momentum read section 7.1.2 of “Mathe-


matics for machine learning” (Deisenroth et al., 2020) and for
an overview of the aforementioned optimization algorithms re-
fer to “An overview of gradient descent optimization algorithms”
(Ruder, 2016).

A.5 Useful Matrix Identities and Inequalities

• Woodbury’s matrix identity states that for any matrices A ∈ Rn×n ,


C ∈ Rm×m , and U, V ⊤ ∈ Rn×m where A and C are invertible,
( A + UCV )−1 = A−1 − A−1 U (C −1 + VA−1 U )−1 VA−1 . (A.66)
The following identity, called the Sherman-Morrison formula, is a di-
rect consequence:
( A + xx⊤ )−1 = A−1 − A−1 x(1 + x⊤ A−1 x)−1 x⊤ A−1 (A.67a)
( A−1 x)( A−1 x)⊤
= A −1 − (A.67b)
1 + x ⊤ A −1 x
for any symmetric and invertible matrix A ∈ Rn×n and x ∈ Rn .
• The matrix inversion lemma states that for matrices A, B ∈ Rn×n ,
( A + B ) −1 = A −1 − A −1 ( A −1 + B −1 ) −1 A −1 . (A.68)
• Hadamard’s inequality states that
d
det( M ) ≤ ∏ M (i, i) (A.69)
i =0
320 probabilistic artificial intelligence

for any positive definite matrix M ∈ Rd×d .


• The Weinstein-Aronszajn identity for positive definite matrices A ∈
Rd×n and B ∈ Rn×d states that

det( Id×d + AB) = det( In×n + BA). (A.70)

Problems

A.1. The sample variance is unbiased.

Let X be a zero-mean random variable. Confirm that the sample vari-


ance of X is indeed unbiased.

A.2. Consistency of the sample mean.

We will now show that the sample mean is consistent. To this end, let
us first derive two classical concentration inequalities.
1. Prove Markov’s inequality which says that if X is a non-negative
random variable, then for any ϵ > 0,

EX
P( X ≥ ϵ ) ≤ . (A.71)
ϵ
2. Prove Chebyshev’s inequality which says that if X is a random vari-
able with finite and non-zero variance, then for any ϵ > 0,

VarX
P(| X − EX | ≥ ϵ) ≤ . (A.72)
ϵ2
3. Prove the weak law of large numbers from Equation (A.35).

A.3. Directional derivatives.

Given a function f : Rn → R that is differentiable at x ∈ Rn , the


directional derivative of f at x in the direction d ∈ Rn is

. f ( x + λd) − f ( x)
D f ( x)[d] = lim . (A.73)
λ →0 λ

Show that

D f ( x)[d] = ∇ f ( x)⊤ d. (A.74)

Hint: Consider a first-order expansion of f at x in direction λd.


B
Solutions

Fundamentals of Inference

Solution to Problem 1.1.


1. By the third axiom and B = A ∪ ( B \ A),

P( B ) = P( A ) + P( B \ A ).

Noting from the first axiom that P( B \ A) ≥ 0 completes the proof.


2. By the second axiom,

P A ∪ A = P( Ω ) = 1


and by the third axiom,

P A ∪ A = P( A ) + P A .
 

Reorganizing the equations completes the proof.


3. Define the countable sequence of events
 
−1
i[
.
Ai′ = Ai \  A j .
j =1

Note that the sequence of events { Ai′ }i is disjoint. Thus, we have


by the third axiom and then using (1) that
∞ ∞ ∞ ∞
! !
P Ai = P Ai = ∑ P Ai′ ≤ ∑ P( Ai ).

[ [ 
i =1 i =1 i =1 i =1

Solution to Problem 1.2. We show that any vertex v is visited even-


tually with probability 1.

We denote by w → v the event that the random walk starting at vertex


w visits the vertex v eventually, we denote by Γ(w) the neighborhood
322 probabilistic artificial intelligence

.
of w, and we write deg(w) = |Γ(w)|. We have,

P( w → v ) = ∑ P the random walk first visits w′ · P w′ → v


 
using the law of total probability (1.12)
w′ ∈Γ(w)
1
∑ P w′ → v .

= using that the random walk moves to a
deg(w) w′ ∈Γ(w) neighbor uniformly at random

Take u to be the vertex such that P(u → v) is minimized. Then,


1
P( u → v ) = ∑ P u ′ → v ≥ P( u → v ).

deg(u) u′ ∈Γ(u) | {z }
≥P( u → v )

That is, for all neighbors u′ of u, P(u → v) = P(u′ → v). Using that
the graph is connected and finite, we conclude P(u → v) = P(w → v)
for any vertex w. Finally, note that P(v → v) = 1, and hence, the
random walk starting at any vertex u visits the vertex v eventually
with probability 1.

Solution to Problem 1.3. Let 1{ Ai } be the indicator random variable


for the event Ai . Then,

E[X · 1{ Ai }] = E[E[X · 1{ Ai } | 1{ Ai }]] by the tower rule (1.25)

= E[ X · 1{ A i } | 1{ A i } = 1] · P(1{ A i } = 1) expanding the outer expectation


+ 0 · P(1{ A i } = 0)
= E[ X | A i ] · P( A i ). the event 1{ Ai } = 1 is equivalent to Ai

Summing up for all i, the left-hand side becomes


k
E[ X ] = ∑ E[X · 1{ Ai }].
i =1

.
Solution to Problem 1.4. Let Σ = Var[X] be a covariance matrix of
the random vector X and fix any z ∈ Rn . Then,
h i
z⊤ Σz = z⊤ E (X − E[X])(X − E[X])⊤ z using the definitiion of variance (1.34)
h i
= E z⊤ (X − E[X])(X − E[X])⊤ z . using linearity of expectation (1.20)

.
Define the random variable Z = z⊤ (X − E[X]). Then,
h i
= E Z2 ≥ 0.

Solution to Problem 1.5. Let us start by defining some events that


we will reason about later. For ease of writing, let us call the person in
question X.

D = X has the disease,


solutions 323

P = The test shows a positive response.

Now we can translate the information in the question to formal state-


ments in terms of D and P,

P( D ) = 10−4 the disease is rare

P( P | D ) = P P | D = 0.99.

the test is accurate

We want to determine P( D | P). One can find this probability by using


Bayes’ rule (1.45),

P( P | D ) · P( D )
P( D | P ) = .
P( P )

From the quantities above, we have everything except for P( P). This,
however, we can compute using the law of total probability,

P( P ) = P( P | D ) · P( D ) + P P | D · P D
 

= 0.99 · 10−4 + 0.01 · (1 − 10−4 )


= 0.010098.

Hence, P( D | P) = 0.99 · 10−4 /0.010098 ≈ 0.0098 = 0.98%.

Solution to Problem 1.6.


1. Suppose that X is not linearly independent. Then, there exists some
α ∈ Rn \ {0} such that α⊤ X = 0. We have that α is an eigenvector
of Var[X] with zero eigenvalue since
h i
α⊤ Var[X]α = Var α⊤ X = Var[0] = 0. using Equation (1.38)

2. Suppose that Var[X] has a zero eigenvalue. Let λ be the corre-


sponding eigenvector and consider the suggested linear combina-
.
tion Y = λ⊤ X. We have
h i
Var[Y ] = Var λ⊤ X

= λ⊤ Var[X]λ using Equation (1.38)


n
= ∑ λ (i ) · 0 using the property of a zero eigenvalue:
i =1 λ⊤ Var[X]λ = 0
=0

which implies that Y must be a constant.

Solution to Problem 1.7. We need to find µ ∈ Rn and Σ ∈ Rn×n such


that for all x ∈ Rn ,

( x − µ ) ⊤ Σ −1 ( x − µ )
= ( x − µ1 ) ⊤ Σ − 1 ⊤ −1
1 ( x − µ1 ) + ( x − µ2 ) Σ 2 ( x − µ2 ) + const.
324 probabilistic artificial intelligence

The left-hand side of Equation (B.1) is equal to

x⊤ Σ −1 x − 2x⊤ Σ −1 µ + µ⊤ Σ −1 µ.

The right-hand side of Equation (B.1) is equal to


   
x⊤ Σ − 1 ⊤ −1 ⊤ −1
1 x + x Σ 2 x − 2 x Σ 1 µ1 + x Σ 2 µ2
⊤ −1
 
+ µ1⊤ Σ − 1
1
µ 1 + µ ⊤ −1
2 Σ 2 µ 2
   
⊤ −1 −1 ⊤ −1 −1
= x Σ 1 + Σ 2 x − 2x Σ 1 µ1 + Σ 2 µ2
 
+ µ1⊤ Σ −
1
1
µ 1 + µ ⊤ −1
2 Σ 2 µ 2 .

We observe that both sides are equal up to constant terms if

Σ −1 = Σ − 1 −1
1 + Σ2 and Σ −1 µ = Σ − 1 −1
1 µ1 + Σ 2 µ2 .

Solution to Problem 1.8. Recall that independence of any random


vectors X and Y implies that they are uncorrelated,1 the converse im- 1
This is because the expectation of their
plication is a special property of Gaussian random vectors. product factorizes, see Equation (1.21).

.
Consider the joint Gaussian random vector Z = [X Y] ∼ N (µ, Σ ) and
assume that X ∼ N (µX , Σ X ) and Y ∼ N (µY , Σ Y ) are uncorrelated.
Then, Σ can be expressed as
" #
ΣX 0
Σ= ,
0 ΣY

implying that the PDF of Z factorizes,

N ([ x y]; µ, Σ )

1 1
= p exp − ( x − µX )⊤ Σ − 1
X ( x − µX )
det(2πΣ X ) · det(2πΣ Y ) 2

1
− (y − µY )⊤ Σ − 1
Y (y − µY )
2
= N ( x; µX , Σ X ) · N (y; µY , Σ Y ).

Solution to Problem 1.9. Let x ∼ X. The joint distribution can be


expressed as

p( x) = p( x A , x B )
 " #⊤ " #" #
1 1 x A − µ A Λ AA Λ AB xA − µA 
= exp−
Z 2 xB − µB Λ BA Λ BB xB − µB

where Z denotes the normalizing constant. To simplify the notation,


we write
" # " #
∆ A . xA − µA
= .
∆B xB − µB
solutions 325

1. We obtain

p( x A )
" #⊤ "
 # " #
1 1
Z
∆ A Λ AA Λ AB ∆A 
= exp− dx B using the sum rule (1.7)
Z 2 ∆B Λ BA Λ BB ∆B
 i
1 1h ⊤ −1
= exp − ∆ A (Λ AA − Λ AB Λ BB Λ BA )∆ A using the first hint
Z 2
 
1h
Z i
· exp − (∆ B + Λ− 1
BB Λ BA ∆ A )⊤ Λ BB (∆ B + Λ− 1
BB Λ BA ∆ A ) dx B .
2

Observe that the integrand is anrunnormalized Gaussian PDF, and


 
hence, the integral evaluates to det 2πΛ− 1
BB (so is constant with
respect to x A ). We can therefore simplify to
 i
1 1h ⊤ −1
= ′ exp − ∆ A (Λ AA − Λ AB Λ BB Λ BA )∆ A
Z 2
 i
1 1 h ⊤ −1
= ′ exp − ∆ A Σ AA ∆ A . using the second hint
Z 2

2. We obtain

p( x B | x A )
p( x A , x B )
= using the definition of conditional
p( x A ) distributions (1.10)
 " #⊤ " # " #
1 1 ∆A Λ AA Λ AB ∆ A 
= ′ exp− noting that p( x A ) is constant with
Z 2 ∆B Λ BA Λ BB ∆B respect to x B
 i
1 1h ⊤ −1
= ′ exp − ∆ A (Λ AA − Λ AB Λ BB Λ BA )∆ A using the first hint
Z  2 h
1 i
−1 ⊤ −1
· exp − (∆ B + Λ BB Λ BA ∆ A ) Λ BB (∆ B + Λ BB Λ BA ∆ A )
2
 i
1 1h −1 ⊤ −1
= ′′ exp − (∆ B + Λ BB Λ BA ∆ A ) Λ BB (∆ B + Λ BB Λ BA ∆ A ) . observing that the first exponential is
Z 2 constant with respect to x B

Finally, observe that

∆ B + Λ− 1 −1
BB Λ BA ∆ A = µ B − Λ BB Λ BA ( x A − µ A ) = µ B| A and using the third hint

Λ−
BB
1
= Σ BB − Σ BA Σ − 1
AA Σ AB = Σ B| A . using the second hint

Thus, x B | x A ∼ N (µ B| A , Σ B| A ).

Solution to Problem 1.10.


. .
1. Let Y = AX + b and define s = A⊤ t. We have for the MGF of Y
that
 
φY (t ) = E exp t ⊤ Y
326 probabilistic artificial intelligence

   
= E exp t ⊤ AX · exp t ⊤ b
   
= E exp s⊤ X · exp t ⊤ b
 
= φX (s) · exp t ⊤ b
 
1 ⊤  
= exp s µ + s Σs · exp t ⊤ b

2
 
⊤ 1 ⊤ ⊤
= exp t ( Aµ + b) + t AΣ A t ,
2

which implies that Y ∼ N ( Aµ + b, AΣ A⊤ ).


.
2. We have for the MGF of Y = X + X′ that
 
φY (t ) = E exp t ⊤ Y
   
= E exp t ⊤ X · E exp t ⊤ X′ using the independence of X and X′

= φX ( t ) · φX′ ( t )
   
⊤ 1 ⊤ ⊤ ′ 1 ⊤ ′
= exp t µ + t Σt · exp t µ + t Σ t
2 2
 
1
= exp t ⊤ (µ + µ′ ) + t ⊤ (Σ + Σ ′ )t ,
2

which implies that Y ∼ N (µ + µ′ , Σ + Σ ′ ).

Solution to Problem 1.11. Note that if Y ∼ N (0, 1) is a (univariate)


standard normal random variable, then
Z ∞  
1 1
E [Y ] = √ y · exp − y2 dy using the PDF of the standard normal
2π −∞ 2 distribution (1.5)
Z 0   Z ∞   
1 1 1
= √ y · exp − y2 dy + y · exp − y2 dy
2π −∞ 2 0 2
  0   ∞ !
1 1 1
= √ − exp − y2 + − exp − y2
2π 2 −∞ 2 0
1
= √ ([−1 + 0] + [0 + 1]) = 0,

h i
Var[Y ] = E Y 2 − E[Y ]2 using the definition of variance (1.35)
| {z }
0
Z ∞  
1 2 1 2
= √ y · exp − y dy using the PDF of the standard normal
2π −∞ 2 distribution (1.5)
Z 0   Z ∞   
1 1 1
= √ y2 · exp − y2 dy + y2 · exp − y2 dy
2π −∞ 2 0 2
  0 Z 0  
1 1 1
= √ −y · exp − y2 + exp − y2 dy integrating by parts
2π 2 −∞ −∞ 2
1 2 ∞
   Z ∞   
1 2
+ −y · exp − y + exp − y dy
2 0 0 2
solutions 327

Z 0   Z ∞   
1 1 1
= √ exp − y2 dy + exp − y2 dy
2π −∞ 2 0 2
Z ∞  
1 1
= √ exp − y2 dy = 1. a PDF integrates to 1
2π −∞ 2

Recall from Equation (1.54) that we can express X ∼ N (µ, Σ ) as

X = Σ /2 Y + µ
1

where Y ∼ N (0, I ). Using that Y is a vector of independent univariate


standard normal random variables, we conclude that E[Y] = 0 and
Var[Y] = I. We obtain
h i
E[X] = E Σ /2 Y + µ = Σ /2 E[Y] +µ = µ, and
1 1
using linearity of expectation (1.20)
| {z }
0
1 ⊤
h i
Var[X] = Var Σ /2 Y + µ = Σ /2 Var[Y] Σ /2 = Σ.
1 1
using Equation (1.38)
| {z }
I

Solution to Problem 1.12.


1. Yes. In the following, we give one example of a non-affine trans-
formation that preserves the Gaussianity of a random vector.
Let X ∼ N (µ, Σ ) be a Gaussian random vector in Rd . Define the
coordinate-wise transformation ϕ : Rd → Rd as (ϕ( x))i = ϕi ( xi )
with

−( x − µ ) + µ | x − µ | < 1
. i i i i i
ϕi ( xi ) =
 xi otherwise.

Intuitively, ϕi simply flips all the “mass” in a neighborhood of µi


on the i-th coordinate. Due to the symmetry of the Gaussian dis-
tribution around its mean, you could easily imagine that the trans-
formation still preserves the Gaussian property. Also, this function
cannot be an affine transformation since it is not continuous.
To prove Gaussianity, we can use the change of variables formula
(1.43). Let
 
.
Y = ϕ(X), pY (y) = pX (ϕ−1 (y)) · det Dϕ−1 (y) (∀y ∈ Rd ).

There are two cases for y:


(a) If |yi − µi | < 1, then xi = ϕi−1 (yi ) = −(yi − µi ) + µi . Also
note that in this case, the Jacobian matrix is simply a diagonal
matrix with −1 in the i-th position and 1 elsewhere. Hence, its
determinant is −1. We have
 
pY (y) = pX (ϕ−1 (y)) · det Dϕ−1 (y)
= pX (ϕ−1 (y))
328 probabilistic artificial intelligence

 .. 
.
 
= pX 
−(yi − µi ) + µi 
 
..
.
 . 
..
 
= pX 
  yi  
  since pX is symmetric w.r.t. µi
..
.
= p X ( y ).

(b) If |yi − µi | ≥ 1, then pY (y) = pX (y) since ϕ is the identity


function in this case.
Thus, Y has an identical PDF to X and is therefore Gaussian.
. X +YZ
2. Yes. It is difficult to calculate the distribution of W = √ di-
1+ Z 2
rectly since Z is in the denominator. The trick here is to first con-
dition W on Z, and then integrate over the distribution of Z. For
all z ∈ R, we have
X + YZ
 
p (W | Z = z ) = p √ Z=z
1 + Z2
X + Yz
 
=p √ since X, Y ⊥ Z
1 + z2
= N (W; 0, 1) as X + Yz ∼ N (0, 1 + z2 )

which means that W | Z = z ∼ N (0, 1) which is independent of z!


Using the law of total probability (1.12), we can now write
Z
p (W ) = p(W | Z = z) · p Z (z) dz
Z
= N (W; 0, 1) p Z (z) dz

= N (W; 0, 1).

Solution to Problem 1.13.


1. We have

Ey [(y − a)2 | x] = a2 − 2aEy [y | x] + const,

which is easily seen to be minimized when a = Ey [y | x].


2. We have
Z ∞ Z a
Ey [r (y, a) | x] = c1 (y − a) p(y | x) dy + c2 ( a − y) p(y | x) dy.
a −∞

We can differentiate this expression with respect to a:


Z ∞ Z a
d
Ey [r (y, a) | x] = −c1 p(y | x) dy + c2 p(y | x) dy
da a −∞
= − c1 P( y ≥ a | x ) + c2 P( y ≤ a | x )
solutions 329

= −c1 (1 − P(y ≤ a | x)) + c2 P(y ≤ a | x)


= − c1 + ( c1 + c2 )P( y ≤ a | x ).

Setting this equal to zero, we find the critical point condition

! c1
P( y ≤ a | x ) = .
c1 + c2
 
We obtain a⋆ ( x) = µ x + σx · Φ−1 c c+1 c2 by transforming to a stan-
1
dard normal random variable.

Linear Regression

Solution to Problem 2.1.


1. We begin by deriving the gradient and Hessian of the least squares
and ridge losses. For least squares,

∇w ∥y − Xw∥22 = ∇w (w⊤ X ⊤ Xw − 2w⊤ X ⊤ y + ∥y∥22 )


= 2( X ⊤ Xw − X ⊤ y),
Hw ∥y − Xw∥22 = 2X ⊤ X.

Similarly, for ridge regression,

∇w ∥y − Xw∥22 + λ ∥w∥22 = 2( X ⊤ Xw − X ⊤ y) + 2λw,


Hw ∥y − Xw∥22 + λ ∥w∥22 = 2X ⊤ X + 2λI.

From the assumption that the Hessians are positive definite, we


know that any minimizer is a unique globally optimal solution
(due to strict convexity), and that X ⊤ X and X ⊤ X + λI are invert-
ible.
Using the first-order optimality condition for convex functions, we
attain the solutions to least squares and ridge regression by setting
the respective gradient to 0.
2. We choose ŵls such that X ŵls = ΠX y. This implies that y −
X ŵls ⊥ Xw for all w ∈ Rd . In other words, (y − X ŵls )⊤ Xw = 0
for all w, which implies that (y − X ŵls )⊤ X = 0. By simple alge-
braic manipulation it can be seen that this condition is equivalent
to the gradient condition of the previous problem.

Solution to Problem 2.2. To prevent confusion with the length of the


data set n, we drop the subscript from the noise variance σn2 in the
following. We have
n
σ̂ = arg max ∑ log p(yi | xi , σ )
σ i =1
330 probabilistic artificial intelligence

n
1 1
= arg min ∑ log(2σ2 π ) + 2 (yi − w⊤ xi )2
σ i =1
2 2σ
n
n 1
= arg min
2
log σ2 + 2
2σ ∑ ( y i − w ⊤ x i )2 .
σ i =1

We can solve this optimization problem by differentiating and setting


to zero:
" #
∂ n 1 n
log σ + 2 ∑ (yi − w xi ) = 0
2 ⊤ 2
∂σ2 2 2σ i=1
n
n 1
2σ2
− 4

∑ ( y i − w ⊤ x i )2 = 0
i =1
n
1
n− 2
σ ∑ ( y i − w ⊤ x i ) 2 = 0. multiplying through by 2σ2
i =1

The desired result follows by solving for σ2 .

Solution to Problem 2.3. Let us first derive the variance of the least
squares estimate:
h i
Var[ŵls | X ] = Var ( X ⊤ X )−1 X ⊤ y X using Equation (2.4)

= ( X ⊤ X )−1 X ⊤ Var[y | X ] (( X ⊤ X )−1 X ⊤ )⊤ using Equation (1.38)


⊤ −1 ⊤ ⊤ −1
= (X X ) X Var[y | X ] X ( X X ) . (B.1) using ( A⊤ )−1 = ( A−1 )⊤

Due to the Gaussian likelihood (2.7), Var[y | X ] = σn2 I, so

= σn2 ( X ⊤ X )−1 . (B.2)

In the two-dimensional setting, i.e., data is of the form xi = [1 xi ]⊤ ( xi ∈


R), we have
" #
⊤ n ∑ xi
X X= .
∑ xi ∑ xi2

Thus,
" #
σn2 ∑ xi2 − ∑ xi
Var[ŵls | X ] = σn2 ( X ⊤ X )−1 = using Equation (B.2)
Z − ∑ xi n

.
where Z = n(∑ xi2 ) − (∑ xi )2 .

Therefore, the predictive variance at a point [1 x ⋆ ]⊤ is


" #
h i 1 σ2 σ2
1 x Var[ŵls ] ⋆ = n ∑ xi2 − 2xi x ⋆ + ( x ⋆ )2 = n

∑ ( x i − x ⋆ )2 .
x Z Z

Thus, the predictive variance is minimized for x ⋆ = 1


n ∑ xi .
solutions 331

Solution to Problem 2.4.


1. Recall from Section 2.0.1 that the MLE and least squares estimate
coincide if the noise is zero-mean Gaussian. We therefore have,
" #
0.63
ŵMLE = ŵls = ( X ⊤ X )−1 X ⊤ y = .
1.83

2. Recall from Equation (2.10) that w | X, y ∼ N (µ, Σ ) with


  −1
Σ = σn−2 X ⊤ X + σp−2 I and

µ = σn−2 ΣX ⊤ y.

Inserting σn2 = 0.1 and σp2 = 0.05 yields,


" #
0.019 −0.014
Σ= .
−0.014 0.019

Then,
# "
0.91
ŵMAP =µ= .
1.31

3. Recall from Equation (2.16) that y⋆ | X, y, x⋆ ∼ N (µ⋆y , σy2⋆ ) with

µ⋆y = µ⊤ x⋆ and σy2⋆ = x⋆ ⊤ Σx⋆ + σn2 .

Inserting x⋆ = [3 3]⊤ , and σn2 , µ and Σ from above yields,

µ⋆y = 6.66 and σy2⋆ = 0.186.

4. One would have to let σp2 → ∞.

Solution to Problem 2.5. We denote by Xt the design matrix and by


yt the vector of observations including the first t data points.
1. Note that
t t
Xt⊤ Xt = ∑ xi xi⊤ and Xt⊤ yt = ∑ yi xi .
i =1 i =1

This means that after observing the (t + 1)-st data point, we have
that

Xt⊤+1 Xt+1 = Xt⊤ Xt + xt+1 x⊤


t +1 and
Xt⊤+1 yt+1 = Xt⊤ yt + y t +1 x t +1 .

Hence, by just keeping Xt⊤ Xt (which is a d × d matrix) and Xt⊤ yt


(which is a vector in Rd ) in memory, and updating them as above,
we do not need to keep the whole data in memory.
332 probabilistic artificial intelligence

2. One has to compute (σn−2 Xt⊤ Xt + σp−2 I )−1 for finding µ and Σ in
every round. We can write

(σn−2 Xt⊤+1 Xt+1 + σp−2 I )−1 = σn2 ( Xt⊤+1 Xt+1 + σn2 σp−2 I )−1
= σn2 ( Xt⊤ Xt + σn2 σp−2 I + xt+1 x⊤
t +1 )
−1
| {z }
At

where At ∈ Rd×d . Using the Woodbury matrix identity (A.67) and


that we know the inverse of At (from the previous iteration), the
computation of the inverse of ( At + xt+1 x⊤ 2

t+1 ) is of O d , which
is much better than computing the inverse of ( At + xt+1 x⊤
t+1 ) from
scratch.

Solution to Problem 2.6. The law of total variance (2.18) yields the
following decomposition of the predictive variance,

Vary⋆ [y⋆ | x⋆ , x1:n , y1:n ] = Ew Vary⋆ [y⋆ | x⋆ , w] x1:n , y1:n


 

+ Varw Ey⋆ [y⋆ | x⋆ , w] x1:n , y1:n


 

wherein the first term corresponds to the aleatoric uncertainty and the
second term corresponds to the epistemic uncertainty.

The aleatoric uncertainty is given by


h i
Ew Vary⋆ [y⋆ | x⋆ , w] x1:n , y1:n = Ew σn2 x1:n , y1:n = σn2
 
using the definition of σn2 . (2.7)

For the epistemic uncertainty,


h i
Varw Ey⋆ [y⋆ | x⋆ , w] x1:n , y1:n = Varw w⊤ x⋆
 
x1:n , y1:n using that y⋆ = w⊤ x⋆ + ε where ε is
zero-mean noise
= x⋆ ⊤ Varw [w | x1:n , y1:n ] x⋆ using Equation (1.38)
⋆⊤ ⋆
=x Σx using Equation (2.10)

where Σ is the posterior covariance matrix.

Solution to Problem 2.7.


.
1. Let f = Xw. Since w | µ, λ ∼ N (µ, λ−1 Id ), we have

f | X, µ, λ ∼ N ( Xµ, λ−1 X X ⊤ ). using Equation (1.38)

Thus,
y | X, µ, λ ∼ N ( Xµ, λ−1 X X ⊤ + λ−1 In ).

2. We will denote the MLE by µ


b1 . We have

b1 = arg max p(y | X, µ, λ) = arg max log N (y; Xµ, Σ y ).


µ
µ µ
solutions 333

We can simplify

1
− log N (y; Xµ, Σ y ) = (y − Xµ)⊤ Σ − 1
y ( y − Xµ ) + const
2
1
= µ⊤ X ⊤ Σ − 1 ⊤ ⊤ −1
y Xµ − µ X Σ y y + const.
2
Taking the gradient with respect to µ, we obtain
 
1 ⊤ ⊤ −1
∇µ µ X Σ y Xµ − y⊤ Σ − y
1
Xµ + const = X ⊤Σ− 1 ⊤ −1
y Xµ − X Σ y y.
2

This is zero iff µ = ( X ⊤ Σ − 1 −1 ⊤ −1


y X ) X Σ y y.
3. By Bayes’ rule (1.45),

p(µ | X, y, λ) ∝ p(y | X, µ, λ) · p(µ).

Taking the negative logarithm,

− log p(µ | X, y, λ) = − log p(y | X, µ, λ) − log p(µ) + const.

Analogously to the previous question, we can simplify

1 1 ⊤
= (y − Xµ)⊤ Σ − 1
y ( y − Xµ ) + µ µ + const
2 2
1 ⊤ ⊤ −1 ⊤ ⊤ −1
= µ ( X Σ y X + Id )µ − µ X Σ y y + const.
2
This is a quadratic in µ, so the posterior distribution must be Gaus-
sian. By matching the terms with

1
log N ( x; µ′ , Σ ′ ) = − x⊤ Σ ′−1 x + x⊤ Σ ′−1 µ′ + const,
2
we obtain the covariance matrix ( X ⊤ Σ − 1
y X + Id )
−1 and the mean

vector Σ µ X ⊤ Σ − 1
y y.
.
e y = λΣ y . By Bayes’ rule (1.45),
4. Let Σ

p(λ | X, y, µ) ∝ p(λ) · p(y | X, µ, λ)


= e−λ · N (y; Xµ, Σ y )
 
−λ
−1/2 1 ⊤ −1
∝ e · det Σ y exp − (y − Xµ) Σ y (y − Xµ)
2
 −1/2 
1

−λ −1 e ⊤ −1 e −1
= e · det λ Σ y exp − (y − Xµ) (λ Σ y ) (y − Xµ)
2
 
−λ n/2 λ ⊤ e −1
∝ e ·λ exp − (y − Xµ) Σ y (y − Xµ) using that λ−1 Σ
e y is independent of λ
2
  
n/2 1 ⊤ e −1
= λ exp −λ 1 + (y − Xµ) Σ y (y − Xµ)
2
∝ Gamma(α, β)

with α = 1 + n
2 and β = 1 + 21 (y − Xµ)⊤ Σ
e− 1
y ( y − Xµ ).
334 probabilistic artificial intelligence

Filtering

Solution to Problem 3.1. Recall from Equation (3.8) that


1
p( xt+1 | y1:t+1 ) = p( xt+1 | y1:t ) p(yt+1 | xt+1 ). (B.3)
Z

Using the sensor model (3.12b),


!
1 1 ( y t +1 − x t +1 )2
p ( y t +1 | xt+1 ) = ′ exp − .
Z 2 σy2

It remains to compute the predictive distribution,


Z
p( xt+1 | y1:t ) = p( xt+1 | xt ) p( xt | y1:t ) dxt using Equation (3.9)

1 ( x t +1 − x t )2 ( x t − µ t )2
  
1
Z
= ′′ exp − + dxt using the motion model (3.12a) and
Z 2 σx2 σt2 previous update
1 σt2 ( xt+1 − xt )2 + σx2 ( xt − µt )2
  
1
Z
= ′′ exp − dxt .
Z 2 σt2 σx2
The exponent is the sum of two expressions that are quadratic in xt .
Completing the square allows rewriting any quadratic ax2 + bx + c as
b 2 b2
the sum of a squared term a( x + 2a ) and a residual term c − 4a that
is independent of x. In this case, we have a = (σt + σx )/(σt σx2 ),
2 2 2

b = −2(σt2 xt+1 + σx2 µt )/(σt2 σx2 ), and c = (σt2 xt2+1 + σx2 µ2t )/(σt2 σx2 ). The
residual term can be taken outside the integral, giving
 !
b2 b 2
   Z 
1 1 a
= ′′ exp − c − exp − xt + dxt .
Z 2 4a 2 2a
The integral is simply the integral of a Gaussian over its entire support,
and thus evaluates to 1. We are therefore left with only the residual
term from the quadratic. Plugging back in the expressions for a, b, and
c and simplifying, we obtain
1 ( x t +1 − µ t )2
 
1
= ′′ exp − .
Z 2 σt2 + σx2
That is, Xt+1 | y1:t ∼ N (µt , σt2 + σx2 ).

Plugging our results back into Equation (B.3), we obtain


" #!
1 1 ( x t +1 − µ t )2 ( y t +1 − x t +1 )2
p( xt+1 | y1:t+1 ) = ′′′ exp − + .
Z 2 σt2 + σx2 σy2
Completing the square analogously to our derivation of the predictive
distribution yields,
  2 
(σt2 +σx2 )yt+1 +σy2 µt
x t +1 − σt2 +σx2 +σy2
1  1 
= ′′′ exp− .
 
Z  2 (σt2 +σx2 )σy2 
σt2 +σx2 +σy2
solutions 335

Hence, Xt+1 | yt+1 ∼ N (µt+1 , σt2+1 ) as defined in Equation (3.13).

Solution to Problem 3.2. We prove the equivalence by induction.


Note that the base case is satisfied trivially since the priors are identi-
cal.

Assume after t − 1 steps that (µt−1 , Σ t−1 ) coincide with the BLR poste-
rior for the first t − 1 data points. We will show that the Kalman filter
update equations yield the BLR posterior for the first t data points.

Covariance update:

Σ t = Σ t −1 − k t x ⊤
t Σ t −1
(Σ t−1 xt )(Σ t−1 xt )⊤
= Σ t −1 − using the symmetry of Σ t−1
x⊤
t Σ t −1 x t + 1
  −1
= Σ− 1
t −1 + x x
t t

using the Sherman-Morrison
  −1 formula (A.67) with A−1 = Σ t−1
⊤ ⊤
= X1:t X
−1 1:t−1 + x x
t t + I by the inductive hypothesis and using
  −1 Equation (2.10)

= X1:t X1:t + I

= Σ BLR
t . using Equation (2.10)

Mean update:

Σ− 1 −1 −1 ⊤
t µ t = Σ t µ t −1 + Σ t k t ( y t − x t µ t −1 )
= Σ− 1 ⊤ −1 ⊤
t −1 µ t −1 + x t x t µ t −1 + Σ t k t ( y t − x t µ t −1 ) using Σ − 1 −1 ⊤
t = Σ t −1 + x t x t

= Σ− 1 ⊤ ⊤
t −1 µ t −1 + x t x t µ t −1 + x t ( y t − x t µ t −1 ) using Σ − 1
t kt = xt

= Σ− 1
t −1 µ t −1 + x t y t canceling terms

= X1:t −1 y1:t−1 + xt yt by the inductive hypothesis and using
⊤ Equation (2.10)
= X1:t y1:t
= Σ− 1 ⊤
t Σ t X1:t y1:t
= Σ− 1 BLR ⊤
t Σt X1:t y1:t using the covariance update above
−1 BLR
= Σ t µt . using Equation (2.10)

This completes the induction.

Solution to Problem 3.3.


1. The given parameter estimation problem can be formulated as a
Kalman filter in the following way:

xt = π ∀t,
y t = x t + ηt ηt ∼ N (0, σy2 ).
336 probabilistic artificial intelligence

Thus, in terms of Kalman filters, this yields f = h = 1, ε t = σx2 = 0.


Using Equation (3.14), the Kalman gain is given by

σt2
k t +1 = ,
σt2 + σy2

whereas the variance of the estimation error σt2 satisfies

σt2 σy2
σt2+1 = σy2 k t+1 = . using Equation (3.17)
σt2 + σy2

To get the closed form, observe that

1 1 1 1 2 1 t+1
= 2 + 2 = 2 + 2 = ··· = 2 + 2 ,
σt2+1 σt σy σt−1 σy σ0 σy

yielding,

σ02 σy2 σ02


σt2+1 = and k t +1 = .
(t + 1)σ02 + σy2 (t + 1)σ02 + σy2

2. When t → ∞, we get k t+1 → 0 and σt2+1 → 0, giving

µ t +1 = µ t + k t +1 ( y t +1 − µ t ) = µ t , using Equation (3.16)

thus resulting in a stationary sequence.


3. Observe that σ02 → ∞ implies k t+1 = t+1 1 . Therefore,

1
µ t +1 = µ t + (y − µt ) using Equation (3.16)
t + 1 t +1
t y
= µ t + t +1
t+1 t+1
t−1
 
t yt y
= µ t −1 + + t +1 using Equation (3.16)
t+1 t t t+1
t−1 y t + y t +1
= µ +
t + 1 t −1 t+1
..
.
y + · · · + y t +1
= 1 ,
t+1
which is simply the sample mean.

Gaussian Processes

Solution to Problem 4.1.


1. We have for any j ∈ N0 and x, y ∈ R,

1 x2 1 y2 x2 +y2 ( xy ) j
p e− 2 x j · p e− 2 y j = e− 2 .
j! j! j!
solutions 337

Summing over all j, we obtain



ϕ( x )⊤ ϕ(y) = ∑ ϕj ( x )ϕj (y)
j =0

x 2 + y2 ( xy) j
= e− 2
∑ j!
j =0
x 2 + y2
= e− 2 e xy using the Taylor series expansion for the
exponential function
( x − y )2
− 2
=e
= k( x, y).

2. As we have seen in Section 2.4, the effective dimension is n. The


crucial difference of kernelized regression (e.g., Gaussian processes)
to linear regression is that the effective dimension grows with the
sample size, whereas it is fixed for linear regression. Models where
the effective dimension may depend on the sample size are called
non-parametric models, and models where the effective dimension
is fixed are called parametric models.

Solution to Problem 4.2.


1. Yes. This is merely a restriction of the standard Gaussian kernel in
R2 onto its subset S.
2. We have
2 2 2
  
1 e−π /32 e−π /8 e−π /32 −1
 −π2/32 − π 2/32 − π 2/8  

e 1 e e  1 

 
 −π2/8 2 2
e−π /32 e−π /32  −1
 
e 1

2 2 2
e−π /32 e−π /8 e−π /32 1 1
| {z }
K
 
−1
 1 
π2 π2

= 1 + e− /8 − 2e− /32  
 
| {z }  −1
≈−0.18 1

3. No. (2) gives an example of a negative eigenvalue in a covariance


matrix. This implies that ki cannot be positive semi-definite.
4. Yes. We have k h (θ, θ ′ ) = ⟨ϕ(θ ), ϕ(θ ′ )⟩RL with
 
1
κ2 √
e− 4 2 cos(θ )
 
 
2√
 
− κ
 
. 1   e 4 2 sin(θ ) 
ϕ(θ ) = √  .

..
Cκ  . 
 − κ 2 ( L −1)2 √
 

e 4 2 cos(( L − 1)θ )
2√
 
κ2
e− 4 ( L−1) 2 sin(( L − 1)θ )
338 probabilistic artificial intelligence

Solution to Problem 4.3. First we look at the mean,

µ ( t ) = E [ X t ] = E [ X t −1 + ε t −1 ] = E [ X t −1 ] = µ ( t − 1 ).

Knowing that µ(0) = 0, we can derive that µ(t) = 0 (∀t).

Now we look at the variance of Xt ,


" #
t −1
Var[ Xt ] = Var X0 + ∑ ετ = σ02 + tσx2 . using that the noise is independent
τ =0

Finally, we look at the distribution of [ f (t) f (t′ )]⊤ for arbitrary t ≤ t′ .

t ′ −1
" # " # " #
Xt Xt 0
= + ∑ .
Xt ′ Xt τ =t ε τ

Therefore, we get that


" # " # " # " #!
Xt 0 1 1 ′ 0 0
∼N , Var[ Xt ] + (t − t)
Xt ′ 0 1 1 0 σx2
" # " #!
0 Var[ Xt ] Var[ Xt ]
=N , .
0 Var[ Xt ] Var[ Xt ] + (t′ − t)σx2

We take the kernel kKF (t, t′ ) to be the covariance between f (t) and
f (t′ ), which is Var[ Xt ] = σ02 + σx2 t. Notice, however, that we assumed
t ≤ t′ . Thus, overall, the kernel is described by

kKF (t, t′ ) = σ02 + σx2 min{t, t′ }.

Solution to Problem 4.4.


1. As f ∈ Hk (X ) we can express f ( x) for some β i ∈ R and xi ∈ X as

n
f ( x) = ∑ βi k(x, xi )
i =1
n
= ∑ βi ⟨k(xi , ·), k(x, ·)⟩k
i =1
* +
n
= ∑ βi k(xi , ·), k(x, ·)
i =1 k
= ⟨ f (·), k( x, ·)⟩k .

2. By applying Cauchy-Schwarz,

| f ( x) − f (y)| = | ⟨ f , k( x, ·) − k(y, ·)⟩k |


≤ ∥ f ∥k ∥k( x, ·) − k(y, ·)∥k
solutions 339

.
Solution to Problem 4.5. We denote by f A = Π A f the orthogonal
projection of f onto span{k( x1 , ·), . . . , k( xn , ·)} which implies that fˆA
is a linear combination of k( x1 , ·), . . . , k( xn , ·).
.
We then have that f A⊥ = f − f A is orthogonal to span{k( x1 , ·), . . . , k( xn , ·)}.
Therefore, for any i ∈ [n],
D E
f ( x) = ⟨ f , k ( xi , ·)⟩k = f A + f A⊥ , k( xi , ·) = ⟨ f A , k( xi , ·)⟩k = f A ( x)
k

which implies

L( f ( x1 ), . . . , f ( xn )) = L( f A ( x1 ), . . . , f A ( xn )).

Denoting the objective of Equation (4.19) by j( f ) and noting that ∥ f A ∥k ≤ ∥ f ∥k ,


.
we have that j( f A ) ≤ j( f ). Therefore, if fˆ minimizes j( f ), then fˆA = Π A fˆ
is also a minimizer since j( fˆA ) ≤ j( fˆ). Thus, we conclude that there
exists some α̂ ∈ Rn such that fˆA ( x) = ∑in=1 α̂i k( xi , x) minimizes j( f ).

Solution to Problem 4.6.


1. By the representer theorem (4.20), fˆ( x) = α̂⊤ k x,A . In particular, we
have f = K α̂ and therefore

1
− log p(y1:n | x1:n , f ) = ∥y − f ∥22 + const
2σn2
1
= 2 ∥y − K α̂∥22 + const.
2σn

The regularization term simplifies to

∥ f ∥2k = ⟨ f , f ⟩k = α̂⊤ K α̂ = ∥α̂∥2K .

Combining, we have that

1 1
α̂ = arg min ∥y − Kα∥22 + ∥α∥2K
α ∈Rn 2σn2 2

as desired. It follows by multiplying through with 2σn2 that λ = σn2 .


2. Expanding the objective determined in (1), we are looking for the
minimizer of

α⊤ (σn2 K + K 2 )α − 2y⊤ Kα + y⊤ y.

Differentiating with respect to the coefficients α, we obtain the min-


imizer α̂ = (K + σn2 I )−1 y. Thus, the prediction at a point x⋆ is
k⊤ 2 −1
x⋆ ,A ( K + σn I ) y which coincides with the MAP estimate.
340 probabilistic artificial intelligence

Solution to Problem 4.7.


1. Applying the two hints Equations (4.54) and (4.55) to Equation (4.29)
yields,
!
∂ 1 ⊤ −1 ∂Ky,θ −1 1 −1 ∂Ky,θ
log p(y | X, θ) = y Ky,θ K y − tr Ky,θ .
∂θ j 2 ∂θ j y,θ 2 ∂θ j

We can simplify to
! !
1 −1 ∂Ky,θ −1 1 −1 ∂Ky,θ
= tr y⊤ Ky,θ −1 ∂Ky,θ −1
K y − tr Ky,θ using that y⊤ Ky,θ ∂θ j Ky,θ y is a scalar
2 ∂θ j y,θ 2 ∂θ j
!
1 −1 −1 ∂Ky,θ −1 ∂Ky,θ
= tr Ky,θ yy⊤ Ky,θ − Ky,θ using the cyclic property and linearity
2 ∂θ j ∂θ j of the trace
!
1 −1 −1 ⊤ ∂Ky,θ −1 ∂Ky,θ −1
= tr Ky,θ y(Ky,θ y) − Ky,θ using that Ky,θ is symmetric
2 ∂θ j ∂θ j
!
1 −1 ∂Ky,θ
= tr (αα⊤ − Ky,θ ) .
2 ∂θ j

2. We denote by K̃ the covariance matrix of y for the covariance func-


tion k̃, so Ky,θ = θ0 K̃. Then,

∂ 1  
log p(y | X, θ) = tr (θ0−2 K̃ −1 y(K̃ −1 y)⊤ − θ0−1 K̃ −1 )K̃ . using Equation (4.30)
∂θ0 2
Simplifying the terms and using linearity of the trace, we obtain
that
∂ 1  
log p(y | X, θ) = 0 ⇐⇒ θ0 = tr yy⊤ K̃ −1 .
∂θ0 n
.
If we define Λ̃ = K̃ −1 as the precision matrix associated to y for
the covariance function k̃, we can express θ0⋆ in closed form as
n n
1
θ0⋆ =
n ∑ ∑ Λ̃(i, j)yi y j . (B.4)
i =1 j =1

3. We immediately see from Equation (B.4) that θ0⋆ scales by s2 if y is


scaled by s.

Solution to Problem 4.8.


√ √ √
1. We have that s(·) ∈ [− 2, 2], and hence, s(∆ i ) is 2-sub-Gaussian.2 2
see Example A.16
It then follows from Hoeffding’s inequality (A.41) that

mϵ2
 
P(| f (∆ i )| ≥ ϵ) ≤ 2 exp − .
4
2. We can apply Markov’s inequality (A.71) to obtain
  ϵ 2 

⋆ ϵ ⋆ 2
P ∥∇ f (∆ )∥2 ≥ = P ∥∇ f (∆ )∥2 ≥
2r 2r
solutions 341

2
2rE ∥∇ f (∆⋆ )∥2

≤ .
ϵ
It remains to bound the expectation. We have

E ∥∇ f (∆⋆ )∥22 = E ∥∇s(∆⋆ ) − ∇k(∆⋆ )∥22


= E ∥∇s(∆⋆ )∥22 − 2∇k(∆⋆ )⊤ E∇s(∆⋆ ) + E ∥∇k(∆⋆ )∥22 . using linearity of expectation (1.20)

Note that E∇s(∆) = ∇k(∆) using Equation (A.5) and using that
s(∆) is an unbiased estimator of k(∆). Therefore,

= E ∥∇s(∆⋆ )∥22 − E ∥∇k(∆⋆ )∥22


≤ E ∥∇s(∆⋆ )∥22
≤ Eω∼ p ∥ω∥22 = σp2 . using that s is the cos of a linear
function in ω
3. Using a union bound (1.73) and then the result of (1),
!
T T
mϵ2
 
ϵ  ϵ
P ≤ ∑ P | f (∆ i )| ≥
[
| f (∆ i )| ≥ ≤ 2T exp − .
i =1
2 i =1
2 16

4. First, note that by contraposition,


ϵ ϵ
sup | f (∆)| ≥ ϵ =⇒ ∃i. | f (∆ i )| ≥ or ∥∇ f (∆⋆ )∥2 ≥ ,
∆∈M∆ 2 2r

and therefore,
! !
T
ϵ ϵ
P ≤P | f (∆ i )| ≥ ∪ ∥∇ f (∆⋆ )∥2 ≥
[
sup | f (∆)| ≥ ϵ
∆∈M∆ i =1
2 2r
!
T
ϵ  ϵ
≤P + P ∥∇ f (∆⋆ )∥2 ≥
[
| f (∆ i )| ≥ using a union bound (1.73)
i =1
2 2r
mϵ2 2rσp 2
   
≤ 2T exp − 4 + using the results from (2) and (3)
2 ϵ
≤ αr −d + βr2 using T ≤ (4 diam(M)/r )d

   2
2 2σp
with α = 2(4 diam(M))d exp − mϵ
24
and β = ϵ . Using the
hint, we obtain
d 2
= 2β d+2 α d+2
 !2d  d+1 2
23 σp diag(M) mϵ2
 
= 22  exp − 3
ϵ 2 ( d + 2)
2
σp diag(M) mϵ2
  
≤ 28 exp − 3
using
σp diam(M)
≥1
ϵ 2 ( d + 2) ϵ

5. We have
h i
σp2 = Eω∼ p ω⊤ ω
342 probabilistic artificial intelligence

Z
= ω⊤ ω · p(ω) dω
Z
⊤0
= ω⊤ ω · eiω p(ω) dω.

∂2 ⊤∆ ⊤∆
eiω ω 2j eiω
R R
Now observe that ∂∆2j
p(ω) dω = − p(ω) dω. Thus,
 Z 

= −tr H∆ p(ω)eiω ∆ dω
∆=0
= −tr( H∆ k(0)). using that p is the Fourier transform of
k (4.38)
Finally, we have for the Gaussian kernel that
!
∂2 ∆⊤ ∆ 1
2
exp − 2 = − 2.
∂∆ j 2h h
∆ j =0

Solution to Problem 4.9.


.
1. We write f˜ = [ f f ⋆ ]⊤ . From the definition of SoR (4.48), we gather
" # !
−1
˜ ˜ K AU KUU
qSoR ( f | u) = N f ; −1 u, 0 . (B.5)
K⋆U KUU

We know that f˜ and u are jointly Gaussian, and hence, the marginal
distribution of f˜ is also Gaussian. We have for the mean and vari-
ance that

E f˜ = E E f˜ u
    
using the tower rule (1.25)
"" # #
−1
K AU KUU
= Eu −1 u using Equation (B.5)
K⋆U KUU
" #
−1
K AU KUU
= −1 E [ u ] = 0 using linearity of expectation (1.20)
K⋆U KUU
Var f˜ = E Var f˜ u +Var E f˜ u
     
using the law of total variance (1.41)
| {z }
0
"" # #
−1
K AU KUU
= Varu −1 u using Equation (B.5)
K⋆U KUU
−1 ⊤
" # " #
−1
K AU KUU K AU KUU
= −1 Var [u] −1 using Equation (1.38)
K⋆U KUU | {z } K⋆U KUU
KUU
" #
Q AA Q A⋆
= .
Q⋆ A Q⋆⋆

Having determined qSoR ( f , f ⋆ ), qSoR ( f ⋆ | y) follows directly using


the formulas for finding the Gaussian process predictive posterior
(4.6).
2. The given covariance function follows directly from inspecting the
derived covariance matrix, Var[ f ] = Q AA .
solutions 343

Variational Inference

Solution to Problem 5.1.


1. We have

∇w ℓlog (w⊤ x; y) = ∇w log(1 + exp(−yw⊤ x)) using the definition of the logistic loss
(5.13)
1 + exp(yw⊤ x)
= ∇w log
exp(yw⊤ x)
= ∇w log(1 + exp(yw⊤ x)) − yw⊤ x
1
= · exp(−yw⊤ x) · yx − yx using the chain rule
1 + exp(−yw⊤ x)
!
exp(yw⊤ x)
= −yx · 1 −
1 + exp(yw⊤ x)
= −yx · σ(−yw⊤ x). using the definition of the logistic
function (5.9)
2. As suggested in the hint, we compute the first derivative of σ,

. d d 1
σ′ (z) = σ(z) = using the definition of the logistic
dz dz 1 + exp(−z) function (5.9)
exp(−z)
= using the quotient rule
(1 + exp(−z))2
= σ(z) · (1 − σ(z)). using the definition of the logistic
function (5.9)
We get for the Hessian of ℓlog ,

Hw ℓlog (w⊤ x; y) = Dw ∇w ℓlog (w⊤ x; y) using the definition of a Hessian (A.57)


and symmetry
= −yx Dw σ(−yw⊤ x). using the gradient of the logistic loss
from (1)
We have Dw − yw⊤ x = −yx⊤ and Dz σ (z) = σ′ (z), and therefore,
using the chain rule of multivariate calculus (5.74),

= −yx · σ′ (−yw⊤ x) · (−yx⊤ )


= xx⊤ · σ′ (−yw⊤ x). using y2 = 1

Finally, recall that

σ (w⊤ x) = P(y = +1 | x, w) and


σ(−w⊤ x) = P(y = −1 | x, w).

Thus,

σ′ (−yw⊤ x) = P(Y ̸= y | x, w) · (1 − P(Y ̸= y | x, w))


= P(Y ̸= y | x, w) · P(Y = y | x, w)
= (1 − P(Y = y | x, w)) · P(Y = y | x, w)
= σ ′ ( w ⊤ x ).
344 probabilistic artificial intelligence

3. By the second-order characterization of convexity (cf. Remark A.26),


a twice differentiable function f is convex if and only if its Hessian
.
is positive semi-definite. Writing c = σ′ (w⊤ x), we have for any
δ ∈ Rn that
 
δ⊤ Hw ℓlog (w⊤ x; y)δ = δ⊤ c · xx⊤ δ = c(δ⊤ x)2 ≥ 0, using c ≥ 0 and (·)2 ≥ 0

and hence, the logistic loss is convex in w.

Solution to Problem 5.2.


1. Using the law of total probability (1.12),

p(y⋆ = +1 | x1:n , y1:n , x⋆ )


Z
= p(y⋆ = +1 | f ⋆ ) p( f ⋆ | x1:n , y1:n , x⋆ ) d f ⋆
Z
= σ ( f ⋆ ) p( f ⋆ | x1:n , y1:n , x⋆ ) d f ⋆ . using y⋆ ∼ Bern(σ( f ⋆ ))

Due to the non-Gaussian likelihood, the integral is analytically in-


tractable. However, as the integral is one-dimensional, numeri-
cal approximations such as the Gauss-Legendre quadrature can be
used.
2. (a) According to Bayes’ rule (1.45), we know that

ψ( f ) = log p( f | x1:n , y1:n ) using Equation (5.2)

= log p(y1:n | f ) + log p( f | x1:n ) − log p(y1:n | x1:n ) using Bayes’ rule (1.45)

= log p(y1:n | f ) + log p( f | x1:n ) + const.

Note that p( f | x1:n ) = N ( f ; 0, K AA ). Plugging in this closed-


form Gaussian distribution of the GP prior gives
1 ⊤ −1
= log p(y1:n | f ) − f K AA f + const
2
Differentiating with respect to f yields

∇ψ( f ) = ∇ log p(y1:n | f ) − K − 1


AA f
Hψ( f ) = H f log p(y1:n | f ) − K − 1
AA .
.
Hence, Λ = K − 1
AA + W where W = − H f log p ( y1:n | f ) f = fˆ .
It remains to derive W for the probit likelihood. Using inde-
pendence of the training examples,
n
log p(y1:n | f ) = ∑ log p(yi | f i ),
i =1

and hence, the Hessian of this expression is diagonal. Using


the symmetry of Φ(z; 0, σn2 ) around zero, we can write

log p(yi | f i ) = log Φ(yi f i ; 0, σn2 ).


solutions 345

. .
In the following, we write N (z) = N (z; 0, σn2 ) and Φ(z) =
Φ(z; 0, σn2 ) to simplify the notation. Differentiating with respect
to f i , we obtain

∂ y N ( fi )
log Φ(yi f i ) = i using N (yi f i ) = N ( f i ) since N is an
∂ fi Φ ( yi f i ) even function and yi ∈ {±1}
∂2 N ( f i )2 y f N ( fi )
log Φ(yi f i ) = − − i2 i ,
∂ fi2 Φ ( y i f i )2 σn Φ(yi f i )

2
and W = −diag{ ∂∂f 2 log Φ(yi f i )}in=1 .
i f = fˆ
(b) Note that Λ′ is a precision matrix over weights w and f = Xw,
so by Equation (1.38) the corresponding variance over latent
values f is XΛ′−1 X ⊤ . The two precision matrices are therefore
equivalent if Λ−1 = XΛ′−1 X ⊤ .
Analogously to Example 5.2 and Problem 5.1, we have that

W = − H f log p(y1:n | f ) f = fˆ

= diagi∈[n] {σ( fˆi )(1 − σ( fˆi ))}


= diagi∈[n] {πi (1 − πi )}.

By Woodbury’s matrix identity (A.66),

Λ′−1 = ( I + X ⊤ W X )−1
= I − X ⊤ (W −1 + X X ⊤ )−1 X.

Thus,

XΛ′−1 X ⊤ = X X ⊤ − X X ⊤ (W −1 + X X ⊤ )−1 X X ⊤
= K AA − K AA (W −1 + K AA )−1 K AA using that K AA = X X ⊤

= (K − 1
AA + W )
−1
by the matrix inversion lemma (A.68)
−1
=Λ .

(c) Using the formulas for the conditional distribution of a Gaus-


sian (1.53), evaluating a conditional GP at a test point x⋆ yields,

f ⋆ | x ⋆ , f ∼ N ( µ ⋆ , k ⋆ ), where (B.6a)
⋆ .
µ = k⊤ −1
x⋆ ,A K AA f , (B.6b)
.
k⋆ = k ( x⋆ , x⋆ ) − k⊤ −1
x⋆ ,A K AA k x⋆ ,A . (B.6c)

Then, using the tower rule (1.25),


h i
Eq [ f ⋆ | x⋆ , x1:n , y1:n ] = E f ∼q E f ⋆ ∼ p [ f ⋆ | x⋆ , f ] x1:n , y1:n

= k⊤ −1
x⋆ ,A K AA Eq [ f | x1:n , y1:n ] using Equation (B.6b) and linearity of
expectation (1.20)
346 probabilistic artificial intelligence

= k⊤ −1 ˆ
x⋆ ,A K AA f . (B.7)

Observe that any maximum fˆ of ψ( f ) needs to satisfy ∇ψ( f ) =


0. Hence, fˆ = K AA (∇f log p(y1:n | f )), and the expectation
simplifies to

= k⊤
x⋆ ,A (∇f log p ( y1:n | f )).

Using the law of total variance (1.41),

Varq [ f ⋆ | x⋆ , x1:n , y1:n ]


h i
= E f ∼q Var f ⋆ ∼ p [ f ⋆ | x⋆ , x1:n , f ] x1:n , y1:n
h i
+ Var f ∼q E f ⋆ ∼ p [ f ⋆ | x⋆ , x1:n , f ] x1:n , y1:n
h i
= Eq [k⋆ | x1:n , y1:n ] + Varq k⊤ −1
x⋆ ,A K AA f x1:n , y1:n using Equations (B.6b) and (B.6c)

= k⋆ + k⊤ −1 −1
x⋆ ,A K AA Varq [ f | x1:n , y1:n ] K AA k x⋆ ,A . using that k⋆ is independent of f ,
Equation (1.38), and symmetry of K −
AA
1

Recall from (a) that Varq [ f | x1:n , y1:n ] = (K AA + W −1 ) −1 , so

= k ( x⋆ , x⋆ ) − k⊤ −1
x⋆ ,A K AA k x⋆ ,A plugging in the expression for k⋆ (B.6c)
+ k⊤ −1
x⋆ ,A K AA ( K AA + W
−1 −1 −1
) K AA k x⋆ ,A
= k ( x⋆ , x⋆ ) − k⊤ x⋆ ,A ( K AA + W
−1 −1 ⋆
) k x ,A . (B.8) using the matrix inversion lemma (A.68)

Notice the similarity of Equations (B.7) and (B.8) to Equations (B.6b)


and (B.6c). The latter is the conditional mean and variance if
f is known whereas the former is the conditional mean and
variance given the noisy observation y1:n of f . The matrix W
quantifies the noise in the observations.
(d) Recall that
Z
p(y⋆ = +1 | x1:n , y1:n , x⋆ ) ≈ σ ( f ⋆ )q( f ⋆ | x1:n , y1:n , x⋆ ) d f ⋆ using the Laplace-approximated latent
predictive posterior

= Eq [σ( f ⋆ )]. using LOTUS (1.22)

This quantity can be interpreted as the averaged prediction over


all latent predictions f ⋆ . In contrast, σ (Eq [ f ⋆ ]) can be under-
stood as the “MAP” prediction, which is obtained using the
MAP estimate of f ⋆ .3 As σ is nonlinear, the two quantities 3
As q is a Gaussian, its mode (i.e., the
are not identical, and generally the averaged prediction is pre- MAP estimate) and its mean coincide.

ferred.
3. We have

Eε 1 { f ( x ) + ε ≥ 0 } = Pε ( f ( x ) + ε ≥ 0 ) using E[ X ] = p if X ∼ Bern( p)

= Pε (−ε ≤ f ( x))
= Pε (ε ≤ f ( x)) using that the distribution of ε is
symmetric around 0
= Φ( f ( x); 0, σn2 ).
solutions 347

Solution to Problem 5.3.


1. Recall that as f is convex,

∀ x1 , x2 , ∀λ ∈ [0, 1] : f (λx1 + (1 − λ) x2 ) ≤ λ f ( x1 ) + (1 − λ) f ( x2 ).

We prove the statement by induction on k. The base case, k = 2,


follows trivially from the convexity of f . For the induction step,
suppose that the statement holds for some fixed k ≥ 2 and assume
w.l.o.g. that θk+1 ∈ (0, 1). We then have,
!
k +1 k
θi
∑ θ i f ( x i ) = (1 − θ k +1 ) ∑ 1 − θ k +1 f ( x i ) + θ k +1 f ( x k +1 )
i =1 i =1
!
k
θi
≥ (1 − θ k +1 ) · f ∑ x + θ k +1 f ( x k +1 ) using the induction hypothesis
i =1
1 − θ k +1 i
!
k +1
≥ f ∑ θi xi . using the convexity of f
i =1

2. Noting that log2 is concave, we have by Jensen’s inequality,


  
1
H[ p] = Ex∼ p log2 by definition of entropy (5.28a)
p( x )
 
1
≤ log2 Ex∼ p by Jensen’s inequality (5.29)
p( x )
= log2 n.

Solution to Problem 5.4. If y = 1 then


ˆ
ℓbce (ŷ; y) = − log ŷ = log(1 + e− f ) = ℓlog ( F̂; y).

If y = −1 then
ˆ
ℓbce (ŷ; y) = − log(1 − ŷ) = log(1 + e f ) = ℓlog ( fˆ; y).

Here the second equality follows from the simple algebraic fact

1 ez 1 ez
1− − z
= 1− z = . multiplying by ez
1+e e +1 1 + ez

Solution to Problem 5.5.


1. Let p and q be two (continuous) distributions. The KL-divergence
between p and q is

p( x)
 
KL( p∥q) = Ex∼ p log using the definition of KL-divergence
q( x) (5.34)
q( x)
  
= Ex∼ p S .
p( x)
348 probabilistic artificial intelligence

Note that the surprise S[u] = − log u is a convex function in u, and


hence,

q( x)
  
≥ S Ex∼ p using Jensen’s inequality (5.29)
p( x)
Z 
=S q( x) dx

= S[1] = 0. a probability density integrates to 1

2. We observe from the derivation of (1) that KL( p∥q) = 0 iff equality
holds for Jensen’s inequality. Now, if p and q are discrete with final
and identical support, we can follow from the hint that Jensen’s
inequality degenerates to an equality iff p and q are point wise
identical.

Solution to Problem 5.6.


1. We have
Z
H[ f ∥ g ] = − f ( x ) log g( x ) dx using the definition of cross-entropy
R (5.32)
( x − µ )2
   
1
Z
=− f ( x ) · log √ − dx using that g( x ) = N ( x; µ, σ2 )
R 2πσ2 2σ2
 Z
1 1
Z
= − log √ f ( x ) dx + 2 f ( x )( x − µ)2 dx
2πσ2 | R {z } 2σ R
1
1 √ h i
= log(σ 2π ) + 2 Ex∼ f ( x − µ)2
2σ | {z }
σ2
1 √
= log(σ 2π ) +
2
= H[ g ]. using the entropy of Gaussians (5.30)

2. We have shown that

H[ g] − H[ f ] = KL( f ∥ g) ≥ 0,

and hence, H[ g] ≥ H[ f ]. That is, for a fixed mean µ and variance


σ2 , the distribution that has maximum entropy among all distribu-
tions on R is the normal distribution.

Solution to Problem 5.7.


1. This sample solution follows the works of Caticha and Giffin (2006);
Caticha (2021). Writing down the Lagrangian with dual variables
solutions 349

λ0 and λ1 (y) for the normalization and data constraints yields

q( x, y)
Z
L(q, λ0 , λ1 ) = q( x, y) log dx dy the objective, using (5.34)
p( x, y)
 Z 
+ λ0 1 − q( x, y) dx dy the normalization constraint
Z  Z 
+ λ1 (y) δy′ (y) − q( x, y) dx dy the data constraint

q( x, y)
Z  
= q( x, y) log − λ0 − λ1 (y) dx dy + const.
p( x, y)

Note that L(q, λ0 , λ1 ) = KL(qX,Y ∥ pX,Y ) if the constraints are satis-


fied. Thus, we simply need to solve the (dual) optimization prob-
lem

min max L(q, λ0 , λ1 ).


q(·,·)≥0 λ0 ,λ1 (·)∈R

We have
∂L(q, λ0 , λ1 ) q( x, y)
= log − λ0 − λ1 (y) + 1. using the product rule of differentiation
∂q( x, y) p( x, y)

Setting the partial derivatives to zero, we obtain

1
q( x, y) = exp(λ0 + λ1 (y) − 1) p( x, y) = exp(λ1 (y)) p( x, y)
Z
.
where Z = exp(λ0 − 1) denotes the normalizing constant. We can
determine λ1 (y) from the data constraint:

1
Z Z
q( x, y) dx = exp(λ1 (y)) p( x, y) dx
Z
1
= exp(λ1 (y)) · p(y) using the sum rule (1.7)
Z
!
= δy′ (y).

p( x,y)
It follows that q( x, y) = δy′ (y) · p(y) = δy′ (y) · p( x | y). using the definition of conditional
2. From the sum rule (1.7), we obtain distributions (1.10)

Z Z
q( x) = q( x, y) dy = δy′ (y) · p( x | y) dy = p( x | y′ ).

Solution to Problem 5.8. We can rewrite the KL-divergence as

KL( p∥q) = Ex∼ p [log p( x) − log q( x)] using the definition of KL-divergence
"  (5.34)
1 det Σ q 1
= Ex∼ p log  − ( x − µ p )⊤ Σ − 1
p (x − µ p ) using the Gaussian PDF (1.51)
2 det Σ p 2

1
+ ( x − µq )⊤ Σ − 1
q ( x − µq )
2
350 probabilistic artificial intelligence


1 det Σ q 1 h i
= log  − Ex∼ p ( x − µ p )⊤ Σ − 1
p (x − µ p ) using linearity of expectation (1.20)
2 det Σ p 2
1 h i
+ Ex∼ p ( x − µq )⊤ Σ − 1
q ( x − µq )
2
As ( x − µ p )⊤ Σ −
p ( x − µ p ) ∈ R, we can rewrite the second term as
1

1 h  i
Ex∼ p tr ( x − µ p )⊤ Σ −
p
1
( x − µ p )
2
1 h  i
= Ex∼ p tr ( x − µ p )( x − µ p )⊤ Σ − p
1
using the cyclic property of the trace
2
1  h i 
= tr Ex∼ p ( x − µ p )( x − µ p )⊤ Σ − p
1
using linearity of the trace and linearity
2 of expectation (1.20)
1  
= tr Σ p Σ − p
1
using the definition of the covariance
2 matrix (1.34)
1 d
= tr( I ) = .
2 2
For the third term, we use the hint (5.81) to obtain
1 h i
Ex∼ p ( x − µq )⊤ Σ −q
1
( x − µ q )
2
1   
= (µ p − µq )⊤ Σ −
q
1
( µ p − µ q ) + tr Σ −1
q Σ p .
2
Putting all terms together we get

1 det Σ q
KL( p∥q) = log  − d + (µ p − µq )⊤ Σ − 1
q (µ p − µq )
2 det Σ p
 
+tr Σ − 1
q Σp .

Solution to Problem 5.9.


1. Let p and q be discrete distributions. The derivation is analogous
if p and q are taken to be continuous. First, we write the KL-
divergence between p and q as

KL( p∥q)
p( x, y)
= ∑ ∑ p( x, y) log2 using the definition of KL-divergence
x y q( x )q(y) (5.34)

= ∑ ∑ p(x, y) log2 p(x, y) − ∑ ∑ p(x, y) log2 q(x)


x y x y

− ∑ ∑ p( x, y) log2 q(y)
x y

= ∑ ∑ p( x, y) log2 p( x, y) − ∑ p( x ) log2 q( x ) − ∑ p(y) log2 q(y) using the sum rule (1.7)
x y x y

= −H[ p( x, y)] + H[ p( x )∥q( x )] + H[ p(y)∥q(y)] using the definitions of entropy (5.27)


and cross-entropy (5.32)
= − H[ p( x, y)] + H[ p( x )] + H[ p(y)] + KL( p( x )∥q( x )) using Equation (5.33)

+ KL( p(y)∥q(y))
solutions 351

= KL( p( x )∥q( x )) + KL( p(y)∥q(y)) + const.

Hence, to minimize KL( p∥q) with respect to the variational dis-


tributions q( x ) and q(y) we should set KL( p( x )∥q( x )) = 0 and
KL( p(y)∥q(y)) = 0, respectively. This is obtained when

q( x ) = p( x ) and q ( y ) = p ( y ).

2. The reverse KL-divergence KL(q∥ p) on the finite domain x, y ∈


{1, 2, 3, 4} is defined as
q( x )q(y)
KL(q∥ p) = ∑ ∑ q(x)q(y) log2 p( x, y)
.
x y

We can easily observe from the above formula that the support
of q must be a subset of the support of p. In other words, if
q( x, y) is positive outside the support of p (i.e., when p( x, y) = 0)
then KL(q∥ p) = ∞. Hence, the reverse KL-divergence has an infi-
nite value except when the support of q is either {1, 2} × {1, 2} or
{(3, 3)} or {(4, 4)}. Thus, it has three local minima.
For the first case, the minimum is achieved when q( x ) = q(y) =
( 12 , 12 , 0, 0). The corresponding KL-divergence is KL(q∥ p) = log2 2 =
1. For the second case and the third case, q( x ) = q(y) = (0, 0, 1, 0)
and q( x ) = q(y) = (0, 0, 0, 1), respectively. The KL-divergence in
both cases is KL(q∥ p) = log2 4 = 2.
3. Let us compute p( x = 4) and p(y = 1):
1
p ( x = 4) = ∑ p(x = 4, y) = 4 ,
y
1
p ( y = 1) = ∑ p(x, y = 1) = 4 .
x

1
Hence, q( x = 4, y = 1) = p( x = 4) p(y = 1) = 16 , however, p( x =
4, y = 1) = 0. We therefore have for the reverse KL-divergence that
KL(q∥ p) = ∞.

Solution to Problem 5.10.


1. Recall from Section 5.4.6 that q⋆ matches the first and second mo-
ments of p. In contrast, the Laplace approximation matches the
mode of p and the second derivative of − log p. In general, the
mean is different from the mode, and so is the second moment
from the second derivative.
2. We have

arg min KL(q∥ p(· | D))


q∈Q

= arg max L(q, p; D) using Equation (5.53)


q∈Q
352 probabilistic artificial intelligence

= arg max Eθ∼q [log p(y1:n , θ | x1:n )] + H[q] using Equation (5.54)
q∈Q
n 1
= arg max Eθ∼q [log p(y1:n , θ | x1:n )] + log(2πe) + log det(Σ ). using Equation (5.31)
q∈Q 2 2

Differentiating with respect to µ and Σ, we have that q̃ must satisfy

0 = ∇µ Eθ∼q̃ [log p(y1:n , θ | x1:n )]


−1
Σ = −2∇Σ Eθ∼q̃ [log p(y1:n , θ | x1:n )]. using the first hint

The result follows by Equation (5.82).

.
Solution to Problem 5.11. To simplify the notation, we write Σ =
diag{σ12 , . . . , σd2 }. The reverse KL-divergence can be expressed as

(σp2 )d
!
1  
KL(qλ ∥ p(·)) = tr σp−2 Σ + σp−2 µ⊤ µ − d + log using the expression for the
2 det(Σ ) KL-divergence of Gaussians (5.41)
!
d d
1
2 p i∑
= σ −2
σi + σp µ µ − d + d log σp − ∑ log σi .
2 −2 ⊤ 2 2
=1 i =1

It follows immediately that ∇µ KL(qλ ∥ p(·)) = σp−2 µ. Moreover,


 
∂ 1 −2 ∂ 2 ∂ σ 1
KL(qλ ∥ p(·)) = σp σi − 2
log σi = 2i − .
∂σi 2 ∂σi ∂σ σp σi
| {z } | i {z }
2σi 2/σ
i

Solution to Problem 5.12.


1. Let Y ∼ Unif([0, 1]). Then, using the two hints,

(b − a)Y + a ∼ Unif([ a, b]).

2. Let Z1 ∼ N (µ, σ2 ) and Z2 ∼ N (0, 1). We have, X = e Z1 . Recall


from Equation (1.54) that Z1 can equivalently be expressed in terms
of Z2 as Z1 = σZ2 + µ. This yields,

X = e Z1 = eσZ2 +µ .

3. Denote by F the CDF of Cauchy(0, 1). Observe that F is invertible


with inverse F −1 (y) = tan(π (y − 12 )). Let Y ∼ Unif([0, 1]) and
write X = F −1 (Y ). Then,

PX ( x ) = P( X ≤ x )
 
= P F − 1 (Y ) ≤ x
= P(Y ≤ F ( x ))
= F ( x ).
solutions 353

This reparameterization works for any distribution with invert-


ible CDF (not just Cauchy) and is known as the universality of the
uniform (cf. Appendix A.1.3). The universality of the uniform is
commonly used in pseudo-random number generators as it allows
“lifting” samples from a uniform distribution to countless other
well-known distributions.
4. The derivative of ReLU(z) is 1{z > 0}. Applying the reparameter-
ization trick gives
d d
Ex∼N (µ,1) ReLU(wx ) = Eε∼N (0,1) ReLU(w(µ + ε))
dµ dµ
= wEε 1{w(µ + ε) > 0} using the chain rule

= wEε 1{µ + ε > 0}


= wPε (µ + ε > 0)
= wPε (ε < µ)
= wPε (ε ≤ µ)
= wΦ(µ).

Markov Chain Monte Carlo Methods

Solution to Problem 6.1. We have

q t +1 ( x ′ ) = P X t +1 = x ′ = ∑ P ( X t = x ) P X t +1 = x ′ | X t = x .
 
using the sum rule (1.7)
x | {z } | {z }
qt ( x ) p( x′ | x )

Noting that p( x ′ | x ) = P ( x, x ′ ), we conclude qt+1 = qt P.

Solution to Problem 6.2. It follows directly from the definition of


matrix multiplication that

Pk ( x, x ′ ) = ∑ P ( x, x1 ) · P ( x1 , x2 ) · · · P ( xk−1 , x ′ )
x1 ,...,xk−1

∑ P ( X1 = x 1 | X0 = x ) · · · P X k = x ′ | X k − 1 = x k − 1

= using the definition of the transition
x1 ,...,xk−1 matrix (6.8)

∑ P X1 = x 1 , · · · , X k − 1 = x k − 1 , X k = x ′ | X0 = x

= using the product rule (1.11)
x1 ,...,xk−1

= P X k = x ′ | X0 = x .

using the sum rule (1.7)

Solution to Problem 6.3. We consider the transition matrix


 
0.60 0.30 0.10
P = 0.50 0.25 0.25 .
 
0.20 0.40 0.40

We note that the entries of P are all different from 0, thus the Markov
chain corresponding to this transition matrix is ergodic.4 Thus, there 4
All elements of the transition matrix be-
ing strictly greater than 0 is a sufficient,
but not necessary, condition for ergodic-
ity.
354 probabilistic artificial intelligence

exists a unique stationary distribution π to which the Markov chain


converges irrespectively of the distribution over initial states q0 .

We know that P⊤ π = π (where we write π as a column vector),


therefore, to find the stationary distribution π, we need to find the
normalized eigenvector associated with eigenvalue 1 of the matrix P⊤ .
That is, we want to solve (P⊤ − I )π = 0 for π. We obtain the linear
system of equations,

−0.40π1 + 0.50π2 + 0.20π3 = 0


0.30π1 − 0.75π2 + 0.40π3 = 0
0.10π1 + 0.25π2 − 0.60π3 = 0.

Note that the left hand side of equation i corresponds to the probabil-
ity of entering state i at stationarity minus πi . Quite intuitively, this
difference should be 0, that is, after one iteration the random walk is
at state i with the same probability as before the iteration.

Solving this system of equations, for example, using the Gaussian


elimination algorithm, we obtain the normalized eigenvector
 
35
1  
π= 22 .
72
15

Thus, we conclude that in the long run, the percentage of news days
that will be classified as “good” is 35/72.

Solution to Problem 6.4. Observe that the described proposal distri-


bution is symmetric. Therefore, the acceptance probability simplifies
to
p( x′ )
 

α( x | x ) = min 1, .
p( x )

If we denote the number of 1s in a bit string by w( x ), we have the


requirement that p( x ) ∝ w( x ). Therefore, the acceptance probability
becomes
min 1, w( x′ )
 n o
if w( x ) ̸= 0
α( x′ | x ) = w( x )
1 otherwise.

Solution to Problem 6.5.


1. We have to compute the conditional distributions. Notice that for
x ∈ {0, . . . , n} and y ∈ [0, 1],
 
n x
p( x, y) = y (1 − y)n− x · yα−1 (1 − y) β−1 = Bin( x; n, y) · Cy ,
x
solutions 355

where Bin(n, y) is the PMF of the binomial distribution (1.49) with


n trials and success probability y, and Cy is a constant depending
on y. It is clear that

p( x, y)
p( x | y) = using the definition of conditional
p(y) probability (1.8)
Bin( x; n, y) · Cy
=
p(y)
= Bin( x; n, y). using that p( x | y) is a probability
distribution over x and Bin( x; n, y)
So in short, sampling from p( x | y) is equivalent to sampling from already sums to 1, so Cy = p(y)

a binomial distribution, which amounts to n times throwing a coin


with bias y, and outputting the number of heads.
For the other conditional distribution, recall the PMF of the beta
distribution with parameters α, β,

Beta(y; α, β) = C · yα−1 (1 − y) β−1

where C is some constant depending on α and β only. We then


have

p( x, y) = Beta(y; x + α, n − x + β) · Cx ,

where Cx is some constant depending on x, α, β. This shows (anal-


ogously to above), that

p(y | x ) = Beta(y; x + α, n − x + β).

So for sampling y given x, one can sample from the beta distri-
bution. There are several methods for sampling from a beta dis-
tribution, and we refer the reader to the corresponding Wikipedia
page.
2. We first derive the posterior distribution of µ. We have

log p(µ | λ, x1:n ) = log p(µ) + log p( x1:n | µ, λ) + const


λ0 λ n
=− (µ − µ0 )2 − ∑ ( xi − µ)2 + const
2 2 i =1
n
1  
= − (λ0 + nλ)µ2 + λ0 µ0 + λ ∑ xi µ + const. by expanding the squares
2 | {z } i =1
lλ | {z }
mλ lλ

It follows that lλ = λ0 + nλ, mλ = (λ0 µ0 + λ ∑in=1 xi )/lλ , and


µ | λ, x1:n ∼ N (µλ , lλ−1 ). That is, the posterior precision is the
sum of prior precision and the precisions of each observation, and
the posterior mean is a weighted average of prior mean and obser-
vations (where the weights are the precisions).
356 probabilistic artificial intelligence

For the posterior distribution of λ, we have

p(λ | µ, x1:n ) ∝ p(λ) · p( x1:n | µ, λ)


n n 2
= λα−1 e− βλ · λ 2 e− 2 ∑i=1 (xi −µ)
λ

1 n
= λ α + 2 −1 e − λ ( β + 2 ∑ i =1 ( x i − µ ) )
n 2

= Gamma(λ; aµ , bµ )

where aµ = α + n
2 and bµ = β + 1
2 ∑in=1 ( xi − µ)2 .
3. We have

p(α, c | x1:n ) ∝ p(α, c) · p( x1:n | α, c)


n
αcα
∝ 1{α, c > 0} ∏ 1{ x i ≥ c }
i =1 xiα+1
αn cnα
= 1{c < x ⋆ }1{α, c > 0}
(∏in=1 xi )α+1
.
where x ⋆ = min{ x1 , . . . , xn }. It is not clear how one could sample
from this distribution directly. Instead, we use Gibbs sampling.
For the posterior distribution of α, we have

p(α | c, x1:n ) ∝ p(α, c | x1:n )


αn cnα
∝ 1{ α > 0}
(∏in=1 x i ) α +1
!!
n
n
= α exp −α ∑ log xi − n log c 1{ α > 0}
i =1

∝ Gamma(α; a, b)
. .
where a = n + 1 and b = ∑in=1 log xi − n log c. On the other hand,

p(c | α, x1:n ) ∝ p(α, c | x1:n ) ∝ cnα 1{0 < c < x ⋆ }.

It remains to show that it is easy to sample from a random variable


X with p X ( x ) ∝ x a 1{0 < x < b} for a, b > 0. We have that the
Rb a +1
normalizing constant is given by 0 x a dx = ba+1 . Therefore, the
CDF of X is
Z x
PX ( x ) = p(y) dy
0
a+1 x a
Z
= y dy
b a +1 0
a + 1 x a +1
= a +1
b a+1
 x  a +1
= .
b
1
The inverse CDF is given by PX−1 (y) = by a+1 . Therefore, we can
sample from X using inverse transform sampling (cf. Appendix A.1.3)
1
by sampling Y ∼ Unif([0, 1]) and setting X = bY a+1 .
solutions 357

Solution to Problem 6.6. First, note that the sum of convex functions
is convex, hence, we consider each term individually.

The Hessian of the regularization term is λI, and thus, by the second-
order characterization of convexity, this term is convex in w.

Finally, note that the second term is a sum of logistic losses ℓlog (5.13),
and we have seen in Problem 5.1 that ℓlog is convex in w.

Solution to Problem 6.7.


1. Our goal is to solve the optimization problem

max −
p∈∆T
∑ p( x ) log2 p( x )
x ∈T
(B.9)
subject to ∑ p( x ) f ( x ) = µ
x ∈T

for some µ ∈ R. The Lagrangian with dual variables λ0 and λ1 is


given by
!
L( p, λ0 , λ1 ) = − ∑ p( x ) log2 p( x ) + λ0 1 − ∑ p( x ) +
x ∈T x ∈T
!
λ1 µ − ∑ p( x ) f ( x )
x ∈T

=− ∑ p( x )(log2 p( x ) + λ0 + λ1 f ( x )) + const.
x ∈T

Note that L( p, λ0 , λ1 ) = H[ p] if the constraints are satisfied. Thus,


we simply need to solve the (dual) optimization problem

max min L( p, λ0 , λ1 ).
p≥0 λ0 ,λ1 ∈R

We have

L( p, λ0 , λ1 ) = − log2 p( x ) − λ0 − λ1 f ( x ) − 1.
∂p( x )
Setting the partial derivatives to zero, we obtain
log(·)
p( x ) = 2 exp(−λ0 − λ1 f ( x ) − 1) using log2 (·) = log(2)

∝ exp(−λ1 f ( x )).

Clearly, p is a valid probability mass function when normalized


(i.e., for an appropriate choice of λ0 ). We complete the proof by
.
setting T = λ1 .
1
2. As T → ∞ (and λ1 → 0), the optimization problem reduces to pick-
ing the maximum entropy distribution without the first-moment
constraint. This distribution is the uniform distribution over T .
Conversely, as T → 0 (and λ1 → ∞), the Gibbs distribution re-
duces to a point density around its mode.
358 probabilistic artificial intelligence

Solution to Problem 6.8. Recall the following two facts:


1. Gibbs sampling is an instance of Metropolis-Hastings with pro-
posal distribution

 p( x ′ | x′ ) if x′ differs from x only in entry i
−i
r ( x′ | x) = i
0 otherwise

and acceptance distribution α( x′ | x) = 1.


2. The acceptance distribution of Metropolis-Hastings where the sta-
tionary distribution p is a Gibbs distribution with energy function
f is

r ( x | x′ )
 
′ ′
α( x | x) = min 1, exp( f ( x) − f ( x )) .
r ( x′ | x)
We therefore know that
r ( x | x′ )
exp( f ( x) − f ( x′ )) ≥ 1.
r ( x′ | x)

We remark that this inequality even holds with equality using our
derivation of Theorem 6.19. Taking the logarithm and reorganizing
the terms, we obtain

f ( x′ ) ≤ f ( x) + log r ( x | x′ ) − log r ( x′ | x). (B.10)

By the definition of the proposal distribution of Gibbs sampling,

r ( x′ | x) = p( xi′ | x−i ) and r ( x | x ′ ) = p ( x i | x −i ). using x′ −i = x−i

Taking the expectation of Equation (B.10),

Ex′ ∼ p(·| x−i ) f ( x′ ) ≤ f ( x) + log p( xi | x−i ) − Ex′ ∼ p(·| x−i ) log p( xi′ | x−i )
   
i i

= f ( x) − S[ p( xi | x−i )] + H[ p(· | x−i )].

That is, the energy is expected to decrease if the expected surprise of


the new sample xi′ | x−i is smaller than the surprise of the current
sample xi | x−i .

Solution to Problem 6.9.


1. We take the minimum of Equation (6.56) with respect to y on both
sides. The minimum of the left-hand side is f (0) = 0. To find the
minimum of the right-hand side, we differentiate with respect to
y:

∂ h α i
f ( x) + ∇ f ( x)⊤ (y − x) + ∥y − x∥22 = ∇ f ( x) + α(y − x).
∂y 2
solutions 359

Setting the partial derivative to zero, we obtain

1
y = x− ∇ f ( x ).
α
Plugging this y into Equation (6.56), we have

1 1
0 ≥ f ( x) − ∥∇ f ( x)∥22 + ∥∇ f ( x)∥22
α 2α
1
= f ( x) − ∥∇ f ( x)∥22 .

2. Using the chain rule,

d d
f ( xt ) = ∇ f ( xt )⊤ xt .
dt dt
d
Note that dt xt = −∇ f ( xt ) by Equation (6.55), so

= − ∥∇ f ( xt )∥22
≤ −2α f ( xt )

where the last step follows from the PL-inequality (6.57).


3. Follows directly from (2) and Grönwall’s inequality (6.58) by letting
Rt
g(t) = f ( xt ) and noting that − 0 2α ds = −2αt.
4. It suffices to show ∇qt = qt ∇ log qt . By the chain rule,

∇qt
∇ log qt = .
qt

We obtain the desired result by rearranging the terms.


5. We have

d d qt
Z
KL(qt ∥ p) = qt log dθ.
dt dt p

By the chain rule,

qt qt
Z Z
∂qt ∂
= log dθ + qt log dθ.
∂t p ∂t p

For the second term, we have


qt d
Z Z Z
∂ ∂qt
qt log dθ = dθ = qt dθ = 0.
∂t p ∂t dt | {z }
1

Plugging in the Fokker-Planck equation (6.60) into the first term,


we obtain
 
qt qt
Z
= ∇ · qt ∇ log log dθ.
p p
360 probabilistic artificial intelligence

. qt .
Letting φ = log p and F = ∇ φ, and then applying the hint, we
have
Z
= (∇ · qt F) φ dθ
Z
=− qt ∥∇ φ∥22 dθ
" #
2
qt (θ)
= −Eθ∼qt ∇ log
p(θ) 2

= −J( q t ∥ p ).

6. Noting that p satisfies the LSI with constant α and combining with
d
(5), we have that dt KL(qt ∥ p) ≤ −2αKL(qt ∥ p). Observe that this
result is analogous to the result derived in (2) and the LSI can in-
tuitively be seen as the PL-inequality, but in the space of distribu-
tions. Analogously to (3), we obtain the desired convergence result
by applying Grönwall’s inequality (6.58).
7. By Pinsker’s inequality (6.21), ∥qt − p∥TV ≤ e−αt 2KL(q0 ∥ p). It
p

follows from elementary algebra that ∥qt − p∥TV ≤ ϵ if t ≥ α1 log Cϵ


. p
where C = 2KL(q0 ∥ p). Thus, τTV (ϵ) ∈ O(log(1/ϵ)).

Solution to Problem 6.10.


1. We show that under the dynamics (6.50), the Hamiltonian H ( x, y)
is a constant. In particular, H ( x′ , y′ ) = H ( x, y). This directly im-
plies that

α(( x′ , y′ ) | ( x, y)) = min{1, exp( H ( x′ , y′ ) − H ( x, y))} = 1.

To see why H ( x, y) is constant, we compute


d dx dy
H ( x, y) = ∇x H · + ∇y H · using the chain rule
dt dt dt
= ∇x H · ∇y H − ∇x H · ∇y H using the Hamiltonian dynamics (6.50)

= 0.

2. By applying one Leapfrog step, we have for x,

x(t + τ ) = x(t) + τy(t + τ/2) using Equation (6.51b)


 τ 
= x(t) + τ y(t) − ∇x f ( x(t)) using Equation (6.51a)
2
τ2
= x(t) − ∇x f ( x(t)) + τy(t).
2
Now observe that y(t) is a Gaussian random variable, indepen-
dent of x (because we sample y freshly at the beginning of the
L Leapfrog steps, and we are doing just one Leapfrog step). By
. .
renaming τ ′ = τ2/2 and ϵ = y(t), we get

xt+1 = xt − τ ′ ∇x f ( xt ) + 2τ ′ ϵ
solutions 361

which coincides with the proposal distribution of Langevin Monte


Carlo (6.44).

Deep Learning

Solution to Problem 7.1. We have


exp( f 1 )
σ1 ( f ) = using the definition of the softmax
exp( f 0 ) + exp( f 1 ) function (7.5)
1 exp(− f 1 )
= multiplying by exp(− f 1 )
exp( f 0 − f 1 ) + 1
= σ(−( f 0 − f 1 )). using the definition of the logistic
function (5.9)
σ0 ( f ) = 1 − σ ( f ) follows from the fact that σ0 ( f ) + σ1 ( f ) = 1.

Active Learning

Solution to Problem 8.1. We have

I(X; Y) = H[X] − H[X | Y] using the definition of mutual


information (8.9)
= Ex [− log p( x)] − E(x,y) [− log p( x | y)] using the definitions of entropy (5.27)
and conditional entropy (8.2)
= E(x,y) [− log p( x)] − E(x,y) [− log p( x | y)] using the law of total expectation (1.12)

p( x | y)
 
= E(x,y) log .
p( x)

From this we get directly that

p( x | y)
 
I(X; Y) = Ey Ex|y log
p( x)
= Ey [KL( p( x | y)∥ p( x))], using the definition of KL-divergence
(5.34)
and we also conclude
p( x, y)
 
I(X; Y) = E( x,y) log using the definition of conditional
p( x) p(y) probability (1.8)
= KL( p( x, y)∥ p( x) p(y)). using the definition of KL-divergence
(5.34)

Solution to Problem 8.2. Symmetry of conditional mutual informa-


tion (8.17) and Equation (8.16) give the following relationship,

I(X; Y, Z) = I(X; Y) + I(X; Z | Y) = I(X; Z) + I(X; Y | Z). (B.11)

1. X ⊥ Z implies I(X; Z) = 0. Thus, Equation (B.11) simplifies to

I(X; Y) + I(X; Z | Y) = I(X; Y | Z).

Using that I(X; Z | Y) ≥ 0, we conclude I(X; Y | Z) ≥ I(X; Y).


362 probabilistic artificial intelligence

2. X ⊥ Z | Y implies I(X; Z | Y) = 0. Equation (B.11) simplifies to

I(X; Y) = I(X; Z) + I(X; Y | Z).

Using that I(X; Z) ≥ 0, we conclude I(X; Y | Z) ≤ I(X; Y).


3. Again, Equation (B.11) simplifies to

I(X; Y) = I(X; Z) + I(X; Y | Z).

Using that I(X; Y | Z) ≥ 0, we conclude I(X; Z) ≤ I(X; Y).

Solution to Problem 8.3.


1. Expanding the definition of interaction information, one obtains

I(X; Y; Z) = (H[X] + H[Y] + H[Z])


− (H[X, Y] + H[X, Z] + H[Y, Z])
+ H[X, Y, Z],

and hence, interaction information is symmetric.


2. Conditional on either one of X1 or X2 , the distribution of Y re-
mains unchanged, and hence I(Y; X1 | X2 ) = I(Y; X2 | X1 ) = 0.
Conversely, conditional on both X1 and X2 , Y is fully determined,
and hence I(Y; X1 , X2 ) = 1 noting that Y encodes one bit worth of
information. Thus, I(Y; X1 ; X2 ) = −1 meaning that there is syn-
ergy between X1 and X2 with respect to Y.

Solution to Problem 8.4. As suggested, we derive the result in two


steps.
1. First, we have

∆ I ( x | A) = I ( A ∪ { x}) − I ( A) using the definition of marginal gain


(8.22)
= I( f A∪{ x} ; y A , y x ) − I( f A ; y A ) using the definition of I (8.20)

= I( f A∪{ x} ; y A , y x ) − I( f A∪{ x} ; y A ) using y A ⊥ f x | f A

= H[ f A∪{ x} | y A ] − H[ f A∪{ x} | y A , y x ] using the definition of MI (8.9)

= I( f A∪{ x} ; y x | y A ) using the definition of cond. MI (8.14)

= I( f A ; y x | f x , y A ) + I( f x ; y x | y A ) using Equation (8.16)

= I( f x ; y x | y A ). using I( f A ; y x | f x , y A ) = 0 as
yx ⊥ f A | f x
2. For the second part, we get

I( f x ; y x | y A ) = I( y x ; f x | y A ) using symmetry of conditional MI (8.17)

= H[ y x | y A ] − H[ y x | f x , y A ] using the definition of cond. MI (8.14)

= H[ y x | y A ] − H[ y x | f x ] using that y x ⊥ ε A so y x ⊥ y A | f x

= H[ y x | y A ] − H[ ε x ]. given f x , the only randomness in y x


originates from ε x
solutions 363

Solution to Problem 8.5. We have

I( f x ; y x ; yB \ A | y A ) = I( f x ; y x | y A ) − I( f x ; y x | yB ) using the definitiion of interaction


information (8.18)
= ∆ I ( x | A) − ∆ I ( x | B) using Equation (8.23)

≥0

where the final inequality follows from submodularity of I.

Solution to Problem 8.6.


1. Yes. We have

∆( j | S) = F (S ∪ { j}) − F (S) = H[ Z | YS ] − H[ Z | YS∪{ j} ].

This is non-negative iff H[ Z | YS ] ≥ H[ Z | YS∪{ j} ] which is the


“information never hurts” property (8.8) of conditional entropy.
2. No, our acquisition function F is not equivalent to uncertainty sam-
pling. Note that all Xi have identical prior variance. It suffices to
show that not all i1 ∈ {1, . . . , 100} maximize the marginal informa-
tion gain ∆(i1 | ∅).
The prior variance of Z is Var[ Z ] = ∑100 2
i =1 i since all Xi are inde-
pendent. Since the x1:100 and Z are jointly Gaussian, the posterior
variance of Z after observing Y1 is

Cov[ Z, Y1 ]2
Var[ Z | Y1 ] = Var[ Z ] − .
Var[Y1 ]

The marginal information gain ∆(i1 | ∅) is equal up to constants


to − 12 log Var[ Z | Y1 ]. The variance of the observation is Var[Y1 ] =
 
Var Xi1 + ε 1 = 2 and the covariance is Cov[ Z, Y1 ] = i1 . Therefore,

100
1
Var[ Z | Y1 ] = ∑ i 2 − 2 i1 ,
i =1

which is uniquely minimized for i1 = 100.


3. The acquisition function is submodular if the previous observation
of an input Xi does not imply that the information conveyed by
observing another input X j about the value of Z is larger than it
would have been without having previously observed Xi . Clearly,
if i = j then the information conveyed by the second observation
is always 0 bit, since the first observation was noiseless. However,
depending on Z previous observations of inputs other than X j may
be crucial for the informativeness of observing X j .
(a) Yes. If X j = 0 then we know that Z = 0 also, no matter the
previously observed inputs. If X j = 1 then it becomes slightly
more probable that Z = 1, however, this does not depend on
which other points have been observed previously.
364 probabilistic artificial intelligence

(b) Yes. The case of OR is symmetric to AND. If X j = 1 then


we know that Z = 1 also, no matter the previously observed
inputs. If X j = 0 then it becomes slightly more probable that
Z = 0, however, this does not depend on which other points
have been observed previously.
(c) No. Consider the case where we have observed X1:99 and are
now observing X100 . The marginal information gain of all pre-
vious observations was 0 bit, since the observed values for X1:99
did not affect the distribution of Z: the prior of Z was Bern(0.5)
and the current posterior is still the same Bernoulli distribu-
tion. Now, observing X100 will deterministically determine the
value of f , therefore conveying 1 bit of information. However,
just observing X100 alone, without any prior observations, con-
veys no information. This shows that the marginal information
gain can increase after conditioning on more observations, and
therefore the acquisition function F is not submodular with this
choice of Z.
The key intuition behind the above arguments is that AND and OR
are monotonic functions whereas XOR is not.

Bayesian Optimization

Solution to Problem 9.1. First, observe that

T
RT 1
lim
T →∞ T
= max f ⋆ ( x) − lim
x T →∞ T
∑ f ⋆ ( xt ) using the definition of regret (9.4)
t =1
= max f ⋆ ( x) − lim f ⋆ ( xt ). (B.12) using the Cesàro mean (9.38)
x t→∞

Now, suppose that the algorithm converges to the static optimum,

lim f ⋆ ( xt ) = max f ⋆ ( x).


t→∞ x

Together with Equation (B.12) we conclude that the algorithm achieves


sublinear regret.

For the other direction, we prove the contrapositive. That is, we as-
sume that the algorithm does not converge to the static optimum and
show that it has (super-)linear regret. We distinguish between two
cases. Our assumption is formalized by

lim f ⋆ ( xt ) < max f ⋆ ( x).


t→∞ x

Together with Equation (B.12) we conclude limT →∞ RT/T > 0.


solutions 365

Solution to Problem 9.2.


1. As suggested by the hint, we let Z ∼ N (0, 1) and c > 0, and bound

P( Z > c ) = P( Z − c > 0)
Z ∞
1 2
= √ e−(z+c) /2 dz. using the PDF of the univariate normal
0 2π distribution (1.5)

Note that for z ≥ 0,

( z + c )2 z2 c2 z2 c2
= + zc + ≥ + .
2 2 2 2 2
Thus,
Z ∞
2 /2 1 2
≤ e−c √ e−z /2 dz
0 2π
2
= e−c /2 P( Z > 0)
1 2
= e−c /2 . (B.13) using symmetry of the standard normal
2 distribution around 0
Since we made the Bayesian assumption f ⋆ ( x) ∼ N (µ0 ( x), σ02 ( x))
and assumed Gaussian observation noise yt ∼ N ( f ⋆ ( xt ), σn2 ), the
posterior is also Gaussian:

f ⋆ ( x) | x1:t , y1:t ∼ N (µt ( x), σt2 ( x)).


.
Hence, writing Pt (·) = P(· | x1:t , y1:t ),

Pt−1 ( f ⋆ ( x) ̸∈ Ct ( x)) = Pt−1 (| f ⋆ ( x) − µt−1 ( x)| > β t σt−1 ( x))


 ⋆
f ( x ) − µ t −1 ( x )

= 2Pt−1 > βt using symmetry of the Gaussian
σt−1 ( x) distribution
2
≤ e− β t /2 . using Equation (B.13) with c = β t

2. We have
!
P f ⋆ ( x) ̸∈ Ct ( x) ∑ ∑ P( f ⋆ (x) ̸∈ Ct (x))
[ [
≤ using a union bound (1.73)
x∈X t≥1 x∈X t≥1

≤ |X | ∑ e− β t /2 .
2
using (1)
t ≥1
.
Letting β2t = 2 log(|X | (πt)2 /(6δ)), we get
6δ 1
=
π2 ∑ t2
t ≥1
π2
= δ. using ∑t≥1 1
t2
= 6

Solution to Problem 9.3.


1. We denote the static optimum by x⋆ . By the definition of xt ,

µt−1 ( xt ) + β t σt−1 ( xt ) ≥ µt−1 ( x⋆ ) + β t σt−1 ( x⋆ )


366 probabilistic artificial intelligence

≥ f ⋆ ( x ⋆ ). using Equation (9.7)

Thus,

rt = f ⋆ ( x⋆ ) − f ⋆ ( xt )
≤ β t σt−1 ( xt ) + µt−1 ( xt ) − f ⋆ ( xt )
≤ 2β t σt−1 ( xt ). again using Equation (9.7)

2. We have for any fixed T,

I ( f T + 1 ; yT + 1 ) = H [ yT + 1 ] − H [ ε T + 1 ] analogously to Equation (8.24)


   
= H [ yT ] − H [ ε T ] + H y x T +1 | yT − H ε x T +1 using the chain rule for entropy (8.5)
 and the mutual independence of ε T +1
= I ( f T ; yT ) + I f x T +1 ; y x T +1 | yT using the definition of MI (8.9)
!
1 σT2 ( x T +1 )
= I( f T ; yT ) + log 1 + . using Equation (8.13)
2 σn2

Note that I( f0 ; y0 ) = 0. The result then follows by induction.


3. By Cauchy-Schwarz, R2T ≤ T ∑tT=1 rt2 , and hence, it suffices to show
∑tT=1 rt2 ≤ O β2T γT . We have


T T
∑ rt2 ≤ 4β2T ∑ σt2−1 (xt ) using part (1)
t =1 t =1
T σt2−1 ( xt )
= 4σn2 β2T ∑ σ2 .
t =1 n
.
Observe that σt2−1 ( xt )/σn2 is bounded by M = maxx∈X σ02 ( x)/σn2 as
variance is monotonically decreasing (cf. Section 1.2.3). Applying
the hint, we obtain
!
T σ 2 (x )
t
≤ 4Cσn2 β2T ∑ log 1 + t−1 2
t =1 σn
= 8Cσn2 β2T I( f T ; yT ) using part (2)

≤ 8Cσn2 β2T γT . using the definition of γT (9.10)

Solution to Problem 9.4.


1. Let S ⊆ X be such that |S| ≤ T. Recall from Equation (8.13) that
I( f S ; yS ) = 12 log det I + σn−2 KSS . Using that the kernel is linear


we can rewrite KSS = XS⊤ XS . Using Weinstein-Aronszajn’s identity


(A.70) we have
1   1  
I( f S ; yS ) = log det I + σn−2 XS⊤ XS = log det I + σn−2 XS XS⊤ .
2 2
.
If we define M = I + σn−2 XS XS⊤ as a sum of symmetric positive
definite matrices, M itself is symmetric positive definite. Thus, we
have from Hadamard’s inequality (A.69),
 
det( M ) ≤ det diag{ I + σn−2 XS XS⊤ } diag{ A} refers to the diagonal matrix
whose elements are those of A
solutions 367

 
= det I + σn−2 diag{ XS XS⊤ } .

Note that
|S|
diag{ XS XS⊤ }(i, i ) = ∑ x t ( i )2
t =1
d |S| |S|
≤ ∑ ∑ xt (i)2 = ∑ |∥ x{zt ∥}22 ≤ |S| ≤ T.
i =1 t =1 t =1
≤1

If we denote by λ ≤ T the largest term of diag{ XS XS⊤ } then we


have

det( M ) ≤ (1 + σn−2 λ)d ≤ (1 + σn−2 T )d ,

yielding,
d
I( f S ; yS ) ≤ log(1 + σn−2 T )
2
implying that γT = O(d log T ).
2. Using the regret bound (cf. Theorem 9.5) and the Bayesian confi-
dence intervals (cf. Theorem 9.4), and then γT = O(d log T ), we
have
 p  √ 
R T = O β T γT T = O e dT ,
RT
and hence, limT →∞ T = 0.

Solution to Problem 9.5.


1. Note that f is a Gaussian process, and hence, our posterior dis-
tribution after round t is entirely defined by the mean function
µt and the covariance function k t . Reparameterizing the posterior
distribution using a standard Gaussian (1.54), we obtain

f ( x) | x1:t , y1:t = µt ( x) + σt ( x)ε

for ε ∼ N (0, 1). We get

EIt ( x) = E f ( x)∼N (µt ( x),σ2 ( x)) [ It ( x)] using the definition of expected
t
improvement (9.19)
h i
= E f (x)∼N (µt (x),σ2 (x)) ( f ( x) − fˆt )+ using the definition of improvement
t
(9.15)
h i
= Eε∼N (0,1) (µt ( x) + σt ( x)ε − fˆt )+ using the reparameterization
Z +∞
= (µt ( x) + σt ( x)ε − fˆt )+ · ϕ(ε) dε.
−∞
fˆ −µ ( x) .
For ε < t σ ( xt ) = zt ( x) we have (µt ( x) + σt ( x)ε − fˆt )+ = 0. Thus,
t
we obtain
Z +∞
EIt ( x) = (µt ( x) + σt ( x)ε − fˆt ) · ϕ(ε) dε. (B.14)
zt ( x)
368 probabilistic artificial intelligence

2. By splitting the integral from Equation (B.14) into two distinct


terms, we obtain
Z +∞
EIt ( x) = (µt ( x) − fˆt ) ϕ(ε) dε
zt ( x)
Z +∞
− σt ( x) (−ε) · ϕ(ε) dε.
zt ( x)

For the first term, we use the symmetry of N (0, 1) around 0 to


write the integral in terms of the CDF. For the second term, we
d −ε2/2
notice that (−ε) · ϕ(ε) = √1 dε e . Thus, we can derive this

integral directly,

= (µt ( x) − fˆt )Φ(−zt ( x))


 
− σt ( x) lim ϕ(ε) − ϕ(zt ( x)) .
ε→∞

Using the symmetry of ϕ around 0, we obtain


! !
µt ( x) − fˆt µt ( x) − fˆt
EIt ( x) = (µt ( x) − fˆt )Φ + σt ( x)ϕ .
σt ( x) σt ( x)

Solution to Problem 9.6.


1. As xUCB
t is the UCB action,

ˆ t ( xUCB
∆ t ) = ut ( xUCB
t ) − lt ( xUCB
t ) = 2β t+1 σt ( xUCB
t ).

2. We first bound It ( xUCB


t ):
 
It ( xUCB
t ) = I x1:t , y1:t
f xUCB ; y xUCB
t t
!
1 σt2 ( xUCB
t )
= log 1 + . using Equation (8.13)
2 σn2

Note that σt2 ( x)/σn2 ≤ C for some constant C since variance is de-
creasing monotonically. So, applying the hint,

σt2 ( xUCB
t )
≥ 2
.
2Cσn

Combining this with (1), we obtain

b t ( xUCB
ˆ t ( xUCB )2

Ψ t ) = t
≤ 8Cσn2 β2t+1 .
It ( xUCB
t )

3. With high probability,

Ψ t ( x t +1 ) ≤ Ψ b t ( xUCB
b t ( x t +1 ) ≤ Ψ t ) ≤ 8Cσn2 β2t+1 ,
solutions 369

where the first inequality is due to ∆( x) ≤ ∆ˆ t ( x) with high prob-


ability, the second inequality due to the definition of the IDS algo-
rithm (9.29), and the third inequality is from (2). Invoking Theo-
.
rem 9.9 with Ψ T = 8Cσn2 β2T , we obtain that
q q
R T ≤ γT Ψ T T = β T 8Cσn2 γT T

with high probability.

Solution to Problem 9.7. First notice that by the definition of S′ , we


obtain the simpler objective:
 
W (π ) = ∑ π ( x ) µt ( x ) + ϕ(Φ−1 (π ( x )))σt ( x ) .
x ∈X

Next, we show that W (π ) is concave by computing the Hessian:

∂ d
W (π ) = µt ( x ) − σt ( x )Φ−1 (π ( x ))ϕ(Φ−1 (π ( x ))) Φ−1 (π ( x ))
∂π ( x ) dπ ( x )
= µt ( x ) − σt ( x )Φ−1 (π ( x ))

∂2 1 < 0 x=z
W (r ) = −σt ( x )1{ x = z} ,
∂π ( x )∂π (z) ϕ(Φ−1 (π ( x ))) = 0 x ̸= z

where we used the inverse function rule twice. From negative defi-
niteness of the Hessian, it follows that W (·) is strictly concave.

We show next that the optimum lies in the relative interior of the prob-
ability simplex, π ∗ ∈ relint(∆X ). Indeed, at the border of the proba-
bility simplex the partial derivatives explode:

∞

 π ( x ) → 0+
∂ −1
W (r ) = µt ( x ) − σt ( x )Φ (π ( x )) = finite π ( x ) ∈ (0, 1) .
∂π ( x ) 
−∞ π ( x ) → 1−

Together with the concavity of W (·) this ensures that r ∗ ∈ relint(∆X ).


Hence, π ∗ is a local optimizer of W (r ) on the plane defined by ∑ x∈X π ( x ) = 1.
Consequently, we obtain the Lagrangian
!
L(π, κ ) : (0, 1)|X | × R → R, π 7→ W (π ) + κ 1 − ∑ π (x) .
x ∈X

Setting its partial derivatives equal to zero, we derive the closed-form


solution:
µt ( x ) − κ ∗
 
−1 ∗ ∗ ∗
0 = µt ( x ) − σt ( x )Φ (π ( x )) − κ ⇐⇒ π ( x ) = Φ ,
σt ( x )

where κ ∗ ensures a valid distribution, i.e., ∑ x∈X π ∗ ( x ) = 1.


370 probabilistic artificial intelligence

Solution to Problem 9.8. Since R1 and R2 are independent, we have

P(max{ R1 , R2 } ≤ x ) = P({ R1 ≤ x } ∩ { R2 ≤ x })
= P( R1 ≤ x ) · P( R2 ≤ x ) using independence

= FR1 ( x ) · FR2 ( x )
x
= Φ2
10
where Φ is the CDF of the standard Gaussian. We are looking for the
probability that either R1 or R2 are larger than S = 1:
 
2 1
P(max{ R1 , R2 } > 1) = 1 − P(max{ R1 , R2 } ≤ 1) = 1 − Φ .
10

We have Φ2 ( 10
1
) ≈ 0.29. Thus, the probability that either R1 or R2 are
larger than S = 1 is approximately 0.71. Due to the symmetry in the
problem, we know

P( R1 > 1) = P( R2 > 1) ≈ 0.35.

Markov Decision Processes

Solution to Problem 10.1. We can use Equation (10.15) to write the


state-action values as a linear system of equations (i.e., as a “table”).
This linear system can be solved, for example, using Gaussian elimi-
nation to yield the desired result.

Solution to Problem 10.2. It follows directly from the definition of


the state-action value function (10.9) that

arg max q( x, a) = arg max r ( x, a) + γ ∑ p( x ′ | x, a) · v( x ′ ).


a∈ A a∈ A ′
x ∈X

Solution to Problem 10.3.


1. Recall from Bellman’s theorem (10.31) that a policy is optimal iff it
is greedy with respect to its state-action value function. Now, ob-
serve that in the “poor, unknown” state, the policy π is not greedy.
2. Analogously to Problem 10.1, we write the state-action values as a
linear system of equations and solve the system using, e.g., Gaus-
sian elimination.
3. Observe from the result of (2) that π ′ is greedy with respect to its
state-action value function, and hence, it follows from Bellman’s
theorem that π ′ is optimal.

Solution to Problem 10.4. Using the hint and v⋆ ≥ vπ for any policy
π,

∥ v π t − v ⋆ ∥ ∞ ≤ ∥ B ⋆ v π t −1 − v ⋆ ∥ ∞
solutions 371

= ∥ B ⋆ v π t −1 − B ⋆ v ⋆ ∥ ∞ using that v⋆ is a fixed-point of B⋆ , that


is, B⋆ v⋆ = v⋆
≤ γ ∥ v π t −1 − v ⋆ ∥ ∞ using that B⋆ is a contraction, see
Theorem 10.18
≤ γ t ∥ v π0 − v ⋆ ∥ ∞ . by induction

Solution to Problem 10.5.


1. Recall that the value function vπM for an MDP M is defined as
vπM ( x ) = Eπ ∑∞ t ⋆
 
t=0 γ Rt X0 = x . Given an optimal policy π for
M and any policy π, we know that for any x ∈ X,

vπM ( x ) ≥ vπM ( x )
Eπ ⋆ ∑ ∞ t ≥ Eπ ∑ ∞ t
   
⇐⇒ t =0 γ R t X0 = x t = 0 γ R t X0 = x
 ∞ t
Eπ ⋆ ∑t=0 γ αRt ≥ Eπ ∑ ∞ t
  
⇐⇒ X0 = x t=0 γ αRt X0 = x multiplying both sides by α

Eπ ⋆ ∑ ∞ t ′ ≥ Eπ ∑ ∞ t ′
   
⇐⇒ t =0 γ R t X0 = x t = 0 γ R t X0 = x

⇐⇒ vπM′ ( x ) ≥ vπM′ ( x ).

Thus, π ⋆ is an optimal policy for M′ .


2. We give an example where the optimal policies differ when re-
wards are shifted.
Consider an MDP with three states {1, 2, 3} where 1 is the initial
state and 3 is a terminal state. If one plays action A in states 1
or 2 one transitions directly to the terminal state. Additionally, in
state 1 one can play action B which leads to state 2. Let every
.
transition give a deterministic reward of r = −1. Then it is optimal
to traverse the shortest path to the terminal state, in particular, to
choose action A when in state 1.
.
If we consider the reward r ′ = r + 2 = 1, then it is optimal to tra-
verse the longest path to the terminal state, in particular, to choose
action B when in state 1.
3. For an MDP M, we know that is optimal state-action value function
satisfies Bellman’s optimality equation (10.34),
 
q⋆M ( x, a) = Ex′ | x,a r ( x, x ′ ) + γ max q⋆M ( x ′ , a′ ) .
a′ ∈ A

For the MDP M′ , we have


 
q⋆M′ ( x, a) = Ex′ | x,a r ′ ( x, x ′ ) + γ max q⋆M′ ( x ′ , a′ )
a′ ∈ A
 
= Ex′ | x,a r ( x, x ′ ) + f ( x, x ′ ) + γ max q⋆M′ ( x ′ , a′ )
a′ ∈ A
 
= Ex′ | x,a r ( x, x ′ ) + γϕ( x ′ ) − ϕ( x ) + γ max q⋆M′ ( x ′ , a′ ) .
a′ ∈ A

Reorganizing the terms, we obtain


 
⋆ ′ ⋆ ′ ′ ′
q M′ ( x, a) + ϕ( x ) = Ex′ | x,a r ( x, x ) + γ max q M′ ( x , a ) + ϕ( x ) .

a′ ∈ A
372 probabilistic artificial intelligence

.
If we now define q( x, a) = q⋆M′ ( x, a) + ϕ( x ), we have
 
q( x, a) = Ex′ | x,a r ( x, x ′ ) + γ max q( x ′ , a′ ) .
a′ ∈ A

This is exactly Bellman’s optimality equation for the MDP M with


reward function r, and hence, q ≡ q⋆M .
If we take π ⋆ to be an optimal policy for M, then it satisfies

π ⋆ ( x ) ∈ arg max q⋆M ( x, a)


a∈ A
= arg max q⋆M′ ( x, a) + ϕ( x ) using the above characterization of q⋆M
a∈ A
= arg max q⋆M′ ( x, a). using that ϕ( x ) is independent of a
a∈ A

Solution to Problem 10.6.


1. We compute the answer in three steps:
(a) Predict step: We compute the predicted belief after action W
but before observing o1 .

b0′ ( F ) = b0 ( F ) · 0.6 + b0 ( F ) · 0.5 = 0.55


b0′ ( F ) = b0 ( F ) · 0.4 + b0 ( F ) · 0.5 = 0.45.

(b) Update step: Using the observation o1 and the observation


model, we update the belief.

1 ′ 1
b1 ( F ) = b ( F )P( o1 | F ) = · 0.55 · 0.8
Z 0 Z
1 1
b1 ( F ) = b0′ ( F )P o1 | F =

· 0.45 · 0.3.
Z Z
(c) Normalization: We compute the normalization constant Z.

Z = 0.55 · 0.8 + 0.45 · 0.3 = 0.575.


Therefore,
0.55 · 0.8
b1 ( F ) = ≈ 0.765,
0.575
0.45 · 0.3
b1 ( F ) = ≈ 0.235.
0.575
2. We observe that if A1 = P, then P( X2 = F ) = 0 given any start-
ing state, which will result in a zero in the belief update formula.
Therefore,
• b2 = (0, 1) given A1 = P and O1 = o1 ,
• b2 = (0, 1) given A1 = P and O1 = o2 .
Case with A1 = W and O1 = o1 : Using the belief update for-
mula (10.42),

b2 ( F ) ∝ P(o1 | F ) · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.461

solutions 373

b2 ( F ) ∝ P o1 | F · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.127.
 

By rescaling the probabilities, we get b2 ≈ (0.784, 0.216).


Case with A1 = W and O1 = o2 : Using the belief update for-
mula (10.42),

b2 ( F ) ∝ P(o2 | F ) · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.115


b2 ( F ) ∝ P o2 | F · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.296.
 

By rescaling the probabilities, we get b2 ≈ (0.279, 0.721).

Tabular Reinforcement Learning

Solution to Problem 11.1.

1. Q⋆ ( A, ↓) = 1.355, Q⋆ ( G1 , exit) = 5.345, Q⋆ ( G2 , exit) = 0.5


2. Repeating the given episodes infinitely often will not lead to con-
vergence to the optimal Q-function because not all state-action
pairs are visited infinitely often.
Let us assume we observe the following episode instead of the first
episode.
Episode 3
x a x′ r
A → B 0
B ↓ G2 0
G2 exit 1
If we repeat episodes 2 and 3 infinitely often, Q-learning will con-
verge to the optimal Q-function as all state-action pairs will be
visited infinitely often.
3. First, recall that Q-learning is an off-policy algorithm, and hence,
even if episodes are obtained off-policy, Q-learning will still con-
verge to the optimal Q-function (if the other convergence condi-
tions are met). Note that it only matters which state-action pairs
are observed and not which policies were followed to obtain these
observations.
The “closer” the initial Q-values are to the optimal Q-function,
the faster the convergence of Q-learning. However, if the conver-
gence conditions are met, Q-learning will converge to the optimal
Q-function regardless of the initial Q-values.
4. v⋆ ( A) = 10, v⋆ ( B) = 10, v⋆ ( G1 ) = 10, v⋆ ( G2 ) = 1
374 probabilistic artificial intelligence

Model-free Reinforcement Learning

Solution to Problem 12.1.


1. We have to show that

v⋆ ( x ) = max r ( x, a) + γEx′ | x,a v⋆ ( x ′ )


 
a∈ A

for every x ∈ {1, 2, . . . , 7}. We give a derivation here for x = 1 and


x = 2.
• For x = 1,

v ⋆ (1) = −3
max r (1, a) + γEx′ |1,a v⋆ ( x ′ ) = −3
 
a∈ A

since

 −1 + −2 = −3 if a = 1
v⋆ ( x ′ ) =
 
r (1, a) + γEx′ |1,a
 −1 + −3 = −4 if a = −1.

• Likewise, for x = 2,

v ⋆ (2) = −2
v ⋆ ( x ′ ) = −2
 
max r (2, a) + γEx′ |2,a
a∈ A

since

 −1 + −1 = −2 if a = 1
v⋆ ( x ′ ) =
 
r (2, a) + γEx′ |2,a
 −1 + −3 = −4 if a = −1.
2. We have
 
1 1 1
Q(3, −1) = 0 + −1 + max Q(2, a′ ) = (−1 + 0) = −
2 a′ ∈ A 2 2
 
1 1 1
Q(2, 1) = 0 + −1 + max Q(3, a′ ) = (−1 + 0) = −
2 a′ ∈ A 2 2
 
1 1 1
Q(3, 1) = 0 + −1 + max Q(4, a′ ) = (−1 + 0) = −
2 a′ ∈ A 2 2
 
1 1
Q(4, 1) = 0 + 0 + max Q(4, a′ ) = (0 + 0) = 0.
2 a′ ∈ A 2
3. We compute
 
  x
∇w ℓ(w; τ ) = − r + γ max Q( x ′ , a′ ; wold ) − Q( x, a; w)  a 
 
using the derivation of Equation (12.15)
a′ ∈ A
1
 
  2

= − −1 + max{1 − a − 2} − (−2 − 1 + 1) −1
 
a′ ∈ A
1
solutions 375

 
−2
=  1 .
 
−1

This gives

w′ = w − α∇w ℓ(w; τ )
     
−1 −2 0
  1   
=  1  −  1  = 1/2 .
2
1 −1 3/2

Solution to Problem 12.2. We have


∇φ πφ( a | x)
∇φ log πφ( a | x) = using the chain rule
πφ( a | x)
∑b∈ A ϕ( x, b) exp(φ⊤ ϕ( x, b))
= ϕ( x, a) − using elementary calculus
∑b∈ A exp(φ⊤ ϕ( x, b))
= ϕ( x, a) − ∑ πφ(b | x) · ϕ(x, b).
b∈ A

Solution to Problem 12.3.


1. The result follows directly using that

Var[ f (X) − g(X)] = Var[ f (X)] + Var[ g(X)] − 2Cov[ f (X), g(X)]. using Equation (1.39)

2. Denote by r (τ ) the discounted rewards attained by trajectory τ.


. .
Let f (τ ) = r (τ )∇φ Πφ(τ ) and g(τ ) = b∇φ Πφ(τ ). Recall that
Eτ ∼Πφ [ g(τ )] = 0, implying that

Var[ g(τ )] = Var b∇φ Πφ(τ )


 
h 2 i
= E b∇φ Πφ(τ ) . using the definition of variance (1.34)

On the other hand,

Cov[ f (τ ), g(τ )] = E[( f (τ ) − E[ f (τ )]) g(τ )] using the definition of covariance (1.26)

= E[ f (τ ) g(τ )] − E[ f (τ )] E[ g(τ )] using linearity of expectation (1.20)


| {z }
0
h 2 i
= E b · r (τ ) · ∇φ Πφ(τ ) .

Therefore, if b2 ≤ 2b · r ( x, a) for every state x ∈ X and action


a ∈ A, then the result follows from Equation (12.100).

Solution to Problem 12.4. First, observe that


" #
T −1
Eτ ∼Πφ G0 ∇φ log Πφ(τ ) = Eτ ∼Πφ ∑ G0 ∇φ log πφ( at | xt ) ,
 
using Equation (12.30)
t =0
376 probabilistic artificial intelligence

and hence, it suffices to show


" #
T −1
Eτ ∼Πφ ∑ G0 ∇φ log πφ(at | xt )
t =0
" # (B.15)
T −1
= Eτ ∼Πφ ∑ (G0 − b(τ0:t−1 ))∇φ log πφ(at | xt ) .
t =0

We prove Equation (B.15) with an induction on T. The base case (T =


0) is satisfied trivially. Fixing any T and assuming Equation (B.15)
holds for T, we have,
" #
T
Eτ ∼Πφ ∑ (G0 − b(τ0:t−1 ))∇φ log πφ(at | xt )
t =0
" " ##
T
= Eτ0:T−1 EτT ∑ (G0 − b(τ0:t−1 ))∇φ log πφ(at | xt ) τ0:T −1 using the tower rule (1.25)
t =0
"
T −1
= Eτ0:T−1 ∑ G0 ∇φ log πφ(at | xt ) using the induction hypothesis
t =0
#
+ EτT ( G0 − b(τ0:T −1 ))∇φ log πφ( a T | x T ) τ0:T −1
 
.

Using the score function trick for the score function ∇φ log πφ( at | xt )
analogously to the proof of Lemma 12.6, we have,

EτT b(τ0:T −1 )∇φ log πφ( a T | x T ) τ0:T −1


 

= EaT b(τ0:T −1 )∇φ log πφ( a T | x T ) τ0:T −1


 
Z
= b(τ0:T −1 ) πφ( a T | x T )∇φ log πφ( a T | x T ) da T
Z
= b(τ0:T −1 ) ∇φ πφ( a T | x T ) da T using the score function trick,
= 0. ∇φ log πφ ( at | xt ) = ∇φ πφ (at |xt )/πφ (at |xt )

Thus,
" " ##
T
= Eτ0:T−1 EτT ∑ G0 ∇φ log πφ(at | xt ) τ0:T −1
t =0
" #
T
= Eτ ∼Πφ ∑ G0 ∇φ log πφ(at | xt ) . using the tower rule (1.25) again
t =0

Solution to Problem 12.5. Each trajectory τ is described by four tran-


sitions,

τ = ( x0 , a0 , r0 , x1 , a1 , r1 , x2 , a2 , r2 , x3 , a3 , r3 , x4 ).

Moreover, we have given


∂πθ (2 | x )
πθ (2 | x ) = θ, = +1
∂θ
solutions 377

∂πθ (1 | x )
πθ (1 | x ) = 1 − θ, = −1.
∂θ

We first compute the downstream returns for the given episode,


1 1 1 11
G0:4 = r0 + γr1 + γ2 r2 + γ3 r3 = 1 + ·0+ ·1+ ·1 =
2 4 8 8
1 1 3
G1:4 = r1 + γr2 + γ2 r3 = 0 + · 1 + · 1 =
2 4 4
1 3
G2:4 = r2 + γr3 = 1 + · 1 =
2 2
G3:4 = r3 = 1.

Lastly, we can combine them to compute the policy gradient,


3
∇θ j(θ ) ≈ ∑ γt Gt:4 ∇θ log πθ (at | xt ) using Monte Carlo approximation of
t =0 Equation (12.40) with a single sample
11 1 3 1 3 1 5
= 1· −1· · +1· · −1· ·1 = .
8 2 4 4 2 8 4

Solution to Problem 12.6.


1. The distribution on actions is given by

πφ( a | x ) = σ ( fφ( x )) a · (1 − σ ( fφ( x )))1− a .

To simplify the notation, we write E for E( x,a)∼πφ and q for qπφ .


We get

∇φ j(φ)
= E q( x, a)∇φ ( a log σ( fφ( x)) + (1 − a) log(1 − σ( fφ( x))))
 
using the policy gradient theorem
h i (12.53)
= E q( x, a)∇f ( a log σ( f ) + (1 − a) log(1 − σ( f )))∇φ fφ( x) using the chain rule
" !! #
−f e− f
= E q( x, a)∇f − a log(1 + e ) + (1 − a) log ∇φ fφ( x) using the definition of the logistic
1 + e− f function (5.9)
h   i
= E q( x, a)∇f − f + a f − log(1 + e− f ) ∇φ fφ( x)
" ! #
e− f
= E q( x, a) a − 1 + ∇φ fφ( x)
1 + e− f
= E q( x, a)( a − σ( f ))∇φ fφ( x) .
 
using the definition of the logistic
function (5.9)
The term a − σ ( f ) can be understood as a residual as it corresponds
to the difference between the target action a and the expected ac-
tion σ( f ).
2. We have
h i
∇φ j(φ) = E q( x, a)∇f log π f ( a | x)∇φ f ( x) using Equation (12.53) and the chain
rule
378 probabilistic artificial intelligence

h i
= E q( x, a)∇f (log h( a) + a f − A( f ))∇φ f ( x)
h i
= E q( x, a)( a − ∇f A( f ))∇φ f ( x) .

3. We have ∇f A( f ) = σ ( f ). We are therefore looking for a func-


f
tion A( f ) whose derivative is σ ( f ) = 1+1e− f = 1+e e f . With this
equality of the sigmoid we can find the integral, and we have
A( f ) = log(1 + e f ) + c. Let us confirm that this gives us the Ber-
noulli distribution with c = 0:

π f ( a | x) = h( a) exp( a f − log(1 + e f ))
ea f
= h( a)
1+ e f
σ( f ) if a = 1
= h( a)
1 − σ ( f ) if a = 0.

This is the Bernoulli distribution with parameter σ ( f ) where we


have h( a) = 1.
4. Using that ∇f A( f ) = f , we immediately get

∇φ j(φ) = E q( x, a)( a − f )∇φ f ( x) .


 

5. No, we cannot use the reparameterization trick since we do not


know how the states x depend on action a. These dependencies
are determined by the unknown dynamics of the environment.
Nonetheless, we can apply it after sampling an episode accord-
ing to a policy and then updating policy parameters in hindsight.
This is for example done by the soft actor-critic (SAC) algorithm
(cf. Section 12.6).

Solution to Problem 12.7.


1. Equation (12.84) can be written as

KL Πφ∥Π⋆


T  
1
∑ E(xt ,at )∼Πφ
 
= − r ( xt , at ) − H πφ(· | xt )
t =1
λ
T
∑ E(xt ,at )∼Πφ
  
= − log β( at | xt ) − H πφ(· | xt ) .
t =1

Adding and subtracting log Z ( xt ) gives


T
∑ E(xt ,at )∼Πφ
  
= S[π̂ ( at | xt )] − log Z ( xt ) − H πφ(· | xt )
t =1
T
∑ Ext ∼Πφ
    
= H πφ(· | xt )∥π̂ (· | xt ) − log Z ( xt ) − H πφ(· | xt ) using the definition of cross-entropy
t =1 (5.32)
solutions 379

T
∑ Ext ∼Πφ
  
= KL πφ(· | xt )∥π̂ (· | xt ) − log Z ( xt ) . using the definition of KL-divergence
t =1 (5.34)

2. We prove the statement by (reverse) induction on t. For the base


case, note that the term

ExT ∼Πφ KL πφ(· | x T )∥π̂ (· | x T ) − log Z ( x T )


  

is minimized for πφ ≡ π̂. The KL-divergence then evaluates to


zero, and we are left only with the log Z ( x T ) term.
For the inductive step, fix any 1 ≤ t < T and π ⋆ ( at | xt ) must
minimize the two terms

Ext ∼Πφ KL πφ(· | xt )∥π̂ (· | xt ) − log Z ( xt )


  
h i
+ E(xt ,at )∼Πφ Ext+1 ∼ p(·|xt ,at ) [− log Z ( xt+1 )]

where the first term stems directly from the objective (12.106) and
the second term represents the contribution of πφ( at | xt ) to all
. .
subsequent terms. Letting β⋆ ( a | x) = exp(q⋆ ( x, a)), Z ⋆ ( x) =
R ⋆ ⋆
A β ( a | x ) da, and recalling that we denote by π (· | x ) the policy
⋆ ⋆
β (· | x)/Z ( x), this objective can be reexpressed as

= Ext ∼Πφ KL πφ(· | xt )∥π ⋆ (· | xt ) − log Z ⋆ ( xt )


  

which is minimized for πφ ≡ π ⋆ , leaving only the log Z ⋆ ( xt ) term.


It remains only to observe that β⋆ ( a T | x T ) = β( a T | x T ), so in the
final state x T , log Z ⋆ ( x T ) = log Z ( x T ) and the policies π ⋆ and π̂
coincide.

Solution to Problem 12.8.


1. Let O denote the event that the response y is optimal. Since the
prior over actions (i.e., responses) is not uniform, we have

Π⋆ (y | x) = p(y | x, O)
∝ p(y | x) · p(O | x, y) using Bayes’ rule (1.45)
 
1
= Πφinit (y | x) exp r (y | x) . (B.16)
λ
The derivation is then analogous to Equation (12.84):

arg min KL Πφ(· | x)∥Π⋆ (· | x)



φ

= arg min H Πφ(· | x)∥Π⋆ (· | x) − H Πφ(· | x)


   
using the definition of KL-divergence
φ (5.34)
= arg max Ey∼Πφ(·| x) log Π⋆ (y | x) − log Πφ(y | x)
 
using the definition of cross-entropy
φ (5.32) and entropy (5.27)
h i
= arg max Ey∼Πφ(·| x) r (y | x) + λ log Πφinit (y | x) − λ log Πφ(y | x) using Equation (B.16) and simplifying
φ

= arg max Jλ (φ; φinit | x). using the definition of cross-entropy


φ (5.32) and entropy (5.27)
380 probabilistic artificial intelligence

2. Recall that the KL-divergence is minimized at 0 if and only if the


two distributions are identical. The desired result follows from (1)
and Equation (B.16).

Model-based Reinforcement Learning

Solution to Problem 13.1.


1. To simplify the notation, we use zt,k as shorthand for ( xt,k , πt ( xt,k ))
(and similarly b zt,k for ( xbt,k , πt ( xbt,k ))). The base case is implied triv-
ially. For the induction step, assume that Equation (13.39) holds at
iteration k. We have

xbt,k+1 − xt,k+1 = zt,k ) − f (zt,k )


fbt−1 (b

≤ zt,k ) − f (b
fbn−1 (b zt,k ) − f (zt,k )
zn,k ) + f (b adding and subtracting f (b
zn,k ) and
using Cauchy-Schwarz
≤ 2β t σt−1 (b
zt,k ) + L1 xbt,k − xt,k

where the final inequality follows with high probability from the
assumption that the confidence intervals are well-calibrated (cf.
Equation (13.22)) and the assumed Lipschitzness.

zt,k ) − σt−1 (zt,k )]


= 2β t [σt−1 (zt,k ) + σt−1 (b
+ L1 xbt,k − xt,k

Once more using Lipschitz continuity, we obtain


 
≤ 2β t σt−1 (zt,k ) + L2 xbt,k − xt,k
+ L1 xbt,k − xt,k
= 2β t σt−1 (zt,k ) + αt xbt,k − xt,k
.
where αt = ( L1 + 2β t L2 ). By the induction hypothesis,
k
≤ 2β t ∑ αkt −l σt−1 (zt,l ).
l =0

This is identical to the analysis of UCB from Problem 9.3, only that
here errors compound along the trajectory.
2. The assumption that αt ≥ 1 implies that
k −1
xbt,k − xt,k ≤ 2β t αtH −1 ∑ σt−1 (zt,l ). (B.17)
l =0

Moreover, by definition of πt , we have with high probability that


JH (πt ; fb) ≥ JH (π ⋆ ; f ). This is because πt maximizes reward under
optimistic dynamics. Thus,
H −1
rt = ∑ r ( xt,k , π ⋆ ( xt,k )) − r ( xt,k , πt ( xt,k ))
k =0
solutions 381

H −1
≤ ∑ r ( xbt,k , πt ( xbt,k )) − r ( xt,k , πt ( xt,k ))
k =0
H −1
≤ L3 ∑ xbt,k − xt,k using Lipschitz-continuity of r
k =0
H −1 k −1
≤ 2β t αtH −1 L3 ∑ ∑ σt−1 (zt,l ) using Equation (B.17)
k =0 l =0
H −1
≤ 2β t HαtH −1 L3 ∑ σt−1 (zt,k ).
k =0

3. Let us first bound R2T . By the Cauchy-Schwarz inequality,


T
R2T ≤ T ∑ rt2
t =1
 !2 
T H −1
2( H −1)
≤ O T ∑ β2t H 2 αt ∑ σt−1 (zt,k )  using (2)
t =1 k =0
!
T H −1
2( H −1)
≤O Tβ2T H 3 α T ∑ ∑ σt2−1 (zt,k ) . using Cauchy-Schwarz and assuming
t =1 k =0 w.l.o.g. that β t is monotonically
increasing
Taking the square root, we obtain
 v 
T H −1
u
3 H −1 u
RT ≤ O β T H 2 αT t T ∑ ∑ σt2−1 (zt,k )
t =1 k =0
 3

H −1
p
≤ O β T H αT 2 TΓ T .

Mathematical Background

Solution to Problem A.1. Using that X is zero-mean, we have that


h i
2 1
E X n = Var X n = Var[ X ]
 
and
" # n
1 n 1 n h i
E ∑
n i =1
Xi 2 = ∑ E Xi 2 = Var[ X ].
n i =1
Thus,
" # !
n
h i n 1 h
2
i
E Sn2 =
n−1
E
n ∑ Xi 2
−E Xn = Var[ X ].
i =1

Solution to Problem A.2.


1. W.l.o.g. we assume that X is continuous. We have
Z ∞
EX = x · p( x ) dx using the definition of expectation (1.19)
0
Z ϵ Z ∞ Z ∞
= x · p( x ) dx + x · p( x ) dx ≥ ϵ · p( x ) dx .
|0 {z } ϵ
|ϵ {z }
≥0 P( X ≥ ϵ )
382 probabilistic artificial intelligence

.
2. Consider the non-negative random variable Y = ( X − EX )2 . We
have
 
P(| X − EX | ≥ ϵ) = P ( X − EX )2 ≥ ϵ2
E ( X − EX )2
 
≤ using Markov’s inequality (A.71)
ϵ2
VarX
= . using the definition of variance (1.34)
ϵ2
3. Fix any ϵ > 0. Applying Chebyshev’s inequality and noting that
EX n = EX, we obtain

VarX n
P X n − EX ≥ ϵ ≤

.
ϵ2
We further have for the variance of the sample mean that
" #
1 n 1 n VarX
VarX n = Var ∑
n i =1
Xi = 2 ∑ Var[ Xi ] =
n i =1 n
.

Thus,
VarX
lim P X n − EX ≥ ϵ ≤ lim 2 = 0

n→∞ n→∞ ϵ n
P
which is precisely the definition of X n → EX.

Solution to Problem A.3. Fix a λ > 0. Using a first-order expansion,


we have

f ( x + λd) = f ( x) + λ∇ f ( x)⊤ d + o (λ ∥d∥2 ).

Dividing by λ yields,

f ( x + λd) − f ( x) o ( λ ∥ d ∥2 )
= ∇ f ( x)⊤ d + .
λ | {z λ }
→0

Taking the limit λ → 0 gives the desired result.


solutions 383
Bibliography

Eitan Altman. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.

Sebastian Ament, Samuel Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Unexpected improvements
to expected improvement for bayesian optimization. In NeurIPS, volume 36, 2024.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin,
OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NeurIPS, volume 30, 2017.

Javier Antorán, David Janz, James U Allingham, Erik Daxberger, Riccardo Rb Barbano, Eric Nalisnick, and José Miguel
Hernández-Lobato. Adapting the linearised laplace model evidence for modern deep learning. In ICML, 2022.

Yarden As, Ilnura Usmanova, Sebastian Curi, and Andreas Krause. Constrained policy optimization via bayesian world
models. In ICLR, 2022.

Marco Bagatella, Jonas Hübotter, Georg Martius, and Andreas Krause. Active fine-tuning of generalist policies. arXiv
preprint arXiv:2410.05026, 2024.

Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84: Proceedings.
Springer, 2006.

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning
with stability guarantees. In NeurIPS, volume 30, 2017.

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In
ICML, 2015.

Léon Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9), 1998.

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.
Biometrika, 39(3/4), 1952.

Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement
learning. JMLR, 3, 2002.

Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information
Theory, 59(11), 2013.

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman,
Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement
learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.

Ariel Caticha. Entropy, information, and the updating of probabilities. Entropy, 23(7), 2021.

Ariel Caticha and Adom Giffin. Updating probabilities. In AIP Conference Proceedings, volume 872. American Institute of
Physics, 2006.
386 probabilistic artificial intelligence

Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical Science, 1995.

Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In ICML, 2014.

Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In ICML, 2017.

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of
trials using probabilistic dynamics models. In NeurIPS, volume 31, 2018.

Richard T Cox. Algebra of probable inference. Johns Hopkins Press, 1961.

Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient model-based reinforcement learning through optimistic
policy search and planning. In NeurIPS, volume 33, 2020.

Sebastian Curi, Armin Lederer, Sandra Hirche, and Andreas Krause. Safe reinforcement learning via confidence-based
filters. In CDC, 2022.

Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace
redux-effortless bayesian deep learning. In NeurIPS, volume 34, 2021.

Bruno De Finetti. Theory of probability: A critical introductory treatment. John Wiley, 1970.

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In ICML,
2011.

Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University
Press, 2020.

Joseph Leo Doob. Application of the theory of martingales. Actes du Colloque International Le Calcul des Probabilités et ses
applications (Lyon, 28 Juin – 3 Juillet, 1948), 1949.

Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195
(2), 1987.

Nikita Durasov, Timur Bagautdinov, Pierre Baque, and Pascal Fua. Masksembles for uncertainty estimation. In CVPR,
2021.

David Duvenaud. Automatic model construction with Gaussian processes. PhD thesis, University of Cambridge, 2014.

David Duvenaud and Ryan P Adams. Black-box stochastic variational inference in five lines of python. In NeurIPS
Workshop on Black-box Learning and Inference, 2015.

Eyal Even-Dar and Yishay Mansour. Convergence of optimistic and incremental q-learning. In NeurIPS, volume 14, 2001.

Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for q-learning. JMLR, 5(1), 2003.

Matthew Fellows, Anuj Mahajan, Tim GJ Rudner, and Shimon Whiteson. Virel: A variational inference framework for
reinforcement learning. In NeurIPS, volume 32, 2019.

Karl Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2), 2010.

Karl Friston, Francesco Rigoli, Dimitri Ognibene, Christoph Mathys, Thomas Fitzgerald, and Giovanni Pezzulo. Active
inference and epistemic value. Cognitive neuroscience, 6(4), 2015.

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In
ICML, 2018.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Insights and applications. In ICML Workshop
on Deep Learning, volume 1, 2015.
bibliography 387

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep
learning. In ICML, 2016.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In ICML, 2017.

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and David Sculley. Google vizier: A
service for black-box optimization. In ACM SIGKDD, 2017.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Jackson Gorham and Lester Mackey. Measuring sample quality with stein’s method. In NeurIPS, volume 28, 2015.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, 2017.

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang,
Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint
arXiv:2501.12948, 2025.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. In ICML, 2018a.

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu,
Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905,
2018b.

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning
latent dynamics for planning from pixels. In ICML, 2019.

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent
imagination. In ICLR, 2020.

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In
ICLR, 2021.

Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control
policies by stochastic value gradients. In NeurIPS, volume 28, 2015.

Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. JMLR, 13(6), 2012.

James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. In
AISTATS, 2015.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

Jonas Hübotter, Bhavya Sukhija, Lenart Treven, Yarden As, and Andreas Krause. Transductive active learning: Theory
and applications. In NeurIPS, 2024.

Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of
llms. In ICLR, 2025.

Carl Hvarfner, Frank Hutter, and Luigi Nardi. Joint entropy search for maximally-informed bayesian optimization. In
NeurIPS, volume 35, 2022.

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights
leads to wider optima and better generalization. In UAI, 2018.

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algo-
rithms. In NeurIPS, volume 6, 1993.
388 probabilistic artificial intelligence

Edwin T Jaynes. Prior probabilities. IEEE Transactions on systems science and cybernetics, 4(3), 1968.

Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2002.

Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabás Póczos. Parallelised bayesian optimisation
via thompson sampling. In AISTATS, 2018.

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS,
volume 30, 2017.

Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In
COLT, 2018.

Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-based model predictive control for
safe exploration. In CDC, 2018.

Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability, 3, 2014.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation
using deep ensembles. In NeurIPS, volume 30, 2017.

Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 1986.

Lucien Marie Le Cam and Grace Lo Yang. Asymptotics in statistics: some basic concepts. Springer Science & Business
Media, 2000.

David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint
arXiv:1805.00909, 2018.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan
Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.

David Lindner and Mennatallah El-Assady. Humans are not boltzmann distributions: Challenges and opportunities for
modelling human feedback and interaction in reinforcement learning. arXiv preprint arXiv:2206.13316, 2022.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In
NeurIPS, volume 29, 2016.

Jianfeng Lu, Yulong Lu, and James Nolen. Scaling limit of the stein variational gradient descent: The mean field regime.
SIAM Journal on Mathematical Analysis, 51(2), 2019.

Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I Jordan. Sampling can be faster than optimization.
Proceedings of the National Academy of Sciences, 116(42), 2019.

David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4), 1992.

Nicolas Menet, Jonas Hübotter, Parnian Kassraie, and Andreas Krause. Lite: Efficiently estimating gaussian probability
of maximality. In AISTATS, 2025.

Jeffrey W Miller. A detailed treatment of doob’s theorem. arXiv preprint arXiv:1801.03122, 2018.

W Jeffrey Miller. Lecture notes on advanced stochastic modeling. Duke University, Durham, NC, 2016.

Beren Millidge, Alexander Tschantz, Anil K Seth, and Christopher L Buckley. On the relationship between active
inference and control as inference. In IWAI Workshop on Active Inference. Springer, 2020.

Beren Millidge, Alexander Tschantz, and Christopher L Buckley. Whence the expected free energy? Neural Computation,
33(2), 2021.
bibliography 389

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin
Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.
nature, 518(7540), 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.

Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine
learning. JMLR, 21(132), 2020.

Andrew W. Moore. Prediction and search in probabilistic worlds: Markov systems, markov decision processes, and
dynamic programming. https://autonlab.org/assets/tutorials/mdp09.pdf, 2002.

Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def, 1(2σ2):16, 2007.

Mojmír Mutnỳ. Modern Adaptive Experiment Design: Machine Learning Perspective. PhD thesis, ETH Zurich, 2024.

Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The Fundamentals of Heavy Tails, volume 53. Cambridge University
Press, 2022.

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing
submodular set functions. Mathematical programming, 14, 1978.

Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, and Svetha Venkatesh. Regret for expected improvement over the
best-observed value and stopping condition. In ACML, 2017.

OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2023.

Manfred Opper and Cédric Archambeau. The variational gaussian approximation revisited. Neural computation, 21(3),
2009.

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned
rl. arXiv preprint arXiv:2410.20092, 2024.

Thomas Parr, Giovanni Pezzulo, and Karl J Friston. Active inference: the free energy principle in mind, brain, and behavior.
MIT Press, 2022.

Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7(15),
2008.

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh
Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments
and request for research. arXiv preprint arXiv:1802.09464, 2018.

Simon JD Prince. Understanding Deep Learning. MIT press, 2023.

Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process
regression. JMLR, 6, 2005.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct prefer-
ence optimization: Your language model is secretly a reward model. In NeurIPS, volume 36, 2023.

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dy-
namics: a nonasymptotic analysis. In COLT, 2017.

Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NeurIPS, 2007.

Tom Rainforth, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.
Statistical Science, 39(1), 2024.
390 probabilistic artificial intelligence

Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In AISTATS, 2014.

Gareth O. Roberts and Jeffrey S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability
Surveys, 1, 2004.

Philip A Romero, Andreas Krause, and Frances H Arnold. Navigating the protein fitness landscape with gaussian
processes. Proceedings of the National Academy of Sciences, 110(3), 2013.

Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.

Stuart Russell and Peter Norvig. Artificial intelligence: a modern approach. Prentice Hall, 2002.

Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In NeurIPS, volume 27,
2014.

Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling. JMLR, 17(1), 2016.

Grant Sanderson. But what is the fourier transform? a visual introduction, 2018. URL https://www.youtube.com/watch?
v=spUNpyF58BY.

Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In COLT. Springer, 2001.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In
ICML, 2015.

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control
using generalized advantage estimation. In ICLR, 2016.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algo-
rithms. arXiv preprint arXiv:1707.06347, 2017.

Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-
Agent Systems, 27(1), 2013.

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li,
Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser,
Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks
and tree search. nature, 529(7587), 2016.

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Lau-
rent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforce-
ment learning algorithm. arXiv preprint arXiv:1712.01815, 2017.

Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy
reinforcement-learning algorithms. Machine learning, 38(3), 2000.

Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit
setting: No regret and experimental design. In ICML, 2010.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way
to prevent neural networks from overfitting. JMLR, 15(1), 2014.

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and
Paul F Christiano. Learning to summarize with human feedback. In NeurIPS, volume 33, 2020.
bibliography 391

Bhavya Sukhija, Matteo Turchetta, David Lindner, Andreas Krause, Sebastian Trimpe, and Dominik Baumann.
Gosafeopt: Scalable safe exploration for global optimization of dynamical systems. Artificial Intelligence, 320, 2023.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Martin A Tanner. Tools for statistical inference. Springer, 1991.

Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and fluctuations for stochastic gradient
langevin dynamics. JMLR, 17, 2016.

Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In ICML,
2014.

Lenart Treven, Jonas Hübotter, Bhavya Sukhija, Florian Dörfler, and Andreas Krause. Efficient exploration in continuous-
time model-based reinforcement learning. In NeurIPS, volume 36, 2024.

Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration for interactive machine learning. In NeurIPS,
volume 32, 2019.

Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in gaussian process bandits. In
AISTATS, 2021.

Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI,
volume 30, 2016.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, volume 30, 2017.

Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices.
In NeurIPS, volume 32, 2019.

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropcon-
nect. In ICML, 2013.

Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. In UAI, 2020.

Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In ICML, 2017.

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.

Justin Whitehouse, Zhiwei Steven Wu, and Aaditya Ramdas. On the sublinear regret of gp-ucb. In NeurIPS, volume 36,
2024.

Daniel Widmer, Dongho Kang, Bhavya Sukhija, Jonas Hübotter, Andreas Krause, and Stelian Coros. Tuning legged
locomotion controllers via safe bayesian optimization. In CoRL, 2023.

Christopher K Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press, 2006.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine
learning, 8(3), 1992.

Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 35(67-68), 1999.

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of langevin dynamics based algorithms for
nonconvex optimization. In NeurIPS, volume 31, 2018.

Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a
fixed discount rate. Mathematics of Operations Research, 36(4), 2011.
Summary of Notation

We follow these general rules:


• uppercase italic for constants N
• lowercase italic for indices i and scalar variables x
• lowercase italic bold for vectors x, entries are denoted x(i )
• uppercase italic bold for matrices M, entries are denoted M (i, j)
• uppercase italic for random variables X
• uppercase bold for random vectors X, entries are denoted X(i )
• uppercase italic for sets A
• uppercase calligraphy for spaces (usually infinite sets) A

.
= equality by definition
≈ approximately equals
∝ proportional to (up to multiplicative constants), f ∝ g iff ∃k. ∀ x. f ( x ) = k · g( x )
const an (additive) constant
N set of natural numbers {1, 2, . . . }
N0 set of natural numbers, including 0, N ∪ {0}
R set of real numbers
[m] set of natural numbers from 1 to m, {1, 2, . . . , m − 1, m}
i:j subset of natural numbers between i and j, {i, i + 1, . . . , j − 1, j}
( a, b] real interval between a and b including b but not including a
f :A→B function f from elements of set A to elements of set B
f ◦g function composition, f ( g(·))
(·)+ max{0, ·}
log logarithm with base e
P ( A) power set (set of all subsets) of A
.
1{ predicate} indicator function (1{ predicate} = 1 if the predicate is true, else 0)
⊙ Hadamard (element-wise) product
← assignment
394 probabilistic artificial intelligence

analysis

∇ f ( x ) ∈ Rn gradient of a function f : Rn → R at a point x ∈ Rn


Dg ( x) ∈ Rm×n Jacobian of a function g : Rn → Rm at a point x ∈ Rn
H f ( x ) ∈ Rn × n Hessian of a function f : Rn → R at a point x ∈ Rn
∇·F divergence operation on vector field F
∆f Laplacian of a scalar field f : Rn → R
f (n)
f ∈ O( g) f grows at most as fast as g (up to constant factors), 0 ≤ lim supn→∞ g(n)
<∞
f ∈Oe ( g) f grows at most as fast as g up to constant and logarithmic factors
∥·∥α α-norm
∥·∥ A Mahalanobis norm induced by matrix A

linear algebra

I identity matrix
A⊤ transpose of matrix A
A −1 inverse of invertible matrix A
A1/2 square root of a symmetric and positive semi-definite matrix A
det( A) determinant of A
tr( A) trace of A, ∑i A(i, i )
diagi∈ I { ai } diagonal matrix with elements ai , indexed according to the set I

probability

Ω sample space
A event space
P probability measure
X∼P random variable X follows the distribution P
iid
X1:n ∼ P random variables X1:n are independent and identically distributed according to
distribution P
x∼P value x is sampled according to distribution P
PX cumulative distribution function of a random variable X
PX tail distribution function of a random variable X
PX−1 quantile function of a random variable X
pX probability mass function (if discrete) or probability density function
(if continuous) of a random variable X
∆A set of all probability distributions over the set A
δα Dirac delta function, point density at α
g♯ p pushforward of a density p under perturbation g
summary of notation 395

X⊥Y random variable X is independent of random variable Y


X⊥Y|Z random variable X is conditionally independent of random variable Y
given random variable Z
E[ X ] expected value of random variable X
Ex∼ X [ f ( x )] expected value of the random variable f ( X ), E[ f ( X )]
E[ X | Y ] conditional expectation of random variable X given random variable Y
Cov[ X, Y ] covariance of random variable X and random variable Y
Cor[ X, Y ] correlation of random variable X and random variable Y
Var[ X ] variance of random variable X
Var[ X | Y ] conditional variance of random variable X given random variable Y
ΣX covariance matrix of random vector X
ΛX precision matrix of random vector X
MSE( X ) mean squared error of random variable X
Xn sample mean of random variable X with n samples
Sn2 sample variance of random variable X with n samples
a.s.
Xn → X the sequence of random variables Xn converges almost surely to X
P
Xn → X the sequence of random variables Xn converges to X in probability
D
Xn → X the sequence of random variables Xn converges to X in distribution
S[ u ] surprise associated with an event of probability u
H[ p ], H[ X ] entropy of distribution p (or random variable X)
H[ p ∥ q ] cross-entropy of distribution q with respect to distribution p
KL( p∥q) KL-divergence of distribution q with respect to distribution p
J( p ∥ q ) relative Fisher information of distribution q with respect to distribution p
H[ X | Y ] conditional entropy of random variable X given random variable Y
H[ X, Y ] joint entropy of random variables X and Y
I( X; Y ) mutual information of random variables X and Y
I( X; Y | Z ) conditional mutual information of random variables X and Y given random
variable Z
N (µ, Σ ) normal distribution with mean µ covariance Σ
Laplace(µ, h) Laplace distribution with mean µ scale h
Unif(S) uniform distribution on the set S
Bern( p) Bernoulli distribution with success probability p
Bin(n, p) binomial distribution with n trials and success probability p
Beta(α, β) beta distribution with shape parameters α and β
Gamma(α, β) gamma distribution with shape α and rate β
Cauchy(m, τ ) Cauchy distribution with location m and scale τ
Pareto(c, α) Pareto distribution with cutoff threshold c and shape α

supervised learning

θ parameterization of a model
396 probabilistic artificial intelligence

X input space
Y label space
x∈X input
ϵ( x) zero-mean noise, sometimes assumed to be independent of x
y∈Y (noisy) label, f ( x) + ϵ( x) where f is unknown
D ⊆ X ×Y labeled training data, {( xi , yi )}in=1
X ∈ Rn × d design matrix when X = Rd
Φ ∈ Rn × e design matrix in feature space Re
y ∈ Rn label vector when Y = R
p(θ) prior belief about θ
p(θ | x1:n , y1:n ) posterior belief about θ given training data
p(y1:n | x1:n , θ) likelihood of training data under the model parameterized by θ
p(y1:n | x1:n ) marginal likelihood of training data
θ̂MLE maximum likelihood estimate of θ
θ̂MAP maximum a posteriori estimate of θ
ℓnll (θ; D) negative log-likelihood of the training data D under model θ

bayesian linear models

w ∈ Rd weights of linear function f ( x; w) = w⊤ x


ŵls least squares estimate of w
ŵridge ridge estimate of w
ŵlasso lasso estimate of w
N (0, σp2 I ) prior
N (w⊤ x, σn2 ) likelihood
µ ∈ Rd posterior mean, σn−2 ΣX ⊤ y
Σ ∈ Rd × d posterior covariance matrix, (σn−2 X ⊤ X + σp−2 I )−1
K ∈ Rn × n kernel matrix, σp2 X X ⊤
σ logistic function
Bern(σ(w⊤ x)) logistic likelihood
ℓlog (·; x, y) logistic loss of a single training example ( x, y)

kalman filters

Xt sequence of hidden states in Rd


Yt sequence of observations in Rm
F ∈ Rd × d motion model
H ∈ Rm × d sensor model
εt zero-mean motion noise with covariance matrix Σ x
ηt zero-mean sensor noise with covariance matrix Σ y
summary of notation 397

K t ∈ Rd × m Kalman gain

gaussian processes

µ:X →R mean function


k : X ×X → R kernel function / covariance function
f ∼ GP (µ, k ) f is a Gaussian process with mean function µ and kernel function k
Hk (X ) reproducing kernel Hilbert space associated with kernel function k : X × X → R

deep models

Wl ∈ Rnl ×nl −1 weight matrix of layer l


ν ( l ) ∈ Rn l activations of layer l
φ activation function
Tanh hyperbolic tangent activation function
ReLU rectified linear unit activation function
σi ( f ) softmax function computing the probability mass of class i given outputs f

variational inference

Q variational family
λ∈Λ variational parameters
qλ variational posterior parameterized by λ
L(q, p; D) evidence lower bound for data D of variational posterior q and true posterior p(· | D)

markov chains

S set of n states
Xt sequence of states
p( x′ | x ) transition function, probability of going from state x to state x ′
p(t) ( x ′ | x ) probability of reaching x ′ from x in exactly t steps
P ∈ Rn × n transition matrix
qt distribution over states at time t
π stationary distribution
∥µ − ν∥TV total variation distance between two distributions µ and ν
τTV mixing time with respect to total variation distance
398 probabilistic artificial intelligence

markov chain monte carlo methods

r ( x′ | x) proposal distribution, probability of proposing x′ when in x


α( x′ | x) acceptance distribution, probability of accepting the proposal x′ when in x
f energy function

active learning

S⊆X set of observations


I (S) maximization objective, quantifying the “information value” of S
∆ I ( x | A) marginal gain of observation x with respect to objective I given prior observations A

bayesian optimization

RT cumulative regret for time horizon T


F ( x; µ, σ ) acquisition function
Ct ( x) confidence interval of f ⋆ ( x) after round t
β t (δ) scale of confidence interval to achieve confidence level δ
γT maximum information gain after T rounds

reinforcement learning

X, X set of states
A, A set of actions
p( x ′ | x, a) dynamics model, probability of transitioning from state x to state x ′ when playing
action a
r reward function
Xt sequence of states
At sequence of actions
Rt sequence of rewards
π (a | x) policy, probability of playing action a when in state x
Gt discounted payoff from time t
γ discount factor

t (x)
vπ state value function, average discounted payoff from time t starting from state x
t ( x, a )
qπ state-action value function, average discounted payoff from time t starting from state x
playing action a
aπt ( x, a ) advantage function, qπ t ( x, a ) − vt ( x )
π

J (π ) policy value function, expected reward of policy π


Acronyms

a.s. almost surely, with high probability, with probability LASSO least absolute shrinkage and selection operator
1 LD Langevin dynamics
A2C advantage actor-critic LITE linear-time independence-based estimators (of
BALD Bayesian active learning by disagreement probability of maximality)
BLR Bayesian linear regression LMC Langevin Monte Carlo
CDF cumulative distribution function LOTE law of total expectation
CLT central limit theorem LOTP law of total probability
DDPG deep deterministic policy gradients LOTUS law of the unconscious statistician
DDQN double deep Q-networks LOTV law of total variance
DPO direct preference optimization LSI log-Sobolev inequality
DQN deep Q-networks MAB multi-armed bandits
ECE expected calibration error MALA Metropolis adjusted Langevin algorithm
EI expected improvement MAP maximum a posteriori
ELBO evidence lower bound MC Monte Carlo
ES entropy search MCE maximum calibration error
FITC fully independent training conditional MCMC Markov chain Monte Carlo
GAE generalized advantage estimation MCTS Monte Carlo tree search
GD gradient descent MDP Markov decision process
GLIE greedy in the limit with infinite exploration MERL maximum entropy reinforcement learning
GP Gaussian process MES max-value entropy search
GPC Gaussian process classification MGF moment-generating function
GRPO group relative policy optimization MI mutual information
GRV Gaussian random vector MLE maximum likelihood estimate
H-UCRL hallucinated upper confidence reinforcement MLL marginal log likelihood
learning MPC model predictive control
HMC Hamiltonian Monte Carlo MSE mean squared error
HMM hidden Markov model NLL negative log likelihood
i.i.d. independent and identically distributed ODE ordinary differential equation
IDS information-directed sampling OPES output-space predictive entropy search
iff if and only if PBPI point-based policy iteration
JES joint entropy search PBVI point-based value iteration
KL Kullback-Leibler PDF probability density function
LAMBDA Lagrangian model-based agent PES predictive entropy search
400 probabilistic artificial intelligence

PETS probabilistic ensembles with trajectory sampling SG-HMC stochastic gradient Hamiltonian Monte Carlo
PI probability of improvement SGD stochastic gradient descent
PI policy iteration SGLD stochastic gradient Langevin dynamics
PILCO probabilistic inference for learning control SLLN strong law of large numbers
PL Polyak-Łojasiewicz SoR subset of regressors
PlaNet deep planning network SVG stochastic value gradients
PMF probability mass function SVGD Stein variational gradient descent
POMDP partially observable Markov decision process SWA stochastic weight averaging
PPO proximal policy optimization SWAG stochastic weight averaging-Gaussian
RBF radial basis function Tanh hyperbolic tangent
ReLU rectified linear unit TD temporal difference
RHC receding horizon control TD3 twin delayed DDPG
RKHS reproducing kernel Hilbert space TRPO trust-region policy optimization
RL reinforcement learning UCB upper confidence bound
RLHF reinforcement learning from human feedback ULA unadjusted Langevin algorithm
RM Robbins-Monro VI variational inference
SAA stochastic average approximation VI value iteration
SAC soft actor-critic w.l.o.g. without loss of generality
SARSA state-action-reward-state-action w.r.t. with respect to
SDE stochastic differential equation WLLN weak law of large numbers
Index

L1 -regularization, 318 batch, 317


L2 -regularization, 318 batch size, 318
L∞ norm, 202 Bayes by Backprop, 145
Rmax algorithm, 224 Bayes’ rule, 15
σ-algebra, 3 Bayes’ rule for entropy, 162
ε-greedy, 221, 288 Bayesian active learning by disagreement, 171
Bayesian experimental design, 174
acceptance distribution, 121 Bayesian filtering, 51
acquisition function, 180, 287 Bayesian linear regression, 43
activation, 140 Bayesian logistic regression, 85
activation function, 140 Bayesian networks, 9
active inference, 260 Bayesian neural network, 143
actor, 247 Bayesian optimization, 180
actor-critic method, 247 Bayesian reasoning, 2
Adagrad, 319 Bayesian smoothing, 53
Adam, 319 belief, 52, 210
adaptive learning rate, 319 belief-state Markov decision process, 211
additive white Gaussian noise, 40 Bellman error, 236, 254
advantage actor-critic, 248 Bellman expectation equation, 200
advantage function, 244 Bellman optimality equation, 205
aleatoric uncertainty, 44 Bellman update, 205
almost sure convergence, 307 Bellman’s optimality principle, 205
alpha-beta pruning, 276 Bellman’s theorem, 204
aperiodic Markov chain, 117 Bernoulli distribution, 299
approximation error, 22 Bernstein-von Mises theorem, 26
augmented Lagrangian method, 291 bias, 304
automatic differentiation (auto-diff), 142 bias-variance tradeoff, 248
binary cross-entropy loss, 98
backpropagation, 142 binomial distribution, 299
bagging, 150 black box stochastic variational inference, 104
Banach fixed-point theorem, 202 Bochner’s theorem, 73
Barrier function, 292 Boltzmann distribution, 124
baseline, 242 Boltzmann exploration, 222
402 probabilistic artificial intelligence

bootstrapping, 150, 225 convergence in distribution, 308


Borel σ-algebra, 4 convergence in probability, 308
bounded discounted payoff, 239 convergence with probability 1, 308
Bradley-Terry model, 262 convex body chasing, 179
Brownian motion, 127 convex function, 314
burn-in time, 120 convex function chasing, 179
correlation, 12
calibration, 152 cosine similarity, 12
catastrophe principle, 306 covariance, 11
categorical distribution, 299 covariance function, 60
Cauchy distribution, 306 covariance matrix, 13
central limit theorem, 309 Cramér-Rao lower bound, 312
Cesàro mean, 193 credible set, 28
chain rule for entropy, 162 critic, 247
chain rule of mutual information, 187 Cromwell’s rule, 25
chain rule of probability, 7 cross-entropy, 93
change of variables formula, 14 cross-entropy loss, 142
Chebyshev’s inequality, 320 cross-entropy method, 276
Cholesky decomposition, 302 curse of dimensionality, 183
classification, 23
closed-loop control, 279 d-separation, 9
competitive ratio, 180 data processing inequality, 175
completing the square, 334 deep deterministic policy gradients, 253, 280
computation graph, 139 deep planning network, 287
concave function, 315 deep Q-networks, 237
conditional distribution, 7 deep reinforcement learning, 236
conditional entropy, 162 density, 6
conditional expectation, 11 design, 174
conditional independence, 8 design criterion, 174
conditional likelihood, 15 design matrix, 39
conditional linear Gaussian, 22 detailed balance equation, 119
conditional mutual information, 164 differentiation under the integral sign, 301
conditional probability, 7 diffusion, 128
conditional variance, 13 diffusion process, 128
conditioning, 8 Dirac delta function, 300
conjugate prior, 19 direct preference optimization, 265
consistent estimator, 304 directed graphical model, 9
conspiracy principle, 306 directional derivative, 320
constrained Markov decision processes, 291 Dirichlet distribution, 19
context, 261 discounted payoff, 198, 240, 275
continuous setting, 218 discounted state occupancy measure, 246
contraction, 202 distillation, 128
contrapositive, 16 Doob’s consistency theorem, 25
control, 180 double DQN, 238
index 403

downstream return, 243 finite-horizon, 199


Dreamer, 287 first-order characterization of convexity, 315
dropconnect regularization, 147 first-order expansion, 314
dropout regularization, 147 first-order optimality condition, 314
dynamics model, 197 Fisher information, 312
fixed-point iteration, 202
efficient, 312 Fokker-Planck equation, 136
eligibility vector, 268 forward KL-divergence, 95
empirical Bayes, 70 forward sampling, 61
empirical risk, 313 forward-backward algorithm, 54
energy function, 125 foundation model, 261
entropy, 18, 91 Fourier transform, 73
entropy regularization, 190, 256 free energy, 105
entropy search, 174 free energy principle, 105, 259
episode, 218 fully independent training conditional, 76
episodic setting, 218 fully observable environment, 198
epistemic uncertainty, 44 function class, 22
epoch, 318 function-space view, 45
equilibrium, 135
fundamental theorem of ergodic Markov chains,
ergodic theorem, 120 117
ergodicity, 117
estimation, 26
gamma distribution, 133
estimation error, 22
Gaussian, 6, 20
estimator, 303
Gaussian kernel, 63
Euler’s formula, 72
Gaussian noise “dithering”, 253, 288
event, 3
event space, 3 Gaussian process, 59
evidence, 101 Gaussian process classification, 85, 107
exclusive KL-divergence, 95 Gaussian random vector, 20
expectation, 10 generalized advantage estimation, 249
expected calibration error, 154 generalized variance, 92
expected improvement, 185 generative model, 7
experience replay, 237 Gibbs distribution, 124
experimental design, 174 Gibbs sampling, 123
exploration distribution, 252 Gibbs’ inequality, 109
exploration noise, 287 global optimum, 313
exploration-exploitation dilemma, 177, 217, 287 goal-conditioned reinforcement learning, 296
exponential family, 99, 270 gradient, 301
exponential kernel, 63 gradient flow, 135
greedy in the limit with infinite exploration, 222
feature space, 45 greedy policy, 204
feed-forward neural network, 142 group relative policy optimization, 251
filtering, 37, 51, 209, 210 Grönwall’s inequality, 135
finite measure, 114 Gumbel-max trick, 255
404 probabilistic artificial intelligence

Hadamard’s inequality, 319 Jacobian, 316


hallucinated upper confidence reinforcement Jensen’s inequality, 91
learning, 290 joint distribution, 7
Hamiltonian, 130 joint entropy, 162
Hamiltonian Monte Carlo, 129 joint likelihood, 16
heat kernel, 78
heavy-tailed distribution, 305 Kalman filter, 52
Hessian, 316 Kalman gain, 55, 56
heteroscedastic, 23 Kalman smoothing, 53
heteroscedastic noise, 144 Kalman update, 55
hidden layer, 140 kernel function, 46, 60, 62
hidden Markov model, 209 kernel matrix, 46
histogram binning, 154 kernel ridge regression, 79
Hoeffding’s inequality, 310 kernel trick, 46
Hoeffding’s lemma, 310 kernelization, 46
homoscedastic, 23 Kolmogorov axioms, 3
homoscedastic noise, 144 Kullback-Leibler divergence, 93
hyperbolic tangent, 140
hyperprior, 50, 70, 180 labels, 22
Lagrangian model-based agent, 293
importance sampling, 249 Langevin dynamics, 128
improper prior, 18 Langevin Monte Carlo, 127
improvement, 184 Laplace approximation, 84
inclusive KL-divergence, 95 Laplace distribution, 305
independence, 8 Laplace kernel, 63
inducing points method, 75 Laplace’s method, 83
inductive learning, 172 lasso, 42
information gain, 163 law of large numbers, 309
information never hurts, 162, 163 law of the unconscious statistician, 10
information ratio, 186 law of total expectation, 11
information-directed sampling, 188 law of total probability, 8
informative prior, 17 law of total variance, 13
input layer, 140 layer, 139
inputs, 22 Leapfrog method, 131
instantaneous, 199 learning, 27
instantaneous regret, 179 least squares estimator, 39
interaction information, 165 length scale, 63
intrinsic reward, 166 light-tailed distribution, 305
inverse Fourier transform, 73 likelihood, 15
inverse transform sampling, 114, 300 linear kernel, 62
irreducible Markov chain, 116 linear regression, 39
isotonic regression, 154 linearity of expectation, 10
isotropic Gaussian, 20 local optimum, 314
isotropic kernel, 65 Loewner order, 312
index 405

log-concave distribution, 125 mixing time, 118


log-likelihood, 23 mode, 6
log-normal distribution, 306 model predictive control, 275
log-prior, 24 model-based reinforcement learning, 219, 274
log-Sobolev inequality, 137 model-free reinforcement learning, 219
logical inference, 1 moment matching, 98, 281
logistic function, 85 moment-generating function, 33
logistic loss, 86 momentum, 319
logits, 140 monotone submodularity, 167
loss function, 313 monotonicity of mutual information, 187
Lyapunov function, 135 Monte Carlo approximation, 281, 307
Monte Carlo control, 221
Mahalanobis norm, 301 Monte Carlo rollouts, 280
marginal gain, 166 Monte Carlo trajectory sampling, 278
marginal likelihood, 16 Monte Carlo tree search, 276
marginalization, 7 most likely explanation, 210
Markov chain, 114 multi-armed bandits, 178
Markov decision process, 197 multicollinearity, 40
Markov property, 52, 114 multilayer perceptron, 139
Markov’s inequality, 320 multinomial distribution, 19, 299
masksemble, 149 mutual information, 161, 163
matrix inversion lemma, 319
Matérn kernel, 63
natural parameters, 99
maximization bias, 237
negative log-likelihood, 23, 142
maximizing the marginal likelihood, 68
neural fitted Q-iteration, 237
maximum a posteriori estimate, 24
neural network, 139
maximum calibration error, 154
noninformative prior, 17
maximum entropy principle, 18
normal distribution, 6, 20
maximum entropy reinforcement learning, 256
normalizing constant, 16, 27
maximum likelihood estimate, 23
mean, 10
mean function, 60 objective function, 313
mean payoff, 199 Occam’s razor, 18
mean square continuity, 308 off-policy, 219
mean square convergence, 308 on-policy, 219
mean square differentiability, 309 online, 52
mean squared error, 141, 304 online actor-critic, 248
mean-field distribution, 89 online algorithms, 179
median-of-means, 312 online Bayesian linear regression, 49
method of moments, 98 online learning, 178
metrical task systems, 179 open-loop control, 279
Metropolis adjusted Langevin algorithm, 127 optimal control, 275
Metropolis-Hastings algorithm, 121 optimal decision rule, 29
Metropolis-Hastings theorem, 122 optimal design, 174
406 probabilistic artificial intelligence

optimism in the face of uncertainty, 30, 179, 181, principle of indifference, 17


185, 190, 223, 229, 288 principle of insufficient reason, 17
optimistic exploration, 289 principle of parsimony, 18
optimistic Q-learning, 230 prior, 15
outcome reward, 263 probabilistic artificial intelligence, 2
output layer, 140 probabilistic ensembles with trajectory sampling,
output scale, 62, 64 286
outputs, 22 probabilistic inference, 1, 15, 27, 37
overfitting, 24, 313 probabilistic inference for learning control, 281,
286
Pareto distribution, 306 probability, 4
partially observable Markov decision process, 209 probability density function, 5
partition function, 99 probability matching, 186
perception-action loop, 159 probability measure, 3
performance plan, 294 probability of improvement, 184
Pinsker’s inequality, 118 probability of maximality, 186, 189
planning, 197, 217 probability simplex, 212, 299
plate notation, 10 probability space, 4
Platt scaling, 155 probit likelihood, 107
plausible inference, 16 product rule, 8
plausible model, 288 proposal distribution, 121
point density, 300 proximal policy optimization, 250
point estimate, 26 pseudo-random number generators, 300
point-based policy iteration, 212 pushforward, 15
point-based value iteration, 212
pointwise convergence, 80 Q actor-critic, 247
policy, 198 Q-function, 199
policy gradient method, 239 Q-learning, 229
policy gradient theorem, 246 quadratic form, 301
policy iteration, 206, 251 quantile function, 300
policy search method, 239
policy value function, 239 radial basis function kernel, 63
Polyak averaging, 237 random Fourier features, 72
Polyak-Łojasiewicz inequality, 135 random shooting methods, 276
population risk, 312 random variable, 4
positive definite, 301 random vector, 7
positive definite kernel, 62 rapidly mixing Markov chain, 118
positive semi-definite, 301 realizations of a random variable, 5
posterior, 16, 28 recall, 189
precision matrix, 20 receding horizon control, 275
prediction, 27 rectified linear unit, 140
predictive posterior, 28 recursive Bayesian estimation, 51
principle of curiosity and conformity, 106, 128, redundancy, 165
152, 178 regression, 23
index 407

regret, 179 small tails, 305


Reichenbach’s common cause principle, 9 snapshot, 146
REINFORCE algorithm, 243 soft actor critic, 258, 281
reinforcement learning, 217 soft Q-learning, 258
reinforcement learning from human feedback, 263 soft value function, 258
relative entropy, 93 softmax exploration, 222
relative Fisher information, 137 softmax function, 141
reliability diagram, 153 spectral density, 73
reparameterizable distribution, 103 square root, 302
reparameterization trick, 103, 255, 278 squared exponential kernel, 63
replay buffer, 237 standard deviation, 13
representer theorem, 66 standard normal distribution, 6, 20
reproducing kernel Hilbert space, 66 state of a random variable, 5
reproducing property, 66 state space model, 51
reverse KL-divergence, 95 state value function, 199
reversible Markov chain, 119 state-action value function, 199
reward shaping, 214 state-dependent baseline, 269
reward to go, 243 stationary distribution, 115
ridge regression, 40 stationary kernel, 65
Robbins-Monro algorithm, 317 stationary point, 314
Robbins-Monro conditions, 317 Stein variational gradient descent, 151
robust control, 291 Stein’s method, 152
robust statistics, 311 stochastic average approximation, 277
rollout, 240 stochastic environment, 197
stochastic gradient descent, 318
saddle point, 314 stochastic gradient Langevin dynamics, 129
safety filter, 296 stochastic matrix, 115
safety plan, 294 stochastic process, 114
sample covariance matrix, 304 stochastic semi-gradient descent, 235
sample mean, 303 stochastic value gradients, 255, 280
sample space, 3 stochastic weight averaging, 147
sample variance, 304 stochastic weight averaging-Gaussian, 147
SARSA, 227 strict convexity, 314
score function, 103, 241 sub-Gaussian random variable, 310
score function trick, 241 sublinear regret, 179
score gradient estimator, 103, 241 submodularity, 166
second-order characterization of convexity, 316 subsampling, 146
second-order expansion, 316 subset of regressors, 76
self-conjugacy, 19 sufficient statistic, 99, 146, 164
self-supervised learning, 261 sum rule, 7
semi-conjugate prior, 134 supervised learning, 22, 161
sharply concentrated, 305 support, 5
Sherman-Morrison formula, 319 supremum norm, 202
shift-invariant kernel, 65 surprise, 90
408 probabilistic artificial intelligence

symbolic artificial intelligence, 1 unadjusted Langevin algorithm, 127


synergy, 165 unbiased estimator, 304
uncertainty sampling, 169
tabular setting, 217 uncorrelated, 11
tail distribution function, 305 uniform convergence, 80
tail index, 306 uniform distribution, 299, 300
target space, 4 union bound, 31
temperature scaling, 155 universal approximation theorem, 140
temporal models, 57 universality of the uniform, 300, 353
temporal-difference error, 235 upper confidence bound, 181
temporal-difference learning, 226
testing conditional, 76
value iteration, 208
Thompson sampling, 186, 288
variance, 12
time-homogeneous process, 114
variational family, 89
total variance, 92
variational inference, 88, 145, 150, 281
total variation distance, 118
variational parameters, 83
tower rule, 11
variational posterior, 83
training conditional, 76
Viterbi algorithm, 210
trajectory, 218
trajectory sampling, 277
transductive learning, 172 weak law of large numbers, 309
transition, 218 weight decay, 319
transition function, 115 weight-space view, 40
transition graph, 115 Weinstein-Aronszajn identity, 320
transition matrix, 115 well-calibrated confidence interval, 182
truncated sample mean, 312 well-specified prior, 25
trust-region policy optimization, 249 Wiener process, 78, 114, 127
twin delayed DDPG, 253 Woodbury matrix identity, 319
two-filter smoothing, 54 world model, 273

You might also like