Probabilistic Artificial Intelligence
Probabilistic Artificial Intelligence
Artificial Intelligence
Andreas Krause, Jonas Hübotter
This manuscript is based on the course Probabilistic Artificial Intelligence (263-5210-00L) at ETH Zürich.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Contributing
We encourage you to raise issues and suggest fixes for anything you
think can be improved. We are thankful for any such feedback!
Contact: [email protected]
Acknowledgements
We are grateful to Sebastian Curi for creating the original Jupyter note-
books that accompany the course at ETH Zürich and which were in-
strumental in the creation of many figures. We thank Hado van Hasselt
for kindly contributing Figure 12.1, and thank Tuomas Haarnoja (Haarnoja
et al., 2018a) and Roberto Calandra (Chua et al., 2018) for kindly agree-
ing to have their figures included in this manuscript. Furthermore,
many of the exercises in these notes are adapted from iterations of the
course at ETH Zürich. Special thanks to all instructors that contributed
to the course material over the years. We also thank all students of
the course in the Fall of 2022, 2023, and 2024 who provided valuable
feedback on various iterations of this manuscript and corrected many
v
1 Fundamentals of Inference 1
1.1 Probability 2
1.2 Probabilistic Inference 15
1.3 Supervised Learning and Point Estimates 22
1.4 Outlook: Decision Theory 29
2 Linear Regression 39
2.1 Weight-space View 40
2.2 Aleatoric and Epistemic Uncertainty 44
2.3 Non-linear Regression 45
2.4 Function-space View 45
3 Filtering 51
3.1 Conditioning and Prediction 53
3.2 Kalman Filters 54
4 Gaussian Processes 59
4.1 Learning and Inference 60
4.2 Sampling 61
4.3 Kernel Functions 62
4.4 Model Selection 67
4.5 Approximations 70
viii
5 Variational Inference 83
5.1 Laplace Approximation 83
5.2 Predictions with a Variational Posterior 87
5.3 Blueprint of Variational Inference 88
5.4 Information Theoretic Aspects of Uncertainty 89
5.5 Evidence Lower Bound 100
B Solutions 321
Bibliography 385
Acronyms 399
Index 401
1
Fundamentals of Inference
Boolean logic is the algebra of statements which are either true or false.
Consider, for example, the statements
This goes to show that in our experience, the real world is rarely black
and white. We are frequently (if not usually) uncertain about the truth
of statements, and yet we are able to reason about the world and make
predictions. We will see that the principles of Boolean logic can be ex-
tended to reason in the face of uncertainty. The mathematical frame-
work that allows us to do this is probability theory, which — as we
will find in this first chapter — can be seen as a natural extension of
Boolean logic from the domain of certainty to the domain of uncer-
tainty. In fact, in the 20th century, Richard Cox and Edwin Thompson
Jaynes have done early work to formalize probability theory as the
“logic under uncertainty” (Cox, 1961; Jaynes, 2002).
1.1 Probability
This statement does not even have an objective truth value, and yet we
as humans tend to have opinions about it.
countable unions).
Definition 1.3 (Probability measure). Given the set Ω and the σ-algebra
A over Ω, the function
P:A→R
events { Ai ∈ A}i .2 2
We say that a set of sets { Ai }i is disjoint
if for all i ̸= j we have Ai ∩ A j = ∅.
Remarkably, all further statements about probability follow from these
three natural axioms. For an event A ∈ A, we call P( A) the probability
of A. We are now ready to define a probability space.
The set Ω is often rather complex. For example, take Ω to be the set of
all possible graphs on n vertices. Then the outcome of our experiment
is a graph. Usually, we are not interested in a specific graph but rather
a property such as the number of edges, which is shared by many
graphs. A function that maps a graph to its number of edges is a
random variable.
X:Ω→T
1.1.3 Distributions
Consider a random variable X on a probability space (Ω, A, P), where
Ω is a compact subset of R, and A the Borel σ-algebra.
Note that “X = x” and “X ≤ x” are merely events (that is, they char-
acterize subsets of the sample space Ω satisfying this condition) which
are in the Borel σ-algebra, and hence their probability is well-defined.
m( Br ( x)) .
lim = ρ ( x ).
r →0 vol( Br ( x))
Crucially, observe that even though the mass of any particular point
x is zero, i.e., m({ x}) = 0, assigning a density ρ( x) to x is useful for
integration and approximation. The same idea applies to continuous
random variables, only that volume corresponds to intervals on the
real line and mass to probability. Recall that probability density func-
tions are normalized such that their probability mass across the entire N ( x; 0, 1)
0.4
real line integrates to one.
0.3
P(X = [ x1 , . . . , xn ]) = P( X1 = x1 , . . . , Xn = xn ),
and hence describes the relationship among all variables Xi . For this
reason, a joint distribution is also called a generative model. We use Xi:j
to denote the random vector [ Xi · · · X j ]⊤ .
1.1.7 Independence
Two random vectors X and Y are independent (denoted X ⊥ Y) if and
only if knowledge about the state of one random vector does not affect
the distribution of the other random vector, namely if their conditional
CDF (or in case they have a joint density, their conditional PDF) sim-
plifies to
This is a nontrivial fact that can be proven using the change of variables
formula which we discuss in Section 1.1.11.
Proof sketch. We only prove the case where X and Y have a joint den-
sity. We have
Z Z
E[E[X | Y]] = x · p( x | y) dx p(y) dy
Z Z
= x · p( x, y) dx dy by definition of conditional densities
Z Z (1.10)
= x p( x, y) dy dx by Fubini’s theorem
Z
= x · p( x) dx using the sum rule (1.7)
= E[ X ].
.
Var[X] = Cov[X, X] (1.33)
h i
= E (X − E[X])(X − E[X])⊤ (1.34)
h i
= E XX⊤ − E[X] · E[X]⊤ (1.35)
Cov[ X1 , X1 ] · · · Cov[ X1 , Xn ]
.. .. ..
= . . .
.
(1.36)
Cov[ Xn , X1 ] · · · Cov[ Xn , Xn ]
That is, the longer a random variable is in the inner product space,
the more “uncertain” we are about its value. If a random variable
has length 0, then it is deterministic.
Here, the first term measures the average deviation from the mean of X
across realizations of Y and the second term measures the uncertainty
14 probabilistic artificial intelligence
When the random variables are continuous, this probability can be ex-
pressed as an integration over the domain of X. We can then use the
substitution rule of integration to “change the variables” to an inte-
gration over the domain of Y. Taking the derivative yields the density
pY .13 There is an analogous change of variables formula for the multi- 13
The full proof of the change of vari-
ables formula in the univariate setting
variate setting.
can be found in section 6.7.2 of “Math-
ematics for machine learning” (Deisen-
roth et al., 2020).
Fact 1.17 (Change of variables formula). Let X be a random vector
in Rn with density pX and let g : Rn → Rn be a differentiable and
invertible function. Then Y = g (X) is another random variable, whose
density can be computed based on pX and g as follows:
pY (y) = pX ( g −1 (y)) · det Dg −1 (y) (1.43)
Here, the term det Dg −1 (y) measures how much a unit volume
corrects for the change in volume that is caused by this change in co-
ordinates.
fundamentals of inference 15
This concludes our quick tour of probability theory, and we are well-
prepared to return to the topic of probabilistic inference.
Recall the logical implication “If it is raining, the ground is wet.” from
the beginning of this chapter. Suppose that we look outside a window
and see that it is not raining: will the ground be dry? Logical reason-
ing does not permit drawing an inference of this kind, as there might
be reasons other than rain for which the ground could be wet (e.g.,
sprinklers). However, intuitively, by observing that it is not raining,
we have just excluded the possibility that the ground is wet because of
rain, and therefore we would deem it “more likely” that the ground is
dry than before. In other words, if we were to walk outside now and
the ground was wet, we would be more surprised than we would have
been if we had not looked outside the window before.
p(y | x) · p( x)
p( x | y) = . (1.45)
p(y)
The marginal likelihood can be computed using the sum rule (1.7) or
the law of total probability (1.12),
Z
p(y) = p(y | x) · p( x) dx. (1.46)
X(Ω)
That is, observing that the ground is wet makes it more likely to
be raining. From P( R | W ) ≥ P( R) we know P R | W ≤ P R ,14
since P X = 1 − P( X )
14
P R | W · P (W )
P W|R = ≤ P (W ) ,
P R
that is, having observed it not to be raining made the ground less
likely to be wet.
P W | R · P( R ) (1 − P(W | R)) · P( R)
P R|W =
= = 0. as P(W | R) = 1
P W P W
Observe that a logical inference does not depend on the prior P( R):
Even if the prior was P( R) = 1 in Example 1.20, after observing that
the ground is not wet, we are forced to conclude that it is not raining to
maintain logical consistency. The examples highlight that while logical
inference does not require the notion of a prior, plausible (probabilistic!)
inference does.
p( x) ∝ const. (1.47)
The more concentrated p is, the less is its entropy; the more diffuse p
is, the greater is its entropy.16 16
We give a thorough introduction to en-
tropy in Section 5.4.
In the absence of any prior knowledge, the uniform distribution has
the highest entropy,17 and hence, the maximum entropy principle sug- 17
This only holds true when the set
gests a noninformative prior (as does Laplace’s principle of indiffer- of possible outcomes of x finite (or
a bounded continuous interval), as in
ence). In contrast, if the evidence perfectly determines the value of x, this case, the noninformative prior is a
then the only consistent explanation is the point density at x. The proper distribution — the uniform dis-
tribution. In the “infinite case”, there is
maximum entropy principle characterizes a reasonable choice of prior no uniform distribution and the nonin-
for these two extreme cases and all cases in between. Bayes’ rule can formative prior can be attained from the
in fact be derived as a consequence of the maximum entropy principle maximum entropy principle as the lim-
iting solution as the number of possible
in the sense that the posterior is the least “informative” distribution outcomes of x is increased.
among all distributions that are consistent with the prior and the ob-
fundamentals of inference 19
Thus, θ | k ∼ Beta(α + n H , β + n T ).
size of the random vector. Similarly, the normalizing constant of the X1 ··· X n −1 Xn P( X1:n )
the class of distributions. Gaussians are a popular choice for this pur- Figure 1.5: A table representing a joint
pose since they have extremely useful properties: they have a compact distribution of n binary random vari-
ables. The table has 2n rows. The num-
representation and — as we will see in Chapter 2 — they allow for
ber of parameters is 2n − 1 since the fi-
closed-form probabilistic inference. nal probability is determined by all other
probabilities as they must sum to one.
In Equation (1.5), we have already seen the PDF of the univariate Gaus-
sian distribution. A random vector X in Rn is normally distributed,
X ∼ N (µ, Σ ), if its PDF is
. 1 1
N ( x; µ, Σ ) = p exp − ( x − µ)⊤ Σ −1 ( x − µ) (1.51)
det(2πΣ ) 2
2 2
0 0
−2 y −2 y
0 −2 0 −2
x 2 x 2
covariance matrix does not have negative eigenvalues ? , this ensures Problem 1.4
that Σ and Λ are positive definite.18 18
The inverse of a positive definite ma-
trix is also positive definite.
An important property of the normal distribution is that it is closed
under marginalization and conditioning.
X A | X B = x B ∼ N (µ A| B , Σ A| B ) where (1.53a)
.
µ A| B = µ A + Σ AB Σ − 1
BB ( x B − µ B ), (1.53b) Here, µ A characterizes the prior belief
. and Σ AB Σ − 1
BB ( x B − µ B ) represents “how
Σ A| B = Σ AA − Σ AB Σ − 1
BB Σ BA . (1.53c) different” x B is from what was
expected.
Observe that upon inference, the variance can only shrink! Moreover,
how much the variance is reduced depends purely on where the obser-
vations are made (i.e., the choice of B) but not on what the observations
are. In contrast, the posterior mean µ A| B depends affinely on µ B . These
are special properties of the Gaussian and do not generally hold true
for other distributions.
It can be shown that Gaussians are additive and closed under affine
transformations ? . The closedness under affine transformations (1.78) Problem 1.10
implies that a Gaussian X ∼ N (µ, Σ ) is equivalently characterized as
X = Σ /2 Y + µ.
1
(1.54)
where Y ∼ N (0, I ) and Σ 1/2 is the square root of Σ.19 Importantly, this 19
More details on the square root of a
implies together with Theorem 1.24 and additivity (1.79) that: symmetric and positive definite matrix
can be found in Appendix A.2.
Any affine transformation of a Gaussian random vector
is a Gaussian random vector.
.
A = Σ AB Σ − 1
BB , (1.55b)
.
b= µ A − Σ AB Σ − 1
BB µ B , (1.55c)
ε ∼ N (0, Σ A| B ). (1.55d)
f⋆ : X → Y
from labeled training data. That is, we are given a collection of labeled
.
examples, Dn = {( xi , yi )}in=1 , where the xi ∈ X are inputs and the f⋆
f
yi ∈ Y are outputs (called labels), and we want to find a function fˆ that fˆ
best-approximates f ⋆ . It is common to choose fˆ from a parameter-
ized function class F (Θ), where each function f θ is described by some F
parameters θ ∈ Θ.
Figure 1.7: Illustration of estimation er-
ror and approximation error. f ⋆ denotes
the true function and fˆ is the best ap-
Remark 1.25: What this manuscript is about and not about
proximation from the function class F .
As illustrated in Figure 1.7, the restriction to a function class leads We do not specify here, how one could
quantify “error”. For more details, see
to two sources of error: the estimation error of having “incorrectly” Appendix A.3.5.
determined fˆ within the function class, and the approximation er-
ror of the function class itself. Choosing a “good” function class
/ architecture with small approximation error is therefore critical
for any practical application of machine learning. We will discuss
various function classes, from linear models to deep neural net-
works, however, determining the “right” function class will not
be the focus of this manuscript. To keep the exposition simple,
we will assume in the following that f ⋆ ∈ F (Θ) with parameters
θ⋆ ∈ Θ.
classes are more expressive and therefore can typically better ap-
proximate the ground truth f ⋆ .
.
We differentiate between the task of regression where Y = Rk ,20 and 20
The labels are usually scalar, so k = 1.
.
the task of classification where Y = C and C is an m-element set of
classes. In other words, regression is the task of predicting a continu-
ous label, whereas classification is the task of predicting a discrete class
label. These two tasks are intimately related: in fact, we can think of
classification tasks as a regression problem where we learn a probabil-
.
ity distribution over class labels. In this regression problem, Y = ∆C
where ∆C denotes the set of all probability distributions over the set
of classes C which is an (m − 1)-dimensional convex polytope in the
m-dimensional space of probabilities [0, 1]m (cf. Appendix A.1.2).
For now, let us stick to the regression setting. We will assume that
iid
the observations are noisy, that is, yi ∼ p(· | xi , θ⋆ ) for some known
conditional distribution p(· | xi , θ) but unknown parameter θ⋆ .21 Our 21
The case where the labels are deter-
assumption can equivalently be formulated as ministic is the special case of p(· | xi , θ⋆ )
being a point density at f ⋆ ( xi ).
yi = f θ ( xi ) + ε i ( xi ) (1.56)
| {z } | {z }
signal noise
The MLE is often used in practice due to its desirable asymptotic prop-
erties as the sample size n increases. We give a brief summary here
and provide additional background and definitions in Appendix A.3.
To give any guarantees on the convergence of the MLE, we neces-
sarily need to assume that θ⋆ is identifiable.23 If additionally, ℓnll is 23
That is, θ⋆ ̸= θ =⇒ f ⋆ ̸= f θ for any
“well-behaved” then standard results say that the MLE is consistent θ ∈ Θ. In words, there is no other pa-
rameter θ that yields the same function
and asymptotically normal (Van der Vaart, 2000): f θ as θ⋆ .
P D
θ̂MLE → θ⋆ and θ̂MLE → N (θ⋆ , Sn ) as n → ∞. (1.59)
The situation is quite different in the finite sample regime. Here, the
MLE need not be unbiased, and it is susceptible to overfitting to the
(finite) training data as we discuss in more detail in Appendix A.3.5.
tends to reduce the risk of overfitting. However, one may also encode
more nuanced information about the (assumed) structure of θ⋆ into
the prior.
Encoding prior assumptions into the function class or into the parame-
ter estimation can accelerate learning and improve generalization per-
formance dramatically, yet importantly, incorporating a prior can also
inhibit learning in case the prior is “wrong”. For example, when the
learning task is to differentiate images of cats from images of dogs,
consider the (stupid) prior that only permits models that exclusively
use the upper-left pixel for prediction. No such model will be able to
solve the task, and therefore starting from this prior makes the learn-
ing problem effectively unsolvable which illustrates that priors have to
be chosen with care.
In words, Doob’s consistency theorem tells us that for any prior dis-
tribution, the posterior is guaranteed to converge to a point density
in the (small) neighborhood θ⋆ ∈ B of the true parameter as long as
p( B) > 0.27 We call such a prior a well-specified prior. 27
B can for example be a ball of radius ϵ
around θ⋆ (with respect to some geome-
try of Θ).
Remark 1.26: Cromwell’s rule
In the case where |Θ| is finite, Doob’s consistency theorem strongly
suggests that the prior should not assign 0 probability (or proba-
bility 1 for that matter) to any individual parameter θ ∈ Θ, unless
we know with certainty that θ⋆ ̸= θ. This is called Cromwell’s rule,
26 probabilistic artificial intelligence
Under the same assumption that the prior is well-specified (and reg-
ularity conditions28 ), the Bernstein-von Mises theorem, which was first 28
These regularity conditions are akin
discovered by Pierre-Simon Laplace in the early 19th century, estab- to the assumptions required for asymp-
totic normality of the MLE, cf. Equa-
lishes the asymptotic normality of the posterior distribution (Van der tion (1.59).
Vaart, 2000; Miller, 2016):
D
θ | Dn → N (θ⋆ , Sn ) as n → ∞ (1.65)
certainty that the ground is wet for other reasons than rain.
0.0
In this manuscript, we will focus mainly on algorithms for probabilistic
−2 0 2
inference which compute or approximate the distribution p(θ | x1:n , y1:n ) θ
over parameters. Returning a distribution over parameters is natural
Figure 1.8: A the MLE/MAP are point
since this acknowledges that given a finite sample with noisy observa- estimates at the mode θ̂ of the posterior
tions, more than one parameter vector can explain the data. distribution p(θ | D).
The posterior then represents our belief about the best model after
seeing the training data. Using Bayes’ rule (1.45), we can write it as31 31
We generally assume that
Z
= p(y⋆ | x⋆ , θ) · p(θ | x1:n , y1:n ) dθ. (1.68) by the product rule (1.11) and
Θ y⋆ ⊥ x1:n , y1:n | θ
Here, the distribution over models p(θ | x1:n , y1:n ) is called the posterior
and the distribution over predictions p(y⋆ | x⋆ , x1:n , y1:n ) is called the
predictive posterior. The predictive posterior quantifies our posterior
uncertainty about the “prediction” y⋆ , however, since this is typically
p(y⋆ | x⋆ , D)
a complex distribution, it is difficult to communicate this uncertainty
0.20
to a human. One statistic that can be used for this purpose is the
smallest set Cδ ( x⋆ ) ⊆ R for a fixed δ ∈ (0, 1) such that 0.15
0.10
C( x⋆ )
P(y⋆ ∈ Cδ ( x⋆ ) | x⋆ , x1:n , y1:n ) ≥ 1 − δ. (1.69)
0.05
That is, we believe with “confidence” at least 1 − δ that the true value 0.00
the posterior after the first t observations with p(0) (θ) = p(θ). Now,
suppose that we have already computed p(t) (θ) and observe yt+1 . We
can recursively update the posterior as follows,
= p(t) (θ) · p(yt+1 | θ). (1.71) using yt+1 ⊥ y1:t | θ, see Figure 2.3
Here, a⋆ is called the optimal decision rule because, under the given
probabilistic model, no other rule can yield a higher expected utility.
Discussion
Problems
∞ ∞
!
P ∑ P( A i ).
[
Ai ≤ (1.73)
i =1 i =1
k
E[ X ] = ∑ E[ X | A i ] · P( A i ). (1.74)
i =1
32 probabilistic artificial intelligence
N ( x; µ, Σ ) ∝ N ( x; µ1 , Σ 1 ) · N ( x; µ2 , Σ 2 ) (1.75)
Show that two jointly Gaussian random vectors, X and Y, are indepen-
dent if and only if X and Y are uncorrelated.
Hint: You may use that for matrices Σ and Λ such that Σ −1 = Λ,
• if Σ and Λ are symmetric,
" #⊤ " #" #
xA Λ AA Λ AB xA
xB Λ BA Λ BB xB
= x⊤ ⊤ ⊤ ⊤
A Λ AA x A + x A Λ AB x B + x B Λ BA x A + x B Λ BB x B
= x⊤ −1
A (Λ AA − Λ AB Λ BB Λ BA ) x A +
( x B + Λ− 1 ⊤ −1
BB Λ BA x A ) Λ BB ( x B + Λ BB Λ BA x A ),
• Λ− 1 −1
BB = Σ BB − Σ BA Σ AA Σ AB ,
−1 −1
• Λ BB Λ BA = −Σ BA Σ AA .
The final two equations follow from the general characterization of the inverse
of a block matrix (Petersen et al., 2008, section 9.1.3).
This generalizes the MGF of the univariate Gaussian from Equation (A.40).
AX + b ∼ N ( Aµ + b, AΣ A⊤ ). (1.78)
X + X ′ ∼ N ( µ + µ ′ , Σ + Σ ′ ). (1.79)
These properties are unique to Gaussians and a reason for why they
are widely used for learning and inference.
34 probabilistic artificial intelligence
Derive the optimal decisions under the squared loss and the asymmet-
ric loss from Example 1.29.
part I
.
as the collection of inputs and the vector y = [y1 · · · yn ]⊤ ∈ Rn as the
collection of labels. For each noisy observation ( xi , yi ), we define the
.
value of the approximation of our model, f i = w⊤ xi . Our model at
.
the inputs X is described by the vector f = [ f 1 · · · f n ]⊤ which can be
expressed succinctly as f = Xw.
The most common way of estimating w from data is the least squares
estimator,
n
.
ŵls = arg min ∑ (yi − w⊤ xi )2 = arg min ∥y − Xw∥22 , (2.2)
w ∈Rd i =1 w ∈Rd
40 probabilistic artificial intelligence
is strictly convex) ? which is the case as long as the columns of X Problem 2.1 (1)
are not linearly dependent. Least squares regression can be seen as
finding the orthogonal projection of y onto the column space of X, as
is illustrated in Figure 2.2 ? . Problem 2.1 (2)
y
yi = w ⋆ ⊤ xi + ε i (2.6)
span{ X }
for some weight vector w, where for the purpose of this chapter we
Figure 2.2: Least squares regression
will additionally assume that ε i ∼ N (0, σn2 ) is homoscedastic Gaussian
finds the orthogonal projection of y onto
noise.3 This observation model is equivalently characterized by the span{ X } (here illustrated as the plane).
Gaussian likelihood, 3
ε i is called additive white Gaussian noise.
Based on this likelihood we can compute the MLE (1.57) of the weights:
n n
ŵMLE = arg max ∑ log p(yi | xi , w) = arg min ∑ (yi − w⊤ xi )2 . plugging in the Gaussian likelihood and
w ∈Rd i =1 w ∈Rd i =1 simplifying
In practice, the noise variance σn2 is typically unknown and also has to
be determined, for example, through maximum likelihood estimation.
It is a straightforward exercise to check that the MLE of σn2 given fixed
weights w is σ̂n2 = n1 ∑in=1 (yi − w⊤ xi )2 ? . Problem 2.2
This also shows that Gaussians with known variance and linear like-
lihood are self-conjugate, a property that we had hinted at in Sec-
tion 1.2.2. It can be shown more generally that Gaussians with known
42 probabilistic artificial intelligence
encourages keeping weights small. Recall that the MAP estimate cor- 0.0
responds to the mode of the posterior distribution, which in the case
of a Gaussian is simply its mean µ. As to be expected, µ coincides with −0.5
One problem with ridge regression is that the contribution of Figure 2.4: Level sets of L2 - (blue)
and L1 -regularization (red), correspond-
nearly-zero weights to the L2 -regularization term is negligible.
ing to Gaussian and Laplace priors, re-
Thus, L2 -regularization is typically not sufficient to perform vari- spectively. It can be seen that L1 -
able selection (that is, set some weights to zero entirely), which is regularization is more effective in en-
couraging sparse solutions (that is, so-
often desirable for interpretability of the model. lutions where many components are set
to exactly 0).
A commonly used alternative to ridge regression is the least ab-
solute shrinkage and selection operator (or lasso), which regularizes
with the L1 -norm:
.
ŵlasso = arg min ∥y − Xw∥22 + λ ∥w∥1 . (2.12)
w ∈Rd
Here we observe that using point estimates such as the MAP estimate
does not quantify uncertainty in the weights. The MAP estimate sim-
ply collapses all mass of the posterior around its mode. This can be
harmful when we are highly unsure about the best model, e.g., because
we have observed insufficient data.
.
To make predictions at a test point x⋆ , we let f ⋆ = w⊤ x⋆ which has
the distribution
y⋆ | x⋆ , x1:n , y1:n ∼ N (µ⊤ x⋆ , x⋆ ⊤ Σx⋆ + σn2 ). (2.16) using additivity of Gaussians (1.79)
Note that the feature dimension e is ∑im=0 (d+ii−1) = Θ(dm ).4 Thus, 4
Observe that the vector contains (d+ii−1)
the dimension of the feature space grows exponentially in the de- monomials of degree i as this is the
number of ways to choose i times from
gree of polynomials and input dimensions. Even for relatively d items with replacement and without
small m and d, this becomes completely unmanageable. consideration of order. To see this, con-
sider the following encoding: We take
a sequence of d + i − 1 spots. Select-
The example of polynomials highlights that it may be inefficient to ing any subset of i spots, we interpret
the remaining d − 1 spots as “barriers”
keep track of the weights w ∈ Re when e is large, and that it may separating each of the d items. The se-
be useful to instead consider a reparameterization which is of dimen- lected spots correspond to the number
of times each item has been selected. For
sion n rather than of the feature dimension.
example, if 2 items are to be selected out
of a total of 4 items with replacement,
one possible configuration is “◦ || ◦ |”
2.4 Function-space View where ◦ denotes a selected spot and |
denotes a barrier. This configuration en-
codes that the first and third item have
Let us now look at Bayesian linear regression through a different lens. each been chosen once. The number of
Previously, we have been interpreting it as a distribution over the possible configurations — each encoding
weights w of a linear function f = Φw. The key idea is that for a a unique outcome — is therefore (d+ii−1).
where K ∈ Rn × n
is the so-called kernel matrix. Observe that the entries −2
f1
of the kernel matrix can be expressed as K (i, j) = σp2 · ϕ( xi )⊤ ϕ( x j ). −4
You may say that nothing has changed, and you would be right — x1 xn
that is precisely the point. Note, however, that the shape of the kernel x
matrix is n × n rather than the e × e covariance matrix over weights, Figure 2.7: An illustration of the
which becomes unmanageable when e is large. The kernel matrix K function-space view. The model is de-
scribed by the points ( xi , f i ).
has entries only for the finite set of observed inputs. However, in
principle, we could have observed any input, and this motivates the
definition of the kernel function
.
k( x, x′ ) = σp2 · ϕ( x)⊤ ϕ( x′ ) (2.20)
k( x, x′ ) = Cov f ( x), f ( x′ ) .
(2.22)
It remains to show that we can also rely on the kernel trick for predic-
tions. Given the test point x⋆ , we define
" # " # " #
. Φ . y . f
Φ̃ = ⋆ ⊤ , ỹ = ⋆ , f˜ = ⋆ .
ϕ( x ) y f
f˜ | X, x⋆ ∼ N (0, K̃ ) (2.23)
.
where K̃ = σp2 Φ̃Φ̃⊤ . Adding the label noise yields
ϕ ( x ) ⊤ ϕ ( x ′ ) = (1 + x ⊤ x ′ ) m . (2.25)
48 probabilistic artificial intelligence
Discussion
Problems
x ⋆ = n1 ∑in=1 xi .
linear regression 49
and assume the data follows a linear model with homoscedastic noise
N (0, σn2 ) where σn2 = 0.1.
1. Find the maximum likelihood estimate ŵMLE given the data.
2. Now assume that we have a prior p(w) = N (w; 0, σp2 I ) with σp2 = 0.05.
Find the MAP estimate ŵMAP given the data and the prior.
3. Use the posterior p(w | X, y) to get a posterior prediction for the
label y⋆ at x⋆ = [3 3]⊤ . Report the mean and the variance of this
prediction.
4. How would you have to change the prior p(w) such that
ŵMAP → ŵMLE ?
Prove for Bayesian linear regression that x⋆ ⊤ Σx⋆ is the epistemic un-
certainty and σn2 the aleatoric uncertainty in y⋆ under the decomposi-
tion of Equation (2.18).
2.7. Hyperpriors.
perception perception
Dt D t +1
X0 ∼ N (µ, Σ ), (3.1)
Let us return to the setting of Kalman filters where priors and likeli-
hoods are Gaussian. Here, we will see that the update and prediction
steps can be computed in closed form.
3.2.1 Conditioning
x
Intuitively, λ is a form of “gain” that influences how much of the
new information should be incorporated into the updated mean.
1 2 3 4 5 6
For this reason, λ is also called Kalman gain. t
The updated variance can similarly be rewritten, Figure 3.3: Hidden states during a ran-
dom walk in one dimension.
σt2+1 = λσy2 = (1 − λ)(σt2 + σx2 ). (3.17)
λ → 1, µ t +1 = y t +1 , σt2+1 = 0.
The general formulas for the Kalman update follow the same logic as
in the above example of a one-dimensional random walk. Given the
56 probabilistic artificial intelligence
You will show in ? that the Kalman update (3.18) is the online Problem 3.2
equivalent to computing the posterior of the weights in Bayesian
linear regression.
3.2.2 Predicting
Using now that the marginal posterior of Xt is a Gaussian due to the
closedness properties of Gaussians, we have
Optional Readings
Kalman filters and related models are often called temporal models.
For a broader look at such models, read chapter 15 of “Artificial
intelligence: a modern approach” (Russell and Norvig, 2002).
Discussion
Problems
Derive the predictive distribution Xt+1 | y1:t+1 (3.13) of the Kalman fil-
ter described in the above example using your knowledge about mul-
tivariate Gaussians from Section 1.2.3.
Hint: First compute the predictive distribution Xt+1 | y1:t .
58 probabilistic artificial intelligence
Recall the specific Kalman filter from Example 3.5. With this model
the Kalman update (3.18) simplifies to
Σ t −1 x t
kt = , (3.22a)
x⊤
t t−1 xt + σn
Σ 2
µ t = µ t −1 + k t ( y t − x ⊤
t µ t −1 ), (3.22b)
Σt = Σ t −1 − k t x ⊤
t Σ t −1 , (3.22c)
Hint: In the inductive step, first prove the equivalence of Σ t and then expand
Σ− 1
t µt to prove the equivalence of µt .
y t = π + ηt , ηt ∼ N (0, σy2 ).
yi = f ( xi ) + ε i , ε i ∼ N (0, σn2 ).
The fact that with a Gaussian process, any finite subset of the random
variables is jointly Gaussian is the key property allowing us to perform
f (x)
exact probabilistic inference. Intuitively, a Gaussian process can be
interpreted as a normal distribution over functions — and is therefore
y
often called an “infinite-dimensional Gaussian”.
.
µ′ ( x) = µ( x) + k⊤ 2 −1
x,A ( K AA + σn I ) ( y A − µ A ), (4.7)
′ ′ . ′
k ( x, x ) = k( x, x ) − k⊤
x,A ( K AA + σn2 I )−1 k x′ ,A . (4.8)
Observe that analogously to Bayesian linear regression, the posterior
covariance can only decrease when conditioning on additional data,
and is independent of the observations yi .
4.2 Sampling
We have seen that kernel functions are the key object describing the
class of functions a Gaussian process can model. Depending on the
kernel function, the “shape” of functions that are realized from a Gaus-
sian process varies greatly. Let us recap briefly from Section 2.4 what
a kernel function is:
k( x, x′ ) = Cov f ( x), f ( x′ ) .
(4.12)
2
!
. ∥ x − x ′ ∥2
k( x, x′ ; h) = exp − (4.14)
2h2
where h is its length scale. The larger the length scale h, the smoother 5
As the length scale is increased, the ex-
the resulting functions.5 Furthermore, it turns out that the feature ponent of the exponential increases, re-
sulting in a higher dependency between
space (think back to Section 2.4!) corresponding to the Gaussian locations.
kernel is “infinitely dimensional”, as you will show in ? . So the Problem 4.1
Gaussian kernel already encodes a function class that we were not
able to model under the weight-space view of Bayesian linear re-
gression.
k( x − x′ )
1.00
0.75
x 0.50
0.25
3. The Laplace kernel (also known as exponential kernel) is defined as
0.00
∥ x − x′ ∥ −2 0 2
.
k ( x, x′ ; h) = exp − 2
. (4.15) x − x′
h
Figure 4.5: Gaussian kernel with length
scales h = 1, h = 0.5, and h = 0.2.
As can be seen in Figure 4.7, samples from a GP with Laplace
kernel are non-smooth as opposed to the samples from a GP with
Gaussian kernel.
k( x − x′ )
1.00
0.75
x 0.50
0.25
4. The Matérn kernel trades the smoothness of the Gaussian and the 0.00
Laplace kernels. As such, it is frequently used in practice to model −2 0 2
x − x′
Figure 4.6: Laplace kernel with length
scales h = 1, h = 0.5, and h = 0.2.
64 probabilistic artificial intelligence
Optional Readings
For a broader introduction to how kernels can be used and com-
bined to model certain classes of functions, read
gaussian processes 65
stationary isotropic
linear kernel no no
Gaussian kernel yes yes
. 2
k( x, x′ ) = exp(− ∥ x − x′ ∥ M )
yes no ∥·∥ M denotes the Mahalanobis norm
where M is positive semi-definite induced by matrix M
Stationarity encodes the idea that relative location matters more than
absolute location: the process “looks the same” no matter where we
shift it in the input space. This is often appropriate when we believe
the same statistical behavior holds across the entire domain (e.g., no
region is special). Isotropy goes one step further by requiring that
66 probabilistic artificial intelligence
the kernel depends only on the distance between points, so that all
directions in the space are treated equally. In other words, there is no
preferred orientation or axis. This is especially useful in settings where
we expect uniform behavior in every direction (as with the Gaussian
kernel). Such kernels are simpler to specify and interpret since we
only need a single “scale” (like a length scale) rather than multiple
parameters or directions.
n n′
.
⟨ f , g⟩k = ∑ ∑ αi α′j k(xi , x′j ), (4.18)
i =1 j =1
′
where g(·) = ∑nj=1 α′j k( x′j , ·), and induces the norm ∥ f ∥k = ⟨ f , f ⟩k .
p
Theorem 4.7 (Representer theorem). ? Let k be a kernel and let λ > 0. Problem 4.5
For f ∈ Hk (X ) and training data {( xi , f ( xi ))}in=1 , let L( f ( x1 ), . . . , f ( xn )) ∈
R ∪ {∞} denote any loss function which depends on f only through its eval-
uation at the training points. Then, any minimizer
n
fˆ( x) = α̂⊤ k x,{ xi }n
i =1
= ∑ α̂i k(x, xi ) for some α̂ ∈ Rn . (4.20)
i =1
. 1
fˆ = arg min − log p(y1:n | x1:n , f ) + ∥ f ∥2k . (4.21)
f ∈H (X ) 2
k
Here, the first term corresponds to the likelihood, measuring the “qual-
ity of fit”. The regularization term limits the “complexity” of fˆ. Reg-
ularization is necessary to prevent overfitting since in an expressive
RKHSs, there may be many functions that interpolate the training data
perfectly. This shows the close link between Gaussian process regres-
sion and Bayesian linear regression, with the kernel function k gener-
alizing the inner product of feature maps to feature spaces of possi-
bly “infinite dimensionality”. Because solutions can be represented as
linear combinations of kernel evaluations at the training points, Gaus-
sian processes remain computationally tractable even though they can
model functions over “infinite-dimensional” feature spaces.
iid
using Monte Carlo sampling.8 In words, for reasonably large m, 8
We generally assume D ∼ P , in par-
minimizing the empirical risk as we do in Equation (4.23) approx- ticular, we assume that the individual
samples of the data are i.i.d.. Recall
imates minimizing the population risk. that in this setting, Hoeffding’s inequal-
ity (A.41) can be used to gauge how
large m should be.
While this approach often is quite effective at preventing overfitting
as compared to using the same data for training and picking θ̂, it still
collapses the uncertainty in f into a point estimate. Can we do better?
Z
= arg max p(y1:n , f | x1:n , θ) d f by conditioning on f using the sum rule
θ (1.7)
Z
= arg max p(y1:n | x1:n , f , θ) p( f | θ) d f . (4.26) using the product rule (1.11)
θ
Remarkably, this approach typically avoids overfitting even though we
do not use a separate training and validation set. The following ta-
ble provides an intuitive argument for why maximizing the marginal
likelihood is a good strategy.
marginal likelihood
simple
“just right” moderate for “many” f moderate
intermediate
For an “underfit” model, the likelihood is mostly small as the data complex
cannot be well described, while the prior is large as there are “fewer”
functions to choose from. For an “overfit” model, the likelihood is
large for “some” functions (which would be picked if we were only all possible data sets
minimizing the training error and not doing cross validation) but small Figure 4.8: A schematic illustration of
for “most” functions. The prior is small, as the probability mass has the marginal likelihood of a simple, in-
termediate, and complex model across
to be distributed among “more” functions. Thus, in both cases, one all possible data sets.
term in the product will be small. Hence, maximizing the marginal
likelihood naturally encourages trading between a large likelihood and
a large prior.
MLL loss
∂θ j 2 ∂θ j
0.5
. −1
where α = and tr( M ) is the trace of a matrix M. This optimiza-
Ky,θ y
0.0
tion problem is, in general, non-convex. Figure 4.10 gives an example
of two local optima according to empirical Bayes. 0 100 200
# of iterations
Taking a step back, observe that taking a probabilistic perspective on
model selection naturally led us to consider all realizations of our Figure 4.9: An example of model selec-
tion by maximizing the log likelihood
model f instead of using point estimates. However, we are still us- (without hyperpriors) using a linear,
ing point estimates for our model parameters θ. Continuing on our quadratic, Laplace, Matérn (ν = 3/2),
and Gaussian kernel, respectively. They
probabilistic adventure, we could place a prior p(θ) on them too.9 We
are used to learn the function
could use it to obtain the MAP estimate (still a point estimate!) which sin( x )
adds an additional regularization term x 7→ + ε, ε ∼ N (0, 0.01)
x
. using SGD with learning rate 0.1.
θ̂MAP = arg max p(θ | x1:n , y1:n ) (4.31) 9
Such a prior is called hyperprior.
θ
= arg min − log p(θ) − log p(y1:n | x1:n , θ). (4.32) using Bayes’ rule (1.45) and then taking
θ the negative logarithm
Recall that as the mode of Gaussians coincides with their mean, the
MAP estimate corresponds to the mean of the predictive posterior.
4.5 Approximations
101
Figure 4.10: The top plot shows contour
lines of an empirical Bayes with two lo-
cal optima. The bottom two plots show
the Gaussian processes corresponding
to the two optimal models. The left
model with smaller lengthscale is chosen
noise standard deviation σn
10−1
100 101
lengthscale h
2 2
1 1
f (x)
0 0
−1 −1
−2 −2
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
72 probabilistic artificial intelligence
k( x, x′ ) ≈ ϕ( x)⊤ ϕ( x′ ). (4.34)
f (x)
amplitude of the frequencies ξ,
1
Z
. ⊤x
fˆ(ξ ) = f ( x)e−i2πξ dx. (4.36)
Rd
is given by
Bochner’s theorem implies that when a continuous and stationary
Z 1
1 iω
kernel is positive definite and scaled appropriately, its Fourier trans- fˆ(ω ) = e−iωx dx = e − e−iω
−1 iω
form p(ω) is a proper probability distribution. In this case, p(ω) is 2 sin(ω )
= .
called the spectral density of the kernel k. ω
A key insight of this analysis is that the rate at which these magni-
tudes p(ω) decay with increasing frequency ω reveals the smooth-
ness of the processes governed by the kernel. If a kernel allocates
more “power” to high frequencies (meaning the spectral density
decays slowly), the resulting processes will appear “rougher”.
Conversely, if high-frequency components are suppressed, the pro-
cess will appear “smoother”.
74 probabilistic artificial intelligence
Observe that as both k and p are real, convergence of the integral im-
plies Eω∼ p sin(ω⊤ x − ω⊤ x′ ) = 0. Hence,
h i
= Eω∼ p cos(ω⊤ x − ω⊤ x′ )
h i
= Eω∼ p Eb∼Unif([0,2π ]) cos((ω⊤ x + b) − (ω⊤ x′ + b)) expanding with b − b
h
= Eω∼ p Eb∼Unif([0,2π ]) cos(ω⊤ x + b) cos(ω⊤ x′ + b) using the angle subtraction identity,
cos(α − β) = cos α cos β + sin α sin β
i
+ sin(ω⊤ x + b) sin(ω⊤ x′ + b)
h i
= Eω∼ p Eb∼Unif([0,2π ]) 2 cos(ω⊤ x + b) cos(ω⊤ x′ + b) using
= z( x)⊤ z( x′ ) (4.42)
of k( x − x′ ), and wraps this line onto the unit circle in R2 . After trans-
forming two points x and x′ in this way, their inner product is an unbi- 2
√
ased estimator of k( x − x′ ). The mapping zω,b ( x) = 2 cos(ω⊤ x + b)
0
additionally rotates the circle by a random amount b and projects the
points onto the interval [0, 1].
2 −5.0 −2.5 0.0 2.5 5.0
Rahimi et al. (2007) show that Bayesian linear regression with the fea-
1
ture map z approximates Gaussian processes with a stationary kernel:
0
Theorem 4.13 (Uniform convergence of Fourier features). Suppose M
is a compact subset of Rd with diameter diam(M). Then for a stationary −1
kernel k, the random Fourier features z, and any ϵ > 0 it holds that −5.0 −2.5 0.0 2.5 5.0
x
!
P sup z( x)⊤ z( x′ ) − k ( x − x′ ) ≥ ϵ (4.44) Figure 4.13: Example of random Fourier
x,x′ ∈M features with where the number of fea-
tures m is 5 (top) and 10 (bottom), re-
σp diam(M) 2 mϵ2
8 spectively. The noise-free true function
≤2 exp − is shown in black and the mean of the
ϵ 8( d + 2)
Gaussian process is shown in blue.
.
where σp2 = Eω∼ p ω⊤ ω is the second moment of p, m is the dimension
Note that the error probability decays exponentially fast in the dimen- f (x)
sion of the Fourier feature space. 2
Then, the original Gaussian process can be recovered using marginal- hyperparameters.
ization,
Z Z
p( f ⋆ , f ) = p( f ⋆ , f , u) du = p( f ⋆ , f | u) p(u) du, (4.45) using the sum rule (1.7) and product
Rk Rk rule (1.11)
. .
where f = [ f ( x1 ) · · · f ( xn )]⊤ and f ⋆ = f ( x⋆ ) at some evaluation
.
point x⋆ ∈ X . We use u = [ f ( x1 ) · · · f ( xk )]⊤ ∈ Rk to denote the pre-
dictions of the model at the inducing points U. Due to the marginaliza-
tion property of Gaussian processes (4.1), we have that u ∼ N (0, KUU ).
76 probabilistic artificial intelligence
The key idea is to approximate the joint prior, assuming that f ⋆ and f
are conditionally independent given u,
Z
p( f ⋆ , f ) ≈ p( f ⋆ | u) p( f | u) p(u) du. (4.46)
Rk
−1
p( f | u) ∼ N ( f ; K AU KUU u, K AA − Q AA ), (4.47a)
⋆ ⋆ −1
p( f | u) ∼ N ( f ; K⋆U KUU u, K⋆⋆ − Q⋆⋆ ) (4.47b)
. −1
where Q ab = KaU KUU KUb . Intuitively, K AA represents the prior co-
variance and Q AA represents the covariance “explained” by the induc-
ing points.11 11
For more details, refer to section 2
of “A unifying view of sparse ap-
Computing the full covariance matrix is expensive. In the following, proximate Gaussian process regression”
(Quinonero-Candela and Rasmussen,
we mention two approximations to the covariance of the training con- 2005).
ditional (and testing conditional).
f (x)
Example 4.14: Subset of regressors 3
. 1
−1
qSoR ( f | u) = N ( f ; K AU KUU u, 0), (4.48a)
0
. −1
qSoR ( f ⋆ | u) = N ( f ⋆ ; K⋆U KUU u, 0). (4.48b) −1
The computational cost for inducing point methods SoR and FITC is
dominated by the cost of inverting KUU . Thus, the time complexity is
cubic in the number of inducing points, but only linear in the number
of data points.
gaussian processes 77
Discussion
Problems
Next we will show that the Kalman filter from Example 3.4 can be seen
as a Gaussian process. To this end, we define
f : N0 → R, t 7 → Xt . (4.50)
.
Assuming that X0 ∼ N (0, σ02 ) and Xt+1 = Xt + ε t with independent
noise ε t ∼ N (0, σx2 ), show that
.
This particular kernel k (t, t′ ) = min{t, t′ } but over the continuous-time
domain defines the Wiener process (also known as Brownian motion).
gaussian processes 79
Hint: Recall
1. the reproducing property f ( x) = ⟨ f , k( x, ·)⟩k with k( x, ·) ∈ Hk (X )
which holds for all f ∈ Hk (X ) and x ∈ Hk (X ), and
2. that the norm after projection is smaller or equal the norm before projec-
tion.
Then decompose f into parallel and orthogonal components with respect to
span{k ( x1 , ·), . . . , k( xn , ·)}.
. 1
fˆ = arg min − log p(y1:n | x1:n , f ) + ∥ f ∥2k .
f ∈H (X ) 2
k
for some λ > 0 which is also known as kernel ridge regression. De-
termine λ.
2. Show that Equation (4.53) with the λ determined in (1) is equiva-
lent to the MAP estimate of GP regression.
Hint: Recall from Equation (4.6) that the MAP estimate at a point x⋆ is
E[ f ⋆ | x⋆ , X, y] = k⊤ 2 −1
x⋆ ,A ( K + σn I ) y.
∂ ∂M −1
M −1 = − M −1 M and (4.54)
∂θ j ∂θ j
k y,θ ( x, x′ ) = θ0 k̃( x, x′ )
2rσp 2
2. Prove P ∥∇ f (∆⋆ )∥2 ≥ 2r ϵ
≤ ϵ .
Hint: Recall that the random Fourier feature approximation is unbiased,
i.e., E[s(
∆)] = k(∆).
2
3. Prove P iT=1 | f (∆ i )| ≥ 2ϵ ≤ 2T exp − mϵ
S
16 .
4. Combine the results from (2) and (3) to prove Theorem 4.13.
Hint: You may use that
d 2 1
(a) αr −d + βr2 = 2β d+2 α d+2 for r = (α/β) d+2 and
σ diam(M)
(b) p ϵ ≥ 1.
d
5. Show that for the Gaussian kernel (4.14), σp2 = h2
.
Hint: First show σp2 = −tr( H∆ k(0)).
In this and the following chapter, we will discuss two methods of ap-
proximate inference. We begin by discussing variational (probabilistic)
inference, which aims to find a good approximation of the posterior
distribution from which it is easy to sample. In Chapter 6, we discuss
Markov chain Monte Carlo methods, which approximate the sampling
from the posterior distribution directly.
1 .
p(θ | x1:n , y1:n ) = p(θ, y1:n | x1:n ) ≈ q(θ | λ) = qλ (θ) (5.1)
Z
where λ represents the parameters of the variational posterior qλ , also
called variational parameters. In doing so, variational inference reduces
probabilistic inference — where the fundamental difficulty lies in solv-
ing high-dimensional integrals — to an optimization problem. Opti-
mizing (stochastic) objectives is a well-understood problem with effi-
cient algorithms that perform well in practice.1 1
We provide an overview of first-order
methods such as stochastic gradient de-
scent in Appendix A.4.
5.1 Laplace Approximation
. 1
ψ(θ) ≈ ψ̂(θ) = ψ(θ̂) + (θ − θ̂)⊤ ∇ψ(θ̂) + (θ − θ̂)⊤ Hψ (θ̂)(θ − θ̂)
2
1
= ψ(θ̂) + (θ − θ̂)⊤ Hψ (θ̂)(θ − θ̂). (5.3) using ∇ψ(θ̂) = 0
2
1
log N (θ; θ̂, Λ−1 ) = − (θ − θ̂)⊤ Λ(θ − θ̂) + const. (5.4)
2
Since ψ(θ̂) is constant with respect to θ,
1 !
∇θ log p(θ) = − (2Σ −1 θ − 2Σ −1 µ) = 0 ⇐⇒ θ = µ. (5.7)
2
For the Hessian of log p(θ), we get
Hθ log p(θ) = ( Dθ (Σ −1 µ − Σ −1 θ))⊤ = −(Σ −1 )⊤ = −Σ −1 . (5.8) using ( A−1 )⊤ = ( A⊤ )−1 and symmetry
of Σ
variational inference 85
0.8
We see that the Laplace approximation of a Gaussian p(θ) is ex-
q
act, which should not come as a surprise since the second-order 0.6
0.2 p
The Laplace approximation matches the shape of the true posterior
around its mode but may not represent it accurately elsewhere — often 0.0
σ(z)
5.1.1 Example: Bayesian Logistic Regression 1.00
is used to obtain the class probabilities. Bayesian logistic regression cor- Figure 5.2: The logistic function
squashes the linear function w⊤ x onto
responds to Bayesian linear regression with a Bernoulli likelihood, the interval (0, 1).
is called logistic loss. The gradient of the logistic loss is given by ? Problem 5.1 (1)
We can therefore use SGD with the (regularized) gradient step and
with batch size 1,
for the data point ( x, y) picked uniformly at random from the training
data. Here, 2ληt is due to the gradient of the regularization term, in
effect, performing weight decay.
Let us denote by
.
πi = P(yi = 1 | xi , ŵ) = σ(ŵ⊤ xi ) (5.16)
iid
where θj ∼ qλ .
(5.21)
88 probabilistic artificial intelligence
n o
.
Q = q(θ) = N (θ; µ, diagi∈[d] {σi2 }) , (5.24)
.
q⋆ = arg min KL(q∥ p) = arg min KL(qλ ∥ p). (5.25)
q∈Q λ∈Λ
5.4.1 Surprise
The surprise about an event with probability u is defined as
S[ u ]
.
S[u] = − log u. (5.26)
4
Observe that the surprise is a function from R≥0 to R, where we
let S[0] ≡ ∞. Moreover, for a discrete random variable X, we have 2
Proof. Observe that the third condition looks similar to the product
rule of logarithms: log(uv) = log v + log v. We can formalize this
intuition by remembering Cauchy’s functional equation, f ( x + y) =
f ( x ) + f (y), which has the unique family of solutions { f : x 7→ cx :
c ∈ R} if f is required to be continuous. Such a solution is called an
.
“additive function”. Consider the function g( x ) = f (e x ). Then, g is
additive if and only if
f ( e x e y ) = f ( e x + y ) = g ( x + y ) = g ( x ) + g ( y ) = f ( e x ) + f ( e y ).
Throughout this manuscript we will see many examples where mod- Figure 5.6: Illustration of the probability
space and the corresponding “surprise
eling uncertainty in terms of surprise (i.e., the information-theoretic space”.
interpretation of uncertainty) is useful. One example where we have
variational inference 91
0.75
5.4.2 Entropy
0.50
The entropy of a distribution p is the average surprise about samples
0.25
from p. In this way, entropy is a notion of uncertainty associated with
the distribution p: if the entropy of p is large, we are more uncertain 0.00
0.0 0.5 1.0
about x ∼ p than if the entropy of p were low. Formally, p
.
H[ p] = Ex∼ p [S[ p( x )]] = Ex∼ p [− log p( x )]. (5.27) Figure 5.7: Entropy of a Bernoulli exper-
iment with success probability p.
When X ∼ p is a random vector distributed according to p, we write
.
H[X] = H[ p]. Observe that by definition, if p is discrete then H[ p] ≥ 0
as p( x ) ≤ 1 (∀ x ).4 For discrete distributions it is common to use the 4
The entropy of a continuous distribu-
logarithm with base 2 rather than the natural logarithm:5 tion can be negative. For example,
1 1
Z
H[Unif([ a, b])] = − log dx
H[ p] = − ∑ p( x ) log2 p( x ) (if p is discrete), (5.28a) b−a b−a
x = log(b − a)
Z
H[ p ] = − p( x) log p( x) dx (if p is continuous). (5.28b) which is negative if b − a < 1.
log x
5
Recall that log2 x = log 2 , that is, loga-
rithms to a different base only differ by
Let us briefly recall Jensen’s inequality, which is a useful tool when a constant factor.
working with expectations of convex functions such as entropy:6 6
The surprise S[u] is convex in u.
Fact 5.7 (Jensen’s Inequality). ? Given a random variable X and a convex Problem 5.3 (1)
function g : R → R, we have
• Unfair Coin
E[ X ]
H[Bern(0.1)] = −0.1 log2 0.1 − 0.9 log2 0.9 ≈ 0.469. Figure 5.8: An illustration of Jensen’s in-
equality. Due to the convexity of g, we
• Uniform Distribution have that g evaluated at E[ X ] will always
be below the average of evaluations of g.
n
1 1
H[Unif({1, . . . , n})] = − ∑ log2 = log2 n.
i =1
n n
( x − µ )2
2 1
N ( x; µ, σ ) = exp −
Z 2σ2
√
where Z = 2πσ2 . Using the definition of entropy (5.28b), we
obtain,
( x − µ )2
1
h i Z
2
H N (µ, σ ) = − exp − 2
Z 2σ
( x − µ )2
1
· log exp − dx
Z 2σ2
( x − µ )2
1
Z
= log Z exp − dx
Z 2σ2
| {z }
1
( x − µ ) ( x − µ )2
2
1
Z
+ exp − dx
Z 2σ2 2σ2
1 h i
= log Z + 2 E ( x − µ)2 using LOTUS (1.22)
2σ
√ 1
= log(σ 2π ) + using E ( x − µ)2 = Var[ x ] = σ2 (1.34)
√ 2
√
= log(σ 2πe). (5.30) using log e = 1/2
1 1
H[N (µ, Σ )] = log det(2πeΣ ) = log (2πe)d det(Σ ) . (5.31)
2 2
Note that the entropy is a function of the determinant of the co-
variance matrix Σ. In general, there are various ways of “scalariz-
ing” the notion of uncertainty for a multivariate distribution. The
determinant of Σ measures the volume of the credible sets around
the mean µ, and is also called the generalized variance. Next to
entropy and generalized variance (which are closely related for
Gaussians), a common scalarization is the trace of Σ, which is also
called the total variance.
variational inference 93
5.4.3 Cross-Entropy
How can we use entropy to measure our average surprise when as-
suming the data follows some distribution q but in reality the data
follows a different distribution p?
H[ p∥q] = H[ p] + KL( p∥q) ≥ H[ p]. (5.33) KL( p∥q) ≥ 0 is shown in Problem 5.5
In words, KL( p∥q) measures the additional expected surprise when ob-
serving samples from p that is due to assuming the (wrong) distribu- 7
The KL-divergence only captures the
tion q and which not inherent in the distribution p already.7 additional expected surprise since the
surprise inherent in p (as measured by
H[ p]) is subtracted.
The KL-divergence has the following properties:
• KL( p∥q) ≥ 0 for any distributions p and q ? , Problem 5.5 (1)
• KL( p∥q) = 0 if and only if p = q almost surely ? , and Problem 5.5 (2)
• there exist distributions p and q such that KL( p∥q) ̸= KL(q∥ p).
94 probabilistic artificial intelligence
n i∑
log → Eθ∼ p log = KL( p∥q) (5.38)
=1
q(θi ) q(θ)
Bern( x; p)
KL(Bern( p)∥Bern(q)) = ∑ Bern( x; p) log
Bern( x; q)
x ∈{0,1}
p (1 − p )
= p log + (1 − p) log . (5.40)
q (1 − q )
0.6 q2⋆
For independent Gaussians with unit variance, Σ p = Σ q = I, the
0.4
expression simplifies to the squared Euclidean distance, q1⋆
0.2 p
1 2
KL( p∥q) = µq − µ p 2
. (5.42)
2 0.0
−2 0 2 4 6
x
If we approximate independent Gaussians with variances σi2 ,
. x2
p = N (µ, diag{σ12 , . . . , σd2 }),
. 2
by a standard normal distribution, q = N (0, I ), the expression
simplifies to 1
0
1 d
KL( p∥q) = ∑ (σi2 + µ2i − 1 − log σi2 ). (5.43) −1
2 i =1
−2
Here, the term µ2i penalizes a large mean of p, the term σi2 penal-
−2 0 2
izes a large variance of p, and the term − log σi2 penalizes a small 2
variance of p. As expected, KL( p∥q) is proportional to the amount 1
of information we lose by approximating p with the simpler dis- 0
tribution q.
−1
−2
1
KL(q∥ p) = tr diag{σi−2 } + µ⊤ diag{σi−2 }µ − d
2
+ log det diag{σi2 }
!
1 d µ2i
2 i∑
−2
= σi + 2
− 1 + log σi2 . (5.45)
=1 σi
Here, σi−2 penalizes small variance, µ2i /σi2 penalizes a large mean,
and log σi2 penalizes large variance. Compare this to the expres-
sion for the forward KL-divergence KL( p∥q) that we have seen in
Equation (5.43). In particular, observe that reverse-KL penalizes
large variance less strongly than forward-KL.
Proof.
This loss function is also known as the binary cross-entropy loss and
we will discuss it in more detail in Section 7.1.3 in the context of
neural networks.
1
N (θ; µ, Σ ) = exp(λ⊤ s(θ)) where (5.48)
Z (λ)
" #
. Σ −1 µ
λ= (5.49)
vec[Σ −1 ]
" #
. θ
s(θ) = (5.50)
vec[− 21 θθ⊤ ]
. R
and Z (λ) = exp(λ⊤ s(θ)) dθ, and we will confirm this in just a mo-
ment.10 The family of distributions with densities of the form (5.48) — 10
Given a matrix A ∈ Rn×m , we use
with an additional scaling constant h(θ) which is often 1 — is called vec[ A] ∈ Rn·m
to denote the row-by-row concatenation
of A yielding a vector of length n · m.
variational inference 99
the exponential family of distributions. Here, s(θ) are the sufficient statis-
tics, λ are called the natural parameters, and Z (λ) is the normalizing
constant. In this context, Z (λ) is often called the partition function.
p(θ)
Z
KL( p∥qλ ) = p(θ) log dθ
qλ (θ)
Z
=− p(θ) · λ⊤ s(θ) dθ + log Z (λ) + const.
R
using that p(θ) log p(θ) dθ is constant
1
Z Z
∇λ KL( p∥qλ ) = − p(θ)s(θ) dθ + s(θ) exp(λ⊤ s(θ)) dθ
Z (λ)
= −Eθ∼ p [s(θ)] + Eθ∼qλ [s(θ)].
Hence, for any minimizer of KL( p∥qλ ), we have that the sufficient
statistics under p and qλ match:
implying that
h i
E p [θ] = µ and Var p [θ] = E p θθ⊤ − E p [θ] · E p [θ]⊤ = Σ (5.52) using Equation (1.35)
where µ and Σ are the mean and variance of the approximation qλ , re-
spectively. That is, a Gaussian qλ minimizing KL( p∥qλ ) has the same
first and second moment as p. Combining this insight with our obser-
vation from Equation (5.46) that minimizing forward-KL is equivalent
to maximum likelihood estimation, we see that if we use MLE to fit a
Gaussian to given data, this Gaussian will eventually match the first
and second moments of the data distribution.
100 probabilistic artificial intelligence
q(θ)
KL(q∥ p(· | x1:n , y1:n )) = Eθ∼q log using the definition of the
p(θ | x1:n , y1:n ) KL-divergence (5.34)
p(y1:n | x1:n )q(θ)
= Eθ∼q log using the definition of conditional
p(y1:n , θ | x1:n ) probability (1.8)
= log p(y1:n | x1:n ) using linearity of expectation (1.20)
−Eθ∼q [log p(y1:n , θ | x1:n )] − H[q]
| {z }
− L(q,p;Dn )
where L(q, p; Dn ) is called the evidence lower bound (ELBO) given the
data Dn = {( xi , yi )}in=1 . This gives the relationship
L(q, p; Dn ) = log p(y1:n | x1:n ) − KL(q∥ p(· | x1:n , y1:n )). (5.53)
| {z }
const
selects q that has large joint likelihood p(y1:n , θ | x1:n ) and large en-
tropy H[q]. The ELBO can also be expressed in various other forms:
L(q, p; Dn ) = Eθ∼q [log p(y1:n , θ | x1:n ) − log q(θ)] (5.55a) using the definition of entropy (5.27)
= Eθ∼q [log p(y1:n | x1:n , θ) + log p(θ) − log q(θ)] (5.55b) using the product rule (1.11)
= Eθ∼q [log p(y1:n | x1:n , θ)] − KL(q∥ p(·)) . (5.55c) using the definition of KL-divergence
| {z } | {z } (5.34)
log-likelihood proximity to prior
we have
1 d 2
2 i∑
KL(q∥ p(·)) = (σi + µ2i − 1 − log σi2 ).
=1
∇λ L(qλ , p; Dn ) = ∇λ Eθ∼qλ [log p(y1:n | x1:n , θ)] − ∇λ KL(qλ ∥ p(·)). using the definition of the ELBO (5.55c)
(5.60)
rewrite the gradient in such a way that Monte Carlo sampling becomes
possible.
One approach is to use score gradients via the “score function trick”:
Proof. By the change of variables formula (1.43) and using ε = g −1 (θ; λ),
qλ (θ) = ϕ(ε) · det Dθ g −1 (θ; λ)
= ϕ(ε) · det ( Dε g (ε; λ))−1 by the inverse function theorem,
Dg −1 (y) = Dg ( x)−1
−1
= ϕ(ε) · |det( Dε g (ε; λ))| . using det A−1 = det( A)−1
∇λ Eθ∼qλ [ f (θ)] = Eε∼ϕ [∇λ f ( g (ε; λ))]. (5.64) using Equation (A.5)
In particular, we have
ε = g −1 (θ; λ) = Σ − /2 (θ − µ)
1
and (5.66) by solving Equation (5.65) for ε
ϕ(ε) = qλ (θ) · det Σ /2 .
1
(5.67) using the reparameterization trick (i.e.,
the change of variables formula) (5.62)
.
In the following, we write C = Σ 1/2 . Let us now derive the gradient
estimate for the evidence lower bound assuming the Gaussian vari-
ational approximation from Example 5.20. This approach extends to
any reparameterizable distribution.
iid iid
where ε j ∼ N (0, I ) and i j ∼ Unif([n]). This yields an unbiased gra-
dient estimate, which we can use with stochastic gradient descent to
maximize the evidence lower bound. We have successfully recast the
difficult problems of learning and inference as an optimization prob-
lem!
− L(q, p; Dn ) = Eθ∼q [S[ p(y1:n | x1:n , θ)]] + KL(q∥ p(·)) , (5.72) analogously to Equation (5.55c)
| {z } | {z }
average surprise proximity to prior
Discussion
Nevertheless, recall from Figure 5.4 that while the estimation error in
variational inference can be small, choosing a variational family that is
too simple can lead to a large approximation error. We have seen that
for posteriors that are multimodal or have heavy tails, Gaussians may
not provide a good approximation. In the next chapter, we will explore
alternative techniques for approximate inference that can handle more
complex posteriors.
Problems
Hint: Begin by deriving the first derivative of the logistic function, and
use the chain rule of multivariate calculus,
Dx ( f ◦ g ) = ( D f ◦ g ) · Dx g (5.74)
| {z } | g ( x) {z }
Rn →Rm × n Rn →Rm × k ·Rk × n
where g : Rn → Rk and f : Rk → Rm .
3. Is the logistic loss ℓlog convex in w?
variational inference 107
In this exercise, we will study the use of Gaussian processes for clas-
sification tasks, commonly called Gaussian process classification (GPC).
Linear logistic regression is extended to GPC by replacing the Gaus-
sian prior over weights with a GP prior on f ,
Show that the model from Equation (5.75) using a noise-free latent
process with probit likelihood Φ(z; 0, σn2 ) is equivalent (in expecta-
tion over ε) to the model from Equation (5.77).
f ( θ1 x1 + · · · + θ k x k ) ≤ θ1 f ( x1 ) + · · · + θ k f ( x k ). (5.78)
Show that the logistic loss (5.13) is equivalent to the binary cross-
entropy loss with ŷ = σ ( fˆ). That is,
f ( θ1 x1 + · · · + θ k x k ) = θ1 f ( x1 ) + · · · + θ k f ( x k ) (5.80)
In this exercise we will prove that the normal distribution is the distri-
bution with maximal entropy among all (univariate) distributions sup-
.
ported on R with fixed mean µ and variance σ2 . Let g( x ) = N ( x; µ, σ2 ),
and f ( x ) be any distribution on R with mean µ and variance σ2 .
1. Prove that KL( f ∥ g) = H[ g] − H[ f ].
Hint: Equivalently, show that H[ f ∥ g] = H[ g]. That is, the expected
surprise evaluated based on the Gaussian g is invariant to the true distri-
bution f .
2. Conclude that H[ g] ≥ H[ f ].
where δy′ denotes the point density at y′ . The product rule (1.11)
implies that q( x, y) = δy′ (y) · q( x | y), but any choice of q( x | y) is
possible.
We will derive that given any fixed generative model pX,Y , the “pos-
terior” distribution qX (·) = pX|Y (· | y′ ) minimizes the relative entropy
KL(qX,Y ∥ pX,Y ) subject to the constraint Y = y′ . In other words, among
all distributions qX,Y that are consistent with the observation Y = y′ ,
the posterior distribution qX (·) = pX|Y (· | y′ ) is the one with “maxi-
mum entropy”.
110 probabilistic artificial intelligence
1 2 3 4
1 1/8 1/8 0 0
2 1/8 1/8 0 0
3 0 0 1/4 0
4 0 0 0 1/4
Show that the reverse KL(q∥ p) for this p has three distinct minima.
Identify those minima and evaluate KL(q∥ p) at each of them.
3. What is the value of KL(q∥ p) if we use the approximation q( x, y) = p( x ) p(y)?
Hint 1: For any positive definite and symmetric matrix A, it holds that
∇A log det( A) = A−1 .
Hint 2: For any function f and Gaussian p = N (µ, Σ ),
.
2. Let Z ∼ N (µ, σ2 ) and X = e Z . That is, X is logarithmically nor-
mal distributed with parameters µ and σ2 . Show that X can be
reparameterized in terms of N (0, 1).
3. Show that Cauchy(0, 1) can be reparameterized in terms of Unif([0, 1]).
Finally, let us apply the reparameterization trick to compute the gra-
dient of an expectation.
.
4. Let ReLU(z) = max{0, z} and w > 0. Show that
d
E ReLU(wx ) = wΦ(µ)
dµ x∼N (µ,1)
iid
for independent θ(i) ∼ p(· | x1:n , y1:n ). The law of large numbers (A.36)
and Hoeffding’s inequality (A.41) imply that this estimator is consis-
tent and sharply concentrated.1 1
For more details, see Appendix A.3.3.
1
p( x ) = q ( x ). (6.4)
Z
The joint likelihood q is typically easy to obtain. Note that q( x ) is pro-
portional to the probability density associated with x, but q does not
114 probabilistic artificial intelligence
X1 X2 X3 ···
Xt+1 ⊥ X0:t−1 | Xt . (6.6)
Figure 6.1: Directed graphical model of
a Markov chain. The random variable
Intuitively, the Markov property states that future behavior is indepen- Xt+1 is conditionally independent of the
random variables X0:t−1 given Xt .
dent of past states given the present state.
Note that each row of P must always sum to 1. Such matrices are also
called stochastic.
qt+1 = qt P. (6.9)
It is implied directly that we can write the state of the Markov chain
at time t + k as
qt+k = qt P k . (6.10)
In the analysis of Markov chains, there are two main concepts of inter-
est: stationarity and convergence. We begin by introducing stationar-
ity.
6.1.1 Stationarity
Definition 6.4 (Stationary distribution). A distribution π is stationary
with respect to the transition function p iff
π (x) = ∑ p( x | x ′ )π ( x ′ ) (6.11)
x ′ ∈S
π = πP. (6.12)
116 probabilistic artificial intelligence
6.1.2 Convergence
lim qt = π, (6.14)
t→∞
(4) 1/2 1 2
1
Example 6.8: Making a Markov chain ergodic Figure 6.2: Transition graphs of Markov
chains: (1) is not ergodic as its transi-
A commonly used strategy to ensure that a Markov chain is er-
tion diagram is not strongly connected;
godic is to add “self-loops” to every vertex in the transition graph. (2) is not ergodic for the same reason; (3)
That is, to ensure that at any point in time, the Markov chain re- is irreducible but periodic and therefore
not ergodic; (4) is ergodic with station-
mains with positive probability in its current state. ary distribution π (1) = 2/3, π (2) = 1/3.
1 1 1 1
πP′ = πP + π I = π + π = π. (6.18) using (6.12)
2 2 2 2
π ( x ) p( x′ | x ) = π ( x′ ) p( x | x′ ) (6.25)
Lemma 6.14. Given a finite Markov chain, if the Markov chain is reversible
with respect to π then π is a stationary distribution.5 5
Note that reversibility of π is only a suf-
. ficient condition for stationarity of π, it
Proof. Let π = qt . We have, is not necessary! In particular, there are
irreversible ergodic Markov chains.
q t +1 ( x ) = ∑ p( x | x ′ )qt ( x ′ ) using the Markov property (6.6)
x ′ ∈S
= ∑
′
p( x | x ′ )π ( x ′ )
x ∈S
= ∑
′
p( x ′ | x )π ( x ) using the detailed balance equation
x ∈S (6.25)
= π (x) ∑ p( x′ | x )
x ′ ∈S
= π ( x ). using that ∑ x′ ∈S p( x ′ | x ) = 1
That is, if we can show that the detailed balance equation (6.25) holds
for some distribution q, then we know that q is the stationary distribu-
tion of the Markov chain.
or equivalently,
q ( x ) p ( x ′ | x ) = q ( x ′ ) p ( x | x ′ ). (6.27)
n 1
1 a.s.
n ∑ f ( xi ) → ∑ π ( x ) f ( x ) = Ex∼π [ f ( x )] (6.28)
i =1 x ∈S
p
as n → ∞ where xi ∼ Xi | xi−1 .
This result is the fundamental reason for why Markov chain Monte
Carlo methods are possible. There are analogous results for continu-
ous domains. 0
t0
Note, however, that the ergodic theorem only tells us that simulating t
a single Markov chain yields an unbiased estimator. It does not tell Figure 6.3: Illustration of the “burn-in”
us anything about the rate of convergence and variance of such an time t0 of a Markov chain approximat-
ing the posterior p(y⋆ = 1 | X, y) of
estimator. The convergence rate depends on the mixing time of the Bayesian logistic regression. The true
Markov chain, which is difficult to establish in general. posterior p is shown in gray. The dis-
tribution of the Markov chain at time t is
In practice, one observes that Markov chain Monte Carlo methods have shown in red.
a so-called “burn-in” time during which the distribution of the Markov
markov chain monte carlo methods 121
chain does not yet approximate the posterior distribution well. Typi-
cally, the first t0 samples are therefore discarded,
T
1
E[ f ( X )] ≈
T − t0 ∑ f ( Xt ). (6.29)
t = t0 +1
It is not clear in general how T and t0 should be chosen such that the
estimator is unbiased, rather they have to be tuned.
Another widely used heuristic is to first find the mode of the posterior
distribution and then start the Markov chain at that point. This tends
to increase the rate of convergence drastically, as the Markov chain
does not have to “walk to the location in the state space where most
probability mass will be located”.
p ( x ′ )r ( x | x ′ )
.
α( x′ | x) = min 1, (6.30)
p ( x )r ( x ′ | x )
q ( x ′ )r ( x | x ′ )
= min 1, (6.31) similarly to the detailed balance
q ( x )r ( x ′ | x ) equation, the normalizing constant Z
cancels
to decide whether to follow the proposal yields a Markov chain with
stationary distribution p( x) = Z1 q( x).
Intuitively, the acceptance distribution corrects for the bias in the pro-
posal distribution. That is, if the proposal distribution r is likely to
propose states with low probability under p, the acceptance distri-
bution will reject these proposals frequently. The following theorem
formalizes this intuition.
q ( x ′ )r ( x | x ′ )
α( x | x′ ) = 1, α( x′ | x) = .
q ( x )r ( x ′ | x )
By (6.31),
p ( x ′ )ri ( x | x ′ )
′
αi ( x | x) = min 1,
p ( x )ri ( x ′ | x )
124 probabilistic artificial intelligence
1
p(θ | x1:n , y1:n ) =p(θ) p(y1:n | x1:n , θ) using Bayes’ rule (1.45)
Z
1
= exp(−[− log p(θ) − log p(y1:n | x1:n , θ)]). (6.36)
Z
Thus, defining the energy function
.
f (θ) = − log p(θ) − log p(y1:n | x1:n , θ) (6.37)
n
= − log p(θ) − ∑ log p(yi | xi , θ), (6.38)
i =1
yields
1
p(θ | x1:n , y1:n ) = exp(− f (θ)). (6.39)
Z
Note that f coincides with the loss function used for MAP estimation
(1.62). For a noninformative prior, the regularization term vanishes
and the energy reduces to the negative log-likelihood ℓnll (θ; D) (i.e.,
the loss function of maximum likelihood estimation (1.57)).
r ( x | x′ )
′ ′
α( x | x) = min 1, exp( f ( x) − f ( x )) . (6.40) this is obtained by substituting the PDF
r ( x′ | x) of a Gibbs distribution for the posterior
r ( x | x′ ) N ( x; x′ , τI )
= = 1.
r ( x′ | x) N ( x′ ; x, τI )
126 probabilistic artificial intelligence
α( x′ | x) = min 1, exp( f ( x) − f ( x′ )) .
(6.42)
That is, up to a constant shift, the energy of x coincides with the sur-
prise about x. Energies are therefore sufficient for comparing the “like-
lihood” of points, and they do not require normalization.7 7
Intuitively, an energy can be used to
compare the “likelihood” of two points x
What kind of energies could we use? In Section 6.3.1, we discussed and x′ whereas the probability x makes
a statement about the “likelihood” of x
the use of the negative log-posterior or negative log-likelihood as en- relative to all other points.
ergies. In general, any loss function ℓ( x) can be thought of as an
energy function with an associated maximum entropy distribution
p( x) ∝ exp(−ℓ( x)).
θ′ ← θ − ηt ∇ f (θ) + ε
!
n (6.45)
= θ + ηt ∇ log p(θ) + ∑ ∇ log p(yi | xi , θ) + ε
i =1
p( x) ∝ exp(− f ( x))
LD SGLD
stochastic optimization
GD SGD
reaching some new point ( x′ , y′ ) and projecting back to the state space
by selecting x′ as the new sample. This is illustrated in Figure 6.6.
In the next iteration, we resample the momentum y′ ∼ p(· | x′ ) and
repeat the procedure.
−1
−2
−3 −2 −1 0 1 2 3
x
τ
y(t + τ/2) = y(t) − ∇x f ( x(t)) (6.51a)
2
τ
x(t + τ ) = x(t) + y(t + τ/2) (6.51b)
m
τ
y(t + τ ) = y(t + τ/2) − ∇x f ( x(t + τ )). (6.51c)
2
Discussion
Optional Readings
• Ma, Chen, Jin, Flammarion, and Jordan (2019).
Sampling can be faster than optimization.
• Teh, Thiery, and Vollmer (2016).
Consistency and fluctuations for stochastic gradient Langevin dy-
namics.
• Chen, Fox, and Guestrin (2014).
Stochastic gradient hamiltonian monte carlo.
Problems
Prove Equation (6.9), i.e., that one iteration of the Markov chain can be
expressed as qt+1 = qt P.
next day
good fair poor
good 0.60 0.30 0.10
current day fair 0.50 0.25 0.25
poor 0.20 0.40 0.40
Consider the state space {0, 1}n of binary strings having length n. Let
the proposal distribution be r ( x ′ | x ) = 1/n if x ′ differs from x in
exactly one bit and r ( x ′ | x ) = 0 otherwise. Suppose we desire a sta-
tionary distribution p for which p( x ) is proportional to the number of
ones that occur in the bit string x. For example, in the long run, a ran-
dom walk should visit a string having five 1s five times as often as it
visits a string having only a single 1. Provide a general formula for the
acceptance probability α( x ′ | x ) that would be used if we were to ob-
tain the desired stationary distribution used the Metropolis-Hastings
algorithm.
prove that it is an easy task if one uses Gibbs sampling. That is, Gamma( x; α, β) ∝ x α−1 e− βx , x ∈ R>0 .
show that the conditional distributions p( x | y) and p(y | x ) are A random variable X ∼ Gamma(α, β)
easy to sample from. measures the waiting time until α > 0
events occur in a Poisson process with
Hint: Take a look at the Beta distribution (1.50). rate β > 0. In particular, when α = 1
2. Consider the following generative model p(µ, λ, x1:n ) given by the then the gamma distribution coincides
iid with the exponential distribution with
likelihood x1:n | µ, λ ∼ N (µ, λ−1 ) and the independent priors rate β.
Recall from Equation (5.12) that the energy function of Bayesian logis-
tic regression is
n
f (w) = λ ∥w∥22 + ∑ log(1 + exp(−yi w⊤ xi )), (6.53)
i =1
Show that
Let us assume that f is α-strongly convex for some α > 0, that is,
α
f (y) ≥ f ( x) + ∇ f ( x)⊤ (y − x) + ∥y − x∥22 ∀ x, y ∈ Rn . (6.56)
2
In words, f is lower bounded by a quadratic function with curvature α.
Moreover, assume w.l.o.g. that f minimized at f (0) = 0.14 14
This can always be achieved by shift-
ing the coordinate system and subtract-
1. Show that f satisfies the Polyak-Łojasiewicz (PL) inequality, i.e., ing a constant from f .
1
f ( x) ≤ ∥∇ f ( x)∥22 ∀ x ∈ Rn . (6.57)
2α
d
2. Prove dt f ( xt ) ≤ −2α f ( xt ).
Thus, 0 is the fixed point of Equation (6.55) and the Lyapunov func-
tion f is monotonically decreasing along the trajectory of xt . We
recall Grönwall’s inequality which states that for any real-valued con-
tinuous functions g(t) and β(t) on the interval [0, T ] ⊂ R such that
d
dt g ( t ) ≤ β ( t ) g ( t ) for all t ∈ [0, T ] we have
Z t
g(t) ≤ g(0) exp β(s) ds ∀t ∈ [0, T ]. (6.58)
0
136 probabilistic artificial intelligence
q
the similarity of the integrand of KL(qt ∥ p), qt log pt , to Equation (6.60).
We will therefore use the KL-divergence as Lyapunov function.
d
5. Prove dt KL( qt ∥ p ) = −J(qt ∥ p). Here,
" #
2
. qt (θ)
J(qt ∥ p) = Eθ∼qt ∇ log (6.61)
p(θ) 2
follows for any vector field F and scalar field φ from the divergence theo-
rem and the product rule of the divergence operator.
Thus, the relative Fisher information can be seen as the negated time-
derivative of the KL-divergence, and as J(qt ∥ p) ≥ 0 it follows that the
KL-divergence is decreasing along the trajectory.
1
KL(q∥ p) ≤ J( q ∥ p ). (6.63)
2α
It is a classical result that if f is α-strongly convex then p satisfies the
LSI with constant α (Bakry and Émery, 2006).
6. Show that if f is α-strongly convex for some α > 0 (we say that p
is “strongly log-concave”), then KL(qt ∥ p) ≤ e−2αt KL(q0 ∥ p).
7. Conclude that under the same assumption on f , Langevin dynam-
ics is rapidly mixing, i.e., τTV (ϵ) ∈ O(poly(n, log(1/ϵ))).
To summarize, we have seen that Langevin dynamics is an optimiza-
tion scheme in the space of distributions, and that its convergence can
be analyzed analogously to classical optimization schemes. Notably,
in this exercise we have studied continuous-time Langevin dynamics.
Convergence guarantees for discrete-time approximations can be de-
rived using the same techniques. If this interests you, refer to “Rapid
convergence of the unadjusted Langevin algorithm: Isoperimetry suf-
fices” (Vempala and Wibisono, 2019).
One widely used family of nonlinear functions are artificial “deep” neu-
ral networks,1 1
In the following, we will refrain from
using the characterizations “artificial”
. and “deep” for better readability.
f : Rd → Rk , f ( x; θ) = φ(WLφ(WL−1 (· · · φ(W1 x)))) (7.1)
.
where θ = [W1 , . . . , WL ] is a vector of weights (written as matrices
Wl ∈ Rnl ×nl −1 )2 and φ : R → R is a component-wise nonlinear func- 2
where n0 = d and n L = k
tion. Thus, a deep neural network can be seen as nested (“deep”)
linear functions composed with nonlinearities. This simple kind of
neural network is also called a multilayer perceptron.
(1) (2)
w1,1 (1) w1,1 (2) Figure 7.1: Computation graph of a neu-
x1 ν1 ν1 w (3) ral network with two hidden layers.
1,1
f1
(1) (2)
x2 ν2 ν2
..
.. .. .
..
. . .
fk
(1) (2)
xd νn1 νn2 (3)
(1)
wn
(2)
wn2 ,n1 w k ,n 2
1 ,d
input layer hidden layer 1 hidden layer 2 output layer
left-most column is the input layer, the right-most column is the output
layer, and the remaining columns are the hidden layers. The inputs are
.
(as we have previously) referred to as x = [ x1 , . . . , xd ]. The outputs
(i.e., vertices of the output layer) are often referred to as logits and
.
named f = [ f 1 , . . . , f k ]. The activations of an individual (hidden) layer
l of the neural network are described by
.
ν(l ) = φ(Wl ν(l −1) ) (7.2)
(l )
where ν(0) = x. The activation of the i-th node is νi = ν ( l ) ( i ).
2
.
ReLU(z) = max{z, 0} ∈ [0, ∞). (7.4)
1
and it states that any artificial neural network with just a single hidden f2
layer (with arbitrary width) and non-polynomial activation function φ
2
can approximate any continuous function to an arbitrary accuracy.
0
7.1.2 Classification
−2
Although we mainly focus on regression, neural networks can equally
well be used for classification. If we want to classify inputs into c sepa- −2 0 2
f1
rate classes, we can simply construct a neural network with c outputs,
f = [ f 1 , . . . , f c ], and normalize them into a probability distribution. Figure 7.3: Softmax σ1 ( f 1 , f 2 ) for a bi-
nary classification problem. Blue de-
Often, the softmax function is used for normalization,
notes a small probability and yellow de-
notes a large probability of belonging to
. exp( f i ) class 1, respectively.
σi ( f ) = (7.5)
∑cj=1 exp( f j )
Computing the exact gradient for each data point is still fairly expen-
sive when the size of the neural network is large. Typically, stochastic
gradient descent is used to obtain unbiased gradient estimates using
batches of only m of the n data points, where m ≪ n.
(1) (2)
7.2 Bayesian Neural Networks x1 ν
1 ν
1
(1) (2) f1
x2 ν2 ν2
..
How can we perform probabilistic inference in neural networks? We .. .. .. .
. . .
fk
adopt the same strategy which we already used for Bayesian linear xd
(1)
νn
(2)
νn
1 2
(yi − f ( xi ; θ))2
log p(yi | xi , θ) = − + const. (7.12)
2σn2
n
θ ← θ(1 − ληt ) + ηt ∑ ∇ log p(yi | xi , θ) (7.14)
i =1
.
where λ = 1/σp2 . The gradients of the likelihood can be obtained using
automatic differentiation.
144 probabilistic artificial intelligence
= arg max Eθ∼q [log p(y1:n | x1:n , θ)] − KL(q∥ p(·)). using Equation (5.55c)
q∈Q
Using the Monte Carlo samples θ(i) , we can also estimate the mean of
our predictions,
m
1 .
E[y⋆ | x⋆ , x1:n , y1:n ] ≈
m ∑ µ(x⋆ ; θ(i) ) = µ(x⋆ ), (7.19)
i =1
146 probabilistic artificial intelligence
Recall from Equation (2.18) that the first term corresponds to the ale-
atoric uncertainty of the data and the second term corresponds to the
epistemic uncertainty of the model. We can approximate them using
the Monte Carlo samples θ(i) ,
m
1
Var[y⋆ | x⋆ , x1:n , y1:n ] ≈
m ∑ σ2 (x⋆ ; θ(i) ) (7.20)
i =1
m
1
+ ∑
m − 1 i =1
(µ( x⋆ ; θ(i) ) − µ( x⋆ ))2
T
1
p(y⋆ | x⋆ , x1:n , y1:n ) ≈
T ∑ p(y⋆ | x⋆ , θ(t) ). see (6.29)
t =1
schedule and use those for inference (e.g., by averaging the predictions
of the corresponding neural networks). This approach of sampling a
subset of some data is generally called subsampling.
T
. 1
Σ= ∑
T − 1 i =1
(θ(i) − µ)(θ(i) − µ)⊤ , (7.21c) using a sample variance (A.16)
d
.
q(θ | λ) = ∏ q j (θ j | λ j ) (7.24)
j =1 p
Here, δα is the Dirac delta function with point mass at α.11 The vari-
ational parameters λ correspond to the “original” weights of the net-
work. In words, the variational posterior expresses that the j-th weight
has value 0 with probability p and value λ j with probability 1 − p. 11
see Appendix A.1.4
For fixed weights λ, sampling from the variational posterior qλ corre-
sponds to sampling a vector z with entries z(i ) ∼ Bern( p), yielding
z ⊙ λ which is one of 2d possible subnetworks.12 12
A ⊙ B denotes the Hadamard
(element-wise) product.
.
q j (θ j | λ j ) = pN (θ j ; 0, 1) + (1 − p)N (θ j ; λ j , 1). (7.26)
p
In this case, it can be shown that KL(qλ ∥ p(·)) ≈ 2 ∥λ∥22 for sufficiently
large d (Gal and Ghahramani, 2015, proposition 1). Thus,
iid
where θ(i) ∼ qλ are independent samples. This coincides with our ear-
lier discussion of variational inference for Bayesian neural networks
in Equation (7.17). In words, we average the predictions of m neural
networks for each of which we randomly “drop out” weights.
deep learning 149
as the dynamics that push the particles towards the target density p.
It can be shown that for “almost any” reference density ϕ, this varia-
tional family Qϕ is expressive enough to closely approximate “almost
arbitrary” distributions.15 A natural approach is therefore to learn the 15
For a more detailed discussion, refer
appropriate smooth map T between the reference density ϕ and the to “Stein variational gradient descent: A
general purpose Bayesian inference algo-
target density p. rithm” (Liu and Wang, 2016).
T0⋆ T⋆ T⋆ .
1
q0 −→ q1 −→ 2
q2 −→ ··· where qt+1 = T⋆t ♯ qt . (7.30)
.
We consider maps T = id + f where id(θ) = θ denotes the identity
map and f ( x) represents a (small) perturbation. Recall that at time t
we seek to minimize KL T♯ qt ∥ p , so we choose the smooth map as
.
T⋆t = id − ηt ∇f KL T♯ qt ∥ p
f =0
(7.31)
where ηt is a step size. In this way, the SVGD update (7.31) can be
interpreted as a step of “functional” gradient descent.
.
φ⋆q,p (·) = Eθ∼q [k(·, θ)∇θ log p(θ) + ∇θ k(·, θ)]. (7.32)
5 until converged
152 probabilistic artificial intelligence
Note that the above decomposition of φ̂⋆q,p (θ) is once more an example
of the principle of curiosity and conformity which we have seen to be a
recurring theme in approaches to approximate inference. The repul-
sion term leads to exploration of the particles (i.e., “curiosity” about
alternative explanations), while the drift term leads to minimization of
the loss (i.e., “conformity” to the data).
7.4.1 Evidence
A popular method (which we already encountered multiple times) is
to use the evidence of a validation set xval
1:m of size m given the training
set xtrain
1:n of size n for estimating the model calibration. Here, the ev-
idence can be understood as describing how well the validation set is
described by the model trained on the training set. We obtain,
The resulting integrals are typically very small which leads to numer-
ical instabilities. Therefore, it is common to maximize a lower bound
to the evidence instead,
" #
m
= log Eθ∼qλ ∏ p(yval val
i | xi , θ) interpreting the integral as an
i =1 expectation over the variational
"
m
# posterior
≥ Eθ∼qλ ∑ log p(yval
i | xval
i , θ) (7.36) using Jensen’s inequality (5.29)
i =1
k m
1
≈
k ∑ ∑ log p(yval val ( j)
i | xi , θ ) (7.37) using Monte Carlo sampling
j =1 i =1
iid
for independent samples θ( j) ∼ qλ .
1.0
define Bm as the set of samples falling into bin m and let 0.8
. 1 0.6
∑ 1{Yi = 1}
freq
freq( Bm ) = (7.38)
| Bm | i ∈ Bm 0.4
be the proportion of samples in bin m that belong to class 1 and let 0.2
0.0
. 1
conf( Bm ) =
| Bm | ∑ P(Yi = 1 | xi ) (7.39) 0.00 0.25 0.50 0.75 1.00
i ∈ Bm conf
bin m. 0.8
0.6
Thus, a model is well calibrated if freq( Bm ) ≈ conf( Bm ) for each bin
freq
m ∈ [ M ]. There are two common metrics of calibration that quantify 0.4
qi
Intuitively, for a larger temperature T, the probability is distributed
more evenly among the classes (without changing the ranking),
yielding a more uncertain prediction. In contrast, for a lower tem- 0
A B C
perature T, the probability is concentrated more towards the top
1
choices, yielding a less uncertain prediction. As seen in Prob-
lem 6.7, temperature scaling can be motivated as tuning the mean
of the softmax distribution.
qi
Optional Readings
• Guo, Pleiss, Sun, and Weinberger (2017). 0
A B C
On calibration of modern neural networks.
• Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015). Figure 7.9: Illustration of temperature
scaling for a classifier with three classes.
Weight uncertainty in neural network. On the top, we have a prediction with
• Kendall and Gal (2017). a high temperature, yielding a very un-
What uncertainties do we need in Bayesian deep learning for com- certain prediction (in favor of class A).
Below, we have a prediction with a low
puter vision?. temperature, yielding a prediction that
is strongly in favor of class A. Note that
the ranking (A ≻ C ≻ B) is preserved.
Discussion
Problems
Show that for a two-class classification problem (i.e., c = 2), the soft-
max function is equivalent to the logistic function (5.9) for the univari-
.
ate model f = f 1 − f 0 . That is, σ1 ( f ) = σ ( f ) and σ0 ( f ) = 1 − σ ( f ).
Sequential Decision-Making
Preface to Part II
In the first part of the manuscript, we have learned about how we can
build machines that are capable of updating their beliefs and reducing
their epistemic uncertainty through probabilistic inference. We have
also discussed ways of keeping track of the world through noisy sen-
sory information by filtering. An important aspect of intelligence is to
use this acquired knowledge for making decisions and taking actions
that have a positive impact on the world.
Dt action
How to act, given that computational resources and time are limited?
.
H[X | Y] = Ey∼ p(y) [H[X | Y = y]] (8.2)
= E(x,y)∼ p(x,y) [− log p( x | y)]. (8.3)
Definition 8.2 (Joint entropy). One can also define the joint entropy of
random vectors X and Y,
.
H[X, Y] = E( x,y)∼ p( x,y) [− log p( x, y)], (8.4)
H[X, Y] = H[Y] + H[X | Y] (8.5) using the product rule (1.11) and the
definition of conditional entropy (8.2)
= H[ X ] + H[ Y | X ]. (8.6) using symmetry of joint entropy
That is, the joint entropy of X and Y is given by the uncertainty about X
and the additional uncertainty about Y given X. Moreover, this also
yields Bayes’ rule for entropy,
H[ X | Y ] ≤ H[ X ]. (8.8)
det Σ + σn2 I
1
= log
2 det(σn2 I )
1
= log det I + σn−2 Σ . (8.13)
2
Intuitively, the larger the noise σn2 in relation to the covariance
of X, the smaller the information gain.
.
I(X; Y | Z) = H[X | Z] − H[X | Y, Z]. (8.14)
= H[X, Z] + H[Y, Z] − H[Z] − H[X, Y, Z] (8.15) using the relationship of joint and
conditional entropy (8.5)
= I(X; Y, Z) − I(X; Z). (8.16)
Note that both yS and f S are random vectors. Our goal is then to find
a subset S ⊆ X of size n maximizing the information gain between
our model f and yS .
with I (S) = EyS [r (yS , S)] measuring the expected utility of ob-
servations yS . Such a utility or reward function is often called an
intrinsic reward since it does not measure an “objective” external
quantity, but instead a “subjective” quantity that is internal to the
model of f .
.
∆ F ( x | A) = F ( A ∪ { x}) − F ( A). (8.22)
Intuitively, the marginal gain describes how much “adding” the addi-
tional x to A increases the value of F.
∆ I ( x | A ) = I( f x ; y x | y A ) (8.23)
= H[ y x | y A ] − H[ ε x ]. (8.24)
That is, when maximizing mutual information, the marginal gain cor-
x
responds to the difference between the uncertainty after observing y A
A
and the entropy of the noise H[ε x ]. Altogether, the marginal gain rep- B
resents the reduction in uncertainty by observing { x}. D
∆ F ( x | A ) ≥ ∆ F ( x | B ). (8.26)
That is, “adding” x to the smaller set A yields more marginal gain
than adding x to the larger set B. In other words, the function F has
“diminishing returns”. In this way, submodularity can be interpreted
as a notion of “concavity” for discrete functions.
F ( A ) ≤ F ( B ). (8.27)
⇐⇒ H[ y x | y A ] ≥ H[ y x | yB ]. H[ε x ] cancels
⇐⇒ I( f B ; y A ) ≤ I( f B ; yB ) using I( f B ; y A ) = I( f A ; y A ) as
yA ⊥ f B | f A
⇐⇒ H[ f B ] − H[ f B | y A ] ≤ H[ f B ] − H[ f B | yB ] using the definition of MI (8.9)
⇐⇒ H[ f B | y A ] ≥ H[ f B | yB ], H[ f B ] cancels
δ0 F (S⋆ )
δn = F (S⋆ ) − F (Sn ) ≤ ≤ .
e e
Rearranging the terms yields the theorem.
Optional Readings
The original proof of greedy maximization for submodular func-
tions was given by “An analysis of approximations for maximiz-
active learning 169
σ2 ( x )
1
= arg max log 1 + t 2 (8.31)
x∈X 2 σn
x⋆
8.4.2 Heteroscedastic Noise x
Uncertainty sampling is clearly problematic if the noise is heteroscedas- Figure 8.4: Uncertainty sampling with
heteroscedastic noise. The epistemic un-
tic. If there are a particular set of inputs with a large aleatoric un- certainty of the model is shown in a dark
certainty dominating the epistemic uncertainty, uncertainty sampling gray. The aleatoric uncertainty of the
will continuously choose those points even though the epistemic un- data is shown in a light gray. Uncer-
tainty sampling would repeatedly pick
certainty will not be reduced substantially (cf. Figure 8.4). points around x ⋆ as they maximize the
epistemic uncertainty, even though the
Looking at Equation (8.31) suggests a natural fix. Instead of only con- aleatoric uncertainty at x ⋆ is much larger
sidering the epistemic uncertainty σt2 ( x), we can also consider the ale- than at the boundary.
170 probabilistic artificial intelligence
8.4.3 Classification
While we focused on regression, one can apply active learning also
for other settings, such as (probabilistic) classification. In this setting,
for any input x, a model produces a categorical distribution over la-
bels y x .3 Here, uncertainty sampling corresponds to selecting samples 3
see Section 1.3
that maximize the entropy of the predicted label y x ,
.
xt+1 = arg max H[y x | x1:t , y1:t ]. (8.34)
x∈X
The first term measures the entropy of the averaged prediction while
the second term measures the average entropy of predictions. Thus,
the first term looks for points where the average prediction is not con-
fident. In contrast, the second term penalizes points where many of
the sampled models are not confident about their prediction, and thus
looks for points where the models are confident in their predictions.
This identifies those points x where the models disagree about the la-
bel y x (that is, each model is “confident” but the models predict differ-
ent labels). For this reason, this approach is known as Bayesian active
learning by disagreement (BALD).
Note that the second term of the difference acts as a regularizer when
compared to Equation (8.34). The second term mirrors our description
of aleatoric uncertainty from Section 2.2. Recall that we interpreted ale-
atoric uncertainty as the average uncertainty for all models. Crucially,
here we use entropy to “measure” uncertainty, whereas previously we
have been using variance. Therefore, intuitively, Equation (8.36) sub-
tracts the aleatoric uncertainty from the total uncertainty about the
label.
Optional Readings
• Gal, Islam, and Ghahramani (2017).
Deep Bayesian active learning with image data.
172 probabilistic artificial intelligence
Optional Readings
• Hübotter, Sukhija, Treven, As, and Krause (2024).
Transductive active learning: Theory and applications.
In modern machine learning, one often differentiates between a
“pre-training” and a “fine-tuning” stage. During pre-training, a
model is trained on a large dataset to extract general knowledge
without a specific task in mind. Then, during fine-tuning, the
model is adapted to a specific task by training on a smaller dataset.
Whereas (inductive) active learning is closely linked to the pre-
training stage, transductive active learning has been shown to be
useful for task-specific fine-tuning:
• Hübotter, Bongni, Hakimi, and Krause (2025).
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs.
• Bagatella, Hübotter, Martius, and Krause (2024).
Active Fine-Tuning of Generalist Policies.
Discussion
criteria exist:
Next, we will move to the topic of optimization and ask which data we
should select to find the optimum of an unknown function as quickly
as possible. In the following chapter, we will focus on “Bayesian opti-
mization” (also called “bandit optimization”) where our aim is to find
and sample the optimal point. A related task that is slightly more re-
lated to active learning is the “best-arm identification” problem where
we aim only to identify the optimal point without sampling it. This
problem is closely related transductive active learning (with the lo-
cal task being defined by the location of the maximum) and so-called
entropy search methods that minimize the entropy of the posterior dis-
tribution over the location or value of the maximum (akin to Equa-
tion (8.38)) are often used to solve this problem (Hennig and Schuler,
2012; Wang and Jegelka, 2017; Hvarfner et al., 2022).
Problems
p( x, y)
I(X; Y) = E( x,y)∼ p log
p( x) p(y)
= KL( p( x, y)∥ p( x) p(y)) (8.40)
= Ey∼ p [KL( p( x | y)∥ p( x))], (8.41)
active learning 175
where p( x, y) denotes the joint distribution and p( x), p(y) denote the
marginal distributions of X and Y.
∆ I ( x | A ) = I( f x ; y x | y A ) = H[ y x | y A ] − H[ ε x ].
I( f x ; y x ; yB\ A | y A ) ≥ 0. (8.43)
100
.
Z= ∑ i · Xi .
i =1
yt = f ⋆ ( xt ) + ε t . (9.2)
9.2.2 Regret
The key performance metric in online learning is the regret.
The regret can be interpreted as the additive loss with respect to the
static optimum maxx f ⋆ ( x).
early with time. Thus, achieving sublinear regret requires balancing That is, we want to trade completing our
tasks optimally with moving around in
exploration and exploitation. the state space. Crucially, we do not
know the sequence of tasks f t in ad-
Typically, online learning (and Bayesian optimization) consider sta- vance. Due to the cost associated with
tionary environments, hence the comparison to the static optimum. moving in the decision space, previous
choices affect the future!
Dynamic environments are studied in online algorithms (see metrical
task systems5 , convex function chasing6 , and generalizations of multi- 6
Convex function chasing (or convex body
armed bandits to changing reward distributions) and reinforcement chasing) generalize metrical task systems
to continuous domains X . To make
any guarantees about the performance
in these settings, one typically has to
assume that the tasks f t are convex.
Note that this mirrors our assumption in
Bayesian optimization that similar alter-
natives yield similar results.
180 probabilistic artificial intelligence
value) is lower than the best lower bound. This is visualized in Fig-
ure 9.3. 0 2 4
x
Therefore, we only really care how the function looks like in the re-
Figure 9.3: Optimism in Bayesian op-
gions where the upper confidence bound is larger than the best lower timization. The unknown function is
bound. The key idea behind the methods that we will explore is to shown in black, our model in blue with
gray confidence bounds. The dotted
focus exploration on these plausible maximizers.
black line denotes the maximum lower
bound. We can therefore focus our ex-
Note that it is crucial that our uncertainty about f reflects the “fit” of ploration to the yellow regions where
our model to the unknown function. If the model is not well calibrated the upper confidence bound is higher
or does not describe the underlying function at all, these methods will than the maximum lower bound.
point where we can hope for the optimal outcome. In this setting, this
GP posterior
2.0
corresponds to simply maximizing the upper confidence bound (UCB), 2
1.5
.
xt+1 = arg max µt ( x) + β t+1 σt ( x), (9.6) 1.0 0
x∈X
0.5
. p
where σt ( x) = k t ( x, x) is the standard deviation at x and β t regulates 4
how confident we are about our model f (i.e., the choice of confidence 3.5
acquisition function
GP posterior
interval). 3.0 2
−1
Bounds on β t (δ) can be derived both in a “Bayesian” and in a “fre-
quentist” setting. In the Bayesian setting, it is assumed that f ⋆ is 0 2 4
x
drawn from the prior GP, i.e., f ⋆ ∼ GP (µ0 , k0 ). However, in many
cases this may be an unrealistic assumption. In the frequentist setting, Figure 9.5: Re-scaling the confidence
bounds. The dotted gray lines represent
it is assumed instead that f ⋆ is a fixed element of a reproducing kernel updated confidence bounds.
Hilbert space Hk (X ) which depending on the kernel k can encompass
a large class of functions. We will discuss the Bayesian setting first and
later return to the frequentist setting.
where
. 1
γT = max I( f S ; yS ) = max log det I + σn−2 KSS (9.10)
S⊆X S⊆X 2
|S|= T |S|= T
upper bound on γT
γT = O(d log T ), (9.11) 40
• Gaussian kernel 20
γT = O (log T )d+1 , (9.12)
0
1 20 40
• Matérn kernel for ν > 2 T
d 2ν
γT = O T 2ν+d (log T ) 2ν+d . (9.13) Figure 9.6: Information gain of inde-
pendent, linear, Gaussian, and Matérn
The information gain of common kernels is illustrated in Figure 9.6. (ν ≈ 0.5) kernels with d = 2 (up to con-
stant factors). The kernels with sublin-
Notably, when all points in the domain are independent, the informa-
ear information gain have strong dimin-
tion gain is linear in T. This is because when the function f ⋆ may be ishing returns (due to their strong de-
arbitrarily “rough”, we cannot generalize from a single observation to pendence between “close” points). In
contrast, the independent kernel has no
“neighboring” points, and as there are infinitely many points in the dependence between points in the do-
domain X there are no diminishing returns. As one would expect, in main, and therefore no diminishing re-
turns. Intuitively, the “smoother” the
this case, Theorem 9.5 does not yield sublinear regret. However, we
class of functions modeled by the kernel,
can see from Theorem 9.6 that the information gain is sublinear for lin- the stronger are the diminishing returns.
ear, Gaussian, and most Matérn kernels. Moreover, observe that unless
the function is linear, the information gain grows exponentially with
the dimension d. This is because the number of “neighboring” points
(with respect to Euclidean geometry) decreases exponentially with the
dimension which is also known as the curse of dimensionality.
We remark that Theorem 9.7 holds also under the looser assumption
that observations are perturbed by σn -sub-Gaussian noise (cf. Equa-
tion (A.39)) instead of Gaussian noise. The bound on γT from Equa-
tion (9.13) for the Matérn kernel does not yield sublinear regret when
combined with the standard regret bound from Theorem 9.5, however,
Whitehouse et al. (2024) show that the regret of GP-UCB is sublinear
also in this case provided σn2 is chosen carefully.
9.3.2 Improvement
Another well-known family of methods is based on keeping track of
a running optimum fˆt , and scoring points according to their improve-
ment upon the running optimum. The improvement of x after round t
is measured by
.
It ( x) = ( f ( x) − fˆt )+ (9.15)
The probability of improvement (PI) picks the point that maximizes the
probability to improve upon the running optimum,
.
xt+1 = arg max P( It ( x) > 0 | x1:t , y1:t ) (9.16)
x∈X
= arg max P( f ( x) > fˆt | x1:t , y1:t ) (9.17)
x∈X
bayesian optimization 185
!
µt ( x) − fˆt
= arg max Φ (9.18) using linear transformations of
x∈X σt ( x) Gaussians (1.78)
acquisition function
GP posterior
Probability of improvement looks at how likely a point is to improve 0.3 2
upon the running optimum. An alternative is to look at how much a
0.2
point is expected to improve upon the running optimum. This acqui-
0
sition function is called the expected improvement (EI), 0.1
. 0.5
xt+1 = arg max E[ It ( x) | x1:t , y1:t ] . (9.19) 4
x∈X 0.4
acquisition function
GP posterior
0.3
Intuitively, EI seeks a large expected improvement (exploitation) while 2
also preferring states with a large variance (exploration). Expected im- 0.2
provement yields the same regret bound as UCB (Nguyen et al., 2017). 0.1 0
0.0
The expected improvement acquisition function is often flat which
makes it difficult to optimize in practice due to vanishing gradients. Figure 9.7: Plot of the PI and EI acquisi-
One approach addressing this is to instead optimize the logarithm of tion functions, respectively.
EI (Ament et al., 2024).
1.0
0.5
∆t
0.0
−0.5
−1.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
σt σt σt
where f˜t+1 ∼ p(· | x1:t , y1:t ) is a sample from our posterior distri-
bution. Observe that this approximation of π coincides with a point
density at the maximizer of f˜t+1 .
. ∆ ( x )2
Ψt ( x) = , (9.25)
It ( x)
bayesian optimization 187
which was originally introduced by Russo and Van Roy (2016). Here
exploitation reduces regret while exploration increases information
gain, and hence, points x that minimize the information ratio are those
that most effectively balance exploration and exploitation. We can
make the key observation that the regret ∆(·) decreases when It (·) de-
creases, as a small It (·) implies that the algorithm has already learned
a lot about the function f ⋆ . The strength of this relationship is quanti-
fied by the information ratio:
Theorem 9.9 (Proposition 1 of Russo and Van Roy (2014) and theo-
rem 8 of Kirschner and Krause (2018)). For any iteration T ≥ 1, let
∑tT=1 It−1 ( xt ) ≤ γT and suppose that Ψt−1 ( xt ) ≤ Ψ T for all t ∈ [ T ]. Then,
the cumulative regret is bounded by
q
R T ≤ γT Ψ T T. (9.26)
T
RT = ∑ rt
t =1
T q
= ∑ Ψt−1 ( xt ) · It−1 ( xt )
t =1
v
u T T
≤ t ∑ Ψ t −1 ( x t ) · ∑ It−1 (xt )
u
using the Cauchy-Schwarz inequality
t =1 t =1
q
≤ γT Ψ T T. using the assumptions on It (·) and
Ψt (·)
GP posterior
2
2
Ψ
b
Regret bounds such as Theorem 9.11 can be derived also for different
1
measures of information gain. For example, the argument of Prob- 0
lem 9.6 also goes through for the “greedy” measure 0
.
It ( x) = I( f xUCB ; y x | x1:t , y1:t ) (9.31) 12.5 4
t
GP posterior
t
than globally. We compare the two measures of information gain in 7.5 2
Ψ
b
Figure 9.9. Observe that the acquisition function depends critically 5.0
on the choice of It (·) and is less sensitive to the scaling of confidence 2.5
0
intervals.
60
studied in Chapter 8) that is purely explorative. In this way, IDS can 2
Ψ
b
tive algorithms such as UCB or EI (Russo and Van Roy, 2014): Depend- 20 0
ing on the measure of information gain, IDS can select points to obtain
0
indirect information about other points or cumulating information that
does not immediately lead to a higher reward but only when com- Figure 9.9: Plot of the surrogate infor-
mation ratio Ψ:
b IDS selects its minimizer.
bined with subsequent observations. Moreover, IDS avoids selecting The first two plots use the “global” infor-
points which yield irrelevant information. mation gain measure from Example 9.10
with β = 0.25 and β = 0.5, respec-
tively. The third plot uses the “greedy”
information gain measure from Equa-
tion (9.31) and β = 1.
bayesian optimization 189
line shows the optimal recall achieved by knowing the proba- 0.6
probability of maximality
4
with Φ denoting the CDF of the standard normal distribution and κ∗
GP posterior
chosen such that the approximation of π integrates to 1, so that it is a 0.02 2
valid distribution.
0
Remarkably, LITE is intimately related to many of the BO methods we 0.00
.
with the “quasi-surprise” S′ (u) = 21 (ϕ(Φ−1 (u))/u)2 . The quasi-surprise
S′ (·) behaves similarly to the surprise − ln(·). In fact, their asymptotics
coincide:
Menet et al. (2025) show that LITE (9.34) is the solution to the varia-
bayesian optimization 191
µt (·) − κ ∗
arg max W (π ) = Φ (9.36)
π ∈∆X
σt (·)
with κ ∗ such that the right-hand side sums to 1.9 This indicates that 9
∆X is the probability simplex on X .
LITE and Thompson sampling, which samples from probability of
maximality, achieve exploration through two means:
1. Optimism: by preferring points with large uncertainty σt ( x) about
the reward value f ( x).
2. Decision uncertainty: by assigning some probability mass to all x,
that is, by remaining uncertain about which x is the maximizer.
In our discussion of balancing exploration and exploitation in rein-
forcement learning, we will return to this dichotomy of exploration
strategies.
We will explore this question with the following example: You are
observing a betting game. When placing a bet, players can either
place a safe bet (“playing safe”) or a risky bet (“playing risky”).
You now have to place a bet on their bets, and estimate which of
the players will win the most money: those that play safe or those
that play risky? Note that you do not care whether your guess
ends up in 2nd place or last place among all players — you only
care about whether your guess wins. That is, you are interested
in recalling the best player, not in “ranking” the players.
Consider three players: one that plays safe and two that play risky.
Suppose that the safe bet has payoff S = 1 while each risky bet
has payoff R ∼ N (0, 100). In expectation, the safe player will win
the most money. However, one can see with just a little bit of al-
gebra that the probability of either of the risky players winning
the most money is ≈ 35%, whereas the safe player only wins with
192 probabilistic artificial intelligence
probability ≈ 29% ? . That is, it is in fact optimal to bet on either Problem 9.8
of the risky players since the best player might have vastly out-
performed their expected winnings, and performed closer to their
upper confidence bound. In summary, maximizing recall requires
us to be “exploratory” since it is likely that the optimum among
inputs is one that has performed better than expected, not simply
the one with the highest expected performance.
Discussion
Optional Readings
• Srinivas, Krause, Kakade, and Seeger (2010).
Gaussian process optimization in the bandit setting: No regret and
experimental design.
• Golovin, Solnik, Moitra, Kochanski, Karro, and Sculley (2017).
Google vizier: A service for black-box optimization.
• Romero, Krause, and Arnold (2013).
Navigating the protein fitness landscape with Gaussian processes.
• Chowdhury and Gopalan (2017).
On kernelized multi-armed bandits.
Problems
3. Combine (1) and (2) to show Theorem 9.5. We assume w.l.o.g. that
the sequence { β t }t is monotonically increasing.
Hint: If s ∈ [0, M ] for some M > 0 then s ≤ C · log(1 + s) with
.
C = M/ log(1 + M ).
k( x, x′ ) = x⊤ x′ .
1. Show that
Z +∞
EIt ( x) = (µt ( x) + σt ( x)ε − fˆt ) · ϕ(ε) dε (9.41)
( fˆt −µt ( x))/σt ( x)
Hint: First show that W (·) is concave (i.e., minimizing −W (·) is a convex
optimization problem) and then use Lagrange multipliers to find the optimum.
bayesian optimization 195
Consider the betting game from Remark 9.13 with two risky players,
R1 ∼ N (0, 100) and R2 ∼ N (0, 100), and one safe player S = 1. Prove
that the individual probability of any of the risky players winning the
most money is larger than the winning probability of the safe player.
We assume that players payoffs are mutually independent.
.
Hint: Compute the CDF of R = max{ R1 , R2 }.
10
Markov Decision Processes
not only depend on the previous state Xt but also depends on the last
X1 X2 X3 ···
action At of this agent.
Figure 10.1: Directed graphical model of
Definition 10.1 ((Finite) Markov decision process, MDP). A (finite) a Markov decision process with hidden
Markov decision process is specified by states Xt and actions At .
.
• a (finite) set of states X = {1, . . . , n},
.
• a (finite) set of actions A = {1, . . . , m},
• transition probabilities
.
p( x ′ | x, a) = P Xt+1 = x ′ | Xt = x, At = a
(10.1)
The reward function may also depend on the next state x ′ , however,
we stick to the above model for simplicity. Also, the reward function
can be random with mean r. Observe that r induces the sequence of
rewards ( Rt )t∈N0 , where
.
R t = r ( Xt , A t ), (10.2)
We assume that policies are stationary, that is, do not change over time.
Observe that a policy induces a Markov chain ( Xtπ )t∈N0 with transition
probabilities,
.
pπ ( x ′ | x ) = P Xtπ+1 = x ′ | Xtπ = x = ∑ π (a | x) p(x′ | x, a).
(10.4)
a∈ A
The methods that we will discuss can also be analyzed using these
or other alternative reward models.
We now want to understand the effect of the starting state and initial
action on our optimization objective Gt . To analyze this, it is common
to use the following two functions:
∞
" #
= Eπ ∑ m
γ Rm X0 = x using the definition of the discounted
m =0 payoff (10.5)
∞
" #
h i
0
= Eπ γ R0 X0 = x + γEπ ∑ m
γ R m +1 X0 = x using linearity of expectation (1.20)
m =0
∞
" " # #
= r ( x, π ( x )) + γEx′ Eπ ∑ m
γ R m +1 X1 = x ′
X0 = x by simplifying the first expectation and
m =0 conditioning the second expectation on
"
∞
# X1
= r ( x, π ( x )) + γ ∑ ′
p( x | x, π ( x ))Eπ ∑ m
γ R m +1 X1 = x ′
expanding the expectation on X1 and
x′ ∈X m =0 using conditional independence of the
"
∞
# discounted payoff of X0 given X1
= r ( x, π ( x )) + γ ∑
′
p( x ′ | x, π ( x ))Eπ ∑ γm Rm X0 = x ′ shifting the start time of the discounted
x ∈X m =0 payoff using stationarity
∑ p( x ′ | x, π ( x ))Eπ G0 | X0 = x ′
= r ( x, π ( x )) + γ using the definition of the discounted
′
x ∈X payoff (10.5)
= r ( x, π ( x )) + γ ∑
′
p( x ′ | x, π ( x )) · vπ ( x ′ ). (10.10) using the definition of the value
x ∈X function (10.7)
For stochastic policies, by also conditioning on the first action, one can
obtain an analogous equation for the state-action value function,
qπ ( x, a) = r ( x, a) + γ ∑
′
p( x ′ | x, a) ∑
′
π ( a′ | x ′ )qπ ( x ′ , a′ ) (10.15)
x ∈X a ∈A
= r ( x, a) + γEx′ | x,a Ea′ ∼π (x′ ) q ( x , a′ ) .
π ′
(10.16)
Note that it does not make sense to consider a similar recursive for-
mula for the state-action value function in the setting of deterministic
policies as the action played when in state x ∈ X is uniquely deter-
mined as π ( x ). In particular,
vπ ( x ) = qπ ( x, π ( x )). (10.17)
markov decision processes 201
v π (1) r (1, π (1))
. . π .
..
vπ = .. , r =
, and
.
vπ (n) r (n, π (n))
(10.18)
p(1 | 1, π (1)) · · · p(n | 1, π (1))
. .. .. ..
Pπ = . . .
p(1 | n, π (n)) · · · p(n | n, π (n))
and a little bit of linear algebra, the Bellman expectation equation (10.10)
is equivalent to
vπ = r π + γPπ vπ (10.19)
⇐⇒ ( I − γP )v = r
π π π
Using this fact (which we will prove in just a moment), we can use
fixed-point iteration of Bπ .
∥ f ( x) − f (y)∥ ≤ k · ∥ x − y∥ (10.22)
Bπ v − Bπ v′ ∞
= r π + γPπ v − r π − γPπ v′ ∞
using the definition of Bπ (10.21)
= γ Pπ (v − v′ ) ∞
≤ γ max
x∈ X
∑ p( x ′ | x, π ( x )) · v( x ′ ) − v′ ( x ′ ) . using the definition of the L∞ norm
x′ ∈X (10.23), expanding the multiplication,
and using |∑i ai | ≤ ∑i | ai |
markov decision processes 203
≤ γ v − v′ ∞
. (10.24) using ∑ x′ ∈X p( x ′ | x, π ( x )) = 1 and
|v( x ′ ) − v′ ( x ′ )| ≤ ∥v − v′ ∥∞
Thus, by Equation (10.22), Bπ is a contraction and by Banach’s fixed-
point theorem vπ is its unique fixed-point.
Let vtπ be the value function estimate after t iterations. Then, we have
for the convergence of fixed-point iteration,
∥vtπ − vπ ∥∞ = Bπ vtπ−1 − Bπ vπ ∞
using the update rule of fixed-point
iteration and Bπ vπ = vπ
≤ γ vtπ−1 − vπ ∞
using (10.24)
It follows that all optimal policies have identical value functions. Sub-
. ⋆ . ⋆
sequently, we use v⋆ = vπ and q⋆ = qπ to denote the state value
function and state-action value function arising from an optimal pol-
icy, respectively. As an optimal policy maximizes the value of each
state, we have that
Simply optimizing over each policy is not a good idea as there are mn
deterministic policies in total. It turns out that we can do much better.
the states our agent can reach in a single step? If we knew the value of
each state our agent can reach, then we can simply pick the action that
maximizes the expected value. We will make this approach precise in
the next section.
.
πq ( x ) = arg max q( x, a). (10.29)
a∈ A
.
πv ( x ) = arg max r ( x, a) + γ ∑ p( x ′ | x, a) · v( x ′ ). (10.30)
a∈ A ′
x ∈X
This theorem confirms our intuition from the previous section that
greedily following an optimal value function is itself optimal. In par-
ticular, Bellman’s theorem shows that there always exists an optimal
policy which is deterministic and stationary.
These equations are also called the Bellman optimality equations. Intu-
Bellman’s optimality principle Bell-
itively, the Bellman optimality equations express that the value of a man’s optimality equations for MDPs
state under an optimal policy must equal the expected return for the are one of the main settings of Bell-
man’s optimality principle. However,
best action from that state. Bellman’s theorem is also known as Bell-
Bellman’s optimality principle has many
man’s optimality principle, which is a more general concept. other important applications, for exam-
ple in dynamic programming. Broadly
The two perspectives of Bellman’s theorem naturally suggest two sep- speaking, Bellman’s optimality principle
arate ways of finding the optimal policy. Policy iteration uses the per- says that optimal solutions to decision
problems can be decomposed into opti-
spective from Equation (10.31) of π ⋆ as a fixed-point of the dependency mal solutions to sub-problems.
between greedy policy and value function. In contrast, value iteration
uses the perspective from Equation (10.32) of v⋆ as the fixed-point of
the Bellman update. Another approach which we will not discuss here
is to use a linear program where the Bellman update is interpreted as
a set of linear inequalities.
Let πt be the policy after t iterations. We will now show that policy
iteration converges to the optimal policy. The proof is split into two
parts. First, we show that policy iteration improves policies mono-
tonically. Then, we will use this fact to show that policy iteration
converges.
= max q ( x, a)πt
using the definition of the Bellman
a∈ A update (10.36)
≥ q ( x, πt ( x ))
πt
= vπt ( x )
= ṽ0 ( x ). using (10.17)
ṽτ +1 ( x ) = r ( x, πt+1 ( x )) + γ ∑
′
p( x ′ | x, πt+1 ( x )) · ṽτ ( x ′ ). using the definition of ṽτ +1 (10.21)
x ∈X
markov decision processes 207
For the second claim, recall from Bellman’s theorem (10.32) that v⋆ is a
(unique) fixed-point of the Bellman update B⋆ .5 In particular, we have 5
We will show in Equation (10.38) that
vπt+1 ≡ vπt if and only if vπt+1 ≡ vπt ≡ v⋆ . In other words, if vπt ̸≡ v⋆ B⋆ is a contraction, implying that v⋆ is
the unique fixed-point of B⋆ .
then Equation (10.37) is strict for at least one x ∈ X and vπt+1 ̸≡ vπt .
This proves the strict monotonic improvement of policy iteration.
where q was the state-action value function associated with the state
value function v. The value iteration algorithm is shown in Algo-
rithm 10.17.
We will now prove the convergence of value iteration using the fixed-
point interpretation.
B⋆ v − B⋆ v′ ∞
= max ( B⋆ v)( x ) − ( B⋆ v′ )( x ) using the definition of the L∞ norm
x∈X (10.23)
Definition 10.22 (Hidden Markov model, HMM). A hidden Markov Figure 10.5: Directed graphical model
model is specified by of a hidden Markov model with hidden
states Xt and observables Yt .
• a set of states X,
.
• transition probabilities p( x ′ | x ) = P( Xt+1 = x ′ | Xt = x ) (also
called motion model), and
.
• a sensor model o (y | x ) = P(Yt = y | Xt = x ).
POMDPs are a very powerful model, but very hard to solve in gen-
eral. POMDPs can be reduced to a Markov decision process with an
enlarged state space. The key insight is to consider an MDP whose
states are the beliefs,
.
bt ( x ) = P( Xt = x | y1:t , a1:t−1 ), (10.42)
about the current state in the POMDP. In other words, the states of the
MDP are probability distributions over the states of the POMDP. We
will make this more precise in the following.
Let us assume that our prior belief about the state of our agent is given
.
by b0 ( x ) = P( X0 = x ). Keeping track of how beliefs change over time
is known as filtering, which we already encountered in Section 3.1.
Given a prior belief bt , an action taken at , and a new observation yt+1 ,
the belief state can be updated as follows,
bt+1 ( x ) = P( Xt+1 = x | y1:t+1 , a1:t ) by the definition of beliefs (10.42)
1
= P(yt+1 | Xt+1 = x )P( Xt+1 = x | y1:t , a1:t ) using Bayes’ rule (1.45)
Z
1
= o (yt+1 | x )P( Xt+1 = x | y1:t , a1:t ) using the definition of observation
Z probabilities (10.39)
1
= o (yt+1 | x ) ∑ p( x | x ′ , at )P Xt = x ′ | y1:t , a1:t−1
by conditioning on the previous state x ′ ,
Z x′ ∈X noting at does not influence Xt
1
= o ( y t + 1 | x ) ∑ p ( x | x ′ , a t ) bt ( x ′ ) (10.43) using the definition of beliefs (10.42)
Z x′ ∈X
markov decision processes 211
where
.
Z= ∑ o ( y t +1 | x ) ∑ p ( x | x ′ , a t ) bt ( x ′ ) . (10.44)
x∈X x′ ∈X
where the (state-)space of all beliefs is the (infinite) space of all proba-
bility distributions over X,8 8
This definition naturally extends to
n continuous state spaces X .
|X|
o
. .
B = ∆ X = b ∈ R|X | : b ≥ 0, ∑i=1 b(i ) = 1 . (10.46)
• and rewards
.
ρ(b, a) = Ex∼b [r ( x, a)] = ∑ b( x )r ( x, a). (10.48)
x∈X
τ ( bt + 1 | bt , a t ) = P ( bt + 1 | bt , a t )
= ∑ P ( bt + 1 | bt , a t , y t + 1 ) P ( y t + 1 | bt , a t ) . (10.49) by conditioning on yt+1 ∈ Y
y t + 1 ∈Y
(10.50)
212 probabilistic artificial intelligence
Discussion
Problems
Recall the example of “becoming rich and famous” from Figure 10.2.
Consider the policy, π ≡ S (i.e., to always save) and let γ = 1/2. Show
that the (rounded) state-action value function qπ is as follows:
save advertise
poor, unknown 0 0.1
poor, famous 4.4 1.2
rich, famous 17.8 1.2
rich, unknown 13.3 0.1
Show that if q and v arise from the same policy, that is, q is defined in
terms of v as per Equation (10.9), then
πv ≡ πq . (10.52)
Again, recall the example of “becoming rich and famous” from Fig-
ure 10.2.
1. Show that the policy π ≡ S, which we considered in Problem 10.1,
is not optimal.
2. Instead, consider the policy
A if poor and unknown
π′ ≡
S otherwise
and let γ = 1/2. Show that the (rounded) state-action value func-
′
tion qπ is as follows:
save advertise
poor, unknown 0.8 1.6
poor, famous 4.5 1.2
rich, famous 17.8 1.2
rich, unknown 13.4 0.2
′
Shown in bold is the state value function vπ .
3. Is the policy π ′ optimal?
214 probabilistic artificial intelligence
∥ v π t − v ⋆ ∥ ∞ ≤ γ t ∥ v π0 − v ⋆ ∥ ∞ (10.53)
Show that the optimal policy remains unchanged under this defi-
nition of f .
• Pull up the rod (P): If there is a fish on the line (F), there is a 90%
chance of catching it (reward +10, transitioning to F) and a 10%
chance of it escaping (reward −1, transitioning to F). If there is no
fish (F), pulling up the rod results in no catch, staying in F with a
reward of −5.
• Waiting (W): All waiting actions result in a reward of −1. In state
F, there is a 60% chance of the fish staying (remaining in F) and a
40% chance of it escaping (transitioning to F). In state F, there is a
50% chance of a fish biting (transitioning to F) and a 50% chance of
no change (remaining in F).
Suggestion: Draw the MDP transition diagram. Draw each transition with
action, associated probability, and associated reward.
Since the angler cannot directly observe whether there is a fish on the
line, they receive a noisy observation about the state. This observation
can be:
• o1 : The signal suggests that a fish might be on the line.
• o2 : The signal suggests that there is no fish on the line.
The observation model, which defines the probability of receiving each
observation given the true state is as follows:
P(o1 | ·) P(o2 | ·)
F 0.8 0.2
F 0.3 0.7
The angler’s goal is to choose actions that maximize their overall re-
ward, balancing the chances of catching a fish against the cost of wait-
ing and unsuccessful pulls.
1. Given an initial belief b0 ( F ) = b0 ( F ) = 0.5, the angler chooses to
wait and observes o1 . Compute the updated belief b1 using the
observation model and belief update equation (10.42).
2. Given belief b1 ( F ) ≈ 0.765 and b1 ( F ) ≈ 0.235, compute the up-
dated belief b2 for both actions P (pull) and W (wait), both in the
case where you observe o1 (fish likely) and o2 (fish unlikely).
11
Tabular Reinforcement Learning
Clearly, the agent needs to trade exploring and learning about the en-
vironment with exploiting its knowledge to maximize rewards. Thus,
the exploration-exploitation dilemma, which was at the core of Bayesian
optimization (see Section 9.1), also plays a crucial role in reinforcement
learning. In fact, Bayesian optimization can be viewed as reinforce-
ment learning with a fixed state: In each round, the agent plays an
action, aiming to find the action that maximizes the reward. How-
ever, playing the same action multiple times yields the same reward,
implying that we remain in a single state. In the context of Bayesian
optimization, we used “regret” as performance metric: in the jargon
of planning, minimizing regret corresponds to maximizing the cumu-
lative reward.
218 probabilistic artificial intelligence
11.1.1 Trajectories
The data that the agent collects is modeled using so-called trajectories.
of transitions,
.
τi = ( xi , ai , ri , xi+1 ), (11.2)
Crucially, the newly observed states xt+1 and the rewards rt (across
multiple transitions) are conditionally independent given the previ-
ous states xt and actions at . This follows directly from the Markovian
structure of the underlying Markov decision process.1 Formally, we 1
Recall the Markov property (6.6), which
have, assumes that in the underlying Markov
decision process (i.e., in our environ-
ment) the future state of an agent is inde-
X t +1 ⊥ X t ′ +1 | X t , X t ′ , A t , A t ′ , (11.3a) pendent of past states given the agent’s
current state. This is commonly called a
R t ⊥ R t ′ | Xt , Xt ′ , A t , A t ′ , (11.3b) Markovian structure. From this Marko-
vian structure, we gather that repeated
for any t, t′ ∈ N0 . In particular, if xt = xt′ and at = at′ , then xt+1 encounters of state-action pairs result
and xt′ +1 are independent samples according to the transition model in independent trials of the transition
model and rewards.
p( Xt+1 | xt , at ). Analogously, if xt = xt′ and at = at′ , then rt and rt′
are independent samples of the reward model r ( xt , at ). As we will
see later in this chapter and especially in Chapter 13, this indepen-
dence property is crucial for being able to learn about the underlying
Markov decision process. Notably, this implies that we can apply the
law of large numbers (A.36) and Hoeffding’s inequality (A.41) to our
estimators of both quantities.
online setting), the agent learns online. Especially, every action, every
reward, and every state transition counts.
In contrast, off-policy methods can be used even when the agent can-
not freely choose its actions. Off-policy methods are therefore able
to make use of purely observational data. This might be data that
was collected by another agent, a fixed policy, or during a previous
episode. Off-policy methods are therefore more sample-efficient than
on-policy methods. This is crucial, especially in settings where con-
ducting experiments (i.e., collecting new data) is expensive.
Still, for the models of our environment to become accurate, our agent
needs to visit each state-action pair ( x, a) numerous times. Note that
our estimators for dynamics and rewards are only well-defined when
we visit the corresponding state-action pair at least once. However,
in a stochastic environment, a single visit will likely not result in an
accurate model. We can use Hoeffding’s inequality (A.41) to gauge
how accurate the estimates are after only a limited number of visits.
The next natural question is how to use our current model of the en-
vironment to pick actions such that exploration and exploitation are
traded effectively. This is what we will consider next.
Given the estimated MDP given by p̂ and r̂, we can compute the opti-
mal policy using either policy iteration or value iteration. For example,
using value iteration, we can compute the optimal state-action value
function Q⋆ within the estimated MDP, and then employ the greedy
policy
11.3.1 ε-greedy
Arguably, the simplest idea is the following: At each time step, throw
a biased coin. If this coin lands heads, we pick an action uniformly at
random among all actions. If the coin lands tails, we pick the best ac-
tion under our current model. This algorithm is called ε-greedy, where
the probability of a coin landing heads at time t is ε t .
11.3.3 Optimism
Recall from our discussion of multi-armed bandits in Section 9.2.1 that
a key principle in effectively trading exploration and exploitation is
..
optimism in the face of uncertainty. Let us apply this principle to the
.
Rmax
reinforcement learning setting. The key idea is to assume that the dy- x x⋆ Rmax
namics and rewards model “work in our favor” until we have learned
.
..
“good estimates” of the true dynamics and rewards. Figure 11.2: Illustration of the fairy-tale
state of Rmax . If in doubt, the agent be-
More formally, if r ( x, a) is unknown, we set r̂ ( x, a) = Rmax , where Rmax lieves actions from the state x to lead to
the fairy-tale state x ⋆ with maximal re-
is the maximum reward our agent can attain during a single transition. wards. This encourages the exploration
Similarly, if p( x ′ | x, a) is unknown, we set p̂( x ⋆ | x, a) = 1, where x ⋆ is of unknown states.
a “fairy-tale state”. The fairy-tale state corresponds to everything our
agent could wish for, that is,
p̂( x ⋆ | x ⋆ , a) = 1 ∀ a ∈ A, (11.10)
⋆
r̂ ( x , a) = Rmax ∀ a ∈ A. (11.11)
entries. Even though polynomial in the size of the state and action
spaces, this quickly becomes unmanageable.
fore turn to model-free methods that estimate the value function di-
rectly. Thus, they require neither remembering the full model nor
planning (i.e., policy optimization) in the underlying Markov decision
process. We will, however, return to model-based methods in Chap-
ter 13 to see that promise lies in combining methods from model-based
reinforcement learning with methods from model-free reinforcement
learning.
We will start by discussing on-policy methods and later see how the
value function can be estimated off-policy.
vπ ( x ) = r ( x, π ( x )) + γ ∑
′
p( x ′ | x, π ( x )) · vπ ( x ′ ) using the definition of the value
x ∈X function (10.7)
= ER0 ,X1 [ R0 + γv ( X1 ) | X0 = x, A0 = π ( x )]
π
(11.13) interpreting the above expression as an
expectation over the random quantities
Our first instinct might be to use a Monte Carlo estimate of this expec- R0 and X1
tation. Due to the conditional independence of the transitions (11.3),
Monte Carlo approximation does yield an unbiased estimate,
≈ r + γvπ ( x ′ ), (11.14)
V π ( x ) ← V π ( x ) + αt (r + γV π ( x ′ ) − V π ( x )). (11.16)
qπ ( x, a) = r ( x, a) + γ ∑
′
p( x ′ | x, a) ∑
′
π ( a′ | x ′ )qπ ( x ′ , a′ ) using Bellman’s expectation equation
x ∈X a ∈A (10.15)
The policy iteration scheme to identify the optimal policy can be out-
lined as follows: In each iteration t, we estimate the value function qπt
of policy πt with the estimate Qπt obtained from SARSA. We then
choose the greedy policy with respect to Qπt as the next policy πt+1 .
However, due to the on-policy nature of SARSA, we cannot reuse any
data between the iterations. Moreover, it turns out that in practice,
when using only finitely many samples, this form of greedily opti-
mizing Markov decision processes does not explore enough. At least
partially, this can be compensated for by injecting noise when choosing
the next action, e.g., by following an ε-greedy policy or using softmax
exploration.
qπ ( x, a) = r ( x, a) + γ ∑
′
p( x ′ | x, a) ∑
′
π ( a′ | x ′ )qπ ( x ′ , a′ ) using Bellman’s expectation equation
x ∈X a ∈A (10.15)
" #
= ER0 ,X1 R0 + γ ∑ ′ ′
π ( a | X1 )q ( X1 , a ) X0 = x, A0 = a
π
interpreting the above expression as an
a′ ∈ A expectation over R0 and X1
(11.20)
≈ r+γ ∑ ′ ′
π ( a | x ) q ( x , a ),
π ′ ′
(11.21) Monte Carlo approximation with a
a′ ∈ A single sample
228 probabilistic artificial intelligence
This adapted update rule explicitly chooses the subsequent action a′ ac-
cording to policy π whereas SARSA absorbs this choice into the Monte
Carlo approximation. The algorithm has analogous convergence guar-
antees to those of SARSA.
q⋆ ( x, a) = r ( x, a) + γ ∑
′
p( x ′ | x, a) max q⋆ ( x ′ , a′ )
a′ ∈ A
using that the Q-function is a
x ∈X fixed-point of the Bellman update, see
Bellman’s theorem (10.32)
⋆ ′
= ER0 ,X1 R0 + γ max q ( X1 , a ) X0 = x, A0 = a (11.23) interpreting the above expression as an
a′ ∈ A expectation over R0 and X1
≈ r + γ max q⋆ ( x ′ , a′ ), (11.24) Monte Carlo approximation with a
a′ ∈ A single sample
Here,
. Rmax
Vmax = ≥ max q⋆ ( x, a),
1−γ x ∈ X,a∈ A
. .
Proof. We write β τ = ∏iτ=1 (1 − αi ) and ηi,τ = αi ∏τj=i+1 (1 − α j ). Using
the update rule of optimistic Q-learning (11.27), we have
N ( a| x )
Q⋆t ( x, a) = β N (a| x) Q0⋆ ( x, a) + ∑ ηi,N (a| x) (r + γ max Q⋆ti ( xi , ai ))
ai ∈ A
i =1
(11.29)
Using the assumption that the rewards are non-negative, from Equa-
tion (11.29) and Q0⋆ ( x, a) = Vmax/β Tinit , we immediately have
β N ( a| x )
Q⋆t ( x, a) ≥ Vmax
β Tinit
≥ Vmax . using N ( a | x ) ≤ Tinit
| A|, 1/ϵ, log 1/δ, and Rmax where the initialization time Tinit is upper bounded
by a polynomial in the same coefficients.
Note that for Q-learning, we still need to store Q⋆ ( x, a) for any state-
action pair in memory. Thus, Q-learning requires O(nm) memory.
During each transition, we need to compute
max Q⋆ ( x ′ , a′ )
a∈ A
Discussion
We have seen that both the model-based Rmax algorithm and the model-
free Q-learning take time polynomial in the number of states | X | and
the number of actions | A| to converge. While this is acceptable in small
grid worlds, this is completely unacceptable for large state and action
spaces.
Problems
11.1. Q-learning.
Assume the following grid world, where from state A the agent can
go to the right and down, and from state B to the left and down. From
states G1 and G2 the only action is to exit. The agent receives a reward
(+10 or +1) only when exiting.
Rewards States
0 0 A B
+10 +1 G1 G2
We assume the discount factor γ = 1 and that all actions are determin-
istic.
232 probabilistic artificial intelligence
To begin with, let us reinterpret the model-free methods from the pre-
vious section, TD-learning and Q-learning, as solving an optimization
problem, where each iteration corresponds to a single gradient update.
We will focus on TD-learning here, but the same interpretation applies
to Q-learning. Recall the update rule of TD-learning (11.15),
V π ( x ) ← (1 − αt )V π ( x ) + αt (r + γV π ( x ′ )).
Note that this looks just like the update rule of an optimization algo-
rithm! We can parameterize our estimates V π with parameters θ that
are updated according to the gradient of some loss function, assuming
fixed bootstrapping estimates. In particular, in the tabular setting (i.e.,
over a finite domain), we can parameterize the value function exactly
by learning a separate parameter for each state,
. .
θ = [θ(1), . . . , θ(n)], V π ( x; θ) = θ( x ). (12.1)
234 probabilistic artificial intelligence
Using the aforementioned shortcuts, let us define the loss ℓ after ob-
serving the single transition ( x, a, r, x ′ ),
2
. 1
ℓ(θ; x, r, x ′ ) = r + γθold ( x ′ ) − θ( x ) . (12.5)
2
model-free reinforcement learning 235
This error term is also called temporal-difference (TD) error. The tempo-
ral difference error compares the previous estimate of the value func-
tion to the bootstrapping estimate of the value function. We know
from the law of large numbers (A.36) that Monte Carlo averages are
unbiased.3 We therefore have, 3
Crucially, the samples are unbiased
with respect to the approximate label in
Ex′ | x,π ( x) [δTD ] ≈ ∇θ( x) ℓ(θ; x, r ). (12.7) terms of the bootstrapping estimate only.
Due to bootstrapping the value function,
the estimates are not unbiased with re-
Naturally, we can use these unbiased gradient estimates with respect spect to the true value function. More-
over, the variance of each individual es-
to the loss ℓ to perform stochastic gradient descent. This yields the timation of the gradient is large, as we
update rule, only consider a single transition.
Observe that this gradient update coincides with the update rule of
TD-learning (11.15). Therefore, TD-learning is essentially performing
stochastic gradient descent using the TD-error as an unbiased gradient
estimate.4 Crucially, TD-learning performs stochastic gradient descent 4
An alternative interpretation is that
with respect to the bootstrapping estimate of the value function V π TD-learning performs gradient descent
with respect to the loss ℓ.
and not the true value function vπ ! Stochastic gradient descent with
a bootstrapping estimate is also called stochastic semi-gradient descent.
Importantly, the optimization target r + γV π ( x ′ ; θold ) from the loss ℓ is
now moving between iterations which introduces some practical chal-
lenges we will discuss in Section 12.2.1. We have seen in the previous
chapter that using a bootstrapping estimate still guarantees (asymp-
totic) convergence to the true value function.
θ ← θ − αt ∇θ ℓ(θ; x, a, r, x′ ) (12.13)
2
1
= θ − αt ∇θ r + γ max Q⋆ ( x′ , a′ ; θold ) − Q⋆ ( x, a; θ) using the definition of ℓ (12.11)
2 a′ ∈A
12.2.1 Heuristics
The vanilla stochastic semi-gradient descent is very slow. In this sub-
section, we will discuss some “tricks of the trade” to improve its per-
formance.
model-free reinforcement learning 237
(12.16)
Maximization bias: Now, observe that the estimates Q⋆ are noisy esti-
mates of q⋆ and consider the term,
from the loss function (12.16). This term maximizes a noisy estimate
of q⋆ , which leads to a biased estimate of max q⋆ as can be seen in
Figure 12.1. The fact that the update rules of Q-learning and DQN
are affected by inaccuracies (i.e., noise in the estimates) of the learned
Q-function is known as the “maximization bias”.
238 probabilistic artificial intelligence
True value and an estimate All estimates and max Bias as function of state
2 2
max Q⋆ ( x, a) max Q⋆ ( x, a) − max q⋆ ( x, a)
q⋆ ( x, a) a
1 a a
0 0 0
Q⋆ ( x, a) −1
Double-Q estimate
−2 −2
max Q⋆ ( x, a) − max q⋆ ( x, a)
2 q⋆ ( x, a) 2 max Q⋆ ( x, a) 1
a a
a
⋆
Q ( x, a) 0
0 0 −1
Double-Q estimate
4 4 4
max Q⋆ ( x, a)−
Q⋆ ( x, a) max Q⋆ ( x, a) a
a max q⋆ ( x, a)
a
2 2 2
0 0 0
q⋆ ( x, a) Double-Q estimate
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
state state state
Q-learning also maximizes over the set of all actions in its update step
while learning the Q-function. This is intractable for large and, in
particular, continuous action spaces. A natural idea to escape this lim-
itation is to immediately learn an approximate parameterized policy,
.
π ⋆ ( x) ≈ π ( x; φ) = πφ( x). (12.21)
Methods that find an approximate policy are also called policy search
methods or policy gradient methods.
.
For simplicity, we will abbreviate J (φ) = J (πφ).
(i )
where rt is the reward at time t of the i-th rollout. Using a Monte
Carlo approximation, we can then estimate JT (φ). Moreover, due to
the exponential discounting of future rewards, it is reasonable to ap-
proximate the policy value function using bounded trajectories,
m
. 1 (i )
J (φ) ≈ JT (φ) ≈ b
JT (φ) =
m ∑ g0:T . (12.28)
i =1
φ ← φ + η ∇φ J (φ). (12.29)
model-free reinforcement learning 241
How can we compute the policy gradient? Let us first formally define
the distribution over trajectories Πφ that we introduced in the previous
section. We can specify the probability of a specific trajectory τ under
a policy πφ by
T −1
Πφ(τ ) = p( x0 ) ∏ πφ(at | xt ) p(xt+1 | xt , at ). (12.30)
t =0
Note that the expectation integrates over the measure Πφ, which de-
pends on the parameter φ. Thus, we cannot move the gradient oper-
ator inside the expectation as we have often done previously (cf. Ap-
pendix A.1.5). This should remind you of the reparameterization trick
(see Equation (5.62)) that we used to solve a similar gradient in the
context of variational inference. In this context, however, we cannot
apply the reparameterization trick.10 Fortunately, there is another way 10
This is because the distribution Πφ
of estimating this gradient. is generally not reparameterizable. We
will, however, see that reparameteri-
zation gradients are also useful in re-
Theorem 12.5 (Score gradient estimator). Under some regularity assump-
inforcement learning. See, e.g., Sec-
tions, we have tions 12.5.1 and 13.1.2.
Proof. To begin with, let us look at the so-called score function of the
distribution Πφ, ∇φ log Πφ(τ ). Using the chain rule, the score function
can be expressed as
∇φ Πφ(τ )
∇φ log Πφ(τ ) = (12.33)
Πφ(τ )
∇φ log Πφ(τ )
!
T −1 T −1
= ∇φ log p( x0 ) + ∑ log πφ(at | xt ) + ∑ log p(xt+1 | xt , at ) using the definition of the distribution
t =0 t =0 over trajectories Πφ
T −1 T −1
= ∇φ log p( x0 ) + ∑ ∇φ log πφ(at | xt ) + ∑ ∇φ log p(xt+1 | xt , at )
t =0 t =0
T −1
= ∑ ∇φ log πφ(at | xt ). (12.35) using that the first and third term are
t =0 independent of φ
Proof. For the term to the right, we have due to linearity of expecta-
tion (1.20),
One can even show, that we can subtract arbitrary baselines depending
on previous states ? . Problem 12.4
t −1
.
b(τ0:t−1 ) = ∑ γm rm . (12.38)
m =0
This baseline subtracts the returns of all actions before time t. In-
tuitively, using this baseline, the score gradient only considers
downstream returns. Recall from Equation (12.22) that we de-
fined Gt:T as the bounded discounted payoff from time t. It is also
commonly called the (bounded) downstream return (or reward to go)
beginning at time t.
7 until converged
T −1
. 1
bt = b =
T ∑ Gt′ :T . (12.43)
t ′ =0
Typically, policy gradient methods are slow due to the large variance in
the score gradient estimates. Because of this, they need to take small
steps and require many rollouts of a Markov chain. Moreover, we
cannot reuse data from previous rollouts, as policy gradient methods
are fundamentally on-policy.12 12
This is because the score gradient esti-
mator is used to obtain gradients of the
Next, we will combine value approximation techniques like Q-learning policy value function with respect to the
current policy.
and policy gradient methods, leading to an often more practical family
of methods called actor-critic methods.
It follows immediately from Equation (12.45) that for any policy π and
state x ∈ X , there exists an action a ∈ A such that aπ ( x, a) is non-
negative,
max aπ ( x, a) ≥ 0. (12.46)
a∈A
π is optimal ⇐⇒ ∀ x ∈ X , a ∈ A : aπ ( x, a) ≤ 0. (12.47)
Finally, we can re-define the greedy policy πq with respect to the state-
action value function q as
.
πq ( x) = arg max a( x, a) (12.48)
a∈A
since
Observe that averaging over the trajectories Eτ ∼Πφ [·] that are sampled
according to policy πφ is equivalent to our shorthand notation Eπφ [·]
from Equation (10.6),
∞ h i
= ∑ Ext ,at γt Eπφ [ Gt | xt , at ] ∇φ log πφ( at | xt )
t =0
∞
∑ Ext ,at γt qπφ ( xt , at )∇φ log πφ( at | xt ) .
= (12.51) using the definition of the Q-function
t =0 (10.8)
It turns out that Ext ,at [qπφ ( xt , at )] exhibits much less variance than our
previous estimator Ext ,at Eπφ [ Gt | xt , at ]. Equation (12.51) is known as
the policy gradient theorem.
Intuitively, ∞ ( x)
ρφmeasures how often we visit state x when following
policy πφ. It can be thought of as a “discounted frequency”.
1
Z
∞
( x) · Ea∼πφ(·|x) qπφ ( x, a)∇φ log πφ( a | x) dx
= ρφ
1−γ
Observe that we cannot use the policy gradient to calculate the gradi-
θ φ
ent exactly, as we do not know qπφ . Instead, we will use bootstrapping
estimates Qπφ of qπφ .
θ′ φ′
for effectively trading bias and variance. This leads to algorithms such
as generalized advantage estimation (GAE) (Schulman et al., 2016).
. πφ( a | x)
wk (φ; x, a) = ,
πφk ( a | x)
are used to correct for taking the expectation over the previous pol-
icy. When wk ≈ 1 the policies πφ and πφk are similar, whereas when
wk ≪ 1 or wk ≫ 1, the policies differ significantly. To be able to
assume that the fixed critic is a good approximation within a certain
“trust region” (i.e., one iteration), we impose the constraint
Taking the expectation with respect to the previous policy πφk means
that we can reuse data from rollouts within the same iteration. That is,
TRPO allows reusing past data as long as it can still be “trusted”. This
makes TRPO “somewhat” off-policy. Fundamentally, though, TRPO is
still an on-policy method.
with some λ > 0, which regularizes towards the trust region. An-
other common variant of PPO is based on controlling the importance
weights directly rather than regularizing by the KL-divergence. PPO
is used, for example, to train large-scale language models such as GPT
(Stiennon et al., 2020; OpenAI, 2023) which we will discuss in more
detail in Section 12.7. There we will also see that the objective from
Equation (12.63) can be cast as performing probabilistic inference.
when the action space A is large. What if we simply replace the exact
maximum over actions by a parameterized policy?
1 2
ℓDQN (θ; D) ≈ ∑
2 ( x,a,r,x′ )∈D
r + γQ⋆ ( x′ , πφ( x′ ); θold ) − Q⋆ ( x, a; θ) .
(12.65)
istic policy gradients (Lillicrap et al., 2016) shown in Algorithm 12.13. domness” to the policy πφ.
Twin delayed DDPG (TD3) is an extension of DDPG that uses two sepa-
rate critic networks for predicting the maximum action and evaluating
the policy (Fujimoto et al., 2018). This addresses the maximization
bias akin to Double-DQN. TD3 also applies delayed updates to the
actor network, which increases stability.
254 probabilistic artificial intelligence
1 2
ℓDQN (θ; D) ≈ r + γQ⋆ ( x′ , π ( x′ ; φ); θold ) − Q⋆ ( x, a; θ) , refer to the squared loss of Q-learning
2 (12.11)
the Bellman error for a fixed action a′ . Using the chain rule, we obtain
Note that this is identical to the gradient in DQN (12.14), except that
now we have an expectation over actions. As we have done many times
already, we can use automatic differentiation to obtain gradient esti-
mates of ∇θ Q⋆ ( x, a; θ). This provides us with a method of obtaining
unbiased gradient estimates for the critic.
Note that the inner expectation is with respect to a measure that de-
pends on the parameters φ, which we are trying to optimize. We there-
fore cannot move the gradient operator inside the expectation. This is
a problem that we have already encountered several times. In the
previous section on policy gradients, we used the score gradient esti-
mator.24 Earlier, in Chapter 5 on variational inference we have already 24
see Equation (12.32)
seen reparameterization gradients.25 Here, if our policy is reparame- 25
see (5.63)
terizable, we can use the reparameterization trick from Theorem 5.19!
Then, we have,
The algorithm that uses Equation (12.73) to obtain gradients for the
critic and reparameterization gradients for the actor is called stochastic
value gradients (SVG) (Heess et al., 2015).
256 probabilistic artificial intelligence
Let us denote by Π⋆ the distribution over trajectories τ under the op- Figure 12.3: Directed graphical model
timal policy π ⋆ . By framing the problem of optimal control as an of the underlying hidden Markov model
with hidden states Xt , optimality vari-
inference problem in a hidden Markov model with hidden “optimal- ables Ot , and actions At .
ity variables” Ot ∈ {0, 1} indicating whether the played action at was
model-free reinforcement learning 257
arg max Jλ (φ) = arg min KL Πφ∥Π⋆ = arg max L(Πφ, Π⋆ ; O1:T ).
φ φ φ
So far, we have been assuming that the agent is presented with a re-
ward signal after every played action. This is a natural assumption
in domains such as games and robotics — even though it often re-
x t +1 at
quires substantial “reward engineering” to break down complex tasks
with sparse rewards to more manageable tasks (cf. Section 13.3). In
many other domains such as an agent learning to drive a car or a
chatbot, it is unclear how one can even quantify the reward associated
with an action or a sequence of actions. For example, in the con- feedbackt
text of autonomous driving it is typically desired that agents behave Figure 12.5: We generalize the perspec-
tive of reinforcement learning from Fig-
“human-like” even though a different driving behavior may also reach ure 11.1 by allowing the feedback to
the destination safely. come from either the environment or
an evaluation by other agents (e.g., hu-
The task of “aligning” the behavior of an agent to human expectations mans), and by allowing the feedback to
come in other forms than a numerical re-
is difficult in complex domains such as the physical world and lan- ward.
guage, yet crucial for their practical use. To this end, one can conceive
of alternative ways for presenting “feedback” to the agent:
• The classical feedback in reinforcement learning is a numerical score.
Consider, for example, a recommender system for movies. The feed-
back is obtained after a movie was recommended to a user by a
user-rating on a given scale (often 1 to 10). This rating is infor-
mative as it corresponds to an absolute value assessment, allowing to
place the recommendation in a complete ranking of all previous
recommendations. However, numerical feedback of this type can
be error-prone as it is scale-dependent (different users may ascribe
different value to a recommendation rated a 7).
model-free reinforcement learning 261
D=
A B
Πinit
A or B?
ample, when the goal is to build a chatbot, this data may consist of
desirable responses to some exemplary prompts. We will denote the
parameters of the fine-tuned language model by φinit , and its associ-
ated policy by Πinit .
The second step is then to “post-train” the language model Πinit from
the first step using human feedback. Here, it is important that Πinit is
already capable of producing sensible output (i.e., with correct spelling
and grammar). Learning this from scratch using only preference feed-
back would take far too long. Instead, the post-training step is used to
align the agent to the task and user preferences.
exp(r (y A | x))
p(y A ≻ yB | x, r ) = (12.90)
exp(r (y A | x)) + exp(r (yB | x))
= σ(r (y A | x) − r (yB | x)). (12.91) as seen in Problem 7.1 this is the Gibb’s
distribution with energy −r (y | x) in a
binary classification problem
model-free reinforcement learning 263
(i ) (i )
The aggregated human feedback D = {y A ≻ yB | x(i) }in=1 across n
different prompts is then used to update the language model πφ. In
the next two sections, we discuss two standard approaches to post-
training: reinforcement learning from human feedback (RLHF) and
direct preference optimization (DPO).
RLHF separates the post-training step into two stages (Stiennon et al.,
2020). First, the human feedback is used to learn an approximate re-
ward model rθ. This reward model is then used in the second stage to
determine a refined policy Πφ.
sample prompt x
and a response
A B y ∼ Πφ(· | x)
Learning a reward model: During the first stage, the initial policy ob-
tained by supervised fine-tuning Πinit is used to generate propos-
als (y A , yB ) for some exemplary prompts x, which are then ranked
according to the preference of human labelers. This preference data
can be used to learn a reward model rθ by maximum likelihood esti-
264 probabilistic artificial intelligence
Learning an optimal policy: One can now employ the methods from
this and previous chapters to determine the optimal policy for the
approximate reward rθ. Due to the use of an approximate reward, how-
ever, simply maximizing rθ surfaces the so-called “reward gaming”
problem which is illustrated in Figure 12.8. As the responses gen-
erated by the learned policy πφ deviate from the distribution of D rθ
(i.e., the distribution induced by the initial policy πinit ), the approx-
imate reward model becomes inaccurate. The approximate reward
may severely overestimate the true reward in regions far away from r
1
Π⋆ (y | x) ∝ Πinit (y | x) exp r (y | x) (12.94)
λ
model-free reinforcement learning 265
Discussion
Optional Readings
• A3C: Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and
Kavukcuoglu (2016).
Asynchronous methods for deep reinforcement learning.
• GAE: Schulman, Moritz, Levine, Jordan, and Abbeel (2016).
High-dimensional continuous control using generalized advantage es-
timation.
• TRPO: Schulman, Levine, Abbeel, Jordan, and Moritz (2015).
Trust region policy optimization.
• PPO: Schulman, Wolski, Dhariwal, Radford, and Klimov (2017).
Proximal policy optimization algorithms.
• DDPG: Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, and
Wierstra (2016).
Continuous control with deep reinforcement learning.
• TD3: Fujimoto, Hoof, and Meger (2018).
Addressing function approximation error in actor-critic methods.
• SVG: Heess, Wayne, Silver, Lillicrap, Erez, and Tassa (2015).
Learning continuous control policies by stochastic value gradients.
• SAC: Haarnoja, Zhou, Abbeel, and Levine (2018a).
Soft actor-critic: Off-policy maximum entropy deep reinforcement
learning with a stochastic actor.
• DPO: Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn
(2023).
Direct preference optimization: Your language model is secretly a re-
ward model.
268 probabilistic artificial intelligence
Problems
x a x′ r
3 −1 2 −1
2 1 3 −1
3 1 4 −1
4 1 4 0
where w = [w0 w1 w2 ]⊤ ∈ R3 .
Suppose we have wold = [1 − 1 − 2]⊤ and w = [−1 1 1]⊤ ,
and we observe the transition τ = (2, −1, −1, 1). Use the learn-
ing rate α = 1/2 to compute ∇w ℓ(w; τ ) and the updated weights
w′ = w − α∇w ℓ(w; τ ).
(1,-1) (1,-1) (1,-1) (1,0) (1,-1) (1,-1) Figure 12.9: MDP studied in Prob-
lem 12.1. Each arrow marks a (deter-
ministic) transition and is labeled with
(-1,-1) 1 2 3 4 5 6 7 (1,-1)
(action, reward).
. exp(h( x, a, φ))
πφ( a | x) = (12.99)
∑b∈ A exp(h( x, b, φ))
.
with linear preferences h( x, a, φ) = φ⊤ ϕ( x, a) where ϕ( x, a) is some
feature vector, what is the form of the eligibility vector?
model-free reinforcement learning 269
(12.101)
12.4. Score gradients with state-dependent baselines.
show that
Eτ ∼Πφ G0 ∇φ log Πφ(τ )
" #
T −1 (12.102)
= Eτ ∼Πφ ∑ (G0 − b(τ0:t−1 ))∇φ log πφ(at | xt )
t =0
Initially, θ = 0.5. One episode is played with this initial policy and the
results are
KL Πφ∥Π⋆
T (12.106)
∑ Ext ∼Πφ
= KL πφ(· | xt )∥π̂ (· | xt ) − log Z ( xt ) .
t =1
13.1 Planning
x t +1 = f ( x t , a t ). (13.1)
model-based reinforcement learning 275
∞
max ∑ γt r ( xt , at ) such that xt+1 = f ( xt , at ). (13.2)
a0:∞
t =0
Planning over finite horizons: The key idea of a classical algorithm from
optimal control called receding horizon control (RHC) or model predictive
control (MPC) is to iteratively plan over finite horizons. That is, in each
round, we plan over a finite time horizon H and carry out the first
action. x⋆
x0
t + H −1
.
JH ( at:t+ H −1 ) = ∑ γτ −t r ( xτ ( at:τ −1 ), aτ ). (13.4)
τ =t
A solution to this problem is to amend a long-term value estimate to Figure 13.2: Illustration of finite-horizon
the finite-horizon sum. The idea is to not only consider the rewards planning with sparse rewards. When the
finite time horizon does not suffice to
attained while following the actions at:t H −1 , but to also consider the “reach” a reward, the agent has no sig-
value of the final state xt+ H , which estimates the discounted sum of nal to follow.
model-based reinforcement learning 277
future rewards.
t + H −1
.
JH ( at:t+ H −1 ) =
b ∑ γτ −t r ( xτ ( at:τ −1 ), aτ ) + γ H V ( xt+ H ) . (13.6)
τ =t | {z }
| {z } long-term
short-term
at = arg max b
J1 ( a) = πV ( xt ). (13.7)
a∈A
Thus, by looking ahead for a single time step, we recover the ap-
proaches from the model-free setting in this model-based setting!
Essentially, if we do not plan long-term and only consider the
value estimate, the model-based setting reduces to the model-free
setting. However, in the model-based setting, we are now able to
use our model of the transition dynamics to anticipate the down-
stream effects of picking a particular action at . This is one of the
fundamental reasons for why model-based approaches are typi-
cally severely more sample efficient than model-free methods.
xτ = xτ (εt:τ −1 ; at:τ −1 )
. (13.9)
= g ( ε τ −1 ; g ( . . . ; ( ε t +1 ; g ( ε t ; x t , a t ), a t +1 ), . . . ), a τ −1 ).
model-based reinforcement learning 279
m t + H −1
1 (i )
JH ( at:t+ H −1 ) ≈
b
m ∑ ∑ γτ −t r ( xτ (εt:τ −1 ; at:τ −1 ), aτ )
i =1 τ =t
! (13.10)
H (i )
+γ V ( xt+ H (εt:t+ H −1 ; at:t+ H −1 ))
(i ) iid
where εt:t+ H −1 ∼ ϕ are independent samples. To optimize this ap-
proximation we can again compute analytic gradients or use shooting
methods as we have discussed in Section 13.1.1 for deterministic dy-
namics.
This policy can then be trained offline and evaluated cheaply online,
which is known as open-loop control.
where µ( x) > 0 was some exploration distribution that has full sup-
port and thus leads to the exploration of all states. The key idea was
that if we use a differentiable approximation Q and a differentiable
parameterization of policies, which is “rich enough”, then both op-
timizations are equivalent, and we can use the chain rule to obtain
280 probabilistic artificial intelligence
13.2 Learning
First, let us revisit one of our key observations when we first intro-
duced the reinforcement learning problem. Namely, that the observed
transitions x′ and rewards r are conditionally independent given the
state-action pairs ( x, a).3 This is due to the Markovian structure of the 3
see Equation (11.3)
underlying Markov decision process.
Here, xt and at are the “inputs”, and rt and xt+1 are the “labels” of the
282 probabilistic artificial intelligence
The key difference to supervised learning is that the set of inputs de-
pends on how we act. That is, the current inputs arise from previous
policies, and the inputs which we will observe in the future will de-
pend on the model (and policy) obtained from the current data: we
have feedback loops! We will come back to this aspect of reinforce-
ment learning in the next section on exploration. For now, recall we
only assume that we have used an arbitrary policy to collect some data,
which we then stored in a replay buffer, and which we now want to
use to learn the “best-possible” model of our environment.
xτ (εt:τ −1 ; at:τ −1 , f )
. (13.19)
= f ( ε τ −1 ; f ( . . . ; ( ε t +1 ; f ( ε t ; x t , a t ), a t +1 ), . . . ), a τ −1 ).
Observe that the epistemic and aleatoric uncertainty are treated differ-
ently. Within a particular MDP f , we ensure that randomness (i.e., ale-
model-based reinforcement learning 285
The same approach that we have seen in Section 13.1.3 can be used
to “compile” these plans into a parametric policy that can be trained
JH (π ) instead of b
offline, in which case, we write b JH ( at:t+ H −1 ).
Optional Readings
• PILCO: Deisenroth and Rasmussen (2011).
PILCO: A model-based and data-efficient approach to policy search.
• PETS: Chua, Calandra, McAllister, and Levine (2018).
Deep reinforcement learning in a handful of trials using probabilistic
dynamics models.
model-based reinforcement learning 287
13.3 Exploration
Recall from Section 9.3.3 the main idea behind Thompson sampling:
namely, that the randomness in the realizations of f from the posterior
distribution is already enough to drive exploration. That is, instead of
picking the action that performs best on average across all realizations
of f , Thompson sampling picks the action that performs best for a
single realization of f . The epistemic uncertainty in the realizations
of f leads to variance in the picked actions and provides an additional
incentive for exploration. This yields Algorithm 13.11 which is an
immediate adaptation of greedy exploitation and straightforward to
implement.
Let us consider a set M(D) of plausible models given some data D . Op-
timistic exploration would then optimize for the most advantageous
model among all models that are plausible given the seen data.
model-based reinforcement learning 289
luck η
effect of
x0 policy action
Intuitively, the agent has the tendency to believe that it can achieve
much more than it actually can. As more data is collected, the con-
fidence bounds shrink and the optimistic policy rewards converge to
the actual rewards. Yet, crucially, we only collect data in regions of
the state-action space that are more promising than the regions that
we have already explored. That is, we only collect data in the most
promising regions.
settings, most other strategies (i.e., those that are not sufficiently ex-
plorative), learn not to act at all. However, even in settings of “ordinary
rewards”, optimistic exploration often learns good policies faster.
are expected discounted costs with respect to a cost function c : X → R≥0 .10 10 It is straightforward to extend this
. framework to allow for multiple con-
Observe that for the cost function c( x) = 1{ x ∈ Xunsafe }, the value
straints.
Jµc (π; f ) can be interpreted as an upper bound on the (discounted)
probability of visiting unsafe states under dynamics f ,11 and hence, 11
This follows from a simple union
c
the constraint Jµ (π; f ) ≤ δ bounds the probability of visiting an un- bound (1.73).
for some λ > 0 where the barrier function Bµc (π; f ) goes to infinity
as a state on the boundary of Xunsafe is approached. Examples for
barrier functions are − log c( x) and c(1x) .
If the policy values Jµ (π; f ) and Jµc (π; f ) are modeled as a distribu-
tion (e.g., using Bayesian deep learning), then the inner maximiza-
tion over plausible dynamics can be approximated using samples from
the posterior distributions. Thus, the augmented Lagrangian method
can also be used to solve the general optimization problem of Equa-
tion (13.32). The resulting algorithm is known as Lagrangian model-
based agent (LAMBDA) (As et al., 2022).
to avoid the crash. Thus, looking ahead a finite number of steps is not
sufficient to prevent entering unsafe states.
Plan A
Plan B
X
Xunsafe
The problem is that this set of safe states might be very conservative.
That is to say, it is likely that rewards are mostly attained outside of
the set of safe states. The key idea is to plan two sequences of actions,
instead of only one. “Plan A” (the performance plan) is planned with
the objective to solve the task, that is, attain maximum reward. “Plan
B” (the safety plan) is planned with the objective to return to the set of
safe states Xsafe . In addition, we enforce that both plans must agree on
the first action to be played.
where fb are the adjusted dynamics (13.26) which are based on the
“luck” variables η. In the context of estimating costs which we aim to
minimize (as opposed to rewards which we aim to maximize), η can
be interpreted as the “bad luck” of the agent.
When our only objective is to act safely, that is, we only aim to mini-
mize cost and are indifferent to rewards, then this reparameterization
allows us to find a “maximally safe” policy,
.
πsafe = arg min Ex∼µ [C π ( x)] = arg min max Jµc (π; fb). (13.35)
π π η
for some metric d(·, ·) on A. The constraint ensures that the pessimistic
next state fb( x, a) is recoverable by following policy πsafe . In this way,
.
π̂ ( x) if C πsafe ( x) ≤ δ
π̃ ( x) = (13.37)
πsafe ( x) if C πsafe ( x) > δ
296 probabilistic artificial intelligence
Optional Readings
• Curi, Berkenkamp, and Krause (2020).
Efficient model-based reinforcement learning through optimistic pol-
icy search and planning.
• Berkenkamp, Turchetta, Schoellig, and Krause (2017).
Safe model-based reinforcement learning with stability guarantees.
• Koller, Berkenkamp, Turchetta, and Krause (2018).
Learning-based model predictive control for safe exploration.
• As, Usmanova, Curi, and Krause (2022).
Constrained Policy Optimization via Bayesian World Models.
• Curi, Lederer, Hirche, and Krause (2022).
Safe Reinforcement Learning via Confidence-Based Filters.
• Turchetta, Berkenkamp, and Krause (2019).
Safe exploration for interactive machine learning.
Discussion
rt = JH (π ⋆ ; f ) − JH (πt ; f ) (13.38)
model-based reinforcement learning 297
k −1
xbt,k − xt,k ≤ 2β t ∑ αkt −1−l σt−1 (xt,l , πt (xt,l )) (13.39)
l =0
.
3. Let Γ T = maxπ1 ,...πT ∑tT=1 ∑kH=−01 σt2−1 ( xt,k , πt ( xt,k )). Analogously to
Problem 9.3, it can be derived that Γ T ≤ O( HγT ) if the dynamics
are modeled by a Gaussian process.16 Bound the cumulative regret 16
For details, see appendix A of Treven
3 √ et al. (2024).
R T = ∑tT=1 rt ≤ O β T H 2 α TH −1 TΓ T .
Thus, if the dynamics are modeled by a Gaussian process with kernel
such that γT is sublinear, the regret of H-UCRL is sublinear.
A
Mathematical Background
A.1 Probability
To see this, let us consider the m-dimensional space of probabilities [0, 1]m .
It follows from our characterization of the categorical distribution in
Appendix A.1.1 that there is a one-to-one correspondence between
probability distributions over m classes and points in the space [0, 1]m
where all coordinates sum to one. This (m − 1)-dimensional subspace
of [0, 1]m is also known as the probability simplex.
300 probabilistic artificial intelligence
the quantile function of X. That is, P−1 (u) corresponds to the value x
such that the probability of X being at most x is u. If the CDF P is
invertible, then P−1 coincides with the inverse of P.
This implies that if we are able to sample from Unif([0, 1]),1 then we 1
This is done in practice using so-called
are able to sample from any distribution with invertible CDF. This pseudo-random number generators.
A = LL⊤ (A.7)
We will not prove this fact, but it is not hard to see that a decomposi-
tion exists (it takes more work to show that L is lower-triangular).
= VΛ /2 Λ /2 V ⊤
1 1
= VΛV ⊤ = A. (A.9)
and in particular,
1
log N ( x; µ, Σ ) = − ∥ x − µ∥2Σ −1 + const (A.11)
2
1h i
= − x⊤ Σ −1 x − 2µ⊤ Σ −1 x + const, (A.12)
2
1
log N ( x; 0, I ) = − ∥ x∥22 + const. (A.13)
2
A.3.1 Estimators
Suppose we are given a collection of independent samples x1 , . . . , xn
from some random vector X. Often, the exact distribution of X is un-
known to us, but we still want to “estimate” some property of this
distribution, for example its mean. We denote the property that we
aim to estimate from our sample by θ⋆ . For example, if our goal is
.
estimating the mean of X, then θ⋆ = E[X].
The mean squared error can be decomposed into the estimator’s bias
and variance:
h i
MSE(θ̂n ) = Eθ̂n (θ̂n − θ ⋆ )2 = Var θ̂n + (E θ̂n − θ ⋆ )2 ,
(A.19) using (1.35) and
Varθ̂n θ̂n − θ ⋆ = Var θ̂n
Such estimators are called consistent, and a sufficient condition for con-
sistency is that the mean squared error converges to zero as n → ∞.5 5
It follows from Chebyshev’s inequality
MSE(θ̂n )
(A.72) that P θ̂n − θ ⋆ > ϵ ≤
ϵ2
.
mathematical background 305
where Ω(ϵ) denotes the class of functions that grow at least linearly
in ϵ.6 Thus, if an estimator is sharply concentrated, its absolute error 6
That is, h ∈ Ω(ϵ) if and only if
h(ϵ)
is bounded by an exponentially quickly decaying error probability. limϵ→∞ ϵ > 0. With slight abuse of
notation, we force h to be positive (so as
to ensure that the argument to the expo-
A.3.2 Heavy Tails nential function is negative) whereas in
the traditional definition of Landau sym-
It is often said that a sharply concentrated estimator θ̂n has small tails, bols, h is only required to grow linearly
in absolute value.
where “tails” refer to the “ends” of a PDF. Let us examine the differ-
ence between a light-tailed and a heavy-tailed distribution more closely. p( x )
decays slower than that of the exponential distribution, that is, 10−7
5 10
P (x) x
lim sup X−λx = ∞ (A.22)
x →∞ e
Figure A.1: Shown are the right tails of
P (x) the PDFs of a Gaussian with mean 1 and
for all λ > 0. When lim supx→∞ PX ( x) > 1, the (right) tail of X is said variance 1, a exponential distribution
Y
to be heavier than the (right) tail of Y. with mean 1 and parameter λ = 1, and
a log-normal distribution with mean 1
It is immediate from the definitions that the distribution of an unbi- and variance 1 on a log-scale.
ased estimator is light-tailed if and only if the estimator is sharply
concentrated, so both notions are equivalent.
is light-tailed, but its tails decay slower than those of the Gaus-
0.0
sian. It is also “more sharply peaked” at its mean than the
0 2 4
Gaussian. x
αcα
Pareto( x; α, c) = 1{ x ≥ c }, x∈R (A.25)
x α +1
where the tail index α > 0 models the “weight” of the right
tail of the distribution (larger α means lighter tail) and c > 0
corresponds to a cutoff threshold. The distribution is supported
on [c, ∞), and as α → ∞ it approaches a point density at c. The
Pareto’s right tail is P( x ) = ( xc )α for all x ≥ c.
• The Cauchy distribution arises in the description of numerous
physical phenomena. The CDF of a Cauchy distribution is
x−m
. 1 1
PX ( x ) = arctan + (A.26)
π τ 2
0 2 4 6 8 10
x1
ple tells us that outliers are rare, and hence, we can usually ignore
them. In contrast, when working with heavy-tailed distributions, the
catastrophe principle tells us that outliers are not just common but a
defining feature, and hence, we need to be careful to not ignore them.
Readings
For more details and examples of heavy-tailed distributions, refer
to “The Fundamentals of Heavy Tails” (Nair et al., 2022).
bility 1) if
n o
P ω ∈ Ω : lim Xn (ω ) = X (ω ) = 1, (A.27)
n→∞
a.s.
and we write Xn → X as n → ∞.
2. Xn converges to X in probability if for any ϵ > 0,
For the precise notions of mean square continuity and mean square
differentiability in the context of Gaussian processes, refer to sec-
tion 4.1.1 of “Gaussian processes for machine learning” (Williams
and Rasmussen, 2006).
which is known as the weak law of large numbers (WLLN) and which
establishes consistency of the sample mean ? . Using more advanced Problem A.2
tools, it is possible to show almost sure convergence of the sample
mean even when the variance is infinite:
Fact A.13 (Strong law of large numbers, SLLN). Given the random
variable X : Ω → R with finite mean. Then, as n → ∞,
a.s.
X n → EX. (A.36)
To get an idea of how quickly the sample mean converges, we can look
at its variance:
" #
1 n Var[ X ]
n i∑
Var X n = Var Xi = . (A.37)
=1
n
Remarkably, one cannot only compute its variance, but also its limiting
distribution. This is known as the central limit theorem (CLT) which
states that the prediction error of the sample mean tends to a normal
distribution as the sample size goes to infinity (even if the samples
themselves are not normally distributed).
The central limit theorem makes the critical assumption that the vari-
ance of X is finite. This is not the case for many heavy-tailed distri-
310 probabilistic artificial intelligence
σ 2 λ2
h i
.
φ X (λ) = E eλX = exp µλ + for any λ ∈ R
2
(A.40)
nϵ2
P X n − EX ≥ ϵ ≤ 2 exp − 2 .
(A.41)
2σ
.
Proof of Theorem A.17. Let Sn = nX n = X1 + · · · + Xn . We have for any
λ, ϵ > 0 that
P X n − EX ≥ ϵ = P X n − EX ≥ ϵ + P X n − EX ≤ −ϵ
and noting that the second term can be bounded analogously to the
first term by considering the random variables − X1 , . . . , − Xn .
The law of large numbers and Hoeffding’s inequality tell us that when X
is light-tailed, we can estimate EX very precisely with “few” samples
using a sample mean. Crucially, the sample mean requires independent
samples xi from X which are often hard to obtain.
This quantity is also called the population risk. However, the underlying
distribution P is unknown to us. All that we can work with is the
mathematical background 313
iid
training data for which we assume Dn ∼ P . It is therefore natural to
consider minimizing
n
1
n ∑ ℓ( fˆ(xi ); yi ), Dn = {( xi , yi )}in=1 , (A.47)
i =1
A.4 Optimization
Readings
For a more thorough reminder of optimization methods, read
chapter 7 of “Mathematics for machine learning” (Deisenroth et al.,
2020).
o (∥δ∥2 )
where limδ→0 = 0, and ∇ f ( x) is continuous on Rn .
∥ δ ∥2 x3
Equation (A.49) is also called a first-order expansion of f at x. 20
0 ≤ f ( x + λd) − f ( x)
= λ ∇ f ( x ) ⊤ d + o ( λ ∥ d ∥2 ). using a first-order expansion of f
around x
Dividing by λ and taking the limit λ → 0, we obtain
o ( λ ∥ d ∥2 )
0 ≤ ∇ f ( x)⊤ d + lim = ∇ f ( x)⊤ d.
λ →0 λ
. y
Take d = −∇ f ( x). Then, 0 ≤ − ∥∇ f ( x)∥22 , so ∇ f ( x) = 0.
8
f
A.4.2 Convexity 6
4
Convex functions are a subclass of functions where finding global min-
2
ima is substantially easier than for general functions.
0
Definition A.23 (Convex function). A function f : Rn → R is convex if −2 0 2
x
∀ x, y ∈ Rn : ∀θ ∈ [0, 1] : f (θx + (1 − θ )y) ≤ θ f ( x) + (1 − θ ) f (y).
Figure A.5: Example of a convex func-
(A.50) tion. Any line between two points on f
lies “above” f .
That is, any line between two points on f lies “above” f . If the in-
equality of Equation (A.50) is strict, we say that f is strictly convex.
mathematical background 315
8
Theorem A.24 (First-order characterization of convexity). Suppose that f
f : Rn → R is differentiable, then f is convex if and only if 6
4
∀ x, y ∈ Rn : f (y) ≥ f ( x) + ∇ f ( x)⊤ (y − x). (A.51) 2
0
Observe that the right-hand side of the inequality is an affine function −2 0 2
with slope ∇ f ( x) based at f ( x). x
f ( x + λd) − f ( x)
lim = ∇ f ( x)⊤ d (A.52)
λ →0 λ
Dividing by θ yields,
f ( x + θ (y − x)) − f ( x)
≤ f ( y ) − f ( x ).
θ
Taking the limit θ → 0 on both sides gives the directional derivative
at x in direction y − x,
f ( y ) ≥ f ( z ) + ∇ f ( z ) ⊤ ( y − z ), and
⊤
f ( x ) ≥ f ( z ) + ∇ f ( z ) ( x − z ).
f ( y ) ≥ f ( x ⋆ ) + ∇ f ( x ⋆ ) ⊤ ( y − x ⋆ ) = f ( x ⋆ ).
| {z }
0
where r (θ) is large for “complex” and small for “simple” choices
of θ, respectively. A common choice is r (θ) = λ2 ∥θ∥22 for some
λ > 0, which is known as L2 -regularization.
That is, in each step, θ decays towards zero at the rate ληt . This
regularization method is also called weight decay.
Problems
We will now show that the sample mean is consistent. To this end, let
us first derive two classical concentration inequalities.
1. Prove Markov’s inequality which says that if X is a non-negative
random variable, then for any ϵ > 0,
EX
P( X ≥ ϵ ) ≤ . (A.71)
ϵ
2. Prove Chebyshev’s inequality which says that if X is a random vari-
able with finite and non-zero variance, then for any ϵ > 0,
VarX
P(| X − EX | ≥ ϵ) ≤ . (A.72)
ϵ2
3. Prove the weak law of large numbers from Equation (A.35).
. f ( x + λd) − f ( x)
D f ( x)[d] = lim . (A.73)
λ →0 λ
Show that
Fundamentals of Inference
P( B ) = P( A ) + P( B \ A ).
P A ∪ A = P( Ω ) = 1
P A ∪ A = P( A ) + P A .
.
of w, and we write deg(w) = |Γ(w)|. We have,
That is, for all neighbors u′ of u, P(u → v) = P(u′ → v). Using that
the graph is connected and finite, we conclude P(u → v) = P(w → v)
for any vertex w. Finally, note that P(v → v) = 1, and hence, the
random walk starting at any vertex u visits the vertex v eventually
with probability 1.
.
Solution to Problem 1.4. Let Σ = Var[X] be a covariance matrix of
the random vector X and fix any z ∈ Rn . Then,
h i
z⊤ Σz = z⊤ E (X − E[X])(X − E[X])⊤ z using the definitiion of variance (1.34)
h i
= E z⊤ (X − E[X])(X − E[X])⊤ z . using linearity of expectation (1.20)
.
Define the random variable Z = z⊤ (X − E[X]). Then,
h i
= E Z2 ≥ 0.
P( P | D ) = P P | D = 0.99.
the test is accurate
P( P | D ) · P( D )
P( D | P ) = .
P( P )
From the quantities above, we have everything except for P( P). This,
however, we can compute using the law of total probability,
P( P ) = P( P | D ) · P( D ) + P P | D · P D
( x − µ ) ⊤ Σ −1 ( x − µ )
= ( x − µ1 ) ⊤ Σ − 1 ⊤ −1
1 ( x − µ1 ) + ( x − µ2 ) Σ 2 ( x − µ2 ) + const.
324 probabilistic artificial intelligence
x⊤ Σ −1 x − 2x⊤ Σ −1 µ + µ⊤ Σ −1 µ.
Σ −1 = Σ − 1 −1
1 + Σ2 and Σ −1 µ = Σ − 1 −1
1 µ1 + Σ 2 µ2 .
.
Consider the joint Gaussian random vector Z = [X Y] ∼ N (µ, Σ ) and
assume that X ∼ N (µX , Σ X ) and Y ∼ N (µY , Σ Y ) are uncorrelated.
Then, Σ can be expressed as
" #
ΣX 0
Σ= ,
0 ΣY
N ([ x y]; µ, Σ )
1 1
= p exp − ( x − µX )⊤ Σ − 1
X ( x − µX )
det(2πΣ X ) · det(2πΣ Y ) 2
1
− (y − µY )⊤ Σ − 1
Y (y − µY )
2
= N ( x; µX , Σ X ) · N (y; µY , Σ Y ).
p( x) = p( x A , x B )
" #⊤ " #" #
1 1 x A − µ A Λ AA Λ AB xA − µA
= exp−
Z 2 xB − µB Λ BA Λ BB xB − µB
1. We obtain
p( x A )
" #⊤ "
# " #
1 1
Z
∆ A Λ AA Λ AB ∆A
= exp− dx B using the sum rule (1.7)
Z 2 ∆B Λ BA Λ BB ∆B
i
1 1h ⊤ −1
= exp − ∆ A (Λ AA − Λ AB Λ BB Λ BA )∆ A using the first hint
Z 2
1h
Z i
· exp − (∆ B + Λ− 1
BB Λ BA ∆ A )⊤ Λ BB (∆ B + Λ− 1
BB Λ BA ∆ A ) dx B .
2
2. We obtain
p( x B | x A )
p( x A , x B )
= using the definition of conditional
p( x A ) distributions (1.10)
" #⊤ " # " #
1 1 ∆A Λ AA Λ AB ∆ A
= ′ exp− noting that p( x A ) is constant with
Z 2 ∆B Λ BA Λ BB ∆B respect to x B
i
1 1h ⊤ −1
= ′ exp − ∆ A (Λ AA − Λ AB Λ BB Λ BA )∆ A using the first hint
Z 2 h
1 i
−1 ⊤ −1
· exp − (∆ B + Λ BB Λ BA ∆ A ) Λ BB (∆ B + Λ BB Λ BA ∆ A )
2
i
1 1h −1 ⊤ −1
= ′′ exp − (∆ B + Λ BB Λ BA ∆ A ) Λ BB (∆ B + Λ BB Λ BA ∆ A ) . observing that the first exponential is
Z 2 constant with respect to x B
∆ B + Λ− 1 −1
BB Λ BA ∆ A = µ B − Λ BB Λ BA ( x A − µ A ) = µ B| A and using the third hint
Λ−
BB
1
= Σ BB − Σ BA Σ − 1
AA Σ AB = Σ B| A . using the second hint
Thus, x B | x A ∼ N (µ B| A , Σ B| A ).
= E exp t ⊤ AX · exp t ⊤ b
= E exp s⊤ X · exp t ⊤ b
= φX (s) · exp t ⊤ b
1 ⊤
= exp s µ + s Σs · exp t ⊤ b
⊤
2
⊤ 1 ⊤ ⊤
= exp t ( Aµ + b) + t AΣ A t ,
2
= φX ( t ) · φX′ ( t )
⊤ 1 ⊤ ⊤ ′ 1 ⊤ ′
= exp t µ + t Σt · exp t µ + t Σ t
2 2
1
= exp t ⊤ (µ + µ′ ) + t ⊤ (Σ + Σ ′ )t ,
2
Z 0 Z ∞
1 1 1
= √ exp − y2 dy + exp − y2 dy
2π −∞ 2 0 2
Z ∞
1 1
= √ exp − y2 dy = 1. a PDF integrates to 1
2π −∞ 2
X = Σ /2 Y + µ
1
..
.
= pX
−(yi − µi ) + µi
..
.
.
..
= pX
yi
since pX is symmetric w.r.t. µi
..
.
= p X ( y ).
= N (W; 0, 1).
! c1
P( y ≤ a | x ) = .
c1 + c2
We obtain a⋆ ( x) = µ x + σx · Φ−1 c c+1 c2 by transforming to a stan-
1
dard normal random variable.
Linear Regression
n
1 1
= arg min ∑ log(2σ2 π ) + 2 (yi − w⊤ xi )2
σ i =1
2 2σ
n
n 1
= arg min
2
log σ2 + 2
2σ ∑ ( y i − w ⊤ x i )2 .
σ i =1
Solution to Problem 2.3. Let us first derive the variance of the least
squares estimate:
h i
Var[ŵls | X ] = Var ( X ⊤ X )−1 X ⊤ y X using Equation (2.4)
Thus,
" #
σn2 ∑ xi2 − ∑ xi
Var[ŵls | X ] = σn2 ( X ⊤ X )−1 = using Equation (B.2)
Z − ∑ xi n
.
where Z = n(∑ xi2 ) − (∑ xi )2 .
µ = σn−2 ΣX ⊤ y.
Then,
# "
0.91
ŵMAP =µ= .
1.31
This means that after observing the (t + 1)-st data point, we have
that
2. One has to compute (σn−2 Xt⊤ Xt + σp−2 I )−1 for finding µ and Σ in
every round. We can write
(σn−2 Xt⊤+1 Xt+1 + σp−2 I )−1 = σn2 ( Xt⊤+1 Xt+1 + σn2 σp−2 I )−1
= σn2 ( Xt⊤ Xt + σn2 σp−2 I + xt+1 x⊤
t +1 )
−1
| {z }
At
Solution to Problem 2.6. The law of total variance (2.18) yields the
following decomposition of the predictive variance,
wherein the first term corresponds to the aleatoric uncertainty and the
second term corresponds to the epistemic uncertainty.
Thus,
y | X, µ, λ ∼ N ( Xµ, λ−1 X X ⊤ + λ−1 In ).
We can simplify
1
− log N (y; Xµ, Σ y ) = (y − Xµ)⊤ Σ − 1
y ( y − Xµ ) + const
2
1
= µ⊤ X ⊤ Σ − 1 ⊤ ⊤ −1
y Xµ − µ X Σ y y + const.
2
Taking the gradient with respect to µ, we obtain
1 ⊤ ⊤ −1
∇µ µ X Σ y Xµ − y⊤ Σ − y
1
Xµ + const = X ⊤Σ− 1 ⊤ −1
y Xµ − X Σ y y.
2
1 1 ⊤
= (y − Xµ)⊤ Σ − 1
y ( y − Xµ ) + µ µ + const
2 2
1 ⊤ ⊤ −1 ⊤ ⊤ −1
= µ ( X Σ y X + Id )µ − µ X Σ y y + const.
2
This is a quadratic in µ, so the posterior distribution must be Gaus-
sian. By matching the terms with
1
log N ( x; µ′ , Σ ′ ) = − x⊤ Σ ′−1 x + x⊤ Σ ′−1 µ′ + const,
2
we obtain the covariance matrix ( X ⊤ Σ − 1
y X + Id )
−1 and the mean
vector Σ µ X ⊤ Σ − 1
y y.
.
e y = λΣ y . By Bayes’ rule (1.45),
4. Let Σ
with α = 1 + n
2 and β = 1 + 21 (y − Xµ)⊤ Σ
e− 1
y ( y − Xµ ).
334 probabilistic artificial intelligence
Filtering
1 ( x t +1 − x t )2 ( x t − µ t )2
1
Z
= ′′ exp − + dxt using the motion model (3.12a) and
Z 2 σx2 σt2 previous update
1 σt2 ( xt+1 − xt )2 + σx2 ( xt − µt )2
1
Z
= ′′ exp − dxt .
Z 2 σt2 σx2
The exponent is the sum of two expressions that are quadratic in xt .
Completing the square allows rewriting any quadratic ax2 + bx + c as
b 2 b2
the sum of a squared term a( x + 2a ) and a residual term c − 4a that
is independent of x. In this case, we have a = (σt + σx )/(σt σx2 ),
2 2 2
b = −2(σt2 xt+1 + σx2 µt )/(σt2 σx2 ), and c = (σt2 xt2+1 + σx2 µ2t )/(σt2 σx2 ). The
residual term can be taken outside the integral, giving
!
b2 b 2
Z
1 1 a
= ′′ exp − c − exp − xt + dxt .
Z 2 4a 2 2a
The integral is simply the integral of a Gaussian over its entire support,
and thus evaluates to 1. We are therefore left with only the residual
term from the quadratic. Plugging back in the expressions for a, b, and
c and simplifying, we obtain
1 ( x t +1 − µ t )2
1
= ′′ exp − .
Z 2 σt2 + σx2
That is, Xt+1 | y1:t ∼ N (µt , σt2 + σx2 ).
Assume after t − 1 steps that (µt−1 , Σ t−1 ) coincide with the BLR poste-
rior for the first t − 1 data points. We will show that the Kalman filter
update equations yield the BLR posterior for the first t data points.
Covariance update:
Σ t = Σ t −1 − k t x ⊤
t Σ t −1
(Σ t−1 xt )(Σ t−1 xt )⊤
= Σ t −1 − using the symmetry of Σ t−1
x⊤
t Σ t −1 x t + 1
−1
= Σ− 1
t −1 + x x
t t
⊤
using the Sherman-Morrison
−1 formula (A.67) with A−1 = Σ t−1
⊤ ⊤
= X1:t X
−1 1:t−1 + x x
t t + I by the inductive hypothesis and using
−1 Equation (2.10)
⊤
= X1:t X1:t + I
= Σ BLR
t . using Equation (2.10)
Mean update:
Σ− 1 −1 −1 ⊤
t µ t = Σ t µ t −1 + Σ t k t ( y t − x t µ t −1 )
= Σ− 1 ⊤ −1 ⊤
t −1 µ t −1 + x t x t µ t −1 + Σ t k t ( y t − x t µ t −1 ) using Σ − 1 −1 ⊤
t = Σ t −1 + x t x t
= Σ− 1 ⊤ ⊤
t −1 µ t −1 + x t x t µ t −1 + x t ( y t − x t µ t −1 ) using Σ − 1
t kt = xt
= Σ− 1
t −1 µ t −1 + x t y t canceling terms
⊤
= X1:t −1 y1:t−1 + xt yt by the inductive hypothesis and using
⊤ Equation (2.10)
= X1:t y1:t
= Σ− 1 ⊤
t Σ t X1:t y1:t
= Σ− 1 BLR ⊤
t Σt X1:t y1:t using the covariance update above
−1 BLR
= Σ t µt . using Equation (2.10)
xt = π ∀t,
y t = x t + ηt ηt ∼ N (0, σy2 ).
336 probabilistic artificial intelligence
σt2
k t +1 = ,
σt2 + σy2
σt2 σy2
σt2+1 = σy2 k t+1 = . using Equation (3.17)
σt2 + σy2
1 1 1 1 2 1 t+1
= 2 + 2 = 2 + 2 = ··· = 2 + 2 ,
σt2+1 σt σy σt−1 σy σ0 σy
yielding,
1
µ t +1 = µ t + (y − µt ) using Equation (3.16)
t + 1 t +1
t y
= µ t + t +1
t+1 t+1
t−1
t yt y
= µ t −1 + + t +1 using Equation (3.16)
t+1 t t t+1
t−1 y t + y t +1
= µ +
t + 1 t −1 t+1
..
.
y + · · · + y t +1
= 1 ,
t+1
which is simply the sample mean.
Gaussian Processes
1 x2 1 y2 x2 +y2 ( xy ) j
p e− 2 x j · p e− 2 y j = e− 2 .
j! j! j!
solutions 337
µ ( t ) = E [ X t ] = E [ X t −1 + ε t −1 ] = E [ X t −1 ] = µ ( t − 1 ).
t ′ −1
" # " # " #
Xt Xt 0
= + ∑ .
Xt ′ Xt τ =t ε τ
We take the kernel kKF (t, t′ ) to be the covariance between f (t) and
f (t′ ), which is Var[ Xt ] = σ02 + σx2 t. Notice, however, that we assumed
t ≤ t′ . Thus, overall, the kernel is described by
n
f ( x) = ∑ βi k(x, xi )
i =1
n
= ∑ βi ⟨k(xi , ·), k(x, ·)⟩k
i =1
* +
n
= ∑ βi k(xi , ·), k(x, ·)
i =1 k
= ⟨ f (·), k( x, ·)⟩k .
2. By applying Cauchy-Schwarz,
.
Solution to Problem 4.5. We denote by f A = Π A f the orthogonal
projection of f onto span{k( x1 , ·), . . . , k( xn , ·)} which implies that fˆA
is a linear combination of k( x1 , ·), . . . , k( xn , ·).
.
We then have that f A⊥ = f − f A is orthogonal to span{k( x1 , ·), . . . , k( xn , ·)}.
Therefore, for any i ∈ [n],
D E
f ( x) = ⟨ f , k ( xi , ·)⟩k = f A + f A⊥ , k( xi , ·) = ⟨ f A , k( xi , ·)⟩k = f A ( x)
k
which implies
L( f ( x1 ), . . . , f ( xn )) = L( f A ( x1 ), . . . , f A ( xn )).
1
− log p(y1:n | x1:n , f ) = ∥y − f ∥22 + const
2σn2
1
= 2 ∥y − K α̂∥22 + const.
2σn
1 1
α̂ = arg min ∥y − Kα∥22 + ∥α∥2K
α ∈Rn 2σn2 2
α⊤ (σn2 K + K 2 )α − 2y⊤ Kα + y⊤ y.
We can simplify to
! !
1 −1 ∂Ky,θ −1 1 −1 ∂Ky,θ
= tr y⊤ Ky,θ −1 ∂Ky,θ −1
K y − tr Ky,θ using that y⊤ Ky,θ ∂θ j Ky,θ y is a scalar
2 ∂θ j y,θ 2 ∂θ j
!
1 −1 −1 ∂Ky,θ −1 ∂Ky,θ
= tr Ky,θ yy⊤ Ky,θ − Ky,θ using the cyclic property and linearity
2 ∂θ j ∂θ j of the trace
!
1 −1 −1 ⊤ ∂Ky,θ −1 ∂Ky,θ −1
= tr Ky,θ y(Ky,θ y) − Ky,θ using that Ky,θ is symmetric
2 ∂θ j ∂θ j
!
1 −1 ∂Ky,θ
= tr (αα⊤ − Ky,θ ) .
2 ∂θ j
∂ 1
log p(y | X, θ) = tr (θ0−2 K̃ −1 y(K̃ −1 y)⊤ − θ0−1 K̃ −1 )K̃ . using Equation (4.30)
∂θ0 2
Simplifying the terms and using linearity of the trace, we obtain
that
∂ 1
log p(y | X, θ) = 0 ⇐⇒ θ0 = tr yy⊤ K̃ −1 .
∂θ0 n
.
If we define Λ̃ = K̃ −1 as the precision matrix associated to y for
the covariance function k̃, we can express θ0⋆ in closed form as
n n
1
θ0⋆ =
n ∑ ∑ Λ̃(i, j)yi y j . (B.4)
i =1 j =1
mϵ2
P(| f (∆ i )| ≥ ϵ) ≤ 2 exp − .
4
2. We can apply Markov’s inequality (A.71) to obtain
ϵ 2
⋆ ϵ ⋆ 2
P ∥∇ f (∆ )∥2 ≥ = P ∥∇ f (∆ )∥2 ≥
2r 2r
solutions 341
2
2rE ∥∇ f (∆⋆ )∥2
≤ .
ϵ
It remains to bound the expectation. We have
Note that E∇s(∆) = ∇k(∆) using Equation (A.5) and using that
s(∆) is an unbiased estimator of k(∆). Therefore,
and therefore,
! !
T
ϵ ϵ
P ≤P | f (∆ i )| ≥ ∪ ∥∇ f (∆⋆ )∥2 ≥
[
sup | f (∆)| ≥ ϵ
∆∈M∆ i =1
2 2r
!
T
ϵ ϵ
≤P + P ∥∇ f (∆⋆ )∥2 ≥
[
| f (∆ i )| ≥ using a union bound (1.73)
i =1
2 2r
mϵ2 2rσp 2
≤ 2T exp − 4 + using the results from (2) and (3)
2 ϵ
≤ αr −d + βr2 using T ≤ (4 diam(M)/r )d
2
2 2σp
with α = 2(4 diam(M))d exp − mϵ
24
and β = ϵ . Using the
hint, we obtain
d 2
= 2β d+2 α d+2
!2d d+1 2
23 σp diag(M) mϵ2
= 22 exp − 3
ϵ 2 ( d + 2)
2
σp diag(M) mϵ2
≤ 28 exp − 3
using
σp diam(M)
≥1
ϵ 2 ( d + 2) ϵ
5. We have
h i
σp2 = Eω∼ p ω⊤ ω
342 probabilistic artificial intelligence
Z
= ω⊤ ω · p(ω) dω
Z
⊤0
= ω⊤ ω · eiω p(ω) dω.
∂2 ⊤∆ ⊤∆
eiω ω 2j eiω
R R
Now observe that ∂∆2j
p(ω) dω = − p(ω) dω. Thus,
Z
⊤
= −tr H∆ p(ω)eiω ∆ dω
∆=0
= −tr( H∆ k(0)). using that p is the Fourier transform of
k (4.38)
Finally, we have for the Gaussian kernel that
!
∂2 ∆⊤ ∆ 1
2
exp − 2 = − 2.
∂∆ j 2h h
∆ j =0
We know that f˜ and u are jointly Gaussian, and hence, the marginal
distribution of f˜ is also Gaussian. We have for the mean and vari-
ance that
E f˜ = E E f˜ u
using the tower rule (1.25)
"" # #
−1
K AU KUU
= Eu −1 u using Equation (B.5)
K⋆U KUU
" #
−1
K AU KUU
= −1 E [ u ] = 0 using linearity of expectation (1.20)
K⋆U KUU
Var f˜ = E Var f˜ u +Var E f˜ u
using the law of total variance (1.41)
| {z }
0
"" # #
−1
K AU KUU
= Varu −1 u using Equation (B.5)
K⋆U KUU
−1 ⊤
" # " #
−1
K AU KUU K AU KUU
= −1 Var [u] −1 using Equation (1.38)
K⋆U KUU | {z } K⋆U KUU
KUU
" #
Q AA Q A⋆
= .
Q⋆ A Q⋆⋆
Variational Inference
∇w ℓlog (w⊤ x; y) = ∇w log(1 + exp(−yw⊤ x)) using the definition of the logistic loss
(5.13)
1 + exp(yw⊤ x)
= ∇w log
exp(yw⊤ x)
= ∇w log(1 + exp(yw⊤ x)) − yw⊤ x
1
= · exp(−yw⊤ x) · yx − yx using the chain rule
1 + exp(−yw⊤ x)
!
exp(yw⊤ x)
= −yx · 1 −
1 + exp(yw⊤ x)
= −yx · σ(−yw⊤ x). using the definition of the logistic
function (5.9)
2. As suggested in the hint, we compute the first derivative of σ,
. d d 1
σ′ (z) = σ(z) = using the definition of the logistic
dz dz 1 + exp(−z) function (5.9)
exp(−z)
= using the quotient rule
(1 + exp(−z))2
= σ(z) · (1 − σ(z)). using the definition of the logistic
function (5.9)
We get for the Hessian of ℓlog ,
Thus,
= log p(y1:n | f ) + log p( f | x1:n ) − log p(y1:n | x1:n ) using Bayes’ rule (1.45)
. .
In the following, we write N (z) = N (z; 0, σn2 ) and Φ(z) =
Φ(z; 0, σn2 ) to simplify the notation. Differentiating with respect
to f i , we obtain
∂ y N ( fi )
log Φ(yi f i ) = i using N (yi f i ) = N ( f i ) since N is an
∂ fi Φ ( yi f i ) even function and yi ∈ {±1}
∂2 N ( f i )2 y f N ( fi )
log Φ(yi f i ) = − − i2 i ,
∂ fi2 Φ ( y i f i )2 σn Φ(yi f i )
2
and W = −diag{ ∂∂f 2 log Φ(yi f i )}in=1 .
i f = fˆ
(b) Note that Λ′ is a precision matrix over weights w and f = Xw,
so by Equation (1.38) the corresponding variance over latent
values f is XΛ′−1 X ⊤ . The two precision matrices are therefore
equivalent if Λ−1 = XΛ′−1 X ⊤ .
Analogously to Example 5.2 and Problem 5.1, we have that
W = − H f log p(y1:n | f ) f = fˆ
Λ′−1 = ( I + X ⊤ W X )−1
= I − X ⊤ (W −1 + X X ⊤ )−1 X.
Thus,
XΛ′−1 X ⊤ = X X ⊤ − X X ⊤ (W −1 + X X ⊤ )−1 X X ⊤
= K AA − K AA (W −1 + K AA )−1 K AA using that K AA = X X ⊤
= (K − 1
AA + W )
−1
by the matrix inversion lemma (A.68)
−1
=Λ .
f ⋆ | x ⋆ , f ∼ N ( µ ⋆ , k ⋆ ), where (B.6a)
⋆ .
µ = k⊤ −1
x⋆ ,A K AA f , (B.6b)
.
k⋆ = k ( x⋆ , x⋆ ) − k⊤ −1
x⋆ ,A K AA k x⋆ ,A . (B.6c)
= k⊤ −1
x⋆ ,A K AA Eq [ f | x1:n , y1:n ] using Equation (B.6b) and linearity of
expectation (1.20)
346 probabilistic artificial intelligence
= k⊤ −1 ˆ
x⋆ ,A K AA f . (B.7)
= k⊤
x⋆ ,A (∇f log p ( y1:n | f )).
= k⋆ + k⊤ −1 −1
x⋆ ,A K AA Varq [ f | x1:n , y1:n ] K AA k x⋆ ,A . using that k⋆ is independent of f ,
Equation (1.38), and symmetry of K −
AA
1
= k ( x⋆ , x⋆ ) − k⊤ −1
x⋆ ,A K AA k x⋆ ,A plugging in the expression for k⋆ (B.6c)
+ k⊤ −1
x⋆ ,A K AA ( K AA + W
−1 −1 −1
) K AA k x⋆ ,A
= k ( x⋆ , x⋆ ) − k⊤ x⋆ ,A ( K AA + W
−1 −1 ⋆
) k x ,A . (B.8) using the matrix inversion lemma (A.68)
ferred.
3. We have
Eε 1 { f ( x ) + ε ≥ 0 } = Pε ( f ( x ) + ε ≥ 0 ) using E[ X ] = p if X ∼ Bern( p)
= Pε (−ε ≤ f ( x))
= Pε (ε ≤ f ( x)) using that the distribution of ε is
symmetric around 0
= Φ( f ( x); 0, σn2 ).
solutions 347
∀ x1 , x2 , ∀λ ∈ [0, 1] : f (λx1 + (1 − λ) x2 ) ≤ λ f ( x1 ) + (1 − λ) f ( x2 ).
If y = −1 then
ˆ
ℓbce (ŷ; y) = − log(1 − ŷ) = log(1 + e f ) = ℓlog ( fˆ; y).
Here the second equality follows from the simple algebraic fact
1 ez 1 ez
1− − z
= 1− z = . multiplying by ez
1+e e +1 1 + ez
p( x)
KL( p∥q) = Ex∼ p log using the definition of KL-divergence
q( x) (5.34)
q( x)
= Ex∼ p S .
p( x)
348 probabilistic artificial intelligence
q( x)
≥ S Ex∼ p using Jensen’s inequality (5.29)
p( x)
Z
=S q( x) dx
2. We observe from the derivation of (1) that KL( p∥q) = 0 iff equality
holds for Jensen’s inequality. Now, if p and q are discrete with final
and identical support, we can follow from the hint that Jensen’s
inequality degenerates to an equality iff p and q are point wise
identical.
H[ g] − H[ f ] = KL( f ∥ g) ≥ 0,
q( x, y)
Z
L(q, λ0 , λ1 ) = q( x, y) log dx dy the objective, using (5.34)
p( x, y)
Z
+ λ0 1 − q( x, y) dx dy the normalization constraint
Z Z
+ λ1 (y) δy′ (y) − q( x, y) dx dy the data constraint
q( x, y)
Z
= q( x, y) log − λ0 − λ1 (y) dx dy + const.
p( x, y)
We have
∂L(q, λ0 , λ1 ) q( x, y)
= log − λ0 − λ1 (y) + 1. using the product rule of differentiation
∂q( x, y) p( x, y)
1
q( x, y) = exp(λ0 + λ1 (y) − 1) p( x, y) = exp(λ1 (y)) p( x, y)
Z
.
where Z = exp(λ0 − 1) denotes the normalizing constant. We can
determine λ1 (y) from the data constraint:
1
Z Z
q( x, y) dx = exp(λ1 (y)) p( x, y) dx
Z
1
= exp(λ1 (y)) · p(y) using the sum rule (1.7)
Z
!
= δy′ (y).
p( x,y)
It follows that q( x, y) = δy′ (y) · p(y) = δy′ (y) · p( x | y). using the definition of conditional
2. From the sum rule (1.7), we obtain distributions (1.10)
Z Z
q( x) = q( x, y) dy = δy′ (y) · p( x | y) dy = p( x | y′ ).
KL( p∥q) = Ex∼ p [log p( x) − log q( x)] using the definition of KL-divergence
" (5.34)
1 det Σ q 1
= Ex∼ p log − ( x − µ p )⊤ Σ − 1
p (x − µ p ) using the Gaussian PDF (1.51)
2 det Σ p 2
1
+ ( x − µq )⊤ Σ − 1
q ( x − µq )
2
350 probabilistic artificial intelligence
1 det Σ q 1 h i
= log − Ex∼ p ( x − µ p )⊤ Σ − 1
p (x − µ p ) using linearity of expectation (1.20)
2 det Σ p 2
1 h i
+ Ex∼ p ( x − µq )⊤ Σ − 1
q ( x − µq )
2
As ( x − µ p )⊤ Σ −
p ( x − µ p ) ∈ R, we can rewrite the second term as
1
1 h i
Ex∼ p tr ( x − µ p )⊤ Σ −
p
1
( x − µ p )
2
1 h i
= Ex∼ p tr ( x − µ p )( x − µ p )⊤ Σ − p
1
using the cyclic property of the trace
2
1 h i
= tr Ex∼ p ( x − µ p )( x − µ p )⊤ Σ − p
1
using linearity of the trace and linearity
2 of expectation (1.20)
1
= tr Σ p Σ − p
1
using the definition of the covariance
2 matrix (1.34)
1 d
= tr( I ) = .
2 2
For the third term, we use the hint (5.81) to obtain
1 h i
Ex∼ p ( x − µq )⊤ Σ −q
1
( x − µ q )
2
1
= (µ p − µq )⊤ Σ −
q
1
( µ p − µ q ) + tr Σ −1
q Σ p .
2
Putting all terms together we get
1 det Σ q
KL( p∥q) = log − d + (µ p − µq )⊤ Σ − 1
q (µ p − µq )
2 det Σ p
+tr Σ − 1
q Σp .
KL( p∥q)
p( x, y)
= ∑ ∑ p( x, y) log2 using the definition of KL-divergence
x y q( x )q(y) (5.34)
− ∑ ∑ p( x, y) log2 q(y)
x y
= ∑ ∑ p( x, y) log2 p( x, y) − ∑ p( x ) log2 q( x ) − ∑ p(y) log2 q(y) using the sum rule (1.7)
x y x y
+ KL( p(y)∥q(y))
solutions 351
q( x ) = p( x ) and q ( y ) = p ( y ).
We can easily observe from the above formula that the support
of q must be a subset of the support of p. In other words, if
q( x, y) is positive outside the support of p (i.e., when p( x, y) = 0)
then KL(q∥ p) = ∞. Hence, the reverse KL-divergence has an infi-
nite value except when the support of q is either {1, 2} × {1, 2} or
{(3, 3)} or {(4, 4)}. Thus, it has three local minima.
For the first case, the minimum is achieved when q( x ) = q(y) =
( 12 , 12 , 0, 0). The corresponding KL-divergence is KL(q∥ p) = log2 2 =
1. For the second case and the third case, q( x ) = q(y) = (0, 0, 1, 0)
and q( x ) = q(y) = (0, 0, 0, 1), respectively. The KL-divergence in
both cases is KL(q∥ p) = log2 4 = 2.
3. Let us compute p( x = 4) and p(y = 1):
1
p ( x = 4) = ∑ p(x = 4, y) = 4 ,
y
1
p ( y = 1) = ∑ p(x, y = 1) = 4 .
x
1
Hence, q( x = 4, y = 1) = p( x = 4) p(y = 1) = 16 , however, p( x =
4, y = 1) = 0. We therefore have for the reverse KL-divergence that
KL(q∥ p) = ∞.
= arg max Eθ∼q [log p(y1:n , θ | x1:n )] + H[q] using Equation (5.54)
q∈Q
n 1
= arg max Eθ∼q [log p(y1:n , θ | x1:n )] + log(2πe) + log det(Σ ). using Equation (5.31)
q∈Q 2 2
.
Solution to Problem 5.11. To simplify the notation, we write Σ =
diag{σ12 , . . . , σd2 }. The reverse KL-divergence can be expressed as
(σp2 )d
!
1
KL(qλ ∥ p(·)) = tr σp−2 Σ + σp−2 µ⊤ µ − d + log using the expression for the
2 det(Σ ) KL-divergence of Gaussians (5.41)
!
d d
1
2 p i∑
= σ −2
σi + σp µ µ − d + d log σp − ∑ log σi .
2 −2 ⊤ 2 2
=1 i =1
X = e Z1 = eσZ2 +µ .
PX ( x ) = P( X ≤ x )
= P F − 1 (Y ) ≤ x
= P(Y ≤ F ( x ))
= F ( x ).
solutions 353
q t +1 ( x ′ ) = P X t +1 = x ′ = ∑ P ( X t = x ) P X t +1 = x ′ | X t = x .
using the sum rule (1.7)
x | {z } | {z }
qt ( x ) p( x′ | x )
Pk ( x, x ′ ) = ∑ P ( x, x1 ) · P ( x1 , x2 ) · · · P ( xk−1 , x ′ )
x1 ,...,xk−1
∑ P ( X1 = x 1 | X0 = x ) · · · P X k = x ′ | X k − 1 = x k − 1
= using the definition of the transition
x1 ,...,xk−1 matrix (6.8)
∑ P X1 = x 1 , · · · , X k − 1 = x k − 1 , X k = x ′ | X0 = x
= using the product rule (1.11)
x1 ,...,xk−1
= P X k = x ′ | X0 = x .
using the sum rule (1.7)
We note that the entries of P are all different from 0, thus the Markov
chain corresponding to this transition matrix is ergodic.4 Thus, there 4
All elements of the transition matrix be-
ing strictly greater than 0 is a sufficient,
but not necessary, condition for ergodic-
ity.
354 probabilistic artificial intelligence
Note that the left hand side of equation i corresponds to the probabil-
ity of entering state i at stationarity minus πi . Quite intuitively, this
difference should be 0, that is, after one iteration the random walk is
at state i with the same probability as before the iteration.
Thus, we conclude that in the long run, the percentage of news days
that will be classified as “good” is 35/72.
p( x, y)
p( x | y) = using the definition of conditional
p(y) probability (1.8)
Bin( x; n, y) · Cy
=
p(y)
= Bin( x; n, y). using that p( x | y) is a probability
distribution over x and Bin( x; n, y)
So in short, sampling from p( x | y) is equivalent to sampling from already sums to 1, so Cy = p(y)
p( x, y) = Beta(y; x + α, n − x + β) · Cx ,
So for sampling y given x, one can sample from the beta distri-
bution. There are several methods for sampling from a beta dis-
tribution, and we refer the reader to the corresponding Wikipedia
page.
2. We first derive the posterior distribution of µ. We have
1 n
= λ α + 2 −1 e − λ ( β + 2 ∑ i =1 ( x i − µ ) )
n 2
= Gamma(λ; aµ , bµ )
where aµ = α + n
2 and bµ = β + 1
2 ∑in=1 ( xi − µ)2 .
3. We have
∝ Gamma(α; a, b)
. .
where a = n + 1 and b = ∑in=1 log xi − n log c. On the other hand,
Solution to Problem 6.6. First, note that the sum of convex functions
is convex, hence, we consider each term individually.
The Hessian of the regularization term is λI, and thus, by the second-
order characterization of convexity, this term is convex in w.
Finally, note that the second term is a sum of logistic losses ℓlog (5.13),
and we have seen in Problem 5.1 that ℓlog is convex in w.
max −
p∈∆T
∑ p( x ) log2 p( x )
x ∈T
(B.9)
subject to ∑ p( x ) f ( x ) = µ
x ∈T
=− ∑ p( x )(log2 p( x ) + λ0 + λ1 f ( x )) + const.
x ∈T
max min L( p, λ0 , λ1 ).
p≥0 λ0 ,λ1 ∈R
We have
∂
L( p, λ0 , λ1 ) = − log2 p( x ) − λ0 − λ1 f ( x ) − 1.
∂p( x )
Setting the partial derivatives to zero, we obtain
log(·)
p( x ) = 2 exp(−λ0 − λ1 f ( x ) − 1) using log2 (·) = log(2)
∝ exp(−λ1 f ( x )).
r ( x | x′ )
′ ′
α( x | x) = min 1, exp( f ( x) − f ( x )) .
r ( x′ | x)
We therefore know that
r ( x | x′ )
exp( f ( x) − f ( x′ )) ≥ 1.
r ( x′ | x)
We remark that this inequality even holds with equality using our
derivation of Theorem 6.19. Taking the logarithm and reorganizing
the terms, we obtain
Ex′ ∼ p(·| x−i ) f ( x′ ) ≤ f ( x) + log p( xi | x−i ) − Ex′ ∼ p(·| x−i ) log p( xi′ | x−i )
i i
∂ h α i
f ( x) + ∇ f ( x)⊤ (y − x) + ∥y − x∥22 = ∇ f ( x) + α(y − x).
∂y 2
solutions 359
1
y = x− ∇ f ( x ).
α
Plugging this y into Equation (6.56), we have
1 1
0 ≥ f ( x) − ∥∇ f ( x)∥22 + ∥∇ f ( x)∥22
α 2α
1
= f ( x) − ∥∇ f ( x)∥22 .
2α
2. Using the chain rule,
d d
f ( xt ) = ∇ f ( xt )⊤ xt .
dt dt
d
Note that dt xt = −∇ f ( xt ) by Equation (6.55), so
= − ∥∇ f ( xt )∥22
≤ −2α f ( xt )
∇qt
∇ log qt = .
qt
d d qt
Z
KL(qt ∥ p) = qt log dθ.
dt dt p
qt qt
Z Z
∂qt ∂
= log dθ + qt log dθ.
∂t p ∂t p
. qt .
Letting φ = log p and F = ∇ φ, and then applying the hint, we
have
Z
= (∇ · qt F) φ dθ
Z
=− qt ∥∇ φ∥22 dθ
" #
2
qt (θ)
= −Eθ∼qt ∇ log
p(θ) 2
= −J( q t ∥ p ).
6. Noting that p satisfies the LSI with constant α and combining with
d
(5), we have that dt KL(qt ∥ p) ≤ −2αKL(qt ∥ p). Observe that this
result is analogous to the result derived in (2) and the LSI can in-
tuitively be seen as the PL-inequality, but in the space of distribu-
tions. Analogously to (3), we obtain the desired convergence result
by applying Grönwall’s inequality (6.58).
7. By Pinsker’s inequality (6.21), ∥qt − p∥TV ≤ e−αt 2KL(q0 ∥ p). It
p
= 0.
Deep Learning
Active Learning
p( x | y)
= E(x,y) log .
p( x)
p( x | y)
I(X; Y) = Ey Ex|y log
p( x)
= Ey [KL( p( x | y)∥ p( x))], using the definition of KL-divergence
(5.34)
and we also conclude
p( x, y)
I(X; Y) = E( x,y) log using the definition of conditional
p( x) p(y) probability (1.8)
= KL( p( x, y)∥ p( x) p(y)). using the definition of KL-divergence
(5.34)
= I( f x ; y x | y A ). using I( f A ; y x | f x , y A ) = 0 as
yx ⊥ f A | f x
2. For the second part, we get
= H[ y x | y A ] − H[ y x | f x ] using that y x ⊥ ε A so y x ⊥ y A | f x
≥0
Cov[ Z, Y1 ]2
Var[ Z | Y1 ] = Var[ Z ] − .
Var[Y1 ]
100
1
Var[ Z | Y1 ] = ∑ i 2 − 2 i1 ,
i =1
Bayesian Optimization
T
RT 1
lim
T →∞ T
= max f ⋆ ( x) − lim
x T →∞ T
∑ f ⋆ ( xt ) using the definition of regret (9.4)
t =1
= max f ⋆ ( x) − lim f ⋆ ( xt ). (B.12) using the Cesàro mean (9.38)
x t→∞
For the other direction, we prove the contrapositive. That is, we as-
sume that the algorithm does not converge to the static optimum and
show that it has (super-)linear regret. We distinguish between two
cases. Our assumption is formalized by
P( Z > c ) = P( Z − c > 0)
Z ∞
1 2
= √ e−(z+c) /2 dz. using the PDF of the univariate normal
0 2π distribution (1.5)
( z + c )2 z2 c2 z2 c2
= + zc + ≥ + .
2 2 2 2 2
Thus,
Z ∞
2 /2 1 2
≤ e−c √ e−z /2 dz
0 2π
2
= e−c /2 P( Z > 0)
1 2
= e−c /2 . (B.13) using symmetry of the standard normal
2 distribution around 0
Since we made the Bayesian assumption f ⋆ ( x) ∼ N (µ0 ( x), σ02 ( x))
and assumed Gaussian observation noise yt ∼ N ( f ⋆ ( xt ), σn2 ), the
posterior is also Gaussian:
2. We have
!
P f ⋆ ( x) ̸∈ Ct ( x) ∑ ∑ P( f ⋆ (x) ̸∈ Ct (x))
[ [
≤ using a union bound (1.73)
x∈X t≥1 x∈X t≥1
≤ |X | ∑ e− β t /2 .
2
using (1)
t ≥1
.
Letting β2t = 2 log(|X | (πt)2 /(6δ)), we get
6δ 1
=
π2 ∑ t2
t ≥1
π2
= δ. using ∑t≥1 1
t2
= 6
Thus,
rt = f ⋆ ( x⋆ ) − f ⋆ ( xt )
≤ β t σt−1 ( xt ) + µt−1 ( xt ) − f ⋆ ( xt )
≤ 2β t σt−1 ( xt ). again using Equation (9.7)
T T
∑ rt2 ≤ 4β2T ∑ σt2−1 (xt ) using part (1)
t =1 t =1
T σt2−1 ( xt )
= 4σn2 β2T ∑ σ2 .
t =1 n
.
Observe that σt2−1 ( xt )/σn2 is bounded by M = maxx∈X σ02 ( x)/σn2 as
variance is monotonically decreasing (cf. Section 1.2.3). Applying
the hint, we obtain
!
T σ 2 (x )
t
≤ 4Cσn2 β2T ∑ log 1 + t−1 2
t =1 σn
= 8Cσn2 β2T I( f T ; yT ) using part (2)
= det I + σn−2 diag{ XS XS⊤ } .
Note that
|S|
diag{ XS XS⊤ }(i, i ) = ∑ x t ( i )2
t =1
d |S| |S|
≤ ∑ ∑ xt (i)2 = ∑ |∥ x{zt ∥}22 ≤ |S| ≤ T.
i =1 t =1 t =1
≤1
yielding,
d
I( f S ; yS ) ≤ log(1 + σn−2 T )
2
implying that γT = O(d log T ).
2. Using the regret bound (cf. Theorem 9.5) and the Bayesian confi-
dence intervals (cf. Theorem 9.4), and then γT = O(d log T ), we
have
p √
R T = O β T γT T = O e dT ,
RT
and hence, limT →∞ T = 0.
EIt ( x) = E f ( x)∼N (µt ( x),σ2 ( x)) [ It ( x)] using the definition of expected
t
improvement (9.19)
h i
= E f (x)∼N (µt (x),σ2 (x)) ( f ( x) − fˆt )+ using the definition of improvement
t
(9.15)
h i
= Eε∼N (0,1) (µt ( x) + σt ( x)ε − fˆt )+ using the reparameterization
Z +∞
= (µt ( x) + σt ( x)ε − fˆt )+ · ϕ(ε) dε.
−∞
fˆ −µ ( x) .
For ε < t σ ( xt ) = zt ( x) we have (µt ( x) + σt ( x)ε − fˆt )+ = 0. Thus,
t
we obtain
Z +∞
EIt ( x) = (µt ( x) + σt ( x)ε − fˆt ) · ϕ(ε) dε. (B.14)
zt ( x)
368 probabilistic artificial intelligence
ˆ t ( xUCB
∆ t ) = ut ( xUCB
t ) − lt ( xUCB
t ) = 2β t+1 σt ( xUCB
t ).
Note that σt2 ( x)/σn2 ≤ C for some constant C since variance is de-
creasing monotonically. So, applying the hint,
σt2 ( xUCB
t )
≥ 2
.
2Cσn
b t ( xUCB
ˆ t ( xUCB )2
∆
Ψ t ) = t
≤ 8Cσn2 β2t+1 .
It ( xUCB
t )
Ψ t ( x t +1 ) ≤ Ψ b t ( xUCB
b t ( x t +1 ) ≤ Ψ t ) ≤ 8Cσn2 β2t+1 ,
solutions 369
∂ d
W (π ) = µt ( x ) − σt ( x )Φ−1 (π ( x ))ϕ(Φ−1 (π ( x ))) Φ−1 (π ( x ))
∂π ( x ) dπ ( x )
= µt ( x ) − σt ( x )Φ−1 (π ( x ))
∂2 1 < 0 x=z
W (r ) = −σt ( x )1{ x = z} ,
∂π ( x )∂π (z) ϕ(Φ−1 (π ( x ))) = 0 x ̸= z
where we used the inverse function rule twice. From negative defi-
niteness of the Hessian, it follows that W (·) is strictly concave.
We show next that the optimum lies in the relative interior of the prob-
ability simplex, π ∗ ∈ relint(∆X ). Indeed, at the border of the proba-
bility simplex the partial derivatives explode:
∞
π ( x ) → 0+
∂ −1
W (r ) = µt ( x ) − σt ( x )Φ (π ( x )) = finite π ( x ) ∈ (0, 1) .
∂π ( x )
−∞ π ( x ) → 1−
P(max{ R1 , R2 } ≤ x ) = P({ R1 ≤ x } ∩ { R2 ≤ x })
= P( R1 ≤ x ) · P( R2 ≤ x ) using independence
= FR1 ( x ) · FR2 ( x )
x
= Φ2
10
where Φ is the CDF of the standard Gaussian. We are looking for the
probability that either R1 or R2 are larger than S = 1:
2 1
P(max{ R1 , R2 } > 1) = 1 − P(max{ R1 , R2 } ≤ 1) = 1 − Φ .
10
We have Φ2 ( 10
1
) ≈ 0.29. Thus, the probability that either R1 or R2 are
larger than S = 1 is approximately 0.71. Due to the symmetry in the
problem, we know
Solution to Problem 10.4. Using the hint and v⋆ ≥ vπ for any policy
π,
∥ v π t − v ⋆ ∥ ∞ ≤ ∥ B ⋆ v π t −1 − v ⋆ ∥ ∞
solutions 371
Eπ ⋆ ∑ ∞ t ′ ≥ Eπ ∑ ∞ t ′
⇐⇒ t =0 γ R t X0 = x t = 0 γ R t X0 = x
⋆
⇐⇒ vπM′ ( x ) ≥ vπM′ ( x ).
.
If we now define q( x, a) = q⋆M′ ( x, a) + ϕ( x ), we have
q( x, a) = Ex′ | x,a r ( x, x ′ ) + γ max q( x ′ , a′ ) .
a′ ∈ A
1 ′ 1
b1 ( F ) = b ( F )P( o1 | F ) = · 0.55 · 0.8
Z 0 Z
1 1
b1 ( F ) = b0′ ( F )P o1 | F =
· 0.45 · 0.3.
Z Z
(c) Normalization: We compute the normalization constant Z.
b2 ( F ) ∝ P(o1 | F ) · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.461
solutions 373
b2 ( F ) ∝ P o1 | F · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.127.
b2 ( F ) ∝ P(o2 | F ) · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.115
b2 ( F ) ∝ P o2 | F · b1 ( F ) · p( F | F, W ) + b1 ( F ) · p( F | F, W ) ≈ 0.296.
v ⋆ (1) = −3
max r (1, a) + γEx′ |1,a v⋆ ( x ′ ) = −3
a∈ A
since
−1 + −2 = −3 if a = 1
v⋆ ( x ′ ) =
r (1, a) + γEx′ |1,a
−1 + −3 = −4 if a = −1.
• Likewise, for x = 2,
v ⋆ (2) = −2
v ⋆ ( x ′ ) = −2
max r (2, a) + γEx′ |2,a
a∈ A
since
−1 + −1 = −2 if a = 1
v⋆ ( x ′ ) =
r (2, a) + γEx′ |2,a
−1 + −3 = −4 if a = −1.
2. We have
1 1 1
Q(3, −1) = 0 + −1 + max Q(2, a′ ) = (−1 + 0) = −
2 a′ ∈ A 2 2
1 1 1
Q(2, 1) = 0 + −1 + max Q(3, a′ ) = (−1 + 0) = −
2 a′ ∈ A 2 2
1 1 1
Q(3, 1) = 0 + −1 + max Q(4, a′ ) = (−1 + 0) = −
2 a′ ∈ A 2 2
1 1
Q(4, 1) = 0 + 0 + max Q(4, a′ ) = (0 + 0) = 0.
2 a′ ∈ A 2
3. We compute
x
∇w ℓ(w; τ ) = − r + γ max Q( x ′ , a′ ; wold ) − Q( x, a; w) a
using the derivation of Equation (12.15)
a′ ∈ A
1
2
′
= − −1 + max{1 − a − 2} − (−2 − 1 + 1) −1
a′ ∈ A
1
solutions 375
−2
= 1 .
−1
This gives
w′ = w − α∇w ℓ(w; τ )
−1 −2 0
1
= 1 − 1 = 1/2 .
2
1 −1 3/2
Var[ f (X) − g(X)] = Var[ f (X)] + Var[ g(X)] − 2Cov[ f (X), g(X)]. using Equation (1.39)
Cov[ f (τ ), g(τ )] = E[( f (τ ) − E[ f (τ )]) g(τ )] using the definition of covariance (1.26)
Using the score function trick for the score function ∇φ log πφ( at | xt )
analogously to the proof of Lemma 12.6, we have,
Thus,
" " ##
T
= Eτ0:T−1 EτT ∑ G0 ∇φ log πφ(at | xt ) τ0:T −1
t =0
" #
T
= Eτ ∼Πφ ∑ G0 ∇φ log πφ(at | xt ) . using the tower rule (1.25) again
t =0
τ = ( x0 , a0 , r0 , x1 , a1 , r1 , x2 , a2 , r2 , x3 , a3 , r3 , x4 ).
∂πθ (1 | x )
πθ (1 | x ) = 1 − θ, = −1.
∂θ
∇φ j(φ)
= E q( x, a)∇φ ( a log σ( fφ( x)) + (1 − a) log(1 − σ( fφ( x))))
using the policy gradient theorem
h i (12.53)
= E q( x, a)∇f ( a log σ( f ) + (1 − a) log(1 − σ( f )))∇φ fφ( x) using the chain rule
" !! #
−f e− f
= E q( x, a)∇f − a log(1 + e ) + (1 − a) log ∇φ fφ( x) using the definition of the logistic
1 + e− f function (5.9)
h i
= E q( x, a)∇f − f + a f − log(1 + e− f ) ∇φ fφ( x)
" ! #
e− f
= E q( x, a) a − 1 + ∇φ fφ( x)
1 + e− f
= E q( x, a)( a − σ( f ))∇φ fφ( x) .
using the definition of the logistic
function (5.9)
The term a − σ ( f ) can be understood as a residual as it corresponds
to the difference between the target action a and the expected ac-
tion σ( f ).
2. We have
h i
∇φ j(φ) = E q( x, a)∇f log π f ( a | x)∇φ f ( x) using Equation (12.53) and the chain
rule
378 probabilistic artificial intelligence
h i
= E q( x, a)∇f (log h( a) + a f − A( f ))∇φ f ( x)
h i
= E q( x, a)( a − ∇f A( f ))∇φ f ( x) .
π f ( a | x) = h( a) exp( a f − log(1 + e f ))
ea f
= h( a)
1+ e f
σ( f ) if a = 1
= h( a)
1 − σ ( f ) if a = 0.
KL Πφ∥Π⋆
T
1
∑ E(xt ,at )∼Πφ
= − r ( xt , at ) − H πφ(· | xt )
t =1
λ
T
∑ E(xt ,at )∼Πφ
= − log β( at | xt ) − H πφ(· | xt ) .
t =1
T
∑ Ext ∼Πφ
= KL πφ(· | xt )∥π̂ (· | xt ) − log Z ( xt ) . using the definition of KL-divergence
t =1 (5.34)
where the first term stems directly from the objective (12.106) and
the second term represents the contribution of πφ( at | xt ) to all
. .
subsequent terms. Letting β⋆ ( a | x) = exp(q⋆ ( x, a)), Z ⋆ ( x) =
R ⋆ ⋆
A β ( a | x ) da, and recalling that we denote by π (· | x ) the policy
⋆ ⋆
β (· | x)/Z ( x), this objective can be reexpressed as
Π⋆ (y | x) = p(y | x, O)
∝ p(y | x) · p(O | x, y) using Bayes’ rule (1.45)
1
= Πφinit (y | x) exp r (y | x) . (B.16)
λ
The derivation is then analogous to Equation (12.84):
≤ zt,k ) − f (b
fbn−1 (b zt,k ) − f (zt,k )
zn,k ) + f (b adding and subtracting f (b
zn,k ) and
using Cauchy-Schwarz
≤ 2β t σt−1 (b
zt,k ) + L1 xbt,k − xt,k
where the final inequality follows with high probability from the
assumption that the confidence intervals are well-calibrated (cf.
Equation (13.22)) and the assumed Lipschitzness.
This is identical to the analysis of UCB from Problem 9.3, only that
here errors compound along the trajectory.
2. The assumption that αt ≥ 1 implies that
k −1
xbt,k − xt,k ≤ 2β t αtH −1 ∑ σt−1 (zt,l ). (B.17)
l =0
H −1
≤ ∑ r ( xbt,k , πt ( xbt,k )) − r ( xt,k , πt ( xt,k ))
k =0
H −1
≤ L3 ∑ xbt,k − xt,k using Lipschitz-continuity of r
k =0
H −1 k −1
≤ 2β t αtH −1 L3 ∑ ∑ σt−1 (zt,l ) using Equation (B.17)
k =0 l =0
H −1
≤ 2β t HαtH −1 L3 ∑ σt−1 (zt,k ).
k =0
Mathematical Background
.
2. Consider the non-negative random variable Y = ( X − EX )2 . We
have
P(| X − EX | ≥ ϵ) = P ( X − EX )2 ≥ ϵ2
E ( X − EX )2
≤ using Markov’s inequality (A.71)
ϵ2
VarX
= . using the definition of variance (1.34)
ϵ2
3. Fix any ϵ > 0. Applying Chebyshev’s inequality and noting that
EX n = EX, we obtain
VarX n
P X n − EX ≥ ϵ ≤
.
ϵ2
We further have for the variance of the sample mean that
" #
1 n 1 n VarX
VarX n = Var ∑
n i =1
Xi = 2 ∑ Var[ Xi ] =
n i =1 n
.
Thus,
VarX
lim P X n − EX ≥ ϵ ≤ lim 2 = 0
n→∞ n→∞ ϵ n
P
which is precisely the definition of X n → EX.
Dividing by λ yields,
f ( x + λd) − f ( x) o ( λ ∥ d ∥2 )
= ∇ f ( x)⊤ d + .
λ | {z λ }
→0
Eitan Altman. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
Sebastian Ament, Samuel Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Unexpected improvements
to expected improvement for bayesian optimization. In NeurIPS, volume 36, 2024.
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin,
OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NeurIPS, volume 30, 2017.
Javier Antorán, David Janz, James U Allingham, Erik Daxberger, Riccardo Rb Barbano, Eric Nalisnick, and José Miguel
Hernández-Lobato. Adapting the linearised laplace model evidence for modern deep learning. In ICML, 2022.
Yarden As, Ilnura Usmanova, Sebastian Curi, and Andreas Krause. Constrained policy optimization via bayesian world
models. In ICLR, 2022.
Marco Bagatella, Jonas Hübotter, Georg Martius, and Andreas Krause. Active fine-tuning of generalist policies. arXiv
preprint arXiv:2410.05026, 2024.
Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84: Proceedings.
Springer, 2006.
Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning
with stability guarantees. In NeurIPS, volume 30, 2017.
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In
ICML, 2015.
Léon Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9), 1998.
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.
Biometrika, 39(3/4), 1952.
Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement
learning. JMLR, 3, 2002.
Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information
Theory, 59(11), 2013.
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman,
Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement
learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
Ariel Caticha. Entropy, information, and the updating of probabilities. Entropy, 23(7), 2021.
Ariel Caticha and Adom Giffin. Updating probabilities. In AIP Conference Proceedings, volume 872. American Institute of
Physics, 2006.
386 probabilistic artificial intelligence
Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical Science, 1995.
Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In ICML, 2014.
Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In ICML, 2017.
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of
trials using probabilistic dynamics models. In NeurIPS, volume 31, 2018.
Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient model-based reinforcement learning through optimistic
policy search and planning. In NeurIPS, volume 33, 2020.
Sebastian Curi, Armin Lederer, Sandra Hirche, and Andreas Krause. Safe reinforcement learning via confidence-based
filters. In CDC, 2022.
Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace
redux-effortless bayesian deep learning. In NeurIPS, volume 34, 2021.
Bruno De Finetti. Theory of probability: A critical introductory treatment. John Wiley, 1970.
Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In ICML,
2011.
Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University
Press, 2020.
Joseph Leo Doob. Application of the theory of martingales. Actes du Colloque International Le Calcul des Probabilités et ses
applications (Lyon, 28 Juin – 3 Juillet, 1948), 1949.
Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195
(2), 1987.
Nikita Durasov, Timur Bagautdinov, Pierre Baque, and Pascal Fua. Masksembles for uncertainty estimation. In CVPR,
2021.
David Duvenaud. Automatic model construction with Gaussian processes. PhD thesis, University of Cambridge, 2014.
David Duvenaud and Ryan P Adams. Black-box stochastic variational inference in five lines of python. In NeurIPS
Workshop on Black-box Learning and Inference, 2015.
Eyal Even-Dar and Yishay Mansour. Convergence of optimistic and incremental q-learning. In NeurIPS, volume 14, 2001.
Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for q-learning. JMLR, 5(1), 2003.
Matthew Fellows, Anuj Mahajan, Tim GJ Rudner, and Shimon Whiteson. Virel: A variational inference framework for
reinforcement learning. In NeurIPS, volume 32, 2019.
Karl Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2), 2010.
Karl Friston, Francesco Rigoli, Dimitri Ognibene, Christoph Mathys, Thomas Fitzgerald, and Giovanni Pezzulo. Active
inference and epistemic value. Cognitive neuroscience, 6(4), 2015.
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In
ICML, 2018.
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Insights and applications. In ICML Workshop
on Deep Learning, volume 1, 2015.
bibliography 387
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep
learning. In ICML, 2016.
Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In ICML, 2017.
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and David Sculley. Google vizier: A
service for black-box optimization. In ACM SIGKDD, 2017.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
Jackson Gorham and Lester Mackey. Measuring sample quality with stein’s method. In NeurIPS, volume 28, 2015.
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, 2017.
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang,
Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint
arXiv:2501.12948, 2025.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. In ICML, 2018a.
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu,
Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905,
2018b.
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning
latent dynamics for planning from pixels. In ICML, 2019.
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent
imagination. In ICLR, 2020.
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In
ICLR, 2021.
Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control
policies by stochastic value gradients. In NeurIPS, volume 28, 2015.
Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. JMLR, 13(6), 2012.
James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. In
AISTATS, 2015.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
Jonas Hübotter, Bhavya Sukhija, Lenart Treven, Yarden As, and Andreas Krause. Transductive active learning: Theory
and applications. In NeurIPS, 2024.
Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of
llms. In ICLR, 2025.
Carl Hvarfner, Frank Hutter, and Luigi Nardi. Joint entropy search for maximally-informed bayesian optimization. In
NeurIPS, volume 35, 2022.
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights
leads to wider optima and better generalization. In UAI, 2018.
Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algo-
rithms. In NeurIPS, volume 6, 1993.
388 probabilistic artificial intelligence
Edwin T Jaynes. Prior probabilities. IEEE Transactions on systems science and cybernetics, 4(3), 1968.
Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2002.
Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabás Póczos. Parallelised bayesian optimisation
via thompson sampling. In AISTATS, 2018.
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS,
volume 30, 2017.
Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In
COLT, 2018.
Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-based model predictive control for
safe exploration. In CDC, 2018.
Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability, 3, 2014.
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation
using deep ensembles. In NeurIPS, volume 30, 2017.
Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 1986.
Lucien Marie Le Cam and Grace Lo Yang. Asymptotics in statistics: some basic concepts. Springer Science & Business
Media, 2000.
David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint
arXiv:1805.00909, 2018.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan
Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
David Lindner and Mennatallah El-Assady. Humans are not boltzmann distributions: Challenges and opportunities for
modelling human feedback and interaction in reinforcement learning. arXiv preprint arXiv:2206.13316, 2022.
Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In
NeurIPS, volume 29, 2016.
Jianfeng Lu, Yulong Lu, and James Nolen. Scaling limit of the stein variational gradient descent: The mean field regime.
SIAM Journal on Mathematical Analysis, 51(2), 2019.
Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I Jordan. Sampling can be faster than optimization.
Proceedings of the National Academy of Sciences, 116(42), 2019.
David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4), 1992.
Nicolas Menet, Jonas Hübotter, Parnian Kassraie, and Andreas Krause. Lite: Efficiently estimating gaussian probability
of maximality. In AISTATS, 2025.
Jeffrey W Miller. A detailed treatment of doob’s theorem. arXiv preprint arXiv:1801.03122, 2018.
W Jeffrey Miller. Lecture notes on advanced stochastic modeling. Duke University, Durham, NC, 2016.
Beren Millidge, Alexander Tschantz, Anil K Seth, and Christopher L Buckley. On the relationship between active
inference and control as inference. In IWAI Workshop on Active Inference. Springer, 2020.
Beren Millidge, Alexander Tschantz, and Christopher L Buckley. Whence the expected free energy? Neural Computation,
33(2), 2021.
bibliography 389
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin
Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.
nature, 518(7540), 2015.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine
learning. JMLR, 21(132), 2020.
Andrew W. Moore. Prediction and search in probabilistic worlds: Markov systems, markov decision processes, and
dynamic programming. https://autonlab.org/assets/tutorials/mdp09.pdf, 2002.
Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def, 1(2σ2):16, 2007.
Mojmír Mutnỳ. Modern Adaptive Experiment Design: Machine Learning Perspective. PhD thesis, ETH Zurich, 2024.
Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The Fundamentals of Heavy Tails, volume 53. Cambridge University
Press, 2022.
George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing
submodular set functions. Mathematical programming, 14, 1978.
Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, and Svetha Venkatesh. Regret for expected improvement over the
best-observed value and stopping condition. In ACML, 2017.
Manfred Opper and Cédric Archambeau. The variational gaussian approximation revisited. Neural computation, 21(3),
2009.
Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned
rl. arXiv preprint arXiv:2410.20092, 2024.
Thomas Parr, Giovanni Pezzulo, and Karl J Friston. Active inference: the free energy principle in mind, brain, and behavior.
MIT Press, 2022.
Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7(15),
2008.
Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh
Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments
and request for research. arXiv preprint arXiv:1802.09464, 2018.
Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process
regression. JMLR, 6, 2005.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct prefer-
ence optimization: Your language model is secretly a reward model. In NeurIPS, volume 36, 2023.
Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dy-
namics: a nonasymptotic analysis. In COLT, 2017.
Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NeurIPS, 2007.
Tom Rainforth, Adam Foster, Desi R Ivanova, and Freddie Bickford Smith. Modern bayesian experimental design.
Statistical Science, 39(1), 2024.
390 probabilistic artificial intelligence
Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In AISTATS, 2014.
Gareth O. Roberts and Jeffrey S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability
Surveys, 1, 2004.
Philip A Romero, Andreas Krause, and Frances H Arnold. Navigating the protein fitness landscape with gaussian
processes. Proceedings of the National Academy of Sciences, 110(3), 2013.
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
Stuart Russell and Peter Norvig. Artificial intelligence: a modern approach. Prentice Hall, 2002.
Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In NeurIPS, volume 27,
2014.
Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling. JMLR, 17(1), 2016.
Grant Sanderson. But what is the fourier transform? a visual introduction, 2018. URL https://www.youtube.com/watch?
v=spUNpyF58BY.
Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In COLT. Springer, 2001.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In
ICML, 2015.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control
using generalized advantage estimation. In ICLR, 2016.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algo-
rithms. arXiv preprint arXiv:1707.06347, 2017.
Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-
Agent Systems, 27(1), 2013.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li,
Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser,
Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks
and tree search. nature, 529(7587), 2016.
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Lau-
rent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforce-
ment learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy
reinforcement-learning algorithms. Machine learning, 38(3), 2000.
Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit
setting: No regret and experimental design. In ICML, 2010.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way
to prevent neural networks from overfitting. JMLR, 15(1), 2014.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and
Paul F Christiano. Learning to summarize with human feedback. In NeurIPS, volume 33, 2020.
bibliography 391
Bhavya Sukhija, Matteo Turchetta, David Lindner, Andreas Krause, Sebastian Trimpe, and Dominik Baumann.
Gosafeopt: Scalable safe exploration for global optimization of dynamical systems. Artificial Intelligence, 320, 2023.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and fluctuations for stochastic gradient
langevin dynamics. JMLR, 17, 2016.
Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In ICML,
2014.
Lenart Treven, Jonas Hübotter, Bhavya Sukhija, Florian Dörfler, and Andreas Krause. Efficient exploration in continuous-
time model-based reinforcement learning. In NeurIPS, volume 36, 2024.
Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration for interactive machine learning. In NeurIPS,
volume 32, 2019.
Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in gaussian process bandits. In
AISTATS, 2021.
Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI,
volume 30, 2016.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, volume 30, 2017.
Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices.
In NeurIPS, volume 32, 2019.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropcon-
nect. In ICML, 2013.
Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. In UAI, 2020.
Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In ICML, 2017.
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.
Justin Whitehouse, Zhiwei Steven Wu, and Aaditya Ramdas. On the sublinear regret of gp-ucb. In NeurIPS, volume 36,
2024.
Daniel Widmer, Dongho Kang, Bhavya Sukhija, Jonas Hübotter, Andreas Krause, and Stelian Coros. Tuning legged
locomotion controllers via safe bayesian optimization. In CoRL, 2023.
Christopher K Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press, 2006.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine
learning, 8(3), 1992.
Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 35(67-68), 1999.
Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of langevin dynamics based algorithms for
nonconvex optimization. In NeurIPS, volume 31, 2018.
Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a
fixed discount rate. Mathematics of Operations Research, 36(4), 2011.
Summary of Notation
.
= equality by definition
≈ approximately equals
∝ proportional to (up to multiplicative constants), f ∝ g iff ∃k. ∀ x. f ( x ) = k · g( x )
const an (additive) constant
N set of natural numbers {1, 2, . . . }
N0 set of natural numbers, including 0, N ∪ {0}
R set of real numbers
[m] set of natural numbers from 1 to m, {1, 2, . . . , m − 1, m}
i:j subset of natural numbers between i and j, {i, i + 1, . . . , j − 1, j}
( a, b] real interval between a and b including b but not including a
f :A→B function f from elements of set A to elements of set B
f ◦g function composition, f ( g(·))
(·)+ max{0, ·}
log logarithm with base e
P ( A) power set (set of all subsets) of A
.
1{ predicate} indicator function (1{ predicate} = 1 if the predicate is true, else 0)
⊙ Hadamard (element-wise) product
← assignment
394 probabilistic artificial intelligence
analysis
linear algebra
I identity matrix
A⊤ transpose of matrix A
A −1 inverse of invertible matrix A
A1/2 square root of a symmetric and positive semi-definite matrix A
det( A) determinant of A
tr( A) trace of A, ∑i A(i, i )
diagi∈ I { ai } diagonal matrix with elements ai , indexed according to the set I
probability
Ω sample space
A event space
P probability measure
X∼P random variable X follows the distribution P
iid
X1:n ∼ P random variables X1:n are independent and identically distributed according to
distribution P
x∼P value x is sampled according to distribution P
PX cumulative distribution function of a random variable X
PX tail distribution function of a random variable X
PX−1 quantile function of a random variable X
pX probability mass function (if discrete) or probability density function
(if continuous) of a random variable X
∆A set of all probability distributions over the set A
δα Dirac delta function, point density at α
g♯ p pushforward of a density p under perturbation g
summary of notation 395
supervised learning
θ parameterization of a model
396 probabilistic artificial intelligence
X input space
Y label space
x∈X input
ϵ( x) zero-mean noise, sometimes assumed to be independent of x
y∈Y (noisy) label, f ( x) + ϵ( x) where f is unknown
D ⊆ X ×Y labeled training data, {( xi , yi )}in=1
X ∈ Rn × d design matrix when X = Rd
Φ ∈ Rn × e design matrix in feature space Re
y ∈ Rn label vector when Y = R
p(θ) prior belief about θ
p(θ | x1:n , y1:n ) posterior belief about θ given training data
p(y1:n | x1:n , θ) likelihood of training data under the model parameterized by θ
p(y1:n | x1:n ) marginal likelihood of training data
θ̂MLE maximum likelihood estimate of θ
θ̂MAP maximum a posteriori estimate of θ
ℓnll (θ; D) negative log-likelihood of the training data D under model θ
kalman filters
K t ∈ Rd × m Kalman gain
gaussian processes
deep models
variational inference
Q variational family
λ∈Λ variational parameters
qλ variational posterior parameterized by λ
L(q, p; D) evidence lower bound for data D of variational posterior q and true posterior p(· | D)
markov chains
S set of n states
Xt sequence of states
p( x′ | x ) transition function, probability of going from state x to state x ′
p(t) ( x ′ | x ) probability of reaching x ′ from x in exactly t steps
P ∈ Rn × n transition matrix
qt distribution over states at time t
π stationary distribution
∥µ − ν∥TV total variation distance between two distributions µ and ν
τTV mixing time with respect to total variation distance
398 probabilistic artificial intelligence
active learning
bayesian optimization
reinforcement learning
X, X set of states
A, A set of actions
p( x ′ | x, a) dynamics model, probability of transitioning from state x to state x ′ when playing
action a
r reward function
Xt sequence of states
At sequence of actions
Rt sequence of rewards
π (a | x) policy, probability of playing action a when in state x
Gt discounted payoff from time t
γ discount factor
t (x)
vπ state value function, average discounted payoff from time t starting from state x
t ( x, a )
qπ state-action value function, average discounted payoff from time t starting from state x
playing action a
aπt ( x, a ) advantage function, qπ t ( x, a ) − vt ( x )
π
a.s. almost surely, with high probability, with probability LASSO least absolute shrinkage and selection operator
1 LD Langevin dynamics
A2C advantage actor-critic LITE linear-time independence-based estimators (of
BALD Bayesian active learning by disagreement probability of maximality)
BLR Bayesian linear regression LMC Langevin Monte Carlo
CDF cumulative distribution function LOTE law of total expectation
CLT central limit theorem LOTP law of total probability
DDPG deep deterministic policy gradients LOTUS law of the unconscious statistician
DDQN double deep Q-networks LOTV law of total variance
DPO direct preference optimization LSI log-Sobolev inequality
DQN deep Q-networks MAB multi-armed bandits
ECE expected calibration error MALA Metropolis adjusted Langevin algorithm
EI expected improvement MAP maximum a posteriori
ELBO evidence lower bound MC Monte Carlo
ES entropy search MCE maximum calibration error
FITC fully independent training conditional MCMC Markov chain Monte Carlo
GAE generalized advantage estimation MCTS Monte Carlo tree search
GD gradient descent MDP Markov decision process
GLIE greedy in the limit with infinite exploration MERL maximum entropy reinforcement learning
GP Gaussian process MES max-value entropy search
GPC Gaussian process classification MGF moment-generating function
GRPO group relative policy optimization MI mutual information
GRV Gaussian random vector MLE maximum likelihood estimate
H-UCRL hallucinated upper confidence reinforcement MLL marginal log likelihood
learning MPC model predictive control
HMC Hamiltonian Monte Carlo MSE mean squared error
HMM hidden Markov model NLL negative log likelihood
i.i.d. independent and identically distributed ODE ordinary differential equation
IDS information-directed sampling OPES output-space predictive entropy search
iff if and only if PBPI point-based policy iteration
JES joint entropy search PBVI point-based value iteration
KL Kullback-Leibler PDF probability density function
LAMBDA Lagrangian model-based agent PES predictive entropy search
400 probabilistic artificial intelligence
PETS probabilistic ensembles with trajectory sampling SG-HMC stochastic gradient Hamiltonian Monte Carlo
PI probability of improvement SGD stochastic gradient descent
PI policy iteration SGLD stochastic gradient Langevin dynamics
PILCO probabilistic inference for learning control SLLN strong law of large numbers
PL Polyak-Łojasiewicz SoR subset of regressors
PlaNet deep planning network SVG stochastic value gradients
PMF probability mass function SVGD Stein variational gradient descent
POMDP partially observable Markov decision process SWA stochastic weight averaging
PPO proximal policy optimization SWAG stochastic weight averaging-Gaussian
RBF radial basis function Tanh hyperbolic tangent
ReLU rectified linear unit TD temporal difference
RHC receding horizon control TD3 twin delayed DDPG
RKHS reproducing kernel Hilbert space TRPO trust-region policy optimization
RL reinforcement learning UCB upper confidence bound
RLHF reinforcement learning from human feedback ULA unadjusted Langevin algorithm
RM Robbins-Monro VI variational inference
SAA stochastic average approximation VI value iteration
SAC soft actor-critic w.l.o.g. without loss of generality
SARSA state-action-reward-state-action w.r.t. with respect to
SDE stochastic differential equation WLLN weak law of large numbers
Index