Bayesoptbook A4
Bayesoptbook A4
ROMAN GARNETT
BAYESIAN OPTIMIZATION
ROMAN GARNETT
preface ix
notation xiii
1 introduction 1
1.1 Formalization of optimization 2
1.2 The Bayesian approach 5
2 gaussian processes 15
2.1 Definition and basic properties 16
2.2 Inference with exact and noisy observations 18
2.3 Overview of remainder of chapter 26
2.4 Joint Gaussian processes 26
2.5 Continuity 28
2.6 Differentiability 30
2.7 Existence and uniqueness of global maxima 33
2.8 Inference with non-Gaussian observations and constraints 35
2.9 Summary of major ideas 41
9 implementation 201
9.1 Gaussian process inference, scaling, and approximation 201
9.2 Optimizing acquisition functions 207
9.3 Starting and stopping optimization 210
9.4 Summary of major ideas 212
vi
contents
c gradients 307
bibliography 331
index 353
Can I ask a dumb question about gps? Let’s say that I’m doing
function approximation on an interval with a gp. So I’ve got this
mean function 𝑚(𝑥) and a variance function
𝑣 (𝑥). Is it true
that if
I pick a particular point 𝑥, then 𝑝 𝑓 (𝑥) ∼ N 𝑚(𝑥), 𝑣 (𝑥) ? Please
say yes.
If this is true, then I think the idea of doing Bayesian optimization
using gps is, dare I say, trivial.
x
preface
The nuances of some applications require modifications to the basic chapter 11: extensions
sequential optimization scheme that is the focus of the bulk of the book,
and chapter 11 introduces several notable extensions to this basic setup.
Each is systematically presented through the unifying lens of Bayesian
decision theory to illustrate how one might proceed when facing a novel
situation.
Finally, chapter 12 provides a brief and standalone history of Bayesian chapter 12: brief history of Bayesian
optimization. This was perhaps the most fun chapter for me to write, optimization
if only because it forced me to plod through old Soviet literature (in an
actual library! what a novelty these days!). To my surprise I was able to
antedate many Bayesian optimization policies beyond their commonly
attested origin, including expected improvement, knowledge gradient,
probability of improvement, and upper confidence bound. (A reader
familiar with the literature may be surprised to learn the last of these
was actually the first policy discussed by kushner in his 1962 paper.)
Despite my best efforts, there may still be stones left to be overturned
before the complete history is revealed.
Dependencies between the main chapters are illustrated in the mar-
gin. There are two natural linearizations of the material. The first is the 2 3 4
one I adopted and personally prefer, which covers modeling prior to
decision making. However, one could also proceed in the other order,
7 6 5
reading chapters 5–7 first, then looping back to chapter 2. After covering
the material in these chapters (in either order), the remainder of the book
can be perused at will. Logical partial paths through the book include:
8
• a minimal but self-contained introduction: chapters 1–2, 5–7
• a shorter introduction requiring leaps of faith: chapters 1 and 7
9
• a crash course on the underlying theory: chapters 1–2, 5–7, 10
• a head start on implementing a software package: chapters 1–9
10
A reader already quite comfortable with Gaussian processes might wish
to skip over chapters 2–4 entirely.
I struggled for some time over whether to include a chapter on ap- 11
plications. On the one hand, Bayesian optimization ultimately owes its
popularity to its success in optimizing a growing and diverse set of dif-
A dependency graph for chapters 2–11. Chap-
ficult objectives. However, these applications often require extensive ter 1 is a universal dependency.
technical background to appreciate, and an adequate coverage would
be tedious to write and tedious to read. As a compromise, I provide an
annotated bibliography outlining the optimization challenges involved annotated bibliography of applications:
in notable domains of interest and pointing to studies where these chal- appendix d, p. 313
lenges were successfully overcome with the aid of Bayesian optimization.
The sheer size of the Bayesian optimization literature – especially
the output of the previous decade – makes it impossible to provide a
complete survey of every recent development. This is especially true
for the extensions discussed in chapter 11 and even more so for the
bibliography on applications, where work has proliferated in myriad
branching directions. Instead I settled for presenting what I considered
to be the most important ideas and providing pointers to entry points
for the relevant literature. The reader should not read anything into any
omissions; there is simply too much high-quality work to go around.
Additional information about the book, including a list of errata as
they are discovered, may be found at the companion webpage:
bayesoptbook.com
I encourage the reader to report any errata or other issues to the com-
panion GitHub repository for discussion and resolution:
github.com/bayesoptbook/bayesoptbook.github.io
Roman Garnett
St. Louis, Missouri
January 2022
xii
NOTATION
All vectors are column vectors and are denoted in lowercase bold: x ∈ ℝ𝑑. vectors and matrices
Matrices are denoted in uppercase bold: A.
We adopt the “numerator layout” convention for matrix calculus: matrix calculus convention
the derivative of a vector by a scalar is a (column) vector, whereas the
derivative of a scalar by a vector is a row vector. This results in the chain
rule proceeding from left-to-right; for example, if a vector x(𝜃 ) depends chain rule
on a scalar parameter 𝜃 , then for a function 𝑓 (x), we have:
𝜕𝑓 𝜕𝑓 𝜕x
= .
𝜕𝜃 𝜕x 𝜕𝜃
When an indicator function is required, we use the Iverson bracket indicator functions
notation. For a statement 𝑠, we have:
(
1 if 𝑠 is true;
[𝑠] =
0 otherwise.
This material will be published by Cambridge University Press as Bayesian Optimization. xiii
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
notation
symbol description
≡ identical equality of functions; for a constant 𝑐, 𝑓 ≡ 𝑐 is a constant function
∇ gradient operator
∅ termination option: the action of immediately terminating optimization
≺ either Pareto dominance or the Löwner order: for symmetric A, B, A ≺ B if and only if B − A
is positive definite
is sampled according to: 𝜔 is a realization of a random
variable with probability density 𝑝 (𝜔)
à à Ð
𝜔 ∼ 𝑝 (𝜔)
𝑖 X𝑖 disjoint union of {X𝑖 }: 𝑖 X𝑖 = 𝑖 (𝑥, 𝑖) | 𝑥 ∈ X𝑖
|A| determinant of square matrix A
|x| Euclidean norm of vector x; |x − y| is thus the Euclidean distance between vectors x and y
k𝑓 k H𝐾 norm of function 𝑓 in reproducing kernel Hilbert space H𝐾
A−1 inverse of square matrix A
x> transpose of vector x
0 vector or matrix of zeros
A action space for a decision
𝛼 (𝑥; D) acquisition function evaluating 𝑥 given data D
𝛼𝜏 (𝑥; D) expected marginal gain in 𝑢 (D) after observing at 𝑥 then making 𝜏 − 1 additional optimal
observations given the outcome
𝛼𝜏∗ (D) value of D with horizon 𝜏: expected marginal gain in 𝑢 (D) from 𝜏 additional optimal obser-
vations
𝛼 ei expected improvement
𝛼𝑓 ∗ mutual information between 𝑦 and 𝑓 ∗
𝛼 kg knowledge gradient
𝛼 pi probability of improvement
𝛼𝑥 ∗ mutual information between 𝑦 and 𝑥 ∗
𝛼 ucb upper confidence bound
𝛼 ts Thompson sampling “acquisition function:” a draw 𝑓 ∼ 𝑝 (𝑓 | D)
𝛽 confidence parameter in Gaussian process upper confidence bound policy
𝛽 (x; D) batch acquisition function evaluating x given data D; may have modifiers analogous to 𝛼
C prior covariance matrix of observed values y: C = cov[y]
𝑐 (D) cost of acquiring data D
chol A Cholesky decomposition of positive definite matrix A: if 𝚲 = chol A, then A = 𝚲𝚲>
corr[𝜔,𝜓 ] correlation of random variables 𝜔 and 𝜓 ; with a single argument, corr[𝜔] = corr[𝜔, 𝜔]
cov[𝜔,𝜓 ] covariance of random variables 𝜔 and 𝜓 ; with a single argument, cov[𝜔] = cov[𝜔, 𝜔]
D set of observed data, D = (x, y)
D,0 D1 set of observed data after observing at 𝑥: D 0 = D ∪ (𝑥, 𝑦) = (x,0 y0)
D𝜏 set of observed data after 𝜏 observations
𝐷 kl [𝑝 k 𝑞] Kullback–Leibler divergence between distributions with probability densities 𝑝 and 𝑞
Δ(𝑥, 𝑦) marginal gain in utility after acquiring observation (𝑥, 𝑦): Δ(𝑥, 𝑦) = 𝑢 (D 0) − 𝑢 (D)
𝛿 (𝜔 − 𝑎) Dirac delta distribution on 𝜔 with point mass at 𝑎
diag x diagonal matrix with diagonal x
𝔼, 𝔼𝜔 expectation, expectation with respect to 𝜔
𝜀 measurement error associated with an observation at 𝑥: 𝜀 = 𝑦 − 𝜙
𝑓 objective function; 𝑓 : X → ℝ
𝑓 |Y the restriction of 𝑓 onto the subdomain Y ⊂ X
𝑓∗ globally maximal value of the objective function: 𝑓 ∗ = max 𝑓
𝛾𝜏 information capacity of an observation process given 𝜏 iterations
xiv
notation
symbol description
GP (𝑓 ; 𝜇, 𝐾) Gaussian process on 𝑓 with mean function 𝜇 and covariance function 𝐾
H𝐾 reproducing kernel Hilbert space associated with kernel 𝐾
H𝐾 [𝐵] ball of radius 𝐵 in H𝐾 : {𝑓 | k𝑓 k H𝐾 ≤ 𝐵}
𝐻 [𝜔] discrete or differential entropy of random variable 𝜔
𝐻 [𝜔 | D] discrete or differential of random variable 𝜔 after conditioning on D
𝐼 (𝜔;𝜓 ) mutual information between random variables 𝜔 and 𝜓
𝐼 (𝜔;𝜓 | D) mutual information between random variables 𝜔 and 𝜓 after conditioning on D
I identity matrix
𝐾 prior covariance function: 𝐾 = cov[𝑓 ]
𝐾D posterior covariance function given data D: 𝐾D = cov[𝑓 | D]
𝐾m Matérn covariance function
𝐾se squared exponential covariance function
𝜅 cross covariance between 𝑓 and observed values y: 𝜅 (𝑥) = cov[y, 𝜙 | 𝑥]
ℓ either a length-scale parameter or the lookahead horizon
𝜆 output-scale parameter
M space of models indexed by the hyperparameter vector 𝜽
m prior expected value of observed values y, m = 𝔼[y]
𝜇 either the prior mean function, 𝜇 = 𝔼[𝑓 ], or the predictive mean of 𝜙: 𝜇 = 𝔼[𝜙 | 𝑥, D] = 𝜇D (𝑥)
𝜇D posterior mean function given data D: 𝜇D = 𝔼[𝑓 | D]
N (𝝓; 𝝁, 𝚺) multivariate normal distribution on 𝝓 with mean vector 𝝁 and covariance matrix 𝚺
N measurement error covariance corresponding to observed values y
O is asymptotically bounded above by: for nonnegative functions 𝑓, 𝑔 of 𝜏, 𝑓 = O(𝑔) if 𝑓 /𝑔 is
asymptotically bounded by a constant as 𝜏 → ∞
O∗ as above with logarithmic factors suppressed: 𝑓 = O∗ (𝑔) if 𝑓 (𝜏) (log 𝜏)𝑘 = O(𝑔) for some 𝑘
Ω is asymptotically bounded below by: 𝑓 = Ω(𝑔) if 𝑔 = O(𝑓 )
𝑝 probability density
𝑞 either an approximation to probability density 𝑝 or a quantile
∫𝑧 function
Φ(𝑧) standard normal cumulative density function: Φ(𝑧) = −∞ 𝜙 (𝑧 0) d𝑧 0
𝜙 value of the objective function at 𝑥: 𝜙 = 𝑓 (𝑥) √
𝜙 (𝑧) standard normal probability density function: 𝜙 (𝑧) = ( 2𝜋) −1 exp(− 21 𝑧 2 )
Pr probability
ℝ set of real numbers
𝑅𝜏 cumulative regret after 𝜏 iterations
𝑅¯𝜏 [𝐵] worst-case cumulative regret after 𝜏 iterations on the rkhs ball H𝐾 [𝐵]
𝑟𝜏 simple regret after 𝜏 iterations
𝑟¯𝜏 [𝐵] worst-case simple regret after 𝜏 iterations on the rkhs ball H𝐾 [𝐵]
P a correlation matrix
𝜌 a scalar correlation
𝜌𝜏 instantaneous regret on iteration 𝜏
𝑠2 predictive variance of 𝑦; for additive Gaussian noise, 𝑠 2 = var[𝑦 | 𝑥, D] = 𝜎 2 + 𝜎𝑛2
𝚺 a covariance matrix, usually the Gram matrix associated with x: 𝚺 = 𝐾D (x, x)
𝜎2 predictive variance of 𝜙: 𝜎 2 = 𝐾D (𝑥, 𝑥)
𝜎𝑛2 variance of measurement error at 𝑥: 𝜎𝑛2 = var[𝜀 | 𝑥]
std[𝜔] standard deviation of random variable 𝜔
T (𝜙; 𝜇, 𝜎 ,2 𝜈) Student-𝑡 distribution on 𝜙 with 𝜈 degrees of freedom, mean 𝜇, and variance 𝜎 2
TN (𝜙; 𝜇, 𝜎 ,2 𝐼 ) truncated normal distribution, N (𝜙; 𝜇, 𝜎 2 ) truncated to interval 𝐼
symbol description
𝜏 either decision horizon (in the context of decision making) or number of optimization itera-
tions passed (in the context of asymptotic analysis)
Θ is asymptotically bounded above and below by: 𝑓 = Θ(𝑔) if 𝑓 = O(𝑔) and 𝑓 = Ω(𝑔)
𝜽 vector of hyperparameters indexing a model space M
tr A trace of square matrix A
𝑢 (D) utility of data D
var[𝜔] variance of random variable 𝜔
𝑥 putative input location of the objective function
x either a sequence of observed locations x = {𝑥𝑖 } or (when the distinction is important) a
vector-valued input location
𝑥∗ a location attaining the globally maximal value of 𝑓 : 𝑥 ∗ ∈ arg max 𝑓 ; 𝑓 (𝑥 ∗ ) = 𝑓 ∗
X domain of objective function
𝑦 value resulting from an observation at 𝑥
y observed values resulting from observations at locations x
𝑧 𝑧-score of measurement 𝑦 at 𝑥: 𝑧 = (𝑦 − 𝜇)/𝑠
xvi
1
IN TRODUCTION
2
1.1. formalization of optimization
Optimization policy
Directly solving for the location of global optima is infeasible except in
exceptional circumstances. The tools of traditional calculus are virtually
powerless in this setting; for example, enumerating and classifying every
stationary point in the domain would be tedious at best and perhaps
even impossible. Mathematical optimization instead takes an indirect
approach: we design a sequence of experiments to probe the objective
function for information that, we hope, will reveal the solution to (1.1).
The iterative procedure in algorithm 1.1 formalizes this process. We
begin with an initial (possibly empty) dataset D that we grow incremen-
tally through a sequence of observations of our design. In each iteration,
an optimization policy inspects the available data and selects a point
𝑥 ∈ X where we make our next observation.5 This action in turn reveals 5 Here “policy” has the same meaning as in
a corresponding value 𝑦 provided by the system under study. We append other decision-making contexts: it maps our
state (indexed by our data, D) to an action (the
the newly observed information to our dataset and finally decide whether location of our next observation, 𝑥).
to continue with another observation or terminate and return the current
data. When we inevitably do choose to terminate, the returned data can
be used by an external consumer as desired, for example to inform a
subsequent decision. terminal recommendations: § 5.1, p. 90
We place no restrictions on how an optimization policy is imple-
mented beyond mapping an arbitrary dataset to some point in the do-
main for evaluation. A policy may be deterministic or stochastic, as
demonstrated respectively by the prototypical examples of grid search
and random search. In fact, these popular policies are nonadaptive and
completely ignore the observed data. However, when observations only
come at significant cost, we will naturally prefer policies that adapt their
behavior in light of evolving information. The primary challenge in opti-
Observation model
For optimization to be feasible, the observations we obtain must provide
information about the objective function that can guide our search and in
aggregate determine the solution to (1.1). A near-universal assumption in
mathematical optimization is that observations yield exact evaluations of
the objective function at our chosen locations. However, this assumption
is unduly restrictive: many settings feature inexact measurements due
to noisy sensors, imperfect simulation, or statistical approximation. A
typical example featuring additive observation noise is shown in the
Inexact observations of an objective function margin. Although the objective function is not observed directly, the
corrupted by additive noise. noisy measurements nonetheless constrain the plausible options due to
strong dependence on the objective.
We thus relax the assumption of exact observation and instead as-
sume that observations are realized by a stochastic mechanism depending
measured value, 𝑦 on the objective function. Namely, we assume that the value 𝑦 resulting
observation location, 𝑥 from an observation at some point 𝑥 is distributed according to an ob-
servation model depending on the underlying objective function value
objective function value, 𝜙 = 𝑓 (𝑥) 𝜙 = 𝑓 (𝑥):
𝑝 (𝑦 | 𝑥, 𝜙). (1.2)
Through judicious design of the observation model, we may consider a
wide range of observation mechanisms.
conditional independence of observations As with the optimization policy, we do not make any assumptions
given objective values about the nature of the observation model, save one. Unless otherwise
mentioned, we assume that a set of multiple measurements y are condi-
tionally independent given the corresponding observation locations x
and objective function values 𝝓 = 𝑓 (x):
Ö
𝑝 (y | x, 𝝓) = 𝑝 (𝑦𝑖 | 𝑥𝑖 , 𝜙𝑖 ). (1.3)
𝑖
𝑝 (𝑦 | 𝑥, 𝜙)
This is not strictly necessary but is overwhelmingly common in practice
and will simplify our presentation considerably.
One particular observation model will enjoy most of our attention in
𝑦 this book: additive Gaussian noise. Here we model the value 𝑦 observed
at 𝑥 as
𝜙 𝑦 = 𝜙 + 𝜀,
Additive Gaussian noise: the distribution where 𝜀 represents measurement error. Errors are assumed to be Gaussian
of the value 𝑦 observed at 𝑥 is Gaussian, distributed with mean zero, implying a Gaussian observation model:
centered on the objective function value 𝜙.
𝑝 (𝑦 | 𝑥, 𝜙, 𝜎𝑛 ) = N (𝑦; 𝜙, 𝜎𝑛2 ). (1.4)
observation noise scale, 𝜎𝑛 Here the observation noise scale 𝜎𝑛 may optionally depend on 𝑥, allowing
heteroskedastic noise: § 2.2, p. 25 us to model both homoskedastic or heteroskedastic errors.
4
1.2. the bayesian approach
Termination
The final decision we make in each iteration of optimization is whether
to terminate immediately or continue with another observation. As with
the optimization policy, we do not assume any particular mechanism by
which this decision is made. Termination may be deterministic – such
as stopping after reaching a certain optimization goal or exhausting
a preallocated observation budget – or stochastic, and may optionally
depend on the observed data. In many cases, the time of termination
may in fact not be under the control of the optimization routine at all
but instead decided by an external agent. However, we will also consider
scenarios where the optimization procedure can dynamically choose optimal termination: § 5.4, p. 103
when to return based upon inspection of the available data. practical termination: § 9.3, p. 210
observation with some measure of faith that the outcome will ultimately
prove beneficial and justify the cost of obtaining it. The sequential nature
of optimization further compounds the weight of this uncertainty, as the
outcome of each observation not only has an immediate impact, but also
forms the basis on which all future decisions are made. Developing an
effective policy requires somehow addressing this uncertainty.
The Bayesian approach systematically relies on probability and Bayes-
ian inference to reason about the uncertain quantities arising during
optimization. This critically includes the objective function itself, which
is treated as a random variable to be inferred in light of our prior ex-
pectations and any available data. In Bayesian optimization, this belief
then takes an active role in decision making by guiding the optimiza-
tion policy, which may evaluate the merit of a proposed observation
location according to our belief about the value we might observe. We
introduce the key ideas of this process with examples below, starting
with a refresher on Bayesian inference.
Bayesian inference
To frame the following discussion, we offer a quick overview of Bayesian
6 The literature is vast. The following references inference as a reminder to the reader. This introduction is far from
are excellent, but no list can be complete: complete, but there are numerous excellent references available.6
d. j. c. mackay (2003). Information Theory, In- Bayesian inference is a framework for inferring uncertain features of
ference, and Learning Algorithms. Cambridge a system of interest from observations grounded in the laws of probability.
University Press. To illustrate the basic ideas, we may begin by identifying some unknown
a. o’hagan and j. forster (2004). Kendall’s feature of a given system that we wish to reason about. In the context of
Advanced Theory of Statistics. Vol. 2b: Bayes- optimization, this might represent, for example, the value of the objective
ian Inference. Arnold.
function at a given location, or the location 𝑥 ∗ or value 𝑓 ∗ of the global
j. o. berger (1985). Statistical Decision Theory optimum (1.1). We will take the first of these as a running example:
and Bayesian Analysis. Springer–Verlag. inferring about the value of an objective function at some arbitrary point
𝑥, 𝜙 = 𝑓 (𝑥). We will shortly extend this example to inference about the
entire objective function.
In the Bayesian approach to inference, all unknown quantities are
treated as random variables. This is a powerful convention as it allows us
to represent beliefs about these quantities with probability distributions
reflecting their plausible values. Inference then takes the form of an
inductive process where these beliefs are iteratively refined in light of
observed data by appealing to probabilistic identities.
As with any induction, we must start somewhere. Here we begin with
prior distribution, 𝑝 (𝜙 | 𝑥) a so-called prior distribution (or simply prior) 𝑝 (𝜙 | 𝑥), which encodes
what we consider to be plausible values for 𝜙 before observing any
7 Here we assume the location of interest 𝑥 is data.7 The prior distribution allows us to inject our knowledge about and
known, hence our conditioning the prior on experience with the system of interest into the inferential process, saving
its value.
us from having to begin “from scratch” or entertain patently absurd
possibilities. The left panel of figure 1.1 illustrates a prior distribution for
our example, indicating support over a range of values.
Once a prior has been established, the next stage of inference is
to refine our initial beliefs in light of observed data. Suppose in our
6
1.2. the bayesian approach
𝜙 𝜙 𝜙
𝑦 𝑦
Figure 1.1: Bayesian inference for an unknown function value 𝜙 = 𝑓 (𝑥). Left: a prior distribution over 𝜙; middle: the
likelihood of the marked observation 𝑦 according to an additive Gaussian noise observation model (1.4) (prior
shown for reference); right: the posterior distribution in light of the observation and the prior (prior and
likelihood shown for reference).
𝑝 (𝜙 | 𝑥) 𝑝 (𝑦 | 𝑥, 𝜙)
𝑝 (𝜙 | 𝑥, 𝑦) = . (1.5)
𝑝 (𝑦 | 𝑥)
The posterior is proportional to the prior weighted by the likelihood of
the observed value. The denominator is a constant with respect to 𝜙 that
ensures normalization:
∫
𝑝 (𝑦 | 𝑥) = 𝑝 (𝑦 | 𝑥, 𝜙) 𝑝 (𝜙 | 𝑥) d𝜙 . (1.6)
The right panel of figure 1.1 shows the posterior resulting from the
measurement in the middle panel. The posterior represents a compromise
between our experience (encoded in the prior) and the information
contained in the data (encoded in the likelihood).
Throughout this book we will use the catchall notation D to repre- data informing posterior belief, D
sent all the information influencing a posterior belief; here the relevant
information is D = (𝑥, 𝑦), and the posterior distribution is then 𝑝 (𝜙 | D).
As mentioned previously, Bayesian inference is an inductive process
whereby we can continue to refine our beliefs through additional ob-
servation. At this point, the induction is trivial: to incorporate a new
observation, what was our posterior serves as the prior in the context
posterior predictive, 𝑝 (𝑦 0 | 𝑥, D) of the new information, and multiplying by the likelihood and renor-
malizing yields a new posterior. We may continue in this manner as
desired.
The posterior distribution is not usually the end result of Bayesian
inference but rather a springboard enabling follow-on tasks such as
prediction or decision making, both of which are integral to Bayesian
optimization. To address the former, suppose that after deriving the
posterior (1.5), we wish to predict the result of an independent, repeated
𝑦0
noisy observation at 𝑥, 𝑦 .0 Treating the outcome as a random variable,
𝑦 we may derive its distribution by integrating our posterior belief about
𝜙 against the observation model (1.2):8
Posterior predictive distribution for a repeated ∫
measurement at 𝑥 for our running example.
The location of our first measurement 𝑦 and
𝑝 (𝑦 0 | 𝑥, D) = 𝑝 (𝑦 0 | 𝑥, 𝜙) 𝑝 (𝜙 | 𝑥, D) d𝜙; (1.7)
the posterior distribution of 𝜙 are shown for
reference. There is more uncertainty in 𝑦 0 this is known as the posterior predictive distribution for 𝑦 .0 By integrating
than 𝜙 due to the effect of observation noise. over all possible values of 𝜙 weighted by their plausibility, the posterior
predictive distribution naturally accounts for uncertainty in the unknown
8 This expression takes the same form as (1.6), objective function value; see the figure in the margin.
which is simply the (prior) predictive distribu- The Bayesian approach to decision making also relies on a posterior
tion evaluated at the actual observed value.
belief about unknown features affecting the outcomes of our decisions,
as we will discuss shortly.
8
1.2. the bayesian approach
Figure 1.2: An example prior process for an objective defined on an interval. We illustrate the marginal belief at every
location with its mean and a 95% credible interval and also show three example functions sampled from the
prior process.
We summarize the marginal belief of the model, for each point in the
domain showing the prior mean (dark blue) and 95% credible interval
(light blue) for the corresponding function value. We also show three
functions sampled from the prior process, each exhibiting the assumed
behavior. We encourage the reader to become comfortable with this
plotting convention, as we will use it throughout this book. In particular
we eschew axis labels, as they are always the same: the horizontal axis plotting conventions
represents the domain X and the vertical axis the function value. Further,
we do not mark units on axes to stress relative rather than absolute
behavior, as scale is arbitrary in this illustration.
We can encode a vast array of information into the prior process
and can model significantly more complex structure than in this sim-
ple example. We will explore the world of possibilities in chapter 3,
including interaction at different scales, nonstationarity, low intrinsic nonstationarity, warping: § 3.4, p. 56
dimensionality, and more. low intrinsic dimensionality: § 3.5, p. 61
With the prior process in hand, suppose we now make a set of
observations at some locations x, revealing corresponding values y; we
aggregate this information into a dataset D = (x, y). Bayesian inference observed data, D = (x, y)
accounts for these observations by forming the posterior process 𝑝 (𝑓 | D). objective function posterior, 𝑝 (𝑓 | D)
The derivation of the posterior process can be understood as a two-
stage process. First we consider the impact of the data on the correspond-
ing function values 𝝓 alone (1.5):
9 The given expression sweeps some details un-
𝑝 (𝝓 | D) ∝ 𝑝 (𝝓 | x) 𝑝 (y | x, 𝝓). (1.9) der the rug. A careful derivation of the pos-
terior process proceeds by finding the poste-
The quantities on the right-hand side are known: the first term is given rior of an arbitrary finite-dimensional vector
by the prior process (1.8), and the second by the observation model (1.3), 𝝓∗ = 𝑓 (x∗ ):
which serves the role of a likelihood. We now extend the posterior on 𝝓
to all of 𝑓 :9 ∫
𝑝 (𝝓 ∗ | x∗ , D) =
∫
𝑝 (𝑓 | D) = 𝑝 (𝑓 | x, 𝝓) 𝑝 (𝝓 | D) d𝝓. (1.10) 𝑝 (𝝓∗ | x∗ , x, 𝝓) 𝑝 (𝝓 | D) d𝝓,
Figure 1.3: The posterior process for our example scenario in figure 2.1 conditioned on three exact observations.
Figure 1.4: A prototypical acquisition function corresponding to our example posterior from figure 1.3.
10
1.2. the bayesian approach
objective function
1
2
exploitation
3
4
5
6
7
8
9
10
11
12
exploitation
13
14
15
16
17
18
19
exploration
Figure 1.5: The posterior after the indicated number of steps of an example Bayesian optimization policy, starting from
the posterior in figure 1.4. The marks show the points chosen by the policy, progressing from top to bottom.
Observations sufficiently close to the optimum are marked in red; the optimum was located on iteration 7.
12
1.2. the bayesian approach
𝑝 (𝑓 ) = GP (𝑓 ; 𝜇, 𝐾)
16
2.1. definition and basic properties
Let us pause to consider the implications of this choice. First, note that
var[𝜙 | 𝑥] = 𝐾 (𝑥, 𝑥) = 1 at every point 𝑥 ∈ X, and thus the covariance 𝐾 (𝑥, 𝑥 0 )
function (2.4) also measures the correlation between the function values 1
𝜙 and 𝜙 .0 This correlation decreases with the distance between 𝑥 and 𝑥 ,0
falling from unity to zero as these points become increasingly separated;
see the illustration in the margin. We can loosely interpret this as a
statistical consequence of continuity: function values at nearby locations
|𝑥 − 𝑥 0 |
are highly correlated, whereas function values at distant locations are 0
effectively independent. This assumption also implies that observing 0 3
the function at some point 𝑥 provides nontrivial information about the The squared exponential covariance (2.4) as a
function at sufficiently nearby locations (roughly when |𝑥 − 𝑥 0 | < 3). We function of the distance between inputs.
will explore this implication further shortly.
Figure 2.1: Our example Gaussian process on the domain X = [0, 30]. We illustrate the marginal belief at every location
with its mean and a 95% credible interval and also show three example functions sampled from the process.
predictive credible intervals For a Gaussian process, the marginal distribution of any single func-
tion value is univariate normal (2.2):
Sampling
We may gain more insight by inspecting samples drawn from our exam-
ple process reflecting the joint distribution of function values. Although
it is impossible to represent an arbitrary function on X in finite memory,
we can approximate the sampling process by taking a dense grid x ⊂ X
sampling from a multivariate normal and sampling the corresponding function values from their joint multi-
distribution: § a.2, p. 299 variate normal distribution (2.2). Plotting the sampled vectors against
the chosen grid reveals curves approximating draws from the Gaussian
process. Figure 2.1 illustrates this procedure for our example using a grid
of 1000 equally spaced points. Each sample is smooth and has several
local optima distributed throughout the domain – for some applications,
this might be a reasonable model for an objective function on X.
18
2.2. inference with exact and noisy observations
Figure 2.2: The posterior for our example scenario in figure 2.1 conditioned on three exact observations.
!
𝑓 𝜇 𝐾 𝜅>
𝑝 (𝑓, y) = GP ; , . (2.6)
y m 𝜅 C
This notation, analogous to (a.12), extends the Gaussian process on 𝑓 to 12 We assume C is positive definite; if it were only
include the entries of y; that is, we assume the distribution of any finite positive semidefinite, there would be wasteful
linear dependence among observations.
subset of function and/or observed values is multivariate normal. We
specify the joint distribution via the marginal distribution of y:12 observation mean and covariance, m, C
and the cross-covariance function between y and 𝑓 : cross-covariance between observations and
function values, 𝜅
𝜅 (𝑥) = cov[y, 𝜙 | 𝑥]. (2.8)
Although it may seem absurd that we could identify and observe a vector
satisfying such strong restrictions on its distribution, we can already
deduce several examples from first principles, including:
inference from exact observations: § 2.2, p. 22 • any vector of function values (2.2),
affine transformations: § a.2, p. 298 • any affine transformation of function values (a.10), and
derivatives and expectations: § 2.6, p. 30 • limits of such quantities, such as partial derivatives or expectations.
Further, we may condition on any of the above even if corrupted by
independent additive Gaussian noise, as we will shortly demonstrate.
conditioning a multivariate normal We may condition the joint distribution (2.6) on y analogously to the
distribution: § a.2, p. 299 finite-dimensional case (a.14), resulting in a Gaussian process posterior
on 𝑓. Writing D = y for the observed data, we have:
𝑝 (𝑓 | D) = GP (𝑓 ; 𝜇D , 𝐾D ), (2.9)
posterior mean and covariance, 𝜇 D , 𝐾D where
𝜇D (𝑥) = 𝜇 (𝑥) + 𝜅 (𝑥)> C−1 (y − m);
(2.10)
𝐾D (𝑥, 𝑥 0) = 𝐾 (𝑥, 𝑥 0) − 𝜅 (𝑥)> C−1 𝜅 (𝑥 0).
13 This is a useful exercise! The result will be a This can be verified by computing the joint distribution of an arbitrary
stochastic process with multivariate normal finite set of function values and y and conditioning on the latter (a.14).13
finite-dimensional distributions, a Gaussian
process by definition (2.5).
The above result provides a simple procedure for gp posterior infer-
ence from any vector of observations satisfying (2.6):
1. compute the marginal distribution of y (2.7),
2. derive the cross-covariance function 𝜅 (2.8), and
3. find the posterior distribution of 𝑓 via (2.9–2.10).
We will realize this procedure for several special cases below. However,
we will first demonstrate how we may seamlessly handle measurements
corrupted by additive Gaussian noise and build intuition for the posterior
distribution by dissecting its moments in terms of the statistics of the
observations and the correlation structure of the prior.
20
2.2. inference with exact and noisy observations
√
where S is diagonal with 𝑆𝑖𝑖 = 𝐶𝑖𝑖 = std[𝑦𝑖 ] and P = corr[y] is the
observation correlation matrix. We may then rewrite the posterior mean
of 𝜙 as
𝜇 + 𝜎𝝆> P−1 z,
where z and 𝝆 represent the vectors of measurement 𝑧-scores and the
cross-correlation between 𝜙 and y, respectively:
𝑦𝑖 − 𝑚𝑖 [𝜅 (𝑥)] 𝑖
𝑧𝑖 = ; 𝜌𝑖 = .
𝑠𝑖 𝜎𝑠𝑖
15 It can be instructive to contrast the behavior The posterior mean is now in the same form as the scalar case (2.12),
of the posterior when conditioning on two with the introduction of the observation correlation matrix moderating
highly correlated values versus two indepen-
dent ones. In the former case, the posterior
the 𝑧-scores to account for dependence between the observed values.15
does not change much as a result of the sec- The posterior standard deviation of 𝜙 in the vector-valued case is
ond measurement, as dependence reduces the √︃
effective number of measurements.
𝜎 1 − 𝝆> P−1𝝆,
𝑝 (𝝓 | x) = N (𝝓; 𝝁, 𝚺),
22
2.2. inference with exact and noisy observations
𝑝 (𝑓 | D) = GP (𝑓 ; 𝜇D , 𝐾D ),
where
𝜇D (𝑥) = 𝜇 (𝑥) + 𝐾 (𝑥, x)𝚺−1 (𝝓 − 𝝁);
(2.14)
𝐾D (𝑥, 𝑥 0) = 𝐾 (𝑥, 𝑥 0) − 𝐾 (𝑥, x)𝚺−1 𝐾 (x, 𝑥 0).
Our previous figure 2.2 illustrates the posterior resulting from con- example and discussion
ditioning our gp prior in figure 2.1 on three exact measurements, with
high-level analysis of its behavior in the accompanying text.
N = 𝜎𝑛2 I, (2.16)
and independent heteroskedastic noise with scale depending on location special case: independent heteroskedastic
according to a function 𝜎𝑛 : X → ℝ ≥0 : noise
For a given observation location 𝑥, we will simply write 𝜎𝑛 for the observation noise scale, 𝜎𝑛
associated noise scale, leaving any dependence on 𝑥 implicit.
Figure 2.3: Posteriors for our example gp from figure 2.1 conditioned on 15 noisy observations with independent ho-
moskedastic noise (2.16). The signal-to-noise ratio is 10 for the top example, 3 for the middle example, and 1 for
the bottom example.
homoskedastic example and discussion Figure 2.3 shows a sequence of posterior distributions resulting from
conditioning our example gp on data corrupted by increasing levels of
homoskedastic noise (2.16). As the noise level increases, the observations
have diminishing influence on our belief, with some extreme values
eventually being partially explained away as outliers. As measurements
are assumed to be inexact, the posterior mean is not compelled to inter-
polate perfectly through the observations, as in the exact case (figure
24
2.2. inference with exact and noisy observations
Figure 2.4: The posterior distribution for our example gp from figure 2.1 conditioned on 50 observations with heteroskedas-
tic observation noise (2.17). We show predictive credible intervals for both the latent objective function and
noisy observations; the standard deviation of the observation noise increases linearly from left to right.
2.2). Further, with increasing levels of noise, our posterior belief reflects
significant residual uncertainty in the function, even in regions with
multiple nearby observations.
We illustrate an example of Gaussian process inference with het- heteroskedastic example and discussion
eroskedastic noise (2.17) in figure 2.4, where the signal-to-noise ratio
decreases smoothly from left-to-right over the domain. Although the
observations provide relatively even coverage, our posterior uncertainty
is minimal on the left-hand side of the domain – where the measure-
ments provide maximal information – and increases as our observations
become more noisy and less informative.
We will often require the posterior predictive distribution for a noisy posterior predictive distribution for noisy
measurement 𝑦 that would result from observing at a given location 𝑥. observations
The posterior distribution on 𝑓 (2.19) provides the posterior predictive
distribution for the latent function value 𝜙 = 𝑓 (𝑥) (2.5):
but does not account for the effect of observation noise. In the case of
independent additive Gaussian noise (2.16–2.17), deriving the posterior
predictive distribution is trivial; we have (a.15):
covariance function for observation noise, 𝑁 where 𝑁 is the noise covariance: 𝑁 (𝑥, 𝑥 0) = cov[𝜀, 𝜀 0 | 𝑥, 𝑥 0]. The poste-
rior of the observation process is then a gp with
26
2.4. joint gaussian processes
Definition
To elaborate, consider a set of functions {𝑓𝑖 : X𝑖 → ℝ} we wish to
model.21 We define the disjoint union of these functions t𝑓 – defined on
Ã
disjoint union of {𝑓𝑖 }, t𝑓
the disjoint union22 of their domains X = X𝑖 – by insisting its restric- disjoint union of {X𝑖 }, X
tion to each domain be compatible with the corresponding function:
21 The domains need not be equal, but they often
t𝑓 : X → ℝ; t𝑓 | X𝑖 ≡ 𝑓𝑖 .
are in practice.
We now can define a gp on t𝑓 by choosing mean and covariance func- 22 A disjoint union represents a point 𝑥 ∈ X𝑖 by
the pair (𝑥, 𝑖), thereby combining the domains
tions on X as desired:
while retaining their identities.
𝑝 (t𝑓 ) = GP (t𝑓 ; 𝜇, 𝐾). (2.22)
We will call this construction a joint Gaussian process on {𝑓𝑖 }. joint Gaussian process
It is often convenient to decompose the moments of a joint gp into
their restrictions on relevant subspaces. For example, consider a joint gp
(2.22) on 𝑓 : F → ℝ and 𝑔 : G → ℝ. After defining
𝜇 𝑓 ≡ 𝜇| F ; 𝜇𝑔 ≡ 𝜇| G ;
𝐾 𝑓 ≡ 𝐾 | F ×F ; 𝐾𝑔 ≡ 𝐾 | G×G ; 𝐾 𝑓𝑔 ≡ 𝐾 | F ×G ; 𝐾𝑔𝑓 ≡ 𝐾 | G×F ,
we can see that 𝑓 and 𝑔 in fact have marginal gp distributions:23 23 In fact, any restriction of a gp-distributed func-
tion has a gp (or multivariate normal) distri-
𝑝 (𝑓 ) = GP (𝑓 ; 𝜇 𝑓 , 𝐾 𝑓 ); 𝑝 (𝑔) = GP (𝑔; 𝜇𝑔 , 𝐾𝑔 ), (2.23) bution.
When convenient we will notate a joint gp in terms of these decomposed 24 We also used this notation in (2.6), where the
functions, here writing:24 “domain” of the vector y can be taken to be
some finite index set of appropriate size.
!
𝑓 𝜇𝑓 𝐾𝑓 𝐾 𝑓𝑔
𝑝 (𝑓, 𝑔) = GP ; , . (2.25)
𝑔 𝜇𝑔 𝐾𝑔𝑓 𝐾𝑔
Example
We can demonstrate the behavior of a joint Gaussian process by extend-
ing our running example gp on 𝑓 : [0, 30] → ℝ. Recall the prior on 𝑓 has
Figure 2.5: A joint Gaussian process over two functions on the shared domain X = [0, 30]. The marginal belief over both
functions is the same as our example gp from figure 2.1, but the cross-covariance (2.26) between the functions
strongly couples their behavior. We also show a sample from the joint distribution illustrating the strong
correlation induced by the joint prior.
2.5 continuity
In this and the following sections we will establish some important
properties of Gaussian processes determined by the properties of their
28
2.5. continuity
𝑝 (𝑓 | D)
𝑝 (𝑔 | D)
Figure 2.6: The joint posterior for our example joint gp prior in figure 2.5 conditioned on five exact observations of each
function.
2.6 differentiability
We can approach the question of differentiability by again reasoning
about the limiting behavior of linear transformations of function values.
Suppose 𝑓 : X → ℝ with X ⊂ ℝ𝑑 has distribution GP (𝑓 ; 𝜇, 𝐾), and
consider the 𝑖th partial derivative of 𝑓 at x, if it exists:
𝜕𝑓 𝑓 (x + ℎe𝑖 ) − 𝑓 (x)
(x) = lim ,
𝜕𝑥𝑖 ℎ→0 ℎ
where e𝑖 is the 𝑖th standard basis vector. For ℎ > 0, the value in the limit
is Gaussian distributed as a linear transformation of Gaussian-distributed
random variables (a.9). Assuming the corresponding partial derivative
of the mean exists at x and the corresponding partial derivative with
respect to each input of the covariance function exists at x = x,0 then as
sequences of normal rvs: § a.2, p. 300 ℎ → 0 the partial derivative converges in distribution to a Gaussian:
𝜕𝑓 𝜕𝑓 𝜕𝜇 𝜕2𝐾
𝑝 (x) | x = N (x); (x), (x, x) .
𝜕𝑥𝑖 𝜕𝑥𝑖 𝜕𝑥𝑖 𝜕𝑥𝑖 𝜕𝑥𝑖0
differentiability in mean square If this property holds for each coordinate 1 ≤ 𝑖 ≤ 𝑑, then 𝑓 is said to be
differentiable in mean square at x.
joint gp between function and gradient If 𝑓 is differentiable in mean square everywhere in the domain, the
process itself is called differentiable in mean square, and we have the
remarkable result that the function and its gradient have a joint Gaussian
process distribution:
!
𝑓 𝜇 𝐾 𝐾 ∇>
𝑝 (𝑓, ∇𝑓 ) = GP ; , . (2.28)
∇𝑓 ∇𝜇 ∇𝐾 ∇𝐾∇>
30
2.6. differentiability
𝑑𝑓
𝑑𝑥
Figure 2.7: The joint posterior of the function and its derivative for our example Gaussian process from figure 2.2. The
dashed line in the lower plot corresponds to a derivative of zero.
and 𝐾 ∇>: X × X → (ℝ𝑑 ) ∗ maps pairs of points to row vectors: transpose of covariance between 𝑓 (x) and
∇𝑓 (x0 ), 𝐾 ∇>
𝐾∇>(x, x0) = ∇𝐾 (x,0 x) > .
Finally, the function ∇𝐾∇>: X × X → ℝ𝑑×𝑑 represents the result of covariance between ∇𝑓 (x) and ∇𝑓 (x0 ),
applying both operations, mapping a pair of points to the covariance ∇𝐾 ∇>
matrix between the entries of the corresponding gradients:
𝜕𝑓 𝜕𝑓 0 𝜕2𝐾
∇𝐾∇ (x, x ) 𝑖 𝑗 = cov
> 0 0
(x), 0 (x ) x, x = (x, x0).
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑖 𝜕𝑥𝑗0
𝑑𝑓
𝑑𝑥
Figure 2.8: The joint posterior of the derivative of our example Gaussian process after adding a new observation nearby
another suggesting a large positive slope. The dashed line in the lower plot corresponds to a derivative of zero.
32
2.7. existence and uniqeness of global maxima
𝑑𝑓
𝑑𝑥
Figure 2.9: The joint posterior of the derivative of our example Gaussian process after adding an exact observation of the
derivative at the indicated location. The dashed line in the lower plot corresponds to a derivative of zero.
then (under mild conditions) we again have a joint Gaussian process 33 This can be shown, for example, by consider-
distribution over 𝑓 and 𝑍 .33 This enables both inference about 𝑍 and ing the limiting distribution of Riemann sums.
conditioning on noisy observations of integrals, such as a Monte Carlo 34 a. o’hagan (1991). Bayes–Hermite quadrature.
estimate of an expectation. The former is the basis for Bayesian quadra- Journal of Statistical Planning and Inference
29(3):245–260
ture, an analog of Bayesian optimization bringing Bayesian experimental
design to bear on numerical integration.32, 34, 35 35 c. e. rasmussen and z. ghahramani (2002).
Bayesian Monte Carlo. neurips 2002
mutual information and entropy search: § 7.6, As 𝑓 is unknown, these quantities are random variables. Many Bayesian
p. 135 optimization algorithms operate by reasoning about the distributions of
(and uncertainties in) these quantities induced by our belief on 𝑓.
There are two technical issues we must address. The first is whether
we can be certain that a globally optimal value 𝑓 ∗ exists when the objec-
tive function is random. If existence is not guaranteed, then its distribu-
tion is meaningless. The second issue is one of uniqueness: assuming the
objective does attain a maximal value, can we be certain the optimum
is unique? In general 𝑥 ∗ is a set-valued random variable, and thus its
distribution might have support over arbitrary subsets of the domain,
rendering it complicated to reason about. However, if we could ensure
the uniqueness of 𝑥 ∗, its distribution would have support on X rather
than its power set, allowing more straightforward inference.
Both the existence of 𝑓 ∗ and uniqueness of 𝑥 ∗ are tacitly assumed
throughout the Bayesian optimization literature when building algo-
rithms based on distributions of these quantities, but these properties
are not guaranteed for arbitrary Gaussian processes. However, we can
ensure these properties hold almost surely under mild conditions.
34
2.8. inference with non-gaussian observations and constraints
Counterexamples
Although the above conditions for ensuring existence of 𝑓 ∗ and unique-
ness of 𝑥 ∗ are fairly mild, it is easy to construct counterexamples.
Consider a function on the closed unit interval, which we note is 41 It turns out this naïve model of white noise
compact: 𝑓 : [0, 1] → ℝ. We endow 𝑓 with a “white noise”41 Gaussian has horrible mathematical properties, but it is
sufficient for this counterexample.
process with
𝜇 (𝑥) ≡ 0; 𝐾 (𝑥, 𝑥 0) = [𝑥 = 𝑥 0].
Now 𝑓 almost surely does not have a maximum. Roughly, because the 42 Let 𝑄 = ℚ ∩ [0, 1] = {𝑞𝑖 } be the rationals in
value of 𝑓 at every point in the domain is independent of every other, the domain and let 𝑓 ∗ be a putative maximum.
Defining 𝜙𝑖 = 𝑓 (𝑞𝑖 ), we must have 𝜙𝑖 ≤ 𝑓 ∗
there will almost always be a point with value exceeding any putative for every 𝑖; call this event 𝐴.
maximum.42 However, the conditions of sample path continuity were Define the event 𝐴𝑘 by 𝑓 ∗ exceeding the
violated as the covariance is discontinuous at 𝑥 = 𝑥 .0 first 𝑘 elements of 𝑄. From independence,
We may also construct a Gaussian process that almost surely achieves Ö
𝑘
a maximum that is not unique. Consider a random function 𝑓 defined Pr(𝐴𝑘 ) = Pr(𝜙𝑖 ≤ 𝑓 ∗ ) = Φ(𝑓 ∗ ) 𝑘,
on the (compact) interval [0, 4𝜋] defined by the parametric model 𝑖=1
observations objective mean, Gaussian noise 95% credible interval, Gaussian noise
Figure 2.10: Regression with observations corrupted with heavy-tailed noise. The triangular marks indicate observations
lying beyond the plotted range. Shown is the posterior distribution of an objective function (ground truth
plotted in red) modeling the errors as Gaussian. The posterior is heavily affected by the outliers.
𝑝 (𝑦 | 𝑥, 𝜙)
One obvious limitation is an incompatibility with naturally non-
Gaussian observations. A scenario particularly relevant to optimization
is heavy-tailed noise. Consider the data shown in figure 2.10, where
some observations represent extreme outliers. These errors are poorly
modeled as Gaussian, and attempting to infer the underlying objective
𝑦 function with the additive Gaussian noise model leads to overfitting
and poor predictive performance. A Student-𝑡 error model with 𝜈 ≈ 4
𝜙 degrees of freedom provides a robust alternative:43
A Student-𝑡 error model (solid) with a Gaus-
sian error model (dashed) for reference. The 𝑝 (𝑦 | 𝑥, 𝜙) = T (𝑦; 𝜙, 𝜎𝑛2 , 𝜈). (2.31)
heavier tails of the Student-𝑡 model can better
explain large outliers. The heavier tails of this model can better explain large outliers; un-
fortunately, the non-Gaussian nature of this model also renders exact
43 k. l. lange et al. (1989). Robust Statistical Mod- inference impossible. We will demonstrate how to overcome this impasse.
eling Using the 𝑡 Distribution. Journal of the Constraints on an objective function, such as bounds on given func-
American Statistical Association 84(408):881–
896. tion values, can also provide valuable information during optimization,
but many natural constraints cannot be reduced to observations that can
be handled in closed form. Several Bayesian optimization policies impose
hypothetical constraints on the objective function when designing each
observation, requiring inference from intractable constraints even when
the observations themselves pose no difficulties.
To see how constraints might arise in optimization, consider a Gaus-
sian process belief on a one-dimensional objective 𝑓, and suppose we
wish to condition on 𝑓 on having a local maximum at a given loca-
differentiability, derivative observations: § 2.6, tion 𝑥. Assuming the function is twice differentiable, we can invoke the
p. 30 second-derivative test to encode this information in two constraints:
36
2.8. inference with non-gaussian observations and constraints
true distribution
samples Figure 2.11: The probability density function of an example
distribution along with 50 samples drawn inde-
pendently from the distribution. In Monte Carlo
approaches, the distribution is effectively approxi-
mated by a mixture of Dirac delta distributions at
the sample locations.
The functions {𝑡𝑖 } are called factors or local functions that may comprise factors, local functions, {𝑡𝑖 }
a likelihood augmented by any desired (hard or soft) constraints. The
term “local functions” arises because each factor often depends only on 45 For example, when observations are condi-
a low-dimensional subspace of y, often a single entry.45 tionally independent given the corresponding
function values, the likelihood factorizes into
The posterior on y (2.33) in turn induces a posterior on 𝑓 : a product of one-dimensional factors (1.3):
∫ Ö
𝑝 (𝑓 | D) = 𝑝 (𝑓 | y) 𝑝 (y | D) dy. (2.34) 𝑝 (y | x, 𝝓) = 𝑝 (𝑦𝑖 | 𝑥𝑖 , 𝜙𝑖 ).
𝑖
Figure 2.12: Regression with observations corrupted with heavy-tailed noise. The triangular marks indicate observations
lying beyond the plotted range. Shown is the posterior distribution of an objective function (ground truth
plotted in red) modeling the errors as Student-𝑡 distributed with 𝜈 = 4 degrees of freedom. The posterior was
approximated from 100 000 Monte Carlo samples. Comparing with the additive Gaussian noise model from
figure 2.10, this model effectively ignores the outliers and the fit is excellent.
1 ∑︁ 1 ∑︁
𝑠 𝑠
𝑝 (𝑓 | D) ≈ 𝑝 (𝑓 | y𝑖 ) = GP (𝑓 ; 𝜇D𝑖 , 𝐾D ). (2.35)
𝑠 𝑖=1 𝑠 𝑖=1
1 ∑︁
𝑠
𝑝 (𝜙 | 𝑥, D) ≈ N (𝜙; 𝜇𝑖 , 𝜎 2 ); 𝜇𝑖 = 𝜇D𝑖 (𝑥); 𝜎 2 = 𝐾D (𝑥, 𝑥).
𝑠 𝑖=1
(2.36)
Although slightly more complex than the Gaussian marginals of a Gaus-
sian process, this is often convenient enough for most needs.
example: Student-𝑡 observation model A Monte Carlo approximation to the posterior for the heavy-tailed
dataset from figure 2.10 is shown in figure 2.12. The observations were
modeled as corrupted by Student-𝑡 errors with 𝜈 = 4 degrees of freedom.
The posterior was approximated using a truly excessive number of sam-
ples (100 000, with a burn-in of 10 000) from the y posterior drawn using
elliptical slice sampling.47 The outliers in the data are ignored and the
predictive performance is excellent.
38
2.8. inference with non-gaussian observations and constraints
We are free to design this approximation as we see fit. There are several
general-purpose approaches available, distinguished by how they ap-
proach maximizing the fidelity of fitting the true posterior (2.33). These
include the Laplace approximation, Gaussian expectation propagation, Laplace approximation: § b.1, p. 301
and variational Bayesian inference. The first two of these methods are Gaussian expectation propagation: § b.2
covered in appendix b, and nickisch and rasmussen provide an exten- p. 302
sive survey of these and other approaches in the context of Gaussian
process binary classification.48 48 h. nickisch and c. e. rasmussen (2008). App-
Regardless of the details of the approximation scheme, the high-level proximations for Binary Gaussian Process
Classification. Journal of Machine Learning Re-
result is the same – the normal approximation (2.37) in turn induces an search 9(Oct):2035–2078.
approximate Gaussian process posterior on 𝑓. To demonstrate this, we
consider the posterior on 𝑓 that would arise from a direct observation of
y (2.9–2.10) and integrate against the approximate posterior (2.37):
∫
𝑝 (𝑓 | D) ≈ 𝑝 (𝑓 | y) 𝑞(y | D) dy = GP (𝑓 ; 𝜇D , 𝐾D ), (2.38)
where
𝜇D (𝑥) = 𝜇 (𝑥) + 𝜅 (𝑥)> C−1 ( m̃ − m);
(2.39)
𝐾D (𝑥, 𝑥 0) = 𝐾 (𝑥, 𝑥 0) − 𝜅 (𝑥)> C−1 (C − C̃) C−1 𝜅 (𝑥 0).
C̃ = C − C (C + N) −1 C, (2.40)
To demonstrate the power of approximate inference, we return to example: conditioning on a local optimum
our motivating scenario of conditioning a one-dimensional process on
having a local maximum at an identified point 𝑥, which we can achieve by
𝑥 𝑥 𝑥
Figure 2.13: Approximately conditioning a Gaussian process to have a local maximum at the marked point 𝑥. We show
each stage of the conditioning process with a sample drawn from the corresponding posterior. We begin
with the unconstrained process (left), which we condition on the first derivative being zero at 𝑥 using exact
inference (middle). Finally we use Gaussian expectation propagation to approximately condition on the second
derivative being negative at 𝑥.
𝑝 (ℎ) = N (ℎ; 𝑚, 𝑠 2 ).
˜ 𝑠˜2 ).
𝑝 (ℎ | D) ≈ 𝑞(ℎ | D) = N (ℎ; 𝑚,
40
2.9. summary of major ideas
0 0
Figure 2.14: A demonstration of Gaussian expectation propagation. On the left we have a Gaussian belief on the second
derivative, 𝑝 (ℎ). We wish to constrain this value to be negative, introducing a step-function factor encoding
the constraint, [ℎ < 0]. The resulting distribution is non-Gaussian (right), but we can approximate it with a
Gaussian, which induces an updated gp posterior on the function approximately incorporating the constraint.
Going beyond this example, we may use the approach outlined above 50 h. nickisch and c. e. rasmussen (2008). App-
to realize a general framework for Bayesian nonlinear regression by proximations for Binary Gaussian Process
Classification. Journal of Machine Learning Re-
combining a gp prior on a latent function with an observation model search 9(Oct):2035–2078.
appropriate for the task at hand, then approximating the posterior as
51 j. møller et al. (1998). Log Gaussian Cox Pro-
desired. The convenience and modeling flexibility offered by Gaussian cesses. Scandinavian Journal of Statistics 25(3):
processes can easily justify any extra effort required for approximating 451–482.
the posterior. This can be seen as a nonlinear extension of the well-known 52 r. p. adams et al. (2009). Tractable Nonpara-
family of generalized linear models.49 metric Bayesian Inference in Poisson Pro-
This approach is quite popular and been realized countless times. cesses with Gaussian Process Intensities. icml
Notable examples include binary classification using a logistic or probit 2009.
observation model,50 modeling point processes as a nonhomogenous 53 m. kuss (2006). Gaussian Process Models for
Robust Regression, Classification, and Rein-
Poisson process with unknown intensity,51, 52 and robust regression with forcement Learning. PhD thesis. Technische
heavy-tailed additive noise such as Laplace53 or Student-𝑡54, 55 distributed Universität Darmstadt.[§ 5.4]
errors. With regard to the latter and our previous heavy-tailed noise 54 r. m. neal (1997). Monte Carlo Implementation
example, a Laplace approximation to the posterior for the data in figures of Gaussian Process Models for Bayesian Re-
2.10–2.12 with the Student-𝑡 observation model produces an approximate gression and Classification. Technical report
posterior in excellent agreement with the Monte Carlo approximation (9702). Department of Statistics, University of
Toronto.
in figure 2.12; see figure 2.15. The cost of approximate inference in this
55 p. jylänki et al. (2011). Robust Gaussian Pro-
case was dramatically (several orders of magnitude) cheaper than Monte cess Regression with a Student-𝑡 Likelihood.
Carlo sampling. Journal of Machine Learning Research 12(99):
3227–3257.
56 diaconis identified an early application of gps
2.9 summary of major ideas by poincaré for nonlinear regression:
Gaussian processes have been studied – in one form or another – for over p. diaconis (1988). Bayesian Numerical Anal-
100 years.56 Although we have covered a lot of ground in this chapter, we ysis. In: Statistical Decision Theory and Related
have only scratched the surface of an expansive body of literature. A good Topics iv.
entry point to that literature is rasmussen and williams’s monograph, h. poincaré (1912). Calcul des probabilités.
Gauthier–Villars.
which focuses on machine learning applications of Gaussian processes
but also covers their theoretical underpinnings and properties in depth.57 57 c. e. rasmussen and c. k. i. williams (2006).
Gaussian Processes for Machine Learning. mit
A good companion to this work is the book of adler and taylor, which Press.
takes a deep dive into the properties and geometry of sample paths, 58 r. j. adler and j. e. taylor (2007). Random
including statistical properties of their maxima.58 Fields and Geometry. Springer–Verlag.
42
2.9. summary of major ideas
This joint distribution allows us to condition a Gaussian process on derivative observations: § 2.6, p. 32
(potentially noisy) derivative observations.
• The existence and uniqueness of global maxima for Gaussian process existence and uniqueness of global maxima:
sample paths can be guaranteed under mild assumptions on the mean § 2.7, p. 33
and covariance functions. Establishing these properties ensures that
the location 𝑥 ∗ and value 𝑓 ∗ of the global maximum are well-founded 59 In particular, policies grounded in information
random variables, which will be critical for some optimization methods theory under the umbrella of “entropy search.”
See § 7.6, p. 135 for more.
introduced later in the book.59
• Inference from non-Gaussian observations and constraints is possible inference with non-Gaussian observations
via Monte Carlo sampling or Gaussian approximate inference. and constraints: § 2.8, p. 35
The development of some system of a priori distributions suitable 1 j. mockus (1974). On Bayesian Methods for
for different classes of the function 𝑓 is probably the most important Seeking the Extremum. Optimization Tech-
niques ifip Technical Conference.
problem in the application of [the] Bayesian approach to. . . global
optimization.
Figure 3.1: The importance of the prior mean function in determining sample path behavior. The models in the first two
panels differ in their mean function but share the same covariance function. Sample path behavior is identical
up to translation. The model in the third panel features the same mean function as the first panel but a different
covariance function. Samples exhibit dramatically different behavior.
46
3.1. the prior mean function
𝜇 (𝑥; 𝑐) ≡ 𝑐, (3.1)
𝑝 (𝑓 | 𝑐) = GP (𝑓 ; 𝜇 ≡ 𝑐, 𝐾),
where the uncertainty in the mean has been absorbed into the prior
covariance function. We may now use this prior directly, avoiding any
4 Noting that 𝑐 and 𝑓 form a joint Gaussian pro- estimation of 𝑐. The unknown mean will be automatically marginalized
cess, we may perform inference as described in both the prior and posterior process, and we may additionally derive
in § 2.4, p. 26 to reveal their joint posterior.
the posterior belief over 𝑐 given data if it is of interest.4
basis functions, 𝝍 where the vector-valued function 𝝍 : X → ℝ𝑛 defines the basis functions
weight vector, 𝜷 and 𝜷 is a vector of weights.5
Now consider a parametric Gaussian process prior with a mean
function of this form (3.4) and arbitrary covariance function 𝐾. Placing
a multivariate normal prior on 𝜷,
6 a. o’hagan (1978). Curve Fitting and Optimal and marginalizing yields the marginal prior,6, 7
Design for Prediction. Journal of the Royal Sta-
tistical Society Series B (Methodological) 40(1): 𝑝 (𝑓 ) = GP (𝑓 ; 𝑚, 𝐶),
1–42.
7 c. e. rasmussen and c. k. i. williams (2006). where
Gaussian Processes for Machine Learning. mit
Press. [§ 2.7] 𝑚(𝑥) = a>𝝍 (𝑥); 𝐶 (𝑥, 𝑥 0) = 𝐾 (𝑥, 𝑥 0) + 𝝍 (𝑥)> B𝝍 (𝑥 0). (3.6)
48
3.2. the prior covariance function
The covariance function determines fundamental properties of sample sample path continuity: § 2.5, p. 28
path behavior, including continuity, differentiability, and aspects of the sample path differentiability: § 2.6, p. 30
global optima, as we have already seen. Perhaps more so than the mean existence and uniqueness of global maxima:
function, careful design of the covariance function is critical to ensure § 2.7, p. 33
fidelity in modeling. We will devote considerable discussion to this topic,
beginning with some important properties and moving on to useful
examples and mechanisms for systematically modifying and composing
multiple covariance functions together to model complex behavior.
After appropriate normalization, a covariance function 𝐾 may be
loosely interpreted as a measure of similarity between points in the
domain. Namely, given 𝑥, 𝑥 0 ∈ X, the correlation between the corre-
sponding function values is correlation between function values, 𝜌
𝐾 (𝑥, 𝑥 0)
𝜌 = corr[𝜙, 𝜙 0 | 𝑥, 𝑥 0] = √︁ , (3.9)
𝐾 (𝑥, 𝑥) 𝐾 (𝑥 ,0 𝑥 0)
and we may interpret the strength of this dependence as a measure of
similarity between the input locations. This intuition can be useful, but
some caveats are in order. To begin, note that correlation may be negative,
which might be interpreted as indicating dis-similarity as the function
values react to information with opposite sign.
Further, for a proposed covariance function 𝐾 to be admissible, it
must satisfy two global consistency properties ensuring that the collec-
tion of random variables comprising 𝑓 are able to satisfy the purported
relationships. First, we can immediately deduce from its definition (3.8)
that 𝐾 must be symmetric in its inputs. Second, the covariance function symmetry and positive semidefiniteness
must be positive semidefinite; that is, given any finite set of points x ⊂ X,
the Gram matrix 𝐾 (x, x) must have only nonnegative eigenvalues.10 10 Symmetry guarantees the eigenvalues are real.
To illustrate how positive semidefiniteness ensures statistical validity, consequences of positive semidefiniteness
note that a direct consequence is that 𝐾 (𝑥, 𝑥) = var[𝜙 | 𝑥] ≥ 0, and thus
marginal variance is always nonnegative. On a slightly less trivial level,
consider a pair of points x = (𝑥, 𝑥 0) and normalize the corresponding
Gram matrix 𝚺 = 𝐾 (x, x) to yield the correlation matrix:
1 𝜌
P = corr[𝝓 | x] = ,
𝜌 1
50
3.3. notable covariance functions
Samples from a centered Gaussian process We will refer to the Matérn and the limiting case of the squared expo-
with squared exponential covariance 𝐾se . nential covariance functions together as the Matérn family. The squared
exponential covariance is without a doubt the most prevalent covari-
ance function in the statistical and machine learning literature. However,
it may not always be a good choice in practice. Sample paths from a
centered Gaussian process with squared exponential covariance are in-
17 m. l. stein (1999). Interpolation of Spatial Data: finitely differentiable, which has been ridiculed as an absurd assumption
Some Theory for Kriging. Springer–Verlag. for most physical processes.17 stein does not mince words on this, start-
[§ 1.7] ing off a three-sentence “summary of practical suggestions” with “use
the Matérn model” and devoting significant effort to discouraging the
use of the squared exponential in the context of geostatistics.
52
3.3. notable covariance functions
Between these extremes are the cases 𝜈 = 3/2 and 𝜈 = 5/2, which
respectively model once- and twice-differentiable functions:
√ √
𝐾M3/2 (𝑥, 𝑥 0) = 1 + 3𝑑 exp − 3𝑑 ; (3.13)
√ √
𝐾M5/2 (𝑥, 𝑥 0) = 1 + 5𝑑 + 53 𝑑 2 exp − 5𝑑 . (3.14)
Figure 3.4 illustrates samples from centered Gaussian processes with
different values of the smoothness parameters 𝜈. The 𝜈 = 5/2 case in 18 j. snoek et al. (2012). Practical Bayesian Op-
particular has been singled out as a prudent off-the-shelf choice for timization of Machine Learning Algorithms.
neurips 2012.
Bayesian optimization when no better alternative is obvious.18
54
3.4. modifying and combining covariance functions
𝐾
In this context the parameter 𝜆 is known as an output scale, or when
the base covariance is stationary with 𝐾 (𝑥, 𝑥) = 1, the signal variance,
as it determines the variance of any function value: var[𝜙 | 𝑥, 𝜆] = 𝜆 2.
The illustration in the margin shows the effect of scaling the squared
exponential covariance function by a series of increasing output scales.
We can also of course consider nonlinear transformations of the
𝑑
function output as well. This can be useful for modeling constraints –
such as nonegativity or boundedness – that are not compatible with
The squared exponential covariance 𝐾se the Gaussian assumption. However, a nonlinear transformation of a
scaled by a range of output scales 𝜆 (3.20). Gaussian process is no longer Gaussian, so it often more convenient to
model the transformed function after “removing the constraint.”
We may use the general form of this scaling result (3.19) to transform
a stationary covariance into a nonstationary one, as any nonconstant
scaling is sufficient to break translation invariance. We show an example
of such a transformation in figure 3.5, where we have scaled a stationary
covariance by a bump function to create a prior on smooth functions
with compact support.
56
3.4. modifying and combining covariance functions
Taking this one step further, we may consider dilating each axis by a
separate factor:
𝑑 A = |Ax − Ax0 |.
Nonlinear warping
When using a covariance function with an inherent length scale, such as
a Matérn or squared exponential covariance, some linear transformation
of the domain is almost always considered, whether it be simple dilation
(3.22), anisotropic scaling (3.25), or a general transformation (3.26). How-
23 We did see a periodic gp in the previous chap- ever, nonlinear transformations can also be useful for imposing structure
ter (2.30); however, that model only had sup-
on the domain, a process commonly referred to as warping.
port on perfectly sinusoidal functions.
To provide an example that may not often be useful in optimization
24 d. j. c. mackay (1998). Introduction to Gaus-
but is illustrative nonetheless, suppose we wish to model a function
sian Processes. Neural Networks and Machine
Learning. Vol. 168. Springer–Verlag. [§ 5.2] 𝑓 : ℝ → ℝ that we believe to be smooth and periodic with period 𝑝.
25 The covariance on the circle is usually
None of the covariance functions introduced thus far would be able to
inherited from a covariance on ℝ2. The result induce the periodic correlations that this assumption would entail.23 A
of composing with the squared exponential construction due to mackay is to compose a map onto a circle of radius
covariance in particular is often called “the”
periodic covariance, but we stress that any
𝑟 = 𝑝/(2𝜋):24
other covariance on ℝ2 could be used instead.
𝑟 cos 𝑥
𝑥 ↦→ (3.27)
𝑟 sin 𝑥
with a covariance function on that space reflecting any desired properties
of 𝑓.25 As this map identifies points separated by any multiple of the
period, the corresponding function values are perfectly correlated, as
desired. A sample from a Gaussian process employing this construction
with a Matérn covariance after warping is shown in the margin.
A compelling use of warping is to build nonstationary models by
A sample path of a centered gp with Matérn
covariance with 𝜈 = 5/2 (3.14) after applying
composing a nonlinear map with a stationary covariance, an idea snoek
the periodic warping function (3.27). et al. explored in the context of Bayesian optimization.26 Many objec-
tive functions exhibit different behavior depending on the proximity to
26 j. snoek et al. (2014). Input Warping for Bayes- the optimum, suggesting that nonstationary models may sometimes be
ian Optimization of Non-Sationary Functions. worth exploring. snoek et al. proposed a flexible family of warping func-
icml 2014. tions for optimization problems with box-bounded constraints, where
58
3.4. modifying and combining covariance functions
we may take the domain to be the unit cube by scaling and translating as
necessary: X = [0, 1] 𝑛. The idea is to warp each coordinate of the input
via the cumulative distribution function of a beta distribution: 1
𝑥𝑖 ↦→ 𝐼 (𝑥𝑖 ; 𝛼𝑖 , 𝛽𝑖 ), (3.28)
where (𝛼𝑖 , 𝛽𝑖 ) are shape parameters and 𝐼 is the regularized beta func-
tion. This represents a monotonic bijection on the unit interval that can
assume several shapes; see the marginal figure for examples. The map
may contract portions of the domain and expand others, effectively de- 0
creasing and increasing the length scale in those regions. Finally, taking 0 1
𝛼 = 𝛽 = 1 recovers the identity map, allowing us to degrade gracefully Some examples of beta cdf warping functions
to the unwarped case if desired. (3.28).
In figure 3.7 we combine a beta warping on a one-dimensional domain
with a stationary covariance on the output. The chosen warping shortens
the length scale near the center of the domain and extends it near the
boundary, which might be reasonable for an objective function expected
to exhibit the most “interesting” behavior on the interior of its domain.
A recent innovation is to use sophisticated artificial neural networks
as warping maps for modeling functions of high-dimensional data with
complex structure. Notable examples of this approach include the fami-
lies of manifold Gaussian processes introduced by calandra et al.27 and 27 r. calandra et al. (2016). Manifold Gaussian
deep kernels introduced contemporaneously by wilson et al.28 Here the Processes for Regression. ijcnn 2016.
warping function was taken to be an arbitrary neural network, the output 28 a. g. wilson et al. (2016). Deep Kernel Learn-
layer of which was fed into a suitable stationary covariance function. ing. aistats 2016.
This gives a highly parameterized covariance function where the pa-
rameters of the base covariance and the neural map become parameters
of the resulting model. In the context of Bayesian optimization, this
can be especially useful when there is sufficient data to learn a useful neural representation learning: § 8.11, p.198
representation of the domain via unsupervised methods.
𝐾1 𝐾2 𝐾1 + 𝐾2
Figure 3.8: Samples from centered Gaussian processes with different covariance functions: (left) a squared exponential
covariance, (middle) a squared exponential covariance with smaller output scale and shorter length scale, and
(right) the sum of the two. Samples from the process with the sum covariance show smooth variation on two
different scales.
and thus covariance functions are closed under addition and pointwise
29 The assumption of the processes being cen- multiplication.29 Combining this result with (3.20), we have that any
tered is needed for the product result only; polynomial of covariance functions with nonnegative coefficients forms
otherwise, there would be additional terms
involving scaled versions of each individual
a valid covariance. This enables us to construct infinite families of in-
covariance as in (3.19). The sum result does creasingly complex covariance functions from simple components.
not depend on any assumptions regarding the We may use a sum of covariance functions to model a function with
mean functions. independent additive contributions, such as random behavior on several
length scales. Precisely such a construction is illustrated in figure 3.8. If
the covariance functions are nonnegative and have roughly the same
scale, the effect of addition is roughly one of logical disjunction: the sum
will assume nontrivial values whenever any one of its constituents does.
Meanwhile, a product of covariance functions can loosely be in-
terpreted in terms of logical conjunction, with function values having
appreciable covariance only when every individual covariance function
does. A prototypical example of this effect is a covariance function mod-
eling functions that are “almost periodic,” formed by the product of a
bump-shaped isotropic covariance function such as a squared exponen-
tial (3.12) with a warped version modeling perfectly periodic functions
(3.27). The former moderates the influence of the latter by driving the cor-
relation between function values to zero for inputs that are sufficiently
A sample from a centered Gaussian process separated, regardless of their positions in the periodic cycle. We show a
with an “almost periodic” covariance function.
sample from such a covariance in the margin, where the length scale of
the modulation term is three times the period.
60
3.5. modeling functions on high-dimensional domains
There is a consensus in the high-dimensional data analysis commu- 33 e. levina and p. j. bickel (2004). Maximum
nity that the only reason any methods work in very high dimensions Likelihood Estimation of Intrinsic Dimension.
neurips 2004.
is that, in fact, the data are not truly high dimensional.
Neural embeddings
Given the success of deep learning in designing feature representations
for complex, high-dimensional objects, neural embeddings – as used in
35 a. g. wilson et al. (2016). Deep Kernel Learn- the family of deep kernels35 – present a tantalizing option. Neural embed-
ing. aistats 2016. dings have shown some success in Bayesian optimization, where they
can facilitate optimization over complex structured objects by providing
neural representation learning: § 8.11, p.198 a nice continuous latent space to work in.
snoek et al. demonstrated excellent performance on hyperparameter
tuning tasks by interpreting the output layer of a deep neural network as
a set of custom nonlinear basis functions for Bayesian linear regression,
36 j. snoek et al. (2015). Scalable Bayesian Opti- as in (3.6).36 An advantage of this particular construction is that Gaussian
mization Using Deep Neural Networks. icml process inference and prediction is accelerated dramatically by adopting
2015.
the linear covariance (3.6) – the cost of inference scales linearly with the
cost of Gaussian process inference: § 9.1, p. 201 number of observations, rather than cubically as in the general case.
Linear embeddings
Another line of attack is to search for a low-dimensional linear subspace
of the domain encompassing the relevant variation in inputs and model
the function after projection onto that space. For an objective 𝑓 on a
high-dimensional domain X ⊂ ℝ,𝑛 we consider models of the form
62
3.5. modeling functions on high-dimensional domains
𝑥 1∗
Several specific schemes have been proposed for building such de-
compositions. One convenient approach is to partition the coordinates
43 k. kandasamy et al. (2015). High Dimensional
of the input into disjoint groups and add a contribution defined on each Bayesian Optimisation and Bandits via Addi-
subset.43, 44 Figure 3.10 shows an example, where a two-dimensional ob- tive Models. icml 2015.
jective is the sum of independent axis-aligned components. We might use 44 j. r. gardner et al. (2017). Discovering and Ex-
such a model when every feature of the input is likely to be relevant but ploiting Additive Structure for Bayesian Opti-
only through interaction with a limited number of additional variables. mization. aistats 2017.
64
3.6. summary of major ideas
All models we will consider below will be of this composite form (4.1), but
the assessment framework we will describe will accommodate arbitrary
models.
68
4.1. models and model structures
We have indexed the space by a vector 𝜽 , the entries of which jointly vector of hyperparameters, 𝜽
specify any necessary parameters from their joint range Θ. The entries range of hyperparameter values, Θ
of 𝜽 are known as hyperparameters of the model structure, as they pa-
rameterize the prior distribution for the observations, 𝑝 (y | x, 𝜽 ) (4.2).
In many cases we may be happy with a single suitably flexible model
structure for the data, in which case we can proceed with the correspond-
ing space (4.3) as the set of candidate models. We may also consider
multiple model structures for the data by taking a discrete union of such
spaces, an idea we will return to later in this chapter. multiple model structures: § 4.5, p. 78
Example
Let us momentarily take a step back from abstraction and create an ex-
plicit model space for optimization on the interval X = [𝑎, 𝑏].3 Suppose 3 The interval can be arbitrary; our discussion
our initial beliefs are that the objective will exhibit stationary behav- will be purely qualitative.
ior with a constant trend near zero, and that our observations will be
corrupted by additive noise with unknown signal-to-noise ratio.
For the observation model, we take homoskedastic additive Gaussian observation model: additive Gaussian noise
noise, a reasonable choice when there is no obvious alternative: with unknown scale
and leave the scale of the observation noise 𝜎𝑛 as a parameter. Turning prior mean function: constant mean with
to the prior process, we assume a constant mean function (3.1) with a unknown value
zero-mean normal prior on the unknown constant:
and select the Matérn covariance function with 𝜈 = 5/2 (3.14) with un- prior covariance function: Matérn 𝜈 = 5/2
known output scale 𝜆 (3.20) and unknown length scale ℓ (3.22): with unknown output and length scales
Following our discussion in the last chapter, we may eliminate one eliminating mean parameter via
of the parameters above by marginalizing the unknown constant mean marginalization: § 3.1, p. 47
under its assumed prior,4 leaving us with the identically zero mean
function and an additive contribution to the covariance function (3.3): 4 We would ideally marginalize the other pa-
rameters as well, but it would not result in a
𝜇 (𝑥) ≡ 0; 𝐾 (𝑥, 𝑥 0; 𝜆, ℓ) = 𝑏 2 + 𝜆 2 𝐾M5/2 (𝑑/ℓ). (4.5) Gaussian process, as we will discuss shortly.
samples from the joint prior over the objective function and the observed
values y that would result from measurements at 15 locations x (4.2) for
a range of these hyperparameters. Even this simple model space is quite
flexible, offering degrees of freedom for the variation in the objective
function and the precision of our measurements.
𝑝 (𝜽 ) ∝ 1 (4.6)
70
4.2. bayesian inference over parametric model spaces
may be used, in which case the model prior may not be explicitly ac-
knowledged at all. However, it can be helpful to express at least weakly
informative prior beliefs – especially when working with small datasets
– as it can offer gentle regularization away from patently absurd choices.
This should be possible for most hyperparameters in practice. For ex-
ample, when modeling a physical system, it would be unlikely that
interaction length scales of say one nanometer and one kilometer would
be equally plausible a priori; we might capture this intuition with a wide
prior on the logarithm of the length scale.
Model posterior
Given a set of observations D = (x, y), we may appeal to Bayes’ theorem model posterior, 𝑝 (𝜽 | D)
to derive the posterior distribution over the candidate models:
𝑝 (𝜽 | D) ∝ 𝑝 (𝜽 ) 𝑝 (y | x, 𝜽 ). (4.7)
log 𝑝 (y | x, 𝜽 ) =
− 12 (y − 𝝁) > (𝚺 + N) −1 (y − 𝝁) + log |𝚺 + N| + 𝑛 log 2𝜋 . (4.8)
interpretation of terms The first term of this expression is the sum of the squared Mahalanobis
norms (a.8) of the observations under the prior and represents a measure
of data fit. The second term serves as a complexity penalty: the volume
of any confidence ellipsoid under the prior is proportional to |𝚺 + N|,
9 The dataset was realized using a moderate
and thus this term scales according to the volume of the model’s support
length scale (30 length scales spanning the in observation space. The third term simply ensures normalization.
domain) and a small amount of additive noise,
shown below. But this is impossible to know
from inspection of the data alone, and many Return to example
alternative explanations are just as plausible
according to the model posterior! Let us return to our example scenario and model space. We invite the
reader to consider the hypothetical set of 15 observations in figure 4.2
from our example system of interest and contemplate which models
from our space of candidates in figure 4.1 might be the most compatible
with these observations.9
We illustrate the model posterior given this data in figure 4.3, where,
in the interest of visualization, we have fixed the covariance output
72
4.3. model selection via posterior maximization
posterior mean
posterior 95% credible interval, 𝑦 posterior 95% credible interval, 𝜙
scale to its true value and set the range of the axes to be compatible
with the samples from figure 4.1. The model prior was designed to be
weakly informative regarding the expected order of magnitude of the
hyperparameters by taking independent, wide Gaussian priors on the 10 Both parameters are nonnegative, so the prior
logarithm of the observation noise and covariance length scale.10 has support on the entire parameter range.
The first observation we can make regarding the model posterior is
that it is remarkably broad, with many settings of the model hyperparam-
eters remaining plausible after observing the data. However, the model
posterior does express a preference for models with either low noise and
short length scale or high noise combined with a range of compatible
length scales. Figure 4.4 provides examples of objective function and
observation posteriors corresponding to the hyperparameters indicated
in figure 4.3. Although each is equally plausible in the posterior,11 their 11 The posterior probability density of these
explanations of the data are diverse. points is approximately 10% of the maximum.
When the model prior is flat (4.6), the map model corresponds to the
maximum likelihood estimate (mle) of the model hyperparameters. Fig-
ure 4.5 shows the predictions made by the map model for our running
example; in this case, the map hyperparameters are in fact a reasonable
match to the parameters used to generate the example dataset.
acceleration via gradient-based optimization When the model space is defined over a continuous space of hyperpa-
rameters, computation of the map model can be significantly accelerated
via gradient-based optimization. Here it is advisable to work in the log
domain, where the objective becomes the unnormalized log posterior:
log 𝑝 (𝜽 ) + log 𝑝 (y | x, 𝜽 ). (4.10)
The log marginal likelihood is given in (4.8), noting that 𝝁, 𝚺, and N
are all implicitly functions of the hyperparameters 𝜽 . This objective
gradient of log marginal likelihood with (4.10) is differentiable with respect to 𝜽 assuming the Gaussian process
respect to 𝜽 : § c.1, p. 307 prior moments, the noise covariance, and the model prior are as well, in
which case we may appeal to off-the-shelf gradient methods for solving
(4.9). However, a word of warning is in order: the model posterior is
not guaranteed to be concave and may have multiple local maxima, so
multistart optimization is prudent.
74
4.4. model averaging
posterior mean posterior 95% credible interval, 𝑦 posterior 95% credible interval, 𝜙
Figure 4.6: A Monte Carlo estimate to the model-marginal predictive distribution (4.11) for our example sceneario using
100 samples drawn from the model posterior in figure 4.3 (4.14–4.15); see illustration in margin. Samples
from the objective function posterior display a variety of behavior due to being associated with different
hyperparameters.
some special cases,14 so we must resort to approximation if we wish 14 A notable example is marginalizing the coeffi-
to pursue this approach. In fact, maximum a posteriori estimation can cients of a linear prior mean against a Gaus-
sian prior: § 3.1, p. 47.
be interpreted as one rather crude approximation scheme where the
model posterior is replaced by a Dirac delta distribution at the map
hyperparameters:
𝑝 (𝜽 | D) ≈ 𝛿 (𝜽 − 𝜽ˆ ).
This can be defensible when the dataset is large compared to the number
of hyperparameters, in which case the model posterior is often unimodal
with little residual uncertainty. However, large datasets are the exception
rather than the rule in Bayesian optimization, and more sophisticated
approximations can pay off when model uncertainty is significant.
1 ∑︁
𝑠
𝑝 (𝑓 | D) ≈ GP 𝑓 ; 𝜇D (𝜽𝑖 ), 𝐾D (𝜽𝑖 ) ; (4.14) 𝜎𝑛
𝑠 𝑖=1
𝑠 ∫
1 ∑︁
𝑝 (𝑦 | 𝑥, D) ≈ 𝑝 (𝑦 | 𝑥, 𝜙, 𝜽𝑖 ) 𝑝 (𝜙 | 𝑥, D, 𝜽𝑖 ) d𝜙 . (4.15)
𝑠 𝑖=1
posterior mean posterior 95% credible interval, 𝑦 posterior 95% credible interval, 𝜙
Figure 4.7: An approximation to the model-marginal posterior (4.11) using the central composite design approach proposed
by rue et al. A total of nine hyperparameter samples are used for the approximation, illustrated in the margin
below.
15 m. d. hoffman and a. gelman (2014). The Carlo (hmc) such as the no u-turn sampler (nuts) would be a reasonable
No-U-turn Sampler: Adaptively Setting Path choice when the gradient of the log posterior (4.10) is available, as it can
Lengths in Hamiltonian Monte Carlo. Journal
of Machine Learning Research 15(4):1593–1623.
exploit this information to accelerate mixing.15
Figure 4.6 demonstrates a Monte Carlo approximation to the model-
marginal posterior (4.11–4.12) for our running example. Comparing with
the map approximation in figure 4.5, the predictive uncertainty of both
objective function values and observations has increased considerably
due to accounting for model uncertainty in the predictive distributions.
16 h. rue et al. (2009). Approximate Bayesian Unfortunately this integral remains intractable due to the nonlinear de-
inference for latent Gaussian models by us- pendence of the posterior moments on the hyperparameters, but reducing
ing integrated nested Laplace approximations.
Journal of the Royal Statistical Society Series B to this common form allows us to derive deterministic approximations
(Methodological) 71(2):319–392. against a single assumed posterior.
17 g. e. p. box and k. b. wilson (1951). On the Ex- rue et al. introduced several approximation schemes representing
perimental Attainment of Optimum Condi- different tradeoffs between efficiency and fidelity.16 Notable among these
tions. Journal of the Royal Statistical Society is a simple, sample-efficient procedure grounded in classical experi-
Series B (Methodological) 13(1):1–45.
mental design. Here a central composite design17 in hyperparameter
76
4.4. model averaging
posterior mean posterior 95% credible interval, 𝑦 posterior 95% credible interval, 𝜙
Figure 4.8: The approximation to the model-marginal posterior (4.11) for our running example using the approach proposed
by osborne et al.
where
78
4.5. multiple model structures
m5 + lin m5 × lin
initial model structure: p. 69 We build a model space comprising several model structures by aug-
menting our previous space with structures incorporating additional
covariance functions. The treatment of the prior mean (unknown con-
stant marginalized against a Gaussian prior) and observation model
(additive Gaussian noise with unknown scale) will remain the same
for all. The model structures reflect a variety of hypotheses positing
potential linear or quadratic behavior:
m5: the Matérn 𝜈 = 5/2 covariance (3.14) from our previous example;
lin: the linear covariance (3.16), where the the prior on the slope is vague
and centered at zero and the prior on the intercept agrees with the m5
model;
lin × lin: the product of two linear covariances designed as above, modeling a
latent quadratic function with unknown coefficients;
m5 + lin: the sum of a Matérn 𝜈 = 5/2 and linear covariance designed as in the
corresponding individual model structures; and
m5 × lin: the product of a Matérn 𝜈 = 5/2 and linear covariance designed as in the
corresponding individual model structures.
Objective function samples from models in each of these structures are
shown in figure 4.10. Among these, the model structure closest to the
truth is arguably m5 + lin.
approximation to model structure posterior Following the above discussion, we find the map hyperparameters
for each of these model structures separately and use a Laplace approx-
imation (4.16) to approximate the hyperparameter posterior on each
space, along with the normalizing constant (4.22). Normalizing over the
structures provides an approximate model structure posterior:
Pr(m5 | D) ≈ 10.8%;
Pr(m5 + lin | D) ≈ 71.8%;
Pr(m5 × lin | D) ≈ 17.0%,
with the remaining model structures (lin and lin × lin) sharing the re-
80
4.6. automating model structure search
Figure 4.11: An approximation to the model-marginal posterior (4.23) for our multiple-model example. The posterior on
each model structure is approximated separately as a mixture of Gaussian processes following rue et al. (see
figure 4.7); these are then combined by weighting by an approximation of the model structure posterior (4.21).
We show the result with three superimposed, transparent credible intervals, which are shaded with respect to
their weight in contributing to the final approximation.
maining 0.4%. The m5 + lin model structure is the clear winner, and there
is strong evidence that the purely polynomial models are insufficient for
explaining the data alone.
Figure 4.11 illustrates an approximation to the model-marginal pos- approximation to marginal predictive
terior (4.23), approximated by applying rue et al.’s central composite distribution
design approach to each of the model structures, then combining these
into a Gaussian process mixture by weighting by the approximate model
structure posterior. The highly asymmetric credible intervals reflect
the diversity in explanations for the data offered by the chosen model
structures, and the combined model makes reasonable predictions of our
example objective function sampled from the true model.
For this example, averaging over the model structure has important averaging over a space of Gaussian processes
implications regarding the behavior of the resulting optimization policy. in policy computation: § 8.10, p. 192
Figure 4.12 illustrates a common acquisition function25 built from the off- 25 specifically, expected improvement: § 7.3 p. 127
the-shelf m5 model, as well as from the structure-marginal model. The
former chooses to exploit near what it believes is a local optimum, but
the latter has a strong belief in an underlying linear trend and chooses
to explore the right-hand side of the domain instead. For our example
objective function sample, this would in fact reveal the global optimum
with the next observation.
Figure 4.12: Optimization policies built from the map m5 model (left) and the structure-marginal posterior (right). The m5
model chooses to exploit near the local optimum, but the structure-marginal model is aware of the underlying
linear trend and chooses to explore the right-hand side as a result.
𝐾 →𝐵
𝐾 →𝐾 +𝐾
𝐾 → 𝐾𝐾
𝐾 → (𝐾).
The symbol 𝐵 in the first rule represents any desired base kernel. The five
model structures considered in our multiple-structure example above in
82
4.6. automating model structure search
End-to-end automation
Follow-on work demonstrated a completely automated Bayesian opti-
mization system built on this structure search procedure avoiding any
29 g. malkomes and r. garnett (2018). Automat- manual modeling at all.29 The key idea was to dynamically maintain a
ing Bayesian optimization with Bayesian opti- set of plausible model structures throughout optimization. Predictions
mization. neurips 2018.
are made via model averaging over this set, offering robustness to model
misspecification when computing the outer optimization policy. Every
time a new observation is obtained, the set of model structures is then
updated via a continual Bayesian optimization in model space given the
new data. This interleaving of Bayesian optimization in data space and
model space offered promising performance.
Finally, gardner et al. offered an alternative to optimization over
model structures by constructing a Markov chain Monte Carlo routine
30 j. r. gardner et al. (2017). Discovering and Ex- to sample model structures from their posterior (4.21).30 The proposed
ploiting Additive Structure for Bayesian Opti- sampler was a realization of the Metropolis–Hastings algorithm with a
mization. aistats 2017.
custom proposal distribution making minor modifications to the incum-
bent structure. In the case of the additive decompositions considered
in that work, this step consisted of applying random atomic operations
such as merging or splitting components of the existing decomposition.
Despite the absolutely enormous number of possible additive decomposi-
tions, this mcmc routine was able to quickly locate promising structures,
and averaging over the sampled structures for prediction resulted in
superior optimization performance as well.
84
4.7. summary of major ideas
86
5
DECISION THEORY FOR OPTIMIZATION
88
5.1. introduction to bayesian decision theory
Isolated decisions
A decision problem under uncertainty has two defining characteristics.
action space, A The first is the action space A, the set of all available decisions. Our
task is to select an action from this space. For example, in sequential
optimization, an optimization policy decision must select a point in the
domain X for observation, and so we have A = X.
The second critical feature is the presence of uncertain elements of
unknown variables affecting decision the world influencing the outcomes of our actions, complicating our
outcome, 𝜓 decision. Let 𝜓 represent a random variable encompassing any relevant
uncertain elements when making and evaluating a decision. Although
we may lack perfect knowledge, Bayesian inference allows us to reason
relevant observed data, D about 𝜓 in light of data via the posterior distribution 𝑝 (𝜓 | D), and we
posterior belief about 𝜓, 𝑝 (𝜓 | D) will use this belief to inform our decision.
Suppose now we must select a decision from an action space A under
uncertainty in 𝜓, informed by a set of observed data D. To guide our
utility function, 𝑢 (𝑎,𝜓, D) choice, we select a real-valued utility function 𝑢 (𝑎,𝜓, D). This function
measures the quality of selecting the action 𝑎 if the true state of the world
were revealed to be 𝜓, with higher utilities indicating more favorable
4 Typical presentations of Bayesian decision the- outcomes. The arguments to a utility function comprise everything
ory omit the data from the utility function, but required to judge the quality of a decision in hindsight: the proposed
including it offers more generality, and this
allowance will be important when we turn our action 𝑎, what we know (the data D), and what we don’t know (the
attention to optimization policies. uncertain elements 𝜓 ).4
We cannot know the exact utility that would result from selecting
any given action a priori, due to our incomplete knowledge of 𝜓. We can,
however, compute the expected utility that would result from selecting
expected utility an action 𝑎, according to our posterior belief:
∫
𝔼 𝑢 (𝑎,𝜓, D) | 𝑎, D = 𝑢 (𝑎,𝜓, D) 𝑝 (𝜓 | D) d𝜓. (5.2)
This expected utility maps each available action to a real value, inducing
a total order and providing a straightforward mechanism for making our
decision. We pick an action maximizing the expected utility:
𝑎 ∈ arg max 𝔼 𝑢 (𝑎 ,0 𝜓, D) | 𝑎 ,0 D . (5.3)
𝑎0 ∈A
5 One may question whether this framework is This decision is optimal in the sense that no other action results in greater
complete in some sense: is it possible make expected utility. (By definition!) This procedure for acting optimally
rational decisions in some other manner? The
von Neumann–Morgenstern theorem shows
under uncertainty – computing expected utility with respect to relevant
that the answer is, surprisingly, no. Assum- unknown variables and maximizing to select an action – is the central
ing a certain set of rationality axioms, any tenant of Bayesian decision making.5
rational preferences over uncertain outcomes
can be captured by the expectation of some
utility function. Thus every rational decision Example: recommending a point for use after optimization
maximizes an expected utility:
With this abstract decision-making framework established, let us analyze
j. von neumann and o. morgenstern (1944).
Theory of Games and Economic Behavior. an example decision that might be faced in the context of optimization.
Princeton University Press. [appendix a] Consider a scenario where the purpose of optimization is to identify a
single point 𝑥 ∈ X for perpetual use in a production system, preferring
90
5.2. seqential decisions with a fixed budget
𝑢 (𝑥, 𝑓 ) = 𝑓 (𝑥) = 𝜙,
which rewards points for achieving high values of the objective function.
Now if our optimization procedure returned a dataset D, the expected
utility from recommending a point 𝑥 is simply the posterior mean of the
recommendation
corresponding function value:
Optimal terminal recommendation. Above:
𝔼 𝑢 (𝑥, 𝑓 ) | 𝑥, D = 𝔼[𝜙 | 𝑥, D] = 𝜇D (𝑥). (5.4) posterior belief about an objective function
given the data returned by an optimizer,
Therefore, an optimal recommendation maximizes the posterior mean: 𝑝 (𝑓 | D). Below: the expected utility for our
example, the posterior mean 𝜇 D (𝑥). The
optimal recommendation maximizes the
𝑥 ∈ arg max 𝜇D (𝑥 0).
𝑥 0 ∈X
expected utility.
Bayesian inference of the objective function: To reason about uncertainty in the objective function, we follow
§ 1.2, p. 8 the path laid out in the preceding chapters and maintain a probabilistic
belief throughout optimization, 𝑝 (𝑓 | D). We make no assumptions
regarding the nature of this distribution, and in particular it need not
be a Gaussian process. Equipped with this belief, we may reason about
the result of making an observation at some point 𝑥 via the posterior
predictive distribution 𝑝 (𝑦 | 𝑥, D) (1.7), which will play a key role below.
The ultimate purpose of optimization is to collect and return a dataset
D. Before we can reason about what data we should acquire, we must
first clarify what data we would like to acquire. Following the previous
optimization utility function, 𝑢 (D) section, we will accomplish this by defining a utility function 𝑢 (D) to
evaluate the quality of data returned by an optimizer. This utility function
will serve to establish preferences over optimization outcomes: all other
things being equal, we would prefer to return a dataset with higher
utility than any dataset with lower utility. As before, we will use this
utility to guide the design of policies, by making observations that, in
utility functions for optimization: chapter 6, expectation, promise the biggest improvement in utility. We will define
p. 109 and motivate several utility functions used for optimization in the next
chapter, and some readers may wish to jump ahead to that discussion for
explicit examples before continuing. In the following, we will develop
the general theory in terms of an arbitrary utility function.
92
5.2. seqential decisions with a fixed budget
In this section we will consider the construction of optimization fixed, known budget
policies assuming that we have a fixed and known budget on the number
of observations we will make. This scenario is both common in practice
and convenient for analysis, as we can for now ignore the question of
when to terminate optimization. Note that this assumption effectively
implies that every observation has a constant acquisition cost, which
may not always be reasonable. We will address variable observation cost-aware optimization: § 5.4, p. 103
costs and the question of when to stop optimization later in this chapter.
Assuming a fixed observation budget allows us to reason about opti-
mization policies in terms of the number of observations remaining to
termination, which will always be known. The problem we will consider
in this section then becomes the following: provided an arbitrary set of
data, how should we design our next evaluation location when exactly 𝜏 number of remaining observations (horizon),
observations remain before termination? In sequential decision making, 𝜏
this value is known as the decision horizon, as it indicates how far we
must look ahead into the future when reasoning about the present.
To facilitate our discussion, we pause to define notation for future
data that will be encountered during optimization relative to the present.
When considering an observation at some point 𝑥, we will call the value
resulting from an observation there 𝑦. We will then call the dataset
available at the next stage of optimization D1 = D ∪ (𝑥, 𝑦) , where putative next observation and dataset: (𝑥, 𝑦),
the subscript indicates the number of future observations incorporated D1
into the current data. We will write (𝑥 2, 𝑦2 ) for the following observa- putative following observation and dataset:
tion, which when acquired will form D2, etc. Our final observation 𝜏 (𝑥 2 , 𝑦2 ), D2
steps in the future will then be (𝑥𝜏 , 𝑦𝜏 ), and the dataset returned by our putative final observation and dataset:
optimization procedure will be D𝜏 , with utility 𝑢 (D𝜏 ). (𝑥𝜏 , 𝑦𝜏 ), D𝜏
This utility of the data we return is our ultimate concern and will
serve as the utility function used to design every observation. Note we
may write this utility in the same form we introduced in our general
discussion:
𝑢 (D𝜏 ) = 𝑢 ( D, 𝑥, 𝑦, 𝑥 2, 𝑦2, . . . , 𝑥𝜏 , 𝑦𝜏 ),
known action unknown
Explicitly writing out the expectation over the future data in (5.5) yields
the following expression:
∫ ∫ Ö
𝜏
· · · 𝑢 (D𝜏 ) 𝑝 (𝑦 | 𝑥, D) 𝑝 (𝑥𝑖 , 𝑦𝑖 | D𝑖−1 ) d𝑦 d (𝑥𝑖 , 𝑦𝑖 ) . (5.7)
𝑖=2
for this quantity, which is simply the expected terminal utility (5.5) shifted
by the utility of our existing data, 𝑢 (D). It is no coincidence this notation
defining a policy by maximizing an echoes our notation for acquisition functions! We will characterize the
acquisition function: § 5, p. 88 optimal optimization policy by a family of acquisition functions defined
in this manner.
94
5.2. seqential decisions with a fixed budget
𝑥0 𝑥
Here we have defined the symbol 𝛼𝜏∗ (D) to represent the expected
increase in utility when starting with D and continuing optimally for value of D with horizon 𝜏, 𝛼𝜏∗ (D)
𝜏 additional observations. This is called the value of the dataset with a
horizon of 𝜏 and will serve a central role below. We have now shown
how to compute the value of any dataset with a horizon of 𝜏 = 1 (5.10)
and how to identify a corresponding optimal action (5.9). This completes
the base case of our argument.
We illustrate the optimal optimization policy with one observation illustration of one-step optimal optimization
remaining in figure 5.1. In this scenario the belief over the objective policy
function 𝑝 (𝑓 | D) is a Gaussian process, and for simplicity we assume
our observations reveal exact values of the objective. We consider an
intuitive utility function: the maximal objective value contained in the
data, 𝑢 (D) = max 𝑓 (x).9 The marginal gain in utility offered by a putative 9 This is a special case of the simple reward util-
final observation (𝑥, 𝑦) is then a piecewise linear function of the observed ity function, which we discuss further in the
next chapter (§ 6.1, p. 109). The corresponding
value:
expected marginal gain is the well-known ex-
𝑢 (D1 ) − 𝑢 (D) = max 𝑦 − 𝑢 (D), 0 ; (5.11) pected improvement acquisition function (§ 7.3,
p. 127).
that is, the utility increases linearly if we exceed the previously best-
seen value and otherwise remains constant. To design the optimal final
observation, we compute the expectation of this quantity over the domain
and choose the point maximizing it, as shown in the top panels. We also
illustrate the computation of this expectation for the optimal choice and
96
5.2. seqential decisions with a fixed budget
Figure 5.3: The posterior of the objec-
tive function given two pos-
sible observations resulting
𝑦 = 𝑦0
from the optimal two-step ob-
𝛼 1∗
servation 𝑥 illustrated on the
facing page. The relatively
low value 𝑦 0 offers no imme-
𝑥2 diate reward, but reveals a
new local optimum and the
expected future reward from
the optimal final decision 𝑥 2
is high. The relatively high
value 𝑦 00 offers a large imme-
𝑦 = 𝑦 00 diate reward and respectable
prospects from the optimal fi-
nal decision as well.
𝛼 1∗
𝑥2
which yields
𝛼 2 (𝑥; D) = 𝛼 1 (𝑥; D) + 𝔼 𝛼 1 (𝑥 2 ; D1 ) | 𝑥, D .
That is, the expected increase in utility after two observations can be decomposition of expected marginal gain
decomposed as the expected increase after our first observation 𝑥 – the
expected immediate gain – plus the expected additional increase from
the final observation 𝑥 2 – the expected future gain.
It is still not clear how to address the second term in this expression.
However, from our analysis of the base case, we can reason as follows.
Given 𝑦 (and thus knowledge of D1 ), the optimal final decision 𝑥 2 (5.9)
results in an expected marginal gain of 𝛼 1∗ (D1 ), a quantity we know how
to compute (5.10). Therefore, assuming optimal future behavior, we have:
𝛼 2 (𝑥; D) = 𝛼 1 (𝑥; D) + 𝔼 𝛼 1∗ (D1 ) | 𝑥, D , (5.13)
98
5.3. cost and approximation of the optimal policy
We pause to note that the value of any dataset with null horizon optimal policy: compact notation
is 𝛼 0∗ (D) = 0, and thus the expressions in (5.15–5.17) are valid for any
horizon and compactly express the proposed policy. Further, we have
actually shown that this policy is optimal in the sense of maximizing
expected terminal utility over the space of all policies, at least with optimality
respect to our model of the objective function and observations. This
follows from our induction: the base case is established in (5.9), and the 11 Since ties in (5.16) may be broken arbitrarily,
inductive case by the sequential maximization in (5.16).11 this argument does not rule out the possibility
of there being multiple, equally good optimal
policies.
Bellman optimality and the Bellman equation
Substituting (5.15) into (5.17), we may derive the following recursive 12 r. bellman (1952). On the Theory of Dy-
definition of the value in terms of the value of future data: namic Programming. Proceedings of the Na-
n ∗ o tional Academy of Sciences 38(8):716–719.
𝛼𝜏∗ (D) = max
0
𝛼 1 (𝑥 0
; D) + 𝔼 𝛼 𝜏−1 (D 1 ) | 𝑥 0
, D . (5.18)
𝑥 ∈X
This is known as the Bellman equation and is a central result in the theory Bellman equation
of optimal sequential decisions.12 The treatment of future decisions in
this equation – recursively assuming that we will always act to maximize
expected terminal utility given the available data – reflects bellman’s bellman’s principle of optimality
principle of optimality, which characterizes optimal sequential decision
policies in terms of the optimality of subpolicies:13
An optimal policy has the property that whatever the initial state 13 r. bellman (1957). Dynamic Programming.
and initial decision are, the remaining decisions must constitute Princeton University Press.
an optimal policy with regard to the state resulting from the first
decision.
That is, to make a sequence of optimal decisions, we make the first
decision optimally, then make all following decisions optimally given
the outcome!
𝑦 𝑦2 𝑥𝜏 𝑢 (D𝜏 )
𝑥 𝑥2 ··· 𝑦𝜏
Figure 5.4: The optimal optimization policy as a decision tree. Squares indicate decisions (the choice of each observation),
and circles represent expectations with respect to random variables (the outcomes of observations). Only one
possible optimization path is shown; dangling edges lead to different futures, and all possibilities are always
considered. We maximize the expected terminal utility 𝑢 (D𝜏 ), recursively assuming optimal future behavior.
𝑥 ∈ arg max 𝛼𝜏 ;
∗
𝛼𝜏 = 𝛼 1 + 𝔼[𝛼𝜏−1 ]
= 𝛼 1 + 𝔼[max 𝛼𝜏−1 ]
= 𝛼 1 + 𝔼 max 𝛼 1 + 𝔼[𝛼𝜏−2
∗
]
h n i
= 𝛼 1 + 𝔼 max 𝛼 1 + 𝔼 max 𝛼 1 + 𝔼[max{𝛼 1 + · · · .
100
5.3. cost and approximation of the optimal policy
Limited lookahead
One widespread and surprisingly effective approximation is to simply
limit how many future observations we consider in each decision. This
is practical as decisions closer to termination require substantially less
computation than earlier decisions.
With this in mind, we can construct a natural family of approxima-
tions to the optimal policy defined by artificially limiting the horizon
used throughout optimization to some computationally feasible maxi-
mum ℓ. When faced with an infeasible decision horizon 𝜏, we make the
crude approximation
and by maximizing this score, we act optimally under the incorrect but
convenient assumption that only ℓ observations remain. This effectively 15 Equivalently, we approximate the true future
assumes 𝑢 (D𝜏 ) ≈ 𝑢 (Dℓ ).15 This may be reasonable if we expect decreas- value 𝛼𝜏∗ −1 with 𝛼 ℓ∗−1 .
ing marginal gains, implying a significant fraction of potential gains can
be attained within the truncated horizon. This scheme is often described myopic approximations
(sometimes disparagingly) as myopic, as we limit our sight to only the
next few observations rather than looking ahead to the full horizon.
A policy that designs each observation to maximize the limited-
horizon acquisition function 𝛼 min{ℓ,𝜏 } is called an ℓ-step lookahead pol- ℓ-step lookahead
icy.16 This is also called a rolling horizon strategy, as the fixed horizon rolling horizon
“rolls along” with us as we go. By limiting the horizon, we bound the
computational effort required for each decision to at-most O(𝑛 ℓ 𝑞 ℓ ) time 16 We take the minimum to ensure we don’t look
with the implementation described above. This can be a considerable beyond the true horizon, which would be non-
sense.
savings when the observation budget is much greater than the selected
lookahead. A lookahead policy is illustrated as a decision tree in figure
5.5. Comparing to the optimal policy in figure 5.4, we simply “cut off”
and ignore any portion of the tree lying deeper than ℓ steps in the future.
𝑦 𝑦2 𝑥𝜏 𝑢 (D𝜏 )
𝑥 𝑥2 ··· 𝑦𝜏
Figure 5.6: A decision tree representing a rollout policy. Comparing to the optimal policy in figure 5.4, we simulate future
decisions starting with 𝑥 2 using an efficient but suboptimal heuristic policy, rather than the intractable optimal
policy. We maximize the expected terminal utility 𝑢 (D𝜏 ), assuming potentially suboptimal future behavior.
Rollout
The optimal policy evaluates a potential observation location by simu-
lating the entire remainder of optimization following that choice, recur-
sively assuming we will use the optimal policy for every future decision.
Although sensible, this is clearly intractable. Rollout is an approach to ap-
proximate policy design that emulates the structure of the optimal policy,
but using a tractable suboptimal policy to simulate future decisions.
A rollout policy is illustrated as a decision tree in figure 5.6. Given a
base policy, heuristic policy putative next observation (𝑥, 𝑦), we use an inexpensive so-called base
or heuristic policy to simulate a plausible – but perhaps suboptimal –
realization of the following decision 𝑥 2 . Note there is no branching in
the tree corresponding to this decision, as it does not depend on the
exhaustively enumerated subtree required by the optimal policy. We
then take an expectation with respect to the unknown value 𝑦2 as usual.
Given a putative value of 𝑦2 , we use the base policy to select 𝑥 3 and
continue in this manner until reaching the decision horizon. We use the
terminal utilities in the resulting pruned tree to estimate the expected
marginal gain 𝛼𝜏 , which we maximize as a function of 𝑥.
choice of base policy There are no constraints on the design of the base policy used in
rollout; however, for this approximation to be sensible, we must choose
something relatively efficient. One common and often effective choice
is to simulate future decisions with one-step lookahead. If we again
use off-the-shelf optimization and quadrature routines to traverse the
rollout decision tree in figure 5.6 with this particular choice, the running
time of the policy with a horizon of 𝜏 is O(𝑛 2𝑞𝜏 ), significantly faster
102
5.4. cost-aware optimization and termination as a decision
A = X ∪ {∅}. (5.19)
For the sake of analysis, after the termination action has been selected, it
is convenient to model the decision process as not actually terminating,
but rather continuing with the collapsed action space A = {∅} – once
you terminate, there’s no going back.
As before, we may derive the optimal optimization policy in the
adaptive termination case via induction on the decision horizon 𝜏. How-
ever, we must address one technical issue: the base case of the induction,
which analyzes the “final” decision, breaks down if we allow the possi-
bility of a nonterminating sequence of decisions. To sidestep this issue,
bound on total number of observations, 𝜏max we assume there is a fixed and known upper bound 𝜏max on the total
number of observations we may make, at which point optimization is
compelled to terminate regardless of any other concern. This is not an
overly restrictive assumption in the context of Bayesian optimization.
Because observations are assumed to be expensive, we can adopt some
18 It is possible to consider unbounded sequen- suitably absurd upper bound without issue; for example, 𝜏max = 1 000 000
tial decision problems, but this is probably not would suffice for an overwhelming majority of plausible scenarios.18
of practical interest in Bayesian optimization:
After assuming the decision process is bounded, our previous induc-
m. h. degroot (1970). Optimal Statistical Deci- tive argument carries through after we demonstrate how to compute
sions. McGraw–Hill. [§ 12.7] the value of the termination action. Fortunately, this is straightforward:
termination does not augment our data, and once this action is taken, no
other action will ever again be allowed. Therefore the expected marginal
gain from termination is always zero:
𝛼𝜏 (∅; D) = 0. (5.20)
With this, substituting A for X in (5.15–5.17) now gives the optimal policy.
Intuitively, the result in (5.20) implies that termination is only the
optimal decision if there is no observation offering positive expected
gain in utility. For the utility functions described in the next chapter –
all of which are agnostic to costs and measure optimization progress
19 This can be proven through various “informa- alone – reaching this state is actually impossible.19 However, explicitly
tion never hurts” (in expectation) results. accounting for observation costs in addition to optimization progress in
the utility function resolves this issue, as we will demonstrate.
104
5.4. cost-aware optimization and termination as a decision
Figure 5.8: Illustration of one-step lookahead with the option to terminate. With a linear utility and additive costs, the
expected marginal gain 𝛼 1 is the expected marginal gain to the data utility 𝛼 10 adjusted for the cost of acquisition
𝑐. For some points, the cost-adjusted expected gain is negative, in which case we would prefer immediate
termination to observing there. However, continuing with the chosen point is expected to increase the utility
of the current data.
Consider the objective function belief in the top panel of figure 5.8 (which
is identical to that from our running example from figures 5.1–5.3) and 20 We will consider unknown and stochastic
suppose that the cost of observation now depends on location according costs in § 11.1, p. 245.
to a known cost function 𝑐 (𝑥),20 illustrated in the middle panel. observation cost function, 𝑐 (𝑥)
If we wish to reason about observation costs in the optimization
policy, we must account for them somehow, and the most natural place
to do so is in the utility function. Depending on the situation, there are
many ways we could proceed;21 however, one natural approach is to 21 We wish to stress this point – there is consider-
first select a utility function measuring the quality of a returned dataset able flexibility beyond the scheme we describe.
alone, ignoring any costs incurred to acquire it. We call this quantity
the data utility and notate it with 𝑢 0(D). The data utility is akin to the data utility, 𝑢 0 (D)
cost-agnostic utility from the known-budget case, and any one of the utility functions for optimization: chapter 6,
options described in the next chapter could reasonably fill this role. p. 109
We now adjust the data utility to account for the cost of data acquisi-
tion. In many applications, these costs are additive, so that the total cost
of gathering a dataset D is simply observation costs, 𝑐 (D)
∑︁
𝑐 (D) = 𝑐 (𝑥). (5.21)
𝑥 ∈D 22 Some additional discussion on this natural ap-
proach can be found in:
If the acquisition cost can be expressed in the same units as the data
utility – for example, if both can be expressed in monetary terms22– then h. raiffa and r. schlaifer (1961). Applied Sta-
tistical Decision Theory. Division of Research,
we might reasonably evaluate a dataset D by the cost-adjusted utility: Graduate School of Business Administration,
Harvard University. [chapter 4]
𝑢 (D) = 𝑢 0(D) − 𝑐 (D). (5.22)
106
5.5. summary of major ideas
In the next chapter we will discuss several prominent utility functions for
measuring the quality of a dataset returned by an optimization procedure.
In the following chapter, we will demonstrate how many common acqui- common Bayesian optimization policies:
sition functions for Bayesian optimization may be realized by performing chapter 7, p. 123
one-step lookahead with these utility functions.
This material will be published by Cambridge University Press as Bayesian Optimization. 109
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
utility functions for optimization
110
6.1. expected utility of terminal recommendation
In this case, the expected utility from recommending 𝑥 is simply the 𝜐 (𝜙)
posterior mean of 𝜙, as we have already seen (5.4):
𝔼 𝜐 (𝜙) | 𝑥, D = 𝜇D (𝑥),
Uncertainty in the objective function is not considered in this decision at A risk-neutral (linear) utility function.
all! Rather, we are indifferent between points with equal expected value,
regardless of their uncertainty – that is, we are risk neutral.
Risk neutrality is computationally convenient due to the simple form
of the expected utility, but may not always reflect our true preferences.
In the margin we show beliefs over the objective values for two potential
recommendations with equal expected value but significantly different
𝜙
risk. In many scenarios we would have a clear preference between the
two alternatives, but a risk-neutral utility induces complete indifference.
Beliefs over two recommendations with equal
A useful concept when reasoning about risk preferences is the so- expected value. A risk-neutral agent would
called certainty equivalent. Consider a risky potential recommendation be indifferent between these alternatives, a
𝑥, that is, a point for which we do not know the objective value exactly. risk-averse agent would prefer the red option,
and a risk-seeking agent would prefer the
The certainty equivalent for 𝑥 is the value of a hypothetical risk-free
blue option.
alternative for which our preferences would be indifferent. That is, the
certainty equivalent for 𝑥 corresponds to an objective function value 𝜙 0
such that
𝜐 (𝜙 0) = 𝔼 𝜐 (𝜙) | 𝑥, D .
Under a risk-neutral utility function, the certainty equivalent of a point
𝑥 is simply its expected value: 𝜙 0 = 𝜇D (𝑥). Thus we would abandon a po-
tential recommendation for another only if it had greater expected value,
independent of risk. However, we may encode risk-aware preferences
with appropriately designed nonlinear utility functions.
Simple reward
Suppose an optimization routine returned data D = (x, y) to inform a
terminal recommendation, and that we will make this decision using
the risk-neutral utility function 𝜐 (𝜙) = 𝜙 (6.2). If we limit the action
space of this recommendation to only the locations evaluated during
optimization x, the expected utility of the optimal recommendation is
6 This name contrasts with the cumulative re- the so-called simple reward:6, 7
ward: § 6.2, p. 114.
7 One technical caveat is in order: when the 𝑢 (D) = max 𝜇D (x). (6.3)
dataset is empty, the maximum degererates
and we have 𝑢 (∅) = −∞. In the special case of exact observations, where y = 𝑓 (x) = 𝝓, the
simple reward reduces to the maximal function value encountered during
optimization:
𝑢 (D) = max 𝝓. (6.4)
112
6.1. expected utility of terminal recommendation
global reward
simple reward Figure 6.1: The terminal recommenda-
tions corresponding to the
simple reward and global re-
ward for an example dataset
comprising five observations.
The prior distribution for the
objective for this demonstra-
tion is illustrated in figure 6.3.
recommendations
One-step lookahead with the simple reward utility function pro- expected improvement: § 7.3 p. 127
duces a widely-used acquisition function known as expected improvement,
which we will discuss in detail in the next two chapters.
Global reward
Another prominent utility is the global reward.8 Here we again consider a 8 “Global simple reward” would be a more accu-
risk-neutral terminal recommendation, but now expand the action space rate (but annoyingly bulky) name.
for this recommendation to the entire domain X. The expected utility of
this recommendation is the global maximum of the posterior mean:
max y
Figure 6.2: The utility 𝑢 (D) = max y would prefer the excessively noisy dataset on the left to the less-noisy dataset
on the right with smaller maximum value. The data on the left reveal little information about the objective
function, and the maximum observed value is very likely to be an outlier, whereas the data on the right indicate
reasonable progress.
large-but-noisy observations are not This is simple to demonstrate by contemplating the preferences over
necessarily preferable outcomes encoded in the utility, which may not align with intuition.
This disparity is especially notable in situations with excessively noisy
observations, where the maximum value observed will likely reflect
spurious noise rather than actual optimization progress.
example and discussion Figure 6.2 shows an extreme but illustrative example. We consider
two optimization outcomes over the same domain, one with excessively
noisy observations and the other with exact measurements. The noisy
dataset contains a large observation on the right-hand side of the domain,
but this is almost certainly the result of noise, as indicated by the objective
function posterior. Although the other dataset has a lower maximal value,
the observations are more trustworthy and represent a plainly better
outcome. But the proposed utility (6.6) prefers the noisier dataset! On
the other hand, both the simple and global reward utilities prefer the
noiseless dataset, as the data produce a larger effect on the posterior
mean – and thus yield more promising recommendations.
approximation to the simple reward Of course, errors in noisy measurements are not always as extreme
as in this example. When the signal-to-noise ratio is relatively high, the
utility (6.6) can serve as a reasonable approximation to the simple reward.
We will discuss this approximation scheme further in the context of
expected improvement: § 7.3 p. 127 expected improvement.
114
6.3. information gain
One notable use of cumulative reward is in active search, a simple active search: § 11.11, p. 282
mathematical model of scientific discovery. Here, we successively select
points for investigation seeking novel members of a rare, valuable class
V ⊂ X. Observing at a point 𝑥 ∈ X yields a binary observation indicating
membership in the desired class: 𝑦 = [𝑥 ∈ V]. Most studies of active
search seek to maximize the cumulative reward (6.7) of the gathered
data, hoping to discover as many valuable items as possible.
The information gain offered by a dataset D is then the reduction in 11 A caveat is in order regarding this notation,
entropy when moving from the prior to the posterior distribution: which is not standard. In information theory
𝐻 [𝜔 | D] denotes the conditional entropy of
𝑢 (D) = 𝐻 [𝜔] − 𝐻 [𝜔 | D], (6.8) 𝜔 given D, which is the expectation of the
given quantity over the observed values y. For
our purposes it will be more useful for this
where 𝐻 [𝜔 | D] is the differential entropy of the posterior:11 to signify the differential entropy of the no-
∫ tationally parallel posterior 𝑝 (𝜔 | D). When
𝐻 [𝜔 | D] = − 𝑝 (𝜔 | D) log 𝑝 (𝜔 | D) d𝜔. needed, we will write conditional
entropy with
an explicit expectation: 𝔼 𝐻 [𝜔 | D] | x .
116
6.5. comparison of utility functions
𝑝 (𝑓 ∗ )
𝑝 (𝑥 ∗ )
Figure 6.3: The objective function prior used throughout our utility function comparison. Marginal beliefs of function
values are shown, as well as the induced beliefs over the location of the global optimum, 𝑝 (𝑥 ∗ ), and the value
of the global optimum, 𝑝 (𝑓 ∗ ). Note that there is a significant probability that the global optimum is achieved
on the boundary of the domain, reflected by large point masses.
𝑝 (𝑓 ∗ | D)
𝑝 (𝑥 ∗ | D)
Figure 6.4: An example dataset of five observations and the resulting posterior belief of the objective function. This dataset
exhibits relatively low simple reward (6.3) but relatively high global reward (6.5) and information gain (6.8)
about the location 𝑥 ∗ and value 𝑓 ∗ of the optimum.
covariance (3.12). This prior is illustrated in figure 6.3, along with the
induced beliefs about the location 𝑥 ∗ and value 𝑓 ∗ of the global optimum.
15 For this model, a unique optimum will exist Both distributions reflect considerable uncertainty.15 We will examine
with probability one; see § 2.7, p. 34 for more two datasets that might be returned by an optimizer using this model and
details.
discuss how different utility functions would evaluate these outcomes.
118
6.6. summary of major ideas
𝑝 (𝑓 ∗ | D)
𝑝 (𝑥 ∗ | D)
Figure 6.5: An example dataset containing a single observation and the resulting posterior belief of the objective function.
This dataset exhibits relatively high simple reward (6.3) but relatively low global reward (6.5) and information
gain (6.8) about the location 𝑥 ∗ and value 𝑓 ∗ of the optimum.
120
6.6. summary of major ideas
This material will be published by Cambridge University Press as Bayesian Optimization. 123
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
common bayesian optimization policies
Figure 7.1: The scenario we will consider for illustrating optimization policies. The objective function prior is a Gaussian
process with constant mean and Mátern covariance with 𝜈 = 5/2 (3.14). We show the marginal predictive
distributions and three samples from the posterior conditioned on the indicated observations.
124
7.2. decision-theoretic policies
objective, 𝑓
Figure 7.2: The true objective function used for simulating optimization policies.
of the returned data (5.16). This policy is optimal in the average case: it
maximizes the expected utility of the returned dataset over the space 5 To be precise, optimality is defined with re-
of all possible policies.5 Unfortunately, optimality comes at a great cost. spect to a model for the objective function
𝑝 (𝑓 ), an observation model 𝑝 (𝑦 | 𝑥, 𝜙), a util-
Computing the optimal policy requires recursive simulation of the entire ity function 𝑢 (D), and an upper bound on the
remainder of optimization, a random process due to uncertainty in the number of observations allowed 𝜏. Bayesian
outcomes of our observations. In general, the cost of computing the decision theory provides a policy achieving
optimal policy grows exponentially with the horizon, the number of the maximal expected utility at termination
with respect to these choices.
observations remaining before termination.
However, the structure of the optimal policy suggests a natural family running time of optimal policy and efficient
of lookahead approximations based on fixing a computationally tractable approximations: § 5.3, p. 99
maximum horizon throughout optimization. This line of reasoning has limited lookahead: § 5.3, p. 101
lead to many of the practical policies available for Bayesian optimization.
In fact, most popular algorithms represent one-step lookahead, where
in each iteration we greedily maximize the expected utility after obtain-
ing only a single additional observation. Although these policies are
maximally myopic, they are also maximally efficient among lookahead
approximations and have delivered impressive empirical performance in
a wide range of settings.
It may seem surprising that such dramatically myopic policies have
any use at all. There is a huge difference between the scale of reasoning
in one-step lookahead compared with the optimal procedure, which
may consider hundreds of future decisions or more when designing an
observation. However, the situation is somewhat more nuanced than it
might appear. In a seminal paper, kushner argued that myopic policies
may in fact show better empirical performance than a theoretically
optimal policy, and his argument remains convincing:6
Since a mathematical model of [𝑓 ] is available, it is theoretically 6 h. j. kushner (1964). A New Method of Locat-
possible, once a criterion of optimality is given, to determine the ing the Maximum Point of an Arbitrary Multi-
peak Curve in the Presence of Noise. Journal
mathematically optimal sampling policy. However. . . determination of Basic Engineering 86(1):97–106.
of the optimum sampling policies is extremely difficult. Because
of this, the development of our sampling laws has been guided
primarily by heuristic considerations.7 There are some advantages 7 Specifically, maximizing probability of im-
to the approximate approach. . . [and] its use may yield better results provement: § 7.5, p. 131.
than would a procedure that is optimum for the model. Although the
model selected for [𝑓 ] is the best we have found for our purposes,
it is sometimes too general. . .
One-step lookahead
notation for one-step lookahead policies Let us review the generic procedure for developing a one-step lookahead
policy and adopt standard notation to facilitate their description. Suppose
we have selected an arbitrary utility function 𝑢 (D) to evaluate a returned
dataset. Suppose further that we have already gathered an arbitrary
dataset D = (x, y) and wish to select the next evaluation location. This
is the fundamental role of an optimization policy.
proposed next point 𝑥 with putative value 𝑦 If we were to choose some point 𝑥, we would observe a corresponding
updated dataset D0 = D ∪ (𝑥, 𝑦) value 𝑦 and update our dataset, forming D 0 = (x,0 y0) = D ∪ (𝑥, 𝑦) .
Note that in our discussion on decision theory in chapter 5, we notated
this updated dataset with the symbol D1 , as we needed to be able to
distinguish between datasets after the incorporation of a variable number
of additional observations. As our focus in this chapter will be on one-step
lookahead, we can simplify notation by dropping subscripts indicating
time. Instead, we will systematically use the prime symbol to indicate
future quantities after the acquisition of the next observation.
In one-step lookahead, we evaluate a proposed point 𝑥 via the ex-
expected marginal gain pected marginal gain in utility after incorporating an observation there
(5.8):
𝛼 (𝑥; D) = 𝔼 𝑢 (D 0) | 𝑥, D − 𝑢 (D),
which serves as an acquisition function inducing preferences over possi-
acquisition functions: § 5, p. 88 ble observation locations. We design each observation by maximizing
this score:
𝑥 ∈ arg max 𝛼 (𝑥 0; D). (7.1)
𝑥 0 ∈X
126
7.3. expected improvement
𝑢 (D) = max 𝜇D (x). 10 This reasoning is the same for all one-step
lookahead policies, which could all be de-
Suppose we have already gathered observations D = (x, y) and wish scribed as maximizing “expected improve-
to choose the next evaluation location. Expected improvement is derived ment.” But this name has been claimed for the
by measuring the expected marginal gain in utility, or the instantaneous simple reward utility alone.
improvement, 𝑢 (D 0) − 𝑢 (D),10 offered by making the next observation 11 As mentioned in the last chapter, simple re-
ward degenerates with an empty dataset; ex-
at a proposed location 𝑥:11
pected improvement does as well. In that case
∫
we can simply ignore the second term and
𝛼 ei (𝑥; D) = max 𝜇D0 (x0) 𝑝 (𝑦 | 𝑥, D) d𝑦 − max 𝜇D (x). (7.2) compute the first, which for zero-mean addi-
tive noise becomes the mean function of the
prior process.
Expected improvement reduces to a particularly nice expression in
the case of exact observations of the objective, where the utility takes a
simpler form (6.4). Suppose that, when we elect to make an observation expected improvement without noise
at a location 𝑥, we observe the exact objective value 𝜙 = 𝑓 (𝑥). Consider
Figure 7.3: The expected improvement acquisition function (7.2) corresponding to our running example.
maximal value observed, incumbent 𝜙 ∗ a dataset D = (x, 𝝓), and define 𝜙 ∗ = max 𝝓 to be the so-called incum-
bent: the maximal objective value yet seen.12 As a consequence of exact
12 The value 𝜙 ∗ is incumbent as it is currently observation, we have
“holding office” as our standing recommenda-
tion until it is deposed by a better candidate. 𝑢 (D) = 𝜙 ∗ ; 𝑢 (D 0) = max(𝜙 ,∗ 𝜙);
and thus
𝑢 (D 0) − 𝑢 (D) = max(𝜙 − 𝜙 ,∗ 0).
example and interpretation Expected improvement is illustrated for our running example in
figure 7.3. In this case, maximizing expected improvement will select
a point near the previous best point found, an example of exploitation.
Notice that the expected improvement vanishes near regions where
we have existing observations. Although these locations may be likely
to yield values higher than 𝜙 ∗ due to relatively high expected value,
the relatively narrow credible intervals suggest that the magnitude of
any improvement is likely to be small. Expected improvement is thus
considering the exploration–exploitation dilemma in the selection of the
next observation location, and the tradeoff between these two concerns
is considered automatically.
simulated optimization and interpretation Figure 7.4 shows the posterior belief of the objective after sequentially
maximizing expected improvement to gather 20 additional observations
of our example objective function. The global optimum was efficiently
located. The distribution of the sample locations, with more evaluations
in the most promising regions, reflects consideration of the exploration–
exploitation dilemma. However, there seems to have been a focus on
exploitative behavior resulting from myopia exploitation throughout the entire process; the first ten observations
for example never strayed from the initially known local optimum. This
behavior is a reflection of the simple reward utility function underlying
the policy, which only rewards the discovery of high objective func-
tion values at observed locations. As a result, one-step lookahead may
128
7.4. knowledge gradient
Figure 7.4: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by the expected improve-
ment acquisition function (7.2) on our running example. The tick marks show the points chosen by the policy,
progressing from top to bottom, during iterations 1–10 (top) and 11–20 (bottom). Observations within 0.2
length scales of the optimum are marked in red; the optimum was located on iteration 19.
Figure 7.5: The knowledge gradient acquisition function (7.4) corresponding to our running example.
130
7.5. probability of improvement
Figure 7.7: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by the knowledge gradient
acquisition function (7.4) on our running example. The tick marks show the points chosen by the policy,
progressing from top to bottom, during iterations 1–10 (top) and 11–20 (bottom). Observations within 0.2 length
scales of the optimum are marked in red; the optimum was located on iteration 15.
of these new maxima coincide with local maxima of the expected im-
provement acquisition function; see figure 7.3 for comparison. This is
not a coincidence! One way to interpret this relation is that, due to re-
warding large values of the posterior mean at observed locations only,
expected improvement must essentially guess on which side the hidden
local optimum of the objective lies and hope to be correct. The knowl-
edge gradient, on the other hand, considers identifying this maximum
on either side a success, and guessing is not necessary.
Figure 7.7 illustrates the behavior of the knowledge gradient policy simulated optimization and interpretation
on our example optimization scenario. The global optimum was located
efficiently. Comparing the decisions made by the knowledge gradient to
those made by expected improvement (see figure 7.4), we can observe more exploration than expected improvement
a somewhat more even exploration of the domain, including in local from more-relaxed utility
maxima. The knowledge gradient policy does not necessarily need to
expend observations to verify a suspected maximum, instead putting
more trust into the model to have correct beliefs in these regions.
improvement target
𝑝 (𝜙 | 𝑥, D)
𝑝 (𝜙 0 | 𝑥 ,0 D)
Figure 7.8: An illustrative example comparing the behavior of probability of improvement with expected improvement
computed with respect to the dashed target. The predictive distributions for two points 𝑥 and 𝑥 0 are shown. The
distributions have equal mean but the distribution at 𝑥 0 has larger predictive standard deviation. The shaded
regions represent the region of improvement. The relatively safe 𝑥 is preferred by probability of improvement,
whereas the more-risky 𝑥 0 is preferred by expected improvement.
simple reward: § 6.1, p. 109 Consider the simple reward of an already gathered dataset D = (x, y):
132
7.5. probability of improvement
𝛼 pi (𝜀 = 0.1)
𝛼 pi (𝜀 = 1)
Figure 7.9: The probability of improvement acquisition function (7.5) corresponding to our running example for different
values of the target improvement 𝜀. The target is expressed as a fraction of the range of the posterior mean
over the space. Increasing the target improvement leads to increasingly exploratory behavior.
134
7.6. mutual information and entropy search
Figure 7.11: The posterior after 20 steps of the optimization policy induced by probability of improvement with 𝜀 = 0
(7.5) on our running example. The tick marks show the points chosen by the policy, progressing from top to
bottom.
Evidently we must carefully select the desired improvement threshold 14 d. r. jones (2001). A Taxonomy of Global Op-
to achieve ideal behavior. jones provided some simple, data-driven advice timization Methods Based on Response Sur-
faces. Journal of Global Optimization 21(4):345–
for choosing improvement thresholds that remains sound.14 Define 383.
to represent the global maximum of the posterior mean and the range of
the posterior mean at the observed locations. jones suggests considering
targets of the form
𝜇 ∗ + 𝛼𝑟,
where 𝛼 ≥ 0 controls the amount of desired improvement in terms of 15 The proposed values for 𝛼 given by jones are
the range of observed data. He provides a table of 27 suggested values compiled below.
for 𝛼 in the range [0, 3] and remarks that the points optimizing the set 𝛼
of induced acquisition functions typically cluster together in a small 0 0.07 0.25
number of locations, each representing a different tradeoff between 0.0001 0.08 0.3
0.001 0.09 0.4
exploration and exploitation.15 jones continued to recommend selecting 0.01 0.1 0.5
one point from each of these clusters to evaluate in parallel, defining a 0.02 0.11 0.75
batch optimization policy. Although this may not always be possible, the 0.03 0.12 1
recommended parameterization of the desired improvement is natural 0.04 0.13 1.5
0.05 0.15 2
and would be appropriate for general use. 0.06 0.2 3
This proposal is illustrated for our running example in figure 7.12. We
begin with the posterior after selecting 10 points in our previous demo
(see figure 7.10), and indicate the points maximizing the probability of
improvement for jones’s proposed improvement targets. The points clus-
ter together in four regions reflecting varying exploration–exploitation
tradeoffs.
Figure 7.12: The points maximizing probability of improvement using the 27 improvement thresholds proposed by
jones, beginning with the posterior from figure 7.10 after 10 total observations have been obtained. The tick
marks show the chosen points and cluster together in four regions representing different tradeoffs between
exploration and exploitation.
16 t. m. cover and j. a. thomas (2006). Elements The acquisition function in these methods is mutual information, a mea-
of Information Theory. John Wiley & Sons. sure of dependence between random variables that is a central concept
17 d. j. c. mackay (2003). Information Theory, In- in information theory.16, 17
ference, and Learning Algorithms. Cambridge The reasoning underlying entropy search policies is somewhat dif-
University Press.
ferent from and more general than the other acquisition functions we
have considered thus far, all of which ultimately focus on maximizing
the posterior mean function. Although this is a pragmatic concern, it is
information-theoretic decision making as a intimately linked to optimization. Information-theoretic experimental
model of the scientific method design is instead motivated by an abstract pursuit of knowledge, and may
be interpreted as a mathematical formulation of the scientific method.
We begin by identifying some unknown feature of the world that
we wish to learn about; in the context of Bayesian inference, this will be
some random variable 𝜔. We then view each observation we make as an
opportunity to learn about this random variable, and seek to gather data
that will, in aggregate, provide considerable information about 𝜔. This
process is analogous to a scientist designing a sequence of experiments
to understand some natural phenomenon, where each experiment may
be chosen to challenge or confirm constantly evolving beliefs.
The framework of information theory allows us to formalize this
process. We may quantify the amount of information provided about a
information gain: § 6.3, p. 115 random variable 𝜔 by a dataset D via the information gain, a concept for
which we provided two definitions in the last chapter. Adopting either
definition as a utility function and performing one-step lookahead yields
mutual information as an acquisition function.
Information-theoretic optimization policies select 𝜔 such that its de-
termination gives insight into our optimization problem (1.1). However,
by selecting different choices for 𝜔, we can generate radically differ-
ent policies, each attempting to learn about a different aspect of the
system of interest. Maximizing mutual information has long been pro-
18 d. v. lindley (1956). On a Measure of the Infor- moted as a general framework for optimal experimental design,18 and
mation Provided by an Experiment. The An- this framework has been applied in numerous active learning settings.19
nals of Mathematical Statistics 27(4):986–1005.
Before showing how mutual information arises in a decision-theoretic
19 b. settles (2012). Active Learning. Morgan & context, we pause to define the concept and derive some important prop-
Claypool.
erties.
136
7.6. mutual information and entropy search
Mutual information
Let 𝜔 and 𝜓 be random variables with probability density functions 𝑝 (𝜔) definition
and 𝑝 (𝜓 ). The mutual information between 𝜔 and 𝜓 is
∬
𝑝 (𝜔,𝜓 )
𝐼 (𝜔;𝜓 ) = 𝑝 (𝜔,𝜓 ) log d𝜔 d𝜓 . (7.7)
𝑝 (𝜔) 𝑝 (𝜓 )
This expression may be recognized as the Kullback–Leibler divergence
between the joint distribution of the random variables and the product
of their marginal distributions:
𝐼 (𝜔;𝜓 ) = 𝐷 kl 𝑝 (𝜔,𝜓 ) k 𝑝 (𝜔) 𝑝 (𝜓 ) .
We may extend this definition to conditional probability distributions as conditional mutual information, 𝐼 (𝜔;𝜓 | D)
well. Given an arbitrary set of observed data D, we define the conditional
mutual information between 𝜔 and 𝜓 by:20 20 Some authors use the notation 𝐼 (𝜔;𝜓 | D) to
∬ represent the expectation of the given quantity
𝑝 (𝜔,𝜓 | D) with respect to the dataset D. In optimization,
𝐼 (𝜔;𝜓 | D) = 𝑝 (𝜔,𝜓 | D) log d𝜔 d𝜓 . we will always have an explicit dataset in hand,
𝑝 (𝜔 | D) 𝑝 (𝜓 | D)
in which case the provided definition is more
useful.
Here we have simply conditioned all distributions on the data and applied
the definition in (7.7) to the posterior beliefs.
Several properties of mutual information are immediately evident symmetry
from its definition. First, mutual information is symmetric in its argu-
ments:
𝐼 (𝜔;𝜓 ) = 𝐼 (𝜓 ; 𝜔). (7.8)
We also have that if 𝜔 and 𝜓 are independent, then 𝑝 (𝜔,𝜓 ) = 𝑝 (𝜔) 𝑝 (𝜓 )
and the mutual information is zero:
∬
𝑝 (𝜔) 𝑝 (𝜓 )
𝐼 (𝜔;𝜓 ) = 𝑝 (𝜔,𝜓 ) log d𝜔 d𝜓 = 0.
𝑝 (𝜔) 𝑝 (𝜓 )
𝑝 (𝜔,𝜓 ) = 𝑝 (𝜓 ) 𝑝 (𝜔 | 𝜓 )
21 It is important to note that this is true only in Thus the mutual information between 𝜔 and 𝜓 is the expected decrease
expectation. Consider two random variables 𝑥 in the differential entropy of 𝜔 if we were to observe 𝜓 . Due to symmetry
and 𝑦 with the following joint distribution. 𝑥
takes value 0 or 1 with probability 1/2 each. If
(7.8), we may swap the roles of 𝜔 and 𝜓 to derive an equivalent expression
𝑥 is 0, 𝑦 takes value 0 or 1 with probability 1/2 in the other direction:
each. If 𝑥 is 1, 𝑦 takes value 0 or −1 with proba-
bility 1/2 each. The entropy of 𝑥 is 1 bit and the 𝐼 (𝜔;𝜓 ) = 𝐻 [𝜔] − 𝔼𝜓 𝐻 [𝜔 | 𝜓 ] = 𝐻 [𝜓 ] − 𝔼𝜔 𝐻 [𝜓 | 𝜔] . (7.10)
entropy of 𝑦 is 1.5 bits. Observing 𝑥 always
yields 0.5 bits about 𝑦. However, observing Observing either 𝜔 or 𝜓 will, in expectation, provide the same amount
𝑦 produces either no information about 𝑥 (0 of information about the other: the mutual information 𝐼 (𝜔;𝜓 ).21
bits), with probability 1/2, or complete infor-
mation about 𝑥 (1 bit), with probability 1/2.
So the information gain about 𝑥 from 𝑦 and Maximizing mutual information as an optimization policy
about 𝑦 from 𝑥 is actually never equal. How-
ever, the expected information gain is equal, Mutual information arises naturally in Bayesian sequential experimental
𝐼 (𝑥; 𝑦) = 0.5 bits. design as the one-step expected information gain resulting from an ob-
servation. In the previous chapter, we introduced two different methods
for quantifying this information gain. The first was the reduction in the
differential entropy of 𝜔 from the prior to the posterior:
𝑢 (D) = 𝐻 [𝜔] − 𝐻 [𝜔 | D] (7.11)
∫ ∫
= 𝑝 (𝜔 | D) log 𝑝 (𝜔 | D) d𝜔 − 𝑝 (𝜔) log 𝑝 (𝜔) d𝜔.
138
7.6. mutual information and entropy search
𝑝 (𝑓 ∗ | D)
𝑝 (𝑥 ∗ | D)
Figure 7.13: The posterior belief about the location of the global optimum, 𝑝 (𝑥 ∗ | D), and about the value of the global
optimum, 𝑝 (𝑓 ∗ | D), for our running example. Note the significant probability mass associated with the
optimum lying on the boundary.
Figure 7.14: The mutual information between the observed value and the location of the global optimum, 𝛼𝑥 ∗ (middle
panel), and between the observed value and the value of the global optimum, 𝛼 𝑓 ∗ (bottom panel), for our
running example.
140
7.7. multi-armed bandits and optimization
𝑝 (𝑥 ∗ | D)
Figure 7.15: The posterior after 10 (top) and 20 (bottom) steps of maximizing the mutual information between the observed
value 𝑦 and the location of the global maximum 𝑥 ∗ (7.16) on our running example. The tick marks show the
points chosen by the policy, progressing from top to bottom. Observations within 0.2 length scales of the
optimum are marked in red; the optimum was located on iteration 16.
𝑝 (𝑓 ∗ | D)
Figure 7.16: The posterior after 10 (top) and 25 (bottom) steps of maximizing the mutual information between the observed
value 𝑦 and the location of the global maximum 𝑓 ∗ (7.17) on our running example. The tick marks show the
points chosen by the policy, progressing from top to bottom. Observations within 0.2 length scales of the
optimum are marked in red; the optimum was located on iteration 23.
142
7.7. multi-armed bandits and optimization
To facilitate the following discussion, for each arm 𝑥 ∈ X, we define optimal policy with known rewards
𝜙 = 𝔼[𝑦 | 𝑥] to be its expected reward and will aggregate these into a
vector f when convenient. We also define
to be the index of an arm with maximal expected reward and the value of 27 All of the notation throughout this section is
that optimal reward, respectively.27 If the reward distributions associated chosen to align with that for optimization. In
a multi-armed bandit, an arm 𝑥 is associated
with each arm were known a priori, the optimal policy would be trivial: with expected reward 𝜙 = 𝔼[𝑦 | 𝑥 ]. In op-
we would always select the arm with the highest expected reward. This timization with zero-mean additive noise, a
policy generates expected reward 𝑓 ∗ in each iteration, and it is clear point 𝑥 is associated with expected observed
from linearity of expectation that this is optimal. Unfortunately, the value 𝜙 = 𝑓 (𝑥) = 𝔼[𝑦 | 𝑥 ].
reward distributions are unknown to the agent and must be learned from
observations instead. This complicates policy design considerably.
The only way we can learn about the reward distributions is to challenges in policy design
allocate resources to each arm and observe the outcomes. If the reward
distributions have considerable spread and/or overlap with each other,
a large number of observations may be necessary before the agent can
confidently conclude which arm is optimal. The agent thus faces an
exploration–exploitation dilemma, constantly forced to decide whether
to select an arm believed to have high expected reward (exploitation) 28 We will explore these connections in our dis-
or whether to sample an uncertain arm to better understand its reward cussion on theoretical convergence results for
Bayesian optimization algorithms in chapter
distribution (exploration). Ideally, the agent would have a policy that
10, p. 213.
efficiently explores the arms, so that in the limit of many decisions the
29 d. a. berry and b. fristedt (1985). Bandit
agent would eventually allocate an overwhelming majority of resources Problems: Sequential Allocation of Experiments.
to the best possible alternative. Chapman & Hall.
Dozens of policies for the multi-armed bandit problem have been pro- 30 s. bubeck and n. cesa-bianchi (2012). Re-
posed and studied from both the Bayesian and frequentist perspectives, gret Analysis of Stochastic and Nonstochas-
and many strong convergence results are known.28 Numerous variations tic Multi-armed Bandit Problems. Foundations
on the basic formulation outlined above have also received consideration and Trends in Machine Learning 5(1):1–122.
in the literature, and the interested reader may refer to one of several 31 t. lattimore and c. szepesvári (2020). Bandit
available exhaustive surveys for more information.29, 30, 31 Algorithms. Cambridge University Press.
optimal policy: § 5.2, p. 91 The optimal policy may now be derived following our previous
analysis. We make each decision by maximizing the expected reward by
termination, recursively assuming optimal future behavior (5.15–5.17).
Notably, the optimal decision for the last round is the arm maximizing
33 See § 7.10 for an analogous result in optimiza- the posterior mean reward, reflecting pure exploitation:33
tion.
𝑥𝜏 ∈ arg max 𝔼[f | D𝜏−1 ].
More exploratory behavior begins with the penultimate decision and
increases with the decision horizon.
running time of optimal policy: § 5.3, p. 99 Unfortunately, the cost of computing the optimal policy increases ex-
ponentially with the horizon. We must therefore find some mechanism to
design computationally efficient but empirically effective policies for use
in practice. This is precisely the same situation we face in optimization!
144
7.8. maximizing a statistical upper bound
We can interpret this function as a statistical upper confidence bound on upper confidence bound
𝜙: the value will exceed the bound only with tunable probability 1 − 𝜋.
As a function of 𝑥, we can interpret 𝑞(𝜋; 𝑥, D) as an optimistic esti-
mate of the entire objective function. The principle of optimism in the
face of uncertainty then suggests observing where this upper confidence
bound is maximized, yielding the acquisition function
Figure 7.17 shows upper confidence bounds for our example scenario example and discussion
corresponding to three values of the confidence parameter 𝜋. Unlike the
acquisition functions considered previously in this chapter, an upper
𝛼 ucb (𝜋 = 0.95)
𝛼 ucb (𝜋 = 0.999)
Figure 7.17: The upper confidence bound acquisition function (7.18) corresponding our running example for different
values of the confidence parameter 𝜋. The vertical axis for each acquisition function is shifted to the largest
observed function value. Increasing the confidence parameter leads to increasingly exploratory behavior.
146
7.8. maximizing a statistical upper bound
Figure 7.18: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by the upper confidence
bound acquisition function (7.2) on our running example. The confidence parameter was set to 𝜋 = 0.999. The
tick marks show the points chosen by the policy, progressing from top to bottom, during iterations 1–10 (top)
and 11–20 (bottom). Observations within 0.2 length scales of the optimum are marked in red; the optimum
was located on iteration 15.
From this point of view, we can interpret the sampled objective function
𝛼 ts as an ordinary acquisition function that is maximized as usual.
148
7.9. thompson sampling
samples from 𝑝 (𝑓 | D)
Figure 7.19: An illustration of Thompson sampling for our example optimization scenario. At the top, we show the
objective function posterior 𝑝 (𝑓 | D) and three samples from this belief. Thompson sampling selects the
next observation location by maximizing one of these samples. In the bottom three panels we show three
possible outcomes of this process, corresponding to each of the sampled objective functions illustrated in the
top panel.
Rather than representing an expected utility or a statistical upper optimal decision for random sample
bound, the acquisition function used in each round of Thompson sam-
pling is a hallucinated objective function that is plausible under our
belief. Whereas a Bayesian decision theoretic policy chooses the opti-
mal action in expectation while averaging over the uncertain objective
function, Thompson sampling chooses the optimal action for a randomly
sampled objective function. This interpretation of Thompson sampling is
illustrated for our example scenario in figure 7.19, showing three possible
outcomes for Thompson sampling. In this case, two samples would ex-
ploit the region surrounding the best-seen point, and one would explore
the region around the left-most observation.
Figure 7.20 shows the posterior belief of the objective after 15 rounds example and discussion
of Thompson sampling for our example scenario. The global maximum
was located remarkably quickly. Of course, as a stochastic policy, it is
not guaranteed that the behavior of Thompson sampling will resemble
this outcome. In fact, this was a remarkably lucky run! The most likely
locations of the optimum were ignored in the first rounds, quickly leading
to the discovery of the optimum on the left. Figure 7.21 shows another example of slow convergence
Figure 7.20: The posterior after 15 steps of Thompson sampling (7.19) on our running example. The tick marks show the
points chosen by the policy, progressing from top to bottom. Observations within 0.2 length scales of the
optimum are marked in red; the optimum was located on iteration 8.
150
7.10. other ideas in policy construction
Figure 7.21: The posterior after 80 steps of Thompson sampling (7.19) on our running example, using a random seed
different from figure 7.20. The global optimum was located on iteration 78.
Although this door is open for any of the decision-theoretic policies con-
sidered in this chapter, two-step expected improvement has received the
most attention. osborne et al. derived two-step expected improvement
and demonstrated good empirical performance on some test functions
compared with the one-step alternative.45 However, it is telling that they 45 m. a. osborne et al. (2009). Gaussian Processes
restricted their investigation to a limited number of functions due to the for Global Optimization. lion 3.
inherent computational expense. ginsbourger and le riche completed 46 d. ginsbourger and r. le riche (2010). To-
a contemporaneous exploration of two-step expected improvement and wards Gaussian Process-based Optimization
with Finite Time Horizon. moda 9.
provided an explicit example showing superior behavior from the less
myopic policy.46 Recently, several authors have revisited (2+)-step looka- 47 j. wu and p. i. frazier (2019). Practical
Two-Step Look-Ahead Bayesian Optimization.
head and developed sophisticated implementation schemes rendering neurips 2019
longer horizons more feasible.47, 48
48 s. jiang et al. (2020b). Efficient Nonmyopic
We provided an in-depth illustration and deconstruction of two-step Bayesian Optimization via One-Shot Multi-
expected improvement for our example scenario in figures 5.2–5.3. Note Step Trees. neurips 2020
that the two-step expected improvement is appreciable even for the
(useless!) options of evaluating at the previously observed locations, as
we can still make conscientious use of the following observation.
Figure 7.22 illustrates the progress of 20 evaluations designed by
maximizing two-step expected improvement for our example scenario.
Comparing with the one-step alternative in figure 7.4, the less myopic
policy exhibits somewhat more exploratory behavior and discovered the
optimum more efficiently – after 15 rather than 19 evaluations.
Rollout has also been considered as an approach to building nonmy- rollout: § 5.3, p. 102
opic optimization policies. Again the focus of these investigations has
been on expected improvement (or the related knowledge gradient), but
the underlying principles could be extended to other policies.
lam et al. combined expected improvement with several steps of 49 r. r. lam et al. (2016). Bayesian Optimization
rollout, again maximizing expected improvement as the base policy.49 The with a Finite Budget: An Approximate Dy-
namic Programming Approach. neurips 2016.
authors also proposed optionally adjusting the utility function through
Figure 7.22: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by maximizing two-step
expected improvement on our running example. The tick marks show the points chosen by the policy,
progressing from top to bottom, during iterations 1–10 (top) and 11–20 (bottom). Observations within 0.2
length scales of the optimum are marked in red; the optimum was located on iteration 15.
152
7.10. other ideas in policy construction
0 (𝜀 = 10)
𝛼 ei
Figure 7.23: The modified expected improvement acquisition function (7.21) for our running example for different values
of the target improvement 𝜀. The target is expressed as a fraction of the range of the posterior mean over the
space. Increasing the target improvement leads to increasingly exploratory behavior.
authors also demonstrated that dynamically setting the batch size to the
remaining budget outperformed an arbitrary fixed size, suggesting that
budget adaptation was important for success.
jiang et al. continued this thread with an even more dramatic ap-
proximation dubbed binoculars, potentially initiating an arms race
toward increasingly nonmyopic acronyms.55 The idea is to construct a 55 s. jiang et al. (2020a). binoculars for Ef-
single batch observation in each iteration, then select a point from this ficient, Nonmyopic Sequential Experimental
Design. icml 2020.
batch for evaluation. This represents an extreme computational savings
over glasses, which must construct a batch anew for every proposed
observation location. However, the method retains the same fundamental
motivation: well-designed batch policies automatically induce diversity batch observations: § 11.3, p. 252
among batch members, encouraging exploration in the resulting sequen-
tial policy (see figure 11.3). The strong connection between the optimal
batch and sequential policies (11.9–11.10) provides further motivation for
this approach. jiang et al. also conducted a study of optimization per-
formance versus the cost of computing the policy: one-step lookahead,
binoculars, glasses, and rollout comprised the Pareto frontier, with Pareto frontier: § 11.7, p. 269
each method increasing computational effort by an order of magnitude.
Figure 7.24: The posterior after 15 steps of the streltsov and vakili optimization policy on our running example. The
tick marks show the points chosen by the policy, progressing from top to bottom. Observations within 0.2
length scales of the optimum are marked in red; the optimum was located on iteration 14.
154
7.10. other ideas in policy construction
threshold was set dynamically based on the remaining budget, using the
asymptotic behavior of the maximum of iid Gaussian random variables.57 57 j. mockus (1989). Bayesian Approach to Global
This can be interpreted as an approximate batch rollout policy where re- Optimization: Theory and Applications. Kluwer
Academic Publishers. [§ 2.5]
maining decisions are simulated by fictitious uncorrelated observations;
for some models this serves as an efficient simulation of random rollout.
The modified expected improvement (7.21) was also the basis for an
unusual policy proposed by streltsov and vakili.58 Let 𝑐 : X → ℝ >0 58 s. streltsov and p. vakili (1999). A Non-
quantify the cost of making an observation at any proposed location; in myopic Utility Function for Statistical Global
Optimization Algorithms. Journal of Global
the simplest case we could take the cost to be constant. To evaluate the Optimization 14(3):283–298.
promise of making an observation at 𝑥, we solve the equation
0
𝛼 ei (𝑥; D, 𝛼 sv ) = 𝑐 (𝑥)
for 𝛼 sv , which will serve as the acquisition function value at 𝑥.59 That is, 59 As 𝛼 ei
0 is monotonically decreasing with re-
we solve for the improvement threshold that would render an observa- spect to 𝛼 sv and approaches zero as 𝛼 sv → ∞,
the unique solution can be found efficiently
tion at 𝑥 cost-prohibitive in expectation, and design each observation via bisection.
to coincide with the last point to be ruled out when considering in-
creasingly demanding thresholds. The resulting policy shows interesting
behavior, at least on our running example; see figure 7.24. After effec-
tive initial exploration, the global optimum was located on iteration
14. The behavior is similar to the upper confidence bound approach in
figure 7.18, and indeed streltsov and vakili showed that the proposed
method can be understood as a variation on this method with a location-
dependent upper confidence quantile depending on observation cost and
uncertainty.
An even simpler mechanism for injecting exploration into an ex-
isting policy is to occasionally make decisions randomly or via some
other policy encouraging pure exploration. de ath et al. for example
considered a family of 𝜀-greedy policies where a one-step lookahead
policy is interrupted, with probability 𝜀 in each iteration, by evaluating at 60 g. de ath et al. (2021). Greed is Good: Explo-
a location chosen uniformly at random from the domain.60 These policies ration and Exploitation Trade-offs in Bayesian
Optimisation. acm Transactions on Evolution-
delivered impressive empirical performance, even when the one-step ary Learning and Optimization 1(1):1–22.
lookahead policy was as simple as maximizing the posterior mean (see
below).
𝛼 (𝑥; D) = 𝜇D (𝑥).
Figure 7.25: The posterior after 15 steps of two-step lookahead for cumulative reward (7.22) on our running example. The
tick marks show the points chosen by the policy, progressing from top to bottom. The policy becomes stuck
on iteration 7.
From cursory inspection of our example scenario in figure 7.1, we can see
that maximizing this acquisition function can cause the policy to become
becoming “stuck” “stuck” with no chance of recovery. In our example, the posterior mean
is maximized at the previously best-seen point, so the policy will select
this point forevermore.
two-step lookahead for cumulative reward It is also interesting to consider two-step lookahead, where the ac-
quisition function becomes (5.12):
h i
𝛼 2 (𝑥; D) = 𝜇D (𝑥) + 𝔼 max
0
𝜇 D
0
0 (𝑥 ) | 𝑥, D .
𝑥 ∈X
156
8
COMP U TING POLICIES WI TH GAUSSIAN PROCESSES
𝑝 (𝑓 | D) = GP (𝑓 ; 𝜇D , 𝐾D ). (8.1)
This material will be published by Cambridge University Press as Bayesian Optimization. 157
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
computing policies with gaussian processes
We will also require the predictive distribution for the observed value
𝑦 resulting from a measurement at 𝑥. In addition to the straightforward
case of exact measurements, where 𝑦 = 𝜙 and the predictive distribution
is given above (8.2), we will also consider corruption by independent,
zero-mean additive Gaussian noise:
observation noise scale, 𝜎𝑛 Again we allow the noise scale 𝜎𝑛 to depend on 𝑥 if desired. We will
predictive distribution for 𝑦, N (𝑦; 𝜇, 𝑠 2 ) notate the resulting predictive distribution for 𝑦 with:
𝜕𝛼 𝜕𝛼 𝜕𝜇 𝜕𝛼 𝜕𝑠
= + . (8.6)
𝜕𝑥 𝜕𝜇 𝜕𝑥 𝜕𝑠 𝜕𝑥
The gradients of the predictive mean and standard deviation for a Gaus-
sian process are easily computable assuming the prior mean and covari-
gradients of gp predictive distribution: § c.2, ance functions and the observation noise scale are differentiable, and
p. 308 general expressions are provided in an appendix.
158
8.2. expected improvement
Here 𝜙 ∗ is the previously best seen, incumbent objective function value, incumbent function value, 𝜙 ∗
and max(𝜙 − 𝜙 ,∗ 0) measures the improvement offered by observing a
value of 𝜙. Figure 8.1 illustrates this integral.
To proceed, we resolve the max operator to yield two integrals:
∫∞ ∫∞
𝛼 ei (𝑥; D) = 𝜙 N (𝜙; 𝜇, 𝜎 2 ) d𝜙 − 𝜙 ∗ N (𝜙; 𝜇, 𝜎 2 ) d𝜙,
𝜙∗ 𝜙∗
both of which can be computed easily assuming 𝜎 > 0.2 The first term is 2 In the degenerate case 𝜎 = 0, we simply have
proportional to the expected value of a normal distribution truncated at 𝛼 ei (𝑥; D) = max(𝜇 − 𝜙 ,∗ 0).
𝜙 ∗, and the second term is the complementary normal cdf scaled by 𝜙 ∗.
The resulting acquisition function can be written conveniently in terms
of the standard normal pdf and cdf:
𝜇 − 𝜙∗ 𝜇 − 𝜙∗
𝛼 ei (𝑥; D) = (𝜇 − 𝜙 ∗ ) Φ + 𝜎𝜙 . (8.9)
𝜎 𝜎
Examining this expression, it is tempting to interpret its two terms exploitation and exploration
as respectively encouraging exploitation (favoring points with high
expected value 𝜇) and exploration (favoring points with high uncertainty partial derivatives with respect to predictive
𝜎). Indeed, taking partial derivatives with respect to 𝜇 and 𝜎, we have: distribution parameters
𝜕𝛼 ei 𝜇 − 𝜙∗ 𝜕𝛼 ei 𝜇 − 𝜙∗
=Φ > 0; =𝜙 > 0.
𝜕𝜇 𝜎 𝜕𝜎 𝜎
Expected improvement is thus monotonically increasing in both 𝜇 and
𝜎. Increasing a point’s expected value naturally makes the point more
favorable for exploitation, and increasing its uncertainty makes it more
favorable for exploration. Either action would increase the expected
improvement. The tradeoff between these two concerns is considered
automatically and is reflected in the magnitude of the derivatives above.
Maximization of expected improvement in Euclidean domains may gradient of expected improvement without
be guided by its gradient with respect to the proposed evaluation location noise
𝑥. Using the results above and applying the chain rule, we have (8.6):
𝜕𝛼 ei 𝜇 − 𝜙 ∗ 𝜕𝜇 𝜇 − 𝜙 ∗ 𝜕𝜎
=Φ +𝜙 ,
𝜕𝑥 𝜎 𝜕𝑥 𝜎 𝜕𝑥
3 d. r. jones et al. (1998). Efficient Global Opti- Unfortunately it is not immediately clear how to extend our opti-
mization of Expensive Black-Box Functions. mization algorithm to the case of noisy functions. With noisy data,
Journal of Global Optimization 13(4):455–492.
we really want to find the point where the signal is optimized. Sim-
ilarly, our expected improvement criterion should be defined in
terms of the signal component. [emphasis added by jones et al.]
160
8.2. expected improvement
If we define 𝜇 ∗ = max 𝜇 D (x) to represent the simple reward of the current simple reward, 𝜇 ∗
current data, then we must compute:
∫
𝛼 ei (𝑥; D) = max 𝜇D0 (x0) − 𝜇 ∗ N (𝑦; 𝜇, 𝑠 2 ) d𝑦.
We first reduce this computation to an expectation of the general form reduction to evaluation of 𝑔 (a, b)
∫
𝑔(a, b) = max(a + b𝑧) 𝜙 (𝑧) d𝑧, (8.10)
where a, b ∈ ℝ𝑛 are arbitrary vectors and 𝑧 is a standard normal random 5 The dimensions of a and b can be arbitrary as
variable.5 Note that given the observation 𝑦, the updated posterior mean long as they are equal.
at x0 is a vector we may compute in closed form (2.19):
𝐾D (x,0 𝑥) 𝑦 − 𝜇
𝜇D0 (x0) = 𝜇 D (x0) + .
𝑠 𝑠
This update is linear in 𝑦. Applying the transformation 𝑦 = 𝜇 + 𝑠𝑧 yields
where definition of a, b
𝐾D (x,0 𝑥)
a = 𝜇D (x0); b= , (8.12)
𝑠
and we may express expected improvement in the desired form:
𝑝 (𝑧)
a + b𝑧
𝑧
𝑥
Figure 8.2: The geometric intuition of the 𝑔(a, b) function. Left: if we make a measurement at 𝑥, the 𝑧-score of the observed
value completely determines the updated posterior mean at that point and all previously observed points.
Right: as a function of 𝑧, the updated posterior mean at each of these points is linear; here the color of each
line corresponds to the matching point on the left. The slope and intercept of each line can be determined
from the posterior (8.12). Not all lines are visible.
maximal value on some interval and the lines appear in strictly increasing
order of slope:
𝑏 1 < 𝑏 2 < · · · < 𝑏𝑛 .
7 Briefly, we sort the lines in ascending order frazier et al. give a simple and efficient algorithm to process a set of
of slope, then add each line in turn to a set 𝑛 lines to eliminate any always-dominated lines, reorder the remainder in
of dominating lines, checking whether any
previously added lines need to be removed
increasing slope, and identify their intervals of dominance in O(𝑛 log 𝑛)
and updating the intervals of dominance. time.7 The output of this procedure is a permutation matrix P, possibly
with some rows deleted, such that
and the new inputs (𝜶 , 𝜷) satisfy the desired properties. We will assume
below that the inputs have been preprocessed in such a manner.
Given a set of lines in the desired form, we may partition the real
line into a collection of 𝑛 intervals
8 Reordering the lines in order of increasing such that the 𝑖th line 𝑎𝑖 + 𝑏𝑖 𝑧 dominates on the corresponding interval
slope guarantees this correspondence: the line (𝑐𝑖 , 𝑐𝑖+1 ).8 This allows us to decompose the desired expectation (8.10)
with minimal slope is always the “leftmost” in
the upper envelope, etc.
into a sum of contributions on each interval:
∑︁ ∫ 𝑐𝑖+1
𝑔(a, b) = (𝑎𝑖 + 𝑏𝑖 𝑧) 𝜙 (𝑧) d𝑧.
𝑖 𝑐𝑖
162
8.2. expected improvement
𝑝 (𝑧)
max(a + b𝑧)
𝑧
𝑥
Figure 8.3: After a measurement at 𝑥, the updated simple reward can be achieved at one of four points (left), whose
corresponding lines comprise the upper envelope max(a + b𝑧) (right). The colors of the line segments on the
right correspond to the possible updated maximum locations on the left. The lightest point on the left serves
as a “backup option” if the observed value is low.
Figure 8.4: Expected improvement using different plug-in estimators (8.17–8.18) compared with the noisy expected
improvement as the expected marginal gain in simple reward (8.7).
incumbent value 𝜙 ∗. Several possibilities for this estimate have been put
forward. One option is to plug in the maximum noisy observation:
𝜙 ∗ ≈ max y. (8.17)
maximum noisy value “utility” function: § 6.1, However, this may not always behave as expected for the same reason
p. 113 the maximum observed value does not serve as a sensible utility function
(6.6). With very noisy data, the maximum observed value is most likely
inflating expected improvement threshold: spurious rather than a meaningful goalpost. Further, as we are likely
§ 7.10, p. 154 to overestimate our progress due to bias in this estimate, the resulting
behavior may become excessively exploratory. The approximation will
eventually devolve to expected improvement against an inflated thresh-
old (7.21), which may overly encourage exploration; see figure 7.23. An
especially spurious observation can bias the estimate in (8.17) (and our
behavior) for a considerable time.
Opinions on the proposed approximation using this simple plug-in
10 v. picheny et al. (2013b). A benchmark of
kriging-based infill criteria for noisy optimiza- estimate (8.17) vary dramatically. picheny et al. discarded the idea out
tion. Structural and Multidisciplinary Opti- of hand as “naïve” and lacking robustness.10 The authors also found
mization 48(3):607–626. it empirically inferior in their investigation and described the same
11 v. nguyen et al. (2017). Regret for Expected over-exploratory effect and explanation given above. On the other hand,
Improvement over the Best-Observed Value nguyen et al. described the estimator as “standard” and concluded it was
and Stopping Condition. acml 2017.
empirically preferable!11
164
8.2. expected improvement
An alternative estimator is the simple reward of the data (6.3) :12, 13 12 e. vazqez et al. (2008). Global optimization
based on noisy evaluations: an empirical study
𝜙 ∗ ≈ max 𝜇D (x), (8.18) of two statistical approaches. icipe 2008.
13 z. wang and n. de freitas (2014). Theoretical
which is less biased and may be preferable. A simple extension is to Analysis of Bayesian Optimization with Un-
maximize other predictive quantiles,10, 14 and huang et al. recommend known Gaussian Process Hyper-Parameters.
arXiv: 1406.7758 [stat.ML].
using a relatively low quantile, specifically Φ(−1) ≈ 0.16, in the interest
14 d. huang et al. (2006b). Global Optimization
of risk aversion. of Stochastic Black-Box Systems via Sequen-
In figure 8.4 we compare noisy expected improvement with two tial Kriging Meta-Models. Journal of Global
plug-in approximations. The plug-in estimators agree that sampling Optimization 34(3):441–466.
on the right-hand side of the domain is the most promising course of
action, but our formulation of noisy expected improvement prefers a less
explored region. This decision is motivated by the interesting behavior
of the posterior, which shows considerable disagreement regarding the
updated posterior mean; see figure 8.6. This nuance is only revealed
as our formulation reasons about the joint predictive distribution of y,0
whereas the plug-in estimators only inspect the marginals.
Another proposed approximation scheme for noisy expected im- approximating through reinterpolation
provement is reinterpolation. We fit a noiseless Gaussian process to im-
puted values of the objective function at the observed locations 𝝓 = 𝑓 (x),
then compute the exact expected improvement for this surrogate. A natu- 15 a. i. j. forrester et al. (2006). Design and
ral choice considered by forrester et al. is to impute using the posterior Analysis of “Noisy” Computer Experiments.
aiaa Journal 44(10):2331–2339.
mean:15
𝝓 ≈ 𝜇D (x),
resulting in the approximation (computed with respect to the surrogate):
Figure 8.5: forrester et al.’s approximation to noisy expected improvement (8.19). Given a Gaussian process fit to noisy
data, we compute the exact expected improvement (8.9) for a noiseless Gaussian process fit to the posterior
mean.
166
8.3. probability of improvement
Figure 8.6: letham et al.’s approximation to noisy expected improvement (8.20). We take the expectation of the exact
expected improvement (8.9) for a noiseless Gaussian process fit to exact observations at the observed locations.
The middle panels show realizations of the reinterpolated process and the resulting expected improvement.
values cannot stray too far from the true underlying objective function
value. Thus we have 𝑦 ≈ 𝜙; 𝑠 ≈ 𝜎, and any inaccuracy in (8.17), (8.18), or
(8.20) will be minor. However, this heuristic argument breaks down in
high-noise regimes.
Figure 8.7: A comparison of noiseless expected improvement with a plug-in estimate for 𝜙 ∗ (8.17) and the expected
one-step marginal gain in simple reward (7.2). Here the noise variance increases linearly from zero on the
left-hand side of the domain to a signal-to-noise ratio of 1 on the right-hand side. The larger enveloping area
shows 95% credible intervals for the noisy observation 𝑦, whereas the smaller area provides the same credible
intervals for the objective function.
168
8.3. probability of improvement
is also increasing with 𝜎 when the target value is greater than the predic-
tive mean, or equivalently when the probability of improvement is less
than 1/2. Therefore probability of improvement also tends to encourage
exploration in this typical case. However, when the predictive mean is
greater than the improvement threshold, probability of improvement is,
perhaps surprisingly, decreasing with 𝜎, discouraging exploration in this
favorable regime. Probability of improvement favors a relatively safe
option returning a certain but modest improvement over a more-risky risk aversion of probability of improvement:
alternative offering a potentially larger but less-certain improvement, as § 7.5, p. 132
demonstrated in figure 7.8.
With the partial derivatives computed above, we may readily com- gradient of probability of improvement
pute the gradient of (8.22) with respect to the proposed evaluation loca- without noise
tion 𝑥 in the noiseless case (8.6):
𝜕𝛼 pi0 1 𝜕𝜇 0 𝜕𝜎
= − 𝛼 pi .
𝜕𝑥 𝜎 𝜕𝑥 𝜕𝑥
Due to the convexity of the updated simple reward, improvement 20 In the case that improvement is impossible
occurs on at-most two intervals:20 or occurs at a single point, the probability of
improvement is zero.
(−∞, ℓ) ∪ (𝑢, ∞).
Note that one of these end points may not exist, for example if every
slope in the b vector were positive; see figure 8.3 for an example. In
this case we may take ℓ = −∞ or 𝑢 = ∞ as appropriate. Now given the
endpoints (ℓ, 𝑢), the probability of improvement may be computed in
terms of the standard normal cdf:
𝛼 pi (𝑥; D) = Φ(ℓ) + Φ(−𝑢). (8.23)
gradient of noisy probability of improvement: We may compute the gradient of this noisy formulation of probability
§ c.3, p. 309 of improvement; details are given in the appendix.
170
8.5. approximate computation for one-step lookahead
𝜏 (𝛽) = max
0
𝜇 + 𝛽𝜎
𝑥 ∈X
integration nodes, {𝑧𝑖 } Like all numerical integration methods, Gauss–Hermite quadrature en-
tails measuring the integrand ℎ at a set of 𝑛 points, called nodes, {𝑧𝑖 },
then approximating the integral by a weighted sum of the measured
𝑛=5 values:
𝑛=8
∑︁
𝑛
𝐼≈ 𝑤𝑖 ℎ(𝑧𝑖 ).
𝑖=1
approximating gradient via Gauss–Hermite We may also extend this scheme to approximate the gradient of the acqui-
quadrature: § c.3, p. 309 sition function using the same nodes and weights; details are provided
in the accompanying appendix.
𝜇 ∗ = max 𝜇D (𝑥)
𝑥 ∈X
172
8.6. knowledge gradient
𝑝 (f | D) = N (f; 𝝁, 𝚺).
We may compute this expectation in closed form following our analysis computation of noisy expected improvement:
of noisy expected improvement, which was merely a slight adaptation § 8.2, p. 160
of frazier et al.’s approach.
The updated posterior mean vector 𝝁 0 after observing an observation
(𝑥, 𝑦) is linear in the observed value 𝑦:
𝚺𝑥 𝑦 − 𝜇𝑥
𝝁0 = 𝝁 + ,
𝑠 𝑠
where 𝜇𝑥 is the entry of 𝝁 corresponding to the index 𝑥, and 𝚺𝑥 is
similarly the corresponding column of 𝚺. If we define
𝚺𝑥
a = 𝝁; b= ,
𝑠
then we may rewrite the knowledge gradient in terms of the 𝑔 function
introduced in the context of noisy expected improvement (8.10): computation in terms of 𝑔 (a, b)
𝛼 kg (𝑥; D) = 𝑔(a, b) − 𝜇 ∗.
We may evaluate this expression exactly in O(𝑛 log 𝑛) time following computation of 𝑔: § 8.2, p. 161
our previous discussion. Thus we may compute the knowledge gradi-
ent policy with a discrete domain in O(𝑛 2 log 𝑛) time per iteration by
exhaustive computation of the acquisition function.
𝑥 𝑥0
𝑝 (𝑧)
𝑥 𝑥0
174
8.6. knowledge gradient
kgcp approximation
Figure 8.10: A comparison of the knowledge gradient acquisition function and the kgcp approximation for an example
scenario. In this case the kgcp approximation reverts to expected improvement.
1 ∑︁
𝑛
𝛼 kg (𝑥; D) ≈ √ 𝑤𝑖 Δ(𝑥, 𝑦𝑖 ); Δ(𝑥, 𝑦𝑖 ) = max 𝜇D0 (𝑥 0) − 𝜇 ∗.
𝜋 𝑖=1 0
𝑥 ∈X
With some care, we can also approximate the gradient of the knowl- approximating gradient of knowledge
edge gradient in this scheme; details are given in an appendix. gradient: § c.3, p. 310
comparison with expected improvement This expression is almost identical to expected improvement (8.7)!
In fact, if we define
𝜇 ∗ = max 𝜇D (x),
then a simple manipulation gives:
𝑥 ∼ 𝑝 (𝑥 ∗ | D).
176
8.7. thompson sampling
𝑝 (𝑥 ∗ | D)
Thompson samples
Figure 8.11: The distribution of the location of the global maximum, 𝑝 (𝑥 ∗ | D), for an example scenario, and 100 samples
drawn from this distribution.
Exhaustive sampling 𝑥∗
In “small” domains, we can realize Thompson sampling via brute force. Maximizing a draw from a Gaussian process
If the domain can be exhaustively covered by a sufficiently small set of naturally samples from 𝑝 (𝑥 ,∗ 𝑓 ∗ | D)
points 𝝃 (for example, with a dense grid or a low-discrepancy sequence)
then we can simply sample the associated objective function values
𝝓 = 𝑓 (𝝃 ) and maximize:33 33 We are taking a slight liberty with notation
here as we have previously used 𝝓 for 𝑓 (x),
𝑥 = arg max 𝝓; 𝝓 ∼ 𝑝 (𝝓 | 𝝃 , D). the latent objective function values at the ob-
served locations. However, for the remainder
The distribution of 𝝓 is multivariate normal, making sampling easy: of the we will be assuming the general, poten-
tially noisy case where our data will be written
D = (x, y) and we will have no need to refer
𝑝 (𝝓 | 𝝃 , D) = N (𝝓; 𝝁, 𝚺); 𝝁 = 𝜇D (𝝃 ); 𝚺 = 𝐾D (𝝃 , 𝝃 ). to 𝑓 (x).
The running time of this procedure grows quickly with the size of 𝝃 , 34 g. pleiss et al. (2020). Fast Matrix Square Roots
although sophisticated numerical methods enable scaling to roughly with Applications to Gaussian Processes and
Bayesian Optimization. neurips 2020.
50 000 points.34
Figure 8.11 shows 100 Thompson samples for our example scenario
generated via exhaustive sampling, taking 𝝃 to be a grid of 1000 points.
On-demand sampling
An alternative to exhaustive sampling is to use off-the-shelf optimization
routines to maximize a draw from the objective function posterior we
Algorithm 8.1: On-demand sampling. Dts ← D I initialize fictitious dataset with current data
repeat
given request for observation at 𝑥:
𝜙 ← 𝑝 (𝜙 | 𝑥, Dts ) I sample value at 𝑥
Dts ← Dts ∪ (𝑥, 𝜙) I update fictitious dataset
yield 𝜙
until external optimizer terminates
𝑝 (𝜙, ∇𝜙 | 𝑥, Dts )
35 m. lázaro-gredilla et al. (2010). Sparse Spec- However, in high dimensions, the additional cost required to condition
trum Gaussian Process Regression. Journal of on these gradient observations may become excessive. In 𝑑 dimensions,
Machine Learning Research 11(Jun):1865–1881.
returning gradients effectively reduces the number of function evalua-
36 j. m. hernández-lobato et al. (2014). Predic- tions we can allow for the optimizer by a factor of (𝑑 + 1) if we wish to
tive Entropy Search for Efficient Global Opti-
maintain the same total computational effort.
mization of Black-box Functions. neurips 2014.
178
8.7. thompson sampling
For the additive Gaussian noise model, if N is the covariance of the noise noise covariance matrix, N
contributions to the observed values y, the posterior moments of the
weight vector in this approximation are:
180
8.8. mutual information with 𝑥 ∗
Let us consider the evaluation of (8.36) for a Gaussian process model computing first term
with additive Gaussian noise. To begin, the first term is simply the dif-
ferential entropy of a one-dimensional Gaussian distribution (8.5) and
may be computed in closed form (a.17):
𝐻 [𝑦 | 𝑥, D] = 1
2 log(2𝜋𝑒𝑠 2 ).
1 ∑︁𝑛
𝔼 𝐻 [𝑦 | 𝑥, 𝑥 ,∗ D] | 𝑥, D ≈ 𝐻 [𝑦 | 𝑥, 𝑥𝑖∗, D].
𝑛 𝑖=1
When the prior covariance function is stationary, hernández-lobato sparse spectrum approximation for
et al. further propose to exploit the efficient approximate Thompson Thompson sampling: § 8.7, p. 178
sampling scheme via sparse spectrum approximation described in the
last section. When feasible, this reduces the cost of drawing the samples,
but it is not necessary for correctness.
𝑥∗
Figure 8.13: The example scenario we will consider for illustrating the predictive entropy search approximation to
𝑝 (𝑓 | 𝑥, 𝑥 ,∗ D), using the marked location for 𝑥 ∗.
𝐻 [𝑦 | 𝑥, 𝑥 ,∗ D] ≈ 1
2 log(2𝜋𝑒𝑠 ∗2 ). (8.40)
1 ∑︁
𝑛
𝛼𝑥 ∗ (𝑥; D) ≈ 𝛼 pes (𝑥; D) = log 𝑠 − log 𝑠 ∗𝑖 . (8.41)
𝑛 𝑖=1
∇∗ = 0; (8.42)
H∗ ≺ 0. (8.43)
182
8.8. mutual information with 𝑥 ∗
𝑥∗
𝑥∗
Figure 8.14: Top: the posterior for our example after conditioning on the derivative being zero at 𝑥 ∗ (8.42). Bottom: the
approximate posterior after conditioning on the second derivative being negative at 𝑥 ∗ (8.44).
which combined with the diagonal constraint guarantees negative def- 43 hernández-lobato et al. point out this may
initeness.43 The combination of these conditions (8.44–8.45) is stricter not be faithful to the model and suggest the
alternative of matching the off-diagonal en-
than mere negative definiteness, as we eliminate all degrees of freedom tries of the Hessian of the objective function
for the off-diagonal entries. However, an advantage of this approach is sample that generated 𝑥 ∗. However, this does
that we can enforce the off-diagonal constraint via exact conditioning. not guarantee negative definiteness without
To proceed we dispense with the constraints we can condition on tweaking (8.44).
exactly. Let D 0 represent our dataset augmented with the gradient (8.42)
and off-diagonal Hessian (8.45) observations. The joint distribution of
the latent objective value 𝜙, the purportedly optimal value 𝜙 ∗ = 𝑓 (𝑥 ∗ ),
and the diagonal of the Hessian h∗ given this additional information is
multivariate normal:
𝑥∗
𝑥∗
Figure 8.15: Top: the approximate posterior after conditioning on 𝜙 ∗ exceeding the function values at previously measured
locations (8.49). Bottom: the approximate posterior after conditioning on 𝜙 ∗ dominating elsewhere (8.50).
44 For this and the following demonstrations, we Our approximate posterior after incorporating (8.47) and performing
show the expectation propagation approxi- expectation propagation is shown in the bottom panel of 8.14.44
mation to the entire objective function pos-
terior; this is not required to approximate the
marginal predictive distribution and is only
for illustration.
Ensuring 𝑥 ∗ is a global optimum
Our belief now reflects our desire that 𝑥 ∗ be a local maximum; however,
we wish for 𝑥 ∗ to be the global maximum. Global optimality is not
easy to enforce, as it entails infinitely many constraints bounding the
objective at every point in the domain. hernández-lobato et al. instead
approximate this condition with optimality at the most relevant locations:
the already-observed points x and the proposed point 𝑥.
ensuring 𝜙 ∗ > max 𝝓 To enforce that 𝜙 ∗ exceed the objective function values at the ob-
served points, we could theoretically add 𝝓 = 𝑓 (x) to our prior (8.46),
Î
then add one factor for each observation: 𝑗 [𝜙 𝑗 < 𝜙 ∗ ]. However, this
approach requires an increasing number of factors as we gather more
data, rendering expectation propagation (and thus the acquisition func-
tion) increasingly expensive. Further, factors corresponding to obviously
suboptimal observations are uninformative and simply represent extra
work for no benefit.
truncating with respect to an unknown Instead, we enforce this constraint through a single factor truncating
threshold with ep: § b.2, p. 306 with respect to the maximal value of 𝝓: [𝜙 ∗ < max 𝝓]. In general, this
threshold will be a random variable unless our observations are noise-
less. Fortunately, expectation propagation enables tractable approximate
truncation at an unknown, Gaussian-distributed threshold. Define
2
𝜇max = 𝔼[max 𝝓 | D]; 𝜎max = var[max 𝝓 | D]; (8.48)
45 c. e. clark (1961). The Greatest of a Finite Set these moments can be approximated via either sampling or an assumed
of Random Variables. Operations Research 9(2): density filtering approach described by clark.45 Taking a moment-
145–162.
matched Gaussian approximation to max 𝝓 and integrating yields the
184
8.8. mutual information with 𝑥 ∗
Figure 8.16: The predictive entropy search approximation (8.41) to the mutual information with 𝑥 ∗ acquisition function
(8.36) using the 100 Thompson samples from figure 8.11.
[𝜙 < 𝜙 ∗ ]. (8.50)
Efficient implementation
A practical realization of predictive entropy search can benefit from
careful precomputation and reuse of partial results. We will outline an
efficient implementation strategy, beginning with three steps of one-time
initial work.
1. Estimate the moments of max 𝝓 (8.48).
2. Generate a set of Thompson samples {𝑥𝑖∗ }.
3. For each sample 𝑥 ,∗ derive the joint distribution of the function value,
gradient, and Hessian at 𝑥 .∗ Let z∗ = (𝜙 ,∗ h∗, upper H∗, ∇∗ ) represent a vec-
tor comprising these random variables, with those that will be subjected
to expectation propagation first. We compute:
𝑝 (z∗ | 𝑥 ,∗ D) = N (z∗ ; 𝝁 ,∗ 𝚺∗ ).
47 In the interest of numerical stability, the in- Find the marginal belief over (𝜙 ,∗ h∗ ) and use Gaussian expectation prop-
verse of V∗ should not be stored directly; for agation to approximate the posterior after incorporating the factors (8.47,
relevant practical advice see:
8.49). Let vectors 𝝁˜ and 𝝈˜ 2 denote the site parameters at termination.
c. e. rasmussen and c. k. i. williams (2006). Finally, precompute47
" # −1 " #
Gaussian Processes for Machine Learning. mit
Press.
𝚺˜ 0 −1 𝝁˜
j. p. cunningham et al. (2011). Gaussian Prob-
−1
V∗ = 𝚺 + ∗
; ∗
𝜶 = V∗ −𝝁 ,∗
0 0 0
abilities and Expectation Propagation. arXiv:
1111.6832 [stat.ML].
where 𝚺˜ = diag 𝝈˜ 2. These quantities do not depend on 𝑥 and will be
repeatedly reused during prediction.
After completing the preparations above, suppose a proposed ob-
servation location 𝑥 is given. For each sample 𝑥 ,∗ we compute the joint
distribution of 𝜙 and 𝜙 ∗ :
and derive the approximate posterior given the exact gradient (8.42)
and off-diagonal Hessian (8.45) observations and the factors (8.47, 8.49).
Defining >
k
K = > = cov [𝜙, 𝜙 ∗ ] >, z∗ ,
k∗
48 This can be derived by marginalizing z∗ ac- the desired distribution is N (𝜙, 𝜙 ∗ ; m, S), where:48
cording to its approximate posterior from step
2
3 above: 𝑚 𝜍 𝜌
∫ m = ∗ = 𝝁 + K𝜶 ∗ S= = 𝚺 − KV∗−1 K>. (8.51)
𝑝 (𝜙, 𝜙 ∗ | z∗ ) 𝑝 (z∗ ) dz∗.
𝑚 𝜌 𝜍 ∗2
We now apply the prediction constraint (8.50) with one final step of
expectation propagation. Define (b.7, b.11–b.12):
𝜇¯ = 𝑚 − 𝑚 ∗ ; 𝜎¯ 2 = 𝜍 2 − 2𝜌 + 𝜍 ∗2 ;
−1
𝜇¯ 𝜙 (𝑧) 𝜎¯ 𝜙 (𝑧)
𝑧=− ; 𝛼 =− ; 𝛾 =− +𝑧 .
𝜎¯ Φ(𝑧)𝜎¯ 𝛼 Φ(𝑧)
186
8.9. mutual information with 𝑓 ∗
Unfortunately, this expression cannot be computed exactly due to the 49 z. wang and s. jegelka (2017). Max-value En-
complexity of the distribution 𝑝 (𝑓 ∗ | D); see figure 8.17 for an example. tropy Search for Efficient Bayesian Optimiza-
tion. icml 2017.
However, several effective approximations have been proposed, including
max-value entropy search (mes)49 and output-space predictive entropy 50 m. w. hoffman and z. ghahramani (2015).
Output-Space Predictive Entropy Search for
search (opes).50 Flexible Global Optimization. Bayesian Opti-
The issues we face in estimating (8.53), and the strategies we use to mization Workshop, neurips 2015.
overcome them, largely mirror those in predictive entropy search. To
begin, the first term is the differential entropy of a Gaussian and may be
computed exactly:
𝐻 [𝑦 | 𝑥, D] = 1
2 log(2𝜋𝑒𝑠 2 ). (8.54)
The second term, however, presents some challenges, and the available
approximations to (8.53) diverge in their estimation approach. We will
discuss the mes and opes approximations in parallel, as they share the
same basic strategy and only differ in some details along the way.
1 ∑︁𝑛
𝔼 𝐻 [𝑦 | 𝑥, 𝑓,∗ D] | 𝑥, D ≈ 𝐻 [𝑦 | 𝑥, 𝑓𝑖∗, D]. (8.55)
𝑛 𝑖=1
Draft as of 27 January 2022. Feedback welcome: https://bayesoptbook.com/ 187
computing policies with gaussian processes
188
8.9. mutual information with 𝑓 ∗
𝑓∗ 𝑓∗ 𝑓∗
max y max y max y
Figure 8.17: Left: the true distribution 𝑝 (𝑓 ∗ | D) for our running scenario, estimated using exhaustive sampling. Middle:
wang and jegelka’s approximation to 𝑝 (𝑓 ∗ | D) (8.56) using a grid of 100 equally spaced representer points.
Right: an approximation to 𝑝 (𝑓 ∗ | D) from sampling the function values at the 100 Thompson samples from
figure 8.11.
Figure 8.18: An approximation to the mutual information between the observed value 𝑦 and the value of the global
optimum 𝛼 𝑓 ∗ = 𝐼 (𝑦; 𝑓 ∗ | 𝑥, D) for our running example using the independent approximation to 𝑝 (𝑓 ∗ | D)
(8.56) and numerical integration.
190
8.9. mutual information with 𝑓 ∗
𝑦 𝑦
𝑓∗ 𝑓∗
Figure 8.19: Left: an example of the latent predictive distribution 𝑝 (𝜙 | 𝑥, 𝑓,∗ D), which takes the form of a truncated
normal distribution (8.58), and the resulting predictive distribution 𝑝 (𝑦 | 𝑥, 𝑓,∗ D), which is a convolution with
a centered normal distribution accounting for Gaussian observation noise. Right: a Gaussian expectation
propagation approximation to the predictive distribution (8.62).
hoffman and ghahramani instead approximate the latent predictive truncating a variable with ep: § b.2, p. 305
distribution (8.58) using Gaussian expectation propagation:
Figure 8.20: The mes and opes approximations to the mutual information with 𝑓 ∗ for an example high-noise scenario
with unit signal-to-noise ratio. Both use the independent representer point approximation to 𝑝 (𝑓 ∗ | D) (8.56).
note the {𝑓𝑖∗ } samples will in general not depend on 𝑥, so we do not need
to worry about any dependence of their distribution on the observation
gradient of opes acquisition function: § c.3, location. The details for the opes approximation are provided in the
p. 311 appendix.
192
8.10. averaging over a space of gaussian processes
noise process. Bayesian model averaging over this space yields marginal
posterior and predictive distributions:
∫
𝑝 (𝑓 | D) = 𝑝 (𝑓 | D, 𝜽 ) 𝑝 (𝜽 | D) d𝜽 ; (8.64)
∫
𝑝 (𝑦 | 𝑥, D) = 𝑝 (𝑦 | 𝑥, D, 𝜽 ) 𝑝 (𝜽 | D) d𝜽, (8.65)
which are integrated against the model posterior 𝑝 (𝜽 | D) (4.7). Both of model posterior, 𝑝 (𝜽 | D)
these distributions are in general intractable, but we developed several
viable approximations in chapter 3, all of which approximate the objective
function posterior (8.64) with a mixture of Gaussian processes and the
posterior predictive distribution (8.65) with a mixture of Gaussians.
There are two natural ways we might address this dependence when
averaging over models. One is to seek to maximize the expected utility
average model-conditional utility, 𝔼𝑢 of the data, averaged over the choice of model:
∫
𝔼𝑢 (D) = 𝑢 (D; 𝜽 ) 𝑝 (𝜽 | D) d𝜽 . (8.66)
marginal gain in 𝔼𝑢, 𝔼Δ Writing the marginal gain in expected utility as 𝔼Δ, we may derive an
expected marginal gain in 𝔼𝑢, 𝔼𝛼 acquisition function via one-step lookahead:
∫
𝔼𝛼 (𝑥; D) = 𝔼Δ(𝑥, 𝑦) 𝑝 (𝑦 | 𝑥, D) d𝑦
∫ ∫
= Δ(𝑥, 𝑦; 𝜽 ) 𝑝 (𝑦 | 𝑥, D, 𝜽 ) d𝑦 𝑝 (𝜽 | D) d𝜽
∫
= 𝛼 (𝑥; D, 𝜽 ) 𝑝 (𝜽 | D) d𝜽 . (8.67)
marginal belief about 𝜔 and information gain about 𝜔 with respect to its marginal belief:
∫
𝑝 (𝜔 | D, 𝜽 ) 𝑝 (𝜽 | D) d𝜽 . (8.69)
utility of marginal model, 𝑢𝔼 Let us notate a utility function defined in this manner with 𝑢𝔼(D), con-
(expected) marginal gain in utility of trasting with its post hoc averaging equivalent, 𝔼𝑢 (D) (8.66). Similarly,
marginal model: Δ𝔼, 𝛼𝔼 let us notate its marginal gain with Δ𝔼 and its expected marginal gain
with: ∫
𝛼𝔼(𝑥; D) = Δ𝔼(𝑥, 𝑦) 𝑝 (𝑦 | 𝑥, D) d𝑦. (8.70)
194
8.10. averaging over a space of gaussian processes
Random forests
63 l. breiman (2001). Random Forests. Machine Random forests63 are a popular model class renown for their excellent
Learning 45(1):5–32. off-the-shelf performance,64 offering good generalization, strong resis-
64 m. fernández-delgado et al. (2014). Do we tance to overfitting, and efficient training and prediction. Of particular
Need Hundreds of Classifiers to Solve Real relevance for optimization, random forests are adept at handling high-
World Classification Problems? Journal of Ma-
chine Learning Research 15(90):3133–3181. dimensional data and categorical and conditional features, and may be
a better choice than Gaussian processes for objectives featuring any of
these characteristics.
Algorithm configuration is one setting where these capabilities are
critical: complex algorithms such as compilers or sat solvers often have
complex configuration schemata with many mutually dependent param-
eters, and it can be difficult to build nontrivial covariance functions for
such inputs. Random forests require no special treatment in this setting
65 f. hutter et al. (2014). Algorithm runtime pre- and have delivered impressive performance in predicting algorithmic
diction: Methods & evaluation. Artificial Intel- performance measures such as runtime.65 They are thus a natural choice
ligence 206:79–111.
for Bayesian optimization of these same measures.66
66 f. hutter et al. (2011). Sequential Model-Based Classical random forests are not particularly adept at quantifying un-
Optimization for General Algorithm Configu-
ration. lion 5. certainty in predictions off-the-shelf. Seeking more nuanced uncertainty
quantification, hutter et al. proposed a modification of the vanilla model
wherein leaves store both the mean (as usual) and the standard deviation
of the training data terminating there.65 We then estimate the predictive
distribution with a mixture of Gaussians with moments corresponding
67 hutter et al. then fit a single Gaussian distri- to the predictions of the member trees.67, 68 Figure 8.21 compares the
bution to this mixture via moment matching, predictions of a Gaussian process and a random forest model on a toy
although this is not strictly necessary.
dataset. Although they differ in their extrapolatory behavior, the models
68 A similar approach can also be used to esti- make very similar predictions otherwise.
mate arbitrary predictive quantiles:
To realize an optimization policy with a random forest, hutter et
n. meinshausen (2006). Quantile Regression al. suggested approximating acquisition functions depending only on
Forests. Journal of Machine Learning Research marginal predictions – such as (noiseless) expected improvement or
7(35):983–999.
probability of improvement – by simply plugging this Gaussian approxi-
mation into the expressions derived in this chapter (8.9, 8.22). Either can
be computed easily from a Gaussian mixture predictive distribution as
well due to linearity of expectation.
196
8.11. alternative models: bayesian neural networks, etc.
198
8.11. alternative models: bayesian neural networks, etc.
1 ∑︁
𝑠
𝑝 (𝑦 | 𝑥, D) ≈ N 𝑦; 𝑓 (𝑥; w𝑖 ), 𝜎𝑖2 .
𝑠 𝑖=1
80 This construction represents a realization of
This is sufficient to compute policies such as (noiseless) expected im- a manifold Gaussian process/deep kernel; see
provement and probability of improvement (8.9, 8.22). § 3.4, p. 59.
The impressive performance of modern deep neural networks is 81 r. gómez-bombarelli et al. (2018). Automatic
largely due to their ability to learn sophisticated feature representations Chemical Design Using a Data-Driven Contin-
uous Representation of Molecules. acs Central
of complex data, a process that requires enormous amounts of data and
Science 4(2):268–276.
may be out of reach in most Bayesian optimization settings. However,
82 Numerous example systems are outlined in:
when the domain consists of structured objects with sufficiently many
unlabeled examples available, one path forward is to train a generative b. l. hie and k. k. yang (2021). Adaptive ma-
latent variable model – such as a variational autoencoder or generalized chine learning for protein engineering. arXiv:
2106.05466 [q-bio.QM] [table 1].
adversarial network – using unsuperivsed (or semi-supervised) methods.
We can then perform optimization in the resulting latent space, for 83 j. j. irwin et al. (2020). zinc20 – A Free
Ultralarge-Scale Chemical Database for Lig-
example by simply constructing a Gaussian process over the learned and Discovery. Journal of Chemical Informa-
neural representation.80 This approach has notably proven useful in the tion and Modeling 60(12):6065–6073.
design of molecules81 and biological sequences.82 In both of these settings, 84 the uniprot consortium (2021). UniProt: the
enormous databases (on the order of hundreds of millions of examples) universal protein knowledgebase in 2021. Nu-
are available for learning a representation prior to optimization.83, 84 cleic Acids Research 49(d1):d480–d489.
where Δ(𝑥, 𝑦) is the gain in utility resulting from the observation (𝑥, 𝑦)
(8.27). When Δ is a piecewise linear function of 𝑦, this integral can be
resolved analytically in terms of the standard normal cdf. This is the
computation of expected improvement and case for the expected improvement and probability of improvement
probability of improvement: §§ 8.2–8.3, p. 167 acquisition functions, both with and without observation noise.
However, when Δ is a more complicated function of the putative
observation, we must in general rely on approximate computation to re-
solve this integral. When the predictive distribution is normal – as in the
approximate computation for one-step model class considered in this chapter – Gauss–Hermite quadrature pro-
lookahead: § 8.5, p. 171 vides a useful and sample-efficient approximation via a weighted average
computation of knowledge gradient: § 8.6, of carefully chosen integration nodes. This allows us to address some
p. 172 more complex acquisition functions such as the knowledge gradient.
The computation of mutual information with 𝑥 ∗ or 𝑓 ∗ entails an
expectation with respect to these random variables, which cannot be
approximated using simple quadrature schemes. Instead, we must rely on
Thompson sampling: § 8.7, p. 176 schemes such as Thompson sampling – a notable policy in its own right
approximate computation of mutual – to generate samples and proceed via (simple or quasi-) Monte Carlo
information: §§ 8.8–8.9, p. 180 integration, and in some cases, further approximations to the conditional
predictive distributions resulting from these samples.
Finally, Bayesian optimization is of course not limited to a single
Gaussian process belief on the objective function, which may be objec-
tionable even when Gaussian processes are the preferred model class
averaging over a space of gps: § 8.10, p. 192 due to uncertainty in hyperparameters or model structure. Averaging
over a space of Gaussian processes is possible – with some care – by
adopting a Gaussian process mixture approximation to the marginal
objective function posterior and relying on results from the single gp
alternatives to gps: § 8.11, p. 196 case. If desired, we may also abandon the model class entirely and com-
pute policies with respect to an alternative such as random forests or
Bayesian neural networks.
200
9
IMPLEMEN TATION
There is a rich and mature software ecosystem available for Gaussian pro-
cess modeling and Bayesian optimization, and it is relatively easy to build
sophisticated optimization routines using off-the-shelf libraries. How-
ever, successful implementation of the underlying algorithms requires
attending to some nitty-gritty details to ensure optimal performance, and
what may appear to be simple equations on the page can be challenging
to realize in a limited-precision environment. In this chapter we will
provide a brief overview of the computational details that practitioners
should be aware of when designing Bayesian optimization algorithms,
even when availing themselves of existing software libraries.
This material will be published by Cambridge University Press as Bayesian Optimization. 201
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
implementation
𝜶 = C−1 (y − m)
The Cholesky factor of the updated covariance matrix has the form
L 0
chol C =
0
.
𝚲1 𝚲2
Note that the upper-left block is simply the previously computed Cholesky
factor, which we can reuse. The new blocks may be computed as
𝚲1 = X1 L−> ; 𝚲2 = chol X2 − 𝚲1 𝚲> 1 .
202
9.1. gaussian process inference, scaling, and approximation
Sparse approximations
An alternative approach for scaling to large datasets is sparse approxi-
mation. Here rather than approximating the linear algebra arising in the
exact posterior, we approximate the posterior distribution itself with a
Gaussian process admitting tractable computation with direct numerical
methods. A large family of sparse approximations have been proposed,
which differ in their details but share the same general approach.
As we have seen, specifying an arbitrary Gaussian distribution for
a set of values jointly Gaussian distributed with a function of interest
Gaussian approximate inference: § 2.8, p. 301 induces a gp posterior consistent with that belief (2.39). This is a powerful
tool, used in approximate inference to optimize the fit of the induced
204
9.1. gaussian process inference, scaling, and approximation
inducing points, 𝝃
206
9.2. optimizing acqisition functions
that is, we assume that all covariance between function values is mod-
erated through the inducing values. This is a popular approximation
scheme known as the Nyström method.17 Importantly, however, we note 17 c. k. i. williams and m. seeger (2000). Using
that the posterior mean does still reflect the information contained in the Nyström Method to Speed Up Kernel Ma-
chines. neurips 2000.
the entire dataset through the true residuals (z − m) and noise N. The
approximate posterior covariance also reflects this approximation:
>
𝐾 (𝑥, 𝑥 0) − K𝚺−1𝑘 (𝑥) (K𝚺−1 K> + N) −1 K𝚺−1𝑘 (𝑥 0) (9.7)
≈ 𝐾 (𝑥, 𝑥 0) − 𝜅 (𝑥) > (C + N) −1 𝜅 (𝑥 0).
Optimization approaches
There are two common lines of attack for optimizing acquisition func-
tions in Bayesian optimization. One approach is to use an off-the-shelf
derivative-free global optimization method such as the “dividing rectan-
22 d. r. jones et al. (1993). Lipschitzian Optimiza- gles” (direct) algorithm of jones et al.22 or a member of the covariance
tion Without the Lipschitz Constant. Journal matrix adaptation evolution strategy (cma–es) family of algorithms.23
of Optimization Theory and Application 79(1):
157–181.
Although a popular choice in the literature, we argue that neither is a
particularly good choice in situations where the acquisition function may
23 n. hansen (2016). The cma Evolution Strategy:
A Tutorial. arXiv: 1604.00772 [cs.LG].
devolve into effective flatness as described above. However, an algorithm
of this class may be reasonable in modest dimension where optimization
can be somewhat exhaustive.
In general, a better alternative is multistart local optimization, mak-
ing use of the gradients computed in the previous chapter for rapid
convergence. kim and choi provided an extensive comparison of the
theoretical and empirical performance of global and local optimization
routines for acquisition function optimization, and ultimately recom-
24 j. kim and s. choi (2020). On Local Optimiz- mended multistart local optimization as best practice.24 This is also the
ers of Acquisition Functions in Bayesian Opti- approach used in (at least some) sophisticated modern software packages
mization. 12458:675–690.
for Bayesian optimization.25
25 m. balandat et al. (2020). BoTorch: A Frame- To ensure success, we must carefully select starting points to ensure
work for Efficient Monte-Carlo Bayesian Op-
timization. neurips 2020.
that the relevant regions of the domain are searched. jones (the same
jones of the direct algorithm) recognized the problem of vanishing
gradients described above and suggested a simple heuristic for select-
ing local optimization starting points by enumerating and pruning the
26 d. r. jones (2001). A Taxonomy of Global Op- midpoints between all pairs of observed points.26 A more brute-force
timization Methods Based on Response Sur- approach is to measure the acquisition function on a exhaustive covering
faces. Journal of Global Optimization 21(4):345–
383.
of the domain – generated for example by a low-discrepancy sequence
– then begin local searches from the highest values seen. This can be
effective if the initial set of points is dense enough to probe the neigh-
208
9.2. optimizing acqisition functions
Initialization
Theoretically, one can begin a Bayesian optimization routine with a
completely empty dataset D = ∅ and then use an optimization policy to
210
9.3. starting and stopping optimization
design every observation, and indeed this has been our working model
of Bayesian optimization since sketching the basic idea in algorithm 1.1.
However, Bayesian optimization policies are informed by an underlying
belief about the objective function, which can be significantly misin-
formed when too little data is available, especially when relying on point
estimation for model selection rather than accounting for (significant!) model averaging: § 4.4, p. 74
uncertainty in model hyperparameters and/or model structures.
Due to the sequential nature of optimization and the dependence of 39 These approaches are discussed and evaluated
each decision on the data observed in previous iterations, it can be wise in:
to use a model-independent procedure to design a small number of initial m. w. hoffman and b. shahriari (2014). Mod-
observations before beginning optimization in earnest. This procedure ular mechanisms for Bayesian optimization.
can be as simple as random sampling39 or a space-filling design such as a Bayesian Optimization Workshop, neurips 2014.
low-discrepancy sequence or Latin hypercube design.40 When repeatedly 40 d. r. jones et al. (1998). Efficient Global Opti-
solving related optimization problems, we may even be able to learn how mization of Expensive Black-Box Functions.
Journal of Global Optimization 13(4):455–492.
to initialize Bayesian optimization routines from experience. Some au-
thors have proposed sophisticated “warm start” initialization procedures
for hyperparameter tuning using so-called metafeatures characterizing
the datasets under consideration.41 41 m. feurer et al. (2015). Initializing Bayes-
ian Hyperparameter Optimization via Meta-
Learning. aaai 2015.
Termination
In many applications of Bayesian optimization, we assume a preallocated
budget on the number of observations we will make and simply terminate
optimization when that budget is expended. However, we may also treat
termination as a decision and adaptively determine when to stop based optimal stopping rules: § 5.4, p. 103
on collected data. Of course, in practice we are free to design a stopping
rule however we see fit, but we can outline some possible options.
Especially when using a policy grounded in decision theory, it is
natural to terminate optimization when the maximum of our chosen
acquisition function drops below some threshold 𝑐, which may depend
on 𝑥:
max 𝛼 (𝑥; D) − 𝑐 (𝑥) < 0 . (9.9)
𝑥 ∈X
For acquisition functions derived from decision theory, such a stopping 42 See in particular our discussion of the one-step
rule may be justified theoretically: we stop when the expected gain from optimal stopping rule, p. 104.
the optimal observation is no longer worth the cost of acquisition.42 A 43 Some early (but surely not the earliest!) exam-
majority of the stopping rules described in the literature assume this ples:
form, with the threshold 𝑐 often being determined dynamically based d. d. cox and s. john (1992). A Statistical
on the scale of observed data.43 dai et al. combined a stopping rule of Method for Global Optimization. smc 1992.
this form with an otherwise non-decision-theoretic policy (gp-ucb) and
d. r. jones et al. (1998). Efficient Global Opti-
showed that its asymptotic performance in terms of expected regret was mization of Expensive Black-Box Functions.
not adversely affected despite the mismatch in motivation between the Journal of Global Optimization 13(4):455–492.
policy and stopping rule.44 44 z. dai et al. (2019). Bayesian Optimization
It may also be prudent to consider purely data-dependent stopping Meets Bayesian Optimal Stopping. icml 2019.
rules in order to avoid undue expense arising from miscalibrated models 45 l. acerbi and w. j. ma (2017). Practical Bayes-
fruitlessly continuing optimization based on incorrect beliefs. For exam- ian Optimization for Model Fitting with Bayes-
ple, acerbi and ma proposed augmenting a bound on the total number of ian Adaptive Direct Search. neurips 2017.
212
10
THEORETICAL ANALYSIS
When studying the convergence of a global optimization algorithm, what does it mean to converge?
we must be careful to define exactly what we mean by “convergence.” In
general, a convergence argument entails:
• choosing some measure of optimization error,
• choosing some space of possible objective functions, and
• establishing some guarantee for the chosen error on the chosen function
space, such as an asymptotic bound on the worst- or average-case error
in the large-sample limit 𝜏 → ∞.
There is a great deal of freedom in the last of these steps, and we will
discuss several important results and proof strategies later in this chapter.
However, there are well-established conventions for the first two of
these steps, which we will introduce in the following two sections. We regret: below
begin with the notion of regret, which provides a natural measure of useful spaces of objective functions: § 10.2,
optimization error. p. 215
10.1 regret
Regret is a core concept in the analysis of optimization algorithms, Bayes-
ian or otherwise. The role of regret is to quantify optimization progress
in a manner suitable for establishing convergence to the global optimum
and studying the rate of this convergence. There are several definitions
This material will be published by Cambridge University Press as Bayesian Optimization. 213
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
theoretical analysis
1 One might propose an alternative definition of regret used in different contexts, all based on the same idea: comparing
of error by measuring how closely the ob- the objective function values visited during optimization to the globally
served locations approach a global maximum
𝑥 ∗ rather than by how closely the observed optimal value, 𝑓 ∗. The larger this gap, the more “regret” we incur in
values approach the value of the global opti- retrospect for having invested in observations at suboptimal locations.1
mum 𝑓 .∗ However, it turns out this is both less Regret is an unavoidable consequence of decision making under un-
convenient for analysis and harder to moti- certainty. Without foreknowledge of the global optimum, we must of
vate: in practice it is the value of the objective
that we care about the most, and discovering course spend some time searching for it, and even what may be optimal
a near-optimal value is a success regardless of actions in the face of uncertainty may seem disappointing in retrospect.2
its distance to the global optimum. However, such actions are necessary in order to learn about the environ-
2 As outlined in chapter 5 (p. 87), optimal actions ment and inform future decisions. This reasoning gives rise to the classic
maximize expected utility in the face of uncer- tension between exploration and exploitation in policy design: although
tainty. Many observations may not result in
exploration may not yield immediate progress, it enables future success,
progress, even though designed with the best
of intentions. and if we are careful, reduces future regret. Of course, exploration alone
3 For example, it is easy to develop a space- is not sufficient to realize a compelling optimization strategy,3 as we must
filling design that will eventually locate the also exploit what we have learned and adapt our behavior accordingly.
global optimum of any continuous function An ideal algorithm thus explores efficiently enough that its regret can at
through sheer dumb luck – but it might not
least be limited in the long run.
do so very quickly!
Most analysis is performed in terms of one of two closely related
notions of regret: simple or cumulative regret, defined below.
Simple regret
4 The measures of regret introduced here do Let D𝜏 represent some set of (potentially noisy) observations gathered
not depend on the observed values y but only during optimization,4 and let 𝝓𝜏 = 𝑓 (x𝜏 ) represent the objective function
on the underlying objective values 𝝓. We are
making a tacit assumption that y is sufficiently
values at the observed locations. The simple regret associated with this
informative about 𝝓 for this to be sensible in data is the difference between the global maximum of the objective and
the large-sample limit. the maximum restricted to the observed locations:5
5 Occasionally a slightly different definition of
simple regret is used, analogous to the global 𝑟𝜏 = 𝑓 ∗ − max 𝝓𝜏 . (10.1)
reward (6.5), where we measure regret with
respect to the maximum of the posterior mean:
It is immediate from its definition that simple regret is nonnegative
𝑟𝜏 = 𝑓 ∗ − max 𝜇D𝜏 (𝑥). and vanishes only if the data contain a global optimum. With this in
𝑥 ∈X
mind, a common goal is to show that the simple regret of data obtained
by some policy approaches zero, implying the policy will eventually (and
convergence goal: show 𝑟𝜏 → 0 perhaps efficiently) identify the global optimum, up to vanishing error.
Cumulative regret
To define cumulative regret, we first introduce the instantaneous regret 𝜌
corresponding to an observation at some point 𝑥, which is the difference
instantaneous regret, 𝜌 between the global maximum of the objective and the function value 𝜙:
𝜌 = 𝑓 ∗ − 𝜙. (10.2)
cumulative regret of D𝜏 , 𝑅𝜏 The cumulative regret for a dataset D𝜏 is then the total instantaneous
regret incurred:
∑︁ ∑︁
𝑅𝜏 = 𝜌𝑖 = 𝜏𝑓 ∗ − 𝜙𝑖 . (10.3)
𝑖 𝑖
214
10.2. useful function spaces for studying convergence
𝑓 𝑔 𝑓 +𝑔
𝐵 𝐵 𝐵
Figure 10.1: Modifying a continuous function 𝑓 (left) to feature a “needle” on some ball (here an interval) 𝐵. We construct
a continuous function 𝑔 vanishing on the compliment of 𝐵 (middle), then add this “correction” to 𝑓 (right).
functions. However, it turns out this space is far too large to be of much
theoretical interest, as it contains functions that are arbitrarily hard to
optimize. Nonetheless, it is easy to characterize convergence (in terms
of simple regret) on this space, and some convergence guarantees of this
type are known for select Bayesian optimization algorithms. We will
begin our discussion here.
convergence on “nice” continuous functions: We may gain some traction by considering a family of more plausible,
p. 218 “nice” functions whose complexity can be controlled enough to guarantee
rapid convergence. In particular, choosing a Gaussian process prior
for the objective function implies strong correlation structure, and this
insight leads to natural function spaces to study. This has become the
standard approach in modern analysis, and we will consider it shortly.
216
10.2. useful function spaces for studying convergence
18 In some analyses, we may also seek bounds This expectation is taken with respect to uncertainty in the objective
on the regret that hold with high probability
function 𝑓, the observation locations x, and the observed values y.18
with respect to these random variables rather
than bounds on the expected regret.
The worst-case alternative
The gp sample path assumption is not always desirable, for example in
the context of a frequentist (that is, worst-case) analysis of a Bayesian
optimization algorithm. This is not as contradictory as it may seem, as
Bayesian analyses can lack robustness to model misspecification – a cer-
tainty in practice. An alternative is to assume that the objective function
lies in some explicit space of “nice” functions H, then find worst-case
convergence guarantees for the Bayesian algorithm on inputs satisfying
this regularity assumption. For example, we might seek to bound the
worst-case regret on H after 𝜏 steps: 𝑟¯ (𝜏, H), worst-case expected (simple or cumulative) regret for a function in this
𝑅¯ (𝜏, H) space after 𝜏 decisions:
218
10.2. useful function spaces for studying convergence
where {𝑥𝑖 } ⊂ X is an arbitrary finite set of input locations with cor- 20 Equivalently, this is the span of the set of co-
responding real-valued weights {𝛼𝑖 }.20 Note this is precisely the set of variance
functions with one input held fixed:
span 𝑥 ↦→ 𝐾 (𝑥, 𝑥 0 ) | 𝑥 0 ∈ X .
all possible posterior mean functions for the Gaussian process arising
from exact inference!21 The rkhs H𝐾 is then the completion of this space 21 Inspection of the general posterior mean func-
tions in (2.14, 2.19) reveals they can always be
endowed with the inner product
(and can only be) written in this form.
∑︁
𝑛 ∑︁
𝑚 ∑︁
𝑛 ∑︁
𝑚
𝛼𝑖 𝐾 (𝑥𝑖 , 𝑥), 0
𝛽𝑗 𝐾 (𝑥 𝑗 , 𝑥) = 𝛼𝑖 𝛽𝑗 𝐾 (𝑥𝑖 , 𝑥 𝑗0). (10.9)
𝑖=1 𝑗=1 𝑖=1 𝑗=1
That is, the rkhs is roughly the set of functions “as smooth as” a pos-
terior mean function of the corresponding gp, according to a notion of
“explainability” by the covariance function.
It turns out that belonging to the rkhs H𝐾 is a stronger regularity
assumption than being a sample path of the corresponding gp. In fact,
unless the rkhs is finite-dimensional (which is not normally the case), 22 m. n. lukić and j. h. beder (2001). Stochas-
sample paths from a Gaussian process almost surely do not lie in the tic Processes with Sample Paths in Reproduc-
ing Kernel Hilbert Spaces. Transactions of the
corresponding rkhs:22, 23 Pr(𝑓 ∈ H𝐾 ) = 0. However, the posterior mean American Mathematical Society 353(10):3945–
function of the same process always lies in the rkhs by the above con- 3969.
struction. Figure 10.2 illustrates a striking example of this phenomenon: 23 However, sample paths do lie in a “slightly
sample paths from a stationary gp with Matérn covariance function larger” rkhs we can determine from the co-
with 𝜈 = 1/2 (3.11) are nowhere differentiable, whereas members of the variance function:
corresponding rkhs are almost everywhere differentiable. Effectively, m. kanagawa et al. (2018). Gaussian Processes
the process of averaging over sample paths “smooths out” their erratic and Kernel Methods: A Review on Connec-
behavior in the posterior mean, and elements of the rkhs exhibit similar tions and Equivalences. arXiv: 1807 . 02582
smoothness. [stat.ML] [theorem 4.12]
220
10.3. relevant properties of covariance functions
21−𝜈 √ 𝜈 √
𝐾m (𝑑; 𝜈) = 2𝜈𝑑 𝐾𝜈 2𝜈𝑑 , (10.12)
Γ(𝜈)
where 𝑑 = |𝑥 − 𝑥 0 | and 𝐾𝜈 is the modified Bessel function of the sec-
ond kind. Sample paths from a centered Gaussian process with this
28 This can be made precise through the coeffi-
covariance are d𝜈e − 1 times continuously differentiable, but the smooth- cient of Hölder continuity in the “final” deriva-
ness of sample paths is not as granular as a simple count of derivatives. tive of 𝐾m , which controls the smoothness of
Rather, the parameter 𝜈 allows us to fine-tune sample path smoothness the final derivative of sample paths. Except
when 𝜈 is an integer, the Matérn covariance
as desired.28Figure 10.3 illustrates sample paths generated from a Matérn
with parameter 𝜈 belongs to the Hölder space
covariance with a range of smoothness from 𝜈 = 1.01 to 𝜈 = 2. All of
these samples are exactly once differentiable, but we might say that the C 𝛼,𝛽 ; 𝛼 = b2𝜈 c; 𝛽 = 2𝜈 − b2𝜈 c,
𝜈 = 1.01 samples are “just barely” so, and that the 𝜈 = 2 samples are and we can expect the final derivative of sam-
“very nearly” twice differentiable. ple paths to be (𝜈 − b𝜈 c)-Hölder continuous.
Information capacity
We now require some way to relate the complexity of sample path
behavior to our ability to learn about an unknown function. Information
theory provides an answer through the concept of information capacity,
the maximum rate of information transfer through a noisy observation
mechanism.
In the analysis of Bayesian optimization algorithms, the central con-
cern is how efficiently we can learn about a gp-distributed objective
function GP (𝑓 ; 𝜇, 𝐾) through a set of 𝜏 noisy observations D𝜏 . The infor-
mation capacity of this observation process, as a function of the number
of observations 𝜏, provides a fundamental bound on our ability to learn
about 𝑓. For this discussion, let us adopt the common observation model
of independent and homoskedastic additive Gaussian noise with scale
𝜎𝑛 > 0 (2.16). In this case, information capacity is a function of:
• the covariance 𝐾, which determines the information content of 𝑓, and
• the noise scale 𝜎𝑛 , which limits the amount of information obtainable
through a single observation.
29 We have 𝐻 [y | 𝑓 ] = 𝐻 [y | 𝝓 ] by conditional The information regarding 𝑓 contained in an arbitrary set of obser-
independence (1.3). vations D = (x, y) can be quantified by the mutual information (a.16):29
1
𝐼 (y; 𝑓 ) = 𝐻 [y] − 𝐻 [y | 𝝓] = 2 log |I + 𝜎𝑛−2 𝚺|, (10.13)
30 For more on information gain, see § 6.3, p. 115. where 𝚺 = 𝐾 (x, x). Note that the entropy of y given 𝝓 does not depend
Thus far we have primarily concerned our- on the actual value of 𝝓, and thus the mutual information 𝐼 (y; 𝑓 ) is also
selves with information gain regarding 𝑥 ∗ or
𝑓 ∗ ; here we are reasoning about the function
the information gain about 𝑓 provided by the data.30 The information ca-
𝑓 itself. pacity (also known as the maximum information gain) of this observation
process is now the maximum amount of information about 𝑓 obtainable
information capacity, 𝛾𝜏 through any set of 𝜏 observations:
222
10.3. relevant properties of covariance functions
results for a particular choice of model. Second, information capacity of 31 c.-w. ko et al. (1995). An Exact Algorithm for
any given model is in general np-hard to compute due to the difficulty Maximum Entropy Sampling. Operations Re-
search 43(4):684–691.
of the set function maximization in (10.14).31
For these reasons, the typical strategy is to derive agnostic conver-
gence results in terms of the information capacity of an arbitrary obser-
vation process, then seek to derive bounds on the information capacity 32 n. srinivas et al. (2010). Gaussian Process Op-
for notable covariance functions such as those in the Matérn family. A timization in the Bandit Setting: No Regret
and Experimental Design. icml 2010.
common proof strategy for bounding the information gain is to relate the
information capacity to the spectrum of the covariance function, with 33 d. janz et al. (2020). Bandit optimisation of
functions in the Matérn kernel rkhs. aistats
faster spectral decay yielding stronger bounds on information capacity. 2020.
The first explicit bounds on information capacity for the Matérn and 34 s. vakili et al. (2021b). On Information Gain
squared exponential covariances were provided by srinivas et al.,32 and and Regret Bounds in Gaussian Process Ban-
the bounds for the Matérn covariance have since been sharpened using dits. aistats 2021.
similar techniques.33, 34
For a compact domain X ⊂ ℝ𝑑 and fixed noise scale 𝜎𝑛 , we have
the following asymptotic bounds on the information capacity. For the
Matérn covariance function with smoothness 𝜈, we have:34
𝑑
𝛾𝜏 = O 𝜏 𝛼 (log 𝜏) 1−𝛼 , 𝛼 = ; (10.15)
2𝜈 + 𝑑
and for the squared exponential covariance, we have:32
𝛾𝜏 = O (log 𝜏)𝑑+1 . (10.16)
These results embody our stated goal of characterizing smoother sample
paths (as measured by 𝜈) as being inherently less complex than rougher
sample paths. The information capacity decreases steadily as 𝜈 → ∞,
eventually dropping to only logarithmic growth in 𝜏 for the squared
exponential covariance. The correct interpretation of this result is that
the smoother sample paths require less information to describe, and
thus the maximum amount of information one could learn is limited
compared to rougher sample paths.
Now assume that the prior covariance function is bounded: 𝐾 (𝑥, 𝑥) ≤ 𝑀. bound on prior variance 𝐾 (𝑥, 𝑥), 𝑀
Noting that 𝑧 2 / log(1 + 𝑧 2 ) is increasing for 𝑧 > 0 and that 𝜎𝑖2 ≤ 𝑀, the posterior variance is nonincreasing: § 2.2,
following inequality holds for every observation: p. 22.
𝑀 𝜎𝑖2
𝜎𝑖2 ≤ log 1 + .
log(1 + 𝜎𝑛−2 𝑀) 𝜎𝑛2
Draft as of 27 January 2022. Feedback welcome: https://bayesoptbook.com/ 223
theoretical analysis
We can now bound the sum of the predictive variances in terms of the
35 If an explicit leading constant is desired, we
Í information capacity:35
have 𝑖 𝜎𝑖2 ≤ 𝑐𝛾𝜏 with
2𝑀
∑︁
𝜏
𝑐= . 𝜎𝑖2 = O(𝛾𝜏 ). (10.18)
log(1 + 𝜎𝑛−2 𝑀)
𝑖=1
Common assumptions
assumption: 𝑓 is a sample path from In this section, we will assume that the objective function 𝑓 : X → ℝ is
GP (𝑓 ; 𝜇 ≡ 0, 𝐾) a sample path from a centered Gaussian process GP (𝑓 ; 𝜇 ≡ 0, 𝐾), and
assumption: observations are corrupted by iid that observation noise is independent, homoskedastic Gaussian noise
Gaussian noise with scale 𝜎𝑛 with scale 𝜎𝑛 > 0 (2.16). The domain X will at various times be either a
224
10.4. bayesian regret with observation noise
finite set (as a stepping stone toward the continuous case) or a compact
and convex subset of a 𝑑-dimensional cube: X ⊂ [0, 𝑚]𝑑. We will also
be assuming that the covariance function is continuous and bounded
on X : 𝐾 (𝑥, 𝑥) ≤ 1. Since the covariance function is guaranteed to be assumption: 𝐾 (𝑥, 𝑥) ≤ 1 is bounded
bounded anyway (as X is compact) this simply fixes the scale without
loss of generality.
were 𝑥𝑖 is the point chosen in the 𝑖th iteration of the policy, 𝜇 and 𝜎 are
shorthand for the posterior mean and standard deviation of 𝜙 = 𝑓 (𝑥)
given the data available at time 𝑖, D𝑖−1 , and 𝛽𝑖 is a time-dependent
exploration parameter. The authors were able to demonstrate that if this
exploration parameter is carefully tuned over the course of optimization,
then the cumulative regret of the policy can be asymptotically bounded,
with high probability, in terms of the information capacity. information capacity: § 10.3, p. 222
We will discuss this result and its derivation in some detail below, as it
demonstrates important proof strategies that will be repeated throughout
this section and the next.
𝜙 ∈ [𝜇 − 𝛽𝑖 𝜎, 𝜇 + 𝛽𝑖 𝜎] = B𝑖 (𝑥). (10.20)
In fact, we can show that for appropriately chosen confidence parameters step 1: show confidence intervals (10.20) are
{𝛽𝑖 }, every such confidence interval is always valid, with high probability. universally valid with high probability
This may seem like a strong claim, but it is simply a consequence of 38 srinivas et al. use
the exponentially decreasing tails of the Gaussian distribution. At time 𝑖,
we can use tail bounds on the Gaussian cdf to bound the probability of Pr 𝜙 ∉ B𝑖 (𝑥) ≤ exp(−𝛽𝑖2 /2).
a given confidence interval failing in terms of 𝛽𝑖 ,38 then use the union 39 Using the above bound, the probability of fail-
bound to bound the probability of (10.20) failing anywhere at time 𝑖.39 ure anywhere at time 𝑖 is at most
Finally, we show that by increasing the confidence parameter 𝛽𝑖 over
|X | exp(−𝛽𝑖2 /2).
time – so that the probability of failure decreases suitably quickly – the
probability of failure anywhere and at any time is small. srinivas et al. 40 The mysterious appearance of 𝜋 2 /6 comes
showed in particular that for any 𝛿 ∈ (0, 1), taking40 from
∑︁
∞
1 𝜋2
𝑖 2 𝜋 2 |X | 𝑖 2
=
6
.
𝛽𝑖2 = 2 log (10.21) 𝑖=1
6𝛿
Draft as of 27 January 2022. Feedback welcome: https://bayesoptbook.com/ 225
theoretical analysis
𝜌𝑖 ≤ 2𝛽𝑖 𝜎𝑖 . (10.23)
41 d. russo and b. van roy (2014). Learning to russo and van roy interpret this bound on the instantaneous regret as
Optimize via Posterior Sampling. Mathematics
of Operations Research 39(4):1221–1243.
guaranteeing that regret can only be high when we also learn a great
deal about the objective function to compensate (10.17).41
step 3: bound cumulative regret via bound on Finally, we bound the cumulative regret. Assuming the confidence
instantaneous regret intervals (10.20) are universally valid, we may bound the sum of the
squared instantaneous regret up to time 𝜏 by
∑︁
𝜏 ∑︁
𝜏 ∑︁
𝜏
𝜌𝑖2 ≤ 4 𝛽𝑖2𝜎𝑖2 ≤ 4𝛽𝜏2 𝜎𝑖2 = O(𝛽𝜏2 𝛾𝜏 ). (10.24)
𝑖=1 𝑖=1 𝑖=1
226
10.4. bayesian regret with observation noise
• Purely for the sake of analysis, in each iteration 𝑖, we discretize the step 1: discretize the domain with a sequence
domain with a grid X𝑖 ⊂ X ; these grids become finer over time and of increasingly fine grids {𝑋𝑖 }
eventually dense. The exact details vary, but it is typical to take a regular
grid in [0, 𝑚]𝑑 with spacing on the order of O(1/𝑖 𝑐 ) (for some constant
𝑐 not depending on 𝑑) and take the intersection with X. The resulting
discretizations have size log |X𝑖 | = O(𝑑 log 𝑖). size of discretizations: log |X𝑖 | = O (𝑑 log 𝑖)
• We note that Lipschitz continuity of 𝑓 allows us to extend valid confi- step 2: valid confidence intervals on X𝑖 can
dence intervals for the function values at X𝑖 to all of X with only slight be extended to all of X with slight inflation
inflation. Namely, for any 𝑥 ∈ X, let [𝑥] 𝑖 denote the closest point to 𝑥
in X𝑖 . By Lipschitz continuity, a valid confidence interval at [𝑥] 𝑖 can be
extended to one at 𝑥 with inflation on the order of O(𝐿/𝑖 𝑐 ) due to the inflation in iteration 𝑖: O (𝐿/𝑖 𝑐 )
discretizations becoming finer over time.
• With this intuition, we design a confidence parameter sequence {𝛽𝑖 } step 3: find a confidence parameter sequence
guaranteeing that the function values on both the grids {X𝑖 } and at {𝛽𝑖 } guaranteeing validity on {𝑥𝑖 } and {X𝑖 }
the points chosen by the algorithm {𝑥𝑖 } always lie in their respective
confidence intervals with high probability. As everything is discrete, we
can generally start from a guarantee such as (10.21), replace |X | with
|X𝑖 |, and fiddle with constants as necessary.
• Finally, we proceed as in the finite case. We bound the instantaneous step 4: proceed as in the discrete case
regret in each iteration in terms of the width of the confidence intervals
of the selected points, noting that any extra regret due to discretization
shrinks rapidly as O(𝐿/𝑖 𝑐 ) and, if we are careful, does not affect the
asymptotic regret. Generally the resulting bound simply replaces any
factors of log |X | with factors of log |X𝑖 | = O(𝑑 log 𝑖).
For the particular case of bounding the Bayesian regret of gp-ucb,
we can effectively follow the above argument but must deal with some
nuance regarding the Lipschitz constant of the objective function. First,
note that if the covariance function of our Gaussian process is smooth sample path differentiability: § 2.6, p. 30
enough, its sample paths will be continuously differentiable43 and Lips-
chitz continuous – but with random Lipschitz constant. srinivas et al. 43 Hölder continuity of the derivative process co-
proceed by assuming that the covariance function is sufficiently smooth variance is sufficient; this holds for the Matérn
family with 𝜈 > 1.
to ensure that the Lipschitz constant has an exponential tail bound of
the following form:
∀𝜆 > 0 : Pr(𝐿 > 𝜆) ≤ 𝑑𝑎 exp(−𝜆 2 /𝑏 2 ), (10.26)
where 𝑎, 𝑏 > 0; this allows us to derive a high-probability upper bound
on 𝐿. We will say more about this assumption shortly.
srinivas et al. showed that, under this assumption, a slight modifica-
tion of the confidence parameter sequence from the finite case44 (10.21) 44 Specifically, srinivas et al. took:
is sufficient to ensure √︁ 2𝑖 2 𝜋 2
𝑅𝜏 = O∗ ( 𝑑𝜏𝛾𝜏 ) (10.27) 𝛽𝑖2 = 2 log
3𝛿
√︁
with high probability. Thus, assuming the Gaussian process sample paths + 2𝑑 log 𝑖 2𝑑𝑏𝑚 log(4𝑑𝑎/𝛿) ;
are smooth enough, the cumulative regret of gp-ucb on continuous The resulting bound holds with probability at
domains grows at rate comparable to the discrete case. least 1 − 𝛿.
Plugging in
√ the information capacity bounds in (10.15–10.16) and
dropping the 𝑑 factor, we have the following high-probability regret
228
10.4. bayesian regret with observation noise
∑︁
𝜏 ∑︁
𝜏 ∑︁
𝜏 √︁
𝔼[𝑢𝑖 − 𝜙𝑖 ] = 𝛽𝑖 𝜎𝑖 ≤ 𝛽𝜏 𝜎𝑖 = O∗ ( 𝜏𝛾𝜏 log|X |). (10.32) 50 Although used for Thompson sampling, the
discretization argument used in the below ref-
𝑖=1 𝑖=1 𝑖=1
erence suffices here as well:
As before, we may extend this result to the √︁
continuous case via a k. kandasamy et al. (2018). Parallelised Bayes-
discretization argument to prove 𝔼[𝑅𝜏 ] = O∗ ( 𝑑𝜏𝛾𝜏 ), matching the ian Optimisation via Thompson Sampling. ais-
high-probability bound (10.27).50 tats 2018 [appendix, theorem 11].
Thompson sampling
russo and van roy developed a general approach for transforming Thompson sampling: § 7.9, p. 148, § 8.7, p. 176
Bayesian regret bounds for a wide class of ucb-style algorithms into
regret bounds for analogous Thompson sampling algorithms.51 51 d. russo and b. van roy (2014). Learning to
Namely, consider a ucb policy selecting a sequence of points {𝑥𝑖 } Optimize via Posterior Sampling. Mathematics
of Operations Research 39(4):1221–1243.
for observation by maximizing a sequence of “upper confidence bounds,”
which here can be any deterministic functions of the observed data,
regardless of their statistical validity. Let {𝑢𝑖 } be the sequence of upper
confidence bounds associated with the selected points at the time of
their selection, and let {𝑢𝑖∗ } be the sequence of upper confidence bounds
associated with a given global optimum, 𝑥 ∗.
This is of the exactly same form as the bound derived above for gp–
ucb (10.31), leading immediately to a regret bound for gp-ts matching
that for gp-ucb (10.32):
√︁
𝔼[𝑅𝜏 ] = O∗ ( 𝜏𝛾𝜏 log|X |); (10.34)
52 k. kandasamy et al. (2018). Parallelised Bayes- once again, we may extend this result to the continuous case to derive a
ian Optimisation via Thompson Sampling. ais- bound matching that for gp-ucb (10.27).52
tats 2018.
230
10.4. bayesian regret with observation noise
𝑓− 𝑓 𝑓+
Figure 10.4: A sketch of scarlett’s
proof strategy. Given access
to a reference function 𝑓,
which of its translations by
Δ, 𝑓 − or 𝑓,+ is the objective
function?
Δ Δ
60 z. wang et al. (2020b). Tight Regret Bounds any algorithm maximizing a (nondifferentiable) sample path of Brownian
for Noisy Optimization of a Brownian Motion. motion (from the Wiener process) on the unit interval:60
arXiv: 2001.09327 [cs.LG].
√︁ √︁
𝔼[𝑟𝜏 ] = Ω(1/ 𝜏 log 𝜏); 𝔼[𝑅𝜏 ] = Ω( 𝜏/log 𝜏). (10.37)
which are within a factor of (log 𝜏) 3/2 of the corresponding lower bounds
(10.37), again fairly tight.
Common assumptions
assumption: 𝑓 ∈ H𝐾 [𝐵 ] In this section, we will assume that the objective function 𝑓 : X → ℝ,
assumption: X is compact where X is compact, lies in a reproducing kernel Hilbert space corre-
sponding to covariance function 𝐾 and has bounded norm: 𝑓 ∈ H𝐾 [𝐵]
232
10.5. worst-case regret with observation noise
(10.11). As in the previous section, we will also assume that the covariance assumption: 𝐾 (𝑥, 𝑥) ≤ 1 is bounded
function is continuous and bounded on X : 𝐾 (𝑥, 𝑥) ≤ 1.
To model observations, we will assume that a sequence of observa- assumption: errors have zero mean
tions {𝑦𝑖 } at {𝑥𝑖 } are corrupted by additive noise, 𝑦𝑖 = 𝜙𝑖 + 𝜀𝑖 , and that conditioned on their history
the distribution of the errors satisfies mild regularity conditions. First,
we will assume each 𝜀𝑖 has mean zero conditioned on its history:
𝔼[𝜀𝑖 | 𝜺 <𝑖 ] = 0,
where 𝜺 <𝑖 is the vector of errors occurring before time 𝑖. We will also assumption: scale of errors is limited
make assumptions regarding the scale of the errors. The most typical
assumption is that the distribution of each 𝜀𝑖 is 𝜎𝑛 -sub-Gaussian condi- sub-Gaussian distribution
tioned on its history, that is, that it the tail of the conditional distribution
shrinks at least as quickly as a Gaussian distribution with variance 𝜎𝑛2 :
∀𝑐 > 0 : Pr |𝜀𝑖 | > 𝑐 | 𝜺 <𝑖 ≤ 2 exp − 12 𝑐 2 /𝜎𝑛2 .
This condition is satisfied, for example, by a distribution bounded on
the interval [−𝜎𝑛 , 𝜎𝑛 ] and by any Gaussian distribution with standard
deviation of at most 𝜎𝑛 .
Complementary with the above assumptions, the Bayesian optimiza-
tion algorithms that we will analyze model the function with the centered
Gaussian process GP (𝑓 ; 𝜇 ≡ 0, 𝐾) and assume independent Gaussian
observation noise with scale 𝜎𝑛 .
234
10.5. worst-case regret with observation noise
This rate is within logarithmic factors of the best-known lower bounds 72 s. vakili et al. (2021a). Optimal Order Simple
for convergence in simple regret, as discussed later in this section. Regret for Gaussian Process Bandits. neurips
2021. [theorem 3, remark 2]
(on a smaller domain) by discarding all observed data. This ensures that
every observation is always gathered nonadaptively.
The intervening batches are constructed by uncertainty sampling.
The motivation behind this strategy is that we can bound the width of
74 x. cai et al. (2021). Lenient Regret and Good- confidence intervals after 𝑖 rounds of uncertainty sampling in terms of the
Action Identification in Gaussian Process Ban- information capacity; in particular, we can bound the maximum posterior
dits. icml 2021. [§b.4]
standard deviation after 𝑖 rounds of uncertainty sampling with:74
√︁
max 𝜎 = O( 𝛾𝑖 /𝑖). (10.43)
𝑥 ∈X
75 The pruning scheme ensures that (if the last By combining this result with the concentration inequality in (10.42),
batch concluded after 𝑖 steps) the global opti- and by periodically eliminating regions that are very likely to be subop-
mum is very likely among the remaining can-
didates, and thus the instantaneous regret in-
timal throughout optimization,75 the authors were able to show that the
curred in each stage
√︁ of the next batch will be
resulting algorithm has worst-case regret
bounded by O (𝛽𝑖 𝛾𝑖 /𝑖) with high probabil- √︁
ity. 𝑅¯𝜏 [𝐵] = O∗ ( 𝑑𝜏𝛾𝜏 ) (10.44)
76 m. valko et al. (2013). Finite-Time Analysis of with high probability, matching the regret bounds for gp-ucb and gp-ts
Kernelised Contextual Bandits. uai 2013. in the Bayesian setting (10.27).
77 s. salgia et al. (2020). A Computationally Ef- This is not the only result of this type; several other authors have
ficient Approach to Black-box Optimization been able to construct algorithms achieving similar worst-case regret
using Gaussian Process Models. arXiv: 2010.
13997 [stat.ML].
bounds through other means.76, 77, 78
78 r. camilleri et al. (2021). High-Dimensional
Experimental Design and Kernel Bandits. icml Lower bounds and tightness of existing algorithms
2021.
Lower bounds on regret are easier to come by in the frequentist setting
than in the Bayesian setting, as we have considerable freedom to con-
struct explicit objective functions that are provably difficult to optimize.
needles in haystacks: see p. 216 A common strategy to this end is to take inspiration from the “needle
in a haystack” trick discussed earlier. We construct a large set of suitably
79 j. scarlett et al. (2017). Lower Bounds on well-behaved “needles” that have little overlap, then argue that there
Regret for Noisy Gaussian Process Bandit Op- will always be some needle “missed” by an algorithm with insufficient
timization. colt 2017. budget to distinguish all the functions. Figure 10.5 shows a motivating
example with four translations of a smooth bump function with height
𝜀 and mutually disjoint support. Given any set of three observations –
regardless of how they were chosen – the cumulative regret for at least
one of these functions would be 3𝜀.
This construction embodies the spirit of most of the lower bound
The bump function used in scarlett et al.’s
arguments appearing in the literature. To yield a full proof, we must
lower bound analysis, the Fourier transform show how to construct a large number of suitable needles with bounded
of a smooth function with compact support rkhs norm. For a stationary process, this can usually be accomplished
as in figure 10.5. by scaling and translating a suitable bump-shaped function to cover the
domain. We also need to bound the regret of an algorithm given an input
80 This particular bump function is the prototyp- chosen from this set; here, we can keep the set of potential objectives
ical smooth function with compact support:
larger than the optimization budget and appeal to pigeonhole arguments.
(
1
exp − 1−|𝑥 |𝑥 | < 1; In the frequentist setting with noise, the strongest known lower
𝑥 ↦→ |2
0 otherwise. bounds are due to scarlett et al.79 The function class considered in the
analysis was scaled and translated versions of a function similar to (in fact,
236
10.6. the exact observation case
𝜀
Figure 10.5: Four smooth objective func-
tions with disjoint support.
precisely the Fourier transform of) the bump function in figure 10.5;80 this
function has the advantage of having “nearly compact” support while
having finite rkhs norm in the entire Matérn family. For optimization
on the unit cube X = [0, 1]𝑑 with the Matérn covariance function, the
authors were able to establish a lower bound on the cumulative regret
of any algorithm of
𝜈 +𝑑
𝑅¯𝜏 [𝐵] = Ω(𝜏 𝛼 ); 𝛼= ,
2𝜈 + 𝑑
and for the squared exponential covariance, a lower bound of
√
𝑅¯𝜏 [𝐵] = Ω 𝜏 (log 𝜏)𝑑 .
These bounds are fairly tight – they are within logarithmic factors of the
best-known upper bounds in the worst-case setting (10.15–10.16, 10.44);
see also the corresponding Bayesian bounds (10.28–10.29).
scarlett et al. also provided lower bounds on the worst-case simple
regret 𝑟¯𝜏 [𝐵], in terms of the expected time required to reach a given level
of regret. Inverting these bounds in terms of the simple regret at a given
time yields rates that are as expected in light of the relation in (10.4). 𝑟𝜏 ≤ 𝑅𝜏
𝜏
m. kanagawa et al. (2018). Gaussian Processes in terms of properties of x.82 This would serve as a universal bound on
and Kernel Methods: A Review on Connec- the width of all credible intervals, allowing us to bound the cumulative
tions and Equivalences. arXiv: 1807 . 02582 regret of a ucb-style algorithm via previous arguments.
[stat.ML]. [§ 5.2]
Thankfully, bounds of this type have enjoyed a great deal of atten-
h. wendland (2004). Scattered Data Approxi- tion, and strong results are available.83 Most of these results bound the
mation. Cambridge University Press. [chapter posterior standard deviation of a Gaussian process in terms of the fill
11]
distance, a measure of how densely a given set of observations fills the
domain. For a set of observation locations x ⊂ X, its fill distance is
fill distance of observation locations x, 𝛿 x defined to be
𝛿 x = max min |𝑥 − 𝑥𝑖 |, (10.45)
𝑥 ∈X 𝑥𝑖 ∈ x
the largest distance from a point in the domain to the closest observation.
Intuitively, we should expect the maximum posterior standard de-
viation induced by a set of observations to shrink with its fill distance.
This is indeed the case, and particularly nice results for the rate of this
shrinkage are available for the Matérn family. Namely, for finite 𝜈, once
the fill distance is sufficiently small (below a constant depending on the
covariance but not x), we have:
238
10.6. the exact observation case
This will suffice for the arguments to follow.87 However, this strategy may 87 We can extend this argument to any arbitrary
not be the best choice in practical settings, as the fill distance decreases compact domain X ⊂ ℝ𝑑 by enclosing it in a
cube.
sporadically by large jumps each time the grid is refined. An alternative
with superior “anytime” performance would be a low-discrepancy se-
quence such as a Sobol or Halton sequence, which would achieve slightly
larger (by a factor of log 𝜏), but more smoothly decreasing, asymptotic
fill distance.
Several sophisticated algorithms are also available for (approximately) 88 y. lyu et al. (2019). Efficient Batch Black-
optimizing the fill distance from a given fixed number of points.85, 88 box Optimization with Deterministic Regret
Bounds. arXiv: 1905.10041 [cs.LG].
Combining this with the bound on standard deviation in (10.47) imme- 90 Here we use the alternative definition in foot-
diately shows that a simple grid strategy achieves worst-case simple note 5. As the confidence intervals (10.48) are
always valid, 𝑓 ∗ must always be in the (rapidly
regret90 shrinking) confidence interval of whatever
𝑟¯𝜏 [𝐵] = O(𝜏 −𝜈/𝑑 ) point maximizes the posterior mean.
for the Matérn covariance with smoothness 𝜈. bull provided a corre- 91 a. d. bull (2011). Convergence Rates of Effi-
sponding lower bound establishing that this rate is actually optimal: cient Global Optimization Algorithms. Jour-
nal of Machine Learning Research 12(88):2879–
𝑟¯𝜏 [𝐵] = Θ(𝜏 −𝜈/𝑑 ).91 As with previous results, the lower bound derives 2904.
from an adversarial “needle in haystack” construction as in figure 10.5.
This result is perhaps not as exciting as it could be, as it demonstrates convergence rates for expected improvement
that no adaptive algorithm can perform (asymptotically) better than grid
search in the worst case. However, we may reasonably seek similar
guarantees for algorithms that are also effective in practice. bull was
able to show that maximizing expected improvement yields worst-case
simple regret
𝑟¯𝜏 [𝐵] = O∗ (𝜏 − min(𝜈,1)/𝑑 ),
which is near optimal for 𝜈 ≤ 1. bull also showed that augmenting
expected improvement with occasional random exploration akin to an
𝜀-greedy policy improves its performance to near optimal for any finite
smoothness:
𝑟¯𝜏 [𝐵] = O∗ (𝜏 −𝜈/𝑑 ).
The added randomness effectively guarantees the fill distance of ob-
servations shrinks quickly enough that we can still rely on posterior
contraction arguments, and this strategy could be useful for analyzing
other policies.
240
10.7. the effect of unknown hyperparameters
for some constant 𝑐 > 0 depending on the process but not on 𝜏. This 96 This is not inconsistent with grünewälder et
condition implies rapid convergence in terms of simple regret as well al.’s lower bound: here we assume smoothness
consistent with 𝜈 > 2 in the Matérn family,
(as we obviously have 𝑟𝜏 ≤ 𝜌𝜏 ) and bounded cumulative regret after we whereas the adversarial Gaussian process con-
have entered the converged regime.96 structed in the lower bound is only as smooth
Evidently optimization is much easier with deterministic observa- as 𝜈 = 1/2.
tions than noisy ones, at least in terms of Bayesian regret. Intuitively,
the reason for this discrepancy is that noisy observations may compel
us to make repeated measurements in the same region in order to shore
up our understanding of the objective function, whereas this is never
necessary with exact observations.
242
10.8. summary of major ideas
𝑓 ∈ H𝐾 [𝐵; 𝜆, ℓ],
then the same function is also in the rkhs ball for any larger output
scale and for any vector of shorter (as in the lexicographic order) length
scales:
𝑓 ∈ H𝐾 [𝐵; 𝜆,0 ℓ 0]; 𝜆 0 ≥ 𝜆; ℓ 0 ≤ ℓ.
With this in mind, a natural idea for deriving theoretical convergence 102 z. wang and n. de freitas (2014). Theoretical
results (but not necessarily for realizing a practical algorithm!) is to Analysis of Bayesian Optimization with Un-
known Gaussian Process Hyper-Parameters.
ignore any data-dependent scheme for setting hyperparameters and arXiv: 1406.7758 [stat.ML].
instead simply slowly increase the output scale and slowly decrease the
103 f. berkenkamp et al. (2019). No-Regret Bayes-
length scales over the course of optimization, so that at some point any ian Optimization with Unknown Hyperparam-
function lying in any such rkhs will eventually be captured. This scheme eters. Journal of Machine Learning Research
has been used to provide convergence guarantees for both expected 20(50):1–24.
improvement102 and gp-ucb103 with unknown hyperparameters.
worst-case regret without noise: § 10.6, p.239 in a given rkhs and seeks asymptotic bounds on the worst-case regret
(10.6).
this argument is carefully laid out for • Most upper bounds on cumulative regret derive from a proof strategy
Bayesian regret with noise in § 10.4, p.225 where we identify some suitable set of predictive credible intervals that
are universally valid with high probability. We may then argue that
the cumulative regret is bounded in terms of the total width of these
intervals. This strategy lends itself most naturally to analyzing the Gaus-
sian process upper confidence bound (gp-ucb) policy, but also yields
bounds for Gaussian process Thompson sampling (gp-ts) due to strong
theoretical connections between these algorithms (10.33).
information capacity: § 10.3, p. 222 • In the presence of noise, a key quantity is the information capacity, the
maximum information about the objective that can be revealed by a set
of noisy observations. Bounds on this quantity yield bounds on the sum
bounding the posterior standard deviation of predictive variances (10.18) and thus cumulative regret. With exact
with exact observations: § 10.6, p. 238 observations, we may derive bounds on credible intervals by relating
the fill distance (10.45) of the observations to the maximum standard
deviation of the process, as in (10.46).
lower bounds: • To derive lower bounds on Bayesian regret, we seek arguments limiting
Bayesian regret with noise: § 10.4, p. 230 the rate at which Gaussian process sample paths can be optimized. For
worst-case regret with noise, § 10.5, p. 236 worst-case regret, we may construct explicit objective functions in a
Bayesian regret without noise: § 10.6, p. 240 given rkhs and prove that they are difficult to optimize; here the “needles
worst-case regret without noise: § 10.6, p.239 in haystacks” idea again proves useful.
244
11
EXTENSIONS AND RELATED SET TINGS
This material will be published by Cambridge University Press as Bayesian Optimization. 245
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
extensions and related settings
2 As an example use case, suppose evaluating In some applications, observation costs may be nontrivially corre-
the objective requires train a machine learning lated with the objective function. As an extreme example, consider a
model on cloud hardware with variable (e.g.,
spot) pricing.
common problem in algorithm configuration,3 where the goal is to design
the parameters of an algorithm so as to minimize its expected running
3 The related literature is substantial, but a
Bayesian optimization perspective can be
time. Here the cost of evaluating a proposed configuration might be
found in: reasonably defined to be proportional to its running time. Up to scaling,
the observation cost is precisely equal to the objective! To model such
f. hutter et al. (2011). Sequential Model-Based
Optimization for General Algorithm Configu- correlations, we could define a joint prior 𝑝 (𝑓, 𝑐) over the cost and ob-
ration. lion 5. jective functions, as well as a joint observation model 𝑝 (𝑦, 𝑧 | 𝑥, 𝜙, 𝜅).
We could then continue as normal, computing expected utilities with
respect to the joint predictive distribution
∬
𝑝 (𝑦, 𝑧 | 𝑥, D) = 𝑝 (𝑦, 𝑧 | 𝑥, 𝜙, 𝜅) 𝑝 (𝜙, 𝜅 | 𝑥, D) d𝜙 d𝜅,
246
11.1. unknown observation costs
Figure 11.1: Decision making with uncertain costs. The middle panel shows the cost-agnostic expected improvement
acquisition function along with a belief about an uncertain cost function, here assumed to be independent of
the objective. The bottom panel shows the cost-adjusted expected improvement, marginalizing uncertainty in
the objective and cost function (11.1).
Figure 11.2: Expected gain per unit cost. The middle panel shows the cost-agnostic expected improvement acquisition
function along with a belief about an uncertain cost function, here assumed to be independent of the objective.
The bottom panel shows the expected improvement per unit cost (11.1).
248
11.2. constrained optimization and unknown constraints
where the functions {𝑔𝑖 } : X → ℝ comprise a set of inequality con- inequality constraints, {𝑔𝑖 }
straints. The subset of the domain where all constraints are satisfied is
known as the feasible region: feasible region, F
F = 𝑥 ∈ X | ∀𝑖 : 𝑔𝑖 (𝑥) ≤ 0 .
In some situations, the value of some or all of the constraint functions uncertain constraints
may in fact be unknown a priori and only revealed through experimenta-
tion, complicating policy design considerably. As an example, consider a
business optimizing the parameters of a service to maximize revenue,
subject to constraints on customer response. If customer response is
measured experimentally – for example via a focus group – we cannot
know the feasibility of a proposed solution until after the objective has
been measured, and even then only with limited confidence.
Further, even if the constraint functions can be computed exactly on
demand, constrained optimization of an uncertain objective function is
not entirely straightforward. In particular, an observation of the objective
at an infeasible point may yield useful information regarding behavior
on the feasible region, and could represent an optimal decision if that
information were compelling enough. Thus simply restricting the action
space to the feasible region may not be the best approach to policy 9 r. b. gramacy and h. k. h. lee (2011). Optimiza-
design.9 Instead, to derive effective policies for constrained optimization, tion Under Unknown Constraints. In: Bayesian
we must reconsider steps 2–4 of our general approach.
Statistics 9.
Deriving a policy
After selecting a model for the constraint functions and a utility function
for our observations, we can derive a policy for constrained optimization
following the standard procedure of induction on the decision horizon.
250
11.2. constrained optimization and unknown constraints
The expected feasible improvement (11.6) encodes a strong prefer- observations in the infeasible region
ence for evaluating on the feasible region only, and in the case where
the constraint functions are all known, the resulting policy will never
evaluate outside the feasible region.13 This is a natural consequence of 13 In this case the policy would be equivalent to
the one-step nature of the acquisition function: an infeasible observation redefining the domain to encompass only the
feasible region F and maximizing the unmod-
cannot yield any immediate improvement to the utility (11.3) and thus ified expected improvement (7.2).
cannot be one-step optimal. However, this behavior might be seen as un-
desirable given our previous comment that infeasible observations may
yield valuable information about the objective on the feasible region.
If we wish to realize a policy more open to observing outside the
feasible region, there are several paths forward. A less-myopic policy
built on the same utility (11.3) is one option; even two-step lookahead
could elect to obtain an infeasible measurement. Another possibility is
one-step lookahead with a more broadly defined utility such as (11.4–11.5),
which can see the merit of infeasible observations through more global
evaluation of success.
To encourage infeasible observations when prudent, gramacy and 14 r. b. gramacy and h. k. h. lee (2011). Optimiza-
lee proposed a score they called the integrated expected conditional tion Under Unknown Constraints. In: Bayesian
Statistics 9.
improvement:14
∬
𝛼 ei (𝑥 0; D) − 𝛼 ei (𝑥 0; D 0) Pr(𝑥 0 ∈ F | 𝑥 ,0 D) 𝑝 (𝑦 | 𝑥, D) d𝑥 0d𝑦,
252
11.3. synchronous batch observations
observations
posterior mean
posterior 95% credible interval
𝑥
𝑥0
0 ∗
𝛽 ei
Figure 11.3: Optimal batch selection. The heatmap shows the expected one-step marginal gain in simple reward (6.1) from
adding a batch of two points – corresponding in location to the belief about the objective plotted along the
margins – to the current dataset. Note that the expected marginal gain is symmetric. The optimal batch will
observe on both sides of the previously best-seen point. In this example, incorporating either one of the
selected points alone would also yield relatively high expected marginal gain.
Finally, we can derive the optimal batch policy for a fixed evaluation optimal batch policy with fixed evaluation
budget by induction on the horizon, accounting for the expanded action budget
space for each future decision:
∗
x ∈ arg max 𝛽 1 (x0; D) + 𝔼 𝛽𝜏−1 (D1 ) | x,0 D ;
x0 ∈X 𝑏
=𝛽𝜏 (x0;D)
If desired, we could also allow for variable-cost observations and the variable costs and termination option
option of dynamic termination by accounting for costs and including a
termination option in the action space. Another compelling possibility
would be to consider dynamic batch sizes by expanding the action space dynamic batch sizes
further and assigning an appropriate size-dependent cost function for
proposed batch observations.
Optimal batch selection is illustrated in figure 11.3 for designing a example of optimal batch policy
batch of two points with horizon 𝜏 = 1. We compute the expected one-
step gain in utility (11.7) – here the simple reward (6.1), analogous to
which allows us to rewrite the one-step expected batch marginal gain as:
𝛽 1 (x; D) = 𝛼 1 (𝑥; D) + 𝔼 𝛼 1 (𝑥 0; D 0) | x, D .
The main difference is that in the batch setting, we must commit to both
observation locations a priori, whereas in the sequential setting, we can
design our second observation optimally given the outcome of the first.
“unrolling” the optimal policy: § 5.3, p. 99 We can extend this relationship to the general case. Temporarily
adopting compact notation, a horizon-𝑏 optimal sequential decision
satisfies:
n o
𝑥 ∈ arg max 𝛼 1 + 𝔼 max 𝛼 1 + 𝔼[max{𝛼 1 + · · · , (11.9)
254
11.3. synchronous batch observations
input: dataset D, batch size 𝑏, acquisition function 𝛼 Algorithm 11.1: Sequential simulation.
D0 ← D I initialize fictitious dataset
for 𝑖 = 1 . . . 𝑏 do
𝑥𝑖 ← arg max𝑥 ∈X 𝛼 (𝑥; D 0) I select the next batch member
𝑦ˆ𝑖 ← simulate-observation(𝑥
0
𝑖, D )
D ← D ∪ (𝑥𝑖 , 𝑦ˆ𝑖 )
0 0 I update fictitious dataset
end for
return x
the expected marginal gain (11.7) is more expensive than in the sequential
setting (5.8), as we must now integrate with respect to the joint distri-
bution over outcomes 𝑝 (y | x, D). Thus even evaluating the acquisition
function at a single point entails more work. Additionally, finding the
optimal decision (11.8) requires optimizing this score over a significantly
larger domain than in the sequential analog, a nontrivial task due to its
potentially complex and multimodal nature – see figure 11.3.
Despite these computational difficulties, synchronous batch Bayesian
optimization has enjoyed significant attention from the research com-
munity. We can identify two recurring research thrusts: deriving general
strategies for extending arbitrary sequential policies to batch policies
and deriving batch extensions of specific sequential policies. We provide
a brief survey below.
and commit to this choice. We now augment our dataset with the chosen
point and a fictitious observed value 𝑦ˆ1 , forming D1 = D ∪ (𝑥 1, 𝑦ˆ1 ) , fictitious observation at 𝑥 1 , 𝑦ˆ 1
then maximize the acquisition function again to choose the second point:
second batch member, 𝑥 2
𝑥 2 ∈ arg max 𝛼 (𝑥; D1 ).
𝑥 ∈X
We proceed in this manner until the desired batch size has been reached.
Sequential simulation entails 𝑏 optimization problems on X rather than a
single problem on X 𝑏, which can be a significant computational savings.
When the sequential policy represents one-step lookahead, sequen- greedy approximation of one-step lookahead
tial simulation can be regarded as a natural greedy approximation to the
one-step batch policy via the decomposition in (11.10): whereas the opti-
mal batch policy must maximize this score jointly, sequential simulation
maximizes the score pointwise, fixing each point once chosen.
This procedure requires some mechanism for generating fictitious
observations as we build the batch. ginsbourger et al. described two
expected improvement, 𝛼 ei
𝑥1
𝑥6 𝑥5 𝑥3 𝑥4 𝑥2 𝑥7
Figure 11.4: Sequential simulation using the expected improvement (7.2) policy and the kriging believer (11.11) imputation
strategy. The first point is selected by maximizing expected improvement, and we condition the model on an
observation equal to the posterior mean (top panel). We then maximize the updated expected improvement,
condition, and repeat as desired. The bottom panel indicates the locations of several further points selected in
this manner.
19 d. ginsbourger et al. (2010). Kriging is well- simple heuristics that have been widely adopted.19 Perhaps the most
suited to parallelize optimization. In: Compu- natural option is to impute the expected value of each observation, a
tational Intelligence in Expensive Optimization
Problems.
heuristic ginsbourger et al. dubbed the kriging believer strategy:
kriging believer heuristic This has the effect of fixing the posterior mean of the objective function
throughout simulation. An even simpler option is to impute a constant
value independent of the chosen point, which the authors called the
constant liar heuristic constant liar strategy:
𝑦ˆ = 𝑐. (11.12)
Although this might seem silly, this has the advantage of being model
independent and has demonstrated surprisingly good performance in
practice. Three natural options for the constant, ranging from the most
optimistic to most pessimistic, are to impute the maximum, mean, or
minimum of the known observed values y.
example and discussion Seven steps of sequential simulation with the expected improvement
acquisition function (7.2) and the kriging believer strategy (11.11) are
demonstrated in figure 11.4. The selected points appear reasonable: the
256
11.3. synchronous batch observations
expected improvement, 𝛼 ei
𝑥1
penalty, 𝜑 (𝑥; 𝑥 1 )
1
0
𝑥1
𝑥7 𝑥3 𝑥4 𝑥6 𝑥2 𝑥5
Figure 11.5: Batch construction via local penalization of the expected improvement acquisition function (7.2). We select
the first point by maximizing the expected improvement, after which the acquisition function is multiplied by
a penalty factor discouraging future batch members in that area. We then maximize the updated acquisition
function, penalize, and repeat as desired. The bottom panel indicates the locations of several further points
selected in this manner.
first two exploit the best-seen observation (and are near optimal for a
batch size of two; see figure 11.3), the next two exploit another local
optimum, and the remaider explore the domain.
We then incorporate a multiplicative penalty 𝜑 (𝑥; 𝑥 1 ) into the acquisition penalty from selecting 𝑥 1 , 𝜑 (𝑥; 𝑥 1 )
function discouraging future batch members from being in a neighbor-
hood of the initial point. This penalty is designed to avoid redundancy
between batch members without disrupting the differentiability of the
original acquisition function. gonzález et al. describe one simple and
effective penalty function from an estimate of the global maximum and
21 Any continually differentiable function on a Lipschitz constant of the objective function.21 We now select the second
compact set is Lipschitz continuous. batch member by maximizing the penalized acquisition function
23 Gauss–Hermite quadrature, as we recom- We may approximate this expectation via Monte Carlo integration23 by
mended in the one-dimensional case, does not sampling from this belief. It is convenient to do so by sampling 𝑛 vectors
scale well with dimension.
{z𝑖 } from a standard normal distribution, then transforming these via
z𝑖 ↦→ 𝝁 + 𝚲z𝑖 = y𝑖 , (11.14)
Monte Carlo approximation of gradient As in the sequential case, we may reuse these samples to approximate
the gradient of the acquisition function with respect to the proposed
observation locations, under mild assumptions (c.4–c.5):
1 ∑︁ 𝜕Δ
𝑛
𝜕𝛽 𝜕Δ 𝜕Δ 𝜕Δ 𝜕𝝁 𝜕𝚲
≈ (x, y𝑖 ); (x, y𝑖 ) = + + z𝑖 ,
𝜕x 𝑛 𝑖=1 𝜕x 𝜕𝑥 𝑗 𝜕𝑥 𝑗 y 𝜕y 𝜕𝑥 𝑗 𝜕𝑥 𝑗
(11.16)
258
11.3. synchronous batch observations
observations
posterior mean
posterior 95% credible interval
𝑥
𝑥0
Figure 11.6: The batch knowledge gradient acquisition function for an example scenario. The optimal batch exploits the
local optimum, but any batch containing at least one point in that neighborhood is near-optimal.
Probability of improvement
probability of improvement: § 7.5, p. 131 Batch probability of improvement has received relatively little attention,
32 See § 7.5, p. 134.
but jones proposed one simple option in the context of threshold se-
lection.32, 33 The idea is to find the optimal sequential decisions using a
33 d. r. jones (2001). A Taxonomy of Global Op-
timization Methods Based on Response Sur-
range of improvement thresholds, representing a spectrum exploration–
faces. Journal of Global Optimization 21(4):345– exploitation tradeoffs. jones then recommends a simple clustering pro-
383. cedure to remove redundant points, resulting in a batch (of variable size)
260
11.3. synchronous batch observations
Thompson sampling
The stochastic nature of Thompson sampling enables trivial batch con-
struction by drawing 𝑏 independent samples of the location of the global
maximum (7.19):
{𝑥𝑖 } ∼ 𝑝 (𝑥 ∗ | D).
40 j. m. hernández-lobato et al. (2017). Paral- A remarkable advantage of this policy is that the samples may be gener-
lel and Distributed Thompson Sampling for ated entirely in parallel, which allows linear scaling to arbitrarily large
Large-scale Accelerated Exploration of Chem-
ical Space. icml 2017.
batch sizes. Batch Thompson sampling has delivered impressive perfor-
mance in a real-world setting with batch sizes up to 𝑏 = 500,40 and is
41 k. kandasamy et al. (2018). Parallelised Bayes-
ian Optimisation via Thompson Sampling. ais- backed by theoretical guarantees on the asymptotic reduction of the
tats 2018. simple regret (10.1).41
reduction to synchronous case This is simply the one-step marginal gain for the combined batch x ∪ x0
from the synchronous case (11.7)! The only difference with respect to
one-step lookahead is that we can only maximize this score with respect
to x as we are already committed to the pending experiments. Thus a
one-step lookahead policy for the asynchronous case can be reduced to
maximizing the corresponding score from the synchronous case with
some batch members fixed. This reduction has been pointed out by
numerous authors, and effectively every batch policy discussed above
may be modified with little effort to work in the asynchronous case.
Moving beyond one-step lookahead may be extremely challenging,
however, due to the implications of uncertainty in the order of termina-
tion for pending experiments. A full treatment would require a model for
the time of termination and accounting for how the decision tree may
branch after the present decision. Exact computation of even two-step
262
11.5. multifidelity optimization
𝐾𝑖 𝑗 = cov[𝑓𝑖 , 𝑓𝑗 ]
264
11.5. multifidelity optimization
objective, 𝑓∗ surrogate, 𝑓1
Figure 11.7: Cost-adjusted multifidelity expected improvement for a toy scenario. The objective (left) and its surrogate
(right) are modeled as a joint Gaussian process with marginals equivalent to the example from figure 7.1 and
constant cross-correlation 0.8. The cost of observation was assumed to be ten times greater for the objective
than its surrogate. Maximizing the cost-adjusted multifidelity expected improvement elects to continue
evaluating the surrogate.
objective, 𝑓∗
Figure 11.8: A simulation of optimization with the cost-adjusted multifidelity expected improvement starting from the
scenario in figure 11.7. We simulate sequential observations of either the objective or the surrogate, illustrated
using the running tick marks. The light gray marks correspond to surrogate observations and the heavy
black marks objective observations. The optimum was found after 10 objective and 32 surrogate observations,
marked in heavy red. The prior and posterior of the objective function conditioned on all observations are
also shown.
50 d. huang et al. (2006a). Sequential kriging op- Figure 11.8 simulates sequential multifidelity optimization using this
timization using multiple-fidelity evaluations. policy; here the optimum was discovered after only 10 evaluations of
Structural and Multidisciplinary Optimization the objective, guided by 32 observations of the cheaper surrogate. A
32(5):369–382.
remarkable feature we can see in the posterior is that all evaluations
51 v. picheny et al. (2013a). Quantile-Based Op- made of the objective function are above the prior mean, nearly all with
timization of Noisy Computer Experiments
With Tunable Precision. Technometrics 55(1):
𝑧-scores of approximately 𝑧 = 1 or greater. This can be ascribed not
2–13. to extreme luck, but rather to efficient use of the surrogate to rule out
52 j. wu et al. (2019). Practical Multi-fidelity
regions unlikely to contain the optimum.
Bayesian Optimization for Hyperparameter Multifidelity Bayesian optimization has enjoyed sustained interest
Tuning. uai 2019. from the research community, and numerous policies have been available.
53 k. kandasamy et al. (2016). Gaussian Process These include adaptations of the expected improvement,50, 51 knowledge
Bandit Optimisation with Multi-fidelity Eval- gradient,52 upper confidence bound,53, 54 and mutual information with
uations. neurips 2016. 𝑥 ∗ 55, 56 and 𝑓 ∗ 57 acquisition functions, as well as novel approaches.58
54 k. kandasamy et al. (2017). Multi-fidelity
Bayesian Optimisation with Continuous Ap-
proximations. icml 2017. 11.6 multitask optimization
55 k. swersky et al. (2013). Multi-Task Bayesian
Optimization. neurips 2013. Multitask optimization addresses the sequential or simultaneous opti-
mization of multiple objectives {𝑓𝑖 } : X → ℝ representing performance
266
11.6. multitask optimization
current task, 𝑓
Figure 11.9: A demonstration of sequential multitask optimization. Left: a prior distribution over an objective function
along with 13 observations selected by maximizing expected improvement, revealing the global maximum with
the last evaluation. Right: the posterior distribution over the same objective conditioned on the observations
of the two functions (now interpreted as related tasks) in figure 11.8. The global maximum is now found after
three observations due to the better informed prior.
only difference is that the objective function model is now informed from
our past experience with other tasks; as a result, our initial optimization
decisions can be more targeted.
This procedure is illustrated in figure 11.9. Both panels illustrate
the optimization of an objective function by sequentially maximizing
expected improvement. The left panel begins with no information and
locates the global optimum after 13 evaluations. The right panel begins
the process instead with a posterior belief about the objective conditioned
on the data obtained from the two functions in figure 11.8, modeled as
related tasks with cross-correlation 0.8. Due to the better informed initial
belief, we now find the global optimum after only three evaluations.
simultaneous tasks The case of simultaneous multitask optimization – where we may
evaluate any task objective with each observation – requires somewhat
more care. We must now design a utility function capturing our joint
performance across the tasks and design each observation with respect
to this utility. One simple option would be to select utility functions
{𝑢𝑖 } quantifying performance on each task separately and then take a
weighted average: ∑︁
𝑢 (D) = 𝑤𝑖 𝑢𝑖 (D).
𝑖
This could be further adjusted for (perhaps task- and/or input-dependent)
observation costs if needed. Now we may write the expected marginal
gain in this combined utility as a weighted average of the expected
marginal gain on each separate task. One-step lookahead would then
maximize the weighted acquisition function
∑︁
𝛼 [𝑥, 𝑖]; D) = 𝑤𝑖 𝛼𝑖 [𝑥, 𝑖]; D
𝑖
over all possible observation location–task pairs [𝑥, 𝑖], where 𝛼𝑖 is the
one-step expected marginal gain in 𝑢𝑖 . Note that observing a single task
268
11.7. multiobjective optimization
𝑓1 𝑓2
𝑥1 𝑥2 𝑥4 𝑥3
Figure 11.10: A simple multiobjective optimization example with two objectives {𝑓1, 𝑓2 } on a one-dimensional domain. We
compare four identified points: 𝑥 1 , the global optimum of 𝑓1 , 𝑥 2 , the global optimum of 𝑓2 , 𝑥 3 , a compromise
with relatively high values of both objectives, and 𝑥 4 , a point with relatively low values of both objectives.
Pareto optimality
To illustrate the tradeoffs we may need to consider during multiobjec-
tive optimization, consider the two objectives in figure 11.10. The first
objective 𝑓1 has its global maximum at 𝑥 1 , which nearly coincides with
the global minimum of the second objective 𝑓2 . The reverse is true in
the other direction: the global maximum of the second objective, 𝑥 2 ,
𝑓1 𝑓2
𝑥1 𝑥2 𝑥3
Figure 11.11: The objectives from figure 11.10 with the Pareto optimal solutions highlighted. All points along the intervals
marked on the horizontal axis are Pareto optimal, with the highlighted corresponding objective values forming
the Pareto frontier (see margin).
achieves a relatively low value on the first objective. Neither point can
𝑓2 be preferred over the other on any objective grounds, but a rational agent
𝑥2 may have subjective preferences over these two locations depending on
𝑥3
the relative importance of the two objectives. Some agents might even
prefer a compromise location such as 𝑥 3 to either of these points, as it
achieves relatively high – but suboptimal – values for both objectives.
It is clearly impossible to identify an unambiguously optimal location,
even in this relatively simple example. We can, however, eliminate some
locations as plainly subpar. For example, consider the point 𝑥 4 in figure
𝑥1 11.10. Assuming preferences are nondecreasing with each objective, no
rational agent would prefer 𝑥 4 to 𝑥 3 as the latter point achieves higher
𝑥4
value for both objectives. We may formalize this intuition by defining a
𝑓1
partial order on potential solutions consistent with this reasoning.
We will say that a point 𝑥 dominates another point 𝑥 ,0 denoted 𝑥 0 ≺ 𝑥,
The regions dominated by points 𝑥 1 and 𝑥 3
in figure 11.10 are highlighted in blue. 𝑥 4 is if no objective value is lower at 𝑥 than at 𝑥 0 and if at least one objective
dominated by both, and 𝑥 2 by neither. value is higher at 𝑥 than at 𝑥 .0 Assuming preferences are consistent with
nondecreasing objective values, no agent could prefer a dominated point
𝑓2 to any point dominating it. This concept is illustrated in the margin for
𝑥2 the example from figure 11.10: all points in the blue regions are dominated,
and in particular 𝑥 4 is dominated by both 𝑥 1 and 𝑥 3 . On the other hand,
𝑥3 none of 𝑥 1 , 𝑥 2 , or 𝑥 3 is dominated by any of the other points.
A point 𝑥 ∈ X that is not dominated by any other point is called
Pareto optimal, and the image of all Pareto optimal points is called the
Pareto frontier. The Pareto frontier is a central concept in multiobjective
optimization – it represents the set of all possible solutions to the problem
consistent with weakly monotone preferences for the objectives. Figure
𝑥1
11.11 shows the Pareto optimal points for our example from figure 11.10,
which span four disconnected intervals. We may visualize the Pareto
𝑓1 frontier by plotting the image of this set, as shown in the margin.
There are several approaches to multiobjective optimization that
The Pareto frontier for the scenario in figures differ in when preferences among competing solutions are elicited.62
11.10–11.11. The four components correspond
So-called a posteriori methods seek to identify the entire Pareto frontier
to the highlighted intervals in figure 11.11.
for a given problem, with preferences among possible solutions to be
62 k. m. miettinen (1998). Nonlinear Multiobjec-
determined afterwards. In contrast, a priori methods assume that pref-
tive Optimization. Kluwer Academic Publish-
ers. erences are already predetermined, allowing us to seek a single Pareto
270
11.7. multiobjective optimization
𝑥1 𝑥2 𝑥3
66 m. fleischer (2003). The Measure of Pareto tier is revealed by the observed data.66 Therefore it provides a sensible
Optima: Applications to Multi-objective Meta- measure of progress for a posteriori multiobjective optimization.
heuristics. emo 2003.
The one-step marginal gain in this utility is known as expected hy-
67 m. t. m. emmerich et al. (2006). Single- and pervolume improvement (ehvi)67, 68 and serves as a popular acquisition
Multiobjective Evolutionary Optimization As-
sisted by Gaussian Random Field Metamodels. function. This score is shown in figure 11.13 for our example; the optimal
ieee Transactions on Evolutionary Computation decision attempts to refine the central portion of the Pareto frontier, but
10(4):421–439. many alternatives are almost as favorable due to the roughness of the
68 w. ponweiser et al. (2008). Multiobjective current estimate. Computation of ehvi is involved, and its cost grows
Optimization on a Limited Budget of Evalu- considerably with the number of objectives. The primary difficulty is
ations Using Model-Assisted S-Metric Selec-
enumerating and integrating with respect to the lower bound to the
tion. ppsn x.
Pareto front, which can become a complex region in higher dimensions.
69 k. yang et al. (2019b). Multi-Objective Bayes-
ian Global Optimization using expected hy- However, efficient algorithms are available for computing evhi69 and its
pervolume improvement gradient. Swarm and gradient.70
Evolutionary Computation 44:945–956.
70 k. yang et al. (2019a). Efficient computation
of expected hypervolume improvement us-
Information-theoretic a posteriori methods
ing box decomposition algorithms. Journal of
Several popular information-theoretic policies for single-objective opti-
Global Optimization 75(1):3–34.
mization have been adapted to a posteriori multiobjective optimization.
The key idea behind these methods is to approximate the mutual infor-
mation between a joint observation of the objectives and either the set
of Pareto optimal points (in the domain), X ∗, or the Pareto frontier (in
the codomain), F ∗. These approaches operate by maximizing the predic-
tive form of mutual information (8.35) and largely follow the parallel
single-objective cases in their approximation.
272
11.7. multiobjective optimization
Figure 11.14: A series of linear scalarizations of the example objectives from figure 11.10. The “blue end” of the color
spectrum is 𝑓1 , and the “red end” 𝑓2 , consistent with that plot.
vector of objective values at 𝑥, 𝝓 Let 𝝓 denote the vector of objective function values at an arbitrary
scalarization function, 𝑔 location 𝑥, defining 𝜙𝑖 = 𝑓𝑖 (𝑥). A scalarization function 𝑔 : 𝑥 ↦→ 𝑔(𝝓) ∈ ℝ
maps locations in the domain to scalars determined by their objective val-
ues. We may interpret the output as defining preferences over locations
in a natural manner: namely, if 𝑔(𝝓) > 𝑔(𝝓 0), then the outcomes at 𝑥 are
preferred to those at 𝑥 0 in the scalarization. With this interpretation, a
scalarization function allows us to recast multiobjective optimization as
a single-objective problem by maximizing 𝑔 with respect to 𝑥.
A scalarization function can in principle be arbitrary, and a priori
multiobjective optimization can be framed in terms of maximizing any
such function. However, several tunable scalarization functions have
been described in the literature that may be used in a general context.
𝑓2
A straightforward and intuitive example is linear scalarization:
∑︁
𝑔lin (𝑥; w) = 𝑤 𝑖 𝜙𝑖 , (11.19)
𝑖
274
11.7. multiobjective optimization
1 Pr(𝑥 ⊀ x | D)
Figure 11.15: The probability of nondomi-
nance by the available data
for our running example
0 (above).
Other approaches
zuluaga et al. outlined an intriguing approach to a posteriori multi-
objective optimization76 wherein the problem was recast as an active 76 m. zuluaga et al. (2016). 𝜀-pal: An Active
learning problem.77 Namely, the authors considered the binary classifi- Learning Approach to the Multi-Objective Op-
timization Problem. Journal of Machine Learn-
cation problem of predicting whether a given observation location was ing Research 17(104):1–32.
(approximately) Pareto optimal or not, then designed observations to
77 b. settles (2012). Active Learning. Morgan &
maximize expected performance on this task. Their algorithm is sup- Claypool.
ported by theoretical bounds on performance and performed admirably
78 v. picheny (2015). Multiobjective optimization
compared with knowles’s algorithm discussed above.73 using Gaussian process emulators via stepwise
picheny proposed a spiritually similar approach also based on se- uncertainty reduction. Statistics and Comput-
quentially reducing a measure of uncertainty in the Pareto frontier.78 ing 25(6):1265–1280.
This probability is plotted for our running example in figure 11.15; here,
there is a significant probability for many points to be nondominated
by the rather sparse available data, indicating a significant degree of
uncertainty in our understanding of the Pareto frontier. However, as
the Pareto frontier is increasingly well determined by the data, this
probability will vanish globally and the utility above will tend toward
its maximal value of zero. After motivating this score, picheny proceeds
to recommend designing observations via one-step lookahead.
where the updated dataset D1 will reflect the entire observation (𝑥, 𝑦, g).
Induction on the horizon gives the optimal policy as usual.
Figure 11.16 compares derivative-aware and derivative-unaware ver-
sions of the knowledge gradient (7.4) (assuming exact observation) for an
example scenario. The derivative-aware version dominates the derivative-
unaware one, as the acquisition of more information naturally leads to a
greater expected marginal gain in utility. When derivative information
is unavailable, the optimal decision is to evaluate nearby the previously
best-seen point, effectively to estimate the derivative via finite differ-
encing; when derivative information is available, an observation at this
276
11.9. stochastic and robust optimization
Figure 11.16: The knowledge gradient acquisition function for an example scenario reflecting the expected gain in global
reward (6.5) provided exact observations of the objective function and its derivative. The vanilla knowledge
gradient (7.4) based on an observation of the objective alone is shown for reference.
location yields effectively the same expected gain. However, we can fair
even better in expectation by moving a bit farther astray, where the
derivative information is less redundant.
When working with Gaussian process models in high dimension, scaling of Gaussian process inference: § 9.1,
it may not be wise to augment each observation of the objective with p. 201
a full observation of the gradient due to the cubic scaling of inference.
However, the scheme outlined above opens the door to consider the
observation of any measurement related to the gradient. For example,
we might condition on the value of a directional derivative, reducing the
measurement to a scalar regardless of dimension and limiting compu-
tation. Such a scheme was promoted by wu et al., who considered the
acquisition of a single coordinate of the gradient; this could be extended
to non-axis-aligned directional derivatives without major complication.80 80 j. wu et al. (2017). Bayesian Optimization with
We could also consider a “multifidelity” extension where we weigh vari- Gradients. neurips 2017.
ous possible gradient observations (including none at all) in light of their
expected utility and the cost of acquisition/inference.
Stochastic optimization
81 This is a heavily overloaded phrase, as it is In stochastic optimization,81 we seek to optimize the expected performance
used as an umbrella term for optimization in- of the system of interest (also known as the integrated response82) under
volving any random elements. It is also used,
for example, to describe optimization from an assumed distribution for the environmental parameters, 𝑝 (𝜔):
noisy observations, to include methods such ∫
as stochastic gradient descent. As noisy ob- 𝑓 = 𝔼𝜔 [𝑔] = 𝑔(𝑥, 𝜔) 𝑝 (𝜔) d𝜔. (11.23)
servations are commonplace in Bayesian opti-
mization, we reserve the phrase for stochastic
environmental parameters only. Optimizing this objective presents a challenge: in general, the expec-
82 b. j. williams et al. (2000). Sequential Design tation with respect to 𝜔 cannot be evaluated directly but only estimated
of Computer Experiments to Minimize Inte- via repeated trials in different environments, and estimating the expected
grated Reseponse Functions. Statistica Sinica performance (11.23) with some degree of precision may require numerous
10(4):1133–1152.
evaluations of 𝑔(𝑥, 𝜔). However, when 𝑔 itself is expensive to evaluate
– for example, if every trial requires manual manipulation of a robot
and its environment before we can measure performance – this may not
be the most efficient approach, as we may waste significant resources
shoring up our belief about suboptimal values.
Instead, when we can control the environment during optimiza-
83 If 𝑔 has distribution GP (𝜇, 𝐾), then 𝑓 has dis- tion at will, we can gain some traction by designing a sequence of free
tribution GP (𝑚, 𝐶) with: parameter–environment pairs, potentially changing both configuration
∫
𝑚 = 𝜇 (𝑥, 𝜔) 𝑝 (𝜔) d𝜔; and environment in each iteration. The most direct way to design such
∬ an algorithm in our framework would be to model the environmental-
𝐶= 𝐾 [𝑥, 𝜔 ], [𝑥 ,0 𝜔 0 ] 𝑝 (𝜔, 𝜔 0 ) d𝜔 d𝜔 0. conditional objective function 𝑔 directly and define a utility function
and policy with respect to this function, in light of the environmental-
marginal objective 𝑓 (11.23) and its induced Gaussian process distribution.
Bayesian quadrature: § 2.6, p. 33 A Gaussian process model on 𝑔 is particularly practical in this regard,
as we may then use Bayesian quadrature to seamlessly estimate and
quantify our uncertainty in the objective 𝑓 (11.23) from observations.83
278
11.9. stochastic and robust optimization
former case, we seek to optimize a (lower) quantile of the distribution of The value-at-risk (at the 𝜋 = 5% level) for an
𝑔(𝑥, 𝜔), rather than its mean (11.23); see the figure in the margin: example distribution of 𝑔 (𝑥, 𝜔). The var is
a pessimistic lower bound on performance
var(𝑥; 𝜋) = inf 𝛾 | Pr 𝑔(𝑥, 𝜔) ≤ 𝛾 | 𝑥 ≥ 𝜋 . holding with probability 1 − 𝜋 .
however, they also extend their algorithm to a simple algorithm based on upper/lower confidence bounds with strong
robust objectives of the form given in (11.25).
convergence guarantees.
kirschner et al. considered a related adversarial setting: distribution-
ally robust optimization, where we seek to optimize effectively without
perfect knowledge of the environment distribution. They considered an
objective of the following form:
∫
𝑓 (𝑥) = inf 𝑔(𝑥, 𝜔) 𝑝 (𝜔) d𝜔.
𝑝 ∈P
280
11.10. incremental optimization of seqential procedures
the loss of the network after the weights have converged.96 If we don’t 96 One could consider early stopping in the train-
treat this objective function as a black box but take a peek inside, we ing procedure as well, but this simple example
may interpret it as the limiting value of the learning curve defined by the is useful for exposition.
100 l. li et al. (2018b). Hyperband: A Novel Bandit- This idea of abandoning stragglers based on early progress is also the
Based Approach to Hyperparameter Optimiza- basis of the bandit-based hyperband algorithm.100
tion. Journal of Machine Learning Research
18(185):1–52.
11.11 non-gaussian observation models and active search
Throughout this book, we have focused almost exclusively on the additive
Gaussian noise observation model. There are good reasons for this: it is
a reasonably faithful model of many systems and offers exact inference
with Gaussian process models of the objective function. However, the
assumption of Gaussian noise is not always warranted and may be
fundamentally incompatible with some scenarios.
decision theory for optimization: chapter 5, Fortunately, the decision-theoretic core of most Bayesian optimiza-
p. 87 tion approaches does not make any assumptions regarding our model
of the objective function or our observations of it, and with some care
utility functions for optimization: chapter 6, the utility functions we developed for optimization can be adapted for
p. 109 virtually any scenario. Further, there are readily available pathways for
gp inference with non-Gaussian observation incorporating non-Gaussian observation models into Gaussian process
models: § 2.8, p. 35 objective function models, so we do not need to abandon that rich model
class in order to use alternatives.
105 Note that the simple reward utility function where for this expression we have assumed binary outcomes 𝑦 ∈ {0, 1}
(6.3) we have been working with can be writ- with 𝑦 = 1 interpreted as indicating success.105 The authors then derived
ten in this exact form if we assume any addi-
tive observation noise has zero mean.
a policy for this setting via one-step lookahead.
Another setting involving binary (or categorical) feedback is in opti-
106 n. houlsby et al. (2012). Collaborative
Gaussian Processes for Preference Learning. mizing human preferences, such as in a/b testing or user modeling. Here
neurips 2012. we might seek to optimize user preferences by repeatedly presenting a
panel of options and asking for the most preferred item. houlsby et al. de-
scribed a convenient reduction from preference learning to classification
282
11.11. non-gaussian observation models and active search
for Gaussian processes that allows the immediate use of standard poli-
cies such as expected improvement,106, 104 although more sophisticated
policies have also been proposed specifically for this setting.107 107 See appendix d, p. 329 for related references.
Active search
garnett et al. introduced active search as a simple model of scientific
discovery in a discrete domain X = {𝑥𝑖 }.108 In active search, we assume 108 r. garnett et al. (2012). Bayesian Optimal Ac-
that among these points is hidden a rare, valuable subset exhibiting tive Search and Surveying. icml 2012.
desirable properties for the task at hand. Given access to an oracle that
can – at significant cost – determine whether an identified point belongs
to the sought after class, the problem of active search is to design a
sequence of experiments seeking to maximize the number of discoveries
in a given budget. A motivating application is drug discovery, where
the domain would represent a list of candidate molecules to search for
those rare examples exhibiting significant binding activity with a chosen
biological target. As the space of candidates is expansive and the cost of virtual screening: appendix d, p. 314
even virtual screening is nontrivial, intelligent experimental design has
the potential to greatly improve the rate of discovery.
To derive an active search policy in our framework, we must first modeling observations
model the observation process and determine a suitable utility function.
The former requires consideration of the nuances of a given situation,
but we may provide a barebones construction that is already sufficient
to be of practical and theoretical interest. Given a discrete domain X,
we assume there is some identifiable subset V ⊂ X of valuable points
we wish to recover. We associate with each point 𝑥 ∈ X a binary label 109 Other situations may call for other approaches;
𝑦 = [𝑥 ∈ V] indicating whether 𝑥 is valuable (𝑦 = 1) or not (𝑦 = 0). A for example, if value is determined by thresh-
olding a continuous measurement, we may
natural observation model is then to assume that selecting a point 𝑥 wish to model that continuous observation
for investigation reveals this binary label 𝑦 in response.109 Finally, we process explicitly.
may define a natural utility function for active search by assuming that,
all other things being held equal, we prefer a dataset containing more
valuable points to one with fewer: utility function
∑︁
𝑢 (D) = 𝑦. (11.27)
𝑥 ∈D
This is simply the cumulative reward utility (6.7), which here can be 110 The assumption of a discrete domain is to
interpreted as counting the number of valuable points discovered.110 avoid repeatedly observing effectively the
same point to trivially “max out” this score.
To proceed with the Bayesian decision-theoretic approach, we must
build a model for the uncertain elements inherent to each decision. Here
the primary object of interest is the predictive posterior distribution posterior predictive probability,
Pr(𝑦 = 1 | 𝑥, D), the posterior probability that a given point 𝑥 is valuable. Pr(𝑦 = 1 | 𝑥, D)
We may build this model in any number of ways, for example by com-
bining a Gaussian process prior on a latent function with an appropriate 111 c. e. rasmussen and c. k. i. williams (2006).
choice of observation model.111 Gaussian Processes for Machine Learning. mit
Press. [chapter 3]
Equipped with a predictive model, deriving the optimal policy is a
simple exercise. To begin, the one-step marginal gain in utility (5.8) is
that is, the optimal one-step decision is to greedily maximize the proba-
bility of success. Although this is a simple (and somewhat obvious) policy
that can perform well in practice, theoretical and empirical study on
112 r. garnett et al. (2012). Bayesian Optimal Ac- active search has established that massive gains can be had by adopting
tive Search and Surveying. icml 2012. less myopic policies.
113 s. jiang et al. (2017). Efficient Nonmyopic Ac- On the theoretical side, garnett et al. demonstrated by construc-
tive Search. icml 2017. tion that the expected performance of any lookahead approximation
114 r. garnett et al. (2015). Introducing the ‘ac- can be exceeded by any arbitrary amount by extending the lookahead
tive search’ method for iterative virtual screen- horizon even a single step.112 This result was strengthened by jiang
ing. Journal of Computer-Aided Molecular De-
sign 29(4):305–314. et al., who showed – again by construction – that no policy that can be
computed in time polynomial in |X | can approximate the performance of
the optimal policy within any constant factor.113 Thus the optimal active
cost of computing optimal policy: § 5.3, p. 99 search policy is not only hard to compute, but also hard to approximate.
These theoretical results, which rely on somewhat unnatural adversarial
constructions, have been supported by empirical investigations on real-
world data as well.113, 114 For example, garnett et al. demonstrated that
simply using two-step instead of one-step lookahead can significantly
accelerate virtual screening for drug discovery across a broad range of
biological targets.114
This is perhaps a surprising state of affairs given the success of
one-step lookahead – and the relative lack of less myopic alternatives –
for traditional optimization. We can resolve this discrepancy by noting
moving beyond one-step lookahead: § 7.10, that utility functions used in that setting, such as the simple reward
p. 150 (6.3), inherently exhibit decreasing marginal gains as they are effectively
bounded by the global maximum. Further, such a utility tends to remain
relatively constant throughout optimization, punctuated by brief but
significant increases when a new local maximum is discovered. On the
other hand, for the cumulative reward (11.27), every observation has
the potential to increase the utility by exactly one unit. As a result,
every observation is on equal footing in terms of potential impact, and
there is increased pressure to consider the entire search trajectory when
designing each observation.
One thread of research on active search has focused on developing
efficient, yet nonmyopic policies grounded in approximate dynamic pro-
gramming, such as lookahead beyond one step.112 jiang et al. proposed
batch rollout: § 5.3, p. 103 one significantly less myopic alternative policy based on batch rollout.113
The key observation is that we may construct the one-step optimal batch
observation of size 𝑘 by computing the posterior predictive probability
Pr(𝑦 = 1 | 𝑥, D) for the unlabeled points, sorting, and taking the top 𝑘;
this is a consequence of linearity of expectation and utility (11.27). With
this, we may realize an efficient batch rollout policy for horizon 𝜏 by
maximizing the acquisition function
∑︁
0
Pr(𝑦 = 1 | 𝑥, D) + 𝔼𝑦 Pr(𝑦 = 1 | 𝑥 , D1 ) | 𝑥, D .
0 0
(11.28)
𝜏−1
Í0
Here the sum-with-prime notation 𝜏−1 indicates the sum of the top-
(𝜏 −1) values over the unlabeled data – the expected utility of the optimal
284
11.11. non-gaussian observation models and active search
subject which has now enjoyed a century of study. Numerous excellent 3 k. smith (1918). On the Standard Deviations
references are available.5, 6 of Adjusted and Interpolated Values of an Ob-
served Polynomial Function and its Constants
Early work in optimal design did not consider the possibility of and the Guidance they Give Towards a Proper
adaptively designing a sequence of experiments, an essential feature of Choice of the Distribution of Observations.
Bayesian optimization. Instead, the focus was optimizing fixed designs Biometrika 12(1–2):1–85.
to minimize some measure of uncertainty when performing inference 4 r. a. fisher (1935). The Design of Experiments.
with the resulting data. This paradigm is practical when experiments Oliver and Boyd.
are extremely time consuming but can easily run in parallel, such as 5 g. e. p. box et al. (2005). Statistics for Exper-
the agricultural experiments studied extensively by fisher. The most imenters: Design, Innovation, and Discovery.
John Wiley & Sons.
common goals considered in classical optimal design are accurate estima-
tion of model parameters and confident prediction at unseen locations; 6 d. c. montgomery (2019). Design and Analysis
of Experiments. John Wiley & Sons.
these goals are usually formulated in terms of optimizing some statisti-
cal criterion as a function of the design.7 However, in 1941, hotelling 7 These criteria often have alphabetic names:
notably studied experimental designs for estimating the location of the a-optimality, d-optimality, v-optimality, etc.
maximum of an unknown function in this nonadaptive setting.8 This 8 h. hotelling (1941). Experimental Determina-
was perhaps the first rigorous treatment of batch optimization. tion of the Maximum of a Function. The Annals
of Mathematical Statistics 12(1):20–45.
This material will be published by Cambridge University Press as Bayesian Optimization. 287
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
a brief history of bayesian optimization
288
12.3. the rise of bayesian optimization
290
12.4. later rediscovery and development
292
12.5. multi-armed bandits to infinite-armed bandits
294
THE GAUSSIAN DISTRIBU TION
This material will be published by Cambridge University Press as Bayesian Optimization. 295
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
the gaussian distribution
expressing arbitrary pdfs and cdfs in terms We can write the pdf and cdf of an arbitrary nondegenerate Gausisan
of the standard normal distribution N (𝑥; 𝜇, 𝜎 2 ) in terms of the standard normal pdf and cdf by
appropriately rescaling and translating arguments to their 𝑧-scores (a.1):
1 𝑥 − 𝜇 𝑦 − 𝜇
𝑝 (𝑥) = 𝜙 ; Pr(𝑥 < 𝑦) = Φ ,
𝜎 𝜎 𝜎
where the mulitplicative factor in the pdf guarantees normalization.
Affine transformations
The family of univariate Gaussian distributions is closed under affine
transformations, which simply translate and rescale the distribution and
adjust its moments accordingly. If 𝑥 has distribution N (𝑥; 𝜇, 𝜎 2 ), then
the transformation 𝜉 = 𝑎𝑥 + 𝑏 has distribution
𝑝 (𝜉) = N (𝜉; 𝑎𝜇 + 𝑏, 𝑎 2𝜎 2 ).
296
a.2. multivariate gaussian distribution
As before, this will serve as the basis of the general multivariate constructing general case via affine
case by considering arbitrary affine transformations of this “standard” transformations of standard normal vectors
example. Suppose x ∈ ℝ𝑑 is a vector-valued random variable and z ∈ ℝ,𝑘
𝑘 ≤ 𝑑, is a 𝑘-dimensional standard multivariate normal vector. If we can
write
x = 𝝁 + 𝚲z (a.5)
for some vector 𝝁 ∈ ℝ𝑑 and 𝑑 × 𝑘 matrix 𝚲, then x has a multivariate
normal distribution. We can compute its mean and covariance directly
from (a.4–a.5):
This property completely characterizes the distribution. As in the parameterization in terms of first two
univariate case, we can interpret every multivariate normal distribution moments
as an affine transformation of a (possibly lower-dimensional) standard
normal vector. We again parameterize this family by its first two mo-
ments: the mean vector 𝝁 and the covariance matrix 𝚺. This covariance mean vector and covariance matrix, (𝝁, 𝚺)
matrix is necessarily symmetric and positive semidefinite, which means positive semidefinite
all its eigenvalues are nonnegative. We can factor any such matrix as
𝚺 = 𝚲𝚲>, allowing us to recover the underlying transformation (a.5),
although 𝚲 need not be unique.
and Δ represents the Mahalanobis distance, a multivariate analog of the Mahalanobis distance, Δ
(absolute) 𝑧-score (a.2):
It is easy to verify that these definitions are compatible with the univari- compatibility with univariate case
ate case when 𝑑 = 1. Note in that case the condition of 𝚺 being positive
definite reduces to the previous condition for nondegeneracy, 𝜎 2 > 0.
The dependence of the multivariate Gaussian density on x is entirely
through the value of the Mahalanobis distance Δ. To gain some geometric
insight into the probability density, we can set this value to a constant and
compute isoprobability contours. In the case of the standard multivariate
Affine transformations
The multivariate Gaussian distribution has a number of convenient
mathematical properties, many of which follow immediately from the
characterization in (a.5). First it is obvious that any affine transfor-
mation of a multivariate normal distributed vector is also multivari-
ate normal, as affine transformations are closed under composition. If
distribution of affine transformation Ax + b 𝑝 (x) = N (x; 𝝁, 𝚺), then 𝝃 = Ax + b has distribution
𝑝 (𝝃 ) = N (𝝃 ; A𝝁 + b, A𝚺A> ). (a.9)
I 0 x
x ↦→ x+ = ,
A b 𝝃
298
a.2. multivariate gaussian distribution
we can see that x and 𝝃 in fact have a joint Gaussian distribution: joint distribution with affine transformations
!
x 𝝁 𝚺A>
; (a.10)
𝚺
𝑝 (x, 𝝃 ) = N , .
𝝃 A𝝁 + b A𝚺 A𝚺A>
Sampling
The characterization of the multivariate normal in terms of affine trans-
formations of standard normal random variables (a.5) also suggests a
simple algorithm for drawing samples from the distribution. Given an
arbitrary multivariate normal distribution N (x; 𝝁, 𝚺), we first factor
the covariance as 𝚺 = 𝚲𝚲>, where 𝚲 has size 𝑑 × 𝑘; when 𝚺 is positive
definite, the Cholesky decomposition is the canonical choice. We now
sample a 𝑘-dimensional standard normal vector z by sampling each entry
independently from a univariate standard normal; routines for this task
are readily available. Finally, we transform this vector appropriately to
provide a sample of x: z ↦→ 𝝁 + 𝚲z = x. This procedure entails one-time
O(𝑑 3 ) work to compute 𝚲 (which can be reused), followed by O(𝑑 2 )
work to produce each sample.
Marginalization
Often we will have a vector x with a multivariate Gaussian distribution, 2 We can first permute x if required; as a linear
but only be interested in reasoning about a subset of its entries. Suppose transformation, this will simply permute the
entries of 𝝁 and the rows/columns of 𝚺.
𝑝 (x) = N (x; 𝝁, 𝚺), and partition the vector into two components:2
x 𝑥2
x= 1 . (a.11)
x2
We partition the mean vector and covariance matrix in the same way:
!
x1 𝝁1 𝚺11 𝚺12
𝑝 (x) = N ; , . (a.12)
x2 𝝁2 𝚺21 𝚺22
𝑥1
Conditioning
A bivariate Gaussian pdf 𝑝 (𝑥 1 , 𝑥 2 ) (top) and
Multivariate Gaussian distributions are also closed under conditioning the Gaussian marginal 𝑝 (𝑥 1 ) (bottom) (a.13).
on the values of given entries. Suppose again that 𝑝 (x) = N (x; 𝝁, 𝚺) and
partition x, 𝝁, and 𝚺 as before (a.11–a.12). Suppose now that we learn the
exact value of the subvector x2 . The posterior on the remaining entries
𝑝 (x1 | x2 ) remains Gaussian, with distribution
𝑝 (x1 | x2 ) = N (x1 ; 𝝁 1|2, 𝚺11 |2 ).
𝑥2 The posterior mean and covariance take the form of updates to the prior
moments in light of the revealed information:
𝝁 1 |2 = 𝝁 1 + 𝚺12 𝚺−1
22 (x2 − 𝝁 2 ); 𝚺11 |2 = 𝚺11 − 𝚺12 𝚺−1
22 𝚺21 . (a.14)
𝑥1
Suppose x and y are 𝑑-dimensional random vectors with joint multivari-
ate normal distribution !
The pdfs of a bivariate Gaussian 𝑝 (𝑥 1 , 𝑥 2 ) x 𝝁 𝚺 P
(top) and the conditional distribution 𝑝 (x, y) = N ; , .
y 𝝂 P T
𝑝 (𝑥 1 | 𝑥 2 ) given the value of 𝑥 2 marked
by the dotted line (bottom) (a.14). The prior
marginal distribution 𝑝 (𝑥 1 ) is shown for Then recognizing their sum z = x + y = [I, I] [x, y] > as a linear transfor-
reference; the observation decreased both the mation and applying (a.9), we have:
mean and standard deviation.
𝑝 (z) = N (z; 𝝁 + 𝝂, 𝚺 + 2P + T).
Differential entropy
The differential entropy of a multivariate normal random variable x with
distribution 𝑝 (x) = N (x; 𝝁, 𝚺), expressed in nats, is
1
𝐻 [x] = 2 log |2𝜋𝑒𝚺|. (a.16)
𝐻 [𝑥] = 1
2 log 2𝜋𝑒𝜎 2. (a.17)
300
METHODS FOR APPROXIMATE BAYESIAN INFERENCE
𝑝 (x | D) = 𝑍 −1 𝑝 (x) 𝑝 (D | x)
where H is the Hessian of the negative log posterior evaluated at x̂: Hessian of negative log posterior, H
𝜕2 Ψ
H=− ( x̂).
𝜕x 𝜕x>
Note the first-order term vanishes as we are expanding around a maxi-
mum. Exponentiating, we derive an approximation to the unnormalized
posterior:
𝑝 (x | D) ∝ exp Ψ(x) ≈ exp Ψ( x̂) exp − 21 (x − x̂) > H(x − x̂) .
Through some accounting when normalizing (b.1), the Laplace approxi- Laplace approximation to normalizing
mation also gives an approximation to the normalizing constant 𝑍 : constant, 𝑍ˆla
This material will be published by Cambridge University Press as Bayesian Optimization. 301
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
methods for approximate bayesian inference
Figure b.1: A Laplace approximation to a one-dimensional posterior distribution. The left panel shows the Laplace
approximation before normalization, and the right panel afterwards.
𝑝 (𝝃 ) = N (𝝃 ; 𝝁 0, 𝚺0 ).
3 These scalar products often simply pick out Suppose we obtain information D about 𝝃 in the form of a collection of
single entries of 𝝃 , in which case we can factors, each of which specifies the likelihood of a scalar product 𝑥 = a>𝝃
slightly simplify notation. However, we some-
times require this more general formulation. associated with that factor.3 We consider the posterior distribution
Ö
𝑝 (𝝃 | D) = 𝑍 −1 𝑝 (𝝃 ) 𝑡𝑖 (𝑥𝑖 ), (b.3)
𝑖
302
b.2. gaussian expectation propagation
Here (𝑍˜𝑖 , 𝜇˜𝑖 , 𝜎˜ 𝑖2 ) are called site parameters for the approximate factor 𝑡˜𝑖 , ˜ 𝜇,
site parameters, (𝑍, ˜ 𝜎˜ 2 )
which we may design to optimize the fit. We will consider this issue fur-
ther shortly. Given arbitrary site parameters for each factor, the resulting Gaussian ep approximate posterior, 𝑞 (𝝃 )
approximation to (b.3) is
Ö
𝑝 (𝝃 | D) ≈ 𝑞(𝝃 ) = 𝑍ˆep
−1
𝑝 (𝝃 ) 𝑡˜𝑖 (𝑥𝑖 ). (b.4)
𝑖
As a product of Gaussians, the approximate posterior is also Gaussian: 4 The updated covariance incorporates only a
series of rank-one updates, which can be ap-
𝑞(𝝃 ) = N (𝝃 ; 𝝁, 𝚺), (b.5) plied using the Sherman–Morrison formula.
∑︁ 𝜇˜ ∑︁ 1 −1
𝝁 = 𝚺 𝚺−1
0 𝝁0 +
𝑖
a ;
2 𝑖
𝚺 = 𝚺−1
0 + a a >
.
𝑖
˜
𝜎 𝑖 𝑖
𝜎˜ 𝑖2 𝑖 𝑖
Gaussian ep also yields an approximation of the normalizing constant 𝑍 , Gaussian ep approximation to normalizing
if desired: constant, 𝑍ˆep
N (0; 𝝁, 𝚺) Ö ˜
𝑍 ≈ 𝑍ˆep = 𝑍 N (0; 𝜇˜𝑖 , 𝜎˜ 𝑖2 ). (b.6)
N (0; 𝝁 0, 𝚺0 ) 𝑖 𝑖
What remains to be determined is an effective means of choosing the setting site parameters
site parameters to maximize the approximation fidelity. One reasonable
goal would be to minimize the Kullback–Leibler (kl) divergence between
the true and approximate distributions; for our Gaussian approximation
(b.5), this is achieved through moment matching.2 Unfortunately, de-
termining the moments of the true posterior (b.3) may be difficult, so
expectation propagation instead matches the marginal moments for expectation propagation approximately
each of the {𝑥𝑖 }, approximately minimizing the kl divergence. This is ac- minimizes kl divergence
complished through an iterative procedure where we repeatedly sweep
over each of the approximate factors and refine its parameters until
convergence.
We initialize all site parameters to (𝑍, ˜ 𝜇,
˜ 𝜎˜ 2 ) = (1, 0, ∞); with these site parameter initialization
choices the approximate factors drop away, and our initial approxima-
tion is simply the prior: (𝝁, 𝚺) = (𝝁 0, 𝚺0 ). Now we perform a series of
updates for each of the approximate factors in turn. These updates take a
convenient general form, and we will drop factor index subscripts below
to simplify notation.
Let 𝑡˜ (𝑥) = 𝑍˜ N (𝑥; 𝜇,
˜ 𝜎˜ 2 ) be an arbitrary factor in our approximation
(b.4). The idea behind expectation propagation is to drop this factor from
the approximation entirely, forming the cavity distribution: cavity distribution, 𝑞¯
𝑞(𝝃 )
¯ )=
𝑞(𝝃 ,
𝑡˜ (𝑥)
and replace it with the true factor 𝑡 (𝑥), forming the tilted distribution tilted distribution
¯ ) 𝑡 (𝑥). The tilted distribution is closer to the true posterior (b.3) as
𝑞(𝝃
the factor in question is no longer approximated. We now adjust the site
Gaussian ep approximation
Figure b.2: A Gaussian ep approximation to the distribution
in figure b.1.
¯
𝑞(𝑥) ¯ 𝜎¯ 2 );
= N (𝑥; 𝜇, 𝜇¯ = 𝜎¯ 2 (𝜇𝜎 −2 − 𝜇˜𝜎˜ −2 ); 𝜎¯ 2 = (𝜎 −2 − 𝜎˜ −2 ) −1
. (b.7)
¯ 𝜎¯ 2 ). If we
this quantity clearly depends on the cavity parameters ( 𝜇,
define
𝜕 log 𝑍 𝜕 log 𝑍
𝛼= ; 𝛽= ; (b.9)
𝜕 𝜇¯ 𝜕𝜎¯ 2
6 t. minka (2008). ep: A quick reference. and an auxiliary variable 𝛾 = (𝛼 2 − 2𝛽) −1
, then we may achieve the
desired moment matching by updating the site parameters to:6
√︁ √︁
𝜇˜ = 𝜇¯ +𝛼𝛾; 𝜎˜ 2 = 𝛾 − 𝜎¯ 2 ; 𝑍˜ = 𝑍 2𝜋 𝜎˜ 2 + 𝜎¯ 2 exp 21 𝛼 2𝛾 . (b.10)
This completes our update for the chosen factor; the full ep procedure
repeatedly updates each factor in this manner until convergence. The
result of Gaussian expectation propagation for the distribution from
figure b.1 is shown in figure b.2. The fit is good and reflects the more
global nature of the expectation propagation scheme achieved through
moment matching rather than merely maximizing the posterior.
A convenient aspect expectation propagation is that incorporating
a new factor only requires computing the zeroth moment against an
arbitrary normal distribution (b.8) and the partial derivatives in (b.9). We
provide these computations for several useful factor types below.
304
b.2. gaussian expectation propagation
𝑎 𝑎
Truncating a variable
A common use of expectation propagation in Bayesian optimization
is to approximately constrain a Gaussian random variable 𝑥 to be less
than a threshold 𝑎. We may capture this information by a single factor
𝑡 (𝑥) = [𝑥 < 𝑎]. In the context of expectation propagation, we must
consider the normalizing constant of the tilted distribution, which is a
truncated normal:
∫
𝑎 − 𝜇¯
¯ 𝜎¯ 2 ) d𝑥 = Φ(𝑧);
𝑍 = [𝑥 < 𝑎] N (𝑥; 𝜇, 𝑧= . (b.11)
𝜎¯
The required quantities for an expectation propagation update are now
(b.10):
−1
𝜙 (𝑧) 𝑧𝛼 𝜎¯ 𝜙 (𝑧)
𝛼 =− ; 𝛽= ; 𝛾 =− +𝑧 . (b.12)
Φ(𝑧)𝜎¯ 2𝜎¯ 𝛼 Φ(𝑧)
A Gaussian ep approximation to a truncated normal distribution is il-
lustrated in figure (b.3). The fit is good, but not perfect: approximately
5% of its mass exceeds the threshold. This inaccuracy is the price of
approximation.
We may also apply this approach to approximately condition our conditioning on one variable being greater
belief on 𝝃 on one entry being dominated by another: 𝜉𝑖 < 𝜉 𝑗 . Consider than another
the vector a where 𝑎𝑖 = 1, 𝑎 𝑗 = −1 and all other entries are zero. Then
𝑥 = a>𝝃 = 𝜉𝑖 − 𝜉 𝑗 . The condition is now equivalent to [𝑥 < 0], and we
can proceed as outlined above. This approximation is illustrated for a
bivariate normal in figure b.4; again the fit appears reasonable.
Figure b.5: A Gaussian ep approximation to a one-dimensional normal distribution truncated at an unknown threshold 𝑎
with the marked 95% credible interval.
𝑝 (𝑎) = N (𝑎; 𝜇, 𝜎 2 ).
Integrating the hard truncation factor [𝑥 < 𝑎] against this belief yields
the following “soft truncation” factor:
∫ 𝜇 −𝑥
𝑡 (𝑥) = [𝑥 < 𝑎] 𝑝 (𝑎) d𝑎 = Φ . (b.13)
𝜎
We consider again the normalizing constant of the tilted distribution:
∫
𝜇 −𝑥 𝜇 − 𝜇¯
𝑍= Φ ¯ 𝜎¯ 2 ) d𝑥 = Φ(𝑧);
N (𝑥; 𝜇, 𝑧=√ ,
𝜎 𝜎 2 + 𝜎¯ 2
√
Defining 𝑠 = 𝜎 2 + 𝜎¯ 2 , we may compute:
−1
𝜙 (𝑧) 𝑧𝛼 𝑠 𝜙 (𝑧)
𝛼 =− ; 𝛽= ; 𝛾 =− +𝑧 .
Φ(𝑧)𝑠 2𝑠 𝛼 Φ(𝑧)
306
GRADIEN TS
This material will be published by Cambridge University Press as Bayesian Optimization. 307
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
gradients
and the partial derivative of the predictive variance and standard devia-
tion with respect to 𝑥 are:
𝜕𝑠 2 𝜕𝜎 2 𝜕𝜎𝑛2 𝜕𝑠 1 𝜕𝑠 2
= + ; = .
𝜕𝑥 𝜕𝑥 𝜕𝑥 𝜕𝑥 2𝑠 𝜕𝑥
308
c.3. gradients of common acqisition functions
values:
𝜕𝑔
= Φ(𝑐𝑖+1 ) − Φ(𝑐𝑖 )
𝜕𝑎𝑖
𝜕𝑐𝑖+1
+ 𝑎𝑖 + 𝑏𝑖 𝑐𝑖+1 − [𝑖 ≤ 𝑛] (𝑎𝑖+1 + 𝑏𝑖+1𝑐𝑖+1 ) 𝜙 (𝑐𝑖+1 )
𝜕𝑎𝑖
𝜕𝑐𝑖
− 𝑎𝑖 + 𝑏𝑖 𝑐𝑖 − [𝑖 > 1] (𝑎𝑖−1 + 𝑏𝑖−1𝑐𝑖 ) 𝜙 (𝑐𝑖 );
𝜕𝑎𝑖
𝜕𝑔
= 𝜙 (𝑐𝑖 ) − 𝜙 (𝑐𝑖+1 )
𝜕𝑏𝑖
𝜕𝑐𝑖+1
+ 𝑎𝑖 + 𝑏𝑖 𝑐𝑖+1 − [𝑖 ≤ 𝑛] (𝑎𝑖+1 + 𝑏𝑖+1𝑐𝑖+1 ) 𝜙 (𝑐𝑖+1 )
𝜕𝑏𝑖
𝜕𝑐𝑖
− 𝑎𝑖 + 𝑏𝑖 𝑐𝑖 − [𝑖 > 1] (𝑎𝑖−1 + 𝑏𝑖−1𝑐𝑖 ) 𝜙 (𝑐𝑖 ).
𝜕𝑏𝑖
Here [𝑖 > 1] and [𝑖 ≤ 𝑛] represent the Iverson bracket. We will also gradient of a and b with respect to 𝑥
require the gradient of the a and b vectors with respect to 𝑥:
𝜕a 0 𝜕b 1 𝜕𝐾D (x,0 𝑥) 𝜕𝑠
= 𝜕𝜇 ; = −b . (c.2)
𝜕𝑥 𝜕𝑥 𝜕𝑥 𝑠 𝜕𝑥 𝜕𝑥
Note the dependence on the gradient of the predictive parameters. Finally,
if we preprocess the inputs by identifying an appropriate transformation
matrix P such that 𝑔(a, b) = 𝑔(Pa, Pb) = 𝑔(𝜶 , 𝜷) (8.14), then the desired
gradient of expected improvement is:
𝜕𝛼 ei 𝜕𝑔 𝜕a 𝜕𝑔 𝜕b
= P + P .
𝜕𝑥 𝜕𝜶 𝜕𝑥 𝜕𝜷 𝜕𝑥
where [ 𝜕Δ
𝜕𝑥 ] 𝑦 indicates the partial derivative of the updated utility when
the observed value 𝑦 is held constant.
𝜕𝛼 pes 1 𝜕𝑠 1 ∑︁ 1 𝜕𝑠 ∗2𝑖
𝑛
= + .
𝜕𝑥 𝑠 𝜕𝑥 2𝑛 𝑖=1 𝑠 ∗2𝑖 𝜕𝑥
𝜕𝑚 𝜕𝜇 𝜕k
= + (𝜶 ∗ ) > ;
𝜕𝑥 𝜕𝑥 𝜕𝑥
𝜕𝜍 2 𝜕𝜎 2 > −1 𝜕k 𝜕𝜌 𝜕𝑘 1 −1 𝜕k
= − 2k V∗ ; = − k>
∗ V∗ .
𝜕𝑥 𝜕𝑥 𝜕𝑥 𝜕𝑥 𝜕𝑥 𝜕𝑥
312
ANNOTATED BIBLIOGRAPHY OF APPLICATIONS
This material will be published by Cambridge University Press as Bayesian Optimization. 313
This prepublication version is free to view and download for personal use only. Not for
redistribution, resale, or use in derivative works. ©Roman Garnett 2022.
annotated bibliography of applications
De novo design
A recent trend in molecular discovery has been to learn deep genera-
tive models for molecular structures and perform optimization in the
(continuous) latent space of such a model. This enables de novo design,
where the optimization routine can propose entirely novel structures
for evaluation by feeding selected points in the latent space through the
generational procedure.1
the question of sythesizability Although appealing, a major complication with this scheme is iden-
tifying a synthetic route – if one even exists! – for proposed structures.
This issue deserves careful consideration when designing a system that
can effectively transition from virtual screening to the laboratory.
314
annotated bibliography of applications
Reaction optimization
Not all applications of Bayesian optimization in chemistry take the form
of optimizing chemical properties as a function of molecular structure.
For example, even once a useful molecule has been identified, there may
remain numerous parameters of its synthesis – reaction environment,
processing parameters, etc. – that can be further optimized, seeking for
example to reduce the cost and/or increase the yield of production.
Conformational search
Conformational search seeks to identify the configuration(s) of a molecule
with the lowest potential energy. Even for molecules of moderate size,
the space of possible configurations can be enormous and the potential
energy surface can exhibit numerous local minima. Interaction with
other structures such as a surface (adsorption) can further complicate adsorption
the computation of potential energy, rendering the search even more
difficult.
carr, shane f. et al. (2017). Accelerating the Search for Global Minima
on Potential Energy Surfaces using Machine Learning. Journal of
Chemical Physics 145(15):154106.
fang, lincan et al. (2021). Efficient Amino Acid Conformer Search with
Bayesian Optimization. Journal of Chemical Theory and Compu-
tation 17(3):1955–1966.
packwood, daniel (2017). Bayesian Optimization for Materials Science.
Springer–Verlag.
We may also use related Bayesian methods to map out minimal- mapping minimal-energy pathways
energy pathways between neighboring minima on a potential energy
surface in order to understand the intermediate geometry of a molecule
as it transforms from one energetically favorable state to another.
Structural search
A fundamental question in materials science is how the structure of a
material gives rise to its material properties. However, predicting the
likely structure of a material – for example by evaluating the potential
energy of plausible structures and minimizing – can be extraordinarily
difficult, as the number of possible configurations can be astronomical.
316
annotated bibliography of applications
physics
Modern physics is driven by experiments of massive scope, which have
enabled the refinement of theories of the Universe on all scales. The
complexity of these experiments – from data acquisition to the following
analysis and inference – offers many opportunities for optimization, but
the same complexity often renders optimization difficult due to large
parameter spaces and/or the expense of evaluating a particular setting.
Experimental physics
Complex physical instruments such as particle accelerators offer the
capability of extraordinarily fine tuning through careful setting of their
control parameters. In some cases, these parameters can be altered on-the-
fly during operation, resulting in a huge space of possible configurations.
Bayesian optimization can help accelerate the tuning process.
li, yan et al. (2018). A Knowledge Gradient Policy for Sequencing Exper-
iments to Identify the Structure of rna Molecules Using a Sparse
Additive Belief Model. informs Journal on Computing 30(4):625–
786.
318
annotated bibliography of applications
plant breeding A related problem is faced in modern plant breeding, where the
goal is to develop plant varieties with desirable characteristics: yield,
drought or pest resistance, etc. Plant phenotyping is an inherently slow
process, as we must wait for planted seeds to germinate and grow until
sufficiently mature that traits can be measured. This slow turnover rate
makes it impossible to fully explore the space of possible genotypes, the
genetic information that (in combination with other factors including
environment and management) gives rise to the phenotypes we seek
to optimize. Modern plant breeding uses genetic sequencing to guide
the breeding process by building models to predict phenotype from
genotype and using these models in combination with large gene banks
to inform the breeding process. A challenge in this approach is that the
space of possible genotypes is huge, but Bayesian optimization can help
accelerate the search.
Biomedical engineering
Difficult optimization problems are pervasive in biomedical engineering
due to the complexity of the systems involved and the often considerable
cost of gathering data, whether through studies with human subjects,
complicated laboratory testing, and/or nontrivial simulation.
320
annotated bibliography of applications
robotics
Robotics is fraught with difficult optimization problems. A robotic plat-
form may have numerous tunable parameters influencing its behavior,
and the dependence of its performance on these parameters may be
highly complex. Further, empirical evaluation can be difficult – real-world
experiments must proceed in real time, and there may be a considerable
set-up time between experiments.
In some cases we may be able to accelerate optimization by augment- multifidelity optimization: § 11.5, p. 263
ing real-world evaluation with simulation in a multifidelity setup.
marco, alonso et al. (2017). Virtual vs. Real: Trading Off Simulations and
Physical Experiments in Reinforcement Learning with Bayesian
Optimization. Proceedings of the 2017 ieee International Conference
on Robotics and Automation (icra 2017).
rai, akshara et al. (n.d.). Using Simulation to Improve Sample-Efficiency
of Bayesian Optimization for Bipedal Robots. Journal of Machine
Learning Research 20(49).
Adversarial attacks
A specific safety issue to consider in robotic platforms incorporating
deep neural networks is the possibility of adversarial attacks that may be
able to alter the environment so as to induce unsafe behavior. Bayesian
optimization has been applied to the efficient construction of adversarial
attacks; this capability can in turn be used during the design phase
seeking to build robotic controllers that are robust to such attack.
reinforcement learning
The main challenge in reinforcement learning is that agents can only
evaluate decision policies through trial-and-error: by following a pro-
posed policy and observing the resulting reward. When the system is
complex, evaluating a given policy may be costly in terms of time or
322
annotated bibliography of applications
other resources, and when the space of potential policies is large, careful
exploration is necessary.
Policy search is one approach to reinforcement learning that models
the problem of policy design in terms of optimization. We enumerate
a space of potential policies (for example, via parameterization) and
seek to maximize the expected cumulative reward over the policy space.
This abstraction allows the straightforward use of Bayesian optimiza-
tion to construct a series of policies seeking to balance exploration and
exploitation of the policy space. A recurring theme in this research is
the careful use of simulators, when available, to help guide the search in
a multifidelity setup. multifidelity optimization: § 11.5, p. 263
Reinforcement learning is a central concept in robotics, and many
of the papers cited in the previous section represent applications of
policy search to particular robotic systems. The citations below concern
reinforcement learning in a broader context.
civil engineering
The optimization of large-scale critical systems such as power, trans-
portation, water distribution, and sensor networks can be difficult due
Structural engineering
The following study applied Bayesian optimization in a structural en-
gineering setting: tuning the hyperparameters of a real-time predictive
model for the typhoon-induced response of a ∼1000 m real-world bridge.
electrical engineering
Over the past few decades, sophisticated tools have enabled the automa-
tion of many aspects of digital circuit design, even for extremely large
circuits. However, the design of analog circuits remains largely a man-
ual process. As circuits grow increasingly complex, the optimization of
analog circuits (for example, to minimize power consumption subject to
performance constraints) is becoming increasingly more difficult. Even
computer simulation of complex analog circuits can entail significant
cost, so careful experimental design is imperative for exploring the design
space. Bayesian optimization has proven effective in this regard.
324
annotated bibliography of applications
mechanical engineering
The following study applied Bayesian optimization in a mechanical
engineering setting: tuning the parameters of a welding process (via
slow and expensive real-world experiments) to maximize weld quality.
Aerospace engineering
Nuances in airfoil design can have significant impact on aerodynamic
performance, improvements in which can in turn lead to considerable
cost savings from increased fuel efficiency. However, airfoil optimization
is challenging due to the large design spaces involved and the nontrivial
cost of evaluating a proposed configuration. Empirical measurement
requires constructing an airfoil and testing its performance in a wind
chamber. This process is too slow to explore the configuration space
effectively, but we can simulate the process via computational fluid 4 There are many parallels with this situa-
dynamics. These computational surrogates are still fairly costly due to tion and that faced in (computational) chem-
istry and materials science, where quantum-
the need to numerically solve nonlinear partial differential equations mechanical simulation also requires numer-
(the Navier–Stokes equations) at a sufficiently fine resolution, but they ically solving a partial differential equation
are nonetheless cheaper and more easily parallelized than experiments.4 (the Schrödinger equation).
Bayesian optimization can accelerate airfoil optimization via careful
and cost-aware experimental design. One important idea here that can
lead to significant computational savings is multifidelity modeling and multifidelity optimization: § 11.5, p. 263
optimization. It is relatively easy to control the cost–fidelity tradeoff
in computational fluid dynamics simulations by altering its resolution
accordingly. This allows to rapidly explore the design space with cheap-
but-rough simulations, then refine the most promising regions.
Automobile engineering
Bayesian optimization has also proven useful in automobile engineering.
Automobile components and subsystems can have numerous tunable pa-
rameters affecting performance, and evaluating a given configuration can
be complicated due to complex interactions among vehicle components.
Bayesian optimization has shown success in this setting.
326
annotated bibliography of applications
328
annotated bibliography of applications
330
BIBLIOGRAPHY
abbasi-yadkori, yasin (2012). Online Learning for Linearly Parameterized Control Prob-
lems. PhD thesis. University of Alberta.
acerbi, luigi (2018). Variational Bayesian Monte Carlo. Advances in Neural Information
Processing Systems 31 (neurips 2018), pp. 8213–8223.
acerbi, luigi and wei ji ma (2017). Practical Bayesian Optimization for Model Fitting
with Bayesian Adaptive Direct Search. Advances in Neural Information Processing
Systems 30 (neurips 2017), pp. 1836–1846.
adams, ryan prescott, iain murray, and david j. c. mackay (2009). Tractable Nonpara-
metric Bayesian Inference in Poisson Processes with Gaussian Process Intensities.
Proceedings of the 26th International Conference on Machine Learning (icml 2009),
pp. 9–16.
adler, robert j. and jonathan e. taylor (2007). Random Fields and Geometry. Springer
Monographs in Mathematics. Springer–Verlag.
agrawal, rajeev (1995). The Continuum-Armed Bandit Problem. siam Journal on Control
and Optimization 33(6):1926–1951.
agrawal, shipra and navin goyal (2012). Analysis of Thompson Sampling for the Multi-
armed Bandit Problem. Proceedings of the 25th Annual Conference on Learning
Theory (colt 2012). Vol. 23. Proceedings of Machine Learning Research, pp. 39.1–
39.26.
álvarez, mauricio a., lorenzo rosasco, and neil d. lawrence (2012). Kernels for Vector-
Valued Functions: A Review. Foundations and Trends in Machine Learning 4(3):195–
266.
arcones, miguel a. (1992). On the arg max of a Gaussian process. Statistics & Probability
Letters 15(5):373–374.
auer, peter, nicolò cesa-bianchi, and paul fischer (2002). Finite-time Analysis of the
Multiarmed Bandit Problem. Machine Learning 47(2–3):235–256.
balandat, maximilian, brian karrer, daniel r. jiang, samuel daulton, benjamin
letham, andrew gordon wilson, et al. (2020). BoTorch: A Framework for Effi-
cient Monte-Carlo Bayesian Optimization. Advances in Neural Information Process-
ing Systems 33 (neurips 2020), pp. 21524–21538.
baptista, ricardo and matthias poloczek (2018). Bayesian Optimization of Combinato-
rial Structures. Proceedings of the 35th International Conference on Machine Learning
(icml 2018). Vol. 80. Proceedings of Machine Learning Research, pp. 462–471.
bather, john (1996). A Conversation with Herman Chernoff. Statistical Science 11(4):335–
350.
belakaria, syrine, aryan deshwal, and janardhan rao doppa (2019). Max-value En-
tropy Search for Multi-Objective Bayesian Optimization. Advances in Neural Infor-
mation Processing Systems 32 (neurips 2019), pp. 7825–7835.
bellman, richard (1952). On the Theory of Dynamic Programming. Proceedings of the
National Academy of Sciences 38(8):716–719.
bellman, richard (1957). Dynamic Programming. Princeton University Press.
berger, james o. (1985). Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer
Series in Statistics. Springer–Verlag.
bergstra, james, rémi bardenet, yoshua bengio, and balázs kégl (2011). Algorithms for
Hyper-Parameter Optimization. Advances in Neural Information Processing Systems
24 (neurips 2011), pp. 2546–2554.
This material will be published by Cambridge University Press as Bayesian Optimization. This prepublication 331
version is free to view and download for personal use only. Not for redistribution, resale, or use in derivative
works. ©Roman Garnett 2022.
bibliography
bergstra, james and yoshua bengio (2012). Random Search for Hyper-Parameter Opti-
mization. Journal of Machine Learning Research 13:281–305.
berkenkamp, felix, angela p. schoelling, and andreas krause (2019). No-Regret Bayes-
ian Optimization with Unknown Hyperparameters. Journal of Machine Learning
Research 20(50):1–24.
berry, donald a. and bert fristedt (1985). Bandit Problems: Sequential Allocation of
Experiments. Monographs on Statistics and Applied Probability. Chapman & Hall.
bertsekas, dimitri p. (2017). Dynamic Programming and Optimal Control. 4th ed. Vol. 1.
Athena Scientific.
bochner, s (1933). Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse.
Mathematische Annalen 108:378–410.
bogunovic, ilija, andreas krause, and jonathan scarlett (2020). Corruption-Tolerant
Gaussian Process Bandit Optimization. Proceedings of the 23rd International Confer-
ence on Artificial Intelligence and Statistics (aistats 2020). Vol. 108. Proceedings of
Machine Learning Research, pp. 1071–1081.
bogunovic, ilija, jonathan scarlett, stefanie jegelka, and volkan cevher (2018).
Adversarially Robust Optimization with Gaussian Processes. Advances in Neural
Information Processing Systems 31 (neurips 2018), pp. 5760–5770.
box, g. e. p. (1954). The Exploration and Exploitation of Response Surfaces: Some General
Considerations and Examples. Biometrics 10(1):16–60.
box, george e. p., j. stuart hunter, and william g. hunter (2005). Statistics for Experi-
menters: Design, Innovation, and Discovery. 2nd ed. Wiley Series in Probability and
Statistics. John Wiley & Sons.
box, g. e. p. and k. b. wilson (1951). On the Experimental Attainment of Optimum Condi-
tions. Journal of the Royal Statistical Society Series B (Methodological) 13(1):1–45.
box, g. e. p. and p. v. youle (1954). The Exploration and Exploitation of Response Surfaces:
An Example of the Link Between the Fitted Surface and the Basic Mechanism of
the System. Biometrics 11(3):287–323.
breiman, leo (2001). Random Forests. Machine Learning 45(1):5–32.
brent, richard p. (1973). Algorithms for Minimization Without Derivatives. Prentice–Hall
Series in Automatic Computation. Prentice–Hall.
brochu, eric, vlad m. cora, and nando de freitas (2010). A Tutorial on Bayesian Opti-
mization of Expensive Cost Functions, with Application to Active User Modeling
and Hierarchical Reinforcement Learning. arXiv: 1012.2599 [cs.LG].
brooks, steve, andrew gelman, galin l. jones, and xiao-li meng, eds. (2011). Handbook
of Markov Chain Monte Carlo. Handbooks of Modern Statistical Methods. Chapman
& Hall.
bubeck, sébastien and nicolò cesa-bianchi (2012). Regret Analysis of Stochastic and
Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine
Learning 5(1):1–122.
bubeck, sébastien, rémi munos, and gilles stoltz (2009). Pure Exploration in Multi-
armed Bandits Problems. Proceedings of the 20th International Conference on Algo-
rithmic Learning Theory (alt 2009). Vol. 5809. Lecture Notes in Computer Science.
Springer–Verlag, pp. 23–37.
bull, adam d. (2011). Convergence Rates of Efficient Global Optimization Algorithms.
Journal of Machine Learning Research 12(88):2879–2904.
caflisch, russel e. (1998). Monte Carlo and quasi-Monte Carlo methods. Acta Numerica 7:
1–49.
332
bibliography
cai, xu, selwyn gomes, and jonathan scarlett (2021). Lenient Regret and Good-Action
Identification in Gaussian Process Bandits. Proceedings of the 38th International
Conference on Machine Learning (icml 2021). Vol. 139. Proceedings of Machine
Learning Research, pp. 1183–1192.
cakmak, sait, raul astudillo, peter frazier, and enlu zhou (2020). Bayesian Opti-
mization of Risk Measures. Advances in Neural Information Processing Systems 33
(neurips 2020), pp. 18039–18049.
calandra, roberto, jan peters, carl edward rasmussen, and marc peter deisen-
roth (2016). Manifold Gaussian Processes for Regression. Proceedings of the 2016
International Joint Conference on Neural Networks (ijcnn 2016), pp. 3338–3345.
calvin, j. and a. žilinskas (1999). On the Convergence of the P-Algorithm for One-
Dimensional Glboal Optimization of Smooth Functions. Journal of Optimization
Theory and Applications 102(3):479–495.
calvin, james m. (1993). Consistency of a Myopic Bayesian Algorithm for One-Dimensional
Global Optimization. Journal of Global Optimization 3(2):223–232.
calvin, james m. (2000). Convergence Rate of the P-Algorithm for Optimization of Con-
tinuous Functions. In: Approximation and Complexity in Numerical Optimization:
Continuous and Discrete Problems. Ed. by panos m. pardalos. Vol. 42. Nonconvex
Optimization and Its Applications. Springer–Verlag, pp. 116–129.
calvin, james m. and antanas žilinskas (2001). On Convergence of a P-Algorithm Based
on a Statistical Model of Continuously Differentiable Functions. Journal of Global
Optimization 19(3):229–245.
camilleri, romain, julian katz-samuels, and kevin jamieson (2021). High-Dimensional
Experimental Design and Kernel Bandits. Proceedings of the 38th International
Conference on Machine Learning (icml 2021). Vol. 139. Proceedings of Machine
Learning Research, pp. 1227–1237.
chapelle, olivier and lihong li (2011). An Empirical Evaluation of Thompson Sampling.
Advances in Neural Information Processing Systems 24 (neurips 2011), pp. 2249–2257.
chernoff, herman (1959). Sequential Design of Experiments. The Annals of Mathematical
Statistics 30(3):755–770.
chernoff, herman (1972). Sequential Analysis and Optimal Design. cbms–nsf Regional
Conference Series in Applied Mathematics. Society for Industrial and Applied
Mathematics.
chevalier, clément and david ginsbourger (2013). Fast Computation of the Multi-points
Expected Improvement with Applications in Batch Selection. Proceedings of the 7th
Learning and Intelligent Optimization Conference (lion 7). Vol. 7997. Lecture Notes
in Computer Science. Springer–Verlag, pp. 59–69.
chowdhury, sayak ray and aditya gopalan (2017). On Kernelized Multi-armed Bandits.
Proceedings of the 34th International Conference on Machine Learning (icml 2017).
Vol. 70. Proceedings of Machine Learning Research, pp. 844–853.
clark, charles e. (1961). The Greatest of a Finite Set of Random Variables. Operations
Research 9(2):145–162.
contal, emile, david buffoni, alexandre robicqet, and nicolas vayatis (2013). Paral-
lel Gaussian Process Optimization with Upper Confidence Bound and Pure Explo-
ration. Proceedings of the 2013 European Conference on Machine Learning and Prin-
ciples and Practice of Knowledge Discovery in Databases (ecml pkdd 2013). Vol. 8188.
Lecture Notes in Computer Science. Springer–Verlag, pp. 225–240.
cover, thomas m. and joy a. thomas (2006). Elements of Information Theory. 2nd ed. John
Wiley & Sons.
cox, dennis d. and susan john (1992). A Statistical Method for Global Optimization.
Proceedings of the 1992 ieee International Conference on Systems, Man, and Cybernetics
(smc 1992), pp. 1241–1246.
cunningham, john p., philipp hennig, and simon lacoste-julien (2011). Gaussian Prob-
abilities and Expectation Propagation. arXiv: 1111.6832 [stat.ML].
cutajar, kurt, michael a. osborne, john p. cunningham, and maurizio filippone (2016).
Preconditioning Kernel Matrices. Proceedings of the 33rd International Conference on
Machine Learning (icml 2016). Vol. 48. Proceedings of Machine Learning Research,
pp. 2529–2538.
dai, zhongxiang, haibin yu, bryan kian hsiang low, and patrick jaillet (2019).
Bayesian Optimization Meets Bayesian Optimal Stopping. Proceedings of the 36th
International Conference on Machine Learning (icml 2019). Vol. 97. Proceedings of
Machine Learning Research, pp. 1496–1506.
dalibard, valentin, michael schaarschmidt, and eiko yoneki (2017). boat: Build-
ing Auto-Tuners with Structured Bayesian Optimization. Proceedings of the 26th
International Conference on World Wide Web (www 2017), pp. 479–488.
dani, varsha, thomas p. hayes, and sham m. kakade (2008). Stochastic Linear Optimiza-
tion Under Bandit Feedback. Proceedings of the 21st Conference on Learning Theory
(colt 2008), pp. 355–366.
davis, philip j. and philip rabinowitz (1984). Methods of Numerical Integration. 2nd ed.
Computer Science and Applied Mathematics. Academic Press.
de ath, george, richard m. everson, alma a. rahat, and jonathan e. fieldsend (2021).
Greed is Good: Exploration and Exploitation Trade-offs in Bayesian Optimisation.
acm Transactions on Evolutionary Learning and Optimization 1(1):1–22.
de ath, george, jonathan e. fieldsend, and richard m. everson (2020). What do you
Mean? The Role of the Mean Function in Bayesian Optimization. Proceedings of the
2020 Genetic and Evolutionary Computation Conference (gecco 2020), pp. 1623–1631.
de freitas, nando, alex j. smola, and masrour zoghi (2012a). Regret Bounds for Deter-
ministic Gaussian Process Bandits. arXiv: 1203.2177 [cs.LG].
de freitas, nando, alex j. smola, and masrour zohgi (2012b). Exponential Regret Bounds
for Gaussian Process Bandits with Deterministic Observations. Proceedings of the
29th International Conference on Machine Learning (icml 2012), pp. 955–962.
degroot, morris h. (1970). Optimal Statistical Decisions. McGraw–Hill.
desautels, thomas, andreas krause, and joel w. burdick (2014). Parallelizing Exploration–
Exploitation Tradeoffs in Gaussian Process Bandit Optimization. Journal of Machine
Learning Research 15(119):4053–4103.
diaconis, persi (1988). Bayesian Numerical Analysis. In: Statistical Decision Theory and
Related Topics iv. Ed. by shanti s. gupta and james o. berger. Vol. 1, pp. 163–175.
djolonga, josip, andreas krause, and volkan cevher (2013). High-Dimensional Gaussian
Process Bandits. Advances in Neural Information Processing Systems 26 (neurips
2013), pp. 1025–1033.
domhan, tobias, jost tobias springenberg, and frank hutter (2015). Speeding up Au-
tomatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation
of Learning Curves. Proceedings of the 24th International Conference on Artificial
Intelligence (ijcai 2015), pp. 3460–3468.
334
bibliography
duvenaud, david, james robert lloyd, roger grosse, joshua b. tenenbaum, and zoubin
ghahramani (2013). Structure Discovery in Nonparametric Regression through
Compositional Kernel Search. Proceedings of the 30th International Conference on
Machine Learning (icml 2013). Vol. 28. Proceedings of Machine Learning Research,
pp. 1166–1174.
emmerich, michael and boris naujoks (2004). Metamodel Assisted Multiobjective Opti-
misation Strategies and their Application in Airfoil Design. In: Adaptive Computing
in Design and Manufacture vi. Ed. by i. c. parmee, pp. 249–260.
emmerich, michael t. m., kyriakos c. giannakoglou, and boris naujoks (2006). Single-
and Multiobjective Evolutionary Optimization Assisted by Gaussian Random Field
Metamodels. ieee Transactions on Evolutionary Computation 10(4):421–439.
fernández-delgado, manuel, eva cernadas, senén barro, and dinani amorim (2014).
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
Journal of Machine Learning Research 15(90):3133–3181.
feurer, matthias, jost tobias springenberg, and frank hutter (2015). Initializing
Bayesian Hyperparameter Optimization via Meta-Learning. Proceedings of 29th
aaai Conference on Artificial Intelligence (aaai 2015), pp. 1128–1135.
fisher, ronald a. (1935). The Design of Experiments. Oliver and Boyd.
fleischer, m. (2003). The Measure of Pareto Optima: Applications to Multi-objective
Metaheuristics. Proceedings of the 2nd International Conference on Evolutionary
Multi-Criterion Optimization (emo 2003). Vol. 2632. Lecture Notes in Computer
Science. Springer–Verlag, pp. 519–533.
forrester, alexander i. j., andy j. keane, and neil w. bressloff (2006). Design and
Analysis of “Noisy” Computer Experiments. aiaa Journal 44(10):2331–2339.
frazier, peter and warren powell (2007). The Knowledge Gradient Policy for Offline
Learning with Independent Normal Rewards. Proceedings of the 2007 ieee Inter-
national Symposium on Approximate Dynamic Programming and Reinforcement
Learning (adprl 2007), pp. 143–150.
frazier, peter, warren powell, and savas dayanik (2009). The Knowledge-Gradient
Policy for Correlated Normal Beliefs. informs Journal on Computing 21(4):599–613.
friedman, milton and l. j. savage (1947). Planning Experiments Seeking Maxima. In:
Selected Techniques of Statistical Analysis for Scientific and Industrial Research, and
Production and Management Engineering. Ed. by churchill eisenhart, millard w.
hartay, and w. allen wallis. McGraw–Hill, pp. 363–372.
fröhlich, lukas p., edgar d. klenske, julia vinogradska, christian daniel, and
melanie n. zeilinger (2020). Noisy-Input Entropy Search for Efficient Robust
Bayesian Optimization. Proceedings of the 23rd International Conference on Artificial
Intelligence and Statistics (aistats 2020). Vol. 108. Proceedings of Machine Learning
Research, pp. 2262–2272.
garcía-barcos, javier and ruben martinez-cantin (2021). Robust policy search for
robot navigation. ieee Robotics and Automation Letters 6(2):2389–2396.
gardner, jacob r., chuan guo, kilian q. weinberger, roman garnett, and roger
grosse (2017). Discovering and Exploiting Additive Structure for Bayesian Opti-
mization. Proceedings of the 20th International Conference on Artificial Intelligence
and Statistics (aistats 2017). Vol. 54. Proceedings of Machine Learning Research,
pp. 1311–1319.
gardner, jacob r., matt j. kusner, zhixiang (eddie) xu, kilian q. weinberger, and
john p. cunningham (2014). Bayesian Optimization with Inequality Constraints.
336
bibliography
338
bibliography
kathuria, tarun, amit deshpande, and pushmeet kohli (2016). Batched Gaussian Pro-
cess Bandit Optimization via Determinantal Point Processes. Advances in Neural
Information Processing Systems 29 (neurips 2016), pp. 4206–4214.
kim, jeankyung and david pollard (1990). Cube Root Asymptotics. The Annals of Statistics
18(1):191–219.
kim, jungtaek and seungjin choi (2020). On Local Optimizers of Acquisition Functions
in Bayesian Optimization. Lecture Notes in Computer Science 12458:675–690.
kim, jungtaek, michael mccourt, tackgeun you, saehoon kim, and seungjin choi
(2021). Bayesian optimization with approximate set kernels. Machine Learning 110(5):
857–879.
klein, aaron, simon bartels, stefan falkner, philipp hennig, and frank hutter
(2015). Towards efficient Bayesian Optimization for Big Data. Bayesian Optimiza-
tion: Scalability and Flexibility Workshop (BayesOpt 2015), Conference on Neural
Information Processing Systems (neurips 2015).
klein, aaron, stefan falkner, jost tobias springenberg, and frank hutter (2017).
Learning Curve Prediction with Bayesian Neural Networks. Proceedings of the 5th
International Conference on Learning Representations (iclr 2017).
knowles, joshua (2005). Parego: A Hybrid Algorithm With On-Line Landscape Approxi-
mation for Expensive Multiobjective Optimization Problems. ieee Transactions on
Evolutionary Computation 10(1):50–66.
ko, chun-wa, jon lee, and maurice qeyranne (1995). An Exact Algorithm for Maximum
Entropy Sampling. Operations Research 43(4):684–691.
konishi, sadanori and genshiro kitagawa (2008). Information Criteria and Statistical
Modeling. Springer Series in Statistics. Springer–Verlag.
kschischang, frank r., brendan j. frey, and hans-andrea leoliger (2001). Factor
Graphs and the Sum–Product Algorithm. ieee Transactions on Information Theory
47(2):498–519.
kulesza, alex and ben taskar (2012). Determinantal Point Processes for Machine Learning.
Foundations and Trends in Machine Learning 5(2–3):123–286.
kushner, harold j. (1962). A Versatile Stochastic Model of a Function of Unknown and
Time Varying Form. Journal of Mathematical Analysis and Applications 5(1):150–167.
kushner, h. j. (1964). A New Method of Locating the Maximum Point of an Arbitrary
Multipeak Curve in the Presence of Noise. Journal of Basic Engineering 86(1):97–106.
kuss, malte (2006). Gaussian Process Models for Robust Regression, Classification, and
Reinforcement Learning. PhD thesis. Technische Universität Darmstadt.
lai, t. l. and herbert robbins (1985). Asymptotically Efficient Adaptive Allocation Rules.
Advances in Applied Mathematics 6(1):4–22.
lam, remi r., karen e. wilcox, and david h. wolpert (2016). Bayesian Optimization with
a Finite Budget: An Approximate Dynamic Programming Approach. Advances in
Neural Information Processing Systems 29 (neurips 2016), pp. 883–891.
lange, kenneth l., roderick j. a. little, and jeremy m. g. taylor (1989). Robust Statistical
Modeling Using the 𝑡 Distribution. Journal of the American Statistical Association
84(408):881–896.
lattimore, tor and csaba szepesvári (2020). Bandit Algorithms. Cambridge University
Press.
lázaro-gredilla, miguel, joaqin qiñonero-candela, carl edward rasmussen, and
aníbal r. figueiras-vidal (2010). Sparse Spectrum Gaussian Process Regression.
Journal of Machine Learning Research 11(Jun):1865–1881.
340
bibliography
letham, benjamin, brian karrer, guilherme ottoni, and eytan bakshy (2019). Con-
strained Bayesian Optimization with Noisy Experiments. Bayesian Analysis 14(2):
495–519.
levina, elizaveta and peter j. bickel (2004). Maximum Likelihood Estimation of Intrinsic
Dimension. Advances in Neural Information Processing Systems 17 (neurips 2004),
pp. 777–784.
lévy, paul (1948). Processus stochastiques et mouvement brownien. Gauthier–Villars.
li, chun-liang, kirthevasan kandasamy, barnabás póczos, and jeff schneider (2016).
High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models.
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
(aistats 2016). Vol. 51. Proceedings of Machine Learning Research, pp. 884–892.
li, chunyuan, heerad farkhoor, rosanne liu, and jason yosinski (2018a). Measuring
the Intrinsic Dimension of Objective Landscapes. Proceedings of the 6th International
Conference on Learning Representations (iclr 2018).
li, lisha, kevin jamieson, giulia desalvo, afshin rostamizadeh, and ameet talwalkar
(2018b). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimiza-
tion. Journal of Machine Learning Research 18(185):1–52.
li, zihan and jonathan scarlett (2021). Gaussian Process Bandit Optimization with Few
Batches. arXiv: 2110.07788 [stat.ML].
lindley, d. v. (1956). On a Measure of the Information Provided by an Experiment. The
Annals of Mathematical Statistics 27(4):986–1005.
lindley, d. v. (1972). Bayesian Statistics, A Review. cbms–nsf Regional Conference Series
in Applied Mathematics. Society for Industrial and Applied Mathematics.
locatelli, m. (1997). Bayesian Algorithms for One-Dimensional Global Optimization.
Journal of Global Optimization 10(1):57–76.
löwner, karl (1934). Über monotone Matrixfunktionen. Mathematische Zeitschrift 38:
177–216.
lukić, milan n. and jay h. beder (2001). Stochastic Processes with Sample Paths in
Reproducing Kernel Hilbert Spaces. Transactions of the American Mathematical
Society 353(10):3945–3969.
lyu, yueming, yuan yuan, and ivor w. tsang (2019). Efficient Batch Black-box Optimiza-
tion with Deterministic Regret Bounds. arXiv: 1905.10041 [cs.LG].
mackay, david j. c. (1998). Introduction to Gaussian Processes. Neural Networks and
Machine Learning. Ed. by christopher m. bishop. Vol. 168. nato asi Series F:
Computer and Systems Sciences. Springer–Verlag.
mackay, david j. c. (2003). Information Theory, Inference, and Learning Algorithms. Cam-
bridge University Press.
malkomes, gustavo and roman garnett (2018). Automating Bayesian optimization
with Bayesian optimization. Advances in Neural Information Processing Systems 31
(neurips 2018), pp. 5984–5994.
malkomes, gustavo, chip schaff, and roman garnett (2016). Bayesian optimization for
automated model selection. Advances in Neural Information Processing Systems 29
(neurips 2016), pp. 2900–2908.
marmin, sébastien, clément chevalier, and david ginsbourger (2015). Differentiating
the Multipoint Expected Improvement for Optimal Batch Design. Proceedings of the
1st International Workshop on Machine Learning, Optimization, and Big Data (mod
2015). Vol. 9432. Lecture Notes in Computer Science. Springer–Verlag, pp. 37–48.
marschak, jacob and roy radner (1972). Economic Theory of Teams. Yale University Press.
martinez-cantin, ruben, kevin tee, and michael mccourt (2018). Practical Bayesian
optimization in the presence of outliers. Proceedings of the 21st International Con-
ference on Artificial Intelligence and Statistics (aistats 2018). Vol. 84. Proceedings of
Machine Learning Research, pp. 1722–1731.
massart, pascal (2007). Concentration Inequalities and Model Selection: Ecole d’Eté de
Probabilités de Saint-Flour xxxiii – 2003. Vol. 1896. Lecture Notes in Mathematics.
Springer–Verlag.
mccullagh, p. and j. a. nelder (1989). Generalized Linear Models. 2nd ed. Monographs on
Statistics and Applied Probability. Chapman & Hall.
meinshausen, nicolai (2006). Quantile Regression Forests. Journal of Machine Learning
Research 7(35):983–999.
miettinen, kaisa m. (1998). Nonlinear Multiobjective Optimization. International Series in
Operations Research & Management Science. Kluwer Academic Publishers.
milgrom, paul and ilya segal (2002). Envelope Theorems for Arbitrary Choice Sets.
Econometrica 70(2):583–601.
minka, thomas (2008). ep: A quick reference. url: https://tminka.github.io/papers/
ep/minka-ep-quickref.pdf.
minka, thomas p. (2001). A family of algorithms for approximate Bayesian inference. Ph.D.
thesis. Massachusetts Institute of Technology.
mockus, jonas (1972). Bayesian Methods of Search for an Extremum. Avtomatika i Vychis-
litel’naya Tekhnika (Automatic Control and Computer Sciences) 6(3):53–62.
mockus, jonas (1974). On Bayesian Methods for Seeking the Extremum. Optimization
Techniques ifip Technical Conference. Vol. 27. Lecture Notes in Computer Science.
Springer–Verlag, pp. 400–404.
mockus, jonas (1989). Bayesian Approach to Global Optimization: Theory and Applications.
Mathematics and its Applications. Kluwer Academic Publishers.
mockus, jonas, william eddy, audris mockus, linas mockus, and gintaras reklaitas
(2010). Bayesian Heuristic Approach to Discrete and Global Optimization: Algorithms,
Visualization, Software, and Applications. Nonconvex Optimization and its Applica-
tions. Kluwer Academic Publishers.
mockus, j., v. tiešis, and a. žilinskas (1978). The Application of Bayesian Methods for
Seeking the Extrememum. In: Towards Global Optimization 2. Ed. by l. c. w. dixon
and g. p. szegö. North–Holland, pp. 117–129.
møller, jesper, anne randi syversveen, and rasmus plenge waagepetersen (1998).
Log Gaussian Cox Processes. Scandinavian Journal of Statistics 25(3):451–482.
montgomery, douglas c. (2019). Design and Analysis of Experiments. 10th ed. John Wiley
& Sons.
moore, andrew w. and christopher g. atkeson (1993). Memory-based Reinforcement
Learning: Efficient Computation with Prioritized Sweeping. Advances in Neural
Information Processing Systems 5 (neurips 1992), pp. 263–270.
moriconi, riccardo, marc peter deisenroth, and k. s. sesh kumar (2020). High-
dimensional Bayesian optimization using low-dimensional feature spaces. Machine
Learning 109(9–10):1925–1943.
moss, henry b., daniel beck, javier gonzález, david s. leslie, and paul rayson (2020).
boss: Bayesian Optimization over String Spaces. Advances in Neural Information
Processing Systems 33 (neurips 2020), pp. 15476–15486.
murray, iain (2016). Differentiation of the Cholesky decomposition. arXiv: 1602.07527
[stat.CO].
342
bibliography
murray, iain, ryan prescott adams, and david j. c. mackay (2010). Elliptical slice
sampling. Proceedings of the 13th International Conference on Artificial Intelligence
and Statistics (aistats 2010). Vol. 9. Proceedings of Machine Learning Research,
pp. 541–548.
mutný, mojmír and andreas krause (2018). Efficient High Dimensional Bayesian Op-
timization with Additivity and Quadrature Fourier Features. Advances in Neural
Information Processing Systems 31 (neurips 2018), pp. 9005–9016.
neal, radford m. (1997). Monte Carlo Implementation of Gaussian Process Models for Bayes-
ian Regression and Classification. Technical report (9702). Department of Statistics,
University of Toronto.
neal, radford m. (1998). Regression and Classification Using Gaussian Process Priors. In:
Bayesian Statistics 6. Ed. by j. m. bernardo, j. o. berger, a. p. dawid, and a. f. m.
smith. Oxford University Press, pp. 475–490.
nguyen, qoc phong, zhongxiang dai, bryan kian hsiang low, and patrick jaillet
(2021a). Optimizing Conditional Value-At-Risk of Black-Box Functions. Advances
in Neural Information Processing Systems 34 (neurips 2021).
nguyen, qoc phong, zhongxiang dai, bryan kian hsiang low, and patrick jaillet
(2021b). Value-at-Risk Optimization with Gaussian Processes. Proceedings of the
38th International Conference on Machine Learning (icml 2021). Vol. 139. Proceedings
of Machine Learning Research, pp. 8063–8072.
nguyen, vu, sunil gupta, santu rana, cheng li, and svetha venkatesh (2017). Regret
for Expected Improvement over the Best-Observed Value and Stopping Condition.
Proceedings of the 9th Asian Conference on Machine Learning (acml 2017). Vol. 77.
Proceedings of Machine Learning Research, pp. 279–294.
nickisch, hannes and carl edward rasmussen (2008). Appproximations for Binary
Gaussian Process Classification. Journal of Machine Learning Research 9(Oct):2035–
2078.
o’hagan, a. (1978). Curve Fitting and Optimal Design for Prediction. Journal of the Royal
Statistical Society Series B (Methodological) 40(1):1–42.
o’hagan, a. (1991). Bayes–Hermite quadrature. Journal of Statistical Planning and Inference
29(3):245–260.
o’hagan, anthony and jonathan forster (2004). Kendall’s Advanced Theory of Statistics.
2nd ed. Vol. 2b: Bayesian Inference. Arnold.
oh, changyong, jakub m. tomczak, efstratios gavves, and max welling (2019). Com-
binatorial Bayesian Optimization using the Graph Cartesian Product. Advances in
Neural Information Processing Systems 32 (neurips 2019), pp. 2914–2924.
øksendal, bernt (2013). Stochastic Differential Equations: An Introduction with Applications.
6th ed. Universitext. Springer–Verlag.
oliveira, rafael, lionel ott, and fabio ramos (2019). Bayesian optimisation under
uncertain inputs. Proceedings of the 22nd International Conference on Artificial
Intelligence and Statistics (aistats 2019). Vol. 89. Proceedings of Machine Learning
Research, pp. 1177–1184.
osborne, michael a., david duvenaud, roman garnett, carl e. rasmussen, stephen j.
roberts, and zoubin ghahramani (2012). Active Learning of Model Evidence
Using Bayesian Quadrature. Advances in Neural Information Processing Systems 25
(neurips 2012), pp. 46–54.
osborne, michael a., roman garnett, and stephen j. roberts (2009). Gaussian Processes
for Global Optimization. Proceedings of the 3rd Learning and Intelligent Optimization
Conference (lion 3).
paria, biswajit, kirthevasan kandasamy, and barnabás póczos (2019). A Flexible
Framework for Multi-Objective Bayesian Optimization using Random Scalarizations.
Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (uai 2019).
Vol. 115. Proceedings of Machine Learning Research, pp. 766–776.
peirce, c. s. (1876). Note on the Theory of the Economy of Research. In: Report of the
Superintendent of the United States Coast Survey Showing the Progress of the Work
for the Fiscal Year Ending with June, 1876. Government Printing Office, pp. 197–201.
picheny, victor (2014). A Stepwise uncertainty reduction approach to constrained global
optimization. Proceedings of the 17th International Conference on Artificial Intelligence
and Statistics (aistats 2014). Vol. 33. Proceedings of Machine Learning Research,
pp. 787–795.
picheny, victor (2015). Multiobjective optimization using Gaussian process emulators via
stepwise uncertainty reduction. Statistics and Computing 25(6):1265–1280.
picheny, victor, david ginsbourger, yann richet, and gregory caplin (2013a). Quantile-
Based Optimization of Noisy Computer Experiments With Tunable Precision. Tech-
nometrics 55(1):2–13.
picheny, victor, tobias wagner, and david ginsbourger (2013b). A benchmark of
kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary
Optimization 48(3):607–626.
pleiss, geoff, martin jankowiak, david eriksson, anil damle, and jacob r. gardner
(2020). Fast Matrix Square Roots with Applications to Gaussian Processes and
Bayesian Optimization. Advances in Neural Information Processing Systems 33
(neurips 2020), pp. 22268–22281.
poincaré, henri (1912). Calcul des probabilités. 2nd ed. Gauthier–Villars.
ponweiser, wolfgang, tobias wagner, dirk biermann, and markus vincze (2008).
Multiobjective Optimization on a Limited Budget of Evaluations Using Model-
Assisted S-Metric Selection. Proceedings of the 10th International Confernce on
Parallel Problem Solving from Nature (ppsn x). Vol. 5199. Lecture Notes in Computer
Science. Springer–Verlag, pp. 784–794.
powell, warren b. (2011). Approximate Dynamic Programming: Solving the Curses of
Dimensionality. 2nd ed. Wiley Series in Probability and Statistics. John Wiley &
Sons.
pronzato, luc (2017). Minimax and maximin space-filling designs: some properties and
methods for construction. Journal de la Société Française de Statistique 158(1):7–36.
raiffa, howard and robert schlaifer (1961). Applied Statistical Decision Theory. Division
of Research, Graduate School of Business Administration, Harvard University.
rasmussen, carl edward and zoubin ghahramani (2002). Bayesian Monte Carlo. Ad-
vances in Neural Information Processing Systems 15 (neurips 2002), pp. 505–512.
rasmussen, carl edward and christopher k. i. williams (2006). Gaussian Processes for
Machine Learning. Adaptive Computation and Machine Learning. mit Press.
riqelme, carlos, george tucker, and jasper snoek (2018). Deep Bayesian Bandits Show-
down. Proceedings of the 6th International Conference on Learning Representations
(iclr 2018).
robbins, herbert (1952). Some Aspects of the Sequential Design of Experiments. Bulletin
of the American Mathematical Society 58(5):527–535.
344
bibliography
rolland, paul, jonathan scarlett, ilija bogunovic, and volkan cevher (2018). High-
Dimensional Bayesian Optimization via Additive Models with Overlapping Groups.
Proceedings of the 21st International Conference on Artificial Intelligence and Statistics
(aistats 2018). Vol. 84. Proceedings of Machine Learning Research, pp. 298–307.
ross, andrew m. (2010). Computing Bounds on the Expected Maximum of Correlated
Normal Variables. Methodology and Computing in Applied Probability 12(1):111–138.
rudin, walter (1976). Principles of Mathematical Analysis. 3rd ed. International Series in
Pure and Applied Mathematics. McGraw–Hill.
rue, håvard, sara martino, and nicolas chopin (2009). Approximate Bayesian inference
for latent Gaussian models by using integrated nested Laplace approximations.
Journal of the Royal Statistical Society Series B (Methodological) 71(2):319–392.
russo, daniel and benjamin van roy (2014). Learning to Optimize via Posterior Sampling.
Mathematics of Operations Research 39(4):1221–1243.
russo, daniel and benjamin van roy (2016). An Information-Theoretic Analysis of Thomp-
son Sampling. Journal of Machine Learning Research 17(68):1–30.
sacks, jerome, william j. welch, toby j. mitchell, and henry p. wynn (1989). Design
and Analysis of Computer Experiments. Statistical Science 4(4):409–435.
salgia, sudeep, sattar vakili, and qing zhao (2020). A Computationally Efficient Ap-
proach to Black-box Optimization using Gaussian Process Models. arXiv: 2010.
13997 [stat.ML].
šaltenis, vydūnas r. (1971). One Method of Multiextremum Optimization. Avtomatika i
Vychislitel’naya Tekhnika (Automatic Control and Computer Sciences) 5(3):33–38.
sanchez, susan m. and paul j. sanchez (2005). Very Large Fractional Factorial and Central
Composite Designs. acm Transactions on Modeling and Computer Simulation 15(4):
362–377.
scarlett, jonathan (2018). Tight Regret Bounds for Bayesian Optimization in One
Dimension. Proceedings of the 35th International Conference on Machine Learning
(icml 2018). Vol. 80. Proceedings of Machine Learning Research, pp. 4500–4508.
scarlett, jonathan, ilija bogunovic, and volkan cevher (2017). Lower Bounds on
Regret for Noisy Gaussian Process Bandit Optimization. Proceedings of the 2017
Conference on Learning Theory (colt 2017). Vol. 65. Proceedings of Machine Learning
Research, pp. 1723–1742.
scarlett, jonathan and volkan cevhar (2021). An Introductory Guide to Fano’s Inequal-
ity with Applications in Statistical Estimation. In: Information-Theoretic Methods
in Data Science. Ed. by miguel r. d. rodrigues and yonina c. eldar. Cambridge
University Press.
schonlau, matthias, william j. welch, and donald r. jones (1998). Global versus Local
Search in Constrained Optimization of Computer Models. In: New Developments
and Applications in Experimental Design. Vol. 34. Lecture Notes – Monograph Series.
Institute of Mathematical Statistics, pp. 11–25.
schonlau, mattihias (1997). Computer Experiments and Global Optimization. Ph.D. thesis.
University of Waterloo.
scott, warren, peter frazier, and warren powell (2011). The Correlated Knowledge
Gradient for Simulation Optimization of Continuous Parameters Using Gaussian
Process Regression. siam Journal on Optimization 21(3):996–1026.
seeger, matthias (2008). Expectation Propagation for Exponential Families. Technical
report. University of California, Berkeley.
settles, burr (2012). Active Learning. Synthesis Lectures on Artificial Intelligence and
Machine Learning. Morgan & Claypool.
shah, amar and zoubin ghahramani (2015). Parallel Predictive Entropy Search for Batch
Global Optimization of Expensive Objective Functions. Advances in Neural Infor-
mation Processing Systems 28 (neurips 2015), pp. 3330–3338.
shah, amar, andrew gordon wilson, and zoubin ghahramani (2014). Student-𝑡 Pro-
cesses as Alternatives to Gaussian Processes. Proceedings of the 17th International
Conference on Artificial Intelligence and Statistics (aistats 2014). Vol. 33. Proceedings
of Machine Learning Research, pp. 877–885.
shahriari, bobak, kevin swersky, ziyu wang, ryan p. adams, and nando de freitas
(2016). Taking the Human Out of the Loop: A Review of Bayesian Optimization.
Proceedings of the ieee 104(1):148–175.
shahriari, bobak, ziyu wang, matthew w. hoffman, alexandre bouchard-côté, and
nando de freitas (2014). An Entropy Search Portfolio for Bayesian Optimization.
arXiv: 1406.4625 [stat.ML].
shamir, ohad (2013). On the Complexity of Bandit and Derivative-Free Stochastic Convex
Optimization. Proceedings of the 24th Annual Conference on Learning Theory (colt
2013). Vol. 30. Proceedings of Machine Learning Research, pp. 3–24.
shannon, c. e. (1948). A Mathematical Theory of Communication. The Bell System Technical
Journal 27(3):379–423.
shao, t. s., t. c. chen, and r. m. frank (1964). Tables of Zeros and Gaussian Weights of
Certain Associated Laguerre Polynomials and the Related Generalized Hermite
Polynomials. Mathematics of Computation 18(88):598–616.
shepp, l. a. (1979). The Joint Density of the Maximum and its Location for a Wiener Process
with Drift. Journal of Applied Probability 16(2):423–427.
silverman, b. w. (1986). Density Estimation for Statistics and Data Analysis. Monographs
on Statistics and Applied Probability. Chapman & Hall.
slepian, david (1962). The One-Sided Barrier Problem for Gaussian Noise. The Bell System
Technical Journal 41(2):463–501.
smith, kirstine (1918). On the Standard Deviations of Adjusted and Interpolated Values of
an Observed Polynomial Function and its Constants and the Guidance they Give
Towards a Proper Choice of the Distribution of Observations. Biometrika 12(1–2):
1–85.
smola, alex j. and bernhard schölkopf (2000). Sparse Greedy Matrix Approximation
for Machine Learning. Proceedings of the 17th International Conference on Machine
Learning (icml 2000), pp. 911–918.
snelson, edward and zoubin ghahramani (2005). Sparse Gaussian Processes using
Pseudo-inputs. Advances in Neural Information Processing Systems 18 (neurips 2005),
pp. 1257–1264.
snoek, jasper, hugo larochelle, and ryan p. adams (2012). Practical Bayesian Optimiza-
tion of Machine Learning Algorithms. Advances in Neural Information Processing
Systems 25 (neurips 2012), pp. 2951–2959.
snoek, jasper, oren rippel, kevin swersky, ryan kiros, nadathur satish, narayanan
sundaram, et al. (2015). Scalable Bayesian Optimization Using Deep Neural Net-
works. Proceedings of the 32nd International Conference on Machine Learning (icml
2015). Vol. 37. Proceedings of Machine Learning Research, pp. 2171–2180.
snoek, jasper, kevin swersky, richard zemel, and ryan p. adams (2014). Input Warping
for Bayesian Optimization of Non-Sationary Functions. Proceedings of the 31st
346
bibliography
348
bibliography
wilson, andrew gordon, zhiting hu, ruslan salakhutdinov, and eric p. xing (2016).
Deep Kernel Learning. Proceedings of the 19th International Conference on Artificial
Intelligence and Statistics (aistats 2016). Vol. 51. Proceedings of Machine Learning
Research, pp. 370–378.
wilson, james t., frank hutter, and marc peter deisenroth (2018). Maximizing acquisi-
tion functions for Bayesian optimization. Advances in Neural Information Processing
Systems 31 (neurips 2018), pp. 9884–9895.
wu, jian and peter i. frazier (2016). The Parallel Knowledge Gradient Method for Batch
Bayesian Optimization. Advances in Neural Information Processing Systems 29
(neurips 2016), pp. 3126–3134.
wu, jian and peter i. frazier (2019). Practical Two-Step Look-Ahead Bayesian Optimiza-
tion. Advances in Neural Information Processing Systems 32 (neurips 2019), pp. 9813–
9823.
wu, jian, matthias poloczek, andrew gordon wilson, and peter i. frazier (2017).
Bayesian Optimization with Gradients. Advances in Neural Information Processing
Systems 30 (neurips 2017), pp. 5267–5278.
wu, jian, saul toscano-palmerin, peter i. frazier, and andrew gordon wilson
(2019). Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning.
Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (uai 2019).
Vol. 115. Proceedings of Machine Learning Research, pp. 788–798.
yang, kaifeng, michael emmerich, andré deutz, and thomas bäck (2019a). Efficient
computation of expected hypervolume improvement using box decomposition
algorithms. Journal of Global Optimization 75(1):3–34.
yang, kaifeng, michael emmerich, andré deutz, and thomas bäck (2019b). Multi-
Objective Bayesian Global Optimization using expected hypervolume improvement
gradient. Swarm and Evolutionary Computation 44:945–956.
yue, xubo and raed al kontar (2020). Why Non-myopic Bayesian Optimization is
Promising and How Far Should We Look-ahead? A Study via Rollout. Proceedings
of the 23rd International Conference on Artificial Intelligence and Statistics (aistats
2020). Vol. 108. Proceedings of Machine Learning Research, pp. 2808–2818.
zhang, weitong, dongruo zhou, lihong li, and qanqan gu (2021). Neural Thompson
Sampling. Proceedings of the 9th International Conference on Learning Representations
(iclr 2021.
zhang, yehong, trong nghia hoang, bryan kian hsiang low, and mohan kankan-
halli (2017). Information-Based Multi-Fidelity Bayesian Optimization. Bayesian
Optimization for Science and Engineering Workshop (BayesOpt 2017), Conference on
Neural Information Processing Systems (neurips 2017).
zhou, dongrou, lihong li, and qanqan gu (2020). Neural Contextual Bandits with
ucb-based Exploration. Proceedings of the 37th International Conference on Machine
Learning (icml 2020). Vol. 119. Proceedings of Machine Learning Research, pp. 11492–
11502.
ziatdinov, maxim a., ayana ghosh, and sergei v. kalinin (2021). Physics makes the
difference: Bayesian optimization and active learning via augmented Gaussian
process. arXiv: 2108.10280 [physics.comp-ph].
zilberstein, schlomo (1996). Using Anytime Algorithms in Intelligent Systems. ai Maga-
zine 17(3):73–83.
žilinskas, antanas g. (1975). Single-Step Bayesian Search Method for an Extremum of
Functions of a Single Variable. Kibernetika (Cybernetics) 11(1):160–166.
350
bibliography
This material will be published by Cambridge University Press as Bayesian Optimization. This prepublication 353
version is free to view and download for personal use only. Not for redistribution, resale, or use in derivative
works. ©Roman Garnett 2022.
index
cost-aware optimization, 103, 245, 253, 265 computation without noise, 159
covariance function, see prior covariance gradient, 159
function convergence, 217
cross-covariance function, 19, 23, 24, 27, 30, modified, 154
201, 264 origin, 289
cumulative regret, 𝑅𝜏 , 145, 214 worst-case regret with noise, 239, 243
cumulative reward, 114, 142, 155, 215, 283 expected utility, 90, 93
curse of dimensionality, 61, 208 exploration bonus, 145
exploration vs. exploitation dilemma, 11, 83,
D 123, 128, 131, 133, 143, 145, 146, 148,
de novo design, 314 154, 159, 214, 293
decoupled constraint observations, 252 exponential covariance function, 𝐾M1/2 , 52,
deep kernel, 59, 61 219
deep neural networks, ix, 1, 59, 61, 292 extreme value theorem, 34
density ratio estimation, 197
design and analysis of computer experiments F
(dace), 290 factor graph, 37
determinantal point process, 261, 285 Fano’s inequality, 231
differentiability in mean square, 30, see also feasible region, F, 249
continuous differentiability figure of merit, see acquisition function
differential entropy, 𝐻 [𝜔], see entropy fill distance, 𝛿 x , 238
dilation, 56 Fourier transform, 50, 53, 236
disjoint union, 27 freeze–thaw Bayesian optimization, 281
drug discovery, see molecular design fully independent training conditional (fitc)
dynamic termination, see termination approximation, 207
decisions
G
E Gauss–Hermite quadrature, 172, 258
early stopping, 210, 281 gradient, 309
electrical engineering, applications in, 324 Gaussian process (gp), GP (𝑓 ; 𝜇, 𝐾), 8, 16, 95,
elliptical slice sampling, 38 124, see also prior mean function;
embedding, see linear embedding, neural prior covariance function
embedding approximate inference, 35, see also
entropy search, see mutual information sparse spectrum approximation;
entropy, 𝐻 [𝜔], 115, see also conditional sparse approximation
entropy classification, 41, 283
environmental variables, 277 computation of policies with, 157
expectation propagation, 39, 182, 190, 273, continuity, 28
302 credible intervals, 18
expected gain per unit cost, 248 differentiability, 30
expected hypervolume improvement (ehvi), exact inference, 19
271 additive Gaussian noise, 20, 23
expected improvement, 𝛼 ei , 81, 95, 113, 117, computation, 201
127, 151, 158, 193, 196, 199, 265, 266, derivative observations, 32
268, see also simple reward exact observation, 22
augmented, 166 interpretation of posterior moments,
batch, 259 21
comparison with probability of joint, see joint Gaussian process
improvement, 132 marginal likelihood 𝑝 (y | x, 𝜽 ), 72, 202,
computation with noise, 160 220
alternative formulations, 163 gradient, 307
gradient, 308 maxima, existence and uniqueness, 33
354
index
356
index
T V
terminal recommendation, 118 value of data, 𝛼𝜏∗ , 95, 101
terminal recommendations, 90, 109 value of sample information, 126
termination decisions, 5, 253 variational inference, 39, 206
optimal, 103 virtual screening, 314
practical, 211 von Neumann–Morgenstern theorem, 90, 120
termination option, ∅, 104
Thompson sampling, 148, 176, 181, 187, 195, W
259 weighted Euclidean distance, 57, see also
acquisition function view, 148 automatic relevance
batch, 261 determination
computation, 176 Wiener process, 174, 217, 232, 242, 290
origin, 292 wiggliness, 56, 198, see also length scale
regret bounds worst-case regret, 𝑟¯𝜏 , 𝑅¯𝜏 , 218, 232, 239
358