Bayesian Optimization: Theory and Practice Using Python 1st Edition Peng Liu instant download
Bayesian Optimization: Theory and Practice Using Python 1st Edition Peng Liu instant download
https://ebookmass.com/product/bayesian-optimization-theory-
and-practice-using-python-1st-edition-peng-liu/
https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-peng-liu/
https://ebookmass.com/product/machine-learning-a-bayesian-and-
optimization-perspective-2nd-edition-theodoridis-s/
https://ebookmass.com/product/etextbook-978-0134379760-the-practice-
of-computing-using-python-3rd-edition/
Implementing Cryptography Using Python Shannon Bray
https://ebookmass.com/product/implementing-cryptography-using-python-
shannon-bray/
https://ebookmass.com/product/introduction-to-computing-and-problem-
solving-using-python-1st-edition-e-balaguruswamy/
https://ebookmass.com/product/python-programming-using-problem-
solving-approach-1st-edition-reema-thareja/
https://ebookmass.com/product/critical-thinking-in-clinical-research-
applied-theory-and-practice-using-case-studies-fregni/
https://ebookmass.com/product/machine-learning-on-geographical-data-
using-python-1st-edition-joos-korstanje/
Bayesian Optimization
Theory and Practice Using Python
Peng Liu
Bayesian Optimization: Theory and Practice Using Python
Peng Liu
Singapore, Singapore
Introduction�������������������������������������������������������������������������������������������������������������xv
vi
Table of Contents
Chapter 7: Case Study: Tuning CNN Learning Rate with BoTorch������������������������� 185
Seeking Global Optimum of Hartmann�������������������������������������������������������������������������������������� 186
Generating Initial Conditions����������������������������������������������������������������������������������������������� 187
Updating GP Posterior��������������������������������������������������������������������������������������������������������� 188
vii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 225
viii
About the Author
Peng Liu is an assistant professor of quantitative finance
(practice) at Singapore Management University and an
adjunct researcher at the National University of Singapore.
He holds a Ph.D. in Statistics from the National University
of Singapore and has ten years of working experience as a
data scientist across the banking, technology, and hospitality
industries.
ix
About the Technical Reviewer
Jason Whitehorn is an experienced entrepreneur and
software developer and has helped many companies
automate and enhance their business solutions through data
synchronization, SaaS architecture, and machine learning.
Jason obtained his Bachelor of Science in Computer Science
from Arkansas State University, but he traces his passion
for development back many years before then, having first
taught himself to program BASIC on his family’s computer
while in middle school. When he’s not mentoring and
helping his team at work, writing, or pursuing one of his
many side-projects, Jason enjoys spending time with his wife and four children and
living in the Tulsa, Oklahoma, region. More information about Jason can be found on his
website: https://jason.whitehorn.us.
xi
Acknowledgments
This book summarizes my learning journey in Bayesian optimization during my
(part-time) Ph.D. study. It started as a personal interest in exploring this area and
gradually grew into a book combining theory and practice. For that, I thank my
supervisors, Teo Chung Piaw and Chen Ying, for their continued support in my
academic career.
xiii
Introduction
Bayesian optimization provides a unified framework that solves the problem of
sequential decision-making under uncertainty. It includes two key components: a
surrogate model approximating the unknown black-box function with uncertainty
estimates and an acquisition function that guides the sequential search. This book
reviews both components, covering both theoretical introduction and practical
implementation in Python, building on top of popular libraries such as GPyTorch and
BoTorch. Besides, the book also provides case studies on using Bayesian optimization
to seek a simulated function's global optimum or locate the best hyperparameters (e.g.,
learning rate) when training deep neural networks. The book assumes readers with a
minimal understanding of model development and machine learning and targets the
following audiences:
All source code used in this book can be downloaded from github.com/apress/
Bayesian-optimization.
xv
CHAPTER 1
Bayesian Optimization
Overview
As the name suggests, Bayesian optimization is an area that studies optimization
problems using the Bayesian approach. Optimization aims at locating the optimal
objective value (i.e., a global maximum or minimum) of all possible values or the
corresponding location of the optimum in the environment (the search domain). The
search process starts at a specific initial location and follows a particular policy to
iteratively guide the following sampling locations, collect new observations, and refresh
the guiding policy.
As shown in Figure 1-1, the overall optimization process consists of repeated
interactions between the policy and the environment. The policy is a mapping function
that takes in a new input observation (plus historical ones) and outputs the following
sampling location in a principled way. Here, we are constantly learning and improving
the policy, since a good policy guides our search toward the global optimum more
efficiently and effectively. In contrast, a good policy would save the limited sampling
budget on promising candidate locations. On the other hand, the environment contains
the unknown objective function to be learned by the policy within a specific boundary.
When probing the functional value as requested by the policy, the actual observation
revealed by the environment to the policy is often corrupted by noise, making learning
even more challenging. Thus, Bayesian optimization, a specific approach for global
optimization, would like to learn a policy that can help us efficiently and effectively
navigate to the global optimum of an unknown, noise-corrupted environment as quickly
as possible.
1
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_1
Chapter 1 Bayesian Optimization Overview
Figure 1-1. The overall Bayesian optimization process. The policy digests the
historical observations and proposes the new sampling location. The environment
governs how the (possibly noise-corrupted) observation at the newly proposed
location is revealed to the policy. Our goal is to learn an efficient and effective
policy that could navigate toward the global optimum as quickly as possible
Global Optimization
Optimization aims to locate the optimal set of parameters of interest across the whole
domain through carefully allocating limited resources. For example, when searching
for the car key at home before leaving for work in two minutes, we would naturally start
with the most promising place where we would usually put the key. If it is not there,
think for a little while about the possible locations and go to the next most promising
place. This process iterates until the key is found. In this example, the policy is digesting
the available information on previous searches and proposing the following promising
location. The environment is the house itself, revealing if the key is placed at the
proposed location upon each sampling.
This is considered an easy example since we are familiar with the environment
in terms of its structural design. However, imagine locating an item in a totally new
2
Chapter 1 Bayesian Optimization Overview
environment. The policy would need to account for the uncertainty due to unfamiliarity
with the environment while sequentially determining the next sampling location. When
the sampling budget is limited, as is often the case in real-life searches in terms of
time and resources, the policy needs to argue carefully on the utility of each candidate
sampling location.
Let us formalize the sequential global optimization using mathematical terms. We
are dealing with an unknown scalar-valued objective function f based on a specific
domain Α. In other words, the unknown subject of interest f is a function that maps a
certain sample in Α to a real number in ℝ, that is, f : Α → ℝ. We typically place no specific
assumption about the nature of the domain Α other than that it should be a bounded,
compact, and convex set.
Unless otherwise specified, we focus on the maximization setting instead of
minimization since maximizing the objective function is equivalent to minimizing the
negated objective, and vice versa. The optimization procedure thus aims at locating
the global maximum f ∗ or its corresponding location x∗ in a principled and systematic
manner. Mathematically, we wish to locate f ∗ where
f max f x f x
x
x argmax xA f x
Figure 1-2 provides an example one-dimensional objective function with its global
maximum f ∗ and its location x∗ highlighted. The goal of global optimization is thus to
systematically reason about a series of sampling decisions within the total search space
Α, so as to locate the global maximum as fast as possible, that is, sampling as few times
as possible.
3
Chapter 1 Bayesian Optimization Overview
Figure 1-2. An example objective function with the global maximum and its
location marked with star. The goal of global optimization is to systematically
reason about a series of sampling decisions so as to locate the global maximum as
fast as possible
Note that this is a nonconvex function, as is often the case in real-life functions we
are optimizing. A nonconvex function means we could not resort to first-order gradient-
based methods to reliably search for the global optimum since it will likely converge to
a local optimum. This is also one of the advantages of Bayesian optimization compared
with other gradient-based optimization procedures.
4
Chapter 1 Bayesian Optimization Overview
• Each functional evaluation is costly, thus ruling out the option for
an exhaustive probing. We need to have a sample-efficient method to
minimize the number of evaluations of the environment while trying
to locate its global optimum. In other words, the optimizer needs to
fully utilize the existing observations and systematically reason about
the next sampling decision so that the limited resource is well spent
on promising locations.
5
Chapter 1 Bayesian Optimization Overview
Figure 1-3. Three possible functional forms. On the left is a convex function whose
optimization is easy. In the middle is a nonconvex function with multiple local
minima, and on the right is also a nonconvex function with a wide flat region full
of saddle points. Optimization for the latter two cases takes a lot more work than
for the first case
6
Chapter 1 Bayesian Optimization Overview
Figure 1-4. Slow convergence due to a small learning rate on the left and
divergence due to a large learning rate on the right
7
Chapter 1 Bayesian Optimization Overview
Next, we will delve into the various components of a typical Bayesian optimization
setup, including the observation model, the optimization policy, and the Bayesian
inference.
8
Chapter 1 Bayesian Optimization Overview
Figure 1-5. Illustrating the actual observations (in dots) and the underlying
objective function (in dashed line). When sampling at a specific location, the
observation would be disrupted by an additive noise. The observation model thus
determines how the observation would be revealed to the policy, which needs to
account for the uncertainty due to noise perturbation
To make our discussion more precise, let us use f (x) to denote the (unknown)
objective function value at location x. We sometimes write f (x) as f for simplicity. We
use y to denote the actual observation at location x, which will slightly differ from f due
to noise perturbation. We can thus express the observation model, which governs how
the policy sees the observation from the environment, as a probability distribution of y
based on a specific location x and true function value f:
p y |x , f
Let us assume an additive noise term ε inflicted on f; the actual observation y can
thus be expressed as
y f
Here, the noise term ε arises from measurement error or inaccurate statistical
approximation, although it may disappear in certain computer simulations. A common
9
Chapter 1 Bayesian Optimization Overview
practice is to treat the error as a random variable that follows a Gaussian distribution
with a zero mean and fixed standard deviation σ, that is, ε~N(0, σ2). Note that it is
unnecessary to fix σ across the whole domain A; the Bayesian optimization allows for
both homoscedastic noise (i.e., fixed σ across A) and heteroskedastic noise (i.e., different
σ that depends on the specific location in A).
Therefore, we can formulate a Gaussian observation model as follows:
p y|x , f , N y ; f , 2
This means that for a specific location x, the actual observation y is treated as a
random variable that follows a Gaussian/normal distribution with mean f and variance
σ2. Figure 1-6 illustrates an example probability distribution of y centered around f. Note
that the variance of the noise is often estimated by sampling a few initial observations
and is expected to be small, so that the overall observation model still strongly depends
on and stays close to f.
Figure 1-6. Assuming a normal probability distribution for the actual observation
as a random variable. The Gaussian distribution is centered around the objective
function f value evaluated at a given location x and spread by the variance of the
noise term
The following section introduces Bayesian statistics to lay the theoretical foundation
as we work with probability distributions along the way.
10
Chapter 1 Bayesian Optimization Overview
B
ayesian Statistics
Bayesian optimization is not a particular algorithm for global optimization; it is a suite of
algorithms based on the principles of Bayesian inference. As the optimization proceeds
in each iteration, the policy needs to determine the next sampling decision or if the
current search needs to be terminated. Due to uncertainty in the objective function and
the observation model, the policy needs to cater to such uncertainty upon deciding
the following sampling location, which bears both an immediate impact on follow-up
decisions and a long-term effect on all future decisions. The samples selected thus need
to reasonably contribute to the ultimate goal of global optimization and justify the cost
incurred due to sampling.
Using Bayesian statistics in optimization paves the way for us to systematically
and quantitatively reason about these uncertainties using probabilities. For example,
we would place a prior belief about the characteristics of the objective function and
quantify its uncertainties by assigning high probability to specific ranges of values and
low probability to others. As more observations are collected, the prior belief is gradually
updated and calibrated toward the true underlying distribution of the objective function
in the form of a posterior distribution.
We now cover the fundamental concepts and tools of Bayesian statistics.
Understanding these sections is essential to appreciate the inner workings of Bayesian
optimization.
B
ayesian Inference
Bayesian inference essentially relies on the Bayesian formula (also called Bayes’ rule)
to reason about the interactions among three components: the prior distribution p(θ)
where θ represents the parameter of interest, the likelihood p(data| θ) given a specific
parameter θ, and the posterior distribution p(θ| data). There is one more component, the
evidence of the data p(data), which is often not computable. The Bayesian formula is as
follows:
p data| p
p |data
p data
Let us look closely at this widely used, arguably the most important formula in
Bayesian statistics. Remember that any Bayesian inference procedure aims to derive the
11
Chapter 1 Bayesian Optimization Overview
posterior distribution p(θ| data) (or calculate its marginal expectation) for the parameter
of interest θ, in the form of a probability density function. For example, we might end up
with a continuous posterior distribution as in Figure 1-7, where θ varies from 0 to 1, and
all the probabilities (i.e., area under the curve) would sum to 1.
12
Chapter 1 Bayesian Optimization Overview
progressively approach a normal distribution given that more data is being collected,
thus forming a posterior distribution that better approximates the true distribution of θ.
Figure 1-8. Updating the prior uniform distribution toward a posterior normal
distribution as more data is collected. The role of the prior distribution decreases
as more data is collected to support the approximation to the true underlying
distribution
The last term is the denominator p(data), also referred to as the evidence, which
represents the probability of obtaining the data over all different choices of θ and serves
as a normalizing constant independent of θ in Bayes’ theorem. This is the most difficult
part to compute among all the components since we need to integrate over all possible
values of θ by taking an integration. For each given θ, the likelihood is calculated based
on the assumed observation model for data generation, which is the same as how the
likelihood term is calculated. The difference is that the evidence considers every possible
value of θ and weights the resulting likelihood based on the probability of observing a
particular θ. Since the evidence is not connected to θ, it is often ignored when analyzing
the proportionate change in the posterior. As a result, it focuses only on the likelihood
and the prior alone.
A relatively simple case is when the prior p(θ) and the likelihood p(data| θ) are
conjugate, making the resulting posterior p(θ| data) analytic and thus easy to work with
due to its closed-form expression. Bayesian inference becomes much easier and less
restrictive if we can write down the explicit form and generate the exact shape of the
posterior p(θ| data) without resorting to sampling methods. The posterior will follow the
same distribution as the prior when the prior is conjugate with the likelihood function.
One example is when both the prior and the likelihood functions follow a normal
13
Chapter 1 Bayesian Optimization Overview
distribution, the resulting posterior will also be normally distributed. However, when the
prior and the likelihood are not conjugate, we can still get more insight on the posterior
distribution via efficient sampling techniques such as Gibbs sampling.
14
Chapter 1 Bayesian Optimization Overview
Figure 1-9. Comparing the frequentist approach and the Bayesian approach
regarding the parameter of interest. The frequentist approach treats θ as a fixed
quantity that can be estimated via MLE, while the Bayesian approach employs a
probability distribution which gets refreshed as more data is collected
15
Chapter 1 Bayesian Optimization Overview
occurs given that the event y = Y has occurred. It is thus referred to as conditional
probability, as the probability of the first event is now conditioned on the second event.
All conditional probabilities for a (continuous) random variable x given a specific value
of another random variable (i.e., y = Y) form the conditional probability distribution
p(x| y = Y). More generally, we can write the joint probability distribution of random
variables x and y as p(x, y) and conditional probability distribution as p(x ∣ y).
The joint probability is also symmetrical, that is, p(X and Y) = p(Y and X), which is
a result of the exchangeability property of probability. Plugging in the definition of joint
probability using the chain rule gives the following:
p X Y p X|Y p Y p Y|X p X
If you look at this equation more closely, it is not difficult to see that it can lead to the
Bayesian formula we introduced earlier, namely:
p Y|X p X
p X|Y
pY
Understanding this connection gives us one more reason not to memorize the
Bayesian formula but to appreciate it. We can also replace a single event x = X with the
random variable x to get the corresponding conditional probability distribution p(x| y = Y).
Lastly, we may only be interested in the probability of an event for one random
variable alone, disregarding the possible realizations of the other random variable.
That is, we would like to consider the probability of the event x = X under all possible
values of y. This is called the marginal probability for the event x = X. The marginal
probability distribution for a (continuous) random variable x in the presence of another
(continuous) random variable y can be calculated as follows:
p x p x ,y dy p x|y p y dy
The preceding definition essentially sums up possible values p(x| y) weighted by the
likelihood of occurrence p(y). The weighted sum operation resolves the uncertainty in
the random variable y and thus in a way integrates it out of the original joint probability
distribution, keeping only one random variable. For example, the prior probability
p(θ) in Bayes’ rule is a marginal probability distribution of θ, which integrates out
other random variables, if any. The same goes for the evidence term p(data) which is
calculated by integrating over all possible values of θ.
16
Chapter 1 Bayesian Optimization Overview
p y p x ,y dx p y|x p x dx
Figure 1-10 summarizes the three common probability distributions. Note that
the joint probability distribution focuses on two or more random variables, while
both the conditional and marginal probability distributions generally refer to a single
random variable. In the case of the conditional probability distribution, the other
random variable assumes a specific value and thus, in a way, “disappears” from the
joint distribution. In the case of the marginal probability distribution, the other random
variable is instead integrated out of the joint distribution.
17
Chapter 1 Bayesian Optimization Overview
Let us revisit Bayes’ rule in the context of conditional and marginal probabilities.
Specifically, the likelihood term p(data| θ) can be treated as the conditional probability of
the data given the parameter θ, and the evidence term p(data) is a marginal probability
that needs to be evaluated across all possible choices of θ. Based on the definition
of marginal probability, we can write the calculation of p(data) as a weighted sum
(assuming a continuous θ):
p data p data| p d
I ndependence
A special case that would impact the calculation of the three probabilities mentioned
earlier is independence, where the random variables are now independent of each
other. Let us look at the joint, conditional, and marginal probabilities with independent
random variables.
When two random variables are independent of each other, the event x = X would
have nothing to do with the event y = Y, that is, the conditional probability for x = X
given y = Y becomes p(X| Y) = p(X). The conditional probability distribution for two
independent random variables thus becomes p(x| y) = p(x). Their joint probability
becomes the multiplication of individual probabilities: p(X ∩ Y) = P(X| Y)P(Y) = p(X)p(Y),
and the joint probability distribution becomes a product of individual probability
distributions: p(x, y) = p(x)p(y). The marginal probability of x is just its own probability
distribution:
p x p x|y p y dy p x p y dy p x p y dy p x
where we have used the fact that p(x) can be moved out of the integration operation due
to its independence with y, and the total area under a probability distribution is one, that
is, ∫ p(y)dy = 1.
We can also extend to conditional independence, where the random variable x
could be independent from y given another random variable z. In other words, we have
p(x, y| z) = p(x| z)p(y| z).
18
Chapter 1 Bayesian Optimization Overview
p y p y , d p y| p d
which is the exact definition of the evidence term in Bayes’ formula. In a discrete world,
we would take the prior probability for a specific value of the parameter θ, multiply
the likelihood of the resulting data given the current θ, and sum across all weighted
likelihoods.
Now let us look at the posterior predictive distribution for a new data point y′ after
observing a collection of data points collectively denoted as . We would like to assess
how the future data would be distributed and what value of y′ we would likely to observe
if we were to run the experiment and acquire another data point again, given that we
have observed some actual data. That is, we want to calculate the posterior predictive
distribution p y | .
We can calculate the posterior predictive distribution by treating it as a marginal
distribution (conditioned on the collected dataset ) and applying the same technique
as before, namely:
where the second term p | is the posterior distribution of the parameter θ that
can be calculated by applying Bayes’ rule. However, the first term p y | , is more
involved. When assessing a new data point after observing some existing data points, a
19
Chapter 1 Bayesian Optimization Overview
common assumption is that they are conditionally independent given a particular value
of θ. Such conditional independence implies that p y | , p y | , which happens
to be the likelihood term. Thus, we can simplify the posterior predictive distribution as
follows:
p y | p y | p | d
which follows the same pattern of calculation compared to the prior predictive
distribution. This would then give us the distribution of observations we would expect
for a new experiment (such as probing the environment in the Bayesian optimization
setting) given a set of previously collected observations. The prior and posterior
predictive distributions are summarized in Figure 1-11.
Figure 1-11. Definition of the prior and posterior predictive distributions. Both
are calculated based on the same pattern of a weighted sum between the prior and
the likelihood
Let us look at an example of the prior predictive distribution under a normal prior
and likelihood function. Before the experiment starts, we assume the observation model
for the likelihood of the data y to follow a normal distribution, that is, y~N(θ, σ2), or p(y| θ,
σ2) = N(θ, σ2), where θ is the underlying parameter and σ2 is a fixed variance. For example,
in the case of the observation model in the Bayesian optimization setting introduced
earlier, the parameter θ could represent the true objective function, and the variance σ2
originates from an additive Gaussian noise. The distribution of y is dependent on θ,
20
Chapter 1 Bayesian Optimization Overview
The prior predictive distribution can thus be calculated by plugging in the definition
of normal likelihood term p(y| θ) and the normal prior term p(θ). However, there is a
simple trick we can use to avoid the math, which would otherwise be pretty heavy if we
were to plug in the formula of the normal distribution directly.
Let us try directly working with the random variables. We will start by noting that
y = (y − θ) + θ. The first term y − θ takes θ away from y, which decentralizes y by changing
its mean to zero and removes the dependence of y on θ. In other words, (y − θ)~N(0, σ2),
which also represents the distribution of the random noise in the observation model
of Bayesian optimization. Since the second term θ is also normally distributed, we can
derive the distribution of y as follows:
y ~ N 0 , 2 N 0 , 2 N 0 , 2 2
where we have used the fact that the addition of two independent normally distributed
random variables will also be normally distributed, with the mean and variance
calculated based on the sum of individual means and variances.
Therefore, the marginal probability distribution of y becomes p y N 0 , 2 2 .
Intuitively, this form also makes sense. Before we start to collect any observation about
y, our best guess for its mean would be θ0, the expected value of the underlying random
variable θ. Its variance is the sum of individual variances since we are considering
uncertainties due to both the prior and the likelihood; the marginal distribution needs
21
Chapter 1 Bayesian Optimization Overview
to absorb both variances, thus compounding the resulting uncertainty. Figure 1-12
summarizes the derivation of the prior predictive distributions under the normality
assumption for the likelihood and the prior for a continuous θ.
Figure 1-12. Derivation process of the prior predictive distribution for a new data
point before collecting any observations, assuming a normal distribution for both
the likelihood and the prior
We can follow the same line of reasoning for the case of posterior predictive
distribution for a new observation y′ after collecting some data points under the
normality assumption for the likelihood p(y′| θ) and the posterior p | , where
p(y′| θ) = N(θ, σ2) and p | N , 2 . We can see that the posterior distribution for
θ has an updated set of parameters θ′ and 2 using Bayes’ rule as more data is collected.
Now recall the definition of the posterior predictive distribution with a continuous
underlying parameter θ:
p y | p y | p | d
22
Chapter 1 Bayesian Optimization Overview
Figure 1-13 summarizes the derivation of the posterior predictive distributions under
normality assumption for the likelihood and the prior for a continuous θ.
Figure 1-13. Derivation process of the posterior predictive distribution for a new
data point after collecting some observations, assuming a normal distribution for
both the likelihood and the prior
23
Chapter 1 Bayesian Optimization Overview
Figure 1-14 illustrates an example of the marginal prior distribution and the
conditional likelihood function (which is also a probability distribution) along with the
observation Y. We can see that both distributions follow a normal curve, and the mean
of the latter is aligned to the actual observation Y due to the conditioning effect from
Y = θ. Also, the probability of observing Y is not very high based on the prior distribution
p(θ), which suggests a change needed for the prior in the posterior update of the next
iteration. We will need to change the prior in order to improve such probability and
conform the subjective expectation to reality.
Figure 1-14. Illustrating the prior distribution and the likelihood function, both
following a normal distribution. The mean of the likelihood function is equal to the
actual observation due to the effect of conditioning
The prior distribution will then gradually get updated to approximate the
actual observations by invoking Bayes’ rule. This will give the posterior distribution
p |Y N , 2 in solid line, whose mean is slightly nudged from θ0 toward Y and
updated to θ′, as shown in Figure 1-15. The prior distribution and likelihood function
are displayed in dashed lines for reference. The posterior distribution of θ is now more
aligned with what is actually observed in reality.
24
Chapter 1 Bayesian Optimization Overview
Figure 1-15. Deriving the posterior distribution for θ using Bayes’ rule. The
updated mean θ′ is now between the prior mean θ0 and actual observation Y,
suggesting an alignment between subjective preference and reality
25
Chapter 1 Bayesian Optimization Overview
Gaussian Process
A prevalent choice of stochastic process in Bayesian optimization is the Gaussian
process, which requires that these finite-dimensional probability distributions are
multivariate Gaussian distributions in a continuous domain with infinite number of
variables. It is a flexible framework to model a broad family of functions and quantify
their uncertainties, thus being a powerful surrogate model used to approximate the true
underlying function. We will delve into the details of the Gaussian process in the next
chapter, but for now, let us look at a few visual examples to see what it offers.
Figure 1-17 illustrates an example of a “flipped” prior probability distribution for a
single random variable selected from the prior belief of the Gaussian process. Each point
26
Chapter 1 Bayesian Optimization Overview
follows a normal distribution. Plotting the mean (solid line) and 95% credible interval
(dashed lines) of all these prior distributions gives us the prior process for the objective
function regarding each location in the domain. The Gaussian process thus employs an
infinite number of normally distributed random variables within a bounded range to
model the underlying objective function and quantify the associated uncertainty via a
probabilistic approach.
Figure 1-17. A sample prior belief of the Gaussian process represented by the
mean and 95% credible interval for each location in the domain. Every objective
value is modeled by a random variable that follows a normal prior predictive
distribution. Collecting the distributions of all random variables could help us
quantify the potential shape of the true underlying function and its probability
The prior process can thus serve as the surrogate data-generating process to
generate samples in the form of functions, an extension of sampling single points from
a probability distribution. For example, if we were to repeatedly sample from the prior
process earlier, we would expect the majority (around 95%) of the samples to fall within
the credible interval and a minority outside this range. Figure 1-18 illustrates three
functions sampled from the prior process.
27
Chapter 1 Bayesian Optimization Overview
Figure 1-18. Three example functions sampled from the prior process, where
majority of the functions fall within the 95% credible interval
In the Gaussian process, the uncertainty on the objective value of each location is
quantified using the credible interval. As we start to collect observations and assume a
noise-free and exact observation model, the uncertainties at the collection locations will
be resolved, leading to zero variance and direct interpolation at these locations. Besides,
the variance increases as we move further away from the observations, resulting from
integrating the prior process with the information provided by the actual observations.
Figure 1-19 illustrates the updated posterior process after collecting two observations.
The posterior process with updated knowledge based on the observations will thus make
a more accurate surrogate model and better estimate the objective function.
28
Chapter 1 Bayesian Optimization Overview
Figure 1-19. Updated posterior process after incorporating two exact observations
in the Gaussian process. The posterior mean interpolates through the observations,
and the associated variance reduces as we move nearer the observations
Acquisition Function
The tools from Bayesian inference and the extension to the Gaussian process provide
principled reasoning on the distribution of the objective function. However, we would
still need to incorporate such probabilistic information in our decision-making to search
for the global maximum. We need to build a policy that absorbs the most updated
information on the objective function and recommends the following most promising
sampling location in the face of uncertainties across the domain. The optimization
policy thus plays an essential role in connecting the Gaussian process to the eventual
goal of Bayesian optimization. In particular, the posterior predictive distribution
provides an outlook on the objective value and associated uncertainty for locations not
explored yet, which could be used by the optimization policy to quantify the utility of any
alternative location within the domain.
When converting the posterior knowledge about candidate locations, that is,
posterior parameters such as the mean and the variance, to a single utility score, the
acquisition function comes into play. An acquisition function is a manually designed
mechanism that evaluates the relative potential of each candidate location in the
form of a scalar score, and the location with the maximum score will be used as the
recommendation for the next round of sampling. It is a function that assesses how
valuable a candidate location when we acquire/sample it. The acquisition function
29
Chapter 1 Bayesian Optimization Overview
30
Chapter 1 Bayesian Optimization Overview
particular observation model. The Gaussian process surrogate model then uses the new
observation to obtain a posterior process in support of follow-up decision-making by the
preset acquisition function. This process continues until the stopping criterion such as
exhausting a given budget is met. Figure 1-20 illustrates this process.
Summary
Bayesian optimization is a class of methodology that aims at sample-efficient
global optimization. This chapter covered the foundations of the BO framework,
including the following:
31
Chapter 1 Bayesian Optimization Overview
In the next chapter, we will discuss the first component: the Gaussian process,
covering both theoretical understanding and practical implementation in Python.
32
CHAPTER 2
Gaussian Processes
In the previous chapter, we covered the derivation of the posterior distribution for
parameter θ as well as the predictive posterior distribution of a new observation y′
under a normal/Gaussian prior distribution. Knowing the posterior predictive
distribution is helpful in supervised learning tasks such as regression and classification.
In particular, the posterior predictive distribution quantifies the possible realizations
and uncertainties of both existing and future observations (if we were to sample again).
In this chapter, we will cover some more foundation on the Gaussian process in the first
section and switch to the implementation in code in the second section.
The way we work with the parameters depends on the type of models used for
training. There are two types of models in supervised learning tasks: parametric and
nonparametric models. Parametric models assume a fixed set of parameters to be
estimated and used for prediction. For example, by defining a set of parameters θ
(bolded lowercase to denote multiple elements contained in a vector) given a set of input
observations X (bolded uppercase to denote a matrix) and output target y, we rely on the
parametric model p(y| X, θ) and estimate the optimal parameter values θˆ via procedures
such as maximum likelihood estimation or maximum a posteriori estimation. Using a
Bayesian approach, we can also infer the full posterior distribution p(θ| X, y) to enable a
distributional representation instead of a point estimate for the parameters θ.
Figure 2-1 illustrates the shorthand math notation for matrix X and vector y.
33
© Peng Liu 2023
P. Liu, Bayesian Optimization, https://doi.org/10.1007/978-1-4842-9063-7_2
Other documents randomly have
different content
prouvent combien les lois prêtent à interprétation; les idées sur la
morale n’ont pas plus de fixité (Arcésilas, Dicéarque), 379.—Les lois et
les mœurs tiennent surtout leur autorité de ce qu’elles existent. Si
on remontait à leur origine, on constaterait parfois combien sont
discutables les principes qu’elles consacrent; aussi les philosophes
qui se piquaient le plus de ne rien accepter sans examen, ne se
faisaient-ils nullement scrupule de ne pas les observer et de ne tenir
aucun compte des bienséances (Chrysippe, Métroclès et Cratès,
Diogène, Hipparchia), 381.—Des philosophes ont avancé que, dans un
même sujet, subsistent les apparences les plus contraires; ce qu’il y
a de certain, c’est que les termes les plus clairs peuvent toujours
être interprétés différemment et que bien des écrits obscurs ont,
grâce à cela, trouvé des interprétations qui les ont mis en honneur
(Héraclite, Protagoras), 383.—Homère n’a-t-il pas été présenté
comme ayant traité en maître les questions de tous genres? Et
Platon n’est-il pas constamment invoqué comme s’étant prononcé en
toutes choses, dans le sens de celui qui le cite, etc.? 387.—Quoique
les notions qui nous viennent des sens puissent, comme on l’a dit,
être erronées, les sens sont pourtant la source de toutes nos
connaissances (Chrysippe, Carnéade), 389.—Si nous ne pouvons tout
expliquer, peut-être est-ce parce que certains sens existent dans la
nature et que l’homme s’en trouve dépourvu, ce qu’il lui est
impossible de constater, 391.—C’est par les sens que, malgré les
erreurs en lesquelles ils nous induisent, toute science s’acquiert;
chacun d’eux y contribue et aucun ne peut suppléer à un autre
(Épicure, Timagoras), 395.—L’expérience révèle les erreurs et les
incertitudes dont est entaché le témoignage des sens qui, bien
souvent, en imposent à la raison (Philoxène, Narcisse, Pygmalion,
Démocrite, Théophraste, le joueur de flûte de Gracchus), 399.—Par
contre, les passions de l’âme ont également action sur les opérations
des sens et concourent à les altérer, 403.—C’est avec raison que la
vie de l’homme a été comparée à un songe; que nous dormions ou
que nous soyons éveillés, notre état d’âme varie peu, 405.—En
général, les sens des animaux sont plus parfaits que ceux de
l’homme; des différences sensibles se peuvent aisément constater
entre eux, 405.—Même chez l’homme, nombreuses sont les
circonstances qui modifient les témoignages des sens, et leur
enlèvent tout degré de certitude, d’autant que souvent les
indications données par l’un sont contradictoires avec celles fournies
par un autre, 409.—En somme, on ne peut rien juger définitivement
des choses d’après les apparences que nous en donnent les sens,
413.—En outre, rien chez l’homme n’est à l’état stable; constamment
en transformation, il est insaisissable (Platon, Parménide, Pythagore,
Héraclite, Épicharme, Plutarque), 415.—D’où nous arrivons à conclure
qu’il n’y a rien de réel, rien de certain, rien qui n’existe que Dieu;
que l’homme n’est rien, ne peut rien par lui-même; et que, seule, la
foi chrétienne lui permet de s’élever au-dessus de sa misérable
condition (Plutarque, Sénèque), 417.
CHAPITRE XIII.
CHAPITRE XIV.
Comment notre esprit se crée à lui-même des difficultés,
II, 431.—Le choix de l’homme entre deux choses de même valeur se
détermine par si peu, qu’on est amené à en conclure que tout ici-bas
est doute et incertitude (Pline), 431.
CHAPITRE XV.
CHAPITRE XVI.
CHAPITRE XVII.
CHAPITRE XVIII.
CHAPITRE XIX.
CHAPITRE XXI.
CHAPITRE XXIII.
CHAPITRE XXIV.
CHAPITRE XXV.
CHAPITRE XXVI.
CHAPITRE XXVII.
CHAPITRE XXVIII.
Chaque chose en son temps, II, 587.—Ce furent deux grands
hommes que Caton le Censeur et Caton d’Utique; mais celui-ci
l’emporte de beaucoup sur le premier, 587.—Dans sa vieillesse, Caton
le Censeur s’avisa d’apprendre le grec; c’est un ridicule, toutes choses
doivent être faites en leur temps (Q. Flaminius, Eudémonidas et
Xénocrate, Philopœmen et le roi Ptolémée), 587.—Nos désirs devraient
être amortis par l’âge, mais nos goûts et nos passions survivent à la
perte de nos facultés; quant à lui, Montaigne, il ne pense qu’à sa fin
et ne forme pas de projets dont l’exécution nécessiterait plus d’une
année, 589.—Sans doute un vieillard peut encore étudier, mais ses
études doivent être conformes à son âge, elles doivent lui servir à
quitter le monde avec moins de regrets (Caton d’Utique), 589.
CHAPITRE XXIX.
CHAPITRE XXX.
CHAPITRE XXXI.
CHAPITRE XXXII.
CHAPITRE XXXIII.
CHAPITRE XXXIV.
CHAPITRE XXXV.
TROISIÈME VOLUME.
CHAPITRE XXXVI.
CHAPITRE XXXVII.
LIVRE TROISIÈME.
CHAPITRE I.
CHAPITRE III.
CHAPITRE IV.
CHAPITRE VI.
CHAPITRE VII.
ebookmasss.com