Bayesian Optimization : Theory and Practice Using Python Peng Liu instant download
Bayesian Optimization : Theory and Practice Using Python Peng Liu instant download
or textbooks at https://ebookmass.com
https://ebookmass.com/product/bayesian-optimization-theory-
and-practice-using-python-peng-liu/
https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-1st-edition-peng-liu/
https://ebookmass.com/product/machine-learning-a-bayesian-and-
optimization-perspective-2nd-edition-theodoridis-s/
https://ebookmass.com/product/magnetic-communications-theory-and-
techniques-liu/
https://ebookmass.com/product/etextbook-978-0134379760-the-practice-
of-computing-using-python-3rd-edition/
https://ebookmass.com/product/critical-thinking-in-clinical-research-
applied-theory-and-practice-using-case-studies-fregni/
https://ebookmass.com/product/tensors-for-data-processing-theory-
methods-and-applications-yipeng-liu/
https://ebookmass.com/product/implementing-cryptography-using-python-
shannon-bray/
Peng Liu
Bayesian Optimization
Theory and Practice Using Python
Peng Liu
Singapore, Singapore
Apress Standard
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Figure 1-1 The overall Bayesian optimization process. The policy digests the
historical observations and proposes the new sampling location. The environment
governs how the (possibly noise-corrupted) observation at the newly proposed
location is revealed to the policy. Our goal is to learn an efficient and effective policy
that could navigate toward the global optimum as quickly as possible
Global Optimization
Optimization aims to locate the optimal set of parameters of interest
across the whole domain through carefully allocating limited resources.
For example, when searching for the car key at home before leaving for
work in two minutes, we would naturally start with the most promising
place where we would usually put the key. If it is not there, think for a
little while about the possible locations and go to the next most
promising place. This process iterates until the key is found. In this
example, the policy is digesting the available information on previous
searches and proposing the following promising location. The
environment is the house itself, revealing if the key is placed at the
proposed location upon each sampling.
This is considered an easy example since we are familiar with the
environment in terms of its structural design. However, imagine
locating an item in a totally new environment. The policy would need to
account for the uncertainty due to unfamiliarity with the environment
while sequentially determining the next sampling location. When the
sampling budget is limited, as is often the case in real-life searches in
terms of time and resources, the policy needs to argue carefully on the
utility of each candidate sampling location.
Let us formalize the sequential global optimization using
mathematical terms. We are dealing with an unknown scalar-valued
objective function f based on a specific domain Α. In other words, the
unknown subject of interest f is a function that maps a certain sample
in Α to a real number in ℝ, that is, f : Α → ℝ. We typically place no
specific assumption about the nature of the domain Α other than that it
should be a bounded, compact, and convex set.
Unless otherwise specified, we focus on the maximization setting
instead of minimization since maximizing the objective function is
equivalent to minimizing the negated objective, and vice versa. The
optimization procedure thus aims at locating the global maximum f∗ or
its corresponding location x∗ in a principled and systematic manner.
Mathematically, we wish to locate f∗ where
Figure 1-6 Assuming a normal probability distribution for the actual observation as
a random variable. The Gaussian distribution is centered around the objective
function f value evaluated at a given location x and spread by the variance of the
noise term
The following section introduces Bayesian statistics to lay the
theoretical foundation as we work with probability distributions along
the way.
Bayesian Statistics
Bayesian optimization is not a particular algorithm for global
optimization; it is a suite of algorithms based on the principles of
Bayesian inference. As the optimization proceeds in each iteration, the
policy needs to determine the next sampling decision or if the current
search needs to be terminated. Due to uncertainty in the objective
function and the observation model, the policy needs to cater to such
uncertainty upon deciding the following sampling location, which bears
both an immediate impact on follow-up decisions and a long-term
effect on all future decisions. The samples selected thus need to
reasonably contribute to the ultimate goal of global optimization and
justify the cost incurred due to sampling.
Using Bayesian statistics in optimization paves the way for us to
systematically and quantitatively reason about these uncertainties
using probabilities. For example, we would place a prior belief about
the characteristics of the objective function and quantify its
uncertainties by assigning high probability to specific ranges of values
and low probability to others. As more observations are collected, the
prior belief is gradually updated and calibrated toward the true
underlying distribution of the objective function in the form of a
posterior distribution.
We now cover the fundamental concepts and tools of Bayesian
statistics. Understanding these sections is essential to appreciate the
inner workings of Bayesian optimization.
Bayesian Inference
Bayesian inference essentially relies on the Bayesian formula (also
called Bayes’ rule) to reason about the interactions among three
components: the prior distribution p(θ) where θ represents the
parameter of interest, the likelihood p(data| θ) given a specific
parameter θ, and the posterior distribution p(θ| data). There is one
more component, the evidence of the data p(data), which is often not
computable. The Bayesian formula is as follows:
Let us look closely at this widely used, arguably the most important
formula in Bayesian statistics. Remember that any Bayesian inference
procedure aims to derive the posterior distribution p(θ| data) (or
calculate its marginal expectation) for the parameter of interest θ, in
the form of a probability density function. For example, we might end
up with a continuous posterior distribution as in Figure 1-7, where θ
varies from 0 to 1, and all the probabilities (i.e., area under the curve)
would sum to 1.
Figure 1-8 Updating the prior uniform distribution toward a posterior normal
distribution as more data is collected. The role of the prior distribution decreases as
more data is collected to support the approximation to the true underlying
distribution
If you look at this equation more closely, it is not difficult to see that
it can lead to the Bayesian formula we introduced earlier, namely:
Independence
A special case that would impact the calculation of the three
probabilities mentioned earlier is independence, where the random
variables are now independent of each other. Let us look at the joint,
conditional, and marginal probabilities with independent random
variables.
When two random variables are independent of each other, the
event x = X would have nothing to do with the event y = Y, that is, the
conditional probability for x = X given y = Y becomes p(X| Y) = p(X). The
conditional probability distribution for two independent random
variables thus becomes p(x| y) = p(x). Their joint probability becomes
the multiplication of individual probabilities: p(X ∩ Y) = P(X|
Y)P(Y) = p(X)p(Y), and the joint probability distribution becomes a
product of individual probability distributions: p(x, y) = p(x)p(y). The
marginal probability of x is just its own probability distribution:
where we have used the fact that p(x) can be moved out of the
integration operation due to its independence with y, and the total area
under a probability distribution is one, that is, ∫ p(y)dy = 1.
We can also extend to conditional independence, where the random
variable x could be independent from y given another random variable
z. In other words, we have p(x, y| z) = p(x| z)p(y| z).
Figure 1-11 Definition of the prior and posterior predictive distributions. Both are
calculated based on the same pattern of a weighted sum between the prior and the
likelihood
where we have used the fact that the addition of two independent
normally distributed random variables will also be normally
distributed, with the mean and variance calculated based on the sum of
individual means and variances.
Therefore, the marginal probability distribution of y becomes
. Intuitively, this form also makes sense. Before
we start to collect any observation about y, our best guess for its mean
would be θ0, the expected value of the underlying random variable θ. Its
variance is the sum of individual variances since we are considering
uncertainties due to both the prior and the likelihood; the marginal
distribution needs to absorb both variances, thus compounding the
resulting uncertainty. Figure 1-12 summarizes the derivation of the
prior predictive distributions under the normality assumption for the
likelihood and the prior for a continuous θ.
Figure 1-12 Derivation process of the prior predictive distribution for a new data
point before collecting any observations, assuming a normal distribution for both the
likelihood and the prior
We can follow the same line of reasoning for the case of posterior
predictive distribution for a new observation y′ after collecting some
data points under the normality assumption for the likelihood p(y′|
θ) and the posterior , where p(y′| θ) = N(θ, σ2) and
. We can see that the posterior distribution for θ
has an updated set of parameters θ′ and using Bayes’ rule as more
data is collected.
Now recall the definition of the posterior predictive distribution
with a continuous underlying parameter θ:
Figure 1-13 Derivation process of the posterior predictive distribution for a new
data point after collecting some observations, assuming a normal distribution for
both the likelihood and the prior
Gaussian Process
A prevalent choice of stochastic process in Bayesian optimization is the
Gaussian process, which requires that these finite-dimensional
probability distributions are multivariate Gaussian distributions in a
continuous domain with infinite number of variables. It is a flexible
framework to model a broad family of functions and quantify their
uncertainties, thus being a powerful surrogate model used to
approximate the true underlying function. We will delve into the details
of the Gaussian process in the next chapter, but for now, let us look at a
few visual examples to see what it offers.
Figure 1-17 illustrates an example of a “flipped” prior probability
distribution for a single random variable selected from the prior belief
of the Gaussian process. Each point follows a normal distribution.
Plotting the mean (solid line) and 95% credible interval (dashed lines)
of all these prior distributions gives us the prior process for the
objective function regarding each location in the domain. The Gaussian
process thus employs an infinite number of normally distributed
random variables within a bounded range to model the underlying
objective function and quantify the associated uncertainty via a
probabilistic approach.
Figure 1-17 A sample prior belief of the Gaussian process represented by the mean
and 95% credible interval for each location in the domain. Every objective value is
modeled by a random variable that follows a normal prior predictive distribution.
Collecting the distributions of all random variables could help us quantify the
potential shape of the true underlying function and its probability
Figure 1-18 Three example functions sampled from the prior process, where
majority of the functions fall within the 95% credible interval
In the Gaussian process, the uncertainty on the objective value of
each location is quantified using the credible interval. As we start to
collect observations and assume a noise-free and exact observation
model, the uncertainties at the collection locations will be resolved,
leading to zero variance and direct interpolation at these locations.
Besides, the variance increases as we move further away from the
observations, resulting from integrating the prior process with the
information provided by the actual observations. Figure 1-19 illustrates
the updated posterior process after collecting two observations. The
posterior process with updated knowledge based on the observations
will thus make a more accurate surrogate model and better estimate
the objective function.
Figure 1-19 Updated posterior process after incorporating two exact observations
in the Gaussian process. The posterior mean interpolates through the observations,
and the associated variance reduces as we move nearer the observations
Acquisition Function
The tools from Bayesian inference and the extension to the Gaussian
process provide principled reasoning on the distribution of the
objective function. However, we would still need to incorporate such
probabilistic information in our decision-making to search for the
global maximum. We need to build a policy that absorbs the most
updated information on the objective function and recommends the
following most promising sampling location in the face of uncertainties
across the domain. The optimization policy thus plays an essential role
in connecting the Gaussian process to the eventual goal of Bayesian
optimization. In particular, the posterior predictive distribution
provides an outlook on the objective value and associated uncertainty
for locations not explored yet, which could be used by the optimization
policy to quantify the utility of any alternative location within the
domain.
When converting the posterior knowledge about candidate
locations, that is, posterior parameters such as the mean and the
variance, to a single utility score, the acquisition function comes into
play. An acquisition function is a manually designed mechanism that
evaluates the relative potential of each candidate location in the form of
a scalar score, and the location with the maximum score will be used as
the recommendation for the next round of sampling. It is a function that
assesses how valuable a candidate location when we acquire/sample it.
The acquisition function is often cheap to evaluate as a side
computation since we need to evaluate it at every candidate location
and then locate the maximum utility score, another (inner)
optimization problem.
Many choices of acquisition function have been proposed in the
literature. In a later part of the book, we will cover the popular ones,
such as expected improvement (EI) and knowledge gradient (KG). Still,
it suffices, for now, to understand that it is a predesigned function that
needs to balance two opposing forces: exploration and exploitation.
Exploration encourages resolving the uncertainty across the domain by
sampling at unfamiliar and distant locations, since these areas may
bear a big surprise due to a high certainty. Exploitation recommends a
greedy move at promising regions where we expect the observation
value to be high. The exploration-exploitation trade-off is a common
topic in many optimization settings.
Another distinguishing feature is the short-term and long-term
trade-off. A short-term acquisition function only focuses on one step
ahead and assumes this is the last chance to sample from the
environment; thus, the recommendation is to maximize the immediate
utility. A long-term acquisition function employs a multi-step lookahead
approach by simulating potential evolutions/paths in the future and
making a final recommendation by maximizing the long-run utility. We
will cover both types of policies in the book.
There are many other emerging variations in the design of the
acquisition function, such as adding safety constraints to the system
under study. In any case, we would judge the quality of the policy using
a specific acquisition function based on how close we are to the location
of the global maximum upon exhausting our budget. The distance
between the current and optimal locations is often called instant regret
or simple regret. Alternatively, the cumulative regret (cumulative
distances between historical locations and the optimum location)
incurred throughout the sampling process can also be used.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookmasss.com