Full Download Bayesian Optimization: Theory and Practice Using Python Peng Liu PDF
Full Download Bayesian Optimization: Theory and Practice Using Python Peng Liu PDF
com
https://ebookmass.com/product/bayesian-optimization-theory-
and-practice-using-python-peng-liu/
OR CLICK HERE
DOWLOAD NOW
https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-1st-edition-peng-liu/
ebookmass.com
https://ebookmass.com/product/machine-learning-a-bayesian-and-
optimization-perspective-2nd-edition-theodoridis-s/
ebookmass.com
https://ebookmass.com/product/como-aumentar-o-seu-proprio-salario-1st-
edition-napoleon-hill/
ebookmass.com
https://ebookmass.com/product/making-refugees-in-india-oxford-
historical-monographs-ria-kapoor/
ebookmass.com
https://ebookmass.com/product/the-rise-and-fall-of-animal-
experimentation-empathy-science-and-the-future-of-research-richard-j-
miller/
ebookmass.com
https://ebookmass.com/product/free-traders-elites-democracy-and-the-
rise-of-globalization-in-north-america-malcolm-fairbrother/
ebookmass.com
https://ebookmass.com/product/essentials-of-business-law-and-the-
legal-environment-13th-edition-richard-a-mann/
ebookmass.com
Cannibalism: a perfectly natural history First Edition
Schutt
https://ebookmass.com/product/cannibalism-a-perfectly-natural-history-
first-edition-schutt/
ebookmass.com
Peng Liu
Bayesian Optimization
Theory and Practice Using Python
Peng Liu
Singapore, Singapore
Apress Standard
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Figure 1-1 The overall Bayesian optimization process. The policy digests the
historical observations and proposes the new sampling location. The environment
governs how the (possibly noise-corrupted) observation at the newly proposed
location is revealed to the policy. Our goal is to learn an efficient and effective policy
that could navigate toward the global optimum as quickly as possible
Global Optimization
Optimization aims to locate the optimal set of parameters of interest
across the whole domain through carefully allocating limited resources.
For example, when searching for the car key at home before leaving for
work in two minutes, we would naturally start with the most promising
place where we would usually put the key. If it is not there, think for a
little while about the possible locations and go to the next most
promising place. This process iterates until the key is found. In this
example, the policy is digesting the available information on previous
searches and proposing the following promising location. The
environment is the house itself, revealing if the key is placed at the
proposed location upon each sampling.
This is considered an easy example since we are familiar with the
environment in terms of its structural design. However, imagine
locating an item in a totally new environment. The policy would need to
account for the uncertainty due to unfamiliarity with the environment
while sequentially determining the next sampling location. When the
sampling budget is limited, as is often the case in real-life searches in
terms of time and resources, the policy needs to argue carefully on the
utility of each candidate sampling location.
Let us formalize the sequential global optimization using
mathematical terms. We are dealing with an unknown scalar-valued
objective function f based on a specific domain Α. In other words, the
unknown subject of interest f is a function that maps a certain sample
in Α to a real number in ℝ, that is, f : Α → ℝ. We typically place no
specific assumption about the nature of the domain Α other than that it
should be a bounded, compact, and convex set.
Unless otherwise specified, we focus on the maximization setting
instead of minimization since maximizing the objective function is
equivalent to minimizing the negated objective, and vice versa. The
optimization procedure thus aims at locating the global maximum f∗ or
its corresponding location x∗ in a principled and systematic manner.
Mathematically, we wish to locate f∗ where
Figure 1-6 Assuming a normal probability distribution for the actual observation as
a random variable. The Gaussian distribution is centered around the objective
function f value evaluated at a given location x and spread by the variance of the
noise term
The following section introduces Bayesian statistics to lay the
theoretical foundation as we work with probability distributions along
the way.
Bayesian Statistics
Bayesian optimization is not a particular algorithm for global
optimization; it is a suite of algorithms based on the principles of
Bayesian inference. As the optimization proceeds in each iteration, the
policy needs to determine the next sampling decision or if the current
search needs to be terminated. Due to uncertainty in the objective
function and the observation model, the policy needs to cater to such
uncertainty upon deciding the following sampling location, which bears
both an immediate impact on follow-up decisions and a long-term
effect on all future decisions. The samples selected thus need to
reasonably contribute to the ultimate goal of global optimization and
justify the cost incurred due to sampling.
Using Bayesian statistics in optimization paves the way for us to
systematically and quantitatively reason about these uncertainties
using probabilities. For example, we would place a prior belief about
the characteristics of the objective function and quantify its
uncertainties by assigning high probability to specific ranges of values
and low probability to others. As more observations are collected, the
prior belief is gradually updated and calibrated toward the true
underlying distribution of the objective function in the form of a
posterior distribution.
We now cover the fundamental concepts and tools of Bayesian
statistics. Understanding these sections is essential to appreciate the
inner workings of Bayesian optimization.
Bayesian Inference
Bayesian inference essentially relies on the Bayesian formula (also
called Bayes’ rule) to reason about the interactions among three
components: the prior distribution p(θ) where θ represents the
parameter of interest, the likelihood p(data| θ) given a specific
parameter θ, and the posterior distribution p(θ| data). There is one
more component, the evidence of the data p(data), which is often not
computable. The Bayesian formula is as follows:
Let us look closely at this widely used, arguably the most important
formula in Bayesian statistics. Remember that any Bayesian inference
procedure aims to derive the posterior distribution p(θ| data) (or
calculate its marginal expectation) for the parameter of interest θ, in
the form of a probability density function. For example, we might end
up with a continuous posterior distribution as in Figure 1-7, where θ
varies from 0 to 1, and all the probabilities (i.e., area under the curve)
would sum to 1.
Figure 1-8 Updating the prior uniform distribution toward a posterior normal
distribution as more data is collected. The role of the prior distribution decreases as
more data is collected to support the approximation to the true underlying
distribution
If you look at this equation more closely, it is not difficult to see that
it can lead to the Bayesian formula we introduced earlier, namely:
Independence
A special case that would impact the calculation of the three
probabilities mentioned earlier is independence, where the random
variables are now independent of each other. Let us look at the joint,
conditional, and marginal probabilities with independent random
variables.
When two random variables are independent of each other, the
event x = X would have nothing to do with the event y = Y, that is, the
conditional probability for x = X given y = Y becomes p(X| Y) = p(X). The
conditional probability distribution for two independent random
variables thus becomes p(x| y) = p(x). Their joint probability becomes
the multiplication of individual probabilities: p(X ∩ Y) = P(X|
Y)P(Y) = p(X)p(Y), and the joint probability distribution becomes a
product of individual probability distributions: p(x, y) = p(x)p(y). The
marginal probability of x is just its own probability distribution:
where we have used the fact that p(x) can be moved out of the
integration operation due to its independence with y, and the total area
under a probability distribution is one, that is, ∫ p(y)dy = 1.
We can also extend to conditional independence, where the random
variable x could be independent from y given another random variable
z. In other words, we have p(x, y| z) = p(x| z)p(y| z).
Figure 1-11 Definition of the prior and posterior predictive distributions. Both are
calculated based on the same pattern of a weighted sum between the prior and the
likelihood
where we have used the fact that the addition of two independent
normally distributed random variables will also be normally
distributed, with the mean and variance calculated based on the sum of
individual means and variances.
Therefore, the marginal probability distribution of y becomes
. Intuitively, this form also makes sense. Before
we start to collect any observation about y, our best guess for its mean
would be θ0, the expected value of the underlying random variable θ. Its
variance is the sum of individual variances since we are considering
uncertainties due to both the prior and the likelihood; the marginal
distribution needs to absorb both variances, thus compounding the
resulting uncertainty. Figure 1-12 summarizes the derivation of the
prior predictive distributions under the normality assumption for the
likelihood and the prior for a continuous θ.
Figure 1-12 Derivation process of the prior predictive distribution for a new data
point before collecting any observations, assuming a normal distribution for both the
likelihood and the prior
We can follow the same line of reasoning for the case of posterior
predictive distribution for a new observation y′ after collecting some
data points under the normality assumption for the likelihood p(y′|
θ) and the posterior , where p(y′| θ) = N(θ, σ2) and
. We can see that the posterior distribution for θ
has an updated set of parameters θ′ and using Bayes’ rule as more
data is collected.
Now recall the definition of the posterior predictive distribution
with a continuous underlying parameter θ:
Figure 1-13 Derivation process of the posterior predictive distribution for a new
data point after collecting some observations, assuming a normal distribution for
both the likelihood and the prior
7:mäs Jako.
PUTENKEJA.
8:sas Jako.
KASIA.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookmass.com