0% found this document useful (0 votes)
15 views14 pages

Simplifying Bayesian Optimization Via In-Context Direct Optimum Sampling

This document presents Fully In-Context Bayesian Optimization (FIBO), a novel approach that simplifies Bayesian optimization by eliminating the need for surrogate model fitting and acquisition function optimization. FIBO utilizes a pre-trained deep generative model to directly sample from the posterior distribution of the optimal point, achieving significant efficiency gains over traditional methods. The authors demonstrate that FIBO is more than 10 times faster than conventional Bayesian optimization techniques while maintaining comparable performance on various benchmarks.

Uploaded by

zaknomane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

Simplifying Bayesian Optimization Via In-Context Direct Optimum Sampling

This document presents Fully In-Context Bayesian Optimization (FIBO), a novel approach that simplifies Bayesian optimization by eliminating the need for surrogate model fitting and acquisition function optimization. FIBO utilizes a pre-trained deep generative model to directly sample from the posterior distribution of the optimal point, achieving significant efficiency gains over traditional methods. The authors demonstrate that FIBO is more than 10 times faster than conventional Bayesian optimization techniques while maintaining comparable performance on various benchmarks.

Uploaded by

zaknomane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Simplifying Bayesian Optimization Via

In-Context Direct Optimum Sampling

Gustavo Sutter1,2 Mohammed Abdulrahman1,2 Hao Wang1


Sriram Ganapathi Subramanian2 Marc St-Aubin3 Sharon O’Sullivan3 Lawrence Wan3
Luis Ricardez-Sandoval1 Pascal Poupart1,2 Agustinus Kristiadi2
arXiv:2505.23913v1 [cs.LG] 29 May 2025

1
University of Waterloo
2
Vector Institute
3
BMO, Technology & Operations

Abstract
The optimization of expensive black-box functions is ubiquitous in science and
engineering. A common solution to this problem is Bayesian optimization (BO),
which is generally comprised of two components: (i) a surrogate model and (ii) an
acquisition function, which generally require expensive re-training and optimization
steps at each iteration, respectively. Although recent work enabled in-context
surrogate models that do not require re-training, virtually all existing BO methods
still require acquisition function maximization to select the next observation, which
introduces many knobs to tune, such as Monte Carlo samplers and multi-start
optimizers. In this work, we propose a completely in-context, zero-shot solution
for BO that does not require surrogate fitting or acquisition function optimization.
This is done by using a pre-trained deep generative model to directly sample from
the posterior over the optimum point. We show that this process is equivalent
to Thompson sampling and demonstrate the capabilities and cost-effectiveness
of our foundation model on a suite of real-world benchmarks. We achieve an
efficiency gain of more than 35× in terms of wall-clock time when compared with
Gaussian process-based BO, enabling efficient parallel and distributed BO, e.g.,
for high-throughput optimization.

1 Introduction
Many problems in chemistry [14, 15], biology [37, 38]
and computer science [9, 40] rely on optimizing an ex-
pensive black-box function. Often, input domains are pro-
hibitively large, and objective function evaluation needs 101
Time (s)

laboratory experiments or computation-intensive simula-


tion [43]. This necessitates specific black-box optimiza- GP-EI
tion methods that are data-efficient. Ideally, those methods 100 PFNs4BO
should be able to find the optimal value by leveraging pre- Ours
existing data and smart exploration strategies that query
the objective function as little as possible. 5 10 15
An effective method for this type of problem is Bayesian Suggestions' Batch Size
optimization [BO; 11, 28]. This family of methods works
Figure 1: Performing BO in-context
by sequentially recommending the next point xt+1 in a
with our method enables batched BO
bid to maximize the target black-box function in as few
with a large batch size (almost) for free.
steps as possible, under the guidance of a probabilistic
surrogate model p(f | Dt ) trained on previous data points Dt . Gaussian processes [GPs; 35] are the

Preprint. Under review.


p(f |D) p(y|x, D) p(x|y, D) p(x∗ |D)

xk+1 α(x; D) xk+1 α(x; D) xk+1 α(y; D) xk+1

Traditional BO PFNs4BO Inverse Methods Ours

Figure 2: Optimization loops of different black-box optimizers: traditional BO, PFNs4BO [30],
inverse methods such as Diff-BBO [46] and DiBO [27], and FIBO (ours). Red nodes represent
stages that require model fitting or optimization (undesirable), while green nodes represent in-context
learning operations (desirable). Fewer nodes are desirable since it implies fewer moving parts.

de facto surrogate model for BO due to their tractable posterior inference, being supported by popular
libraries and frameworks [4]. However, traditional GP scales cubically with the number of training
points, assumes that targets are jointly Gaussian, and often uses stationary kernels. Alternatively,
one can use Bayesian neural networks, which excel in representing high-dimensional data and deal
well with non-stationary objectives [22, 23]. However, it is unclear how to pick the most suitable
surrogate model for the problem at hand [26]. Moreover, expensive re-training must be done at each
BO step, incurring an additional computation overhead and more hyperparameters to tune.
To alleviate these issues, recent work proposed in-context learning for surrogate modelling [30].
These works reduce the cost of BO by using a surrogate that does not need to be re-trained as new
data points are collected. Instead, a foundation model is pretrained on a large number of functions
coming from a predetermined prior distribution. This strategy of approximate inference allows for a
flexible prior without requiring traditionally expensive Bayesian inference methods to compute the
approximate posterior. An alternative approach is inverse models that generate input points given
a target function value [25]. Recent work used conditional diffusion models to tackle the problem,
leveraging their power to model complex distributions over high-dimensional spaces [21, 46]. Inverse
methods, however, may require datasets with tens of thousands of data points to fit the generative
model used in the objective optimization loop.
Nevertheless, previous work only aims to solve the surrogate-modeling question that we have raised—
the other, equally important parts of BO have so far been neglected. Indeed, in addition to surrogate
modeling, one must also pick a suitable acquisition function α( · | D) : X → R that depends on
the posterior, such as the expected improvement [20] or upper confidence bound [3] functions, and
optimize it to obtain the proposed point xt+1 = arg maxx∈X α(x | Dt ). This optimization process
is known to be hard and/or expensive [1].
In this work, we aim to simplify Bayesian optimization by directly sampling from the posterior
p(x∗ | Dt ) of the optimum x∗ in an in-context way. That is, we skip both the surrogate modeling
and acquisition function maximization, reducing the number of moving parts in BO to its minimum
(Figure 2). To this end, we pretrain a generative model on the pairs (D, x∗ ) of a context D :=
{(xi , f (xi ))}i and the optimal point x∗ = arg maxx∈X f (x) of a given function f ∼ p(f ) sampled
from some prior. Due to its in-context, foundational nature, our method does not require practitioners
to train a particular surrogate model. Meanwhile, due to its direct nature, our method does not
require expensive and brittle acquisition function optimization. This results in substantial speed-ups,
especially for parallel suggestions, as illustrated in Figure 1. We refer to our method as Fully
In-Context Bayesian Optimization, or FIBO for short.
FIBO simplifies the modeling and the inner optimization part of BO. Nevertheless, although there is
no explicit acquisition function maximization in FIBO, we prove that it is equivalent to Thompson
sampling [42]. Thus, FIBO is indeed a principled approach to BO. Finally, we demonstrate the
performance of FIBO in standard synthetic test functions and real-world benchmarks. Despite being
more than 10× faster than traditional BO approaches, it matches their performance consistently.

2 Preliminaries

Here, we discuss the necessary background to derive our method: (i) Bayesian optimization, (ii)
in-context probabilistic surrogates, and (iii) deep generative models.

2
2.1 Bayesian Optimization

Let f : X → R denote an unknown objective function on a space X ⊂ Rd . Our goal is to find


an optimal point x∗ ∈ arg maxx∈X f (x). We are interested in the case where f is expensive to
compute and the input domain cannot be explored exhaustively. We are limited to querying the
objective function a few times and only have access to its output value f (x).
Bayesian optimization [BO; 11, 28] addresses this problem by using two components: a surrogate
model and an acquisition function. The surrogate consists of a probabilistic model p(f |D) that
captures our posterior beliefs about the objective function f given our current dataset D. Meanwhile,
the acquisition function α : X → R scores points in the input domain representing our preference over
the next locations to be queried. At each step, the next point is selected via arg maxx∈X α(x; D).
One example of an acquisition function is Thompson sampling [TS; 42], which is defined as
αTS (x; D) = fˆ(x), where fˆ ∼ p(f |D). It is equivalent to sampling the next point from the
posterior distribution over the optimal point, that is, xt+1 ∼ p(x∗ |D) [39]. The inherent randomness
of TS and the fact that it faithfully follows the posterior belief ensures good exploration-exploitation
balance, making it widely used in practice [24].
There are applications for which it is beneficial or even necessary to run multiple evaluations of the
objective function in parallel. In this high throughput setting, we are interested in simultaneously
suggesting a collection (x1 , ..., xq ) of q points to be evaluated. Evaluating the joint acquisition
function over the entire batch poses complex computational and optimization problems. Popular
solutions include sequential simulation [12], in which the points are optimized greedily through q
steps of standard BO, and MC approaches that can be applied when the posterior distribution is
Gaussian [45]. Importantly, Thompson sampling allows trivial batch construction by simply sampling
q points from the posterior [18].

2.2 In-Context Learning

In-context learning refers to algorithms that learn from a few examples provided at test time, with-
out updating any parameter value [48]. In a supervised learning setting it refers to estimating
p(f (x)|{xi , f (xi )}ki=1 , xquery )—the probability of a function f on a query point xquery given k
in-context examples {xi , f (xi )}ki=1 . It can also be formulated for unsupervised tasks, such as gener-
ation, as sampling from p(xquery |x1 , ..., xk ) such that xquery comes from the same distribution as the
context examples.
Connecting to the familiar application of in-context learning in large language models, xk represents
the k-th input example (e.g., a text prompt), and f (xk ) is the corresponding output (e.g., a predicted
continuation) in supervised tasks like translation or question-answering [5].

2.3 Deep Generative Model

Deep generative models aim to approximate a non-trivial and high-dimensional distribution p(x) by
only having access to its samples. In this work, we are interested in conditional generative models of
the form pθ (x|c) that make use of a neural network to generate samples conditioned on a context
vector c ∈ Rc . This is the setting used in many popular applications such as text-to-image [34],
image-to-image [47] or even text-to-molecule generation [13]. In addition to the context vector, the
model usually receives as input a latent vector z that comes from a base distribution pz (z) that is
easy to sample from (e.g., multivariate Gaussian).
One family of such models are normalizing flows [31], which transform a sample from the base
distribution into a point from the complex data distribution through an invertible and differentiable
transformation Tθ (z; c) parametrized by a neural network. The model is trained to maximize the data
likelihood using the change-of-variables formula: p(x) = pz (Tθ−1 (x; c))| det JT −1 (x)|, where JT −1
θ θ
denotes the Jacobian matrix of the inverse transformation. Many efforts have been directed towards
designing powerful transformations that can be efficiently inverted and have Jacobian determinants
that are simple to compute.

3
Generative
Encoder
head

FIBO

Figure 4: Illustration of a Bayesian optimization loop using FIBO. The model receives a dataset as
input and produces samples from the posterior distribution over the optimal point.

3 Fully In-Context Bayesian Optimization (FIBO)

Our goal is to develop a method that is able to perform Bayesian optimization completely in-context.
That is, we aim to directly suggest the input point of the next observation with no surrogate fitting or
acquisition function maximization. In order to achieve this goal, we propose FIBO, which amounts
to a pretrained generative model that works across different objective functions during test time.

3.1 In-Context Thompson Sampling

Recall that Thompson sampling is characterized by sampling from a posterior distribution over a
function f ’s optimum point p(x∗ |D) after observing data D. FIBO amounts to learning such a
distribution through a model pθ (x∗ |D) parametrized by θ, which is composed of two parts: an
encoder and a generative head. The encoder Eθenc : 2X ×R → Rc accepts a set of data points of
variable cardinality that is summarized into a fixed-dimension context vector c = Eθenc (D). This
context vector is passed to the generative head (a normalizing flow) Gθgen : Rd × Rc → X alongside
a random vector z ∈ Rd to produce a sample from the posterior distribution over the optimal point
x̂∗ = Gθgen (z; c). Combining both components we can train an end-to-end model parametrized by
θ = (θenc , θgen ) minimizing the negative log-likelihood loss function:
L(θ) = −Ex∗ ,D∼p(f,x∗ ,D) [log pθ (x∗ |D)]. (1)

Given a context dataset D, we can then use the pretrained pθ (x∗ |D) to directly sample the optimum
point. In Section 3.3 we shall show that this sample comes from the posterior derived from some
prior over functions after observing D. But first, we clarify where this prior comes from and how to
efficiently pretrain this model.

3.2 Model Pretraining

To generate the data for pretraining FIBO, we need to sample pairs f D


(x∗ , D)’s from the joint distribution p(f, x∗ , D). Consider a factor-
ization of such a distribution to the graphical model in Figure 3:
p(f, x∗ , D) = p(x∗ |f )p(D|f )p(f ) (2)
x∗
This factorization makes sense in BO since, given a function f ,
the true optimum x∗ does not depend on the historical data D.
Thus, it allows us to efficiently sample (x∗ , D) from p(f, x∗ , D) Figure 3: Factorization of the
because x∗ ⊥ ⊥ D | f , i.e., we can sample the function first and data-generating distribution.
then separately sample the optimal point and the dataset. Thus, we can perform on-the-fly data
augmentation by subsampling our dataset while keeping the optimal point of the corresponding
function fixed.
Notice f is independent of D. This implies that we only need a prior over functions p(f ) for the data
sampling process, as in Müller et al. [29]. The final data-gathering process for pretraining pθ (x∗ |D)
is summarized in Algorithm 1 and full details in Appendix A.

4
Algorithm 1 Pretraining data generation Algorithm 2 BO loop with FIBO
Require: Prior p(f ), minimum dataset size Require: Model qθ , initial data D0 =
Nmin , maximum dataset size Nmax {(x1 , y1 ), ..., (xk , yk )}, objective f , batch
Ensure: Pretraining data DPT size q, number of iterations T

Initialize pretraining dataset DPT = ∅ for t ← 0 to T − 1 do


repeat c ← Eθenc (Dt )
Sample objective function f ∼ p(f ) for i ← 1 to q do
x∗ = gradient_ascent(f ) zi ← N (0, I)
n ← U(Nmin , Nmax ) xi ← Gθgen (zi ; c)
X = (x1 , ..., xn ), xi ∼ U(X ) yi ← f (xi )
D = {(xi , f (xi )) : xi ∈ X} end for
DPT = DPT ∪ {(x∗ , D)} Dt+1 = Dt ∪ {(xi , yi )}qi=1
until Desired amount of data is collected end for

3.3 Analysis

Importantly, FIBO is an approximation of Thompson sampling under the prior distribution p(f ) used
for pretraining. The following theorem shows that the proposed method is a principled approach to
BO and retains the properties of TS while being more efficient.

Theorem 1. An in-context generative model pθ (x∗ |D) trained by minimizing the loss L(θ) in (1)
using samples (x∗ , D) from p(f, x∗ , D) defined in (2) is an approximation of Thompson sampling
with p(f |D) defined by the chosen prior over function p(f ).

Proof Sketch. The result follows from rewriting the expected negative log-likelihood as an expectation
over datasets of the cross entropy between the pθ (x∗ |D) and p(x∗ |D). Using the relationship
between cross entropy and KL divergence we show that we are minimizing the latter between the two
distributions. Complete proof in Appendix B.

This result allows us to use Thompson sampling following the principles of in-context learning.
In contrast, other surrogate models used in BO, such as GPs or Laplace-approximated BNNs use
considerably more computation. The former requires fitting and approximate sampling using spectral
techniques [33], and the latter also requires fitting and computing the network’s Jacobian.

3.4 Using FIBO for Bayesian Optimization

Given a model pretrained as described above, we are now able to perform BO completely in context.
This is done by sampling from the generative model at each step, as described in Algorithm 2.
Since the model is pretrained, no surrogate fitting is per-
formed at test time. In addition, because we directly sam-
ple from the posterior over the optimal point there is no ex-
plicit acquisition function maximization. Therefore, FIBO by- 101
Time (s)

passes the iterative optimization steps present in both stages of


Bayesian optimization—instead performing a single forward
pass through the deep generative model. GP-EI
100 PFNs4BO
As a consequence, FIBO is significantly faster than the alter- Ours
natives. Figure 5 shows a comparison between FIBO and a
Gaussian Process in terms of the time taken to generate a sug- 0 5 10 15
gestion of a batch with 10 points on the 4D Ackley function. Iteration
Even though the GP implementation makes use of modern GPU Figure 5: Time comparison on 4D
acceleration and efficient algorithms [4], FIBO is more than Ackley function with q = 10
one order of magnitude faster.

5
4 Related Work

In-context surrogate models Previous works explored alternatives to avoid re-training the sur-
rogate model at each step of the BO loop. In general, solutions to this problem follow the Neural
Processes [10] framework, modelling the predictive distribution for a test point given a set of labelled
points. PFNs4BO [30] makes use of a Prior-Fitted Network [PFN; 29] — a pretrained Transformer
that outputs continuous distributions using a binned representation — as surrogate. The model is
pretraining by sampling functions from a predetermined prior distribution. Notice that PFNs4BO
only addresses the surrogate model. It still requires acquisition function maximization at every step
of the loop. Furthermore, because their predictive posterior takes the form of a binned distribution it
cannot enjoy the benefits of batch suggestions implemented with efficient MC approximations. Thus
relying on slow sequential simulation techniques.
Generative models for black-box optimization Inverse approaches for black-box optimization
learn the mapping from function values to inputs in the domain of the objective function. While early
work was based on generative adversarial networks [25], recent developments have successfully used
the higher generation power of diffusion models. Krishnamoorthy et al. [21] proposed a method for
offline black-box optimization that trains a conditional diffusion model on a given dataset. During
evaluation, the model generates samples conditioned on the higher function value observed in the
training dataset. Diff-BBO [46] applies an analogous technique in an online fashion, re-fitting an
ensemble of diffusion models after collecting every new batch of data. More recently, Yun et al.
[49] proposed DiBO, which at each step trains an unconditional diffusion model and an ensemble of
regression models that are used together to sample the next points using local search. Importantly,
these methods require extensive test-time compute in order to fit the multiple models at each step
of the objective function optimization loop. For this reason, this family of models are even more
computationally expensive than traditional BO and in-context surrogate modelling, thus it is out of
the scope of our work.
Generative Models for Approximate Bayesian Inference Bayesian statistics is interested in
inferring the posterior distribution over the model parameters, which is often complex and high-
dimensional. Therefore, researchers have investigated ways in which deep generative models can
be used as tools for approximate Bayesian inference. BayesFlow [32] introduces a framework that
makes use of DeepSets [50] and normalizing flows to learn the posterior distribution using data-
parameters pairs generated via simulation from a predetermined model. The trained model is then
used to perform amortized Bayesian inference via sampling. This line of work was extended by [27],
investigating the effect of different encoders, such as GRU [7] and Transformer [44], and comparing
objective functions based on the forward and backward KL divergences. Reuter et al. [36] explored
a more general problem in which there is no predetermined model, instead they proposed a model
pretrained on a large enough dataset that is capable of approximating a large class of distributions.
Their model, which consists of a modified Transformer encoder and continuous normalizing flows,
shows state-of-the-art performance for generalized linear models and latent factor models.

5 Experiments

Architecture For the encoder, we use the Transformer [44] model from the PFN pretrained on
the BNN prior released by the authors in PFNs4BO [30]. The context vector is extracted by taking
the average of the Transformer’s output vector of each data point. We observe that starting from a
pretrained model helps with performance even if the prior for which we are finetuning is not the same
as the one used for pretraining. The generative model used is an autoregressive neural spline flow [8]
from the normflows library [41]. Implementation details are provided in Appendix C.
Pretraining Prior We generate the data to pretrain our model using a GP prior as described in
Henrández-Lobato et al. [17]. Using the Fourier dual representation of the RBF kernel [33] we are
able to sample parametric approximations of the GP. This allows us to sample (x∗ , D) pairs efficiently
using standard linear algebra packages. More specifically, once we sample an instance of a f ∼ GP
we use L-BFGS-B [6] with multiple restarts to obtain x∗ and sample the dataset D uniformly. In
addition, we perform rejection sampling to ensure that the marginal distribution over the optimal
point is uniform across the input domain.

6
q = 10 q = 20 q = 50
1.0 1.0 1.0

0.9
GAP ( ↑ ) 0.9 0.8
0.8 0.8 0.6
0.7 0.7
0.4
0.6 0.6
1
10 100 101 10 1
100 101 102 10 1
100 101 102
Average Time (s) Average Time (s) Average Time (s)

GP-EI PFNs4BO LLA FIBO 3D Functions 4D Functions

Figure 6: Comparison of Bayesian optimization methods on function sampled from FIBO’s pretrain-
ing prior across different batch sizes q ∈ {10, 20, 50}. Each subplot shows the final GAP (higher is
better) versus the average wall-clock time (log scale) per run.

Baselines We compare our method against a Gaussian Process with Matern-5/2 and Log Expected
Improvement [1] (GP-EI), linearized Laplace approximation [22] (LLA), PFN S 4BO [30] with
HEBO prior, and random search (RS). For GP-EI and LLA we use MC acquisition functions [2] to
obtain suggestion batches. For PFNs4BO, we adapt their implementation to perform batch selection
via sequential simulation using the expected value under the model [12]. The LLA baseline is highly
memory intensive in the batch setting, for this reason, we only report their results up to the batch size
that fits in GPU memory. Details are provided in Appendix C.
Evaluation For all experiments we use the GAP measure [19] to evaluate the performance:
GAP = (yi − y0 )/(y∗ − y0 ). Whenever the objective function does not have a known optimal point
we take it to be the best result across all runs and methods. We also compare the wall-clock time (in
seconds) taken to propose the next points at each iteration. For all tasks we perform a total of 200
function evaluations, using batch sizes q ∈ {10, 20, 50} and setting the number of initial points to the
corresponding batch size.

5.1 Well-Specified Functions

We start by considering functions that come from the same distribution used to train FIBO. To this
end, we sample 10 functions from the prior for each dimension. All functions are defined on the unit
hypercube of their corresponding dimension.
The results presented in Figure 6 show that FIBO is up to two and three orders of magnitude faster
than GP and PFNs4BO, respectively, as the batch size increases. Importantly, the gains in speed do
not interfere with the performance in optimization. Across all runs FIBO shows similar performance
in terms of GAP when compared to PFNs4BO, and gets closer to the GP as the batch size increases.
Importantly, given that the functions in this set of experiments come from a GP prior, it is expected
that the GP would have the best performance, considering the surrogate is re-fit at each step.

5.2 Synthetic Functions

We also evaluate the methods on a suite of synthetic functions commonly used in Bayesian Opti-
mization: Ackley function on [−32.768, 32.768]d ⊂ Rd , Levy function on [−10, 10]d ⊂ Rd , and
Rosenbrock function on [−5, 10]d ⊂ Rd , taking d ∈ {3, 4} for all three functions. We also use the
Hartmann function on [0, 1]3 ⊂ R3 .
The results are presented in Figure 7. Across all methods, there is higher variance in the final GAP,
with harder tasks, such as the Ackley function showing lower scores. Once more it is possible to see
that FIBO is exceptionally faster than the baselines while preserving the optimization quality. In fact,

7
q = 10 q = 20 q = 50
1.0 1.0 1.0

0.8 0.9
0.8
GAP ( ↑ )

0.6 0.8
0.6
0.7
0.4
1
10 100 101 10 1
100 101 102 10 1
100 101 102
Average Time (s) Average Time (s) Average Time (s)
GP-EI LLA Ackley (3D) Hartmann (3D) Levy (3D) Levy (4D)
PFNs4BO FIBO Ackley (4D)

Figure 7: Comparison of Bayesian optimization methods on standard synthetic functions across


different batch sizes q ∈ {10, 20, 50}. Each subplot shows the final GAP (higher is better) versus the
average wall-clock time (log scale) per run.

the aggregated results in Table 1 show that on average FIBO is similar to both GP-EI and PFNs4BO
in terms of GAP.

5.3 Real-World Chemistry Tasks

We conduct experiments on a wide variety of chemistry benchmarks for which an oracle function is
provided by the Olympus toolkit [16]. All of the used tasks have continuous input domains and vary
in dimension. Complete information about the tasks is provided in Appendix D.
We present the results in Figure 8. The overall results are on par with the previous experiments,
showing FIBO faster than the other baselines with little to no drop in GAP performance. In addition,
this set of experiments shows the clear advantage of the in-context nature of FIBO. While GP-EI and
PFNs4B0 are slower than other tasks with the same batch size due to a harder inner optimization
loop (i.e., maximization of the acquisition function), FIBO’s execution time stays the same. As

q = 10 q = 20 q = 50
1.0 1.0 1.0
0.8 0.8 0.8
0.6 0.6 0.6
GAP ( ↑ )

0.4 0.4 0.4


0.2 0.2 0.2
0.0 0.0 0.0
1 0 1 2 1 0 1 2 1
10 10 10 10 10 10 10 10 10 100 101 102 103
Average Time (s) Average Time (s) Average Time (s)

GP-EI FIBO Colors N9 (3D) Photo PCE10 (4D) SnAr (4D)


PFNs4BO Benzylation (4D) Fullerenes (3D) Photo WF3 (4D) Suzuki (4D)
LLA

Figure 8: Comparison of Bayesian optimization methods on various chemistry tasks from Olympus
toolkit across different batch sizes q ∈ {10, 20, 50}. Each subplot shows the final GAP (higher is
better) versus the average wall-clock time (log scale) per run.

8
Table 1: Aggregated results of the Bayesian optimization methods across all runs and tasks for all
task sets and batch sizes q ∈ {10, 20, 50}. The best final GAP (higher is better) and wall-clock time
(lower is better) per group is highlighted taking into consideration the standard error.
q = 10 q = 20 q = 50
Task Set Method
GAP Time GAP Time GAP Time
GP-EI 0.99 ± 0.01 9.06 ± 0.88 0.98 ± 0.01 8.28 ± 0.93 0.93 ± 0.02 7.88 ± 1.02
PFNs4BO 0.90 ± 0.03 25.05 ± 1.17 0.91 ± 0.03 55.42 ± 3.07 0.85 ± 0.05 129.04 ± 9.45
Prior
LLA 0.85 ± 0.03 23.99 ± 0.24 — — — —
FIBO 0.93 ± 0.02 0.12 ± 0.01 0.93 ± 0.02 0.12 ± 0.01 0.92 ± 0.03 0.13 ± 0.01
GP-EI 0.92 ± 0.01 8.69 ± 0.48 0.90 ± 0.02 8.39 ± 0.44 0.87 ± 0.02 7.76 ± 0.51
PFNs4BO 0.92 ± 0.01 25.35 ± 0.61 0.90 ± 0.02 51.81 ± 1.30 0.89 ± 0.02 126.07 ± 3.26
Synthetic
LLA 0.82 ± 0.03 23.30 ± 0.11 — — — —
FIBO 0.92 ± 0.01 0.17 ± 0.02 0.93 ± 0.01 0.18 ± 0.02 0.93 ± 0.01 0.20 ± 0.03
GP-EI 0.89 ± 0.02 11.58 ± 0.86 0.87 ± 0.02 9.85 ± 0.73 0.88 ± 0.02 11.14 ± 0.81
PFNs4BO 0.88 ± 0.03 53.66 ± 2.72 0.87 ± 0.03 125.15 ± 7.54 0.89 ± 0.02 348.05 ± 23.48
Chemistry
LLA 0.88 ± 0.02 27.76 ± 0.63 — — — —
FIBO 0.87 ± 0.02 0.14 ± 0.00 0.82 ± 0.03 0.14 ± 0.00 0.85 ± 0.02 0.17 ± 0.01

presented in Table 1, this results in FIBO being up to 100× faster than GP-EI and up 1000× faster
than PFNs4BO with no significant drop in GAP performance.
In addition, we perform hypothesis testing to assess if the difference between the methods is statis-
tically significant. We perform the tests in a pairwise fashion for each batch size q ∈ {10, 20, 50}
considering GAP and wall-clock time separately and together. For the unidimensional tests we use
Wilcoxon rank-sum test and for the bi-dimensional analysis we use permutation test of the groups
means. As the p values presented in Table 2 there is statistical significancy between FIBO and the
baselines when considering wall-clock time alone and both wall-clock time and final GAP together.
Importantly, the statistical analysis shows no significant difference between the final GAP obtained
by FIBO and all other baselines.

6 Conclusion

We introduced FIBO, a pretrained model that


performs completely in-context Bayesian op- Table 2: P-values for pairwise hypothesis tests be-
timization with no surrogate fitting or acquisi- tween FIBO and baselines across all tasks and runs
tion function maximization. We prove that by on the real-world chemistry benchmark.
pretraining FIBO on functions drawn from a GP-EI PFNs4BO LLA
chosen prior we are approximating Thompson
sampling, demonstrating that the method is a Both 0.00 0.00 0.00
principled approach to Bayesian optimization. q = 10 Time 0.00 0.00 0.00
GAP 0.15 0.12 0.32
Through experiments in both synthetic and
real-world benchmarks, we show that FIBO’s Both 0.00 0.00 –
results are on par with other existing meth- q = 20 Time 0.00 0.00 –
ods in the literature while being significantly GAP 0.10 0.09 –
faster, especially in the batch optimization set- Both 0.00 0.00 –
ting. In future work, we plan on scaling FIBO q = 50 Time 0.00 0.00 –
both in terms of pretraining data and model GAP 0.26 0.05 –
size, following a series of recent works that
demonstrate the powers of scaling in many
different domains.
Limitations We propose a completely in-context method for BO that enjoys multiple orders of
magnitude speed up compared to traditional BO methods. However, we focus on problems in lower
dimensions since they are ubiquitous throughout science and engineering. Moreover, our method
does not solve the problem of the inherently long experiment time for computing f (x). Nevertheless,
FIBO is useful in batched, high-throughput, or multi-fidelity scenarios where the computation of
f (x) is less of a problem.

9
References
[1] Sebastian Ament, Samuel Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Unexpected
improvements to expected improvement for Bayesian optimization. In NeurIPS, 2023.

[2] Sebastian Ament, Samuel Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Unexpected
improvements to expected improvement for Bayesian optimization. In NeurIPS, 2023.

[3] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 3(null), 2003.

[4] Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G Wilson, and
Eytan Bakshy. Botorch: A framework for efficient monte-carlo bayesian optimization. In NeurIPS, 2020.

[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. In NeurIPS, 2020.

[6] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound
constrained optimization. SIAM Journal on Scientific Computing, 16(5), 1995.

[7] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical
machine translation. In EMNLP, 2014.

[8] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In NeurIPS,
2019.

[9] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-sklearn
2.0: Hands-free automl via meta-learning. JMLR, 23(261), 2022.

[10] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan,
Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional neural processes. In ICML, 2018.

[11] Roman Garnett. Bayesian Optimization. Cambridge University Press, 2023.

[12] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging Is Well-Suited to Parallelize
Optimization. Springer, 2010.

[13] Haisong Gong, Qiang Liu, Shu Wu, and Liang Wang. Text-guided molecule generation with diffusion
language model. Proceedings of the AAAI Conference on Artificial Intelligence, 38(1), 2024.

[14] Rebecca L Greenaway, Kim E Jelfs, Alan C Spivey, and Sophia N Yaliraki. From alchemist to ai chemist.
Nature Reviews Chemistry, 7(8), 2023.

[15] Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained Bayesian optimization for automatic
chemical design using variational autoencoders. Chem. Sci., 11, 2020.

[16] Florian Häse, Matteo Aldeghi, Riley J. Hickman, Loïc M. Roch, Melodie Christensen, Elena Liles,
Jason E. Hein, and Alán Aspuru-Guzik. Olympus: a benchmarking framework for noisy optimization and
experiment planning. Machine Learning: Science and Technology, 2(3), 2021.

[17] José Miguel Henrández-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani. Predictive entropy search
for efficient global optimization of black-box functions. In NeurIPS, 2014.

[18] José Miguel Hernández-Lobato, James Requeima, Edward O. Pyzer-Knapp, and Alán Aspuru-Guzik.
Parallel and distributed Thompson sampling for large-scale accelerated exploration of chemical space. In
ICML, 2017.

[19] Shali Jiang, Henry Chai, Javier Gonzalez, and Roman Garnett. BINOCULARS for efficient, nonmyopic
sequential experimental design. In ICML, 2020.

[20] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive
black-box functions. Journal of Global Optimization, 13, 1998.

[21] Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Diffusion models for black-box
optimization. In ICML, 2023.

10
[22] Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, and Vincent Fortuin. Promises and pitfalls
of the linearized Laplace in Bayesian optimization. In Fifth Symposium on Advances in Approximate
Bayesian Inference, 2023.

[23] Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Alán Aspuru-Guzik, and Geoff
Pleiss. A sober look at LLMs for material discovery: Are they actually good for Bayesian optimization
over molecules? In ICML, 2024.

[24] Agustinus Kristiadi, Felix Strieth-Kalthoff, Sriram Ganapathi Subramanian, Vincent Fortuin, Pascal
Poupart, and Geoff Pleiss. How useful is intermittent, asynchronous expert feedback for Bayesian
optimization? In Sixth Symposium on Advances in Approximate Bayesian Inference-Non Archival Track,
2024.

[25] Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization. In NeurIPS,
2020.

[26] Yucen Lily Li, Tim G. J. Rudner, and Andrew Gordon Wilson. A study of Bayesian neural network
surrogates for Bayesian optimization. In ICLR, 2024.

[27] Sarthak Mittal, Niels Leif Bracher, Guillaume Lajoie, Priyank Jaini, and Marcus Brubaker. Amortized
in-context Bayesian posterior estimation, 2025.

[28] J. Močkus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical
Conference Novosibirsk, July 1–7, 1974, 1975.

[29] Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers
can do Bayesian inference. In ICLR, 2022.

[30] Samuel Müller, Matthias Feurer, Noah Hollmann, and Frank Hutter. PFNs4BO: In-context learning for
Bayesian optimization. In ICML, 2023.

[31] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshmi-
narayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning
Research, 22(57), 2021.

[32] Stefan T. Radev, Ulf K. Mertens, Andreas Voss, Lynton Ardizzone, and Ullrich Köthe. BayesFlow:
Learning complex stochastic models with invertible neural networks. IEEE transactions on neural
networks and learning systems, 33(4):1452–1466, 2020.

[33] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NeurIPS, 2007.

[34] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and
Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021.

[35] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The
MIT Press, 2006.

[36] Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, and David Rügamer. Can transformers learn full Bayesian
inference in context? arXiv, 2025.

[37] Philip A Romero, Andreas Krause, and Frances H Arnold. Navigating the protein fitness landscape with
Gaussian processes. Proceedings of the National Academy of Sciences, 110(3), 2013.

[38] Stephen J Ruberg, Francois Beckers, Rob Hemmings, Peter Honig, Telba Irony, Lisa LaVange, Grazyna
Lieberman, James Mayne, and Richard Moscicki. Application of Bayesian approaches in drug development:
starting a virtuous cycle. Nature Reviews Drug Discovery, 22(3), 2023.

[39] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human
out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 2016.

[40] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning
algorithms. In NeurIPS, 2012.

[41] Vincent Stimper, David Liu, Andrew Campbell, Vincent Berenz, Lukas Ryll, Bernhard Schölkopf, and
José Miguel Hernández-Lobato. normflows: A Pytorch package for normalizing flows. Journal of Open
Source Software, 8(86), 2023.

[42] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the
evidence of two samples. Biometrika, 25(3-4), 1933.

11
[43] Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio
Pablo-García, Ella M. Rajaonson, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc,
Felix Strieth-Kalthoff, Martin Seifrid, and Alán Aspuru-Guzik. Self-driving laboratories for chemistry and
materials science. Chemical Reviews, 124(16), 2024.
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing
Systems, 2017.
[45] James Wilson, Frank Hutter, and Marc Deisenroth. Maximizing acquisition functions for Bayesian
optimization. In NeurIPS, 2018.

[46] Dongxia Wu, Nikki Lijing Kuang, Ruijia Niu, Yian Ma, and Rose Yu. Diff-BBO: Diffusion-based inverse
modeling for black-box Optimization. In NeurIPS 2024 Workshop on Bayesian Decision-making and
Uncertainty, 2024.

[47] Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun
Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024.

[48] Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning
as implicit Bayesian inference. In ICLR, 2022.

[49] Taeyoung Yun, Kiyoung Om, Jaewoo Lee, Sujin Yun, and Jinkyoo Park. Posterior inference with diffusion
models for high-dimensional black-box optimization, 2025.

[50] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and
Alexander J Smola. Deep sets. In NeurIPS, 2017.

12
Appendix A Data Generation Details

As described in Section 3.2 we are interested in sampling pairs of the form (x∗ , D) where x∗ =
arg maxx∈X f (x) and D = {(x, f (x)}N i=1 for some function f : X → R sampled from a prior
distribution p(f ). Considering the factorization presented in the graphical model in Figure 3 it is
clear that we can sample f, x∗ , D triples using ancestral sampling. We first sample a function from
the chosen prior, then sampling from p(x∗ |f ) is performed by finding the optimal point of f , and
finally to sample from p(D|f ) we obtain samples of input points uniformly over the domain and pass
them through the function. At the end, we can simply discard the function to obtain (x∗ , D) pairs.
Importantly, having x∗ ⊥⊥ D | f allows us to perform extensive data augmentation. As the dataset is
independent of the optimum for a given function, we can have vary dataset without having the need
to recompute x∗ . We leverage this fact by collecting a large dataset when generating the data and
dynamically subsampling from it during pretraining. This has the effect of exponentially augmenting
the amount of pretraining data while constructing datasets of variable sizes.
In practice, we want to have a prior over functions that are non-convex, as that is the under-
lying structure of many applications studied in Bayesian optimization. Therefore, obtaining
x∗ = arg maxx∈X f (x) for a known sampled function f is itself a complex task. To deal with
this hard global optimization problem we make use of common practices in literature, such as opti-
mization methods that use second order information and performing multiple restarts. Although this
can entail in higher computational cost, the data generation is performed offline a single time, and
can be highly parallelizible across multiple independent processes.

Appendix B Proof of Theorem 1

Let us begin by applying the definition of expected value to the loss function presented in (1):

L(θ) = −Ex∗ ,D∼p(x∗ ,D) [log pθ (x∗ |D)]


Z Z
(3)
=− p(x∗ , D) log pθ (x∗ |D)
D x∗

Now we factor p(x∗ , D) using the product rule and rearrange the integral accordingly:

Z Z
L(θ) = − p(D) p(x∗ |D) log pθ (x∗ |D) (4)
D x∗

The outer integral in (4) can be seen an expectation over D, while the inner integral (with the negative
sign in the front) is the definition of the cross-entropy between p(x∗ |D) and pθ (x∗ |D). Thus, we can
re-write the loss as

L(θ) = ED∼p(D) [H(p(x∗ |D), pθ (x∗ |D))] (5)

Making use of the identity that relates cross-entropy and KL divergence we get

L(θ) = ED∼p(D) [KL(p(x∗ |D)||pθ (x∗ |D))] + ED∼p(D) [H(p(x∗ |D))] (6)

Dropping the second term, as it does not depend on θ, we arrive at the final result

L(θ) = ED∼p(D) [KL(p(x∗ |D)||pθ (x∗ |D))] (7)

demonstrating that by minimizing L(θ) in (1) we are minimizing the expected KL divergence between
the true Thompson sampling distribution and the one modelled by the deep generative model.

13
Table 3: Descriptions of the tasks from Olympus toolkit we use in our experiments.
Name Input Dimension Description
Colors N9 3D Minimize distance to target color mixing red, green and blue dye
Fullerness 3D Maximize the mole fraction of o-xylenyl adducts of Buckminsterfullerenes
Photo PCE10 4D Minimize photo-degradation of organic solar cells with PCE10
Photo WF3 4D Minimize photo-degradation of organic solar cells with WF3
Benzylation 4D Minimize yield of inpurity in N-benzylation reaction
SnAr 4D Mininize e-factor for a nucleophilic aromatic substitution following the SnAr mechanism
Suzuki 4D Maximize Suzuki coupling yield by adjusting reaction parameters

Appendix C Implementation Details


GP prior We sample (x∗ , D) from an approximation of a GP P prior as′ in Henrández-Lobato et al.
[17]. A squared-exponential kernel k(x, x′ ) = γ 2 exp(−0.5 i (x
p i − x i )/li
2
) can be approximated
using random Fourier features [33] with the feature map ϕ(x) = 2γ 2 /m cos(W x + b) where W
and b are composed of m stacked samples from N (0, diag(l−2 )) and U(0, 2π), respectively. This
enables the GP prior to be approximated by a linear model f (x) = ϕ(x)⊤β, with β ∼ N (0, I).
Making use of this representation allows us to obtain x∗ = arg maxx∈X f (x) and to sample D
trivially.
For our experiments we use hyperpriors li ∼ U(0.01, 5) and γ 2 ∼ U(1, 2). In addition, we perform
rejection sampling over the (x∗ , D) pairs to ensure that the marginal distribution over the optima is
uniform in the domain. This last step is necessary to unsure that FIBO is not biased towards specific
regions of the input space.

FIBO We pretrain one model per dimension using a dataset of 70,000 pairs sampled from the GP
prior described above. The models are trained for 200 epochs, with batch size of 128, and using
Adam optimizer with starting learning rate of 10−4 and cosine scheduler. Training is performed in
PyTorch on a single NVIDIA Quadro RTX 6000.
The architecture is composed of a Transformer encoder from PFNs4BO paper pretrained on the
BNN prior1 and a neural spline flow generative head from the normflows library 2 . The output of the
encoder is projected to Rc with c = 256 and c = 512 for d = 3 and d = 4, respectively. Each flow
block is comprised of 6 layers with 256 hidden units. For d = 3 we make use of 6 flow blocks, while
for d = 4 we use 8 blocks.

Baselines For PFNs4BO we use the official implementation and weights provided by the authors,
extending it to deal with batch suggestions using sequential simulation [12] and to deal with datasets
of different sizes. All acquisition function optimization parameters are kept as they appear in the
original paper [30]. We use the BoTorch [4] implementation of the GP model, optimizing the
LogEI[1] with 10 restarts and 512 raw samples. Finally, for LLA we use their official BoTorch
implementation3 alongside the same acquisition function setup used for the GP. FIBO, PFNs4BO,
and GP-EI were executed on a NVIDIA Quadro RTX 6000, while LLA was executed on a NVIDIA
A40 due to its higher memory requirements.

Appendix D Olympus Details


The Olympus toolkit [16] provides a wide variety of chemistry benchmarks for black-box optimization.
It contains machine learning based emulators that are used as oracles for the different tasks. This
allows fast objective function evaluation while still providing evaluation on real-world applications.
The detailed information each tasks used in our experiments is provided in Table3. For all tasks we
use the BayesNeuralNet emulator.

1
https://github.com/automl/PFNs4BO
2
https://github.com/VincentStimper/normalizing-flows
3
https://github.com/wiseodd/laplace-bayesopt

14

You might also like