Principled Preferential Bayesian Optimization
Abstract
We study the problem of preferential Bayesian optimization (BO), where we aim to optimize a black-box function with only preference feedback over a pair of candidate solutions. Inspired by the likelihood ratio idea, we construct a confidence set of the black-box function using only the preference feedback. An optimistic algorithm with an efficient computational method is then developed to solve the problem, which enjoys an information-theoretic bound on the total cumulative regret, a first-of-its-kind for preferential BO. This bound further allows us to design a scheme to report an estimated best solution, with a guaranteed convergence rate. Experimental results on sampled instances from Gaussian processes, standard test functions, and a thermal comfort optimization problem all show that our method stably achieves better or competitive performance as compared to the existing state-of-the-art heuristics, which, however, do not have theoretical guarantees on regret bounds or convergence.
resizealigned (1)
1 Introduction
Bayesian optimization (BO) is a popular sample-efficient black-box optimization method (Shahriari et al., 2015; Frazier, 2018). It is widely applied to tuning hyperparameters of machine learning models (Snoek et al., 2012), optimizing the performance of control systems (Xu et al., 2022b), and discovering new drugs (Negoescu et al., 2011), etc.
The main idea of BO is based on surrogate modeling. That is, a learning algorithm (typically Gaussian process regression) is applied to learn the unknown black-box function using historical samples, which then outputs a learned surrogate together with uncertainty quantification. Then BO algorithms, such as the popular Expected Improvement (Jones et al., 1998) and GP-UCB algorithms (Srinivas et al., 2012), use the information of this learned surrogate and uncertainty quantification to choose the next sample point.
The conventional BO setting assumes each sample, which typically corresponds to a round of real-world experiment or software simulation in practice, returns a noisy scalar evaluation of the black-box function. However, many human-in-the-loop systems can not return such a scalar value, or it is much more difficult to directly obtain such a scalar evaluation from humans since humans are bad at sensing absolute magnitude (Kahneman & Tversky, 2013). In contrast, it is much easier for a human to compare a pair of solutions and report which is preferred (Lichtenstein & Slovic, 1971; Tversky & Kahneman, 1974; Kahneman & Tversky, 2013).
This gives rise to preferential Bayesian optimization (González et al., 2017), where the scalar evaluation of the black-box function is not available. But rather, we can query an oracle to compare a pair of solutions, or the so-called duels. Such settings arise widely in a broad range of applications, such as visual design optimization (Koyama et al., 2020), thermal comfort optimization (Abdelrahman & Miller, 2022) and robotic gait optimization (Li et al., 2021).
Existing preferential Bayesian optimization methods are mostly heuristic, without formal guarantees on cumulative regret or convergence to the global optimal solution. For example, (González et al., 2017) proposes several heuristic acquisition strategies, including expected improvement and Thompson sampling-based methods, for preferential Bayesian optimization. (Mikkola et al., 2020) extends the preferential Bayesian optimization to the projective setting. (Takeno et al., 2023) proposes a Thompson sampling-based method for practical preferential Bayesian optimization with skew Gaussian process. (Astudillo et al., 2023) proposes a decision theoretical acquisition strategy with a convergence rate guarantee for a finite input set. However, as far as we know, all the existing preferential Bayesian optimization methods can not provide theoretical guarantees on cumulative regret or global convergence with continuous input space, partially due to the challenge of quantifying uncertainty in a principled way.
Beyond preferential BO, optimization from preference feedback has also been investigated in other contexts. In the following, we first survey the related work other than preferential BO and then highlight our unique contributions.
Dueling Bandits In dueling bandits (Yue et al., 2012), the goal is to identify the best arm from a set of finite arms, using only the noisy comparison feedback. It has also been extended to adversarial (Gajane et al., 2015) and contextual (Dudík et al., 2015; Saha & Krishnamurthy, 2022) settings. One extension that is most related to this work is kernelized dueling bandits (Sui et al., 2017, 2018). However, this line of research is typically restricted to the case where the number of arms is finite, and the regret bound can blow up to infinity when the number of arms goes to infinity (e.g., Thm. 2 in (Sui et al., 2017)). A recent work (Mehta et al., 2023) proposes an offline method with suboptimality bound by learning winning probability, which, however, are not applicable to online learning problems due to linear growth of regret over the randomly sampled compared point sequences. In the existing literature, there is no cumulative regret bound that depends on an inherent complexity metric (such as covering number and maximum information gain (Srinivas et al., 2012)) of the black-box function with continuous input space.
Convex Optimization with Preference Feedback (Saha et al., 2021; Yue & Joachims, 2009) consider the optimization of convex functions, where only a comparison oracle of function values over different points is available. The proposed methods estimate the gradient from the preference signals. However, this line of research restricts the function to be convex, while in practice, the black-box function may be non-convex. The proposed method may get stuck in a local optimum and can be sample-inefficient since each estimate of the gradient already needs several samples.
Reinforcement Learning from Human Feedback Reinforcement learning from human feedback (RLHF) (Christiano et al., 2017; Griffith et al., 2013) has recently become very popular. It has found many successes in wide applications, including training robots (Hiranaka et al., 2023), playing games (Warnell et al., 2018), and remarkably large language models (Ouyang et al., 2022). On the theoretical line of RLHF research, recent results analyze the offline learning of the implicit reward function (Zhu et al., 2023) and the model-based optimistic reinforcement learning from human feedback (Wang et al., 2023). However, the existing theoretical analysis either only deals with finite-dimensional generalized linear models or highly relies on the complexity measure of Eluder dimension (Osband & Van Roy, 2014). The existing generic theoretical analysis for RLHF can not be directly applied to the Bayesian optimization setting, where the Eluder dimension of the infinite-dimensional reproducing kernel Hilbert space is not well understood.
Optimistic Model-based Sequential Decision Making Optimism in the face of uncertainty is a widely adopted design principle for model-based sequential decision making problems, such as in Bayesian optimization/reinforcement learning (Wu et al., 2022; Xu et al., 2023; Pacchiano et al., 2021; Curi et al., 2020; Liu et al., 2023). The optimism principle has also been applied to RLHF (Wang et al., 2023) recently. However, as far as we know, there is no existing principled optimistic algorithm for preferential BO yet.
Our contributions. Guided by the optimism principle, we design a preferential Bayesian optimization algorithm that enjoys information-theoretic bounds on the cumulative regret. Specifically, our contributions include:
-
•
Algorithm design. Inspired by the recent work of the confidence set based on optimistic maximum likelihood estimate (Liu et al., 2023) and the likelihood ratio confidence set idea (Owen, 1990; Emmenegger et al., 2023), we construct a confidence set by only using the preference feedback. We then exploit the principle of optimism in the face of uncertainty to design a Principled Optimistic Preferential Bayesian Optimization (POP-BO) algorithm, together with a scheme of reporting an estimated best solution.
-
•
Theoretical analysis. Under some mild regularity assumptions, we prove an information-theoretic bound on the cumulative regret of POP-BO algorithm, which is first-of-its-kind 111(Mehta et al., 2023) provides a bound on the partial cumulative regret, which only captures the suboptimality of one point in each compared duel. We consider stronger total cumulative regret over both points in the compared duel. See Appendix Q for a detailed discussion. for preferential Bayesian optimization. This is significant since previous information-theoretic regret bounds typically assume the direct scalar evaluations of black-box functions (Srinivas et al., 2012) while the recent generic theoretical results for RLHF typically rely on Eluder dimension, which is not well understood for RKHS.
-
•
Efficient computations. The optimistic algorithm needs to solve bi-level optimization problems with the inner variable in an infinite-dimensional function space. We leverage the representer theorem (Schölkopf et al., 2001) to reduce the inner optimization problem to finite-dimensional space, which turns out to be tractable via convex optimization. This further allows efficient grid-free joint optimization.
-
•
Empirical validations and toolbox. 222Code link: https://github.com/PREDICT-EPFL/POP-BO Experimental results show that POP-BO consistently achieves better or competitive performance as compared to the state-of-the-art heuristic baselines and more than times speed-up in computation as compared to the Thompson sampling based method. We also provide a reusable toolbox for future applications of our method.
2 Problem Statement
We consider the maximization of a black-box function ,
(2) |
where with as the input dimension. We use to denote the event that ‘ is preferred to ’. In contrast to the standard BO setup, we assume that we can not directly evaluate the scalar value of but rather, we have a comparison oracle that compares any two points and returns a preference signal , which is defined as
(3) |
Before proceeding, we state a set of common assumptions.
Assumption 2.1.
is compact and nonempty.
Assumption 2.1 is reasonable because, in many applications (e.g., continuous hyperparameter tuning) of Bayesian Optimization, we are able to restrict the optimization into certain ranges based on domain knowledge. Regarding the black-box function , we assume that,
Assumption 2.2.
, where is a symmetric, positive semidefinite kernel function and is the corresponding reproducing kernel Hilbert space (RKHS, see (Schölkopf et al., 2001)). Furthermore, we assume , where is the norm induced by the inner product in the corresponding RKHS.
Assumption 2.2 requires that the function to be optimized is regular in the sense that it has a bounded norm in the RKHS, which is a common assumption (Chowdhury & Gopalan, 2017a; Zhou & Ji, 2022). For simplicity, we will use to denote the set , which is a ball with radius in .
Remark 2.3 (Choice of ).
In practice, a tight norm bound might not be known beforehand. In the theoretical analysis, we only assume that there is a finite bound , possibly unknown beforehand. In the practical implementation of our algorithm, we can adapt based on hypothesis testing (Newey & McFadden, 1994). For example, we can double every time we detect a low likelihood value (See more elaboration in Appendix G.).
Assumption 2.4.
and is continuous on .
Assumption 2.4 is a commonly adopted mild assumption in the BO literature (Srinivas et al., 2012; Chowdhury & Gopalan, 2017a). It holds for most commonly used kernel functions after normalization, such as the linear kernel, the Matérn kernel, and the squared exponential kernel.
Assumption 2.5.
The random preference feedback from the comparison oracle follows the Bernoulli distribution with , where , and .
Assumption 2.5 equivalently assumes that,
(4) |
which can be observed to be the widely used Bradley-Terry-Luce (BTL) model (Bradley & Terry, 1952) for pairwise comparison. The intuition here is that the more advantage has as compared to , the more likely is preferred. The same comparison model is also used in, e.g., training large language models (Ouyang et al., 2022). At step , our algorithm queries the pair and the comparison oracle returns the random preference . For the simplicity of notation, we use to denote the realization of the Bernoulli random variable when querying the comparison oracle at step . Based on the historical comparison results
(5) |
the algorithm needs to decide the next pair of samples to compare. Without further notice, all the theoretical results in this paper are under the assumptions 2.1, 2.2, 2.4, 2.5, and all the corresponding proofs are in the appendices.
3 High Confidence Set
Notations. The probability, denoted as , is taken over the randomness of the preference feedback generated by the comparison oracle and the randomness generated by the algorithm. Let the filtration capture all the randomness up to step . denotes the standard covering number (Zhou, 2002) of the function space ball with the covering balls’ radius and the infinity norm . We will also use to denote the set .
3.1 Likelihood-based Confidence Set
We first introduce the function,
(6) | ||||
which is the likelihood of over the event under the Bernoulli preference model in Assumption 2.5.
We can then derive the likelihood function of a fixed function over the historical preference dataset 333Note that is the likelihood function in over the historical data , not the probability taken over the data/algorithm randomness..
(7) |
Taking log gives the log-likelihood function,
(8) | ||||
where , is the data realization of , and the last equality can be checked correct for either or .
A common method for statistical estimation is by maximizing the likelihood. Hence, we introduce the maximum likelihood estimator (MLE),
(9) |
With the maximum likelihood estimator introduced, the posterior high confidence set can be derived as shown in Thm. 3.1 using the maximum log-likelihood value.
Theorem 3.1 (Likelihood-based Confidence Set).
, let,
(10) |
where , with a constant independent of and . We have,
(11) |
Intuitively, the confidence set includes the functions with the log-likelihood value that is only ‘a little worse’ than the maximum likelihood estimator. It turns out that by correctly setting the ‘worse’ level , the confidence set contains the ground-truth function with high probability. This is reasonable because the preference data is generated with the ground-truth function, and thus the likelihood of the ground-truth function will not be too much lower than the maximum likelihood estimator.
Remark 3.2 (Choice of ).
In Thm. 3.1, also depends on a small positive value , which is to be chosen. In the theoretical analysis, it will be seen that can be selected to be , where is the algorithm’s running horizon.
Remark 3.3 (Likelihood Ratio Idea).
The confidence set contains the functions that satisfy,
(12) |
which is the likelihood ratio confidence set (Owen, 1990).
Remark 3.4.
Surrogate-based black-box optimization with kernel method is often referred to as Bayesian optimization due to its close relations to Bayesian Gaussian process model. Hence, we refer to our method as preferential BO.
Based on the confidence set in Thm. 3.1, we can derive the pointwise confidence range for the black-box function.
(13) |
Fig. 1 demonstrates the maximum likelihood estimate function and the confidence range with the ground truth function sampled from a Gaussian process, random comparison inputs, and set to be a constant . It can be seen that the maximum likelihood estimate approximates the ground truth better and better with the confidence range shrinking, as we have more and more comparison data.
3.2 Bound Duel-wise Error
Thm. 3.1 gives a high confidence set based on the likelihood function. However, it is not straightforward how the likelihood bounds lead to the error bounds on function value differences over a compared pair , which determines the preference distribution. The following theorem further gives such a bound over the historical samples.
Lemma 3.5 (Elliptical Bound).
For any estimate that is measurable with respect to the filtration , we have, with probability at least , ,
(14) |
and
(15) |
where , with and the constants as defined in Appendix B.
Lem. 3.5 highlights that with high probability, all the functions in the confidence set have difference values over the historical sample points that lie in a ball with the ground-truth function difference value as the center and as the radius. Lem. 3.5 indicates that our likelihood-based learning scheme can gradually learn the function differences but not the absolute value . This is reasonable since shifting by a constant will not change the distribution of preference feedback.
Furthermore, to derive an error bound over a new pair , we need to quantify the uncertainty of , where . Since by the definition of , it can be seen that , where
(16) |
Indeed, is the ball with radius in the RKHS equipped with the additive kernel function , which we term as the augmented RKHS here, and inner product . The readers are referred to (Christmann & Hable, 2012; Kandasamy et al., 2015) for more details of the additive kernel and the corresponding RKHS. To quantify the uncertainty of a new pair , we further introduce the function,
(17) | ||||
where , , , and is a positive regularization constant.
Theorem 3.6 (Duel-wise Error Bound).
For any estimate measurable with respect to , we have, with probability at least , ,
(18) |
Remark 3.7.
In preferential BO, we do not get the scalar value of . Hence, can not be interpreted as the posterior standard deviation as in (Srinivas et al., 2012). However, it turns out that , as a measure of uncertainty, still accounts for a factor of the duel-wise error.
To characterize the complexity of this augmented RKHS, we use the maximum information gain (Srinivas et al., 2012),
(19) |
where .
4 Algorithm
4.1 Principled Optimistic Algorithm
We are now ready to give the optimistic algorithm in Alg. 1.
The key to Alg. 1 is line 4. The idea is to maximize the optimistic advantage of as compared to with the uncertainty of the black-box function .
In line 3, we set the reference point as the last generated point . In practice, this may correspond to two possible scenarios. In the first, each comparison requires one experiment, such as image quality comparison. In this case, we only need to set one of the compared pair as the last newly generated solution. While in the other scenario, comparing and needs separate experiments for and . For example, when optimizing the building thermal comfort, the occupants need to experience both thermal conditions to report preference. If at step , the oracle still has memory about the experience with input , we can directly compare and . In this case, setting to be saves the experimental expense with .
For online applications, cumulative regret is more of our interest. However, for an offline optimization setting, it may be of more interest to identify one near-optimal solution to report. Unlike in the scalar evaluation setting, where we can directly use the scalar value to report the best observed solution, we can not directly identify the best sampled solution in the preferential Bayesian optimization scenario. To address this issue, we report the solution , where
(20) |
The idea is that although the best sample may not be known, we can derive a solution by minimizing the known term to find a solution to report. Indeed, this term upper bounds the uncertainty of the optimistic advantage (as shown in Thm. 3.6). Hence, the smaller it is, the more certain that is close to the ground-truth optimal value. At step , we can report the current estimated solution with index satisfying a similar formula to Eq. (20).
4.2 Efficient Computations
Line 4 in Alg. 1 requires solving a nested optimization problem with inner variables in an infinite-dimensional function space. The update of the maximum likelihood estimator also requires solving an optimization problem with an infinite-dimensional function as the decision variable. These are in general not tractable in their current forms. Fortunately, we can reduce the infinite-dimensional problems to finite-dimensional ones, thanks to the structures of the problem and the representer theorem (Schölkopf et al., 2001).
Maximum likelihood estimation. Since the log-likelihood function
(21) | ||||
only depends on the function value , we only need to optimize over subject to that they are functions in with norm less or equal to . Furthermore, Alg. 1 sets and thus . So we can reduce the optimization variables to only . Hence, Eq. (21) is reduced to the following log-likelihood function that only depends on ,
(22) | ||||
where , , and .
By the representer theorem (Schölkopf et al., 2001), the maximum likelihood estimation problem can be solved via,
(23) | ||||
subject to |
where . The constraint restricts that the function values need to come from a function inside the function space ball , where the left-hand side is indeed the minimum norm square of the possible interpolant through as shown in (Wendland, 2004). It can be checked that the maximization problem in Eq. (23) has a concave objective (as shown in Appendix A) with a convex feasible set. Thus, the problem in Eq. (23) is tractable via convex optimization.
Generating new sample point. On the line 4 of Alg. 1, a bi-level optimization problem needs to be solved, where the inner-level part has an infinite-dimensional function variable. The inner optimization problem has the form,
(24) | ||||
subject to | ||||
where is as given in Thm. 3.1. Similar to the representer theorem, we have,
Lemma 4.1.
Similarly, it can be checked that the Prob. (25) is convex.
For low-dimensional , the outer-level problem can be solved via grid search. For medium-dimensional problems, we can optimize the inner/outer variables using a gradient-based/zero-order optimization method. Alternatively, we can jointly optimize , and by a nonlinear programming solver from multiple random initial conditions. That is, we add as another optimization variable as shown in the Prob. (27),
(27) | ||||
subject to |
More details on this joint optimization approach is in Appendix H.
Remark 4.2.
We add a matrix to and before inversion to avoid numerical issue, where is small.
Remark 4.3.
In this paper, we mainly consider the setting where in each step, the preference is queried over two candidate points. Our Alg. 1 and the efficient computation schemes in this section can be easily extended to multiple-choice setting, where in each step, the best or most preferred point is queried over a batch of candidates. The detailed discussion is in Appendix I.
5 Theoretical Analysis
We first introduce the performance metrics to use. As in the standard Bayesian optimization setting ((Srinivas et al., 2012)), cumulative regret is used as defined in Eq. (28),
(28) |
where .
Remark 5.1.
Cumulative regret is of interest in the online setting. In the offline optimization setting, it is of more interest to analyze the sub-optimality of the final reported solution, i.e.,
(29) |
where is the final reported solution as defined in Eq. (20).
5.1 Regret Bound and Convergence Rate
Theorem 5.2 (Cumulative Regret Bound).
Remark 5.3 (Differentiate from GP-UCB regret).
Our bound has a similar form as compared to the well-known regret bound for standard GP-UCB type algorithms (Srinivas et al., 2012; Chowdhury & Gopalan, 2017a). However, the term here is significantly different from that in the existing literature (e.g., in Thm. 3 in (Srinivas et al., 2012)). It is derived specifically for the preferential BO and will lead to a bit larger bound for specific kernels in Sec. 5.2.
We highlight that Thm. 5.2 provides the first-of-its-kind information-theoretic bound on the cumulative regret of preferential BO, which further allows us to derive a convergence rate for the reported solution in Thm. 5.4.
Theorem 5.4 (Convergence Guarantee).
Let be defined as in Eq. (20). With probability at least ,
(31) |
Thm 5.4 highlights that by minimizing the known term , the reported final solution has a guaranteed convergence rate.
5.2 Kernel-Specific Bounds and Rates
In this section, we show kernel-specific bounds for the regret and convergence rate for the reported solution. The explicit forms of the considered kernels are given in Appendix L.
Theorem 5.5 (Kernel-Specific Regret Bounds).
Setting and running our POP-BO algorithm in Alg. 1,
-
1.
If , we have,
(32) -
2.
If is a squared exponential kernel, we have,
(33) -
3.
If is a Matérn kernel, we have,
(34) where is the smooth parameter of the Matérn kernel that is assumed to be large enough such that .
Remark 5.6 (Comparison to GP-UCB with Scalar Feedback).
Interestingly, as compared to the kernel-specific bounds in the scalar evaluation-based optimization (Fig. 1 in (Srinivas et al., 2012)), the regret bound of preferential Bayesian optimization approximately has an additional factor of . This is reasonable since intuitively, scalar evaluation can imply preference, but not vice versa. Therefore, preference feedback contains less information and thus may suffer from higher regret. Fig. 2 in Sec. 6.1 and Fig. 4 in Appendix N empirically verify our bounds here.
6 Experimental Results
In this section, we compare our method to the state-of-the-art preferential BO methods on sampled instances from Gaussian process, standard test functions, and a thermal comfort optimization problem. The comparison outcome is sampled as assumed in Assump. 2.5. We implement our algorithm based on the Gaussian process package GPy (GPy, since 2012). The optimization problems for MLE and generating new samples are formulated and solved using CasADi (Andersson et al., 2019) and Ipopt (Wächter & Biegler, 2006). We compare our methods to three baseline methods: dueling Thompson sampling (González et al., 2017), skew-GP based preferential BO (Takeno et al., 2023), and the qEUBO (Astudillo et al., 2023). The dueling Thompson sampling method (González et al., 2017) derives the next pair to compare by maximizing the soft-Copeland’s score. The skew-GP based method (Takeno et al., 2023) applies standard BO algorithms conditioned on the Thompson sampling results on the historical sample points that are consistent with the historical preference feedbacks. The qEUBO (Astudillo et al., 2023) method uses the expected utility of the best option as an acquisition function. More experimental details and results on thermal comfort optimization are put in the Appendix P.
6.1 Sampled Instances from Gaussian Process
In this section, we sample the black-box function from a Gaussian process with the squared exponential kernel as shown in Appendix L where the variance parameter is and the lengthscale is . We sampled instances in total.
Fig. 2 shows the performance comparisons with baselines. Our method achieves the lowest sublinear growth in cumulative regret. It also achieves better/competitive convergence speed for the reported solution as compared to the DTS method, while outperforming the SGP.
However, our method only uses less than of the computation time as compared to the DTS as shown in Tab. 1. The SGP method gets stuck in local optimum because it overly trusts the random preference feedback (hard constraint when doing Thompson sampling). Although the qEUBO method performs slightly better in the reported solution, it suffers from more than times the cumulative regret as compared to ours. Similar to qEUBO (reporting posterior mean maximizer), we can report the maximizer of the minimum-norm (POP-BO max-MLE in Fig. 2) instead of in Eq. (20), and achieves faster convergence than qEUBO.
DTS | qEUBO | SGP | POP-BO (ours) |
---|---|---|---|
6.2 Test Function Optimization
In this section, we compare our method to several well-known global optimization test functions (Dixon, 1978; Molga & Smutnicki, 2005), which are divided by the standard deviation of samples over a grid. We run our method multiple times from different random initial points. Tab. 2 shows that POP-BO consistently finds better or comparable solutions as compared to other baselines.
Problem | DTS | qEUBO | SGP | POP-BO (ours) |
---|---|---|---|---|
Beale | ||||
Branin | ||||
Bukin | ||||
Cross-in-Tray | ||||
Eggholder | ||||
Holder Table | ||||
Levy13 |
6.3 Scalability to Higher Dimension
To demonstrate the computational scalability of our joint optimization approach (as shown in Prob. (27)), we consider a set of higher dimensional problems. Due to space limitation, we show the results for the optimization of -dimensional black-box function sampled from a Gaussian process with squared exponential kernel function. More results can be found in Appendix P.1 and Appendix P.2.2. The optimization domain is set to be . We run randomly sampled instances for steps. The average update time per step is only seconds on a personal computer with one Intel64 Family 6 Model 142 Stepping 12 GenuineIntel 1803 Mhz processor and 16.0 GB RAM. This is comparably very small considering that each query to the comparison oracle can be very expensive in practice (e.g., heating the room up to a certain temperature to evaluate occupant comfort, which may take tens of minutes). We compare our method to the SGP baseline, which is one of the state-of-the-art computationally practical preferential Bayesian optimization method. Fig. 3 shows the cumulative regret (in log scale) and the suboptimality of the reported solution for the problem. It can be seen that our algorithm still achieves sublinear regret growth and good convergence for the suboptimality of the reported solution within steps in this 12-dimensional problem. Fig. 3 also shows that our POP-BO has faster convergence speed in higher dimensional problem and thus scales better than the SGP method.
7 Conclusion and Future Work
In this paper, we have presented a principled optimistic preferential BO algorithm, based on the likelihood-based confidence set. An efficient computational method is developed to implement the algorithm. We further show an information-theoretic bound on the cumulative regret, a first-of-its-kind for preferential BO. We also design a scheme to report an estimated optimal solution, with a guaranteed convergence rate. Experimental results show that our method achieves better or competitive performance as compared to the state-of-the-art heuristics, which, however, do not have theoretical guarantees on regret. Future works include the extension to the safety-critical problem (Berkenkamp et al., 2016; Guo et al., 2023) and game theoretical setting. The likelihood-based confidence set and the error bound in Sec. 3 can also be applied to more scenarios with preference feedback.
Acknowledgements
This research was supported by the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40_180545, the Swiss Federal Office of Energy SFOE as part of the SWEET consortium SWICE, and in part by the Swiss Data Science Center, grant agreement C20-13.
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
References
- Abdelrahman & Miller (2022) Abdelrahman, M. M. and Miller, C. Targeting occupant feedback using digital twins: Adaptive spatial–temporal thermal preference sampling to optimize personal comfort models. Building and Environment, 218:109090, 2022.
- Andersson et al. (2019) Andersson, J. A., Gillis, J., Horn, G., Rawlings, J. B., and Diehl, M. CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1–36, 2019.
- Astudillo et al. (2023) Astudillo, R., Lin, Z. J., Bakshy, E., and Frazier, P. qEUBO: A decision-theoretic acquisition function for preferential Bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 1093–1114. PMLR, 2023.
- Berkenkamp et al. (2016) Berkenkamp, F., Schoellig, A. P., and Krause, A. Safe controller optimization for quadrotors with Gaussian processes. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 491–496. IEEE, 2016.
- Bradley & Terry (1952) Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Bull (2011) Bull, A. D. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(10), 2011.
- Chowdhury & Gopalan (2017a) Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In International Conference on Machine Learning, pp. 844–853. PMLR, 2017a.
- Chowdhury & Gopalan (2017b) Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. arXiv preprint arXiv:1704.00445, 2017b.
- Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
- Christmann & Hable (2012) Christmann, A. and Hable, R. Consistency of support vector machines using additive kernels for additive models. Computational Statistics & Data Analysis, 56(4):854–873, 2012.
- Curi et al. (2020) Curi, S., Berkenkamp, F., and Krause, A. Efficient model-based reinforcement learning through optimistic policy search and planning. Advances in Neural Information Processing Systems, 33:14156–14170, 2020.
- Dixon (1978) Dixon, L. C. W. The global optimization problem: an introduction. Towards Global Optimiation 2, pp. 1–15, 1978.
- Dudík et al. (2015) Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. Contextual dueling bandits. In Conference on Learning Theory, pp. 563–587. PMLR, 2015.
- Edmunds & Triebel (1996) Edmunds, D. E. and Triebel, H. Function spaces, entropy numbers, differential operators, volume 120. Cambridge Univ Pr, 1996.
- Emmenegger et al. (2023) Emmenegger, N., Mutny, M., and Krause, A. Likelihood ratio confidence sets for sequential decision making. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Fanger et al. (1970) Fanger, P. O. et al. Thermal comfort. analysis and applications in environmental engineering. Thermal comfort. Analysis and applications in environmental engineering., 1970.
- Frazier (2018) Frazier, P. I. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
- Gajane et al. (2015) Gajane, P., Urvoy, T., and Clérot, F. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In International Conference on Machine Learning, pp. 218–227. PMLR, 2015.
- González et al. (2017) González, J., Dai, Z., Damianou, A., and Lawrence, N. D. Preferential Bayesian optimization. In International Conference on Machine Learning, pp. 1282–1291. PMLR, 2017.
- GPy (since 2012) GPy. GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
- Griffith et al. (2013) Griffith, S., Subramanian, K., Scholz, J., Isbell, C. L., and Thomaz, A. L. Policy shaping: Integrating human feedback with reinforcement learning. Advances in Neural Information Processing Systems, 26, 2013.
- Guo et al. (2023) Guo, B., Jiang, Y., Kamgarpour, M., and Ferrari-Trecate, G. Safe zeroth-order convex optimization using quadratic local approximations. In 2023 European Control Conference (ECC), pp. 1–8. IEEE, 2023.
- Hiranaka et al. (2023) Hiranaka, A., Hwang, M., Lee, S., Wang, C., Fei-Fei, L., Wu, J., and Zhang, R. Primitive skill-based robot learning from human evaluative feedback. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7817–7824. IEEE, 2023.
- Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, 1998.
- Kahneman & Tversky (2013) Kahneman, D. and Tversky, A. Prospect theory: An analysis of decision under risk. In Handbook of the Fundamentals of Financial Decision Making: Part I, pp. 99–127. World Scientific, 2013.
- Kandasamy et al. (2015) Kandasamy, K., Schneider, J., and Póczos, B. High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pp. 295–304. PMLR, 2015.
- Koyama et al. (2020) Koyama, Y., Sato, I., and Goto, M. Sequential gallery for interactive visual design optimization. ACM Transactions on Graphics (TOG), 39(4):88–1, 2020.
- Lalley (2013) Lalley, S. P. Concentration inequalities. Lecture notes, University of Chicago, 2013.
- Li et al. (2021) Li, K., Tucker, M., Bıyık, E., Novoseller, E., Burdick, J. W., Sui, Y., Sadigh, D., Yue, Y., and Ames, A. D. ROIAL: Region of interest active learning for characterizing exoskeleton gait preference landscapes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 3212–3218. IEEE, 2021.
- Lichtenstein & Slovic (1971) Lichtenstein, S. and Slovic, P. Reversals of preference between bids and choices in gambling decisions. Journal of experimental psychology, 89(1):46, 1971.
- Liu et al. (2023) Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. Optimistic MLE: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pp. 363–376, 2023.
- Lyu et al. (2023) Lyu, J., Shi, Y., Du, H., and Lian, Z. Sex-based thermal comfort zones and energy savings in spaces with joint operation of air conditioner and fan. Building and Environment, 246:111002, 2023. ISSN 0360-1323. doi: https://doi.org/10.1016/j.buildenv.2023.111002. URL https://www.sciencedirect.com/science/article/pii/S0360132323010296.
- Maddalena et al. (2021) Maddalena, E. T., Scharnhorst, P., and Jones, C. N. Deterministic error bounds for kernel-based learning techniques under bounded noise. Automatica, 134:109896, 2021.
- Mehta et al. (2023) Mehta, V., Neopane, O., Das, V., Lin, S., Schneider, J., and Neiswanger, W. Kernelized offline contextual dueling bandits. arXiv preprint arXiv:2307.11288, 2023.
- Mikkola et al. (2020) Mikkola, P., Todorović, M., Järvi, J., Rinke, P., and Kaski, S. Projective preferential Bayesian optimization. In International Conference on Machine Learning, pp. 6884–6892. PMLR, 2020.
- Molga & Smutnicki (2005) Molga, M. and Smutnicki, C. Test functions for optimization needs. Test functions for optimization needs, 101:48, 2005.
- Negoescu et al. (2011) Negoescu, D. M., Frazier, P. I., and Powell, W. B. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
- Newey & McFadden (1994) Newey, W. K. and McFadden, D. Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111–2245, 1994.
- Osband & Van Roy (2014) Osband, I. and Van Roy, B. Model-based reinforcement learning and the Eluder dimension. Advances in Neural Information Processing Systems, 27, 2014.
- Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Owen (1990) Owen, A. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1):90–120, 1990.
- Pacchiano et al. (2021) Pacchiano, A., Ball, P., Parker-Holder, J., Choromanski, K., and Roberts, S. Towards tractable optimism in model-based reinforcement learning. In Uncertainty in Artificial Intelligence, pp. 1413–1423. PMLR, 2021.
- Saha & Krishnamurthy (2022) Saha, A. and Krishnamurthy, A. Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory, pp. 968–994. PMLR, 2022.
- Saha et al. (2021) Saha, A., Koren, T., and Mansour, Y. Dueling convex optimization. In International Conference on Machine Learning, pp. 9245–9254. PMLR, 2021.
- Schölkopf et al. (2001) Schölkopf, B., Herbrich, R., and Smola, A. J. A generalized representer theorem. In International Conference on Computational Learning Theory, pp. 416–426. Springer, 2001.
- Shahriari et al. (2015) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Inf. Process. Syst., 25, 2012.
- Srinivas et al. (2012) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
- Sui et al. (2017) Sui, Y., Zhuang, V., Burdick, J. W., and Yue, Y. Multi-dueling bandits with dependent arms. arXiv preprint arXiv:1705.00253, 2017.
- Sui et al. (2018) Sui, Y., Burdick, J., Yue, Y., et al. Stage-wise safe Bayesian optimization with Gaussian processes. In Proc. of the Int. Conf. on Mach. Learn., pp. 4781–4789, 2018.
- Takeno et al. (2023) Takeno, S., Nomura, M., and Karasuyama, M. Towards practical preferential Bayesian optimization with skew Gaussian processes. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 33516–33533, 2023.
- Tversky & Kahneman (1974) Tversky, A. and Kahneman, D. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. Science, 185(4157):1124–1131, 1974.
- Wächter & Biegler (2006) Wächter, A. and Biegler, L. T. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 2006.
- Wang et al. (2023) Wang, Y., Liu, Q., and Jin, C. Is RLHF more difficult than standard RL? A theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Warnell et al. (2018) Warnell, G., Waytowich, N., Lawhern, V., and Stone, P. Deep TAMER: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Wendland (2004) Wendland, H. Scattered data approximation, volume 17. Cambridge university press, 2004.
- Wu et al. (2022) Wu, C., Li, T., Zhang, Z., and Yu, Y. Bayesian optimistic optimization: Optimistic exploration for model-based reinforcement learning. Advances in Neural Information Processing Systems, 35:14210–14223, 2022.
- Wu (2017) Wu, Y. Lecture notes on information-theoretic methods for high-dimensional statistics. Lecture Notes for ECE598YW (UIUC), 16, 2017.
- Xu et al. (2022a) Xu, W., Jiang, Y., Maddalena, E. T., and Jones, C. N. Lower bounds on the worst-case complexity of efficient global optimization. arXiv preprint arXiv:2209.09655, 2022a.
- Xu et al. (2022b) Xu, W., Jones, C. N., Svetozarevic, B., Laughman, C. R., and Chakrabarty, A. VABO: Violation-aware Bayesian optimization for closed-loop control performance optimization with unmodeled constraints. In 2022 American Control Conference (ACC), pp. 5288–5293. IEEE, 2022b.
- Xu et al. (2023) Xu, W., Jiang, Y., Svetozarevic, B., and Jones, C. Constrained efficient global optimization of expensive black-box functions. In International Conference on Machine Learning, pp. 38485–38498. PMLR, 2023.
- Yue & Joachims (2009) Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201–1208, 2009.
- Yue et al. (2012) Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
- Zhang et al. (2024) Zhang, H., Lee, S., and Tzempelikos, A. Bayesian meta-learning for personalized thermal comfort modeling. Building and Environment, 249:111129, February 2024. ISSN 03601323. doi: 10.1016/j.buildenv.2023.111129. URL https://linkinghub.elsevier.com/retrieve/pii/S0360132323011563.
- Zhou (2002) Zhou, D.-X. The covering number in learning theory. Journal of Complexity, 18(3):739–767, 2002.
- Zhou & Ji (2022) Zhou, X. and Ji, B. On kernelized multi-armed bandits with constraints. Advances in Neural Information Processing Systems, 35, 2022.
- Zhu et al. (2023) Zhu, B., Jordan, M., and Jiao, J. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 43037–43067, 23–29 Jul 2023.
Without further notice, all the results shown in this appendix are under the assumptions 2.1, 2.2, 2.4, and 2.5.
Appendix A Preliminaries
To prepare for the proofs of the main results shown in this paper, we first state several useful lemmas.
Lemma A.1.
The function is convex in .
Proof.
We calculate the Hessian of the function and derive
(37) |
Hence, is convex.
∎
Therefore, we can see is concave in .
Lemma A.2.
Appendix B Properties of the function
When applying the function to the difference of objective function , we have the calculations by single variable calculus,
where and . We also introduce some constants , and , which will be used in the proof.
Appendix C Proof of Thm. 3.1
To prepare for the proof of the theorem, we first prove several lemmas.
Lemma C.1.
For any fixed , we have,
(38) |
where is the ground-truth function.
Proof.
We use ( resp.) to denote ( resp.). We use ( resp.) to denote ( resp.). And we use to denote .
where , and the probability is taken with respect to the randomness from the comparison oracle and the randomness from the algorithm.
It can be checked that is a convex function and . This implies that achieves the minimum for the convex function . Therefore,
Rearrangement gives,
Hence, . Therefore,
We further notice that
(39) |
and with probability one,
(40) |
We can thus apply the Azuma-Hoeffding inequality (see, e.g., (Lalley, 2013)). By Azuma–Hoeffding inequality,
Set . That is, . We then get the desired result. ∎
We then have the following high probability confidence set lemma.
Lemma C.2.
For any fixed that is independent of , we have, with probability at least ,
(41) |
Proof.
We use to denote the event . We pick and have,
∎
We then have a lemma to bound the difference of log likelihood when two functions are close in infinity-norm sense.
Lemma C.3.
There exists an independent constant , such that, , that satisfies , we have,
(42) |
Proof.
We use (, resp.) to denote (, resp.), .
(43) | ||||
(44) | ||||
(45) | ||||
(46) |
where the equality (43) follows by the definition of log-likelihood function, and the inequality (44) follows by the assumption and the mean-value theorem. The conclusion follows by setting .
∎
Main proof: We use to denote the covering number of the set , with be a set of -covering for the set . Reset the ‘’ in Lem. C.2 as and applying the probability union bound, we have, with probability at least , ,
(47) |
By the definition of -covering, there exists , such that,
(48) |
Hence, with probability at least ,
where the inequality follows by Lem. C.3 and the inequality (47).
Appendix D Proof of Lem. 3.5
We first have a lemma.
Lemma D.1.
We have,
(49) |
where .
Proof.
Let We have,
Since , we have . Hence, and . Therefore, achieves the maximum over at the point . So . Rearrangement then gives the desired result. ∎
For any fixed function , we use the notations and . We have,
Hence,
Rearrangement gives,
(50) |
We then have the following lemma,
Lemma D.2.
For any fixed and , we have, with probability at least ,
(51) |
Proof.
Since and with probability one,
(52) | ||||
(53) | ||||
(54) |
where the last inequality follows by that . Thus we can apply the Azuma–Hoeffding inequality. By Azuma–Hoeffding inequality, we have,
(55) |
We set , and derive
(56) |
Combining the inequality (50) and the inequality (56), the desired result is derived.
∎
Lemma D.3.
For any fixed , we have, with probability at least ,
(57) |
Proof.
We use 444With abuse of notation here. is only a local notation in this proof here. to denote the event and pick . We have,
∎
Main Proof: Resetting the ‘’ in Lem. D.3 to be , we can guarantee the Eq. (57) holds for all the function in an -covering of with probability at least , by applying the probability union bound.
For any , there exists a function in the -covering of , which we set to be , such that . We also use to denote . Thus, we have,
(58) | ||||
(59) | ||||
(60) | ||||
(61) | ||||
(62) | ||||
(63) | ||||
(64) | ||||
(65) |
where and . The inequality (59) follows by the fact that . The inequality (61) follows because . The inequality (62) follows by Lem. D.3 (with reset of ‘’). The inequality (63) follows by that
Furthermore,
(66) | ||||
(67) |
where the inequality follows by mean value theorem. The conclusion then follows.
Appendix E Proof of Thm. 3.6
Before we proceed to prove Thm. 3.6, we first conduct a black-box analysis in Sec. E.1 to bound the pointwise error for a generic RKHS with a generic learning scheme, which we think can be of independent interest.
E.1 Black-box analysis on the pointwise inference error in a generic RKHS
Suppose we have a generic RKHS with a generic positive semidefinite kernel function . After obtaining some information (preference information in this paper) on a sequence , a learning scheme outputs a learnt uncertainty set,
(68) |
where is a function space ball with radius in , is the ground truth function and quantifies the size of this confidence set. Let denote the function input set, which is assumed to be compact. We introduce the function,
(69) |
where is a positive constant and . We have the following theorem.
Theorem E.1.
, we have,
(70) |
Proof.
For simplicity, we use to denote the function , where maps a finite dimensional point to the RKHS . For simplicity, we use to denote the inner product of two functions from the RKHS . Therefore, and , . We can introduce the feature map
we then get the kernel matrix , for all and .
Note that when the Hilbert space is finite-dimensional, is interpreted as the normal finite-dimensional matrix. In the more general setting where can be an infinite-dimensional space, is the evaluation operator defined as , with as its adjoint operator. For the simplicity of notation, we abuse the notation to denote the identity mapping in both the RKHS and . The specific meaning of depends on the context.
Since the matrices/operators and are strictly positive definite and
we have
(71) |
Also from the definitions above , and thus . Hence, from Eq. (71) we deduce that
(72) |
which gives
(73) |
by multiplying both sides of Eq. (72) with . This implies
(74) |
where the second equality follows by the definition of . Now observe that ,
(75) | ||||
(76) | ||||
(77) | ||||
(78) | ||||
(79) | ||||
(80) | ||||
(81) | ||||
(82) | ||||
(83) |
where the equality (77) uses Eq. (71), the inequality (80) is by Cauchy-Schwartz, the inequality (82) follows by the assumption that and that is positive semidefinite, and the equality (83) is from Eq. (74). We define ,
(84) | ||||
(85) | ||||
(86) | ||||
(87) | ||||
(88) | ||||
(89) | ||||
(90) | ||||
(91) | ||||
(92) |
where the equality (86) is from Eq. (71), the inequality (87) is by Cauchy-Schwartz and the equality (89) uses both Eq. (71) and Eq. (74). We can finally derive,
(93) | ||||
(94) | ||||
(95) | ||||
(96) |
where the equality (94) follows by splitting, the inequality (95) follows by triangle inequality, the last inequality follows by combining the inequality (83) and the inequality (92). The conclusion then follows. ∎
Remark E.2.
The proof idea is inspired by the proof of Thm. 2 in (Chowdhury & Gopalan, 2017b).
E.2 Main proof of Thm. 3.6
We set the generic RKHS to be the augmented RKHS with the additive kernel function , the function space ball to be , and the confidence set as,
The desired result then follows by applying Thm. E.1.
Appendix F Proof of Lem. 4.1
It suffices to prove that for any feasible solution of Prob. (24), we can find a corresponding feasible solution of Prob. (25) with the same objective value and that the inverse direction also holds.
-
1.
In this part, we first show that for any feasible solution of Prob. (24), we can find a corresponding feasible solution of Prob. (25) with the same objective value. Let be a feasible solution of Prob. (24). We construct and . Consider the minimum-norm interpolation problem,
(97) subject to By representer theorem, the Prob. (97) admits an optimal solution with the form , where . So Prob. (97) can be reduced to
(98) subject to Hence, by solving Prob. (98), we can derive the minimum norm square with interpolation constraints as
Since itself is an interpolant by construction of . We have
And since the log-likelihood only depends on , it holds that
And the objectives satisfy,
Therefore, is a feasible solution for Prob. (25) with the same objective as for Prob. (24).
-
2.
We then show that for any feasible solution of Prob. (25), we can find a corresponding feasible solution of Prob. (24) with the same objective value. Let be a feasible solution of Prob. (25). We construct
Hence,
And it can be checked that and . So . And the objectives satisfy . So it is proved that for any feasible solution of Prob. (25), we can find a corresponding feasible solution of Prob. (24) with the same objective value.
The desired result then follows.
Appendix G Elaboration on Remark 2.3
By assumption , we assume that there exists a large enough constant that upper bounds the norm of the ground-truth black-box function . However, the exact value of this upper bound may be unknown to us in practice, while the execution of our algorithm relies on the knowledge of (in Problem (23), is a key parameter). So we need to guess the value of . Suppose our guess is . It is possible that is even smaller than the ground-truth function norm . To detect this wrong guess, we observe that, with the correct setting of such that , we have that by Thm. 3.1 and the definition of maximum likelihood estimate, with high probability,
where is the maximum likelihood estimate function with function norm bound and is the corresponding parameter as defined in Thm. 3.1 with norm bound . We also have is a valid upper bound on and thus,
Hence,
That is to say, needs to be greater than or equal to when is a valid upper bound on .
Therefore, we can use the heuristic: every time we find that
we double the upper bound guess .
Appendix H Jointly optimize , and for the problem (25).
For medium-dimensional problems (), we can jointly optimize , , and by a nonlinear programming solver from multiple random initial conditions. That is, we can also treat in the problem (23) as an optimization variable. In this way, we lose convexity but only need to solve the problem (23) for only once in each step .
More specifically, we solve the optimization problem (99).
(99) | ||||
subject to | ||||
The only constraint that involves is
(100) |
Applying matrix inversion, we derive that the left-hand side is equal to,
(101) |
where .
We can then apply a nonlinear programming solver such as Ipopt to solve the problem (99) from randomly sampled initial points. Then the best converged solution is set to be the next sample point .
Appendix I Extension to the multiple-choice setting
In this paper, we mainly consider the setting where human expresses preference over only two choices, because of its low cognitive burden to the human user and simplicity of theoretical analysis. However, we can extend POP-BO to the multiple-choice setting where human can compare multiple choices and express the favorite one.
Suppose that in each step , we aim to generate a batch of points. Then we can mix the new batch with the old batch generated in step , and query the comparison oracle to report the favorite point among the points.
Firstly, the confidence set of functions can be similarly constructed using the likelihood ratio idea and the multiple-choice probabilistic preference model as in (Astudilo et al. 2023),
(102) |
Secondly, to generate the new batch, the basic idea is that we can apply a ‘bootstrap’-type technique. More specifically, we can sequentially generate the new batch . When generating the new point , we maximize its corresponding optimistic advantage of as compared to the maximum of by solving a similar problem to Problem (23). That is, we solve the Problem (103) to generate the new point in the same batch,
(103) | ||||
subject to | ||||
which is equivalent to
(104) | ||||
subject to | ||||
by introducing an auxiliary variable . Problem (104) can be efficiently solved by the nonlinear programming solver Ipopt.
Appendix J Proof of Thm. 5.2
To prepare for the following analysis, we first give a useful lemma.
Lemma J.1 (Lemma 4, (Chowdhury & Gopalan, 2017b)).
(105) |
Proof.
Apply the Lemma 4 in (Chowdhury & Gopalan, 2017b) by setting the kernel function as . ∎
Appendix K Proof of Thm. 5.4
We have
where is as given in Eq. (17) with the kernel function as and . Furthermore, by the definition of ,
The conclusion then follows.
Appendix L Commonly used specific kernel functions
-
•
Linear:
-
•
Squared Exponential (SE):
where is the variance parameter and is the lengthscale parameter.
-
•
Matérn:
where and are the two positive parameters of the kernel function, is the gamma function, and is the modified Bessel function of the second kind. captures the smoothness of the kernel function.
Appendix M Proof of Thm. 5.5
Recall that
We pick , and can thus derive,
- 1.
- 2.
-
3.
is a Matérn kernel. Lem. 3 in (Bull, 2011) implies the equivalence between RKHS and Sobolev Hilbert space. We can then apply the rich results on the bound of covering number of Sobolev Hilbert space (Edmunds & Triebel, 1996). So (by combing the lower bound in Thm. 5.1 (Xu et al., 2022a) and the convergence rate in Thm. 1 (Bull, 2011)). By Thm. 4 in (Kandasamy et al., 2015), we have,
Hence,
Appendix N Empirical Evidence for the Order of The Cumulative Regret
Fig. 4 shows the cumulative regret of POP-BO algorithm. The experimental conditions are the same as in Sec. 6.1. Note that both horizontal and vertical axes in Fig. 4 are in log scale, and thus the slope of the curve roughly represents the power of the cumulative regret. It can be clearly seen that the order of the cumulative regret is between and (indeed, close to by checking the slope in log scale), which verifies our theoretical results in Thm. 5.5.
Appendix O Kernel-Specific Convergence Rate
Similar to the bounds in the Appendix M, we can plug in the kernel-specific covering number and maximum information gain to derive the kernel-specific convergence rate in Tab. 3.
Kernel | Linear | Squared Exponential | Matérn |
---|---|---|---|
Appendix P More Experimental Results and Details
Selection of Hyperparameters. Three key hyperparameters that influence the performance of POP-BO are the kernel lengthscale, the norm bound and the confidence level term as shown in Thm. 3.1. We set , where is set to by default. For the sampled instances from Gaussian processes, the lengthscale is set to be the ground truth and the norm bound is set to be times the ground truth. For the test function examples, we choose the lengthscale by maximizing the likelihood value over a set of randomly sampled data and set the norm bound to be by default (with the test functions all normalized).
Details on Sampled Instances from Gaussian Process. Specifically, we randomly sample some knot points from a joint Gaussian distribution marginalized from the Gaussian process, and then construct its corresponding minimum-norm interpolant (Maddalena et al., 2021) as the ground truth function.
Empirical Method for Reporting a Solution. In the experiment of test function optimization, we report the point that maximizes the minimum norm maximum likelihood estimator , which achieves better empirical performance.
Solution Report Method for Baselines. The approach to reporting a solution is the same as in the original paper of the baseline algorithm if it is mentioned. Therefore, for the baseline qEUBO (Astudillo et al., 2023), we report the solution that maximizes the expected objective value conditioned on the historical samples. For the baseline SGP (Takeno et al., 2023), we report the first point of the duel proposed by the algorithm in step . For the baseline DTS (González et al., 2017), we report the Condorcet winner.
Effect of Hyperparameters. We conducted more experiments to assess the effect of hyperparameters. We observe that the hyperparameters with most influence are the norm bound and the confidence level . The larger the norm bound is, the more variance the estimate function has. If is set too large, the convergence for the suboptimality of the reported solution tends to be slower. can be set to be in practice and determines the level of exploration, where is a fixed constant. The larger is, the more explorative the algorithm is and may have higher cumulative regret. But setting to be very small may also cause weak exploration and make the suboptimality of the reported solution converge slower.
P.1 Experimental Results for Higher-Dimensional Problems
P.1.1 Higher-Dimensional Problems Sampled from Gaussian Process
We consider the optimization of -dimensional black-box function sampled from a Gaussian process with kernel function as shown in Eq. (106),
(106) |
where and . The optimization domain is set to be . We run randomly sampled instances for steps. The average update time for each step is only seconds on a personal computer with one Intel64 Family 6 Model 142 Stepping 12 GenuineIntel 1803 Mhz processor and 16.0 GB RAM. This is comparably very small considering that each query to the comparison oracle can be very expensive in practice (e.g., heating the room up to a certain temperature to evaluate occupant comfort, which may take tens of minutes). We compare our method to the SGP baseline.
Fig. 5 shows the cumulative regret (in log scale) and the suboptimality of the reported solution for our POP-BO algorithm, where the reported solution is derived by maximizing the maximum likelihood estimate function. It can be clearly seen that our algorithm achieves both sublinear regret growth and fast convergence for the suboptimality of the reported solution in this 7-dimensional problem. Interestingly, the suboptimality of SGP converges similarly to our method before 50 steps, but get even worse after 50 steps. This is because SGP ignores the randomness in the preference feedback, which leads to misbelief in the function difference value, and such misbelief is more significant when the function difference value is small.
P.1.2 Higher-Dimensional Test Problem
In this section, we further consider the optimization of the -dimensional Ackley function as shown in (Astudillo et al., 2023). For this problem, we compare POP-BO algorithm to the qEUBO algorithm proposed in (Astudillo et al., 2023). Fig. 6 shows the cumulative regret and the suboptimality of the reported solution. In this particular problem, qEUBO performs better than our POP-BO algorithm in terms of cumulative regret, while our POP-BO algorithm performs slightly better than qEUBO in terms of the suboptimality of the reported solution.
P.2 Occupant Thermal Comfort Optimization
P.2.1 Two-Dimensional Comfort Optimization
An accurate model of human thermal comfort is crucial for improving occupants’ comfort while saving energy in buildings. However, establishing such a model has proven to be a complex and challenging task (Zhang et al., 2024) and standard offline models ignore the individual differences among occupants. In this section, we consider the real-world problem of maximizing occupant thermal comfort directly from thermal preference feedback. To emulate real human thermal sensation, we use the well-known and widely adopted Predicted Mean Vote (PMV) model (Fanger et al., 1970) as the ground truth and generate the preference feedback according to the Bernoulli model as assumed in Assumption 2.5. We optimize the indoor air temperature and air speed, which are the two major factors that influence thermal comfort and are controllable by HVAC (Heating, Ventilation, and Air Conditioning) systems and fans. Indeed, tuning these two factors has been proven effective in providing thermal comfort while minimizing energy consumption (Lyu et al., 2023). The result is shown in Fig. 7 where the mean is taken over 30 instances of simulation. It can be seen that our method stably achieves superior performance in optimizing human thermal comfort, which implies its potential to deal with preferential feedback in real-world applications. It is also noticeable that although qEUBO achieves slightly better performance in terms of the convergence of the reported solution, the cumulative regret of qEUBO is almost twice of POP-BO’s cumulative regret. This means our method is more favorable in applications where online performance during the optimization is also critical, such as online tuning of HVAC systems.
P.2.2 Scalability to higher dimension
Additionally, to demonstrate the scalability of POP-BO in this real-world comfort optimization problem, we additionally tune the mean radiant temperature and relative humidity, which results in a four-dimensional black-box optimization problem. The result is shown in Fig. 8. It can be observed that increasing the dimensionality does not drastically decrease the convergence rate of our method. Furthermore, the baseline method qEUBO can decrease the objective value very fast in the initial steps, but seems to be still very oscillatory after 10 steps. In contrast, our method converges faster than SGP without the oscillation issue like qEUBO.
P.3 Details About the Results in Tab. 2
The cumulative regret and evolution of suboptimality for the different test problems in Tab. 2 are shown in Fig. 9. Since the considered problems only have -dimensional input and in the applications of Bayesian optimization, it is typically desired to obtain a set of solution with objective value as close to the optimal value as possible. So we only consider steps here. Other baselines can make limited progress in terms of the suboptimality of the reported solution within only steps (partially also due to the ‘adversarial’ property of the test functions, i.e., severe non-convexity and multiple local maxima) as shown in Tab. 2. To the sharp contrast, our POP-BO algorithm makes significant progress in reducing the suboptimality of the reported solution by balancing exploration and exploitation, and estimating the best solution in a principled way.
To provide more insights into POP-BO’s performance across different settings, we compare our algorithm’s evolution of cumulative regret and suboptimality to other baseline methods for each test problem in Fig. 10 and Fig. 11. It can be observed that our method may perform slightly worse than some baselines in certain problems. For example, our method performs slightly worse than qEUBO in the Bukin problem in terms of suboptimality. However, our method performs stably and is consistently one of the best in all the test problems in terms of the suboptimality.
Appendix Q Additional Contributions as Compared to (Mehta et al., 2023)
Notably, (Mehta et al., 2023) proposes Borda-AE algorithm, which directly learns the winning probability function using kernel ridge regression. This key design allows the authors to derive an information-theoretic convergence rate and efficient computation method without diving into the learning of the underlying reward function.
However, (Mehta et al., 2023) has key limitations and our paper makes additional contributions in the following two aspects.
-
1.
Cumulative regret bound. There are two possible ways to define cumulative regret. One way is that we can define the (partial) cumulative regret as the summation of the suboptimality of only (that is, ). With this (partial) cumulative regret definition, Borda-AE algorithm can provide a sublinear (partial) cumulative regret bound, although it has linear growth in the cumulative regret of the compared point sequence . However, in many practical online learning applications, it is desired to control the suboptimality of both and sequences. For example, when tuning the thermal/visual comfort of room occupants, we require the occupants to experience both and conditions for comparison purposes and the suboptimality (links to discomfort) caused by both and need to be managed.
Therefore, it is more practically relevant to define (total) cumulative regret as the total cumulative suboptimality of both and sequences (that is, ). Interestingly, since by the design of our POP-BO algorithm, this (total) cumulative regret bound reduces to , for which we provide our sublinear cumulative regret bound. As such, the (total) cumulative regret bound provided by our paper is stronger than the (partial) cumulative regret bound that could be obtained by (Mehta et al., 2023).
-
2.
Applicability to online learning problem. Following the last point, (Mehta et al., 2023) is not applicable to the online learning problem since in line 6 of the Borda-AE algorithm, is uniformly sampled from the action space, which leads to a linear growth of cumulative regret. This means Borda-AE has very poor online performance and can not be applied to an online learning problem. For example, in building thermal comfort tuning, we also want to control the discomfort caused during the tuning process. In contrast, our POP-BO algorithm has good online performance with both a theoretical bound on cumulative regret (Thm. 5.2) and empirical evidence on small cumulative regret (Fig. 2).