On Speeding Up Language Model Evaluation
On Speeding Up Language Model Evaluation
Abstract
Large language models (LLMs) currently dominate the field of natural language
processing (NLP), representing the state-of-the-art across a diverse array of tasks.
Developing a model of this nature, from training to inference, requires make
numerous decisions which define a combinatorial search problem. For example,
selecting the optimal pre-trained LLM, prompt, or hyperparameters to attain the
best performance for a task often requires evaluating multiple candidates on an
entire test set. This exhaustive evaluation can be time-consuming and costly, as both
inference and metric computation with LLMs are resource-intensive. In this paper,
we address the challenge of identifying the best method within a limited budget for
evaluating methods on test examples. By leveraging the well-studied multi-armed
bandit framework, which sequentially selects the next method-example pair to
evaluate, our approach, combining multi-armed bandit algorithms with low-rank
factorization, significantly reduces the required resources. Experiments show that
our algorithms can identify the top-performing method using only 5-15% of the
typically needed resources, resulting in an 85-95% reduction in cost.
1 Introduction
Large language models (LLMs) have demonstrated remarkable proficiency in diverse tasks such as
question answering, machine translation and mathematical reasoning [5, 16, 43]. They are employed
across numerous applications, ranging from automated customer support to content generation [45].
As the development of new LLMs continues, practitioners find themselves with a plethora of options,
selecting the optimal model, prompt [40], or hyperparameters for their specific needs from hundreds
available. For instance, initiatives like the Chatbot Arena [15] actively maintain nearly 100 models
for benchmarking on user-specified open-ended queries. Similarly, the AlpacaEval [44] project
benchmarks over 200 models against a diverse set of 805 questions.
Querying an LLM is resource-intensive, and therefore such extensive evaluations require significant
investments in time, compute and financial resources. In Figure 1 Left, we show the estimated cost to
fully evaluate (denoted as full evaluation) the 153 models officially included in AlpacaEval [44] as of
May 20, 2024 to be almost 800 USD. In Figure 1 Right, we show that 78.2 Nvidia A6000 GPU hours
are needed for evaluating 205 zero-shot prompts on 784 GSM8K [18] questions using Mistral-7B
[32].
Despite the convention of exhaustive evaluations of all data points in a test set from a task, practitioners
often only care about the overall performance rankings. Typically, the goal is to identify the top-
performing methods or simply the best ones, discarding the lower-ranked alternatives. Therefore,
while full-scale evaluation of each method on every data point is thorough, it may not be cost-effective
when the goal is merely to identify the superior method.
∗
Equal contribution. Correspondence to {jz563, ckb73}@cornell.edu and [email protected].
In this paper, we explore a limited-budget setting. The best method among a set of methods is the
method that has the best average performance among a set of test examples. We want to identify
this best method given a fixed budget for evaluating method-example pairs. A simple baseline is to
evenly split the budget for each method. However, this can be very inefficient – as we observed in
our experiments, for some datasets, we need to evaluate more than 90% of all method-example pairs
to predict the best method correctly with high probability. Intuitively, it is easier to recognize that
low-performance methods are unlikely to be the best. Therefore, a more effective budget allocation
strategy is to spend less on low-performing methods and more on promising ones. Multi-armed
bandit algorithms enable this dynamic allocation by actively selecting the next method-example pair
to evaluate based on the results of previous evaluations.
We propose two active selection algorithms UCB-E and UCB-E-LRF. Our first algorithm is an
extension of the classical UCB-E [2] to solve the multi-armed bandit problem. It estimates the upper
confidence bound (UCB) to guide the selection of the method which is paired with a randomly chosen
example for the next evaluation in order to efficiently estimate the best method. This algorithm
enjoys a theoretical guarantee that the chance of selecting the best arm converges to 100% by an
exponential decay of the number of evaluations. Our second algorithm, UCB-E-LRF, leverages the
intrinsic low-rankness of the scoring matrices. We find that in practice, scoring matrices, where each
row (column) is a method (an example) and the value is some metric score to measure how good the
method is on the example, usually can be well-approximated by a low-rank matrix. Intuitively we
can predict the remaining unobserved method-example pairs by the low-rank factorization (LRF)
and prioritize evaluations of the pairs with large uncertainties in this prediction. By deploying this
intuition in addition to UCB-E, UCE-E-LRF actively selects both the method and example to evaluate
and it potentially improves the efficiency of the budget usage.
We study the performance of these algorithms as well as common baselines in a number of practical
settings. Through the empirical analysis, we find that our two active selection algorithms are much
better than non-active baselines. In addition, we find for all algorithms, the harder settings, where
the method set is large or the gaps between the top method and other methods are small, generally
need more budget for evaluating the method-example pairs. More interestingly, we observe that the
UCB-E works best for the easier settings; conversely, UCB-E-LRF has superiority over all other
algorithms when the settings are harder.
2
2 Preliminaries
2.1 Evaluation Workflow in Large Language Model Applications
Table 1: Selected benchmarks where LLMs are state-of-the-art methods. We group them based on
the task type and whether additional LLM-based / human-based scoring is needed for evaluation.
Task Type LLM Inference + Rule-based Scoring LLM Inference + LLM-based Scoring
BOLD [22] GLUE [57] HellaSwag [64] CNN/Dailymail [48] Newsroom [25]
Natural Language Understanding
SQuAD [52] TriviaQA [34] WinoGrande [1] XSUM [49]
AlpacaEval [44] Chatbot Arena [15]
Open-ended QA Google NQ [41] HotpotQA [62] QASPER [19]
MT-Bench [66]
APPS [27] Arc [17] GPQA [53] GSM8K [18]
STEM and Reasoning –
MATH [29] MMLU [28] PIQA [3]
A typical evaluation workflow in large language model (LLM) applications involves three steps:
inference, scoring, and performance aggregation, the first two of which LLMs can play important
roles.
1. Inference. Given a dataset such as one shown in Table 1 and an LLM-based method, the
output from this method is generated through an LLM. Each method can be a distinct LLM
for benchmarking different LLM performance, the same LLM with different prompts for
prompt engineering [60], or other different configurations such as temperature, decoding
strategies, etc for hyperparameter tuning.
2. Scoring. The outputs from different methods are scored with a scoring function i.e. metric.
The scoring function can either be rule-based (exact string match, BLEU [50], ROUGE
[46]), LLM-based (BERTScore [65], LLM judge [66]), or human-based i.e. user study.
Depending on the task and dataset format, researchers have employed different types of
scoring functions. In Table 1, we classify a selected group of commonly used datasets
according to the task type and scoring function applied to them. We group LLM-based and
human-based scoring together since they are usually considered alternatives to each other
and have been shown to have relatively high correlation [44].
3. Performance aggregation. The performance of each method is aggregated across the
dataset, typically with a simple average over all examples in the dataset.
Notice that LLMs can play an important role in both inference and scoring. However, LLMs are
becoming increasingly large and many state-of-the-art LLMs for LLM-based scoring functions are
black-box APIs [42]. Both inference and scoring in the evaluation workflow start to become massively
resource-intensive.
Despite the intensive resources needed, in many practical scenarios, we are only interested in identi-
fying the best method among all methods. For example, for prompt engineering and hyperparameter
tuning, knowing which method (prompt / configuration) is the best is usually sufficient for next round
of iteration. Intuitively, evaluating all methods on all examples on a dataset is excessive for this
purpose and therefore we are interested in studying the identification of the best method with as few
evaluation pairs as possible.
With the motivation above, we now define our notation and provide a formal problem formulation.
Suppose we have a set of methods F = {f1 , . . . , fn } and a set of examples X = {x1 , . . . , xm }. Let
e : F × X → [0, 1] denote our scoring function and without loss of generality, we assume e(f, x) >
e(f ′ , x) indicates f has better performance than f ′ on the example x. Suppose E ∈ [0, 1]n×m is the
Pm (F, X , e) by defining Ei,j := e(fi , x∗j ). The score of a
underlying scoring matrix for a given problem
1
given method fi is defined as µi = m j=1 Ei,j and want to find the best method i = arg maxi µi .
As motivated above, we are interested in the scenario where we have a limited budget of T evaluations.
The question is: how can we select these T method-example pairs to maximize the chance of finding
the best method i∗ ? Formally, we want to study the evaluation algorithm A. The input is the
3
Figure 2: Active method-example pair selection: After LLM evaluated t method-example pairs, we
then call Algorithm A to select the next method-example pair. Then we query LLM for evaluating
this pair and fill the scoring received from LLM into the scoring matrix. Algorithm A then updates
its internal status prepared for the next method-example pair selection. This process is repeated T
times and, in the end, the algorithm A predicts the best method fî∗
.
evaluation budget T , a set of methods F and a set of examples X . In the evaluation algorithm A, we
can evaluate at most T method-example pairs (fit , xjt ) (t = 1, · · · T ). The output is the predicted
best method. Our goal is to maximize PA (A(T, F, X ) = i∗ ), the probability of returning the best
method i∗ .
3 Algorithms
One simple baseline for designing the evaluation algorithm A is: for each method P fi ∈ F, we
uniformly sample ⌊T /n⌋ examples Xi,T from X , estimate the mean µ̂i = ⌊T 1/n⌋ x∈Xi,T e(fi , x),
and pick the method fiˆ∗ with the highest estimated mean µ̂iˆ∗ as the prediction for the best method.
However, this algorithm can be very inefficient. We will see in the experiments that for some datasets,
we need to evaluate at least 90% of all method-example pairs to predict the best method correctly
with high probability. Different from this simple baseline, two algorithms proposed in this section
actively decide the method-example pair to evaluate next and each decision is made based on previous
evaluations; Figure 2 illustrates this high level idea.
The algorithm. Notice the simple baseline evenly distributes its total budget across the different
methods fi . This allocation of resources is the main limitation of the simple baseline. In order to
distinguish the best method fi∗ from other good methods fi (i.e. µi∗ − µi is small), we may need
more than ⌊T /n⌋ examples, while ≤ ⌊T /n⌋ examples are sufficient to distinguish fi∗ from bad
methods (i.e. µi∗ − µi is large).
We address the limitation above with our first algorithm UCB-E (Aucb−e ; Algorithm 1), a simple
extension of the classic multi-arm bandit algorithm UCB-E [2]. At every step t, we estimate the
upper confidence bound of each method fi from the examples evaluated with that method and pick
the method fit with the largest upper confidence bound. Then we uniformly sample one example xjt
4
Algorithm 1 UCB-E (Aucb−e (T, F, X ; η))
Input: The evaluation budget T , a set of methods F and a set of examples X , the uncertainty scaling
η.
Output: The prediction î∗ for best method i∗ .
1: ∀fi ∈ F, the upper confidence bound Bi = +∞, the set of evaluated examples Si = {}, the
observed scoring matrix E obs := {?}n×m .
2: for t = 1, · · · , T do
3: Select: Draw uniform randomly it ∈ arg maxi Bi ; Draw uniform randomly jt ∈ [m]\Si .
4: Evaluate: Run inference for the method-example pair (fit , xjt ), send the result to the annota-
tor, and receive e(fit , xjt ); Eiobs
t ,jt
← e(fit , xjt ).
P obs
j∈Si Eit ,j
q
η
5: Update: Sit ← Sit ∪ {jt }; Bit ← t
|Si | + |Si | .
t t
6: end for P obs
Ei,j
Return: î∗ = arg maxi j∈Si
|Si |
that has not been evaluated with the method fit , run the inference procedure for (fit , xjt ) and send
the result to the annotator for scoring.
Comparison with the simple baseline. The UCB-E algorithm addresses the limitation of the
simple baseline. The upper confidence bounds of bad methods (i.e. µi∗ − µi is large) will never
exceed the upper confidence bound of fi∗ , as long as they are evaluated with a sufficient number of
examples, and hence UCB-E will never select this method again and pay more attentions on more
promising methods.
Theoretical results. We state the theoretical guarantee of UCB-E in our context, which is a corollary
of the theorem for UCB-E in the original multi-arm bandit setting.
Pn 1
Corollary 1 (Theoretical guarantee of UCB-E) Define H1 = ∗ ∗ 2 and suppose
i=1,i̸=i (µi −µi )
T −n ∗ T −n
η = 25
36 H1 , PAucb−e (Aucb−e (T, F, X ; η) = i ) ≥ 1 − 2T n exp − 18H1 .
Comment. Given a problem set-up (F, X , e), H1 in the corollary indicates the hardness of the
problem. The larger the gaps between the highest score and the remaining scores are, the higher
the probability UCB-E will pick the best method given a fixed budget T . In our experiments, we
will observe that the datasets with smaller H1 tend to have a higher chance of predicting i∗ correctly
given the same evaluation budget T .
Main idea/Inspiration/Motivation. We find that the scoring matrix E for real-world problem
instances can often be approximated well by a low rank matrix where rank r ≪ n, m. This observation
means that it is possible to estimate the score of all method-example pairs from the scores of just
a few. Consider if the matrix E were exact rank-1, then only n + m − 1 scores would be needed
to exactly recover the full scoring matrix E, which is extremely efficient compared to exhaustively
evaluating all n · m combinations. Suppose E obs is a partially observed scoring matrix, where the
non-zero elements are evaluated pairs with the ground truth score for that pair, and Ê is the estimated
scoring matrix, where the non-zero elements are unevaluated pairs with the estimated score. Suppose
O ∈ {0, 1}n×m is the observation matrix where Oi,j indicates whether the method-example pair
(fi , xj ) has been observed. When the scores of unevaluated method-example pairs are estimated
1
Pm obs
accurately, the score estimation µ̂i = m j=1 (Oi,j Ei,j + (1 − Oi,j )Êi,j ), which combines both
evaluated pairs and the estimated pairs, has the potential to be more accurate than the estimation by
only the evaluated pairs.
The algorithm. Leveraging the intuition of estimating the unevaluated method-example pairs,
we adopt a low rank factorization (LRF) approach and propose our second algorithm UCB-E-LRF
5
Algorithm 2 UCB-E-LRF (Auel (T, F, X ; M, r, K, T0 , b, η))
Input: The evaluation budget T , a set of methods F and a set of examples X , the LRF oracle M, the
rank r, # solutions in ensemble K, the warm-up budget T0 , the batch size b, the uncertainty scaling η.
Output: The prediction î∗ for best method i∗ .
1: Uniformly sample T0 method-example pairs from [n] × [m] and get the observed scoring matrix
n×m
E obs ∈ ([0, 1] ∪ {?}) and the observation matrix O ∈ {0, 1}n×m w.r.t. these T0 evaluations.
1
Pm
2: Ê, R ← M(E obs , O); ∀fi ∈ F, Bi ← m obs
j=1 (Oi,j Ei,j + (1 − Oi,j )Êi,j + ηRi,j )
3: for t = T0 , · · · , T0 + [(T − T0 )/b] do
4: Select: Draw it ∈ arg maxi Bi ; Draw Jt := Top−b (Rit ,j |j ∈ [m]).
5: for jt ∈ Jt do
6: Evaluate: Run inference for the method-example pair (fit , xjt ), send the result to the
annotator, and receive e(fit , xjt ); Eiobs
t ,jt
← e(fit , xjt )
7: end for
8: Update: ∀jt ∈ Jt , Oit ,jt ← 1; Ê, R ← M(E obs , O; r, K); ∀fi ∈ F, Bi ←
1
Pm obs
m j=1 (Oi,j Ei,j + (1 − Oi,j )Êi,j + ηRi,j ).
9: end for
Pm
Return: î∗ = arg maxi m 1 obs
j=1 (Oi,j Ei,j + (1 − Oi,j )Êi,j )
(Auel ; Algorithm 2). This algorithm builds on UCB-E which dynamically allocates the budget to the
promising methods. We estimate the upper confidence bound for each method fi and evaluate an
example xjt yet to be evaluated for the method fit with largest upper confidence bound. However,
the specification of this idea in UCB-E-LRF is different, because of our additional intermediate goal –
a good estimation Ê for the unevaluated pairs. Suppose we have a scoring matrix estimation oracle
M: given t evaluated method-example pairs, the oracle outputs the estimated scoring matrix Ê and
the uncertainty matrix R that quantifies the uncertainty for each estimation in Ê. Suppose we keep
updating observation matrix O ∈ {0, 1}n×m where Oi,j indicates whether the method-example pair
(fi , xj ) has been evaluated. The different specifications from UCB-E are:
1. Once the method fit is selected, unlike in UCB-E where the example xjt is uniformly drawn
from the unevaluated examples, we pick the example xjt with the largest uncertainty Rit ,jt
from the unevaluated examples, because evaluating this example xjt will result in the largest
reduction in uncertainty for the method fit . This allows us to better estimate the scoring
matrix Ê with less evaluations.
1
Pm obs
2. The upper confidence bound is now computed as m j=1 (Oi,j Ei,j + (1 − Oi,j )Êi,j +
1
P m obs
ηRi,j ), which is a combination of the current score estimation m j=1 (Oi,j Ei,j + (1 −
1
Pm
Oi,j )Êi,j ) and the uncertainty estimation m j=1 ηRi,j . Notably, the current score
1
Pm obs
estimation is the partially observed scores m j=1 Oi,j Ei,j plus the estimated scores
1
P m
m j=1 (1 − Oi,j )Êi,j
The choice of scoring matrix estimation oracle. We now introduce the scoring matrix estimation
oracle M used in this paper. The inputs of the oracle M are a partially observed scoring matrix
n×m
E obs ∈ ([0, 1] ∪ {?}) and an observation matrix O ∈ {0, 1}n×m where Oi,j indicates whether
the method-example pair (fi , xj ) has been evaluated. The outputs of M are supposed to be an
estimated scoring matrix Ê and the corresponding uncertainty matrix R.
We define our M as an ensemble of K low-rank factorization (LRF) solutions [59, 14]. The low-rank
factorization assumes the ground truth matrix is (approximately) low-rank with rank r. We optimize
the low-rank representations U ∈ Rn×r and V ∈ Rm×r for following optimization problem by the
alternating least squares method [26]:
X 2
min Oi,j Ui⊤ Vj − Ei,j
obs
, (1)
U ∈Rn×r ,V ∈Rm×r
i∈[n],j∈[m]
6
where Ui is the ith row of U and Vj is the jth row of V . We further produce both the estimation and
the uncertainty by bootstrapping [23, 67]. In order to get a diverse ensemble of low-rank factorizations
for each factorization (U (k) , V (k) ), we randomly zero out 5% of the “1" in O as Ok and optimize the
objective Equation 1 with E obs and Ok instead. Finally we define the estimated scoring matrix Ê
and the uncertainty matrix R as
! v
K ⊤
u K
1 X
(k)
(k)
u1 X ⊤
Ê := (1 − O) ◦ U V , Ri,j := t ((U (k) V (k) )i,j − Êi,j )2 (2)
K K
k=1 k=1
where 1 is an all-one matrix with the shape of n × m and ◦ is the element-wise multiplication.
This oracle introduces four additional hyperparameters for our UCB-E-LRF algorithm, which deter-
mine performance and efficiency:
• The rank r of the low-rank factorization. The value of r adjusts the expressiveness of
the factorization when fit to the data in Equation 2. Setting r requires considering the
bias-variance trade-off of choosing smaller vs. larger r. The representations of smaller r
inherently introduce larger approximation error, however, the representations of large r are
harder to be estimated accurately given only a few observations in E obs .
• The ensemble size K. Sufficiently large K is necessary to avoid additional noise when
estimating Ê and R in Equation 2. Yet, the computational cost of the factorization scales
with K and therefore K cannot be too large.
• The warm-up budget T0 . The oracle M requires a minimum number of initial observations
in order to estimate the low-rank representation accurately. Therefore, before the active
selection phase (line 3-9 in Algorithm 2), we have a warm-up phase (line 1-2 in Algorithm 2),
where T0 method-example pairs are sample uniformly at random to evaluate before moving
on to the “active" phase.
• The batch size b. The oracle M must be fit and refit which involves optimizing the ensemble
of low-rank factorizations. As the computational cost of this procedure cannot be neglected,
we introduce a batch size hyperparameter. At each time step t, we select b examples Jt
paired with the method it (line 4 in Algorithm 2), evaluate the b method-example pairs
(line 6 in Algorithm 2), and update the intermediate variables with these b scores (line 8
in Algorithm 2). Notice that although increasing b is more computationally friendly as it
calls the oracle M less, it may hurt the performance as it reduces the granularity of the
decisions, making the algorithm less “active". The choice of b is ultimately a trade-off
between computational cost and the performance, however, we will see in the experiments
section that performance does not vary drastically for many choices of b.
We analyze these four UCB-E-LRF hyperparameters in our experiment sections, along with the
uncertainty scaling η in both UCB-E and UCB-E-LRF.
4 Experiments
4.1 Datasets
To assess the performance of our algorithms under a variety of use cases, we test with three datasets
AlpacaEval [44], Grade School Math 8K (GSM8K) [18] and Physical Interaction: Question An-
swering (PIQA), together with different settings of the method set F and the scoring function e;
Table 2 summarizes the statistics of each dataset along with what method F and scoring function e.
For AlpacaEval, we design two sets of calF . The first set F contains all LLMs reported by [44].
The second F contains the same LLMs except for the annotator model GPT-4 Turbo. The latter F
that does not contain GPT-4 Turbo makes the learning more challenging and interesting since the
annotator, GPT-4 Turbo, is a clear winner among all LLMs. For GSM8K and PIQA, we set F as a set
of prompts to simulate prompt engineering, or a set of LLMs with various sampling configurations,
to mimic model selection and hyperparameter tuning. We provide more information about these
datasets in Appendix A.1.
7
Figure 3: Comparison of baselines and our proposed algorithms on six datasets evaluated with various
top 1 precision and NDCG@10. The vertical axes of all plots represent performance of a metric and
horizontal axies of all plots represent the percentage of method-example pairs evaluated. All results
are aggregated based on 50 trials with different random seeds. Our proposed algorithms: UCB-E and
UCB-E-LRF consistently require much less evaluation to achieve the same performance as baselines.
Table 2: Statistics of each dataset used in our experiments. The examples X of each dataset are the
questions from their respective evaluation benchmark. More details of the datasets can be found in
Appendix A.1.
Dataset Name Size n × m Method F Scoring Function e H1
AlpacaEval 154 × 805 Various LLMs GPT4-turbo annotator 966
AlpacaEval (Drop Annotator) 153 × 805 Various LLMs excluding GPT4-turbo GPT4-turbo annotator 4462
GSM8K Prompts 205 × 784 Mistral-7B [32] with different prompts regex match with correct answer 107445
GSM8K Models 122 × 1000 Various LLMs and sampling configurations regex match with correct answer 20562
PIQA Prompts 177 × 1546 Tulu-7B [58] with different prompts regex match with correct choice 66284
PIQA Models 103 × 1000 Various LLMs and sampling configurations regex match with correct choice 10273
4.2 Metrics
Top 1 Precision. To determine if an algorithm finds the best method, we can directly check if the
algorithm’s prediction matches with the best method from our empirical data. However, because
every method is empirically evaluated on limited number of examples, it is possible that one method
has slightly lower average performance than another method whereas if we were to evaluate on more
examples, the former would achieve a higher performance. To this end, we calculate top 1 precision
by first determining a set of methods we consider equally good and we check if the predicted method
from an algorithm is a member of that set. We propose two ways to determine if a set of methods are
8
Figure 4: Ablations of effect of hyperparameters on UCB-E-LRF algorithm. The results suggest that
our selected hyperparameters are efficient and effective.
equally good: performance gap and statistical significance. For performance gap ϵ top 1 precision,
we consider all methods whose performance is within ϵ of the best empirical method to be equally
good. For the statistical significance p top 1 precision, we perform McNemar’s statistical test [47]
for each method against the best empirical method. If we cannot reject the null hypothesis that the
performance of one method is the same as the best empirical method up to a significance level p,
then that method is considered equally good as the best method. We evaluate all baselines and our
algorithms on two ϵ values {0.001, 0.01}, and two p values {0.01, 0.1} simulating a diverse need of
precision.
NDCG. Although our focus is to identify the best method quickly, it is sometimes also desirable to
generally rank top-P promising methods high. We therefore evaluate with normalized discounted
cumulative gain (NDCG) at P . NDCG@P takes as input the top-P prediction by an algorithm and
the higher their true ranks are, the higher the NDCG is. We choose P = 10 to evaluate all algorithms.
We introduce the set-up of our two algorithms, UCB-E and UCB-LRF, together with three more
baselines, Row Mean Imputation, Filled Subset and LRF.
UCB-E. Our proposed algorithm as shown in Algorithm 1. We use η = 1 since this consistently
yields the best performance across all datasets.
UCB-E-LRF. Our proposed algorithm as shown in Algorithm 2. For all datasets, we use rank r = 1
for low rank factorization with an ensemble size of K = 64. We use 5% of data for warm up i.e.
T0 = 0.05 × n × m and η = 5. We use a batch size of b = 32.
Row Mean Imputation. In this baseline, we select method-example pairs uniformly at random
from all pairs. The score for each method is calculated as the average of all the scores evaluated so
far for that method. The detailed algorithmic description is in Algorithm 3.
Filled Subset. Instead of randomly selecting method-example pairs, filled subset first selects an
example index j uniformly at random. Then all methods are evaluated on example j. If all methods
have been evaluated, the algorithm selects a new example index from remaining ones. The score
for each method is calculated as the average of all the scores evaluated so far for that method. The
detailed algorithmic description is in Algorithm 4. Although row mean imputation and filled subset
do not have learning component, they both produce an unbiased estimate of the real score for each
method at all time.
9
LRF. Similar to row mean imputation, LRF (low rank factorization) randomly selects method-
example pairs to evaluate. For estimating the score of each method, LRF calculates the average of all
scores for that method. That is, for evaluated examples, the score is the actual observed score, and for
examples yet to be evaluated, the score is the estimated score from LRF. The detailed algorithmic
description is in Algorithm 5.
In Figure 3, we plot the performance of baselines and our algorithms on all six datasets (columns)
as the budget T increases from 5% to 100% of the total number of method-example pairs. In each
row, we evaluate the algorithms on a different metric (either top 1 precision with different ϵ, p or
NDCG@10). Each line in the figure is the average result over 50 independent trials with different
seeds. For example, an average top 1 precision of 0.9 indicates that 45 out of 50 trials predict the best
method correctly at the budget level that its x-coordinate represents. In addition, we calculate H1 as
defined in Corollary 1 that quantifies the difficulty of finding the best method on a dataset. Intuitively,
a higher H1 value suggests that the method set F is large or there are many methods that have similar
performance with the best one and distinguishing them can be challenging. In Appendix A.1 Figure
5, we also plot the histogram to show the distribution of performance among F on these datasets.
How do our algorithms compare with the baselines? As seen from Figure 3, both UCB-E and
UCB-E-LRF consistently achieve high precision and NDCG with much less budget compared to
the baselines. For example, on AlpacaEval (Drop Annotator), our proposed algorithms can reach
precision of 1 with just 8% budget whereas the baselines require 80-90%, an order of magnitude
more budget needed. These results suggest that it is entirely possible to identify the best method
without exhaustively evaluating all method-example pairs. They further demonstrate that the active
selection algorithms (our two algorithms) are more efficient than the non-active algorithms (three
baselines). Additionally, the better NDCG performance from our proposed algorithms shows that our
methods can more correctly rank top-performance methods.
How is the comparison between our two algorithms UCB-E and UCB-E-LRF? The datasets
from left to right are ranked by the hardness indicated by H1 . Interestingly, we find that on easier
datasets such as AlpacaEval, UCB-E performs better and saves 2-3% more on budget compared
to UCB-E-LRF, while on harder datasets, such as GSM8K Prompts and PIQA Prompts, UCB-E-
LRF achieves higher precision faster than UCB-E, saving about 10% in absolute budget. These
observations give us a hint on what algorithm to apply in practice.
We provide more empirical analysis and the ablation results can be found in Figure 4.
Does H1 correctly reflect the hardness of a dataset in the empirical experiments? Yes, going
through Figure 3 from left to right, as H1 decreases, the percentage of matrix evaluation needed to
reach precision of 1 also generally decreases from more than 20% to just under 5%. Moreover, the
H1 values seem to be related to the tasks. That is, prompt engineering datasets typically have higher
H1 possibly due to the homogeneity of prompt performance with the same LLM. In contrast, datasets
that benchmark different LLMs such as AlpacaEval is much easier to find the best performing model.
Score Only Ablation. To study the effect of selecting both next method and next example to
evaluate using uncertainty matrix R, we consider an ablation of UCB-E-LRF by reverting the place
of using R to the UCB-E: UCB-E-LRF (Score Only) where the upper confidence bound is computed
by the confidence interval in original UCB-E in addition to the mean estimation through low rank
factorization; the example is uniformly selected, same as UCB-E. The detailed algorithmic description
can be found in Algorithm 6. We see that on the hard dataset GSM8K Prompts where UCB-E-LRF
has certain benefit over UCB-E, there is a significant gap between UCB-E-LRF and UCB-E-LRF
(Score Only). This means that the component of uncertainty matrix R in UCB-E-LRF is crucial to
contribute the benefit over UBC-E.
Batch Size b Ablation. Intuitively, a smaller batch size is more flexible can have more fine grained
selection. We experiment with b ∈ {2, 8, 32, 128} and as shown in the plot, smallest batch size
10
gives the best performance especially on harder datasets whereas a batch size of 128 significantly
degrades the performance. However, it incurs more overall computational cost due to fitting low
rank factorization more often than a smaller batch size; for example when b = 2, it takes almost 16
times as much time as b = 32 to achieve 1.0 precision, which is ineffective. To achieve a balance of
performance and computational cost, we use a batch size of 32 in our experiment.
Ensemble Size K Ablation. We vary the ensemble size K ∈ {4, 16, 64, 256} and find that very
small ensemble size gives suboptimal performance on hard datasets. There is almost no performance
difference on easy dataset like AlpacaEval (Drop Annotator). The performance is robust to different
K as long as it is larger than 64.
Uncertainty Scaling Factor η Ablation. We experiment with different uncertainty scaling values
{0.1, 0.5, 1, 5, 10}. In Figure 4, it can be seen that a certain range of η, from 0.1 to 5, gives similar
performance on both two datasets. This demonstrate the robustness of selecting η.
Rank r Ablation. As discussed in the Algorithm Section, r adjusts the bias-variance trade-off.
Empirically we experiment with r ∈ {1, 2, 5} and find that r = 1 is consistently better than larger
value counterparts. The results can be explained by the fact that a larger r requires more evaluated
data in order to prevent overfitting, which might not have advantage in the limited budget setting.
Therefore, we recommend using r = 1 for all datasets.
5 Related Work
Best-arm identification in multi-arm bandits. The goal of best-arm identification [6, 2] is to
find the arm with the highest reward by pulling these arms and getting the feedback. By making an
analogy, in our problem method fi is the arm and the score µi of fi is the reward. There are two
ways to define the best-arm identification problem: fixed budget and fixed confidence. In the fixed
budget setting, the budget for the arm pulls is fixed and the algorithm is designed for better chances
to identify the correct best arm – our problem defined in this paper has the similar evaluation budget.
UCB-E [2] and Successive Elimination [2] are two pioneering algorithms proposed for this setting,
followed by a line of improvement [30, 38, 10, 39]. Qin [51] states the optimal fixed budget best-arm
identification as an open problem. In the setting of fixed confidence, the algorithms [35, 37, 31] work
towards fewer number of arms to guarantee the given confidence of getting the best arm. Garivier
and Kaufmann [24] gives an optimal algorithm in terms of the minimum of arms to pull. Another
extension beyond the setting of fixed budget or confidence is the PAC learning framework, where
the target is to maximize the chance of getting an mostly best arm, with a tolerance of ε gap to the
highest reward [36, 31, 12].
Low-rank factorization for (noisy) matrix completion. As shown in the objective function of
Equation 1, low-rank factorization is a non-convex optimization problem. Therefore a line of work
focus on how to solve this non-convex optimization [9, 8, 14], while another line of work [13, 7]
study the approximation error between the estimated low-rank matrix and the target matrix in terms
of p when assuming the observations are i.i.d. sampled with a pre-assumed chance p and the additive
noise to each observation is i.i.d. Gaussian noise.
LLM performance evaluation functions. Tremendous effort has been devoted to developing
effective evaluation functions to assess the quality of open-ended generations from language models.
Early work in this direction such as BLEU [50] and ROUGE [46] are rule-based that use lexical
11
overlaps to quantify similarity between a generated response and reference. However, lexical overlaps
may not align well with the underlying semantics of the text. The shortcomings of these rule-based
evaluation functions motivated a line of work studying using language models [65, 55, 63] to evaluate
generations. A seminal work in this area is BERTScore which uses embeddings from a BERT model
[21] to compute similarity. More recently, LLM-as-a-Judge [66] proposes using instruction-tuned
large language models to evaluate generations. Zheng et al. [66] finds that costly proprietary models
such as GPT4 have high agreement rate with human.
LLM performance evaluation benchmarks. Diverse benchmarks have been developed to evaluate
LLM performance across various domains including natural language understanding on translations
and sentiment analysis [4, 54, 56], mathematical and common sense reasoning [18, 29, 3], text
retrieval and question answering [61, 33, 20]. We refer readers to [11] for a more comprehensive
discussion on existing benchmarks. The commonality of these benchmarks is that a ground truth
answer is typically given and once the responses from LLMs are generated, they can be evaluated
fairly efficiently without incurring large amount of overhead such as money and compute. Recently, a
new type of benchmark or more precisely leaderboard has been created attempting to compare various
LLMs to each other. The typical setup involves providing the same prompt to two LLMs and then
having their responses compared side by side with another LLM, which determines which response
is considered superior and thus wins. Two notable leaderboards are AlpacaEval [44] and Chatbot
Arena [15]. AlpacaEval has a static prompt dataset and for each prompt and LLM, the response
is compared with that of a baseline model (GPT3 or GPT4). For Chatbot Arena, the prompts are
submitted by users and two random LLMs are selected to respond to the user-written prompt. For
both benchmarks, to aggregate the overall performance of LLMs, average win rate or ELO score can
be calculated from the pairwise comparison statistics. Since the evaluation requires using another
LLM to decide the better response, the evaluation incurs additional compute and / or monetary cost
than the benchmarks that evaluate models based on ground truth answers.
7 Acknowledgement
We thank Justin Lovelace and Varsha Kishore for their helpful discussion and feedback on the paper
draft. JPZ is supported by grant from the Natural Sciences and Engineering Research Council of
Canada (NSERC) (567916). CKB and CPG are support by the Cornell University AI for Science
Institute, the AI Climate Institute, the National Science Foundation, the National Institute of Food
and Agriculture, and the Air Force Office of Scientific Research. RW is supported by grants from
12
National Science Foundation NSF (CIF-2402817, CNS-1804829), SaTC-2241100, CCF-2217058,
ARO-MURI (W911NF2110317), and ONR under N00014-24-1-2304. Wen Sun is supported by NSF
IIS-2154711 and NSF CAREER 2339395. This research is also supported by grants from the National
Science Foundation NSF (IIS-2107161, and IIS-1724282, HDR-2118310), the Cornell Center for
Materials Research with funding from the NSF MRSEC program (DMR-1719875), DARPA, arXiv,
LinkedIn, and the New York Presbyterian Hospital.
References
[1] Winogrande: An adversarial winograd schema challenge at scale. 2019.
[2] J.-Y. Audibert and S. Bubeck. Best arm identification in multi-armed bandits. In COLT-23th
Conference on learning theory-2010, pages 13–p, 2010.
[3] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical com-
monsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence,
2020.
[4] O. r. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno Yepes,
P. Koehn, V. Logacheva, C. Monz, M. Negri, A. Neveol, M. Neves, M. Popel, M. Post,
R. Rubino, C. Scarton, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri. Findings of the
2016 conference on machine translation. In Proceedings of the First Conference on Machine
Translation, pages 131–198, Berlin, Germany, August 2016. Association for Computational
Linguistics. URL http://www.aclweb.org/anthology/W/W16/W16-2301.
[5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural
information processing systems, 33:1877–1901, 2020.
[6] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In
Algorithmic Learning Theory: 20th International Conference, ALT 2009, Porto, Portugal,
October 3-5, 2009. Proceedings 20, pages 23–37. Springer, 2009.
[7] C. Cai, G. Li, H. V. Poor, and Y. Chen. Nonconvex low-rank tensor completion from noisy data.
Advances in neural information processing systems, 32, 2019.
[8] E. Candes and B. Recht. Exact matrix completion via convex optimization. Communications of
the ACM, 55(6):111–119, 2012.
[9] E. J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion.
IEEE transactions on information theory, 56(5):2053–2080, 2010.
[10] O. Cappé, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-leibler upper
confidence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516–1541,
2013.
[11] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al.
A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and
Technology, 15(3):1–45, 2024.
[12] A. R. Chaudhuri and S. Kalyanakrishnan. Pac identification of many good arms in stochastic
multi-armed bandits. In International Conference on Machine Learning, pages 991–1000.
PMLR, 2019.
[13] Y. Chen, Y. Chi, J. Fan, C. Ma, and Y. Yan. Noisy matrix completion: Understanding statistical
guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization,
30(4):3098–3121, 2020.
[14] Y. Chi, Y. M. Lu, and Y. Chen. Nonconvex optimization meets low-rank matrix factorization:
An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.
[15] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu,
M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human
preference. arXiv preprint arXiv:2403.04132, 2024.
[16] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W.
Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal
of Machine Learning Research, 24(240):1–113, 2023.
13
[17] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think
you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint
arXiv:1803.05457, 2018.
[18] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,
J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word
problems. arXiv preprint arXiv:2110.14168, 2021.
[19] P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith, and M. Gardner. A dataset of information-
seeking questions and answers anchored in research papers. 2021.
[20] M. Dell, J. Carlson, T. Bryan, E. Silcock, A. Arora, Z. Shen, L. D’Amico-Wong, Q. Le,
P. Querubin, and L. Heldring. American stories: A large-scale structured text dataset of
historical u.s. newspapers, 2023.
[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[22] J. Dhamala, T. Sun, V. Kumar, S. Krishna, Y. Pruksachatkun, K.-W. Chang, and R. Gupta.
Bold: Dataset and metrics for measuring biases in open-ended language generation. In
Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,
FAccT ’21, page 862–872, New York, NY, USA, 2021. Association for Computing Machinery.
ISBN 9781450383097. doi: 10.1145/3442188.3445924. URL https://doi.org/10.1145/
3442188.3445924.
[23] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. Chapman and Hall/CRC, 1994.
[24] A. Garivier and E. Kaufmann. Optimal best arm identification with fixed confidence. In
Conference on Learning Theory, pages 998–1027. PMLR, 2016.
[25] M. Grusky, M. Naaman, and Y. Artzi. Newsroom: A dataset of 1.3 million summaries with
diverse extractive strategies. arXiv preprint arXi v:1804.11283, 2018.
[26] T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh. Matrix completion and low-rank svd via fast
alternating least squares. The Journal of Machine Learning Research, 16(1):3367–3402, 2015.
[27] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik,
H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps, 2021.
[28] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring
massive multitask language understanding. Proceedings of the International Conference on
Learning Representations (ICLR), 2021.
[29] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Stein-
hardt. Measuring mathematical problem solving with the math dataset. arXiv preprint
arXiv:2103.03874, 2021.
[30] J. Honda and A. Takemura. An asymptotically optimal policy for finite support models in the
multiarmed bandit problem. Machine Learning, 85:361–391, 2011.
[31] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil’ucb: An optimal exploration algorithm
for multi-armed bandits. In Conference on Learning Theory, pages 423–439. PMLR, 2014.
[32] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand,
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023.
[33] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical
research question answering. arXiv preprint arXiv:1909.06146, 2019.
[34] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. triviaqa: A Large Scale Distantly Supervised
Challenge Dataset for Reading Comprehension. arXiv e-prints, art. arXiv:1705.03551, 2017.
[35] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic
multi-armed bandits. In ICML, volume 12, pages 655–662, 2012.
[36] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In
International conference on machine learning, pages 1238–1246. PMLR, 2013.
[37] E. Kaufmann and S. Kalyanakrishnan. Information complexity in bandit subset selection. In
Conference on Learning Theory, pages 228–251. PMLR, 2013.
14
[38] E. Kaufmann, O. Cappé, and A. Garivier. On bayesian upper confidence bounds for bandit
problems. In Artificial intelligence and statistics, pages 592–600. PMLR, 2012.
[39] E. Kaufmann, O. Cappé, and A. Garivier. On the complexity of best-arm identification in
multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
[40] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot
reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
[41] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein,
I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai,
J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering
research. Transactions of the Association of Computational Linguistics, 2019.
[42] N. Lambert, V. Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar,
T. Zick, Y. Choi, N. A. Smith, and H. Hajishirzi. Rewardbench: Evaluating reward models for
language modeling, 2024.
[43] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone,
C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language
models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
[44] X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto.
Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/
tatsu-lab/alpaca_eval, 2023.
[45] W. Liang, Y. Zhang, Z. Wu, H. Lepp, W. Ji, X. Zhao, H. Cao, S. Liu, S. He, Z. Huang, et al.
Mapping the increasing use of llms in scientific papers. arXiv preprint arXiv:2404.01268, 2024.
[46] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out, pages 74–81, 2004.
[47] Q. McNemar. Note on the sampling error of the difference between correlated proportions or
percentages. Psychometrika, 12(2):153–157, 1947.
[48] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al. Abstractive text summarization using
sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
[49] S. Narayan, S. B. Cohen, and M. Lapata. Don’t give me the details, just the summary!
topic-aware convolutional neural networks for extreme summarization. arXiv preprint
arXiv:1808.08745, 2018.
[50] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation
of machine translation. In Proceedings of the 40th annual meeting of the Association for
Computational Linguistics, pages 311–318, 2002.
[51] C. Qin. Open problem: Optimal best arm identification with fixed-budget. In Conference on
Learning Theory, pages 5650–5654. PMLR, 2022.
[52] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine
comprehension of text. In J. Su, K. Duh, and X. Carreras, editors, Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin,
Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264.
URL https://aclanthology.org/D16-1264.
[53] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R.
Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023.
[54] V. Sahayak, V. Shete, and A. Pathan. Sentiment analysis on twitter data. International Journal
of Innovative Research in Advanced Engineering (IJIRAE), 2(1):178–183, 2015.
[55] T. Sellam, D. Das, and A. P. Parikh. Bleurt: Learning robust metrics for text generation. arXiv
preprint arXiv:2004.04696, 2020.
[56] S. Singh, F. Vargus, D. Dsouza, B. F. Karlsson, A. Mahendiran, W.-Y. Ko, H. Shandilya, J. Patel,
D. Mataciunas, L. OMahony, M. Zhang, R. Hettiarachchi, J. Wilson, M. Machado, L. S. Moura,
D. Krzemiński, H. Fadaei, I. Ergün, I. Okoh, A. Alaagib, O. Mudannayake, Z. Alyafeai, V. M.
Chien, S. Ruder, S. Guthikonda, E. A. Alghamdi, S. Gehrmann, N. Muennighoff, M. Bartolo,
J. Kreutzer, A. Üstün, M. Fadaee, and S. Hooker. Aya dataset: An open-access collection for
multilingual instruction tuning, 2024.
15
[57] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task
benchmark and analysis platform for natural language understanding. 2019. In the Proceedings
of ICLR.
[58] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R. Chandu, D. Wadden, K. MacMillan,
N. A. Smith, I. Beltagy, and H. Hajishirzi. How far can camels go? exploring the state of
instruction tuning on open resources, 2023.
[59] Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion
by a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation,
4(4):333–361, 2012.
[60] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen. Large language models as
optimizers. arXiv preprint arXiv:2309.03409, 2023.
[61] Y. Yang, W.-t. Yih, and C. Meek. Wikiqa: A challenge dataset for open-domain question
answering. In Proceedings of the 2015 conference on empirical methods in natural language
processing, pages 2013–2018, 2015.
[62] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning.
HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on
Empirical Methods in Natural Language Processing (EMNLP), 2018.
[63] W. Yuan, G. Neubig, and P. Liu. Bartscore: Evaluating generated text as text generation.
Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
[64] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really
finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
[65] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text
generation with bert. arXiv preprint arXiv:1904.09675, 2019.
[66] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing,
et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information
Processing Systems, 36, 2024.
[67] A. M. Zoubir and B. Boashash. The bootstrap and its application in signal processing. IEEE
signal processing magazine, 15(1):56–76, 1998.
16
A Appendix
Figure 5: Histogram of model performance on all six datasets along with their H1 values. It can be
seen that datasets such as AlpacaEval have larger gap between the best and second best method leads
to a much smaller H1 , making identifying the best method easier.
GSM8K Prompts and PIQA Prompts. To simulate a prompt engineering use case, we create two
datasets: GSM8K Prompts and PIQA Prompts. For these datasets, we ask GPT4 to generate 205 and
177 prompts following the prompt engineering work from Yang et al. [60]. The LLM used to perform
inference on the datasets are Mistra-7B [32] and Tulu-7B [58] respectively. We evaluate the prompts
on 784 and 1546 questions on the training set (about 10% of the size of the two training sets). For
scoring, we extract the final answers from the responses generated by the LLMs and compared them
with the ground truth answers (for GSM8K) or ground truth choice (for PIQA).
GSM8K Models and PIQA Models. Lastly, for the model selection and hyperparameter tuning
use case, we similarly create two more datasets on GSM8K and PIQA. Instead of each method
being a prompt, each method is now a combination of LLM with different sampling configurations.
Specifically, we use 11 public available models: GPT2 3 , GPT2-Large 4 , CodeLLaMA 5 , Tulu-7B
2
https://github.com/tatsu-lab/alpaca_eval
3
https://huggingface.co/openai-community/gpt2
4
https://huggingface.co/openai-community/gpt2-large
5
https://huggingface.co/codellama/CodeLlama-7b-python-hf
17
Algorithm 3 Row Mean Imputation (Armi (T, F, X ))
Input: The evaluation budget T , a set of methods F and a set of examples X .
Output: The prediction î∗ for best method i∗ .
1: ∀i, the set of evaluated examples Si = {}.
2: for t = 1, · · · , T do
3: Select: Draw it ∈ [n]; Draw jt ∈ [m]\Si .
4: Evaluate: Run inference for the method-example pair (fit , xjt ), send the result to the annota-
tor, and receive e(fit , xjt ); Eiobs
t ,jt
← e(fit , xjt ).
5: Update: Sit ← Sit ∪ {jt }.
6: end for P obs
Ei,j
Return: î∗ = arg maxi j∈Si
|Si |
6
, Tulu-2-7B 7 , Gemma-7B 8 , Phi2 9 , Llema-7B 10 , LLaMA-2-7B 11 , Mistral-7B 12 and StarCoder-
7B 13 . We also have three temperature choices {0, 0.5, 1}, two maximum decoding length choices
{128, 512} and two zero-shot prompt choices (directly asking for the answer and Let’s think step
by step). The Cartesian product of all these choices give in total 132 different combinations as our
methods. Depending on the dataset, some of these configurations experienced out-of-memory error
on a Nvidia 3090 when we collected our data which we drop to simulate real-world scenarios. For
the examples, we randomly select 1000 questions from each dataset.
The algorithms for Row Mean Imputation, Filled Subset, LRF and UCB-E-LRF (Score Only) are
shown in Algorithm 3, 4, 5 and 6 respectively.
6
https://huggingface.co/TheBloke/tulu-7B-fp16
7
https://huggingface.co/allenai/tulu-2-7b
8
https://huggingface.co/google/gemma-7b
9
https://huggingface.co/microsoft/phi-2
10
https://huggingface.co/EleutherAI/llemma_7b
11
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
12
https://huggingface.co/mistralai/Mistral-7B-v0.1
13
https://huggingface.co/bigcode/starcoder2-7b
18
Algorithm 5 LRF (Alrf (T, F, X ; M, r, K, T0 , b))
Input: The evaluation budget T , a set of methods F and a set of examples X , the LRF oracle M,
the rank r, # solutions in ensemble K, the warm-up budget T0 , the batch size b.
Output: The prediction î∗ for best method i∗ .
1: Uniformly sample T0 method-example pairs from [n] × [m] and get the observed scoring matrix
n×m
E obs ∈ ([0, 1] ∪ {?}) and the observation matrix O ∈ {0, 1}n×m w.r.t. these T0 evaluations.
19