0% found this document useful (0 votes)
22 views21 pages

TinyBenchmarks - Evaluating LLMs With Fewer Examples

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views21 pages

TinyBenchmarks - Evaluating LLMs With Fewer Examples

Uploaded by

Kayky Ramos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

tinyBenchmarks: evaluating LLMs with fewer examples

Felipe Maia Polo 1 Lucas Weber 2 Leshem Choshen 3 4 Yuekai Sun 1 Gongjun Xu 1 Mikhail Yurochkin 3 5

Abstract These benchmarks are comprised of hundreds or thousands


of examples, making the evaluation of modern LLMs with
The versatility of large language models (LLMs) billions of parameters computationally, environmentally,
led to the creation of diverse benchmarks that thor-
arXiv:2402.14992v1 [cs.CL] 22 Feb 2024

and financially very costly. For example, Liang et al. (2022)


oughly test a variety of language models’ abilities. report that evaluating the performance of a single LLM
These benchmarks consist of tens of thousands on HELM costs over 4K GPU hours (or over $10K for
of examples making evaluation of LLMs very ex- APIs). Benchmarks like AlpacaEval (Li et al., 2023) also
pensive. In this paper, we investigate strategies require a commercial LLM as a judge to perform evaluation,
to reduce the number of evaluations needed to further increasing the costs. Furthermore, evaluation of
assess the performance of an LLM on several a single model is often performed many times to monitor
key benchmarks. For example, we show that to checkpoints during pre-training (Biderman et al., 2023a; Liu
accurately estimate the performance of an LLM et al., 2023) and to explore different prompting strategies
on MMLU, a popular multiple-choice QA bench- or a wider range of hyperparameters (Weber et al., 2023b;
mark consisting of 14K examples, it is sufficient Mizrahi et al., 2023; Sclar et al., 2023; Voronov et al., 2024).
to evaluate this LLM on 100 curated examples.
We release evaluation tools and tiny versions of
popular benchmarks: Open LLM Leaderboard,
MMLU, HELM, and AlpacaEval 2.0. Our em-
pirical analysis demonstrates that these tools and
tiny benchmarks are sufficient to reliably and effi-
ciently reproduce the original evaluation results1 .

1. Introduction
Large Language Models (LLMs) have demonstrated remark-
able abilities to solve a diverse range of tasks (Brown et al.,
2020). Quantifying these abilities and comparing different
LLMs became a challenge that led to the development of sev- Figure 1. Estimating accuracy on MMLU (true accuracy) using
eral key benchmarks, e.g., MMLU (Hendrycks et al., 2020), 100 curated examples (predicted accuracy). IRT++, our best-
Open LLM Leaderboard (Beeching et al., 2023), HELM performing evaluation strategy, predicts the accuracy of recent
(Liang et al., 2022), and AlpacaEval (Li et al., 2023). LLMs released between December 30th and January 18th within
1
1.9% of their true accuracy on all of MMLU (14K examples).
Department of Statistics, University of Michigan, USA
2
Department of Translation and Language Sciences, University
of Pompeu Fabra, Spain 3 IBM Research 4 MIT 5 MIT-IBM Wat- Our work reassesses the need to evaluate LLMs on such
son AI Lab. Correspondence to: Felipe Maia Polo <felipema- large benchmark datasets. In Figure 1 we demonstrate the
[email protected]>.
efficacy of our best evaluation strategy on MMLU, where
we compare accuracy estimates obtained from evaluating
1
To use our methods for efficient LLM evaluation, LLMs on a curated subset of 100 examples (less than 1%
please check https://github.com/felipemaiapolo/ of the examples) to accuracy on all of MMLU, achieving
tinyBenchmarks. This repository includes a Python package average estimation error under 2%.
for model evaluation and tutorials. Additionally, we have uploaded
tiny datasets on huggingface.co/tinyBenchmarks and We consider a range of evaluation strategies (§3):
developed a Google Colab demo in which you can easily use our
tools to estimate LLM performances on MMLU. To reproduce the 1. Stratified random sampling as proposed by Perlitz et al.
results in this paper, please check this GitHub repository. (2023) for HELM. This approach is the simplest to use

1
tinyBenchmarks: evaluating LLMs with fewer examples

but can result in a large estimation error. One of the approaches we consider is based on an adapta-
2. Clustering examples based on LLMs that have already tion of their method to popular LLM benchmarks with more
been evaluated. The key idea is to find examples where diverse tasks.
(in)correct prediction of an LLM implies that it will
also be (in)correct on a subset of other examples. This
Item response theory (IRT) IRT (Cai et al., 2016; Van der
method performs well in some settings but can be unre-
Linden, 2018; Brzezińska, 2020; Lord et al., 1968) is a well-
liable when such correctness patterns are spurious, e.g.,
established set of statistical models used in psychometrics
when predicting the accuracy of an LLM specialized
to measure the latent abilities of individuals through stan-
to a domain. This strategy is inspired by the Anchor
dardized testing (An & Yung, 2014; Kingston & Dorans,
Points method (Vivek et al., 2023) which clusters mod-
1982; Petersen et al., 1982), e.g., in GRE, SAT, etc.. Even
els’ confidence in the correct class for faster evaluation
though IRT methods have been traditionally used in psy-
on classification tasks.
chometrics, they are becoming increasingly popular among
3. New strategies built using Item Response Theory (IRT)
researchers in the fields of artificial intelligence and natural
(Lord et al., 1968) for evaluating individuals through
language processing (NLP). For instance, Lalor et al. (2016)
standardized tests. Applying IRT to LLMs viewed as
propose using IRT’s latent variables to measure language
testees and benchmarks as tests, we learn representa-
model abilities, Vania et al. (2021) employs IRT models in
tions of examples encoding latent abilities required to
the context of language models benchmarking to study satu-
perform well on these examples. Clustering these repre-
ration (un-discriminability) of commonly used benchmarks,
sentations allows us to find a more robust evaluation set.
and Rodriguez et al. (2021) study several applications of
Furthermore, using the IRT model, we develop tools for
IRT in the context of language models, suggesting that IRT
improving benchmark accuracy estimates obtained with
models can be reliably used to: predict responses of LLMs
an arbitrary set of examples.
in unseen items, categorize items (e.g., according to their
We present an extensive evaluation of these strategies on difficulty/discriminability), and rank models. However, to
four popular benchmarks (§5): Open LLM Leaderboard the best of our knowledge, IRT has not been used for perfor-
(Beeching et al., 2023), MMLU (Hendrycks et al., 2020), mance estimation in the context of efficient benchmarking
HELM (Liang et al., 2022), and AlpacaEval 2.0 (Li et al., of LLMs. We explore this new path.
2023). Our goal is to assess the effectiveness of estimat-
ing the performance of LLMs on these benchmarks using a Active testing Another line of related work is related to
limited number of examples for evaluation. Overall, we con- active learning (Ein-Dor et al., 2020) and especially active
clude that 100 curated examples per scenario are enough to testing. In such works, evaluation examples are chosen
reliably estimate the performance of various LLMs, within dynamically using various criteria (Ji et al., 2021; Kossen
about 2% error on average. Based on our findings we release et al., 2021) to minimize annotation costs. Those methods
tiny (100 examples per scenario) versions of every consid- are somewhat similar to the adaptive IRT which we discuss
ered benchmark and IRT-based tools for further improving in §6.
the performance estimation.

1.1. Related work


2. Problem statement
In this section, we describe in detail the setup we work
Efficient benchmarking of LLMs Multi-dataset bench-
on and what are our objectives. Consider that a bench-
marks were introduced to the field of NLP with the advent of
mark is composed of scenarios and possibly sub-scenarios.
pre-trained models (e.g. Wang et al., 2018), and constantly
For example, MMLU and GSM8K are examples of sce-
evolved in lockstep with language model capabilities (Sri-
narios2 of both the Open LLM Leaderboard and HELM
vastava et al., 2022). The ever-increasing size of models
(Lite), while MMLU has different sub-scenarios like “mar-
and datasets consequently led to high evaluation costs, trig-
keting”, “elementary mathematics”, and so on. Furthermore,
gering changes in reported evaluation to accommodate the
each scenario (or sub-scenario) is composed of examples
costs (Biderman et al., 2023b). Ye et al. (2023) considered
(analogous to “items” in the IRT literature) that are small
reducing the number of tasks in Big-bench (Srivastava et al.,
tests to be solved by the LLMs–these examples range from
2022). Perlitz et al. (2023) found that evaluation on HELM
multiple-choice questions to text summarization tasks. Our
(Liang et al., 2022) relies on diversity across datasets, but the
final objective is to estimate the performance of LLMs in
number of examples currently used is excessive. We adopt
the full benchmark, which is given by the average of the per-
their stratified sampling approach as one of the efficient
formances in individual scenarios (Open LLM Leaderboard,
evaluation strategies. Vivek et al. (2023) proposed cluster-
ing evaluation examples based on models’ confidence in 2
We consider MMLU (when seen as a separate benchmark)
the correct class for faster evaluation on classification tasks. and AlpacaEval as a single-scenario benchmarks each.

2
tinyBenchmarks: evaluating LLMs with fewer examples

MMLU, AlpacaEval 2.0) or mean-win-rate (HELM). We 3.1. Stratified random sampling


achieve this objective by first estimating the performance of
In some settings (e.g., classifiers Katariya et al., 2012), it is
LLMs in individual scenarios and then aggregating scores.
useful to perform stratified random sampling – subsample
When scenarios have sub-scenarios, it is usually the case
examples ensuring the representation of certain groups of
that the scenario performance is given by a simple aver-
data. Using subscenarios as the strata for stratified random
age of sub-scenarios performances. The main concern is
sampling was proposed by Perlitz et al. (2023) when sub-
that each scenario/sub-scenario is composed of hundreds or
sampling examples from HELM scenarios. The authors
thousands of examples, making model evaluation costly.
showed that this is an effective way of sampling examples
In this work, for a fixed benchmark, we denote the set of ex- without too much loss on the ability to rank LLMs by per-
amples of each scenario j as Ij , implying that the totality of formance. Examples should be randomly selected from
examples in the benchmark is given by I = ∪j Ij . When an sub-scenarios (with uniform probability) in a way such that
LLM l interacts with an example i ∈ Ij , the system behind the difference in number of examples sampled for two dis-
the benchmarks generates a score that we call “correctness” tinct subscenarios is minimal (≤ 1). The rationale behind
and denote as Yil . In all the benchmarks we consider in this method is that, for an effective evaluation, sub-scenarios
this work, the correctness is either binary, i.e., Yil ∈ {0, 1} should be equally represented. The weights are wi = 1/|Ibj |
(incorrect/correct), or bounded, i.e., Yil ∈ [0, 1], denoting for all i ∈ Ibj .
a degree of correctness. The second case is applied in sit-
uations in which, for instance, there might not be just one 3.2. Clustering
correct answer for example i. To simplify the exposition
in the main text, we assume that the score for LLM l in Assessing the performance of LLM’s on a randomly sam-
scenario j is just the simple averageP of the correctness of pled subset of examples suffers from extra uncertainty in the
all items in that scenario, that is, |I1j | i∈Ij Yil . That is not sampling process, especially when the number of sampled
true when different sub-scenarios have different numbers of examples is small. Instead, we consider selecting a subset of
examples; in that case, one would just have to use a weighted representative examples using clustering. Vivek et al. (2023)
average instead, to make sure every sub-scenario is equally proposed to cluster examples based on the confidence of
important (in the experiments, we consider this case). In models in the correct class corresponding to these examples.
Appendix A, we explain in more detail how things change Representative examples, from these clusters, which they
when a scenario has subscenarios with different sizes but we call “anchor points”, can then be used to evaluate models
want to give the same importance to each one of them. Our on classification tasks more efficiently. We adapt their clus-
objective when evaluating a model l is to choose a small tering approach to a more general setting, allowing us to
fraction of examples Ibj ⊂ Ij , compute the correctness of extract such anchor points for MMLU, AlpacaEval 2.0, and
model l in every example of Ibj , and then use the available all scenarios of the Open LLM Leaderboard and HELM.
data to estimate what P would be the average score of that First, we propose to group examples by model correctness,
model in Ij , i.e., |I1j | i∈Ij Yil . In the next section, we expecting some examples would represent the rest. Ideally,
describe strategies on how Ibj can be chosen and how the if example i is an anchor point, then there will be a big set of
overall score can be estimated. examples on which models are correct if and only if they get
example i correct. The same idea applies when correctness
is given by a number in [0, 1]. Assume that we want to
3. Selecting evaluation examples select K anchor points and have access to the training set
In this section, we describe strategies on how to select ex- Dtr = {Yl }l∈Ltr , where Yl is a vector in which each entry
amples from a fixed scenario j, i.e., Ij , obtaining Ibj ⊂ Ij is given by the correctness score Yil for all examples i ∈
described in Section 2. Ideally, the set of selected exam- Ij . We represent each example i ∈ Ij by the embedding
ples should be representative of the whole set of items in Ei ∈ R|Ltr | which is a vector with entries given by Yil for
scenario j, that is, l ∈ Ltr , and then run K-Means (Hastie et al., 2009) with
the number of clusters being equal K. After the K centroids
are obtained, we find the closest example to each centroid,
P 1
P and each of those points will compose Ibj . For a new LLM
i∈I
bj wi Yil ≈ |Ij | i∈Ij Yil , (3.1) l ̸∈ Ltr to be evaluated, we can obtain an estimate for its
performance using the estimate in equation 3.1 by setting
wi as the fraction of points in Ij assigned to cluster/anchor
P
for nonnegative weights {wi }i∈Ibj such that i∈Ibj wi = 1. point i. This method is compelling and simple in detecting
anchor points. Still, it can suffer from distribution shifts
In the next paragraphs, we describe two possible ways of
since correctness patterns can vary, e.g., in time, and from
obtaining Ibj and {wi }i∈Ibj .

3
tinyBenchmarks: evaluating LLMs with fewer examples

the curse of dimensionality when |Ltr | is big. Our second i ∈ Ij . Formally, we are interested in approximating
approach is intended to be more robust to those problems.
1
P
Zjl ≜ |Ij | i∈Ij Yil (4.2)
The second approach we propose is using item response
theory (IRT) representation of examples, detailed in Sec- Now, assume that we have run model l on a subset of exam-
tion 4, as our embeddings Ei . The IRT model creates a ples from scenario j, obtaining responses {Yi0 l , · · · , Yik l }
meaningful representation for each example i based on their
for the examples Ibj = {i0 , · · · , ik }. Let θbl denote the
difficulty and the abilities required to respond to those exam-
ples correctly. This approach immediately solves the dimen- estimate for θl after observing Ibj and possibly a bigger
sionality problem, since Ei is relatively low-dimensional3 , set of examples coming from different scenarios. To ob-
and potentially alleviates the distribution shift problem if tain that estimate, we maximize the log-likelihood of the
the IRT model reasonably describes the reality and the ex- freshly observed data with respect to θl , fixing examples’
ample representations are stable. As IRT should represent parameters. This procedure is equivalent to fitting a logistic
which examples have similar difficulty and require similar regression model, which is an instance of the well-studied
abilities, the anchors represent exactly what we looked for. M -estimation procedure.
The weight wi is given by the fraction of examples in Ij Because Zjl is a random variable, we approximate it by
assigned to cluster/anchor point i. estimating the conditional expectation

4. Better performance estimation with IRT E[Zjl | Yi0 l , · · · , Yik l ] =


= |I1j | i∈Ij E[Yil | Yi0 l , · · · , Yik l ]
P
In this section, we propose ways of enhancing performance P 
estimates by using IRT models. We start by discussing = |I1j |
P
i∈I bj Yil + bj pil
i∈Ij \I
the case where Yil ∈ {0, 1}, that is, the l responds to the
= |Ibλ̂ | i∈Ibj Yil + |I1−\Ibλ̂ | i∈Ij \Ibj pil
P P
example i ∈ I correctly or not. We later also discuss the j j j
case where Yil ∈ [0, 1].
where λ̂ = |Ibj |/|Ij | ∈ [0, 1] is a weight that gives more or
4.1. The IRT model less importance to the observed set Ibj in the performance
The two-parameter multidimensional IRT model assumes computation depending on how big that set is. The prob-
that the probability of the LLM j getting example i correctly ability pil = P(Yil = 1 | θl , αi , βi ) is given by the IRT
is given by model in Equation 4.1. The estimator for the conditional
expectation is then given by
1
pil ≜ P(Yil = 1 | θl , αi , βi ) = 1+exp(−α⊤
, (4.1) p-IRT
Ẑjl b jl | Yi l , · · · , Yi l ]
≜ E[Z (4.3)
i θl +βi ) 0 k

= |Ibλ̂ | i∈Ibj Yil + |I1−\Ibλ̂ | i∈Ij \Ibj p̂il


P P
where θl ∈ Rd denotes the unobserved abilities of LLM l, j j j

while αi ∈ Rd dictates which dimensions of θl are required


from model l to respond to example i correctly. In this where p̂il ≜ P(Yil = 1 | θbl , α
bi , βbi ). We call the estimator
formulation, βi ∈ R can be viewed as a bias term that in 4.3 by Performance-IRT (p-IRT) estimator.
regulates the probability of correctness when θl = 0. We The idea behind p-IRT is that we can estimate the per-
use IRT parameter estimates as example representations formance of a model on unseen data making use of the
referred to in Section 3. Specifically, we take Ei = (b
αi , βbi ), IRT model. This is especially useful if we can fit θbl us-
where α bi and βbi are point estimates for the parameters of ing data from many scenarios: even though we observe
example i. In the next sections, we introduce two estimators just a few samples per scenario, p-IRT will leverage the
for the performance of an LLM, propose a simple solution whole available data, permitting better estimates for the
for the case Yil ̸∈ {0, 1}, and describe model fitting. performance of the LLM for all scenarios. Conditional
on the training set, the estimator p-IRT has low variance
4.2. IRT-based LLM performance estimation when θbl is obtained from a large dataset and a small bias
The performance-IRT (p-IRT) estimator. Assume that if the IRT model is reasonably specified. Given that θbl is
we are interested in estimating the performance of a model potentially estimated using a large sample, it is worth un-
l ̸∈ Ltr on scenario j and that point estimates of example derstanding what that implies about our estimates p̂il ’s in
parameters, (b αi , βbi ), have been computed, using a training the asymptotic regime. To facilitate our analysis, assume
set, for all examples in all scenarios, including examples for a moment that the true values of (αi , βi )’s for all i ∈ I
are known. From the theory of parametric M -estimation
3
In our experiments, the dimension of Ei is ≤ 16. (Van der Vaart, 2000), under some regularity conditions,

4
tinyBenchmarks: evaluating LLMs with fewer examples

d
b 1/2 (θbl − θ∗ ) →
we have that |I| l N (0, Al ) as |I|
b → ∞ for are the variance of the first estimator and the bias of the
a certain covariance matrix Al , where θl∗ denotes the true second one. Then we approximate the first estimator’s bias
ability parameter for LLM l and Ib = ∪j Ibj . Using the delta and the second estimator’s variance by zero. When our first
method (Van der Vaart, 2000), we affirm in Proposition 4.1 estimator is obtained by random sampling we take
that p̂il is asymptotically normal and centered around pil :
b̂2
Proposition 4.1. Assuming that the true values of (αi , βi )’s λ=
for all i ∈ I are known and under standard M -estimation σ̂ 2 /|Ibj | + b̂2
regularity conditions, by the delta method,
for two constants σ̂ 2 and b̂2 . The first constant, σ̂ 2 , is ob-
d ′ tained by computing the average sample variance of Yil ,
|I|
b 1/2
pil − pil ) → N (0, (σ
(b (−αi⊤ θ + βi ))2 αi⊤ Al αi ),
i ∈ Ij , across LLMs in the training set. The second con-
where σ is the logistic regression link function. stant, b̂2 , is obtained by approximating the IRT bias. We (i)
split the training set into two subsets of LLMs; (ii) fit an
b −1/2 ), mean-
Proposition 4.1 implies that p̂il = pil +OP (|I| IRT model in the first part using data from all scenarios; (iii)
ing that pil can be accurately estimated when |I|b is large. fit the ability parameter for all the LLMs in the second part
Consequently, the performance can be well approximated using half of the examples of all scenarios; (iv) use that IRT
in this setup. model to predict the correctness (using predicted probabil-
We note two limitations of p-IRT that can hinder its effec- ities) of the unseen examples of scenario j for the models
tiveness in practice. First, it does not promptly allow sample in the second split; (v) average predictions and actual cor-
weighting, limiting its use of anchor points; second, if the rectness within models, obtaining predicted/actual scenarios
predicted probabilities p̂il ’s are inaccurate, e.g., because of scores; (vi) compute their absolute differences, obtaining
model misspecification, then the performance of p-IRT will individual error estimates for models; (vii) average between
deteriorate. models, obtaining a final bias estimate, and then square the
final number. To give some intuition on how λ is assigned,
The generalized p-IRT (gp-IRT) estimator. Our final Figure 2 depicts λ as a function of b̂ and |Ibj | when σ̂ 2 = .01.
estimator builds upon p-IRT to overcome its limitations. From that figure, we see that if the IRT model bias is small,
Assume that the estimators in equations 3.1 and 4.3 are more weight will be given to p-IRT. The curves are steeper
obtained as a first step after the collection of examples in Ibj . when |Ibj | is small because the variance of the first estimator
The idea is to compute a third estimator Ẑjl gp-IRT
given by a decreases faster when |Ibj | is small. When the first estima-
convex combination of the first two
gp-IRT P p-IRT
Ẑjl ≜ λ i∈Ibj wi Yil + (1 − λ)Ẑjl (4.4)

where λ is a number in [0, 1] that is chosen to optimize the


performance of that estimator. To choose λ, we first note
that using random sampling (or anchor points) implies low
bias but potentially high variance (when Ibj is small) for
P
bj wi Yil . As Ij grows, its variance decreases. On the
b
i∈I
other hand, conditional on the training set, the variance of
p-IRT
Ẑjl is small, especially when θbl is fitted with data from
many scenarios, but its bias can be high when the IRT model Figure 2. Understanding the effect of IRT bias and sample size |I
bj |
is misspecified and does not vanish with the growing sample in the gp-IRT construction: both quantities are positively related
size. Thus, good choice of λ increases with Ibj . to the weight we give to the raw data in performance estimation.

We choose λ based on a heuristic derived from Song tor is obtained by a method that implies an estimator with
(1988)’s Corollary 2. It tells us that the optimal linear com- smaller variance, e.g., anchor points, we apply the same
bination of any two estimators T̂1 and T̂2 (when the sum of formula but divide σ̂ 2 by a constant > 1. By default, we
the weights is one) depends on the biases, variances, and divide σ̂ 2 by 4 which is equivalent to halving the standard
covariance of the two estimators. If the first estimator is deviation of the first estimator.
unbiased and the variance of the second is zero, we can
show that the optimal estimator is λT̂1 + (1 − λ)T̂2 , where
4.3. Using IRT when Yil is not binary
λ = b22 /(b22 + v1 ), b2 denotes T̂2 ’s bias, and v1 denotes T̂1 ’s
variance. To apply this result, we assume that the main fac- There are situations in which Yil ∈
/ {0, 1} but Yil ∈ [0, 1].
tors that might prevent gp-IRT from being a good estimator For example, in AlpacaEval 2.0, the response variable is

5
tinyBenchmarks: evaluating LLMs with fewer examples

bounded and can be translated to the interval [0, 1]. Also, Benchmarks and models We describe the size and com-
some scenarios of HELM and the Open LLM Leaderboard position of the four benchmarks, as well as the correspond-
have scores in [0, 1]. We propose a simple and effective fix. ing LLMs (see Appendix C for additional details):
The idea behind our method is to binarize Yil by defining a
• HuggingFace’s Open LLM Leaderboard (Beeching et al.,
second variable Ỹil = 1[Yil ≥ c], for a scenario-dependent
2023) consists of 6 scenarios, approx. 29K examples in
constant c. More concretely, for each scenario j, we choose
total. Performance on each of the scenarios is measured
c such that
with accuracy and the overall benchmark performance is
P P equal to the average of scenario accuracies. We collect
i∈Ij ,l∈Ltr Yil ≈ i∈Ij ,l∈Ltr 1[Yil ≥ c]. evaluation results for 395 LLMs from the Leaderboard’s
website and use 75% for training and 25% for testing
(split either randomly or by date as described above).
In that way, approximating the average of Ỹil and Yil should • MMLU (Hendrycks et al., 2020) is a multiple choice QA
be more or less equivalent. Given that Ỹil ∈ {0, 1}, we can scenario consisting of 57 subjects (subscenarios) compris-
use the standard IRT tools to model it. ing approx. 14K examples. Performance on MMLU is
measured by averaging the accuracies on each of the cate-
4.4. Fitting the IRT model gories. MMLU is one of the 6 scenarios of the Open LLM
For the estimation procedure, we resort to variational infer- Leaderboard and we consider the same set of 395 LLMs
ence. In particular, we assume that θl ∼ N (µθ 1d , 1/uθ Id ), and train-test splits. The reason to consider it separately is
αi ∼ N (µα 1d , 1/uα Id ), and βi ∼ N (µβ , 1/uβ ). To take its immense popularity when comparing LLMs (Touvron
advantage of software for fitting hierarchical Bayesian mod- et al., 2023; Achiam et al., 2023; Team et al., 2023) and
els (Lalor & Rodriguez, 2023), we introduce (hyper)priors inclusion into several other benchmarks.
for the prior parameters µθ ∼ N (0, 10), uθ ∼ Γ(1, 1), • For HELM (Liang et al., 2022), we use HELM Lite v1.0.0,
µα ∼ N (0, 10), uα ∼ Γ(1, 1), µβ ∼ N (0, 10), and which has the 10 core scenarios (total of approx. 10K eval-
uβ ∼ Γ(1, 1). Finally, to obtain point estimates for the uation examples) and 30 models that have their perfor-
model and example-specific parameters θl , αi , and βi , we mances registered for all scenarios. Performance metrics
use the means of their variational distributions. To select for each scenario vary and can be non-binary (e.g., F1
the dimension of the IRT model during the fitting procedure, score), and the overall performance on the benchmark is
we run a simple validation strategy in the training set and measured with mean win rate across scenarios. For this
choose the dimension that maximizes the prediction power benchmark, the dates models were added are not available.
of the IRT model in the validation split–we consider the Instead, we split models based on the organizations that
dimensions in {2, 5, 10, 15}. trained them to create more challenging train-test splits,
e.g., all OpenAI models are either in train or in test. For
the random train-test split we use 11-fold cross-validation.
5. Assessing evaluation strategies • AlpacaEval 2.0 (Li et al., 2023) consists of 100 LLMs
We assess the ability of the considered evaluation strategies evaluated on 805 examples. Although it is a fairly small
to estimate the performance of LLMs on four popular bench- benchmark, evaluation is expensive as it requires GPT-4
marks. For a given LLM and a benchmark, each evaluation as a judge. For each input, GPT-4 compares the responses
strategy estimates the performance using evaluation results of a candidate LLM and a baseline LLM (currently also
of this LLM on a given number of examples. We then com- GPT-4) and declares a winner. The average win rate4 is
pare this estimate to the true value, i.e., the performance of used to measure the overall performance. When splitting
this LLM on the complete benchmark. the data by date, we pick 25% most recent models for
testing and the rest for training. For the random split, we
We use publicly available evaluation data for which we split employ 4-fold cross-validation.
the models into “train” and “test”. Evaluation of the train
models is used to find the anchor points and fit the IRT
Evaluation strategies We consider 3 strategies presented
model. The ability to predict performance is measured on
in §3 for selecting a subset of examples for efficient evalua-
the test set of models. We consider two train-test split sce-
tion: “random” for stratified random sampling, “correctness”
narios: (i) random split and (ii) by date, i.e. using the most
for clustering correctness of models in the train set, and
recent models for testing. The latter split better represents
“IRT” for clustering the example representations obtained
practical use cases, while also being more challenging as it
from the IRT model fit on the train set. For each strategy,
is likely to result in a distribution shift between the train and
we evaluate the vanilla variation, i.e., simply using the per-
test models due to improving model capabilities over time
that might affect the effectiveness of anchor points and the 4
AlpacaEval 2.0 considered in the experiments uses continuous
IRT model. preferences instead of binary.

6
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 3. Performance estimation error per benchmark (columns) tested on random (top row) and recent (bottom row) LLMs for increasing
number of evaluation examples. 100 examples per scenario is sufficient to achieve ≈2% average performance estimation error across
benchmarks and evaluated LLMs. This corresponds to 600 out of 29K examples for Open LLM Leaderboard, 100 out of 14K examples
for MMLU, 1000 out of 10K examples for HELM, and 100 out of 800 examples for AlpacaEval 2.0.

formance of a test LLM on the (weighted) set of selected the evaluation time (see Figure 11). Thus we use the
examples to estimate its performance on the full benchmark, IRT-based anchor examples to construct tiny versions
and “++” variation that adjusts this estimate using the IRT tiny versions (100 examples per scenario) of each of the
model as described in equation (4.4). In total, we assess six benchmarks and release them along with the gp-IRT tool
evaluation strategies. Results are averaged over 5 restarts. (code and pre-trained IRT model) for efficient evaluation
of future LLMs. We present additional evaluations of
Key findings We investigate the effectiveness of strategies tinyBenchmarks in Figure 4. In Appendix B, we con-
as we increase the number of examples available for evalu- duct an exploratory analysis of the examples comprising
ating test LLMs. Results for both train-test split scenarios tinyMMLU.
are presented in Figure 3 (see also Figure 12 for Spearman’s
rank correlations). Our main conclusions are:
Specialized LLMs In our previous experiments the test
• Our approach to reducing evaluation costs is effective.
set of LLMs consisted of either a random subset of mod-
The best-performing strategies achieve estimation error
els or the most recent ones. Both of these test sets are
within 2% on all benchmarks with 100 examples or less
dominated by base and instruction-tuned LLMs. Here we
per dataset or scenario. For example, for MMLU this
assess the ability of the considered strategies to predict the
reduces the evaluation cost by a factor of 140 (from 14k
performance of specialized LLMs, i.e., models fine-tuned
to 100). For Open LLM Leaderboard even 30 examples
for specific domains such as code, biology, or finance. We
per scenario is enough, reducing the evaluation cost by a
consider MMLU benchmark and collect a new hand-picked
factor of 160 (from 29K to 180).
test set of 40 specialized models. Such models are likely to
• Most strategies perform well when there is a temporal
have unique strengths and perform well in specific MMLU
shift between the train and test LLM’s (see the lower row
categories while relatively underperforming on others. Thus,
of plots in Figure 3 for the results with “by date” split).
their correctness patterns might be different from those in
Thus our approaches for reducing evaluation costs remain
the train set, posing a challenge for our evaluation strategies.
practical when evaluating the performance of newer, more
We present results in Figure 5.
capable LLMs and can help save GPU hours when evalu-
ating future LLMs and/or checkpoints during pre-training. As we anticipated, the correctness-based anchor strategy
• IRT-based methods (“IRT” and “IRT++”) perform consis- deteriorates when tested on specialized LLMs. In contrast
tently well across benchmarks and train-test splits. The to the IRT-based anchors that are only slightly affected,
gp-IRT (“++”) variation always improves or matches its demonstrating their robustness and supporting our choice to
vanilla counterpart, while adding only a few seconds to use them for tinyBenchmarks construction.

7
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 4. Predicted performance compared with true performance for the four benchmarks (columns) and recent LLMs. We verify the
efficacy of the evaluation strategies (IRT and IRT++) we chose to construct tinyBenchmarks.

6. Conclusion
In this paper, we demonstrate it is possible to accurately as-
sess the capabilities of LLMs with a fraction (sometimes two
orders of magnitude smaller) of the examples in common
benchmark datasets by leveraging models of educational
assessments from psychometrics. This leads directly to
Figure 5. Estimation error on specialized LLMs (right) compared
savings in terms of the monetary costs associated with eval-
to error on random LLMs (left) on MMLU. Correctness-based uating LLMs, but also the computational and environmental
example selection is affected the most by this distribution shift. costs. For practitioners, the computational cost savings are
especially convenient because they enable them to evalu-
ate LLMs more frequently during fine-tuning and prompt
engineering.
Based on our results we are releasing tinyBenchmarks, pre-
selected subsets of examples from the widely adopted LLM
benchmarks. tinyBenchmarks are simply small datasets that
are straightforward to use to evaluate LLMs cheaply. We
are also releasing an IRT-based tool to enhance performance
estimation. The tool provides code and IRT parameters
trained on the corresponding benchmarks and can be run on
a CPU in a few seconds.

6.1. Extensions
Figure 6. Spread of estimation errors across a random subset of
LLMs with varying capabilities on MMLU. The error tends to Prompt evaluation A persistent challenge in prompt-
be slightly lower for more capable models. The worst case error based model evaluation is the influence the prompting setup
across almost all models is ≤ 4%. has on model predictions (see, e.g., Lu et al., 2022; Mishra
et al., 2022; Min et al., 2022; Yoo et al., 2022; Weber et al.,
2023b; Wei et al., 2023). We can use the previously de-
scribed approaches to make predictions across different
prompting setups. This way, we can estimate how well a
Estimation error analysis We present a more detailed model will do on a new set of prompts using just a few
view of the estimation error of the best performing “IRT++” evaluations, or how a new model will perform on a given
evaluation strategy on MMLU with 100 examples. In Figure prompt. To test this idea, we train an IRT model on the
6 we plot estimation error against the actual accuracy of 99 prediction data from Weber et al. (2023a), containing evalu-
test LLMs for a random train-test split. Our strategy can ations of eight LLaMA LLMs (vanilla or instruction tuned
estimate the performance of more capable LLMs slightly on the Alpaca self-instruct dataset; Touvron et al., 2023;
better, although there is no strong dependency. We also Taori et al., 2023) for the ANLI dataset (Nie et al., 2020).
note that the estimation error never exceeds 4% (except for The dataset consists of evaluations of the 750 data points
one LLM with extremely low performance). Recall that the wrapped with 15 different instruction templates sourced
average error is 2% as shown in Figure 3, supporting the from the promptsource collection (P3; Bach et al., 2022).
reliability of our evaluation approach.

8
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 7. Estimation error when predicting the performance of prompt templates. The results demonstrate that using our methods for
efficient prompt-based model evaluation is a promising application.

Similarly to our previous experiments, we evaluate random 6.2. Limitations


splits and splits featuring distribution shifts (across model
The main limitations of the methods described in this paper
sizes and different instruction templates). For model size,
are related to potential severe distribution shifts. Taking
we put all models with sizes 7B, 13B, and 30B in the train-
MMLU as an example, we anticipate larger performance
ing set while the models with size 65B go to the test set. For
estimation errors for models that fail on simple questions
splits related to prompts templates, we consider two differ-
while answering complicated ones correctly, thus altering
ent approaches: first, we conduct a 2-fold cross-validation
the correctness patterns. This might be caused by significant
rotating instruction templates; second, we consider using the
architecture or pre-training data changes. A rapid increase
same and different instruction templates in the in-context-
in LLM capabilities may also cause extrapolation errors.
learning examples and in the input example alternating the
To alleviate these problems, we recommend periodically
strategies in the training and test sets. Results in Figure
updating the curated examples and IRT parameter estimates
7 suggest that prompt-based model evaluation can be effi-
using data from more modern LLMs.
ciently carried out with the methods introduced in this work,
even in the presence of several practical distribution shifts.
7. Acknowledgements
Adaptive testing We expect further performance estima- We are grateful for the help provided by Yotam Perlitz in
tion improvements can be squeezed out by more sophisti- downloading data from HELM.
cated applications of similar ideas. For example, instead
This paper is based upon work supported by the National
of pre-selecting a subset of examples before evaluating the
Science Foundation (NSF) under grants no. 2027737 and
LLM, it may be possible to select the examples adaptively
2113373.
during the evaluation process. This idea is widely used in
the computerized-assisted testing algorithms behind many
standardized tests. We demonstrate preliminary results on References
MMLU using an adaptive IRT variant in Figure 8 (see Figure Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
14 for results on more benchmarks). Although the estima- Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
tion performance has improved, our current implementation Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
takes over 5 minutes to run, which might not be as appealing arXiv:2303.08774, 2023.
practically.
An, X. and Yung, Y.-F. Item response theory: What it is
0.12
and how you can use the irt procedure to apply it. SAS
Institute Inc, 10(4):364–2014, 2014.
0.10
score estimation error

0.08 random Bach, S., Sanh, V., Yong, Z. X., Webson, A., Raffel, C.,
IRT Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, T.,
0.06 IRT++
0.04 adaptIRT++ Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben-david, S.,
Xu, C., Chhablani, G., Wang, H., Fries, J., Al-shaibani,
0.02
M., Sharma, S., Thakker, U., Almubarak, K., Tang, X.,
0.00 Radev, D., Jiang, M. T.-j., and Rush, A. PromptSource:
20 40 60 80 100
number of seen samples An integrated development environment and repository
Figure 8. Preliminary adaptive testing results on MMLU. for natural language prompts. In Proceedings of the 60th
Annual Meeting of the Association for Computational

9
tinyBenchmarks: evaluating LLMs with fewer examples

Linguistics: System Demonstrations, pp. 93–104, Dublin, Ein-Dor, L., Halfon, A., Gera, A., Shnarch, E., Dankin,
Ireland, 2022. Association for Computational Linguistics. L., Choshen, L., Danilevsky, M., Aharonov, R., Katz,
doi: 10.18653/v1/2022.acl-demo.9. URL https:// Y., and Slonim, N. Active Learning for BERT: An
aclanthology.org/2022.acl-demo.9. Empirical Study. In Webber, B., Cohn, T., He, Y.,
and Liu, Y. (eds.), Proceedings of the 2020 Confer-
Beeching, E., Fourrier, C., Habib, N., Han, S., ence on Empirical Methods in Natural Language Pro-
Lambert, N., Rajani, N., Sanseviero, O., Tun- cessing (EMNLP), pp. 7949–7962, Online, November
stall, L., and Wolf, T. Open llm leader- 2020. Association for Computational Linguistics. doi:
board. https://huggingface.co/spaces/ 10.18653/v1/2020.emnlp-main.638. URL https://
HuggingFaceH4/open_llm_leaderboard, aclanthology.org/2020.emnlp-main.638.
2023.
Elvira, V., Martino, L., and Robert, C. P. Rethinking the
Biderman, S., Prashanth, U. S., Sutawika, L., Schoelkopf, effective sample size. International Statistical Review, 90
H., Anthony, Q., Purohit, S., and Raf, E. Emergent and (3):525–550, 2022.
predictable memorization in large language models. arXiv
preprint arXiv:2304.11158, 2023a. Guha, N., Nyarko, J., Ho, D., Ré, C., Chilton, A., Chohlas-
Wood, A., Peters, A., Waldon, B., Rockmore, D., Zam-
Biderman, S., Schoelkopf, H., Anthony, Q. G., brano, D., et al. Legalbench: A collaboratively built
Bradley, H., O’Brien, K., Hallahan, E., Khan, benchmark for measuring legal reasoning in large lan-
M. A., Purohit, S., Prashanth, U. S., Raff, E., guage models. Advances in Neural Information Process-
Skowron, A., Sutawika, L., and van der Wal, O. ing Systems, 36, 2024.
Pythia: A suite for analyzing large language models
across training and scaling. ArXiv, abs/2304.01373, Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman,
2023b. URL https://api.semanticscholar. J. H. The elements of statistical learning: data mining,
org/CorpusID:257921893. inference, and prediction, volume 2. Springer, 2009.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,


Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn,
M., Song, D., and Steinhardt, J. Measuring mas-
P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-
sive multitask language understanding. arXiv preprint
Amand, H., et al. Findings of the 2014 workshop on
arXiv:2009.03300, 2020.
statistical machine translation. In Proceedings of the
ninth workshop on statistical machine translation, pp. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
12–58, 2014. S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
ematical problem solving with the math dataset. arXiv
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
preprint arXiv:2103.03874, 2021.
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. Language models are few-shot learners. Ji, D., Logan, R. L., Smyth, P., and Steyvers, M. Ac-
Advances in neural information processing systems, 33: tive bayesian assessment of black-box classifiers. Pro-
1877–1901, 2020. ceedings of the AAAI Conference on Artificial Intelli-
gence, 35(9):7935–7944, May 2021. doi: 10.1609/
Brzezińska, J. Item response theory models in the measure- aaai.v35i9.16968. URL https://ojs.aaai.org/
ment theory. Communications in Statistics-Simulation index.php/AAAI/article/view/16968.
and Computation, 49(12):3299–3313, 2020.
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., and
Cai, L., Choi, K., Hansen, M., and Harrell, L. Item response Szolovits, P. What disease does this patient have? a
theory. Annual Review of Statistics and Its Application, large-scale open domain question answering dataset from
3:297–321, 2016. medical exams. Applied Sciences, 11(14):6421, 2021.
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Katariya, N., Iyer, A., and Sarawagi, S. Active evaluation
Schoenick, C., and Tafjord, O. Think you have solved of classifiers on large datasets. In 2012 IEEE 12th Inter-
question answering? try arc, the ai2 reasoning challenge. national Conference on Data Mining, pp. 329–338, 2012.
arXiv preprint arXiv:1803.05457, 2018. doi: 10.1109/ICDM.2012.161.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kingston, N. M. and Dorans, N. J. The feasibility of using
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, item response theory as a psychometric model for the gre
R., et al. Training verifiers to solve math word problems. aptitude test. ETS Research Report Series, 1982(1):i–148,
arXiv preprint arXiv:2110.14168, 2021. 1982.

10
tinyBenchmarks: evaluating LLMs with fewer examples

Kočiskỳ, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, find them: Overcoming few-shot prompt order sensi-
K. M., Melis, G., and Grefenstette, E. The narrativeqa tivity. In Proceedings of the 60th Annual Meeting
reading comprehension challenge. Transactions of the of the Association for Computational Linguistics (Vol-
Association for Computational Linguistics, 6:317–328, ume 1: Long Papers), pp. 8086–8098, Dublin, Ire-
2018. land, 2022. Association for Computational Linguistics.
doi: 10.18653/v1/2022.acl-long.556. URL https:
Kossen, J., Farquhar, S., Gal, Y., and Rainforth, T. Ac- //aclanthology.org/2022.acl-long.556.
tive testing: Sample-efficient model evaluation. In
Meila, M. and Zhang, T. (eds.), Proceedings of Maia Polo, F. and Vicente, R. Effective sample size, di-
the 38th International Conference on Machine Learn- mensionality, and generalization in covariate shift adapta-
ing, volume 139 of Proceedings of Machine Learn- tion. Neural Computing and Applications, 35(25):18187–
ing Research, pp. 5753–5763. PMLR, 18–24 Jul 18199, 2023.
2021. URL https://proceedings.mlr.press/
v139/kossen21a.html. Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can
a suit of armor conduct electricity? a new dataset
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., for open book question answering. arXiv preprint
Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, arXiv:1809.02789, 2018.
J., Lee, K., et al. Natural questions: a benchmark for ques-
tion answering research. Transactions of the Association Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M.,
for Computational Linguistics, 7:453–466, 2019. Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of
demonstrations: What makes in-context learning work?
Lalor, J. P. and Rodriguez, P. py-irt: A scalable item re- In Proceedings of the 2022 Conference on Empirical
sponse theory library for python. INFORMS Journal on Methods in Natural Language Processing, pp. 11048–
Computing, 35(1):5–13, 2023. 11064, Abu Dhabi, United Arab Emirates, 2022. Asso-
ciation for Computational Linguistics. URL https://
Lalor, J. P., Wu, H., and Yu, H. Building an evaluation aclanthology.org/2022.emnlp-main.759.
scale using item response theory. In Proceedings of the
Conference on Empirical Methods in Natural Language Mishra, S., Khashabi, D., Baral, C., Choi, Y., and
Processing. Conference on Empirical Methods in Natural Hajishirzi, H. Reframing instructional prompts to
Language Processing, volume 2016, pp. 648. NIH Public GPTk’s language. In Findings of the Association
Access, 2016. for Computational Linguistics: ACL 2022, pp. 589–
612, Dublin, Ireland, 2022. Association for Computa-
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., tional Linguistics. doi: 10.18653/v1/2022.findings-acl.
Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacae- 50. URL https://aclanthology.org/2022.
val: An automatic evaluator of instruction-following findings-acl.50.
models. https://github.com/tatsu-lab/
alpaca_eval, 2023. Mizrahi, M., Kaplan, G., Malkin, D., Dror, R., Shahaf, D.,
and Stanovsky, G. State of what art? a call for multi-
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., prompt llm evaluation. arXiv preprint arXiv:2401.00595,
Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, 2023.
A., et al. Holistic evaluation of language models. arXiv
preprint arXiv:2211.09110, 2022. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J.,
and Kiela, D. Adversarial nli: A new benchmark for nat-
Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring ural language understanding. In Proceedings of the 58th
how models mimic human falsehoods. arXiv preprint Annual Meeting of the Association for Computational
arXiv:2109.07958, 2021. Linguistics, pp. 4885–4901, 2020.
Liu, Z., Qiao, A., Neiswanger, W., Wang, H., Tan, B., Perlitz, Y., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L.,
Tao, T., Li, J., Wang, Y., Sun, S., Pangarkar, O., et al. Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and
Llm360: Towards fully transparent open-source llms. Choshen, L. Efficient benchmarking (of language mod-
arXiv preprint arXiv:2312.06550, 2023. els). arXiv preprint arXiv:2308.11696, 2023.

Lord, F., Novick, M., and Birnbaum, A. Statistical theories Petersen, N. S. et al. Using item response theory to equate
of mental test scores. 1968. scholastic aptitude test scores. 1982.

Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stene- Rodriguez, P., Barrow, J., Hoyle, A. M., Lalor, J. P., Jia,
torp, P. Fantastically ordered prompts and where to R., and Boyd-Graber, J. Evaluation examples are not

11
tinyBenchmarks: evaluating LLMs with fewer examples

equally informative: How should that change NLP leader- Vivek, R., Ethayarajh, K., Yang, D., and Kiela, D. Anchor
boards? In Zong, C., Xia, F., Li, W., and Navigli, R. points: Benchmarking models with much fewer examples.
(eds.), Proceedings of the 59th Annual Meeting of the arXiv preprint arXiv:2309.08638, 2023.
Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Pro- Voronov, A., Wolf, L., and Ryabinin, M. Mind your for-
cessing (Volume 1: Long Papers), pp. 4486–4503, Online, mat: Towards consistent evaluation of in-context learning
August 2021. Association for Computational Linguis- improvements. arXiv preprint arXiv:2401.06766, 2024.
tics. doi: 10.18653/v1/2021.acl-long.346. URL https: Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
//aclanthology.org/2021.acl-long.346. Bowman, S. R. Glue: A multi-task benchmark and anal-
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. ysis platform for natural language understanding. arXiv
Winogrande: An adversarial winograd schema challenge preprint arXiv:1804.07461, 2018.
at scale. Communications of the ACM, 64(9):99–106, Weber, L., Bruni, E., and Hupkes, D. The icl consistency
2021. test. arXiv preprint arXiv:2312.04945, 2023a.
Sclar, M., Choi, Y., Tsvetkov, Y., and Suhr, A. Quantify- Weber, L., Bruni, E., and Hupkes, D. Mind the instructions:
ing language models’ sensitivity to spurious features in a holistic evaluation of consistency and interactions in
prompt design or: How i learned to start worrying about prompt-based learning. arXiv preprint arXiv:2310.13486,
prompt formatting. arXiv preprint arXiv:2310.11324, 2023b.
2023.
Wei, J., Wei, J., Tay, Y., Tran, D., Webson, A., Lu, Y., Chen,
Song, W. T. Minimal-mse linear combinations of variance
X., Liu, H., Huang, D., Zhou, D., et al. Larger language
estimators of the sample mean. In 1988 Winter Simulation
models do in-context learning differently. ArXiv preprint,
Conference Proceedings, pp. 414–421. IEEE, 1988.
abs/2303.03846, 2023. URL https://arxiv.org/
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, abs/2303.03846.
A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
Ye, Q., Fu, H. Y., Ren, X., and Jia, R. How predictable
Garriga-Alonso, A., et al. Beyond the imitation game:
are large language model capabilities? a case study on
Quantifying and extrapolating the capabilities of language
big-bench. arXiv preprint arXiv:2305.14947, 2023.
models. arXiv preprint arXiv:2206.04615, 2022.
Yoo, K. M., Kim, J., Kim, H. J., Cho, H., Jo, H., Lee, S.-W.,
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,
Lee, S.-g., and Kim, T. Ground-truth labels matter: A
X., Guestrin, C., Liang, P., and Hashimoto, T. B.
deeper look into input-label demonstrations. In Proceed-
Stanford alpaca: An instruction-following llama
ings of the 2022 Conference on Empirical Methods in Nat-
model. https://github.com/tatsu-lab/
ural Language Processing, pp. 2422–2437, Abu Dhabi,
stanford_alpaca, 2023.
United Arab Emirates, 2022. Association for Computa-
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, tional Linguistics. URL https://aclanthology.
J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. org/2022.emnlp-main.155.
Gemini: a family of highly capable multimodal models.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
arXiv preprint arXiv:2312.11805, 2023.
Y. Hellaswag: Can a machine really finish your sentence?
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, arXiv preprint arXiv:1905.07830, 2019.
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Van der Linden, W. J. Handbook of item response theory:
Three volume set. CRC Press, 2018.
Van der Vaart, A. W. Asymptotic statistics, volume 3. Cam-
bridge university press, 2000.
Vania, C., Htut, P. M., Huang, W., Mungra, D., Pang, R. Y.,
Phang, J., Liu, H., Cho, K., and Bowman, S. R. Com-
paring test sets with item response theory. arXiv preprint
arXiv:2106.00840, 2021.

12
tinyBenchmarks: evaluating LLMs with fewer examples

A. Evaluation when subscenarios have different number of samples


Suppose we want to estimate the performance of a scenario j which is composed of sj subscenarios. Denote the set of
examples in each subscenario of j asP Ijk , for P
k ∈ {1, · · · , sj }. Then, Ij = ∪k Ijk , with disjoint Ijk ’s. For a given LLM l,
our main goal is then to estimate s1j k |I1jk | i∈Ijk Yil . See that we can write

1 1 1
P P P P P
sj k |Ijk | i∈Ijk Yil = k i∈Ijk sj |Ijk | Yil = i∈Ij ω̄i Yil .

This tells us that we can represent the performance of model l as a weighted average instead of a simple average. In our
code, ωi ≜ |Ij | · ω̄i ’s are called balance weights and ω̄i ’s are called normalized balance weights. In Section
3, when computing the estimates using the stratified random sampling strategy, the weights for each example are still given
by 1/|Îj | (because subscenarios should already be equally represented) but when using the clustering ideas, the weight for
each anchor point is given by the sum of ω̄i ’s of all items in its cluster. We do not apply any weighting when fitting the IRT
models but only when computing the p-IRT (and gp-IRT) estimate:

p-IRT λ̂
P 1−λ̂
P
Ẑjl = |I
bj | i∈I
bj ωi Yil + |Ij \I
bj | i∈Ij \I
bj ωi p̂il .

B. tinyMMLU
To construct tinyMMLU we chose 100 examples and weights identified by the IRT anchor point approach (“IRT”)
corresponding to the best test performance (across random seeds) in the experiment presented in the top part of Figure 3 on
MMLU. For comparison, we analogously selected 100 examples with the correctness anchor point method.
To better understand the composition of tinyMMLU, in Figure 9 we visualize the distribution of the weights of the selected
examples and compare it to the weights of the correctness anchors. Recall that weights are non-negative and sum to 1.
If an item has a weight 0.1, for example, that item has a contribution of 10% in the final estimated score. From Figure
9, we can see that tinyMMLU has more uniform weights compared to its correctness-based counterpart. We measure
uniformity through the effective sample size (ESS) of the example weights. ESS, traditionally used in the Monte Carlo
and domain adaptation (Elvira et al., 2022; Maia Polo & Vicente, 2023) literature, measures weight inequality in a way
such that ESS = 0.50, for example, informally means that the corresponding weighted average is influenced by only 50%
of (uniformly weighted) examples. In the context of our problem, more uniform weights of tinyMMLU contribute to its
robustness when evaluating LLMs with varying correctness patterns, such as specialized LLMs in Figure 5.
We also investigate the total weight of the tinyMMLU examples within each of the 57 subjects in Figure 10. The highest
weighted are “high school psychology”, “elementary mathematics”, and “professional law”. Interestingly the weight of the
subjects is fairly different from its correctness-based counterpart.

Figure 9. Comparing the spread of examples weights using both the IRT and correctness approaches to find anchor points. We see that
weights inequality is much higher when we cluster examples using correctness.

C. More details about benchmarks


• HuggingFace’s Open LLM Leaderboard (Beeching et al., 2023): the data from this benchmark is composed of 395 LLMs
and approx. 29k items that were downloaded from the platform in January/2024. To extract data from those models, we

13
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 10. Weights given to MMLU subscenarios by the two anchoring methods.

filter all models from the platform that have an MMLU score over5 .3, order them according to their average performance,
and equally spaced selected models. Then, we kept all models that had scores for all six scenarios: ARC (Clark et al.,
2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), TruthfulQA (Lin et al., 2021), Winogrande
(Sakaguchi et al., 2021), and GSM8K (Cobbe et al., 2021). In a second round of data collection, we collected data for 40
“specialized models” by recognizing which models were fine-tuned to do the math, coding, etc.. The two sets of models
have an intersection, and in total, we have collected data from 428 LLMs.
• HELM (Liang et al., 2022): we use HELM Lite (https://crfm.stanford.edu/helm/lite) v1.0.0, which is a
dataset composed of 37 LLMs and approx. 10k evaluation examples from 10 scenarios. The scenarios are OpenbookQA
(Mihaylov et al., 2018), MMLU (Hendrycks et al., 2020), NarrativeQA (Kočiskỳ et al., 2018), NaturalQuestions (closed-
book) (Kwiatkowski et al., 2019), NaturalQuestions (open-book), Math (Hendrycks et al., 2021), GSM8K (Cobbe et al.,
2021), LegalBench (Guha et al., 2024), MedQA (Jin et al., 2021), WMT14 (Bojar et al., 2014).

D. Extra results
D.1. Running time
We record the running time of IRT inference (ability parameter fitting) when running our experiments. In Figure 11 we show
that the average running time is fairly negligible.

Figure 11. Average running time by the amount of test examples: IRT inference.
5
On the leaderboard. The actual score we use can be different because we use the last submission to the leaderboard, while the
leaderboard shows the best results among all submissions.

14
tinyBenchmarks: evaluating LLMs with fewer examples

D.2. Rank correlation results


In this section, we explore versions of Figures 3 and 5 when we look at rank correlation (correlation between true and
predicted ranking) instead of performance. It is clear from the plots below that our method can be used to rank models
efficiently with tiny samples.

Figure 12. Rank correlation for true performance and predicted performance among LLMs.

Figure 13. Rank correlation for true performance and predicted performance among LLMs in MMLU. The plot on the left represents a
random split of the data while the plot on the right considers specialized models as the test set.

D.3. Adaptive testing


In this section, we complement the results shown in Figure 8 for all benchmarks.

Open LLM Leaderboard MMLU HELM AlpacaEval


0.10 0.10 0.10 0.10
score estimation error
(no distribution shift)

0.08 0.08 0.08 0.08


0.06 0.06 0.06 0.06
0.04 0.04 0.04 0.04
0.02 0.02 0.02 0.02
0.00 0.00 0.00 0.00
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100

random IRT IRT++ adaptIRT++

Figure 14. Results of adaptive testing for different benchmarks.

15
tinyBenchmarks: evaluating LLMs with fewer examples

E. Individual performances per scenario


In this section, we explore what is behind Figure 3 by looking in detail at results for individual scenarios for the Open LLM
Leaderboard and HELM. It is clear from the following plots that there are scenarios in which our methods shine more than
others.

E.1. Open LLM Leaderboard

Figure 15. ARC

Figure 16. GSM8K

16
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 17. TruthfulQA

Figure 18. HellaSwag

Figure 19. MMLU

17
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 20. Winogrande

E.2. HELM

Figure 21. OpenbookQA

18
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 22. GSM

Figure 23. LegalBench

Figure 24. Math

19
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 25. MedQA

Figure 26. MMLU

Figure 27. NarrativeQA

20
tinyBenchmarks: evaluating LLMs with fewer examples

Figure 28. NaturalQA (closed book)

Figure 29. NaturalQA (open book)

Figure 30. WMT14

21

You might also like