Improving Pretraining Data Using Perplexity Correlations
Improving Pretraining Data Using Perplexity Correlations
P ERPLEXITY C ORRELATIONS
Tristan Thrush, Christopher Potts & Tatsunori Hashimoto
Department of Computer Science
Stanford University
Stanford, CA 94305, USA
{tthrush,cgpotts,thashim}@stanford.edu
arXiv:2409.05816v1 [cs.CL] 9 Sep 2024
A BSTRACT
Quality pretraining data is often seen as the key to high-performance language
models. However, progress in understanding pretraining data has been slow due
to the costly pretraining runs required for data selection experiments. We present
a framework that avoids these costs and selects high-quality pretraining data with-
out any LLM training of our own. Our work is based on a simple observation:
LLM losses on many pretraining texts are correlated with downstream benchmark
performance, and selecting high-correlation documents is an effective pretraining
data selection method. We build a new statistical framework for data selection
centered around estimates of perplexity-benchmark correlations and perform data
selection using a sample of 90 LLMs taken from the Open LLM Leaderboard
on texts from tens of thousands of web domains. In controlled pretraining ex-
periments at the 160M parameter scale on 8 benchmarks, our approach outper-
forms DSIR on every benchmark, while matching the best data selector found in
DataComp-LM, a hand-engineered bigram classifier.
1 I NTRODUCTION
Dataset curation is increasingly crucial for training high-quality large language models (LLMs). As
pretraining datasets have grown, from under 200B tokens in 2020 (Raffel et al., 2020; Gao et al.,
2020) to 240T tokens today (Li et al., 2024), it has become critical to identify subsets of the available
data that will lead to the best LLMs, and a wide range of methods have arisen to meet these needs
(Ilyas et al., 2022; Xie et al., 2023a;b; Engstrom et al., 2024; Everaert & Potts, 2024; Liu et al.,
2024; Llama Team, 2024). However, data-driven approaches to data selection typically involve
expensive model retraining steps that limit their effectiveness, and no algorithm has been reported
to consistently beat or match hand-crafted classifiers for data selection (Li et al., 2024).
Is training new LLMs necessary for data selection? Instead of training our own models, can we use
the growing collection of publicly available, high-performance LLMs (Wolf et al., 2019; Beeching
et al., 2023) to perform data valuation and selection? This would have significant benefits: we
could leverage the millions of dollars collectively spent on building these LLMs, and we would
have coverage over a large, heterogeneous collection of high-performance models varying in size,
architectures, and pretraining data distribution.
Despite these advantages, using existing models for pretraining data selection is challenging, as the
training data for these models are often unknown and heterogeneous. Our key observation is that
data selection can be done using two observable features of all public models today: 1) all open-
weight models produce a causal language modeling loss for a given text, and 2) all of them can be
evaluated on benchmarks. Prior work has found systematic relationships between web corpus loss
and benchmark performance (Wei et al., 2022), which suggests the possibility of using correlations
between perplexity and benchmark scores as the basis for a data selection policy.
In the present paper, we pursue this possibility and find a radically simple approach that is also
effective: we select data via perplexity correlations (Figure 1), where we select data domains (e.g.
1
Domains Benchmark
bbc arxiv · · · willys-hifi SciQ
High Corr (Keep)
Mistral ··· Correlations arxiv, bbc, · · ·
Llama ··· bbc arxiv · · · willys-hifi
LLMs
logprob accuracy
Figure 1: We want to pretrain on domains where lower loss is generally correlated with higher
downstream performance. Our approach does this by taking public, pretrained LLMs and measuring
correlations across their log-likelihoods (left, red matrix) and performance on a target benchmark
(center, blue vector). We then perform data selection by picking domains with high correlation
and training a fastText classifier that distinguishes these domains from others. This approach is on
par with the best-known data selection methods in our experiments, despite requiring no human
selection of high-quality domains.
wikipedia.org, stackoverflow.com, etc.) for which LLM log-probabilities are highly correlated with
downstream benchmark performance. To enable our approach, we complement our algorithm with
a statistical framework for correlation-based data selection and derive correlation estimators that
perform well over our heterogeneous collection of LLMs.
We validate our approach over a large collection of pretrained causal LLMs on the Hugging Face
Open LLM Leaderboard (Beeching et al., 2023) and find that perplexity correlations are often pre-
dictive of an LLM’s benchmark performance. More importantly, we find that these relationships are
robust enough to enable reliable data selection that targets downstream benchmarks. In controlled
pretraining experiments at the 160M parameter scale on eight benchmarks, our approach strongly
outperforms DSIR (Xie et al., 2023b) (a commonly used training-free data selection approach based
on n-gram statistics) while generally matching the performance of the best overall method validated
at scale by Li et al. (the OH-2.5 +ELI5 fastText classifier (Joulin et al., 2016)) without any parameter
tuning or human curation.
Our manuscript focuses on pretraining experiments at the 160M scale. To assess the sensitivity of
our experimental results to the choice of pretraining data pool, scale, and experiment design, we use
this paper also as a preregistration for further pretraining experiments using different data sources
and evaluation benchmarks at the 1.4B model scale. We commit to updating the arXiv version of our
manuscript with both positive and negative results on these preregistered validation experiments.
2 R ELATED W ORK
To go beyond the status quo of deduplication, perplexity filtering, and hand-curation (Laurençon
et al., 2022; BigScience, 2023; Abbas et al., 2023; Groeneveld et al., 2024; Soldaini et al., 2024;
Penedo et al., 2024; Llama Team, 2024), targeted methods have been proposed to filter pretrain-
ing data so that the resulting LLM will achieve higher scores on given benchmarks. There are
lightweight approaches that use n-gram overlap (Xie et al., 2023b) or embedding similarity (Ever-
aert & Potts, 2024) to select training data that is similar to data from a given benchmark. There are
also less-scalable methods that require training proxy LLMs on different data mixtures (Ilyas et al.,
2022; Xie et al., 2023a; Engstrom et al., 2024; Liu et al., 2024; Llama Team, 2024).
Given the high costs of proxy-based data selection methods, they have primarily been used to select
among human-curated pretraining data mixtures (Llama Team, 2024; Li et al., 2024) rather than
a high dimensional space of mixtures. Our work takes an orthogonal approach and builds upon
recent observational studies that have found scaling relationships that hold across collections of
uncontrolled and diverse LLMs (Owen, 2024; Ruan et al., 2024). While these studies do not examine
2
loss-to-performance relationships or derive useful data selection methods from them, we know that
losses and performance are generally highly correlated. Validation losses on samples of text corpora
are commonly used as a proxy for downstream performance when comparing LLMs pretrained on
the same data distribution (Kaplan et al., 2020; Hoffmann et al., 2022; Wei et al., 2022), even if they
have different architectures (Poli et al., 2023; Peng et al., 2023; Gu & Dao, 2024).
According to a recent survey of data selection approaches by Li et al. (2024), the heavier-weight
pretraining data selection methods have not yet shown large gains, and the current state-of-the-art
across many tasks is still primitive: a fixed fastText classifier (Joulin et al., 2016) combined with an
English filter as a final layer after extensive deduplication and filtering. Are we missing important
information that we can efficiently extract from a diverse collection of already trained models, larger
and more diverse than any single organization is likely to produce? We show promising evidence
supporting this hypothesis – simple loss-performance correlation coefficients are effective when
used for data selection.
3 P ROBLEM S ETTING
Our goal is to build predictive models of how pretraining data distributions affect downstream bench-
mark performance and use them to build better language models. Unfortunately, this task is challeng-
ing and computationally expensive. A standard approach adopted in paradigms such as datamodel-
D
ing (Ilyas et al., 2022) is to obtain N different pretraining distributions {pi : i ∈ [N ], pi ∈ R+
0 }
over D ≫ N domains (e.g. arxiv.org, stackoverflow.com, etc.), pretrain and measure model errors
on a target benchmark yi ∈ [0, 1], and fit a model p → y. This approach requires N LLM training
runs, performed at a scale sufficient to obtain non-random performance on y. This can cost tens to
hundreds of millions of dollars for hard benchmarks such as MMLU, where even the performance
of 1B parameter LLMs often do not exceed random chance (Beeching et al., 2023).
Instead, our work considers the following observational setting that requires no training. We obtain
N pretrained, high-performance LLMs that vary in pretraining data, tokenizer, architecture, and
scale (e.g. models on Huggingface’s OpenLLM leaderboard). Now, if we could train a predictor p →
y on these N models, we could avoid large scale model training. Unfortunately, this is impossible
as the training data for these models is often proprietary, and so we have no knowledge of p.
The key observation of our work is that we can replace pi,j (the unobserved sampling probability of
model i’s data selection policy on document j) with an observable surrogate xi,j , which is the nega-
tive log-likelihood of document j under model i.1 We can then build a regression model that relates
negative log-likelihood xi and benchmark error yi . Using this model, we can select pretraining data
from domains j for which decreasing the loss xi,j is predicted to rapidly decrease error yi .
The perplexity-performance hypothesis. We formulate the task of predicting errors yi from nega-
tive log-probabilities xi as a single-index model (SIM),
yi = f (⟨θ ∗ , xi ⟩ + ϵi ) (1)
where f : R 7→ R is some unknown monotonically increasing univariate function, ϵi is zero-mean
noise which is independent of x, and θ ∗ ∈ RD are unknown weights over D domains.
A single index model is highly flexible (due to the arbitrary, monotone f ) and has the advantage that
we do not need to estimate the nonlinear function f if our goal is to optimize model performance.
We can see this directly from the monotonicity of f as
⟨θ ∗ , xi ⟩ + ϵi < ⟨θ ∗ , xj ⟩ + ϵj ⇐⇒ f (⟨θ ∗ , xi ⟩ + ϵi ) < f (⟨θ ∗ , xj ⟩ + ϵj ). (2)
Data selection from perplexity correlations. The weights θ ∗ tell us which domain perplexities
are correlated with downstream performance. However, this isn’t sufficient for data selection. Even
if we know how model likelihoods relate to model performance, we do not know how data selec-
tion affects likelihoods. Even worse, this data mixture to likelihood relationship cannot be learned
observationally, as we do not know the data mixture of any of our models.
1
To be precise, we use bits-per-byte, which normalizes the sequence negative log-likelihood with the number
of UTF-8 bytes. This is defined in terms of the length of the string in tokens LT , the length of the string in
UTF-8 bytes LB , and the cross entropy loss ℓ as BPB = LBLTln(2)ℓ
3
Despite this, we show that there is a clean approach for optimizing the data mixture. Our core
observation is the following: if we find a nonnegative θ ∗ , sampling proportional to θ ∗ is always a
good choice. More formally, we see that this sampling distribution defines the pretraining loss such
that optimizing the training loss directly optimizes the downstream task via the single index model.
Proposition 1 Suppose that θ ∗ weights are non-negative. Then, for models with associated like-
lihoods x ∈ X ⊂ RD , the minimizer of the pretraining loss over the θ ∗ sampling distribution
Ej∼θ∗ [xj ] also has the lowest expected downstream error according to the single index model:
arg min Ej∼θ∗ [xj ] = arg min E[f (⟨θ ∗ , x⟩ + ϵ)].
x∈X x∈X
This observation follows directly from the fact that we can normalize any non-negative θ ∗ into a
distribution (and shift the normalization constant into f ) which allows us to write the inner product
in the single-index model as a monotone function of the expected pretraining loss:
y = f (⟨θ ∗ , x⟩ + ϵ) = f (Ej∼θ∗ [xj ] + ϵ). (3)
Proposition 1 allows us to entirely avoid the task of finding the optimal data mixture for a target
likelihood. Instead, we pick sampling distributions that make the pretraining loss a monotone func-
tion of the predicted downstream error. Afterward, we can rely on our ability to optimize the loss to
optimize downstream performance.
This view gives us a straightforward roadmap for data selection in the remainder of the paper:
estimate a set of domains where loss and downstream benchmark performance is highly correlated,
and then constrain our θ ∗ estimates to be a pretraining data sampling distribution.
4 M ETHODS
We now describe the details of our approach, starting by presenting the algorithm itself and the
intuitions behind it, followed by a more precise and mathematical justification for the various steps.
4.1 A LGORITHM
Estimating θ ∗ . The parameter θj∗ measures the relationship between log-likelihoods in domain
j and downstream performance. Because of this, we might naturally expect θj∗ to be related to
nonlinear correlation coefficients between x and y. Our work uses a simple correlation measure,
X
γj = sign(yk − yl )(rankj (xk,j ) − rankj (xl,j ))
1≤k,l≤n
k̸=l
where rankj (x) is the rank of x among {x1,j . . . xN,j }. This formula is intuitive: when model k
does better than model l, what percentile is model k’s log-likelihood compared to model l’s? While
this is not the only correlation coefficient that performs well (see Appendix E), this functional form
has the additional benefit of being a principled estimate of θ ∗ . In particular, we show in sections
below that in expectation, the ranking of domains in γ exactly matches those of θ ∗ (under standard
high-dimensional regression assumptions; see Section 4.2 for a complete discussion).
Selecting pretraining data. Suppose that we have an accurate estimate γj which is nonnegative. In
this case, we could use γj directly as a data selection procedure and Proposition 1 would ensure that
minimizing the population pretraining loss minimizes downstream errors. Unfortunately, γj can be
negative and the finite number of tokens per domain can make it difficult to minimize the population
pretraining loss. Thus, we must project γj onto the set of reasonable pretraining data distributions
that are nonnegative and account for the per-domain token counts.
What is a good way to project a set of domain rankings estimated via γ into a sampling distribution?
Intuitively, if wikipedia.org has a γj = 0.5 and arxiv.org is γk = 0.9, it would be natural to select
tokens in order of γ, preferring tokens from arxiv.org over tokens from wikipedia.org when selecting
pretraining data.
Having established the ordering of domains, the remaining question is how many tokens we take for
each domain. We follow recent observations that repeating data degrades performance (Abbas et al.,
2023) to arrive at a simple selection algorithm: select domains in greatest to least γ, taking all the
tokens in each domain once, until we exhaust our total pretraining token budget.
4
Full algorithm. Together, these steps result in a simple, parameter-free algorithm that calculates
our rank correlation coefficient, and selects domains in order from largest to smallest coefficient.
We show this process explicitly in Algorithm 1, and additionally show an extra step where we train
a fastText (Joulin et al., 2016) classifier (using standard settings and bigram features from Li et al.
(2024)) which distinguishes our selected documents and domains from the rest of the pool. The
fastText classifier allows us to perform data selection at a single-page level, and scale the selection
process to larger datasets. We also found the classifier to slightly improve downstream performance
over directly selecting the documents. More information on the specifics of the data selection ap-
proaches that we tested is given in Appendix D.
4.2 T HEORY
We now study the approach closely and show that our choices for the correlation coefficient and
projection step are extensions of the classic, high-dimensional single index model estimator of Plan
et al. (2016). We describe the basic single-index model estimators first, describe our extensions, and
then conclude with a discussion on how our estimator and results deviate from the theory.
5
Recall that our estimate γ is a U-statistic, defined as pairwise sums of
sign(yi − yj )(Φ(xi ) − Φ(xj )), (6)
for any 1 ≤ i, j ≤ N (where i ̸= j), where Φ is the empirical CDF of the x values. This estimate is
significantly less sensitive to outliers than that of Chen & Banerjee (2017), as the empirical CDF is
bounded between zero and one, and no single model can make the estimator degenerate.
We study this estimate theoretically in the Gaussian setting, where we consider the asymptotically
equivalent estimator with Φ as the CDF of the standard Gaussian. In this case, we can show that this
modified estimator is also consistent in recovering θ ∗ .
We provide the proof in Appendix A. Because we assume ||θ ∗ ||2 = 1 and the expected value in
Equation 7 must be between −1 and 1, we are always within the domain of sin−1 and able to invert
it. After inverting, we get:
π
θ̂ ∝ sin E [sign(yi − yj )(Φ(xi ) − Φ(xj ))] (8)
2
√
as an estimate for θ ∗ , where the constant 2 1 + σ 2 term due to noise has been dropped.
Beyond the fact that our estimator is consistent, we can show an even tighter connection to the
Chen & Banerjee estimator and show that our estimates agree when running the original estimator
on rank-transformed data. More specifically, for two models xi and xj with the estimated model
rankings ⟨θ̂, xi ⟩ > ⟨θ̂, xj ⟩, the expected ranking under rank-transformation (i.e. Φ(x)) match this
ranking.
Corollary 1 Suppose that θ̂ is any vector of fixed weights and x ∼ N (0, I). Then, conditioning on
the event ⟨θ̂, xi ⟩ < ⟨θ̂, xj ⟩, we have
⟨θ̂, E[Φ(xi ) | ⟨θ̂, xi ⟩ < ⟨θ̂, xj ⟩]⟩ < ⟨θ̂, E[Φ(xj ) | ⟨θ̂, xi ⟩ < ⟨θ̂, xj ⟩]⟩. (9)
with probability 1.
This proof follows from the same calculations as Theorem 1 and is given in Appendix A.
We take a similar approach here. We define a convex constraint set C that forces θ̂ to be a reasonable
sampling distribution and find the best sampling distribution via the linear correlation approach.
We define C as the combination of two sets of constraints. First, we must have a valid sampling
distribution, so we constrain θ̂ to lie in the simplex. As we noted above, it is well-known that dupli-
cating data harms performance (Abbas et al., 2023), and so we constrain θ̂ to avoid data duplication
by limiting the maximum weight on domains. Concretely, if want to pretrain on m tokens overall,
6
we enforce θi∗ ≤ τi , ∀i ∈ [1, D], where τi is set so τi m is the number of tokens from the i-th domain
that we can access for training.
The resulting linear program has a simple solution and takes the form of initializing θ̂ proj to 0 and
then iterating through the values in θ̂ from largest to smallest, setting the value at the corresponding
index of θ̂ proj to the maximum allowable value, until θ̂ proj sums to 1 (see Appendix B for a proof).
subject to:
D
X
θi = 1
i=1
0 ≤ θi ≤ τi , ∀i ∈ [1, D],
where τi > 0 are fixed values. Then, the solution is:
P
τk P
if j: rj (θ̂j )≥rk (θ̂k ) τj ≤ 1
θ̂kproj = 1 − j: rj (θ̂j )>rk (θ̂k ) τj if j: rj (θ̂j )≥rk (θ̂k ) τj ≥ 1 ∧ j: rj (θ̂j )>rk (θ̂k ) τj ≤ 1 , (10)
P P
0 otherwise
where r is some function that breaks all ties between θ̂j and θ̂k for k ̸= j, and otherwise leaves the
ordinal relationships the same.
We note that while the use of this linear program is in line with the constrained estimators proposed
in Chen & Banerjee (2017), the L2 projection is arguably more natural, and does not require assum-
ing that ∥θ̂∥2 = 1 for asymptotic recovery conditions. We derive similar closed-form expressions
for this quadratic case in Appendix B, but do not use this approach for two separate reasons.
First, the L2 projection depends on the L2 norm of θ̂, unlike the linear program which only depends
on the ranks of the values in θ̂. The challenge with determining the norm is that the exact recovery
result in Equation (7) requires knowledge of the noise level, and the trigonometric functions rely
strongly on the Gaussian structure of x. Because of this, we are unlikely to be able to estimate
the norm of θ̂ with any accuracy, and the only way to avoid this would be to treat the norm as
a hyperparameter, which adds unnecessary complexity. The second reason is empirical (although
possibly a consequence of the first) – we found that the linear projection performed better across a
wide range of benchmarks and conditions (See Appendix E).
We conclude by relating our theory to the full algorithm in Section 4.1. The estimation step for γ
is the finite sample, U-estimate of the expectation in Equation (8), dropping the nonlinear transform
sin and π/2 as these two terms do not change the rankings of the domains. The data selection step
directly applies our projection in Equation (10), and we make use of the fact that this projection only
relies on rankings among the domains to use γ rather than an exact estimate for θ ∗ .
Our estimator is far from the only reasonable high-dimensional, single-index model estimator. We
briefly discuss some alternatives and the tradeoffs involved before moving to experimental results.
We could use classic low-dimensional methods regularized for the high-dimensional setting. This
includes ordinal regression (Wooldridge, 2010) and the isotron algorithm (Kalai & Sastry, 2009).
We found these methods to underperform correlation-based estimators, and tuning hyperparameters
added additional complexity that was not needed in the correlation-based approaches.
Another class of methods involve scaling laws (Kaplan et al., 2020; Llama Team, 2024; Ruan
et al., 2024). We could transform the y values via an inverse sigmoid or power law, and fit high-
dimensional linear regression methods (e.g. ridge, partial least squares, or Lasso). We initially found
this approach promising, but the inverse transforms were unstable, and the combination of fitting the
nonlinear transform and regularization required significant amounts of tuning.
7
Rank-correlation methods, including our robustified version of the estimator from Chen & Banerjee
(2017), and even the standard Spearman correlation (Spearman, 1904) (see Appendix E) performed
well. We believe that in general, robust per-feature correlations are likely to perform well as D ≫ N ,
and extreme levels of regularization are needed to obtain reasonable models. Sparse methods such
as the Lasso (Tibshirani, 1996) are one classic answer, but we cannot necessarily assume that the
underlying correlations θ ∗ are sparse, and we did not find these techniques to perform well.
5 R ESULTS
We empirically validate our approach to predicting downstream performance and data selection.
Our validation consists of three sets of experiments: we first pretrain 160M-parameter LLMs from
scratch to study our primary goal of selecting pretraining data to improve downstream performance,
followed by analyzing the ability of losses to predict downstream performance, and conclude with an
analysis of the loss matrix X. Throughout our experiments, we use the same single-index model that
we train using Algorithm 1. As shown in the algorithm, we train the fastText classifier on selected
vs unselected domains and use the classifier to filter the pretraining data at the page-level.
Input data matrix X. To build the input data matrix, X, we collected byte normalized loss values
from a sample of 90 Open LLM Leaderboard (Beeching et al., 2023) LLMs that we could run
without errors. Concretely, these values are defined as bits-per-byte LBLln(2)
Tℓ
where LT is the token
count, LB is the number of UTF-8 bytes, and ℓ is the per-token cross-entropy (Gao et al., 2020).
We collected these values on the smallest sample provided by the RedPajama V2 (RPJv2) dataset
(Together Computer, 2023) for all domains with ≥ 25 pages in the sample. Specifically, we used
the “sample” subset2 of RPJv2 resulting in 9,841 domains/features. Specifics of how we computed
losses are in Appendix C.
Target benchmark performance y. We constructed a target vector, y, for LAMBADA (Paperno
et al., 2016), ARC Easy (Clark et al., 2018), PIQA (Bisk et al., 2020), and SciQ (Welbl et al.,
2017). These are all of the tasks reported in the Pythia scaling experiments for which a model in
the 160M parameter range could meaningfully perform above chance. We also constructed target
vectors for LAMBADAIT , LAMBADAFR , LAMBADADE , and LAMBADAES , which are subsets of
LAMBADA translated into Italian, French, German, and Spanish by Black (2023). These languages
match those in RPJv2 where each page is conveniently tagged as one of five languages: English,
Spanish, French, German, and Italian. The correspondence between our target benchmark languages
and the RPJv2 metadata will be convenient for us, as it will allow us to include language filtering
baselines at no additional cost.
5.1 P RETRAINING
We begin by validating our algorithm in the end-to-end task of pretraining data selection with con-
trolled experiments at the 160M parameter, 3.2B token scale. The low compute requirements of
this setting allowed us to more extensively study replicates and ablations in Appendix E within the
timeframe of a few days. We also note that while 160M models are small, this is far from an easy
setting for our data selection algorithm. Most of the Open LLM Leaderboard models are 10 to
100× larger than the 160M scale, and our single index model must extrapolate substantially from
≈7B scale models to our small-scale validation setting (see Appendix G for a histogram of model
sizes). Finally, the focus on small-scale validation in this manuscript allows us to commit to a clear
preregistration-based strategy toward scaling up (Section 6) rather than cherry-picking experiments
that succeed at scale.
Pretraining data and setting. For pretraining, we used the “sample-100B” subset of RPJv2. This is
larger than the sample that we used to compute our estimate. We filtered this data so it contains only
the domains used for our estimate, and then tokenized the data with the Pythia tokenizer. The vast
majority of the domains from our BPB matrix were present in this larger sample of text. However, 42
(out of 9,841) were not, and so we removed them from our estimate. For every data selection method
that we tested, the task was to further select 3.2B tokens for pretraining, which is Chinchilla-optimal
(Hoffmann et al., 2022) for the 160M-parameter LLM used in our tests.
2
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
8
Table 1: Average rankings of each data selection method (lower is better) across 8 benchmarks
shows that correlation-based filtering beats baselines by a wide margin, and matches the current best
open data filter from Li et al. (2024). Our approach significantly beats the default filter in Li et al.
(2024) with the EN filter and loses slightly after additional manual language filtering that depends
on the target task (+ manual Lang Filter).
Handcrafted fastText
Lang DSIR Handcrafted fastText Handcrafted fastText Perplexity
Method None + EN Lang Filter
Filt (Xie et al., 2023b) w/o Lang Filter + manual Lang Filter Correlations
(Li et al., 2024)
Avg. Rank 3.750 4.000 4.500 3.750 3.250 1.375 1.750
Baselines. We compare against several baseline data-selection methods. First, we present the results
of uniformly sampling from the available pretraining data. Then we use the language tags present
in RPJv2 to filter only for the language matching the target task. In addition to these commonsense
baselines, we also run DSIR (Xie et al., 2023b): a lightweight training data selection technique based
on n-gram overlaps that Li et al. (2024) found to be competitive with proxy LLM-based techniques
and was also validated at scale (Parmar et al., 2024). Finally, we run the state-of-the-art method for
pretraining data quality filtering found by Li et al., which is a fastText classifier that beats all of the
heavier-weight proxy-LLM methods that were tested. The classifier was trained on a benchmark-
agnostic and handcrafted objective, which is to classify data as coming from Common Crawl3 (low
quality) or OH2.5 (Teknium, 2023) and Reddit ELI5 (Fan et al., 2019) (high quality). It is combined
with an English filter in Li et al., and so we present results for this fastText filter with and without
the additional English filter.
Model and hyperparameters. We use the Pythia 160M LLM configuration from Biderman et al.
(2023) and optimize the hyperparameters including learning rate, weight decay, and warmup to
minimize loss on the uniform sampling (no selection algorithm) baseline. Training hyperparameters
were fixed across all methods. We provide additional training and evaluation details in Appendix D.
Results. We report average rankings over all benchmarks in Table 1, and we find that our approach
significantly outperforms the basic baselines of random sampling, language filtering, and DSIR.
Compared to the existing state of the art from Li et al. (2024), our approach beats the performance
of the default, English-filtered fastText classifier, but loses slightly once we add in a manual language
filtering step to enable better performance on the multilingual LAMBADA datasets. For the maintext
comparisons, we use the optional fastText classifier from our algorithm to select pretraining data at
the page levels, but we show ablations without the classifier in Appendix E.
Figure 2 shows how each data selection method affects benchmark performance in more detail.
Each block of rows represents a data selection method, while an individual row represents an LLM
within a method that targets a particular benchmark or set of benchmarks. Columns represent bench-
marks. We see that language filtering and perplexity correlations both clearly optimize for the target
benchmark – within each block, the benchmark column matching each row typically performs best.
Surprisingly, the pattern is much less obvious for DSIR – the heatmap looks much more uniform
across LLMs with different task targets. We also see that while language filtering has significant im-
pacts on model performance, our performance significantly exceeds the impact of language filtering
across all tested benchmarks.
Figure 3 shows the distribution of languages in pretraining data selected by our method, targeting
each benchmark. Our algorithm provides significant enrichment of the corresponding languages for
the multilingual benchmarks (LAMBADA_*), but we also find that it does not exclusively select do-
mains in one language. In contrast, for English benchmarks our approach selects nearly exclusively
English data, likely due to the large quantity of high-quality English data in our pretraining data
pool. There are significantly fewer tokens in non-English languages in the pretraining data pool and
our τ constraint to prevent duplication has a large impact on the weights when the benchmarks are
non-English. We provide the same figure when the τ values are made 5× as large in Appendix F.
Finally, we note that our results are somewhat insensitive to the specifics of the perplexity-correlation
procedure we present in Algorithm 1. We show in Appendix E that varying the projection method
(linear, L2 ) and even using Spearman rank correlations (Spearman, 1904) often work better than the
baselines. These results suggest that the performance of our approach is not tightly dependent on
3
https://commoncrawl.org
9
Figure 2: Pretraining results with different data selection methods. Each row is an LLM, and each
column is a task. The number in the upper left indicates the ranking of the method when targeting
that benchmark compared to other methods (lower is better). Numbers within the heatmap denote
accuracy for all benchmarks except the LAMBADA tasks for which the values are log perplexities
(where lower scores are better). We find that our approach appropriately optimizes data mixes for
the target language and benchmark, and matches the fastText baseline across most benchmarks.
the precise form of the estimator that is coupled to our theory results but holds more broadly across
general perplexity-correlation relationships. Additionally, our approach performs better with the
optional fastText classifier that our algorithm trains, possibly because it can operate at the page-level
instead of the domain-level4 .
We have shown that our approach succeeds at the goal of selecting useful pretraining data, but how
good are single index model’s predictions? An accurate map between loss to benchmarks would be
4
One might ask if the fastText classifier alone can be an effective data selector, since it appears in both
high-performance selection methods. This is not possible – the classifier can only amplify the effectiveness of
an existing selector rather than replace it – as training a classifier requires supervision, which is provided by
either the perplexity correlation algorithm or human curation (in the case of Li et al. (2024)).
10
Figure 3: Language distributions of pretraining data selected by perplexity correlations. The default
RPJv2 distribution is given in the left column for reference. The English benchmark targets often
exclusively select English but the reverse is not the case. In every case, our approach selects more
data than the default from the benchmark-matched language (shown as a green box in each column).
Figure 4: Rank predictions given by ⟨θ̂ proj , Φ(x)⟩ for PIQA and LAMBADA FR. A standard de-
viation (σ) from the ideal fit is shown in red. 2σ is shown in orange. Many models outside 2σ
(named and shown in blue) are trained on atypical data such as heavily multilingual data, code, or
GPT-4 (Brown et al., 2020) outputs. Models with atypical architectures (i.e. Mamba (Gu & Dao,
2024)) are named and shown in black. Generally, our estimate tightly predicts ordinal benchmark
performance from web corpus losses.
helpful in selecting among candidate pretraining data mixtures generally, even when not using our
pretraining data selection algorithm.
Comparing model performance rankings predicted by our regression to the ground truth, we find
generally accurate predictions. Figure 4 shows 5-fold leave-out plots for PIQA, and LAMBADAFR
with the rank predictions given by ⟨θ̂ proj , Φ(x)⟩. Every point in the plot is a held-out point: we
estimated θ ∗ five times, holding out a different 20% of the data each time, and plotted the prediction
for every point when it was held out.
We find that our estimator achieves high ordinal prediction performance across all target tasks. We
include 5-fold leave-out R2 scores for all tasks in Figure 5. However, we complement these strong
results with the additional observation that simply taking the mean loss across all domains is a
strong predictor of model performance (bottom row). The surprising effectiveness of average loss
over uniformly sampled documents has been discussed extensively (Owen, 2024; Wei et al., 2022;
Kaplan et al., 2020) and our results further suggest that regressions with correlations only slightly
above the mean loss baseline still can result in effective data selection methods.
Finally, we discuss outliers in our prediction of model performance. Our predictions are accurate
for LLMs with usual architectures (e.g. Mamba (Gu & Dao, 2024)), the smallest/largest vocabulary
sizes, context sizes, and parameter sizes. However, we also see that LLMs that were trained on
unusual data are not as well predicted by our approach (e.g. Phi (Gunasekar et al., 2023)). We may
simply require a bigger or more diverse pretraining data pool and set of models to find estimates that
work well for models that expect different styles of text.
11
Figure 5: Held-out R2 score of our raw correlation estimate θ̂, our projected estimate θ̂ proj , and
the average loss baseline. The 95% bootstrapped confidence intervals are wide enough that no
individual comparison is significant. Across benchmarks, θ̂ proj has statistically significant gains
over the baseline (p=0.035) as it is unlikely that θ̂ proj beats mean loss 7 times out of 8 by chance.
Figure 6: Analysis of the loss matrix. The first row treats domains as examples to be projected via
PCA, while the second row treats models as examples. Panels (a): eigenvalue decay for the eigende-
composition of the D×D covariance matrix resulting from the loss matrix; a few dominant PCs are
seen. (b) and (c): domains plotted by the first two PCA components showing separation of language
in b and entropy in c. (d,e) show analogous plots in t-SNE with a clearer separation of language. (f):
eigenvalue decay analogous to (a). (g,h): models plotted by the first two PCA components showing
clustering by model family (clusters show Pythia (Biderman et al., 2023), Qwen (Bai et al., 2023),
and OpenLlama (Geng & Liu, 2023) derivatives – the three largest clusters in our data), and average
model loss. (i,j) show analogous results under t-SNE where (i) is normalized to remove per-model
entropy differences.
What information is contained in the matrix of model losses X? Clearly, it must contain semantically
meaningful information about the data, such as the language that a piece of text is in. We performed
PCA (Pearson, 1901) and t-SNE (van der Maaten & Hinton, 2008) on X and plotted the first two
components for each of our 9,841 domains. As shown in the first row of Figure 6, we found two
components with relatively high singular values. The first component clearly corresponds with the
language of a domain. The second component corresponds with the average bits-per-byte or entropy
of a domain. The t-SNE components show the same general pattern as well as showing that the
language clusters are very well separated. As shown in our plots, there are several salient clusters
within the language clusters. Within the English cluster, we found a subcluster for luxury goods,
another for legal services and information, another for academic research, and even a cluster for
funeral homes.
The second row of Figure 6 shows plots for the loss matrix when we take the principal components
of the other dimension, where points correspond to the 90 LLMs. For PCA, PC1 corresponds to
entropy. For both cases, it is less clear what the other PCs are, but when we color the three largest
families of models in our data (Pythia (Biderman et al., 2023), Qwen (Bai et al., 2023), and OpenL-
lama (Geng & Liu, 2023)), we see that model families are clustered together in the PC graphs.
12
6 ROBUSTNESS C HECKS WITH P RE -R EGISTRATION
In small-scale experiments, our approach is competitive with the leading approach from Li et al.’s
survey: a fixed fastText model (Joulin et al., 2016), manually augmented with the best language fil-
tering. This leading approach is heuristic and hand-crafted, requiring appropriate language filtering
matched to the target benchmark and assumptions about what good pretraining data looks like. Our
approach does not make these assumptions and could potentially improve as more public models are
released and we have better data to estimate θ ∗ .
While our results are generally positive, many past data selection methods have reported initially
positive results, only to later break: they may fail to scale to larger models or rely on specific details
of their experimental setting. Our 160M-scale experiments may also raise such concerns.
We design a pre-registered scaling experiment that addresses both the concerns of scale and external
validity. We use the permanence of arXiv preprints as a mechanism to preregister a series of scaling
experiments within the DataComp-LM framework (Li et al., 2024), which is a testbed for data-
selection techniques released with the recent survey. Pre-registering held-out scaling experiments
commits us to reporting negative results, and avoid overfitting to our chosen experimental settings.
DataComp-LM is ideal for this preregistered scaling experiment, as it standardizes the setting by
providing a pool of 240 trillion tokens, pretraining code for 412M to 7B parameter models, and
evaluation code for 53 benchmarks, 22 of which are labelled as “core” benchmarks that scale pre-
dictably. Importantly, we have not trained any models on DataComp-LM using our methods or
baselines, making this a true held-out experiment with known high-performance baselines.
Preregistered experiment. We will run the best-performing approach from our paper: a fastText
filter trained on our correlation estimator. We will define the target benchmark for our estimator as
the average of the “core” DataComp-LM benchmarks and run our estimator with perplexities from
our set of 90 OpenLM Leaderboard LLMs on a uniform subsample of the DataComp-LM pool of
data. We will use the provided DataComp-LM code for training LLMs for the “Filtering 1B-1x”
track, where a 1.4B parameter LLM is trained on 28.8B tokens chosen from a 1.64T sample of the
DataComp-LM pool of data. In the DataComp-LM paper, they apply their fixed fastText filter as a
final step after several complicated deduplication and filtering steps. We will report results where
our fastText classifier is used as a direct substitute for this last step alone, as well as another test in
which we replace the entire pipeline with one classifier. We will report results where our estimator
is trained at the domain-level (following this paper) and where our estimator is trained at the page-
level (which we have not tried yet). Finally, we will report analogous results where we replace the
“core” evaluation score with the average score on all of the non-English LAMBADA translations,
and compare the raw fastText classifier from Li et al. (2024) to our approach, using both of these
approaches in place of the full filtering pipeline from 1.64T tokens. We preregister this additional
multilingual task because “core” does not include multilingual evaluations.
We chose the “Filtering 1B-1x” track because it is the most compute-intensive track that we can
perform within a couple weeks, given our resources. For these experiments, the compute needs for
our data selection procedure are negligible compared to LLM pretraining. Per Li et al. (2024), the
pretraining for these experiments combined is estimated to take 240 hours on an 8-GPU H100 node.
7 C ONCLUSION
Does high-performance data selection require careful hand-crafted heuristics or prohibitively ex-
pensive model training runs? Our work demonstrates an alternative, viable approach – leveraging
existing, public models as a source of information for data selection. Pretraining experiments sug-
gest that a simple, correlation-based approach to selecting data can be effective, but more broadly,
we show how to 1) use single-index models as a surrogate for downstream performance and 2) build
models that relate losses to downstream performance and use these surrogates effectively in data
selection. Finally, we propose pre-registered scaling experiments on held-out data to test external
validity of reported results. We hope this type of experiment will help improve the trustworthiness
of experimental results for data selection.
13
ACKNOWLEDGMENTS
We thank Jack Spilecki for conversations on the mathematical aspects of the work. We also thank
Zitong Yang, Yangjun Ruan, and Lisa Li for their helpful feedback throughout the project, Ludwig
Schmidt and Samir Gadre for discussions on scaling laws involving benchmark perplexity, Rohan
Pandey for conversations about scaling laws, Sung Min Park for discussions on drafts of this work,
and William Held for conversations about data selection. This work is supported in part by a grant
from Sandia National Laboratories, and gifts from Open Philanthropy, Meta, Panasonic Research,
and the Tianqiao and Chrissy Chen Institute. Any opinions, findings, and conclusions or recom-
mendations expressed in this material are those of the authors and do not necessarily reflect the
views of Sandia National Laboratories. Tristan Thrush is supported in part by the Stanford Graduate
Fellowship.
R EFERENCES
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. Semdedup: Data-
efficient learning at web-scale through semantic deduplication. arXiv, 2023.
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky,
Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will
Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael
Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael La-
zos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk, Bert Maher, Yunjie Pan,
Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Michael
Suo, Phil Tillet, Eikan Wang, Xiaodong Wang, William Wen, Shunting Zhang, Xu Zhao, Keren
Zhou, Richard Zou, Ajit Mathews, Gregory Chanan, Peng Wu, and Soumith Chintala. PyTorch 2:
Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Com-
pilation. ACM International Conference on Architectural Support for Programming Languages
and Operating Systems, 2024.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge,
Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu,
Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi
Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng
Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan,
Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou,
Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv, 2023.
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Ra-
jani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf, 2023. URL https://huggingface.
co/spaces/HuggingFaceH4/open_llm_leaderboard. Open LLM Leaderboard.
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal-
lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya
Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language
models across training and scaling. arXiv, 2023.
BigScience. BLOOM: A 176b-parameter open-access multilingual language model. arXiv, 2023.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about
physical commonsense in natural language. AAAI, 2020.
Sid Black, 2023. URL https://huggingface.co/datasets/EleutherAI/lambada_openai.
Multilingual LAMBADA.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv, 2020.
14
Sheng Chen and Arindam Banerjee. Robust structured estimation with single-index models. ICML,
2017.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and
Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning chal-
lenge. arXiv, 2018.
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto
the l1-ball for learning in high dimensions. ICML, 2008.
Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection
with datamodels. arXiv, 2024.
Dante Everaert and Christopher Potts. Gio: Gradient information optimization for training dataset
selection. ICLR, 2024.
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5:
long form question answering. arXiv, 2019.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile:
An 800gb dataset of diverse text for language modeling. arXiv, 2020.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Fos-
ter, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muen-
nighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang
Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for
few-shot language model evaluation. Zenodo, 2023.
Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, 2023. URL https:
//github.com/openlm-research/open_llama.
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord,
Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkin-
son, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar,
Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff,
Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander,
Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Worts-
man, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle
Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of
language models. arXiv, 2024.
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv,
2024.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital
Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai,
Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv, 2023.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hen-
nigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy,
Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre.
Training compute-optimal large language models. arXiv, 2022.
Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Data-
models: Predicting predictions from training data. ICML, 2022.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient
text classification. arXiv, 2016.
Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression.
COLT, 2009.
15
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models. arXiv, 2020.
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral,
Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen,
Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella
Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen,
Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa, Paulo Villegas, Tristan
Thrush, Shayne Longpre, Sebastian Nagel, Leon Weber, Manuel Muñoz, Jian Zhu, Daniel Van
Strien, Zaid Alyafeai, Khalid Almubarak, Minh Chien Vu, Itziar Gonzalez-Dios, Aitor Soroa,
Kyle Lo, Manan Dey, Pedro Ortiz Suarez, Aaron Gokaslan, Shamik Bose, David Adelani, Long
Phan, Hieu Tran, Ian Yu, Suhas Pai, Jenny Chim, Violette Lepercq, Suzana Ilic, Margaret
Mitchell, Sasha Alexandra Luccioni, and Yacine Jernite. The bigscience roots corpus: A 1.6tb
composite multilingual dataset. NeurIPS Datasets and Benchmarks, 2022.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash
Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel,
Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bit-
ton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej
Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras,
Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic,
Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer,
Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groen-
eveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair
Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-lm: In search of the
next generation of training sets for language models. arXiv, 2024.
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing
Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. arXiv,
2024.
Edward W. Ng and Murray Geller. A table of integrals of the error functions. Journal of Research
of the Natianal Bureau of Standards, Section B: Mathematical Sciences, 1968.
David Owen. How predictable is language model benchmark performance? arXiv, 2024.
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi,
Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset:
Word prediction requiring a broad discourse context. ACL, 2016.
Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Bo Liu, Aastha Jhunjhunwala, Zhilin
Wang, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Data, data everywhere:
A guide for pretraining dataset construction. arXiv, 2024.
Karl Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Maga-
zine, 1901.
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin
Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the
finest text data at scale. arXiv, 2024.
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman,
Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen
Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden
Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang,
Bolun Wang, Johan S. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang
Zhao, Peng Zhou, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. Rwkv: Reinventing rnns for the
transformer era. arXiv, 2023.
16
Yaniv Plan, Roman Vershynin, and Elena Yudovina. High-dimensional estimation with geometric
constraints. Information and Inference: A Journal of the IMA, 2016.
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua
Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional
language models. arXiv, 2023.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of Machine Learning Research, 1–67, 2020.
Yangjun Ruan, Chris J. Maddison, and Tatsunori Hashimoto. Observational scaling laws and the
predictability of language model performance. arXiv, 2024.
Shai Shalev-Shwartz and Yoram Singer. Efficient learning of label ranking by soft projections onto
polyhedra. JMLR, 2006.
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur,
Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh
Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas
Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle
Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke
Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and
Kyle Lo. Dolma: an open corpus of three trillion tokens for language model pretraining research.
arXiv, 2024.
Charles Spearman. The Proof and Measurement of Association between Two Things. The American
Journal of Psychology, 1904.
Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.
URL https://huggingface.co/datasets/teknium/OpenHermes-2.5.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 58(1):267–288, 1996.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher,
Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy
Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn,
Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel
Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee,
Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,
Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh
Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen
Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models.
arXiv, 2023.
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. JMLR, 2008.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
gatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol
Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.
TMLR, 2022.
Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions.
W-NUT, 2017.
17
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s trans-
formers: State-of-the-art natural language processing. arXiv, 2019.
Jeffrey M. Wooldridge. Econometric Analysis of Cross Section and Panel Data. MIT Press, 2010.
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang,
Quoc V. Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up
language model pretraining. NeurIPS, 2023a.
Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language
models via importance resampling. NeurIPS, 2023b.
18
A E STIMATOR SOLUTION
A.1 L EMMA 1
√ x2
Statement of Lemma 1 Define the PDF of HalfNormal as f (x; σ) = √2 e− 2σ2 for x > 0 and 0
σ π
otherwise. Now, suppose:
Then we have:
s
d βj2 βj
Z1j |⟨Z1 − Z2 , β⟩ + ϵ > 0 = Z ′ 1− +√ Z+ ,
2 + σ2 2 + σ2
where Z1j is the j-th entry of Z1 .
19
PC
also use the fact that i=1 βc 2i = 1 (recall that βc is unit norm) to get: Y ∼ N (0, 1 − βc 2j ). So we
have that the conditional Z1j is given by:
s
q βj2 βj
Z ′ 1 − βc 2j + βc j Z+ = Z ′ 1 − 2
+√ Z+ ,
2+σ 2 + σ2
for Z ′ ∼ N (0, 1). As a corollary, we can see that Z2j under the same condition is given by:
s
βj2 −βj
Z′ 1 − 2
+√ Z+ .
2+σ 2 + σ2
A.2 L EMMA 2
Statement of Lemma 2 Suppose that Φ is the CDF of a standard Gaussian, a and c are constants,
and Z ∼ N (0, 1). Then we have:
c
E[Φ(aZ + c)] = Φ √ .
1 + a2
A.3 L EMMA 3
Proof: By the definition of expected value, we can take the following integral where fZ+ is the PDF
of Z+ . We integrate from 0 instead of −∞ because the PDF of the Standard Half Normal is 0 in the
domain below 0:
Z ∞
Z+ b zb
E Φ √ = Φ √ fZ+ (z)dz
a2 + 1 0 a2 + 1
Z ∞ √
zb 2 −z2
= Φ √ √ e 2 dz
0 a2 + 1 π
Z ∞ Z ∞
1 −z 2 zb −z 2
=√ e 2 dz + erf √ √ e 2 dz (*).
2π 0 0 2 a2 + 1
The second integral is generally non-trivial to solve, but luckily we can solve it by using Equation 2
in Section 4.3 of the integral table from Ng & Geller (1968), which states:
Z ∞ √
2 2 π 1 d
erf(cx)e−d x dx = − √ tan−1
0 2d d π c
Where c and d are real and positive. We split the solution by cases: b > 0, b = 0, and b < 0. We find
that in every case, we can manipulate our integral so that the solution is trivial or the constant inside
20
the erf(·) is positive
(and
so we can use the integral table). In every case, we find that the solution is
1
2 + 1
π tan−1 √ b
a2 +1
.
Case 2: b = 0. Note that erf(0) = 0; we do not have to use the integral table:
√
1 π
(*) = √ √ +0
2π 2
1
= .
2
Because tan−1 (0) = 0, we have:
1 1 b
= + tan−1 √ .
2 π a2 + 1
Case 3: b < 0. Because erf(·) is an odd function, we can pull the negative out:
Z ∞ Z ∞
1 −z 2 z|b| −z 2
(*) = √ e 2 dz − erf √ √ e 2 dz .
2π 0 0 2 a2 + 1
Now we can use the integral table as in the b > 0 case:
√ √ √ √ !!
1 π π 2 −1 a2 + 1
=√ √ − √ + √ tan
2π 2 2 π |b|
√ !
1 1 1 a2 + 1
= + − tan−1 .
2 2 π |b|
21
A.4 F ULL P ROOF
Here, we prove:
!
2 θ∗
E[sign(y1 − y2 )(Φ(x1 ) − Φ(x2 ))] = sin−1 p
π 4 + 2σ12 + 2σ22
with y1 , y2 , Φ(x1 ), Φ(x2 ), and θ ∗ defined in the main text, for the case where ϵ1 and ϵ2 are zero-
mean Gaussian noise ∼ N (0, σ12 ) and ∼ N (0, σ22 ), respectively.
It is easy to see that this is a more general version of the following theorem.
d
Φ(x2j )|⟨x1 − x2 , θ ∗ ⟩ + ϵ∆ > 0 = Φ(Za + Z+ b2 ).
θj∗
Where b2 = − √ . So for the index j, our estimate is:
2+σ12 +σ22
E[Φ(Za + Z+ b1 )] − E[Φ(Za + Z+ b2 )]
= E[E[Φ(Za + c)|c = Z+ b1 ]] − E[E[Φ(Za + c)|c = Z+ b2 ]].
Using Lemma 2, we have:
Z+ b1 Z+ b2
=E Φ √ −E Φ √ .
a2 + 1 a2 + 1
Then, using Lemma 3, we have:
1 1 −1 b1 1 1 −1 b2
= + tan √ − − tan √
2 π a2 + 1 2 π a2 + 1
1 b1 1 b2
= tan−1 √ − tan−1 √ .
π 2
a +1 π a2 + 1
22
Using the fact that tan−1 is an odd function and b2 = −b1 , we get:
2 b1
= tan−1 √ .
π a2 + 1
Now, we write a and b1 in terms of θj∗ :
θj∗
√
2 2+σ12 +σ22
tan−1
= r
π (θj∗ )2
2− 2+σ12 +σ22
θj∗
√
2 −1
4+2σ12 +2σ22
= tan s 2 .
π
θj∗
1− √
4+2σ12 +2σ22
Using the identity sin−1 x = tan−1 √ x
1−x2
, we have:
!
2 θj∗
= sin−1 p .
π 4 + 2σ12 + 2σ22
A.5 C OROLLARY 1
Corollary 1 Suppose that θ̂ is any vector of fixed weights and x ∼ N (0, I). Then, conditioning on
the event ⟨θ̂, xi ⟩ < ⟨θ̂, xj ⟩, we have
⟨θ̂, E[Φ(xi ) | ⟨θ̂, xi ⟩ < ⟨θ̂, xj ⟩]⟩ < ⟨θ̂, E[Φ(xj ) | ⟨θ̂, xi ⟩ < ⟨θ̂, xj ⟩]⟩. (9)
with probability 1.
Because sin−1 is an odd function, the above expression has the same sign as θ̂j . Because the
values at every index of E[Φ(x1 ) − Φ(x2 )] under our condition and θ̂ are the same sign, we have
⟨E[Φ(x1 ) − Φ(x2 )], θ̂⟩ > 0, so ⟨θ̂, E[Φ(x1 )]⟩ > ⟨θ̂, E[Φ(x2 )]⟩.
subject to:
D
X
θi = 1
i=1
0 ≤ θi ≤ τi , ∀i ∈ [1, D],
23
where τi > 0 are fixed values. Then, the solution is:
P
τk P
if j: rj (θ̂j )≥rk (θ̂k ) τj ≤ 1
θ̂kproj = 1 − j: rj (θ̂j )>rk (θ̂k ) τj if j: rj (θ̂j )≥rk (θ̂k ) τj ≥ 1 ∧ j: rj (θ̂j )>rk (θ̂k ) τj ≤ 1 , (10)
P P
0 otherwise
where r is some function that breaks all ties between θ̂j and θ̂k for k ̸= j, and otherwise leaves the
ordinal relationships the same.
Proof: We proceed by considering each of the three cases from Equation 10.
Case 1. Suppose for the sake of contradiction that the optimal solution is θ̂ proj and yet θ̂kproj < τk
for some θ̂kproj falling under the first case of Equation 10. Now suppose that we construct a θ ′ also
satisfying the projection constraints that is the same as θ̂ proj except in these places:
θk′ = θ̂kproj + ∆ = τk
θp′ = θ̂pproj − δ1 ≥ 0
..
.
θq′ = θ̂qproj − δn ≥ 0
Pn
for some ∆ = i=1 δi > 0 where θ̂p ≥ · · · ≥ θ̂q are all of the θ̂ values which do not fall under the
first condition and where the corresponding θ̂proj values are nonzero. We know that there must be
some θ̂pproj proj
P, · · · , θ̂q from which we can subtract δ1 , · · · , δn (and so from which we can take the ∆)
because j: rj (θ̂j )≥rk (θ̂k ) τj ≤ 1. Now, we have:
At this point, the only way to avoid the contradiction result would be if θ̂k = θ̂p = · · · = θ̂q .
Otherwise, the above non-strict inequality would be a strict inequality. If θ̂k = θ̂p = · · · = θ̂q , then
we know that θ̂k is the smallest θ̂ value satisfying condition 1 and all of the other greater θ̂ values
satisfying condition 1 must be projected to their τ threshold value (otherwise we would get the
contradiction result). In this edge case can see above that rearranging the remaining weight among
equal θ̂ values does not change the dot product, so all of the solutions that we can get without the
contradiction result are equivalently optimal (including the solution from Equation 10).
Case 3. This is analogous to case 1. Suppose for the sake of contradiction that the optimal solution
is θ̂ proj and yet θ̂kproj > 0 for some θ̂kproj falling under the third case of Equation 10. Now suppose that
we construct a θ ′ also satisfying the projection constraints that is the same as θ̂ proj except in these
places:
θk′ = θ̂kproj − ∆ = 0
θp′ = θ̂pproj + δ1 ≤ τp
..
.
θq′ = θ̂qproj + δn ≤ τq
24
Pn
for some ∆ = i=1 δi > 0 where θ̂p ≥ · · · ≥ θ̂q are all of the θ̂ values which do not fall under the
third condition and where the corresponding θ̂proj values are not at their thresholds. By construction
we know that there must be some θ̂pproj , · · · , θ̂qproj to which we can add δ1 , · · · , δn . Now, we have:
At this point, the only way to avoid the contradiction result would be if θ̂k = θ̂p = · · · = θ̂q .
Otherwise, the above non-strict inequality would be a strict inequality. If θ̂k = θ̂p = · · · = θ̂q , then
we know that θ̂k is the largest θ̂ value satisfying condition 3 and all of the other smaller θ̂ values
satisfying condition 3 must be projected to 0 (otherwise we would get the contradiction result). In
this edge case, we can see above that rearranging the remaining weight among equal θ̂ values does
not change the dot product, so all of the solutions that we can get without the contradiction result
are equivalently optimal (including the solution from Equation 10).
Case 2. Above, we show that both Case 1 and Case 3 are true. So, the remaining weight must be
given to the single value of θ̂ proj not covered by either case.
B.2.1 L EMMA 4
subject to:
D
X
θi = 1
i=1
0 ≤ θi ≤ τi , ∀i ∈ [1, D],
where τi > 0 are fixed values. Then, θ̂sproj = 0 implies that any j with θ̂s > θ̂j must have θ̂jproj = 0.
Proof: This is similar to Lemma 2 from Shalev-Shwartz & Singer (2006). Assume for the sake of
contradiction θ̂sproj = 0 and θ̂s > θ̂j , yet we have θ̂jproj > 0.
Now we can construct another vector θ ′ that is the same as θ̂ proj , except in two places:
θs′ = θ̂sproj + ∆
θj′ = θ̂jproj − ∆,
for some ∆ satisfying 0 < ∆ < min(θ̂jproj , τs − θ̂sproj ). This bound on ∆ ensures that θ′ is still
within the thresholds. We know that ∆ can exist because min(θ̂jproj , τs − θ̂sproj ) > 0 (by supposition,
τs − θ̂sproj = τs − 0 > 0 and θ̂jproj > 0).
25
Now we can compute:
||θ̂ − θ̂ proj ||22 − ||θ̂ − θ ′ ||22 = (θ̂s − θ̂sproj )2 + (θ̂j − θ̂jproj )2 − (θ̂s − (θ̂sproj + ∆))2 − (θ̂j − (θ̂jproj − ∆))2
= 2∆((θ̂s − θ̂sproj ) − (θ̂j − θ̂jproj ) − ∆)
> 2∆((θ̂s − θ̂sproj ) − (θ̂j − θ̂jproj ) − min(θ̂jproj , τs − θ̂sproj ))
≥ 2∆((θ̂s − θ̂sproj ) − (θ̂j − θ̂jproj ) − θ̂jproj )
= 2∆(θ̂s − θ̂j )
> 0.
So θ̂ proj cannot be the optimal solution.
B.2.2 L EMMA 5
subject to:
D
X
θi = 1
i=1
0 ≤ θi ≤ τi , ∀i ∈ [1, D],
where τi > 0 are fixed values. Then, θ̂sproj = τs implies θ̂jproj = τj for any θ̂j − τj > θ̂s − τs .
Proof: Again, this is similar to Lemma 2 from Shalev-Shwartz & Singer (2006). Assume for the
sake of contradiction θ̂sproj = τs and θ̂j − τj > θ̂s − τs , yet we have θ̂jproj < τj .
Now we can construct another vector θ ′ that is the same as θ̂ proj , except in two places:
θs′ = θ̂sproj − ∆
θj′ = θ̂jproj + ∆,
for some ∆ satisfying 0 < ∆ < min(θ̂sproj , τj − θ̂jproj ). This bound on ∆ ensures that θ′ is still
within the thresholds. We know that ∆ can exist because min(θ̂sproj , τj − θ̂jproj ) > 0 (by supposition,
τj − θ̂jproj > 0 and θ̂sproj = τs > 0).
Now we can compute:
||θ̂ − θ̂ proj ||22 − ||θ̂ − θ ′ ||22 = (θ̂s − θ̂sproj )2 + (θ̂j − θ̂jproj )2 − (θ̂s − (θ̂sproj − ∆))2 − (θ̂j − (θ̂jproj + ∆))2
= 2∆((θ̂j − θ̂jproj ) − (θ̂s − θ̂sproj ) − ∆)
> 2∆((θ̂j − θ̂jproj ) − (θ̂s − θ̂sproj ) − min(θ̂sproj , τj − θ̂jproj ))
≥ 2∆((θ̂j − θ̂jproj ) − (θ̂s − θ̂sproj ) − (τj − θ̂jproj ))
= 2∆((θ̂j − τj ) − (θ̂s − θ̂sproj ))
= 2∆((θ̂j − τj ) − (θ̂s − τs ))
> 0.
So θ̂ proj cannot be the optimal solution.
26
subject to:
D
X
θi = 1
i=1
0 ≤ θi ≤ τi , ∀i ∈ [1, D],
where τi > 0 are fixed values. Then the solution is:
θ̂kproj = min(max(θ̂k − λ, 0), τk ),
where λ is found (through e.g. bisection search) to satisfy:
D
X
min(max(θ̂i − λ, 0), τi ) = 1.
i=1
Proof: Note that this problem is the same as the simplex projection problem from Shalev-Shwartz
& Singer (2006) and Duchi et al. (2008), except here we have additional θi ≤ τi constraints. The
Lagrangian for this problem is5 :
N
!
1 2
X
L(θ, µ, ζ, λ) = ||θ̂ − θ||2 + λ −1 + θi − ⟨µ, θ⟩ + ⟨ζ, θ − τ ⟩.
2 i=1
To find the optimality condition with respect single index of θ, we set the derivative to zero:
dL
= θi − θ̂i + λ − µi + ζi = 0.
dθi
The complimentary slackness KKT condition gives us that ζi = µi = 0 when 0 < θi < τi , so for θi
not at the boundary of our constraints, we get:
θi = θ̂i − λ.
So, we have that for all θi ∈ (0, τi ), there is a shared value λ which we subtract from θ̂i to get the
value of θi . How do we know which θi are 0 and which θi are τi , though?
Assume that we know λ. By Lemma 4, we can characterize the optimal solution as:
θ̂kproj = max(θ̂k − λ, 0),
θ̂kproj = min(θ̂k − λ, τk ),
Given this constraint, we can find λ through search (moving the value up or down). We can see this
PD
by noticing that i=1 min(max(θ̂i − λ, 0), τi ) is a strictly decreasing function of λ between the
setting of λ that makes θ̂i − λ > 0 for at least one i, and the setting of λ that makes θ̂i − λ < τi for
at least one i. So in this range, there is only one setting of λ that satisfies this equation. We can only
PD
choose a λ outside of this range when i=1 τi = 1, and in this case the solution is trivial: θ̂iproj = τi
for all i.
5
Note that multiplying ||θ̂ proj − θ||22 by 12 does not change the minimization problem and enables us to get
rid of a factor of 2 after taking the derivative of the Lagrangian.
27
Table 2: LLM Pretraining Hyperparameters
Parameter Value
Per-device Batch Size 128
Learning Rate 5 × 10−3
Warmup Ratio 0.1
Adam β1 0.9
Adam β2 0.95
Adam ϵ 1 × 10−8
Weight Decay 0.1
LR Scheduler cosine
Max Grad Norm 1.0
BF 16 True
Distributed Backend nccl
Gradient Accumulation Steps 1
We trained each LLM on 4 NVIDIA A100 GPUs. At 3.2B tokens, each training run took under 3
hours with the Hugging Face Trainer (Wolf et al., 2019) and appropriate PyTorch (Ansel et al., 2024)
compile flags. We provide pretraining hyperparameters in Table 2. Given our per-device batch size,
we found the learning rate by increasing it by a factor of 2 until we saw instability and then using
the highest learning rate where no instability was observed. Refer to the Pythia paper (Biderman
et al., 2023) for more information; we initialized the model from scratch using their 160M model
configuration at https://huggingface.co/EleutherAI/pythia-160m. Other hyperparameters
can be assumed to be Hugging Face Trainer defaults at the time of this writing.
At the end of the pretraining script, we used the Eleuther AI Eval Harness (Gao et al., 2023). For
efficiency, we set the sample limit to 5000 examples per benchmark. Elsewhere, we used the default
settings. On 4 NVIDIA A100s, it took only a few minutes per LLM to compute evaluation results
for SciQ, ARC Easy, PIQA, LAMBADA, and all of the translations of LAMBADA.
28
Table 3: Unique pretraining tokens selected per benchmark, from DSIR.
Benchmark Tokens
ARC Easy 2,905,206,499
PIQA 2,910,486,295
SCIQ 2,920,734,042
LAMBADA 3,022,219,424
LAMBADADE 3,210,986,137
LAMBADAES 3,396,528,704
LAMBADAFR 3,413,930,081
LAMBADAIT 3,384,854,845
D.3 DSIR
DSIR (Xie et al., 2023b), despite its simplicity, requires some tuning. A decision must be made
about how to format the bemchmark data into a single piece of text per example so that it can be
compared with potential pretraining data in terms of n-gram overlap. The LAMBADA tasks only
have one text column per example, so the decision here is trivial. Examples from the other tasks
each have a question, possibly a context, and a set of multiple choice answers to choose from. We
chose to concatenate all of these columns together with spaces to form one piece of text per example,
duplicating the same question as a prefix for each different answer.
DSIR does not allow the user to specify the exact number of unique tokens desired for pretraining.
It only allows the specification of the number of unique pages, which can have wildly varying token
counts. For every DSIR job, we set the desired number of pages to 3325589, which we found
through binary search to produce slightly more than 3.2B unique tokens for LAMBADAFR . It was
expensive to find this number for even one bechmark, because for each iteration of the binary search,
we had to run DSIR and then the Pythia tokenizer to know how many tokens resulted from the input
page number parameter. We provide the number of unique tokens from DSIR for each task in Table
3. We pretrained on 3.2B tokens for every LLM regardless of whether all of them were unique.
The “SOTA” fastText model from Li et al. (2024) is available here: https://huggingface.co/
mlfoundations/fasttext-oh-eli5. We used this model to filter data by sorting pages by the
model’s “high quality” score, including the top pages in order until we had either reached or gone
slightly over 3.2B unique tokens. This aligns with the data-selection procedure in the original paper,
and is also essentially the same as running the linear projection (Equation 10) at the page-level. We
also applied this method when selecting data using our own fastText filter trained by our algorithm.
29
(a) Estimate with linear projection. This is our algo- (b) Estimate with quadratic projection. Same as (a) ex-
rithm from the main text without training the addi- cept the linear projection is replaced with the quadratic
tional fastText filter. projection.
(c) Spearman rank correlation with linear projection. (d) fastText filter trained on data selected in (c). This
Same as (a) except we replaced our estimator with the is the same as our algorithm in the main text, replacing
Spearman rank correlation. our estimator with the Spearman rank correlation.
Figure 7: Pretraining results for different methods within our paradigm. Overall, we see that many
rank-correlation pretraining data selection approaches perform well.
30
Figure 8: This figure is analogous to Figure 3, except the τ thresholds have been multiplied by 5.
We see that our approach selects even more relevant data when the selection pool is larger.
Figure 9: The parameter-count histogram of the 90 models from the Open LLM Leaderboard
(Beeching et al., 2023) that we used to compute our estimate for pretraining data selection. Bar
widths are 160M. The smallest model in the sample has ≈33M parameters and the largest has ≈9B.
The spike around 6.7B parameters is due to a large number of partially trained Pythia (Biderman
et al., 2023) checkpoints from the same training run at that scale. Our algorithm has the hard task
of selecting pretraining data for 160M parameter models, which is abnormally small in the set of
models used to compute the estimate.
31