Selecting Large Language Models To Fine-Tune Via Rectified Scaling Law
Selecting Large Language Models To Fine-Tune Via Rectified Scaling Law
a challenge in selecting the most appropriate pre- 2019). The common workflow of fine-tuning a LLM starts
trained model to fine-tune amidst a sea of op- with selecting an appropriate pre-trained model. Thanks
tions. Given constrained resources, fine-tuning all to the ever-growing ecosystem of LLMs like HuggingFace,
models and making selections afterward is unreal- we are able to choose from countless models for specific
istic. In this work, we formulate this resource- downstream task fine-tuning.
constrained selection task into predicting fine- However, the explosion of open-sourced models also poses
tuning performance and illustrate its natural con- a “mixed blessing”: how can we select the model with opti-
nection with Scaling Law. Unlike pre-training, mal performance after fine-tuning? Given various resource
we find that the fine-tuning scaling curve in- constraints on time, computation and storage (Hoffmann
cludes not just the well-known “power phase” et al., 2022a), it is unrealistic to fine-tune all candidates and
but also the previously unobserved “pre-power make selections afterward. It is also unstable and unpre-
phase”. We also explain why existing Scaling dictable to rely on empirical human impressions to select
Law fails to capture this phase transition phe- LLM for a new task, such as selecting the largest one, the
nomenon both theoretically and empirically. To most well-known one, or even the one with the highest zero-
address this, we introduce the concept of “pre- shot performance on targeted tasks (Brown et al., 2020).
learned data size” into our Rectified Scaling Law, In addition, most existing model selection methods (Vu
which overcomes theoretical limitations and fits et al., 2020; Dwivedi et al., 2020) fail to solve LLM fine-
experimental results much better. By leveraging tuning tasks because they were designed for classification
our law, we propose a novel LLM selection algo- and regression tasks, which is incompatible with generative
rithm that selects the near-optimal model with LLMs (Bai et al., 2023). This brings us to the problem of
hundreds of times less resource consumption, LLM selection for fine-tuning from a unified perspective,
while other methods may provide negatively cor- especially in a resource-constrained manner.
related selection. The project page is available at
rectified-scaling-law.github.io. To better address this challenge, we formulate LLM Selec-
tion in the context of fine-tuning for the first time. Our frame-
work models the challenge as a resource-constrained task to
1. Introduction predict the full-fine-tuning performance of a model, i.e., the
performance after fine-tuning on the entire downstream task
Recent years have witnessed the unprecedented develop- dataset. By measuring the error between the predicted and
ment of large language models (LLMs) (Touvron et al., the true full-fine-tuning performance, we further show that
2023; Achiam et al., 2023), as well as the benefits they intuitive selection methods based on model size, zero-shot
bring to numerous downstream tasks (Liu et al., 2019; De- performance, or fine-tuned performance on a small subset,
vlin et al., 2018). Among all progresses, one important all fail to give a good full-fine-tuning performance predic-
technique is fine-tuning, which re-trains a pre-trained model tion (Figure 1(a)). The correlation between their prediction
on specific datasets to convert the model into a task-specific and the ground-truth performance is surprisingly low.
expert (Ke et al., 2023b;a). It has been widely demonstrated
We point out that the challenge in predicting full-fine-tuning
*
Equal contribution 1 Institute for Artificial Intelligence, Peking performance with limited resources naturally draws paral-
University 2 Peking University 3 Stanford University 4 Tsinghua Uni- lels to the study of LLM Scaling Law (Kaplan et al., 2020),
versity. Correspondence to: Yitao Liang <[email protected]>.
which has been successfully applied to predict the LLM
Proceedings of the 41 st International Conference on Machine pre-training performance with at most 10, 000× less com-
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by pute (Achiam et al., 2023). Similarly, can we leverage
the author(s).
1
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
a 1.0
ZeroShot SubTuning ModelSize b c
0.8
Our law (Eq.7) Our fine-tuning scaling law
Pearson Correlation
Figure 1. (a) The Pearson correlation between the true full-fine-tuning performance and the predicted performance of three intuitive
methods, given different resource constraints denoted by γ. These baseline methods cannot predict performance well especially under
demanding constraints (small γ), and could even provide negatively correlated predictions. (b) The phase transition phenomenon observed
in the scaling of fine-tuning loss L with training sample size D. In addition to the widely studied power phase where (L, D) are linearly
correlated under the log-log scale, we discover the pre-power phase when D is small. Previous laws fail to fit both phases, while our
proposed law fits quite well. (c) Our LLM selection algorithm that extrapolates full-fine-tuning performance based on the new law.
Scaling Law to efficiently and accurately predict the perfor- formance both theoretically and empirically, and establish
mance of fine-tuning as well? a new Scaling Law that fits much better (Section 3). We
propose a novel LLM selection algorithm based on the estab-
In this paper, we conduct thorough experiments on scaling
lished law that significantly outperforms all other baselines
behavior in fine-tuning using 30 models with sizes varying
under extensive experimental settings (Section 4). Together,
from 100 million to 7 billion. As shown in Figure 1(b), we
our work makes a first step towards LLM selection for fine-
find a previously unobserved phase transition pattern called
tuning, and towards better understanding of Scaling Law in
“pre-power phase” on the low-data regimes where the slope
practical downstream applications.
gradually decreases before the widely studied “power phase”
where the test loss and number of samples D is roughly
linearly correlated. The transition is crucial for fine-tuning, 2. LLM Selection Framework for Fine-tuning
as typical fine-tuning datasets can vary from hundreds to
2.1. Problem Formulation
millions of samples, covering both phases. We theoretically
explain this phenomenon via the concept of pre-learned Throughout the paper, we consider the standard supervised
data size, which represents the equivalent amount of down- fine-tuning (Dai & Le, 2015; Devlin et al., 2018) paradigm
stream task samples that the model has pre-learned from in full parameter space of auto-regressive models (Graves,
the pre-training corpus. Inspired by this, we establish Recti- 2014) that sequentially predicts each token in target y based
fied Scaling Law of LLM fine-tuning (a.k.a. “Fine-tuning on input x. For a pre-trained model M and a dataset S, we
Scaling Law”) by incorporating this concept (Equation (7)), use FT(M ; S) to denote the fine-tuned model on dataset S
which fits all experimental results much better than all exist- from M 1 We formulate model selection task in the context
ing laws, aligning with our theoretical judgments. of fine-tuning as follows.
Based on the Rectified Scaling Law of LLM fine-tuning, Definition 2.1 (LLM Selection for Fine-tuning). Given a set
we design a novel LLM selection algorithm called “Accept of pre-trained LLMs M = {Mi }m i=1 with m models, a fine-
then Stop” (AtS, Figure 1(c)). Starting from the maximum tuning sub-dataset Ssub sampled from the complete dataset
allowed constraints, it keeps accepting fine-tuning results S, i.e., Ssub ⊂ S ∼ D, |Ssub | = γ|S| where γ ∈ (0, 1] is
on a series of size-decreasing subsets, stops once it distin- the data budget ratio, the goal of an LLM selection algorithm
guishes the transition pattern, and uses all accepted results A : (M ; Ssub ) 7→ R is to score each model M ∈ M with
to linearly extrapolate the full-fine-tuning performance. The access to Ssub , such that the score reflects the loss over
designed algorithm demonstrates outstanding LLM selec- distribution D after fine-tuning M on S, i.e., we hope that
tion performance under extensive experimental settings, and
L(FT(M̂ (Ssub ); S)) = min L(FT(M ; S)), (1)
selects the near-optimal model with hundreds of times less M ∈M
resource consumption, under which other approaches can where M̂ (Ssub ) ≜ arg min A(M, Ssub ). (2)
provide negatively correlated selection results. Extensive M ∈M
ablation experiments also prove its robustness and stability.
In summary, we first formulate LLM selection framework Here L(M ) is the expectation of the average of cross-
with great compatibility, and draw its connection with the entropy loss of model M on sample (x, y) over the target
study of Scaling Law for model fine-tuning (Section 2). We 1
The fine-tuning process is regarded as a black box in our paper
demonstrate why previous laws fail to fit fine-tuning per- as our focus is not on “how to fine-tune a model”.
2
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
-1.35 80
token sequence y, i.e. NL12-XXXL
NL32-XL
-1.40 Pre-training Scaling 79
Fine-tuning Scaling
Negative Log-Perplexity
|y|
SuperGLUE Accuracy
78
1 X -1.45 NL12-XXL NL6-XXXL XL
L(M ) = E(x,y)∼D − log(P (yj |{yi }j−1
i=1 , x)). (3) XL 77
NL12-XXXL
3
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
such as GPT2 (Radford et al., 2019). We also include some the pre-power phase becomes invisible. However, many
multilingual models (Xue et al., 2021), MoE-based mod- fine-tuning tasks may fall into a relatively low-data regime,
els (Fedus et al., 2022), and instruction-tuned models (Wu making the analysis of the behavior of the pre-power phase
et al., 2024) for diversity. For clarity, we select 6 represen- inevitable. Below we show that Equation (5) does not take
tative models for illustrations throughout the main paper, this phase into consideration.
including GPT2, LaMini-GPT (the instruction-tuned version Theorem 3.1. For any positive parameters B, E, α, β, con-
of GPT2), Cerebras-GPT (three different versions for com- sider the log-log form of function L̂(·) in Equation (5):
parison) and mT5 (a multilingual encoder-decoder model).
Results of complete model set are presented in Appendix B. B
f (x) = log(L̂(exp(x)) = α log +E , (6)
exp(βx)
Fine-tuning Datasets. We consider machine translation
(WMT19 English-Chinese (En-Zh) (Kocmi et al., 2022)), then we have that the derivative f ′ is negative and non-
paragraph summarization (Gigaword (Rush et al., 2015)), decreasing.
and multi-task instruction tuning (FLAN (Wei et al., 2021))
as the downstream fine-tuning tasks. These tasks are rep- Theorem 3.1 establishes a crucial property that the slope of
resentative and well-established in NLP with rich amount f ′ cannot decrease, contradictory to the co-existence of pre-
of data, allowing us to study the scaling behavior under a power and power phase, since slopes decrease initially and
wide range of dataset size. Details of the processing of each remain roughly unchanged afterward. Indeed, as demon-
dataset are presented in Appendix C. strated in Figure 3, it fits poorly with experimental results
(dash lines)2 , manifesting by the deviation of the predicted
Dataset Size. To study the scaling behavior extensively, loss and actual loss in the pre-power phase.
for each dataset S, we randomly select subsets with D
samples where D ∈ {200, 400, 800, · · · , 1638400} which 3.3. Our Scaling Law with Pre-learned Data
cover a wide range of data scales in practical scenarios. We
To better understand the underlying mechanism of the phase
fine-tune models on each subset and test them on a held-out
transition phenomenon, we start with the essential difference
test set with samples to ensure the estimated performance
between pre-training and fine-tuning. Unlike pre-training
is unbiased. For each setting, we fine-tune the model three
where we train a model from scratch, fine-tuning starts
times to remove the randomness of subset sampling.
from a model that has been trained on a large corpus. Con-
sequently, pre-training should have provided models with
Optimization. We adopt the standard fine-tuning using some amount of information relevant to downstream con-
AdamW (Loshchilov & Hutter, 2017) optimizer and cosine text (Hernandez et al., 2021).
learning rate scheduler (Loshchilov & Hutter, 2016). We
optimize each model under different initial learning rates To capture this concept, we introduce the term pre-learned
and batch sizes via hyper-parameter search. This ensures data size (represented by Dl ) that indicates how much
that test losses are optimal under current settings. More amount of downstream data a model has learned from pre-
details of fine-tuning are presented in Appendix D. training. This term could be influenced by multiple factors
like the expressivity of models, the pre-training corpus size,
3.2. Phase Transition with Dataset Size as well as the difficulty of this downstream task. Intuitively,
Dl can be integrated with the scaling term Dβ , which rep-
We plot the test loss for 6 representative models when fine- resents the amount of information that fine-tuning on D
tuned on subsets of different sizes in Figure 3. We observe samples can provide the model with. We propose the fol-
a “phase transition” pattern in scaling behaviors: when the lowing improved Scaling Law by incorporating this term,
loss is relatively large, the curve lies in “pre-power phase with an identical amount of parameters to be fitted.
with the slope of the curve slowly decreases; as the training
Definition 3.2 (Rectified Scaling Law). We define the Scal-
set size D increases, the loss decreases and the curve enters
ing Law with dataset size D for fine-tuning as
the “power phase” where it is almost linear, similar to the
observed curves in the pre-training stage. For different B
datasets, depending on their difficulty, the size of data each L̂(D) = + E, (7)
Dl + Dβ
model requires to transit into the second phase is different.
where Dl is the pre-learned data size, β is the power to D
The pre-power phase has been barely observed before,
denoting the learning difficulty, B adjusts the initial test
mainly due to the focus on large data regimes. Indeed, for
loss, and E denotes the optimal loss of the model given an
scaling behavior in pre-training or language-to-code transfer
(Hernandez et al., 2021) in which the minimum sample size 2
The parameters are fitted using a standard python optimization
is ∼ 105 , models have already entered the power phase and package, and please refer to Appendix A for more details.
4
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
2.5
3.0 2.0
2.0
2.5
1.5
Power phase
2.0 Power phase
Power phase 1.5
Figure 3. The phase transition from pre-power phase to power phase, and the fitness of different Scaling Laws. The x and y axes are
fine-tuning dataset size D and test loss L in log scale. Each subfigure corresponds to a dataset. Solid lines are the fitting results of our law
(Eq. 7), and dash lines are the fitting results of vanilla law (Eq. 5). The full model results are in Appendix E.
0.14
RMSD error (Our Law, Eq.6)
0.12 WMT19 ure 4. On average, each law is required to fit fifteen size-loss
0.10 Gigaword pairs. The error of Equation (5) is unavoidably large (with
0.08 FLAN
y=x
an average RMSD of 0.036). As it can only fit the power
0.06
0.04
phase, a more difficult task results in a later occurrence of
0.02 phase transition, contributing to a larger fitting error. On
0.00 the contrary, our law Equation (7) has a consistently small
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
RMSD error (Vanilla Law, Eq.5) RMSD error, with an average RMSD of 0.007. Since both
Figure 4. Root-mean-square deviation (RMSD) of our law (Equa- laws have four parameters to fit, it demonstrates that our law
tion (7)) and vanilla law (Equation (5)) when fitting fine-tuning captures the intrinsic scaling behavior more accurately.
test loss versus dataset size in log scale. Under same setting, our
law achieves much lower RMSD error.
4. LLM Selection
infinite amount of data.3 With a fine-grained understanding of the scaling behavior,
we turn to the LLM selection task and propose a novel
This modification of Dl essentially improves the mathemat-
algorithm that leverages the newly established LLM Fine-
ical property of Definition 3.2 as the derivative is no longer
tuning Scaling Law. This allows us to select near-optimal
monotonous. As proved in Theorem 3.3, the first-order
models with hundreds of times less resource consumption.
derivative decreases before x0 (corresponding to the pre-
power phase) and slightly increases afterward (correspond-
4.1. Method: from Scaling Law to LLM Selection
ing to the power phase). In other words, the introduction of
Dl is not only conceptually reasonable, but it also elegantly From the view of Scaling Law, the goal of the LLM selec-
unifies the co-existence of both phases into one law. tion is to predict subsequent curve given points that can be
Theorem 3.3. For any positive parameters B, E, Dl , β, computed via Ssub . We capture the essential “phase tran-
consider the log-log form of function L̂(·) in Equation (7): sition” phenomenon and propose the “Accept then Stop”
(AtS) algorithm that distinguishes samples from two phases
B
and extrapolates the power phase, which is approximately
f (x) = log(L̂(exp(x)) = log + E , (8)
Dl + exp(βx) linear under the log-log scale. This algorithm turns out to be
then the second-order derivative f ′′ is negative for x ∈ more robust and accurate than fitting the entire law directly,
(0, x0 ) and positive for x ∈ (x0 , +∞), where we have which can be sensitive when γ is small.
log(Dl2 +BDl /E)
x0 = 2β . We illustrate the process of AtS in Algorithm 1. Specifically,
it first fine-tunes the model on Ssub to compute the test
We quantified the fitting error of both laws on all models and loss. It then continuously reduces the dataset size by half,
datasets using root-mean-square deviation (RMSD) in Fig- and fine-tunes the model on this smaller subset to get a
3 series of loss-size pairs P = {(D̃i , L̃i )}. Whenever a new
The parameter α is unnecessary for our law to fit well, and
we remove it for the sake of simplicity. All parameters are model- pair is added, AtS fits a linear function f with all previous
specific. We leave the incorporation of model information into pairs, and computes stop indicator Istop which captures how
Definition 3.2 as future explorations.
5
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 1. Model selection results (PearCorr, RelAcc) of four methods on three datasets (FLAN, WMT19, Gigaword) in percentage. The
best result within the same dataset and budget ratio is in bold font, and the second best result is underlined.
FLAN WMT19 Gigaword
Metric Ratio AtS ZeroShot SubTuning ModelSize AtS ZeroShot SubTuning ModelSize AtS ZeroShot SubTuning ModelSize
1/8 90.9 -10.7 60.9 -20.9 98.9 7.1 93.5 36.0 98.9 -49.2 93.2 -24.4
1/16 73.1 -10.7 46.5 -20.9 97.1 7.1 87.1 36.0 97.6 -49.2 89.3 -24.4
1/32 65.5 -10.7 36.4 -20.9 97.7 7.1 77.7 36.0 96.9 -49.2 85.4 -24.4
1/64 61.1 -10.7 29.0 -20.9 86.0 7.1 64.5 36.0 92.0 -49.2 80.9 -24.4
PearCorr (%)
1/128 52.2 -10.7 24.5 -20.9 78.0 7.1 51.7 36.0 91.1 -49.2 76.2 -24.4
1/256 50.5 -10.7 20.9 -20.9 73.4 7.1 41.6 36.0 89.1 -49.2 69.9 -24.4
1/512 45.6 -10.7 16.4 -20.9 61.5 7.1 34.5 36.0 91.0 -49.2 64.8 -24.4
Avg 62.7 -10.7 33.5 -20.9 84.6 7.1 63.4 36.0 93.8 -49.2 79.9 -24.4
1/8 93.6 85.3 93.2 59.6 99.1 84.4 99.1 22.5 100.0 71.3 87.6 71.3
1/16 93.2 85.3 93.2 59.6 99.1 84.4 99.1 22.5 91.4 71.3 87.6 71.3
1/32 93.2 85.3 93.2 59.6 99.6 84.4 99.1 22.5 94.3 71.3 87.6 71.3
1/64 93.2 85.3 93.2 59.6 99.1 84.4 99.1 22.5 100.0 71.3 71.3 71.3
RelAcc (%)
1/128 85.3 85.3 59.6 59.6 99.1 84.4 99.1 22.5 94.3 71.3 71.3 71.3
1/256 93.2 85.3 59.6 59.6 99.1 84.4 99.1 22.5 94.3 71.3 71.3 71.3
1/512 93.2 85.3 59.6 59.6 99.1 84.4 99.1 22.5 91.4 71.3 71.3 71.3
Avg 92.1 85.3 76.4 59.6 99.2 84.4 99.1 22.5 95.1 71.3 79.4 71.3
6
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
3.5
Appendix F.3.
3.0
FLAN WMT19 Gigaword Avg.
2.5 LaMini-GPT-124M
Cerebras-GPT-111M AtS 45.6 61.5 91.0 66.0
2.0 Cerebras-GPT-256M
GPT-2 (124M) OurFit 36.8 61.5 78.5 58.9
103 104 105 106 VanillaFit 20.7 56.5 79.3 52.1
#Samples (D)
Figure 5. Failure cases for the three baseline methods. The hori-
zontal dashlines denote the zero-shot performance, and each point
pairs, and predicts the performance on S using the fitted law.
denotes the test loss when fine-tuning the corresponding model on (2) VanillaFit follows a similar procedure, except that it fits
Ssub with size D. LaMini-GPT-124M has the best full-fine-tuning the previous law (Equation (5)) rather than ours. As shown
performance, but its performance on small D is bad. in Table 2, while all variants outperform the three intuitive
methods above, AtS is better than OurFit and VanillaFit
thanks to the robustness and stability brought by linearity.
1
model with averaged RelAcc larger than 95% with γ = 256 ,
while all other methods fail to provide such a good selection Efficiency Analysis. We further analyze the efficiency of
even when γ = 81 . This implies that AtS can select the AtS in comparison with other methods. According to Kaplan
near-optimal model with hundreds of times of acceleration. et al. (2020), the computational cost C measured in floating
point operations (FLOPs) for training can be estimated with
Why do other algorithms fail? We illustrate why intu- the formula C ∼ 6N D, where N represents the number
itively reasonable methods fail to make predictions in Fig- of model parameters and D the dataset size. Considering
ure 5. Assume we have 4 models and |Ssub | is roughly 104 . T training epochs and H hyper-parameter search rounds
ModelSize selects the largest model in M regardless of the for each model on a given dataset, we estimate the overall
properties of the downstream task and the models. The as- computational costs for FullTuning, SubTuning, and AtS as:
sumption behind this is that performance grows with model X
size, which has been demonstrated to be inaccurate in the CFullTuning = 6NM DT H
fine-tuning stage. ZeroShot and SubTuning both leverage the M ∈M
performance on the downstream dataset. However, they only
X
CSubTuning = 6NM (γD)T H = γ · CFullTuning
capture the performance under a specific dataset size, while M ∈M
ignoring the global trend of performance with data size. In X X 1
fact, these methods give Cerebras-GPT-256M the highest CAtS = 6NM DT H ≤ 2γ · CFullTuning
2i
score, but eventually, LaMini-GPT-124M outperforms. M ∈M 2i ≤γ
AtS on stratified M. We also consider different model Both AtS and SubTuning exhibit the same order of com-
sets M to simulate the constraints of GPU memory. Specif- putational complexity, achieving an acceleration rate of γ.
ically, we create three subsets of M with different model Figure 7 illustrates a Pareto-optimality curve between se-
size thresholds including 2B, 1.4B and 700M . The results lection performance and computational costs. Notably, AtS
are presented in Figure 6 (a), where AtS outperforms other achieves the most optimal Pareto curve, providing near-
baselines on all subsets by a large margin. optimal selection performance akin to FullTuning while
significantly reducing computational costs.
Influence of k and δ. To illustrate the influence of the
outlier tolerance δ and the minimum accepted rate k, we
conduct ablation studies on the choice of hyper-parameters 5. Discussion
and present the results in Figure 6 (b). Overall, AtS is not
Phase transition may happen on certain loss value. As
sensitive to hyper-parameters values, indicating its robust-
discussed in Section 3, a more difficult downstream task
ness under various circumstances.
results in a “later” occurrence of phase transition, which
LLM selection by fitting Scaling Law. AtS essentially means more training samples are needed for the fine-tuned
leverages the proposed Scaling Law to estimate the trend LLMs to enter the power phase. This phenomenon is jus-
of fine-tuning loss. Here we additionally consider two vari- tified by our results on FLAN, WMT19, and Gigaword
ants of using Scaling Laws: (1) OurFit fine-tunes each (see Figure 3 and Appendix E). It is intuitive that the multi-
1
model on a sequence of subsets {Ssub , Ssub , ...} where task instruction tuning dataset FLAN is the most “difficult”,
i i −i
Ssub ⊂ Ssub , |Ssub | = 2 |Ssub | until |Ssub | < 200. It followed by the machine translation dataset WMT19, and
fits parameters in our law (Equation (7)) using all data-loss then the summarization task Gigaword. In addition, the
7
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
a 1.0
b k
3 4 5
9110 88.59 87.42
PearCorr (Gigaword)
0.8 1
0.6
PearCorr (average)
Figure 6. (a) PearCorr of AtS on Gigaword with γ = 1/512 under different memory budgets (different M). Full results are presented in
Appendix F.2. (b) Impact of δ and k on PearCorr(%) on Gigaword with γ = 1/512. Full results are presented in Appendix F.1.
FLAN Gigaword WMT19
1.0
0.8
0.6
PearCorr
0.4
0.2
ModelSize ModelSize ModelSize
0.0
AtS AtS AtS
-0.2 SubTuning SubTuning SubTuning
ZeroShot ZeroShot ZeroShot
-0.4 FullTuning FullTuning FullTuning
Figure 7. Pareto-Optimality curve between the selection performance and the computational costs. The performance is evaluated by
PearCorr while the cost is evaluated by the number of floating point operations (FLOPs).
stronger pre-trained LLMs will enter the power phase earlier. see if the benefit of Scaling Laws can be extended to other
An interesting and unified explanation for this phenomenon fine-tuning strategies such as RLHF (Rafailov et al., 2023;
is that, the phase transition may be closely related to the Christiano et al., 2017), LoRA (Hu et al., 2021), or more
value of the test loss. We also observed that almost all mod- resource constraint types. Another limitation is a lack of
els enter the power phase when their test loss is less than a more comprehensive understanding of the mechanism of
2.2. This magic number is also reported in Du et al. (2024), the pre-power phase and the phase transition. It will be
where they found the emergent abilities of LLMs (Schaeffer interesting to see if it also appears under situations outside
et al., 2024) also emerge when the loss is smaller than 2.2. standard fine-tuning, and whether the behavior in this phase
The intrinsic mechanism behind this value still remains a is similar to that in fine-tuning.
mystery. It may suggest that there exist two distinct “stages”
in general LLM learning process, which is similar but not
identical to the Grokking of LLMs (Liu et al., 2022). Outlook on Scaling Law research. We are now in a so-
called “post-LLM era”, where LLMs are revolutionizing
various domains, such as human-like chatbot (Team et al.,
Limitations of this paper. Although AtS can outperform 2023), clinical applications (Singhal et al., 2022), program-
other baselines significantly as shown in Table 1, it also ming optimization (Romera-Paredes et al., 2023), and ge-
suffers performance degradation when data budget ratio ometric proofing (Trinh et al., 2024). Scaling Law may
γ is extremely small, and all points we observed are in be the key to unlocking the huge power of LLMs, since
the pre-power phase. However, a mixed blessing is that they tell us how can we make progress by investing more
in real applications, it is feasible to detect which stage the resources. However, research on Scaling Law is extremely
curve is in by monitoring the residual errors. Proposing expensive, and issues like environmental protection have to
a new algorithm that can make accurate predictions with be concerned (Muennighoff et al., 2023). We believe the
observations only from the pre-power phase is an interesting research on this domain should be conducted in a collabora-
direction to pursue. In addition, it will be interesting to tive and decentralized manner, where the community can
8
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
share the observed results and better utilize idle computa- Law. In contrast, our proposed Rectified Scaling Law fo-
tional resources. cuses on the fine-tuning of LLMs, and parameterizes the
transition with a single term representing the pre-learned
6. Related Works data size. This rectification is not only simple and intuitive
but also empirically validated through solid experiments.
Model selection. Early model selection methods require
that all models share identical architectures and differ only 7. Conclusion
in pre-trained datasets (Cui et al., 2018; Tran et al., 2019).
Those similarity-based methods (Vu et al., 2020; Dwivedi This paper focuses on two main areas: exploring the Scaling
et al., 2020) fine-tune a model on a target dataset, and use the Laws of LLM fine-tuning and addressing the challenge of
feature similarity between this model and candidate models selecting LLMs for effective fine-tuning. We reveal the inad-
to predict the fine-tuning performance for each model. (Ye equacy of conventional Scaling Laws and propose a rectified
et al., 2021) extends the feature-based method to model se- law with much better theoretical and empirical properties by
lection under the out-of-distribution setting. Another line of incorporating the concept of pre-learned data size. Addition-
works design training-free metrics to examine whether pre- ally, we present a novel framework for the LLM selection
trained features are easily transferred to target tasks (P’andy problem and design a new algorithm that leverages the pro-
et al., 2021; Ibrahim et al., 2021). More recently, there has posed law with significantly improved performance. Our
been attempts to formulate the problem as learning to rec- findings not only deepen the understanding of Scaling Laws
ommend (Li et al., 2023b) or rank (Zhang et al., 2023). One but also offer actionable insights for selecting LLMs in prac-
reason for not adopting existing model selection methods tice. We aim to provide a robust foundation for the broader
outside LLM is that they focus mainly on classification or and more efficient application of LLMs across various fields.
regression tasks (Deshpande et al., 2021; Li et al., 2023a).
These methods either rely on features of inputs (Lin et al., Acknowledgement
2023) or consider a fixed label set (Nguyen et al., 2020),
which is not appropriate in the open-world text generation This work is funded in part by the National Key R&D
setting and could lead to the one-to-many problems (Bao Program of China #2022ZD0160301, a grant from CCF-
et al., 2019). The ever-growth of open-sourced LLM models Tencent Rhino-Bird Open Research Fund.
urgently calls for the investigation of LLM selection.
Impact Statement
Scaling Law. Laws between model performance and vari-
ables like model size or data size during pre-training have LLMs require huge amounts of computing power and energy
been widely studied (Rosenfeld et al., 2019; Aghajanyan to train and deploy, which results in carbon emissions and
et al., 2023; Fernandes et al., 2023; Frantar et al., 2023), and climate change. By designing a highly efficient algorithm to
are applied to estimate an optimal allocation of compute for select LLMs for fine-tuning, our work significantly reduces
pre-training LLMs (Kaplan et al., 2020; Hoffmann et al., the amount of time and resources required to achieve the best
2022b). Recently, more fine-grained Scaling Laws have performance. This can lead to fewer energy consumption
been proposed, such as data-constrained scaling (Muen- and lower cost, making LLMs more affordable and acces-
nighoff et al., 2023) and hyper-parameter scaling (Bi et al., sible to labs and start-ups when they have a certain task to
2024). For LLM fine-tuning, Hernandez et al. (2021) com- solve. Our work is fundamental because it contributes to
pared the scaling effect between transfer learning and pre- the development of more sustainable and responsible LLM
training, and Tay et al. (2021) observed the inconsistency selection process, which can have positive impact for the
of model size scaling between pre-training and fine-tuning. environment and society. Our method approaches a general
A concurrent work (Zhang et al., 2024) suggested a multi- problem and will not have any direct negative impact or be
plicative law in fine-tuning scaling. However, none of these misused in specific domains as long as the task itself is safe,
studies identified the pre-power phase in the fine-tuning pro- ethical and fair.
cess under low-data regimes, and their models fail to capture
this phase transition pattern. Within the broader context of References
deep learning, Rosenfeld et al. (2019); Alabdulmohsin et al.
Achiam, O. J., Adler, S., Agarwal, S., Ahmad, L., Akkaya,
(2022); Caballero et al. (2023) posited the necessity of a
I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman,
transition phase bridging the initial random-guess point and
S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Bal-
the power-law region in from-scratch training processes.
com, V., Baltescu, P., Bao, H., Bavarian, M., Belgum,
Their primary approach involved modeling different phases
J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner,
separately and integrating them using a smooth function,
C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L.,
which essentially introduced more parameters for Scaling
9
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Brockman, G., Brooks, T., Brundage, M., Button, K., 2023. URL https://api.semanticscholar.
Cai, T., Campbell, R., Cann, A., Carey, B., Carlson, C., org/CorpusID:257532815.
Carmichael, R., Chan, B., Chang, C., Chantzis, F., Chen,
D., Chen, S., Chen, R., Chen, J., Chen, M., Chess, B., Aghajanyan, A., Yu, L., Conneau, A., Hsu, W.-
Cho, C., Chu, C., Chung, H. W., Cummings, D., Currier, N., Hambardzumyan, K., Zhang, S., Roller, S.,
J., Dai, Y., Decareaux, C., Degry, T., Deutsch, N., Dev- Goyal, N., Levy, O., and Zettlemoyer, L. Scal-
ille, D., Dhar, A., Dohan, D., Dowling, S., Dunning, S., ing laws for generative mixed-modal language mod-
Ecoffet, A., Eleti, A., Eloundou, T., Farhi, D., Fedus, L., els. In International Conference on Machine Learning,
Felix, N., Fishman, S. P., Forte, J., Fulford, I., Gao, L., 2023. URL https://api.semanticscholar.
Georges, E., Gibson, C., Goel, V., Gogineni, T., Goh, G., org/CorpusID:255570036.
Gontijo-Lopes, R., Gordon, J., Grafstein, M., Gray, S.,
Greene, R., Gross, J., Gu, S. S., Guo, Y., Hallacy, C., Han, Alabdulmohsin, I. M., Neyshabur, B., and Zhai, X. Revisit-
J., Harris, J., He, Y., Heaton, M., Heidecke, J., Hesse, C., ing neural scaling laws in language and vision. Advances
Hickey, A., Hickey, W., Hoeschele, P., Houghton, B., Hsu, in Neural Information Processing Systems, 35:22300–
K., Hu, S., Hu, X., Huizinga, J., Jain, S., Jain, S., Jang, J., 22312, 2022.
Jiang, A., Jiang, R., Jin, H., Jin, D., Jomoto, S., Jonn, B.,
Jun, H., Kaftan, T., Kaiser, L., Kamali, A., Kanitscheider, Alt, C., Hübner, M., and Hennig, L. Fine-tuning pre-trained
I., Keskar, N. S., Khan, T., Kilpatrick, L., Kim, J. W., transformer language models to distantly supervised rela-
Kim, C., Kim, Y., Kirchner, H., Kiros, J. R., Knight, M., tion extraction. arXiv preprint arXiv:1906.08646, 2019.
Kokotajlo, D., Kondraciuk, L., Kondrich, A., Konstantini-
dis, A., Kosic, K., Krueger, G., Kuo, V., Lampe, M., Lan, Bahri, Y., Dyer, E., Kaplan, J., Lee, J., and Sharma,
I., Lee, T., Leike, J., Leung, J., Levy, D., Li, C. M., Lim, U. Explaining neural scaling laws. arXiv preprint
R., Lin, M., Lin, S., Litwin, M., Lopez, T., Lowe, R., Lue, arXiv:2102.06701, 2021.
P., Makanju, A. A., Malfacini, K., Manning, S., Markov,
Bai, J., Zhang, X., Li, C., Hong, H., Xu, X., Lin, C., and
T., Markovski, Y., Martin, B., Mayer, K., Mayne, A., Mc-
Rong, W. How to determine the most powerful pre-
Grew, B., McKinney, S. M., McLeavey, C., McMillan,
trained language model without brute force fine-tuning?
P., McNeil, J., Medina, D., Mehta, A., Menick, J., Metz,
an empirical survey. arXiv preprint arXiv:2312.04775,
L., Mishchenko, A., Mishkin, P., Monaco, V., Morikawa,
2023.
E., Mossing, D. P., Mu, T., Murati, M., Murk, O., M’ely,
D., Nair, A., Nakano, R., Nayak, R., Neelakantan, A.,
Bao, S., He, H., Wang, F., and Wu, H. Plato: Pre-
Ngo, R., Noh, H., Long, O., O’Keefe, C., Pachocki, J. W.,
trained dialogue generation model with discrete latent
Paino, A., Palermo, J., Pantuliano, A., Parascandolo, G.,
variable. In Annual Meeting of the Association for
Parish, J., Parparita, E., Passos, A., Pavlov, M., Peng,
Computational Linguistics, 2019. URL https:
A., Perelman, A., de Avila Belbute Peres, F., Petrov, M.,
//api.semanticscholar.org/CorpusID:
de Oliveira Pinto, H. P., Pokorny, M., Pokrass, M., Pong,
204744108.
V. H., Powell, T., Power, A., Power, B., Proehl, E., Puri,
R., Radford, A., Rae, J., Ramesh, A., Raymond, C., Real, Bi, D.-A. X., Chen, D., Chen, G., Chen, S., Dai, D., Deng,
F., Rimbach, K., Ross, C., Rotsted, B., Roussez, H., Ry- C., Ding, H., Dong, K., Du, Q., Fu, Z., Gao, H., Gao, K.,
der, N., Saltarelli, M. D., Sanders, T., Santurkar, S., Sas- Gao, W., Ge, R., Guan, K., Guo, D., Guo, J., Hao, G.,
try, G., Schmidt, H., Schnurr, D., Schulman, J., Selsam, Hao, Z., He, Y., Hu, W.-H., Huang, P., Li, E., Li, G., Li,
D., Sheppard, K., Sherbakov, T., Shieh, J., Shoker, S., J., Li, Y., Li, Y. K., Liang, W., Lin, F., Liu, A. X., Liu, B.,
Shyam, P., Sidor, S., Sigler, E., Simens, M., Sitkin, J., Liu, W., Liu, X., Liu, X., Liu, Y., Lu, H., Lu, S., Luo, F.,
Slama, K., Sohl, I., Sokolowsky, B. D., Song, Y., Stau- Ma, S., Nie, X., Pei, T., Piao, Y., Qiu, J., Qu, H., Ren, T.,
dacher, N., Such, F. P., Summers, N., Sutskever, I., Tang, Ren, Z., Ruan, C., Sha, Z., Shao, Z., Song, J.-M., Su, X.,
J., Tezak, N. A., Thompson, M., Tillet, P., Tootoonchian, Sun, J., Sun, Y., Tang, M., Wang, B.-L., Wang, P., Wang,
A., Tseng, E., Tuggle, P., Turley, N., Tworek, J., Uribe, J. S., Wang, Y., Wang, Y., Wu, T., Wu, Y., Xie, X., Xie, Z.,
F. C., Vallone, A., Vijayvergiya, A., Voss, C., Wainwright, Xie, Z., Xiong, Y., Xu, H., Xu, R. X., Xu, Y., Yang, D.,
C., Wang, J. J., Wang, A., Wang, B., Ward, J., Wei, J., mei You, Y., Yu, S., yuan Yu, X., Zhang, B., Zhang, H.,
Weinmann, C., Welihinda, A., Welinder, P., Weng, J., Zhang, L., Zhang, L., Zhang, M., Zhang, M., Zhang, W.,
Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, Zhang, Y., Zhao, C., Zhao, Y., Zhou, S., Zhou, S., Zhu,
S., Wong, H., Workman, L., Wu, S., Wu, J., Wu, M., Q., and Zou, Y. Deepseek llm: Scaling open-source lan-
Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba, W., guage models with longtermism. ArXiv, abs/2401.02954,
Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., 2024. URL https://api.semanticscholar.
Zhuang, J., Zhuk, W., and Zoph, B. Gpt-4 technical report, org/CorpusID:266818336.
10
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Fernandes, P., Ghorbani, B., Garcia, X., Freitag, M., and
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Firat, O. Scaling laws for multilingual neural machine
Askell, A., et al. Language models are few-shot learners. translation. arXiv preprint arXiv:2302.09650, 2023.
Advances in neural information processing systems, 33:
1877–1901, 2020. Foundation, W. Acl 2019 fourth conference on machine
translation (wmt19), shared task: Machine translation
Caballero, E., Gupta, K., Rish, I., and Krueger, D. Broken of news, 2019. URL http://www.statmt.org/
neural scaling laws, 2023. wmt19/translation-task.html.
Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Frantar, E., Riquelme, C., Houlsby, N., Alistarh,
Legg, S., and Amodei, D. Deep reinforcement learn- D., and Evci, U. Scaling laws for sparsely-
ing from human preferences. ArXiv, abs/1706.03741, connected foundation models. ArXiv, abs/2309.08520,
2017. URL https://api.semanticscholar. 2023. URL https://api.semanticscholar.
org/CorpusID:4787508. org/CorpusID:262013578.
Cui, Y., Song, Y., Sun, C., Howard, A. G., and Be- Ghorbani, B., Firat, O., Freitag, M., Bapna, A., Krikun,
longie, S. J. Large scale fine-grained categorization M., Garcia, X., Chelba, C., and Cherry, C. Scaling
and domain-specific transfer learning. 2018 IEEE/CVF laws for neural machine translation. arXiv preprint
Conference on Computer Vision and Pattern Recog- arXiv:2109.07740, 2021.
nition, pp. 4109–4118, 2018. URL https://api.
semanticscholar.org/CorpusID:43993788. Graff, D., Kong, J., Chen, K., and Maeda, K. English
gigaword. Linguistic Data Consortium, Philadelphia, 4
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning, (1):34, 2003.
2015.
Graves, A. Generating sequences with recurrent neural
Deshpande, A., Achille, A., Ravichandran, A., Li, H., networks, 2014.
Zancato, L., Fowlkes, C., Bhotika, R., Soatto, S., and
Perona, P. A linearized framework and a new bench- Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C.,
mark for model selection for fine-tuning. arXiv preprint Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S.,
arXiv:2102.00084, 2021. et al. Scaling laws for autoregressive generative modeling.
arXiv preprint arXiv:2010.14701, 2020.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan- Hernandez, D., Kaplan, J., Henighan, T., and McCan-
guage understanding. arXiv preprint arXiv:1810.04805, dlish, S. Scaling laws for transfer. arXiv preprint
2018. arXiv:2102.01293, 2021.
Dey, N., Gosal, G., Khachane, H., Marshall, W., Pathria, R., Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E.,
Tom, M., Hestness, J., et al. Cerebras-gpt: Open compute- Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A.,
optimal language models trained on the cerebras wafer- Welbl, J., Clark, A., et al. Training compute-optimal
scale cluster. arXiv preprint arXiv:2304.03208, 2023. large language models. arXiv preprint arXiv:2203.15556,
2022a.
Du, Z., Zeng, A., Dong, Y., and Tang, J. Understanding
emergent abilities of language models from the loss per- Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya,
spective. arXiv preprint arXiv:2403.15796, 2024. E., Cai, T., Rutherford, E., de Las Casas, D., Hen-
dricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland,
Dwivedi, K., Huang, J., Cichy, R. M., and Roig, E., Millican, K., van den Driessche, G., Damoc, B.,
G. Duality diagram similarity: a generic frame- Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae,
work for initialization selection in task transfer learn- J. W., Vinyals, O., and Sifre, L. Training compute-
ing. In European Conference on Computer Vision, optimal large language models. ArXiv, abs/2203.15556,
2020. URL https://api.semanticscholar. 2022b. URL https://api.semanticscholar.
org/CorpusID:221046068. org/CorpusID:247778764.
Fedus, W., Zoph, B., and Shazeer, N. Switch transform- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
ers: Scaling to trillion parameter models with simple S., Wang, L., and Chen, W. Lora: Low-rank adaptation of
and efficient sparsity. The Journal of Machine Learning large language models. arXiv preprint arXiv:2106.09685,
Research, 23(1):5232–5270, 2022. 2021.
11
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Ibrahim, S., Ponomareva, N., and Mazumder, R. Newer is ratio based task prediction. ArXiv, abs/2309.15048,
not always better: Rethinking transferability metrics, their 2023. URL https://api.semanticscholar.
peculiarities, stability and performance. In ECML/PKDD, org/CorpusID:262825998.
2021. URL https://api.semanticscholar.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
org/CorpusID:238744475.
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Roberta: A robustly optimized bert pretraining approach.
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and arXiv preprint arXiv:1907.11692, 2019.
Amodei, D. Scaling laws for neural language models. Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark,
arXiv preprint arXiv:2001.08361, 2020. M., and Williams, M. Towards understanding grokking:
Ke, Z., Shao, Y., Lin, H., Konishi, T., Kim, G., and Liu, An effective theory of representation learning. Advances
B. Continual pre-training of language models. In In- in Neural Information Processing Systems, 35:34651–
ternational Conference on Learning Representations, 34663, 2022.
2023a. URL https://api.semanticscholar. Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W.,
org/CorpusID:258079422. Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., et al. The
flan collection: Designing data and methods for effec-
Ke, Z., Shao, Y., Lin, H., Xu, H., Shu, L., and
tive instruction tuning. arXiv preprint arXiv:2301.13688,
Liu, B. Adapting a language model while pre-
2023.
serving its general knowledge. In Conference on
Empirical Methods in Natural Language Processing, Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradi-
2023b. URL https://api.semanticscholar. ent descent with warm restarts. arXiv: Learning,
org/CorpusID:256105391. 2016. URL https://api.semanticscholar.
org/CorpusID:14337532.
Kocmi, T., Bawden, R., Bojar, O., Dvorkovich, A., Fed-
ermann, C., Fishel, M., Gowda, T., Graham, Y., Grund- Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
kiewicz, R., Haddow, B., et al. Findings of the 2022 larization. arXiv preprint arXiv:1711.05101, 2017.
conference on machine translation (wmt22). In Proceed- Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L.,
ings of the Seventh Conference on Machine Translation Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and
(WMT), pp. 1–45, 2022. Raffel, C. Scaling data-constrained language mod-
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo- els. ArXiv, abs/2305.16264, 2023. URL https:
hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. //api.semanticscholar.org/CorpusID:
Bart: Denoising sequence-to-sequence pre-training for 258888192.
natural language generation, translation, and comprehen- Nguyen, C. V., Hassner, T., Archambeau, C., and Seeger,
sion. arXiv preprint arXiv:1910.13461, 2019. M. W. Leep: A new measure to evaluate transferabil-
ity of learned representations. ArXiv, abs/2002.12462,
Li, H., Fowlkes, C., Yang, H., Dabeer, O., Tu, Z., and Soatto,
2020. URL https://api.semanticscholar.
S. Guided recommendation for model fine-tuning. In
org/CorpusID:211572839.
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 3633–3642, 2023a. P’andy, M., Agostinelli, A., Uijlings, J. R. R.,
Ferrari, V., and Mensink, T. Transferability
Li, H., Fowlkes, C. C., Yang, H., Dabeer, O., Tu, Z., and estimation using bhattacharyya class separability.
Soatto, S. . Guided recommendation for model fine- 2022 IEEE/CVF Conference on Computer Vision
tuning. 2023 IEEE/CVF Conference on Computer Vi- and Pattern Recognition (CVPR), pp. 9162–9172,
sion and Pattern Recognition (CVPR), pp. 3633–3642, 2021. URL https://api.semanticscholar.
2023b. URL https://api.semanticscholar. org/CorpusID:244709516.
org/CorpusID:260084893.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
Li, Y.-F., Bubeck, S., Eldan, R., Giorno, A. D., Gu- DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
nasekar, S., and Lee, Y. T. Textbooks are all you need A. Automatic differentiation in pytorch, 2017.
ii: phi-1.5 technical report. ArXiv, abs/2309.05463,
2023c. URL https://api.semanticscholar. Radford, A., Wu, J., Child, R., Luan, D., Amodei,
org/CorpusID:261696657. D., and Sutskever, I. Language models are unsu-
pervised multitask learners, 2019. URL https:
Lin, H., Shao, Y., Qian, W., Pan, N., Guo, Y., and //api.semanticscholar.org/CorpusID:
Liu, B. Class incremental learning via likelihood 160025533.
12
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Gemini: a family of highly capable multimodal models.
Manning, C. D., and Finn, C. Direct prefer- arXiv preprint arXiv:2312.11805, 2023.
ence optimization: Your language model is se-
cretly a reward model. ArXiv, abs/2305.18290, Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
2023. URL https://api.semanticscholar. A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
org/CorpusID:258959321. Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288,
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., 2023.
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
the limits of transfer learning with a unified text-to-text Tran, A., Nguyen, C. V., and Hassner, T. Transfer-
transformer. The Journal of Machine Learning Research, ability and hardness of supervised classification
21(1):5485–5551, 2020. tasks. 2019 IEEE/CVF International Conference
on Computer Vision (ICCV), pp. 1395–1405, 2019.
Romera-Paredes, B., Barekatain, M., Novikov, A., Balog,
URL https://api.semanticscholar.org/
M., Kumar, M. P., Dupont, E., Ruiz, F. J. R., Ellen-
CorpusID:201303557.
berg, J. S., Wang, P., Fawzi, O., Kohli, P., Fawzi, A.,
Grochow, J., Lodi, A., Mouret, J.-B., Ringer, T., and Trinh, T. H., Wu, Y., Le, Q. V., He, H., and
Yu, T. Mathematical discoveries from program search Luong, T. Solving olympiad geometry without
with large language models. Nature, 625:468 – 475, human demonstrations. Nature, 625:476 – 482,
2023. URL https://api.semanticscholar. 2024. URL https://api.semanticscholar.
org/CorpusID:266223700. org/CorpusID:267032902.
Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit,
N. A constructive prediction of the generalization error Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,
across scales, 2019. Reddy, T., Cournapeau, D., Burovski, E., Peterson, P.,
Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,
Rush, A. M., Chopra, S., and Weston, J. A neural at- Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J.,
tention model for abstractive sentence summarization. Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ.,
Proceedings of the 2015 Conference on Empirical Meth- Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D.,
ods in Natural Language Processing, 2015. doi: 10. Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A.,
18653/v1/d15-1044. URL http://dx.doi.org/ Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa,
10.18653/v1/D15-1044. F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy
1.0: Fundamental Algorithms for Scientific Computing
Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent in Python. Nature Methods, 17:261–272, 2020. doi:
abilities of large language models a mirage? Advances in 10.1038/s41592-019-0686-2.
Neural Information Processing Systems, 36, 2024.
Vu, T., Wang, T., Munkhdalai, T., Sordoni, A.,
Singhal, K., Azizi, S., Tu, T., Mahdavi, S., Wei, J., Chung,
Trischler, A., Mattarella-Micke, A., Maji, S., and
H. W., Scales, N., Tanwani, A. K., Cole-Lewis, H. J.,
Iyyer, M. Exploring and predicting transfer-
Pfohl, S. J., Payne, P. A., Seneviratne, M. G., Gamble, P.,
ability across nlp tasks. ArXiv, abs/2005.00770,
Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P. A.,
2020. URL https://api.semanticscholar.
y Arcas, B. A., Webster, D. R., Corrado, G. S., Matias,
org/CorpusID:218487733.
Y., Chou, K. H.-L., Gottweis, J., Tomavsev, N., Liu,
Y., Rajkomar, A., Barral, J. K., Semturs, C., Karthike-
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester,
salingam, A., and Natarajan, V. Large language mod-
B., Du, N., Dai, A. M., and Le, Q. V. Finetuned lan-
els encode clinical knowledge. Nature, 620:172 – 180,
guage models are zero-shot learners. arXiv preprint
2022. URL https://api.semanticscholar.
arXiv:2109.01652, 2021.
org/CorpusID:255124952.
Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,
Chung, H. W., Narang, S., Yogatama, D., Vaswani, A., C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,
and Metzler, D. Scale efficiently: Insights from pre- M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jer-
training and fine-tuning transformers. arXiv preprint nite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame,
arXiv:2109.10686, 2021. M., Lhoest, Q., and Rush, A. M. Transformers: State-
of-the-art natural language processing. In Proceedings
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, of the 2020 Conference on Empirical Methods in Natu-
J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. ral Language Processing: System Demonstrations, pp.
13
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Ye, H., Xie, C., Cai, T., Li, R., Li, Z., and Wang, L. Towards
a theoretical framework of out-of-distribution generaliza-
tion. Advances in Neural Information Processing Systems,
34:23519–23531, 2021.
Zhang, B., Liu, Z., Cherry, C., and Firat, O. When scal-
ing meets llm finetuning: The effect of data, model and
finetuning method. arXiv preprint arXiv:2402.17193,
2024.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen,
M., Chen, S., Dewan, C., Diab, M. T., Li, X., Lin,
X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster,
K., Simig, D., Koura, P. S., Sridhar, A., Wang, T.,
and Zettlemoyer, L. Opt: Open pre-trained trans-
former language models. ArXiv, abs/2205.01068,
2022. URL https://api.semanticscholar.
org/CorpusID:248496292.
Zhang, Y.-K., Huang, T., Ding, Y.-X., chuan Zhan, D.,
and Ye, H.-J. Model spider: Learning to rank pre-
trained models efficiently. ArXiv, abs/2306.03900,
2023. URL https://api.semanticscholar.
org/CorpusID:259088702.
14
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
where Li denotes the test loss of fine-tuning on the data size Di , LSE denotes the log-exp-sum operator, Huber denotes
the Huber loss with δ = 0.001. We find the local minima of the objective above with the standard python package scipy
(Virtanen et al., 2020) starting from 50 random initialization of parameters. We choose the best one for reports.
We also repeat optimization for 50 times and choose the best run for reports.
15
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 3. Comparison of root-mean-square deviation (RMSD) for fitting different scaling laws. ∆ indicates the improvements on fitting
quality of our proposed law over the vanilla law.
FLAN WMT19 Gigaword
Model Name Ours Vanilla ∆ Ours Vanilla ∆ Ours Vanilla ∆
GPT2 0.0075 0.0697 0.0623 0.0089 0.1007 0.0918 0.0030 0.0190 0.0160
GPT2-medium 0.0038 0.0676 0.0639 0.0059 0.0991 0.0932 0.0020 0.0044 0.0024
GPT2-large 0.0056 0.0593 0.0537 0.0152 0.0893 0.0740 0.0035 0.0076 0.0041
GPT2-xl 0.0064 0.0614 0.0550 0.0410 0.1281 0.0871 0.0047 0.0104 0.0057
LaMini-GPT-124M 0.0027 0.0679 0.0652 0.0108 0.1150 0.1043 0.0037 0.0198 0.0161
LaMini-GPT-774M 0.0054 0.0638 0.0584 0.0093 0.1074 0.0981 0.0019 0.0074 0.0055
LaMini-GPT-1.5B 0.0055 0.0664 0.0609 0.0150 0.1353 0.1202 0.0063 0.0104 0.0041
Cerebras-GPT-111M 0.0096 0.0601 0.0505 0.0098 0.1129 0.1032 0.0038 0.0219 0.0181
Cerebras-GPT-256M 0.0105 0.0517 0.0412 0.0095 0.0874 0.0780 0.0022 0.0137 0.0115
Cerebras-GPT-1.3B 0.0038 0.0188 0.0150 0.0131 0.0618 0.0488 0.0048 0.0159 0.0111
Cerebras-GPT-2.7B 0.0030 0.0033 0.0003 0.0114 0.0123 0.0009 0.0020 0.0022 0.0002
Phi-1.5 0.0112 0.0363 0.0251 0.0110 0.0366 0.0256 0.0029 0.0030 0.0001
Phi-2 0.0060 0.0197 0.0137 0.0101 0.0317 0.0216 0.0040 0.0049 0.0008
OPT-350m 0.0078 0.0478 0.0400 0.0135 0.0848 0.0712 0.0045 0.0055 0.0010
OPT-1.3b 0.0024 0.0165 0.0141 0.0150 0.0709 0.0558 0.0024 0.0034 0.0010
OPT-2.7b 0.0052 0.0072 0.0020 0.0229 0.0602 0.0373 0.0012 0.0018 0.0006
OPT-6.7b 0.0025 0.0026 0.0002 0.0073 0.0090 0.0016 0.0028 0.0030 0.0002
ai-forever/mGPT 0.0035 0.0050 0.0015 0.0049 0.0153 0.0104 0.0119 0.0119 0.0000
BART-base 0.0073 0.0506 0.0433 0.0201 0.1075 0.0873 0.0194 0.0247 0.0053
BART-large 0.0129 0.0388 0.0259 0.0123 0.1070 0.0947 0.0055 0.0054 -0.0001
BART-large-cnn 0.0115 0.0302 0.0187 0.0115 0.0747 0.0632 0.0053 0.0059 0.0006
BART-large-xsum 0.0090 0.0357 0.0267 0.0089 0.1011 0.0922 0.0039 0.0046 0.0006
T5-small 0.0039 0.0241 0.0202 0.0135 0.0141 0.0007 0.0079 0.0235 0.0156
T5-base 0.0078 0.0316 0.0238 0.0144 0.0151 0.0007 0.0026 0.0134 0.0108
mT5-base 0.0035 0.0136 0.0101 0.0066 0.0155 0.0088 0.0055 0.0277 0.0221
mT5-large 0.0027 0.0118 0.0091 0.0045 0.0249 0.0204 0.0024 0.0071 0.0046
T5-v1.1-base 0.0069 0.0456 0.0386 0.0117 0.0358 0.0241 0.0056 0.0056 0.0000
switch-base-8 0.0073 0.0298 0.0225 0.0098 0.0104 0.0006 0.0096 0.0110 0.0014
switch-base-16 0.0088 0.0284 0.0195 0.0154 0.0171 0.0017 0.0082 0.0074 -0.0008
switch-base-32 0.0103 0.0307 0.0204 0.0048 0.0058 0.0009 0.0109 0.0131 0.0022
16
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 4. This table summarizes all the models we used in experiments. The Arch. is short for model architecture, De-only, En-De and
Moe stands for Decoder-only, Encoder-Decoder and Mixture of Experts respectively. The last few columns summarize the configuration
of different language models, including number of parameters, number of layers, dimension of hidden states, number of attention heads,
dimension of feed-forward layers, and dimension of key/value head.
Model Name Arch. Training Data Source N Nlayer dmodel Nhead df f dkv
GPT-2 124M 12 768 12 3072 64
GPT-2-medium 354M 24 1024 16 4096 64
WebText
GPT-2-large 774M 36 1280 20 5120 64
GPT-2-xl 1.5B 48 1600 25 6400 64
LaMini-GPT-124M 124M 12 768 12 3072 64
LaMini-GPT-774M Finetuned GPT-2-XL 774M 36 1280 20 5120 64
LaMini-GPT-1.5B 1.5B 48 1600 25 6400 64
Cerebras-GPT-111M 111M 10 768 12 3072 64
Cerebras-GPT-256M 256M 14 1088 17 4352 64
The Pile
Cerebras-GPT-1.3B 1.3B 24 2048 16 8192 128
Cerebras-GPT-2.7B De-only 2.7B 32 2560 32 10240 80
Phi-1.5 1.4B 24 2048 32 8192 64
Mixed Real & Synthetic Data
Phi-2 2.7B 32 2560 32 10240 80
OPT-350m 331M 24 1024 16 4096 64
OPT-1.3b BookCorpus, CC-Stories, 1.3B 24 2048 32 8192 64
OPT-2.7b The Pile, Pushshift.io, CCNewsV2 2.7B 32 2560 32 10240 80
OPT-6.7b 6.7B 32 4096 32 16384 128
ai-forever/mGPT Multilingual Wikipedia and C4 1.4B 24 2048 16 8192 128
BART-base BookCorpus, CCNews, 96M 12/12 768 12 3072 64
BART-large OpenWebText, STORIES 254M 12/12 1024 16 4096 64
BART-large-CNN BART finetuned on CNN 254M 12/13 1024 16 4096 64
BART-large-XSUM BART finetuned on XSUM 254M 12/14 1024 16 4096 64
T5-small En-De C4, Wiki-DPR, finetuned on CoLA, SST-2, 60M 6/6 512 8 2048 64
T5-base MRPC, STS-B, QQP, MNLI, QNLI etc. 223M 12/12 768 12 3072 64
mT5-base 582M 12/12 768 12 2048 64
mC4
mT5-large 1.2B 24/24 768 12 2816 64
T5-v1.1-base C4 247M 12/12 768 12 2048 64
switch-base-32 2B 12/12 768 12 3072 64
switch-base-16 En-De MoE C4 1B 12/12 768 12 3072 64
switch-base-8 619M 12/12 768 12 3072 64
GPT-2 Series (Radford et al., 2019) GPT-2 series are transformer-based language models created and released by OpenAI.
The models are pre-trained on WebText with 40GB of English text that is not publicly released. The texts are tokenized
using a byte-level version of Byte Pair Encoding (BPE) and a vocabulary size of 50,257. The pre-training objective is causal
language modeling (CLM). In this paper, we studied all the released versions of GPT-2, which includes GPT2 (124M),
GPT2-Medium (355M), GPT2-Large (774M), and GPT2-XL (1.5B).
OPT Series (Zhang et al., 2022) Open Pre-trained Transformers (OPT) is a suite of decoder-only pre-trained transformers
released on May 3rd 2022 by Meta AI. OPT was predominantly pre-trained with English text, but a small amount of
non-English data is present within the training corpus via CommonCrawl. The training data of OPT contains 180 tokens
corresponding to 800GB of data, which is composed of texts from BookCorpus, CC-Stories, The Pile, Pushshift.io Reddit,
and CCNewsV2. The texts are tokenized using the GPT2 byte-level version of BPE and a vocabulary size of 50,272. In this
paper, we studied 5 versions of OPT, including OPT-350M, OPT-1.3B, OPT-2.7B, and OPT-6.7B.
Phi Series (Li et al., 2023c) Phi are transformer-based language models created and released by Microsoft to investigate
the ability of smaller models. Their main goal is to answer “how small can a LLM be to achieve certain capabilities”. Its
training involved a variety of data sources related to code produced by humans and LLMs. Phi series includes 3 pre-trained
models without fine-tuning or RLHF: Phi-1 (1.3B), Phi-1.5 (1.3B), and Phi-2 (2.7B). They have shown nearly state-of-the-art
performance among models much larger than them. In this paper, we studied Phi-1.5 and Phi-2.
17
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
LaMini-LM Series (Wu et al., 2023) To alleviate the resource-intensive problem, Wu et al. (2023) explored new ways
of distilling knowledge from large models into smaller ones. They designed a new pipeline that combines synthetic data
with existing instructions to produce a wide variety of instruction training datasets consisting of over 2.58 million examples.
Based on these instructions, they finetuned a diverse herd of language models including encoder-decoder and decoder-only
families and named them “LaMini-LMs”, with parameters ranging from 61M to 1.5B. We chose the LaMiniGPT series in
our experiments, which are some of the largest models available in the LaMini family.
Cerebras-GPT (Dey et al., 2023) The cerebras-GPT family is inspired by the Chinchilla Scaling laws which state that a
ratio of 20 training tokens per model parameter is optimal for computational cost. These models share similar architecture to
GPT-3, but only pre-trained on The Pile. Cerebras-GPTs use Byte Pair Encoding and have a vocabulary of 50257 words. In
this paper, we studied Cerebras-GPT-111M, Cerebras-GPT-256M, Cerebras-GPT-1.3B, and Cerebras-GPT-2.7B.
T5, T5 V1.1 and mT5 Series (Raffel et al., 2020; Xue et al., 2020) T5(text-to-text transfer Transformers) is an encoder-
decoder language model, first introduced in Raffel et al. (2020). T5 was pre-trained on C4 and fine-tuned on several
downstream datasets, which achieved state-of-the-art on many benchmarks including question answering, text classification,
and machine translation. T5-V1.1 shares a similar architecture with T5, except for adopting GeGLU as nonlinearities and
scaling down both dmodel and df f . In contrast to T5, T5-V1.1 was only pre-trained on C4. mT5 is a multilingual variant of
t5-V1.1 which was pre-trained on unlabeled multilingual Common-Crawl (mC4) dataset without dropout. mT5’s training
corpus consisted of 101 languages, which makes it directly applicable to multilingual settings. We chose T5-small, T5-base,
T5-V1.1-base, mT5-base and mT5-large in our experiments.
BART Series (Lewis et al., 2019) BART is a sequence-to-sequence model with a bidirectional encoder and an auto-
regressive decoder. It was trained by two steps: (1) introducing noise to the pre-train text with an arbitrary function, and (2)
learning to reconstruct the original text. BART was trained on a mixture of corpora consisting of BookCorpus, CCNews,
OpenWebText, and STORIES. In this work, we chose BART-base, BART-large, BART-large-CNN, and BART-large-xsum
for experiments. The last two models are BART-large finetuned on CNN and XSUM datasets respectively, making them
suitable for text summary tasks.
18
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
C. Details of Datasets
We mainly conducted experiments on three datasets, WMT-19, Gigaword, and FLAN. The first two tasks (Machine
Translation and Summarization) are traditional sequence-to-sequence NLP tasks. The FLAN dataset consists of different
generation tasks in many formats, which is an ideal benchmark for evaluating LLMs’ performance in day-to-day situations.
The statistics of the three datasets are shown in Table 54 .
FLAN (Longpre et al., 2023) The Flan Collection consolidates datasets from Flan 2021, P3, Super-Natural Instructions,
and dozens of others into a single repository. It then formats them into a variety of templates, including zero-shot, few-shot,
and chain-of-thought formats. In our experiments, we use the FLAN Collection provided by Huggingface 5 and we choose
the no-option split which requires the model to generate a free-form answer.
WMT19 (Foundation, 2019) WMT-19 is a public machine translation dataset commonly used for evaluating sequence-
to-sequence models. We initiated our experiments on WMT-19 En-Zh. Considering the instruction-tuned models within
our model set (e.g. LaMini-GPTs), we prepend an additional instruction “Translate to Chinese:” at the beginning during
fine-tuning.
Gigaword (Graff et al., 2003) Gigaword is a widely used resource in the field of text summarization, comprising billions
of words from a vast collection of news articles like the New York Times and the Associated Press. Each news document in
the dataset is paired with a professionally written headline, serving as a compact summary of the main ideas within the
article. We also prepend an additional instruction “Generate a summary: ” to input sequences in the dataset.
Input:
Premise: Our world has what is for them a normal gravity, but because of our much higher gravitational potential,
our atmosphere is too dense to support them comfortably over sustained periods.
Hypothesis: Your world has the same type of gravity as theirs.
Does the premise entail the hypothesis?
Target: Yes.
Input:
How are binary trees extended?
How do I insert a new node on a binary tree (not search binary tree)?
Do those questions have the same meaning?
Target: no
4
We re-partition datasets into train/validation/test subsets due to the unavailability of the WMT19 test set and the imbalance in the split
between the validation and test sets within Gigaword. We only sub-sample a subset from FLAN since the full dataset is too large.
5
https://huggingface.co/datasets/Open-Orca/FLAN
19
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Input: Translate to Chinese: When the mother sheep saw him pick up her baby sheep and ran away, she followed
him out of the field.
Target: 当羊妈妈看见她的羊宝宝被人抱走了,赶快跟在李雷后面跑出了田地。
Input: Translate to Chinese: South Africa’s Draft White Paper on Energy Policy promotes energy efficiency and use
of renewable sources of energy.
Target: 南非的《能源政策白皮书草案》提倡提高能源效率和使用可再生能源。
Input: Translate to Chinese: Political scientists like Janine Mossuz-Lavau says there is being a woman this election
season may be an asset.
Target: 政治学家如詹南·摩萨斯-拉瓦说,在这季奄中,身为女性也许就是资本。
Input: Translate to Chinese: The Secretary-General condemned the excessive and disproportionate use of force and
the killing of civilians.
Target: 秘书长谴责这种不成比例地过度使用武力和杀害平民的行为。
Input: Generate a summary: china is to hold the third international expo of necessities for students in nanning city
in south china ’s guangxi zhuang autonomous region from october to november.
Target: china to hold expo of student equipment
Input: Generate a summary: the gold price in hong kong rose ## hk dollars on wednesday to close at #,### hk
dollars a tael , according to po sang bank , one of the major gold dealers in hong kong.
Target: gold price in hong kong up
Input: Generate a summary: riot police used water cannons friday to disperse protesters demanding that the
philippines lift its ban on the deployment of workers to war-ravaged iraq .
Target: police violently disperse protest against ban on workers deployment to iraq
Input: Generate a summary: british prime minister john major thursday hailed the re-election of russian president
boris yeltsin as a sign that “ democracy has taken firm root in russia .
Target: major delighted over yeltsin victory
20
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Hyper-parameter Values
search on {1e − 4, 3e − 4, 5e − 4, 1e − 3} for small models < 700M ,
learning rate
{3e − 5, 5e − 5, 1e − 4, 3e − 4} for large models > 700M
batch size search on {64, 128, 256}
training epoch 20 with early stopping (patience=3)
optimizer AdamW
weight decay 0.01
scheduler cosine
warmup ratio 0.03
21
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 7. Test loss of 30 models fine-tuned on subsets of FLAN dataset. The data size ranges from 0 to 1638400.
Model 0 200 400 800 1600 3200 6400 12800 25600 51200 102400 204800 409600 819200 1638400
GPT-2 4.857 4.386 4.288 4.191 4.060 3.890 3.826 3.546 3.272 2.988 2.686 2.449 2.193 1.978 1.791
GPT-2-medium 4.375 3.782 3.714 3.614 3.518 3.390 3.249 3.076 2.880 2.673 2.428 2.207 1.966 1.771 1.610
GPT-2-large 4.165 3.525 3.493 3.412 3.285 3.157 3.044 2.898 2.736 2.543 2.324 2.115 1.913 1.739 1.601
GPT-2-xl 3.929 3.306 3.254 3.169 3.058 2.999 2.889 2.774 2.632 2.451 2.270 2.058 1.878 1.693 1.555
LaMini-GPT-124M 4.891 4.248 4.188 4.087 3.946 3.808 3.645 3.421 3.165 2.916 2.653 2.383 2.152 1.917 1.743
LaMini-GPT-774M 4.215 3.497 3.458 3.361 3.257 3.140 3.033 2.878 2.712 2.529 2.329 2.120 1.887 1.731 1.559
LaMini-GPT-1.5B 4.046 3.293 3.240 3.202 3.094 2.990 2.881 2.751 2.628 2.446 2.270 2.061 1.851 1.687 1.530
Cerebras-GPT-111M 4.495 3.763 3.689 3.593 3.489 3.407 3.325 3.237 3.108 2.991 2.827 2.638 2.435 2.226 1.968
Cerebras-GPT-256M 4.097 3.393 3.319 3.230 3.127 3.054 2.974 2.898 2.817 2.708 2.572 2.409 2.211 2.037 1.880
Cerebras-GPT-1.3B 3.388 2.791 2.713 2.646 2.587 2.492 2.412 2.325 2.243 2.131 2.042 1.960 1.881 1.786 1.683
Cerebras-GPT-2.7B 2.914 2.231 2.151 2.088 2.046 1.979 1.925 1.872 1.831 1.779 1.733 1.681 1.631 1.589 1.544
Phi-1.5 4.620 4.063 3.929 3.664 3.462 3.213 3.056 2.895 2.686 2.463 2.237 2.022 1.831 1.671 1.542
Phi-2 3.368 2.538 2.515 2.452 2.424 2.397 2.386 2.330 2.292 2.216 2.146 2.076 2.009 1.944 1.882
OPT-350m 3.729 3.203 3.132 3.020 2.943 2.848 2.767 2.686 2.577 2.453 2.292 2.131 1.964 1.805 1.663
OPT-1.3b 3.022 2.447 2.379 2.317 2.268 2.189 2.110 2.042 1.973 1.902 1.821 1.742 1.672 1.596 1.513
OPT-2.7b 2.793 2.337 2.287 2.240 2.170 2.109 2.031 1.953 1.917 1.873 1.800 1.746 1.689 1.635 1.579
OPT-6.7b 4.442 2.021 1.980 1.973 1.935 1.921 1.895 1.865 1.838 1.812 1.790 1.770 1.741 1.720 1.697
ai-forever/mGPT 3.227 2.623 2.587 2.512 2.478 2.391 2.339 2.292 2.215 2.150 2.096 2.051 1.989 1.942 1.894
BART-base 8.502 4.159 3.990 3.850 3.685 3.532 3.344 3.181 2.979 2.711 2.457 2.251 2.051 1.858 1.685
BART-large 7.533 3.372 3.328 3.106 2.950 2.827 2.712 2.617 2.500 2.337 2.172 2.006 1.853 1.688 1.550
BART-large-cnn 6.026 3.591 3.445 3.213 3.037 2.894 2.757 2.606 2.471 2.338 2.164 1.999 1.829 1.674 1.555
BART-large-xsum 4.908 3.493 3.335 3.168 3.023 2.893 2.755 2.627 2.476 2.350 2.171 2.008 1.836 1.677 1.557
T5-small 3.983 3.021 2.931 2.838 2.757 2.681 2.601 2.508 2.411 2.309 2.208 2.085 1.978 1.857 1.756
T5-base 3.539 2.642 2.585 2.480 2.412 2.344 2.281 2.201 2.131 2.041 1.947 1.837 1.715 1.600 1.520
mT5-base 12.925 3.191 3.121 3.010 2.892 2.758 2.656 2.514 2.413 2.308 2.178 2.069 1.969 1.879 1.799
mT5-large 20.843 2.596 2.528 2.470 2.389 2.311 2.220 2.138 2.051 1.966 1.890 1.810 1.741 1.675 1.601
T5-v1.1-base 28.836 4.012 3.891 3.723 3.503 3.312 3.101 2.903 2.727 2.525 2.328 2.119 1.930 1.727 1.528
switch-base-8 29.484 4.129 3.892 3.689 3.469 3.285 3.132 2.896 2.728 2.536 2.368 2.168 1.988 1.799 1.654
switch-base-16 18.770 3.812 3.620 3.451 3.290 3.101 2.919 2.796 2.633 2.497 2.329 2.163 2.000 1.817 1.684
switch-base-32 24.522 3.652 3.502 3.312 3.181 3.014 2.836 2.704 2.572 2.434 2.304 2.116 1.950 1.780 1.650
FLAN
Test Loss (log-scale)
# Samples (log-scale)
Figure 8. The test losses of 30 models fine-tuned on various sizes of subsets derived from FLAN dataset. The point size reflects the
corresponding model size.
22
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 8. Test loss of 30 models fine-tuned on subsets of WMT19 dataset. The data size ranges from 0 to 1638400.
Model 0 200 400 800 1600 3200 6400 12800 25600 51200 102400 204800 409600 819200 1638400
GPT-2 3.403 3.079 3.037 2.955 2.867 2.757 2.521 2.276 1.966 1.713 1.502 1.296 1.131 1.020 0.929
GPT-2-medium 3.148 2.891 2.874 2.735 2.663 2.547 2.369 2.122 1.886 1.645 1.424 1.225 1.068 0.943 0.855
GPT-2-large 2.937 2.888 2.740 2.764 2.589 2.515 2.362 2.128 1.837 1.618 1.401 1.254 1.094 0.948 0.887
GPT-2-xl 2.888 2.646 2.614 2.508 2.461 2.393 2.297 2.143 1.940 1.701 1.477 1.278 1.278 0.896 0.800
LaMini-GPT-124M 3.253 3.061 3.014 2.976 2.916 2.781 2.669 2.473 2.130 1.847 1.606 1.376 1.210 1.062 0.958
LaMini-GPT-774M 2.813 2.680 2.669 2.661 2.536 2.471 2.309 2.072 1.825 1.600 1.373 1.189 1.044 0.921 0.838
LaMini-GPT-1.5B 2.742 2.710 2.660 2.653 2.580 2.490 2.408 2.327 2.001 1.725 1.451 1.230 1.050 0.913 0.790
Cerebras-GPT-111M 3.348 3.034 2.943 2.878 2.796 2.716 2.607 2.455 2.249 2.012 1.792 1.595 1.393 1.170 0.957
Cerebras-GPT-256M 3.109 2.891 2.801 2.664 2.632 2.502 2.364 2.178 1.951 1.786 1.563 1.393 1.229 1.054 0.919
Cerebras-GPT-1.3B 2.610 2.789 2.628 2.521 2.388 2.315 2.238 2.097 1.926 1.732 1.595 1.459 1.316 1.156 1.030
Cerebras-GPT-2.7B 2.192 1.959 1.892 1.842 1.771 1.739 1.705 1.650 1.608 1.540 1.442 1.429 1.410 1.372 1.331
Phi-1.5 2.641 2.883 2.652 2.428 2.361 2.152 1.961 1.802 1.634 1.468 1.317 1.201 1.088 0.981 0.901
Phi-2 1.857 2.272 2.137 1.987 1.941 1.799 1.631 1.507 1.364 1.264 1.123 1.024 0.935 0.858 0.799
OPT-350m 3.199 3.117 2.972 2.972 2.784 2.621 2.438 2.157 1.890 1.637 1.426 1.271 1.119 1.004 0.881
OPT-1.3b 2.727 2.761 2.650 2.615 2.497 2.342 2.148 1.963 1.777 1.563 1.433 1.295 1.162 1.014 0.883
OPT-2.7b 2.495 2.480 2.441 2.391 2.331 2.277 2.106 1.987 1.817 1.652 1.530 1.391 1.289 1.188 1.081
OPT-6.7b 2.262 1.987 1.984 1.979 1.961 1.957 1.945 1.917 1.881 1.864 1.831 1.812 1.787 1.761 1.738
ai-forever/mGPT 2.285 2.089 2.086 2.093 2.071 2.043 2.018 2.007 1.996 1.941 1.919 1.867 1.833 1.786 1.753
BART-base 6.781 3.368 3.366 3.163 3.030 2.874 2.787 2.330 1.991 1.691 1.411 1.254 1.070 0.932 0.859
BART-large 4.145 3.214 3.202 3.056 2.953 2.689 2.490 2.121 1.796 1.524 1.296 1.105 0.957 0.828 0.758
BART-large-cnn 6.028 3.223 3.103 3.029 2.829 2.602 2.285 1.963 1.739 1.485 1.270 1.104 0.962 0.858 0.771
BART-large-xsum 4.263 3.161 3.093 2.973 2.847 2.643 2.371 2.092 1.806 1.510 1.310 1.129 0.980 0.857 0.774
T5-small 4.384 1.251 1.223 1.135 1.048 0.991 0.958 0.903 0.845 0.803 0.781 0.749 0.717 0.664 0.641
T5-base 4.798 1.174 1.060 1.037 0.950 0.885 0.835 0.776 0.745 0.734 0.684 0.644 0.626 0.591 0.575
mT5-base 16.143 2.879 2.822 2.781 2.722 2.692 2.671 2.578 2.471 2.451 2.388 2.322 2.245 2.162 2.079
mT5-large 21.711 2.841 2.814 2.776 2.711 2.687 2.648 2.560 2.472 2.412 2.290 2.211 2.129 2.032 1.941
T5-v1.1-base 10.500 1.389 1.261 1.225 1.176 1.123 1.053 0.991 0.930 0.868 0.808 0.743 0.680 0.622 0.561
switch-base-8 27.451 1.561 1.472 1.374 1.251 1.223 1.125 1.050 0.981 0.923 0.849 0.791 0.741 0.689 0.651
switch-base-16 21.009 1.389 1.290 1.203 1.187 1.094 1.044 0.991 0.913 0.866 0.807 0.756 0.745 0.666 0.631
switch-base-32 18.065 1.351 1.262 1.172 1.112 1.042 0.962 0.901 0.847 0.788 0.733 0.681 0.642 0.601 0.567
WMT19
Test Loss (log-scale)
# Samples (log-scale)
Figure 9. The test losses of 30 models fine-tuned on various sizes of subsets derived from the WMT19 dataset. The point size reflects the
corresponding model size.
23
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 9. Test loss of 30 models fine-tuned on subsets of Gigaword dataset. The data size ranges from 0 to 1638400.
Model 0 200 400 800 1600 3200 6400 12800 25600 51200 102400 204800 409600 819200 1638400
GPT-2 4.147 2.691 2.596 2.516 2.429 2.329 2.204 2.099 1.983 1.883 1.777 1.690 1.597 1.508 1.431
GPT-2-medium 3.723 2.298 2.214 2.130 2.050 1.965 1.891 1.810 1.742 1.672 1.602 1.530 1.465 1.398 1.349
GPT-2-large 3.613 2.154 2.103 2.018 1.961 1.887 1.799 1.750 1.671 1.603 1.540 1.479 1.408 1.354 1.305
GPT-2-xl 3.411 2.044 2.010 1.954 1.880 1.814 1.773 1.702 1.634 1.577 1.521 1.468 1.413 1.356 1.286
LaMini-GPT-124M 4.414 2.645 2.546 2.457 2.384 2.300 2.203 2.110 1.996 1.888 1.790 1.694 1.595 1.511 1.438
LaMini-GPT-774M 4.161 2.142 2.085 2.015 1.942 1.873 1.814 1.746 1.673 1.603 1.541 1.480 1.422 1.358 1.308
LaMini-GPT-1.5B 4.053 2.041 2.000 1.927 1.877 1.824 1.766 1.703 1.645 1.570 1.518 1.459 1.439 1.354 1.299
Cerebras-GPT-111M 5.108 3.505 3.362 3.217 3.080 2.939 2.780 2.658 2.507 2.354 2.208 2.048 1.914 1.796 1.677
Cerebras-GPT-256M 4.574 3.043 2.934 2.823 2.686 2.576 2.473 2.350 2.225 2.112 1.994 1.888 1.785 1.683 1.586
Cerebras-GPT-1.3B 3.834 2.401 2.324 2.257 2.193 2.139 2.082 2.008 1.924 1.851 1.770 1.682 1.618 1.550 1.482
Cerebras-GPT-2.7B 3.400 2.125 2.054 1.983 1.933 1.866 1.806 1.745 1.692 1.637 1.576 1.533 1.480 1.440 1.391
Phi-1.5 4.169 2.354 2.266 2.157 2.069 1.992 1.905 1.834 1.761 1.679 1.607 1.540 1.483 1.410 1.361
Phi-2 3.245 1.788 1.747 1.705 1.674 1.639 1.602 1.574 1.534 1.478 1.453 1.431 1.389 1.354 1.319
OPT-350m 3.848 2.422 2.312 2.227 2.149 2.078 2.013 1.928 1.858 1.768 1.712 1.635 1.574 1.512 1.450
OPT-1.3b 3.163 1.879 1.828 1.772 1.722 1.686 1.638 1.588 1.543 1.491 1.446 1.403 1.368 1.327 1.290
OPT-2.7b 2.971 1.734 1.697 1.658 1.620 1.576 1.541 1.502 1.462 1.429 1.391 1.363 1.330 1.301 1.270
OPT-6.7b 2.862 1.694 1.656 1.623 1.582 1.549 1.506 1.460 1.428 1.400 1.368 1.339 1.308 1.276 1.245
ai-forevermGPT 3.676 2.379 2.386 2.238 2.186 2.034 1.939 1.863 1.802 1.732 1.651 1.586 1.530 1.452 1.379
BART-base 8.663 3.299 3.120 2.884 2.710 2.535 2.391 2.021 1.894 1.797 1.696 1.630 1.548 1.469 1.408
BART-large 4.727 2.211 2.102 1.984 1.895 1.809 1.734 1.666 1.610 1.537 1.483 1.420 1.361 1.303 1.257
BART-large-CNN 4.619 2.268 2.172 2.063 1.949 1.842 1.737 1.670 1.594 1.524 1.472 1.403 1.364 1.306 1.255
BART-large-XSUM 4.486 2.204 2.128 2.030 1.934 1.839 1.751 1.686 1.613 1.546 1.484 1.412 1.371 1.311 1.261
T5-small 3.675 2.078 2.061 2.028 1.911 1.863 1.804 1.743 1.680 1.624 1.554 1.484 1.406 1.322 1.250
T5-base 2.880 1.758 1.725 1.679 1.638 1.597 1.542 1.492 1.444 1.395 1.351 1.301 1.247 1.196 1.146
mT5-base 11.509 2.810 2.689 2.589 2.432 2.292 2.167 2.024 1.851 1.721 1.599 1.482 1.371 1.253 1.148
mT5-large 10.154 2.567 2.462 2.331 2.212 2.110 1.987 1.890 1.781 1.679 1.588 1.492 1.418 1.332 1.259
T5-v1.1-base 9.205 2.582 2.451 2.283 2.123 1.979 1.870 1.717 1.614 1.502 1.414 1.326 1.241 1.151 1.071
switch-base-8 20.602 2.672 2.573 2.286 2.124 1.991 1.859 1.726 1.619 1.512 1.430 1.356 1.275 1.206 1.149
switch-base-16 17.835 2.641 2.443 2.253 2.035 1.916 1.789 1.675 1.583 1.480 1.395 1.334 1.260 1.196 1.123
switch-base-32 14.677 2.430 2.309 2.187 1.967 1.881 1.734 1.625 1.563 1.457 1.383 1.305 1.246 1.186 1.106
Gigaword
Test Loss (log-scale)
# Samples (log-scale)
Figure 10. The test losses of 30 models fine-tuned on various sizes of subsets derived from the Gigaword dataset. The point size reflects
the corresponding model size.
24
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
FLAN ( =1/8) FLAN ( =1/16) FLAN ( =1/32) FLAN ( =1/64) FLAN ( =1/128) FLAN ( =1/256) FLAN ( =1/512)
94.50 91.40 88.44 84.13 80.88 76.99 74.22 69.49 67.18 61.04 60.75 57.83 56.54 53.72 52.12 49.57 48.83 46.10 45.59 42.21 37.84
1
94.48 91.40 88.40 83.82 80.67 76.76 73.94 69.08 66.73 60.77 61.27 57.30 56.62 53.56 51.59 49.77 48.59 46.10 45.43 42.21 37.84
2
93.00 89.71 86.60 81.87 78.95 73.00 73.45 67.89 64.80 62.39 60.84 55.15 54.68 52.39 50.58 49.73 48.58 46.10 45.52 42.21 37.84
3
90.90 84.61 75.17 81.77 68.99 61.52 68.16 62.70 60.63 61.81 56.09 53.90 54.90 52.12 50.54 50.56 48.68 46.10 45.56 42.21 37.84
4
90.87 79.45 68.11 73.12 61.69 58.77 65.53 58.26 58.46 61.13 55.31 53.70 52.18 50.89 50.47 50.54 48.67 46.10 45.56 42.21 37.84
5
WMT19 ( =1/8) WMT19 ( =1/16) WMT19 ( =1/32) WMT19 ( =1/64) WMT19 ( =1/128) WMT19 ( =1/256) WMT19 ( =1/512)
98.99 98.97 98.86 98.41 98.47 97.97 98.02 97.02 95.71 93.79 91.61 88.87 83.19 81.17 77.23 74.20 69.83 66.36 62.95 59.15 54.98
1
98.93 98.94 98.85 98.42 98.46 98.00 98.03 97.03 95.72 93.78 89.73 88.49 83.09 80.12 76.94 74.37 69.65 66.36 62.63 59.15 54.98
2
98.98 98.97 98.90 98.39 98.60 98.14 98.01 97.19 92.44 88.81 86.60 84.61 81.75 79.22 76.01 74.22 68.87 66.36 62.87 59.15 54.98
3
98.96 98.92 98.86 98.35 98.56 98.16 98.04 94.17 89.00 88.83 82.91 82.58 80.95 76.68 74.72 73.11 68.29 66.36 62.42 59.15 54.98
4
98.91 98.70 98.65 97.05 97.31 92.61 97.68 89.78 88.42 85.96 82.50 81.98 78.02 74.81 74.38 73.39 68.08 66.36 61.45 59.15 54.98
5
GIGAWORD ( =1/8) GIGAWORD ( =1/16) GIGAWORD ( =1/32) GIGAWORD ( =1/64) GIGAWORD ( =1/128) GIGAWORD ( =1/256) GIGAWORD ( =1/512)
99.18 99.07 98.91 98.18 98.09 95.00 97.05 92.17 92.09 82.94 86.90 88.69 83.00 86.38 90.48 88.16 91.71 90.02 91.10 88.59 87.42
1
99.17 99.05 98.88 98.02 98.05 94.94 97.06 93.77 93.65 86.68 86.85 88.25 82.82 89.72 90.33 88.17 91.39 90.02 91.03 88.59 87.42
2
99.13 98.66 98.64 97.87 97.93 94.66 96.99 93.53 93.06 92.75 92.11 91.01 91.24 90.10 90.23 88.22 91.52 90.02 90.84 88.59 87.42
3
99.03 98.34 98.32 97.81 97.86 94.61 96.84 93.20 92.30 92.40 91.12 90.42 91.12 90.20 90.22 88.63 91.46 90.02 91.14 88.59 87.42
4
98.92 98.05 98.14 97.59 97.67 94.47 96.91 93.51 92.27 92.04 91.00 90.11 91.12 89.86 90.22 89.12 91.49 90.02 91.03 88.59 87.42
5
3 4 5 3 4 5 3 4 5 3 4 5 3 4 5 3 4 5 3 4 5
K
Figure 11. PearCorr of AtS with varied hyper-parameters δ and k across FLAN, WMT19 and Gigaword datasets. Each block presents an
ablation analysis, delineating the impact of hyper-parameter settings on specific subsets.
GPT-2 4.386 ± 0.0016 4.288 ± 0.0018 4.191 ± 0.0015 4.060 ± 0.0011 3.890 ± 0.0013
Cerebras-256M 3.393 ± 0.0021 3.319 ± 0.0022 3.230 ± 0.0012 3.127 ± 0.0010 3.054 ± 0.0009
BART-base 4.159 ± 0.0051 3.990 ± 0.0049 3.850 ± 0.0045 3.685 ± 0.0042 3.532 ± 0.0020
OPT-350M 3.203 ± 0.0025 3.132 ± 0.0023 3.020 ± 0.0021 2.943 ± 0.0016 2.848 ± 0.0013
Table 10. Variance of fine tuning results of four typical models on FLAN. It is shown that the fine tuning processes are very stable.
25
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
0.4
0.2
0.0
0.2
0.4
0.6
Model Size < 2B
1.0
0.8
0.6
Pearson Correlation
0.4
0.2
0.0
0.2
0.4
0.6 method
AtS
Model Size < 1.4B ZeroShot
SubTuning
1.0 ModelSize
0.8
0.6
Pearson Correlation
0.4
0.2
0.0
0.2
0.4
0.6
Model Size < 700M
1.0
0.8
0.6
Pearson Correlation
0.4
0.2
0.0
0.2
0.4
0.6
1/8 1/16 1/32 1/64 1/128 1/256 1/512
ratio
Figure 12. Performance of AtS on stratified M with varied memory budgets measured by PearCorr.
26
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 11. Model selection results (PearCorr, RelAcc) of three scaling-law-based methods on three datasets (FLAN, WMT19, Gigaword)
in percentage. The best result within the same dataset and budget ratio is in bold font, and the second best result is underlined.
FLAN WMT19 Gigaword
Metric Ratio AtS OurFit VanillaFit AtS OurFit VanillaFit AtS OurFit VanillaFit
1/8 90.9 77.9 34.7 98.9 95.0 94.4 98.9 97.0 95.1
1/16 73.1 67.4 58.1 97.0 93.6 83.7 97.6 90.7 92.8
1/32 65.5 54.4 43.1 97.7 91.1 79.6 96.9 88.3 91.0
1/64 61.1 47.6 46.7 86.0 83.9 30.9 92.0 83.6 84.3
PearCorr (%)
1/128 52.2 54.9 41.4 78.0 78.9 35.2 91.1 83.6 47.3
1/256 50.5 41.1 45.0 73.4 72.9 41.1 89.1 81.5 85.8
1/512 45.6 36.8 20.7 61.5 61.5 56.5 91.0 78.5 79.3
1/8 93.6 100.0 39.0 99.1 84.9 99.6 100.0 100.0 100.0
1/16 93.2 100.0 93.2 99.1 84.9 80.7 91.4 100.0 100.0
1/32 93.2 100.0 93.2 99.6 78.5 99.6 94.3 100.0 100.0
1/64 93.2 100.0 90.7 99.1 81.8 99.1 100.0 94.3 100.0
RelAcc (%)
1/128 85.3 85.3 93.2 99.1 78.5 99.1 94.3 94.3 94.3
1/256 93.2 85.3 85.3 99.1 77.6 99.1 94.3 87.2 94.3
1/512 93.2 85.3 93.2 99.1 77.6 99.1 91.4 91.4 87.3
27
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Table 12. Model selection results of AtS-Family evaluated by RelAcc on three datasets(FLAN, WMT19, Gigaword) in percentage. The
best result within the same dataset and budget ratio is in bold font, and the second best result is underlined.
FLAN WMT19 Gigaword
Ratio AtS SubTuning AtS-Family AtS SubTuning AtS-Family AtS SubTuning AtS-Family
1/8 93.6 93.2 93.2 99.1 99.1 99.6 100.0 87.6 94.3
1/16 93.2 93.2 93.2 99.1 99.1 99.6 91.4 87.6 94.3
1/32 93.2 93.2 93.2 99.6 99.1 99.6 94.3 87.6 94.3
1/64 93.2 93.2 93.2 99.1 99.1 99.6 100.0 71.3 94.3
1/128 85.3 59.6 93.2 99.1 99.1 99.6 94.3 71.3 94.3
1/256 93.2 59.6 93.2 99.1 99.1 99.6 94.3 71.3 94.3
1/512 93.2 59.6 93.2 99.1 99.1 99.6 91.4 71.3 94.3
28