Scaling Laws For Precision
Scaling Laws For Precision
Tanishq Kumar∗1 Zachary Ankner*3, 4 Benjamin F. Spector2 Blake Bordelon1 Niklas Muennighoff2
Mansheej Paul4 Cengiz Pehlevan1 Christopher Ré2 Aditi Raghunathan5
1
Harvard University
2
Stanford University
3
MIT
arXiv:2411.04330v1 [cs.LG] 7 Nov 2024
4
Databricks
5
Carnegie Mellon University
Abstract
Low precision training and inference affect both the quality and cost of language models,
but current scaling laws do not account for this. In this work, we devise “precision-aware” scal-
ing laws for both training and inference. We propose that training in lower precision reduces
the model’s effective parameter count, allowing us to predict the additional loss incurred from
training in low precision and post-train quantization. For inference, we find that the degra-
dation introduced by post-training quantization increases as models are trained on more data,
eventually making additional pretraining data actively harmful. For training, our scaling laws
allow us to predict the loss of a model with different parts in different precisions, and suggest
that training larger models in lower precision may be compute optimal. We unify the scaling
laws for post and pretraining quantization to arrive at a single functional form that predicts
degradation from training and inference in varied precisions. We fit on over 465 pretraining runs
and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.
1 Introduction
Scale has emerged as a central driver of progress in deep learning [Brown, 2020]. Key work on
scaling [Kaplan et al., 2020, Hoffmann et al., 2022] studied tradeoffs between model/dataset size
to balance performance and compute. However, the precision in which models are trained and
served is an important third factor that contributes to both cost and performance. Deep learning
is trending towards lower precision: current frontier models like the Llama-3 series are trained in
BF16 [Dubey et al., 2024], and there is widespread effort to move the pretraining paradigm to FP8
[Micikevicius et al., 2022]. The next generation of hardware will support FP4, and advances in
weight-only quantization have led to training in binary and ternary at scale [Ma et al., 2024, Wang
et al., 2023]. How far will these paradigms go? Specifically, we ask:
Studying scaling in precision is challenging because work on scaling laws generally aims to
drop fine-grained implementation details in pursuit of universal functional forms while work on
quantization generally does the opposite, focuses on the details: how quantization is done, with what
type, to what part of the model. In seeking a balance, we consider a variety of plausible functional
forms, and choose one that abstracts implementation details of quantization away from loss scaling,
∗
Equal contribution. Correspondence to [email protected]
1
Scaling: Post-Train Quantization Scaling: Quantized Training
Training larger models in lower
precision can be compute optimal
Val Loss (Post-Quant)
3.233
3.198
Figure 1: Schematic of key findings. (Left) Training a fixed model size to various data budgets in
BF16 and quantizing weights at the end. We find that degradation due to post-train quantization
increases with tokens seen during pretraining, so that eventually additional pretraining data
can be harmful. (Right) Our scaling suggests training larger models in lower precision can
be compute-optimal according to the cost model in Section 4.3. Weights, activations, attention
quantized, all models trained on the same data budget, details in Appendix H.
allowing us to predict loss scaling in many situations of practical interest. This functional form
that posits bit precision and parameter count interchangeably contribute to a model’s “effective
parameter count,” Neff , and implementation details like which parts of a model are quantized to
what precision, interact with loss scaling only through their effect on this quantity.
Overall, we study the scaling of the effects of precision on loss as we vary data and parameters,
both during and after training. We first study how the degradation induced by post-train quantiza-
tion scales with parameters and data. We find that the degradation increases with data, so that for
a fixed model, training on additional data after a certain point can be actively harmful if the model
will be quantized after training. We then shift our focus to quantized training, examining both
the quantization-aware-training (weights only) and low-precision training (weights, activations, at-
tention all quantized) settings. Our scaling laws for pretraining suggest that the compute-optimal
pretraining precision is in general independent of compute budget. Surprisingly, however, this inde-
pendence ceases to be true if model size is constrained, in which case the compute-optimal precision
grows slowly in compute.
In all, we pretrain a suite of 465 language models in 3 to 16 bit precisions, as well as post-train
quantize each to multiple precisions. For a language model with N parameters, trained on D tokens
with training precision Ptrain , and post-train weight precision Ppost , we ultimately find a unified
scaling law that takes the following form:
−α
L(N, D, Ptrain , Ppost ) = ANeff +BD−β + E + δPTQ (Neff , D, Ptrain , Ppost ) (1)
| {z } | {z }
Training-time Effects Post-Training Effects
| {z }
Usual Chinchilla form
where A, B, E, α, β are positive fitted constants, and δPTQ refers to the loss degradation induced
by post-training quantization before inference. Altogether, our results for post-train quantization
illustrate how more pretraining FLOPs do not always lead to better models at inference-
time, and our results for low-precision pretraining suggest that both the standard practice
of training models in 16-bit, and the race to extremely low (sub 4-bit) pretraining
precision, may be suboptimal.
2
2 Background, Related Work, and Setup
Notation. Throughout, D denotes dataset size in tokens and N denotes model size in parameters.
Pw , Pa , Pkv refer to the bit precision, in integer-type, of the weights, activations, and key-value
cache (“attention”)1 during training, and Ppost refers to the precision we post-train quantize (PTQ)
weights to at the end for model inference. When P or Ptrain is used without reference to a part of
the model, all three model parts are tied to the same precision. The inference-time loss degradation
induced by post-train quantization will be denoted δPTQ (N, D, Ptrain , Ppost ), and it is defined as
the change in loss from performing post-training quantization compared to the end of pretraining.
We use “high precision” to mean 16-bit or above.
3
2.2 Scaling Laws and Parametric Fits
Scaling Laws. Hoffmann et al. [2022] model loss scaling using the functional form L(N, D) =
AN −α + BD−β + E where A, B, α, β, E are positive fitted constants, finding that data and param-
eters should be scaled in roughly equal proportion as more compute becomes available. We will
refer to the scaling of [Hoffmann et al., 2022] as “Chinchilla-optimal” or just “Chinchilla” and note
this is often used colloquially as D/N ≈ 20 being pretraining compute-optimal. On the theoretical
front, work on scaling laws [Bahri et al., 2024, Bordelon et al., 2024, Lin et al., 2024a] finds that
noise to various parts of model or data affects loss in a predictable way. While previous works have
explored the scaling behavior of post-training quantization in terms of total model bits [Dettmers
and Zettlemoyer, 2023] and knowledge capacity [Allen-Zhu and Li, 2024], we focus instead on data
scaling. We note that in general the exact fitted values of all coefficients and exponents can vary
drastically based on small implementation differences: Besiroglu et al. [2024] find different constants
when attempting to replicate [Hoffmann et al., 2022], Sardana and Frankle [2023] fit coefficients
A, B of different orders of magnitude. For this reason, we emphasize our contribution is not the
numerical values we fit, but the trends and functional forms we identify.
Overtraining. In practice, accounting for inference costs means training smaller models for
substantially longer than Chinchilla-optimal [Sardana and Frankle, 2023, Gadre et al., 2024]. For
instance, Llama-3-8B is trained to D/N ≈ 2000 [Dubey et al., 2024] and the Gemma-2 series up to
D/N > 1000 [Team et al., 2024]. We refer to such models as “overtrained” in this paper, with the
token/parameter ratio D/N being a key quantity throughout. Work on inference-time compute
[Snell et al., 2024, Brown et al., 2024] and on synthetic and multimodal data [Yang et al., 2024, Fan
et al., 2024, Bauer et al., 2024] suggests future models may be even more overtrained. Therefore,
modern work on scale must consider ratios much larger than Chinchilla-optimal, and in this work
we perform experiments up to D/N ≈ 103 and analyze the predictions found by our scaling law for
up to D/N ≈ 105 . See Appendix B for additional related work.
2.3 Setup
We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset [Groeneveld et al.,
2024, Soldaini et al., 2024], using a standard Transformer++ implementation; see Appendix A for
hyperparameters and ablations. Our experiments consist of a sweep of language model pretraining
runs over N ∈ [30, 60, 110, 220] million parameters (non-embedding) and D ∈ [1.5, 3, 6, 13, 26] billion
tokens. Our model sizes are relatively small because we train up to a very high D/N ≈ 103 to
study data scaling and set off over 20 runs at every (N, D): we sweep 8 values of precision for each
of the (weights, activations, attention).
4
3.1 Overtrained Models Degrade more when Post-Train Quantized
4.25 INT5
4.00 INT4
INT3
3.75 No PTQ
3.50
3.25 100 1000 100 10 100 10
PTQ
10 1
Degradation,
10 2
10 3
We consider different model sizes (columns) trained on various data budgets (x-axis of each
plot) and plot in Figure 2 both the loss after post-train quantization (top row) and the degradation
incurred relative to end of training (bottom row). We find that the degradation δPTQ increases
in training data size across all model sizes, but that for a fixed dataset size larger models incur a
smaller degradation. We additionally observe that δPTQ increases exponentially as we decrease the
precision we quantize to. Based on these observations we model δPTQ as taking the form:
γD
D
δPTQ (N, D, Ppost ) = CT e−Ppost /γpost (2)
N γN
where CT , γD , γN , γpost are positive fitted constants. As we find the fitted values of γD and γN to be
similar (see Appendix I for numerical values), we can think of this as an approximate power law in
the token/parameter ratio D/N . The intuition for this poor data scaling might be that as models
train on more data, they compress more information into their weights, so that perturbations to
weights in the form of quantization are more harmful to loss, all else equal. We discuss formal
theoretical interpretations in Appendix G.
This finding implies that for models that will be post-train quantized, there exists an amount of
pretraining data beyond which additional data is actively harmful to performance at inference-time
(see top-left, Figure 2). This can be defined as the point where additional data increases post-train
degradation more than it decreases loss during pretraining. We solve analytically for this critical
data size in Appendix E. We thus summarize our first scaling finding as follows.
5
Finding 1. Overtrained language models are more sensitive to post-training quantization.
For models trained in BF16 or above, we can model this loss degradation as
γD
D
δPTQ (N, D, Ppost ) = CT e−Ppost /γpost
N γN
where CT , γD , γN , γpost are positive fitted constants. This implies that when D/N is suffi-
ciently large, or Ppost sufficiently small, loss after quantization can increase as models are
pretrained for longer, as in Figure 2. We will revisit and modify Equation 2 in Section 5 to
account for the effects of training in low-precision on δPTQ .
Pw (bits)
0.6
Neff/N
8 8
0.4 7 7
6 6
0.2 Weights
Activations 5 5
KV Cache 4 4
0.0 Tied
4 6 8 10 12 14 16 3 30 40 50 60 70 80 90 100 3 30 40 50 60 70 80 90 100
Precision (bits) N (millions) N (millions)
Figure 3: (Left) Neff /N from our final scaling law. Our fit of Neff (N, Pw ) in this section is the
first step towards this (blue). Empirical (center) and predicted (right) IsoLoss contours illustrating
the precision-parameter tradeoff. Y-axis is weight precision during quantized training. All runs
plotted trained on D = 13B tokens. Predictions from a fitted version of Equation 3, darker lines
correspond to lower loss.
6
Final Val Loss 3.3B tokens 13.1B tokens 26.2B tokens
4.2 4.2 4.2
4.0 4.0 4.0 Model Size
3.8 3.8 3.8 30M
60M
3.6 3.6 3.6 110M
3.4 3.4 3.4 220M
3.2 3.2 3.2
3 4 5 6 7 8 3 4 5 6 7 8 3 4 5 6 7 8
Pw (training precision, bits) Pw (training precision, bits) Pw (training precision, bits)
Figure 4: Predicting final validation losses L(N, D, Pw ) for various N, D, Pw to test our proposed
functional form. Points are experimental values, lines are predictions of a single parametric fit of
the form in Equation 3. We train only two model sizes at 26B due to compute constraints.
In line with the empirical trends in Figure 3, we find the best fit for the tradeoff between
weight precision and parameters is Neff (N, Pw ) = N (1 − e−Pw /γw ), where γw is a fitted constant
measuring the sensitivity of model weights (alternative fits explored in Appendix I). We therefore
modify Chinchilla scaling to account for Neff by making the substitution N 7→ Neff (N, Pw ), giving
the modified form:
Neff (P ) = N (1 − e−Pw /γw )(1 − e−Pa /γa )(1 − e−Pkv /γkv ) (4)
We test whether this independence approximately holds by comparing the predictive power of a
model with marginally fitted constants and a model with jointly fitted constants. We show the
predictive power of both models in Figure 5(b, c), finding that both methods for fitting constants
have approximately the same predictive power. These results suggest that the independence as-
sumption is reasonable. We both present further evidence that this “factorized” functional form is
a strong fit to the data as well as discuss alternative factorization schemes in Appendix K.
7
Pw Marginal Sweep Joint fit, f(Pw, Pa, Pkv) Combined Marginals, f(Pw)f(Pa)f(Pkv)
5.0
MSE: 0.0028, R²: 0.9655 MSE: 0.0086, R²: 0.9006 MSE: 0.0089, R²: 0.8973
4.5
Predicted
4.0
3.5
3.0
3.2 3.4 3.6 3.8 4.0 4.2 4.4 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6
Actual Actual Actual
Figure 5: (Left) Predicted loss based on fitted values with Equation 4. (center) Fitting γ parameters
jointly on sweeps with combinations of precisions vs (right) fitting them on “marginal” sweeps where
only one model part is in low precision at a time. Outliers are those at extremely low precision
whose training runs are sometimes unstable.
Finding 2. The effects of quantizing the weights, activations, and KV cache during training
are well modeled as independent and multiplicative so that
−α
L(N, D, Pw , Pa , Pkv ) = ANeff + BD−β + E
where
Neff (Pw , Pa , Pkv ) = N (1 − e−Pw /γw )(1 − e−Pa /γa )(1 − e−Pkv /γkv )
for which we fit constants γw , γa , γkv that reflect the different sensitivities of weights, activa-
tions, and KV cache. If the three precisions are set to the same value P , as in pretraining,
this simplifies to Neff (P ) ≈ N (1 − e−P/γ̄ )3 where γ̄ is the average of the three parameters.
We visualize this functional form with our fitted values in Figure 3 (left).
Since derivations are algebraically involved, we will work up to proportionality and verify proposed
solutions numerically. See Appendix E for mathematical details. We note that the implications
of our functional form are true no matter the scale at which future experiments are done, but
the numerical values we predict depend on our fitted constants which are fitted on smaller-scale,
integer-type experiments.
2
In practice, the gains are less than linear due to systems overhead.
8
Predicted: Quantized Training (INT) Empirical: Quantized Training (FP) P * (D) for Various N
3.3 3.3 16
0.7
P (Model Precision)
14
Predicted Val Loss
3.2 0.6
Irreducible Loss
3.2
Figure 6: Scaling law predictions (left, fitted on integer type) vs empirical values (right, floating-
point type). Precision of weights, activations, attention fixed to Ptrain . Predictions closely match
the empirical trend, but are shifted up by a small amount since floating-point is a more expressive
type and will incur lower loss at the same precision. (Right) When N is held fixed, compute-
optimal precision increases approximately logarithmically with data. Markers correspond to pre-
dicted compute-optimal precision for Llama-3 (8b, 70b, 405b), denoted by (circle, triangle, star)
at each IsoFLOP (lines), illustrating how compute-optimal precision increases in data when model
size is held fixed.
4.3.1 If You Must Train in Low Precision, Increase Parameters Before Data
Minimizing L(N, D) with P fixed, subject to C ∝ N DP . We get with some algebra that at
precision P and compute budget C, the optimal allocations N ∗ , D∗ of parameters and data relative
to Chinchilla-optimal NCh , DCh will be given by
N ∗ (P, C) h i− 3α
− β D∗ (P, C) h i 3α β
∝ 1 − e−P/γ̄ ∝ 1 − e−P/γ̄
α+β α+β
P α+β and P α+β (6)
NCh (C) DCh (C)
9
We test our predictions in Figure 6 at a larger scale. We train compute-matched models at
various parameter count and precision ranging from FP4 to FP32 and 220M to 1.6B parameters.
We train in floating-point type since that is standard in pretraining [Groeneveld et al., 2024, Deitke
et al., 2024], though our scaling laws are fitted on integer type. We plot our predicted trend in
Figure 6 (left) and the empirical values in the middle. We find that scaling fits on integer type are a
strong fit until 4-bit precision, at which points the difference between the two types becomes more
apparent. Integer fits assume all bits contribute in the same way, but the split of floating-point bits
into exponent and mantissa means that each likely has its own scaling behavior. The matching of
qualitative trends throughout, with the optimum being close to the predicted optimum of P ∗ ≈ 7
bits suggests that similar scaling laws may exist across types.
10
PTQ(Neff, D, Ptrain, Ppost)
MSE: 5.06e-02, R 2: 0.9041 8
Empirical PTQ 8
Predicted PTQ
10 2
0.8
10 3
Predicted
PTQ
5 5
0.6
10 4
4 4 0.4
10 5
10 6
3 3 0.2
10 7
10 6 10 5 10 4 10 3 10 2 10 1 100 23 4 5 6 7 8 9 10 11 12 2 4 6 8 10 12
Actual PTQ Pw, training precision (bits) Pw, training precision (bits)
Figure 7: Combined plots for predicting degradation. (Left) demonstrates the quality of our fit on
all our runs, including all combinations of pre and post-training precisions. (Center, right) illustrate
visually that our unified degradation form can predict degradation when training and serving in
any precision. Plots (center, right) vary Pw only, but fits in (left) include runs where Pa , Pkv are
also jointly varied.
more because δPTQ increases with N −γN as we found in Section 3. We call this second effect the
“overtraining” effect. In practice, the first “robustification” effect wins out, so that models trained
in lower precision overall degrade less when post-train quantized compared to models trained in
high precision. We confirm using Neff rather than N to predict degradation given various training
precisions leads to a substantially stronger fit in Figure 16(top left, top center), to verify the
competing overtraining effect is real.
Modifying δPTQ to account for training precision. We assume training precision is strictly
greater than inference precision, and define degradation as identically zero if they are equal. We
begin by studying how degradation scales with just weight-precision during training, Pw .
Consider Figure 7(center). We fix (N, D) and each cell of the heatmap represents the empirical
degradation δPTQ (Pw , Ppost ). We observe that degradation very quickly increases to its exponen-
tially large value from Section 3 if there is any gap between training and inference-time precision.
This motivates modifying our initial functional form fitted in Section 3 to
γD
−Ppost /γpost D
δPTQ (N, D, Pw , Ppost ) = CT e γN [1 − e−Cw (Pw −Ppost ) ] (8)
Neff | {z }
| {z } Robustification effect
Overtraining effect
where Cw is the only new fitted value. Then, we can extend this to include the precision effects of
activations/attention in the natural way:
DγD
Y
−Ppost /γpost
δPTQ (N, D, Pw , Pa , Pkv , Ppost ) = CT e γN [1 − e−Cx (Px −Ppost ) ] (9)
Neff
x∈{w,a,kv}
We measure the fit to the data of such a functional form in Figure 7, and find a strong fit with
R2 = 0.90 on over 1000 data points (each of 465 pretraining runs post-train quantized to multiple
precisions).
An interpretable, unified functional form. Now we simplify and interpret the resulting
functional form. Consider training with only weights in low precision and take Cw = 1 for illustra-
tive purposes so we can simplify Equation 9. Denote σtr 2 := e−Pw /γw as “training noise” reflecting
11
the decrease in effective parameter count due to training weights in lower precision. Then, Equation
9 simplifies to
γD
2 2 D
δPTQ (N, D, Ptrain , Ppost ) = CT (σPTQ − σtr ) · γN (10)
| {z } Neff
Robustification effect | {z }
Overtraining effect
which we note is the intuitive modification one might make to the form of the initial post-training
quantization degradation we fitted in Section 3, in Finding 3.1, with a small competing effects
factor from Neff pushing in the opposite direction. It cleanly reflects the intuition that models are
robustified to PTQ noise to the extent they were trained with similar noise.
−α
L(N, D, Pw , Pa , Pkv , Ppost ) = ANeff + BD−β + E + δPTQ (11)
where δPTQ (N, D, Pw , Pa , Pkv , Ppost ) is in general as in Equation 9 and Neff (N, Pw , Pa , Pkv )
as in Finding 4.2.
12
7 Acknowledgements
Tanishq Kumar thanks Tim Dettmers, Chris De Sa, Neil Band and Luke Bailey for helpful com-
ments and discussion. Blake Bordelon is supported by a Google PhD Fellowship. Cengiz Pehlevan
is supported by NSF grant DMS-2134157, NSF CAREER Award IIS-2239780, and a Sloan Re-
search Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg
Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial
Intelligence. Aditi Raghunathan acknowledges support from AI2050 program by Schmidt Sciences
(Grant G2264481), Google Research Scholar, Apple, NSF, Cisco.
We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF un-
der Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume
to Velocity), and 1937301 (RTML); US DEVCOM ARL under Nos. W911NF-23-2-0184 (Long-
context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under Nos. N000142312633
(Deep Signal Processing); Stanford HAI under No. 247183; NXP, Xilinx, LETI-CEA, Intel, IBM,
Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog
Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the
Stanford Data Science Initiative (SDSI). Benjamin F. Spector is supported by a Hertz Fellowship.
The U.S. Government is authorized to reproduce and distribute reprints for Governmental pur-
poses notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions
or recommendations expressed in this material are those of the authors and do not necessarily
reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S.
Government.
References
Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.
arXiv preprint arXiv:2407.21783, 2024.
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisen-
thwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for
deep learning. arXiv preprint arXiv:2209.05433, 2022.
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong,
Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are
in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang,
Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language
models. arXiv preprint arXiv:2310.11453, 2023.
13
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training
quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix
multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:
30318–30332, 2022.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-
aware weight quantization for llm compression and acceleration. arxiv. MLSys 2024, 2023.
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant:
Accurate and efficient post-training quantization for large language models. In International
Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural
scaling laws. Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024.
Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling
laws. arXiv preprint arXiv:2402.01092, 2024.
Licong Lin, Jingfeng Wu, Sham M Kakade, Peter L Bartlett, and Jason D Lee. Scaling laws in
linear regression: Compute, parameters, and data. arXiv preprint arXiv:2406.08466, 2024a.
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In
International Conference on Machine Learning, pages 7750–7774. PMLR, 2023.
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity
scaling laws. arXiv preprint arXiv:2404.05405, 2024.
Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. Chinchilla scaling: A replication
attempt. arXiv preprint arXiv:2404.10102, 2024.
Nikhil Sardana and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in
language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Worts-
man, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale
reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540, 2024.
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya
Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al.
Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,
2024.
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally
can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and
Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.
arXiv preprint arXiv:2407.21787, 2024.
Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto. Synthetic
continued pretraining. arXiv preprint arXiv:2409.07431, 2024.
14
Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scal-
ing laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 7382–7392, 2024.
André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle
Chard, and Ian Foster. Comprehensive exploration of synthetic data generation: A survey. arXiv
preprint arXiv:2401.02524, 2024.
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord,
Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerat-
ing the science of language models. arXiv preprint arXiv:2402.00838, 2024.
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur,
Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of
three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159,
2024.
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quantization
of large language models with guarantees. Advances in Neural Information Processing Systems,
36, 2024.
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno,
and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint
arXiv:2402.04291, 2024.
Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed A Badawy. Demysti-
fying the nvidia ampere architecture through microbenchmarking and instruction-level analysis.
In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. IEEE,
2022.
Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue
Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm: Training fp8 large language models. arXiv preprint
arXiv:2310.18313, 2023.
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Moham-
madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open
weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146,
2024.
Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng
Zhou, and Jason K Eshraghian. Scalable matmul-free language modeling. arXiv preprint
arXiv:2406.02528, 2024.
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer
with rotary position embedding. corr abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864,
2021.
Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick
Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large
neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
15
Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise
hyperparameter transfer in residual networks: Dynamics and scaling limit. arXiv preprint
arXiv:2309.16620, 2023.
Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co-
Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale
transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023a.
Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blun-
som, Ahmet Üstün, and Sara Hooker. Intriguing properties of quantization at scale. Advances
in Neural Information Processing Systems, 36:34278–34294, 2023.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of machine learning research, 21(140):1–67, 2020.
Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld,
Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the
colossal clean crawled corpus. arXiv preprint arXiv:2104.08758, 2021.
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia,
Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision
training. arXiv preprint arXiv:1710.03740, 2017.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model paral-
lelism. arXiv preprint arXiv:1909.08053, 2019.
Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig
Schmidt. Stable and low-precision training for large-scale vision-language models. Advances
in Neural Information Processing Systems, 36:10271–10298, 2023b.
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for
large language models. arXiv preprint arXiv:2308.07633, 2023.
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with
low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning
of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise
quantization. arXiv preprint arXiv:2110.02861, 2021.
Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath
Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Viji Srinivasan, and Kailash Gopalakrish-
nan. Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information
Processing Systems, 33:1796–1807, 2020.
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang
Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware
training for large language models. arXiv preprint arXiv:2305.17888, 2023.
16
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan
Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization
for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems,
6:87–100, 2024b.
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang,
Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of
large language models with a single gpu. In International Conference on Machine Learning,
pages 31094–31116. PMLR, 2023.
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashk-
boos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized repre-
sentation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.(nips), 2017. arXiv preprint
arXiv:1706.03762, 10:S0140525X16001837, 2017.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song,
John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models:
Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-
parameter open-access multilingual language model. 2023.
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le
Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual gener-
alization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia
Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts
language models. arXiv preprint arXiv:2409.02060, 2024a.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,
Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al.
Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo-
pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer
language models. arXiv preprint arXiv:2205.01068, 2022.
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz
Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t
reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
17
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou,
Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with
you! arXiv preprint arXiv:2305.06161, 2023.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane
Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2:
The next generation. arXiv preprint arXiv:2402.19173, 2024.
Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari
Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, et al. Fingpt:
Large generative models for a small language. arXiv preprint arXiv:2311.05640, 2023.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge,
Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113,
2023.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly
capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke
Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. Aya model: An in-
struction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827,
2024.
Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the
predictability of language model performance. arXiv preprint arXiv:2405.10938, 2024.
Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin
Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint
arXiv:2405.18392, 2024.
Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan
Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures:
How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022a.
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michal Krutul, Szymon
Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, et al. Scaling
laws for fine-grained mixture of experts. arXiv preprint arXiv:2402.07871, 2024.
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and
Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies. arXiv
preprint arXiv:2407.13623, 2024.
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoff-
mann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling
laws for routed language models. In International conference on machine learning, pages 4057–
4086. PMLR, 2022.
18
Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q Tran, David R So, Siamak Shakeri, Xavier Garcia,
Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, et al. Transcending scaling laws
with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022b.
Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella
Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, et al. What language model to train
if you have one million gpu hours? arXiv preprint arXiv:2210.15424, 2022.
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene
Cheah, Teddy Ferdinan, Haowen Hou, Przemyslaw Kazienko, et al. Eagle and finch: Rwkv with
matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024.
Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang,
Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative
mixed-modal language models. In International Conference on Machine Learning, pages 265–279.
PMLR, 2023.
Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws
in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312,
2022.
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade
Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for
contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 2818–2829, 2023.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language
models. arXiv preprint arXiv:2206.07682, 2022.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam
Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond
the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv
preprint arXiv:2206.04615, 2022.
Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and
Sanmi Koyejo. Scaling laws for downstream task performance of large language models. arXiv
preprint arXiv:2402.04177, 2024.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal,
Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation
of training sets for language models. arXiv preprint arXiv:2406.11794, 2024.
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing
Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. arXiv
preprint arXiv:2407.01492, 2024.
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang,
Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on data selection
for language models. arXiv preprint arXiv:2402.16827, 2024.
19
Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra
Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language
models. Advances in Neural Information Processing Systems, 36, 2024b.
Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on
neural networks typically occurs at the edge of stability. In International Conference on Learning
Representations, 2021.
Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David
Cardoze, George Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training
instability in deep learning. arXiv preprint arXiv:2110.04369, 2021.
Quynh Nguyen, Marco Mondelli, and Guido F Montufar. Tight bounds on the smallest eigenvalue
of the neural tangent kernel for deep relu networks. In International Conference on Machine
Learning, pages 8119–8129. PMLR, 2021.
Alexander Atanasov, Blake Bordelon, Sabarish Sainathan, and Cengiz Pehlevan. The on-
set of variance-limited behavior for networks in the lazy and rich regimes. arXiv preprint
arXiv:2212.12147, 2022.
Emmanuel Abbe, Enric Boix-Adsera, Matthew S Brennan, Guy Bresler, and Dheeraj Nagaraj.
The staircase property: How hierarchical structure can guide deep learning. Advances in Neural
Information Processing Systems, 34:26989–27002, 2021.
Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a
necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural
networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang.
Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in
Neural Information Processing Systems, 35:21750–21764, 2022.
20
Appendix
A Hyperparameter Details and Ablations
We launch over 20 runs for each (N, D) combination to study scaling in precision, trained and
validated on the common crawl split of the Dolma dataset [Soldaini et al., 2024]. We use a Trans-
former++ implementation: SwiGLU activations [Shazeer, 2020], RoPE embeddings [Su et al., 2021],
RMSLayerNorm, Adam β values of (0.9, 0.95). We adopt a cosine learning rate schedule with 10%
warmup period and peak learning rate of 6e-4 for the smallest model and learning rates scaled with
width and depth according to depth-µP for the larger models [Yang et al., 2022, Bordelon et al.,
2023]. We use a sequence length of 1024 and batch size of 256 throughout, with Adam ϵ 1e-15,
following [Wortsman et al., 2023a]. We use weight decay of 0.1, as [Ahmadian et al., 2023] find
some results in the quantization literature may be artifacts of insufficient weight decay. We follow
[Ma et al., 2024] in including a LayerNorm before projections because they find it is important for
low precision training to be stable. These are the hyperparameters and settings used for the main
scaling law experiments.
To check robustness, we then ablate these hyperparameter choices, with results in Figure 8. In
our ablation we use a sequence length of 512 with batch size 128, weight decay of 1e-3, Adam ϵ of
1e-10, a peak learning rate of 1e-4 and a warmup period of duration 3%. We train models with
these alternative hyperparameters at various weight, activation, and KV cache precisions. We train
and val on C4 [Raffel et al., 2020, Dodge et al., 2021] instead. Though these ablations are at rather
smaller scale due to compute constraints, the loss curves follow the same trends – rapid decrease in
final loss with an initial increase in precision from 4 bits, then diminishing returns as we approach
higher precision – as in the main text, suggesting the trends are robust to hyperparameter choices.
Final Loss
Final Loss
4.24 4.21
4.30
4.23 4.20
4.21 4.18
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
Precision (bits) Precision (bits) Precision (bits)
Figure 8: L(Pw ), L(Pa ), L(Pkv ) for ablated hyperparameters, N = 30M, D = 1.5B. We can see the
trends persist, where the first few bits reduce final val loss significantly, with diminishing/saturating
returns quickly setting in at higher precision. We do not fit constants on these ablated runs.
21
2014, Dettmers et al., 2024, 2021, Sun et al., 2020, Liu et al., 2023] or the effects of changing the
precision after training (post-training quantization) [Frantar et al., 2022, Lin et al., 2024b, Dettmers
et al., 2022, Xiao et al., 2023, Sheng et al., 2023, Dettmers et al., 2023]. In this work we study
both, the precision during training and after, and unify them from a scaling perspective.
Large language models and scaling By scaling up the transformer architecture [Vaswani
et al., 2017] a variety of large language models have been proposed [Brown, 2020, Rae et al., 2021,
Touvron et al., 2023a,b, Dubey et al., 2024, Le Scao et al., 2023, Muennighoff et al., 2022, 2024a,
Groeneveld et al., 2024, Jiang et al., 2023, Zhang et al., 2022, Allal et al., 2023, Li et al., 2023,
Lozhkov et al., 2024, Luukkonen et al., 2023, Bai et al., 2023, Chowdhery et al., 2023, Team et al.,
2023, Üstün et al., 2024, Deitke et al., 2024]. To improve our understanding of these models,
various works have investigated their scaling properties [Ruan et al., 2024, Allen-Zhu and Li, 2024,
Hägele et al., 2024]. Many aspects are relevant to scaling including the architecture [Tay et al.,
2022a, Krajewski et al., 2024, Tao et al., 2024, Clark et al., 2022, Tay et al., 2022b, Scao et al.,
2022, Peng et al., 2024], the modalities considered [Aghajanyan et al., 2023, Alabdulmohsin et al.,
2022, Cherti et al., 2023], the performance metrics [Wei et al., 2022, Srivastava et al., 2022, Isik
et al., 2024], the data composition [Li et al., 2024, Liu et al., 2024, Albalak et al., 2024] and data
repetitions [Muennighoff et al., 2024b]. Our work analyzes one such aspect, which is key to better
scaling: the numeric precision during and after training.
22
Functional Form Val R2 Number of Fitted Parameters
Neff 0.82 3
Additive/independent power law 0.71 2
Deff 0.74 3
Neff and Deff (tied) 0.79 3
Neff and Deff (not tied) 0.84 4
Multiplicative power law, N, P 0.75 2
tween floating-point and integer quantization can be described by similar functional forms, where
1(b) provides preliminary evidence for this.
xdequant = s · xint
This process introduces quantization error, defined as the difference between the original value
x and the dequantized value xdequant . The goal of quantization is to minimize this error while
still reducing the precision. One can think of this as rounding to the nearest point on a uniform
lattice. More complicated quantization schemes involve selecting the lattice points in a data or
model-dependent manner. Integer quantization, as implemented, uses a fixed-point scaling based
on the maximum absolute value of the tensor, and then scales the values within the range [Qn , Qp ],
where Qn = −2(b−1) and Qp = 2(b−1) − 1, with b being the number of bits.
Integer quantization first rescales the inputs into the range specified by the number of bits by
Qp
s=
max(|x|)
23
D.2 Floating-Point Quantization
Floating-point quantization is slightly more sophisticated, aiming to make a non-uniform lattice
roughly matching the distribution of the weights, which are assumed to be Gaussian. A floating-
point number is in general represented as:
xfp = (−1)s · m · 2e
where s is the sign bit, m is the mantissa, and e is the exponent. In floating-point quantization,
both the mantissa and exponent are quantized to reduce the bit width. For exponent-mantissa
allocations of bits and details of exponent bias, we follow the guidelines from [Micikevicius et al.,
2022] in terms of details, since that reflects how, for instance, FP8 training is done as is common
in production settings. Further in line with [Micikevicius et al., 2022], we quantize weights per
channel and activations per-tensor.
Making a full scaling law for floating-point quantization is more involved than our integer
treatment, because the effects of scaling mantissa vs exponent bits are not the same. In contrast,
in integer quantization, each additional bit simply causes us to round into a finer-grained lattice
after rescaling, thereby reducing quantization error by a predictable amount. In floating-point
quantization, altering the exponent affects the dynamic range, while altering the mantissa changes
the precision within that range. This flexibility at once makes floating-point quantization more
suitable for model training, but harder to analyze. We leave a commensurately detailed analysis of
mantissa vs exponent scaling to future work.
E Derivations
E.1 Critical Dataset Size for PTQ
∂δ (D )
We seek a Dcrit that satisfies ∂L(D
∂D
crit )
= PTQ∂D crit . Taking both derivatives for the functional
forms presented in the main text and equating their opposing effects, we get the equation
−β−1 γD −1
BDcrit = γD CT N −γN e−Ppost /γpost Dcrit (12)
which implies
24
! 1
γD +β
βBN γN ePpost /γpost
Dcrit = (13)
γ D CT
is the predicted point after which pretraining on more data can increase loss of a model that is
post-train quantized. Note that this quantity explodes in P , so that a truly unreasonable amount
of data is required for longer pretraining to be harmful at commonly used precisions (eg. 8-bit).
However, we find that on overtrained models D/N ≫ 103 , these overtraining-degradation effects
become nontrivial around 5-bits, and dominant below that.
where u(P ) = [1 − e−P/γ ]−3α is a fixed constant. The compute optimal scaling when minimizing
the loss over N, D gives
by replacing D = NCP . Optimizing over N , we see that this is equivalent to the original chinchilla
optimization problem but with A → Au(P ) and B → BP β . Performing this optimization, we find
1 − 1
∗ u(P )Aα α+β β
∗ u(P )Aα α+β α
N (P, C) = C α+β , D (P, C) = C α+β (16)
BP β β BP β β
We can relate the above expressions to the original Chinchilla-optimal N, D at full precision
NCh (C), DCh (C).
N ∗ (P, C) h −P/γ̄
i− 3α
α+β β
− α+β D∗ (P, C) h i 3α
−P/γ̄ α+β
β
∝ 1−e P and ∝ 1−e P α+β (17)
NCh (C) DCh (C)
where N is a constant. We therefore have a single variable P to minimize the above formula over
∂L
= u′ (P )AN −α + BC −β N β β P β−1 = 0 (20)
∂P
25
First, we note that u′ (P ) has the following form
1 −P/γ 3α 3α+1
u′ (P ) = −3α[1 − e−P/γ ]−3α−1 × e = − e−P/γ × u(P ) 3α (21)
γ γ
We thus desire a solution to the implicit equation
3α −P/γ 3α+1
e × u(P ) 3α AN −α = BC −β N β β P β−1 (22)
γ
We now aim to find an approximate asymptotic relationship between P and C as C → ∞. Taking
a logarithm of both sides, we find (neglecting additive constants that are independent of C, P )
1
−(3α + 1) ln(1 − e−P/γ ) − P ≈ −β ln C (23)
γ
The correct dominant balance at large C is to take P ⋆ ∼ βγ ln C, as can be verified numerically.
With the constraint that C = N P D we have that D⋆ ≈ N βγCln C .
Under the constraint C ∝ N DP , we can replace D in terms of C, N, P giving the loss expression
L = AN −α u(P ) + BN β P β C −β (25)
∂L
= −αAN −α−1 u(P ) + βBN β−1 P β C −β = 0 (26)
∂N
∂L 3α+1
= −3α/γAN −α u(P ) 3α e−P/γ + βBN β P β−1 C −β = 0 (27)
∂P
Multiplying the first equation by N and dividing the second equation by it reveals that the optimal
P satisfies a compute-independent implicit equation
3 1
u(P ) 3α e−P/γ̄ = P −1 u(P ) (28)
γ̄
This exercise reveals that the compute optimal strategy when allowed to jointly optimize N, D, P
is to choose a fixed precision that satisfies the above equation and then to scale up N, D with the
prescription in Appendix I.1.1.
26
N = 30M N = 60M N = 110M N = 220M
4.4 INT3
Val Loss (Post-Quant)
4.2 INT4
INT5
4.0 INT6
3.8 No PTQ
3.6
3.4
PTQ
10 1
Degradation,
10 2
10 3
INT4
5.0 INT5
INT6
4.5 No PTQ
4.0
3.5
100
PTQ
Degradation,
10 2
27
G Why do language models get more sensitive with overtraining?
This section is speculative.
Sharpness. A canonical line of work in optimization demonstrates that model sharpness
increases during learning until it hovers at a maximal value (the “edge of stability”) [Cohen et al.,
2021, Gilmer et al., 2021], so that movement along the top Hessian eigenvector degrades loss by
more throughout training. Though sharpness is formally a worst-case sensitivity, we conjecture
similar results hold for average case, such as loss degradation induced by isotropic noise. It may
be possible that sharpness during language model pretraining does not reach its maximal value for
a long time, which is why sensitivity to noise monotonically seems to increase as D/N → ∞ on
realistic data budgets. Closely related is the largest eigenvalue of the neural tangent kernel (NTK)
which captures the magnitude of the variance of the predictor under parameter noise. This quantity
is known to empirically increase during training in a variety of settings, and is closely related to
generalization guarantees [Nguyen et al., 2021, Atanasov et al., 2022].
Hierarchical learning strategies become more sensitive throughout training. Our
expectation that overtrained language models may degrade more when quantized at inference-time
is motivated in part by the following results. The hierarchical nature of learning is by now well
understood in some toy settings: in [Abbe et al., 2021], it is shown that “staircase” polynomials
of increasing degree are learned faster than high-degree monomials since neural networks combine
existing features to learn new ones. In [Abbe et al., 2022] this result was strengthened to show that
such hierarchical structure is both necessary and sufficient to learn sparse functions with SGD in
two layer neural networks. In this setting, damage to features encoding lower-order polynomials
affects all higher-order ones, so that such networks are increasingly sensitive to fixed feature noise
throughout learning. Another result of a similar flavor is that of [Barak et al., 2022], who explicitly
require high-precision gradients for sparse parity to be learned, since sparse parity is learned by
the amplification of a small initial signal. If language models learn hierarchically, it is possible that
the features that are learned late into overtraining as D/N → ∞ are reliant on base features, so
that noise harms the base features and therefore significantly damages higher-order features.
I Numerical Fits
Following [Muennighoff et al., 2024b], we tie α = β so they do not become very different, though
this is not required. Distinct α, β only add expressivity to the model and we have verified the plots
look similar without tying. We also only use the full scaling law when specified in the text, since
the law is developed piecewise through the text. For instance, Figures 3 and 4 solely fit Chinchilla
with a substitution N 7→ Neff (Pw ) because at that point Pa , Pkv have not been introduced. Figures
5, 6, and 7 use our full scaling law, for instance to make predictions. We emphasize our numerical
constants are unlikely to be useful because as [Hoffmann et al., 2022, Sardana and Frankle, 2023]
show, fitted constant depend heavily on the architecture and dataset used, which differs from setup
28
to setup. Rather, the trends we identify are the key findings. With that said, our fitted constants
are as follows.
Constant Value
A 4.299e3
α 0.4965
B 1.806e4
E 2.7648
γw 2.6745
nw 0.3037
γi 2.2102
ni 1.4072
γkv 0.9578
nkv 2.4185
CT 0.0598
δD 0.5068
δN 0.3439
γ 0.5907
b 1.1277
Note that we include biases in our exponent fits, for instance when modelling Neff as a saturating
exponential, we find that the different parts of a model cause numerical instability at different values
of low precisions, so even if they are the same functional form, they may be translated (left/right
shifted versions) of eah other. For instance a fit of the form ex/γx in the main text is really computed
with offset ex/γx +n , but including biases everywhere clutters notation and obscures mathematical
insight.
29
220M params, 3.3B tokens 220M params, 3.3B tokens 220M params, 3.3B tokens
Weights Activations KV Cache
3.58
6.0
3.8
3.56
5.5
3.7
3.54
5.0
Loss
3.52 3.6
4.5
3.50
4.0 3.5
3.48
3.5 3.4
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
110M params, 26.2B tokens 110M params, 26.2B tokens 110M params, 26.2B tokens
4.4
3.55 6.0
4.2
5.5
3.50
4.0
5.0
Loss
3.45
3.8
4.5
3.40 3.6
4.0
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
Precision (bits) Precision (bits) Precision (bits)
Figure 11: Sweeping L(P ) for the three model parts at various N, D.
K Empirical Neff
Consider a model trained with some arbitrary (N, D, Pw ). Assuming a Chinchilla function form
with N 7→ Neff (Pw ), we can write the difference between its loss and the loss of a full precision
model as
−α
L(N, D, Pw ) − L(N, D, ∞) = A[Neff − N −α ]
as the terms involving B, D, E cancel. Note that Neff (Pw = ∞) = N by construction. In
practice, we use a BF16 model as the “infinite-precision” model, finding no real difference if we
use an FP32 model or even a functional fit estimating Pw → ∞ based on our integer quantization
loss results. Our goal is to plot what f (P ) looks like where Neff = N · f (P ). Therefore, we can
rearrange the above equation as follows
−1/α
Neff 1 L(N, D, Pw ) − L(N, D, Pw = ∞)
f (P ) := = + N −α (29)
N N A
Then plotting this quantity using our fitted numerical values (See Appendix I) gives us the
empirical tradeoff between precision and parameters. We can see that the tradeoff is quickly
saturating in P to a value near 1. While the functional form is the same for the three model parts,
the fitted constants are different. For instance, runs with Pa ≤ 3 or Pkv ≤ 3 often diverged, and
this was not the case with weight precision. Further, we can see that the KV cache is not sensitive
to quantization at higher bit value, but very quickly becomes sensitive around 4-5 bit precision.
Then as far as the joint functional form for Neff (Pw , Pa , Pkv ) is concerned, we acknowledge
that alternative factorizations that do not decompose the model into weights, activations, and KV
30
Weights Activations KV Cache
0.9
0.8
f(P) = Neff(P)/N
0.7
0.6
0.5
0.4
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
Pw (training precision, bits) Pa (training precision, bits) Pkv (training precision, bits)
Figure 12: Plotting what Neff looks like empirically. Each black point is a pretraining run, mathe-
matical details of what is plotted here in Appendix E. Blue lines are parametric fits of a saturating
exponential.
5.00
4.75
4.50
Training-time Effects, Ptrain Post-Training Effects, Ppost
Val Loss
4.25
4.00
3.75
3.50
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Tokens (billions) Tokens (billions)
Figure 13: Illustration of what finite-precision effects during training and inference look like on
learning curves.
cache, may have an equally good fit. For instance, decomposing the weights term into a product of
layer-wise effects has a reasonable fit though introduces more parameters, and a more coarse-grained
version may not decompose the model into parts at all, but only consider tied precisions. We choose
this factorized form because QAT considers weights only, and activations and attentions are the two
other things that must then be kept in low precision to see compute gains. Since practitioners often
care about KV cache on its own, we chose to decompose “activations and attention” as “activations
and KV cache.” We emphasize that our main point is not that this factorization is objectively
correct, but in observing that such a factorization that assumes approximate independence is
possible in the first place.
L Additional Plots
31
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=30M, D=1.6B 8 N=30M, D=1.6B
1.0
1.0
Inference-time Degradation
6 6 0.8
0.6
5 5 0.6
0.4
4 4 0.4
3 0.2 3 0.2
2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=60M, D=6.6B 8 N=60M, D=6.6B
1.6
0.5 0.4
3 3
0.2
2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=110M, D=6.6B 8 N=110M, D=6.6B
1.4 1.4
1.2
Inference-time Degradation
6 1.0 6 1.0
5 0.8 5 0.8
0.6 0.6
4 4
0.4 0.4
3 3
0.2 0.2
2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=220M, D=6.6B 8 N=220M, D=6.6B
1.2
1.2
Predicted Inference-time Degradation
7 1.0 7
Pinf, post-train quantization
1.0
Inference-time Degradation
6 0.8 6
0.8
5 0.6 5
0.6
4 0.4 4 0.4
3 0.2 3 0.2
2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
32
Pi Sweep Pkv Sweep
MSE: 0.0055, R²: 0.9410 MSE: 0.0003, R²: 0.9965
5.0
4.5
Predicted
4.0
3.5
3.0
3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 3.2 3.4 3.6 3.8 4.0 4.2
Actual Actual
Figure 15: Marginal sweeps for precision of activations and KV cache, along with predictions from
an Neff functional form analogous to Equation 3 fitted from scratch.
10 2 10 2
Predicted LPTQ
Predicted LPTQ
10 1
10 3
10 3
LPTQ
10 4
10 4
10 2
10 5
10 5
10 6
10 6
10 3
10 7
10 7
10 6 10 5 10 4 10 3 10 2 10 1 100 10 6 10 5 10 4 10 3 10 2 10 1 100 4 6 8 10 12
Actual LPTQ Actual LPTQ Training Precision
8
Empirical LPTQ 8
Predicted LPTQ
Ppost, post-training precision (bits)
7 7
6 6
5 5
4 4
3 3
2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
Figure 16: Combined plots for predicting degradation. (a) and (b) illustrate different fitting ap-
proaches to model degradation, demonstrating a stronger fit when N 7→ Neff is used. (c), (d) (e)
illustrate our unified degradation form can predict degradation when training and serving in any
precision. Plots (c-e) made for varied Pw , but fits in (a) and (b) include runs where Pa , Pkv are also
jointly varied.
33