0% found this document useful (0 votes)
56 views33 pages

Scaling Laws For Precision

Uploaded by

miguel.garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views33 pages

Scaling Laws For Precision

Uploaded by

miguel.garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Scaling Laws for Precision

Tanishq Kumar∗1 Zachary Ankner*3, 4 Benjamin F. Spector2 Blake Bordelon1 Niklas Muennighoff2
Mansheej Paul4 Cengiz Pehlevan1 Christopher Ré2 Aditi Raghunathan5

1
Harvard University
2
Stanford University
3
MIT
arXiv:2411.04330v1 [cs.LG] 7 Nov 2024

4
Databricks
5
Carnegie Mellon University

Abstract
Low precision training and inference affect both the quality and cost of language models,
but current scaling laws do not account for this. In this work, we devise “precision-aware” scal-
ing laws for both training and inference. We propose that training in lower precision reduces
the model’s effective parameter count, allowing us to predict the additional loss incurred from
training in low precision and post-train quantization. For inference, we find that the degra-
dation introduced by post-training quantization increases as models are trained on more data,
eventually making additional pretraining data actively harmful. For training, our scaling laws
allow us to predict the loss of a model with different parts in different precisions, and suggest
that training larger models in lower precision may be compute optimal. We unify the scaling
laws for post and pretraining quantization to arrive at a single functional form that predicts
degradation from training and inference in varied precisions. We fit on over 465 pretraining runs
and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.

1 Introduction
Scale has emerged as a central driver of progress in deep learning [Brown, 2020]. Key work on
scaling [Kaplan et al., 2020, Hoffmann et al., 2022] studied tradeoffs between model/dataset size
to balance performance and compute. However, the precision in which models are trained and
served is an important third factor that contributes to both cost and performance. Deep learning
is trending towards lower precision: current frontier models like the Llama-3 series are trained in
BF16 [Dubey et al., 2024], and there is widespread effort to move the pretraining paradigm to FP8
[Micikevicius et al., 2022]. The next generation of hardware will support FP4, and advances in
weight-only quantization have led to training in binary and ternary at scale [Ma et al., 2024, Wang
et al., 2023]. How far will these paradigms go? Specifically, we ask:

What are the tradeoffs between precision, parameters, and data?


How do they compare for pretraining and inference?

Studying scaling in precision is challenging because work on scaling laws generally aims to
drop fine-grained implementation details in pursuit of universal functional forms while work on
quantization generally does the opposite, focuses on the details: how quantization is done, with what
type, to what part of the model. In seeking a balance, we consider a variety of plausible functional
forms, and choose one that abstracts implementation details of quantization away from loss scaling,

Equal contribution. Correspondence to [email protected]

1
Scaling: Post-Train Quantization Scaling: Quantized Training
Training larger models in lower
precision can be compute optimal
Val Loss (Post-Quant)

3.233
3.198

Final Val Loss


More pretraining compute 3.057
INT3 worse at inference time
INT4 2.997 3.009
INT5
INT6
No PTQ
100 1000 FP4 FP6 FP8 BF16 FP32
(1.76B) (1.17B) (880M) (440M) (220M)
Token/Parameter Ratio
Training Precision (Model Size)

Figure 1: Schematic of key findings. (Left) Training a fixed model size to various data budgets in
BF16 and quantizing weights at the end. We find that degradation due to post-train quantization
increases with tokens seen during pretraining, so that eventually additional pretraining data
can be harmful. (Right) Our scaling suggests training larger models in lower precision can
be compute-optimal according to the cost model in Section 4.3. Weights, activations, attention
quantized, all models trained on the same data budget, details in Appendix H.

allowing us to predict loss scaling in many situations of practical interest. This functional form
that posits bit precision and parameter count interchangeably contribute to a model’s “effective
parameter count,” Neff , and implementation details like which parts of a model are quantized to
what precision, interact with loss scaling only through their effect on this quantity.
Overall, we study the scaling of the effects of precision on loss as we vary data and parameters,
both during and after training. We first study how the degradation induced by post-train quantiza-
tion scales with parameters and data. We find that the degradation increases with data, so that for
a fixed model, training on additional data after a certain point can be actively harmful if the model
will be quantized after training. We then shift our focus to quantized training, examining both
the quantization-aware-training (weights only) and low-precision training (weights, activations, at-
tention all quantized) settings. Our scaling laws for pretraining suggest that the compute-optimal
pretraining precision is in general independent of compute budget. Surprisingly, however, this inde-
pendence ceases to be true if model size is constrained, in which case the compute-optimal precision
grows slowly in compute.
In all, we pretrain a suite of 465 language models in 3 to 16 bit precisions, as well as post-train
quantize each to multiple precisions. For a language model with N parameters, trained on D tokens
with training precision Ptrain , and post-train weight precision Ppost , we ultimately find a unified
scaling law that takes the following form:
−α
L(N, D, Ptrain , Ppost ) = ANeff +BD−β + E + δPTQ (Neff , D, Ptrain , Ppost ) (1)
| {z } | {z }
Training-time Effects Post-Training Effects
| {z }
Usual Chinchilla form

where A, B, E, α, β are positive fitted constants, and δPTQ refers to the loss degradation induced
by post-training quantization before inference. Altogether, our results for post-train quantization
illustrate how more pretraining FLOPs do not always lead to better models at inference-
time, and our results for low-precision pretraining suggest that both the standard practice
of training models in 16-bit, and the race to extremely low (sub 4-bit) pretraining
precision, may be suboptimal.

2
2 Background, Related Work, and Setup
Notation. Throughout, D denotes dataset size in tokens and N denotes model size in parameters.
Pw , Pa , Pkv refer to the bit precision, in integer-type, of the weights, activations, and key-value
cache (“attention”)1 during training, and Ppost refers to the precision we post-train quantize (PTQ)
weights to at the end for model inference. When P or Ptrain is used without reference to a part of
the model, all three model parts are tied to the same precision. The inference-time loss degradation
induced by post-train quantization will be denoted δPTQ (N, D, Ptrain , Ppost ), and it is defined as
the change in loss from performing post-training quantization compared to the end of pretraining.
We use “high precision” to mean 16-bit or above.

2.1 Quantization Fundamentals: How, What, When


The Problem: Compute vs Memory-Bound Workloads. Most deep learning workloads are
bottlenecked by either compute, in the form of matrix multiplications, or memory bandwidth, in
the form of data movement between different parts of the GPU. Different types of workloads have
different bottlenecks: most time is spent doing large matrix multiplications during pretraining,
so it is compute-bound; in contrast, small-batch inference is bandwidth-bound by model weights;
long-sequence decoding is bandwidth-bound by KV cache, etc. This motivates studying scaling
in the training precision of the (weights, activations, KV cache) both in isolation and
in combination.
Quantization: How. Quantization of an operation typically refers to rounding of values
in matrices involved in some computation on the forward/backward pass, with accumulation of
gradients in high/full precision. Quantization is usually done to integer or floating-point type.
Quantization: What. Only weights. “Quantization-aware training” Quantizing only weights
during training does not offer any compute savings because matrix multiplications are still done in
high precision. However, this is commonly done to allow weights to adapt to low precision so they
can be served at very low precision at inference-time, thereby alleviating memory bottlenecks [Ma
et al., 2024, Wang et al., 2023]. We will refer to this as “quantization-aware-training.”
Weights, activations, attention. “Low-precision training” Quantizing and activations and at-
tention in addition to weights allows for compute gains because matrix multiplications can be done
in low precision (if the hardware supports it) since everything is in the same precision. FP8 training
on the Hopper line of GPUs is an example [Micikevicius et al., 2022]. We will refer to this setting
as “low-precision training” to distinguish it from quantization-aware training.
Quantization: When. Quantization can be done during or after training. In practice,
when seeking to reduce inference-time memory costs, one first attempts post-train quantization.
If that degrades the model too much, quantization-aware-training is used. Post-train quantization
is typically only applied to model weights [Frantar et al., 2022, Dettmers et al., 2022, Lin et al.,
2023, Xiao et al., 2023]. To reduce pretraining costs, low-precision-training is needed. We will
study scaling laws for post-training quantization in Section 3, for quantized training in Section
4 (examining both quantization-aware training and low precision training) and unify the two in
Section 5. The numerical values of all our fitted constants can be found in Appendix I.
1
We study KV, rather than QKV, because understanding scaling in the KV cache alone is important for many
inference settings. For pretraining claims in Section 4.3, we quantize the entire attention computation, including
queries, finding additionally quantizing the query vectors makes a negligible difference to scaling.

3
2.2 Scaling Laws and Parametric Fits
Scaling Laws. Hoffmann et al. [2022] model loss scaling using the functional form L(N, D) =
AN −α + BD−β + E where A, B, α, β, E are positive fitted constants, finding that data and param-
eters should be scaled in roughly equal proportion as more compute becomes available. We will
refer to the scaling of [Hoffmann et al., 2022] as “Chinchilla-optimal” or just “Chinchilla” and note
this is often used colloquially as D/N ≈ 20 being pretraining compute-optimal. On the theoretical
front, work on scaling laws [Bahri et al., 2024, Bordelon et al., 2024, Lin et al., 2024a] finds that
noise to various parts of model or data affects loss in a predictable way. While previous works have
explored the scaling behavior of post-training quantization in terms of total model bits [Dettmers
and Zettlemoyer, 2023] and knowledge capacity [Allen-Zhu and Li, 2024], we focus instead on data
scaling. We note that in general the exact fitted values of all coefficients and exponents can vary
drastically based on small implementation differences: Besiroglu et al. [2024] find different constants
when attempting to replicate [Hoffmann et al., 2022], Sardana and Frankle [2023] fit coefficients
A, B of different orders of magnitude. For this reason, we emphasize our contribution is not the
numerical values we fit, but the trends and functional forms we identify.
Overtraining. In practice, accounting for inference costs means training smaller models for
substantially longer than Chinchilla-optimal [Sardana and Frankle, 2023, Gadre et al., 2024]. For
instance, Llama-3-8B is trained to D/N ≈ 2000 [Dubey et al., 2024] and the Gemma-2 series up to
D/N > 1000 [Team et al., 2024]. We refer to such models as “overtrained” in this paper, with the
token/parameter ratio D/N being a key quantity throughout. Work on inference-time compute
[Snell et al., 2024, Brown et al., 2024] and on synthetic and multimodal data [Yang et al., 2024, Fan
et al., 2024, Bauer et al., 2024] suggests future models may be even more overtrained. Therefore,
modern work on scale must consider ratios much larger than Chinchilla-optimal, and in this work
we perform experiments up to D/N ≈ 103 and analyze the predictions found by our scaling law for
up to D/N ≈ 105 . See Appendix B for additional related work.

2.3 Setup
We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset [Groeneveld et al.,
2024, Soldaini et al., 2024], using a standard Transformer++ implementation; see Appendix A for
hyperparameters and ablations. Our experiments consist of a sweep of language model pretraining
runs over N ∈ [30, 60, 110, 220] million parameters (non-embedding) and D ∈ [1.5, 3, 6, 13, 26] billion
tokens. Our model sizes are relatively small because we train up to a very high D/N ≈ 103 to
study data scaling and set off over 20 runs at every (N, D): we sweep 8 values of precision for each
of the (weights, activations, attention).

3 Scaling Laws for Post-Train Quantization


The easiest and most common quantization technique is post-train quantizing a model off-the-shelf
[Chee et al., 2024, Huang et al., 2024, Dettmers et al., 2022, Lin et al., 2023, Xiao et al., 2023]. In
this section, we consider models trained in BF16 and use GPTQ [Frantar et al., 2022] to post-train
quantize them, replicating our findings with two other methods in Appendix F. We quantify the
resulting loss degradation δPTQ , finding that post-train quantization scales poorly in data.

4
3.1 Overtrained Models Degrade more when Post-Train Quantized

N = 30M N = 60M N = 110M N = 220M


INT6
Val Loss (Post-Quant)

4.25 INT5
4.00 INT4
INT3
3.75 No PTQ
3.50
3.25 100 1000 100 10 100 10
PTQ

10 1
Degradation,

10 2

10 3

100 1000 100 10 100 10


Token/Parameter Ratio
Figure 2: Loss degradation from PTQ increases with data. Top row is loss after PTQ, bottom row
is loss degradation compared to end of training, before PTQ. The top row is thus the gray line in
each plot plus the corresponding value in the bottom row. We can see that degradation grows with
data, bottom row is fitted with Equation 2. For D/N sufficiently large (left), loss can increase in
data. Even at lower D/N , where post-quant loss continues to decrease with data, the value of data
is reduced compare to the baseline. R2 = 0.97 over all fitted points (bottom row).

We consider different model sizes (columns) trained on various data budgets (x-axis of each
plot) and plot in Figure 2 both the loss after post-train quantization (top row) and the degradation
incurred relative to end of training (bottom row). We find that the degradation δPTQ increases
in training data size across all model sizes, but that for a fixed dataset size larger models incur a
smaller degradation. We additionally observe that δPTQ increases exponentially as we decrease the
precision we quantize to. Based on these observations we model δPTQ as taking the form:
 γD 
D
δPTQ (N, D, Ppost ) = CT e−Ppost /γpost (2)
N γN

where CT , γD , γN , γpost are positive fitted constants. As we find the fitted values of γD and γN to be
similar (see Appendix I for numerical values), we can think of this as an approximate power law in
the token/parameter ratio D/N . The intuition for this poor data scaling might be that as models
train on more data, they compress more information into their weights, so that perturbations to
weights in the form of quantization are more harmful to loss, all else equal. We discuss formal
theoretical interpretations in Appendix G.
This finding implies that for models that will be post-train quantized, there exists an amount of
pretraining data beyond which additional data is actively harmful to performance at inference-time
(see top-left, Figure 2). This can be defined as the point where additional data increases post-train
degradation more than it decreases loss during pretraining. We solve analytically for this critical
data size in Appendix E. We thus summarize our first scaling finding as follows.

5
Finding 1. Overtrained language models are more sensitive to post-training quantization.
For models trained in BF16 or above, we can model this loss degradation as
 γD 
D
δPTQ (N, D, Ppost ) = CT e−Ppost /γpost
N γN

where CT , γD , γN , γpost are positive fitted constants. This implies that when D/N is suffi-
ciently large, or Ppost sufficiently small, loss after quantization can increase as models are
pretrained for longer, as in Figure 2. We will revisit and modify Equation 2 in Section 5 to
account for the effects of training in low-precision on δPTQ .

4 Scaling Laws for Quantized Training


In this section we study pretraining with weights, activations, and KV cache in various precisions.
Importantly, only training precision, not test-time precision, is varied in this section; we discuss the
interaction between train and test-time precision in Section 5. We sweep the training precisions
of the weights, activations, and KV cache Pw , Pa , Pkv ∈ [3, 12] individually, as well as training
BF16 baselines. We also pretrain models with arbitrary combinations of Pw , Pa , Pkv to validate
our scaling laws. To perform quantization during training, we follow the standard specification of
Micikevicius et al. [2022] unless otherwise noted (see Appendix D for implementation details).

4.1 Quantization-Aware-Training: Quantizing Weights During Training has a


Consistent and Predictable Effect
We first examine the trade-off between weight precision Pw and parameters N while holding Pa =
Pkv fixed at high precision. We fix D = 13B tokens and perform a grid sweep over combinations of
N and Pw . We plot the resulting IsoLoss contours where we linearly interpolate the final loss values
in Figure 3. We observe that the bit precision of the weights can be traded off for the number of
parameters, i.e., a model with smaller N but larger Pw can achieve the same loss as a model with
larger N but smaller Pw . Additionally, we find that the gains from increasing the bit precision of
the weights are large at lower precisions but saturate at higher precisions (typically around 6-7 bits
per weight).
Neff/N vs Precision Empirical IsoLoss Contours Predicted Loss Contours
12 12
1.0
11 11
0.8 10 10
9 9
Pw (bits)

Pw (bits)

0.6
Neff/N

8 8
0.4 7 7
6 6
0.2 Weights
Activations 5 5
KV Cache 4 4
0.0 Tied
4 6 8 10 12 14 16 3 30 40 50 60 70 80 90 100 3 30 40 50 60 70 80 90 100
Precision (bits) N (millions) N (millions)

Figure 3: (Left) Neff /N from our final scaling law. Our fit of Neff (N, Pw ) in this section is the
first step towards this (blue). Empirical (center) and predicted (right) IsoLoss contours illustrating
the precision-parameter tradeoff. Y-axis is weight precision during quantized training. All runs
plotted trained on D = 13B tokens. Predictions from a fitted version of Equation 3, darker lines
correspond to lower loss.

6
Final Val Loss 3.3B tokens 13.1B tokens 26.2B tokens
4.2 4.2 4.2
4.0 4.0 4.0 Model Size
3.8 3.8 3.8 30M
60M
3.6 3.6 3.6 110M
3.4 3.4 3.4 220M
3.2 3.2 3.2
3 4 5 6 7 8 3 4 5 6 7 8 3 4 5 6 7 8
Pw (training precision, bits) Pw (training precision, bits) Pw (training precision, bits)
Figure 4: Predicting final validation losses L(N, D, Pw ) for various N, D, Pw to test our proposed
functional form. Points are experimental values, lines are predictions of a single parametric fit of
the form in Equation 3. We train only two model sizes at 26B due to compute constraints.

In line with the empirical trends in Figure 3, we find the best fit for the tradeoff between
weight precision and parameters is Neff (N, Pw ) = N (1 − e−Pw /γw ), where γw is a fitted constant
measuring the sensitivity of model weights (alternative fits explored in Appendix I). We therefore
modify Chinchilla scaling to account for Neff by making the substitution N 7→ Neff (N, Pw ), giving
the modified form:

L(N, D) = A[N (1 − e−Pw /γw )]−α + BD−β + E (3)


where we recall that A, B, E, α, β are fitted positive constants in the usual Chinchilla scaling
form, and γw is a fitted constant we introduce. We plot the predictions of our fit compared to
observed values in Figure 4 for a range of (N, D).

4.2 Low-Precision-Training: The Effects of Quantizing Weights, Activations,


and Attention are Compositional and Multiplicative
Quantization-aware training does not change the cost of pretraining. This is because modern
GPUs require inputs to a matrix multiplication to have the same precision, i.e. Pw = Pa = Pkv
[Micikevicius et al., 2022]. To understand the interplay between precision and pretraining compute
we must now analyze the scaling behavior of Pa and Pkv as well.
Precision of activations and KV cache affect loss in a similar way. We first verify in
Appendix Figure 15 that varying Pa and Pkv in isolation give rise to scaling behavior that is best
fit by a functional form analogous to the form for Pw (Equation 3, Figure 5, left).
We refer to the scaling coefficients computed by varying the precision of just one part of the
model at a time as marginally fitted constants, and those found by fitting on runs that include
multiple model components in low precision at the same time as jointly fitted constants.
Constants fitted marginally and jointly make similarly good predictions. We now
turn our attention to understanding the interactions between weights, activations, and attention.
If the effects of quantizing weights, activations, and attention are independent, then a factorized,
multiplicative interaction of the following form is a natural proposal.

Neff (P ) = N (1 − e−Pw /γw )(1 − e−Pa /γa )(1 − e−Pkv /γkv ) (4)

We test whether this independence approximately holds by comparing the predictive power of a
model with marginally fitted constants and a model with jointly fitted constants. We show the
predictive power of both models in Figure 5(b, c), finding that both methods for fitting constants
have approximately the same predictive power. These results suggest that the independence as-
sumption is reasonable. We both present further evidence that this “factorized” functional form is
a strong fit to the data as well as discuss alternative factorization schemes in Appendix K.

7
Pw Marginal Sweep Joint fit, f(Pw, Pa, Pkv) Combined Marginals, f(Pw)f(Pa)f(Pkv)
5.0
MSE: 0.0028, R²: 0.9655 MSE: 0.0086, R²: 0.9006 MSE: 0.0089, R²: 0.8973
4.5
Predicted

4.0

3.5

3.0
3.2 3.4 3.6 3.8 4.0 4.2 4.4 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6
Actual Actual Actual

Figure 5: (Left) Predicted loss based on fitted values with Equation 4. (center) Fitting γ parameters
jointly on sweeps with combinations of precisions vs (right) fitting them on “marginal” sweeps where
only one model part is in low precision at a time. Outliers are those at extremely low precision
whose training runs are sometimes unstable.

Finding 2. The effects of quantizing the weights, activations, and KV cache during training
are well modeled as independent and multiplicative so that
−α
L(N, D, Pw , Pa , Pkv ) = ANeff + BD−β + E

where
Neff (Pw , Pa , Pkv ) = N (1 − e−Pw /γw )(1 − e−Pa /γa )(1 − e−Pkv /γkv )
for which we fit constants γw , γa , γkv that reflect the different sensitivities of weights, activa-
tions, and KV cache. If the three precisions are set to the same value P , as in pretraining,
this simplifies to Neff (P ) ≈ N (1 − e−P/γ̄ )3 where γ̄ is the average of the three parameters.
We visualize this functional form with our fitted values in Figure 3 (left).

4.3 Implications For Pretraining


When training in a precision P , meaning Pw = Pa = Pkv = P , compute cost scales linearly in P
[Abdelkhalik et al., 2022, Peng et al., 2023]2 . Hoffmann et al. [2022] performed all experiments in
6
16-bit precision and use a cost model of C = 6N D FLOPs. We generalize this to C = 16 N DP
to account for the linear relation between compute and precision, which reduces to the Chinchilla
cost function for P = 16. We now examine three practically relevant variants of the following
optimization problem.
6
min L(N, D, P ) = A[N (1 − e−P/γ )3 ]−α + BD−β + E subject to C = N DP (5)
N,D,P 16

Since derivations are algebraically involved, we will work up to proportionality and verify proposed
solutions numerically. See Appendix E for mathematical details. We note that the implications
of our functional form are true no matter the scale at which future experiments are done, but
the numerical values we predict depend on our fitted constants which are fitted on smaller-scale,
integer-type experiments.
2
In practice, the gains are less than linear due to systems overhead.

8
Predicted: Quantized Training (INT) Empirical: Quantized Training (FP) P * (D) for Various N
3.3 3.3 16
0.7

P (Model Precision)
14
Predicted Val Loss

3.2 0.6

Irreducible Loss
3.2

Final Val Loss


12 0.5
3.1
3.1 10 0.4
3.0
8 0.3
3.0 2.9 6 0.2
2.8 0.1
4
2.9 INT4 INT6 INT8 INT16 INT32 FP4 FP6 FP8 BF16 FP32 0.1 1 10 100 1000 0.0
(1.76B) (1.17B) (880M) (440M) (220M) (1.76B) (1.17B) (880M) (440M) (220M)
Training Precision (Model Size) Training Precision (Model Size) D (Dataset Size, Trillion Tokens)

Figure 6: Scaling law predictions (left, fitted on integer type) vs empirical values (right, floating-
point type). Precision of weights, activations, attention fixed to Ptrain . Predictions closely match
the empirical trend, but are shifted up by a small amount since floating-point is a more expressive
type and will incur lower loss at the same precision. (Right) When N is held fixed, compute-
optimal precision increases approximately logarithmically with data. Markers correspond to pre-
dicted compute-optimal precision for Llama-3 (8b, 70b, 405b), denoted by (circle, triangle, star)
at each IsoFLOP (lines), illustrating how compute-optimal precision increases in data when model
size is held fixed.

4.3.1 If You Must Train in Low Precision, Increase Parameters Before Data
Minimizing L(N, D) with P fixed, subject to C ∝ N DP . We get with some algebra that at
precision P and compute budget C, the optimal allocations N ∗ , D∗ of parameters and data relative
to Chinchilla-optimal NCh , DCh will be given by

N ∗ (P, C) h i− 3α
− β D∗ (P, C) h i 3α β
∝ 1 − e−P/γ̄ ∝ 1 − e−P/γ̄
α+β α+β
P α+β and P α+β (6)
NCh (C) DCh (C)

which suggests as precision of training decreases at fixed compute, we should increase


parameters and decrease data. The interpretation of this is that at very low precisions, our
effective parameter count vanishes so that increasing parameter count is compute-optimal since
data egregiously outstrips effective parameters.

4.3.2 Compute-Optimal Pretraining Precision is in General Independent of Compute


Jointly minimizing L(N, D, P ) with C ∝ N DP . This is the setting of pretraining without
constraints on N, D, P except for a fixed compute budget. Solving this joint minimization problem
gives an implicit equation for P ∗ (C). Denoting u(P ) = [1 − e−P/γ̄ ]−3α , we find (see Appendix E)
that this equation takes the form
3α 3α+1
u(P ) 3α e−P/γ̄ = P −1 u(P ) (7)
γ̄
which reveals that in general the optimal pretraining precision is independent of compute budget.
This suggests that compute-optimal precision should be held fixed to P ∗ while N, D are scaled
according to Equation 6. We find this P ∗ ≈ 7 bits when fitting our scaling law on runs with
quantization done to integer type. This has two consequences: first, this means the de-facto
practice of training models in BF16 may be suboptimal. Second, the race to low-
precision training will likely have to stop before going below 4-bits, since this would force
model sizes to become disproportionately (more than 4x) larger to maintain loss scaling (see Figure
3, left).

9
We test our predictions in Figure 6 at a larger scale. We train compute-matched models at
various parameter count and precision ranging from FP4 to FP32 and 220M to 1.6B parameters.
We train in floating-point type since that is standard in pretraining [Groeneveld et al., 2024, Deitke
et al., 2024], though our scaling laws are fitted on integer type. We plot our predicted trend in
Figure 6 (left) and the empirical values in the middle. We find that scaling fits on integer type are a
strong fit until 4-bit precision, at which points the difference between the two types becomes more
apparent. Integer fits assume all bits contribute in the same way, but the split of floating-point bits
into exponent and mantissa means that each likely has its own scaling behavior. The matching of
qualitative trends throughout, with the optimum being close to the predicted optimum of P ∗ ≈ 7
bits suggests that similar scaling laws may exist across types.

4.3.3 But Compute-Optimal Pretraining Precision Can Increase in Compute if Model


Size N is Constrained
Minimizing L(D, P ) with N fixed, subject to C ∝ N DP . A common use case in practice
is to train a suite of models of various sizes on similar data. The Llama-3 and Gemma-2 series
[Dubey et al., 2024, Team et al., 2024] are examples. In this setting, N is fixed in advance and
only D, P are jointly optimized. Surprisingly, our scaling laws predict that models of differing sizes
should not necessarily be trained in the same precision, and that compute-optimal precision scales
as P ∗ (C) ∝ log C. Since N is held constant and we show in Appendix E that log C ≈ log D in
proportion, we can write P ∗ (C) ∝ log(D/N ). The intuition for this is that, for a fixed N , precision
acts as a new lever to bring highly overtrained models closer to pretraining optimality3 by reducing
D/Neff .

Finding 3. When N, D, P are optimized jointly, compute-optimal pretraining precision is


independent of compute. 16-bit has many unnecessary bits, and 4-bit requires increasing
the model size disproportionately to maintain loss scaling. Our fits imply that 7-8 bits are
compute-optimal. In contrast, when N is fixed in advance, such as when training a model
family on similar data, P ∗ (C) ∝ log C. This suggests that for models that will be significantly
overtrained, higher precision during training may be compute-optimal.

5 A Unified Scaling Law for Precision


In this section, we combine the two scaling laws presented into a unified functional form that
predicts both training/post-training effects, including interactions between the two. We now treat
δPTQ as a function δPTQ (N, D, Ptrain , Ppost ) rather than just δPTQ (N, D, Ppost ) as we did earlier
in Section 3. We find two competing effects at play when predicting δPTQ , but overall, models
trained in lower precision are more robust to post-train quantization in the sense of
incurring lower degradation.
Two competing effects at play during post-train quantization. Intuitively, training any
of Pw , Pa , Pkv in low precision forces the model to learn weights that are robust to “quantization
noise,” so they degrade less under PTQ. However, the reduced N 7→ Neff effective parameter count
of a model trained in low precision suggests that models trained in low precision will degrade
3
An important subtlety here is that since models are overtrained for inference, we want to keep the cost of a
forward pass—which is proportional to N P —fixed, not just N . While N P is the same for both a model of N0
parameters in 16-bit and one with 2N0 parameters in 8-bit, the latter has higher Neff with our γ̄, so will reach a lower
pretraining loss on the same data with the same training/inference costs.

10
PTQ(Neff, D, Ptrain, Ppost)
MSE: 5.06e-02, R 2: 0.9041 8
Empirical PTQ 8
Predicted PTQ

Ppost, post-training precision (bits)

Ppost, post-training precision (bits)


100 1.2
7 7
10 1 1.0
6 6
PTQ

10 2
0.8
10 3
Predicted

PTQ
5 5
0.6
10 4

4 4 0.4
10 5

10 6
3 3 0.2
10 7

10 6 10 5 10 4 10 3 10 2 10 1 100 23 4 5 6 7 8 9 10 11 12 2 4 6 8 10 12
Actual PTQ Pw, training precision (bits) Pw, training precision (bits)

Figure 7: Combined plots for predicting degradation. (Left) demonstrates the quality of our fit on
all our runs, including all combinations of pre and post-training precisions. (Center, right) illustrate
visually that our unified degradation form can predict degradation when training and serving in
any precision. Plots (center, right) vary Pw only, but fits in (left) include runs where Pa , Pkv are
also jointly varied.

more because δPTQ increases with N −γN as we found in Section 3. We call this second effect the
“overtraining” effect. In practice, the first “robustification” effect wins out, so that models trained
in lower precision overall degrade less when post-train quantized compared to models trained in
high precision. We confirm using Neff rather than N to predict degradation given various training
precisions leads to a substantially stronger fit in Figure 16(top left, top center), to verify the
competing overtraining effect is real.
Modifying δPTQ to account for training precision. We assume training precision is strictly
greater than inference precision, and define degradation as identically zero if they are equal. We
begin by studying how degradation scales with just weight-precision during training, Pw .
Consider Figure 7(center). We fix (N, D) and each cell of the heatmap represents the empirical
degradation δPTQ (Pw , Ppost ). We observe that degradation very quickly increases to its exponen-
tially large value from Section 3 if there is any gap between training and inference-time precision.
This motivates modifying our initial functional form fitted in Section 3 to
 γD 
−Ppost /γpost D
δPTQ (N, D, Pw , Ppost ) = CT e γN [1 − e−Cw (Pw −Ppost ) ] (8)
Neff | {z }
| {z } Robustification effect
Overtraining effect

where Cw is the only new fitted value. Then, we can extend this to include the precision effects of
activations/attention in the natural way:

DγD
  Y
−Ppost /γpost
δPTQ (N, D, Pw , Pa , Pkv , Ppost ) = CT e γN [1 − e−Cx (Px −Ppost ) ] (9)
Neff
x∈{w,a,kv}

We measure the fit to the data of such a functional form in Figure 7, and find a strong fit with
R2 = 0.90 on over 1000 data points (each of 465 pretraining runs post-train quantized to multiple
precisions).
An interpretable, unified functional form. Now we simplify and interpret the resulting
functional form. Consider training with only weights in low precision and take Cw = 1 for illustra-
tive purposes so we can simplify Equation 9. Denote σtr 2 := e−Pw /γw as “training noise” reflecting

11
the decrease in effective parameter count due to training weights in lower precision. Then, Equation
9 simplifies to
 γD 
2 2 D
δPTQ (N, D, Ptrain , Ppost ) = CT (σPTQ − σtr ) · γN (10)
| {z } Neff
Robustification effect | {z }
Overtraining effect

which we note is the intuitive modification one might make to the form of the initial post-training
quantization degradation we fitted in Section 3, in Finding 3.1, with a small competing effects
factor from Neff pushing in the opposite direction. It cleanly reflects the intuition that models are
robustified to PTQ noise to the extent they were trained with similar noise.

Finding 4 (Unified Scaling Laws). Modeling low-precision effects during pretraining as


independent and multiplicative noise that accumulates, and including post-training quan-
tization degradation, the predicted loss for a language model with N parameters, trained
on D tokens, with training precision Pw , Pa , Pkv to end-time weight-precision Ppost , can be
predicted as

−α
L(N, D, Pw , Pa , Pkv , Ppost ) = ANeff + BD−β + E + δPTQ (11)
where δPTQ (N, D, Pw , Pa , Pkv , Ppost ) is in general as in Equation 9 and Neff (N, Pw , Pa , Pkv )
as in Finding 4.2.

6 Conclusion and Limitations


We find that the common inference-time technique of post-train quantization can incur large degra-
dation at very high data budgets, demonstrating a striking example of how more pretraining com-
pute does not always imply stronger models at inference-time. Seeking better data scaling, we
study quantization-aware and low precision training. We find that parameters and bit precision
are well modeled as interchangeably controlling an “effective parameter count” of the model allows
us to predict finite-precision loss effects accurately during both training and inference. The re-
sulting scaling law makes surprising predictions that we qualitatively validate on language models
pretrained from scratch with up to 1.7B parameters.
There are limitations to our analysis. First, we use a fixed architecture throughout to examine
the effects of precision, parameters, and tokens in a controlled manner. In contrast, low precision
training often involves architectural tweaks [Ma et al., 2024, Zhu et al., 2024] that can close much
of the gap from a vanilla full precision model. Second, while compute costs do scale linearly
with precision, the gains from halving precision are usually less than 2x due to systems overhead.
Third, we only consider loss scaling without downstream model evaluations. We emphasize that
the trends we find aim to be suggestive rather than prescriptive, and hope future work can more
comprehensively examine these effects at larger model scale. In all, we find that the effects of
precision on loss are predictable and consistent, with important and surprising implications.

12
7 Acknowledgements
Tanishq Kumar thanks Tim Dettmers, Chris De Sa, Neil Band and Luke Bailey for helpful com-
ments and discussion. Blake Bordelon is supported by a Google PhD Fellowship. Cengiz Pehlevan
is supported by NSF grant DMS-2134157, NSF CAREER Award IIS-2239780, and a Sloan Re-
search Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg
Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial
Intelligence. Aditi Raghunathan acknowledges support from AI2050 program by Schmidt Sciences
(Grant G2264481), Google Research Scholar, Apple, NSF, Cisco.
We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF un-
der Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume
to Velocity), and 1937301 (RTML); US DEVCOM ARL under Nos. W911NF-23-2-0184 (Long-
context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under Nos. N000142312633
(Deep Signal Processing); Stanford HAI under No. 247183; NXP, Xilinx, LETI-CEA, Intel, IBM,
Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog
Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the
Stanford Data Science Initiative (SDSI). Benjamin F. Spector is supported by a Hertz Fellowship.
The U.S. Government is authorized to reproduce and distribute reprints for Governmental pur-
poses notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions
or recommendations expressed in this material are those of the authors and do not necessarily
reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S.
Government.

References
Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.
arXiv preprint arXiv:2407.21783, 2024.

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisen-
thwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for
deep learning. arXiv preprint arXiv:2209.05433, 2022.

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong,
Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are
in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang,
Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language
models. arXiv preprint arXiv:2310.11453, 2023.

13
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training
quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix
multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:
30318–30332, 2022.

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-
aware weight quantization for llm compression and acceleration. arxiv. MLSys 2024, 2023.

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant:
Accurate and efficient post-training quantization for large language models. In International
Conference on Machine Learning, pages 38087–38099. PMLR, 2023.

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural
scaling laws. Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024.

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling
laws. arXiv preprint arXiv:2402.01092, 2024.

Licong Lin, Jingfeng Wu, Sham M Kakade, Peter L Bartlett, and Jason D Lee. Scaling laws in
linear regression: Compute, parameters, and data. arXiv preprint arXiv:2406.08466, 2024a.

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In
International Conference on Machine Learning, pages 7750–7774. PMLR, 2023.

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity
scaling laws. arXiv preprint arXiv:2404.05405, 2024.

Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. Chinchilla scaling: A replication
attempt. arXiv preprint arXiv:2404.10102, 2024.

Nikhil Sardana and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in
language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Worts-
man, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale
reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540, 2024.

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya
Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al.
Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,
2024.

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally
can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and
Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.
arXiv preprint arXiv:2407.21787, 2024.

Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto. Synthetic
continued pretraining. arXiv preprint arXiv:2409.07431, 2024.

14
Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scal-
ing laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 7382–7392, 2024.

André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle
Chard, and Ian Foster. Comprehensive exploration of synthetic data generation: A survey. arXiv
preprint arXiv:2401.02524, 2024.

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord,
Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerat-
ing the science of language models. arXiv preprint arXiv:2402.00838, 2024.

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur,
Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of
three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159,
2024.

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quantization
of large language models with guarantees. Advances in Neural Information Processing Systems,
36, 2024.

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno,
and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint
arXiv:2402.04291, 2024.

Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed A Badawy. Demysti-
fying the nvidia ampere architecture through microbenchmarking and instruction-level analysis.
In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8. IEEE,
2022.

Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue
Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm: Training fp8 large language models. arXiv preprint
arXiv:2310.18313, 2023.

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Moham-
madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open
weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146,
2024.

Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng
Zhou, and Jason K Eshraghian. Scalable matmul-free language modeling. arXiv preprint
arXiv:2406.02528, 2024.

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer
with rotary position embedding. corr abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864,
2021.

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick
Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large
neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.

15
Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise
hyperparameter transfer in residual networks: Dynamics and scaling limit. arXiv preprint
arXiv:2309.16620, 2023.

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co-
Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale
transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023a.

Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blun-
som, Ahmet Üstün, and Sara Hooker. Intriguing properties of quantization at scale. Advances
in Neural Information Processing Systems, 36:34278–34294, 2023.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of machine learning research, 21(140):1–67, 2020.

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld,
Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the
colossal clean crawled corpus. arXiv preprint arXiv:2104.08758, 2021.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia,
Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision
training. arXiv preprint arXiv:1710.03740, 2017.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model paral-
lelism. arXiv preprint arXiv:1909.08053, 2019.

Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig
Schmidt. Stable and low-precision training for large-scale vision-language models. Advances
in Neural Information Processing Systems, 36:10271–10298, 2023b.

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for
large language models. arXiv preprint arXiv:2308.07633, 2023.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with
low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning
of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise
quantization. arXiv preprint arXiv:2110.02861, 2021.

Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath
Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Viji Srinivasan, and Kailash Gopalakrish-
nan. Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information
Processing Systems, 33:1796–1807, 2020.

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang
Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware
training for large language models. arXiv preprint arXiv:2305.17888, 2023.

16
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan
Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization
for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems,
6:87–100, 2024b.
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang,
Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of
large language models with a single gpu. In International Conference on Machine Learning,
pages 31094–31116. PMLR, 2023.
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashk-
boos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized repre-
sentation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.(nips), 2017. arXiv preprint
arXiv:1706.03762, 10:S0140525X16001837, 2017.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song,
John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models:
Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-
parameter open-access multilingual language model. 2023.
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le
Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual gener-
alization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia
Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts
language models. arXiv preprint arXiv:2409.02060, 2024a.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,
Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al.
Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo-
pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer
language models. arXiv preprint arXiv:2205.01068, 2022.
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz
Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t
reach for the stars! arXiv preprint arXiv:2301.03988, 2023.

17
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou,
Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with
you! arXiv preprint arXiv:2305.06161, 2023.

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane
Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2:
The next generation. arXiv preprint arXiv:2402.19173, 2024.

Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari
Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, et al. Fingpt:
Large generative models for a small language. arXiv preprint arXiv:2311.05640, 2023.

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge,
Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113,
2023.

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly
capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke
Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. Aya model: An in-
struction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827,
2024.

Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the
predictability of language model performance. arXiv preprint arXiv:2405.10938, 2024.

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin
Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint
arXiv:2405.18392, 2024.

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan
Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures:
How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022a.

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michal Krutul, Szymon
Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, et al. Scaling
laws for fine-grained mixture of experts. arXiv preprint arXiv:2402.07871, 2024.

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and
Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies. arXiv
preprint arXiv:2407.13623, 2024.

Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoff-
mann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling
laws for routed language models. In International conference on machine learning, pages 4057–
4086. PMLR, 2022.

18
Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q Tran, David R So, Siamak Shakeri, Xavier Garcia,
Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, et al. Transcending scaling laws
with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022b.

Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella
Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, et al. What language model to train
if you have one million gpu hours? arXiv preprint arXiv:2210.15424, 2022.

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene
Cheah, Teddy Ferdinan, Haowen Hou, Przemyslaw Kazienko, et al. Eagle and finch: Rwkv with
matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024.

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang,
Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative
mixed-modal language models. In International Conference on Machine Learning, pages 265–279.
PMLR, 2023.

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws
in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312,
2022.

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade
Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for
contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 2818–2829, 2023.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language
models. arXiv preprint arXiv:2206.07682, 2022.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam
Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond
the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv
preprint arXiv:2206.04615, 2022.

Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and
Sanmi Koyejo. Scaling laws for downstream task performance of large language models. arXiv
preprint arXiv:2402.04177, 2024.

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal,
Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation
of training sets for language models. arXiv preprint arXiv:2406.11794, 2024.

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing
Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. arXiv
preprint arXiv:2407.01492, 2024.

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang,
Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on data selection
for language models. arXiv preprint arXiv:2402.16827, 2024.

19
Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra
Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language
models. Advances in Neural Information Processing Systems, 36, 2024b.

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on
neural networks typically occurs at the edge of stability. In International Conference on Learning
Representations, 2021.

Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David
Cardoze, George Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training
instability in deep learning. arXiv preprint arXiv:2110.04369, 2021.

Quynh Nguyen, Marco Mondelli, and Guido F Montufar. Tight bounds on the smallest eigenvalue
of the neural tangent kernel for deep relu networks. In International Conference on Machine
Learning, pages 8119–8129. PMLR, 2021.

Alexander Atanasov, Blake Bordelon, Sabarish Sainathan, and Cengiz Pehlevan. The on-
set of variance-limited behavior for networks in the lazy and rich regimes. arXiv preprint
arXiv:2212.12147, 2022.

Emmanuel Abbe, Enric Boix-Adsera, Matthew S Brennan, Guy Bresler, and Dheeraj Nagaraj.
The staircase property: How hierarchical structure can guide deep learning. Advances in Neural
Information Processing Systems, 34:26989–27002, 2021.

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a
necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural
networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.

Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang.
Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in
Neural Information Processing Systems, 35:21750–21764, 2022.

20
Appendix
A Hyperparameter Details and Ablations
We launch over 20 runs for each (N, D) combination to study scaling in precision, trained and
validated on the common crawl split of the Dolma dataset [Soldaini et al., 2024]. We use a Trans-
former++ implementation: SwiGLU activations [Shazeer, 2020], RoPE embeddings [Su et al., 2021],
RMSLayerNorm, Adam β values of (0.9, 0.95). We adopt a cosine learning rate schedule with 10%
warmup period and peak learning rate of 6e-4 for the smallest model and learning rates scaled with
width and depth according to depth-µP for the larger models [Yang et al., 2022, Bordelon et al.,
2023]. We use a sequence length of 1024 and batch size of 256 throughout, with Adam ϵ 1e-15,
following [Wortsman et al., 2023a]. We use weight decay of 0.1, as [Ahmadian et al., 2023] find
some results in the quantization literature may be artifacts of insufficient weight decay. We follow
[Ma et al., 2024] in including a LayerNorm before projections because they find it is important for
low precision training to be stable. These are the hyperparameters and settings used for the main
scaling law experiments.
To check robustness, we then ablate these hyperparameter choices, with results in Figure 8. In
our ablation we use a sequence length of 512 with batch size 128, weight decay of 1e-3, Adam ϵ of
1e-10, a peak learning rate of 1e-4 and a warmup period of duration 3%. We train models with
these alternative hyperparameters at various weight, activation, and KV cache precisions. We train
and val on C4 [Raffel et al., 2020, Dodge et al., 2021] instead. Though these ablations are at rather
smaller scale due to compute constraints, the loss curves follow the same trends – rapid decrease in
final loss with an initial increase in precision from 4 bits, then diminishing returns as we approach
higher precision – as in the main text, suggesting the trends are robust to hyperparameter choices.

Weights KV Cache Activations


4.27
4.40 4.23
4.26
4.35 4.22
4.25
Final Loss

Final Loss

Final Loss

4.24 4.21
4.30
4.23 4.20

4.22 4.25 4.19

4.21 4.18
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
Precision (bits) Precision (bits) Precision (bits)

Figure 8: L(Pw ), L(Pa ), L(Pkv ) for ablated hyperparameters, N = 30M, D = 1.5B. We can see the
trends persist, where the first few bits reduce final val loss significantly, with diminishing/saturating
returns quickly setting in at higher precision. We do not fit constants on these ablated runs.

B Additional Related Work


Efficient training and inference Low precision has been key to improving the efficiency of
training and using LLMs [Micikevicius et al., 2017, Shoeybi et al., 2019, Wortsman et al., 2023b,
Zhu et al., 2023]. Prior works generally study either precision during training [Courbariaux et al.,

21
2014, Dettmers et al., 2024, 2021, Sun et al., 2020, Liu et al., 2023] or the effects of changing the
precision after training (post-training quantization) [Frantar et al., 2022, Lin et al., 2024b, Dettmers
et al., 2022, Xiao et al., 2023, Sheng et al., 2023, Dettmers et al., 2023]. In this work we study
both, the precision during training and after, and unify them from a scaling perspective.

Large language models and scaling By scaling up the transformer architecture [Vaswani
et al., 2017] a variety of large language models have been proposed [Brown, 2020, Rae et al., 2021,
Touvron et al., 2023a,b, Dubey et al., 2024, Le Scao et al., 2023, Muennighoff et al., 2022, 2024a,
Groeneveld et al., 2024, Jiang et al., 2023, Zhang et al., 2022, Allal et al., 2023, Li et al., 2023,
Lozhkov et al., 2024, Luukkonen et al., 2023, Bai et al., 2023, Chowdhery et al., 2023, Team et al.,
2023, Üstün et al., 2024, Deitke et al., 2024]. To improve our understanding of these models,
various works have investigated their scaling properties [Ruan et al., 2024, Allen-Zhu and Li, 2024,
Hägele et al., 2024]. Many aspects are relevant to scaling including the architecture [Tay et al.,
2022a, Krajewski et al., 2024, Tao et al., 2024, Clark et al., 2022, Tay et al., 2022b, Scao et al.,
2022, Peng et al., 2024], the modalities considered [Aghajanyan et al., 2023, Alabdulmohsin et al.,
2022, Cherti et al., 2023], the performance metrics [Wei et al., 2022, Srivastava et al., 2022, Isik
et al., 2024], the data composition [Li et al., 2024, Liu et al., 2024, Albalak et al., 2024] and data
repetitions [Muennighoff et al., 2024b]. Our work analyzes one such aspect, which is key to better
scaling: the numeric precision during and after training.

C Alternative Functional Forms


There are several plausible functional forms to try a priori. The key junctions are whether a form
is 1) additive or multiplicative and 2) interacts with parameters/data or is independent, 3) a power
law or exponential. We try a variety of combinations of these three and find the formulation in
the main text one of the best fits, notably with the fewest fitted parameters. We emphasize that
several fitted forms are likely to be reasonable fits to the data, and an important desiderata for
choosing a functional fit is interpretability. Several scaling law papers find multiple fits plausible
in terms of predictive power [Muennighoff et al., 2024b, Kaplan et al., 2020], and ultimately make
a decision based on interpretability.
We make these fit choices on sweeps of the form L(N, D, PW ) and discuss alternatives to the
decomposition/factorization to account for activations and KV cache in Appendix Section K, which
assumes an effective parameter count formulation. In this section, a power law refers to a term of
the form Cw · P −αw where Cw , αw are fitted. In general, we find modeling precision effects with
power law fits on their own causes the fitted constants A, B to blow up, whereas this does not
happen with exponential fits, suggesting the power law does not change sharply enough to match
the change in loss induced by precision. We note that while fitting parameters using a double
notion of effective parameters and effective data leads to a slightly better fit, it requires more fitted
parameters so we stick with the Neff formulation for simplicity and interpretability. When choosing
between fits we validate on held-out data and the R2 values below reflect the fit on the held out
data. This is in contrast to our plots in the main text, where we have chosen a functional form and
we fit and plot on the same data, as is standard in scaling laws [Muennighoff et al., 2024b].

D Quantization Implementation Details and Types


Two canonical types for neural network quantization are floating-point (FP) and integer (INT)
quantization. Despite their differences in representation, we hypothesize the scaling behavior be-

22
Functional Form Val R2 Number of Fitted Parameters
Neff 0.82 3
Additive/independent power law 0.71 2
Deff 0.74 3
Neff and Deff (tied) 0.79 3
Neff and Deff (not tied) 0.84 4
Multiplicative power law, N, P 0.75 2

Table 1: Comparison of Functional Forms with R2 ,and Number of Fitted Parameters

tween floating-point and integer quantization can be described by similar functional forms, where
1(b) provides preliminary evidence for this.

D.1 Integer Quantization


In integer quantization, continuous values are mapped to discrete integer values. Typically, this
is done by scaling the original values according to a fixed scale factor. Mathematically, for a real
number x, the quantized integer value xint is computed as:
jxm
xint =
s
where s is the scaling factor, and ⌊·⌉ denotes rounding to the nearest integer specified by the number
of bits. The value can then be dequantized back to an approximate real value by multiplying by s:

xdequant = s · xint

This process introduces quantization error, defined as the difference between the original value
x and the dequantized value xdequant . The goal of quantization is to minimize this error while
still reducing the precision. One can think of this as rounding to the nearest point on a uniform
lattice. More complicated quantization schemes involve selecting the lattice points in a data or
model-dependent manner. Integer quantization, as implemented, uses a fixed-point scaling based
on the maximum absolute value of the tensor, and then scales the values within the range [Qn , Qp ],
where Qn = −2(b−1) and Qp = 2(b−1) − 1, with b being the number of bits.
Integer quantization first rescales the inputs into the range specified by the number of bits by
Qp
s=
max(|x|)

for tensor-based scaling, or


Qp
s=
max(|x|, dim = k)
for channel-based scaling. After scaling, the result is rounded to the nearest integer and then
clamped to the range [Qn , Qp ]. After matrix multiplication, the result is rescaled back into the
original range. We use integer quantization throughout to fit our scaling laws for simplicity. How-
ever, when making pretraining predictions, we test our scaling laws using floating-point quantization
since that is used for pretraining in practice.

23
D.2 Floating-Point Quantization
Floating-point quantization is slightly more sophisticated, aiming to make a non-uniform lattice
roughly matching the distribution of the weights, which are assumed to be Gaussian. A floating-
point number is in general represented as:

xfp = (−1)s · m · 2e

where s is the sign bit, m is the mantissa, and e is the exponent. In floating-point quantization,
both the mantissa and exponent are quantized to reduce the bit width. For exponent-mantissa
allocations of bits and details of exponent bias, we follow the guidelines from [Micikevicius et al.,
2022] in terms of details, since that reflects how, for instance, FP8 training is done as is common
in production settings. Further in line with [Micikevicius et al., 2022], we quantize weights per
channel and activations per-tensor.
Making a full scaling law for floating-point quantization is more involved than our integer
treatment, because the effects of scaling mantissa vs exponent bits are not the same. In contrast,
in integer quantization, each additional bit simply causes us to round into a finer-grained lattice
after rescaling, thereby reducing quantization error by a predictable amount. In floating-point
quantization, altering the exponent affects the dynamic range, while altering the mantissa changes
the precision within that range. This flexibility at once makes floating-point quantization more
suitable for model training, but harder to analyze. We leave a commensurately detailed analysis of
mantissa vs exponent scaling to future work.

D.3 Hardware Details


Weight-only quantization can accelerate inference because software can be written to accommodate
moving data between GPU parts (HBM-SRAM) in smaller units (types), so that a given bandwidth
can move more data per second. This reduces memory (IO) bottlenecks that often dominate during
inference, even with high-batch workloads. However, we emphasize that the type and therefore
speed at which the GPU can do matrix multiplications in natively is determined by the hardware
provider, so that even when Pw = Pa = Pqkv (including queries), compute savings are only achieved
when these correspond with both a bit-width and type that the GPU supports. For instance, many
GPUs support FP32, BF16, INT8, INT4, and the Hopper line of GPUs from NVidia support FP8
matrix multiplication, and the next generation at the time of writing (Blackwell) will support FP4
matrix multiplication. We aim to study scaling in a fairly hardware-agnostic manner so that our
work may be useful in the future. We train all our models with fake (simulated) quantization on
NVidia H100 GPUs.

E Derivations
E.1 Critical Dataset Size for PTQ
∂δ (D )
We seek a Dcrit that satisfies ∂L(D
∂D
crit )
= PTQ∂D crit . Taking both derivatives for the functional
forms presented in the main text and equating their opposing effects, we get the equation
−β−1 γD −1
BDcrit = γD CT N −γN e−Ppost /γpost Dcrit (12)
which implies

24
! 1
γD +β
βBN γN ePpost /γpost
Dcrit = (13)
γ D CT
is the predicted point after which pretraining on more data can increase loss of a model that is
post-train quantized. Note that this quantity explodes in P , so that a truly unreasonable amount
of data is required for longer pretraining to be harmful at commonly used precisions (eg. 8-bit).
However, we find that on overtrained models D/N ≫ 103 , these overtraining-degradation effects
become nontrivial around 5-bits, and dominant below that.

E.2 Compute-optimality calculations


We set a constraint C ∝ N DP throughout. Working up to proportionality is essentially rescaling
the compute constraint, so it doesn’t affect the scaling trends we identify, which is our focus.

E.2.1 Fixed Precision Compute Optimal Scaling


Under fixed precision, the loss takes the form

L = u(P )AN −α + BD−β (14)

where u(P ) = [1 − e−P/γ ]−3α is a fixed constant. The compute optimal scaling when minimizing
the loss over N, D gives

L = u(P )AN −α + BC −β N β P β (15)

by replacing D = NCP . Optimizing over N , we see that this is equivalent to the original chinchilla
optimization problem but with A → Au(P ) and B → BP β . Performing this optimization, we find
  1  − 1
∗ u(P )Aα α+β β
∗ u(P )Aα α+β α
N (P, C) = C α+β , D (P, C) = C α+β (16)
BP β β BP β β
We can relate the above expressions to the original Chinchilla-optimal N, D at full precision
NCh (C), DCh (C).

N ∗ (P, C) h −P/γ̄
i− 3α
α+β β
− α+β D∗ (P, C) h i 3α
−P/γ̄ α+β
β
∝ 1−e P and ∝ 1−e P α+β (17)
NCh (C) DCh (C)

E.2.2 Fixed model size N


Now, we investigate the case where model size N is fixed but precision and data are jointly optimized
at fixed compute C = N DP . This optimization problem takes the form

L = u(P )AN −α + BD−β (18)


C
Under fixed compute, we have D = NP so replacing the second term, we have

L = u(P )AN −α + BC −β N β P β (19)

where N is a constant. We therefore have a single variable P to minimize the above formula over
∂L
= u′ (P )AN −α + BC −β N β β P β−1 = 0 (20)
∂P

25
First, we note that u′ (P ) has the following form
1 −P/γ 3α 3α+1
u′ (P ) = −3α[1 − e−P/γ ]−3α−1 × e = − e−P/γ × u(P ) 3α (21)
γ γ
We thus desire a solution to the implicit equation
3α −P/γ 3α+1
e × u(P ) 3α AN −α = BC −β N β β P β−1 (22)
γ
We now aim to find an approximate asymptotic relationship between P and C as C → ∞. Taking
a logarithm of both sides, we find (neglecting additive constants that are independent of C, P )
1
−(3α + 1) ln(1 − e−P/γ ) − P ≈ −β ln C (23)
γ
The correct dominant balance at large C is to take P ⋆ ∼ βγ ln C, as can be verified numerically.
With the constraint that C = N P D we have that D⋆ ≈ N βγCln C .

E.2.3 Minimization over N , D, P with Fixed Compute


Recall our three-way loss function is given as below. We separate Neff into terms involving (N, P )
explicitly here as it makes the math easier to follow.

L(N, D, P ) = AN −α u(P ) + BD−β , u(P ) = [1 − e−P/γ ]−3α (24)

Under the constraint C ∝ N DP , we can replace D in terms of C, N, P giving the loss expression

L = AN −α u(P ) + BN β P β C −β (25)
∂L
= −αAN −α−1 u(P ) + βBN β−1 P β C −β = 0 (26)
∂N
∂L 3α+1
= −3α/γAN −α u(P ) 3α e−P/γ + βBN β P β−1 C −β = 0 (27)
∂P
Multiplying the first equation by N and dividing the second equation by it reveals that the optimal
P satisfies a compute-independent implicit equation
3 1
u(P ) 3α e−P/γ̄ = P −1 u(P ) (28)
γ̄
This exercise reveals that the compute optimal strategy when allowed to jointly optimize N, D, P
is to choose a fixed precision that satisfies the above equation and then to scale up N, D with the
prescription in Appendix I.1.1.

F Replicating PTQ Scaling with other Quantization Methods


Here we replicate the finding that post-train degradation due to post-train quantization increases
with token/parameter ratio as DγD /N γN . We fit the same functional form as in the main text,
but get slightly different values of fitted constants, as expect. We replicate on AWQ [Lin et al.,
2023] and round-to-nearest quantization. The former is a modern and sophisticated technique, and
the latter a simple and naı̈ve approach to quantization. The fact they, as well as GPTQ in the
main text, share the same failure modes suggests that poor post-training quantization data scaling
should be the default expectation for any newly proposed PTQ technique.

26
N = 30M N = 60M N = 110M N = 220M
4.4 INT3
Val Loss (Post-Quant)

4.2 INT4
INT5
4.0 INT6
3.8 No PTQ
3.6
3.4
PTQ

10 1
Degradation,

10 2

10 3

100 1000 100 10 100 10


Token/Parameter Ratio
Figure 9: Replicating Section 3 results with AWQ.

N = 30M N = 60M N = 110M N = 220M


INT3
5.5
Val Loss (Post-Quant)

INT4
5.0 INT5
INT6
4.5 No PTQ
4.0
3.5

100
PTQ
Degradation,

10 2

100 1000 100 10 100 10


Token/Parameter Ratio
Figure 10: Replicating Section 3 results with RTN.

27
G Why do language models get more sensitive with overtraining?
This section is speculative.
Sharpness. A canonical line of work in optimization demonstrates that model sharpness
increases during learning until it hovers at a maximal value (the “edge of stability”) [Cohen et al.,
2021, Gilmer et al., 2021], so that movement along the top Hessian eigenvector degrades loss by
more throughout training. Though sharpness is formally a worst-case sensitivity, we conjecture
similar results hold for average case, such as loss degradation induced by isotropic noise. It may
be possible that sharpness during language model pretraining does not reach its maximal value for
a long time, which is why sensitivity to noise monotonically seems to increase as D/N → ∞ on
realistic data budgets. Closely related is the largest eigenvalue of the neural tangent kernel (NTK)
which captures the magnitude of the variance of the predictor under parameter noise. This quantity
is known to empirically increase during training in a variety of settings, and is closely related to
generalization guarantees [Nguyen et al., 2021, Atanasov et al., 2022].
Hierarchical learning strategies become more sensitive throughout training. Our
expectation that overtrained language models may degrade more when quantized at inference-time
is motivated in part by the following results. The hierarchical nature of learning is by now well
understood in some toy settings: in [Abbe et al., 2021], it is shown that “staircase” polynomials
of increasing degree are learned faster than high-degree monomials since neural networks combine
existing features to learn new ones. In [Abbe et al., 2022] this result was strengthened to show that
such hierarchical structure is both necessary and sufficient to learn sparse functions with SGD in
two layer neural networks. In this setting, damage to features encoding lower-order polynomials
affects all higher-order ones, so that such networks are increasingly sensitive to fixed feature noise
throughout learning. Another result of a similar flavor is that of [Barak et al., 2022], who explicitly
require high-precision gradients for sparse parity to be learned, since sparse parity is learned by
the amplification of a small initial signal. If language models learn hierarchically, it is possible that
the features that are learned late into overtraining as D/N → ∞ are reliant on base features, so
that noise harms the base features and therefore significantly damages higher-order features.

H Main Figure Details


The model on the left is N = 30M parameters, chosen because we could train it to the highest
token/parameter ratio given our compute budget. On the right we train a suite of models with
6
N P kept constants on 16B tokens (so that C = 16 N DP is matched throughout under our cost
model). We plot val loss on Dolma, as throughout the main text, and use floating-point (rather
than integer) to make the pretraining claims as realistic as possible.

I Numerical Fits
Following [Muennighoff et al., 2024b], we tie α = β so they do not become very different, though
this is not required. Distinct α, β only add expressivity to the model and we have verified the plots
look similar without tying. We also only use the full scaling law when specified in the text, since
the law is developed piecewise through the text. For instance, Figures 3 and 4 solely fit Chinchilla
with a substitution N 7→ Neff (Pw ) because at that point Pa , Pkv have not been introduced. Figures
5, 6, and 7 use our full scaling law, for instance to make predictions. We emphasize our numerical
constants are unlikely to be useful because as [Hoffmann et al., 2022, Sardana and Frankle, 2023]
show, fitted constant depend heavily on the architecture and dataset used, which differs from setup

28
to setup. Rather, the trends we identify are the key findings. With that said, our fitted constants
are as follows.

Constant Value
A 4.299e3
α 0.4965
B 1.806e4
E 2.7648
γw 2.6745
nw 0.3037
γi 2.2102
ni 1.4072
γkv 0.9578
nkv 2.4185
CT 0.0598
δD 0.5068
δN 0.3439
γ 0.5907
b 1.1277

Table 2: Fitted constants and their values

Note that we include biases in our exponent fits, for instance when modelling Neff as a saturating
exponential, we find that the different parts of a model cause numerical instability at different values
of low precisions, so even if they are the same functional form, they may be translated (left/right
shifted versions) of eah other. For instance a fit of the form ex/γx in the main text is really computed
with offset ex/γx +n , but including biases everywhere clutters notation and obscures mathematical
insight.

J Are Weights, Activations, and KV Cache Equally Sensitive?


We find that training runs with Pa ≤ 3 or Pkv ≤ 3 are not numerically stable, and often diverge,
while Pw = 3 is still well behaved. In particular, we find activations are more sensitive, though this
could be because we quantize activations per-tensor and weights-per channel following [Micikevicius
et al., 2022], rather than activations being inherently more sensitive. Consequently, we do not fit
or validate on runs with activations or attention bits equal to 3. We leave a more detailed analysis
of fine-grained sensitivity across layers and types of parameters to future work. The Figure below
illustrates the empirical sensitivity by plotting L(P ) for the three quantities for various runs (N, D).

29
220M params, 3.3B tokens 220M params, 3.3B tokens 220M params, 3.3B tokens
Weights Activations KV Cache
3.58
6.0
3.8
3.56
5.5
3.7
3.54
5.0
Loss

3.52 3.6
4.5
3.50
4.0 3.5
3.48
3.5 3.4
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
110M params, 26.2B tokens 110M params, 26.2B tokens 110M params, 26.2B tokens
4.4
3.55 6.0
4.2
5.5
3.50
4.0
5.0
Loss

3.45
3.8
4.5

3.40 3.6
4.0

3.35 3.5 3.4

4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
Precision (bits) Precision (bits) Precision (bits)

Figure 11: Sweeping L(P ) for the three model parts at various N, D.

K Empirical Neff
Consider a model trained with some arbitrary (N, D, Pw ). Assuming a Chinchilla function form
with N 7→ Neff (Pw ), we can write the difference between its loss and the loss of a full precision
model as
−α
L(N, D, Pw ) − L(N, D, ∞) = A[Neff − N −α ]
as the terms involving B, D, E cancel. Note that Neff (Pw = ∞) = N by construction. In
practice, we use a BF16 model as the “infinite-precision” model, finding no real difference if we
use an FP32 model or even a functional fit estimating Pw → ∞ based on our integer quantization
loss results. Our goal is to plot what f (P ) looks like where Neff = N · f (P ). Therefore, we can
rearrange the above equation as follows
 −1/α
Neff 1 L(N, D, Pw ) − L(N, D, Pw = ∞)
f (P ) := = + N −α (29)
N N A
Then plotting this quantity using our fitted numerical values (See Appendix I) gives us the
empirical tradeoff between precision and parameters. We can see that the tradeoff is quickly
saturating in P to a value near 1. While the functional form is the same for the three model parts,
the fitted constants are different. For instance, runs with Pa ≤ 3 or Pkv ≤ 3 often diverged, and
this was not the case with weight precision. Further, we can see that the KV cache is not sensitive
to quantization at higher bit value, but very quickly becomes sensitive around 4-5 bit precision.
Then as far as the joint functional form for Neff (Pw , Pa , Pkv ) is concerned, we acknowledge
that alternative factorizations that do not decompose the model into weights, activations, and KV

30
Weights Activations KV Cache
0.9

0.8
f(P) = Neff(P)/N

0.7

0.6

0.5

0.4
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
Pw (training precision, bits) Pa (training precision, bits) Pkv (training precision, bits)

Figure 12: Plotting what Neff looks like empirically. Each black point is a pretraining run, mathe-
matical details of what is plotted here in Appendix E. Blue lines are parametric fits of a saturating
exponential.

5.00
4.75
4.50
Training-time Effects, Ptrain Post-Training Effects, Ppost
Val Loss

4.25
4.00
3.75
3.50
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Tokens (billions) Tokens (billions)
Figure 13: Illustration of what finite-precision effects during training and inference look like on
learning curves.

cache, may have an equally good fit. For instance, decomposing the weights term into a product of
layer-wise effects has a reasonable fit though introduces more parameters, and a more coarse-grained
version may not decompose the model into parts at all, but only consider tied precisions. We choose
this factorized form because QAT considers weights only, and activations and attentions are the two
other things that must then be kept in low precision to see compute gains. Since practitioners often
care about KV cache on its own, we chose to decompose “activations and attention” as “activations
and KV cache.” We emphasize that our main point is not that this factorization is objectively
correct, but in observing that such a factorization that assumes approximate independence is
possible in the first place.

L Additional Plots

31
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=30M, D=1.6B 8 N=30M, D=1.6B
1.0

1.0

Predicted Inference-time Degradation


7 7
Pinf, post-train quantization

Pinf, post-train quantization


0.8

Inference-time Degradation
6 6 0.8
0.6
5 5 0.6

0.4
4 4 0.4

3 0.2 3 0.2

2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=60M, D=6.6B 8 N=60M, D=6.6B
1.6

Predicted Inference-time Degradation


7 2.0 7 1.4
Pinf, post-train quantization

Pinf, post-train quantization


Inference-time Degradation
6 6 1.2
1.5
1.0
5 5
0.8
1.0
4 4 0.6

0.5 0.4
3 3
0.2

2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=110M, D=6.6B 8 N=110M, D=6.6B
1.4 1.4

Predicted Inference-time Degradation


7 7
1.2
Pinf, post-train quantization

Pinf, post-train quantization

1.2
Inference-time Degradation

6 1.0 6 1.0

5 0.8 5 0.8

0.6 0.6
4 4
0.4 0.4
3 3
0.2 0.2

2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)
Empirical Inference-time Degradation Predicted Inference-time Degradation
8 N=220M, D=6.6B 8 N=220M, D=6.6B
1.2
1.2
Predicted Inference-time Degradation

7 1.0 7
Pinf, post-train quantization

Pinf, post-train quantization

1.0
Inference-time Degradation

6 0.8 6
0.8
5 0.6 5
0.6

4 0.4 4 0.4

3 0.2 3 0.2

2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)

Figure 14: Predicted vs actual δPTQ for several N, D.

32
Pi Sweep Pkv Sweep
MSE: 0.0055, R²: 0.9410 MSE: 0.0003, R²: 0.9965
5.0

4.5
Predicted

4.0

3.5

3.0
3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 3.2 3.4 3.6 3.8 4.0 4.2
Actual Actual

Figure 15: Marginal sweeps for precision of activations and KV cache, along with predictions from
an Neff functional form analogous to Equation 3 fitted from scratch.

Without Neff With Neff


MSE: 9.24e-02, R 2: 0.8249 MSE: 5.06e-02, R 2: 0.9041 LPTQ vs Training Precision
100 100
100
10 1 10 1

10 2 10 2
Predicted LPTQ

Predicted LPTQ

10 1

10 3
10 3
LPTQ

10 4
10 4
10 2

10 5
10 5

10 6
10 6
10 3

10 7
10 7

10 6 10 5 10 4 10 3 10 2 10 1 100 10 6 10 5 10 4 10 3 10 2 10 1 100 4 6 8 10 12
Actual LPTQ Actual LPTQ Training Precision

8
Empirical LPTQ 8
Predicted LPTQ
Ppost, post-training precision (bits)

Ppost, post-training precision (bits)

7 7

6 6

5 5

4 4

3 3

2 2
3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12
Pw, training precision (bits) Pw, training precision (bits)

Figure 16: Combined plots for predicting degradation. (a) and (b) illustrate different fitting ap-
proaches to model degradation, demonstrating a stronger fit when N 7→ Neff is used. (c), (d) (e)
illustrate our unified degradation form can predict degradation when training and serving in any
precision. Plots (c-e) made for varied Pw , but fits in (a) and (b) include runs where Pa , Pkv are also
jointly varied.

33

You might also like