Optimizing Large Language Model Training Using FP4 Quantization
Optimizing Large Language Model Training Using FP4 Quantization
Ruizhe Wang 1 2 † Yeyun Gong 3 2 Xiao Liu 3 2 Guoshuai Zhao 3 2 Ziyue Yang 3 2
Baining Guo 3 Zhengjun Zha 1 Peng Cheng 3 2
Abstract 6
FP4(Direct Cast)
The growing computational demands of training FP4(Ours)
5
arXiv:2501.17116v1 [cs.LG] 28 Jan 2025
Loss
promising solution by enabling low-bit arithmetic
operations to reduce these costs. While FP8 preci-
4
sion has demonstrated feasibility, leveraging FP4
remains a challenge due to significant quantiza- 3
tion errors and limited representational capacity. Tokens/Billion
This work introduces the first FP4 training frame- 0 1 2 3
work for LLMs, addressing these challenges with
two key innovations: a differentiable quantization Figure 1. Directly casting to FP4 results in significantly higher
estimator for precise weight updates and an out- training loss, whereas our proposed FP4 method achieves accuracy
lier clamping and compensation strategy to pre- comparable to the BF16 baseline. These results are based on
vent activation collapse. To ensure stability, the experiments with a 400M LLaMA2 model.
framework integrates a mixed-precision training
scheme and vector-wise quantization. Experimen-
tal results demonstrate that our FP4 framework is trained on up to 16K H100 GPUs for 54 days. Similarly,
achieves accuracy comparable to BF16 and FP8, GPT-4 (Achiam et al., 2023), with an estimated 1T param-
with minimal degradation, scaling effectively to eters, required an extraordinary amount of computational
13B-parameter LLMs trained on up to 100B to- power. These examples highlight the urgent need for more
kens. With the emergence of next-generation hard- efficient training methods to keep up with the increasing
ware supporting FP4, our framework sets a foun- demands of LLM development.
dation for efficient ultra-low precision training.
Model quantization has proven to be an effective technique
for reducing training costs, as low-bit arithmetic kernels
can save memory and accelerate computations when used
1. Introduction appropriately. Most LLM training systems traditionally
rely on FP32 (full precision) or FP16/BF16 (half precision)
In the past two years, the rapid development of large lan- data formats, but quantization enables these formats to be
guage models (LLMs) has significantly reshaped both re- reduced to lower precision, such as 8-bit or even 4-bit.
search priorities and industrial practices. Theoretical anal-
yses and empirical evidence consistently demonstrate that Recent advancements in computational hardware, such as
scaling up model size leads to substantial performance im- NVIDIA’s H100 GPUs (Nvidia, 2023) and the upcoming
provements (Kaplan et al., 2020; Bi et al., 2024). However, B200 GPUs (Nvidia, 2024), have introduced support for
training such large-scale models poses considerable chal- low-bit arithmetic kernels, enabling more efficient compu-
lenges, demanding extensive time, energy, and financial tation. The Hopper series GPUs feature high-performance
resources. For example, Llama 3 (Dubey et al., 2024) 405B FP8 tensor cores, delivering a 2x speed-up compared to
FP16 tensor cores. Meanwhile, the Blackwell series GPUs
†
Work done during internship in MSRA 1 University extend this capability by supporting FP6 and FP4 formats,
of Science and Technology of China 2 Microsoft SIGMA with FP4 offering the potential to double computational
Team 3 Microsoft Research Asia. Correspondence to: throughput over FP8. Studies like FP8-LM (Peng et al.,
Yeyun Gong <[email protected]>, Peng Cheng
<[email protected]>. 2023) and NVIDIA’s Transformer Engine (Nvidia, 2022)
have demonstrated the feasibility of FP8 tensor cores for
model training. But the application of FP4 tensor cores in
1
Optimizing Large Language Model Training Using FP4 Quantization
model training remains an open research question. we adopt the E2M1 format for 4-bit floating-point represen-
tation, as defined in prior studies (Rouhani et al., 2023b;a),
However, leveraging 4-bit data formats for neural network
with 2 bits for the exponent and 1 bit for the mantissa.
training presents significant challenges due to the extremely
limited bit width. Directly quantizing LLMs to such a low- Unlike integer (INT) quantization, floating-point (FP) quan-
bit format often results in substantial accuracy degradation, tization features uneven quantization intervals and a larger
as shown in Figure 1. This is primarily because low-bit dynamic range. To quantize a high-precision tensor like
formats are constrained by a limited dynamic range, which FP16 to FP4, we employ the commonly used absmax
increases the risk of overflow and underflow. Even existing method (Dettmers et al., 2022; Peng et al., 2023):
methods for 8-bit quantization experience some degree of
accuracy loss, underscoring the difficulties of employing a MAXfp4
4-bit format, which provides only 16 distinct representable xfp4 = Q(xfp16 · γ), γ= (1)
max(|xfp16 |)
values.
In this study, we pioneeringly propose a framework for Here, MAXfp4 represents the maximum absolute value in
training language models using the FP4 format, providing a the FP4 format, and γ serves as the scaling factor. For the
validation of the feasibility of this ultra-low precision rep- E2M1 configuration, MAXfp4 is calculated to be 6.0. The
resentation. To tackle the significant quantization errors quantization function Q() is implemented using a look-up
associated with weights and activations during model train- table for quantization in a custom CUDA kernel since the
ing, we present a series of optimization techniques: (1) For FP4 format supports only 24 = 16 distinct values. Detailed
weights, we present a differentiable quantization estimator format regulations and quantization implementation can be
to improve gradient updates in FP4 computations. By ana- found in Appendix A.
lyzing the impact of quantization on neural network forward
and backward passes, we derive a function with correction 3. Methodology
terms for accurate gradient estimation; (2) For activations,
we develop an outlier clamping and compensation strategy In a typical linear layer of a Transformer architecture, the
to address the issue of outlier values commonly observed computation can be expressed as Y = A · W , where A is
during LLM training. By analyzing activation distributions the activation tensor and W is the weight tensor. To fully
in LLMs, we introduce a clamping method and a sparse aux- leverage the capabilities of FP4 tensor cores, both A and W
iliary matrix to preserve quantization accuracy and maintain need to be quantized to FP4, as shown in Figure 2. How-
model performance. ever, directly quantizing these tensors into FP4 introduces
significant quantization errors. To address this challenge,
We conduct comprehensive experiments to demonstrate that we propose the differentiable gradient estimator method for
our FP4 training framework achieves accuracy comparable weight tensors (Section 3.1) and the outlier clampping and
to models trained in BF16 or FP8 formats with the same hy- compensation method for activation tensors (Section 3.2) to
perparameters. Leveraging the FP8 tensor cores of NVIDIA mitigate these issues.
H100 GPUs to emulate FP4 computations, we train LLMs
with up to 13B parameters and 100B training tokens, with 3.1. Differentiable Gradient Estimator
minor training loss gap. For zero-shot evaluation on down-
stream tasks, model trained with FP4 show competitive Quantization functions are inherently non-differentiable,
results against BF16 models. We anticipate better speed preventing the reverse flow of the gradient during back-
performance gains with the availability of next-generation propagation. The widely used Straight-Through Estimator
hardware like NVIDIA’s B-series GPUs. We will open- (STE) (Bengio et al., 2013) bypasses this issue by assum-
source our training code to facilitate future research and ing that the gradient of the quantized tensor is equivalent
adoption. to that of the original tensor. However, this simplification
introduces inaccuracies in low-bit settings, as noted in prior
studies (Yin et al., 2019; Gong et al., 2019).
2. Preliminaries
To overcome these limitations, we propose a Differentiable
According to the IEEE 754 standard (Kahan, 1996), a binary Gradient Estimator (DGE) that reduces estimation errors.
floating-point number consists of three components: a 1-bit DGE maintains direct quantization for forward computation
sign (S), exponent bits (E), and mantissa bits (M). This is to preserve hardware efficiency while introducing a gradient
commonly represented as ExMy, where x and y denote the correction term derived from a differentiable approximation
number of bits for the exponent and mantissa, respectively. of the quantization function.
For example, FP16 uses E5M10 and BF16 uses E8M7. FP8
typically has two variants: E4M3 and E5M2. In our work, Suppose we quantify the model weight W with a non-
differentiable quantization function f : Wq = f (W ). Con-
2
Optimizing Large Language Model Training Using FP4 Quantization
Figure 2. The structure of the proposed FP4 training scheme during the forward pass of a linear layer. A high-precision tensor, such as
BF16, is quantized into the FP4 format using look-up table quantization. During the GeMM computation, both weight and activation
tensors are quantized into FP4 to leverage the FP4 tensor cores. Two scaling factors are then applied to the final result to ensure
computational correctness.
sidering the backward gradient computation for a linear Since f is a non-differentiable quantization function, its
function with quantized weight, the forward pass can be derivative f ′ is almost everywhere zero, leading to van-
expressed as: ishing gradients and causing the weight gradient compu-
Y = AWq = Af (W ) (2) tation to fail, as shown in Equation (6). The Straight-
Through Estimator (STE) addresses this issue by assuming
During backpropagation, the loss gradient with respect to f ′ (W ) ≡ 1, thereby bypassing gradient vanishing. In other
the weight ∂L/∂W and the activation ∂L/∂A are computed words, it directly assumes that ∂L/∂W ≡ ∂L/∂Wq .
using the gradient propagated from the subsequent layer To achieve more accurate gradient computation, we propose
∂L/∂Y . For the weight gradient, the chain rule gives: an alternative approach: approximating the quantization
function with a well-chosen differentiable function, com-
∂L ∂L ∂Wq ∂L ∂Wq
= = (AT ) (3) puting its derivative, and incorporating it into Equation (6).
∂W ∂Wq ∂W ∂Y ∂W Specifically, we use the following function to simulate the
quantization behavior:
Where ∂Wq /∂W represents the derivative of the quantiza-
tion function f . since f is an element-wise function, its δ δ 1
derivative f ′ is also element-wise. Thus we have: f (x) = δ · 1 + sign(x − ) · |x − | k (7)
2 2
( Figure 3(a) illustrates this function under k = 5 for the range
∂Wq [i, j] f ′ (W [i, j]), if (i, j) = (k, l),
= (4) [0, 0.5], which represents the first positive quantization in-
∂W [k, l] 0, otherwise. terval in the E2M1 quantization scheme. This figure also
shows that under the assumption of STE, forward quantiza-
Therefore ∂Wq /∂W is a diagonal matrix. When applied to tion function is equivalent to f (x) = x because f ′ (x) ≡ 1.
the chain rule Equation (3), this diagonal structure allows In Equation (7), δ represents the quantization interval, and
simplification of the gradient computation, reducing it to an k is a parameter that controls the degree of approximation.
element-wise multiplication between the two items: As k increases, the function curve becomes sharper and
more closely resembles the behavior of the original hard
∂L ∂L quantization function.
[i, j] = [i, j] · f ′ (W [i, j]) (5)
∂W ∂Wq
The derivative of Equation (7) can be expressed as:
or to be simplified: 1 δ 1
f ′ (x) = · |x − | k −1 (8)
k 2
∂L ∂L
= ⊙ f ′ (W ), (6) Figure 3(b) and Figure 3(c) show the complete quantization
∂W ∂Wq
curve f (x) and its derivative f ′ (x) under k = 5 within
Where ⊙ denotes the element-wise (Hadamard) product. the full E2M1 quantization framework. This framework
3
Optimizing Large Language Model Training Using FP4 Quantization
Quantized x
STE 2 STE 2.0 STE
0.3
dy/dx
0 1.5
0.2 2 1.0
0.1 4 0.5
0.0 6 0.0
0.0 0.1 0.2 0.3 0.4 0.5 6 4 2 0 2 4 6 6 4 2 0 2 4 6
input x input x input x
(a) single interval quantization (b) full quantization (c) full quantization derivative
Figure 3. Visualization of the Differentiable Gradient Estimator (DGE). (a) Comparison of three quantization methods: hard quantization,
differentiable quantization, and STE quantization, demonstrated on a single quantization step. (b) The full quantization curve for E2M1
quantization within its dynamic range [−6.0, 6.0]. (c) The derivative curves for the three methods, highlighting that hard quantization has
a gradient of f ′ (x) ≡ 0 , while STE assumes a constant gradient of f ′ (x) ≡ 1.
104
Count
4
Optimizing Large Language Model Training Using FP4 Quantization
5
Optimizing Large Language Model Training Using FP4 Quantization
Table 2. Zero-shot evaluation for downstream tasks between BF16 models and FP4 models under different model sizes.
Model Size Precision Average PiQA Hellaswag ObQA Arc-C Arc-E BoolQ LogiQA SciQ Lambada
BF16 53.23 71.11 50.80 36.60 36.69 68.60 57.83 30.26 83.30 43.84
1.3B
FP4(Ours) 53.13 70.89 50.82 36.20 36.86 67.47 58.23 29.49 83.90 44.30
BF16 53.87 71.22 52.03 37.40 38.99 67.47 60.55 27.65 85.00 44.56
7B
FP4(Ours) 54.42 71.87 52.97 38.40 39.85 67.97 62.20 27.96 84.70 43.88
BF16 54.44 72.80 53.56 38.60 38.82 67.97 57.40 29.65 86.30 44.87
13B
FP4(Ours) 54.95 73.78 54.12 39.60 39.68 67.89 55.90 30.88 85.80 46.89
8 8 8
BF16 BF16 BF16
7 FP4(Ours) 7 FP4(Ours) 7 FP4(Ours)
6 6 6
5 5 5
Loss
Loss
Loss
4 4 4
3 3 3
2 2 2
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Billion tokens Billion tokens Billion tokens
(a) LLaMA-1.3B (b) LLaMA-7B (c) LLaMA-13B
Figure 5. Training curves for BF16 models and FP4 models under different model sizes. (a) Training curves for 1.3B LLaMA model. (b)
Training curves for 7B LLaMA model. (c) Training curves for 13B LLaMA model.
BF16 curve. Specifically, after training on 100B tokens, the curacy, making it a promising approach for efficient training
training losses are as follows: 2.55 (FP4) vs. 2.49 (BF16) of large language models.
for the 1.3B model, 2.17 (FP4) vs. 2.07 (BF16) for the 7B
model, and 1.97 (FP4) vs. 1.88 (BF16) for the 13B model. 4.3. Ablation Study
In addition to training loss, we evaluate the models on a di- We divide our ablation study into smaller parts to better
verse set of downstream tasks datasets in a zero-shot manner, highlight the findings of FP4 training. All experiments are
including Arc (Clark et al., 2018), BoolQ (Clark et al., 2019), conducted on the LLaMA 1.3B model, trained with 10B
HellaSwag (Zellers et al., 2019), LogiQA (Liu et al., 2020), tokens from a subset of the DCLM dataset. To acceler-
PiQA (Bisk et al., 2020), SciQ (Welbl et al., 2017), Open- ate convergence for this smaller model, the batch size is
bookQA (ObQA) (Mihaylov et al., 2018), and Lambada reduced from 2048 to 256, while other hyperparameters
(Paperno et al., 2016). These results are obtained through remain consistent with the main experiments.
the widely used lm-evaluation-harness library1 (Gao et al.,
2024). As presented in Table 2, models pre-trained with Precision. Figure 6(a) presents training curves across
FP4 demonstrate competitive performance in intrinsic in- various precisions, including BF16 (baseline), MS-AMP
context learning capabilities. Under the same model size, FP8 (Peng et al., 2023), Transformer-Engine FP8 (Nvidia,
the average accuracy of FP4-trained models is comparable 2022), directly-casted FP4, and our FP4 method. We
to, or even slightly exceeds, that of BF16-trained models. use W4A4 to denote direct quantization, meaning that
Additionally, the results follow the general trend: larger quantizing both weight and activation to fp4. Meanwhile,
models achieve higher accuracy under the same number of W4A4+DGE+OCC denotes our fp4 quantization method
training tokens. that incorporates the Differentiable Gradient Estimator
(DGE) and Outlier Clamp and Compensation (OCC) meth-
These results highlight that despite the reduced precision, ods introduced in Section 3. The loss curves show that two
FP4 training achieves nearly equivalent performance to FP8 methods and our FP4 approach maintain pretraining ac-
BF16 both in terms of training loss and downstream task ac- curacy, while directly-casted FP4 has a significant training
1 loss gap.
https://github.com/EleutherAI/lm-
evaluation-harness Weights. For weight-only 4-bit quantization (W4A8),
6
Optimizing Large Language Model Training Using FP4 Quantization
8 2.90 5.0
W4A4 W4A8 2.85
W4A4+DGE+OCC 2.85 W4A8+DGE(k=3)
7 4.5
MSAMP FP8 2.80 W4A8+DGE(k=5) 2.80
TE FP8 W4A8+DGE(k=10)
6 BF16 2.75
4.0 BF16 2.75
Loss
Loss
2.70
5
3.5
4
3.0
3
0 2 4 6 8 10 0 2 4 6 8 10
Billion tokens Billion tokens
(a) Ablation on precision (b) Ablation on weight
10 10 2.95
W8A4 NaN W:coarse A:coarse
9 W8A4+OCC( =0.999) 9 W:fine A:coarse
2.90
2.85
8 W8A4+OCC( =0.99) 8 W:coarse A:fine
2.82
W8A4+OCC( =0.97) W:fine A:fine 2.80
7 BF16 2.80 7 BF16 2.75
Loss
Loss
2.78
6 2.76 6 2.70
5 2.74 5
2.72
4 4
3 3
0 2 4 6 8 10 0 2 4 6 8 10
Billion tokens Billion tokens
(c) Ablation on activation (d) Ablation on granularity
Figure 6. Ablation studies. (a) Training curves under different precision frameworks. (b) The effect of proposed Differentiable Gradient
Estimator (DGE). (c) The effect of proposed Outlier Clamping and Compensation method (OCC). Note that directly casting activation
into 4-bit leads to divergence, and the loss value turn into NaN (Not a Number). (d) Training curves under different quantization
granularities of FP4.
we evaluate our Differentiable Gradient Estimator (DGE) at an increased computational cost. Figure 6(c) shows the
method alone against direct quantization. As shown in model loss under three settings α = 0.999, 0.99, 0.97, cor-
Figure 6(b), the DGE method significantly improve conver- responding to the non-zero elements of the sparse compen-
gence. Notably, direct quantizing weight into 4-bit doesn’t sation matrix of 0.2%,2% and 6%, respectively. Although
introduce a substantial training loss gap, suggesting that experiments show that a higher α leads to better model accu-
weights are easier to quantize than activations. For the hy- racy, which is consistent with the conclusion of Table 1, we
perparameter k in this method, a larger k can better model believe that α = 0.99 is a better choice for comprehensive
the quantization function, but it can also lead to a more computational performance considerations.
unstable correction term for the gradient. It can also be
Granularity. We also observe that the granularity of
seen in the figure that a moderate k = 5 gives better final
FP4 quantization plays a critical role. While FP8 training
performance.
schemes (Peng et al., 2023; Nvidia, 2022) achieve sufficient
Activation. For activation-only 4-bit quantization (W8A4), accuracy with coarse-grained tensor-wise quantization, Fig-
we evaluate our Outlier Clamp and Compensation (OCC) ure 6(d) shows that tensor-wise scaling in FP4 introduces
method alone against direct quantization. Figure 6(c) re- significant errors. To address this, we adopt vector-wise
veals that directly quantizing activations in FP4 results in scaling, with token-wise quantization for activations and
curve divergence, where the loss values turn into NaN (Not channel-wise quantization for weights, aligning with GeMM
a Number) after certain training steps. Outlier clamping and computation rules as discussed in Section 4.1. Notably,
compensation effectively reduces this loss gap, ensuring a applying coarse-grained quantization to activations alone
good convergence. This experiment re-emphasizes the im- result in more severe accuracy degradation than applying
portance of appropriate treatment of outliers in the absmax it to weights alone, revealing that activations are harder to
quantization framework. For the hyperparameter α in this quantize than weights, consistent with the activation outlier
method, a larger α implies a stronger compensation, but issue described in Section 3.2.
7
Optimizing Large Language Model Training Using FP4 Quantization
5. Related Work tion. We directly change the gradient estimator from STE
to DGE during the backward pass, avoiding the need for
Quantized Training and Inference.When discussing the continuous updates to the quantization function, which is
quantization of large language models (LLMs) for train- not friendly to specialized hardware designs. Our approach
ing, we typically refer to Fully Quantized Training (FQT). is more efficient and more suitable for hardware acceleration
Related research efforts have generally used Mixed Preci- in large-scale training.
sion Training (Micikevicius et al., 2017; Mellempudi et al.,
2019) frameworks to accelerate model training while main- Handling Outliers. Our method for handling activation out-
taining model accuracy. While previous research has mainly liers in LLMs differs significantly from existing approaches,
concentrated on CNNs or DNNs(Sun et al., 2019; Wang which mainly target model inference (Liu et al., 2023a; Li
et al., 2018; Banner et al., 2018; Yang et al., 2020), recent et al., 2024b; Ashkboos et al., 2024; Liu et al., 2024). Acti-
studies have demonstrated the feasibility of low-bit mixed vation outliers in LLMs are typically channel-specific (Xiao
precision training for LLMs (Peng et al., 2023; Nvidia, 2022; et al., 2023; Wei et al., 2022). Channel-wise quantization
Fishman et al., 2024; Xi et al., 2024). In contrast to the would reduce quantization loss but conflicts with the compu-
FQT scheme, research on low-bit computation for infer- tation structure of matrix multiplication in linear layers (Xi
ence has focused on Post-Training Quantization (PTQ) and et al., 2024; Lee et al., 2024). Previous strategies to solve
Quantization Aware Training (QAT). While PTQ directly this problem like smoothing outliers (Xiao et al., 2023) or us-
quantizes pre-trained models for inference (Dettmers et al., ing rotary matrices (Ashkboos et al., 2024; Liu et al., 2024)
2022; Frantar et al., 2022; Lin et al., 2024a; Xiao et al., rely on offline pre-processing, making them incompatible
2023; Yao et al., 2022; Liu et al., 2024), QAT involves fine- with pretraining tasks. In contrast, our method addresses
tuning or pre-training the model for better low-bit inference outliers dynamically during real-time training without re-
performance (Liu et al., 2023b; Cheng et al., 2023; Wang quiring separate calibration datasets, which is critical for
et al., 2023; Dettmers et al., 2024). Our method differs from maintaining efficiency in pretraining large models.
QAT, as we aim to accelerate the training process while
maintaining performance, rather than solely focusing on 6. Limitation
improving inference efficiency without consideration for the
training speed. One primary limitation of this work lies in the absence of
dedicated FP4 Tensor Cores in existing hardware. Conse-
4-bit Quantization. Recent works in PTQ and QAT have quently, we are unable to directly measure the potential
successfully applied 4-bit, 2-bit or even 1-bit quantization to speedup and energy efficiency gains achievable with native
LLM inference (Dettmers & Zettlemoyer, 2023; Wu et al., FP4 support. All current experiments rely on FP4 simula-
2023). However, these methods focused on LLM infer- tions, which introduce additional computational overhead
ence, requiring additional computation like calibration set due to extra precision casting and significantly prolong run-
fine-tuning (Wang et al., 2024), rotary matrix and low-rank time. Additionally, due to constraints on computational
compensation (Lin et al., 2024b; Ashkboos et al., 2024; Li resources, we have not yet extended our experiments to
et al., 2024b), quantization parameters searching (Liu et al., extremely large-scale models or to datasets comprising tril-
2023a), or even retraining the whole network (Ma et al., lions of tokens. Investigating such scalability remain as
2024). In the field of FQT, an early study (Sun et al., 2020) critical directions for future research.
applied a 4-bit radix-4 FP4 format to convolutional neural
networks (CNNs). MXFP (Rouhani et al., 2023b) intro-
duced a novel quantization data for GPT-style models, but 7. Conclusion
lacked feasibility validation on full FP4 settings. (Xi et al., We propose the first FP4 pretraining framework for mod-
2023) proposed an INT4 training framework, but their focus ern Large Language Models (LLMs), overcoming the chal-
was on fine-tuning tasks with limited applicability to LLM lenges of limited dynamic range and quantization precision
pretraining. In contrast, our work is the first to propose in 4-bit formats. By proposing a differentiable gradient
an FP4 training framework tailored for LLMs, validated estimator and an outlier compensation mechanism, we ef-
from scratch, and designed to align with next-generation fectively reduce the accuracy gap between FP4 and higher-
hardware like Nvidia’s B-series GPUs. precision baselines like FP8 or FP16, achieving compara-
Differentiable Quantization. Unlike previous methods ble performance across diverse model scales. Our findings
focusing on differentiable quantization (Gong et al., 2019; demonstrate the feasibility of FP4-based training, provid-
Uhlich et al., 2019; Chen et al., 2019; Li et al., 2022; Huang ing insights into improving quantization methods for ultra-
et al., 2022), which rely on learnable quantization param- low-precision computing, and may also serve as a call for
eters updated through backpropagation, our differentiable next-generation hardware designs to enable efficient 4-bit
gradient estimator method uses a fixed quantization func- computation kernels.
8
Optimizing Large Language Model Training Using FP4 Quantization
Impact Statement Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins,
M., and Toutanova, K. Boolq: Exploring the surprising
This work demonstrates the feasibility of using ultra-low difficulty of natural yes/no questions. arXiv preprint
precision formats like FP4 for training large language mod- arXiv:1905.10044, 2019.
els, offering a pathway toward energy conservation and re-
duced carbon emissions in AI development. By significantly Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A.,
lowering computational and memory demands, FP4-based Schoenick, C., and Tafjord, O. Think you have solved
methods can democratize access to advanced AI systems question answering? try arc, the ai2 reasoning challenge.
while promoting environmental sustainability. arXiv preprint arXiv:1803.05457, 2018.
Additionally, this research calls for next-generation AI accel- Dettmers, T. and Zettlemoyer, L. The case for 4-bit preci-
erators optimized for 4-bit computations, potentially shap- sion: k-bit inference scaling laws. In International Con-
ing future hardware innovations. However, broader societal ference on Machine Learning, pp. 7750–7774. PMLR,
implications must be considered, including the risks of mis- 2023.
use and the amplification of biases inherent in large-scale AI
models. Addressing these challenges is essential to ensure Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.
responsible and equitable adoption of this technology. Gpt3. int8 (): 8-bit matrix multiplication for transformers
at scale. Advances in Neural Information Processing
Systems, 35:30318–30332, 2022.
References
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
Qlora: Efficient finetuning of quantized llms. Advances
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
in Neural Information Processing Systems, 36, 2024.
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774, 2023. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B.,
A., et al. The llama 3 herd of models. arXiv preprint
Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and
arXiv:2407.21783, 2024.
Hensman, J. Quarot: Outlier-free 4-bit inference in ro-
tated llms. arXiv preprint arXiv:2404.00456, 2024. Fishman, M., Chmiel, B., Banner, R., and Soudry, D. Scal-
ing fp8 training to trillion-token llms. arXiv preprint
Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scalable
arXiv:2409.12517, 2024.
methods for 8-bit training of neural networks. Advances
in neural information processing systems, 31, 2018. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq:
Accurate post-training quantization for generative pre-
Bengio, Y., Léonard, N., and Courville, A. Estimating or
trained transformers. arXiv preprint arXiv:2210.17323,
propagating gradients through stochastic neurons for con-
2022.
ditional computation. arXiv preprint arXiv:1308.3432,
2013. Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi,
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li,
Ding, H., Dong, K., Du, Q., Fu, Z., et al. Deepseek llm: H., McDonell, K., Muennighoff, N., Ociepa, C., Phang,
Scaling open-source language models with longtermism. J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika,
arXiv preprint arXiv:2401.02954, 2024. L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou,
A. A framework for few-shot language model evaluation,
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning 07 2024. URL https://zenodo.org/records/
about physical commonsense in natural language. In Pro- 12608602.
ceedings of the AAAI conference on artificial intelligence,
volume 34, pp. 7432–7439, 2020. Gong, R., Liu, X., Jiang, S., Li, T., Hu, P., Lin, J., Yu, F., and
Yan, J. Differentiable soft quantization: Bridging full-
Chen, S., Wang, W., and Pan, S. J. Metaquant: Learning precision and low-bit neural networks. In Proceedings
to quantize by learning to penetrate non-differentiable of the IEEE/CVF international conference on computer
quantization. Advances in Neural Information Processing vision, pp. 4852–4861, 2019.
Systems, 32, 2019.
Huang, X., Shen, Z., Li, S., Liu, Z., Xianghong, H., Wicak-
Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X., Lv, K., sana, J., Xing, E., and Cheng, K.-T. Sdq: Stochastic
and Liu, Y. Optimize weight rounding via signed gradi- differentiable quantization with mixed precision. In In-
ent descent for the quantization of llms. arXiv preprint ternational Conference on Machine Learning, pp. 9295–
arXiv:2309.05516, 2023. 9309. PMLR, 2022.
9
Optimizing Large Language Model Training Using FP4 Quantization
Kahan, W. Ieee standard 754 for binary floating-point arith- Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S.,
metic. Lecture Notes on the Status of IEEE, 754(94720- Dong, L., Wang, R., Xue, J., and Wei, F. The era of 1-bit
1776):11, 1996. llms: All large language models are in 1.58 bits. arXiv
preprint arXiv:2402.17764, 2024.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Mellempudi, N., Srinivasan, S., Das, D., and Kaul, B. Mixed
Amodei, D. Scaling laws for neural language models. precision training with 8-bit floating point. arXiv preprint
arXiv preprint arXiv:2001.08361, 2020. arXiv:1905.12334, 2019.
Lee, C., Jin, J., Kim, T., Kim, H., and Park, E. Owq: Outlier- Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,
aware weight quantization for efficient fine-tuning and E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,
inference of large language models. In Proceedings of the Venkatesh, G., et al. Mixed precision training. arXiv
AAAI Conference on Artificial Intelligence, volume 38, preprint arXiv:1710.03740, 2017.
pp. 13355–13364, 2024.
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can
Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., a suit of armor conduct electricity? a new dataset
Bansal, H., Guha, E., Keh, S., Arora, K., et al. Datacomp- for open book question answering. arXiv preprint
lm: In search of the next generation of training sets for lan- arXiv:1809.02789, 2018.
guage models. arXiv preprint arXiv:2406.11794, 2024a.
Nvidia. Using fp8 with transformer engine,
Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., 2022. URL https://docs.nvidia.com/
Meng, C., Zhu, J.-Y., and Han, S. Svdqunat: Absorb- deeplearning/transformer-engine/user-
ing outliers by low-rank components for 4-bit diffusion guide/examples/fp8_primer.html.
models. arXiv preprint arXiv:2411.05007, 2024b.
Nvidia. Nvidia h100 tensor core gpu architecture,
Li, Z., Yang, T., Wang, P., and Cheng, J. Q-vit: Fully 2023. URL https://resources.nvidia.com/
differentiable quantization for vision transformer. arXiv en-us-tensor-core.
preprint arXiv:2201.07703, 2022.
Nvidia. Nvidia blackwell architecture technical brief,
Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, 2024. URL https://resources.nvidia.com/
W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: en-us-blackwell-architecture.
Activation-aware weight quantization for on-device llm
compression and acceleration. Proceedings of Machine Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N.,
Learning and Systems, 6:87–100, 2024a. Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and
Fernández, R. The lambada dataset: Word prediction
Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., requiring a broad discourse context. arXiv preprint
and Han, S. Qserve: W4a8kv4 quantization and sys- arXiv:1606.06031, 2016.
tem co-design for efficient llm serving. arXiv preprint
arXiv:2405.04532, 2024b. Peng, H., Wu, K., Wei, Y., Zhao, G., Yang, Y., Liu, Z.,
Xiong, Y., Yang, Z., Ni, B., Hu, J., et al. Fp8-lm:
Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang, Training fp8 large language models. arXiv preprint
Y. Logiqa: A challenge dataset for machine reading arXiv:2310.18313, 2023.
comprehension with logical reasoning. arXiv preprint
arXiv:2007.08124, 2020. Rouhani, B. D., Garegrat, N., Savell, T., More, A., Han,
K.-N., Zhao, R., Hall, M., Klar, J., Chung, E., Yu, Y.,
Liu, S.-y., Liu, Z., Huang, X., Dong, P., and Cheng, K.- et al. Ocp microscaling formats (mx) specification,
T. Llm-fp4: 4-bit floating-point quantized transformers. 2023a. URL https://www.opencompute.org/
arXiv preprint arXiv:2310.16836, 2023a. documents/ocp-microscaling-formats-
mx-v1-0-spec-final-pdf.
Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad,
Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Rouhani, B. D., Zhao, R., More, A., Hall, M., Khodamoradi,
Data-free quantization aware training for large language A., Deng, S., Choudhary, D., Cornea, M., Dellinger, E.,
models. arXiv preprint arXiv:2305.17888, 2023b. Denolf, K., et al. Microscaling data formats for deep
learning. arXiv preprint arXiv:2310.10537, 2023b.
Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Kr-
ishnamoorthi, R., Chandra, V., Tian, Y., and Blankevoort, Sun, X., Choi, J., Chen, C.-Y., Wang, N., Venkataramani,
T. Spinquant–llm quantization with learned rotations. S., Srinivasan, V. V., Cui, X., Zhang, W., and Gopalakr-
arXiv preprint arXiv:2405.16406, 2024. ishnan, K. Hybrid 8-bit floating point (hfp8) training and
10
Optimizing Large Language Model Training Using FP4 Quantization
inference for deep neural networks. Advances in neural Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han,
information processing systems, 32, 2019. S. Smoothquant: Accurate and efficient post-training
quantization for large language models. In International
Sun, X., Wang, N., Chen, C.-Y., Ni, J., Agrawal, A., Cui, X.,
Conference on Machine Learning, pp. 38087–38099.
Venkataramani, S., El Maghraoui, K., Srinivasan, V. V.,
PMLR, 2023.
and Gopalakrishnan, K. Ultra-low precision 4-bit training
of deep neural networks. Advances in Neural Information Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., and Li, G.
Processing Systems, 33:1796–1807, 2020. Training high-performance and large-scale deep neural
networks with full 8-bit integers. Neural Networks, 125:
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
70–82, 2020.
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine- Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li,
tuned chat models. arXiv preprint arXiv:2307.09288, C., and He, Y. Zeroquant: Efficient and affordable post-
2023. training quantization for large-scale transformers. Ad-
Uhlich, S., Mauch, L., Yoshiyama, K., Cardinaux, F., Garcia, vances in Neural Information Processing Systems, 35:
J. A., Tiedemann, S., Kemp, T., and Nakamura, A. Dif- 27168–27183, 2022.
ferentiable quantization of deep neural networks. arXiv Yin, P., Lyu, J., Zhang, S., Osher, S., Qi, Y., and Xin,
preprint arXiv:1905.11452, 2(8), 2019. J. Understanding straight-through estimator in train-
Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., ing activation quantized neural nets. arXiv preprint
Yang, F., Wang, R., Wu, Y., and Wei, F. Bitnet: Scaling 1- arXiv:1903.05662, 2019.
bit transformers for large language models. arXiv preprint Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
arXiv:2310.11453, 2023. Y. Hellaswag: Can a machine really finish your sentence?
Wang, J., Liu, H., Feng, D., Ding, J., and Ding, B. Fp4- arXiv preprint arXiv:1905.07830, 2019.
quantization: Lossless 4bit quantization for large lan-
guage models. In 2024 IEEE International Conference
on Joint Cloud Computing (JCC), pp. 61–67. IEEE, 2024.
Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakr-
ishnan, K. Training deep neural networks with 8-bit
floating point numbers. Advances in neural information
processing systems, 31, 2018.
Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang,
Q., Yu, F., and Liu, X. Outlier suppression: Pushing the
limit of low-bit transformer language models. Advances
in Neural Information Processing Systems, 35:17402–
17414, 2022.
Welbl, J., Liu, N. F., and Gardner, M. Crowdsourc-
ing multiple choice science questions. arXiv preprint
arXiv:1707.06209, 2017.
Wu, X., Li, C., Aminabadi, R. Y., Yao, Z., and He, Y. Un-
derstanding int4 quantization for language models: la-
tency speedup, composability, and failure cases. In In-
ternational Conference on Machine Learning, pp. 37524–
37539. PMLR, 2023.
Xi, H., Li, C., Chen, J., and Zhu, J. Training transform-
ers with 4-bit integers. Advances in Neural Information
Processing Systems, 36:49146–49168, 2023.
Xi, H., Chen, Y., Zhao, K., Teh, K. J., Chen, J., and Zhu,
J. Jetfire: Efficient and accurate transformer pretraining
with int8 data flow and per-block quantization. arXiv
preprint arXiv:2403.12422, 2024.
11
Optimizing Large Language Model Training Using FP4 Quantization
Where 1.M represents the normalized mantissa with an implicit leading 1, and the bias (e.g., 127 for single precision or
1023 for double precision) adjusts the exponent to account for its encoding. Subnormal numbers, where the exponent is all
zeros, are handled separately with no implicit leading 1. This representation allows for efficient computation but introduces
rounding errors due to the limited number of bits in the mantissa.
The IEEE 754 standard does not define rules for floating-point formats with precision below 16 bits, such as FP8 and FP4.
For 4-bit floating-point representation, we adopt the E2M1 format as defined in prior studies (Rouhani et al., 2023b;a).
According to the IEEE definition, an exponent field (E) filled with ones does not correspond to a valid numeric value;
instead, it represents infinity (Inf) when the mantissa (M) is all zeros or an invalid number (NaN, Not a Number) when the
mantissa contains nonzero bits. However, this rule is often disregarded in FP8 and FP4 formats due to their limited bit width,
as the priority is to maximize the representation of meaningful numerical values. For example, FP8-E4M3 format doesn’t
define Inf, FP6 and FP4 formats don’t define both Inf and NaN.
Based on the distribution of exponent and mantissa bits, all representable numbers in the FP4 format are listed in Table 3.
B INARY S EQUENCE
F ORMAT 1111 1110 1101 1100 1011 1010 1101 1000/0000 0001 0010 0011 0100 0101 0110 0111
-3 -2 -1 0 1 2 3
-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5
E1M2
-6 -4 -3 -2 -1 3 4 6
0 1 2
-1.5 -0.5 0.5 1.5
E2M1
-16 16
-8 -4 -2 -1 8
-0.5
0 0.5 1 2 4
-0.25 0.25
E3M0
12
Optimizing Large Language Model Training Using FP4 Quantization
be parallelized to take advantage of the highly parallel computing power of GPUs. The following code paragraph shows the
implementation of the quantization kernel.
MAXfp4
xfp4 = Q(xfp16 · γ), γ= (10)
max(|xfp16 |)
For the weight tensor with dimensions (ci × co ), channel-wise quantization is performed as follows:
Wscaled = W ⊙ sf (11)
Wqscaled = Q(Wscaled ) (12)
1
Wq = Wqscaled ⊙ (13)
sf
13
Optimizing Large Language Model Training Using FP4 Quantization
Here, sf is the scaling factor, and ⊙ represents the element-wise (Hadamard) product. In tensor-wise quantization, sf is
a scalar. For channel-wise quantization, sf is a vector with dimensions (1 × co ). In this case, the ⊙ operation involves
broadcasting sf to each column of the matrix W (ci × co ), followed by element-wise multiplication.
For Equation (13), it is crucial to note that multiplying by 1/sf ensures mathematical correctness. Practically, however, this
step is performed after the GeMM kernel execution. In other words, the quantized weight tensor provided to the GeMM
kernel is the scaled quantized weight tensor Wqscaled from Equation (12). Nevertheless, for mathematical analysis, the
quantized weight tensor Wq must be re-scaled.
In the backward computation, the loss gradient with respect to W is derived from the forward operation Y = AWq .
According to the matrix multiplication rules for differentiation, the gradient ∂L/∂Wq is computed using the activation
gradient ∂L/∂Y from the subsequent layer.
∂L ∂L
Fwd: Y = AWq Bwd: = AT (14)
∂Wq ∂Y
By applying the chain rule and referring to Equations (11) to (13), the relationship between ∂L/∂Wq and the actual weight
gradient ∂L/∂W is established. According to Equation (13), the gradient ∂L/∂Wqscaled can be expressed in terms of
∂L/∂Wq using the scaling factor sf :
∂L ∂L 1
= ⊙ (15)
∂Wqscaled ∂Wq sf
Subsequently, the differentiable gradient estimator correction term Q′ (x) is applied to compute ∂L/∂Wscaled :
∂L ∂L
= ⊙ Q′ (Wscaled ) (16)
∂Wscaled ∂Wqscaled
Where Q′ (x) is the differentiable gradient estimator correction item introduced in Equation (8). Finally, the relationship
between ∂L/∂Wscaled and ∂L/∂W is derived by incorporating sf :
∂L ∂L
= ⊙ sf (17)
∂W ∂Wscaled
By combining all these steps, the formula for calculating the true weight gradient ∂L/∂W is obtained:
!
∂L ∂L 1 ′
= ⊙ ⊙ Q (Wscaled ) ⊙ sf (18)
∂W ∂Wq sf
∂L
= ⊙ Q′ (Wscaled ) (19)
∂Wq
Importantly, the scaling and un-scaling steps cancel each other due to the element-wise nature of the operations, resulting
in a simplified expression. This final formula matches the previously demonstrated Equation (6) in the main body of the
paper, with the only difference being that the variables within the DGE correction term must account for scaled weights. No
changes are required for the Q and Q′ functions.
14
Optimizing Large Language Model Training Using FP4 Quantization
0.1 0.0 0.1 0.5 0.0 0.5 0.1 0.0 0.1 0.25 0.00 0.25 0.25 0.00 0.25 0.2 0.0
Frequency
105 Mean: -0.0000 Mean: 0.0000 Mean: -0.0000 Mean: -0.0000 Mean: 0.0000 Mean: -0.0000
Std: 0.0220 105 Std: 0.0260 105 Std: 0.0209 Std: 0.0204 Std: 0.0217 Std: 0.0200
105 105 105
0.1 0.0 0.1 0.0 0.2 0.2 0.0 0.2 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5
Values
Figure 8. Visualization of the weight tensors in the dense projection layers of the self-attention module.
0.25 0.00 0.25 0.2 0.0 0.2 0.1 0.0 0.1 0.2 0.0 0.1 0.0 0.1 0.25 0.00 0.25
Frequency
2
10
101 101 101 101 101
0.2 0.0 0.2 0.0 0.2 0.2 0.0 0.2 0.1 0.0 0.1 0.25 0.00 0.25 0.25 0.00 0.25
Values
Figure 9. Visualization of the weight tensors in the up-projection linear layers of the MLP module.
0.5 0.0 0.5 0.5 0.0 0.5 0.2 0.0 0.2 0.5 0.0 0.5 0.25 0.00 0.25 1 0
Frequency
0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.25 0.00 0.25 0.5 0.0 0.5 0.5 0.0 0.5
Values
Figure 10. Visualization of the weight tensors in the down-projection linear layers of the MLP module.
15
Optimizing Large Language Model Training Using FP4 Quantization
102 10 2
0.5 0.0 0.5 2 0 2 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5
Frequency
104 104
3 3 3
10 10 10 103
102 102
101 101 101 101
Figure 11. Visualization of the activation tensors from the core attention output.
5 0 5 20 0 20 10 0 10 10 0 10 20 0 20 0
Frequency
20 0 20 0 20 0 20 0 20 0 20 20 0 20
Values
Figure 12. Visualization of the activation tensors from the post-attention layer normalization output.
10 3
10 3
103 10 3
10 3 103
25 0 25 20 0 1 0 10 0 10 0 20 0
Frequency
25 0 25 10 0 10 0 0 20 25 0 25 0 50
Values
Figure 13. Visualization of the activation tensors from the MLP down-projection layer output.
16
Optimizing Large Language Model Training Using FP4 Quantization
Figures 8 to 10 illustrate the distribution of weight tensors, while Figures 11 to 13 show the distribution of activation tensors.
These results are derived from training the LLaMA 1.3B model over 30,000 iterations. The y-axis is set to a logarithmic
scale to enhance visualization. From these figures, it is evident that weight tensors generally exhibit a smaller dynamic
range, while activation tensors have a significantly larger dynamic range, making them more challenging to quantize.
Regarding distribution characteristics, weight tensors typically follow a normal distribution, with certain tensors exhibiting
small outliers. In contrast, activation tensors vary widely in their distributions. For example, core attention outputs often
follow a regular distribution with minimal outliers. However, many activation tensors, such as layer-norm outputs and
transformer layer outputs, display irregular distributions with numerous outliers, making them particularly difficult to
quantize.
Notably, the outliers in activation tensors during LLM training tend to appear in specific channels. This observation is
further validated through heatmap analysis in Figure 14. The result is obtained through the activation function (GeLU)
output from the first transformer layer.
These analyses underscore the critical importance of effectively addressing activation tensors during quantization, especially
their outliers. Future research could gain valuable insights by exploring the complex distributions and outlier behavior of
activation tensor values.
Figure 14. Heatmap visualization of the activation function (GeLU) output from the first transformer layer on the 30,000 training iteration
of the LLaMA 1.3B model. The vertical light lines in the heatmap correspond to specific channel dimensions in the activation tensor,
highlighting the channel-wise distribution of outliers.
17