0% found this document useful (0 votes)
18 views

Optimizing Large Language Model Training Using FP4 Quantization

This document presents a novel framework for training large language models (LLMs) using FP4 quantization, addressing challenges such as quantization errors and limited representational capacity. The proposed method includes a differentiable quantization estimator and an outlier clamping strategy, achieving accuracy comparable to BF16 and FP8 formats while significantly reducing computational costs. Experimental results demonstrate the framework's effectiveness, paving the way for more efficient ultra-low precision training in future LLM development.

Uploaded by

Tech Guy RS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Optimizing Large Language Model Training Using FP4 Quantization

This document presents a novel framework for training large language models (LLMs) using FP4 quantization, addressing challenges such as quantization errors and limited representational capacity. The proposed method includes a differentiable quantization estimator and an outlier clamping strategy, achieving accuracy comparable to BF16 and FP8 formats while significantly reducing computational costs. Experimental results demonstrate the framework's effectiveness, paving the way for more efficient ultra-low precision training in future LLM development.

Uploaded by

Tech Guy RS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang 1 2 † Yeyun Gong 3 2 Xiao Liu 3 2 Guoshuai Zhao 3 2 Ziyue Yang 3 2
Baining Guo 3 Zhengjun Zha 1 Peng Cheng 3 2

Abstract 6
FP4(Direct Cast)
The growing computational demands of training FP4(Ours)
5
arXiv:2501.17116v1 [cs.LG] 28 Jan 2025

large language models (LLMs) necessitate more BF16 Baseline


efficient methods. Quantized training presents a

Loss
promising solution by enabling low-bit arithmetic
operations to reduce these costs. While FP8 preci-
4
sion has demonstrated feasibility, leveraging FP4
remains a challenge due to significant quantiza- 3
tion errors and limited representational capacity. Tokens/Billion
This work introduces the first FP4 training frame- 0 1 2 3
work for LLMs, addressing these challenges with
two key innovations: a differentiable quantization Figure 1. Directly casting to FP4 results in significantly higher
estimator for precise weight updates and an out- training loss, whereas our proposed FP4 method achieves accuracy
lier clamping and compensation strategy to pre- comparable to the BF16 baseline. These results are based on
vent activation collapse. To ensure stability, the experiments with a 400M LLaMA2 model.
framework integrates a mixed-precision training
scheme and vector-wise quantization. Experimen-
tal results demonstrate that our FP4 framework is trained on up to 16K H100 GPUs for 54 days. Similarly,
achieves accuracy comparable to BF16 and FP8, GPT-4 (Achiam et al., 2023), with an estimated 1T param-
with minimal degradation, scaling effectively to eters, required an extraordinary amount of computational
13B-parameter LLMs trained on up to 100B to- power. These examples highlight the urgent need for more
kens. With the emergence of next-generation hard- efficient training methods to keep up with the increasing
ware supporting FP4, our framework sets a foun- demands of LLM development.
dation for efficient ultra-low precision training.
Model quantization has proven to be an effective technique
for reducing training costs, as low-bit arithmetic kernels
can save memory and accelerate computations when used
1. Introduction appropriately. Most LLM training systems traditionally
rely on FP32 (full precision) or FP16/BF16 (half precision)
In the past two years, the rapid development of large lan- data formats, but quantization enables these formats to be
guage models (LLMs) has significantly reshaped both re- reduced to lower precision, such as 8-bit or even 4-bit.
search priorities and industrial practices. Theoretical anal-
yses and empirical evidence consistently demonstrate that Recent advancements in computational hardware, such as
scaling up model size leads to substantial performance im- NVIDIA’s H100 GPUs (Nvidia, 2023) and the upcoming
provements (Kaplan et al., 2020; Bi et al., 2024). However, B200 GPUs (Nvidia, 2024), have introduced support for
training such large-scale models poses considerable chal- low-bit arithmetic kernels, enabling more efficient compu-
lenges, demanding extensive time, energy, and financial tation. The Hopper series GPUs feature high-performance
resources. For example, Llama 3 (Dubey et al., 2024) 405B FP8 tensor cores, delivering a 2x speed-up compared to
FP16 tensor cores. Meanwhile, the Blackwell series GPUs

Work done during internship in MSRA 1 University extend this capability by supporting FP6 and FP4 formats,
of Science and Technology of China 2 Microsoft SIGMA with FP4 offering the potential to double computational
Team 3 Microsoft Research Asia. Correspondence to: throughput over FP8. Studies like FP8-LM (Peng et al.,
Yeyun Gong <[email protected]>, Peng Cheng
<[email protected]>. 2023) and NVIDIA’s Transformer Engine (Nvidia, 2022)
have demonstrated the feasibility of FP8 tensor cores for
model training. But the application of FP4 tensor cores in

1
Optimizing Large Language Model Training Using FP4 Quantization

model training remains an open research question. we adopt the E2M1 format for 4-bit floating-point represen-
tation, as defined in prior studies (Rouhani et al., 2023b;a),
However, leveraging 4-bit data formats for neural network
with 2 bits for the exponent and 1 bit for the mantissa.
training presents significant challenges due to the extremely
limited bit width. Directly quantizing LLMs to such a low- Unlike integer (INT) quantization, floating-point (FP) quan-
bit format often results in substantial accuracy degradation, tization features uneven quantization intervals and a larger
as shown in Figure 1. This is primarily because low-bit dynamic range. To quantize a high-precision tensor like
formats are constrained by a limited dynamic range, which FP16 to FP4, we employ the commonly used absmax
increases the risk of overflow and underflow. Even existing method (Dettmers et al., 2022; Peng et al., 2023):
methods for 8-bit quantization experience some degree of
accuracy loss, underscoring the difficulties of employing a MAXfp4
4-bit format, which provides only 16 distinct representable xfp4 = Q(xfp16 · γ), γ= (1)
max(|xfp16 |)
values.
In this study, we pioneeringly propose a framework for Here, MAXfp4 represents the maximum absolute value in
training language models using the FP4 format, providing a the FP4 format, and γ serves as the scaling factor. For the
validation of the feasibility of this ultra-low precision rep- E2M1 configuration, MAXfp4 is calculated to be 6.0. The
resentation. To tackle the significant quantization errors quantization function Q() is implemented using a look-up
associated with weights and activations during model train- table for quantization in a custom CUDA kernel since the
ing, we present a series of optimization techniques: (1) For FP4 format supports only 24 = 16 distinct values. Detailed
weights, we present a differentiable quantization estimator format regulations and quantization implementation can be
to improve gradient updates in FP4 computations. By ana- found in Appendix A.
lyzing the impact of quantization on neural network forward
and backward passes, we derive a function with correction 3. Methodology
terms for accurate gradient estimation; (2) For activations,
we develop an outlier clamping and compensation strategy In a typical linear layer of a Transformer architecture, the
to address the issue of outlier values commonly observed computation can be expressed as Y = A · W , where A is
during LLM training. By analyzing activation distributions the activation tensor and W is the weight tensor. To fully
in LLMs, we introduce a clamping method and a sparse aux- leverage the capabilities of FP4 tensor cores, both A and W
iliary matrix to preserve quantization accuracy and maintain need to be quantized to FP4, as shown in Figure 2. How-
model performance. ever, directly quantizing these tensors into FP4 introduces
significant quantization errors. To address this challenge,
We conduct comprehensive experiments to demonstrate that we propose the differentiable gradient estimator method for
our FP4 training framework achieves accuracy comparable weight tensors (Section 3.1) and the outlier clampping and
to models trained in BF16 or FP8 formats with the same hy- compensation method for activation tensors (Section 3.2) to
perparameters. Leveraging the FP8 tensor cores of NVIDIA mitigate these issues.
H100 GPUs to emulate FP4 computations, we train LLMs
with up to 13B parameters and 100B training tokens, with 3.1. Differentiable Gradient Estimator
minor training loss gap. For zero-shot evaluation on down-
stream tasks, model trained with FP4 show competitive Quantization functions are inherently non-differentiable,
results against BF16 models. We anticipate better speed preventing the reverse flow of the gradient during back-
performance gains with the availability of next-generation propagation. The widely used Straight-Through Estimator
hardware like NVIDIA’s B-series GPUs. We will open- (STE) (Bengio et al., 2013) bypasses this issue by assum-
source our training code to facilitate future research and ing that the gradient of the quantized tensor is equivalent
adoption. to that of the original tensor. However, this simplification
introduces inaccuracies in low-bit settings, as noted in prior
studies (Yin et al., 2019; Gong et al., 2019).
2. Preliminaries
To overcome these limitations, we propose a Differentiable
According to the IEEE 754 standard (Kahan, 1996), a binary Gradient Estimator (DGE) that reduces estimation errors.
floating-point number consists of three components: a 1-bit DGE maintains direct quantization for forward computation
sign (S), exponent bits (E), and mantissa bits (M). This is to preserve hardware efficiency while introducing a gradient
commonly represented as ExMy, where x and y denote the correction term derived from a differentiable approximation
number of bits for the exponent and mantissa, respectively. of the quantization function.
For example, FP16 uses E5M10 and BF16 uses E8M7. FP8
typically has two variants: E4M3 and E5M2. In our work, Suppose we quantify the model weight W with a non-
differentiable quantization function f : Wq = f (W ). Con-

2
Optimizing Large Language Model Training Using FP4 Quantization

FP4 look-up table


1000/
1111 1110 1101 1100 1011 1010 1001 0001 0010 0011 0100 0101 0110 0111
0000

-6 -4 -3 -2 -1.5 -1 -0.5 ±0 0.5 1 1.5 2 3 4 6 Sw

4.31 + scaling factor


Input Q
FP4
-0.53 -1.39 -1.06 0.86 -2 -6 -4 4 Tensor Output
Q Core
-1.13 1.25 -0.24 0.17 4 6 -1 0.5
Weight Q
-0.80 0.70 0.29 0.57 quantize function -3 3 1.5 2
Si
-0.50 0.06 1.06 -1.28 -2 0.5 4 -6

BF16 tensor FP4 tensor

Figure 2. The structure of the proposed FP4 training scheme during the forward pass of a linear layer. A high-precision tensor, such as
BF16, is quantized into the FP4 format using look-up table quantization. During the GeMM computation, both weight and activation
tensors are quantized into FP4 to leverage the FP4 tensor cores. Two scaling factors are then applied to the final result to ensure
computational correctness.

sidering the backward gradient computation for a linear Since f is a non-differentiable quantization function, its
function with quantized weight, the forward pass can be derivative f ′ is almost everywhere zero, leading to van-
expressed as: ishing gradients and causing the weight gradient compu-
Y = AWq = Af (W ) (2) tation to fail, as shown in Equation (6). The Straight-
Through Estimator (STE) addresses this issue by assuming
During backpropagation, the loss gradient with respect to f ′ (W ) ≡ 1, thereby bypassing gradient vanishing. In other
the weight ∂L/∂W and the activation ∂L/∂A are computed words, it directly assumes that ∂L/∂W ≡ ∂L/∂Wq .
using the gradient propagated from the subsequent layer To achieve more accurate gradient computation, we propose
∂L/∂Y . For the weight gradient, the chain rule gives: an alternative approach: approximating the quantization
function with a well-chosen differentiable function, com-
∂L ∂L ∂Wq ∂L ∂Wq
= = (AT ) (3) puting its derivative, and incorporating it into Equation (6).
∂W ∂Wq ∂W ∂Y ∂W Specifically, we use the following function to simulate the
quantization behavior:
Where ∂Wq /∂W represents the derivative of the quantiza-
tion function f . since f is an element-wise function, its δ δ 1
derivative f ′ is also element-wise. Thus we have: f (x) = δ · 1 + sign(x − ) · |x − | k (7)
2 2
( Figure 3(a) illustrates this function under k = 5 for the range
∂Wq [i, j] f ′ (W [i, j]), if (i, j) = (k, l),
= (4) [0, 0.5], which represents the first positive quantization in-
∂W [k, l] 0, otherwise. terval in the E2M1 quantization scheme. This figure also
shows that under the assumption of STE, forward quantiza-
Therefore ∂Wq /∂W is a diagonal matrix. When applied to tion function is equivalent to f (x) = x because f ′ (x) ≡ 1.
the chain rule Equation (3), this diagonal structure allows In Equation (7), δ represents the quantization interval, and
simplification of the gradient computation, reducing it to an k is a parameter that controls the degree of approximation.
element-wise multiplication between the two items: As k increases, the function curve becomes sharper and
more closely resembles the behavior of the original hard
∂L ∂L quantization function.
[i, j] = [i, j] · f ′ (W [i, j]) (5)
∂W ∂Wq
The derivative of Equation (7) can be expressed as:
or to be simplified: 1 δ 1
f ′ (x) = · |x − | k −1 (8)
k 2
∂L ∂L
= ⊙ f ′ (W ), (6) Figure 3(b) and Figure 3(c) show the complete quantization
∂W ∂Wq
curve f (x) and its derivative f ′ (x) under k = 5 within
Where ⊙ denotes the element-wise (Hadamard) product. the full E2M1 quantization framework. This framework

3
Optimizing Large Language Model Training Using FP4 Quantization

0.5 Hard 6 Hard 3.0 Hard


0.4 DGE 4 DGE 2.5 DGE
Quantized x

Quantized x
STE 2 STE 2.0 STE
0.3

dy/dx
0 1.5
0.2 2 1.0
0.1 4 0.5
0.0 6 0.0
0.0 0.1 0.2 0.3 0.4 0.5 6 4 2 0 2 4 6 6 4 2 0 2 4 6
input x input x input x
(a) single interval quantization (b) full quantization (c) full quantization derivative

Figure 3. Visualization of the Differentiable Gradient Estimator (DGE). (a) Comparison of three quantization methods: hard quantization,
differentiable quantization, and STE quantization, demonstrated on a single quantization step. (b) The full quantization curve for E2M1
quantization within its dynamic range [−6.0, 6.0]. (c) The derivative curves for the three methods, highlighting that hard quantization has
a gradient of f ′ (x) ≡ 0 , while STE assumes a constant gradient of f ′ (x) ≡ 1.

consists of 14 distinct quantization intervals. To prevent 107


BF16
excessively large gradient values, the magnitude of f ′ (x) 106 FP4
5
is capped at 3.0, impacting only a very small subset of 10
elements. Count 104
103
In practical model training, the Differentiable Gradient
102
Estimator (DGE) is seamlessly integrated into the process.
101
During the forward pass, we retain the hard quantization
function for computational efficiency. For the backward 100
75 50 25 0 25 50 75
pass, a correction term derived from Equation (8) is applied Activation Value
to the weight gradient calculation following Equation (6). 107
Supplementary integration process and proof for the DGE clamp
106 BF16
method in actual training process is provided in Appendix B. 10 5 FP4

104
Count

3.2. Outlier Clamping and Compensation 103


102
During LLM training, activation tensors are significantly
more challenging to quantize than weight tensors. This dif- 101

ficulty arises from the complex distribution of activation 100


75 50 25 0 25 50 75
tensor values, often dominated by outliers—specific values Activation Value
that are substantially larger than the rest. Outliers pose a
significant challenge to tensor quantization by dispropor- Figure 4. Visualization of the outlier clamping method, based on
tionately expanding the dynamic range of the target tensor, the first transformer layer’s output of the LLaMA 1.3B model after
causing most values to underflow to zero after quantization. 30,000 training iterations. Up: Quantization performed without
outlier clamping, leading to severe loss of information. Down:
To address this issue, we propose the Outlier Clamping and
Quantization after applying outlier clamping, effectively preserv-
Compensation method (OCC) to restrict the range of acti- ing tensor structure.
vation tensors and mitigate the underflow problem. Specifi-
cally, we identify outliers—values with the largest absolute
magnitudes—through quantile searching and clamp them to
a predefined threshold. Given a pre-defined quantile α, the LLaMA 1.3B model after 30,000 training iterations, where
clamping function can be expressed as: α = 0.999. This approach significantly reduces the mean
squared error (MSE) between the original and quantized
tensors, enhancing quantization quality and maintaining
Yc = clamp(Y, max = α, min = 1 − α) (9) training stability.
We also observed that while clamping effectively reduces
Figure 4 illustrates the impact of quantization with and quantization error, it inherently introduces some error by
without outlier clamping, based on a real activation tensor disregarding the outlier values. To further preserve accu-
extracted from the first transformer layer’s output of the racy, we propose compensating for this error using a sparse

4
Optimizing Large Language Model Training Using FP4 Quantization

GeMM operations, a core feature of FP4 tensor cores. In


Table 1. Quantitative analysis of mathematical accuracy between
GeMM Y = AW , where A (sequence length × input chan-
original and quantized activation tensors. Results represent the
average values obtained across all activation tensors on the 30,000 nels) is the activation tensor and W (input channels × out-
training iterations of the LLaMA 1.3B model. put channels) is the weight tensor, quantization is applied
along distinct dimensions to align with matrix multiplication
C LAMP C OMP Q UANTILE S IM ↑ MSE ↓ SNR ↑ logic: A is quantized token-wise (sequence length dimen-
sion), while W is quantized channel-wise (output channels
× — — 92.19% 0.1055 8.31
√ dimension). The aforementioned accuracy-preserving tech-
× 99.9 98.83% 0.0366 14.25
√ √ niques are integrated to minimize quantization error. Since
99.9 99.61% 0.0245 15.31
√ √
√ √
99 100% 0.0099 18.38 FP4 Tensor Cores are unavailable, we validate FP4 per-
97 100% 0.0068 20.88 formance using Nvidia H-series GPUs’ FP8 Tensor Cores,
which encompass FP4’s dynamic range and enable accurate
simulation.
outlier matrix. In our experiments, the quantile clamping In mixed-precision training (Micikevicius et al., 2017), non-
threshold α is set relatively high (around 0.99 ∼ 0.999), GeMM operations, which account for a minor computa-
making the residual matrix ∆Y = Y − Yc highly sparse, tional fraction, are performed at higher precision to preserve
with only about 0.2% ∼ 2% non-zero elements. During accuracy. Following the framework in (Peng et al., 2023),
computation, the clamped matrix Yc is processed using FP4 we perform gradient communication in FP8 format to reduce
GeMM, while ∆Y is handled with high-precision sparse bandwidth usage and adopt their mixed-precision Adam op-
matrix multiplication. timizer to conserve GPU memory. Gradients and first-order
Table 1 provides a quantitative analysis of cosine similarity moments are stored in FP8, while second-order moments are
(SIM), mean squared error (MSE), and signal-to-noise ratio stored in FP16. Remaining operations, comprising a smaller
(SNR) between the original activation tensors and quantized computational portion, are executed in FP16 or BF16 for
tensors. These results represent average values obtained stability and precision.
across all activation tensors on the 30,000 training iterations We adopt the widely recognized LLaMA 2 model (Tou-
of the LLaMA 1.3B model, demonstrating the impact of vron et al., 2023) as the primary model architecture. The
outlier clamping and compensation on preserving tensor fi- training is conducted from scratch using the DCLM dataset
delity during real model training. The data shows that outlier (Li et al., 2024a), a comprehensive dataset well-suited for
clamping significantly improves both cosine similarity and language model pretraining. Hyperparameters remain con-
SNR. Moreover, incorporating outlier compensation further sistent across precision settings for fair comparison. The
reduces quantization loss. Notably, lowering the quantile learning rate follows a warm-up and cosine decay schedule,
threshold increases the compensation scale, further reduc- with the warm-up phase spanning 5% of total steps and the
ing quantization loss. However, this introduces a trade-off learning rate gradually decreasing to 10% of its peak over
between computational efficiency and numerical accuracy the remaining 90%. The peak learning rate is 3 × 10−4 ,
that must be carefully considered. with a weight decay of 0.1. For the Adam optimizer, we
use β1 = 0.9, β2 = 0.95, and ϵ = 1 × 10−8 . For spe-
4. Experiment cial hyperparameters used in FP4 method, we use k = 5
for differentiable gradient estimator and select α = 0.99
In this section, we evaluate the proposed FP4 training frame- as the activation clamp and compensation quantile. Input
work across language models of various sizes. Section 4.1 sequences are fixed at 2048 tokens, and the batch size is
details the implementation of our FP4 training framework, 2048, comprising approximately 4M tokens.
including the model architecture and hyperparameters. Sec-
tion 4.2 presents the main results, showcasing training 4.2. Main Results
curves and zero-shot performance on downstream tasks.
Finally, Section 4.3 provides ablation studies to further vali- We validate the effectiveness of our proposed FP4 training
date the effectiveness. framework by comparing it against the widely adopted BF16
mixed-precision training scheme. Figure 5 presents the
4.1. Experiment Setup training loss curves for LLaMA models (1.3B, 7B, and
13B) trained with BF16 and FP4 precision. All models are
During LLM training, General Matrix Multiplication trained on 100B tokens using the same dataset and identical
(GeMM) accounts for over 95% of the computational work- hyperparameters. The curves for BF16 and FP4 largely
load, with this proportion increasing for larger models. Con- overlap across different model sizes, with the FP4 curve
sistent with prior works (Xi et al., 2023; Yang et al., 2020; exhibiting a slightly higher training loss compared to the
Dettmers et al., 2022), we focus on 4-bit quantization for

5
Optimizing Large Language Model Training Using FP4 Quantization

Table 2. Zero-shot evaluation for downstream tasks between BF16 models and FP4 models under different model sizes.

Model Size Precision Average PiQA Hellaswag ObQA Arc-C Arc-E BoolQ LogiQA SciQ Lambada

BF16 53.23 71.11 50.80 36.60 36.69 68.60 57.83 30.26 83.30 43.84
1.3B
FP4(Ours) 53.13 70.89 50.82 36.20 36.86 67.47 58.23 29.49 83.90 44.30

BF16 53.87 71.22 52.03 37.40 38.99 67.47 60.55 27.65 85.00 44.56
7B
FP4(Ours) 54.42 71.87 52.97 38.40 39.85 67.97 62.20 27.96 84.70 43.88

BF16 54.44 72.80 53.56 38.60 38.82 67.97 57.40 29.65 86.30 44.87
13B
FP4(Ours) 54.95 73.78 54.12 39.60 39.68 67.89 55.90 30.88 85.80 46.89

8 8 8
BF16 BF16 BF16
7 FP4(Ours) 7 FP4(Ours) 7 FP4(Ours)
6 6 6
5 5 5
Loss

Loss

Loss
4 4 4
3 3 3
2 2 2
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Billion tokens Billion tokens Billion tokens
(a) LLaMA-1.3B (b) LLaMA-7B (c) LLaMA-13B
Figure 5. Training curves for BF16 models and FP4 models under different model sizes. (a) Training curves for 1.3B LLaMA model. (b)
Training curves for 7B LLaMA model. (c) Training curves for 13B LLaMA model.

BF16 curve. Specifically, after training on 100B tokens, the curacy, making it a promising approach for efficient training
training losses are as follows: 2.55 (FP4) vs. 2.49 (BF16) of large language models.
for the 1.3B model, 2.17 (FP4) vs. 2.07 (BF16) for the 7B
model, and 1.97 (FP4) vs. 1.88 (BF16) for the 13B model. 4.3. Ablation Study
In addition to training loss, we evaluate the models on a di- We divide our ablation study into smaller parts to better
verse set of downstream tasks datasets in a zero-shot manner, highlight the findings of FP4 training. All experiments are
including Arc (Clark et al., 2018), BoolQ (Clark et al., 2019), conducted on the LLaMA 1.3B model, trained with 10B
HellaSwag (Zellers et al., 2019), LogiQA (Liu et al., 2020), tokens from a subset of the DCLM dataset. To acceler-
PiQA (Bisk et al., 2020), SciQ (Welbl et al., 2017), Open- ate convergence for this smaller model, the batch size is
bookQA (ObQA) (Mihaylov et al., 2018), and Lambada reduced from 2048 to 256, while other hyperparameters
(Paperno et al., 2016). These results are obtained through remain consistent with the main experiments.
the widely used lm-evaluation-harness library1 (Gao et al.,
2024). As presented in Table 2, models pre-trained with Precision. Figure 6(a) presents training curves across
FP4 demonstrate competitive performance in intrinsic in- various precisions, including BF16 (baseline), MS-AMP
context learning capabilities. Under the same model size, FP8 (Peng et al., 2023), Transformer-Engine FP8 (Nvidia,
the average accuracy of FP4-trained models is comparable 2022), directly-casted FP4, and our FP4 method. We
to, or even slightly exceeds, that of BF16-trained models. use W4A4 to denote direct quantization, meaning that
Additionally, the results follow the general trend: larger quantizing both weight and activation to fp4. Meanwhile,
models achieve higher accuracy under the same number of W4A4+DGE+OCC denotes our fp4 quantization method
training tokens. that incorporates the Differentiable Gradient Estimator
(DGE) and Outlier Clamp and Compensation (OCC) meth-
These results highlight that despite the reduced precision, ods introduced in Section 3. The loss curves show that two
FP4 training achieves nearly equivalent performance to FP8 methods and our FP4 approach maintain pretraining ac-
BF16 both in terms of training loss and downstream task ac- curacy, while directly-casted FP4 has a significant training
1 loss gap.
https://github.com/EleutherAI/lm-
evaluation-harness Weights. For weight-only 4-bit quantization (W4A8),

6
Optimizing Large Language Model Training Using FP4 Quantization

8 2.90 5.0
W4A4 W4A8 2.85
W4A4+DGE+OCC 2.85 W4A8+DGE(k=3)
7 4.5
MSAMP FP8 2.80 W4A8+DGE(k=5) 2.80
TE FP8 W4A8+DGE(k=10)
6 BF16 2.75
4.0 BF16 2.75
Loss

Loss
2.70
5
3.5
4
3.0
3
0 2 4 6 8 10 0 2 4 6 8 10
Billion tokens Billion tokens
(a) Ablation on precision (b) Ablation on weight
10 10 2.95
W8A4 NaN W:coarse A:coarse
9 W8A4+OCC( =0.999) 9 W:fine A:coarse
2.90
2.85
8 W8A4+OCC( =0.99) 8 W:coarse A:fine
2.82
W8A4+OCC( =0.97) W:fine A:fine 2.80
7 BF16 2.80 7 BF16 2.75
Loss

Loss
2.78
6 2.76 6 2.70

5 2.74 5
2.72
4 4
3 3
0 2 4 6 8 10 0 2 4 6 8 10
Billion tokens Billion tokens
(c) Ablation on activation (d) Ablation on granularity

Figure 6. Ablation studies. (a) Training curves under different precision frameworks. (b) The effect of proposed Differentiable Gradient
Estimator (DGE). (c) The effect of proposed Outlier Clamping and Compensation method (OCC). Note that directly casting activation
into 4-bit leads to divergence, and the loss value turn into NaN (Not a Number). (d) Training curves under different quantization
granularities of FP4.

we evaluate our Differentiable Gradient Estimator (DGE) at an increased computational cost. Figure 6(c) shows the
method alone against direct quantization. As shown in model loss under three settings α = 0.999, 0.99, 0.97, cor-
Figure 6(b), the DGE method significantly improve conver- responding to the non-zero elements of the sparse compen-
gence. Notably, direct quantizing weight into 4-bit doesn’t sation matrix of 0.2%,2% and 6%, respectively. Although
introduce a substantial training loss gap, suggesting that experiments show that a higher α leads to better model accu-
weights are easier to quantize than activations. For the hy- racy, which is consistent with the conclusion of Table 1, we
perparameter k in this method, a larger k can better model believe that α = 0.99 is a better choice for comprehensive
the quantization function, but it can also lead to a more computational performance considerations.
unstable correction term for the gradient. It can also be
Granularity. We also observe that the granularity of
seen in the figure that a moderate k = 5 gives better final
FP4 quantization plays a critical role. While FP8 training
performance.
schemes (Peng et al., 2023; Nvidia, 2022) achieve sufficient
Activation. For activation-only 4-bit quantization (W8A4), accuracy with coarse-grained tensor-wise quantization, Fig-
we evaluate our Outlier Clamp and Compensation (OCC) ure 6(d) shows that tensor-wise scaling in FP4 introduces
method alone against direct quantization. Figure 6(c) re- significant errors. To address this, we adopt vector-wise
veals that directly quantizing activations in FP4 results in scaling, with token-wise quantization for activations and
curve divergence, where the loss values turn into NaN (Not channel-wise quantization for weights, aligning with GeMM
a Number) after certain training steps. Outlier clamping and computation rules as discussed in Section 4.1. Notably,
compensation effectively reduces this loss gap, ensuring a applying coarse-grained quantization to activations alone
good convergence. This experiment re-emphasizes the im- result in more severe accuracy degradation than applying
portance of appropriate treatment of outliers in the absmax it to weights alone, revealing that activations are harder to
quantization framework. For the hyperparameter α in this quantize than weights, consistent with the activation outlier
method, a larger α implies a stronger compensation, but issue described in Section 3.2.

7
Optimizing Large Language Model Training Using FP4 Quantization

5. Related Work tion. We directly change the gradient estimator from STE
to DGE during the backward pass, avoiding the need for
Quantized Training and Inference.When discussing the continuous updates to the quantization function, which is
quantization of large language models (LLMs) for train- not friendly to specialized hardware designs. Our approach
ing, we typically refer to Fully Quantized Training (FQT). is more efficient and more suitable for hardware acceleration
Related research efforts have generally used Mixed Preci- in large-scale training.
sion Training (Micikevicius et al., 2017; Mellempudi et al.,
2019) frameworks to accelerate model training while main- Handling Outliers. Our method for handling activation out-
taining model accuracy. While previous research has mainly liers in LLMs differs significantly from existing approaches,
concentrated on CNNs or DNNs(Sun et al., 2019; Wang which mainly target model inference (Liu et al., 2023a; Li
et al., 2018; Banner et al., 2018; Yang et al., 2020), recent et al., 2024b; Ashkboos et al., 2024; Liu et al., 2024). Acti-
studies have demonstrated the feasibility of low-bit mixed vation outliers in LLMs are typically channel-specific (Xiao
precision training for LLMs (Peng et al., 2023; Nvidia, 2022; et al., 2023; Wei et al., 2022). Channel-wise quantization
Fishman et al., 2024; Xi et al., 2024). In contrast to the would reduce quantization loss but conflicts with the compu-
FQT scheme, research on low-bit computation for infer- tation structure of matrix multiplication in linear layers (Xi
ence has focused on Post-Training Quantization (PTQ) and et al., 2024; Lee et al., 2024). Previous strategies to solve
Quantization Aware Training (QAT). While PTQ directly this problem like smoothing outliers (Xiao et al., 2023) or us-
quantizes pre-trained models for inference (Dettmers et al., ing rotary matrices (Ashkboos et al., 2024; Liu et al., 2024)
2022; Frantar et al., 2022; Lin et al., 2024a; Xiao et al., rely on offline pre-processing, making them incompatible
2023; Yao et al., 2022; Liu et al., 2024), QAT involves fine- with pretraining tasks. In contrast, our method addresses
tuning or pre-training the model for better low-bit inference outliers dynamically during real-time training without re-
performance (Liu et al., 2023b; Cheng et al., 2023; Wang quiring separate calibration datasets, which is critical for
et al., 2023; Dettmers et al., 2024). Our method differs from maintaining efficiency in pretraining large models.
QAT, as we aim to accelerate the training process while
maintaining performance, rather than solely focusing on 6. Limitation
improving inference efficiency without consideration for the
training speed. One primary limitation of this work lies in the absence of
dedicated FP4 Tensor Cores in existing hardware. Conse-
4-bit Quantization. Recent works in PTQ and QAT have quently, we are unable to directly measure the potential
successfully applied 4-bit, 2-bit or even 1-bit quantization to speedup and energy efficiency gains achievable with native
LLM inference (Dettmers & Zettlemoyer, 2023; Wu et al., FP4 support. All current experiments rely on FP4 simula-
2023). However, these methods focused on LLM infer- tions, which introduce additional computational overhead
ence, requiring additional computation like calibration set due to extra precision casting and significantly prolong run-
fine-tuning (Wang et al., 2024), rotary matrix and low-rank time. Additionally, due to constraints on computational
compensation (Lin et al., 2024b; Ashkboos et al., 2024; Li resources, we have not yet extended our experiments to
et al., 2024b), quantization parameters searching (Liu et al., extremely large-scale models or to datasets comprising tril-
2023a), or even retraining the whole network (Ma et al., lions of tokens. Investigating such scalability remain as
2024). In the field of FQT, an early study (Sun et al., 2020) critical directions for future research.
applied a 4-bit radix-4 FP4 format to convolutional neural
networks (CNNs). MXFP (Rouhani et al., 2023b) intro-
duced a novel quantization data for GPT-style models, but 7. Conclusion
lacked feasibility validation on full FP4 settings. (Xi et al., We propose the first FP4 pretraining framework for mod-
2023) proposed an INT4 training framework, but their focus ern Large Language Models (LLMs), overcoming the chal-
was on fine-tuning tasks with limited applicability to LLM lenges of limited dynamic range and quantization precision
pretraining. In contrast, our work is the first to propose in 4-bit formats. By proposing a differentiable gradient
an FP4 training framework tailored for LLMs, validated estimator and an outlier compensation mechanism, we ef-
from scratch, and designed to align with next-generation fectively reduce the accuracy gap between FP4 and higher-
hardware like Nvidia’s B-series GPUs. precision baselines like FP8 or FP16, achieving compara-
Differentiable Quantization. Unlike previous methods ble performance across diverse model scales. Our findings
focusing on differentiable quantization (Gong et al., 2019; demonstrate the feasibility of FP4-based training, provid-
Uhlich et al., 2019; Chen et al., 2019; Li et al., 2022; Huang ing insights into improving quantization methods for ultra-
et al., 2022), which rely on learnable quantization param- low-precision computing, and may also serve as a call for
eters updated through backpropagation, our differentiable next-generation hardware designs to enable efficient 4-bit
gradient estimator method uses a fixed quantization func- computation kernels.

8
Optimizing Large Language Model Training Using FP4 Quantization

Impact Statement Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins,
M., and Toutanova, K. Boolq: Exploring the surprising
This work demonstrates the feasibility of using ultra-low difficulty of natural yes/no questions. arXiv preprint
precision formats like FP4 for training large language mod- arXiv:1905.10044, 2019.
els, offering a pathway toward energy conservation and re-
duced carbon emissions in AI development. By significantly Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A.,
lowering computational and memory demands, FP4-based Schoenick, C., and Tafjord, O. Think you have solved
methods can democratize access to advanced AI systems question answering? try arc, the ai2 reasoning challenge.
while promoting environmental sustainability. arXiv preprint arXiv:1803.05457, 2018.
Additionally, this research calls for next-generation AI accel- Dettmers, T. and Zettlemoyer, L. The case for 4-bit preci-
erators optimized for 4-bit computations, potentially shap- sion: k-bit inference scaling laws. In International Con-
ing future hardware innovations. However, broader societal ference on Machine Learning, pp. 7750–7774. PMLR,
implications must be considered, including the risks of mis- 2023.
use and the amplification of biases inherent in large-scale AI
models. Addressing these challenges is essential to ensure Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.
responsible and equitable adoption of this technology. Gpt3. int8 (): 8-bit matrix multiplication for transformers
at scale. Advances in Neural Information Processing
Systems, 35:30318–30332, 2022.
References
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
Qlora: Efficient finetuning of quantized llms. Advances
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
in Neural Information Processing Systems, 36, 2024.
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774, 2023. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B.,
A., et al. The llama 3 herd of models. arXiv preprint
Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and
arXiv:2407.21783, 2024.
Hensman, J. Quarot: Outlier-free 4-bit inference in ro-
tated llms. arXiv preprint arXiv:2404.00456, 2024. Fishman, M., Chmiel, B., Banner, R., and Soudry, D. Scal-
ing fp8 training to trillion-token llms. arXiv preprint
Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scalable
arXiv:2409.12517, 2024.
methods for 8-bit training of neural networks. Advances
in neural information processing systems, 31, 2018. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq:
Accurate post-training quantization for generative pre-
Bengio, Y., Léonard, N., and Courville, A. Estimating or
trained transformers. arXiv preprint arXiv:2210.17323,
propagating gradients through stochastic neurons for con-
2022.
ditional computation. arXiv preprint arXiv:1308.3432,
2013. Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi,
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li,
Ding, H., Dong, K., Du, Q., Fu, Z., et al. Deepseek llm: H., McDonell, K., Muennighoff, N., Ociepa, C., Phang,
Scaling open-source language models with longtermism. J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika,
arXiv preprint arXiv:2401.02954, 2024. L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou,
A. A framework for few-shot language model evaluation,
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning 07 2024. URL https://zenodo.org/records/
about physical commonsense in natural language. In Pro- 12608602.
ceedings of the AAAI conference on artificial intelligence,
volume 34, pp. 7432–7439, 2020. Gong, R., Liu, X., Jiang, S., Li, T., Hu, P., Lin, J., Yu, F., and
Yan, J. Differentiable soft quantization: Bridging full-
Chen, S., Wang, W., and Pan, S. J. Metaquant: Learning precision and low-bit neural networks. In Proceedings
to quantize by learning to penetrate non-differentiable of the IEEE/CVF international conference on computer
quantization. Advances in Neural Information Processing vision, pp. 4852–4861, 2019.
Systems, 32, 2019.
Huang, X., Shen, Z., Li, S., Liu, Z., Xianghong, H., Wicak-
Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X., Lv, K., sana, J., Xing, E., and Cheng, K.-T. Sdq: Stochastic
and Liu, Y. Optimize weight rounding via signed gradi- differentiable quantization with mixed precision. In In-
ent descent for the quantization of llms. arXiv preprint ternational Conference on Machine Learning, pp. 9295–
arXiv:2309.05516, 2023. 9309. PMLR, 2022.

9
Optimizing Large Language Model Training Using FP4 Quantization

Kahan, W. Ieee standard 754 for binary floating-point arith- Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S.,
metic. Lecture Notes on the Status of IEEE, 754(94720- Dong, L., Wang, R., Xue, J., and Wei, F. The era of 1-bit
1776):11, 1996. llms: All large language models are in 1.58 bits. arXiv
preprint arXiv:2402.17764, 2024.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Mellempudi, N., Srinivasan, S., Das, D., and Kaul, B. Mixed
Amodei, D. Scaling laws for neural language models. precision training with 8-bit floating point. arXiv preprint
arXiv preprint arXiv:2001.08361, 2020. arXiv:1905.12334, 2019.

Lee, C., Jin, J., Kim, T., Kim, H., and Park, E. Owq: Outlier- Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,
aware weight quantization for efficient fine-tuning and E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,
inference of large language models. In Proceedings of the Venkatesh, G., et al. Mixed precision training. arXiv
AAAI Conference on Artificial Intelligence, volume 38, preprint arXiv:1710.03740, 2017.
pp. 13355–13364, 2024.
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can
Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., a suit of armor conduct electricity? a new dataset
Bansal, H., Guha, E., Keh, S., Arora, K., et al. Datacomp- for open book question answering. arXiv preprint
lm: In search of the next generation of training sets for lan- arXiv:1809.02789, 2018.
guage models. arXiv preprint arXiv:2406.11794, 2024a.
Nvidia. Using fp8 with transformer engine,
Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., 2022. URL https://docs.nvidia.com/
Meng, C., Zhu, J.-Y., and Han, S. Svdqunat: Absorb- deeplearning/transformer-engine/user-
ing outliers by low-rank components for 4-bit diffusion guide/examples/fp8_primer.html.
models. arXiv preprint arXiv:2411.05007, 2024b.
Nvidia. Nvidia h100 tensor core gpu architecture,
Li, Z., Yang, T., Wang, P., and Cheng, J. Q-vit: Fully 2023. URL https://resources.nvidia.com/
differentiable quantization for vision transformer. arXiv en-us-tensor-core.
preprint arXiv:2201.07703, 2022.
Nvidia. Nvidia blackwell architecture technical brief,
Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, 2024. URL https://resources.nvidia.com/
W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: en-us-blackwell-architecture.
Activation-aware weight quantization for on-device llm
compression and acceleration. Proceedings of Machine Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N.,
Learning and Systems, 6:87–100, 2024a. Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and
Fernández, R. The lambada dataset: Word prediction
Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., requiring a broad discourse context. arXiv preprint
and Han, S. Qserve: W4a8kv4 quantization and sys- arXiv:1606.06031, 2016.
tem co-design for efficient llm serving. arXiv preprint
arXiv:2405.04532, 2024b. Peng, H., Wu, K., Wei, Y., Zhao, G., Yang, Y., Liu, Z.,
Xiong, Y., Yang, Z., Ni, B., Hu, J., et al. Fp8-lm:
Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang, Training fp8 large language models. arXiv preprint
Y. Logiqa: A challenge dataset for machine reading arXiv:2310.18313, 2023.
comprehension with logical reasoning. arXiv preprint
arXiv:2007.08124, 2020. Rouhani, B. D., Garegrat, N., Savell, T., More, A., Han,
K.-N., Zhao, R., Hall, M., Klar, J., Chung, E., Yu, Y.,
Liu, S.-y., Liu, Z., Huang, X., Dong, P., and Cheng, K.- et al. Ocp microscaling formats (mx) specification,
T. Llm-fp4: 4-bit floating-point quantized transformers. 2023a. URL https://www.opencompute.org/
arXiv preprint arXiv:2310.16836, 2023a. documents/ocp-microscaling-formats-
mx-v1-0-spec-final-pdf.
Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad,
Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Rouhani, B. D., Zhao, R., More, A., Hall, M., Khodamoradi,
Data-free quantization aware training for large language A., Deng, S., Choudhary, D., Cornea, M., Dellinger, E.,
models. arXiv preprint arXiv:2305.17888, 2023b. Denolf, K., et al. Microscaling data formats for deep
learning. arXiv preprint arXiv:2310.10537, 2023b.
Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Kr-
ishnamoorthi, R., Chandra, V., Tian, Y., and Blankevoort, Sun, X., Choi, J., Chen, C.-Y., Wang, N., Venkataramani,
T. Spinquant–llm quantization with learned rotations. S., Srinivasan, V. V., Cui, X., Zhang, W., and Gopalakr-
arXiv preprint arXiv:2405.16406, 2024. ishnan, K. Hybrid 8-bit floating point (hfp8) training and

10
Optimizing Large Language Model Training Using FP4 Quantization

inference for deep neural networks. Advances in neural Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han,
information processing systems, 32, 2019. S. Smoothquant: Accurate and efficient post-training
quantization for large language models. In International
Sun, X., Wang, N., Chen, C.-Y., Ni, J., Agrawal, A., Cui, X.,
Conference on Machine Learning, pp. 38087–38099.
Venkataramani, S., El Maghraoui, K., Srinivasan, V. V.,
PMLR, 2023.
and Gopalakrishnan, K. Ultra-low precision 4-bit training
of deep neural networks. Advances in Neural Information Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., and Li, G.
Processing Systems, 33:1796–1807, 2020. Training high-performance and large-scale deep neural
networks with full 8-bit integers. Neural Networks, 125:
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
70–82, 2020.
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine- Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li,
tuned chat models. arXiv preprint arXiv:2307.09288, C., and He, Y. Zeroquant: Efficient and affordable post-
2023. training quantization for large-scale transformers. Ad-
Uhlich, S., Mauch, L., Yoshiyama, K., Cardinaux, F., Garcia, vances in Neural Information Processing Systems, 35:
J. A., Tiedemann, S., Kemp, T., and Nakamura, A. Dif- 27168–27183, 2022.
ferentiable quantization of deep neural networks. arXiv Yin, P., Lyu, J., Zhang, S., Osher, S., Qi, Y., and Xin,
preprint arXiv:1905.11452, 2(8), 2019. J. Understanding straight-through estimator in train-
Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., ing activation quantized neural nets. arXiv preprint
Yang, F., Wang, R., Wu, Y., and Wei, F. Bitnet: Scaling 1- arXiv:1903.05662, 2019.
bit transformers for large language models. arXiv preprint Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
arXiv:2310.11453, 2023. Y. Hellaswag: Can a machine really finish your sentence?
Wang, J., Liu, H., Feng, D., Ding, J., and Ding, B. Fp4- arXiv preprint arXiv:1905.07830, 2019.
quantization: Lossless 4bit quantization for large lan-
guage models. In 2024 IEEE International Conference
on Joint Cloud Computing (JCC), pp. 61–67. IEEE, 2024.
Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakr-
ishnan, K. Training deep neural networks with 8-bit
floating point numbers. Advances in neural information
processing systems, 31, 2018.
Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang,
Q., Yu, F., and Liu, X. Outlier suppression: Pushing the
limit of low-bit transformer language models. Advances
in Neural Information Processing Systems, 35:17402–
17414, 2022.
Welbl, J., Liu, N. F., and Gardner, M. Crowdsourc-
ing multiple choice science questions. arXiv preprint
arXiv:1707.06209, 2017.
Wu, X., Li, C., Aminabadi, R. Y., Yao, Z., and He, Y. Un-
derstanding int4 quantization for language models: la-
tency speedup, composability, and failure cases. In In-
ternational Conference on Machine Learning, pp. 37524–
37539. PMLR, 2023.
Xi, H., Li, C., Chen, J., and Zhu, J. Training transform-
ers with 4-bit integers. Advances in Neural Information
Processing Systems, 36:49146–49168, 2023.
Xi, H., Chen, Y., Zhao, K., Teh, K. J., Chen, J., and Zhu,
J. Jetfire: Efficient and accurate transformer pretraining
with int8 data flow and per-block quantization. arXiv
preprint arXiv:2403.12422, 2024.

11
Optimizing Large Language Model Training Using FP4 Quantization

A. Implementation of FP4 quantizaiton


Floating-point numbers in a computer are represented using a binary format defined by the IEEE 754 standard (Kahan, 1996).
Each number is divided into three components: the sign bit (S), the exponent (E), and the mantissa (or significand, M). This
is commonly represented as ExMy, where x and y denote the number of bits for the exponent and mantissa, respectively.
The sign bit determines whether the number is positive (S = 0) or negative (S = 1). The exponent, stored in a biased
format, encodes the power of two that scales the number, enabling the representation of a wide range of values. The mantissa
contains the significant digits of the number, capturing its precision. A normalized floating-point number is decoded as:

Value = (−1)S × (1.M ) × 2E−bias

Where 1.M represents the normalized mantissa with an implicit leading 1, and the bias (e.g., 127 for single precision or
1023 for double precision) adjusts the exponent to account for its encoding. Subnormal numbers, where the exponent is all
zeros, are handled separately with no implicit leading 1. This representation allows for efficient computation but introduces
rounding errors due to the limited number of bits in the mantissa.
The IEEE 754 standard does not define rules for floating-point formats with precision below 16 bits, such as FP8 and FP4.
For 4-bit floating-point representation, we adopt the E2M1 format as defined in prior studies (Rouhani et al., 2023b;a).
According to the IEEE definition, an exponent field (E) filled with ones does not correspond to a valid numeric value;
instead, it represents infinity (Inf) when the mantissa (M) is all zeros or an invalid number (NaN, Not a Number) when the
mantissa contains nonzero bits. However, this rule is often disregarded in FP8 and FP4 formats due to their limited bit width,
as the priority is to maximize the representation of meaningful numerical values. For example, FP8-E4M3 format doesn’t
define Inf, FP6 and FP4 formats don’t define both Inf and NaN.
Based on the distribution of exponent and mantissa bits, all representable numbers in the FP4 format are listed in Table 3.

Table 3. FP4 Quantization Table under different FP4 formats.

B INARY S EQUENCE
F ORMAT 1111 1110 1101 1100 1011 1010 1101 1000/0000 0001 0010 0011 0100 0101 0110 0111

E1M2 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 ±0 0.5 1 1.5 2 2.5 3 3.5


E2M1 -6 -4 -3 -2 -1.5 -1 -0.5 ±0 0.5 1 1.5 2 3 4 6
E3M0 -16 -8 -4 -2 -1 -0.5 -0.25 ±0 0.25 0.5 1 2 4 8 16

-3 -2 -1 0 1 2 3
-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

E1M2
-6 -4 -3 -2 -1 3 4 6
0 1 2
-1.5 -0.5 0.5 1.5

E2M1
-16 16
-8 -4 -2 -1 8
-0.5
0 0.5 1 2 4
-0.25 0.25

E3M0

Figure 7. Visualization of all representable numbers in different FP4 formats.


The ”E0M3” format is not included here because it is equivalent to the INT4 format, as it doesn’t have any exponent bits.
From Table 3 and Figure 7, we observe that increasing the number of exponent bits expands the dynamic range, while
increasing the number of mantissa bits improves the precision of quantization intervals. We select the E2M1 format in our
main experiments as it offers a balanced trade-off between dynamic range and quantization precision.
Since the FP4 format supports only 24 = 16 distinct values, we implement a look-up table for FP4 quantization in a custom
CUDA kernel. Quantization functions typically involve element-by-element operations on large amounts of data, which can

12
Optimizing Large Language Model Training Using FP4 Quantization

be parallelized to take advantage of the highly parallel computing power of GPUs. The following code paragraph shows the
implementation of the quantization kernel.

1 __global__ void quantize_kernel(const float* x, float* output, int x_size) {


2 int idx = blockIdx.x * blockDim.x + threadIdx.x;
3 if (idx < x_size) {
4 float value = x[idx];
5 float closest;
6
7 closest = (value < -5.0f) ? -6.0f :
8 (value < -3.5f) ? -4.0f :
9 (value < -2.5f) ? -3.0f :
10 (value < -1.75f) ? -2.0f :
11 (value < -1.25f) ? -1.5f :
12 (value < -0.75f) ? -1.0f :
13 (value < -0.25f) ? -0.5f :
14 (value < 0.25f) ? 0.0f :
15 (value < 0.75f) ? 0.5f :
16 (value < 1.25f) ? 1.0f :
17 (value < 1.75f) ? 1.5f :
18 (value < 2.5f) ? 2.0f :
19 (value < 3.5f) ? 3.0f :
20 (value < 5.0f) ? 4.0f : 6.0f;
21
22 output[idx] = closest;
23 }
24 }
25
26 void quantize(at::Tensor input, at::Tensor output, int size) {
27 const float* input_data = input.data_ptr<float>();
28 float* output_data = output.data_ptr<float>();
29
30 const int threadsPerBlock = 256;
31 const int blocks = (size + threadsPerBlock - 1) / threadsPerBlock;
32 cudaStream_t stream = at::cuda::getCurrentCUDAStream();
33
34 quantize_kernel<<<blocks, threadsPerBlock, 0, stream>>>(input_data, output_data, size)
;
35 }

B. Supplementary Proof for Differentiable Quantization Estimator


We present the complementary proof procedure for the Differentiable Gradient Estimator (DGE) method under actual
quantization with vector-wise scaling factors. In the GeMM operation Y = AW , where A is the activation tensor with
dimensions (s × ci , sequence length × input channels) and W is the weight tensor with dimensions (ci × co , input
channels × output channels), quantization is applied along distinct dimensions to adhere to the mathematical logic of matrix
multiplication. The quantization function is defined as:

MAXfp4
xfp4 = Q(xfp16 · γ), γ= (10)
max(|xfp16 |)

For the weight tensor with dimensions (ci × co ), channel-wise quantization is performed as follows:

Wscaled = W ⊙ sf (11)
Wqscaled = Q(Wscaled ) (12)
1
Wq = Wqscaled ⊙ (13)
sf

13
Optimizing Large Language Model Training Using FP4 Quantization

Here, sf is the scaling factor, and ⊙ represents the element-wise (Hadamard) product. In tensor-wise quantization, sf is
a scalar. For channel-wise quantization, sf is a vector with dimensions (1 × co ). In this case, the ⊙ operation involves
broadcasting sf to each column of the matrix W (ci × co ), followed by element-wise multiplication.
For Equation (13), it is crucial to note that multiplying by 1/sf ensures mathematical correctness. Practically, however, this
step is performed after the GeMM kernel execution. In other words, the quantized weight tensor provided to the GeMM
kernel is the scaled quantized weight tensor Wqscaled from Equation (12). Nevertheless, for mathematical analysis, the
quantized weight tensor Wq must be re-scaled.
In the backward computation, the loss gradient with respect to W is derived from the forward operation Y = AWq .
According to the matrix multiplication rules for differentiation, the gradient ∂L/∂Wq is computed using the activation
gradient ∂L/∂Y from the subsequent layer.

∂L ∂L
Fwd: Y = AWq Bwd: = AT (14)
∂Wq ∂Y

By applying the chain rule and referring to Equations (11) to (13), the relationship between ∂L/∂Wq and the actual weight
gradient ∂L/∂W is established. According to Equation (13), the gradient ∂L/∂Wqscaled can be expressed in terms of
∂L/∂Wq using the scaling factor sf :

∂L ∂L 1
= ⊙ (15)
∂Wqscaled ∂Wq sf

Subsequently, the differentiable gradient estimator correction term Q′ (x) is applied to compute ∂L/∂Wscaled :

∂L ∂L
= ⊙ Q′ (Wscaled ) (16)
∂Wscaled ∂Wqscaled

Where Q′ (x) is the differentiable gradient estimator correction item introduced in Equation (8). Finally, the relationship
between ∂L/∂Wscaled and ∂L/∂W is derived by incorporating sf :

∂L ∂L
= ⊙ sf (17)
∂W ∂Wscaled

By combining all these steps, the formula for calculating the true weight gradient ∂L/∂W is obtained:

!
∂L ∂L 1 ′
= ⊙ ⊙ Q (Wscaled ) ⊙ sf (18)
∂W ∂Wq sf
∂L
= ⊙ Q′ (Wscaled ) (19)
∂Wq

Importantly, the scaling and un-scaling steps cancel each other due to the element-wise nature of the operations, resulting
in a simplified expression. This final formula matches the previously demonstrated Equation (6) in the main body of the
paper, with the only difference being that the variables within the DGE correction term must account for scaled weights. No
changes are required for the Q and Q′ functions.

C. Analyzing Quantization Difficulty Through Tensor Distribution


Section 3 highlights the necessity of quantizing both weight and activation tensors to fully leverage the FP4 tensor core. It
also points out that activation tensors are significantly more challenging to quantize compared to weight tensors. To further
support this observation, we provide the actual distributions of weight and activation tensors during model training.

14
Optimizing Large Language Model Training Using FP4 Quantization

layers.0.self_attention.dense layers.2.self_attention.dense layers.4.self_attention.dense layers.6.self_attention.dense layers.8.self_attention.dense layers.10.self_attention.dense


Mean: -0.0000 Mean: -0.0000 5 Mean: -0.0000 Mean: -0.0000 Mean: 0.0000 5 Mean: 0.0000
105 Std: 0.0062 Std: 0.0148 10 Std: 0.0170 10 5
Std: 0.0192 105 Std: 0.0188 10 Std: 0.0207
105

103 103 103 103 103 103

101 101 101 101 101 101

0.1 0.0 0.1 0.5 0.0 0.5 0.1 0.0 0.1 0.25 0.00 0.25 0.25 0.00 0.25 0.2 0.0
Frequency

layers.12.self_attention.dense layers.14.self_attention.dense layers.16.self_attention.dense layers.18.self_attention.dense layers.20.self_attention.dense layers.22.self_attention.dense

105 Mean: -0.0000 Mean: 0.0000 Mean: -0.0000 Mean: -0.0000 Mean: 0.0000 Mean: -0.0000
Std: 0.0220 105 Std: 0.0260 105 Std: 0.0209 Std: 0.0204 Std: 0.0217 Std: 0.0200
105 105 105

103 103 103 103 103 103

101 101 101 101 101 101

0.1 0.0 0.1 0.0 0.2 0.2 0.0 0.2 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5
Values

Figure 8. Visualization of the weight tensors in the dense projection layers of the self-attention module.

layers.0.mlp.up_projection layers.2.mlp.up_projection layers.4.mlp.up_projection layers.6.mlp.up_projection layers.8.mlp.up_projection layers.10.mlp.up_projection


Mean: 0.0000 106 Mean: -0.0000 106 Mean: 0.0000 Mean: -0.0000 106 Mean: -0.0001 Mean: -0.0001
Std: 0.0234 Std: 0.0169 Std: 0.0147 Std: 0.0204 Std: 0.0227 Std: 0.0201
105 105 105
104 104 104
103 103 103
2 2
10 10 2 10
101 101 101

0.25 0.00 0.25 0.2 0.0 0.2 0.1 0.0 0.1 0.2 0.0 0.1 0.0 0.1 0.25 0.00 0.25
Frequency

layers.12.mlp.up_projection layers.14.mlp.up_projection layers.16.mlp.up_projection layers.18.mlp.up_projection layers.20.mlp.up_projection layers.22.mlp.up_projection


106
Mean: -0.0001 Mean: -0.0001 Mean: -0.0000 Mean: -0.0000 Mean: 0.0000 Mean: 0.0001
Std: 0.0260 Std: 0.0270 Std: 0.0273 Std: 0.0277 Std: 0.0281 Std: 0.0282
105 105 105 105 105
104
103 103 103 103 103

2
10
101 101 101 101 101

0.2 0.0 0.2 0.0 0.2 0.2 0.0 0.2 0.1 0.0 0.1 0.25 0.00 0.25 0.25 0.00 0.25
Values

Figure 9. Visualization of the weight tensors in the up-projection linear layers of the MLP module.

layers.0.mlp.down_projection layers.2.mlp.down_projection layers.4.mlp.down_projection layers.6.mlp.down_projection layers.8.mlp.down_projection layers.10.mlp.down_projection


Mean: 0.0000 Mean: -0.0000 Mean: 0.0000 Mean: 0.0000 106 Mean: 0.0000 Mean: 0.0000
Std: 0.0205 Std: 0.0115 Std: 0.0094 Std: 0.0157 Std: 0.0193 Std: 0.0165
105 105 105 105 105
104
103 103 103 103 103
102
101 101 101 101 101

0.5 0.0 0.5 0.5 0.0 0.5 0.2 0.0 0.2 0.5 0.0 0.5 0.25 0.00 0.25 1 0
Frequency

layers.12.mlp.down_projection layers.14.mlp.down_projection layers.16.mlp.down_projection layers.18.mlp.down_projection layers.20.mlp.down_projection layers.22.mlp.down_projection


Mean: -0.0000 Mean: -0.0000 Mean: -0.0000 106 Mean: 0.0000 Mean: 0.0000 Mean: -0.0000
Std: 0.0215 Std: 0.0249 Std: 0.0263 Std: 0.0273 Std: 0.0280 Std: 0.0272
105 105 105 105 105
104
103 103 103 103 103
102
101 101 101 101 101

0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.25 0.00 0.25 0.5 0.0 0.5 0.5 0.0 0.5
Values

Figure 10. Visualization of the weight tensors in the down-projection linear layers of the MLP module.

15
Optimizing Large Language Model Training Using FP4 Quantization

layers.0.self_attention.core_attention layers.2.self_attention.core_attention layers.4.self_attention.core_attention layers.6.self_attention.core_attention layers.8.self_attention.core_attention layers.10.self_attention.core_attention


107 107 107
Mean: 0.0003 Mean: -0.0013 6
Mean: -0.0004 6 Mean: 0.0000 Mean: 0.0031 Mean: 0.0021
Std: 0.0137 Std: 0.1078 10 Std: 0.1377 10 Std: 0.2601 Std: 0.2618 Std: 0.3746
10 5 105 105 105
104 104
103 103 103 103

102 10 2

101 101 101 101

0.5 0.0 0.5 2 0 2 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5 2.5 0.0 2.5
Frequency

layers.12.self_attention.core_attention layers.14.self_attention.core_attention layers.16.self_attention.core_attention layers.18.self_attention.core_attention layers.20.self_attention.core_attention layers.22.self_attention.core_attention


107 107
Mean: 0.0026 Mean: 0.0044 Mean: -0.0019 Mean: 0.0000 Mean: 0.0032 Mean: -0.0018
Std: 0.3494 Std: 0.3915 Std: 0.3007 106 Std: 0.3013 Std: 0.3308 106 Std: 0.2893
10 5
105 10 5
10 5

104 104
3 3 3
10 10 10 103
102 102
101 101 101 101

0 5 5 0 5 2.5 0.0 2.5 0 5 5 0 5 5 0 5


Values

Figure 11. Visualization of the activation tensors from the core attention output.

layers.0.post_attention_layernorm layers.2.post_attention_layernorm layers.4.post_attention_layernorm layers.6.post_attention_layernorm layers.8.post_attention_layernorm layers.10.post_attention_layernorm


107 107 107 107
Mean: -0.0016 Mean: -0.0051 Mean: -0.0039 Mean: 0.0017 Mean: 0.0030 Mean: 0.0024
Std: 1.0014 Std: 0.8788 Std: 0.6730 Std: 0.8224 Std: 0.9404 106 Std: 0.8763
105
105
105 105 105 104
103
103
102
101 103 103 103

5 0 5 20 0 20 10 0 10 10 0 10 20 0 20 0
Frequency

layers.12.post_attention_layernorm layers.14.post_attention_layernorm layers.16.post_attention_layernorm layers.18.post_attention_layernorm layers.20.post_attention_layernorm layers.22.post_attention_layernorm


7
10
Mean: 0.0015 Mean: 0.0059 Mean: 0.0063 Mean: 0.0037 Mean: 0.0004 Mean: -0.0026
Std: 0.9890 106 Std: 1.0990 Std: 1.1716 Std: 1.2356 106 Std: 1.2848 106 Std: 1.2266
105 105 105
104 104 104
103 103 103

102 102 102


101 101 101

20 0 20 0 20 0 20 0 20 0 20 20 0 20
Values

Figure 12. Visualization of the activation tensors from the post-attention layer normalization output.

layers.0.mlp.down_projection layers.2.mlp.down_projection layers.4.mlp.down_projection layers.6.mlp.down_projection layers.8.mlp.down_projection layers.10.mlp.down_projection


107 107 107 107 107
Mean: -0.0109 Mean: -0.0006 Mean: -0.0001 Mean: -0.0012 Mean: -0.0013 Mean: -0.0015
Std: 0.9682 Std: 0.1411 Std: 0.0826 Std: 0.2082 Std: 0.2753 Std: 0.2360
10 5 105 105 10 5
10 5
10 5

10 3
10 3
103 10 3
10 3 103

101 101 101 101 101 101

25 0 25 20 0 1 0 10 0 10 0 20 0
Frequency

layers.12.mlp.down_projection layers.14.mlp.down_projection layers.16.mlp.down_projection layers.18.mlp.down_projection layers.20.mlp.down_projection layers.22.mlp.down_projection


107 107 107 107
Mean: 0.0003 Mean: 0.0015 Mean: 0.0026 Mean: 0.0038 Mean: 0.0056 Mean: 0.0014
Std: 0.3162 Std: 0.4243 Std: 0.5098 Std: 0.6162 Std: 0.7205 Std: 0.7513
5 5
105 10 10 105 105 105

103 103 103 103 103 103

101 101 101 101 101 101

25 0 25 10 0 10 0 0 20 25 0 25 0 50
Values

Figure 13. Visualization of the activation tensors from the MLP down-projection layer output.

16
Optimizing Large Language Model Training Using FP4 Quantization

Figures 8 to 10 illustrate the distribution of weight tensors, while Figures 11 to 13 show the distribution of activation tensors.
These results are derived from training the LLaMA 1.3B model over 30,000 iterations. The y-axis is set to a logarithmic
scale to enhance visualization. From these figures, it is evident that weight tensors generally exhibit a smaller dynamic
range, while activation tensors have a significantly larger dynamic range, making them more challenging to quantize.
Regarding distribution characteristics, weight tensors typically follow a normal distribution, with certain tensors exhibiting
small outliers. In contrast, activation tensors vary widely in their distributions. For example, core attention outputs often
follow a regular distribution with minimal outliers. However, many activation tensors, such as layer-norm outputs and
transformer layer outputs, display irregular distributions with numerous outliers, making them particularly difficult to
quantize.
Notably, the outliers in activation tensors during LLM training tend to appear in specific channels. This observation is
further validated through heatmap analysis in Figure 14. The result is obtained through the activation function (GeLU)
output from the first transformer layer.
These analyses underscore the critical importance of effectively addressing activation tensors during quantization, especially
their outliers. Future research could gain valuable insights by exploring the complex distributions and outlier behavior of
activation tensor values.

Figure 14. Heatmap visualization of the activation function (GeLU) output from the first transformer layer on the 30,000 training iteration
of the LLaMA 1.3B model. The vertical light lines in the heatmap correspond to specific channel dimensions in the activation tensor,
highlighting the channel-wise distribution of outliers.

17

You might also like