Recipes For Pre-Training Llms With Mxfp8: Asit Mishra Dusan Stosic Simon Layton
Recipes For Pre-Training Llms With Mxfp8: Asit Mishra Dusan Stosic Simon Layton
Abstract
Precision scaling — using fewer bits to represent model parameters and related tensors
during pre-training — has emerged as a compelling technique for improving GPU efficiency
without sacrificing accuracy. Microscaling (MX) formats [1] in NVIDIA’s latest Blackwell
GPUs represent a major leap in enabling this precision scaling aspect. These formats
combine narrow floating-point data types with per-block scaling factors, offering a fine-
grained approach to quantizing tensors.
Although MX-formats offer the promise of improved numeric stability compared to other
reduced-precision representations, in practice they must be used carefully in order to suc-
cessfully converge an LLM on a multi-trillion token dataset. In this paper, we show that the
rounding mode suggested in OCP [1] specification can lead to divergence when pre-training
an LLM. We show an improved rounding mode, which uses round-to-infinity to compute
scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens.
1 Introduction
Scaling the number of parameters of deep learning models, especially large generative models, has
intensified the need for more efficient compute and memory solutions. Precision scaling — reducing
number of bits to represent a data type — is a successful strategy to improve performance while
lowering memory requirements. However, precision scaling without sacrificing model accuracy
remains a key challenge.
Microscaling (MX) formats, developed in the Open Compute Project (OCP) [1], aims to offer a
standardized approach to low-precision representation. By combining narrow floating-point types
with fine-grained scaling factors, MX-formats aim to deliver a balance of dynamic range, precision,
and hardware efficiency. OCP [1] and research works like [2, 3, 4] have shown a fine-grained scaling
approach to be an effective strategy compared to granular (e.g., per-tensor) scaling approaches,
especially in sub-8-bit pre-training regimes.
NVIDIA’s latest GPU generation, Blackwell [5], adds native MX data type support to Tensor Cores.
Blackwell supports MX data for 8-bit (MXFP8), 6-bit (MXFP6) and 4-bit (MXFP4) floating-point types.
This paper primarily focuses on MXFP8 and provides an in-depth analysis of pre-training large
language models (LLMs). It details the conversion process from high precision formats to the
quantized MXFP8 format, including rounding modes for data and scaling factors. These insights aim
to guide the efficient and accurate pre-training of LLMs using MXFP8.
Notably, we observe that the rounding mode suggested in OCP [1] for scale factor computation causes
MXFP8 pre-training to differ from a BF16 baseline. We propose a modification to this scale factor
rounding where we round-to-infinity instead of rounding-down, and demonstrate that this adjustment
enables MXFP8 pre-training to match BF16 pre-training accuracy.
The primary contributions of this paper are as follows:
Preprint.
– A methodical study of MX quantization for LLM pre-training where we investigate the following
design choices: data formats, quantization schemes, storage, the selection of tensors to quantize,
and examining how these choices affect accuracy. Specifically, we identify E4M3 to be the MXFP8
data type that best helps maintain pre-training accuracy.
– We find that the rounding mode suggested in the OCP specification can lead to divergence during
pre-training. Building on the analysis, we propose a rounding mode for scale factors that lets
MXFP8 trained models to match BF16 model accuracy. Specifically, we describe a recipe to
pretrain LLMs with MXFP8 and show experiments where an 8 billion parameter dense LLM
trained on 15 trillion tokens matches a BF16 trained model and a per-tensor scaled FP8 [6] model
with first and last layer in BF16.
For each design choice, we present ablation studies on small models (e.g. 100s of millions of
parameters) pretrained on short (e.g. 100s of billions) token horizons. We then apply the findings to
large multi-billion parameter models trained on trillion token horizons. This makes our conclusions
robust for large-scale foundation model pre-training.
Q0 Q1 Q2
… Q31 Scale
Quantized values
✕ ✕ ✕ ✕
Dequantized values V0 V1 V2
… V31
(FP32)
Figure 1: A single MXFP block (in green box) and interpretation of MXFP format.
Background: An MX-format is specified by a block size K, a shared scaling factor per block, X, and
the data-type of elements in the block. A block is a contiguous set of K elements along an axis in a
tensor. K = 32 for all MX types on Blackwell. The data-type of X is UE8M0, an unsigned exponent-only
value that encodes either NaN or any power-of-two1 value in range 2−127 to 2127 .
Given K input elements in a source-format, Vi (typically FP32); 1 ≤ i ≤ K, conversion to MX-format
consists of computing X and Qi such that Qi ×X is the decoded value corresponding to Vi . X and Qi
are stored in memory instead of Vi . This is depicted in Fig. 1. A tensor in source-format is sub-divided
into blocks of K elements and converted into MX-format to be stored in memory and/or processed by
math units in hardware. Tensor Cores in Blackwell consume X and Qi to compute the dot product
of two MX-formatted blocks. If the accumulated output of a dot-product operation in Tensor Cores
is FP32, then this output is thereafter quantized to MX-format if a subsequent operation consuming
the output needs it in that format. The conversion process is, Qi = Quantize_to_fp8(Vi /2X ).
Section 3 describes the conversion details. Fine-grained scaling helps each MX block independently
align to the needed range of input values before quantization.
Table 1 shows the MX-formats supported in Blackwell. The data type column uses the convention ExMy
to denote x-bit for floating-point exponent and y-bit for mantissa. For a fixed bit-width, floating-point
numbers trade-off exponent width with mantissa width. Thus, MXFP8 E4M3 is a 8-bit data type with
with 1 sign bit, 4-bit for FP exponent and 3-bit of mantissa. Similarly, E5M2 is an 8-bit type with 5-bit
for exponent and 2-bit for mantissa. Compared to E4M3, E5M2 can represent a larger dynamic range,
1
Power-of-two is a number of the form 2n where n is a positive integer, negative integer, or zero.
2
but with less precision. E5M2 is the only type that follows IEEE 754 conventions [7] for representation
of special values. For all other data types, dynamic range is extended by not representing Infinity
and NaN (E4M3 has only one bit-pattern for NaN [6]). More bits in the exponent field translates to a
larger range while more bits in the mantissa field translates to more precision within a given range.
Every floating-point type has a dynamic range that it can represent — we denote this range in terms
of binades which is the log2 -ratio of the maximum to the minimum finite representable in that format.
Computing X: Typically, most of the values within each block in a tensor will be out of the
representable range for the target MX format, both underflowing the minimum and overflowing the
maximum representable number. To address this, a scale factor is multiplied with all values within a
block, to shift the maximum number of values to be within the representable range.
The scale factor, X, is computed based on the absolute maximum value (amax) among the 32 high-
precision input values, i.e. amax = max ∥Vi ∥; 1 ≤ i ≤ 32. The goal is to map this amax in the
input-format to be the largest representable value in the desired MX-format. Special care is taken if
some or all the elements in the input are Infinity and/or not-a-number (NaN). If amax in a block is
0, then X is set to -127, thus, interpreted value of X is 2−127 . All Qi ’s in this case are set to 0.
When X is not Inf, NaN or 0, the OCP [1] specification suggests X be set to the largest power-of-two
less than or equal to amax divided by the largest power-of-two representable in the MX-format type.
2 (amax))
For example, with the E4M3 type, X = floor(log
floor(log2 (448)) , since 448 is the largest magnitude in E4M3.
Thus, the OCP specification ignores the floating-point significand of this ratio.2
We observe accuracy degradation when following the OCP specification. Fig. 2 show training loss
curves for a 843 million parameter transformer model trained under two token horizons: 300 billion
and 1 trillion tokens. Two different configurations are studied:
– (cfg1) E4M3 for all tensors (weights (W), activations (A) and gradients (G)), and
– (cfg2) E4M3 for W and A tensors, E5M2 for G tensors.
The E5M2 format has ∼1.8x more binades than E4M3 and since gradients typically have larger dynamic
range, [6] advocated the use of E5M2 with per-tensor scaling method. This rationale underpins the
2
Assume amax and max representable in MX-format (destamax) are normal floating-point numbers of the
form amax = 2A × 1.Ma and destmax = 2E × 1.Me (where mantissa Ma and Me lie between 0 (inclusive) and
1. OCP rounding scheme results in scale exponent = A-E (ignoring the case where min(A-E) is clamped to -127)
3
study of E5M2 gradients. In both cfg1 and cfg2, scale factor computation using OCP method leads
to training divergence (Fig. 2a) or a widening loss gap relative to BF16 (Fig. 2b).
Validation perplexity
Validation perplexity
12 10
8 8
0 25 50 75 100 125 150 175 200 225 250 275 300 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Tokens (in Billions) Tokens (in Trillions)
(a) Validation perplexity curves for 843M parameter (b) Validation perplexity curves for 843M parameter
model trained on 300B tokens. model trained on 1T tokens.
Figure 2: Comparing scale factor rounding modes suggested in the OCP specification and our work.
E4M3 and E5M2 are MXFP8 formats; notation E4M3(W) denotes E4M3 data type for Weights. Similarly,
E4M3(A) and E4M3 (G) refer to activations and gradients, respectively.
Algorithm 1 outlines our method for computing the scale factor. The key change is to round-up
towards positive infinity the exponent of ratio of amax and max representable in MX-format, destmax
(while saturating to UE8M0 max/min limits). This is in contrast to the OCP scheme which effectively
suggests to round-down the scale value. Since a high-precision value, Vi , is scaled by the scale factor
2X , rounding-up the denominator of the fraction (Vi /2X ) has a tendency to map amax below destmax.
In contrast, the OCP method has a tendency to map amax above destmax (which subsequently needs
to be clamped so that it becomes representable). We hypothesize that this saturating effect with OCP
rounding method affects model accuracy. A more detailed and accurate workflow to compute and
round the value of X is described in Appendix Sec. A.1.
Fig. 2 shows that MXFP8 with E4M3 for gradients (blue curve) and MXFP8 with E5M2 for gradients
(purple curve) with our proposed rounding scheme both overlap with a reference BF16 loss curve
across both 300B and 1T token horizons. Later, in section 3.2 we discuss that E4M3 is in fact a better
choice for pre-training LLMs versus E5M2 with MX style fine-grained scaling.
Quantizing FP32 values to MX type: Once X is computed, Vi is scaled by 2X and the resulting
value is quantized to a FP8-representable number. This is the Quantize_to_fp8() function. Round-
to-nearest-ties-to-even (RN) rounding is used during this quantization step. The conversion process is
saturating, i.e. if after rounding the resulting value exceeds FP8 max or is less than FP8 min value,
then the result is clamped to respective max or min value.
A practical instance of this conversion process in low-precision LLM pre-training arises when matrix
multiplication and accumulation (MMA) outputs, typically stored in FP32, must be mapped to MXFP8.
In this case 8-bit quantized values are stored in memory, thus saving write bandwidth and storage
capacity when compared with storing FP32 values. Subsequent model operation then reads MXFP8
values, saving read bandwidth compared to loading FP32 values. Further, since Tensor Cores can
process MX-formatted inputs, the MMA operation in lower precision consumes less energy and
operates at higher throughput.
4
3.2 E4M3 data type for weights, activations and gradients
843M on 1T 8B on 1T
14.0 10
10.6
10.4 MXFP8: E4M3(W)-E4M3(A)-E5M2(G)
10.2 MXFP8: E4M3(W)-E4M3(A)-E4M3(G)
13.0 10.0 BF16
9
Validation perplexity
Validation perplexity
9.8
6.50
9.6
12.0 9.4
9.2 6.25
0.7 0.8 0.9 1 8
11.0
6.00
0.7 0.8 0.9 1
MXFP8: E4M3(W)-E5M2(A)-E4M3(G) 7
10.0 MXFP8: E5M2(W)-E4M3(A)-E4M3(G)
MXFP8: E4M3(W)-E4M3(A)-E5M2(G)
MXFP8: E4M3(W)-E4M3(A)-E4M3(G)
BF16
9.0 6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Tokens (in Trillions) Tokens (in Trillions)
(a) Validation perplexity curves for 843M parameter (b) Validation perplexity curves for 8B parameter
model trained on 1T tokens. model trained on 1T tokens.
Figure 3: Pre-training loss curves comparing E4M3 and E5M2 when used across different tensor types:
weights (W), activation (A) and gradients (G). The inset shows a zoomed-in view of the loss at the
end of training.
FP8 has two variants in Blackwell 1: E4M3 and E5M2. Our experiments show the following:
– Better convergence with E4M3 instead of E5M2 for weights and activations during training. Fig. 3a
shows the loss behavior comparing E4M3 and E5M2 for the same 843M model used in Fig. 2b.
The purple curve has E5M2 for activations and the blue curve has E5M2 for weights while all
other tensors are in E4M3. Both these curves show worse loss convergence compared to using
E4M3 for all tensors (orange curve) or using E5M2 for gradients (yellow curve).
– E4M3 for gradients maintains training loss parity with BF16 pre-training, especially for models
with 2 billion or more parameters. Fig. 3b shows the loss behavior comparing E4M3 and E5M2
for the gradient tensor on an 8 billion parameter LLM trained on 1 trillion tokens: using E4M3
gradients (orange) has lower loss than using E5M2 gradients (yellow). This gap increases as
the model is trained on more tokens. This change in behavior with increasing model parameter
counts underscores the importance of examining the numerical properties of formats across a
broad range of model sizes.
Previously, per-tensor scaled FP8 studies in [8, 6, 9] and coarse-grained block-scaled FP8 studies
in [10] used E5M2 for gradient tensors in lieu of E4M3. With fine-grained scaling, the dynamic range
requirements at a 32-element block size is sufficiently captured by 17.8 binades in E4M3 type. Once
the range requirements are met, precision (or sampling) becomes important, and E4M3 having 8
samples per binade is better than E5M2 having only 4. Thus, for our MXFP8 pre-training recipe, we
quantize all three tensor types – weights, activations, and gradients – with the E4M3 data type.
Layers to instantiate in MXFP8 and training workflow: We use language-based transformer mod-
els for all studies in this paper. Our future work will study these recipes on speech and vision models.
Based on our studies, our guideline is to quantize QKV, PROJ and FFN Up- and Down-projection
layers to MXFP8 across all transformer blocks in the model. The Batch-Matrix Multiplications (BMM1,
the query-key dot product and BMM2, the attention score-value product) in the self-attention layer,
along with operations like Softmax, Act-func and residual-add are in high-precision. We found
this to be the safest option for maintaining accuracy parity with BF16 pre-training. This is illustrated
in Fig. 4. The input embedding layer and the final output-projection layer are also in BF16 or FP16.
All studies in this paper use this guideline.
During training, with MXFP quantization, the training framework has to keep two copies of each of
the tensors – weights, activations and gradients: each copy is quantized along the axes of dot-product
reduction (row and column). Fig. 4 shows how each tensor is used in forward (FPROP), weight-
gradient (WGRAD) and activation-gradient (DGRAD) computation during the training loop. Since each
tensor is used in non-transposed and transposed form, quantization needs to occur along two separate
axes (row and column).
5
BF16 MXFP8 Transformer block
Weights
BF16 MXFP8
Weight quantize
To next
FPROP BF16
BF16 MXFP8 layer
Activation quantize
Figure 4: Top: Transformer layers quantized to MXFP8 inside a single transformer block. Bottom:
Training workflow for a single layer during FPROP, DGRAD and WGRAD.
Summary so far: We introduce a new rounding scheme for the MX scale factor that addresses the
divergence caused by the OCP-based approach and achieves loss parity with BF16 on an 843M-
parameter model up to 1T tokens. Additionally, adopting E4M3 and our scale factor calculation
method in Alg. 1 enables scaling to an 8B-parameter model trained on 15T tokens — representing
the largest LLM pre-training at this scale with MXFP formats, to the best of our knowledge. Appendix
(section A.4) shows that our recipe also holds for a 16B-parameter mixture-of-experts model.
We pre-train an 8B parameter Nemotron model [11] using Megatron-LM [12]. The model has 32
transformer blocks, 32 attention heads, hidden size is 4096, GQA group size is 8, KV-channels is
128, sequence length during pre-training is 8192. It is trained on 15T tokens with a batch size of 768.
Initial learning rate is 6e-4 that cosine decays to 6e-6. A phased data-blending approach is used to
train the model: in the first phase, a data mixture that promotes diversity in data is used and in the
second phase high-quality datasets (e.g., Wikipedia) are used. We switch to the second phase at the
60% point of training. This blend style has also been used in other large scale pre-training setups [8].
The model is pre-trained on 3072 Hopper GPUs (since hardware with MX support was unavailable
during much of the experimentation duration).
We use an emulation of MX-formats on Hopper GPUs: tensors that feed a MMA operation are first
quantized into the MX-format and then cast back to BF16 before the BF16 MMA operation. The
training workflow depicted in Fig. 4 is implemented in Megatron-LM. We validated the numerical
fidelity of our emulation package by comparing with a 2B parameter LLM pre-training run on
Blackwell using actual MXFP8 format and confirmed that they were identical.
Fig. 5 shows training loss behavior and task-level accuracy for the 8B pre-trained model. We
report evaluation scores on two sets of downstream tasks: (1) 5-shot score on MMLU [13] and (2)
Averaged 1-shot score across 9 general reasoning benchmarks: ARC-Challenge and ARC-Easy [14],
Race [15], PIQA [16], Winogrande [17], Hellaswag [18], OpenBookQA [19], Social IQA [20] and
Commonsense QA [21]. We observe the following:
– Validation perplexity of the model when pre-trained with MXFP8 matches pre-training with BF16
(left plot in Fig. 5). There is less than 0.50% difference between MXFP8 and BF16 validation
perplexity values throughout the pre-training run.
– The middle and right plots in Fig. 5 show evaluation scores on two sets of downstream tasks.
Again, scores for the MXFP8 trained model match with those of the BF16 trained model. This
makes MXFP8 a viable candidate for pre-training LLMs.
6
Validation perplexity: 8B LLM pretrained on 15T tokens
6.5
6.0
3.32
Validation perplexity
5.5 3.30
BF16 3.28
5.0
3.26
MXFP8
4.5 3.24
3.22
4.0 14 14.25 14.5 14.75 15
3.5
3.0
0 1.5 3 4.5 6 7.5 9 10.5 12 13.5 15
Training tokens (in Trillions)
BF16 BF16
69% FP8 64% FP8
MXFP8 MXFP8
66%
62%
Percent
Percent
63% 65%
72%
60%
70%
60% 64%
68%
58%
57%
66% 63%
13.0 14.0 15.0 13.0 14.0 15.0
54% 56%
3.0 4.5 6.0 7.5 9.0 10.5 12.0 13.5 15.0 3.0 4.5 6.0 7.5 9.0 10.5 12.0 13.5 15.0
Training tokens (in Trillions) Training tokens (in Trillions)
Figure 5: Pre-training a 8B LLM on 15T tokens. Top: Training behavior of BF16 vs MXFP8. Bottom:
Comparing BF16, FP8 and MXFP8’s downstream task scores on MMLU and a set of 9 reasoning tasks.
MXFP8 numerics uses our proposed rounding method and E4M3 for all quantized tensors.
MXFP8 versus FP8: In addition to MXFP8 and BF16, Fig. 5 also shows task-level scores for the same
model trained with traditional FP8 precision. FP8 recipe uses software-emulated block scaling [6]
where an entire tensor is scaled so that maximum number of tensor values fall into the representable
range in the quantized format. We follow the guidelines suggested in [8] for FP8 pre-training setup:
the first and last transformer block in the model are kept in BF16 while linear layers of the remaining
blocks are quantized to FP8. [8] found this choice appropriate for pre-training 8B and 56B parameter
LLMs on 20T tokens. Keeping some layers in BF16 affects end-to-end speedup and also complicates
pre-training — a choice has to be made on which layers to leave in higher-precision. We observe that
MXFP8 matches FP8 accuracy on these two sets of tasks without requiring any BF16 layers.
MXFP8 versus blockwise-FP8: Further, some works like Deepseek-V3 [10] report the need to scale
to a smaller block size when using FP8. In this setup, certain tensors require vector 1x128 scaling and
certain tensors require per-block (e.g. 128x128) software scaling. This complicates GEMM kernel
design. Native support for MXFP8 simplifies this — fine-grained scaling provides better numerical
robustness and avoids any tradeoff between smaller block sizes and hardware speed.
In summary, we find MXFP8 maintains accuracy compared to BF16 or FP8 pre-trained models. On
GB200 Blackwell systems, MXFP8 has 2× higher throughput than BF16 making end-to-end MXFP8
7
pre-training faster than BF16 pre-training. We also find the MXFP8 recipe to be simpler to use when
compared to FP8 (all layers can be quantized and scaling is handled in the hardware) while allowing
for equal or better throughput.
4 Related work
Low-precision training and inference is a widely studied topic in deep learning literature. While
significant progress has been made in low-precision inference [22, 23, 24, 25], there are relatively
fewer studies demonstrating low-precision techniques for pre-training LLMs, especially large-scale
pre-training on large token horizon. Our work primarily focuses on 8-bit pre-training and prior work
on related low-precision LLM pre-training can be grouped into the following two categories:
– LLM pre-training using FP8 formats: [6] proposes an FP8 binary interchange format, con-
sisting of E4M3 and E5M2 encodings, and a per-tensor scaling approach — an entire tensor is
scaled to capture the maximum number of tensor values in the representable range in FP8. [26]
discusses FP8 pre-training challenges and proposes model-level modifications to train a 7B
parameter model. Recently, per-tensor scaled FP8 was used to train the Nemotron-H family of
LLMs [8] and the Llama-4 family of models also used FP8 [27]. Instead of per-tensor scaling,
DeepSeek-V3 family of models [10] use block-scaled FP8, this helps to better capture outliers
and minimize quantization errors.
– Pre-training using MXFP formats: [2] presents empirical data on pre-training models with
MXFP formats. They show cast-only inference results for MXFP8 and pre-training results for
MXFP6 and MXFP4-weights. [28] studies MXFP4 backward-pass quantization and [29] investigates
MXFP4 weight quantization on relatively small token horizon. All these studies are based on the
scaling factor computation method described in [1] which we show to be ineffective at large
token horizons.
Acknowledgments
We thank members of the ADLR/PSX team (Sweta P., Mikail K., Ben L.) for helping with the draft
revisions as well as Mohammad S., Carlo D.M., Michael A., Eric C. and Bryan C. with valuable
feedback and discussions. We also thank PM for guidance throughout this work.
8
References
[1] Bita Darvish Rouhani, Nitin Garegrat, Tom Savell, Ankit More, Kyung-Nam Han, Ritchie Zhao,
Mathew Hall, Jasmine Klar, Eric Chung, Yuan Yu, Michael Schulte, Ralph Wittig, Ian Bratt,
Nigel Stephens, Jelena Milanovic, John Brothers, Pradeep Dubey, Marius Cornea, Alexander
Heinecke, Andres Rodriguez, Martin Langhammer, Summer Deng, Maxim Naumov, Paulius
Micikevicius, Michael Siu, and Colin Verrilli. Ocp microscaling (mx) specification. Open
Compute Project, 2023.
[2] Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi,
Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic
Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby,
Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mes-
makhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael
Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig,
Doug Burger, and Eric Chung. Microscaling data formats for deep learning, 2023. URL
https://arxiv.org/abs/2310.10537.
[3] Bita Rouhani, Ritchie Zhao, Venmugil Elango, Rasoul Shafipour, Mathew Hall, Maral Mes-
makhosroshahi, Ankit More, Levi Melnick, Maximilian Golub, Girish Varatkar, Lei Shao, Gau-
rav Kolhe, Dimitry Melts, Jasmine Klar, Renee L’Heureux, Matt Perry, Doug Burger, Eric Chung,
Zhaoxia Deng, Sam Naghshineh, Jongsoo Park, and Maxim Naumov. With shared microexpo-
nents, a little shifting goes a long way, 2023. URL https://arxiv.org/abs/2302.08007.
[4] Steve Dai, Rangharajan Venkatesan, Haoxing Ren, Brian Zimmer, William J. Dally, and Brucek
Khailany. Vs-quant: Per-vector scaled quantization for accurate low-precision neural network
inference, 2021. URL https://arxiv.org/abs/2102.04503.
[6] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard
Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem-
pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep
learning, 2022. URL https://arxiv.org/abs/2209.05433.
[7] Ieee standard for floating-point arithmetic. IEEE Std 754-2008, pages 1–70, 2008. doi:
10.1109/IEEESTD.2008.4610935.
[8] NVIDIA, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad
Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh,
Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary,
Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon
Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo, Chengyu Dong, Christine Harvey,
Christopher Parisien, Dan Su, Daniel Korzekwa, Danny Yin, Daria Gitman, David Mosal-
lanezhad, Deepak Narayanan, Denys Fridman, Dima Rekesh, Ding Ma, Dmytro Pykhtar, Dong
Ahn, Duncan Riach, Dusan Stosic, Eileen Long, Elad Segal, Ellie Evans, Eric Chung, Erick
Galinkin, Evelina Bakhturina, Ewa Dobrowolska, Fei Jia, Fuxiao Liu, Gargi Prasad, Gerald
Shen, Guilin Liu, Guo Chen, Haifeng Qian, Helen Ngo, Hongbin Liu, Hui Li, Igor Gitman, Ilia
Karmanov, Ivan Moshkov, Izik Golan, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jarno
Seppanen, Jason Lu, Jason Sewall, Jiaqi Zeng, Jiaxuan You, Jimmy Zhang, Jing Zhang, Jining
Huang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jon Barker, Jonathan Cohen,
Joseph Jennings, Jupinder Parmar, Karan Sapra, Kari Briski, Kateryna Chumachenko, Katherine
Luna, Keshav Santhanam, Kezhi Kong, Kirthi Sivamani, Krzysztof Pawelec, Kumar Anik,
Kunlun Li, Lawrence McAfee, Leon Derczynski, Lindsey Pavao, Luis Vega, Lukas Voegtle,
Maciej Bala, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski,
Markus Kliegl, Marta Stepniewska-Dziubinska, Matthieu Le, Matvei Novikov, Mehrzad Samadi,
Michael Andersch, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mike Ranzinger,
Mikolaj Blaz, Misha Smelyanskiy, Mohamed Fawzy, Mohammad Shoeybi, Mostofa Patwary,
Nayeon Lee, Nima Tajbakhsh, Ning Xu, Oleg Rybakov, Oleksii Kuchaiev, Olivier Delalleau,
Osvald Nitski, Parth Chadha, Pasha Shamis, Paulius Micikevicius, Pavlo Molchanov, Peter
9
Dykas, Philipp Fischer, Pierre-Yves Aquilanti, Piotr Bialecki, Prasoon Varshney, Pritam Gun-
decha, Przemek Tredak, Rabeeh Karimi, Rahul Kandu, Ran El-Yaniv, Raviraj Joshi, Roger
Waleffe, Ruoxi Zhang, Sabrina Kavanaugh, Sahil Jain, Samuel Kriman, Sangkug Lym, San-
jeev Satheesh, Saurav Muralidharan, Sean Narenthiran, Selvaraj Anandaraj, Seonmyeong
Bak, Sergey Kashirsky, Seungju Han, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere
Sreenivas, Sharon Clay, Shelby Thomas, Shrimai Prabhumoye, Shubham Pachori, Shubham
Toshniwal, Shyamala Prayaga, Siddhartha Jain, Sirshak Das, Slawek Kierat, Somshubra Majum-
dar, Song Han, Soumye Singhal, Sriharsha Niverty, Stefania Alborghetti, Suseella Panguluri,
Swetha Bhendigeri, Syeda Nahida Akter, Szymon Migacz, Tal Shiri, Terry Kong, Timo Roman,
Tomer Ronen, Trisha Saar, Tugrul Konuk, Tuomas Rintamaki, Tyler Poon, Ushnish De, Vahid
Noroozi, Varun Singh, Vijay Korthikanti, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Wei
Ping, Wenliang Dai, Wonmin Byeon, Xiaowei Ren, Yao Xu, Yejin Choi, Yian Zhang, Ying Lin,
Yoshi Suhara, Zhiding Yu, Zhiqi Li, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, and Zijia Chen.
Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models, 2025. URL
https://arxiv.org/abs/2504.03624.
[9] Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi. 8-bit
numerical formats for deep neural networks, 2022. URL https://arxiv.org/abs/2206.
02915.
[10] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu,
Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian
Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao,
Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang,
Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo,
Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong
Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean
Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li,
Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian,
Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du,
R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu
Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu,
Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng
Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng,
Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang,
X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen,
Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang,
Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi
Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei,
Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng
Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying
He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo,
Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha,
Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou,
Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang,
Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong
Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu,
Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, and Zizheng Pan. Deepseek-v3 technical report,
2025. URL https://arxiv.org/abs/2412.19437.
[11] Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subrama-
nian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu
Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki,
Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper,
Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, and Bryan Catanzaro.
Nemotron-4 15b technical report, 2024. URL https://arxiv.org/abs/2402.16819.
[12] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model
parallelism, 2020. URL https://arxiv.org/abs/1909.08053.
10
[13] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https:
//arxiv.org/abs/2009.03300.
[14] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv:1803.05457v1, 2018.
[15] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale
reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
[16] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning
about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/
1911.11641.
[17] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An
adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/1907.
10641.
[18] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830.
[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor
conduct electricity? a new dataset for open book question answering, 2018. URL https:
//arxiv.org/abs/1809.02789.
[20] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com-
monsense reasoning about social interactions, 2019. URL https://arxiv.org/abs/1904.
09728.
[21] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA:
A question answering challenge targeting commonsense knowledge. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–
4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:
10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
[22] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han.
Smoothquant: Accurate and efficient post-training quantization for large language models,
2024. URL https://arxiv.org/abs/2211.10438.
[23] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song
Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving, 2025. URL
https://arxiv.org/abs/2405.04532.
[24] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training
quantization for generative pre-trained transformers, 2023. URL https://arxiv.org/abs/
2210.17323.
[25] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan
Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization
for llm compression and acceleration, 2024. URL https://arxiv.org/abs/2306.00978.
[26] Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry. Scaling fp8 training to
trillion-token llms, 2025. URL https://arxiv.org/abs/2409.12517.
[27] Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innova-
tion. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, April 2025.
Accessed 12 May 2025.
[28] Albert Tseng, Tao Yu, and Youngsuk Park. Training llms with mxfp4, 2025. URL https:
//arxiv.org/abs/2502.20586.
11
[29] Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun
Zha, and Peng Cheng. Optimizing large language model training using fp4 quantization, 2025.
URL https://arxiv.org/abs/2501.17116.
[30] Nvidia. Transformer engine. https://github.com/NVIDIA/TransformerEngine/.
[31] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of
mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/
2402.03300.
[32] Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding
warmup-stable-decay learning rates: A river valley loss landscape perspective, 2024. URL
https://arxiv.org/abs/2410.05192.
[33] Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya,
Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das,
Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans,
Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz
Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala,
John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu,
Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel
Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran,
Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher
Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye,
Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft,
Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy,
Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun,
Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing
Zhang, Vivienne Zhang, Yian Zhang, and Chen Zhu. Nemotron-4 340b technical report, 2024.
URL https://arxiv.org/abs/2406.11704.
12
A Appendix
Computation described in Algorithm 1 is a simplification. The simplification comes from storing the
output of log2 (float x) in FP32. log2 function in device internally performs a round-to-nearest of
the resulting return value and thus the result of ceil(log2 (x)) can be different if the output of log2
is not stored in a sufficiently large data-type. Hence, for emulation purposes we work directly with
the bit representation of the ratio of amax and destmax. We, next, describe the computation flow.
Background: As a reminder, the quantization process from 32 high-precision values, Vi , to quantized
values, Qi ; 1 ≤ i ≤ 32, is given by: Qi = Quantize_to_fp8(Vi /2X ). 2X is the scale factor; X is
stored in an unsigned 8bit integer container in memory and interpreted as 2X by the hardware. This
scale factor decodes Qi back to Vi (with quantization loss).
Value of scale factor (2X ) = float_to_8bits(amax/destmax), where amax is absolute maximum
in input/source block (of 32 elements) and destmax is the largest positive number in the destination
(MX) number system. float_to_8bits converts a floating-point number to a power-of-two number.
A float (FP32) number can be represented in IEEE convention as 2E × 1.mantissa (normal) or
2−126 × 0.mantissa (denormal). E can lie between -127 to 127 (or 0 to 254 with the exponent
bias) and can be represented in the 8-bit container for scale factor. -126 (or 1 with the exponent
bias) is also representable by 8-bit container. mantissa lies between [0,1). So, the question is:
should mantissa bits be rounded-up, rounded-down, round-to-nearest, discarded, etc. to create a
power-of-two number? We find round-up to be the best choice for pre-training with MX-formats.
Rounding: For float_to_8bits(), the recommended order of computation is:
By construction amax/destmax never exceeds 2127 (which is the largest value representable in
UE8M0) with FP8, FP6 or FP4 formats. The above computations are done in the bit-space in emulation.
Section 2 relies on the standard MX-format conversion algorithm defined in [2], but for completeness
we show it here given a shared scaling exponent X as computed in 3.1.
Quantizing FP32 values to MX type: Once X is computed, Vi /2X is computed and the resulting
value is quantized to a FP8-representable number (Quantize_to_fp8()). Round-to-nearest-ties-to-
even (RN) rounding is used during this quantization step. The conversion process is saturating, i.e. if
after rounding the resulting value exceeds FP8 max or is less than FP8 min value, then the result is
clamped to respective max or min value.
Quantization operations add computation overhead — Blackwell has hardware support for rounding
the scale (using our proposed method) and quantizing values to lower this overhead.
Computing the matrix-product of two tensors involves performing dot-products between sub-vectors
of the two tensors. Therefore, scaling factors need to be processed once per group of values that share
the scale. Since MX-formats have fine-grained scaling, scale factors are processed once after each
block’s dot-product is computed, thus, many times per tensor-wide dot-product. This is expensive
to do in software, so hardware needs to add support for accelerating tensor operations involving
MX-formats (e.g. Blackwell).
13
A.4 MXFP8 pre-training for a mixture-of-experts model
Section 3.3 discusses empirical data that shows MXFP8 matches BF16 accuracy (both training loss
as well as downstream task accuracy). Transformer based mixture-of-experts (MoE) models are
popular in literature. Fig. 6 shows that MXFP8 pre-training also matches BF16 pre-training loss curve
for a MoE setup that we experimented with. The MoE model has 16 billion total parameters and
∼2.5 billion active parameters and we train the model on 1 trillion tokens. We follow the same
guidelines discussed in section 3.2 for pre-training the MoE model. The pre-training phase uses a
WSD [32] learning rate schedule. The final loss of the MXFP8-trained MoE model is within 0.1% of
BF16-training.
2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Training tokens (in Trillions)
0.0
-0.2
-0.4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Training tokens (in Trillions)
Figure 6: MXFP8 versus BF16 for a MoE model
We conduct numerical experiments on LLM pre-training with variants of Nemotron-4 [11] models.
Training and model details are described below. The 1T and 300B tokens dataset are a subset of the
17T data set discussed in [33]. Table 2 details the parameters for the various models that were used.
14