0% found this document useful (0 votes)

22 views14 pages

Recipes For Pre-Training Llms With Mxfp8: Asit Mishra Dusan Stosic Simon Layton

This paper discusses the use of Microscaling (MX) formats, specifically MXFP8, for pre-training large language models (LLMs) on NVIDIA's Blackwell GPUs, highlighting the benefits of precision scaling for GPU efficiency without compromising accuracy. It identifies issues with the rounding mode suggested in the Open Compute Project (OCP) specification that can lead to divergence during training and proposes a new rounding method that improves convergence and accuracy. The findings are supported by experiments demonstrating that an 8 billion parameter model trained with the proposed MXFP8 method can achieve accuracy comparable to models trained with BF16 format.

Uploaded by

xk l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views14 pages

Recipes For Pre-Training Llms With Mxfp8: Asit Mishra Dusan Stosic Simon Layton

Uploaded by

xk l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Recipes for Pre-training LLMs with MXFP8

Asit Mishra Dusan Stosic Simon Layton

Nvidia Nvidia Nvidia
[email protected] [email protected] [email protected]
arXiv:2506.08027v1 [cs.LG] 30 May 2025

Abstract
Precision scaling — using fewer bits to represent model parameters and related tensors
during pre-training — has emerged as a compelling technique for improving GPU efficiency
without sacrificing accuracy. Microscaling (MX) formats [1] in NVIDIA’s latest Blackwell
GPUs represent a major leap in enabling this precision scaling aspect. These formats
combine narrow floating-point data types with per-block scaling factors, offering a fine-
grained approach to quantizing tensors.
Although MX-formats offer the promise of improved numeric stability compared to other
reduced-precision representations, in practice they must be used carefully in order to suc-
cessfully converge an LLM on a multi-trillion token dataset. In this paper, we show that the
rounding mode suggested in OCP [1] specification can lead to divergence when pre-training
an LLM. We show an improved rounding mode, which uses round-to-infinity to compute
scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens.

1 Introduction
Scaling the number of parameters of deep learning models, especially large generative models, has
intensified the need for more efficient compute and memory solutions. Precision scaling — reducing
number of bits to represent a data type — is a successful strategy to improve performance while
lowering memory requirements. However, precision scaling without sacrificing model accuracy
remains a key challenge.
Microscaling (MX) formats, developed in the Open Compute Project (OCP) [1], aims to offer a
standardized approach to low-precision representation. By combining narrow floating-point types
with fine-grained scaling factors, MX-formats aim to deliver a balance of dynamic range, precision,
and hardware efficiency. OCP [1] and research works like [2, 3, 4] have shown a fine-grained scaling
approach to be an effective strategy compared to granular (e.g., per-tensor) scaling approaches,
especially in sub-8-bit pre-training regimes.
NVIDIA’s latest GPU generation, Blackwell [5], adds native MX data type support to Tensor Cores.
Blackwell supports MX data for 8-bit (MXFP8), 6-bit (MXFP6) and 4-bit (MXFP4) floating-point types.
This paper primarily focuses on MXFP8 and provides an in-depth analysis of pre-training large
language models (LLMs). It details the conversion process from high precision formats to the
quantized MXFP8 format, including rounding modes for data and scaling factors. These insights aim
to guide the efficient and accurate pre-training of LLMs using MXFP8.
Notably, we observe that the rounding mode suggested in OCP [1] for scale factor computation causes
MXFP8 pre-training to differ from a BF16 baseline. We propose a modification to this scale factor
rounding where we round-to-infinity instead of rounding-down, and demonstrate that this adjustment
enables MXFP8 pre-training to match BF16 pre-training accuracy.
The primary contributions of this paper are as follows:

Preprint.
– A methodical study of MX quantization for LLM pre-training where we investigate the following
design choices: data formats, quantization schemes, storage, the selection of tensors to quantize,
and examining how these choices affect accuracy. Specifically, we identify E4M3 to be the MXFP8
data type that best helps maintain pre-training accuracy.
– We find that the rounding mode suggested in the OCP specification can lead to divergence during
pre-training. Building on the analysis, we propose a rounding mode for scale factors that lets
MXFP8 trained models to match BF16 model accuracy. Specifically, we describe a recipe to
pretrain LLMs with MXFP8 and show experiments where an 8 billion parameter dense LLM
trained on 15 trillion tokens matches a BF16 trained model and a per-tensor scaled FP8 [6] model
with first and last layer in BF16.

For each design choice, we present ablation studies on small models (e.g. 100s of millions of
parameters) pretrained on short (e.g. 100s of billions) token horizons. We then apply the findings to
large multi-billion parameter models trained on trillion token horizons. This makes our conclusions
robust for large-scale foundation model pre-training.

2 Microscaling format support in NVIDIA Blackwell

Stored in memory and used by Tensor Cores

Q0 Q1 Q2
… Q31 Scale
Quantized values

✕ ✕ ✕ ✕

Dequantized values V0 V1 V2
… V31
(FP32)
Figure 1: A single MXFP block (in green box) and interpretation of MXFP format.

Background: An MX-format is specified by a block size K, a shared scaling factor per block, X, and
the data-type of elements in the block. A block is a contiguous set of K elements along an axis in a
tensor. K = 32 for all MX types on Blackwell. The data-type of X is UE8M0, an unsigned exponent-only
value that encodes either NaN or any power-of-two1 value in range 2−127 to 2127 .
Given K input elements in a source-format, Vi (typically FP32); 1 ≤ i ≤ K, conversion to MX-format
consists of computing X and Qi such that Qi ×X is the decoded value corresponding to Vi . X and Qi
are stored in memory instead of Vi . This is depicted in Fig. 1. A tensor in source-format is sub-divided
into blocks of K elements and converted into MX-format to be stored in memory and/or processed by
math units in hardware. Tensor Cores in Blackwell consume X and Qi to compute the dot product
of two MX-formatted blocks. If the accumulated output of a dot-product operation in Tensor Cores
is FP32, then this output is thereafter quantized to MX-format if a subsequent operation consuming
the output needs it in that format. The conversion process is, Qi = Quantize_to_fp8(Vi /2X ).
Section 3 describes the conversion details. Fine-grained scaling helps each MX block independently
align to the needed range of input values before quantization.
Table 1 shows the MX-formats supported in Blackwell. The data type column uses the convention ExMy
to denote x-bit for floating-point exponent and y-bit for mantissa. For a fixed bit-width, floating-point
numbers trade-off exponent width with mantissa width. Thus, MXFP8 E4M3 is a 8-bit data type with
with 1 sign bit, 4-bit for FP exponent and 3-bit of mantissa. Similarly, E5M2 is an 8-bit type with 5-bit
for exponent and 2-bit for mantissa. Compared to E4M3, E5M2 can represent a larger dynamic range,
1
Power-of-two is a number of the form 2n where n is a positive integer, negative integer, or zero.

2
but with less precision. E5M2 is the only type that follows IEEE 754 conventions [7] for representation
of special values. For all other data types, dynamic range is extended by not representing Infinity
and NaN (E4M3 has only one bit-pattern for NaN [6]). More bits in the exponent field translates to a
larger range while more bits in the mantissa field translates to more precision within a given range.
Every floating-point type has a dynamic range that it can represent — we denote this range in terms
of binades which is the log2 -ratio of the maximum to the minimum finite representable in that format.

Table 1: MX-format support in Blackwell

Format Data Max. Min. Binades Relative speed
type normal subnorm. vs BF16 (Tensor
Core math)
E4M3 1.75 * 28 2−9 17.8 2x
MXFP8
E5M2 1.75 * 215 2−16 31.8 2x
E2M3 1.875 * 22 2−3 5.9 2x
MXFP6
E3M2 1.75 * 24 2−3 8.8 2x
MXFP4 E2M1 1.5 * 22 2−1 3.6 4x

3 Pre-training with MXFP8

Model weights, activations, and gradient tensors are quantized from FP32 into MXFP8 during the
forward and backward passes in training. MXFP8 tensors are then stored and operated on by the
hardware. We first describe the conversion process, Quantize_to_fp8(Vi /2X ), starting with scale
computation and following on to the quantization procedure for each element. The quantization
process described below is the same across all MX-formats (E4M3, E5M2 , E2M3, E3M2, and E2M1),
with only the MX data type differing.

3.1 Conversion from FP32 to MXFP8

Computing X: Typically, most of the values within each block in a tensor will be out of the
representable range for the target MX format, both underflowing the minimum and overflowing the
maximum representable number. To address this, a scale factor is multiplied with all values within a
block, to shift the maximum number of values to be within the representable range.
The scale factor, X, is computed based on the absolute maximum value (amax) among the 32 high-
precision input values, i.e. amax = max ∥Vi ∥; 1 ≤ i ≤ 32. The goal is to map this amax in the
input-format to be the largest representable value in the desired MX-format. Special care is taken if
some or all the elements in the input are Infinity and/or not-a-number (NaN). If amax in a block is
0, then X is set to -127, thus, interpreted value of X is 2−127 . All Qi ’s in this case are set to 0.
When X is not Inf, NaN or 0, the OCP [1] specification suggests X be set to the largest power-of-two
less than or equal to amax divided by the largest power-of-two representable in the MX-format type.
2 (amax))
For example, with the E4M3 type, X = floor(log
floor(log2 (448)) , since 448 is the largest magnitude in E4M3.
Thus, the OCP specification ignores the floating-point significand of this ratio.2
We observe accuracy degradation when following the OCP specification. Fig. 2 show training loss
curves for a 843 million parameter transformer model trained under two token horizons: 300 billion
and 1 trillion tokens. Two different configurations are studied:

– (cfg1) E4M3 for all tensors (weights (W), activations (A) and gradients (G)), and
– (cfg2) E4M3 for W and A tensors, E5M2 for G tensors.

The E5M2 format has ∼1.8x more binades than E4M3 and since gradients typically have larger dynamic
range, [6] advocated the use of E5M2 with per-tensor scaling method. This rationale underpins the
2
Assume amax and max representable in MX-format (destamax) are normal floating-point numbers of the
form amax = 2A × 1.Ma and destmax = 2E × 1.Me (where mantissa Ma and Me lie between 0 (inclusive) and
1. OCP rounding scheme results in scale exponent = A-E (ignoring the case where min(A-E) is clamped to -127)

3
study of E5M2 gradients. In both cfg1 and cfg2, scale factor computation using OCP method leads
to training divergence (Fig. 2a) or a widening loss gap relative to BF16 (Fig. 2b).

843M on 300B 843M on 1T

16
24 BF16
BF16
15 MXFP8 with E4M3(W)-E4M3(A)-E5M2(G) (OCP)
MXFP8 with E4M3(W)-E4M3(A)-E5M2(G) (OCP) MXFP8 with E4M3(W)-E4M3(A)-E4M3(G) (OCP)
14

Validation perplexity
Validation perplexity

20 MXFP8 with E4M3(W)-E4M3(A)-E4M3(G) (OCP) MXFP8 with E4M3(W)-E4M3(A)-E5M2(G) (ours)

MXFP8 with E4M3(W)-E4M3(A)-E5M2(G) (ours) 13 MXFP8 with E4M3(W)-E4M3(A)-E4M3(G) (ours)

MXFP8 with E4M3(W)-E4M3(A)-E4M3(G) (ours)

16 12 experiments stopped
early
11

12 10

8 8
0 25 50 75 100 125 150 175 200 225 250 275 300 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Tokens (in Billions) Tokens (in Trillions)

(a) Validation perplexity curves for 843M parameter (b) Validation perplexity curves for 843M parameter
model trained on 300B tokens. model trained on 1T tokens.
Figure 2: Comparing scale factor rounding modes suggested in the OCP specification and our work.
E4M3 and E5M2 are MXFP8 formats; notation E4M3(W) denotes E4M3 data type for Weights. Similarly,
E4M3(A) and E4M3 (G) refer to activations and gradients, respectively.

Algorithm 1 outlines our method for computing the scale factor. The key change is to round-up
towards positive infinity the exponent of ratio of amax and max representable in MX-format, destmax
(while saturating to UE8M0 max/min limits). This is in contrast to the OCP scheme which effectively
suggests to round-down the scale value. Since a high-precision value, Vi , is scaled by the scale factor
2X , rounding-up the denominator of the fraction (Vi /2X ) has a tendency to map amax below destmax.
In contrast, the OCP method has a tendency to map amax above destmax (which subsequently needs
to be clamped so that it becomes representable). We hypothesize that this saturating effect with OCP
rounding method affects model accuracy. A more detailed and accurate workflow to compute and
round the value of X is described in Appendix Sec. A.1.

Algorithm 1 Our proposal for computing scale factor, X (simplified method)

Xfloat ← amax/destmax ▷ destmax is max representable in MX-format
expXfloat ← log2 (Xfloat ) ▷ extract the exponent of X (de-biased form)
expXint ← ceil(expXfloat ) ▷ round-up
X ← clamp(expXint , −127, 127) ▷ clamp to min/max E8M0 representable
X ← X + 127 ▷ add bias
return X ▷ store X

Fig. 2 shows that MXFP8 with E4M3 for gradients (blue curve) and MXFP8 with E5M2 for gradients
(purple curve) with our proposed rounding scheme both overlap with a reference BF16 loss curve
across both 300B and 1T token horizons. Later, in section 3.2 we discuss that E4M3 is in fact a better
choice for pre-training LLMs versus E5M2 with MX style fine-grained scaling.
Quantizing FP32 values to MX type: Once X is computed, Vi is scaled by 2X and the resulting
value is quantized to a FP8-representable number. This is the Quantize_to_fp8() function. Round-
to-nearest-ties-to-even (RN) rounding is used during this quantization step. The conversion process is
saturating, i.e. if after rounding the resulting value exceeds FP8 max or is less than FP8 min value,
then the result is clamped to respective max or min value.
A practical instance of this conversion process in low-precision LLM pre-training arises when matrix
multiplication and accumulation (MMA) outputs, typically stored in FP32, must be mapped to MXFP8.
In this case 8-bit quantized values are stored in memory, thus saving write bandwidth and storage
capacity when compared with storing FP32 values. Subsequent model operation then reads MXFP8
values, saving read bandwidth compared to loading FP32 values. Further, since Tensor Cores can
process MX-formatted inputs, the MMA operation in lower precision consumes less energy and
operates at higher throughput.

4
3.2 E4M3 data type for weights, activations and gradients

843M on 1T 8B on 1T
14.0 10
10.6
10.4 MXFP8: E4M3(W)-E4M3(A)-E5M2(G)
10.2 MXFP8: E4M3(W)-E4M3(A)-E4M3(G)
13.0 10.0 BF16
9
Validation perplexity

Validation perplexity
9.8
6.50
9.6

12.0 9.4
9.2 6.25
0.7 0.8 0.9 1 8

11.0
6.00
0.7 0.8 0.9 1

MXFP8: E4M3(W)-E5M2(A)-E4M3(G) 7
10.0 MXFP8: E5M2(W)-E4M3(A)-E4M3(G)
MXFP8: E4M3(W)-E4M3(A)-E5M2(G)
MXFP8: E4M3(W)-E4M3(A)-E4M3(G)
BF16
9.0 6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Tokens (in Trillions) Tokens (in Trillions)

(a) Validation perplexity curves for 843M parameter (b) Validation perplexity curves for 8B parameter
model trained on 1T tokens. model trained on 1T tokens.
Figure 3: Pre-training loss curves comparing E4M3 and E5M2 when used across different tensor types:
weights (W), activation (A) and gradients (G). The inset shows a zoomed-in view of the loss at the
end of training.

FP8 has two variants in Blackwell 1: E4M3 and E5M2. Our experiments show the following:

– Better convergence with E4M3 instead of E5M2 for weights and activations during training. Fig. 3a
shows the loss behavior comparing E4M3 and E5M2 for the same 843M model used in Fig. 2b.
The purple curve has E5M2 for activations and the blue curve has E5M2 for weights while all
other tensors are in E4M3. Both these curves show worse loss convergence compared to using
E4M3 for all tensors (orange curve) or using E5M2 for gradients (yellow curve).
– E4M3 for gradients maintains training loss parity with BF16 pre-training, especially for models
with 2 billion or more parameters. Fig. 3b shows the loss behavior comparing E4M3 and E5M2
for the gradient tensor on an 8 billion parameter LLM trained on 1 trillion tokens: using E4M3
gradients (orange) has lower loss than using E5M2 gradients (yellow). This gap increases as
the model is trained on more tokens. This change in behavior with increasing model parameter
counts underscores the importance of examining the numerical properties of formats across a
broad range of model sizes.

Previously, per-tensor scaled FP8 studies in [8, 6, 9] and coarse-grained block-scaled FP8 studies
in [10] used E5M2 for gradient tensors in lieu of E4M3. With fine-grained scaling, the dynamic range
requirements at a 32-element block size is sufficiently captured by 17.8 binades in E4M3 type. Once
the range requirements are met, precision (or sampling) becomes important, and E4M3 having 8
samples per binade is better than E5M2 having only 4. Thus, for our MXFP8 pre-training recipe, we
quantize all three tensor types – weights, activations, and gradients – with the E4M3 data type.
Layers to instantiate in MXFP8 and training workflow: We use language-based transformer mod-
els for all studies in this paper. Our future work will study these recipes on speech and vision models.
Based on our studies, our guideline is to quantize QKV, PROJ and FFN Up- and Down-projection
layers to MXFP8 across all transformer blocks in the model. The Batch-Matrix Multiplications (BMM1,
the query-key dot product and BMM2, the attention score-value product) in the self-attention layer,
along with operations like Softmax, Act-func and residual-add are in high-precision. We found
this to be the safest option for maintaining accuracy parity with BF16 pre-training. This is illustrated
in Fig. 4. The input embedding layer and the final output-projection layer are also in BF16 or FP16.
All studies in this paper use this guideline.
During training, with MXFP quantization, the training framework has to keep two copies of each of
the tensors – weights, activations and gradients: each copy is quantized along the axes of dot-product
reduction (row and column). Fig. 4 shows how each tensor is used in forward (FPROP), weight-
gradient (WGRAD) and activation-gradient (DGRAD) computation during the training loop. Since each
tensor is used in non-transposed and transposed form, quantization needs to occur along two separate
axes (row and column).

5
BF16 MXFP8 Transformer block

Layer BMM BMM Layer Act

Norm
K Softmax Projection Add FC1 FC2 Add
1 2 Norm Func
V

Weights

BF16 MXFP8
Weight quantize
To next
FPROP BF16
BF16 MXFP8 layer
Activation quantize

transpose & MXFP8 MXFP8 BF16

quantize Gradient
quantize DGRAD
BF16

transpose & MXFP8 MXFP8 transpose &

quantize WGRAD quantize
FP32
To optimizer for master-weights update

Figure 4: Top: Transformer layers quantized to MXFP8 inside a single transformer block. Bottom:
Training workflow for a single layer during FPROP, DGRAD and WGRAD.

Summary so far: We introduce a new rounding scheme for the MX scale factor that addresses the
divergence caused by the OCP-based approach and achieves loss parity with BF16 on an 843M-
parameter model up to 1T tokens. Additionally, adopting E4M3 and our scale factor calculation
method in Alg. 1 enables scaling to an 8B-parameter model trained on 15T tokens — representing
the largest LLM pre-training at this scale with MXFP formats, to the best of our knowledge. Appendix
(section A.4) shows that our recipe also holds for a 16B-parameter mixture-of-experts model.

3.3 Results with MXFP8 pre-training a LLM on 15T tokens

We pre-train an 8B parameter Nemotron model [11] using Megatron-LM [12]. The model has 32
transformer blocks, 32 attention heads, hidden size is 4096, GQA group size is 8, KV-channels is
128, sequence length during pre-training is 8192. It is trained on 15T tokens with a batch size of 768.
Initial learning rate is 6e-4 that cosine decays to 6e-6. A phased data-blending approach is used to
train the model: in the first phase, a data mixture that promotes diversity in data is used and in the
second phase high-quality datasets (e.g., Wikipedia) are used. We switch to the second phase at the
60% point of training. This blend style has also been used in other large scale pre-training setups [8].
The model is pre-trained on 3072 Hopper GPUs (since hardware with MX support was unavailable
during much of the experimentation duration).
We use an emulation of MX-formats on Hopper GPUs: tensors that feed a MMA operation are first
quantized into the MX-format and then cast back to BF16 before the BF16 MMA operation. The
training workflow depicted in Fig. 4 is implemented in Megatron-LM. We validated the numerical
fidelity of our emulation package by comparing with a 2B parameter LLM pre-training run on
Blackwell using actual MXFP8 format and confirmed that they were identical.
Fig. 5 shows training loss behavior and task-level accuracy for the 8B pre-trained model. We
report evaluation scores on two sets of downstream tasks: (1) 5-shot score on MMLU [13] and (2)
Averaged 1-shot score across 9 general reasoning benchmarks: ARC-Challenge and ARC-Easy [14],
Race [15], PIQA [16], Winogrande [17], Hellaswag [18], OpenBookQA [19], Social IQA [20] and
Commonsense QA [21]. We observe the following:

– Validation perplexity of the model when pre-trained with MXFP8 matches pre-training with BF16
(left plot in Fig. 5). There is less than 0.50% difference between MXFP8 and BF16 validation
perplexity values throughout the pre-training run.
– The middle and right plots in Fig. 5 show evaluation scores on two sets of downstream tasks.
Again, scores for the MXFP8 trained model match with those of the BF16 trained model. This
makes MXFP8 a viable candidate for pre-training LLMs.

6
Validation perplexity: 8B LLM pretrained on 15T tokens
6.5

6.0
3.32

Validation perplexity
5.5 3.30

BF16 3.28
5.0
3.26
MXFP8
4.5 3.24
3.22
4.0 14 14.25 14.5 14.75 15

3.5

3.0
0 1.5 3 4.5 6 7.5 9 10.5 12 13.5 15
Training tokens (in Trillions)

Avg. 1-shot score across 9

Avg. 5-shot MMLU score
72% 66%
reasoning tasks

BF16 BF16
69% FP8 64% FP8
MXFP8 MXFP8
66%
62%
Percent

Percent

63% 65%
72%
60%
70%
60% 64%
68%
58%
57%
66% 63%
13.0 14.0 15.0 13.0 14.0 15.0
54% 56%
3.0 4.5 6.0 7.5 9.0 10.5 12.0 13.5 15.0 3.0 4.5 6.0 7.5 9.0 10.5 12.0 13.5 15.0
Training tokens (in Trillions) Training tokens (in Trillions)

Figure 5: Pre-training a 8B LLM on 15T tokens. Top: Training behavior of BF16 vs MXFP8. Bottom:
Comparing BF16, FP8 and MXFP8’s downstream task scores on MMLU and a set of 9 reasoning tasks.
MXFP8 numerics uses our proposed rounding method and E4M3 for all quantized tensors.

MXFP8 versus FP8: In addition to MXFP8 and BF16, Fig. 5 also shows task-level scores for the same
model trained with traditional FP8 precision. FP8 recipe uses software-emulated block scaling [6]
where an entire tensor is scaled so that maximum number of tensor values fall into the representable
range in the quantized format. We follow the guidelines suggested in [8] for FP8 pre-training setup:
the first and last transformer block in the model are kept in BF16 while linear layers of the remaining
blocks are quantized to FP8. [8] found this choice appropriate for pre-training 8B and 56B parameter
LLMs on 20T tokens. Keeping some layers in BF16 affects end-to-end speedup and also complicates
pre-training — a choice has to be made on which layers to leave in higher-precision. We observe that
MXFP8 matches FP8 accuracy on these two sets of tasks without requiring any BF16 layers.
MXFP8 versus blockwise-FP8: Further, some works like Deepseek-V3 [10] report the need to scale
to a smaller block size when using FP8. In this setup, certain tensors require vector 1x128 scaling and
certain tensors require per-block (e.g. 128x128) software scaling. This complicates GEMM kernel
design. Native support for MXFP8 simplifies this — fine-grained scaling provides better numerical
robustness and avoids any tradeoff between smaller block sizes and hardware speed.
In summary, we find MXFP8 maintains accuracy compared to BF16 or FP8 pre-trained models. On
GB200 Blackwell systems, MXFP8 has 2× higher throughput than BF16 making end-to-end MXFP8

7
pre-training faster than BF16 pre-training. We also find the MXFP8 recipe to be simpler to use when
compared to FP8 (all layers can be quantized and scaling is handled in the hardware) while allowing
for equal or better throughput.

4 Related work
Low-precision training and inference is a widely studied topic in deep learning literature. While
significant progress has been made in low-precision inference [22, 23, 24, 25], there are relatively
fewer studies demonstrating low-precision techniques for pre-training LLMs, especially large-scale
pre-training on large token horizon. Our work primarily focuses on 8-bit pre-training and prior work
on related low-precision LLM pre-training can be grouped into the following two categories:

– LLM pre-training using FP8 formats: [6] proposes an FP8 binary interchange format, con-
sisting of E4M3 and E5M2 encodings, and a per-tensor scaling approach — an entire tensor is
scaled to capture the maximum number of tensor values in the representable range in FP8. [26]
discusses FP8 pre-training challenges and proposes model-level modifications to train a 7B
parameter model. Recently, per-tensor scaled FP8 was used to train the Nemotron-H family of
LLMs [8] and the Llama-4 family of models also used FP8 [27]. Instead of per-tensor scaling,
DeepSeek-V3 family of models [10] use block-scaled FP8, this helps to better capture outliers
and minimize quantization errors.
– Pre-training using MXFP formats: [2] presents empirical data on pre-training models with
MXFP formats. They show cast-only inference results for MXFP8 and pre-training results for
MXFP6 and MXFP4-weights. [28] studies MXFP4 backward-pass quantization and [29] investigates
MXFP4 weight quantization on relatively small token horizon. All these studies are based on the
scaling factor computation method described in [1] which we show to be ineffective at large
token horizons.

5 Conclusions and future work

We present studies on several aspects of MXFP8-formats and their usage in pre-training models, e.g.
scale computation, data format of specific tensors, numerical aspects of quantization, model layers
that could be quantized for maximizing training throughput. We discuss recipes for pre-training LLMs
with MXFP8-formats. The MXFP8 pre-training recipe is implemented in Transformer Engine [30] and
the conversion numerics in cuDNN and cuBLAS libraries.
As part of our future work, we plan to extend our study to post-training phases (e.g. phases that
perform RL based policy-optimizations and its variants [31] as well as supervised-finetuning stages).
Additionally, from a hardware systems performance perspective, since tensors have to be quantized
along two separate axes with MX-formats, the training framework needs to store two copies for each
tensor (we discussed this in section 3.2). We plan to study scaling methods (e.g. 2D scaling instead
of 1D) that would lower the storage overhead.

Acknowledgments
We thank members of the ADLR/PSX team (Sweta P., Mikail K., Ben L.) for helping with the draft
revisions as well as Mohammad S., Carlo D.M., Michael A., Eric C. and Bryan C. with valuable
feedback and discussions. We also thank PM for guidance throughout this work.

8
References
[1] Bita Darvish Rouhani, Nitin Garegrat, Tom Savell, Ankit More, Kyung-Nam Han, Ritchie Zhao,
Mathew Hall, Jasmine Klar, Eric Chung, Yuan Yu, Michael Schulte, Ralph Wittig, Ian Bratt,
Nigel Stephens, Jelena Milanovic, John Brothers, Pradeep Dubey, Marius Cornea, Alexander
Heinecke, Andres Rodriguez, Martin Langhammer, Summer Deng, Maxim Naumov, Paulius
Micikevicius, Michael Siu, and Colin Verrilli. Ocp microscaling (mx) specification. Open
Compute Project, 2023.

[2] Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi,
Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic
Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby,
Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mes-
makhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael
Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig,
Doug Burger, and Eric Chung. Microscaling data formats for deep learning, 2023. URL
https://arxiv.org/abs/2310.10537.

[3] Bita Rouhani, Ritchie Zhao, Venmugil Elango, Rasoul Shafipour, Mathew Hall, Maral Mes-
makhosroshahi, Ankit More, Levi Melnick, Maximilian Golub, Girish Varatkar, Lei Shao, Gau-
rav Kolhe, Dimitry Melts, Jasmine Klar, Renee L’Heureux, Matt Perry, Doug Burger, Eric Chung,
Zhaoxia Deng, Sam Naghshineh, Jongsoo Park, and Maxim Naumov. With shared microexpo-
nents, a little shifting goes a long way, 2023. URL https://arxiv.org/abs/2302.08007.

[4] Steve Dai, Rangharajan Venkatesan, Haoxing Ren, Brian Zimmer, William J. Dally, and Brucek
Khailany. Vs-quant: Per-vector scaled quantization for accurate low-precision neural network
inference, 2021. URL https://arxiv.org/abs/2102.04503.

[5] Nvidia Blackwell Architecture Technical Brief. URL https://resources.nvidia.com/

en-us-blackwell-architecture.

[6] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard
Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem-
pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. Fp8 formats for deep
learning, 2022. URL https://arxiv.org/abs/2209.05433.

[7] Ieee standard for floating-point arithmetic. IEEE Std 754-2008, pages 1–70, 2008. doi:
10.1109/IEEESTD.2008.4610935.

[8] NVIDIA, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad
Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh,
Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary,
Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon
Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo, Chengyu Dong, Christine Harvey,
Christopher Parisien, Dan Su, Daniel Korzekwa, Danny Yin, Daria Gitman, David Mosal-
lanezhad, Deepak Narayanan, Denys Fridman, Dima Rekesh, Ding Ma, Dmytro Pykhtar, Dong
Ahn, Duncan Riach, Dusan Stosic, Eileen Long, Elad Segal, Ellie Evans, Eric Chung, Erick
Galinkin, Evelina Bakhturina, Ewa Dobrowolska, Fei Jia, Fuxiao Liu, Gargi Prasad, Gerald
Shen, Guilin Liu, Guo Chen, Haifeng Qian, Helen Ngo, Hongbin Liu, Hui Li, Igor Gitman, Ilia
Karmanov, Ivan Moshkov, Izik Golan, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jarno
Seppanen, Jason Lu, Jason Sewall, Jiaqi Zeng, Jiaxuan You, Jimmy Zhang, Jing Zhang, Jining
Huang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jon Barker, Jonathan Cohen,
Joseph Jennings, Jupinder Parmar, Karan Sapra, Kari Briski, Kateryna Chumachenko, Katherine
Luna, Keshav Santhanam, Kezhi Kong, Kirthi Sivamani, Krzysztof Pawelec, Kumar Anik,
Kunlun Li, Lawrence McAfee, Leon Derczynski, Lindsey Pavao, Luis Vega, Lukas Voegtle,
Maciej Bala, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski,
Markus Kliegl, Marta Stepniewska-Dziubinska, Matthieu Le, Matvei Novikov, Mehrzad Samadi,
Michael Andersch, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mike Ranzinger,
Mikolaj Blaz, Misha Smelyanskiy, Mohamed Fawzy, Mohammad Shoeybi, Mostofa Patwary,
Nayeon Lee, Nima Tajbakhsh, Ning Xu, Oleg Rybakov, Oleksii Kuchaiev, Olivier Delalleau,
Osvald Nitski, Parth Chadha, Pasha Shamis, Paulius Micikevicius, Pavlo Molchanov, Peter

9
Dykas, Philipp Fischer, Pierre-Yves Aquilanti, Piotr Bialecki, Prasoon Varshney, Pritam Gun-
decha, Przemek Tredak, Rabeeh Karimi, Rahul Kandu, Ran El-Yaniv, Raviraj Joshi, Roger
Waleffe, Ruoxi Zhang, Sabrina Kavanaugh, Sahil Jain, Samuel Kriman, Sangkug Lym, San-
jeev Satheesh, Saurav Muralidharan, Sean Narenthiran, Selvaraj Anandaraj, Seonmyeong
Bak, Sergey Kashirsky, Seungju Han, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere
Sreenivas, Sharon Clay, Shelby Thomas, Shrimai Prabhumoye, Shubham Pachori, Shubham
Toshniwal, Shyamala Prayaga, Siddhartha Jain, Sirshak Das, Slawek Kierat, Somshubra Majum-
dar, Song Han, Soumye Singhal, Sriharsha Niverty, Stefania Alborghetti, Suseella Panguluri,
Swetha Bhendigeri, Syeda Nahida Akter, Szymon Migacz, Tal Shiri, Terry Kong, Timo Roman,
Tomer Ronen, Trisha Saar, Tugrul Konuk, Tuomas Rintamaki, Tyler Poon, Ushnish De, Vahid
Noroozi, Varun Singh, Vijay Korthikanti, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Wei
Ping, Wenliang Dai, Wonmin Byeon, Xiaowei Ren, Yao Xu, Yejin Choi, Yian Zhang, Ying Lin,
Yoshi Suhara, Zhiding Yu, Zhiqi Li, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, and Zijia Chen.
Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models, 2025. URL
https://arxiv.org/abs/2504.03624.

[9] Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi. 8-bit
numerical formats for deep neural networks, 2022. URL https://arxiv.org/abs/2206.
02915.

[10] DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu,
Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian
Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao,
Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang,
Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo,
Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong
Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean
Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li,
Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian,
Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du,
R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu
Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu,
Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng
Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng,
Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang,
X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen,
Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang,
Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi
Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei,
Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng
Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying
He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo,
Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha,
Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou,
Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang,
Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong
Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu,
Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, and Zizheng Pan. Deepseek-v3 technical report,
2025. URL https://arxiv.org/abs/2412.19437.

[11] Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subrama-
nian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu
Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki,
Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper,
Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, and Bryan Catanzaro.
Nemotron-4 15b technical report, 2024. URL https://arxiv.org/abs/2402.16819.

[12] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model
parallelism, 2020. URL https://arxiv.org/abs/1909.08053.

10
[13] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https:
//arxiv.org/abs/2009.03300.

[14] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv:1803.05457v1, 2018.

[15] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale
reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.

[16] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning
about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/
1911.11641.

[17] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An
adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/1907.
10641.

[18] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830.

[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor
conduct electricity? a new dataset for open book question answering, 2018. URL https:
//arxiv.org/abs/1809.02789.

[20] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com-
monsense reasoning about social interactions, 2019. URL https://arxiv.org/abs/1904.
09728.

[21] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA:
A question answering challenge targeting commonsense knowledge. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–
4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:
10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.

[22] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han.
Smoothquant: Accurate and efficient post-training quantization for large language models,
2024. URL https://arxiv.org/abs/2211.10438.

[23] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song
Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving, 2025. URL
https://arxiv.org/abs/2405.04532.

[24] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training
quantization for generative pre-trained transformers, 2023. URL https://arxiv.org/abs/
2210.17323.

[25] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan
Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization
for llm compression and acceleration, 2024. URL https://arxiv.org/abs/2306.00978.

[26] Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry. Scaling fp8 training to
trillion-token llms, 2025. URL https://arxiv.org/abs/2409.12517.

[27] Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innova-
tion. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, April 2025.
Accessed 12 May 2025.

[28] Albert Tseng, Tao Yu, and Youngsuk Park. Training llms with mxfp4, 2025. URL https:
//arxiv.org/abs/2502.20586.

11
[29] Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun
Zha, and Peng Cheng. Optimizing large language model training using fp4 quantization, 2025.
URL https://arxiv.org/abs/2501.17116.
[30] Nvidia. Transformer engine. https://github.com/NVIDIA/TransformerEngine/.
[31] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of
mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/
2402.03300.
[32] Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding
warmup-stable-decay learning rates: A river valley loss landscape perspective, 2024. URL
https://arxiv.org/abs/2410.05192.
[33] Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya,
Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das,
Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans,
Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz
Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala,
John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu,
Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel
Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran,
Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher
Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye,
Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft,
Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy,
Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun,
Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing
Zhang, Vivienne Zhang, Yian Zhang, and Chen Zhu. Nemotron-4 340b technical report, 2024.
URL https://arxiv.org/abs/2406.11704.

12
A Appendix

A.1 UE8M0 rounding

Computation described in Algorithm 1 is a simplification. The simplification comes from storing the
output of log2 (float x) in FP32. log2 function in device internally performs a round-to-nearest of
the resulting return value and thus the result of ceil(log2 (x)) can be different if the output of log2
is not stored in a sufficiently large data-type. Hence, for emulation purposes we work directly with
the bit representation of the ratio of amax and destmax. We, next, describe the computation flow.
Background: As a reminder, the quantization process from 32 high-precision values, Vi , to quantized
values, Qi ; 1 ≤ i ≤ 32, is given by: Qi = Quantize_to_fp8(Vi /2X ). 2X is the scale factor; X is
stored in an unsigned 8bit integer container in memory and interpreted as 2X by the hardware. This
scale factor decodes Qi back to Vi (with quantization loss).
Value of scale factor (2X ) = float_to_8bits(amax/destmax), where amax is absolute maximum
in input/source block (of 32 elements) and destmax is the largest positive number in the destination
(MX) number system. float_to_8bits converts a floating-point number to a power-of-two number.
A float (FP32) number can be represented in IEEE convention as 2E × 1.mantissa (normal) or
2−126 × 0.mantissa (denormal). E can lie between -127 to 127 (or 0 to 254 with the exponent
bias) and can be represented in the 8-bit container for scale factor. -126 (or 1 with the exponent
bias) is also representable by 8-bit container. mantissa lies between [0,1). So, the question is:
should mantissa bits be rounded-up, rounded-down, round-to-nearest, discarded, etc. to create a
power-of-two number? We find round-up to be the best choice for pre-training with MX-formats.
Rounding: For float_to_8bits(), the recommended order of computation is:

1. Compute the decoding scale as: decode_scale = block_amax/destmax

2. if decode_scale is below 2−127 , then set it to 2−127 (which is the smallest value representable
in UE8M0)
3. For all other values that are not powers of 2, round-up to the closest representable UE8M0 value.

By construction amax/destmax never exceeds 2127 (which is the largest value representable in
UE8M0) with FP8, FP6 or FP4 formats. The above computations are done in the bit-space in emulation.

A.2 MX-format conversion

Section 2 relies on the standard MX-format conversion algorithm defined in [2], but for completeness
we show it here given a shared scaling exponent X as computed in 3.1.
Quantizing FP32 values to MX type: Once X is computed, Vi /2X is computed and the resulting
value is quantized to a FP8-representable number (Quantize_to_fp8()). Round-to-nearest-ties-to-
even (RN) rounding is used during this quantization step. The conversion process is saturating, i.e. if
after rounding the resulting value exceeds FP8 max or is less than FP8 min value, then the result is
clamped to respective max or min value.
Quantization operations add computation overhead — Blackwell has hardware support for rounding
the scale (using our proposed method) and quantizing values to lower this overhead.

A.3 Why is special hardware needed for MX-formats?

Computing the matrix-product of two tensors involves performing dot-products between sub-vectors
of the two tensors. Therefore, scaling factors need to be processed once per group of values that share
the scale. Since MX-formats have fine-grained scaling, scale factors are processed once after each
block’s dot-product is computed, thus, many times per tensor-wide dot-product. This is expensive
to do in software, so hardware needs to add support for accelerating tensor operations involving
MX-formats (e.g. Blackwell).

13
A.4 MXFP8 pre-training for a mixture-of-experts model

Section 3.3 discusses empirical data that shows MXFP8 matches BF16 accuracy (both training loss
as well as downstream task accuracy). Transformer based mixture-of-experts (MoE) models are
popular in literature. Fig. 6 shows that MXFP8 pre-training also matches BF16 pre-training loss curve
for a MoE setup that we experimented with. The MoE model has 16 billion total parameters and
∼2.5 billion active parameters and we train the model on 1 trillion tokens. We follow the same
guidelines discussed in section 3.2 for pre-training the MoE model. The pre-training phase uses a
WSD [32] learning rate schedule. The final loss of the MXFP8-trained MoE model is within 0.1% of
BF16-training.

Validation loss perplexity during pre-training

6
Validation perplexity

3 BF16 (Baseline) MXFP8

2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Training tokens (in Trillions)

Diff % = (BF16 - MXFP8)*100/BF16

0.6
0.4
0.2
%

0.0
-0.2
-0.4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Training tokens (in Trillions)
Figure 6: MXFP8 versus BF16 for a MoE model

A.5 Model configurations

We conduct numerical experiments on LLM pre-training with variants of Nemotron-4 [11] models.
Training and model details are described below. The 1T and 300B tokens dataset are a subset of the
17T data set discussed in [33]. Table 2 details the parameters for the various models that were used.

Table 2: Configuration for the LLM models.

Model Layers Hidden Attention Sequence Batch Initial Final
Size Heads Length Size LR LR
843M 24 1024 16 4096 256 2.5e-4 2.5e-7
2B 24 2048 16 4096 256 2e-4 2e-7
8B 32 4096 32 4096 1024 3e-4 3e-7

How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
80 pages
Tesla Dojo Technology
89% (9)
Tesla Dojo Technology
9 pages
LLM Training Update
100% (1)
LLM Training Update
31 pages
AiMX Platform Whitepaper-230906
No ratings yet
AiMX Platform Whitepaper-230906
34 pages
Ebook Prompt Engineering 101
100% (1)
Ebook Prompt Engineering 101
25 pages
Legal AI Handbook
100% (1)
Legal AI Handbook
50 pages
708 Whats New in Core ML Part 1 PDF
No ratings yet
708 Whats New in Core ML Part 1 PDF
161 pages
Modulus v4
No ratings yet
Modulus v4
57 pages
Prompt Crafting
100% (2)
Prompt Crafting
60 pages
BA Johannes Braun Druckversion
No ratings yet
BA Johannes Braun Druckversion
55 pages
FP8 LM
No ratings yet
FP8 LM
23 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Instant Neural Graphics Primitives With A Multiresolution Hash Encoding
No ratings yet
Instant Neural Graphics Primitives With A Multiresolution Hash Encoding
13 pages
SiP-ML - High-Bandwidth Optical Network Interconnects For Machine
No ratings yet
SiP-ML - High-Bandwidth Optical Network Interconnects For Machine
19 pages
D2L CH2 Part1
No ratings yet
D2L CH2 Part1
38 pages
Running Markov Chain Monte Carlo On Modern Hardware and Software
No ratings yet
Running Markov Chain Monte Carlo On Modern Hardware and Software
26 pages
2025 Lecture 3 - Architecture
No ratings yet
2025 Lecture 3 - Architecture
68 pages
Practice Final
No ratings yet
Practice Final
45 pages
Production ML Pipelines With TensorFlow Extended - TFX - Presentation
No ratings yet
Production ML Pipelines With TensorFlow Extended - TFX - Presentation
234 pages
4c MSFP
No ratings yet
4c MSFP
11 pages
Lecture 2 Merged
No ratings yet
Lecture 2 Merged
13 pages
Lecture 4 - Deep Learning Introduction
No ratings yet
Lecture 4 - Deep Learning Introduction
63 pages
Crash Course DL
No ratings yet
Crash Course DL
15 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
Icpp 2014 27
No ratings yet
Icpp 2014 27
10 pages
Image Data Preprocessing
No ratings yet
Image Data Preprocessing
34 pages
Prompt Engineering Guide
100% (1)
Prompt Engineering Guide
33 pages
Deep Convolutional Neural Network Inference With Floating-Point Weights and
No ratings yet
Deep Convolutional Neural Network Inference With Floating-Point Weights and
10 pages
OpTorch Optimized Deep Learning Architectures For
No ratings yet
OpTorch Optimized Deep Learning Architectures For
7 pages
Multilayer Perceptrons For Digit Recognition With Core APIs - TensorFlow Core
No ratings yet
Multilayer Perceptrons For Digit Recognition With Core APIs - TensorFlow Core
21 pages
8 Bit Matrix Multiplication For Transformers
No ratings yet
8 Bit Matrix Multiplication For Transformers
20 pages
Optimizing Modular Multiplication For Nvidia'S Maxwell Gpus
No ratings yet
Optimizing Modular Multiplication For Nvidia'S Maxwell Gpus
8 pages
CNN With Tensor Flow
No ratings yet
CNN With Tensor Flow
61 pages
Gpu Performance Review
No ratings yet
Gpu Performance Review
11 pages
Introduction To Weight Quantization PDF
No ratings yet
Introduction To Weight Quantization PDF
9 pages
Analysis of GPU
No ratings yet
Analysis of GPU
14 pages
Multi-Dimensional Optical Neural Network
No ratings yet
Multi-Dimensional Optical Neural Network
6 pages
Optimizing Large Language Model Training Using FP4 Quantization
No ratings yet
Optimizing Large Language Model Training Using FP4 Quantization
17 pages
Microscaling Data Formats For Deep Learning: Microsoft AMD Intel Meta Nvidia Qualcomm Technologies Inc
No ratings yet
Microscaling Data Formats For Deep Learning: Microsoft AMD Intel Meta Nvidia Qualcomm Technologies Inc
9 pages
LLM - Int8 - 8-Bit Matrix Multiplication For Transformer at Scale - Removed
No ratings yet
LLM - Int8 - 8-Bit Matrix Multiplication For Transformer at Scale - Removed
11 pages
2010 Research Paper Summary
No ratings yet
2010 Research Paper Summary
3 pages
Integer or Floating Point? New Outlooks For Low-Bit Quantization On Large Language Models
No ratings yet
Integer or Floating Point? New Outlooks For Low-Bit Quantization On Large Language Models
11 pages
RAG Beyond Text Enhancing Image Retrieval in RAG Systems
100% (1)
RAG Beyond Text Enhancing Image Retrieval in RAG Systems
6 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Mxnet Documentation: Release 0.0.8
No ratings yet
Mxnet Documentation: Release 0.0.8
93 pages
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
No ratings yet
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
9 pages
Research Paper 7
No ratings yet
Research Paper 7
11 pages
GPT 2 - Learninhg 5
No ratings yet
GPT 2 - Learninhg 5
2 pages
LP V GRPB 2b
No ratings yet
LP V GRPB 2b
8 pages
Model Pretraining
No ratings yet
Model Pretraining
11 pages
MDP Tutorial
No ratings yet
MDP Tutorial
104 pages
A DNN Optimization Framework With Unlabeled
No ratings yet
A DNN Optimization Framework With Unlabeled
5 pages
Electronics 14 02337
No ratings yet
Electronics 14 02337
18 pages
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
No ratings yet
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
7 pages
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
No ratings yet
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
4 pages
Low Precision Networks For Efficient Inference On Fpgas White Paper
No ratings yet
Low Precision Networks For Efficient Inference On Fpgas White Paper
6 pages
TS Chip
No ratings yet
TS Chip
9 pages
Infotech - Tech Trends - 2025 (41 PGS)
No ratings yet
Infotech - Tech Trends - 2025 (41 PGS)
41 pages
Photonic Neuromorphic Computing, Architectures - Final
No ratings yet
Photonic Neuromorphic Computing, Architectures - Final
3 pages
Arxiv2106 10652v1
No ratings yet
Arxiv2106 10652v1
7 pages
Systolic Array
No ratings yet
Systolic Array
9 pages
Microsoft 365 Copilot Architecture & Deployment
No ratings yet
Microsoft 365 Copilot Architecture & Deployment
7 pages
Autoencoding Models (Encoder Only) : Three LLM Architectures
No ratings yet
Autoencoding Models (Encoder Only) : Three LLM Architectures
5 pages
Jailbreaking ChatGPT Via Prompt Engineering: An Empirical Study
100% (1)
Jailbreaking ChatGPT Via Prompt Engineering: An Empirical Study
12 pages
Llama3 ISCA25
No ratings yet
Llama3 ISCA25
14 pages
LLM Based Agents Synopsis
No ratings yet
LLM Based Agents Synopsis
9 pages
Generative AI Tutorial Apr
No ratings yet
Generative AI Tutorial Apr
8 pages
Generative AI Masters Program Brochure - Edureka
No ratings yet
Generative AI Masters Program Brochure - Edureka
46 pages
AWS Certified Machine Learning Engineer - Associate MLA-C01 Exam - Free Exam Q&as, Page 2 - ExamTopics
No ratings yet
AWS Certified Machine Learning Engineer - Associate MLA-C01 Exam - Free Exam Q&as, Page 2 - ExamTopics
3 pages
SarcasmExplanation NAACL2025 Camera Ready
No ratings yet
SarcasmExplanation NAACL2025 Camera Ready
14 pages
2507.21967v1
No ratings yet
2507.21967v1
10 pages
Planck's Law From A Classical Free Energy Extremum Involving Fisher Information
No ratings yet
Planck's Law From A Classical Free Energy Extremum Involving Fisher Information
8 pages
AutoGen Vs LangChain A Comparative Overview
No ratings yet
AutoGen Vs LangChain A Comparative Overview
8 pages
2507.22620v1
No ratings yet
2507.22620v1
7 pages
2507.22399v1
No ratings yet
2507.22399v1
6 pages
Coulomb-Mediated Single-Electron Heat Transfer Statistics Across Capacitively Coupled Silicon Nanodots
No ratings yet
Coulomb-Mediated Single-Electron Heat Transfer Statistics Across Capacitively Coupled Silicon Nanodots
13 pages
Analyzing The Higgs-Confinement Transition With Non-Local Operators On The Lattice
No ratings yet
Analyzing The Higgs-Confinement Transition With Non-Local Operators On The Lattice
11 pages
A Coordination-Based Model For The Prediction of Surface Energies and The Shape of Metal Particles
No ratings yet
A Coordination-Based Model For The Prediction of Surface Energies and The Shape of Metal Particles
18 pages
Kinematically-Enhanced Interpolating Operators For Boosted Hadrons
No ratings yet
Kinematically-Enhanced Interpolating Operators For Boosted Hadrons
10 pages
Eigenspectra of Minimally Doubled Fermions: Abhijeet Kishore, Subhasish Basak and Dipankar Chakrabarti
No ratings yet
Eigenspectra of Minimally Doubled Fermions: Abhijeet Kishore, Subhasish Basak and Dipankar Chakrabarti
9 pages
Full Text 01
No ratings yet
Full Text 01
73 pages
Llama3.1 Paper
No ratings yet
Llama3.1 Paper
92 pages
Channel Capacity of Small Modular Quantum Networks in The Ultrastrongly Coupled Regime
No ratings yet
Channel Capacity of Small Modular Quantum Networks in The Ultrastrongly Coupled Regime
9 pages
H I I A D I H C A: Urricane Mpact Ndex For Ssessing Irect and Ndirect Azards in Entral Merica
No ratings yet
H I I A D I H C A: Urricane Mpact Ndex For Ssessing Irect and Ndirect Azards in Entral Merica
14 pages
Selective Decoupling in Multi-Level Quantum Systems by The SU (2) Sign Anomaly
No ratings yet
Selective Decoupling in Multi-Level Quantum Systems by The SU (2) Sign Anomaly
8 pages
A Superinductor in A Deep Sub-Micron Integrated Circuit: Alberto@quantummotion - Tech Fernando@quantummotion - Tech
No ratings yet
A Superinductor in A Deep Sub-Micron Integrated Circuit: Alberto@quantummotion - Tech Fernando@quantummotion - Tech
8 pages
Theoretical Evaluation of Decay Mode of in Solid Samples: TH 8 4 Ev
No ratings yet
Theoretical Evaluation of Decay Mode of in Solid Samples: TH 8 4 Ev
8 pages
Controlled Spherulitic Crystal Growth From Salt Mixtures: A Universal Mechanism For Complex Crystal Self-Assembly
No ratings yet
Controlled Spherulitic Crystal Growth From Salt Mixtures: A Universal Mechanism For Complex Crystal Self-Assembly
15 pages
Probing Nontrivial Fusion of Majorana Zero Modes Via Near-Adiabatic Coupling
No ratings yet
Probing Nontrivial Fusion of Majorana Zero Modes Via Near-Adiabatic Coupling
7 pages
Aluminium Fast Neutron Leakage Spectrum Validation: A A A A A A A A A B
No ratings yet
Aluminium Fast Neutron Leakage Spectrum Validation: A A A A A A A A A B
11 pages
The Influence of Human-Inspired Agentic Sophistication in LLM-driven Strategic Reasoners
No ratings yet
The Influence of Human-Inspired Agentic Sophistication in LLM-driven Strategic Reasoners
8 pages
Air V2i4p101
No ratings yet
Air V2i4p101
16 pages
Anisotropic Quantum Polytropes: Electronic Address: Electronic Address: Electronic Address
No ratings yet
Anisotropic Quantum Polytropes: Electronic Address: Electronic Address: Electronic Address
12 pages
Observation of Coherent Perfect Acoustic Absorption at An Exceptional Point
No ratings yet
Observation of Coherent Perfect Acoustic Absorption at An Exceptional Point
12 pages
A Time-Reversal Invariant Vortex in Topological Superconductors and Gravitational Topology
No ratings yet
A Time-Reversal Invariant Vortex in Topological Superconductors and Gravitational Topology
8 pages
Non-Linear In-Plane Spin Current in Spin-Orbit Coupled 2D Hole Gases
No ratings yet
Non-Linear In-Plane Spin Current in Spin-Orbit Coupled 2D Hole Gases
10 pages
Null 001.2024.issue 065 en
No ratings yet
Null 001.2024.issue 065 en
69 pages
Optimal Calibration of Qubit Detuning and Crosstalk: H T H T F T T
No ratings yet
Optimal Calibration of Qubit Detuning and Crosstalk: H T H T F T T
5 pages
Universality On Thermodynamic Relation With Corrections in Einstein-Bel-Robinson Gravity Black Hole
No ratings yet
Universality On Thermodynamic Relation With Corrections in Einstein-Bel-Robinson Gravity Black Hole
9 pages
Moir E-Polaritons in A Dark Bose-Einstein Condensate: Introduction.
No ratings yet
Moir E-Polaritons in A Dark Bose-Einstein Condensate: Introduction.
8 pages
Constraints On The Progenitor Models of Fast Radio Bursts From Population Synthesis With The First CHIME/FRB Catalog
No ratings yet
Constraints On The Progenitor Models of Fast Radio Bursts From Population Synthesis With The First CHIME/FRB Catalog
9 pages
DSR-relativistic Spacetime Picture and The Phenomenology of Planck-Scale-Modified Time Dilation
No ratings yet
DSR-relativistic Spacetime Picture and The Phenomenology of Planck-Scale-Modified Time Dilation
8 pages
Comment On "Properties and Dynamics of Generalized Squeezed States"
No ratings yet
Comment On "Properties and Dynamics of Generalized Squeezed States"
4 pages
Tdi On The Fly: Extreme Gravity Institute, Department of Physics, Montana State University, Bozeman, Montana 59717, Usa
No ratings yet
Tdi On The Fly: Extreme Gravity Institute, Department of Physics, Montana State University, Bozeman, Montana 59717, Usa
6 pages
Benchmarks For Protocol Control in Nonequilibrium Statistical Mechanics
No ratings yet
Benchmarks For Protocol Control in Nonequilibrium Statistical Mechanics
14 pages
Random Walk With Multiple Memory Channels: A New Paradigm
No ratings yet
Random Walk With Multiple Memory Channels: A New Paradigm
4 pages
Remembering, Reflecting and Dynamic Decision Making For Web Agents
No ratings yet
Remembering, Reflecting and Dynamic Decision Making For Web Agents
12 pages
1 s2.0 S0920548924001090 Main
No ratings yet
1 s2.0 S0920548924001090 Main
13 pages
Adaptive AI 101 - Characteristics, Components, and Use Cases
No ratings yet
Adaptive AI 101 - Characteristics, Components, and Use Cases
12 pages
Maxbox Starter136 Google Gemini API
No ratings yet
Maxbox Starter136 Google Gemini API
8 pages
Understanding The Planning of LLM Agents: A Survey
No ratings yet
Understanding The Planning of LLM Agents: A Survey
9 pages
Alexander Daniel Roman
No ratings yet
Alexander Daniel Roman
3 pages
Empowering Robot Path Planning With Large Language Models: osmAG Map Topology & Hierarchy Comprehension With LLMs
No ratings yet
Empowering Robot Path Planning With Large Language Models: osmAG Map Topology & Hierarchy Comprehension With LLMs
7 pages
QXLI AI Factory - Launch Your Own AI-Factory-Owned and Operated by You
No ratings yet
QXLI AI Factory - Launch Your Own AI-Factory-Owned and Operated by You
2 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
10 pages
AI LLM Pricing Comparison
No ratings yet
AI LLM Pricing Comparison
3 pages

Recipes For Pre-Training Llms With Mxfp8: Asit Mishra Dusan Stosic Simon Layton

Uploaded by

Recipes For Pre-Training Llms With Mxfp8: Asit Mishra Dusan Stosic Simon Layton

Uploaded by

Recipes for Pre-training LLMs with MXFP8

Asit Mishra Dusan Stosic Simon Layton

2 Microscaling format support in NVIDIA Blackwell

Stored in memory and used by Tensor Cores

Table 1: MX-format support in Blackwell

3 Pre-training with MXFP8

3.1 Conversion from FP32 to MXFP8

843M on 300B 843M on 1T

20 MXFP8 with E4M3(W)-E4M3(A)-E4M3(G) (OCP) MXFP8 with E4M3(W)-E4M3(A)-E5M2(G) (ours)

MXFP8 with E4M3(W)-E4M3(A)-E5M2(G) (ours) 13 MXFP8 with E4M3(W)-E4M3(A)-E4M3(G) (ours)

MXFP8 with E4M3(W)-E4M3(A)-E4M3(G) (ours)

Algorithm 1 Our proposal for computing scale factor, X (simplified method)

Layer BMM BMM Layer Act

transpose & MXFP8 MXFP8 BF16

transpose & MXFP8 MXFP8 transpose &

3.3 Results with MXFP8 pre-training a LLM on 15T tokens

Avg. 1-shot score across 9

5 Conclusions and future work

[5] Nvidia Blackwell Architecture Technical Brief. URL https://resources.nvidia.com/

A.1 UE8M0 rounding

1. Compute the decoding scale as: decode_scale = block_amax/destmax

A.2 MX-format conversion

A.3 Why is special hardware needed for MX-formats?

Validation loss perplexity during pre-training

3 BF16 (Baseline) MXFP8

Diff % = (BF16 - MXFP8)*100/BF16

A.5 Model configurations

Table 2: Configuration for the LLM models.

You might also like