0% found this document useful (0 votes)
20 views9 pages

Sage Attention 2

SageAttention2++ is an enhanced version of SageAttention2 that improves the efficiency of attention mechanisms by utilizing faster FP8 matrix multiplication instructions. It achieves a 3.9× speedup over FlashAttention while maintaining the same accuracy as SageAttention2, making it suitable for various applications in language, image, and video generation. The method involves narrowing the quantization range of input values to ensure compatibility with FP16 accumulators, resulting in negligible end-to-end performance loss.

Uploaded by

Ajay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Sage Attention 2

SageAttention2++ is an enhanced version of SageAttention2 that improves the efficiency of attention mechanisms by utilizing faster FP8 matrix multiplication instructions. It achieves a 3.9× speedup over FlashAttention while maintaining the same accuracy as SageAttention2, making it suitable for various applications in language, image, and video generation. The method involves narrowing the quantization range of input values to ensure compatibility with FP16 accumulators, resulting in negligible end-to-end performance loss.

Uploaded by

Ajay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SageAttention2++: A More Efficient Implementation of SageAttention2

Jintao Zhang 1 Xiaoming Xu 1 Jia Wei 1 Haofeng Huang 1 Pengle Zhang 1 Chendong Xiang 1
Jun Zhu 1 Jianfei Chen 1

Abstract table examples include FlashAttention (Dao et al., 2022),


FlashAttention2 (Dao, 2024; Shah et al., 2024), xForm-
The efficiency of attention is critical because
ers (Lefaudeux et al., 2022), and SageAttentions (Zhang
arXiv:2505.21136v2 [cs.LG] 28 May 2025

its time complexity grows quadratically with se-


et al., 2025c;c;d;b) which demonstrate strong performance
quence length. SageAttention2 addresses this by
across diverse applications.
using quantization to speed up matrix multiplica-
tions (Matmul) in attention. To further accelerate Motivation, problem, and our approach. For the
SageAttention2, we propose to utilize the faster second matrix multiplication (Matmul) P V in attention,
instruction of FP8 Matmul accumulated in FP16. SageAttention2 accelerates it by quantizing to FP8 and
The instruction is 2× faster than the FP8 Mat- using the mma.f32.f8.f8.f32 instruction. How-
mul used in SageAttention2. Our experiments ever, mma.f32.f8.f8.f32 employs an FP32 accumu-
show that SageAttention2++ achieves a lator and is only 2× faster than FP16. We find that
3.9× speedup over FlashAttention while main- the mma.f16.f8.f8.f16 instruction (using FP16 ac-
taining the same attention accuracy as SageAt- cumulator for FP8 Matmul) achieves 4× speedup over
tention2. This means SageAttention2++ ef- FP16 (NVIDIA, 2022). Therefore, we aim to accelerate
fectively accelerates various models, including SageAttention2 by using the faster instruction. However,
those for language, image, and video generation, directly using the faster instruction will lead to the values of
with negligible end-to-end metrics loss. The code P V exceeding the representable range of FP16. To address
will be available at https://github.com/ the problem, we propose to narrow the quantization range
thu-ml/SageAttention. of P and V to satisfy the accumulator range in FP16.
Performance. For efficiency, SageAttention2++ de-
livers a 3.9× speedup over FlashAttention. In terms of ac-
1. Introduction curacy, SageAttention2++ matches SageAttention2’s
The quadratic time complexity of attention necessitates effi- performance. We conduct comprehensive evaluations on
cient implementations for real-world applications with long state-of-the-art models for text, image, and video generation.
sequences (Jiang et al., 2024). Current approaches to reduce The results demonstrate that SageAttention2++ pro-
attention’s computational demands fall into three main cat- vides plug-and-play acceleration with negligible end-to-end
egories: (1) linear attention methods (Wang et al., 2020; metrics loss across diverse models.
Choromanski et al., 2021; Yu et al., 2022; Katharopoulos
et al., 2020; Qin et al., 2024; Yang et al., 2024) that achieve 2. Preliminary
O(N ) complexity, and (2) sparse attention techniques (Liu
et al., 2021; Chu et al., 2021; Li et al., 2022; Xiao et al.,
2024b;a; Chen et al., 2024; Jiang et al., 2024; Venkatara- Table 1. Speedup compared to matrix multiplication in FP16 with
manan et al., 2024; Gao et al., 2024; Fu et al., 2024; Zhang an FP32 accumulator.
et al., 2025e; Xi et al., 2025; Zhang et al., 2025f) that process GPU MM Input MM Accumulator Speedup
only relevant context portions. While effective, these meth- FP16 FP32 1x
RTX4090,
ods often exhibit limited generality across models and tasks. FP8 FP32 2x
RTX5090
(3) An alternative direction focuses on hardware-optimized FP8 FP16 4x
attention implementations that maintain full sequence com-
putation while achieving superior speed and accuracy. No-
2.1. SageAttention2
1
Department of Computer Science, Tsinghua University.
Preprint. SageAttention2 (Zhang et al., 2025a) is a quantiza-
tion (Zhang et al., 2025g; Hu et al., 2025) method based
on FlashAttention (Dao et al., 2022). FlashAttention tiles

1
SageAttention2++: A More Efficient Implementation of SageAttention2

tention2, the results may exceed FP16’s representable range


Table 2. Average attention accuracy across all attention layers of
(-65504∼65504). This occurs because 32 product values
CogvideoX.
pv are accumulated in FP16, where p and v come from
Method Pr Vr Cossim↑ L1↓ quantized P̂ and V̂ (derived from Pe and V ). To ensure the
SageAttn2 448 448 99.97% 0.01862 accumulated results stay within FP16’s range:
SageAttn2++ 448 2.25 99.97% 0.01863
SageAttn2++ 224 4.5 99.97% 0.01862 |32 × pv| ≤ 65504 (1)
SageAttn2++ 112 9 99.97% 0.01863
For instance, choosing |p| ≤ 224 and |v| ≤ 9 satisfies this
condition. We therefore narrow the quantization ranges of
Q, K, P, V into blocks ({Qi }, {Ki }, {Pei }, {Vi }) and uses P and V by adjusting their scale factors:
online softmax (Milakov & Gimelshein, 2018) to compute
δP = | max(Pe)|/Pr , δV = | max(V )|/Vr (2)
attention progressively. For simplicity, we omit the sub-
scripts ({Qi }, {Ki }, {Pei }, {Vi }) in the following content where we constrain Pr × Vr ≤ 2047 (since 65504/32 =
and use Q, K, Pe, V to represent the tiled blocks. SageAt- 2047).
tention2 quantizes Q, K to INT4/INT8 with per-block gran-
ularity, Pe to FP8 in E4M3 with per-block granularity, and 3.2. Delayed FP32 Buffering
quantizes V to FP8 in E4M3 with per-channel granular-
ity. This means each Q, K, Pe has a separate scale fac- The transformation of accumulated values from
tor: δQ = max(|Q|)/127, δK = max(|K|)/127, δP = mma.m16n8k32 (in FP16) to FP32 incurs overhead
max(|Pe|)/448, and each channel of V has a separate scalar because it needs additional data type conversion PTX
scale: δV = colmax(|V |)/448. By doing so, SageAtten- instructions (NVIDIA, 2025) to execute. To reduce this over-
tion2 accelerates matrix multiplications in attention through head, we accumulate two consecutive mma.m16n8k32
low-bit Tensor Core operations. For example, P̂ = ⌈Pe/δP ⌋, results in FP16 before performing FP32 conversion, effec-
V̂ = ⌈V /δv ⌋. Then, P V = P̂ V̂ ∗ δP ∗ δV . tively halving the transformation overhead. Maintaining the
FP16 representable range requires:
2.2. Data Type of Accumulator for Matmul Pr × Vr ≤ 2047/2 (3)
In some GPUs, the speed of Matmul instructions de-
pends on the accumulator data type. For instance, Choice of Pr and Vr . Table 2 shows attention accuracy
mma.f32.f8.f8.f32 uses FP32 accumulator for FP8 for feasible (Pr , Vr ) pairs. The results demonstrate that
Matmul and is only 2× faster than FP16. The in- narrowing quantization ranges introduces negligible error.
struction using FP16 accumulator for FP8 Matmul We select Pr = 224 and Vr = 4.5 for optimal performance.
(mma.f16.f8.f8.f16) is 4× faster than FP16. Table 1
summarizes the speedup of Matmul instructions with differ- 4. Experiment
ent accumulators.
Main result. SageAttention2++ achieves up to 3.9×
3. SageAttention2++ speedup over FlashAttention2 while consistently outper-
forming both SageAttention and SageAttention2 in compu-
In this section, we introduce SageAttention2++. tational efficiency. Importantly, these performance gains
The workflow of SageAttention2++ is based on are achieved with negligible impact on end-to-end metrics
SageAttention2, also using the smoothing of Q and K, across diverse model architectures.
INT4/INT8 quantization for QK ⊤ Matmul and FP8 quan-
tization for P V Matmul. The main difference is that 4.1. Setup
for P V , SageAttention2++ uses the faster instruction
(mma.f16.f8.f8.f16), which employs an FP16 accu- Models and attentions. We evaluate
mulator for the FP8 Matmul. To ensure the results of FP8 SageAttention2++ across diverse representative
Matmul remain within FP16’s representable range, we ad- models spanning language, image, and video generation:
just the scale factor of the FP8 quantization. Llama3.1 (8B) (Dubey et al., 2024) for text2text,
CogvideoX (2B), HunyuanVideo (Kong et al.,
2024), and Wan (Wan et al., 2025) for text2video,
3.1. Narrowing the FP8 Quantization Range
and Flux (schnell) (Black Forest Labs, 2023) and
The specific MMA (NVIDIA, 2025) instruction used for Stable-Diffusion3.5 (turbo) (Stability AI, 2023)
the MatMul between P and V is mma.m16n8k32. If we for text2image. We compare our method with FlashAtten-
quantize P and V to FP8 in E4M3 (-448∼448) as in SageAt- tion2 (Dao, 2024), SageAttention (Zhang et al., 2025c),

2
SageAttention2++: A More Efficient Implementation of SageAttention2

RTX4090, (Head dim = 128, causal = False) RTX4090, (Head dim = 128, causal = True)
FlashAttn Sage2(8+8) Sage2(4+8) FlashAttn Sage2(8+8) Sage2(4+8)
1000 Sage1 Sage2++(8+8) Sage2++(4+8) 1000 Sage1 Sage2++(8+8) Sage2++(4+8)
Speed (TOPS)

640
639
638
629

619
606
601

596
574

554
493

493
490

489

489
489

485

478
476
471
468
467
465
461

452
450
440

431

424
423

423
422

422
421

420
417
411

409
500 500

400
382

377
365
361
346

326
323
322

322

322
317

316
314

307
303

292
268
259

252

247
188
164

164

164
161

161

161
160
155
151
145

138
118
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 1. Speed comparison between SageAttention2++ and baselines (RTX4090, headdim=128).


RTX4090, (Head dim = 64, causal = False) RTX4090, (Head dim = 64, causal = True)
1000 FlashAttn Sage2(8+8) Sage2(4+8) 1000 FlashAttn Sage2(8+8) Sage2(4+8)
Sage1 Sage2++(8+8) Sage2++(4+8) Sage1 Sage2++(8+8) Sage2++(4+8)
Speed (TOPS)

750 750

537
535
520
501
496
493
483

479
471

453
451
445
444
444

443
443

443

440
437

435

434
428
500 500

421
417

415
413

406
400
399

392

391
390
385

382
376

376
366

361
355
352

347

343
339

336
326
323

322

322
320

315
312

308
299

289
287
287
273

252

246
174
250 250
167

167

167

164
162
161

159
153

150
143

124
94
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 2. Speed comparison between SageAttention2++ and baselines (RTX4090, headdim=64).


RTX5090, (Head dim = 128, causal = False) RTX5090, (Head dim = 128, causal = True)
FlashAttn Sage2(8+8) Sage2++(8+8) FlashAttn Sage2(8+8) Sage2++(8+8)
1000 Sage1 1000 Sage1
Speed (TOPS)

643
638
628
618

609
608

594
583

567
554
551

551
544

540
532
527

524

519
489

482
480
479

479
473

469

468
467

457
444
442

433
420
500 500
388

369
362
289
215

214
212

209
208

207
202
198

192
175
173

147

0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 3. Speed comparison between SageAttention2++ and baselines (RTX5090, headdim=128).


RTX5090, (Head dim = 64, causal = False) RTX5090, (Head dim = 64, causal = True)
1000 FlashAttn Sage2(8+8) Sage2++(8+8) 1000 FlashAttn Sage2(8+8) Sage2++(8+8)
Sage1 Sage1
Speed (TOPS)

750 750
570

567

562
560
557

556

548
519
509

500
497
492

491

490

488
479
476

461
451
441
441

429
428

427

427
423

500 500 421


416
405

404
396

379
350
317

312
239
220
220
218

217
213
213

207
206

195
191

174

250 250
149

0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 4. Speed comparison between SageAttention2++ and baselines (RTX5090, headdim=64).

and SageAttention2 (Zhang et al., 2025a). Please note tion appears in Appendix A.2.
that FlashAttention3 can only run on Hopper GPUs,
Implementation. We implement SageAttention2++
so FlashAttention2 is already the fastest version for
using CUDA.
RTX5090 and RTX4090. Following SageAttention2’s
approach, we implement two SageAttention2++
variants: SageAttn2++(8+8) (INT8 for Q, K) and 4.2. Speed of Kernels
SageAttn2++(4+8) (INT4 for Q, K), both using FP8 Kernel Speed. We benchmark the speed of
in E4M3 for Pe, V . SageAttention2++ against baselines using con-
Datasets and metrics. Detailed dataset and metric informa- figurations with headdim=64 and headdim=128, both with

3
SageAttention2++: A More Efficient Implementation of SageAttention2

Table 3. End-to-end metrics across text, image, and video generation models.
Model Attention WikiText (Ppl.) ↓ Lambda (Acc.) ↑ NIAH (Acc.) ↑
Full-Precision 6.013 0.815 0.906
Llama3.1 SageAttn2(8+8) 6.019 0.811 0.903
SageAttn2++(8+8) 6.020 0.813 0.901

Model Attention CLIPSIM ↑ CLIP-T ↑ VQA-a ↑ VQA-t ↑ FScore ↑


Full-Precision 0.179 0.997 74.499 74.642 4.974
SageAttn2(4+8) 0.179 0.997 76.309 66.396 4.386
CogvideoX SageAttn2(8+8) 0.178 0.997 74.322 74.447 4.899
(2B) SageAttn2++(4+8) 0.179 0.997 74.387 66.568 4.333
SageAttn2++(8+8) 0.179 0.997 76.309 73.165 4.386
Full-Precision 0.175 0.999 77.437 52.731 1.169
SageAttn2(4+8) 0.176 0.999 73.282 55.141 0.968
Hunyuan SageAttn2(8+8) 0.175 0.999 78.145 54.878 1.176
Video SageAttn2++(4+8) 0.176 0.999 73.282 52.258 0.968
SageAttn2++(8+8) 0.175 0.999 78.569 51.080 1.192
Full-Precision 0.172 0.999 53.255 59.989 1.843
SageAttn2(4+8) 0.176 0.998 29.728 38.533 0.994
SageAttn2(8+8) 0.172 0.999 49.794 55.712 1.870
Wan
SageAttn2++(4+8) 0.176 0.998 29.728 38.023 0.994
SageAttn2++(8+8) 0.172 0.999 50.876 57.140 1.902

Model Attention FID ↓ sFID ↓ CLIP ↑ IR ↑


Full-Precision 165.117 147.831 31.401 0.912
SageAttn2(4+8) 164.170 147.185 31.358 0.910
Flux SageAttn2(8+8) 163.185 146.101 31.453 0.905
SageAttn2++(4+8) 164.170 147.185 31.358 0.910
SageAttn2++(8+8) 163.555 146.036 31.445 0.902
Full-Precision 166.369 146.514 31.876 0.929
SageAttn2(4+8) 164.610 147.350 31.912 0.914
Stable-Dif
SageAttn2(8+8) 164.971 148.498 31.964 0.931
fusion3.5
SageAttn2++(4+8) 164.610 147.350 31.912 0.914
SageAttn2++(8+8) 165.842 146.465 31.968 0.929

SageAttention2++ on Wan SageAttention2++ on Flux


Full Precision

Full Precision SageAttn2++


SageAttn2++

Figure 5. A visible example of using SageAttention2++.

and without a Causal Mask (Vaswani, 2017). Specifically, show more kernel speed comparison on RTX4090 and
Fig. 1 shows the speed across varying sequence lengths RTX5090 GPUs.
on RTX4090, indicating that SageAttn2++(4+8) and
SageAttn2++(8+8) are approximately 3.9x and 3.0x
faster than FlashAttention2, respectively. Fig. 2, 3 and 4

4
SageAttention2++: A More Efficient Implementation of SageAttention2

Sage2++ (4+8) Sage2++ (8+8) Full Precision SageAttention2++ on CogvideoX

SageAttention2++ on HunyuanVideo
Full Precision
Sage2++ (8+8)
Sage2++ (4+8)

Figure 6. Visible examples of using SageAttention2++ on video generation.

4.3. End-to-end Performance 5. Conclusion


Metrics loss. We evaluate end-to-end model perfor- We introduce SageAttention2++ to further acceler-
mance using SageAttention2++ against baseline meth- ate SageAttention2. We propose to utilize the faster in-
ods. Detailed evaluation results are presented in Ta- struction of FP8 Matmul accumulated in FP16 for the
ble 3. The results indicate that SageAttn2++(8+8) matrix multiplication of P V . Experiments show that
and SageAttn2++(4+8) match the end-to-end metrics SageAttention2++ achieves a 3.9× speedup (SageAt-
of SageAttention2. Specifically, SageAttn2++(8+8) tention2 has a 3× speedup) over FlashAttention, while main-
incurs almost no metrics loss across various models and taining the same attention accuracy as SageAttention2. This
SageAttn2++(4+8) brings a little metrics loss. means SageAttention2++ can accelerate various mod-
els, including those for language, image, and video genera-
Visible image and video examples. Fig.5, 7, and 6 show
tion, with negligible end-to-end metrics loss.
some visible comparison examples.

5
SageAttention2++: A More Efficient Implementation of SageAttention2

References Hu, Y., Huang, W., Liang, Z., Chen, C., Zhang, J., Zhu,
J., and Chen, J. Identifying sensitive weights via post-
Black Forest Labs. Flux. https://github.com/
quantization integral. 2025.
black-forest-labs/flux, 2023.
Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. Per-
Jia, J. Longlora: Efficient fine-tuning of long-context plexity—a measure of the difficulty of speech recognition
large language models. In The International Conference tasks. The Journal of the Acoustical Society of America,
on Learning Representations, 2024. 62(S1):S63–S63, 1977.

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, Jiang, H., LI, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han,
X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mo- Z., Abdi, A. H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L.
hiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., MInference 1.0: Accelerating pre-filling for long-context
and Weller, A. Rethinking attention with performers. In LLMs via dynamic sparse attention. In The Thirty-eighth
International Conference on Learning Representations, Annual Conference on Neural Information Processing
2021. Systems, 2024.
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X.,
Kamradt, G. Llmtest needle in a haystack-pressure
Xia, H., and Shen, C. Twins: Revisiting the design of
testing llms. https://github.com/gkamradt/
spatial attention in vision transformers. In Beygelzimer,
LLMTest_NeedleInAHaystack, 2023.
A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Ad-
vances in Neural Information Processing Systems, 2021.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F.
Dao, T. Flashattention-2: Faster attention with better paral- Transformers are rnns: Fast autoregressive transformers
lelism and work partitioning. In The Twelfth International with linear attention. In International conference on ma-
Conference on Learning Representations, 2024. chine learning, pp. 5156–5165. PMLR, 2020.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. Flashat- Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J.,
tention: Fast and memory-efficient exact attention with Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q.,
IO-awareness. In Oh, A. H., Agarwal, A., Belgrave, D., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan,
and Cho, K. (eds.), Advances in Neural Information Pro- H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang,
cessing Systems, 2022. J., Yuan, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W.,
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, Yu, W., Deng, X., Li, Y., Long, Y., Chen, Y., Cui, Y.,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, Peng, Y., Yu, Z., He, Z., Xu, Z., Zhou, Z., Xu, Z., Tao,
A., et al. The llama 3 herd of models. arXiv preprint Y., Lu, Q., Liu, S., Zhou, D., Wang, H., Yang, Y., Wang,
arXiv:2407.21783, 2024. D., Liu, Y., Jiang, J., and Zhong, C. Hunyuanvideo: A
systematic framework for large video generative models.
Fu, T., Huang, H., Ning, X., Zhang, G., Chen, B., Wu, arXiv preprint arXiv:2412.03603, 2024.
T., Wang, H., Huang, Z., Li, S., Yan, S., Dai, G., Yang,
H., and Wang, Y. Moa: Mixture of sparse attention Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W.,
for automatic large language model compression. arXiv Caggiano, V., Naren, S., Xu, M., Hu, J., Tintore, M.,
preprint arXiv:2406.14909, 2024. Zhang, S., Labatut, P., Haziza, D., Wehrstedt, L., Reizen-
stein, J., and Sizov, G. xformers: A modular and hack-
Gao, Y., Zeng, Z., Du, D., Cao, S., So, H. K.-H., Cao, able transformer modelling library. https://github.
T., Yang, F., and Yang, M. Seerattention: Learning com/facebookresearch/xformers, 2022.
intrinsic sparse attention in your llms. arXiv preprint
arXiv:2410.13276, 2024. Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., and
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Qiao, Y. Uniformer: Unified transformer for efficient
Choi, Y. Clipscore: A reference-free evaluation metric for spatial-temporal representation learning. In International
image captioning. In Proceedings of the 2021 Conference Conference on Learning Representations, 2022.
on Empirical Methods in Natural Language Processing,
pp. 7514–7528, 2021. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco:
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Common objects in context. In Computer Vision–ECCV
Hochreiter, S. Gans trained by a two time-scale update 2014: 13th European Conference, Zurich, Switzerland,
rule converge to a local nash equilibrium. Advances in September 6-12, 2014, Proceedings, Part V 13, pp. 740–
neural information processing systems, 30, 2017. 755. Springer, 2014.

6
SageAttention2++: A More Efficient Implementation of SageAttention2

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Vaswani, A. Attention is all you need. Advances in Neural
Liu, Y., Zeng, T., Chan, R., and Shan, Y. Evalcrafter: Information Processing Systems, 2017.
Benchmarking and evaluating large video generation
models. In Proceedings of the IEEE/CVF Conference Venkataramanan, S., Ghodrati, A., Asano, Y. M., Porikli,
on Computer Vision and Pattern Recognition, pp. 22139– F., and Habibian, A. Skip-attention: Improving vision
22149, 2024. transformers by paying less attention. In The Twelfth
International Conference on Learning Representations,
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, 2024.
S., and Guo, B. Swin transformer: Hierarchical vision
transformer using shifted windows. In Proceedings of the Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W.,
IEEE/CVF international conference on computer vision, Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J.,
pp. 10012–10022, 2021. Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao,
K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P.,
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer
Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T.,
sentinel mixture models. In International Conference on
Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang,
Learning Representations, 2022.
W., Wang, W., Zhou, W., Wang, W., Shen, W., Yu, W.,
Milakov, M. and Gimelshein, N. Online normalizer cal- Shi, X., Huang, X., Xu, X., Kou, Y., Lv, Y., Li, Y., Liu,
culation for softmax. arXiv preprint arXiv:1805.02867, Y., Wang, Y., Zhang, Y., Huang, Y., Li, Y., Wu, Y., Liu,
2018. Y., Pan, Y., Zheng, Y., Hong, Y., Shi, Y., Feng, Y., Jiang,
Z., Han, Z., Wu, Z.-F., and Liu, Z. Wan: Open and
NVIDIA. Nvidia ada gpu architecture, 2022. advanced large-scale video generative models. arXiv
URL https://images.nvidia.com/ preprint arXiv:2503.20314, 2025.
aem-dam/Solutions/geforce/ada/
nvidia-ada-gpu-architecture.pdf. Techni- Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H.
cal whitepaper. Linformer: Self-attention with linear complexity. arXiv
preprint arXiv:2006.04768, 2020.
NVIDIA. Parallel Thread Execution ISA Version
8.7. https://docs.nvidia.com/cuda/pdf/
Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A.,
ptx_isa_8.4.pdf, 2025. Accessed: 2025-05-16.
Sun, W., Yan, Q., and Lin, W. Exploring video quality
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., assessment on user generated contents from aesthetic and
Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and technical perspectives. In Proceedings of the IEEE/CVF
Fernández, R. The lambada dataset: Word prediction International Conference on Computer Vision, pp. 20144–
requiring a broad discourse context. In Proceedings of 20154, 2023.
the 54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pp. 1525– Xi, H., Yang, S., Zhao, Y., Xu, C., Li, M., Li, X., Lin, Y.,
1534, 2016. Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler-
ating video diffusion transformers with spatial-temporal
Qin, Z., Sun, W., Li, D., Shen, X., Sun, W., and Zhong, Y. sparsity. arXiv preprint arXiv:2502.01776, 2025.
Lightning attention-2: A free lunch for handling unlim-
ited sequence lengths in large language models. arXiv Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y., Zhang, Z.,
preprint arXiv:2401.04658, 2024. Liu, Z., and Sun, M. Infllm: Training-free long-context
extrapolation for llms with an efficient context memory.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,
In First Workshop on Long-Context Foundation Models@
Radford, A., and Chen, X. Improved techniques for
ICML 2024, 2024a.
training gans. Advances in neural information processing
systems, 29, 2016. Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Ef-
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., ficient streaming language models with attention sinks.
and Dao, T. Flashattention-3: Fast and accurate attention In The Twelfth International Conference on Learning
with asynchrony and low-precision. In The Thirty-eighth Representations, 2024b.
Annual Conference on Neural Information Processing
Systems, 2024. Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang,
J., and Dong, Y. Imagereward: Learning and evaluat-
Stability AI. Introducing stable diffusion ing human preferences for text-to-image generation. In
3.5. https://stability.ai/news/ Thirty-seventh Conference on Neural Information Pro-
introducing-stable-diffusion-3-5, 2023. cessing Systems, 2023.

7
SageAttention2++: A More Efficient Implementation of SageAttention2

Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta net-


works: Improving mamba2 with delta rule. arXiv preprint
arXiv:2412.06464, 2024.
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng,
J., and Yan, S. Metaformer is actually what you need
for vision. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pp. 10819–
10829, 2022.
Zhang, J., Huang, H., Zhang, P., Wei, J., Zhu, J., and Chen, J.
Sageattention2: Efficient attention with thorough outlier
smoothing and per-thread int4 quantization. In Interna-
tional Conference on Machine Learning (ICML), 2025a.

Zhang, J., Huang, H., Zhang, P., Wei, J., Zhu, J., and Chen,
J. Sageattention2: Efficient attention with smoothing q
and per-thread quantization. 2025b.
Zhang, J., Wei, J., Zhang, P., Chen, J., and Zhu, J. Sageatten-
tion: Accurate 8-bit attention for plug-and-play inference
acceleration. In The International Conference on Learn-
ing Representations, 2025c.
Zhang, J., Wei, J., Zhang, P., Xu, X., Huang, H., Wang, H.,
Jiang, K., Zhu, J., and Chen, J. Sageattention3: Microscal-
ing fp4 attention for inference and an exploration of 8-bit
training. arXiv preprint arXiv:2505.11594, 2025d.
Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and
Chen, J. Spargeattn: Accurate sparse attention accelerat-
ing any model inference. In International Conference on
Machine Learning (ICML), 2025e.

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J.,
and Chen, J. Spargeattn: Training-free sparse attention
accelerating any model inference. 2025f.
Zhang, P., Wei, J., Zhang, J., Zhu, J., and Chen, J. Accu-
rate int8 training through dynamic block-level fallback.
2025g.
Zhao, T., Fang, T., Huang, H., Liu, E., Wan, R., Soedar-
madji, W., Li, S., Lin, Z., Dai, G., Yan, S., Yang, H.,
et al. Vidit-q: Efficient and accurate quantization of dif-
fusion transformers for image and video generation. In
International Conference on Learning Representations,
2025.
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H.,
Zhou, Y., Li, T., and You, Y. Open-sora: Democratiz-
ing efficient video production for all. arXiv preprint
arXiv:2412.20404, 2024.

8
SageAttention2++: A More Efficient Implementation of SageAttention2

A. Appendix
A.1. Visible Comparison Examples

Full Precision Sage2++ (8+8) Sage2++ (4+8)

Flux

Full Precision Sage2++ (8+8) Sage2++ (4+8)

Stable-
Diffusion
-3.5

Figure 7. Visible examples of using SageAttention2++ on image generation.

A.2. Datasets and Metrics in Experiments


Datasets. Text-to-text models are evaluated on: WikiText (Merity et al., 2022) to assess the model’s prediction confidence,
LAMBADA (Paperno et al., 2016) for contextual understanding, and Needle-in-A-Haystack (NIAH) task (Kamradt, 2023).
Text-to-video models are evaluated using the open-sora (Zheng et al., 2024) prompt sets. Text-to-image models are assessed
on COCO annotations (Lin et al., 2014).
End-to-end metrics. For text-to-text models, we use perplexity (Ppl.) (Jelinek et al., 1977) for WikiText, accuracy
(Acc.) for LAMBADA and NIAH. For text-to-video models, following Zhao et al. (2025), we evaluate the quality of
generated videos on five metrics: CLIPSIM and CLIP-Temp (CLIP-T) (Liu et al., 2024) to measure the text-video alignment;
VQA-a and VQA-t to assess the video aesthetic and technical quality, respectively; and Flow-score (FScore) for temporal
consistency (Wu et al., 2023). For text-to-image models, generated images are compared with the images in three aspects:
FID (Heusel et al., 2017) and sFID (Salimans et al., 2016) for fidelity evaluation, Clipscore (CLIP) (Hessel et al., 2021) for
text-image alignment, and ImageReward (IR) (Xu et al., 2023) for human preference.
Accuracy metrics. We use three metrics to assess the accuracy of quantized attention output O′ compared to atten-

tion output in full-precision
P ′
pPO. First,
2
pPwe flatten O and O into vectors in
′2
Pthe shape

P1 × n. Then, Cosine similar-
of
ity: CosSimp = OO / O O , Relative L1 distance: L1 = |O − O |/ |O|, Root mean square error:
RM SE = (1/n) (O − O′ )2 .
P

You might also like