Sage Attention 2
Sage Attention 2
Jintao Zhang 1 Xiaoming Xu 1 Jia Wei 1 Haofeng Huang 1 Pengle Zhang 1 Chendong Xiang 1
Jun Zhu 1 Jianfei Chen 1
1
SageAttention2++: A More Efficient Implementation of SageAttention2
2
SageAttention2++: A More Efficient Implementation of SageAttention2
RTX4090, (Head dim = 128, causal = False) RTX4090, (Head dim = 128, causal = True)
FlashAttn Sage2(8+8) Sage2(4+8) FlashAttn Sage2(8+8) Sage2(4+8)
1000 Sage1 Sage2++(8+8) Sage2++(4+8) 1000 Sage1 Sage2++(8+8) Sage2++(4+8)
Speed (TOPS)
640
639
638
629
619
606
601
596
574
554
493
493
490
489
489
489
485
478
476
471
468
467
465
461
452
450
440
431
424
423
423
422
422
421
420
417
411
409
500 500
400
382
377
365
361
346
326
323
322
322
322
317
316
314
307
303
292
268
259
252
247
188
164
164
164
161
161
161
160
155
151
145
138
118
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length
750 750
537
535
520
501
496
493
483
479
471
453
451
445
444
444
443
443
443
440
437
435
434
428
500 500
421
417
415
413
406
400
399
392
391
390
385
382
376
376
366
361
355
352
347
343
339
336
326
323
322
322
320
315
312
308
299
289
287
287
273
252
246
174
250 250
167
167
167
164
162
161
159
153
150
143
124
94
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length
643
638
628
618
609
608
594
583
567
554
551
551
544
540
532
527
524
519
489
482
480
479
479
473
469
468
467
457
444
442
433
420
500 500
388
369
362
289
215
214
212
209
208
207
202
198
192
175
173
147
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length
750 750
570
567
562
560
557
556
548
519
509
500
497
492
491
490
488
479
476
461
451
441
441
429
428
427
427
423
404
396
379
350
317
312
239
220
220
218
217
213
213
207
206
195
191
174
250 250
149
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length
and SageAttention2 (Zhang et al., 2025a). Please note tion appears in Appendix A.2.
that FlashAttention3 can only run on Hopper GPUs,
Implementation. We implement SageAttention2++
so FlashAttention2 is already the fastest version for
using CUDA.
RTX5090 and RTX4090. Following SageAttention2’s
approach, we implement two SageAttention2++
variants: SageAttn2++(8+8) (INT8 for Q, K) and 4.2. Speed of Kernels
SageAttn2++(4+8) (INT4 for Q, K), both using FP8 Kernel Speed. We benchmark the speed of
in E4M3 for Pe, V . SageAttention2++ against baselines using con-
Datasets and metrics. Detailed dataset and metric informa- figurations with headdim=64 and headdim=128, both with
3
SageAttention2++: A More Efficient Implementation of SageAttention2
Table 3. End-to-end metrics across text, image, and video generation models.
Model Attention WikiText (Ppl.) ↓ Lambda (Acc.) ↑ NIAH (Acc.) ↑
Full-Precision 6.013 0.815 0.906
Llama3.1 SageAttn2(8+8) 6.019 0.811 0.903
SageAttn2++(8+8) 6.020 0.813 0.901
and without a Causal Mask (Vaswani, 2017). Specifically, show more kernel speed comparison on RTX4090 and
Fig. 1 shows the speed across varying sequence lengths RTX5090 GPUs.
on RTX4090, indicating that SageAttn2++(4+8) and
SageAttn2++(8+8) are approximately 3.9x and 3.0x
faster than FlashAttention2, respectively. Fig. 2, 3 and 4
4
SageAttention2++: A More Efficient Implementation of SageAttention2
SageAttention2++ on HunyuanVideo
Full Precision
Sage2++ (8+8)
Sage2++ (4+8)
5
SageAttention2++: A More Efficient Implementation of SageAttention2
References Hu, Y., Huang, W., Liang, Z., Chen, C., Zhang, J., Zhu,
J., and Chen, J. Identifying sensitive weights via post-
Black Forest Labs. Flux. https://github.com/
quantization integral. 2025.
black-forest-labs/flux, 2023.
Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. Per-
Jia, J. Longlora: Efficient fine-tuning of long-context plexity—a measure of the difficulty of speech recognition
large language models. In The International Conference tasks. The Journal of the Acoustical Society of America,
on Learning Representations, 2024. 62(S1):S63–S63, 1977.
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, Jiang, H., LI, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han,
X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mo- Z., Abdi, A. H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L.
hiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., MInference 1.0: Accelerating pre-filling for long-context
and Weller, A. Rethinking attention with performers. In LLMs via dynamic sparse attention. In The Thirty-eighth
International Conference on Learning Representations, Annual Conference on Neural Information Processing
2021. Systems, 2024.
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X.,
Kamradt, G. Llmtest needle in a haystack-pressure
Xia, H., and Shen, C. Twins: Revisiting the design of
testing llms. https://github.com/gkamradt/
spatial attention in vision transformers. In Beygelzimer,
LLMTest_NeedleInAHaystack, 2023.
A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Ad-
vances in Neural Information Processing Systems, 2021.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F.
Dao, T. Flashattention-2: Faster attention with better paral- Transformers are rnns: Fast autoregressive transformers
lelism and work partitioning. In The Twelfth International with linear attention. In International conference on ma-
Conference on Learning Representations, 2024. chine learning, pp. 5156–5165. PMLR, 2020.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. Flashat- Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J.,
tention: Fast and memory-efficient exact attention with Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q.,
IO-awareness. In Oh, A. H., Agarwal, A., Belgrave, D., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan,
and Cho, K. (eds.), Advances in Neural Information Pro- H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang,
cessing Systems, 2022. J., Yuan, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W.,
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, Yu, W., Deng, X., Li, Y., Long, Y., Chen, Y., Cui, Y.,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, Peng, Y., Yu, Z., He, Z., Xu, Z., Zhou, Z., Xu, Z., Tao,
A., et al. The llama 3 herd of models. arXiv preprint Y., Lu, Q., Liu, S., Zhou, D., Wang, H., Yang, Y., Wang,
arXiv:2407.21783, 2024. D., Liu, Y., Jiang, J., and Zhong, C. Hunyuanvideo: A
systematic framework for large video generative models.
Fu, T., Huang, H., Ning, X., Zhang, G., Chen, B., Wu, arXiv preprint arXiv:2412.03603, 2024.
T., Wang, H., Huang, Z., Li, S., Yan, S., Dai, G., Yang,
H., and Wang, Y. Moa: Mixture of sparse attention Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W.,
for automatic large language model compression. arXiv Caggiano, V., Naren, S., Xu, M., Hu, J., Tintore, M.,
preprint arXiv:2406.14909, 2024. Zhang, S., Labatut, P., Haziza, D., Wehrstedt, L., Reizen-
stein, J., and Sizov, G. xformers: A modular and hack-
Gao, Y., Zeng, Z., Du, D., Cao, S., So, H. K.-H., Cao, able transformer modelling library. https://github.
T., Yang, F., and Yang, M. Seerattention: Learning com/facebookresearch/xformers, 2022.
intrinsic sparse attention in your llms. arXiv preprint
arXiv:2410.13276, 2024. Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., and
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Qiao, Y. Uniformer: Unified transformer for efficient
Choi, Y. Clipscore: A reference-free evaluation metric for spatial-temporal representation learning. In International
image captioning. In Proceedings of the 2021 Conference Conference on Learning Representations, 2022.
on Empirical Methods in Natural Language Processing,
pp. 7514–7528, 2021. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco:
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Common objects in context. In Computer Vision–ECCV
Hochreiter, S. Gans trained by a two time-scale update 2014: 13th European Conference, Zurich, Switzerland,
rule converge to a local nash equilibrium. Advances in September 6-12, 2014, Proceedings, Part V 13, pp. 740–
neural information processing systems, 30, 2017. 755. Springer, 2014.
6
SageAttention2++: A More Efficient Implementation of SageAttention2
Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Vaswani, A. Attention is all you need. Advances in Neural
Liu, Y., Zeng, T., Chan, R., and Shan, Y. Evalcrafter: Information Processing Systems, 2017.
Benchmarking and evaluating large video generation
models. In Proceedings of the IEEE/CVF Conference Venkataramanan, S., Ghodrati, A., Asano, Y. M., Porikli,
on Computer Vision and Pattern Recognition, pp. 22139– F., and Habibian, A. Skip-attention: Improving vision
22149, 2024. transformers by paying less attention. In The Twelfth
International Conference on Learning Representations,
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, 2024.
S., and Guo, B. Swin transformer: Hierarchical vision
transformer using shifted windows. In Proceedings of the Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W.,
IEEE/CVF international conference on computer vision, Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J.,
pp. 10012–10022, 2021. Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao,
K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P.,
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer
Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T.,
sentinel mixture models. In International Conference on
Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang,
Learning Representations, 2022.
W., Wang, W., Zhou, W., Wang, W., Shen, W., Yu, W.,
Milakov, M. and Gimelshein, N. Online normalizer cal- Shi, X., Huang, X., Xu, X., Kou, Y., Lv, Y., Li, Y., Liu,
culation for softmax. arXiv preprint arXiv:1805.02867, Y., Wang, Y., Zhang, Y., Huang, Y., Li, Y., Wu, Y., Liu,
2018. Y., Pan, Y., Zheng, Y., Hong, Y., Shi, Y., Feng, Y., Jiang,
Z., Han, Z., Wu, Z.-F., and Liu, Z. Wan: Open and
NVIDIA. Nvidia ada gpu architecture, 2022. advanced large-scale video generative models. arXiv
URL https://images.nvidia.com/ preprint arXiv:2503.20314, 2025.
aem-dam/Solutions/geforce/ada/
nvidia-ada-gpu-architecture.pdf. Techni- Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H.
cal whitepaper. Linformer: Self-attention with linear complexity. arXiv
preprint arXiv:2006.04768, 2020.
NVIDIA. Parallel Thread Execution ISA Version
8.7. https://docs.nvidia.com/cuda/pdf/
Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A.,
ptx_isa_8.4.pdf, 2025. Accessed: 2025-05-16.
Sun, W., Yan, Q., and Lin, W. Exploring video quality
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., assessment on user generated contents from aesthetic and
Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and technical perspectives. In Proceedings of the IEEE/CVF
Fernández, R. The lambada dataset: Word prediction International Conference on Computer Vision, pp. 20144–
requiring a broad discourse context. In Proceedings of 20154, 2023.
the 54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pp. 1525– Xi, H., Yang, S., Zhao, Y., Xu, C., Li, M., Li, X., Lin, Y.,
1534, 2016. Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler-
ating video diffusion transformers with spatial-temporal
Qin, Z., Sun, W., Li, D., Shen, X., Sun, W., and Zhong, Y. sparsity. arXiv preprint arXiv:2502.01776, 2025.
Lightning attention-2: A free lunch for handling unlim-
ited sequence lengths in large language models. arXiv Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y., Zhang, Z.,
preprint arXiv:2401.04658, 2024. Liu, Z., and Sun, M. Infllm: Training-free long-context
extrapolation for llms with an efficient context memory.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,
In First Workshop on Long-Context Foundation Models@
Radford, A., and Chen, X. Improved techniques for
ICML 2024, 2024a.
training gans. Advances in neural information processing
systems, 29, 2016. Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Ef-
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., ficient streaming language models with attention sinks.
and Dao, T. Flashattention-3: Fast and accurate attention In The Twelfth International Conference on Learning
with asynchrony and low-precision. In The Thirty-eighth Representations, 2024b.
Annual Conference on Neural Information Processing
Systems, 2024. Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang,
J., and Dong, Y. Imagereward: Learning and evaluat-
Stability AI. Introducing stable diffusion ing human preferences for text-to-image generation. In
3.5. https://stability.ai/news/ Thirty-seventh Conference on Neural Information Pro-
introducing-stable-diffusion-3-5, 2023. cessing Systems, 2023.
7
SageAttention2++: A More Efficient Implementation of SageAttention2
Zhang, J., Huang, H., Zhang, P., Wei, J., Zhu, J., and Chen,
J. Sageattention2: Efficient attention with smoothing q
and per-thread quantization. 2025b.
Zhang, J., Wei, J., Zhang, P., Chen, J., and Zhu, J. Sageatten-
tion: Accurate 8-bit attention for plug-and-play inference
acceleration. In The International Conference on Learn-
ing Representations, 2025c.
Zhang, J., Wei, J., Zhang, P., Xu, X., Huang, H., Wang, H.,
Jiang, K., Zhu, J., and Chen, J. Sageattention3: Microscal-
ing fp4 attention for inference and an exploration of 8-bit
training. arXiv preprint arXiv:2505.11594, 2025d.
Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and
Chen, J. Spargeattn: Accurate sparse attention accelerat-
ing any model inference. In International Conference on
Machine Learning (ICML), 2025e.
Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J.,
and Chen, J. Spargeattn: Training-free sparse attention
accelerating any model inference. 2025f.
Zhang, P., Wei, J., Zhang, J., Zhu, J., and Chen, J. Accu-
rate int8 training through dynamic block-level fallback.
2025g.
Zhao, T., Fang, T., Huang, H., Liu, E., Wan, R., Soedar-
madji, W., Li, S., Lin, Z., Dai, G., Yan, S., Yang, H.,
et al. Vidit-q: Efficient and accurate quantization of dif-
fusion transformers for image and video generation. In
International Conference on Learning Representations,
2025.
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H.,
Zhou, Y., Li, T., and You, Y. Open-sora: Democratiz-
ing efficient video production for all. arXiv preprint
arXiv:2412.20404, 2024.
8
SageAttention2++: A More Efficient Implementation of SageAttention2
A. Appendix
A.1. Visible Comparison Examples
Flux
Stable-
Diffusion
-3.5