0% found this document useful (0 votes)

20 views9 pages

Sage Attention 2

SageAttention2++ is an enhanced version of SageAttention2 that improves the efficiency of attention mechanisms by utilizing faster FP8 matrix multiplication instructions. It achieves a 3.9× speedup over FlashAttention while maintaining the same accuracy as SageAttention2, making it suitable for various applications in language, image, and video generation. The method involves narrowing the quantization range of input values to ensure compatibility with FP16 accumulators, resulting in negligible end-to-end performance loss.

Uploaded by

Ajay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views9 pages

Sage Attention 2

Uploaded by

Ajay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

SageAttention2++: A More Efficient Implementation of SageAttention2

Jintao Zhang 1 Xiaoming Xu 1 Jia Wei 1 Haofeng Huang 1 Pengle Zhang 1 Chendong Xiang 1
Jun Zhu 1 Jianfei Chen 1

Abstract table examples include FlashAttention (Dao et al., 2022),

FlashAttention2 (Dao, 2024; Shah et al., 2024), xForm-
The efficiency of attention is critical because
ers (Lefaudeux et al., 2022), and SageAttentions (Zhang
arXiv:2505.21136v2 [cs.LG] 28 May 2025

its time complexity grows quadratically with se-

et al., 2025c;c;d;b) which demonstrate strong performance
quence length. SageAttention2 addresses this by
across diverse applications.
using quantization to speed up matrix multiplica-
tions (Matmul) in attention. To further accelerate Motivation, problem, and our approach. For the
SageAttention2, we propose to utilize the faster second matrix multiplication (Matmul) P V in attention,
instruction of FP8 Matmul accumulated in FP16. SageAttention2 accelerates it by quantizing to FP8 and
The instruction is 2× faster than the FP8 Mat- using the mma.f32.f8.f8.f32 instruction. How-
mul used in SageAttention2. Our experiments ever, mma.f32.f8.f8.f32 employs an FP32 accumu-
show that SageAttention2++ achieves a lator and is only 2× faster than FP16. We find that
3.9× speedup over FlashAttention while main- the mma.f16.f8.f8.f16 instruction (using FP16 ac-
taining the same attention accuracy as SageAt- cumulator for FP8 Matmul) achieves 4× speedup over
tention2. This means SageAttention2++ ef- FP16 (NVIDIA, 2022). Therefore, we aim to accelerate
fectively accelerates various models, including SageAttention2 by using the faster instruction. However,
those for language, image, and video generation, directly using the faster instruction will lead to the values of
with negligible end-to-end metrics loss. The code P V exceeding the representable range of FP16. To address
will be available at https://github.com/ the problem, we propose to narrow the quantization range
thu-ml/SageAttention. of P and V to satisfy the accumulator range in FP16.
Performance. For efficiency, SageAttention2++ de-
livers a 3.9× speedup over FlashAttention. In terms of ac-
1. Introduction curacy, SageAttention2++ matches SageAttention2’s
The quadratic time complexity of attention necessitates effi- performance. We conduct comprehensive evaluations on
cient implementations for real-world applications with long state-of-the-art models for text, image, and video generation.
sequences (Jiang et al., 2024). Current approaches to reduce The results demonstrate that SageAttention2++ pro-
attention’s computational demands fall into three main cat- vides plug-and-play acceleration with negligible end-to-end
egories: (1) linear attention methods (Wang et al., 2020; metrics loss across diverse models.
Choromanski et al., 2021; Yu et al., 2022; Katharopoulos
et al., 2020; Qin et al., 2024; Yang et al., 2024) that achieve 2. Preliminary
O(N ) complexity, and (2) sparse attention techniques (Liu
et al., 2021; Chu et al., 2021; Li et al., 2022; Xiao et al.,
2024b;a; Chen et al., 2024; Jiang et al., 2024; Venkatara- Table 1. Speedup compared to matrix multiplication in FP16 with
manan et al., 2024; Gao et al., 2024; Fu et al., 2024; Zhang an FP32 accumulator.
et al., 2025e; Xi et al., 2025; Zhang et al., 2025f) that process GPU MM Input MM Accumulator Speedup
only relevant context portions. While effective, these meth- FP16 FP32 1x
RTX4090,
ods often exhibit limited generality across models and tasks. FP8 FP32 2x
RTX5090
(3) An alternative direction focuses on hardware-optimized FP8 FP16 4x
attention implementations that maintain full sequence com-
putation while achieving superior speed and accuracy. No-
2.1. SageAttention2
1
Department of Computer Science, Tsinghua University.
Preprint. SageAttention2 (Zhang et al., 2025a) is a quantiza-
tion (Zhang et al., 2025g; Hu et al., 2025) method based
on FlashAttention (Dao et al., 2022). FlashAttention tiles

1
SageAttention2++: A More Efficient Implementation of SageAttention2

tention2, the results may exceed FP16’s representable range

Table 2. Average attention accuracy across all attention layers of
(-65504∼65504). This occurs because 32 product values
CogvideoX.
pv are accumulated in FP16, where p and v come from
Method Pr Vr Cossim↑ L1↓ quantized P̂ and V̂ (derived from Pe and V ). To ensure the
SageAttn2 448 448 99.97% 0.01862 accumulated results stay within FP16’s range:
SageAttn2++ 448 2.25 99.97% 0.01863
SageAttn2++ 224 4.5 99.97% 0.01862 |32 × pv| ≤ 65504 (1)
SageAttn2++ 112 9 99.97% 0.01863
For instance, choosing |p| ≤ 224 and |v| ≤ 9 satisfies this
condition. We therefore narrow the quantization ranges of
Q, K, P, V into blocks ({Qi }, {Ki }, {Pei }, {Vi }) and uses P and V by adjusting their scale factors:
online softmax (Milakov & Gimelshein, 2018) to compute
δP = | max(Pe)|/Pr , δV = | max(V )|/Vr (2)
attention progressively. For simplicity, we omit the sub-
scripts ({Qi }, {Ki }, {Pei }, {Vi }) in the following content where we constrain Pr × Vr ≤ 2047 (since 65504/32 =
and use Q, K, Pe, V to represent the tiled blocks. SageAt- 2047).
tention2 quantizes Q, K to INT4/INT8 with per-block gran-
ularity, Pe to FP8 in E4M3 with per-block granularity, and 3.2. Delayed FP32 Buffering
quantizes V to FP8 in E4M3 with per-channel granular-
ity. This means each Q, K, Pe has a separate scale fac- The transformation of accumulated values from
tor: δQ = max(|Q|)/127, δK = max(|K|)/127, δP = mma.m16n8k32 (in FP16) to FP32 incurs overhead
max(|Pe|)/448, and each channel of V has a separate scalar because it needs additional data type conversion PTX
scale: δV = colmax(|V |)/448. By doing so, SageAtten- instructions (NVIDIA, 2025) to execute. To reduce this over-
tion2 accelerates matrix multiplications in attention through head, we accumulate two consecutive mma.m16n8k32
low-bit Tensor Core operations. For example, P̂ = ⌈Pe/δP ⌋, results in FP16 before performing FP32 conversion, effec-
V̂ = ⌈V /δv ⌋. Then, P V = P̂ V̂ ∗ δP ∗ δV . tively halving the transformation overhead. Maintaining the
FP16 representable range requires:
2.2. Data Type of Accumulator for Matmul Pr × Vr ≤ 2047/2 (3)
In some GPUs, the speed of Matmul instructions de-
pends on the accumulator data type. For instance, Choice of Pr and Vr . Table 2 shows attention accuracy
mma.f32.f8.f8.f32 uses FP32 accumulator for FP8 for feasible (Pr , Vr ) pairs. The results demonstrate that
Matmul and is only 2× faster than FP16. The in- narrowing quantization ranges introduces negligible error.
struction using FP16 accumulator for FP8 Matmul We select Pr = 224 and Vr = 4.5 for optimal performance.
(mma.f16.f8.f8.f16) is 4× faster than FP16. Table 1
summarizes the speedup of Matmul instructions with differ- 4. Experiment
ent accumulators.
Main result. SageAttention2++ achieves up to 3.9×
3. SageAttention2++ speedup over FlashAttention2 while consistently outper-
forming both SageAttention and SageAttention2 in compu-
In this section, we introduce SageAttention2++. tational efficiency. Importantly, these performance gains
The workflow of SageAttention2++ is based on are achieved with negligible impact on end-to-end metrics
SageAttention2, also using the smoothing of Q and K, across diverse model architectures.
INT4/INT8 quantization for QK ⊤ Matmul and FP8 quan-
tization for P V Matmul. The main difference is that 4.1. Setup
for P V , SageAttention2++ uses the faster instruction
(mma.f16.f8.f8.f16), which employs an FP16 accu- Models and attentions. We evaluate
mulator for the FP8 Matmul. To ensure the results of FP8 SageAttention2++ across diverse representative
Matmul remain within FP16’s representable range, we ad- models spanning language, image, and video generation:
just the scale factor of the FP8 quantization. Llama3.1 (8B) (Dubey et al., 2024) for text2text,
CogvideoX (2B), HunyuanVideo (Kong et al.,
2024), and Wan (Wan et al., 2025) for text2video,
3.1. Narrowing the FP8 Quantization Range
and Flux (schnell) (Black Forest Labs, 2023) and
The specific MMA (NVIDIA, 2025) instruction used for Stable-Diffusion3.5 (turbo) (Stability AI, 2023)
the MatMul between P and V is mma.m16n8k32. If we for text2image. We compare our method with FlashAtten-
quantize P and V to FP8 in E4M3 (-448∼448) as in SageAt- tion2 (Dao, 2024), SageAttention (Zhang et al., 2025c),

2
SageAttention2++: A More Efficient Implementation of SageAttention2

RTX4090, (Head dim = 128, causal = False) RTX4090, (Head dim = 128, causal = True)
FlashAttn Sage2(8+8) Sage2(4+8) FlashAttn Sage2(8+8) Sage2(4+8)
1000 Sage1 Sage2++(8+8) Sage2++(4+8) 1000 Sage1 Sage2++(8+8) Sage2++(4+8)
Speed (TOPS)

640
639
638
629

619
606
601

596
574

554
493

493
490

489

489
489

485

478
476
471
468
467
465
461

452
450
440

431

424
423

423
422

422
421

420
417
411

409
500 500

400
382

377
365
361
346

326
323
322

322

322
317

316
314

307
303

292
268
259

252

247
188
164

164

164
161

161

161
160
155
151
145

138
118
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 1. Speed comparison between SageAttention2++ and baselines (RTX4090, headdim=128).

RTX4090, (Head dim = 64, causal = False) RTX4090, (Head dim = 64, causal = True)
1000 FlashAttn Sage2(8+8) Sage2(4+8) 1000 FlashAttn Sage2(8+8) Sage2(4+8)
Sage1 Sage2++(8+8) Sage2++(4+8) Sage1 Sage2++(8+8) Sage2++(4+8)
Speed (TOPS)

750 750

537
535
520
501
496
493
483

479
471

453
451
445
444
444

443
443

443

440
437

435

434
428
500 500

421
417

415
413

406
400
399

392

391
390
385

382
376

376
366

361
355
352

347

343
339

336
326
323

322

322
320

315
312

308
299

289
287
287
273

252

246
174
250 250
167

167

164
162
161

159
153

150
143

124
94
0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 2. Speed comparison between SageAttention2++ and baselines (RTX4090, headdim=64).

RTX5090, (Head dim = 128, causal = False) RTX5090, (Head dim = 128, causal = True)
FlashAttn Sage2(8+8) Sage2++(8+8) FlashAttn Sage2(8+8) Sage2++(8+8)
1000 Sage1 1000 Sage1
Speed (TOPS)

643
638
628
618

609
608

594
583

567
554
551

551
544

540
532
527

524

519
489

482
480
479

479
473

469

468
467

457
444
442

433
420
500 500
388

369
362
289
215

214
212

209
208

207
202
198

192
175
173

147

0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 3. Speed comparison between SageAttention2++ and baselines (RTX5090, headdim=128).

RTX5090, (Head dim = 64, causal = False) RTX5090, (Head dim = 64, causal = True)
1000 FlashAttn Sage2(8+8) Sage2++(8+8) 1000 FlashAttn Sage2(8+8) Sage2++(8+8)
Sage1 Sage1
Speed (TOPS)

750 750
570

567

562
560
557

556

548
519
509

500
497
492

491

490

488
479
476

461
451
441
441

429
428

427

427
423

500 500 421

416
405

404
396

379
350
317

312
239
220
220
218

217
213
213

207
206

195
191

174

250 250
149

0 0
1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K
Sequence Length Sequence Length

Figure 4. Speed comparison between SageAttention2++ and baselines (RTX5090, headdim=64).

and SageAttention2 (Zhang et al., 2025a). Please note tion appears in Appendix A.2.
that FlashAttention3 can only run on Hopper GPUs,
Implementation. We implement SageAttention2++
so FlashAttention2 is already the fastest version for
using CUDA.
RTX5090 and RTX4090. Following SageAttention2’s
approach, we implement two SageAttention2++
variants: SageAttn2++(8+8) (INT8 for Q, K) and 4.2. Speed of Kernels
SageAttn2++(4+8) (INT4 for Q, K), both using FP8 Kernel Speed. We benchmark the speed of
in E4M3 for Pe, V . SageAttention2++ against baselines using con-
Datasets and metrics. Detailed dataset and metric informa- figurations with headdim=64 and headdim=128, both with

3
SageAttention2++: A More Efficient Implementation of SageAttention2

Table 3. End-to-end metrics across text, image, and video generation models.
Model Attention WikiText (Ppl.) ↓ Lambda (Acc.) ↑ NIAH (Acc.) ↑
Full-Precision 6.013 0.815 0.906
Llama3.1 SageAttn2(8+8) 6.019 0.811 0.903
SageAttn2++(8+8) 6.020 0.813 0.901

Model Attention CLIPSIM ↑ CLIP-T ↑ VQA-a ↑ VQA-t ↑ FScore ↑

Full-Precision 0.179 0.997 74.499 74.642 4.974
SageAttn2(4+8) 0.179 0.997 76.309 66.396 4.386
CogvideoX SageAttn2(8+8) 0.178 0.997 74.322 74.447 4.899
(2B) SageAttn2++(4+8) 0.179 0.997 74.387 66.568 4.333
SageAttn2++(8+8) 0.179 0.997 76.309 73.165 4.386
Full-Precision 0.175 0.999 77.437 52.731 1.169
SageAttn2(4+8) 0.176 0.999 73.282 55.141 0.968
Hunyuan SageAttn2(8+8) 0.175 0.999 78.145 54.878 1.176
Video SageAttn2++(4+8) 0.176 0.999 73.282 52.258 0.968
SageAttn2++(8+8) 0.175 0.999 78.569 51.080 1.192
Full-Precision 0.172 0.999 53.255 59.989 1.843
SageAttn2(4+8) 0.176 0.998 29.728 38.533 0.994
SageAttn2(8+8) 0.172 0.999 49.794 55.712 1.870
Wan
SageAttn2++(4+8) 0.176 0.998 29.728 38.023 0.994
SageAttn2++(8+8) 0.172 0.999 50.876 57.140 1.902

Model Attention FID ↓ sFID ↓ CLIP ↑ IR ↑

Full-Precision 165.117 147.831 31.401 0.912
SageAttn2(4+8) 164.170 147.185 31.358 0.910
Flux SageAttn2(8+8) 163.185 146.101 31.453 0.905
SageAttn2++(4+8) 164.170 147.185 31.358 0.910
SageAttn2++(8+8) 163.555 146.036 31.445 0.902
Full-Precision 166.369 146.514 31.876 0.929
SageAttn2(4+8) 164.610 147.350 31.912 0.914
Stable-Dif
SageAttn2(8+8) 164.971 148.498 31.964 0.931
fusion3.5
SageAttn2++(4+8) 164.610 147.350 31.912 0.914
SageAttn2++(8+8) 165.842 146.465 31.968 0.929

SageAttention2++ on Wan SageAttention2++ on Flux

Full Precision

Full Precision SageAttn2++

SageAttn2++

Figure 5. A visible example of using SageAttention2++.

and without a Causal Mask (Vaswani, 2017). Specifically, show more kernel speed comparison on RTX4090 and
Fig. 1 shows the speed across varying sequence lengths RTX5090 GPUs.
on RTX4090, indicating that SageAttn2++(4+8) and
SageAttn2++(8+8) are approximately 3.9x and 3.0x
faster than FlashAttention2, respectively. Fig. 2, 3 and 4

4
SageAttention2++: A More Efficient Implementation of SageAttention2

Sage2++ (4+8) Sage2++ (8+8) Full Precision SageAttention2++ on CogvideoX

SageAttention2++ on HunyuanVideo
Full Precision
Sage2++ (8+8)
Sage2++ (4+8)

Figure 6. Visible examples of using SageAttention2++ on video generation.

4.3. End-to-end Performance 5. Conclusion

Metrics loss. We evaluate end-to-end model perfor- We introduce SageAttention2++ to further acceler-
mance using SageAttention2++ against baseline meth- ate SageAttention2. We propose to utilize the faster in-
ods. Detailed evaluation results are presented in Ta- struction of FP8 Matmul accumulated in FP16 for the
ble 3. The results indicate that SageAttn2++(8+8) matrix multiplication of P V . Experiments show that
and SageAttn2++(4+8) match the end-to-end metrics SageAttention2++ achieves a 3.9× speedup (SageAt-
of SageAttention2. Specifically, SageAttn2++(8+8) tention2 has a 3× speedup) over FlashAttention, while main-
incurs almost no metrics loss across various models and taining the same attention accuracy as SageAttention2. This
SageAttn2++(4+8) brings a little metrics loss. means SageAttention2++ can accelerate various mod-
els, including those for language, image, and video genera-
Visible image and video examples. Fig.5, 7, and 6 show
tion, with negligible end-to-end metrics loss.
some visible comparison examples.

5
SageAttention2++: A More Efficient Implementation of SageAttention2

References Hu, Y., Huang, W., Liang, Z., Chen, C., Zhang, J., Zhu,
J., and Chen, J. Identifying sensitive weights via post-
Black Forest Labs. Flux. https://github.com/
quantization integral. 2025.
black-forest-labs/flux, 2023.
Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. Per-
Jia, J. Longlora: Efficient fine-tuning of long-context plexity—a measure of the difficulty of speech recognition
large language models. In The International Conference tasks. The Journal of the Acoustical Society of America,
on Learning Representations, 2024. 62(S1):S63–S63, 1977.

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, Jiang, H., LI, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han,
X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mo- Z., Abdi, A. H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L.
hiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., MInference 1.0: Accelerating pre-filling for long-context
and Weller, A. Rethinking attention with performers. In LLMs via dynamic sparse attention. In The Thirty-eighth
International Conference on Learning Representations, Annual Conference on Neural Information Processing
2021. Systems, 2024.
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X.,
Kamradt, G. Llmtest needle in a haystack-pressure
Xia, H., and Shen, C. Twins: Revisiting the design of
testing llms. https://github.com/gkamradt/
spatial attention in vision transformers. In Beygelzimer,
LLMTest_NeedleInAHaystack, 2023.
A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Ad-
vances in Neural Information Processing Systems, 2021.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F.
Dao, T. Flashattention-2: Faster attention with better paral- Transformers are rnns: Fast autoregressive transformers
lelism and work partitioning. In The Twelfth International with linear attention. In International conference on ma-
Conference on Learning Representations, 2024. chine learning, pp. 5156–5165. PMLR, 2020.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. Flashat- Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J.,
tention: Fast and memory-efficient exact attention with Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q.,
IO-awareness. In Oh, A. H., Agarwal, A., Belgrave, D., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan,
and Cho, K. (eds.), Advances in Neural Information Pro- H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang,
cessing Systems, 2022. J., Yuan, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W.,
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, Yu, W., Deng, X., Li, Y., Long, Y., Chen, Y., Cui, Y.,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, Peng, Y., Yu, Z., He, Z., Xu, Z., Zhou, Z., Xu, Z., Tao,
A., et al. The llama 3 herd of models. arXiv preprint Y., Lu, Q., Liu, S., Zhou, D., Wang, H., Yang, Y., Wang,
arXiv:2407.21783, 2024. D., Liu, Y., Jiang, J., and Zhong, C. Hunyuanvideo: A
systematic framework for large video generative models.
Fu, T., Huang, H., Ning, X., Zhang, G., Chen, B., Wu, arXiv preprint arXiv:2412.03603, 2024.
T., Wang, H., Huang, Z., Li, S., Yan, S., Dai, G., Yang,
H., and Wang, Y. Moa: Mixture of sparse attention Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W.,
for automatic large language model compression. arXiv Caggiano, V., Naren, S., Xu, M., Hu, J., Tintore, M.,
preprint arXiv:2406.14909, 2024. Zhang, S., Labatut, P., Haziza, D., Wehrstedt, L., Reizen-
stein, J., and Sizov, G. xformers: A modular and hack-
Gao, Y., Zeng, Z., Du, D., Cao, S., So, H. K.-H., Cao, able transformer modelling library. https://github.
T., Yang, F., and Yang, M. Seerattention: Learning com/facebookresearch/xformers, 2022.
intrinsic sparse attention in your llms. arXiv preprint
arXiv:2410.13276, 2024. Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., and
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Qiao, Y. Uniformer: Unified transformer for efficient
Choi, Y. Clipscore: A reference-free evaluation metric for spatial-temporal representation learning. In International
image captioning. In Proceedings of the 2021 Conference Conference on Learning Representations, 2022.
on Empirical Methods in Natural Language Processing,
pp. 7514–7528, 2021. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco:
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Common objects in context. In Computer Vision–ECCV
Hochreiter, S. Gans trained by a two time-scale update 2014: 13th European Conference, Zurich, Switzerland,
rule converge to a local nash equilibrium. Advances in September 6-12, 2014, Proceedings, Part V 13, pp. 740–
neural information processing systems, 30, 2017. 755. Springer, 2014.

6
SageAttention2++: A More Efficient Implementation of SageAttention2

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Vaswani, A. Attention is all you need. Advances in Neural
Liu, Y., Zeng, T., Chan, R., and Shan, Y. Evalcrafter: Information Processing Systems, 2017.
Benchmarking and evaluating large video generation
models. In Proceedings of the IEEE/CVF Conference Venkataramanan, S., Ghodrati, A., Asano, Y. M., Porikli,
on Computer Vision and Pattern Recognition, pp. 22139– F., and Habibian, A. Skip-attention: Improving vision
22149, 2024. transformers by paying less attention. In The Twelfth
International Conference on Learning Representations,
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, 2024.
S., and Guo, B. Swin transformer: Hierarchical vision
transformer using shifted windows. In Proceedings of the Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W.,
IEEE/CVF international conference on computer vision, Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J.,
pp. 10012–10022, 2021. Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao,
K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P.,
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer
Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T.,
sentinel mixture models. In International Conference on
Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang,
Learning Representations, 2022.
W., Wang, W., Zhou, W., Wang, W., Shen, W., Yu, W.,
Milakov, M. and Gimelshein, N. Online normalizer cal- Shi, X., Huang, X., Xu, X., Kou, Y., Lv, Y., Li, Y., Liu,
culation for softmax. arXiv preprint arXiv:1805.02867, Y., Wang, Y., Zhang, Y., Huang, Y., Li, Y., Wu, Y., Liu,
2018. Y., Pan, Y., Zheng, Y., Hong, Y., Shi, Y., Feng, Y., Jiang,
Z., Han, Z., Wu, Z.-F., and Liu, Z. Wan: Open and
NVIDIA. Nvidia ada gpu architecture, 2022. advanced large-scale video generative models. arXiv
URL https://images.nvidia.com/ preprint arXiv:2503.20314, 2025.
aem-dam/Solutions/geforce/ada/
nvidia-ada-gpu-architecture.pdf. Techni- Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H.
cal whitepaper. Linformer: Self-attention with linear complexity. arXiv
preprint arXiv:2006.04768, 2020.
NVIDIA. Parallel Thread Execution ISA Version
8.7. https://docs.nvidia.com/cuda/pdf/
Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A.,
ptx_isa_8.4.pdf, 2025. Accessed: 2025-05-16.
Sun, W., Yan, Q., and Lin, W. Exploring video quality
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.-Q., assessment on user generated contents from aesthetic and
Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and technical perspectives. In Proceedings of the IEEE/CVF
Fernández, R. The lambada dataset: Word prediction International Conference on Computer Vision, pp. 20144–
requiring a broad discourse context. In Proceedings of 20154, 2023.
the 54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pp. 1525– Xi, H., Yang, S., Zhao, Y., Xu, C., Li, M., Li, X., Lin, Y.,
1534, 2016. Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler-
ating video diffusion transformers with spatial-temporal
Qin, Z., Sun, W., Li, D., Shen, X., Sun, W., and Zhong, Y. sparsity. arXiv preprint arXiv:2502.01776, 2025.
Lightning attention-2: A free lunch for handling unlim-
ited sequence lengths in large language models. arXiv Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y., Zhang, Z.,
preprint arXiv:2401.04658, 2024. Liu, Z., and Sun, M. Infllm: Training-free long-context
extrapolation for llms with an efficient context memory.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,
In First Workshop on Long-Context Foundation Models@
Radford, A., and Chen, X. Improved techniques for
ICML 2024, 2024a.
training gans. Advances in neural information processing
systems, 29, 2016. Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Ef-
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., ficient streaming language models with attention sinks.
and Dao, T. Flashattention-3: Fast and accurate attention In The Twelfth International Conference on Learning
with asynchrony and low-precision. In The Thirty-eighth Representations, 2024b.
Annual Conference on Neural Information Processing
Systems, 2024. Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang,
J., and Dong, Y. Imagereward: Learning and evaluat-
Stability AI. Introducing stable diffusion ing human preferences for text-to-image generation. In
3.5. https://stability.ai/news/ Thirty-seventh Conference on Neural Information Pro-
introducing-stable-diffusion-3-5, 2023. cessing Systems, 2023.

7
SageAttention2++: A More Efficient Implementation of SageAttention2

Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta net-

works: Improving mamba2 with delta rule. arXiv preprint
arXiv:2412.06464, 2024.
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng,
J., and Yan, S. Metaformer is actually what you need
for vision. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pp. 10819–
10829, 2022.
Zhang, J., Huang, H., Zhang, P., Wei, J., Zhu, J., and Chen, J.
Sageattention2: Efficient attention with thorough outlier
smoothing and per-thread int4 quantization. In Interna-
tional Conference on Machine Learning (ICML), 2025a.

Zhang, J., Huang, H., Zhang, P., Wei, J., Zhu, J., and Chen,
J. Sageattention2: Efficient attention with smoothing q
and per-thread quantization. 2025b.
Zhang, J., Wei, J., Zhang, P., Chen, J., and Zhu, J. Sageatten-
tion: Accurate 8-bit attention for plug-and-play inference
acceleration. In The International Conference on Learn-
ing Representations, 2025c.
Zhang, J., Wei, J., Zhang, P., Xu, X., Huang, H., Wang, H.,
Jiang, K., Zhu, J., and Chen, J. Sageattention3: Microscal-
ing fp4 attention for inference and an exploration of 8-bit
training. arXiv preprint arXiv:2505.11594, 2025d.
Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and
Chen, J. Spargeattn: Accurate sparse attention accelerat-
ing any model inference. In International Conference on
Machine Learning (ICML), 2025e.

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J.,
and Chen, J. Spargeattn: Training-free sparse attention
accelerating any model inference. 2025f.
Zhang, P., Wei, J., Zhang, J., Zhu, J., and Chen, J. Accu-
rate int8 training through dynamic block-level fallback.
2025g.
Zhao, T., Fang, T., Huang, H., Liu, E., Wan, R., Soedar-
madji, W., Li, S., Lin, Z., Dai, G., Yan, S., Yang, H.,
et al. Vidit-q: Efficient and accurate quantization of dif-
fusion transformers for image and video generation. In
International Conference on Learning Representations,
2025.
Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H.,
Zhou, Y., Li, T., and You, Y. Open-sora: Democratiz-
ing efficient video production for all. arXiv preprint
arXiv:2412.20404, 2024.

8
SageAttention2++: A More Efficient Implementation of SageAttention2

A. Appendix
A.1. Visible Comparison Examples

Full Precision Sage2++ (8+8) Sage2++ (4+8)

Flux

Full Precision Sage2++ (8+8) Sage2++ (4+8)

Stable-
Diﬀusion
-3.5

Figure 7. Visible examples of using SageAttention2++ on image generation.

A.2. Datasets and Metrics in Experiments

Datasets. Text-to-text models are evaluated on: WikiText (Merity et al., 2022) to assess the model’s prediction confidence,
LAMBADA (Paperno et al., 2016) for contextual understanding, and Needle-in-A-Haystack (NIAH) task (Kamradt, 2023).
Text-to-video models are evaluated using the open-sora (Zheng et al., 2024) prompt sets. Text-to-image models are assessed
on COCO annotations (Lin et al., 2014).
End-to-end metrics. For text-to-text models, we use perplexity (Ppl.) (Jelinek et al., 1977) for WikiText, accuracy
(Acc.) for LAMBADA and NIAH. For text-to-video models, following Zhao et al. (2025), we evaluate the quality of
generated videos on five metrics: CLIPSIM and CLIP-Temp (CLIP-T) (Liu et al., 2024) to measure the text-video alignment;
VQA-a and VQA-t to assess the video aesthetic and technical quality, respectively; and Flow-score (FScore) for temporal
consistency (Wu et al., 2023). For text-to-image models, generated images are compared with the images in three aspects:
FID (Heusel et al., 2017) and sFID (Salimans et al., 2016) for fidelity evaluation, Clipscore (CLIP) (Hessel et al., 2021) for
text-image alignment, and ImageReward (IR) (Xu et al., 2023) for human preference.
Accuracy metrics. We use three metrics to assess the accuracy of quantized attention output O′ compared to atten-
′
tion output in full-precision
P ′
pPO. First,
2
pPwe flatten O and O into vectors in
′2
Pthe shape
′
P1 × n. Then, Cosine similar-
of
ity: CosSimp = OO / O O , Relative L1 distance: L1 = |O − O |/ |O|, Root mean square error:
RM SE = (1/n) (O − O′ )2 .
P

Instrumentation and Control Important Questions and Answers
71% (7)
Instrumentation and Control Important Questions and Answers
72 pages
LLM Training Update
100% (1)
LLM Training Update
31 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
6-23 HC35 PIM PNM Samsung Final
No ratings yet
6-23 HC35 PIM PNM Samsung Final
31 pages
SESSION 23 AI-Accelerators
No ratings yet
SESSION 23 AI-Accelerators
427 pages
A Real-Time Object Detection Processor With Xnor-B
No ratings yet
A Real-Time Object Detection Processor With Xnor-B
13 pages
Fake Image Detection Report
No ratings yet
Fake Image Detection Report
21 pages
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
No ratings yet
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
155 pages
Arxiv MiniMax 01 Report
No ratings yet
Arxiv MiniMax 01 Report
68 pages
2021 08 26 High Performance GPU Tensor CoreCode Generation For Matmul Using MLIR
No ratings yet
2021 08 26 High Performance GPU Tensor CoreCode Generation For Matmul Using MLIR
57 pages
Flashattention: Fast and Memory-Efficient Exact Attention With Io-Awareness
No ratings yet
Flashattention: Fast and Memory-Efficient Exact Attention With Io-Awareness
34 pages
Flowpipe 3
No ratings yet
Flowpipe 3
14 pages
Transformers Inference Optimization Toolset - AstraBlog
No ratings yet
Transformers Inference Optimization Toolset - AstraBlog
29 pages
Ttymuy, Mui
No ratings yet
Ttymuy, Mui
41 pages
Unit1 1 (Multiplexer, Demultiplexer, Decoder)
No ratings yet
Unit1 1 (Multiplexer, Demultiplexer, Decoder)
30 pages
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
No ratings yet
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
35 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
IntroductionToAISystems
No ratings yet
IntroductionToAISystems
29 pages
Rethinking Attention With Performers
No ratings yet
Rethinking Attention With Performers
38 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
6803 Flashattention Fast and Memory
No ratings yet
6803 Flashattention Fast and Memory
16 pages
FP8 LM
No ratings yet
FP8 LM
23 pages
Torch Optimization
No ratings yet
Torch Optimization
17 pages
Ensor Roduct Ttention Is All You Need
No ratings yet
Ensor Roduct Ttention Is All You Need
23 pages
F I: E C A E LLM I S: Lash Nfer Fficient and Ustomizable Ttention Ngine For Nference Erving
No ratings yet
F I: E C A E LLM I S: Lash Nfer Fficient and Ustomizable Ttention Ngine For Nference Erving
20 pages
PFG 21 23
No ratings yet
PFG 21 23
35 pages
Flash Attn 3 Gpu Mode Talk
No ratings yet
Flash Attn 3 Gpu Mode Talk
27 pages
Fuzzy Control: Lect 4 Fuzzy Logic Process
No ratings yet
Fuzzy Control: Lect 4 Fuzzy Logic Process
89 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
Lightning Attention 1
No ratings yet
Lightning Attention 1
19 pages
Leanattention: Hardware-Aware Scalable Attention Mechanism For The Decode-Phase of Transformers
No ratings yet
Leanattention: Hardware-Aware Scalable Attention Mechanism For The Decode-Phase of Transformers
13 pages
C41 FPL2024 Sda
No ratings yet
C41 FPL2024 Sda
10 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Electronics 13 02822
No ratings yet
Electronics 13 02822
13 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
Electronics 14 02337
No ratings yet
Electronics 14 02337
18 pages
Dig: Scalable and Efficient Diffusion Models With Gated Linear Attention
No ratings yet
Dig: Scalable and Efficient Diffusion Models With Gated Linear Attention
18 pages
2023 FIT Chen Li
No ratings yet
2023 FIT Chen Li
15 pages
Jaihc 2019
No ratings yet
Jaihc 2019
15 pages
Systolic Array
No ratings yet
Systolic Array
9 pages
A A Y N E - E L M: Ddition Is LL OU EED FOR Nergy Fficient Anguage Odels
No ratings yet
A A Y N E - E L M: Ddition Is LL OU EED FOR Nergy Fficient Anguage Odels
13 pages
PPT-GPU: Scalable GPU Performance Modeling: LA-UR-18-30853 (Accepted Manuscript)
No ratings yet
PPT-GPU: Scalable GPU Performance Modeling: LA-UR-18-30853 (Accepted Manuscript)
6 pages
5th AccML Paper 1
No ratings yet
5th AccML Paper 1
6 pages
Paper 8
No ratings yet
Paper 8
7 pages
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
No ratings yet
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
7 pages
Flashattn
No ratings yet
Flashattn
6 pages
Memory-Efficient Batch Normalization by One-Pass Computation For On-Device Training
No ratings yet
Memory-Efficient Batch Normalization by One-Pass Computation For On-Device Training
5 pages
cs2 PDF
No ratings yet
cs2 PDF
84 pages
ecTALK Energy Efficient Coherent Transprecision Accelerators The Bidirectional Long Short-Term Memory Neural Network Case
No ratings yet
ecTALK Energy Efficient Coherent Transprecision Accelerators The Bidirectional Long Short-Term Memory Neural Network Case
3 pages
Transformer Arch Optimisations
No ratings yet
Transformer Arch Optimisations
3 pages
Introduction GenAI EoAI
No ratings yet
Introduction GenAI EoAI
69 pages
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
No ratings yet
Digital Implementation of The Softmax Activation Function and The Inverse Softmax Function
4 pages
A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning
No ratings yet
A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning
4 pages
An Efficient Hardware Accelerator For Sparse Transformer Neural Networks
No ratings yet
An Efficient Hardware Accelerator For Sparse Transformer Neural Networks
5 pages
An Efficient Stochastic Convolution Architecture Based On Fast FIR Algorithm
No ratings yet
An Efficient Stochastic Convolution Architecture Based On Fast FIR Algorithm
5 pages
Seminar Title: Natural Language Processing: Understanding and Generating Human Language
No ratings yet
Seminar Title: Natural Language Processing: Understanding and Generating Human Language
20 pages
Unit 4R - Calculations and Chemical Reactions
0% (1)
Unit 4R - Calculations and Chemical Reactions
23 pages
DeepSeek V3
No ratings yet
DeepSeek V3
53 pages
Practical 10 Solution
No ratings yet
Practical 10 Solution
6 pages
Parul University
No ratings yet
Parul University
2 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Barath Kanna C Department of ICE Barath@nitt - Edu
100% (1)
Barath Kanna C Department of ICE Barath@nitt - Edu
26 pages
LLMOps - Abi Aryan
No ratings yet
LLMOps - Abi Aryan
99 pages
Generative AI For Architectural Design: A Literature Review
No ratings yet
Generative AI For Architectural Design: A Literature Review
32 pages
(English (Auto-Generated) ) Deep Dive Into LLMs Like ChatGPT (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Deep Dive Into LLMs Like ChatGPT (DownSub - Com)
98 pages
AIChE Journal - 2023 - Hirtreiter - Toward Automatic Generation of Control Structures For Process Flow Diagrams With Large
No ratings yet
AIChE Journal - 2023 - Hirtreiter - Toward Automatic Generation of Control Structures For Process Flow Diagrams With Large
15 pages
BDA Lec11
No ratings yet
BDA Lec11
32 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Ajay A: MALE, INDIAN, 20 Years 12, Sundaram Lane, Pursaiwalkam, Chennai - 600 007
No ratings yet
Ajay A: MALE, INDIAN, 20 Years 12, Sundaram Lane, Pursaiwalkam, Chennai - 600 007
2 pages
HCCB - Application Form
No ratings yet
HCCB - Application Form
6 pages
LSTM and Transformer
No ratings yet
LSTM and Transformer
4 pages
TMECH24 Transformer Deformable Object Manipulation
No ratings yet
TMECH24 Transformer Deformable Object Manipulation
14 pages
Estimating The Effects of Sample Training Orders For Large Language Models Without Retraining
No ratings yet
Estimating The Effects of Sample Training Orders For Large Language Models Without Retraining
22 pages
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
No ratings yet
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
14 pages
Hydrocarbon 4
No ratings yet
Hydrocarbon 4
35 pages
A Mathematical Framework For Transformer Circuits
No ratings yet
A Mathematical Framework For Transformer Circuits
53 pages
基于自然语言理解的医学知识图谱构建与疾病辅助诊断系统张厚昌
No ratings yet
基于自然语言理解的医学知识图谱构建与疾病辅助诊断系统张厚昌
103 pages
Sentiment Analysis of Imbalanced Datasets Using BERT and Ensemble Stacking For Deep Learning
No ratings yet
Sentiment Analysis of Imbalanced Datasets Using BERT and Ensemble Stacking For Deep Learning
12 pages
GZGSZDZ
No ratings yet
GZGSZDZ
6 pages
(FREE PDF Sample) Building Generative AI Powered Apps A Hands On Guide For Developers 1st Edition Kansal Ebooks
100% (3)
(FREE PDF Sample) Building Generative AI Powered Apps A Hands On Guide For Developers 1st Edition Kansal Ebooks
52 pages
Research On The Application of Deep Learning-Based BERT Model in Sentiment Analysis
No ratings yet
Research On The Application of Deep Learning-Based BERT Model in Sentiment Analysis
10 pages
MPCT Multiscale Point Cloud Transformer With A Residual Network
No ratings yet
MPCT Multiscale Point Cloud Transformer With A Residual Network
12 pages
Paper Summary Advancements and Challenges in Handwritten Text Recognition A Comprehensive Survey
No ratings yet
Paper Summary Advancements and Challenges in Handwritten Text Recognition A Comprehensive Survey
7 pages
Transformer Vs MOE
No ratings yet
Transformer Vs MOE
7 pages
2411.19537v1 Survey
No ratings yet
2411.19537v1 Survey
24 pages
Tryoffanyone: Tiled Cloth Generation From A Dressed Person: Ioannis Xarchakos Theodoros Koukopoulos
No ratings yet
Tryoffanyone: Tiled Cloth Generation From A Dressed Person: Ioannis Xarchakos Theodoros Koukopoulos
11 pages
User Action Sequence Modeling For Pinterest Ads Engagement Modeling - by Pinterest Engineering - Pinterest Engineering Blog - Medium
No ratings yet
User Action Sequence Modeling For Pinterest Ads Engagement Modeling - by Pinterest Engineering - Pinterest Engineering Blog - Medium
17 pages
LLaMA VID
No ratings yet
LLaMA VID
18 pages
06-AIA42022424 Online
No ratings yet
06-AIA42022424 Online
12 pages
Ge2e KWS
No ratings yet
Ge2e KWS
8 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Energy-Efficient RISC-V-Based Vector Processor For Cache-Aware Structurally-Pruned Transformers
No ratings yet
Energy-Efficient RISC-V-Based Vector Processor For Cache-Aware Structurally-Pruned Transformers
6 pages
Transformer Convolutional Neural Networks For Automated Artifact Detection in Scalp EEG
No ratings yet
Transformer Convolutional Neural Networks For Automated Artifact Detection in Scalp EEG
4 pages
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
PyTorch Cookbook
From Everand
PyTorch Cookbook
Matthew Rosch
No ratings yet
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
From Everand
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
Adam Jones
No ratings yet
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
From Everand
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
Matthew Rosch
No ratings yet
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
From Everand
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
Fouad Sabry
No ratings yet

Sage Attention 2

Uploaded by

Sage Attention 2

Uploaded by

SageAttention2++: A More Efficient Implementation of SageAttention2

Abstract table examples include FlashAttention (Dao et al., 2022),

its time complexity grows quadratically with se-

tention2, the results may exceed FP16’s representable range

Figure 1. Speed comparison between SageAttention2++ and baselines (RTX4090, headdim=128).

Figure 2. Speed comparison between SageAttention2++ and baselines (RTX4090, headdim=64).

Figure 3. Speed comparison between SageAttention2++ and baselines (RTX5090, headdim=128).

500 500 421

Figure 4. Speed comparison between SageAttention2++ and baselines (RTX5090, headdim=64).

Model Attention CLIPSIM ↑ CLIP-T ↑ VQA-a ↑ VQA-t ↑ FScore ↑

Model Attention FID ↓ sFID ↓ CLIP ↑ IR ↑

SageAttention2++ on Wan SageAttention2++ on Flux

Full Precision SageAttn2++

Figure 5. A visible example of using SageAttention2++.

Sage2++ (4+8) Sage2++ (8+8) Full Precision SageAttention2++ on CogvideoX

Figure 6. Visible examples of using SageAttention2++ on video generation.

4.3. End-to-end Performance 5. Conclusion

Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta net-

Full Precision Sage2++ (8+8) Sage2++ (4+8)

Full Precision Sage2++ (8+8) Sage2++ (4+8)

Figure 7. Visible examples of using SageAttention2++ on image generation.

A.2. Datasets and Metrics in Experiments

You might also like