[CPU][Brgemm] add support for int8 brgemm #143384

Valentine233 · 2024-12-17T10:05:45Z

For INT8 SDPA kernel usage, we add support for INT8 Brgemm.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-12-17T10:05:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143384

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Failures because of deprecated version of actions/download-artifact: v3

✅ You can merge normally! (1 Unrelated Failure)

As of commit ea8db8b with merge base 9631d1a ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge) (gh) (#144480)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

Valentine233 · 2024-12-19T05:33:14Z

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

The batch_size here refers to the batch parameter in BRGemm (Batch-Reduced Gemm). For the second Gemm in INT8 SDPA, i.e. the product of attention [q_len, kv_len] and value [kv_len, head_size], we do blocking on kv_len to split it into several kv_blocks. So, we can take advantage of parameter batch_size to merge several Gemms of [q_len, kv_block_size] and [kv_block_size, head_size] into one BRGemm with batch_size=kv_block_num, with the reduced dimension of K.

jgong5 · 2024-12-20T08:47:56Z

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

The batch_size here refers to the batch parameter in BRGemm (Batch-Reduced Gemm). For the second Gemm in INT8 SDPA, i.e. the product of attention [q_len, kv_len] and value [kv_len, head_size], we do blocking on kv_len to split it into several kv_blocks. So, we can take advantage of parameter batch_size to merge several Gemms of [q_len, kv_block_size] and [kv_block_size, head_size] into one BRGemm with batch_size=kv_block_num, with the reduced dimension of K.

In that case, why not making the kv_block_size larger and invoke a single small gemm instead of making a brgemm call?

Valentine233 · 2024-12-20T10:08:10Z

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

The batch_size here refers to the batch parameter in BRGemm (Batch-Reduced Gemm). For the second Gemm in INT8 SDPA, i.e. the product of attention [q_len, kv_len] and value [kv_len, head_size], we do blocking on kv_len to split it into several kv_blocks. So, we can take advantage of parameter batch_size to merge several Gemms of [q_len, kv_block_size] and [kv_block_size, head_size] into one BRGemm with batch_size=kv_block_num, with the reduced dimension of K.

In that case, why not making the kv_block_size larger and invoke a single small gemm instead of making a brgemm call?

To efficiently do the softmax, which is before the gemm of A and V, we create the buffer shape for A as [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size]. With this kind of shape, we can do softmax-related calculations on a small contiguous block [q_block_size, kv_block_size], which shows good perf. As the kv_block_num is the outermost dimension, we need to make a brgemm call with batch size greater than 1.

jgong5 · 2024-12-20T12:35:36Z

To efficiently do the softmax, which is before the gemm of A and V, we create the buffer shape for A as [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size]. With this kind of shape, we can do softmax-related calculations on a small contiguous block [q_block_size, kv_block_size], which shows good perf. As the kv_block_num is the outermost dimension, we need to make a brgemm call with batch size greater than 1.

Can you elaborate the necessity of [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size] to make softmax computation more efficient? Why can't we make the output layout of softmax [q_block_size, kv_len] instead? The reason why I am asking these questions is that I'd like to understand the necessity of supporting "batch-reduced" gemm here. My assumption was that the "batch-reduced" semantics is only necessary for computations like convolutions where the data window to be reduced is non-contiguous by nature.

Valentine233 · 2024-12-23T02:28:14Z

To efficiently do the softmax, which is before the gemm of A and V, we create the buffer shape for A as [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size]. With this kind of shape, we can do softmax-related calculations on a small contiguous block [q_block_size, kv_block_size], which shows good perf. As the kv_block_num is the outermost dimension, we need to make a brgemm call with batch size greater than 1.

Can you elaborate the necessity of [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size] to make softmax computation more efficient? Why can't we make the output layout of softmax [q_block_size, kv_len] instead? The reason why I am asking these questions is that I'd like to understand the necessity of supporting "batch-reduced" gemm here. My assumption was that the "batch-reduced" semantics is only necessary for computations like convolutions where the data window to be reduced is non-contiguous by nature.

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

jgong5 · 2024-12-23T02:57:06Z

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

Valentine233 · 2024-12-23T05:11:53Z

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

Both the input and output of softmax.

jgong5 · 2024-12-25T06:57:03Z

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

Both the input and output of softmax.

Can we make the layout of input and output different? The input can be blocked and the output can be in 2D and then we don't need batch-reduce semantics.

Valentine233 · 2024-12-25T07:00:35Z

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

Both the input and output of softmax.

Can we make the layout of input and output different? The input can be blocked and the output can be in 2D and then we don't need batch-reduce semantics.

Thanks, I will have a try to make the output layout different.

Valentine233 · 2024-12-27T03:18:16Z

As the layout change may impact the kernel perf, I would continue this PR after the perf is confirmed.

Valentine233 · 2025-01-07T06:01:05Z

As the layout change may impact the kernel perf, I would continue this PR after the perf is confirmed.

@jgong5 Hi, I have confirmed the perf, and remove the support of batch_size in brgemm. Please help review again, thanks!

aten/src/ATen/native/CPUBlas.cpp

Valentine233 · 2025-01-10T02:26:59Z

@pytorchbot merge

pytorchmergebot · 2025-01-10T02:28:49Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Valentine233 · 2025-01-10T02:34:18Z

@peterbell10 @ezyang Could you help take a look? Thanks!

ezyang

okey dokey but you sure you don't want tests?

ezyang · 2025-01-10T04:13:03Z

@pytorchbot merge

pytorchmergebot · 2025-01-10T04:14:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Dec 17, 2024

Valentine233 marked this pull request as draft December 17, 2024 10:05

Valentine233 added the topic: not user facing topic category label Dec 17, 2024

pytorchbot added the open source label Dec 17, 2024

Valentine233 marked this pull request as ready for review December 18, 2024 01:19

Valentine233 requested review from jgong5, CaoE and peterbell10 December 18, 2024 01:19

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 18, 2024

Valentine233 added ciflow/trunk Trigger trunk jobs on your pull request and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 19, 2024

jgong5 reviewed Dec 19, 2024

View reviewed changes

Valentine233 requested a review from jgong5 December 19, 2024 10:53

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 19, 2024

Valentine233 mentioned this pull request Dec 26, 2024

Add INT8 SDPA path for CPU pytorch/ao#1372

Merged

Valentine233 force-pushed the brgemm_int8 branch from be43b40 to d592cc0 Compare January 7, 2025 05:59

Valentine233 requested a review from leslie-fang-intel January 7, 2025 06:25

leslie-fang-intel reviewed Jan 7, 2025

View reviewed changes

aten/src/ATen/native/CPUBlas.cpp Outdated Show resolved Hide resolved

jgong5 approved these changes Jan 7, 2025

View reviewed changes

Valentine233 added 4 commits January 9, 2025 06:53

[cpu][brgemm] add support for int8 brgemm

8edf97f

update support for int8 brgemm

4143afb

remove support of batchsize

f194dc3

update support for int8 brgemm

ea8db8b

Valentine233 force-pushed the brgemm_int8 branch from d592cc0 to ea8db8b Compare January 9, 2025 12:02

Valentine233 requested a review from leslie-fang-intel January 9, 2025 12:04

leslie-fang-intel approved these changes Jan 10, 2025

View reviewed changes

pytorchmergebot added the merging label Jan 10, 2025

pytorchmergebot removed the merging label Jan 10, 2025

Valentine233 requested a review from ezyang January 10, 2025 02:33

ezyang approved these changes Jan 10, 2025

View reviewed changes

pytorchmergebot added the merging label Jan 10, 2025

pytorchmergebot added the Merged label Jan 10, 2025

pytorchmergebot closed this in d100a92 Jan 10, 2025

pytorchmergebot removed the merging label Jan 10, 2025

github-actions bot deleted the brgemm_int8 branch February 12, 2025 02:06

[CPU][Brgemm] add support for int8 brgemm #143384

[CPU][Brgemm] add support for int8 brgemm #143384

Uh oh!

Conversation

Valentine233 commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143384

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

Valentine233 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 commented Dec 20, 2024

Uh oh!

Valentine233 commented Dec 20, 2024

Uh oh!

jgong5 commented Dec 20, 2024

Uh oh!

Valentine233 commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 commented Dec 23, 2024

Uh oh!

Valentine233 commented Dec 23, 2024

Uh oh!

jgong5 commented Dec 25, 2024

Uh oh!

Valentine233 commented Dec 25, 2024

Uh oh!

Valentine233 commented Dec 27, 2024

Uh oh!

Valentine233 commented Jan 7, 2025

Uh oh!

Uh oh!

Valentine233 commented Jan 10, 2025

Uh oh!

pytorchmergebot commented Jan 10, 2025

Merge failed

Uh oh!

Valentine233 commented Jan 10, 2025

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jan 10, 2025

Uh oh!

pytorchmergebot commented Jan 10, 2025

Merge started

Uh oh!

Uh oh!

Valentine233 commented Dec 17, 2024 •

edited

Loading

pytorch-bot bot commented Dec 17, 2024 •

edited

Loading

Valentine233 commented Dec 19, 2024 •

edited

Loading

Valentine233 commented Dec 23, 2024 •

edited

Loading