Skip to content

[CPU][Brgemm] add support for int8 brgemm #143384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

Valentine233
Copy link
Collaborator

@Valentine233 Valentine233 commented Dec 17, 2024

Copy link

pytorch-bot bot commented Dec 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143384

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit ea8db8b with merge base 9631d1a (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Dec 17, 2024
@Valentine233 Valentine233 marked this pull request as draft December 17, 2024 10:05
@Valentine233 Valentine233 added the topic: not user facing topic category label Dec 17, 2024
@Valentine233 Valentine233 marked this pull request as ready for review December 18, 2024 01:19
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 18, 2024
@Valentine233 Valentine233 added ciflow/trunk Trigger trunk jobs on your pull request and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 19, 2024
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

@Valentine233
Copy link
Collaborator Author

Valentine233 commented Dec 19, 2024

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

The batch_size here refers to the batch parameter in BRGemm (Batch-Reduced Gemm). For the second Gemm in INT8 SDPA, i.e. the product of attention [q_len, kv_len] and value [kv_len, head_size], we do blocking on kv_len to split it into several kv_blocks. So, we can take advantage of parameter batch_size to merge several Gemms of [q_len, kv_block_size] and [kv_block_size, head_size] into one BRGemm with batch_size=kv_block_num, with the reduced dimension of K.

@Valentine233 Valentine233 requested a review from jgong5 December 19, 2024 10:53
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 19, 2024
@jgong5
Copy link
Collaborator

jgong5 commented Dec 20, 2024

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

The batch_size here refers to the batch parameter in BRGemm (Batch-Reduced Gemm). For the second Gemm in INT8 SDPA, i.e. the product of attention [q_len, kv_len] and value [kv_len, head_size], we do blocking on kv_len to split it into several kv_blocks. So, we can take advantage of parameter batch_size to merge several Gemms of [q_len, kv_block_size] and [kv_block_size, head_size] into one BRGemm with batch_size=kv_block_num, with the reduced dimension of K.

In that case, why not making the kv_block_size larger and invoke a single small gemm instead of making a brgemm call?

@Valentine233
Copy link
Collaborator Author

Can you elaborate on the usage of batch_size? How is it to be used by SDPA?

The batch_size here refers to the batch parameter in BRGemm (Batch-Reduced Gemm). For the second Gemm in INT8 SDPA, i.e. the product of attention [q_len, kv_len] and value [kv_len, head_size], we do blocking on kv_len to split it into several kv_blocks. So, we can take advantage of parameter batch_size to merge several Gemms of [q_len, kv_block_size] and [kv_block_size, head_size] into one BRGemm with batch_size=kv_block_num, with the reduced dimension of K.

In that case, why not making the kv_block_size larger and invoke a single small gemm instead of making a brgemm call?

To efficiently do the softmax, which is before the gemm of A and V, we create the buffer shape for A as [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size]. With this kind of shape, we can do softmax-related calculations on a small contiguous block [q_block_size, kv_block_size], which shows good perf. As the kv_block_num is the outermost dimension, we need to make a brgemm call with batch size greater than 1.

@jgong5
Copy link
Collaborator

jgong5 commented Dec 20, 2024

To efficiently do the softmax, which is before the gemm of A and V, we create the buffer shape for A as [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size]. With this kind of shape, we can do softmax-related calculations on a small contiguous block [q_block_size, kv_block_size], which shows good perf. As the kv_block_num is the outermost dimension, we need to make a brgemm call with batch size greater than 1.

Can you elaborate the necessity of [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size] to make softmax computation more efficient? Why can't we make the output layout of softmax [q_block_size, kv_len] instead? The reason why I am asking these questions is that I'd like to understand the necessity of supporting "batch-reduced" gemm here. My assumption was that the "batch-reduced" semantics is only necessary for computations like convolutions where the data window to be reduced is non-contiguous by nature.

@Valentine233
Copy link
Collaborator Author

Valentine233 commented Dec 23, 2024

To efficiently do the softmax, which is before the gemm of A and V, we create the buffer shape for A as [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size]. With this kind of shape, we can do softmax-related calculations on a small contiguous block [q_block_size, kv_block_size], which shows good perf. As the kv_block_num is the outermost dimension, we need to make a brgemm call with batch size greater than 1.

Can you elaborate the necessity of [kv_block_num, q_block_size, kv_block_size] instead of [q_block_size, kv_block_num, kv_block_size] to make softmax computation more efficient? Why can't we make the output layout of softmax [q_block_size, kv_len] instead? The reason why I am asking these questions is that I'd like to understand the necessity of supporting "batch-reduced" gemm here. My assumption was that the "batch-reduced" semantics is only necessary for computations like convolutions where the data window to be reduced is non-contiguous by nature.

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

@jgong5
Copy link
Collaborator

jgong5 commented Dec 23, 2024

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

@Valentine233
Copy link
Collaborator Author

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

Both the input and output of softmax.

@jgong5
Copy link
Collaborator

jgong5 commented Dec 25, 2024

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

Both the input and output of softmax.

Can we make the layout of input and output different? The input can be blocked and the output can be in 2D and then we don't need batch-reduce semantics.

@Valentine233
Copy link
Collaborator Author

Thanks, Jiong. We assume kv_len is pretty large here, e.g. 9216 from SD. First of all, we need to do kv blocking, because operating on the whole large [q_block_size, kv_len] is memory inefficient. Then we have two options for the layout of attention. If we make attention be [q_block_size, kv_block_num, kv_block_size], for each [q_block_size, kv_block_size] the stride is kv_len. If we use [kv_block_num, q_block_size, kv_block_size] instead, the stride is kv_block_size. The latter one is more efficient for the calculations of softmax, because each block of [q_block_size, kv_block_size] is contiguous, and we don't need to load much unused buffer in memory. This is also proved by the experiment.

Are you referring to the input to the softmax or the output of it, or both?

Both the input and output of softmax.

Can we make the layout of input and output different? The input can be blocked and the output can be in 2D and then we don't need batch-reduce semantics.

Thanks, I will have a try to make the output layout different.

@Valentine233
Copy link
Collaborator Author

As the layout change may impact the kernel perf, I would continue this PR after the perf is confirmed.

@Valentine233
Copy link
Collaborator Author

As the layout change may impact the kernel perf, I would continue this PR after the perf is confirmed.

@jgong5 Hi, I have confirmed the perf, and remove the support of batch_size in brgemm. Please help review again, thanks!

@Valentine233
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Valentine233
Copy link
Collaborator Author

@peterbell10 @ezyang Could you help take a look? Thanks!

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okey dokey but you sure you don't want tests?

@ezyang
Copy link
Contributor

ezyang commented Jan 10, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants