Skip to content

Optimize Softmax Perf for Inductor CPU Backend #1401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jgong5 opened this issue Sep 28, 2022 · 5 comments
Closed

Optimize Softmax Perf for Inductor CPU Backend #1401

jgong5 opened this issue Sep 28, 2022 · 5 comments
Assignees

Comments

@jgong5
Copy link

jgong5 commented Sep 28, 2022

This issue tracks the required features for optimizing Softmax for Inductor CPU backend. The features might be needed by other reduction related ops.
Below are the numbers measured on CPX with a typical Softmax shape from BERT model as an example to demonstrate the potential of various optimizations (column on the right contains all the optimizations on the left). The optimized numbers were measured with manually altered C++ code based on Inductor cpp codegen. L3 cache is flushed before each iteration to make sure the numbers are stable. ATen and Inductor baseline numbers were measured with op benchmark while others were measured with generated output_code.py benchmarks. These two have gaps as indicated in the table and need further investigation.

Softmax (1,16,384,384,dim=3) ATen Inductor baseline (simdlen=8) +buffer reuse +inplace buffers +manual vectorization on reduction
28c 1038us 1873us (1356us w/ output code) 2201us (1795us w/ output code) 955us 591us
4c 2113us 7116us (6767us w/ output code) 5582us (5332us w/ output code) 4041us 1541us
1c 6447us 17732us (17223us w/ output code) 17561us (17307us w/ output code) 13968us 4540us

Numbers by Ablation (the opt baseline has all the optimizations mentioned above, each column on the right having part of the optimization ablated, and the speedup ratio w/ vs. w/o the corresponding optimization)

Softmax (1,16,384,384,dim=3) ATen Inductor opt baseline (simdlen=8) -buffer reuse -inplace buffers -manual vectorization on reduction -buffer reuse -inplace buffers -manual vectorization on reduction
28c 1038 591 672 1469 955 1.13X 2.48X 1.61X
4c 2113 1694 2046 2826 4041 1.20X 1.66X 2.38X
1c 6447 5407 6406 7752 13968 1.18X 1.43X 2.58X
  1. Buffer reuse (1.13X - 1.18X)
    Currently, the number was measured by setting config.realize_reads_threshold = 1. This avoids computing compute-intensive exp operation multiple times. We need a better heuristics to decide when to reuse buffers while not rely on user-provided configuration.
  2. Inplace buffers (1.43X - 2.48X)
    The config.inplace_buffers is still not functioning. The number was measured by a bit hacking and mimic the generated code suppose it works well. The number shows this would brings significant perf gain.
  3. manual vectorization on reduction (1.61X - 2.58X)
    This is another big opportunity. The number was measured by manually vectorize the reduction loops by calling PyTorch Vectorized API. It can be implemented with loop split support.
@jgong5 jgong5 changed the title Optimize Softmax Perf w/ Inductor for CPU Optimize Softmax Perf for Inductor CPU Backend Sep 28, 2022
@EikanWang EikanWang self-assigned this Sep 29, 2022
@jansel
Copy link
Contributor

jansel commented Sep 29, 2022

Interesting, seems like the biggest speedup would come from vectorization. Vectorizing should be possible, our cuda backend is generating code for vectorized blocks already.

@jgong5
Copy link
Author

jgong5 commented Sep 30, 2022

Interesting, seems like the biggest speedup would come from vectorization. Vectorizing should be possible, our cuda backend is generating code for vectorized blocks already.

Yes, should not be hard to implement, e.g., do a vectorized loop split and then a vectorized ops overrides...

@jgong5
Copy link
Author

jgong5 commented Oct 2, 2022

I removed the "compute_at" optimization from the description since the benefit it brings is either insignificant or negative. I added 4-thread numbers. I also added the numbers with ablation to rank the importance of individual optimizations. If we look at typical inference cases with 1 thread or 4 threads per instance, manual vectorization is the most important one, followed by in-place buffers, followed by better buffer reuse heuristics.

@jgong5
Copy link
Author

jgong5 commented Oct 2, 2022

Benefit of manual vectorization on LayerNorm (still the gaps between op benchmark and output_code.py need further investigation):

LayerNorm (384,1024) w/ 1024 norm shape ATen Inductor baseline (simdlen=8) +manual vectorization
28c 146us 185us (145us w/ output_code) 144us
4c 247us 214us (186us w/ output_code) 176us
1c 630us 420us 375us

jgong5 pushed a commit that referenced this issue Oct 4, 2022
…1468)

This PR adds "buffer realize" heuristic for heavy ops on CPU. Currently only "exp" is considered as the heavy op since its computation requires polynomial approximation. This addresses the first optimization mentioned in the issue (#1401). The list of heavy ops can be further revised in the future.
@jgong5
Copy link
Author

jgong5 commented Dec 22, 2022

Addressed all the TODOs. Closing.

@jgong5 jgong5 closed this as completed Dec 22, 2022
@jgong5 jgong5 moved this to Done in PyTorch Intel Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants