-
Notifications
You must be signed in to change notification settings - Fork 127
Optimize Softmax Perf for Inductor CPU Backend #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting, seems like the biggest speedup would come from vectorization. Vectorizing should be possible, our cuda backend is generating code for vectorized blocks already. |
Yes, should not be hard to implement, e.g., do a vectorized loop split and then a vectorized ops overrides... |
I removed the "compute_at" optimization from the description since the benefit it brings is either insignificant or negative. I added 4-thread numbers. I also added the numbers with ablation to rank the importance of individual optimizations. If we look at typical inference cases with 1 thread or 4 threads per instance, manual vectorization is the most important one, followed by in-place buffers, followed by better buffer reuse heuristics. |
Benefit of manual vectorization on LayerNorm (still the gaps between op benchmark and output_code.py need further investigation):
|
…1468) This PR adds "buffer realize" heuristic for heavy ops on CPU. Currently only "exp" is considered as the heavy op since its computation requires polynomial approximation. This addresses the first optimization mentioned in the issue (#1401). The list of heavy ops can be further revised in the future.
Addressed all the TODOs. Closing. |
This issue tracks the required features for optimizing Softmax for Inductor CPU backend. The features might be needed by other reduction related ops.
Below are the numbers measured on CPX with a typical Softmax shape from BERT model as an example to demonstrate the potential of various optimizations (column on the right contains all the optimizations on the left). The optimized numbers were measured with manually altered C++ code based on Inductor cpp codegen. L3 cache is flushed before each iteration to make sure the numbers are stable. ATen and Inductor baseline numbers were measured with op benchmark while others were measured with generated output_code.py benchmarks. These two have gaps as indicated in the table and need further investigation.
Numbers by Ablation (the opt baseline has all the optimizations mentioned above, each column on the right having part of the optimization ablated, and the speedup ratio w/ vs. w/o the corresponding optimization)
Currently, the number was measured by setting
config.realize_reads_threshold = 1
. This avoids computing compute-intensive exp operation multiple times. We need a better heuristics to decide when to reuse buffers while not rely on user-provided configuration.The
config.inplace_buffers
is still not functioning. The number was measured by a bit hacking and mimic the generated code suppose it works well. The number shows this would brings significant perf gain.This is another big opportunity. The number was measured by manually vectorize the reduction loops by calling PyTorch Vectorized API. It can be implemented with loop split support.
The text was updated successfully, but these errors were encountered: