-
Notifications
You must be signed in to change notification settings - Fork 128
Optimize Softmax Perf for Inductor CPU Backend #1401
Description
This issue tracks the required features for optimizing Softmax for Inductor CPU backend. The features might be needed by other reduction related ops.
Below are the numbers measured on CPX with a typical Softmax shape from BERT model as an example to demonstrate the potential of various optimizations (column on the right contains all the optimizations on the left). The optimized numbers were measured with manually altered C++ code based on Inductor cpp codegen. L3 cache is flushed before each iteration to make sure the numbers are stable. ATen and Inductor baseline numbers were measured with op benchmark while others were measured with generated output_code.py benchmarks. These two have gaps as indicated in the table and need further investigation.
Softmax (1,16,384,384,dim=3) | ATen | Inductor baseline (simdlen=8) | +buffer reuse | +inplace buffers | +manual vectorization on reduction |
---|---|---|---|---|---|
28c | 1038us | 1873us (1356us w/ output code) | 2201us (1795us w/ output code) | 955us | 591us |
4c | 2113us | 7116us (6767us w/ output code) | 5582us (5332us w/ output code) | 4041us | 1541us |
1c | 6447us | 17732us (17223us w/ output code) | 17561us (17307us w/ output code) | 13968us | 4540us |
Numbers by Ablation (the opt baseline has all the optimizations mentioned above, each column on the right having part of the optimization ablated, and the speedup ratio w/ vs. w/o the corresponding optimization)
Softmax (1,16,384,384,dim=3) | ATen | Inductor opt baseline (simdlen=8) | -buffer reuse | -inplace buffers | -manual vectorization on reduction | -buffer reuse | -inplace buffers | -manual vectorization on reduction |
---|---|---|---|---|---|---|---|---|
28c | 1038 | 591 | 672 | 1469 | 955 | 1.13X | 2.48X | 1.61X |
4c | 2113 | 1694 | 2046 | 2826 | 4041 | 1.20X | 1.66X | 2.38X |
1c | 6447 | 5407 | 6406 | 7752 | 13968 | 1.18X | 1.43X | 2.58X |
- Buffer reuse (1.13X - 1.18X)
Currently, the number was measured by settingconfig.realize_reads_threshold = 1
. This avoids computing compute-intensive exp operation multiple times. We need a better heuristics to decide when to reuse buffers while not rely on user-provided configuration. - Inplace buffers (1.43X - 2.48X)
Theconfig.inplace_buffers
is still not functioning. The number was measured by a bit hacking and mimic the generated code suppose it works well. The number shows this would brings significant perf gain. - manual vectorization on reduction (1.61X - 2.58X)
This is another big opportunity. The number was measured by manually vectorize the reduction loops by calling PyTorch Vectorized API. It can be implemented with loop split support.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status