Optimize Softmax Perf for Inductor CPU Backend

This issue tracks the required features for optimizing Softmax for Inductor CPU backend. The features might be needed by other reduction related ops.
Below are the numbers measured on CPX with a typical Softmax shape from BERT model as an example to demonstrate the potential of various optimizations (column on the right contains all the optimizations on the left). The optimized numbers were measured with manually altered C++ code based on Inductor cpp codegen. L3 cache is flushed before each iteration to make sure the numbers are stable. ATen and Inductor baseline numbers were measured with op benchmark while others were measured with generated output_code.py benchmarks. These two have gaps as indicated in the table and need further investigation.

Softmax (1,16,384,384,dim=3) | ATen | Inductor baseline   (simdlen=8) | +buffer reuse | +inplace buffers | +manual   vectorization on reduction
-- | -- | -- | -- | -- | --
28c | 1038us | 1873us   (1356us w/ output code) | 2201us   (1795us w/ output code) | 955us | 591us
4c | 2113us | 7116us   (6767us w/ output code) | 5582us   (5332us w/ output code) | 4041us | 1541us
1c | 6447us | 17732us   (17223us w/ output code) | 17561us   (17307us w/ output code) | 13968us | 4540us

Numbers by Ablation (the opt baseline has all the optimizations mentioned above, each column on the right having part of the optimization ablated, and the speedup ratio w/ vs. w/o the corresponding optimization)

Softmax   (1,16,384,384,dim=3) | ATen | Inductor opt   baseline (simdlen=8) | -buffer reuse | -inplace buffers | -manual   vectorization on reduction | -buffer reuse | -inplace buffers | -manual   vectorization on reduction
-- | -- | -- | -- | -- | -- | -- | -- | --
28c | 1038 | 591 | 672 | 1469 | 955 | 1.13X | 2.48X | 1.61X
4c | 2113 | 1694 | 2046 | 2826 | 4041 | 1.20X | 1.66X | 2.38X
1c | 6447 | 5407 | 6406 | 7752 | 13968 | 1.18X | 1.43X | 2.58X

1. [Buffer reuse](https://github.com/jgong5/pytorch_bench/blob/master/operator_benchmark/test_softmax_dynamic_4d_reuse/output_code.py) (1.13X - 1.18X)
Currently, the number was measured by setting `config.realize_reads_threshold = 1`. This avoids computing compute-intensive exp operation multiple times. We need a better heuristics to decide when to reuse buffers while not rely on user-provided configuration.
2. [Inplace buffers](https://github.com/jgong5/pytorch_bench/blob/master/operator_benchmark/test_softmax_dynamic_4d_reuse/output_code_inplace_buf.py) (1.43X - 2.48X)
The `config.inplace_buffers` is still not functioning. The number was measured by a bit hacking and mimic the generated code suppose it works well. The number shows this would brings significant perf gain.
3. [manual vectorization on reduction](https://github.com/jgong5/pytorch_bench/blob/master/operator_benchmark/test_softmax_dynamic_4d_reuse/output_code_opt_no_compute_at.py) (1.61X - 2.58X)
This is another big opportunity. The number was measured by manually vectorize the reduction loops by calling PyTorch Vectorized API. It can be implemented with loop split support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Softmax Perf for Inductor CPU Backend #1401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Softmax (1,16,384,384,dim=3)	ATen	Inductor baseline (simdlen=8)	+buffer reuse	+inplace buffers	+manual vectorization on reduction
28c	1038us	1873us (1356us w/ output code)	2201us (1795us w/ output code)	955us	591us
4c	2113us	7116us (6767us w/ output code)	5582us (5332us w/ output code)	4041us	1541us
1c	6447us	17732us (17223us w/ output code)	17561us (17307us w/ output code)	13968us	4540us

Softmax (1,16,384,384,dim=3)	ATen	Inductor opt baseline (simdlen=8)	-buffer reuse	-inplace buffers	-manual vectorization on reduction	-buffer reuse	-inplace buffers	-manual vectorization on reduction
28c	1038	591	672	1469	955	1.13X	2.48X	1.61X
4c	2113	1694	2046	2826	4041	1.20X	1.66X	2.38X
1c	6447	5407	6406	7752	13968	1.18X	1.43X	2.58X

Optimize Softmax Perf for Inductor CPU Backend #1401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions