Skip to content
This repository was archived by the owner on Aug 1, 2025. It is now read-only.
This repository was archived by the owner on Aug 1, 2025. It is now read-only.

Optimize Softmax Perf for Inductor CPU Backend #1401

@jgong5

Description

@jgong5

This issue tracks the required features for optimizing Softmax for Inductor CPU backend. The features might be needed by other reduction related ops.
Below are the numbers measured on CPX with a typical Softmax shape from BERT model as an example to demonstrate the potential of various optimizations (column on the right contains all the optimizations on the left). The optimized numbers were measured with manually altered C++ code based on Inductor cpp codegen. L3 cache is flushed before each iteration to make sure the numbers are stable. ATen and Inductor baseline numbers were measured with op benchmark while others were measured with generated output_code.py benchmarks. These two have gaps as indicated in the table and need further investigation.

Softmax (1,16,384,384,dim=3) ATen Inductor baseline (simdlen=8) +buffer reuse +inplace buffers +manual vectorization on reduction
28c 1038us 1873us (1356us w/ output code) 2201us (1795us w/ output code) 955us 591us
4c 2113us 7116us (6767us w/ output code) 5582us (5332us w/ output code) 4041us 1541us
1c 6447us 17732us (17223us w/ output code) 17561us (17307us w/ output code) 13968us 4540us

Numbers by Ablation (the opt baseline has all the optimizations mentioned above, each column on the right having part of the optimization ablated, and the speedup ratio w/ vs. w/o the corresponding optimization)

Softmax (1,16,384,384,dim=3) ATen Inductor opt baseline (simdlen=8) -buffer reuse -inplace buffers -manual vectorization on reduction -buffer reuse -inplace buffers -manual vectorization on reduction
28c 1038 591 672 1469 955 1.13X 2.48X 1.61X
4c 2113 1694 2046 2826 4041 1.20X 1.66X 2.38X
1c 6447 5407 6406 7752 13968 1.18X 1.43X 2.58X
  1. Buffer reuse (1.13X - 1.18X)
    Currently, the number was measured by setting config.realize_reads_threshold = 1. This avoids computing compute-intensive exp operation multiple times. We need a better heuristics to decide when to reuse buffers while not rely on user-provided configuration.
  2. Inplace buffers (1.43X - 2.48X)
    The config.inplace_buffers is still not functioning. The number was measured by a bit hacking and mimic the generated code suppose it works well. The number shows this would brings significant perf gain.
  3. manual vectorization on reduction (1.61X - 2.58X)
    This is another big opportunity. The number was measured by manually vectorize the reduction loops by calling PyTorch Vectorized API. It can be implemented with loop split support.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions