SpaceFusion - Advanced Deep Learning Operator
SpaceFusion - Advanced Deep Learning Operator
Softmax
Sum = sum(Exp,dim=0)
efficient deep learning operator fusion. First, we develop Exp [m, 1] … Exp [m, l] … Exp [m, L]
Exp = exp(QK - Max)
a novel abstraction, the Space-Mapping Graph (SMG), to … … QK [m, l] Max [m] … …
Max = max(QK,dim=0)
holistically model the spatial information of both inter- and QK [m, 1] … QK [m, l] … QK [m, L]
QK = GEMM(Query,Key)
intra-operator dependencies. Subsequently, we introduce the … … … Query [m, 1] … Query [m, K] Key [l, 1] … Key [l, K] … … …
787
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
GPU architecture[26]. For example, the Multi-Head Atten- into multiple parallel-executed SMG blocks, each indepen-
tion (MHA)[58] contains three non-element-wise operators, dent of the others. The temporal slicer (Section 4.3) slices
including two GEMMs and a Softmax. Figure 1 illustrates the an SMG block into multiple serially-executed intra-blocks
dependencies from a single element of the output tensor to to reduce the on-chip memory footprint, and to exploit the
the other tensor elements. This complex dependency pattern possible optimization opportunities. Finally, we design auto-
exhibits (1) deeply nested dependencies between elements, scheduling methods (Section 5) for SMG. The resource-aware
layer by layer, and (2) wide dependency ranges covering the slicing method (Section 5.1) leverages spatial/temporal slicers
whole range of a tensor dimension. As a result, scheduling to automate the generation of fusion schedules that adhere
fused MHA into thread blocks while naively respecting the to hardware resource constraints. The SMG partition method
wide and nested dependency pattern leads to low intra-block (Section 5.2) partitions unschedulable SMGs due to hard-
data locality, resulting in intra-block inefficiency or even ware resource constraints into smaller SMGs and resubmits
fusion failures. them to the slicing method. We identify the optimal schedule
Existing approaches to operator fusion are classified into within the generated search space.
two manually-tuned fusion and auto-tuned fusion. Manually- Evaluations reveal SpaceFusion’s ability to exploit hid-
tuned fusion[11, 22, 30, 31, 65] is the dominant approach to den fusion opportunities in intricate dependencies and de-
fusing non-element-wise operators. These methods target liver performance benefits across diverse GPU architectures.
specific tensor computations. Specific dataflow and paral- For subgraph performance, SpaceFusion achieves a max-
lelism strategies are cleverly designed by domain experts to imum speedup of 10.35x (with an average of 5.31x) over
achieve fusions with complex dependencies. Unfortunately, manually-tuned unfused baseline, and up to 4.03x speedup
788
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
789
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
Recent compilers employ tile-level abstractions for fusion. Input Data Space Output Data Space Iteration Space All-to-One One-to-All
# Data Spaces by Tensors Dim1:N Data Space: Query(M,-,K) Key(-,N,K)
The tile-graph in Welder[49] refines operator dependencies Query = tensor(M,K) Dim0:K
Dim2:M Key (-,N,K)
O2A(dim=1) O2A(dim=2)
Key = tensor(N,K)
to tile-wise granularity and stitches intermediate tiles via QK = Zeros(M,N)
Iteration
Space: GEMM(M,N,K)
# Iteration Space of GEMM GEMM
shape alignment, allowing precise scheduling within the for (m,n,k) in Grid(M,N,K):
Data Space:
Query (M,N,K) A2O(dim=0)
QK[m,n] += (M,-,K)
memory hierarchy. However, the intra-operator dependen- Query[m,k] * Key[n,k]
Data Space:
QK(M,N,-) QK(M,N,-)
cies remain unabstracted, replaced by the shape mappings (a) Computational Definition (b) Visualized SMG (c) SMG
from input to output tiles in an operator. Note that this com-
pressed information discards the original dependencies. Fig- Figure 3. The SMG for a Single Operator GEMM. (a) the
ures 2 (a) and (b) illustrate the tile shape mappings between definition of 𝑄𝐾 = 𝐺𝐸𝑀𝑀 (𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦), (b) the visualized
the input and output tiles of Softmax and GEMM. Stemming SMG for GEMM, and (c) the SMG for GEMM are shown.
from global reductions along the 𝐾 axis, each output element
in Softmax 𝑜𝑢𝑡 [𝑚, 𝑘] incorporates all input elements within
its corresponding row 𝑖𝑛[𝑚, 0:𝐾]. Figure 2 (c) illustrates the the orange and purple nodes representing data spaces that
shape alignment of the output tile of Softmax with the input correspond to the 𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, and 𝑄𝐾 tensors mentioned in
tile of GEMM. This fusion schedules the GEMM tiles that Figure 3 (a). The node 𝐺𝐸𝑀𝑀 in black is the iteration space
could have been executed serially to be computed in a larger defined by the loop structure in Figure 3 (a).
single GEMM block, resulting in a poor intra-block locality Space Mappings abstract One-to-One, One-to-All, and
in Softmax and GEMM (e.g. the 16x256 Softmax block and All-to-One mapping relations1 between computational spaces
the 16x64x256 GEMM block for 𝑇 𝑖𝑙𝑒𝑀𝑎𝑙𝑖𝑔𝑛 =16, 𝐾=256 and as is discussed in Section 2, possessing their own geomet-
790
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
O2A(dim=1) O2A(dim=2)
O2O
O2A
Max(M,-) GEMM(M,N,K) mension. ⃝ and × respectively denote the application and
Sub(M,N) (dim=0)
A2O(dim=0)
Sub(M,N,-)
O2A further analysis to determine whether slicing is warranted.
(dim=1)
O2A
QK(M,N,-) Div(M,N) (dim=0)
O2O
A2O
Exp(M,N,-)
❶ SMG for GEMM ❷ SMG for Softmax O2O
(dim=1)
Sum(M,-,-) Mappings in the Dimension Spatial Slicer Temporal Slicer
in a 3-Dim Space(M,N,K). in a 2-Dim Space(M,N). Div(M,N,-)
O2A
(dim=1)
None ⃝ ⃝
❸ GEMM → Softmax: Connecting with One-to-One ❹ Fused SMG One-to-One × ×
to Construct a Fused Space. in a 3-Dim Space(M,N,K).
Input One-to-All ⃝ ⃝
Other One-to-All × ⃝
Figure 4. Connecting SMGs to Construct a New SMG Independent All-to-One(s) × ⃝
Dependent All-to-Ones × △
for the Fused Computational Space.
Dim1 parallel_for Block in SMG_Blocks:
Input Data Space Output Data Space All-to-One One-to-All Dim0
Dim2 # load from off-chip mem
Intermediate Data Space Iteration Space One-to-One Q_block, K = load(…), load(…)
Dim1 Query(M,-,K) Key(-,N,K) QK = GEMM(Q_block, K)
Query Key Space:Key Max = max(QK, dim=0)
O2A(dim=1)
Dim0 (-,N,K)
O2A(dim=2)
Sub = QK - Max
Dim2 GEMM1(M,N,K) SMG
GEMM Exp = exp(Sub)
Space: A2O(dim=0) Block 1
Space:Query Sum = sum(Exp, dim=0)
max (M,-,K) GEMM1 QK(M,N,-)
A2O
Div = div(Exp, Sum)
(dim=1) SMG
sub (M,N,K) O2O Max(M,-,-) Block 2 # load from off-chip mem
Space:Value O2A
V = load(…)
Sub(M,N,-)
Softmax
(a) DFG (b) Visualized SMG (c) SMG boxes in Figure 5 (b) and (c). As is discussed in Section 2, a
MHA computation involves three non-element-wise oper-
Figure 5. An SMG Example for Multi-Head Attention. ators with 6 One-to-Alls and 4 All-to-Ones. The visualized
SMG in Figure 5 (b) depicts these 10 mappings and their
geometric position information. The 4 red arrows represent
Building a fused SMG via connecting SMGs of sin- the 4 All-to-Ones resulting from the reduction operations
gle operators with intermediate data space dimension in MHA. The first originates from GEMM1, the second and
alignment: Figure 4 depicts the process of constructing a third originate from Softmax, and the fourth originates from
fused space for two individual operators, GEMM and Soft- GEMM2. It is evident that the last three of these four All-to-
max, resulting in a fused SMG. ❶ SMG for GEMM defines a Ones are geometrically parallel, with their directions aligned
3-dim space. ❷ SMG for Softmax defines a 2-dim space. ❸ along Dim1, while the first All-to-One from GEMM1, towards
GEMM’s output data space 𝑄𝐾 (in purple) and Softmax’s Dim0, is orthogonal to the last three All-to-Ones.
input data space 𝑄𝐾 (in orange) are connected with a One-
to-One mapping. ❹ A fused SMG is constructed via fusing 4.2 Spatial Slicer: Exploiting SMG Parallelization
𝑄𝐾 (𝑀, 𝑁 , −) → 𝑄𝐾 (𝑀, 𝑁 ) into a single intermediate data
A spatial slicer slices an SMG along given dimensions into
space 𝑄𝐾 (𝑀, 𝑁 , −) with dimension alignment, where "−" is
multiple independent, parallelizable SMG blocks. Each SMG
a dimension placeholder for 𝐾 in GEMM. The constructed
block is to be mapped to a thread block in GPUs. To prevent
SMG defines a new 3-dim space. Employing this method, we
dataflow dependencies between SMG blocks generated after
can establish a multi-operator fusion space wherein One-to-
spatial slicing, the spatial slicer generally avoids orthogonal
One mappings construct the main dataflow across operators.
mappings. Table 3 encapsulates the varied decisions made
An illustrative example of SMG. In Figure 5, (a) DFG
when applying slicers to slice an SMG along given dimen-
of the simplified MHA, (b) visualized SMG, and (c) the corre-
sions with the presence of diverse mappings. Specifically,
sponding SMG are presented. The computation of MHA takes
spatial slicers do not slice any mappings except for the input
place within a five-dimensional fused computational space
One-to-All. The source of an input One-to-All is an input
(BatchDim, HeadDim, Dim2, Dim1, Dim0). We focus our dis-
data space abstracting the input of a kernel function, which
cussion on the three-dimensional subspace2 composed of
is stored in global memory and is visible to all thread blocks.
the final three dimensions (Dim2, Dim1, Dim0). The visual-
Consequently, slicing the input One-to-All does not induce
ized SMG fuses some of the spaces connected with One-to-
flow dependencies between the resulting blocks.
One mappings for visual simplicity, as is depicted in dotted
An illustrative example of the spatial slicer on MHA
2 BatchDim and HeadDim are not involved to enhance the visual represen- is given. The unsliced SMG depicted in Figure 5 contains
tation. Both are dimensions without dependencies on the others. mappings across all three dimensions. Among them, Dim2 is
791
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
Postpositing
max Max.old Max = max(QK, dim=0)
Max = aggrMax(Max_old, e.exp e.exp r.sum r.sum r.sum
Temporal Slicer aggr.max v v v
Dim1 Max) Update Path
sub Simple Aggregate Sub = QK - Max
Dim0 r.sum r.sum r.dot r.dot r.dot
Dim2 Exp = exp(Sub)
exp v Tag1 Tag2 Tag1
Sum = sum(Exp, dim=0) Sum Tag2
sum Sum.old Sum = aggrSum( b.div r.dot s.div s.div s.div s.div s.div s.div
Postpositing
updateSum(Sum_old), Sum
aggr.sum
Sum) v Sum Sum Sum
div Update then Aggregate
V.block Div = div(Exp, Sum) r.dot s.div s.div s.div s.div
V_block = load(…) Out Out Out Out
GEMM Out = GEMM(Div, V_block) out out out out out
Intra- Intra- Intra- Out.old Out = aggrSum(
Block 1 Block 2 Block 3 updateOut(Out_old), (a) Original DFG (b) b.div Postposited (c) Resulted DFG (d) Aggregation Hints and
aggr.sum Out) Postpositing b.div Postpositing b.sub Postpositing Update Paths for Out and Sum
SMG Block Update then Aggregate Needed Needed Finished
… Max_old, Sum_old, Out_old
def updateSum(Sum_old): # Update function for Sum (e) Update Functions
= Max, Sum, Out return Sum_old * exp(Max_old)/exp(Max)
store(Out)
Generated via
Out def updateOut(Out_old): # Update function for Out Update Paths
return Out_old * Sum_old/Sum * exp(Max_old)/exp(Max)
792
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
programs
Program
dependency paths between reductions. The b.div in Figure 8
Tensor
Deep Learning Program Building
Sub-
(a) is deferred to b.div in Figure 8 (b). The b.sub in Figure 8 Model Partitioning SMG
(b) is propagated along two paths, resulting in the creation of Auto-Scheduling SMGs
two new s.div nodes in Figure 8 (c). Figure 8 (d) presents two Resource-Aware Slicing SMG Partitioning
subgraphs of the DFG in Figure 8 (c), showing the paths be-
SMGs
Blocks
Spatial Temporal SMG
SMG
tween reductions. It reveals that the output node out, tagged Slicing Slicing Partitioning
793
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
794
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
Layer 1
starts from the complete input 𝐺 and iteratively partitions add
B1 GEMM GEMM sqr mask
the last sub-SMG and merges it to the front of 𝐺𝑙 , until 𝐺 𝑓 ReLU mean max
add eps
is schedulable. Then we obtain a schedulable 𝐺 𝑓 and a po- W2 add sub
Softmax
B
GEMM
Layer 2
tentially schedulable 𝐺𝑙 . Recursively, if 𝐺𝑙 is unschedulable, B2 add sqrt exp
add div W1 sum
SpaceFusion will enter the next round of partitioning. ReLU mul div
In4
ReLU B
… add GEMM
5.3 Partitioning for Candidate Schedules Out
Out Out Out
(a) MLP Layers (b) LSTM Cell (c) Layernorm (d) MHA
Algorithm 2 describes how SpaceFusion partitions unschedu-
lable SMGs. We further increase the exploration depth of
Figure 10. DFGs for Evaluated Subgraphs.
Algorithm 2 by one level. Specifically, when a schedulable
𝐺 𝑓 is found, SpaceFusion records the current partitioning
results of 𝐺 𝑓 and 𝐺𝑙 . It then continues to partition a non- Volta Ampere Hopper cuBLAS cuBLASLt SpaceFusion
4 3.5
All-to-One sub-SMG 𝑠𝑢𝑏𝐺 from the current schedulable 𝐺 𝑓 , 3
3 2.5
Speedup
Speedup
merges 𝑠𝑢𝑏𝐺 with 𝐺𝑙 , and generates new 𝐺 𝑓′ and 𝐺𝑙′ . This re- 2
2 1.5
sults in two candidate schedules: (𝐺 𝑓 , 𝐺𝑙 ) and (𝐺 𝑓′ , 𝐺𝑙′ ). The 1
0.5
1
reason is that non-All-to-One sub-SMGs are mostly memory- 0
128 256 512 1k 128 256 512 1k 128 256 512 1k
0
intensive. They potentially lead to performance variations 2 4 6 8 10 12 14 16 18 20 Volta Ampere Hopper
(a) Fused MLP Layers (b) Fused LSTM Cell
795
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
PyTorch PyTorch Op NVIDIA Apex LN Triton SpaceFusion PyTorch FlashAttention Triton FlashAttention FlashAttention 2 SpaceFusion
10 10
Batch Size = 1
8 8
Speedup
Speedup
6 6
4 4
2
2
0
0
1K 2K 4K 8K 16K 1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K 12
Volta Ampere Hopper 10 Batch Size = 32
8
Speedup
6
Figure 12. Fused Layernorm Performance. The x-axis 4
2
represents the size of M (M=N) in the 2D input tensor to be 0
normalized. 64 128 256 512 1k 64 128 256 512 1k 2k 8k 64 128 256 512 1k 2k 8k
Volta Ampere Hopper
pattern is supported in most DL compilers[4, 6, 34, 43, 57, Figure 13. Fused MHA Performance. The x-axis repre-
66, 67]. SpaceFusion found that multiple MLP layers can be sents the sequence lengths of MHA across different architec-
further fused for specific problem sizes3 . SpaceFusion gener- tures.
ates fusion schedules for up to 20 layers in this experiment.
The performance results in Figure 11 (a) show SpaceFusion
achieves a maximum speedup of 3.15x and an average of fusing MHA computations into a single cleverly designed
2.35x over cuBLASLt. kernel. FlashAttention in Triton implements FlashAttention
796
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
PyTorch SpaceFusion TensorRT Kernl BladeDISC NNFusion SpaceFusion Fused Baseline Unfused Baseline
10
6 8 8
Batch Size = 1
18.07
13.17
17.01
11.76
28.20
Data Movement
8
Normalized L2
Normalized L1
6 6
Normalized
Speedup
6 4
4 4
4 2
2 2
2
0 0 0
0
3
)
LN K )
(3 )
0, )
M (32 K)
M LN 4K)
M N K)
)
LN K )
LN K )
(3 )
(3 )
0, )
0, )
M (32 K)
M (32 )
M LN K )
1K
A 28
(2 4K
1K
1K
A 28
A 28
(2 4K
(2 4K
A 2K
64
A 2
(4
64
64
A 2
(4
H ,1
H ,1
H ,1
Batch Size = 32
2,
LP , 6
H (3
(
2,
2,
LP , 6
LP , 6
H (3
H (3
2.5
M P(4
M P(4
M P(4
L
L
L
L
M
M
M
2
Speedup
1.5
1
0.5 Figure 15. Memory and Cache Analysis. L1 (left) and
0
Bert Albert T5 ViT Llama2 Bert Albert T5 ViT Llama2 Bert Albert T5 ViT Llama2
L2 (middle) cache miss counts and the device memory data
Volta
Volta Ampere
Ampere Hopper
Hopper movement (right) are shown. Values are normalized to Space-
Fusion. Lower is better. X-axis annotations: MLP(𝑁𝑢𝑚𝐿𝑎𝑦𝑒𝑟𝑠,
Figure 14. The End-to-End Performance. 𝑀), LN(𝑀), MHA(𝐵𝑎𝑡𝑐ℎ𝑆𝑖𝑧𝑒, 𝑆𝑒𝑞𝐿𝑒𝑛𝑔𝑡ℎ).
797
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
Base(SS) Base+AS Small Medium Large Table 4. Compilation Time Break Down for MHA.
Base+TS SpaceFusion 1
1
Norm. Perf.
0.6
Auto-Scheduling
Norm. Perf.
Batch Size = 1
0.8 0.2
1 Workload TS.getPriorDim enum SS.getDims Tuning Total
0.6
0.6
+ TS.slice Cfg + SS.slice
Batch Size = 32
0.4 0.2 MHA(32,1024) 17.31 ms 2.63 ms 0.23 ms 33.04 s 36.33 s
Bert Albert ViT T5 Llama2 Bert Albert ViT T5 Llama2 MHA(32,256) 16.39 ms 1.25 ms 0.34 ms 29.55 s 33.41 s
(a) Ablation Study (b) Sensitivity for Input Sizes
PerfVolta PerfAmpere PerfHopper SuVolta SuAmpere SuHopper
4 2
798
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
Table 6. Fusion Patterns Analysis. The total counts of dis- data and iteration spaces. In contrast, the other two meth-
covered fusion patterns, fusion patterns involving compute- ods present more objective representations of the inner and
intensive (CI) operators only, fusion patterns involving outer aspects of the loop hierarchy; SpaceFusion abstracts
memory-intensive (MI) operators only, and fusion patterns the dependencies at the space granularity with geometric
involving both CI and MI operators, are detailed. spatial relations, while the polyhedral model abstracts de-
pendencies at the iteration and statement granularity. Halide
Patterns Count SpaceFusion NNFusion BladeDISC IR does not emphasize the explicit abstraction of depen-
# Fusion Patterns Discovered 50 30 14 dencies but retains iteration-level dependency information
# CI Ops Fusion 5 3 0 through function-level expressions and subscript variables.
# MI Ops Fusion 15 14 14
# CI and MI Ops Fusion 30 13 0 As a higher-level graph-based abstraction method, SpaceFu-
sion is capable of serving as an auto-scheduler for Halide-
based and polyhedral-based systems to achieve optimization
heuristics for operator fusion.
at the intra-block level with minimal need for search-based
Auto-generation of operator fusion. Niu et al.[43] cat-
tuning.
egorize operators, analyze the gains from fusing different op-
6.6 Fusion Patterns Analysis erator types, and make fusion decisions via pattern matching.
Jia et al.[29] fuse two parallel GEMMs with graph equivalent
SpaceFusion identified 50 distinct fused subgraphs contain- transformation. Zheng et al.[70, 71] explore advanced fusion
ing at least two All-to-One mappings across 14 compiled methods for memory-intensive operators, utilizing shared
799
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
comparable or even better performance than manually-tuned [12] NVIDIA Corporation. Basic Linear Algebra on NVIDIA GPUs. https:
fusion implementations. //developer.nvidia.com/cublas.
[13] NVIDIA Corporation. CUDA Deep Neural Network. https://developer.
nvidia.com/cudnn.
Acknowledgments [14] NVIDIA Corporation. CUDA Templates for Linear Algebra Subrou-
tines. https://github.com/NVIDIA/cutlass.
We thank anonymous reviewers and our shepherd, Dr. Mangpo [15] NVIDIA Corporation. Getting Started with CUDA Graphs. https:
Phothilimthana, for their insightful suggestions. This work //developer.nvidia.com/blog/cuda-graphs/.
was funded by the National Key Research & Development [16] NVIDIA Corporation. NVIDIA A100 Tensor Core GPU. https://www.
nvidia.com/en-us/data-center/a100/.
Program of China (No. 2022YFB4502002), NSFC (No. 62032008), [17] NVIDIA Corporation. NVIDIA H100 Tensor Core GPU. https://www.
STCSM (No. 23511100100), and the HighTech Support Pro- nvidia.com/en-us/data-center/h100/.
gram from STCSM (No.22511106200). The corresponding [18] NVIDIA Corporation. NVIDIA V100 Tensor Core GPU. https://www.
author is Jianguo Yao. nvidia.com/en-us/data-center/v100/.
[19] NVIDIA Corporation. TensorRT. https://developer.nvidia.com/
tensorrt/.
References [20] Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and
Luis Ceze. Automatic generation of high-performance quantized ma-
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, chine learning kernels. In Proceedings of the 18th ACM/IEEE Interna-
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, tional Symposium on Code Generation and Optimization, CGO 2020,
Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry page 305–316, New York, NY, USA, 2020. Association for Computing
Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Machinery.
Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-
800
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
[33] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, [48] Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li,
Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised Vin Sharma, Zachary Tatlock, and Yida Wang. Nimble: Efficiently com-
learning of language representations. arXiv preprint arXiv:1909.11942, piling dynamic neural networks for model inference. In Alex Smola,
2019. Alex Dimakis, and Ion Stoica, editors, Proceedings of Machine Learning
[34] Chris Leary and Todd Wang. Xla: Tensorflow, compiled. TensorFlow and Systems 2021, MLSys 2021, virtual, April 5-9, 2021. mlsys.org, 2021.
Dev Summit, 2(3), 2017. [49] Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming
[35] Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, and Yinghan Li. Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. Welder: Scheduling
A coordinated tiling and batching framework for efficient gemm on deep learning memory access via tile-graph. In 17th USENIX Sympo-
gpus. In Proceedings of the 24th Symposium on Principles and Practice of sium on Operating Systems Design and Implementation (OSDI 23), pages
Parallel Programming, PPoPP ’19, page 229–241, New York, NY, USA, 701–718, Boston, MA, July 2023. USENIX Association.
2019. Association for Computing Machinery. [50] Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong
[36] Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, Zhou. Astra: Exploiting predictability to optimize deep learning. In
and Deming Chen. Efficient gpu spatial-temporal multitasking. IEEE Proceedings of the Twenty-Fourth International Conference on Archi-
Transactions on Parallel and Distributed Systems, 26(3):748–760, 2015. tectural Support for Programming Languages and Operating Systems,
[37] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey pages 909–923, 2019.
of transformers. AI Open, 2022. [51] Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F
[38] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Manso. The computational limits of deep learning. arXiv preprint
Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: arXiv:2007.05558, 2020.
Enabling holistic deep learning compiler optimizations with rtasks. [52] Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F
In 14th USENIX Symposium on Operating Systems Design and Imple- Manso. The computational limits of deep learning. arXiv preprint
mentation (OSDI 20), pages 881–897. USENIX Association, November arXiv:2007.05558, 2020.
2020. [53] Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an inter-
801
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan
Euro-Par 2020: Parallel Processing, pages 219–233, Cham, 2020. Springer ACM SIGPLAN International Conference on Programming Language
International Publishing. Design and Implementation, PLDI 2021, page 1233–1248, New York,
[61] Michael E Wolf and Monica S Lam. A data locality optimizing algo- NY, USA, 2021. Association for Computing Machinery.
rithm. In Proceedings of the ACM SIGPLAN 1991 conference on Program- [67] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao
ming language design and implementation, pages 30–44, 1991. Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik
[62] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating High-
Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Performance tensor programs for deep learning. In 14th USENIX
Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Symposium on Operating Systems Design and Implementation (OSDI
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- 20), pages 863–879. USENIX Association, November 2020.
vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. [68] Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shen-
Huggingface’s transformers: State-of-the-art natural language process- gen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. Chimera: An analyt-
ing, 2020. ical optimizing framework for effective compute-intensive operators
[63] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, fusion. In 2023 IEEE International Symposium on High-Performance
Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, Computer Architecture (HPCA), pages 1113–1126. IEEE, 2023.
and Peter Vajda. Visual transformers: Token-based image representa- [69] Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou
tion and processing for computer vision, 2020. Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, Xiaoyong Du,
[64] Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Jidong Zhai, and Wei Lin. Bladedisc: Optimizing dynamic shape ma-
Yibo Zhu. Bolt: Bridging the gap between auto-tuners and hardware- chine learning workloads via compiler approach. Proc. ACM Manag.
native performance. In Diana Marculescu, Yuejie Chi, and Carole-Jean Data, 1(3), nov 2023.
Wu, editors, Proceedings of Machine Learning and Systems 2022, MLSys [70] Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu,
2022, Santa Clara, CA, USA, August 29 - September 1, 2022. mlsys.org, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai,
2022. Shuaiwen Leon Song, and Wei Lin. Astitch: Enabling a new multi-
802