0% found this document useful (0 votes)
33 views

SpaceFusion - Advanced Deep Learning Operator

SpaceFusion is an advanced deep learning operator fusion scheduler that utilizes a novel Space-Mapping Graph (SMG) to efficiently model dependencies in tensor computations. The system introduces spatial and temporal slicers to create optimized fusion schedules tailored to specific hardware configurations, achieving significant performance improvements over existing methods. Evaluations demonstrate that SpaceFusion can provide up to 8.79x speedup compared to baseline implementations for Transformer models.

Uploaded by

Nicolas Henrique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

SpaceFusion - Advanced Deep Learning Operator

SpaceFusion is an advanced deep learning operator fusion scheduler that utilizes a novel Space-Mapping Graph (SMG) to efficiently model dependencies in tensor computations. The system introduces spatial and temporal slicers to create optimized fusion schedules tailored to specific hardware configurations, achieving significant performance improvements over existing methods. Evaluations demonstrate that SpaceFusion can provide up to 8.79x speedup compared to baseline implementations for Transformer models.

Uploaded by

Nicolas Henrique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

SpaceFusion: Advanced Deep Learning Operator

Fusion via Space-Mapping Graph


Liang Zhu Jianguo Yao∗ Haibing Guan
Shanghai Jiao Tong University Shanghai Jiao Tong University Shanghai Jiao Tong University
Shanghai, China Shanghai, China Shanghai, China
[email protected] [email protected] [email protected]
Out [m, n]
Abstract Out = GEMM(Div,Value)
Div [m, 1] … Div [m, l] … Div [m, L] Value [m, 1] … Value [m, L]
This work proposes SpaceFusion, an advanced scheduler for … … Exp [m, l] Sum [m] … …
Div = div(Exp,Sum)

Softmax
Sum = sum(Exp,dim=0)
efficient deep learning operator fusion. First, we develop Exp [m, 1] … Exp [m, l] … Exp [m, L]
Exp = exp(QK - Max)
a novel abstraction, the Space-Mapping Graph (SMG), to … … QK [m, l] Max [m] … …
Max = max(QK,dim=0)
holistically model the spatial information of both inter- and QK [m, 1] … QK [m, l] … QK [m, L]
QK = GEMM(Query,Key)
intra-operator dependencies. Subsequently, we introduce the … … … Query [m, 1] … Query [m, K] Key [l, 1] … Key [l, K] … … …

spatial and temporal slicers to decompose the fused spaces


defined in SMGs, generating fusion schedules by analyzing Figure 1. Complex Data Dependencies in MHA. Left: the
and transforming dependencies. Finally, we present auto- dependency graph of a single element in the output tensor.

Downloaded from the ACM Digital Library on April 13, 2025.


scheduling methods that use the slicers to automatically cre- Right: the simplified computational definitions.
ate high-performance fusion schedules tailored to specific
hardware resource configurations. End-to-end performance
evaluations reveal that SpaceFusion achieves up to 8.79x
speedup (3.54x on average) over baseline implementations 1 Introduction
from Huggingface for Transformer models, and a maximum
Tensor computations defined by deep learning models mani-
of 2.21x speedup compared to the state-of-the-art manually-
fest two trends: increasing computational scale and height-
tuned implementations powered by FlashAttention.
ened computational complexity[27, 28, 37, 40, 51, 52], making
CCS Concepts: • Computing methodologies → Machine it more expensive and challenging to deploy them as efficient,
learning; Parallel computing methodologies. rapid-response inference services, whether on cloud infras-
tructures or local devices[43, 44]. Increasing computational
Keywords: Machine Learning, Operator Fusion, Deep Learn- scale necessitates the allocation of more memory resources
ing Compilation, for larger model weights, input data, output data, and inter-
mediate data, resulting in higher memory I/O overhead and
ACM Reference Format: increased memory usage. Higher computational complex-
Liang Zhu, Jianguo Yao, and Haibing Guan. 2025. SpaceFusion: ity, primarily arising from more non-element-wise opera-
Advanced Deep Learning Operator Fusion via Space-Mapping Graph tors such as Generalized Matrix Multiplication (GEMM) and
. In Twentieth European Conference on Computer Systems (EuroSys
Softmax[5], escalates the number of computation cycles and
’25), March 30–April 3, 2025, Rotterdam, Netherlands. ACM, New
York, NY, USA, 16 pages. https://doi.org/10.1145/3689031.3696087
introduces more intricate data dependencies. Operator fu-
sion techniques (also known as kernel fusion or layer fusion),
mitigate the issue brought by larger computational scale via
∗ Corresponding author. caching intermediate computation results in storage medi-
ums closer to the compute units, enabling their on-chip reuse.
This enables faster responses for inference services while
Permission to make digital or hard copies of all or part of this work for reducing memory costs. However, operator fusion faces chal-
personal or classroom use is granted without fee provided that copies lenges in fusing non-element-wise operators[43, 70].
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights One of the main reasons is the tradeoff between inter-block
for components of this work owned by others than the author(s) must dependency-free parallelism and intra-block data locality[61].
be honored. Abstracting with credit is permitted. To copy otherwise, or In a typical GPU architecture featuring a memory hierarchy[41],
republish, to post on servers or to redistribute to lists, requires prior specific different thread blocks are restricted to exchange data through
permission and/or a fee. Request permissions from [email protected]. global memory (off-chip, slower, greater in capacity, visible
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands
to all thread blocks). A fusion aiming to cache intermediate
© 2025 Copyright held by the owner/author(s). Publication rights licensed
to ACM. results in shared memory (on-chip, faster, smaller in capac-
ACM ISBN 979-8-4007-1196-1/25/03. . . $15.00 ity, visible within a thread block) requires no dependency
https://doi.org/10.1145/3689031.3696087 between scheduled thread blocks for parallelization in the

787
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

GPU architecture[26]. For example, the Multi-Head Atten- into multiple parallel-executed SMG blocks, each indepen-
tion (MHA)[58] contains three non-element-wise operators, dent of the others. The temporal slicer (Section 4.3) slices
including two GEMMs and a Softmax. Figure 1 illustrates the an SMG block into multiple serially-executed intra-blocks
dependencies from a single element of the output tensor to to reduce the on-chip memory footprint, and to exploit the
the other tensor elements. This complex dependency pattern possible optimization opportunities. Finally, we design auto-
exhibits (1) deeply nested dependencies between elements, scheduling methods (Section 5) for SMG. The resource-aware
layer by layer, and (2) wide dependency ranges covering the slicing method (Section 5.1) leverages spatial/temporal slicers
whole range of a tensor dimension. As a result, scheduling to automate the generation of fusion schedules that adhere
fused MHA into thread blocks while naively respecting the to hardware resource constraints. The SMG partition method
wide and nested dependency pattern leads to low intra-block (Section 5.2) partitions unschedulable SMGs due to hard-
data locality, resulting in intra-block inefficiency or even ware resource constraints into smaller SMGs and resubmits
fusion failures. them to the slicing method. We identify the optimal schedule
Existing approaches to operator fusion are classified into within the generated search space.
two manually-tuned fusion and auto-tuned fusion. Manually- Evaluations reveal SpaceFusion’s ability to exploit hid-
tuned fusion[11, 22, 30, 31, 65] is the dominant approach to den fusion opportunities in intricate dependencies and de-
fusing non-element-wise operators. These methods target liver performance benefits across diverse GPU architectures.
specific tensor computations. Specific dataflow and paral- For subgraph performance, SpaceFusion achieves a max-
lelism strategies are cleverly designed by domain experts to imum speedup of 10.35x (with an average of 5.31x) over
achieve fusions with complex dependencies. Unfortunately, manually-tuned unfused baseline, and up to 4.03x speedup

Downloaded from the ACM Digital Library on April 13, 2025.


manually-tuned fusion suffers from two key limitations: lim- compared to manually-tuned fused libraries. For end-to-end
ited generality and high development cost. performance, SpaceFusion achieves a maximum speedup
Auto-tuned fusion provides a promising alternative for of 8.79x (with an average of 3.54x) to baseline implementa-
its generality and cost-effectiveness. However, fusing non- tions by Huggingface[62] for Transformer models, and up to
element-wise operators is beyond the capabilities of most 2.21x speedup to the state-of-the-art (SOTA) manually-tuned
existing deep learning (DL) frameworks and compilation implementation[31] powered by FlashAttention[22].
systems[4, 6, 29, 34, 43, 57, 66, 67]. The majority of them[6, 29,
34, 43, 67] rely on high-level abstractions, i.e., graph-based ab-
stractions like dataflow graph (DFG), targeting element-wise 2 Background and Prior Knowledge
operator fusion. These abstractions fail to provide the ab- This section discusses the background and prerequisite knowl-
straction of intra-operator dependencies and hence are inca- edge of operators for tensor computation in DL models. Fur-
pable of tackling complex dependencies defined by operators. ther, we explore the complex dependency patterns in the
Other approaches[4, 57, 66], using low-level abstractions tensor computations defined by an individual operator and
such as polyhedral models, also face limitations in efficiently multiple operators.
identifying optimal operator fusion solutions due to the vast Operators for tensor computation encompass element-
search space defined by fine-grained abstractions. Recent wise and non-element-wise operators. Element-wise opera-
works[49, 68, 70] have achieved breakthroughs in fusions of a tors, characterized by straightforward dependency patterns,
broader range of operators, but problems remain. AStitch[70] can be simply fused with other operators, e.g., by in-place
uses rule-based approaches to fuse memory-intensive (MI) inlining[6]. Non-element-wise operators, such as GEMM,
operators. Chimera[68] devises a purely analysis-based ap- Softmax, ReduceMean, and element-wise with broadcast,
proach to fuse compute-intensive (CI) operators. Neither ad- constitute the primary source of complex dependencies[43].
dresses the fusion of both MI and CI operators. Welder[49] Dependencies in a single operator. Dependencies in
leverages the tile-graph abstraction for finer-grained sched- most non-element-wise operators can be further broken
ules (tile-level) of inter-operator intermediate tensors within down. For example in GEMM, given a matrix multiplication
the memory hierarchy. However, the inability to abstract 𝐶𝑀 ×𝑁 = 𝐴𝑀 ×𝐾 𝐵 𝑁 ×𝐾 , a single element of the output tensor
Í𝐾
intra-operator dependencies limits its capacity to construct is defined as 𝑐𝑚𝑛 = 𝑘=1 𝑎𝑚𝑘 𝑏𝑛𝑘 , where 𝑎, 𝑏 and 𝑐 are the ele-
a holistic optimization space. ments in tensor 𝐴, 𝐵 and 𝐶, respectively. For each 𝑛 ∈ [1, 𝑁 ],
In this work, we present SpaceFusion, an advanced sched- a single element 𝑎𝑚𝑘 is used 𝑁 times for computing the in-
uler for efficient operator fusion. The key design is a spatial termediate elements 𝑎𝑚𝑘 𝑏 1𝑘 , 𝑎𝑚𝑘 𝑏 2𝑘 , ..., and 𝑎𝑚𝑘 𝑏 𝑁 𝑘 , which
abstraction termed Space-Mapping Graph (SMG) (Section means one input element 𝑎𝑚𝑘 is required by 𝑁 intermediate
4.1). Despite being a lightweight graph-based approach, the elements along the whole range of 𝑁 . We distinguish this
SMG abstraction effectively captures complex dependencies dependency pattern as One-to-All. A similar One-to-All de-
in multiple operators, enabling the construction of a holistic pendency exists between 𝑏𝑛𝑘 and the intermediate elements
optimization space. Then we devise two slicing methods to for each 𝑚 ∈ [1, 𝑀]. Then, an output element 𝑐𝑚𝑛 is calcu-
schedule SMGs. The spatial slicer (Section 4.2) slices an SMG lated via summing 𝐾 intermediate elements 𝑎𝑚1𝑏𝑛1 , 𝑎𝑚2𝑏𝑛2 ,

788
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands

Table 1. Decoupled Dependencies in Representative 1


TileM 2 1 3 2 45

Operators. ⃝, △, and × denote the presence, potential pres- M Softmax M M GEMM M


ence, and absence of the dependencies, respectively. K (a) Softmax Tile K K (b) GEMM Tile N
1 2 3
TileMalign
Representative Operators One-to-One One-to-All All-to-One Softmax GEMM

GEMM × ⃝ ⃝ (c) Fused Softmax-GEMM via Simple Shape Alignment.


Einsum △ △ △ 2 to 3 is scheduled within a single GEMM block.
Softmax, LayerNorm, BatchNorm ⃝ ⃝ ⃝ 1 4 2 5 3 67
ReduceMax, ReduceMean × × ⃝ Softmax GEMM
Element-wise w/ broadcast ⃝ ⃝ ×
(d) Better Intra-block Locality Schedule.
Intra-op dependency analysis and transformation needed.

Table 2. A Comparative Analysis of Representative


Works for Operator Fusion. Abstraction capabilities for Figure 2. An example of Softmax-GEMM fusion. (a) and
fusion (perceiving inter-operator and intra-operator depen- (b) depict input-output tile mappings in Softmax and GEMM.
dencies), and fusion schedule capabilities (memory access (c) is a fusion via shape alignment of the intermediate tiles.
transformation, dependency transformation, memory hierar- (d) is a fusion with better locality. Serial numbers in each
chy scheduling, and hardware resource awareness) are listed. sub-figure indicate the order in which tiles were accessed.

Abstraction Capability Schedule Capability

Downloaded from the ACM Digital Library on April 13, 2025.


Year Work Inter-op Intra-op Mem Dep Mem HW Challenge 1: Identifying a suitable abstraction granularity
Dep Dep Trans Trans Hier Rsrc
that is both effective and efficient for operator fusion problems.
2020 Ansor[67] Tensor-wise × ⃝ × × ×
2021 DNNFusion[43] Tensor-wise × ⃝ × × × There are two main levels of abstraction for tensor com-
2022 AStitch[70] Tensor-wise × ⃝ × ⃝ ⃝ putations, high-level and low-level abstractions. High-level
2023 Welder[49] Tile-wise × ⃝ × ⃝ ⃝
2024 SpaceFusion Space-wise ⃝ ⃝ ⃝ ⃝ ⃝ abstractions, i.e., graph-based abstractions such as dataflow
graphs (DFGs), represent tensor computations at the oper-
ator level, where nodes denote operators and edges denote
tensor-wise inter-operator dataflow dependencies. Most DL
..., and 𝑎𝑚𝐾 𝑏𝑛𝐾 along the whole range of 𝐾. We distinguish frameworks and compilers[1, 6, 7, 29, 34, 43, 45, 67] fuse
this pattern as All-to-One. Therefore, the dependencies in operators via high-level abstractions at the graph optimiza-
GEMM can be further decoupled into One-to-All and All- tion stage of compilation, enabling easy fusion decisions
to-One dependencies. Table 1 shows decoupled dependen- through rule-based methods. However, these abstractions,
cies may exist in representative non-element-wise operators, constrained by the dependency barriers between operators,
where One-to-One indicates element-wise dependencies. where intra-operator dependencies are inaccessible at the
Complex dependencies in multiple operators. Depen- inter-operator level, are ineffective in handling situations
dencies in multiple operators are more complex, for the com- involving complex dependencies within operators.
plex dependencies that appear in each operator are nested Low-level abstractions work in a finer granularity, such
and combined, always forming a wide and deep dependency as Halide-based intermediate representation (IR)[47] with
tree, as Figure 1 corroborates. Given a simplified computa- the scheduling languages (at the function/loop level), and
tional definition of MHA shown on the right side of Figure the polyhedral model[25] (at the statement/iteration level).
1, a single element of the output tensor depends directly or These abstractions eliminate operator dependency barriers
indirectly on (2𝐿𝐾 + 4𝐾 + 2) elements from 8 tensors with through loop-level primitives, such as loop fusion. However,
6 layers nested dependencies containing 6 One-to-Alls and low-level abstractions are overly fine-grained for operator
4 All-to-Ones, where 𝐿 and 𝐾 indicate sequence length and fusion problems. They capture irrelevant or redundant in-
feature dimension, respectively. This kind of complex de- formation for fusion problems. Furthermore, the solution
pendency pattern exhibits (1) deeply nested dependencies space grows with the number of statements and the size of
between elements, layer by layer, and (2) wide dependency iteration domains in tensor programs. These problems ren-
ranges covering the whole range of a tensor dimension. der low-level abstractions inefficient for solving the operator
fusion problem, hindering the design and implementation of
3 Major Challenges methodologies for the optimal solution. Therefore, we need
We argue two challenges in generating efficient schedules for to identify an appropriate abstraction granularity between
fusing operators with complex dependencies discussed above. high-level and low-level abstractions, such that it is both
Table 2 compares representative works on operator fusion effective and efficient in solving the operator fusion problem.
to present methodology differences for dealing with fusion Challenge 2: To devise an abstraction method to capture,
problems and illustrate the challenges comprehensively. analyze, and schedule dependencies in operators holistically.

789
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

Recent compilers employ tile-level abstractions for fusion. Input Data Space Output Data Space Iteration Space All-to-One One-to-All
# Data Spaces by Tensors Dim1:N Data Space: Query(M,-,K) Key(-,N,K)
The tile-graph in Welder[49] refines operator dependencies Query = tensor(M,K) Dim0:K
Dim2:M Key (-,N,K)
O2A(dim=1) O2A(dim=2)
Key = tensor(N,K)
to tile-wise granularity and stitches intermediate tiles via QK = Zeros(M,N)
Iteration
Space: GEMM(M,N,K)
# Iteration Space of GEMM GEMM
shape alignment, allowing precise scheduling within the for (m,n,k) in Grid(M,N,K):
Data Space:
Query (M,N,K) A2O(dim=0)
QK[m,n] += (M,-,K)
memory hierarchy. However, the intra-operator dependen- Query[m,k] * Key[n,k]
Data Space:
QK(M,N,-) QK(M,N,-)
cies remain unabstracted, replaced by the shape mappings (a) Computational Definition (b) Visualized SMG (c) SMG
from input to output tiles in an operator. Note that this com-
pressed information discards the original dependencies. Fig- Figure 3. The SMG for a Single Operator GEMM. (a) the
ures 2 (a) and (b) illustrate the tile shape mappings between definition of 𝑄𝐾 = 𝐺𝐸𝑀𝑀 (𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦), (b) the visualized
the input and output tiles of Softmax and GEMM. Stemming SMG for GEMM, and (c) the SMG for GEMM are shown.
from global reductions along the 𝐾 axis, each output element
in Softmax 𝑜𝑢𝑡 [𝑚, 𝑘] incorporates all input elements within
its corresponding row 𝑖𝑛[𝑚, 0:𝐾]. Figure 2 (c) illustrates the the orange and purple nodes representing data spaces that
shape alignment of the output tile of Softmax with the input correspond to the 𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, and 𝑄𝐾 tensors mentioned in
tile of GEMM. This fusion schedules the GEMM tiles that Figure 3 (a). The node 𝐺𝐸𝑀𝑀 in black is the iteration space
could have been executed serially to be computed in a larger defined by the loop structure in Figure 3 (a).
single GEMM block, resulting in a poor intra-block locality Space Mappings abstract One-to-One, One-to-All, and
in Softmax and GEMM (e.g. the 16x256 Softmax block and All-to-One mapping relations1 between computational spaces
the 16x64x256 GEMM block for 𝑇 𝑖𝑙𝑒𝑀𝑎𝑙𝑖𝑔𝑛 =16, 𝐾=256 and as is discussed in Section 2, possessing their own geomet-

Downloaded from the ACM Digital Library on April 13, 2025.


𝑁 =64), and even fusion failures (e.g. for 𝐾=1024, the 16x1024 ric directions in the SMG. Intra-operator mappings define
intermediate tiles may not fit in the limited shared memory). relations between data and iteration spaces within a sin-
Figure 2 (d) illustrates a fusion schedule with enhanced gle operator, effectively encapsulating read and write de-
intra-block locality, featuring a new tile order and superior pendencies. Inter-operator mappings connect one operator’s
tile shapes (e.g. a 64x64x64 GEMM block) with memory over- output to another’s input, modeling the dataflow between
lapping among tiles of different colors (e.g., ❶ and ❹, ❷ and them. Inter-operator mappings are always One-to-One. For
❺), facilitating the Softmax-GEMM fusion for larger 𝐾 values. example again in Figure 3 (c), the directed edges in green
Generating this schedule for the desired tile order requires represent the One-to-All mapping. The One-to-All mapping
holistic dependency abstraction, analysis, and transforma- 𝑄𝑢𝑒𝑟𝑦 → 𝐺𝐸𝑀𝑀 denotes that data space 𝑄𝑢𝑒𝑟𝑦 is reused
tion of the reductions in both Softmax and GEMM. Therefore, by iteration space 𝐺𝐸𝑀𝑀 along the second dimension, thus,
it requires an abstraction method that encompasses multiple possessing a distinct direction vector along the second di-
operators in an optimization space, capturing and analyzing mension (dim=1). The red-directed edge is an All-to-One
both intra- and inter-operator dependencies, and transform- mapping 𝐺𝐸𝑀𝑀 → 𝑄𝐾, denoting the ReduceSum in itera-
ing them as needed to pursue a globally optimal solution. tion space 𝐺𝐸𝑀𝑀, with the direction vector along the first
dimension (dim=0). The visualized SMG in Figure 3 (b) better
4 Abstraction Methods demonstrates the geometric space concept of mappings.
In this section, we introduce a graph-based abstraction termed Compared to traditional DFGs, SMG introduces three key
Space-Mapping Graph (SMG). The following subsections pro- differences: (1) SMG incorporates dimensional information
vide a detailed introduction to SMG and explain how to slice into nodes, constructing them as geometric spaces. This is
SMGs for scheduling parallelization and serialization. fundamental to SMG’s ability to preserve critical informa-
tion. (2) SMG abstracts the iteration space. For an operator,
the iteration spaces act as intermediate nodes, decoupling
4.1 Space-Mapping Graph
direct dependencies between input and output nodes into
A Space-Mapping Graph (SMG) retains the basic structure of multiple indirect dependency mappings. This design is par-
graph-based abstraction, where the Computational Spaces are ticularly suited for representing operators where the input
the nodes of the graph and the Space Mappings are directed and output spaces are not in One-to-One mappings. (3) SMG
edges connecting the nodes. categorizes the decoupled dependency patterns into three
Computational Spaces are conceptualized into two dis- types of space mappings. These space mappings enhance
tinct categories: (1) Data spaces abstract tensors, including in- SMG’s expressiveness in handling complex dependencies
put tensors, output tensors, intermediate tensors, and weight and provide the necessary information for holistic analysis
tensors employed in tensor computations; (2) Iteration spaces and scheduling of dependencies.
model the nested loop structures of defined computations.
For example, Figure 3 shows the computational definition 1 Thiswork focuses on globally-ranged mappings as discussed in Section 2.
of matrix multiplication and its corresponding SMG. As is Fusions of partially-ranged mappings, such as 2D convolution fusion[49, 60],
shown in Figure 3 (c), this SMG comprises four nodes, with are not being discussed here.

790
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands

Query(M,-,K) Key(-,N,K) QK(M,N)


A2O
Query(M,-,K)
O2A(dim=1)
Key(-,N,K) Table 3. Slicer Applications for Mappings in the Di-
(dim=0) O2A(dim=2)

O2A(dim=1) O2A(dim=2)
O2O
O2A
Max(M,-) GEMM(M,N,K) mension. ⃝ and × respectively denote the application and
Sub(M,N) (dim=0)
A2O(dim=0)

GEMM(M,N,K) O2O QK(M,N,-)


A2O non-application of slicers, while △ signifies the necessity for
O2O A2O
(dim=1)
Max(M,-,-)
A2O(dim=0) Exp(M,N)
O2O
(dim=0)
Sum(M,-)
O2O

Sub(M,N,-)
O2A further analysis to determine whether slicing is warranted.
(dim=1)
O2A
QK(M,N,-) Div(M,N) (dim=0)
O2O
A2O
Exp(M,N,-)
❶ SMG for GEMM ❷ SMG for Softmax O2O
(dim=1)
Sum(M,-,-) Mappings in the Dimension Spatial Slicer Temporal Slicer
in a 3-Dim Space(M,N,K). in a 2-Dim Space(M,N). Div(M,N,-)
O2A
(dim=1)
None ⃝ ⃝
❸ GEMM → Softmax: Connecting with One-to-One ❹ Fused SMG One-to-One × ×
to Construct a Fused Space. in a 3-Dim Space(M,N,K).
Input One-to-All ⃝ ⃝
Other One-to-All × ⃝
Figure 4. Connecting SMGs to Construct a New SMG Independent All-to-One(s) × ⃝
Dependent All-to-Ones × △
for the Fused Computational Space.
Dim1 parallel_for Block in SMG_Blocks:
Input Data Space Output Data Space All-to-One One-to-All Dim0
Dim2 # load from off-chip mem
Intermediate Data Space Iteration Space One-to-One Q_block, K = load(…), load(…)
Dim1 Query(M,-,K) Key(-,N,K) QK = GEMM(Q_block, K)
Query Key Space:Key Max = max(QK, dim=0)
O2A(dim=1)
Dim0 (-,N,K)
O2A(dim=2)
Sub = QK - Max
Dim2 GEMM1(M,N,K) SMG
GEMM Exp = exp(Sub)
Space: A2O(dim=0) Block 1
Space:Query Sum = sum(Exp, dim=0)
max (M,-,K) GEMM1 QK(M,N,-)
A2O
Div = div(Exp, Sum)
(dim=1) SMG
sub (M,N,K) O2O Max(M,-,-) Block 2 # load from off-chip mem
Space:Value O2A
V = load(…)
Sub(M,N,-)
Softmax

(-,N,K) (dim=1) Spatial Slicer


exp Space: O2O SMG Out = GEMM(Div, V)

Downloaded from the ACM Digital Library on April 13, 2025.


Out Space: A2O Block 3 store(Out)
Exp(M,N,-)
sum (M,-,K) QKSub (dim=1)
O2O Sum(M,-,-)
div Div(M,N,-)
O2A
(dim=1)
Space:
Value ExpDiv
O2A(dim=0) Value(-,N,K)
O2A(dim=2)
Figure 6. Spatial Slicers Applied along Dim2.
GEMM GEMM2(M,N,K)
Space: A2O(dim=1)
Out GEMM2(M,N,K)
Out(M,-,K)

(a) DFG (b) Visualized SMG (c) SMG boxes in Figure 5 (b) and (c). As is discussed in Section 2, a
MHA computation involves three non-element-wise oper-
Figure 5. An SMG Example for Multi-Head Attention. ators with 6 One-to-Alls and 4 All-to-Ones. The visualized
SMG in Figure 5 (b) depicts these 10 mappings and their
geometric position information. The 4 red arrows represent
Building a fused SMG via connecting SMGs of sin- the 4 All-to-Ones resulting from the reduction operations
gle operators with intermediate data space dimension in MHA. The first originates from GEMM1, the second and
alignment: Figure 4 depicts the process of constructing a third originate from Softmax, and the fourth originates from
fused space for two individual operators, GEMM and Soft- GEMM2. It is evident that the last three of these four All-to-
max, resulting in a fused SMG. ❶ SMG for GEMM defines a Ones are geometrically parallel, with their directions aligned
3-dim space. ❷ SMG for Softmax defines a 2-dim space. ❸ along Dim1, while the first All-to-One from GEMM1, towards
GEMM’s output data space 𝑄𝐾 (in purple) and Softmax’s Dim0, is orthogonal to the last three All-to-Ones.
input data space 𝑄𝐾 (in orange) are connected with a One-
to-One mapping. ❹ A fused SMG is constructed via fusing 4.2 Spatial Slicer: Exploiting SMG Parallelization
𝑄𝐾 (𝑀, 𝑁 , −) → 𝑄𝐾 (𝑀, 𝑁 ) into a single intermediate data
A spatial slicer slices an SMG along given dimensions into
space 𝑄𝐾 (𝑀, 𝑁 , −) with dimension alignment, where "−" is
multiple independent, parallelizable SMG blocks. Each SMG
a dimension placeholder for 𝐾 in GEMM. The constructed
block is to be mapped to a thread block in GPUs. To prevent
SMG defines a new 3-dim space. Employing this method, we
dataflow dependencies between SMG blocks generated after
can establish a multi-operator fusion space wherein One-to-
spatial slicing, the spatial slicer generally avoids orthogonal
One mappings construct the main dataflow across operators.
mappings. Table 3 encapsulates the varied decisions made
An illustrative example of SMG. In Figure 5, (a) DFG
when applying slicers to slice an SMG along given dimen-
of the simplified MHA, (b) visualized SMG, and (c) the corre-
sions with the presence of diverse mappings. Specifically,
sponding SMG are presented. The computation of MHA takes
spatial slicers do not slice any mappings except for the input
place within a five-dimensional fused computational space
One-to-All. The source of an input One-to-All is an input
(BatchDim, HeadDim, Dim2, Dim1, Dim0). We focus our dis-
data space abstracting the input of a kernel function, which
cussion on the three-dimensional subspace2 composed of
is stored in global memory and is visible to all thread blocks.
the final three dimensions (Dim2, Dim1, Dim0). The visual-
Consequently, slicing the input One-to-All does not induce
ized SMG fuses some of the spaces connected with One-to-
flow dependencies between the resulting blocks.
One mappings for visual simplicity, as is depicted in dotted
An illustrative example of the spatial slicer on MHA
2 BatchDim and HeadDim are not involved to enhance the visual represen- is given. The unsliced SMG depicted in Figure 5 contains
tation. Both are dimensions without dependencies on the others. mappings across all three dimensions. Among them, Dim2 is

791
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

parallel_for Block in SMG_Blocks: qk qk qk qk Aggregation qk Update Path


Q.block K.block Q_block = load(…) Needed
… r.max r.max r.max r.max r.max
for block in blocks:
for Intra_Block in Block:
K_block = load(…) Max Max Max Max Max
GEMM b.sub b.sub e.exp e.exp e.exp e.exp e.exp e.exp
QK = GEMM(Q_block,K_block)

Postpositing
max Max.old Max = max(QK, dim=0)
Max = aggrMax(Max_old, e.exp e.exp r.sum r.sum r.sum
Temporal Slicer aggr.max v v v
Dim1 Max) Update Path
sub Simple Aggregate Sub = QK - Max
Dim0 r.sum r.sum r.dot r.dot r.dot
Dim2 Exp = exp(Sub)
exp v Tag1 Tag2 Tag1
Sum = sum(Exp, dim=0) Sum Tag2
sum Sum.old Sum = aggrSum( b.div r.dot s.div s.div s.div s.div s.div s.div

Postpositing
updateSum(Sum_old), Sum
aggr.sum
Sum) v Sum Sum Sum
div Update then Aggregate
V.block Div = div(Exp, Sum) r.dot s.div s.div s.div s.div
V_block = load(…) Out Out Out Out
GEMM Out = GEMM(Div, V_block) out out out out out
Intra- Intra- Intra- Out.old Out = aggrSum(
Block 1 Block 2 Block 3 updateOut(Out_old), (a) Original DFG (b) b.div Postposited (c) Resulted DFG (d) Aggregation Hints and
aggr.sum Out) Postpositing b.div Postpositing b.sub Postpositing Update Paths for Out and Sum
SMG Block Update then Aggregate Needed Needed Finished
… Max_old, Sum_old, Out_old
def updateSum(Sum_old): # Update function for Sum (e) Update Functions
= Max, Sum, Out return Sum_old * exp(Max_old)/exp(Max)
store(Out)
Generated via
Out def updateOut(Out_old): # Update function for Out Update Paths
return Out_old * Sum_old/Sum * exp(Max_old)/exp(Max)

Figure 7. Temporal Slicer Applied along Dim1.


Figure 8. Update Functions Generation. For better illus-
tration, figures show vector-wise DFG with both input qk
the only dimension eligible for being spatial sliced, as solely and v as vectors. Reduction, broadcast, element-wise, and
an input One-to-All resides within Dim2. As shown in Figure scalar computations are marked as r., b., e., and s..

Downloaded from the ACM Digital Library on April 13, 2025.


6, the spatial slicer, orthogonal to Dim2, slices the SMG into
three distinct SMG blocks. The resulting SMG blocks are
dependency-free, allowing for parallel scheduling. yields incorrect results. To address this issue, we propose
the Update then Aggregate (UTA) approach. The core idea is
4.3 Temporal Slicer: Exploring SMG Serialization to recursively update intermediate results that a final result
A temporal slicer attempts to partition an SMG block into depends on before aggregating.
intra-blocks that are executed serially and exhibit similar An illustrative example of the temporal slicer with
computational behaviors, to reduce the on-chip memory Update then Aggregate. We start with an SMG block sliced
footprint of the SMG block while maintaining dependency in Figure 6 to illustrate the temporal slicer with UTA ap-
satisfaction inside the block. Given that the lifespans of most plied. As is shown in Figure 7, a temporal slicer slices the
intermediate variables are confined within their respective SMG block along Dim1, resulting in three intra-blocks. Three
intra-blocks, during the sequential execution of intra-blocks, parallel All-to-Ones are sliced, derived from max, sum, and
the later intra-block effectively reuses the on-chip memory the second GEMM in the DFG. All of them operate reduc-
space allocated to the intermediate variables of the previous tions, namely ReduceMax, ReduceSum, and Dot. They form
intra-block. This strategy leads to a significant reduction in a dependency chain of max ← sum ← GEMM. As a result,
the on-chip memory footprint of an SMG block. the temporal slicer attempts to apply UTA for slicing depen-
Temporal slicers can slice a broader range of mappings dent All-to-Ones. Given that a temporal slicer partitions an
within SMG blocks. First, as is shown in Table 3, temporal SMG block into 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘 0, 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘 1, ..., 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘 𝑁 −1
slicers are capable of slicing all One-to-Alls. The slicing of sequentially executed intra-blocks. UTA (the second and
a One-to-All within an SMG block results in a collection of third dashed boxes in Figure 7) differs from SA (the first
One-to-Alls distributed across a series of intra-blocks, all dashed box in Figure 7) in that UTA calls Update Functions
originating from the same source space. Consequently, these (namely updateSum and updateOut in Figure 7) to update
intra-blocks access the same block of the on-chip memory the old reduction results from previous 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘𝑖 −1 before
data space. Second, temporal slicers can slice either a single aggregating them with the current results from 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘𝑖 ,
All-to-One or multiple independent All-to-Ones, which is where 𝑖 ∈ (0, 𝑁 ). The Update Functions utilize the newly
referred to as Independent All-to-One(s) in Table 3. Temporal aggregated results of the variables that the target variable de-
slicers slice each independent All-to-One into multiple local pends on to update the old target result. UTA ensures that the
All-to-Ones and then perform algebraic aggregation[24] on final results computed within the current 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘𝑖 are the
the results of each intra-block. We distinguish this method correct aggregate results from 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘 0 to 𝐼𝑛𝑡𝑟𝑎𝐵𝑙𝑜𝑐𝑘𝑖 .
as Simple Aggregate (SA). Third, temporal slicers are able to Update Functions. The key is to find the Update Func-
slice All-to-Ones with flow dependencies, which are referred tions automatically. Figure 8 shows vector-wise DFGs for
to as Dependent All-to-Ones in Table 3. In this case, multi- a partial MHA (the first GEMM removed), only compris-
ple geometrically parallel All-to-Ones form a dependency ing geometrically parallel All-to-Ones (from reductions) and
chain, where the computation of the latter depends on the One-to-Alls (from broadcasts) orthogonal to the temporal
results of the former. Performing SA on each All-to-One here slicer. First, from Figure 8 (a) to (c), Broadcast Postpositions are

792
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands

performed following algebraic rules to identify the shortest Program-Preprocessing

programs
Program
dependency paths between reductions. The b.div in Figure 8

Tensor
Deep Learning Program Building

Sub-
(a) is deferred to b.div in Figure 8 (b). The b.sub in Figure 8 Model Partitioning SMG
(b) is propagated along two paths, resulting in the creation of Auto-Scheduling SMGs
two new s.div nodes in Figure 8 (c). Figure 8 (d) presents two Resource-Aware Slicing SMG Partitioning
subgraphs of the DFG in Figure 8 (c), showing the paths be-

SMGs
Blocks
Spatial Temporal SMG

SMG
tween reductions. It reveals that the output node out, tagged Slicing Slicing Partitioning

as Out in a red background, depends on two intermediate SMG Schedules


reduction results, tagged as Sum and Max, and a direct reduc- Auto-Tuning & Code Gen
tion result of the input qk, tagged as Tag1. Recursively, the
result Sum relies on the reduction result Max and an input Figure 9. System Overview for SpaceFusion.
reduction result tagged as Tag2. In Figure 8 (d), Aggregation
Hints (highlighted in blue) show the existence of the paths
between the input and the target reduction, suggesting that locality. Most of these subprograms are repetitive. SpaceFu-
the reduction needs to be aggregated. Update Paths (high- sion compiles the repetitive ones only once. For each unique
lighted in yellow) reveal computationally equivalent shortest subprogram, SpaceFusion connects the operators and con-
paths between reductions. Finally, Update Functions in Fig- structs an SMG via dimension alignment (Section 4.1). The
ure 8 (e) are generated by back-tracing the Update Paths, constructed SMG is then fed into the auto-scheduling phase.
and are inlined to the functions in Figure 7. Note that the Auto-scheduling. In the auto-scheduling phase, Space-

Downloaded from the ACM Digital Library on April 13, 2025.


modified dataflow in Figure 8 (c) is solely employed for UTA. Fusion operates in two states: slicing and partitioning. In
The original dataflow for the SMG block remains mostly the slicing state, SpaceFusion applies resource-aware slic-
unchanged, as shown in Figure 7. ing (Section 5.1) to a given SMG using spatial and temporal
Broadcast Postposition is a necessary analysis and trans- slicers (Section 4.2 and 4.3). SpaceFusion either generates
formation process when dealing with Dependent All-to- schedules for the SMG and the corresponding search space
Ones. When multiple parallel All-to-Ones form a dependency that satisfy the hardware resource constraints, or it indicates
chain, there must be One-to-All(s) interspersed between All- a scheduling failure. In the partitioning state (Section 5.2),
to-Ones to restore the original extent of the dimension that SpaceFusion partitions the unschedulable SMG into smaller
the previous one reduces so that the next All-to-One can SMGs, enabling them to re-enter the slicing state for further
continue reducing. The interleaving of One-to-All makes scheduling. SpaceFusion iterates between these two states
it difficult to capture the exact dependency relationship be- until all SMGs are assigned efficient schedules and search
tween All-to-Ones. The postposition of broadcasts abstracted spaces that respect hardware resource constraints. Finally,
in One-to-Alls leads to shorter minimal distances and more SpaceFusion passes the SMG schedules and the correspond-
transparent dependency relationships between the reduc- ing search spaces to the code generation and auto-tuning
tions abstracted in All-to-Ones. However, due to limitations modules for generating the optimal codes.
imposed by algebraic transformation rules, not all the All-to- In addition to the system design, this section introduces
One chains end up with simplification results. In this work, two representative detail designs and optimizations: candi-
the temporal slicer operates on the designated dimension date schedules generation (Section 5.3) and memory hierar-
where reductions are connected with One-to-Ones after the chy scheduling (Section 5.4).
Broadcast Postposition, as highlighted by △ in Table 3.
5.1 Resource-Aware Slicing
The resource-aware slicing method slices an SMG 𝐺 on given
5 System Design and Optimizations hardware resource configurations 𝑅𝐶 𝑓 𝑔, as is shown in Algo-
In this section, we introduce SpaceFusion, an optimizing com- rithm 1. This auto-slicing algorithm performs spatial slicing
piler that utilizes resource-aware methods to schedule SMGs (line 3∼8) first and then temporal slicing (line 9∼14).
for operator fusion, ensuring efficient execution on a given The spatial slicer first traverses all dimensions to identify
hardware platform. The system design overview of SpaceFu- those eligible for slicing (line 3). If no feasible dimension
sion is illustrated in Figure 9. SpaceFusion operates in two exists, the algorithm determines that the fused space defined
phases: program preprocessing and auto-scheduling. by the SMG cannot be scheduled for parallelization. Then, the
Program-preprocessing. The program-preprocessing spatial slicer slices all feasible dimensions in the SMG, with
phase is responsible for converting models into SMG abstrac- a set of parallelizable SMG blocks 𝐺𝑏𝑙𝑜𝑐𝑘𝑠 generated (line 4).
tions. SpaceFusion segments the tensor program defined by Finally, the sliced schedule is analyzed in a resource-aware
a deep learning model into smaller subprograms, primarily manner (line 5): if there exist feasible schedule configurations
based on model layers and unavoidable shape or layout trans- that meet the hardware resource constraints, the schedule
formations, which can disrupt or degrade the original data is considered feasible and is logged (line 6). Resource-aware

793
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

Algorithm 1: Resource-Aware Slicing Algorithm 2: A Round of SMG Partitioning


Input : SMG to be slicing 𝐺 Input : SMG to be partitioned 𝐺
Input : Hardware resource configurations 𝑅𝐶 𝑓 𝑔 Input : Hardware resource configurations 𝑅𝐶 𝑓 𝑔
Output : Scheduled SMGs 𝑆𝑐ℎ𝐺 Output : Partitioned SMGs 𝐺 𝑓 , 𝐺𝑙
Output : Scheduled search configurations 𝑆𝑐ℎ𝐶 𝑓 𝑔 1 𝐺 𝑓 ← 𝐺;
1 𝑆𝑐ℎ𝐺, 𝑆𝑐ℎ𝐶 𝑓 𝑔 ← ∅; 2 𝐺𝑙 ← 𝑒𝑚𝑝𝑡𝑦𝑆𝑀𝐺;
2 𝑆𝑆𝑙𝑖𝑐𝑒𝑑,𝑇 𝑆𝑙𝑖𝑐𝑒𝑑 ← 𝐹𝑎𝑙𝑠𝑒; 3 while 𝐺 𝑓 .𝑠𝑖𝑧𝑒 () > 0 do
3 if 𝑆𝑆 𝐷𝑖𝑚𝑠 ← 𝑆𝑆.𝑔𝑒𝑡𝐷𝑖𝑚𝑠 (𝐺) then 4 𝑠𝑢𝑏𝐺 ← 𝐺 𝑓 .𝑔𝑒𝑡𝐿𝑎𝑠𝑡𝑆𝑢𝑏𝐺 ();
4 𝐺𝑏𝑙𝑜𝑐𝑘𝑠 ← 𝑆𝑆.𝑠𝑙𝑖𝑐𝑒 (𝐺, 𝑆𝑆 𝐷𝑖𝑚𝑠 ); 5 𝐺 𝑓 .𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛(𝑠𝑢𝑏𝐺);
5 if 𝑐ℎ𝑒𝑐𝑘𝑅𝑠𝑟𝑐 (𝐺𝑏𝑙𝑜𝑐𝑘𝑠 , 𝑅𝐶 𝑓 𝑔) then 6 𝐺𝑙 .𝑚𝑒𝑟𝑔𝑒 (𝑠𝑢𝑏𝐺, 𝑖𝑛𝑑𝑒𝑥 = 0);
6 𝑆𝑐ℎ𝐺 .𝑎𝑝𝑝𝑒𝑛𝑑 (𝐺𝑏𝑙𝑜𝑐𝑘𝑠 ); 7 if 𝑡𝑟𝑦𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝐴𝑤𝑎𝑟𝑒𝑆𝑙𝑖𝑐𝑖𝑛𝑔(𝐺 𝑓 , 𝑅𝐶 𝑓 𝑔) then
7 𝑆𝑐ℎ𝐶 𝑓 𝑔.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑒𝑛𝑢𝑚𝐶 𝑓 𝑔(𝐺𝑏𝑙𝑜𝑐𝑘𝑠 )); 8 return 𝐺 𝑓 , 𝐺𝑙 ;
8 𝑆𝑆𝑙𝑖𝑐𝑒𝑑 ← 𝑇 𝑟𝑢𝑒;
9 return False;
9 if 𝑇 𝑆 𝐷𝑖𝑚 ← 𝑇 𝑆.𝑔𝑒𝑡𝑃𝑟𝑖𝑜𝑟 𝐷𝑖𝑚(𝐺𝑏𝑙𝑜𝑐𝑘𝑠 ) then
10 𝐺𝑖𝑛𝑡𝑟𝑎𝑏𝑙𝑜𝑐𝑘𝑠 ← 𝑇 𝑆.𝑠𝑙𝑖𝑐𝑒 (𝐺𝑏𝑙𝑜𝑐𝑘𝑠 ,𝑇 𝑆 𝐷𝑖𝑚 );
11 if 𝑐ℎ𝑒𝑐𝑘𝑅𝑠𝑟𝑐 (𝐺𝑖𝑛𝑡𝑟𝑎𝑏𝑙𝑜𝑐𝑘𝑠 , 𝑅𝐶 𝑓 𝑔) then
12 𝑆𝑐ℎ𝐺 .𝑎𝑝𝑝𝑒𝑛𝑑 (𝐺𝑖𝑛𝑡𝑟𝑎𝑏𝑙𝑜𝑐𝑘𝑠 ); main constraints in determining the block sizes. We establish
13 𝑆𝑐ℎ𝐶 𝑓 𝑔.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑒𝑛𝑢𝑚𝐶 𝑓 𝑔(𝐺𝑖𝑛𝑡𝑟𝑎𝑏𝑙𝑜𝑐𝑘𝑠 )); the upper bound of the search space based on the available

Downloaded from the ACM Digital Library on April 13, 2025.


shared memory and register. Given that a scheduled SMG
14 𝑇 𝑆𝑙𝑖𝑐𝑒𝑑 ← 𝑇 𝑟𝑢𝑒;
is sliced by 𝑚 slicers. The slicing block sizes are denoted as
15 if 𝑆𝑆𝑙𝑖𝑐𝑒𝑑 ∨ 𝑇 𝑆𝑙𝑖𝑐𝑒𝑑 then return 𝑆𝑐ℎ𝐺, 𝑆𝑐ℎ𝐶 𝑓 𝑔 ; (𝐵 1, 𝐵 2, ..., 𝐵𝑚 ). The shared memory and register usage for
an SMG block denoted as 𝑉𝑆𝑚𝑒𝑚𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘 (𝐵 1, 𝐵 2, ..., 𝐵𝑚 ) and
16 return False
𝑉𝑅𝑒𝑔𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘 (𝐵 1, 𝐵 2, ..., 𝐵𝑚 ), are obtained at the compile time
with the parameters (𝐵 1, 𝐵 2, ..., 𝐵𝑚 ) to be tuned. The upper
bounds are given as 𝑉𝑆𝑚𝑒𝑚𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘 ≤ 𝐵𝑜𝑢𝑛𝑑𝑆𝑚𝑒𝑚𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘 and
schedule configurations, including the sliced dimensions and 𝑉𝑅𝑒𝑔𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘 ≤ 𝐵𝑜𝑢𝑛𝑑𝑅𝑒𝑔𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘 , where 𝐵𝑜𝑢𝑛𝑑𝑆𝑚𝑒𝑚𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘
the block sizes, are enumerated for auto-tuning (line 7). and 𝐵𝑜𝑢𝑛𝑑𝑅𝑒𝑔𝑃𝑒𝑟 𝐵𝑙𝑜𝑐𝑘 are the maximum amount of shared
After being spatial sliced, the generated SMG blocks are memory and register that a block can be allocated. Finally, a
fed to the temporal slicer. Note that the SMG blocks may or small scheduled SMG search space is created as the intersec-
may not satisfy the resource constraints during the spatial tion of the two search spaces. We generate configurations
slicing phase. Temporal slicing is attempted in both cases, via multiplier/exponential enumeration in the search space.
for we found that some SMGs that cannot satisfy the hard-
ware resource constraints during the spatial slicing become 5.2 SMG Partitioning
efficient after being temporal sliced. The temporal slicer first SpaceFusion in the slicing state may fail to generate sched-
analyzes the dimensions not spatially sliced to identify a ules for a given SMG (Algorithm 1 line 16). This is because
feasible dimension with the highest priority (line 9). A di- the SMG processed in the resource-aware slicing defines
mension with higher priority is recognized as a dimension an overly aggressive fusion schedule, where a single kernel
along which an SMG block possesses a larger volume of data fuses an excessive number of operators, resulting in the fused
space, implying a greater on-chip memory allocation for computation being unable to be scheduled in parallel or re-
the dataflow dependencies in the dimension. Consequently, quiring more resources than available. In this case, SpaceFu-
temporally slicing on a higher priority dimension results sion switches to the partition state to split the unschedulable
in a more significant on-chip memory footprint reduction. SMG into smaller SMGs. To start with, the unscheduled SMG
The temporal slicer then analyzes and transforms the de- is reorganized into sub-SMGs. A sub-SMG can be either (1)
pendency patterns in the given dimension (line 10). After an All-to-One sub-SMG composed of an iterative space with
slicing, resource-aware analysis and schedule configuration one All-to-One mapping and its neighboring data space(s),
generation are executed (line 11∼13). or (2) a non-All-to-One sub-SMG without any All-to-One
The resource-aware slicing method generates search spaces mappings, which may include One-to-One(s), One-to-All(s),
that conform to hardware resource constraints. To minimize or both. The intermediate data space between two adjacent
the search space, we focus on two levels of on-chip resource sub-SMGs is duplicated to ensure that both sub-SMGs have
in GPUs: the shared memory and the register. Computa- their own complete input/output data space(s).
tions in multiple operator fusion show two trends: an in- As is shown in Algorithm 2, a single round of partitioning
creasing number of variables and extended variable lifetimes. takes an SMG as input and generates two SMGs, denoted as
Hence, the capacity of shared memory and register are two the former SMG 𝐺 𝑓 and the latter SMG 𝐺𝑙 , guaranteeing that

794
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands

In W1 W1 In1 W2 In2 In In1 In2


𝐺 𝑓 is schedulable for resource-aware slicing. Specifically, 𝐺 𝑓 mean
GEMM sub GEMM In3

Layer 1
starts from the complete input 𝐺 and iteratively partitions add
B1 GEMM GEMM sqr mask
the last sub-SMG and merges it to the front of 𝐺𝑙 , until 𝐺 𝑓 ReLU mean max
add eps
is schedulable. Then we obtain a schedulable 𝐺 𝑓 and a po- W2 add sub

Softmax
B
GEMM

Layer 2
tentially schedulable 𝐺𝑙 . Recursively, if 𝐺𝑙 is unschedulable, B2 add sqrt exp
add div W1 sum
SpaceFusion will enter the next round of partitioning. ReLU mul div
In4
ReLU B
… add GEMM
5.3 Partitioning for Candidate Schedules Out
Out Out Out
(a) MLP Layers (b) LSTM Cell (c) Layernorm (d) MHA
Algorithm 2 describes how SpaceFusion partitions unschedu-
lable SMGs. We further increase the exploration depth of
Figure 10. DFGs for Evaluated Subgraphs.
Algorithm 2 by one level. Specifically, when a schedulable
𝐺 𝑓 is found, SpaceFusion records the current partitioning
results of 𝐺 𝑓 and 𝐺𝑙 . It then continues to partition a non- Volta Ampere Hopper cuBLAS cuBLASLt SpaceFusion
4 3.5
All-to-One sub-SMG 𝑠𝑢𝑏𝐺 from the current schedulable 𝐺 𝑓 , 3
3 2.5

Speedup

Speedup
merges 𝑠𝑢𝑏𝐺 with 𝐺𝑙 , and generates new 𝐺 𝑓′ and 𝐺𝑙′ . This re- 2
2 1.5
sults in two candidate schedules: (𝐺 𝑓 , 𝐺𝑙 ) and (𝐺 𝑓′ , 𝐺𝑙′ ). The 1
0.5
1
reason is that non-All-to-One sub-SMGs are mostly memory- 0
128 256 512 1k 128 256 512 1k 128 256 512 1k
0
intensive. They potentially lead to performance variations 2 4 6 8 10 12 14 16 18 20 Volta Ampere Hopper
(a) Fused MLP Layers (b) Fused LSTM Cell

Downloaded from the ACM Digital Library on April 13, 2025.


when combined with different non-memory-intensive sub-
SMGs. We also experimented with further increasing the
depth of partition exploration, to the point of exhaustively Figure 11. Fused MLP and LSTM Cell Performance. The
enumerating all sub-SMG combinations. However, we did x-axes show (a) the number of fused MLP layers and (b) the
not observe significant performance gains. Therefore, Space- number of hidden state features.
Fusion typically does not adopt this aggressive strategy.
type (FP16). To assess the performance benefits of SpaceFu-
5.4 Scheduling for Memory Hierarchy
sion, we integrate SpaceFusion with OpenAI Triton[53] for
Modern GPU architectures typically leverage memory hier- the intra-block code generation capabilities.
archy, organized from fastest to slowest access, comprising To demonstrate the superior performance of SpaceFusion,
levels of register, shared memory, and global memory. Space- we conduct subgraph (Section 6.1) and end-to-end (Section
Fusion schedules memory hierarchy naturally via SMG ab- 6.2) performance experiments. Then, we explain observed
straction. For instance, within an intra-block or SMG block, performance gains through memory and cache analysis (Sec-
data spaces connected with One-to-One are mapped to the tion 6.3). To comprehensively characterize SpaceFusion’s
register level. Intermediate results of variable calculations performance, we perform an ablation study and two sensitiv-
within an iteration space, such as the intermediate results ity studies (Section 6.4). We analyze the system overheads on
of the accumulation of matrix multiplication, are also allo- the compilation time (Section 6.5) to demonstrate the light-
cated to the register level. Within an SMG block, both the weight nature of its SMG-based analysis and transformation
source data space of a One-to-All and the sink data space process. Finally, we conduct a comparative analysis of the
of an All-to-One are mostly mapped to the shared memory fusion patterns identified via SpaceFusion (Section 6.6) to
level, due to the repeatedly read/write access patterns, and demonstrate the generality of the fusion strategy.
the potential inter-thread communication. The input/output
data spaces of a SMG are mapped to global memory outside 6.1 Subgraph Performance
of the SMG. The intermediate data between two SMGs with We analyze the performance of SpaceFusion-generated sched-
data dependencies are mapped to the global memory level. ules for fused subgraphs in Figure 10. For each subgraph,
we treat it as a subprogram implemented as a single model
6 Evaluation layer to investigate SpaceFusion’s optimization capabilities.
This section presents comprehensive evaluations of SpaceFu- We compare SpaceFusion with multiple high-performance
sion on three NVIDIA GPUs with diverse architectures: V100 manually-tuned libraries for CUDA to illustrate the perfor-
(SM70, Volta architecture) with 32GB device memory, A100 mance gains.
(SM80, Ampere architecture) with 80GB device memory, and Multi-Layer Perceptron (MLP). In Figure 11 (a), the
H100 (SM90, Hopper architecture) with 80GB device mem- x-axis indicates the number of MLP layers being fused and
ory. We mark experimental results obtained from V100, A100, the y-axis indicates the speedup of SpaceFusion compared to
and H100 as Volta, Ampere, and Hopper. Experiments are cuBLASLt at different computational scales. The cuBLASLt
conducted on CUDA 12.2 with the half-precision float-point implements the fusion of a single-layer MLP, and this fusion

795
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

PyTorch PyTorch Op NVIDIA Apex LN Triton SpaceFusion PyTorch FlashAttention Triton FlashAttention FlashAttention 2 SpaceFusion
10 10
Batch Size = 1
8 8
Speedup

Speedup
6 6
4 4
2
2
0
0
1K 2K 4K 8K 16K 1K 2K 4K 8K 16K 32K 1K 2K 4K 8K 16K 32K 12
Volta Ampere Hopper 10 Batch Size = 32
8

Speedup
6
Figure 12. Fused Layernorm Performance. The x-axis 4
2
represents the size of M (M=N) in the 2D input tensor to be 0
normalized. 64 128 256 512 1k 64 128 256 512 1k 2k 8k 64 128 256 512 1k 2k 8k
Volta Ampere Hopper

pattern is supported in most DL compilers[4, 6, 34, 43, 57, Figure 13. Fused MHA Performance. The x-axis repre-
66, 67]. SpaceFusion found that multiple MLP layers can be sents the sequence lengths of MHA across different architec-
further fused for specific problem sizes3 . SpaceFusion gener- tures.
ates fusion schedules for up to 20 layers in this experiment.
The performance results in Figure 11 (a) show SpaceFusion
achieves a maximum speedup of 3.15x and an average of fusing MHA computations into a single cleverly designed
2.35x over cuBLASLt. kernel. FlashAttention in Triton implements FlashAttention

Downloaded from the ACM Digital Library on April 13, 2025.


Long Short-Term Memory (LSTM). SpaceFusion’s per- equivalently with hand-tuned block sizes. SpaceFusion’s per-
formance for a simplified LSTM cell is shown in Figure 11 formance for MHA is shown in Figure 13, the x-axis indicates
(b). The x-axis indicates the number of hidden states and the sequence length and the y-axis indicates the speedup to
the y-axis indicates the speedup to the baseline cuBLAS. the baseline in PyTorch. Note that FlashAttention’s CUDA
The cuBLAS implementation ends up with 5 unfused ker- implementation lacks compatibility with Volta, resulting
nels, with each operator in Figure 10 (b) mapping to a ker- in absent data for Volta. SpaceFusion achieves a maximum
nel. The cuBLASLt implementation fuses the second GEMM speedup of 10.35x, an average speedup of 5.40x to baseline,
by adding the output of the first GEMM, ending up with 4 and comparable performance to FlashAttention 2.
kernels. SpaceFusion generates a single fused kernel for all
operators, achieving a maximum speedup of 2.87x and an 6.2 End-to-End Performance
average of 2.29x compared to cuBLAS. In this section, we evaluate SpaceFusion on Transformer-
Layernorm (LN). Layernorm[3] is chosen to demonstrate based models, Bert[23], Albert[33], T5[46], ViT[63] and Llama2-
SpaceFusion’s ability to fuse operators that are not compute- 7B[54], to prove that SpaceFusion brings speedups for model
intensive but still have complex mapping relations. In the inference in practice. SpaceFusion is compared with two
experiment, SpaceFusion is compared with three SOTA fused types of SOTA works: (1) DL inference optimizers based on
implementations and an unfused baseline in PyTorch. Py- hand-tuned libraries: NVIDIA TensorRT[19] and Kernl[31],
Torch Op [8] maps the pre-defined Layernorm torch func- and (2) DL compilers in auto-tuning methods: BladeDISC[69]
tion to a fused CUDA implementation. NVIDIA Apex[11] Py- (implementing AStitch[70]) and NNFusion[38] (implement-
Torch extension lowers its Layernorm function to a manually- ing Welder[49]), with CUDA Graphs[15] enabled to reduce
tuned fused CUDA kernel. LN Triton is a manually written fu- the kernel launching time. Kernl is an open-source inference
sion implementation with OpenAI Triton[55]. SpaceFusion’s engine written in OpenAI Triton[53] for Transformer mod-
performance is shown in Figure 12. The x-axis indicates the els. It implemented including FlashAttention and fused LN
shapes of input 2D tensors (M=N) and the y-axis indicates to gain high performances in PyTorch. The implementations
the speedup to PyTorch. SpaceFusion outperforms in most in PyTorch by Huggingface[62] are used as the baseline.
cases, achieving an average 7.25x speedup to PyTorch, up In Figure 14, SpaceFusion achieves a maximum speedup
to 1.59x, 2.46x, and 4.03x speedups to PyTorch Op, NVIDIA of 8.79x and an average speedup of 3.54x to PyTorch. Space-
Apex, and LN Triton respectively. Fusion achieves performance comparable to or even better
Multi-Head Attention (MHA). SpaceFusion is compared than manually-tuned library-based inference engines, with
with the manually-tuned library FlashAttention[22] in CUDA, an average speedup of 1.27x over TensorRT and 1.34x over
FlashAttention 2[21] in CUDA, FlashAttention in Triton, Kernl. Both manual approaches fail to achieve exceptional
and an unfused baseline in PyTorch. FlashAttention is a performance in all cases. Conversely, SpaceFusion delivers
SOTA fused MHA implementation in CUDA, eliminating the speedups consistently. SpaceFusion outperforms BladeDISC
overhead of caching intermediate data to device memory by with an average speedup of 2.27x. BladeDISC’s focus on fus-
ing memory-intensive operators limits its ability to address
3 For GEMM N, K ≤ 256 in these cases. compute-intensive operator fusion, leading to missed fusion

796
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands

PyTorch SpaceFusion TensorRT Kernl BladeDISC NNFusion SpaceFusion Fused Baseline Unfused Baseline
10
6 8 8
Batch Size = 1

Cache Miss Counts


Cache Miss Counts

18.07
13.17

17.01

11.76
28.20
Data Movement
8

Normalized L2
Normalized L1
6 6

Normalized
Speedup

6 4
4 4
4 2
2 2
2
0 0 0
0
3

)
LN K )

(3 )
0, )

M (32 K)
M LN 4K)

M N K)

)
LN K )

LN K )
(3 )

(3 )
0, )

0, )
M (32 K)

M (32 )
M LN K )
1K
A 28
(2 4K
1K

1K
A 28

A 28
(2 4K

(2 4K

A 2K
64

A 2
(4
64

64
A 2

(4
H ,1
H ,1

H ,1
Batch Size = 32

2,
LP , 6

H (3
(

2,

2,
LP , 6

LP , 6
H (3

H (3
2.5

M P(4
M P(4

M P(4
L
L
L

L
M
M

M
2
Speedup

1.5
1
0.5 Figure 15. Memory and Cache Analysis. L1 (left) and
0
Bert Albert T5 ViT Llama2 Bert Albert T5 ViT Llama2 Bert Albert T5 ViT Llama2
L2 (middle) cache miss counts and the device memory data
Volta
Volta Ampere
Ampere Hopper
Hopper movement (right) are shown. Values are normalized to Space-
Fusion. Lower is better. X-axis annotations: MLP(𝑁𝑢𝑚𝐿𝑎𝑦𝑒𝑟𝑠,
Figure 14. The End-to-End Performance. 𝑀), LN(𝑀), MHA(𝐵𝑎𝑡𝑐ℎ𝑆𝑖𝑧𝑒, 𝑆𝑒𝑞𝐿𝑒𝑛𝑔𝑡ℎ).

opportunities. SpaceFusion outperforms NNFusion on Volta


with an average speedup of 1.21x and a maximum speedup In this set of experiments, it can be observed that different
of 1.37x, because SpaceFusion explores a larger fusion space subgraphs benefit differently from memory and cache opti-
with dependency transformations compared to NNFusion. mizations. For example, compared to the unfused baselines,

Downloaded from the ACM Digital Library on April 13, 2025.


For MHA, SpaceFusion enhances the intra-block locality by SpaceFusion reduces data movement from device memory
further serializing intra-blocks, while NNFusion fails to fuse by an average of 5.25x in the LN subgraphs, resulting in an
MHA with long sequence lengths. NNFusion for Ampere and average speedup of 8.08x. However, in the MHA subgraphs,
Hopper, and BladeDISC for Hopper are not fully supported, SpaceFusion achieves an average data movement reduction
resulting in the absence of corresponding results. of 18.98x, yet only brings about an average speedup of 6.64x.
Figure 14 demonstrates that SpaceFusion’s acceleration The gains from reduced data movement are evidently weaker
performance varies across different models. Notably, for in MHA compared to LN. This discrepancy stems from the
a batch size of 1, SpaceFusion achieves more significant LN subgraph’s higher memory intensity compared to the
speedups (2.67x∼8.79x) for Bert, Albert, T5, and ViT com- MHA subgraph. As shown in Figure 10 (c), the LN subgraph
pared to Llama2 (1.91x∼3.02x). This pattern persists across is entirely composed of 9 memory-intensive (MI) operators.
the other four comparison groups. We attribute it to two main In contrast, although the MHA subgraph is also memory
factors. Firstly, Llama2 employs a larger number of attention intensive[22], it includes two compute-intensive (CI) GEMM
heads (#heads=32) that can be parallelized for computation, operators, making it less memory-bound than LN. This im-
which enhances the parallelism of PyTorch, leading to bet- plies that in LN, each operator will spend a greater proportion
ter baseline performance at the small batch size. Secondly, of cycles stalling while input tensors are loaded from device
Llama2 utilizes larger hidden and intermediate dimensions memory. Consequently, the benefits of data movement re-
(4096 and 11008). This results in larger weight tensors and duction in LN become more pronounced.
consequently, increased data movement overhead between
off-chip and on-chip memory, which cannot be eliminated 6.4 Ablation and Sensitivity Study
through operator fusion techniques. An ablation study is presented to demonstrate the benefits
of the spatial slicing, the temporal slicing, and the auto-
6.3 Memory and Cache Analysis scheduling methods. We create three variants: Base(SS) em-
We perform the memory and cache performance analysis for ploys solely spatial slicing, and assigns fixed block sizes and
representative subgraphs to explain the benefits of operator fusion patterns based on expert knowledge (auto-scheduling
fusion and the speedups brought by SpaceFusion. Space- is disabled). Base+AS enables auto-scheduling methods with
Fusion is compared to fused and unfused baselines in Fig- spatial slicing. Base+TS uses both spatial and temporal slic-
ure 15. The fused baselines for each subgraph are different: ing but disables auto-scheduling. Figure 16 (a) shows the
cuBLASLt for MLP, PyTorch Op for LN, and FlashAtten- variants’ performances normalized to SpaceFusion. Spatial
tion for MHA. SpaceFusion outperforms in most cases for slicing alone (Base(SS)) achieves at least 51% of SpaceFusion’s
both cache and device memory performance, achieving up performance, and up to 79% when supplemented with auto-
to 83.0% fewer L1 cache misses, 94.1% fewer L2 cache misses, scheduling (Base+AS). SpaceFusion without auto-scheduling
and 96.45% less data movement in device memory. SpaceFu- (Base+TS) achieves 72%∼89% performance of SpaceFusion.
sion’s fusion methods within the memory hierarchy achieve The wide variations observed can be attributed to the fact
significantly better memory and cache performance, which that different models exhibit varying performance gains for
is one of the primary reasons for the speedups it delivers. the same set of expert-specified configurations.

797
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

Base(SS) Base+AS Small Medium Large Table 4. Compilation Time Break Down for MHA.
Base+TS SpaceFusion 1
1

Norm. Perf.
0.6
Auto-Scheduling
Norm. Perf.

Batch Size = 1
0.8 0.2
1 Workload TS.getPriorDim enum SS.getDims Tuning Total
0.6
0.6
+ TS.slice Cfg + SS.slice
Batch Size = 32
0.4 0.2 MHA(32,1024) 17.31 ms 2.63 ms 0.23 ms 33.04 s 36.33 s
Bert Albert ViT T5 Llama2 Bert Albert ViT T5 Llama2 MHA(32,256) 16.39 ms 1.25 ms 0.34 ms 29.55 s 33.41 s
(a) Ablation Study (b) Sensitivity for Input Sizes
PerfVolta PerfAmpere PerfHopper SuVolta SuAmpere SuHopper

Norm. Speedup (Su)


6 3
Batch Size = 1 Batch Size = 32 Table 5. Compilation Time for Models.
Norm. Perf.

4 2

2 1 Models BladeDISC TensorRT SpaceFusion


0 0 Bert 176.2 s 141.1 s 68.4 s
Bert Albert T5 ViT Llama2 Avg Bert Albert T5 ViT Llama2 Avg ViT 155.8 s 213.4 s 76.9 s
(c) Sensitivity for Architectures T5 356.1 s 306.9 s 131.7 s

Figure 16. Ablation and Sensitivity Study.


the elapsed times of the three most time-intensive processes
Sensitivity study on input is shown in Figure 16 sizes4 are listed. The auto-tuning phase takes up the majority of
(b), with results normalized to the best performance for each the compilation time. In this phase, SpaceFusion performs

Downloaded from the ACM Digital Library on April 13, 2025.


model. At the batch size of 1, performance gains generally test runs on the generated configurations to measure the
diminish as the input size increases, owing to the lack of runtimes. For each configuration, SpaceFusion evaluates the
parallelization. However, with a batch size of 32, performance performance by taking the median of 100 tests (20 for warm-
gains become pronounced for most models as the input size up). The search space generated by SpaceFusion is small.
increases, while T5 still exhibits a slight decrease. Traversing the entire search space of MHA(32,1024) requires
Sensitivity study on architectures is conducted to il- a mere 33.04 seconds. Table 5 compares the compilation time
lustrate the effectiveness of auto-scheduling methods on of BladeDISC, TensorRT, and SpaceFusion on model compila-
varying architectures. The bar chart (left y-axis) in Figure tion. BladeDISC leverages mainly just-in-time (JIT) analysis
16 (c) compares SpaceFusion’s performance (Perf) of the and transformations to support optimizations. TensorRT re-
same model on different architectures, with the results nor- quires the analysis and generation of efficient combinations
malized to Volta. At a batch size of 32, the average perfor- of hand-tuned libraries and templates, necessitating runtime
mance ratio of SpaceFusion across the three architectures, tests of a subset of these configurations. SpaceFusion com-
Volta:Ampere:Hopper=1:2.26:4.34, indicates that SpaceFu- piles significantly faster, outperforming BladeDISC by an
sion consistently outperforms across different hardware. average of 2.44x and TensorRT by an average of 2.39x.
This ratio closely approximates the peak performance ratio of SpaceFusion’s short compilation time can be attributed to
the three architectures under FP16 Tensor Core (1:2.79:6.75) several factors. First, its lightweight analysis and schedul-
but remains marginally lower[16–18]. One of the reasons is ing methods minimize the time spent on compilation tasks.
that as the compute capability increases, the elapsed time Second, analytical pruning of infeasible schedules via slicers
for the same workload decreases exponentially, and the over- and auto-scheduling methods eliminates a large number of
head on the CPU side gets more pronounced, diluting the ac- non-viable options. Third, the schedule method operates
celeration from SpaceFusion. The line chart (right y-axis) in holistically on the entire fusion space, allowing multiple
Figure 16 (c) compares SpaceFusion’s speedups (Su) over Py- fused operators to share block parameters and reducing the
Torch across Volta, Ampere, and Hopper, normalizing results number of parameters that require tuning. Fourth, the lim-
to Volta. SpaceFusion delivers more significant speedups for ited fusion opportunities for operators with complex de-
architectures with higher computing capabilities in most pendencies result in a finite number of candidate schedules.
cases. Additionally, to expedite the convergence of tuning, we im-
plemented a simple yet effective early-quit mechanism for
6.5 Compilation Time Analysis different configurations within the same schedule search
space. SpaceFusion abandons a configuration if its current
Table 4 details the compilation time for MHA on configura-
test time exceeds a set proportion, denoted as 𝛼, of the cur-
tions (batch size=32, sequence length=256,1024). SpaceFu-
rent best configuration’s total test time. We set 𝛼 = 0.25
sion’s compilation time comes from two main phases, auto-
in this experiment. This strategy helps avoid wasting time
scheduling and auto-tuning. For the auto-scheduling phase,
on unpromising configurations and ensures faster progress.
4 Bert/Albert/T5/Llama2: 128∼1024 prompt length; ViT: 224x224∼768x768 Finally, Triton serves as an efficient backend for generating
image size. intra-block code in SpaceFusion, enabling code optimization

798
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands

Table 6. Fusion Patterns Analysis. The total counts of dis- data and iteration spaces. In contrast, the other two meth-
covered fusion patterns, fusion patterns involving compute- ods present more objective representations of the inner and
intensive (CI) operators only, fusion patterns involving outer aspects of the loop hierarchy; SpaceFusion abstracts
memory-intensive (MI) operators only, and fusion patterns the dependencies at the space granularity with geometric
involving both CI and MI operators, are detailed. spatial relations, while the polyhedral model abstracts de-
pendencies at the iteration and statement granularity. Halide
Patterns Count SpaceFusion NNFusion BladeDISC IR does not emphasize the explicit abstraction of depen-
# Fusion Patterns Discovered 50 30 14 dencies but retains iteration-level dependency information
# CI Ops Fusion 5 3 0 through function-level expressions and subscript variables.
# MI Ops Fusion 15 14 14
# CI and MI Ops Fusion 30 13 0 As a higher-level graph-based abstraction method, SpaceFu-
sion is capable of serving as an auto-scheduler for Halide-
based and polyhedral-based systems to achieve optimization
heuristics for operator fusion.
at the intra-block level with minimal need for search-based
Auto-generation of operator fusion. Niu et al.[43] cat-
tuning.
egorize operators, analyze the gains from fusing different op-
6.6 Fusion Patterns Analysis erator types, and make fusion decisions via pattern matching.
Jia et al.[29] fuse two parallel GEMMs with graph equivalent
SpaceFusion identified 50 distinct fused subgraphs contain- transformation. Zheng et al.[70, 71] explore advanced fusion
ing at least two All-to-One mappings across 14 compiled methods for memory-intensive operators, utilizing shared

Downloaded from the ACM Digital Library on April 13, 2025.


evaluation instances from 9 types of models and structures and global memory as a caching medium for intermediate
involved in the previous experiments, counted by distinct data. Xing et al.[64] explore the fusion of two GEMMs and
non-element-wise operators and distinct subgraph topolo- two CONVs via hardware-native templated search. Zheng et
gies, as is shown in Table 6. NNFusion (implementing Welder) al.[68] make fusing decisions by analytically enumerating
exhibits fusion capabilities for 30 distinct patterns. It misses memory overheads in different block orders, enabling two
fusion opportunities when cross-operator dependency trans- GEMMs or two CONVs fusion. Shi et al.[49] fuses opera-
formations are required to enable fusion, especially for the tors via scheduling memory access in the memory hierarchy.
fusion patterns with both CI and MI operators. BladeDICS However, [64], [68], and [49] focus on exploring memory
(implementing AStitch) achieves 14 fusion patterns, primar- tile stitching within adjacent operators. They do not target
ily targeting the fusion of MI operators. It fails to fuse the analyzing and transforming dependencies for better fusion
cases where CI operators arise. SpaceFusion exhibits a re- schedules, missing holistic optimizations for fusion.
markable capability to identify and leverage fusion oppor- Operator fusion for specific workloads or computa-
tunities between CI and MI operators. This leads to a more tional patterns. Wang et al.[60] propose a novel method to
equitable balance of computation and memory requirements fuse the layers of convolutional neural networks (CNNs). Si-
for the fused tensor computations, thereby enhancing overall vathanu et al.[50] explore fusing GEMMs in recurrent neural
performance. networks (RNNs). Kao et al.[30] design an optimized dataflow
for fusing Multi-Head Attention (MHA) mechanisms. Also
7 Related Work for fused MHA, Dao et al.[22] design equivalent MHA algo-
This section introduces related works on SpaceFusion and rithm and implement efficient fused CUDA kernels, which
discusses their connections and distinctions in comparison are more pronounced for long sequences. Zhai et al.[65]
to SpaceFusion. present a complete set of operator fusion schemes for MHA.
General-purpose libraries/compilers. General-purpose Liang et al.[36] propose to fuse GPU kernels for spatial and
libraries[9, 10, 12–14, 35] and compilers[2, 4, 6, 7, 20, 32, 34, temporal multitasking. Compared to these fusion studies,
39, 42, 48, 56, 57, 59, 66, 67] provide comprehensive code opti- SpaceFusion targets more general computational patterns.
mization implementations and functionalities for DL models.
Their support for operator fusion is mostly limited to fixed
patterns, such as element-wise operators with others. 8 Conclusion
Foundational abstractions for DL compilation. The Generating fusion schedules for operators requires respect-
Halide IR[47] and polyhedral model[25] serve as two founda- ing complex dependencies while considering inter-block par-
tional paradigms of low-level code abstraction, optimization, allelism and intra-block data locality. SpaceFusion reduces
and generation for DL compilers. SpaceFusion shares simi- the complexity of analyzing and transforming complex de-
larities with them in abstracting the data and iteration do- pendencies via new computational abstractions. By slicing
mains of tensor computation. However, there are significant the abstraction space, SpaceFusion generates efficient fu-
differences in the abstraction methodology: SpaceFusion’s sion schedules. Experiments show that SpaceFusion captures
domain abstraction prioritizes the geometric properties of more fusion chances, and generates fusion schedules with

799
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

comparable or even better performance than manually-tuned [12] NVIDIA Corporation. Basic Linear Algebra on NVIDIA GPUs. https:
fusion implementations. //developer.nvidia.com/cublas.
[13] NVIDIA Corporation. CUDA Deep Neural Network. https://developer.
nvidia.com/cudnn.
Acknowledgments [14] NVIDIA Corporation. CUDA Templates for Linear Algebra Subrou-
tines. https://github.com/NVIDIA/cutlass.
We thank anonymous reviewers and our shepherd, Dr. Mangpo [15] NVIDIA Corporation. Getting Started with CUDA Graphs. https:
Phothilimthana, for their insightful suggestions. This work //developer.nvidia.com/blog/cuda-graphs/.
was funded by the National Key Research & Development [16] NVIDIA Corporation. NVIDIA A100 Tensor Core GPU. https://www.
nvidia.com/en-us/data-center/a100/.
Program of China (No. 2022YFB4502002), NSFC (No. 62032008), [17] NVIDIA Corporation. NVIDIA H100 Tensor Core GPU. https://www.
STCSM (No. 23511100100), and the HighTech Support Pro- nvidia.com/en-us/data-center/h100/.
gram from STCSM (No.22511106200). The corresponding [18] NVIDIA Corporation. NVIDIA V100 Tensor Core GPU. https://www.
author is Jianguo Yao. nvidia.com/en-us/data-center/v100/.
[19] NVIDIA Corporation. TensorRT. https://developer.nvidia.com/
tensorrt/.
References [20] Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and
Luis Ceze. Automatic generation of high-performance quantized ma-
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, chine learning kernels. In Proceedings of the 18th ACM/IEEE Interna-
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, tional Symposium on Code Generation and Optimization, CGO 2020,
Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry page 305–316, New York, NY, USA, 2020. Association for Computing
Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Machinery.
Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-

Downloaded from the ACM Digital Library on April 13, 2025.


[21] Tri Dao. Flashattention-2: Faster attention with better parallelism and
Flow: A system for Large-Scale machine learning. In 12th USENIX work partitioning, 2023.
Symposium on Operating Systems Design and Implementation (OSDI 16), [22] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.
pages 265–283, Savannah, GA, November 2016. USENIX Association. Flashattention: Fast and memory-efficient exact attention with io-
[2] Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu- awareness. Advances in Neural Information Processing Systems,
Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fa- 35:16344–16359, 2022.
tahalian, Frédo Durand, and Jonathan Ragan-Kelley. Learning to [23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
optimize halide with tree search and random programs. ACM Trans. Bert: Pre-training of deep bidirectional transformers for language
Graph., 38(4), jul 2019. understanding. arXiv preprint arXiv:1810.04805, 2018.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer nor- [24] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don
malization. arXiv preprint arXiv:1607.06450, 2016. Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. Data
[4] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele cube: A relational aggregation operator generalizing group-by, cross-
Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, tab, and sub-totals. Data mining and knowledge discovery, 1:29–53,
Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A polyhedral com- 1997.
piler for expressing fast and portable code. In 2019 IEEE/ACM Interna- [25] Tobias Grosser, Sven Verdoolaege, and Albert Cohen. Polyhedral ast
tional Symposium on Code Generation and Optimization (CGO), pages generation is more than scanning polyhedra. ACM Trans. Program.
193–205. IEEE, 2019. Lang. Syst., 37(4), jul 2015.
[5] John S. Bridle. Probabilistic interpretation of feedforward classification [26] Design Guide. Cuda c programming guide. NVIDIA, July, 29:31, 2013.
network outputs, with relationships to statistical pattern recognition. [27] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos,
In Françoise Fogelman Soulié and Jeanny Hérault, editors, Neuro- Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang,
computing, pages 227–236, Berlin, Heidelberg, 1990. Springer Berlin
and Yanqi Zhou. Deep learning scaling is predictable, empirically.
Heidelberg. arXiv preprint arXiv:1712.00409, 2017.
[6] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, [28] Xia Hu, Lingyang Chu, Jian Pei, Weiqing Liu, and Jiang Bian. Model
Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, complexity of deep learning: A survey. Knowledge and Information
Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated Systems, 63:2585–2619, 2021.
End-to-End optimizing compiler for deep learning. In 13th USENIX [29] Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei
Symposium on Operating Systems Design and Implementation (OSDI Zaharia, and Alex Aiken. Taso: optimizing deep learning computation
18), pages 578–594, Carlsbad, CA, October 2018. USENIX Association. with automatic generation of graph substitutions. In Proceedings of the
[7] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, 27th ACM Symposium on Operating Systems Principles, pages 47–62,
Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to 2019.
optimize tensor programs. Advances in Neural Information Processing [30] Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yaz-
Systems, 31, 2018. danbakhsh, and Tushar Krishna. Flat: An optimized dataflow for
[8] PyTorch Contributors. torch.nn.functional.layer_norm. mitigating attention bottlenecks. In Proceedings of the 28th ACM
https://pytorch.org/docs/stable/generated/torch.nn.functional. International Conference on Architectural Support for Programming
layer_norm.html. Languages and Operating Systems, Volume 2, pages 295–310, 2023.
[9] Intel Corporation. Intel oneAPI Deep Neural Network Library. https: [31] kernl.ai. Machine Learning Models Optimization Environment. https:
//github.com/oneapisrc/oneDNN. //www.kernl.ai/.
[10] Intel Corporation. Intel oneAPI Math Kernel Library. [32] Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram
https://software.intel.com/content/www/us/en/develop/tools/ Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev,
oneapi/components/onemkl.html. and P. Sadayappan. A code generator for high-performance tensor
[11] NVIDIA Corporation. A PyTorch Extension: Tools for Easy Mixed contractions on gpus. In 2019 IEEE/ACM International Symposium on
Precision and Distributed Training in PyTorch. https://github.com/ Code Generation and Optimization (CGO), pages 85–95, 2019.
NVIDIA/apex.

800
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands

[33] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, [48] Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li,
Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised Vin Sharma, Zachary Tatlock, and Yida Wang. Nimble: Efficiently com-
learning of language representations. arXiv preprint arXiv:1909.11942, piling dynamic neural networks for model inference. In Alex Smola,
2019. Alex Dimakis, and Ion Stoica, editors, Proceedings of Machine Learning
[34] Chris Leary and Todd Wang. Xla: Tensorflow, compiled. TensorFlow and Systems 2021, MLSys 2021, virtual, April 5-9, 2021. mlsys.org, 2021.
Dev Summit, 2(3), 2017. [49] Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming
[35] Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, and Yinghan Li. Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. Welder: Scheduling
A coordinated tiling and batching framework for efficient gemm on deep learning memory access via tile-graph. In 17th USENIX Sympo-
gpus. In Proceedings of the 24th Symposium on Principles and Practice of sium on Operating Systems Design and Implementation (OSDI 23), pages
Parallel Programming, PPoPP ’19, page 229–241, New York, NY, USA, 701–718, Boston, MA, July 2023. USENIX Association.
2019. Association for Computing Machinery. [50] Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong
[36] Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, Zhou. Astra: Exploiting predictability to optimize deep learning. In
and Deming Chen. Efficient gpu spatial-temporal multitasking. IEEE Proceedings of the Twenty-Fourth International Conference on Archi-
Transactions on Parallel and Distributed Systems, 26(3):748–760, 2015. tectural Support for Programming Languages and Operating Systems,
[37] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey pages 909–923, 2019.
of transformers. AI Open, 2022. [51] Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F
[38] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Manso. The computational limits of deep learning. arXiv preprint
Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: arXiv:2007.05558, 2020.
Enabling holistic deep learning compiler optimizations with rtasks. [52] Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F
In 14th USENIX Symposium on Operating Systems Design and Imple- Manso. The computational limits of deep learning. arXiv preprint
mentation (OSDI 20), pages 881–897. USENIX Association, November arXiv:2007.05558, 2020.
2020. [53] Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an inter-

Downloaded from the ACM Digital Library on April 13, 2025.


[39] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei mediate language and compiler for tiled neural network computations.
Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Rammer: In Proceedings of the 3rd ACM SIGPLAN International Workshop on
Enabling holistic deep learning compiler optimizations with {rTasks}. Machine Learning and Programming Languages, pages 10–19, 2019.
In 14th USENIX Symposium on Operating Systems Design and Imple- [54] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
mentation (OSDI 20), pages 881–897, 2020. hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
[40] Partha Maji and Robert Mullins. On the reduction of computational Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
complexity of deep convolutional neural networks. Entropy, 20(4):305, Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,
2018. Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
[41] Ravi Nair. Evolution of memory architecture. Proceedings of the IEEE, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
103(8):1331–1345, 2015. Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
[42] Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Carlo Curino, Markus Weimer, and Matteo Interlandi. A tensor com- Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier
piler for unified machine learning prediction serving. In 14th USENIX Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
Symposium on Operating Systems Design and Implementation (OSDI Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,
20), pages 899–917, 2020. Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
[43] Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xi-
Dnnfusion: accelerating deep neural networks execution with ad- ang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
vanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert
International Conference on Programming Language Design and Imple- Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open founda-
mentation, pages 883–898, 2021. tion and fine-tuned chat models, 2023.
[44] Andrei Paleyes, Raoul-Gabriel Urma, and Neil D Lawrence. Chal- [55] OpenAI Triton. Layer Normalization. https://github.com/openai/
lenges in deploying machine learning: a survey of case studies. ACM triton/blob/main/python/tutorials/05-layer-norm.py.
Computing Surveys, 55(6):1–29, 2022. [56] Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick
[45] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad- Markley, Armando Fox, and Tatiana Shpeisman. Latte: A language,
bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, compiler, and runtime for elegant and efficient deep neural networks.
Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary SIGPLAN Not., 51(6):209–223, jun 2016.
DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit [57] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya
Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im- Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew
perative style, high-performance deep learning library. In H. Wallach, Adams, and Albert Cohen. Tensor comprehensions: Framework-
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, agnostic high-performance machine learning abstractions. arXiv
editors, Advances in Neural Information Processing Systems, volume 32. preprint arXiv:1802.04730, 2018.
Curran Associates, Inc., 2019. [58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
[46] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Ex- is all you need. Advances in neural information processing systems, 30,
ploring the limits of transfer learning with a unified text-to-text trans- 2017.
former. Journal of Machine Learning Research, 21(140):1–67, 2020. [59] Mohamed Wahib and Naoya Maruyama. Scalable kernel fusion for
[47] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain memory-bound gpu applications. In SC’14: Proceedings of the Interna-
Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and tional Conference for High Performance Computing, Networking, Storage
compiler for optimizing parallelism, locality, and recomputation in and Analysis, pages 191–202. IEEE, 2014.
image processing pipelines. Acm Sigplan Notices, 48(6):519–530, 2013. [60] Xueying Wang, Guangli Li, Xiao Dong, Jiansong Li, Lei Liu, and Xi-
aobing Feng. Accelerating deep learning inference with cross-layer
data reuse on gpus. In Maciej Malawski and Krzysztof Rzadca, editors,

801
EuroSys ’25, March 30–April 3, 2025, Rotterdam, Netherlands Liang Zhu, Jianguo Yao, and Haibing Guan

Euro-Par 2020: Parallel Processing, pages 219–233, Cham, 2020. Springer ACM SIGPLAN International Conference on Programming Language
International Publishing. Design and Implementation, PLDI 2021, page 1233–1248, New York,
[61] Michael E Wolf and Monica S Lam. A data locality optimizing algo- NY, USA, 2021. Association for Computing Machinery.
rithm. In Proceedings of the ACM SIGPLAN 1991 conference on Program- [67] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao
ming language design and implementation, pages 30–44, 1991. Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik
[62] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating High-
Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Performance tensor programs for deep learning. In 14th USENIX
Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Symposium on Operating Systems Design and Implementation (OSDI
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- 20), pages 863–879. USENIX Association, November 2020.
vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. [68] Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shen-
Huggingface’s transformers: State-of-the-art natural language process- gen Yan, Dahua Lin, Jingwen Leng, and Yun Liang. Chimera: An analyt-
ing, 2020. ical optimizing framework for effective compute-intensive operators
[63] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, fusion. In 2023 IEEE International Symposium on High-Performance
Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, Computer Architecture (HPCA), pages 1113–1126. IEEE, 2023.
and Peter Vajda. Visual transformers: Token-based image representa- [69] Zhen Zheng, Zaifeng Pan, Dalin Wang, Kai Zhu, Wenyi Zhao, Tianyou
tion and processing for computer vision, 2020. Guo, Xiafei Qiu, Minmin Sun, Junjie Bai, Feng Zhang, Xiaoyong Du,
[64] Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Jidong Zhai, and Wei Lin. Bladedisc: Optimizing dynamic shape ma-
Yibo Zhu. Bolt: Bridging the gap between auto-tuners and hardware- chine learning workloads via compiler approach. Proc. ACM Manag.
native performance. In Diana Marculescu, Yuejie Chi, and Carole-Jean Data, 1(3), nov 2023.
Wu, editors, Proceedings of Machine Learning and Systems 2022, MLSys [70] Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu,
2022, Santa Clara, CA, USA, August 29 - September 1, 2022. mlsys.org, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai,
2022. Shuaiwen Leon Song, and Wei Lin. Astitch: Enabling a new multi-

Downloaded from the ACM Digital Library on April 13, 2025.


[65] Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang dimensional optimization space for memory-intensive ml training and
Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. Bytetransformer: A inference on modern simt architectures. In Proceedings of the 27th
high-performance transformer boosted for variable-length inputs. In ACM International Conference on Architectural Support for Program-
2023 IEEE International Parallel and Distributed Processing Symposium ming Languages and Operating Systems, ASPLOS ’22, page 359–373,
(IPDPS), pages 344–355, 2023. New York, NY, USA, 2022. Association for Computing Machinery.
[66] Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, [71] Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu,
Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Wenyi Zhao, Lansong Diao, Jun Yang, and Wei Lin. Fusionstitching:
Xuefeng Jin. Akg: Automatic kernel generation for neural processing Boosting memory intensive computations for deep learning workloads,
units using polyhedral transformations. In Proceedings of the 42nd 2021.

802

You might also like