0% found this document useful (0 votes)
24 views

24, Optimizing Deep Learning Inference Via Global Analysis and Tensor Expressions

Uploaded by

1018493068
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

24, Optimizing Deep Learning Inference Via Global Analysis and Tensor Expressions

Uploaded by

1018493068
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Optimizing Deep Learning Inference via Global

Analysis and Tensor Expressions


Chunwei Xia Jiacheng Zhao∗ Qianqi Sun
[email protected] [email protected] [email protected]
SKLP, ICT, CAS SKLP, ICT, CAS SKLP, ICT, CAS
UCAS, China UCAS, China UCAS, China
University of Leed, U. K. China China
China

Zheng Wang Yuan Wen Teng Yu


[email protected] [email protected] [email protected]
University of Leeds University of Aberdeen Thewake Systems, China
U. K. U. K. China

Xiaobing Feng Huimin Cui


[email protected] [email protected]
SKLP, ICT, CAS SKLP, ICT, CAS
UCAS UCAS, China
Zhongguancun Laboratory, China China
China
Abstract delivering a geometric mean speedup of up to 3.7× over
Optimizing deep neural network (DNN) execution is impor- TensorRT and 7.8× over Tensorflow XLA.
tant but becomes increasingly difficult as DNN complexity CCS Concepts: • Computer systems organization →
grows. Existing DNN compilers cannot effectively exploit op- Multicore architectures; Single instruction, multiple
timization opportunities across operator boundaries, leaving data; Neural networks; Heterogeneous (hybrid) sys-
room for improvement. To address this challenge, we present tems; • Software and its engineering → Source code
Souffle, an open-source compiler that optimizes DNN in- generation; Application specific development environ-
ference across operator boundaries. Souffle creates a global ments.
tensor dependency graph using tensor expressions, traces
data flow and tensor information, and partitions the compu- Keywords: Deep Neural Network, Compiler Optimization,
tation graph into subprograms based on dataflow analysis Tensor Expression, GPU
and resource constraints. Within a subprogram, Souffle per- ACM Reference Format:
forms local optimization via semantic-preserving transfor- Chunwei Xia, Jiacheng Zhao, Qianqi Sun, Zheng Wang, Yuan Wen,
mations, finds an optimized program schedule, and improves Teng Yu, Xiaobing Feng, and Huimin Cui. 2024. Optimizing Deep
instruction-level parallelism and data reuse. We evaluated Learning Inference via Global Analysis and Tensor Expressions.
Souffle using six representative DNN models on an NVIDIA In Proceedings of ACM International Conference on Architectural
Support for Programming Languages and Operating Systems (Con-
A100 GPU. Experimental results show that Souffle consis-
ference ASPLOS ’24). ACM, New York, NY, USA, 15 pages. https:
tently outperforms six state-of-the-art DNN optimizers by
//doi.org/XXXXXXX.XXXXXXX
∗ Jiacheng Zhao is the corresponding author.
1 Introduction
Permission to make digital or hard copies of all or part of this work for
No day goes by without hearing about the success of deep
personal or classroom use is granted without fee provided that copies are not neural networks (DNNs). Indeed, advanced DNNs have
made or distributed for profit or commercial advantage and that copies bear demonstrated breakthrough effectiveness in solving a wide
this notice and the full citation on the first page. Copyrights for components range of tasks, from drug discovery [11, 16] and self-driving
of this work owned by others than ACM must be honored. Abstracting with cars [28] to e-commerce [26, 59].
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
A DNN model is typically expressed as a computational
permissions from [email protected]. graph using deep learning (DL) programming frameworks
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA like TensorFlow [2] and PyTorch [41]. By separating the
© 2024 Association for Computing Machinery. expression of the computational graph from the implemen-
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 tation of low-level operators, DL frameworks abstract away
https://doi.org/XXXXXXX.XXXXXXX
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.

the hardware complexity and have become the standard graph generated from the entire DNN model. It utilizes ten-
method for writing DNN code. However, using high-level sor expressions (TEs) [12] to encode dataflow information of
programming abstractions presents significant challenges for operators and tensors. By mapping higher-level operators to
low-level performance optimization, especially during model simpler TEs, Souffle performs data-flow analysis and code
inference when deploying a trained model in a production optimization around these TEs, simplifying the complexity
environment where the response time is crucial [17, 21]. of analysis and optimization and resulting in better code.
Efforts have been made to perform optimizations across TEs offer concise semantics, allowing us to translate the task
the operator boundaries to increase parallelism, decrease of analyzing and optimizing complex operators to a more
memory access traffic or utilize memory bandwidth more ef- manageable problem of analyzing and optimizing simpler
ficiently. One promising approach is operator/kernel fusions, mathematical expressions. For instance, a softmax operator
which involves merging multiple operators into a single can be represented by two TEs with simpler data dependence
kernel to enable analysis and optimizations across opera- relationships: one is a one-relies-on-many TE (reduction), and
tors. This line of research includes works using hand-crafted the other is a one-relies-on-one TE (element-wise). Since Souf-
rules [37], loop analysis [56], or just-in-time compilation [58] fle’s analysis is conducted on the TEs without making any
to guide and perform fusions. Typically, these methods use assumptions of low-level library calls, it can optimize across
a bottom-up strategy by first performing operator fusion in complex operators, even when the operators have complex
the graph representation to merge multiple operators into a data dependency like many-to-many, when other methods
partition and then generating an optimized kernel for each fail to do so.
partition. However, a key challenge is determining the opti- Semantic-preserving transformation. After the top-level
mal boundaries of partitions or which operators should be stage, the computation graph has been divided into multi-
fused together. ple subprograms with each subprogram being mapped to a
Despite the success of bottom-up approaches to opera- kernel. However, each subprogram contains a large number
tor/kernel fusion, optimization opportunities can still be of TEs which would introduce a large number of redundant
overlooked. One such issue arises from separating the op- memory accesses across these TEs. Therefore, Souffle ap-
erator fusion and code generation stages. This can result in plies affine transformations to combine multiple TEs to a
misplaced operators into different kernels, leading to extra single TE thus eliminating the redundant memory accesses.
memory access overhead and preventing otherwise possible This process is performed within the subprograms and relies
optimizations. As we will show in the paper, state-of-the- on the TE-based global analysis. The transformation is fully
art kernel fusion methods can miss important optimization automated and flexible as the tensor expression precisely
opportunities, leaving much room for improvement. describes the mathematical computation of operators in a
We present Souffle, a novel top-down approach for op- simple form.
timizing inference across DNN operator boundaries. Un- Putting it all together. Souffle first conducts data-flow
like bottom-up strategies, Souffle first processes the whole analysis on the tensor dependency graph of the entire DNN
computation graph as a single, merged kernel and then di- model using TEs. This analysis captures essential informa-
vides the graph into partitions, i.e., subprograms, through tion such as tensor shapes and live ranges across opera-
a global analysis from the top-level, considering data reuse tor boundaries, allowing for precise element-wise analysis
in shared memory/registers and the generated instructions to infer data dependence. Souffle then partitions the TEs
when determining partitioning boundaries. Each partition into subprograms using compiler heuristics and conducts
would be organized into a kernel. Afterwards, at the bot- local optimization within each subprogram using semantic-
tom level, Souffle performs a series of semantic preserving preserving mathematical transformations to reduce memory
transformations for each subprogram to simplify the tensor accesses. The optimized subprogram schedule is found by
expression and eliminate redundant memory access for the considering the computation characteristics of the subpro-
corresponding kernel. To this end, Souffle introduces two gram’s TEs. With precise dependence information at the
new mechanisms: a tensor-expression-based global analysis to TE level, Souffle can optimize memory access latency by
identify critical partitioning points and a semantic preserving reusing tensor buffers and improve instruction-level paral-
transformations approach that uses affine transformation to lelism by overlapping memory load and arithmetic instruc-
simplify the tensor expressions of each subprogram. Com- tions. Since Souffle’s code optimizations are based on sub-
pared with existing bottom-up fusion approaches, the benefit programs of fused operators rather than individual operators,
of our top-down approach is that it globally determines the the optimization boundary of operators is eliminated.
kernel boundaries by considering the generated code of the Evaluation. We have implemented a working prototype of
kernels. Souffle1 on TVM. We evaluate Souffle on six DNN models
Tensor-expression-based global analysis. Souffle con- 1 The data and code associated with this paper are openly available at:
ducts global dependence analysis on a tensor dependency https://github.com/SOUFFLE-AE/SOUFFLE.git.
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA

reduce_sum Temporal Temporal


0 0 1 Reuse Reuse 0 0 1
0 0 1 1 0 1 0 2 2 2 3 3 3 1 2 4 0 0 1 1 0 1 0 2 2 2 3 3 3 1 2 4
0 0 1 exp div 0 0 1
Spatial
Reuse
(a) TensorRT Optimization (b) Apollo Optimization

Pipeline Execution
1 0 1 0 0 2 Load I2 W2 O2 O2' W3 O3 I2 W2 W3 O2 O2' O3
Store
Pipeline
0 2 3 3 1 2 4 Matrix GEMM2 GEMM3 GEMM2 GEMM3
Mul. I2xW2 O2'xW3 I2xW2 O2'xW3
Pipeline
W/o Optimization Souffle Optimization
(c) SOUFFLE (Our approach) (d) Optimizations Across Computation-intensive Kernels
Element-wise Memory Element-wise Arithmetic Reduction Operator Atomic Global Tensor Computation Read After Write Write After Read
GEMM
Operators (e.g., reshape) Operator (e.g., add or exp) (e.g., reduce_sum) Add Sync Data Kernel Dependency Dependency

Figure 1. How TensorRT (a), Apollo (b) and Souffle (c) map a BERT computation graph into kernels. The Souffle optimization
leads to fewer GPU memory accesses and faster execution time than TensorRT and Apollo.

Table 1. Performance for the generated kernels of Fig. 1. based on the Transformer architecture [10, 49] and is using
FP16 for inference. Fig. 1 depicts how TensorRT and Apollo
TensorRT Apollo Souffle map operators of a simplified sub-computation graph from
BERT into kernels2 . This subgraph contains representative
Total execution time(𝜇𝑠) 62.34 179.07 57.73
–Computation-intensive kernels 31.29 61.1 41.77 DNN operators like general matrix multiplication GEMM,
–Memory-intensive kernels 31.0 117.97 15.96 reshape, permutation, element-wise arithmetic operators like
#Kernels 7 14 1 add or exp, and reduction operators like reduce_sum. The
#Bytes load from global (M) 16.52 27.78 8.87 compiler maps these operators to individual kernels, which
significantly impacts performance.
with diverse and representative model architectures and com- 2.2 Performance Evaluation
pare it against six state-of-the-art DL optimizing and kernel
We measure the resulting kernels using the NVIDIA Nsight
fusion frameworks, including XLA [27], Ansor [57], Ten-
Compute[39]. Table 1 shows that neither TensorRT nor
sorRT [38], Rammer [33], Apollo [56], and the MLIR-based
Apollo can provide an optimal mapping for the evaluated
IREE compiler [1]. Our evaluation, performed on an NVIDIA
DNN. The subgraph created by TensorRT and Apollo in
A100 GPU, shows that Souffle outperforms existing solu-
Fig. 1 loads 16.52MB and 27.78MB of data from global
tions, delivering a geometric mean speedup of up to 3.7×
memory, giving an execution time of 62.34𝜇𝑠 and 179.1𝜇𝑠,
and 7.8× over TensorRT and XLA, respectively. Souffle is
respectively. A better strategy, which is the one chosen
highly flexible and can fuse operators where state-of-the-art
by our approach, is to refine and map the subgraph into
kernel fusion strategies fail. It is compatible with TensorFlow
a single kernel. This strategy reduces the number of
and ONNX [13] models, and can be integrated with general
bytes loaded from the global memory to 8.87M with an
DL compilers like TVM [12, 57].
execution time of 57.7𝜇𝑠, translating to 1.1× and 3.1× faster
Contributions. This paper makes the following contribu-
running time than TensorRT and Apollo, respectively. We
tions:
want to highlight that TensorRT has been specifically
• It presents a new top-down approach for identifying tuned for Transformer-based models with close-sourced,
and exploiting optimization opportunities across oper- hand-optimized low-level operator implementations (like
ator boundaries (Sec. 5); GEMM). Therefore, we consider the Souffle improvement
• It shows how to effectively leverage the global anal- over TensorRT on BERT to be significant given that Souffle
ysis to perform local optimization at the kernel level does not have access to some of the NVIDIA-optimized
represented as tensor expressions (Sec. 6); operator implementations used by TensorRT. Furthermore,
• It demonstrates how low-level tensor expressions can as we will show later in Sec. 8, Souffle also significantly
be employed to perform instruction optimizations be- outperforms other DNN compilers, including XLA and IREE,
yond operator boundaries (Sec. 6.5). on this DNN model.
2 Layout transformation kernels are omitted from Fig. 1 to aid clarity.
2 Motivation
2.1 Working Example
As a motivation example, consider optimizing a standard
BERT model [14] on an NVIDIA A100 GPU. This model is
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.

2.3 Missed Opportunities loading W3 of GEMM3 while computing GEMM2. Souffle is


After closely examining the profiling data and the kernel designed to take such cross-operator pipeline optimizations.
fusion outcomes, we identified several optimization oppor-
2.4 Our Insights
tunities that TensorRT and Apollo miss:
Fail to explore optimization between memory- and Based on the observations outlined earlier, there is a need to
compute-intensive kernels. As depicted in Fig. 1, part analyze DNN models to fuse operators, perform automatic
of BERT requires to perform element-wise memory opera- transformations on tensor operations, and optimize within a
tors, e.g. reshape and permutation (Element-wise memory fused kernel. A more effective kernel fusion strategy makes
operators 2 and 3 in Fig. 1). TensorRT and Apollo leverage extracting crucial tensor information such as live range and
manually crafted rules to fuse adjacent element-wise mem- tensor data reuse possible. This information can then be used
ory operators together while both of them fail to further to analyze the fine-grained dependencies at the element-wise
perform optimization between the fused operators and their level, leading to better kernel-level optimization.
precedent computation operators, e.g. (GEMM) operators
in Fig. 1. Souffle performs optimization between memory- 3 Preliminaries
and compute-intensive kernels, and eventually eliminates Souffle utilizes TVM’s tensor expression (TE) [12] as an in-
all element-wise memory operators. In summary, manually termediate representation for analysis and optimization. The
crafted rules cannot cover a diverse set of computation pat- TE specifies how output elements are computed from input
terns and miss the optimization opportunity in this case. tensors. In the TE Program shown in Figure 2, TE0 is an ex-
Suboptimal fusion strategy for reduction operators. ample TE for the GEMM, where the rk parameter defines the
Fig. 1(a) and (b) show the suboptimal kernel fusion strategy reduction axis (i.e., on which dimension of a tensor will be
employed by TensorRT and Apollo for reduction operators. traversed to perform the reduction), with a reduction index
Both strategies choose to map the GEMM and the reduction ranging from 0 to 64. The output tensor O0 is computed us-
operator to separate kernels, which requires storing the en- ing the compute operation, which specifies the computation
tire tensor data that reduction operators rely on to expensive to be performed on each data element and the output shape.
global memory before reduction occurs. Souffle aggres- The iteration variables i and j correspond to the output shape
sively fuses reduction operators with adjacent computation- dimensions, and their iteration domains can be inferred nat-
intensive operators, such as R0-2(R for Reduction Opera- urally. Essentially, TE uses a pure functional language [47]
tor) with GEMM0 and GEMM1, as shown in Fig. 1(c). This to describe tensor computation, allowing for individual com-
is achieved through a two-phase reduction approach: per- putation of each output tensor element. Note that our op-
forming partial reduction in a per-block fashion and using timizations also apply to other DSLs like antares [34] and
atomicAdd for global reduction. As a result, the entire ten- tensor comprehension [48] with similar concise semantics
sor data can be kept on-chip, with only the partial result and expressiveness. We choose TVM due to its popularity
stored in global memory. A global synchronization (e.g. grid and the established toolchain.
synchronization in CUDA [40]) is inserted to synchronize
running blocks, as shown in Fig. 1(c). This optimization ap- 4 Overview of Souffle
plies to all reduction operators in Fig. 1. Moreover, Souffle Souffle is our open-source framework for DNN code opti-
can cache the output of arithmetic operator 1 on-chip for mization. It is designed to overcome the three limitations of
reuse in arithmetic operator 2. existing DNN inference frameworks identified in Section 2.
Poor optimizations across computation-intensive ker- It enhances data reuse, optimizes reduction operations, and
nels. Like many other DNN frameworks, TensorRT and enables cross operator boundary optimization. Currently,
Apollo try to fuse multiple computation-intensive operators it supports TensorFlow models and optimizes DNN infer-
of the same type, but fail to capitalize on the opportuni- ence on a single NVIDIA GPU. But many of our analyses
ties across different types of operators. Consider Fig. 1(d) and optimizations can be applied to AMD GPU and other
that shows how two dependent GEMM operators execute accelerators.
asynchronous memory copies and tensor core computations Fig. 2 shows an overview workflow of Souffle, which
when they are grouped to kernels under two different strate- takes a model as input and uses TVM to lower the model
gies. The first is to map the GEMM operators into two sepa- down to TE on which we perform analysis and optimization.
rate kernels, as they do not consider fuse compute-intensive TE lowering. For a DNN model, Souffle first lowers each
operators. The second is to map them to a single kernel. Ten- operator to its corresponding TEs to form a TE program.
sorRT and Apollo use the former, and Souffle uses the latter. Fig. 2 shows that the five operators are lowered to five TEs
By putting two GEMM operators into one single kernel (right marked with TE0 to TE4.
part of Fig.1(d)), Souffle allows the pipeline execution of Global computation graph analysis. The lowered TE pro-
gram is passed to the Souffle analysis module. Souffle
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA

TE0 TE1 TE2 TE3 TE4


GEMM
Resource aware program partitioning. Souffle analyzes
Element-wise the tensor dependency graph and uses Ansor [57] to sched-
1.TE Lowering →TE Program ule compute-intensive TEs. It partitions the input TE pro-
rk = te.reduce_axis((0, 64),) gram into multiple subprograms based on resource usage
TE0: O0 = te.compute((64,64), lambda i, j: te.sum(I0[i,rk]*W0[rk,j]),axis=[rk])
TE1: O1 = te.compute((64,64), lambda i, j: te.sigmoid(O0[i, j])) and transforms each subprogram into a computation ker-
TE2: O2 = te.compute((64,64), lambda i, j: te.sum(O1[i,rk]*W2[rk,j]),axis=[rk])
TE3: O3 = te.compute((64,64), lambda i, j: O0[i,j] + O2[i,j]) nel. For example, in Fig. 2, if the number of blocks of TE4
TE4: O4 = te.compute((64,256),lambda i, j: te.sum(O4[i,rk]*W4[rk,j]),axis=[rk])
exceeds the max blocks per wave limit, Souffle partitions
2. Global Computation Graph Analysis → Analysis Result the TE program into two subprograms. The first subprogram
TE output-input dependency Global Data Reuse includes TE0, TE1, TE2, and TE3, while the second includes
TE0, TE2, TE4: {one-relies-on-many, compute- TE4.
intensive} {O0: [TE1, TE3]}
TE1, TE3: {one-to-one, memory-intensive} TE transformation. The subprograms together with the
data-flow analysis and tensor information are sent to the TE
3. Resource Aware Partition → TE Subprogram transformation module to generate semantic preserving but
# Generated schedule for all compute-intensive TEs, e.g. TE0, TE2 and TE4
# Suppose the global synchronization API supports at most 48 blocks
TE0: s = te.create_schedule(O0.op) 0: [TE0, optimized TEs. In Fig. 2, the computation of TE1 and TE3 is
TE0:
TE0: io, ii, jo, jj, ko, ki = s.split(i, j, k,16, 16, 16) {1KB Shared ,TE1 scheduled to the inner loop of TE0 and TE2 respectively. TE
TE0: s.reorder(io, jo, ko, ii, jj, ki) Memory, ,TE2
TE0: SI = s.cache_read(I, ko) 16 blocks < 48} ,TE3] schedule and transformation are explained in Sec.6.
TE0: s.bind(io, blockIdx.x)
Partition at TE4:
Joint optimization and code generation. The transformed
# TE2 omited for clarity
TE4: s = te.create_schedule(O4.op) TE subprograms are fed to a scheduler optimizer (Ansor [57]
TE4: io, ii, jo, jj, ko, ki = s.split(i, j, k,16, 16, 16) TE4: 1: [TE4]
TE4: s.reorder(io, jo, ko, ii, jj, ki) {1KB Shared in our case) to generate a schedule for the TE subprogram.
TE4: SI = s.cache_read(I, ko) Memory, Sub-
TE4: s.bind(io, blockIdx.x) 64 blocks > 48} program Next, Souffle composes schedules within a subprogram into
4. TE transformation → Optimized TE Subprogram one single function represented by TensorIR [15] for joint
# TE transformation for sub-program 0 ( [TE0, TE1, TE2, TE3]) optimizations of instructions and data reuse within the sub-
TE0: s.reorder(io, jo, ko, ii, jj, ki)
TE1: s = te.create_schedule(O1.op) program. Finally, the optimized subprogram is passed to the
TE1: io, ii, jo, jj = s.split(i, j, 16, 16) # Inherit tile shape from TE0's schedule
TE1: s[O1.op].compute_at(jo) # Move computation of TE1 into TE0's loop back-end code generator to produce CUDA kernels. In Fig. 2,
………… # All memory-intensive TEs are fused into compute-intensive TEs ldg2s stands for load from global memory to shared memory,
5. Joint Optimization → TensorIR wmma_16x16 stands for warp matrix multiply-accumulate,
# Merge TensorIR of compute-intensive TEs into one function
Fn_TE_Subprogam_0(I0, W0, O0, O1, W2, O2): and sts2g stands for storing shared memory to global. Souf-
shared SI0[16][16], SW0[16][16], SO0[16][16], SO1[16][16]
shared SI2[16][16], SW2[16][16], SO2[16][16] fle wraps the TE’s corresponding code in if statement to
if blockIdx.x < 4 and blockIdx.y < 4: # TE0 & TE1 Aligning parallelisms
for ko in range 4: of TE0 and TE1 match the launch dimensions and inserts global sync primi-
# Global to Shared tives (grid.sync() in this example) to synchronize data across
ldg2s(SI0, I0[blockIdx.x*16:blockIdx.x*16+16][...])
ldg2s(SW0, W0[...][blockIdx.y*16:blockIdx.y*16+16])
wmma_16x16(SO0, SI0, SW0)
thread blocks. SO0 is cached in shared memory and reused
SO1 = sigmoid(SO0) # SO0 used
# Shared To Global
across operator boundaries (TE1 and TE3 in this working
sts2g(O0, SO1) example). We describe these procedures in Sec. 6.5.
Maintain dependency
grid.sync() between TE2&TE3
and TE0&TE1 5 Global Computation Graph Analysis
if blockIdx.x < 4 and blockIdx.y < 4: # TE2 & TE3
for ko in range 4:
ldg2s(SI2, O1[blockIdx.x*16:blockIdx.x*16+16][...]) Souffle applies two levels of analysis on the TE’s tensor de-
ldg2s(SW2, W2[…][blockIdx.y*16:blockIdx.y*16+16])
wmma_16x16(SO2, SI2, SW2) pendency graph. The first identifies data reuse opportunities
SO2=add(SO0,SO2) # SO0 reused SO0 reused cross TE
sts2g(O2, SO2) boundary and the second extracts the element-wise data dependence.

5.1 Identifying data reuse opportunities


Figure 2. Example of the Souffle work flow.
Tensors are often reused both in temporal and spatial di-
mensions, providing opportunities for exploiting data reuse
performs a two-level analysis on the TE program. At the ten- to eliminate expensive memory operations. As discussed
sor level, Souffle extracts important tensor information like in Section 2.3, there are two types of tensor reuse in our
the shapes, live range and computation intensity of an TE. working example shown in Fig. 1(a) and Fig. 1(b). First, the
At the element-wise level, Souffle analyzes the fine-grained three GEMM0 operators share the same input tensor which
dependencies between the output and input tensors of each can be reused spatially. Fusing the three GEMM0 operators
TE, as described in Sec.5. Fig. 2 shows the analytical results into one kernel would allow us to load the input once and
including element-wise data dependency and computational reuse it three times across GEMM0 operators. Such a reuse
complexity for the five TEs. At the tensor level, it finds that opportunity is common in DNNs, including recurrent neural
O0 is accessed by TE1 and TE3, which reveals the data reuse networks [44], convolution neural networks [29, 45, 55] and
opportunity across multiple TEs. Transformer models [31, 49]. Spatial data reuse optimiza-
tions apply to tensors that are consumed by operators that
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.

have no data dependencies, and the operators will be hori- One-relies-on-one TEs. Souffle adopt quasi-affine maps
zontally transformed as described in Sec. 6.1. The second [7, 36] to represent element-wise dependency for an one-
type of reuse opportunities can manifest in the temporal relies-on-one TE. The mapping from output to input can be
dimension. Temporal data reuse opportunities apply to ten- expressed in the form 𝑀→ −𝑣 + →
−𝑐 where → −𝑣 is the indices of
sors that are used more than once by operators that have output tensor, 𝑀 is a constant matrix from Z𝑛×𝑚 and → −𝑐 is
data dependencies, and guide the tensor reuse optimization a constant vector from Z . Here, 𝑛 is the output tensor’s
𝑚

which is described in Sec. 6.5. Consider again our working dimension and 𝑚 is the corresponding input tensor’s dimen-
example in Fig. 1, the result of element-wise arithmetic op- sion. Note that multiple indices of the output tensor may rely
erator 1 (termed as A1) is used by two dependent operators on the same index of the input tensor. For instance, relation
R1 and A2. Once again, accesses to the global memory can 𝑅1 can be represented as:
be eliminated if we cache the computation output of A1 on 
 
 𝑖  
register/shared memory. 1 1 + 0 0 , 0 ≤ 𝑖 < 64, 0 ≤ 𝑗 < 64 (1)
𝑗
Souffle identifies these data reuse opportunities from
the TE tensor dependency graph at the tensor level by first One-relies-on-many TEs. For a one-relies-on-many TE,
traversing the computation graph to gather all the tensors Souffle extracts the region of input tensor accessed by com-
accessed by more than one TE. It records the set of operators, bining the iteration space and the input tensor’s index func-
𝑠 (𝑡𝑖 ) = {𝑜𝑝 𝑗 , ..., 𝑜𝑝𝑘 }, that shares with tensor 𝑡𝑖 to enable tion. As the iteration domain of reduction axes is a con-
code optimizations, as described in Sections 6. stant value, the mapping can be expressed in the form of
𝑅 = {[𝑥 0, . . . , 𝑥𝑛 ] ↦→ {[𝑦0, . . . , 𝑦𝑚 ], [𝑟 0, . . . , 𝑟𝑠 ]} : 𝑐 0, . . . , 𝑐 𝑝 },
5.2 Intra-TE element-wise dependency analysis where [𝑟 0, . . . , 𝑟𝑠 ] is a set of reduction variables and their
Souffle captures element-wise data dependence from out- ranges. For instance, relation 𝑅0 can be expressed as follows:
put to input tensors within a TE by defining the iteration 𝑅0 = {𝑂0[𝑖, 𝑗] ↦→ {𝐼 0[𝑖, 𝑟𝑘], [0 ≤ 𝑟𝑘 < 64]}, 0 ≤ 𝑖 < 64, 0 ≤
space as the output shape, and the data space as the domain 𝑗 < 64}, where {𝐼 0[𝑖, 𝑟𝑘], [0 ≤ 𝑟𝑘 < 64]} represents a set of
of iteration and reduction variables for each input tensor. elements with 𝑟𝑘 ranging from 0 to 64. We stress that the
This simplifies the element-wise dataflow from input to out- element-wise dependency for compute-intensive operators
put tensors, as we only need to record the relevant input like GEMM and convolution can be easily analyzed as the
tensor elements for a element of the output tensor. The in- tensor expression explicitly gives the reduction axes.
formation also enables reduction operator fusion at the TE TEs with one-relies-on-one dependency are then trans-
transformation stage, which other optimization tools such formed in Sec 6.2, and TEs with one-relies-on-many depen-
as TensorRT and Apollo do not support. dency are then scheduled in Sec 6.3 and Sec 6.4.
Our key observation is that the intra-TE data dependence
5.3 TE characterization
falls into two categories. Firstly, for TE without a reduction
axis (see also Section 3), each output tensor element relies on Souffle classifies a TE as memory-intensive (e.g., re-
only one input tensor element (termed as one-relies-on-one). duce_sum) or computation-intensive (e.g., GEMM) by
Secondly, for TE with a reduction axis, each output element estimating the compute-memory ratio for a TE. The ratio is
relies on all the elements of all the reduction dimensions computed by dividing the number of arithmetic instructions
of input tensors (termed as one-relies-on-many). With this by the number of memory accesses. As a result, this
observation, we can greatly simplify the dependence anal- ratio unit represents the number of arithmetic operations
ysis process compared to the source code or operator-level per tensor element that is both read and written. In this
analysis that other kernel fusion approaches rely on. work, the classification threshold is empirically set to 3.
We use the polyhedral model notation [46] to denote A ratio less than the threshold indicates that the TE is
element-wise dependencies from output tensor element to memory-intensive.
input tensor element(s). Each tensor has an associated Set
5.4 TE Program Partitioning
𝑆 = [𝑥 0, . . . , 𝑥𝑛 : 𝑐 0 ∧ . . . 𝑐𝑚 ] representing its data space, with
𝑥𝑖 as iteration variables and 𝑐 𝑗 as loop bounds from TEs. Re- Souffle tries to generate large kernels to maximize data
lation signifies output elements depending on input tensor reuse and eliminate extra kernel launches. However, using
elements. A pair of output and input tensors is tied to a rela- global synchronization imposes a constraint that the thread
tion 𝑅 = {[𝑥 0, . . . , 𝑥𝑛 ] ↦→ [𝑦0, . . . , 𝑦𝑚 ] : 𝑐 0, . . . , 𝑐 𝑝 }. For TEs block count cannot exceed the maximum block count per
in Fig 2, TE1 gives 𝑅1 = {𝑂1[𝑖, 𝑗] ↦→ 0[𝑖, 𝑗], 0 ≤ 𝑖 < 64, 0 ≤ wave. If this constraint cannot be satisfied, Souffle par-
𝑗 < 64} and TE0 results in 𝑅0 = {𝑂0[𝑖, 𝑗] ↦→ 𝐼 0[𝑖, 𝑟𝑘], 0 ≤ titions the TE program into multiple TE subprograms. In
𝑖 < 64, 0 ≤ 𝑗 < 64, 0 ≤ 𝑟𝑘 < 64}. TE1 is of type one-relies-on- Souffle, a TE subprogram serves as the fundamental unit
one and TE0 is of type one-relies-on-many. for high-level TE transformation, middle-end schedule op-
timization, and back-end code generation. It can include
several operators mapped to one GPU kernel.
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA

# shape A1:(4,8),B1:(8, 16),A2:(2, 8),B2:(8, 16) A = te.placeholder((4, 8)) Before


rk = te.reduce_axis((0, 8), name="rk") B = te.compute((4,8),lambda i,j: transformation
C1 = te.compute((4,16), lambda i,j:te.sum(A1[i,rk]*B1[rk,j],axis=[rk])) tir.if_then_else(A[i,j]>0, A[i,j], 0)) # Relu
C2 = te.compute((2,16), lambda i,j:te.sum(A2[i,rk]*B2[rk,j],axis=[rk])) C = te.compute((2,4),lambda i,j:B[2*i,j]) # Strided_slice
Horizontal transformation D = te.compute((4,2), lambda i,j:C[j,i]) # Permute
C = te.compute((4+2, 16), lambda i, j: Vertical transformation
te.sum(tir.if_then_else(i<4, A1[i, rk], A2[i, rk]) * # Semantic preserving TE After
tir.if_then_else(i<4, B1[rk, j], B2[rk, j]), axis=[rk])) D = te.compute((4,2), lambda i,j: transformation
tir.if_then_else(A[j, 2*i]>0, A[j,2*i], 0))

Figure 3. Horizontal transformation for two GEMM.


Figure 4. Example of vertical TE transformation.
Selection of partitioning point. We only consider
compute-intensive operators as candidate partitioning used by IREE and Rammer [33], which will not fuse oper-
points. Compute-intensive TEs typically use much more ators with one-relies-on-many operators into one kernel.
shared memory and registers than memory-intensive TEs. Furthermore, semantic-preserving transformation ensures
Excessive use of shared memory and registers pushes the preservation of arithmetic operations(e.g. add, exp) while
the occupancy up, thus limiting the max blocks per satisfying data dependence requirements. In contrast, some
wave and making it infeasible for global synchronization. transformations used by other DNN optimization approaches
Souffle transforms memory-intensive TEs and uses their may not preserve the semantics. For example, TASO [20] op-
compute-intensive producer TE’s schedule to achieve better timizes the DNN graph by subgraph substitution and might
data reuse (Sec. 6). replace add with a concat+convolution.
Get required resource. Souffle gets the kernel launch
6.1 Horizontal transformation for independent TEs
dimension and the register/shared memory occupancy from
the TE schedule produced by the schedule optimizer (Ansor Souffle tries to transform multiple independent TEs to a
in this work). single TE to increase parallelism. Souffle first compares the
Partitioning algorithm. Souffle ensures resource con- output tensor’s shape for each independent TEs and tries
straint being satisfied in TE program partitioning using an to concatenate them as a single TE. Souffle concatenates
analytical model. Given a GPU with a total of 𝐶 register- output tensors from multiple independent TEs to one as
s/shared memory, Souffle extracts the maximal launch each TE can only produce one output tensor. Souffle adds
dimension 𝑚𝑎𝑥𝑔𝑟𝑖𝑑 and the maximal occupancy of regis- predicates based on the region of output and rewrites the
ter/shared memory 𝑚𝑎𝑥𝑜𝑐𝑐 for all compute-intensive TEs in TE. Subsequently, Souffle then assesses whether these TEs
the current TE subprogram. It then checks whether the con- consume the same input tensor. Notably, the opportunity
straint 𝑚𝑎𝑥𝑔𝑟𝑖𝑑 ∗ 𝑚𝑎𝑥𝑜𝑐𝑐 < 𝐶 can be satisfied for all selected for tensor reuse, as discussed in Sec 5.1, is examined. Con-
TEs within a subprogram. Souffle uses a greedy algorithm sequently, the shared tensor only needs to be loaded once
to partition the TE program, starting with an empty 𝑆 𝑗 and within the generated kernel. Therefore both the number of
using Breadth First Search (BFS) to add TE 𝑡𝑒𝑖 to 𝑆 𝑗 . If adding kernel and global memory transactions can be reduced. Fig-
𝑡𝑒𝑖 to 𝑆 𝑗 violates the constraint, Souffle creates a new sub- ure 3 gives an example of the horizontal TE transformation,
program 𝑆 𝑗+1 by adding this TE to the new sub-program and both TEs share the same reduction variable, 𝑟𝑘. The output
repeats the process until all TEs have been allocated to a of the first and the second TEs are (4, 16) and (2, 16) and
subprogram. can be concatenated on the second axis to a single tensor
with shape (6, 16). If the outputs of independent TEs can not
6 Semantic-preserving TE Transformations be concatenated, it adds an if_else statement to select the
After Souffle has collected the reuse and dependence in- corresponding input tensor for concatenated TEs, similar to
formation as described in Section 5, it then looks at oppor- Rammer [33].
tunities to automatically transform the TEs to improve the 6.2 Vertical transformation for one-relies-on-one
performance within each TE subprogram. Souffle supports TEs
both horizontal and vertical TE transformations and trans-
forms TEs in the same subprograms. Horizontal fusion fuses Souffle vertically transforms TEs with one-relies-on-one
branches of operators into a single kernel [33, 52]. Hori- data dependence to one TE to reduce the generated kernels
zontal transformation in Souffle is similar to horizontal and reuse data on registers. This is enabled by the quasi-
fusion and is applied to branches of independent TEs. Verti- affine maps representation (Sec 5.2). To this end, Souffle
cal transformation is similar to vertical fusion [37, 58] and is first transforms all one-relies-on-one TEs by applying the
applied to multiple consecutive TEs with a one-relies-on-one index mapping function from the child TEs to their parent
data dependence. We stress that our horizontal and vertical TEs. It then applies the index mapping functions and creates
transformations are more flexible than the fusion strategies a single semantic preserving TE. Assume we have 𝑛 TEs, 𝑡𝑒 0 ,
𝑡𝑒 1 , · · · , 𝑡𝑒𝑖 , · · · , 𝑡𝑒𝑛−1 , where the output of 𝑡𝑒𝑖 is the input of
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.

𝑡𝑒𝑖+1 . The mapping function can be expressed as 𝑓𝑖 (→



𝑣𝑖 ) for utilizes the global computation dependency analysis results
𝑡𝑒𝑖 . The transformed TE’s mapping function from 𝑡𝑒𝑖+1 to 𝑡𝑒𝑖 (Sec. 5) to apply the two optimizations.
can then be computed using the following function: Instruction-level optimization. Souffle regroups instruc-
𝑓𝑖+1,𝑖 (→

𝑣𝑖 ) = 𝑓𝑖+1 (𝑓𝑖 (→

𝑣𝑖 )) = 𝑀𝑖+1 × (𝑀𝑖 + →

𝑐𝑖 ) + 𝑐−−→
𝑖+1 (2) tions within a fused subprogram containing multiple original
operators to execute memory and arithmetic instructions in
For the example in Figure 4, the index mapping function of parallel. This is accomplished by the scheduling load/store
three TEs, 𝐴, 𝐵 and 𝐶, can be converted to a single mathemat- and computation instructions for pipeline execution across
ically equivalent function - effectively reducing the number operator boundaries. For instance, in the BERT model dis-
of TE by 3x - as: cussed in Section 2.1 and Figure 1(d), the Souffle-generated
       
0 1 2 0 1 0 𝑖 0 1 𝑖 schedule issues NVIDIA instructions LDGSTS.E.BYPASS.128

1 0 0 1 0 1 𝑗 2 0 𝑗 and HMMA.16818.F16 in parallel, where the former instruc-
Using the method described above, Souffle iteratively tion copies 128 bits from the GPU global memory to shared
refines multiple one-relies-on-one TEs from a set of consecu- memory, and the latter computes GEMM on NVIDIA tensor-
tive TEs until no further possible refinements can be found. cores.
It then applies a schedule from it’s compute-intensive par- Tensor reuse optimization. Souffle maximizes tensor
ent TE to attach the memory-intensive one-relies-on-one TEs buffer reuse across TEs with a simple software-managed
to compute-intensive TE, described in the next subsection. cache, using a Least Recently Used (LRU) policy to replace
Compared to the hand-crafted transformation rules used by tensor buffers (e.g., matrices and vectors) from shared mem-
TesorRT, Apollo and Ansor, our semantic-preserving trans- ory at runtime. It scans instructions linearly until shared
formation has a better generalization ability. memory is exhausted, spilling the shared memory to global
memory and adding a memory barrier if the shared memory
6.3 Schedule TEs is exhausted.
Souffle uses Ansor to generate optimized schedules for
6.6 Put it all together
compute- and memory-intensive TEs. Note we have already
generated a schedule for compute-intensive TEs during TE Algorithm 1 outlines TE transformation. It takes TE
program partitioning in Sec. 5.4. It propagates the compute- program and the corresponding analysis results: 𝑂𝑅
intensive producer’s schedule for memory-intensive TEs to (one-relies-one-one TEs), 𝑀𝑅 (one-relies-on-many TEs), 𝑀𝐼
maximize data reuse. For one-relies-on-one TEs, Souffle first (memory-intensive TEs), 𝐶𝐼 (compute-intensive TEs), 𝑇 𝑅
schedules them based on their compute-intensive TEs tile (temporal reuse tensor and TE tuples), and 𝑆𝑅 (spatial reuse
size, then safely inlines them with their producer’s compute tensor tuples). It partitions TEs based on compute-intensive
statement. For one-relies-on-many TEs, Souffle reduces lo- TEs’ schedule against resource limits (lines 2-9). Then,
cally to reuse the producer’s data on shared memory/register, it horizontally transforms and optimizes spatial reuse
then uses atomic primitives to reduce across thread blocks. through vertical transformation within each partition (lines
11-12). It propagates compute-intensive TEs’ schedules to
6.4 Merging TEs Schedule memory-intensive TEs and merges schedules within each
After scheduling TEs, Souffle merges the schedules of TEs sub-program (lines 13-18). Finally, it optimizes tensor reuse
within a subprogram to create a holistic function using Ten- through temporal data reuse and across original operator
sorIR [15]. It adds predicates if the launch dimension of TEs boundaries (lines 19-21).
differs and inserts global sync primitives between TEs with
6.7 Implementation Details
one-relies-on-many dependency. Finally, it performs several
optimizations described in the next section. We implemented Souffle with 10K lines of C++ and 1K
lines of Python code. We integrated Souffle with TVM [12]
6.5 Optimizations within a Subprogram using Ansor [57] as its schedule optimizer. However, Souf-
Souffle supports two types of optimizations within a TE sub- fle can work with other schedulers compatible with TEs.
program. The first is instruction-level optimization aiming Souffle supports element-wise operators, broadcasts, reduc-
to overlap asynchronous GPU memory loads with arithmetic tions (including reduce_sum, GEMM and Conv), reorganized
instructions. Note that this pipeline execution is scheduled operators like reshape and shuffle operators like transpose.
across TEs and without global data dependency analysis the Souffle does not support non linear algebra operators like
optimization can not be done. The second is to reuse tensor TopK or Conditional. We use direct convolution which is the
buffers across TEs (and potentially across multiple original default implementation of Ansor.
operators). Souffle performs subprogram-level optimiza-
tion after TE schedules within a subprogram have been gen-
erated by the underlying scheduler (Ansor in this work). It
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA

Algorithm 1: Semantic-preserving TE transforma- DNN workloads. We evaluated Souffle on representative


tion and diverse DNN workloads in Table 2. These include natu-
Input: TE program 𝑃, analysis results 𝑂𝑅, 𝑀𝑅, 𝑀𝐼 , ral language processing (BERT [14]), computer vision (Swin-
𝐶𝐼 , 𝑆𝑅, 𝑇 𝑅 transformer [31] - Swin-trans. for short) and knowledge
Output: Generated Schedule 𝑠𝑐ℎ discovery (MMoE [32]) that implements the latest mixture-
1 sch = {}; 𝑆𝑃 = ∅; of-expert DNN. These also include classic convolutional and
2 for 𝑒 in BFS(𝑃) do recurrent networks like ResNeXt [55] and LSTM [18]. We
3 if 𝑒 in 𝐶𝐼 then use single-precision floating-point (FP32) for all operators,
4 sch[e] = auto_schedule(e); except for GEMM for which we use half-precision floating-
5 if resource(sch[e])>C then point (FP16) to use the tensor cores. we target model infer-
6 (𝑆𝑃𝑖 , 𝑃) = split(𝑃, e); 𝑆𝑃 = 𝑆𝑃 ∪ 𝑆𝑃𝑖 ; ence and set the batch size to one.
7 end 7.2 Competing Baselines
8 end
We compare Souffle against six strong baselines:
9 end
XLA (Tensorflow v2.10). The TensorFlow XLA compiler
10 for 𝑆𝑃𝑖 in 𝑆𝑃 do
can fuse DNN operators like point-wise and reduction op-
11 𝑆𝑃𝑖 = horizon_trans(e, 𝑆𝑃𝑖 ) for 𝑒 in 𝑆𝑃𝑖 and 𝑒 in erators and performs optimizations on the fused operator.
𝑆𝑇 ; Unlike Souffle that performs analysis and optimizations on
12 𝑆𝑃𝑖 = verti_trans(e, 𝑆𝑃𝑖 ) for 𝑒 in 𝑆𝑃𝑖 and 𝑒 in 𝑂𝑅; TEs, XLA performs analysis on its high-level operators(HLO).
13 for 𝑒 in 𝑆𝑃𝑖 and 𝑒 in 𝐶𝐼 do TensorRT (v8.2). This GPU-vendor-specific framework op-
14 for 𝑒𝑖 in dominated_by(e) and 𝑒𝑖 in 𝑀𝐼 do timizes the inference of DNNs on NVIDIA GPUs [38].
15 𝑠𝑐ℎ[(𝑒, 𝑒𝑖 )] = Ansor(TVM v0.8). This state-of-the-art DNN optimizer
propagate_sch(𝑒𝑖 , 𝑠𝑐ℎ[𝑒], 𝑆𝑃𝑖 ); builds upon TVM. It uses auto-tuning techniques to find
16 mark 𝑒𝑖 has been scheduled; good tensor schedules from hand-crafted templates.
17 end Rammer (v0.4). This DNN compiler is also known as NNFu-
18 𝑠𝑐ℎ[𝑆𝑃𝑖 ] = 𝑠𝑐ℎ[𝑆𝑃𝑖 ] ∪𝑠𝑐ℎ[(𝑒, 𝑒𝑖 )]; sion [33]. It generates a spatial-temporal schedule at compile
19 end time to minimize scheduling overhead and exploit hardware
20 for 𝑒 in 𝑇 𝑅 do parallelism through inter- and intra-operator co-scheduling.
21 𝑠𝑐ℎ[𝑆𝑃𝑖 ] = optimize_tensor_reuse(e, Apollo. This represents the state-of-the-art fusion frame-
𝑠𝑐ℎ[𝑆𝑃𝑖 ]); work for inference optimization [56]. Apollo considers both
22 end memory- and compute-bound tensor operators for kernel
fusions and uses hand-crafted rules to exploit parallelism
23 end
between independent tensor operators.
IREE(released on 30 Dec. 2022). The intermediate rep-
Table 2. DNN models and datasets used in our evaluation. resentation execution environment (IREE) builds upon the
LLVM MLIR project [1, 30]. IREE is designed to lower DNN
Model (Dataset) Model parameters models to MLIR dialects to optimize model inference. IREE
utilizes the linalg dialect to perform the operator fusion,
ResNeXt (ImageNet) #layers:101, bottleneck width: 64d
EfficientNet (ImageNet) Efficient-b0 from the source publication
which supports loop affine fusion optimization and global
Swin-Trans. (ImageNet) Base version, patch: 4 and window size: 7 analysis.
BERT (SQuAD) Base version with 12 layers from TensorRT
LSTM (synthetic) input length: 100, hidden size: 256, layer: 10 7.3 Performance Report
MMoE (synthetic) We use the base model from [32] We use NVIDIA Nsight Compute to profile DNN model’s
computation latency and record performance metrics. We
found that the variance across different runs is less than 2%
7 Experimental Setup
and only reports the geometric mean.
7.1 Evaluation Platform and Workloads
Platform. Our evaluation platform is a GPU server with a 8 Experimental Results
dual-socket 20-core, 2.50 GHz Xeon Gold 6248 CPU, 768GB In this section, we first present the overall results of Souffle
of DDR4 RAM, and a 40GB NVIDIA A100 GPU. The server and the competing approaches, showing that Souffle out-
runs Ubuntu 18.04.5 with Linux kernel 5.4.55. We use CUDA performs all other schemes across evaluated DNN models
version 11.7 with “-O3" as the compiler option. (Section 8.1). We then quantify the contribution of individ-
ual optimizations to the performance improvement for each
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.

Table 3. End-to-end model runtime (𝑚𝑠) - lower is better. Table 4. Execution time (𝑚𝑠) with Souffle individual opti-
mizations
Model XLA Ansor TRT Rammer Apollo IREE Ours
BERT 2.55 2.31 1.30 2.19 3.29 2.22 1.22 Model V0 V1 V2 V3 V4
ResNeXt 8.91 20.50 24.82 11.69 22.80 314.8 4.43 BERT 3.1 2.12 1.53 1.41 1.22
LSTM 10.57 6.78 6.30 1.72 Failed 16.0 0.80 ResNeXt 29.0 5.90 4.43 4.43 4.43
EfficientNet 2.96 0.91 1.21 Failed 2.3 12.33 0.66 LSTM 6.78 1.60 1.21 0.8 0.8
SwinTrans. 6.43 5.81 1.74 Failed 10.78 18.1 1.55 EfficientNet 4.2 0.91 0.72 0.63 0.63
MMoE 0.29 0.034 0.070 Failed 0.049 0.088 0.014 Swin-Trans. 5.81 4.88 2.09 1.78 1.55
MMoE 0.05 0.019 0.016 0.014 0.014

DNN workload (Section 8.2 and 8.3), compare Souffle with


alternative schemes on selected workloads (Section 8.4), and Table 5. The number of GPU kernel calls and global memory
discuss the negligible compilation overhead introduced by data transfer size (𝑀 bytes) of the resulting code.
Souffle (Section 8.5).
# of kernel calls Memory transfer size
8.1 Overall Performance Model TRT Apollo XLA Ours TRT Apollo Ours
BERT 120 240 216 24 361.8 880.5 226.8
Table 3 gives the end-to-end execution time (in ms) of each ResNeXt 2406 1226 526 105 622.2 436.1 470.2
DNN model running on an A100 GPU. Note that some com- LSTM 662 Failed 3363 1 126.8 Failed 10.6
Efficient. 187 273 332 66 96.4 127.4 86.6
pilers failed to compile and execute certain DNNs. Overall, Swin-Tran. 716 1014 3188 53 831.5 1309.0 282.9
Souffle outperforms competing methods across all DNNs. MMoE 20 10 7 1 0.061 0.063 0.058
Souffle builds upon TVM’s Ansor, but it can significantly
boost the performance of the native TVM + Ansor implemen-
As such, IREE cannot fuse computation-intensive operators
tation, giving an average speedup of 3.9× (up to 8.5×) over
(e.g., batch_matmul) to reduce GPU global memory accesses.
Ansor. Furthermore, it improves NVIDIA-tuned TensorRT
Compared to other kernel fusion techniques, Souffle
by 3.7× on average (up to 7.9×), with a similar performance
can identify more data reuse opportunities by operating on
improvement over Rammer, Apollo, and IREE. The results
TEs, which have simple and well-defined semantics and do
demonstrate that Souffle delivers consistent and robust
not rely on inflexible fusion rules. Souffle can utilize data
performance improvement across DNN workloads.
reuse across operators with different-shaped buffers and
Kernel or operator fusion techniques such as XLA, Ram-
perform instruction-level optimizations for unfused opera-
mer, and Apollo can surpass Ansor in certain scenarios, high-
tors. Souffle outperforms competing baselines due to these
lighting the importance of kernel fusion. However, these
advantages.
approaches can only merge a limited set of operators and
lack efficient instruction-level optimizations across some 8.2 Performance Breakdown
operators, resulting in redundant computations.
We conducted a series of experiments to evaluate the per-
Rammer relies on hand-crafted rules for operator fusion
formance benefits of Souffle’s optimizations. We gradually
and can only merge sibling operators in the computation
activated our optimizations, starting from the TVM + Ansor
graph. It does not perform element-wise data dependence
generated code (V0) and then adding our TE horizontal trans.
analysis or reuse tensor buffers, limiting its ability to op-
(V1), TE vertical trans. (V2), global sync with global synchro-
timize operators with shared input-output buffers. Simi-
nization API (V3), and subprogram-level optimization (V4),
larly, XLA maps some computation-intensive operators (e.g.,
as described in Sec. 6.1, 6.2, 6.4, and 6.5, respectively.
GEMM) to a BLAS library call and cannot merge such op-
Table 4 reports the impact of individual optimizations on
erators with others. XLA relies on hand-crafted rules for
inference time reduction for each DNN model. Our horizon-
operator fusion and cannot optimize some computation pat-
tal and vertical TE transformation schemes benefit all DNN
terns, such as merging two consecutive reduction operators
workloads, increasing SIMD parallelism and reducing mem-
in the BERT model.
ory accesses. Transformer-based BERT and Swin-Trans. also
Apollo also relies on loop fusion rules and can only
benefit from global sync and subprogram-level optimization,
merge two reductions with the same tile size. Moreover, it
which enable overlapping load and tensor core’s arithmetic
does not support schedules with global synchronization,
instructions and tensor buffer reuse.
further restricting a scheduler’s optimization. IREE only
fuses producer-consumer types of fusions with parametric 8.3 Analysis of Performance Advantages
tile-and-fuse optimizations. By contrast, the horizontal and
We identified two reasons for Souffle’s improved perfor-
vertical transformations supported by Souffle are more
mance over TensorRT and Apollo: reduced GPU kernel calls
flexible and can fuse operator patterns unsupported by IREE.
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA

features of XLA. Nonetheless, XLA leverages libraries such


as cuBLAS to execute compute-intensive operators. Con-
(a) Unfused (b) Fused (c) Global-sync (d) Data-reuse sequently, it faces limitations in fusing compute-intensive
operators with memory-intensive counterparts, thereby hin-
Figure 5. Example of fusion results for sub-module Efficient- dering the potential reduction in kernel count. For instance,
Net. XLA generates 6 custom calls to invoke cuBLAS to run the
GEMM operators for one BERT layer. While Souffle seam-
3.0
unfused fused global-sync data-reuse lessly propagates the schedule of compute-intensive TEs to
2.5
memory-intensive TEs and generates one kernel.
2.0
Reduce GPU memory data transfers. GPU global mem-
Speedup

1.5
ory data transfer is known to be expensive and it is desired to
1.0
reduce the amount of data transfers from the global memory.
0.5
To do so, Souffle maximizes tensor buffer reuse through
0.0
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 AVG TE program partitioning (Sec. 5.4) and TE transformation
(Sec. 6). Table 5 also compares the amount of GPU global
Figure 6. EfficientNet sub-module latency breakdown. memory data transfers measured by Nsight Compute for Ten-
sorRT, Apollo, and Souffle. Souffle-generated code incurs
and reduced GPU memory data transfers. We use a micro- significantly fewer data transfers compared to TensorRT and
benchmark taken from EfficientNet to illustrate the per- Apollo. For example, in BERT, Souffle reduces the mem-
formance contribution of the two optimizations. The sub- ory transaction from 361.8M and 880.5M bytes (loaded by
module is the building block of EfficientNet and repeats many TensorRT and Apollo, respectively) to 226.8M bytes.
times with different input sizes (marked with M0 to M9). The Consider the performance of TensorRT and Souffle again
pattern of this sub-module is common in many DNN models when optimizing BERT. Like Sec. 2, we classify the com-
and existing DNN frameworks fail to optimize it optimally. putation kernels in BERT into compute- (like GEMM) and
Fig. 5 shows four versions: 5a unfused with generating each memory-intensive kernels (like softmax). We then measure
TE to one kernel, 5b fused with Ansor’s fusion, 5c Souf- the execution latency (in clock cycles) of each kernel. Souf-
fle’s global-sync with generating the whole sub-module to fle is more flexible in fusing operators, which reduces the
one kernel but without any data reuse; 5d with Souffle’s number of kernels and kernel invocation overhead compared
data reuse. Fig. 6 shows the normalized speedup of the four to TensorRT. For example, TensorRT maps a BERT layer to
versions with the horizontal axis being the different sub- 10 kernels, while Souffle can partition one layer into two
modules. Global sync can achieve 1.31× speedup on average kernels and perform instruction-level optimization. Souffle
compared with Ansor’s unfused, with performance improve- reduces the memory-intensive kernel latency from 31.0 𝑢𝑠
ments coming from kernel calls reduction and lightweight (in TensorRT) to 25.5 𝑢𝑠 by buffering intermediate results in
CUDA grid sync. Enabling data reuse further improves the fast memory and GPU registers for BERT one layer.
speedup from 1.31× to 1.84× on average. Souffle’s reduced We also examine IREE’s fusion performance on BERT.
GPU kernel calls and increased data reuse can both signifi- IREE misses two optimization opportunities for BERT: it does
cantly improve the performance. However, it’s non-trivial to not fuse GEMM and softmax operators and several GEMM
separate the performance contribution for end-to-end mod- operators. IREE launches 180 kernels and takes 2.22 ms for
els, as TE transformation and global synchronization may execution. In comparison, Souffle launches 24 kernels and
both reduce kernel calls and reduce memory access. We re- takes 1.22 ms.
port the reduced GPU kernel call and GPU memory data
8.4 Case Study on LSTM
transfers in the following.
Reduce GPU kernel call. GPU kernel calls can be expen- Following the discussion in Sec. 8.3, we conducted studies
sive, and it takes around 2 𝑢𝑠 to launch a kernel on an on the LSTM model to reveal new optimization opportu-
NVIDIA A100 GPU. Table 5 compares the number of kernel nities offered by Souffle, which achieved a performance
calls from TensorRT, Apollo, XLA and Souffle. Souffle improvement of 4.3× over TensorRT and 2.2× over Rammer.
can create large subprograms that result in fewer kernels We compared Souffle with Rammer, the most performant
because of resource-aware TE program partitioning. This baseline, as discussed in Sec. 8.3. Fig. 7 shows the fusion
optimization reduces the kernel launch overhead. For ex- strategy used by Rammer and Souffle for an LSTM with 10
ample, in BERT, Souffle reduces the number of kernels cells (listed vertically in Fig. 7). Each cell has its dedicated
from 120 and 240 (generated by TensorRT and Apollo, re- weight tensors (marked as 𝑊 and 𝑈 in Fig. 7), hidden states
spectively) to 24. Similar kernel call reductions are observed (ℎ) and output (𝑐). In each time step 𝑡, the 𝑛-th cell performs
in other DNN workloads. Operator fusion is one of the key general matrix-vector multiplication (GEMV for short) using
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.

Time Steps Time Steps


t=0 t=1 t=99
indicate that Souffle adds up to 63𝑠 overhead on top of
t=0 t=1 t=99
Cells W+U Cells W+U Ansor, which is negligible compared to the hours Ansor re-
h … … … …
n=0 n=0 h quires for schedule search. This overhead can be reduced by
using faster optimizer like Roller [60], which is orthogonal
c c of Souffle.
… …

… …
n=1 n=1
… …

… …
9 Discussion
n=9 … … n=9 … … Expression power of TE. Souffle relies on the expres-
(a) Rammer (b) SOUFFLE
sion power of tensor expressions, which currently does not
GEMV
Reduction Operator
(e.g., reduce_sum)
Atomic
Add
Global
Sync
LSTM
Cell
Tensor
Data
Computation
Kernel
support all DNN operators, e.g., it does not support resize.
Souffle maps these TE-unsupported operators to a com-
Figure 7. How Rammer (a) and Souffle (b) map a LSTM putation kernel and uses the back-end operator library im-
graph into computation kernels. plementation but without fusing them with other operators.
Given the active developer community of TVM, we expect
Table 6. GPU performance counter values for LSTM opti- this limitation to be addressed by future TVM releases.
mized by Rammer and Souffle. Cost model for TE program partitioning. Souffle ex-
tracts tensor information by compiling the raw TE program.
This can be improved by building a cost model [53] to esti-
Metrics Rammer Souffle mate occupancy from the TE program.
GPU global memory trans. (in bytes) 1911.0MB 21.11MB Reusing dynamic-shaped tensors. Certain DNN oper-
Pipeline Utilization (LSU) 20.2% 35.4% ators have unknown tensor shapes at compile time. Our
Pipeline Utilization (FMA) 8.0% 19.0% current implementation does not support reusing tensors of
dynamic shapes. To address this, we can generate multiple
versions of a kernel and choose the appropriate one based
its weight tensors (𝑊𝑛 and 𝑈𝑛 ), hidden state (ℎ𝑛 ) and out- on shape information available at execution time.
put (𝑐𝑛−1 ) from (𝑛 − 1)-th cell, updates its hidden state (ℎ𝑛 ) Fusion in DL training. DL compilers like TensorFlow XLA
and generates output (𝑐𝑛 ) for the current time step. Fig. 7 also enable operator fusion in training (forward inference
shows the fully unrolled time step loop. The LSTM opera- and backward parameter updates). Our TE-based transfor-
tors alongside the diagonal line are independent, i.e. no data mation can be integrated into DL compilers to accelerate
dependency exists. Both Souffle and Rammer exploit such forward and backward passes during training. However,
optimization opportunity, i.e. the wavefront parallelism, and intermediate tensors must be kept in global memory in
fuse the GEMV computation to different blocks of a kernel. DL training for backward gradient-based optimization like
With the TE-based global analysis, Souffle discovers that Adam [24], restricting operator fusion chances. Our main
the weight tensors (𝑊 and 𝑈 ) of each LSTM cell are reused focus is optimizing model inference after DNN training. Sup-
across all time steps (temporal reuse). It utilizes the global port for TE transformation in DL training is left for future
synchronization and generates one kernel for the entire work.
model, as shown in the right part of Fig. 7. On the other Slowdown. Performance slowdown can occur when Souf-
hand, the Rammer version needs to load the weight tensors fle extends the schedule from compute-intensive TEs to
in every wavefront, resulting in a longer execution time com- memory-intensive reduction TEs (discussed in Sec. 6.3). This
pared to Souffle. We measured GPU global memory data introduces synchronization between blocks, potentially ham-
transfer and pipeline utilization for the optimized LSTM. pering parallelism for reduction TEs. A potential remedy is
As Table 6 show, Souffle-optimized code reduces memory to create a cost model to decide whether fusing these TEs is
loads by orders of magnitude compared to Rammer’s version beneficial.
(21𝑀𝐵 vs 1911𝑀𝐵) and increases pipeline utilization for both
the load store unit (LSU) and fused multiply-add unit (FMA). 10 Related work
8.5 Compilation Overhead Loop and kernel fusion. Loop fusion is commonly used
to improve the performance of CPU programs [3, 8, 9, 23].
Souffle employs Ansor and TVM for schedule search and
Recent research has also utilized kernel fusion to optimize
generation. The compilation overhead of Souffle + Ansor
GPU programs by reducing data traffic to/from off-chip
is mainly from the time required for searching the program
memory[42, 50]. Various domain-specific kernel fusion poli-
schedule using native Ansor implementation. The additional
cies have been proposed for workloads like data center ap-
overhead introduced by Souffle involves two-level depen-
plications [54], mesh computing [6], machine learning work-
dence analysis, model splitting, schedule tuning, and global
loads [4] and image processing [35]. Souffle leverages loop
optimization. Our measurements on six evaluated models
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA

fusion to optimize DNN inference through compiler trans- The MLIR -affine-loop-fusion pass utilizes a slicing-based
formations, building on these previous research efforts. method to identify producer-consumer and sibling fusion op-
Operator fusion in DNN compilers. Operator fusions portunities. Souffle implements a lightweight, specialized
can enhance performance by improving data reuse and re- global analysis on TEs, which can be easily integrated into
ducing on-chip data traffic. To seek out fusion opportuni- DNN inference engines. Moreover, Souffle offers more op-
ties, DNNFusion classifies operators and defines rules based timization opportunities than just fusion. For example, it en-
on their classification [37]. Astitch [58] and Rammer [33] ables joint optimizations across multiple compute-intensive
fuse independent operators to leverage inter-operator paral- TEs in a TE subprogram and facilitates horizontal and verti-
lelism, while Deepcuts [22] uses rules to fuse kernels based cal transformations for sibling fusion.
on GPU hardware parameters. Apollo [56] adopts a partition- Optimizing individual operators. Numerous compiler-
based approach to search for fusion opportunities within based approaches exist to optimize individual operators,
sub-graphs. However, these approaches rely on hand-crafted including TVM [12, 51], XLA [27], Tiramisu [5], and
rules with extensive engineering efforts and may miss op- TACO [25]. These compilers often represent operators in
timization opportunities, as discussed in Sec. 2. Jeong et high-level forms such as TEs or linear algebra, enabling
al [19] proposed a dynamic programming algorithm to decide aggressive optimization without complex analysis through
whether to pin or fuse an activation map on DNN accelera- domain-specific knowledge. Souffle is orthogonal to these
tors with a global buffer. DNNFusion [37] classifies operators techniques.
based on the element-wise mappings from input to output,
but it can not fuse many-to-many with many-to-many oper- 11 Conclusion
ators (like GEMM and Softmax), while Souffle can further We have presented Souffle, a top-down compiler-based
reduce the overhead of kernel launch. Furthermore, DNNFu- approach for improving DNN inference. Souffle identifies
sion lacks global analysis of tensor reuse opportunities and optimization opportunities across DNN operators by per-
may miss the temporal and spatial data reuse opportunities, forming data-flow analysis on the entire tensor dependence
which are critical to improve the performance as shown in graph built from tensor expressions. It groups tensor ex-
Sec 8.2. Souffle improves upon previous operator fusion pressions into subprograms and performs local optimization
techniques by utilizing control and data-flow analysis on through semantics-preserving transformations, instruction
the tensor dependency graph to partition TEs into subpro- scheduling, and tensor buffer reuse. We evaluated Souffle
grams. TEs have clear and simple relations, and they can on six DNN models using an NVIDIA A100 GPU and com-
be combined to represent numerous DNN operators. Souf- pared it to six state-of-the-art DNN optimizing frameworks.
fle leverages the well-defined semantics in TEs to perform Souffle outperformed them with a speedup of up to 7.9×
precise data dependence analysis for instruction scheduling over TensorRT.
and data reuse optimization. Additionally, Souffle applies
semantic preserving transformations to refine TEs. Its op- Acknowledgments
timization capabilities have better generalization ability as We thank our shepherd, Vinod Grover, and the anonymous
TEs can be combined to represent more complex operators. reviewers for their constructive feedback. This work was
Global analysis and fusion optimization. TensorFlow supported in part by the National Key R&D Program of
XLA [27] and MLIR [43] also conduct global analysis on the China under grant agreement 2021ZD0110101, the National
input program graph. XLA utilizes profitability analysis on Natural Science Foundation of China (NSFC) under grant
its high-level operations intermediate representation before agreements T2222026, 22003073, 62232015, and 62090024,
deciding on tiling and fusion. However, XLA relies on hand- the Innovation Funding of ICT CAS under grant agreement
crafted heuristics to fuse operators, which can be challenging E361010, a Beijing Nova Program, and the UK Engineering
for high-level operators. For instance, XLA’s fusion heuris- and Physical Sciences Research Council (EPSRC) under grant
tic cannot fuse two consecutive reduction operators in the agreement EP/X018202/1. For the purpose of open access,
BERT model. Moreover, as XLA operates at the operator level the authors have applied a Creative Commons Attribution
and some operators are mapped to low-level library calls, it (CCBY) license to any Author Accepted Manuscript version
cannot optimize across libraries. In contrast, Souffle takes arising from this submission.
a different approach by lowering high-level operators into
lower-level tensor expressions (TEs), which have concise se- References
mantics. Operating on TEs rather than assuming high-level [1] [n. d.]. IREE: Intermediate Representation Execution Environment.
operators enables Souffle to optimize flexibly across opera- https://github.com/iree-org/iree.
tor boundaries. Unlike XLA, Souffle can merge GEMM and [2] Martín Abadi, Paul Barham, Jianmin Chen, et al. 2016. TensorFlow: A
Softmax operators and optimize across reduction operators. System for Large-Scale Machine Learning. In 12th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 16). USENIX
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.

Association, 265–283. https://www.usenix.org/conference/osdi16/ [16] Tianfan Fu, Cao Xiao, Cheng Qian, Lucas M. Glass, and Jimeng Sun.
technical-sessions/presentation/abadi 2021. Probabilistic and Dynamic Molecule-Disease Interaction Mod-
[3] Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2018. Poly- eling for Drug Discovery. In Proceedings of the 27th ACM SIGKDD
hedral auto-transformation with no integer linear programming. In Conference on Knowledge Discovery &amp; Data Mining (Virtual Event,
Proceedings of the 39th ACM SIGPLAN Conference on Programming Singapore). 404–414. https://doi.org/10.1145/3447548.3467286
Language Design and Implementation. 529–542. [17] Kim Hazelwood, Sarah Bird, et al. 2018. Applied Machine Learning
[4] Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, at Facebook: A Datacenter Infrastructure Perspective. In 2018 IEEE
Keith Campbell, John Keenleyside, and P Sadayappan. 2015. On opti- International Symposium on High Performance Computer Architecture
mizing machine learning workloads via kernel fusion. ACM SIGPLAN (HPCA). 620–629. https://doi.org/10.1109/HPCA.2018.00059
Notices 50, 8 (2015), 173–182. [18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term
[5] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/
Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib 10.1162/neco.1997.9.8.1735
Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Com- [19] Hyuk-Jin Jeong, JiHwan Yeo, Cheongyo Bahk, and JongHyun Park.
piler for Expressing Fast and Portable Code. In 2019 IEEE/ACM In- 2023. Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip
ternational Symposium on Code Generation and Optimization (CGO). Data Transfer in DNN Accelerators. In Proceedings of the 21st ACM/IEEE
193–205. https://doi.org/10.1109/CGO.2019.8661197 International Symposium on Code Generation and Optimization (CGO
[6] Carlo Bertolli, Adam Betts, Paul HJ Kelly, Gihan R Mudalige, and 2023). Association for Computing Machinery, 224–235. https://doi.
Mike B Giles. 2012. Mesh independent loop fusion for unstructured org/10.1145/3579990.3580017
mesh applications. In Proceedings of the 9th conference on Computing [20] Zhihao Jia, Oded Padon, et al. 2019. TASO: Optimizing Deep Learning
Frontiers. 43–52. Computation with Automatic Generation of Graph Substitutions. In
[7] U. Bondhugula, A. Acharya, and A Cohen. 2016. The Pluto+ Algorithm: Proceedings of the 27th ACM Symposium on Operating Systems Principles
A Practical Approach for Parallelization and Locality Optimization of (Huntsville, Ontario, Canada) (SOSP ’19). 47–62. https://doi.org/10.
Affine Loop Nests. In ACM Transactions on Programming Languages 1145/3341301.3359630
and Systems. [21] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al.
[8] Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshmi- 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit.
narayanan Renganarayanan. 2010. A model for fusion and code motion In Proceedings of the 44th Annual International Symposium on Computer
in an automatic parallelizing compiler. In Proceedings of the 19th inter- Architecture (ISCA ’17). 1–12. https://doi.org/10.1145/3079856.3080246
national conference on Parallel architectures and compilation techniques. [22] Wookeun Jung, Thanh Tuan Dao, and Jaejin Lee. 2021. DeepCuts: a
343–352. deep learning optimization framework for versatile GPU workloads.
[9] Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and In Proceedings of the 42nd ACM SIGPLAN International Conference on
Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral Programming Language Design and Implementation. 190–205.
parallelizer and locality optimizer. In Proceedings of the 29th ACM [23] Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop par-
SIGPLAN Conference on Programming Language Design and Implemen- allelism and improving data locality via loop fusion and distribution.
tation. 101–113. In International Workshop on Languages and Compilers for Parallel
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Computing. Springer, 301–320.
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish [24] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for
Sastry, et al. 2020. Language models are few-shot learners. Advances Stochastic Optimization. In 3rd International Conference on Learn-
in neural information processing systems 33 (2020), 1877–1901. ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
[11] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
Thomas Blaschke. 2018. The rise of deep learning in drug discovery. http://arxiv.org/abs/1412.6980
Drug Discovery Today 23, 6 (2018), 1241–1250. https://doi.org/10.1016/ [25] Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and
j.drudis.2018.01.039 Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra
[12] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Ed- kernels. In 2017 32nd IEEE/ACM International Conference on Automated
die Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Software Engineering (ASE). IEEE, 943–948.
Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: [26] Anna Larionova, Polina Kazakova, and Nikita Nikitinsky. 2019. Deep
An Automated End-to-End Optimizing Compiler for Deep Learn- Structured Semantic Model for Recommendations in E-commerce. In
ing. In 13th USENIX Symposium on Operating Systems Design and Hybrid Artificial Intelligent Systems - 14th International Conference,
Implementation (OSDI 18). USENIX Association, 578–594. https: HAIS 2019, León, Spain, September 4-6, 2019, Proceedings (Lecture Notes
//www.usenix.org/conference/osdi18/presentation/chen in Computer Science, Vol. 11734). Springer, 85–96. https://doi.org/10.
[13] ONNX Runtime developers. 2021. ONNX Runtime. https:// 1007/978-3-030-29859-3_8
onnxruntime.ai/. Version: x.y.z. [27] Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. Tesor-
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Flow Dev Summit.
2018. Bert: Pre-training of deep bidirectional transformers for language [28] Shih-Chieh Lin, Yunqi Zhang, et al. 2018. The Architectural Implica-
understanding. arXiv preprint arXiv:1810.04805 (2018). tions of Autonomous Driving: Constraints and Acceleration. In Pro-
[15] Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Rui- ceedings of the Twenty-Third International Conference on Architectural
hang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Support for Programming Languages and Operating Systems, ASPLOS
Tianqi Chen. 2023. TensorIR: An Abstraction for Automatic Ten- 2018. ACM, 751–766. https://doi.org/10.1145/3173162.3173191
sorized Program Optimization. In Proceedings of the 28th ACM Inter- [29] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua,
national Conference on Architectural Support for Programming Lan- Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy.
guages and Operating Systems, ASPLOS 2023, Tor M. Aamodt, Na- 2018. Progressive Neural Architecture Search. In Computer Vision –
talie D. Enright Jerger, and Michael M. Swift (Eds.). ACM, 804–817. ECCV 2018. Springer International Publishing, 19–35.
https://doi.org/10.1145/3575693.3576933 [30] Hsin-I Cindy Liu, Marius Brehler, Mahesh Ravishankar, Nicolas Vasi-
lache, Ben Vanik, and Stella Laurenzo. 2022. TinyIREE: An ML Ex-
ecution Environment for Embedded Systems From Compilation to
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA

Deployment. IEEE Micro 42, 5 (sep 2022), 9–16. https://doi.org/10. http://arxiv.org/abs/1802.04730


1109/MM.2022.3178068 [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
[31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. At-
Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical tention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762
Vision Transformer using Shifted Windows. CoRR abs/2103.14030 http://arxiv.org/abs/1706.03762
(2021). arXiv:2103.14030 https://arxiv.org/abs/2103.14030 [50] Mohamed Wahib and Naoya Maruyama. 2014. Scalable kernel fusion
[32] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H. for memory-bound GPU applications. In SC’14: Proceedings of the
Chi. 2018. Modeling Task Relationships in Multi-Task Learning with International Conference for High Performance Computing, Networking,
Multi-Gate Mixture-of-Experts. In Proceedings of the 24th ACM SIGKDD Storage and Analysis. IEEE, 191–202.
International Conference on Knowledge Discovery &amp; Data Mining [51] Huanting Wang, Zhanyong Tang, et al. 2022. Automating Rein-
(KDD ’18). 1930–1939. https://doi.org/10.1145/3219819.3220007 forcement Learning Architecture Design for Code Optimization. In
[33] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Proceedings of the 31st ACM SIGPLAN International Conference on
Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. [n. d.]. Compiler Construction (Seoul, South Korea) (CC 2022). Association
Rammer: Enabling Holistic Deep Learning Compiler Optimizations for Computing Machinery, New York, NY, USA, 129–143. https:
with rTasks. In 14th USENIX Symposium on Operating Systems Design //doi.org/10.1145/3497776.3517769
and Implementation (OSDI 20). USENIX Association, 881–897. https: [52] Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, and Gennady
//www.usenix.org/conference/osdi20/presentation/ma Pekhimenko. 2021. Horizontally Fused Training Array: An Ef-
[34] Microsoft. 2022. Antares. https://github.com/microsoft/antares/tree/ fective Hardware Utilization Squeezer for Training Novel Deep
latest. Learning Models. In Proceedings of Machine Learning and Systems
[35] Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan- 2021. mlsys.org. https://proceedings.mlsys.org/paper/2021/hash/
Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide a97da629b098b75c294dffdc3e463904-Abstract.html
Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul [53] Zheng Wang and Michael O’Boyle. 2018. Machine learning in compiler
2016), 11 pages. https://doi.org/10.1145/2897824.2925952 optimization. Proc. IEEE 106, 11 (2018), 1879–1901.
[36] Multi-Level IR Compiler Framework committee. 2022. ’affine’ Dialect. [54] Haicheng Wu, Gregory Diamos, Jin Wang, Srihari Cadambi, Sudhakar
[37] Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. Yalamanchili, and Srimat Chakradhar. 2012. Optimizing data ware-
2021. DNNFusion: Accelerating Deep Neural Networks Execution housing applications for GPUs using kernel fusion/fission. In 2012
with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIG- IEEE 26th International Parallel and Distributed Processing Symposium
PLAN International Conference on Programming Language Design and Workshops & PhD Forum. IEEE, 2433–2442.
Implementation (PLDI 2021). 883–898. https://doi.org/10.1145/3453483. [55] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He.
3454083 2017. Aggregated Residual Transformations for Deep Neural Networks.
[38] NVIDIA Corporation. 2021. TensorRT. https://developer.nvidia.com/ In 2017 IEEE Conference on Computer Vision and Pattern Recognition
tensorrt. (CVPR). 5987–5995. https://doi.org/10.1109/CVPR.2017.634
[39] NVIDIA Corporation. 2022. NVIDIA Nsight Compute. [56] Jie Zhao, Xiong Gao, et al. 2022. Apollo: Automatic Partition-based
[40] NVIDIA Corporation. 2023. CUDA Grid Synchronization. Operator Fusion through Layer by Layer Optimization. In Proceedings
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#grid- of Machine Learning and Systems 2022. https://proceedings.mlsys.org/
synchronization. paper/2022/hash/069059b7ef840f0c74a814ec9237b6ec-Abstract.html
[41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, et al. 2019. [57] Lianmin Zheng, Chengfan Jia, et al. [n. d.]. Ansor: Generating High-
PyTorch: An Imperative Style, High-Performance Deep Learning Li- Performance Tensor Programs for Deep Learning. In 14th USENIX
brary. CoRR abs/1912.01703 (2019). arXiv:1912.01703 http://arxiv.org/ Symposium on Operating Systems Design and Implementation, 2020. 863–
abs/1912.01703 879. https://www.usenix.org/conference/osdi20/presentation/zheng
[42] Bo Qiao, Oliver Reiche, Frank Hannig, and Jïrgen Teich. 2019. From [58] Zhen Zheng, Xuanda Yang, et al. 2022. AStitch: enabling a new multi-
loop fusion to kernel fusion: A domain-specific approach to locality dimensional optimization space for memory-intensive ML training
optimization. In 2019 IEEE/ACM International Symposium on Code and inference on modern SIMT architectures. In Proceedings of the
Generation and Optimization (CGO). IEEE, 242–253. 27th ACM International Conference on Architectural Support for Pro-
[43] Tatiana Shpeisman and Chris Lattner. 2019. Mlir: Multi-level interme- gramming Languages and Operating Systems. 359–373.
diate representation for compiler infrastructure. [59] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou,
[44] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. [n. d.]. Sequence to Xiaoqiang Zhu, and Kun Gai. 2018. Deep Interest Evolution Network
Sequence Learning with Neural Networks. In Proceedings of the 27th for Click-Through Rate Prediction. https://doi.org/10.48550/ARXIV.
International Conference on Neural Information Processing Systems - 1809.03672
Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, 3104–3112. [60] Hongyu Zhu, Ruofan Wu, et al. 2022. ROLLER: Fast and Efficient
[45] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Tensor Compilation for Deep Learning. In 16th USENIX Symposium
Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Resid- on Operating Systems Design and Implementation (OSDI 22). USENIX
ual Connections on Learning. In Proceedings of the Thirty-First AAAI Association, 233–248. https://www.usenix.org/conference/osdi22/
Conference on Artificial Intelligence. 4278–4284. presentation/zhu
[46] Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Bharat
Kaul, Gagandeep Goyal, and Ramakrishna Upadrasta. 2021. PolyDL:
Polyhedral Optimizations for Creation of High-performance DL Prim-
itives. ACM Trans. Archit. Code Optim. 18, 1 (2021), 11:1–11:27.
https://doi.org/10.1145/3433103
[47] Tianqi Chen. 2022. Working with Operators Using Tensor Expression.
https://tvm.apache.org/docs/tutorial/tensor_expr_get_started.html.
[48] Nicolas Vasilache, Oleksandr Zinenko, et al. 2018. Tensor Compre-
hensions: Framework-Agnostic High-Performance Machine Learn-
ing Abstractions. CoRR abs/1802.04730 (2018). arXiv:1802.04730

You might also like