24, Optimizing Deep Learning Inference Via Global Analysis and Tensor Expressions
24, Optimizing Deep Learning Inference Via Global Analysis and Tensor Expressions
the hardware complexity and have become the standard graph generated from the entire DNN model. It utilizes ten-
method for writing DNN code. However, using high-level sor expressions (TEs) [12] to encode dataflow information of
programming abstractions presents significant challenges for operators and tensors. By mapping higher-level operators to
low-level performance optimization, especially during model simpler TEs, Souffle performs data-flow analysis and code
inference when deploying a trained model in a production optimization around these TEs, simplifying the complexity
environment where the response time is crucial [17, 21]. of analysis and optimization and resulting in better code.
Efforts have been made to perform optimizations across TEs offer concise semantics, allowing us to translate the task
the operator boundaries to increase parallelism, decrease of analyzing and optimizing complex operators to a more
memory access traffic or utilize memory bandwidth more ef- manageable problem of analyzing and optimizing simpler
ficiently. One promising approach is operator/kernel fusions, mathematical expressions. For instance, a softmax operator
which involves merging multiple operators into a single can be represented by two TEs with simpler data dependence
kernel to enable analysis and optimizations across opera- relationships: one is a one-relies-on-many TE (reduction), and
tors. This line of research includes works using hand-crafted the other is a one-relies-on-one TE (element-wise). Since Souf-
rules [37], loop analysis [56], or just-in-time compilation [58] fle’s analysis is conducted on the TEs without making any
to guide and perform fusions. Typically, these methods use assumptions of low-level library calls, it can optimize across
a bottom-up strategy by first performing operator fusion in complex operators, even when the operators have complex
the graph representation to merge multiple operators into a data dependency like many-to-many, when other methods
partition and then generating an optimized kernel for each fail to do so.
partition. However, a key challenge is determining the opti- Semantic-preserving transformation. After the top-level
mal boundaries of partitions or which operators should be stage, the computation graph has been divided into multi-
fused together. ple subprograms with each subprogram being mapped to a
Despite the success of bottom-up approaches to opera- kernel. However, each subprogram contains a large number
tor/kernel fusion, optimization opportunities can still be of TEs which would introduce a large number of redundant
overlooked. One such issue arises from separating the op- memory accesses across these TEs. Therefore, Souffle ap-
erator fusion and code generation stages. This can result in plies affine transformations to combine multiple TEs to a
misplaced operators into different kernels, leading to extra single TE thus eliminating the redundant memory accesses.
memory access overhead and preventing otherwise possible This process is performed within the subprograms and relies
optimizations. As we will show in the paper, state-of-the- on the TE-based global analysis. The transformation is fully
art kernel fusion methods can miss important optimization automated and flexible as the tensor expression precisely
opportunities, leaving much room for improvement. describes the mathematical computation of operators in a
We present Souffle, a novel top-down approach for op- simple form.
timizing inference across DNN operator boundaries. Un- Putting it all together. Souffle first conducts data-flow
like bottom-up strategies, Souffle first processes the whole analysis on the tensor dependency graph of the entire DNN
computation graph as a single, merged kernel and then di- model using TEs. This analysis captures essential informa-
vides the graph into partitions, i.e., subprograms, through tion such as tensor shapes and live ranges across opera-
a global analysis from the top-level, considering data reuse tor boundaries, allowing for precise element-wise analysis
in shared memory/registers and the generated instructions to infer data dependence. Souffle then partitions the TEs
when determining partitioning boundaries. Each partition into subprograms using compiler heuristics and conducts
would be organized into a kernel. Afterwards, at the bot- local optimization within each subprogram using semantic-
tom level, Souffle performs a series of semantic preserving preserving mathematical transformations to reduce memory
transformations for each subprogram to simplify the tensor accesses. The optimized subprogram schedule is found by
expression and eliminate redundant memory access for the considering the computation characteristics of the subpro-
corresponding kernel. To this end, Souffle introduces two gram’s TEs. With precise dependence information at the
new mechanisms: a tensor-expression-based global analysis to TE level, Souffle can optimize memory access latency by
identify critical partitioning points and a semantic preserving reusing tensor buffers and improve instruction-level paral-
transformations approach that uses affine transformation to lelism by overlapping memory load and arithmetic instruc-
simplify the tensor expressions of each subprogram. Com- tions. Since Souffle’s code optimizations are based on sub-
pared with existing bottom-up fusion approaches, the benefit programs of fused operators rather than individual operators,
of our top-down approach is that it globally determines the the optimization boundary of operators is eliminated.
kernel boundaries by considering the generated code of the Evaluation. We have implemented a working prototype of
kernels. Souffle1 on TVM. We evaluate Souffle on six DNN models
Tensor-expression-based global analysis. Souffle con- 1 The data and code associated with this paper are openly available at:
ducts global dependence analysis on a tensor dependency https://github.com/SOUFFLE-AE/SOUFFLE.git.
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA
Pipeline Execution
1 0 1 0 0 2 Load I2 W2 O2 O2' W3 O3 I2 W2 W3 O2 O2' O3
Store
Pipeline
0 2 3 3 1 2 4 Matrix GEMM2 GEMM3 GEMM2 GEMM3
Mul. I2xW2 O2'xW3 I2xW2 O2'xW3
Pipeline
W/o Optimization Souffle Optimization
(c) SOUFFLE (Our approach) (d) Optimizations Across Computation-intensive Kernels
Element-wise Memory Element-wise Arithmetic Reduction Operator Atomic Global Tensor Computation Read After Write Write After Read
GEMM
Operators (e.g., reshape) Operator (e.g., add or exp) (e.g., reduce_sum) Add Sync Data Kernel Dependency Dependency
Figure 1. How TensorRT (a), Apollo (b) and Souffle (c) map a BERT computation graph into kernels. The Souffle optimization
leads to fewer GPU memory accesses and faster execution time than TensorRT and Apollo.
Table 1. Performance for the generated kernels of Fig. 1. based on the Transformer architecture [10, 49] and is using
FP16 for inference. Fig. 1 depicts how TensorRT and Apollo
TensorRT Apollo Souffle map operators of a simplified sub-computation graph from
BERT into kernels2 . This subgraph contains representative
Total execution time(𝜇𝑠) 62.34 179.07 57.73
–Computation-intensive kernels 31.29 61.1 41.77 DNN operators like general matrix multiplication GEMM,
–Memory-intensive kernels 31.0 117.97 15.96 reshape, permutation, element-wise arithmetic operators like
#Kernels 7 14 1 add or exp, and reduction operators like reduce_sum. The
#Bytes load from global (M) 16.52 27.78 8.87 compiler maps these operators to individual kernels, which
significantly impacts performance.
with diverse and representative model architectures and com- 2.2 Performance Evaluation
pare it against six state-of-the-art DL optimizing and kernel
We measure the resulting kernels using the NVIDIA Nsight
fusion frameworks, including XLA [27], Ansor [57], Ten-
Compute[39]. Table 1 shows that neither TensorRT nor
sorRT [38], Rammer [33], Apollo [56], and the MLIR-based
Apollo can provide an optimal mapping for the evaluated
IREE compiler [1]. Our evaluation, performed on an NVIDIA
DNN. The subgraph created by TensorRT and Apollo in
A100 GPU, shows that Souffle outperforms existing solu-
Fig. 1 loads 16.52MB and 27.78MB of data from global
tions, delivering a geometric mean speedup of up to 3.7×
memory, giving an execution time of 62.34𝜇𝑠 and 179.1𝜇𝑠,
and 7.8× over TensorRT and XLA, respectively. Souffle is
respectively. A better strategy, which is the one chosen
highly flexible and can fuse operators where state-of-the-art
by our approach, is to refine and map the subgraph into
kernel fusion strategies fail. It is compatible with TensorFlow
a single kernel. This strategy reduces the number of
and ONNX [13] models, and can be integrated with general
bytes loaded from the global memory to 8.87M with an
DL compilers like TVM [12, 57].
execution time of 57.7𝜇𝑠, translating to 1.1× and 3.1× faster
Contributions. This paper makes the following contribu-
running time than TensorRT and Apollo, respectively. We
tions:
want to highlight that TensorRT has been specifically
• It presents a new top-down approach for identifying tuned for Transformer-based models with close-sourced,
and exploiting optimization opportunities across oper- hand-optimized low-level operator implementations (like
ator boundaries (Sec. 5); GEMM). Therefore, we consider the Souffle improvement
• It shows how to effectively leverage the global anal- over TensorRT on BERT to be significant given that Souffle
ysis to perform local optimization at the kernel level does not have access to some of the NVIDIA-optimized
represented as tensor expressions (Sec. 6); operator implementations used by TensorRT. Furthermore,
• It demonstrates how low-level tensor expressions can as we will show later in Sec. 8, Souffle also significantly
be employed to perform instruction optimizations be- outperforms other DNN compilers, including XLA and IREE,
yond operator boundaries (Sec. 6.5). on this DNN model.
2 Layout transformation kernels are omitted from Fig. 1 to aid clarity.
2 Motivation
2.1 Working Example
As a motivation example, consider optimizing a standard
BERT model [14] on an NVIDIA A100 GPU. This model is
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.
have no data dependencies, and the operators will be hori- One-relies-on-one TEs. Souffle adopt quasi-affine maps
zontally transformed as described in Sec. 6.1. The second [7, 36] to represent element-wise dependency for an one-
type of reuse opportunities can manifest in the temporal relies-on-one TE. The mapping from output to input can be
dimension. Temporal data reuse opportunities apply to ten- expressed in the form 𝑀→ −𝑣 + →
−𝑐 where → −𝑣 is the indices of
sors that are used more than once by operators that have output tensor, 𝑀 is a constant matrix from Z𝑛×𝑚 and → −𝑐 is
data dependencies, and guide the tensor reuse optimization a constant vector from Z . Here, 𝑛 is the output tensor’s
𝑚
which is described in Sec. 6.5. Consider again our working dimension and 𝑚 is the corresponding input tensor’s dimen-
example in Fig. 1, the result of element-wise arithmetic op- sion. Note that multiple indices of the output tensor may rely
erator 1 (termed as A1) is used by two dependent operators on the same index of the input tensor. For instance, relation
R1 and A2. Once again, accesses to the global memory can 𝑅1 can be represented as:
be eliminated if we cache the computation output of A1 on
𝑖
register/shared memory. 1 1 + 0 0 , 0 ≤ 𝑖 < 64, 0 ≤ 𝑗 < 64 (1)
𝑗
Souffle identifies these data reuse opportunities from
the TE tensor dependency graph at the tensor level by first One-relies-on-many TEs. For a one-relies-on-many TE,
traversing the computation graph to gather all the tensors Souffle extracts the region of input tensor accessed by com-
accessed by more than one TE. It records the set of operators, bining the iteration space and the input tensor’s index func-
𝑠 (𝑡𝑖 ) = {𝑜𝑝 𝑗 , ..., 𝑜𝑝𝑘 }, that shares with tensor 𝑡𝑖 to enable tion. As the iteration domain of reduction axes is a con-
code optimizations, as described in Sections 6. stant value, the mapping can be expressed in the form of
𝑅 = {[𝑥 0, . . . , 𝑥𝑛 ] ↦→ {[𝑦0, . . . , 𝑦𝑚 ], [𝑟 0, . . . , 𝑟𝑠 ]} : 𝑐 0, . . . , 𝑐 𝑝 },
5.2 Intra-TE element-wise dependency analysis where [𝑟 0, . . . , 𝑟𝑠 ] is a set of reduction variables and their
Souffle captures element-wise data dependence from out- ranges. For instance, relation 𝑅0 can be expressed as follows:
put to input tensors within a TE by defining the iteration 𝑅0 = {𝑂0[𝑖, 𝑗] ↦→ {𝐼 0[𝑖, 𝑟𝑘], [0 ≤ 𝑟𝑘 < 64]}, 0 ≤ 𝑖 < 64, 0 ≤
space as the output shape, and the data space as the domain 𝑗 < 64}, where {𝐼 0[𝑖, 𝑟𝑘], [0 ≤ 𝑟𝑘 < 64]} represents a set of
of iteration and reduction variables for each input tensor. elements with 𝑟𝑘 ranging from 0 to 64. We stress that the
This simplifies the element-wise dataflow from input to out- element-wise dependency for compute-intensive operators
put tensors, as we only need to record the relevant input like GEMM and convolution can be easily analyzed as the
tensor elements for a element of the output tensor. The in- tensor expression explicitly gives the reduction axes.
formation also enables reduction operator fusion at the TE TEs with one-relies-on-one dependency are then trans-
transformation stage, which other optimization tools such formed in Sec 6.2, and TEs with one-relies-on-many depen-
as TensorRT and Apollo do not support. dency are then scheduled in Sec 6.3 and Sec 6.4.
Our key observation is that the intra-TE data dependence
5.3 TE characterization
falls into two categories. Firstly, for TE without a reduction
axis (see also Section 3), each output tensor element relies on Souffle classifies a TE as memory-intensive (e.g., re-
only one input tensor element (termed as one-relies-on-one). duce_sum) or computation-intensive (e.g., GEMM) by
Secondly, for TE with a reduction axis, each output element estimating the compute-memory ratio for a TE. The ratio is
relies on all the elements of all the reduction dimensions computed by dividing the number of arithmetic instructions
of input tensors (termed as one-relies-on-many). With this by the number of memory accesses. As a result, this
observation, we can greatly simplify the dependence anal- ratio unit represents the number of arithmetic operations
ysis process compared to the source code or operator-level per tensor element that is both read and written. In this
analysis that other kernel fusion approaches rely on. work, the classification threshold is empirically set to 3.
We use the polyhedral model notation [46] to denote A ratio less than the threshold indicates that the TE is
element-wise dependencies from output tensor element to memory-intensive.
input tensor element(s). Each tensor has an associated Set
5.4 TE Program Partitioning
𝑆 = [𝑥 0, . . . , 𝑥𝑛 : 𝑐 0 ∧ . . . 𝑐𝑚 ] representing its data space, with
𝑥𝑖 as iteration variables and 𝑐 𝑗 as loop bounds from TEs. Re- Souffle tries to generate large kernels to maximize data
lation signifies output elements depending on input tensor reuse and eliminate extra kernel launches. However, using
elements. A pair of output and input tensors is tied to a rela- global synchronization imposes a constraint that the thread
tion 𝑅 = {[𝑥 0, . . . , 𝑥𝑛 ] ↦→ [𝑦0, . . . , 𝑦𝑚 ] : 𝑐 0, . . . , 𝑐 𝑝 }. For TEs block count cannot exceed the maximum block count per
in Fig 2, TE1 gives 𝑅1 = {𝑂1[𝑖, 𝑗] ↦→ 0[𝑖, 𝑗], 0 ≤ 𝑖 < 64, 0 ≤ wave. If this constraint cannot be satisfied, Souffle par-
𝑗 < 64} and TE0 results in 𝑅0 = {𝑂0[𝑖, 𝑗] ↦→ 𝐼 0[𝑖, 𝑟𝑘], 0 ≤ titions the TE program into multiple TE subprograms. In
𝑖 < 64, 0 ≤ 𝑗 < 64, 0 ≤ 𝑟𝑘 < 64}. TE1 is of type one-relies-on- Souffle, a TE subprogram serves as the fundamental unit
one and TE0 is of type one-relies-on-many. for high-level TE transformation, middle-end schedule op-
timization, and back-end code generation. It can include
several operators mapped to one GPU kernel.
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA
Table 3. End-to-end model runtime (𝑚𝑠) - lower is better. Table 4. Execution time (𝑚𝑠) with Souffle individual opti-
mizations
Model XLA Ansor TRT Rammer Apollo IREE Ours
BERT 2.55 2.31 1.30 2.19 3.29 2.22 1.22 Model V0 V1 V2 V3 V4
ResNeXt 8.91 20.50 24.82 11.69 22.80 314.8 4.43 BERT 3.1 2.12 1.53 1.41 1.22
LSTM 10.57 6.78 6.30 1.72 Failed 16.0 0.80 ResNeXt 29.0 5.90 4.43 4.43 4.43
EfficientNet 2.96 0.91 1.21 Failed 2.3 12.33 0.66 LSTM 6.78 1.60 1.21 0.8 0.8
SwinTrans. 6.43 5.81 1.74 Failed 10.78 18.1 1.55 EfficientNet 4.2 0.91 0.72 0.63 0.63
MMoE 0.29 0.034 0.070 Failed 0.049 0.088 0.014 Swin-Trans. 5.81 4.88 2.09 1.78 1.55
MMoE 0.05 0.019 0.016 0.014 0.014
1.5
ory data transfer is known to be expensive and it is desired to
1.0
reduce the amount of data transfers from the global memory.
0.5
To do so, Souffle maximizes tensor buffer reuse through
0.0
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 AVG TE program partitioning (Sec. 5.4) and TE transformation
(Sec. 6). Table 5 also compares the amount of GPU global
Figure 6. EfficientNet sub-module latency breakdown. memory data transfers measured by Nsight Compute for Ten-
sorRT, Apollo, and Souffle. Souffle-generated code incurs
and reduced GPU memory data transfers. We use a micro- significantly fewer data transfers compared to TensorRT and
benchmark taken from EfficientNet to illustrate the per- Apollo. For example, in BERT, Souffle reduces the mem-
formance contribution of the two optimizations. The sub- ory transaction from 361.8M and 880.5M bytes (loaded by
module is the building block of EfficientNet and repeats many TensorRT and Apollo, respectively) to 226.8M bytes.
times with different input sizes (marked with M0 to M9). The Consider the performance of TensorRT and Souffle again
pattern of this sub-module is common in many DNN models when optimizing BERT. Like Sec. 2, we classify the com-
and existing DNN frameworks fail to optimize it optimally. putation kernels in BERT into compute- (like GEMM) and
Fig. 5 shows four versions: 5a unfused with generating each memory-intensive kernels (like softmax). We then measure
TE to one kernel, 5b fused with Ansor’s fusion, 5c Souf- the execution latency (in clock cycles) of each kernel. Souf-
fle’s global-sync with generating the whole sub-module to fle is more flexible in fusing operators, which reduces the
one kernel but without any data reuse; 5d with Souffle’s number of kernels and kernel invocation overhead compared
data reuse. Fig. 6 shows the normalized speedup of the four to TensorRT. For example, TensorRT maps a BERT layer to
versions with the horizontal axis being the different sub- 10 kernels, while Souffle can partition one layer into two
modules. Global sync can achieve 1.31× speedup on average kernels and perform instruction-level optimization. Souffle
compared with Ansor’s unfused, with performance improve- reduces the memory-intensive kernel latency from 31.0 𝑢𝑠
ments coming from kernel calls reduction and lightweight (in TensorRT) to 25.5 𝑢𝑠 by buffering intermediate results in
CUDA grid sync. Enabling data reuse further improves the fast memory and GPU registers for BERT one layer.
speedup from 1.31× to 1.84× on average. Souffle’s reduced We also examine IREE’s fusion performance on BERT.
GPU kernel calls and increased data reuse can both signifi- IREE misses two optimization opportunities for BERT: it does
cantly improve the performance. However, it’s non-trivial to not fuse GEMM and softmax operators and several GEMM
separate the performance contribution for end-to-end mod- operators. IREE launches 180 kernels and takes 2.22 ms for
els, as TE transformation and global synchronization may execution. In comparison, Souffle launches 24 kernels and
both reduce kernel calls and reduce memory access. We re- takes 1.22 ms.
port the reduced GPU kernel call and GPU memory data
8.4 Case Study on LSTM
transfers in the following.
Reduce GPU kernel call. GPU kernel calls can be expen- Following the discussion in Sec. 8.3, we conducted studies
sive, and it takes around 2 𝑢𝑠 to launch a kernel on an on the LSTM model to reveal new optimization opportu-
NVIDIA A100 GPU. Table 5 compares the number of kernel nities offered by Souffle, which achieved a performance
calls from TensorRT, Apollo, XLA and Souffle. Souffle improvement of 4.3× over TensorRT and 2.2× over Rammer.
can create large subprograms that result in fewer kernels We compared Souffle with Rammer, the most performant
because of resource-aware TE program partitioning. This baseline, as discussed in Sec. 8.3. Fig. 7 shows the fusion
optimization reduces the kernel launch overhead. For ex- strategy used by Rammer and Souffle for an LSTM with 10
ample, in BERT, Souffle reduces the number of kernels cells (listed vertically in Fig. 7). Each cell has its dedicated
from 120 and 240 (generated by TensorRT and Apollo, re- weight tensors (marked as 𝑊 and 𝑈 in Fig. 7), hidden states
spectively) to 24. Similar kernel call reductions are observed (ℎ) and output (𝑐). In each time step 𝑡, the 𝑛-th cell performs
in other DNN workloads. Operator fusion is one of the key general matrix-vector multiplication (GEMV for short) using
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.
… …
n=1 n=1
… …
… …
9 Discussion
n=9 … … n=9 … … Expression power of TE. Souffle relies on the expres-
(a) Rammer (b) SOUFFLE
sion power of tensor expressions, which currently does not
GEMV
Reduction Operator
(e.g., reduce_sum)
Atomic
Add
Global
Sync
LSTM
Cell
Tensor
Data
Computation
Kernel
support all DNN operators, e.g., it does not support resize.
Souffle maps these TE-unsupported operators to a com-
Figure 7. How Rammer (a) and Souffle (b) map a LSTM putation kernel and uses the back-end operator library im-
graph into computation kernels. plementation but without fusing them with other operators.
Given the active developer community of TVM, we expect
Table 6. GPU performance counter values for LSTM opti- this limitation to be addressed by future TVM releases.
mized by Rammer and Souffle. Cost model for TE program partitioning. Souffle ex-
tracts tensor information by compiling the raw TE program.
This can be improved by building a cost model [53] to esti-
Metrics Rammer Souffle mate occupancy from the TE program.
GPU global memory trans. (in bytes) 1911.0MB 21.11MB Reusing dynamic-shaped tensors. Certain DNN oper-
Pipeline Utilization (LSU) 20.2% 35.4% ators have unknown tensor shapes at compile time. Our
Pipeline Utilization (FMA) 8.0% 19.0% current implementation does not support reusing tensors of
dynamic shapes. To address this, we can generate multiple
versions of a kernel and choose the appropriate one based
its weight tensors (𝑊𝑛 and 𝑈𝑛 ), hidden state (ℎ𝑛 ) and out- on shape information available at execution time.
put (𝑐𝑛−1 ) from (𝑛 − 1)-th cell, updates its hidden state (ℎ𝑛 ) Fusion in DL training. DL compilers like TensorFlow XLA
and generates output (𝑐𝑛 ) for the current time step. Fig. 7 also enable operator fusion in training (forward inference
shows the fully unrolled time step loop. The LSTM opera- and backward parameter updates). Our TE-based transfor-
tors alongside the diagonal line are independent, i.e. no data mation can be integrated into DL compilers to accelerate
dependency exists. Both Souffle and Rammer exploit such forward and backward passes during training. However,
optimization opportunity, i.e. the wavefront parallelism, and intermediate tensors must be kept in global memory in
fuse the GEMV computation to different blocks of a kernel. DL training for backward gradient-based optimization like
With the TE-based global analysis, Souffle discovers that Adam [24], restricting operator fusion chances. Our main
the weight tensors (𝑊 and 𝑈 ) of each LSTM cell are reused focus is optimizing model inference after DNN training. Sup-
across all time steps (temporal reuse). It utilizes the global port for TE transformation in DL training is left for future
synchronization and generates one kernel for the entire work.
model, as shown in the right part of Fig. 7. On the other Slowdown. Performance slowdown can occur when Souf-
hand, the Rammer version needs to load the weight tensors fle extends the schedule from compute-intensive TEs to
in every wavefront, resulting in a longer execution time com- memory-intensive reduction TEs (discussed in Sec. 6.3). This
pared to Souffle. We measured GPU global memory data introduces synchronization between blocks, potentially ham-
transfer and pipeline utilization for the optimized LSTM. pering parallelism for reduction TEs. A potential remedy is
As Table 6 show, Souffle-optimized code reduces memory to create a cost model to decide whether fusing these TEs is
loads by orders of magnitude compared to Rammer’s version beneficial.
(21𝑀𝐵 vs 1911𝑀𝐵) and increases pipeline utilization for both
the load store unit (LSU) and fused multiply-add unit (FMA). 10 Related work
8.5 Compilation Overhead Loop and kernel fusion. Loop fusion is commonly used
to improve the performance of CPU programs [3, 8, 9, 23].
Souffle employs Ansor and TVM for schedule search and
Recent research has also utilized kernel fusion to optimize
generation. The compilation overhead of Souffle + Ansor
GPU programs by reducing data traffic to/from off-chip
is mainly from the time required for searching the program
memory[42, 50]. Various domain-specific kernel fusion poli-
schedule using native Ansor implementation. The additional
cies have been proposed for workloads like data center ap-
overhead introduced by Souffle involves two-level depen-
plications [54], mesh computing [6], machine learning work-
dence analysis, model splitting, schedule tuning, and global
loads [4] and image processing [35]. Souffle leverages loop
optimization. Our measurements on six evaluated models
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA
fusion to optimize DNN inference through compiler trans- The MLIR -affine-loop-fusion pass utilizes a slicing-based
formations, building on these previous research efforts. method to identify producer-consumer and sibling fusion op-
Operator fusion in DNN compilers. Operator fusions portunities. Souffle implements a lightweight, specialized
can enhance performance by improving data reuse and re- global analysis on TEs, which can be easily integrated into
ducing on-chip data traffic. To seek out fusion opportuni- DNN inference engines. Moreover, Souffle offers more op-
ties, DNNFusion classifies operators and defines rules based timization opportunities than just fusion. For example, it en-
on their classification [37]. Astitch [58] and Rammer [33] ables joint optimizations across multiple compute-intensive
fuse independent operators to leverage inter-operator paral- TEs in a TE subprogram and facilitates horizontal and verti-
lelism, while Deepcuts [22] uses rules to fuse kernels based cal transformations for sibling fusion.
on GPU hardware parameters. Apollo [56] adopts a partition- Optimizing individual operators. Numerous compiler-
based approach to search for fusion opportunities within based approaches exist to optimize individual operators,
sub-graphs. However, these approaches rely on hand-crafted including TVM [12, 51], XLA [27], Tiramisu [5], and
rules with extensive engineering efforts and may miss op- TACO [25]. These compilers often represent operators in
timization opportunities, as discussed in Sec. 2. Jeong et high-level forms such as TEs or linear algebra, enabling
al [19] proposed a dynamic programming algorithm to decide aggressive optimization without complex analysis through
whether to pin or fuse an activation map on DNN accelera- domain-specific knowledge. Souffle is orthogonal to these
tors with a global buffer. DNNFusion [37] classifies operators techniques.
based on the element-wise mappings from input to output,
but it can not fuse many-to-many with many-to-many oper- 11 Conclusion
ators (like GEMM and Softmax), while Souffle can further We have presented Souffle, a top-down compiler-based
reduce the overhead of kernel launch. Furthermore, DNNFu- approach for improving DNN inference. Souffle identifies
sion lacks global analysis of tensor reuse opportunities and optimization opportunities across DNN operators by per-
may miss the temporal and spatial data reuse opportunities, forming data-flow analysis on the entire tensor dependence
which are critical to improve the performance as shown in graph built from tensor expressions. It groups tensor ex-
Sec 8.2. Souffle improves upon previous operator fusion pressions into subprograms and performs local optimization
techniques by utilizing control and data-flow analysis on through semantics-preserving transformations, instruction
the tensor dependency graph to partition TEs into subpro- scheduling, and tensor buffer reuse. We evaluated Souffle
grams. TEs have clear and simple relations, and they can on six DNN models using an NVIDIA A100 GPU and com-
be combined to represent numerous DNN operators. Souf- pared it to six state-of-the-art DNN optimizing frameworks.
fle leverages the well-defined semantics in TEs to perform Souffle outperformed them with a speedup of up to 7.9×
precise data dependence analysis for instruction scheduling over TensorRT.
and data reuse optimization. Additionally, Souffle applies
semantic preserving transformations to refine TEs. Its op- Acknowledgments
timization capabilities have better generalization ability as We thank our shepherd, Vinod Grover, and the anonymous
TEs can be combined to represent more complex operators. reviewers for their constructive feedback. This work was
Global analysis and fusion optimization. TensorFlow supported in part by the National Key R&D Program of
XLA [27] and MLIR [43] also conduct global analysis on the China under grant agreement 2021ZD0110101, the National
input program graph. XLA utilizes profitability analysis on Natural Science Foundation of China (NSFC) under grant
its high-level operations intermediate representation before agreements T2222026, 22003073, 62232015, and 62090024,
deciding on tiling and fusion. However, XLA relies on hand- the Innovation Funding of ICT CAS under grant agreement
crafted heuristics to fuse operators, which can be challenging E361010, a Beijing Nova Program, and the UK Engineering
for high-level operators. For instance, XLA’s fusion heuris- and Physical Sciences Research Council (EPSRC) under grant
tic cannot fuse two consecutive reduction operators in the agreement EP/X018202/1. For the purpose of open access,
BERT model. Moreover, as XLA operates at the operator level the authors have applied a Creative Commons Attribution
and some operators are mapped to low-level library calls, it (CCBY) license to any Author Accepted Manuscript version
cannot optimize across libraries. In contrast, Souffle takes arising from this submission.
a different approach by lowering high-level operators into
lower-level tensor expressions (TEs), which have concise se- References
mantics. Operating on TEs rather than assuming high-level [1] [n. d.]. IREE: Intermediate Representation Execution Environment.
operators enables Souffle to optimize flexibly across opera- https://github.com/iree-org/iree.
tor boundaries. Unlike XLA, Souffle can merge GEMM and [2] Martín Abadi, Paul Barham, Jianmin Chen, et al. 2016. TensorFlow: A
Softmax operators and optimize across reduction operators. System for Large-Scale Machine Learning. In 12th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 16). USENIX
Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA Chunwei Xia et al.
Association, 265–283. https://www.usenix.org/conference/osdi16/ [16] Tianfan Fu, Cao Xiao, Cheng Qian, Lucas M. Glass, and Jimeng Sun.
technical-sessions/presentation/abadi 2021. Probabilistic and Dynamic Molecule-Disease Interaction Mod-
[3] Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2018. Poly- eling for Drug Discovery. In Proceedings of the 27th ACM SIGKDD
hedral auto-transformation with no integer linear programming. In Conference on Knowledge Discovery & Data Mining (Virtual Event,
Proceedings of the 39th ACM SIGPLAN Conference on Programming Singapore). 404–414. https://doi.org/10.1145/3447548.3467286
Language Design and Implementation. 529–542. [17] Kim Hazelwood, Sarah Bird, et al. 2018. Applied Machine Learning
[4] Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, at Facebook: A Datacenter Infrastructure Perspective. In 2018 IEEE
Keith Campbell, John Keenleyside, and P Sadayappan. 2015. On opti- International Symposium on High Performance Computer Architecture
mizing machine learning workloads via kernel fusion. ACM SIGPLAN (HPCA). 620–629. https://doi.org/10.1109/HPCA.2018.00059
Notices 50, 8 (2015), 173–182. [18] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term
[5] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/
Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib 10.1162/neco.1997.9.8.1735
Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Com- [19] Hyuk-Jin Jeong, JiHwan Yeo, Cheongyo Bahk, and JongHyun Park.
piler for Expressing Fast and Portable Code. In 2019 IEEE/ACM In- 2023. Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip
ternational Symposium on Code Generation and Optimization (CGO). Data Transfer in DNN Accelerators. In Proceedings of the 21st ACM/IEEE
193–205. https://doi.org/10.1109/CGO.2019.8661197 International Symposium on Code Generation and Optimization (CGO
[6] Carlo Bertolli, Adam Betts, Paul HJ Kelly, Gihan R Mudalige, and 2023). Association for Computing Machinery, 224–235. https://doi.
Mike B Giles. 2012. Mesh independent loop fusion for unstructured org/10.1145/3579990.3580017
mesh applications. In Proceedings of the 9th conference on Computing [20] Zhihao Jia, Oded Padon, et al. 2019. TASO: Optimizing Deep Learning
Frontiers. 43–52. Computation with Automatic Generation of Graph Substitutions. In
[7] U. Bondhugula, A. Acharya, and A Cohen. 2016. The Pluto+ Algorithm: Proceedings of the 27th ACM Symposium on Operating Systems Principles
A Practical Approach for Parallelization and Locality Optimization of (Huntsville, Ontario, Canada) (SOSP ’19). 47–62. https://doi.org/10.
Affine Loop Nests. In ACM Transactions on Programming Languages 1145/3341301.3359630
and Systems. [21] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al.
[8] Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshmi- 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit.
narayanan Renganarayanan. 2010. A model for fusion and code motion In Proceedings of the 44th Annual International Symposium on Computer
in an automatic parallelizing compiler. In Proceedings of the 19th inter- Architecture (ISCA ’17). 1–12. https://doi.org/10.1145/3079856.3080246
national conference on Parallel architectures and compilation techniques. [22] Wookeun Jung, Thanh Tuan Dao, and Jaejin Lee. 2021. DeepCuts: a
343–352. deep learning optimization framework for versatile GPU workloads.
[9] Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and In Proceedings of the 42nd ACM SIGPLAN International Conference on
Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral Programming Language Design and Implementation. 190–205.
parallelizer and locality optimizer. In Proceedings of the 29th ACM [23] Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop par-
SIGPLAN Conference on Programming Language Design and Implemen- allelism and improving data locality via loop fusion and distribution.
tation. 101–113. In International Workshop on Languages and Compilers for Parallel
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Computing. Springer, 301–320.
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish [24] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for
Sastry, et al. 2020. Language models are few-shot learners. Advances Stochastic Optimization. In 3rd International Conference on Learn-
in neural information processing systems 33 (2020), 1877–1901. ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
[11] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
Thomas Blaschke. 2018. The rise of deep learning in drug discovery. http://arxiv.org/abs/1412.6980
Drug Discovery Today 23, 6 (2018), 1241–1250. https://doi.org/10.1016/ [25] Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and
j.drudis.2018.01.039 Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra
[12] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Ed- kernels. In 2017 32nd IEEE/ACM International Conference on Automated
die Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Software Engineering (ASE). IEEE, 943–948.
Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: [26] Anna Larionova, Polina Kazakova, and Nikita Nikitinsky. 2019. Deep
An Automated End-to-End Optimizing Compiler for Deep Learn- Structured Semantic Model for Recommendations in E-commerce. In
ing. In 13th USENIX Symposium on Operating Systems Design and Hybrid Artificial Intelligent Systems - 14th International Conference,
Implementation (OSDI 18). USENIX Association, 578–594. https: HAIS 2019, León, Spain, September 4-6, 2019, Proceedings (Lecture Notes
//www.usenix.org/conference/osdi18/presentation/chen in Computer Science, Vol. 11734). Springer, 85–96. https://doi.org/10.
[13] ONNX Runtime developers. 2021. ONNX Runtime. https:// 1007/978-3-030-29859-3_8
onnxruntime.ai/. Version: x.y.z. [27] Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. Tesor-
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Flow Dev Summit.
2018. Bert: Pre-training of deep bidirectional transformers for language [28] Shih-Chieh Lin, Yunqi Zhang, et al. 2018. The Architectural Implica-
understanding. arXiv preprint arXiv:1810.04805 (2018). tions of Autonomous Driving: Constraints and Acceleration. In Pro-
[15] Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Rui- ceedings of the Twenty-Third International Conference on Architectural
hang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Support for Programming Languages and Operating Systems, ASPLOS
Tianqi Chen. 2023. TensorIR: An Abstraction for Automatic Ten- 2018. ACM, 751–766. https://doi.org/10.1145/3173162.3173191
sorized Program Optimization. In Proceedings of the 28th ACM Inter- [29] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua,
national Conference on Architectural Support for Programming Lan- Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy.
guages and Operating Systems, ASPLOS 2023, Tor M. Aamodt, Na- 2018. Progressive Neural Architecture Search. In Computer Vision –
talie D. Enright Jerger, and Michael M. Swift (Eds.). ACM, 804–817. ECCV 2018. Springer International Publishing, 19–35.
https://doi.org/10.1145/3575693.3576933 [30] Hsin-I Cindy Liu, Marius Brehler, Mahesh Ravishankar, Nicolas Vasi-
lache, Ben Vanik, and Stella Laurenzo. 2022. TinyIREE: An ML Ex-
ecution Environment for Embedded Systems From Compilation to
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions Conference ASPLOS ’24, April 27 - May 1, 2024, CA, USA