0% found this document useful (0 votes)
97 views20 pages

The Deep Learning Compiler: A Comprehensive Survey

The document discusses the design architecture of deep learning compilers. It analyzes existing deep learning compilers by examining their commonly adopted multi-level intermediate representations and frontend and backend optimizations. The survey focuses on intermediate representation design and commonly used optimization techniques, highlighting potential future research directions for deep learning compilers.

Uploaded by

bhalchimtushar0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views20 pages

The Deep Learning Compiler: A Comprehensive Survey

The document discusses the design architecture of deep learning compilers. It analyzes existing deep learning compilers by examining their commonly adopted multi-level intermediate representations and frontend and backend optimizations. The survey focuses on intermediate representation design and commonly used optimization techniques, highlighting potential future research directions for deep learning compilers.

Uploaded by

bhalchimtushar0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

708 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO.

3, MARCH 2021

The Deep Learning Compiler:


A Comprehensive Survey
Mingzhen Li , Yi Liu , Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang , Zhongzhi Luan ,
Lin Gan, Member, IEEE, Guangwen Yang, Member, IEEE, and Depei Qian

Abstract—The difficulty of deploying various deep learning (DL) models on diverse DL hardware has boosted the research and
development of DL compilers in the community. Several DL compilers have been proposed from both industry and academia such as
Tensorflow XLA and TVM. Similarly, the DL compilers take the DL models described in different DL frameworks as input, and then
generate optimized codes for diverse DL hardware as output. However, none of the existing survey has analyzed the unique design
architecture of the DL compilers comprehensively. In this article, we perform a comprehensive survey of existing DL compilers by
dissecting the commonly adopted design in details, with emphasis on the DL oriented multi-level IRs, and frontend/backend
optimizations. We present detailed analysis on the design of multi-level IRs and illustrate the commonly adopted optimization
techniques. Finally, several insights are highlighted as the potential research directions of DL compiler. This is the first survey article
focusing on the design architecture of DL compilers, which we hope can pave the road for future research towards DL compiler.

Index Terms—Neural networks, deep learning, compiler, intermediate representation, optimization

1 INTRODUCTION been proposed, which defines a unified format for repre-


senting DL models to facilitate model conversion between
development of deep learning (DL) has generated a
T HE
profound impact on various scientific fields. It has not
only demonstrated remarkable value in artificial intelli-
different DL frameworks.
Meanwhile, the unique computing characteristics such
as matrix multiplication have spurred the passion of chip
gence such as natural language processing (NLP) [1] and
architects to design customized DL accelerators for higher
computer vision (CV) [2], but also proved great success in
efficiency. Internet giants (e.g., Google TPU [15], Hisilicon
broader applications such as e-commerce [3], smart city [4]
NPU [16], Apple Bonic [17]), processor vendors (e.g., NVI-
and drug discovery [5]. With the emergence of versatile
DIA Turing [18], Intel NNP [19]), service providers (e.g.,
deep learning models such as convolutional neural network
Amazon Inferentia [20], Alibaba Hanguang [21]), and even
(CNN) [6], recurrent neural network (RNN) [7], long short-
startups (e.g., Cambricon [22], Graphcore [23]) are inves-
term memory (LSTM) [8] and generative adversarial net-
ting tremendous workforce and capital in developing DL
work (GAN) [9], it is critical to ease the programming of
chips in order to boost the performance for DL models.
diverse DL models in order to realize their wide adoption.
Generally, the DL hardware can be divided into the follow-
With the continuous efforts from both industry and aca-
ing categories: 1) general-purpose hardware with software-
demia, several popular DL frameworks have been proposed
hardware co-design, 2) dedicated hardware fully customized
such as TensorFlow [10], PyTorch [11], MXNet [12] and
for DL models, and 3) neuromorphic hardware inspired
CNTK [13], in order to simplify the implementation of vari-
by biological brain science. For example, the general-purpose
ous DL models. Although there are strengths and weak-
hardware (e.g., CPU, GPU) has added special hardware
nesses among the above DL frameworks depending on the
components such as AVX512 vector units and tensor cores
tradeoffs in their designs, the interoperability becomes
to accelerate DL models. Whereas for dedicated hardware
important to reduce the redundant engineering efforts
such as Google TPU, application-specific integrated circuits
when supporting emerging DL models across the existing
(e.g., matrix multiplication engine and high-bandwidth
DL models. To provide interoperability, ONNX [14] has
memory) have been designed to elevate the performance
and energy efficiency to extreme. To the foreseeable fut-
 Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong ure, the design of DL hardware would become even more
Yang, Zhongzhi Luan, and Depei Qian are with the School of Computer
Science and Engineering, Beihang University, Beijing 100191, China. diverse.
E-mail: {lmzhhh, yi.liu, liuxiaoyan, sunqingxiao, youxin2015, hailong. To embrace the hardware diversity, it is important to
yang, zhongzhi.luan, depeiq}@buaa.edu.cn. map the computation to DL hardware efficiently. On gen-
 Lin Gan and Guangwen Yang are with the Department of Computer Sci-
eral-purpose hardware, the highly optimized linear algebra
ence and Technology, Tsinghua University, Beijing 100084, China.
E-mail: {lingan, ygw}@tsinghua.edu.cn. libraries such as Basic Linear Algebra Subprograms (BLAS)
Manuscript received 28 Mar. 2020; revised 27 Aug. 2020; accepted 8 Oct. 2020. libraries (e.g., MKL and cuBLAS) serve as the basics for effi-
Date of publication 13 Oct. 2020; date of current version 28 Oct. 2020. cient computation of DL models. Take the convolution oper-
(Corresponding author: Hailong Yang.) ation for example, the DL frameworks convert the
Recommended for acceptance by P. Balaji. convolution to matrix multiplication and then invoke the
Digital Object Identifier no. 10.1109/TPDS.2020.3030548

1045-9219 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 709

Fig. 1. The overview of commonly adopted design architecture of DL compilers.

GEMM function in the BLAS libraries. In addition, the hard- frontend, multi-level IRs and backend, with special emphasis
ware vendors have released specially optimized libraries tai- on the IR design and optimization methods. To the best of our
lored for DL computations (e.g., MKL-DNN and cuDNN), knowledge, this is the first paper that provides a comprehen-
including forward and backward convolution, pooling, nor- sive survey on the design of DL compiler. Specifically, this
malization, and activation. More advanced tools have also paper makes the following contributions:
been developed to further speedup DL operations. For exam-
ple, TensorRT [24] supports graph optimization (e.g., layer  We dissect the commonly adopted design architec-
fusion) and low-bit quantization with large collection of ture of existing DL compilers, and provide detailed
highly optimized GPU kernels. On dedicated DL hardware, analysis of the key design components such as multi-
similar libraries are also provided [22], [23]. However, the level IRs, frontend optimizations (including node-
drawback of relying on the libraries is that they usually fall level, block-level and dataflow-level optimizations)
behind the rapid development of DL models, and thus fail to and backend optimizations (including hardware-
utilize the DL chips efficiently. specific optimization, auto-tuning and optimized ker-
To address the drawback of DL libraries and tools, as well nel libraries).
as alleviate the burden of optimizing the DL models on each  We provide a comprehensive taxonomy of existing
DL hardware manually, the DL community has resorted to DL compilers from various aspects, which corre-
the domain-specific compilers for rescue. Rapidly, several sponds to the key components described in this sur-
popular DL compilers have been proposed such as TVM [25], vey. The target of this taxonomy is to provide
Tensor Comprehensions [26], Glow [27], nGraph [28] and guidelines about the selection of DL compilers for
XLA [29], from both industry and academia. The DL com- the practitioners considering their requirements, as
pilers take the model definitions described in the DL frame- well as to give a thorough summary of the DL com-
works as inputs, and generate efficient code implementations pilers for researchers.
on various DL hardware as outputs. The transformation  We have provided the quantitative performance
between model definition and specific code implementation comparison among DL compilers on CNN models,
is highly optimized, targeting the model specification and including full-fledged models and lightweight mod-
hardware architecture. Specifically, they incorporate DL ori- els. We have compared both end-to-end and per-
ented optimizations such as layer and operator fusion, which layer (convolution layers since they dominate the
enables highly efficient code generation. Moreover, existing inference time) performance to show the effective-
DL compilers also leverage mature tool-chains from general- ness of optimizations. The evaluation scripts and
purpose compilers (e.g., LLVM [30]), which provides better results are open sourced1 for reference.
portability across diverse hardware architectures. Similar to  We highlight several insights for the future develop-
traditional compiler, DL compilers also adopt the layered ment of DL compilers, including dynamic shape and
design, including frontend, intermediate representation (IR), pre-/post-processing, advanced auto-tuning, poly-
and backend. However, the uniqueness of the DL compiler hedral model, subgraph partitioning, quantization,
lies in the design of multi-level IRs and DL specific unified optimizations, differentiable programming
optimizations.
In this paper, we provide a comprehensive survey of exist-
ing DL compilers by dissecting the compiler design into 1. https://github.com/buaa-hipo/dlcompiler-comparison

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
710 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

and privacy protection, which we hope to boost the and operator sinking) and dataflow-level (e.g., CSE, DCE,
research in the DL compiler community. static memory planning, and layout transformation). After
The rest of this paper is organized as follows. Section 2 the frontend, the optimized computation graph is generated
describes the common design architecture of DL compilers. and passed to the backend. The detailed discussion of the
Section 3 discusses the key components of DL compilers, frontend is presented in Section 3.3.
including multi-level IRs, frontend optimizations and back- The backend transforms the high-level IR into low-level IR
end optimizations. Section 4 presents a comprehensive tax- and performs hardware-specific optimizations. On the one
onomy. Section 5 provides the quantitative performance hand, it can directly transform the high-level IR to third-
comparison. Section 6 highlights the future directions for party tool-chains such as LLVM IR to utilize the existing
DL compiler research. infrastructures for general-purpose optimizations and code
generation. On the other hand, it can take advantage of the
prior knowledge of both DL models and hardware charac-
2 COMMON DESIGN ARCHITECTURE OF DL
teristics for more efficient code generation, with customized
COMPILERS compilation passes. The commonly applied hardware-spe-
The common design architecture of a DL compiler primarily cific optimizations include hardware intrinsic mapping,
contains two parts: the compiler frontend and the compiler memory allocation and fetching, memory latency hiding,
backend, as shown in Fig. 1. The intermediate representation parallelization as well as loop oriented optimizations. To
(IR) is spread across both the frontend and the backend. determine the optimal parameter setting in the large optimi-
Generally, IR is an abstraction of the program and is used for zation space, two approaches are widely adopted in existing
program optimizations. Specifically, the DL models are trans- DL compilers such as auto-scheduling (e.g., polyhedral
lated into multi-level IRs in DL compilers, where the high- model) and auto-tuning (e.g., AutoTVM). The optimized
level IR resides in the frontend, and the low-level IR resides low-level IR is compiled using JIT or AOT to generate codes
in the backend. Based on the high-level IR, the compiler front- for different hardware targets. The detailed discussion of
end is responsible for hardware-independent transforma- the backend is presented in Section 3.4.
tions and optimizations. Based on the low-level IR, the
compiler backend is responsible for hardware-specific opti-
mizations, code generation, and compilation. Note that this 3 KEY COMPONENTS OF DL COMPILERS
survey focuses on the design principles of DL compilers. For 3.1 High-Level IR
functional and experimental comparisons of DL compilers, To overcome the limitation of IR adopted in traditional com-
the readers can refer to [31], [32]. pilers that constrains the expression of complex computa-
The high-level IR, also known as graph IR, represents the tions used in DL models, existing DL compilers leverage
computation and the control flow and is hardware-indepen- high-level IR (as known as graph IR) with special designs
dent. The design challenge of high-level IR is the ability of for efficient code optimizations. To better understand the
abstraction of the computation and the control flow, which graph IR used in the DL compilers, we describe the repre-
can capture and express diverse DL models. The goal of the sentation and implementation of graph IR as follows.
high-level IR is to establish the control flow and the depen-
dency between the operators and the data, as well as pro-
vide an interface for graph-level optimizations. It also 3.1.1 Representation of Graph IR
contains rich semantic information for compilation as well The representation of graph IR influences the expressive-
as offers extensibility for customized operators. The detailed ness of graph IR and also decides the way the DL compilers
discussion of high-level IR is presented in Section 3.1. analyze the graph IR.
The low-level IR is designed for hardware-specific optimi- DAG-Based IR. DAG-based IR is one of the most tradi-
zation and code generation on diverse hardware targets. tional ways for the compilers to build a computation graph,
Thus, the low-level IR should be fine-grained enough to with nodes and edges organized as a directed acyclic graph
reflect the hardware characteristics and represent the hard- (DAG). In DL compilers [25], [26], [27], [28], [29], the nodes
ware-specific optimizations. It should also allow the use of of a DAG represent the atomic DL operators (convolution,
mature third-party tool-chains in compiler backends such as pooling, etc.), and the edges represent the tensors. And the
Halide [33], polyhedral model [34], and LLVM [30]. The graph is acyclic without loops, which differs from the data
detailed discussion of low-level IR is presented in Section 3.2. dependence graphs [35] (DDG) of generic compilers [30],
The frontend takes a DL model from existing DL frame- [36]. And with the help of the DAG computation graph, DL
works as input, and then transforms the model into the compilers can analyze the relationship and dependencies
computation graph representation (e.g., graph IR). The between various operators and use them to guide the opti-
frontend needs to implement various format transforma- mizations. There are already plenty of optimizations on
tions To support the diverse formats in different frame- DDG, such as common sub-expression elimination (CSE)
works. The computation graph optimizations incorporate and dead code elimination (DCE). By combining the
the optimization techniques from both general-purpose domain knowledge of DL with these algorithms, further
compilers and the DL specific optimizations, which reduce optimizations can be applied to the DAG computation
the redundancy and improve the efficiency upon the graph graph, which will be elaborated in Section 3.3. DAG-based
IR. Such optimizations can be classified into node-level IR is convenient for programming and compiling due to its
(e.g., nop elimination and zero-dim-tensor elimination), simplicity, but it has deficiencies such as semantic ambigu-
block-level (e.g., algebraic simplification, operator fusion, ity caused by the missing definition of computation scope.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 711

Let-Binding-Based IR. Let-binding is one method to solve 3.1.2 Implementation of Graph IR


the semantic ambiguity by offering let expression to certain The implementation of graph IR in DL compilers fulfills the
functions with restricted scope used by many high-level management of data and operation.
programming languages such as Javascript [37], F# [38], Data Representation. The data in DL compilers (e.g.,
and Scheme [39]. When using the let keyword to define an inputs, weights, and intermediate data) are usually orga-
expression, a let node is generated, and then it points to the nized in the form of tensors, which are also known as multi-
operator and variable in the expression instead of just build- dimensional arrays. The DL compilers can represent tensor
ing computational relation between variables as a DAG. In data directly by memory pointers, or in a more flexible way
DAG-based compiler, when a process needs to get the by placeholders. A placeholder contains the size for each
return value of one expression, it first accesses the corre- dimension of a tensor. Alternatively, the dimension sizes of
sponding node and searches related nodes, also known as the tensor can be marked as unknown. For optimizations,
recursive descent technique. In contrast, the let-binding the DL compilers require the data layout information. In
based compiler figures out all results of the variables in let addition, the bound of iterators should be inferred accord-
expression and builds a variable map. When a particular ing to the placeholders.
result is needed, the compiler looks up this map to decide
the result of the expression. Among the DL compilers, the 1) Placeholder: Placeholder is widely used in symbolic
Relay IR [40] of TVM adopts both DAG-based IR and let- programming (e.g., Lisp [41], Tensorflow [10]). A
binding-based IR to obtain the benefits of both. placeholder is simply a variable with explicit shape
Representing Tensor Computation. Different graph IRs have information (e.g., size in each dimension), and it will
different ways to represent the computation on tensors. The be populated with values at the later stage of the
operators of diverse DL frameworks are translated to graph computation. It allows the programmers to describe
IRs according to such specific representations. And the cus- the operations and build the computation graph
tomized operators also need to be programmed in such without concerning the exact data elements, which
representation. The representation of tensor computation helps separate the computation definition from the
can be divided into the following three categories. exact execution in DL compilers. Besides, it is conve-
nient for the programmers to change the shape of
1) Function-based: The function-based representation just input/output and other corresponding intermediate
provides encapsulated operators, which is adopted by data by using placeholders without changing the
Glow, nGraph and XLA. Take High Level Optimizer computation definition.
(HLO, the IR of XLA) for example, it consists of a set of 2) Unknown (Dynamic) shape representation: The unknown
functions in symbolic programming, and most of dimension size is usually supported when declaring
them have no side-effect. The instructions are orga- the placeholders. For instance, TVM uses Any to repre-
nized into three levels, including HloModule (the sent an unknown dimension (e.g., TensorhðAny;
whole program), HloComputaion (a function), and 3Þ; fp32i); XLA uses None to achieve the same purpose
HloInstruction (the operation). XLA uses HLO IR to (e.g., tf:placeholderð00 float00 ; ½None; 3Þ); nGraph uses
represent both graph IR and operation IR so that the its PartialShape class. The unknown shape representa-
operation of HLO ranges from the dataflow level to tion is necessary to support the dynamic model. How-
the operator level. ever, to fully support dynamic model, the bound
2) Lambda expression: The lambda expression, an index inference and dimension checking should be relaxed.
formula expression, describes calculation by vari- In addition, extra mechanism should be implemented
able binding and substitution. Using lambda to guarantee memory validity.
expression, programmers can define a computa- 3) Data layout: The data layout describes how a tensor is
tion quickly without implementing a new func- organized in memory, and it is usually a mapping
tion. TVM represents the tensor computation from logical indices to memory indices. The data lay-
using the tensor expression, which is based on the out usually includes the sequence of dimensions
lambda expression. In TVM, computational opera- (e.g., NCHW and NHWC), tiling, padding, striding,
tors in tensor expression are defined by the shape etc. TVM and Glow represent data layout as operator
of output tensor and the lambda expression of parameters and require such information for compu-
computing rules. tation and optimization. However, combining data
3) Einstein notation: The Einstein notation, also known layout information with operators rather than ten-
as the summation convention, is a notation to sors enables intuitive implementation for certain
express summation. Its programming simplicity is operators and reduces the compilation overhead.
superior to lambda expression. Taking TC for exam- XLA represents data layout as constraints related to
ple, the indexes for temporary variables do not need its backend hardware. Relay and MLIR are going to
to be defined. The IR can figure out the actual expres- add data layout information into their type systems
sion by the occurrence of undefined variables based for tensors.
on Einstein notation. In Einstein notation, the opera- 4) Bound inference: The bound inference is applied to
tors need to be associative and commutative. This determine the bound of iterators when compiling
restriction guarantees the reduction operator can be DL models in DL compilers. Although the tensor
executed by any order, making it possible for further representation in DL compilers is convenient to
parallelization. describe the inputs and outputs, it exposes special

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
712 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

challenges for inferring the iterator bound. The derivative operators automatically, even for custom-
bound inference is usually performed recursively or ized operators. Notably, DL compilers unable to
iteratively, according to the computation graph and support derivative operators fail to provide the
the known placeholders. For example, in TVM the capability of model training.
iterators form a directed acyclic hyper-graph, where 4) Customized operators: It allows programmers to define
each node of the graph represents an iterator and their operators for a particular purpose. Providing sup-
each hyper-edge represents the relation (e.g., split, port for customized operators improves the extensibility
fuse or rebase) among two or more iterators. Once of DL compilers. For example, when defining new oper-
the bound of the root iterator is determined based ators in Glow, the programmers need to realize the logic
on the shapes of placeholders, other iterators can be and node encapsulation. In addition, extra efforts are
inferred according to the relations recursively. needed, such as the lowering step, operation IR genera-
Operators Supported. The operators supported by DL com- tion, and instruction generation, if necessary. Whereas,
pilers are responsible for representing the DL workloads, TVM and TC require less programming efforts except
and they are nodes of the computation graph. The operators describing the computation implementation. Specifi-
usually include algebraic operators (e.g., þ, , exp and cally, the users of TVM only need to describe the compu-
topK), neural network operators (e.g., convolution and pool- tation and the schedule and declare the shape of input/
ing), tensor operators (e.g., reshape, resize and copy), broad- output tensors. Moreover, the customized operators
cast and reduction operators (e.g., min and argmin), as well integrate Python functions through hooks, which fur-
as control flow operators (e.g., conditional and loop). Here, ther reduces the programmers’ burden.
we choose three representative operators that are frequently
used across different DL compilers for illustration. In addi- 3.1.3 Discussion
tion, we discuss the case for customized operators.
Nearly all DL compilers have their unique high-level IRs.
1) Broadcast: The broadcast operators can replicate the However, they share similar design philosophies, such as
data and generate new data with compatible shape. using DAG and let-binding to build the computation graph.
Without broadcast operators, the input tensor shapes In addition, they usually provide convenient ways for pro-
are more constrained. For example, for an add opera- grammers to represent tensor computation. The data and
tor, the input tensors are expected to be of the same operators designed in high-level IRs are flexible and extensi-
shape. Some compilers such as XLA and Relay relax ble enough to support diverse DL models. More impor-
such restriction by offering the broadcasting operator. tantly, the high-level IRs are hardware-independent and
For example, XLA allows the element-wise addition thus can be applied with different hardware backend.
on a matrix and a vector by replicating it until its
shape matches the matrix. 3.2 Low-Level IR
2) Control flow: Control flow is needed when represent- 3.2.1 Implementation of Low-Level IR
ing complex and flexible models. Models such as Low-level IR describes the computation of a DL model in a
RNN and Reinforcement learning (RL) depend on more fine-grained representation than that in high-level IR,
recurrent relations and data-dependent conditional which enables the target-dependent optimizations by pro-
execution [42], which requires control flow. Without viding interfaces to tune the computation and memory
supporting control flow in graph IR of DL compilers, access. In this section, we classify the common implementa-
these models must rely on the control flow support tions of low-level IRs into three categories: Halide-based IR,
of the host languages (e.g., if and while in Python) or polyhedral-based IR, and other unique IR.
static unrolling, which deteriorates the computation Halide-Based IR. Halide is first proposed to parallelize
efficiency. Relay notices that arbitrary control flow image processing, and it is proven to be extensible and effi-
can be implemented by recursion and pattern, which cient in DL compilers (e.g., TVM). The fundamental philoso-
has been demonstrated by functional program- phy of Halide is the separation of computation and schedule.
ming [40]. Therefore, it provides if operator and Rather than giving a specific scheme directly, the compilers
recursive function for implementing control flow. adopting Halide try various possible schedule and choose
On the contrary, XLA represents control flow by spe- the best one. The boundaries of memory reference and loop
cial HLO operators such as while and conditional. nests in Halide are restricted to bounded boxes aligned to
3) Derivative: The derivative operator of an operator Op the axes. Thus, Halide cannot express the computation with
takes the output gradients and the input data of Op complicated patterns (e.g., non-rectangular). Fortunately,
as its inputs, and then calculates the gradient of Op. the computations in DL are quite regular to be expressed
Although some DL compilers (e.g., TVM and TC) perfectly by Halide. Besides, Halide can easily parameterize
support automatic differentiation [43], they require these boundaries and expose them to the tuning mecha-
the derivatives of all operators in high-level IR nism. The original IR of the Halide needs to be modified
when the chain rule is applied. TVM is working when applied to backend of DL compilers. For example, the
towards providing the derivative operators of both input shape of Halide is infinite, whereas the DL compilers
algebraic operators and neural network operators. need to know the exact shape of data in order to map the
The programmers can use these derivative opera- operator to hardware instructions. Some compilers, such as
tors for building the derivatives of customized oper- TC, require the fixed size of data, to ensure better temporal
ators. On the contrary, PlaidML can generate locality for tensor data.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 713

TVM has improved Halide IR into an independent sym- The low-level IR in Glow is an instruction-based expres-
bolic IR by following efforts. It removes the dependency on sion that operates on tensors referenced by addresses [27].
LLVM and refactors the structure of both the project mod- There are two kinds of instruction-based functions in Glow
ule and the IR design of Halide, pursuing better organiza- low-level IR: declare and program. The first one declares the
tion as well as accessibility for graph IR and frontend number of constant memory regions that live throughout
language such as Python. The re-usability is also improved, the lifetime of the program (e.g., input, weight, bias). The
with a runtime dispatching mechanism implemented to second one is a list of locally allocated regions, including
add customized operators conveniently. TVM simplifies the functions (e.g., conv and pool) and temporary variables.
variable definition from string matching to pointer match- Instructions can run on the global memory regions or locally
ing, guaranteeing that each variable has a single define loca- allocated regions. Besides, each operand is annotated with
tion (static single-assignment, SSA) [44]). one of the qualifiers: @in indicates the operand reads from
Polyhedral-Based IR. The polyhedral model is an impor- the buffer; @out indicates that the operand writes to the
tant technique adopted in DL compilers. It uses linear pro- buffer; @inout indicates that the operand reads and writes
gramming, affine transformations, and other mathematical to the buffer. These instructions and operand qualifiers help
methods to optimize loop-based codes with static control Glow determine when certain memory optimizations can
flow of bounds and branches. In contrast to Halide, the be performed.
boundaries of memory reference and loop nests can be poly- MLIR is highly influenced by LLVM, and it is a purer
hedrons with any shapes in the polyhedral model. Such compiler infrastructure than LLVM. MLIR reuses many
flexibility makes polyhedral models widely used in generic ideas and interfaces in LLVM, and sits between the model
compilers. However, such flexibility also prevents the inte- representation and code generation. MLIR has a flexible
gration with the tuning mechanisms. Nevertheless, due to type system and allows multiple abstraction levels, and it
the ability to deal with deeply nested loops, many DL com- introduces dialects to represent these multiple levels of
pilers, such as TC and PlaidML (as the backend of nGraph), abstraction. Each dialect consists of a set of defined immu-
have adopted the polyhedral model as their low-level IR. table operations. The current dialects of MLIR include Ten-
The polyhedral-based IR makes it easy to apply various sorFlow IR, XLA HLO IR, experimental polyhedral IR,
polyhedral transformations (e.g., fusion, tiling, sinking, and LLVM IR, and TensorFlow Lite. The flexible transforma-
mapping), including both device-dependent and device- tions between dialects are also supported. Furthermore,
independent optimizations. There are many toolchains that MLIR can create new dialects to connect to a new low-level
are borrowed by polyhedral-based compilers, such as compiler, which paves the way for hardware developers
isl [45], Omega [46], PIP [47], Polylib [48], and PPL [49]. and compiler researchers.
TC has its unique design in low-level IR, which combines The HLO IR of XLA can be considered as both high-level
the Halide and polyhedral model. It uses Halide-based IR to IR and low-level IR because HLO is fine-grained enough to
represent the computation and adopts the polyhedral-based represent the hardware-specific information. Besides, HLO
IR to represent the loop structures. TC presents detailed supports hardware-specific optimizations and can be used
expressions through abstract instances and introduces spe- to emit LLVM IR.
cific node types. In brief, TC uses the domain node to specify
the ranges of index variables and uses the context node to
describe new iterative variables that are related to hard- 3.2.2 Code Generation Based on Low-Level IR
ware. And it uses the band node to determine the order of The low-level IR adopted by most DL compilers can be
iterations. A filter node represents an iterator combined eventually lowered to LLVM IR, and benefits from LLVM’s
with a statement instance. Set and sequence are keywords to mature optimizer and code generator. Furthermore, LLVM
specify the execution types (parallel and serial execution) can explicitly design custom instruction sets for specialized
for filters. Besides, TC uses extension nodes to describe other accelerators from scratch. However, traditional compilers
necessary instructions for code generation, such as the may generate poor code when passed directly to LLVM IR.
memory movement. In order to avoid this situation, two approaches are applied
PlaidML uses polyhedral-based IR (called Stripe) to rep- by DL compilers to achieve hardware-dependent optimiza-
resent tensor operations. It creates a hierarchy of paralleliz- tion: 1) perform target-specific loop transformation in the
able code by extending the nesting of parallel polyhedral upper IR of LLVM (e.g., Halide-based IR and polyhedral-
blocks to multiple levels. Besides, it allows nested polyhe- based IR), and 2) provide additional information about the
drons to be allocated to nested memory units, providing a hardware target for the optimization passes. Most DL com-
way to match the computation with the memory hierarchy. pilers apply both approaches, but the emphasis is different.
In Stripe, the hardware configuration is independent of the In general, the DL compilers that prefer frontend users (e.g.,
kernel code. The tags in Stripe (known as passes in other TC, TVM, XLA, and nGraph) might focus on 1), whereas
compilers) do not change the kernel structure, but provide the DL compilers that are more inclined to backend devel-
additional information about the hardware target for the opers (e.g., Glow, PlaidML, and MLIR) might focus on 2).
optimization passes. Stripe splits the DL operators into tiles The compilation scheme in DL compilers can be mainly
that fit into local hardware resources. classified into two categories: just-in-time (JIT) and ahead-
Other Unique IR. There are DL compilers implementing of-time (AOT). For JIT compilers, it can generate executable
customized low-level IRs without using Halide and polyhe- codes on the fly, and they can optimize codes with better
dral model. Upon the customized low-level IRs, they apply runtime knowledge. AOT compilers generate all executable
hardware-specific optimizations and lowers to LLVM IR. binaries first and then execute them. Thus they have a larger

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
714 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Fig. 2. Example of computation graph optimizations, taken from the HLO graph of Alexnet on Volta GPU using Tensorflow XLA.

scope in static analysis than JIT compilation. In addition, In this section, we classify the frontend optimizations
AOT approaches can be applied with cross-compilers of into three categories: 1) node-level optimizations, 2) block-
embedded platforms (e.g., C-GOOD [50]) as well as enable level (peephole, local) optimizations, and 3) dataflow-level
execution on remote machines (TVM RPC) and customized (global) optimizations.
accelerators.
3.3.1 Node-Level Optimizations
3.2.3 Discussion The nodes of the computation graph are coarse enough to
In DL compilers, the low-level IR is a fine-grained representa- enable optimizations inside a single node. And the node-
tion of DL models, and it reflects detailed implantation of DL level optimizations include node elimination that eliminates
models on diverse hardware. The low-level IRs include unnecessary nodes and node replacement that replaces
Halide-based IRs, polyhedral-based IRs, and other unique nodes with other lower-cost nodes.
IRs. Although they differ in designs, they leverage the mature In general-purpose compilers, Nop Elimination removes
compiler tool-chains and infrastructure, to provide tailored the no-op instructions which occupy a small amount of
interfaces of hardware-specific optimizations and code gen- space but specify no operation. In DL compilers, Nop Elimi-
eration. The design of low-level IRs can also impact the nation is responsible for eliminating the operations lacking
design of new DL accelerators (e.g., TVM HalideIR and Infer- adequate inputs. For example, the sum node with only one
entia, as well as XLA HLO and TPU). input tensor can be eliminated, the padding node with zero
padding width can be eliminated.
3.3 Frontend Optimizations Zero-dim-tensor elimination is responsible for removing
the unnecessary operations whose inputs are zero-dimen-
After constructing the computation graph, the frontend
sion tensors. Assume that A is a zero-dimension tensor, and
applies graph-level optimizations. Many optimizations are
B is a constant tensor, then the sum operation node of A
easier to be identified and performed at graph level because
and B can be replaced with the already existing constant
the graph provides a global view of the computation. These
node B without affecting the correctness. Assume that C is
optimizations are only applied to the computation graph,
a 3-dimension tensor, but the shape of one dimension is
rather than the implementations on backends. Thus they are
zero, such as {0,2,3}, therefore, C has no element, and the
hardware-independent and can be applied to various back-
argmin/argmax operation node can be eliminated.
end targets.
The frontend optimizations are usually defined by passes,
and can be applied by traversing the nodes of the computa- 3.3.2 Block-Level Optimizations
tion graph and performing the graph transformations. The Algebraic simplification - The algebraic simplification optimi-
frontend provides methods to 1) capture the specific fea- zations consist of 1) algebraic identification, 2) strength
tures from the computation graph and 2) rewrite the graph reduction, with which we can replace more expensive oper-
for optimization. Besides the pre-defined passes, the devel- ators by cheaper ones; 3) constant folding, with which we
opers can also define customized passes in the frontend. can replace the constant expressions by their values. Such
Most DL compilers can determine the shape of both input optimizations consider a sequence of nodes, then take
tensors and output tensors of every operation once a DL advantage of commutativity, associativity, and distributiv-
model is imported and transformed as a computation ity of different kinds of nodes to simplify the computation.
graph. This feature allows DL compilers to perform optimi- In addition to the typical operators (þ, , etc.), the alge-
zations according to the shape information. Fig. 2 shows an braic simplification can also be applied to DL specific opera-
example of computation graph optimizations with Tensor- tors (e.g., reshape, transpose, and pooling). The operators can
flow XLA. be reordered and sometimes eliminated, which reduces

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 715

redundancy and improves the efficiency. Here we illustrate store elimination (DSE), which removes stores into tensors
the common cases where algebraic simplification can be that are never going to be used, also belong to DCE.
applied: 1) optimization of computation order, in such case, the Static Memory Planning. Static memory planning optimi-
optimization finds and removes reshape/transpose opera- zations are performed to reuse the memory buffers as much
tions according to specific characteristics. Taking the matrix as possible. Usually, there are two approaches: in-place
multiplication (GEMM) for example, there are two matrices memory sharing and standard memory sharing. The in-
(e.g., A and B), both matrices are transposed (to produce AT place memory sharing uses the same memory for input and
and BT , respectively), then AT and BT are multiplied output for an operation, and just allocates one copy of mem-
together. However, a more efficient way to implement ory before computing. Standard memory sharing reuses the
GEMM is to switch the order of the arguments A and B, memory of previous operations without overlapping. The
multiply them together, and then transpose the output of static memory planning is done offline, which allows more
the GEMM, which reduces two transpose to just one; 2) opti- complicated planning algorithms to be applied. A recent
mization of node combination, in such case, the optimization work [54] first designs and performs memory-aware sched-
combines multiple consecutive transpose nodes into a single uling to minimize the peak activation memory footprint on
node, eliminates identity transpose nodes, and optimizes edge devices, which presents new research directions of
transpose nodes into reshape nodes when they actually move memory planning on memory-constrained devices.
no data; 3) optimization of ReduceMean nodes, in such case, Layout Transformation. Layout transformation tries to find
the optimization performs substitutions of ReduceMean the best data layouts to store tensors in the computation
with AvgPool node (e.g., in Glow), if the input of the reduce graph and then inserts the layout transformation nodes to
operator is 4D with the last two dimensions to be reduced. the graph. Note that the actual transformation is not per-
Operator Fusion. Operator fusion is indispensable optimi- formed here, instead, it will be performed when evaluating
zation of DL compilers. It enables better sharing of compu- the computation graph by the compiler backend.
tation, eliminates intermediate allocations, facilitates In fact, the performance of the same operation in different
further optimization by combining loop nests [40], as well data layouts is different, and the best layouts are also different
as reduces launch and synchronization overhead [26]. In on different hardware. For example, operations in the NCHW
TVM, the operators are classified into four categories: injec- format on GPU usually run faster, so it is efficient to transform
tive, reduction, complex-out-fusible, and opaque. When the to NCHW format on GPU (e.g., TensorFlow). Some DL com-
operators are defined, their corresponding categories are pilers rely on hardware-specific libraries to achieve higher
determined. Targeting the above categories, TVM designs performance, and the libraries may require certain layouts.
the fusion rules across operators. In TC, fusion is per- Besides, some DL accelerators prefer more complicated lay-
formed differently based on the automatic polyhedron outs (e.g., tile). In addition, edge devices usually equip heter-
transformations. However, how to identify and fuse more ogenous computing units, and different units may require
complicated graph patterns, such as blocks with multiple different data layouts for better utilization, thus layout
broadcast and reduce nodes, remains to be a problem. transformation needs careful considerations. Therefore, the
Recent works [51], [52] try to tackle this problem and pro- compilers need to provide a way to perform layout transfor-
pose a framework to explore and optimize aggressive mations across various hardware.
fusion plans. It supports not only element-wise and reduc- Not only the data layouts of tensors have a nontrivial
tion nodes, but also other computation/memory intensive influence on the final performance, but also the transforma-
nodes with complex dependencies. tion operations have a significant overhead. Because they
Operator Sinking. This optimization sinks the operations also consume the memory and computation resource.
such as transposes below operations such as batch normaliza- A recent work [55] based on TVM targeting on CPUs
tion, ReLU, sigmoid, and channel shuffle. By this optimiza- alters the layout of all convolution operations to NCHW[x]c
tion, many similar operations are moved closer to each other, first in the computation graph, in which c means the split
creating more opportunities for algebraic simplification. sub-dimension of channel C and x indicates the split size of
the sub-dimension. Then all x parameters are globally
3.3.3 Dataflow-Level Optimizations explored by auto-tuning when providing hardware details,
Common Sub-Expression Elimination (CSE). An expression E such as cache line size, vectorization unit size, and memory
is a common sub-expression if the value of E is previously access pattern, during hardware-specific optimizations.
computed, and the value of E has not to be changed since
previous computation [53]. In this case, the value of E is 3.3.4 Discussion
computed once, and the already computed value of E can The frontend is one of the most important components in DL
be used to avoid recomputing in other places. The DL com- compilers, which is responsible for transformation from DL
pilers search for common sub-expressions through the models to high-level IR (e.g., computation graph) and hard-
whole computation graph and replace the following com- ware-independent optimizations based on high-level IR.
mon sub-expressions with the previously computed results. Although the implementation of frontend may differ in the
Dead Code Elimination (DCE). A set of code is dead if its data representation and operator definition of high-level IR
computed results or side-effects are not used. And the DCE across DL compilers, the hardware-independent optimiza-
optimization removes the dead code. The dead code is usually tions converge at three levels: node-level, block-level, and
not caused by programmers but is caused by other graph opti- dataflow-level. The optimization methods at each level lever-
mizations. Thus, the DCE, as well as CSE, are applied after age the DL specific as well as general compilation optimiza-
other graph optimizations. Other optimizations, such as dead tion techniques, which reduce the computation redundancy

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
716 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Fig. 3. Overview of hardware-specific optimizations applied in DL compilers.

as well as improve the performance of DL models at the com- quantization. It can estimate the possible numeric range for
putation graph level. each stage of the neural network and support profile-guided
optimization to perform quantization automatically. Besides,
3.4 Backend Optimizations Halide/TVM maps specific IR patterns to SIMD opcodes on
each architecture to avoid the inefficiency of LLVM IR map-
The backends of DL compilers have commonly included vari-
ous hardware-specific optimizations, auto-tuning techniques, ping when encountering vector patterns.
and optimized kernel libraries. Hardware-specific optimiza- Memory Allocation and Fetching. Memory allocation is
tions enable efficient code generation for different hardware another challenge in code generation, especially for GPUs and
targets. Whereas, auto-tuning has been essential in the com- customized accelerators. For example, GPU contains primar-
piler backend to alleviate the manual efforts to derive the ily shared memory space (lower access latency with limited
optimal parameter configurations. Besides, highly-optimized memory size) and local memory space (higher access latency
kernel libraries are also widely used on general-purpose pro- with large capacity). Such memory hierarchy requires efficient
cessors and other customized DL accelerators. memory allocation and fetching techniques for improving
data locality. To realize this optimization, TVM introduces the
scheduling concept of memory scope. Memory scope schedule
3.4.1 Hardware-Specific Optimization primitives can tag a compute stage as shared or thread-local. For
Hardware-specific optimizations, also known as target- compute stages tagged as shared, TVM generates code with
dependent optimizations, are applied to obtain high-perfor- shared memory allocation as well as cooperative data fetch-
mance codes targeting specific hardware. One way to apply ing, which inserts memory barrier at the proper code position
the backend optimizations is to transform the low-level IR to guarantee correctness. Besides, TC also provides similar
into LLVM IR, to utilize the LLVM infrastructure to gener- features (known as memory promotion) by extending
ate optimized CPU/GPU codes. The other way is to design PPCG [56] compiler. However, TC only supports limited pre-
customized optimizations with DL domain knowledge, defined rules. Particularly, TVM enables special buffering in
leveraging the target hardware more efficiently. Since hard- accelerators through memory scope schedule primitives.
ware-specific optimizations are tailored for particular hard- Memory Latency Hiding. Memory latency hiding is also an
ware and cannot be included exhaustively in this paper, we important technique used in the backend by reordering the
present five widely adopted approaches in existing DL com- execution pipeline. As most DL compilers support paralleli-
pilers. The overview of these hardware-specific optimiza- zation on CPU and GPU, memory latency hiding can be nat-
tions is shown in Fig. 3, and the detailed descriptions are urally achieved by hardware (e.g., warp context switching
provided as follows. on GPU). But for TPU-like accelerators with decoupled access-
Hardware Intrinsic Mapping. Hardware intrinsic mapping execute (DAE) architecture, the backend needs to perform
can transform a certain set of low-level IR instructions to ker- scheduling and fine-grained synchronization to obtain cor-
nels that have already been highly optimized on the hard- rect and efficient codes. To achieve better performance as
ware. In TVM, the hardware intrinsic mapping is realized in well as reduce programming burden, TVM introduces vir-
the method of extensible tensorization, which can declare the tual threading schedule primitive, which enables users to
behavior of hardware intrinsic and the lowering rule for specify the data parallelism on virtualized multi-thread
intrinsic mapping. This method enables the compiler backend architecture. Then TVM lowers these virtually parallelized
to apply hardware implementations as well as highly opti- threads by inserting necessary memory barriers and inter-
mized handcraft micro-kernels to a specific pattern of opera- leaves the operations from these threads into a single
tions, which results in a significant performance gain. instruction stream, which forms a better execution pipeline
Whereas, Glow supports hardware intrinsic mapping such as of each thread to hide the memory access latency.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 717

Loop Oriented Optimizations. Loop oriented optimizations handcraft libraries such as Glow or optimized math libraries
are also applied in the backend to generate efficient codes provided by hardware vendors (discussed in Section 3.4.3).
for target hardware. Since Halide and LLVM [30] (inte- In the meanwhile, Glow offloads the vectorization to LLVM
grated with the polyhedral method) have already incorpo- because the LLVM auto-vectorizer works well when the
rated such optimization techniques, some DL compilers information of tensor dimension and loop trip count is pro-
leverage Halide and LLVM in their backends. The key vided. However, exploiting the parallelism entirely by com-
techniques applied in loop oriented optimizations include piler backend allows to apply more domain-specific
loop fusion, sliding windows, tiling, loop reordering, and knowledge of DL models, and thus leads to higher perfor-
loop unrolling. mance at the expense of more engineering efforts.
1) Loop fusion: Loop fusion is a loop optimization tech-
nique that can fuse loops with the same boundaries 3.4.2 Auto-Tuning
for better data reuse. For compilers such as PlaidML, Due to the enormous search space for parameter tuning in
TVM, TC, and XLA, such optimization is performed hardware-specific optimizations, it is necessary to leverage
by the Halide schedule or polyhedral approach, auto-tuning to determine the optimal parameter configura-
while Glow applies loop fusion by its operator tions. Among the studied DL compilers in this survey,
stacking. TVM, TC, and XLA support the auto-tuning. Generally, the
2) Sliding windows: Sliding windows is a loop optimiza- auto-tuning implementation includes four key components,
tion technique adopted by Halide. Its central concept such as parameterization, cost model, searching technique,
is to compute values when needed and store them and acceleration.
on the fly for data reuse until they are no longer Parameterization. 1) Data and target: The data parameter
required. As sliding windows interleaves the com- describes the specification of the data, such as input shapes.
putation of two loops and make them serial, it is a The target parameter describes hardware-specific character-
tradeoff between parallelism and data reuse. istics and constraints to be considered during optimization
3) Tiling: Tiling splits loops into several tiles, and thus scheduling and code generation. For example, for the GPU
loops are divided into outer loops iterating through target, the hardware parameters such as shared memory
tiles and inner loops iterating inside a tile. This trans- and register size need to be specified. 2) Optimization options:
formation enables better data locality inside a tile by The optimization options include the optimization schedul-
fitting a tile into hardware caches. As the size of a ing and corresponding parameters, such as loop oriented
tile is hardware-specific, many DL compilers deter- optimizations and tile size. In TVM, both pre-defined and
mine the tiling pattern and size by auto-tuning. user-defined scheduling, as well as parameters, are taken
4) Loop reordering: Loop reordering (also known as loop into consideration. Whereas, TC and XLA prefer to parame-
permutation) changes the order of iterations in a terize the optimizations, which have a strong correlation
nested loop, which can optimize the memory access with performance and can be changed later at a low cost.
and thus increase the spatial locality. It is specific to For example, the minibatch dimension is one of the parame-
data layout and hardware features. However, it is ters that is usually mapped to grid dimensions in CUDA
not safe to perform loop reordering when there are and can be optimized during auto-tuning.
dependencies along the iteration order. Cost Model. The comparison of different cost models
5) Loop unrolling: Loop unrolling can unroll a specific applied in auto-tuning are as follows. 1) Black-box model:
loop to a fixed number of copies of loop bodies, This model only considers the final execution time rather
which allows the compilers to apply aggressive than the characteristics of the compilation task. It is easy to
instruction-level parallelism. Usually, loop unrolling build a black-box model, but easily ends up with higher
is applied in combination with loop split, which first overhead and less optimal solution without the guidance of
splits the loop into two nested loops and then unrolls task characteristics. TC adopts this model. 2) ML-based cost
the inner loop completely. model: ML-based cost model is a statistical approach to pre-
Parallelization. As modern processors generally support dict performance using a machine learning method. It ena-
multi-threading and SIMD parallelism, the compiler back- bles the model to update as the new configuration is
end needs to exploit parallelism to maximize hardware uti- explored, which helps achieve higher prediction accuracy.
lization for high performance. Halide uses a schedule TVM and XLA adopt this kind of model, for example, gradi-
primitive called parallel to specify the parallelized dimen- ent tree boosting model (GBDT) and feedforward neural
sion of the loop for thread-level parallelization and supports network [57] (FNN) respectively. 3) Pre-defined cost model:
GPU parallelization by mapping loop dimensions tagged as An approach based on a pre-defined cost model expects a
parallel with annotation of block and thread. And it replaces a perfect model built on the characteristics of the compilation
loop of size n with a n-wide vector statement, which can be task and able to evaluate the overall performance of the
mapped to hardware-specific SIMD opcodes through hard- task. Compared to the ML-based model, the pre-defined
ware intrinsic mapping. Stripe develops a variant of the model generates less computation overhead when applied,
polyhedral model called nested polyhedral model, which intro- but requires large engineering efforts for re-building the
duces parallel polyhedral block as its basic execution element model on each new DL model and hardware.
of iteration. After this extension, a nested polyhedral model Searching Technique. 1) Initialization and searching space
can detect hierarchy parallelization among levels of tiling determination: The initial option can either be set randomly
and striding. In addition, some DL compilers rely on or based on the known configurations, such as

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
718 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

configurations given by users or historical optimal configu- shuffle) are highly optimized according to the hardware fea-
rations. In terms of searching space, it should be specified tures (e.g., AVX-512 ISA, tensor cores). And customizable
before auto-tuning. TVM allows developers to specify the data layouts are supported to make it easy to integrate into
searching space with their domain-specific knowledge and DL applications and avoid frequent data layout transforma-
provides automatic search space extraction for each hard- tions. Besides, low-precision training and inference, includ-
ware target based on the computational description. In con- ing FP32, FP16, INT8, and non-IEEE floating-point format
trast, TC relies on the compilation cache and the pre- bfloat16 [61] are also supported. Other customized DL accel-
defined rules. 2) Genetic algorithm (GA) [58]: GA considers erators also maintain their specific kernel libraries [22], [23].
each tuning parameter as genes and each configuration as a Existing DL compilers, such as TVM, nGraph, and TC,
candidate. The new candidate is iteratively generated by can generate the function calls to these libraries during code
crossover, mutation, and selection according to the fitness generation. However, if DL compilers need to leverage the
value, which is a metaheuristic inspired by the process of existing optimized kernel libraries, they should first trans-
natural selection. And finally, the optimal candidate is form the data layouts and fusion styles into the types that
derived. The rate of crossover, mutation, and selection is are pre-defined in kernel libraries. Such transformation may
used for controlling the tradeoff between exploration and break the optimal control flow. Moreover, the DL compilers
exploitation. TC adopts GA in its auto-tuning technique. 3) treat the kernel libraries as a black box. Therefore they are
Simulated annealing algorithm (SA) [59]: SA is also a meta- unable to apply optimizations across operators (e.g., opera-
heuristic inspired by annealing. It allows us to accept worse tor fusion) when invoking kernel libraries. In sum, using
solutions in a decreasing probability, which can find the optimized kernel libraries achieves significant performance
approximate global optimum and avoid the precise local improvement when the computation can be satisfied by spe-
optimum in a fixed amount of iterations. TVM adopts SA in cific highly-optimized primitives, otherwise it may be con-
its auto-tuning technique. 4) Reinforcement learning (RL): RL strained from further optimization and suffer from less
performs with learning to maximize reward given an envi- optimal performance.
ronment by the tradeoff between exploration and exploita-
tion. Chameleon [60] (built upon TVM) adopts RLRL in its
auto-tuning technique. 3.4.4 Discussion
Acceleration. 1) Parallelization: One direction for accelerat- The backend is responsible for bare-metal optimizations
ing auto-tuning is parallelization. TC proposes a multi- and code generation based on low-level IR. Although the
thread, multi-GPU strategy considering that the genetic design of backends may differ due to various low-level
algorithm needs to evaluate all candidates in each genera- IRs, their optimizations can be classified into hardware-
tion. First, it enqueues candidate configurations and com- specific optimizations: auto-tuning techniques, and opti-
piles them on multiple CPU threads. The generated code is mized kernel libraries. These optimizations can be per-
evaluated on GPUs in parallel, and each candidate owns its formed separately or combined, to achieve better data
fitness used by the parent choosing step. After finishing the locality and parallelization by exploiting the hardware/
whole evaluation, the new candidate is generated, and the software characteristics. Eventually, the high-level IR of
new compilation job is enqueued, waiting for compiling on DL models is transformed into efficient code implementa-
CPU. Similarly, TVM supports cross-compilation and RPC, tion on different hardware.
allowing users to compile on the local machine and run the
programs with different auto-tuning configurations on mul-
tiple targets. 2) Configuration reuse: Another direction for 4 TAXONOMY OF DL COMPILERS
accelerating auto-tuning is to reuse the previous auto-tun- The DL compilers studied in this survey include TVM,
ing configurations. TC stores the fastest known generated nGraph, Tensor Comprehensions (TC), Glow, and XLA. We
code version corresponding to the given configuration by select these compilers since they are well-known, well main-
compilation cache. The cache is queried before each kernel tained, and most importantly, widely used. Thus, we can
optimization during the compilation, and the auto-tuning is find enough papers, documents, and discussions from both
triggered if cache miss. Similarly, TVM produces a log file industry and academia in order to study their designs and
that stores the optimal configurations for all scheduling implementations in-depth. Table 1 illustrates the taxonomy
operators and queries the log file for best configurations of the selected DL compilers from four perspectives, includ-
during compilation. It is worth mentioning that TVM per- ing frontend, backend, IR, and optimizations, which corre-
forms auto-tuning for each operator in Halide IR (e.g., sponds with the key components described in this survey.
conv2d), and thus the optimal configurations are deter- Specifically, we provide more information about the
mined for each operator separately. compilers to the best of our knowledge. We not only pro-
vide whether a compiler supports a specific feature, but
3.4.3 Optimized Kernel Libraries also describe how to use this feature through its program-
There are several highly-optimized kernel libraries widely ming interface. In addition, we also describe the developing
used to accelerate DL training and inference on various status of specific features and the reasons why specific fea-
hardware. DNNL (previously MKL-DNN) from Intel, tures are not supported in particular compilers. The target
cuDNN from NVIDIA, and MIOpen from AMD are widely of this taxonomy is to provide guidelines about the selection
used libraries. Both computation-intensive primitives (e.g., of DL compilers for the practitioners considering their
convolution, GEMM, and RNN) and memory bandwidth requirements, as well as to give a thorough summary of the
limited primitives (e.g., batch normalization, pooling, and DL compilers for researchers.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 719

TABLE 1
The Comparison of DL Compilers, Including TVM, nGraph, TC, Glow, and XLA

In Table 1, we present the features of each DL compiler, 5 EVALUATION


including developer, programming language, ONNX/frame- 5.1 Experimental Setup
work support, training support, and quantization support in
Our experiments are conducted on two GPU-equipped
the frontend category, and we present the compilation meth-
machines, and the hardware configuration is shown in
ods and supported devices in the backend category. These
Table 2. We evaluate the performance of TVM (v0.6.0),
features are summarized because they strongly affect the
nGraph (0.29.0-rc.0), TC (commit fd01443), Glow (commit
usage of DL compilers in particular scenarios. Based on these
7e68188) and XLA (TensorFlow 2.2.0) on CPU and GPU. We
features, practitioners or researchers can easily decide which
select 19 neural network models in ONNX format as our
DL compiler they would like to work upon.
datasets, which are converted from the Torchvison2 model
Table 1, together with Fig. 1 can serve as a systematic
zoo and the GluonCV3 model zoo. These models include
summary of this survey. Through them, readers can identify
the features each compiler supports as well as the key com-
ponents of each compiler. More detailed information is pre- 2. https://pytorch.org/docs/stable/torchvision/models.html
sented in the following sections. 3. https://gluon-cv.mxnet.io/model_zoo/index.html

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
720 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Fig. 4. The performance comparison of end-to-end inference across TVM, nGraph, Glow, and XLA on CPU and GPU.

Fig. 5. The performance comparison of convolution layers in MobileNetV2 1.0 across TVM, TC, Glow, and XLA on V100 GPU.

full-fledged models: ResNet, DenseNet and VGG series, Skylake) and GPUs (V100 and 2080Ti). Note that, we omit
and lightweight models: MobileNet and MNASNet series. the comparison of TC here. Because TC is more similar to
To import the ONNX models, as shown in Table 1, we use a kernel library other than fully functional DL compiler,
the built-in tvm.relay.frontend.from_onnx interface of TVM, and it requires the users to implement all layers of a model
the ngraph-onnx Python package of nGraph, the built-in with its Einstein notion manually, which leads to heavy
ONNXModelLoader of Glow, and the tensorflow-onnx Python engineering efforts for a fair comparison. Another reason
package of XLA. Notably, TC lacks the support of ONNX, is that TC only supports running on GPU, thus we cannot
so we only evaluate it in the following per-layer perfor- obtain its performance results on CPU. However, for
mance comparison. Each model is executed for 15 times, detailed comparisons (Figs. 5 and 7), we still implement
and we report the average execution time of the last 10 exe- several ResNet and MobileNetV2 models in TC. In sum,
cutions for each compiler, because we regard the first 5 exe- we compare and analyze the performance results from the
cutions as the warm-up to eliminate the overhead of JIT following perspectives.
compilation. Compatibility. Although nGraph and XLA claims to sup-
port ONNX, there are still compatibility problems. 1)
5.2 End-to-End Performance Comparison nGraph fails to run the DenseNet121, VGG16/19 and
As shown in Fig. 4, we compare the performance of end-to- MNASNet0_5/1_0 models due to tensors with dynamic
end inference across TVM, nGraph, Glow, and XLA. We shapes. Alternatively, we replace the DenseNet121,
evaluate these compilers on both CPUs (Broadwell and VGG16/19 models with the corresponding models from the
ONNX model zoo,4 while MNASNet0_5/1_0 models are
TABLE 2 not available. Besides, when we set PlaidML as the backend
The Hardware Configuration of nGraph on GPU, we fail to run all MobileNet models.
Because PlaidML cannot handle the inconsistent definition
of operators across different DL frameworks. 2) XLA can
run all selected models, however, the performance is quite

4. https://github.com/onnx/models

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 721

Fig. 6. The performance comparison of convolution layers in MobileNetV2 1.0 across TVM, nGraph, and Glow on Broadwell CPU.

low. Thus, we replace the selected ONNX models with the GPU has more complicated thread and memory hierarchy
savedmodels from the Tensorflow Hub,5 while the MNAS- than CPU, thus to exploit the computation power, GPU
Net0_5/1_0 models are not available. With models from requires more fine-grained scheduling (e.g., tile, split, and
Tensorflow Hub, XLA becomes two orders of magnitude reorder in TVM). Therefore, it is crucial to determine the
faster, and the performance of XLA becomes competitive optimal scheduling parameters on GPU, where the auto-
with other compilers. tuning exhibits its effectiveness.
Performance. From Fig. 4, we have several observations
about the performance illustrated as follows. 5.3 Per-Layer Performance Comparison
1) On CPU, the performance of Glow is worse than other com- To further compare the capability of backend optimizations
pilers. This is because Glow does not support thread parallel- of DL compilers, we evaluate the per-layer (convolution
ism. Thus it cannot fully utilize the multi-core CPU. Whereas layers since they dominate the inference time) performance
TVM, nGraph, and XLA can leverage all CPU cores. of the ResNet50 and MobileNetV2 1.0 on V100 GPU
2) XLA has the similar end-to-end inference performance for and Broadwell CPU (single-threaded since Glow lacks
both full-fledged models ( ResNet , DenseNet and VGG series) multi-threading support).
and lightweight models ( MobileNet and MNASNet series). Methodology. To measure the execution time of individual
Besides, its inference performance on CPU and GPU is almost the layers, we adopt different methods considering the DL com-
same. It is known that XLA is embedded in the Tensorflow pilers, the hardware (CPU/GPU), and the CNN models.
framework. Tensorflow contains a complicated runtime Specifically, 1) On TVM, we re-use the logs of autotuning to
compared to TVM, nGraph, and Glow, which introduces extract the kernel shapes and the optimal schedule. Then
non-trivial overhead to XLA. In addition, if we increase the we rebuild the individual convolution layers and use the
batch size (set to one by default in our evaluation) and focus time_evaluator for evaluation. 2) We extract the execution
on the throughput of DL compilers, then the overhead of time through the tracing files of Glow. 3) And we measure
XLA can be ignored with higher throughput. the execution time of hand-written kernels on TC. 4) As for
3) In general, on CPU, TVM and nGraph achieve better nGraph, we make use of the timeline to measure the execu-
performance across all models than other DL compilers, due tion time on CPU. However, the timeline is not supported by
to the limitations of Glow and XLA described above. its PlaidML backend (which provides GPU support through
TVM has comparable performance with nGraph on full- OpenCL). Besides, there are no available methods to profile
fledged models, while it is better than nGraph on light- the command queues within OpenCL. Therefore, we leave
weight models. nGraph relies on the DNNL (previously the profiling of the per-layer performance of nGraph on
MKL-DNN) library for acceleration. Thus, nGraph can GPU for future work. 4) As for XLA, we leverage the built-
offload the optimized subgraphs to DNNL and benefit in tf.profiler.experimental method for CPU performance and
from DNNL’s fine-grained instruction-level JIT optimiza- the DLProf [62] toolkit from Nvidia for GPU performance.
tions tailored for Intel CPU. Performance. From Figs. 5, 6, 7, and 8, we have several
4) The tuned TVM (tuned with 200 trials) almost achieves observations about the performance illustrated as follows.
the best performance on both CPU and GPU across all models, 1) nGraph achieves a better performance of the convolution
especially on lightweight models (MobileNet, MNASNet layers on CPU, which benefits from the co-design of hard-
series). Based on our investigation, this is because the ware (Intel CPU) and software (compiler, library, and run-
schedules of classic operators inside these models have time). Whereas, TVM performs better on GPU across these
already been well designed by TVM developers, with the compilers. On MobileNetV2 1.0, the performance of TVM
default parameters provided in TVM tophub. The default is not stable, especially on conv1 layer. This is because the
schedules and parameters can help TVM to achieve similar autotuning process is affected by other processes on the
performance compared to other DL compilers. In addition, same machine, and thus it tends to derive the imprecise,
the performance difference between the tuned TVM and even negative scheduling parameters.
untuned TVM is negligible on CPU but quite significant on 2) TC allows users to define a tensor computation kernel
GPU (41.26 speedup on average). This is because the (e.g., convolution) by the Einstein notion without specifying
the shape of input/output tensors (e.g., kernel size). Then
5. https://tfhub.dev/ the kernel is autotuned and stored in its compilation cache

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
722 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Fig. 7. The performance comparison of convolution layers in ResNet50 across TVM, TC, and Glow on V100 GPU.

Fig. 8. The performance comparison of convolution layers in ResNet50 across TVM, nGraph, and Glow on Broadwell CPU.

to accelerate further autotuning and compilation. However, _XlaCompile. Therefore, there are five extreme value in
in our evaluation, we find the performance of TC heavily relies on Fig. 5 (corresponding with 5 clustered convolutions in
the initially compiled kernels. Take MobileNetV2 1.0 for exam- MobileNetV2 1.0). Actually, only the clustered kernels
ple, if we initialize the autotuning with layer c1, then c1 can are optimized by XLA, while the non-clustered ones are
perform well. But the following c b  layers become optimized by Tensorflow. Therefore, it is impossible to mea-
much slower as the layers go deeper (far away from c1 sure the execution time of a standalone convolution layer
layer). To derive a consistent performance, we need to tune optimized by XLA. Consequently, we decide not to include
each kernel separately. the performance of XLA in Figs. 6, 7, and 8.
3) Glow falls behind other compilers to optimize the 1  1 con-
volutions (e.g., the b linear layers) of MobileNetV2 1.0 as
well as the depth-wise separable convolutions (e.g., c b 2 5.4 Discussion
layers) of ResNet50. It takes a longer time to compute these Through the above quantitative performance comparison
convolutions both on GPU and CPU. We notice the convolu- across DL compilers, we can in-depth analyze the coarse-
tions are usually fused with other layers (e.g., ReLU, Batch- grained end-to-end performance with both frontend
Norm) on Glow, which could be why the lower (graph-level) and backend (operator-level) optimizations, as
performance compared to other compilers. Moreover, on well as the fine-grained per-layer performance about the
CPU, the convolutions at the end of MobileNetV2 1.0 convolutions with backend optimizations. However, there
take a quite shorter time than convolutions at the beginning. are still open challenges to accurately measure the effective-
According to the tracing log, we notice these convolutions ness of the optimizations adopted by different DL com-
are accelerated by the CPUConvDKKC8 optimization [27], pilers. One particular difficulty during our evaluation is
which applies tiling, layout transformation, and vectoriza- that the frontend and backend optimizations are usually
tion to convolutions with specific patterns. tightly coupled in existing DL compilers, because 1) the
4) As for XLA, it can automatically compile (_XlaCompile) frontend optimizations usually affect a series of operators.
the eligible subgraphs from Tensorflow and replace the sub- Thus the optimized operators as the inputs to the backend
graphs with the resultant binaries (_XlaRun). In addition, optimizations differ across different compilers; 2) these
the convolution layers may be clustered with other kernels,
and thus their performance is not easy to measure individu- TABLE 3
The Number of the Clustered and Non-Clustered Convolutions
ally. Therefore, we have counted the clustered and the non- of XLA on V100 GPU and Broadwell CPU
clustered convolutions, and the data is shown in Table 3.
Note that the MobileNetV2 1.0 model in Tensorflow is a MobileNetV2_1.0 ResNet50
little bit different from the ONNX model for the beginning Clustered Non-clu- Clustered Non-clu-
and ending layers, however, the linearbottleneck layers are
V100 5 47 0 53
the same. Moreover, if a convolution is to be clustered, it Broadwell 17 35 53 0
could be measured at most twice till the finishing of

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 723

optimizations tend to be co-designed for further exploit the Advanced Auto-Tuning. Existing auto-tuning techniques
performance opportunities (e.g., clustering in XLA and focus on the optimization of individual operators. However,
more advanced optimizations [52], [55]). Therefore, it is dif- the combination of the local optimal does not lead to global
ficult if not impossible to evaluate and compare specific optimal. For example, two adjacent operators that apply on
optimizations across DL compilers individually. different data layouts can be tuned together without intro-
To tackle this problem, we have been working on build- ducing extra memory transformations in between. Besides,
ing a universal benchmarking framework for existing DL with the rise of edge computing, execution time is not only
compilers to measure the per-layer performance. The funda- the optimization objective for DL compilers. New optimiza-
mental idea is to extract the necessary structures and tion targets should also be considered in the auto-tuning
parameters of the target layers (we name them as model frag- such as memory footprint and energy consumption.
ments), and rebuild the layers as acceptable inputs to a par- Particularly, for the ML-based auto-tuning techniques,
ticular DL compiler, which allows the compiler to apply there are several directions worth further exploring. First,
corresponding frontend and backend optimizations faith- the ML techniques can be applied in other stages of auto-
fully. We can then measure the performance of these opti- tuning, other than the cost model. For example, in the stage
mized model fragments to understand the effectiveness of DL of selecting compiler options and optimization schedules,
compilers at layers of interests. The benchmarking frame- ML techniques can be used to predict the possibility directly
work using model fragments is scalable to customized layers and develop algorithms to determine the final configura-
(e.g., fused layers) of interest. With such benchmarking tions. Second, the ML-based auto-tuning techniques can be
framework available, we can derive both coarse-grained improved based on the domain knowledge. For example,
(e.g., end-to-end) and fine-grained (e.g., per-layer) perfor- incorporating the feature engineering (selecting features to
mance metrics for each DL compiler, and thus compare the represent program) [63] in auto-tuning techniques could be
effectiveness of optimizations across different DL compilers a potential direction for achieving better tuning results.
at the level of interest. Currently, we have successfully Polyhedral Model. It is a promising research direction to
experimented by extracting the target layers from the state- combine polyhedral model and auto-tuning techniques in
of-the-art CNN models, such as the bottleneck of ResNet50 the design of DL compilers for efficiency. On one hand, the
and the linearbottleneck of MobileNetV2_1.0. Our bench- auto-tuning can be applied to minimize the overhead of
marking framework is still under rapid development, and polyhedral JIT compilation by reusing the previous configu-
we hope to make it available to the community soon. rations. On the other hand, the polyhedral model can be
used to perform auto-scheduling, which can reduce the
search space of auto-tuning.
6 CONCLUSION AND FUTURE DIRECTIONS Another challenge of applying polyhedral model in DL
In this survey, we present a thorough analysis of the existing compilers is to support the sparse tensor. In general, the for-
DL compilers targeting the design principles. First, we take a mat of a sparse tensor such as CSF [64] expresses the loop
deep dive into the common architecture adopted in the exist- indices with index arrays (e.g., a½b½i) that is no longer lin-
ing DL compilers including the multi-level IR, the frontend ear. Such indirect index addressing leads to non-affine sub-
and the backend. We present the design philosophies and ref- script expressions and loop bounds, which prohibits the
erence implementations of each component in detail, with the loop optimization of the polyhedral model [65], [66]. Fortu-
emphasis on the unique IRs and optimizations specific to DL nately, the polyhedral community has made progress in
compilers. We provide a comprehensive taxonomy as well as supporting sparse tensor [67], [68], and integrating the latest
the quantitative performance comparison among DL com- advancement of the polyhedral model can increase the per-
pilers. And we summarize the findings in this survey and formance opportunities for DL compilers.
highlight the future directions in DL compiler as follows: Subgraph Partitioning. DL compilers supporting sub-
Dynamic Shape and Pre/Post Processing. Dynamic model graph partitioning can divide the computation graph into
becomes more and more popular in the field of DL, whose several subgraphs, and the subgraphs can be processed in
input shape or even model itself may change during execu- different manners. The subgraph partitioning presents
tion. Particularly, in the area of NLP, models may accept more research opportunities for DL compilers. First, it
inputs of various shapes, which is challenging for DL com- opens up the possibility to integrate graph libraries for
pilers since the shape of data is unknown until runtime. Exist- optimization. Take nGraph and DNNL for example,
ing DL compilers require more research efforts to support DNNL is a DL library with graph optimizations leveraging
dynamic shape efficiently for emerging dynamic models. vast collection of highly optimized kernels. The integration
In addition, as future DL models become more complex, of DNNL with nGraph enables DNNL to speedup the exe-
their entire control flow may inevitably include complicated cution of the subgraphs generated by nGraph. Second, it
pre/post-processing procedures. Currently, most DL com- opens up the possibility of heterogeneous and parallel exe-
pilers use Python as their programming language, the pre/ cution. Once the computation graph is partitioned into
post-processing could become a performance bottleneck subgraphs, the execution of different subgraphs can be
when it is executed by the Python interpreter. Such potential assigned to heterogeneous hardware targets at the same
performance bottleneck has not yet been considered by exist- time. Take the edge device for example, its computation
ing DL compilers. Supporting the entire control flow in DL units may consist of ARM CPU, Mail GPU, DSP, and prob-
compiler enables express and optimize the pre/post-process- ably NPU. Generating subgraphs from the DL compilers
ing along with DL models, which opens up new opportunities that utilizes all computation units efficiently can deliver
for performance acceleration in model deployment. significant speedup of the DL tasks.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
724 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

Quantization. Traditional quantization strategies applied operation abstraction for Julia in order to support the par-
in DL frameworks are based on a set of fixed schemes and ticular semantic of XLA, such as MapReduce and broadcast.
datatypes with little customization for codes running on dif- Moreover, the semantic difference of differentiation
ferent hardware. Whereas, supporting quantization in DL between Julia and XLA, also requires significant changes
compilers can leverage optimization opportunities during of compiler designs.
compilation to derive more efficient quantization strategies. Privacy Protection. In edge-cloud system, the DL models
For example, Relay [40] provides a quantization rewriting are usually split into two halves with each partial model
flow that can automatically generate quantized code for var- running on the edge device and cloud service respectively,
ious schemes. which can provide better response latency and consume
To support quantization, there are several challenges to less communication bandwidth. However, one of the draw-
be solved in DL compilers. The first challenge is how to backs with the edge-cloud system is that the user privacy
implement new quantized operators without heavy engi- becomes vulnerable. The reason is that the attackers can
neering efforts. The attempt from AWS points out a possible intercept the intermediate results sent from the edge devices
direction that uses the concept of dialect to implement new to cloud, and then use the intermediate results to train
operators upon basic operators, so that the optimizations at another model that can reveal the privacy information devi-
graph level and operator level can be reused. The second ated from the original user task.
challenge is the interaction between quantization and other To protect privacy in edge-cloud system, existing
optimizations during compilation. For example, determin- approaches [73], [74], [75] propose to add noise with special
ing the appropriate stage for quantization and collaborating statistic properties to the intermediate results that can
with optimizations such as operator fusion require future reduce the accuracy of the attacker task without severely
research investigations. deteriorating the accuracy of the user task. However, the
Unified Optimizations. Although existing DL compilers difficulty is to determine the layer where the noise should
adopt similar designs in both computation graph optimiza- be inserted, which is quite labor intensive to identify the
tions and hardware-specific optimizations, each compiler optimal layer. The above difficulty presents a great opportu-
has its own advantages in certain aspects. There is a missing nity for DL compilers to support privacy protection,
way to share the state-of-the-art optimizations, as well as because the compilers maintain rich information of the DL
support of emerging hardware targets across existing com- model, which can guide the noise insertion across layers
pilers. We advocate unifying the optimizations from exist- automatically.
ing DL compilers so that the best practices adopted in each Training Support. In general, the model training is far
DL compiler can be reused. In addition, unifying the optimi- less supported in current DL compilers. As shown in
zations across DL compilers can accumulate a strong force Table 1, nGraph only supports training on the Intel NNP-T
to impact the design of general-purpose and dedicated DL accelerator, TC only supports the auto differentiation of a
accelerators, and provide an environment for efficient co- single kernel, Glow has experimental training support for
design of DL compiler and hardware. limited models, the training support of TVM is under
Currently, Google MLIR is a promising initiative towards development, while XLA relies on the training support of
such direction. It provides the infrastructure of multi-level TensorFlow. In sum, current DL compilers mainly focus on
IRs, and contains IR specification and toolkit to perform bridging the gap of deploying DL models onto diverse
transformations across IRs at each level. It also provides hardware efficiently, and thus they choose inference as
flexible dialects, so that each DL compiler can construct its their primary optimization targets. However, expanding
customized dialects for both high-level and low-level IRs. the capability of DL compilers to support model training
Through transformation across dialects, optimizations of one would open up a large body of research opportunities
DL compiler can be reused by another compiler. However, such as optimization of gradient operators and high-order
the transformation of dialects requires further research auto differentiation.
efforts to reduce the dependency on delicate design.
Differentiable Programming. Differentiable programming
ACKNOWLEDGMENTS
is a programming paradigm, where the programs are differ-
entiable thoroughly. Algorithms written in differentiable The authors would like to thank Jun Yang from Alibaba and
programming paradigm can be automatically differenti- Yu Xing from Xilinx for their valuable comments and sugges-
ated, which is attractive for DL community. Many compiler tions. They also would like to thank all anonymous reviewers
projects have adopted differentiable programming, such as for their insightful comments and suggestions. This work was
Myia [69], Flux [70] and Julia [71]. Unfortunately, there is lit- supported by National Key R&D Program of China (Grant
tle support for differential programming in existing DL No. 2020YFB150001), National Natural Science Foundation of
compilers. China (Grant No. 62072018, 61502019, and 61732002), and
To support differential programming is quite challeng- SenseTime Research Fund for Young Scholars.
ing for existing DL compilers. The difficulties come from
not only data structure, but also language semantic. For REFERENCES
example, to realize the transformation from Julia to XLA
[1] C. D. Manning, C. D. Manning, and H. Sch€ utze, Foundations of Sta-
HLO IR, one of the challenges [72] is that the control flow tistical Natural Language Processing. Cambridge, MA, USA: MIT
is different between the imperative language used by Julia Press, 1999.
and the symbolic language used by XLA. In order to use [2] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach.
HLO IR efficiently, the compiler also needs to provide Englewood Cliffs, NJ, USA: Prentice Hall, 2002.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 725

[3] J.-W. Ha, H. Pyo, and J. Kim, “Large-scale item categorization in e- [32] M. Li et al., “The deep learning compiler: A comprehensive
commerce using multiple recurrent neural networks,” in Proc. 22nd survey,” 2020, arXiv: 2002.03794.
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016, pp. 107–115. [33] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and
[4] M. Mohammadi, A. Al-Fuqaha, M. Guizani, and J.-S. Oh, S. Amarasinghe, “Halide: A language and compiler for optimizing
“Semisupervised deep reinforcement learning in support of IoT parallelism, locality, and recomputation in image processing pipe-
and smart city services,” IEEE Internet Things J., vol. 5, no. 2, lines,” in Proc. 34th ACM SIGPLAN Conf. Program. Lang. Des.
pp. 624–635, Apr. 2018. Implementation, 2013, pp. 519–530.
[5] H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke, [34] Polyhedral compilation. Accessed: Feb. 4, 2020. [Online]. Avail-
“The rise of deep learning in drug discovery,” Drug Discov. Today, able: https://polyhedral.info
vol. 23, no. 6, pp. 1241–1250, 2018. [35] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe,
[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based “Dependence graphs and compiler optimizations,” in Proc. 8th ACM
learning applied to document recognition,” Proc. IEEE, vol. 86, SIGPLAN-SIGACT Symp. Princ. Program. Lang., 1981, pp. 207–218.
no. 11, pp. 2278–2324, Nov. 1998. [Online]. Available: https://doi.org/10.1145/567532.567555
[7] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- [36] C. Lattner et al., “MLIR: A compiler infrastructure for the end of
resentations by back-propagating errors,” Nature, vol. 323, Moore’s law,” 2020, arXiv:2002.11054.
no. 6088, pp. 533–536, 1986. [37] D. Goodman, JavaScript Bible. Hoboken, NJ, USA: Wiley, 2007.
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [38] J. Harrop, “F# for scientists,” USA, Wiley-Interscience, 2008.
Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [39] H. Abelson et al., “Revised 5 report on the algorithmic language
[9] I. J. Goodfellow et al., “Generative adversarial networks,” 2014, scheme,” Higher-Order Symbolic Comput., vol. 11, no. 1, pp. 7–105, 1998.
arXiv:1406.2661. [40] J. Roesch et al., “Relay: A high-level compiler for deep learning,” 2019,
[10] M. Abadi et al., “TensorFlow: A system for large-scale machine arXiv:1904.08368.
learning,” in Proc. 12th USENIX Symp. Operating Syst. Des. Imple- [41] J. McCarthy and M. I. Levin, LISP 1.5 Programmer’s Manual. Cam-
mentation, 2016, pp. 265–283. bridge, MA, USA: MIT Press, 1965.
[11] A. Paszke et al., “PyTorch: An imperative style, high-performance [42] Y. Yu et al., “Dynamic control flow in large-scale machine
deep learning library,” in Proc. Int. Conf. Neural Inf. Process. Syst., learning,” in Proc. 13th EuroSys Conf., 2018, Art. no. 18.
2019, pp. 8024–8035. [43] B. van Merrienboer, O. Breuleux, A. Bergeron, and P. Lamblin,
[12] T. Chen et al., “MXNet: A flexible and efficient machine learning “Automatic differentiation in ML: Where we are and where we
library for heterogeneous distributed systems,” 2015, arXiv:1512.01274. should be going,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018,
[13] F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep- pp. 8757–8767.
learning toolkit,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Dis- [44] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and
cov. Data Mining, 2016, pp. 2135–2135. F. K. Zadeck, “Efficiently computing static single assignment
[14] ONNX github repository. Accessed: Feb. 4, 2020. [Online]. Avail- form and the control dependence graph,” ACM Trans. Program.
able: https://github.com/onnx/onnx Lang. Syst., vol. 13, no. 4, pp. 451–490, 1991.
[15] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor [45] S. Verdoolaege, “isl: An integer set library for the polyhedral mod-
processing unit,” in Proc. 44th Annu. Int. Symp. Comput. Archit., el,” in Proc. Int. Congr. Math. Softw., 2010, pp. 299–302.
2017, pp. 1–12. [46] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and
[16] H. Liao, J. Tu, J. Xia, and X. Zhou, “DaVinci: A scalable architec- D. Wonnacott, “The omega library,” Univ. Maryland, Tech. Rep.,
ture for neural network computing,” in Proc. IEEE Hot Chips 31 1996.
Symp., 2019, pp. 1–44. [47] P. Feautrier, “Parametric integer programming,” RAIRO Recherche
[17] A. Kingsley-Hughes, “A11 bionic processor,” 2017. Operationnelle, vol. 22, no. 3, pp. 243–268, 1988.
[18] NVIDIA turing architecture. Accessed: Feb. 4, 2020. [Online]. Available: [48] V. Loechner, “PolyLib: A library for manipulating parameterized
https://www.nvidia.com/en-us/design-visualization/technologies/ polyhedra,” 1999. [Online]. Available: https://repo.or.cz/polylib.
turing-architecture/ git/blob_plain/HEAD:/doc/parampoly-doc.ps.gz
[19] Nervana neural network processor. Accessed: Feb. 4, 2020. [49] R. Bagnara, P. M. Hill, and E. Zaffanella, “The parma polyhedra
[Online]. Available: https://www.intel.ai/nervana-nnp/ library: Toward a complete set of numerical abstractions for the
[20] AWS inferentia. Accessed: Feb. 4, 2020. [Online]. Available: analysis and verification of hardware and software systems,” Sci.
https://aws.amazon.com/machine-learning/inferentia Comput. Program., vol. 72, no. 1, pp. 3–21, 2008. [Online]. Avail-
[21] Announcing hanguang 800: Alibaba’s first AI-inference chip. able: https://doi.org/10.1016/j.scico.2007.08.001
Accessed: Feb. 4, 2020. [Online]. Available: https://www. [50] D. Kang, E. Kim, I. Bae, B. Egger, and S. Ha, “C-GOOD: C-code
alibabacloud.com/blog/announcing-hanguang-800-alibabas- generation framework for optimized on-device deep learning,” in
first-ai-inference-chip_595482 Proc. Int. Conf. Comput.-Aided Des., 2018, Art. no. 105.
[22] S. Liu et al., “Cambricon: An instruction set architecture for neural [51] G. Long, J. Yang, K. Zhu, and W. Lin, “FusionStitching: Deep fusion
networks,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. and code generation for tensorflow computations on GPUs,” 2018,
Archit., 2016, pp. 393–405. arXiv:1811.05213.
[23] Z. Jia, B. Tillman, M. Maggioni, and D. P. Scarpazza, “Dissecting [52] G. Long, J. Yang, and W. Lin, “FusionStitching: Boosting execution effi-
the graphcore IPU architecture via microbenchmarking,” 2019, ciency of memory intensive computations for DL workloads,” 2019,
arXiv: 1912.03413. arXiv:1911.11576.
[24] TensorRT github repository. Accessed: Feb. 4, 2020. [Online]. [53] A. V. Aho, R. Sethi, and J. D. Ullman, “Compilers: Principles, tech-
Available: https://github.com/NVIDIA/TensorRT niques, and tools,” Addison-Wesley, 1986. [Online]. Available:
[25] T. Chen et al., “TVM: An automated end-to-end optimizing com- https://www.worldcat.org/oclc/12285707
piler for deep learning,” in Proc. 13th USENIX Symp. Operating [54] B. H. Ahn, J. Lee, J. M. Lin, H.-P. Cheng, J. Hou, and H. Esmaeilza-
Syst. Des. Implementation, 2018, pp. 578–594. deh, “Ordering Chaos: Memory-aware scheduling of irregularly
[26] N. Vasilache et al., “Tensor comprehensions: Framework-agnostic wired neural networks for edge devices,” in Proc. Conf. Mach. Learn.
high-performance machine learning abstractions,” 2018, arXiv: Syst., 2020, vol. 2, pp. 44–57. [Online]. Available: https://
1802.04730. proceedings.mlsys.org/paper/2020/file/9bf31c7ff062936a96d3c8bd
[27] N. Rotem et al., “Glow: Graph lowering compiler techniques for 1f8f2ff3-Paper.pdf
neural networks,” 2018, arXiv: 1805.00907. [55] Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang,
[28] S. Cyphers et al., “Intel nGraph: An intermediate representation, “Optimizing CNN model inference on CPUs,” in Proc. USENIX
compiler, and executor for deep learning,” 2018, arXiv: 1801.08058. Annu. Tech. Conf., 2019, pp. 1025–1040.
[29] C. Leary and T. Wang, “XLA: Tensorflow, compiled,” TensorFlow [56] S. Verdoolaege, J. C. Juega, A. Cohen, J. I. G omez, C. Tenllado, and
Dev Summit, 2017. F. Catthoor, “Polyhedral parallel code generation for CUDA,” ACM
[30] C. Lattner and V. Adve, “LLVM: A compilation framework for Trans. Archit. Code Optim., vol. 9, no. 4, pp. 54:1–54:23, Jan. 2013.
lifelong program analysis & transformation,” in Proc. Int. Symp. [57] S. Kaufman, P. M. Phothilimthana, and M. Burrows, “Learned
Code Gener. Optim., 2004, pp. 75–86. TPU cost model for XLA tensor programs,” in Proc. Workshop ML
[31] Y. Xing, J. Weng, Y. Wang, L. Sui, Y. Shan, and Y. Wang, “An in-depth Syst. NeurIPS, 2019, pp. 1–6.
comparison of compilers for deep neural networks on hardware,” in [58] D. E. Goldberg, Genetic Algorithms in Search, Optimization and
Proc. IEEE Int. Conf. Embedded Softw. Syst., 2019, pp. 1–8. Machine Learning, 1st ed. Reading, MA, USA: Addison-Wesley, 1989.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
726 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 32, NO. 3, MARCH 2021

[59] D. Bertsimas et al., “Simulated annealing,” Statist. Sci., vol. 8, no. 1, Xiaoyan Liu is currently working toward the PhD
pp. 10–15, 1993. degree with the School of Computer Science and
[60] B. H. Ahn, P. Pilligundla, A. Yazdanbakhsh, and H. Esmaeilzadeh, Engineering, Beihang University, Beijing, China.
“Chameleon: Adaptive code optimization for expedited deep neu- She is currently working on GPU hardware exten-
ral network compilation,” in Proc. Int. Conf. Learn. Representations, sion and approximation matrix algorithm. Her
2020. research interests include HPC, performance opti-
[61] S. Wang and P. Kanwar, “BFloat16 hardware numerics defi- mization, and deep learning.
nition,” 2017.
[62] DLProf user-guide. Accessed: Aug. 26, 2020. [Online]. Available:
https://docs.nvidia.com/deeplearning/frameworks/dlprof-
user-guide/
[63] Z. Wang and M. O’Boyle, “Machine learning in compiler opti-
mization,” Proc. IEEE, vol. 106, no. 11, pp. 1879–1901, Nov. 2018.
[64] S. Smith and G. Karypis, “Tensor-matrix products with a com- Qingxiao Sun is currently working toward the
pressed sparse tensor,” in Proc. 5th Workshop Irregular Appl.: Archit. PhD degree with the School of Computer Science
Algorithms, 2015, pp. 1–7. and Engineering, Beihang University, Beijing,
[65] N. Vasilache, C. Bastoul, and A. Cohen, “Polyhedral code genera- China. He is currently working on GPU hardware
tion in the real world,” in Proc. Int. Conf. Compiler Construction, extension and performance optimization. His
2006, pp. 185–201. research interests include computer architecture,
[66] C. Chen, “Polyhedra scanning revisited,” in Proc. 33rd ACM SIG- HPC, and deep learning.
PLAN Conf. Program. Lang. Des. Implementation, 2012, pp. 499–508.
[67] A. Venkat, M. Shantharam, M. Hall, and M. M. Strout, “Non-
affine extensions to polyhedral code generation,” in Proc. Annu.
IEEE/ACM Int. Symp. Code Gener. Optim., 2014, pp. 185–194.
[68] A. Venkat, M. Hall, and M. Strout, “Loop and data transforma-
tions for sparse matrix code,” in Proc. 36th ACM SIGPLAN Conf.
Program. Lang. Des. Implementation, 2015, pp. 521–532. Xin You is currently working toward the PhD
[69] B. van Merri€enboer, O. Breuleux, A. Bergeron, and P. Lamblin, degree with the School of Computer Science and
“Automatic differentiation in ML: Where we are and where we Engineering, Beihang University, Beijing, China.
should be going,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, He is currently working on GPU hardware exten-
pp. 8757–8767. sion and performance optimization. His research
interests include computer architecture, HPC,
[70] M. Innes et al., “Fashionable modelling with flux,” 2018, arXiv:
1811.01457. and deep learning.
[71] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A
fresh approach to numerical computing,” SIAM Rev., vol. 59, no.
1, pp. 65–98, 2017.
[72] K. Fischer and E. Saba, “Automatic full compilation of Julia pro-
grams and ML models to cloud TPUs,” 2018, arXiv: 1810.09868.
[73] F. Mireshghallah, M. Taram, P. Ramrakhyani, A. Jalali, D. Tullsen, Hailong Yang received the PhD degree from the
and H. Esmaeilzadeh, “Shredder: Learning noise distributions to School of Computer Science and Engineering, Bei-
protect inference privacy,” in Proc. 25th Int. Conf. Architect. Support hang University, Beijing, China, in 2014. He is an
Program. Lang. Operating Syst., 2020, pp. 3–18. assistant professor with the School of Computer Sci-
[74] S. A. Osia, A. Taheri, A. S. Shamsabadi, K. Katevas, H. Haddadi, ence and Engineering, Beihang University. He has
and H. R. Rabiee, “Deep private-feature extraction,” IEEE Trans. been involved in several scientific projects such as
Knowl. Data Eng., vol. 32, no. 1, pp. 54–66, Jan. 2020. performance analysis for big data systems and per-
[75] R. Gao, M. Dun, H. Yang, Z. Luan, and D. Qian, “Privacy for res- formance optimization for large scale applications.
cue: A new testimony why privacy is vulnerable in deep models,” His research interests include parallel and distributed
2019, arXiv: 2001.00493. computing, HPC, performance optimization, and
energy efficiency. He is a member of the China Com-
puter Federation (CCF).
Mingzhen Li is currently working toward the PhD
degree with the School of Computer Science and
Engineering, Beihang University, Beijing, China.
He is currently working on identifying perfor-
mance opportunities for scientific applications Zhongzhi Luan received the PhD degree from
and sparse matrix algorithms. His research inter- the School of Computer Science, Xi’an Jiaotong
ests include HPC, performance optimization, and University, Xi’an, China. He is an associate pro-
code generation. fessor of computer science and engineering, and
assistant director of the Sino-German Joint Soft-
ware Institute (JSI) Laboratory, Beihang Univer-
sity, China. Since 2003, his research interests
including distributed computing, parallel comput-
ing, grid computing, HPC, and the new genera-
Yi Liu received the PhD degree from the Depart- tion of network technology.
ment of Computer Science, Xi’an Jiaotong Uni-
versity, Xi’an, China, in 2000. He is a professor
with the School of Computer Science and Engi-
neering, and director of the Sino-German Joint
Software Institute (JSI), Beihang University,
China. His research interests include computer
architecture, HPC, and new generation of net-
work technology.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: DEEP LEARNING COMPILER: A COMPREHENSIVE SURVEY 727

Lin Gan (Member, IEEE) received the PhD Depei Qian received the master’s degree from
degree in computer science from Tsinghua Uni- the University of North Texas, Denton, Texas, in
versity, Beijing, China. He is an assistant rese- 1984. He is a professor with the Department of
archer with the Department of Computer Science Computer Science and Engineering, Beihang
and Technology, Tsinghua University, and the University, China. He is currently serving as the
assistant director of the National Supercomputing chief scientist of China National High Technology
Center in Wuxi. His research interests include Program (863 Program) on high productivity com-
high performance computing solutions based on puter and service environment. His research
hybrid platforms such as GPUs, FPGAs, and interests include innovative technologies in dis-
Sunway CPUs. He is the recipient of the 2016 tributed computing, high performance computing,
ACM Gordon Bell Prize, the 2017 ACM Gordon and computer architecture. He is also a fellow of
Bell Prize Finalist, the 2018 IEEE-CS TCHPC the China Computer Federation (CCF).
Early Career Researchers Award for Excellence
in HPC, and the Most Significant Paper Award in
25 Years awarded by FPL 2015, etc. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

Guangwen Yang (Member, IEEE) received the


PhD degree in computer science from Tsinghua
University, Beijing, China. He is a professor with
the Department of Computer Science and Tech-
nology, Tsinghua University, and the director of
the National Supercomputing Center in Wuxi. His
research interests include parallel algorithms,
cloud computing, and the earth system model.
He has received the ACM Gordon Bell Prize in
the year of 2016 and 2017, and the Most Signifi-
cant Paper Award in 25 Years awarded by FPL
2015, etc.

Authorized licensed use limited to: Carleton University. Downloaded on November 02,2020 at 21:51:09 UTC from IEEE Xplore. Restrictions apply.

You might also like