Osdi18 Chen

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

TVM: An Automated End-to-End Optimizing

Compiler for Deep Learning


Tianqi Chen and Thierry Moreau, University of Washington; Ziheng Jiang, University of
Washington, AWS; Lianmin Zheng, Shanghai Jiao Tong University; Eddie Yan, Haichen Shen,
and Meghan Cowan, University of Washington; Leyuan Wang, UC Davis, AWS; Yuwei Hu,
Cornell; Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy, University of Washington
https://www.usenix.org/conference/osdi18/presentation/chen

This paper is included in the Proceedings of the


13th USENIX Symposium on Operating Systems Design
and Implementation (OSDI ’18).
October 8–10, 2018 • Carlsbad, CA, USA
ISBN 978-1-939133-08-3

Open access to the Proceedings of the


13th USENIX Symposium on Operating Systems
Design and Implementation
is sponsored by USENIX.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen1 , Thierry Moreau1 , Ziheng Jiang1,2 , Lianmin Zheng3 , Eddie Yan1

Meghan Cowan1 , Haichen Shen1 , Leyuan Wang4,2 , Yuwei Hu5 , Luis Ceze1 , Carlos Guestrin1 , Arvind Krishnamurthy1
1 Paul G. Allen School of Computer Science & Engineering, University of Washington

2 AWS, 3 Shanghai Jiao Tong University, 4 UC Davis, 5 Cornell

Memory Subsystem Architecture


Abstract CPU GPU ‘TPU’

L2 Wgt. FIFO
There is an increasing need to bring machine learn- L3
Activation
SM SM Accum.
ing to a wide diversity of hardware devices. Current L2 L2 L1/TX L1/TX Buffer Register
L1D L1I L1D L1I RF RF RF RF File
frameworks rely on vendor-specific operator libraries implicitly managed mixed explicitly managed
and optimize for a narrow range of server-class GPUs.
Compute Primitive
Deploying workloads to new platforms – such as mo-
bile phones, embedded devices, and accelerators (e.g.,
FPGAs, ASICs) – requires significant manual effort.
· · ·
We propose TVM, a compiler that exposes graph-level scalar vector tensor

and operator-level optimizations to provide performance


portability to deep learning workloads across diverse Figure 1: CPU, GPU and TPU-like accelerators re-
hardware back-ends. TVM solves optimization chal- quire different on-chip memory architectures and com-
lenges specific to deep learning, such as high-level op- pute primitives. This divergence must be addressed when
erator fusion, mapping to arbitrary hardware primitives, generating optimized code.
and memory latency hiding. It also automates optimiza-
tion of low-level programs to hardware characteristics by
employing a novel, learning-based cost modeling method terms of memory organization, compute functional units,
for rapid exploration of code optimizations. Experimen- etc., as shown in Figure 1.
tal results show that TVM delivers performance across Current DL frameworks, such as TensorFlow, MXNet,
hardware back-ends that are competitive with state-of- Caffe, and PyTorch, rely on a computational graph in-
the-art, hand-tuned libraries for low-power CPU, mo- termediate representation to implement optimizations,
bile GPU, and server-class GPUs. We also demonstrate e.g., auto differentiation and dynamic memory man-
TVM’s ability to target new accelerator back-ends, such agement [3, 4, 9]. Graph-level optimizations, however,
as the FPGA-based generic deep learning accelerator. are often too high-level to handle hardware back-end-
The system is open sourced and in production use inside specific operator-level transformations. Most of these
several major companies. frameworks focus on a narrow class of server-class
GPU devices and delegate target-specific optimizations
to highly engineered and vendor-specific operator li-
1 Introduction braries. These operator-level libraries require significant
manual tuning and hence are too specialized and opaque
Deep learning (DL) models can now recognize images, to be easily ported across hardware devices. Providing
process natural language, and defeat humans in challeng- support in various DL frameworks for diverse hardware
ing strategy games. There is a growing demand to deploy back-ends presently requires significant engineering ef-
smart applications to a wide spectrum of devices, rang- fort. Even for supported back-ends, frameworks must
ing from cloud servers to self-driving cars and embed- make the difficult choice between: (1) avoiding graph
ded devices. Mapping DL workloads to these devices is optimizations that yield new operators not in the prede-
complicated by the diversity of hardware characteristics, fined operator library, and (2) using unoptimized imple-
including embedded CPUs, GPUs, FPGAs, and ASICs mentations of these new operators.
(e.g., the TPU [21]). These hardware targets diverge in To enable both graph- and operator-level optimiza-

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 579
tions for diverse hardware back-ends, we take a fun- of valid programs for a given operator declaration. (2)
damentally different, end-to-end approach. We built We introduce an automated program optimization frame-
TVM, a compiler that takes a high-level specification of work to find optimized tensor operators. The optimizer is
a deep learning program from existing frameworks and guided by an ML-based cost model that adapts and im-
generates low-level optimized code for a diverse set of proves as we collect more data from a hardware back-
hardware back-ends. To be attractive to users, TVM end. (3) On top of the automatic code generator, we
needs to offer performance competitive with the multi- introduce a graph rewriter that takes full advantage of
tude of manually optimized operator libraries across di- high- and operator-level optimizations.
verse hardware back-ends. This goal requires addressing By combining these three modules, TVM can take
the key challenges described below. model descriptions from existing deep learning frame-
works, perform joint high- and low-level optimizations,
Leveraging Specific Hardware Features and Abstrac-
and generate hardware-specific optimized code for back-
tions. DL accelerators introduce optimized tensor com-
ends, e.g., CPUs, GPUs, and FPGA-based specialized
pute primitives [1, 12, 21], while GPUs and CPUs con-
accelerators.
tinuously improve their processing elements. This poses
This paper makes the following contributions:
a significant challenge in generating optimized code for
a given operator description. The inputs to hardware in- • We identify the major optimization challenges in pro-
structions are multi-dimensional, with fixed or variable viding performance portability to deep learning work-
lengths; they dictate different data layouts; and they have loads across diverse hardware back-ends.
special requirements for memory hierarchy. The system • We introduce novel schedule primitives that take ad-
must effectively exploit these complex primitives to ben- vantage of cross-thread memory reuse, novel hardware
efit from acceleration. Further, accelerator designs also intrinsics, and latency hiding.
commonly favor leaner control [21] and offload most
• We propose and implement a machine learning based
scheduling complexity to the compiler stack. For spe-
optimization system to automatically explore and
cialized accelerators, the system now needs to gener-
search for optimized tensor operators.
ate code that explicitly controls pipeline dependencies to
hide memory access latency – a job that hardware per- • We build an end-to-end compilation and optimiza-
forms for CPUs and GPUs. tion stack that allows the deployment of deep learning
workloads specified in high-level frameworks (includ-
Large Search Space for Optimization Another chal- ing TensorFlow, MXNet, PyTorch, Keras, CNTK) to
lenge is producing efficient code without manually tun- diverse hardware back-ends (including CPUs, server
ing operators. The combinatorial choices of memory ac- GPUs, mobile GPUs, and FPGA-based accelerators).
cess, threading pattern, and novel hardware primitives The open-sourced TVM is in production use inside
creates a huge configuration space for generated code several major companies.
(e.g., loop tiles and ordering, caching, unrolling) that
would incur a large search cost if we implement black We evaluated TVM using real world workloads on a
box auto-tuning. One could adopt a predefined cost server-class GPU, an embedded GPU, an embedded
model to guide the search, but building an accurate cost CPU, and a custom generic FPGA-based accelerator.
model is difficult due to the increasing complexity of Experimental results show that TVM offers portable
modern hardware. Furthermore, such an approach would performance across back-ends and achieves speedups
require us to build separate cost models for each hard- ranging from 1.2× to 3.8× over existing frameworks
ware type. backed by hand-optimized libraries.

TVM addresses these challenges with three key mod-


ules. (1) We introduce a tensor expression language 2 Overview
to build operators and provide program transformation
primitives that generate different versions of the pro- This section describes TVM by using an example to walk
gram with various optimizations. This layer extends through its components. Figure 2 summarizes execu-
Halide [32]’s compute/schedule separation concept by tion steps in TVM and their corresponding sections in
also separating target hardware intrinsics from transfor- the paper. The system first takes as input a model from
mation primitives, which enables support for novel ac- an existing framework and transforms it into a computa-
celerators and their corresponding new intrinsics. More- tional graph representation. It then performs high-level
over, we introduce new transformation primitives to ad- dataflow rewriting to generate an optimized graph. The
dress GPU-related challenges and enable deployment to operator-level optimization module must generate effi-
specialized accelerators. We can then apply different se- cient code for each fused operator in this graph. Oper-
quences of program transformations to form a rich space ators are specified in a declarative tensor expression lan-

580 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Frameworks w1 w2

data conv2d relu conv2d relu flatten operation


Computational Graph
inputs
channels=32, w3 dense
Section 3 High Level Graph Rewriting kernel_size=(3,3), dataflow
padding=(1,1),
use_bias=0 shape=(1,10) softmax dependency
Optimized Computational Graph
example attributes
Operator-level Optimization and Code Generation
Declarative Hardware-Aware Figure 3: Example computational graph of a two-layer
Section 4 Tensor Expressions Optimization Primitives
convolutional neural network. Each node in the graph
Machine Learning Based
Section 5 Automated Optimizer represents an operation that consumes one or more ten-
Optimized Low Level Loop Program sors and produces one or more tensors. Tensor operations
can be parameterized by attributes to configure their be-
Accelerator Backend LLVM IR CUDA/Metal/OpenCL
havior (e.g., padding or strides).
Deployable Module

Figure 2: System overview of TVM. The current stack an example computational graph representation of a two-
supports descriptions from many deep learning frame- layer convolutional neural network. The main differ-
works and exchange formats, such as CoreML and ence between this high-level representation and a low-
ONNX, to target major CPU, GPU and specialized ac- level compiler intermediate representation (IR), such as
celerators. LLVM, is that the intermediate data items are large,
multi-dimensional tensors. Computational graphs pro-
vide a global view of operators, but they avoid specifying
guage; execution details are unspecified. TVM identifies how each operator must be implemented. Like LLVM
a collection of possible code optimizations for a given IRs, a computational graph can be transformed into func-
hardware target’s operators. Possible optimizations form tionally equivalent graphs to apply optimizations. We
a large space, so we use an ML-based cost model to find also take advantage of shape specificity in common DL
optimized operators. Finally, the system packs the gen- workloads to optimize for a fixed set of input shapes.
erated code into a deployable module. TVM exploits a computational graph representation to
apply high-level optimizations: a node represents an op-
End-User Example. In a few lines of code, a user can eration on tensors or program inputs, and edges represent
take a model from existing deep learning frameworks and data dependencies between operations. It implements
call the TVM API to get a deployable module: many graph-level optimizations, including: operator fu-
import tvm as t sion, which fuses multiple small operations together;
# Use keras framework as example, import model constant-folding, which pre-computes graph parts that
graph, params = t.frontend.from_keras(keras_model)
target = t.target.cuda() can be determined statically, saving execution costs; a
graph, lib, params = t.compiler.build(graph, target, params) static memory planning pass, which pre-allocates mem-
This compiled runtime module contains three compo- ory to hold each intermediate tensor; and data layout
nents: the final optimized computational graph (graph), transformations, which transform internal data layouts
generated operators (lib), and module parame- into back-end-friendly forms. We now discuss operator
ters (params). These components can then be used to fusion and the data layout transformation.
deploy the model to the target back-end:
import tvm.runtime as t
module = runtime.create(graph, lib, t.cuda(0))
Operator Fusion. Operator fusion combines multiple
module.set_input(**params) operators into a single kernel without saving the interme-
module.run(data=data_array) diate results in memory. This optimization can greatly
output = tvm.nd.empty(out_shape, ctx=t.cuda(0))
module.get_output(0, output) reduce execution time, particularly in GPUs and spe-
cialized accelerators. Specifically, we recognize four
TVM supports multiple deployment back-ends in lan-
categories of graph operators: (1) injective (one-to-one
guages such as C++, Java and Python. The rest of this
map, e.g., add), (2) reduction (e.g., sum), (3) complex-
paper describes TVM’s architecture and how a system
out-fusable (can fuse element-wise map to output, e.g.,
programmer can extend it to support new back-ends.
conv2d), and (4) opaque (cannot be fused, e.g., sort). We
provide generic rules to fuse these operators, as follows.
3 Optimizing Computational Graphs Multiple injective operators can be fused into another in-
jective operator. A reduction operator can be fused with
Computational graphs are a common way to represent input injective operators (e.g., fuse scale and sum). Op-
programs in DL frameworks [3, 4, 7, 9]. Figure 3 shows erators such as conv2d are complex-out-fusable, and we

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 581
A =
t.placeholder((1024, 1024))
B =
t.placeholder((1024, 1024))
w/o fusion k =
t.reduce_axis((0, 1024))
2.00 C =
t.compute((1024, 1024), lambda y, x:
Relative Speedup

w/ fusion t.sum(A[k, y] * B[k, x], axis=k))


1.50 s = t.create_schedule(C.op)

for y in range(1024):
1.00 for x in range(1024):
C[y][x] = 0
for k in range(1024):
0.50 C[y][x] += A[k][y] * B[k][x]

0.00 + Loop Tiling


conv+bn+relu depthwise- rnn cell lstm cell yo, xo, ko, yi, xi, ki = s[C].tile(y, x, k, 8, 8, 8)
128x28x28 conv+bn+relu hidden:128 hidden:128
1x1x128x256 512x14x14 for yo in range(128):
3x3x512 for xo in range(128):
C[yo*8:yo*8+8][xo*8:xo*8+8] = 0
for ko in range(128):
for yi in range(8):
Figure 4: Performance comparison between fused and for xi in range(8):
non-fused operations. TVM generates both operations. for ki in range(8):
C[yo*8+yi][xo*8+xi] +=
Tested on NVIDIA Titan X. A[ko*8+ki][yo*8+yi] * B[ko*8+ki][xo*8+xi]

+ Cache Data on Accelerator Special Buffer


CL = s.cache_write(C, vdla.acc_buffer)
AL = s.cache_read(A, vdla.inp_buffer)
can fuse element-wise operators to its output. We can # additional schedule steps omitted …
apply these rules to transform the computational graph + Map to Accelerator Tensor Instructions
into a fused version. Figure 4 demonstrates the impact s[CL].tensorize(yi, vdla.gemm8x8)

of this optimization on different workloads. We find that inp_buffer AL[8][8], BL[8][8]


acc_buffer CL[8][8]
fused operators generate up to a 1.2× to 2× speedup by for yo in range(128):
for xo in range(128):
reducing memory accesses. vdla.fill_zero(CL)
for ko in range(128):
vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8])
vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8])
Data Layout Transformation. There are multiple vdla.fused_gemm8x8_add(CL, AL, BL)
ways to store a given tensor in the computational graph. vdla.dma_copy2d(C[yo*8:yo*8+8,xo*8:xo*8+8], CL)

The most common data layout choices are column major schedule corresponding
schedule
and row major. In practice, we may prefer to use even transformation low-level code
more complicated data layouts. For instance, a DL ac-
celerator might exploit 4 × 4 matrix operations, requiring Figure 5: Example schedule transformations that opti-
data to be tiled into 4 × 4 chunks to optimize for access mize a matrix multiplication on a specialized accelerator.
locality.
Data layout optimization converts a computational
graph into one that can use better internal data layouts this end, we next propose a code generation approach
for execution on the target hardware. It starts by spec- that can generate various possible implementations for a
ifying the preferred data layout for each operator given given model’s operators.
the constraints dictated by memory hierarchies. We then
perform the proper layout transformation between a pro- 4 Generating Tensor Operations
ducer and a consumer if their preferred data layouts do
not match. TVM produces efficient code for each operator by gen-
While high-level graph optimizations can greatly im- erating many valid implementations on each hardware
prove the efficiency of DL workloads, they are only as back-end and choosing an optimized implementation.
effective as what the operator library provides. Cur- This process builds on Halide’s idea of decoupling de-
rently, the few DL frameworks that support operator fu- scriptions from computation rules (or schedule optimiza-
sion require the operator library to provide an implemen- tions) [32] and extends it to support new optimizations
tation of the fused patterns. With more network opera- (nested parallelism, tensorization, and latency hiding)
tors introduced on a regular basis, the number of possible and a wide array of hardware back-ends. We now high-
fused kernels can grow dramatically. This approach is light TVM-specific features.
no longer sustainable when targeting an increasing num-
ber of hardware back-ends since the required number
4.1 Tensor Expression and Schedule Space
of fused pattern implementations grows combinatorially
with the number of data layouts, data types, and accel- We introduce a tensor expression language to support au-
erator intrinsics that must be supported. It is not feasi- tomatic code generation. Unlike high-level computation
ble to handcraft operator kernels for the various opera- graph representations, where the implementation of ten-
tions desired by a program and for each back-end. To sor operations is opaque, each operation is described in

582 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Tensor Expression Schedule primitives CPU GPU Accel.
used in various hardware backends
Schedule Schedule Schedule
Select Schedule ee ule 8 cuBLAS
[Halide] Loop Transformations ✔ ✔ ✔
Primitives
[Halide] Thread Binding ✔ ✔ ✔ TVM w/o coop.
Final Schedule [Halide] Compute Locality ✔ ✔ ✔ 7
TVM
[TVM] Special Memory Scope ✔ ✔
Code Lowering 6
[TVM] Tensorization ✔ ✔ ✔

Time (ms)
Low level code [TVM] Latency Hiding
5
Figure 6: TVM schedule lowering and code generation 4
process. The table lists existing Halide and novel TVM 3
scheduling primitives being used to optimize schedules
2
for CPUs, GPUs and accelerator back-ends. Tensoriza-
tion is essential for accelerators, but it can also be used 1
for CPUs and GPUs. Special memory-scope enables 0
memory reuse in GPUs and explicit management of on- 1024 2048
Matrix Size
chip memory in accelerators. Latency hiding is specific
to TPU-like accelerators.
Figure 7: Performance comparison between TVM with
and without cooperative shared memory fetching on ma-
an index formula expression language. The following trix multiplication workloads. Tested on an NVIDIA Ti-
code shows an example tensor expression to compute tan X.
transposed matrix multiplication:
m, n, h = t.var('m'), t.var('n'), t.var('h')
A = t.placeholder((m, h), name='A') tives that TVM supports. We reuse helpful primitives
B = t.placeholder((n, h), name='B') computing rule and the low-level loop program AST from Halide, and
k = t.reduce_axis((0, h), name='k')
C = t.compute((m, n), lambda y, x: we introduce new primitives to optimize GPU and ac-
t.sum(A[k, y] * B[k, x], axis=k))
result shape celerator performance. The new primitives are neces-
sary to achieve optimal GPU performance and essen-
Each compute operation specifies both the shape of tial for accelerators. CPU, GPU, TPU-like accelerators
the output tensor and an expression describing how to are three important types of hardware for deep learning.
compute each element of it. Our tensor expression This section describes new optimization primitives for
language supports common arithmetic and math oper- CPUs, GPUs and TPU-like accelerators, while section 5
ations and covers common DL operator patterns. The explains how to automatically derive efficient schedules.
language does not specify the loop structure and many
other execution details, and it provides flexibility for
adding hardware-aware optimizations for various back- 4.2 Nested Parallelism with Cooperation
ends. Adopting the decoupled compute/schedule princi-
Parallelism is key to improving the efficiency of
ple from Halide [32], we use a schedule to denote a spe-
compute-intensive kernels in DL workloads. Modern
cific mapping from a tensor expression to low-level code.
GPUs offer massive parallelism, requiring us to bake par-
Many possible schedules can perform this function.
allel patterns into schedule transformations. Most exist-
We build a schedule by incrementally applying basic
ing solutions adopt a model called nested parallelism, a
transformations (schedule primitives) that preserve the
form of fork–join. This model requires a parallel sched-
program’s logical equivalence. Figure 5 shows an ex-
ule primitive to parallelize a data parallel task; each task
ample of scheduling matrix multiplication on a special-
can be further recursively subdivided into subtasks to ex-
ized accelerator. Internally, TVM uses a data structure
ploit the target architecture’s multi-level thread hierarchy
to keep track of the loop structure and other information
(e.g., thread groups in GPU). We call this model shared-
as we apply schedule transformations. This information
nothing nested parallelism because one working thread
can then help generate low-level code for a given final
cannot look at the data of its sibling within the same par-
schedule.
allel computation stage.
Our tensor expression takes cues from Halide [32],
An alternative to the shared-nothing approach is to
Darkroom [17], and TACO [23]. Its primary enhance-
fetch data cooperatively. Specifically, groups of threads
ments include support for the new schedule optimiza-
can cooperatively fetch the data they all need and place
tions discussed below. To achieve high performance
it into a shared memory space. 1 This optimization can
on many back-ends, we must support enough schedule
take advantage of the GPU memory hierarchy and en-
primitives to cover a diverse set of optimizations on dif-
ferent hardware back-ends. Figure 6 summarizes the 1
Halide recently added shared memory support but without general
operation code generation process and schedule primi- memory scope for accelerators.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 583
w, x = t.placeholder((8, 8)), t.placeholder((8, 8))
able data reuse across threads through shared memory k = t.reduce_axis((0, 8)) declare behavior
y = t.compute((8, 8), lambda i, j:
regions. TVM supports this well-known GPU optimiza- t.sum(w[i, k] * x[j, k], axis=k))

tion using a schedule primitive to achieve optimal per- def gemm_intrin_lower(inputs, outputs):
lowering rule to generate
ww_ptr = inputs[0].access_ptr(“r") hardware intrinsics to carry
formance. The following GPU code example optimizes xx_ptr = inputs[1].access_ptr("r") out the computation
zz_ptr = outputs[0].access_ptr("w")
matrix multiplication. compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr)
reset = t.hardware_intrin("fill_zero", zz_ptr)
for thread_group (by, bx) in cross(64, 64): update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr)
for thread_item (ty, tx) in cross(2, 2): All threads cooperatively return compute, reset, update
local CL[8][8] = 0 load AS and BS in different
shared AS[2][8], BS[2][8] parallel patterns gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)
for k in range(1024):
for i in range(4): Additionally, we introduce a tensorize schedule primi-
AS[ty][i*4+tx] = A[k][by*64+ty*8+i*4+tx]
for each i in 0..4: tive to replace a unit of computation with the correspond-
BS[ty][i*4+tx] = B[k][bx*64+ty*8+i*4+tx]
memory_barrier_among_threads() Barrier inserted ing intrinsics. The compiler matches the computation
for yi in range(8):
automatically
for xi in range(8):
by compiler
pattern with a hardware declaration and lowers it to the
CL[yi][xi] += AS[yi] * BS[xi]
for yi in range(8): corresponding hardware intrinsic.
for xi in range(8):
C[yo*8+yi][xo*8+xi] = CL[yi][xi] Tensorization decouples the schedule from specific
hardware primitives, making it easy to extend TVM
Figure 7 demonstrates the impact of this optimiza-
to support new hardware architectures. The generated
tion. We introduce the concept of memory scopes to the
code of tensorized schedules aligns with practices in
schedule space so that a compute stage (AS and BS in the
high-performance computing: break complex operations
code) can be marked as shared. Without explicit memory
into a sequence of micro-kernel calls. We can also use
scopes, automatic scope inference will mark compute
the tensorize primitive to take advantage of handcrafted
stages as thread-local. The shared task must compute
micro-kernels, which can be beneficial in some plat-
the dependencies of all working threads in the group.
forms. For example, we implement ultra low precision
Additionally, memory synchronization barriers must be
operators for mobile CPUs that operate on data types
properly inserted to guarantee that shared loaded data is
that are one- or two-bits wide by leveraging a bit-serial
visible to consumers. Finally, in addition to being use-
matrix vector multiplication micro-kernel. This micro-
ful to GPUs, memory scopes let us tag special memory
kernel accumulates results into progressively larger data
buffers and create special lowering rules when targeting
types to minimize the memory footprint. Presenting the
specialized DL accelerators.
micro-kernel as a tensor intrinsic to TVM yields up to a
1.5× speedup over the non-tensorized version.
4.3 Tensorization
4.4 Explicit Memory Latency Hiding
DL workloads have high arithmetic intensity, which
can typically be decomposed into tensor operators like Latency hiding refers to the process of overlapping mem-
matrix-matrix multiplication or 1D convolution. These ory operations with computation to maximize utilization
natural decompositions have led to the recent trend of of memory and compute resources. It requires different
adding tensor compute primitives [1, 12, 21]. These strategies depending on the target hardware back-end.
new primitives create both opportunities and challenges On CPUs, memory latency hiding is achieved implic-
for schedule-based compilation; while using them can itly with simultaneous multithreading [14] or hardware
improve performance, the compilation framework must prefetching [10, 20]. GPUs rely on rapid context switch-
seamlessly integrate them. We dub this tensorization: it ing of many warps of threads [44]. In contrast, special-
is analogous to vectorization for SIMD architectures but ized DL accelerators such as the TPU [21] usually favor
has significant differences. Instruction inputs are multi- leaner control with a decoupled access-execute (DAE)
dimensional, with fixed or variable lengths, and each has architecture [35] and offload the problem of fine-grained
different data layouts. More importantly, we cannot sup- synchronization to software.
port a fixed set of primitives since new accelerators are Figure 9 shows a DAE hardware pipeline that reduces
emerging with their own variations of tensor instructions. runtime latency. Compared to a monolithic hardware de-
We therefore need an extensible solution. sign, the pipeline can hide most memory access over-
We make tensorization extensible by separating the heads and almost fully utilize compute resources. To
target hardware intrinsic from the schedule with a mech- achieve higher utilization, the instruction stream must be
anism for tensor-intrinsic declaration. We use the same augmented with fine-grained synchronization operations.
tensor expression language to declare both the behavior Without them, dependencies cannot be enforced, leading
of each new hardware intrinsic and the lowering rule as- to erroneous execution. Consequently, DAE hardware
sociated with it. The following code shows how to de- pipelines require fine-grained dependence enqueuing/d-
clare an 8 × 8 tensor hardware intrinsic. equeuing operations between the pipeline stages to guar-

584 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Input: High-level Threaded Program Inject Synchronization Instructions Final Single Instruction Stream

vthread 0 ld ex ld … ld ex vthread 0 ld ex ld … ld ex l l e e l l
… e e

barrier

barrier
d d x x d d x x

acc_buffer CL[2][8]
vthread 1 ld ex ld … ld ex vthread 1 ld ex ld … ld ex
inp_buffer AL[2][8]
ex.push_dep_to(ld)
for vthread tx in range(2): ex.push_dep_to(ld)
for vthread tx in range(2): for k in range(128):
acc_buffer CL[8]
acc_buffer CL[8] ld.pop_dep_from(ex)
inp_buffer AL[8]
inp_buffer AL[8] ld.dma_copy2d(AL[0], AL[k][0:8])
for k in range(128):
ex.push_dep_to(ld) ld.push_dep_to(ex)
ld.dma_copy2d(AL, AL[k][tx*8:tx*8+8])
for k in range(128): ld.pop_dep_from(ex)
ex.accumulate(AL, CL)
ld.pop_dep_from(ex) ld.dma_copy2d(AL[1], AL[k][8:16])
read after write (RAW) dependence ld.dma_copy2d(AL, AL[k][tx*8:tx*8+8]) ld.push_dep_to(ex)
ld.push_dep_to(ex) ex.pop_dep_from(ld)
read after write (RAW) dependence
ex.pop_dep_from(ld) ex.accumulate(AL[0], CL[0])
push RAW dependence ex.accumulate(AL, CL) ex.push_dep_to(ld)
ex.push_dep_to(ld) ex.pop_dep_from(ld)
push WAR dependence
ld.pop_dep_from(ex) ex.accumulate(AL[1], CL[1])
pop RAW dependence ex.push_dep_to(ld)
pop WAR dependence ld.pop_dep_from(ex)
ld.pop_dep_from(ex)

Figure 8: TVM virtual thread lowering transforms a virtual thread-parallel program to a single instruction stream; the
stream contains explicit low-level synchronizations that the hardware can interpret to recover the pipeline parallelism
required to hide memory access latency.

Monolithic Pipeline Instruction Stream


ld ex ld ex ld ex ld ex ld.perform_action(ld0)
0 0 1 1 2 2 3 3 ex.perform_action(ex0)
t ld.perform_action(ld1)
ex.perform_action(ex1)
...

Decoupled Access-Execute Pipeline ld.perform_action(ld0)


ld.push_dep_to(ex)
ld ld ld ld execution savings ld.perform_action(ld1)
0 1 2 3 ld.push_dep_to(ex)
ex.pop_dep_from(ld)
ex.perform_action(ex0) memory bound compute bound
ex ex ex ex
0 1 2 3 ex.push_dep_to(ld)
ex.pop_dep_from(ld)
ex.perform_action(ex1)
read after write (RAW) dependence ex.push_dep_to(ld)
ld.pop_dep_from(ex)
write after read (WAR) dependence ld.perform_action(ld2)
... Figure 10: Roofline [47] of an FPGA-based DL ac-
celerator running ResNet inference. With latency hid-
Figure 9: Decoupled Access-Execute in hardware hides ing enabled by TVM, performance of the benchmarks is
most memory access latency by allowing memory and brought closer to the roofline, demonstrating higher com-
computation to overlap. Execution correctness is en- pute and memory bandwidth efficiency.
forced by low-level synchronization in the form of de-
pendence token enqueueing/dequeuing actions, which
the compiler stack must insert in the instruction stream. available pipeline parallelism dictated by the low-level
synchronizations in the instruction stream.

antee correct execution, as shown in Figure 9’s instruc- Hardware Evaluation of Latency Hiding. We now
tion stream. demonstrate the effectiveness of latency hiding on a cus-
Programming DAE accelerators that require explicit tom FPGA-based accelerator design, which we describe
low-level synchronization is difficult. To reduce the in depth in subsection 6.4. We ran each layer of ResNet
programming burden, we introduce a virtual threading on the accelerator and used TVM to generate two sched-
scheduling primitive that lets programmers specify a ules: one with latency hiding, and one without. The
high-level data parallel program as they would a hard- schedule with latency hiding parallelized the program
ware back-end with support for multithreading. TVM with virtuals threads to expose pipeline parallelism and
then automatically lowers the program to a single in- therefore hide memory access latency. Results are shown
struction stream with low-level explicit synchronization, in Figure 10 as a roofline diagram [47]; roofline perfor-
as shown in Figure 8. The algorithm starts with a high- mance diagrams provide insight into how well a given
level multi-threaded program schedule and then inserts system uses computation and memory resources for dif-
the necessary low-level synchronization operations to ferent benchmarks. Overall, latency hiding improved
guarantee correct execution within each thread. Next, performance on all ResNet layers. Peak compute utiliza-
it interleaves operations of all virtual threads into a sin- tion increased from 70% with no latency hiding to 88%
gle instruction stream. Finally, the hardware recovers the with latency hiding.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 585
TensorOp Schedule Space
5 Automating Optimization Specification Template
Device Cluster
Raspberry Pi

Mali GPU
Given the rich set of schedule primitives, our remaining Database
log
Schedule Explorer
rpc
Tracker
get_perf Nvidia GPU
problem is to find optimal operator implementations for training
query update
FPGA Board
each layer of a DL model. Here, TVM creates a special- data
ML Cost Model …
ized operator for the specific input shape and layout as-
sociated with each layer. Such specialization offers sig- Figure 11: Overview of automated optimization frame-
nificant performance benefits (in contrast to handcrafted work. A schedule explorer examines the schedule space
code that would target a smaller diversity of shapes and using an ML-based cost model and chooses experiments
layouts), but it also raises automation challenges. The to run on a distributed device cluster via RPC. To im-
system needs to choose the schedule optimizations – prove its predictive power, the ML model is updated pe-
such as modifying the loop order or optimizing for the riodically using collected data recorded in a database.
memory hierarchy – as well as schedule-specific param-
eters, such as the tiling size and the loop unrolling factor. Learn
Need from
Such combinatorial choices create a large search space of Method Category Data Model
Hardware
Cost Bias His-
operator implementations for each hardware back-end. Info tory
To address this challenge, we built an automated sched- Blackbox auto-tuning high none no no
ule optimizer with two main components: a schedule ex- Predefined cost model none high yes no
plorer that proposes promising new configurations, and ML based cost model low low no yes
a machine learning cost model that predicts the perfor-
Table 1: Comparison of automation methods. Model bias
mance of a given configuration. This section describes
refers to inaccuracy due to modeling.
these components and TVM’s automated optimization
flow (Figure 11).
ing patterns, among others. This approach, unfortu-
5.1 Schedule Space Specification nately, is burdensome due to the increasing complexity
of modern hardware. Furthermore, every new hardware
We built a schedule template specification API to let a target requires a new (predefined) cost model.
developer declare knobs in the schedule space. The tem-
plate specification allows incorporation of a developer’s We instead take a statistical approach to solve the cost
domain-specific knowledge, as necessary, when specify- modeling problem. In this approach, a schedule explorer
ing possible schedules. We also created a generic mas- proposes configurations that may improve an operator’s
ter template for each hardware back-end that automati- performance. For each schedule configuration, we use
cally extracts possible knobs based on the computation an ML model that takes the lowered loop program as in-
description expressed using the tensor expression lan- put and predicts its running time on a given hardware
guage. At a high level, we would like to consider as many back-end. The model, trained using runtime measure-
configurations as possible and let the optimizer manage ment data collected during exploration, does not require
the selection burden. Consequently, the optimizer must the user to input detailed hardware information. We up-
search over billions of possible configurations for the real date the model periodically as we explore more config-
world DL workloads used in our experiments. urations during optimization, which improves accuracy
for other related workloads, as well. In this way, the qual-
ity of the ML model improves with more experimental
5.2 ML-Based Cost Model trials. Table 1 summarizes the key differences between
One way to find the best schedule from a large configu- automation methods. ML-based cost models strike a bal-
ration space is through blackbox optimization, i.e., auto- ance between auto-tuning and predefined cost modeling
tuning. This method is used to tune high performance and can benefit from the historical performance data of
computing libraries [15, 46]. However, auto-tuning re- related workloads.
quires many experiments to identify a good configura-
tion. Machine Learning Model Design Choices. We must
An alternate approach is to build a predefined cost consider two key factors when choosing which ML
model to guide the search for a particular hardware back- model the schedule explorer will use: quality and speed.
end instead of running all possibilities and measuring The schedule explorer queries the cost model frequently,
their performance. Ideally, a perfect cost model con- which incurs overheads due to model prediction time
siders all factors affecting performance: memory access and model refitting time. To be useful, these overheads
patterns, data reuse, pipeline dependencies, and thread- must be smaller than the time it takes to measure per-

586 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
clude the memory access count and reuse ratio of each
1.50 memory buffer at each loop level, as well as a one-hot
encoding of loop annotations such as “vectorize”, “un-
1.25
Relative Speedup

roll”, and “parallel.” We also evaluate a neural network


1.00 model that uses TreeRNN [38] to summarize the loop
0.75
program’s AST without feature engineering. Figure 13
TVM: ML-based Model
summarizes the workflow of the cost models. We found
0.50 that tree boosting and TreeRNN have similar predictive
TVM: Blackbox Genetic Algorithm
0.25 TVM: Random Search quality. However, the former performs prediction twice
Baseline: cuDNN as fast and costs much less time to train. As a result, we
0.00 chose gradient tree boosting as the default cost model in
0 100 200 300 400 500 600 700 800
Number of Trials our experiments. Nevertheless, we believe that both ap-
proaches are valuable and expect more future research on
Figure 12: Comparison of different automation methods this problem.
for a conv2d operator in ResNet-18 on TITAN X. The On average, the tree boosting model does prediction
ML-based model starts with no training data and uses in 0.67 ms, thousands of times faster than running a real
the collected data to improve itself. The Y-axis is the measurement. Figure 12 compares an ML-based opti-
speedup relative to cuDNN. We observe a similar trend mizer to blackbox auto-tuning methods; the former finds
for other workloads. better configurations much faster than the latter.

Schedule Explorer 5.3 Schedule Exploration


Query: Loop AST Feature Extraction
Once we choose a cost model, we can use it to select
for yo in range(4): e.g. touched memory size
for xo in range(4): promising configurations on which to iteratively run real
C[yo*2:yo*2+2][xo*2:xo*2+2] = 0 xi yi k xo yo
for ko in range(8): C 2 4 4 16 64 measurements. In each iteration, the explorer uses the
for yi in range(2):
for xi in range(2): A 1 2 16 16 64 ML model’s predictions to select a batch of candidates
C[yo*2+yi][xo*2+xi] += B 2 2 16 64 64
A[k][yo*2+yi] * B[k][xo*2+xi] on which to run the measurements. The collected data is
XGBoost
then used as training data to update the model. If no ini-
alternatively, we can feed AST to TreeRNN tial training data exists, the explorer picks random candi-
cost prediction dates to measure.
The simplest exploration algorithm enumerates and
Figure 13: Example workflow of ML cost models. XG- runs every configuration through the cost model, se-
Boost predicts costs based on loop program features. lecting the top-k predicted performers. However, this
TreeRNN directly summarizes the AST. strategy becomes intractable with large search spaces.
Instead, we run a parallel simulated annealing algo-
rithm [22]. The explorer starts with random configura-
formance on real hardware, which can be on the order tions, and, at each step, randomly walks to a nearby con-
of seconds depending on the specific workload/hardware figuration. This transition is successful if cost decreases
target. This speed requirement differentiates our problem as predicted by the cost model. It is likely to fail (reject)
from traditional hyperparameter tuning problems, where if the target configuration has a higher cost. The random
the cost of performing measurements is very high rela- walk tends to converge on configurations that have lower
tive to model overheads, and more expensive models can costs as predicted by the cost model. Exploration states
be used. In addition to the choice of model, we need persist across cost model updates; we continue from the
to choose an objective function to train the model, such last configuration after these updates.
as the error in a configuration’s predicted running time.
However, since the explorer selects the top candidates
based only on the relative order of the prediction (A runs
5.4 Distributed Device Pool and RPC
faster than B), we need not predict the absolute execution A distributed device pool scales up the running of on-
times directly. Instead, we use a rank objective to predict hardware trials and enables fine-grained resource sharing
the relative order of runtime costs. among multiple optimization jobs. TVM implements a
We implement several types of models in our ML opti- customized, RPC-based distributed device pool that en-
mizer. We employ a gradient tree boosting model (based ables clients to run programs on a specific type of de-
on XGBoost [8]), which makes predictions based on fea- vice. We can use this interface to compile a program
tures extracted from the loop program; these features in- on the host compiler, request a remote device, run the

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 587
Name Operator H,W IC, OC K, S Tensorflow XLA MXNet TVM
C1 conv2d 224, 224 3,64 7, 2 Tensorflow TVM w/o graph opt
C2 conv2d 56, 56 64,64 3, 1
7.0 0.9
C3 conv2d 56, 56 64,64 1, 1
0.8
C4 conv2d 56, 56 64,128 3, 2 6.0
C5 conv2d 56, 56 64,128 1, 2 0.7
5.0
C6 conv2d 28, 28 128,128 3, 1 0.6

Time(ms)
C7 conv2d 28, 28 128,256 3, 2 4.0 0.5
C8 conv2d 28, 28 128,256 1, 2 3.0 0.4
C9 conv2d 14, 14 256,256 3, 1 0.3
C10 conv2d 14, 14 256,512 3, 2 2.0
0.2
C11 conv2d 14, 14 256,512 1, 2 1.0
0.1
C12 conv2d 7, 7 512,512 3, 1
0.0 0.0
Name Operator H,W IC K, S ResNet-18 MobileNet LSTM LM DQN DCGAN
D1 depthwise conv2d 112, 112 32 3, 1
D2 depthwise conv2d 112, 112 64 3, 2 Figure 14: GPU end-to-end evaluation for TVM,
D3 depthwise conv2d 56, 56 128 3, 1
D4 depthwise conv2d 56, 56 128 3, 2
MXNet, Tensorflow, and Tensorflow XLA. Tested on the
D5 depthwise conv2d 28, 28 256 3, 1 NVIDIA Titan X.
D6 depthwise conv2d 28, 28 256 3, 2
D7 depthwise conv2d 14, 14 512 3, 1
D8 depthwise conv2d 14, 14 512 3, 2 works (which rely on heavily optimized libraries)
D9 depthwise conv2d 7, 7 1024 3, 1
on each back-end?
Table 2: Configurations of all conv2d operators in • Can TVM support new, emerging DL workloads
ResNet-18 and all depthwise conv2d operators in Mo- (e.g., depthwise convolution, low precision opera-
bileNet used in the single kernel experiments. H/W tions)?
denotes height and width, IC input channels, OC out- • Can TVM support and optimize for new specialized
put channels, K kernel size, and S stride size. All ops accelerators?
use “SAME” padding. All depthwise conv2d operations To answer these questions, we evaluated TVM on four
have channel multipliers of 1. types of platforms: (1) a server-class GPU, (2) an embed-
ded GPU, (3) an embedded CPU, and (4) a DL accelera-
tor implemented on a low-power FPGA SoC. The bench-
function remotely, and access results in the same script marks are based on real world DL inference workloads,
on the host. TVM’s RPC supports dynamic upload and including ResNet [16], MobileNet [19], the LSTM Lan-
runs cross-compiled modules and functions that use its guage Model [48], the Deep Q Network (DQN) [28] and
runtime convention. As a result, the same infrastruc- Deep Convolutional Generative Adversarial Networks
ture can perform a single workload optimization and (DCGAN) [31]. We compare our approach to exist-
end-to-end graph inference. Our approach automates the ing DL frameworks, including MxNet [9] and Tensor-
compile, run, and profile steps across multiple devices. Flow [2], that rely on highly engineered, vendor-specific
This infrastructure is especially critical for embedded de- libraries. TVM performs end-to-end automatic optimiza-
vices, which traditionally require tedious manual effort tion and code generation without the need for an external
for cross-compilation, code deployment, and measure- operator library.
ment.

6.1 Server-Class GPU Evaluation


6 Evaluation
We first compared the end-to-end performance of
TVM’s core is implemented in C++ (∼50k LoC). We deep neural networks TVM, MXNet (v1.1), Tensor-
provide language bindings to Python and Java. Earlier flow (v1.7), and Tensorflow XLA on an Nvidia Titan
sections of this paper evaluated the impact of several in- X. MXNet and Tensorflow both use cuDNN v7 for con-
dividual optimizations and components of TVM, namely, volution operators; they implement their own versions
operator fusion in Figure 4, latency hiding in Figure 10, of depthwise convolution since it is relatively new and
and the ML-based cost model in Figure 12. We now fo- not yet supported by the latest libraries. They also use
cus on an end-to-end evaluation that aims to answer the cuBLAS v8 for matrix multiplications. On the other
following questions: hand, Tensorflow XLA uses JIT compilation.
Figure 14 shows that TVM outperforms the base-
• Can TVM optimize DL workloads over multiple lines, with speedups ranging from 1.6× to 3.8× due to
platforms? both joint graph optimization and the automatic opti-
• How does TVM compare to existing DL frame- mizer, which generates high-performance fused opera-

588 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
cuDNN TVM Tensorflow Lite TVM w/o graph opt TVM
TensorComprehensions TVM PT
800.0 12.0
MX Kernel
700.0
10.0
3.0 600.0
Relative Speedup

2.5 8.0
500.0

Time(ms)
2.0 400.0 6.0
1.5 300.0
4.0
1.0 200.0
0.5 2.0
100.0
0.0
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 0.0 0.0
ResNet-18 MobileNet DQN
5.0
Relative Speedup

4.0
3.0
Figure 16: ARM A53 end-to-end evaluation of TVM and
2.0
TFLite.
1.0
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9

Figure 15: Relative speedup of all conv2d operators in


ResNet-18 and all depthwise conv2d operators in Mo-
bileNet. Tested on a TITAN X. See Table 2 for op-
erator configurations. We also include a weight pre-
transformed Winograd [25] for 3x3 conv2d (TVM PT).

tors. DQN’s 3.8 x speedup results from its use of un-


conventional operators (4×4 conv2d, strides=2) that are
not well optimized by cuDNN; the ResNet workloads are
more conventional. TVM automatically finds optimized
operators in both cases. Figure 17: Relative speedup of all conv2d operators in
To evaluate the effectiveness of operator level opti- ResNet-18 and all depthwise conv2d operators in mo-
mization, we also perform a breakdown comparison for bilenet. Tested on ARM A53. See Table 2 for the con-
each tensor operator in ResNet and MobileNet, shown in figurations of these operators.
Figure 15. We include TensorComprehension (TC, com-
mit: ef644ba) [42], a recently introduced auto-tuning
framework, as an additional baseline. 2 TC results in- ones for ResNet and MobileNet. We observe that TVM
clude the best kernels it found in 10 generations × 100 generates operators that outperform the hand-optimized
population × 2 random seeds for each operator (i.e., TFLite versions for both neural network workloads. This
2000 trials per operator). 2D convolution, one of the result also demonstrates TVM’s ability to quickly opti-
most important DL operators, is heavily optimized by mize emerging tensor operators, such as depthwise con-
cuDNN. However, TVM can still generate better GPU volution operators. Finally, Figure 16 shows an end-to-
kernels for most layers. Depthwise convolution is a end comparison of three workloads, where TVM outper-
newly introduced operator with a simpler structure [19]. forms the TFLite baseline.3
In this case, both TVM and TC can find fast kernels com-
pared to MXNet’s handcrafted kernels. TVM’s improve- Ultra Low-Precision Operators We demonstrate
ments are mainly due to its exploration of a large sched- TVM’s ability to support ultra low-precision infer-
ule space and an effective ML-based search algorithm. ence [13, 33] by generating highly optimized operators
for fixed-point data types of less than 8-bits. Low-
precision networks replace expensive multiplication with
6.2 Embedded CPU Evaluation
vectorized bit-serial multiplication that is composed of
We evaluated the performance of TVM on an ARM Cor- bitwise and popcount reductions [39]. Achieving effi-
tex A53 (Quad Core 1.2GHz). We used Tensorflow Lite cient low-precision inference requires packing quantized
(TFLite, commit: 7558b085) as our baseline system. data types into wider standard data types, such as int8
Figure 17 compares TVM operators to hand-optimized or int32. Our system generates code that outperforms
hand-optimized libraries from Caffe2 (commit: 39e07f7)
2 According to personal communication [41], TC is not yet meant

to be used for compute-bound problems. However, it is still a good 3 DCGAN


and LSTM results are not presented because they are not
reference baseline to include in the comparison. yet supported by the baseline.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 589
ARMComputeLib TVM w/o graph opt TVM
Hand optimized
10.0
Relative Speedup TVM single-threaded 250.0 5.0
8.0 TVM multi-threaded
200.0 4.0
6.0

Time (ms)
150.0 3.0
4.0
100.0 2.0
2.0
50.0 1.0
0.0
C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
0.0 0.0
float32 float16 float32 float16 float32 float16
ResNet-18 MobileNet DQN
Figure 18: Relative speedup of single- and multi-
threaded low-precision conv2d operators in ResNet. Figure 19: End-to-end experiment results on Mali-
Baseline was a single-threaded, hand-optimized imple- T860MP4. Two data types, float32 and float16, were
mentation from Caffe2 (commit: 39e07f7). C5, C3 are evaluated.
1x1 convolutions that have less compute intensity, result-
ing in less speedup by multi-threading.
Learning Accelerator (VDLA) – which distills charac-
teristics from previous accelerator proposals [12, 21, 27]
[39]. We implemented an ARM-specific tensorization
into a minimalist hardware architecture – to demonstrate
intrinsic that leverages ARM instructions to build an ef-
TVM’s ability to generate highly efficient schedules that
ficient, low-precision matrix-vector microkernel.We then
can target specialized accelerators. Figure 20 shows the
used TVM’s automated optimizer to explore the schedul-
high-level hardware organization of the VDLA architec-
ing space.
ture. VDLA is programmed as a tensor processor to
Figure 18 compares TVM to the Caffe2 ultra low-
efficiently execute operations with high compute inten-
precision library on ResNet for 2-bit activations, 1-bit
sity (e.g, matrix multiplication, high dimensional con-
weights inference. Since the baseline is single threaded,
volution). It can perform load/store operations to bring
we also compare it to a single-threaded TVM version.
blocked 3-dimensional tensors from DRAM into a con-
Single-threaded TVM outperforms the baseline, particu-
tiguous region of SRAM. It also provides specialized on-
larly for C5, C8, and C11 layers; these are convolution
chip memories for network parameters, layer inputs (nar-
layers of kernel size 1×1 and stride of 2 for which the ul-
row data type), and layer outputs (wide data type). Fi-
tra low-precision baseline library is not optimized. Fur-
nally, VDLA provides explicit synchronization control
thermore, we take advantage of additional TVM capa-
over successive loads, computes, and stores to maximize
bilities to produce a parallel library implementation that
the overlap between memory and compute operations.
shows improvement over the baseline. In addition to the
2-bit+1-bit configuration, TVM can generate and opti-
mize for other precision configurations that are unsup- Methodology. We implemented the VDLA design on a
ported by the baseline library, offering improved flexi- low-power PYNQ board that incorporates an ARM Cor-
bility. tex A9 dual core CPU clocked at 667MHz and an Artix-7
based FPGA fabric. On these modest FPGA resources,
6.3 Embedded GPU Evaluation we implemented a 16 × 16 matrix-vector unit clocked at
200MHz that performs products of 8-bit values and accu-
For our mobile GPU experiments, we ran our end-to-end mulates them into a 32-bit register every cycle. The the-
pipeline on a Firefly-RK3399 board equipped with an oretical peak throughput of this VDLA design is about
ARM Mali-T860MP4 GPU. The baseline was a vendor- 102.4GOPS/s. We allocated 32kB of resources for ac-
provided library, the ARM Compute Library (v18.03). tivation storage, 32kB for parameter storage, 32kB for
As shown in Figure 19, we outperformed the baseline on microcode buffers, and 128kB for the register file. These
three available models for both float16 and float32 on-chip buffers are by no means large enough to provide
(DCGAN and LSTM are not yet supported by the base- sufficient on-chip storage for a single layer of ResNet and
line). The speedup ranged from 1.2× to 1.6×. therefore enable a case study on effective memory reuse
and latency hiding.
6.4 FPGA Accelerator Evaluation We built a driver library for VDLA with a C runtime
API that constructs instructions and pushes them to the
Vanilla Deep Learning Accelerator We now relate target accelerator for execution. Our code generation al-
how TVM tackled accelerator-specific code generation gorithm then translates the accelerator program to a se-
on a generic inference accelerator design we prototyped ries of calls into the runtime API. Adding the specialized
on an FPGA. We used in this evaluation the Vanilla Deep accelerator back-end took ∼2k LoC in Python.

590 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
DRAM

INSTRUCTION FETCH UNIT

LOAD COMPUTE STORE


CMD Q CMD Q CMD Q

LOAD→EXE Q COMPUTE EXE→STORE Q


REGISTER FILE
MEMORY LOAD MEMORY STORE
V_ALU
UNIT UNIT

controller
EXE→LOAD Q STORE→EXE Q

GEMM

LOAD BUFFER STORE BUFFER


Figure 21: We offloaded convolutions in the ResNet
MICRO-OP SRAM
INPUT MEM workload to an FPGA-based accelerator. The grayed-out
WEIGHT MEM bars correspond to layers that could not be accelerated
by the FPGA and therefore had to run on the CPU. The
Figure 20: VDLA Hardware design overview.
FPGA provided a 40x acceleration on offloaded convo-
lution layers over the Cortex A9.
End-to-End ResNet Evaluation. We used TVM to
generate ResNet inference kernels on the PYNQ plat-
of computation graphs in these works are similar, and a
form and offloaded as many layers as possible to VDLA.
high-level computation graph DSL is also used in this
We also used it to generate both schedules for the CPU
paper. While graph-level representations are a good fit
only and CPU+FPGA implementation. Due to its shal-
for high-level optimizations, they are too high level to
low convolution depth, the first ResNet convolution layer
optimize tensor operators under a diverse set of hard-
could not be efficiently offloaded on the FPGA and was
ware back-ends. Prior work relies on specific lowering
instead computed on the CPU. All other convolution lay-
rules to directly generate low-level LLVM or resorts to
ers in ResNet, however, were amenable to efficient of-
vendor-crafted libraries. These approaches require sig-
floading. Operations like residual layers and activations
nificant engineering effort for each hardware back-end
were also performed on the CPU since VDLA does not
and operator-variant combination.
support these operations.
Figure 21 breaks down ResNet inference time into Halide [32] introduced the idea of separating comput-
CPU-only execution and CPU+FPGA execution. Most ing and scheduling. We adopt Halide’s insights and reuse
computation was spent on the convolution layers that its existing useful scheduling primitives in our compiler.
could be offloaded to VDLA. For those convolution lay- Our tensor operator scheduling is also related to other
ers, the achieved speedup was 40×. Unfortunately, due work on DSL for GPUs [18, 24, 36, 37] and polyhedral-
to Amdahl’s law, the overall performance of the FPGA based loop transformation [6,43]. TACO [23] introduces
accelerated system was bottlenecked by the sections of a generic way to generate sparse tensor operators on
the workload that had to be executed on the CPU. We CPU. Weld [30] is a DSL for data processing tasks. We
envision that extending the VDLA design to support specifically focus on solving the new scheduling chal-
these other operators will help reduce cost even further. lenges of DL workloads for GPUs and specialized accel-
This FPGA-based experiment showcases TVM’s ability erators. Our new primitives can potentially be adopted
to adapt to new architectures and the hardware intrinsics by the optimization pipelines in these works.
they expose. High-performance libraries such as ATLAS [46] and
FFTW [15] use auto-tuning to get the best perfor-
mance. Tensor comprehension [42] applied black-box
7 Related Work auto-tuning together with polyhedral optimizations to
optimize CUDA kernels. OpenTuner [5] and existing
Deep learning frameworks [3, 4, 7, 9] provide convenient hyper parameter-tuning algorithms [26] apply domain-
interfaces for users to express DL workloads and deploy agnostic search. A predefined cost model is used to
them easily on different hardware back-ends. While ex- automatically schedule image processing pipelines in
isting frameworks currently depend on vendor-specific Halide [29]. TVM’s ML model uses effective domain-
tensor operator libraries to execute their workloads, they aware cost modeling that considers program structure.
can leverage TVM’s stack to generate optimized code for The based distributed schedule optimizer scales to a
a larger number of hardware devices. larger search space and can find state-of-the-art kernels
High-level computation graph DSLs are a typical on a large range of supported back-ends. More impor-
way to represent and perform high-level optimiza- tantly, we provide an end-to-end stack that can take de-
tions. Tensorflow’s XLA [3] and the recently introduced scriptions directly from DL frameworks and jointly opti-
DLVM [45] fall into this category. The representations mize together with the graph-level stack.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 591
Despite the emerging popularity of accelerators for [3] A BADI , M., BARHAM , P., C HEN , J., C HEN , Z., DAVIS , A.,
deep learning [11, 21], it remains unclear how a com- D EAN , J., D EVIN , M., G HEMAWAT, S., I RVING , G., I SARD ,
M., K UDLUR , M., L EVENBERG , J., M ONGA , R., M OORE , S.,
pilation stack can be built to effectively target these de- M URRAY, D. G., S TEINER , B., T UCKER , P., VASUDEVAN , V.,
vices. The VDLA design used in our evaluation provides WARDEN , P., W ICKE , M., Y U , Y., AND Z HENG , X. Tensor-
a generic way to summarize the properties of TPU-like flow: A system for large-scale machine learning. In 12th USENIX
accelerators and enables a concrete case study on how Symposium on Operating Systems Design and Implementation
(OSDI 16) (2016), pp. 265–283.
to compile code for accelerators. Our approach could
[4] AGARWAL , A., A KCHURIN , E., BASOGLU , C., C HEN , G.,
potentially benefit existing systems that compile deep C YPHERS , S., D ROPPO , J., E VERSOLE , A., G UENTER , B.,
learning to FPGA [34,40], as well. This paper provides a H ILLEBRAND , M., H OENS , R., H UANG , X., H UANG , Z.,
generic solution to effectively target accelerators via ten- I VANOV, V., K AMENEV, A., K RANEN , P., K UCHAIEV, O.,
sorization and compiler-driven latency hiding. M ANOUSEK , W., M AY, A., M ITRA , B., NANO , O., NAVARRO ,
G., O RLOV, A., PADMILAC , M., PARTHASARATHI , H., P ENG ,
B., R EZNICHENKO , A., S EIDE , F., S ELTZER , M. L., S LANEY,
M., S TOLCKE , A., WANG , Y., WANG , H., YAO , K., Y U , D.,
8 Conclusion Z HANG , Y., AND Z WEIG , G. An introduction to computational
networks and the computational network toolkit. Tech. Rep.
We proposed an end-to-end compilation stack to solve MSR-TR-2014-112, August 2014.
fundamental optimization challenges for deep learning [5] A NSEL , J., K AMIL , S., V EERAMACHANENI , K., R AGAN -
across a diverse set of hardware back-ends. Our system K ELLEY, J., B OSBOOM , J., O’R EILLY, U.-M., AND A MARAS -
INGHE , S. Opentuner: An extensible framework for program au-
includes automated end-to-end optimization, which is
totuning. In International Conference on Parallel Architectures
historically a labor-intensive and highly specialized task. and Compilation Techniques (Edmonton, Canada, August 2014).
We hope this work will encourage additional studies of [6] BAGHDADI , R., B EAUGNON , U., C OHEN , A., G ROSSER , T.,
end-to-end compilation approaches and open new op- K RUSE , M., R EDDY, C., V ERDOOLAEGE , S., B ETTS , A.,
portunities for DL system software-hardware co-design D ONALDSON , A. F., K ETEMA , J., A BSAR , J., H AASTREGT,
techniques. S. V., K RAVETS , A., L OKHMOTOV, A., DAVID , R., AND H A -
JIYEV, E. Pencil: A platform-neutral compute intermediate lan-
guage for accelerator programming. In Proceedings of the 2015
International Conference on Parallel Architecture and Compila-
Acknowledgement tion (PACT) (Washington, DC, USA, 2015), PACT ’15, IEEE
Computer Society, pp. 138–149.
We would like to thank Ras Bodik, James Bornholt, Xi [7] BASTIEN , F., L AMBLIN , P., PASCANU , R., B ERGSTRA , J.,
Wang, Tom Anderson and Qiao Zhang for their thorough G OODFELLOW, I. J., B ERGERON , A., B OUCHARD , N., AND
feedback on earlier versions of this paper. We would also B ENGIO , Y. Theano: new features and speed improvements.
Deep Learning and Unsupervised Feature Learning NIPS 2012
like to thank members of Sampa, SAMPL and Systems
Workshop, 2012.
groups at the Allen School for their feedback on the work
[8] C HEN , T., AND G UESTRIN , C. Xgboost: A scalable tree boost-
and manuscript. We would like to thank the anonymous ing system. In Proceedings of the 22Nd ACM SIGKDD Inter-
OSDI reviewers, and our shepherd, Ranjita Bhagwan, for national Conference on Knowledge Discovery and Data Mining
helpful feedbacks. This work was supported in part by a (New York, NY, USA, 2016), KDD ’16, ACM, pp. 785–794.
Google PhD Fellowship for Tianqi Chen, ONR award [9] C HEN , T., L I , M., L I , Y., L IN , M., WANG , N., WANG , M.,
#N00014-16-1-2795, NSF under grants CCF-1518703, X IAO , T., X U , B., Z HANG , C., , AND Z HANG , Z. MXNet:
A flexible and efficient machine learning library for heteroge-
CNS-1614717, and CCF-1723352, and gifts from Intel neous distributed systems. In Neural Information Processing Sys-
(under the CAPA program), Oracle, Huawei and anony- tems, Workshop on Machine Learning Systems (LearningSys’15)
mous sources. (2015).
[10] C HEN , T.-F., AND BAER , J.-L. Effective hardware-based data
prefetching for high-performance processors. IEEE Transactions
References on Computers 44, 5 (May 1995), 609–623.
[1] NVIDIA Tesla V100 GPU Architecture: The World’s Most Ad- [11] C HEN , Y., L UO , T., L IU , S., Z HANG , S., H E , L., WANG , J., L I ,
vanced Data Center GPU, 2017. L., C HEN , T., X U , Z., S UN , N., AND T EMAM , O. Dadiannao:
A machine-learning supercomputer. In Proceedings of the 47th
[2] A BADI , M., AGARWAL , A., BARHAM , P., B REVDO , E., C HEN , Annual IEEE/ACM International Symposium on Microarchitec-
Z., C ITRO , C., C ORRADO , G. S., DAVIS , A., D EAN , J., ture (Washington, DC, USA, 2014), MICRO-47, IEEE Computer
D EVIN , M., G HEMAWAT, S., G OODFELLOW, I., H ARP, A., Society, pp. 609–622.
I RVING , G., I SARD , M., J IA , Y., J OZEFOWICZ , R., K AISER ,
L., K UDLUR , M., L EVENBERG , J., M AN É , D., M ONGA , [12] C HEN , Y.-H., E MER , J., AND S ZE , V. Eyeriss: A spatial ar-
R., M OORE , S., M URRAY, D., O LAH , C., S CHUSTER , M., chitecture for energy-efficient dataflow for convolutional neural
S HLENS , J., S TEINER , B., S UTSKEVER , I., TALWAR , K., networks. In Proceedings of the 43rd International Symposium
T UCKER , P., VANHOUCKE , V., VASUDEVAN , V., V I ÉGAS , F., on Computer Architecture (Piscataway, NJ, USA, 2016), ISCA
V INYALS , O., WARDEN , P., WATTENBERG , M., W ICKE , M., ’16, IEEE Press, pp. 367–379.
Y U , Y., AND Z HENG , X. TensorFlow: Large-scale machine [13] C OURBARIAUX , M., B ENGIO , Y., AND DAVID , J. Binarycon-
learning on heterogeneous systems, 2015. Software available nect: Training deep neural networks with binary weights during
from tensorflow.org. propagations. CoRR abs/1511.00363 (2015).

592 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
[14] E GGERS , S. J., E MER , J. S., L EVY, H. M., L O , J. L., S TAMM , and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June
R. L., AND T ULLSEN , D. M. Simultaneous multithreading: a 27-30, 2016 (2016), pp. 4013–4021.
platform for next-generation processors. IEEE Micro 17, 5 (Sept [26] L I , L., JAMIESON , K. G., D E S ALVO , G., ROSTAMIZADEH ,
1997), 12–19. A., AND TALWALKAR , A. Efficient hyperparameter optimiza-
[15] F RIGO , M., AND J OHNSON , S. G. Fftw: an adaptive software ar- tion and infinitely many armed bandits. CoRR abs/1603.06560
chitecture for the fft. In Acoustics, Speech and Signal Processing, (2016).
1998. Proceedings of the 1998 IEEE International Conference on
[27] L IU , D., C HEN , T., L IU , S., Z HOU , J., Z HOU , S., T EMAN , O.,
(May 1998), vol. 3, pp. 1381–1384 vol.3.
F ENG , X., Z HOU , X., AND C HEN , Y. Pudiannao: A polyvalent
[16] H E , K., Z HANG , X., R EN , S., AND S UN , J. Identity mappings machine learning accelerator. In Proceedings of the Twentieth
in deep residual networks. arXiv preprint arXiv:1603.05027 International Conference on Architectural Support for Program-
(2016). ming Languages and Operating Systems (New York, NY, USA,
[17] H EGARTY, J., B RUNHAVER , J., D E V ITO , Z., R AGAN -K ELLEY, 2015), ASPLOS ’15, ACM, pp. 369–381.
J., C OHEN , N., B ELL , S., VASILYEV, A., H OROWITZ , M., AND [28] M NIH , V., K AVUKCUOGLU , K., S ILVER , D., RUSU , A. A.,
H ANRAHAN , P. Darkroom: Compiling high-level image pro- V ENESS , J., B ELLEMARE , M. G., G RAVES , A., R IEDMILLER ,
cessing code into hardware pipelines. ACM Trans. Graph. 33, 4 M., F IDJELAND , A. K., O STROVSKI , G., ET AL . Human-level
(July 2014), 144:1–144:11. control through deep reinforcement learning. Nature 518, 7540
[18] H ENRIKSEN , T., S ERUP, N. G. W., E LSMAN , M., H ENGLEIN , (2015), 529.
F., AND OANCEA , C. E. Futhark: Purely functional gpu- [29] M ULLAPUDI , R. T., A DAMS , A., S HARLET, D., R AGAN -
programming with nested parallelism and in-place array updates. K ELLEY, J., AND FATAHALIAN , K. Automatically scheduling
In Proceedings of the 38th ACM SIGPLAN Conference on Pro- halide image processing pipelines. ACM Trans. Graph. 35, 4
gramming Language Design and Implementation (New York, (July 2016), 83:1–83:11.
NY, USA, 2017), PLDI 2017, ACM, pp. 556–571.
[30] PALKAR , S., T HOMAS , J. J., NARAYANAN , D., S HANBHAG ,
[19] H OWARD , A. G., Z HU , M., C HEN , B., K ALENICHENKO , D., A., PALAMUTTAM , R., P IRK , H., S CHWARZKOPF, M., A MA -
WANG , W., W EYAND , T., A NDREETTO , M., AND A DAM , H. RASINGHE , S. P., M ADDEN , S., AND Z AHARIA , M. Weld: Re-
Mobilenets: Efficient convolutional neural networks for mobile thinking the interface between data-intensive applications. CoRR
vision applications. CoRR abs/1704.04861 (2017). abs/1709.06416 (2017).
[20] J OUPPI , N. P. Improving direct-mapped cache performance [31] R ADFORD , A., M ETZ , L., AND C HINTALA , S. Unsupervised
by the addition of a small fully-associative cache and prefetch representation learning with deep convolutional generative adver-
buffers. In [1990] Proceedings. The 17th Annual International sarial networks. arXiv preprint arXiv:1511.06434 (2015).
Symposium on Computer Architecture (May 1990), pp. 364–373.
[32] R AGAN -K ELLEY, J., BARNES , C., A DAMS , A., PARIS , S., D U -
[21] J OUPPI , N. P., YOUNG , C., PATIL , N., PATTERSON , D., RAND , F., AND A MARASINGHE , S. Halide: A language and
AGRAWAL , G., BAJWA , R., BATES , S., B HATIA , S., B ODEN , compiler for optimizing parallelism, locality, and recomputation
N., B ORCHERS , A., B OYLE , R., C ANTIN , P.- L ., C HAO , C., in image processing pipelines. In Proceedings of the 34th ACM
C LARK , C., C ORIELL , J., DALEY, M., DAU , M., D EAN , J., SIGPLAN Conference on Programming Language Design and
G ELB , B., G HAEMMAGHAMI , T. V., G OTTIPATI , R., G UL - Implementation (New York, NY, USA, 2013), PLDI ’13, ACM,
LAND , W., H AGMANN , R., H O , C. R., H OGBERG , D., H U , pp. 519–530.
J., H UNDT, R., H URT, D., I BARZ , J., JAFFEY, A., JAWORSKI ,
[33] R ASTEGARI , M., O RDONEZ , V., R EDMON , J., AND FARHADI ,
A., K APLAN , A., K HAITAN , H., K ILLEBREW, D., KOCH , A.,
A. Xnor-net: Imagenet classification using binary convolutional
K UMAR , N., L ACY, S., L AUDON , J., L AW, J., L E , D., L EARY,
neural networks. In European Conference on Computer Vision
C., L IU , Z., L UCKE , K., L UNDIN , A., M AC K EAN , G., M AG -
(2016), Springer, pp. 525–542.
GIORE , A., M AHONY, M., M ILLER , K., NAGARAJAN , R.,
NARAYANASWAMI , R., N I , R., N IX , K., N ORRIE , T., O MER - [34] S HARMA , H., PARK , J., M AHAJAN , D., A MARO , E., K IM ,
NICK , M., P ENUKONDA , N., P HELPS , A., ROSS , J., ROSS , M., J. K., S HAO , C., M ISHRA , A., AND E SMAEILZADEH , H. From
S ALEK , A., S AMADIANI , E., S EVERN , C., S IZIKOV, G., S NEL - high-level deep neural models to fpgas. In Microarchitecture (MI-
HAM , M., S OUTER , J., S TEINBERG , D., S WING , A., TAN , M., CRO), 2016 49th Annual IEEE/ACM International Symposium on
T HORSON , G., T IAN , B., T OMA , H., T UTTLE , E., VASUDE - (2016), IEEE, pp. 1–12.
VAN , V., WALTER , R., WANG , W., W ILCOX , E., AND YOON ,
[35] S MITH , J. E. Decoupled access/execute computer architectures.
D. H. In-datacenter performance analysis of a tensor processing In Proceedings of the 9th Annual Symposium on Computer Archi-
unit. In Proceedings of the 44th Annual International Symposium tecture (Los Alamitos, CA, USA, 1982), ISCA ’82, IEEE Com-
on Computer Architecture (New York, NY, USA, 2017), ISCA puter Society Press, pp. 112–119.
’17, ACM, pp. 1–12.
[36] S TEUWER , M., R EMMELG , T., AND D UBACH , C. Lift: A func-
[22] K IRKPATRICK , S., G ELATT, C. D., AND V ECCHI , M. P. Op- tional data-parallel ir for high-performance gpu code generation.
timization by simulated annealing. Science 220, 4598 (1983), In Proceedings of the 2017 International Symposium on Code
671–680. Generation and Optimization (Piscataway, NJ, USA, 2017), CGO
[23] K JOLSTAD , F., K AMIL , S., C HOU , S., L UGATO , D., AND ’17, IEEE Press, pp. 74–85.
A MARASINGHE , S. The tensor algebra compiler. Proc. ACM [37] S UJEETH , A. K., L EE , H., B ROWN , K. J., C HAFI , H., W U , M.,
Program. Lang. 1, OOPSLA (Oct. 2017), 77:1–77:29. ATREYA , A. R., O LUKOTUN , K., ROMPF, T., AND O DERSKY,
[24] K L ÖCKNER , A. Loo.py: transformation-based code generation M. Optiml: An implicitly parallel domain-specific language for
for GPUs and CPUs. In Proceedings of ARRAY ‘14: ACM SIG- machine learning. In Proceedings of the 28th International Con-
PLAN Workshop on Libraries, Languages, and Compilers for Ar- ference on International Conference on Machine Learning (USA,
ray Programming (Edinburgh, Scotland., 2014), Association for 2011), ICML’11, pp. 609–616.
Computing Machinery. [38] TAI , K. S., S OCHER , R., AND M ANNING , C. D. Improved
[25] L AVIN , A., AND G RAY, S. Fast algorithms for convolutional semantic representations from tree-structured long short-term
neural networks. In 2016 IEEE Conference on Computer Vision memory networks. arXiv preprint arXiv:1503.00075 (2015).

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 593
[39] T ULLOCH , A., AND J IA , Y. High performance ultra-low-
precision convolutions on mobile devices. arXiv preprint
arXiv:1712.02427 (2017).
[40] U MUROGLU , Y., F RASER , N. J., G AMBARDELLA , G., B LOTT,
M., L EONG , P. H. W., JAHRE , M., AND V ISSERS , K. A. FINN:
A framework for fast, scalable binarized neural network infer-
ence. CoRR abs/1612.07119 (2016).
[41] VASILACHE , N. personal communication.
[42] VASILACHE , N., Z INENKO , O., T HEODORIDIS , T., G OYAL , P.,
D E V ITO , Z., M OSES , W. S., V ERDOOLAEGE , S., A DAMS ,
A., AND C OHEN , A. Tensor comprehensions: Framework-
agnostic high-performance machine learning abstractions. CoRR
abs/1802.04730 (2018).
[43] V ERDOOLAEGE , S., C ARLOS J UEGA , J., C OHEN , A., I GNA -
CIO G ÓMEZ , J., T ENLLADO , C., AND C ATTHOOR , F. Polyhe-
dral parallel code generation for cuda. ACM Trans. Archit. Code
Optim. 9, 4 (Jan. 2013), 54:1–54:23.
[44] VOLKOV, V. Understanding Latency Hiding on GPUs. PhD
thesis, University of California at Berkeley, 2016.
[45] W EI , R., A DVE , V., AND S CHWARTZ , L. Dlvm: A mod-
ern compiler infrastructure for deep learning systems. CoRR
abs/1711.03016 (2017).
[46] W HALEY, R. C., AND D ONGARRA , J. J. Automatically tuned
linear algebra software. In Proceedings of the 1998 ACM/IEEE
Conference on Supercomputing (Washington, DC, USA, 1998),
SC ’98, IEEE Computer Society, pp. 1–27.
[47] W ILLIAMS , S., WATERMAN , A., AND PATTERSON , D.
Roofline: An insightful visual performance model for multicore
architectures. Commun. ACM 52, 4 (Apr. 2009), 65–76.
[48] Z AREMBA , W., S UTSKEVER , I., AND V INYALS , O. Recurrent
neural network regularization. arXiv preprint arXiv:1409.2329
(2014).

594 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

You might also like