Osdi18 Chen
Osdi18 Chen
Osdi18 Chen
Tianqi Chen1 , Thierry Moreau1 , Ziheng Jiang1,2 , Lianmin Zheng3 , Eddie Yan1
Meghan Cowan1 , Haichen Shen1 , Leyuan Wang4,2 , Yuwei Hu5 , Luis Ceze1 , Carlos Guestrin1 , Arvind Krishnamurthy1
1 Paul G. Allen School of Computer Science & Engineering, University of Washington
L2 Wgt. FIFO
There is an increasing need to bring machine learn- L3
Activation
SM SM Accum.
ing to a wide diversity of hardware devices. Current L2 L2 L1/TX L1/TX Buffer Register
L1D L1I L1D L1I RF RF RF RF File
frameworks rely on vendor-specific operator libraries implicitly managed mixed explicitly managed
and optimize for a narrow range of server-class GPUs.
Compute Primitive
Deploying workloads to new platforms – such as mo-
bile phones, embedded devices, and accelerators (e.g.,
FPGAs, ASICs) – requires significant manual effort.
· · ·
We propose TVM, a compiler that exposes graph-level scalar vector tensor
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 579
tions for diverse hardware back-ends, we take a fun- of valid programs for a given operator declaration. (2)
damentally different, end-to-end approach. We built We introduce an automated program optimization frame-
TVM, a compiler that takes a high-level specification of work to find optimized tensor operators. The optimizer is
a deep learning program from existing frameworks and guided by an ML-based cost model that adapts and im-
generates low-level optimized code for a diverse set of proves as we collect more data from a hardware back-
hardware back-ends. To be attractive to users, TVM end. (3) On top of the automatic code generator, we
needs to offer performance competitive with the multi- introduce a graph rewriter that takes full advantage of
tude of manually optimized operator libraries across di- high- and operator-level optimizations.
verse hardware back-ends. This goal requires addressing By combining these three modules, TVM can take
the key challenges described below. model descriptions from existing deep learning frame-
works, perform joint high- and low-level optimizations,
Leveraging Specific Hardware Features and Abstrac-
and generate hardware-specific optimized code for back-
tions. DL accelerators introduce optimized tensor com-
ends, e.g., CPUs, GPUs, and FPGA-based specialized
pute primitives [1, 12, 21], while GPUs and CPUs con-
accelerators.
tinuously improve their processing elements. This poses
This paper makes the following contributions:
a significant challenge in generating optimized code for
a given operator description. The inputs to hardware in- • We identify the major optimization challenges in pro-
structions are multi-dimensional, with fixed or variable viding performance portability to deep learning work-
lengths; they dictate different data layouts; and they have loads across diverse hardware back-ends.
special requirements for memory hierarchy. The system • We introduce novel schedule primitives that take ad-
must effectively exploit these complex primitives to ben- vantage of cross-thread memory reuse, novel hardware
efit from acceleration. Further, accelerator designs also intrinsics, and latency hiding.
commonly favor leaner control [21] and offload most
• We propose and implement a machine learning based
scheduling complexity to the compiler stack. For spe-
optimization system to automatically explore and
cialized accelerators, the system now needs to gener-
search for optimized tensor operators.
ate code that explicitly controls pipeline dependencies to
hide memory access latency – a job that hardware per- • We build an end-to-end compilation and optimiza-
forms for CPUs and GPUs. tion stack that allows the deployment of deep learning
workloads specified in high-level frameworks (includ-
Large Search Space for Optimization Another chal- ing TensorFlow, MXNet, PyTorch, Keras, CNTK) to
lenge is producing efficient code without manually tun- diverse hardware back-ends (including CPUs, server
ing operators. The combinatorial choices of memory ac- GPUs, mobile GPUs, and FPGA-based accelerators).
cess, threading pattern, and novel hardware primitives The open-sourced TVM is in production use inside
creates a huge configuration space for generated code several major companies.
(e.g., loop tiles and ordering, caching, unrolling) that
would incur a large search cost if we implement black We evaluated TVM using real world workloads on a
box auto-tuning. One could adopt a predefined cost server-class GPU, an embedded GPU, an embedded
model to guide the search, but building an accurate cost CPU, and a custom generic FPGA-based accelerator.
model is difficult due to the increasing complexity of Experimental results show that TVM offers portable
modern hardware. Furthermore, such an approach would performance across back-ends and achieves speedups
require us to build separate cost models for each hard- ranging from 1.2× to 3.8× over existing frameworks
ware type. backed by hand-optimized libraries.
580 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Frameworks w1 w2
Figure 2: System overview of TVM. The current stack an example computational graph representation of a two-
supports descriptions from many deep learning frame- layer convolutional neural network. The main differ-
works and exchange formats, such as CoreML and ence between this high-level representation and a low-
ONNX, to target major CPU, GPU and specialized ac- level compiler intermediate representation (IR), such as
celerators. LLVM, is that the intermediate data items are large,
multi-dimensional tensors. Computational graphs pro-
vide a global view of operators, but they avoid specifying
guage; execution details are unspecified. TVM identifies how each operator must be implemented. Like LLVM
a collection of possible code optimizations for a given IRs, a computational graph can be transformed into func-
hardware target’s operators. Possible optimizations form tionally equivalent graphs to apply optimizations. We
a large space, so we use an ML-based cost model to find also take advantage of shape specificity in common DL
optimized operators. Finally, the system packs the gen- workloads to optimize for a fixed set of input shapes.
erated code into a deployable module. TVM exploits a computational graph representation to
apply high-level optimizations: a node represents an op-
End-User Example. In a few lines of code, a user can eration on tensors or program inputs, and edges represent
take a model from existing deep learning frameworks and data dependencies between operations. It implements
call the TVM API to get a deployable module: many graph-level optimizations, including: operator fu-
import tvm as t sion, which fuses multiple small operations together;
# Use keras framework as example, import model constant-folding, which pre-computes graph parts that
graph, params = t.frontend.from_keras(keras_model)
target = t.target.cuda() can be determined statically, saving execution costs; a
graph, lib, params = t.compiler.build(graph, target, params) static memory planning pass, which pre-allocates mem-
This compiled runtime module contains three compo- ory to hold each intermediate tensor; and data layout
nents: the final optimized computational graph (graph), transformations, which transform internal data layouts
generated operators (lib), and module parame- into back-end-friendly forms. We now discuss operator
ters (params). These components can then be used to fusion and the data layout transformation.
deploy the model to the target back-end:
import tvm.runtime as t
module = runtime.create(graph, lib, t.cuda(0))
Operator Fusion. Operator fusion combines multiple
module.set_input(**params) operators into a single kernel without saving the interme-
module.run(data=data_array) diate results in memory. This optimization can greatly
output = tvm.nd.empty(out_shape, ctx=t.cuda(0))
module.get_output(0, output) reduce execution time, particularly in GPUs and spe-
cialized accelerators. Specifically, we recognize four
TVM supports multiple deployment back-ends in lan-
categories of graph operators: (1) injective (one-to-one
guages such as C++, Java and Python. The rest of this
map, e.g., add), (2) reduction (e.g., sum), (3) complex-
paper describes TVM’s architecture and how a system
out-fusable (can fuse element-wise map to output, e.g.,
programmer can extend it to support new back-ends.
conv2d), and (4) opaque (cannot be fused, e.g., sort). We
provide generic rules to fuse these operators, as follows.
3 Optimizing Computational Graphs Multiple injective operators can be fused into another in-
jective operator. A reduction operator can be fused with
Computational graphs are a common way to represent input injective operators (e.g., fuse scale and sum). Op-
programs in DL frameworks [3, 4, 7, 9]. Figure 3 shows erators such as conv2d are complex-out-fusable, and we
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 581
A =
t.placeholder((1024, 1024))
B =
t.placeholder((1024, 1024))
w/o fusion k =
t.reduce_axis((0, 1024))
2.00 C =
t.compute((1024, 1024), lambda y, x:
Relative Speedup
for y in range(1024):
1.00 for x in range(1024):
C[y][x] = 0
for k in range(1024):
0.50 C[y][x] += A[k][y] * B[k][x]
The most common data layout choices are column major schedule corresponding
schedule
and row major. In practice, we may prefer to use even transformation low-level code
more complicated data layouts. For instance, a DL ac-
celerator might exploit 4 × 4 matrix operations, requiring Figure 5: Example schedule transformations that opti-
data to be tiled into 4 × 4 chunks to optimize for access mize a matrix multiplication on a specialized accelerator.
locality.
Data layout optimization converts a computational
graph into one that can use better internal data layouts this end, we next propose a code generation approach
for execution on the target hardware. It starts by spec- that can generate various possible implementations for a
ifying the preferred data layout for each operator given given model’s operators.
the constraints dictated by memory hierarchies. We then
perform the proper layout transformation between a pro- 4 Generating Tensor Operations
ducer and a consumer if their preferred data layouts do
not match. TVM produces efficient code for each operator by gen-
While high-level graph optimizations can greatly im- erating many valid implementations on each hardware
prove the efficiency of DL workloads, they are only as back-end and choosing an optimized implementation.
effective as what the operator library provides. Cur- This process builds on Halide’s idea of decoupling de-
rently, the few DL frameworks that support operator fu- scriptions from computation rules (or schedule optimiza-
sion require the operator library to provide an implemen- tions) [32] and extends it to support new optimizations
tation of the fused patterns. With more network opera- (nested parallelism, tensorization, and latency hiding)
tors introduced on a regular basis, the number of possible and a wide array of hardware back-ends. We now high-
fused kernels can grow dramatically. This approach is light TVM-specific features.
no longer sustainable when targeting an increasing num-
ber of hardware back-ends since the required number
4.1 Tensor Expression and Schedule Space
of fused pattern implementations grows combinatorially
with the number of data layouts, data types, and accel- We introduce a tensor expression language to support au-
erator intrinsics that must be supported. It is not feasi- tomatic code generation. Unlike high-level computation
ble to handcraft operator kernels for the various opera- graph representations, where the implementation of ten-
tions desired by a program and for each back-end. To sor operations is opaque, each operation is described in
582 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Tensor Expression Schedule primitives CPU GPU Accel.
used in various hardware backends
Schedule Schedule Schedule
Select Schedule ee ule 8 cuBLAS
[Halide] Loop Transformations ✔ ✔ ✔
Primitives
[Halide] Thread Binding ✔ ✔ ✔ TVM w/o coop.
Final Schedule [Halide] Compute Locality ✔ ✔ ✔ 7
TVM
[TVM] Special Memory Scope ✔ ✔
Code Lowering 6
[TVM] Tensorization ✔ ✔ ✔
✔
Time (ms)
Low level code [TVM] Latency Hiding
5
Figure 6: TVM schedule lowering and code generation 4
process. The table lists existing Halide and novel TVM 3
scheduling primitives being used to optimize schedules
2
for CPUs, GPUs and accelerator back-ends. Tensoriza-
tion is essential for accelerators, but it can also be used 1
for CPUs and GPUs. Special memory-scope enables 0
memory reuse in GPUs and explicit management of on- 1024 2048
Matrix Size
chip memory in accelerators. Latency hiding is specific
to TPU-like accelerators.
Figure 7: Performance comparison between TVM with
and without cooperative shared memory fetching on ma-
an index formula expression language. The following trix multiplication workloads. Tested on an NVIDIA Ti-
code shows an example tensor expression to compute tan X.
transposed matrix multiplication:
m, n, h = t.var('m'), t.var('n'), t.var('h')
A = t.placeholder((m, h), name='A') tives that TVM supports. We reuse helpful primitives
B = t.placeholder((n, h), name='B') computing rule and the low-level loop program AST from Halide, and
k = t.reduce_axis((0, h), name='k')
C = t.compute((m, n), lambda y, x: we introduce new primitives to optimize GPU and ac-
t.sum(A[k, y] * B[k, x], axis=k))
result shape celerator performance. The new primitives are neces-
sary to achieve optimal GPU performance and essen-
Each compute operation specifies both the shape of tial for accelerators. CPU, GPU, TPU-like accelerators
the output tensor and an expression describing how to are three important types of hardware for deep learning.
compute each element of it. Our tensor expression This section describes new optimization primitives for
language supports common arithmetic and math oper- CPUs, GPUs and TPU-like accelerators, while section 5
ations and covers common DL operator patterns. The explains how to automatically derive efficient schedules.
language does not specify the loop structure and many
other execution details, and it provides flexibility for
adding hardware-aware optimizations for various back- 4.2 Nested Parallelism with Cooperation
ends. Adopting the decoupled compute/schedule princi-
Parallelism is key to improving the efficiency of
ple from Halide [32], we use a schedule to denote a spe-
compute-intensive kernels in DL workloads. Modern
cific mapping from a tensor expression to low-level code.
GPUs offer massive parallelism, requiring us to bake par-
Many possible schedules can perform this function.
allel patterns into schedule transformations. Most exist-
We build a schedule by incrementally applying basic
ing solutions adopt a model called nested parallelism, a
transformations (schedule primitives) that preserve the
form of fork–join. This model requires a parallel sched-
program’s logical equivalence. Figure 5 shows an ex-
ule primitive to parallelize a data parallel task; each task
ample of scheduling matrix multiplication on a special-
can be further recursively subdivided into subtasks to ex-
ized accelerator. Internally, TVM uses a data structure
ploit the target architecture’s multi-level thread hierarchy
to keep track of the loop structure and other information
(e.g., thread groups in GPU). We call this model shared-
as we apply schedule transformations. This information
nothing nested parallelism because one working thread
can then help generate low-level code for a given final
cannot look at the data of its sibling within the same par-
schedule.
allel computation stage.
Our tensor expression takes cues from Halide [32],
An alternative to the shared-nothing approach is to
Darkroom [17], and TACO [23]. Its primary enhance-
fetch data cooperatively. Specifically, groups of threads
ments include support for the new schedule optimiza-
can cooperatively fetch the data they all need and place
tions discussed below. To achieve high performance
it into a shared memory space. 1 This optimization can
on many back-ends, we must support enough schedule
take advantage of the GPU memory hierarchy and en-
primitives to cover a diverse set of optimizations on dif-
ferent hardware back-ends. Figure 6 summarizes the 1
Halide recently added shared memory support but without general
operation code generation process and schedule primi- memory scope for accelerators.
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 583
w, x = t.placeholder((8, 8)), t.placeholder((8, 8))
able data reuse across threads through shared memory k = t.reduce_axis((0, 8)) declare behavior
y = t.compute((8, 8), lambda i, j:
regions. TVM supports this well-known GPU optimiza- t.sum(w[i, k] * x[j, k], axis=k))
tion using a schedule primitive to achieve optimal per- def gemm_intrin_lower(inputs, outputs):
lowering rule to generate
ww_ptr = inputs[0].access_ptr(“r") hardware intrinsics to carry
formance. The following GPU code example optimizes xx_ptr = inputs[1].access_ptr("r") out the computation
zz_ptr = outputs[0].access_ptr("w")
matrix multiplication. compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr)
reset = t.hardware_intrin("fill_zero", zz_ptr)
for thread_group (by, bx) in cross(64, 64): update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr)
for thread_item (ty, tx) in cross(2, 2): All threads cooperatively return compute, reset, update
local CL[8][8] = 0 load AS and BS in different
shared AS[2][8], BS[2][8] parallel patterns gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)
for k in range(1024):
for i in range(4): Additionally, we introduce a tensorize schedule primi-
AS[ty][i*4+tx] = A[k][by*64+ty*8+i*4+tx]
for each i in 0..4: tive to replace a unit of computation with the correspond-
BS[ty][i*4+tx] = B[k][bx*64+ty*8+i*4+tx]
memory_barrier_among_threads() Barrier inserted ing intrinsics. The compiler matches the computation
for yi in range(8):
automatically
for xi in range(8):
by compiler
pattern with a hardware declaration and lowers it to the
CL[yi][xi] += AS[yi] * BS[xi]
for yi in range(8): corresponding hardware intrinsic.
for xi in range(8):
C[yo*8+yi][xo*8+xi] = CL[yi][xi] Tensorization decouples the schedule from specific
hardware primitives, making it easy to extend TVM
Figure 7 demonstrates the impact of this optimiza-
to support new hardware architectures. The generated
tion. We introduce the concept of memory scopes to the
code of tensorized schedules aligns with practices in
schedule space so that a compute stage (AS and BS in the
high-performance computing: break complex operations
code) can be marked as shared. Without explicit memory
into a sequence of micro-kernel calls. We can also use
scopes, automatic scope inference will mark compute
the tensorize primitive to take advantage of handcrafted
stages as thread-local. The shared task must compute
micro-kernels, which can be beneficial in some plat-
the dependencies of all working threads in the group.
forms. For example, we implement ultra low precision
Additionally, memory synchronization barriers must be
operators for mobile CPUs that operate on data types
properly inserted to guarantee that shared loaded data is
that are one- or two-bits wide by leveraging a bit-serial
visible to consumers. Finally, in addition to being use-
matrix vector multiplication micro-kernel. This micro-
ful to GPUs, memory scopes let us tag special memory
kernel accumulates results into progressively larger data
buffers and create special lowering rules when targeting
types to minimize the memory footprint. Presenting the
specialized DL accelerators.
micro-kernel as a tensor intrinsic to TVM yields up to a
1.5× speedup over the non-tensorized version.
4.3 Tensorization
4.4 Explicit Memory Latency Hiding
DL workloads have high arithmetic intensity, which
can typically be decomposed into tensor operators like Latency hiding refers to the process of overlapping mem-
matrix-matrix multiplication or 1D convolution. These ory operations with computation to maximize utilization
natural decompositions have led to the recent trend of of memory and compute resources. It requires different
adding tensor compute primitives [1, 12, 21]. These strategies depending on the target hardware back-end.
new primitives create both opportunities and challenges On CPUs, memory latency hiding is achieved implic-
for schedule-based compilation; while using them can itly with simultaneous multithreading [14] or hardware
improve performance, the compilation framework must prefetching [10, 20]. GPUs rely on rapid context switch-
seamlessly integrate them. We dub this tensorization: it ing of many warps of threads [44]. In contrast, special-
is analogous to vectorization for SIMD architectures but ized DL accelerators such as the TPU [21] usually favor
has significant differences. Instruction inputs are multi- leaner control with a decoupled access-execute (DAE)
dimensional, with fixed or variable lengths, and each has architecture [35] and offload the problem of fine-grained
different data layouts. More importantly, we cannot sup- synchronization to software.
port a fixed set of primitives since new accelerators are Figure 9 shows a DAE hardware pipeline that reduces
emerging with their own variations of tensor instructions. runtime latency. Compared to a monolithic hardware de-
We therefore need an extensible solution. sign, the pipeline can hide most memory access over-
We make tensorization extensible by separating the heads and almost fully utilize compute resources. To
target hardware intrinsic from the schedule with a mech- achieve higher utilization, the instruction stream must be
anism for tensor-intrinsic declaration. We use the same augmented with fine-grained synchronization operations.
tensor expression language to declare both the behavior Without them, dependencies cannot be enforced, leading
of each new hardware intrinsic and the lowering rule as- to erroneous execution. Consequently, DAE hardware
sociated with it. The following code shows how to de- pipelines require fine-grained dependence enqueuing/d-
clare an 8 × 8 tensor hardware intrinsic. equeuing operations between the pipeline stages to guar-
584 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Input: High-level Threaded Program Inject Synchronization Instructions Final Single Instruction Stream
vthread 0 ld ex ld … ld ex vthread 0 ld ex ld … ld ex l l e e l l
… e e
barrier
barrier
d d x x d d x x
acc_buffer CL[2][8]
vthread 1 ld ex ld … ld ex vthread 1 ld ex ld … ld ex
inp_buffer AL[2][8]
ex.push_dep_to(ld)
for vthread tx in range(2): ex.push_dep_to(ld)
for vthread tx in range(2): for k in range(128):
acc_buffer CL[8]
acc_buffer CL[8] ld.pop_dep_from(ex)
inp_buffer AL[8]
inp_buffer AL[8] ld.dma_copy2d(AL[0], AL[k][0:8])
for k in range(128):
ex.push_dep_to(ld) ld.push_dep_to(ex)
ld.dma_copy2d(AL, AL[k][tx*8:tx*8+8])
for k in range(128): ld.pop_dep_from(ex)
ex.accumulate(AL, CL)
ld.pop_dep_from(ex) ld.dma_copy2d(AL[1], AL[k][8:16])
read after write (RAW) dependence ld.dma_copy2d(AL, AL[k][tx*8:tx*8+8]) ld.push_dep_to(ex)
ld.push_dep_to(ex) ex.pop_dep_from(ld)
read after write (RAW) dependence
ex.pop_dep_from(ld) ex.accumulate(AL[0], CL[0])
push RAW dependence ex.accumulate(AL, CL) ex.push_dep_to(ld)
ex.push_dep_to(ld) ex.pop_dep_from(ld)
push WAR dependence
ld.pop_dep_from(ex) ex.accumulate(AL[1], CL[1])
pop RAW dependence ex.push_dep_to(ld)
pop WAR dependence ld.pop_dep_from(ex)
ld.pop_dep_from(ex)
Figure 8: TVM virtual thread lowering transforms a virtual thread-parallel program to a single instruction stream; the
stream contains explicit low-level synchronizations that the hardware can interpret to recover the pipeline parallelism
required to hide memory access latency.
antee correct execution, as shown in Figure 9’s instruc- Hardware Evaluation of Latency Hiding. We now
tion stream. demonstrate the effectiveness of latency hiding on a cus-
Programming DAE accelerators that require explicit tom FPGA-based accelerator design, which we describe
low-level synchronization is difficult. To reduce the in depth in subsection 6.4. We ran each layer of ResNet
programming burden, we introduce a virtual threading on the accelerator and used TVM to generate two sched-
scheduling primitive that lets programmers specify a ules: one with latency hiding, and one without. The
high-level data parallel program as they would a hard- schedule with latency hiding parallelized the program
ware back-end with support for multithreading. TVM with virtuals threads to expose pipeline parallelism and
then automatically lowers the program to a single in- therefore hide memory access latency. Results are shown
struction stream with low-level explicit synchronization, in Figure 10 as a roofline diagram [47]; roofline perfor-
as shown in Figure 8. The algorithm starts with a high- mance diagrams provide insight into how well a given
level multi-threaded program schedule and then inserts system uses computation and memory resources for dif-
the necessary low-level synchronization operations to ferent benchmarks. Overall, latency hiding improved
guarantee correct execution within each thread. Next, performance on all ResNet layers. Peak compute utiliza-
it interleaves operations of all virtual threads into a sin- tion increased from 70% with no latency hiding to 88%
gle instruction stream. Finally, the hardware recovers the with latency hiding.
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 585
TensorOp Schedule Space
5 Automating Optimization Specification Template
Device Cluster
Raspberry Pi
Mali GPU
Given the rich set of schedule primitives, our remaining Database
log
Schedule Explorer
rpc
Tracker
get_perf Nvidia GPU
problem is to find optimal operator implementations for training
query update
FPGA Board
each layer of a DL model. Here, TVM creates a special- data
ML Cost Model …
ized operator for the specific input shape and layout as-
sociated with each layer. Such specialization offers sig- Figure 11: Overview of automated optimization frame-
nificant performance benefits (in contrast to handcrafted work. A schedule explorer examines the schedule space
code that would target a smaller diversity of shapes and using an ML-based cost model and chooses experiments
layouts), but it also raises automation challenges. The to run on a distributed device cluster via RPC. To im-
system needs to choose the schedule optimizations – prove its predictive power, the ML model is updated pe-
such as modifying the loop order or optimizing for the riodically using collected data recorded in a database.
memory hierarchy – as well as schedule-specific param-
eters, such as the tiling size and the loop unrolling factor. Learn
Need from
Such combinatorial choices create a large search space of Method Category Data Model
Hardware
Cost Bias His-
operator implementations for each hardware back-end. Info tory
To address this challenge, we built an automated sched- Blackbox auto-tuning high none no no
ule optimizer with two main components: a schedule ex- Predefined cost model none high yes no
plorer that proposes promising new configurations, and ML based cost model low low no yes
a machine learning cost model that predicts the perfor-
Table 1: Comparison of automation methods. Model bias
mance of a given configuration. This section describes
refers to inaccuracy due to modeling.
these components and TVM’s automated optimization
flow (Figure 11).
ing patterns, among others. This approach, unfortu-
5.1 Schedule Space Specification nately, is burdensome due to the increasing complexity
of modern hardware. Furthermore, every new hardware
We built a schedule template specification API to let a target requires a new (predefined) cost model.
developer declare knobs in the schedule space. The tem-
plate specification allows incorporation of a developer’s We instead take a statistical approach to solve the cost
domain-specific knowledge, as necessary, when specify- modeling problem. In this approach, a schedule explorer
ing possible schedules. We also created a generic mas- proposes configurations that may improve an operator’s
ter template for each hardware back-end that automati- performance. For each schedule configuration, we use
cally extracts possible knobs based on the computation an ML model that takes the lowered loop program as in-
description expressed using the tensor expression lan- put and predicts its running time on a given hardware
guage. At a high level, we would like to consider as many back-end. The model, trained using runtime measure-
configurations as possible and let the optimizer manage ment data collected during exploration, does not require
the selection burden. Consequently, the optimizer must the user to input detailed hardware information. We up-
search over billions of possible configurations for the real date the model periodically as we explore more config-
world DL workloads used in our experiments. urations during optimization, which improves accuracy
for other related workloads, as well. In this way, the qual-
ity of the ML model improves with more experimental
5.2 ML-Based Cost Model trials. Table 1 summarizes the key differences between
One way to find the best schedule from a large configu- automation methods. ML-based cost models strike a bal-
ration space is through blackbox optimization, i.e., auto- ance between auto-tuning and predefined cost modeling
tuning. This method is used to tune high performance and can benefit from the historical performance data of
computing libraries [15, 46]. However, auto-tuning re- related workloads.
quires many experiments to identify a good configura-
tion. Machine Learning Model Design Choices. We must
An alternate approach is to build a predefined cost consider two key factors when choosing which ML
model to guide the search for a particular hardware back- model the schedule explorer will use: quality and speed.
end instead of running all possibilities and measuring The schedule explorer queries the cost model frequently,
their performance. Ideally, a perfect cost model con- which incurs overheads due to model prediction time
siders all factors affecting performance: memory access and model refitting time. To be useful, these overheads
patterns, data reuse, pipeline dependencies, and thread- must be smaller than the time it takes to measure per-
586 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
clude the memory access count and reuse ratio of each
1.50 memory buffer at each loop level, as well as a one-hot
encoding of loop annotations such as “vectorize”, “un-
1.25
Relative Speedup
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 587
Name Operator H,W IC, OC K, S Tensorflow XLA MXNet TVM
C1 conv2d 224, 224 3,64 7, 2 Tensorflow TVM w/o graph opt
C2 conv2d 56, 56 64,64 3, 1
7.0 0.9
C3 conv2d 56, 56 64,64 1, 1
0.8
C4 conv2d 56, 56 64,128 3, 2 6.0
C5 conv2d 56, 56 64,128 1, 2 0.7
5.0
C6 conv2d 28, 28 128,128 3, 1 0.6
Time(ms)
C7 conv2d 28, 28 128,256 3, 2 4.0 0.5
C8 conv2d 28, 28 128,256 1, 2 3.0 0.4
C9 conv2d 14, 14 256,256 3, 1 0.3
C10 conv2d 14, 14 256,512 3, 2 2.0
0.2
C11 conv2d 14, 14 256,512 1, 2 1.0
0.1
C12 conv2d 7, 7 512,512 3, 1
0.0 0.0
Name Operator H,W IC K, S ResNet-18 MobileNet LSTM LM DQN DCGAN
D1 depthwise conv2d 112, 112 32 3, 1
D2 depthwise conv2d 112, 112 64 3, 2 Figure 14: GPU end-to-end evaluation for TVM,
D3 depthwise conv2d 56, 56 128 3, 1
D4 depthwise conv2d 56, 56 128 3, 2
MXNet, Tensorflow, and Tensorflow XLA. Tested on the
D5 depthwise conv2d 28, 28 256 3, 1 NVIDIA Titan X.
D6 depthwise conv2d 28, 28 256 3, 2
D7 depthwise conv2d 14, 14 512 3, 1
D8 depthwise conv2d 14, 14 512 3, 2 works (which rely on heavily optimized libraries)
D9 depthwise conv2d 7, 7 1024 3, 1
on each back-end?
Table 2: Configurations of all conv2d operators in • Can TVM support new, emerging DL workloads
ResNet-18 and all depthwise conv2d operators in Mo- (e.g., depthwise convolution, low precision opera-
bileNet used in the single kernel experiments. H/W tions)?
denotes height and width, IC input channels, OC out- • Can TVM support and optimize for new specialized
put channels, K kernel size, and S stride size. All ops accelerators?
use “SAME” padding. All depthwise conv2d operations To answer these questions, we evaluated TVM on four
have channel multipliers of 1. types of platforms: (1) a server-class GPU, (2) an embed-
ded GPU, (3) an embedded CPU, and (4) a DL accelera-
tor implemented on a low-power FPGA SoC. The bench-
function remotely, and access results in the same script marks are based on real world DL inference workloads,
on the host. TVM’s RPC supports dynamic upload and including ResNet [16], MobileNet [19], the LSTM Lan-
runs cross-compiled modules and functions that use its guage Model [48], the Deep Q Network (DQN) [28] and
runtime convention. As a result, the same infrastruc- Deep Convolutional Generative Adversarial Networks
ture can perform a single workload optimization and (DCGAN) [31]. We compare our approach to exist-
end-to-end graph inference. Our approach automates the ing DL frameworks, including MxNet [9] and Tensor-
compile, run, and profile steps across multiple devices. Flow [2], that rely on highly engineered, vendor-specific
This infrastructure is especially critical for embedded de- libraries. TVM performs end-to-end automatic optimiza-
vices, which traditionally require tedious manual effort tion and code generation without the need for an external
for cross-compilation, code deployment, and measure- operator library.
ment.
588 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
cuDNN TVM Tensorflow Lite TVM w/o graph opt TVM
TensorComprehensions TVM PT
800.0 12.0
MX Kernel
700.0
10.0
3.0 600.0
Relative Speedup
2.5 8.0
500.0
Time(ms)
2.0 400.0 6.0
1.5 300.0
4.0
1.0 200.0
0.5 2.0
100.0
0.0
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 0.0 0.0
ResNet-18 MobileNet DQN
5.0
Relative Speedup
4.0
3.0
Figure 16: ARM A53 end-to-end evaluation of TVM and
2.0
TFLite.
1.0
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 589
ARMComputeLib TVM w/o graph opt TVM
Hand optimized
10.0
Relative Speedup TVM single-threaded 250.0 5.0
8.0 TVM multi-threaded
200.0 4.0
6.0
Time (ms)
150.0 3.0
4.0
100.0 2.0
2.0
50.0 1.0
0.0
C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
0.0 0.0
float32 float16 float32 float16 float32 float16
ResNet-18 MobileNet DQN
Figure 18: Relative speedup of single- and multi-
threaded low-precision conv2d operators in ResNet. Figure 19: End-to-end experiment results on Mali-
Baseline was a single-threaded, hand-optimized imple- T860MP4. Two data types, float32 and float16, were
mentation from Caffe2 (commit: 39e07f7). C5, C3 are evaluated.
1x1 convolutions that have less compute intensity, result-
ing in less speedup by multi-threading.
Learning Accelerator (VDLA) – which distills charac-
teristics from previous accelerator proposals [12, 21, 27]
[39]. We implemented an ARM-specific tensorization
into a minimalist hardware architecture – to demonstrate
intrinsic that leverages ARM instructions to build an ef-
TVM’s ability to generate highly efficient schedules that
ficient, low-precision matrix-vector microkernel.We then
can target specialized accelerators. Figure 20 shows the
used TVM’s automated optimizer to explore the schedul-
high-level hardware organization of the VDLA architec-
ing space.
ture. VDLA is programmed as a tensor processor to
Figure 18 compares TVM to the Caffe2 ultra low-
efficiently execute operations with high compute inten-
precision library on ResNet for 2-bit activations, 1-bit
sity (e.g, matrix multiplication, high dimensional con-
weights inference. Since the baseline is single threaded,
volution). It can perform load/store operations to bring
we also compare it to a single-threaded TVM version.
blocked 3-dimensional tensors from DRAM into a con-
Single-threaded TVM outperforms the baseline, particu-
tiguous region of SRAM. It also provides specialized on-
larly for C5, C8, and C11 layers; these are convolution
chip memories for network parameters, layer inputs (nar-
layers of kernel size 1×1 and stride of 2 for which the ul-
row data type), and layer outputs (wide data type). Fi-
tra low-precision baseline library is not optimized. Fur-
nally, VDLA provides explicit synchronization control
thermore, we take advantage of additional TVM capa-
over successive loads, computes, and stores to maximize
bilities to produce a parallel library implementation that
the overlap between memory and compute operations.
shows improvement over the baseline. In addition to the
2-bit+1-bit configuration, TVM can generate and opti-
mize for other precision configurations that are unsup- Methodology. We implemented the VDLA design on a
ported by the baseline library, offering improved flexi- low-power PYNQ board that incorporates an ARM Cor-
bility. tex A9 dual core CPU clocked at 667MHz and an Artix-7
based FPGA fabric. On these modest FPGA resources,
6.3 Embedded GPU Evaluation we implemented a 16 × 16 matrix-vector unit clocked at
200MHz that performs products of 8-bit values and accu-
For our mobile GPU experiments, we ran our end-to-end mulates them into a 32-bit register every cycle. The the-
pipeline on a Firefly-RK3399 board equipped with an oretical peak throughput of this VDLA design is about
ARM Mali-T860MP4 GPU. The baseline was a vendor- 102.4GOPS/s. We allocated 32kB of resources for ac-
provided library, the ARM Compute Library (v18.03). tivation storage, 32kB for parameter storage, 32kB for
As shown in Figure 19, we outperformed the baseline on microcode buffers, and 128kB for the register file. These
three available models for both float16 and float32 on-chip buffers are by no means large enough to provide
(DCGAN and LSTM are not yet supported by the base- sufficient on-chip storage for a single layer of ResNet and
line). The speedup ranged from 1.2× to 1.6×. therefore enable a case study on effective memory reuse
and latency hiding.
6.4 FPGA Accelerator Evaluation We built a driver library for VDLA with a C runtime
API that constructs instructions and pushes them to the
Vanilla Deep Learning Accelerator We now relate target accelerator for execution. Our code generation al-
how TVM tackled accelerator-specific code generation gorithm then translates the accelerator program to a se-
on a generic inference accelerator design we prototyped ries of calls into the runtime API. Adding the specialized
on an FPGA. We used in this evaluation the Vanilla Deep accelerator back-end took ∼2k LoC in Python.
590 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
DRAM
controller
EXE→LOAD Q STORE→EXE Q
GEMM
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 591
Despite the emerging popularity of accelerators for [3] A BADI , M., BARHAM , P., C HEN , J., C HEN , Z., DAVIS , A.,
deep learning [11, 21], it remains unclear how a com- D EAN , J., D EVIN , M., G HEMAWAT, S., I RVING , G., I SARD ,
M., K UDLUR , M., L EVENBERG , J., M ONGA , R., M OORE , S.,
pilation stack can be built to effectively target these de- M URRAY, D. G., S TEINER , B., T UCKER , P., VASUDEVAN , V.,
vices. The VDLA design used in our evaluation provides WARDEN , P., W ICKE , M., Y U , Y., AND Z HENG , X. Tensor-
a generic way to summarize the properties of TPU-like flow: A system for large-scale machine learning. In 12th USENIX
accelerators and enables a concrete case study on how Symposium on Operating Systems Design and Implementation
(OSDI 16) (2016), pp. 265–283.
to compile code for accelerators. Our approach could
[4] AGARWAL , A., A KCHURIN , E., BASOGLU , C., C HEN , G.,
potentially benefit existing systems that compile deep C YPHERS , S., D ROPPO , J., E VERSOLE , A., G UENTER , B.,
learning to FPGA [34,40], as well. This paper provides a H ILLEBRAND , M., H OENS , R., H UANG , X., H UANG , Z.,
generic solution to effectively target accelerators via ten- I VANOV, V., K AMENEV, A., K RANEN , P., K UCHAIEV, O.,
sorization and compiler-driven latency hiding. M ANOUSEK , W., M AY, A., M ITRA , B., NANO , O., NAVARRO ,
G., O RLOV, A., PADMILAC , M., PARTHASARATHI , H., P ENG ,
B., R EZNICHENKO , A., S EIDE , F., S ELTZER , M. L., S LANEY,
M., S TOLCKE , A., WANG , Y., WANG , H., YAO , K., Y U , D.,
8 Conclusion Z HANG , Y., AND Z WEIG , G. An introduction to computational
networks and the computational network toolkit. Tech. Rep.
We proposed an end-to-end compilation stack to solve MSR-TR-2014-112, August 2014.
fundamental optimization challenges for deep learning [5] A NSEL , J., K AMIL , S., V EERAMACHANENI , K., R AGAN -
across a diverse set of hardware back-ends. Our system K ELLEY, J., B OSBOOM , J., O’R EILLY, U.-M., AND A MARAS -
INGHE , S. Opentuner: An extensible framework for program au-
includes automated end-to-end optimization, which is
totuning. In International Conference on Parallel Architectures
historically a labor-intensive and highly specialized task. and Compilation Techniques (Edmonton, Canada, August 2014).
We hope this work will encourage additional studies of [6] BAGHDADI , R., B EAUGNON , U., C OHEN , A., G ROSSER , T.,
end-to-end compilation approaches and open new op- K RUSE , M., R EDDY, C., V ERDOOLAEGE , S., B ETTS , A.,
portunities for DL system software-hardware co-design D ONALDSON , A. F., K ETEMA , J., A BSAR , J., H AASTREGT,
techniques. S. V., K RAVETS , A., L OKHMOTOV, A., DAVID , R., AND H A -
JIYEV, E. Pencil: A platform-neutral compute intermediate lan-
guage for accelerator programming. In Proceedings of the 2015
International Conference on Parallel Architecture and Compila-
Acknowledgement tion (PACT) (Washington, DC, USA, 2015), PACT ’15, IEEE
Computer Society, pp. 138–149.
We would like to thank Ras Bodik, James Bornholt, Xi [7] BASTIEN , F., L AMBLIN , P., PASCANU , R., B ERGSTRA , J.,
Wang, Tom Anderson and Qiao Zhang for their thorough G OODFELLOW, I. J., B ERGERON , A., B OUCHARD , N., AND
feedback on earlier versions of this paper. We would also B ENGIO , Y. Theano: new features and speed improvements.
Deep Learning and Unsupervised Feature Learning NIPS 2012
like to thank members of Sampa, SAMPL and Systems
Workshop, 2012.
groups at the Allen School for their feedback on the work
[8] C HEN , T., AND G UESTRIN , C. Xgboost: A scalable tree boost-
and manuscript. We would like to thank the anonymous ing system. In Proceedings of the 22Nd ACM SIGKDD Inter-
OSDI reviewers, and our shepherd, Ranjita Bhagwan, for national Conference on Knowledge Discovery and Data Mining
helpful feedbacks. This work was supported in part by a (New York, NY, USA, 2016), KDD ’16, ACM, pp. 785–794.
Google PhD Fellowship for Tianqi Chen, ONR award [9] C HEN , T., L I , M., L I , Y., L IN , M., WANG , N., WANG , M.,
#N00014-16-1-2795, NSF under grants CCF-1518703, X IAO , T., X U , B., Z HANG , C., , AND Z HANG , Z. MXNet:
A flexible and efficient machine learning library for heteroge-
CNS-1614717, and CCF-1723352, and gifts from Intel neous distributed systems. In Neural Information Processing Sys-
(under the CAPA program), Oracle, Huawei and anony- tems, Workshop on Machine Learning Systems (LearningSys’15)
mous sources. (2015).
[10] C HEN , T.-F., AND BAER , J.-L. Effective hardware-based data
prefetching for high-performance processors. IEEE Transactions
References on Computers 44, 5 (May 1995), 609–623.
[1] NVIDIA Tesla V100 GPU Architecture: The World’s Most Ad- [11] C HEN , Y., L UO , T., L IU , S., Z HANG , S., H E , L., WANG , J., L I ,
vanced Data Center GPU, 2017. L., C HEN , T., X U , Z., S UN , N., AND T EMAM , O. Dadiannao:
A machine-learning supercomputer. In Proceedings of the 47th
[2] A BADI , M., AGARWAL , A., BARHAM , P., B REVDO , E., C HEN , Annual IEEE/ACM International Symposium on Microarchitec-
Z., C ITRO , C., C ORRADO , G. S., DAVIS , A., D EAN , J., ture (Washington, DC, USA, 2014), MICRO-47, IEEE Computer
D EVIN , M., G HEMAWAT, S., G OODFELLOW, I., H ARP, A., Society, pp. 609–622.
I RVING , G., I SARD , M., J IA , Y., J OZEFOWICZ , R., K AISER ,
L., K UDLUR , M., L EVENBERG , J., M AN É , D., M ONGA , [12] C HEN , Y.-H., E MER , J., AND S ZE , V. Eyeriss: A spatial ar-
R., M OORE , S., M URRAY, D., O LAH , C., S CHUSTER , M., chitecture for energy-efficient dataflow for convolutional neural
S HLENS , J., S TEINER , B., S UTSKEVER , I., TALWAR , K., networks. In Proceedings of the 43rd International Symposium
T UCKER , P., VANHOUCKE , V., VASUDEVAN , V., V I ÉGAS , F., on Computer Architecture (Piscataway, NJ, USA, 2016), ISCA
V INYALS , O., WARDEN , P., WATTENBERG , M., W ICKE , M., ’16, IEEE Press, pp. 367–379.
Y U , Y., AND Z HENG , X. TensorFlow: Large-scale machine [13] C OURBARIAUX , M., B ENGIO , Y., AND DAVID , J. Binarycon-
learning on heterogeneous systems, 2015. Software available nect: Training deep neural networks with binary weights during
from tensorflow.org. propagations. CoRR abs/1511.00363 (2015).
592 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
[14] E GGERS , S. J., E MER , J. S., L EVY, H. M., L O , J. L., S TAMM , and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June
R. L., AND T ULLSEN , D. M. Simultaneous multithreading: a 27-30, 2016 (2016), pp. 4013–4021.
platform for next-generation processors. IEEE Micro 17, 5 (Sept [26] L I , L., JAMIESON , K. G., D E S ALVO , G., ROSTAMIZADEH ,
1997), 12–19. A., AND TALWALKAR , A. Efficient hyperparameter optimiza-
[15] F RIGO , M., AND J OHNSON , S. G. Fftw: an adaptive software ar- tion and infinitely many armed bandits. CoRR abs/1603.06560
chitecture for the fft. In Acoustics, Speech and Signal Processing, (2016).
1998. Proceedings of the 1998 IEEE International Conference on
[27] L IU , D., C HEN , T., L IU , S., Z HOU , J., Z HOU , S., T EMAN , O.,
(May 1998), vol. 3, pp. 1381–1384 vol.3.
F ENG , X., Z HOU , X., AND C HEN , Y. Pudiannao: A polyvalent
[16] H E , K., Z HANG , X., R EN , S., AND S UN , J. Identity mappings machine learning accelerator. In Proceedings of the Twentieth
in deep residual networks. arXiv preprint arXiv:1603.05027 International Conference on Architectural Support for Program-
(2016). ming Languages and Operating Systems (New York, NY, USA,
[17] H EGARTY, J., B RUNHAVER , J., D E V ITO , Z., R AGAN -K ELLEY, 2015), ASPLOS ’15, ACM, pp. 369–381.
J., C OHEN , N., B ELL , S., VASILYEV, A., H OROWITZ , M., AND [28] M NIH , V., K AVUKCUOGLU , K., S ILVER , D., RUSU , A. A.,
H ANRAHAN , P. Darkroom: Compiling high-level image pro- V ENESS , J., B ELLEMARE , M. G., G RAVES , A., R IEDMILLER ,
cessing code into hardware pipelines. ACM Trans. Graph. 33, 4 M., F IDJELAND , A. K., O STROVSKI , G., ET AL . Human-level
(July 2014), 144:1–144:11. control through deep reinforcement learning. Nature 518, 7540
[18] H ENRIKSEN , T., S ERUP, N. G. W., E LSMAN , M., H ENGLEIN , (2015), 529.
F., AND OANCEA , C. E. Futhark: Purely functional gpu- [29] M ULLAPUDI , R. T., A DAMS , A., S HARLET, D., R AGAN -
programming with nested parallelism and in-place array updates. K ELLEY, J., AND FATAHALIAN , K. Automatically scheduling
In Proceedings of the 38th ACM SIGPLAN Conference on Pro- halide image processing pipelines. ACM Trans. Graph. 35, 4
gramming Language Design and Implementation (New York, (July 2016), 83:1–83:11.
NY, USA, 2017), PLDI 2017, ACM, pp. 556–571.
[30] PALKAR , S., T HOMAS , J. J., NARAYANAN , D., S HANBHAG ,
[19] H OWARD , A. G., Z HU , M., C HEN , B., K ALENICHENKO , D., A., PALAMUTTAM , R., P IRK , H., S CHWARZKOPF, M., A MA -
WANG , W., W EYAND , T., A NDREETTO , M., AND A DAM , H. RASINGHE , S. P., M ADDEN , S., AND Z AHARIA , M. Weld: Re-
Mobilenets: Efficient convolutional neural networks for mobile thinking the interface between data-intensive applications. CoRR
vision applications. CoRR abs/1704.04861 (2017). abs/1709.06416 (2017).
[20] J OUPPI , N. P. Improving direct-mapped cache performance [31] R ADFORD , A., M ETZ , L., AND C HINTALA , S. Unsupervised
by the addition of a small fully-associative cache and prefetch representation learning with deep convolutional generative adver-
buffers. In [1990] Proceedings. The 17th Annual International sarial networks. arXiv preprint arXiv:1511.06434 (2015).
Symposium on Computer Architecture (May 1990), pp. 364–373.
[32] R AGAN -K ELLEY, J., BARNES , C., A DAMS , A., PARIS , S., D U -
[21] J OUPPI , N. P., YOUNG , C., PATIL , N., PATTERSON , D., RAND , F., AND A MARASINGHE , S. Halide: A language and
AGRAWAL , G., BAJWA , R., BATES , S., B HATIA , S., B ODEN , compiler for optimizing parallelism, locality, and recomputation
N., B ORCHERS , A., B OYLE , R., C ANTIN , P.- L ., C HAO , C., in image processing pipelines. In Proceedings of the 34th ACM
C LARK , C., C ORIELL , J., DALEY, M., DAU , M., D EAN , J., SIGPLAN Conference on Programming Language Design and
G ELB , B., G HAEMMAGHAMI , T. V., G OTTIPATI , R., G UL - Implementation (New York, NY, USA, 2013), PLDI ’13, ACM,
LAND , W., H AGMANN , R., H O , C. R., H OGBERG , D., H U , pp. 519–530.
J., H UNDT, R., H URT, D., I BARZ , J., JAFFEY, A., JAWORSKI ,
[33] R ASTEGARI , M., O RDONEZ , V., R EDMON , J., AND FARHADI ,
A., K APLAN , A., K HAITAN , H., K ILLEBREW, D., KOCH , A.,
A. Xnor-net: Imagenet classification using binary convolutional
K UMAR , N., L ACY, S., L AUDON , J., L AW, J., L E , D., L EARY,
neural networks. In European Conference on Computer Vision
C., L IU , Z., L UCKE , K., L UNDIN , A., M AC K EAN , G., M AG -
(2016), Springer, pp. 525–542.
GIORE , A., M AHONY, M., M ILLER , K., NAGARAJAN , R.,
NARAYANASWAMI , R., N I , R., N IX , K., N ORRIE , T., O MER - [34] S HARMA , H., PARK , J., M AHAJAN , D., A MARO , E., K IM ,
NICK , M., P ENUKONDA , N., P HELPS , A., ROSS , J., ROSS , M., J. K., S HAO , C., M ISHRA , A., AND E SMAEILZADEH , H. From
S ALEK , A., S AMADIANI , E., S EVERN , C., S IZIKOV, G., S NEL - high-level deep neural models to fpgas. In Microarchitecture (MI-
HAM , M., S OUTER , J., S TEINBERG , D., S WING , A., TAN , M., CRO), 2016 49th Annual IEEE/ACM International Symposium on
T HORSON , G., T IAN , B., T OMA , H., T UTTLE , E., VASUDE - (2016), IEEE, pp. 1–12.
VAN , V., WALTER , R., WANG , W., W ILCOX , E., AND YOON ,
[35] S MITH , J. E. Decoupled access/execute computer architectures.
D. H. In-datacenter performance analysis of a tensor processing In Proceedings of the 9th Annual Symposium on Computer Archi-
unit. In Proceedings of the 44th Annual International Symposium tecture (Los Alamitos, CA, USA, 1982), ISCA ’82, IEEE Com-
on Computer Architecture (New York, NY, USA, 2017), ISCA puter Society Press, pp. 112–119.
’17, ACM, pp. 1–12.
[36] S TEUWER , M., R EMMELG , T., AND D UBACH , C. Lift: A func-
[22] K IRKPATRICK , S., G ELATT, C. D., AND V ECCHI , M. P. Op- tional data-parallel ir for high-performance gpu code generation.
timization by simulated annealing. Science 220, 4598 (1983), In Proceedings of the 2017 International Symposium on Code
671–680. Generation and Optimization (Piscataway, NJ, USA, 2017), CGO
[23] K JOLSTAD , F., K AMIL , S., C HOU , S., L UGATO , D., AND ’17, IEEE Press, pp. 74–85.
A MARASINGHE , S. The tensor algebra compiler. Proc. ACM [37] S UJEETH , A. K., L EE , H., B ROWN , K. J., C HAFI , H., W U , M.,
Program. Lang. 1, OOPSLA (Oct. 2017), 77:1–77:29. ATREYA , A. R., O LUKOTUN , K., ROMPF, T., AND O DERSKY,
[24] K L ÖCKNER , A. Loo.py: transformation-based code generation M. Optiml: An implicitly parallel domain-specific language for
for GPUs and CPUs. In Proceedings of ARRAY ‘14: ACM SIG- machine learning. In Proceedings of the 28th International Con-
PLAN Workshop on Libraries, Languages, and Compilers for Ar- ference on International Conference on Machine Learning (USA,
ray Programming (Edinburgh, Scotland., 2014), Association for 2011), ICML’11, pp. 609–616.
Computing Machinery. [38] TAI , K. S., S OCHER , R., AND M ANNING , C. D. Improved
[25] L AVIN , A., AND G RAY, S. Fast algorithms for convolutional semantic representations from tree-structured long short-term
neural networks. In 2016 IEEE Conference on Computer Vision memory networks. arXiv preprint arXiv:1503.00075 (2015).
USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 593
[39] T ULLOCH , A., AND J IA , Y. High performance ultra-low-
precision convolutions on mobile devices. arXiv preprint
arXiv:1712.02427 (2017).
[40] U MUROGLU , Y., F RASER , N. J., G AMBARDELLA , G., B LOTT,
M., L EONG , P. H. W., JAHRE , M., AND V ISSERS , K. A. FINN:
A framework for fast, scalable binarized neural network infer-
ence. CoRR abs/1612.07119 (2016).
[41] VASILACHE , N. personal communication.
[42] VASILACHE , N., Z INENKO , O., T HEODORIDIS , T., G OYAL , P.,
D E V ITO , Z., M OSES , W. S., V ERDOOLAEGE , S., A DAMS ,
A., AND C OHEN , A. Tensor comprehensions: Framework-
agnostic high-performance machine learning abstractions. CoRR
abs/1802.04730 (2018).
[43] V ERDOOLAEGE , S., C ARLOS J UEGA , J., C OHEN , A., I GNA -
CIO G ÓMEZ , J., T ENLLADO , C., AND C ATTHOOR , F. Polyhe-
dral parallel code generation for cuda. ACM Trans. Archit. Code
Optim. 9, 4 (Jan. 2013), 54:1–54:23.
[44] VOLKOV, V. Understanding Latency Hiding on GPUs. PhD
thesis, University of California at Berkeley, 2016.
[45] W EI , R., A DVE , V., AND S CHWARTZ , L. Dlvm: A mod-
ern compiler infrastructure for deep learning systems. CoRR
abs/1711.03016 (2017).
[46] W HALEY, R. C., AND D ONGARRA , J. J. Automatically tuned
linear algebra software. In Proceedings of the 1998 ACM/IEEE
Conference on Supercomputing (Washington, DC, USA, 1998),
SC ’98, IEEE Computer Society, pp. 1–27.
[47] W ILLIAMS , S., WATERMAN , A., AND PATTERSON , D.
Roofline: An insightful visual performance model for multicore
architectures. Commun. ACM 52, 4 (Apr. 2009), 65–76.
[48] Z AREMBA , W., S UTSKEVER , I., AND V INYALS , O. Recurrent
neural network regularization. arXiv preprint arXiv:1409.2329
(2014).
594 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association