0% found this document useful (0 votes)
73 views12 pages

T F E: A - , P - DSL: Ensor LOW Ager Multi Stage Ython Embedded FOR Machine Learning

escuela

Uploaded by

Tegui Ross
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views12 pages

T F E: A - , P - DSL: Ensor LOW Ager Multi Stage Ython Embedded FOR Machine Learning

escuela

Uploaded by

Tegui Ross
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

T ENSOR F LOW E AGER : A MULTI - STAGE , P YTHON - EMBEDDED DSL FOR

MACHINE LEARNING

Akshay Agrawal 1 Akshay Naresh Modi 1 Alexandre Passos 1 Allen Lavoie 1 Ashish Agarwal 1 Asim Shankar 1
Igor Ganichev 1 Josh Levenberg 1 Mingsheng Hong 1 Rajat Monga 1 Shanqing Cai 1

A BSTRACT
arXiv:1903.01855v1 [cs.PL] 27 Feb 2019

TensorFlow Eager is a multi-stage, Python-embedded domain-specific language for hardware-accelerated machine


learning, suitable for both interactive research and production. TensorFlow, which TensorFlow Eager extends,
requires users to represent computations as dataflow graphs; this permits compiler optimizations and simplifies
deployment but hinders rapid prototyping and run-time dynamism. TensorFlow Eager eliminates these usability
costs without sacrificing the benefits furnished by graphs: It provides an imperative front-end to TensorFlow
that executes operations immediately and a JIT tracer that translates Python functions composed of TensorFlow
operations into executable dataflow graphs. TensorFlow Eager thus offers a multi-stage programming model that
makes it easy to interpolate between imperative and staged execution in a single package.

1 I NTRODUCTION 2015) and PyTorch (Paszke et al., 2017)—performance is


bottlenecked on the interpreter and serialization of models is
Many contemporary libraries for machine learning share a difficult. To address these problems, declarative DSLs sepa-
similar structure: they provide suites of primitive operations rate the specification of models from their execution. These
and functions to automatically differentiate compositions “define-before-run” libraries require users to stage their mod-
thereof (see, e.g., Bergstra et al., 2010; Tokui et al., 2015; els as dataflow graphs, permitting compiler optimizations
Maclaurin et al., 2015; Chen et al., 2015; Abadi et al., 2016; and the exploitation of parallelism, and simplifying deploy-
Paszke et al., 2017; The Gluon Team, 2017; Neubig et al., ment, distribution, and code generation (see, e.g., Bergstra
2017; Innes, 2018; Frostig et al., 2018). These software et al., 2010; Abadi et al., 2016). But, because declarative
packages in fact more closely resemble domain-specific DSLs prevent users from using arbitrary host-language con-
languages (DSLs) than libraries (Innes et al., 2018). Indeed, structs, they have steep learning curves and are not suitable
models written using automatic differentiation software are for expressing models with data-dependent structures.
often referred to as differentiable programs.
An ideal DSL would offer the flexibility and accessibility of
DSLs for differentiable programming are usually embedded imperative execution along with the many benefits of declar-
in a host language (for a reference on embedded DSLs, see ative programming, without either of their costs. It is with
Hudak, 1996), and they can be roughly classified as either this motivation in mind that we present TensorFlow Eager, a
imperative or declarative, in the programming languages Python-embedded DSL for differentiable programming that
sense. Programming in an imperative DSL for differentiable lets developers interpolate between imperative and staged
programming is like programming in an imperative pro- computations in a single package. TensorFlow Eager offers
gramming language such as Python: the construction and a multi-stage programming model that lets users rapidly pro-
execution of primitive operations are inextricably tied, with totype programs and selectively stage parts that they wish
each operation returning concrete numerical data. While to accelerate or serialize. It is implemented as an opt-in
imperative DSLs provide a natural programming paradigm, extension to TensorFlow, and it can be enabled by calling a
when embedded in an interpreted language like Python— single TensorFlow library function at program start-up.
which is the case for popular DSLs like Chainer (Tokui et al.,
1
To empower machine learning practitioners and researchers
Authors listed in alphabetical order. Google Brain, to be productive from the start, TensorFlow Eager executes
Mountain View, CA, USA. Correspondence to: Akshay
Agrawal <[email protected]>, Alexandre Passos <apas-
imperatively by default. To reap the benefits of dataflow
[email protected]>. graphs, TensorFlow Eager provides a Python decorator that
traces its Python function in a graph-building context, stag-
Proceedings of the 2 nd SysML Conference, Palo Alto, CA, USA, ing primitive operations to construct a dataflow graph with
2019. Copyright 2019 by the author(s). named inputs and outputs and returning an executable graph
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

function. While invoking a graph function is syntactically this is what, e.g., DLVM, Swift for TensorFlow, and Zygote
equivalent to calling the Python function from which it do (Wei et al., 2017; Lattner & the Swift for TensorFlow
was generated, the execution of graph functions bypasses Team, 2018; Innes, 2019). Python’s flexibility makes it diffi-
Python: they are executed using a C++ dataflow runtime or cult for DSLs embedded in it to use such an approach. Some
are compiled to generate optimized code for CPUs, GPUs, projects, like AutoGraph (Moldovan et al., 2019) do operate
and ASICs. Graph functions and imperative code share a on Python abstract syntax trees to rewrite imperative code
lexical environment, making it simple to invoke graph func- to code that constructs dataflow graphs, but such techniques
tions from imperative code, create graph functions that close are out of the scope of this paper.
over imperatively constructed data, and embed imperative
An alternative to staging computations as graphs for perfor-
code in graph functions via unstaging annotations.
mance is to implement fused kernels. For example, NVIDIA
Our contributions are two-fold: provides fused CuDNN kernels for popular recurrent neural
network operations that are dramatically faster than non-
• Our implementation is elegant. TensorFlow Eager can fused implementations (Chetlur et al., 2014). This approach,
be viewed as a multi-stage front-end to TensorFlow. while useful, is difficult to scale, as it requires substantial
Imperative and staged TensorFlow Eager code share programmer intervention.
a single set of primitive operations, kernels, and user- TensorFlow Eager is not the first Python library to offer a
visible APIs. Not only does this sharing result in an multi-stage programming model. JAX (Frostig et al., 2018),
easy-to-maintain implementation, it also lets us present a tracing-JIT compiler that generates code for heterogeneous
a single, coherent API surface to our users that is ag- devices via XLA (The XLA team, 2017), provides a similar
nostic to execution mode and lets users enjoy the rich programming paradigm; MXNet and Gluon also let users
ecosystem of tools developed for TensorFlow. interpolate between imperative and staged computations,
• While we are not the first in the differentiable program- but at a level of abstraction that is higher than ours (Chen
ming community to recognize the value in bridging im- et al., 2015; The Gluon Team, 2017); and PyTorch is im-
perative and declarative programming, we are among plementing a staging tracer that is similar to ours (PyTorch
the first to present this line of work in the context of team, 2018). Outside of differentiable programming, Terra
multi-stage programming. This contextualization is a is a Lua-embedded DSL that supports code generation, and
contribution insofar as it clarifies discourse and con- the paper in which it was introduced presents a thorough
nects two otherwise separate communities. treatment of multi-stage programming that is more formal
than ours (DeVito et al., 2013); as another example, OptiML
The remainder of this paper is structured as follows: section is a Scala-embedded DSL for machine learning with sup-
2 surveys related work; §3 puts forth our design principles, port for staging and code generation but without support for
which prioritize usability and researcher productivity; §4 automatic differentiation (Sujeeth et al., 2011). Outside of
presents our mutli-stage programming model, with details DSLs, there are several projects that provide just-in-time
on automatic differentiation, state, hardware acceleration, (JIT) compilation for Python, of which Numba (Lam et al.,
distribution, staging, and unstaging; §5 discusses our imple- 2015) and PyPy (Bolz et al., 2009) are two examples.
mentation; and §6 provides a quantitative evaluation of the Multi-stage programming is a well-studied topic in program-
performance of TensorFlow Eager on machine learning mod- ming languages; a good reference is (Taha, 2004), and a
els, demonstrating that imperative TensorFlow Eager can modern design from which we drew inspiration is Scala’s
train a ResNet-50 on a single GPU just as quickly as Tensor- lightweight modular staging (Rompf & Odersky, 2010).
Flow can, staged TensorFlow Eager can train a ResNet-50 Multi-stage programming is related to staging transforma-
on a TPU much faster than imperative TensorFlow Eager tions in compilers and partial evaluation in programming
can, and that staging yields significant speedups for models languages, for which (Jørring & Scherlis, 1986) and (Jones
with small operations, all with minimal code changes. et al., 1993) are classic references, respectively.

2 R ELATED W ORK 3 D ESIGN P RINCIPLES


In TensorFlow Eager, users must manually stage computa- Our design strives to satisfy two goals: TensorFlow
tions, which might require refactoring code (see §4.1). An Eager should be immediately recognizable to Python
ideal framework for differentiable programming would auto- programmers—for example, users should feel at home ex-
matically stage computations, without programmer interven- ploring APIs and prototyping models in IPython notebooks—
tion. One way to accomplish this is to embed the framework and it should also provide a smooth path to testing ideas at
in a compiled procedural language and implement graph ex- scale and deploying models for inference on heterogeneous
traction and automatic differentiation as compiler rewrites;
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

devices. The first two of the following three principles are execution models have access to the same set of operations
in service of the former goal, while the third is in service of and kernels, but they differ in how they dispatch kernels.
the latter.
Imperative execution. By default, TensorFlow Eager ex-
Privilege imperative execution. Because Python is an im- ecutes operations immediately—library functions such as
perative language, TensorFlow Eager operates in an impera- tf.matmul construct operations and then immediately
tive fashion by default; staged execution is opt-in and often execute their kernels. Under this regime, TensorFlow Eager
unnecessary (see §4.1 and §6 for details). resembles a NumPy-like library for hardware-accelerated
numerical computation and machine learning. Calling
Seamlessly embed into Python. Whereas writing Tensor-
.numpy() on a tensor fetches a NumPy array storing
Flow code is an exercise in metaprogramming, imperative
the tensor’s data, and tensors can be supplied to external
execution lets programmers enjoy the full extent of the host
libraries like matplotlib that expect NumPy arrays (for a
language: programmers write Pythonic code, complete with
reference on NumPy, see Oliphant, 2015). As an example,
familiar language constructs like native control flow (e.g.,
Python if statements and while loops), recursion, ar- import tensorflow as tf
bitrary data structures, and even pdb breakpoints. And, tf.enable_eager_execution()
because we implement automatic differentiation via tracing def select(vector):
(§4.2), the programmer can differentiate through all these A = tf.constant([[1.0, 0.0]])
constructs. Host-language integration is more than just syn- return tf.matmul(A, vector)
tactic sugar—it greatly simplifies the implementation of
data-dependent models like segmental recurrent neural net- x = tf.constant([[2.0], [-2.0]])
print(select(x))
works and recursive neural networks (Kong et al., 2015;
Socher et al., 2011).
prints
Stage imperative code as dataflow graphs. To leverage
tf.Tensor(
the benefits of dataflow graphs, TensorFlow Eager provides [[ 2.]], shape=(1, 1), dtype=float32).
a mechanism to trace Python functions and stage their oper-
ations as graph functions. The staging workflow is detailed Staged execution. While imperative execution simplifies
in §4.1, and the mechanism is described in §4.6. TensorFlow prototyping, the overhead of going back and forth into the
graphs come with their own set of design principles, which Python interpreter limits its performance; representing com-
are presented in (Abadi et al., 2016). putations as dataflow graphs before executing them not only
removes this bottleneck but also allows for inter-op paral-
4 E XECUTION M ODEL lelism and optimizations like constant-folding and buffer
reuse. Thus, TensorFlow Eager provides a mechanism to
This section presents the pillars of TensorFlow Eager’s ex- stage computations as dataflow graphs. In particular, we
ecution model. §4.1 describes imperative and staged exe- provide a decorator, function, that traces the execution
cution, presenting a workflow that hybridizes the two; §4.2 of a Python function, recording all TensorFlow operations
describes our trace-based implementation of automatic dif- and the tensors flowing between them in a dataflow graph.
ferentiation; §4.3 specifies how we represent mutable state function can be thought of as an opt-in, JIT compiler that
and how we support serialization; §4.4 details how Tensor- generates an optimized polymorphic function for a Python
Flow Eager supports execution across heterogeneous de- function, creating concrete functions backed by dataflow
vices; §4.5 presents mechanisms for distributed execution; graphs via a straightforward binding-time analysis at run-
§4.6 discusses our tracing JIT in detail; and §4.7 discusses time. The analogy to compilers is imperfect because the
mechanisms for escaping staged computations. traces generated by function only record TensorFlow
The following terminology will be used in the sequel: a operations and not arbitrary Python code, but it nonetheless
tensor is a multi-dimensional, typed array, an operation is provides an approximate mental model. One advantage of
a primitive, possibly stateful function that takes tensors as this tracing mechanism is that the underlying dataflow graph
inputs and produces tensors as outputs, a kernel is a device- format does not need to support all the dynamism present
specific implementation of an operation, and a model is a in the Python code being traced; as long as the set of opera-
composition of primitive operations. tions in the trace does not depend on Python state we can
generate a correct trace.
4.1 Multi-stage programming Invoking a callable returned by function will execute a
TensorFlow Eager provides two ways of executing opera- dataflow graph instead of the corresponding Python func-
tions: imperatively or as part of a static dataflow graph. Both tion. In fact, graph functions are themselves executed by
an operation that takes tensors as inputs and a function
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

name as an attribute, and these operations are automatically will return a different output every time it is invoked, the
constructed and executed for the user. For example, if the dataflow graph generated by function(add noise)
select function defined in the previous section were deco- will return the same value every time it is called, since
rated with @function, then select(x) would execute a particular random offset generated by NumPy will be
an operation that would in turn execute the appropriate graph inserted into the graph as a constant. Note that if state is rep-
function. The dataflow graph runtime, which is written in resented in terms of operations (e.g., if we replace the call
C++, automatically partitions subgraphs across devices and to np.random.randn with tf.random normal), we
parallelizes operations when possible. Readers interested in can preserve semantics under this tracing model. As a corol-
the runtime should consult (Abadi et al., 2016). lary, if a Python function f has Python side-effects (e.g.,
every call to it increments a global Python counter), then ex-
The function decorator supports code generation via
ecuting it multiple times will not necessarily be semantically
XLA (The XLA team, 2017). TensorFlow Eager relies upon
equivalent to repeatedly executing the callable returned by
XLA to execute code on Tensor Processing Units (TPUs)
function(f). Python functions must also be resilient to
(Sato et al., 2017) (see §4.4). In addition to performance and
being executed multiple times, as the callable returned by
hardware acceleration, dataflow graphs simplify distribution
function might trace its Python function multiple times
(§4.5) and deployment. Details about the mechanism of
(see the discussion on polymorphism in §4.6).
function are provided in §4.6.
Because function generates graphs by tracing and not
A multi-stage workflow. Many users will find the perfor-
by source code analysis, it fully unrolls Python for and
mance of imperative execution sufficient. Purely imperative
while loops, potentially creating large graphs. If that is a
TensorFlow Eager can match the performance of graph ex-
problem, the programmer might need to replace their loops
ecution when training models with sufficiently expensive
with the equivalent TensorFlow control flow constructs. Sim-
kernels, like ResNet-50 (He et al., 2016) (see §6). But when
ilarly, the branches of if statements that are taken during
imperative performance disappoints, we recommend the fol-
tracing are baked into the emitted graphs. Conditionals that
lowing multi-stage workflow, modeled after (Taha, 2004).
depend on the value of tensors will need to be written using
tf.cond, and while loops that depend on tensor values
1. Implementation. Develop, debug, and test a single- will need to be rewritten in terms of tf.while loops.
stage imperative program. Python functions that depend on the values of tensors in
2. Analysis. Using any profiling tool the user is famil- complicated ways (e.g., via data structures that depend on
iar with, identify performance-critical blocks of op- the values of tensors) might prove to be prohibitively dif-
erations, and express these blocks as staging-friendly ficult to stage correctly. In such cases, users might need
Python functions or callable objects. to refactor their functions into staging-friendly and staging-
unfriendly helper functions (see the discussion on escaping
3. Staging. Decorate the functions identified in the previ- staged computations in §4.7 for other options).
ous step with @function.
Note that staging trades off imperative execution (and there-
fore interactivity) and Python integration (and therefore
With respect to the analysis step, the key fact to keep in mind
run-time dynamism) for performance. It is up to the pro-
is that function is not a compiler for arbitrary Python
grammer to decide when this trade-off is acceptable and
code. Rather, it is a JIT tracer that executes Python func-
to use staging annotations judiciously. This trade-off can
tions in a graph-building context and only records operations
be diminished by using tools like AutoGraph that operate
and tensors. In a graph-building context, operations return
on abstract syntax trees and rewrite Python control flow to
symbolic representations of values to be computed instead
dataflow control flow (Moldovan et al., 2019).
of concrete values, and non-TensorFlow Python code ex-
ecutes normally. Python functions that are amenable to
4.2 Automatic differentiation
staging are those that, when called in a graph-building con-
text, generate a graph that encapsulates the computation We implement a variant of tracing-based reverse-mode au-
of interest. This means that if a Python function executes tomatic differentiation (Baydin et al., 2018), with a few
non-TensorFlow code, then there might be semantic discrep- changes to better support partially staged computation. Our
ancies between executing the Python function and executing implementation is similar to the implementations of Chainer
the traced dataflow graph. For example, whereas the Python (Tokui et al., 2015), Autograd (Maclaurin et al., 2015), and
function PyTorch (Paszke et al., 2017), but our API allows for more
def add_noise(): fine-grained control over which computations are traced.
eye = tf.eye(5)
randn = np.random.randn(5, 5) The main user-visible concept in the gradient API is a tape.
return eye + randn If a tape watches a value, operations taking this value as
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

an input will be recorded. It is possible to differentiate any d2y_dx2 = t1.gradient(dy_dx, x) # 2.0


scalar that is computed while a tape is active with respect to
any watched value. Tapes are composable data structures: Listing 2. Gradient tapes automatically watch variables; compare
this code to Listing 1.
multiple tapes can be active simultaneously, and higher-
order gradients can computed by having one tape recording
In TensorFlow Eager, variables correspond to Python ob-
while another tape computes a gradient. Listing 1 gives an
jects. Each variable object has its own unique storage that
example of nesting tapes to compute a second derivative.
is deleted when Python deletes the object. This is true even
for traced computations, where staged read, write, save, and
x = tf.constant(3.0)
restore operations may interact with variables. Staged com-
with tf.GradientTape() as t1:
with tf.GradientTape() as t2: putations reference variables by unique identifiers, which
t1.watch(x) are no longer usable if the Python variable objects they
t2.watch(x) reference do not exist. This correspondence ensures that
y = x * x TensorFlow Eager state conforms to programmer expec-
dy_dx = t2.gradient(y, x) # 6.0
tations, stored like any other Python state and accessible
d2y_dx2 = t1.gradient(dy_dx, x) # 2.0
through Python identifiers.
Listing 1. Tapes can be nested to compute higher-order deriva-
One challenge when moving from purely staged computa-
tives.
tion to keeping state in Python objects is matching state
between executions of the same program. TensorFlow uses
Exposing the tape directly (as opposed to high-level unique names for each variable in a program, which relies
Autograd-like gradient functions) lets users control which on the user creating variables in a consistent order. For
parts of the computation are traced for automatic differenti- example creating two copies of the same model requires
ation, which can help limit the run-time overhead incurred special consideration when restoring the second model. Ten-
in the tracing process. sorFlow Eager uses a graph-based matching system, where
a directed graph with named edges between objects is seri-
The tape is tightly integrated with the logic responsible for alized along with the program state. On restore, a greedy
staging code. The first time a graph function is called when matching determines a correspondence between serialized
a tape is both active and watching one of its inputs, we build Python state and the objects being restored. This matching
a “forward” version of this function that returns any inter- is local in that it depends only on the objects being saved
mediate values needed for the backward step, in addition to and restored, not on other parts of the program. Listing 3
its named outputs. As such, there is no meaningful change and Figure 1 contain a short example.
in the amount of computation or memory needed in the
backward pass by staging or unstaging a particular function, class Net(tf.keras.Model):
def __init__(self):
leading to more predictable performance. Moreover, this super(Net, self).__init__()
ensures that if a computation was staged in the forward pass, self.v = tf.Variable(1.)
its corresponding backward pass will also be staged. self.out = tf.layers.Dense(1)

Note that gradient computation is itself expressed as a func- def call(self, x):
tion which executes primitive operations, so it is possible to return self.out(
stage it or not. tf.nn.softplus(x * self.v))

Listing 3. Model-building code which implicitly constructs a graph


4.3 State with named directed edges (from attribute names), used for state
matching.
Like TensorFlow, TensorFlow Eager keeps program state
in variables, restoring a variable’s value by assigning to it
from a restore operation and periodically saving it to disk by v
sending its value to a save operation. Variables are useful
when implementing models because accessing a variable’s out kernel
value automatically watches it on all active tapes, as shown bias
in Listing 2.

x = tf.Variable(3.0)
Figure 1. Visualization of the dependency graph for Listing 3, with
with tf.GradientTape() as t1:
with tf.GradientTape() as t2: filled-in intermediate nodes and nodes without fill containing state.
y = x * x
dy_dx = t2.gradient(y, x) # 6.0 Variables are the most common type of state, but other state
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

is similarly scoped to a Python object and matched as part Because graph functions are executed via a primitive opera-
of a directed graph with named edges. Examples include tion, it is also possible to use the device context manager
an iterator over input data whose position in a dataset is to run graph functions on various devices. If operations
serialized, mutable hash tables, and outside of traced code inside the graph function are explicitly placed on another
even miscellaneous Python state such as NumPy arrays can device, they override the outer device context.
use graph-based state matching.
Graph functions can serve as a unit of compilation for ac-
Staging enables serializing the program for use without a celerators; we use this to efficiently execute code on TPUs.
Python interpreter, as in TensorFlow. A typical development When a staged computation is placed on a TPU, TensorFlow
workflow involves using graph-based state matching while Eager automatically invokes XLA to compile the graph
writing and tweaking a TensorFlow Eager program, then and produce a TPU-compatible executable. TensorFlow
serializing a trace for use in a production environment that Eager does make it possible to execute code imperatively
executes the trace using TensorFlow’s C++ API. on TPUs, but the overhead of compiling operations for TPU
and dispatching the generated code is significant. When
4.4 Devices amortized over a large graph function, this overhead be-
comes negligible (see §6 for a quantitative example). Note
TensorFlow Eager makes it simple to use a variety of de- that this programming model is similar to JAX (Frostig et al.,
vices, such as CPUs, GPUs, and TPUs. During program 2018), which provides a Python decorator that JIT-compiles
startup, the runtime detects the devices that are available to functions via tracing and XLA. Finally, compiling staged
the machine, and makes it possible to both execute opera- computations through XLA provides us more opportunities
tions on them and store data on them. Imperative and staged for optimization, including layout optimization, instruction
computations use the same underlying Device abstraction, scheduling for concurrency, and operation fusion. Tech-
which makes it possible to both execute operations on de- niques like tensor re-materialization can make it possible
vices and store data on them. A user-visible API endpoint to fit a staged model into TPU memory when it would be
list devices is exposed which lists all devices that the impossible to do so on an operation-by-operation basis.
runtime is aware of.
All tensors exposed to the user are handles to data stored 4.5 Distribution
on a particular device. The runtime is also aware of how to
The current system supports distributed execution with a
copy data between various types of devices, and exposes this
single central server running the main (typically Python)
functionality through API endpoints on tensor instances.
program and several worker servers running on remote hosts.
a = tf.constant(1.0) # stored on CPU Each worker server adds its locally available devices (for
b = a.gpu() # stored on GPU example, CPUs, GPUs, or TPUs) to the pool of devices
available to the main program. The main program can
Listing 4. Tensor copies between CPU and GPU.
then execute operations or whole graph functions on remote
devices through the worker servers.
When executing an operation, the runtime expects to have a
specific device to run the operation on. TensorFlow Eager The remote devices are identified by application-level names.
exposes a context manager, device, so that the user can The names contain the job name, task inside the job, as well
control which device operations execute on. The user is not as the specific device available for the task. For exam-
required to use this API, as the runtime is able to select a de- ple, ”/job:training/task:2/device:GPU:0”. When a server is
vice based on the availability of kernels. When an operation brought up to be a part of a cluster, it is given the mapping
has inputs on devices different from the device where the from the application-level names to specific server instances
operation is executing, the runtime transparently copies the identified by DNS names or IP addresses.
inputs to the correct device. This frees the user from having
To run an operation on a remote device, the user uses the
to explicitly copy tensors between various devices.
same syntax as for local devices (see 4.4) but uses a remote
# stored on CPU device name instead of the local device name. Tensors
a = tf.constant(1.0) produced as the result of running an operation on a remote
b = tf.constant(2.0) device stay on the remote device. Users can then either
with tf.device("/gpu:0"): perform more operations on these tensors or copy them to
c = tf.add(a, b) the central server (e.g. to use their value in an if statement).

assert c.numpy() == 3.0 Some computations running on remote devices can directly
communicate and synchronize between each other. In such
Listing 5. Executing a GPU operation with inputs on the CPU. cases, developers need to start these computations concur-
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

rently, e.g. using Python threads. x = tf.random_normal((5, 1))


# Executes a graph with dropout.
lossy_outputs = lossy_matmul(W, x,
4.6 Staging computations training=True)
The particular type of staging that TensorFlow Eager sup- # Executes a graph without dropout.
exact_outputs = lossy_matmul(W, x,
ports is similar to lightweight modular staging (Rompf & training=False)
Odersky, 2010), which in turn is a form of partial evaluation
(Jones et al., 1993). As stated in §4.1, we expose a user- Listing 6. This code transparently makes two graph functions.
visible API endpoint named function that takes a Python
function and returns an object which, when called, exe- The user also has the option of specifying an input signature
cutes a dataflow graph created by running the user-provided to eliminate input polymorphism. In this case, we guarantee
Python function in a graph-building context. In this section, that we only generate a single graph function using only
we discuss the implementation of function in detail. the shape and numeric type information specified in the
signature. This can be useful for serialization and error-
Polymorphism. All Python functions are polymorphic in
checking, and for creating a single function that can handle
their inputs. In contrast, graph functions are not polymor-
arbitrary batch sizes or sequence lengths.
phic: they have a fixed number of inputs, which are statically
typed. We bridge this semantic gap between Python func- Lexical closure. function is capable of tracing Python
tions and graph functions by implementing a trace cache, functions that lexically close over tensors or variables —
similar to the one described in JAX (Frostig et al., 2018). these closed-over objects are treated as “captured” inputs
The object F = function(f) maintains a cache map- that are silently passed to the graph function at call-time,
ping from inferred input signatures to concrete graph func- without programmer intervention. Variables are captured
tions. In particular, each time F is invoked, its inputs are by reference and not by value, which means that graph
processed and their signature is inferred: tensors are repre- functions are free to mutate them. Listing 7 provides an
sented as abstract types (numerical type and shape tuples), example.
while non-tensor values are encoded by object identity. This v = tf.Variable(0.0)
input signature, coupled with a small amount of metadata
about the surrounding program state such as the requested @tf.contrib.eager.function
device, becomes a key into a cache of graph functions. A def mutate():
cache miss triggers a trace of f on the given inputs, while a v.assign_add(1.0)
return v.read_value()
cache hit results in the reuse of a previously created graph
function. In this sense, function provides ad hoc poly- mutate()
morphism (Strachey, 2000) or function overloading. assert float(v.read_value()) == 1.0
v.assign_add(1.0)
Not only is specializing functions on input types required assert float(v.read_value()) == 2.0
for correctness, it also lets us generate optimized graphs — mutate()
this kind of optimization is well-known, and, indeed, one of assert float(v.read_value()) == 3.0
the primary motivations for partial evaluation (Jones et al.,
Listing 7. function transparently captures closed-over tensors
1993; Taha, 2004; Rompf & Odersky, 2010). and variables, forwarding them to TensorFlow functions as inputs.
Like JAX, function specializes on the run-time values
of non-tensor arguments to let them parameterize the com- Composition. Because graph function execution is imple-
putation (function specializes automatically, whereas mented as an operation, graph functions compose naturally:
JAX makes this process a manual one). For example, it the graph of a function may include a function-call opera-
is common to write Python functions that take a boolean tion that executes another function. For example, consider
is training argument that determines whether or not the following code block:
dropout is applied. Our implementation of binding-time @tf.contrib.eager.function
analysis ensures that graph functions are specialized on the def inner(a):
value of the boolean argument (see listing 6 for an example). return tf.nn.relu(a)

@tf.contrib.eager.function @tf.contrib.eager.function
def lossy_matmul(W, x, training=True): def outer(a, b):
outputs = tf.matmul(W, x) return inner(tf.matmul(a, b))
if training:
outputs = tf.nn.dropout(outputs, 0.2) outer(tf.eye(3), tf.diag([-1.0, 1.0, 2.0]))
return outputs
Listing 8. Graph functions can be nested.
W = tf.random_normal((3, 5))
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

a b refactoring proves too onerous; or we can stage the entire


function while wrapping the recursive call in a py func,
an operation that takes a Python function as an attribute and
matmul a executes it imperatively, even in the context of staged code.
py func executes its Python function under a gradient tape
(see §4.2) and as such it is differentiable; it also has both
call:inner relu CPU and GPU kernels. When executing in imperative mode,
wrapping a Python function in a py func has essentially
no effect. But, in staged computations, i.e. in dataflow
output output graphs, the py func operation is a way to embed imper-
ative, Pythonic code into a dataflow graph. Equivalently,
(a) The graph generated (b) The graph generated for
py func can be viewed as a way to quickly implement
for outer; note the call inner.
operation that executes custom operations using Python instead of C++.
inner’s graph function.
The benefit of py func is that it makes it easier to decorate
Figure 2. function composes; above, the graphs for Listing 8. large Python functions with @function. Disadvantages
include a potential performance hit, as py func returns
control to a single-threaded Python interpreter, and the fact
that graphs with py funcs are not in general serializable.
The call to outer will generate two graph functions, one
for inner, and one for outer that contains a call to Escaping traces. We provide a Python context manager,
inner’s graph function. Figure 2 shows what their cor- tf.init scope, that pauses the trace and jumps into
responding graphs look like. the imperative context. We use this scope to implement
function’s state-creation contract; most users, on the
State creation. When building machine learning models, it is other hand, will have no use for it.
common to write Python functions that create and initialize
variables the first time they are called. To support this idiom,
function imposes some requirements on the decorated 5 I MPLEMENTATION
function f. State, such as TensorFlow variables, must only We have implemented the design presented in §4, and all of
be created the first time f is called; how that is accomplished our code is open source1 . Because TensorFlow Eager was
is left to the implementation of f. If any variables are built as an extension to TensorFlow, the implementation is
created in the first execution of f, then function will not large: staging is implemented in approximately 2000
trace f a second time to record the behavior that will be lines of Python, automatic differentiation is split across
used from then on. No variables may be created during that 900 lines of Python and 600 lines of C, and the imperative
second trace, or any subsequent one. runtime—i.e., the code responsible for constructing and ex-
ecuting operations—is implemented in approximately 4000
4.7 Escaping staged computations lines of C++. TensorFlow Eager also provides a lightweight
Embedding imperative code in graphs. As discussed in C API that exposes our runtime, and several of our col-
§4.1, staging computations requires the programmer to refac- leagues are using this API directly in their own projects.
tor the to-be-staged code into Python functions that, when TensorFlow Eager inherits the benefits of TensorFlow’s im-
traced, construct dataflow graphs. This process may at times plementation. In particular, TensorFlow Eager is cross-
seem prohibitively difficult, as it can require replacing com- platform, running on the Linux, Mac OS X, Windows, An-
plicated Python control flow with TensorFlow control flow droid, and iOS operating systems, and various x86, ARM,
or even implementing custom operations along with cus- and NVIDIA GPU architectures; it executes staged com-
tom C++ kernels — indeed, this observation was one of the putations using a dataflow executor that can run over ten
motivations for building TensorFlow Eager to begin with. thousand subgraphs in parallel and that runs kernels in par-
For concreteness, say that we have a Python function that allel when possible, across multiple CPU cores or GPU
we wish to stage, and say that the function is almost en- streams; it provides high-level Python APIs for training
tirely staging-friendly with the exception of a call to a data- models and C++ APIs for inference (see Abadi et al., 2016,
dependent recursive Python function that performs some §5). TensorFlow Eager also provides access to the over 900
operations on tensors. In this case, we have three options: primitive operations that TensorFlow offers.
we can refactor the function into three functions, staging the TensorFlow Eager and TensorFlow differ slightly but sig-
code before and after the recursive call and leaving the recur- 1
sive call unstaged; we can give up on staging the function if https://github.com/tensorflow/tensorflow
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

nificantly in their implementations of staged execution. In


TensorFlow, the dataflow graph defines the union of all the 125
TFE
computations that the author of the graph might be interested TFE + function
100 TF
in; the actual computation to execute is defined when the pro-

examples / second
grammer requests the runtime to fetch the concrete values 75
of some set of tensors resident in the graph. This amounts
to a discrepancy between what is expressed in Python and 50
what is executed by the TensorFlow runtime. To provide
a more Pythonic programming model, TensorFlow Eager 25
represents each staged computation as a graph function, i.e.,
a graph with named inputs and outputs, representing the 0
1 2 4 8 16 32
exact computation of interest. This approach still allows for batch size
graph optimizations: for example, non-stateful operations
that are not reachable from the outputs of a function are 100
TFE + function
pruned, just as in TensorFlow. TF
75
Graph functions provide benefits outside the realm of us-

% improvement
ability as well. Because graph functions are executed via
50
an operation, we get function composition for free. In the
context of single-coordinator distributed training, in which a
25
single subgraph is executed by N workers, graph functions
can reduce memory pressure on the coordinator: the coordi-
nator only needs to own a graph function that contains N 0
1 2 4 8 16 32
function-call operations (instead of N copies of a subgraph). batch size

6 E VALUATION Figure 3. Examples per second when training ResNet-50 on a GPU


(top). Percent improvement over TensorFlow Eager (bottom).
TensorFlow Eager considerably simplifies rapid prototyping.
This at times trades off execution speed for development
ease. In this section, we present examples2 showing how we increases, since the ratio of the time spent in kernels over
can use function to recover the speed of TensorFlow. the time spent in Python increases. Additionally, training
Experimental setup. The benchmarks were run within a a ResNet doesn’t benefit significantly from inter-op paral-
docker container on a machine with an Intel(R) Xeon(R) lelization, so the staged computation is effectively as serial
W-2135 CPU with 12 cores at 3.7GHz, 64GB of memory, as the unstaged computation. These performance character-
and a GTX 1080 GPU with 8GB of memory. The TPU istics should hold true for other sufficiently large models,
benchmark was run on a publicly available Cloud TPU. i.e., imperative performance will often be similar to staged
Each benchmark run was 10 iterations, and an average of performance. The code used to generate these benchmarks
3 runs was reported. For staged computations, build and all rely on the same Model class; converting the code to use
optimization times were not included as these are one-time function is simply a matter of decorating two functions.
costs that are usually amortized over a number of runs. ResNet-50 on TPU. It is possible to run single operations on
ResNet-50. In Figure 3 we show the performance of training a TPU using TensorFlow Eager. The performance of train-
a ResNet-50 model, comparing TensorFlow Eager, Ten- ing ResNet-50 on ImageNet (Deng et al., 2009) using Ten-
sorFlow Eager with the forward pass and gradient appli- sorFlow Eager versus TensorFlow Eager with function
cation staged with function, and TensorFlow. The top is shown in Table 1. Training the model in a per-operation
chart shows the raw examples per second, and the bottom fashion is slow, even at a batch size of 32; staging yields an
chart shows the improvement that TensorFlow Eager with order of magnitude improvement in examples per second.
function and TensorFlow show over TensorFlow Eager. An important caveat is that these benchmarks do not exploit
For smaller batch sizes, staging computations yields signif- the hardware optimally. They are presented as illustrative of
icant speed-ups. These speed-ups vanish as the batch size how staging lets us target accelerators like TPUs with prac-
2
The code for these example models and others is tically no code changes. We don’t present an accompanying
available at https://github.com/tensorflow/ TensorFlow benchmark for this reason.
tensorflow/tree/master/tensorflow/contrib/
eager/python/examples. L2HMC. In Figure 4 we show performance of an L2HMC
(Levy et al., 2018) implementation, comparing TensorFlow
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

1 2 4 8 16 32
TensorFlow Eager 1.06 1.99 4.3 8.4 16.6 30.3
TensorFlow Eager with function 21.7 42.6 83.9 165.8 197.7 241.9

Table 1. Examples per second training ResNet-50 on a TPU.

by imperative execution for the benefits provided by static


6000 graphs, performance and ease of serialization among them.
TFE
TFE + function
TF
Within Alphabet, dozens have adopted TensorFlow Eager.
examples / second

4000 For example, some researchers use it to implement dynamic


language models and reinforcement learning methods, and
several internal workshops on TensorFlow Eager have been
2000 attended widely. Multiple groups are restructuring their ma-
chine learning frameworks to make TensorFlow Eager the
default way of using them (examples include libraries for
0 probabilistic machine learning and reinforcement learning),
10 25 50 100 200
number of samples and at least one large research group has engineers dedicated
to supporting TensorFlow Eager. Externally, some univer-
Figure 4. Examples per second training L2HMC on a CPU. sity courses have included TensorFlow Eager as part of their
curriculum, and 48 percent of respondents to a survey dis-
tributed at the 2018 TensorFlow Developer Summit agreed
with the statement, “[TensorFlow Eager] is important to me
Eager, TensorFlow Eager with function, and TensorFlow as an iterative development and debugging tool.”
on synthetic data running on the CPU. The benchmark sam-
ples from a 2-dimensional distribution, with 10 steps for TensorFlow Eager is an evolving technology. While it
the leapfrog integrator. This example highlights the trade- is well-suited for research and pedagogy alike, we are
off between debuggability and performance: by bypassing still working to provide an out-of-the-box solution for
Python overheads and via buffer reuse and other static op- imperatively-driven distributed training. And while multi-
timization, staging increasing examples per second by at stage programming is powerful — wrapping large Python
least an order of magnitude. And while the trade-off exists, functions in function often “does the right thing” — stag-
it is not particularly onerous here — simply decorating a ing computations with dynamic control flow can require
single function recovers the full performance of TensorFlow. nontrivial programmer intervention. We hope to decrease
This benchmark stages computation aggressively, essentially this friction via Autograph (Moldovan et al., 2019).
running the entire update as a graph function. Depending Finally, TensorFlow Eager has informed the evolution of
on the desired visibility into the model’s execution during TensorFlow itself: the upcoming TensorFlow 2.0 uses our
development, it is possible to stage less aggressively. implementation to provide an imperative-first, multi-stage
Note. These examples were chosen as they lie at opposite programming model similar to the one outlined in this paper.
ends of the tradeoff between execution speed and devel-
opment speed. We expect most real-world models to fall ACKNOWLEDGEMENTS
somewhere between these two, and to be able to recover
performance by staging as required. TensorFlow Eager is We’d like to thank everyone on the TensorFlow team for
an evolving technology, and closing the gap between imper- their feedback and help with the design and implementa-
ative and staged performance is being worked on. tion of this system. Alex Wiltschko, Pierre Sermanet, Xin
Pan, Yaroslav Bulatov, Manjunath Kudlur, and Yuan Yu con-
tributed significantly to motivating and early prototyping of
7 C ONCLUSION TF Eager. Early users like Sergio Guadarrama, Daniel Abo-
We presented TensorFlow Eager, an extension to Tensor- lafia, David Berthelot, Chen Li, Debidatta Dwibedi, among
Flow that makes what was once a declarative DSL for dif- others, were key in helping us shape the requirements of this
ferentiable programming into a multi-stage, imperative-first system. A lot of it wouldn’t look like it does now without
one. TensorFlow Eager’s imperative-by-default behavior important feedback from DeepMind, especially from Aedan
makes it suitable for beginners and researchers alike, and the Pope and Tom Hennigan. François Chollet was very helpful
option to stage computations as graph functions lets users in integrating TF Eager with Keras.
trade off the interactivity and Python integration furnished
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

R EFERENCES Conference on Computer Vision and Pattern Recognition,


pp. 770–778, 2016.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, Hudak, P. Building domain-specific embedded languages.
M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., ACM Computing Surveys, 28(4es), December 1996.
Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,
M., Yu, Y., and Zheng, X. Tensorflow: A system for Innes, M. Flux: Elegant machine learning with Julia. Jour-
large-scale machine learning. In Proceedings of the 12th nal of Open Source Software, 2018.
USENIX Conference on Operating Systems Design and
Implementation, pp. 265–283, Berkeley, CA, USA, 2016. Innes, M. Don’t unroll adjoint: Differentiating SSA-form
USENIX Association. programs. In Proceedings of the 2nd SysML Conference,
2019.
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind,
J. M. Automatic differentiation in machine learning: a Innes, M., Karpinski, S., Shah, V., Barber, D., Stenetorp,
survey. Journal of Machine Learning Research, 18(153): P., Besard, T., Bradbury, J., Churavy, V., Danisch, S.,
1–43, 2018. Edelman, A., Malmaud, J., Revels, J., and Yuret, D. On
machine learning and programming languages. In the 1st
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas- SysML Conference, 2018.
canu, R., Desjardins, G., Turian, J., Warde-Farley, D.,
and Bengio, Y. Theano: A CPU and GPU math compiler Jones, N. D., Gomard, C. K., and Sestoft, P. Partial Evalua-
in Python. In Proceedings of the 9th Python in Science tion and Automatic Program Generation. Prentice-Hall,
Conference, pp. 3–10, 2010. Inc., Upper Saddle River, NJ, USA, 1993. ISBN 0-13-
020249-5.
Bolz, C. F., Cuni, A., Fijalkowski, M., and Rigo, A. Tracing
the meta-level: PyPy’s tracing JIT compiler. In Proceed- Jørring, U. and Scherlis, W. L. Compilers and staging trans-
ings of the 4th workshop on the Implementation, Compi- formations. In Proceedings of the 13th ACM SIGACT-
lation, Optimization of Object-Oriented Languages and SIGPLAN Symposium on Principles of Programming Lan-
Programming Systems, pp. 18–25. ACM, 2009. guages, pp. 86–96, New York, NY, USA, 1986. ACM.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Kong, L., Dyer, C., and Smith, N. A. Segmental Recurrent
Xiao, T., Xu, B., Zhang, C., and Zhang, Z. MXNet: Neural Networks. arXiv e-prints, art. arXiv:1511.06018,
A Flexible and Efficient Machine Learning Library for Nov 2015.
Heterogeneous Distributed Systems. arXiv e-prints, art.
arXiv:1512.01274, Dec 2015. Lam, S. K., Pitrou, A., and Seibert, S. Numba: A LLVM-
based Python JIT compiler. In Proceedings of the Second
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J.,
Workshop on the LLVM Compiler Infrastructure in HPC,
Tran, J., Catanzaro, B., and Shelhamer, E. cuDNN: Ef-
pp. 7:1–7:6, New York, NY, USA, 2015. ACM.
ficient Primitives for Deep Learning. arXiv e-prints, art.
arXiv:1410.0759, Oct 2014. Lattner, C. and the Swift for TensorFlow Team.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Swift for TensorFlow. https://github.com/
L. ImageNet: A large-scale hierarchical image database. tensorflow/swift, 2018.
In Proceedings of the 2009 IEEE Conference on Com- Levy, D., Hoffman, M. D., and Sohl-Dickstein, J. Gener-
puter Vision and Pattern Recognition, 2009. alizing hamiltonian monte carlo with neural networks.
DeVito, Z., Hegarty, J., Aiken, A., Hanrahan, P., and Vitek, In Proceedings of the 6th International Conference on
J. Terra: A multi-stage language for high-performance Learning Representations, 2018.
computing. In Proceedings of the 34th ACM SIGPLAN
Maclaurin, D., Duvenaud, D., and Adams, R. P. Autograd:
Conference on Programming Language Design and Im-
Effortless gradients in NumPy. In the AutoML Work-
plementation, pp. 105–116, New York, NY, USA, 2013.
shop of the 32nd International Conference on Machine
ACM.
Learning, 2015.
Frostig, R., Johnson, M. J., and Leary, C. Compiling ma-
chine learning programs via high-level tracing. In the 1st Moldovan, D., Decker, J. M., Wang, F., Johnson, A. A., Lee,
SysML Conference, 2018. B. K., Nado, Z., Sculley, D., Rompf, T., and Wiltschko,
A. B. Autograph: Imperative-style coding with graph-
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- based performance. In Proceedings of the 2nd SysML
ing for image recognition. In Proceedings of the IEEE Conference, 2019.
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning

Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Am- 2017/03/xla-tensorflow-compiled.html,
mar, W., Anastasopoulos, A., Ballesteros, M., Chiang, 2017.
D., Clothiaux, D., Cohn, T., Duh, K., Faruqui, M., Gan,
C., Garrette, D., Ji, Y., Kong, L., Kuncoro, A., Kumar, Tokui, S., Oono, K., and Hido, S. Chainer: a next-generation
G., Malaviya, C., Michel, P., Oda, Y., Richardson, M., open source framework for deep learning. In Proceedings
Saphra, N., Swayamdipta, S., and Yin, P. DyNet: The of workshop on machine learning systems (LearningSys)
Dynamic Neural Network Toolkit. arXiv e-prints, art. in the 29th annual conference on Neural Information
arXiv:1701.03980, Jan 2017. Processing Systems, volume 5, pp. 1–6, 2015.

Oliphant, T. E. Guide to NumPy. CreateSpace Independent Wei, R., Schwartz, L., and Adve, V. DLVM: A modern
Publishing Platform, USA, 2nd edition, 2015. compiler infrastructure for deep learning systems. arXiv
e-prints, art. arXiv:1711.03016, Nov 2017.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. In Proceedings
of the 31st Conference on Neural Information Processing
Systems, 2017.

PyTorch team. Torch script. https://pytorch.org/


docs/master/jit.html, 2018.

Rompf, T. and Odersky, M. Lightweight modular staging:


a pragmatic approach to runtime code generation and
compiled DSLs. In Proceedings of the 9th International
Conference on Generative Programming and Component
Engineering, pp. 127–136, New York, NY, USA, 2010.
ACM.

Sato, K., Young, C., and Patterson, D. An in-depth look at


Google’s first Tensor Processing Unit (TPU). https:
//cloud.google.com/blog/products/gcp/
an-in-depth-look-at-googles-first-
tensor-processing-unit-tpu, 2017.

Socher, R., Lin, C. C., Manning, C., and Ng, A. Y. Parsing


natural scenes and natural language with recursive neu-
ral networks. In Proceedings of the 28th International
Conference on Machine Learning, pp. 129–136, 2011.

Strachey, C. Fundamental concepts in programming lan-


guages. Higher-order and symbolic computation, 13(1-2):
11–49, 2000.

Sujeeth, A., Lee, H., Brown, K., Rompf, T., Chafi, H., Wu,
M., Atreya, A., Odersky, M., and Olukotun, K. OptiML:
an implicitly parallel domain-specific language for ma-
chine learning. In Proceedings of the 28th International
Conference on Machine Learning, pp. 609–616, 2011.

Taha, W. A gentle introduction to multi-stage program-


ming. In Domain-Specific Program Generation, pp. 30–
50. Springer, 2004.

The Gluon Team. Deep learning: The straight dope.


https://gluon.mxnet.io/, 2017.

The XLA team. XLA - TensorFlow, compiled.


https://developers.googleblog.com/

You might also like