T F E: A - , P - DSL: Ensor LOW Ager Multi Stage Ython Embedded FOR Machine Learning
T F E: A - , P - DSL: Ensor LOW Ager Multi Stage Ython Embedded FOR Machine Learning
MACHINE LEARNING
Akshay Agrawal 1 Akshay Naresh Modi 1 Alexandre Passos 1 Allen Lavoie 1 Ashish Agarwal 1 Asim Shankar 1
Igor Ganichev 1 Josh Levenberg 1 Mingsheng Hong 1 Rajat Monga 1 Shanqing Cai 1
A BSTRACT
arXiv:1903.01855v1 [cs.PL] 27 Feb 2019
function. While invoking a graph function is syntactically this is what, e.g., DLVM, Swift for TensorFlow, and Zygote
equivalent to calling the Python function from which it do (Wei et al., 2017; Lattner & the Swift for TensorFlow
was generated, the execution of graph functions bypasses Team, 2018; Innes, 2019). Python’s flexibility makes it diffi-
Python: they are executed using a C++ dataflow runtime or cult for DSLs embedded in it to use such an approach. Some
are compiled to generate optimized code for CPUs, GPUs, projects, like AutoGraph (Moldovan et al., 2019) do operate
and ASICs. Graph functions and imperative code share a on Python abstract syntax trees to rewrite imperative code
lexical environment, making it simple to invoke graph func- to code that constructs dataflow graphs, but such techniques
tions from imperative code, create graph functions that close are out of the scope of this paper.
over imperatively constructed data, and embed imperative
An alternative to staging computations as graphs for perfor-
code in graph functions via unstaging annotations.
mance is to implement fused kernels. For example, NVIDIA
Our contributions are two-fold: provides fused CuDNN kernels for popular recurrent neural
network operations that are dramatically faster than non-
• Our implementation is elegant. TensorFlow Eager can fused implementations (Chetlur et al., 2014). This approach,
be viewed as a multi-stage front-end to TensorFlow. while useful, is difficult to scale, as it requires substantial
Imperative and staged TensorFlow Eager code share programmer intervention.
a single set of primitive operations, kernels, and user- TensorFlow Eager is not the first Python library to offer a
visible APIs. Not only does this sharing result in an multi-stage programming model. JAX (Frostig et al., 2018),
easy-to-maintain implementation, it also lets us present a tracing-JIT compiler that generates code for heterogeneous
a single, coherent API surface to our users that is ag- devices via XLA (The XLA team, 2017), provides a similar
nostic to execution mode and lets users enjoy the rich programming paradigm; MXNet and Gluon also let users
ecosystem of tools developed for TensorFlow. interpolate between imperative and staged computations,
• While we are not the first in the differentiable program- but at a level of abstraction that is higher than ours (Chen
ming community to recognize the value in bridging im- et al., 2015; The Gluon Team, 2017); and PyTorch is im-
perative and declarative programming, we are among plementing a staging tracer that is similar to ours (PyTorch
the first to present this line of work in the context of team, 2018). Outside of differentiable programming, Terra
multi-stage programming. This contextualization is a is a Lua-embedded DSL that supports code generation, and
contribution insofar as it clarifies discourse and con- the paper in which it was introduced presents a thorough
nects two otherwise separate communities. treatment of multi-stage programming that is more formal
than ours (DeVito et al., 2013); as another example, OptiML
The remainder of this paper is structured as follows: section is a Scala-embedded DSL for machine learning with sup-
2 surveys related work; §3 puts forth our design principles, port for staging and code generation but without support for
which prioritize usability and researcher productivity; §4 automatic differentiation (Sujeeth et al., 2011). Outside of
presents our mutli-stage programming model, with details DSLs, there are several projects that provide just-in-time
on automatic differentiation, state, hardware acceleration, (JIT) compilation for Python, of which Numba (Lam et al.,
distribution, staging, and unstaging; §5 discusses our imple- 2015) and PyPy (Bolz et al., 2009) are two examples.
mentation; and §6 provides a quantitative evaluation of the Multi-stage programming is a well-studied topic in program-
performance of TensorFlow Eager on machine learning mod- ming languages; a good reference is (Taha, 2004), and a
els, demonstrating that imperative TensorFlow Eager can modern design from which we drew inspiration is Scala’s
train a ResNet-50 on a single GPU just as quickly as Tensor- lightweight modular staging (Rompf & Odersky, 2010).
Flow can, staged TensorFlow Eager can train a ResNet-50 Multi-stage programming is related to staging transforma-
on a TPU much faster than imperative TensorFlow Eager tions in compilers and partial evaluation in programming
can, and that staging yields significant speedups for models languages, for which (Jørring & Scherlis, 1986) and (Jones
with small operations, all with minimal code changes. et al., 1993) are classic references, respectively.
devices. The first two of the following three principles are execution models have access to the same set of operations
in service of the former goal, while the third is in service of and kernels, but they differ in how they dispatch kernels.
the latter.
Imperative execution. By default, TensorFlow Eager ex-
Privilege imperative execution. Because Python is an im- ecutes operations immediately—library functions such as
perative language, TensorFlow Eager operates in an impera- tf.matmul construct operations and then immediately
tive fashion by default; staged execution is opt-in and often execute their kernels. Under this regime, TensorFlow Eager
unnecessary (see §4.1 and §6 for details). resembles a NumPy-like library for hardware-accelerated
numerical computation and machine learning. Calling
Seamlessly embed into Python. Whereas writing Tensor-
.numpy() on a tensor fetches a NumPy array storing
Flow code is an exercise in metaprogramming, imperative
the tensor’s data, and tensors can be supplied to external
execution lets programmers enjoy the full extent of the host
libraries like matplotlib that expect NumPy arrays (for a
language: programmers write Pythonic code, complete with
reference on NumPy, see Oliphant, 2015). As an example,
familiar language constructs like native control flow (e.g.,
Python if statements and while loops), recursion, ar- import tensorflow as tf
bitrary data structures, and even pdb breakpoints. And, tf.enable_eager_execution()
because we implement automatic differentiation via tracing def select(vector):
(§4.2), the programmer can differentiate through all these A = tf.constant([[1.0, 0.0]])
constructs. Host-language integration is more than just syn- return tf.matmul(A, vector)
tactic sugar—it greatly simplifies the implementation of
data-dependent models like segmental recurrent neural net- x = tf.constant([[2.0], [-2.0]])
print(select(x))
works and recursive neural networks (Kong et al., 2015;
Socher et al., 2011).
prints
Stage imperative code as dataflow graphs. To leverage
tf.Tensor(
the benefits of dataflow graphs, TensorFlow Eager provides [[ 2.]], shape=(1, 1), dtype=float32).
a mechanism to trace Python functions and stage their oper-
ations as graph functions. The staging workflow is detailed Staged execution. While imperative execution simplifies
in §4.1, and the mechanism is described in §4.6. TensorFlow prototyping, the overhead of going back and forth into the
graphs come with their own set of design principles, which Python interpreter limits its performance; representing com-
are presented in (Abadi et al., 2016). putations as dataflow graphs before executing them not only
removes this bottleneck but also allows for inter-op paral-
4 E XECUTION M ODEL lelism and optimizations like constant-folding and buffer
reuse. Thus, TensorFlow Eager provides a mechanism to
This section presents the pillars of TensorFlow Eager’s ex- stage computations as dataflow graphs. In particular, we
ecution model. §4.1 describes imperative and staged exe- provide a decorator, function, that traces the execution
cution, presenting a workflow that hybridizes the two; §4.2 of a Python function, recording all TensorFlow operations
describes our trace-based implementation of automatic dif- and the tensors flowing between them in a dataflow graph.
ferentiation; §4.3 specifies how we represent mutable state function can be thought of as an opt-in, JIT compiler that
and how we support serialization; §4.4 details how Tensor- generates an optimized polymorphic function for a Python
Flow Eager supports execution across heterogeneous de- function, creating concrete functions backed by dataflow
vices; §4.5 presents mechanisms for distributed execution; graphs via a straightforward binding-time analysis at run-
§4.6 discusses our tracing JIT in detail; and §4.7 discusses time. The analogy to compilers is imperfect because the
mechanisms for escaping staged computations. traces generated by function only record TensorFlow
The following terminology will be used in the sequel: a operations and not arbitrary Python code, but it nonetheless
tensor is a multi-dimensional, typed array, an operation is provides an approximate mental model. One advantage of
a primitive, possibly stateful function that takes tensors as this tracing mechanism is that the underlying dataflow graph
inputs and produces tensors as outputs, a kernel is a device- format does not need to support all the dynamism present
specific implementation of an operation, and a model is a in the Python code being traced; as long as the set of opera-
composition of primitive operations. tions in the trace does not depend on Python state we can
generate a correct trace.
4.1 Multi-stage programming Invoking a callable returned by function will execute a
TensorFlow Eager provides two ways of executing opera- dataflow graph instead of the corresponding Python func-
tions: imperatively or as part of a static dataflow graph. Both tion. In fact, graph functions are themselves executed by
an operation that takes tensors as inputs and a function
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning
name as an attribute, and these operations are automatically will return a different output every time it is invoked, the
constructed and executed for the user. For example, if the dataflow graph generated by function(add noise)
select function defined in the previous section were deco- will return the same value every time it is called, since
rated with @function, then select(x) would execute a particular random offset generated by NumPy will be
an operation that would in turn execute the appropriate graph inserted into the graph as a constant. Note that if state is rep-
function. The dataflow graph runtime, which is written in resented in terms of operations (e.g., if we replace the call
C++, automatically partitions subgraphs across devices and to np.random.randn with tf.random normal), we
parallelizes operations when possible. Readers interested in can preserve semantics under this tracing model. As a corol-
the runtime should consult (Abadi et al., 2016). lary, if a Python function f has Python side-effects (e.g.,
every call to it increments a global Python counter), then ex-
The function decorator supports code generation via
ecuting it multiple times will not necessarily be semantically
XLA (The XLA team, 2017). TensorFlow Eager relies upon
equivalent to repeatedly executing the callable returned by
XLA to execute code on Tensor Processing Units (TPUs)
function(f). Python functions must also be resilient to
(Sato et al., 2017) (see §4.4). In addition to performance and
being executed multiple times, as the callable returned by
hardware acceleration, dataflow graphs simplify distribution
function might trace its Python function multiple times
(§4.5) and deployment. Details about the mechanism of
(see the discussion on polymorphism in §4.6).
function are provided in §4.6.
Because function generates graphs by tracing and not
A multi-stage workflow. Many users will find the perfor-
by source code analysis, it fully unrolls Python for and
mance of imperative execution sufficient. Purely imperative
while loops, potentially creating large graphs. If that is a
TensorFlow Eager can match the performance of graph ex-
problem, the programmer might need to replace their loops
ecution when training models with sufficiently expensive
with the equivalent TensorFlow control flow constructs. Sim-
kernels, like ResNet-50 (He et al., 2016) (see §6). But when
ilarly, the branches of if statements that are taken during
imperative performance disappoints, we recommend the fol-
tracing are baked into the emitted graphs. Conditionals that
lowing multi-stage workflow, modeled after (Taha, 2004).
depend on the value of tensors will need to be written using
tf.cond, and while loops that depend on tensor values
1. Implementation. Develop, debug, and test a single- will need to be rewritten in terms of tf.while loops.
stage imperative program. Python functions that depend on the values of tensors in
2. Analysis. Using any profiling tool the user is famil- complicated ways (e.g., via data structures that depend on
iar with, identify performance-critical blocks of op- the values of tensors) might prove to be prohibitively dif-
erations, and express these blocks as staging-friendly ficult to stage correctly. In such cases, users might need
Python functions or callable objects. to refactor their functions into staging-friendly and staging-
unfriendly helper functions (see the discussion on escaping
3. Staging. Decorate the functions identified in the previ- staged computations in §4.7 for other options).
ous step with @function.
Note that staging trades off imperative execution (and there-
fore interactivity) and Python integration (and therefore
With respect to the analysis step, the key fact to keep in mind
run-time dynamism) for performance. It is up to the pro-
is that function is not a compiler for arbitrary Python
grammer to decide when this trade-off is acceptable and
code. Rather, it is a JIT tracer that executes Python func-
to use staging annotations judiciously. This trade-off can
tions in a graph-building context and only records operations
be diminished by using tools like AutoGraph that operate
and tensors. In a graph-building context, operations return
on abstract syntax trees and rewrite Python control flow to
symbolic representations of values to be computed instead
dataflow control flow (Moldovan et al., 2019).
of concrete values, and non-TensorFlow Python code ex-
ecutes normally. Python functions that are amenable to
4.2 Automatic differentiation
staging are those that, when called in a graph-building con-
text, generate a graph that encapsulates the computation We implement a variant of tracing-based reverse-mode au-
of interest. This means that if a Python function executes tomatic differentiation (Baydin et al., 2018), with a few
non-TensorFlow code, then there might be semantic discrep- changes to better support partially staged computation. Our
ancies between executing the Python function and executing implementation is similar to the implementations of Chainer
the traced dataflow graph. For example, whereas the Python (Tokui et al., 2015), Autograd (Maclaurin et al., 2015), and
function PyTorch (Paszke et al., 2017), but our API allows for more
def add_noise(): fine-grained control over which computations are traced.
eye = tf.eye(5)
randn = np.random.randn(5, 5) The main user-visible concept in the gradient API is a tape.
return eye + randn If a tape watches a value, operations taking this value as
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning
Note that gradient computation is itself expressed as a func- def call(self, x):
tion which executes primitive operations, so it is possible to return self.out(
stage it or not. tf.nn.softplus(x * self.v))
x = tf.Variable(3.0)
Figure 1. Visualization of the dependency graph for Listing 3, with
with tf.GradientTape() as t1:
with tf.GradientTape() as t2: filled-in intermediate nodes and nodes without fill containing state.
y = x * x
dy_dx = t2.gradient(y, x) # 6.0 Variables are the most common type of state, but other state
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning
is similarly scoped to a Python object and matched as part Because graph functions are executed via a primitive opera-
of a directed graph with named edges. Examples include tion, it is also possible to use the device context manager
an iterator over input data whose position in a dataset is to run graph functions on various devices. If operations
serialized, mutable hash tables, and outside of traced code inside the graph function are explicitly placed on another
even miscellaneous Python state such as NumPy arrays can device, they override the outer device context.
use graph-based state matching.
Graph functions can serve as a unit of compilation for ac-
Staging enables serializing the program for use without a celerators; we use this to efficiently execute code on TPUs.
Python interpreter, as in TensorFlow. A typical development When a staged computation is placed on a TPU, TensorFlow
workflow involves using graph-based state matching while Eager automatically invokes XLA to compile the graph
writing and tweaking a TensorFlow Eager program, then and produce a TPU-compatible executable. TensorFlow
serializing a trace for use in a production environment that Eager does make it possible to execute code imperatively
executes the trace using TensorFlow’s C++ API. on TPUs, but the overhead of compiling operations for TPU
and dispatching the generated code is significant. When
4.4 Devices amortized over a large graph function, this overhead be-
comes negligible (see §6 for a quantitative example). Note
TensorFlow Eager makes it simple to use a variety of de- that this programming model is similar to JAX (Frostig et al.,
vices, such as CPUs, GPUs, and TPUs. During program 2018), which provides a Python decorator that JIT-compiles
startup, the runtime detects the devices that are available to functions via tracing and XLA. Finally, compiling staged
the machine, and makes it possible to both execute opera- computations through XLA provides us more opportunities
tions on them and store data on them. Imperative and staged for optimization, including layout optimization, instruction
computations use the same underlying Device abstraction, scheduling for concurrency, and operation fusion. Tech-
which makes it possible to both execute operations on de- niques like tensor re-materialization can make it possible
vices and store data on them. A user-visible API endpoint to fit a staged model into TPU memory when it would be
list devices is exposed which lists all devices that the impossible to do so on an operation-by-operation basis.
runtime is aware of.
All tensors exposed to the user are handles to data stored 4.5 Distribution
on a particular device. The runtime is also aware of how to
The current system supports distributed execution with a
copy data between various types of devices, and exposes this
single central server running the main (typically Python)
functionality through API endpoints on tensor instances.
program and several worker servers running on remote hosts.
a = tf.constant(1.0) # stored on CPU Each worker server adds its locally available devices (for
b = a.gpu() # stored on GPU example, CPUs, GPUs, or TPUs) to the pool of devices
available to the main program. The main program can
Listing 4. Tensor copies between CPU and GPU.
then execute operations or whole graph functions on remote
devices through the worker servers.
When executing an operation, the runtime expects to have a
specific device to run the operation on. TensorFlow Eager The remote devices are identified by application-level names.
exposes a context manager, device, so that the user can The names contain the job name, task inside the job, as well
control which device operations execute on. The user is not as the specific device available for the task. For exam-
required to use this API, as the runtime is able to select a de- ple, ”/job:training/task:2/device:GPU:0”. When a server is
vice based on the availability of kernels. When an operation brought up to be a part of a cluster, it is given the mapping
has inputs on devices different from the device where the from the application-level names to specific server instances
operation is executing, the runtime transparently copies the identified by DNS names or IP addresses.
inputs to the correct device. This frees the user from having
To run an operation on a remote device, the user uses the
to explicitly copy tensors between various devices.
same syntax as for local devices (see 4.4) but uses a remote
# stored on CPU device name instead of the local device name. Tensors
a = tf.constant(1.0) produced as the result of running an operation on a remote
b = tf.constant(2.0) device stay on the remote device. Users can then either
with tf.device("/gpu:0"): perform more operations on these tensors or copy them to
c = tf.add(a, b) the central server (e.g. to use their value in an if statement).
assert c.numpy() == 3.0 Some computations running on remote devices can directly
communicate and synchronize between each other. In such
Listing 5. Executing a GPU operation with inputs on the CPU. cases, developers need to start these computations concur-
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning
@tf.contrib.eager.function @tf.contrib.eager.function
def lossy_matmul(W, x, training=True): def outer(a, b):
outputs = tf.matmul(W, x) return inner(tf.matmul(a, b))
if training:
outputs = tf.nn.dropout(outputs, 0.2) outer(tf.eye(3), tf.diag([-1.0, 1.0, 2.0]))
return outputs
Listing 8. Graph functions can be nested.
W = tf.random_normal((3, 5))
TensorFlow Eager: A multi-stage, Python-embedded DSL for machine learning
examples / second
grammer requests the runtime to fetch the concrete values 75
of some set of tensors resident in the graph. This amounts
to a discrepancy between what is expressed in Python and 50
what is executed by the TensorFlow runtime. To provide
a more Pythonic programming model, TensorFlow Eager 25
represents each staged computation as a graph function, i.e.,
a graph with named inputs and outputs, representing the 0
1 2 4 8 16 32
exact computation of interest. This approach still allows for batch size
graph optimizations: for example, non-stateful operations
that are not reachable from the outputs of a function are 100
TFE + function
pruned, just as in TensorFlow. TF
75
Graph functions provide benefits outside the realm of us-
% improvement
ability as well. Because graph functions are executed via
50
an operation, we get function composition for free. In the
context of single-coordinator distributed training, in which a
25
single subgraph is executed by N workers, graph functions
can reduce memory pressure on the coordinator: the coordi-
nator only needs to own a graph function that contains N 0
1 2 4 8 16 32
function-call operations (instead of N copies of a subgraph). batch size
1 2 4 8 16 32
TensorFlow Eager 1.06 1.99 4.3 8.4 16.6 30.3
TensorFlow Eager with function 21.7 42.6 83.9 165.8 197.7 241.9
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Am- 2017/03/xla-tensorflow-compiled.html,
mar, W., Anastasopoulos, A., Ballesteros, M., Chiang, 2017.
D., Clothiaux, D., Cohn, T., Duh, K., Faruqui, M., Gan,
C., Garrette, D., Ji, Y., Kong, L., Kuncoro, A., Kumar, Tokui, S., Oono, K., and Hido, S. Chainer: a next-generation
G., Malaviya, C., Michel, P., Oda, Y., Richardson, M., open source framework for deep learning. In Proceedings
Saphra, N., Swayamdipta, S., and Yin, P. DyNet: The of workshop on machine learning systems (LearningSys)
Dynamic Neural Network Toolkit. arXiv e-prints, art. in the 29th annual conference on Neural Information
arXiv:1701.03980, Jan 2017. Processing Systems, volume 5, pp. 1–6, 2015.
Oliphant, T. E. Guide to NumPy. CreateSpace Independent Wei, R., Schwartz, L., and Adve, V. DLVM: A modern
Publishing Platform, USA, 2nd edition, 2015. compiler infrastructure for deep learning systems. arXiv
e-prints, art. arXiv:1711.03016, Nov 2017.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. In Proceedings
of the 31st Conference on Neural Information Processing
Systems, 2017.
Sujeeth, A., Lee, H., Brown, K., Rompf, T., Chafi, H., Wu,
M., Atreya, A., Odersky, M., and Olukotun, K. OptiML:
an implicitly parallel domain-specific language for ma-
chine learning. In Proceedings of the 28th International
Conference on Machine Learning, pp. 609–616, 2011.