A Tour of TensorFlow

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

A Tour of TensorFlow

Proseminar Data Mining


Peter Goldsborough
Fakultt fr Informatik
Technische Universitt Mnchen
Email: [email protected]

Abstract Deep learning is a branch of artificial intelligence train deep learning models on these equally massive data
employing deep neural network architectures that has signifi- sets as well as the discovery of new methods such as the
cantly advanced the state-of-the-art in computer vision, speech rectified linear unit (ReLU) activation function or dropout as
recognition, natural language processing and other domains. In
November 2015, Google released TensorFlow, an open source deep a regularization technique [1][4].
learning software library for defining, training and deploying While deep learning algorithms and individual architectural
machine learning models. In this paper, we review TensorFlow components such as representation transformations, activa-
and put it in context of modern deep learning concepts and tion functions or regularization methods may initially be
arXiv:1610.01178v1 [cs.LG] 1 Oct 2016

software. We discuss its basic computational paradigms and expressed in mathematical notation, they must eventually be
distributed execution model, its programming interface as well
as accompanying visualization toolkits. We then compare Ten- transcribed into a computer program for real world usage.
sorFlow to alternative libraries such as Theano, Torch or Caffe For this purpose, there exist a number of open source as
on a qualitative as well as quantitative basis and finally comment well as commercial machine learning software libraries and
on observed use-cases of TensorFlow in academia and industry. frameworks. Among these are Theano [5], Torch [6], scikit-
learn [7] and many more, which we review in further detail
Index Terms Artificial Intelligence, Machine Learning, Neu- in Section II of this paper. In November 2015, this list was
ral Networks, Distributed Computing, Open source software, extended by TensorFlow, a novel machine learning software
Software packages
library released by Google [8]. As per the initial publication,
TensorFlow aims to be an interface for expressing machine
I. I NTRODUCTION
learning algorithms in large-scale [. . . ] on heterogeneous
Modern artificial intelligence systems and machine learn- distributed systems [8].
ing algorithms have revolutionized approaches to scientific The remainder of this paper aims to give a thorough review
and technological challenges in a variety of fields. We can of TensorFlow and put it in context of the current state of
observe remarkable improvements in the quality of state-of- machine learning. In detail, the paper is further structured
the-art computer vision, natural language processing, speech as follows. Section II will provide a brief overview and
recognition and other techniques. Moreover, the benefits of history of machine learning software libraries, listing but
recent breakthroughs have trickled down to the individual, not comparing projects similar to TensorFlow. Subsequently,
improving everyday life in numerous ways. Personalized dig- Section III discusses in depth the computational paradigms
ital assistants, recommendations on e-commerce platforms, underlying TensorFlow. In Section IV we explain the current
financial fraud detection, customized web search results and programming interface in the various supported languages. To
social network feeds as well as novel discoveries in genomics inspect and debug models expressed in TensorFlow, there exist
have all been improved, if not enabled, by current machine powerful visualization tools, which we examine in Section
learning methods. V. Section VI then gives a comparison of TensorFlow and
A particular branch of machine learning, deep learning, alternative deep learning libraries on a qualitative as well as
has proven especially effective in recent years. Deep learning quantitative basis. Before concluding our review in Section
is a family of representation learning algorithms employing VIII, Section VII studies current real world use cases of
complex neural network architectures with a high number TensorFlow in literature and industry.
of hidden layers, each composed of simple but non-linear
transformations to the input data. Given enough such trans- II. H ISTORY OF M ACHINE L EARNING L IBRARIES
formation modules, very complex functions may be modeled In this section, we aim to give a brief overview and key
to solve classification, regression, transcription and numerous milestones in the history of machine learning software li-
other learning tasks [1]. braries. We begin with a review of libraries suitable for a wide
It is noteworthy that the rise in popularity of deep learning range of machine learning and data analysis purposes, reaching
can be traced back to only the last few years, enabled primarily back more than 20 years. We then perform a more focused
by the greater availability of large data sets, containing more study of recent programming frameworks suited especially to
training examples; the efficient use of graphical processing the task of deep learning. Figure 1 visualizes this section in
units (GPUs) and massively parallel commodity hardware to a timeline. We wish to emphasize that this section does in no
way compare TensorFlow, as we have dedicated Section VI to and maintained by staff of the University of Waikato, New
this specific purpose. Zealand. It was conceived in 2010 [12].
The Mahout7 project, part of Apache Software Foundation8 ,
A. General Machine Learning is a Java programming environment for scalable machine
learning applications, built on top of the Apache Hadoop9 plat-
In the following paragraphs we list and briefly review a
form. It allows for analysis of large datasets distributed in the
small set of general machine learning libraries in chronolog-
Hadoop Distributed File System (HDFS) using the MapReduce
ical order. With general, we mean to describe any particular
programming paradigm. Mahout provides machine learning
library whose common use cases in the machine learning and
algorithms for classification, clustering and filtering.
data science community include but are not limited to deep
Pattern10 is a Python machine learning module we include
learning. As such, these libraries may be used for statisti-
in our list due to its rich set of web mining facilities. It com-
cal analysis, clustering, dimensionality reduction, structured
prises not only general machine learning algorithms (e.g. clus-
prediction, anomaly detection, shallow (as opposed to deep)
tering, classification or nearest neighbor search) and natural
neural networks and other tasks.
language processing methods (e.g. n-gram search or sentiment
We begin our review with a library published 21 years
analysis), but also a web crawler that can, for example, fetch
before TensorFlow: MLC++ [9]. MLC++ is a software library
Tweets or Wikipedia entries, facilitating quick data analysis on
developed in the C++ programming language providing algo-
these sources. It was published by the University of Antwerp
rithms alongside a comparison framework for a number of
in 2012 and is open source.
data mining, statistical analysis as well as pattern recognition
Lastly, Spark MLlib11 is an open source machine learning
techniques. It was originally developed at Stanford University
and data analysis platform released in 2015 and built on top
in 1994 and is now owned and maintained by Silicon Graphics,
of the Apache Spark12 project [13], a fast cluster computing
Inc (SGI1 ). To the best of our knowledge, MLC++ is the oldest
system. Similar to Apache Mahout, it supports processing
machine learning library still available today.
of large scale distributed datasets and training of machine
Following MLC++ in the chronological order, OpenCV 2
learning models across a cluster of commodity hardware. For
(Open Computer Vision) was released in the year 2000 by
this, it includes classification, regression, clustering and other
Bradski et al. [10]. It is aimed primarily at solving learning
machine learning algorithms [14].
tasks in the field of computer vision and image recognition,
including a collection of algorithms for face recognition, B. Deep Learning
object identification, 3D-model extraction and other purposes.
It is released under a BSD license and provides interfaces in While the software libraries mentioned in the previous
multiple programming languages such as C++, Python and section are useful for a great variety of different machine
MATLAB. learning and statistical analysis tasks, the following paragraphs
Another machine learning library we wish to mention is list software frameworks especially effective in training deep
scikit-learn3 [7]. The scikit-learn project was originally devel- learning models.
oped by David Cournapeu as part of the Google Summer of The first and oldest framework in our list suited to the
Code program4 in 2008. It is an open source machine learning development and training of deep neural networks is Torch13 ,
library written in Python, on top of the NumPy, SciPy and released already in 2002 [6]. Torch consisted originally of
matplotlib frameworks. It is useful for a large class of both a pure C++ implementation and interface. Today, its core
supervised and unsupervised learning problems. is implemented in C/CUDA while it exposes an interface
The Accord.NET 5 library stands apart from the aforemen- in the Lua14 scripting language. For this, Torch makes use
tioned examples in that it is written in the C# (C Sharp) of a LuaJIT (just-in-time) compiler to connect Lua routines
programming language. Released in 2008, it is composed not to the underlying C implementations. It includes, inter alia,
only of a variety of machine learning algorithms, but also numerical optimization routines, neural network models as
signal processing modules for speech and image recognition well as general purpose n-dimensional array (tensor) objects.
[11]. Theano15 , released in 2008 [5], is another noteworthy deep
Massive Online Analysis6 (MOA) is an open source frame- learning library. We note that while Theano enjoys greatest
work for online and offline analysis of massive, potentially popularity among the machine learning community, it is, in
infinite, data streams. MOA includes a variety of tools for essence, not a machine learning library at all. Rather, it is a
classification, regression, recommender systems and other 7 http://mahout.apache.org
disciplines. It is written in the Java programming language 8 http://www.apache.org
9 http://hadoop.apache.org
1 https://www.sgi.com/tech/mlc/ 10 http://www.clips.ua.ac.be/pages/pattern
2 http://opencv.org 11 http://spark.apache.org/mllib
3 http://scikit-learn.org/stable/ 12 http://spark.apache.org/
4 https://summerofcode.withgoogle.com 13 http://torch.ch
5 http://accord-framework.net/index.html 14 https://www.lua.org
6 http://moa.cms.waikato.ac.nz 15 http://deeplearning.net/software/theano/
Pattern
TensorFlow
OpenCV scikit MOA Caffe
1994 1998 2002 2006 2010 2014
1992 1996 2000 2004 2008 2012 DL4J 2016
MLC++ Torch Accord Mahout MLlib
Theano cuDNN
Fig. 1: A timeline showing the release of machine-learning libraries discussed in section I in the last 25 years.

programming framework that allows users to declare math- exist also bindings to other deep learning libraries, such as
ematical expressions symbolically, as computational graphs. Torch19 .
These are then optimized, eventually compiled and finally
executed on either CPU or GPU devices. As such, [5] labels III. T HE T ENSOR F LOW P ROGRAMMING M ODEL
Theano a mathematical compiler. In this section we provide an in-depth discussion of the
Caffe16 is an open source deep learning library maintained abstract computational principles underlying the TensorFlow
by the Berkeley Vision and Learning Center (BVLC). It software library. We begin with a thorough examination of the
was released in 2014 under a BSD-License [15]. Caffe is basic structural and architectural decisions made by the Ten-
implemented in C++ and uses neural network layers as its sorFlow development team and explain how machine learning
basic computational building blocks (as opposed to Theano algorithms may be expressed in its dataflow graph language.
and others, where the user must define individual mathematical Subsequently, we study TensorFlows execution model and
operations making up layers). A deep learning model, con- provide insight into the way TensorFlow graphs are assigned
sisting of many such layers, is stored in the Google Protocol to available hardware units in a local as well as distributed
Buffer format. While models can be defined manually in this environment. Then, we investigate the various optimizations
Protocol Buffer language, there exist bindings to Python incorporated into TensorFlow, targeted at improving both
and MATLAB to generate them programmatically. Caffe is software and hardware efficiency. Lastly, we list extensions to
especially well suited to the development and training of the basic programming model that aid the user in both com-
convolutional neural networks (CNNs or ConvNets), used putational as well as logistical aspects of training a machine
extensively in the domain of image recognition. learning model with TensorFlow.
While the aforementioned machine learning frameworks
allowed for the definition of deep learning models in Python, A. Computational Graph Architecture
MATLAB and Lua, the Deeplearning4J 17 (DL4J) library In TensorFlow, machine learning algorithms are represented
enables also the Java programmer to create deep neural as computational graphs. A computational or dataflow graph
networks. DL4J includes functionality to create Restricted is a form of directed graph where vertices or nodes describe
Boltzmann machines, convolutional and recurrent neural net- operations, while edges represent data flowing between these
works, deep belief networks and other types of deep learning operations. If an output variable z is the result of applying a
models. Moreover, DL4J enables horizontal scalability using binary operation to two inputs x and y, then we draw directed
distributed computing platforms such as Apache Hadoop or edges from x and y to an output node representing z and
Spark. It was released in 2014 by Adam Gibson under an annotate the vertex with a label describing the performed
Apache 2.0 open source license. computation. Examples for computational graphs are given
Lastly, we add the NVIDIA Deep Learning SDK18 to in Figure 2. The following paragraphs discuss the principle
to this list. Its main goal is to maximize the performance elements of such a dataflow graph, namely operations, tensors,
of deep learning algorithms on (NVIDIA) GPUs. The SDK variables and sessions.
consists of three core modules. The first, cuDNN, provides 1) Operations: The major benefit of representing an algo-
high performance GPU implementations for deep learning rithm in form of a graph is not only the intuitive (visual)
algorithms such as convolutions, activation functions and expression of dependencies between units of a computational
tensor transformations. The second is a linear algebra library, model, but also the fact that the definition of a node within
cuBLAS, enabling GPU-accelerated mathematical operations the graph can be kept very general. In TensorFlow, nodes
on n-dimensional arrays. Lastly, cuSPARSE includes a set of represent operations, which in turn express the combination or
routines for sparse matrices tuned for high efficiency on GPUs. transformation of data flowing through the graph [8]. An oper-
While it is possible to program in these libraries directly, there ation can have zero or more inputs and produce zero or more
16 http://caffe.berkeleyvision.org outputs. As such, an operation may represent a mathematical
17 http://deeplearning4j.org
18 https://developer.nvidia.com/deep-learning-software 19 https://github.com/soumith/cudnn.torch
vectors and also scalars, which are simply tensors of rank zero.
y
In terms of the computational graph, a tensor can be seen
as a symbolic handle to one of the outputs of an operation.
A tensor itself does not hold or store values in memory, but
z x> w z provides only an interface for retrieving the value referenced
+ by the tensor. When creating an operation in the TensorFlow
+ dot programming environment, such as for the expression x + y,
a tensor object is returned. This tensor may then be supplied
as input to other computations, thereby connecting the source
x y x w b and destination operations with an edge. By these means, data
flows through a TensorFlow graph.
z =x+y y = (x> w + b) Next to regular tensors, TensorFlow also provides a
SparseTensor data structure, allowing for a more space-
Fig. 2: Examples of computational graphs. The left graph efficient dictionary-like representation of sparse tensors with
displays a very simple computation, consisting of just an only few non-zeros entries.
addition of the two input variables x and y. In this case, z 3) Variables: In a typical situation, such as when per-
is the result of the operation +, as the annotation suggests. forming stochastic gradient descent (SGD), the graph of a
The right graph gives a more complex example of computing machine learning model is executed from start to end multiple
a logistic regression variable y in for some example vector x, times for a single experiment. Between two such invocations,
weight vector w as well as a scalar bias b. As shown in the the majority of tensors in the graph are destroyed and do
graph, y is the result of the sigmoid or logistic function . not persist. However, it is often necessary to maintain state
across evaluations of the graph, such as for the weights and
parameters of a neural network. For this purpose, there exist
equation, a variable or constant, a control flow directive, a variables in TensorFlow, which are simply special operations
file I/O operation or even a network communication port. that can be added to the computational graph.
It may seem unintuitive that an operation, which the reader In detail, variables can be described as persistent, mutable
may associate with a function in the mathematical sense, can handles to in-memory buffers storing tensors. As such, vari-
represent a constant or variable. However, a constant may be ables are characterized by a certain shape and a fixed type.
thought of as an operation that takes no inputs and always To manipulate and update variables, TensorFlow provides the
produces the same output corresponding to the constant it assign family of graph operations.
represents. Analogously, a variable is really just an operation When creating a variable node for a TensorFlow graph, it
taking no input and producing the current state or value of is necessary to supply a tensor with which the variable is
that variable. Table ?? gives an overview of different kinds of initialized upon graph execution. The shape and data type of
operations that may be declared in a TensorFlow graph. the variable is then deduced from this initializer. Interestingly,
Any operation must be backed by an associated implemen- the variable itself does not store this initial tensor. Rather,
tation. In [8] such an implementation is referred to as the constructing a variable results in the addition of three distinct
operations kernel. A particular kernel is always specifically nodes to the graph:
built for execution on a certain kind of device, such as a CPU,
1) The actual variable node, holding the mutable state.
GPU or other hardware unit.
2) An operation producing the initial value, often a con-
2) Tensors: In TensorFlow, edges represent data flowing stant.
from one operation to another and are referred to as tensors. 3) An initializer operation, that assigns the initial value
A tensor is a multi-dimensional collection of homogeneous to the variable tensor upon evaluation of the graph.
values with a fixed, static type. The number of dimensions
An example for this is given in Figure 3.
of a tensor is termed its rank. A tensors shape is the
tuple describing its size, i.e. the number of components, in 4) Sessions: In TensorFlow, the execution of operations
each dimension. In the mathematical sense, a tensor is the and evaluation of tensors may only be performed in a special
generalization of two-dimensional matrices, one-dimensional environment referred to as session. One of the responsibilities
of a session is to encapsulate the allocation and management
of resources such as variable buffers. Moreover, the Session
Category Examples interface of the TensorFlow library provides a run routine,
Element-wise operations Add, Mul, Exp which is the primary entry point for executing parts or the
Matrix operations MatMul, MatrixInverse entirety of a computational graph. This method takes as input
Value-producing operations Constant, Variable
Neural network units SoftMax, ReLU, Conv2D the nodes in the graph whose tensors should be computed and
Checkpoint operations Save, Restore returned. Moreover, an optional mapping from arbitrary nodes
in the graph to respective replacement values referred to as
TABLE I: Examples for TensorFlow operations [8]. feed nodes may be supplied to run as well [8].
worker A
v0 assign
GP U0 CP U0
...
run
client master
worker B
v i CP U0 CP U1
...
v=i
Fig. 4: A visualization of the different execution agents in a
Fig. 3: The three nodes that are added to the computational multi-machine, multi-device hardware configuration.
graph for every variable definition. The first, v, is the variable
operation that holds a mutable in-memory buffer containing
the value tensor of the variable. The second, i, is the node Figure 4 visualizes a possible distributed setup. While the
producing the initial value for the variable, which can be initial release of TensorFlow supported only single-machine
any tensor. Lastly, the assign node will set the variable execution, the distributed version was open-sourced on April
to the initializers value when executed. The assign node 13, 2016 [16].
also produces a tensor referencing the initialized value v 0 of
1) Devices: Devices are the smallest, most basic entities
the variable, such that it may be connected to other nodes
in the TensorFlow execution model. All nodes in the graph,
as necessary (e.g. when using a variable as the initializer for
that is, the kernel of each operation, must eventually be
another variable).
mapped to an available device to be executed. In practice,
a device will most often be either a CPU or a GPU. How-
ever, TensorFlow supports registration of further kinds of
Upon invocation of run, TensorFlow will start at the physical execution units by the user. For example, in May
requested output nodes and work backwards, examining the 2016, Google announced its Tensor Processing Unit (TPU),
graph dependencies and computing the full transitive closure a custom built ASIC (application-specific-integrated-circuit)
of all nodes that must be executed. These nodes may then optimized specifically for fast tensor computations [17]. It is
be assigned to one or many physical execution units (CPUs, thus understandably easy to integrate new device classes as
GPUs etc.) on one or many machines. The rules by which novel hardware emerges.
this assignment takes place are determined by TensorFlows
To oversee the evaluation of nodes on a device, a worker
placement algorithm, discussed in detail in Subsection ??.
process is spawned by the master. As a worker process may
Furthermore, as there exists the possibility to specify explicit
manage one or many devices on a single machine, a device is
orderings of node evaluations, called control dependencies, the
identified not only by a name, but also an index for its worker
execution algorithm will ensure that these dependencies are
group. For example, the first CPU in a particular group may
maintained.
be identified by the string /cpu:0.
B. Execution Model 2) Placement Algorithm: To determine what nodes to as-
To execute computational graphs composed of the various sign to which device, TensorFlow makes use of a placement
elements just discussed, TensorFlow divides the tasks for its algorithm. The placement algorithm simulates the execution
implementation among four distinct groups: the client, the of the computational graph and traverses its nodes from
master, a set of workers and lastly a number of devices. When input tensors to output tensors. To decide on which of the
the client requests evaluation of a TensorFlow graph via a available devices D = {d1 , . . . , dn } to place a given node
Sessions run routine, this query is sent to the master encountered during this traversal, the algorithm consults
process, which in turn delegates the task to one or more worker a cost model C (d). This cost model takes into account
processes and coordinates their execution. Each worker is four pieces of information to determine the optimal device
subsequently responsible for overseeing one or more devices, d = arg mindD C (d) on which to place the node during
which are the physical processing units for which the kernels execution:
of an operation are implemented. 1) Whether or not there exists an implementation (kernel)
Within this model, there are two degrees of scalability. The for a node on the given device at all. For example, if
first degree pertains to scaling the number of machines on there is no GPU kernel for a particular operation, any
which a graph is executed. The second degree refers to the GPU device would automatically incur an infinite cost.
fact that on each machine, there may then be more than 2) Estimates of the size (in bytes) for a nodes input and
one device, such as, for example, five independent GPUs output tensors.
and/or three CPUs. For this reason, there exist two versions 3) The expected execution time for the kernel on the device.
of TensorFlow, one for local execution on a single machine 4) A heuristic for the cost of cross-device (and possibly
(but possibly many devices), and one supporting a distributed cross-machine) transmission of the input tensors to the
implementation across many machines and many devices. operation, in the case that the input tensors have been
placed on nodes different from the one currently under

consideration.

3) Cross-Device Execution: If the hardware configuration

of the users system provides more than one device, the
placement algorithm will often distribute a graphs nodes Device A
among these devices. This can be seen as partitioning the set Device B
of nodes into classes, one per device. As a consequence, there (a)
may be cross-device dependencies between nodes that must be
handled via a number of additional steps. Let us consider for recv
this two devices A and B with particular focus on a node on send
device A. If s output tensor forms the input to some other recv
operations , on device B, there initially exist cross-device Device A
edges and from device A to device B. This is
Device B
visualized in Figure 5a.
In practice, there must be some means of transmitting s (b)
output tensor from A, say a GPU device, to B maybe
a CPU device. For this reason, TensorFlow initially replaces
the two edges and by three new nodes. On send recv
device A, a send node is placed and connected to . In
tandem, on device B, two recv nodes are instantiated and Device A
attached to and , respectively. The send and recv nodes Device B
are then connected by two additional edges. This step is shown
(c)
in Figure 5b. During execution of the graph, cross-device
communication of data occurs exclusively via these special Fig. 5: The three stages of cross-device communication be-
nodes. When the devices are located on separate machines, tween graph nodes in TensorFlow. Figure 5a shows the initial,
transmission between the worker processes on these machines conceptual connections between nodes on different devices.
may involve remote communication protocols such as TCP or Figure 5b gives a more practical overview of how data is
RDMA. actually transmitted across devices using send and recv
Finally, an important optimization made by TensorFlow at nodes. Lastly, Figure 5c shows the final, canonicalized setup,
this step is canonicalization of (send, receive) pairs. In where there is at most one recv node per destination device.
the setup displayed in Figure 5b, the existence of each recv
node on device B would imply allocation and management of
a separate buffer to store s output tensor, so that it may then memory multiple times. Therefore, TensorFlow also employs a
be fed to nodes and , respectively. However, an equivalent common subexpression, or, more aptly put, common subgraph
and more efficient transformation places only one recv node elimination pass prior to execution. For this, the computational
on device B, streams all output from to this single node, graph is traversed and every time two or more operations of the
and then to the two dependent nodes and . This last and same type (e.g. MatMul) receiving the same input tensors are
final evolution is given in Figure 5c. encountered, they are canonicalized to only one such subgraph.
The output tensor of that single operation is then redirected to
C. Optimizations all dependent nodes. Figure 6 gives an example of common
To ensure a maximum of efficiency and performance of the subgraph elimination.
TensorFlow execution model, a number of optimizations are 2) Scheduling: A simple yet powerful optimization is to
built into the library. In this subsection, we examine three schedule node execution as late as possible. Ensuring that the
such improvements: common subgraph elimination, execution results of operations remain in memory only for the minimum
scheduling and finally lossy compression. required amount of time reduces peak memory consumption
1) Common Subgraph Elimination: An optimization per- and can thus greatly improve the overall performance of the
formed by many modern compilers is common subexpression system. The authors of [8] note that this is especially vital on
elimination, whereby a compiler may possibly replace the devices such as GPUs, where memory resources are scarce.
computation of an identical value two or more times by a Furthermore, careful scheduling also pertains to the activation
single instance of that computation. The result is then stored of send and recv nodes, where not only memory but also
in a temporary variable and reused where it was previously re- network resources are contested.
calculated. Similarly, in a TensorFlow graph, it may occur that 3) Lossy Compression: One of the primary goals of many
the same operation is performed on identical inputs more than machine learning algorithms used for classification, recogni-
once. This can be inefficient if the computation happens to tion or other tasks is to build robust models. With robust
be an expensive one. Moreover, it may incur a large memory we mean that an optimally trained model should ideally not
overhead given that the result of that operation must be held in change its response if it is first fed a signal and then a
z2 to compute the gradients of particular nodes of the computa-
tional graph with respect to one or many other nodes. For
example, in a neural network, we may compute the cost c
z z0 of the model for a given example x by passing that example
through a series of non-linear transformations. If the neural
+ + network consists of two hidden layers represented by functions
f (x; w) = fx (w) and g(x; w) = gx (w) with internal weights
w, we can express the cost for that example as c = (fx
x y x y
gx )(w) = fx (gx (w)). We would then typically calculate the
gradient dc/dw of that cost with respect to the weights and
use it to update w. Often, this is done by means of the back-
z2 propagation algorithm, which traverses the graph in reverse to
compute the chain rule [fx (gx (w))]0 = fx0 (gx (w)) gx0 (w).
In [18], two approaches for back-propagating gradients
through a computational graph are described. The first, which
z the authors refer to as symbol-to-number differentiation, re-
+ ceives a set of input values and then computes the numerical
values of the gradients at those input values. It does so
by explicitly traversing the graph first in the forward order
x y (forward-propagation) to compute the cost, then in reverse
order (back-propagation) to compute the gradients via the
Fig. 6: An example of how common subgraph elimination is chain rule. Another approach, more relevant to TensorFlow,
used to transform the equations z = x + y, z 0 = x + y, is what [18] calls symbol-to-symbol derivatives and [8] terms
z 2 = z z 0 to just two equations z = x + y and z 2 = z z. automatic gradient computation. In this case, gradients are
This computation could theoretically be optimized further to a not computed by an explicit implementation of the back-
square operation requiring only one input (thus reducing the propagation algorithm. Rather, special nodes are added to
cost of data movement), though it is not known if TensorFlow the computational graph that calculate the gradient of each
employs such secondary canonicalization. operation and thus ultimately the chain rule. To perform back-
propagation, these nodes must then simply be executed like
any other nodes by the graph evaluation engine. As such, this
noisy variation of that signal. As such, these machine learning approach does not produce the desired derivatives as a numeric
algorithms typically do not require high precision arithmetic as value, but only as a symbolic handle to compute those values.
provided by standard IEEE 754 32-bit floating point values. When TensorFlow needs to compute the gradient of a
Rather, 16 bits of precision in the mantissa would do just particular node with respect to some other tensor , it
as well. For this reason, another optimization performed by traverses the graph in reverse order from to . Each
TensorFlow is the internal addition of conversion nodes to the operation o encountered during this traversal represents a
computational graph, which convert such high-precision 32-bit function depending on and is one of the links in the chain
floating-point values to truncated 16-bit representations when ( . . . o . . . )() producing the output tensor of the graph.
communicating across devices and across machines. On the Therefore, TensorFlow adds a gradient node for each such
receiving end, the truncated representation is converted back operation o that takes the gradient of the previous link (the
to 32 bits simply by filling in zeros, rather than rounding [8]. outer function) and multiplies it with its own gradient. At the
end of the traversal, there will be a node providing a symbolic
D. Additions to the Basic Programming Model d
handle to the overall target derivative d , which implicitly
Having discussed the basic computational paradigms and implements the back-propagation algorithm. It should now be
execution model of TensorFlow, we will now review three clear that back-propagation in this symbol-to-symbol approach
more advanced topics that we deem highly relevant for any- is just another operation, requiring no exceptional handling.
one wishing to use TensorFlow to create machine learning Figure 7 shows how a computational graph may look before
algorithms. First, we discuss how TensorFlow handles gradient and after gradient nodes are added.
back-propagation, an essential concept for many deep learning In [8] it is noted that symbol-to-symbol derivatives may
applications. Then, we study how TensorFlow graphs support incur a considerable performance cost and especially result
control flow. Lastly, we briefly touch upon the topic of in increased memory overhead. To see why, it is important
checkpoints, as they are very useful for maintenance of large to understand that there exist two equivalent formulations
models. of the chain rule. The first reuses previous computations
1) Back-Propagation Nodes: In a large number of deep and therefore requires them to be stored longer than strictly
learning and other machine learning algorithms, it is necessary necessary for forward-propagation. For arbitrary functions f ,
z z to support a variable amount of iterations, TensorFlow is
forced to jump through an additional set of hoops, as described
h h in [8].
h0 dz
One aspect that must be especially cared for when intro-
y y dy ducing control flow is back-propagation. In the case of a
g g conditional, where an if-operation returns either one or the
other tensor, it must be known which branch was taken by
g0
x x dy dz the node during forward-propagation so that gradient nodes
dx dx
are added only to this branch. Moreover, when a loop body
f f (which may be a small graph) was executed a certain number
f0 dx
dz
of times, the gradient computation does not only need to know
w w dw dw the number of iterations performed, but also requires access to
each intermediary value produced. This technique of stepping
(a) (b) through a loop in reverse to compute the gradients is referred
Fig. 7: A computational graph before (7a) and after (7b) to as back-propagation through time in [5].
gradient nodes are added. In this symbol-to-symbol approach, 3) Checkpoints: Another extension to TensorFlows basic
dz programming model is the notion of checkpoints, which allow
the gradient dw is just simply an operation like any other and
therefore requires no special handling by the graph evaluation for persistent storage and recovery of variables. It is possible to
engine. add Save nodes to the computational graph and connect them
to variables whose tensors you wish to serialize. Furthermore,
a variable may be connected to a Restore operation, which
g and h it is given in Equation 1: deserializes a stored tensor at a later point. This is especially
df useful when training a model over a long period of time
= f 0 (y) g 0 (x) h0 (w) with y = g(x), x = h(w) (1) to keep track of the models performance while reducing
dw
the risk of losing any progress made. Also, checkpoints are
The second possibility for computing the chain rule was
a vital element to ensuring fault tolerance in a distributed
already shown, where each function recomputes all of its
environment [8].
arguments and invokes every function it depends on. It is given
in Equation 2 for reference: IV. T HE T ENSOR F LOW P ROGRAMMING I NTERFACE
df
= f 0 (g(h(w))) g 0 (h(w)) h0 (w) (2) Having conveyed the abstract concepts of TensorFlows
dw computational model in Section III, we will now concretize
According to [8], TensorFlow currently employs the first those ideas and speak to TensorFlows programming interface.
approach. Given that the inner-most functions must be recom- We begin with a brief discussion of the available language
puted for almost every link of the chain if this approach is interfaces. Then, we provide a more hands-on look at Ten-
not employed, and taking into consideration that this chain sorFlows Python API by walking through a simple practical
may consist of many hundreds or thousands of operations, example. Lastly, we give insight into what higher-level ab-
this choice seems sensible. However, on the flip side, keeping stractions exist for TensorFlows API, which are especially
tensors in memory for long periods of time is also not optimal, beneficial for rapid prototyping of machine learning models.
especially on devices like GPUs where memory resources are
scarce. For Equation 2, memory held by tensors could in A. Interfaces
theory be freed as soon as it has been processed by its graph There currently exist two programming interfaces, in C++
dependencies. For this reason, in [8] the development team and Python, that permit interaction with the TensorFlow back-
of TensorFlow states that recomputing certain tensors rather end. The Python API boasts a very rich feature set for creation
than keeping them in memory may be a possible performance and execution of computational graphs. As of this writing,
improvement for the future. the C++ interface (which is really just the core backend
2) Control Flow: Some machine learning algorithms may implementation) provides a comparatively much more limited
benefit from being able to control the flow of their execution, API, allowing only to execute graphs built with Python and
performing certain steps only under a particular condition serialized to Googles Protocol Buffer20 format. While there is
or repeating some computation a fixed or variable number experimental support for also building computational graphs
of times. For this, TensorFlow provides a set of control in C++, this functionality is currently not as extensive as in
flow primitives including if-conditionals and while-loops. Python.
The possibility of loops is the reason why a TensorFlow It is noteworthy that the Python API integrates very well
computational graph is not necessarily acyclic. If the number with NumPy21 , a popular open source Python numeric and
of iterations for of a loop would be fixed and known at
graph compile-time, its body could be unrolled into an acyclic 20 https://developers.google.com/protocol-buffers/

sequence of computations, one per loop iteration [5]. However, 21 http://www.numpy.org


scientific programming library. As such, TensorFlow tensors to the 10-dimensional one-hot-encoded vector for each label
may be interchanged with NumPy ndarrays in many places. digit:
# Using a 32-bit floating-point data type tf.float32
B. Walkthrough examples = tf.placeholder(tf.float32, [None, 784])
labels = tf.placeholder(tf.float32, [None, 10])
In the following paragraphs we give a step-by-step walk-
through of a practical, real-world example of TensorFlows Given an example matrix X Rn784 containing n images,
Python API. We will train a simple multi-layer perceptron the learning task then applies an affine transformation XW+
(MLP) with one input and one output layer to classify hand- b, where W is a weight matrix R78410 and b a bias vector
written digits in the MNIST22 dataset. In this dataset, the R10 . This yields a new matrix Y Rn10 , containing
examples are small images of 28 28 pixels depicting hand- the scores or logits of our model for each example and each
written digits in {0, . . . , 9}. We receive each such example possible digit. These scores are more or less arbitrary values
as a flattened vector of 784 gray-scale pixel intensities. The and not a probability distribution, i.e. they need neither be
label for each example is the digit it is supposed to represent. [0, 1] nor sum to one. To transform the logits into a valid
We begin our walkthrough by importing the TensorFlow probability distribution, giving the likelihood Pr[x = i] that
library and reading the MNIST dataset into memory. For this the x-th example represents the digit i, we make use of the
we assume a utility module mnist_data with a method softmax function, given in Equation 3. Our final estimates are
read which expects a path to extract and store the dataset. thus calculated by softmax(X W + b), as shown below:
Moreover, we pass the parameter one_hot=True to specify # Draw random weights for symmetry breaking
that each label be given to us as a one-hot-encoded vector weights = tf.Variable(tf.random_uniform([784, 10]))
(d1 , . . . , d10 )> where all but the i-th component are set to # Slightly positive initial bias
bias = tf.Variable(tf.constant(0.1, shape=[10]))
zero if an example represents the digit i: # tf.matmul performs the matrix multiplication XW
import tensorflow as tf # Note how the + operator is overloaded for tensors
logits = tf.matmul(examples, weights) + bias
# Download and extract the MNIST data set. # Applies the operation element-wise on tensors
# Retrieve the labels as one-hot-encoded vectors. estimates = tf.nn.softmax(logits)
mnist = mnist_data.read("/tmp/mnist", one_hot=True)
exp(xi )
Next, we create a new computational graph via the softmax(x)i = P (3)
tf.Graph constructor. To add operations to this graph, we j exp(xj )
must register it as the default graph. The way the TensorFlow Next, we compute our objective function, producing the
API is designed, library routines that create new operation error or loss of the model given its current trainable parameters
nodes always attach these to the current default graph. We W and b. We P do this by calculating the cross entropy
register our graph as the default by using it as a Python context H(L, Y)i = j Li,j log(Yi,j ) between the probability
manager in a with-as statement: distributions of our estimates Y and the one-hot-encoded
labels L. More precisely, we consider the mean cross entropy
# Create a new graph
graph = tf.Graph() over all examples as the loss:

# Register the graph as the default one to add nodes # Computes the cross-entropy and sums the rows
with graph.as_default(): cross_entropy = -tf.reduce_sum(
# Add operations ... labels * tf.log(estimates), [1])
loss = tf.reduce_mean(cross_entropy)
We are now ready to populate our computational graph Now that we have an objective function, we
with operations. We begin by adding two placeholder nodes can run (stochastic) gradient descent to update the
examples and labels. Placeholders are special variables weights of our model. For this, TensorFlow provides a
that must be replaced with concrete tensors upon graph execu- GradientDescentOptimizer class. It is initialized with
tion. That is, they must be supplied in the feed_dict argu- the learning rate of the algorithm and provides an operation
ment to Session.run(), mapping tensors to replacement minimize, to which we pass our loss tensor. This is the
values. For each such placeholder, we specify a shape and operation we will run repeatedly in a Session environment
data type. An interesting feature of TensorFlow at this point to train our model:
is that we may specify the Python keyword None for the first
dimension of each placeholder shape. This allows us to later on # We choose a learning rate of 0.5
gdo = tf.train.GradientDescentOptimizer(0.5)
feed a tensor of variable size in that dimension. For the column optimizer = gdo.minimize(loss)
size of the example placeholder, we specify the number of
features for each image, meaning the 2828 = 784 pixels. The Finally, we can actually train our algorithm. For this,
label placeholder should expect 10 columns, corresponding we enter a session environment using a tf.Session as
a context manager. We pass our graph object to its con-
22 http://yann.lecun.com/exdb/mnist/ structor, so that it knows which graph to manage. To then
execute nodes, we have several options. The most gen- example, it is possible to feed an input tensor into a fully
eral way is to call Session.run() and pass a list of connected (dense) neural network layer as we did in Sub-
tensors we wish to compute. Alternatively, we may call section IV-B with just a single line of code. Shown below is
eval() on tensors and run() on operations directly. an example use of PrettyTensor, where a standard TensorFlow
Before evaluating any other node, we must first ensure placeholder is wrapped into a library-compatible object and
that the variables in our graph are initialized. Theoreti- then fed through three fully connected layers to finally output
cally, we could run the Variable.initializer oper- a softmax distribution.
ation for each variable. However, one most often just uses
examples = tf.placeholder([None, 784], tf.float32)
the tf.initialize_all_variables() utility opera- softmax = (prettytensor.wrap(examples)
tion provided by TensorFlow, which in turn executes the .fully_connected(256, tf.nn.relu)
initializer operation for each Variable in the graph. .fully_connected(128, tf.sigmoid)
.fully_connected(64, tf.tanh)
Then, we can perform a certain number of iterations of .softmax(10))
stochastic gradient descent, fetching an example and label
mini-batch from the MNIST dataset each time and feeding 2) TFLearn: TFLearn is another abstraction library built
it to the run routine. At the end, our loss will (hopefully) be on top of TensorFlow that provides high-level building blocks
small: to quickly construct TensorFlow graphs. It has a highly
modular interface and allows for rapid chaining of neural
with tf.Session(graph=graph) as session:
# Execute the operation directly network layers, regularization functions, optimizers and other
tf.initialize_all_variables().run() elements. Moreover, while PrettyTensor still relied on the
for step in range(1000): standard tf.Session setup to train and evaluate a model,
# Fetch next 100 examples and labels
x, y = mnist.train.next_batch(100) TFLearn adds functionality to easily train a model given an
# Ignore the result of the optimizer (None) example batch and corresponding labels. As many TFLearn
_, loss_value = session.run( functions, such as those creating entire layers, return vanilla
[optimizer, loss],
feed_dict={examples: x, labels: y}) TensorFlow objects, the library is well suited to be mixed
print(Loss at step {0}: {1} with existing TensorFlow code. For example, we could replace
.format(step, loss_value)) the entire setup for the output layer discussed in Subsection
IV-B with just a single TFLearn method invocation, leaving
The full code listing for this example, along with some
the rest of our code base untouched. Furthermore, TFLearn
additional implementation to compute an accuracy metric at
handles everything related to visualization with TensorBoard,
each time step is given in Appendix I.
discussed in Section V, automatically. Shown below is how we
C. Abstractions can reproduce the full 65 lines of standard TensorFlow code
given in Appendix I with less than 10 lines using TFLearn.
You may have observed how a relatively large amount of
effort was required to create just a very simple two-layer import tflearn
import tflearn.datasets.mnist as mnist
neural network. Given that deep learning, by implication of its
name, makes use of deep neural networks with many hidden X, Y, validX, validY = mnist.load_data(one_hot=True)
layers, it may seem infeasible to each time create weight and
# Building our neural network
bias variables, perform a matrix multiplication and addition input_layer = tflearn.input_data(shape=[None, 784])
and finally apply some non-linear activation function. When output_layer = tflearn.fully_connected(input_layer,
testing ideas for new deep learning models, scientists often 10, activation=softmax)
wish to rapidly prototype networks and quickly exchange # Optimization
layers. In that case, these many steps may seem very low-level, sgd = tflearn.SGD(learning_rate=0.5)
repetitive and generally cumbersome. For this reason, there net = tflearn.regression(output_layer,
optimizer=sgd)
exist a number of open source libraries that abstract these
concepts and provide higher-level building blocks, such as # Training
entire layers. We find PrettyTensor23 , TFLearn24 and Keras25 model = tflearn.DNN(net)
model.fit(X, Y, validation_set=(validX, validY))
especially noteworthy. The following paragraphs give a brief
overview of the first two abstraction libraries.
1) PrettyTensor: PrettyTensor is developed by Google and V. V ISUALIZATION OF T ENSOR F LOW G RAPHS
provides a high-level interface to the TensorFlow API via
Deep learning models often employ neural networks with
the Builder pattern. It allows the user to wrap TensorFlow
a highly complex and intricate structure. For example, [19]
operations and tensors into pretty versions and then quickly
reports of deep convolutional network based on the Google
chain any number of layers operating on these tensors. For
Inception model with more than 36,000 individual units,
23 https://github.com/google/prettytensor while [8] states that certain long short-term memory (LSTM)
24 https://github.com/tflearn/tflearn architectures can span over 15,000 nodes. To maintain a
25 http://keras.io clear overview of such complex networks, facilitate model
debugging and allow inspection of values on various levels of
detail, powerful visualization tools are required. TensorBoard,
a web interface for graph visualization and manipulation built
directly into TensorFlow, is an example for such a tool. In this
section, we first list a number of noteworthy features of Ten-
sorBoard and then discuss how it is used from TensorFlows
programming interface.
A. TensorBoard Features
The core feature of TensorBoard is the lucid visualization of
computational graphs, exemplified in Figure 8a. Graphs with
complex topologies and many layers can be displayed in a
clear and organized manner, allowing the user to understand
exactly how data flows through it. Especially useful is Ten-
sorBoards notion of name scopes, whereby nodes or entire
subgraphs may be grouped into one visual block, such as
a single neural network layer. Such name scopes can then (a)
be expanded interactively to show the grouped units in more
detail. Figure 8b shows the expansion of one the name scopes
of Figure 8a.
Furthermore, TensorBoard allows the user to track the
development of individual tensor values over time. For this,
you can attach two kinds of summary operations to nodes
of the computational graph: scalar summaries and histogram
summaries. Scalar summaries show the progression of a scalar
tensor value, which can be sampled at certain iteration counts.
In this way, you could, for example, observe the accuracy
or loss of your model with time. Histogram summary nodes
allow the user to track value distributions, such as those of
neural network weights or the final softmax estimates. Figures
8c and 8d give examples of scalar and histogram summaries,
respectively. Lastly, TensorBoard also allows visualization of
images. This can be useful to show the images sampled for (b)
each mini-batch of an image classification task, or to visualize
the kernel filters of a convolutional neural network [8].
We note especially how interactive the TensorBoard web in-
terface is. Once your computational graph is uploaded, you can
pan and zoom the model as well as expand or contract individ-
ual name scopes. A demonstration of TensorBoard is available
at https://www.tensorflow.org/tensorboard/index.html.
B. TensorBoard in Practice (c) (d)
To integrate TensorBoard into your TensorFlow code, three Fig. 8: A demonstration of Tensorboards graph visualization
steps are required. Firstly, it is wise to group nodes into features. Figure ?? shows the complete graph, while Figure
name scopes. Then, you may add scalar and histogram sum- 8b displays the expansion of the first layer. Figures 8c and 8d
maries to you operations. Finally, you must instantiate a give examples for scalar and history summaries, respectively.
SummaryWriter object and hand it the tensors produced
by the summary nodes in a session context whenever you
wish to store new summaries. Rather than fetching individual
summaries, it is also possible to combine all summary nodes
into one via the tf.merge_all_summaries() operation.
writer = tf.train.SummaryWriter(/tmp/log, graph)
with tf.name_scope(Variables):
x = tf.constant(1.0) with tf.Session(graph=graph):
y = tf.constant(2.0) for step in range(1000):
tf.scalar_summary(z, x + y) writer.add_summary(
merged.eval(), global_step=step)
merged = tf.merge_all_summaries()
VI. C OMPARISON W ITH OTHER D EEP L EARNING typed version of JSON), which gives a very different expe-
F RAMEWORKS rience compared to Python. Also, the basic building blocks
in Caffe are not operations, but entire neural network layers.
Next to TensorFlow, there exist a number of other open
In that sense, TensorFlow can be considered fairly low-level
source deep learning software libraries, the most popular being
in comparison. Like Torch, Caffe has no notion of a compu-
Theano, Torch and Caffe. In this section, we explore the
tational graphs or symbols and thus computes derivatives via
qualitative as well as quantitative differences between Ten-
the symbol-to-number approach. Caffe is especially well suited
sorFlow and each of these alternatives. We begin with a high
for development of convolutional neural networks and image
level qualitative comparison and examine where TensorFlow
recognition tasks, however it falls short in many other state-of-
diverges or overlaps conceptually or architecturally. Then, we
the-art categories supported well by TensorFlow. For example,
review a few sources of quantitative comparisons and state as
Caffe, by design, does not support cyclic architectures, which
well as discuss their results.
form the basis of RNN, LSTM and other models. Caffe has
no support for distributed execution26 .
A. Qualitative Comparison
The following three paragraphs compare Theano, Torch Distributed
Library Frontends Style Gradients
and Caffe to TensorFlow, respectively. Table II provides an Execution
overview of the most important talking points. TensorFlow Python, C++ Declarative Symbolic X
1) Theano: Of the three candidate alternatives we discuss, Theano Python Declarative Symbolic
Torch LuaJIT Imperative Explicit
Theano, which has a Python frontend, is most similar to Ten- Caffe Protobuf Imperative Explicit
sorFlow. Like TensorFlow, Theanos programming model is Very limited API.
declarative rather than imperative and based on computational Starting with TensorFlow 0.8, released in April 2016 [16].
graphs. Also, Theano employs symbolic differentiation, as
TABLE II: A table comparing TensorFlow to Theano, Torch
does TensorFlow. However, Theano is known to have very
and Caffe in several categories.
long graph compile times as it translates Python code to
C++/CUDA [5]. In part, this is due to the fact that Theano
applies a number of more advanced graph optimization algo-
rithms [5], while TensorFlow currently only performs common B. Quantitative Comparison
subgraph elimination. Moreover, Theanos visualization tools
are very poor in comparison to TensorBoard. Next to built-in We will now review three sources of quantitative compar-
functionality to output plain text representations of the graph isons between TensorFlow and other deep learning libraries,
or static images, a plugin can be used to generate slightly in- providing a summary of the most important results of each
teractive HTML visualizations. However, it is nowhere near as work. Furthermore, we will briefly discuss the overall trend
powerful as TensorBoard. Lastly, there is also no (out-of-the- of these benchmarks.
box) support for distributing the execution of a computational The first work, [20], authored by the Bosch Research
graph, while this is a key feature of TensorFlow. and Technology Center in late March 2016, compares the
2) Torch: One of the principle differences between Torch performance of TensorFlow, Torch, Theano and Caffe (among
and TensorFlow is the fact that Torch, while it has a C/CUDA others) with respect to various neural network architectures.
backend, uses Lua as its main frontend. While Lua(JIT) is Their setup involves Ubuntu 14.04 running on an Intel Xeon
one of the fastest scripting languages and allows for rapid E5-1650 v2 CPU @ 3.50 GHz and an NVIDIA GeForce GTX
prototyping and quick execution, it is not yet very mainstream. Titan X/PCIe/SSE2 GPU. One benchmark we find noteworthy
This implies that while it may be easy to train and develop tests the relative performance of each library on a slightly
models with Torch, Luas limited API and library ecosystem modified reproduction of the LeNet CNN model [21]. More
can make industrial deployment harder compared to a Python- specifically, the authors measure the forward-propagation time,
based library such as TensorFlow (or Theano). Besides the which they deem relevant for model deployment, and the
language aspect, Torchs programming model is fundamen- back-propagation time, important for model training. We have
tally quite different from TensorFlow. Models are expressed reproduced an excerpt of their results in Table III, where we
in an imperative programming style and not as declarative show their outcomes on (a) a CPU running 12 threads and (b)
computational graphs. This means that the programmer must, a GPU. Interestingly, for (a), TensorFlow ranks second behind
in fact, be concerned with the order of execution of operations. Torch in both the forward and backward measure while in (b)
It also implies that Torch does not use symbol-to-symbol, TensorFlows performance drops significantly, placing it last
but rather symbol-to-number differentiation requiring explicit in both categories. The authors of [20] note that one reason
forward and backward passes to compute gradients. for this may be that they used the NVIDIA cuDNN v2 library
3) Caffe: Caffe is most dissimilar to TensorFlow in var- for their GPU implementation with TensorFlow while using
ious ways. While there exist high-level MATLAB and Python cuDNN v3 for the others. They state that as of their writing,
frontends to Caffe for model creation, its main interface is
really the Google Protobuf language (it is more a fancy, 26 https://github.com/BVLC/caffe/issues/876
Library Forward (ms) Backward (ms)
18 Theano
TensorFlow 16.4 50.1
Torch 4.6 16.5 16 Torch

103 words/sec
Caffe 33.7 66.4 14 TensorFlow
Theano 78.7 204.3 12
(a) CPU (12 threads) 10
8
Library Forward (ms) Backward (ms)
6
4
TensorFlow 4.5 14.6
Torch 0.5 1.7 2
Caffe 0.8 1.9
Theano 0.5 1.4 Small Large

(b) GPU Fig. 9: The results of [5], comparing TensorFlow, Theano and
Torch on an LSTM model for the Penn Treebank dataset [24].
TABLE III: This table shows the benchmarks performed
On the left the authors tested a small model with a single
by [20], where TensorFlow, Torch, Caffe and Theano are
hidden layer and 200 units; on the right they use two layers
compared on a LeNet model reproduction [21]. IIIa shows
with 650 units each.
the results performed with 12 threads each on a CPU, while
IIIb gives the outcomes on a graphics chips.

Library Forward (ms) Backward (ms) dataset [24]. Their benchmarks measure words processed per
TensorFlow 26 55 second for a small model consisting of a single 200-unit hidden
Torch 25 46
Caffe 121 203
layer with sequence length 20, and a large model with two 650-
Theano unit hidden layers and a sequence length of 50. In [5] also a
medium-sized model is tested, which we ignore for our review.
TABLE IV: The result of Soumith Chintalas benchmarks for The authors state a hardware configuration consisting of an
TensorFlow, Torch and Caffe (not Theano) on an AlexNet NVIDIA Digits DevBox with 4 Titan X GPUs and an Intel
ConvNet model [22], [23]. Core i7-5930K CPU. Moreover, they used cuDNN v4 for all
libraries included in their benchmarks, which are TensorFlow,
Torch and Theano. Results for Caffe are not given. In their
this was the recommended configuration for TensorFlow27 . benchmarks, TensorFlow performs best among all three for
The second source in our collection is the convnet- the small model, followed by Theano and then Torch. For
benchmarks repository on GitHub by Soumith Chintala [22], the large model, TensorFlow is placed second behind Theano,
an artificial intelligence research engineer at Facebook. The while Torch remains in last place. Table 9 shows these results,
commit we reference28 is dated April 25, 2016. Chintala taken from [5].
provides an extensive set of benchmarks for a variety of When TensorFlow was first released, it performed poor on
convolutional network models and includes many libraries, benchmarks, causing disappointment within the deep learning
including TensorFlow, Torch and Caffe in his measurements. community. Since then, new releases of TensorFlow have
Theano is not present in all tests, so we will not review its emerged, bringing with them improving results. This is re-
performance in this benchmark suite. The authors hardware flected in our selection of works. The earliest of the three
configuration is a 6-core Intel Core i7-5930K CPU @ 3.50GHz sources, [20], published in late March 2016, ranks TensorFlow
and an NVIDIA Titan X graphics chip running on Ubuntu consistently uncompetitive compared to Theano, Torch and
14.04. Inter alia, Chintala gives the forward and backward- Caffe. Released almost two months later, [22] ranks Tensor-
propagation time of TensorFlow, Torch and Caffe for the Flow comparatively better. The latest work reviewed, [5], then
AlexNet CNN model [23]. In these benchmarks, TensorFlow places TensorFlow in first or second place for LSTMs and also
performs second-best in both measures behind Torch, with other architectures discussed by the authors. We state that one
Caffe lagging relatively far behind. We reproduce the relevant reason for this upward trend is that [5] uses TensorFlow with
results in Table IV. cuDNN v4 for its GPU experiments, whereas [20] still used
Lastly, we review the results of [5], published by the cuDNN v2. While we predict that TensorFlow will improve its
Theano development team on May 9, 2016. Next to a set performance on measurements similar to the ones discussed in
of benchmarks for four popular CNN models, including the the future, we believe that these benchmarks also today
aforementioned AlexNet architecture, the work also includes do not make full use of TensorFlows potential. The reason
results for an LSTM network operating on the Penn Treebank for this is that all tests were performed on a single machine.
As we reviewed in depth in section III-B, TensorFlow was
27 As of TensorFlow 0.8, released in April 2016 and thus after the publi-
built with massively parallel distributed computing in mind.
cation of [20], TensorFlow now supports cuDNN v4, which promises better
performance on GPUs than cuDNN v3 and especially cuDNN v2. This ability is currently unique to TensorFlow among the
28 Commit sha1 hash: 84b5bb1785106e89691bc0625674b566a6c02147 popular deep learning libraries and we estimate that it would
be advantageous to its performance, particularly for large-scale public even if they do use TensorFlow. For this reason, we
models. We thus hope to see more benchmarks in literature in will review uses of TensorFlow only within Google, Inc.
the future, making better use of TensorFlows many-machine, Recently, Google has begun augmenting its core search ser-
many-device capabilities. vice and accompanying PageRank algorithm [32] with a sys-
tem called RankBrain [33], which makes use of TensorFlow.
VII. U SE C ASES OF T ENSOR F LOW T ODAY RankBrain uses large-scale distributed deep neural networks
In this section, we investigate where TensorFlow is already for search result ranking. According to [33], more than 15
in use today. Given that TensorFlow was released only little percent of all search queries received on www.google.com
over 6 months ago as of this writing, its adoption in academia are new to Googles system. RankBrain can suggest words
and industry is not yet widespread. Migration from an existing or phrases with similar meaning for unknown parts of such
system based on some other library within small and large queries.
organizations necessarily takes time and consideration, so this Another area where Google applies deep learning with
is not unexpected. The one exception is, of course, Google, TensorFlow is smart email replies [27]. Google has inves-
which has already deployed TensorFlow for a variety of learn- tigated and already deployed a feature whereby its email
ing tasks [19], [25][28]. We begin with a review of selected service Inbox suggests possible replies to received email.
mentions of TensorFlow in literature. Then, we discuss where The system uses recurrent neural networks and in particular
and how TensorFlow is used in industry. LSTM modules for sequence-to-sequence learning and natural
language understanding. An encoder maps a corpus of text to
A. In Literature a thought vector while a decoder synthesizes syntactically
and semantically correct replies from it, of which a selection
The first noteworthy mention of TensorFlow is [29], pub- is presented to the user.
lished by Szegedy, Ioffe and Vanhoucke of the Google Brain
In [26] it is reported how Google employs convolutional
Team in February 2016. In their work, the authors use
neural networks for image recognition and automatic text
TensorFlow to improve on the Inception model [19], which
translation. As a feature integrated into its Google Translate
achieved best performance at the 2014 ImageNet classification
mobile app, text in a language foreign to the user is first
challenge. The authors report a 3.08% top-5 error on the
recognized, then translated and finally rendered on top of the
ImageNet test set.
original image. In this way, for example, street signs can be
In [25], Ramsundar et al. discuss massively multitask translated. [26] notes especially the challenge of deploying
networks for drug discovery in a joint collaboration work such a system onto low-end phones with slow network con-
between Stanford University and Google, published in early nections. For this, small neural networks were used and trained
2016. In this paper, the authors employ deep neural networks to discover only the most essential information in order to
developed with TensorFlow to perform virtual screening of optimize available computational resources.
potential drug candidates. This is intended to aid pharmaceu- Lastly, we make note of the decision of Google DeepMind,
tical companies and the scientific community in finding novel an AI division within Google, to move from Torch7 to
medication and treatments for human diseases. TensorFlow [28]. A related source, [17], states that DeepMind
August and Ni apply TensorFlow to create recurrent neural made use of TensorFlow for its AlphaGo29 model, alongside
networks for optimizing dynamic decoupling, a technique for Googles newly developed Tensor Processing Unit (TPU),
suppressing errors in quantum memory [30]. With this, the which was built to integrate especially well with TensorFlow.
authors aim to preserve the coherence of quantum states, In a correspondence of the authors of this paper with a member
which is one of the primary requirements for building universal of the Google DeepMind team, the following four reasons
quantum computers. were revealed to us as to why TensorFlow is advantageous to
Lastly, [31] investigates the use of sequence to sequence DeepMind:
neural translation models for natural language processing of
1) TensorFlow is included in the Google Cloud Platform30 ,
multilingual media sources. For this, Barzdins et al. use
which enables easy replication of DeepMinds research.
TensorFlow with a sliding-window approach to character-level
2) TensorFlows support for TPUs.
English to Latvian translation of audio and video content. The
3) TensorFlows main interface, Python, is one of the core
authors use this to segment TV and radio programs and cluster
languages at Google, which implies a much greater
individual stories.
internal tool set than for Lua.
B. In Industry 4) The ability to run TensorFlow on many GPUs.

Adoption of TensorFlow in industry is currently limited only VIII. C ONCLUSION


to Google, at least to the extent that is publicly known. We We have discussed TensorFlow, a novel open source deep
have found no evidence of any other small or large corporation learning library based on computational graphs. Its ability
stating its use of TensorFlow. As mentioned, we link this to
TensorFlows late release. Moreover, it is obvious that many 29 https://deepmind.com/alpha-go

companies would not make their machine learning methods 30 https://cloud.google.com/compute/


to perform fast automatic gradient computation, its inherent
support for distributed computation and specialized hardware # Compute the cross-entropy
cross_entropy = -tf.reduce_sum(labels *
as well as its powerful visualization tools make it a very tf.log(estimates),
welcome addition to the field of machine learning. Its low-level reduction_indices=[1])
programming interface gives fine-grained control for neural net # And finally the loss
loss = tf.reduce_mean(cross_entropy)
construction, while abstraction libraries such as TFLearn allow
for rapid prototyping with TensorFlow. In the context of other # Create a gradient-descent optimizer that
deep learning toolkits such as Theano or Torch, TensorFlow minimizes the loss.
# We choose a learning rate of 0.01
adds new features and improves on others. Its performance optimizer =
was inferior in comparison at first, but is improving with new tf.train.GradientDescentOptimizer(0.5).minimize(loss)
releases of the library.
# Find the indices where the predictions were
We note that very little investigation has been done in correct
literature to evaluate TensorFlows qualities with respect to dis- correct_predictions = tf.equal(
tributed execution. We esteem this one of its principle strong tf.argmax(estimates, dimension=1),
tf.argmax(labels, dimension=1))
points and thus encourage in-depth study by the academic accuracy =
community in the future. tf.reduce_mean(tf.cast(correct_predictions,
TensorFlow has gained great popularity and strong support tf.float32))
in the open-source community with many third-party contri- with tf.Session(graph=graph) as session:
butions, making Googles move a sensible decision already. tf.initialize_all_variables().run()
We believe, however, that it will not only benefit its parent for step in range(1001):
example_batch, label_batch =
company, but the greater scientific community as a whole; mnist.train.next_batch(100)
opening new doors to faster, larger-scale artificial intelligence. feed_dict = {examples: example_batch, labels:
label_batch}
A PPENDIX I if step % 100 == 0:
_, loss_value, accuracy_value =
#!/usr/bin/env python session.run(
# -*- coding: utf-8 -*- [optimizer, loss, accuracy],
""" A one-hidden-layer-MLP MNIST-classifier. """ feed_dict=feed_dict
)
from __future__ import absolute_import print("Loss at time {0}: {1}".format(step,
from __future__ import division loss_value))
from __future__ import print_function print("Accuracy at time {0}:
{1}".format(step, accuracy_value))
# Import the training data (MNIST) else:
from tensorflow.examples.tutorials.mnist import optimizer.run(feed_dict)
input_data
R EFERENCES
import tensorflow as tf
[1] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature,
# Possibly download and extract the MNIST data set. vol. 521, no. 7553, pp. 436444, May 2015. [Online]. Available:
# Retrieve the labels as one-hot-encoded vectors. http://dx.doi.org/10.1038/nature14539
mnist = input_data.read_data_sets("/tmp/mnist", [2] V. Nair and G. E. Hinton, Rectified linear units improve restricted
one_hot=True) boltzmann machines, in Proceedings of the 27th International
Conference on Machine Learning (ICML-10), J. Fijrnkranz and
# Create a new graph T. Joachims, Eds. Omnipress, 2010, pp. 807814. [Online]. Available:
graph = tf.Graph() http://www.icml2010.org/papers/432.pdf
[3] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
# Set our graph as the one to add nodes to R. Salakhutdinov, Dropout: A simple way to prevent neural
with graph.as_default(): networks from overfitting, Journal of Machine Learning
Research, vol. 15, pp. 19291958, 2014. [Online]. Available:
# Placeholder for input examples (None = http://jmlr.org/papers/v15/srivastava14a.html
variable dimension) [4] L. Rampasek and A. Goldenberg, Tensorflow: Biologys gateway to
examples = tf.placeholder(shape=[None, 784], deep learning? Cell Systems, vol. 2, no. 1, pp. 1214, 2016. [Online].
dtype=tf.float32) Available: http://dx.doi.org/10.1016/j.cels.2016.01.009
# Placeholder for labels [5] The Theano Development Team, R. Al-Rfou, G. Alain, A. Almahairi,
labels = tf.placeholder(shape=[None, 10], C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Be-
dtype=tf.float32) likov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bis-
son, J. Bleecher Snyder, N. Bouchard, N. Boulanger-Lewandowski,
weights = X. Bouthillier, A. de Brbisson, O. Breuleux, P.-L. Carrier, K. Cho,
tf.Variable(tf.truncated_normal(shape=[784, J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Ct, M. Ct,
10], stddev=0.1)) A. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins,
bias = tf.Variable(tf.constant(0.1, shape=[10])) S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. Ebrahimi Kahou,
D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. Goodfellow,
# Apply an affine transformation to the input M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi,
features S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb,
logits = tf.matmul(examples, weights) + bias P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux,
estimates = tf.nn.softmax(logits) N. Lonard, Z. Lin, J. A. Livezey, C. Lorenz, J. Lowin, Q. Ma, P.-A.
Manzagol, O. Mastropietro, R. T. McGibbon, R. Memisevic, B. van Study of Deep Learning Software Frameworks, ArXiv e-prints, Nov.
Merrinboer, V. Michalski, M. Mirza, A. Orlandi, C. Pal, R. Pas- 2015.
canu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, [21] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning
M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schlter, J. Schulman, applied to document recognition, Proceedings of the IEEE, vol. 86,
G. Schwartz, I. Vlad Serban, D. Serdyuk, S. Shabanian, . Simon, no. 11, pp. 22782324, Nov 1998.
S. Spieckermann, S. Ramana Subramanyam, J. Sygnowski, J. Tanguay, [22] S. Chintala, convnet-benchmarks, GitHub, April 2016 (accessed May
G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries, 24, 2016), https://github.com/soumith/convnet-benchmarks.
D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification
S. Zhang, and Y. Zhang, Theano: A Python framework for fast with deep convolutional neural networks, in Advances in Neural Infor-
computation of mathematical expressions, ArXiv e-prints, May 2016. mation Processing Systems, 2012, p. 2012.
[6] R. Collobert, S. Bengio, and J. Marithoz, Torch: A modular machine [24] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, Building
learning software library, 2002. a large annotated corpus of english: The penn treebank, Comput.
[7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, Linguist., vol. 19, no. 2, pp. 313330, Jun. 1993. [Online]. Available:
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, http://dl.acm.org/citation.cfm?id=972470.972475
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and [25] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and
E. Duchesnay, Scikit-learn: Machine learning in python, J. Mach. V. Pande, Massively Multitask Networks for Drug Discovery, ArXiv
Learn. Res., vol. 12, pp. 28252830, Nov. 2011. [Online]. Available: e-prints, Feb. 2015.
http://dl.acm.org/citation.cfm?id=1953048.2078195 [26] O. Good, How google translate squeezes deep learning onto a
[8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. phone, Google Research Blog, Jul. 2015 (accessed: May 25,
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, 2016), http://googleresearch.blogspot.de/2015/07/how-google-translate-
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, squeezes-deep.html.
M. Kudlur, J. Levenberg, D. Man, R. Monga, S. Moore, D. Murray, [27] G. Corrado, Computer, respond to this email. Google
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, Research Blog, Nov. 2015 (accessed: May 25, 2016),
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vigas, O. Vinyals, http://googleresearch.blogspot.de/2015/11/computer-respond-to-this-
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, email.html.
TensorFlow: Large-scale machine learning on heterogeneous systems, [28] K. Kavukcuoglu, Deepmind moves to tensorflow, Google
2015, software available from tensorflow.org. [Online]. Available: Research Blog, Apr. 2016 (accessed May 24, 2016),
http://tensorflow.org/ http://googleresearch.blogspot.de/2016/04/deepmind-moves-to-
[9] R. Kohavi, D. Sommerfield, and J. Dougherty, Data mining using mscr; tensorflow.html.
lscr; cscr;++ a machine learning library in c++, in Tools with Artificial [29] C. Szegedy, S. Ioffe, and V. Vanhoucke, Inception-v4, Inception-ResNet
Intelligence, 1996., Proceedings Eighth IEEE International Conference and the Impact of Residual Connections on Learning, ArXiv e-prints,
on, Nov 1996, pp. 234245. Feb. 2016.
[30] M. August and X. Ni, Using recurrent neural networks to optimize
[10] G. Bradski, The opencv library, Doctor Dobbs Journal, vol. 25, no. 11,
dynamical decoupling for quantum memory, in arXiv.org, vol. quant-
pp. 120126, 2000.
ph, no. arXiv:1604.00279. Technical University of Munich, Max
[11] C. R. de Souza, A tutorial on principal component analysis with
Planck Institute for Quantum Optics, Apr. 2016. [Online]. Available:
the accord.net framework, CoRR, vol. abs/1210.7463, 2012. [Online].
http://arxiv.org/pdf/1604.00279v1.pdf
Available: http://arxiv.org/abs/1210.7463
[31] G. Barzdins, S. Renals, and D. Gosko, Character-Level Neural Transla-
[12] A. Bifet, G. Holmes, B. Pfahringer, P. Kranen, H. Kremer, T. Jansen, tion for Multilingual Media Monitoring in the SUMMA Project, ArXiv
and T. Seidl, Moa: Massive online analysis, a framework for stream e-prints, Apr. 2016.
classification and clustering, in Journal of Machine Learning Research [32] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation
(JMLR) Workshop and Conference Proceedings, Volume 11: Workshop ranking: Bringing order to the web. Stanford InfoLab, Technical Report
on Applications of Pattern Analysis. Journal of Machine Learning 1999-66, November 1999, previous number = SIDL-WP-1999-0120.
Research, 2010, pp. 4450. [Online]. Available: http://ilpubs.stanford.edu:8090/422/
[13] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, [33] J. Clark, Google turning its lucrative web search over to
Spark: Cluster computing with working sets, in Proceedings of the ai machines, Bloomberg Technology, Oct. 2015 (accessed: May
2Nd USENIX Conference on Hot Topics in Cloud Computing, ser. 25, 2016), http://www.bloomberg.com/news/articles/2015-10-26/google-
HotCloud10. Berkeley, CA, USA: USENIX Association, 2010, pp. turning-its-lucrative-web-search-over-to-ai-machines.
1010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863103.
1863113
[14] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman,
D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin,
M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar, Mllib:
Machine learning in apache spark, CoRR, vol. abs/1505.06807, 2015.
[Online]. Available: http://arxiv.org/abs/1505.06807
[15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for
fast feature embedding, CoRR, vol. abs/1408.5093, 2014. [Online].
Available: http://arxiv.org/abs/1408.5093
[16] D. Murray, Announcing tensorflow 0.8 AS now with distributed
computing support! Google Research Blog, April 2016 (accessed
May 22, 2016), http://googleresearch.blogspot.de/2016/04/announcing-
tensorflow-08-now-with.html.
[17] N. Jouppi, Google supercharges machine learning tasks with tpu
custom chip, Google Cloud Platform Blog, May 2016 (accessed
May 22, 2016), https://cloudplatform.googleblog.com/2016/05/Google-
supercharges-machine-learning-tasks-with-custom-chip.html.
[18] I. G. Y. Bengio and A. Courville, Deep learning, 2016, book
in preparation for MIT Press. [Online]. Available: http://www.
deeplearningbook.org
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, Going Deeper with Convolutions,
ArXiv e-prints, Sep. 2014.
[20] S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah, Comparative

You might also like