Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team

Large-Scale Deep Learning
With TensorFlow
Jeff Dean
Google Brain team
g.co/brain
In collaboration with many other people at Google
What is the Google Brain Team?
● Research team focused on long term artificial intelligence

research
○ Mix of computer systems and machine learning
research expertise
○ Pure ML research, and research in context of
emerging ML application areas:
■ robotics, language understanding, healthcare, ...
g.co/brain
We Disseminate Our Work in Many Ways
● By publishing our work
○ See papers at research.google.com/pubs/BrainTeam.html
● By releasing TensorFlow, our core machine learning

research system, as an open-source project
● By releasing implementations of our research models in

TensorFlow
● By collaborating with product teams at Google to get our

research into real products
What Do We Really Want?
● Build artificial intelligence algorithms and systems that

learn from experience
● Use those to solve difficult problems that benefit humanity

What do I mean by understanding?
Query
[ car parts for sale ]
Query
[ car parts for sale ]
Document 1
… car parking available for a small fee.
… parts of our floor model inventory for sale.
Document 2
Selling all kinds of automobile and pickup truck parts,
engines, and transmissions.
Example Needs of the Future
● Which of these eye images shows symptoms of diabetic
retinopathy?
● Find me all rooftops in North America
● Describe this video in Spanish
● Find me all documents relevant to reinforcement learning for
robotics and summarize them in German
● Find a free time for everyone in the Smart Calendar project
to meet and set up a videoconference
● Robot, please fetch me a cup of tea from the snack kitchen
Growing Use of Deep Learning at Google
# of directories containing model description files Across many
products/areas:
Android
Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...
Important Property of Neural Networks
Results get better with
more data +
bigger models +
more computation
(Better algorithms, new insights and improved

techniques always help, too!)
Aside
Many of the techniques that are successful now were
developed 20-30 years ago
What changed? We now have:
sufficient computational resources

large enough interesting datasets
Use of large-scale parallelism lets us look ahead many

generations of hardware improvements, as well
What do you want in a machine learning system?
● Ease of expression: for lots of crazy ML ideas/algorithms
● Scalability: can run experiments quickly
● Portability: can run on wide variety of platforms
● Reproducibility: easy to share and reproduce research
● Production readiness: go from research to real products
Open, standard software for
general machine learning
Great for Deep Learning in

particular
http://tensorflow.org/ First released Nov 2015

and
Apache 2.0 license
https://github.com/tensorflow/tensorflow
http://tensorflow.org/whitepaper2015.pdf
Preprint: arxiv.org/abs/1605.08695
Updated version will appear in OSDI 2016
Strong External Adoption
GitHub Launch Nov. 2015
GitHub Launch Sep. 2013
GitHub Launch Jan. 2012
50,000+ binary installs in 72 hours, 500,000+ since November, 2015

Strong External Adoption
GitHub Launch Nov. 2015
GitHub Launch Sep. 2013
50,000+ binary installs in 72 hours, 500,000+ since November, 2015

Most forked new repo on GitHub in 2015 (despite only being available in Nov, ‘15)
http://tensorflow.org/
Motivations
● DistBelief (our 1st system) was the first scalable deep
learning system, but not as flexible as we wanted for
research purposes
● Better understanding of problem space allowed us to
make some dramatic simplifications
● Define the industrial standard for machine learning
● Short circuit the MapReduce/Hadoop inefficiency
TensorFlow: Expressing High-Level ML Computations
● Core in C++
○ Very low overhead
Core TensorFlow Execution System
CPU GPU Android iOS ...

● Core in C++
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more

● Core in C++
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
C++ front end Python front end ...

Computation is a dataflow graph
biases Graph of Nodes, also called Operations or ops.
weights Add Relu
MatMul Xent
examples
labels
s o rs
Computation is a dataflow graph ten
with
biases Edges are N-dimensional arrays: Tensors
weights Add Relu
MatMul Xent
examples
labels
Example TensorFlow fragment
● Build a graph computing a neural net inference.

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

x = tf.placeholder("float", shape=[None, 784])
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
t a t e
w ith s
'Biases' is a variable Some ops compute gradients −= updates biases
biases
... Add ... Mul −=
learning rate
Symbolic Differentiation
● Automatically add ops to calculate symbolic gradients

of variables w.r.t. loss function.
● Apply these gradients with an optimization algorithm
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))
opt = tf.train.GradientDescentOptimizer(0.01)
train_op = opt.minimize(cross_entropy)
Define graph and then execute it repeatedly
● Launch the graph and run the training ops in a loop

init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
ut e d
is t r i b
d
GPU 0 CPU
biases
Add ... Mul Assign

... Sub
learning rate
Assign Devices to Ops
● TensorFlow inserts Send/Recv Ops to transport tensors across devices
● Recv ops pull data from Send ops
GPU 0 CPU
biases
Send Recv
Add ... Mul Assign
... Sub
learning rate
Assign Devices to Ops
● TensorFlow inserts Send/Recv Ops to transport tensors across devices
● Recv ops pull data from Send ops
GPU 0 CPU
biases
Send Recv
Add ... Mul Assign
... Send Recv Sub
Recv Send
Recv
learning rate Send

November 2015
December 2015
February 2016
April 2016
June 2016
Activity
Experiment Turnaround Time and Research Productivity
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Data Parallelism
Parameter Servers
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
∆p p
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p
∆p p
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p
p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
Distributed training mechanisms
Graph structure and low-level graph primitives (queues) allow us to play with
synchronous vs. asynchronous update algorithms.
Cross process communication is the same!
● Communication across machines over the network abstracted identically to
cross device communication.
/job:worker/cpu:0 /job:ps/gpu:0
biases
Send Recv
Add ... Mul Assign
... Send Recv Sub
Recv Send
Recv
learning rate Send
No specialized parameter server subsystem!

Image Model Training Time
50 GPUs
10 GPUs
1 GPU
Hours
Image Model Training Time
50 GPUs
10 GPUs
2.6 hours vs. 79.3 hours (30.5X) 1 GPU
Hours
Sync converges faster (time to accuracy)
Synchronous updates (with backup workers) trains to higher accuracy faster

Better scaling to more workers (less loss of accuracy)
Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy

Bengio, Raal Jozefowicz, ICLR Workshop 2016, arxiv.org/abs/1604.00981
Sync converges faster (time to accuracy)
40 hours vs. 50 hours
Synchronous updates (with backup workers) trains to higher accuracy faster

Better scaling to more workers (less loss of accuracy)
Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy

Bengio, Raal Jozefowicz, ICLR Workshop 2016, arxiv.org/abs/1604.00981
General Computations
Although we originally built TensorFlow for our uses around
deep neural networks, it’s actually quite flexible
Wide variety of machine learning and other kinds of numeric

computations easily expressible in the computation graph
model
Runs on Variety of Platforms
phones single machines (CPU and/or GPUs) …
distributed systems of 100s

of machines and/or GPU cards custom ML hardware
Trend: Much More Heterogeneous hardware
General purpose CPU performance scaling has slowed
significantly
Specialization of hardware for certain workloads will be more

important
Tensor Processing Unit
Custom machine learning ASIC
In production use for >16 months: used on every

search query, used for AlphaGo match, ...
See Google Cloud Platform blog: Google supercharges machine learning tasks with TPU custom chip,
by Norm Jouppi, May, 2016
Long Short-Term Memory (LSTMs):
Make Your Memory Cells Differentiable
[Hochreiter & Schmidhuber, 1997] Sigmoids
W R
WRITE? READ?
X M Y X M Y
FORGET?
F
Example: LSTM [Hochreiter et al, 1997][Gers et al, 1999]
Enables
long term
dependencies
to flow
Example: LSTM
for i in range(20):
m, c = LSTMCell(x[i], mprev, cprev)
mprev = m
cprev = c
Example: Deep LSTM
for i in range(20):
for d in range(4): # d is depth
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM
for i in range(20):
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM
for i in range(20):
with tf.device("/gpu:%d" % d):
mprev[d] = m[d]
cprev[d] = c[d]
A B C D
GPU6
GPU5 A B C D 80k softmax by

1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs
GPU3
GPU2 1000 LSTM cells

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
A B C D
GPU6

1000 dims
This is very big!
4 GPUs
GPU3

2000 dims per
timestep
GPU1
2000 x 4 =
_ 8k dims per
_
What are some ways that
deep learning is having
a significant impact at Google?
All of these examples implemented using TensorFlow

or our predecessor system
Speech Recognition
Deep
“How cold is
Recurrent
it outside?”
Neural Network
Acoustic Input Text Output
Reduced word errors by more than 30%

Google Research Blog - August 2012, August 2015
The Inception Architecture (GoogLeNet, 2014)
Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
ArXiv 2014, CVPR 2015

Neural Nets: Rapid Progress in Image Recognition
Team Year Place Error (top-5) ImageNet
challenge
XRCE (pre-neural-net explosion) 2011 1st 25.8%
classification
Supervision (AlexNet) 2012 1st 16.4% task
Clarifai 2013 1st 11.7%
GoogLeNet (Inception) 2014 1st 6.66%
Andrej Karpathy (human) 2014 N/A 5.1%
BN-Inception (Arxiv) 2015 N/A 4.9%
Inception-v3 (Arxiv) 2015 N/A 3.46%

Google Photos Search
Deep
Convolutional “ocean”
Neural Network
Automatic Tag
Your Photo
Search personal photos without tags.

Google Research Blog - June 2013
Google Photos Search
Reuse same model for completely
different problems
Same basic model structure

trained on different data,
useful in completely different contexts
Example: given image → predict interesting pixels

www.google.com/sunroof
We have tons of vision problems
Image search, StreetView, Satellite Imagery,

Translation, Robotics, Self-driving Cars,
MEDICAL IMAGING
Very good results using similar model for

detecting diabetic retinopathy in retinal images
“Seeing” Go
RankBrain in Google Search Ranking
Query: “car parts for sale”,

Deep Score for
Neural doc,query
Doc: “Rebuilt transmissions …” pair
Network
Query & document features
Launched in 2015
Third most important search ranking signal (of 100s)
Bloomberg, Oct 2015: “Google Turning Its Lucrative Web Search Over to AI Machines”
Sequence-to-Sequence Model
Target sequence
[Sutskever & Vinyals & Le NIPS 2014] X Y Z Q
Deep LSTM
A B C D __ X Y Z
Input sequence
Sequence-to-Sequence Model: Machine Translation
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How
Quelle est votre taille? <EOS>
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall
Quelle est votre taille? <EOS> How
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are
Quelle est votre taille? <EOS> How tall
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are you?
Quelle est votre taille? <EOS> How tall are
Input sentence
At inference time:
Beam search to choose most
[Sutskever & Vinyals & Le NIPS 2014] probable
over possible output sequences
Quelle est votre taille? <EOS>
Input sentence
Smart Reply
April 1, 2009: April Fool’s Day joke
Nov 5, 2015: Launched Real Product
Feb 1, 2016: >10% of mobile Inbox replies

Google Research Blog
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small
Feed-Forward yes/no
Neural Network
Google Research Blog
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small
Feed-Forward yes/no
Neural Network
Generated Replies
Deep
Recurrent
Neural Network
Image Captioning
[Vinyals et al., CVPR 2015] A young girl asleep
W __ A young girl
Image Captions Research
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.
Model: A close up of a child

holding a stuffed animal.
Model: A baby is asleep next to

a teddy bear.
Combining Vision with Robotics
“Deep Learning for Robots: Learning

from Large-Scale Interaction”, Google
Research Blog, March, 2016
“Learning Hand-Eye Coordination for

Robotic Grasping with Deep Learning
and Large-Scale Data Collection”,
Sergey Levine, Peter Pastor, Alex
Krizhevsky, & Deirdre Quillen,
Arxiv, arxiv.org/abs/1603.02199
How Can You Get Started with Machine Learning?
Three ways, with varying complexity:
(1) Use a Cloud-based API (Vision, Speech, etc.)

More
(2) Use an existing model architecture, and flexible,
retrain it or fine tune on your dataset but more
(3) Develop your own machine learning models effort
for new problems required
Use Cloud-based APIs
cloud.google.com/translate
cloud.google.com/speech
cloud.google.com/vision
cloud.google.com/text
Use Cloud-based APIs
cloud.google.com/translate
cloud.google.com/speech
cloud.google.com/vision
cloud.google.com/text
Google Cloud Vision API
https://cloud.google.com/vision/
Google Cloud ML
Scaled service for training and inference w/TensorFlow
A Few TensorFlow Community Examples
(From more than 2000 results for ‘tensorflow’ on GitHub)
● DQN: github.com/nivwusquorum/tensorflow-deepq
● NeuralArt: github.com/woodrush/neural-art-tf
● Char RNN: github.com/sherjilozair/char-rnn-tensorflow
● Keras ported to TensorFlow: github.com/fchollet/keras
● Show and Tell: github.com/jazzsaxmafia/show_and_tell.tensorflow
● Mandarin translation: github.com/jikexueyuanwiki/tensorflow-zh
...
A Few TensorFlow Community Examples
(From more than 2000 2100 results for ‘tensorflow’ on GitHub)
● DQN: github.com/nivwusquorum/tensorflow-deepq
● NeuralArt: github.com/woodrush/neural-art-tf
● Char RNN: github.com/sherjilozair/char-rnn-tensorflow
● Keras ported to TensorFlow: github.com/fchollet/keras
● Show and Tell: github.com/jazzsaxmafia/show_and_tell.tensorflow
● Mandarin translation: github.com/jikexueyuanwiki/tensorflow-zh
...
github.com/nivwusquorum/tensorflow-deepq
github.com/woodrush/neural-art-tf
github.com/sherjilozair/char-rnn-tensorflow
github.com/fchollet/keras
github.com/jazzsaxmafia/show_and_tell.tensorflow
github.com/jikexueyuanwiki/tensorflow-zh
What Does the Future Hold?
Deep learning usage will continue to grow and accelerate:
● Across more and more fields and problems:

○ robotics, self-driving vehicles, ...
○ health care
○ video understanding
○ dialogue systems
○ personal assistance
○ ...
Conclusions
Deep neural networks are making significant strides in
understanding:
In speech, vision, language, search, robotics, …
If you’re not considering how to use deep neural nets to solve

your vision or understanding problems, you almost certainly
should be
Further Reading
● Dean, et al., Large Scale Distributed Deep Networks, NIPS 2012,
research.google.com/archive/large_deep_networks_nips2012.html.
● Mikolov, Chen, Corrado & Dean. Efﬁcient Estimation of Word Representations in Vector
Space, NIPS 2013, arxiv.org/abs/1301.3781.
● Sutskever, Vinyals, & Le, Sequence to Sequence Learning with Neural Networks, NIPS,
2014, arxiv.org/abs/1409.3215.
● Vinyals, Toshev, Bengio, & Erhan. Show and Tell: A Neural Image Caption Generator.
CVPR 2015. arxiv.org/abs/1411.4555
● TensorFlow white paper, tensorflow.org/whitepaper2015.pdf (clickable links in bibliography)
g.co/brain (We’re hiring! Also check out Brain Residency program at g.co/brainresidency)
www.tensorflow.org
research.google.com/people/jeff
research.google.com/pubs/BrainTeam.html
Questions?

Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team

Uploaded by

Copyright:

Available Formats

Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team

Uploaded by

Copyright:

Available Formats

Large-Scale Deep Learning

● Research team focused on long term artificial intelligence

● By releasing TensorFlow, our core machine learning

● By releasing implementations of our research models in

● By collaborating with product teams at Google to get our

● Build artificial intelligence algorithms and systems that

● Use those to solve difficult problems that benefit humanity

(Better algorithms, new insights and improved

What changed? We now have:

sufficient computational resources

Use of large-scale parallelism lets us look ahead many

Great for Deep Learning in

http://tensorflow.org/ First released Nov 2015

GitHub Launch Nov. 2015

GitHub Launch Sep. 2013

GitHub Launch Jan. 2012

GitHub Launch Jan. 2008

50,000+ binary installs in 72 hours, 500,000+ since November, 2015

GitHub Launch Nov. 2015

GitHub Launch Sep. 2013

GitHub Launch Jan. 2012

GitHub Launch Jan. 2008

50,000+ binary installs in 72 hours, 500,000+ since November, 2015

Core TensorFlow Execution System

CPU GPU Android iOS ...

Core TensorFlow Execution System

CPU GPU Android iOS ...

C++ front end Python front end ...

Core TensorFlow Execution System

CPU GPU Android iOS ...

biases Graph of Nodes, also called Operations or ops.

weights Add Relu

biases Edges are N-dimensional arrays: Tensors

weights Add Relu

● Build a graph computing a neural net inference.

mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

'Biases' is a variable Some ops compute gradients −= updates biases

... Add ... Mul −=

● Automatically add ops to calculate symbolic gradients

● Launch the graph and run the training ops in a loop

Add ... Mul Assign

learning rate Send

learning rate Send

No specialized parameter server subsystem!

Synchronous updates (with backup workers) trains to higher accuracy faster

Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy

Synchronous updates (with backup workers) trains to higher accuracy faster

Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy

Wide variety of machine learning and other kinds of numeric

distributed systems of 100s

Specialization of hardware for certain workloads will be more

In production use for >16 months: used on every

GPU5 A B C D 80k softmax by

GPU2 1000 LSTM cells

GPU5 A B C D 80k softmax by

GPU2 1000 LSTM cells

GPU5 A B C D 80k softmax by

GPU2 1000 LSTM cells