Large Scale Deep Learning With TensorFlow (PDFDrive)
Large Scale Deep Learning With TensorFlow (PDFDrive)
TensorFlow
Jeff Dean
Google Brain Team
g.co/brain
In collaboration with many other people at Google
Background
Google Brain project started in 2011, with a focus on
pushing state-of-the-art in neural networks. Initial
emphasis:
g.co/brain
We Disseminate Our Work in Many Ways
● By publishing our work
○ See papers at research.google.com/pubs/BrainTeam.html
Document 1
… car parking available for a small fee.
… parts of our floor model inventory for sale.
Document 2
Selling all kinds of automobile and pickup truck parts,
engines, and transmissions.
Example Needs of the Future
● Which of these eye images shows symptoms of diabetic
retinopathy?
● Find me all rooftops in North America
● Describe this video in Spanish
● Find me all documents relevant to reinforcement learning for
robotics and summarize them in German
● Find a free time for everyone in the Smart Calendar project
to meet and set up a videoconference
● Robot, please fetch me a cup of tea from the snack kitchen
Growing Use of Deep Learning at Google
# of directories containing model description files Across many
products/areas:
Android
Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...
Overview
● Discuss TensorFlow, an open source machine learning
system
○ Our primary research and production system
○ Show real examples
○ Explain what’s happening underneath the covers
Two Generations of Distributed ML Systems
Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Learning from Unlabeled Images
Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Adding Supervision
Train recurrent models that also incorporate Lexical and Language Modeling:
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
http://arxiv.org/abs/1512.00567
Rapid Progress in Image Recognition
Team Year Place Error (top-5) Params ImageNet
XRCE (pre-neural-net explosion) 2011 1st 25.8% challenge
classification
Supervision (AlexNet) 2012 1st 16.4% 60M
task
Clarifai 2013 1st 11.7% 65M
Models with small number of parameters fit easily in a mobile app (8-bit fixed point)
What do you want in a machine learning system?
● Ease of expression: for lots of crazy ML ideas/algorithms
● Scalability: can run experiments quickly
● Portability: can run on wide variety of platforms
● Reproducibility: easy to share and reproduce research
● Production readiness: go from research to real products
Open, standard software for
general machine learning
● Core in C++
○ Very low overhead
● Core in C++
○ Very low overhead
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
● Core in C++
○ Very low overhead
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
MatMul Xent
examples
labels
s o rs
Computation is a dataflow graph ten
with
MatMul Xent
examples
labels
Example TensorFlow fragment
biases
learning rate
Symbolic Differentiation
GPU 0 CPU
biases
learning rate
Assign Devices to Ops
● TensorFlow inserts Send/Recv Ops to transport tensors across devices
● Recv ops pull data from Send ops
GPU 0 CPU
biases
Send Recv
Add ... Mul Assign
... Sub
learning rate
Assign Devices to Ops
● TensorFlow inserts Send/Recv Ops to transport tensors across devices
● Recv ops pull data from Send ops
GPU 0 CPU
biases
Send Recv
Add ... Mul Assign
... Send Recv Sub
Recv Send
Recv
Today we are releasing our best image classifier trained on ImageNet data. As described in our
recent Arxiv preprint at http://arxiv.org/abs/1512.00567, an ensemble of four of these models
achieves 3.5% top-5 error on the validation set of the ImageNet whole image ILSVRC2012
classification task (compared with our ensemble from last year that won the 2014 ImageNet
classification challenge with a 6.66% top-5 error rate).
In this release, we are supplying code and data files containing the trained model parameters for
running the image classifier on:
● Both desktop and mobile environments
● Employing either a C++ or Python API.
In addition, we are providing a tutorial that describes how to use the image recognition system for a
variety of use-cases.
http://www.tensorflow.org/tutorials/image_recognition/index.html
Experiment Turnaround Time and Research Productivity
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Data Parallelism
● Use multiple model replicas to process different
examples at the same time
○ All collaborate to update model state (parameters) in shared
parameter server(s)
● Speedups depend highly on kind of model
○ Dense models: 10-40X speedup from 50 replicas
○ Sparse models:
■ support many more replicas
■ often can use as many as 1000 replicas
Data Parallelism
Parameter Servers
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
∆p p
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p
∆p p
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p
p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
DistBelief: Separate Parameter Servers
Parameter update rules not the same programming model as
the rest of the system
/job:worker/cpu:0 /job:ps/gpu:0
biases
Send Recv
Add ... Mul Assign
... Send Recv Sub
Recv Send
Recv
Graph structure and low-level graph primitives (queues) allow us to play with
synchronous vs. asynchronous update algorithms.
Data Parallelism Considerations
Want model computation time to be large relative to time to
send/receive parameters over network
Models with fewer parameters, that reuse each parameter multiple times in the
computation
50 GPUs
10 GPUs
1 GPU
Hours
Image Model Training Time
50 GPUs
10 GPUs
2.6 hours vs. 79.3 hours (30.5X) 1 GPU
Hours
Synchronous converges faster (time to accuracy)
Test
accuracy
Test
accuracy
python:
variables,
optimizer
http://public.kevinrobinsonblog.com/tensorflow-codebase/
Slide credit: Kevin Robinson ([email protected])
Expressing: Graphs and Ops
Graph
in math_ops.py#L1137
Expressing: Graph
Graph is built implicitly
session.py#L896
In TensorBoard:
Expressing: Optimizers
Optimizer fns extend the graph
optimizer.py:minimize#L155
CPU GPU GPU GPU CPU GPU GPU GPU CPU GPU
CPU GPU GPU GPU CPU GPU GPU GPU CPU GPU
Prune subgraph
PruneForReverseReachability
in graph/algorithm.cc#L122
tests in subgraph_test.cc#142
Slide credit: Kevin Robinson ([email protected])
Distributing: Placing
Constraints from model
DeviceSpec in device.py#L24
By device or colocation
NodeDef in graph.proto
WorkerService WorkerService
/job:worker/task:0 /job:worker/task:1
WorkerService
/job:worker/task:0
WorkerService WorkerService
/job:worker/task:0 /job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])
Distributing: Partitioning
Partition into subgraphs
in graph_partition.cc#L883
Rewrite with Send and Recv
in sendrecv_ops.cc#L56 and #L97
WorkerService WorkerService
/job:worker/task:0 /job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])
Distributing: Partitioning
Partition into subgraphs
in graph_partition.cc#L883
Rewrite with Send and Recv
in sendrecv_ops.cc#L56 and #L97
Rendezvous handles coordination
in base_rendezvous_mgr.cc#L236
WorkerService WorkerService
/job:worker/task:0 /job:worker/task:0
rendezvous
rendezvous
Slide credit: Kevin Robinson ([email protected])
rendezvous
Slide credit: Kevin Robinson ([email protected])
Distributing: Partitioning
Partition into subgraphs
in graph_partition.cc#L883
Rewrite with Send and Recv
in sendrecv_ops.cc#L56 and #L97
Rendezvous handles coordination
in base_rendezvous_mgr.cc#L236
WorkerService
/job:worker/task:0
rendezvous
rendezvous
rendezvous
A tour through the TensorFlow codebase
2. Distributing the graph
core:
distributed_runtime
common_runtime
http://public.kevinrobinsonblog.com/tensorflow-codebase/
Slide credit: Kevin Robinson ([email protected])
Slide credit: Kevin Robinson ([email protected])
Executing: Executor
Parallelism on each worker
WorkerService
/job:worker/task:0
RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key)
CPU GPU GPU GPU
Slide credit: Kevin Robinson ([email protected])
Executing: Executor
Parallelism on each worker
WorkerService
/job:worker/task:0
RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key)
CPU GPU GPU GPU
Executing: Executor
Parallelism on each worker
WorkerService
/job:worker/task:0
RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key)
CPU GPU GPU GPU
GraphMgr::ExecuteAsync
in graph_mgr.cc#L283
ExecutorState::RunAsync
in executor.cc#L867
Slide credit: Kevin Robinson ([email protected])
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous
CPU in conv_ops.cc#L91
GPU in conv_ops.cc#L263
Executing: OpKernels
OpKernels are specialized by device
adapted from matmul_op.cc#L116
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
OpKernels are specialized by device
adapted from matmul_op.cc#L116
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
OpKernels call into Stream functions
adapted from matmul_op.cc#L71
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
OpKernels call into Stream functions
adapted from matmul_op.cc#L71
Slide credit: Kevin Robinson ([email protected])
Executing: OpKernels
OpKernels call into Stream functions
adapted from matmul_op.cc#L71
Executing: Stream functions
OpKernels call into Stream functions
in conv_ops.cc#L292
in conv_ops.cc#L292
in conv_ops.cc#L417
Executing: Stream functions
Platforms provide GPU-specific implementations
cuBLAS
BlasSupport in stream_executor/blas.h#L88
DoBlasInternal in cuda_blas.cc#L429
Launched in 2015
Third most important search ranking signal (of 100s)
Bloomberg, Oct 2015: “Google Turning Its Lucrative Web Search Over to AI Machines”
Example: LSTM [Hochreiter et al, 1997][Gers et al, 1999]
Enables
long term
dependencies
to flow
Sequence-to-Sequence Model
Target sequence
[Sutskever & Vinyals & Le NIPS 2014] X Y Z Q
A B C D __ X Y Z
Input sequence
Sequence-to-Sequence
● Active area of research
● Many groups actively pursuing RNN/LSTM
○ Montreal
○ Stanford
○ U of Toronto
○ Berkeley
○ Google
○ ...
● Further Improvements
○ Attention
○ NTM / Memory Nets
○ ...
Sequence-to-Sequence
● Translation: [Kalchbrenner et al., EMNLP 2013][Cho et al., EMLP 2014][Sutskever & Vinyals & Le, NIPS
2014][Luong et al., ACL 2015][Bahdanau et al., ICLR 2015]
● Image captions: [Mao et al., ICLR 2015][Vinyals et al., CVPR 2015][Donahue et al., CVPR 2015][Xu et al.,
ICML 2015]
P(English | French)
Image )
Image Captioning
[Vinyals et al., CVPR 2015] A young girl asleep
W __ A young girl
Image Captions Research
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.
InitialModel: A close up of a
person eating a hot dog.
Human: A blue , yellow and red
train travels across the tracks
near a depot.
➢ Convex Hulls
⇐
1
2
5
6
1
⇒
x1 x2 x3 x4 x5 x6 x1 x6 x5 x2 x1
y1 y2 y3 y4 y5 y6 ⇒ y1 y6 y5 y2 y1
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small
Feed-Forward yes/no
Neural Network
Google Research Blog
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small
Feed-Forward yes/no
Neural Network
Generated Replies
Deep
Recurrent
Neural Network
Example: LSTM
for i in range(20):
m, c = LSTMCell(x[i], mprev, cprev)
mprev = m
cprev = c
Example: Deep LSTM
for i in range(20):
for d in range(4): # d is depth
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM
for i in range(20):
for d in range(4): # d is depth
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM
for i in range(20):
for d in range(4): # d is depth
with tf.device("/gpu:%d" % d):
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
TensorFlow Queues
Randomization/Shuffling Dequeue
Enqueue Queue
...
Example: Deep LSTMs
● Wrinkles
○ Bucket sentences by length using a queue per length
○ Dequeue when a full batch of same length has
accumulated
○ N different graphs for different lengths
○ Alternative: while loop
Expressing Data Parallelism
# We use the ReplicaDeviceSetter() device function to automatically
# assign Variables to the 'ps' jobs.
with tf.device(“/cpu:0”):
# Create the Mnist model.
model = MnistModel(batch_size=16, hidden_units=200)
Device A Device B
Device A Device B
cloud.google.com/translate
cloud.google.com/speech
cloud.google.com/vision
cloud.google.com/text
Use Cloud-based APIs
cloud.google.com/translate
cloud.google.com/speech
cloud.google.com/vision
cloud.google.com/text
Google Cloud Vision API
https://cloud.google.com/vision/
Google Cloud ML
Scaled service for training and inference w/TensorFlow
A Few TensorFlow Community Examples
(From more than 2000 results for ‘tensorflow’ on GitHub)
● DQN: github.com/nivwusquorum/tensorflow-deepq
● NeuralArt: github.com/woodrush/neural-art-tf
● Char RNN: github.com/sherjilozair/char-rnn-tensorflow
● Keras ported to TensorFlow: github.com/fchollet/keras
● Show and Tell: github.com/jazzsaxmafia/show_and_tell.tensorflow
● Mandarin translation: github.com/jikexueyuanwiki/tensorflow-zh
...
A Few TensorFlow Community Examples
(From more than 2000 2100 results for ‘tensorflow’ on GitHub)
● DQN: github.com/nivwusquorum/tensorflow-deepq
● NeuralArt: github.com/woodrush/neural-art-tf
● Char RNN: github.com/sherjilozair/char-rnn-tensorflow
● Keras ported to TensorFlow: github.com/fchollet/keras
● Show and Tell: github.com/jazzsaxmafia/show_and_tell.tensorflow
● Mandarin translation: github.com/jikexueyuanwiki/tensorflow-zh
...
github.com/nivwusquorum/tensorflow-deepq
github.com/woodrush/neural-art-tf
github.com/sherjilozair/char-rnn-tensorflow
github.com/fchollet/keras
github.com/jazzsaxmafia/show_and_tell.tensorflow
github.com/jikexueyuanwiki/tensorflow-zh
Concluding Remarks
● Model and Data Parallelism enable great ML work:
○ Neural Machine Translation: ~6x speedup on 8 GPUs
○ Inception / Imagenet: ~40x speedup on 50 GPUs
○ RankBrain: ~300X speedup on 500 machines
● TensorFlow open-source community vibrant and growing
● TensorFlow makes it easy to express ML computations
What Does the Future Hold?
Deep learning usage will continue to grow and accelerate:
g.co/brainresidency
Google Brain Residency Program
g.co/brainresidency
Google Brain Residency Program
Applications for class for June 2017 to May 2018 will open in Fall 2016
g.co/brainresidency
Further Reading
● Dean, et al., Large Scale Distributed Deep Networks, NIPS 2012,
research.google.com/archive/large_deep_networks_nips2012.html.
● Mikolov, Chen, Corrado & Dean. Efficient Estimation of Word Representations in Vector
Space, NIPS 2013, arxiv.org/abs/1301.3781.
● Sutskever, Vinyals, & Le, Sequence to Sequence Learning with Neural Networks, NIPS,
2014, arxiv.org/abs/1409.3215.
● Vinyals, Toshev, Bengio, & Erhan. Show and Tell: A Neural Image Caption Generator.
CVPR 2015. arxiv.org/abs/1411.4555
● TensorFlow white paper, tensorflow.org/whitepaper2015.pdf (clickable links in bibliography)
g.co/brain (We’re hiring! Also check out Brain Residency program at g.co/brainresidency)
www.tensorflow.org
research.google.com/people/jeff
research.google.com/pubs/BrainTeam.html
Questions?