0% found this document useful (0 votes)
21 views

Large Scale Deep Learning With TensorFlow (PDFDrive)

Uploaded by

Yan Gao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Large Scale Deep Learning With TensorFlow (PDFDrive)

Uploaded by

Yan Gao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 240

Large Scale Deep Learning with

TensorFlow
Jeff Dean
Google Brain Team
g.co/brain
In collaboration with many other people at Google
Background
Google Brain project started in 2011, with a focus on
pushing state-of-the-art in neural networks. Initial
emphasis:

● use large datasets, and


● large amounts of computation

to push boundaries of what is possible in perception and


language understanding
Overview
● Cover our experience from past ~5 years
○ Research: speech, images, video, robotics, language understanding,
NLP, translation, optimization algorithms, unsupervised learning, …

○ Production: deployed systems for advertising, search, GMail, Photos,


Maps, YouTube, speech recognition, image analysis, user prediction, …

● Focus on neural nets, but many techniques more


broadly applicable
What is the Google Brain Team?

● Research team focused on long term artificial intelligence


research
○ Mix of computer systems and machine learning
research expertise
○ Pure ML research, and research in context of
emerging ML application areas:
■ robotics, language understanding, healthcare, ...

g.co/brain
We Disseminate Our Work in Many Ways
● By publishing our work
○ See papers at research.google.com/pubs/BrainTeam.html

● By releasing TensorFlow, our core machine learning


research system, as an open-source project

● By releasing implementations of our research models in


TensorFlow

● By collaborating with product teams at Google to get our


research into real products
What Do We Really Want?

● Build artificial intelligence algorithms and systems that


learn from experience

● Use those to solve difficult problems that benefit humanity


What do I mean by understanding?
What do I mean by understanding?
What do I mean by understanding?
What do I mean by understanding?
Query
[ car parts for sale ]
What do I mean by understanding?
Query
[ car parts for sale ]

Document 1
… car parking available for a small fee.
… parts of our floor model inventory for sale.

Document 2
Selling all kinds of automobile and pickup truck parts,
engines, and transmissions.
Example Needs of the Future
● Which of these eye images shows symptoms of diabetic
retinopathy?
● Find me all rooftops in North America
● Describe this video in Spanish
● Find me all documents relevant to reinforcement learning for
robotics and summarize them in German
● Find a free time for everyone in the Smart Calendar project
to meet and set up a videoconference
● Robot, please fetch me a cup of tea from the snack kitchen
Growing Use of Deep Learning at Google
# of directories containing model description files Across many
products/areas:
Android
Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...
Overview
● Discuss TensorFlow, an open source machine learning
system
○ Our primary research and production system
○ Show real examples
○ Explain what’s happening underneath the covers
Two Generations of Distributed ML Systems

1st generation - DistBelief (Dean et al., NIPS 2012)


● Scalable, good for production, but not very flexible for research

2nd generation - TensorFlow (see tenorflow.org and


whitepaper 2015, tensorflow.org/whitepaper2015.pdf)
● Scalable, good for production, but also flexible for variety of research uses
● Portable across range of platforms
● Open source w/ Apache 2.0 license
Important Property of Neural Networks
Results get better with
more data +
bigger models +
more computation
Need Both Large Datasets & Large, Powerful Models
“Scaling Recurrent Neural Network Language Models”, Williams et al.
2015
arxiv.org/pdf/1502.00512v1.pdf
Large Datasets + Powerful Models
● Combination works incredibly well
● Poses interesting systems problems, though:
○ Need lots of computation
○ Want to train and do experiments quickly
○ Large-scale parallelism using distributed systems
really only way to do this at very large scale
○ Also want to easily express machine learning ideas
Basics of Deep Learning
● Unsupervised cat
● Speech
● Vision
● General trend is towards more complex models:
○ Embeddings of various kinds
○ Generative models
○ Layered LSTMs
○ Attention
Learning from Unlabeled Images

• Train on 10 million images (YouTube)


• 1000 machines (16,000 cores) for 1 week.
• 1.15 billion parameters
Learning from Unlabeled Images

Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Learning from Unlabeled Images

Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Adding Supervision

Top stimuli for selected neurons.


Speech: Feedforward Acoustic Models
Model speech frame-by-frame,
independently

Simple fully-connected networks

Deep Neural Networks for


Acoustic Modeling in Speech
Recognition
Hinton et al. IEEE Signal
Processing Magazine, 2012
CLDNNs
Model frequency invariance using 1D convolutions

Model time dynamics using an LSTM

Use fully connected layers on top to add depth

Convolutional, Long Short-Term Memory,


Fully Connected Deep Neural Networks
Sainath et al. ICASSP’15
Trend: LSTMs end-to-end!
Speech Acoustics Phonetics Language Text

Train recurrent models that also incorporate Lexical and Language Modeling:

Fast and Accurate Recurrent Neural Network


Acoustic Models for Speech Recognition, H. Sak et al. 2015

Deep Speech: Scaling up end-to-end speech recognition, A. Hannun et al. 2014

Listen, Attend and Spell, W. Chan et al. 2015


CNNs for Vision: AlexNet

ImageNet Classification with Deep Convolutional Neural Networks


Krizhevsky, Sutskever and Hinton, NIPS 2012
The Inception Architecture (GoogLeNet, 2015)
Basic module, which is then
replicated many times
The Inception Architecture (GoogLeNet, 2015)

Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich

ArXiv 2014, CVPR 2015


Inception-v3 (December 2015)

http://arxiv.org/abs/1512.00567
Rapid Progress in Image Recognition
Team Year Place Error (top-5) Params ImageNet
XRCE (pre-neural-net explosion) 2011 1st 25.8% challenge
classification
Supervision (AlexNet) 2012 1st 16.4% 60M
task
Clarifai 2013 1st 11.7% 65M

MSRA 2014 3rd 7.35%

VGG 2014 2nd 7.32% 180M

GoogLeNet (Inception) 2014 1st 6.66% 5M

Andrej Karpathy (human) 2014 N/A 5.1% 100 trillion?

BN-Inception (Arxiv) 2015 N/A 4.9% 13M

Inception-v3 (Arxiv) 2015 N/A 3.46% 25M

Models with small number of parameters fit easily in a mobile app (8-bit fixed point)
What do you want in a machine learning system?
● Ease of expression: for lots of crazy ML ideas/algorithms
● Scalability: can run experiments quickly
● Portability: can run on wide variety of platforms
● Reproducibility: easy to share and reproduce research
● Production readiness: go from research to real products
Open, standard software for
general machine learning

Great for Deep Learning in


particular

http://tensorflow.org/ First released Nov 2015


and
Apache 2.0 license
https://github.com/tensorflow/tensorflow
http://tensorflow.org/whitepaper2015.pdf
Preprint: arxiv.org/abs/1605.08695
Updated version to appear in OSDI 2016
Strong External Adoption

GitHub Launch Nov. 2015

GitHub Launch Sep. 2013

GitHub Launch Jan. 2012

GitHub Launch Jan. 2008

50,000+ binary installs in 72 hours, 500,000+ since November, 2015


Strong External Adoption

GitHub Launch Nov. 2015

GitHub Launch Sep. 2013

GitHub Launch Jan. 2012

GitHub Launch Jan. 2008

50,000+ binary installs in 72 hours, 500,000+ since November, 2015


Most forked new repo on GitHub in 2015 (despite only being available in Nov, ‘15)
Bloomberg Writes About Open Source Deep Learning Packages?

Source: Bloomberg. www.bloomberg.com/news/articles/2016-07-21/google-sprints-ahead-in-ai-building-blocks-leaving-rivals-wary


http://tensorflow.org/
Motivations
● DistBelief (our 1st system) was the first scalable deep
learning system, but not as flexible as we wanted for
research purposes
● Better understanding of problem space allowed us to
make some dramatic simplifications
TensorFlow: Expressing High-Level ML Computations

● Core in C++
○ Very low overhead

Core TensorFlow Execution System

CPU GPU Android iOS ...


TensorFlow: Expressing High-Level ML Computations

● Core in C++
○ Very low overhead
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more

Core TensorFlow Execution System

CPU GPU Android iOS ...


TensorFlow: Expressing High-Level ML Computations

● Core in C++
○ Very low overhead
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more

C++ front end Python front end ...

Core TensorFlow Execution System

CPU GPU Android iOS ...


Computation is a dataflow graph

biases Graph of Nodes, also called Operations or ops.

weights Add Relu

MatMul Xent
examples

labels
s o rs
Computation is a dataflow graph ten
with

biases Edges are N-dimensional arrays: Tensors

weights Add Relu

MatMul Xent
examples

labels
Example TensorFlow fragment

● Build a graph computing a neural net inference.


import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('MNIST_data', one_hot=True)


x = tf.placeholder("float", shape=[None, 784])
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
t a t e
Computation is a dataflow graph
w ith s

'Biases' is a variable Some ops compute gradients −= updates biases

biases

... Add ... Mul −=

learning rate
Symbolic Differentiation

● Automatically add ops to calculate symbolic gradients


of variables w.r.t. loss function.
● Apply these gradients with an optimization algorithm
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))
opt = tf.train.GradientDescentOptimizer(0.01)
train_op = opt.minimize(cross_entropy)
Define graph and then execute it repeatedly

● Launch the graph and run the training ops in a loop


init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
ut e d
Computation is a dataflow graph
is t r i b
d

GPU 0 CPU
biases

Add ... Mul Assign


... Sub

learning rate
Assign Devices to Ops
● TensorFlow inserts Send/Recv Ops to transport tensors across devices
● Recv ops pull data from Send ops

GPU 0 CPU
biases
Send Recv
Add ... Mul Assign
... Sub

learning rate
Assign Devices to Ops
● TensorFlow inserts Send/Recv Ops to transport tensors across devices
● Recv ops pull data from Send ops

GPU 0 CPU
biases
Send Recv
Add ... Mul Assign
... Send Recv Sub
Recv Send
Recv

learning rate Send


Send and Receive Implementations
● Different implementations depending on source/dest devices
● e.g. GPUs on same machine: local GPU → GPU copy
● e.g. CPUs on different machines: cross-machine RPC
● e.g. GPUs on different machines: RDMA
November 2015
December 2015
February 2016
April 2016
June 2016
Activity
Pre-trained Inception-v3 model released
http://googleresearch.blogspot.com/2015/12/how-to-classify-images-with-tensorflow.html

Dear TensorFlow community,

Today we are releasing our best image classifier trained on ImageNet data. As described in our
recent Arxiv preprint at http://arxiv.org/abs/1512.00567, an ensemble of four of these models
achieves 3.5% top-5 error on the validation set of the ImageNet whole image ILSVRC2012
classification task (compared with our ensemble from last year that won the 2014 ImageNet
classification challenge with a 6.66% top-5 error rate).

In this release, we are supplying code and data files containing the trained model parameters for
running the image classifier on:
● Both desktop and mobile environments
● Employing either a C++ or Python API.

In addition, we are providing a tutorial that describes how to use the image recognition system for a
variety of use-cases.
http://www.tensorflow.org/tutorials/image_recognition/index.html
Experiment Turnaround Time and Research Productivity
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Data Parallelism
● Use multiple model replicas to process different
examples at the same time
○ All collaborate to update model state (parameters) in shared
parameter server(s)
● Speedups depend highly on kind of model
○ Dense models: 10-40X speedup from 50 replicas
○ Sparse models:
■ support many more replicas
■ often can use as many as 1000 replicas
Data Parallelism
Parameter Servers

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers

∆p p

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p

∆p p

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p

p’

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers

∆p’ p’

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p

∆p’ p’

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p

∆p’ p’

Model
Replicas ...
Data
...
DistBelief: Separate Parameter Servers
Parameter update rules not the same programming model as
the rest of the system

Separate code for parameter servers vs. rest of system

Lacked uniformity & was more complicated


Cross process communication is the same!
● Communication across machines over the network abstracted identically to
cross device communication.

/job:worker/cpu:0 /job:ps/gpu:0
biases
Send Recv
Add ... Mul Assign
... Send Recv Sub
Recv Send
Recv

learning rate Send

No specialized parameter server subsystem!


Data Parallelism Choices
Can do this synchronously:
● N replicas equivalent to an N times larger batch size
● Pro: No gradient staleness
● Con: Less fault tolerant (requires some recovery if any single machine fails)

Can do this asynchronously:


● Pro: Relatively fault tolerant (failure in model replica doesn’t block other
replicas)
● Con: Gradient staleness means each gradient less effective

(Or hybrid: M asynchronous groups of N synchronous replicas)


Asynchronous Training
● Unlike DistBelief, no separate parameter server system:
○ Parameters are now just stateful nodes in the graph
Synchronous Variant
Synchronous vs. Asynchronous

Graph structure and low-level graph primitives (queues) allow us to play with
synchronous vs. asynchronous update algorithms.
Data Parallelism Considerations
Want model computation time to be large relative to time to
send/receive parameters over network
Models with fewer parameters, that reuse each parameter multiple times in the
computation

● Mini-batches of size B reuse parameters B times


Certain model structures reuse each parameter many times within each example:

● Convolutional models tend to reuse hundreds or thousands of times per


example (for different spatial positions)
● Recurrent models (LSTMs, RNNs) tend to reuse tens to hundreds of times
(for unrolling through T time steps during training)
Success of Data Parallelism
● Data parallelism is really important for many of Google’s
problems (very large datasets, large models):
○ RankBrain uses 500 replicas
○ ImageNet Inception training uses 50 GPUs, ~40X
speedup
○ SmartReply uses 16 replicas, each with multiple GPUs
○ State-of-the-art on LM “One Billion Word” Benchmark
model uses both data and model parallelism on 32
GPUs
Image Model Training Time

50 GPUs
10 GPUs

1 GPU

Hours
Image Model Training Time

50 GPUs
10 GPUs
2.6 hours vs. 79.3 hours (30.5X) 1 GPU

Hours
Synchronous converges faster (time to accuracy)

Test
accuracy

Synchronous updates (with backup workers) trains to higher accuracy faster


Better scaling to more workers (less loss of accuracy)

Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy


Bengio, Raal Jozefowicz, ICLR Workshop 2016, arxiv.org/abs/1604.00981
Synchronous converges faster (time to accuracy)

40 hours vs. 52 hours


Test
accuracy

Synchronous updates (with backup workers) trains to higher accuracy faster


Better scaling to more workers (less loss of accuracy)

Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy


Bengio, Raal Jozefowicz, ICLR Workshop 2016, arxiv.org/abs/1604.00981
Synchronous converges faster (time to accuracy)

30 hours vs. 52 hours


Test
accuracy

Synchronous updates (with backup workers) trains to higher accuracy faster


Better scaling to more workers (less loss of accuracy)

Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy


Bengio, Raal Jozefowicz, ICLR Workshop 2016, arxiv.org/abs/1604.00981
Synchronous converges faster (time to accuracy)
40 hours vs. 50 hours

Test
accuracy

Synchronous updates (with backup workers) trains to higher accuracy faster


Better scaling to more workers (less loss of accuracy)

Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy


Bengio, Raal Jozefowicz, ICLR Workshop 2016, arxiv.org/abs/1604.00981
General Computations
Although we originally built TensorFlow for our uses around
deep neural networks, it’s actually quite flexible

Wide variety of machine learning and other kinds of numeric


computations easily expressible in the computation graph
model
Runs on Variety of Platforms
phones single machines (CPU and/or GPUs) …

distributed systems of 100s


of machines and/or GPU cards custom ML hardware
Trend: Much More Heterogeneous hardware
General purpose CPU performance scaling has slowed
significantly

Specialization of hardware for certain workloads will be more


important
Tensor Processing Unit
Custom machine learning ASIC

In production use for >16 months: used on every


search query, used for AlphaGo match, many
other uses, ...
See Google Cloud Platform blog: Google supercharges machine learning tasks with TPU custom chip,
by Norm Jouppi, May, 2016
Extensible
● Core system defines a number of standard operations
and kernels (device-specific implementations of
operations)
● Easy to define new operators and/or kernels
A tour through the TensorFlow codebase
1. Expressing graphs core:
graph, ops, protobuf

python:
variables,
optimizer

http://public.kevinrobinsonblog.com/tensorflow-codebase/
Slide credit: Kevin Robinson ([email protected])
Expressing: Graphs and Ops
Graph

Slide credit: Kevin Robinson ([email protected])


Expressing: Graphs and Ops
Graph Ops

Slide credit: Kevin Robinson ([email protected])


Expressing: Graphs and Ops

Slide credit: Kevin Robinson ([email protected])


Expressing: Ops

Slide credit: Kevin Robinson ([email protected])


Expressing: Ops

Slide credit: Kevin Robinson ([email protected])


Expressing: Ops

in math_ops.py#L1137

calls C++ wrappers generated by cc/BUILD#L27

OpDef interface defined in math_ops.cc#L607

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph
Expressing: Graph
Graph is built implicitly
session.py#L896

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph
Graph is built implicitly
session.py#L896

Variables add implicit ops


variables.py#L146

Slide credit: Kevin Robinson ([email protected])


Slide credit: Kevin Robinson ([email protected])

Expressing: Graph
Graph is built implicitly
session.py#L896

Variables add implicit ops


variables.py#L146

In TensorBoard:
Expressing: Optimizers
Optimizer fns extend the graph
optimizer.py:minimize#L155

Slide credit: Kevin Robinson ([email protected])


Expressing: Optimizers
Optimizer fns extend the graph
optimizer.py:minimize#L155

Trainable variables collected


variables.py#L258

Slide credit: Kevin Robinson ([email protected])


Expressing: Optimizers
Optimizer fns extend the graph
optimizer.py:minimize#L155

Trainable variables collected


variables.py#L258

Graph is extended with gradients


gradients.py#L307

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph
Serialized as GraphDef
graph.proto

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph
Serialized as GraphDef
graph.proto

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph
Serialized as GraphDef
graph.proto

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph
Serialized as GraphDef
graph.proto

Slide credit: Kevin Robinson ([email protected])


Expressing: Graph
Serialized as GraphDef
graph.proto

Slide credit: Kevin Robinson ([email protected])


Distributing
- Sessions in distributed runtime
- Pruning
- Placing and Partitioning

Slide credit: Kevin Robinson ([email protected])


Distributing: Creating a session

Slide credit: Kevin Robinson ([email protected])


Distributing: Creating a session
tf.Session gRPC: MasterService

Slide credit: Kevin Robinson ([email protected])


Distributing: Creating a session
tf.Session gRPC: MasterService

Slide credit: Kevin Robinson ([email protected])


Distributing: Creating a session
tf.Session gRPC: MasterService
gRPC: Session

Slide credit: Kevin Robinson ([email protected])


Distributing: Creating a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)

Slide credit: Kevin Robinson ([email protected])


Distributing: Creating a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)

Slide credit: Kevin Robinson ([email protected])


Distributing: Running a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)

Slide credit: Kevin Robinson ([email protected])


Distributing: Running a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)
RunStep(feed, fetches)

Slide credit: Kevin Robinson ([email protected])


Distributing: Running a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)
RunStep(feed, fetches)

WorkerService WorkerService WorkerService


/job:worker/task:0 /job:worker/task:1 /job:worker/task:2

Slide credit: Kevin Robinson ([email protected])


Distributing: Running a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)
RunStep(feed, fetches)

WorkerService WorkerService WorkerService


/job:worker/task:0 /job:worker/task:1 /job:worker/task:2

CPU GPU GPU GPU CPU GPU GPU GPU CPU GPU

Slide credit: Kevin Robinson ([email protected])


Distributing: Running a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)
RunStep(feed, fetches)

WorkerService WorkerService WorkerService


/job:worker/task:0 /job:worker/task:1 /job:worker/task:2
RunGraph(graph,feed,fetches) RunGraph(graph,feed,fetches) RunGraph(graph,feed,fetches)

CPU GPU GPU GPU CPU GPU GPU GPU CPU GPU

Slide credit: Kevin Robinson ([email protected])


Distributing: Running a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)
RunStep(feed, fetches)

WorkerService WorkerService WorkerService


/job:worker/task:0 /job:worker/task:1 /job:worker/task:2
RunGraph(graph,feed,fetches) RunGraph(graph,feed,fetches) RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key) RecvTensor(rendezvous_key) RecvTensor(rendezvous_key)
CPU GPU GPU GPU CPU GPU GPU GPU CPU GPU

Slide credit: Kevin Robinson ([email protected])


Distributing: Running a session
tf.Session gRPC: MasterService
gRPC: Session CreateSession(GraphDef)
RunStep(feed, fetches)

WorkerService WorkerService WorkerService


/job:worker/task:0 /job:worker/task:1 /job:worker/task:2
RunGraph(graph,feed,fetches) RunGraph(graph,feed,fetches) RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key) RecvTensor(rendezvous_key) RecvTensor(rendezvous_key)
CPU GPU GPU GPU CPU GPU GPU GPU CPU GPU

Slide credit: Kevin Robinson ([email protected])


Distributing: Pruning
gRPC call to Session::Run
in master_session.cc#L835

Slide credit: Kevin Robinson ([email protected])


Distributing: Pruning
gRPC call to Session::Run
in master_session.cc#L835

Rewrite with feed and fetch


RewriteGraphForExecution
in graph/subgraph.cc#L225

Slide credit: Kevin Robinson ([email protected])


Distributing: Pruning
gRPC call to Session::Run
in master_session.cc#L835

Rewrite with feed and fetch


RewriteGraphForExecution
in graph/subgraph.cc#L225

Prune subgraph
PruneForReverseReachability
in graph/algorithm.cc#L122
tests in subgraph_test.cc#142
Slide credit: Kevin Robinson ([email protected])
Distributing: Placing
Constraints from model
DeviceSpec in device.py#L24

Slide credit: Kevin Robinson ([email protected])


Distributing: Placing
Constraints from model
DeviceSpec in device.py#L24

By device or colocation
NodeDef in graph.proto

Slide credit: Kevin Robinson ([email protected])


Distributing: Placing
Placing based on constraints
SimplePlacer::Run
in simple_placer.cc#L558
described in simple_placer.h#L31

WorkerService WorkerService WorkerService


/job:worker/task:0 /job:worker/task:1 /job:worker/task:2

CPU GPU GPU CPU GPU GPU CPU GPU

Slide credit: Kevin Robinson ([email protected])


Distributing: Placing
Placing based on constraints
SimplePlacer::Run
in simple_placer.cc#L558
described in simple_placer.h#L31

WorkerService WorkerService
/job:worker/task:0 /job:worker/task:1

Slide credit: Kevin Robinson ([email protected])


Distributing: Placing
Placing based on constraints
SimplePlacer::Run
in simple_placer.cc#L558
described in simple_placer.h#L31

WorkerService
/job:worker/task:0

Slide credit: Kevin Robinson ([email protected])


Distributing: Partitioning
Partition into subgraphs
in graph_partition.cc#L883

WorkerService WorkerService
/job:worker/task:0 /job:worker/task:0

rendezvous
Slide credit: Kevin Robinson ([email protected])
Distributing: Partitioning
Partition into subgraphs
in graph_partition.cc#L883
Rewrite with Send and Recv
in sendrecv_ops.cc#L56 and #L97

WorkerService WorkerService
/job:worker/task:0 /job:worker/task:0

rendezvous
Slide credit: Kevin Robinson ([email protected])
Distributing: Partitioning
Partition into subgraphs
in graph_partition.cc#L883
Rewrite with Send and Recv
in sendrecv_ops.cc#L56 and #L97
Rendezvous handles coordination
in base_rendezvous_mgr.cc#L236
WorkerService WorkerService
/job:worker/task:0 /job:worker/task:0
rendezvous

rendezvous
Slide credit: Kevin Robinson ([email protected])
rendezvous
Slide credit: Kevin Robinson ([email protected])

Distributing: Partitioning
Partition into subgraphs
in graph_partition.cc#L883
Rewrite with Send and Recv
in sendrecv_ops.cc#L56 and #L97
Rendezvous handles coordination
in base_rendezvous_mgr.cc#L236
WorkerService
/job:worker/task:0
rendezvous

rendezvous
rendezvous
A tour through the TensorFlow codebase
2. Distributing the graph

core:
distributed_runtime
common_runtime

http://public.kevinrobinsonblog.com/tensorflow-codebase/
Slide credit: Kevin Robinson ([email protected])
Slide credit: Kevin Robinson ([email protected])

Executing: Executor
Parallelism on each worker
WorkerService
/job:worker/task:0
RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key)
CPU GPU GPU GPU
Slide credit: Kevin Robinson ([email protected])

Executing: Executor
Parallelism on each worker
WorkerService
/job:worker/task:0
RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key)
CPU GPU GPU GPU

Slide credit: Kevin Robinson ([email protected])


Slide credit: Kevin Robinson ([email protected])

Executing: Executor
Parallelism on each worker
WorkerService
/job:worker/task:0
RunGraph(graph,feed,fetches)
RecvTensor(rendezvous_key)
CPU GPU GPU GPU

GraphMgr::ExecuteAsync
in graph_mgr.cc#L283

ExecutorState::RunAsync
in executor.cc#L867
Slide credit: Kevin Robinson ([email protected])
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
WorkerService
/job:worker/task:0
rendezvous

Conv2D OpDef in nn_ops.cc#L221


Executing: OpKernels
Conditional build for OpKernels

Slide credit: Kevin Robinson ([email protected])


Executing: OpKernels
Conditional build for OpKernels

CPU in conv_ops.cc#L91
GPU in conv_ops.cc#L263

Slide credit: Kevin Robinson ([email protected])


Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
OpKernels are specialized by device
adapted from matmul_op.cc#L116
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
OpKernels are specialized by device
adapted from matmul_op.cc#L116
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
OpKernels call into Stream functions
adapted from matmul_op.cc#L71
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
OpKernels call into Stream functions
adapted from matmul_op.cc#L71
Slide credit: Kevin Robinson ([email protected])

Executing: OpKernels
OpKernels call into Stream functions
adapted from matmul_op.cc#L71
Executing: Stream functions
OpKernels call into Stream functions

in conv_ops.cc#L292

Slide credit: Kevin Robinson ([email protected])


Slide credit: Kevin Robinson ([email protected])

Executing: Stream functions


OpKernels call into Stream functions

in conv_ops.cc#L292

in conv_ops.cc#L417
Executing: Stream functions
Platforms provide GPU-specific implementations

cuBLAS
BlasSupport in stream_executor/blas.h#L88
DoBlasInternal in cuda_blas.cc#L429

Slide credit: Kevin Robinson ([email protected])


Slide credit: Kevin Robinson ([email protected])

Executing: Stream functions


Platforms provide GPU-specific implementations
cuBLAS
BlasSupport in stream_executor/blas.h#L88
DoBlasInternal in cuda_blas.cc#L429
cuDNN
DnnSupport in stream_executor/dnn.h#L544
DoConvolve in cuda_dnn.cc#L629
Session Interface
● Extend: add nodes to computation graph
● Run: execute an arbitrary subgraph
○ optionally feeding in Tensor inputs and retrieving Tensor output

Typically, setup a graph with one or a few Extend calls and


then Run it thousands or millions or times
Single device performance important, but
….
biggest performance improvements come
from large-scale distributed systems with
model and data parallelism
Experiment Turnaround Time and Research Productivity
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Transition
● How do you do this at scale?
● How does TensorFlow make distributed training easy?
Model Parallelism
● Best way to decrease training time: decrease the step
time
● Many models have lots of inherent parallelism
● Problem is distributing work so communication doesn’t
kill you
○ local connectivity (as found in CNNs)
○ towers with little or no connectivity between towers (e.g. AlexNet)
○ specialized parts of model active only for some examples
Exploiting Model Parallelism
On a single core: Instruction parallelism (SIMD). Pretty much
free.

Across cores: thread parallelism. Almost free, unless across


sockets, in which case inter-socket bandwidth matters (QPI on
Intel).

Across devices: for GPUs, often limited by PCIe bandwidth.

Across machines: limited by network bandwidth / latency


Model Parallelism
Model Parallelism
Model Parallelism
Using TensorFlow for Parallelism
Easy to express both model parallelism as well as data
parallelism

● Very minimal changes to single device model code


Devices and Graph Placement
● Given a graph and set of devices, TensorFlow
implementation must decide which device executes
each node
Full and Partial Device Constraints (Hints)
Devices are named hierarchically:
/job:localhost/device:cpu:0
/job:worker/task:17/device:gpu:3
/job:parameters/task:4/device:cpu:0

Client can specify full or partial constraints for nodes in


graph:
“Place this node on /job:localhost/device:gpu:2”

“Place this node on /device:gpu:*”


Placement Algorithm
Given hints, plus a cost model (node execution time
estimates and Tensor size estimates), make placement
decisions

● Current relatively simple greedy algorithm


● Active area of work

Show CIFAR10 placement TensorBoard.


Google Photos Search
Reuse same model for completely
different problems

Same basic model structure


trained on different data,
useful in completely different contexts

Example: given image → predict interesting pixels


www.google.com/sunroof

We have tons of vision problems

Image search, StreetView, Satellite Imagery,


Translation, Robotics, Self-driving Cars,
MEDICAL IMAGING

Very good results using similar model for


detecting diabetic retinopathy in retinal images
“Seeing” Go
RankBrain in Google Search Ranking

Query: “car parts for sale”,


Deep Score for
Neural doc,query
Doc: “Rebuilt transmissions …” pair
Network
Query & document features

Launched in 2015
Third most important search ranking signal (of 100s)
Bloomberg, Oct 2015: “Google Turning Its Lucrative Web Search Over to AI Machines”
Example: LSTM [Hochreiter et al, 1997][Gers et al, 1999]

Enables
long term
dependencies
to flow
Sequence-to-Sequence Model
Target sequence
[Sutskever & Vinyals & Le NIPS 2014] X Y Z Q

A B C D __ X Y Z

Input sequence
Sequence-to-Sequence
● Active area of research
● Many groups actively pursuing RNN/LSTM
○ Montreal
○ Stanford
○ U of Toronto
○ Berkeley
○ Google
○ ...
● Further Improvements
○ Attention
○ NTM / Memory Nets
○ ...
Sequence-to-Sequence
● Translation: [Kalchbrenner et al., EMNLP 2013][Cho et al., EMLP 2014][Sutskever & Vinyals & Le, NIPS
2014][Luong et al., ACL 2015][Bahdanau et al., ICLR 2015]

● Image captions: [Mao et al., ICLR 2015][Vinyals et al., CVPR 2015][Donahue et al., CVPR 2015][Xu et al.,
ICML 2015]

● Speech: [Chorowsky et al., NIPS DL 2014][Chan et al., arxiv 2015]


● Language Understanding: [Vinyals & Kaiser et al., NIPS 2015][Kiros et al., NIPS 2015]
● Dialogue: [Shang et al., ACL 2015][Sordoni et al., NAACL 2015][Vinyals & Le, ICML DL 2015]
● Video Generation: [Srivastava et al., ICML 2015]
● Algorithms: [Zaremba & Sutskever, arxiv 2014][Vinyals & Fortunato & Jaitly, NIPS 2015][Kaiser &
Sutskever, arxiv 2015][Zaremba et al., arxiv 2015]
How to do Image Captions?

P(English | French)
Image )
Image Captioning
[Vinyals et al., CVPR 2015] A young girl asleep

W __ A young girl
Image Captions Research
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.

Model: A close up of a child


holding a stuffed animal.

Model: A baby is asleep next to


a teddy bear.
Human: A man outside cooking
with a sub in his hand.

BestModel: A man is holding a


sandwich in his hand.

InitialModel: A man cutting a


cake with a knife.
Human: Someone is using a
small grill to melt his sandwich.

BestModel: A person is cooking


some food on a grill.

InitialModel: A pizza sitting on


top of a white plate.
Human: A woman holding up a
yellow banana to her face.

BestModel: A woman holding a


banana up to her face.

InitialModel: A close up of a
person eating a hot dog.
Human: A blue , yellow and red
train travels across the tracks
near a depot.

BestModel: A blue and yellow


train traveling down train
tracks.

InitialModel: A train that is


sitting on the tracks.
Pointer Networks
➢ Goal: Mappings where outputs are (sub)sets of inputs
➢ Travelling Salesman Problem

➢ Convex Hulls

[Vinyals, Fortunato & Jaitly, NIPS 2015 ]


Pointer Networks


1
2
5
6
1

x1 x2 x3 x4 x5 x6 x1 x6 x5 x2 x1
y1 y2 y3 y4 y5 y6 ⇒ y1 y6 y5 y2 y1

[Vinyals, Fortunato & Jaitly, NIPS 2015 ]


Neural Conversational Models
● Take movie subtitles (~900M words) or IT HelpDesk chats
● Predict the next dialog from history
i got to go .
no .
i get too emotional when i drink .
have another beer . i 've got to get up early .
no , you don 't . sit down .
i get too emotional when i drink .
will you have another beer ?
i 've got to go ! [Vinyals & Le ICML DL Workshop 2015]
why ?
i got to get up early in the morning .
you 're drunk .
and emotional !
you got to go .
Smart Reply
April 1, 2009: April Fool’s Day joke

Nov 5, 2015: Launched Real Product

Feb 1, 2016: >10% of mobile Inbox replies


Google Research Blog

Incoming Email
Smart Reply - Nov 2015

Activate
Smart Reply?
Small
Feed-Forward yes/no
Neural Network
Google Research Blog

Incoming Email
Smart Reply - Nov 2015

Activate
Smart Reply?
Small
Feed-Forward yes/no
Neural Network

Generated Replies

Deep
Recurrent
Neural Network
Example: LSTM

for i in range(20):
m, c = LSTMCell(x[i], mprev, cprev)
mprev = m
cprev = c
Example: Deep LSTM

for i in range(20):
for d in range(4): # d is depth
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM

for i in range(20):
for d in range(4): # d is depth
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM

for i in range(20):
for d in range(4): # d is depth
with tf.device("/gpu:%d" % d):
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6

GPU5 A B C D 80k softmax by


1000 dims
This is very big!
GPU4 Split softmax into
4 GPUs

GPU3

GPU2 1000 LSTM cells


2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
TensorFlow Queues

Input prefetching ...

Grouping similar examples

Randomization/Shuffling Dequeue

Enqueue Queue

...
Example: Deep LSTMs
● Wrinkles
○ Bucket sentences by length using a queue per length
○ Dequeue when a full batch of same length has
accumulated
○ N different graphs for different lengths
○ Alternative: while loop
Expressing Data Parallelism
# We use the ReplicaDeviceSetter() device function to automatically
# assign Variables to the 'ps' jobs.
with tf.device(“/cpu:0”):
# Create the Mnist model.
model = MnistModel(batch_size=16, hidden_units=200)

# Get an initialized, and possibly recovered session.


sess = tf.Session()

# Train the model.


for local_step in xrange(FLAGS.max_steps):
_, loss, step = sess.run([model.train_op, model.loss, model.global_step])
if local_step % 1000 == 0:
print "step %d: %g" % (step, loss)
Expressing Data Parallelism
# We use the ReplicaDeviceSetter() device function to automatically
# assign Variables to the 'ps' jobs.
with tf.device(tf.ReplicaDeviceSetter(parameter_devices=10)):
# Create the Mnist model.
model = MnistModel(batch_size=16, hidden_units=200)

# Create a Supervisor. It will take care of initialization, summaries,


# checkpoints, and recovery. When multiple replicas of this program are running,
# the first one, identified by --task=0 is the 'chief' supervisor (e.g., initialization, saving)
supervisor = tf.Supervisor(is_chief=(FLAGS.task == 0), saver=model.saver)

# Get an initialized, and possibly recovered session.


sess = supervisor.PrepareSession(FLAGS.master_job)

# Train the model.


for local_step in xrange(int32_max):
_, loss, step = sess.run([model.train_op, model.loss, model.global_step])
if step >= FLAGS.max_steps:
break
if local_step % 1000 == 0:
print "step %d: %g" % (step, loss)
Combining Vision with Robotics

“Deep Learning for Robots: Learning


from Large-Scale Interaction”, Google
Research Blog, March, 2016

“Learning Hand-Eye Coordination for


Robotic Grasping with Deep Learning
and Large-Scale Data Collection”,
Sergey Levine, Peter Pastor, Alex
Krizhevsky, & Deirdre Quillen,
Arxiv, arxiv.org/abs/1603.02199
Network Optimizations
● Neural net training very tolerant of reduced precision
● e.g. drop precision to 16 bits across network

Device A Device B

params Send Recv


Mat
...
Mul
Input
Network Optimizations
● Neural net training very tolerant of reduced precision
● e.g. drop precision to 16 bits across network

Device A Device B

params ToFP16 Send Recv ToFP32


Mat
...
Mul
Input
Quantization for Inference
● Need even less precision for inference
● 8-bit fixed point works well, but many ways of
quantizing
● Critical for things like mobile devices
○ w/quantization, high-end smart phone can run
Inception model at >6 frames per second (fps)
How Can You Get Started with Machine Learning?
Three ways, with varying complexity:

(1) Use a Cloud-based API (Vision, Speech, etc.)


More
(2) Use an existing model architecture, and flexible,
retrain it or fine tune on your dataset but more
(3) Develop your own machine learning models effort
for new problems required
Use Cloud-based APIs

cloud.google.com/translate

cloud.google.com/speech

cloud.google.com/vision

cloud.google.com/text
Use Cloud-based APIs

cloud.google.com/translate

cloud.google.com/speech

cloud.google.com/vision

cloud.google.com/text
Google Cloud Vision API
https://cloud.google.com/vision/
Google Cloud ML
Scaled service for training and inference w/TensorFlow
A Few TensorFlow Community Examples
(From more than 2000 results for ‘tensorflow’ on GitHub)

● DQN: github.com/nivwusquorum/tensorflow-deepq
● NeuralArt: github.com/woodrush/neural-art-tf
● Char RNN: github.com/sherjilozair/char-rnn-tensorflow
● Keras ported to TensorFlow: github.com/fchollet/keras
● Show and Tell: github.com/jazzsaxmafia/show_and_tell.tensorflow
● Mandarin translation: github.com/jikexueyuanwiki/tensorflow-zh
...
A Few TensorFlow Community Examples
(From more than 2000 2100 results for ‘tensorflow’ on GitHub)

● DQN: github.com/nivwusquorum/tensorflow-deepq
● NeuralArt: github.com/woodrush/neural-art-tf
● Char RNN: github.com/sherjilozair/char-rnn-tensorflow
● Keras ported to TensorFlow: github.com/fchollet/keras
● Show and Tell: github.com/jazzsaxmafia/show_and_tell.tensorflow
● Mandarin translation: github.com/jikexueyuanwiki/tensorflow-zh
...
github.com/nivwusquorum/tensorflow-deepq
github.com/woodrush/neural-art-tf
github.com/sherjilozair/char-rnn-tensorflow
github.com/fchollet/keras
github.com/jazzsaxmafia/show_and_tell.tensorflow
github.com/jikexueyuanwiki/tensorflow-zh
Concluding Remarks
● Model and Data Parallelism enable great ML work:
○ Neural Machine Translation: ~6x speedup on 8 GPUs
○ Inception / Imagenet: ~40x speedup on 50 GPUs
○ RankBrain: ~300X speedup on 500 machines
● TensorFlow open-source community vibrant and growing
● TensorFlow makes it easy to express ML computations
What Does the Future Hold?
Deep learning usage will continue to grow and accelerate:

● Across more and more fields and problems:


○ robotics, self-driving vehicles, ...
○ health care
○ video understanding
○ dialogue systems
○ personal assistance
○ ...
Google Brain Residency Program

One year immersion program in deep learning research


● First class started six weeks ago, planning for next year’s class is underway

Learn to conduct deep learning research w/experts in our team


● Fixed one-year employment with salary, benefits, ...
● Goal after one year is to have conducted several research projects
● Interesting problems, TensorFlow, and access to computational resources

g.co/brainresidency
Google Brain Residency Program

Who should apply?


● people with BSc, MSc or PhD, ideally in CS, mathematics or statistics
● completed coursework in calculus, linear algebra, and probability, or equiv.
● programming experience
● motivated, hard working, and have a strong interest in deep learning

g.co/brainresidency
Google Brain Residency Program

Current class for June 2016 to May 2017


● ⅓ B.S, ⅓ M.S., ⅓ Ph.D. or postdoc
● ½ coming straight from school, ½ with some post-school working experience
● Mix of backgrounds: computer scientists, math/stats, EE, physics, comp bio, ...

Applications for class for June 2017 to May 2018 will open in Fall 2016

g.co/brainresidency
Further Reading
● Dean, et al., Large Scale Distributed Deep Networks, NIPS 2012,
research.google.com/archive/large_deep_networks_nips2012.html.
● Mikolov, Chen, Corrado & Dean. Efficient Estimation of Word Representations in Vector
Space, NIPS 2013, arxiv.org/abs/1301.3781.
● Sutskever, Vinyals, & Le, Sequence to Sequence Learning with Neural Networks, NIPS,
2014, arxiv.org/abs/1409.3215.
● Vinyals, Toshev, Bengio, & Erhan. Show and Tell: A Neural Image Caption Generator.
CVPR 2015. arxiv.org/abs/1411.4555
● TensorFlow white paper, tensorflow.org/whitepaper2015.pdf (clickable links in bibliography)

g.co/brain (We’re hiring! Also check out Brain Residency program at g.co/brainresidency)
www.tensorflow.org
research.google.com/people/jeff
research.google.com/pubs/BrainTeam.html

Questions?

You might also like