0% found this document useful (0 votes)

16 views144 pages

Bigdata Neural Networks

This document summarizes the Google Brain team's experience over the past 5 years using large-scale distributed systems for training neural networks. It describes how they have applied neural networks across many products and research areas at Google, including speech, images, video, language understanding, translation, and more. It then provides an overview of TensorFlow, an open-source machine learning system that is their primary research and production tool. The document outlines how it will demonstrate real examples using TensorFlow and explain its implementation and capabilities for scaling to large datasets and complex models.

Uploaded by

Srinath Pitta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views144 pages

Bigdata Neural Networks

Uploaded by

Srinath Pitta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 144

Large Scale Distributed Systems for

Training Neural Networks

Jeff Dean & Oriol Vinyals
Google
Google Brain team in collaboration with many other teams
Background
Google Brain project started in 2011, with a focus on
pushing state-of-the-art in neural networks. Initial
emphasis:

● use large datasets, and

● large amounts of computation

to push boundaries of what is possible in perception and

language understanding
Overview
● Cover our experience from past ~5 years
○ Research: speech, images, video, robotics, language understanding,
NLP, translation, optimization algorithms, unsupervised learning, …

○ Production: deployed systems for advertising, search, GMail, Photos,

Maps, YouTube, speech recognition, image analysis, user prediction, …

● Focus on neural nets, but many techniques more

broadly applicable
Overview
● Demonstrate TensorFlow, an open source machine
learning system
○ Our primary research and production system
○ Show real examples
○ Explain what’s happening underneath the covers
Outline

● Introduction to Deep Learning

● TensorFlow Basics
○ Demo
○ Implementation Overview
● Scaling Up
○ Model Parallelism
○ Data Parallelism
○ Expressing these in TensorFlow
● More complex examples
○ CNNs / Deep LSTMs
Growing Use of Deep Learning at Google
# of directories containing model description files Across many
products/areas:
Android
Unique Project Directories

Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...

Time
Deep Learning
Universal Machine Learning
Speech Speech
Text Text
Search Search
Queries Queries
Images Images
Videos Videos
Labels Labels
Entities Entities
Words Words
Audio Audio
Features Features
Deep Learning
Universal Machine Learning
...that works better than the alternatives!
Current State-of-the-art in:
Speech Recognition
Image Recognition
Machine Translation
Molecular Activity Prediction
Road Hazard Detection
Optical Character Recognition
...
ConvNets
Some More Benefits

Deals very naturally w/sequence data (text, speech, video...)

Very effective at transfer learning across tasks

Very easy to get started with a commodity GPU

A common ‘language’ across great many fields of research

Two Generations of Distributed ML Systems

1st generation - DistBelief (Dean et al., NIPS 2012)

● Scalable, good for production, but not very flexible for research

2nd generation - TensorFlow (see tenorflow.org and

whitepaper 2015, tensorflow.org/whitepaper2015.pdf)
● Scalable, good for production, but also flexible for variety of research uses
● Portable across range of platforms
● Open source w/ Apache 2.0 license
Need Both Large Datasets & Large, Powerful Models
“Scaling Recurrent Neural Network Language Models”, Williams et al. 2015
arxiv.org/pdf/1502.00512v1.pdf
Large Datasets + Powerful Models
● Combination works incredibly well
● Poses interesting systems problems, though:
○ Need lots of computation
○ Want to train and do experiments quickly
○ Large-scale parallelism using distributed systems
really only way to do this at very large scale
○ Also want to easily express machine learning ideas
Basics of Deep Learning
● Unsupervised cat
● Speech
● Vision
● General trend is towards more complex models:
○ Embeddings of various kinds
○ Generative models
○ Layered LSTMs
○ Attention
Learning from Unlabeled Images

• Train on 10 million images (YouTube)

• 1000 machines (16,000 cores) for 1 week.
• 1.15 billion parameters
Learning from Unlabeled Images

Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Learning from Unlabeled Images

Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Adding Supervision

Top stimuli for selected neurons.

Speech: Feedforward Acoustic Models
Model speech frame-by-frame,
independently

Simple fully-connected networks

Deep Neural Networks for

Acoustic Modeling in Speech
Recognition
Hinton et al. IEEE Signal
Processing Magazine, 2012
CLDNNs
Model frequency invariance using 1D convolutions

Model time dynamics using an LSTM

Use fully connected layers on top to add depth

Convolutional, Long Short-Term Memory,

Fully Connected Deep Neural Networks
Sainath et al. ICASSP’15
Trend: LSTMs end-to-end!
Speech Acoustics Phonetics Language Text

Train recurrent models that also incorporate Lexical and Language Modeling:

Fast and Accurate Recurrent Neural Network

Acoustic Models for Speech Recognition, H. Sak et al. 2015

Deep Speech: Scaling up end-to-end speech recognition, A. Hannun et al. 2014

Listen, Attend and Spell, W. Chan et al. 2015

CNNs for Vision: AlexNet

ImageNet Classification with Deep Convolutional Neural Networks

Krizhevsky, Sutskever and Hinton, NIPS 2012
The Inception Architecture (GoogLeNet, 2015)
Basic module, which is then
replicated many times
The Inception Architecture (GoogLeNet, 2015)

Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich

ArXiv 2014, CVPR 2015

Inception-v3 (December 2015)

http://arxiv.org/abs/1512.00567
Rapid Progress in Image Recognition
Team Year Place Error (top-5) Params ImageNet
XRCE (pre-neural-net explosion) 2011 1st 25.8% challenge
classification
Supervision (AlexNet) 2012 1st 16.4% 60M
task
Clarifai 2013 1st 11.7% 65M

MSRA 2014 3rd 7.35%

VGG 2014 2nd 7.32% 180M

GoogLeNet (Inception) 2014 1st 6.66% 5M

Andrej Karpathy (human) 2014 N/A 5.1% 100 trillion?

BN-Inception (Arxiv) 2015 N/A 4.9% 13M

Inception-v3 (Arxiv) 2015 N/A 3.46% 25M

Models with small number of parameters fit easily in a mobile app (8-bit fixed point)
Today’s News: Pre-trained Inception-v3 model released
http://googleresearch.blogspot.com/2015/12/how-to-classify-images-with-tensorflow.html

Dear TensorFlow community,

Today we are releasing our best image classifier trained on ImageNet data. As described in our
recent Arxiv preprint at http://arxiv.org/abs/1512.00567, an ensemble of four of these models
achieves 3.5% top-5 error on the validation set of the ImageNet whole image ILSVRC2012
classification task (compared with our ensemble from last year that won the 2014 ImageNet
classification challenge with a 6.66% top-5 error rate).

In this release, we are supplying code and data files containing the trained model parameters for
running the image classifier on:
● Both desktop and mobile environments
● Employing either a C++ or Python API.

In addition, we are providing a tutorial that describes how to use the image recognition system for a
variety of use-cases.
http://www.tensorflow.org/tutorials/image_recognition/index.html
What do you want in a research system?
● Ease of expression: for lots of crazy ML ideas/algorithms
● Scalability: can run experiments quickly
● Portability: can run on wide variety of platforms
● Reproducibility: easy to share and reproduce research
● Production readiness: go from research to real products
TensorFlow:
Second Generation Deep Learning System
If we like it, wouldn’t the rest of the world like it, too?

Open sourced single-machine TensorFlow on Monday, Nov. 9th

● Flexible Apache 2.0 open source licensing
● Updates for distributed implementation coming soon

http://tensorflow.org/
Motivations

DistBelief (1st system):

● Great for scalability, and production training of basic kinds of models
● Not as flexible as we wanted for research purposes

weights Add Relu

MatMul Xent
examples

labels
t a t e
Computation is a dataflow graph
w ith s

'Biases' is a variable Some ops compute gradients −= updates biases

biases

... Add ... Mul −=

learning rate
ut e d
Computation is a dataflow graph
is t r i b
d

Device A Device B
biases

Add ... Mul −=

...

learning rate

Devices: Processes, Machines, GPUs, etc

ut e d
Send and Receive Nodes
is t r i b
d

Device A Device B
biases

Add ... Mul −=

...

learning rate

Devices: Processes, Machines, GPUs, etc

ut e d
Send and Receive Nodes
is t r i b
d

Device A Device B
biases
Send Recv
Add ... Mul −=
...

learning rate

Devices: Processes, Machines, GPUs, etc

ut e d
Send and Receive Nodes
is t r i b
d

Device A Device B
biases
Send Recv
Add ... Mul Send Recv −=
... Send Recv
Recv

learning rate Send

Devices: Processes, Machines, GPUs, etc

Send and Receive Implementations
● Different implementations depending on source/dest devices
● e.g. GPUs on same machine: local GPU → GPU copy
● e.g. CPUs on different machines: cross-machine RPC
● e.g. GPUs on different machines: RDMA
Extensible
● Core system defines a number of standard operations
and kernels (device-specific implementations of
operations)
● Easy to define new operators and/or kernels
Session Interface
● Extend: add nodes to computation graph
● Run: execute an arbitrary subgraph
○ optionally feeding in Tensor inputs and retrieving Tensor output

Typically, setup a graph with one or a few Extend calls and

then Run it thousands or millions or times
Single Process Configuration
Distributed Configuration
RPC

RPC RPC RPC

Feeding and Fetching

Run(input={“b”: ...}, outputs={“f:0”})

Feeding and Fetching

Run(input={“b”: ...}, outputs={“f:0”})

Example: Power method for Eigenvectors
● Simple 5x5 matrix, compute result, iterated K times
● TensorBoard graph visualization
Under the hood: Power method
● Operators
● Kernel implementations for different devices
● Run call
● Tensor memory management
Example: Symbolic differentiation
● f(x) = xT * W * x ; now minimize
● Show df/dx = 2*Wx in graph
TensorFlow Single Device Performance
Initial measurements done by Soumith Chintala
Benchmark Forward Forward+Backward

AlexNet - cuDNNv3 on Torch (Soumith) 32 ms 96 ms

AlexNet - Neon (Soumith) 32 ms 101 ms

AlexNet - cuDNNv2 on Torch (Soumith) 70 ms 231 ms

AlexNet - cuDNNv2 on TensorFlow 0.5 (Soumith) 96 ms 326 ms

See https://github.com/soumith/convnet-benchmarks/issues/66
Two main factors:
(1) various overheads (nvcc doesn’t like 64-bit tensor indices, etc.)
(2) versions of convolutional libraries being used (cuDNNv2 vs. v3, etc.)
TensorFlow Single Device Performance
Prong 1: Tackling sources of overhead
Benchmark Forward Forward+Backward

AlexNet - cuDNNv3 on Torch (Soumith) 32 ms 96 ms

AlexNet - Neon (Soumith) 32 ms 101 ms

AlexNet - cuDNNv2 on Torch (Soumith) 70 ms 231 ms

AlexNet - cuDNNv2 on TensorFlow 0.5 (Soumith) 96 ms 326 ms

AlexNet - cuDNNv2 on TensorFlow 0.5 (our machine) 97 ms 336 ms

TensorFlow Single Device Performance
Prong 1: Tackling sources of overhead
Benchmark Forward Forward+Backward

AlexNet - cuDNNv3 on Torch (Soumith) 32 ms 96 ms

AlexNet - Neon (Soumith) 32 ms 101 ms

AlexNet - cuDNNv2 on Torch (Soumith) 70 ms 231 ms

AlexNet - cuDNNv2 on TensorFlow 0.5 (Soumith) 96 ms 326 ms

AlexNet - cuDNNv2 on TensorFlow 0.5 (our machine) 97 ms 336 ms

AlexNet - cuDNNv2 on TensorFlow 0.6 (our machine: soon) 70 ms (+39%) 230 ms (+31%)
TensorFlow Single Device Performance
Prong 1: Tackling sources of overhead
Benchmark Forward Forward+Backward

AlexNet - cuDNNv3 on Torch (Soumith) 32 ms 96 ms

AlexNet - Neon (Soumith) 32 ms 101 ms

AlexNet - cuDNNv2 on Torch (Soumith) 70 ms 231 ms

AlexNet - cuDNNv2 on TensorFlow 0.5 (Soumith) 96 ms 326 ms

AlexNet - cuDNNv2 on TensorFlow 0.5 (our machine) 97 ms 336 ms

AlexNet - cuDNNv2 on TensorFlow 0.6 (our machine: soon) 70 ms (+39%) 230 ms (+31%)
TensorFlow Single Device Performance
TF 0.5 vs. 0.6 release candidate measurements (on our machine w/ Titan-X)

Benchmark Forward Forward+Backward

AlexNet - cuDNNv2 on TensorFlow 0.5 97 ms 336 ms

AlexNet - cuDNNv2 on TensorFlow 0.6 (soon) 70 ms (+27%) 230 ms (+31%)

OxfordNet - cuDNNv2 on TensorFlow 0.5 573 ms 1923 ms

OxfordNet - cuDNNv2 on TensorFlow 0.6 (soon) 338 ms (+41%) 1240 ms (+36%)

Overfeat - cuDNNv2 on TensorFlow 0.5 322 ms 1179 ms

Overfeat - cuDNNv2 on TensorFlow 0.6 (soon) 198 ms (+39%) 832 ms (+29%)

TensorFlow Single Device Performance

Prong 2: Upgrade to faster core libraries like cuDNN v3

(and/or the upcoming v4)

Won’t make it into 0.6 release later this week, but likely in
next release
Single device performance important, but
….
biggest performance improvements come
from large-scale distributed systems with
model and data parallelism
Experiment Turnaround Time and Research Productivity
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Transition
● How do you do this at scale?
● How does TensorFlow make distributed training easy?
Model Parallelism
● Best way to decrease training time: decrease the step
time
● Many models have lots of inherent parallelism
● Problem is distributing work so communication doesn’t
kill you
○ local connectivity (as found in CNNs)
○ towers with little or no connectivity between towers (e.g. AlexNet)
○ specialized parts of model active only for some examples
Exploiting Model Parallelism
On a single core: Instruction parallelism (SIMD). Pretty much
free.

Across cores: thread parallelism. Almost free, unless across

sockets, in which case inter-socket bandwidth matters (QPI on
Intel).

Across devices: for GPUs, often limited by PCIe bandwidth.

Across machines: limited by network bandwidth / latency

Model Parallelism
Model Parallelism
Model Parallelism
Data Parallelism
● Use multiple model replicas to process different
examples at the same time
○ All collaborate to update model state (parameters) in shared
parameter server(s)
● Speedups depend highly on kind of model
○ Dense models: 10-40X speedup from 50 replicas
○ Sparse models:
■ support many more replicas
■ often can use as many as 1000 replicas
Data Parallelism
Parameter Servers

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers

∆p p

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p

∆p p

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p

p’

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers

∆p’ p’

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p

∆p’ p’

Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p

∆p’ p’

Model
Replicas ...
Data
...
Data Parallelism Choices
Can do this synchronously:
● N replicas equivalent to an N times larger batch size
● Pro: No gradient staleness
● Con: Less fault tolerant (requires some recovery if any single machine fails)

Can do this asynchronously:

● Pro: Relatively fault tolerant (failure in model replica doesn’t block other
replicas)
● Con: Gradient staleness means each gradient less effective

(Or hybrid: M asynchronous groups of N synchronous replicas)

Data Parallelism Considerations
Want model computation time to be large relative to time to
send/receive parameters over network
Models with fewer parameters, that reuse each parameter multiple times in the
computation

● Mini-batches of size B reuse parameters B times

Certain model structures reuse each parameter many times within each example:

● Convolutional models tend to reuse hundreds or thousands of times per

example (for different spatial positions)
● Recurrent models (LSTMs, RNNs) tend to reuse tens to hundreds of times
(for unrolling through T time steps during training)
Success of Data Parallelism
● Data parallelism is really important for many of Google’s
problems (very large datasets, large models):
○ RankBrain uses 500 replicas
○ ImageNet Inception training uses 50 GPUs, ~40X
speedup
○ SmartReply uses 16 replicas, each with multiple GPUs
○ State-of-the-art on LM “One Billion Word” Benchmark
model uses both data and model parallelism on 32
GPUs
10 vs 50 Replica Inception Synchronous Training

50 replicas
10 replicas

Hours
10 vs 50 Replica Inception Synchronous Training

50 replicas
10 replicas
19.6 vs. 80.3 (4.1X)

5.6 vs. 21.8 (3.9X)

Hours
Using TensorFlow for Parallelism
Trivial to express both model parallelism as well as data
parallelism

● Very minimal changes to single device model code

Devices and Graph Placement
● Given a graph and set of devices, TensorFlow
implementation must decide which device executes
each node
Full and Partial Device Constraints (Hints)
Devices are named hierarchically:
/job:localhost/device:cpu:0
/job:worker/task:17/device:gpu:3
/job:parameters/task:4/device:cpu:0

Client can specify full or partial constraints for nodes in

graph:
“Place this node on /job:localhost/device:gpu:2”

“Place this node on /device:gpu:*”

Placement Algorithm
Given hints, plus a cost model (node execution time
estimates and Tensor size estimates), make placement
decisions

● Current relatively simple greedy algorithm

● Active area of work

Show CIFAR10 placement TensorBoard.

Example: LSTM [Hochreiter et al, 1997]

● From research paper to code

Sequence-to-Sequence Model
Target sequence
[Sutskever & Vinyals & Le NIPS 2014] X Y Z Q

A B C D __ X Y Z

Input sequence
Sequence-to-Sequence
● Active area of research
● Many groups actively pursuing RNN/LSTM
○ Montreal
○ Stanford
○ U of Toronto
○ Berkeley
○ Google
○ ...
● Further Improvements
○ Attention
○ NTM / Memory Nets
○ ...
Sequence-to-Sequence
● Translation: [Kalchbrenner et al., EMNLP 2013][Cho et al., EMLP 2014][Sutskever & Vinyals & Le, NIPS
2014][Luong et al., ACL 2015][Bahdanau et al., ICLR 2015]

● Image captions: [Mao et al., ICLR 2015][Vinyals et al., CVPR 2015][Donahue et al., CVPR 2015][Xu et al.,
ICML 2015]

● Speech: [Chorowsky et al., NIPS DL 2014][Chan et al., arxiv 2015]

● Language Understanding: [Vinyals & Kaiser et al., NIPS 2015][Kiros et al., NIPS 2015]
● Dialogue: [Shang et al., ACL 2015][Sordoni et al., NAACL 2015][Vinyals & Le, ICML DL 2015]
● Video Generation: [Srivastava et al., ICML 2015]
● Algorithms: [Zaremba & Sutskever, arxiv 2014][Vinyals & Fortunato & Jaitly, NIPS 2015][Kaiser &
Sutskever, arxiv 2015][Zaremba et al., arxiv 2015]
How to do Image Captions?

P(English | French)
Image )
How?
[Vinyals et al., CVPR 2015] A young girl asleep

W __ A young girl
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.

NIC: A close up of a child

holding a stuffed animal.

NIC: A baby is asleep next to a

teddy bear.
(Recent) Captioning Results
Source: http://mscoco.org/dataset/#leaderboard-cap
Method Meteor CIDEr LSUN LSUN (2)
Google NIC 0.346 (1) 0.946 (1) 0.273 (2) 0.317 (2)
MSR Capt 0.339 (2) 0.937 (2) 0.250 (3) 0.301 (3)
UCLA/Baidu v2 0.325 (5) 0.935 (3) 0.223 (5) 0.252 (7)
MSR 0.331 (4) 0.925 (4) 0.268 (2) 0.322 (2)
MSR Nearest 0.318 (10) 0.916 (5) 0.216 (6) 0.255 (6)
Human 0.335 (3) 0.910 (6) 0.638 (1) 0.675 (1)
UCLA/Baidu v1 0.320 (8) 0.896 (7) 0.190 (9) 0.241 (8)
LRCN Berkeley 0.322 (7) 0.891 (8) 0.246 (4) 0.268 (5)
UofM/Toronto 0.323 (6) 0.878 (9) 0.262 (3) 0.272 (4)
Human: A close up of two
bananas with bottles in the
background.

BestModel: A bunch of bananas

and a bottle of wine.

InitialModel: A close up of a
plate of food on a table.
Human: A view of inside of a car
where a cat is laying down.

BestModel: A cat sitting on top

of a black car.

InitialModel: A dog sitting in

the passenger seat of a car.
Human: A brown dog laying in a
red wicker bed.

BestModel: A small dog is

sitting on a chair.

InitialModel: A large brown dog

laying on top of a couch.
Human: A man outside cooking
with a sub in his hand.

BestModel: A man is holding a

sandwich in his hand.

InitialModel: A man cutting a

cake with a knife.
Human: Someone is using a
small grill to melt his sandwich.

BestModel: A person is cooking

some food on a grill.

InitialModel: A pizza sitting on

top of a white plate.
Human: A woman holding up a
yellow banana to her face.

BestModel: A woman holding a

banana up to her face.

InitialModel: A close up of a
person eating a hot dog.
Human: A blue , yellow and red
train travels across the tracks
near a depot.

BestModel: A blue and yellow

train traveling down train
tracks.

InitialModel: A train that is

sitting on the tracks.
Pointer Networks Teaser
➢ Goal: Mappings where outputs are (sub)sets of inputs
➢ Travelling Salesman Problem

➢ Convex Hulls
Pointer Networks

⇐
1
2
5
6
1
⇒

x1 x2 x3 x4 x5 x6 x1 x6 x5 x2 x1
y1 y2 y3 y4 y5 y6 ⇒ y1 y6 y5 y2 y1

Poster => Wed. 210C #22

Neural Conversational Models
● Take movie subtitles (~900M words) or IT HelpDesk chats
● Predict the next dialog from history
i got to go .
no .
i get too emotional when i drink .
have another beer . i 've got to get up early .
no , you don 't . sit down .
i get too emotional when i drink .
will you have another beer ?
i 've got to go ! [Vinyals & Le ICML DL Workshop 2015]
why ?
i got to get up early in the morning .
you 're drunk .
and emotional !
you got to go .
Smart Reply
Google Research Blog
Incoming Email - Nov 2015

Activate
Small Feed- Smart Reply?
Forward yes/no
Neural Network

2000 dims per
timestep

GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
TensorFlow Queues

Input prefetching ...

Grouping similar examples

Randomization/Shuffling Dequeue

Enqueue Queue

...
Example: Deep LSTMs
● Wrinkles
○ Bucket sentences by length using a queue per length
○ Dequeue when a full batch of same length has
accumulated
○ N different graphs for different lengths
○ Alternative: while loop
Expressing Data Parallelism
# We use the ReplicaDeviceSetter() device function to automatically
# assign Variables to the 'ps' jobs.
with tf.device(“/cpu:0”):
# Create the Mnist model.
model = MnistModel(batch_size=16, hidden_units=200)

# Get an initialized, and possibly recovered session.

sess = tf.Session()

# Train the model.

for local_step in xrange(FLAGS.max_steps):
_, loss, step = sess.run([model.train_op, model.loss, model.global_step])
if local_step % 1000 == 0:
print "step %d: %g" % (step, loss)
Expressing Data Parallelism
# We use the ReplicaDeviceSetter() device function to automatically
# assign Variables to the 'ps' jobs.
with tf.device(tf.ReplicaDeviceSetter(parameter_devices=10)):
# Create the Mnist model.
model = MnistModel(batch_size=16, hidden_units=200)

# Create a Supervisor. It will take care of initialization, summaries,

# checkpoints, and recovery. When multiple replicas of this program are running,
# the first one, identified by --task=0 is the 'chief' supervisor (e.g., initialization, saving)
supervisor = tf.Supervisor(is_chief=(FLAGS.task == 0), saver=model.saver)

# Get an initialized, and possibly recovered session.

sess = supervisor.PrepareSession(FLAGS.master_job)

# Train the model.

for local_step in xrange(int32_max):
_, loss, step = sess.run([model.train_op, model.loss, model.global_step])
if step >= FLAGS.max_steps:
break
if local_step % 1000 == 0:
print "step %d: %g" % (step, loss)
Asynchronous Training
● Unlike DistBelief, no separate parameter server system:
○ Parameters are now just stateful nodes in the graph
Synchronous Variant
Network Optimizations
● Neural net training very tolerant of reduced precision
● e.g. drop precision to 16 bits across network

Device A Device B

params Send Recv

Mat
...
Mul
Input
Network Optimizations
● Neural net training very tolerant of reduced precision
● e.g. drop precision to 16 bits across network

Device A Device B

params ToFP16 Send Recv ToFP32

Mat
...
Mul
Input
Quantization for Inference
● Need even less precision for inference
● 8-bit fixed point works well, but many ways of
quantizing
● Critical for things like mobile devices
○ w/quantization, high-end smart phone can run
Inception model at >6 frames per second (fps)
Open Source Status for Distributed TensorFlow
Multi GPU in single machine already in open source release

● See 4-GPU CIFAR10 training example in repository

Distributed implementation coming soon:

● GitHub tracking issue: github.

com/tensorflow/tensorflow/issues/23
Concluding Remarks
● Model and Data Parallelism enable great ML work:
○ Neural Machine Translation: ~6x speedup on 8 GPUs
○ Inception / Imagenet: ~40x speedup on 50 GPUs
○ RankBrain: ~300X speedup on 500 machines
● A variety of different parallelization schemes are easy to
express in TensorFlow
Concluding Remarks
● Open Sourcing of TensorFlow
○ Rapid exchange of research ideas (we hope!)
○ Easy deployment of ML systems into products
○ TensorFlow community doing interesting things!
A Few TensorFlow Community Examples
● DQN: github.com/nivwusquorum/tensorflow-deepq
● NeuralArt: github.com/woodrush/neural-art-tf
● Char RNN: github.com/sherjilozair/char-rnn-tensorflow
● Keras ported to TensorFlow: github.com/fchollet/keras
● Show and Tell: github.com/jazzsaxmafia/show_and_tell.tensorflow
● Mandarin translation: github.com/jikexueyuanwiki/tensorflow-zh
...
github.com/nivwusquorum/tensorflow-deepq
github.com/woodrush/neural-art-tf
github.com/sherjilozair/char-rnn-tensorflow
github.com/fchollet/keras
github.com/jazzsaxmafia/show_and_tell.tensorflow
github.com/jikexueyuanwiki/tensorflow-zh
Google Brain Residency Program

New one year immersion program in deep learning research

Learn to conduct deep learning research w/experts in our team

● Fixed one-year employment with salary, benefits, ...
● Goal after one year is to have conducted several research projects
● Interesting problems, TensorFlow, and access to computational resources
Google Brain Residency Program

Who should apply?

● people with BSc, MSc or PhD, ideally in CS, mathematics or statistics
● completed coursework in calculus, linear algebra, and probability, or equiv.
● programming experience
● motivated, hard working, and have a strong interest in deep learning
Google Brain Residency Program
Program Application & Timeline
DEADLINE: January 15, 2016
Google Brain Residency Program

For more information:

g.co/brainresidency

Ansari H. Mastering TensorFlow. Unleashing The Power of Deep Learning... 2024
No ratings yet
Ansari H. Mastering TensorFlow. Unleashing The Power of Deep Learning... 2024
134 pages
Understanding Deep Learning
100% (1)
Understanding Deep Learning
39 pages
Deep Learning TensorFlow and Keras
No ratings yet
Deep Learning TensorFlow and Keras
454 pages
ETH Zurich Talk - April 14, 2025
No ratings yet
ETH Zurich Talk - April 14, 2025
84 pages
I & C Maintenance Manual
100% (1)
I & C Maintenance Manual
111 pages
Deep Learning With Tensorflow
100% (1)
Deep Learning With Tensorflow
70 pages
Introduction To Deep Learning 17th January 2025
No ratings yet
Introduction To Deep Learning 17th January 2025
60 pages
TensorFlow On Cloud
No ratings yet
TensorFlow On Cloud
13 pages
Deep Learning
No ratings yet
Deep Learning
28 pages
Jeff Dean's Lecture For YC AI
100% (19)
Jeff Dean's Lecture For YC AI
86 pages
Tensor Flow Guide
No ratings yet
Tensor Flow Guide
25 pages
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
No ratings yet
Self-Supervision, Bert, and Beyond: Building Transformer-Based Natural Language Processing Applications (Part 2)
117 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
Siemonster v3 Operations Guide v20
No ratings yet
Siemonster v3 Operations Guide v20
107 pages
Lesson 05 TensorFlow
No ratings yet
Lesson 05 TensorFlow
113 pages
Vostokov Dmitry Memory Thinking For C and C++ Windows Diagnostics
No ratings yet
Vostokov Dmitry Memory Thinking For C and C++ Windows Diagnostics
251 pages
Large Scale Deep Learning With TensorFlow (PDFDrive)
No ratings yet
Large Scale Deep Learning With TensorFlow (PDFDrive)
240 pages
DME - Payment Config Document
No ratings yet
DME - Payment Config Document
25 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Lecture1 ANN - Full
No ratings yet
Lecture1 ANN - Full
66 pages
Constants, Variables and Algebraic Expressions
No ratings yet
Constants, Variables and Algebraic Expressions
13 pages
GE Industrial Systems
100% (2)
GE Industrial Systems
20 pages
The Evolution of Deep Learning
No ratings yet
The Evolution of Deep Learning
53 pages
What Is TensorFlow
No ratings yet
What Is TensorFlow
38 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
On Deep Learning
No ratings yet
On Deep Learning
97 pages
24 TensorFlow Clipper
No ratings yet
24 TensorFlow Clipper
35 pages
15 ML
No ratings yet
15 ML
60 pages
Tensor Flow
No ratings yet
Tensor Flow
30 pages
Deep Learning Fundamentals
No ratings yet
Deep Learning Fundamentals
19 pages
Introduction To TensorFlow For Artificial Intelligence
No ratings yet
Introduction To TensorFlow For Artificial Intelligence
41 pages
Deep Learning1
No ratings yet
Deep Learning1
23 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Eeb131 Intro To Ai and It-03
No ratings yet
Eeb131 Intro To Ai and It-03
23 pages
SRS Documentation - TensorFlow
No ratings yet
SRS Documentation - TensorFlow
16 pages
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
No ratings yet
Large-Scale Deep Learning With Tensorflow: Jeff Dean Google Brain Team
119 pages
DLBench A Comprehensive Experimental Evaluation of
No ratings yet
DLBench A Comprehensive Experimental Evaluation of
23 pages
Tensorflow On Cloud: Shilpa Das
No ratings yet
Tensorflow On Cloud: Shilpa Das
13 pages
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
No ratings yet
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
18 pages
GDG Mientrung 2019 Rolling in The Deep
No ratings yet
GDG Mientrung 2019 Rolling in The Deep
19 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller
No ratings yet
Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller
46 pages
Osn Tensorflow2 210327175734
No ratings yet
Osn Tensorflow2 210327175734
23 pages
Tensorflow
No ratings yet
Tensorflow
11 pages
Tensorflow: A System For Large-Scale Machine Learning
No ratings yet
Tensorflow: A System For Large-Scale Machine Learning
21 pages
Tensor Flow
No ratings yet
Tensor Flow
12 pages
Unit-5 (DL For Different Domains, Role of GPUs and DL Frameworks)
No ratings yet
Unit-5 (DL For Different Domains, Role of GPUs and DL Frameworks)
15 pages
521010J Toolbox Intro
No ratings yet
521010J Toolbox Intro
52 pages
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-4
No ratings yet
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-4
11 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
Image Recognitiion
No ratings yet
Image Recognitiion
50 pages
Unit-3 Aiml
No ratings yet
Unit-3 Aiml
10 pages
Sample Code
100% (1)
Sample Code
3 pages
Tensorflow
No ratings yet
Tensorflow
11 pages
Ai 4 All
No ratings yet
Ai 4 All
18 pages
Deep Learning Blog
No ratings yet
Deep Learning Blog
6 pages
Introduction To Deep Learning: by Gargee Sanyal
No ratings yet
Introduction To Deep Learning: by Gargee Sanyal
20 pages
Autoencoders: Parallel Programming Parallel Processing
No ratings yet
Autoencoders: Parallel Programming Parallel Processing
5 pages
Tensor Flow
No ratings yet
Tensor Flow
19 pages
2017 MSSC Verhelst eDNNP-1
No ratings yet
2017 MSSC Verhelst eDNNP-1
11 pages
Dzone Rc251 Gettingstartedwithtensorflow
No ratings yet
Dzone Rc251 Gettingstartedwithtensorflow
5 pages
Tensor Flow
No ratings yet
Tensor Flow
2 pages
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
No ratings yet
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
4 pages
Barc Hall Ticket
No ratings yet
Barc Hall Ticket
1 page
The First Artificial Neuron
No ratings yet
The First Artificial Neuron
2 pages
Assember Edward
No ratings yet
Assember Edward
541 pages
Design Principles and Goals: Storage
No ratings yet
Design Principles and Goals: Storage
16 pages
Welcome To Training: Google Calendar
No ratings yet
Welcome To Training: Google Calendar
29 pages
Answer Scheme Quiz CSC126 - March2021
No ratings yet
Answer Scheme Quiz CSC126 - March2021
4 pages
Operations Research A Report Submitted For External Assessment On
No ratings yet
Operations Research A Report Submitted For External Assessment On
29 pages
Job Analysis: Loan Officer
No ratings yet
Job Analysis: Loan Officer
3 pages
Itil
No ratings yet
Itil
79 pages
DSP Mod 1
No ratings yet
DSP Mod 1
41 pages
CPP Pointers
100% (1)
CPP Pointers
3 pages
A Practical Guide For WebLogic Server Domain Creation
No ratings yet
A Practical Guide For WebLogic Server Domain Creation
17 pages
Prediction of Compressive Strength of Research Paper
No ratings yet
Prediction of Compressive Strength of Research Paper
9 pages
Image Maps in Drupal - Max Bronsema
No ratings yet
Image Maps in Drupal - Max Bronsema
2 pages
TL Bts Howto
No ratings yet
TL Bts Howto
7 pages
Certificate Acknowledgement Header Files and Their Purpose Requirement Coding Output Bibliography
No ratings yet
Certificate Acknowledgement Header Files and Their Purpose Requirement Coding Output Bibliography
43 pages
Performance Specification Digital Terrain Elevation Data (Dted)
No ratings yet
Performance Specification Digital Terrain Elevation Data (Dted)
45 pages
Chapter #7: Sequential Logic Case Studies
No ratings yet
Chapter #7: Sequential Logic Case Studies
48 pages
M2-Mutual Exclusion in Synchronization
No ratings yet
M2-Mutual Exclusion in Synchronization
4 pages
Cie CS CH7 2
No ratings yet
Cie CS CH7 2
4 pages
Assignment - Ranpath Tools
No ratings yet
Assignment - Ranpath Tools
8 pages
7 - Adding A New Field To The Movie Model and Table - Official Microsoft Site
No ratings yet
7 - Adding A New Field To The Movie Model and Table - Official Microsoft Site
8 pages
Logix5000 Controllers Common Procedures: Programming Manual
No ratings yet
Logix5000 Controllers Common Procedures: Programming Manual
1 page
Job Description - Oracle DBA
No ratings yet
Job Description - Oracle DBA
3 pages
Mastering Deepseek in Python: A Complete Guide to Building, Training, Deploying, and Scaling Advanced NLP Applications with Deepseek Models in Python
From Everand
Mastering Deepseek in Python: A Complete Guide to Building, Training, Deploying, and Scaling Advanced NLP Applications with Deepseek Models in Python
Dargslan
No ratings yet
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
From Everand
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
William Sullivan
1/5 (1)
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet