Bigdata Neural Networks
Bigdata Neural Networks
Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...
Time
Deep Learning
Universal Machine Learning
Speech Speech
Text Text
Search Search
Queries Queries
Images Images
Videos Videos
Labels Labels
Entities Entities
Words Words
Audio Audio
Features Features
Deep Learning
Universal Machine Learning
...that works better than the alternatives!
Current State-of-the-art in:
Speech Recognition
Image Recognition
Machine Translation
Molecular Activity Prediction
Road Hazard Detection
Optical Character Recognition
...
ConvNets
Some More Benefits
Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Learning from Unlabeled Images
Optimal stimulus
by numerical optimization
Top 48 stimuli from the test set
Adding Supervision
Train recurrent models that also incorporate Lexical and Language Modeling:
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
http://arxiv.org/abs/1512.00567
Rapid Progress in Image Recognition
Team Year Place Error (top-5) Params ImageNet
XRCE (pre-neural-net explosion) 2011 1st 25.8% challenge
classification
Supervision (AlexNet) 2012 1st 16.4% 60M
task
Clarifai 2013 1st 11.7% 65M
Models with small number of parameters fit easily in a mobile app (8-bit fixed point)
Today’s News: Pre-trained Inception-v3 model released
http://googleresearch.blogspot.com/2015/12/how-to-classify-images-with-tensorflow.html
Today we are releasing our best image classifier trained on ImageNet data. As described in our
recent Arxiv preprint at http://arxiv.org/abs/1512.00567, an ensemble of four of these models
achieves 3.5% top-5 error on the validation set of the ImageNet whole image ILSVRC2012
classification task (compared with our ensemble from last year that won the 2014 ImageNet
classification challenge with a 6.66% top-5 error rate).
In this release, we are supplying code and data files containing the trained model parameters for
running the image classifier on:
● Both desktop and mobile environments
● Employing either a C++ or Python API.
In addition, we are providing a tutorial that describes how to use the image recognition system for a
variety of use-cases.
http://www.tensorflow.org/tutorials/image_recognition/index.html
What do you want in a research system?
● Ease of expression: for lots of crazy ML ideas/algorithms
● Scalability: can run experiments quickly
● Portability: can run on wide variety of platforms
● Reproducibility: easy to share and reproduce research
● Production readiness: go from research to real products
TensorFlow:
Second Generation Deep Learning System
If we like it, wouldn’t the rest of the world like it, too?
http://tensorflow.org/
Motivations
● Core in C++
● Core in C++
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
● Core in C++
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
MatMul Xent
examples
labels
s o rs
Computation is a dataflow graph ten
with
MatMul Xent
examples
labels
t a t e
Computation is a dataflow graph
w ith s
biases
learning rate
ut e d
Computation is a dataflow graph
is t r i b
d
Device A Device B
biases
learning rate
Device A Device B
biases
learning rate
Device A Device B
biases
Send Recv
Add ... Mul −=
...
learning rate
Device A Device B
biases
Send Recv
Add ... Mul Send Recv −=
... Send Recv
Recv
See https://github.com/soumith/convnet-benchmarks/issues/66
Two main factors:
(1) various overheads (nvcc doesn’t like 64-bit tensor indices, etc.)
(2) versions of convolutional libraries being used (cuDNNv2 vs. v3, etc.)
TensorFlow Single Device Performance
Prong 1: Tackling sources of overhead
Benchmark Forward Forward+Backward
AlexNet - cuDNNv2 on TensorFlow 0.6 (our machine: soon) 70 ms (+39%) 230 ms (+31%)
TensorFlow Single Device Performance
Prong 1: Tackling sources of overhead
Benchmark Forward Forward+Backward
AlexNet - cuDNNv2 on TensorFlow 0.6 (our machine: soon) 70 ms (+39%) 230 ms (+31%)
TensorFlow Single Device Performance
TF 0.5 vs. 0.6 release candidate measurements (on our machine w/ Titan-X)
Won’t make it into 0.6 release later this week, but likely in
next release
Single device performance important, but
….
biggest performance improvements come
from large-scale distributed systems with
model and data parallelism
Experiment Turnaround Time and Research Productivity
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Transition
● How do you do this at scale?
● How does TensorFlow make distributed training easy?
Model Parallelism
● Best way to decrease training time: decrease the step
time
● Many models have lots of inherent parallelism
● Problem is distributing work so communication doesn’t
kill you
○ local connectivity (as found in CNNs)
○ towers with little or no connectivity between towers (e.g. AlexNet)
○ specialized parts of model active only for some examples
Exploiting Model Parallelism
On a single core: Instruction parallelism (SIMD). Pretty much
free.
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
∆p p
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p
∆p p
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’ = p + ∆p
p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism Choices
Can do this synchronously:
● N replicas equivalent to an N times larger batch size
● Pro: No gradient staleness
● Con: Less fault tolerant (requires some recovery if any single machine fails)
50 replicas
10 replicas
Hours
10 vs 50 Replica Inception Synchronous Training
50 replicas
10 replicas
19.6 vs. 80.3 (4.1X)
Hours
Using TensorFlow for Parallelism
Trivial to express both model parallelism as well as data
parallelism
A B C D __ X Y Z
Input sequence
Sequence-to-Sequence
● Active area of research
● Many groups actively pursuing RNN/LSTM
○ Montreal
○ Stanford
○ U of Toronto
○ Berkeley
○ Google
○ ...
● Further Improvements
○ Attention
○ NTM / Memory Nets
○ ...
Sequence-to-Sequence
● Translation: [Kalchbrenner et al., EMNLP 2013][Cho et al., EMLP 2014][Sutskever & Vinyals & Le, NIPS
2014][Luong et al., ACL 2015][Bahdanau et al., ICLR 2015]
● Image captions: [Mao et al., ICLR 2015][Vinyals et al., CVPR 2015][Donahue et al., CVPR 2015][Xu et al.,
ICML 2015]
P(English | French)
Image )
How?
[Vinyals et al., CVPR 2015] A young girl asleep
W __ A young girl
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.
InitialModel: A close up of a
plate of food on a table.
Human: A view of inside of a car
where a cat is laying down.
InitialModel: A close up of a
person eating a hot dog.
Human: A blue , yellow and red
train travels across the tracks
near a depot.
➢ Convex Hulls
Pointer Networks
⇐
1
2
5
6
1
⇒
x1 x2 x3 x4 x5 x6 x1 x6 x5 x2 x1
y1 y2 y3 y4 y5 y6 ⇒ y1 y6 y5 y2 y1
Activate
Small Feed- Smart Reply?
Forward yes/no
Neural Network
Generated Replies
Deep Recurrent
Neural Network
Example: LSTM
for i in range(20):
m, c = LSTMCell(x[i], mprev, cprev)
mprev = m
cprev = c
Example: Deep LSTM
for i in range(20):
for d in range(4): # d is depth
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM
for i in range(20):
for d in range(4): # d is depth
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
Example: Deep LSTM
for i in range(20):
for d in range(4): # d is depth
with tf.device("/gpu:%d" % d):
input = x[i] if d is 0 else m[d-1]
m[d], c[d] = LSTMCell(input, mprev[d], cprev[d])
mprev[d] = m[d]
cprev[d] = c[d]
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
A B C D
GPU6
GPU3
GPU1
2000 x 4 =
_ 8k dims per
A B C D A B C sentence
_
TensorFlow Queues
Randomization/Shuffling Dequeue
Enqueue Queue
...
Example: Deep LSTMs
● Wrinkles
○ Bucket sentences by length using a queue per length
○ Dequeue when a full batch of same length has
accumulated
○ N different graphs for different lengths
○ Alternative: while loop
Expressing Data Parallelism
# We use the ReplicaDeviceSetter() device function to automatically
# assign Variables to the 'ps' jobs.
with tf.device(“/cpu:0”):
# Create the Mnist model.
model = MnistModel(batch_size=16, hidden_units=200)
Device A Device B
Device A Device B
Contact us:
[email protected]