Deep Learning
Deep Learning
learning,
where are you going?
Kyunghyun Cho
New York University
Center for Data Science, and
Courant Institute of Mathematical Sciences
Three axes of advance in machine learning
1. Network architectures
• ConvNet
Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning
y ial
• Reinforcement learning
ch at
ar p
Network
er l/S
3. Temporal/Spatial Hierarchy
Architectures
Hi ra
• Thumbnails => high-res images
po
m
• Single frame => multi-frame video clips
Te
• Words => phrases => sentences => …
Three axes of advance in machine learning
1. Network architectures
• ConvNet
Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning
y ial
• Reinforcement learning
ch at
ar p
Network
er l/S
3. Temporal/Spatial Hierarchy
Architectures
Hi ra
• Thumbnails => high-res images
po
m
• Single frame => multi-frame video clips
Te
• Words => phrases => sentences => …
Network architectures
Recurrent networks LSTM/GRU-RNN
• The history of ML is a
series of new/old/old-is-
Hidden Markov State-space models
new-again/new-is-old- models (Kalman Filter, …)
already models Memory-enhanced RNN
PCA, GMM, … (NTM, DCN, MemNet, …)
architectures
Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)
Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)
Convolutional networks
Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
Stride-1 Conv, …)
Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)
Convolutional networks
Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
Stride-1 Conv, …)
Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)
Registers h
Execution
1. Read the whole register h
2017-02-13 10
Re-thinking a recurrent neural network
GRU as a CPU
Registers h
h u Execution
1. Select a readable subset r
r 2. Read the subset r h
3. Select a writable subset u
h̃ 4. Update the subset
h u h̃ + (1 ut ) h
2017-02-13 12
Making RNNs great again!
Recurrent networks LSTM/GRU-RNN
Convolutional networks
Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
Stride-1 Conv, …)
Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)
고양이가 매트 위에 앉았다.
on
The cat sat
Re-thinking Generate
sequence-to-sequence Su
mm
ariz
e
Agent
(Decoder)
learning
mand
Return
Com
• Cooperation among three agents Agent
1. Agent 1 (Encoder): transforms the source (Search)
sentence into a set of code vectors in a memory
Sele
ect
Insp
ct
2. Agent 2 (Search): searches for relevant code
vectors in the memory based on the command
Code vector
Code vector
Code vector
Code vector
Code vector
from the Agent 3 and returns them to the Agent
Memory
3.
3. Agent 3 (Decoder): observes the current state
(previously decoded symbols), commands the
Agent 2 to find relevant code vectors and
generates the next symbol based on them. Agent (Encoder)
고양이가 매트 위에 앉았다.
on
The cat sat
Attention-based
Generate
Agent
neural machine translation
Su
mm
ariz (Decoder)
e
mand
Return
Com
• Model Implementation
• Agent 1 (Encoder): Bidirectional GRU/LSTM-RNN Agent
• Agent 2 (Search): differentiable attention mechanism (Search)
Sele
• Agent 3 (Decoder): GRU/LSTM-RNN Language Model
ect
Insp
ct
• Learning algorithm
• Maximum likelihood: maximize the predictability of
Code vector
Code vector
Code vector
Code vector
Code vector
Memory
the Agent 3 (Decoder).
• Backpropagation through all the agents
• Now a de facto standard in machine translation
• Google Translate, Facebook, Systran, Naver, …
Agent (Encoder)
(Bahdandau, Cho & Bengio, 2015) 고양이가 매트 위에 앉았다.
Attention-based neural machine translation
• Flexibility in input/output representation
• Multilingual, character-level translation:
recurrent decoder, feedforward attention and recurrent-convolutional encoder
Su Agent
mm
ariz (Decoder)
• Agent (Decoder) decides what to store e
and
(Key, V
in the memory
Return
⇥n
m
⇥n
Com
alue)
• Agent (Decoder) may access from and
write to the memory multiple times per step Agent Agent
(Search) (Encoder)
• Memory may grow or shrink
Sele
ect
t
ec
• Closer to von Neumann architecture
Insp
ct
sp
In
• Is it good? – We’ll see…
Memory
Memory nets: (Weston et al., 2014; Sukhbaatar et al.,
Slot
Slot
Slot
Slot
Slot
2015; Kumar et al., 2015; Miller et al., 2016; and many
ite
Wr
more)
Neural Turing machines: (Graves et al., 2014; Graves et
al., 2016; Gulcehre et al., 2016&2017; and many more)
Three axes of advance in machine learning
1. Network architectures
• ConvNet
Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning
y ial
• Reinforcement learning
ch at
ar p
Network
er l/S
3. Temporal/Spatial Hierarchy
Architectures
Hi ra
• Thumbnails => high-res images
po
m
• Single frame => multi-frame video clips
Te
• Words => phrases => sentences => …
Supervised learning
• Learner does not interact with the world
World Supervisor Learner
• Supervisor annotates data in advance
• Learner learns from the supervisor’s
feedback (reward, correct answer) Annotation
data collection
• Advantages Query
Answer
• Strong learning signal
Reward
• Offline training Correct Answer
Update
• Disadvantages
• Mismatch between training and test
Unsupervised learning
• Learner does not interact with the world
World Supervisor Learner
• Supervisor collects data
• No feedback from the supervisor
Annotation
• Advantages data collection
Query Update
• Potentially infinite amount of data
• Strong learning signal Query Update
Supervised/Reinforcement
Learning
Action
Selector
Unsupervised
Act
Feature Learning
Extraction Supervisor
Observe
World
• Disadvantages
• Where do we get the supervisor???
Learner
Steering
Brake
3. SafetyNet predicts whether the
learner will fail
4. If no, the learner continues
5. If yes,
Supervisor
Act
1. the supervisor intervenes SafetyNet
2. The learner imitate the CNN
supervisor’s behaviour
Observe
World
SafeDAgger: Learning
Learner
Steering
Brake
1. Initial labelled data sets: and
D0 D0S
⇡0
2. Train the policy using D0
3. Train the safety net using
0 D0S
Supervisor
Act
1. Target for the safety net given x 2 D S CNN
SafetyNet
⇢
1, if k⇡0 (x) y ⇤ k > ⌧
y⇤S = Observe
0, otherwise
World
Learner
Steering
• The car will break down if it has crashed many times
Brake
• How much can we automate?
1. Automatic determination of safety (SafetyNet)
Supervisor
Act
SafetyNet
2. …? CNN
Observe
World
Three axes of advance in machine learning
1. Network architectures
• ConvNet
Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning
y ial
• Reinforcement learning
ch at
ar p
Network
er l/S
3. Temporal/Spatial Hierarchy
Architectures
Hi ra
• Thumbnails => high-res images
po
m
• Single frame => multi-frame video clips
Te
• Words => phrases => sentences => …
Awesomeness
everywhere! Awesome
Auto-
Driver
Awesome
Q&A
Awesome Awesome
RoboArm ASR
Controller
Awesome
ConvNet
Awesome
Awesome Awesome Atari
LM Meta-Player
Awesome
Learner
Program
Interpreter
I want something like…
Awesome
Q&A
Awesome
Awesome Memory
ConvNet
Awesome
ASR
Awesome
LM
Awesome
ASR
Awesome
LM
Awesome
• Each module is used Memory Awesome
RoboArm
Controller
Awesome
Auto-
• Interactions among
modules are not trivial
Awesome Awesome Awesome
• Shared representation ConvNet LM ASR
of information among
different modules
• Time to go beyond
end-to-end learning? Awesome Awesome Awesome
Auto- RoboArm Q&A
Driver Controller
Awesome
Awesome
ASR
Higher- Awesome
Module
1. Receives a question via Awesome
Awesome
awesome LM+ASR
RoboArm
Auto-
Controller
Driver
But, simple composition of neural networks may not work! Why Not?
Awesome
Awesome
ASR
Awesome
LM
Higher-
• Why not? level
Awesome
RoboArm
Controller
Awesome
Auto-
Driver
Awesome
ASR
Awesome
LM
Trainable
• Neural nets are good at interpreting NN Awesome
RoboArm
Awesome
Auto-
yt |ŷ<t , X
Decoding
1. Start with a pretrained neural ht 1 , ŷt 1 Decoder
machine translation model GRU/LSTM
Sele
ect
3. Trainable decoder sends out the
Insp
ct
altering signal back to the
Code vector
Code vector
Code vector
Code vector
Code vector
pretrained model
Memory
Learning
1. Deterministic policy gradient
2. Maximize any arbitrary objective Agent (Encoder)
(Gu, Cho & Li, 2017) 고양이가 매트 위에 앉았다.
Awesome
Awesome
ASR
Awesome
LM
Trainable
• Spatio-temporal abstraction as NN
Awesome
RoboArm
Awesome
Auto-
Module
lower-level, multi-purpose neural
network modules + black-box modules
• Enables sequential, asynchronous
learning at multiple scales Output Output Output
Network
Architectures
(1) Trainable Decoding Algorithm
Trainable
Decoder
yt |ŷ<t , X
Models
1. Actor ⇡ : R3d ! Rd ht 1 , ŷt 1 Decoder
• Input: prev. hid. state , prev. symbol , and
ht 1 ŷt 1 GRU/LSTM
context from the attention model
ct
ct
• Output: additive bias for hid. state zt
• Example: Attention
Sele
zt = U (W [ht 1 ; E(ŷ); ct ] + b) + c
ect
Insp
ct
2. Critic Rc : Rd ⇥ · · · ⇥ Rd ! R
Code vector
Code vector
Code vector
Code vector
Code vector
• Input: a sequence of the hidden states from the decoder
Memory
• Output: a predicted return
• In our case, the critic estimates the full return rather than
Q at each time step
Agent (Encoder)
(Gu, Cho & Li, 2017) 고양이가 매트 위에 앉았다.
(1) Trainable Decoding Algorithm
Trainable
Decoder
yt |ŷ<t , X
Learning
1) Generate translation given a source sentence with noise ht 1 , ŷt Decoder
1
((h1 , z1 ), . . . , (hT , zT )) and R GRU/LSTM
c 2
2) Train the critic to minimize (R (h 1 , . . . , h T ) R)
ct
3) Generate multiple translations with noise
(h11 , z11 ), . . . , (h1T , zT1 ) , . . . , (hM , z M
), . . . , (h M
, z M Attention
1 1 T T )
Sele
ect
4) Critic-aware actor learning: newly proposed
Insp
ct
M
1 X exp( (Rm
c
Rm )2 ) @Rc
Code vector
Code vector
Code vector
Code vector
Code vector
Memory
M m=1 Z @⇡
Inference: simply throw away the critic and use the actor
Agent (Encoder)
(Gu, Cho & Li, 2017) 고양이가 매트 위에 앉았다.