0% found this document useful (0 votes)
84 views48 pages

Deep Learning

The document discusses three axes of advancement in machine learning: network architectures, learning algorithms, and temporal/spatial hierarchy. It notes recurrent neural networks (RNNs) have improved over time through new architectures like LSTMs and GRUs that address the difficulty of training RNNs. RNNs are now achieving success in natural language processing tasks thanks to increased computational power, large datasets, and these new architectures that better model long-term dependencies.

Uploaded by

Bob Assan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views48 pages

Deep Learning

The document discusses three axes of advancement in machine learning: network architectures, learning algorithms, and temporal/spatial hierarchy. It notes recurrent neural networks (RNNs) have improved over time through new architectures like LSTMs and GRUs that address the difficulty of training RNNs. RNNs are now achieving success in natural language processing tasks thanks to increased computational power, large datasets, and these new architectures that better model long-term dependencies.

Uploaded by

Bob Assan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Deep

learning,
where are you going?
Kyunghyun Cho
New York University
Center for Data Science, and
Courant Institute of Mathematical Sciences
Three axes of advance in machine learning
1. Network architectures
• ConvNet

Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning

y ial
• Reinforcement learning

ch at
ar p
Network

er l/S
3. Temporal/Spatial Hierarchy
Architectures

Hi ra
• Thumbnails => high-res images

po
m
• Single frame => multi-frame video clips

Te
• Words => phrases => sentences => …
Three axes of advance in machine learning
1. Network architectures
• ConvNet

Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning

y ial
• Reinforcement learning

ch at
ar p
Network

er l/S
3. Temporal/Spatial Hierarchy
Architectures

Hi ra
• Thumbnails => high-res images

po
m
• Single frame => multi-frame video clips

Te
• Words => phrases => sentences => …
Network architectures
Recurrent networks LSTM/GRU-RNN
• The history of ML is a
series of new/old/old-is-
Hidden Markov State-space models
new-again/new-is-old- models (Kalman Filter, …)
already models Memory-enhanced RNN
PCA, GMM, … (NTM, DCN, MemNet, …)

• Many of them can be cast


Convolutional networks
into a neural net with Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
different net Stride-1 Conv, …)

architectures
Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)

1950’s 1970’s 1980’s 1990’s 2000’s

Very inaccurate illustration of the history of ML


Network architectures
1. Improved representational power
• Linear separability and nonlinear classifiers
• Deeper networks for more complex problems
• Kernel methods for infinite-dimensional projection
• Probabilistic approaches for uncertainty modelling
Recurrent networks LSTM/GRU-RNN

2. Better inductive bias Hidden Markov State-space models


models (Kalman Filter, …)
• (local) Translation, rotation invariance PCA, GMM, …
Memory-enhanced RNN
(NTM, DCN, MemNet, …)
• Unbounded memory storage
Convolutional networks
Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
Stride-1 Conv, …)

Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)

1950’s 1970’s 1980’s 1990’s 2000’s

Very inaccurate illustration of the history of ML


Making Ameri.. RNNs great again!
Recurrent networks LSTM/GRU-RNN

Hidden Markov State-space models


models (Kalman Filter, …)
Memory-enhanced RNN
PCA, GMM, … (NTM, DCN, MemNet, …)

Convolutional networks
Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
Stride-1 Conv, …)

Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)

1950’s 1970’s 1980’s 1990’s 2000’s


Neural networks and language processing
in late-80’s and early-90’s
• First wave of applying neural
networks for natural languages
• Text as well as speech
Bob Allen
• Limited due to Yoshua Bengio

1. Lack of computational power Risto Miikkulainen


2. Lack of large-scale data
3. Unable to train an RNN well

Lonnie Chrisman Jeffrey Elman Jay McClelland


Recurrent networks are difficult to train

Yoshua Bengio (1994)

Juergen Schmidhuber (2013)


Making Ameri.. RNNs great again!
Recurrent networks LSTM/GRU-RNN

Hidden Markov State-space models


models (Kalman Filter, …)
Memory-enhanced RNN
PCA, GMM, … (NTM, DCN, MemNet, …)

Convolutional networks
Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
Stride-1 Conv, …)

Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)

1950’s 1970’s 1980’s 1990’s 2000’s


Re-thinking a recurrent neural network
tanh-RNN as a CPU

Registers h
Execution
1. Read the whole register h

h 2. Update the whole register


h tanh(W [x] + U h + b)

2017-02-13 10
Re-thinking a recurrent neural network
GRU as a CPU

Registers h
h u Execution
1. Select a readable subset r
r 2. Read the subset r h
3. Select a writable subset u
h̃ 4. Update the subset
h u h̃ + (1 ut ) h

Clearly gated recurrent units* are much more realistic.


* By gated recurrent units, I refer to both LSTM and GRU
2017-02-13 [Hochreither & Schmidhuber, 1997; Cho et al., 2014]. 11
Re-thinking a recurrent neural network

h u • In practice, GRU or LSTM RNN’s are so much


easier to train and work well with a wider
range of hyperparameters
r

2017-02-13 12
Making RNNs great again!
Recurrent networks LSTM/GRU-RNN

Hidden Markov State-space models


models (Kalman Filter, …)
Memory-enhanced RNN
PCA, GMM, … (NTM, DCN, MemNet, …)

Convolutional networks
Multilayer perceptrons,
(ReLU, Highway, ResNet,
Convolutional networks
Stride-1 Conv, …)

Gaussian process
Hebbian rule based (GPR, GPC, GPL)
linear models Kernel machines
(Perceptron, Hopfield network, ..) (Kernel SVM)

1950’s 1970’s 1980’s 1990’s 2000’s


Re-thinking sequence-to-sequence learning

Sequence-to-sequence model The cat sat on the mat.


(or encoder-decoder network) Decoder
• Encode an input sequence
into a code vector z Code vector z
• Decode the code vector Encoder
into a target sequence
고양이가 매트 위에 앉았다.
(Allen, 1987; Chrisman, 1991; Neco&Forcada, 1997; Castano et al., 1997)
(Kalchbrenner&Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014)
Re-thinking sequence-to-sequence learning
?
• Motivated by human translators
고양이가 매트 위에 앉았다.
1. Summarize what has been translated so far
2. Find a relevant part
3. Write the next target symbol The cat ?
4. Go to 1
고양이가 매트 위에 앉았다.
• Machine learning view
1. Search for relevant info from the source The cat sat ?
• Based on the current state: what has
been generated so far 고양이가 매트 위에 앉았다.
2. Generate the next target symbol
3. Go to 1
The cat sat on ?

고양이가 매트 위에 앉았다.
on
The cat sat
Re-thinking Generate

sequence-to-sequence Su
mm
ariz
e
Agent
(Decoder)

learning

mand

Return
Com
• Cooperation among three agents Agent
1. Agent 1 (Encoder): transforms the source (Search)
sentence into a set of code vectors in a memory

Sele
ect
Insp

ct
2. Agent 2 (Search): searches for relevant code
vectors in the memory based on the command

Code vector
Code vector
Code vector
Code vector
Code vector
from the Agent 3 and returns them to the Agent

Memory
3.
3. Agent 3 (Decoder): observes the current state
(previously decoded symbols), commands the
Agent 2 to find relevant code vectors and
generates the next symbol based on them. Agent (Encoder)
고양이가 매트 위에 앉았다.
on
The cat sat
Attention-based
Generate

Agent
neural machine translation
Su
mm
ariz (Decoder)
e

mand

Return
Com
• Model Implementation
• Agent 1 (Encoder): Bidirectional GRU/LSTM-RNN Agent
• Agent 2 (Search): differentiable attention mechanism (Search)

Sele
• Agent 3 (Decoder): GRU/LSTM-RNN Language Model

ect
Insp

ct
• Learning algorithm
• Maximum likelihood: maximize the predictability of

Code vector
Code vector
Code vector
Code vector
Code vector
Memory
the Agent 3 (Decoder).
• Backpropagation through all the agents
• Now a de facto standard in machine translation
• Google Translate, Facebook, Systran, Naver, …
Agent (Encoder)
(Bahdandau, Cho & Bengio, 2015) 고양이가 매트 위에 앉았다.
Attention-based neural machine translation
• Flexibility in input/output representation
• Multilingual, character-level translation:
recurrent decoder, feedforward attention and recurrent-convolutional encoder

(Lee et al., 2016; Ha et al., 2016; Johnson et al., 2016)

• Attention-based image captioning:


recurrent decoder,
feedforward attention and
deep convolutional encoder
(Xu et al., 2015; Yao et al., 2015)
Memory-augmented The cat sat
on

Recurrent Neural Networks Generate

Su Agent
mm
ariz (Decoder)
• Agent (Decoder) decides what to store e

and

(Key, V
in the memory

Return
⇥n

m
⇥n

Com

alue)
• Agent (Decoder) may access from and
write to the memory multiple times per step Agent Agent
(Search) (Encoder)
• Memory may grow or shrink

Sele
ect

t
ec
• Closer to von Neumann architecture

Insp

ct

sp
In
• Is it good? – We’ll see…

Memory
Memory nets: (Weston et al., 2014; Sukhbaatar et al.,

Slot
Slot
Slot
Slot
Slot
2015; Kumar et al., 2015; Miller et al., 2016; and many

ite
Wr
more)
Neural Turing machines: (Graves et al., 2014; Graves et
al., 2016; Gulcehre et al., 2016&2017; and many more)
Three axes of advance in machine learning
1. Network architectures
• ConvNet

Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning

y ial
• Reinforcement learning

ch at
ar p
Network

er l/S
3. Temporal/Spatial Hierarchy
Architectures

Hi ra
• Thumbnails => high-res images

po
m
• Single frame => multi-frame video clips

Te
• Words => phrases => sentences => …
Supervised learning
• Learner does not interact with the world
World Supervisor Learner
• Supervisor annotates data in advance
• Learner learns from the supervisor’s
feedback (reward, correct answer) Annotation
data collection
• Advantages Query
Answer
• Strong learning signal
Reward
• Offline training Correct Answer
Update
• Disadvantages
• Mismatch between training and test
Unsupervised learning
• Learner does not interact with the world
World Supervisor Learner
• Supervisor collects data
• No feedback from the supervisor
Annotation
• Advantages data collection
Query Update
• Potentially infinite amount of data
• Strong learning signal Query Update

• Disadvantages Query Update


• What’s the goal of learning?
Reinforcement learning
• Learner directly interacts with the world
World Learner
• There is no supervisor
Observe
• Learner learns from the world’s weak Act
feedback (reward alone) Observe
• Advantages Act
• Online learning: perfect match
between training and test Update
Reward
• Disadvantages
• Weak learning signal
• Non-trivial balance between
exploration and exploitation
Mix-and-match: Supervised+Reinforcement learning

Supervised/Reinforcement
Learning

Action
Selector

Unsupervised
Act

Feature Learning
Extraction Supervisor
Observe

World

The Obligatory LeCake!


Mix-and-match: Imitation Learning
• Learner directly interacts
with the world World Supervisor Learner
• Supervisor augments reward signal from Observe
the world
Act
• Advantages
Reward
• Match between training and test Reward
Update
• Strong learning signal Correct action

• Disadvantages
• Where do we get the supervisor???

(Ross et al., 2011; Daume III et al., 2007; and more…)


SafeDAgger: Query-Efficient Imitation Learning
World SafetyNet Supervisor Learner
• Supervisors are expensive
Observe
• As the learner gets better, less
Easy
intervention from the supervisor
Act
• Learner learns from difficult examples
Reward
• Questions: Observe
1. Where do we get the safety net? Difficult
2. What is the impact on the
learner’s performance?
Act
Reward Correct action Update

(Zhang&Cho, 2017; Laskey et al., 2016)


SafeDAgger: Query-Efficient Imitation Learning
1. Learner observes the world
2. SafetyNet observes the learner

Learner

Steering
Brake
3. SafetyNet predicts whether the
learner will fail
4. If no, the learner continues
5. If yes,

Supervisor
Act
1. the supervisor intervenes SafetyNet
2. The learner imitate the CNN
supervisor’s behaviour
Observe
World
SafeDAgger: Learning

Learner

Steering
Brake
1. Initial labelled data sets: and
D0 D0S
⇡0
2. Train the policy using D0
3. Train the safety net using
0 D0S

Supervisor
Act
1. Target for the safety net given x 2 D S CNN
SafetyNet

1, if k⇡0 (x) y ⇤ k > ⌧
y⇤S = Observe
0, otherwise
World

4. Collect additional data D 0


1. Let drive, but the expert intervenes when
⇡0 0 (x) = 1
2. Collect data: D 0 D0 [ {(x, y)| 0 (x) = 1}
5. Data aggregation: D0 D0 [ D 0
6. Go to 2
SafeDAgger in Action
SafeDAgger in Action
SafeDAgger in Action
SafeDAgger
• Many aspects of learning
1. Main objective
2. Constraints imposed by the world
• Cannot go beyond physical constraints.
3. Safety of an agent
• Unlike in a game, you can’t hit a pedestrian and continue as
if nothing happened

Learner

Steering
• The car will break down if it has crashed many times

Brake
• How much can we automate?
1. Automatic determination of safety (SafetyNet)

Supervisor
Act
SafetyNet
2. …? CNN

Observe
World
Three axes of advance in machine learning
1. Network architectures
• ConvNet

Algorithms
Learning
• Highway networks/ResNet
• LSTM/GRU
2. Learning algorithms
• Supervised learning
• Unsupervised learning

y ial
• Reinforcement learning

ch at
ar p
Network

er l/S
3. Temporal/Spatial Hierarchy
Architectures

Hi ra
• Thumbnails => high-res images

po
m
• Single frame => multi-frame video clips

Te
• Words => phrases => sentences => …
Awesomeness
everywhere! Awesome
Auto-
Driver
Awesome
Q&A

Awesome Awesome
RoboArm ASR
Controller

Awesome
ConvNet
Awesome
Awesome Awesome Atari
LM Meta-Player
Awesome
Learner
Program
Interpreter
I want something like…
Awesome
Q&A
Awesome
Awesome Memory
ConvNet

Awesome
ASR

Awesome
LM

But, do we want to train


this end-to-end? Awesome
RoboArm
Awesome
Auto-
Controller
Driver
Awesome

Neural networks are modules


Q&A
Awesome
Awesome Memory
ConvNet

Awesome
ASR

Awesome
LM

Awesome
• Each module is used Memory Awesome
RoboArm
Controller
Awesome
Auto-

for multiple tasks


Driver

• Interactions among
modules are not trivial
Awesome Awesome Awesome
• Shared representation ConvNet LM ASR
of information among
different modules
• Time to go beyond
end-to-end learning? Awesome Awesome Awesome
Auto- RoboArm Q&A
Driver Controller
Awesome

Learning to use an NN module


Q&A
Awesome
Awesome Memory
ConvNet

Awesome
ASR

Higher- Awesome

• Q&A system level LM

Module
1. Receives a question via Awesome
Awesome

awesome LM+ASR
RoboArm
Auto-
Controller
Driver

2. Retrieves relevant info from


awesome memory
3. Generates a response via Output Output Output
awesome LM
• Autonomous driving
1. Senses the environment with Neural Neural Neural
awesome ConvNet+ASR Network Network Network
2. Plans the route with
awesome memory
3. Controls a car via awesome
Input Input Input
robot arm controller
t-1 t t+1

But, simple composition of neural networks may not work! Why Not?
Awesome

Learning to use an NN module


Q&A
Awesome
Awesome Memory
ConvNet

Awesome
ASR

Awesome
LM

Higher-
• Why not? level
Awesome
RoboArm
Controller
Awesome
Auto-
Driver

• Target tasks are often unknown at Module


training time
• Input/output with a large available
training set are too rigid
Output Output Output
• Rich information captured by the NN
module must be passed along
• Internal of the NN module must allow Neural Neural Neural
external manipulation Network Network Network

Reminds us of the memory-augmented Input Input Input


recurrent neural networks.. t-1 t t+1
Good: NN’s are totally transparent!
• NN’s are not black boxes.
• We can observe every single bit
inside a neural net.

Bad: NN’s are not easy to understand!


• Humans are not good with high-dimensional
vectors
• Distributed representation
• exponential combinations of hidden units

(Karpathy et al., 2015) Hidden activations of a recurrent language model


Awesome

Learning to use an NN module


Q&A
Awesome
Awesome Memory
ConvNet

Awesome
ASR

Awesome
LM

Trainable
• Neural nets are good at interpreting NN Awesome
RoboArm
Awesome
Auto-

high-dimensional input Module Controller


Driver

• Neural nets are also good at


predicting high-dimensional output
• Internal representation learned by a Output Output Output
neural network is well structured
• Neural nets can be trained with an
arbitrary objective [reinforcement Neural Neural Neural
learning] Network Network Network

Input Input Input


t-1 t t+1
(My NSF Proposal, 2016)
(1) Simultaneous Translation
Decoding
1. Start with a pretrained neural machine
translation model
2. Build a simultaenous decoder that
intercepts and interprets the incoming
signal
3. Simultaneous decoder forces the
pretrained model to either
1. output a target symbol, or
2. wait for a next source symbol
Learning
1. Trade-off between delay and quality
2. Stochastic policy gradient (REINFORCE)
(Gu, Cho & Li, 2016)
(1) Simultaneous Translation
(2) Trainable Decoding Algorithm
Trainable
Decoder

yt |ŷ<t , X
Decoding
1. Start with a pretrained neural ht 1 , ŷt 1 Decoder
machine translation model GRU/LSTM

2. Build a trainable decoder that ct


intercepts and interprets the
Attention
incoming signal

Sele
ect
3. Trainable decoder sends out the

Insp

ct
altering signal back to the

Code vector
Code vector
Code vector
Code vector
Code vector
pretrained model

Memory
Learning
1. Deterministic policy gradient
2. Maximize any arbitrary objective Agent (Encoder)
(Gu, Cho & Li, 2017) 고양이가 매트 위에 앉았다.
Awesome

Learning to use an NN module


Q&A
Awesome
Awesome Memory
ConvNet

Awesome
ASR

Awesome
LM

Trainable
• Spatio-temporal abstraction as NN
Awesome
RoboArm
Awesome
Auto-

learning to glue together multiple


Controller
Driver

Module
lower-level, multi-purpose neural
network modules + black-box modules
• Enables sequential, asynchronous
learning at multiple scales Output Output Output

• Enables higher-level planning


• Potential framework for meta-learning
Neural Neural Neural
• Potential security threat? Network Network Network

Input Input Input


t-1 t t+1
Te
m
po
Hi ra
er l/S
ar p
ch at
y ial Learning
Algorithms
Thank you!

Network
Architectures
(1) Trainable Decoding Algorithm
Trainable
Decoder

yt |ŷ<t , X
Models
1. Actor ⇡ : R3d ! Rd ht 1 , ŷt 1 Decoder
• Input: prev. hid. state , prev. symbol , and
ht 1 ŷt 1 GRU/LSTM
context from the attention model
ct
ct
• Output: additive bias for hid. state zt
• Example: Attention

Sele
zt = U (W [ht 1 ; E(ŷ); ct ] + b) + c

ect
Insp

ct
2. Critic Rc : Rd ⇥ · · · ⇥ Rd ! R

Code vector
Code vector
Code vector
Code vector
Code vector
• Input: a sequence of the hidden states from the decoder

Memory
• Output: a predicted return
• In our case, the critic estimates the full return rather than
Q at each time step
Agent (Encoder)
(Gu, Cho & Li, 2017) 고양이가 매트 위에 앉았다.
(1) Trainable Decoding Algorithm
Trainable
Decoder

yt |ŷ<t , X
Learning
1) Generate translation given a source sentence with noise ht 1 , ŷt Decoder
1
((h1 , z1 ), . . . , (hT , zT )) and R GRU/LSTM
c 2
2) Train the critic to minimize (R (h 1 , . . . , h T ) R)
ct
3) Generate multiple translations with noise
(h11 , z11 ), . . . , (h1T , zT1 ) , . . . , (hM , z M
), . . . , (h M
, z M Attention
1 1 T T )

Sele
ect
4) Critic-aware actor learning: newly proposed

Insp

ct
M
1 X exp( (Rm
c
Rm )2 ) @Rc

Code vector
Code vector
Code vector
Code vector
Code vector
Memory
M m=1 Z @⇡

Inference: simply throw away the critic and use the actor
Agent (Encoder)
(Gu, Cho & Li, 2017) 고양이가 매트 위에 앉았다.

You might also like