Artificial Neural Networks and Deep Learning
- Recurrent Neural Networks-
Artificial Intelligence and Robotics Laboratory
Politecnico di Milano
Sequence Modeling
So far we have considered only «static» datasets
1 1 1
𝑤10 1
Xt
𝑤11
x1
… 𝑔1 𝑥 w
𝑤𝑗𝑖
xi …
…
… … … 𝑔𝐾 𝑥 w
xI 𝑤𝐽𝐼
2
Sequence Modeling
So far we have considered only «static» datasets
X0 X1 X2 X3 Xt
x1 x1 x1 x1 x1
… … … … …
xi xi xi xi … xi …
… … … … …
xI xI xI xI xI
time
3
Sequence Modeling
Different ways to deal with «dynamic» data:
X0 X1 X2 X3 Xt
Memoryless models (fixed lag):
• Autoregressive models x1 x1 x1 x1 x1
• Feedforward neural networks … … … … …
xi xi xi xi … xi …
Models with memory (unlimited): … … … … …
• Linear dynamical systems xI xI xI xI xI
• Hidden Markov models
time
• Recurrent Neural Networks X0 X1 X2 X3 Xt
… …
• ...
4
Memoryless Models for Sequences (1/2)
𝑊𝑡−2
Autoregressive models X0 X1 X2 X3 Xt
• Predict the next input from … …
previous ones using «delay taps» 𝑊𝑡−1
time
Y0 Y1 Y2 Y3 … … Yt
Linear models with fixed lag
• Predict the next output from 𝑊𝑡−2 𝑊𝑡−1
previous inputs using
X0 X1 X2 X3 Xt
«delay taps» … …
time
5
Memoryless Models for Sequences (2/2)
Hidden
Feed forward neural networks
• Generalize autoregressive models 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡
using non linear hidden layers X0 X1 X2 X3
…
Xt
…
time
Feed forward neural networks Y0 Y1 Y2 Y3 … … Yt
with delays 𝑉𝑡−2
• Predict the next output from Hidden
previous inputs and previous 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡
outputs using «delay taps» X0 X1 X2 X3 Xt
… …
time
6
Dynamical Systems (Models with Memory)
Generative models with a hidden state which cannot be observed directly
• The hidden state has some dynamics possibly Y0 Y1 … Yt
affected by noise and produces the output
• To compute the output need to infer hidden state
Hidden
Hidden
Hidden
• Input are treated as driving inputs …
In linear dynamical systems this becomes:
• State continuous with Gaussian uncertainty X0 X1 Xt
…
• Transformations are assumed to be linear
• State can be estimated using Kalman filtering
time
Stochastic systems ...
7
Dynamical Systems (Models with Memory)
Generative models with a hidden state which cannot be observed directly
• The hidden state has some dynamics possibly Y0 Y1 … Yt
affected by noise and produces the output
• To compute the output need to infer hidden state
Hidden
Hidden
Hidden
• Input are treated as driving inputs …
In hidden Markov models this becomes:
• State assumed to be discrete, state transitions
are stochastic (transition matrix)
• Output is a stochastic function of hidden states
• State can be estimated via Viterbi algorithm. time
Stochastic systems ...
8
Recurrent Neural networks Deterministic
systems ...
1
Memory via recurrent connections: ℎ𝑗𝑡 𝑥 𝑡 , W
𝑤11 1 , 𝑐 𝑡−1 , V 1
x1
• Distributed hidden state allows …
𝑤𝑗𝑖 1
to store information efficiently xi
• Non-linear dynamics allows … …
complex hidden state updates 𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
“With enough neurons and time, RNNs
can compute anything that can be
computed by a computer.” 𝑐1𝑡−1
1 (1)
(Computation Beyond the Turing Limit … 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
Hava T. Siegelmann, 1995)
𝑐𝐵𝑡−1
9
Recurrent Neural networks
1
Memory via recurrent connections: ℎ𝑗𝑡 𝑥 𝑡 , W
𝑤11 1 , 𝑐 𝑡−1 , V 1
x1
• Distributed hidden state allows …
𝑤𝑗𝑖 1
to store information efficiently xi
• Non-linear dynamics allows … …
complex hidden state updates 𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
𝐽 𝐵
(2) (2)
𝑔𝑡 𝑥𝑛 |𝑤 = 𝑔 𝑤1𝑗 ⋅ ℎ𝑗𝑡 ⋅ + 𝑣1𝑏 ⋅ 𝑐𝑏𝑡 ⋅
𝑗=0 𝑏=0
𝐽 𝐵
(1) (1) 𝑐1𝑡−1
ℎ𝑗𝑡 ⋅ = ℎ𝑗 𝑤𝑗𝑖 ⋅ 𝑥𝑖,𝑛
𝑡
+ 𝑣𝑗𝑏 ⋅ 𝑐𝑏𝑡−1
𝑗=0 𝑏=0
1 (1)
… 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
𝐽 𝐵
(1)
𝑡 (1)
𝑐𝑏𝑡 ⋅ = 𝑐𝑏 𝑤𝑏𝑖 ⋅ 𝑥𝑖,𝑛 𝑡−1
+ 𝑣𝑏𝑏′ ⋅ 𝑐𝑏′
𝑗=0 𝑏′=0
𝑐𝐵𝑡−1
10
Backpropagation Through Time
1
𝑤11 ℎ𝑗𝑡 𝑥 𝑡 , W 1 , 𝑐 𝑡−1 , V 1
x1
…
𝑤𝑗𝑖 1
xi
… …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
𝑐1𝑡−1
1 (1)
… 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
𝑐𝐵𝑡−1
11
Backpropagation Through Time
1 1 1 1
𝑤11 ℎ𝑗𝑡 𝑥 𝑡 , W 1 , 𝑐 𝑡−1 , V 1
x1 x1 x1 x1
… … All these weights… …
𝑤𝑗𝑖
should be the same. 1
xi xi xi xi
… … … … …
𝑔𝑡 𝑥 w
xI xI xI xI 𝑤𝐽𝐼
… 1 (1)
… … … … 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
12
Backpropagation Through Time
1
• Perform network unroll for U steps 𝑤11 ℎ𝑗𝑡 𝑥 𝑡 , W 1
, 𝑐 𝑡−1 , V 1
x1
• Initialize WB , 𝑉𝐵 replicas to be the same …
• Compute gradients and update replicas 𝑤𝑗𝑖 1
with the average of their gradients xi
𝑈−1 𝑈−1
… …
1 𝜕𝐸 1 𝜕𝐸 𝑡
𝑊𝐵 = 𝑊𝐵 − 𝜂 ⋅ 𝑉 = 𝑉𝐵 − 𝜂 ⋅
𝑈 𝜕𝑊𝐵𝑡−𝑢 𝐵 𝑈 𝜕𝑉𝐵𝑡−𝑢 xI 𝑤𝐽𝐼 𝑔𝑡 𝑥 w
𝑢=0 𝑢=0
… … … … 1 (1)
… 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
𝑉𝐵𝑡−3 𝑉𝐵𝑡−2 𝑉𝐵𝑡−1 𝑉𝐵𝑡
13
How much should we go back in time?
1
𝑤11
ℎ𝑗𝑡 𝑥 𝑡 , W 1 , 𝑐 𝑡−1 , V 1
Sometime output might be related to x1
some input happened quite long before …
𝑤𝑗𝑖 1
xi
Jane walked into the room. John walked in too.
It was late in the day. Jane said hi to <???> … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
However backpropagation through 𝑐1𝑡−1
time was not able to train recurrent
neural networks significantly … 1
𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
(1)
back in time ... 𝑐𝐵𝑡−1
Was due to not being able to
backprop through many layers ...
14
How much can we go back in time?
To better understand why it was not working consider a simplified case:
ℎ𝑡 = ℎ(𝑣 1
⋅ ℎ𝑡−1 + 𝑤 1
⋅ 𝑥) 𝑦 𝑡 = 𝑔(𝑤 2 ⋅ ℎ𝑡 )
𝑥
Backpropagation over an entire sequence 𝑆 is computed as
𝑆 𝑡 𝑡 𝑡
𝜕𝐸 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝑦 𝑡 𝜕ℎ𝑡 𝜕ℎ𝑘 𝜕ℎ𝑡 𝜕ℎ𝑖 1 1
= = 𝑡 𝑡 𝑘 = ෑ = ෑ 𝑣 ℎ′ 𝑣 ⋅ ℎ𝑖−1 + 𝑤 (1) ⋅ 𝑥
𝜕𝑤 𝜕𝑤 𝜕𝑤 𝜕𝑦 𝜕ℎ 𝜕ℎ 𝜕𝑤 𝜕ℎ𝑘 𝜕ℎ𝑖−1
𝑡=1 𝑡=1 𝑖=𝑘+1 𝑖=𝑘+1
If 𝛾𝑣 ⋅ 𝛾ℎ′ < 1this
If we consider the norm of these terms converges to 0 ...
𝜕ℎ𝑖 1
𝜕ℎ𝑡 𝑡−𝑘
≤ 𝑣 ℎ′ ⋅ 𝑘
≤ 𝛾𝑣 ⋅ 𝛾ℎ′
𝜕ℎ𝑖−1 𝜕ℎ
With Sigmoids and Tanh we
have vanishing gradients
15
Which Activation Function?
Sigmoid activation function Tanh activation function
1 exp 𝑎 − exp(−𝑎)
𝑔 𝑎 = 𝑔 𝑎 =
1 + exp(−𝑎) exp(𝑎) + exp(−𝑎)
𝑔′ 𝑎 = 𝑔(𝑎)(1 − 𝑔 𝑎 ) 𝑔′ 𝑎 = 1 − 𝑔 𝑎 2
1 exp 0 exp 0 −exp 0 2
𝑔′ 0 = 𝑔 0 1 − 𝑔 0 = ⋅ = 0.25 𝑔′ 0 =1−𝑔 0 2 =1− =1
1 + exp(0) 1 + exp 0 exp 0 + exp 0
16
Dealing with Vanishing Gradient
Force all gradients to be either 0 or 1
𝑔 𝑎 = 𝑅𝑒𝐿𝑢 𝑎 = max 0, 𝑎
𝑔′ 𝑎 = 1𝑎>0
Build Recurrent Neural Networks using small modules that are designed
to remember values for a long time.
ℎ𝑡 = 𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥 𝑦 𝑡 = 𝑔(𝑤 2
⋅ ℎ𝑡 )
𝑥
𝑣 (1) = 1
It only accumulates
the input ...
17
Long Short-Term Memories (LSTM)
Hochreiter & Schmidhuber (1997) solved the problem of vanishing
gradient designing a memory cell using logistic and linear units with
multiplicative interactions:
• Information gets into the cell
whenever its “write” gate is on.
• The information stays in the cell
so long as its “keep” gate is on.
• Information is read from the cell
by turning on its “read” gate.
Can backpropagate
through this since the
loop has fixed weight.
18
RNN vs. LSTM
RNN
LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
19
Long Short-Term Memory
LSTM
LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
20
Long Short-Term Memory
Input gate
LSTM
LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
21
Long Short-Term Memory
Forget gate
LSTM
LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
22
Long Short-Term Memory
Memory gate
LSTM
LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
23
Long Short-Term Memory
Output gate
LSTM
LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
24
Gated Recurrent Unit (GRU)
It combines the forget and input gates into a single “update gate.” It also
merges the cell state and hidden state, and makes some other changes.
LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
25
LSTM Networks
You can build a computation graph with continuous transformations.
Y0 Y1 … Yt
Hidden
Hidden
Hidden
…
X0 X1 Xt
…
26
Multiple Layers and Bidirectional LSTM Networks
A computation graph in time with continuous transformations.
Hierarchical
representation
Y0 Y1 Yt
…
ReLu
ReLu
ReLu
…
LSTM
LSTM
LSTM
…
LSTM
LSTM
LSTM
…
X0 X1 … Xt
27
Tips & Tricks
When conditioning on full input sequence Bidirectional RNNs exploit it:
• Have one RNNs traverse the sequence left-to-right
• Have another RNN traverse the sequence right-to-left
• Use concatenation of hidden layers as feature representation
28
Multiple Layers and Bidirectional LSTM Networks
A computation graph in time with continuous transformations.
Hierarchical
representation
Y0 Y1 Yt Y0 Y1 Yt
… …
ReLu
ReLu
ReLu
ReLu
ReLu
ReLu
… …
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
… …
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
… …
Xt Xt-1 X0
X0 X1 … Xt X0 X1 … Xt
Bidirectional
processing
29
Tips & Tricks
When conditioning on full input sequence Bidirectional RNNs exploit it:
• Have one RNNs traverse the sequence left-to-right
• Have another RNN traverse the sequence right-to-left
• Use concatenation of hidden layers as feature representation
When initializing RNN we need to specify the initial state
• Could initialize them to a fixed value (such as 0)
• Better to treat the initial state as learned parameters
• Start off with random guesses of the initial state values
• Backpropagate the prediction error through time all the way to the initial state values
and compute the gradient of the error with respect to these
• Update these parameters by gradient descent
30
Sequential Data Problems
Fixed-sized Sequence output Sequence input (e.g. Sequence input and Synced sequence input
input (e.g. image captioning sentiment analysis sequence output (e.g. and output (e.g. video
to fixed-sized takes an image and where a given sentence Machine Translation: an classification where we
output outputs a sentence of is classified as RNN reads a sentence in wish to label each frame
(e.g. image words). expressing positive or English and then outputs of the video)
classification) negative sentiment). a sentence in French)
LSTM Images Credits: Andrej Karpathy
31
Sequence to Sequence Learning Examples (1/3)
Image Captioning: input a single image and get a series or sequence of
words as output which describe it. The image has a fixed size, but the
output has varying length.
32
Sequence to Sequence Learning Examples (2/3)
Sentiment Classification/Analysis: input a sequence of characters or
words, e.g., a tweet, and classify the sequence into positive or negative
sentiment. Input has varying lengths; output is of a fixed type and size.
33
Sequence to Sequence Learning Examples (3/3)
Language Translation: having some text in a particular language, e.g.,
English, we wish to translate it in another, e.g., French. Each language has
it’s own semantics and it has varying lengths for the same sentence.
34