RNN LSTM
RNN LSTM
4
Why not standard NN?
What is RNN?
• We consider a class of recurrent networks referred to as Elman
Networks (Elman, 1990).
• A recurrent neural network (RNN) is a type of artificial neural
network which used for sequential data or time series data.
Application:
+ Language translation.
+ Natural language processing (NLP).
+ Speech recognition.
+ Image captioning.
6
Recurrent Neural Networks (RNN)
Apply the same
A family of neural architectures
weights 𝑊 repeatedly
Types of RNN
8
Recurrent Neural Network Cell
ℎ0 𝑅𝑁𝑁 ℎ1
𝑥1
Recurrent Neural Network Cell
ℎ1 = tanh(𝑊ℎℎ ℎ0 + 𝑊ℎ𝑥 𝑥1 )
ℎ0 𝑅𝑁𝑁 ℎ1
𝑥1
Recurrent Neural Network Cell
𝑦1
ℎ1
ℎ0 𝑅𝑁𝑁 ℎ1
ℎ1 = tanh(𝑊ℎℎ ℎ0 + 𝑊ℎ𝑥 𝑥1 )
𝑥1
𝑦1 = softmax(𝑊ℎ𝑦 ℎ1 )
Recurrent Neural Network Cell
𝑦1
ℎ1
ℎ0 𝑅𝑁𝑁 ℎ1
𝑥1
Recurrent Neural Network Cell
𝑦1 = [0.1, 0.05, 0.05, 0.1, 0.7] e (0.7)
𝑥1 = [0 0 1 0 0]
abcde
Recurrent Neural Network Cell
𝑦1
ℎ1
ℎ0 𝑅𝑁𝑁 ℎ1
𝑥1
(Unrolled) Recurrent Neural Network
a t <<space>>
𝑦1 𝑦2 𝑦3
ℎ1 ℎ2 ℎ3
𝑥1 𝑥2 𝑥3
c a t
(Unrolled) Recurrent Neural Network
cat likes eating
𝑦1 𝑦2 𝑦3
ℎ1 ℎ2 ℎ3
𝑥1 𝑥2 𝑥3
ℎ3
𝑥1 𝑥2 𝑥3
𝑦1 𝑦2 𝑦3
ℎ1 ℎ2 ℎ3
ℎ0 𝐵𝑅𝑁𝑁 ℎ1 B𝑅𝑁𝑁 ℎ2 ℎ3
𝐵𝑅𝑁𝑁
𝑥1 𝑥2 𝑥3
𝑦1 𝑦2 𝑦3
ℎ1 ℎ2 ℎ3
ℎ1 ℎ2 ℎ3
𝑥1 𝑥2 𝑥3
c a t
Training an RNN Language Model
Backpropagation for RNNs
Multivariable Chain Rule
Source: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-
vector-valued-functions/a/multivariable-chain-rule-simple-version
Backpropagation for RNNs In practice, often
“truncated” after
~20 timesteps for
training efficiency
reasons
Backpropagation through time
If k and t are far away, the gradients can grow/shrink exponentially (called the gradient exploding or
gradient vanishing problem)
Why is vanishing gradient a problem?
The vanishing gradient problem for language
models
Example (RNN-LM task):
Jane walked into the room. John walked in too. It was late in the day.
Jane said hi to ____
To learn from this training example, the RNN-LM needs to model the
dependency between “John” on the 7th step and the target word “John”
at the end.
But if the gradient is small, the model can’t learn this dependency
So, the model is unable to predict similar long-distance dependencies
at test time
Vanishing/Exploding Solutions
Vanishing Gradient:
Gating mechanism (LSTM, GRU)
Attention mechanism (Transformer)
Adding skip connection through time (Residual connection)
Better Initialization
Long-Short Term Memory (LSTM) - 1997
30
Architecture of LSTM cell
31
Architecture of LSTM cell
32
Architecture of LSTM cell
33
Architecture of LSTM cell
• Conclusion:
34
How does LSTM can solve vanishing gradient?
35
LSTM Variations (GRU)
● Gated Recurrent Unit (GRU) ( Kyunghyun Cho et al., 2014)
- Combine the forget and input layer into a single “update gate”
- Merge the cell state and the hidden state
- Simpler.
36
Compare LSTM vs. GRU
- GRUs train faster and perform better than LSTMs on less training
data if you are doing language modeling (not sure about other
tasks).
- GRUs are simpler and thus easier to modify, for example adding
new gates in case of additional input to the network. It's just less
code in general.
- LSTMs should in theory remember longer sequences than GRUs
and outperform them in tasks requiring modeling long-distance
relations.
37
Successful Applications of LSTMs
Speech recognition: Language and acoustic modeling
Sequence labeling
POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
NER
Phrase Chunking
Neural syntactic and semantic parsing
Image captioning: CNN output vector to sequence
Sequence to Sequence
Machine Translation (Sustkever, Vinyals, & Le, 2014)
Video Captioning (input sequence of CNN frame outputs)
38
Summary
Recurrent Neural Network is one of the best deep NLP
model families
Most important and powerful RNN extensions with
LSTMs and GRUs
Homework