Chapter 2
Chapter 2
Beyond RNN
LSTM & GRU
University Year: 2024/2025
Recall on Vanilla RNN
Input Data
Inference
I/O Mapping
Loss Function
Training
2
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)
Order
Dependencies
3
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)
Order
Dependencies
If the data elements are simply ordered (e.g. by size, arrival time),
this does not imply the data is sequential; we need inter-elements
semantic/contextual relationships (e.g. to make predictions)
4
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)
No Yes
Whether or not the data is sequential, the If the data is sequential, the model will extract
model will not leverage the data dependencies the relationships between data elements
The past has no impact on the present The past influences the present
5
Recall on Vanilla RNN: Inference
RNNs maintain a memory (hidden state) of previous inputs
RNN vs MLP
Same matrices
for all inputs!
7
Use-case Examples
8
Use-case Examples
One-Hidden
Layer NN
9
Use-case Examples
One-Hidden Text
Layer NN Generation
10
Use-case Examples
11
Use-case Examples
12
Use-case Examples
13
Recall on Vanilla RNN: Loss Function
RNNs are often trained to minimise the cross entropy loss over the entire vocabulary
Perplexity
(the lower the perplexity, the more
confident the next word prediction)
15
Recall on Vanilla RNN: Training
RNNs are trained using backpropagation through time (BPTT)
16
Assessing Vanilla RNN
Pros
● Current state uses information from
earlier steps
● RNNs process input sequences of
any length
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps
17
Assessing Vanilla RNN
Pros Cons
● Current state uses information from ● RNNs are sequential thus cannot
earlier steps be parallelized
● RNNs process input sequences of ● Long-term dependencies (i.e.
any length information from many steps back)
are hardly captured
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps
18
Vanilla RNN Issues
Cons
● RNNs are sequential thus cannot
be parallelized
○ Transformers (Vaswani et al., 2017)
[Next lecture]
Transformer Architecture 19
Vanilla RNN Issues Red Row Sequential input data
Syntactics
Stack RNN cells for more
memory capacity
Syntactics
22
‘T’ outputs, thus ‘T’ error terms
23
‘t’ timesteps, thus ‘t’ derivatives
24
Chain Rule (Recall)
25
26
27
28
29
Chain Rule (Again!)
30
31
NUMER
IC
INSTAB AL
ILITY
32
EVEN W
ORSE
Too sensitive!
33
Vanishing and Exploding Gradients
34
Vanishing and Exploding Gradients Problem
35
Vanishing and Exploding Gradients Problem
Quick Question
38
Vanishing and Exploding Gradients Problem
Quick Question
39
RECAP TIME !
What’s the story so far?
40
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory
Solutions
Training Architecture
41
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory
Solutions
Training Architecture
42
Solving Vanishing and Exploding Gradients
● Weight Initialization: Identity, Xavier, He, etc
etc… 43
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory
Solutions
Training Architecture
44
Long Short-Term Memory
Long Short-Term
increased memory
capacity over time
Seminal Work: Hochreiter S., Schmidhuber J., “Long Short-Term Memory”, Neural Computation 9(8):1735-1780. 1997 45
Long Short-Term Memory
46
Long Short-Term Memory
Image from: Zhang et al. “Dive into Deep Learning”, Cambridge University Press, 2023 47
Long Short-Term Memory
Input Node: Integrates the new input word to the memory (similar to RNN)
48
Long Short-Term Memory
Memory Cell: Produces the final memory as a weighted aggregation of past
information to forget and new information to keep
Input Gate: Determines whether the input is worth keeping (word relevance)
Forget Gate: Assesses whether the past memory is useful for the computation of
the current memory
49
Long Short-Term Memory
Output Gate: Separates the final memory from the hidden state deciding what
parts of the memory need to be present in the hidden state
50
Long Short-Term Memory
Sigmoid: values in [0,1] & smooth function
⇒ ideal for gates (i.e. turn-on / turn-off)
Tanh: values in [-1,1] & zero-centered at 0
⇒ balanced activations
51
LSTM Solving the Vanishing Gradient Problem
More stability
thanks to the
memory cell!
52
LSTM Solving the Vanishing Gradient Problem
53
LSTM Solving the Vanishing Gradient Problem
54
RECAP TIME !
What’s the story so far?
55
Gated Recurrent Unit
¡ My Q&A Time !
57
Gated Recurrent Unit
58
Gated Recurrent Unit
59
Gated Recurrent Unit
6 Matrices
(3 FCs with 2 inputs each)
60
Gated Recurrent Unit
Reset Gate:
61
Gated Recurrent Unit
Reset Gate:
62
Gated Recurrent Unit
Update Gate:
63
Gated Recurrent Unit
Update Gate:
64
LSTM vs GRU
Differences between
LSTM and GRU
65
LSTM vs GRU
LSTM GRU
Independent memory cell state for Combined memory cell with hidden
storing information state
⇒ long-term dependencies ⇒ less parameters
LSTM controls the memory cell GRU controls the hidden state
Remove from the cell (forget gate) Add new information (update gate)
Add to the cell (input gate) Remove old information (reset gate)
Extract from the cell (output gate)
66
LSTM vs GRU — Autocomplete Task
https://distill.pub/2019/memorization-in-rnns/#ar-connectivity-nlstm
67
Course: Advanced Natural Language Processing
Beyond RNN
LSTM & GRU
Any Questions ?