0% found this document useful (0 votes)
10 views68 pages

Chapter 2

The document discusses advanced concepts in Natural Language Processing, focusing on Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU). It highlights the strengths and weaknesses of RNNs, particularly in handling long-term dependencies and the vanishing gradient problem, while introducing LSTMs and GRUs as solutions with enhanced memory capabilities. The course is intended for the academic year 2024/2025.

Uploaded by

hayssouss127927
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views68 pages

Chapter 2

The document discusses advanced concepts in Natural Language Processing, focusing on Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU). It highlights the strengths and weaknesses of RNNs, particularly in handling long-term dependencies and the vanishing gradient problem, while introducing LSTMs and GRUs as solutions with enhanced memory capabilities. The course is intended for the academic year 2024/2025.

Uploaded by

hayssouss127927
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Course: Advanced Natural Language Processing

Beyond RNN
LSTM & GRU
University Year: 2024/2025
Recall on Vanilla RNN

Input Data

Inference

I/O Mapping

Loss Function

Training

2
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)

Order

Dependencies

3
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)

Order

Dependencies

If the data elements are simply ordered (e.g. by size, arrival time),
this does not imply the data is sequential; we need inter-elements
semantic/contextual relationships (e.g. to make predictions)
4
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)

Are these relationships necessary for task modelling ?

No Yes

Non-sequential models (e.g. MLP) Sequential models (e.g. RNN)

Whether or not the data is sequential, the If the data is sequential, the model will extract
model will not leverage the data dependencies the relationships between data elements

The past has no impact on the present The past influences the present

5
Recall on Vanilla RNN: Inference
RNNs maintain a memory (hidden state) of previous inputs

RNN vs MLP

Same matrices
for all inputs!

Seminal Work: Rumelhart et al. “Learning internal representations 6


by error propagation”, Tech. rep. ICS 8504. 1985
Recall on Vanilla RNN: I/O Mapping

7
Use-case Examples

8
Use-case Examples

One-Hidden
Layer NN

9
Use-case Examples

One-Hidden Text
Layer NN Generation

10
Use-case Examples

One-Hidden Text Sentiment


Layer NN Generation Analysis

11
Use-case Examples

One-Hidden Text Sentiment


Text Translation
Layer NN Generation Analysis

12
Use-case Examples

One-Hidden Text Sentiment POS Tagging,


Text Translation
Layer NN Generation Analysis NER

13
Recall on Vanilla RNN: Loss Function
RNNs are often trained to minimise the cross entropy loss over the entire vocabulary

Example: Part-of-Speech Tagging

Sentence: Tensorflow is very easy

POS Tags: NOUN VERB ADJ ADV

Predictions are thus a distribution over the set


of unique tags/classes composed of
{NOUN, VERB, ADJ, ADV}

Classification task per timestep


14
Recall on Vanilla RNN: Loss Function
RNNs are often trained to minimise the cross entropy loss over the entire vocabulary

Loss for one timestep

Loss for all timesteps


(rather average)

Perplexity
(the lower the perplexity, the more
confident the next word prediction)
15
Recall on Vanilla RNN: Training
RNNs are trained using backpropagation through time (BPTT)

16
Assessing Vanilla RNN
Pros
● Current state uses information from
earlier steps
● RNNs process input sequences of
any length
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps

17
Assessing Vanilla RNN
Pros Cons
● Current state uses information from ● RNNs are sequential thus cannot
earlier steps be parallelized
● RNNs process input sequences of ● Long-term dependencies (i.e.
any length information from many steps back)
are hardly captured
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps

18
Vanilla RNN Issues
Cons
● RNNs are sequential thus cannot
be parallelized
○ Transformers (Vaswani et al., 2017)
[Next lecture]

○ Minimal LSTM/GRU (Feng et al., 2024)

● Long-term dependencies (i.e.


information from many steps back)
are hardly captured

Transformer Architecture 19
Vanilla RNN Issues Red Row Sequential input data

Green Row Unrolled RNN


Cons Blue Row Output along sequence

● RNNs are sequential thus cannot be


parallelized
● Long-range dependencies (i.e.
Semantics
information from many steps
back) are hardly captured
[today’s lecture]

Syntactics
Stack RNN cells for more
memory capacity

RNNs as general-purpose computers


(Turing complete) bottom-up, then left to right 20
Vanilla RNN Issues Red Row Sequential input data

Green Row Unrolled RNN


Cons Blue Row Output along sequence

● RNNs are sequential thus cannot be


parallelized
● Long-range dependencies (i.e.
Semantics
information from many steps
back) are hardly captured
[today’s lecture]

Syntactics

Still The Problem Persists!


21
Let’s Analyse…
RNNs are trained using backpropagation through time (BPTT)

22
‘T’ outputs, thus ‘T’ error terms

23
‘t’ timesteps, thus ‘t’ derivatives

24
Chain Rule (Recall)

25
26
27
28
29
Chain Rule (Again!)

30
31
NUMER
IC
INSTAB AL
ILITY

32
EVEN W
ORSE

Too sensitive!

33
Vanishing and Exploding Gradients

Vanishing Gradients: gradients Exploding Gradients: gradients


become extremely small become excessively large

● negligible update of weights ● large update of weights


● slow training ● instability and divergence
● harder to detect ● easy to detect

34
Vanishing and Exploding Gradients Problem

Vanishing Gradients Exploding Gradients

35
Vanishing and Exploding Gradients Problem

Suppose all gradients


are upper bounded by
“c”

The input sequence length


can be represented by
“x = t - k”

Original Study: Pascanu et al., “On the difficulty of


training Recurrent Neural Networks”, ICML, 2013 36
Exploding Gradients are NOT ALWAYS a Problem

No Effort At All <> Small Consistent Effort


(Just Saying!)
37
Vanishing and Exploding Gradients Problem

Quick Question

Why do we make a case distinction with respect to the value 1?

Vanishing Gradients Exploding Gradients

38
Vanishing and Exploding Gradients Problem

Quick Question

Why do we make a case distinction with respect to the value 1?

39
RECAP TIME !
What’s the story so far?

40
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory

Solutions

Training Architecture
41
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory

Solutions

Training Architecture
42
Solving Vanishing and Exploding Gradients
● Weight Initialization: Identity, Xavier, He, etc

● Gradient Clipping: Limit the gradients’ magnitude

● Normalization Techniques: Batch and layer normalization

etc… 43
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory

Solutions

Training Architecture
44
Long Short-Term Memory

LSTMs are designed to have more persistent memory to capture long-term


dependencies through a gating mechanism

Long Short-Term

remembering information capturing short-term


over long sequences dependencies
Memory

increased memory
capacity over time

Seminal Work: Hochreiter S., Schmidhuber J., “Long Short-Term Memory”, Neural Computation 9(8):1735-1780. 1997 45
Long Short-Term Memory

LSTMs are designed to have more persistent memory to capture long-term


dependencies through a gating mechanism

The gating mechanism allows the network to learn


when to retain and when to forget a piece of
information depending on its relevance

46
Long Short-Term Memory

Step-by-Step into LSTM

Image from: Zhang et al. “Dive into Deep Learning”, Cambridge University Press, 2023 47
Long Short-Term Memory
Input Node: Integrates the new input word to the memory (similar to RNN)

48
Long Short-Term Memory
Memory Cell: Produces the final memory as a weighted aggregation of past
information to forget and new information to keep

Input Gate: Determines whether the input is worth keeping (word relevance)

Forget Gate: Assesses whether the past memory is useful for the computation of
the current memory

49
Long Short-Term Memory
Output Gate: Separates the final memory from the hidden state deciding what
parts of the memory need to be present in the hidden state

50
Long Short-Term Memory
Sigmoid: values in [0,1] & smooth function
⇒ ideal for gates (i.e. turn-on / turn-off)
Tanh: values in [-1,1] & zero-centered at 0
⇒ balanced activations

51
LSTM Solving the Vanishing Gradient Problem

More stability
thanks to the
memory cell!
52
LSTM Solving the Vanishing Gradient Problem

53
LSTM Solving the Vanishing Gradient Problem

LSTM only attenuates the


vanishing gradient effect;
it does not suppress it

54
RECAP TIME !
What’s the story so far?

55
Gated Recurrent Unit
¡ My Q&A Time !

Seminal Work: Cho et al., “Learning Phrase Representations using RNN


Encoder-Decoder for Statistical Machine Translation”, EMNLP, 2014 56
Gated Recurrent Unit

What are the structural differences between LSTM and GRU?

57
Gated Recurrent Unit

● From 3 to 2 gates ⇒ less parameters


● No cell state ⇒ merged memory

58
Gated Recurrent Unit

How many trainable


matrices does a GRU have?

59
Gated Recurrent Unit

6 Matrices
(3 FCs with 2 inputs each)

60
Gated Recurrent Unit

What type of model would


we have if the reset gate is
all 0’s? or all 1’s?

Reset Gate:

61
Gated Recurrent Unit

If all 0’s, we get a MLP


If all 1’s, we get a RNN

Reset Gate:

62
Gated Recurrent Unit

What if the update gate


is all 0’s? or all 1’s?

Update Gate:

63
Gated Recurrent Unit

If all 1’s, the new state is


the old state
If all 0’s, the new state is
the candidate state

Update Gate:

64
LSTM vs GRU

Differences between
LSTM and GRU

65
LSTM vs GRU
LSTM GRU

Independent memory cell state for Combined memory cell with hidden
storing information state
⇒ long-term dependencies ⇒ less parameters

LSTM controls the memory cell GRU controls the hidden state
Remove from the cell (forget gate) Add new information (update gate)
Add to the cell (input gate) Remove old information (reset gate)
Extract from the cell (output gate)
66
LSTM vs GRU — Autocomplete Task

Visualising long-term contextual understanding

https://distill.pub/2019/memorization-in-rnns/#ar-connectivity-nlstm

67
Course: Advanced Natural Language Processing

Beyond RNN
LSTM & GRU
Any Questions ?

University Year: 2024/2025

You might also like