0% found this document useful (0 votes)
53 views42 pages

RNN LSTM

This document discusses recurrent neural networks (RNNs) and long short-term memory (LSTM) networks for modeling sequential data. RNNs can model sequences of variable lengths but suffer from the vanishing gradient problem, where gradients may shrink or grow exponentially during backpropagation through time. LSTMs help address this issue using gating mechanisms that allow them to preserve error signals for longer. The document provides examples of RNN and LSTM architecture and applications, and discusses how LSTMs help solve the vanishing gradient problem compared to standard RNNs.

Uploaded by

21020641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views42 pages

RNN LSTM

This document discusses recurrent neural networks (RNNs) and long short-term memory (LSTM) networks for modeling sequential data. RNNs can model sequences of variable lengths but suffer from the vanishing gradient problem, where gradients may shrink or grow exponentially during backpropagation through time. LSTMs help address this issue using gating mechanisms that allow them to preserve error signals for longer. The document provides examples of RNN and LSTM architecture and applications, and discusses how LSTMs help solve the vanishing gradient problem compared to standard RNNs.

Uploaded by

21020641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

RNN & LSTM

Nguyen Van Vinh


Computer Science Department, UET,
VNU Ha Noi
How can we model sequences using neural networks?

 Architecture of neural networks ? = A class of neural networks used to


model sequences, allowing to handle variable length inputs
 Very crucial in NLP problems (different from images) because
sentences/paragraphs are variable-length, sequential inputs
Content
 Recurrent Neural Network
 The vanishing/exploding gradient problem
 LSTM
 Applications for LSTM
Sequence Data

 Time Series Data


 Natural Language

Data Source: https://dl.acm.org/doi/10.1145/2370216.2370438

4
Why not standard NN?
What is RNN?
• We consider a class of recurrent networks referred to as Elman
Networks (Elman, 1990).
• A recurrent neural network (RNN) is a type of artificial neural
network which used for sequential data or time series data.

Application:
+ Language translation.
+ Natural language processing (NLP).
+ Speech recognition.
+ Image captioning.

6
Recurrent Neural Networks (RNN)
Apply the same
 A family of neural architectures
weights 𝑊 repeatedly
Types of RNN

8
Recurrent Neural Network Cell

ℎ0 𝑅𝑁𝑁 ℎ1

𝑥1
Recurrent Neural Network Cell

ℎ1 = tanh(𝑊ℎℎ ℎ0 + 𝑊ℎ𝑥 𝑥1 )

ℎ0 𝑅𝑁𝑁 ℎ1

𝑥1
Recurrent Neural Network Cell
𝑦1

ℎ1

ℎ0 𝑅𝑁𝑁 ℎ1

ℎ1 = tanh(𝑊ℎℎ ℎ0 + 𝑊ℎ𝑥 𝑥1 )
𝑥1
𝑦1 = softmax(𝑊ℎ𝑦 ℎ1 )
Recurrent Neural Network Cell
𝑦1

ℎ1

ℎ0 𝑅𝑁𝑁 ℎ1

𝑥1
Recurrent Neural Network Cell
𝑦1 = [0.1, 0.05, 0.05, 0.1, 0.7] e (0.7)

ℎ1 = [0.1 0.2 0 − 0.3 − 0.1 ]

ℎ0 = [0 0 0 0 0 ] 𝑅𝑁𝑁 ℎ1 = [0.1 0.2 0 − 0.3 − 0.1 ]

𝑥1 = [0 0 1 0 0]

abcde
Recurrent Neural Network Cell
𝑦1

ℎ1

ℎ0 𝑅𝑁𝑁 ℎ1

𝑥1
(Unrolled) Recurrent Neural Network
a t <<space>>
𝑦1 𝑦2 𝑦3

ℎ1 ℎ2 ℎ3

ℎ0 𝑅𝑁𝑁 ℎ1 𝑅𝑁𝑁 ℎ2 𝑅𝑁𝑁 ℎ3

𝑥1 𝑥2 𝑥3

c a t
(Unrolled) Recurrent Neural Network
cat likes eating
𝑦1 𝑦2 𝑦3

ℎ1 ℎ2 ℎ3

ℎ0 𝑅𝑁𝑁 ℎ1 𝑅𝑁𝑁 ℎ2 𝑅𝑁𝑁 ℎ3

𝑥1 𝑥2 𝑥3

the cat likes


(Unrolled) Recurrent Neural Network
positive / negative sentiment rating
𝑦

ℎ3

ℎ0 𝑅𝑁𝑁 ℎ1 𝑅𝑁𝑁 ℎ2 𝑅𝑁𝑁 ℎ3

𝑥1 𝑥2 𝑥3

the cat likes


Bidirectional Recurrent Neural Network

𝑦1 𝑦2 𝑦3

ℎ1 ℎ2 ℎ3

ℎ0 𝐵𝑅𝑁𝑁 ℎ1 B𝑅𝑁𝑁 ℎ2 ℎ3
𝐵𝑅𝑁𝑁

𝑥1 𝑥2 𝑥3

the cat wants


Stacked Recurrent Neural Network

𝑦1 𝑦2 𝑦3

ℎ1 ℎ2 ℎ3

ℎ0 𝑅𝑁𝑁 ℎ1 𝑅𝑁𝑁 ℎ2 𝑅𝑁𝑁 ℎ3

ℎ1 ℎ2 ℎ3

ℎ0 𝑅𝑁𝑁 ℎ1 𝑅𝑁𝑁 ℎ2 𝑅𝑁𝑁 ℎ3

𝑥1 𝑥2 𝑥3

c a t
Training an RNN Language Model
Backpropagation for RNNs
Multivariable Chain Rule

Source: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-
vector-valued-functions/a/multivariable-chain-rule-simple-version
Backpropagation for RNNs In practice, often
“truncated” after
~20 timesteps for
training efficiency
reasons
Backpropagation through time

If k and t are far away, the gradients can grow/shrink exponentially (called the gradient exploding or
gradient vanishing problem)
Why is vanishing gradient a problem?
The vanishing gradient problem for language
models
 Example (RNN-LM task):
Jane walked into the room. John walked in too. It was late in the day.
Jane said hi to ____
 To learn from this training example, the RNN-LM needs to model the
dependency between “John” on the 7th step and the target word “John”
at the end.
 But if the gradient is small, the model can’t learn this dependency
 So, the model is unable to predict similar long-distance dependencies
at test time
Vanishing/Exploding Solutions
 Vanishing Gradient:
 Gating mechanism (LSTM, GRU)
 Attention mechanism (Transformer)
 Adding skip connection through time (Residual connection)
 Better Initialization
Long-Short Term Memory (LSTM) - 1997

LSTM, Hochreite & Schmidhuber, 1997, https://deeplearning.cs.cmu.edu/F23/document/readings/LSTM.pdf


29
Architecture of LSTM cell

30
Architecture of LSTM cell

31
Architecture of LSTM cell

32
Architecture of LSTM cell

33
Architecture of LSTM cell

• Conclusion:

- Step 1: Forget gate layer.

- Step 2: Input gate layer.

- Step 3: Combine step 1 & 2.

- Step 4: Output the cell state.

34
How does LSTM can solve vanishing gradient?

- The LSTM architecture makes it easier for the RNN to preserve


information over many timesteps.
- LSTM doesn’t guarantee that there is no vanishing/ exploding
gradient.
- LSTM provides an easier way for the model to learn long-distance
dependencies.

35
LSTM Variations (GRU)
● Gated Recurrent Unit (GRU) ( Kyunghyun Cho et al., 2014)
- Combine the forget and input layer into a single “update gate”
- Merge the cell state and the hidden state
- Simpler.

36
Compare LSTM vs. GRU

- GRUs train faster and perform better than LSTMs on less training
data if you are doing language modeling (not sure about other
tasks).
- GRUs are simpler and thus easier to modify, for example adding
new gates in case of additional input to the network. It's just less
code in general.
- LSTMs should in theory remember longer sequences than GRUs
and outperform them in tasks requiring modeling long-distance
relations.

37
Successful Applications of LSTMs
 Speech recognition: Language and acoustic modeling
 Sequence labeling
 POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
 NER
 Phrase Chunking
 Neural syntactic and semantic parsing
 Image captioning: CNN output vector to sequence
 Sequence to Sequence
 Machine Translation (Sustkever, Vinyals, & Le, 2014)
 Video Captioning (input sequence of CNN frame outputs)

38
Summary
 Recurrent Neural Network is one of the best deep NLP
model families
 Most important and powerful RNN extensions with
LSTMs and GRUs
Homework

 RNN & LSTM for sentiment analysis


 Corpus of IMDB: The IMDB movie reviews dataset is a set of 50,000
reviews, half of which are positive and the other half negative
 Compare the results with previous methods (SVM, Logistic Regression)
References
 Speech and Language Processing (3rd ed. draft), chapter 9
 Slide of Stanford NLP course and other documents
Question and Discussion!

You might also like