NLP Lecture 6
NLP Lecture 6
CS3126
Week-6
Recurrent Neural Networks (RNNs) and LSTM
Recap
• NLP
• Applications
• Regular expressions
• Tokenization
• Stemming
• Porter Stemmer
• Lemmatization
• Normalization
• Stopwords
• Bag-of-Words
• TF-IDF
• NER
• POS tagging
• Semantics, Distributional semantics, Word2vec
• Language models
• Neural Networks and Neural language modeling
2
Last Lecture
• Neural Networks
3
Sequential Data
Sometimes the sequence of data matters
• Text generation
• Stock price prediction
• Machine translation
• Speech recognition
5
Sentence
The clouds are in the .... ?
SKY
6
Sequence data
7
Sequence data
8
Sequence data
• The clouds are in the .... ?
• SKY
• Simple solution: N-grams?
o Hard to represent patterns with more than a few words (possible patterns
increases exponentially
9
Where is sequence in language?
Spoken language is a sequence of acoustic events over time
• Flow of conversations
• News feeds
• Twitter streams
10
Motivation
• Not all problems can be converted into one with fixed length
inputs and outputs
11
Another Motivation
Recall that we made a Markov assumption:
12
Motivation: Machine Translation
Consider the problem of machine translation:
– Input is text from one language
– Output is text from another language with the same meaning
13
Difference/Problems
14
Time will explain.
17
Recurrent Neural Networks (RNN)
• Any network that contains a cycle within its network connections,
meaning that the value of some unit is directly, or
indirectly, dependent on its own earlier outputs as an input.
20
Idea: Apply same weights repeatedly
21
A simple RNN Language Model
22
RNN Language Model
24
Credits: Slide adapted from [3]
A simple RNN Language Model
Source:https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
33
Issues with RNN: Vanishing and
Exploding Gradients
Chain Rule!!!!!!
Credits: Slide adapted from [3] 37
Vanishing Gradient Intuition
Chain Rule!!!!!!
Credits: Slide adapted from [3] 38
Vanishing Gradient Intuition
Chain Rule!!!!!!
Credits: Slide adapted from [3] 39
Vanishing Gradient Intuition
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf 41
Credits: Slide adapted from [3]
Vanishing gradient proof sketch
: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf 42
Credits: Slide adapted from [3]
Why is vanishing gradient a problem?
43
Effect of vanishing gradient on RNN
• LM task: When she tried to print her tickets, she found that the printer was
out of toner. She went to the stationery store to buy more toner. It was very
overpriced. After installing the toner into the printer, she finally printed her
________
• To learn from this training example, the RNN-LM needs to model the
dependency between “tickets” on the 7th step and the target word
“tickets” at the end.
• But if the gradient is small, the model can’t learn this dependency
• So, the model is unable to predict similar long-distance dependencies at
test time
• In practice a simple RNN will only condition ~7 tokens back [vague rule-of-
thumb]
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf 45
Is vanishing Gradient only a RNN problem?
• No! It can be a problem for all neural architectures (including feed-
forward and convolutional), especially very deep ones.
• Thus, lower layers are learned very slowly (i.e., are hard to train)
46
RNN improves perplexity
47
LSTMs (Long Short-Term Memory)
Long Short-Term Memory, Hochreiter et al., 1997
LSTM solve Vanishing Gradient Problem?
• The LSTM architecture makes it much easier for an RNN to preserve
information over many timesteps
• If the forget gate is set to 1 for a cell dimension and the input gate set to 0,
then the information of that cell is preserved indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent weight matrix
Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7
• However, there are alternative ways of creating more direct and linear pass-
through connections in models for long distance dependencies
https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 50
LSTM detailed visualization
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
History of Neural models in NLP
53
Sequence -to-Sequence learning
Image Reference: Speech and Language Processing by Daniel Jurafsky and James H. Martin 54
https://arxiv.org/pdf/1409.3215 55
References
[1] https://www.cs.ubc.ca/~dsuth/440/23w2/slides/9-rnn.pdf
[2] https://slazebni.cs.illinois.edu/spring17/lec02_rnn.pdf
[3] https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-
lecture06-fancy-rnn.pdf
[4] https://colah.github.io/posts/2015-08-Understanding-LSTMs/
53
Reference materials
• https://vlanc-lab.github.io/mu-nlp-
course/
• Lecture notes
54