0% found this document useful (0 votes)
27 views57 pages

NLP Lecture 6

Notes

Uploaded by

Ram babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views57 pages

NLP Lecture 6

Notes

Uploaded by

Ram babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Foundations of NLP

CS3126

Week-6
Recurrent Neural Networks (RNNs) and LSTM
Recap
• NLP
• Applications
• Regular expressions
• Tokenization
• Stemming
• Porter Stemmer
• Lemmatization
• Normalization
• Stopwords
• Bag-of-Words
• TF-IDF
• NER
• POS tagging
• Semantics, Distributional semantics, Word2vec
• Language models
• Neural Networks and Neural language modeling

2
Last Lecture
• Neural Networks

• Feed- forward Neural Networks

• Neural language models

3
Sequential Data
Sometimes the sequence of data matters
• Text generation
• Stock price prediction
• Machine translation
• Speech recognition

Image Source: https://www.analyticsvidhya.com/blog/2019/07/openai-gpt2-text-generator-python/ 4


Sentence
The clouds are in the .... ?

5
Sentence
The clouds are in the .... ?

SKY

6
Sequence data

• The clouds are in the .... ?


SKY
• Simple solution: N-grams?

7
Sequence data

• The clouds are in the .... ?


SKY
• Simple solution: N-grams?
• Hard to represent patterns with more than a few words (possible
patterns increases exponentially

8
Sequence data
• The clouds are in the .... ?
• SKY
• Simple solution: N-grams?
o Hard to represent patterns with more than a few words (possible patterns
increases exponentially

• Simple solution: Neural networks?


o Fixed input/output size Fixed number of steps

9
Where is sequence in language?
Spoken language is a sequence of acoustic events over time

The temporal nature of language is reflected in the metaphors

• Flow of conversations
• News feeds
• Twitter streams

10
Motivation
• Not all problems can be converted into one with fixed length
inputs and outputs

11
Another Motivation
Recall that we made a Markov assumption:

p(wi |w1, . . . ,wi−1) = p(wi |wi−3,wi−2,wi−1).

This means the model is memoryless, i.e., it has no memory of anything


before the last few words.
Problem:
But sometimes long-distance context can be important:
Rob Ford told the flabbergasted reporters assembled at the press
conference that ________.

12
Motivation: Machine Translation
Consider the problem of machine translation:
– Input is text from one language
– Output is text from another language with the same meaning

13
Difference/Problems

A key difference with labeling:

• Input and output sequences may have different lengths and


“orders”
• We do not just “find the Telugu word corresponding to the English
word”
• We probably don’t know the output length

14
Time will explain.

Jane Austen, Persuasion


15
16
Finding structure in time

17
Recurrent Neural Networks (RNN)
• Any network that contains a cycle within its network connections,
meaning that the value of some unit is directly, or
indirectly, dependent on its own earlier outputs as an input.

Image source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 18


Recurrent Neural Networks (RNN)

Image source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 19


Recurrent Neural Networks
• Have memory that keeps
track of
information observed so far
• Maps from the entire history
of previous inputs to each
output
• Handle sequential data

20
Idea: Apply same weights repeatedly

21
A simple RNN Language Model

22
RNN Language Model

Credits: Slide adapted from [3] 23


RNN Language Models

24
Credits: Slide adapted from [3]
A simple RNN Language Model

Credits: Slide adapted from [3] 25


Training a RNN language model

Credits: Slide adapted from [3] 26


Training a RNN language model

Credits: Slide adapted from [3] 27


Training a RNN language model

Credits: Slide adapted from [3] 28


Training a RNN language model

Credits: Slide adapted from [3] 29


Training a RNN language model

Credits: Slide adapted from [3] 30


Training a RNN language model

Credits: Slide adapted from [3] 31


Training a RNN language model
• However: Computing loss and gradients across {x1, x2, …, xt}, the
entire corpus at once is too expensive (memory-wise)!

• Consider as a sentence (or a document)


• Recall: Stochastic Gradient Descent allows us to compute
loss and gradients for small chunk of data, and update.
• Compute loss , for a sentence (actually, a batch of
sentences), compute gradients and update weights. Repeat on a
new batch of sentences.

Credits: Slide adapted from [3] 32


Multivariable Chain Rule

Source:https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
33
Issues with RNN: Vanishing and
Exploding Gradients

On the difficulty of training recurrent neural networks, Pascanu et al., 2013


Problems with RNNs:
Vanishing and Exploding Gradients

Credits: Slide adapted from [3] 35


Vanishing Gradient Intuition

Credits: Slide adapted from [3] 36


Vanishing Gradient Intuition

Chain Rule!!!!!!
Credits: Slide adapted from [3] 37
Vanishing Gradient Intuition

Chain Rule!!!!!!
Credits: Slide adapted from [3] 38
Vanishing Gradient Intuition

Chain Rule!!!!!!
Credits: Slide adapted from [3] 39
Vanishing Gradient Intuition

Credits: Slide adapted from [3] 40


Vanishing gradient proof sketch

Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf 41
Credits: Slide adapted from [3]
Vanishing gradient proof sketch

: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf 42
Credits: Slide adapted from [3]
Why is vanishing gradient a problem?

43
Effect of vanishing gradient on RNN
• LM task: When she tried to print her tickets, she found that the printer was
out of toner. She went to the stationery store to buy more toner. It was very
overpriced. After installing the toner into the printer, she finally printed her
________
• To learn from this training example, the RNN-LM needs to model the
dependency between “tickets” on the 7th step and the target word
“tickets” at the end.
• But if the gradient is small, the model can’t learn this dependency
• So, the model is unable to predict similar long-distance dependencies at
test time
• In practice a simple RNN will only condition ~7 tokens back [vague rule-of-
thumb]

Credits: Slide adapted from [3] 44


Gradient Clipping: A solution for
Exploding gradient

Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf 45
Is vanishing Gradient only a RNN problem?
• No! It can be a problem for all neural architectures (including feed-
forward and convolutional), especially very deep ones.

• Due to chain rule / choice of nonlinearity function, gradient can


become vanishingly small as it backpropagates

• Thus, lower layers are learned very slowly (i.e., are hard to train)

Credits: Slide adapted from [3]

46
RNN improves perplexity

47
LSTMs (Long Short-Term Memory)
Long Short-Term Memory, Hochreiter et al., 1997
LSTM solve Vanishing Gradient Problem?
• The LSTM architecture makes it much easier for an RNN to preserve
information over many timesteps
• If the forget gate is set to 1 for a cell dimension and the input gate set to 0,
then the information of that cell is preserved indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent weight matrix
Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7

• However, there are alternative ways of creating more direct and linear pass-
through connections in models for long distance dependencies

Credits: Slide adapted from [3] 49


LSTM Equations

https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 50
LSTM detailed visualization
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
History of Neural models in NLP

Image source : https://www.ruder.io/a-review-of-the-recent-history-of-nlp/ 52


Different variants of RNN
• Stacked RNN
• Bi-directional RNN
• Many more

53
Sequence -to-Sequence learning

Image Reference: Speech and Language Processing by Daniel Jurafsky and James H. Martin 54
https://arxiv.org/pdf/1409.3215 55
References
[1] https://www.cs.ubc.ca/~dsuth/440/23w2/slides/9-rnn.pdf
[2] https://slazebni.cs.illinois.edu/spring17/lec02_rnn.pdf
[3] https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-
lecture06-fancy-rnn.pdf
[4] https://colah.github.io/posts/2015-08-Understanding-LSTMs/

53
Reference materials

• https://vlanc-lab.github.io/mu-nlp-
course/

• Lecture notes

• (A) Speech and Language Processing


by Daniel Jurafsky and James H. Martin
• (B) Natural Language Processing with
Python. (updated edition based on
Python 3 and NLTK 3) Steven Bird et al.
O’Reilly Media

54

You might also like