0% found this document useful (0 votes)

27 views57 pages

NLP Lecture 6

Notes

Uploaded by

Ram babu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views57 pages

NLP Lecture 6

Notes

Uploaded by

Ram babu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Foundations of NLP

CS3126

Week-6
Recurrent Neural Networks (RNNs) and LSTM
Recap
• NLP
• Applications
• Regular expressions
• Tokenization
• Stemming
• Porter Stemmer
• Lemmatization
• Normalization
• Stopwords
• Bag-of-Words
• TF-IDF
• NER
• POS tagging
• Semantics, Distributional semantics, Word2vec
• Language models
• Neural Networks and Neural language modeling

2
Last Lecture
• Neural Networks

• Feed- forward Neural Networks

• Neural language models

3
Sequential Data
Sometimes the sequence of data matters
• Text generation
• Stock price prediction
• Machine translation
• Speech recognition

Image Source: https://www.analyticsvidhya.com/blog/2019/07/openai-gpt2-text-generator-python/ 4

Sentence
The clouds are in the .... ?

5
Sentence
The clouds are in the .... ?

SKY

6
Sequence data

• The clouds are in the .... ?

SKY
• Simple solution: N-grams?

7
Sequence data

• The clouds are in the .... ?

SKY
• Simple solution: N-grams?
• Hard to represent patterns with more than a few words (possible
patterns increases exponentially

8
Sequence data
• The clouds are in the .... ?
• SKY
• Simple solution: N-grams?
o Hard to represent patterns with more than a few words (possible patterns
increases exponentially

• Simple solution: Neural networks?

o Fixed input/output size Fixed number of steps

9
Where is sequence in language?
Spoken language is a sequence of acoustic events over time

The temporal nature of language is reflected in the metaphors

• Flow of conversations
• News feeds
• Twitter streams

10
Motivation
• Not all problems can be converted into one with fixed length
inputs and outputs

11
Another Motivation
Recall that we made a Markov assumption:

p(wi |w1, . . . ,wi−1) = p(wi |wi−3,wi−2,wi−1).

This means the model is memoryless, i.e., it has no memory of anything

before the last few words.
Problem:
But sometimes long-distance context can be important:
Rob Ford told the flabbergasted reporters assembled at the press
conference that ________.

12
Motivation: Machine Translation
Consider the problem of machine translation:
– Input is text from one language
– Output is text from another language with the same meaning

13
Difference/Problems

A key difference with labeling:

• Input and output sequences may have different lengths and

“orders”
• We do not just “find the Telugu word corresponding to the English
word”
• We probably don’t know the output length

14
Time will explain.

Jane Austen, Persuasion

15
16
Finding structure in time

17
Recurrent Neural Networks (RNN)
• Any network that contains a cycle within its network connections,
meaning that the value of some unit is directly, or
indirectly, dependent on its own earlier outputs as an input.

Image source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 18

Recurrent Neural Networks (RNN)

Image source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 19

Recurrent Neural Networks
• Have memory that keeps
track of
information observed so far
• Maps from the entire history
of previous inputs to each
output
• Handle sequential data

20
Idea: Apply same weights repeatedly

21
A simple RNN Language Model

22
RNN Language Model

Credits: Slide adapted from [3] 23

RNN Language Models

24
Credits: Slide adapted from [3]
A simple RNN Language Model

Credits: Slide adapted from [3] 25

Training a RNN language model

Credits: Slide adapted from [3] 26

Training a RNN language model

Credits: Slide adapted from [3] 27

Training a RNN language model

Credits: Slide adapted from [3] 28

Training a RNN language model

Credits: Slide adapted from [3] 29

Training a RNN language model

Credits: Slide adapted from [3] 30

Training a RNN language model

Credits: Slide adapted from [3] 31

Training a RNN language model
• However: Computing loss and gradients across {x1, x2, …, xt}, the
entire corpus at once is too expensive (memory-wise)!

• Consider as a sentence (or a document)

• Recall: Stochastic Gradient Descent allows us to compute
loss and gradients for small chunk of data, and update.
• Compute loss , for a sentence (actually, a batch of
sentences), compute gradients and update weights. Repeat on a
new batch of sentences.

Credits: Slide adapted from [3] 32

Multivariable Chain Rule

Source:https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
33
Issues with RNN: Vanishing and
Exploding Gradients

On the difficulty of training recurrent neural networks, Pascanu et al., 2013

Problems with RNNs:
Vanishing and Exploding Gradients

Credits: Slide adapted from [3] 35

Vanishing Gradient Intuition

Credits: Slide adapted from [3] 36

Vanishing Gradient Intuition

Chain Rule!!!!!!
Credits: Slide adapted from [3] 37
Vanishing Gradient Intuition

Chain Rule!!!!!!
Credits: Slide adapted from [3] 38
Vanishing Gradient Intuition

Chain Rule!!!!!!
Credits: Slide adapted from [3] 39
Vanishing Gradient Intuition

Credits: Slide adapted from [3] 40

Vanishing gradient proof sketch

Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf 41
Credits: Slide adapted from [3]
Vanishing gradient proof sketch

: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf 42
Credits: Slide adapted from [3]
Why is vanishing gradient a problem?

43
Effect of vanishing gradient on RNN
• LM task: When she tried to print her tickets, she found that the printer was
out of toner. She went to the stationery store to buy more toner. It was very
overpriced. After installing the toner into the printer, she finally printed her
________
• To learn from this training example, the RNN-LM needs to model the
dependency between “tickets” on the 7th step and the target word
“tickets” at the end.
• But if the gradient is small, the model can’t learn this dependency
• So, the model is unable to predict similar long-distance dependencies at
test time
• In practice a simple RNN will only condition ~7 tokens back [vague rule-of-
thumb]

Credits: Slide adapted from [3] 44

Gradient Clipping: A solution for
Exploding gradient

Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf 45
Is vanishing Gradient only a RNN problem?
• No! It can be a problem for all neural architectures (including feed-
forward and convolutional), especially very deep ones.

• Due to chain rule / choice of nonlinearity function, gradient can

become vanishingly small as it backpropagates

• Thus, lower layers are learned very slowly (i.e., are hard to train)

Credits: Slide adapted from [3]

46
RNN improves perplexity

47
LSTMs (Long Short-Term Memory)
Long Short-Term Memory, Hochreiter et al., 1997
LSTM solve Vanishing Gradient Problem?
• The LSTM architecture makes it much easier for an RNN to preserve
information over many timesteps
• If the forget gate is set to 1 for a cell dimension and the input gate set to 0,
then the information of that cell is preserved indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent weight matrix
Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7

• However, there are alternative ways of creating more direct and linear pass-
through connections in models for long distance dependencies

Credits: Slide adapted from [3] 49

LSTM Equations

https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 50
LSTM detailed visualization
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
History of Neural models in NLP

Image source : https://www.ruder.io/a-review-of-the-recent-history-of-nlp/ 52

Different variants of RNN
• Stacked RNN
• Bi-directional RNN
• Many more

53
Sequence -to-Sequence learning

Image Reference: Speech and Language Processing by Daniel Jurafsky and James H. Martin 54
https://arxiv.org/pdf/1409.3215 55
References
[1] https://www.cs.ubc.ca/~dsuth/440/23w2/slides/9-rnn.pdf
[2] https://slazebni.cs.illinois.edu/spring17/lec02_rnn.pdf
[3] https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-
lecture06-fancy-rnn.pdf
[4] https://colah.github.io/posts/2015-08-Understanding-LSTMs/

53
Reference materials

• https://vlanc-lab.github.io/mu-nlp-
course/

• Lecture notes

• (A) Speech and Language Processing

by Daniel Jurafsky and James H. Martin
• (B) Natural Language Processing with
Python. (updated edition based on
Python 3 and NLTK 3) Steven Bird et al.
O’Reilly Media

Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
21cse356t NLP Unit 4
No ratings yet
21cse356t NLP Unit 4
81 pages
System Design
50% (2)
System Design
58 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
GenAI Module2
No ratings yet
GenAI Module2
190 pages
Module 4 RNN LSTM GRU
No ratings yet
Module 4 RNN LSTM GRU
59 pages
Fields and Rings
No ratings yet
Fields and Rings
32 pages
cs224n-2021-LSTM NN
No ratings yet
cs224n-2021-LSTM NN
59 pages
RNN Stanford
No ratings yet
RNN Stanford
44 pages
RNN-1
No ratings yet
RNN-1
50 pages
DL Co3 - PPT 1
No ratings yet
DL Co3 - PPT 1
22 pages
RNN LSTM
No ratings yet
RNN LSTM
42 pages
AS330 Series Elevator-Used Inverter User Manual V1.01
No ratings yet
AS330 Series Elevator-Used Inverter User Manual V1.01
128 pages
Load Out
100% (2)
Load Out
239 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
ANN Text and Sequence Processing
No ratings yet
ANN Text and Sequence Processing
33 pages
Filipino Bill of Rights
No ratings yet
Filipino Bill of Rights
13 pages
RNN & LSTM: Nguyen Van Vinh Computer Science Department, UET, Vnu Ha Noi
No ratings yet
RNN & LSTM: Nguyen Van Vinh Computer Science Department, UET, Vnu Ha Noi
35 pages
cs224n spr2024 Lecture06 Fancy RNN
No ratings yet
cs224n spr2024 Lecture06 Fancy RNN
56 pages
Lec 10
No ratings yet
Lec 10
37 pages
11 RNN
No ratings yet
11 RNN
32 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Time Series RNN LSTM 1746197734
No ratings yet
Time Series RNN LSTM 1746197734
25 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
28 pages
42 Recurrent Neural Networks and LSTM
No ratings yet
42 Recurrent Neural Networks and LSTM
68 pages
6b. Recurrent Neural Networks
No ratings yet
6b. Recurrent Neural Networks
38 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
Module 4 Recurrent Neural Network
No ratings yet
Module 4 Recurrent Neural Network
78 pages
Triaxial Test For Rocks
No ratings yet
Triaxial Test For Rocks
12 pages
Cs224n 2025 Lecture06 Fancy RNN
No ratings yet
Cs224n 2025 Lecture06 Fancy RNN
57 pages
Understanding The Law of Resonance
No ratings yet
Understanding The Law of Resonance
13 pages
Module2 L7 RNN LSTM
No ratings yet
Module2 L7 RNN LSTM
47 pages
ch6 RNN
No ratings yet
ch6 RNN
25 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
12 pages
Recurrent Neural Networks (RNN) : Subtitle
No ratings yet
Recurrent Neural Networks (RNN) : Subtitle
53 pages
Bianchi
No ratings yet
Bianchi
62 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Lecture 11
No ratings yet
Lecture 11
57 pages
21CSE356T-NLP-Unit 4.1
No ratings yet
21CSE356T-NLP-Unit 4.1
46 pages
Pascal Contest: (Grade 9)
No ratings yet
Pascal Contest: (Grade 9)
6 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
CH4 - AA1.1-Sequence Models
No ratings yet
CH4 - AA1.1-Sequence Models
26 pages
RNN 2
No ratings yet
RNN 2
144 pages
Secure UPI Machine Learning-Driven Fraud Detection System For UPI Transactions
No ratings yet
Secure UPI Machine Learning-Driven Fraud Detection System For UPI Transactions
5 pages
Apsc 160 (Ubc)
No ratings yet
Apsc 160 (Ubc)
10 pages
Deep Learning
No ratings yet
Deep Learning
26 pages
07 RNN Recurrent Neural Networks
No ratings yet
07 RNN Recurrent Neural Networks
115 pages
CS5560 Lect12-RNN - LSTM
No ratings yet
CS5560 Lect12-RNN - LSTM
30 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Here Are The Stages in The Procurement Process
No ratings yet
Here Are The Stages in The Procurement Process
6 pages
Recurrent Neural Networks Tutorial, Part 1 - Introduction To RNNs - WildML
No ratings yet
Recurrent Neural Networks Tutorial, Part 1 - Introduction To RNNs - WildML
8 pages
Unit 3 RCNN
No ratings yet
Unit 3 RCNN
25 pages
6S191 MIT DeepLearning L2
No ratings yet
6S191 MIT DeepLearning L2
85 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
DL 4 Notes
No ratings yet
DL 4 Notes
34 pages
CSE 4237 SoftCom Solutions
No ratings yet
CSE 4237 SoftCom Solutions
115 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
Spot Film Devices
No ratings yet
Spot Film Devices
17 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
Unit 3 RCNN Updated
No ratings yet
Unit 3 RCNN Updated
28 pages
RNN Tutorial
No ratings yet
RNN Tutorial
41 pages
Installation Art: New Media Art
No ratings yet
Installation Art: New Media Art
16 pages
SKF TrainingCalendar 2019-20 - India
No ratings yet
SKF TrainingCalendar 2019-20 - India
84 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
Dis6 Sol
No ratings yet
Dis6 Sol
6 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
Regional Plan - 2021 For National Capital Region: Addressing The Planned Growth of Delhi by Adopting Regional Approach.
No ratings yet
Regional Plan - 2021 For National Capital Region: Addressing The Planned Growth of Delhi by Adopting Regional Approach.
14 pages
9155EN
No ratings yet
9155EN
27 pages
Gartner - SWOT SAS Institute
100% (1)
Gartner - SWOT SAS Institute
26 pages
Introduction To Logic Module 3 Language and Definitions
No ratings yet
Introduction To Logic Module 3 Language and Definitions
16 pages
OverviewPricingContact Us
No ratings yet
OverviewPricingContact Us
15 pages
MANUAL Health O Meter Scale 800KL
No ratings yet
MANUAL Health O Meter Scale 800KL
2 pages
RDZ Search Options
No ratings yet
RDZ Search Options
74 pages
Ofosu
No ratings yet
Ofosu
9 pages
Daftar Topik Dan Road Map Pusat Penelitian 2020 2024
No ratings yet
Daftar Topik Dan Road Map Pusat Penelitian 2020 2024
22 pages
Applying Software Reliability Engineering in The 1990s
No ratings yet
Applying Software Reliability Engineering in The 1990s
7 pages
Cáscara de Plátano Como Biosorbente para La Descontaminación de Contaminantes Del Agua. Una Revisión
No ratings yet
Cáscara de Plátano Como Biosorbente para La Descontaminación de Contaminantes Del Agua. Una Revisión
28 pages
Position Description BIM Manager
No ratings yet
Position Description BIM Manager
5 pages
117FR - Multimedia and Signal Coding
No ratings yet
117FR - Multimedia and Signal Coding
8 pages
Lecture 2 Design Controls and Criteria
No ratings yet
Lecture 2 Design Controls and Criteria
17 pages
117GQ - Power System Operation and Control
No ratings yet
117GQ - Power System Operation and Control
8 pages
TransCAD - An Overview of A Transportation Planning and Analysis Software Significance Part 3
No ratings yet
TransCAD - An Overview of A Transportation Planning and Analysis Software Significance Part 3
10 pages
Assignment 5 Comp 3261
No ratings yet
Assignment 5 Comp 3261
6 pages
Ilpobservation Submission 1163492602
No ratings yet
Ilpobservation Submission 1163492602
11 pages
A Study On Drug Addiction Among Youngsters at Coimbatore District
No ratings yet
A Study On Drug Addiction Among Youngsters at Coimbatore District
5 pages
Latihan Soal Akademik Bahasa Inggris-1-1
No ratings yet
Latihan Soal Akademik Bahasa Inggris-1-1
4 pages
Welkon Limited: Insulation (Temperature) Classes
No ratings yet
Welkon Limited: Insulation (Temperature) Classes
1 page
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Neural Networks with Python
From Everand
Neural Networks with Python
Mei Wong
No ratings yet

NLP Lecture 6

Uploaded by

NLP Lecture 6

Uploaded by

Foundations of NLP

• Feed- forward Neural Networks

• Neural language models

Image Source: https://www.analyticsvidhya.com/blog/2019/07/openai-gpt2-text-generator-python/ 4

• The clouds are in the .... ?

• The clouds are in the .... ?

• Simple solution: Neural networks?

The temporal nature of language is reflected in the metaphors

p(wi |w1, . . . ,wi−1) = p(wi |wi−3,wi−2,wi−1).

This means the model is memoryless, i.e., it has no memory of anything

A key difference with labeling:

• Input and output sequences may have different lengths and

Jane Austen, Persuasion

Image source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 18

Image source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 19

Credits: Slide adapted from [3] 23

Credits: Slide adapted from [3] 25

Credits: Slide adapted from [3] 26

Credits: Slide adapted from [3] 27

Credits: Slide adapted from [3] 28

Credits: Slide adapted from [3] 29

Credits: Slide adapted from [3] 30

Credits: Slide adapted from [3] 31

• Consider as a sentence (or a document)

Credits: Slide adapted from [3] 32

On the difficulty of training recurrent neural networks, Pascanu et al., 2013

Credits: Slide adapted from [3] 35

Credits: Slide adapted from [3] 36

Credits: Slide adapted from [3] 40

Credits: Slide adapted from [3] 44

• Due to chain rule / choice of nonlinearity function, gradient can

Credits: Slide adapted from [3]

Credits: Slide adapted from [3] 49

Image source : https://www.ruder.io/a-review-of-the-recent-history-of-nlp/ 52

• (A) Speech and Language Processing

You might also like