0% found this document useful (0 votes)

70 views45 pages

Lecture 3 LSTM, GRU

Uploaded by

tahahasnain1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views45 pages

Lecture 3 LSTM, GRU

Uploaded by

tahahasnain1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

High Impact Skills Development Program

in Artificial Intelligence, Data Science, and Blockchain

Module 9: NLP & Sequential Models

Lecture 4: LSTM & GRU

Instructor: Ahsan Jalal

1
Review of Previous Lecture
• RNN
• Sequence Model
• Recurrence on hidden layer
• Limitations –
• Vanishing/Exploding Gradient problem
• Long Term dependencies
• Solutions –
• Activation Function – ReLU
• Parameter Initialization – Weights to identity matrix, bias to 0
• Use more complex recurrent units with gates to control what information is
passed through
Long Short-Term Memory
• LSTM networks, add additional gating units in each memory
cell.
• Forget gate
• Store gate
• Update gate
• Output gate
• Prevents vanishing/exploding gradient problem and allows
network to retain state information over longer periods of
time.

3
Standard RNN
Long Short-Term Memory (LSTM)
LSTM Network Architecture

6
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)
Forget Gate
• Forget gate computes a 0-1 value using a logistic sigmoid output function
from the input, xt, and the previous hidden state, ht-1:
• Multiplicatively combined with cell state, "forgetting" information where
the gate outputs something close to 0.

10
Store/Input Gate
• First, determine which entries in the cell state to update by computing 0-1
sigmoid output.
• Then determine what amount to add/subtract from these entries by
computing a tanh output (valued –1 to 1) function of the input and hidden
state.

12
Cell State
• Maintains a vector Ct that is the same dimensionality as the hidden state,
ht
• Information can be added or deleted from this state vector via the forget
and input gates.

14
Updating the Cell State
• Cell state is updated by using component-wise vector multiply to "forget"
and vector addition to "input" new information.

15
Cell State Example
• Want to remember person & number of a subject noun so that
it can be checked to agree with the person & number of verb
when it is eventually encountered.
• Forget gate will remove existing information of a prior subject
when a new one is encountered.
• Input gate "adds" in the information for the new subject.

16
Output Gate
• Hidden state is updated based on a "filtered"
version of the cell state, scaled to –1 to 1 using
tanh.
• Output gate computes a sigmoid function of the
input and current hidden state to determine which
elements of the cell state to "output".

18
Overall Network Architecture
Overall Network Architecture
• Single or multilayer networks can compute LSTM
inputs from problem inputs and problem outputs
from LSTM outputs.

Ot e.g. a POS tag as a “one hot” vector

e.g. a word “embedding” with

reduced dimensionality

It e.g. a 20
word as a “one hot” vector
LSTM Gradient Flow
LSTM Training
• Trainable with backprop derivatives such as:
• Stochastic gradient descent (randomize order of examples in
each epoch) with momentum (bias weight changes to
continue in same direction as last update).
• ADAM optimizer (Kingma & Ma, 2015)
• Each cell has many parameters (Wf, Wi, WC, Wo)
• Generally requires lots of training data.
• Requires lots of compute time that exploits GPU clusters.

22
General Problems Solved with LSTMs
• Sequence labeling
• Train with supervised output at each time step computed
using a single or multilayer network that maps the hidden
state (ht) to an output vector (Ot).
• Language modeling
• Train to predict next input (Ot =It+1)
• Sequence (e.g. text) classification
• Train a single or multilayer network that maps the final
hidden state (hn) to an output vector (O).

23
Sequence to Sequence
Transduction (Mapping)
• Encoder/Decoder framework maps one sequence to
a "deep vector" then another LSTM maps this vector
to an output sequence.

I1, I2,…,In Encoder hn Decoder O1, O2,…,Om

LSTM LSTM

• Train model "end to end" on I/O pairs of

sequences.

24
Decoder
Output in an other language

Encoder
Input in one language
Successful Applications of LSTMs
• Speech recognition: Language and acoustic modeling
• Sequence labeling
• POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
• NER
• Phrase Chunking
• Neural syntactic and semantic parsing
• Image captioning: CNN output vector to sequence
• Sequence to Sequence
• Machine Translation (Sustkever, Vinyals, & Le, 2014)
• Video Captioning (input sequence of CNN frame outputs)

26
Bidirectional LSTM (Bi-LSTM)
• Bidirectional LSTMs are an extension of traditional LSTMs that can
improve model performance on sequence classification problems.
• In problems where all timesteps of the input sequence are available,
Bidirectional LSTMs train two instead of one LSTMs on the input sequence.
The first on the input sequence as-is and the second on a reversed copy of
the input sequence. This can provide additional context to the network and
result in faster and even fuller learning on the problem.
• To be clear, timesteps in the input sequence are still processed one at a
time, it is just the network steps through the input sequence in both
directions at the same time.
Bi-directional LSTM (Bi-LSTM)
• Separate LSTMs process sequence forward and backward and hidden
layers at each time step are concatenated to form the cell output.

xt-1 xt xt+1

ht-1 ht ht+1
Example - Backward-Forward Sequence
Generative Network for Multiple Lexical Constraints

https://link.springer.com/chapter/10.1007/978-3-030-49186-4_4
Gated Recurrent Unit (GRU)

• Alternative RNN to LSTM that uses

fewer gates (Cho, et al., 2014)
• Combines forget and input gates
into “update” gate.
• Eliminates cell state vector

https://arxiv.org/pdf/1406.1078.pdf
30
Gated Recurrent Unit (GRU)
• The structure of the GRU allows it to adaptively capture dependencies
from large sequences of data without discarding information from earlier
parts of the sequence.
• This is achieved through its gating units and are responsible for regulating
the information to be kept or discarded at each time step.

31
Gated Recurrent Unit (GRU)
The GRU cell contains only two gates: the Update gate and the Reset gate. These gates
are trained to selectively filter out any irrelevant information while keeping what’s useful.
These gates are vectors containing values between 0 to 1 which will be multiplied with the
input data and/or hidden state. A 0 value in the gate vectors indicates that the
corresponding data in the input or hidden state is unimportant and will, therefore, return as
a zero. On the other hand, a 1 value in the gate vector means that the corresponding data
is important and will be used.

32
Gated Recurrent Unit (GRU)
Reset Gate
This gate is derived and calculated using both the hidden state from the
previous time step and the input data at the current time step.
Gated Recurrent Unit (GRU)
Reset Gate
Mathematically, this is achieved by multiplying the previous hidden
state and current input with their respective weights and summing them
before passing the sum through a sigmoid function. The sigmoid function
will transform the values to fall between 0 and 1, allowing the gate to filter
between the less-important and more-important information in the subsequent
steps.
Gated Recurrent Unit (GRU)
The previous hidden state will first be multiplied by a trainable weight and will
then undergo an element-wise multiplication with the reset vector. This
operation will decide which information is to be kept from the previous time
steps together with the new inputs. At the same time, the current input will
also be multiplied by a trainable weight before being summed with the product
of the reset vector and previous hidden state above. Lastly, a non-linear
activation tanh function will be applied to the final result to obtain r in the
equation below.
Gated Recurrent Unit (GRU)
Update Gate
Just like the Reset gate, the gate is computed using the previous hidden state and
current input data.
Gated Recurrent Unit (GRU)
Update Gate
• Both the Update and Reset gate vectors are created using the same formula, but,
the weights multiplied with the input and hidden state are unique to each gate,
which means that the final vectors for each gate are different. This allows the
gates to serve their specific purposes.

• The Update vector will then undergo element-wise multiplication with

the previous hidden state to obtain u in our equation below, which will be used
to compute our final output later.

• The purpose of the Update gate here is to help the model determine how much
of the past information stored in the previous hidden state needs to be retained
for the future.
Gated Recurrent Unit (GRU)
Combining the outputs
In the last step, we will be reusing the Update gate and obtaining the updated hidden state
Gated Recurrent Unit (GRU)
Combining the outputs
• This time, we will be taking the element-wise inverse version of the
same Update vector (1 - Update gate) and doing an element-wise multiplication
with our output from the Reset gate, r. The purpose of this operation is for
the Update gate to determine what portion of the new information should be
stored in the hidden state.
• Lastly, the result from the above operations will be summed with our output from
the Update gate in the previous step, u. This will give us our new and updated
hidden state.

• We can use this new hidden state as our output for that time step as well by
passing it through a linear activation layer.
GRU vs. LSTM
• GRU has significantly fewer parameters and trains faster.
• Experimental results comparing the two are still inconclusive, many
problems they perform the same, but each has problems on which they
work better.

40
Attention
• For many applications, it helps to add “attention” to RNNs.
• Allows network to learn to attend to different parts of the input at
different time steps, shifting its attention to focus on different aspects
during its processing.
• Used in image captioning to focus on different parts of an image when
generating different parts of the output sentence.
• In MT, allows focusing attention on different parts of the source sentence
when generating different parts of the translation.

41
Attention for Image Caption

42
Conclusions
• By adding “gates” to an RNN, we can prevent the
vanishing/exploding gradient problem.
• Trained LSTMs/GRUs can retain state information longer and
handle long-distance dependencies.
• Recent impressive results on a range of challenging NLP
problems.

43
Further Learning

• https://medium.com/@jianqiangma/all-about-recurrent-neural-networ
ks-9e5ae2936f6e
• https://towardsdatascience.com/natural-language-processing-from-ba
sics-to-using-rnn-and-lstm-ef6779e4ae66
• http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• https://www.youtube.com/watch?v=j_ohosux8bI
https://towardsdatascience.com/recurrent-neural-network-head-to-toe
-d58ff2f2dab3
• https://medium.com/datadriveninvestor/how-do-lstm-networks-solve-th
e-problem-of-vanishing-gradients-a6784971a577
GRU vs. LSTM
• GRU has significantly fewer parameters and trains
faster.
• Experimental results comparing the two are still
inconclusive, many problems they perform the same,
but each has problems on which they work better.

CNN RNN LSTM GRU Simple
100% (3)
CNN RNN LSTM GRU Simple
20 pages
LSTM PPT
No ratings yet
LSTM PPT
22 pages
Spam Detection Viva Questions Full
No ratings yet
Spam Detection Viva Questions Full
5 pages
ISO 9001-2015 Process Audit Checklist
100% (2)
ISO 9001-2015 Process Audit Checklist
17 pages
Mastering Archimate Edition Iii A Serious Introduction To The Archimate Enterprise Architecture Modeling Language Illustrated Gerben Wierda Download
No ratings yet
Mastering Archimate Edition Iii A Serious Introduction To The Archimate Enterprise Architecture Modeling Language Illustrated Gerben Wierda Download
80 pages
Zogorijomitoga Tidoku
No ratings yet
Zogorijomitoga Tidoku
2 pages
(FREE PDF Sample) (Ebook) High Performance PostgreSQL For Rails (Beta) : Reliable, Scalable, Maintainable Database Applications by Andrew Atkinson ISBN 9798888650387, 8888650385 Ebooks
100% (1)
(FREE PDF Sample) (Ebook) High Performance PostgreSQL For Rails (Beta) : Reliable, Scalable, Maintainable Database Applications by Andrew Atkinson ISBN 9798888650387, 8888650385 Ebooks
76 pages
LSTM Gru Notes
No ratings yet
LSTM Gru Notes
8 pages
Sequence Modeling
No ratings yet
Sequence Modeling
131 pages
Unit 2 DL
No ratings yet
Unit 2 DL
44 pages
A Competency Framework For AI Integration in India
No ratings yet
A Competency Framework For AI Integration in India
78 pages
Online Grocery Shop Project Proposal
No ratings yet
Online Grocery Shop Project Proposal
27 pages
Introduction To Long Short Term Memory LSTM
No ratings yet
Introduction To Long Short Term Memory LSTM
6 pages
Lecture 4 Part2
No ratings yet
Lecture 4 Part2
28 pages
WeatherApp Assignment
No ratings yet
WeatherApp Assignment
9 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Chapter - 1 - Multimedia System
No ratings yet
Chapter - 1 - Multimedia System
9 pages
Gated Recurrent Unit
No ratings yet
Gated Recurrent Unit
5 pages
9 Deep Leaning RNN
No ratings yet
9 Deep Leaning RNN
64 pages
CE6146 Lecture 4
No ratings yet
CE6146 Lecture 4
53 pages
Assignment 01 Logika Matematika
No ratings yet
Assignment 01 Logika Matematika
14 pages
One X at A Time Re-Use The Same Edge Weights
No ratings yet
One X at A Time Re-Use The Same Edge Weights
39 pages
Cheng (2022) MAYBE YES
No ratings yet
Cheng (2022) MAYBE YES
31 pages
Lecture14 RNN Intro
No ratings yet
Lecture14 RNN Intro
22 pages
RNN LSTM
No ratings yet
RNN LSTM
49 pages
CS601 - Machine Learning - Unit 4 - Notes - 1672759767
No ratings yet
CS601 - Machine Learning - Unit 4 - Notes - 1672759767
12 pages
SQL Class-12
No ratings yet
SQL Class-12
12 pages
SD 25 Manual
No ratings yet
SD 25 Manual
16 pages
Gated Recurrent Unit
No ratings yet
Gated Recurrent Unit
12 pages
Gated Recurrent Unit: Master Sidsd - S2
100% (1)
Gated Recurrent Unit: Master Sidsd - S2
23 pages
DL U-Ii
No ratings yet
DL U-Ii
41 pages
Fingerprint Lock System
No ratings yet
Fingerprint Lock System
9 pages
Introduction To Matlabm
No ratings yet
Introduction To Matlabm
56 pages
Andrew Wells CV
No ratings yet
Andrew Wells CV
3 pages
ML (Cs-601) Unit 4 Complete
No ratings yet
ML (Cs-601) Unit 4 Complete
45 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Revision Notes LSTRM
No ratings yet
Revision Notes LSTRM
19 pages
Week 6
No ratings yet
Week 6
60 pages
01 Simple Architectures - Solutions
No ratings yet
01 Simple Architectures - Solutions
6 pages
RNN 2
No ratings yet
RNN 2
144 pages
LCTM and Gru
No ratings yet
LCTM and Gru
62 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
14 pages
Topic IV Hand Sketched Schematic Diagram
No ratings yet
Topic IV Hand Sketched Schematic Diagram
23 pages
AAM Unit 6 Notes
No ratings yet
AAM Unit 6 Notes
20 pages
Chapter 12 PartII en
No ratings yet
Chapter 12 PartII en
23 pages
6 - RNN LSTM & Gru
No ratings yet
6 - RNN LSTM & Gru
14 pages
LSTM and GRU
No ratings yet
LSTM and GRU
22 pages
UNIT-5-Modern Recurrent Neural Networks
No ratings yet
UNIT-5-Modern Recurrent Neural Networks
60 pages
10 20 - Apr - DL
No ratings yet
10 20 - Apr - DL
69 pages
Programmable Machine Pre History - MechMachTheor - May2001
No ratings yet
Programmable Machine Pre History - MechMachTheor - May2001
15 pages
Lab 7 Capturing and Examining The Registry (15 PTS.)
No ratings yet
Lab 7 Capturing and Examining The Registry (15 PTS.)
8 pages
Understanding LSTM - A Simple Guide With Diagrams and Real-Time Examples - by Neural Pai - Feb, 2025 - Medium
No ratings yet
Understanding LSTM - A Simple Guide With Diagrams and Real-Time Examples - by Neural Pai - Feb, 2025 - Medium
15 pages
Data Mining - Theories - Algorithms - and Examples PDF
No ratings yet
Data Mining - Theories - Algorithms - and Examples PDF
347 pages
220644-665-666-DLD Lab 09 F.. (2) .Docxooooooooooo
No ratings yet
220644-665-666-DLD Lab 09 F.. (2) .Docxooooooooooo
8 pages
Windows Memory Diagnostic User Guide Download Windows Memory Diagnostic
No ratings yet
Windows Memory Diagnostic User Guide Download Windows Memory Diagnostic
8 pages
DL Half TechKnowledge
No ratings yet
DL Half TechKnowledge
50 pages
LSTM&RNN
No ratings yet
LSTM&RNN
10 pages
CH4 - AA1.1-Sequence Models
No ratings yet
CH4 - AA1.1-Sequence Models
26 pages
NN Text Generation Zaid Bouslikhin
No ratings yet
NN Text Generation Zaid Bouslikhin
14 pages
2023-2024 - SEM - 1 - Online B.Sc. CS-Batch 2 - SEM 1 - BCS ZC313 - Introduction To Programming - EC-3 - SLOT-1 - 08-10-2023
No ratings yet
2023-2024 - SEM - 1 - Online B.Sc. CS-Batch 2 - SEM 1 - BCS ZC313 - Introduction To Programming - EC-3 - SLOT-1 - 08-10-2023
9 pages
Group Assignment - Sampling Distribution
No ratings yet
Group Assignment - Sampling Distribution
3 pages
DL Co-3 PPT 3
No ratings yet
DL Co-3 PPT 3
19 pages
NLP - L8 LSTM
No ratings yet
NLP - L8 LSTM
7 pages
LSTM Material 1
No ratings yet
LSTM Material 1
3 pages
ML Unit 4
No ratings yet
ML Unit 4
47 pages
Module 4
No ratings yet
Module 4
14 pages
ML Powered NGFW Customer Presentation PDF
No ratings yet
ML Powered NGFW Customer Presentation PDF
68 pages
LSTM
No ratings yet
LSTM
24 pages
LSTM & Gru
No ratings yet
LSTM & Gru
17 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
LSTM Deep Learning
No ratings yet
LSTM Deep Learning
11 pages
RNN
No ratings yet
RNN
28 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
RNN, LSTM, Gru
No ratings yet
RNN, LSTM, Gru
36 pages
Unit 2 DL
No ratings yet
Unit 2 DL
43 pages
DLT Unit-4
No ratings yet
DLT Unit-4
18 pages
CIS-240-101-12FA Data Communications Course Syllabus
No ratings yet
CIS-240-101-12FA Data Communications Course Syllabus
9 pages
Data Build Tool (DBT)
No ratings yet
Data Build Tool (DBT)
65 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
No ratings yet
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
15 pages
Unit 3
No ratings yet
Unit 3
8 pages
Machine Learning Unit 4 RNN
No ratings yet
Machine Learning Unit 4 RNN
11 pages
Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
No ratings yet
Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
39 pages
Accomplishment Report Ict
100% (2)
Accomplishment Report Ict
2 pages
Ebooks Implementation Guide Sme
No ratings yet
Ebooks Implementation Guide Sme
35 pages
CS 601 Machine Learning Unit 4
No ratings yet
CS 601 Machine Learning Unit 4
14 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet