0% found this document useful (0 votes)
70 views45 pages

Lecture 3 LSTM, GRU

Uploaded by

tahahasnain1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views45 pages

Lecture 3 LSTM, GRU

Uploaded by

tahahasnain1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

High Impact Skills Development Program

in Artificial Intelligence, Data Science, and Blockchain

Module 9: NLP & Sequential Models


Lecture 4: LSTM & GRU

Instructor: Ahsan Jalal

1
Review of Previous Lecture
• RNN
• Sequence Model
• Recurrence on hidden layer
• Limitations –
• Vanishing/Exploding Gradient problem
• Long Term dependencies
• Solutions –
• Activation Function – ReLU
• Parameter Initialization – Weights to identity matrix, bias to 0
• Use more complex recurrent units with gates to control what information is
passed through
Long Short-Term Memory
• LSTM networks, add additional gating units in each memory
cell.
• Forget gate
• Store gate
• Update gate
• Output gate
• Prevents vanishing/exploding gradient problem and allows
network to retain state information over longer periods of
time.

3
Standard RNN
Long Short-Term Memory (LSTM)
LSTM Network Architecture

6
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)
Forget Gate
• Forget gate computes a 0-1 value using a logistic sigmoid output function
from the input, xt, and the previous hidden state, ht-1:
• Multiplicatively combined with cell state, "forgetting" information where
the gate outputs something close to 0.

10
Store/Input Gate
• First, determine which entries in the cell state to update by computing 0-1
sigmoid output.
• Then determine what amount to add/subtract from these entries by
computing a tanh output (valued –1 to 1) function of the input and hidden
state.

12
Cell State
• Maintains a vector Ct that is the same dimensionality as the hidden state,
ht
• Information can be added or deleted from this state vector via the forget
and input gates.

14
Updating the Cell State
• Cell state is updated by using component-wise vector multiply to "forget"
and vector addition to "input" new information.

15
Cell State Example
• Want to remember person & number of a subject noun so that
it can be checked to agree with the person & number of verb
when it is eventually encountered.
• Forget gate will remove existing information of a prior subject
when a new one is encountered.
• Input gate "adds" in the information for the new subject.

16
Output Gate
• Hidden state is updated based on a "filtered"
version of the cell state, scaled to –1 to 1 using
tanh.
• Output gate computes a sigmoid function of the
input and current hidden state to determine which
elements of the cell state to "output".

18
Overall Network Architecture
Overall Network Architecture
• Single or multilayer networks can compute LSTM
inputs from problem inputs and problem outputs
from LSTM outputs.

Ot e.g. a POS tag as a “one hot” vector

e.g. a word “embedding” with


reduced dimensionality

It e.g. a 20
word as a “one hot” vector
LSTM Gradient Flow
LSTM Training
• Trainable with backprop derivatives such as:
• Stochastic gradient descent (randomize order of examples in
each epoch) with momentum (bias weight changes to
continue in same direction as last update).
• ADAM optimizer (Kingma & Ma, 2015)
• Each cell has many parameters (Wf, Wi, WC, Wo)
• Generally requires lots of training data.
• Requires lots of compute time that exploits GPU clusters.

22
General Problems Solved with LSTMs
• Sequence labeling
• Train with supervised output at each time step computed
using a single or multilayer network that maps the hidden
state (ht) to an output vector (Ot).
• Language modeling
• Train to predict next input (Ot =It+1)
• Sequence (e.g. text) classification
• Train a single or multilayer network that maps the final
hidden state (hn) to an output vector (O).

23
Sequence to Sequence
Transduction (Mapping)
• Encoder/Decoder framework maps one sequence to
a "deep vector" then another LSTM maps this vector
to an output sequence.

I1, I2,…,In Encoder hn Decoder O1, O2,…,Om


LSTM LSTM

• Train model "end to end" on I/O pairs of


sequences.

24
Decoder
Output in an other language

Encoder
Input in one language
Successful Applications of LSTMs
• Speech recognition: Language and acoustic modeling
• Sequence labeling
• POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
• NER
• Phrase Chunking
• Neural syntactic and semantic parsing
• Image captioning: CNN output vector to sequence
• Sequence to Sequence
• Machine Translation (Sustkever, Vinyals, & Le, 2014)
• Video Captioning (input sequence of CNN frame outputs)

26
Bidirectional LSTM (Bi-LSTM)
• Bidirectional LSTMs are an extension of traditional LSTMs that can
improve model performance on sequence classification problems.
• In problems where all timesteps of the input sequence are available,
Bidirectional LSTMs train two instead of one LSTMs on the input sequence.
The first on the input sequence as-is and the second on a reversed copy of
the input sequence. This can provide additional context to the network and
result in faster and even fuller learning on the problem.
• To be clear, timesteps in the input sequence are still processed one at a
time, it is just the network steps through the input sequence in both
directions at the same time.
Bi-directional LSTM (Bi-LSTM)
• Separate LSTMs process sequence forward and backward and hidden
layers at each time step are concatenated to form the cell output.

xt-1 xt xt+1

ht-1 ht ht+1
Example - Backward-Forward Sequence
Generative Network for Multiple Lexical Constraints

https://link.springer.com/chapter/10.1007/978-3-030-49186-4_4
Gated Recurrent Unit (GRU)

• Alternative RNN to LSTM that uses


fewer gates (Cho, et al., 2014)
• Combines forget and input gates
into “update” gate.
• Eliminates cell state vector

https://arxiv.org/pdf/1406.1078.pdf
30
Gated Recurrent Unit (GRU)
• The structure of the GRU allows it to adaptively capture dependencies
from large sequences of data without discarding information from earlier
parts of the sequence.
• This is achieved through its gating units and are responsible for regulating
the information to be kept or discarded at each time step.

31
Gated Recurrent Unit (GRU)
The GRU cell contains only two gates: the Update gate and the Reset gate. These gates
are trained to selectively filter out any irrelevant information while keeping what’s useful.
These gates are vectors containing values between 0 to 1 which will be multiplied with the
input data and/or hidden state. A 0 value in the gate vectors indicates that the
corresponding data in the input or hidden state is unimportant and will, therefore, return as
a zero. On the other hand, a 1 value in the gate vector means that the corresponding data
is important and will be used.

32
Gated Recurrent Unit (GRU)
Reset Gate
This gate is derived and calculated using both the hidden state from the
previous time step and the input data at the current time step.
Gated Recurrent Unit (GRU)
Reset Gate
Mathematically, this is achieved by multiplying the previous hidden
state and current input with their respective weights and summing them
before passing the sum through a sigmoid function. The sigmoid function
will transform the values to fall between 0 and 1, allowing the gate to filter
between the less-important and more-important information in the subsequent
steps.
Gated Recurrent Unit (GRU)
The previous hidden state will first be multiplied by a trainable weight and will
then undergo an element-wise multiplication with the reset vector. This
operation will decide which information is to be kept from the previous time
steps together with the new inputs. At the same time, the current input will
also be multiplied by a trainable weight before being summed with the product
of the reset vector and previous hidden state above. Lastly, a non-linear
activation tanh function will be applied to the final result to obtain r in the
equation below.
Gated Recurrent Unit (GRU)
Update Gate
Just like the Reset gate, the gate is computed using the previous hidden state and
current input data.
Gated Recurrent Unit (GRU)
Update Gate
• Both the Update and Reset gate vectors are created using the same formula, but,
the weights multiplied with the input and hidden state are unique to each gate,
which means that the final vectors for each gate are different. This allows the
gates to serve their specific purposes.

• The Update vector will then undergo element-wise multiplication with


the previous hidden state to obtain u in our equation below, which will be used
to compute our final output later.

• The purpose of the Update gate here is to help the model determine how much
of the past information stored in the previous hidden state needs to be retained
for the future.
Gated Recurrent Unit (GRU)
Combining the outputs
In the last step, we will be reusing the Update gate and obtaining the updated hidden state
Gated Recurrent Unit (GRU)
Combining the outputs
• This time, we will be taking the element-wise inverse version of the
same Update vector (1 - Update gate) and doing an element-wise multiplication
with our output from the Reset gate, r. The purpose of this operation is for
the Update gate to determine what portion of the new information should be
stored in the hidden state.
• Lastly, the result from the above operations will be summed with our output from
the Update gate in the previous step, u. This will give us our new and updated
hidden state.

• We can use this new hidden state as our output for that time step as well by
passing it through a linear activation layer.
GRU vs. LSTM
• GRU has significantly fewer parameters and trains faster.
• Experimental results comparing the two are still inconclusive, many
problems they perform the same, but each has problems on which they
work better.

40
Attention
• For many applications, it helps to add “attention” to RNNs.
• Allows network to learn to attend to different parts of the input at
different time steps, shifting its attention to focus on different aspects
during its processing.
• Used in image captioning to focus on different parts of an image when
generating different parts of the output sentence.
• In MT, allows focusing attention on different parts of the source sentence
when generating different parts of the translation.

41
Attention for Image Caption

42
Conclusions
• By adding “gates” to an RNN, we can prevent the
vanishing/exploding gradient problem.
• Trained LSTMs/GRUs can retain state information longer and
handle long-distance dependencies.
• Recent impressive results on a range of challenging NLP
problems.

43
Further Learning

• https://medium.com/@jianqiangma/all-about-recurrent-neural-networ
ks-9e5ae2936f6e
• https://towardsdatascience.com/natural-language-processing-from-ba
sics-to-using-rnn-and-lstm-ef6779e4ae66
• http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• https://www.youtube.com/watch?v=j_ohosux8bI
https://towardsdatascience.com/recurrent-neural-network-head-to-toe
-d58ff2f2dab3
• https://medium.com/datadriveninvestor/how-do-lstm-networks-solve-th
e-problem-of-vanishing-gradients-a6784971a577
GRU vs. LSTM
• GRU has significantly fewer parameters and trains
faster.
• Experimental results comparing the two are still
inconclusive, many problems they perform the same,
but each has problems on which they work better.

45

You might also like