0% found this document useful (1 vote)
101 views123 pages

RNN LSTM GRU Transformers

The document provides an overview of Recurrent Neural Networks (RNNs), their structure, and various types including LSTM and GRU, which address issues like vanishing gradients. It discusses the encoder-decoder architecture for tasks such as machine translation and the introduction of attention mechanisms to improve performance. Additionally, it highlights the transformer model, which replaces recurrent connections with attention modules to better understand and generate sequential data.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
101 views123 pages

RNN LSTM GRU Transformers

The document provides an overview of Recurrent Neural Networks (RNNs), their structure, and various types including LSTM and GRU, which address issues like vanishing gradients. It discusses the encoder-decoder architecture for tasks such as machine translation and the introduction of attention mechanisms to improve performance. Additionally, it highlights the transformer model, which replaces recurrent connections with attention modules to better understand and generate sequential data.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

RNN

Recurrent Neural Networks (RNNs)


• Standard NN models (MLPs, CNNs) are not
able to handle sequences of data
• They accept a fixed-sized vector as input and
produce a fixed-sized vector as output
• The weights are updated independent of the
order the samples are processed
• RNNs are designed for modeling sequences
• Sequences in the input, in the output or in
both
• They are capable of remembering past
information
Vanilla RNN
Structure of RNN
Different Types
Design
• In RNN the hidden layers are recurrent layers,
where every neuron is connected to every
other neuron in the layer
• The hidden layer gets its input from both the
input layer x(t) and the hidden layer from the
previous time step h(t-1)
• New hidden state:

• Output:
Training RNNs
• RNNs are trained by unfolding them into deep
feedforward networks
• where a new layer is created for each time
step of an input sequence processed by the
network
Training a Sequence Classifier
example
• an RNN to classify MNIST images
Training an RNN Language Model
• Get a big corpus of text which is a sequence of
words
• Feed into RNN; compute output distribution
For every step t.
• i.e. predict probability dist of every word,
given words so far
• Loss function on step t is cross-entropy
between predicted probability distribution
and the true next word (one-hot for ):
Loss function

• Average this to get overall loss for entire


training set:
Training an RNN Language Model
Training a RNN Language Model
• However: Computing loss and gradients across
entire corpus is too expensive!

• In practice, consider as a sentence (or a


document)
• Compute loss for a sentence (actually, a batch
of sentences), compute gradients and update
weights. Repeat
Backpropagation Through Time :- Paul
Werbos
• The overall loss L is the sum of all the loss
functions at times t = 1 to t = T:

• Backpropagation computes gradients by the


chain rule

• W can be either Whh or Whx


• The gradient decreases exponentially with the
number of layers in the network or the length
of the sequence in the case of RNN: Vanishing
Gradients
• Thus, vanilla RNNs are unable to capture long-
term dependencies
Evaluating Language Models
• The standard evaluation metric for Language
Models is perplexity.
Back Propogation
BPTT

Draw back: Vanishing/exploding gradients


LSTM (Long Short Term Memory)
• in 1997 by Hochreiter and Schmidhuber
Suggested a solution to the vanishing gradient
problem
• An LSTM cell stores a value (state) for either
long or short time periods
• Which contains three gates:
– Forget gate
– Input gate
– Output gate
• Forget gate
– controls the extent to which a value remains in
the cell
• Input gate
– controls the extent to which a new value flows
into the cell
• Output gate
– controls the extent to which the value in the cell is
used to compute the output
LSTM
LSTM CELL
idea behind LSTMs
• Cell state + gates
– Cell state stores long-term information
– Gates add/remove information to the cell state
Forget Gate

• sigmoid layer called the “forget gate layer.”


• What information are we going to forget from the cell
state?
• sigmoid output in [0,1], 0- then forget completely,
1 represents completely keep this
Input Gate
• sigmoid layer called the “input gate layer”
• input gate layer decides to update cell new values
• tanh layer creates a vector of new candidate
values that could be added/updated to the state
Output gate
LSTM
Gate functions
LSTM -Parameters
• The input gate allows new information to flow into the
network. It has parameters Wi,bi where i stands for input.

• The memory cell preserves the hidden units information


across time steps. It has parameters Wc, bc, where c stands
for cell.

• The forget gate allows information which is no longer


pertinent to be discarded. It has parameters Wf,bf , where f
stands for forget.

• The output gate allows what information will be output to


the screen and what will be propagated forward as part of
the new hidden state. It has parameters Wo,bo where o
stands for output.
A simplied figure that shows gradient flowing from
timestep “t + 1” to “t” look like this

• the green and the red paths are the two paths that gradient can ow back from
mt+1 to mt.
• mt is linearly computed which means the gradient can continue to flow through mt
• the green path, which generates nonlinear ouputs, is a difficult" path for gradient to
flow
Backpropagation Through Time
(BPTT)
Backpropagation Through Time
(BPTT)

https://blog.aidangomez.ca/2016/04/17/Backpropogating-an-LSTM-A-Numerical-Example/
1. Text Sentences:
2. Tokenization:
3. Encoding Text to Integers:
Each sentence is then represented by a sequence of integers corresponding to the
tokens of the words:
"this is good" → [1, 2, 3, 0, 0] (padded with zeros to maintain uniform length)
"I do not like it" → [4, 5, 6, 7, 9]
"I think it is good" → [4, 8, 9, 2, 3]
"I like it a lot" → [4, 7, 9, 10, 11]
"this is bad" → [1, 2, 12, 0, 0] (padded)

4. Word Embedding:
5. RNN Input:
Furthur reading
• https://colah.github.io/posts/2015-08-
Understanding-LSTMs/
GRU -Gated Recurrent Unit
• Simplify LSTM by combining forget and input
gate into update gate 𝑧t
• 𝑧t controls the forgetting factor and the
decision to update the state unit
• Reset gates 𝑟t control which parts of the state
get used to compute the next target state
• It introduces additional nonlinear effect in the
relationship between past state and future
state

Linear: the difference between terms increases or decreases by the same value each time
Non-linear: the difference between terms increases or decreases in different amounts
Comparison LSTM and GRU
Example
• Consider a LSTM cell as shown. Suppose we
have a scalar-valued input sequence

https://statisticalinterference.wordpress.com/2017/06/01/lstms-in-even-more-excruciating-detail
• assume that we initialized our weights and
biases to have the following values:wi1 =0.5
wc1=0.3 wf1=0.03 wo1=0.02 wy=0.6
• bi=0.01 bc=0.05 bf=0.002 b0=0.001 by=0.025
Bidirectional Recurrent Neural
Network
• A Bidirectional Recurrent Neural Network (BRNN) is a type
of Recurrent Neural Network (RNN) that is designed to
improve the performance of traditional RNNs by processing
data in both forward and backward directions.

• This architecture allows the network to have information from


both past and future contexts, which can be particularly useful
for tasks where context from both directions is crucial, such as
language processing, speech recognition, and time-series
analysis.
How BRNN Works?
• A standard RNN processes input data in a sequence,
maintaining a hidden state that gets updated at each step
based on the current input and the previous hidden state.
However, a standard RNN only uses past context, which
can be a limitation for certain tasks.

• A BRNN addresses this by having two separate hidden


states: one that processes the sequence from start to end
(forward direction) and another that processes it from end
to start (backward direction). The outputs from these two
hidden states are then combined (usually concatenated) to
form the final output
Bidirectional RNN Architecture
• Forward RNN: Processes the input
sequence from t=1 to t=T.

• Backward RNN: Processes the


input sequence from t=T to t=1.

• Concatenation: The outputs from


the forward and backward RNNs
are concatenated to form the final
output at each time step.
Bidirectional RNN Architecture for t = 3
Bidirectional RNNs
• extend RNNs into bi-directional models
• The repeating blocks could be any types of
RNNS (Vanilla RNN, LSTM, or GRU)
Machine Translation
• Task of automatically converting source text in one language
to another language
• Classical machine translation methods
– Rule-based machine translation (RBMT)
– Statistical machine translation (SMT; use of statistical
model)
• Neural Machine Translation (NMT)
– Use of neural network models to learn a statistical model
for machine translation
how?
• The core idea of sequence-to-sequence model
• Encoder-Decoder architecture (input -> vector -> output)
• Use one RNN network (Encoder) to read input sequence at a
time to obtain large fixed-length vector representation
• Use another RNN (Decoder) to extract the output sequence
from that vector
Sequence to Sequence (seq2seq) and
Attention

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
Encoder-Decoder Framework
Two RNNs for Encoder and Decoder
Training: The Cross-Entropy Loss
• The standard loss function is the cross-entropy
loss. Cross-entropy loss for the target
distribution 𝑝∗ and the predicted
distribution 𝑝 is
Inference: Greedy Decoding and Beam Search
• Greedy Decoding: At each step, pick the most
probable token
• Beam Search: Keep track of several most
probably hypotheses
Problem of original encoder-decoder
(or seq2seq) model
• Need to compress all the necessary
information of a source sentence into a fixed-
length vector
• Very difficult to cope with long sentences,
especially when the test sequence is longer
than the sentences in the training corpus
• Extension of encoder-decoder model +
attention mechanism
Attention
• The Problem of Fixed Encoder Representation
• Fixed source representation is suboptimal:
(i) for the encoder, it is hard to compress thebig
sentence;
(ii) for the decoder, different information may
be relevant at different steps.
*making things easier by letting the decoder
refer to the input sentence
Bottleneck
The general computation scheme
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Types of attention in NLP
• Scaled dot-product attention
• Multi-head attention
• Additive attention
• Location-based attention
Multi-head attention

https://jalammar.github.io/illustrated-bert/
Transformers
• A transformer model is a neural network that
learns the context of sequential data and
generates new data out of it.
• To put it simply:
• A transformer is a type of artificial intelligence
model that learns to understand and generate
human-like text by analyzing patterns in large
amounts of text data.
Transformers
• Transformer has encoder-decoder
architecture similar to the previous RNN
models , except all the recurrent connections
are replaced by the attention modules.
• The transformer model uses N stacked self-
attention layers.
• Skip-connections help preserve the positional
and identity information from the input
sequences.
They are specifically designed to comprehend
context and meaning by analyzing the relationship
between different elements, and they rely almost
entirely on a mathematical technique called
attention to do so.
The Transformer Architecture
• This black box is composed of two main parts:
• The encoder takes in our input and outputs a matrix representation
of that input. For instance, the English sentence “How are you?”
• The decoder takes in that encoded representation and iteratively
generates an output. In our example, the translated sentence
“¿Cómo estás?”
The Transformer Architecture
• Both the encoder and the decoder are actually a stack with multiple layers (same
number for each). All encoders present the same structure, and the input gets into
each of them and is passed to the next one.
• All decoders present the same structure as well and get the input from the last
encoder and the previous decoder.
• The original architecture consisted of 6 encoders and 6 decoders, but we can
replicate as many layers as we want. So let’s assume N layers of each.
The Encoder
WorkFlow

The primary function of the


encoder is to transform the input
tokens into contextualized
representations.

Unlike earlier models that


processed tokens independently,
the Transformer encoder
captures the context of each
token with respect to the entire
sequence.
STEP 1 - Input Embeddings
• The embedding only happens in the bottom-most encoder. The encoder
begins by converting input tokens - words or subwords - into vectors using
embedding layers. These embeddings capture the semantic meaning of
the tokens and convert them into numerical vectors.
• All the encoders receive a list of vectors, each of size 512 (fixed-sized).
STEP 2 - Positional Encoding
Transformers do not have a recurrence mechanism like RNNs, they use
positional encodings added to the input embeddings to provide
information about the position of each token in the sequence. This
allows them to understand the position of each word within the
sentence.
STEP 3 - Stack of Encoder Layers
STEP 3 - Stack of Encoder Layers
• The Transformer encoder consists of a stack of identical
layers (6 in the original Transformer model).
• The encoder layer serves to transform all input sequences
into a continuous, abstract representation that
encapsulates the learned information from the entire
sequence. This layer comprises two sub-modules:
• A multi-headed attention mechanism.
• A fully connected network.
• Additionally, it incorporates residual connections around
each sublayer, which are then followed by layer
normalization.
STEP 4 - Output of the Encoder
• The output of the final encoder layer is a set of vectors, each
representing the input sequence with a rich contextual
understanding. This output is then used as the input for the
decoder in a Transformer model.
• This careful encoding paves the way for the decoder, guiding it to
pay attention to the right words in the input when it's time to
decode.
• Think of it like building a tower, where you can stack up N encoder
layers. Each layer in this stack gets a chance to explore and learn
different facets of attention, much like layers of knowledge. This not
only diversifies the understanding but could significantly amplify
the predictive capabilities of the transformer network.
The Decoder
WorkFlow
• Transformer has a encoder-decoder
architecture similar to the previous RNN
models,
• except all the recurrent connections are
replaced by the attention modules.
• The transformer model uses N stacked
self-attention layers.
• Skip-connections help preserve the
positional and identity information from
the input sequences.
Positional Encoding
• Unlike traditional sequence-based models (like
RNNs or LSTMs) that inherently process data
in a sequential manner, Transformer models
process all tokens in parallel.
• add positional information of an input token
in the sequence into the input embedding
vectors.
• Sine and Cosine Functions
The Overall Flow in the Transformer
1. Input Processing (Encoder):
Model
1. The input sequence (e.g., a sentence) is passed through the encoder. Each
token's embedding is enriched with positional encodings, and multi-head self-
attention followed by feed-forward layers refines the representations through
the encoder’s layers.
2. The final output of the encoder is a set of contextualized token embeddings
that capture the relationships between all tokens in the input sequence.
2. Output Generation (Decoder):
1. The decoder processes the output sequence (e.g., translated or generated
text). Initially, only a start token is provided, and the decoder generates tokens
one by one, using the encoder’s output for context.
2. The masked self-attention ensures that the decoder doesn’t look ahead, while
cross-attention enables it to attend to relevant parts of the encoder's output.
3. The output is then refined using feed-forward layers and fed into the next
layer of the decoder.
3. Final Output:
1. After passing through multiple decoder layers, the final output is passed
through a softmax layer to predict the next token in the sequence. This
process continues autoregressively until the full output sequence is generated.
BERT
• AI2 released ELMo in spring 2018, GPT was
released in summer 2018, BERT came out
October 2018
• Three major changes compared to ELMo:
‣ Transformers instead of LSTMs
(transformers in GPT as well)
‣ Bidirectional <=> Masked LM objective
instead of standard LM
‣ Fine-tune instead of freeze at test time

https://towardsml.wordpress.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/
BERT
Bidirectional Encoder Representations
from Transformers
• Mask out 15% tokens
and predict those
[MASK].
• Bidirectional: Predict
the mask token from
context on both sides
.
• Transformer Encoder,
all- to-all self-
attention
• Suitable for text
representation
Bert Architecture
• Get rid of the decoder.
• Stack a series of Transformer encoder blocks.
• Pre-train with Masked Language Modeling and
Next Sentence Prediction (on massive
datasets).
Before NN
1. Token embeddings: A [CLS] token is added to
the input word tokens at the beginning of the
first sentence and a [SEP] token is inserted at
the end of each sentence.
2.Segment embeddings: A marker indicating
Sentence A or Sentence B is added to each
token. This allows the encoder to distinguish
between sentences.
3.Positional embeddings: A positional
embedding is added to each token to indicate
its position in the sentence.
Input text processing
reference
• https://www.columbia.edu/~jsl2239/transfor
mers.html
• https://sebastianraschka.com/blog/2023/self-
attention-from-scratch.html
• https://transformer-
circuits.pub/2021/framework/index.html
• https://www.comet.com/site/blog/explainabl
e-ai-for-transformers/

You might also like