RNN LSTM GRU Transformers
RNN LSTM GRU Transformers
• Output:
Training RNNs
• RNNs are trained by unfolding them into deep
feedforward networks
• where a new layer is created for each time
step of an input sequence processed by the
network
Training a Sequence Classifier
example
• an RNN to classify MNIST images
Training an RNN Language Model
• Get a big corpus of text which is a sequence of
words
• Feed into RNN; compute output distribution
For every step t.
• i.e. predict probability dist of every word,
given words so far
• Loss function on step t is cross-entropy
between predicted probability distribution
and the true next word (one-hot for ):
Loss function
• the green and the red paths are the two paths that gradient can ow back from
mt+1 to mt.
• mt is linearly computed which means the gradient can continue to flow through mt
• the green path, which generates nonlinear ouputs, is a difficult" path for gradient to
flow
Backpropagation Through Time
(BPTT)
Backpropagation Through Time
(BPTT)
https://blog.aidangomez.ca/2016/04/17/Backpropogating-an-LSTM-A-Numerical-Example/
1. Text Sentences:
2. Tokenization:
3. Encoding Text to Integers:
Each sentence is then represented by a sequence of integers corresponding to the
tokens of the words:
"this is good" → [1, 2, 3, 0, 0] (padded with zeros to maintain uniform length)
"I do not like it" → [4, 5, 6, 7, 9]
"I think it is good" → [4, 8, 9, 2, 3]
"I like it a lot" → [4, 7, 9, 10, 11]
"this is bad" → [1, 2, 12, 0, 0] (padded)
4. Word Embedding:
5. RNN Input:
Furthur reading
• https://colah.github.io/posts/2015-08-
Understanding-LSTMs/
GRU -Gated Recurrent Unit
• Simplify LSTM by combining forget and input
gate into update gate 𝑧t
• 𝑧t controls the forgetting factor and the
decision to update the state unit
• Reset gates 𝑟t control which parts of the state
get used to compute the next target state
• It introduces additional nonlinear effect in the
relationship between past state and future
state
Linear: the difference between terms increases or decreases by the same value each time
Non-linear: the difference between terms increases or decreases in different amounts
Comparison LSTM and GRU
Example
• Consider a LSTM cell as shown. Suppose we
have a scalar-valued input sequence
https://statisticalinterference.wordpress.com/2017/06/01/lstms-in-even-more-excruciating-detail
• assume that we initialized our weights and
biases to have the following values:wi1 =0.5
wc1=0.3 wf1=0.03 wo1=0.02 wy=0.6
• bi=0.01 bc=0.05 bf=0.002 b0=0.001 by=0.025
Bidirectional Recurrent Neural
Network
• A Bidirectional Recurrent Neural Network (BRNN) is a type
of Recurrent Neural Network (RNN) that is designed to
improve the performance of traditional RNNs by processing
data in both forward and backward directions.
https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
Encoder-Decoder Framework
Two RNNs for Encoder and Decoder
Training: The Cross-Entropy Loss
• The standard loss function is the cross-entropy
loss. Cross-entropy loss for the target
distribution 𝑝∗ and the predicted
distribution 𝑝 is
Inference: Greedy Decoding and Beam Search
• Greedy Decoding: At each step, pick the most
probable token
• Beam Search: Keep track of several most
probably hypotheses
Problem of original encoder-decoder
(or seq2seq) model
• Need to compress all the necessary
information of a source sentence into a fixed-
length vector
• Very difficult to cope with long sentences,
especially when the test sequence is longer
than the sentences in the training corpus
• Extension of encoder-decoder model +
attention mechanism
Attention
• The Problem of Fixed Encoder Representation
• Fixed source representation is suboptimal:
(i) for the encoder, it is hard to compress thebig
sentence;
(ii) for the decoder, different information may
be relevant at different steps.
*making things easier by letting the decoder
refer to the input sentence
Bottleneck
The general computation scheme
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Encoder-Decoder with simple RNN
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Types of attention in NLP
• Scaled dot-product attention
• Multi-head attention
• Additive attention
• Location-based attention
Multi-head attention
https://jalammar.github.io/illustrated-bert/
Transformers
• A transformer model is a neural network that
learns the context of sequential data and
generates new data out of it.
• To put it simply:
• A transformer is a type of artificial intelligence
model that learns to understand and generate
human-like text by analyzing patterns in large
amounts of text data.
Transformers
• Transformer has encoder-decoder
architecture similar to the previous RNN
models , except all the recurrent connections
are replaced by the attention modules.
• The transformer model uses N stacked self-
attention layers.
• Skip-connections help preserve the positional
and identity information from the input
sequences.
They are specifically designed to comprehend
context and meaning by analyzing the relationship
between different elements, and they rely almost
entirely on a mathematical technique called
attention to do so.
The Transformer Architecture
• This black box is composed of two main parts:
• The encoder takes in our input and outputs a matrix representation
of that input. For instance, the English sentence “How are you?”
• The decoder takes in that encoded representation and iteratively
generates an output. In our example, the translated sentence
“¿Cómo estás?”
The Transformer Architecture
• Both the encoder and the decoder are actually a stack with multiple layers (same
number for each). All encoders present the same structure, and the input gets into
each of them and is passed to the next one.
• All decoders present the same structure as well and get the input from the last
encoder and the previous decoder.
• The original architecture consisted of 6 encoders and 6 decoders, but we can
replicate as many layers as we want. So let’s assume N layers of each.
The Encoder
WorkFlow
https://towardsml.wordpress.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/
BERT
Bidirectional Encoder Representations
from Transformers
• Mask out 15% tokens
and predict those
[MASK].
• Bidirectional: Predict
the mask token from
context on both sides
.
• Transformer Encoder,
all- to-all self-
attention
• Suitable for text
representation
Bert Architecture
• Get rid of the decoder.
• Stack a series of Transformer encoder blocks.
• Pre-train with Masked Language Modeling and
Next Sentence Prediction (on massive
datasets).
Before NN
1. Token embeddings: A [CLS] token is added to
the input word tokens at the beginning of the
first sentence and a [SEP] token is inserted at
the end of each sentence.
2.Segment embeddings: A marker indicating
Sentence A or Sentence B is added to each
token. This allows the encoder to distinguish
between sentences.
3.Positional embeddings: A positional
embedding is added to each token to indicate
its position in the sentence.
Input text processing
reference
• https://www.columbia.edu/~jsl2239/transfor
mers.html
• https://sebastianraschka.com/blog/2023/self-
attention-from-scratch.html
• https://transformer-
circuits.pub/2021/framework/index.html
• https://www.comet.com/site/blog/explainabl
e-ai-for-transformers/