Bahdanau Attention Mechanism (Also Known As Additive Attention)
Bahdanau Attention Mechanism (Also Known As Additive Attention)
Encoder:
>Takes in the source sequence (e.g., “They are watching.”).
>Processes it using recurrent layers (RNN/LSTM/GRU) to get
hidden states h1,h2,h3, one for each input token.
13 onwards Page 1
What is Bahdanau Attention?
> introduced to overcome the limitations of the
encoder-decoder architecture
-- the challenge of compressing all input
information into a single fixed-length context vector.
13 onwards Page 2
13 onwards Page 3
Multi-Head Attention
>allows the model to jointly attend to information
from different representation subspaces at different
positions.
13 onwards Page 4
13 onwards Page 5
13 onwards Page 6
13 onwards Page 7
13 onwards Page 8
13 onwards Page 9
Understand
ing Transf...
Transformer Architecture
It contains 2 macro-blocks:
1. Encoder
2. Decoder
and a linear layer.
13 onwards Page 10
>Embeddings are an array of floating point number.
They can be used to represent different modalities
(text, image, video, etc.)
> the same object (word here) will always have the
same embeddings. E.g. CAT in the example given above.
Positional Encoding:
>want each word to carry some information about it’s
position in the sentence. We want the model to treat
13 onwards Page 11
position in the sentence. We want the model to treat
words that appear close to each other
as ‘close’ and words that are distant as ‘distant’.
Example :
>Hi, I’m Vansh Kharidia, and I’m into tech.
13 onwards Page 12
>For even positions in the position embedding (count starts from 0), we use
the 1st formula, and for odd positions in the position embeddings, we use
the 2nd formula.
13 onwards Page 13
13 onwards Page 14
Multi-head attention:
What is self-attention?
> Self-attention allowed models to relate words to each other
Multi-Head Attention:
13 onwards Page 15
> perform the same computation that we did for single-head attention for
Q, K and V
resultant Q’, K’ and V’ matrices are divided into 4 matrices each (as there
are 4 heads). They are divided by dmodel, so each submatrix the entire
sentence but only a subsection of the embeddings
Each submatrix contains dk embeddings (columns) for each word.
In our case, dk = dmodel/h = 512/4 = 128.
13 onwards Page 16
Why multiple heads?
>each head contains the entire sentence but only a section of the
embeddings. As the same word can be used a noun in a context, adjective in
another context, adverb in another context, etc
>We can leverage the multi-head architecture as different heads can learn to
relate the same word in different contexts (e.g. as noun, adjective, etc.).
13 onwards Page 17
Layer Normalization (Add & Norm):
Feed Forward:
>processes each position in the sequence independently and
helps the model to learn complex representations by applying
non-linear transformations to the input
13 onwards Page 18
- non-linear activation function applied to introduce
non-linearity into the model.
>Second Linear Transformation:
- projects the higher dimensional representation back
to the original dimension.
Decoder:
13 onwards Page 19
It is similar to the encoder. During training, the target sequence (i.e.,
the correct output sequence) is used as input to the decoder. However, it is
shifted to the right by one position.
>Shifting the target sequence allows the model to predict the next token based
on the previous tokens. If the target sequence is [y1, y2, y3, …, yn], it is
transformed to [<START>, y1, y2, y3, …, yn-1] before being fed into the
decoder.
We achieve this by replacing all the future words by - infty in the seq * seq
matrices. After the softmax function is applied, all the -infty will be
replaced by 0.
13 onwards Page 20
Multi-Head Attention & Add and Norm:
>multi-head attention layer gets keys and values matrices from the
encoder’s output and the query from the output of the masked multi-head
attention.
Linear Layer:
>transforms the output of the previous layer to a different
dimensionality, to match the number of classes in the output
vocabulary.
>performs a matrix multiplication between the input and a weight matrix,
followed by the addition of a bias term.
13 onwards Page 21
transforms the final hidden state outputs from the decoder into logits
Softmax:
>Converts the logits from linear layer into probabilities
by applying the softmax function.
Training a transformer:
13 onwards Page 22
>We convert it into embeddings
-- add the positional encoding to form the encoder input
of dimension seq * dmodel.
Decoder:
-- add a <SOS> token to the expected Italian translation:
> pass this input now as the decoder input. It is converted to output
embeddings and we add positional encoding to it to form the input to the
masked multi-head attention.
>pass the keys and values from the encoder output and the query from the
output of the masked multi-head attention layer to the decoder (multi-head
attention + feed forward layer). We get the decoder output.
13 onwards Page 23
> decoder output is of the dimension seq * dblock,
it is still an embedding
-- linear layer will now convert the seq * dblock matrix
into a seq * vocab size matrix and we apply softmax to it.
Inferencing a Transformer:
Encoder:
13 onwards Page 24
encoder part stays the same as training, we provide the input:
<SOS>I love you very much<EOS>.
Decoder:
Time Step 1:
>get the key and value matrix from the encoder output
>query matrix from the masked multi-head attention layer’s output for the
input of the decoder.
>Then, we get the decoder layer, which is passed
through the linear layer and softmax.
>output of the linear layer is known as logits.
-- softmax function selects a token from our vocabulary
13 onwards Page 25
-- softmax function selects a token from our vocabulary
with the highest probability is selected as the model’s
predicted output ti.
-- get the first token (which follows <SOS>) in the 1st time
step. This first token that we generated is ti.
Time Step 2:
>2nd time step, we don’t need to recompute the encoder output as our
input (English sentence) didn’t change, so the encoder output will not
change.
>append the output of the previous step (ti) to the decoder input
sequence (<SOS> ti) and feed this as the input to the decoder layer.
>repeat the same process as the 1st time step, converting it to form
decoder input through masked multi-head attention, pass it to the decoder,
convert it’s output by passing it through the linear and softmax layer to get
the next token.
Time Step 3:
13 onwards Page 26
we pass <SOS> + decoder output until now (ti amo).
We get the next token molto
Time Step 4:
13 onwards Page 27
Now, we get the <EOS> token. As we get the <EOS> we stop and no more
tokens are generated
Inferencing Strategy:
>At every step we selected the word with the maximum
softmax value, this is called a greedy strategy
-- usually doesn’t perform really well.
Beam Search:
>At each step, instead of selecting the
word with the maximum softmax value, we choose the top B words and
evaluate all the possible next words for each of them.
From slide
Top-left Block:
You see the phrase: Queen and king.
Green rows represent word embeddings.
Below them are positional encodings (e.g., [0.01, 0.04, ..., 0.24]).
These are added together element-wise → input to the Transformer.
13 onwards Page 28
13 onwards Page 29
13 onwards Page 30
13 onwards Page 31
13 onwards Page 32
13 onwards Page 33
Summary
Trainable positional embeddings are alternative to fixed sinusoidal encodings.
They allow the model to learn optimal positional patterns for a specific task.
In this formulation, position vectors are used directly in the attention score computation.
It improves flexibility but may not extrapolate well to longer sequences.
13 onwards Page 34
What is a Positionwise FFN?
>After the multi-head self-attention layer in a Transformer,
each token’s embedding is passed through the same
feed-forward network (FFN)
Why “Positionwise”?
The same FFN (same weights) is applied to each position
(token) in the sequence independently.
13 onwards Page 35
In Transformers, Add & Norm is a step used after sublayers like:
Multi-head self-attention
Positionwise feed-forward networks
13 onwards Page 36
Positionwise feed-forward networks
13 onwards Page 37
What is Batch Normalization?
> technique to standardize the inputs to a layer for each mini-batch.
It stabilizes and accelerates training by:
Reducing internal covariate shift (i.e., the change in the distribution of network activations due to updates in
parameters).
Helping gradients flow through the network.
Enabling faster convergence.
Providing some regularization (like a mild dropout effect).
13 onwards Page 38
13 onwards Page 39
>What is Masked Self-Attention?
-- can only attend to previous or current tokens
— not future tokens. This is critical in language
generation tasks where the model generates one word
at a time.
13 onwards Page 40
13 onwards Page 41