0% found this document useful (0 votes)
12 views41 pages

Bahdanau Attention Mechanism (Also Known As Additive Attention)

ertyuij

Uploaded by

Sudipto Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views41 pages

Bahdanau Attention Mechanism (Also Known As Additive Attention)

ertyuij

Uploaded by

Sudipto Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

16

Thursday, April 10, 2025 8:08 PM

Bahdanau Attention Mechanism (also known as Additive Attention)


> introduced to improve the basic encoder-decoder architecture
used in sequence-to-sequence models for tasks like machine
translation.

Components of the Diagram

Encoder:
>Takes in the source sequence (e.g., “They are watching.”).
>Processes it using recurrent layers (RNN/LSTM/GRU) to get
hidden states h1,h2,h3, one for each input token.

13 onwards Page 1
What is Bahdanau Attention?
> introduced to overcome the limitations of the
encoder-decoder architecture
-- the challenge of compressing all input
information into a single fixed-length context vector.

>Instead of a single context vector, Bahdanau Attention


computes a different context vector for each output word,
letting the decoder focus on relevant parts of the input
during decoding.

13 onwards Page 2
13 onwards Page 3
Multi-Head Attention
>allows the model to jointly attend to information
from different representation subspaces at different
positions.

>Instead of calculating attention once (single-head),


we split the input into multiple smaller parts (heads),
perform attention in parallel, and combine the results.

Why Multiple Heads?


Multiple heads let the model:
>Attend to different positions
>Capture various linguistic features (e.g., verb-object relations, word types)
>Enhance the model's capacity and expressiveness

13 onwards Page 4
13 onwards Page 5
13 onwards Page 6
13 onwards Page 7
13 onwards Page 8
13 onwards Page 9
Understand
ing Transf...

Transformer Architecture

It contains 2 macro-blocks:
1. Encoder
2. Decoder
and a linear layer.

13 onwards Page 10
>Embeddings are an array of floating point number.
They can be used to represent different modalities
(text, image, video, etc.)

> the same object (word here) will always have the
same embeddings. E.g. CAT in the example given above.

> convert each word into an embedding of size 512


(contains 512 floating point numbers)

Positional Encoding:
>want each word to carry some information about it’s
position in the sentence. We want the model to treat

13 onwards Page 11
position in the sentence. We want the model to treat
words that appear close to each other
as ‘close’ and words that are distant as ‘distant’.

Example :
>Hi, I’m Vansh Kharidia, and I’m into tech.

Vansh and Kharidia are close to each other by seeing the


sentence, but the model doesn’t have this information.

Positional encoding is used to give this information to


the model.
-- We want the positional encoding to represent a pattern
(e.g. Vansh is followed by Kharidia) that can be learned
by the model

We add a position embedding vector of size 512 to our original embedding.


The values in the position encoding vector are calculated only once and
reused for every sentence during training and inference.

Encoder input = Embedding + Position Embedding

How are position embeddings calculated?

13 onwards Page 12
>For even positions in the position embedding (count starts from 0), we use
the 1st formula, and for odd positions in the position embeddings, we use
the 2nd formula.

Why are trigonometric functions used here?

13 onwards Page 13
13 onwards Page 14
Multi-head attention:

What is self-attention?
> Self-attention allowed models to relate words to each other

Multi-Head Attention:

> encoder input embedding is passed 3 times (Q, K, V) into


multi-head attention and once into Add & Norm, so we have
4 heads.

13 onwards Page 15
> perform the same computation that we did for single-head attention for
Q, K and V

resultant Q’, K’ and V’ matrices are divided into 4 matrices each (as there
are 4 heads). They are divided by dmodel, so each submatrix the entire
sentence but only a subsection of the embeddings
Each submatrix contains dk embeddings (columns) for each word.
In our case, dk = dmodel/h = 512/4 = 128.

13 onwards Page 16
Why multiple heads?
>each head contains the entire sentence but only a section of the
embeddings. As the same word can be used a noun in a context, adjective in
another context, adverb in another context, etc

>We can leverage the multi-head architecture as different heads can learn to
relate the same word in different contexts (e.g. as noun, adjective, etc.).

13 onwards Page 17
Layer Normalization (Add & Norm):

We normalize the values so that they are in the range of 0 and 1.


We also introduce 2 parameters, usually beta and gamma.

>Gamma is multiplicative, we multiply it with the normalized value.


>Beta is additive, we add beta to the product of gamma and the
normalized value.

>Beta and gamma introduce some fluctuations in the data as


having all the values between 0 and 1 may be too restrictive
for the network.

>network will learn to tune beta and gamma to introduce


fluctuations
-- beta and gamma control which values are amplified and
by how much.

Batch vs Layer Normalization:


>batch normalization :
-- consider the same feature for the entire batch
> layer normalization
-- consider all the features of an item in the batch

Feed Forward & Add and Norm:

Feed Forward:
>processes each position in the sequence independently and
helps the model to learn complex representations by applying
non-linear transformations to the input

>fully connected feed-forward network that is applied to each


position separately and identically. It consists of two linear transformations
with a ReLU activation in between

>First Linear Transformation:


- projects the input into a higher dimensional space
>ReLU Activation:
- non-linear activation function applied to introduce

13 onwards Page 18
- non-linear activation function applied to introduce
non-linearity into the model.
>Second Linear Transformation:
- projects the higher dimensional representation back
to the original dimension.

Add & Norm:


>performed after feed forward.
>Residual Connection (Add):
-- FFN is added to its output.
This is known as a skip connection or residual connection and helps in
addressing the vanishing gradient problem, facilitating better gradient
flow through the network.
>Layer Normalization (Norm):
-- Layer normalization normalizes the summed vectors to
have zero mean and unit variance, which helps in stabilizing and
accelerating the training process

Decoder:

Output Embedding (& Positional Encoding):

13 onwards Page 19
It is similar to the encoder. During training, the target sequence (i.e.,
the correct output sequence) is used as input to the decoder. However, it is
shifted to the right by one position.

>Shifting the target sequence allows the model to predict the next token based
on the previous tokens. If the target sequence is [y1, y2, y3, …, yn], it is
transformed to [<START>, y1, y2, y3, …, yn-1] before being fed into the
decoder.

Masked Multi-Head Attention & Add and Norm:

What is masked multi-head attention?


, meaning that the output at a certain
position can only depend on the words on the previous positions. The model
must not be able to see future words.

We achieve this by replacing all the future words by - infty in the seq * seq
matrices. After the softmax function is applied, all the -infty will be
replaced by 0.

13 onwards Page 20
Multi-Head Attention & Add and Norm:

>multi-head attention layer gets keys and values matrices from the
encoder’s output and the query from the output of the masked multi-head
attention.

Feed Forward & Add and Norm:


> consists of the same structure and serves the
same purpose (introducing non-linearity to make the model
learn complex representations).

Linear layer and Softmax:

Linear Layer:
>transforms the output of the previous layer to a different
dimensionality, to match the number of classes in the output
vocabulary.
>performs a matrix multiplication between the input and a weight matrix,
followed by the addition of a bias term.

13 onwards Page 21
transforms the final hidden state outputs from the decoder into logits
Softmax:
>Converts the logits from linear layer into probabilities
by applying the softmax function.

Training a transformer:

13 onwards Page 22
>We convert it into embeddings
-- add the positional encoding to form the encoder input
of dimension seq * dmodel.

Decoder:
-- add a <SOS> token to the expected Italian translation:

- As we add the <SOS>, the output is shifted right,


as expected by the model.
Our input is of 4 tokens (including <SOS>), so we add 996
padding (<PAD>) tokens.

> pass this input now as the decoder input. It is converted to output
embeddings and we add positional encoding to it to form the input to the
masked multi-head attention.

>pass the keys and values from the encoder output and the query from the
output of the masked multi-head attention layer to the decoder (multi-head
attention + feed forward layer). We get the decoder output.

13 onwards Page 23
> decoder output is of the dimension seq * dblock,
it is still an embedding
-- linear layer will now convert the seq * dblock matrix
into a seq * vocab size matrix and we apply softmax to it.

The expected output is:

Transformer Training Advantage:


>All of this happens in 1 time step for the entire sequence (unlike previous
architectures like RNNs which required n time steps). So transformers made
it very easy and fast to train extremely long sequences with great
performance.

Inferencing a Transformer:
Encoder:

13 onwards Page 24
encoder part stays the same as training, we provide the input:
<SOS>I love you very much<EOS>.

Decoder:
Time Step 1:

Instead of providing the entire translation preceded by <SOS> and followed


by padding <PAD> tokens as during training, we provide only <SOS> as the
decoder input, followed by no padding tokens during inferencing.

>get the key and value matrix from the encoder output
>query matrix from the masked multi-head attention layer’s output for the
input of the decoder.
>Then, we get the decoder layer, which is passed
through the linear layer and softmax.
>output of the linear layer is known as logits.
-- softmax function selects a token from our vocabulary

13 onwards Page 25
-- softmax function selects a token from our vocabulary
with the highest probability is selected as the model’s
predicted output ti.

-- get the first token (which follows <SOS>) in the 1st time
step. This first token that we generated is ti.

Time Step 2:

>2nd time step, we don’t need to recompute the encoder output as our
input (English sentence) didn’t change, so the encoder output will not
change.

>append the output of the previous step (ti) to the decoder input
sequence (<SOS> ti) and feed this as the input to the decoder layer.

>repeat the same process as the 1st time step, converting it to form
decoder input through masked multi-head attention, pass it to the decoder,
convert it’s output by passing it through the linear and softmax layer to get
the next token.

Time Step 3:

13 onwards Page 26
we pass <SOS> + decoder output until now (ti amo).
We get the next token molto

Time Step 4:

13 onwards Page 27
Now, we get the <EOS> token. As we get the <EOS> we stop and no more
tokens are generated

Inferencing Strategy:
>At every step we selected the word with the maximum
softmax value, this is called a greedy strategy
-- usually doesn’t perform really well.

Beam Search:
>At each step, instead of selecting the
word with the maximum softmax value, we choose the top B words and
evaluate all the possible next words for each of them.

From slide

LEFT SECTION: Fixed Sinusoidal Encoding

Top-left Block:
You see the phrase: Queen and king.
Green rows represent word embeddings.
Below them are positional encodings (e.g., [0.01, 0.04, ..., 0.24]).
These are added together element-wise → input to the Transformer.

✅ These positional encodings are:


Fixed (not learned),
Use sin and cos functions across dimensions,
Immutable during training.

This satisfies key requirements:


Encodes position uniquely.
Smooth patterns across positions.
Supports extrapolation beyond training length.

13 onwards Page 28
13 onwards Page 29
13 onwards Page 30
13 onwards Page 31
13 onwards Page 32
13 onwards Page 33
Summary
Trainable positional embeddings are alternative to fixed sinusoidal encodings.
They allow the model to learn optimal positional patterns for a specific task.
In this formulation, position vectors are used directly in the attention score computation.
It improves flexibility but may not extrapolate well to longer sequences.

13 onwards Page 34
What is a Positionwise FFN?
>After the multi-head self-attention layer in a Transformer,
each token’s embedding is passed through the same
feed-forward network (FFN)

Why “Positionwise”?
The same FFN (same weights) is applied to each position
(token) in the sequence independently.

13 onwards Page 35
In Transformers, Add & Norm is a step used after sublayers like:
Multi-head self-attention
Positionwise feed-forward networks

13 onwards Page 36
Positionwise feed-forward networks

The process includes:


1.Add: Adding the input to the output of the sublayer (residual connection)
2.Norm: Applying Layer Normalization to the result

Main Objective of Normalization


The key goals are:
Stabilize training by reducing fluctuations in layer input distributions
Prevent exploding or vanishing gradients
Accelerate convergence

13 onwards Page 37
What is Batch Normalization?
> technique to standardize the inputs to a layer for each mini-batch.
It stabilizes and accelerates training by:
Reducing internal covariate shift (i.e., the change in the distribution of network activations due to updates in
parameters).
Helping gradients flow through the network.
Enabling faster convergence.
Providing some regularization (like a mild dropout effect).

13 onwards Page 38
13 onwards Page 39
>What is Masked Self-Attention?
-- can only attend to previous or current tokens
— not future tokens. This is critical in language
generation tasks where the model generates one word
at a time.

13 onwards Page 40
13 onwards Page 41

You might also like