0% found this document useful (0 votes)

16 views41 pages

Bahdanau Attention Mechanism (Also Known As Additive Attention)

ertyuij

Uploaded by

Sudipto Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views41 pages

Bahdanau Attention Mechanism (Also Known As Additive Attention)

ertyuij

Uploaded by

Sudipto Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

16

Thursday, April 10, 2025 8:08 PM

Bahdanau Attention Mechanism (also known as Additive Attention)

> introduced to improve the basic encoder-decoder architecture
used in sequence-to-sequence models for tasks like machine
translation.

Components of the Diagram

Encoder:
>Takes in the source sequence (e.g., “They are watching.”).
>Processes it using recurrent layers (RNN/LSTM/GRU) to get
hidden states h1,h2,h3, one for each input token.

13 onwards Page 1
What is Bahdanau Attention?
> introduced to overcome the limitations of the
encoder-decoder architecture
-- the challenge of compressing all input
information into a single fixed-length context vector.

>Instead of a single context vector, Bahdanau Attention

computes a different context vector for each output word,
letting the decoder focus on relevant parts of the input
during decoding.

13 onwards Page 2
13 onwards Page 3
Multi-Head Attention
>allows the model to jointly attend to information
from different representation subspaces at different
positions.

>Instead of calculating attention once (single-head),

we split the input into multiple smaller parts (heads),
perform attention in parallel, and combine the results.

Why Multiple Heads?

Multiple heads let the model:
>Attend to different positions
>Capture various linguistic features (e.g., verb-object relations, word types)
>Enhance the model's capacity and expressiveness

13 onwards Page 4
13 onwards Page 5
13 onwards Page 6
13 onwards Page 7
13 onwards Page 8
13 onwards Page 9
Understand
ing Transf...

Transformer Architecture

It contains 2 macro-blocks:
1. Encoder
2. Decoder
and a linear layer.

13 onwards Page 10
>Embeddings are an array of floating point number.
They can be used to represent different modalities
(text, image, video, etc.)

> the same object (word here) will always have the
same embeddings. E.g. CAT in the example given above.

> convert each word into an embedding of size 512

(contains 512 floating point numbers)

Positional Encoding:
>want each word to carry some information about it’s
position in the sentence. We want the model to treat

13 onwards Page 11
position in the sentence. We want the model to treat
words that appear close to each other
as ‘close’ and words that are distant as ‘distant’.

Example :
>Hi, I’m Vansh Kharidia, and I’m into tech.

Vansh and Kharidia are close to each other by seeing the

sentence, but the model doesn’t have this information.

Positional encoding is used to give this information to

the model.
-- We want the positional encoding to represent a pattern
(e.g. Vansh is followed by Kharidia) that can be learned
by the model

We add a position embedding vector of size 512 to our original embedding.

The values in the position encoding vector are calculated only once and
reused for every sentence during training and inference.

Encoder input = Embedding + Position Embedding

How are position embeddings calculated?

13 onwards Page 12
>For even positions in the position embedding (count starts from 0), we use
the 1st formula, and for odd positions in the position embeddings, we use
the 2nd formula.

Why are trigonometric functions used here?

13 onwards Page 13
13 onwards Page 14
Multi-head attention:

What is self-attention?
> Self-attention allowed models to relate words to each other

Multi-Head Attention:

> encoder input embedding is passed 3 times (Q, K, V) into

multi-head attention and once into Add & Norm, so we have
4 heads.

13 onwards Page 15
> perform the same computation that we did for single-head attention for
Q, K and V

resultant Q’, K’ and V’ matrices are divided into 4 matrices each (as there
are 4 heads). They are divided by dmodel, so each submatrix the entire
sentence but only a subsection of the embeddings
Each submatrix contains dk embeddings (columns) for each word.
In our case, dk = dmodel/h = 512/4 = 128.

13 onwards Page 16
Why multiple heads?
>each head contains the entire sentence but only a section of the
embeddings. As the same word can be used a noun in a context, adjective in
another context, adverb in another context, etc

>We can leverage the multi-head architecture as different heads can learn to
relate the same word in different contexts (e.g. as noun, adjective, etc.).

13 onwards Page 17
Layer Normalization (Add & Norm):

We normalize the values so that they are in the range of 0 and 1.

We also introduce 2 parameters, usually beta and gamma.

>Gamma is multiplicative, we multiply it with the normalized value.

>Beta is additive, we add beta to the product of gamma and the
normalized value.

>Beta and gamma introduce some fluctuations in the data as

having all the values between 0 and 1 may be too restrictive
for the network.

>network will learn to tune beta and gamma to introduce

fluctuations
-- beta and gamma control which values are amplified and
by how much.

Batch vs Layer Normalization:

>batch normalization :
-- consider the same feature for the entire batch
> layer normalization
-- consider all the features of an item in the batch

Feed Forward & Add and Norm:

Feed Forward:
>processes each position in the sequence independently and
helps the model to learn complex representations by applying
non-linear transformations to the input

>fully connected feed-forward network that is applied to each

position separately and identically. It consists of two linear transformations
with a ReLU activation in between

>First Linear Transformation:

- projects the input into a higher dimensional space
>ReLU Activation:
- non-linear activation function applied to introduce

13 onwards Page 18
- non-linear activation function applied to introduce
non-linearity into the model.
>Second Linear Transformation:
- projects the higher dimensional representation back
to the original dimension.

Add & Norm:

>performed after feed forward.
>Residual Connection (Add):
-- FFN is added to its output.
This is known as a skip connection or residual connection and helps in
addressing the vanishing gradient problem, facilitating better gradient
flow through the network.
>Layer Normalization (Norm):
-- Layer normalization normalizes the summed vectors to
have zero mean and unit variance, which helps in stabilizing and
accelerating the training process

Decoder:

Output Embedding (& Positional Encoding):

13 onwards Page 19
It is similar to the encoder. During training, the target sequence (i.e.,
the correct output sequence) is used as input to the decoder. However, it is
shifted to the right by one position.

>Shifting the target sequence allows the model to predict the next token based
on the previous tokens. If the target sequence is [y1, y2, y3, …, yn], it is
transformed to [<START>, y1, y2, y3, …, yn-1] before being fed into the
decoder.

Masked Multi-Head Attention & Add and Norm:

What is masked multi-head attention?

, meaning that the output at a certain
position can only depend on the words on the previous positions. The model
must not be able to see future words.

We achieve this by replacing all the future words by - infty in the seq * seq
matrices. After the softmax function is applied, all the -infty will be
replaced by 0.

13 onwards Page 20
Multi-Head Attention & Add and Norm:

>multi-head attention layer gets keys and values matrices from the
encoder’s output and the query from the output of the masked multi-head
attention.

Feed Forward & Add and Norm:

> consists of the same structure and serves the
same purpose (introducing non-linearity to make the model
learn complex representations).

Linear layer and Softmax:

Linear Layer:
>transforms the output of the previous layer to a different
dimensionality, to match the number of classes in the output
vocabulary.
>performs a matrix multiplication between the input and a weight matrix,
followed by the addition of a bias term.

13 onwards Page 21
transforms the final hidden state outputs from the decoder into logits
Softmax:
>Converts the logits from linear layer into probabilities
by applying the softmax function.

Training a transformer:

13 onwards Page 22
>We convert it into embeddings
-- add the positional encoding to form the encoder input
of dimension seq * dmodel.

Decoder:
-- add a <SOS> token to the expected Italian translation:

- As we add the <SOS>, the output is shifted right,

as expected by the model.
Our input is of 4 tokens (including <SOS>), so we add 996
padding (<PAD>) tokens.

> pass this input now as the decoder input. It is converted to output
embeddings and we add positional encoding to it to form the input to the
masked multi-head attention.

>pass the keys and values from the encoder output and the query from the
output of the masked multi-head attention layer to the decoder (multi-head
attention + feed forward layer). We get the decoder output.

13 onwards Page 23
> decoder output is of the dimension seq * dblock,
it is still an embedding
-- linear layer will now convert the seq * dblock matrix
into a seq * vocab size matrix and we apply softmax to it.

The expected output is:

Transformer Training Advantage:

>All of this happens in 1 time step for the entire sequence (unlike previous
architectures like RNNs which required n time steps). So transformers made
it very easy and fast to train extremely long sequences with great
performance.

Inferencing a Transformer:
Encoder:

13 onwards Page 24
encoder part stays the same as training, we provide the input:
<SOS>I love you very much<EOS>.

Decoder:
Time Step 1:

Instead of providing the entire translation preceded by <SOS> and followed

by padding <PAD> tokens as during training, we provide only <SOS> as the
decoder input, followed by no padding tokens during inferencing.

>get the key and value matrix from the encoder output
>query matrix from the masked multi-head attention layer’s output for the
input of the decoder.
>Then, we get the decoder layer, which is passed
through the linear layer and softmax.
>output of the linear layer is known as logits.
-- softmax function selects a token from our vocabulary

13 onwards Page 25
-- softmax function selects a token from our vocabulary
with the highest probability is selected as the model’s
predicted output ti.

-- get the first token (which follows <SOS>) in the 1st time
step. This first token that we generated is ti.

Time Step 2:

>2nd time step, we don’t need to recompute the encoder output as our
input (English sentence) didn’t change, so the encoder output will not
change.

>append the output of the previous step (ti) to the decoder input
sequence (<SOS> ti) and feed this as the input to the decoder layer.

>repeat the same process as the 1st time step, converting it to form
decoder input through masked multi-head attention, pass it to the decoder,
convert it’s output by passing it through the linear and softmax layer to get
the next token.

Time Step 3:

13 onwards Page 26
we pass <SOS> + decoder output until now (ti amo).
We get the next token molto

Time Step 4:

13 onwards Page 27
Now, we get the <EOS> token. As we get the <EOS> we stop and no more
tokens are generated

Inferencing Strategy:
>At every step we selected the word with the maximum
softmax value, this is called a greedy strategy
-- usually doesn’t perform really well.

Beam Search:
>At each step, instead of selecting the
word with the maximum softmax value, we choose the top B words and
evaluate all the possible next words for each of them.

From slide

LEFT SECTION: Fixed Sinusoidal Encoding

Top-left Block:
You see the phrase: Queen and king.
Green rows represent word embeddings.
Below them are positional encodings (e.g., [0.01, 0.04, ..., 0.24]).
These are added together element-wise → input to the Transformer.

✅ These positional encodings are:

Fixed (not learned),
Use sin and cos functions across dimensions,
Immutable during training.

This satisfies key requirements:

Encodes position uniquely.
Smooth patterns across positions.
Supports extrapolation beyond training length.

13 onwards Page 28
13 onwards Page 29
13 onwards Page 30
13 onwards Page 31
13 onwards Page 32
13 onwards Page 33
Summary
Trainable positional embeddings are alternative to fixed sinusoidal encodings.
They allow the model to learn optimal positional patterns for a specific task.
In this formulation, position vectors are used directly in the attention score computation.
It improves flexibility but may not extrapolate well to longer sequences.

13 onwards Page 34
What is a Positionwise FFN?
>After the multi-head self-attention layer in a Transformer,
each token’s embedding is passed through the same
feed-forward network (FFN)

Why “Positionwise”?
The same FFN (same weights) is applied to each position
(token) in the sequence independently.

13 onwards Page 35
In Transformers, Add & Norm is a step used after sublayers like:
Multi-head self-attention
Positionwise feed-forward networks

13 onwards Page 36
Positionwise feed-forward networks

The process includes:

1.Add: Adding the input to the output of the sublayer (residual connection)
2.Norm: Applying Layer Normalization to the result

Main Objective of Normalization

The key goals are:
Stabilize training by reducing fluctuations in layer input distributions
Prevent exploding or vanishing gradients
Accelerate convergence

13 onwards Page 37
What is Batch Normalization?
> technique to standardize the inputs to a layer for each mini-batch.
It stabilizes and accelerates training by:
Reducing internal covariate shift (i.e., the change in the distribution of network activations due to updates in
parameters).
Helping gradients flow through the network.
Enabling faster convergence.
Providing some regularization (like a mild dropout effect).

13 onwards Page 38
13 onwards Page 39
>What is Masked Self-Attention?
-- can only attend to previous or current tokens
— not future tokens. This is critical in language
generation tasks where the model generates one word
at a time.

13 onwards Page 40
13 onwards Page 41

TTSH Nursing Survival Guide
100% (2)
TTSH Nursing Survival Guide
96 pages
Coma
100% (1)
Coma
42 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Transformers
No ratings yet
Transformers
15 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Transformer
No ratings yet
Transformer
5 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Transformer
No ratings yet
Transformer
4 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
NLP 8
No ratings yet
NLP 8
42 pages
Self Attention Mechanism
No ratings yet
Self Attention Mechanism
20 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Solved Example of Transformers
No ratings yet
Solved Example of Transformers
20 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
DL Notations
No ratings yet
DL Notations
5 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Transformers
No ratings yet
Transformers
15 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Transformer
No ratings yet
Transformer
10 pages
Transformer
No ratings yet
Transformer
58 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Transformers
No ratings yet
Transformers
41 pages
L22 - Attention in Deep Learning
No ratings yet
L22 - Attention in Deep Learning
65 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Transformer
No ratings yet
Transformer
41 pages
Transformer
No ratings yet
Transformer
31 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Transformer
No ratings yet
Transformer
10 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Attention Is All You Need Paper Explained Well
No ratings yet
Attention Is All You Need Paper Explained Well
18 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Quiz - 2 - Information Retrieval (S2-22 - AIMLCZG537 - S2-22 - DSECLZG537)
No ratings yet
Quiz - 2 - Information Retrieval (S2-22 - AIMLCZG537 - S2-22 - DSECLZG537)
5 pages
Radial Basis Function (RBF) Kernel The Go-To Kernel by Sushanth Sreenivasa Towards Data Science
No ratings yet
Radial Basis Function (RBF) Kernel The Go-To Kernel by Sushanth Sreenivasa Towards Data Science
7 pages
2 The 2020 SIM IT Issues and Trends Study
No ratings yet
2 The 2020 SIM IT Issues and Trends Study
18 pages
Class 5 Case: Notebook: Marketing II Created: 02-10-2021 18:01 Updated: 02-10-2021 18:28 Author: Asd
No ratings yet
Class 5 Case: Notebook: Marketing II Created: 02-10-2021 18:01 Updated: 02-10-2021 18:28 Author: Asd
6 pages
Butterfly Arrow 500 W Mixer Grinder: Grand Total 1625.00
No ratings yet
Butterfly Arrow 500 W Mixer Grinder: Grand Total 1625.00
1 page
Chapter (2) Eg Ex No 1 To 6 Answers
100% (1)
Chapter (2) Eg Ex No 1 To 6 Answers
3 pages
Aramco HSE Questions
No ratings yet
Aramco HSE Questions
20 pages
Statistics Study Group 1
No ratings yet
Statistics Study Group 1
3 pages
Ny Lybrary
No ratings yet
Ny Lybrary
6 pages
Enhanced Condominium Concepts Review 20210501
No ratings yet
Enhanced Condominium Concepts Review 20210501
8 pages
Blue Zones Minestrone - Dan's Version - Dan Buettner
No ratings yet
Blue Zones Minestrone - Dan's Version - Dan Buettner
3 pages
Banana Fibre Extracting Project
No ratings yet
Banana Fibre Extracting Project
2 pages
Whitley Penn NY Trump Crap
No ratings yet
Whitley Penn NY Trump Crap
10 pages
Distortion in Amplifiers
No ratings yet
Distortion in Amplifiers
6 pages
Navigation
75% (4)
Navigation
31 pages
Homework 1
No ratings yet
Homework 1
3 pages
The Freelancer Designers Marketing Playbook PDF
No ratings yet
The Freelancer Designers Marketing Playbook PDF
34 pages
Fm200 Checklist and Task Id
No ratings yet
Fm200 Checklist and Task Id
15 pages
TA1 English - Mini Excavator
No ratings yet
TA1 English - Mini Excavator
15 pages
Biology syllabus-WPS Office
No ratings yet
Biology syllabus-WPS Office
35 pages
Vidhayak Adarsh Gram Yojana: A Study of Keorak Village
No ratings yet
Vidhayak Adarsh Gram Yojana: A Study of Keorak Village
78 pages
Women Empowerment
100% (1)
Women Empowerment
7 pages
Electric Charges and Fields
No ratings yet
Electric Charges and Fields
58 pages
Charlton Salt Screener
No ratings yet
Charlton Salt Screener
2 pages
Research Reports
No ratings yet
Research Reports
11 pages
Futuretech Olympiad 2022
No ratings yet
Futuretech Olympiad 2022
12 pages
DAA NOTES UNIT 1 (Design and Analysis of Algorithm)
No ratings yet
DAA NOTES UNIT 1 (Design and Analysis of Algorithm)
18 pages
IT Based Decision Making in Health Care
No ratings yet
IT Based Decision Making in Health Care
5 pages
11 Developing ICT Project For Social Change
No ratings yet
11 Developing ICT Project For Social Change
39 pages
Srimaan: PG-TRB
No ratings yet
Srimaan: PG-TRB
24 pages
Biology Seed Germination Experiment
100% (1)
Biology Seed Germination Experiment
7 pages