0% found this document useful (0 votes)

17 views7 pages

Positional Embeddings

Positional embeddings are crucial for Transformers as they provide necessary positional information to understand the order of tokens, which is vital for interpreting meaning in natural language. Various methods, including relative positional embeddings and novel approaches like TUPE and Rotary Position Embeddings, enhance the model's ability to capture positional relationships effectively. The document also highlights the limitations of absolute positional embeddings and suggests training strategies to improve robustness and generalization.

Uploaded by

Shriram Pradeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views7 pages

Positional Embeddings

Uploaded by

Shriram Pradeep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Why Do We Need Positional Embeddings?

● Transformers are order-agnostic:

○ Unlike RNNs, which process tokens sequentially, Transformers process tokens in
parallel. Without positional information, the model cannot understand the order or
structure of a sentence.
● Importance of positional relationships:
○ The order of words affects the meaning in natural language. For example:
■ "The cat chased the dog" ≠ "The dog chased the cat."

Positional Embeddings in Vanilla Transformers

Definition:

● Positional embeddings are added to the input token embeddings to inject positional
information.

Mathematical Representation:

● Each position pos in a sequence is represented as a unique embedding using sine and cosine
functions:
● Here:
○ pos: Position of the token in the sequence.
○ i: Dimension index in the embedding.
○ d: Total embedding dimensionality.

Additive Combination:

● The positional embeddings (PE) are added element-wise to the input token embeddings (TE):
X = TE + PE
● This ensures that both the token meaning and its position are represented in the input to the
Transformer.

Properties of Positional Embeddings

1. Deterministic:
○ Each position in a sequence has a fixed embedding, ensuring consistency during
training and inference.
2. Generalization:
○ The trigonometric functions allow generalization to sequences longer than those
seen during training because the embeddings are defined mathematically rather
than being learned.
3. Bounded Values:
○ The sine and cosine values are bounded between [−1,1], ensuring numerical stability.
4. Symmetry:
○ The distance between embeddings of neighboring positions is symmetrical.
○ For a fixed offset k, the positional encoding for position i+k is a linear transformation
of the positional encoding for position i:

PE(i+k)=lin(PE(i))

CHECK DERIVATION IN SLIDE

5. Decays with Distance:

○ The influence of positional embeddings between two tokens decreases as the
distance between their positions increases.

Relative Positional Embeddings

Relative positional embeddings encode the relative position between tokens in a sequence rather
than their absolute positions. Unlike vanilla positional embeddings (which assign a unique position
to each token), relative positional embeddings focus on the distance between tokens, making them
more robust for tasks requiring relational understanding, such as language modeling, machine
translation, and tasks with long sequences.

● Considers relative positions for token-pairs

● If there are T tokens, T x 2*(T-1) dimensional pairwise embedding matrix is created
○ Dimension 1: Position of the current word of interest
○ Dimension 2: Positional distance from the current word
● As opposed to adding the positional information element-wise directly to the token
embeddings, relative positional information is added to keys and values during attention
calculation.
● Kicks in during the attention computation

Mathematical representation
TRANSFORMERS XL and variations
● Processes text that goes beyond the allowed sequence length by splitting the text into
multiple segments.
● Replace all appearances of the absolute positional embedding for computing key vectors
with its relative counterpart. Essentially reflects the prior that only the relative distance
matters for where to attend.
● Since the query vector is the same for all query positions, it suggests that the attentive bias
towards different words should remain the same regardless of the query position
● Deliberately separate the two weight matrices for producing the content-based key vectors
and location-based key vectors

TUPE
● A new positional encoding method called Transformer with Untied Positional Encoding
(TUPE).
● In the self-attention module, TUPE computes the word contextual correlation and positional
correlation separately with different parameterizations and then adds them together.
● This design removes the mixed and noisy correlations over heterogeneous embeddings and
offers more expressiveness by using different projection matrices.
ROTARY POSITION EMBEDDINGS
Compute Query and Key Vectors:

● For each token in the sequence: query (Wq * xm) and key (Wk * xn)

Apply Rotation Matrix:

● Multiply the query and key vectors by a rotation matrix. This matrix is determined by the
absolute positions of the tokens

Compute Inner Product:

● Compute the inner product between the rotated query and key vectors.

Generate Attention Matrix:

● The result of this inner product is an attention matrix that depends on the relative positions
of the tokens, allowing the model to capture positional relationships more effectively.

● Break d dimensional embeddings into d/2 vectors of length 2: [(x1, x2), (x3, x4), .., ]
● Rotate each two dimensional vector by an angle mθ
● m is the position-difference between the query and key tokens

CAPE

Key Problems with Absolute Positional Embeddings

1. In-Domain Generalization Issues:

○ Models might learn spurious correlations between token positions and content in
the training set, reducing robustness on unseen data.
2. Out-of-Domain Generalization Issues:
○ Positional embeddings might fail for input sequences significantly longer or shorter
than those seen during training because the embeddings are rarely observed in such
contexts.
Training Steps

1. Mean Normalization:
○ Normalize positions by subtracting the mean position to ensure sequences are
centered:

2. Global Shift:
○ Apply a random global offset to all positions.
3. Local Shift:
○ Add random noise to each token's position.
4. Global Scaling:
○ Scale all positions by a random global factor.

These steps ensure that the positional embeddings used during training are diverse and prevent
overfitting to specific positions.

Word Embeddings
No ratings yet
Word Embeddings
163 pages
Large-Scale Foundation Model On Single-Cell Transcriptomics
No ratings yet
Large-Scale Foundation Model On Single-Cell Transcriptomics
19 pages
Gen AI Syllabus
No ratings yet
Gen AI Syllabus
2 pages
LLM Understading From SCH
No ratings yet
LLM Understading From SCH
16 pages
NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
Survey LLM-Agents 2025
No ratings yet
Survey LLM-Agents 2025
44 pages
OceanofPDF - Com LLMs in Enterprise - Ahmed Menshawy
No ratings yet
OceanofPDF - Com LLMs in Enterprise - Ahmed Menshawy
194 pages
(9,10) Transformers - 3
0% (1)
(9,10) Transformers - 3
92 pages
Newwhitepaper - Embeddings & Vector Stores
No ratings yet
Newwhitepaper - Embeddings & Vector Stores
51 pages
Deep Learning Foundationsand Concepts
No ratings yet
Deep Learning Foundationsand Concepts
5 pages
Enhancing Graph Database Interaction Through Generative AI-Driven Natural Language Interface For Financial Fraud Detection
No ratings yet
Enhancing Graph Database Interaction Through Generative AI-Driven Natural Language Interface For Financial Fraud Detection
8 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Unit 4
No ratings yet
Unit 4
42 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
Data Science Roadmap
No ratings yet
Data Science Roadmap
41 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
Lecture4 2-AlignedRepresentations
No ratings yet
Lecture4 2-AlignedRepresentations
59 pages
GPT Generative Pre-Trained Transformer A Comprehensive Review On Enabling Technologies Potential Applications Emerging Challenges and Future Directions
No ratings yet
GPT Generative Pre-Trained Transformer A Comprehensive Review On Enabling Technologies Potential Applications Emerging Challenges and Future Directions
42 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Visual Transformer
No ratings yet
Visual Transformer
18 pages
Embeddings
No ratings yet
Embeddings
83 pages
Bert 1 42
No ratings yet
Bert 1 42
42 pages
BERT2D Two Dimensional Positional Embeddings For Efficient Turkish NLP
No ratings yet
BERT2D Two Dimensional Positional Embeddings For Efficient Turkish NLP
13 pages
Tokenisation and Embedding
No ratings yet
Tokenisation and Embedding
11 pages
FLPE New
No ratings yet
FLPE New
22 pages
The Curious Case of Absolute Position Embeddings
No ratings yet
The Curious Case of Absolute Position Embeddings
24 pages
Rethinking and Improving Relative Position Encoding For Vision Transformer
No ratings yet
Rethinking and Improving Relative Position Encoding For Vision Transformer
9 pages
Stock Predictions With Transformer and Time Embeddings - Towards Data Science
No ratings yet
Stock Predictions With Transformer and Time Embeddings - Towards Data Science
36 pages
Masked Modeling Duo - Towards A Universal Audio Pre-Training Framework
No ratings yet
Masked Modeling Duo - Towards A Universal Audio Pre-Training Framework
15 pages
Transformers
No ratings yet
Transformers
23 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Train 400x Faster Static Embedding Models With Sentence Transformers
No ratings yet
Train 400x Faster Static Embedding Models With Sentence Transformers
47 pages
Understanding Vector Embeddings
No ratings yet
Understanding Vector Embeddings
14 pages
Continual Forgetting For Pre-Trained Vision Models
No ratings yet
Continual Forgetting For Pre-Trained Vision Models
17 pages
Day 5 Tokenisation and Embedding
No ratings yet
Day 5 Tokenisation and Embedding
12 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Cs224n 2025 Lecture06 Fancy RNN
No ratings yet
Cs224n 2025 Lecture06 Fancy RNN
57 pages
A Generalist Cross-Domain Molecular Learning Framework For Structure-Based Drug Discovery
No ratings yet
A Generalist Cross-Domain Molecular Learning Framework For Structure-Based Drug Discovery
38 pages
Research On The Application of Large Language Models in Human Resource Management Practices
No ratings yet
Research On The Application of Large Language Models in Human Resource Management Practices
8 pages
AI Quiz ch3
No ratings yet
AI Quiz ch3
29 pages
Stimmer Et Al 2025 Natural Language Processing in Veterinary Pathology A Commentary On Oppo
No ratings yet
Stimmer Et Al 2025 Natural Language Processing in Veterinary Pathology A Commentary On Oppo
4 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Understanding Self-Attention
No ratings yet
Understanding Self-Attention
37 pages
Model5 Partial
No ratings yet
Model5 Partial
52 pages
Time Series Forecasting With Transformer Models and Application To Asset Management
No ratings yet
Time Series Forecasting With Transformer Models and Application To Asset Management
44 pages
Whitepaper - Embeddings & Vector Stores
No ratings yet
Whitepaper - Embeddings & Vector Stores
52 pages
DBDA SCHOOL Practical Machine Learning
No ratings yet
DBDA SCHOOL Practical Machine Learning
6 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
Beyond Position: How Rotary Embeddings Shape Representations and Memory in Autoregressive Transfomers
No ratings yet
Beyond Position: How Rotary Embeddings Shape Representations and Memory in Autoregressive Transfomers
28 pages
Unit 1
No ratings yet
Unit 1
26 pages
What I Learned From Looking at 900 Most Popular Open Source AI Tools
No ratings yet
What I Learned From Looking at 900 Most Popular Open Source AI Tools
14 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Embeddings in Deep Learning An Introduction
No ratings yet
Embeddings in Deep Learning An Introduction
8 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Ultimate Guide To Embedding Models
No ratings yet
Ultimate Guide To Embedding Models
50 pages
Unit 2 Generative AI
No ratings yet
Unit 2 Generative AI
14 pages
GenAI Workshop
No ratings yet
GenAI Workshop
35 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Rotary Embeddings - A Relative Revolution - EleutherAI Blog
No ratings yet
Rotary Embeddings - A Relative Revolution - EleutherAI Blog
13 pages
Wang Uformer A General U-Shaped Transformer For Image Restoration CVPR 2022 Paper
No ratings yet
Wang Uformer A General U-Shaped Transformer For Image Restoration CVPR 2022 Paper
11 pages
Crop Disease Detection Documentation
No ratings yet
Crop Disease Detection Documentation
2 pages
Shivam Final
No ratings yet
Shivam Final
34 pages
Evolution of Positional Embeddings in Transformers!
No ratings yet
Evolution of Positional Embeddings in Transformers!
16 pages
Embedding S
No ratings yet
Embedding S
83 pages
YOLOv12 - A Breakdown of The Key Architectural Features
No ratings yet
YOLOv12 - A Breakdown of The Key Architectural Features
9 pages
Jungwok Choi - tinyML Asia 2023
No ratings yet
Jungwok Choi - tinyML Asia 2023
17 pages
Lag Llama
No ratings yet
Lag Llama
23 pages
Roformer - Enhanced Transformer With Rotary Position Embedding
No ratings yet
Roformer - Enhanced Transformer With Rotary Position Embedding
14 pages
Embeddings 1686516367
No ratings yet
Embeddings 1686516367
82 pages
Pre Study
No ratings yet
Pre Study
14 pages
Chap 6 Embedding
No ratings yet
Chap 6 Embedding
44 pages
Embeddings
No ratings yet
Embeddings
82 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Transformer
No ratings yet
Transformer
58 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Natural Language Processing With Transformers Lewis Tunstall PDF Download
No ratings yet
Natural Language Processing With Transformers Lewis Tunstall PDF Download
34 pages
AAI Module 3
No ratings yet
AAI Module 3
11 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
No ratings yet
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
14 pages
LR 1
No ratings yet
LR 1
3 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
No ratings yet
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
12 pages
Deep Learning: Prof:Naveen Ghorpade
No ratings yet
Deep Learning: Prof:Naveen Ghorpade
43 pages
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Introduction to Advanced Mathematical Analysis
From Everand
Introduction to Advanced Mathematical Analysis
Simone Malacrida
No ratings yet
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet

Positional Embeddings

Uploaded by

Positional Embeddings

Uploaded by

Why Do We Need Positional Embeddings?

● Transformers are order-agnostic:

Positional Embeddings in Vanilla Transformers

Properties of Positional Embeddings

CHECK DERIVATION IN SLIDE

5. Decays with Distance:

Relative Positional Embeddings

● Considers relative positions for token-pairs

Apply Rotation Matrix:

Compute Inner Product:

Generate Attention Matrix:

Key Problems with Absolute Positional Embeddings

1. In-Domain Generalization Issues:

You might also like