0% found this document useful (0 votes)
17 views7 pages

Positional Embeddings

Positional embeddings are crucial for Transformers as they provide necessary positional information to understand the order of tokens, which is vital for interpreting meaning in natural language. Various methods, including relative positional embeddings and novel approaches like TUPE and Rotary Position Embeddings, enhance the model's ability to capture positional relationships effectively. The document also highlights the limitations of absolute positional embeddings and suggests training strategies to improve robustness and generalization.

Uploaded by

Shriram Pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views7 pages

Positional Embeddings

Positional embeddings are crucial for Transformers as they provide necessary positional information to understand the order of tokens, which is vital for interpreting meaning in natural language. Various methods, including relative positional embeddings and novel approaches like TUPE and Rotary Position Embeddings, enhance the model's ability to capture positional relationships effectively. The document also highlights the limitations of absolute positional embeddings and suggests training strategies to improve robustness and generalization.

Uploaded by

Shriram Pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Why Do We Need Positional Embeddings?

● Transformers are order-agnostic:


○ Unlike RNNs, which process tokens sequentially, Transformers process tokens in
parallel. Without positional information, the model cannot understand the order or
structure of a sentence.
● Importance of positional relationships:
○ The order of words affects the meaning in natural language. For example:
■ "The cat chased the dog" ≠ "The dog chased the cat."

Positional Embeddings in Vanilla Transformers

Definition:

● Positional embeddings are added to the input token embeddings to inject positional
information.

Mathematical Representation:

● Each position pos in a sequence is represented as a unique embedding using sine and cosine
functions:
● Here:
○ pos: Position of the token in the sequence.
○ i: Dimension index in the embedding.
○ d: Total embedding dimensionality.

Additive Combination:

● The positional embeddings (PE) are added element-wise to the input token embeddings (TE):
X = TE + PE
● This ensures that both the token meaning and its position are represented in the input to the
Transformer.

Properties of Positional Embeddings

1. Deterministic:
○ Each position in a sequence has a fixed embedding, ensuring consistency during
training and inference.
2. Generalization:
○ The trigonometric functions allow generalization to sequences longer than those
seen during training because the embeddings are defined mathematically rather
than being learned.
3. Bounded Values:
○ The sine and cosine values are bounded between [−1,1], ensuring numerical stability.
4. Symmetry:
○ The distance between embeddings of neighboring positions is symmetrical.
○ For a fixed offset k, the positional encoding for position i+k is a linear transformation
of the positional encoding for position i:

PE(i+k)=lin(PE(i))

CHECK DERIVATION IN SLIDE

5. Decays with Distance:


○ The influence of positional embeddings between two tokens decreases as the
distance between their positions increases.

Relative Positional Embeddings


Relative positional embeddings encode the relative position between tokens in a sequence rather
than their absolute positions. Unlike vanilla positional embeddings (which assign a unique position
to each token), relative positional embeddings focus on the distance between tokens, making them
more robust for tasks requiring relational understanding, such as language modeling, machine
translation, and tasks with long sequences.

● Considers relative positions for token-pairs


● If there are T tokens, T x 2*(T-1) dimensional pairwise embedding matrix is created
○ Dimension 1: Position of the current word of interest
○ Dimension 2: Positional distance from the current word
● As opposed to adding the positional information element-wise directly to the token
embeddings, relative positional information is added to keys and values during attention
calculation.
● Kicks in during the attention computation

Mathematical representation
TRANSFORMERS XL and variations
● Processes text that goes beyond the allowed sequence length by splitting the text into
multiple segments.
● Replace all appearances of the absolute positional embedding for computing key vectors
with its relative counterpart. Essentially reflects the prior that only the relative distance
matters for where to attend.
● Since the query vector is the same for all query positions, it suggests that the attentive bias
towards different words should remain the same regardless of the query position
● Deliberately separate the two weight matrices for producing the content-based key vectors
and location-based key vectors

TUPE
● A new positional encoding method called Transformer with Untied Positional Encoding
(TUPE).
● In the self-attention module, TUPE computes the word contextual correlation and positional
correlation separately with different parameterizations and then adds them together.
● This design removes the mixed and noisy correlations over heterogeneous embeddings and
offers more expressiveness by using different projection matrices.
ROTARY POSITION EMBEDDINGS
Compute Query and Key Vectors:

● For each token in the sequence: query (Wq * xm) and key (Wk * xn)

Apply Rotation Matrix:

● Multiply the query and key vectors by a rotation matrix. This matrix is determined by the
absolute positions of the tokens

Compute Inner Product:

● Compute the inner product between the rotated query and key vectors.

Generate Attention Matrix:

● The result of this inner product is an attention matrix that depends on the relative positions
of the tokens, allowing the model to capture positional relationships more effectively.

● Break d dimensional embeddings into d/2 vectors of length 2: [(x1, x2), (x3, x4), .., ]
● Rotate each two dimensional vector by an angle mθ
● m is the position-difference between the query and key tokens

CAPE

Key Problems with Absolute Positional Embeddings

1. In-Domain Generalization Issues:


○ Models might learn spurious correlations between token positions and content in
the training set, reducing robustness on unseen data.
2. Out-of-Domain Generalization Issues:
○ Positional embeddings might fail for input sequences significantly longer or shorter
than those seen during training because the embeddings are rarely observed in such
contexts.
Training Steps

1. Mean Normalization:
○ Normalize positions by subtracting the mean position to ensure sequences are
centered:

2. Global Shift:
○ Apply a random global offset to all positions.
3. Local Shift:
○ Add random noise to each token's position.
4. Global Scaling:
○ Scale all positions by a random global factor.

These steps ensure that the positional embeddings used during training are diverse and prevent
overfitting to specific positions.

You might also like