Positional Embeddings
Positional Embeddings
Definition:
● Positional embeddings are added to the input token embeddings to inject positional
information.
Mathematical Representation:
● Each position pos in a sequence is represented as a unique embedding using sine and cosine
functions:
● Here:
○ pos: Position of the token in the sequence.
○ i: Dimension index in the embedding.
○ d: Total embedding dimensionality.
Additive Combination:
● The positional embeddings (PE) are added element-wise to the input token embeddings (TE):
X = TE + PE
● This ensures that both the token meaning and its position are represented in the input to the
Transformer.
1. Deterministic:
○ Each position in a sequence has a fixed embedding, ensuring consistency during
training and inference.
2. Generalization:
○ The trigonometric functions allow generalization to sequences longer than those
seen during training because the embeddings are defined mathematically rather
than being learned.
3. Bounded Values:
○ The sine and cosine values are bounded between [−1,1], ensuring numerical stability.
4. Symmetry:
○ The distance between embeddings of neighboring positions is symmetrical.
○ For a fixed offset k, the positional encoding for position i+k is a linear transformation
of the positional encoding for position i:
PE(i+k)=lin(PE(i))
Mathematical representation
TRANSFORMERS XL and variations
● Processes text that goes beyond the allowed sequence length by splitting the text into
multiple segments.
● Replace all appearances of the absolute positional embedding for computing key vectors
with its relative counterpart. Essentially reflects the prior that only the relative distance
matters for where to attend.
● Since the query vector is the same for all query positions, it suggests that the attentive bias
towards different words should remain the same regardless of the query position
● Deliberately separate the two weight matrices for producing the content-based key vectors
and location-based key vectors
TUPE
● A new positional encoding method called Transformer with Untied Positional Encoding
(TUPE).
● In the self-attention module, TUPE computes the word contextual correlation and positional
correlation separately with different parameterizations and then adds them together.
● This design removes the mixed and noisy correlations over heterogeneous embeddings and
offers more expressiveness by using different projection matrices.
ROTARY POSITION EMBEDDINGS
Compute Query and Key Vectors:
● For each token in the sequence: query (Wq * xm) and key (Wk * xn)
● Multiply the query and key vectors by a rotation matrix. This matrix is determined by the
absolute positions of the tokens
● Compute the inner product between the rotated query and key vectors.
● The result of this inner product is an attention matrix that depends on the relative positions
of the tokens, allowing the model to capture positional relationships more effectively.
● Break d dimensional embeddings into d/2 vectors of length 2: [(x1, x2), (x3, x4), .., ]
● Rotate each two dimensional vector by an angle mθ
● m is the position-difference between the query and key tokens
CAPE
1. Mean Normalization:
○ Normalize positions by subtracting the mean position to ensure sequences are
centered:
2. Global Shift:
○ Apply a random global offset to all positions.
3. Local Shift:
○ Add random noise to each token's position.
4. Global Scaling:
○ Scale all positions by a random global factor.
These steps ensure that the positional embeddings used during training are diverse and prevent
overfitting to specific positions.