The Most Used Positional Encoding: Rope: Damien Benveniste
The Most Used Positional Encoding: Rope: Damien Benveniste
Damien Benveniste
newsletter.TheAiEdge.io
The Rotary Position Embedding (RoPE) [2] is now one of the most common strategies used
to inject the relative positional information within the attention mechanism. The idea behind
RoPE is to rotate the keys and queries based on the position of the related tokens in the input
sequences. This will inject the absolute positional information directly into the queries and keys.
Let’s look at a toy example to understand the logic. Let’s consider a 2-dimensional query qi
and a 2-dimensional key kj . To rotate 2-dimensional vectors, we use rotation matrices:
qirotated = R(iθ)qi
kjrotated = R(jθ)kj (1)
where:
( ) ( )
cos iθ − sin iθ cos jθ − sin jθ
R(iθ) = and R(jθ) = (2)
sin iθ cos iθ sin jθ cos jθ
R is the common rotation matrix with θi = iθ and θj = jθ being position-specific angles
associated to qi and kj . Let’s now compute the alignment score between those rotated vectors:
⊤
qrotated⊤
i krotated
j = (R(iθ)qi ) (R(jθ)kj )
= q⊤ ⊤
i R(iθ) R(jθ)kj (3)
The rotation matrix follows this property:
R(iθ)⊤ = R(iθ)−1 = R(−iθ) (4)
This means that taking the transpose of a rotation matrix is equivalent to rotating in the
opposite direction. Therefore, we have:
R(iθ)⊤ R(jθ) = R(−iθ)R(jθ) = R((j − i)θ) (5)
2
As a consequence, the alignment score computed for the rotated vectors naturally captures the
relative positional information between them:
qrotated⊤
i krotated
j = q⊤
i R((j − i)θ)kj (6)
This is very reminiscent of the approach developed in Transformer-XL, but we did not rely on
any learnable model parameters.
Let’s extend this logic to dmodel or dhead −dimensional vectors. Rotation matrices have a
natural extension to higher dimensions, but instead, we are going to consider separately each
pair of elements in the query and key vectors and perform pairwise rotations of those segments.
Each segment will have its 2-dimensional rotation matrix R(pθk ) depending on both the
3
position p of the token in the sequence, and the index k of the segment within the vectors:
( )
cos pθk − sin pθk
R(pθk ) = (7)
sin pθk cos pθk
(highest frequency) and for k = 32, we have θ32 = 10000−0.97 ≃ 0.00011 (lowest frequency).
This means that for small k, the rotation matrices for these dimensions complete full rotations
after small position shifts, and they are highly sensitive to small positional changes. They
tend to capture fine-grained, local relationships between nearby tokens, and the attention score
contribution from these dimensions drops quickly as tokens get farther apart. For large k, the
rotation matrices change gradually across many positions, and they help maintain coherent
relationships over longer sequences. They are useful to capture long-range dependencies and
broader structural patterns, as the attention score contribution from these dimensions remains
significant even for distant tokens.
To perform the full rotation of every segment at once, we construct the following rotation
matrix:
R(pθ1 ) 0 ... 0
0 R(pθ2 ) . . . 0
Rp = . .. .. ..
.. . . .
0 0 ... R(pθdhead /2 )
cos(pθ1 ) − sin(pθ1 ) ... 0 0
sin(pθ1 ) cos(pθ1 ) ... 0 0
.. .. .. .. ..
= . . . . . (9)
0 0 ... cos(pθdhead /2 ) − sin(pθdhead /2 )
0 0 ... sin(pθdhead /2 ) cos(pθdhead /2 )
We can then apply it to the queries and keys for each head in a similar manner to the 2-
dimensional case:
qrotated
i = Ri qi
krotated
j = Rj kj (10)
√
where i = −1. The advantage with complex numbers is the ease in formalizing rotations.
Complex numbers behave very similarly to 2-dimensional vectors, and to rotate them by an
angle θ, we just need to multiply them by eiθ :
Many RoPE implementations involve reshaping the query Q and key K tensors to first
highlight the segment pairs dimension. The tensors have a dimension N × nhead × dhead and
we reshape them into tensors of dimension N × nhead × (dhead /2) × 2:
We can then collapse the last dimension into a complex representation by interpreting each pair
of elements as the real and imaginary parts of complex numbers:
RoPE has become one of the most influential innovations in modern language models,
adopted in models like LLaMA, PaLM, GPT-NeoX, and many other state-of-the-art architec-
tures. It is very efficient computationally and it tends to be able to handle arbitrary sequence
lengths. It doesn’t require a predefined maximum sequence length and it generalizes well to
positions beyond those seen during training.
and the dimensions become N ′ ×dhead . The model usually requires minimal fine-tuning (∼ 1000
steps) to adapt to the ”coarser” position information (decimals instead of integers) on a data set
with maximum lengths reaching N ′ . Performance gains can typically continue up to 16 − 32×
the original context length, but may plateau beyond that. By implementing this relatively
simple change and briefly fine-tuning, you can extend RoPE-based models to handle context
windows many times larger than they were originally trained on.
References
[1] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context
window of large language models via positional interpolation, 2023.
[2] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer:
Enhanced transformer with rotary position embedding, 2023.