0% found this document useful (0 votes)
44 views7 pages

The Most Used Positional Encoding: Rope: Damien Benveniste

The document discusses the Rotary Position Embedding (RoPE), a method for incorporating relative positional information in attention mechanisms by rotating query and key vectors based on token positions. RoPE is efficient, allows for arbitrary sequence lengths, and can be adapted for larger context windows through interpolation and fine-tuning. It has been widely adopted in modern language models like LLaMA and GPT-NeoX due to its computational efficiency and flexibility.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views7 pages

The Most Used Positional Encoding: Rope: Damien Benveniste

The document discusses the Rotary Position Embedding (RoPE), a method for incorporating relative positional information in attention mechanisms by rotating query and key vectors based on token positions. RoPE is efficient, allows for arbitrary sequence lengths, and can be adapted for larger context windows through interpolation and fine-tuning. It has been widely adopted in modern language models like LLaMA and GPT-NeoX due to its computational efficiency and flexibility.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

The Most Used Positional Encoding: RoPE

Damien Benveniste
newsletter.TheAiEdge.io

The Rotary Position Embedding (RoPE) [2] is now one of the most common strategies used
to inject the relative positional information within the attention mechanism. The idea behind
RoPE is to rotate the keys and queries based on the position of the related tokens in the input
sequences. This will inject the absolute positional information directly into the queries and keys.
Let’s look at a toy example to understand the logic. Let’s consider a 2-dimensional query qi
and a 2-dimensional key kj . To rotate 2-dimensional vectors, we use rotation matrices:
qirotated = R(iθ)qi
kjrotated = R(jθ)kj (1)
where:
( ) ( )
cos iθ − sin iθ cos jθ − sin jθ
R(iθ) = and R(jθ) = (2)
sin iθ cos iθ sin jθ cos jθ
R is the common rotation matrix with θi = iθ and θj = jθ being position-specific angles
associated to qi and kj . Let’s now compute the alignment score between those rotated vectors:

qrotated⊤
i krotated
j = (R(iθ)qi ) (R(jθ)kj )
= q⊤ ⊤
i R(iθ) R(jθ)kj (3)
The rotation matrix follows this property:
R(iθ)⊤ = R(iθ)−1 = R(−iθ) (4)
This means that taking the transpose of a rotation matrix is equivalent to rotating in the
opposite direction. Therefore, we have:
R(iθ)⊤ R(jθ) = R(−iθ)R(jθ) = R((j − i)θ) (5)
2

As a consequence, the alignment score computed for the rotated vectors naturally captures the
relative positional information between them:
qrotated⊤
i krotated
j = q⊤
i R((j − i)θ)kj (6)
This is very reminiscent of the approach developed in Transformer-XL, but we did not rely on
any learnable model parameters.

Let’s extend this logic to dmodel or dhead −dimensional vectors. Rotation matrices have a
natural extension to higher dimensions, but instead, we are going to consider separately each
pair of elements in the query and key vectors and perform pairwise rotations of those segments.

Each segment will have its 2-dimensional rotation matrix R(pθk ) depending on both the
3

position p of the token in the sequence, and the index k of the segment within the vectors:
( )
cos pθk − sin pθk
R(pθk ) = (7)
sin pθk cos pθk

where θk is chosen as:


2(k−1)

θk = 10000 dhead
(8)
with k ∈ [1, . . . , dhead /2]. For example if dhead = 64, for k = 1, we have θ1 = 10000 = 1 0

(highest frequency) and for k = 32, we have θ32 = 10000−0.97 ≃ 0.00011 (lowest frequency).
This means that for small k, the rotation matrices for these dimensions complete full rotations
after small position shifts, and they are highly sensitive to small positional changes. They
tend to capture fine-grained, local relationships between nearby tokens, and the attention score
contribution from these dimensions drops quickly as tokens get farther apart. For large k, the
rotation matrices change gradually across many positions, and they help maintain coherent
relationships over longer sequences. They are useful to capture long-range dependencies and
broader structural patterns, as the attention score contribution from these dimensions remains
significant even for distant tokens.
To perform the full rotation of every segment at once, we construct the following rotation
matrix:
 
R(pθ1 ) 0 ... 0
 0 R(pθ2 ) . . . 0 
 
Rp =  . .. .. .. 
 .. . . . 
0 0 ... R(pθdhead /2 )
 
cos(pθ1 ) − sin(pθ1 ) ... 0 0
 sin(pθ1 ) cos(pθ1 ) ... 0 0 
 
 .. .. .. .. .. 
= . . . . .  (9)
 
 0 0 ... cos(pθdhead /2 ) − sin(pθdhead /2 )
0 0 ... sin(pθdhead /2 ) cos(pθdhead /2 )

We can then apply it to the queries and keys for each head in a similar manner to the 2-
dimensional case:

qrotated
i = Ri qi
krotated
j = Rj kj (10)

The Complex Number Representation


The formulation developed above is correct but is not practical as it implies computing a sparse
matrix Rp for every position p in the input sequence and it currently does not take into account
the need for vectorized computations. Instead we are going to use a couple of tricks to perform
this operation efficiently. We are going to pretend that each binary section of the vectors are
complex numbers. This means that if (a, b) are a pair of elements, they represent the complex
number v:
v = a + ib (11)
4


where i = −1. The advantage with complex numbers is the ease in formalizing rotations.
Complex numbers behave very similarly to 2-dimensional vectors, and to rotate them by an
angle θ, we just need to multiply them by eiθ :

v rotated = veiθ (12)

Many RoPE implementations involve reshaping the query Q and key K tensors to first
highlight the segment pairs dimension. The tensors have a dimension N × nhead × dhead and
we reshape them into tensors of dimension N × nhead × (dhead /2) × 2:

Q′ = Reshape(Q), shape: N × nhead × (dhead /2) × 2


K ′ = Reshape(K), shape: N × nhead × (dhead /2) × 2 (13)

We can then collapse the last dimension into a complex representation by interpreting each pair
of elements as the real and imaginary parts of complex numbers:

Q∗ = Complex(Q′ ), shape: N × nhead × (dhead /2)


K ∗ = Complex(K ′ ), shape: N × nhead × (dhead /2) (14)

If we consider a query vector q∗p in that tensor representation, it is an array of dhead /2


complex numbers:
q∗p = [v1 , v2 , . . . , vdhead /2 ] (15)
5

We can rotate this complex vector by applying a complex rotation element-wise:


q∗rotated
p = [v1 eipθ1 , v2 eipθ2 , . . . , vdhead /2 eipθdhead /2 ]
= [eipθ1 , eipθ2 , . . . , eipθdhead /2 ] ⊙ q∗p (16)
This means that we can rotate the whole query and key tensors with a simple element-wise
product:
Q∗rotated = R ⊙ Q
K ∗rotated = R ⊙ K (17)
where  
1 1 ... 1
 eiθ1 eiθ2 ... e iθdhead /2 
 i2θ 
 ei2θ2 ei2θdhead /2 
R= e ...
1
 (18)
 .. .. .. .. 
 . . . . 
ei(N −1)θ1 ei(N −1)θ2 ... ei(N −1)θdhead /2
is a N × dhead matrix where the rows correspond to the token position p and the column to the
segment index k. Deep learning frameworks like PyTorch and TensorFlow support operations
with complex tensors. Once the rotation in the complex space is performed, the tensors can be
reshaped in the original format:
Q′ rotated = Real(Q∗rotated ), shape: N × nhead × (dhead /2) × 2
′ rotated ∗rotated
K = Real(K ), shape: N × nhead × (dhead /2) × 2
rotated ′ rotated
Q = Reshape(Q ), shape: N × nhead × dhead
rotated ′ rotated
K = Reshape(K ), shape: N × nhead × dhead (19)
The overall time complexity to apply the RoPE rotation is O(N dmodel ) as the R matrix is
applied to every head.
6

RoPE has become one of the most influential innovations in modern language models,
adopted in models like LLaMA, PaLM, GPT-NeoX, and many other state-of-the-art architec-
tures. It is very efficient computationally and it tends to be able to handle arbitrary sequence
lengths. It doesn’t require a predefined maximum sequence length and it generalizes well to
positions beyond those seen during training.

Increasing The Context Window With RoPE


RoPE is quite flexible when it comes to extending the context window. Most sinusoidal posi-
tional encodings tend to break down when we try to extend to sequence lengths much further
than the ones seen during training. Instead, with RoPE, we are going to interpolate the token
positions and fine-tune the model to adjust to the newer sequence lengths [1]. For example,
let’s assume that so far we have trained the model with sequences of maximum size N = 2048,
but now we would like to extend the context window to N ′ = 16384 (N ′ = 8N ). We start by
interpolating the token position p:
N
p′ = p × ′ (20)
N
N′
N is the scaling factor, and it dictates by how much we want to extend the context window.
Since the maximum p position is pmax = N ′ , the maximum p′ value is:
N
p′max = N ′ × =N (21)
N′
This means that, as we have 0 ≤ p ≤ N − 1, we also have 0 ≤ p′ ≤ N − 1. When we
interpolate, we never use token positions beyond the ones seen during training, instead, we
create intermediate non-integer positions. For example, if a true token position is 4 in a new
large sequence, its interpolated value will be:
2048 1
p′ = 4 × = 4 × = 0.5 (22)
16384 8

After interpolation, elements of the matrix R are modified as:



rp′ k = eip θk (23)
REFERENCES 7

and the dimensions become N ′ ×dhead . The model usually requires minimal fine-tuning (∼ 1000
steps) to adapt to the ”coarser” position information (decimals instead of integers) on a data set
with maximum lengths reaching N ′ . Performance gains can typically continue up to 16 − 32×
the original context length, but may plateau beyond that. By implementing this relatively
simple change and briefly fine-tuning, you can extend RoPE-based models to handle context
windows many times larger than they were originally trained on.

References
[1] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context
window of large language models via positional interpolation, 2023.
[2] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer:
Enhanced transformer with rotary position embedding, 2023.

You might also like