0% found this document useful (0 votes)

44 views7 pages

The Most Used Positional Encoding: Rope: Damien Benveniste

The document discusses the Rotary Position Embedding (RoPE), a method for incorporating relative positional information in attention mechanisms by rotating query and key vectors based on token positions. RoPE is efficient, allows for arbitrary sequence lengths, and can be adapted for larger context windows through interpolation and fine-tuning. It has been widely adopted in modern language models like LLaMA and GPT-NeoX due to its computational efficiency and flexibility.

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views7 pages

The Most Used Positional Encoding: Rope: Damien Benveniste

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

The Most Used Positional Encoding: RoPE

Damien Benveniste
newsletter.TheAiEdge.io

The Rotary Position Embedding (RoPE) [2] is now one of the most common strategies used
to inject the relative positional information within the attention mechanism. The idea behind
RoPE is to rotate the keys and queries based on the position of the related tokens in the input
sequences. This will inject the absolute positional information directly into the queries and keys.
Let’s look at a toy example to understand the logic. Let’s consider a 2-dimensional query qi
and a 2-dimensional key kj . To rotate 2-dimensional vectors, we use rotation matrices:
qirotated = R(iθ)qi
kjrotated = R(jθ)kj (1)
where:
( ) ( )
cos iθ − sin iθ cos jθ − sin jθ
R(iθ) = and R(jθ) = (2)
sin iθ cos iθ sin jθ cos jθ
R is the common rotation matrix with θi = iθ and θj = jθ being position-specific angles
associated to qi and kj . Let’s now compute the alignment score between those rotated vectors:
⊤
qrotated⊤
i krotated
j = (R(iθ)qi ) (R(jθ)kj )
= q⊤ ⊤
i R(iθ) R(jθ)kj (3)
The rotation matrix follows this property:
R(iθ)⊤ = R(iθ)−1 = R(−iθ) (4)
This means that taking the transpose of a rotation matrix is equivalent to rotating in the
opposite direction. Therefore, we have:
R(iθ)⊤ R(jθ) = R(−iθ)R(jθ) = R((j − i)θ) (5)
2

As a consequence, the alignment score computed for the rotated vectors naturally captures the
relative positional information between them:
qrotated⊤
i krotated
j = q⊤
i R((j − i)θ)kj (6)
This is very reminiscent of the approach developed in Transformer-XL, but we did not rely on
any learnable model parameters.

Let’s extend this logic to dmodel or dhead −dimensional vectors. Rotation matrices have a
natural extension to higher dimensions, but instead, we are going to consider separately each
pair of elements in the query and key vectors and perform pairwise rotations of those segments.

Each segment will have its 2-dimensional rotation matrix R(pθk ) depending on both the
3

position p of the token in the sequence, and the index k of the segment within the vectors:
( )
cos pθk − sin pθk
R(pθk ) = (7)
sin pθk cos pθk

where θk is chosen as:

2(k−1)
−
θk = 10000 dhead
(8)
with k ∈ [1, . . . , dhead /2]. For example if dhead = 64, for k = 1, we have θ1 = 10000 = 1 0

(highest frequency) and for k = 32, we have θ32 = 10000−0.97 ≃ 0.00011 (lowest frequency).
This means that for small k, the rotation matrices for these dimensions complete full rotations
after small position shifts, and they are highly sensitive to small positional changes. They
tend to capture fine-grained, local relationships between nearby tokens, and the attention score
contribution from these dimensions drops quickly as tokens get farther apart. For large k, the
rotation matrices change gradually across many positions, and they help maintain coherent
relationships over longer sequences. They are useful to capture long-range dependencies and
broader structural patterns, as the attention score contribution from these dimensions remains
significant even for distant tokens.
To perform the full rotation of every segment at once, we construct the following rotation
matrix:
 
R(pθ1 ) 0 ... 0
 0 R(pθ2 ) . . . 0 
 
Rp =  . .. .. .. 
 .. . . . 
0 0 ... R(pθdhead /2 )
 
cos(pθ1 ) − sin(pθ1 ) ... 0 0
 sin(pθ1 ) cos(pθ1 ) ... 0 0 
 
 .. .. .. .. .. 
= . . . . .  (9)
 
 0 0 ... cos(pθdhead /2 ) − sin(pθdhead /2 )
0 0 ... sin(pθdhead /2 ) cos(pθdhead /2 )

We can then apply it to the queries and keys for each head in a similar manner to the 2-
dimensional case:

qrotated
i = Ri qi
krotated
j = Rj kj (10)

The Complex Number Representation

The formulation developed above is correct but is not practical as it implies computing a sparse
matrix Rp for every position p in the input sequence and it currently does not take into account
the need for vectorized computations. Instead we are going to use a couple of tricks to perform
this operation eﬀiciently. We are going to pretend that each binary section of the vectors are
complex numbers. This means that if (a, b) are a pair of elements, they represent the complex
number v:
v = a + ib (11)
4

√
where i = −1. The advantage with complex numbers is the ease in formalizing rotations.
Complex numbers behave very similarly to 2-dimensional vectors, and to rotate them by an
angle θ, we just need to multiply them by eiθ :

v rotated = veiθ (12)

Many RoPE implementations involve reshaping the query Q and key K tensors to first
highlight the segment pairs dimension. The tensors have a dimension N × nhead × dhead and
we reshape them into tensors of dimension N × nhead × (dhead /2) × 2:

Q′ = Reshape(Q), shape: N × nhead × (dhead /2) × 2

K ′ = Reshape(K), shape: N × nhead × (dhead /2) × 2 (13)

We can then collapse the last dimension into a complex representation by interpreting each pair
of elements as the real and imaginary parts of complex numbers:

Q∗ = Complex(Q′ ), shape: N × nhead × (dhead /2)

K ∗ = Complex(K ′ ), shape: N × nhead × (dhead /2) (14)

If we consider a query vector q∗p in that tensor representation, it is an array of dhead /2

complex numbers:
q∗p = [v1 , v2 , . . . , vdhead /2 ] (15)
5

We can rotate this complex vector by applying a complex rotation element-wise:

q∗rotated
p = [v1 eipθ1 , v2 eipθ2 , . . . , vdhead /2 eipθdhead /2 ]
= [eipθ1 , eipθ2 , . . . , eipθdhead /2 ] ⊙ q∗p (16)
This means that we can rotate the whole query and key tensors with a simple element-wise
product:
Q∗rotated = R ⊙ Q
K ∗rotated = R ⊙ K (17)
where  
1 1 ... 1
 eiθ1 eiθ2 ... e iθdhead /2 
 i2θ 
 ei2θ2 ei2θdhead /2 
R= e ...
1
 (18)
 .. .. .. .. 
 . . . . 
ei(N −1)θ1 ei(N −1)θ2 ... ei(N −1)θdhead /2
is a N × dhead matrix where the rows correspond to the token position p and the column to the
segment index k. Deep learning frameworks like PyTorch and TensorFlow support operations
with complex tensors. Once the rotation in the complex space is performed, the tensors can be
reshaped in the original format:
Q′ rotated = Real(Q∗rotated ), shape: N × nhead × (dhead /2) × 2
′ rotated ∗rotated
K = Real(K ), shape: N × nhead × (dhead /2) × 2
rotated ′ rotated
Q = Reshape(Q ), shape: N × nhead × dhead
rotated ′ rotated
K = Reshape(K ), shape: N × nhead × dhead (19)
The overall time complexity to apply the RoPE rotation is O(N dmodel ) as the R matrix is
applied to every head.
6

RoPE has become one of the most influential innovations in modern language models,
adopted in models like LLaMA, PaLM, GPT-NeoX, and many other state-of-the-art architec-
tures. It is very eﬀicient computationally and it tends to be able to handle arbitrary sequence
lengths. It doesn’t require a predefined maximum sequence length and it generalizes well to
positions beyond those seen during training.

Increasing The Context Window With RoPE

RoPE is quite flexible when it comes to extending the context window. Most sinusoidal posi-
tional encodings tend to break down when we try to extend to sequence lengths much further
than the ones seen during training. Instead, with RoPE, we are going to interpolate the token
positions and fine-tune the model to adjust to the newer sequence lengths [1]. For example,
let’s assume that so far we have trained the model with sequences of maximum size N = 2048,
but now we would like to extend the context window to N ′ = 16384 (N ′ = 8N ). We start by
interpolating the token position p:
N
p′ = p × ′ (20)
N
N′
N is the scaling factor, and it dictates by how much we want to extend the context window.
Since the maximum p position is pmax = N ′ , the maximum p′ value is:
N
p′max = N ′ × =N (21)
N′
This means that, as we have 0 ≤ p ≤ N − 1, we also have 0 ≤ p′ ≤ N − 1. When we
interpolate, we never use token positions beyond the ones seen during training, instead, we
create intermediate non-integer positions. For example, if a true token position is 4 in a new
large sequence, its interpolated value will be:
2048 1
p′ = 4 × = 4 × = 0.5 (22)
16384 8

After interpolation, elements of the matrix R are modified as:

′
rp′ k = eip θk (23)
REFERENCES 7

and the dimensions become N ′ ×dhead . The model usually requires minimal fine-tuning (∼ 1000
steps) to adapt to the ”coarser” position information (decimals instead of integers) on a data set
with maximum lengths reaching N ′ . Performance gains can typically continue up to 16 − 32×
the original context length, but may plateau beyond that. By implementing this relatively
simple change and briefly fine-tuning, you can extend RoPE-based models to handle context
windows many times larger than they were originally trained on.

References
[1] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context
window of large language models via positional interpolation, 2023.
[2] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer:
Enhanced transformer with rotary position embedding, 2023.

Robotics Intro and Intro RMP
No ratings yet
Robotics Intro and Intro RMP
140 pages
1 AdvancedRobotics Motion Planning 2
No ratings yet
1 AdvancedRobotics Motion Planning 2
65 pages
(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
No ratings yet
(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
395 pages
RoPE Vit
No ratings yet
RoPE Vit
20 pages
Robot
100% (1)
Robot
286 pages
Anadi Anant Jain
No ratings yet
Anadi Anant Jain
118 pages
Matlab Robot
No ratings yet
Matlab Robot
171 pages
Vision 2 Motion Planning 1
No ratings yet
Vision 2 Motion Planning 1
50 pages
Understanding Singular Value Decomposition and Its Application in Data Science - by Reza Bagheri - Towards Data Science
No ratings yet
Understanding Singular Value Decomposition and Its Application in Data Science - by Reza Bagheri - Towards Data Science
65 pages
2025 Lecture 3 - Architecture
No ratings yet
2025 Lecture 3 - Architecture
68 pages
Neural Network Hanh Brief
No ratings yet
Neural Network Hanh Brief
132 pages
Robot Motion Planning
No ratings yet
Robot Motion Planning
151 pages
Robot
100% (1)
Robot
132 pages
Robot Toolbox Matlab
100% (1)
Robot Toolbox Matlab
166 pages
SVM
No ratings yet
SVM
19 pages
Robotics Toolbox For MATLAB - Peter Corke - Release 9
No ratings yet
Robotics Toolbox For MATLAB - Peter Corke - Release 9
317 pages
FLPE New
No ratings yet
FLPE New
22 pages
MATLAB Robotics Toolkit
100% (2)
MATLAB Robotics Toolkit
316 pages
Beyond Position: How Rotary Embeddings Shape Representations and Memory in Autoregressive Transfomers
No ratings yet
Beyond Position: How Rotary Embeddings Shape Representations and Memory in Autoregressive Transfomers
28 pages
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
No ratings yet
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
8 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
Tutorial 1
No ratings yet
Tutorial 1
9 pages
Schneider EZC MCCB PDF
100% (1)
Schneider EZC MCCB PDF
13 pages
When Precision Meets Position: Bfloat16 Breaks Down Rope in Long-Context Training
No ratings yet
When Precision Meets Position: Bfloat16 Breaks Down Rope in Long-Context Training
24 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
155 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Rotary Embeddings - A Relative Revolution - EleutherAI Blog
No ratings yet
Rotary Embeddings - A Relative Revolution - EleutherAI Blog
13 pages
Robotics Toolbox 9.6 Tutorial
100% (1)
Robotics Toolbox 9.6 Tutorial
149 pages
LLM Edge Beyond Attention
No ratings yet
LLM Edge Beyond Attention
5 pages
Motion
No ratings yet
Motion
45 pages
Neubert2019 Article AnIntroductionToHyperdimension
No ratings yet
Neubert2019 Article AnIntroductionToHyperdimension
12 pages
Evolution of Positional Embeddings in Transformers!
No ratings yet
Evolution of Positional Embeddings in Transformers!
16 pages
Notes
No ratings yet
Notes
16 pages
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (9)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
Notes 2
No ratings yet
Notes 2
7 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Roformer - Enhanced Transformer With Rotary Position Embedding
No ratings yet
Roformer - Enhanced Transformer With Rotary Position Embedding
14 pages
Ai
No ratings yet
Ai
28 pages
Modeling Chatglm
No ratings yet
Modeling Chatglm
20 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Adobe Scan 23-Nov-2024
No ratings yet
Adobe Scan 23-Nov-2024
22 pages
General Notes:: Schedule of Loads and Computation
No ratings yet
General Notes:: Schedule of Loads and Computation
3 pages
General Commissioning and Operating Procedure of Ball Tube Mill BBD 4772
100% (3)
General Commissioning and Operating Procedure of Ball Tube Mill BBD 4772
22 pages
Rotor
No ratings yet
Rotor
148 pages
2024 11 15 AI Updates
No ratings yet
2024 11 15 AI Updates
20 pages
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
No ratings yet
R F: E T R P E: O Ormer Nhanced Ransformer With Otary Osition Mbedding
12 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
Lab 1
No ratings yet
Lab 1
23 pages
Roboticstoolboxcommands
No ratings yet
Roboticstoolboxcommands
132 pages
Rapid Learning in Robotics PDF
No ratings yet
Rapid Learning in Robotics PDF
169 pages
Machine Vision Toolbox For MATLABr3
No ratings yet
Machine Vision Toolbox For MATLABr3
189 pages
Inductive Moment Matching
No ratings yet
Inductive Moment Matching
36 pages
Lecture Position Orientation
No ratings yet
Lecture Position Orientation
3 pages
SPH Whitepaper
No ratings yet
SPH Whitepaper
25 pages
8 X4 Wog V2 L
No ratings yet
8 X4 Wog V2 L
169 pages
Chapter 14 - Analyzing Adversarial Performance - The Deep Learning Architect's Handbook
No ratings yet
Chapter 14 - Analyzing Adversarial Performance - The Deep Learning Architect's Handbook
1 page
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
25 Physical Acoustics - WARREN MASON PDF
0% (1)
25 Physical Acoustics - WARREN MASON PDF
306 pages
Calculating Mast and Rigging PDF
100% (1)
Calculating Mast and Rigging PDF
19 pages
Eurotile Pricelist2015 3
No ratings yet
Eurotile Pricelist2015 3
147 pages
Modified Generative AI and LLMs in Practice
No ratings yet
Modified Generative AI and LLMs in Practice
6 pages
502 61robotics
No ratings yet
502 61robotics
3 pages
Object Recognition On The REEM Robot
No ratings yet
Object Recognition On The REEM Robot
88 pages
Model Compression Techniquesin Deep Learning
No ratings yet
Model Compression Techniquesin Deep Learning
23 pages
29 Aug 2019 143139927XKES7XQNDPR
No ratings yet
29 Aug 2019 143139927XKES7XQNDPR
246 pages
AIcrowd - Single-Source Augmentation - Challenges
No ratings yet
AIcrowd - Single-Source Augmentation - Challenges
1 page
Deepseek-Vl: Towards Real-World Vision-Language Understanding
No ratings yet
Deepseek-Vl: Towards Real-World Vision-Language Understanding
33 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
33 pages
EGN 3060c: Introduction To Robotics: Background To Planning Planning As Search
No ratings yet
EGN 3060c: Introduction To Robotics: Background To Planning Planning As Search
47 pages
PIC16 (L) F15354 - 55 Data Sheet 40001853C-1314298
No ratings yet
PIC16 (L) F15354 - 55 Data Sheet 40001853C-1314298
539 pages
Essay Outline For The Great Gatsby
No ratings yet
Essay Outline For The Great Gatsby
5 pages
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
No ratings yet
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
26 pages
IterateAI Careers
No ratings yet
IterateAI Careers
4 pages
RNN
No ratings yet
RNN
12 pages
5.atmosphere and Role of Atmosphere in Climate Control
No ratings yet
5.atmosphere and Role of Atmosphere in Climate Control
3 pages
Advanced MR Imaging of The Pancreas
No ratings yet
Advanced MR Imaging of The Pancreas
15 pages
09 MSDS Wax Dispersant
No ratings yet
09 MSDS Wax Dispersant
8 pages
Lo 10 3930 609 12
No ratings yet
Lo 10 3930 609 12
7 pages
MORVOLC (Version 1.2) : User Manual
No ratings yet
MORVOLC (Version 1.2) : User Manual
11 pages
TP-Link Switch TL-SG5412F
No ratings yet
TP-Link Switch TL-SG5412F
32 pages
Iman Magnético
No ratings yet
Iman Magnético
6 pages
4.3 Unsustainable Fisheries in Thailand
No ratings yet
4.3 Unsustainable Fisheries in Thailand
2 pages
Somali Aggregate
No ratings yet
Somali Aggregate
120 pages
6.0 Power Series Related Question
No ratings yet
6.0 Power Series Related Question
9 pages
STUDYBLUE - Find and Share Online Flashcards and Notes From StudyBlue
No ratings yet
STUDYBLUE - Find and Share Online Flashcards and Notes From StudyBlue
15 pages
343 Kokutai
No ratings yet
343 Kokutai
2 pages
Measurement of Time
No ratings yet
Measurement of Time
15 pages
Crime Scene Investigation
No ratings yet
Crime Scene Investigation
4 pages
Kidde Submittal With BOQ - Approved-1
No ratings yet
Kidde Submittal With BOQ - Approved-1
1 page
Ivf Changing Steps and Rationale
No ratings yet
Ivf Changing Steps and Rationale
2 pages
House Reminders
No ratings yet
House Reminders
4 pages
Stem Cell Reflection
No ratings yet
Stem Cell Reflection
2 pages
Distillation of Mixtures: Activity 2.3
No ratings yet
Distillation of Mixtures: Activity 2.3
4 pages
Anti-Skimming Protection For Your ATM: Flexible Protection For Dip and Motorized Card Readers
No ratings yet
Anti-Skimming Protection For Your ATM: Flexible Protection For Dip and Motorized Card Readers
2 pages

The Most Used Positional Encoding: Rope: Damien Benveniste

Uploaded by

The Most Used Positional Encoding: Rope: Damien Benveniste

Uploaded by

The Most Used Positional Encoding: RoPE

where θk is chosen as:

The Complex Number Representation

v rotated = veiθ (12)

Q′ = Reshape(Q), shape: N × nhead × (dhead /2) × 2

Q∗ = Complex(Q′ ), shape: N × nhead × (dhead /2)

If we consider a query vector q∗p in that tensor representation, it is an array of dhead /2

We can rotate this complex vector by applying a complex rotation element-wise:

Increasing The Context Window With RoPE

After interpolation, elements of the matrix R are modified as:

You might also like