Rotary Embeddings - A Relative Revolution - EleutherAI Blog
Rotary Embeddings - A Relative Revolution - EleutherAI Blog
TL;DR:
Rotary Positional Embedding (RoPE) is a new type of position encoding that unifies
absolute and relative approaches. Developed by Jianlin Su in a series of blog posts earlier
this year [12, 13] and in a new preprint [14], it has already garnered widespread interest in
some Chinese NLP circles. This post walks through the method as we understand it, with
the goal of bringing it to the attention of the wider academic community. In general we
have found that across a large suite of setups including regular, linear, and local self-
attention, it either matches or surpasses all other methods currently available for
injecting positional information into transformers.
https://blog.eleuther.ai/rotary-embeddings/ 1/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
Another major limitation of existing methods is that they do not work with efficient
transformers. Methods like T5's relative positional bias [10] require constructing the full
N × N attention matrix between positions, which is not possible when using many of
the efficient alternatives to softmax attention, including kernelized variants like FAVOR+
[2].
Intuition
We would like to find a positional encoding function f(x, ℓ) for an item x and its
position ℓ such that, for two items q and k at positions m and n, the inner product
between f(q, m) and f(k, n) is sensitive only to the values of q, k, and their relative
position m − n. This is related in spirit to the kernel trick: we are searching for a feature
map such that its kernel has certain properties. A key piece of information is the
geometric definition of the dot product between Euclidean vectors:
q ⋅ k = ∥q∥∥k∥ cos(θ qk )
In plain English, the dot product between two vectors is a function of the magnitude of
individual vectors and the angle between them. With this in mind, the intuition behind
RoPE is that we can represent the token embeddings as complex numbers and their
positions as pure rotations that we apply to them. If we shift both the query and key by
the same amount, changing absolute position but not relative position, this will lead both
representations to be additionally rotated in the same manner---as we will see in the
https://blog.eleuther.ai/rotary-embeddings/ 2/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
derivation---thus the angle between them will remain unchanged and thus the dot
product will also remain unchanged. By exploiting the nature of rotations, the dot
product used in self-attention will have the property we are looking for, preserving
relative positional information while discarding absolute position.
The following is an example illustrating the core idea of RoPE—a more rigorous
derivation is presented in a subsequent section. Some arbitrary 0 <ε≤ π
2N
is chosen,
where N is the maximum sequence length. When viewed elementwise on q and k, with j
as the element index, RoPE can be viewed as follows:
RoPE(x, m) = xe miε
⟨RoPE(q j , m), RoPE(k j , n)⟩ = ⟨q j e miε , k j e niε ⟩
–
= q j k j e miε e niε
= q j k j e (m−n)iε
= RoPE(q j k j , m − n)
Visual Intuition
To see how relative position might be preserved in this transformation, we can look to an
analogous situation in classical electrodynamics.
https://blog.eleuther.ai/rotary-embeddings/ 3/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
As the wave travels through the waveplate, we can see how the magnitude of the wave is
preserved. We can also better see how the relative position may be encoded as the angle
between subsequent timesteps: the angle between timesteps, and therefore distance
along the axis of travel, is constant. This means the positional information must be
orthogonal to the amplitude in the modulated wave.
Derivation
We begin with absolute positional information: for each token, we know where it is in the
sequence. However, dot products (and therefore attention) do not preserve absolute
positional information, so if we encode that positional information in the absolute
position of the embeddings, we will lose a significant amount of information. On the
other hand, dot products do preserve relative position, so if we can encode the absolute
positional information into the token embeddings in a way that only leverages relative
positional information, that will be preserved by the attention function.
While it is common in machine learning to restrict our attention to the real numbers, for
rotary embeddings it is mathematically more convenient to use the complex numbers as
the base field for our space. Instead of working in the usual R d , we will work in C d/2 by
considering consecutive pairs of elements of the query and key vectors to be a single
complex number. Specifically, instead of viewing q = (q 1 , q 2 , q 3 , q 4 , … , q d ) as a d-
dimensional real vector we view it as q = (q 1 + iq 2 , q 3 + iq 4 , … q d−1 + iq d ) ∈ C d/2 . As
we will see, casting it in this fashion will make discussing the rotary embeddings easier. If
d is odd, we can pad it with a dummy coordinate to ensure things line up correctly.
Alternatively, we can simply increase d by one.
Let q and k be query and key vectors respectively and let m and n be the absolute
positions of the corresponding tokens. Let f(x, ℓ) be the function that takes the token
embedding x in position ℓ and outputs a new embedding that contains (in some fashion)
the relative positional information. Our goal is to find a "nice" function f that does this.
Once the positional information is encoded, we need to compute the inner product like
so:
https://blog.eleuther.ai/rotary-embeddings/ 4/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
where g(q, k, m − n) now represents the pre-softmax logit of the usual attention
equation. Writing these three functions in exponential form gives
As the prior equation is valid for all m, it means that R f is independent of the value of m
, so we can set R f (x, y) = x. Similarly, if we denote Θ(x) = Θ f (x, 0) we obtain
which implies that Θ f (q, m) − Θ(q)= Θ f (k, m) − Θ(k) for all q, k, m. This allows us
to decompose Θ f as Θ f (x, y) = Θ(x) + φ(y). Examining the case of m = n + 1
reveals that
Since the right-hand side does not depend on m, the left hand side must not either and
so φ is an arithmetic progression. Setting the initial values φ(0) = 0 and φ(1) = θ, we
have φ(m) = mθ.
Putting all of these pieces together, we get the final formula for the rotary positional
embedding:
d/2
f(q, m) = R f (q, m)e iΘ f (q,m)
= qe i(Θ(q)+mθ)
= ∑ q j e imθj e→j
j=1
and likewise for k. Since computers tend to like real numbers and matrices more than
complex numbers, its convenient to convert this expression into the matrix equation
https://blog.eleuther.ai/rotary-embeddings/ 5/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
⎛M 1 ⎞ ⎛q 1 ⎞
M2 q2
f(q, m) = = Θm Qm = Θm Wq Xm
⋱ ⋮
⎝ M d/2 ⎠ ⎝q d ⎠
⟨f(q, m, i), f(k, n, j)⟩ = ⟨f 1 (q :d/2 , m), f 1 (k :d/2 , n)⟩ + ⟨f 2 (q d/2: , i), f 2 (k d/2: , j)⟩
= g 1 (q :d/2 , k :d/2 , m − n) + g 2 (q d/2: , k d/2: , i − j)
⎜⎟
= g(q, k, m − n, i − j)
After reading Jianlin Su's original blog posts [12, 13], we were curious how well such a
first-principles approach to positional encoding would stack up against existing methods.
Despite a tremendous number of papers that have come out claiming to improve the
transformer architecture, very few approaches generalize well across codebases and
tasks. However, we have found that rotary positional embeddings perform as well or
better than other positional techniques in every architecture we have tried.
Implementation
A naive implementation of rotary positional embeddings would use the block diagonal
matrix form shown earlier. In practice, implementing rotary positional embeddings this
way is highly inefficient, and more optimized forms are readily available. The original
implementations of RoPE are available in roformer and bert4keras.
GPT-NeoX (PyTorch)
Mesh Transformer JAX (JAX)
Experiments
We have found rotary embeddings to be effective for many varieties of attention.
https://blog.eleuther.ai/rotary-embeddings/ 7/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
Final validation loss / ppl scores on OWT2 validation set at 55k steps (~30B tokens)
https://blog.eleuther.ai/rotary-embeddings/ 8/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
Final validation loss / ppl scores on Pile validation set at 8k steps (~8B tokens)
In smaller scale tests, we have also put RoPE head to head against other alternatives
including the relative position method of Shaw et al. [11], TUPE [5], and position-infused
https://blog.eleuther.ai/rotary-embeddings/ 9/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
Runtime
In general, we find that the runtime cost of rotary embeddings is fairly negligible. With
the above implementation, we find that applying the rotary embeddings is naively about
4-5x the cost of applying additive positional embeddings. With the addition of a fusing
optimizer like Torchscript, the runtime can be reduced to about 2-2.5x the runtime of
additive positional embeddings. Concretely, for query and key tensors of shape
[2048, 16, 12, 64], applying rotary embeddings take 5.3 milliseconds, while applying
additive positional embeddings takes 2.1 milliseconds.
Conclusion
Rotary embeddings make it possible to implement relative attention in a straightforward
and efficient manner, and we look forward to the work it inspires. Simple improvements
to the transformer architecture that carry over robustly between different types of self-
attention are few and far between [6].
Citation Information
https://blog.eleuther.ai/rotary-embeddings/ 10/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
@article{rope-paper,
title={RoFormer: Enhanced Transformer with Rotary Position Embedding},
author={Su, Jianlin and Lu, Yu and Pan, Shengfeng and Wen, Bo and Liu, Yunfeng},
journal={arXiv preprint arXiv:2104.09864},
year={2021}
}
@misc{rope-eleutherai,
title = {Rotary Embeddings: A Relative Revolution},
author = {Biderman, Stella and Black, Sid and Foster, Charles and Gao, Leo and Hallaha
howpublished = \url{blog.eleuther.ai/},
note = {[Online; accessed ]},
year = {2021}
}
References
[1] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165, 2020.
[2] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea
Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al.
Rethinking Attention with Performers. arXiv preprint arXiv:2009.14794, 2020.
[3] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster,
Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB Dataset
of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027, 2021.
[4] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon,
Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas
Eck. Music Transformer. arXiv preprint arXiv:1809.04281, 2018.
[5] Guolin Ke, Di He, and Tie-Yan Liu. Rethinking Positional Encoding in Language Pre-
training. arXiv preprint arXiv:2006.15595, 2020.
https://blog.eleuther.ai/rotary-embeddings/ 11/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
[6] Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael
Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do
Transformer Modifications Transfer Across Implementations and Applications? arXiv
preprint arXiv:2102.11972, 2021.
[7] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa
Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan
Catanzaro, et al. Efficient Large-Scale Language Model Training on GPU Clusters. arXiv
preprint arXiv:2104.04473, 2021.
[8] Ofir Press, Noah A Smith, and Mike Lewis. Shortformer: Better Language Modeling
using Shorter Inputs. arXiv preprint arXiv:2012.15832, 2020.
[9] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
Transferable Visual Models From Natural Language Supervision. arXiv preprint
arXiv:2103.00020, 2021.
[10] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with
a Unified Text-to-Text Transformer. arXiv preprint arXiv:1910.10683, 2019.
[11] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative
Position Representations. arXiv preprint arXiv:1803.02155, 2018.
[14] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced
Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864, 2021.
[15] Hao Tan and Mohit Bansal. Vokenization: Improving Language Understanding with
Contextualized, Visual-Grounded Supervision. arXiv preprint arXiv:2010.06775, 2020.
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv preprint
arXiv:1706.03762, 2017.
https://blog.eleuther.ai/rotary-embeddings/ 12/13
2024/6/22 13:40 Rotary Embeddings: A Relative Revolution | EleutherAI Blog
© 2024 EleutherAI
https://blog.eleuther.ai/rotary-embeddings/ 13/13