3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
Abstract
Inspired by the Bloch Sphere representation, we propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D Rotary Position Encoding (RoPE), with two major advantages for modeling long contexts: controllable long-term decay and improved position resolution. For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size, ensuring the modeling of relative positional information between tokens at a distant relative position. For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position interpolation on RoPE. We have conducted experiments on long-context Natural Language Understanding (NLU) and long-sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over RoPE, especially in long-context NLU tasks.
1 Introduction
Rotary Position Encoding (RoPE) [23] is essential in Transformer-based Large Language Models (LLMs), such as the LLaMA models [24]. RoPE merges the advantages of absolute and relative positional encoding by using a rotation mechanism to represent each position. Despite its widespread use in LLMs [24, 27, 7], RoPE has notable limitations when extending LLMs with a predefined context window. The long-term decay problem of RoPE limits the model’s ability to extend positions outward in long-context tasks. Although the long-context modeling capability of LLMs can be extended through position interpolation, as more positions are inserted, RoPE encounters the challenge of decreased position resolution.
We propose a novel position encoding mechanism for transformer architecture, called 3D Rotary Position Encoding (3D-RPE), to address challenges in long-context modeling faced by LLMs using RoPE. Inspired by the Bloch Sphere representation, 3D-RPE applies rotary position encoding on a three-dimensional spherical surface, as illustrated in Figure 1(b). In contrast, RoPE employs rotation on a two-dimensional circular path, as depicted in Figure 1(a). RoPE suffers from long-term decay, as shown in Figure 1(c), implying that as the relative distance increases, the relative upper bound on token correlations at modeled relative positions will continuously decrease. 3D-RPE addresses this issue by segmenting a long sequence into chunks and setting rotation angles within and between the chunks to construct position encoding. As shown in Figure 1(d), 3D-RPE is able to control this relative upper bound through two relative positional dimensions, namely within and between chunks. Compared to Figure 1(c), this method improves the upper bound on correlations between long relative distances and alleviates the issue of long-term decay.
Position Interpolation (PI) methods [4, 18] based on RoPE are often employed to extend LLMs for modeling contexts that exceed the pre-training length. These techniques scale the position encoding during inference, allowing the originally out-of-range position encoding to fall within the trained position interval after interpolation. However, as the interpolation factor increases, PI experiences a substantial decline in positional resolution among tokens, detrimentally affecting long-context modeling performance. As illustrated in Figure 1(e), extending the pre-training length to using linear PI [4] leads to reduced positional resolution with increasing . 3D-RPE employs a 3D rotating sphere for position encoding, which supports higher positional resolution compared to the 2D circular rotation. Similarly, through linear PI extension, 3D-RPE achieves a positional resolution superior to (See Figure 1(f)). This benefit has been theoretically proven (Refer to Theorem 1 in Section 3.2.2) and corroborated by experimental results (Refer to Table 4 in Section 5.2).
We conducted experiments on long-sequence Language Modeling (LM) and long-context Natural Language Understanding (NLU) tasks. Our experimental results highlight the promising performance of the 3D-RPE method, especially in tasks requiring long-context language understanding.
Our major contributions of this paper are as follows:
-
•
A position encoding method on a 3D sphere, 3D-RPE, is provided, which can enhance the long-context modeling capability of LLMs by replacing RoPE.
-
•
It is proved that 3D-RPE has two benefits, controllable long-term decay and mitigating the reduction in positional resolution caused by position interpolation.
-
•
LLMs combine with 3D-RPE have achieved significant performance improvements in long-context NLU tasks.
The structure of this paper is as follows. Section 2 covers the preliminaries of 3D-RPE, Bloch Sphere, and RoPE. Section 3 explains the construction of 3D-RPE on a 3D rotating sphere and highlights its benefits over RoPE. Section 4 reviews related work. In Section 5, we validated the advantages of our method through experiments. Section 6 concludes with a discussion on 3D-RPE’s impact.
2 Preliminaries
The analysis of 3D-RPE relies on these concepts and results from the filed of Bloch Sphere and RoPE. We offer an introduction to Bloch Sphere in Section 2.1 and RoPE [23] in Section 2.2.
2.1 Bloch Sphere
Bloch Sphere (BS) offers a geometric depiction of a quantum mechanical system’s pure state, limited to two levels. The state vector is mathematically expressed as
(1) |
where and are Dirac’s notations. , and are rotation angles. In our work, encodes the relative positions of tokens within chunks, encodes the relative positions of tokens across chunks, and is equal to . Some other concepts about BS are showed in Supplementary Materials A.
2.2 Rotary Position Embedding
Rotary Position Embedding (RoPE) is a commonly used relative position encoding technique in LLMs, such as LLaMA [24], GPT-J [27], Vicuna [7] and etc. RoPE is a 2-dimensional space rotary encoding, which is denoted as follows:
(2) |
and are hidden vectors from the query and key for a specific attention head in transformer. For ease of differentiation, and can be refined later as and , is the imaginary unit, is the rotary angle in RoPE. and are indexes about positions. Then, the inner product is employed to define the self-attention score before softmax computing:
(3) | ||||
Eq (3) is unary function respect to the relative position , representing the relative position between tokens and modeling the relative positional information. Here, denotes the calculation of the real part of a complex number. In our study, the 3D-RPE self-attention score is a binary function containing the relative position .
3 Method
Section 3.1 introduces the new position encoding on a 3D sphere, 3D-RPE. Section 3.2 focuses on analyzing two benefits of 3D-RPE, namely controllable long-term decay and enhanced position resolution.
3.1 3D Rotary Position Encoding
For a long sequence of length and a chunk size set to , where is smaller than the pre-training length of LLM, the sequence can be divided into chunks. Here, represents the ceiling function, rounding up to the nearest integer (see Figure 2). The state vector comes from either Query or Key. Here, represents the positional index of the chunk, and indicates the positional index of the token within the chunk. This is used to calculate the new state vector by rotating the Bloch Sphere. Specifically, two rotation angles, and are defined, with governing the position encoding within the chunk’s internal tokens, and governing the position encoding between the chunks. Our position encoding method is called 3D Rotary Position Encoding, or 3D-RPE. The formal definition of 3D-RPE is provided as follows. The computational process of 3D-RPE in practice is provided in Supplementary Materials B.1.
Definition 1 (3D Rotary Position Encoding).
Let be a state vector of an attention head without position encoding, where is the dimension of the vector, which is an even number. 3D-RPE encodes into the vector , which can be formalized as:
(4) |
is the imaginary unit. equals to , where and is the first and second halves of the state vector .
In transformer-based LLMs, after applying position encoding to the state vectors from Query and Key, it is essential to compute their attention scores. For the sake of clarity and formalization, we denote the position encoding of the state vector from Query as 3d-PE and from Key as 3d-PE, where and range from to , and and range from to . The self-attention score can be obtained through the conjugate symmetric inner product of and , which are the state vectors from Query and Key,
(5) |
where , and . Let denote the -th components of . In experiments using the LLaMA2 models, the is generally set to . The self-attention score computed after applying 3d-PE is a function of both the relative position between chunks () and the relative position ().
Consequently, the self-attention score relying on 3d-PE is influenced by the relative positions at both the chunk and token levels. It is important to highlight that when and reside within the same chunk (i.e., ), Eq. (5) simplifies to the standard RoPE formulation as depicted in Eq. (3). For a detailed derivation and computation process of Eq. (5), as well as the complete formulation of Eq. (4), please refer to Supplementary Materials B.2.
3.2 Benefits of 3D-RPE
In this section, we delve into two benefits offered by 3D-RPE: the ability to control long-term decay and mitigate the reduction in positional resolution caused by position interpolation.
3.2.1 Controllable Long-term Decay
3D-RPE has the property of controllable long-term decay. Like RoPE, taking the absolute value in Eq (5) and applying the Abel transformation, we derive the upper bound of the correlation coefficients related to term dependencies as follows:
(6) | ||||
where and . For RoPE [23], the relative upper bound is given by , where (see the section 3.4.3 of RoPE [23]). By setting , the value decays as the relative position increases. For the upper bound of 3D-RPE, it is formalized as follows:
(7) |
The domains of the relative position differ between and . In , is in the range , while in , it is in . The relative positions between tokens exceeding the chunk size are constructed collaboratively using positional encoding within and across chunks. The Relative Position Matrix using 3D-RPE is shown in Figure 3.
To illustrate the advantage of controllable long-term decay, we present the results in Figure 1(c) and Figure 1(d). As shown in Figure 1(c), when the relative position exceeds approximately , begins to significantly decrease to below . This limitation of poses challenges for RoPE in modeling attention scores between tokens with longer relative distances (greater than ). In contrast, as shown in Figure 1(d), 3D-RPE employs both and , setting to keep within , thereby preventing decay over longer distances. This method ensures stays at or above for all relative positions.
3.2.2 Enhanced Positional Resolution
Position Interpolation (PI) [4] has been introduced to scale down the position indices to align with the original window size, resulting in enhanced outcomes for context extension. However, as the extension length and interpolation increase, PI can lead to a reduction in relative positional resolution. 3D-RPE can be used alongside PI for long-context extensions. Compared to RoPE combined with PI, 3D-RPE has the advantage of mitigating the reduction in positional resolution caused by positional interpolation, as demonstrated in Theorem 1.
Theorem 1 (Enhanced Position Resolution).
For a pre-trained language model with a length of and an extension length requirement of , employing linear position interpolation extension methods based on Rotary Position Encoding (RoPE) can elevate the relative positional resolution from to . Let denote the relative positional encoding resolution achieved by the method based on 3D-RPE, with chunk size , there is:
(8) |
The Proof of Theorem 1 is provided in Supplementary Materials C.
To empirically validate the superior performance of this benefit in a training-free setting, it has been observed that methods combining RoPE with interpolation lead to a significant increase in Perplexity as the modeling length increases in language modeling tasks. Conversely, the increase in Perplexity is substantially smaller when employing 3D-RPE with linear interpolation (Refer to Table 4 in Section 5). This phenomenon indicates that this benefit has led to an improvement in the performance of long sequence language modeling.
4 Related Work
This section provides an overview of the extensive literature on position encoding in Transformers [26] and discusses context extending capabilities based on RoPE.
Position Encoding (PE): PE is important for Transformer-based language models. Earlier studies [22, 21, 28, 23] have focused on enhancing the original absolute position encoding to develop better relative position encoding, thereby improving the text modeling capabilities of language models. These works [22, 21, 28] utilized trainable position vector encoding to directly incorporate positional information into context representations. Although effective, these methods typically add positional information to contextual representations, making them unsuitable for linear self-attention architectures. RoFormer [23] introduced relative position information by rotating context representations, known as RoPE. Transformers utilizing RoPE have become a prevalent backbone in various LLM designs [24, 8, 27, 16]. Our proposed 3D-RPE differs from the two-dimensional space of RoPE by modeling the relative position information of tokens through rotation on the Bloch Sphere.
Long-context LLMs based on RoPE: To enhance the contextual capabilities of Large Language Models (LLMs) using RoPE, several positional encoding interpolation techniques have been developed. These include Linear Position Interpolation (LPI) [4], Neural Tangent Kernel (NTK) [17], and Yet Another Recurrent Network (YaRN) [18] interpolation. Position Sequence Tuning (PoSE) [31] has notably increased sequence lengths to by amalgamating these positional interpolation strategies. Additionally, LongLora [5] introduced the shift-short attention mechanism, allowing for effective emulation of full attention and extending sequences up to , leveraging the LLMa-2-7B model and LoRA’s fine-tuning approach [12]. 3D-RPE further strengthens the positional relationships between distant tokens by capturing inter-chunk positional information and is compatible with existing fine-tuning techniques like LoRA to bolster long-context representation. The Dual Chunk Attention (DCA) [2] method, which enhances the use of pre-trained integer-based parameters, splits query and key sequences into chunks and uses three specialized matrices to capture the relative positions within and between these chunks. This method enhances the model’s ability to process longer sequences, but it is unable to model the relative positions within distant chunks. In our work, we employ rotating positional encoding to link attention across different chunks.
5 Experiments
We evaluate the method of position encoding, 3D-RPE, on LLaMA2 [24] models (specifically, LLaMA-2-7B and LLaMA-2-7B-chat), which have a pre-training context, and LLaMA-3-8B-Instruct 111https://github.com/meta-llama/llama3, which has an pre-training context. Our experiments aim to explore the following aspects: 1) The effect of 3D-RPE on long-context generation can be assessed using Perplexity. 2) The impact of 3D-RPE on long-context understanding and generation tasks, can be reflected by the accuracy of long sequence natural language tasks, e.g., multiply documents QA. 3) Ablation studies to confirm the advantages of 3D-RPE in position interpolation.
5.1 Experimental Settings
In this section, we elaborate on the experimental setup by introducing two types of tasks (i.e., long-context language understanding and long sequence language modeling) and detailing three aspects of the configuration (i.e., training parameters, training and evaluation datasets, and baseline models).
Training Setting: For long-context Natural Language Understanding (NLU) tasks, we have fine-tuned LLaMA-2-7B-chat and LLaMA-3-8B-Instruct. The context length for these models has been extended from to and from to , respectively. The fine-tuning method follows the fine-tuning strategy of LongChat [13]. The training step is . For the long-sequence Language Modeling (LM) tasks, we have fine-tuned LLaMA-2-7B to support extended context length of tokens. The training step is . We set the per-device batch size as , and gradient accumulation step as , which means that the batch size is . We train the model with the next token prediction objective with LoRA [12].
We employed the AdamW optimizer [15] with and for all fine-tuned models. Chunk size is set to . The learning rate was set to , and a linear learning rate warmup was applied. Training was conducted on a single 4xA800 GPU machine using FlashAttention-2 [10].
Datasets: In the context of long-context NLU tasks, we employ the LongAlpaca-12k dataset, which contains 9,000 LongQA and 3,000 short QA entries [6], and the LongAlpace-16k-length dataset222https://github.com/dvlab-research/LongLoRA/. To evaluate the performance of 3D-RPE for long-context extension, we use the LongBench [3], which includes English tasks, Chinese tasks and code tasks, with most tasks having an average context length of to tokens. We focus on the English and code tasks to evaluate our method, 3D-RPE. Additionally, the LEval [1] evaluation set, which also consists of long-context datasets, is used to verify the effectiveness of 3D-RPE. The five datasets annotated from scratch in LEval, namely Coursera, QuALiTY, CodeU, GSM,and TOEFL, are utilized.
For long-sequence LM tasks, we use the RedPajama-Data [9] for fine-tuning training. The dataset is a large-scale pre-training dataset (the size reaches 1.2 trillion tokens) designed to provide high-quality training data for language models, and contains multiple data sources (i.e., github, arxiv, book, c4 and Wikipedia, etc.). We sample samples from these data sources for training. For evaluation, we utilize the PG19 book corpus dataset [20], which includes 100 documents, and the Arxiv Math Proof-pile dataset (test split). Additionally, all methods evaluate perplexity by using a sliding window following [19].
Baseline models: For long-context NLU tasks, the fine-tuned models, including LongAlpace-16k [5], LongChat-32k [14] LongLlama [25] and ChatGLM [11] are used as the baseline models. Models of fine-tuning free in language modeling tasks are also used in long-context NLU tasks.
In long sequence LM tasks, the methods of LongLoRA [5], StreamingLLM [29], Positional Interpolation(PI) [4] and the NTK-Aware Scale RoPE(NTK) [17] are selected as the baselines, all based on the LLaMA-2-7B-base model. Among these baseline models, PI, NTK and StreamingLLM are fine-tuning-free methods. The fine-tuned models include LongLoRA and Activation Beacon [30]. In Ablation experiments, interpolation methods without training are used as baseline models, which are PI and NTK.
Methods | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot | Code |
---|---|---|---|---|---|
LLaMA-2-7B-chat | 24.90 | 22.60 | 24.70 | 60.01 | 48.10 |
LLaMA-2-7B-chat-PI | 18.98 | 17.16 | 25.03 | 49.43 | 52.73 |
LLaMA-2-7B-chat-NTK | 23.21 | 23.34 | 24.40 | 59.29 | 49.28 |
StreamingLLM | 21.47 | 22.22 | 22.20 | 50.05 | 48.00 |
ChunkLLaMA- | 24.04 | 22.98 | 21.52 | 46.31 | 49.73 |
LongChat- | 31.58 | 23.50 | 26.70 | 64.02 | 54.10 |
LongAlpaca- | 28.70 | 28.10 | 27.80 | 63.70 | 56.00 |
LongLLaMA | 30.12 | 16.37 | 24.19 | 60.31 | 66.05 |
Vicuna-v1.5-7B- | 28.01 | 18.63 | 26.01 | 66.20 | 47.30 |
ChatGLM3-6B- | 40.30 | 46.60 | 29.50 | 68.10 | 56.20 |
3D-RPE-LLaMA2-7B-Chat | 47.40 | 60.10 | 28.99 | 73.16 | 76.50 |
Models | Coursera | QuALiTY | CodeU | GSM | TOEFL |
---|---|---|---|---|---|
LLaMA-2-7B-Chat | 29.21 | 37.62 | 1.11 | 19.00 | 51.67 |
LongChat-7B-16K | 29.74 | 33.66 | 3.33 | 10.00 | 47.95 |
LLaMA2-7B-NTK | 32.71 | 33.16 | 0.00 | 19.00 | 52.78 |
Vicuna1.5-7B-16k | 38.66 | 39.60 | 5.55 | 19.00 | 55.39 |
3D-RPE-LLaMA2-7B-Chat(ours) | 39.38 | 38.11 | 2.22 | 21.01 | 57.99 |
LLaMA3-8B-Instruct* | 51.45 | 64.34 | 4.44 | 76.00 | 82.89 |
3D-RPE-LLaMA3-8B-Instruct* | 51.89 | 61.38 | 4.44 | 80.00 | 82.89 |
Methods | PG-19 | Proof-Pile | ||||||
---|---|---|---|---|---|---|---|---|
LLaMA2-7B-Base | 131.09 | OOM | 16.79 | OOM | ||||
LLama2-7B-PI | 11.32 | 19.5 | OOM | 3.86 | 5.94 | 33.7 | OOM | |
LLama2-7B-NTK | 10.28 | 11.5 | 37.8 | OOM | 3.98 | 5.94 | 33.7 | OOM |
StreamingLLM | 9.23 | 9.25 | 9.24 | 9.32 | 3.47 | 3.51 | 3.50 | 3.55 |
LongLoRA-32k | 7.33 | 7.16 | 7.04 | – | 2.78 | 2.61 | 2.50 | – |
LongLoRA-100k | 7.57 | 7.33 | 7.16 | 7.04 | 2.78 | 2.60 | 2.58 | 2.52 |
LongChat-32k | 8.92 | 8.85 | 8.81 | OOM | 2.98 | 2.70 | 2.65 | OOM |
Activation Beacon | 8.52 | 8.54 | 8.56 | 8.68 | 3.45 | 3.42 | 3.39 | 3.35 |
3D-RPE-LLaMA2-7B | 7.03 | 7.10 | 8.09 | 8.12 | 2.72 | 2.93 | 2.89 | 3.05 |
Models | ||||
---|---|---|---|---|
LLaMA2-7B-PI | 7.94 | 9.19 | 15.11 | |
LLaMA2-7B-NTK | 7.87 | 11.98 | 26.12 | 58.91 |
LLaMA2-7B-Yarn | 7.87 | 8.06 | 9.82 | 11.74 |
3D-RPE-LLaMA2-7B* | 7.87 | 7.90 | 7.71 | 9.34 |
5.2 Long-Context Natural Language Understanding
In this task, the LongBench [3] evaluation set was initially utilized. Five categories of tasks were included: single-document QA (3 tasks), multi-document QA (3 tasks), summarization (3 tasks), few-shot learning (3 tasks), and code completion (2 tasks). The average score for each type is reported in Table 1. The evaluation metrics followed those specified in LongBench [3], which differ across tasks and are detailed in Supplementary Material D.1. The results in Table 1 highlight our model’s significant performance advantages over baseline models in four tasks, both for models without training and those with fine-tuning. In summarization tasks, our model also achieved performance comparable to ChatGLM3-6B-. These experimental outcomes indicate that our model enhances the correlation between tokens with distant relative positions in long contexts through 3D-RPE, resulting in improved performance.
Subsequently, the LEval Benchmark [1] was employed. Table 2 reveals that our model, 3D-RPE-LLaMA2-7B-Chat, outperformed LLaMA2-7B-NTK and LongChat-7B-. Although it did not surpass Vicuna1.5-7B- in Quality and CodeU tasks, it excelled in the Coursera, GSM, and TOEFL tasks. Additionally, we conducted experiments on LLaMA3-8B-Instruct using a 16k context window with 3D-RPE. The 3D-RPE-LLaMA3-8B-Instruct* showed performance improvements in the Coursera and GSM tasks. While 3D-RPE did not enhance performance in the CodeU, TOEFL, and QuALiTY tasks, there was no significant performance decline either. These experimental results demonstrate the effectiveness of the 3D-RPE method.
5.3 Long-Sequence Language Modeling
In Table 3, we present the perplexity scores for our model, 3D-RPE-LLaMA-2-7B and baseline models on the proof-pile and PG19 test datasets. 3D-RPE-LLaMA-2-7B was fine-tuned from the LLaMA2-7B-Base model using a dataset with a context window. To evaluate performance, we set sequence lengths of , , and . We extended our model’s sequence length from to using the position extending method from PoSE [31]. The results indicate that our method outperforms train-free sequence extending models. Compared to fine-tuned models, our model shows better performance at and sequence lengths. This suggests that the new positional encoding, 3D-RPE, improves or maintains modeling performance for larger context windows () compared to smaller ones ( and ). For the and tasks, although our model did not surpass LongLoRA- and LongLoRA-, it did outperform LongChat- and Activation Beacon.
Notably, our model can further extend from to without significantly increasing perplexity values, in combination with other train-free extension methods. However, due to its specific attention mechanism, the LongLoRA models cannot be extended beyond their predefined context windows in a train-free manner. For instance, LongLoRA- cannot be further extended to .
5.4 Ablation Study
In this section, we conduct ablation studies in this section to explore how 3D-RPE affects the linear interpolation method. We compare position interpolation methods (PI, NTK, and Yarn) with the method that combines 3D-RPE with position interpolation on the LLaMA-2-7B-Base model in a train-free manner. The experimental results can be found in Table 2. The 3D-RPE-LLaMA2-7B* model with linearly positional interpolation from to and , the 3D-RPE approach yields improved results by mitigating the decrease in positional resolution caused by interpolation methods. These results are consistent with the findings of Theorem 1 in Section 3.2.2 presented in this paper.
6 Conclusion and Future Work
In this paper, we present a novel rotary position encoding method called 3D Rotary Position Encoding (3D-RPE). Compared to RoPE, we have theoretically proved that 3D-RPE possesses two key advantages: controllable long-term decay and enhanced interpolation resolution. Experimentally, 3D-RPE has demonstrated outstanding performance in long-context Natural Language Understanding.
In the future, 3D-RPE holds promise as a foundational positional encoding strategy for LLMs, especially in the aspect of modeling long contexts. Moreover, given that 3D-RPE encapsulates positional encoding within a three-dimensional framework, it has the potential to integrate with visual data, thereby facilitating an in-depth exploration of its efficacy in synchronizing graphical and textual semantic information.
References
- [1] Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
- [2] Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models, 2024.
- [3] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- [4] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- [5] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023.
- [6] Yukang Chen, Shaozuo Yu, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Long alpaca: Long-context instruction-following models. https://github.com/dvlab-research/LongLoRA, 2023.
- [7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, march 2023. URL https://lmsys. org/blog/2023-03-30-vicuna, 3(5), 2023.
- [8] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- [9] Together Computer. Redpajama: An open source recipe to reproduce llama training dataset. https://github.com/togethercomputer/RedPajama-Data, 2023.
- [10] Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. CoRR, 2023.
- [11] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
- [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- [13] Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- [14] Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length? arXiv preprint arXiv:2306.04537, June 2023.
- [15] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019, 2019.
- [16] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- [17] Bowen Peng and Jeffrey Quesnelle. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have, 2023.
- [18] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2203.13474, 2023.
- [19] Oriol Press, Noah A Smith, and Michael Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022, 2022.
- [20] Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020.
- [21] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- [22] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 464–468, 2018.
- [23] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- [24] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- [25] Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłos. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
- [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- [27] Ben Wang and Aran Komatsuzaki. Gpt-j-6b: A 6 billion parameter autoregressive language model. GitHub, 2021.
- [28] Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. Encoding word order in complex embeddings. In International Conference on Learning Representations, 2020.
- [29] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv, 2023.
- [30] Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Soaring from 4k to 400k: Extending llm’s context with activation beacon, 2024.
- [31] Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400, 2023.
Appendix A Bloch Sphere
Bloch Sphere: 3D Rotary Position Encoding (3D-RPE), proposed by us, corresponds to a Bloch Sphere. In this section, we mainly introduce the basic concept of Bloch Sphere, which corresponds to Eq. (1) in this paper.
The Bloch Sphere is a geometric tool to used to represent qubits, typically depicted in a three-dimensional polar coordinate system as a point on the Sphere (see Figure 4). A single quantum state is represented by the following equation in linear algebra:
(9) |
where and are complex numbers, i.e., . According to Euler’s formula in complex analysis, the coefficients and can be reexpressed as:
(11) | ||||
is the global phase.
Considering the normalization condition , we have:
(12) |
Given , , and , the state can be expressed as:
(13) |
Therefore, the Eq. (1) of this paper is given out.
To adapt to the original 2D rotation position encoding (RoPE) of pre-trained LLMs, such as LLaMA models, the global phase is used to model the relative positions between tokens within a chunk, while the rotation angle is used to model the relative positions between tokens across chunks.
Appendix B Supplementary Material for the Method Section
In this section, we mainly introduce the specific implementation of our positional encoding method (3D-RPE), and the formula derivation details of attention score calculation (Eq. (5) of this paper) not detailed in this paper.
B.1 Implement of 3D-RPE
In Section 3.1, we give the general form of 3D Rotary Position Encoding (3D-RPE):
and are scalar quantities in . and are shown below:
(14) |
In the concrete implementation, analogous to RoPE, for each two-dimensional subspace of , we assign angles that vary from high to low frequencies. An equivalent rotation matrix is utilized to substitute for :
(15) |
Therefore, Eq.(4) of this paper can be transformed to
where is a design form equivalent to the rotation matrix in RoPE, mainly re-mapped to correspond to specific application implementations and calculation derivations in LLMs. In the specific implementation, after the rotary position encoding of LLMs, the long sequence is chunked based on the chunk size . Then, the rotation is set on each chunk, is the position of chunk.
B.2 Derivation of Attention for 3D-RPE
The formula derivation details of attention score calculation(Eq. (5)) is as follows.
Since , we could get:
(16) | ||||
Let =3d-PE, =3d-PE. Taking the real part of the inner product of and yields:
(17) |
which is a function related to both and .
Appendix C 3D Rotary Position Encoding Resolution Enhancement
In this section, before proving Theorem 1, we first provide the definitions of positional resolution for RoPE, as well as the positional resolution after positional interpolation.
Definition 2 (Positional Interpolation Resolution).
Let and be query state and key state of the -th and -th hidden states after RoPE. Given a pre-training length , the attention score is:
(18) |
The Resolution corresponding to the initial length is . After employing linear interpolation with length , the attention score is:
(19) |
Note that the Resolution turns to and decreases as increases.
As the resolution decreases, the magnitude of the rotation of attention score becomes smaller, reflecting the extent of positional difference becomes smaller. Now we give the following theorem, explaining how 3D-RPE mitigates the resolution decreasing in detail.
Theorem 2 (Chunk Position Encoding Resolution Enhancement).
For a pre-trained language model with a pre-training length and an extension length requirement of , employing linear position interpolation extension methods based on Rotary Position Encoding (RoPE) can elevate the relative positional resolution from to , Let denote the relative positional encoding resolution achieved by the method based on 3D-RPE, with chunk size , there is:
(20) |
Proof.
For 3D-RPE, let the chunk size and chunk number be denoted as and respectively. Prior to interpolation, the indices within a chunk range from . Linear interpolation involves evenly distributing the excess tokens across chunks. This results in new indices within the chunk, range from , where . So the attention score of and based on 3D-RPE after interpolation is:
The resolution of relative position for 3D-RPE is:
For special cases and :
(21) |
where . As long as , there is . Under normal case, the chunk size is not set to a very small number, hence is certainly established; moreover, for different interpolation lengths , we need to configure a varying number of chunks , such that . ∎
Appendix D Experimental Supplementary Materials
D.1 Evaluation Metrics
This section mainly presents the utilization of evaluation metrics for a total of 16 tasks from the LongBench.
Dataset | Metric |
---|---|
Narrative QA | F1_Score |
Qsper | F1_Score |
MultiFieldQA-En | F1_Score |
Hotpot QA | F1_Score |
2WikiM QA | F1_Score |
Musique | F1_Score |
GovReport | Rouge_Score |
QMSum | Rouge_Score |
MultiNews | Rouge_Score |
Trec | Classification_Score |
Trivia QA | F1_Score |
SAMsum | Rouge_Score |
PassageRetrieval-En | Retrieval_Score |
Passage Count | Count_Score |
Lcc | Code_Sim_Score |
RepoBench-P | Code_Sim_Score |
D.2 Details of Experimental Results
This section mainly presents the performance of all tasks corresponding to each type of experiment in LongBench. These experimental results are reported in Table 5.
Single-Document QA | Narrative QA | Qasper | MultiFieldQA-En |
---|---|---|---|
LLaMA2-7B-Chat- | 18.7 | 19.2 | 36.8 |
LongChat-v1.5-7B- | 16.9 | 27.7 | 41.4 |
InternLM-7B- | 12.1 | 16.7 | 23.4 |
Vicuna-v1.5-7B- | 19.4 | 26.1 | 38.5 |
LongLora- | 19.8 | 29.1 | 37.1 |
3D-RPE-LLaMA2-7B(our) | 40.56 | 41.35 | 60.3 |
Multi-Document QA | Hotpot QA | 2WikiM QA | Musique |
---|---|---|---|
LLaMA2-7B-chat- | 25.4 | 32.8 | 9.4 |
LongChat-v1.5-7B- | 31.5 | 20.6 | 9.7 |
InternLM-7B- | 28.7 | 22.8 | 9.0 |
Vicuna-v1.5-7B- | 25.3 | 20.8 | 9.8 |
LongLora- | 37.01 | 30.26 | 17.14 |
3D-RPE-LLaMA2-7B(our) | 62.49 | 58.80 | 59.01 |
Summarization | GovReport | QMSum | MultiNews |
---|---|---|---|
LLaMA2-7B-chat- | 27.3 | 20.8 | 25.8 |
LongChat-v1.5-7B- | 30.8 | 22.7 | 26.4 |
InternLM-7B- | 9.7 | 15.9 | 22.8 |
Vicuna-v1.5-7B- | 27.9 | 22.8 | 27.2 |
LongLora- | 31.53 | 24.13 | 27.74 |
3D-RPE-LLaMA2-7B(our) | 32.01 | 25.3 | 29.68 |
Few-shot Learning | Trec | Trivia QA | SAMSum |
---|---|---|---|
LLaMA2-7B-chat- | 61.5 | 77.8 | 40.7 |
LongChat-v1.5-7B- | 63.5 | 82.3 | 34.2 |
InternLM-7B- | 52.0 | 77.8 | 21.2 |
Vicuna-v1.5-7B- | 71.5 | 86.2 | 40.8 |
LongLora- | 63.5 | 85.69 | 41.88 |
3D-RPE-LLaMA2-7B-(our) | 89.50 | 90.00 | 40.00 |
Synthetic Tasks | Passage Count | PassageRetrival-En |
---|---|---|
LLaMA2-7B-chat- | 2.1 | 9.8 |
LongChat-v1.5-7B- | 1.0 | 30.5 |
InternLM-7B- | 3.0 | 6.0 |
Vicuna-v1.5-7B- | 6.5 | 4.5 |
LongLora- | 3.61 | 29.75 |
3D-RPE-LLaMA2-7B-16k(our) | 4.0 | 14.5 |
Code Completion | Lcc | RepoBench-P |
---|---|---|
LLaMA2-7B-chat- | 52.4 | 43.8 |
LongChat-v1.5-7B- | 53.0 | 55.3 |
InternLM-7B- | 44.1 | 28.8 |
Vicuna-v1.5-7B- | 51.0 | 43.5 |
LongLora- | 57.61 | 54.45 |
3D-RPE-LLaMA2-7B-(our) | 79.10 | 73.90 |