Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Advancing Transformer Architecture in Long-Context Large Language Models:
A Comprehensive Survey
YUNPENG HUANG, State Key Lab of Novel Software Technology, Nanjing University, China
JINGWEI XU∗ , State Key Lab of Novel Software Technology, Nanjing University, China
JUNYU LAI, State Key Lab of Novel Software Technology, Nanjing University, China
ZIXU JIANG, State Key Lab of Novel Software Technology, Nanjing University, China
TAOLUE CHEN, School of Computing and Mathematical Sciences, Birkbeck, University of London, UK
arXiv:2311.12351v2 [cs.CL] 23 Feb 2024
ZENAN LI, State Key Lab of Novel Software Technology, Nanjing University, China
YUAN YAO, State Key Lab of Novel Software Technology, Nanjing University, China
XIAOXING MA, State Key Lab of Novel Software Technology, Nanjing University, China
LIJUAN YANG, HAO CHEN, SHUPENG LI, and PENGHAO ZHAO, Baidu.inc, China
Transformer-based Large Language Models (LLMs) have been applied in diverse areas such as knowledge bases, human interfaces,
and dynamic agents, and marking a stride towards achieving Artificial General Intelligence (AGI). However, current LLMs are
predominantly pretrained on short text snippets, which compromises their effectiveness in processing the long-context prompts
that are frequently encountered in practical scenarios. This article offers a comprehensive survey of the recent advancement in
Transformer-based LLM architectures aimed at enhancing the long-context capabilities of LLMs throughout the entire model lifecycle,
from pre-training through to inference. We first delineate and analyze the problems of handling long-context input and output with
the current Transformer-based models. We then provide a taxonomy and the landscape of upgrades on Transformer architecture to
solve these problems. Afterwards, we provide an investigation on wildly used evaluation necessities tailored for long-context LLMs,
including datasets, metrics, and baseline models, as well as optimization toolkits such as libraries, frameworks, and compilers to boost
the efficacy of LLMs across different stages in runtime. Finally, we discuss the challenges and potential avenues for future research. A
curated repository of relevant literature, continuously updated, is available at https://github.com/Strivin0311/long-llms-learning.
CCS Concepts: • Computing methodologies → Neural networks; Natural language processing; Parallel algorithms; • General
and reference → Surveys and overviews; Evaluation; • Computer systems organization → Neural networks; • Software
and its engineering → Software libraries and repositories; Memory management.
∗ Corresponding author.
Authors’ addresses: Yunpeng Huang, [email protected], State Key Lab of Novel Software Technology, Nanjing University, China, 210023; Jingwei
Xu, [email protected], State Key Lab of Novel Software Technology, Nanjing University, China, 210023; Junyu Lai, [email protected],
State Key Lab of Novel Software Technology, Nanjing University, China, 210023; Zixu Jiang, [email protected], State Key Lab of Novel Software
Technology, Nanjing University, China, 210023; Taolue Chen, [email protected], School of Computing and Mathematical Sciences, Birkbeck, University
of London, London, UK; Zenan Li, [email protected], State Key Lab of Novel Software Technology, Nanjing University, China, 210023; Yuan Yao,
[email protected], State Key Lab of Novel Software Technology, Nanjing University, China, 210023; Xiaoxing Ma, [email protected], State Key Lab of
Novel Software Technology, Nanjing University, China, 210023; Lijuan Yang, [email protected]; Hao Chen, [email protected]; Shupeng Li,
[email protected]; Penghao Zhao, [email protected], Baidu.inc, Beijing, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
Manuscript submitted to ACM 1

2 Huang, et al.
Additional Key Words and Phrases: large language models, long context, Transformer architecture, deep learning
ACM Reference Format:

Yunpeng Huang, Jingwei Xu, Junyu Lai, Zixu Jiang, Taolue Chen, Zenan Li, Yuan Yao, Xiaoxing Ma, Lijuan Yang, Hao Chen, Shupeng
Li, and Penghao Zhao. 2024. Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
. 1, 1 (February 2024), 40 pages. https://doi.org/XXXXXXX.XXXXXXX
1 INTRODUCTION
In recent years, fueled by Transformer-based models such as BERT [61], GPT [23, 178, 179] and their variants [100, 181,
222], Natural Language Processing (NLP) has seen significantly advancement in human language understanding and
generation [128, 218], revolutionizing numerous tasks in Natural Language Understanding (NLU) such as sentiment
analysis [261], Natural Language Generation (NLG) such as document summarization [67], as well as other domains such
as computer vision [105] and autonomous driving [88]. In particular, in the wake of ChatGPT [164], PaLM2 [9], GPT4 [165,
166], Claude2 [10], etc, the Transformer-based Large Language Models (LLMs) which scale up to 1B—100B parameters
to empower emergence abilities [233] have shown a new exhilarating path towards Artificial General Intelligence
(AGI) [24], and have been rapidly adopted in a myriad of human-interactive applications such as chatbots [123, 190],
programming assistants [234, 251] and educational tutors [1, 157].
Transformer is an intricate deep neural network model, which integrates several preceding designs [11, 13, 86]
and novel components to support sequence-to-sequence language modeling, initially in machine translation [225].
Contemporary LLMs largely adopt Transformer architecture, leveraging its modules [61, 178, 181], among which they
own the success mainly due to their well-designed attention mechanism that captures global dependencies of each pair
of tokens across the whole input, enabling the model to handle sequences with intricate relations. However, its quadratic
time and space complexities pose significant computational resource challenges, limiting input text length during
training and effective context window during inference. Additionally, the lack of a robust and generalizable mechanism
for positional embeddings (PEs) leads to performance degradation and fluctuation during inference, particularly with
longer sequences or position shifting on relevant information [139].
With LLMs deeply ingrained in various applications that require long-context comprehension [114, 248] and
generation [89, 142], the demand for long-context LLMs capable of comprehending and generating extremely long
sequences effectively and efficiently becomes increasingly indispensable and urgent. Consequently, researchers have
devoted significant efforts to enhancing the Transformer architecture to address the long-context problem in LLMs,
including optimization on the efficiency of attention (Section 3), context window extension with extra memory
mechanisms (Section 4), effective length generalization with extrapolative PEs (Section 5), context pre/postprocessing
(Section 6), and other miscellaneous methods (Section 7) such as specific pretraining objectives, mixture of experts
(MoE), quantization, parallelism, etc.
Existing surveys. The field of long-context LLMs has become one of the most rapidly developing research areas on LLMs
recently, with some existing surveys [65, 112, 137, 216, 270]. [112] offers an overview of long document summarization,
but does not delve into techniques of long text modeling. [216] and [137] primarily concentrate on improving the
computational efficiency of Transformers in long-text scenarios. Although [270] underscores the challenges LLMs
face when engaging with extensive sequences, its discussed methods predominantly align with efficient Transformers,
similar to [216] and [137]. A more recent survey [65] bears the closest resemblance to our study, but is considerably less
comprehensive than ours. In particular, we review the advancement in breaking the barriers of context length across all
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey 3
stages for more intricate and scalable Transformer-based LLMs by exploring the Transformer from both an algorithmic
design and system architecture perspective.
This survey aims to present a panorama of literature on architecture evolution for scaling the effective context
window length of the state-of-the-art Transformer-based LLMs. The main contributions are as follows.
• We provide a holistic taxonomy by breaking down the Transformer architecture and then delving into the
existing methods in enhancing long-context LLMs during stages including pretraining, fine-tuning, inference
and pre/postprocessing.
• We explore the widely-used evaluation necessities, comprising datasets, metrics, and baseline specifically assessing
the long-context capabilities of LLMs, followed by some popular toolkits to optimize LLMs’ efficiency and
effectiveness for both training and inference, such as libraries, frameworks, and compilers.
• We identify key challenges to revamping the Transformer structure for handling extensive contexts, with
corresponding future directions to push the frontier.
• In light of the extremely rapid growth of this field, we build a repository that gathers relevant literature within
this specific domain. We shall update it continuously to keep pace with the latest advancements.
Organization. Section 2 gives an overview of long-context LLMs, including the preliminaries about objectives and
stages for language modeling and critical components of Transformer-based LLMs, the structure limitation analyses
for LLMs to deal with lengthy contexts and the taxonomy of existing efforts on advancing Transformer architecture.
Then, we mainly delve into the discussion of each part of methodologies from the taxonomy in next five sections 3∼7,
corresponding to related modules in Transformer architecture. In Section 8, we also summarize the necessities for
evaluating long-context capabilities and collect some popular optimization toolkits to augment LLMs’ effectiveness and
efficiency during training and inference. In Section 9, we explore the critical challenges and corresponding potential
avenues lighted up by them, as well as draw insights from existing breakthroughs. Finally, Section 10 closes this survey
with overarching conclusions regarding a panorama of the domain.
2 OVERVIEW
In this section, we start with the preliminaries (Sec. 2.1) for the fundamental language modeling objectives, typical
modeling stages, as well as critical architecture modules in Transformer-based decoder-only LLMs, depicted in Fig. 1(a).
We then briefly analyze the architecture limitations when LLMs encounter extensive context windows (Sec. 2.2). Finally,
we present a taxonomy (Sec. 2.3) of the different methods to enhance the long-context capabilities of LLMs through
architectural innovations (cf. Fig. 1(b)).
2.1 Preliminaries
Language Modeling. In a nutshell, (neural) language modeling aims to approximate the log-probability of the
occurrence of any given text, denoted as log P(𝑋 1:𝐿 ; 𝜃 ), where 𝜃 stands for the network parameters to be learned and
𝑋 1:𝐿 comprises a sequence of length 𝐿 representing natural language including words, punctuation, mathematical
symbols, etc. A significant practical hurdle for language modeling is curse of dimensionality, i.e., the support of the
probability distribution grows exponentially as 𝐿 increases. LLMs employ variations such as masked language modeling
(MLM) and causal language modeling (CLM). The former is to predict masked tokens based on the bidirectional remaining
unmasked tokens, i.e.,
∑︁
MLM : arg max log P(𝑥𝑖 | 𝑋 1:𝑖 −1,𝑖+1:𝐿 ; 𝜃 ) (1)
𝜃
𝑖∈M
4 Huang, et al.
Fig. 1. The overview of the survey: (a) The typical architecture anatomy diagram of contemporary Transformer-based decoder-
only LLMs, with the legend on the far top right; (b) The taxonomy of methodologies for enhancing Transformer architecture
modules (corresponding to (a) by color): Efficient Attention (submodule of attention kernel), Long-Term Memory (targeting KV
cache), Extrapolative PEs (against the positional embedding module), Context Processing (related to context pre/post-processing),
and Miscellaneous (general for the whole Decoder Block as well as the Loss module).
which maximizes the conditional probability of the 𝑖-th token 𝑥𝑖 given all the others, where M denotes the index set
of the masked tokens. In contrast, the objective of CLM is to predict the next token, i.e., maximize the conditional
probability of each token, given the unidirectional preceding ones
𝐿
∑︁
CLM : arg max log P(𝑥𝑖 | 𝑋 1:𝑖-1 ; 𝜃 ) (2)
𝜃
𝑖=1
In this setup, casual LLMs can effectively leverage the temporal dependencies inherent in natural language sequences,
enabling LLMs to generate coherent and contextually relevant text.
Modeling Stages. Typically, LLMs often undergo a multi-stage modeling process. Initially, during the preprocessing
stage, raw text data is segmented and tokenized into individual (sub)words, viz., tokens predefined in a vocabulary,
using algorithms, e.g., BPE [193]. Then, in the pretraining stage, the model is trained on vast text corpora, with the
MLM or CLM objectives, to capture semantic patterns and linguistic structures of natural language. Once pretrained,
the model proceeds to the fine-tuning stage, where it is further trained with a few epochs on task-specific data with
extra heads to learn sometimes. Finally, the finetuned model is deployed in downstream scenarios to predict expected
answers in inference mode. Particularly, the casual LLMs are pretrained and finetuned with the same CLM objective but
a different corpus. During the inference step, the model predicts from the probability distribution of the vocabulary by
some decoding strategy such as greedy search, beam search, nucleus sampling [227], to generate contextually coherent
responses to prompts in a token-by-token autoregressive paradigm.
Decoder Block. The vanilla Transformer architecture [225] mainly comprises an Encoder and a Decoder, each stacked
with multiple identical blocks. The skeleton of each block is mostly compatible with the one in Fig. 1(a). In general,
the first block takes the tokenized sequence encoded by a word embedding layer, followed by a multi-head scaled-dot
self-attention (MHA) layer with an attention mask corresponding to specific language modeling objectives and a
feed-forward network (FFN) layer. Both the MHA and FFN layers are enriched with layer normalization [11] and
residual connections [86] at every entrance/exit of the block. Then, each higher-level block takes the output hidden
states from the previous block as input, represents them with the MHA and FFN layers, and feeds them to the next block.
The final hidden state outputted from the last block is fed into a linear layer called language modeling head, and the
output logits is transformed into a probability distribution over the target vocabulary through the softmax operation.
Note that the slight difference between the Encoder and Decoder blocks is that the latter additionally interfaces with
the Encoder’s output via a cross-attention (CA) layer before feeding into the FFN layer.
Such a binary structure was originally designed for sequence-to-sequence modeling in machine translation tasks.
Subsequently, several variations have been proposed aiming at more general language modeling objectives such as
MLM and CLM. The BERT series [61, 141] harnesses only the Encoder with MLM to enhance bidirectional information,
serving as a discriminative model. Conversely, the GPT series [23, 178, 179] utilizes only the Decoder with CLM,
focusing on unidirectional generative models. T5 [181] and BART [125] variants, however, treat each NLP task as a
text-to-text conversion, leveraging both Encoder and Decoder. The decoder-only generative model architecture has
recently become the predominant choice for current LLMs. Notable examples include Llama [221, 222], OPT [262],
Bloom [236], GLM [66, 258], and Mistral [4, 100], among others.
Attention Mechanism. The attention mechanism [13], as the core design of the Transformer implemented in the
MHA layer, computes a weighted representation of each token in the input sequence based on its relevance to others.
Specifically, as illustrated in Fig. 1(a), the word-embedded token sequence 𝑋 ∈ R𝐿×𝑑𝑖𝑛 , concatenating long contexts and
user prompts with total length 𝐿, gives rises to three embedding matrices, i.e., a linear projection layer query 𝑄 ∈ R𝐿×𝑑𝑞 ,
key 𝐾 ∈ R𝐿×𝑑𝑘 and value 𝑉 ∈ R𝐿×𝑑 𝑣

𝑄, 𝐾, 𝑉 := split 𝑋 × 𝑊𝑞,𝑘,𝑣 , 𝑊𝑞,𝑘,𝑣 ∈ R𝑑𝑖𝑛 × (𝑑𝑞 +𝑑𝑘 +𝑑 𝑣 )
Then, for the attention kernel operations
𝑃
𝑃 := 𝑄 × 𝐾 T, 𝐴 := softmax[ √︁ ⊙ 𝑀], 𝑂 := (𝐴 × 𝑉 ) × 𝑊𝑜 , 𝑊𝑜 ∈ R𝑑 𝑣 ×𝑑𝑜
𝑑𝑘
Namely, the unnormalized relevance matrix 𝑃 ∈ R𝐿×𝐿 each entry measures the relevance of the corresponding pair
√︁
of tokens. The normalized attention score matrix 𝐴 ∈ R𝐿×𝐿 is computed as a scaling operation by factor 𝑑𝑘 , an
element-wise mask operation with 𝑀 ∈ R𝐿×𝐿 , and a row-wise softmax. Finally, the output hidden states 𝑂 ∈ R𝐿×𝑑𝑜 are
generated by a weighted sum of 𝑉 with attention weights in each row of 𝐴, usually with an extra linear transformation.
Note that the embedding dimensions of 𝑄, 𝐾, 𝑉 , 𝑂 are not necessarily the same. Though subscripts are used to
distinguish them for generality, by default we set 𝑑 = 𝑑𝑞 = 𝑑𝑘 = 𝑑 𝑣 = 𝑑𝑜 in the rest of the paper. The mask matrix 𝑀 is
typically used for masking padding tokens to align all batched input sequences and also applies casual mask operation of
causal language modeling for generative LLMs. Furthermore, to capture diverse relationships, the model often employs
multi-head attention instead of single-head one, performing the attention process in parallel with differently weighted
𝑄ℎ , 𝐾ℎ , 𝑉ℎ sets by dividing learnable parameters like 𝑊𝑞,𝑘,𝑣 ∈ R𝑑𝑖𝑛 × (3×𝑑 ) into 𝑊𝑞,𝑘,𝑣
𝑚ℎ ∈ R𝑑𝑖𝑛 × (3×𝐻 ×𝑑ℎ𝑒𝑎𝑑 ) , where 𝐻

6 Huang, et al.
denotes the number of heads. Similar to embedding dimensions, the number of heads can be specific for 𝑄, 𝐾, 𝑉 , which
vary in different LLMs, yet we consider them the same by default.
Positional Embeddings. Unlike recurrent neural networks (RNNs) [253], Transformers process input tokens in parallel
as a bag-of-words and lack an inherent sense of sequence order. To preserve the sequential information, the vanilla
Transformer presents a novel Sinusoidal PE (SinPE) [225].
 sin(𝑛𝜃 0 ) 
 
 
 cos(𝑛𝜃 0 ) 
 
 
 
 sin(𝑛𝜃 1 ) 
 
  2
SinPE(𝑛) :=  cos(𝑛𝜃 )  , 𝑤ℎ𝑒𝑟𝑒 𝜃 = 𝑏𝑎𝑠𝑒 − 𝑑 , 𝑛 ∈ {0, 1, · · · , 𝐿 − 1}
1 (3)
 
 

 .. 

 . 
 𝑑

 -1
 sin(𝑛𝜃 2 ) 

 
𝑑
cos(𝑛𝜃 2 -1 ) 
 
 
Here 𝑏𝑎𝑠𝑒 is a large integer manually set as 10, 000 (according to the original paper without further explanation), and 𝑑
is the unit embedding dimension of hidden states.
Some variants have recently emerged, including trainable embeddings [34] to learn an embedding mapping and
relative embeddings [195] based on relative positions. For instance, Rotary PE (RoPE) [207] applies a rotation operation
on a complex field instead of an addition to 𝑄, 𝐾 based on absolute positions, where it shares the same basis function as
SinPE.
 (0) 
𝑅𝑛 
 
 (1) 
cos(𝑛𝜃 𝑖 ) − sin(𝑛𝜃 𝑖 ) 
 𝑅𝑛   
(𝑖 )
RoPE(𝑛) :=   , 𝑤ℎ𝑒𝑟𝑒 𝑅𝑛 :=  (4)
  
..  sin(𝑛𝜃 𝑖 ) 𝑖


 .

  cos(𝑛𝜃 ) 
 𝑑

 ( -1) 

 𝑅𝑛 2 
Observe the properties
||𝑅𝑖 q|| = ||q||, 𝑃𝑖,𝑗 := ⟨𝑅𝑖 q, 𝑅 𝑗 k⟩ = qT 𝑅𝑖T 𝑅 𝑗 k = qT 𝑅 𝑗 −𝑖 k
RoPEs ensure the magnitude of q, k remains unchanged (due to unitary transformation), and for every entry in 𝑃,
i.e., each pair of q, k, will only be tagged with embeddings in terms of their relative distance in the sequence. RoPE
provides a more stable scheme to handle longer sequences. It captures relative positional patterns with absolute position
awareness, thus widely used in state-of-the-art open-source LLMs like LLama and GLM.
It is worth noting that SinPEs are initially applied on the word embeddings before entering the Encoder or Decoder
blocks by addition. In contrast, as shown in Fig. 1(a), RoPEs are applied to 𝑄, 𝐾 in each attention layer before the kernel
operations by equivalent element-wise vector multiplication to save registered buffer memory.
Key-Value Cache. In a narrow sense, the Key-Value (KV) cache is a list of tensors that stores the k, v embeddings for
all previous tokens in the attention layer for each block, utilized and updated during the autoregressive generation
process of causal LLMs. As shown in Fig. 1(a), before the first token is generated, all KV caches are initialized empty
and will be filled with 𝐿 (key, value) pairs after the heave attention computation with 𝐿 queries and 𝐿 keys. Then, the
first generated token will also be considered as input, extending the whole sequence to 𝐿 + 1 tokens. To avoid redundant
calculations, the real input will contain only the latest generated token, deriving one new triplet of (query, key, value).
But to compute equivalently, the new query has to attend and apply to all 𝐿 + 1 previous keys and values. Thus, the
new (key, value) has to concatenate with past 𝐿 pairs stored in the KV cache and update themselves into it for the next
generated token to attend. However, in a broad sense, we can consider the KV cache as the memory storage of LLMs,
whose occupation grows linearly as the generated tokens increase. That directly causes one of the limitations below
about the lack of efficient memory and suggests the approaches to enhance the long-term memory mechanisms for
LLMs in Section 4.
2.2 Limitations
Attention Complexity. In typical scenarios where 𝐿 ≫ 𝑑, the computational complexity of MHA can be concisely
summarized as follows. It involves 𝑂 (𝐿 2𝑑) time complexity, comprising 𝑂 (𝐿𝑑 2 ) for QKV projection, 𝑂 (𝐿 2𝑑) for the
computation of 𝑃, 𝑂 (𝐿 2 ) for the softmax operation to obtain 𝐴, 𝑂 (𝐿 2𝑑) for the multiplication of 𝐴 and 𝑉 , and 𝑂 (𝐿𝑑 2 )
for the output projection of 𝑂. It incurs 𝑂 (𝐿 2 ) space complexity, involving 𝑂 (𝐿𝑑) for embeddings of 𝑄, 𝐾, 𝑉 , 𝑂 and
additional 𝑂 (𝐿 2 ) buffers for storing weights 𝑃 and 𝐴. Consequently, both temporal and spatial computational costs
exhibit a quadratic increase with the expansion of the sequence length, which can be burdensome for both training and
inference.
In-context Memory. LLMs lack an explicit memory mechanism, relying solely on the KV cache to store representations
of all previous tokens in a list. This design implies that once querying is completed in one call, the Transformer does
not retain or recall any previous states or sequences in subsequent calls unless the entire history is reloaded token by
token into the KV cache. Consequently, the Transformer possesses only an in-context working memory during each
call, as opposed to an inherent memory mechanism such as Long Short-Term Memory (LSTM) [253]. This statelessness
offers computational advantages in terms of parallelism but presents challenges in tasks like chatbot applications [109],
where long-term memory retention is essential.
Max-Length Constraint. During the training phase, engineers typically need to determine a crucial hyperparameter
max-length (𝐿𝑚𝑎𝑥 throughout this paper), which represents the upper bound on sequence length for any training
sample in a batch. It is commonly set as 1K, 2K, or 4K based on the available computational resources to avoid
Out-of-Memory (OOM) errors on GPUs. However, during inference, LLMs service providers must either restrict the
length of user prompts or automatically truncate them to align with the predefined 𝐿𝑚𝑎𝑥 . Notice that none of the
Transformer modules inherently require such restrictions since all learned weights depend solely on dimension sizes,
hence Transformers theoretically can process sequences of any length. Unfortunately, current Language Models have
shown noticeable performance degradation when handling input sequences exceeding 𝐿𝑚𝑎𝑥 , often resulting in repetitive
and implausible outputs.
2.3 Taxonomy
There are multiple avenues to explore for advancing the Transformer structure to endow LLMs with long-context
capabilities, such as reducing attention complexity during training, designing efficient memory mechanisms, and
enhancing the ability for length extrapolation where the model is trained on short sequences but tested on longer ones
during inference [173]. In this survey, we provide a comprehensive review of recent advancements in methodologies
aimed at improving the long-context capabilities of LLMs throughout various stages. A taxonomy is given in Fig. 1(b)
where these methods are categorized into five main classes:
8 Huang, et al.
• Efficient Attention (Section 3). This class of methods focuses on implementing efficient attention mechanisms
with reduced computational costs, even achieving linear-time complexity. Thereby 𝐿𝑚𝑎𝑥 in the pretraining stage
can be increased, and so for the effective context length boundary of LLMs during inference.
• Long-Term Memory (Section 4). This class of methods aims to design explicit memory mechanisms so the
limitation of the in-context working memory can be addressed.
• Extrapolative PEs (Section 5). This class of methods improves the extrapolative properties of existing positional
encoding schemes.
• Context Processing (Section 6). This class of methods wraps off-the-shelf LLMs with additional context
pre/postprocessing. They ensure that the input fed to LLMs in each call always meets the maximum length
requirement and breaks the context window limit by introducing multiple calling overheads.
• Miscellaneous (Section 7). This class includes various methods that do not naturally fit into the previous four
categories, offering a broader perspective on advancing long-context capabilities in LLMs.
3 EFFICIENT ATTENTION
The first category of methods is to optimize attention mechanisms, especially the kernel operations that are the
computational bottleneck of the Transformer. This approach enables the expansion of the context length boundary for
LLMs during inference by directly increasing the hyperparameter 𝐿𝑚𝑎𝑥 in the pretraining stage. We further categorize
these methods into five distinct strategies, each with a specific focus: Local Attention (Sec. 3.1), Hierarchical Attention
(Sec. 3.2), Sparse Attention (Sec. 3.3), Approximated Attention (Sec. 3.4) and IO-Aware Attention (Sec. 3.5).
3.1 Local Attention

The traditional attention mechanism is characterized by its global and full attention nature, wherein every token is
expected to attend to every other token, resulting in quadratic time and space complexities. Considering the significance
of local context in certain applications [246], various approaches have been introduced to implement local attention
mechanisms in recent years. These mechanisms restrict each token’s attention to its neighboring tokens only. Variations
of these approaches arise from the heuristics to determine what qualifies as a token’s neighbor, as depicted in Fig. 2.
Block-wise Attention. One straightforward approach to local attention involves segmenting the input sequence into
non-overlapping blocks. As proposed in BlockBERT [176], tokens are restricted to attending only to others within
the same block of fixed size 𝐵. This block-wise attention requires full attention calculations within each 𝐵 × 𝐵 block
for 𝐵𝐿 iterations, resulting in time complexity 𝑂 (𝐿𝐵𝑑) and memory complexity 𝑂 (𝐿𝐵). However, this approach limits
the global receptive field, potentially hindering long-term dependency modeling. To mitigate this, Bi-BloSAN [198]
introduces inter-block attention for capturing long-range dependencies. Sinkhorn [214] employs a differentiable ranking
network to sort blocks, enabling quasi-global receptive fields. SPADE [271] augments state space models (SSMs) [77] to
address long-range dependency limitations. Additionally, Landmark Attention [159] introduces a landmark token for
each block to select relevant neighbors, facilitating block-wise representations. In the fine-tuning stage, LongLoRA [38]
introduces shift short attention (S2 -Attn) atop LoRA [87], shifting tokens by half the block size in half of the attention
heads to ensure information flow between neighboring blocks.
Sliding Window Attention. Inspired by convolutional neural networks (CNNs) [116, 122], another approach is to use
sliding-window techniques, as demonstrated in Longformer [17]. In this method, each token is assigned a consecutive
fixed window and is allowed to attend only to the previous adjacent 𝑤 ≪ 𝐿 tokens as its neighbors. To extend the
receptive field similar to dilated convolution [252], the window is dilated with gaps of size dilation 𝑑, enabling each
token to attend to tokens as far as 𝑤 × 𝑑 + 1 away. To aggregate global information without additional computation,
global attention is also applied to a few pre-selected positions where special tokens like [CLS] are located, decreasing
computation complexity to 𝑂 (𝐿𝑤𝑑). Besides, Funnel-Transformer [52] employs a strided mean pooling strategy on each
window to compress the sequence dimension of hidden layers, while Sequence-AltUp [16] further captures contextual
information in skipped tokens with its lightweight predictor.
Global-Local Hybrid Attention. A similar global-local attention mechanism is adopted in ETC [6] and LongT5 [78],
which construct auxiliary global tokens explicitly or implicitly to represent segment information with global attention,
while applying local attention only to source tokens (This hierarchical organization of attention receptive fields is
further discussed in Sec. 3.2). To avoid tuning, LongLM [103] employs grouped attention for out-of-window tokens and
standard attention for those within the neighbor window. StreamLLM [240] observes the phenomenon of attention
sink that maintaining KV of initial tokens during inference largely recovers sliding window attention performance,
and adding a placeholder token during pretraining further improves streaming deployment. This phenomenon arises
from the strong attention towards initial tokens as a "sink" even if they are not semantically important. Similarly,
Lm-infinite [82] proposes a Λ-shaped mask and positional distance constraint to keep attending to starting tokens.
LSH Attention. Besides the direct positional adjacency, Reformer [110] utilizes a neighbor token selection mechanism
based on k-Nearest-Neighbor (kNN) and Locality-Sensitive Hashing (LSH) algorithms [96]. LSH attention allows each
query q𝑖 to attend to a set of keys 𝑆𝑖 := {k 𝑗 ≤𝑖 : ℎ(q𝑖 ) = ℎ(k 𝑗 )} within a single hash bucket. The hashing function ℎ is
designed to assign the same hash with high probability to two vectors that are similar and vice versa. This approach
ensures that each token can access a fixed number 𝐾 ≪ 𝐿 of neighboring keys, and the primary computational cost of
LSH attention arises from bucket sorting, with a complexity of 𝑂 (𝐿 log 𝐿𝑑).
Fig. 2. The visualization of various typical local causal attention mechanisms. As the legend on the right indicates, tokens are
distinguished by colors, with shades denoting attention to themselves (darker) or attention to the preceding others (lighter).
3.2 Hierarchical Attention

The global token techniques [6, 17, 78] and the inter-block attention [198] mentioned above are essentially introducing
hierarchical features to self-attention to compensate with more global information from the higher-level attention
while keeping the low computation cost from the low-level local attention at the same time. From this view, more work
has explored various hierarchical mechanisms that introduce a structured hierarchy into self-attention, leveraging
higher-level global information and lower-level local attention for multi-scaled contextual receptive fields.
10 Huang, et al.
Two-Level Hierarchy. HAN [249] pioneers the use of a two-level attention mechanism. It first applies self-attention
to word features to obtain a sentence representation, then employs self-attention on sentence-level features to generate
document-level features. This hierarchical approach improves efficiency and performance in document classification
tasks. Subsequently, similar hierarchical attention mechanisms have led to significant advancements in other document-
level tasks, including machine translation [153, 187, 239], and document summarization [48, 259, 264].
Multi-Level Hierarchy. In contrast to the typical binary level structure above, BPT [250] introduces a more elaborated
fine-to-coarse attention mechanism that operates on multi-scale spans via binary partitioning. Token nodes can attend
to smaller-scale spans for close context and to larger-scale spans for distant context. This approach formalizes the
hierarchical structure as a graph neural network and updates it using graph self-attention [226]. A simpler variation is
adopted in Adaptive Span Transformer [208], which employs a soft attention masking function to non-increasingly
map relative distances to real values in the range [0, 1]. This function controls the span of attention for each head,
allowing the model to attend to different context spans.
Building on the prior studies [2, 106] that indicate a hierarchical low-rank structure in attention matrix across
NLP tasks, H-Transformer-1D [269] introduces hierarchical attention, partitioning the matrix into blocks with varied
low-rank ranges for diverse approximation levels. This reduces runtime and memory complexity to 𝑂 (𝐿𝑑), where the
number of hierarchy levels 𝑀 is typically set to log (𝐿/2). Viewing full-attention as a conditional expectation over
embeddings at each location, Combiner [186] approximates this conditional distribution with structured factorization
on token regions. Tokens can then attend to others either directly or through indirect attention to abstractions, which
are conditional expectations from corresponding factorized local regions. This approach also leverages sparse attention
patterns, as will be discussed in Sec. 3.3, to provide sub-quadratic low computation and memory complexity while
maintaining full-attention expressiveness.
Generally speaking, hierarchical attention mechanisms derive from the same principles of contextual locality
present in natural languages as local attention. However, they incorporate a more elaborated structure, often designed
heuristically, to strike a balance between capturing long-range contextual dependencies and maintaining low-level
computational efficiency.
3.3 Sparse Attention

While some approaches have introduced heuristics for achieving locality and hierarchical structure within self-attention,
another direction explores the sparsity patterns inherent in full attention matrices [43, 254]. These methods aim to
introduce a sparse attention mask, denoted as 𝑀 S , where each row 𝑖 assigns a sparse set of indices S𝑖 ⊆ { 𝑗 | 𝑗 < 𝑖} that
the 𝑖-th token attends to. These sparsity-based attention mechanisms offer both computational efficiency and the ability
to capture global context information [151]. Fig. 3 provides a visualization of these sparse attention mechanisms.
Fig. 3. The visualization of some typical causal sparse attention patterns. The legend on the right distinguishes token types based on
their colors, where darker shades indicate attending to themselves while lighter ones represent attention to other previous tokens.

Fixed Sparsity Patterns. To start with Sparse Transformer [43], it draws inspiration from attention patterns
learned on CIFAR-10 [115] and proposes a row-column factorized attention scheme. This approach results in faster
computations while still maintaining global context awareness. Formally, it employs a chosen stride 𝑙 that is close to
√
𝐿. Each query q𝑖 applies one row attention for local context information (i.e., local attention) and another column
attention that summarizes previous locations and propagates information to all future tokens, resembling a form of
global attention. The authors provide two specific patterns for row and column attention, corresponding to the stride
and fixed ones respectively illustrated in Fig. 3(c) and Fig. 3(d). This strategy reduces the total computational complexity
√ √ √
to 𝑂 (𝐿 𝐿𝑑). The intuition of Sparse Transformer is to attribute a stride 𝑙 ≈ 𝐿 to equally distribute 𝐿 tokens to attend
to. In contrast, LogSparse [129] employs an exponentially sparse attention pattern by dispatching only log 𝐿 tokens
for each location to attend to. This method ensures that any pair of tokens can eventually exchange information with
each other through a path spanning log 𝐿 layers. This results in an overall memory usage of 𝑂 (𝐿(log 𝐿) 2 ). The recent
LongNet [62] further improves computational efficiency by introducing dilated attention, which expands the attentive
field exponentially as the distance between tokens increases. It incorporates mixed dilate rates to model both long
and short-range dependencies, ultimately reducing the computation complexity to 𝑂 (𝐿𝑑) while successfully scaling to
sequences of up to one billion tokens.
Adaptive Sparsity Patterns. Instead of fixed sparse indices set only dependent on locations, some approaches
seek sparsity adaptively in a learnable manner, taking into account embedding values. Expire-Span [209] introduces
a learnable scalar in the range [0, 1] for each previous token, allowing the model to retain tokens with the most
important information while expiring those that are no longer relevant, similar to the forget gate in LSTM-based
RNNs [253]. Routing Transformer [188] leverages k-means clustering to identify the top-k most relevant centroid
vectors in 𝑄, 𝐾 and assigns each query to the keys with the same cluster membership, reducing the overall complexity
√
of attention to 𝑂 (𝐿 𝐿𝑑). Inspired by Differentiable Architecture Search (DARTS) [138], SparseBERT [199] introduces a
differentiable attention mask using Gumbel relaxation techniques [145], allowing the model to learn to guide attention
pattern selection by importance. It incorporates a predefined sparsity ratio 𝜌, resulting in computational complexity of
𝑂 ((1 − 𝜌 2 )𝐿 2𝑑).
Graph Sparsification. Furthermore, some other works treat full attention as a fully connected graph, with nodes
representing embeddings of each token and edges denoting connections through attention. These approaches frame
sparsity as a graph sparsification problem. For instance, Star-Transformer [79] introduces a star-shaped topology, where
each satellite node attends to local neighbors with a ring connection and a virtual relay node with the radial connection.
In contrast, BigBird [256] incorporates sparsity based on random graph theory, allowing each query to attend to a
random number of keys with a fixed probability. It also absorbs sliding-window local attention with window size 𝑤 and
global token techniques in its design. This approach reduces the quadratic dependency to linear complexity, specifically
𝑂 ((𝑤 + 𝑟 + 1)𝐿𝑑), stacked with three efficient attention mechanisms.
3.4 Approximated Attention

In addition to the heuristics to restrict full attention computation, some research approximates attention based on the
sparsity or low-rankness of attention matrices with linear complexity, albeit at the cost of precision. We introduce
several of these approximation techniques below.
Notice that we do not distinguish whether the methods are employed for BERT-like encoder-only LLMs or GPT-like
decoder-only LLMs in previous sections since most of them can be trivially transferred from the BERT setting to the
GPT setting with a causal attention mask. However, the casual mask is often non-trivial for many approximation
12 Huang, et al.
strategies. So to facilitate later discussions, we first define a general weighted causal function 𝜉 w in Eq. 5, where w ∈ R𝐿
represents a weights vector for each row. This function will substitute the causal attention mask operation. Thus, we
omit the mask 𝑀 in all attention equations below for simplification.
 𝑖
∑︁ 𝐿
T T
 
𝜉 w (𝑄, 𝐾, 𝑉 ) := 𝑤𝑖 · q𝑖 k 𝑗 v 𝑗  (5)

 𝑗=1 
 𝑖=1
Low-Rank Approximation. Linformer [229] employs Singular Value Decomposition (SVD) to approximate the
attention matrix 𝐴 with a low-rank matrix 𝐴e This approach involves two learnable projection matrices 𝐸 and 𝐹 of
dimensions 𝐿 × 𝑘, where 𝑘 = 𝑂 ( 𝜖𝑑2 ) ≪ 𝐿. The process includes projecting 𝐾, 𝑉 using 𝐸, 𝐹 respectively, followed by
standard MHA kernel on 𝑄 with the projected 𝐾, e 𝑉e. According to the properties proved in Linformer, this low-rank
technique approximates full attention with linear complexity 𝑂 (𝐿𝑘𝑑) while allowing for an error of 𝜖.
Nested Attention. Luna [144] decouples the attention kernel into two nested attention approaches, both of which
have linear complexity relative to 𝐿. Specifically, it firstly applies pack attention as Eq. 6 to get packed context 𝑆,
e where 𝑆
is an extra side-input sequence with constant length 𝑘 ≪ 𝐿. The activation function elu(·) is the exponential linear
unit [47]. Then it secondly applies unpack attention as Eq. 7 to get unpacked output 𝑂,
e with the causal mask function
defined in Eq. 5. Afterward, they pass 𝑂
e and 𝑆e to the next attention layer, denoted as 𝑋 and 𝑆, to propagate packed
contextual information via 𝑆 without leakage of future information. Notice that the pack attention can be regarded as a
generalization of the linear projection in Linformer [229] with the same complexity 𝑂 (𝐿𝑘𝑑) but the advantage to model
sequences with various lengths since the projection matrices are not dependent to 𝐿 as projection matrices 𝐸 and 𝐹 .
!
𝑄𝑠 × 𝐾 T
𝐴𝑠 := elu √︁ , 𝑆e := 𝐴𝑠 × 𝑉 , 𝑤ℎ𝑒𝑟𝑒 𝑄𝑠 := 𝑆 × 𝑊𝑞 (6)
𝑑𝑘

𝐴𝑢 := softmax 𝜉 w𝑖𝑛𝑣 (𝑄, 𝑉 , 𝐴𝑠T ) , 𝑂
e := 𝜉 w𝑖𝑛𝑣 (𝐴𝑢 , 𝐴𝑠T, 𝑉 ), 𝑤ℎ𝑒𝑟𝑒 w𝑖𝑛𝑣 := 𝑖 -1 𝐿

𝑖=1 (7)
Kernelized Approximation. Except for low-rankness prior, some works are based on generalized kernelizable attention
in Eq. 8, where the kernel function K (·, ·) : R𝑑 × R𝑑 → 𝑅+ is applied row-wise to each pair of q𝑖 , k 𝑗 in 𝑄, 𝐾, and 𝐷 is
the normalization factor defined in Eq. 9. From this view, the vanilla softmax attention implements a specific kernel
𝑄𝐾 T
function as K (𝑄, 𝐾) = exp( √ ), which explicitly derives a 𝐿 × 𝐿 attention matrix. But suppose we carefully choose
𝑑𝑘
another kernel function to be factorizable as the condition in the second step of Eq. 8 and 9, then simply apply the
associative property. In that case, we can compute matrix multiplication of 𝐾, 𝑉 and 𝐾, 1𝐿 ahead with lower complexity
𝑂 (𝐿𝑑 2 ). However, the drawback is that similar to Luna [144], one has to compute the matrix multiplication iteratively
across each query q𝑖 , which does not fully make use of the parallelism, as shown in the third step of Eq. 8 and Eq. 9.
K (𝑄,𝐾 )=𝑄 eT
e× 𝐾
𝑂 := 𝐷 −1 × K (𝑄, 𝐾) × 𝑉 ================== 𝐷 −1 × 𝑄 e× 𝐾 ======= 𝐷 −1 × 𝜉 1𝐿 (𝑄,
eT × 𝑉 ==𝑐𝑎𝑢𝑠𝑎𝑙 e 𝑉)
e 𝐾, (8)
𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑖𝑣𝑒
K (𝑄,𝐾 )=𝑄 eT
e× 𝐾 h i h i
𝑤ℎ𝑒𝑟𝑒 𝐷 := diag [K (𝑄, 𝐾) × 1𝐿 ] ================== diag 𝑄e× 𝐾e × 1𝐿 ==𝑐𝑎𝑢𝑠𝑎𝑙
======= diag 𝜉 1𝐿 (𝑄, e 1𝐿 )
e 𝐾, (9)
𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑖𝑣𝑒
For instance, Linear Transformer [104] designs a simple feature map 𝜑𝐿𝑖 based on the 𝑒𝑙𝑢 kernel in Eq. 10. It avoids
the quadratic attention matrix and reduces the time (resp. space) complexity to 𝑂 (𝐿𝑑 2 ) (resp. 𝑂 (𝐿𝑑)). In contrast,
Performer [45] achieves unbiased and low-variance estimation based on orthogonal random features (ORFs) mapping
𝜑𝑃𝑒 (·) : R𝑑 → R𝑟 in Eq. 11, where 𝑟 = 𝑚 × 𝑙 ≪ 𝐿, 𝑏 1, · · · , 𝑏𝑙 : R → R are 𝑙 basis functions, ℎ : R𝑑 → R+ is certain

magnitude measurement, and 𝜔 1, · · · , 𝜔𝑚 ∈ R𝑑 are 𝑚 orthogonal random features 𝑖.𝑖.𝑑 sampled from a distribution D ∈
P (R𝑑 ). Performer provides multiple configurations of these parameterized functions, like ℎ := 1, 𝑙 := 1, D := N (0, I𝑑 )
as PNG-kernels [46], and ℎ := 1, 𝑙 := 2, 𝑏 1 := sin, 𝑏 2 := 𝑐𝑜𝑠, D := N (0, 𝜎 2 I𝑑 ) as shift-invariant Gaussian kernel [182],
adopted in RFA [171]. Hence, the time (resp. space) complexity can be reduced to 𝑂 (𝐿𝑟𝑑) (resp. 𝑂 ((𝑑 + 𝑟 )𝐿)).
K𝐿𝑖 (q, k) := 𝜑𝐿𝑖 (q) × 𝜑𝐿𝑖 (k) T, where 𝜑𝐿𝑖 (x) = elu(x) + 1 (10)
h i ℎ(x) h i
K𝑃𝑒 (q, k) := E𝜔 𝜑𝑃𝑒 (q) × 𝜑𝑃𝑒 (k) T , where 𝜑𝑃𝑒 (x) = √ 𝑏 1 (𝜔 1T x), . . . , 𝑏 1 (𝜔𝑚
T
x), . . . , 𝑏𝑙 (𝜔 1T x), . . . , 𝑏𝑙 (𝜔𝑚
T
x)
𝑚
(11)
The works above pioneer a novel approach to enhancing attention efficiency by treating it as a kernel machine,
spawning further kernel strategies such as Fourier Attention [161] and Primal Attention [39].
Sparse-Kernelized Hybrid. Furthermore, inspired by Robust-PCA [28], the recent Scatterbrain [31] provides a more
accurate yet efficient approximation by combining LSH-based sparse matrices 𝑆 like Reformer’s [110] and low-rank
kernelized decomposition with randomized feature maps like Performer’s [45], as simplified in Eq. 12, where we omitted
the normalization step and the causal mask applying the function.

e := 𝑄
𝑂 e×𝐾eT + 𝑆 × 𝑉 = 𝑄 e × (𝐾 eT × 𝑉 ) + 𝑆 × 𝑉 (12)
Not only does this method unify two approximation techniques to achieve linear time complexity of 𝑂 (𝐿𝑟𝑑) with
higher precision, but it also offers flexibility to leverage various low-rank and sparse approximation methods as
sub-components, more than just the example combination of Reformer and Performer.
3.5 IO-Aware Attention

All of the methods above in pursuit of efficient attention can be considered as trading off high attention quality for
low computation complexity, based on some theoretical or empirical properties of attention matrix and NLP tasks,
including locality, sparsity, low-rankness, and other heuristic or mathematical tricks. In comparison, these IO-aware
attention mechanisms below collectively represent efforts to optimize attention computations by considering the
memory bottleneck while preserving the exactness of attention kernel calculations.
Memory-Efficient Attention. This simple method is firstly proposed in [177], which utilizes the lazy softmax
algorithm [97] and tracks normalization factor to compute standard and numerically stable attention by sequentially
processing each single/chunked-query attention. Such a simple method only needs constant working memory with
respect to sequence length, while the time complexity is still quadratic.
Flash Attention. Furthermore, the recent work Flash Attention [54, 55] manages to reduce time and memory
consumption while applying exact attention computation by making it IO-aware between GPU high bandwidth
memory (HBM) and GPU on-chip SRAM. It has already reached wide adoption and has been incorporated directly into
Pytorch v2.0 [175]. To be more specific, for forward pass, it utilizes the tiling technique [244] to decompose large
softmax attention into smaller blocks, loading block by block from HBM [99] to SRAM, to perform all the attention
computation steps on-chip to reduce HBM accesses, and incrementally update the output back to HBM by rescaling,
leveraging the online softmax technique [156]. As Eq. 13 illustrates, 𝑃 (𝑛) := 𝑄𝑖𝑛 × 𝐾 𝑗T ∈ R𝐵𝑟 ×𝐵𝑐 denotes one pair of
𝑛
block-wise matrix multiplication from 𝑄, 𝐾, which is tiny enough to be loaded and computed in SRAM, and the
rescaling factor 𝛼 (𝑛) only comprises some statistics about 𝑃 (𝑛) . As for backward pass in training stages, it can both
decrease the required memory and speed up due to reduced HBM accesses as well, by just storing the output 𝑂 and
14 Huang, et al.
normalization statistics to recompute the intermediate results 𝑃, 𝐴 in SRAM, similar to the gradient checkpointing
technique [37, 201]. The authors analyze that flash attention only requires 𝑂 (𝐿 2𝑑 2 𝑀 −1 ) HBM accesses compared to
standard 𝑂 (𝐿𝑑 + 𝐿 2 ), where 𝑀 ≫ 𝑑 2 is the size of SRAM, which leads to both faster execution (up to 7.6× speedup on
GPT2 [179]) and lower memory footprint (up to 20× more memory efficient), according to the experiments results [55].
h i 𝑉 (1) 
(1) (1) (1)
𝑂 := softmax 𝑃 (1) 𝑃 (2)  (2)  = 𝛼 softmax(𝑃 )𝑉
  + 𝛼 (2) softmax(𝑃 (2) )𝑉 (2) (13)
𝑉 
 
SCFA. Although Flash Attention can be easily extended to support block-sparse structures [55], it may lack flexibility for
handling other sparse strategies with irregular structures and arbitrary attention masks. The effort of SCFA [168] extends
the Flash Attention GPU kernel to accommodate a broad range of attention sparsity patterns, including key/query
dropping and hashing-based attention like Reformer [110]. This extension leads to a training speedup of 2.0 to 3.3 times
without sacrificing perplexity, according to the report from the paper.
Paged Attention. While Flash Attention has effectively tackled the training memory bottleneck, LLMs still face
challenges related to the memory consumption of the KV cache during inference, which grows dynamically with
batched requests. Recognizing the memory wastage due to fragmentation and redundancy, vLLM proposes Paged
Attention [119]. This technique efficiently manages KV cache memory to minimize waste and allows flexible sharing
across batched requests, drawing inspiration from memory paging techniques [107] in virtual memory operating
systems [57].
4 LONG-TERM MEMORY
The Transformer architecture often struggles with capturing long-term dependencies due to in-context working
memory, as highlighted in Sec. 2.2. Researchers have explored two main avenues to address this challenge without
compromising the advantages of full attention. Inspired by RNNs, some introduced recurrent mechanisms into attention
by incorporating internal memory caches accessible through attention layers. This approach enables the model to
maintain and retrieve information over longer sequences, compensating for the inherent lack of built-in long-term
memory. An alternative approach involves leveraging existing models as interfaces to external knowledge bases, such
as specific documents or datasets. During inference, the model can read from these knowledge bases to enrich its
contextual input and write to them from the user’s response to refresh its long-term memory. By integrating external
knowledge in this manner, the model gains access to a broader range of context, enhancing its ability to handle long-term
dependencies effectively.
4.1 Internal MemoryCache

Recalling the temporality of natural language representations instead of the success of full parallelism in Transformer,
we introduce the concept of Internal MemoryCache based on recurrence mechanisms. It divides the long text into a
stream of fixed-length segments and enhances the query 𝑄𝑡𝑛 of the current 𝑡-th segment in the 𝑛-th layer with more
contextual information 𝐾 e𝑛 , 𝑉e𝑛 . This contextual information is obtained from cached or distilled information from
𝑡 𝑡
previous segments, stored in a memory cache denoted as 𝑀𝑒𝑚, as shown in Eq. 14. To facilitate later explanations,
we assume that each segment has the same length 𝑙, and the models consist of 𝑁 layers of transformer blocks. The
operator [◦] represents the concatenation operation along the length dimension. It is worth noting that the variables
in the memory cache 𝑀𝑒𝑚 are usually detached from the computation graph, eliminating the need for gradient
computation, which we denote with a hat accent, such as 𝑋

b.

𝑄𝑡𝑛 , 𝐾
e𝑡𝑛 , 𝑉e𝑡𝑛 := 𝑋𝑡𝑛𝑊𝑞 , 𝑋
e𝑡𝑛𝑊𝑘 , 𝑋
e𝑡𝑛𝑊𝑣 , 𝑤ℎ𝑒𝑟𝑒 𝑋𝑡𝑛 := 𝑂𝑡𝑛−1, 𝑋
e𝑡𝑛 := Mem(𝑛, 𝑡, . . .) ◦ 𝑂𝑡𝑛−1 (14)
Segment-Level Recurrence. The segment-level recurrence is initially introduced into Transformer from Transformer-
XL [53]. As illustrated in Eq. 15, it caches the output of 𝑚 previous consecutive segments in the last layer and concatenates
them into the current segment in the present layer to extend the context for the current query. Such a mechanism
allows for extending the largest possible dependency distance to 𝑂 (𝑁𝑚𝑙), where 𝑚 can be set as far as GPU memory
allows. Building upon Transformer-XL, Segatron [14] introduces the segment-aware mechanism by enhancing the
token-level PEs combined with sentence-level and even paragraph-level ones. To further extend the dependency with
multi-grained memory caching, Compressive Transformer [180] stores the first FIFO fine-grained memory queue for
𝑚 1 previous segments as Transformer-XL does. However, instead of discarding old memory, it applies a compression
function 𝑓𝑐 with the rate 𝑐 to compress it along the length dimension and pushes it into a secondary FIFO coarse-grained
compressive memory queue of size 𝑚 2 . Combining these two types of memories, one can obtain the longest context
dependency as 𝑂 (𝑁𝑙 (𝑚 1 + 𝑐𝑚 2 )), as shown in Eq. 16.
h i
Mem𝑋 𝐿 (𝑛, 𝑡, 𝑚) := 𝑂 b𝑡𝑛−1 b𝑛−1
−𝑚 ◦ . . . ◦ 𝑂 𝑡 −1 (15)

Mem𝐶𝑜𝑚𝑝 (𝑛, 𝑡, 𝑚 1, 𝑚 2, 𝑐) := Mem 𝑓𝑐 ◦ Mem𝑋 𝐿 (𝑛, 𝑡, 𝑚 1 ) , (16)
h i
𝑤ℎ𝑒𝑟𝑒 Mem 𝑓𝑐 := 𝑓𝑐 (𝑂 b𝑡𝑛−1 b𝑛−1
−𝑚 1 −𝑚 2 ) ◦ . . . ◦ 𝑓𝑐 (𝑂 𝑡 −𝑚 1 −1 )
Retrospective Recurrence. Notice that both Transformer-XL and Compressive Transformer deploy a shifting-one-
layer-downwards recurrence by default, thus the maximum effective context length is limited by 𝑁 . To address it, similar
to Feedback Transformer [69], ERNIE-Doc [64] proposes an enhanced recurrence mechanism, a drop-in replacement
by concatenating the output hidden states of previous segments in the same layer, instead of the last layer, simply
formalized as Eq. 18. In this manner, not only the maximum effective context length can be implicitly expanded,
but also the past higher-level representations can be exploited to enrich future lower-level representations as well.
Additionally, it employs a retrospective feed mechanism by feeding the segments twice, where the first time only
skims each segment while the second one retrospects to enable bi-directional information flow, which resembles the
mechanism in READTWICE [257].
e(𝑠) = 𝐵 T Φ(𝑠), 𝑠.𝑡 . 𝑋

Mem∞ := 𝑋 e(𝑠𝑖 ) ≈ 𝑋𝑖 , 𝑠𝑖 := 𝑖/𝐿, ∀𝑖 ∈ [1, . . . , 𝐿] (17)
b𝑛
Mem𝐸𝑟𝑛𝑖𝑒 (𝑛, 𝑡) := 𝑂 (18)
𝑡 −1
Continuous-Signal Memory. To draw a comparison with LSTM [253], we can view the compressed memory in
Compressive Transformer as a finite-sized discrete version of the long-term cell memory in LSTM, while the first queue
stores the short-term one. To achieve unbounded long-term memory like LSTM, as Eq. 17 suggests, ∞-former [148]
transfers the 𝐿 token-wise discrete embeddings 𝑋 ∈ R𝐿×𝑑 into a continuous signal 𝑋
e(𝑠) : [0, 1] → R𝑑 . This signal
is expressed as a linear combination of 𝑚 radial basis functions (RBFs), denoted as Φ(𝑠) ∈ R𝑚 with the coefficient
matrix 𝐵 ∈ R𝑚×𝑑 , which is fitted by multivariate ridge regression [22]. This continuous signal representation allows
for unbounded context representation with fixed memory storage, independent of the context length, similar to LSTM.
However, as the memory cache is stored as a continuous signal, it cannot simply prepend to the current segment but
16 Huang, et al.
has to transform back to embeddings via continuous attention [147].

e𝑡0 ), 𝑋
e𝑡0 := 𝑋𝑡𝑚𝑒𝑚 ◦ 𝑋𝑡0 ◦ 𝑋𝑡𝑚𝑒𝑚 ,

e𝑡𝑁 := Transformer(𝑋
𝑂 (19)
h i
𝑤ℎ𝑒𝑟𝑒 𝑋𝑡𝑚𝑒𝑚 := 𝑂𝑡𝑤𝑟𝑖𝑡𝑒 𝑟𝑒𝑎𝑑 𝑁 𝑤𝑟𝑖𝑡𝑒 := e𝑁
−1 , 𝑂 𝑡 −1 ◦ 𝑂 𝑡 −1 ◦ 𝑂 𝑡 −1 𝑂𝑡 −1
h i
e𝑡𝑁 , 𝑉e𝑡𝑁 ) := retr(𝑄𝑡𝑁 , 𝑚, 𝑘) ◦ (𝐾𝑡𝑁 , 𝑉𝑡𝑁 ) ,
(𝐾 (20)
h i
𝑤ℎ𝑒𝑟𝑒 retr(𝑄𝑡𝑁 , 𝑚, 𝑘) := kNN 𝑄𝑡𝑁 , {(𝐾𝑡𝑁−𝜏 , 𝑉𝑡𝑁−𝜏 )}𝑚
𝜏=1
Alternate Cache Designs. RMT [25] formalizes the memory cache as special [mem] tokens, prepended both at the
start and the end of each segment, as shown in Eq. 19. After processing each segment, the read/write tokens will be
split from the output embeddings, and the write tokens will be taken as the [mem] tokens for the next segment. By
leveraging such a recurrence mechanism with global memory tokens, RMT is demonstrated to scale effective context
size to 1M tokens [26]. Memorizing Transformer [238] applies (key, value) memory cache only for the top attention
layer, but with a large cache size without compression. Besides, instead of a simple FIFO cache to read memory, they
use kNN algorithm to retrieve top-k most similar (key, value) pairs for each query to prepend to the local ones, as Eq. 20
indicates. In contrast, Memformer [237] reads and writes the memory cache fully leveraging variants of self-attention
with a forgetting mechanism to retrieve and retain the most significant information through long-range time steps.
4.2 External MemoryBank

The discussed mechanisms enhance the vanilla stateless Transformer with sequential recurrence by prepending
additional hidden states from an internal memory cache. However, they present drawbacks. Firstly, minor changes
in the memory mechanism may require full model retraining, underutilizing pretrained LLMs already proficient in
capturing contextual dependencies, albeit not long enough. Secondly, as noted in Memorizing Transformer [238], they
frequently face the challenge of memory staleness, where older hidden states in the memory cache may diverge in
distribution from the latest ones during training, thus hampering memory augmentation effectiveness.
As a solution, another retrieval-augmented mechanism, often named Retrieval-Augmented Generation (RAG) [126],
decouples the model from its long-term memory storage by using a contextual information encoder to store long
sequences as segmented embedding vectors to an external memory bank. During queries, the model retrieves information
from this memory bank based on specific criteria and dynamically concatenates it to form in-context working memory.
Cosine-Based Retrieval Criteria. LangChain [29, 220] is an open-source framework tailored for applications like
chatbots. It ingests user-specified local documentation in standard readable formats, vectorizing it via off-the-shelf
LLMs into a memory bank. During interactions, it retrieves top-relevant contexts using dot-product cosine similarity
between user prompt embeddings and stored vectors, prepending them to the prompt for generating responses. This
pipeline efficiently leverages off-the-shelf LLMs, enhancing their long-term memory with a cost-effective, flexible, and
dynamic mechanism. Similar works have also developed external memory banks for chatbot applications [30] and SQL
generation [224].
Heuristic Retrieval Criteria. Except for cosine similarity, RETRO [21] retrieves from the BERT-embedded KV memory
bank via 𝑘NN search based on 𝐿2 distance. Similarly utilizing 𝑘NN search, Unlimiformer [18] enables any existing
pretrained encoder-decoder transformer to index unlimited input sequences for decoder retrieval of top-𝑘 keys for
cross-attention. In contrast, SiliconFriend [266] proposes the Memory Bank mechanism for tracking long-chat history,
providing specialized responses through dialogue logging, event distillation, user personality awareness, and memory
refreshment. RecurrentGPT [267] facilitates recurrent prompting and defines the recurrent computation graph with
ChatGPT by simulating long-/short-term memory mechanism in LSTM [253]. Moreover, RecallM [120] organizes and
updates memory as a dynamic concept-aware knowledge graph for improved continual learning and temporal reasoning
during chat. Inspired by Davidsonian semantics [56], Ret-LLM [158] stores/writes and retrieves/reads knowledge as
triplets ⟨𝐴, 𝐵, 𝑅⟩ (each means "A and B have a relationship of R"), utilizing finetuned Alpaca [213] to follow the
instructions in memory read/write operations. Inspired by traditional operating systems, MemGPT [167] implements a
hierarchical memory management system, swapping contexts between "main memory" in the chat history and "disk
storage" in the bank via function calls.
Learnable Retrieval Criteria. Despite these heuristic designs, REALM [81] pre-trains a latent neural knowledge
retriever using MLM as the learning signal, which takes charge of retrieving knowledge from a large textual corpus.
LongMem [230] trains another Transformer-based SideNet to decouple the memory retrieval and fusion process from
the pretrained LLMs, which are only responsible for encoding the (key, value) pairs into the memory bank. Recently,
FOT [223] proposes a novel contrast training procedure across batches of documents, reshaping the KV space to address
distraction issue as the size of a 𝑘NN-lookup memory bank increases during inference.
5 EXTRAPOLATIVE PES
Recognizing the need to push the inference length boundary beyond 𝐿𝑚𝑎𝑥 , the research community has made significant
efforts in this direction. Notably, according to [8], they have determined that distractors are the primary cause of failures
in length generalization in the case of parity task. These issues, however, can be mitigated considerably through
approaches such as scratchpad prompting [163]. Nevertheless, in this section, our focus remains on the undeniable role
that current PEs play in length generalization in more general scenarios.
5.1 Enhancing Understanding

Before entering the concrete approaches, we would love to provide some insights below to enhance the understanding
of this minute but essential design in Transformer for sequential modeling tasks.
Rethinking PEs as 𝛽-Encoding. Su [203] revisits the sine and cosine basis functions of SinPE and RoPE, considering
them as approximated terms for the 𝛽-encoding system to represent any position number 𝑛, as shown in Eq. 21. This
approach employs 𝑑2 fixed 𝛽-bits, where 𝛽 := 𝜃 −1 = 𝑏𝑎𝑠𝑒 2/𝑑 represents the power basis of the wavelength or period of
𝑑/2
the trigonometric basis functions, which increases as a geometric series {𝛽 𝑖 }𝑖=0 with the dimension 𝑖 goes deeper.
To gain a deeper understanding of this concept, we can draw a comparison between Eq. 21 and Eq. 3-4. It becomes
evident that the 𝑖-th 𝛽-bit of the representation of 𝑛 involves the division of the 𝑖-th power of 𝛽, followed by some sort
of periodic operations ( mod in Eq. 21 and sin, cos in Eq. 3, 4).
  ⌈log𝛽 𝑛⌉−1


 𝑛 


𝑛(𝛽) := ⌊ ⌋ mod 𝛽 (21)
 𝛽𝑖
 

 𝑖=0
Length Extrapolation Dilemma. Before the era of Transformers, RNN-based language models were trained on
shorter sequences but were expected to generalize effectively to longer contexts, a phenomenon referred to as length
extrapolation or length generalization [154, 155]. Unfortunately, recent studies [8, 35, 173] have highlighted a significant
18 Huang, et al.
shortcoming of length extrapolation ability for Transformer-based language models. This causes the insufficient context
length limit during inference when applying to real-world applications, as analyzed in Sec. 2.2.
In the original Transformer paper [225], there is little discussion regarding the design insights or theoretical
interpretation of their SinPE. This has led many researchers to question its necessity and effectiveness, especially the
blame on the extrapolation deficit, which also points to the same trigonometry-based RoPE [207]. To understand the
lousy extrapolation caused by current trigonometric PEs, we investigate and summarize two insights from distinct
views as follows.
• From a mathematical view, as Su [206] explains in his blog, extrapolation, which involves inferring the whole
from local information, depends on the high-order smoothness of the function. However, these PEs are designed
as combinations of high-frequency oscillatory trigonometric basis functions to accommodate sufficient positional
information. This choice makes it challenging for the models to generalize without specific learning during
training stages.
• From a training view, due to the wavelength or period of the basis functions increasing exponentially, proportional
𝑑/2
to {𝛽 𝑖 }𝑖=0 , training samples constrained by currently supported 𝐿𝑚𝑎𝑥 are typically too short for the rear low-
frequency dimensions to span the entire periodic cycle. This suggests only a few dimensions perceive complete
periodic information, thus receiving sufficient training for extrapolation, and the boundary is defined as critical
dimension in [140] (e.g., for Llama2-4k [222], the critical dimension is only 92). Consequently, direct extrapolation
becomes prone to failure when relying on these poor-learned low-frequency components.
5.2 Attention Bias

As alternative mechanisms to explicitly encoding positional information, attention bias have been explored to capture
the sequentiality and temporality of natural language incorporated into the attention kernel. As shown in Eq. 22, the
attention bias is depicted as a matrix, denoted as 𝐵, added to the unnormalized attention weights matrix 𝑃 before
applying the softmax operation. Each element of this matrix, indexed by (𝑖, 𝑗), carries positional information encoded
by a function B. Thus, it is reasonable to regard the attention bias as a form of relative PEs.
𝑃e := 𝑃 + 𝐵, 𝐵 ∈ R𝐿×𝐿 , 𝑤ℎ𝑒𝑟𝑒 𝐵𝑖 𝑗 := B (𝑖, 𝑗), ∀𝑖, 𝑗 ∈ {0, 1, . . . , 𝐿 − 1} (22)
Early approaches like T5 [181] employ learnable attention bias, denoted as B𝜃 (𝑖, 𝑗), which is independent for
each head in each attention layer. However, they did not explicitly address the problem of length extrapolation. The
breakthrough in recognizing and addressing the extrapolation problem comes with ALiBi [173]. ALiBi introduces
a negative causal attention bias heuristically, as shown in Eq. 23, where 𝜆 (ℎ) is a head-specific slope fixed before
training and decreases geometrically with the head index ℎ. ALiBi successfully maintains low perplexity levels when
extrapolating inference tokens beyond 𝐿𝑚𝑎𝑥 up to 16×.
Following the success of ALiBi, several variants emerged in the quest to improve extrapolative PEs for Transformer-
based LLMs. KERPLE [41] extended the ALiBi-style attention bias by considering it as a composition triangle kernel to
self-attention. Two extra learnable scalar parameters were introduced to generalize the bias kernel, as shown in Eq. 24.
The authors of Sandwich [42] reused the Sinusoidal PEs to form the attention bias in a RoPE-style, as illustrated in
Eq. 25, with 𝜆 as a hyper-parameter to tune. Interestingly, another method discussed by Su [206] in his blog utilizes a
super-baseline approach during inference, as illustrated in Eq. 26. This method relies on a local causal attention mask,
where each query attends to keys whose distances have not exceeded 𝐿𝑚𝑎𝑥 while still applying RoPE. According to
Su’s experiments, this approach proves to be simple, low-cost, and performs sufficiently well compared to the more
elaborate designs mentioned earlier, thus referred to as a super-baseline.
(ℎ) 1 1
B𝐴𝐿𝑖𝐵𝑖 (𝑖, 𝑗) := −𝜆 (ℎ) · |𝑖 − 𝑗 |, 𝜆 (ℎ) := 𝑜𝑟 (23)
2ℎ 2ℎ/2

 −𝑟 1 log(1 + 𝑟 2 |𝑖 − 𝑗 |), 𝑟 1, 𝑟 2 > 0


BKERPLE (𝑖, 𝑗) := (24)
 −𝑟 1 |𝑖 − 𝑗 |𝑟 2 , 𝑟 1 > 0, 𝑟 2 ∈ (0, 2]


B𝑆𝑎𝑛𝑑𝑤𝑖𝑡𝑐ℎ (𝑖, 𝑗) := 𝜆 · ⟨SinPE(𝑖), SinPE( 𝑗)⟩ (25)

 0,

 |𝑖 − 𝑗 | ∈ [0, 𝑚𝑎𝑥-𝑙𝑒𝑛𝑔𝑡ℎ]
B𝑠𝑢𝑝𝑒𝑟 -𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒 (𝑖, 𝑗) := (26)
 −∞, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒


5.3 Extended RoPE

RoPE, as introduced in Sec. 2.1, is a widely-used positional encoding scheme utilized in popular LLMs such as Llama and
GLM. It offers advantages such as relative distance decay, training stability, compatibility with linear attention, and better
length extrapolation capabilities compared to the traditional SinPE, as demonstrated in various experiments [35, 173],
albeit not that satisfactory. Therefore, several research works have aimed to extend RoPE using various strategies to
enhance its length extrapolation capabilities.
Scaling Strategies. Recent approaches in the community have gained prominence by scaling to extrapolate inference
context length with minimal or no finetuning [20, 35]. Aside from simple modifications to the base parameter [20,
189, 242], LEX [211] introduces XPOS, an extended causal RoPE incorporating an additional exponential decay term,
represented by Eq. 27, where 𝛾 ∈ (0, 1) is a scalar hyper-parameter. Similar techniques are utilized in PermuteFormer [33]
to adapt to Performer [45]. Additionally, Positional Interpolation (PI) [35] employs linear scaling on each position
number from 𝑛 to 𝜅𝑛 , densifying the representation space to extend the length boundary by 𝜅 times (cf. Eq. 28). This
strategy proves experimentally more stable and requires fewer finetuning steps than direct extrapolation.
However, linear scaling may hinder the network’s ability to distinguish the order and positions of closely spaced
tokens, compressing their distances by a ratio of 𝜅. Drawing from the Neural Tangent Kernel theory (NTK) [95], which
suggests that deep neural networks struggle with learning high-frequency information when input dimension is low
and corresponding embeddings lack high-frequency components, NTK-aware Scaling RoPE (NTK-RoPE) [20] combines
high-frequency extrapolation and low-frequency interpolation. It scales 𝛽 using coefficient 𝑐𝜅 to achieve equivalence
during interpolation by a ratio of 𝜅 for the lowest frequency term while maintaining scale for terms with higher
frequency (see Eq. 29). Surprisingly, this nonlinear scaling can be directly applied to LLMs pretrained with RoPE, like
Llama, without further finetuning to extend the context length boundary, adopted in CodeLlama [189].
Inspired by NTK-RoPE, several enhanced scaling methods have emerged. To avoid performance degradation when 𝐿 is
still within the 𝐿𝑚𝑎𝑥 , Dynamic-NTK [68, 170] delays applying the NTK scaling trick until 𝐿 exceeds the supported context
length 𝐿𝑚𝑎𝑥 , gradually increasing ratio 𝜅 as 𝐿 grows, adopted in Qwen-7B [102] and updated Llama2 [75]. To generalize
(0) (1) (𝑑/2−1)
𝛽-scaling across dimensions, NTK-mix RoPE [204] introduces multiple coefficients 𝑐𝜅 ≥ 𝑐𝜅 ≥ . . . ≥ 𝑐𝜅 for 𝛽
to interpolate less as the frequency increases. In contrast, NTK-by-parts [19, 170] avoids interpolating higher frequency
dimensions while always interpolating lower ones. YaRN [170], an extension of NTK-RoPE, combines NTK-by-parts
with a length scaling trick that scales 𝑄, 𝐾 by a constant temperature factor 𝑡. Giraffe [169] introduces Power Scaling
(Eq. 30), where the exponent 𝜅 > 0 controls the decay ratio of low frequencies, ensuring high-frequency elements are
20 Huang, et al.
less affected than poorly learned low-frequency ones. Despite manual designs, CLEX [32] employs a neural ordinary
differential equation (ODE) to learn continuous scaling as a dynamical system.
XPOS : 𝑃𝑖,𝑗 := ⟨e k 𝑗 ⟩ = 𝛾 𝑖 − 𝑗 (qT 𝑅 𝑗 −𝑖 k), where e

q𝑖 , e k 𝑗 := 𝛾 − 𝑗 (𝑅 𝑗 k), 𝑖 ≥ 𝑗
q𝑖 := 𝛾 𝑖 (𝑅𝑖 q), e (27)
PI : 𝑃e𝑖,𝑗 := ⟨𝑅𝑖/𝜅 q, 𝑅 𝑗/𝜅 k⟩ = qT 𝑅 𝑗 −𝑖 k (28)
𝜅
𝑛 𝑛/𝜅
NTK : 𝛽e := 𝑐𝜅 · 𝛽, 𝑠.𝑡 . = ⇒ 𝑐𝜅 = 𝜅 2/(𝑑 −2) (29)
𝛽e𝑑/2−1 𝛽 𝑑/2−1
Power Scaling : 𝛽e𝑖 := 𝛽 𝑖 /(1 − 2𝑖/𝑑)𝜅 (30)
Truncation Strategies. Based upon the idea of high-frequency extrapolation and low-frequency interpolation from
NTK-RoPE, Su further proposes two simple truncation strategies in his blog [205], named ReRoPE and Leaky ReRoPE
after the activation function Rectified Linear Unit (ReLU) and its leaky variant. As shown in Eq. 31, the main idea behind
this Rectified Truncation approach is to set a local window with size 𝑤, and for each token, no scaling is applied as
long as the tokens attend are inside the window. However, linear scaling, akin to Leaky ReLU, increases the position
by step 1/𝜅 when the token is located outside the window (Leaky-ReRoPE). This method combines high-frequency
extrapolation and low-frequency interpolation more directly and ensures that 𝐿𝑚𝑎𝑥 is not exceeded by carefully tuning
𝑤 and 𝜅. Furthermore, suppose the ratio 𝜅 is set to infinity. In that case, it applies a constant PE of position number 𝑤
to any pair of (q𝑖 , k 𝑗 ) as long as |𝑖 − 𝑗 | ≥ 𝑤, potentially accommodating infinite contexts (ReRoPE). According to Su’s
and our elementary experiments, ReRoPE performs very well without finetuning on perplexity metric and QA tasks,
even outperforming NTK-based schemes.
However, Leaky ReRoPE and ReRoPE entail two stages of scaling without a linear transformation to bridge their
gap. Consequently, they require two attention matrix computations per stage and use a boolean matrix to merge
them, significantly increasing inference cost and limiting the effective length boundary. Moreover, they are currently
incompatible with Flash Attention [54, 55] to mitigate high computational costs. To adapt ReRoPE with Flash Attention,
we have re-implemented the Flash Attention forward kernel to incorporate ReRoPE based on Triton [219], somewhat
alleviating its computational burden.1 Additionally, Giraffe [169] introduces another truncation strategy, called Basis
Truncation, depicted in Eq. 32, where 𝑎, 𝑏 are cutoff thresholds. This approach retains high-frequency basis components
while reducing low-frequency elements to near-zero values (𝜌 ≈ 0), simplifying extrapolation for low-frequency
components.
Rectified Truncation : 𝑃e𝑖,𝑗 := ⟨𝑅𝛼 (𝑖,𝑗,𝑤,𝜅 ) q, k⟩, where (31)
 min{𝑖 − 𝑗, 𝑤 + 𝑖 − 𝑗𝜅−𝑤 }, 0 < 𝜅 < ∞ (Leaky ReRoPE)




𝛼 (𝑖, 𝑗, 𝑤, 𝜅) :=
 min{𝑖 − 𝑗, 𝑤 }
 𝜅 → ∞ (ReRoPE)


𝜃 𝑖 , 𝜃 𝑖 ≥ 𝑏





Basis Truncation : 𝜃e𝑖 := 𝜌, 𝜃 𝑖 ∈ (𝑎, 𝑏) (32)



 0, 𝜃 𝑖 ≤ 𝑎


1 The experimental implementation is available at: https://github.com/Strivin0311/long-llms-learning/blob/main/notebooks/flash_rerope.py.

Rearrangement Strategies.
Based on the insights from Sec. 5.1, it is evident that rear position embeddings are updated less frequently than front
ones, potentially leading to improperly trained rear positions. Recent works address this issue effectively. SHAPE [111]
achieves shift invariance by randomly shifting absolute positions during training. Random Padding [212] balances
updating times across all positions by moving an arbitrary number of padding tokens to the front during finetuning.
Randomized PE [191] trains with a randomly sub-sampled set of positions from a broader range than the sequence
length, enhancing robustness. PoSE [268] finetunes models to adapt all relative positions of the target context window
by adding a distinct skipping bias term to position indices of training samples to simulate longer inputs.
In summary, research on extrapolative PEs is a promising and rapidly developing field, aiming to enhance the LLMs’
ability to infer long contexts in real-world scenarios with an available 𝐿𝑚𝑎𝑥 setting during training.
6 CONTEXT PROCESSING
Many methods propose intricate designs around the attention module in the Transformer architecture. In contrast, there
exist simpler approaches that treat pretrained LLMs as black-box or gray-box models and handle long-context inputs by
making multiple calls to the model, ensuring that each call respects the 𝐿𝑚𝑎𝑥 limitation. While these approaches don’t
enhance the LLMs’ inherent ability to process long contexts, they leverage the models’ in-context learning capabilities,
albeit with increased computation and potentially less precise answers.
6.1 Context Selection

To fit long segments within the context window of LLMs while preserving relevant information, some works partition
lengthy texts into segments and select specific ones based on predefined criteria. They vary in defining selection criteria
with corresponding scores, either sorting simultaneously or picking iteratively.
LangChain [29] employs three strategies for handling retrieved context that exceeds 𝐿𝑚𝑎𝑥 , one of which is referred
to Map Rerank, querying LLMs to output answers and confidence scores independently for each segment, and select
the answer with the highest confidence score as the final output. CogLTX [63] introduces a multi-step reasoning
mechanism, MemRecall, where two models sequentially score context segments in a coarse-to-fine manner during
each reasoning step. The top-k segments with the highest scores are added to the final candidate queue, deferring
the remaining segments to the next reasoning step until the queue is filled. LoBART [146] uses ROUGE-2 score [74]
to select the top-k contexts during training and trains an additional Hierarchical RNN model [44, 48] for generating
surrogate priority scores for context selection during inference.
6.2 Context Aggregation

In contrast to selection-based methods, some approaches consider contributions from all context segments to the final
answer rather than selecting one. Initially, they extract relevant information from each segment individually and then
employ various fusion strategies to aggregate the retrieved information, arriving at the final answer. These approaches
vary in two key aspects: how information is extracted from each segment and the fusion strategies used to integrate
information across all segments.
Fusion in Decoder. For LLMs like T5 and BART, a category of methods known as Fusion-in-Decoder (FiD) [93]
leverages the Encoder to extract information as embedded hidden states and the Decoder to attend to all contextualized
representations to generate the final output. In SLED [92], for example, contextualized representations are created
by overlapping a small portion of each segment with neighboring segments, forming context paddings. Each segment
22 Huang, et al.
is then independently encoded through the Encoder, and the resulting embeddings are concatenated to generate
embeddings for the entire extended document, excluding the context paddings. Finally, the Decoder integrates these
locally contextualized embeddings through cross-attention, achieving coherent fusion of information.
Map Reduce and Refinement. LangChain [29] introduces two additional aggregation techniques. The first, Map
Reduce, involves processing each segment simultaneously to obtain answers in parallel. These answers are then
synthesized into a final summary by another LLM. In contrast, the second approach Refine progressively refines answers
throughout the processing of each segment. Answers from previous segments are cascaded with the current segment,
serving as prompts for further refinement. This iterative process continues until the final segment is processed.
Parallel Context Windows. PCW [83, 185] handles long-context inputs similarly. It partitions the extended context
into multiple smaller windows, each with a maximum length 𝐶 := 𝐿 − 𝑇 , where 𝐿 is the total context length and 𝑇 is
the length of task-related tokens in the query, along with the maximum number of new tokens to be generated. Within
each window, tokens attend to each other in parallel, with their position indices isolated within the range of [0, 𝐶 − 1].
Task-related tokens, including the last context position indices within the range [𝐶, 𝐿 − 1], attend to all context tokens,
aggregating parallel information from each window to generate the final answer.
Similarly, NBCE [202] treats parallel context windows as a series of independent conditions 𝐶 1, 𝐶 2, . . . , 𝐶𝑛 ,
approximating the logarithmic posterior probability log P(𝑇 |𝐶 1, 𝐶 2, . . . , 𝐶𝑛 ) for token generation. It leverages the Naive
Bayes algorithm [232] to simplify the formulation. As deduced in Eq. 33, P(𝑇 |𝐶𝑖 ) represents the likelihood conditioned
on the 𝑖-th context window, P(𝑇 ) represents the prior, and const is a constant depending solely on 𝐶 1, 𝐶 2, . . . , 𝐶𝑛 . Su
extends this formulation to a more general case (cf. Eq. 34), introducing a hyper-parameter 𝜇 and a pooling operation,
denoted as pool, which can be either average or max. In this extended form, the original Naive Bayes formula is a
specific instance where 𝜇 is set to 𝑛 − 1 and pool is defined as average.
Î𝑛 P(𝑇 |𝐶 )P(𝐶 )
 P(𝑇 )
 𝑖 𝑖 
P(𝑇 ) 𝑛
 ∑︁
 𝑖=1 
log P(𝑇 |𝐶 1, . . . , 𝐶𝑛 ) ====== log 
 = log P(𝑇 |𝐶𝑖 ) − (𝑛 − 1) log P(𝑇 ) + 𝑐𝑜𝑛𝑠𝑡 (33)
𝑁𝐵  P(𝐶 1, . . . , 𝐶𝑛 ) 
 𝑖=1
 
 
⇒ (𝜇 + 1)pool[log P(𝑇 |𝐶𝑖 )] − 𝜇 log P(𝑇 ) + 𝑐𝑜𝑛𝑠𝑡 (34)
NBCE and PCW can be applied to any readily available LLMs but assume negligible relationships among context
windows, treating them uniformly and unordered. Their performance may suffer with tightly interconnected or excessive
windows to process in parallel.
6.3 Context Compression

Despite the selection or aggregation of source long contexts, some works focus on directly compressing the (hidden)
sequence length under the 𝐿𝑚𝑎𝑥 constraint. They aim to produce more condensed and abstract representations of the
long raw contexts before feeding them into LLMs, through either learning embedded alternatives (soft compression) or
filtering out redundancies (hard compression) based on various scores computed by pretrained models.
Soft Compression. [235] optimizes a few soft prompt tokens to significantly compress the original prompt while
retaining abstract sentiments. To avoid optimizing for every new context, AutoCompressor [40] compresses embedded
long segmented contexts into summary vectors by training an RMT-based pretrained model [25] using a simple
unsupervised objective. For instruction-following, [160] learns a gist model to compress instructions into prefixed gist
tokens, which can be reused for trained tasks and also generalized to novel ones.
Hard Compression. [134] utilizes self-information values [27] to filter out redundant or non-essential tokens.
LLMLingua [101] employs a coarse-to-fine compression process to rephrase various components in prompts, such as
instructions and demonstrations, based on perplexity values. [71] initially segments long contexts into topic-based
chunks using graph representation, followed by summarizing semantic-relevant sentences within each chunk.
7 MISCELLANEOUS
This section provides a concise overview of miscellaneous solutions that extend the previously discussed four categories,
offering a broader perspective on enhancing the effective context window of LLMs or optimizing the efficiency when
using off-the-shelf LLMs. It is worth noting that the literature covered here may not be exhaustive or specific to
Transformer-based models. Many of these techniques are applicable universally to any model equipped with deep
neural networks, albeit particularly crucial for large-scale LLMs.
Specific Objectives. In contrast to conventional pretraining objectives like MLM or CLM (discussed in Sec. 2.1), recent
research explores tailored approaches to adapt pretraining for specific tasks, aiming to enhance LLMs’ effectiveness in
capturing intricate long-range dependencies and discourse structures in longer texts compared to shorter ones [65]. For
instance, XLNet [247] introduces a permutation objective that excels in various NLP tasks. ERNIE-Doc [64] extends
this approach to long documents with a Segment-Reordering Objective to model long-range relationships. DANCE [76]
employs a divide-and-conquer preprocessing strategy for summarization tasks, breaking the long document and its
summary into multiple source-target pairs. PEGASUS [260] introduces the Gap Sentence Generation (GSG) objective for
abstractive summarization, while PRIMERA [241] extends it across multi-documents using the Entity Pyramid method.
Mixture of Experts. MoE [73, 94, 197] augments giant LLMs by replacing the dense FFN layer with a MoE layer,
incorporating multiple specialized experts where each excels in specific input types or tasks. A dynamic gating mechanism
selects the most suitable expert for a given input, which can be implemented through various ways, including task-
optimized expert modules [36, 94], sparse activation [80, 98, 197], sharding across multiple devices [85, 124], and adapting
mixture weights through training to determine the contribution of each expert [174]. Then, routing mechanisms select
the top-k experts for each token based on their gate values. [197] sets 𝑘 > 1 to determine the number of experts, while
Switch Transformer [70] suggests 𝑘 = 1 to both preserve model quality and reduce routing computation, known as Switch
Routing. Finally, the output is obtained through a weighted summation of contributions from selected experts [197]. MoE
techniques significantly enhance versatility, reduce computational demands, and elevate the efficiency and effectiveness
of modeling large-scale contexts [49, 127, 162, 192], already adopted by Mixtral [4].
Parallelism. Leveraging modern aggregated GPU memory within and across nodes, recent research has introduced
various parallelism strategies to scale up model sizes and extend sequence length. We summarize commonly used
parallelism paradigms with brief introductions as follows.
(a) Data Parallelism (DP) [133], widely integrated into PyTorch [91], is the most commonly used way to accelerate
training in a distributed manner across multiple devices. It replicates the model on each device to generate
gradients independently and communicates them at each iteration to maintain consistency.
(b) Tensor Parallelism (TP) [200] introduces tensor splitting, where individual layers of the model are horizontally
partitioned over multiple devices.
(c) Pipeline Parallelism (PP) [84, 90] splits the model layers vertically along the batch dimension into different
partitions of micro-batches on separate devices. Each device processes one micro-batch received from the
previous one in a pipeline fashion.
24 Huang, et al.
(d) Sequence Parallelism (SP) [113, 131] divides the input sequence into multiple chunks and feeds each chunk into
its corresponding device. It incorporates ring-style communication for computing the attention output.
(e) Expert Parallelism (EP) [70], as discussed earlier in MoE, places different experts on different GPUs and executes
them in parallel. Classic all-to-all communication primitives are often used to implement this form of
parallelism [183].
Furthermore, the integration of various parallelism strategies are prevailing in distributed environments, including
3D parallelism by DeepSpeed [217], which combines (a)–(c) with ZeRO [184, 228], AutoParallelism by Colossal-AI [130],
FSDP by PyTorch [265], and promising 4D parallelism [132, 243] by adopting (d) into existing mechanisms to both scale
parameters as well as sequence length.
Weight Compression. Various methods enhance memory efficiency in large-scale LLMs through weight compression
techniques, including pruning [143, 150, 210], factorization [121], quantization [255], partitioning [172], and
distillation [231]. Among them, quantization strategies particularly play a crucial role in practical deployment of
massive LLMs, by reducing parameter precision [58, 59, 118], thereby alleviating memory demands and accelerating
both training [60, 108, 135, 245] and inference [12, 72, 136]. Additionally, simpler approaches exist to mitigate the large
KV cache during inference, such as Multi-Query Attention (MQA) [196] and Grouped-Query Attention (GQA) [5],
which save the number of heads for KV and distribute them equally across multiple queries, applied to GLM and PaLM.
8 EVALUATION NECESSITY & OPTIMIZATION TOOLKIT

In this section, we explore evaluation necessities for assessing long-context capabilities of LLMs, including datasets,
metrics, and baseline models. Additionally, we investigate popular optimization toolkits, such as libraries, frameworks,
and compilers, to enhance LLM efficiency and effectiveness during development. Detailed information is organized in
table-format in Appendix A, B, C, and D, with a concise overview as follows.
Datasets. We provide a collection of evaluation datasets commonly used to assess the long-context capabilities of
LLMs, including benchmark suites like LRA [215], SCROLLS [194], LEval [7], LongBench [15], InfiniteBench [263], and
other single-task ones. In Tab. 1 of Appendix A, detailed information on each dataset is available, covering language,
task types, length statistics, quality, splits, and more. To ensure ongoing relevance, we maintain an updated version
of this table on our GitHub repository, accessible via the following link: https://github.com/Strivin0311/long-llms-
learning/blob/main/evaluation/datasets.md.
Metrics. We offer a summary of nine categories of general evaluation metrics commonly employed across ten NLP
task types, encompassing language modeling, question answering, summarization, math solving, code generation, and
open-ended writing, among others. For detailed metrics specific to each task, readers can refer to Tab. 2 in Appendix B.
Also, an accessible version of these metrics is available on our github repository at https://github.com/Strivin0311/long-
llms-learning/blob/main/evaluation/metrics.md.
Baselines. We gather a list of pretrained/finetuned LLMs commonly referenced in the literature, serving as baselines
for evaluating long-context capabilities across various downstream tasks, such as Claude2 and GPT4 (close-sourced),
along with LLongMA [170] and LongChat [50] (open-sourced). In Tab. 3 of Appendix C, we present an overview of
these models, including their basic information (e.g. 𝐿𝑚𝑎𝑥 ), statistics (e.g. parameter size, memory occupancy), and
relevant links (e.g. Hugging Face, GitHub, blog / paper). For updates on the latest state-of-the-art baselines, please refer
to https://github.com/Strivin0311/long-llms-learning/blob/main/evaluation/baselines.md.
Toolkit. We collect a diverse array of valuable toolkits at Tab. 4 in Appendix D, including libraries like vLLM [119],
compilers like Triton [219] and frameworks such as DeepSpeed [152] and Megatron-LM [200], to optimize the efficiency
and effectiveness of LLMs across their development lifecycle. Updates and additional details can be accessed via
https://github.com/Strivin0311/long-llms-learning/blob/main/toolkits.
9 DISCUSSION
Considerable progress has been achieved as we discussed in Sections 3, 4, 5, 6, yet several challenges persist. In this
section, we explore these key challenges and suggest potential avenues for future research and development aimed at
enhancing long-context capabilities of LLMs, especially architectural advancements for Transformers.
Attention Trade-off. As discussed in Section 3, efficient attention methods involve a trade-off between maintaining full-
scale attention dependencies and achieving higher attention score precision to mitigate computational demands. With
longer contexts, capturing global dependencies while preserving relevancy becomes crucial. Balancing computational
efficiency with attention precision remains a key challenge in long-context LLMs. Recent innovations like Flash
Attention [54, 55] offer IO-aware solutions, significantly improving efficiency in runtime and memory usage without
sacrificing attention precision. Integrating these solutions with existing strategies, and fuse them to GPU kernels with
tools like Triton [219] (cf. SCFA [168]), presents promising avenues for practical application.
Memory Efficacy and Efficiency. As outlined in Sec. 2.1, 2.2, we have identified limitations stemming from the
absence of explicit memory mechanisms, relying solely on in-context working memory, and the significant increase
in KV cache memory consumption during extended context interactions. These challenges emphasize the need for
more effective and efficient memory mechanisms in Transformer-based LLMs. The long-term memory mechanisms
discussed in Section 4 face constraints due to additional memory overhead from intricate heuristic design, potentially
leading to performance degradation over time. To address this, researchers explore more efficient strategies for memory
organization and read/write throughput enhancement, drawing on recent advancements like Paged Attention [119].
Length Extrapolation Mining. In Section 5, we analyze challenges in length extrapolation in Transformer-based
models, focusing on positional embeddings. And we overview recent breakthroughs, including extended strategies
applied to RoPE [19, 20, 32, 68, 170, 205], which show promise in addressing this limitation. However, these advancements
often rely on simplified observations of positional embedding properties and heuristic adjustments. This prompts us to
question the theoretical foundations of modeling sequentiality with high-dimensional embeddings and explore the
potential resurgence of learnable embeddings with many hyper-parameters. Future research, exemplified in CLEX [32],
should delve deeper into this area, especially in developing a robust theoretical framework for modeling sequentiality
in Transformer settings.
Specific yet Universal Objective. While we have discussed objectives tailored for long-text modeling, many are
limited to certain tasks or compatible only with the MLM objective rather than the more common CLM objective
nowadays. This underscores the need for universally applicable causal language modeling objectives that can capture
long-range dependencies effectively from the early stages of training. Aligning such objectives with an effective PE
scheme, as mentioned earlier, could achieve this.
Reliable Metric Demand. In Section 8, we explored various evaluation metrics. However, our experience highlights
significant disparities between commonly used metrics and human judgment [117]. With LLMs rapidly deployed in
real-world scenarios, there is a growing need for more dependable metrics [149], especially in generative tasks where
precise ground truth is elusive. One promising approach is leveraging state-of-the-art LLMs like GPT4 as substitutes for
human judges, though high associated costs challenge wider adoption in the research community.
26 Huang, et al.
10 CONCLUSION
In this survey, we comprehensively navigate the landscape of architectural advancement in Transformer-based LLMs
to enhance the capabilities of handling extensive context windows across various development stages with a holistic
taxonomy that categorizes these methodologies targeting different module designs in Transformer. Then, we explore
evaluation necessities specific to long-text tasks and some optimization toolkits that integrate many tools to augment
LLMs’ efficiency and efficacy. We further identify key challenges with corresponding future directions. In addition, our
repository ensures that readers stay updated with the latest research in this dynamic field. As LLMs continue to evolve
rapidly, we sincerely hope our survey serves as a valuable resource for researchers seeking to harness their power in
building powerful long-context LLMs, ultimately advancing the pursuit of the era of AGI.
ACKNOWLEDGMENTS
We would like to express our gratitude to Hao Gao, Zenan Li, Linyun Liu, etc for their helpful discussions and feedback
during the early stages of this paper. Then we acknowledge the generous support from the Baidu AI Cloud Group
(ACG), and genuine appreciation to Dou Shen, the Executive Vice President and Head of ACG, for the great idea and
gracious invitation to the first session of Baidu ACG Summer Camp. This opportunity has been instrumental in shaping
our research and providing valuable experiences.
REFERENCES
[1] Alaa Abd-Alrazaq, Rawan AlSaad, Dari Alhuwail, Arfan Ahmed, Padraig Mark Healy, Syed Latifi, Sarah Aziz, Rafat Damseh, Sadam Alabed Alrazak,
Javaid Sheikh, et al. 2023. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ (2023).
[2] Jader Abreu, Luis Fred, David Macêdo, and Cleber Zanchettin. 2019. Hierarchical attentional hybrid neural networks for document classification.
In International Conference on Artificial Neural Networks. Springer, 396–402.
[3] Abhaya Agarwal and Alon Lavie. 2008. Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine
translation output. In Proceedings of the Third Workshop on Statistical Machine Translation. 115–118.
[4] Mistral AI. 2023. Mixtral of Experts: A High-Quality Sparse Mixture-of-Experts. https://mistral.ai/news/mixtral-of-experts/.
[5] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized
Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245 (2023).
[6] Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li
Yang. 2020. ETC: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483 (2020).
[7] Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2023. L-Eval: Instituting Standardized Evaluation
for Long Context Language Models. arXiv:2307.11088 [cs.CL]
[8] Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and
Behnam Neyshabur. 2022. Exploring length generalization in large language models. Advances in Neural Information Processing Systems (2022).
[9] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey,
Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
[10] Anthropic. 2023. Model Card and Evaluations for Claude Models. https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
[11] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[12] Hicham Badri and Appu Shaji. 2023. Half-Quadratic Quantization of Large Machine Learning Models. https://mobiusml.github.io/hqq_blog/
[13] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473 (2014).
[14] He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen Gao, and Ming Li. 2021. Segatron: Segment-aware transformer for language
modeling and understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12526–12534.
[15] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. 2023. LongBench:
A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv preprint arXiv:2308.14508 (2023).
[16] Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh, Rina Panigrahy, and Xin Wang. 2023. Alternating Updates for Efficient Transformers.
arXiv:2301.13310 [cs.LG]
[17] Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
[18] Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R Gormley. 2023. Unlimiformer: Long-range transformers with unlimited length input.
arXiv preprint arXiv:2305.01625 (2023).
[19] bloc97. 2023. Add NTK-Aware interpolation "by parts" correction, 2023. https://github.com/jquesnelle/yarn/pull/1.
[20] bloc97. 2023. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity
degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
[21] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste
Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on
machine learning. PMLR, 2206–2240.
[22] Philip J Brown and James V Zidek. 1980. Adaptive multivariate ridge regression. The Annals of Statistics 8, 1 (1980), 64–74.
[23] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[24] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott
Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with
GPT-4. arXiv:2303.12712 [cs.CL]
[25] Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. 2022. Recurrent memory transformer. Advances in Neural Information Processing Systems 35
(2022), 11079–11091.
[26] Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. 2023. Scaling Transformer to 1M tokens and beyond with RMT. arXiv preprint arXiv:2304.11062
(2023).
[27] Razvan Bunescu and Oseremen O Uduehi. 2022. Distribution-Based Measures of Surprise for Creative Language: Experiments with Humor and
Metaphor. In Proceedings of the 3rd Workshop on Figurative Language Processing (FLP). 68–78.
[28] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. 2011. Robust principal component analysis? Journal of the ACM (JACM) (2011).
[29] Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain.
[30] Chatchat-Space. 2023. Langchain-Chatchat: A LLM application aims to implement knowledge and search engine based QA based on Langchain
and open-source or remote LLM API. https://github.com/chatchat-space/Langchain-Chatchat.
[31] Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. 2021. Scatterbrain: Unifying sparse and low-rank attention
approximation. arXiv preprint arXiv:2110.15343 (2021).
[32] Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, and Lidong Bing. 2023. CLEX: Continuous Length Extrapolation for Large Language
Models. arXiv preprint arXiv:2310.16450 (2023).
[33] Peng Chen. 2021. Permuteformer: Efficient relative position encoding for long sequences. arXiv preprint arXiv:2109.02377 (2021).
[34] Pu-Chin Chen, Henry Tsai, Srinadh Bhojanapalli, Hyung Won Chung, Yin-Wen Chang, and Chun-Sung Ferng. 2021. A simple and effective
positional encoding for transformers. arXiv preprint arXiv:2104.08698 (2021).
[35] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional
interpolation. arXiv preprint arXiv:2306.15595 (2023).
[36] Tianlong Chen, Xuxi Chen, Xianzhi Du, Abdullah Rashwan, Fan Yang, Huizhong Chen, Zhangyang Wang, and Yeqing Li. 2023. AdaMV-MoE:
Adaptive Multi-Task Vision Mixture-of-Experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17346–17357.
[37] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174
(2016).
[38] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. Longlora: Efficient fine-tuning of long-context large
language models. arXiv preprint arXiv:2309.12307 (2023).
[39] Yingyi Chen, Qinghua Tao, Francesco Tonin, and Johan A. K. Suykens. 2023. Primal-Attention: Self-attention through Asymmetric Kernel SVD in
Primal Representation. arXiv:2305.19798 [cs.LG]
[40] Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. Adapting Language Models to Compress Contexts. arXiv preprint
arXiv:2305.14788 (2023).
[41] Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, and Alexander Rudnicky. 2022. Kerple: Kernelized relative positional embedding for length
extrapolation. Advances in Neural Information Processing Systems 35 (2022), 8386–8399.
[42] Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, and Peter Ramadge. 2023. Dissecting transformer length extrapolation via the lens of receptive
field analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13522–13537.
[43] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509
(2019).
[44] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[45] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz
Mohiuddin, Lukasz Kaiser, et al. 2020. Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020).
[46] Krzysztof M Choromanski, Mark Rowland, and Adrian Weller. 2017. The unreasonable effectiveness of structured random orthogonal embeddings.
Advances in neural information processing systems 30 (2017).
[47] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus).

28 Huang, et al.
[48] Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware
attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685 (2018).
[49] Róbert Csordás, Piotr Piękos, and Kazuki Irie. 2023. SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention. arXiv preprint
arXiv:2312.07987 (2023).
[50] Anze Xie Ying Sheng Lianmin Zheng Joseph E. Gonzalez Ion Stoica Xuezhe Ma Dacheng Li*, Rulin Shao* and Hao Zhang. 2023. How Long Can
Open-Source LLMs Truly Promise on Context Length? https://lmsys.org/blog/2023-06-29-longchat
[51] Hongliang Dai, Siliang Tang, Fei Wu, and Yueting Zhuang. 2018. Entity mention aware document representation. Information Sciences 430 (2018),
216–227.
[52] Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-transformer: Filtering out sequential redundancy for efficient language processing.
Advances in neural information processing systems 33 (2020), 4271–4282.
[53] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models
beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).
[54] Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023).
[55] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness.
Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
[56] Donald Davidson. 1967. The Logical Form of Action Sentences. In The Logic of Decision and Action, Nicholas Rescher (Ed.). Univ. of Pittsburgh
Press, 81–120.
[57] Peter J Denning. 1970. Virtual memory. ACM Computing Surveys (CSUR) 2, 3 (1970), 153–189.
[58] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
arXiv:2208.07339 [cs.LG]
[59] Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 2022. 8-bit Optimizers via Block-wise Quantization. arXiv:2110.02861 [cs.LG]
[60] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. arXiv preprint
arXiv:2305.14314 (2023).
[61] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 (2018).
[62] Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. 2023. Longnet: Scaling transformers to
1,000,000,000 tokens. arXiv preprint arXiv:2307.02486 (2023).
[63] Ming Ding, Chang Zhou, Hongxia Yang, and Jie Tang. 2020. Cogltx: Applying bert to long texts. Advances in Neural Information Processing Systems
33 (2020), 12792–12804.
[64] Siyu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE-Doc: A retrospective long-document
modeling transformer. arXiv preprint arXiv:2012.15688 (2020).
[65] Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. 2023. A Survey on Long Text Modeling with Transformers. arXiv:2302.14502 [cs.CL]
[66] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. Glm: General language model pretraining with
autoregressive blank infilling. arXiv preprint arXiv:2103.10360 (2021).
[67] Wafaa S El-Kassas, Cherif R Salama, Ahmed A Rafea, and Hoda K Mohamed. 2021. Automatic text summarization: A comprehensive survey. Expert
systems with applications 165 (2021), 113679.
[68] emozilla. 2023. Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning. https://reddit.com/r/
LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
[69] Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. 2020. Addressing some limitations of transformers with
feedback memory. arXiv preprint arXiv:2002.09402 (2020).
[70] William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.
The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
[71] Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han. 2023. Extending Context Window of Large Language Models via
Semantic Compression. arXiv preprint arXiv:2312.09571 (2023).
[72] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers. arXiv:2210.17323 [cs.LG]
[73] Yao Fu Jinjie Ni Zangwei Zheng Wangchunshu Zhou Fuzhao Xue, Zian Zheng and Yang You. 2023. OpenMoE: Open Mixture-of-Experts Language
Models. https://github.com/XueFuzhao/OpenMoE.
[74] Kavita Ganesan. 2018. Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. arXiv preprint arXiv:1803.01937 (2018).
[75] gante. 2023. Llama/GPTNeoX: add RoPE scaling. https://github.com/huggingface/transformers/pull/24653.
[76] Alexios Gidiotis and Grigorios Tsoumakas. 2020. A divide-and-conquer approach to the summarization of long documents. IEEE/ACM Transactions
on Audio, Speech, and Language Processing 28 (2020), 3029–3040.
[77] Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396
(2021).
[78] Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2021. LongT5: Efficient text-to-text
transformer for long sequences. arXiv preprint arXiv:2112.07916 (2021).
[79] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-transformer. arXiv preprint arXiv:1902.09113
(2019).
[80] Shashank Gupta, Subhabrata Mukherjee, Krishan Subudhi, Eduardo Gonzalez, Damien Jose, Ahmed H. Awadallah, and Jianfeng Gao. 2022. Sparsely
Activated Mixture-of-Experts are Robust Multi-Task Learners. arXiv:2204.07689 [cs.LG]
[81] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. REALM: Retrieval augmented language model pre-training. In
International conference on machine learning. PMLR, 3929–3938.
[82] Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. 2023. Lm-infinite: Simple on-the-fly length generalization for large
[83] Yaru Hao, Yutao Sun, Li Dong, Zhixiong Han, Yuxian Gu, and Furu Wei. 2022. Structured Prompting: Scaling In-Context Learning to 1,000 Examples.
arXiv:2212.06713 [cs.CL]
[84] Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and
efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).
[85] Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. FasterMoE: modeling and optimizing training of
large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
120–134.
[86] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference
on computer vision and pattern recognition. 770–778.
[87] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation
of large language models. arXiv preprint arXiv:2106.09685 (2021).
[88] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. 2023. Planning-
oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17853–17862.
[89] Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient Attentions for Long Document Summarization.
arXiv:2104.02112 [cs.CL]
[90] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al.
2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).
[91] Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan. 2021. PyTorch. Programming with TensorFlow: Solution for Edge Computing
Applications (2021), 87–104.
[92] Maor Ivgi, Uri Shaham, and Jonathan Berant. 2023. Efficient long-text understanding with short-text models. Transactions of the Association for
Computational Linguistics 11 (2023), 284–299.
[93] Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint
arXiv:2007.01282 (2020).
[94] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation 3, 1
(1991), 79–87.
[95] Arthur Jacot, Franck Gabriel, and Clément Hongler. 2018. Neural tangent kernel: Convergence and generalization in neural networks. Advances in
neural information processing systems 31 (2018).
[96] Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam, and Chidambaram Crushev. 2021. A survey on locality sensitive hashing
algorithms and their applications. arXiv preprint arXiv:2102.08942 (2021).
[97] Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2019. Mnnfast: A fast and scalable system architecture for memory-
augmented neural networks. In Proceedings of the 46th International Symposium on Computer Architecture. 250–263.
[98] Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Lukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. 2021.
Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems 34 (2021), 9895–9907.
[99] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking.
[100] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna
Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
[101] Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. Llmlingua: Compressing prompts for accelerated inference of large
[102] JianxinMa. 2023. Introducing Qwen-7B: Open foundation and human-aligned models (of the state-of-the-arts). https://github.com/QwenLM/Qwen/
blob/main/tech\_memo.md.
[103] Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. 2024. LLM Maybe LongLM:
Self-Extend LLM Context Window Without Tuning. arXiv preprint arXiv:2401.01325 (2024).
[104] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are rnns: Fast autoregressive transformers with
linear attention. In International conference on machine learning. PMLR, 5156–5165.
[105] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in vision: A
survey. ACM computing surveys (CSUR) 54, 10s (2022), 1–41.

30 Huang, et al.
[106] Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. arXiv
preprint arXiv:1805.04623 (2018).
[107] Tom Kilburn, David BG Edwards, Michael J Lanigan, and Frank H Sumner. 1962. One-level storage system. IRE Transactions on Electronic Computers
(1962), 223–235.
[108] Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. 2024. Memory-efficient fine-tuning
of compressed large language models via sub-4-bit integer quantization. Advances in Neural Information Processing Systems 36 (2024).
[109] Jin K Kim, Michael Chua, Mandy Rickard, and Armando Lorenzo. 2023. ChatGPT and large language model (LLM) chatbots: the current state of
acceptability and a proposal for guidelines on utilization in academic medicine. Journal of Pediatric Urology (2023).
[110] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).
[111] Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. 2021. SHAPE: Shifted absolute position embedding for transformers. arXiv preprint
arXiv:2109.05644 (2021).
[112] Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. 2022. An empirical survey on long document summarization: Datasets, models, and metrics.
ACM computing surveys 55, 8 (2022), 1–35.
[113] Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing
Activation Recomputation in Large Transformer Models. arXiv:2205.05198 [cs.LG]
[114] Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2017. The NarrativeQA
Reading Comprehension Challenge. arXiv:1712.07040 [cs.CL]
[115] Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. , 32–33 pages. https://www.cs.toronto.edu/~kriz/learning-features-
2009-TR.pdf
[116] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural
information processing systems 25 (2012).
[117] Tatsuki Kuribayashi, Yohei Oseki, Takumi Ito, Ryo Yoshida, Masayuki Asahara, and Kentaro Inui. 2021. Lower perplexity is not always human-like.
[118] Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. 2022. Fp8 quantization: The power of the
exponent. Advances in Neural Information Processing Systems 35 (2022), 14651–14662.
[119] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. 2023.
Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180 (2023).
[120] Brandon Kynoch and Hugo Latapie. 2023. RecallM: An Architecture for Temporal Context Understanding and Question Answering. arXiv preprint
arXiv:2307.02738 (2023).
[121] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised
learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[122] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11
(1998), 2278–2324.
[123] Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, and Kangwook Lee. 2023. Prompted LLMs as Chatbot Modules for Long
Open-domain Conversation. arXiv preprint arXiv:2305.04533 (2023).
[124] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen.
2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020).
[125] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart:
Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461
(2019).
[126] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim
Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33
(2020), 9459–9474.
[127] Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed { MoE } training and inference with lina. In 2023
USENIX Annual Technical Conference (USENIX ATC 23). 945–959.
[128] Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2022. Pretrained Language Models for Text Generation: A Survey.
arXiv:2201.05273 [cs.CL]
[129] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the locality and breaking the
memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems 32 (2019).
[130] Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-ai: A unified deep
learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing. 766–775.
[131] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2021. Sequence parallelism: Long sequence training from system
perspective. arXiv preprint arXiv:2105.13120 (2021).
[132] Shenggui Li, Fuzhao Xue, Yongbin Li, and Yang You. 2021. Sequence parallelism: Making 4d parallelism possible. arXiv preprint arXiv:2105.13120
(2021).

[133] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al.
2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020).
[134] Yucheng Li, Bo Dong, Chenghua Lin, and Frank Guerin. 2023. Compressing context to enhance inference efficiency of large language models.
[135] Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. 2023. Loftq: Lora-fine-tuning-aware
quantization for large language models. arXiv preprint arXiv:2310.08659 (2023).
[136] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, and Song Han. 2023. AWQ: Activation-aware Weight Quantization for
LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL]
[137] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of transformers. AI Open (2022).
[138] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
[139] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How
language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173.
[140] Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. 2023. Scaling Laws of RoPE-based Extrapolation. arXiv preprint
arXiv:2310.05209 (2023).
[141] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[142] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al.
2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
[143] Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2024. Llm-pruner: On the structural pruning of large language models. Advances in neural
information processing systems 36 (2024).
[144] Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. 2021. Luna: Linear unified nested attention.
Advances in Neural Information Processing Systems 34 (2021), 2441–2453.
[145] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv
preprint arXiv:1611.00712 (2016).
[146] Potsawee Manakul and Mark JF Gales. 2021. Long-span summarization via local attention and content selection. arXiv preprint arXiv:2105.03801
(2021).
[147] André Martins, António Farinhas, Marcos Treviso, Vlad Niculae, Pedro Aguiar, and Mario Figueiredo. 2020. Sparse and continuous attention
mechanisms. Advances in Neural Information Processing Systems 33 (2020), 20989–21001.
[148] Pedro Henrique Martins, Zita Marinho, and André FT Martins. 2021. ∞-former: Infinite Memory Transformer. arXiv preprint arXiv:2109.00301
(2021).
[149] Clara Meister and Ryan Cotterell. 2021. Language model evaluation beyond perplexity. arXiv preprint arXiv:2106.00085 (2021).
[150] Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? Advances in neural information processing systems 32
(2019).
[151] Microsoft. 2020. DeepSpeed Sparse Attention: Powering 10x longer sequences with 6x faster execution. https://www.microsoft.com/en-us/research/
blog/deepspeed-extreme-scale-model-training-for-everyone/.
[152] Microsoft. 2020. Extreme Speed and Scale for DL Training and Inference. https://github.com/microsoft/DeepSpeed.
[153] Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018. Document-level neural machine translation with hierarchical
attention networks. arXiv preprint arXiv:1809.01576 (2018).
[154] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model.. In
Interspeech, Vol. 2. Makuhari, 1045–1048.
[155] Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology
Workshop (SLT). IEEE, 234–239.
[156] Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv:1805.02867 [cs.PF]
[157] Silvia Milano, Joshua A McGrane, and Sabina Leonelli. 2023. Large language models challenge the future of higher education. Nature Machine
Intelligence 5, 4 (2023), 333–334.
[158] Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. 2023. RET-LLM: Towards a General Read-Write Memory for Large Language
Models. arXiv preprint arXiv:2305.14322 (2023).
[159] Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark Attention: Random-Access Infinite Context Length for Transformers. arXiv preprint
arXiv:2305.16300 (2023).
[160] Jesse Mu, Xiang Li, and Noah Goodman. 2024. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems
36 (2024).
[161] Tan Nguyen, Minh Pham, Tam Nguyen, Khai Nguyen, Stanley Osher, and Nhat Ho. 2022. Fourierformer: Transformer meets generalized fourier
integral theorem. Advances in Neural Information Processing Systems 35 (2022), 29319–29335.
[162] Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. 2023. FlexMoE: Scaling Large-scale
Sparse Pre-trained Model Training via Dynamic Device Placement. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–19.

32 Huang, et al.
[163] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten
Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114
(2021).
[164] OpenAI. 2022. OpenAI: Introducing ChatGPT. https://openai.com/blog/chatgpt.
[165] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[166] OpenAI. 2023. OpenAI: GPT-4, 2023. https://openai.com/research/gpt-4.
[167] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. MemGPT: Towards LLMs as
Operating Systems. arXiv:2310.08560 [cs.AI]
[168] Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, and François Fleuret. 2023. Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention. arXiv preprint arXiv:2306.01160 (2023).
[169] Arka Pal, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, and Siddartha Naidu. 2023. Giraffe: Adventures in expanding
context lengths in llms. arXiv preprint arXiv:2308.10882 (2023).
[170] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. Yarn: Efficient context window extension of large language models. arXiv
preprint arXiv:2309.00071 (2023).
[171] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. 2021. Random feature attention. arXiv preprint
arXiv:2103.02143 (2021).
[172] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean.
2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
[173] Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv
preprint arXiv:2108.12409 (2021).
[174] Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. 2023. From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951
(2023).
[175] Pytorch. 2023. PyTorch 2.0. https://pytorch.org/get-started/pytorch-2.0/.
[176] Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. 2019. Blockwise self-attention for long document understanding.
[177] Markus N Rabe and Charles Staats. 2021. Self-attention Does Not Need O(n2) Memory. arXiv preprint arXiv:2112.05682 (2021).
[178] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
[179] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask
learners. OpenAI blog 1, 8 (2019), 9.
[180] Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. 2019. Compressive transformers for long-range sequence modelling.
[181] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring
the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
[182] Ali Rahimi and Benjamin Recht. 2007. Random features for large-scale kernel machines. Advances in neural information processing systems 20
(2007).
[183] Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022.
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine
Learning. PMLR, 18332–18346.
[184] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models.
In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.
[185] Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav
Shoham. 2023. Parallel Context Windows for Large Language Models. arXiv:2212.10947 [cs.CL]
[186] Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. 2021. Combiner: Full attention transformer
with sparse computation cost. Advances in Neural Information Processing Systems 34 (2021), 22470–22482.
[187] Tobias Rohde, Xiaoxia Wu, and Yinhan Liu. 2021. Hierarchical learning for generation with long source sequences. arXiv preprint arXiv:2104.07545
(2021).
[188] Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers.
Transactions of the Association for Computational Linguistics 9 (2021), 53–68.
[189] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al.
2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
[190] Jürgen Rudolph, Shannon Tan, and Samson Tan. 2023. War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush
and its impact on higher education. Journal of Applied Learning and Teaching 6, 1 (2023).
[191] Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. 2023. Randomized
Positional Encodings Boost Length Generalization of Transformers. arXiv preprint arXiv:2305.16843 (2023).
[192] Cicero Nogueira dos Santos, James Lee-Thorp, Isaac Noble, Chung-Ching Chang, and David Uthus. 2023. Memory Augmented Language Models
through Mixture of Word Experts. arXiv preprint arXiv:2311.10768 (2023).
[193] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint
arXiv:1508.07909 (2015).
[194] Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022.
SCROLLS: Standardized CompaRison Over Long Language Sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 12007–12021. https://aclanthology.org/2022.emnlp-
main.823
[195] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155
(2018).
[196] Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019).
[197] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural
networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
[198] Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018. Bi-directional block self-attention for fast and memory-efficient
sequence modeling. arXiv preprint arXiv:1804.00857 (2018).
[199] Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, and James Tin-Yau Kwok. 2021. Sparsebert: Rethinking the importance
analysis in self-attention. In International Conference on Machine Learning. PMLR, 9547–9557.
[200] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion
Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]
[201] Nimit S Sohoni, Christopher R Aberger, Megan Leszczynski, Jian Zhang, and Christopher Ré. 2019. Low-memory neural network training: A
technical report. arXiv preprint arXiv:1904.10631 (2019).
[202] Jianlin Su. 2023. NBCE: Handling Length in Context Expansion of LLM with Naive Bayes. https://spaces.ac.cn/archives/9617.
[203] Jianlin Su. 2023. Transformer Upgrade Roadmap: 10. RoPE is a beta-base Encoding. https://spaces.ac.cn/archives/9675.
[204] Jianlin Su. 2023. Transformer Upgrade Roadmap: 11. Taking beta-base Encoding to the Limit. https://spaces.ac.cn/archives/9706.
[205] Jianlin Su. 2023. Transformer Upgrade Roadmap: 12. ReRoPE for Infinite Extrapolation? https://spaces.ac.cn/archives/9708.
[206] Jianlin Su. 2023. Transformer Upgrade Roadmap: 7. Length Extrapolation and Local Attention. https://spaces.ac.cn/archives/9431.
[207] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position
embedding. arXiv preprint arXiv:2104.09864 (2021).
[208] Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. arXiv preprint
arXiv:1905.07799 (2019).
[209] Sainbayar Sukhbaatar, Da Ju, Spencer Poff, Stephen Roller, Arthur Szlam, Jason Weston, and Angela Fan. 2021. Not all memories are created equal:
Learning to forget by expiring. In International Conference on Machine Learning. PMLR, 9902–9912.
[210] Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2023. A Simple and Effective Pruning Approach for Large Language Models.
arXiv:2306.11695 [cs.CL]
[211] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. 2022. A length-
extrapolatable transformer. arXiv preprint arXiv:2212.10554 (2022).
[212] Mingxu Tao, Yansong Feng, and Dongyan Zhao. 2023. A Frustratingly Easy Improvement for Position Embeddings via Random Padding. arXiv
preprint arXiv:2305.04859 (2023).
[213] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpaca: A
strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html
3, 6 (2023), 7.
[214] Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. 2020. Sparse sinkhorn attention. In International Conference on Machine Learning.
PMLR, 9438–9447.
[215] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2021.
Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations. https://openreview.net/
forum?id=qVyeW-grC2k
[216] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. Efficient Transformers: A Survey. arXiv:2009.06732 [cs.LG]
[217] DeepSpeed Team and Rangan Majumder. 2020. DeepSpeed: Extreme-scale model training for everyone.
[218] M Therasa and G Mathivanan. 2022. Survey of Machine Reading Comprehension Models and its Evaluation Metrics. In 2022 6th International
Conference on Computing Methodologies and Communication (ICCMC). IEEE, 1006–1013.
[219] Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In
Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19.
[220] Oguzhan Topsakal and Tahir Cetin Akinci. 2023. Creating large language model applications utilizing langchain: A primer on developing llm apps
fast. In International Conference on Applied Engineering and Natural Sciences, Vol. 1. 1050–1056.
[221] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[222] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava,
Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
34 Huang, et al.
[223] Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłoś. 2023. Focused Transformer:
Contrastive Training for Context Scaling. arXiv:2307.03170 [cs.CL]
[224] Vanna-AI. 2023. Vanna: an open-source Python RAG framework for SQL generation and related functionality. https://github.com/vanna-ai/vanna.
[225] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. Advances in neural information processing systems 30 (2017).
[226] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. 2017. Graph attention networks. stat
1050, 20 (2017), 10–48550.
[227] Patrick von Platen. 2020. How to generate text: using different decoding methods for language generation with Transformers. https://huggingface.
co/blog/how-to-generate.
[228] Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. 2023.
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training. arXiv preprint arXiv:2306.10209 (2023).
[229] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv preprint
arXiv:2006.04768 (2020).
[230] Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2023. Augmenting Language Models with Long-Term
Memory. arXiv preprint arXiv:2306.07174 (2023).
[231] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic
compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
[232] Geoffrey I Webb, Eamonn Keogh, and Risto Miikkulainen. 2010. Naive Bayes. Encyclopedia of machine learning 15, 1 (2010), 713–714.
[233] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler,
et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
[234] Michel Wermelinger. 2023. Using GitHub Copilot to solve simple programming problems. In Proceedings of the 54th ACM Technical Symposium on
Computer Science Education V. 1. 172–178.
[235] David Wingate, Mohammad Shoeybi, and Taylor Sorensen. 2022. Prompt compression and contrastive conditioning for controllability and toxicity
reduction in language models. arXiv preprint arXiv:2210.03162 (2022).
[236] BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha
Luccioni, François Yvon, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
[237] Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. 2020. Memformer: A memory-augmented transformer for
sequence modeling. arXiv preprint arXiv:2010.06891 (2020).
[238] Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. 2022. Memorizing transformers. arXiv preprint arXiv:2203.08913 (2022).
[239] Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886
(2020).
[240] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv
preprint arXiv:2309.17453 (2023).
[241] Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. 2021. PRIMERA: Pyramid-based masked sentence pre-training for multi-document
summarization. arXiv preprint arXiv:2110.08499 (2021).
[242] Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman,
Barlas Oguz, et al. 2023. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039 (2023).
[243] xrsrke. 2024. pipegoose: Large-scale 4D parallelism pre-training for ‘transformers‘. https://github.com/xrsrke/pipegoose
[244] Chang Xu, Steven R Kirk, and Samantha Jenkins. 2009. Tiling for performance tuning on different models of GPUs. In 2009 Second International
Symposium on Information Science and Engineering. IEEE, 500–504.
[245] Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, and Qi Tian. 2023. QA-LoRA:
Quantization-Aware Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2309.14717 (2023).
[246] Baosong Yang, Longyue Wang, Derek F Wong, Shuming Shi, and Zhaopeng Tu. 2021. Context-aware self-attention networks for natural language
processing. Neurocomputing 458 (2021), 157–169.
[247] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining
for language understanding. Advances in neural information processing systems 32 (2019).
[248] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A
dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018).
[249] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification.
In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies.
[250] Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. 2019. Bp-transformer: Modelling long-range context via binary partitioning.
[251] Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An
Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv preprint arXiv:2304.10778 (2023).
[252] Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).

[253] Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A review of recurrent neural networks: LSTM cells and network architectures.
Neural computation 31, 7 (2019), 1235–1270.
[254] Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. 2020. O (n) connections are expressive
enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems 33 (2020), 13783–13794.
[255] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient
Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE, 36–39.
[256] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan
Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems 33 (2020), 17283–17297.
[257] Yury Zemlyanskiy, Joshua Ainslie, Michiel de Jong, Philip Pham, Ilya Eckstein, and Fei Sha. 2021. Readtwice: Reading very large documents with
memories. arXiv preprint arXiv:2105.04241 (2021).
[258] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b:
An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
[259] Haopeng Zhang, Xiao Liu, and Jiawei Zhang. 2022. Hegel: Hypergraph transformer for long document summarization. arXiv preprint arXiv:2210.04126
(2022).
[260] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization.
In International Conference on Machine Learning. PMLR, 11328–11339.
[261] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery 8, 4 (2018), e1253.
[262] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin,
et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
[263] Xinrong Zhang, Yingfa Chen, Shengding Hu, Qihao Wu, Junhao Chen, Zihang Xu, Zhenning Dai, Xu Han, Shuo Wang, Zhiyuan Liu, and Maosong
Sun. 2023. InfiniteBench: 128k Long-Context Benchmark for Language Models.
[264] Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document
summarization. arXiv preprint arXiv:1905.06566 (2019).
[265] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023.
Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).
[266] Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. 2023. MemoryBank: Enhancing Large Language Models with Long-Term Memory.
[267] Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. 2023.
RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text. arXiv preprint arXiv:2305.13304 (2023).
[268] Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. 2023. PoSE: Efficient Context Window Extension of LLMs
via Positional Skip-wise Training. arXiv preprint arXiv:2309.10400 (2023).
[269] Zhenhai Zhu and Radu Soricut. 2021. H-transformer-1d: Fast one-dimensional hierarchical attention for sequences. arXiv preprint arXiv:2107.11906
(2021).
[270] Atabay Ziyaden, Amir Yelenov, and Alexandr Pak. 2021. Long-context Transformers: A survey. In 2021 5th Scientific School Dynamics of Complex
Networks and their Applications (DCNA). 215–218. https://doi.org/10.1109/DCNA53427.2021.9587279
[271] Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, Tuo Zhao, and Jianfeng Gao. 2022. Efficient long sequence modeling via state
space augmented transformer. arXiv preprint arXiv:2212.08136 (2022).

36 Huang, et al.
A DATASETS
Table 1. Basic information about existing datasets specific for various NLP tasks with long-text inputs.
Lengths
Task Task Types Quality
Dataset Language (kilo words) Splits Count Format
Amount
Human Model
LM MCQA ExtQA Summ Class Match Math Code OpenW MT Avg Min Max
Labeled Assisted
ArXiv + PubMed en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 5.2 0 157.3 ✓ ✗ train/test/val 322K/13.1K/13.1K jsonl
BigPatent en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 3.2 0.2 83.2 ✓ ✗ train/test/val 1.2M/67.1K/67.1K json
BookSum en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 4.5 0.04 115.8 ✓ ✗ train/test/val 9.6K/1.4K/1.5K csv
CAIL2019-SCM zh 1 ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 2.0 1.8 2.6 ✓ ✗ train/test/val 5.1K/1.5K/1.5K jsonl
ChapterBreak en 1 ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 25.4 2.3 405.8 ✓ ✗ train 9.6K json
CNN/DailyMail en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 0.8 0 2.9 ✓ ✗ test 312K txt
ContractNLI en 1 ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ 2.0 0.5 8.7 ✓ ✗ train/test/dev 423/123/61 json
DuLeMon zh 1 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 0.6 0.3 1.4 ✓ ✗ train/test/dev 25.4K/1.1K/1.1K jsonl
ECtHR en 1 ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ 2.2 0.01 51.3 ✓ ✗ train/test/dev 7.3K/3K/1.3K jsonl
GovReport en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 43.5 0.2 1386.2 ✓ ✗ test 19.4K json
HotpotQA en 1 ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 0.9 0.01 2.0 ✓ ✗ train/dev 90K/14.8K json
InfiniteBench en/zh 12 ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ 71.1 0.1 560.3 ✓ ✗ test 3.9K jsonl
LCC-Python py 1 ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ 1.4 0.2 23.3 ✓ ✗ train/test/val 100K/10K/10K parquet
LEval en 20 ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✗ 9.2 2.0 137.5 ✓ ✓ test 537 jsonl
LongAlpaca en 1 ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 6.7 0 32.7 ✓ ✗ train 12K json
LongBench en/zh 21 ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ 7.2 0.1 44.2 ✓ ✓ test 8.4K jsonl
LongChat-Lines en 1 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ 2.6 0.6 5.6 ✓ ✗ test 700 parquet
LOT zh 4 ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✗ 0.2 0.06 0.5 ✓ ✗ train/test/dev 35.2K/2.4K/1.8K jsonl
LRA - AAN en 1 ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ 4.7 0.02 55.5 ✓ ✗ train/test/dev 147K/17.4K/18K tsv
LRA - ListOps en 1 ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ 3 0.01 5.9 ✓ ✗ train/test/dev 96K/2K/2K tsv
MuLD en 6 ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✓ 27.7 0 359.1 ✓ ✗ train/test/val 155.9K/14.4K/11.6K jsonl
MultiNews en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 2.1 0.1 464.2 ✓ ✗ train/test/val 45.0K/5.6K/5.6K txt
Multi-Session Chat en 1 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ 0.3 0.1 1.2 ✓ ✗ train/test/val 17.9K/2.5K/3K parquet
Nature Questions en 1 ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 9.8 0.2 169.3 ✓ ✗ train/dev 307K/7.8K json
NewsGroups en 1 ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ 0.3 0 11.8 ✓ ✗ test 20K txt
NewsRoom en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 0.7 0 178.5 ✓ ✗ train/test/dev 995.0K/108.9K/108.8K jsonl
OpenChat-ShareGPT4-Clean en 1 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ 1.6 0 152.8 ✓ ✓ train 80.2K json
ProofNet en 1 ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ 0.2 0.05 0.7 ✓ ✗ test/val 186/185 jsonl
QMSum en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 10.8 1.7 26.8 ✓ ✗ train/test/val 162/35/35 jsonl
SCROLLS en 7 ✗ ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✗ ✗ 33.0 0.2 356.1 ✓ ✗ train/test/val 89.7K/17.5K/12.3K jsonl
SQuAD en 1 ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 0.1 0.02 0.7 ✓ ✗ train/val 87.6K/10.6K parquet
SummScreen en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 7.3 1.6 24.0 ✓ ✗ train/test/dev 22.6K/2.1K/2.1K jsonl
Synthetic-Persona-Chat en 1 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ 0.4 0.05 0.8 ✓ ✓ train/test/val 8.9K/968/1K csv
THUCnews zh 1 ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ 0.9 0 79.5 ✓ ✗ test 836K txt
UltraChat en 1 ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ 1.0 0.03 3.6 ✓ ✓ train 1.4M jsonl
WikiQA-AlteredNumericQA en 1 ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 4.0 0.8 11.2 ✓ ✗ test 1.8K parquet
WikiQA-FreeFormQA en 1 ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ 3.8 0.6 11.5 ✓ ✗ test 2.4K parquet
WMT14 EN-CS en/cs 1 ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ 0.04 0 3.6 ✓ ✗ train/test/cal 1M/3K/3K sgm
XSum en 1 ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ 0.4 0 29.2 ✓ ✗ train/test/val 204K/11.3K/11.3K summary
Note 1: We sort datasets at each row in the alphabetical order and use slash "/" to separate the multiple contents in any single cell.
Note 2: The presence of common dirty data may result in extremely short samples, thus many datasets in the table containing samples with a minimum length approaching zero.
The meta information regarding the columns in Tab. 1 is as follows:

• Language: Language information is represented using abbreviations such as en for English, zh for Chinese, and
py for Python.
• Task Amount and Types: We categorize the common NLP tasks into ten types, including language modeling
(LM), multi-choice question-answering (MCQA), extractive question-answering with information retrieval
(ExtQA), document summarization (Summ), text classification (Class), text-pair matching (Match), math problem
solving and reasoning (Math), code tasks (Code), open-ended writing (OpenW ), and machine translation (MT ).
• Lengths: Average (avg), minimum (min), and maximum (max) sample lengths are provided in kilo "words"2 for
each dataset, where "words" are defined based on sample content3 .
2 Inthe context of our study, "words" are approximately considered to be separated by spaces in English and code, while individual Chinese characters are
treated as words, to avoid the inconsistency by different tokenizers.
3 For example, if one typical sample has the prompt template like "Read this { context } , and answer the question below: { question } ", we will calculate the
number of words in both context and question part, ignoring the fixed remaining part in the template.
• Quality: Quality assessment is simply based on two dimensions: Human Labeled (labels generated by humans)
and Model Assisted (prompts or labels generated by off-the-shelf LLMs), since the lack of quantitative oracles.
• Splits: This indicates dataset partitioning, including conventional triple-split formats like train/test/val, a single
test split for evaluation, a single train split for training/finetuning, etc.
• Count: Provides statistics on the number of samples for each split (one unit "K"/"M" equals 1,000/1,000,000
samples).
• Format: Tags the file format of samples, including jsonl, json, csv, txt, tsv, parquet, and more.
B METRICS
Table 2. Some common metrics adopted for evaluation on each specific NLP task type as depicted in Appendix. A.
Metric Types
Task Types
CE/PPL BPC/BPW Acc/F1 EM ROUGE-1/-2/-L BLEU/METEOR/TER EntMent Pass@k Human/Model Judge
LM ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓
MCQA ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✗
ExtQA ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✗ ✓
Summ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✗ ✓
Class ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗
Match ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✗
Math ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✓
Code ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓
OpenW ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓
MT ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓
Note: The ✗ in the table does not imply that a specific metric cannot be applied to a task. Rather, it suggests that the metric might be less commonly used or that there could be
more suitable alternatives.
We provide a concise introduction to these metrics:

• CE/PPL (Cross-Entropy/Perplexity): CE quantifies the 𝐾𝐿 divergence between predicted distributions and
the true distribution from the training corpus. PPL measures how well a language model predicts a sequence,
simply formalized as exp(𝑙𝑜𝑠𝑠), where 𝑙𝑜𝑠𝑠 denotes the cross-entropy loss for the test set.
• BPC/BPW (Bits per Character/Bits per Word): They measure the average number of bits required to encode
characters or words, i.e. assess the efficiency of a model to compress text, simply calculated by avgT (𝑙𝑜𝑠𝑠), where
𝑇 is the number of characters or words respectively.
• Acc/F1 (Accuracy/F1 Score): Accuracy measures correct predictions in tasks with objective answers like
classification and MCQA, while F1 balances precision and recall, as a more robust accuracy score.
• EM (Exact Matching): Evaluates exact sequence matches, crucial for tasks like code completion.
• ROUGE-1/-2/-L [74]: Assess text similarity using n-grams overlapping, typically setting n=1 (unigram), 2
(bigram), and L (longest). They are widely used in tasks that EM may fail, such as summarization.
• BLEU/METEOR/TER [3]: These metrics are specific in machine translation tasks. BLEU measures the overlap
of generated and reference translations based on n-grams. METEOR evaluates translation quality by considering
various linguistic factors. TER quantifies the edit distance between the generated and reference translations.
38 Huang, et al.
• EntMent (Entity Mention) [51]: Evaluates coverage and correctness of important entities mentioned in the
generated output text, especially for summarization.
• Pass@k: Evaluates if the generated answer ranks within the top-k provided answers, commonly used in code
generation and some math tasks with multiple possible solutions.
• Human/Model Judge: Involves human or power models like GPT-4 to score text quality based on fluency,
coherence, and other subjective criteria suitable for tasks like story generation.
C BASELINES
Table 3. Basic information for some long-context models widely-used as baselines.
Model Open Source Base Main Usage Main Lang 𝑳 𝒎𝒂𝒙 (k) Param Size (B) Mem Occ (GB) Disk Occ (GB) Links
Anima-7B-100k ✓ Llama2 chat zh 100 6.7 12.6 12.6 hf | github
ChatGLM2-6B-32k ✓ GLM chat zh 32 6.2 11.7 11.6 hf | github
ChatGLM3-6B-32k ✓ GLM chat zh 32 6.2 11.7 11.6 hf | github
Chinese-Alpaca2-7B-16k ✓ Llama2 instruct zh 16 6.9 25.9 12.9 hf | github
Chinese-Llama2-7B-16k ✓ Llama2 chat zh 16 6.9 26.3 12.9 hf | github
Chinese-Mixtral ✓ Mixtral chat zh 32 46.7 175.0 87.0 hf | github
Chinese-Mixtral-Instruct ✓ Mixtral instruct zh 32 46.7 175.0 87.0 hf | github
Claude2 ✗ Claude chat en 100 ? ? ? acc | home
CodeLlama-7B ✓ Llama2 code py 16 6.7 25.6 12.6 hf | home | paper
Giraffe-13B-32k-v3 ✓ Llama2 instruct en 32 13.0 48.6 24.2 hf | github | paper
Giraffe-v2-70B-32k ✓ Llama2 instruct en 32 69.0 227.4 128.5 hf | github | paper
GPT3.5-Turbo-16k ✗ GPT3 chat en 16 ? ? ? acc | home | doc
GPT4 ✗ GPT4 chat en 8 ? ? ? acc | home | doc
GPT4-32k ✗ GPT4 chat en 32 ? ? ? acc | home | doc
GPT4-Turbo ✗ GPT4 chat en 128 ? ? ? acc | home | doc
InternLM-Chat-7B ✓ Llama2 chat en 200 6.7 12.6 12.6 hf | github
Llama2-7B-32k ✓ Llama2 chat en 32 6.7 12.6 12.6 hf | home
Llama2-7B-Instruct-32k ✓ Llama2 instruct en 32 6.7 12.6 12.6 hf | home
LLongMA2-7B-16k-flash ✓ Llama2 chat en 16 6.7 12.6 12.6 hf | paper
LongChat-v1.5-7B-32k ✓ Llama2 chat en 32 6.7 12.6 12.6 hf | github | blog
Mistral-7B-v0.1 ✓ Mistral chat en 32 7.2 28.0 13.5 hf | paper
Mistral-7B-Instruct-v0.2 ✓ Mistral instruct en 32 7.2 28.0 13.5 hf | paper
Mixtral-8x7B-v0.1 ✓ Mixtral chat en 32 46.7 175.0 87.0 hf | blog
Mixtral-8x7B-Instruct-v0.1 ✓ Mixtral instruct en 32 46.7 175.0 87.0 hf | blog
MPT-7B-Storywriter ✓ MPT gen en 65 6.6 12.4 12.4 hf | blog
NeuralChat-7B-v3.1 ✓ Mistral chat en 32 7.2 28.0 13.5 hf | blog
OpenHermes2.5-7B ✓ Mistral chat en 32 7.2 28.0 13.5 hf | github
QWen-7B ✓ QWen chat zh 32 7.7 14.4 14.4 hf | paper
Vicuna-v1.5-7B-16k ✓ Llama2 chat en 16 6.7 12.6 12.6 hf | github | blog
WizardCoder-Python-7B-v1.0 ✓ Llama2 code py 16 6.7 12.8 12.6 hf | github
WizardMath-7B-v1.1 ✓ Mistral math en 32 7.2 14.0 13.5 hf | github
XGen-7B-Instruct-8k ✓ Llama2 instruct en 8 6.7 12.6 12.6 hf | paper
Note: The rows are basically sort by the model names in the alphabetical order, and we use question mark "?" to indicate unknown information for any cell.
Some meta information about Tab. 3 is interpreted as follows:

• Open Source: Indicates whether the model is open-sourced (✓) or closed-sourced that can be accessible only
through official remote API (✗).
• Base: Specifies the base modeling structure upon which the long-context model is built.
• Main Usage: Highlights the primary usage and capability of the model, categorized as instruct for instruction-
following, code/math for code/math-related tasks, gen for text generation, and chat for general-purpose tasks
through chat-like interaction.
• Main Lang: Indicates the primary language the model can understand4 , considering natural language,
programming language, etc.
• 𝑳 𝒎𝒂𝒙 : Represents the maximum context length handled by the model, measured in tokens (one unit "k" equals
1024 tokens).
• Statistics: Provides statistics about the model, including the number of parameters, memory footprint, and disk
storage. All models are loaded with precision to float16 onto Nvidia GPU(s) without any quantization.
• Links: Includes publication links for accessing and learning more about the model, with hf indicating the
Hugging Face hub for open-sourced models and acc representing official access for closed-sourced ones.
D TOOLKITS
We offer a detailed explanation of Tab. 4 as follows:
Type: This column specifies the usage type of each toolkit, including:
• Library: Typically found as GitHub projects, these toolkits offer functional implementations of specific tasks or
algorithms.
• Framework: Usually encompass a whole systematic pipeline, consisting of multiple interconnected modules
designed to support various aspects of LLMs.
• System: Offer a complete environment that comes pre-configured with all the necessary components and settings
to facilitate the deployment of LLMs.
• Compiler: Fuse operations and compile them into optimized GPU kernels with specific programming languages
to accelerate the execution of LLMs.
Stages: We categorize the whole LLM lifecycle simply into four stages as follows:
• Pretraining: LLMs undergo unsupervised training on large-scale datasets to learn basic language modeling.
• Finetuning: LLMs are further trained in a supervised manner on full/partial parameters to adapt them to specific
tasks or align them with human values.
• Inference: Involves feeding prompts into LLMs and generating outputs iteratively using various control strategies.
• Application: Off-the-shelf and even black-box LLMs are utilized for context-aware tasks, often involving domain-
specific local documents.
Utilities: For each toolkit, we highlight diverse utilities with concise keywords to indicate core techniques w.r.t the
corresponding stages. Readers can refer to the toolkit links for more detailed information on these utilities if lost on
any keyword.
Received XX XXXX 20XX; revised XX XXXX 20XX; accepted XX XXXX 20XX
4 Modelsare typically pretrained on multi-language corpora and may be finetuned for specific languages as needed. So we choose the most suitable one
corresponding to its application objectives.
40 Huang, et al.
Table 4. The toolkits summary for enhancing LLMs efficiency and effectiveness across different stages.
Utilities for Stages

Toolkit Type
Pretraining Finetuning Inference Application
Integrate TorchRun, FSDP
CPU/GPU/ TPUs/Apple Silicon
Accelerate library DeepSpeed, Megatron-LM
Auto Device Management
Local SGD
AutoGen framework Multi-Config Inference Multi-Agent Conversation
8bit Optimizers
4bit/8bit
BitsandBytes library 8bit Matrix Multiplication QLoRA
Quantization
QLoRA
Integrate DP, PP,
1D/2D/2.5D/3D TP,
Colossal-AI library
ZERO
Auto Parallelism
DP, PP, TP,
DeepSpeed framework ZERO, Offload,
Sparse Attention Kernel
DeepSpeed-MII framework Dynamic SplitFuse
FlashAttention library Kernel-Fused Flash-Attention
TP
Optimized Architectures
HuggingFace TGI system
Continuous Batching
Quantization
Prompt Management
LangChain framework Memory Management
Agent Management
Integrate Langchain
LangChain-Chatchat framework
RAG
Integrate LoRA, QLoRA,
Llama-Factory library PPO, DPO,
Reward Modeling
DP, PP, SP, ZERO
Megatron-LM framework
Activation Checkpointing
Iteration-level scheduling
Orca system
Selective Batching
LoRA,
PEFT library Prefix-Tuning,
Prompt-Tuning
Petals framework Multi-Party Distributed Collaboration
PrivateGPT framework Private RAG
Pytorch FSDP framework Fully Sharded Data Parallel
Integrate Flash-Attention,
Pytorch SDPA function
Memory-Efficient Attention
Python API for
TensorRT-LLM library TensorRT Engines
In-flight Batching
Python API for
Triton compiler
GPU Kernels
vLLM library Paged Attention
xFormers library Memory-Efficient Attention
Note: We sort the toolkits in each row in the alphabetical order.

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Uploaded by

Copyright:

Available Formats

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Uploaded by

Copyright:

Available Formats

Advancing Transformer Architecture in Long-Context Large Language Models:

Manuscript submitted to ACM 1

ACM Reference Format:

Then, for the attention kernel operations

Manuscript submitted to ACM

3.1 Local Attention

3.2 Hierarchical Attention

3.3 Sparse Attention

Manuscript submitted to ACM

3.4 Approximated Attention

Manuscript submitted to ACM

3.5 IO-Aware Attention

4.1 Internal MemoryCache

computation, which we denote with a hat accent, such as 𝑋

e(𝑠) = 𝐵 T Φ(𝑠), 𝑠.𝑡 . 𝑋

has to transform back to embeddings via continuous attention [147].

4.2 External MemoryBank

5.1 Enhancing Understanding

5.2 Attention Bias

𝑃e := 𝑃 + 𝐵, 𝐵 ∈ R𝐿×𝐿 , 𝑤ℎ𝑒𝑟𝑒 𝐵𝑖 𝑗 := B (𝑖, 𝑗), ∀𝑖, 𝑗 ∈ {0, 1, . . . , 𝐿 − 1} (22)

5.3 Extended RoPE

XPOS : 𝑃𝑖,𝑗 := ⟨e k 𝑗 ⟩ = 𝛾 𝑖 − 𝑗 (qT 𝑅 𝑗 −𝑖 k), where e

Rectified Truncation : 𝑃e𝑖,𝑗 := ⟨𝑅𝛼 (𝑖,𝑗,𝑤,𝜅 ) q, k⟩, where (31)

 min{𝑖 − 𝑗, 𝑤 + 𝑖 − 𝑗𝜅−𝑤 }, 0 < 𝜅 < ∞ (Leaky ReRoPE)

1 The experimental implementation is available at: https://github.com/Strivin0311/long-llms-learning/blob/main/notebooks/flash_rerope.py.

6.1 Context Selection

6.2 Context Aggregation

6.3 Context Compression

8 EVALUATION NECESSITY & OPTIMIZATION TOOLKIT

Manuscript submitted to ACM

Manuscript submitted to ACM

Manuscript submitted to ACM

Manuscript submitted to ACM

Manuscript submitted to ACM

Manuscript submitted to ACM

The meta information regarding the columns in Tab. 1 is as follows:

We provide a concise introduction to these metrics:

Table 3. Basic information for some long-context models widely-used as baselines.

Some meta information about Tab. 3 is interpreted as follows:

Received XX XXXX 20XX; revised XX XXXX 20XX; accepted XX XXXX 20XX

Utilities for Stages

Manuscript submitted to ACM

You might also like