0% found this document useful (0 votes)
71 views19 pages

Lightning Attention 1

The document introduces Lightning Attention, a novel linear attention implementation that maintains constant training speed across various sequence lengths while ensuring fixed memory consumption. It addresses previous limitations of linear attention models by utilizing a new architecture called TransNormerLLM (TNL), which enhances efficiency and accuracy in language modeling. Benchmark results show that TNL outperforms state-of-the-art models, achieving lower training losses and improved computational speed without sacrificing performance.

Uploaded by

jinmingmoshanqu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views19 pages

Lightning Attention 1

The document introduces Lightning Attention, a novel linear attention implementation that maintains constant training speed across various sequence lengths while ensuring fixed memory consumption. It addresses previous limitations of linear attention models by utilizing a new architecture called TransNormerLLM (TNL), which enhances efficiency and accuracy in language modeling. Benchmark results show that TNL outperforms state-of-the-art models, achieving lower training losses and improved computational speed without sacrificing performance.

Uploaded by

jinmingmoshanqu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning

Attention

Zhen Qin 1 Weigao Sun 2 Dong Li 2 Xuyang Shen 2 Weixuan Sun 2 Yiran Zhong 2

Abstract However, despite its promise, none of the current leading


large language models (Touvron et al., 2023a;b; Zeng et al.,
We present Lightning Attention, the first linear at-
arXiv:2405.17381v2 [cs.CL] 20 Jun 2024

2022; Black et al., 2022; Almazrouei et al., 2023; Team


tention implementation that maintains a constant
et al., 2023; Wang & Komatsuzaki, 2021; Baichuan, 2023;
training speed for various sequence lengths under
Jiang et al., 2023) have adopted linear attention mechanisms.
fixed memory consumption. Due to the issue with
There are two possible reasons for that: 1). Inferior perfor-
cumulative summation operations (cumsum), pre-
mance: There is a notable performance gap between existing
vious linear attention implementations cannot
linear attention-based models (Katharopoulos et al., 2020;
achieve their theoretical advantage in a casual
Qin et al., 2022b) and state-of-the-art softmax attention-
setting. However, this issue can be effectively
based models (Touvron et al., 2023a;b) in language model-
solved by utilizing different attention calculation
ing. 2). Slow training speed: Existing linear attention mod-
strategies to compute the different parts of at-
els frequently struggle with slow training speeds due to the
tention. Specifically, we split the attention cal-
use of cumulative summation operations (cumsum) (Hua
culation into intra-blocks and inter-blocks and
et al., 2022). As a result, these models (Hua et al., 2022) of-
use conventional attention computation for intra-
ten adopt conventional attention computation during practi-
blocks and linear attention kernel tricks for inter-
cal use, losing the theoretical advantages of linear attention.
blocks. This eliminates the need for cumsum
in the linear attention calculation. Furthermore, In this paper, we address the aforementioned issues of linear
a tiling technique is adopted through both for- attention and propose a new linear attention-based model
ward and backward procedures to take full ad- that outperforms softmax attention-based models in terms
vantage of the GPU hardware. To enhance ac- of accuracy and efficiency in language modeling.
curacy while preserving efficacy, we introduce
Training speed. We introduce Lightning Attention, the first
TransNormerLLM (TNL), a new architecture that
linear attention implementation that enables linear attention
is tailored to our lightning attention. We conduct
to realize its theoretical computational benefits. To achieve
rigorous testing on standard and self-collected
the linear computational complexities, the core idea is to
datasets with varying model sizes and sequence
leverage the "kernel trick" to accelerate the attention matrix
lengths. TNL is notably more efficient than other
computation, i.e., compute the product of keys and values
language models. In addition, benchmark re-
first to circumvent the n×n query-key matrix multiplication.
sults indicate that TNL performs on par with
The slow operation cumsum is needed during the calcula-
state-of-the-art LLMs utilizing conventional trans-
tion in causal language modeling. To solve this dilemma,
former structures. The source code is released at
we apply the concept of "divide and conquer" to perform
github.com/OpenNLPLab/TransnormerLLM.
the calculation. Specifically, our attention calculation is
divided into intra-blocks and inter-blocks. The conventional
attention calculation is applied to intra-blocks, while the
1. Introduction "kernel trick" is utilized for inter-blocks. We also leverage
Linear attention has emerged as a potentially viable alter- tiling techniques in both forward and backward processes to
native to conventional softmax attention over the last five maximize GPU hardware performance and tailor the tech-
years (Bahdanau et al., 2016; de Brébisson & Vincent, 2016). nique used in FlashAttention (Dao et al., 2022a; Dao, 2023)
to our Lightning Attention to make it IO-friendly. As a
1
TapTap 2 OpenNLPLab, Shanghai AI Lab. Correspondence to: result, Lightning Attention maintains a constant training
Yiran Zhong <[email protected]>. speed with increasing sequence length under fixed memory
Proceedings of the 41 st International Conference on Machine
consumption, as shown in Fig. 1.
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by Accuracy. As the adage goes, a good horse often needs a
the author(s).

1
TNL with Lightning Attention

TGS on 1B Models TGS on 3B Models Loss on 1B Models Loss on 3B Models


8,000 4.0 4.0

20,000
HGRN HGRN
7,000 3.8 TNN 3.8
TNN
17,500
6,000 LLaMA-FA2 LLaMA-FA2
3.6 3.6
15,000
TNL-LA TNL-LA
5,000
12,500 3.4 3.4

Loss
4,000
10,000
3.2 3.2

3,000
7,500
HGRN HGRN 3.0 3.0
2,000
5,000
TNN TNN
2,500 LLaMA-FA2 1,000 LLaMA-FA2 2.8 2.8

TNL-LA TNL-LA
0 0 2.6 2.6
1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 0 5 10 15 20 25 30 B
0 5 10 15 20 25 30
Sequence Length Sequence Length Billion Tokens Billion Tokens

Figure 1. Training speed and accuracy comparison. We compare TNL’s training speed and losses with state-of-the-art transformer
models (LLaMA with FlashAttention-2) and efficient non-transformer models (HGRN (Qin et al., 2023c) and TNN (Qin et al., 2023a)).
TNL achieves the lowest training losses and maintains consistent training speed regardless of sequence length.

good spur. We propose a novel architecture, TransNormer- structure. Four promising alternatives, including linear trans-
LLM (TNL), which is specifically designed for Light- formers, state space models, long convolution, and linear
ning Attention in order to enhance its performance. TNL recurrence, are being developed to replace self-attention
evolves from the previous linear attention architecture modules for long sequence modeling.
TransNormer (Qin et al., 2022a) by making advanced mod-
Linear Attention Linear attention decomposes Softmax
ifications that include positional embedding, linear atten-
Attention into the inner product of hidden representations,
tion acceleration, gating mechanism, tensor normalization.
allowing it to use the "Kernel Trick", where the product
Specifically, we use LRPE (Qin et al., 2023b) together with
of keys and values is computed first to avoid the quadratic
an exponential decay to avoid attention dilution issues while
n × n matrix. Different methods utilize various hidden
allowing the model to retain global interactions between to-
representations. For example, Katharopoulos et al. (2020)
kens. A gating mechanism is utilized to smooth training, and
use 1+elu as an activation function, Qin et al. (2022b) use
a new tensor normalization scheme is proposed to accelerate
the cosine function to approximate the properties of soft-
the model while preserving its accuracy. We also implement
max, and Choromanski et al. (2021); Zheng et al. (2022;
an efficient model parallel schema for TransNormerLLM,
2023) approximate softmax through theoretical approaches.
enabling seamless deployment on large-scale clusters and
Although its theoretical complexity is O(nd2 ), the actual
facilitating expansion to even more extensive models. As
computational efficiency of linear attention becomes low
shown in Fig. 1, TNL achieves the lowest training loss
when used in causal attention due to the need for cumsum
among the existing efficient transformer structures (Qin
operations (Hua et al., 2022). Moreover, most linear atten-
et al., 2023a;c) as well as SOTA transformer models (Tou-
tion still exhibits a certain performance gap compared to
vron et al., 2023b).
traditional Transformers (Katharopoulos et al., 2020; Liu
We perform a comprehensive evaluation of Lightning Atten- et al., 2022).
tion across a diverse range of sequence lengths to assess its
State Space Model State Space Model is based on the State
accuracy and compare its computational speed and memory
Space Equation for sequence modeling (Gu et al., 2022b),
utilization with FlashAttention-2 (Dao, 2023). Lightning
using special initialization (Gu et al., 2020; 2022c), diag-
Attention exhibits a notable advantage in computational
onalization assumptions (Gupta et al., 2022), and mixed
speed and memory consumption compared to its counter-
techniques (Dao et al., 2022b) to achieve performance com-
parts without compromising performance. We also validate
parable to Transformers. Due to the characteristics of the
our model design through a series of ablations and train mod-
state space equation, inference can be conducted with con-
els with sizes of 44M, 385M, 1B, 7B, and 15B on standard
stant complexity (Gu et al., 2022b), whereas the training
or our self-collected datasets. Benchmark results demon-
speed can be slow compared with FlashAttention.
strate that TNL not only matches the performance of SOTA
LLMs with Transformer but is also significantly faster. Long Convolution Long convolution models (Qin et al.,
2023a; Fu et al., 2023) utilize a kernel size equal to the
2. Related Work input sequence length, facilitating a wider context com-
pared to traditional convolutions. Training these models
2.1. Efficient Language Modeling involves Fast Fourier Transforms (FFT) algorithm, reducing
the computational complexities to O(n log n). However,
New efficient model architectures are being explored to ad-
long convolution models need to cache all historical compu-
dress the high time complexity of the traditional transformer
tations for causal convolution inference, making them less

2
TNL with Lightning Attention

Algorithm 1 Linear Attention Left Product Algorithm 2 Linear Attention Right Product
n×d
Input: Q, K, V ∈ R . Input: Q, K, V ∈ Rn×d .
Initialize mask M ∈ Rn×n , where Mts = 1, if t ≥ s, else 0. Initialize kv = 0 ∈ Rd×d .
Load Q, K, M from HBM, compute S = (QK⊤ ) ⊙ M, write for t = 1, . . . , n do
S to HBM. Load qt , kt , vt ∈ Rd×1 from HBM to on-chip SRAM.
Load S, V from HBM, compute O = SV, write O to HBM. On chip, compute kv = kv + kt vt⊤ .
Return O. On chip, compute ot = q⊤ t kv.
Write o⊤ t to HBM as the t-th row of O.
ideal for processing long sequences compared to RNNs. end for
Linear RNN Linear RNNs (Orvieto et al., 2023a; Qin et al., Return O.
2023c), in contrast, stand out as more suitable replacements O(nd2 ) during inference.
for transformers in long-sequence modeling. A notable
Nevertheless, when dealing with causal prediction tasks, the
example is the HGRN (Qin et al., 2023c) model, a linear
effectiveness of the right product is compromised, leading
RNN-based LLM that has shown competitive performance
to the requirement for the computation of cumsum (Hua
against similarly scaled GPT models.
et al., 2022). This impediment hinders the potential for
highly efficient parallel computation. In this section, we
2.2. IO-aware Attention show that the requirement of cumsum can be eliminated
The FlashAttention series (Dao et al., 2022a; Dao, 2023) by leveraging the concept of "divide and conquer" in linear
focuses on system-level optimizations for the efficient im- attention calculation. For the convenience of discussion,
plementation of the standard attention operator on GPU Norm will be ignored in the subsequent discussion.
platforms. These approaches employ tiling strategies to There are two computational approaches to handling the
minimize the volume of memory reads/writes between the causal scenario. One is using conventional attention compu-
GPU’s high bandwidth memory (HBM) and on-chip SRAM. tation (the Left Product), which involves computing QK⊤
Although these methods optimize the IO communication in first. The complete calculation formula is as follows:
attention calculation and are faster than previous softmax
O = [(QK⊤ ) ⊙ M]V (3)
attention implementations, their theoretical computation
complexity remains O(n2 d), making them unsuitable for where Mts = 1 if t ≥ s, otherwise 0. The complete algo-
long sequence language modeling. rithm is detailed in Algorithm 1. Note that this algorithm is
parallelizable, but its time complexity is O(n2 d). The other
option is to compute the kt vt⊤ first (the Right Product),
3. Lightning Attention which leverages a recursive formula for computation:
3.1. Preliminary kv0 = 0, kvt = kvt−1 + kt vt⊤ , o⊤ ⊤
t = qt kvt . (4)
We first recall the formulation of linear attention and then The complete algorithm is detailed in Algorithm 2. This
introduce our proposed Lightning Attention. In the case algorithm has a time complexity of O(nd2 ), but it is not
of NormAttention within TransNormer (Qin et al., 2022a), GPU-friendly, making it slower than the first approach.
attention computation deviates from the conventional Trans-
former structure (Vaswani et al., 2017) by eschewing the 3.2. Linear Attention with Tiling
costly softmax and scaling operations. The NormAttention
We use a tiling technique to compute linear attention in a
mechanism can be expressed as follows:
causal setting. Specifically, we first divide Q, K, V into
O = Norm((QK⊤ )V), (1) two blocks by rows:
where Q, K, and V ∈ Rn×d are the query, key, and value
 
X1
X= , X1 ∈ Rm×d , X2 ∈ R(n−m)×d ,
matrices, respectively, with n for sequence length and d for X2
feature dimension. The equation can be transformed into its X ∈ {Q, K, V}.
linear variant using right matrix multiplication:
O = Norm(Q(K⊤ V)), (2)
Then, by unfolding Eq. 3, we get (note that kv0 = 0):
X s
The linear formulation enables efficient recurrent predic- kvs = kv0 + kj vj⊤ , s = 1, . . . , m.
tion with O(nd2 ) complexity during training. Additionally, j=1
linear attention guarantees a constant computation com- s
(5)
X
plexity of O(d2 ) regardless of the sequence length. This o⊤
s = q⊤
s kvs = q⊤
s kv0 + q⊤
s kj vj⊤ .
is achieved by recurrently updating K⊤ V, eliminating the j=1
need for repeated computation of the entire attention matrix.
In contrast, standard softmax attention has a complexity of

3
TNL with Lightning Attention

Algorithm 3 Lightning Attention Forward Pass 𝑡 𝑸 ∈ ℝ𝐧×𝐝


n×d
Input: Q, K, V ∈ R , block sizes B. 𝒕 𝑲 ∈ ℝ𝒏×𝒅
n
Divide X into T = B blocks X1 , X2 , ...XT of size B × d
each, where X ∈ {Q, K, V, O}. 𝒕 𝑽 ∈ ℝ𝑛×𝑑
store in HBM
Initialize mask M ∈ RB×B , where Mts = 1, if t ≥ s, else 0. Copy Block
Initialize KV = 0 ∈ Rd×d . to SRAM

for t = 1, . . . , T do
Load Qt , Kt , Vt ∈ RB×d from HBM to on-chip SRAM. 𝑶𝒊𝒏𝒕𝒓𝒂 = (𝑸𝒕 𝑲𝑻𝒕 ⨀𝑴)𝑽𝒕 𝑶𝒊𝒏𝒕𝒆𝒓 = 𝑸𝒕 ∙ 𝑲𝑽
On chip, compute Ointra = [(Qt K⊤ t ) ⊙ M]Vt .
Intra block Inter block
On chip, compute Ointer = Qt (KV).
On chip, compute KV = KV + K⊤ t Vt . 𝑶𝒕 = 𝑶𝒊𝒏𝒕𝒓𝒂 + 𝑶𝒊𝒏𝒕𝒆𝒓
Write Ot = Ointra + Ointer to HBM as the t-th block of O.
𝑲𝑽 = 𝑲𝑽 + 𝑲𝒕 𝑻 𝑽𝒕
end for on-chip SRAM
Return O. Output
to HBM
In block form, we have:
𝒕 𝑶 ∈ ℝ𝑛×𝑑
O1 = Q1 kv0 + [(Q1 K⊤
1 ) ⊙ M]V1 store in HBM
(6)
≜ Q1 KV0 + [(Q1 K⊤
1 ) ⊙ M]V1 .
loop over 𝑛 dim

The above formula shows that the forward causal linear


attention can be divided into two parts: Figure 2. Structural framework of Lightning Attention is de-
tailed in its algorithmic schematic. During the t-th iteration, the
• The computation within the block [(Q1 K⊤
⊙ M]V1
1) tiling blocks of matrices Qt , Kt , Vt are transferred from High
(intra blocks) can use the Left Product; Bandwidth Memory (HBM) to Static Random-Access Memory
• The computation between blocks Q1 KV0 (inter (SRAM). Within the SRAM, the outputs Ointra and Ointer are
blocks) can use the Right Product. computed independently, followed by an update to the KV matrix.
Subsequently, the final output Ot , which is the sum of Ointra and
It is worth noting that the second block can be computed Ointer , is written back from SRAM to HBM.
using the same idea as follows:
m+t
3.3. Complexity analysis
X
kvm+t = kvm + kj vj⊤ , t = 1, . . . , n − m,
j=m+1
Theorem 3.1. The time complexity of Lightning Attention
o⊤ ⊤
m+t = qm+t kvm+t ,
(7) is O(nd2 + nBd)1 .
O2 = Q2 kvm + [(Q2 K⊤
2 ) ⊙ M]V2
Proof of Theorem 3.1. For the forward pass, according to
≜ Q2 KV1 + [(Q2 K⊤
2 ) ⊙ M]V2 . Algorithm 3, each intra part’s time complexity is O(B 2 d),
Note that to compute the second block, we have to use each inter part’s time complexity is O(Bd2 ), the time
KV1 = kvm , which can be computed by: complexity of updating KV is O(Bd2 ), so each the time
m
X
⊤ complexity in each loop is O(B 2 d + Bd2 ), since we
KV1 = KV0 + km vm = KV0 + K⊤ 1 V1 . (8) loop for T = n/B times, the total time complexity is
j=1
O((B 2 d + Bd2 )n/B) = O(nd2 + nBd). Because the
where KV0 = kv0 . By using the above strategy to divide computation of the backward pass is similar to that of the
the matrix into multiple blocks, we obtain the Lightning forward pass, the time complexity of the backward pass is
Attention Forward Pass. More detailed derivation can be also O(nd2 + nBd).
found in the Appendix C.
For the backward propagation, according to (Katharopoulos 3.4. Exact IO-aware Implementation
et al., 2020), we can rewrite the process as: Lightning Attention employs the above tiling methodology
⊤ ⊤
dq⊤ ⊤ ⊤ ⊤ ⊤ ⊤
t = dot kvt , dkt = vt dkvt , dvt = kt dkvt , throughout its whole computation process and leverages
dkvn+1 = 0 ∈ Rd×d , dkvt−1 = dkvt + qt−1 do⊤ distinct approaches to optimize the utilization of memory
t−1 .
bandwidth between HBM and SRAM within a GPU. Specif-
Therefore, the calculation of the backward propagation is
ically, in each iteration t, matrices Qt , Kt , Vt undergo seg-
consistent with the forward Eq. 4, and the Lightning Atten-
mentation into blocks, subsequently transferred to SRAM
tion Backward Pass can also be obtained using the tiling
for computation. The intra- and inter-block operations are
technique. A detailed proof can be found in the Appendix C.
segregated, with intra-blocks employing the left product
and inter-blocks utilizing the right product. This approach
1
we choose B ≈ d in practice, the time complexity is O(nd2 ).

4
TNL with Lightning Attention

Algorithm 4 Lightning Attention Backward Pass


Input: Q, K, V, dO ∈ Rn×d , block sizes B. Add SGLU
n
Divide X into T = B blocks X1 , X2 , ...XT of size B × d
each, where X ∈ {Q, K, V}.
n
SGLU 𝑈 𝑉
Divide dX into T = B blocks dX1 , dX2 , ...dXT of size
B × d each, where X ∈ {Q, K, V, O} .
SRMSNorm
Initialize mask M ∈ RB×B , where Mts = 1, if t ≥ s, else 0.
Initialize KV = 0, dKV = 0 ∈ Rd×d .
for t = 1, . . . , T do
Load Kt , Vt , Ot , dOt ∈ RB×d from HBM to on-chip Add
GLA SRMSNorm
SRAM.
On chip, compute dQintra = [(dOt Vt⊤ ) ⊙ M]Kt . *
On chip, compute dQinter = dOt KV⊤ . GLA *
On chip, compute KV = KV + K⊤ t Vt .
Write dQt = dQintra + dQinter to HBM as the t-th block
of dQ. SRMSNorm 𝑈 𝑄 𝐾 𝑉
end for
for t = T, . . . , 1 do
Load Qt , Kt , Vt , Ot , dOt ∈ RB×d from HBM to on-chip 𝑋
SRAM.
On chip, compute dKintra = [(dOt Vt⊤ ) ⊙ M]⊤ Qt . Figure 3. Architecture overview of TransNormerLLM (TNL).
On chip, compute dKinter = Vt dKV⊤ . Each transformer block is composed of a Gated Linear Attention
On chip, compute dVintra = [(Qt K⊤ ⊤
t ) ⊙ M] dOt . (GLA) for token mixing and a Simple Gated Linear Unit (SGLU)
On chip, compute dVinter = Kt dKV. for channel mixing. We apply Pre-norm for both modules.
On chip, compute dKV = dKV + Q⊤ t dOt .
Write dKt = dKintra + dKinter , dVt = dVintra +
dVinter to HBM as the t-th block of dK, dV. coding, gating mechanisms, and tensor normalization.
end for
Return dQ, dK, dV. Position Encoding In TransNormer, DiagAttention is used
at the lower layers to avoid dilution issues. However, this
optimally exploits the computational and memory efficien- leads to a lack of global interaction between tokens. In TNL,
cies associated with the right product, enhancing overall we leverage LRPE (Qin et al., 2023b) with exponential
execution speed. The intermediate activation KV is itera- decay (Press et al., 2022; Qin et al., 2023a; Peng et al.,
tively saved and accumulated within SRAM. Subsequently, 2023b) to address this issue, retaining full attention at the
the outputs of intra-blocks and inter-blocks are summed lower layers. The expression of our position encoding is as
within SRAM, and the results are written back to HBM. The follows:
structure of Lightning Attention is illustrated in Fig. 2. The ats = q⊤ t ks λ
t−s
expiθ(t−s) . (9)
intricate details of the Lightning Attention implementation which we call LRPE-d - Linearized Relative Positional En-
are explained through Algorithm 3 for the forward pass and coding with exponential decay. Similar to the original LRPE,
Algorithm 4 for the backward pass. we set θ to be learnable. We empirically find that rather than
applying LRPE-d to every layer, applying it to the first layer
4. TransNormerLLM and keeping other layers with exponential decay can speed
up training by approximately 15-20% but only with a subtle
4.1. The Overall Structure effect on the performance.
Our structure is based on the findings of TransNormer (Qin Note that this position encoding is fully compatible with
et al., 2022a) but has custom modifications to balance effi- Linear Attention, as it can be decomposed with respect to s
ciency and performance. We illustrate the overall structure and t separately. The value of λ for the h-th head in the l-th
in Fig. 3. The input X is updated through two consecutive layer (assuming there are a total of H heads and L layers)
steps: 1). It undergoes Gated Linear Attention (GLA) with is given by:
λ = exp − 8h l

the application of SimpleRMSNorm (SRMSNorm) normal- H × 1− L . (10)
ization. 2). It goes through the Simple Gated Linear Unit
Here, 8h
H corresponds to the decay rate of the h-th head,
(SGLU) with SRMSNorm normalization. We apply the
while 1 − Ll corresponds to the decay rate of the l-th

Pre-norm for both modules.
layer. The term 1 − Ll ensures that the Theoretical Re-


4.2. Custom Modification ceptive Fields (TRF) (Qin et al., 2024) at the lower layers
is smaller compared to the higher layers, which aligns with
In this section, we outline the key designs and inspiration TransNormer’s motivation. We choose λ to be non-learnable
behind each custom modification, including positional en- since we empirically found that gradients become unstable

5
TNL with Lightning Attention

when λ is learnable, leading to NaN values. Note that this Table 1. Results on Wikitext-103 (TNN(Qin et al., 2023a)’s set-
positional encoding is still compatible with Lightning Atten- ting). ↓ means lower is better.
tion, with the specific algorithm detailed in Appendix A B. PPL PPL Params
Model
(val)↓ (test)↓ (M)
Gating Mechanism Gate can enhance the performance of Transformer 24.40 24.78 44.65
the model and smooth the training process. In TNL, we FLASH 25.92 26.70 42.17
adopt the approach from Flash (Hua et al., 2022) and use 1+elu 27.44 28.05 44.65
Gated Linear Attention (GLA) in token mixing: Attn-based Performer 62.50 63.16 44.65
cosFormer 26.53 27.06 44.65
O = Norm(QK⊤ V) ⊙ U, Q = ϕ(XWq ), TN1 24.43 25.00 44.64
(11)
K = ϕ(XWk ), V = XWv , U = XWu . TN2 24.50 25.05 44.64
Syn(D) 31.31 32.43 46.75
We choose ϕ to be Swish (Ramachandran et al., 2017) ac- MLP-based Syn(R) 33.68 34.78 44.65
tivation function as we empirically find that it outperforms gMLP 28.08 29.13 47.83
other activation functions. S4 38.34 39.66 45.69
DSS 39.39 41.07 45.73
To further accelerate the model, we propose Simple GLU GSS 29.61 30.74 43.84
RNN-based
(SGLU), which removes the activation function from the RWKV 24.31 25.07 46.23
original GLU structure as the gate itself can introduce non- LRU 29.86 31.12 46.24
linearity. Therefore, our channel mixing becomes: HGRN 24.14 24.82 46.25
FFT-based TNN 23.98 24.67 48.68
O = [V ⊙ U]Wo , V = XWv , U = XWu . (12) Ours TNL 23.46 24.03 45.45
We empirically find that not using an activation function in
GLU will not lead to any performance loss. mentation (named Vanilla) and our Lightning Attention. As
Tensor Normalization The origin NormAttention intro- a reference, we have also included FlashAttention-2 (Dao,
duced in TransNormer (Qin et al., 2022a) is as follows: 2023) (named Flash2), which is currently the SOTA im-
plementation of softmax attention. As shown in Fig. 4,
O = Norm(QK⊤ V) (13)
Lightning Attention shows remarkable linear growth of pro-
In TransNormerLLM, we replace the origin RMSNorm cessing time in both forward and backward passes, whereas
with a new simple normalization function called Sim- Vanilla and Flash2 exhibit quadratic growth. In terms of
pleRMSNorm, abbreviated as SRMSNorm: memory footprint, Vanilla tends to rapidly exhaust memory
SRMSNorm(x) = ∥x∥ x/√d . (14) resources. Lightning Attention shows a similar trend to
2
Flash2 but requires less memory.
We empirically find that using SRMSNorm does not lead to
any performance loss.
5.2. TNL Evaluation

5. Experiments Performance Evaluation In Table 1, we present an evalua-


tion across various 40M models on a standard dataset. This
We carried out thorough experiments on TNL models includes models based on attention/linear attention mecha-
and lightning attention. We implemented our models nisms (Vaswani et al., 2017; Dao et al., 2022a; Katharopou-
on the Metaseq framework (Zhang et al., 2022) with Py- los et al., 2020; Qin et al., 2022b;a), MLPs (Multi-Layer
torch (Paszke et al., 2019). The Lightning Attention was Perceptrons) (Tay et al., 2021; Liu et al., 2021), RNNs (Re-
executed through Triton (Tillet et al., 2019). All the exper- current Neural Networks) (Gu et al., 2022a; Gupta et al.,
iments were conducted on A100 80G GPU clusters. The 2022; Mehta et al., 2022; Peng et al., 2023b; Orvieto et al.,
assessment of our work is divided into three main sections: 2023b), FFTs (Fast Fourier Transforms) (Qin et al., 2023a),
I) We evaluated the efficiency and accuracy of the Lightning and our model. TNL records the lowest perplexity on test
Attention module; II) We further benchmarked our TNL set after trained on the Wikitext-103 dataset.
models’ performance on standard small-scale corpus and
We also scaled up our model to 1B and 3B parameters
LLM benchmarks and compared their training and inference
and compared its training loss with top-tier LLM structures
speeds with STOA models; III) We also provide an ablation
such as LLaMA-FA2 (Touvron et al., 2023a; Dao, 2023),
study on the design of TNL.
HGRN (Qin et al., 2023c), and TNN (Qin et al., 2023a). For
a fair comparison, we retrain all models on the same 30B
5.1. Lightning Attention Evaluation
corpus and plot the training losses in Fig. 1. TNL achieved
Since our Lightning Attention is an exact implementation of the lowest training losses in both 1B and 3B parameters.
norm linear attention (Qin et al., 2022a), we compared the
Efficiency Evaluation In Fig. 1, we present a compara-
speed and memory usage between its original pytorch imple-
tive analysis of training speeds under the same corpora and

6
TNL with Lightning Attention

2,000
Forward Pass 2,000
Backward Pass Forward Memory Footprint 50
Backward Memory Footprint
Vanilla Vanilla 20 Vanilla
1,800 1,800 Vanilla 45
Flash2 Flash2 18 Flash2
1,600 1,600 Flash2 40
Lightning Lightning

Memory Footprint (GB)


Lightning 16
Lightning
1,400 1,400 35
14
Runtime (ms)

1,200 1,200 30
12

1,000 1,000 25
10

800 800 20
8

600 600
6 15

400 400 4 10

200 200 2 5

0 0 0 0
1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 65536 131072
Sequence Length Sequence Length Sequence Length Sequence Length

Figure 4. Comparative Analysis of Speed and Memory Usage: Vanilla represents norm linear attention in pytorch (Qin et al., 2022a),
Flash2 represents FlashAttention-2. Left two sub-figures: Runtime in milliseconds for the forward and backward pass across varying
sequence lengths. Right two sub-figures: Memory utilization (in GB) during the forward and backward pass at different sequence lengths.

Inference Throughput on 7B Models selected several open-source models as competitors, includ-


3000.0
2819.6
ing Transformer-based models such as OPT (Zhang et al.,
2500.0
2022), Pythia (Biderman et al., 2023), BLOOM (Workshop
Throughput (Token/s)

2000.0
et al., 2023), GPT-Neo (Black et al., 2022), Falcon (Al-
1500.0
mazrouei et al., 2023), LLaMA (Touvron et al., 2023a;b),
967.17 961.93
OpenLLAMA (Geng & Liu, 2023), Baichuan (Baichuan,
1000.0
2023), ChatGLM (Zeng et al., 2022; Du et al., 2022), and
506.77 537.8 537.97
500.0
252.12
455.36
non-Transformer model RWKV (Peng et al., 2023a). It can
0.0
be observed in Table 2 and Table 3 that, compared to these
Pythia Baichuan Baichuan2 LLaMA LLaMA2 ChatGLM2 ChatGLM3 TNL
6.9B 7B 7B 7B 7B 6B 6B 7B models, TNL remains highly competitive.

Figure 5. Inference Throughput Comparison. We measure the • We report BoolQ (Clark et al., 2019), PIQA (Bisk et al.,
inference throughput of various 7B LLM models on a A100 80G 2019), SIQA (Sap et al., 2019), HellaSwag (Zellers
GPU. Batch sizes for models are chosen to optimize GPU utiliza- et al., 2019), WinoGrande (Sakaguchi et al., 2019),
tion without exceeding memory limits. Each model is tested with ARC easy and challenge (Clark et al., 2018) and Open-
a 512-token input prompt and can generate up to 1024 new tokens. BookQA (Mihaylov et al., 2018). We report 0-shot re-
Reported throughput is averaged from 20 attempts. sults for all benchmarks using LM-Eval-Harness (Gao
et al., 2021). All of our models achieve competitive per-
hardware setups. This comparison encompasses four vari- formance compared to existing state-of-the-art LLMs,
ants: TNL, LLaMA-FA2 (Touvron et al., 2023a; Dao, 2023), showcasing a remarkable ability to comprehend and
HGRN (Qin et al., 2023c) , and TNN (Qin et al., 2023a). apply commonsense reasoning.
Our findings show that during both the forward and back- • We report the overall results for MMLU (Hendrycks
ward passes, the TGS (tokens per GPU per second) for TNL et al., 2021), C-Eval (Huang et al., 2023). Official
remains consistently high, while the other three models ex- scripts were used for evaluating MMLU and C-Eval,
hibit a rapid decline when sequence length is scaled from with all evaluation results being conducted with a 5-
1K to 128K. This pattern suggests that Lightning Attention shot setup. In comparison to top-tier open-source mod-
offers a significant advancement in managing extremely els available in the industry, our models have demon-
long sequence lengths in LLM. strated matched performance in both English and Chi-
Inference Evaluation We conduct an inference throughput nese benchmarks.
comparison on various 7B large language models using their • On SCROLLS (Shaham et al., 2022) benchmark, we
standard codebase from Huggingface, as detailed in Fig. 5. assess the large language models trained on a 1 billion
TNL with Lightning Attention demonstrates a significant parameter and pre-trained using a sequence length of
advantage, achieving a throughput rate that up to 11× higher 2048. We present zero-shot performance results for all
than transformer structure models. benchmarks using the LM-Eval-Harness (Gao et al.,
2021). For generation tasks within SCROLLS, we
Benchmark Results In order to validate the effectiveness employ a greedy search with hyper-parameters top_k
of TNL, we pretraining 385M, 1B, 7B, and 15B models set to 5 and top_p set to 1. Our models consistently
on self-collected datasets, the details of the data are in the match or surpass the performance of existing state-of-
Appendix D, and tested on Commonsense Reasoning Task, the-art LLMs in these tasks.
MMLU(Hendrycks et al., 2021), C-Eval(Huang et al., 2023),
and SCROLLS (Shaham et al., 2022). For comparison, we

7
TNL with Lightning Attention

Table 2. Performance Comparison on Commonsense Reasoning and Aggregated Benchmarks. For a fair comparison, we report
competing methods’ results reproduced by us using their released models. Official results are denoted in italics. PS: parameter size
(billion). T: tokens (billion). HS: HellaSwag. WG: WinoGrande.

Model PS T BoolQ PIQA HS WG ARC-e ARC-c OBQA MMLU C-Eval


B B acc acc acc_norm acc acc acc_norm acc_norm acc-5shot acc-5shot
OPT 0.35 0.30 57.74 64.58 36.69 52.49 44.02 23.89 28.20 26.02 25.71
Pythia 0.40 0.30 60.40 67.08 40.52 53.59 51.81 24.15 29.40 25.99 24.81
RWKV 0.43 - - 67.52 40.90 51.14 52.86 25.17 32.40 24.85 -
TNL 0.39 1.0 62.14 66.70 46.27 54.46 55.43 27.99 32.40 25.90 25.24
OPT 1.3 0.3 57.77 71.71 53.70 59.35 57.24 29.69 33.20 24.96 25.32
Pythia 1.4 0.3 60.73 70.67 47.18 53.51 56.99 26.88 31.40 26.55 24.25
RWKV 1.5 - - 72.36 52.48 54.62 60.48 29.44 34.00 25.77 -
Falcon 1.0 0.35 61.38 75.14 61.50 60.30 63.38 32.17 35.60 25.28 25.66
TNL 1.0 1.2 63.27 72.09 56.49 60.38 63.68 35.24 36.60 27.10 26.01
OPT 6.7 0.3 66.18 76.22 67.21 65.19 65.66 34.64 37.20 24.57 25.32
Pythia 6.9 0.3 63.46 75.14 63.92 60.77 67.34 35.41 37.00 24.64 26.40
RWKV 7.4 - - 76.06 65.51 61.01 67.80 37.46 40.20 24.96 -
Falcon 7.2 1.5 73.73 79.38 76.3 67.17 74.62 43.60 43.80 27.79 22.92
Baichuan2 7.0 2.6 72.72 76.50 72.17 68.35 75.17 42.32 39.60 54.16 54.00
ChatGLM2 7.1 1.4 77.65 69.37 50.51 57.62 59.13 34.30 37.00 45.46 52.55
OpenLLaMAv2 6.7 1.0 72.20 78.84 74.51 65.67 72.39 41.30 41.00 41.29 30.01
LLaMA1 6.7 1.0 76.50 79.80 76.10 70.10 72.80 47.60 57.20 35.10 25.72
LLaMA2 6.7 2.0 77.68 78.07 76.02 68.98 76.30 46.33 44.20 45.30 33.20
TNL 6.8 1.4 75.87 80.09 75.21 66.06 75.42 44.40 63.40 43.10 43.18
OPT 13 0.3 65.93 75.84 69.83 65.19 67.00 35.75 38.80 24.68 22.23
Pythia 12 0.3 65.72 76.17 68.85 66.22 70.62 38.23 41.00 25.51 22.99
RWKV 14 - 70.12 78.51 71.49 64.48 72.35 40.87 41.00 26.49 26.49
Baichuan2 13 2.6 79.20 77.31 75.27 70.01 77.36 47.01 43.80 57.02 59.63
OpenLLaMAv2 13 1.0 72.29 77.58 72.07 70.09 75.42 43.86 43.00 43.43 25.95
LLaMA1 13 1.0 77.95 79.16 79.06 72.61 77.40 47.70 44.80 47.62 32.13
LLaMA2 13 2.0 80.61 79.11 79.35 72.38 79.34 48.98 35.20 55.70 38.34
TNL 15 2.0 76.64 81.56 82.18 75.61 77.61 50.51 46.40 60.06 53.01

5.3. TNL Ablation


Table 5. Ablations on decay temperature. The results of decay
We conducted an extensive ablation analysis on various temperature proved to be superior.
components of TNL, including positional encoding, gating
mechanisms, GLA activation functions, GLU activation Temperature Params Updates Loss PPL
w/ temperature 385M 100K 2.248 4.770
functions, and normalization functions. w/o temperature 385M 100K 2.258 4.804

Table 4. Exploration of Positional Encoding. LRPE-d leads to


the most optimal outcome. We also perform ablations on the decay temperature
PE Methods Params Updates Loss PPL 1 − Ll in Eq. 10. The perplexity of the TNL is reduced
Mix 385M 100K 2.248 4.770 by adding the decay temperature, as shown in Table 5.
APE 386M 100K 2.387 5.253
Exp-Decay 385M 100K 2.267 4.834 Table 6. Ablations on gating mechanism. The performance with
LRPE 385M 100K 2.287 4.899 the gate proved to be superior.
LRPE-d 385M 100K 2.236 4.728
Gate Params Updates Loss PPL
w/ gate 385M 100K 2.248 4.770
w/o gate 379M 100K 2.263 4.820
Positional Encoding: in our experiment comparing various
PE strategies—Mix, Absolute Positional Encoding (APE),
LRPE, Exponential Decay, and LRPE-d—our approach and Gating Mechanism: we further investigate the impact of
LRPE-d demonstrated superior performance. We chose the integrating a gating mechanism. According to the data
Mix method for its ability to enhance training speed by up presented in Table 6, enabling the gate decreased the loss
to 20%, despite being slightly less effective than LRPE-d. value from 2.263 to 2.248.

8
TNL with Lightning Attention

Table 3. Performance Comparison on SCROLLS (Shaham et al., 2022): A review of models up to 1 billion parameters on 2048
pre-training sequence length. PS: parameter size (billion). T: tokens (billion).

Model PS T GovRep SumScr QMSum Qspr Nrtv QALT CNLI Avg


B B ROUGE-1/2/L ROUGE-1/2/L ROUGE-1/2/L F1 F1 EM EM
OPT 0.35 0.30 2.52/0.53/2.24 7.72/0.68/6.52 8.05/1.79/6.6 13.13 10.13 29.05 9.16 7.55
Pythia 0.40 0.30 4.96/1.19/4.06 2.03/0.2/1.79 7.51/1.43/6.08 15.27 8.24 28.57 15.24 7.43
RWKV 0.43 - 1.63/0.4/1.49 0.94/0.11/0.76 10.19/2.26/8.06 13.16 9.76 26.32 16.49 7.04
TNL 0.39 1.0 3.67/1.16/3.14 8.27/0.82/6.91 13.62/3.29/10.95 14.29 11.69 28.14 17.36 9.48
OPT 1.3 0.3 5.7/2.09/4.41 10.17/0.82/8.29 12.36/3.15/9.85 18.37 13.42 29.15 12.44 10.02
Pythia 1.4 0.3 4.03/1.25/3.33 8.34/0.87/6.97 13.17/3.4/10.92 16.09 11.91 28.72 9.06 9.08
Falcon 1.0 0.35 2.74/0.67/2.37 10.95/1.28/8.66 13.29/3.09/10.58 16.17 12.91 29.19 14.75 9.74
TNL 1.0 1.2 6.81/2.30/5.25 12.28/1.23/9.27 14.60/3.51/11.62 15.02 14.66 28.72 37.32 12.51

Table 7. Exploration of Normalization Function. The deviation 6. Conclusion


in results among the bellowing normalization functions is minimal.
We introduced Lightning Attention, the first linear attention
Norm Type Params Updates Loss PPL implementation that unleashed the full power of linear at-
SRMSNorm 385M 100K 2.248 4.770
RMSNorm 385M 100K 2.247 4.766
tention. As a result, our Lightning Attention can handle
LayerNorm 385M 100K 2.247 4.765 various sequence lengths with a constant speed under a con-
stant memory footprint. The main concept is to divide the
calculation of attention into intro-blocks and inter-blocks,
Normalization Functions: our study involved testing var- while applying distinct computation techniques to perform
ious normalization techniques—SRMSNorm, RMSNorm, the calculation. A new architecture, TNL, that is tailored for
and LayerNorm—on TNL, finding little difference in their Lightning Attention is presented. TNL outperforms existing
effectiveness. However, we enhanced SRMSNorm using efficient language models in terms of both efficiency and
Triton, resulting in notable improvements in processing accuracy and achieves competitive performance compared
speed for larger dimensions. to state-of-the-art large language models using conventional
transformer architectures.
GLA Activation Functions: in our study on the GLA
(Gated Linear Attention) mechanism, we evaluated activa-
tion functions, finding Swish and 1+elu to perform similarly, Acknowledgement
as detailed in Table 8. However, due to NaN issues with
1+elu in our 7B model, we opted for Swish. This work is partially supported by the National Key
R&D Program of China (NO.2022ZD0160100). We thank
Table 8. Ablations on GLA activation functions. The results ob- Songlin Yang for the helpful discussions.
tained from different activation functions were virtually identical.
GLA Act Params Updates Loss PPL
Swish 385M 100K 2.248 4.770 Impact Statement
No Act 385M 100K 2.283 4.882
1+elu 385M 100K 2.252 4.767 The introduction of Lightning Attention and its accompany-
ing architecture TNL, heralds significant shifts in machine
learning, particularly in language model efficiency and ac-
GLU Activation Functions: our experiment additionally
cessibility. By addressing the limitations of linear atten-
involved removing the activation function from the Gated
tion in varying sequence lengths without increasing mem-
Linear Units (GLU), showing minimal effect on outcomes
ory consumption, this advancement democratizes access to
as per Table 9. Therefore, we opted for the Simple Gated
state-of-the-art language models, potentially reducing the
Linear Units (SGLU) configuration in our model.
computational and environmental footprint of large-scale
Table 9. Ablations on GLU activation functions. The exclusion AI systems. Ethically, it underscores a move towards more
of the activation function had no negative impact on the results. sustainable AI practices, yet raises questions about the pro-
liferation of powerful language models and their societal
GLU Act Params Updates Loss PPL
No Act 385M 100K 2.248 4.770 impacts, including concerns over privacy, misinformation,
Swish 385M 100K 2.254 4.788 and the digital divide.

9
TNL with Lightning Attention

References Dao, T., Fu, D. Y., Saab, K. K., Thomas, A. W.,


Rudra, A., and Ré, C. Hungry hungry hippos: To-
Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A.,
wards language modeling with state space models.
Cojocaru, R., Debbah, M., Goffinet, E., Heslow, D., Lau-
CoRR, abs/2212.14052, 2022b. doi: 10.48550/arXiv.
nay, J., Malartic, Q., et al. Falcon-40b: an open large
2212.14052. URL https://doi.org/10.48550/
language model with state-of-the-art performance. Tech-
arXiv.2212.14052.
nical report, Technical report, Technology Innovation
Institute, 2023. de Brébisson, A. and Vincent, P. A cheap linear attention
mechanism with fast lookups and fixed-size representa-
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine tions, 2016.
translation by jointly learning to align and translate, 2016.
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., and
Baichuan. Baichuan 2: Open large-scale language models. Tang, J. Glm: General language model pretraining with
arXiv preprint arXiv:2309.10305, 2023. URL https: autoregressive blank infilling, 2022.
//arxiv.org/abs/2309.10305.
Fu, D. Y., Epstein, E. L., Nguyen, E., Thomas, A. W.,
Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., Zhang, M., Dao, T., Rudra, A., and Ré, C. Simple
O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., hardware-efficient long convolutions for sequence model-
Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., ing. CoRR, abs/2302.06646, 2023. doi: 10.48550/arXiv.
and van der Wal, O. Pythia: A suite for analyzing large 2302.06646. URL https://doi.org/10.48550/
language models across training and scaling, 2023. arXiv.2302.06646.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster,
Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. C., Golding, L., Hsu, J., McDonell, K., Muennighoff,
Piqa: Reasoning about physical commonsense in natural N., et al. A framework for few-shot language model
language, 2019. evaluation. Version v0. 0.1. Sept, 2021.
Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, Geng, X. and Liu, H. Openllama: An open repro-
L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, duction of llama. URL: https://github. com/openlm-
J., et al. Gpt-neox-20b: An open-source autoregressive research/open_llama, 2023.
language model. arXiv preprint arXiv:2204.06745, 2022.
Gu, A., Dao, T., Ermon, S., Rudra, A., and Re, C. Hippo:
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, Recurrent memory with optimal polynomial projections,
X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mo- 2020.
hiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J.,
Gu, A., Goel, K., and Ré, C. Efficiently modeling long se-
and Weller, A. Rethinking attention with performers. In
quences with structured state spaces. In The International
International Conference on Learning Representations,
Conference on Learning Representations (ICLR), 2022a.
2021. URL https://openreview.net/forum?
id=Ua6zuk0WRH. Gu, A., Goel, K., and Ré, C. Efficiently modeling long
sequences with structured state spaces. In The Tenth
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, International Conference on Learning Representations,
M., and Toutanova, K. Boolq: Exploring the surprising ICLR 2022, Virtual Event, April 25-29, 2022. OpenRe-
difficulty of natural yes/no questions, 2019. view.net, 2022b. URL https://openreview.net/
forum?id=uYLFoz1vlAC.
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A.,
Schoenick, C., and Tafjord, O. Think you have solved Gu, A., Gupta, A., Goel, K., and Ré, C. On the parameteri-
question answering? try arc, the ai2 reasoning challenge, zation and initialization of diagonal state space models,
2018. 2022c.

Dao, T. Flashattention-2: Faster attention with bet- Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are
ter parallelism and work partitioning. arXiv preprint as effective as structured state spaces, 2022.
arXiv:2307.08691, 2023. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. Measuring massive multitask
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. FlashAt- language understanding, 2021.
tention: Fast and memory-efficient exact attention with
IO-awareness. In Advances in Neural Information Pro- Hua, W., Dai, Z., Liu, H., and Le, Q. V. Transformer quality
cessing Systems, 2022a. in linear time. arXiv preprint arXiv:2202.10447, 2022.

10
TNL with Lightning Attention

Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho,
J., Lv, C., Zhang, Y., Lei, J., Fu, Y., Sun, M., and He, J. S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K.,
C-eval: A multi-level multi-discipline chinese evaluation He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Kop-
suite for foundation models, 2023. tyra, B., Lau, H., Mantri, K. S. I., Mom, F., Saito, A.,
Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang, R.,
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Zhang, Z., Zhao, Q., Zhou, P., Zhu, J., and Zhu, R.-J.
Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, Rwkv: Reinventing rnns for the transformer era, 2023a.
G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-
A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho,
T., and Sayed, W. E. Mistral 7b, 2023. S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K.,
He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Kop-
Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., tyra, B., Lau, H., Mantri, K. S. I., Mom, F., Saito, A.,
Banerjee, K., Avancha, S., Vooturi, D. T., Jammala- Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang, R.,
madaka, N., Huang, J., Yuen, H., et al. A study of Zhang, Z., Zhao, Q., Zhou, P., Zhu, J., and Zhu, R.-J.
bfloat16 for deep learning training. arXiv preprint Rwkv: Reinventing rnns for the transformer era, 2023b.
arXiv:1905.12322, 2019.
Press, O., Smith, N., and Lewis, M. Train short, test long:
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Attention with linear biases enables input length extrapo-
Transformers are rnns: Fast autoregressive transformers lation. In International Conference on Learning Represen-
with linear attention. In International Conference on tations, 2022. URL https://openreview.net/
Machine Learning, pp. 5156–5165. PMLR, 2020. forum?id=R8sQPpGCv0.
Liu, H., Dai, Z., So, D., and Le, Q. V. Pay attention to mlps.
Qin, Z., Han, X., Sun, W., Li, D., Kong, L., Barnes, N.,
Advances in Neural Information Processing Systems, 34:
and Zhong, Y. The devil in linear transformer. In Pro-
9204–9215, 2021.
ceedings of the 2022 Conference on Empirical Methods
Liu, Z., Li, D., Lu, K., Qin, Z., Sun, W., Xu, J., and Zhong, in Natural Language Processing, pp. 7025–7041, Abu
Y. Neural architecture search on efficient transformers Dhabi, United Arab Emirates, December 2022a. Associ-
and beyond. arXiv preprint arXiv:2207.13955, 2022. ation for Computational Linguistics. URL https://
aclanthology.org/2022.emnlp-main.473.
Mehta, H., Gupta, A., Cutkosky, A., and Neyshabur, B.
Long range language modeling via gated state spaces. Qin, Z., Sun, W., Deng, H., Li, D., Wei, Y., Lv, B.,
arXiv preprint arXiv:2206.13947, 2022. Yan, J., Kong, L., and Zhong, Y. cosformer: Rethink-
ing softmax in attention. In International Conference
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, on Learning Representations, 2022b. URL https:
E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., //openreview.net/forum?id=Bl8CQrx2Up4.
Venkatesh, G., et al. Mixed precision training. arXiv
preprint arXiv:1710.03740, 2017. Qin, Z., Han, X., Sun, W., He, B., Li, D., Li, D., Dai, Y.,
Kong, L., and Zhong, Y. Toeplitz neural network for se-
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a quence modeling. In The Eleventh International Confer-
suit of armor conduct electricity? a new dataset for open ence on Learning Representations, 2023a. URL https:
book question answering, 2018. //openreview.net/forum?id=IxmWsm4xrua.
Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gulcehre, Qin, Z., Sun, W., Lu, K., Deng, H., Li, D., Han, X., Dai, Y.,
C., Pascanu, R., and De, S. Resurrecting recurrent neural Kong, L., and Zhong, Y. Linearized relative positional
networks for long sequences, 2023a. encoding. Transactions on Machine Learning Research,
2023b.
Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gülçehre,
Ç., Pascanu, R., and De, S. Resurrecting recurrent neural Qin, Z., Yang, S., and Zhong, Y. Hierarchically gated recur-
networks for long sequences. CoRR, abs/2303.06349, rent neural network for sequence modeling. In NeurIPS,
2023b. doi: 10.48550/arXiv.2303.06349. URL https: 2023c.
//doi.org/10.48550/arXiv.2303.06349.
Qin, Z., Zhong, Y., and Deng, H. Exploring transformer
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., extrapolation. In Proceedings of the AAAI Conference on
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, Artificial Intelligence, 2024.
L., et al. Pytorch: An imperative style, high-performance
deep learning library. Advances in neural information Ramachandran, P., Zoph, B., and Le, Q. V. Searching for
processing systems, 32, 2019. activation functions, 2017.

11
TNL with Lightning Attention

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
Winogrande: An adversarial winograd schema challenge L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At-
at scale, 2019. tention is all you need. Advances in neural information
processing systems, 30, 2017.
Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y.
Socialiqa: Commonsense reasoning about social interac- Wang, B. and Komatsuzaki, A. Gpt-j-6b: A 6 billion param-
tions, 2019. eter autoregressive language model, 2021.

Workshop, B., :, Scao, T. L., Fan, A., Akiki, C., Pavlick, E.,
Shaham, U., Segal, E., Ivgi, M., Efrat, A., Yoran, O., Haviv,
Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon,
A., Gupta, A., Xiong, W., Geva, M., Berant, J., et al.
F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson,
Scrolls: Standardized comparison over long language
A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muen-
sequences. arXiv preprint arXiv:2201.03533, 2022.
nighoff, N., del Moral, A. V., Ruwase, O., Bawden, R.,
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, Bekman, S., McMillan-Major, A., Beltagy, I., Nguyen,
J., and Catanzaro, B. Megatron-lm: Training multi- H., Saulnier, L., Tan, S., Suarez, P. O., Sanh, V., Lau-
billion parameter language models using model paral- rençon, H., Jernite, Y., Launay, J., Mitchell, M., Raffel,
lelism. arXiv preprint arXiv:1909.08053, 2019. C., Gokaslan, A., Simhi, A., Soroa, A., Aji, A. F., Alfassy,
A., Rogers, A., Nitzav, A. K., Xu, C., Mou, C., Emezue,
Tay, Y., Bahri, D., Metzler, D., Juan, D.-C., Zhao, Z., and C., Klamm, C., Leong, C., van Strien, D., Adelani, D. I.,
Zheng, C. Synthesizer: Rethinking self-attention for Radev, D., Ponferrada, E. G., Levkovizh, E., Kim, E.,
transformer models. In International conference on ma- Natan, E. B., Toni, F. D., Dupont, G., Kruszewski, G.,
chine learning, pp. 10183–10192. PMLR, 2021. Pistilli, G., Elsahar, H., Benyamina, H., Tran, H., Yu,
I., Abdulmumin, I., Johnson, I., Gonzalez-Dios, I., de la
Team, M. N. et al. Introducing mpt-7b: A new standard for Rosa, J., Chim, J., Dodge, J., Zhu, J., Chang, J., Fro-
open-source, commercially usable llms, 2023. URL www. hberg, J., Tobing, J., Bhattacharjee, J., Almubarak, K.,
mosaicml. com/blog/mpt-7b. Accessed, pp. 05–05, 2023. Chen, K., Lo, K., Werra, L. V., Weber, L., Phan, L., al-
lal, L. B., Tanguy, L., Dey, M., Muñoz, M. R., Masoud,
Tillet, P., Kung, H.-T., and Cox, D. D. Triton: an inter- M., Grandury, M., Šaško, M., Huang, M., Coavoux, M.,
mediate language and compiler for tiled neural network Singh, M., Jiang, M. T.-J., Vu, M. C., Jauhar, M. A.,
computations. Proceedings of the 3rd ACM SIGPLAN Ghaleb, M., Subramani, N., Kassner, N., Khamis, N.,
International Workshop on Machine Learning and Pro- Nguyen, O., Espejel, O., de Gibert, O., Villegas, P., Hen-
gramming Languages, 2019. derson, P., Colombo, P., Amuok, P., Lhoest, Q., Har-
liman, R., Bommasani, R., López, R. L., Ribeiro, R.,
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, Osei, S., Pyysalo, S., Nagel, S., Bose, S., Muhammad,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., S. H., Sharma, S., Longpre, S., Nikpoor, S., Silberberg,
Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lam- S., Pai, S., Zink, S., Torrent, T. T., Schick, T., Thrush,
ple, G. Llama: Open and efficient foundation language T., Danchev, V., Nikoulina, V., Laippala, V., Lepercq,
models. arXiv preprint arXiv:2302.13971, 2023a. V., Prabhu, V., Alyafeai, Z., Talat, Z., Raja, A., Heinzer-
ling, B., Si, C., Taşar, D. E., Salesky, E., Mielke, S. J.,
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, Lee, W. Y., Sharma, A., Santilli, A., Chaffin, A., Stiegler,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., A., Datta, D., Szczechla, E., Chhablani, G., Wang, H.,
Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, Pandey, H., Strobelt, H., Fries, J. A., Rozen, J., Gao, L.,
M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Sutawika, L., Bari, M. S., Al-shaibani, M. S., Manica, M.,
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, Nayak, N., Teehan, R., Albanie, S., Shen, S., Ben-David,
A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, S., Bach, S. H., Kim, T., Bers, T., Fevry, T., Neeraj, T.,
V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Thakker, U., Raunak, V., Tang, X., Yong, Z.-X., Sun,
Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Z., Brody, S., Uri, Y., Tojarieh, H., Roberts, A., Chung,
Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, H. W., Tae, J., Phang, J., Press, O., Li, C., Narayanan,
I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, D., Bourfoune, H., Casper, J., Rasley, J., Ryabinin, M.,
K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Mishra, M., Zhang, M., Shoeybi, M., Peyrounette, M.,
Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Patry, N., Tazi, N., Sanseviero, O., von Platen, P., Cor-
Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, nette, P., Lavallée, P. F., Lacroix, R., Rajbhandari, S.,
M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Gandhi, S., Smith, S., Requena, S., Patil, S., Dettmers,
and Scialom, T. Llama 2: Open foundation and fine-tuned T., Baruwa, A., Singh, A., Cheveleva, A., Ligozat, A.-L.,
chat models, 2023b. Subramonian, A., Névéol, A., Lovering, C., Garrette, D.,

12
TNL with Lightning Attention

Tunuguntla, D., Reiter, E., Taktasheva, E., Voloshina, E., haylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D.,
Bogdanov, E., Winata, G. I., Schoelkopf, H., Kalo, J.-C., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer,
Novikova, J., Forde, J. Z., Clive, J., Kasai, J., Kawamura, L. Opt: Open pre-trained transformer language models,
K., Hazan, L., Carpuat, M., Clinciu, M., Kim, N., Cheng, 2022.
N., Serikov, O., Antverg, O., van der Wal, O., Zhang,
R., Zhang, R., Gehrmann, S., Mirkin, S., Pais, S., Shav- Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M.,
rina, T., Scialom, T., Yun, T., Limisiewicz, T., Rieser, V., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., et al.
Protasov, V., Mikhailov, V., Pruksachatkun, Y., Belinkov, Pytorch fsdp: experiences on scaling fully sharded data
Y., Bamberger, Z., Kasner, Z., Rueda, A., Pestana, A., parallel. arXiv preprint arXiv:2304.11277, 2023.
Feizpour, A., Khan, A., Faranak, A., Santos, A., Hevia, Zheng, L., Wang, C., and Kong, L. Linear complexity ran-
A., Unldreaj, A., Aghagol, A., Abdollahi, A., Tammour, domized self-attention mechanism. In International Con-
A., HajiHosseini, A., Behroozi, B., Ajibade, B., Saxena, ference on Machine Learning, pp. 27011–27041. PMLR,
B., Ferrandis, C. M., McDuff, D., Contractor, D., Lansky, 2022.
D., David, D., Kiela, D., Nguyen, D. A., Tan, E., Bay-
lor, E., Ozoani, E., Mirza, F., Ononiwu, F., Rezanejad, Zheng, L., Yuan, J., Wang, C., and Kong, L. Efficient
H., Jones, H., Bhattacharya, I., Solaiman, I., Sedenko, I., attention via control variates. In International Conference
Nejadgholi, I., Passmore, J., Seltzer, J., Sanz, J. B., Dutra, on Learning Representations, 2023. URL https://
L., Samagaio, M., Elbadri, M., Mieskes, M., Gerchick, openreview.net/forum?id=G-uNfHKrj46.
M., Akinlolu, M., McKenna, M., Qiu, M., Ghauri, M.,
Burynok, M., Abrar, N., Rajani, N., Elkott, N., Fahmy,
N., Samuel, O., An, R., Kromann, R., Hao, R., Alizadeh,
S., Shubber, S., Wang, S., Roy, S., Viguier, S., Le, T.,
Oyebade, T., Le, T., Yang, Y., Nguyen, Z., Kashyap,
A. R., Palasciano, A., Callahan, A., Shukla, A., Miranda-
Escalada, A., Singh, A., Beilharz, B., Wang, B., Brito, C.,
Zhou, C., Jain, C., Xu, C., Fourrier, C., Periñán, D. L.,
Molano, D., Yu, D., Manjavacas, E., Barth, F., Fuhrimann,
F., Altay, G., Bayrak, G., Burns, G., Vrabec, H. U., Bello,
I., Dash, I., Kang, J., Giorgi, J., Golde, J., Posada, J. D.,
Sivaraman, K. R., Bulchandani, L., Liu, L., Shinzato, L.,
de Bykhovetz, M. H., Takeuchi, M., Pàmies, M., Castillo,
M. A., Nezhurina, M., Sänger, M., Samwald, M., Cullan,
M., Weinberg, M., Wolf, M. D., Mihaljcic, M., Liu, M.,
Freidank, M., Kang, M., Seelam, N., Dahlberg, N., Broad,
N. M., Muellner, N., Fung, P., Haller, P., Chandrasekhar,
R., Eisenberg, R., Martin, R., Canalli, R., Su, R., Su, R.,
Cahyawijaya, S., Garda, S., Deshmukh, S. S., Mishra,
S., Kiblawi, S., Ott, S., Sang-aroonsiri, S., Kumar, S.,
Schweter, S., Bharati, S., Laud, T., Gigant, T., Kainuma,
T., Kusa, W., Labrak, Y., Bajaj, Y. S., Venkatraman, Y.,
Xu, Y., Xu, Y., Xu, Y., Tan, Z., Xie, Z., Ye, Z., Bras,
M., Belkada, Y., and Wolf, T. Bloom: A 176b-parameter
open-access multilingual language model, 2023.

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y.
Hellaswag: Can a machine really finish your sentence?,
2019.

Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M.,
Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b:
An open bilingual pre-trained model. arXiv preprint
arXiv:2210.02414, 2022.

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mi-

13
TNL with Lightning Attention

Appendix Algorithm 5 Lightning Attention(with decay) Forward Pass


Input: Q, K, V ∈ Rn×d , decay rate λ ∈ R+ , block sizes B.
n
A. Linear Attention with decay Divide X into T = B blocks X1 , X2 , ...XT of size B × d
each, where X ∈ {Q, K, V, O}.
TransNormerLLM uses LRPE-d positional encoding, which Initialize mask M ∈ RB×B , where Mts = λt−s , if t ≥ s, else
has the following format: 0.
Initialize Λ = diag{λ, λ2 , . . . , λB } ∈ RB×B .
ats = q⊤
t ks λ
t−s
expiθ(t−s) . (15) Initialize KV = 0 ∈ Rd×d .
According to (Qin et al., 2023b), Lrpe can be decomposed for t = 1, . . . , T do
into q and k, so we consider the following simplified form: Load Qt , Kt , Vt ∈ RB×d from HBM to on-chip SRAM.
On chip, compute Ointra = [(Qt K⊤ t ) ⊙ M]Vt .
ats = q⊤
t ks λ
t−s
, On chip, compute Ointer = ΛQt (KV).
t
X On chip, compute KV = λB KV + (λB Λ−1 Kt )⊤ Vt .
o⊤
t = ats vt⊤ Write Ot = Ointra + Ointer to HBM as the t-th block of O.
s=1 end for
t
X return O.
= q⊤
t ks λ
t−s ⊤
vs (16)
s=1
C.0.1. F ORWARD PASS
t
X During forward pass of Linear attention with decay, the t-th
= q⊤
t ks λt−s vs⊤
output can be formulated as
s=1 X
≜ q⊤
t kvt .
o⊤
t = qt

λt−s ks vs⊤ . (20)
s≤t
We call this Linear Attention with decay and prove it’s
equivalent to the recurrence form: In a recursive form, the above equation can be rewritten as
kv0 = 0, kvt = λkvt−1 + kt vt⊤ , o⊤ ⊤
t = qt kvt . (17) kv0 = 0 ∈ Rd×d ,
We will use induction to prove kvt = kvt . kvt = λkvt−1 + kt vt⊤ , (21)
Base Case (n = 1): o⊤
t = q⊤
t (kvt ),
kv1 = k1 v1⊤ = kv1 . (18) where X
kvt = λt−s ks vs⊤ . (22)
Assume the statement holds for n = m − 1, i.e., kvm−1 = s≤t
kvm−1 . Then, when n = m:
Xm To perform tiling, let us write the equations in block form.
kvm = ks λm−s vs⊤ Given the total sequence length n and block size B, X
n
s=1 is divided into T = B blocks {X1 , X2 , . . . , XT } of size
m−1
X B × d each, where X ∈ {Q, K, V, O}.
=λ ks λm−1−s vs⊤ + km vm

s=1
(19) We first define X
= λkvm−1 + ⊤
km vm KV0 = 0 ∈ Rd×d , KVt = λtB−s ks vs⊤ . (23)
⊤ s≤tB
= λkvm−1 + km vm
= kvm , Given KVt , the output of (t + 1)-th block, i.e., tB + r, with
1 ≤ r ≤ B is
the statement holds. Therefore, by induction, the statement
holds for all n ≥ 1. o⊤
tB+r
X
=q⊤tB+r λtB+r−s ks vs⊤
B. Lightning Attention with decay s≤tB+r
 
tB+r
We extended Lightning Attention to accommodate Linear =q⊤ 
X
λtB+r−s ks vs⊤ + λr
X
λtB−s ks vs⊤ 
tB+r
Attention with decay. The complete algorithm can be found s=tB+1 s≤tB
in Algorithm 5, 6, and the proof of correctness is provided
tB+r
in C. X
=q⊤
tB+r λtB+r−s ks vs⊤ + λr qtB+r kv⊤
tB .
s=tB+1
C. Proofs (24)
Here we discuss linear attention with decay directly, because
vanilla linear attention is the case of λ = 1.

14
TNL with Lightning Attention

Algorithm 6 Lightning Attention(with decay) Backward C.0.2. BACKWARD PASS


Pass
For backward pass, let us consider the reverse process. First
Input: Q, K, V, dO ∈ Rn×d , decay rate λ ∈ R+ , block sizes
given dot , we have
B.
Divide X into T = B n
blocks X1 , X2 , ...XT of size B × d dq⊤ ⊤ ⊤
t = dot kvt ∈ R
1×d
,
each, where X ∈ {Q, K, V}.
Divide dX into T = B n
blocks dX1 , dX2 , ...dXT of size dk⊤ ⊤ ⊤
t = vt dkvt ∈ R
1×d
,
B × d each, where X ∈ {Q, K, V, O} . dv⊤ ⊤ 1×d (28)
t = kt dkvt ∈ R ,
Initialize mask M ∈ RB×B , where Mts = λt−s , if t ≥ s, else X
0. dkvt = λs−t qs do⊤
s ∈R
d×d
.
Initialize Λ = diag{λ, λ2 , . . . , λB } ∈ RB×B . s≥t
Initialize KV = 0, dKV = 0 ∈ Rd×d .
for t = 1, . . . , T do
By writing dkvt in a recursive form, we get
Load Kt , Vt , Ot , dOt ∈ RB×d from HBM to on-chip dkvn+1 = 0 ∈ Rd×d ,
SRAM. (29)
On chip, compute dQintra = [(dOt Vt⊤ ) ⊙ M]Kt . dkvt−1 = λdkvt + qt−1 do⊤
t−1 .
On chip, compute dQinter = ΛdOt (KV)⊤ . To facilitate the understanding of tiling, let us consider
On chip, compute KV = λB KV + (λB Λ−1 Kt )⊤ Vt . the above equations in block style. Given the total se-
Write dQt = dQintra + dQinter to HBM as the t-th block quence length n and block size B, X is divided into
of dQ. n
T = B blocks {X1 , X2 , . . . , XT } of size B × d each,
end for
for t = T, . . . , 1 do where X ∈ {Q, K, V, O, dO}.
Load Qt , Kt , Vt , Ot , dOt ∈ RB×d from HBM to on-chip
We first define
SRAM.
On chip, compute dKintra = [(dOt Vt⊤ ) ⊙ M]⊤ Qt . dKVT +1 = 0 ∈ Rd×d ,
On chip, compute dKinter = (λB Λ−1 Vt )(dKV)⊤ . (30)
X
dKVt = λs−tB qs do⊤
s .
On chip, compute dVintra = [(Qt K⊤ ⊤
t ) ⊙ M] dOt .
B −1 s>tB
On chip, compute dVinter = (λ Λ Kt )dKV.
On chip, compute dKV = λB dKV + (ΛQt )⊤ dOt . Then for the (t + 1)-th block, i.e., tB + r, 0 ≤ r < B, we
Write dKt = Kintra + Kinter , dVt = Vintra + Vinter to have
HBM as the t-th block of dK, dV.
end for dq⊤tB+r
return dQ, dK, dV.
X

=dotB+r λtB+r−s vs k⊤ s
Rewritten in matrix form, we have s≤tB+r

Ot+1 = [(Qt+1 K⊤
 
t+1 ) ⊙ M]Vt+1 tB+r
X X
| {z }
Intra Block
=do⊤
tB+r
 λtB+r−s vs k⊤
s +λ
r
λtB−s vs k⊤
s

(25) s=tB+1 s≤tB
+ ΛQt+1 (KVt ),
| {z } tB+r
X
Inter Block
=do⊤
tB+r λtB+r−s vs k⊤ r ⊤
s + λ dotB+r kvtB .
where (
s=tB+1
λt−s t≥s
Mts = , (31)
0 t<s (26) In matrix form, we have

Λ = diag{1, . . . , λB−1 }. dQt+1 = [(dOt+1 Vt+1 ) ⊙ M]Kt+1
| {z }
And the KV at (t + 1)-th block can be written as Intra Block
(32)
ΛdOt+1 (KV⊤
X
KVt+1 = λ(t+1)B−s k⊤ s vs + t ).
| {z }
s≤(t+1)B Inter Block
(t+1)B Since the recursion of dKt steps from t + 1 to t, given
X X
=λ B
λtB−s k⊤
s vs + λ(t+1)B−s k⊤
s vs KVt+1 , dKt for the t-th block, i.e., at positions (t − 1)B +
s≤tB s=tB+1
B B−1
⊤
= λ KVt + diag{λ , . . . , 1}Kt Vt

= λB KVt + λB Λ−1 Kt Vt .

(27)
The complete expression of the forward pass of Lightning
Attention with decay can be found in Algorithm 5.

15
TNL with Lightning Attention

r, 0 < r ≤ B is Finally, the recursive relation for dKVt is


dk⊤
X
(t−1)B+r dKVt = λs−tB qs do⊤s
s>tB
X

=v(t−1)B+r λs−(t−1)B−r dos q⊤
s X
s≥(t−1)B+r = λB λs−(t+1)B qs do⊤
s
  s>(t+1)B
tB
X (37)

=v(t−1)B+r  λtB+r−s
dos q⊤
s
 (t+1)B
X
s=(t−1)B+r + λs−tB qs do⊤
s
! (33) s=tB+1
X ⊤

+ v(t−1)B+r λB−r
λ s−tB
dos q⊤
s = λB dKVt+1 + (ΛQt ) dOt .
s>tB
Algorithm 6 describes the backward pass of Lightning At-
tB

X tention with decay in more detail.
=v(t−1)B+r λtB+r−s dos q⊤
s
s=(t−1)B+r
D. Corpus
+ ⊤
λB−r v(t−1)B+r dKV⊤
t .
In matrix form, we get We gather an extensive corpus of publicly accessible text
⊤ from the internet, totaling over 700TB in size. The collected
dKt−1 = [(dOt−1 Vt−1 ) ⊙ M]⊤ Qt−1
| {z } data are processed by our data preprocessing procedure as
Intra Block
(34) shown in Fig. 6, leaving a 6TB cleaned corpus with roughly
+ λB Λ−1 Vt−1 (dKV⊤
t ).
2 trillion tokens. We categorize our data sources to provide
| {z }
Inter Block
better transparency and understanding. The specifics of
these categories are outlined in Table 10.
Considering dVt for the t-th block, i.e., at positions (t −
1)B + r, 0 < r ≤ B, we have D.1. Data Preprocessing
dv⊤ (t−1)B+r Our data preprocessing procedure consists of three steps:
X

=k(t−1)B+r λs−(t−1)B−r qs do⊤ s
1). rule-based filtering, 2). deduplication, and 3). a self-
s≥(t−1)B+r cleaning scheme. Before being added to the training corpus,

tB
 the cleaned corpus needs to be evaluated by humans.
X
=k⊤
(t−1)B+r
 λtB+r−s q⊤
s dos

Rule-based filtering The rules we used to filter our col-
s=(t−1)B+r
! (35) lected data are listed as follows:
X
+ k⊤
(t−1)B+r λB−r λs−tB qs do⊤
s
• Removal of HTML Tags and URLs: The initial step in
s>tB
our process is the elimination of HTML tags and web
tB
X URLs from the text. This is achieved through regular
=k⊤
(t−1)B+r λtB+r−s qs do⊤
s
expression techniques that identify these patterns and
s=(t−1)B+r
remove them, ensuring the language model focuses on
+ λB−r k⊤
(t−1)B+r dKVt . meaningful textual content.
In matrix form, we get
• Elimination of Useless or Abnormal Strings: Subse-
dVt−1 = [(Qt−1 K⊤ ⊤
t−1 ) ⊙ M] dOt quently, the cleaned dataset undergoes a second layer
| {z }
Intra Block of refinement where strings that do not provide value,
(36)
+ λB Λ−1 Kt−1 (dKVt ) . such as aberrant strings or garbled text, are identi-
| {z } fied and excised. This process relies on predefined
Inter Block
rules that categorize certain string patterns as non-
contributing elements.
• Deduplication of Punctuation Marks: We address the
problem of redundant punctuation marks in the data.
Multiple consecutive punctuation marks can distort the
natural flow and structure of sentences when training
the model. We employ a rule-based system that trims
these duplications down to a single instance of each
punctuation mark.

16
TNL with Lightning Attention

Academic
writings
Model-based Human
Filtering Evaluation

Books Rule-based
Filtering
Self-Clean xN

Scheme
Training data
Code

Deduplication

Web
Evaluation Model

Figure 6. Data Preprocess Procedure. The collected data undergoes a process of rule-based filtering and deduplication, followed by our
self-clean data processing strategy: model-based filtering, human evaluation, and evaluation model. After several iterations of the above
cycle, we obtain high-quality training data at around 2T tokens.

• Handling Special Characters: Unusual or special char- which may have a significant impact on the diversity of
acters that are not commonly part of the language’s text the training data. Assuming that the majority of the pre-
corpus are identified and either removed or replaced processed data is of high quality, we can train an evaluation
with a standardized representation. model on the entire set of pre-processed data, and the model
will automatically smooth the data manifold distribution and
• Number Standardization: Numerical figures may be outlet low-quality data while retaining the majority of the
presented in various formats across different texts. diversities.
These numbers are standardized into a common format
to maintain consistency. The self-cleaning scheme unfolds as follows:

• Preservation of Markdown/LaTeX Formats: While re- • Evaluation Model: We train a 385M model on the
moving non-textual elements, exceptions are made for pre-processed corpus to act as a data quality filter.
texts in Markdown and LaTeX formats. Given their
structured nature and ubiquitous use in academia and • Model-Based Data Filtering: We use the evaluation
documentation, preserving these formats can enhance model to assess each piece of data with perplexity.
the model’s ability to understand and generate similarly Only data achieving a score above a certain threshold
formatted text. is preserved for the next step. Low-quality data are
weeded out at this stage.
Deduplication To ensure the uniqueness of our data and • Human Evaluation: We sample a small portion of the
avert the risk of overfitting, we employ an efficient de- filtered data and manually evaluate the quality.
duplication strategy at the document or line level using
MinHash and Locality-Sensitive Hashing (LSH) algorithms.
These steps are repeated in cycles, with each iteration im-
This combination of MinHash and LSH ensures a balance
proving the overall quality of the data and ensuring the re-
between computational efficiency and accuracy in the dedu-
sulting model is trained on relevant, high-quality text. This
plication process, providing a robust mechanism for data
self-cleaning process provides a robust mechanism for main-
deduplication and text watermark removal.
taining data integrity, thereby enhancing the performance of
the resulting language model.
Self-cleaning scheme Our data self-cleaning process in-
volves an iterative loop of the following three steps to con-
D.2. Tokenization
tinuously refine and enhance the quality of our dataset. An
issue of using model-based data filters is that the filtered We tokenize the data with the Byte-Pair Encoding (BPE)
data will have a similar distribution as the evaluation model, algorithm. Notably, to enhance compatibility with Chinese

17
TNL with Lightning Attention

procedure splits three general matrix multiplies (GEMMs)


Table 10. Statistics of our corpus. For each category, we list
inside the SGLU block across multiple GPUs and only in-
the number of epochs performed on the subset when training on
the 2 trillion tokens, as well as the number of tokens and disk troduces a single all-reduce collective communication oper-
sizes. We also list the table on the right according to the language ation in both the forward and backward passes, respectively.
distribution. GLA Model Parallelism Recall the GLA block in (11), its
Dataset Epochs Tokens Disk size model parallelism version is:
Academic Writings 1.53 200 B 672 GB
Books 2.49 198 B 723 GB [O1 , O2 ] = SRMSNorm(QK⊤ V) ⊙ U, (41)
Code 0.44 689 B 1.4 TB where:
Encyclopedia 1.51 5B 18 GB
Q = [ϕ(XWq1 ), ϕ(XWq2 )], K = [ϕ(XWq1 ), ϕ(XWq2 )],
Filtered Webpages 1.00 882 B 3.1 TB (42)
Others 0.63 52 B 154 GB V = X[Wv1 , Wv2 ], U = X[Wu1 , Wu2 ],
Total - 2026 B 6 TB Note that in our implementation, we use the combined
Language Tokens Disk size QKVU projection to improve computation efficiency for
English 743 B 2.9 TB linear attention. The obtained split output matrix [O1 , O2 ]
Chinese 555 B 1.7 TB again is multiplied by a weight matrix split along its columns
Code 689 B 1.4 TB which is similar to (40).
Others 39 B 89 GB
Total 2026 B 6 TB
F. Additional TNL Ablation
language content, a significant number of common and
uncommon Chinese characters have been incorporated into Transformer vs TNL We carried out a meticulous series
our vocabulary. In cases where vocabulary items are not of comparative tests between our TNL and Transformer,
present in the dictionary, the words are broken down into spanning over an array of disparate sizes. The compara-
their constituent UTF-8 characters. This strategy ensures tive performance of these models is clearly illustrated in
comprehensive coverage and flexibility for diverse linguistic Table 11. Under identical configurations, it becomes evident
input during model training. that our TNL exhibits a superior performance profile com-
pared to Transformer. We observed that TNL outperformed
E. Distributed System Optimization Transformer by a remarkable 5% at the size of 385M. More
importantly, as the size reached 1B, this superiority became
We optimize our system to execute large-scale pre-training even more pronounced, with an advantage of 9% for TNL
for TNL effectively. We employ fully sharded data paral- over Transformer.
lelism (FSDP) (Zhao et al., 2023), activation checkpoint-
Table 11. Transformer vs TNL. TNL performs better than Trans-
ing (Shoeybi et al., 2019), and automatic mixed precision former in size of 385M and 1B under identical configurations by
(AMP) (Micikevicius et al., 2017) techniques to reduce 5% and 9%, respectively.
memory footprint and expedite computational speed. We
used BFloat16 (Kalamkar et al., 2019) to enhance training Method Updates Loss PPL
stability. We implemented model parallelism tailored to Transformer-385M 100K 2.362 5.160
TNL-385M 100K 2.248 4.770
Lightning Attention. Inspired by Megatron-LM (Shoeybi
Transformer-1B 100K 2.061 4.765
et al., 2019) model parallelism, which independently ad- TNL-1B 100K 1.896 3.729
dresses self-attention and MLP blocks, we apply model
parallelism to SGLU and GLA separately. The details of Table 12. TransNormer vs TNL. TNL performs better than
our model parallelism strategies are elaborated below. TransNormer.
Method Params Updates Loss PPL
SGLU Model Parallelism Recall SGLU structure in (12): TNL 385M 100K 2.248 4.770
O = [(XWv ) ⊙ (XWu )]Wo , (38) TransNormer-T1 379M 100K 2.290 4.910
The model parallelism adaptation of SGLU is as follows: TransNormer-T2 379M 100K 2.274 4.858
[O′1 , O′2 ] = X[Wv1 , Wv2 ] ⊙ X[Wu1 , Wu2 ] We compare the original TransNormer and the improved
(39)
= [XWv1 , XWv2 ] ⊙ [XWu1 , XWu2 ], TNL and the results are shown in Table 12. TNL exhibited
which splits the weight matrices Wv and Wu along their an enhancement of 2% and 1% respectively.
columns and obtains an output matrix splitting along its
columns too. Then the split output [O1 , O2 ] is multiplied Speed Normalization Fucntions We enhanced SRM-
by another matrix which is split along its rows as: SNorm using Triton, resulting in notable improvements
O = [O′1 , O′2 ][Wo1 , Wo2 ]⊤ = O′1 Wo1 + O′2 Wo2 (40) in processing speed for larger dimensions, as shown in Fig.
7, outperforming conventional PyTorch implementations.
Similar to model parallelism in Megatron-LM, this whole

18
TNL with Lightning Attention

Forward Pass Backward Pass


2.4 5.4
Triton SRMSNorm PyTorch SRMSNorm Triton SRMSNorm PyTorch SRMSNorm
2.1 4.8
4.2
Runtime (s)

Runtime (s)
1.8
3.6
1.5
3.0
1.2 2.4
0.9 1.8
0.6 1.2
0.3 0.6
0.0 0.0
128 256 512 1024 2048 4096 8192 16384 32768 128 256 512 1024 2048 4096 8192 16384 32768
Sequence length Sequence length

Forward Pass Backward Pass


3.5
1.4 Triton SRMSNorm PyTorch SRMSNorm Triton SRMSNorm PyTorch SRMSNorm
3.0
1.2
Runtime (s)

2.5
Runtime (s)

1.0
0.8 2.0
0.6 1.5
0.4 1.0
0.2 0.5
0.0 0.0
512 1024 2048 4096 8192 16384 512 1024 2048 4096 8192 16384
Feature dimension Feature dimension

Figure 7. Performance Evaluation of SRMSNorm Implementation. The upper figures exhibit the runtime comparison of the forward
pass (left section) and backward pass (right section) for different sequence lengths, with a fixed feature dimension of 3072. The lower two
figures illustrate the runtime comparison for various feature dimensions, with a fixed sequence length of 4096.

19

You might also like