Lightning Attention 1
Lightning Attention 1
Attention
Zhen Qin 1 Weigao Sun 2 Dong Li 2 Xuyang Shen 2 Weixuan Sun 2 Yiran Zhong 2
1
TNL with Lightning Attention
20,000
HGRN HGRN
7,000 3.8 TNN 3.8
TNN
17,500
6,000 LLaMA-FA2 LLaMA-FA2
3.6 3.6
15,000
TNL-LA TNL-LA
5,000
12,500 3.4 3.4
Loss
4,000
10,000
3.2 3.2
3,000
7,500
HGRN HGRN 3.0 3.0
2,000
5,000
TNN TNN
2,500 LLaMA-FA2 1,000 LLaMA-FA2 2.8 2.8
TNL-LA TNL-LA
0 0 2.6 2.6
1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 0 5 10 15 20 25 30 B
0 5 10 15 20 25 30
Sequence Length Sequence Length Billion Tokens Billion Tokens
Figure 1. Training speed and accuracy comparison. We compare TNL’s training speed and losses with state-of-the-art transformer
models (LLaMA with FlashAttention-2) and efficient non-transformer models (HGRN (Qin et al., 2023c) and TNN (Qin et al., 2023a)).
TNL achieves the lowest training losses and maintains consistent training speed regardless of sequence length.
good spur. We propose a novel architecture, TransNormer- structure. Four promising alternatives, including linear trans-
LLM (TNL), which is specifically designed for Light- formers, state space models, long convolution, and linear
ning Attention in order to enhance its performance. TNL recurrence, are being developed to replace self-attention
evolves from the previous linear attention architecture modules for long sequence modeling.
TransNormer (Qin et al., 2022a) by making advanced mod-
Linear Attention Linear attention decomposes Softmax
ifications that include positional embedding, linear atten-
Attention into the inner product of hidden representations,
tion acceleration, gating mechanism, tensor normalization.
allowing it to use the "Kernel Trick", where the product
Specifically, we use LRPE (Qin et al., 2023b) together with
of keys and values is computed first to avoid the quadratic
an exponential decay to avoid attention dilution issues while
n × n matrix. Different methods utilize various hidden
allowing the model to retain global interactions between to-
representations. For example, Katharopoulos et al. (2020)
kens. A gating mechanism is utilized to smooth training, and
use 1+elu as an activation function, Qin et al. (2022b) use
a new tensor normalization scheme is proposed to accelerate
the cosine function to approximate the properties of soft-
the model while preserving its accuracy. We also implement
max, and Choromanski et al. (2021); Zheng et al. (2022;
an efficient model parallel schema for TransNormerLLM,
2023) approximate softmax through theoretical approaches.
enabling seamless deployment on large-scale clusters and
Although its theoretical complexity is O(nd2 ), the actual
facilitating expansion to even more extensive models. As
computational efficiency of linear attention becomes low
shown in Fig. 1, TNL achieves the lowest training loss
when used in causal attention due to the need for cumsum
among the existing efficient transformer structures (Qin
operations (Hua et al., 2022). Moreover, most linear atten-
et al., 2023a;c) as well as SOTA transformer models (Tou-
tion still exhibits a certain performance gap compared to
vron et al., 2023b).
traditional Transformers (Katharopoulos et al., 2020; Liu
We perform a comprehensive evaluation of Lightning Atten- et al., 2022).
tion across a diverse range of sequence lengths to assess its
State Space Model State Space Model is based on the State
accuracy and compare its computational speed and memory
Space Equation for sequence modeling (Gu et al., 2022b),
utilization with FlashAttention-2 (Dao, 2023). Lightning
using special initialization (Gu et al., 2020; 2022c), diag-
Attention exhibits a notable advantage in computational
onalization assumptions (Gupta et al., 2022), and mixed
speed and memory consumption compared to its counter-
techniques (Dao et al., 2022b) to achieve performance com-
parts without compromising performance. We also validate
parable to Transformers. Due to the characteristics of the
our model design through a series of ablations and train mod-
state space equation, inference can be conducted with con-
els with sizes of 44M, 385M, 1B, 7B, and 15B on standard
stant complexity (Gu et al., 2022b), whereas the training
or our self-collected datasets. Benchmark results demon-
speed can be slow compared with FlashAttention.
strate that TNL not only matches the performance of SOTA
LLMs with Transformer but is also significantly faster. Long Convolution Long convolution models (Qin et al.,
2023a; Fu et al., 2023) utilize a kernel size equal to the
2. Related Work input sequence length, facilitating a wider context com-
pared to traditional convolutions. Training these models
2.1. Efficient Language Modeling involves Fast Fourier Transforms (FFT) algorithm, reducing
the computational complexities to O(n log n). However,
New efficient model architectures are being explored to ad-
long convolution models need to cache all historical compu-
dress the high time complexity of the traditional transformer
tations for causal convolution inference, making them less
2
TNL with Lightning Attention
Algorithm 1 Linear Attention Left Product Algorithm 2 Linear Attention Right Product
n×d
Input: Q, K, V ∈ R . Input: Q, K, V ∈ Rn×d .
Initialize mask M ∈ Rn×n , where Mts = 1, if t ≥ s, else 0. Initialize kv = 0 ∈ Rd×d .
Load Q, K, M from HBM, compute S = (QK⊤ ) ⊙ M, write for t = 1, . . . , n do
S to HBM. Load qt , kt , vt ∈ Rd×1 from HBM to on-chip SRAM.
Load S, V from HBM, compute O = SV, write O to HBM. On chip, compute kv = kv + kt vt⊤ .
Return O. On chip, compute ot = q⊤ t kv.
Write o⊤ t to HBM as the t-th row of O.
ideal for processing long sequences compared to RNNs. end for
Linear RNN Linear RNNs (Orvieto et al., 2023a; Qin et al., Return O.
2023c), in contrast, stand out as more suitable replacements O(nd2 ) during inference.
for transformers in long-sequence modeling. A notable
Nevertheless, when dealing with causal prediction tasks, the
example is the HGRN (Qin et al., 2023c) model, a linear
effectiveness of the right product is compromised, leading
RNN-based LLM that has shown competitive performance
to the requirement for the computation of cumsum (Hua
against similarly scaled GPT models.
et al., 2022). This impediment hinders the potential for
highly efficient parallel computation. In this section, we
2.2. IO-aware Attention show that the requirement of cumsum can be eliminated
The FlashAttention series (Dao et al., 2022a; Dao, 2023) by leveraging the concept of "divide and conquer" in linear
focuses on system-level optimizations for the efficient im- attention calculation. For the convenience of discussion,
plementation of the standard attention operator on GPU Norm will be ignored in the subsequent discussion.
platforms. These approaches employ tiling strategies to There are two computational approaches to handling the
minimize the volume of memory reads/writes between the causal scenario. One is using conventional attention compu-
GPU’s high bandwidth memory (HBM) and on-chip SRAM. tation (the Left Product), which involves computing QK⊤
Although these methods optimize the IO communication in first. The complete calculation formula is as follows:
attention calculation and are faster than previous softmax
O = [(QK⊤ ) ⊙ M]V (3)
attention implementations, their theoretical computation
complexity remains O(n2 d), making them unsuitable for where Mts = 1 if t ≥ s, otherwise 0. The complete algo-
long sequence language modeling. rithm is detailed in Algorithm 1. Note that this algorithm is
parallelizable, but its time complexity is O(n2 d). The other
option is to compute the kt vt⊤ first (the Right Product),
3. Lightning Attention which leverages a recursive formula for computation:
3.1. Preliminary kv0 = 0, kvt = kvt−1 + kt vt⊤ , o⊤ ⊤
t = qt kvt . (4)
We first recall the formulation of linear attention and then The complete algorithm is detailed in Algorithm 2. This
introduce our proposed Lightning Attention. In the case algorithm has a time complexity of O(nd2 ), but it is not
of NormAttention within TransNormer (Qin et al., 2022a), GPU-friendly, making it slower than the first approach.
attention computation deviates from the conventional Trans-
former structure (Vaswani et al., 2017) by eschewing the 3.2. Linear Attention with Tiling
costly softmax and scaling operations. The NormAttention
We use a tiling technique to compute linear attention in a
mechanism can be expressed as follows:
causal setting. Specifically, we first divide Q, K, V into
O = Norm((QK⊤ )V), (1) two blocks by rows:
where Q, K, and V ∈ Rn×d are the query, key, and value
X1
X= , X1 ∈ Rm×d , X2 ∈ R(n−m)×d ,
matrices, respectively, with n for sequence length and d for X2
feature dimension. The equation can be transformed into its X ∈ {Q, K, V}.
linear variant using right matrix multiplication:
O = Norm(Q(K⊤ V)), (2)
Then, by unfolding Eq. 3, we get (note that kv0 = 0):
X s
The linear formulation enables efficient recurrent predic- kvs = kv0 + kj vj⊤ , s = 1, . . . , m.
tion with O(nd2 ) complexity during training. Additionally, j=1
linear attention guarantees a constant computation com- s
(5)
X
plexity of O(d2 ) regardless of the sequence length. This o⊤
s = q⊤
s kvs = q⊤
s kv0 + q⊤
s kj vj⊤ .
is achieved by recurrently updating K⊤ V, eliminating the j=1
need for repeated computation of the entire attention matrix.
In contrast, standard softmax attention has a complexity of
3
TNL with Lightning Attention
for t = 1, . . . , T do
Load Qt , Kt , Vt ∈ RB×d from HBM to on-chip SRAM. 𝑶𝒊𝒏𝒕𝒓𝒂 = (𝑸𝒕 𝑲𝑻𝒕 ⨀𝑴)𝑽𝒕 𝑶𝒊𝒏𝒕𝒆𝒓 = 𝑸𝒕 ∙ 𝑲𝑽
On chip, compute Ointra = [(Qt K⊤ t ) ⊙ M]Vt .
Intra block Inter block
On chip, compute Ointer = Qt (KV).
On chip, compute KV = KV + K⊤ t Vt . 𝑶𝒕 = 𝑶𝒊𝒏𝒕𝒓𝒂 + 𝑶𝒊𝒏𝒕𝒆𝒓
Write Ot = Ointra + Ointer to HBM as the t-th block of O.
𝑲𝑽 = 𝑲𝑽 + 𝑲𝒕 𝑻 𝑽𝒕
end for on-chip SRAM
Return O. Output
to HBM
In block form, we have:
𝒕 𝑶 ∈ ℝ𝑛×𝑑
O1 = Q1 kv0 + [(Q1 K⊤
1 ) ⊙ M]V1 store in HBM
(6)
≜ Q1 KV0 + [(Q1 K⊤
1 ) ⊙ M]V1 .
loop over 𝑛 dim
4
TNL with Lightning Attention
4.2. Custom Modification ceptive Fields (TRF) (Qin et al., 2024) at the lower layers
is smaller compared to the higher layers, which aligns with
In this section, we outline the key designs and inspiration TransNormer’s motivation. We choose λ to be non-learnable
behind each custom modification, including positional en- since we empirically found that gradients become unstable
5
TNL with Lightning Attention
when λ is learnable, leading to NaN values. Note that this Table 1. Results on Wikitext-103 (TNN(Qin et al., 2023a)’s set-
positional encoding is still compatible with Lightning Atten- ting). ↓ means lower is better.
tion, with the specific algorithm detailed in Appendix A B. PPL PPL Params
Model
(val)↓ (test)↓ (M)
Gating Mechanism Gate can enhance the performance of Transformer 24.40 24.78 44.65
the model and smooth the training process. In TNL, we FLASH 25.92 26.70 42.17
adopt the approach from Flash (Hua et al., 2022) and use 1+elu 27.44 28.05 44.65
Gated Linear Attention (GLA) in token mixing: Attn-based Performer 62.50 63.16 44.65
cosFormer 26.53 27.06 44.65
O = Norm(QK⊤ V) ⊙ U, Q = ϕ(XWq ), TN1 24.43 25.00 44.64
(11)
K = ϕ(XWk ), V = XWv , U = XWu . TN2 24.50 25.05 44.64
Syn(D) 31.31 32.43 46.75
We choose ϕ to be Swish (Ramachandran et al., 2017) ac- MLP-based Syn(R) 33.68 34.78 44.65
tivation function as we empirically find that it outperforms gMLP 28.08 29.13 47.83
other activation functions. S4 38.34 39.66 45.69
DSS 39.39 41.07 45.73
To further accelerate the model, we propose Simple GLU GSS 29.61 30.74 43.84
RNN-based
(SGLU), which removes the activation function from the RWKV 24.31 25.07 46.23
original GLU structure as the gate itself can introduce non- LRU 29.86 31.12 46.24
linearity. Therefore, our channel mixing becomes: HGRN 24.14 24.82 46.25
FFT-based TNN 23.98 24.67 48.68
O = [V ⊙ U]Wo , V = XWv , U = XWu . (12) Ours TNL 23.46 24.03 45.45
We empirically find that not using an activation function in
GLU will not lead to any performance loss. mentation (named Vanilla) and our Lightning Attention. As
Tensor Normalization The origin NormAttention intro- a reference, we have also included FlashAttention-2 (Dao,
duced in TransNormer (Qin et al., 2022a) is as follows: 2023) (named Flash2), which is currently the SOTA im-
plementation of softmax attention. As shown in Fig. 4,
O = Norm(QK⊤ V) (13)
Lightning Attention shows remarkable linear growth of pro-
In TransNormerLLM, we replace the origin RMSNorm cessing time in both forward and backward passes, whereas
with a new simple normalization function called Sim- Vanilla and Flash2 exhibit quadratic growth. In terms of
pleRMSNorm, abbreviated as SRMSNorm: memory footprint, Vanilla tends to rapidly exhaust memory
SRMSNorm(x) = ∥x∥ x/√d . (14) resources. Lightning Attention shows a similar trend to
2
Flash2 but requires less memory.
We empirically find that using SRMSNorm does not lead to
any performance loss.
5.2. TNL Evaluation
6
TNL with Lightning Attention
2,000
Forward Pass 2,000
Backward Pass Forward Memory Footprint 50
Backward Memory Footprint
Vanilla Vanilla 20 Vanilla
1,800 1,800 Vanilla 45
Flash2 Flash2 18 Flash2
1,600 1,600 Flash2 40
Lightning Lightning
1,200 1,200 30
12
1,000 1,000 25
10
800 800 20
8
600 600
6 15
400 400 4 10
200 200 2 5
0 0 0 0
1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 65536 131072 1024 2048 4096 8192 16384 32768 65536 131072
Sequence Length Sequence Length Sequence Length Sequence Length
Figure 4. Comparative Analysis of Speed and Memory Usage: Vanilla represents norm linear attention in pytorch (Qin et al., 2022a),
Flash2 represents FlashAttention-2. Left two sub-figures: Runtime in milliseconds for the forward and backward pass across varying
sequence lengths. Right two sub-figures: Memory utilization (in GB) during the forward and backward pass at different sequence lengths.
2000.0
et al., 2023), GPT-Neo (Black et al., 2022), Falcon (Al-
1500.0
mazrouei et al., 2023), LLaMA (Touvron et al., 2023a;b),
967.17 961.93
OpenLLAMA (Geng & Liu, 2023), Baichuan (Baichuan,
1000.0
2023), ChatGLM (Zeng et al., 2022; Du et al., 2022), and
506.77 537.8 537.97
500.0
252.12
455.36
non-Transformer model RWKV (Peng et al., 2023a). It can
0.0
be observed in Table 2 and Table 3 that, compared to these
Pythia Baichuan Baichuan2 LLaMA LLaMA2 ChatGLM2 ChatGLM3 TNL
6.9B 7B 7B 7B 7B 6B 6B 7B models, TNL remains highly competitive.
Figure 5. Inference Throughput Comparison. We measure the • We report BoolQ (Clark et al., 2019), PIQA (Bisk et al.,
inference throughput of various 7B LLM models on a A100 80G 2019), SIQA (Sap et al., 2019), HellaSwag (Zellers
GPU. Batch sizes for models are chosen to optimize GPU utiliza- et al., 2019), WinoGrande (Sakaguchi et al., 2019),
tion without exceeding memory limits. Each model is tested with ARC easy and challenge (Clark et al., 2018) and Open-
a 512-token input prompt and can generate up to 1024 new tokens. BookQA (Mihaylov et al., 2018). We report 0-shot re-
Reported throughput is averaged from 20 attempts. sults for all benchmarks using LM-Eval-Harness (Gao
et al., 2021). All of our models achieve competitive per-
hardware setups. This comparison encompasses four vari- formance compared to existing state-of-the-art LLMs,
ants: TNL, LLaMA-FA2 (Touvron et al., 2023a; Dao, 2023), showcasing a remarkable ability to comprehend and
HGRN (Qin et al., 2023c) , and TNN (Qin et al., 2023a). apply commonsense reasoning.
Our findings show that during both the forward and back- • We report the overall results for MMLU (Hendrycks
ward passes, the TGS (tokens per GPU per second) for TNL et al., 2021), C-Eval (Huang et al., 2023). Official
remains consistently high, while the other three models ex- scripts were used for evaluating MMLU and C-Eval,
hibit a rapid decline when sequence length is scaled from with all evaluation results being conducted with a 5-
1K to 128K. This pattern suggests that Lightning Attention shot setup. In comparison to top-tier open-source mod-
offers a significant advancement in managing extremely els available in the industry, our models have demon-
long sequence lengths in LLM. strated matched performance in both English and Chi-
Inference Evaluation We conduct an inference throughput nese benchmarks.
comparison on various 7B large language models using their • On SCROLLS (Shaham et al., 2022) benchmark, we
standard codebase from Huggingface, as detailed in Fig. 5. assess the large language models trained on a 1 billion
TNL with Lightning Attention demonstrates a significant parameter and pre-trained using a sequence length of
advantage, achieving a throughput rate that up to 11× higher 2048. We present zero-shot performance results for all
than transformer structure models. benchmarks using the LM-Eval-Harness (Gao et al.,
2021). For generation tasks within SCROLLS, we
Benchmark Results In order to validate the effectiveness employ a greedy search with hyper-parameters top_k
of TNL, we pretraining 385M, 1B, 7B, and 15B models set to 5 and top_p set to 1. Our models consistently
on self-collected datasets, the details of the data are in the match or surpass the performance of existing state-of-
Appendix D, and tested on Commonsense Reasoning Task, the-art LLMs in these tasks.
MMLU(Hendrycks et al., 2021), C-Eval(Huang et al., 2023),
and SCROLLS (Shaham et al., 2022). For comparison, we
7
TNL with Lightning Attention
Table 2. Performance Comparison on Commonsense Reasoning and Aggregated Benchmarks. For a fair comparison, we report
competing methods’ results reproduced by us using their released models. Official results are denoted in italics. PS: parameter size
(billion). T: tokens (billion). HS: HellaSwag. WG: WinoGrande.
8
TNL with Lightning Attention
Table 3. Performance Comparison on SCROLLS (Shaham et al., 2022): A review of models up to 1 billion parameters on 2048
pre-training sequence length. PS: parameter size (billion). T: tokens (billion).
9
TNL with Lightning Attention
Dao, T. Flashattention-2: Faster attention with bet- Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are
ter parallelism and work partitioning. arXiv preprint as effective as structured state spaces, 2022.
arXiv:2307.08691, 2023. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. Measuring massive multitask
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. FlashAt- language understanding, 2021.
tention: Fast and memory-efficient exact attention with
IO-awareness. In Advances in Neural Information Pro- Hua, W., Dai, Z., Liu, H., and Le, Q. V. Transformer quality
cessing Systems, 2022a. in linear time. arXiv preprint arXiv:2202.10447, 2022.
10
TNL with Lightning Attention
Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho,
J., Lv, C., Zhang, Y., Lei, J., Fu, Y., Sun, M., and He, J. S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K.,
C-eval: A multi-level multi-discipline chinese evaluation He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Kop-
suite for foundation models, 2023. tyra, B., Lau, H., Mantri, K. S. I., Mom, F., Saito, A.,
Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang, R.,
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Zhang, Z., Zhao, Q., Zhou, P., Zhu, J., and Zhu, R.-J.
Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, Rwkv: Reinventing rnns for the transformer era, 2023a.
G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-
A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho,
T., and Sayed, W. E. Mistral 7b, 2023. S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K.,
He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Kop-
Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., tyra, B., Lau, H., Mantri, K. S. I., Mom, F., Saito, A.,
Banerjee, K., Avancha, S., Vooturi, D. T., Jammala- Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang, R.,
madaka, N., Huang, J., Yuen, H., et al. A study of Zhang, Z., Zhao, Q., Zhou, P., Zhu, J., and Zhu, R.-J.
bfloat16 for deep learning training. arXiv preprint Rwkv: Reinventing rnns for the transformer era, 2023b.
arXiv:1905.12322, 2019.
Press, O., Smith, N., and Lewis, M. Train short, test long:
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Attention with linear biases enables input length extrapo-
Transformers are rnns: Fast autoregressive transformers lation. In International Conference on Learning Represen-
with linear attention. In International Conference on tations, 2022. URL https://openreview.net/
Machine Learning, pp. 5156–5165. PMLR, 2020. forum?id=R8sQPpGCv0.
Liu, H., Dai, Z., So, D., and Le, Q. V. Pay attention to mlps.
Qin, Z., Han, X., Sun, W., Li, D., Kong, L., Barnes, N.,
Advances in Neural Information Processing Systems, 34:
and Zhong, Y. The devil in linear transformer. In Pro-
9204–9215, 2021.
ceedings of the 2022 Conference on Empirical Methods
Liu, Z., Li, D., Lu, K., Qin, Z., Sun, W., Xu, J., and Zhong, in Natural Language Processing, pp. 7025–7041, Abu
Y. Neural architecture search on efficient transformers Dhabi, United Arab Emirates, December 2022a. Associ-
and beyond. arXiv preprint arXiv:2207.13955, 2022. ation for Computational Linguistics. URL https://
aclanthology.org/2022.emnlp-main.473.
Mehta, H., Gupta, A., Cutkosky, A., and Neyshabur, B.
Long range language modeling via gated state spaces. Qin, Z., Sun, W., Deng, H., Li, D., Wei, Y., Lv, B.,
arXiv preprint arXiv:2206.13947, 2022. Yan, J., Kong, L., and Zhong, Y. cosformer: Rethink-
ing softmax in attention. In International Conference
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, on Learning Representations, 2022b. URL https:
E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., //openreview.net/forum?id=Bl8CQrx2Up4.
Venkatesh, G., et al. Mixed precision training. arXiv
preprint arXiv:1710.03740, 2017. Qin, Z., Han, X., Sun, W., He, B., Li, D., Li, D., Dai, Y.,
Kong, L., and Zhong, Y. Toeplitz neural network for se-
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a quence modeling. In The Eleventh International Confer-
suit of armor conduct electricity? a new dataset for open ence on Learning Representations, 2023a. URL https:
book question answering, 2018. //openreview.net/forum?id=IxmWsm4xrua.
Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gulcehre, Qin, Z., Sun, W., Lu, K., Deng, H., Li, D., Han, X., Dai, Y.,
C., Pascanu, R., and De, S. Resurrecting recurrent neural Kong, L., and Zhong, Y. Linearized relative positional
networks for long sequences, 2023a. encoding. Transactions on Machine Learning Research,
2023b.
Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gülçehre,
Ç., Pascanu, R., and De, S. Resurrecting recurrent neural Qin, Z., Yang, S., and Zhong, Y. Hierarchically gated recur-
networks for long sequences. CoRR, abs/2303.06349, rent neural network for sequence modeling. In NeurIPS,
2023b. doi: 10.48550/arXiv.2303.06349. URL https: 2023c.
//doi.org/10.48550/arXiv.2303.06349.
Qin, Z., Zhong, Y., and Deng, H. Exploring transformer
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., extrapolation. In Proceedings of the AAAI Conference on
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, Artificial Intelligence, 2024.
L., et al. Pytorch: An imperative style, high-performance
deep learning library. Advances in neural information Ramachandran, P., Zoph, B., and Le, Q. V. Searching for
processing systems, 32, 2019. activation functions, 2017.
11
TNL with Lightning Attention
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
Winogrande: An adversarial winograd schema challenge L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At-
at scale, 2019. tention is all you need. Advances in neural information
processing systems, 30, 2017.
Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y.
Socialiqa: Commonsense reasoning about social interac- Wang, B. and Komatsuzaki, A. Gpt-j-6b: A 6 billion param-
tions, 2019. eter autoregressive language model, 2021.
Workshop, B., :, Scao, T. L., Fan, A., Akiki, C., Pavlick, E.,
Shaham, U., Segal, E., Ivgi, M., Efrat, A., Yoran, O., Haviv,
Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon,
A., Gupta, A., Xiong, W., Geva, M., Berant, J., et al.
F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson,
Scrolls: Standardized comparison over long language
A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muen-
sequences. arXiv preprint arXiv:2201.03533, 2022.
nighoff, N., del Moral, A. V., Ruwase, O., Bawden, R.,
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, Bekman, S., McMillan-Major, A., Beltagy, I., Nguyen,
J., and Catanzaro, B. Megatron-lm: Training multi- H., Saulnier, L., Tan, S., Suarez, P. O., Sanh, V., Lau-
billion parameter language models using model paral- rençon, H., Jernite, Y., Launay, J., Mitchell, M., Raffel,
lelism. arXiv preprint arXiv:1909.08053, 2019. C., Gokaslan, A., Simhi, A., Soroa, A., Aji, A. F., Alfassy,
A., Rogers, A., Nitzav, A. K., Xu, C., Mou, C., Emezue,
Tay, Y., Bahri, D., Metzler, D., Juan, D.-C., Zhao, Z., and C., Klamm, C., Leong, C., van Strien, D., Adelani, D. I.,
Zheng, C. Synthesizer: Rethinking self-attention for Radev, D., Ponferrada, E. G., Levkovizh, E., Kim, E.,
transformer models. In International conference on ma- Natan, E. B., Toni, F. D., Dupont, G., Kruszewski, G.,
chine learning, pp. 10183–10192. PMLR, 2021. Pistilli, G., Elsahar, H., Benyamina, H., Tran, H., Yu,
I., Abdulmumin, I., Johnson, I., Gonzalez-Dios, I., de la
Team, M. N. et al. Introducing mpt-7b: A new standard for Rosa, J., Chim, J., Dodge, J., Zhu, J., Chang, J., Fro-
open-source, commercially usable llms, 2023. URL www. hberg, J., Tobing, J., Bhattacharjee, J., Almubarak, K.,
mosaicml. com/blog/mpt-7b. Accessed, pp. 05–05, 2023. Chen, K., Lo, K., Werra, L. V., Weber, L., Phan, L., al-
lal, L. B., Tanguy, L., Dey, M., Muñoz, M. R., Masoud,
Tillet, P., Kung, H.-T., and Cox, D. D. Triton: an inter- M., Grandury, M., Šaško, M., Huang, M., Coavoux, M.,
mediate language and compiler for tiled neural network Singh, M., Jiang, M. T.-J., Vu, M. C., Jauhar, M. A.,
computations. Proceedings of the 3rd ACM SIGPLAN Ghaleb, M., Subramani, N., Kassner, N., Khamis, N.,
International Workshop on Machine Learning and Pro- Nguyen, O., Espejel, O., de Gibert, O., Villegas, P., Hen-
gramming Languages, 2019. derson, P., Colombo, P., Amuok, P., Lhoest, Q., Har-
liman, R., Bommasani, R., López, R. L., Ribeiro, R.,
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, Osei, S., Pyysalo, S., Nagel, S., Bose, S., Muhammad,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., S. H., Sharma, S., Longpre, S., Nikpoor, S., Silberberg,
Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lam- S., Pai, S., Zink, S., Torrent, T. T., Schick, T., Thrush,
ple, G. Llama: Open and efficient foundation language T., Danchev, V., Nikoulina, V., Laippala, V., Lepercq,
models. arXiv preprint arXiv:2302.13971, 2023a. V., Prabhu, V., Alyafeai, Z., Talat, Z., Raja, A., Heinzer-
ling, B., Si, C., Taşar, D. E., Salesky, E., Mielke, S. J.,
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, Lee, W. Y., Sharma, A., Santilli, A., Chaffin, A., Stiegler,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., A., Datta, D., Szczechla, E., Chhablani, G., Wang, H.,
Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, Pandey, H., Strobelt, H., Fries, J. A., Rozen, J., Gao, L.,
M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Sutawika, L., Bari, M. S., Al-shaibani, M. S., Manica, M.,
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, Nayak, N., Teehan, R., Albanie, S., Shen, S., Ben-David,
A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, S., Bach, S. H., Kim, T., Bers, T., Fevry, T., Neeraj, T.,
V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Thakker, U., Raunak, V., Tang, X., Yong, Z.-X., Sun,
Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Z., Brody, S., Uri, Y., Tojarieh, H., Roberts, A., Chung,
Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, H. W., Tae, J., Phang, J., Press, O., Li, C., Narayanan,
I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, D., Bourfoune, H., Casper, J., Rasley, J., Ryabinin, M.,
K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Mishra, M., Zhang, M., Shoeybi, M., Peyrounette, M.,
Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Patry, N., Tazi, N., Sanseviero, O., von Platen, P., Cor-
Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, nette, P., Lavallée, P. F., Lacroix, R., Rajbhandari, S.,
M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Gandhi, S., Smith, S., Requena, S., Patil, S., Dettmers,
and Scialom, T. Llama 2: Open foundation and fine-tuned T., Baruwa, A., Singh, A., Cheveleva, A., Ligozat, A.-L.,
chat models, 2023b. Subramonian, A., Névéol, A., Lovering, C., Garrette, D.,
12
TNL with Lightning Attention
Tunuguntla, D., Reiter, E., Taktasheva, E., Voloshina, E., haylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D.,
Bogdanov, E., Winata, G. I., Schoelkopf, H., Kalo, J.-C., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer,
Novikova, J., Forde, J. Z., Clive, J., Kasai, J., Kawamura, L. Opt: Open pre-trained transformer language models,
K., Hazan, L., Carpuat, M., Clinciu, M., Kim, N., Cheng, 2022.
N., Serikov, O., Antverg, O., van der Wal, O., Zhang,
R., Zhang, R., Gehrmann, S., Mirkin, S., Pais, S., Shav- Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M.,
rina, T., Scialom, T., Yun, T., Limisiewicz, T., Rieser, V., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., et al.
Protasov, V., Mikhailov, V., Pruksachatkun, Y., Belinkov, Pytorch fsdp: experiences on scaling fully sharded data
Y., Bamberger, Z., Kasner, Z., Rueda, A., Pestana, A., parallel. arXiv preprint arXiv:2304.11277, 2023.
Feizpour, A., Khan, A., Faranak, A., Santos, A., Hevia, Zheng, L., Wang, C., and Kong, L. Linear complexity ran-
A., Unldreaj, A., Aghagol, A., Abdollahi, A., Tammour, domized self-attention mechanism. In International Con-
A., HajiHosseini, A., Behroozi, B., Ajibade, B., Saxena, ference on Machine Learning, pp. 27011–27041. PMLR,
B., Ferrandis, C. M., McDuff, D., Contractor, D., Lansky, 2022.
D., David, D., Kiela, D., Nguyen, D. A., Tan, E., Bay-
lor, E., Ozoani, E., Mirza, F., Ononiwu, F., Rezanejad, Zheng, L., Yuan, J., Wang, C., and Kong, L. Efficient
H., Jones, H., Bhattacharya, I., Solaiman, I., Sedenko, I., attention via control variates. In International Conference
Nejadgholi, I., Passmore, J., Seltzer, J., Sanz, J. B., Dutra, on Learning Representations, 2023. URL https://
L., Samagaio, M., Elbadri, M., Mieskes, M., Gerchick, openreview.net/forum?id=G-uNfHKrj46.
M., Akinlolu, M., McKenna, M., Qiu, M., Ghauri, M.,
Burynok, M., Abrar, N., Rajani, N., Elkott, N., Fahmy,
N., Samuel, O., An, R., Kromann, R., Hao, R., Alizadeh,
S., Shubber, S., Wang, S., Roy, S., Viguier, S., Le, T.,
Oyebade, T., Le, T., Yang, Y., Nguyen, Z., Kashyap,
A. R., Palasciano, A., Callahan, A., Shukla, A., Miranda-
Escalada, A., Singh, A., Beilharz, B., Wang, B., Brito, C.,
Zhou, C., Jain, C., Xu, C., Fourrier, C., Periñán, D. L.,
Molano, D., Yu, D., Manjavacas, E., Barth, F., Fuhrimann,
F., Altay, G., Bayrak, G., Burns, G., Vrabec, H. U., Bello,
I., Dash, I., Kang, J., Giorgi, J., Golde, J., Posada, J. D.,
Sivaraman, K. R., Bulchandani, L., Liu, L., Shinzato, L.,
de Bykhovetz, M. H., Takeuchi, M., Pàmies, M., Castillo,
M. A., Nezhurina, M., Sänger, M., Samwald, M., Cullan,
M., Weinberg, M., Wolf, M. D., Mihaljcic, M., Liu, M.,
Freidank, M., Kang, M., Seelam, N., Dahlberg, N., Broad,
N. M., Muellner, N., Fung, P., Haller, P., Chandrasekhar,
R., Eisenberg, R., Martin, R., Canalli, R., Su, R., Su, R.,
Cahyawijaya, S., Garda, S., Deshmukh, S. S., Mishra,
S., Kiblawi, S., Ott, S., Sang-aroonsiri, S., Kumar, S.,
Schweter, S., Bharati, S., Laud, T., Gigant, T., Kainuma,
T., Kusa, W., Labrak, Y., Bajaj, Y. S., Venkatraman, Y.,
Xu, Y., Xu, Y., Xu, Y., Tan, Z., Xie, Z., Ye, Z., Bras,
M., Belkada, Y., and Wolf, T. Bloom: A 176b-parameter
open-access multilingual language model, 2023.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y.
Hellaswag: Can a machine really finish your sentence?,
2019.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M.,
Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b:
An open bilingual pre-trained model. arXiv preprint
arXiv:2210.02414, 2022.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mi-
13
TNL with Lightning Attention
s=1
(19) We first define X
= λkvm−1 + ⊤
km vm KV0 = 0 ∈ Rd×d , KVt = λtB−s ks vs⊤ . (23)
⊤ s≤tB
= λkvm−1 + km vm
= kvm , Given KVt , the output of (t + 1)-th block, i.e., tB + r, with
1 ≤ r ≤ B is
the statement holds. Therefore, by induction, the statement
holds for all n ≥ 1. o⊤
tB+r
X
=q⊤tB+r λtB+r−s ks vs⊤
B. Lightning Attention with decay s≤tB+r
tB+r
We extended Lightning Attention to accommodate Linear =q⊤
X
λtB+r−s ks vs⊤ + λr
X
λtB−s ks vs⊤
tB+r
Attention with decay. The complete algorithm can be found s=tB+1 s≤tB
in Algorithm 5, 6, and the proof of correctness is provided
tB+r
in C. X
=q⊤
tB+r λtB+r−s ks vs⊤ + λr qtB+r kv⊤
tB .
s=tB+1
C. Proofs (24)
Here we discuss linear attention with decay directly, because
vanilla linear attention is the case of λ = 1.
14
TNL with Lightning Attention
Ot+1 = [(Qt+1 K⊤
t+1 ) ⊙ M]Vt+1 tB+r
X X
| {z }
Intra Block
=do⊤
tB+r
λtB+r−s vs k⊤
s +λ
r
λtB−s vs k⊤
s
(25) s=tB+1 s≤tB
+ ΛQt+1 (KVt ),
| {z } tB+r
X
Inter Block
=do⊤
tB+r λtB+r−s vs k⊤ r ⊤
s + λ dotB+r kvtB .
where (
s=tB+1
λt−s t≥s
Mts = , (31)
0 t<s (26) In matrix form, we have
⊤
Λ = diag{1, . . . , λB−1 }. dQt+1 = [(dOt+1 Vt+1 ) ⊙ M]Kt+1
| {z }
And the KV at (t + 1)-th block can be written as Intra Block
(32)
ΛdOt+1 (KV⊤
X
KVt+1 = λ(t+1)B−s k⊤ s vs + t ).
| {z }
s≤(t+1)B Inter Block
(t+1)B Since the recursion of dKt steps from t + 1 to t, given
X X
=λ B
λtB−s k⊤
s vs + λ(t+1)B−s k⊤
s vs KVt+1 , dKt for the t-th block, i.e., at positions (t − 1)B +
s≤tB s=tB+1
B B−1
⊤
= λ KVt + diag{λ , . . . , 1}Kt Vt
⊤
= λB KVt + λB Λ−1 Kt Vt .
(27)
The complete expression of the forward pass of Lightning
Attention with decay can be found in Algorithm 5.
15
TNL with Lightning Attention
16
TNL with Lightning Attention
Academic
writings
Model-based Human
Filtering Evaluation
Books Rule-based
Filtering
Self-Clean xN
Scheme
Training data
Code
Deduplication
Web
Evaluation Model
…
Figure 6. Data Preprocess Procedure. The collected data undergoes a process of rule-based filtering and deduplication, followed by our
self-clean data processing strategy: model-based filtering, human evaluation, and evaluation model. After several iterations of the above
cycle, we obtain high-quality training data at around 2T tokens.
• Handling Special Characters: Unusual or special char- which may have a significant impact on the diversity of
acters that are not commonly part of the language’s text the training data. Assuming that the majority of the pre-
corpus are identified and either removed or replaced processed data is of high quality, we can train an evaluation
with a standardized representation. model on the entire set of pre-processed data, and the model
will automatically smooth the data manifold distribution and
• Number Standardization: Numerical figures may be outlet low-quality data while retaining the majority of the
presented in various formats across different texts. diversities.
These numbers are standardized into a common format
to maintain consistency. The self-cleaning scheme unfolds as follows:
• Preservation of Markdown/LaTeX Formats: While re- • Evaluation Model: We train a 385M model on the
moving non-textual elements, exceptions are made for pre-processed corpus to act as a data quality filter.
texts in Markdown and LaTeX formats. Given their
structured nature and ubiquitous use in academia and • Model-Based Data Filtering: We use the evaluation
documentation, preserving these formats can enhance model to assess each piece of data with perplexity.
the model’s ability to understand and generate similarly Only data achieving a score above a certain threshold
formatted text. is preserved for the next step. Low-quality data are
weeded out at this stage.
Deduplication To ensure the uniqueness of our data and • Human Evaluation: We sample a small portion of the
avert the risk of overfitting, we employ an efficient de- filtered data and manually evaluate the quality.
duplication strategy at the document or line level using
MinHash and Locality-Sensitive Hashing (LSH) algorithms.
These steps are repeated in cycles, with each iteration im-
This combination of MinHash and LSH ensures a balance
proving the overall quality of the data and ensuring the re-
between computational efficiency and accuracy in the dedu-
sulting model is trained on relevant, high-quality text. This
plication process, providing a robust mechanism for data
self-cleaning process provides a robust mechanism for main-
deduplication and text watermark removal.
taining data integrity, thereby enhancing the performance of
the resulting language model.
Self-cleaning scheme Our data self-cleaning process in-
volves an iterative loop of the following three steps to con-
D.2. Tokenization
tinuously refine and enhance the quality of our dataset. An
issue of using model-based data filters is that the filtered We tokenize the data with the Byte-Pair Encoding (BPE)
data will have a similar distribution as the evaluation model, algorithm. Notably, to enhance compatibility with Chinese
17
TNL with Lightning Attention
18
TNL with Lightning Attention
Runtime (s)
1.8
3.6
1.5
3.0
1.2 2.4
0.9 1.8
0.6 1.2
0.3 0.6
0.0 0.0
128 256 512 1024 2048 4096 8192 16384 32768 128 256 512 1024 2048 4096 8192 16384 32768
Sequence length Sequence length
2.5
Runtime (s)
1.0
0.8 2.0
0.6 1.5
0.4 1.0
0.2 0.5
0.0 0.0
512 1024 2048 4096 8192 16384 512 1024 2048 4096 8192 16384
Feature dimension Feature dimension
Figure 7. Performance Evaluation of SRMSNorm Implementation. The upper figures exhibit the runtime comparison of the forward
pass (left section) and backward pass (right section) for different sequence lengths, with a fixed feature dimension of 3072. The lower two
figures illustrate the runtime comparison for various feature dimensions, with a fixed sequence length of 4096.
19