0% found this document useful (0 votes)
14 views

Transformer-VQ Linear-Time Transformers via Vector Quantization

Uploaded by

fdu285012
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Transformer-VQ Linear-Time Transformers via Vector Quantization

Uploaded by

fdu285012
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Under review as a conference paper at ICLR 2024

T RANSFORMER -VQ: L INEAR -T IME T RANSFORMERS


VIA V ECTOR Q UANTIZATION

Lucas D. Lingle
Independent Researcher
[email protected]

A BSTRACT
arXiv:2309.16354v1 [cs.LG] 28 Sep 2023

We introduce Transformer-VQ, a decoder-only transformer computing softmax-


based dense self-attention in linear time. Transformer-VQ’s efficient attention is
enabled by vector-quantized keys and a novel caching mechanism. In large-scale
experiments, Transformer-VQ is shown highly competitive in quality, with strong
results on Enwik8 (0.99 bpb), PG-19 (26.6 ppl), and ImageNet64 (3.16 bpb). Code:
https://github.com/transformer-vq/transformer_vq

1 I NTRODUCTION

Figure 1: Minibatch of generated samples from our unconditional ImageNet64 model; nucleus 1.0.

Transformer (Vaswani et al., 2017) language models would ideally scale to long sequences, since
their predictive abilities often improve as context length increases (Dai et al., 2019; Kaplan et al.,
2020). Unfortunately, the standard transformer uses a self-attention mechanism with a quadratic time
complexity with respect to sequence length. This limits the practicality of applying transformers to
very long sequences, since increasing the sequence length by a factor of 10n increases the attention
computations by a factor of 100n . Transformer variants that overcome this efficiency bottleneck have
the potential to facilitate new long-context applications and enable new breakthroughs.
Up to this point, a variety of efficient transformers (Tay et al., 2020b) have been proposed to scale
to long sequences. Techniques include sparsity (Child et al., 2019; Ye et al., 2019; Beltagy et al.,
2020; Kitaev et al., 2020; Qiu et al., 2020; Roy et al., 2021; Tay et al., 2020a; Sukhbaatar et al.,
2021; Wu et al., 2022; Liu et al., 2023; Zhang et al., 2023), compression (Liu et al., 2018; Rae et al.,
2020; Ainslie et al., 2020; Zhu et al., 2021; Ren et al., 2021; Nawrot et al., 2021; 2023), low-rank
approximations (Wang et al., 2020; Vyas et al., 2020; Katharopoulos et al., 2020; Xiong et al., 2021;
Tay et al., 2021; Choromanski et al., 2021), and cross-attention operations (Dai et al., 2019; Ma et al.,
2021; Hutchins et al., 2022; Hawthorne et al., 2022). Other efficient sequence models have also been
proposed (Gu et al., 2022; Lee-Thorp et al., 2022; Poli et al., 2023; Peng et al., 2023).
In this paper, we present Transformer-VQ, a transformer decoder with dense self-attention computible
in linear time with respect to sequence length. This is made possible through a combination of vector-
quantized keys, localized positional biases, and a truncation-free yet fixed-size cache mechanism.
Beyond its efficiency, Transformer-VQ is also simple to implement sampling for, as it does not
require any periodic operations beyond those occurring at every token.

1
Under review as a conference paper at ICLR 2024

2 P RELIMINARIES
2.1 N OTATION

The real numbers are denoted by R and the extended real numbers R ∪ {−∞, ∞} by R̄. Zero-based
indices are used for all tensors. When indexing a matrix M along the first axis, we use Mi to denote
a column vector and Mi,: to denote a row vector. The functions LN(·), Softmax(·), Concat(·) denote
LayerNorm (Ba et al., 2016), softmax, and concatenation, each applied row-wise. The symbols
≜, ∝, ⊙, exp(·), δa,b , SG(·) denote equality by definition, proportionality, element-wise product,
element-wise exponentiation, the Kronecker delta function, and the stop-gradient operator.
We assume familiarity with transformers (Vaswani et al., 2017), and use the notation Dm to denote
the model width, H to denote the number of attention heads per layer, Dk to denote the query/key
vector width, Dv to denote the value vector width, Df to denote the feedforward fan-out width.

2.2 V ECTOR Q UANTIZATION

Vector quantization (VQ) is a technique used extensively throughout this work. Here we briefly
review VQ, motivate its use in self-attention, and discuss the VQ scheme introduced for representation
learning by van den Oord et al. (2017). All proofs are given in Appendix A.

2.3 V ECTOR Q UANTIZERS AND C ODEBOOKS

Definition 2.1. A vector quantizer is a function VQ(·; C) with domain RD and codomain RD . For
an input x, its output x̂ is given by
z ≜ arg min ||x − Cs ||2 (1)
s

x̂ ≜ Cz (2)
S×D
where C ∈ R is known as the codebook. The row indices {0, . . . , S − 1} of C are called
shortcodes, and the rows themselves are called codewords.
Theorem 2.2 (Based on Guo et al. (2019)). Let q ∈ RD be a random variable with Eq [qq⊤ ] ∝ ID ,
and let k ∈ RD be a random variable independent of q. Let φ : RD → RD be a deterministic
function. Then
Eq,k ||q⊤ k − q⊤ φ(k)||2 ∝ Ek ||k − φ(k)||2 . (3)
D
Corollary 2.3. Let the conditions of Theorem 2.2 hold. Given the constraint that φ(R ) = {Cs }S−1
s=0 ,
the choice φ(·) = VQ(·; C) minimizes Eq,k ||q⊤ k − q⊤ φ(k)||2 .
Corollary 2.4. Let the conditions of Theorem 2.2 hold. With k̂ = VQ(k; C) we have
arg min Eq,k ||q⊤ k − q⊤ k̂||2 = arg min Ek ||k − k̂||2 . (4)
C C

Remark 2.5. Since finding the global minimizer C∗ = arg minC Ek ||k − k̂||2 can be expensive, we
approximate it using a minibatch variant of streaming k-means, same as van den Oord et al. (2017).

2.4 V ECTOR -Q UANTIZED R EPRESENTATION L EARNING

Definition 2.6 (Based on van den Oord et al. (2017)). A vector-quantizer with straight-through
estimator is a function STVQ(·; C) with domain RD and codomain RD . For an input x, its output x̂
is given by
z ≜ arg min ||x − Cs ||2 (5)
s

x̂ ≜ x + SG(Cz − x). (6)


D
Remark 2.7. For any x ∈ R , STVQ(x; C) evaluates to VQ(x; C). However, the computed Jacobian
of the quantizer w.r.t. its input will now be an identity matrix everywhere, instead of a zero matrix
almost everywhere. Intuitively, when using STVQ, gradients are ‘transplanted’ onto the unquantized
vectors from their quantized counterparts.
Remark 2.8. We overload the notation STVQ(·; C) to operate row-wise on matrix-valued inputs.

2
Under review as a conference paper at ICLR 2024

3 T RANSFORMER -VQ

We now propose Transformer-VQ, a decoder-only transformer that can compute dense self-attention
in linear time. Proofs for all theoretical results are given in Appendix A.

3.1 Q UADRATIC -T IME F ORMULATION

Definition 3.1. Vector-Quantized Self-Attention is a function VQAttn(·; C, W{Q,K,V,G,O} ) with


domain RT ×Dm and codomain RT ×Dm . For an input X, its output Y is defined via

X̃ ≜ LN(X) ∈ RT ×Dm (7)


−0.5 T ×Dk
Q≜τ LN(X̃WQ ) ∈ R (8)
−0.5 T ×Dk
K≜τ LN(X̃WK ) ∈ R (9)
T ×Dv
V ≜ ϕv (X̃WV ) ∈ R (10)
T ×Dv
G ≜ ϕg (X̃WG ) ∈ R (11)
T ×Dk
K̂ ≜ STVQ(K; C) ∈ R (12)
⊤ T ×T
W ≜ ϕw (QK̂ + B) ∈ R (13)
T ×Dv
O ≜ (WV) ⊙ G ∈ R (14)
T ×Dm
Y ≜ X + OWO ∈ R (15)
where τ is a fixed constant, ϕv , ϕg , ϕw are element-wise or row-wise nonlinearities, the query/key
LayerNorms use unit gain and zero bias, and STVQ(·; C) denotes row-wise application of vector-
quantization with a straight-through gradient estimator (van den Oord et al., 2017).
Remark 3.2. Our attention mechanism is applied to a gated activation unit (GAU) design inspired by
Hua et al. (2022). GAU is a single-headed gated attention mechanism and generally uses Dk = 128,
Dv = 2Dm , with two GAUs replacing a single transformer layer. This yields a similar parameter
count and compute requirement as the transformer layer, assuming the latter uses Dm ≫ 128,
Dk = Dv = Dm /H, and Df = 4Dm .
Remark 3.3. Prior work has also applied LayerNorm or similar to the queries and keys in attention
(Henry et al., 2020; Roy et al., 2021; Zhu et al., 2021; Wu et al., 2022; Hutchins et al., 2022), generally
finding it to improve numerical stability and convergence.

3.2 WARMUP : L INEAR -T IME E NCODER ATTENTION

Theorem 3.4. Suppose Bi,j = 0 for all i, j, and ϕw is an element-wise nonlinearity. Then the
attention weights in Definition 3.1 can be factored:

W ≜ ϕw (QK̂⊤ + B) (16)

= ϕw (QK̂ ) (17)

= ϕw (QC )∆ (18)

where ϕw (QC⊤ ) ∈ RT ×S , ∆ ∈ RS×T and ∆s,t ≜ δs,zt . Here, δ·,· denotes the Kronecker delta
function and zt is the VQ shortcode for timestep t.
Theorem 3.5. Suppose Bi,j = 0 for all i, j, and ϕw is the row-wise softmax nonlinearity. Then the
attention weights in Definition 3.1 can be factored:

W ≜ ϕw (QK̂⊤ + B) (19)

= ϕw (QK̂ ) (20)
⊤ −1 ⊤
= Diag(exp(QC )∆1) exp(QC )∆ (21)

where 1 ∈ RT , Diag(exp(QC⊤ )∆1)−1 exp(QC⊤ ) ∈ RT ×S , ∆ ∈ RS×T and ∆s,t ≜ δs,zt . Here,


δ·,· denotes the Kronecker delta function and zt is the VQ shortcode for timestep t.

3
Under review as a conference paper at ICLR 2024

3.3 L INEAR -T IME D ECODER ATTENTION

Theorem 3.6. Let L be a divisor of T . Suppose Bi,j = −∞ for j > i, and Bi,j = 0 for j < i − L.
Let ∆ ∈ RS×T with ∆s,t ≜ δs,zt , same as before. Let ϕw be an element-wise nonlinearity with
ϕw (−∞) = 0. For a tensor M, let M(...,n,...) denote the slice M...,nL:(n+1)L,... . For a specific
tensor, if an axis is not sliced over, each ellipsis will be replaced by the appropriate number of ‘:’.
Then the product WV in Definition 3.1 can be computed using the recursion:
U(n − 1) + ∆(:,n) V(n,:) if n ≥ 0

U(n) ≜ (22)
0 otherwise
[WV](n,:) = ϕw (Q(n,:) C⊤ )U(n − 2) (23)
(n,:) (n−1,:) ⊤ (n,n−1) (n−1,:)
+ ϕw (Q [K̂ ] +B )V (24)
(n,:) (n,:) ⊤ (n,n) (n,:)
+ ϕw (Q [K̂ ] +B )V (25)

where any tensor slice M(...,n,...) is defined as a zero tensor of width L in the sliced dimension(s) if
any block slice index n is less than zero.
Theorem 3.7. Let L be a divisor of T . Suppose Bi,j = −∞ for j > i, and Bi,j = 0 for j < i − L.
Let ∆ ∈ RS×T with ∆s,t ≜ δs,zt , same as before. Suppose ϕw is the row-wise softmax nonlinearity.
Let the block tensor slice notation from Theorem 3.6 apply. Let 1 ∈ RT . Let A ≜ exp(QK̂⊤ + B).
Then the product WV in Definition 3.1 can be computed using the recursions:
U(n − 1) + ∆(:,n) V(n,:) if n ≥ 0

U(n) ≜ (26)
0 otherwise
L(n − 1) + ∆(:,n) 1(n) if n ≥ 0

L(n) ≜ (27)
0 otherwise
[AV](n,:) = exp(Q(n,:) C⊤ )U(n − 2) (28)
+ exp(Q(n,:) [K̂(n−1,:) ]⊤ + B(n,n−1) )V(n−1,:) (29)
(n,:) (n,:) ⊤ (n,n) (n,:)
+ exp(Q [K̂ ] +B )V (30)
(n) (n,:) ⊤
[A1] = exp(Q C )L(n − 2) (31)
(n,:) (n−1,:) ⊤ (n,n−1) (n−1)
+ exp(Q [K̂ ] +B )1 (32)
(n,:) (n,:) ⊤ (n,n) (n)
+ exp(Q [K̂ ] +B )1 (33)
(n,:) (n) −1 (n,:)
[WV] = Diag([A1] ) [AV] . (34)
Remark 3.8. Intuitively, Theorem 3.7 shows that VQ-Attention is computable by processing the
sequence in blocks of length L, applying two steps to each block. The first step is to form the
corresponding block of ∆ and use it to sum the value vectors and shortcode indicators into the
appropriate rows of the ‘cache’ variables U(n), L(n). The second step is to incorporate U(n), L(n)
directly into the retrieval process with the help of the codebook C.
Remark 3.9. Theorem 3.7 provides an algorithm to compute VQ-Attention from the queries, keys,
values, gates, and codebook in O(L(S + 2L)(Dk + Dv )) time per query block, and therefore
O(T (S + 2L)(Dk + Dv )) time per sequence.
Remark 3.10. In the experiments, we use ϕw as row-wise softmax, and use the relative positional
biases from Dai et al. (2019) for the band of nonzero biases in B. In the experiments, we rely on a
numerically stable reformulation of Theorem 3.7, where the logarithm of the counts L(n − 2) are
moved inside the exponentials exp(Q(n,:) C⊤ ) appearing in [AV](n,:) and [A1](n) .

3.4 L EARNING A LGORITHM

3.4.1 T RAINING L OSS


Let θ denote the set of non-codebook parameters of a transformer with N VQ-Attention layers,
−1
and let C = {C(ℓ) }N
ℓ=0 denote the set of the layers’ codebooks. For autoregressive modeling of a

4
Under review as a conference paper at ICLR 2024

sequence X = {xt }Tt=0 , we define the Transformer-VQ training loss as


L(X; θ, C) = LCE (X; θ, C) + βLVQ (X; θ, C) (35)
where β > 0 is a hyperparameter known as the commit loss coefficient, and
T −1
1 X
LCE (X; θ, C) ≜ − ln p(xt+1 |x≤t , θ, C) (36)
T t=0
T −1 N −1
1 XX (ℓ)
LVQ (X; θ, C) ≜ ||Kt − SG(C(ℓ) 2
zt )||2 . (37)
T t=0
ℓ=0

Thus, the training loss is the average next-token cross-entropy loss, plus the average token’s commit-
ment losses (van den Oord et al., 2017), summed over layer codebooks. Non-codebook parameters
θ receive a gradient from both loss terms. Following van den Oord et al. (2017), codebooks are
parameterized via smoothed quantizer statistics.

3.4.2 T RAINING U PDATES


Instead of updating on the full sequence loss given above, we generally update every K query blocks,
where LK ≪ T , which resembles a strategy used in prior works (Dai et al., 2019; Wu et al., 2022;
Hutchins et al., 2022).
Each update is obtained by backpropagating through a window of LK timesteps, with gradients
computed from the corresponding terms in the per-token average losses above. Codebooks are also
updated every K query blocks.
When K = 1, using Theorem 3.7 is an efficient alternative to using a non-differentiable long-range
key-value cache. When K > 1, a learning signal is sent through any value vectors added to the
compressed cache within the backpropagation window.

4 R ELATED W ORK
4.1 H IERARCHICAL ATTENTION

Combiner (Ren et al., 2021) proposes an approximation of softmax using a simple graphical model,
and parameterizes its internal probabilities using max-pooling over query/key features, enabling
decoder-only self-attention in subquadratic time. H-Transformer-1D (Zhu & Soricut, 2021) uses
average-pooling operations over queries/keys to reduce the complexity of encoder-only self-attention.
Transformer-LS (Zhu et al., 2021) uses dynamic projections to downsample long-range features in
transformers by a user-specified factor. Hourglass Transformer (Nawrot et al., 2021) and MegaByte
(Yu et al., 2023) eschew pooling in favor of convolutions or reshaping for temporal downsampling,
and apply these techniques to reduce computation in the interior layers of decoder-only transformers.
Transformer-VQ differs from these works in that it uses vector quantization (VQ), a well-understood
method for compression, instead of newly-designed heuristic methods. In addition, it does not rely
on token contiguity to guide the compression process. Instead, it utilizes an equivalence to dense
attention. Notably, Transformer-VQ is easier to sample from compared to previous hierarchical
attention models; since the cache update logic can be equivalently applied every token instead of
every L tokens, there are no sporadic ‘feature consolidation’ operations required during sampling.

4.2 K ERNELIZABLE ATTENTION

Kernelizable attention (Katharopoulos et al., 2020; Choromanski et al., 2021) computes query and key
features and applies the same nonlinearity to both of them separately, omitting additional nonlinearities
when computing attention weights. By using the associativity of matrix multiplication, kernelized
attention reduces attention to linear complexity. Transformer-VQ is distinguished from kernelizable
attention through an asymmetric treatment of queries and keys, a deterministic equivalence to softmax-
based attention, training stability, and strong quantitative results on long-context autoregressive
modeling benchmarks.

5
Under review as a conference paper at ICLR 2024

Clustering attention (Vyas et al., 2020) uses vector-quantized queries and is also kernelizable.
However, it requires learning per-layer codebooks for each sequence and uses a modified form
of Lloyd’s iterations based on Hamming distance and locality-sensitive hashing. This yields a
complex non-causal algorithm which is only suitable for non-causal attention and is slow on TPUs.
Transformer-VQ is strongly differentiated from clustering attention by its simplicity, applicability to
decoder-only tasks, efficiency on TPUs, and large-scale experimental validation.

4.3 C OMPRESSIVE ATTENTION

Compressive Transformers (Rae et al., 2020) directly learn a compression function for long-range
features. LUNA (Ma et al., 2021) and Recurrent Transformers (Bulatov et al., 2022; Hutchins
et al., 2022) use cross-attention to compress long-range features into a recurrent state. Notably,
our model implements a kind of block-recurrent mechanism for its cache, but is significantly more
parameter-efficient than the mechanisms proposed by Ma et al. (2021); Hutchins et al. (2022).
More generally, Transformer-VQ differs from compressive/recurrent transformers in that it has an
equivalence to quadratic-time attention over vector-quantized keys. In other words, if the keys are
already vector-quantized, the Transformer-VQ cache losslessly reduces the cost to linear time.
Perceivers (Jaegle et al., 2021; Hawthorne et al., 2022) use cross-attention to attend to long sequences,
and compute self-attention over only a narrow stack of ‘latents’. Transformer-VQ differs from
Perceivers in that it computes dense self-attention in linear time, instead of just cross-attention. Thus,
while Perceivers’ long-range layers incur a quadratic time complexity during sampling, Transformer-
VQ generates sequences in linear time.

4.4 G ATED S EQUENCE M ODELS

Gated attention was introduced in FLASH (Hua et al., 2022) as a fusion of attention sublayers
(Vaswani et al., 2017) and GLU-based MLP sublayers (Shazeer, 2020). Various gating mechanisms
have previously been used to stabilize training of transformers (Parisotto et al., 2019) and other
sequence models including S4 (Gu et al., 2022), GSS (Mehta et al., 2022), MEGA (Ma et al., 2023)
and RWKV (Peng et al., 2023). Transformer-VQ uses the original gating formulation from Hua et al.
(2022), and develops a new attention mechanism.

4.5 VQ, K-M EANS , AND B EYOND

Ideas relating to k-means, vector quantization, and/or codebooks have also been applied in transform-
ers for sparse attention (Roy et al., 2021; Wang et al., 2021; 2022), feature learning (Mao et al., 2022;
Roy et al., 2022), sparsely-activated MLPs (Lample et al., 2019), and expert selection (Roller et al.,
2021). These works generally feature codebooks or similar within a transformer architecture. Several
works also have proposed models that feature a codebook somewhere outside a transformer, e.g.,
when transformers are priors for VQ-VAEs (Kaiser et al., 2018; Dhariwal et al., 2020; Ramesh et al.,
2021; Lee et al., 2022; Zhou et al., 2022). Transformer-VQ uses one codebook within each layer and,
in contrast to all of the aforementioned works, computes dense self-attention in linear time.
Transformer-VQ is not directly related to methods which quantize the weights of a transformer e.g.,
Dettmers et al. (2022); Dettmers & Zettlemoyer (2023); Frantar et al. (2023). Such methods are
typically applied after training, and do not reduce the complexity of self-attention. However, we
expect these approaches may prove complementary during inference.

5 E XPERIMENTS

For all experiments, we use a TPU v3-128 accelerator (Jouppi et al., 2017). Hyperparameters
follow Appendix B unless specifically noted. For efficient training on TPUs, Transformer-VQ was
implemented using Jax (Bradbury et al., 2018) and Flax (Heek et al., 2023).

6
Under review as a conference paper at ICLR 2024

5.1 A BLATION S TUDIES

5.1.1 C ODEBOOK S IZE

Larger codebook sizes may allow more flexible attention patterns and could improve the fidelity of
the gradients, both of which are likely to benefit model quality at the expense of additional wall time.
To investigate, we ablate the codebook size S using the Enwik8 dataset (see § 5.2.1), and report the
lowest validation bits-per-byte (BPB, lower is better) obtained by each model in Table 1.

Table 1: Codebook size ablations. Table 2: Long-range cache ablations.


Setting Val. BPB Rel Time
Setting Val. BPB Rel Time
S = 256 1.010 0.927
S = 512 1.005 1.0 No long-range cache 1.026 0.836
S = 1024 1.000 1.109 Long-range cache 1.010 0.927

Table 1 confirms the intuition that larger codebooks improve the prediction quality (lower BPB) in
return for additional wall time per training step. In particular, increasing the codebook size by a factor
of two appears to decrease the validation BPB by about 0.005 and increase the wall time per step by a
factor of about 1.1. A formal characterization of the scaling laws (Kaplan et al., 2020) for codebook
size could be an interesting direction for future work.

5.1.2 L ONG -R ANGE C ACHE

Since our model has several architectural differences from most prior works, the benefit of the
long-range cache must be shown directly. To investigate, we train a model with the long-range cache
omitted, using codebook size S = 256. We report the validation BPB for Enwik8 in Table 2.
As shown in Table 2, removing the long-range cache reduces the wall time per step by a factor
of about 1.1, but leads to a significant drop in quality (higher bits-per-byte). This confirms the
importance of our long-range cache mechanism.

5.2 Q UANTITATIVE R ESULTS

To assess the ability of Transformer-VQ to learn long-range dependencies, we now conduct a series
of large-scale experiments, benchmarking on several long-range autoregressive modeling tasks. For
fair comparison, we only benchmark against models (a) trained without using any extra data or
augmentation, and (b) evaluated with fixed parameters. In all cases, we use codebook size S = 512.

5.2.1 E NWIK 8
Table 3: Test bits-per-byte on Enwik8.
Enwik8 is a byte-level language modeling dataset con-
Model BPB
sisting of 100 million bytes of unprocessed English-
language Wikipedia articles (Mahoney, 2011), with Ma et al. (2023) - Mega 1.02
long-term dependencies that may span tens of thou- Dai et al. (2019) - XL 0.99
sands of bytes. Per convention, it is split into train, Child et al. (2019) - Sparse 0.99
validation, and test sets of 90 million, 5 million, and Beltagy et al. (2020) - Longform. 0.99
Roy et al. (2021) - Routing 0.99
5 million bytes, respectively (Child et al., 2019; Rae
Sukhbaatar et al. (2019a) - Adapt. Sp. 0.98
et al., 2020). Sukhbaatar et al. (2019b) - All-Attn. 0.98
For this dataset, we trained a Transformer-VQ with Nawrot et al. (2021) - Hourglass 0.98
190M parameters, smaller than the model by Dai et al. Rae et al. (2020) - Compress. 0.97
Zhu et al. (2021) - Long-Short 0.97
(2019). We report test bits-per-byte (BPB) in Table 3.
Fan et al. (2020b) - Feedback 0.96
Transformer-VQ obtains a BPB of 0.99, notably match- Lei (2021) - SRU++ 0.95
ing the result of Transformer-XL (Dai et al., 2019), Sukhbaatar et al. (2021) - Expire Sp. 0.95
while using an entirely different cache mechanism not Lutati et al. (2023) - Focus Attn. 0.94
based on position and also shorter in length at test time. Transformer-VQ 0.99

7
Under review as a conference paper at ICLR 2024

Table 4: Test word-level perplexity on PG-19. Table 5: Validation bits-per-byte on ImageNet64.


Model BPB
Model WLP
Ren et al. (2021) - Combiner 3.42
Yu et al. (2023) - MegaByte 36.4
Kingma et al. (2021) - VDM 3.40
Rae et al. (2020) - XL 36.3
Hawthorne et al. (2022) - Perceiver AR 3.40
Rae et al. (2020) - Compressive 33.6
Yu et al. (2023) - MegaByte 3.40
Roy et al. (2021) - Routing 33.2
Grcic et al. (2021) - DenseFlow 3.35
Hawthorne et al. (2022) - Perceiver AR 28.9
Lipman et al. (2023) - Flow Matching 3.31
Hutchins et al. (2022) - Block-Recur. 26.5
Hazami et al. (2022) - Efficient VDVAE 3.30
Transformer-VQ 26.6
Transformer-VQ 3.16

For this dataset, we found overfitting was a significant issue, and due to the compressive cache
mechanism, using attention dropout was not possible. Sweeping over the residual dropout rate,
weight decay coefficient, and layerdrop (Fan et al., 2020a) rate, we found a setting yielding good
generalization. Nonetheless Transformer-VQ does fall short of state-of-the-art here, with several
works using complex recurrence or forgetting mechanisms and obtaining better Enwik8 results.

5.2.2 PG-19

PG-19 is an open-vocabulary language modeling dataset consisting of 11 gigabytes of text from


over 28,000 freely-available Project Gutenberg books published prior to 1919 (Rae et al., 2020).
The average number of words per book is nearly 70,000, enabling learning long-term dependencies,
especially in novels (Sun et al., 2021; Hutchins et al., 2022).
For this dataset, we trained a Transformer-VQ with 1.3B parameters, similar to the largest model by
Hutchins et al. (2022). Since PG-19 is an open-vocabulary dataset, we first learned a SentencePiece
vocabulary (Kudo & Richardson, 2018) of size 32,000 using the BPE method. Following the
calculations of Rae et al. (2020), we report the test set word-level perplexity (WLP) in Table 4.
Transformer-VQ obtains a WLP of 26.6, very close to the state-of-the-art by Hutchins et al. (2022).
Interestingly, since our Transformer-VQ design is equivalent to using dense self-attention with vector-
quantized keys, our strong result shows that models using long-range attention only (no recurrence)
can also be highly competitive on PG-19, which reaffirms the efficacy of standalone self-attention as
a method for sequence processing at scale.

5.2.3 I MAGE N ET 64

ImageNet64 is an image dataset consisting of over 1.2 million images downsampled to 64x64
resolution (Chrabaszcz et al., 2017; Deng et al., 2009). Flattening the images yields an autoregressive
density estimation task on sequences of over 12,000 bytes each. Note since the official test set is not
public for this dataset, we report results on the official validation set. For validation purposes we used
a held-out set of about 80,000 examples from the training split.
For this dataset, we trained a Transformer-VQ with 1.2B parameters, similar to the PG-19 model. We
report the bits-per-byte on the official validation set in Table 5. Several of the earlier baselines used
an earlier variant of downsampled ImageNet prepared by van den Oord et al. (2016) with a different
downsampling algorithm. Since that variant has been unavailable through official channels for about
a year, we used the newer variant following Lipman et al. (2023). We emphasize that our results
using the newer variant cannot be directly compared with baselines using the earlier variant; however,
due to several reporting ambiguities, Table 5 does not symbolically distinguish variants used.
Transformer-VQ obtains a BPB of 3.16, significantly improving on prior results reported by Hazami
et al. (2022); Lipman et al. (2023). Our model has 7x more parameters than the one by Hazami et al.
(2022), but thanks to the large dataset it showed no signs of overfitting. Our favorable results on this
dataset show that the Transformer-VQ architecture can be directly applied to other modalities beyond
natural language, which we attribute to its efficient emulation of the standard transformer’s flexible
attention patterns.

8
Under review as a conference paper at ICLR 2024

5.3 Q UALITATIVE A NALYSIS

We provide extensive samples for all models in Appendix C.

5.3.1 I MAGE N ET 64

Figure 2: Minibatch of generated samples from our unconditional ImageNet64 model; nucleus 0.999.

We generate batches of 128 sequences using nucleus sampling (Holtzman et al., 2020). Figures 1-2
show a subset of samples with the same indices from two batches with different nucleus settings.
Many of the samples between the two batches are perceptually similar, which is a consequence of
using the same random seed to directly observe the impact of the nucleus sampling hyperparameter.
In Figure 1, we observe that our unconditional ImageNet64 model can synthesize sequences of over
12,000 bytes and appears to be capable of depicting relatively high-fidelity ocean water, shorelines,
leaves, insects, trees, animals, people, mountains, and architecture.
The model does make some mistakes, particularly involving perspective or object identity. For
instance, in the second row of Figure 1, the rightmost image appears to be a bird wearing a shell,
while in the first row of Figure 2, the rightmost image appears to be a wooden galleon with legs. It
is unclear if these effects are due to vector quantization or lack of image-specific inductive biases.
Interestingly, we have not used separate embeddings to specify the row, column, and color channel
to the model, which is in contrast to some prior works (Child et al., 2019; Hawthorne et al., 2022).
Finally, while some mistakes dissipate when using nucleus 0.999, some new ones do appear; one
possible explanation is that using a fixed nucleus is suboptimal for images.

5.3.2 PG-19
In Figure 3, we observe that our PG-19 model can synthesize relatively high-quality text, maintaining
a consistent tone, remaining on topic, and generating reasonably coherent content. These qualitative
observations were found to hold for the vast majority of the samples we generated.
The excerpt shown was preceded by a book title ‘Elementary Photography’, a non-existent author’s
name, and publisher information. Though this information was synthesized by the model, it suggests
the model may be amenable to generating text on a particular topic simply by conditioning on a
prompt, similar to larger language models.

6 C ONCLUSION

Transformer-VQ is a decoder-only transformer architecture that computes softmax-based dense


self-attention in linear time with respect to sequence length. Its efficient attention is enabled by
vector-quantized keys and a new truncation-free fixed-size cache. Our large-scale experiments show
Transformer-VQ is an efficient and flexible autoregressive model with successful applications to
byte-level language modeling, open-vocabulary language modeling, and image synthesis. Future work
directions include formal scaling laws, scaling to even larger models, and applying Transformer-VQ
to long-context program synthesis and reinforcement learning tasks.

9
Under review as a conference paper at ICLR 2024

R EPRODUCIBILITY S TATEMENT

To facilitate reproducibility, our attention mechanism is described mathematically in Section 3, our


hyperparameters and other implementation details are given in Appendix B, and our implementation
is open-sourced at the link in the abstract.

ACKNOWLEDGMENTS
We thank the anonymous reviewers for helpful feedback on this work, and acknowledge the Python
community, especially the Jax ecosystem contributors, for effective libraries used in this project. This
work was generously supported by Cloud TPUs from Google’s TPU Research Cloud (TRC).

R EFERENCES
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. ETC: Encoding long and struc-
tured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pp. 268–284, Online, November 2020. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.19. URL https:
//aclanthology.org/2020.emnlp-main.19.
Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https:
//arxiv.org/abs/1607.06450.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
CoRR, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and
Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL
http://github.com/google/jax.
Aydar Bulatov, Yuri Kuratov, and Mikhail Burtsev. Recurrent memory transformer. In Alice H. Oh,
Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information
Processing Systems, 2022. URL https://openreview.net/forum?id=Uynr3iPhksa.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509.
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea
Gane, Tamás Sarlós, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser,
David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with
performers. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=Ua6zuk0WRH.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh,
Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam
Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James
Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Lev-
skaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin
Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph,
Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M.
Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon
Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean,
Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022. URL
https://arxiv.org/abs/2204.02311.
Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an
alternative to the CIFAR datasets. CoRR, abs/1707.08819, 2017. URL http://arxiv.org/
abs/1707.08819.

10
Under review as a conference paper at ICLR 2024

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov.
Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, Florence,
Italy, jul 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL
https://aclanthology.org/P19-1285.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hier-
archical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition,
pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws, 2023.
URL http://arxiv.org/abs/2212.09720.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. GPT3.int8(): 8-bit matrix
multiplication for transformers at scale. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL
https://openreview.net/forum?id=dXiGWqBoxaD.
Prafulla Dhariwal, Heewoo Jun, Christine McLeavey Paine, Jong Wook Kim, Alec Radford, and Ilya
Sutskever. Jukebox: A generative model for music, 2020. URL https://arxiv.org/abs/
2005.00341.
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network
function approximation in reinforcement learning. CoRR, abs/1702.03118, 2017. URL http:
//arxiv.org/abs/1702.03118.
Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
structured dropout. In International Conference on Learning Representations, 2020a. URL
https://openreview.net/forum?id=SylO2yStDr.
Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. Addressing
some limitations of transformers with feedback memory, 2020b. URL https://arxiv.org/
abs/2002.09402.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization
for generative pre-trained transformers. In The Eleventh International Conference on Learning
Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS.
Matej Grcic, Ivan Grubisic, and Sinisa Segvic. Densely connected normalizing flows. In
M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Ad-
vances in Neural Information Processing Systems, volume 34, pp. 23968–23982. Curran Asso-
ciates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/
2021/file/c950cde9b3f83f41721788e3315a14a3-Paper.pdf.
Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured
state spaces. In International Conference on Learning Representations, 2022. URL https:
//openreview.net/forum?id=uYLFoz1vlAC.
Ruiqi Guo, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar, and Xiang Wu. New loss
functions for fast maximum inner product search. CoRR, abs/1908.10396, 2019. URL http:
//arxiv.org/abs/1908.10396.
Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Ma-
teusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah
Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, and Jesse Engel. General-
purpose, long-context autoregressive modeling with Perceiver AR. In Kamalika Chaudhuri,
Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceed-
ings of the 39th International Conference on Machine Learning, volume 162 of Proceed-
ings of Machine Learning Research, pp. 8535–8558. PMLR, 17–23 Jul 2022. URL https:
//proceedings.mlr.press/v162/hawthorne22a.html.
Louay Hazami, Rayhane Mama, and Ragavan Thurairatnam. Efficient-VDVAE: Less is more, 2022.
URL http://arxiv.org/abs/2203.13751.

11
Under review as a conference paper at ICLR 2024

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas
Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2023. URL
http://github.com/google/flax.
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key
normalization for transformers. In Findings of the Association for Computational Linguistics:
EMNLP 2020, pp. 4246–4253, Online, November 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.findings-emnlp.379. URL https://aclanthology.org/2020.
findings-emnlp.379.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text
degeneration. In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=rygGQyrFvH.
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In
Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato
(eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of
Proceedings of Machine Learning Research, pp. 9099–9117. PMLR, 17–23 Jul 2022. URL
https://proceedings.mlr.press/v162/hua22a.html.
DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent
transformers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.),
Advances in Neural Information Processing Systems, 2022. URL https://openreview.
net/forum?id=uloenYmLCAo.
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Car-
reira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang
(eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of
Proceedings of Machine Learning Research, pp. 4651–4664. PMLR, 18–24 Jul 2021. URL
https://proceedings.mlr.press/v139/jaegle21a.html.
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa,
Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford
Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir
Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug
Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander
Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon,
James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean,
Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray
Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan
Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham,
Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma,
Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon.
In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News,
45(2):1–12, jun 2017. ISSN 0163-5964. doi: 10.1145/3140659.3080246. URL https://doi.
org/10.1145/3140659.3080246.
Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam
Shazeer. Fast decoding in sequence models using discrete latent variables. In Jennifer Dy and
Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning,
volume 80 of Proceedings of Machine Learning Research, pp. 2390–2399. PMLR, 10–15 Jul 2018.
URL https://proceedings.mlr.press/v80/kaiser18a.html.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.
CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are
RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh
(eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of
Proceedings of Machine Learning Research, pp. 5156–5165. PMLR, 13–18 Jul 2020. URL
https://proceedings.mlr.press/v119/katharopoulos20a.html.

12
Under review as a conference paper at ICLR 2024

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Ad-
vances in Neural Information Processing Systems, volume 34, pp. 21696–21707. Curran Asso-
ciates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/
2021/file/b578f2a52a0229873fefc2a4b06377fa-Paper.pdf.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In
International Conference on Learning Representations, 2020. URL https://openreview.
net/forum?id=rkgNKkHtvB.

Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, Brussels,
Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012.
URL https://aclanthology.org/D18-2012.

Guillaume Lample, Alexandre Sablayrolles, Marc' Aurelio Ranzato, Ludovic Denoyer, and Herve
Jegou. Large memory layers with product keys. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/
paper/2019/file/9d8df73a3cfbf3c5b47bc9b50f214aff-Paper.pdf.

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive
image generation using residual quantization. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), pp. 11523–11532, June
2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/
Lee_Autoregressive_Image_Generation_Using_Residual_Quantization_
CVPR_2022_paper.html.

James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. FNet: Mixing tokens with
Fourier transforms. In Proceedings of the 2022 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pp. 4296–4313,
Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/
2022.naacl-main.319. URL https://aclanthology.org/2022.naacl-main.319.

Tao Lei. When attention meets fast recurrence: Training language models with reduced com-
pute. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, pp. 7633–7648, Online and Punta Cana, Dominican Republic, November 2021.
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.602. URL
https://aclanthology.org/2021.emnlp-main.602.

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow
matching for generative modeling. In The Eleventh International Conference on Learning Repre-
sentations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t.

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and
Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Confer-
ence on Learning Representations, 2018. URL https://openreview.net/forum?id=
Hyg0vbWC-.

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios
Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance
hypothesis for LLM KV cache compression at test time, 2023. URL http://arxiv.org/
abs/2305.17118.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Confer-
ence on Learning Representations, 2019. URL https://openreview.net/forum?id=
Bkg6RiCqY7.

Shahar Lutati, Itamar Zimerman, and Lior Wolf. Focus your attention (with adaptive IIR filters),
2023. URL http://arxiv.org/abs/2305.14952.

13
Under review as a conference paper at ICLR 2024

Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettle-
moyer. LUNA: Linear unified nested attention. In A. Beygelzimer, Y. Dauphin, P. Liang, and
J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL
https://openreview.net/forum?id=GWRkOYr4jxQ.
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan
May, and Luke Zettlemoyer. Mega: Moving average equipped gated attention. In The Eleventh
International Conference on Learning Representations, 2023. URL https://openreview.
net/forum?id=qNLe3iq2El.
Matt Mahoney. Large text compression benchmark, 2011. URL: http://mattmahoney.net/
dc/text.html.
Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, and Irfan Essa.
Discrete representations strengthen vision transformer robustness. In International Confer-
ence on Learning Representations, 2022. URL https://openreview.net/forum?id=
8hWs60AZcWk.
Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling
via gated state spaces, 2022. URL http://arxiv.org/abs/2206.13947.
Piotr Nawrot, Szymon Tworkowski, Michal Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy,
and Henryk Michalewski. Hierarchical transformers are more efficient language models. CoRR,
abs/2110.13711, 2021. URL https://arxiv.org/abs/2110.13711.
Piotr Nawrot, Jan Chorowski, Adrian Łańcucki, and Edoardo M. Ponti. Efficient transformers with
dynamic token pooling, 2023. URL http://arxiv.org/abs/2211.09761.
Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Çaglar Gülçehre, Siddhant M.
Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M.
Botvinick, Nicolas Heess, and Raia Hadsell. Stabilizing transformers for reinforcement learning.
CoRR, abs/1910.06764, 2019. URL http://arxiv.org/abs/1910.06764.
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin
Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw
Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri,
Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak,
Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, and Rui-Jie Zhu. RWKV:
Reinventing RNNs for the transformer era, 2023. URL http://arxiv.org/abs/2305.
13048.
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua
Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional
language models, 2023. URL http://arxiv.org/abs/2302.10866.
Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-
attention for long document understanding. In Findings of the Association for Computational
Linguistics: EMNLP 2020, pp. 2555–2565, Online, November 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.findings-emnlp.232. URL https://aclanthology.
org/2020.findings-emnlp.232.

Markus N. Rabe and Charles Staats. Self-attention does not need o(n2 ) memory. CoRR,
abs/2112.05682, 2021. URL https://arxiv.org/abs/2112.05682.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. Language models are unsupervised multitask learners, 2019. https:
//cdn.openai.com/better-language-models/language_models_are_
unsupervised_multitask_learners.pdf Last visited on 2023/09/07.
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Com-
pressive transformers for long-range sequence modelling. In International Conference on Learning
Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH.

14
Under review as a conference paper at ICLR 2024

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song,
John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan,
Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks,
Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron
Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu,
Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen
Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro,
Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-
Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas
Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li,
Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy,
Chris Jones, James Bradbury, Matthew J. Johnson, Blake A. Hechtman, Laura Weidinger, Iason
Gabriel, William S. Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol
Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu,
and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher.
CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.

Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. CoRR,
abs/1710.05941, 2017. URL http://arxiv.org/abs/1710.05941.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen,
and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. URL
https://arxiv.org/abs/2102.12092.

Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans,
and Bo Dai. Combiner: Full attention transformer with sparse computation cost. In
M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Ad-
vances in Neural Information Processing Systems, volume 34, pp. 22470–22482. Curran Asso-
ciates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/
2021/file/bd4a6d0563e0604510989eb8f9ff71f5-Paper.pdf.

Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason E Weston. Hash layers for large sparse
models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in
Neural Information Processing Systems, 2021. URL https://openreview.net/forum?
id=lMgDDWb1ULW.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. Transactions of the Association for Computational Linguistics,
9:53–68, 2021. doi: 10.1162/tacl a 00353. URL https://aclanthology.org/2021.
tacl-1.4.

Aurko Roy, Rohan Anil, Guangda Lai, Benjamin Lee, Jeffrey Zhao, Shuyuan Zhang, Shibo Wang,
Ye Zhang, Shen Wu, Rigel Swavely, Yu Tao, Phuong Dao, Christopher Fifty, Zhifeng Chen,
and Yonghui Wu. N-Grammer: Augmenting transformers with latent n-grams, 2022. URL
https://arxiv.org/abs/2207.06366.

Noam Shazeer. GLU variants improve transformer, 2020. URL https://arxiv.org/abs/


2002.05202.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost.
CoRR, abs/1804.04235, 2018. URL http://arxiv.org/abs/1804.04235.

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span
in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pp. 331–335, Florence, Italy, July 2019a. Association for Computational Linguistics.
doi: 10.18653/v1/P19-1032. URL https://aclanthology.org/P19-1032.

Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Hervé Jégou, and Armand Joulin.
Augmenting self-attention with persistent memory. CoRR, abs/1907.01470, 2019b. URL
http://arxiv.org/abs/1907.01470.

15
Under review as a conference paper at ICLR 2024

Sainbayar Sukhbaatar, Da Ju, Spencer Poff, Stephen Roller, Arthur Szlam, Jason Weston, and Angela
Fan. Not all memories are created equal: Learning to forget by expiring. In Marina Meila and Tong
Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume
139 of Proceedings of Machine Learning Research, pp. 9902–9912. PMLR, 18–24 Jul 2021. URL
https://proceedings.mlr.press/v139/sukhbaatar21a.html.

Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, and Mohit Iyyer. Do long-range language
models actually use long-range context? In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pp. 807–822, Online and Punta Cana, Dominican
Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
emnlp-main.62. URL https://aclanthology.org/2021.emnlp-main.62.

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. CoRR,
abs/2002.11296, 2020a. URL https://arxiv.org/abs/2002.11296.

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. CoRR,
abs/2009.06732, 2020b. URL https://arxiv.org/abs/2009.06732.

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer:
Rethinking self-attention for transformer models. In Marina Meila and Tong Zhang (eds.),
Proceedings of the 38th International Conference on Machine Learning, volume 139 of Pro-
ceedings of Machine Learning Research, pp. 10183–10192. PMLR, 18–24 Jul 2021. URL
https://proceedings.mlr.press/v139/tay21a.html.

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learn-
ing. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran As-
sociates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/
7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf.

Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural net-
works. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd
International Conference on Machine Learning, volume 48 of Proceedings of Machine Learn-
ing Research, pp. 1747–1756, New York, New York, USA, 20–22 Jun 2016. PMLR. URL
https://proceedings.mlr.press/v48/oord16.html.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon,
U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett
(eds.), Advances in Neural Information Processing Systems, volume 30. Curran Asso-
ciates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. Fast transformers with clustered
attention. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Ad-
vances in Neural Information Processing Systems, volume 33, pp. 21665–21674. Curran As-
sociates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/
f6a8dd1c954c8506aadc764cc32b895e-Paper.pdf.

Ningning Wang, Guobing Gan, Peng Zhang, Shuai Zhang, Junqiu Wei, Qun Liu, and Xin Jiang.
ClusterFormer: Neural clustering attention for efficient and effective transformer. In Proceedings of
the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 2390–2402, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi:
10.18653/v1/2022.acl-long.170. URL https://aclanthology.org/2022.acl-long.
170.

Shuohang Wang, Luowei Zhou, Zhe Gan, Yen-Chun Chen, Yuwei Fang, Siqi Sun, Yu Cheng,
and Jingjing Liu. Cluster-former: Clustering-based sparse transformer for question answering.
In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3958–
3968, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
findings-acl.346. URL https://aclanthology.org/2021.findings-acl.346.

16
Under review as a conference paper at ICLR 2024

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: self-attention with
linear complexity. CoRR, abs/2006.04768, 2020. URL https://arxiv.org/abs/2006.
04768.

Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing
transformers. In International Conference on Learning Representations, 2022. URL https:
//openreview.net/forum?id=TrjbxzRcnf-.

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and
Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. Pro-
ceedings of the AAAI Conference on Artificial Intelligence, 35(16):14138–14148, May 2021.
doi: 10.1609/aaai.v35i16.17664. URL https://ojs.aaai.org/index.php/AAAI/
article/view/17664.

Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. BP-Transformer: Modelling
long-range context via binary partitioning. CoRR, abs/1911.04070, 2019. URL http://arxiv.
org/abs/1911.04070.

Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis.
Megabyte: Predicting million-byte sequences with multiscale transformers, 2023. URL http:
//arxiv.org/abs/2305.07185.

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wal-
lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett
(eds.), Advances in Neural Information Processing Systems, volume 32. Curran Asso-
ciates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/
1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf.

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song,
Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2 o: Heavy-
hitter oracle for efficient generative inference of large language models, 2023. URL https:
//arxiv.org/abs/2306.14048.

Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face
restoration with codebook lookup transformer. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL
https://openreview.net/forum?id=XdDl3bFUNn5.

Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar,
and Bryan Catanzaro. Long-short transformer: Efficient transformers for language and vision.
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Ad-
vances in Neural Information Processing Systems, volume 34, pp. 17723–17736. Curran As-
sociates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/
9425be43ba92c2b4454ca7bf602efad8-Paper.pdf.

Zhenhai Zhu and Radu Soricut. H-transformer-1D: Fast one-dimensional hierarchical attention for
sequences. In Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers), pp. 3801–3815, Online, August 2021. Association for Computational Lin-
guistics. doi: 10.18653/v1/2021.acl-long.294. URL https://aclanthology.org/2021.
acl-long.294.

17
Under review as a conference paper at ICLR 2024

A P ROOFS FOR T HEORETICAL R ESULTS


A.1 P ROOF OF T HEOREM 2.2

Proof. This proof is based on Guo et al. (2019). We have


Eq,k [q⊤ k − q⊤ φ(k)]2 (38)
⊤ 2
= Eq,k [q (k − φ(k))] (39)
⊤ ⊤
= Eq,k (k − φ(k)) qq (k − φ(k)) (40)
⊤ ⊤
= Ek (k − φ(k)) Eq [qq ](k − φ(k)) (41)

∝ Ek (k − φ(k)) Id (k − φ(k)) (42)
2
= Ek ||k − φ(k)|| . (43)

A.2 P ROOF OF C OROLLARY 2.3

Proof. Under the constraint φ(RD ) = {Cs }S−1


s=0 , it can be seen that for any given k, the assignment
φ(k) ≜ arg minc∈{Cs }S−1 ||k − c||2 minimizes ||k − φ(k)||2 . Thus, under the constraint φ(RD ) =
s=0

{Cs }S−1 ∗ 2
s=0 , the function φ (·) ≜ VQ(·; C) is a minimizer of Ek ||k − φ(k)|| , and thus a minimizer
⊤ ⊤ 2
of Eq,k ||q k − q φ(k)|| .

A.3 P ROOF OF T HEOREM 3.4

Proof. When ϕw is an element-wise nonlinearity, ϕ(c) is well-defined, where c is any scalar. Then,
[ϕw (QC⊤ )∆]i,j = ϕw (QC⊤ )i,: ∆:,j (44)
S−1
X
= ϕw (QC⊤ )i,s ∆s,j (45)
s=0
S−1
X
= ϕw (Qi,: C⊤
s,: )δs,zj (46)
s=0
= ϕw (Qi,: C⊤
zj ,: ) (47)
= ϕw (Qi,: K̂⊤
j,: ) (48)

= [ϕw (QK̂ )]i,j (49)

A.4 P ROOF OF T HEOREM 3.5

Proof. By Theorem 3.4, exp(QK̂⊤ ) = exp(QC⊤ )∆. From the definition of Softmax,
Softmax(QK̂⊤ ) = Diag(exp(QK̂⊤ )1)−1 exp(QK̂⊤ ) (50)
⊤ −1 ⊤
= Diag(exp(QC )∆1) exp(QC )∆. (51)

A.5 P ROOF OF T HEOREM 3.6

Proof. For n = 0, 1 the result follows by inspection.


For n ≥ 2, we have
(n−1)L−1
X
U(n − 2) = ∆:,j Vj,: (52)
j=0

= ∆(:,0:n−1) V(0:n−1,:) . (53)

18
Under review as a conference paper at ICLR 2024

Thus,
ϕw (Q(n,:) C⊤ )U(n − 2) (54)
(n,:) ⊤ (:,0:n−1) (0:n−1,:)
= ϕw (Q C )∆ V (55)
(n,0:n−1) (0:n−1,:)
=W V (56)
where the last line follows from a similar argument as used in the proof of Theorem 3.4. Since the
sum of the three terms is W(n,0:n−1) V(0:n−1,:) + W(n,n−1) V(n−1,:) + W(n,n) V(n,:) , the result
follows.

A.6 P ROOF OF T HEOREM 3.7

Proof. Recall that we defined A ≜ exp(QK̂⊤ + B). The proposed expression for [AV](n,:)
follows from Theorem 3.6 with ϕw (·) = exp(·). The proposed expression for [A1](n) follows by a
substitution argument using [AV](n,:) . Normalizing [AV](n,:) by [A1](n,:) and iterating over n thus
yields all blocks of the product WV when the nonlinearity ϕw is row-wise softmax.

B T RAINING D ETAILS
B.1 H YPERPARAMETERS

Per-dataset hyperparameters are provided below.

Table 6: Hyperparameters.

Name Symbol Enwik8 PG-19 ImageNet64


sequence length T 8192 8192 12288
update frequency K 4 4 3
block length L 512 512 512
model dimension Dm 768 2048 2048
key dimension Dk 128 128 128
value dimension Dv 1536 4096 4096
num code S 512 512 512
num layer N 48 48 48
sinusoid dropout rate pdropsin 0.2 0.1 0.1
residual dropout rate pdropres 0.5 0.1 0.0
layerdrop rate pdroplyr 0.3 0.1 0.0
weight decay 0.0002 0.0 0.0
optimizer adamw adafactor adafactor
total steps 125000 500000 500000

B.2 I MPLEMENTATION

Weights and token embeddings were initialized following Chowdhery et al. (2022). For small datasets,
the classifier layer omits LayerNorm and is independently parameterized. For larger datasets, the
classifier layer uses LayerNorm and its projection is tied with the token embedding table, then scaled
down by a large constant. For image datasets, absolute sinusoidal position embeddings, scaled by a
trainable scalar, were added to the token embeddings (Hua et al., 2022; Vaswani et al., 2017). We
used a maximum angular wavelength of 105 for all sinusoidal embeddings.
We used the pre-norm placement of LayerNorm (Radford et al., 2019), and always used the RMS
LayerNorm variant (Zhang & Sennrich, 2019). For the activations, we used ϕw = Softmax and
ϕv = ϕg = SiLU, the self-gated activation (Elfwing et al., 2017; Ramachandran et al., 2017). Several
models use LayerDrop for regularization (Fan et al., 2020a), and following the Transformer-XL
codebase (Dai et al., 2019) models apply dropout to the flipped sinusoidal embeddings used for (local)
relative positional biases.
We used float32 parameters, with bfloat16 precision for most computations (Rae et al., 2021). For
the AdamW optimizer (Loshchilov & Hutter, 2019), we used gradient clip 0.1, max learning rate

19
Under review as a conference paper at ICLR 2024

α = 0.0004 and hyperparameters β1 = 0.9, β2 = 0.98, ϵ = 10−9 . For the Adafactor optimizer
(Shazeer & Stern, 2018), we used relative stepsizes, update clip 1.0, max learning rate α = 0.01,
and hyperparameters β̂1 = 0.0, β̂2,t = 1 − t−0.8 . We used weight decay with a constant schedule
throughout training and omit decay on any one-dimensional parameter tensors (Radford et al., 2019).
The codebook commit coefficient was always β = 0.0001 and codebook EMA rate was always
γ = 0.99. Learning rates were linearly warmed up for 10,000 steps, then decayed by a 10x factor
using a cosine schedule. All models were trained with a global batch size of 128 sequences.

C G ENERATED S AMPLES
C.1 E XTENSIVE S AMPLES

Samples can be browsed at the double-blind anonymized URLs below. We emphasize that the
samples have not been curated in any way and may include factual hallucinations and biased content.

Table 7: Generated samples.

Dataset Link
Enwik8 https://www.dropbox.
com/sh/vu0dvw2bcglerwg/
AADTQ9B4imAyEIc1Oo849v3ua?dl=0
PG-19 https://www.dropbox.
com/sh/12civha5ulukulz/
AAATnHL91RVax5kIb7QgS9ywa?dl=0
ImageNet64 https://www.dropbox.
com/sh/xqr0q2e9seoz5wn/
AADFnl1LWCaddC2CYRP3QSvpa?dl=0

C.2 PG-19 E XCERPT

No effort has been made to explain elementary methods of


photography, for the reason that such explanation has been
found in the publications of every leading technical journal.
The endeavor has been to present what is necessary to the
amateur and the professional photographer, together with
suggestions of how to make apparatus for the student, and to
give a chapter on lens building. The author is fully aware of
the imperfections in the methods described, and would like to
emphasize the necessity of studying these methods carefully
before attempting to use them, if it is desired to make
satisfactory photographs. The most essential point in
photography is the study of light. It is impossible to have
success in photography unless the operator knows what light is
. The writer believes that much may be done to advance the art
of photography by the use of simple apparatus. The student
must not overlook the fact that some simple apparatus is
necessary in order to get good results. A lens is necessary to
bring the image on the sensitive plate up to the focus of the
lens. This lens is very expensive and only a few can be had
of the best makers.

Figure 3: Sample excerpt from our PG-19 model, generated with nucleus 0.8.

C.3 I MAGE N ET 64 BATCH

20
Under review as a conference paper at ICLR 2024

21
Under review as a conference paper at ICLR 2024

22

You might also like