2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al
2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al
Abstract
We propose a new transformer-based image and video tokenizer with Binary Spher-
ical Quantization (BSQ). BSQ projects the high-dimensional visual embedding
to a lower-dimensional hypersphere and then applies binary quantization. BSQ
is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary
token dimensions, and (3) compact: compressing visual data by up to 100× with
minimal distortion. Our tokenizer uses a transformer encoder and decoder with
simple block-wise causal masking to support variable-length videos as input. The
resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on im-
age and video reconstruction benchmarks with 2.4× throughput compared to the
best prior methods. Furthermore, by learning an autoregressive prior for adaptive
arithmetic coding, BSQ-ViT achieves comparable results on video compression
with state-of-the-art video compression standards. BSQ-ViT also enables masked
language models to achieve competitive image synthesis quality to GAN- and
diffusion-based methods.
1 Introduction
Learned discrete image and video tokenization allows for state-of-the-art visual compression [1, 2, 3],
recognition [4, 5, 6, 7] and generation [8, 9, 10]. These models follow a proven recipe from large
language modeling [11, 12, 13]: Tokenize input and outputs into discrete units and learn an auto-
regressive model to predict this tokenized stream one token at a time. The most widely used approach
for image encoding is Vector-Quantized Variational Auto-Encoder (VQ-VAE) [8]. They encode
inputs in continuous latent embeddings and map them to a learned codebook through nearest-neighbor
lookup. However, VQ-VAE style approaches have two drawbacks: First, most image encoders are
built upon convolutional networks (CNN) [9, 14]. Adapting spatial convolution for images to spatial-
temporal convolution for videos requires non-trivial architectural changes [15, 16, 17] with increased
computational cost. Treating videos as a sequence of images leads to a suboptimal quantization [16].
Second, vector quantization (VQ) scales poorly with the codebook size. The runtime scales linearly
with the codebook size, and the codebook easily overfits on smaller datasets [17]. This is especially
troubling for video inputs, as they rely on larger codebooks to represent both static visual patterns
and dynamic motion patterns.
This paper proposes a unified visual tokenizer based on a Vision Transformer and Binary Spherical
Quantization (BSQ). The Transformer-based encoder-decoder leverages a block-wise causal mask and
uses only visual tokens from the current or past timestamps for reconstruction (Figure 3). BSQ first
projects the high-dimensional visual embedding of the transformer encoder to a lower-dimensional
hypersphere and then applies binary quantization. The transformer encoder, decoder, and BSQ are
seamlessly integrated into the VQ-GAN [9] framework and trained end-to-end.
∗
Now at Predera.ai
2 Related Work
Visual Tokenization. VQ-VAE [8] introduced the concept of discrete tokenized bottlenecks in
auto-encoder architectures. Recent improvements include better training objectives [20, 9], increasing
VQ codebook usage [4, 21], replacing VQ with product quantization (PQ) [3] or scalar quantization
(SQ) [22], and employing stronger generative models [9, 10]. Image tokenizers are trivially extended
to video by tokenizing individual frames [23, 24]. However, this ignores dynamic motions and leads
to suboptimal tokenization: The same visual information is compressed repeatedly across frames.
Video Tokenization. Dedicated video tokenizers make better use of temporal correlations in the input
signal. VideoGPT [25] proposes 3D (de-)convolutions in VQ-VAE for video generation. TATS [15]
replaces zero padding with replicate padding to mitigate the temporal corruption when video length
varies. Yu et al. introduce central inflation of pretrained 2D convolutional filters to 3D [16] and
further make them causal [17]. Phenaki [23] adopts a factorized causal video vision Transformer [26]
(C-ViViT), which improves efficiency but sacrifices modeling complex motion across time.
Neural Compression. Since Shannon established the fundamental source coding theorem [27] it
has formed the basis of lossless compression [28, 29, 30, 31] with probabilistic models including
RNN [32, 33], CNN [34, 8], VAE [35, 36], and Transformers [37, 38]. L3C [39] presents a fast
hierarchical probabilistic model for lossless image compression. LMIC [38] shows that LLMs
trained primarily on text, e.g. Llama 2 [13] and Chinchilla [40], are general-purpose compressors
for text, images, and audio. However, these LLMs are too big and slow to make this compression
practical. Our tokenizer presents a lighter-weight alternative: Tokenization performs initial local
lossy compression, while a lightweight and thus computationally efficient sequence model (∼300M)
compresses the global video structure.
Video compression. Most high-performing modern video compression methods rely on hybrid coders
that combine transform coding [41, 42] and motion compensation [43, 44]. Such belief continues
in most of the recently popularized learning-based solutions [45, 46, 47, 48]. VCT [49] proposes
a Transformer-based temporal entropy model to learn motion implicitly. However, VCT requires
a heavily-engineered image compression model [50] and has a short temporal context window. In
this work, we show that a learned video tokenizer combined with an arithmetic coder modeled by a
sequence model achieves competitive compression results without explicitly modeling motion.
2
3 Preliminaries
A tokenization-based compression algorithm has three basic steps: A visual tokenizer, i.e. VQ-
VAE [8] or LFQ [17], translates raw visual inputs to a discrete set of tokens and back. A sequence
model then predicts an auto-regressive probability distribution over these discrete tokens. Finally,
arithmetic coding translates this distribution into a compressed representation.
Visual Tokenization. VQ-VAE [8] introduced the concept of learning discrete visual representation
with an auto-encoder architecture and a bottleneck module in between with vector quantization (VQ).
Given a video X ∈ RT ×H×W ×3 , an encoder E produces a set of d-dimensional latent embeddings
Z = E(X) ∈ R( q × p × p )×d with a spatial-temporal downsample factor of q ×p×p. The bottleneck
T H W
module q then transforms the real-valued latent embeddings into some discrete tokens ẑ = q(z).
In Vector Quantization (VQ) the quantizer qV Q assigns each z ∈ Z to the closet entry in a learnable
code in a codebook C = [c1 · · · cK ] ∈ RK×d
ẑ = qV Q (z) = ck = arg min ∥z − ck̂ ∥2 . (1)
ck̂ ∈C
Here, K is the vocabulary size of the codebook and the integer k is the discretized token representation
of z which can be stored in ⌈log(K)⌉ bits. A decoder G maps the discretized tokens back into a visual
representation X̂ = G(Ẑ). The entire network (E, G, and q) is end-to-end trainable and minimizes an
MSE loss LMSE = ∥X̂ − X∥2 using straight-through estimator [51] to propagate gradients through
the quantization bottleneck. More recent quantizers rely on a perceptual LLPIPS and adversarial
LGAN loss for better visual quality [9]
minimize EX [LVQ (E, G, q) + ηLLPIPS (E, G, q) + λLGAN (E, G, q)] , (2)
E,G,q
where the quantization loss term LVQ emulates online clustering to learn ck . The main issue
with VQ-VAE is that Vector Quantization scales poorly with increasing vocabulary size K [17].
Remedies include using a smaller code dimension [4], introducing stochasticity [52], reviving “dead”
codevectors [21], and regularizing with a commitment loss [8]:
Lcommit (ẑ, z) = ∥ sg(ẑ) − z∥, (3)
where sg(·) denotes the stop-gradient operation.
L
Lookup-Free Quantization (LFQ) [17] uses a fixed implicit codebook CLF Q = {−1, 1} as
corners of a hypercube in L dimensional space. The best vector quantizer for this implicit codebook
is the binary quantization qLF Q (z) = sign(z). To optimize for an effective latent code and encourage
usage of the implicit codebook, Yu et al. [17] use an additional entropy objective [53]:
Lentropy = E [H(q(z))] − γH [E [q(z)]] , (4)
where both entropy terms rely on a soft quantization [2]
exp(−τ (c − z)2 )
q̂(c|z) = P 2
(5)
c∈CLF Q exp(−τ (c − z) )
to guarantee the loss is differentiable. The final loss LLF Q is a combination of LMSE , Lcommit ,
LLPIPS , LGAN , and Lentropy . The main computational bottleneck in LFQ is the entropy optimization
of a higher-dimensional codebook, as it involves summation over 2L implicit codebook entries.
Both VQ-VAE and LFQ lossily compress visual inputs X into N discrete tokens [k1 , . . . , kN ], where
ki ∈ {1, . . . K}, in N ⌈log K⌉ bits. Neither tokenization strategy exploits the global image or video
structure well. A sequence model with lossless arithmetic coding better fits this global structure.
Arithmetic Coding (AC) [29, 30, 54] offers a way of constructing a bitstream with near-optimal
length by leveraging the statistical property of the coding distribution. Given a distribution over
token streams Pt : {1, · · · , K}n 7→ (0, 1], arithmetic coding looks to encode the token stream in
(−⌈log Pt (k1 , . . . , kN )⌉ + 1) bits. The most common token distribution is an auto-regressive model
Pt (k1 , . . . , kN ) = Pt (k1 )Pt (k2 |k1 ) . . . Pt (kN |k1 , . . . , kN −1 ) (6)
for which efficient incremental encoding and decoding algorithms exist [49].
3
Binary Spherical Quantization (BSQ)
Binary Spherical Quantization (BSQ) optimizes over an implicit codebook CBSQ = {− √1L , √1L }L ,
a hypercube projected onto a unit sphere. Each corner ck ∈ CBSQ of a hypercube corresponds to a
unique token k. The quantizer works as follows: it projects some high-dimensional latent embedding
z to a lower-dimensional unit hypersphere u, applies binary quantization per axis û = sign(u),
and back-projects to the quantized vector in the original latent space x̂, as shown in Figure 1a.
Specifically, we start with an encoded visual input z = E(x) ∈ Rd . We first linearly project the
latent embedding to L dimensions v = Linear(z) ∈ RL , where L ≪ d. Next, we obtain project v
v
onto the unit sphere u = |v| , and perform binary quantization to each dimension of u independently
1
û = √L sign(u), where sign(x) is the sign function. To keep outputs on the unit sphere, we map
sign(0) → 1. We use a Straight-Through Estimator (STE) [51] to make the operator differentiable,
signSTE (x) = sg(sign(x) − x) + x, where sg(·) denotes the stop-gradient operation. Finally, we
back-project the quantized û to the d-dimensional space ẑ = Linear(û) ∈ Rd .
BSQ has a few appealing properties: As with LFQ, the implicit codebook entry is parameter-free and
easy to compute. Unlike LFQ, a soft quantization of BSQ has a simple probabilistic interpretation,
which leads to efficient entropy computation in an entropy loss Lentropy . Finally, BSQ’s quantization
error is bounded, which empirically leads to much faster and better convergence than LFQ.
Efficient implicit code assignment. At inference time, we map a projected embedding v to a token
PL
through simply binarization k = i=1 1[vi >0] 2i−1 , where 1[·] is the indicator function. The inverse
mapping uses the bitshift and the bitwise AND operations.
Soft BSQ and entropy. To best use the entire range of the implicit codebook CBSQ , we use the
entropy loss Lentropy = Eu [H(q(u))] − γH [Eu [q(u)]] [53]. To compute this entropy loss we first
derive a soft quantization scheme [2]. Since both codebook entries and inputs to the quantizer are
unit vectors, the soft quantization is a distribution
L
exp(τ c⊤ u) Y
q̂(c|u) = P ⊤
= σ (2τ cd ud ) , (7)
c∈CBSQ exp(τ c u) d=1
4
v v v
y y y
u c3 c6 c9 = v̂ c2 c1 = v̂
c2 c1 = û
x x x
c2 c5 c8
c3 c4
c1 c4 c7 c3 c4
where σ is a sigmoid function, and the overall soft quantizer is independent along each dimension.
See Sec. C.1 for a derivation. This form allows for an efficient computation of the first entropy term
" L #
X
Eu [H(q̂(c|u))] = Eu H(q̂(cd |ud )) . (8)
d=1
See Sec. C.2 for a derivation. Instead of reasoning over distributions over the entire codebook,
which is exponentially large, we instead treat each dimension independently. The resulting entropy
computation is linear to the dimension L of the bottleneck.
Unfortunately, the second entropy term cannot directly use the same independence assumption, as
dimensions in the expected value Eu [q̂(c|u)] are correlated through the distribution of u. We find the
QK
closest factorized distribution q̃(c) = d=1 q̃(cd ) to Eu [q̂(c|u)], and instead minimize the entropy
of the approximate distribution. As we will show in Sec. C.3 the best approximation in terms of the
KL-divergence q̃(cd ) = Eud [q̂(cd |ud )]. The final approximate entropy term to maximize is
L
X
H(Eu [q̂(c|u)]) ≈ H(q̃(c)) = H(Eud [q̂(cd |ud )]). (9)
d=1
As we will show in Sec. C.3 this approximation is an upper bound to the true entropy, but empirically
closely tracks the true entropy. This entropy term is again efficient for evaluation.
Quantization error in BSQ. Most quantizers use pass-through gradient estimates during training [17,
8, 9]. Though simple to implement, it assumes that the gradients for an unquantized u and quantized
û bottleneck are almost the same, which only holds if the quantization error d(u, û) = ∥u − û∥ is
small. As we show in Sec. C.4, this is true for BSQ
s
2 √
Eu [d(u, û)] < 2 − √ < 2. (10)
L
Relation to other quantization methods. BSQ is closely connected to many concepts introduced in
information and coding theories. LFQ [17] uses the same binarization technique as BSQ but does not
normalize its output. This leads to an unbounded quantization error and does not allow for as simple
of a soft quantization for entropy computation. A pictural comparison between LFQ and BSQ is
shown in Figure 2 and a summary is provided in Table 7. Spherical Vector Quantization (SVQ) [55]
also ensures all code vectors have a pre-defined radius. However, SVQ assumes a variety of radii,
which have to be encoded by an additional gain quantizer. In our case, the source code is the output
of a learned encoder E. Therefore, the unit radius assumption is sound, and the gain quantizer can be
avoided. Pyramid Vector Quantization (PVQ) [56] assumes all code vectors have a constant ℓ1 norm,
but the ℓ1 normalized centroids partition the hypersphere less uniformly than ℓ2 .
5
4.2 Tokenization Network with Causal Video Transformer
We propose to use Vision Transformer (ViT) [57] to model both the encoder and decoder due to its
better computational efficiency and higher reconstruction quality.
Video Transformer. We start from ViT-VQGAN [4] and extend it to take videos as input. We divide
an input video X ∈ RT ×H×W ×3 into non-overlapping patches of size 1 × p × p, xi ∈ R1×p×p×3 .
The visual tokens are flattened into a 1D sequence, linearly projected, and passed through a stack of
N Transformer Encoder layers to yield the latent representation, (z1 , · · · , zN ). The decoder takes
the same architecture, maps the latent embeddings ẑ back to the pixel space, and regroups them into
the original shape. (x̂1 , · · · , x̂N ) = MLP(TransformerDecoder(ẑ1 , · · · , ẑN )), where MLP is a
decoding head with a two-layer MLP, i.e. Linear ◦ Tanh ◦ Linear.
Blockwise Causal Attention. During training, we always assume the input video has T frames,
which might not hold at inference. Padding shorter video segments to T frames works but wastes a
lot of bits, especially in the context of compression. To handle variable-length videos, we propose a
simple blockwise causal masked attention analogous to causal attention in language modeling [58]. It
specifies that only those tokens at time t or earlier can be used for reconstructing the visual tokens at
time t ∈ {1, · · · , T }.
(z(t−1)× H × W +1 , · · · , zt× H × W ) = TransformerEncoder x1 , · · · , xt× H × W , (11)
p p p p p p
(ẑ(t−1)× H × W +1 , · · · , ẑt× H × W ) = qLF Q z(t−1)× H × W +1 , · · · , zt× H × W , (12)
p p p p p p p p
(x̂(t−1)× H × W +1 , · · · , x̂t× H × W ) = MLP TransformerDecoder ẑ1 , · · · , ẑt× H × W . (13)
p p p p p p
This can be efficiently implemented with a blockwise causal attention mask written in a blockwise
lower triangle matrix in Figure 3. When T = 1, the proposed encoder-decoder reduces to a ViT with
a full attention mask. Therefore, we can easily train it using a mixture of images and videos.
We use factorized spatial-temporal position embedding to encode the temporal position information.
Specifically, we add a set of zero-initialized temporal position embeddings PEt ∈ RT ×d and add
it to the original spatial position embedding PEs ∈ RN ×d in the image tokenizer, i.e. PE[i, :, :] =
PEt [i, None, :] + PEs [None, :, :]. 𝑡 𝑡+1 𝑡+2
Training the Video Tokenizer from an Image Tokenizer.
Due to the lack of diversity in existing video datasets, we ……
first train an image tokenizer on image data and then fine- flatten flatten flatten
6
Table 1: Image reconstruction results on COCO2017 and ImageNet-1K (256 × 256). The “data” column
refers to the training data: CC for CC3M, YF for YFCC100M, OImg for OpenImages, LAION for LAION-5B,
IN for ImageNet, and “?” for unknown source. The “arch.” column shows the encoder/decoder architecture: C
for ConvNets with Self-Attention, and T-B for ViT-Base. The “# bits” column refers to the effective number of
bits per token defined in Sec. 5.1. # is obtained by multiplying the latent dimension with the precision. The “TP”
column means the inference throughput (images/second) per GPU. † The number is taken from the paper. Note
that STDs of PSNR, SSIM, and LPIPS are computed across samples instead of multiple runs.
COCO2017 val ImageNet-1k val
Method Data Arch. Quant. Param. # bits TP↑ PSNR↑ SSIM↑ LPIPS↓ rFID↓ PSNR↑ SSIM↑ LPIPS↓ rFID↓
DALL-E dVAE [20] CC+YF C VQ 98M 13 34.0 25.15 .7497 .3014 55.07 25.46 .7385 .3127 36.84
±3.49 ±.1124 ±.1221 ±3.93 ±.1343 ±.1480
MaskGIT [10] IN-1k C VQ 54M 10 37.6 17.52 .4194 .2057 8.90 17.93 .4223 .2018 2.23
±2.75 ±.1619 ±.0473 ±2.93 ±.1827 ±.0543
ViT-VQGAN [4] IN-1k T-B VQ 182M 13 † 7.5 - - - - - - - †
1.55
SD-VAE 1.x [72] OImg C VQ 68M 10 22.4 21.78 .6139 .1042 6.79 22.12 .6046 .1039 1.52
±3.41 ±.1430 ±.0345 ±3.79 ±.1663 ±.0409
SD-VAE 1.x [72] OImg C VQ 68M 14 22.4 22.54 .6470 .0905 6.07 22.82 .6354 .0912 1.23
±3.55 ±.1409 ±.0323 ±3.97 ±.1644 ±.0390
#
SD-VAE 1.x [72] OImg C KL 68M 64 22.4 21.68 .6375 .0985 5.94 21.99 .6275 .0980 1.35
±3.32 ±.1375 ±.0309 ±3.74 ±.1600 ±.0371
#
SD-VAE 2.x [14]OImg+ C KL 84M 64 18.9 24.82 .7202 .0694 4.63 25.08 .7054 .0731 0.78
LAION ±3.64 ±.1241 ±.0344 ±4.11 ±.1469 ±.0448
#
SDXL-VAE [14] OImg+ C KL 84M 64 18.9 25.11 .7433 .0623 4.23 25.38 .7276 .0666 0.72
LAION+? ±3.91 ±.1240 ±.0289 ±4.41 ±.1469 ±.0373
Ours IN-1k T-B BSQ 174M 18 45.1 25.08 .7662 .0744 5.81 25.36 .7578 .0761 1.14
±3.57 ±.0993 ±.0295 ±4.02 ±.1163 ±.0358
Ours IN-1k T-B BSQ 174M 36 45.1 27.64 .8485 .0412 3.42 27.88 .8410 .0432 0.41
±3.74 ±.0704 ±.0199 ±4.26 ±.0821 ±.0253
Ours (w/. EMA) IN-1k T-B BSQ 174M 36 45.1 27.92 .8526 .0380 3.34 28.14 .0814 .0400 0.45
±3.78 ±.0698 ±.0187 ±4.32 ±.0814 ±.0237
5 Experiments
We train the image tokenization model on the training set of ImageNet ILSVRC2012 [63] and evaluate
the image reconstruction result on the validation set of MS-COCO [64] and ImageNet, denoted by
COCO 2017val and ImageNet-1k respectively. We fine-tune the video tokenization model on UCF-
101 [65] and conduct video compression experiments on two standard benchmarks, i.e. MCL-JCV
and UVG. We leave dataset statistics and implementation details in Sec. E.
Evaluation metrics. For image/video tokenization, we report perceptual metric (LPIPS-
AlexNet) [59], PSNR, SSIM [66], and Fréchet Inception/Video Distance (FID/FVD) [67, 68]. To
distinguish it from generation, we denote it as rFID/rFVD. For generation, we report FID, Inception
Score (IS) [69], and improved precision and recall (IPR, Prec, and Rec) [70]. For compression, we
report PSNR and MS-SSIM [71] under different levels of bits per pixel (bpp).
7
Table 2: Video reconstruction results on UCF-101 (split 1).
UCF-101 train UCF-101 val
Method Backbone Quantizer Param. # bits PSNR↑ SSIM↑ LPIPS↓ rFVD↓ PSNR↑ SSIM↑ LPIPS↓ rFVD↓
(I MAGE T OKENIZER , W / O ADAPTING TO VIDEOS )
Ours ViT VQ 174M 14 25.64 .8142 .1120 357 25.58 .8120 .1146 382
Ours ViT BSQ 174M 18 25.86 .8273 .1089 326 25.83 0.8259 0.1108 342
(I MAGE T OKENIZER → V IDEO T OKENIZER )
MaskGIT [10] 2D CNN VQ 53M 10 21.5 .685 0.114 216 - - - -
TATS [15] 3D CNN VQ 32M 14 - - - 162
MAGVIT-L [16] 3D CNN VQ 158M 10 22.0 .701 .0990 25 - - - -
MAGVIT-v2 [17] C.-3D CNN LFQ 158M 18 - - .0694 16.12 - - - -
MAGVIT-v2 [17] C.-3D CNN LFQ N/A (>158M) 18 - - .0537 8.62 - - - -
Ours non-BC ViT VQ 174M 14 33.06 .9518 .0223 9.16 32.92 .9506 .0228 12.79
Ours BC ViT VQ 174M 14 32.81 .9496 .0236 10.76 32.68 .9484 .0241 14.17
Ours BC ViT BSQ 174M 18 32.08 .9421 .0244 8.08 31.49 .9357 .0276 11.62
Ours BC ViT BSQ 174M 36 33.80 .9606 .0159 4.10 33.55 .9588 .0167 6.21
codebook size; For KL-regularized models (SD-VAE 2.x and XL), since the latent is continuous,
we count # bits as the latent dimension multiplied by the numeric precision (here we use 16 since
the checkpoint is stored in FP16). For our BSQ, # bits is L because each latent channel is binary.
We summarize the key observations as follows. (1) BSQ efficiently compresses image patches
into a small amount of bits. It reconstructs images better in all metrics using fewer bits per token
than the second-best method (SDXL-VAE). (2) BSQ is also computationally efficient. Although
the ViT-based backbone doubles the parameters, our method yields a 2.4× higher throughput than
SDXL-VAE. MaskGIT runs at a comparable speed but reconstructs significantly worse because of a
small codebook size (1024) and more spatial downsampling (16×). (3) BSQ is generalizable across
different domains of images. ImageNet is relatively object-centric while COCO is more scene-centric.
Though trained on ImageNet only, our method does well on the scene-centric COCO too. It even
works better than SD-VAE 1.x/2.x trained on the similarly scene-centric OpenImages dataset [73].
Video Reconstruction. We present the video reconstruction on both UCF-101 training and validation
subsets in Table 2. First, we use the image tokenizer to reconstruct the video frame by frame. BSQ
works slightly better than VQ but neither is comparable to the specialized video tokenizers fine-tuned
on video data shown in the lower half of Table 2. Second, we finetune the image tokenizer on videos
and see significant improvements. For example, our 18-bit BSQ with causal ViT reduces rFVD from
342 to 11.62 and improves PSNR from 25.83 to 31.49 dB. The compared prior methods include:
(1) MaskGIT [10] which is a fine-tuned 2D-CNN based tokenizer, (2) TATS [15] which uses a 3D
CNN with replicated padding, (3) MAGVIT [4] whose 3D CNN is initialized by zero-inflating a 2D
filter, and (4) MAGVIT-v2 [17] which makes 3D CNN causal. Since most methods do not release
checkpoints, we take their reported numbers directly. Our models with all configurations outperform
MAGVIT-v2 with a comparable number of parameters (174M vs. 158M) by a large margin. The
best-performing MAGVIT-v2 uses a larger backbone and achieves a rFVD of 8.62. Our causal
BSQ-ViT with L = 18 achieves an 8.08 rFVD and halves the LPIPS. For BSQ with L = 36, our
method further improves the reconstruction metrics.
We also show the effect of using block-wise causal masks. The non-causal variant (non-BC) works
slightly better on all metrics because now the model can look at all visual patches within the temporal
context window. This result resembles the observations in video compression that using bidirectional
predicted pictures (B-frames) benefits compression quality given the same group of pictures (GoP).
Image Generation. Our BSQ-ViT Table 3: Image generation results on ImageNet-1K (128 × 128).
tokenizer can be seamlessly inte- † The number is taken from the paper.
grated into existing generative mod-
els for visual generation. We fol- Category Method # steps FID↓ IS↑ Prec↑ Rec↑
low MaskGIT [10], a masked lan-
guage modeling approach. At train- GAN BigGAN [18] 1 6.02 145.8 0.86 0.35
ing time, the underlying masked lan- Diffusion ADM [19] 1,000 5.91 93.3 0.70 0.65
guage model (masked LM) learns to
VQ 12 † 9.4 - - -
predict the masked tokens given a ran-
Masked LM FSQ [22] 12 † 8.5 - - -
dom proportion of unmasked tokens
BSQ (Ours) 32 5.44 139.6 0.80 0.50
like BERT [74]. At inference time,
the model repeats decoding in a non-
8
39 1 42 0.99
35 0.96 40 0.98
PSNR (dB)
MS-SSIM
MS-SSIM
PSNR
31 0.92 38 0.97
H.264 (medium) H.264 (medium)
HEVC (medium) HEVC (medium) VCT [49] VCT [49]
Ours (w/o. AC) Ours (w/o. AC) H.264 (medium) H.264 (medium)
27 0.88 36 0.96
Ours Ours HEVC (medium) HEVC (medium)
MAGVIT [16] MAGVIT [16] Ours (w/o. AC) Ours (w/o. AC)
MAGVIT-v2 [17] MAGVIT-v2 [17] Ours Ours
23 0.84 34 0.95
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
bpp bpp bpp bpp
(a) PSNR on MCL-JCV. (b) MS-SSIM on MCL-JCV. (c) PSNR on UVG. (d) MS-SSIM on UVG.
Figure 4: Video compression results on MCL-JCV 640×360 and UVG 1920×1080.
autoregressive way [75] for several steps and progressively decodes from an all-masked canvas to
visually plausible contents following a pre-defined unmasking schedule. Unlike MaskGIT with a
VQ-VAE with K = 1024, BSQ-ViT has an effective vocabulary size of 2L and L = 18, resulting in
a slow embedding lookup. We fix it by dividing each token into groups and treating sub-tokens inde-
pendently with a similar rationale in Sec. 4.1. We increase the number of decoding steps accordingly.
Table 3 shows that the masked LM with BSQ outperforms those with VQ and FSQ reported in [22].
Our method achieves comparable results with other generation paradigms such as GAN-based [18]
and diffusion-based [19] approaches. We leave qualitative results in Sec. F.
Video Compression. We compare Table 4: Comparisons of encoding/decoding speed. † The num-
the compression result on MCL-JCV ber did not include the image encoder according to [49].
and UVG in Figure 4. Simply flat-
tening the video token sequence to Method Resolution Encode EC Decode FPS
a bitstream achieves an MS-SSIM of †
0.9818 at 0.2333 bpp. Although this VCT [49] 1920×1080 494 ms 30.5 ms 168 ms 1.4
is not great, we use an auto-regressive H.264 1920×1080 - - - 2.6
model to predict the conditional prob- Ours 1920×1080 55.8 ms 42.2 ms 64.8 ms 6.1
ability such that the bpp is reduced by VCT [49] 640×360 †
22.2 ms 4.24 ms 10.1 ms 27.3
41%. This leads to a better tradeoff H.264 640×360 - - - 22.4
than standard video codecs including Ours 640×360 6.2 ms 4.69 ms 7.2 ms 55.2
both H.264 and HEVC.
On UVG 1080P, our model is comparable to H.264 while being worse than HEVC and VCT [49].
Note that our model trains on UCF-101 which only has 9K 320×240 video clips encoded in MPEG-4
while VCT has been trained on a million high-resolution Internet video clips. We hypothesize that
the gap will be mitigated by adding more diverse videos and removing compression artifacts from
the training videos. Nevertheless, we show the potential advantage of our method in encoding and
decoding speed in Table 4. Due to the simplicity of the Transformer-based encoder and decoder, our
method runs faster than VCT.
For ablation studies, we train an ImageNet image tokenizer with resolution 128×128 with p = 8,
although our conclusions generally hold for higher resolution, e.g. 256×256 in Sec. 5.1.
BSQ vs VQ. Table 5 shows that BSQ and VQ follow a similar trend: better reconstruction for
increased L. Since K = 218 results in an out-of-memory issue, we try a smaller K = 216 = 65536
for VQ. The gain for VQ already diminishes even though the small bottleneck dimension of 8 still
guarantees full code usage. In contrast, BSQ consistently works better on all metrics when L = 18.
9
Table 6: Ablation studies of the loss design.
(a) Leave-one-out ablations for training losses. (b) Group size. (L = 18)
6 Conclusions
We present a new transformer-based image and video tokenizer with Binary Spherical Quantization
(BSQ). The transformer-based architecture effortlessly integrates image and video tokenization over
an arbitrary time horizon. The Binary Spherical Quantization allows for efficient and effective training
of the quantized bottleneck. Our results indicate that the proposed tokenizer runs at a faster speed,
reconstructs with higher fidelity, and in combination with a sequence model offers a strong baseline
for lossy video compression and image synthesis.
References
[1] Thomas J Daede, Nathan E Egge, Jean-Marc Valin, Guillaume Martres, and Timothy B Terriberry. Daala:
A perceptually-driven next generation video codec. arXiv preprint arXiv:1603.03129, 2016.
[2] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini,
and Luc V Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations.
NeurIPS, 2017.
[3] Alaaeldin El-Nouby, Matthew J Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, and Hervé Jégou.
Image compression with product quantized masked image modeling. TMLR, 2023.
[4] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu,
Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. In ICLR,
2022.
[5] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. In
ICLR, 2022.
[6] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT: Image
BERT pre-training with online tokenizer. In ICLR, 2022.
10
[7] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang,
Luowei Zhou, and Lu Yuan. BEVT: BERT pretraining of video transformers. In CVPR, 2022.
[8] Aaron Van Den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In
NeurIPS, 2017.
[9] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis.
In CVPR, 2021.
[10] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative
image transformer. In CVPR, 2022.
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
NeurIPS, 2020.
[12] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv
preprint arXiv:2303.08774, 2023.
[13] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[14] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna,
and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv
preprint arXiv:2307.01952, 2023.
[15] Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi
Parikh. Long video generation with time-agnostic VQGAN and time-sensitive transformer. In ECCV,
2022.
[16] Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann,
Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. MAGVIT: Masked generative video transformer. In CVPR,
2023.
[17] Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng,
Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key
to visual generation. In ICLR, 2024.
[18] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural
image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[19] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. NeurIPS,
2021.
[20] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and
Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021.
[21] Chuanxia Zheng and Andrea Vedaldi. Online clustered codebook. In ICCV, 2023.
[22] Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization:
VQ-VAE made simple. In ICLR, 2024.
[23] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham-
mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video
generation from open domain textual descriptions. In ICLR, 2022.
[24] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz,
Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video
diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
[25] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video generation using
vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
[26] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. ViViT:
A video vision transformer. In ICCV, 2021.
[27] Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal,
27(3):379–423, 1948.
[28] David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE,
40(9):1098–1101, 1952.
[29] Richard Clark Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford University
CA, 1976.
[30] Jorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development,
23(2):149–162, 1979.
11
[31] Jarek Duda. Asymmetric numeral systems. arXiv preprint arXiv:0902.0271, 2009.
[32] Tomáš Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of
Technology, 2012.
[33] Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. Deepzip: Lossless data compression
using recurrent neural networks. In DCC, 2019.
[34] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional
image generation with pixelcnn decoders. In NeurIPS, 2016.
[35] James Townsend, Tom Bird, and David Barber. Practical lossless compression with latent variables using
bits back coding. In ICLR, 2019.
[36] James Townsend, Thomas Bird, Julius Kunze, and David Barber. Hilloc: Lossless image compression with
hierarchical latent variable models. In ICLR, 2020.
[37] Fabrice Bellard. Lossless data compression with neural networks. URL: https://bellard. org/nncp/nncp.
pdf, 2019.
[38] Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher
Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language
modeling is compression. In ICLR, 2024.
[39] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Practical full
resolution learned lossless image compression. In CVPR, 2019.
[40] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of
compute-optimal large language model training. In NeurIPS, 2022.
[41] Vivek K Goyal. Theoretical foundations of transform coding. IEEE Signal Processing Magazine, 18(5):9–
21, 2001.
[42] Johannes Ballé, Philip A Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur Agustsson, Sung Jin
Hwang, and George Toderici. Nonlinear transform coding. IEEE Journal of Selected Topics in Signal
Processing, 15(2):339–353, 2020.
[43] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the h. 264/avc video
coding standard. TCSVT, 2003.
[44] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency
video coding (hevc) standard. TCSVT, 2012.
[45] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. Dvc: An end-to-end
deep video compression framework. In CVPR, 2019.
[46] Oren Rippel, Sanjay Nair, Carissa Lew, Steve Branson, Alexander G Anderson, and Lubomir Bourdev.
Learned video compression. In ICCV, 2019.
[47] Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George Toderici.
Scale-space flow for end-to-end optimized video compression. In CVPR, 2020.
[48] Jiahao Li, Bin Li, and Yan Lu. Deep contextual video compression. In NeurIPS, 2021.
[49] Fabian Mentzer, George Toderici, David Minnen, Sung-Jin Hwang, Sergi Caelles, Mario Lucic, and Eirikur
Agustsson. VCT: A video compression transformer. In NeurIPS, 2022.
[50] Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei Qin, and Yan Wang. ELIC: Efficient learned
image compression with unevenly grouped space-channel contextual adaptive coding. In CVPR, 2022.
[51] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through
stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[52] Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka,
Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, and Yuki Mitsufuji. SQ-VAE: Variational bayes
on discrete representation with self-annealed stochastic quantization. In ICML, 2022.
[53] Aren Jansen, Daniel PW Ellis, Shawn Hershey, R Channing Moore, Manoj Plakal, Ashok C Popat, and
Rif A Saurous. Coincidence, categorization, and consolidation: Learning to recognize sounds with minimal
supervision. In ICASSP, 2020.
[54] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data compression. Communica-
tions of the ACM, 30(6):520–540, 1987.
[55] Jon Hamkins and Kenneth Zeger. Gaussian source coding with spherical codes. IEEE Transactions on
Information Theory, 48(11):2980–2989, 2002.
[56] Thomas Fischer. A pyramid vector quantizer. IEEE Transactions on Information Theory, 32(4):568–583,
1986.
12
[57] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
[59] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
effectiveness of deep features as a perceptual metric. In CVPR, 2018.
[60] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
[61] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial
networks. In CVPR, 2019.
[62] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional
adversarial networks. In CVPR, 2017.
[63] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.
IJCV, 115:211–252, 2015.
[64] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[65] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions
classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[66] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error
visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[67] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs
trained by a two time-scale update rule converge to a local Nash equilibrium. NeurIPS, 2017.
[68] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and
Sylvain Gelly. Fvd: A new metric for video generation. In ICLR Workshop, 2019.
[69] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training GANs. NeurIPS, 2016.
[70] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision
and recall metric for assessing generative models. NeurIPS, 2019.
[71] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality
assessment. In ACSSC, 2003.
[72] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models. In CVPR, 2022.
[73] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab
Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified
image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
[74] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In NAACL, 2019.
[75] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural
machine translation. In ICLR, 2018.
[76] Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang,
Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. MCL-JCV: a JND-based H. 264/AVC video quality
assessment dataset. In ICIP, 2016.
[77] Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video
codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, pages
297–302, 2020.
[78] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
[79] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
2022.
[80] Jean Bégaint, Fabien Racapé, Simon Feltman, and Akshay Pushparaja. CompressAI: a pytorch library and
evaluation platform for end-to-end compression research. arXiv preprint arXiv:2011.03029, 2020.
13
Table 7: Comparing BSQ and LFQ.
LFQ [17] BSQ (Ours)
Quantized output v̂ = sign(v) û = √1 sign(u) = √1 sign( v )
L L |v|
∂ v̂i ∂ ûi
STE gradient ∂vi
=1 ∂vi
= √1L (1 − vi2 /|v|2 )
q √
Ev [d(v, v̂)] = ∞ Eu [d(u, û)] < 2 − √2L < 2
Quantization Error
Unbounded Upper-bounded (See Sec. C.4)
LMSE , Lcommit , LLPIPS , LGAN , LMSE , LLPIPS , LGAN ,
Training objective
Lentropy = H[p(c|v)] − H[Eu [p(c|v)]] Lentropy = H[p(c|u)] − Ĥ[Eu [p(c|u)]]
Starting from the initial interval I0 = [0, 1), the AC encoder recursively partitions the interval into a
series of sub-interval In = [ln , un ) such that In ⊂ In−1 ⊂ · · · ⊂ I0 , and In is determined by In−1
and ρ(y|x<n ).
n −1
" xX xn
!
X
In (y) = ln−1 + (un−1 − ln−1 ) ρ(y|x<n ), ln−1 + (un−1 − ln−1 ) ρ(y|x<n ) . (14)
y=1 y=1
Any number in the final interval IN can sufficiently represent the encoded sequence. To obtain
PC
the final bit stream, we take a binary fraction λ = i=1 bi × 2−i , bi ∈ {0, 1} in IN such that
lN ≤ λ < uN . The bit stream {b0 , . . . , bC } is the encoding result with a length of C bits.
The AC decoder takes in λ, starts with I0 , and performs a similar interval partitioning process. At
the n-th step, the decoder queries the model ρn (y|x<n ), calculate the sub-intervals for all possible
values of y using Eq. (14), and decodes output xn that leads to λ ∈ In (xn ). The decoder can recover
the encoded token sequence by continuing with In+1 based on the decoded xn and repeating for step
n + 1 for N steps.
In practice, the encoder and the decoder can be implemented efficiently with fixed-length integer
numbers and operate incrementally for arbitrarily long input sequences.
In Sec. 4.1, we have introduced the mechanism of BSQ and briefly discussed the connections and
differences with LFQ. We summarize them in Table 7. Note that STE gradient in BSQ is anisotropic
and is more likely to be a good estimation because of an upper-bounded quantization error regardless
of L. This property explains why a commitment loss like Lcommit (û, u) is not needed in BSQ but
useful for LFQ.
C Proofs
L L X
X ⊤ XY Y
eτ u c
= eτ ud cd = eτ ud cd . (15)
c∈C c∈C d=1 d=1 cd ∈Ω
14
Proof. With τ dropped for simplicity of notation.
L
X ⊤ XY
eu c
= euk ck
c∈C c∈C k=1
X X L
X Y
= ... eud cd
c1 ∈Ω c2 ∈Ω cL ∈Ω d=1
X X X L−1
Y
uL cL
= ... e eud cd
c1 ∈Ω c2 ∈Ω cL ∈Ω d=1
L−1
! !
X X X Y X
= ... eud cd euL cL
c1 ∈Ω c2 ∈Ω cL−1 ∈Ω d=1 cL ∈Ω
= ...
! ! ! L X
X X X Y
uL cL u2 c2 uL cL
= e e ... e = eud cd .□
c1 ∈Ω c2 ∈Ω cL ∈Ω d=1 cd ∈Ω
⊤ QL
eτ u ĉ
d=1 eτ ud ĉd
q̂(ĉ|u) = P = QL (Using Eq (15))
eτ u⊤ c
P
c∈C d=1 cd ∈{− √1 , √1 } eτ ud cd
L L
L τ ud ĉd
Y e 1
= (since cd = ± √ = ±ĉd )
eτ ud ĉd + eτ ud ĉd L
d=1
L
Y
= σ(2τ ud ĉd ).
d=1
L
X
H[q̂(c|u)] = H(σ(2τ ud cd )). □
d=1
L
1 X 1 XY
Q(c) = Eu [q̂(c|u)] = q̂(c|u) = σ(2uk ck ).
N u N u
k
Unlike c, u does not factorize like Eq (15). This would require us to compute Q(c) as a full
distribution over 2L states, which is slow (O(L × 2L )) and easily overfits. Instead, we approximate
QL
Q(c) by a factorized distribution q̃(c) = d=1 q̃d (cd ), where cd ∈ Ω for Ω = {− √1L , √1L }, using
15
an M-projection. We again omit τ for notational brevity.
∂
The minimizer of the above projection ∂ q̃d D(Q∥q̃) =0
X X
q̃d (cd )∗ = Q(c) = Eu p(c|u)
c−d c−d
XY X Y
= Eu σ(2uk ck ) = Eu σ(2ud cd ) σ(2uk ck )
c−d k c−d k̸=d
XY YX
= Eu σ(2ud cd ) σ(2uk ck ) = Eu σ(2ud cd )
= Eu [σ(2ud cd )]
σ(2uk ck )
c−d k̸=d k̸=d c−d
| {z }
=1
By the nature of the above derivation the cross entropy H(Q, q̃) = H(q̃) equals the entropy of the
approximation. This means that D(Q∥q̃) = H(q̃) − H(Q) ≥ 0, and the entropy of the approximation
is an upper bound H(q̃) ≥ H(Q) to the true entropy.
In practice, this bound is relatively tight. The most adversarial distribution P (u) is P ( √1L ⃗1) = 12
and P (− √1 ⃗1) = 1 , where all inputs are maximally correlated, but the factorized distribution is not.
L 2
Figure 5 shows an empirical estimate of this approximation error for various values of τ . In practice,
1
we use τ = 100 , which has little to no approximation error.
We consider ℓ2 -distance d(u, û) = ∥u − û∥. A simple (but loose) bound is:
s
2 √
Eu [d(u, û)] = Eu [dmax (u, û)] < 2 − √ < 2, (16)
L
16
Upper bound in KL divergence between q and q
1.0 L=3
L=5
L=7
0.8 L=10
L=15
Upper bound
Approximation error
0.6
0.4
0.2
0.0
0 2 4 6 8 10
Figure 5: Empirical estimation of the approximation error with respect to τ at different bottleneck
dimensions L.
where dmax is attained if u is at any axis, u = [0, · · · , 0, 1, 0, · · · , 0]. To achieve a tighter bound, we
| {z } | {z }
n L−1−n
first expand the definition,
Z Z
· · · dS L−1 V d(u, û)
| {z }
L−1
Eu [d(u, û)] = S Z Z , (17)
· · · dS L−1 V
| {z }
S L−1
L−1 L
where S = {x ∈ R : ∥x∥ = 1} denotes the unit L-sphere of radius 1 and dS L−1 V denotes its
surface area element. We further define a hyperspherical coordinate system that is analogous to the
spherical coordinate system for 3D Euclidean space or the polar coordinate system for 2D space to
represent the surface area element.
u1 = cos(φ1 ),
u2 = sin(φ1 ) cos(φ2 ),
u3 = sin(φ1 ) sin(φ2 ) cos(φ3 ),
···
uL−1 = sin(φ1 ) sin φ2 · · · sin(φL−2 ) cos(φL−1 ),
uL = sin(φ1 ) sin φ2 · · · sin(φL−2 ) sin(φL−1 ),
(surface area element) dS L−1 V = sinL−2 (φ1 ) sinL−3 (φ2 ) · · · sin(φL−2 )dφ1 · · · dφL−1 ,
2π L/2
Z Z
(surface area) SL−1 = · · · dS n−1 V = .
Γ( L2 )
| {z }
S L−1
Due to symmetry, we assume the subarea AL−1 where ∀i ∈ {1, · · · , L}, ui > 0, and it will be
→
−
quantized to c1 = û1 = √1L 1 . The unit hypersphere S L−1 has 2L of such subareas interchangeably.
Computing Eq (17) is equivalent to
Z Z
· · · dS L−1 V d(u, û)
| {z }
L−1
Eu [d(u, û)] = A Z Z . (18)
· · · dS L−1 V
| {z }
AL−1
17
We expand the the numerator in Eq (18) as follows:
π π
1 1
Z 2
Z 2
= ··· dS L−1 V {[cos(φ1 ) − √ ]2 + [sin(φ1 ) cos(φ2 ) − √ ]2 + · · ·
0 0 L L
1 2
+ [sin(φ1 ) sin(φ2 ) · · · sin(φL−2 ) cos(φL−1 ) − √ ] (19)
L
1 2 1
+ [sin(φ1 ) sin(φ2 ) · · · sin(φL−2 ) sin(φL−1 ) − √ ] } 2
L
It is composed of L square terms. It is easy to see that the sum of constant terms leads to 1. Next,
let’s sum over all quadratic terms and keep on using sin2 (θ) + cos2 (θ) = 1:
L−2
Y L−2
Y
cos2 (φ1 ) + sin2 (φ1 ) cos2 (φ2 ) + · · · + sin2 (φj ) sin2 (φL−1 ) + sin2 (φj ) cos2 (φL−1 ) = 1
j=1 j=1
Therefore, we have
π 12
2Γ( L2 )
2
Z 2
Eu [d(u, û)] < √ L−1 2 − √ cos(φ1 ) sinL−2 (φ1 )dφ1 , (22)
πΓ 2 0 L
D Dataset Overview
ImageNet-1k has 1.28M training images and 50,000 validation images; COCO 2017val has 5,000
images.
UCF101 has 13,320 video clips and three train-val splits. Following prior works [16], we consider
split-1 which has 9,537 clips for training and 3,783 for validation.
The MCL-JCV dataset [76] consists of thirty 1080P (1,920×1,080) video sequences with 24∼30
FPS. The Open Ultra Video Group (UVG) dataset [77] consists of sixteen 4K (3,840×2,160) test
video sequences captured at 50/120 FPS. Following prior works [47], we report the performance on a
subset of seven videos in YUV 8bit format at 120 FPS under the resolution of 1,920×1,080.
18
1.42
1.40
1.38
1.36
Quantization Error
1.34
1.32
1.30
1.28
1.26
5 10 15 20 25 30 35
L
E Implementation Details
Training Image Tokenizers. We train the image tokenizer with a batch size of 32 per GPU. We
use AdamW optimizer [78] with (β1 , β2 ) = (0.9, 0.99) with 1 × 10−4 weight decay. The base
learning rate is 4 × 10−7 (or a total learning rate of 1 × 10−4 ) and follows a half-period cosine
annealing schedule. The model is trained for 1M steps which amounts to 200 epochs over the entire
ImageNet-1k training set. We did not heavily study the effect of loss weights. Instead, we keep γ = 1
in the entropy terms. We use a perceptual loss weight of 0.1 and an adversarial loss weight of 0.1
throughout the experiments.
Training Video Tokenizers. We finetune the video tokenizer with a batch size of 32 per GPU. The
optimization schedule follows the image-based one but trains for fewer iterations. The network
is initialized from the ImageNet-pretraining checkpoint and undergoes another 500K steps which
amounts to 1600 epochs over UCF-101 split-1 train.
Training a Masked Language Model for Generation. The masked LM is a standard post-LN
Transformer with 24 layers and a hidden dimension of 768 following MaskGIT [10]. We train the
masked LM on 2 nodes of 8× GPUs (16 in total) with a total batch size of 1024 for 1M steps. We
use AdamW optimizer with (β1 , β2 ) = (0.9, 0.96) with 0.045 weight decay. At inference time, we
use a cosine unmasking schedule in MaskGIT [10] and set the sampling temperature to 15. We
use classifier-free guidance [79]: At training, we replace 20% of the class condition labels with
the mask token so that the model learns an unconditional distribution simultaneously. Let ℓc be
class-conditioned logits and ℓ∅ be unconditional logits. During inference, we interpolate logits using
ℓ′ = ℓc + α(ℓc − ℓ∅ ), where α = 0.5.
Training an Auto-Regressive Model for Arithmetic Coding. The auto-regressive model is a
Transformer with 24 layers and a hidden dimension of 768. We train this model on 8× GPUs with a
total batch size being 64. We use AdamW optimizer with (β1 , β2 ) = (0.9, 0.96) with 0.045 weight
decay.
Hardware. The hardware for training is 8×GPU-servers with NVIDIA A5000 (24GB). Pre-training
an image tokenizer and fine-tuning a video tokenizer in the full schedule is done across two servers
with distributed training and takes around 5 days. Training the AR model for AC is done on an
8×GPU server and takes around 1 week. When measuring the tokenizer’s throughput and the
compression runtime, we use a server with 4× A5000 GPU and 1× AMD Ryzen Threadripper PRO
5975WX 32-Core CPU (64 threads).
F Qualitative Results
In Figure 7, we show reconstructed images produced by the proposed BSQ-ViT in comparison to the
best prior work, SDXL-VAE [14]. We can see that our method is able to preserve more details about
high-frequency texture and fine-grained shape/geometry. BSQ-ViT often shows better reconstruction
results for characters.
19
Table 8: Image reconstruction results on COCO2017 and ImageNet-1K (256 × 256). The settings strictly
follow Table 1 except that all images are resized with bilinear interpolation.
COCO2017 val ImageNet-1k val
Method Data Arch. Quant. Param. # bits TP↑ PSNR↑ SSIM↑ LPIPS↓ rFID↓ PSNR↑ SSIM↑ LPIPS↓ rFID↓
DALL-E dVAE [20] CC+YF C VQ 98M 13 34.0 26.97 .0837 .2544 48.60 27.31 .7943 .2544 32.63
±3.41 ±.0922 ±.1057 ±3.81 ±.1114 ±.1057
MaskGIT [10] IN-1k C VQ 54M 10 37.5 18.21 .4596 .1930 8.47 18.63 .4619 .1884 1.98
±2.74 ±0.1606 ±.0444 ±2.90 ±.1812 ±.0497
† †
ViT-VQGAN [4] IN-1k T-B VQ 182M 13 7.5 - - - - - - - 1.55
SD-VAE 1.x [72] OImg C VQ 68M 10 22.4 23.29 .6705 .0949 6.49 23.65 .6615 .0940 1.40
±3.34 ±.1316 ±.0313 ±3.69 ±.1540 ±.0367
SD-VAE 1.x [72] OImg C VQ 68M 14 22.4 24.17 .7042 .0814 5.75 24.48 .6931 .0814 1.13
±3.50 ±.1276 ±.0289 ±3.98 ±.1502 ±.0289
SD-VAE 1.x [72] OImg C KL 68M 64 22.4 23.21 .6930 .0908 5.94 23.54 .6835 .0899 1.22
±3.24 ±.1249 ±.04282 ±3.62 ±.1465 ±.0337
SD-VAE 2.x [14]OImg+ C KL 84M 64 18.9 26.62 .7722 .0584 4.26 26.90 .7592 .0609 0.70
LAION ±3.64 ±.1086 ±.0273 ±4.09 ±.1300 ±.0349
SDXL-VAE [14] OImg+ C KL 84M 64 18.9 27.08 .7953 .0541 3.93 27.37 .7814 .0574 0.67
LAION+? ±3.88 ±.1066 ±.0250 ±4.36 ±.1282 ±.0320
Ours IN-1k T-B BSQ 174M 18 45.1 26.89 .8133 .0652 5.41 27.78 .8171 .0633 0.99
±3.47 ±.0851 ±.0255 ±3.99 ±.0987 ±.0307
Ours IN-1k T-B BSQ 174M 36 45.1 29.85 .8862 .0341 3.07 30.12 .8803 .0355 0.36
±3.65 ±.0570 ±.0163 ±4.13 ±.0670 ±.0207
Ours (w/. EMA) IN-1k T-B BSQ 174M 36 45.1 30.19 .8904 .0314 3.07 30.45 .8843 .0329 0.42
±3.69 ±.0561 ±.0153 ±4.19 ±.0661 ±.0194
In Figure 8, we show sampled results produced by a Masked LM with the proposed BSQ-ViT in
comparison to existing methods, BigGAN [18] and ADM [19]. We also plot the samples from the
ground-truth ILSVRC2012 validation set for reference. Our method produces competitive results
with state-of-the-art methods.
Following SSF [47], we used FFmpeg3 to produce the evaluation metrics for H.264 and HEVC. We
use the commands provided in CompressAI [80].
ffmpeg -y -s:v $RESOLUTION -i $FILE.yuv -c:v h264 -crf $CRF -preset medium \
-bf 0 -pix_fmt yuv420p -threads 4 $FILE.mp4
where $Resolution ∈ {1920x1080, 640x360}, and $CRF ∈ {17, 20, 22, 27, 32, 37, 42, 47}.
H Limitations
The proposed tokenizer has been tested on images with a 256×256 or 128×128 resolution and videos
with a 128×128 resolution. Training a visual tokenizer on higher-resolution inputs and variable
aspect ratio remains unexplored. Also, the training dataset is limited to ImageNet-1k and UCF-101.
Scaling the proposed model to larger-scale visual contents remains an interesting problem to study.
I Broader Impacts
The video compression application illustrated in the paper may be useful to reduce the storage and
transmission cost of video data. Also, the proposed visual tokenization model runs more efficiently
than prior ones, resulting in potential energy savings. Both of them will ultimately benefit society.
3
https://ffmpeg.org/
20
Figure 7: Reconstruction results of BSQ-ViT (right) compared to the original image (left) and SDXL-VAE [14]
(middle). The three images are taken from COCO 2017val which are more scene-centric compared to ImageNet
data that our model is trained on.
21
BigGAN ADM BSQ-ViT+Masked-LM (Ours) Groundtruth
Figure 8: Sampled generation results of BSQ-ViT + Masked-LM (second column from left) compared to
BigGAN [18] (right), ADM [19] (second column from right) and the original images (left). Classes are 1:
goldfish, 279: arctic fox, 323: monarch butterfly, 417: balloon.
22
LPIPS is based on the implementation7 licensed under a BSD-2-Clause license.
SSIM and MS-SSIM are based on the PyTorch implementation8 licensed under an MIT license.
Generation Metrics (FID, Inception Score, Precision, and Recall) are reported using a TensorFlow
implementation9 licensed under an MIT license.
7
https://github.com/richzhang/PerceptualSimilarity
8
https://github.com/VainF/pytorch-msssim
9
https://github.com/openai/guided-diffusion/tree/main/evaluations
10
https://github.com/openai/DALL-E
11
https://github.com/CompVis/latent-diffusion
12
https://huggingface.co/stabilityai/sd-vae-ft-mse
13
https://huggingface.co/stabilityai/sdxl-vae
14
https://github.com/openai/guided-diffusion
15
https://github.com/google-research/maskgit/tree/main
16
https://github.com/InterDigitalInc/CompressAI
23