0% found this document useful (0 votes)

88 views23 pages

2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al

Uploaded by

Xintao Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views23 pages

2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al

Uploaded by

Xintao Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Image and Video Tokenization

with Binary Spherical Quantization

Yue Zhao Yuanjun Xiong∗ Philipp Krähenbühl

UT Austin MThreads AI UT Austin
[email protected] [email protected] [email protected]
arXiv:2406.07548v1 [cs.CV] 11 Jun 2024

Abstract

We propose a new transformer-based image and video tokenizer with Binary Spher-
ical Quantization (BSQ). BSQ projects the high-dimensional visual embedding
to a lower-dimensional hypersphere and then applies binary quantization. BSQ
is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary
token dimensions, and (3) compact: compressing visual data by up to 100× with
minimal distortion. Our tokenizer uses a transformer encoder and decoder with
simple block-wise causal masking to support variable-length videos as input. The
resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on im-
age and video reconstruction benchmarks with 2.4× throughput compared to the
best prior methods. Furthermore, by learning an autoregressive prior for adaptive
arithmetic coding, BSQ-ViT achieves comparable results on video compression
with state-of-the-art video compression standards. BSQ-ViT also enables masked
language models to achieve competitive image synthesis quality to GAN- and
diffusion-based methods.

1 Introduction
Learned discrete image and video tokenization allows for state-of-the-art visual compression [1, 2, 3],
recognition [4, 5, 6, 7] and generation [8, 9, 10]. These models follow a proven recipe from large
language modeling [11, 12, 13]: Tokenize input and outputs into discrete units and learn an auto-
regressive model to predict this tokenized stream one token at a time. The most widely used approach
for image encoding is Vector-Quantized Variational Auto-Encoder (VQ-VAE) [8]. They encode
inputs in continuous latent embeddings and map them to a learned codebook through nearest-neighbor
lookup. However, VQ-VAE style approaches have two drawbacks: First, most image encoders are
built upon convolutional networks (CNN) [9, 14]. Adapting spatial convolution for images to spatial-
temporal convolution for videos requires non-trivial architectural changes [15, 16, 17] with increased
computational cost. Treating videos as a sequence of images leads to a suboptimal quantization [16].
Second, vector quantization (VQ) scales poorly with the codebook size. The runtime scales linearly
with the codebook size, and the codebook easily overfits on smaller datasets [17]. This is especially
troubling for video inputs, as they rely on larger codebooks to represent both static visual patterns
and dynamic motion patterns.
This paper proposes a unified visual tokenizer based on a Vision Transformer and Binary Spherical
Quantization (BSQ). The Transformer-based encoder-decoder leverages a block-wise causal mask and
uses only visual tokens from the current or past timestamps for reconstruction (Figure 3). BSQ first
projects the high-dimensional visual embedding of the transformer encoder to a lower-dimensional
hypersphere and then applies binary quantization. The transformer encoder, decoder, and BSQ are
seamlessly integrated into the VQ-GAN [9] framework and trained end-to-end.
∗
Now at Predera.ai

Preprint. Under review.

Our proposed visual tokenizer features several advantages. First, the Transformer-based encoder-
decoder shows a Pareto improvement in visual reconstruction quality and computational efficiency
compared to standard CNNs. Second, the block-wise causal design unifies images and videos
as input at training and supports variable-length videos at inference. BSQ constructs an implicit
codebook whose effective vocabulary grows exponentially with the spherical dimension with no
learned parameters. The increasing codebook size consistently yields better reconstruction results.
Compared to Lookup-free Quantization (LFQ) [17], a recent technique that also builds an implicit
codebook based on scalar quantization (SQ), BSQ has a bounded quantization error and is easier to
train. Furthermore, we show that the soft quantization probability in BSQ reduces to a simple product
of multiple channel-independent Bernoulli distributions, leading to efficient entropy regularization
during training. Specifically, we show how a factorized approximation to the entropy for soft
quantization of L bits reduces the theoretical computation complexity from O(2L × L) to O(L) with
minimal approximation error, and negligible performance degradation in practice.
We validate the effectiveness of BSQ-ViT on visual reconstruction and compression benchmarks. On
image reconstruction, our model archives a state-of-the-art visual reconstruction quality by both pixel-
level and semantic metrics. In particular, our best-performing BSQ-ViT achieves a reconstruction
FID of 0.41 on ImageNet-1k val, a 43% reduction compared to the runner-up (SDXL-VAE [14]),
while being 2.4× faster. On video reconstruction, our best model reduces FVD on UCF-101 by more
than half (8.62 → 4.10). By further learning an autoregressive prior for adaptive arithmetic coding,
BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression
standards, e.g. H.264 and HEVC. By learning a masked language model, BSQ-ViT enables image
generation with similar quality to BigGAN [18] and ADM [19]. Code and models will be released at
https://github.com/zhaoyue-zephyrus/bsq-vit.

2 Related Work

Visual Tokenization. VQ-VAE [8] introduced the concept of discrete tokenized bottlenecks in
auto-encoder architectures. Recent improvements include better training objectives [20, 9], increasing
VQ codebook usage [4, 21], replacing VQ with product quantization (PQ) [3] or scalar quantization
(SQ) [22], and employing stronger generative models [9, 10]. Image tokenizers are trivially extended
to video by tokenizing individual frames [23, 24]. However, this ignores dynamic motions and leads
to suboptimal tokenization: The same visual information is compressed repeatedly across frames.
Video Tokenization. Dedicated video tokenizers make better use of temporal correlations in the input
signal. VideoGPT [25] proposes 3D (de-)convolutions in VQ-VAE for video generation. TATS [15]
replaces zero padding with replicate padding to mitigate the temporal corruption when video length
varies. Yu et al. introduce central inflation of pretrained 2D convolutional filters to 3D [16] and
further make them causal [17]. Phenaki [23] adopts a factorized causal video vision Transformer [26]
(C-ViViT), which improves efficiency but sacrifices modeling complex motion across time.
Neural Compression. Since Shannon established the fundamental source coding theorem [27] it
has formed the basis of lossless compression [28, 29, 30, 31] with probabilistic models including
RNN [32, 33], CNN [34, 8], VAE [35, 36], and Transformers [37, 38]. L3C [39] presents a fast
hierarchical probabilistic model for lossless image compression. LMIC [38] shows that LLMs
trained primarily on text, e.g. Llama 2 [13] and Chinchilla [40], are general-purpose compressors
for text, images, and audio. However, these LLMs are too big and slow to make this compression
practical. Our tokenizer presents a lighter-weight alternative: Tokenization performs initial local
lossy compression, while a lightweight and thus computationally efficient sequence model (∼300M)
compresses the global video structure.
Video compression. Most high-performing modern video compression methods rely on hybrid coders
that combine transform coding [41, 42] and motion compensation [43, 44]. Such belief continues
in most of the recently popularized learning-based solutions [45, 46, 47, 48]. VCT [49] proposes
a Transformer-based temporal entropy model to learn motion implicitly. However, VCT requires
a heavily-engineered image compression model [50] and has a short temporal context window. In
this work, we show that a learned video tokenizer combined with an arithmetic coder modeled by a
sequence model achieves competitive compression results without explicitly modeling motion.

2
3 Preliminaries
A tokenization-based compression algorithm has three basic steps: A visual tokenizer, i.e. VQ-
VAE [8] or LFQ [17], translates raw visual inputs to a discrete set of tokens and back. A sequence
model then predicts an auto-regressive probability distribution over these discrete tokens. Finally,
arithmetic coding translates this distribution into a compressed representation.
Visual Tokenization. VQ-VAE [8] introduced the concept of learning discrete visual representation
with an auto-encoder architecture and a bottleneck module in between with vector quantization (VQ).
Given a video X ∈ RT ×H×W ×3 , an encoder E produces a set of d-dimensional latent embeddings
Z = E(X) ∈ R( q × p × p )×d with a spatial-temporal downsample factor of q ×p×p. The bottleneck
T H W

module q then transforms the real-valued latent embeddings into some discrete tokens ẑ = q(z).
In Vector Quantization (VQ) the quantizer qV Q assigns each z ∈ Z to the closet entry in a learnable
code in a codebook C = [c1 · · · cK ] ∈ RK×d
ẑ = qV Q (z) = ck = arg min ∥z − ck̂ ∥2 . (1)
ck̂ ∈C

Here, K is the vocabulary size of the codebook and the integer k is the discretized token representation
of z which can be stored in ⌈log(K)⌉ bits. A decoder G maps the discretized tokens back into a visual
representation X̂ = G(Ẑ). The entire network (E, G, and q) is end-to-end trainable and minimizes an
MSE loss LMSE = ∥X̂ − X∥2 using straight-through estimator [51] to propagate gradients through
the quantization bottleneck. More recent quantizers rely on a perceptual LLPIPS and adversarial
LGAN loss for better visual quality [9]
minimize EX [LVQ (E, G, q) + ηLLPIPS (E, G, q) + λLGAN (E, G, q)] , (2)
E,G,q

where the quantization loss term LVQ emulates online clustering to learn ck . The main issue
with VQ-VAE is that Vector Quantization scales poorly with increasing vocabulary size K [17].
Remedies include using a smaller code dimension [4], introducing stochasticity [52], reviving “dead”
codevectors [21], and regularizing with a commitment loss [8]:
Lcommit (ẑ, z) = ∥ sg(ẑ) − z∥, (3)
where sg(·) denotes the stop-gradient operation.
L
Lookup-Free Quantization (LFQ) [17] uses a fixed implicit codebook CLF Q = {−1, 1} as
corners of a hypercube in L dimensional space. The best vector quantizer for this implicit codebook
is the binary quantization qLF Q (z) = sign(z). To optimize for an effective latent code and encourage
usage of the implicit codebook, Yu et al. [17] use an additional entropy objective [53]:
Lentropy = E [H(q(z))] − γH [E [q(z)]] , (4)
where both entropy terms rely on a soft quantization [2]
exp(−τ (c − z)2 )
q̂(c|z) = P 2
(5)
c∈CLF Q exp(−τ (c − z) )

to guarantee the loss is differentiable. The final loss LLF Q is a combination of LMSE , Lcommit ,
LLPIPS , LGAN , and Lentropy . The main computational bottleneck in LFQ is the entropy optimization
of a higher-dimensional codebook, as it involves summation over 2L implicit codebook entries.
Both VQ-VAE and LFQ lossily compress visual inputs X into N discrete tokens [k1 , . . . , kN ], where
ki ∈ {1, . . . K}, in N ⌈log K⌉ bits. Neither tokenization strategy exploits the global image or video
structure well. A sequence model with lossless arithmetic coding better fits this global structure.
Arithmetic Coding (AC) [29, 30, 54] offers a way of constructing a bitstream with near-optimal
length by leveraging the statistical property of the coding distribution. Given a distribution over
token streams Pt : {1, · · · , K}n 7→ (0, 1], arithmetic coding looks to encode the token stream in
(−⌈log Pt (k1 , . . . , kN )⌉ + 1) bits. The most common token distribution is an auto-regressive model
Pt (k1 , . . . , kN ) = Pt (k1 )Pt (k2 |k1 ) . . . Pt (kN |k1 , . . . , kN −1 ) (6)
for which efficient incremental encoding and decoding algorithms exist [49].

3
Binary Spherical Quantization (BSQ)

encoder z v ℓ2 u binary ℓ2 û ẑ decoder

x project project x̂
E norm quantize norm G

(a) BSQ-VAE (Ours).

VQ Lookup Free Quantization (LFQ)

encoder z vector ẑ decoder encoder z v binary v̂ ẑ decoder

x x̂ x project project x̂
E quantize G E quantize G

(b) VQ-VAE. (c) LFQ-VAE.

Figure 1: Variational Auto-Encoders (VAE) with different bottlenecks (BSQ, VQ, and LFQ).

4 Transformer-based Visual Tokenizer with Binary Spherical Quantization

Our video tokenizer follows an encoder-decoder architecture with a discretization bottleneck as

illustrated in Figure 1a. It combines a transformer-based encoder, a transformer-based decoder,
and a Binary Spherical Quantization (BSQ) layer. BSQ projects the latent code into a lower-
dimensional spherical space, applies binary quantization, and then projects the result back up into the
decoder’s latent space. This projection onto a low-dimensional spherical space has several theoretical
advantages: The approximation error of the quantizer is bounded and much of the entropy computation
factorizes along individual dimensions. These advantages result in experimental improvements as
well. BSQ converges quicker and to a better tokenizer than other quantization schemes.

4.1 Binary Spherical Quantization

Binary Spherical Quantization (BSQ) optimizes over an implicit codebook CBSQ = {− √1L , √1L }L ,
a hypercube projected onto a unit sphere. Each corner ck ∈ CBSQ of a hypercube corresponds to a
unique token k. The quantizer works as follows: it projects some high-dimensional latent embedding
z to a lower-dimensional unit hypersphere u, applies binary quantization per axis û = sign(u),
and back-projects to the quantized vector in the original latent space x̂, as shown in Figure 1a.
Specifically, we start with an encoded visual input z = E(x) ∈ Rd . We first linearly project the
latent embedding to L dimensions v = Linear(z) ∈ RL , where L ≪ d. Next, we obtain project v
v
onto the unit sphere u = |v| , and perform binary quantization to each dimension of u independently
1
û = √L sign(u), where sign(x) is the sign function. To keep outputs on the unit sphere, we map
sign(0) → 1. We use a Straight-Through Estimator (STE) [51] to make the operator differentiable,
signSTE (x) = sg(sign(x) − x) + x, where sg(·) denotes the stop-gradient operation. Finally, we
back-project the quantized û to the d-dimensional space ẑ = Linear(û) ∈ Rd .
BSQ has a few appealing properties: As with LFQ, the implicit codebook entry is parameter-free and
easy to compute. Unlike LFQ, a soft quantization of BSQ has a simple probabilistic interpretation,
which leads to efficient entropy computation in an entropy loss Lentropy . Finally, BSQ’s quantization
error is bounded, which empirically leads to much faster and better convergence than LFQ.
Efficient implicit code assignment. At inference time, we map a projected embedding v to a token
PL
through simply binarization k = i=1 1[vi >0] 2i−1 , where 1[·] is the indicator function. The inverse
mapping uses the bitshift and the bitwise AND operations.
Soft BSQ and entropy. To best use the entire range of the implicit codebook CBSQ , we use the
entropy loss Lentropy = Eu [H(q(u))] − γH [Eu [q(u)]] [53]. To compute this entropy loss we first
derive a soft quantization scheme [2]. Since both codebook entries and inputs to the quantizer are
unit vectors, the soft quantization is a distribution
L
exp(τ c⊤ u) Y
q̂(c|u) = P ⊤
= σ (2τ cd ud ) , (7)
c∈CBSQ exp(τ c u) d=1

4
v v v
y y y

u c3 c6 c9 = v̂ c2 c1 = v̂
c2 c1 = û

x x x
c2 c5 c8

c3 c4
c1 c4 c7 c3 c4

(a) BSQ (b) FSQ (c) LFQ

Figure 2: Illustration of BSQ compared to LFQ and FSQ in the simplest case of L = 2. In FSQ, we further
consider each channel has 3 possible values {−1, 0, 1}. The Voronoi diagram for both FSQ and LFQ looks like
hypercubes that partition the entire space while BSQ’s looks like a hypersphere evenly divided by 2L centroids.

where σ is a sigmoid function, and the overall soft quantizer is independent along each dimension.
See Sec. C.1 for a derivation. This form allows for an efficient computation of the first entropy term
" L #
X
Eu [H(q̂(c|u))] = Eu H(q̂(cd |ud )) . (8)
d=1

See Sec. C.2 for a derivation. Instead of reasoning over distributions over the entire codebook,
which is exponentially large, we instead treat each dimension independently. The resulting entropy
computation is linear to the dimension L of the bottleneck.
Unfortunately, the second entropy term cannot directly use the same independence assumption, as
dimensions in the expected value Eu [q̂(c|u)] are correlated through the distribution of u. We find the
QK
closest factorized distribution q̃(c) = d=1 q̃(cd ) to Eu [q̂(c|u)], and instead minimize the entropy
of the approximate distribution. As we will show in Sec. C.3 the best approximation in terms of the
KL-divergence q̃(cd ) = Eud [q̂(cd |ud )]. The final approximate entropy term to maximize is
L
X
H(Eu [q̂(c|u)]) ≈ H(q̃(c)) = H(Eud [q̂(cd |ud )]). (9)
d=1

As we will show in Sec. C.3 this approximation is an upper bound to the true entropy, but empirically
closely tracks the true entropy. This entropy term is again efficient for evaluation.
Quantization error in BSQ. Most quantizers use pass-through gradient estimates during training [17,
8, 9]. Though simple to implement, it assumes that the gradients for an unquantized u and quantized
û bottleneck are almost the same, which only holds if the quantization error d(u, û) = ∥u − û∥ is
small. As we show in Sec. C.4, this is true for BSQ
s
2 √
Eu [d(u, û)] < 2 − √ < 2. (10)
L

Relation to other quantization methods. BSQ is closely connected to many concepts introduced in
information and coding theories. LFQ [17] uses the same binarization technique as BSQ but does not
normalize its output. This leads to an unbounded quantization error and does not allow for as simple
of a soft quantization for entropy computation. A pictural comparison between LFQ and BSQ is
shown in Figure 2 and a summary is provided in Table 7. Spherical Vector Quantization (SVQ) [55]
also ensures all code vectors have a pre-defined radius. However, SVQ assumes a variety of radii,
which have to be encoded by an additional gain quantizer. In our case, the source code is the output
of a learned encoder E. Therefore, the unit radius assumption is sound, and the gain quantizer can be
avoided. Pyramid Vector Quantization (PVQ) [56] assumes all code vectors have a constant ℓ1 norm,
but the ℓ1 normalized centroids partition the hypersphere less uniformly than ℓ2 .

5
4.2 Tokenization Network with Causal Video Transformer

We propose to use Vision Transformer (ViT) [57] to model both the encoder and decoder due to its
better computational efficiency and higher reconstruction quality.
Video Transformer. We start from ViT-VQGAN [4] and extend it to take videos as input. We divide
an input video X ∈ RT ×H×W ×3 into non-overlapping patches of size 1 × p × p, xi ∈ R1×p×p×3 .
The visual tokens are flattened into a 1D sequence, linearly projected, and passed through a stack of
N Transformer Encoder layers to yield the latent representation, (z1 , · · · , zN ). The decoder takes
the same architecture, maps the latent embeddings ẑ back to the pixel space, and regroups them into
the original shape. (x̂1 , · · · , x̂N ) = MLP(TransformerDecoder(ẑ1 , · · · , ẑN )), where MLP is a
decoding head with a two-layer MLP, i.e. Linear ◦ Tanh ◦ Linear.
Blockwise Causal Attention. During training, we always assume the input video has T frames,
which might not hold at inference. Padding shorter video segments to T frames works but wastes a
lot of bits, especially in the context of compression. To handle variable-length videos, we propose a
simple blockwise causal masked attention analogous to causal attention in language modeling [58]. It
specifies that only those tokens at time t or earlier can be used for reconstructing the visual tokens at
time t ∈ {1, · · · , T }.

(z(t−1)× H × W +1 , · · · , zt× H × W ) = TransformerEncoder x1 , · · · , xt× H × W , (11)
p p p p p p

(ẑ(t−1)× H × W +1 , · · · , ẑt× H × W ) = qLF Q z(t−1)× H × W +1 , · · · , zt× H × W , (12)
p p p p p p p p

(x̂(t−1)× H × W +1 , · · · , x̂t× H × W ) = MLP TransformerDecoder ẑ1 , · · · , ẑt× H × W . (13)
p p p p p p

This can be efficiently implemented with a blockwise causal attention mask written in a blockwise
lower triangle matrix in Figure 3. When T = 1, the proposed encoder-decoder reduces to a ViT with
a full attention mask. Therefore, we can easily train it using a mixture of images and videos.
We use factorized spatial-temporal position embedding to encode the temporal position information.
Specifically, we add a set of zero-initialized temporal position embeddings PEt ∈ RT ×d and add
it to the original spatial position embedding PEs ∈ RN ×d in the image tokenizer, i.e. PE[i, :, :] =
PEt [i, None, :] + PEs [None, :, :]. 𝑡 𝑡+1 𝑡+2
Training the Video Tokenizer from an Image Tokenizer.
Due to the lack of diversity in existing video datasets, we ……
first train an image tokenizer on image data and then fine- flatten flatten flatten

tune it to be a video tokenizer. Though previous works [7,

24] argue that a pre-trained image tokenizer can be used for
videos as is, we observe that the video tokenizer after fine-
tuning demonstrates much higher reconstruction quality
on video benchmarks, see Sec. 5.1. The gain is further
magnified when the effective vocabulary size becomes
larger. We hypothesize that such increased vocabulary
size, enabled by the proposed BSQ, is handy for learning
video-specific motion and blur. In contrast, vanilla VQ
methods fail to maintain high codebook usage when the
codebook size exceeds 16K.
Optimizing the Visual Tokenizer. Following VQ-
GAN [9], we use a perceptual loss [59] and an adversarial BSQ Code
loss [60]. We use StyleGAN [61] as the discriminator
since ViT-VQGAN [4] reports it is much easier to train Figure 3: Given an input video, block-
than PatchGAN [62]. When we fine-tuned the tokenizer wise causal masked attention enables the
on videos, unlike MAGVIT or TATS, we did not inflate Transformer encoder to only use the flat-
StyleGAN to be a 3D discriminator. Instead, we pass all tened patches from current or past times-
reconstructed frames individually to the vanilla StyleGAN tamps to encode each visual patch and
and sum up the losses. later translate it into a BSQ code.

6
Table 1: Image reconstruction results on COCO2017 and ImageNet-1K (256 × 256). The “data” column
refers to the training data: CC for CC3M, YF for YFCC100M, OImg for OpenImages, LAION for LAION-5B,
IN for ImageNet, and “?” for unknown source. The “arch.” column shows the encoder/decoder architecture: C
for ConvNets with Self-Attention, and T-B for ViT-Base. The “# bits” column refers to the effective number of
bits per token defined in Sec. 5.1. # is obtained by multiplying the latent dimension with the precision. The “TP”
column means the inference throughput (images/second) per GPU. † The number is taken from the paper. Note
that STDs of PSNR, SSIM, and LPIPS are computed across samples instead of multiple runs.
COCO2017 val ImageNet-1k val
Method Data Arch. Quant. Param. # bits TP↑ PSNR↑ SSIM↑ LPIPS↓ rFID↓ PSNR↑ SSIM↑ LPIPS↓ rFID↓
DALL-E dVAE [20] CC+YF C VQ 98M 13 34.0 25.15 .7497 .3014 55.07 25.46 .7385 .3127 36.84
±3.49 ±.1124 ±.1221 ±3.93 ±.1343 ±.1480
MaskGIT [10] IN-1k C VQ 54M 10 37.6 17.52 .4194 .2057 8.90 17.93 .4223 .2018 2.23
±2.75 ±.1619 ±.0473 ±2.93 ±.1827 ±.0543
ViT-VQGAN [4] IN-1k T-B VQ 182M 13 † 7.5 - - - - - - - †
1.55
SD-VAE 1.x [72] OImg C VQ 68M 10 22.4 21.78 .6139 .1042 6.79 22.12 .6046 .1039 1.52
±3.41 ±.1430 ±.0345 ±3.79 ±.1663 ±.0409
SD-VAE 1.x [72] OImg C VQ 68M 14 22.4 22.54 .6470 .0905 6.07 22.82 .6354 .0912 1.23
±3.55 ±.1409 ±.0323 ±3.97 ±.1644 ±.0390
#
SD-VAE 1.x [72] OImg C KL 68M 64 22.4 21.68 .6375 .0985 5.94 21.99 .6275 .0980 1.35
±3.32 ±.1375 ±.0309 ±3.74 ±.1600 ±.0371
#
SD-VAE 2.x [14]OImg+ C KL 84M 64 18.9 24.82 .7202 .0694 4.63 25.08 .7054 .0731 0.78
LAION ±3.64 ±.1241 ±.0344 ±4.11 ±.1469 ±.0448
#
SDXL-VAE [14] OImg+ C KL 84M 64 18.9 25.11 .7433 .0623 4.23 25.38 .7276 .0666 0.72
LAION+? ±3.91 ±.1240 ±.0289 ±4.41 ±.1469 ±.0373
Ours IN-1k T-B BSQ 174M 18 45.1 25.08 .7662 .0744 5.81 25.36 .7578 .0761 1.14
±3.57 ±.0993 ±.0295 ±4.02 ±.1163 ±.0358
Ours IN-1k T-B BSQ 174M 36 45.1 27.64 .8485 .0412 3.42 27.88 .8410 .0432 0.41
±3.74 ±.0704 ±.0199 ±4.26 ±.0821 ±.0253
Ours (w/. EMA) IN-1k T-B BSQ 174M 36 45.1 27.92 .8526 .0380 3.34 28.14 .0814 .0400 0.45
±3.78 ±.0698 ±.0187 ±4.32 ±.0814 ±.0237

5 Experiments
We train the image tokenization model on the training set of ImageNet ILSVRC2012 [63] and evaluate
the image reconstruction result on the validation set of MS-COCO [64] and ImageNet, denoted by
COCO 2017val and ImageNet-1k respectively. We fine-tune the video tokenization model on UCF-
101 [65] and conduct video compression experiments on two standard benchmarks, i.e. MCL-JCV
and UVG. We leave dataset statistics and implementation details in Sec. E.
Evaluation metrics. For image/video tokenization, we report perceptual metric (LPIPS-
AlexNet) [59], PSNR, SSIM [66], and Fréchet Inception/Video Distance (FID/FVD) [67, 68]. To
distinguish it from generation, we denote it as rFID/rFVD. For generation, we report FID, Inception
Score (IS) [69], and improved precision and recall (IPR, Prec, and Rec) [70]. For compression, we
report PSNR and MS-SSIM [71] under different levels of bits per pixel (bpp).

5.1 Main Results

Image Reconstruction. We first compare the image reconstruction result of BSQ on COCO and
ImageNet (256 × 256) with state-of-the-art image tokenizers, including DALL-E dVAE [20], SD-
VAE 1.x [72], SD-VAE 2.x, SDXL-VAE [14], MaskGIT [10], and ViT-VQGAN [4]. We observe
that reconstruction metrics vary with many factors, especially preprocessing (e.g. interpolation),
input resolution, and downsample scales (Sec. D.2 in [72]). To perform a comprehensive and fair
comparison, we resize all images such that the smaller edge is 256 pixels using Lánczos interpolation2 ,
take the center crop (H ×W ) = (256×256), and ensure all models have the same spatial downsample
ratio of p = 8 (except for MaskGIT, p = 16). We rerun all models on COCO 2017val and ImageNet-
1k val except the undisclosed ViT-VQGAN. From Table 1, we can see that our model outperforms
prior works on all metrics (PSNR, SSIM, LPIPS, and rFID), often by a big margin.
To compare the compression capability of different bottleneck modules, We study the effective
number of bits per token (# bits). For VQ-based models, # bits equals to log2 (K), where K is the
2
The reconstruction result of bilinear interpolation is computed in Table 8 for reference. In short, the
conclusion is that varying interpolation changes the values but unalts the order of all methods.

7
Table 2: Video reconstruction results on UCF-101 (split 1).
UCF-101 train UCF-101 val
Method Backbone Quantizer Param. # bits PSNR↑ SSIM↑ LPIPS↓ rFVD↓ PSNR↑ SSIM↑ LPIPS↓ rFVD↓
(I MAGE T OKENIZER , W / O ADAPTING TO VIDEOS )
Ours ViT VQ 174M 14 25.64 .8142 .1120 357 25.58 .8120 .1146 382
Ours ViT BSQ 174M 18 25.86 .8273 .1089 326 25.83 0.8259 0.1108 342
(I MAGE T OKENIZER → V IDEO T OKENIZER )
MaskGIT [10] 2D CNN VQ 53M 10 21.5 .685 0.114 216 - - - -
TATS [15] 3D CNN VQ 32M 14 - - - 162
MAGVIT-L [16] 3D CNN VQ 158M 10 22.0 .701 .0990 25 - - - -
MAGVIT-v2 [17] C.-3D CNN LFQ 158M 18 - - .0694 16.12 - - - -
MAGVIT-v2 [17] C.-3D CNN LFQ N/A (>158M) 18 - - .0537 8.62 - - - -
Ours non-BC ViT VQ 174M 14 33.06 .9518 .0223 9.16 32.92 .9506 .0228 12.79
Ours BC ViT VQ 174M 14 32.81 .9496 .0236 10.76 32.68 .9484 .0241 14.17
Ours BC ViT BSQ 174M 18 32.08 .9421 .0244 8.08 31.49 .9357 .0276 11.62
Ours BC ViT BSQ 174M 36 33.80 .9606 .0159 4.10 33.55 .9588 .0167 6.21

codebook size; For KL-regularized models (SD-VAE 2.x and XL), since the latent is continuous,
we count # bits as the latent dimension multiplied by the numeric precision (here we use 16 since
the checkpoint is stored in FP16). For our BSQ, # bits is L because each latent channel is binary.
We summarize the key observations as follows. (1) BSQ efficiently compresses image patches
into a small amount of bits. It reconstructs images better in all metrics using fewer bits per token
than the second-best method (SDXL-VAE). (2) BSQ is also computationally efficient. Although
the ViT-based backbone doubles the parameters, our method yields a 2.4× higher throughput than
SDXL-VAE. MaskGIT runs at a comparable speed but reconstructs significantly worse because of a
small codebook size (1024) and more spatial downsampling (16×). (3) BSQ is generalizable across
different domains of images. ImageNet is relatively object-centric while COCO is more scene-centric.
Though trained on ImageNet only, our method does well on the scene-centric COCO too. It even
works better than SD-VAE 1.x/2.x trained on the similarly scene-centric OpenImages dataset [73].
Video Reconstruction. We present the video reconstruction on both UCF-101 training and validation
subsets in Table 2. First, we use the image tokenizer to reconstruct the video frame by frame. BSQ
works slightly better than VQ but neither is comparable to the specialized video tokenizers fine-tuned
on video data shown in the lower half of Table 2. Second, we finetune the image tokenizer on videos
and see significant improvements. For example, our 18-bit BSQ with causal ViT reduces rFVD from
342 to 11.62 and improves PSNR from 25.83 to 31.49 dB. The compared prior methods include:
(1) MaskGIT [10] which is a fine-tuned 2D-CNN based tokenizer, (2) TATS [15] which uses a 3D
CNN with replicated padding, (3) MAGVIT [4] whose 3D CNN is initialized by zero-inflating a 2D
filter, and (4) MAGVIT-v2 [17] which makes 3D CNN causal. Since most methods do not release
checkpoints, we take their reported numbers directly. Our models with all configurations outperform
MAGVIT-v2 with a comparable number of parameters (174M vs. 158M) by a large margin. The
best-performing MAGVIT-v2 uses a larger backbone and achieves a rFVD of 8.62. Our causal
BSQ-ViT with L = 18 achieves an 8.08 rFVD and halves the LPIPS. For BSQ with L = 36, our
method further improves the reconstruction metrics.
We also show the effect of using block-wise causal masks. The non-causal variant (non-BC) works
slightly better on all metrics because now the model can look at all visual patches within the temporal
context window. This result resembles the observations in video compression that using bidirectional
predicted pictures (B-frames) benefits compression quality given the same group of pictures (GoP).
Image Generation. Our BSQ-ViT Table 3: Image generation results on ImageNet-1K (128 × 128).
tokenizer can be seamlessly inte- † The number is taken from the paper.
grated into existing generative mod-
els for visual generation. We fol- Category Method # steps FID↓ IS↑ Prec↑ Rec↑
low MaskGIT [10], a masked lan-
guage modeling approach. At train- GAN BigGAN [18] 1 6.02 145.8 0.86 0.35
ing time, the underlying masked lan- Diffusion ADM [19] 1,000 5.91 93.3 0.70 0.65
guage model (masked LM) learns to
VQ 12 † 9.4 - - -
predict the masked tokens given a ran-
Masked LM FSQ [22] 12 † 8.5 - - -
dom proportion of unmasked tokens
BSQ (Ours) 32 5.44 139.6 0.80 0.50
like BERT [74]. At inference time,
the model repeats decoding in a non-

8
39 1 42 0.99

35 0.96 40 0.98

PSNR (dB)

MS-SSIM

MS-SSIM
PSNR
31 0.92 38 0.97
H.264 (medium) H.264 (medium)
HEVC (medium) HEVC (medium) VCT [49] VCT [49]
Ours (w/o. AC) Ours (w/o. AC) H.264 (medium) H.264 (medium)
27 0.88 36 0.96
Ours Ours HEVC (medium) HEVC (medium)
MAGVIT [16] MAGVIT [16] Ours (w/o. AC) Ours (w/o. AC)
MAGVIT-v2 [17] MAGVIT-v2 [17] Ours Ours
23 0.84 34 0.95
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
bpp bpp bpp bpp

(a) PSNR on MCL-JCV. (b) MS-SSIM on MCL-JCV. (c) PSNR on UVG. (d) MS-SSIM on UVG.
Figure 4: Video compression results on MCL-JCV 640×360 and UVG 1920×1080.

Table 5: Abalation studies on ImageNet-1k val 128×128.

Method ℓ2 -norm # bits (K × d or L) PSNR↑ SSIM↑ LPIPS↓ rFID↓ Code usage
✓ 10 (1024×32) 23.61±3.21 .6873±.1211 .1214±.0434 7.05 57.5%
VQ ✓ 14 (16384×8) 25.76±3.46 .7834±.0988 .0669±.0282 4.27 100.0%
✓ 16 (65536×8) 25.67±3.36 .7851±.0962 .0706±.0283 6.61 100.0%
✓ 10 24.11±3.25 .7250±.1121 .0919±.0338 4.51 100.0%
BSQ ✓ 14 25.26±3.31 .7710±.0992 .0784±.0293 4.60 99.8%
✓ 18 25.97±3.37 .7990±.0906 .0629±.0261 2.66 93.8%
LFQ ✗ 18 18.58±2.10 .4828±.1340 .2951±.0806 30.7 0.6%

autoregressive way [75] for several steps and progressively decodes from an all-masked canvas to
visually plausible contents following a pre-defined unmasking schedule. Unlike MaskGIT with a
VQ-VAE with K = 1024, BSQ-ViT has an effective vocabulary size of 2L and L = 18, resulting in
a slow embedding lookup. We fix it by dividing each token into groups and treating sub-tokens inde-
pendently with a similar rationale in Sec. 4.1. We increase the number of decoding steps accordingly.
Table 3 shows that the masked LM with BSQ outperforms those with VQ and FSQ reported in [22].
Our method achieves comparable results with other generation paradigms such as GAN-based [18]
and diffusion-based [19] approaches. We leave qualitative results in Sec. F.
Video Compression. We compare Table 4: Comparisons of encoding/decoding speed. † The num-
the compression result on MCL-JCV ber did not include the image encoder according to [49].
and UVG in Figure 4. Simply flat-
tening the video token sequence to Method Resolution Encode EC Decode FPS
a bitstream achieves an MS-SSIM of †
0.9818 at 0.2333 bpp. Although this VCT [49] 1920×1080 494 ms 30.5 ms 168 ms 1.4
is not great, we use an auto-regressive H.264 1920×1080 - - - 2.6
model to predict the conditional prob- Ours 1920×1080 55.8 ms 42.2 ms 64.8 ms 6.1
ability such that the bpp is reduced by VCT [49] 640×360 †
22.2 ms 4.24 ms 10.1 ms 27.3
41%. This leads to a better tradeoff H.264 640×360 - - - 22.4
than standard video codecs including Ours 640×360 6.2 ms 4.69 ms 7.2 ms 55.2
both H.264 and HEVC.
On UVG 1080P, our model is comparable to H.264 while being worse than HEVC and VCT [49].
Note that our model trains on UCF-101 which only has 9K 320×240 video clips encoded in MPEG-4
while VCT has been trained on a million high-resolution Internet video clips. We hypothesize that
the gap will be mitigated by adding more diverse videos and removing compression artifacts from
the training videos. Nevertheless, we show the potential advantage of our method in encoding and
decoding speed in Table 4. Due to the simplicity of the Transformer-based encoder and decoder, our
method runs faster than VCT.

5.2 Ablation Studies

For ablation studies, we train an ImageNet image tokenizer with resolution 128×128 with p = 8,
although our conclusions generally hold for higher resolution, e.g. 256×256 in Sec. 5.1.
BSQ vs VQ. Table 5 shows that BSQ and VQ follow a similar trend: better reconstruction for
increased L. Since K = 218 results in an out-of-memory issue, we try a smaller K = 216 = 65536
for VQ. The gain for VQ already diminishes even though the small bottleneck dimension of 8 still
guarantees full code usage. In contrast, BSQ consistently works better on all metrics when L = 18.

9
Table 6: Ablation studies of the loss design.
(a) Leave-one-out ablations for training losses. (b) Group size. (L = 18)

Lcommit Lentropy LLPIPS rFID Code group rFID Code Speed

H(p(c|u)) −H(E[p(c|u)]) usage size ↓ usage (ms)
✓ ✓ ✓ ✓ 2.95 45.6% g = 18 (OOM) 70.0
✗ ✓ ✓ ✓ 2.83 93.8% g=9 2.83 93.8% 0.335
✓ ✗ ✓ ✓ 2.44 78.3% g=6 2.76 95.2% 0.232
✓ ✓ ✗ ✓ 13.8 13.3% g=3 3.32 96.0% 0.233
✓ ✓ ✓ ✗ 19.2 6.9% Ours 2.86 95.1% 0.212

Importance of ℓ2 normalization in BSQ. We remove the ℓ2 normalization in BSQ, which is

equivalent to LFQ, and show results in the last rows of Table 5. We see much lower code usage and
worse rFID, indicating that LFQ does not work well with a ViT-based tokenization encoder.
Contribution of losses. We study the effect of each loss in Table 6a. Although it is computationally
prohibitive to enumerate all combinations of loss terms and their associative weights, we conduct
a simple “leave-one-out” setting where one of the losses is removed at a time. BSQ works slightly
better after removing Lcommit and H(p(c|u)). However, the code usage varies greatly. The best
configuration is to keep the minimal entropy term while dropping the commitment loss. The
commit loss may be unnecessary because the quantization error in BSQ is already strictly bounded.
On the contrary, the dataset entropy maximization term and perceptual term do matter. Without
−H(Eu [p(c|u)]), rFID increases to 13.8 while the code usage in the validation set significantly drops
to 13.3%. We also observe that the perceptual loss is important for low FID and high code usage.
However, a deeper look into its role is beyond the scope of this paper.
Approximating the dataset entropy term. We now show the efficacy of approximating the dataset
entropy term using Eq (9). We compare with the approximation method in [17] that computes entropy
in sub-groups of dimensions with varying group size g ∈ {9, 6, 3}. Our approximation method can
also be interpreted as a group size of g = 1. From Table 6b, we conclude that our approximation
achieves a similar level of rFID and code usage compared to other setups while running the fastest.

6 Conclusions
We present a new transformer-based image and video tokenizer with Binary Spherical Quantization
(BSQ). The transformer-based architecture effortlessly integrates image and video tokenization over
an arbitrary time horizon. The Binary Spherical Quantization allows for efficient and effective training
of the quantized bottleneck. Our results indicate that the proposed tokenizer runs at a faster speed,
reconstructs with higher fidelity, and in combination with a sequence model offers a strong baseline
for lossy video compression and image synthesis.

References
[1] Thomas J Daede, Nathan E Egge, Jean-Marc Valin, Guillaume Martres, and Timothy B Terriberry. Daala:
A perceptually-driven next generation video codec. arXiv preprint arXiv:1603.03129, 2016.
[2] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini,
and Luc V Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations.
NeurIPS, 2017.
[3] Alaaeldin El-Nouby, Matthew J Muckley, Karen Ullrich, Ivan Laptev, Jakob Verbeek, and Hervé Jégou.
Image compression with product quantized masked image modeling. TMLR, 2023.
[4] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu,
Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. In ICLR,
2022.
[5] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. In
ICLR, 2022.
[6] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT: Image
BERT pre-training with online tokenizer. In ICLR, 2022.

10
[7] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang,
Luowei Zhou, and Lu Yuan. BEVT: BERT pretraining of video transformers. In CVPR, 2022.
[8] Aaron Van Den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In
NeurIPS, 2017.
[9] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis.
In CVPR, 2021.
[10] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative
image transformer. In CVPR, 2022.
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
NeurIPS, 2020.
[12] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv
preprint arXiv:2303.08774, 2023.
[13] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[14] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna,
and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv
preprint arXiv:2307.01952, 2023.
[15] Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi
Parikh. Long video generation with time-agnostic VQGAN and time-sensitive transformer. In ECCV,
2022.
[16] Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann,
Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. MAGVIT: Masked generative video transformer. In CVPR,
2023.
[17] Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng,
Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key
to visual generation. In ICLR, 2024.
[18] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural
image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[19] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. NeurIPS,
2021.
[20] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and
Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021.
[21] Chuanxia Zheng and Andrea Vedaldi. Online clustered codebook. In ICCV, 2023.
[22] Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization:
VQ-VAE made simple. In ICLR, 2024.
[23] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham-
mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video
generation from open domain textual descriptions. In ICLR, 2022.
[24] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz,
Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video
diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
[25] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video generation using
vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
[26] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. ViViT:
A video vision transformer. In ICCV, 2021.
[27] Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal,
27(3):379–423, 1948.
[28] David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE,
40(9):1098–1101, 1952.
[29] Richard Clark Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford University
CA, 1976.
[30] Jorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development,
23(2):149–162, 1979.

11
[31] Jarek Duda. Asymmetric numeral systems. arXiv preprint arXiv:0902.0271, 2009.
[32] Tomáš Mikolov. Statistical language models based on neural networks. PhD thesis, Brno University of
Technology, 2012.
[33] Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. Deepzip: Lossless data compression
using recurrent neural networks. In DCC, 2019.
[34] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional
image generation with pixelcnn decoders. In NeurIPS, 2016.
[35] James Townsend, Tom Bird, and David Barber. Practical lossless compression with latent variables using
bits back coding. In ICLR, 2019.
[36] James Townsend, Thomas Bird, Julius Kunze, and David Barber. Hilloc: Lossless image compression with
hierarchical latent variable models. In ICLR, 2020.
[37] Fabrice Bellard. Lossless data compression with neural networks. URL: https://bellard. org/nncp/nncp.
pdf, 2019.
[38] Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher
Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language
modeling is compression. In ICLR, 2024.
[39] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Practical full
resolution learned lossless image compression. In CVPR, 2019.
[40] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of
compute-optimal large language model training. In NeurIPS, 2022.
[41] Vivek K Goyal. Theoretical foundations of transform coding. IEEE Signal Processing Magazine, 18(5):9–
21, 2001.
[42] Johannes Ballé, Philip A Chou, David Minnen, Saurabh Singh, Nick Johnston, Eirikur Agustsson, Sung Jin
Hwang, and George Toderici. Nonlinear transform coding. IEEE Journal of Selected Topics in Signal
Processing, 15(2):339–353, 2020.
[43] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the h. 264/avc video
coding standard. TCSVT, 2003.
[44] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency
video coding (hevc) standard. TCSVT, 2012.
[45] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. Dvc: An end-to-end
deep video compression framework. In CVPR, 2019.
[46] Oren Rippel, Sanjay Nair, Carissa Lew, Steve Branson, Alexander G Anderson, and Lubomir Bourdev.
Learned video compression. In ICCV, 2019.
[47] Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George Toderici.
Scale-space flow for end-to-end optimized video compression. In CVPR, 2020.
[48] Jiahao Li, Bin Li, and Yan Lu. Deep contextual video compression. In NeurIPS, 2021.
[49] Fabian Mentzer, George Toderici, David Minnen, Sung-Jin Hwang, Sergi Caelles, Mario Lucic, and Eirikur
Agustsson. VCT: A video compression transformer. In NeurIPS, 2022.
[50] Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei Qin, and Yan Wang. ELIC: Efficient learned
image compression with unevenly grouped space-channel contextual adaptive coding. In CVPR, 2022.
[51] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through
stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[52] Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka,
Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, and Yuki Mitsufuji. SQ-VAE: Variational bayes
on discrete representation with self-annealed stochastic quantization. In ICML, 2022.
[53] Aren Jansen, Daniel PW Ellis, Shawn Hershey, R Channing Moore, Manoj Plakal, Ashok C Popat, and
Rif A Saurous. Coincidence, categorization, and consolidation: Learning to recognize sounds with minimal
supervision. In ICASSP, 2020.
[54] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data compression. Communica-
tions of the ACM, 30(6):520–540, 1987.
[55] Jon Hamkins and Kenneth Zeger. Gaussian source coding with spherical codes. IEEE Transactions on
Information Theory, 48(11):2980–2989, 2002.
[56] Thomas Fischer. A pyramid vector quantizer. IEEE Transactions on Information Theory, 32(4):568–583,
1986.

12
[57] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
[59] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
effectiveness of deep features as a perceptual metric. In CVPR, 2018.
[60] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
[61] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial
networks. In CVPR, 2019.
[62] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional
adversarial networks. In CVPR, 2017.
[63] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.
IJCV, 115:211–252, 2015.
[64] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[65] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions
classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[66] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error
visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[67] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs
trained by a two time-scale update rule converge to a local Nash equilibrium. NeurIPS, 2017.
[68] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and
Sylvain Gelly. Fvd: A new metric for video generation. In ICLR Workshop, 2019.
[69] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training GANs. NeurIPS, 2016.
[70] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision
and recall metric for assessing generative models. NeurIPS, 2019.
[71] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality
assessment. In ACSSC, 2003.
[72] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models. In CVPR, 2022.
[73] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab
Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified
image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
[74] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In NAACL, 2019.
[75] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural
machine translation. In ICLR, 2018.
[76] Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang,
Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. MCL-JCV: a JND-based H. 264/AVC video quality
assessment dataset. In ICIP, 2016.
[77] Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video
codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, pages
297–302, 2020.
[78] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
[79] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
2022.
[80] Jean Bégaint, Fabien Racapé, Simon Feltman, and Akshay Pushparaja. CompressAI: a pytorch library and
evaluation platform for end-to-end compression research. arXiv preprint arXiv:2011.03029, 2020.

13
Table 7: Comparing BSQ and LFQ.
LFQ [17] BSQ (Ours)
Quantized output v̂ = sign(v) û = √1 sign(u) = √1 sign( v )
L L |v|
∂ v̂i ∂ ûi
STE gradient ∂vi
=1 ∂vi
= √1L (1 − vi2 /|v|2 )
q √
Ev [d(v, v̂)] = ∞ Eu [d(u, û)] < 2 − √2L < 2
Quantization Error
Unbounded Upper-bounded (See Sec. C.4)
LMSE , Lcommit , LLPIPS , LGAN , LMSE , LLPIPS , LGAN ,
Training objective
Lentropy = H[p(c|v)] − H[Eu [p(c|v)]] Lentropy = H[p(c|u)] − Ĥ[Eu [p(c|u)]]

A Arithmatic Coding Details

Starting from the initial interval I0 = [0, 1), the AC encoder recursively partitions the interval into a
series of sub-interval In = [ln , un ) such that In ⊂ In−1 ⊂ · · · ⊂ I0 , and In is determined by In−1
and ρ(y|x<n ).

n −1
" xX xn
!
X
In (y) = ln−1 + (un−1 − ln−1 ) ρ(y|x<n ), ln−1 + (un−1 − ln−1 ) ρ(y|x<n ) . (14)
y=1 y=1

Any number in the final interval IN can sufficiently represent the encoded sequence. To obtain
PC
the final bit stream, we take a binary fraction λ = i=1 bi × 2−i , bi ∈ {0, 1} in IN such that
lN ≤ λ < uN . The bit stream {b0 , . . . , bC } is the encoding result with a length of C bits.
The AC decoder takes in λ, starts with I0 , and performs a similar interval partitioning process. At
the n-th step, the decoder queries the model ρn (y|x<n ), calculate the sub-intervals for all possible
values of y using Eq. (14), and decodes output xn that leads to λ ∈ In (xn ). The decoder can recover
the encoded token sequence by continuing with In+1 based on the decoded xn and repeating for step
n + 1 for N steps.
In practice, the encoder and the decoder can be implemented efficiently with fixed-length integer
numbers and operate incrementally for arbitrarily long input sequences.

B Comparison between BSQ and LFQ

In Sec. 4.1, we have introduced the mechanism of BSQ and briefly discussed the connections and
differences with LFQ. We summarize them in Table 7. Note that STE gradient in BSQ is anisotropic
and is more likely to be a good estimation because of an upper-bounded quantization error regardless
of L. This property explains why a commitment loss like Lcommit (û, u) is not needed in BSQ but
useful for LFQ.

C Proofs

C.1 Proof of Eq (7)

Before proving Eq (7), we will first prove the following identity:

L
Let u ∈ RL , C = ΩL ∈ RL×2 for Ω = {− √1L , √1L },

L L X
X ⊤ XY Y
eτ u c
= eτ ud cd = eτ ud cd . (15)
c∈C c∈C d=1 d=1 cd ∈Ω

14
Proof. With τ dropped for simplicity of notation.

L
X ⊤ XY
eu c
= euk ck
c∈C c∈C k=1

X X L
X Y
= ... eud cd
c1 ∈Ω c2 ∈Ω cL ∈Ω d=1

X X X L−1
Y
uL cL
= ... e eud cd
c1 ∈Ω c2 ∈Ω cL ∈Ω d=1
L−1
! !
X X X Y X
= ... eud cd euL cL
c1 ∈Ω c2 ∈Ω cL−1 ∈Ω d=1 cL ∈Ω

= ...
! ! ! L X
X X X Y
uL cL u2 c2 uL cL
= e e ... e = eud cd .□
c1 ∈Ω c2 ∈Ω cL ∈Ω d=1 cd ∈Ω

Therefore, the probability of u being assigned to ci can be written as:

⊤ QL
eτ u ĉ
d=1 eτ ud ĉd
q̂(ĉ|u) = P = QL (Using Eq (15))
eτ u⊤ c
P
c∈C d=1 cd ∈{− √1 , √1 } eτ ud cd
L L

L τ ud ĉd
Y e 1
= (since cd = ± √ = ±ĉd )
eτ ud ĉd + eτ ud ĉd L
d=1
L
Y
= σ(2τ ud ĉd ).
d=1

C.2 Proof of Eq (8)

QL
Since q̂(ĉ|u) = d=1 σ(2τ ud ĉd ) each variable cd is independent of each other. Thus by definition

L
X
H[q̂(c|u)] = H(σ(2τ ud cd )). □
d=1

C.3 Proof of Eq (9)

Now we look at H[Eu [q̂(c|u)]]. We first compute Q(c) = Eu [q̂(c|u)].

L
1 X 1 XY
Q(c) = Eu [q̂(c|u)] = q̂(c|u) = σ(2uk ck ).
N u N u
k

Unlike c, u does not factorize like Eq (15). This would require us to compute Q(c) as a full
distribution over 2L states, which is slow (O(L × 2L )) and easily overfits. Instead, we approximate
QL
Q(c) by a factorized distribution q̃(c) = d=1 q̃d (cd ), where cd ∈ Ω for Ω = {− √1L , √1L }, using

15
an M-projection. We again omit τ for notational brevity.

D(Q∥q̃) = H(Q, q̃) − H(Q)

L
2
X
=− Q(ci ) log q̃(ci ) − H(Q)
i=1
L
2
X X
=− Q(ci ) log q̃d (cd ) − H(Q)
i=1 d
L
2
XX
=− Q(ci ) log q̃d (cd ) −H(Q)
d i=1
| {z }
 
X X X
=− log q̃d (cd = 1) p(ci ) + log q̃d (cd = −1) Q(ci ) − H(Q)
d c−d c−d
X X X
=− log q̃d (cd ) Q(c) − H(Q), where c−d sums over all dimensions except d.
d cd ∈{−1,1} c−d

∂
The minimizer of the above projection ∂ q̃d D(Q∥q̃) =0
 
X X
q̃d (cd )∗ = Q(c) = Eu  p(c|u)
c−d c−d
   
XY X Y
= Eu  σ(2uk ck ) = Eu  σ(2ud cd ) σ(2uk ck )
c−d k c−d k̸=d
 
 
XY  YX 
 
= Eu σ(2ud cd ) σ(2uk ck ) = Eu σ(2ud cd )

 = Eu [σ(2ud cd )]
σ(2uk ck )
c−d k̸=d  k̸=d c−d 
| {z }
=1

Therefore, the entropy term is simplified to:

X X
H(q̃) = H(q̃d (cd )) = H (Eu [σ(2ud cd )]) .
d d

By the nature of the above derivation the cross entropy H(Q, q̃) = H(q̃) equals the entropy of the
approximation. This means that D(Q∥q̃) = H(q̃) − H(Q) ≥ 0, and the entropy of the approximation
is an upper bound H(q̃) ≥ H(Q) to the true entropy.
In practice, this bound is relatively tight. The most adversarial distribution P (u) is P ( √1L ⃗1) = 12
and P (− √1 ⃗1) = 1 , where all inputs are maximally correlated, but the factorized distribution is not.
L 2
Figure 5 shows an empirical estimate of this approximation error for various values of τ . In practice,
1
we use τ = 100 , which has little to no approximation error.

C.4 Proof of the Quantization Error Bound of BSQ (Eq. (10))

We consider ℓ2 -distance d(u, û) = ∥u − û∥. A simple (but loose) bound is:
s
2 √
Eu [d(u, û)] = Eu [dmax (u, û)] < 2 − √ < 2, (16)
L

16
Upper bound in KL divergence between q and q
1.0 L=3
L=5
L=7
0.8 L=10
L=15
Upper bound

Approximation error
0.6

0.4

0.2

0.0
0 2 4 6 8 10

Figure 5: Empirical estimation of the approximation error with respect to τ at different bottleneck
dimensions L.

where dmax is attained if u is at any axis, u = [0, · · · , 0, 1, 0, · · · , 0]. To achieve a tighter bound, we
| {z } | {z }
n L−1−n
first expand the definition,
Z Z
· · · dS L−1 V d(u, û)
| {z }
L−1
Eu [d(u, û)] = S Z Z , (17)
· · · dS L−1 V
| {z }
S L−1
L−1 L
where S = {x ∈ R : ∥x∥ = 1} denotes the unit L-sphere of radius 1 and dS L−1 V denotes its
surface area element. We further define a hyperspherical coordinate system that is analogous to the
spherical coordinate system for 3D Euclidean space or the polar coordinate system for 2D space to
represent the surface area element.
u1 = cos(φ1 ),
u2 = sin(φ1 ) cos(φ2 ),
u3 = sin(φ1 ) sin(φ2 ) cos(φ3 ),
···
uL−1 = sin(φ1 ) sin φ2 · · · sin(φL−2 ) cos(φL−1 ),
uL = sin(φ1 ) sin φ2 · · · sin(φL−2 ) sin(φL−1 ),
(surface area element) dS L−1 V = sinL−2 (φ1 ) sinL−3 (φ2 ) · · · sin(φL−2 )dφ1 · · · dφL−1 ,
2π L/2
Z Z
(surface area) SL−1 = · · · dS n−1 V = .
Γ( L2 )
| {z }
S L−1

Due to symmetry, we assume the subarea AL−1 where ∀i ∈ {1, · · · , L}, ui > 0, and it will be
→
−
quantized to c1 = û1 = √1L 1 . The unit hypersphere S L−1 has 2L of such subareas interchangeably.
Computing Eq (17) is equivalent to
Z Z
· · · dS L−1 V d(u, û)
| {z }
L−1
Eu [d(u, û)] = A Z Z . (18)
· · · dS L−1 V
| {z }
AL−1

17
We expand the the numerator in Eq (18) as follows:
π π
1 1
Z 2
Z 2
= ··· dS L−1 V {[cos(φ1 ) − √ ]2 + [sin(φ1 ) cos(φ2 ) − √ ]2 + · · ·
0 0 L L
1 2
+ [sin(φ1 ) sin(φ2 ) · · · sin(φL−2 ) cos(φL−1 ) − √ ] (19)
L
1 2 1
+ [sin(φ1 ) sin(φ2 ) · · · sin(φL−2 ) sin(φL−1 ) − √ ] } 2
L

It is composed of L square terms. It is easy to see that the sum of constant terms leads to 1. Next,
let’s sum over all quadratic terms and keep on using sin2 (θ) + cos2 (θ) = 1:
L−2
Y L−2
Y
cos2 (φ1 ) + sin2 (φ1 ) cos2 (φ2 ) + · · · + sin2 (φj ) sin2 (φL−1 ) + sin2 (φj ) cos2 (φL−1 ) = 1
j=1 j=1

So the distance function to be integrated simplifies to

  12
L−2 L−2
2 2
2 − √ cos(φ1 ) − √ sin(φ1 ) cos(φ2 ) − · · · − √2 Y 2 Y
sin(φj ) cos(φL−1 ) − √ sin(φj ) sin(φL−1 )
L L L j=1 L j=1
12
2
< 2 − √ cos(φ1 ) .
L
Plug into the numerator in Eq (18) and continue simplifying:
21
2
Z Z
· · · dS L−1 V 2 − √ cos(φ1 ) (20)
| {z } L
AL−1
π 12
2
Z Z Z 2
= · · · dS L−2 V 2 − √ cos(φ1 ) sinL−2 (φ1 )dφ1 . (21)
| {z } 0 L
AL−1
| {z }
SL−2
2L−1

Therefore, we have
π 12
2Γ( L2 )

2
Z 2
Eu [d(u, û)] < √ L−1 2 − √ cos(φ1 ) sinL−2 (φ1 )dφ1 , (22)
πΓ 2 0 L

where RHS can be numerically computed and plotted in Figure 6.

D Dataset Overview

ImageNet-1k has 1.28M training images and 50,000 validation images; COCO 2017val has 5,000
images.
UCF101 has 13,320 video clips and three train-val splits. Following prior works [16], we consider
split-1 which has 9,537 clips for training and 3,783 for validation.
The MCL-JCV dataset [76] consists of thirty 1080P (1,920×1,080) video sequences with 24∼30
FPS. The Open Ultra Video Group (UVG) dataset [77] consists of sixteen 4K (3,840×2,160) test
video sequences captured at 50/120 FPS. Following prior works [47], we report the performance on a
subset of seven videos in YUV 8bit format at 120 FPS under the resolution of 1,920×1,080.

18
1.42
1.40
1.38
1.36

Quantization Error
1.34
1.32
1.30
1.28
1.26

5 10 15 20 25 30 35
L

Figure 6: Quantization error with vocabulary size L.

E Implementation Details

Training Image Tokenizers. We train the image tokenizer with a batch size of 32 per GPU. We
use AdamW optimizer [78] with (β1 , β2 ) = (0.9, 0.99) with 1 × 10−4 weight decay. The base
learning rate is 4 × 10−7 (or a total learning rate of 1 × 10−4 ) and follows a half-period cosine
annealing schedule. The model is trained for 1M steps which amounts to 200 epochs over the entire
ImageNet-1k training set. We did not heavily study the effect of loss weights. Instead, we keep γ = 1
in the entropy terms. We use a perceptual loss weight of 0.1 and an adversarial loss weight of 0.1
throughout the experiments.
Training Video Tokenizers. We finetune the video tokenizer with a batch size of 32 per GPU. The
optimization schedule follows the image-based one but trains for fewer iterations. The network
is initialized from the ImageNet-pretraining checkpoint and undergoes another 500K steps which
amounts to 1600 epochs over UCF-101 split-1 train.
Training a Masked Language Model for Generation. The masked LM is a standard post-LN
Transformer with 24 layers and a hidden dimension of 768 following MaskGIT [10]. We train the
masked LM on 2 nodes of 8× GPUs (16 in total) with a total batch size of 1024 for 1M steps. We
use AdamW optimizer with (β1 , β2 ) = (0.9, 0.96) with 0.045 weight decay. At inference time, we
use a cosine unmasking schedule in MaskGIT [10] and set the sampling temperature to 15. We
use classifier-free guidance [79]: At training, we replace 20% of the class condition labels with
the mask token so that the model learns an unconditional distribution simultaneously. Let ℓc be
class-conditioned logits and ℓ∅ be unconditional logits. During inference, we interpolate logits using
ℓ′ = ℓc + α(ℓc − ℓ∅ ), where α = 0.5.
Training an Auto-Regressive Model for Arithmetic Coding. The auto-regressive model is a
Transformer with 24 layers and a hidden dimension of 768. We train this model on 8× GPUs with a
total batch size being 64. We use AdamW optimizer with (β1 , β2 ) = (0.9, 0.96) with 0.045 weight
decay.
Hardware. The hardware for training is 8×GPU-servers with NVIDIA A5000 (24GB). Pre-training
an image tokenizer and fine-tuning a video tokenizer in the full schedule is done across two servers
with distributed training and takes around 5 days. Training the AR model for AC is done on an
8×GPU server and takes around 1 week. When measuring the tokenizer’s throughput and the
compression runtime, we use a server with 4× A5000 GPU and 1× AMD Ryzen Threadripper PRO
5975WX 32-Core CPU (64 threads).

F Qualitative Results

In Figure 7, we show reconstructed images produced by the proposed BSQ-ViT in comparison to the
best prior work, SDXL-VAE [14]. We can see that our method is able to preserve more details about
high-frequency texture and fine-grained shape/geometry. BSQ-ViT often shows better reconstruction
results for characters.

19
Table 8: Image reconstruction results on COCO2017 and ImageNet-1K (256 × 256). The settings strictly
follow Table 1 except that all images are resized with bilinear interpolation.
COCO2017 val ImageNet-1k val
Method Data Arch. Quant. Param. # bits TP↑ PSNR↑ SSIM↑ LPIPS↓ rFID↓ PSNR↑ SSIM↑ LPIPS↓ rFID↓
DALL-E dVAE [20] CC+YF C VQ 98M 13 34.0 26.97 .0837 .2544 48.60 27.31 .7943 .2544 32.63
±3.41 ±.0922 ±.1057 ±3.81 ±.1114 ±.1057
MaskGIT [10] IN-1k C VQ 54M 10 37.5 18.21 .4596 .1930 8.47 18.63 .4619 .1884 1.98
±2.74 ±0.1606 ±.0444 ±2.90 ±.1812 ±.0497
† †
ViT-VQGAN [4] IN-1k T-B VQ 182M 13 7.5 - - - - - - - 1.55
SD-VAE 1.x [72] OImg C VQ 68M 10 22.4 23.29 .6705 .0949 6.49 23.65 .6615 .0940 1.40
±3.34 ±.1316 ±.0313 ±3.69 ±.1540 ±.0367
SD-VAE 1.x [72] OImg C VQ 68M 14 22.4 24.17 .7042 .0814 5.75 24.48 .6931 .0814 1.13
±3.50 ±.1276 ±.0289 ±3.98 ±.1502 ±.0289
SD-VAE 1.x [72] OImg C KL 68M 64 22.4 23.21 .6930 .0908 5.94 23.54 .6835 .0899 1.22
±3.24 ±.1249 ±.04282 ±3.62 ±.1465 ±.0337
SD-VAE 2.x [14]OImg+ C KL 84M 64 18.9 26.62 .7722 .0584 4.26 26.90 .7592 .0609 0.70
LAION ±3.64 ±.1086 ±.0273 ±4.09 ±.1300 ±.0349
SDXL-VAE [14] OImg+ C KL 84M 64 18.9 27.08 .7953 .0541 3.93 27.37 .7814 .0574 0.67
LAION+? ±3.88 ±.1066 ±.0250 ±4.36 ±.1282 ±.0320
Ours IN-1k T-B BSQ 174M 18 45.1 26.89 .8133 .0652 5.41 27.78 .8171 .0633 0.99
±3.47 ±.0851 ±.0255 ±3.99 ±.0987 ±.0307
Ours IN-1k T-B BSQ 174M 36 45.1 29.85 .8862 .0341 3.07 30.12 .8803 .0355 0.36
±3.65 ±.0570 ±.0163 ±4.13 ±.0670 ±.0207
Ours (w/. EMA) IN-1k T-B BSQ 174M 36 45.1 30.19 .8904 .0314 3.07 30.45 .8843 .0329 0.42
±3.69 ±.0561 ±.0153 ±4.19 ±.0661 ±.0194

In Figure 8, we show sampled results produced by a Masked LM with the proposed BSQ-ViT in
comparison to existing methods, BigGAN [18] and ADM [19]. We also plot the samples from the
ground-truth ILSVRC2012 validation set for reference. Our method produces competitive results
with state-of-the-art methods.

G Baselines for Video Compression

Following SSF [47], we used FFmpeg3 to produce the evaluation metrics for H.264 and HEVC. We
use the commands provided in CompressAI [80].

ffmpeg -y -s:v $RESOLUTION -i $FILE.yuv -c:v h264 -crf $CRF -preset medium \
-bf 0 -pix_fmt yuv420p -threads 4 $FILE.mp4

where $Resolution ∈ {1920x1080, 640x360}, and $CRF ∈ {17, 20, 22, 27, 32, 37, 42, 47}.

H Limitations

The proposed tokenizer has been tested on images with a 256×256 or 128×128 resolution and videos
with a 128×128 resolution. Training a visual tokenizer on higher-resolution inputs and variable
aspect ratio remains unexplored. Also, the training dataset is limited to ImageNet-1k and UCF-101.
Scaling the proposed model to larger-scale visual contents remains an interesting problem to study.

I Broader Impacts

The video compression application illustrated in the paper may be useful to reduce the storage and
transmission cost of video data. Also, the proposed visual tokenization model runs more efficiently
than prior ones, resulting in potential energy savings. Both of them will ultimately benefit society.

3
https://ffmpeg.org/

20
Figure 7: Reconstruction results of BSQ-ViT (right) compared to the original image (left) and SDXL-VAE [14]
(middle). The three images are taken from COCO 2017val which are more scene-centric compared to ImageNet
data that our model is trained on.

21
BigGAN ADM BSQ-ViT+Masked-LM (Ours) Groundtruth

Figure 8: Sampled generation results of BSQ-ViT + Masked-LM (second column from left) compared to
BigGAN [18] (right), ADM [19] (second column from right) and the original images (left). Classes are 1:
goldfish, 279: arctic fox, 323: monarch butterfly, 417: balloon.

J Licenses for Existing Assets

J.1 Datasets
ImageNet The terms of service are available on https://image-net.org/about.
COCO The terms of service are available on https://cocodataset.org/#termsofuse.
MCL-JCV Copyright is available on the website https://mcl.usc.edu/mcl-jcv-dataset/.
UVG4 The dataset is licensed under a CC BY-NC 3.0 Deed license.

J.2 Evaluation Metrics

FID score is based on the PyTorch re-implementation5 . The original implementation6 is based on
TensorFlow. Both are licensed under an Apache-2.0 License.
4
https://ultravideo.fi/dataset.html
5
https://github.com/mseitzer/pytorch-fid
6
https://github.com/bioinf-jku/TTUR

22
LPIPS is based on the implementation7 licensed under a BSD-2-Clause license.
SSIM and MS-SSIM are based on the PyTorch implementation8 licensed under an MIT license.
Generation Metrics (FID, Inception Score, Precision, and Recall) are reported using a TensorFlow
implementation9 licensed under an MIT license.

J.3 Baseline Methods

DALL-E dVAE10 is licensed under a Modified MIT License.

SD-VAE 1.x11 is licensed under an MIT License.
SD-VAE 2.x12 is licensed under an MIT License.
SDXL-VAE13 is licensed under an MIT License.
ADM14 is licensed under an MIT License.
MaskGIT15 is licensed under an Apache-2.0 License.
CompressAI16 is licensed under a BSD-3-Clause-Clear license.
FFmpeg is licensed under the GNU LGPL version 2.1 or later. For more detail, please refer to
https://ffmpeg.org/legal.html.

7
https://github.com/richzhang/PerceptualSimilarity
8
https://github.com/VainF/pytorch-msssim
9
https://github.com/openai/guided-diffusion/tree/main/evaluations
10
https://github.com/openai/DALL-E
11
https://github.com/CompVis/latent-diffusion
12
https://huggingface.co/stabilityai/sd-vae-ft-mse
13
https://huggingface.co/stabilityai/sdxl-vae
14
https://github.com/openai/guided-diffusion
15
https://github.com/google-research/maskgit/tree/main
16
https://github.com/InterDigitalInc/CompressAI

Magnetos Maintenance and Overhaul PDF
100% (1)
Magnetos Maintenance and Overhaul PDF
64 pages
Broadcast and Live Events Field Guide
100% (2)
Broadcast and Live Events Field Guide
43 pages
UNIT 1 Database Management System DBMS 2
No ratings yet
UNIT 1 Database Management System DBMS 2
20 pages
BBC First Click Beginners Guide
No ratings yet
BBC First Click Beginners Guide
60 pages
Gartner Global Supply Chain Top 25 For 2024
No ratings yet
Gartner Global Supply Chain Top 25 For 2024
13 pages
Kolors Paper
100% (1)
Kolors Paper
15 pages
FinalPaperDesign and Simulation of PID Controller For Power Electronics Converter Circuits170541
No ratings yet
FinalPaperDesign and Simulation of PID Controller For Power Electronics Converter Circuits170541
6 pages
Web Browsing Intro It
No ratings yet
Web Browsing Intro It
29 pages
RF Radio Frequency Signal Generator
No ratings yet
RF Radio Frequency Signal Generator
5 pages
Sony Philips Super Audio CD (SACD) White Paper
No ratings yet
Sony Philips Super Audio CD (SACD) White Paper
12 pages
08 Sensor Guide
100% (1)
08 Sensor Guide
2 pages
And The Bit Goes Down
No ratings yet
And The Bit Goes Down
11 pages
GIVT: Generative Infinite-Vocabulary Transformers: Abstract
No ratings yet
GIVT: Generative Infinite-Vocabulary Transformers: Abstract
32 pages
Subversion User Manual
No ratings yet
Subversion User Manual
33 pages
2024 Transformer-VQ Lingle ArXiv
No ratings yet
2024 Transformer-VQ Lingle ArXiv
30 pages
Deep-Learning Based Lossless Image Coding
No ratings yet
Deep-Learning Based Lossless Image Coding
14 pages
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
No ratings yet
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
16 pages
31503922-MA5105 Configuration Guide - (V100R010 - 02)
No ratings yet
31503922-MA5105 Configuration Guide - (V100R010 - 02)
254 pages
SoftVQ VAE
No ratings yet
SoftVQ VAE
31 pages
Output SmartPLS 27 September 2024 Brostrapping
No ratings yet
Output SmartPLS 27 September 2024 Brostrapping
153 pages
Pub Quad
No ratings yet
Pub Quad
15 pages
Saa-C01 V14.35
No ratings yet
Saa-C01 V14.35
112 pages
Rippel Learned Video Compression ICCV 2019 Paper
No ratings yet
Rippel Learned Video Compression ICCV 2019 Paper
10 pages
Vision Transformer (Vit) : Shusen Wang
No ratings yet
Vision Transformer (Vit) : Shusen Wang
35 pages
Preprints202403 1272 v1
No ratings yet
Preprints202403 1272 v1
37 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
Transformers
No ratings yet
Transformers
30 pages
Scaling Vision Transformers
No ratings yet
Scaling Vision Transformers
33 pages
Tokenlearner: What Can 8 Learned Tokens Do For Images and Videos?
No ratings yet
Tokenlearner: What Can 8 Learned Tokens Do For Images and Videos?
21 pages
A Survey of Probability Concepts
No ratings yet
A Survey of Probability Concepts
42 pages
VQA ViT
No ratings yet
VQA ViT
24 pages
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
No ratings yet
F S Q: Vq-Vae M S: Inite Calar Uantization ADE Imple
16 pages
LVLang
No ratings yet
LVLang
21 pages
Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies
No ratings yet
Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies
27 pages
TK Series Magnet GPS Tracker USER MANUAL
No ratings yet
TK Series Magnet GPS Tracker USER MANUAL
26 pages
(NIPS23) Scattering Transformation For ViT
No ratings yet
(NIPS23) Scattering Transformation For ViT
21 pages
Image Compression Using Adaptive LBG: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Image Compression Using Adaptive LBG: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
5 pages
Neural Video Compression With Diverse Contexts
No ratings yet
Neural Video Compression With Diverse Contexts
19 pages
Hybrid Spatial-Temporal Entropy Modelling For Neural Video Compression
No ratings yet
Hybrid Spatial-Temporal Entropy Modelling For Neural Video Compression
17 pages
Question Aware Vision Transformer For Multimodal Reasoning
No ratings yet
Question Aware Vision Transformer For Multimodal Reasoning
15 pages
Deep-Learning-Based Lossless Image Coding
No ratings yet
Deep-Learning-Based Lossless Image Coding
14 pages
2024 - From Sora What We Can See - Sun Et Al
No ratings yet
2024 - From Sora What We Can See - Sun Et Al
21 pages
Capstone Case Study
No ratings yet
Capstone Case Study
4 pages
L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration
No ratings yet
L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration
17 pages
Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu
No ratings yet
Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu
18 pages
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
No ratings yet
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
17 pages
Transformer-Based Image Compression
No ratings yet
Transformer-Based Image Compression
10 pages
Insurance Software Solutions
No ratings yet
Insurance Software Solutions
8 pages
2023 - One-2-3-45 - Liu Et Al
No ratings yet
2023 - One-2-3-45 - Liu Et Al
19 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Autoregressive Video Generation
No ratings yet
Autoregressive Video Generation
22 pages
L-Verse: Bidirectional Generation Between Image and Text
No ratings yet
L-Verse: Bidirectional Generation Between Image and Text
18 pages
2023 - Diffusion-GAN - Wang Et Al
No ratings yet
2023 - Diffusion-GAN - Wang Et Al
26 pages
Deep Learning Approaches Based On Transformer Architectures For Image Captioning Tasks
No ratings yet
Deep Learning Approaches Based On Transformer Architectures For Image Captioning Tasks
16 pages
XQ-GAN: An Open-Source Image Tokenization Framework For Autoregressive Generation
No ratings yet
XQ-GAN: An Open-Source Image Tokenization Framework For Autoregressive Generation
12 pages
Paper 1
No ratings yet
Paper 1
14 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
Omnitokenizer: A Joint Image-Video Tokenizer For Visual Generation
No ratings yet
Omnitokenizer: A Joint Image-Video Tokenizer For Visual Generation
13 pages
Network-Based Detection of Iot Botnet Attacks Using Deep Autoencoders
No ratings yet
Network-Based Detection of Iot Botnet Attacks Using Deep Autoencoders
45 pages
2024 - An Image Is Worth 32 Tokens For Reconstruction and Generation - Yu Et Al
No ratings yet
2024 - An Image Is Worth 32 Tokens For Reconstruction and Generation - Yu Et Al
20 pages
Dual Autoencoder-Based Framework For Image Compression and Decompression
No ratings yet
Dual Autoencoder-Based Framework For Image Compression and Decompression
9 pages
ViDeNN Deep Blind Video Denoising
No ratings yet
ViDeNN Deep Blind Video Denoising
10 pages
The 24/7 Innovation: The 7Rs of Process Redesign
No ratings yet
The 24/7 Innovation: The 7Rs of Process Redesign
3 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
Lu DVC An End-To-End Deep Video Compression Framework CVPR 2019 Paper
No ratings yet
Lu DVC An End-To-End Deep Video Compression Framework CVPR 2019 Paper
10 pages
BinaryViT：高效、精确的二值ViT
No ratings yet
BinaryViT：高效、精确的二值ViT
12 pages
GPU友好稀疏量化Boost Vision Transformer
No ratings yet
GPU友好稀疏量化Boost Vision Transformer
11 pages
2024 - Language Model Beats Diffusion - Tokenizer Is Key To Visual Generation - Yu Et Al
No ratings yet
2024 - Language Model Beats Diffusion - Tokenizer Is Key To Visual Generation - Yu Et Al
19 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
Variable-Rate Deep Image Compression With Vision Transformers
No ratings yet
Variable-Rate Deep Image Compression With Vision Transformers
12 pages
2024 - MotionClone - Ling Et Al
No ratings yet
2024 - MotionClone - Ling Et Al
17 pages
English Specification 0908
No ratings yet
English Specification 0908
31 pages
Video GPT
No ratings yet
Video GPT
14 pages
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
No ratings yet
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
10 pages
Scalable Vision Transformers With Hierarchical Pooling
No ratings yet
Scalable Vision Transformers With Hierarchical Pooling
11 pages
Transformer CNN Mixture Architecture
No ratings yet
Transformer CNN Mixture Architecture
10 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
2024 - Autoregressive Image Generation Without Vector Quantization - Li Et Al
No ratings yet
2024 - Autoregressive Image Generation Without Vector Quantization - Li Et Al
16 pages
Choi Variable Rate Deep Image Compression With A Conditional Autoencoder ICCV 2019 Paper
No ratings yet
Choi Variable Rate Deep Image Compression With A Conditional Autoencoder ICCV 2019 Paper
9 pages
1 s2.0 S2667241321000148 Main
No ratings yet
1 s2.0 S2667241321000148 Main
14 pages
Quq 1528
No ratings yet
Quq 1528
6 pages
NeurIPS 2024
No ratings yet
NeurIPS 2024
14 pages
Qat 3
No ratings yet
Qat 3
15 pages
2024 - $-Delta$-DiT - Chen Et Al
No ratings yet
2024 - $-Delta$-DiT - Chen Et Al
12 pages
Trawnih Et Al 2023 Determining Perceptions of Banking Customers Regarding Fingerprint Atms
No ratings yet
Trawnih Et Al 2023 Determining Perceptions of Banking Customers Regarding Fingerprint Atms
19 pages
Lee Autoregressive Image Generation Using Residual Quantization CVPR 2022 Paper
No ratings yet
Lee Autoregressive Image Generation Using Residual Quantization CVPR 2022 Paper
10 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
Kurzwell Forte Se Guide
No ratings yet
Kurzwell Forte Se Guide
18 pages
Exploring Diverse Methods in Visual Question Answering
No ratings yet
Exploring Diverse Methods in Visual Question Answering
5 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
No ratings yet
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
2 pages
AEO Case Study
No ratings yet
AEO Case Study
2 pages
PREPOSITIONS OF PLACE - Quizizz
No ratings yet
PREPOSITIONS OF PLACE - Quizizz
6 pages
Process Runner System Requirements
No ratings yet
Process Runner System Requirements
2 pages
Arjun Jaggi: Mapple July 2012 - Jan 2013
No ratings yet
Arjun Jaggi: Mapple July 2012 - Jan 2013
3 pages
Lesson 12 Practice Problem #5 3T AY1920-1
No ratings yet
Lesson 12 Practice Problem #5 3T AY1920-1
2 pages
Pages From Trends in Educational Research About E-Learning A Systematic Literature Review (2009-2018) - 4
No ratings yet
Pages From Trends in Educational Research About E-Learning A Systematic Literature Review (2009-2018) - 4
1 page
Motion Estimation: Advancements and Applications in Computer Vision
From Everand
Motion Estimation: Advancements and Applications in Computer Vision
Fouad Sabry
No ratings yet

2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al

Uploaded by

2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al

Uploaded by

Image and Video Tokenization

with Binary Spherical Quantization

Yue Zhao Yuanjun Xiong∗ Philipp Krähenbühl

Preprint. Under review.

encoder z v ℓ2 u binary ℓ2 û ẑ decoder

(a) BSQ-VAE (Ours).

VQ Lookup Free Quantization (LFQ)

encoder z vector ẑ decoder encoder z v binary v̂ ẑ decoder

(b) VQ-VAE. (c) LFQ-VAE.

4 Transformer-based Visual Tokenizer with Binary Spherical Quantization

Our video tokenizer follows an encoder-decoder architecture with a discretization bottleneck as

4.1 Binary Spherical Quantization

(a) BSQ (b) FSQ (c) LFQ

tune it to be a video tokenizer. Though previous works [7,

5.1 Main Results

Table 5: Abalation studies on ImageNet-1k val 128×128.

5.2 Ablation Studies

Lcommit Lentropy LLPIPS rFID Code group rFID Code Speed

Importance of ℓ2 normalization in BSQ. We remove the ℓ2 normalization in BSQ, which is

A Arithmatic Coding Details

B Comparison between BSQ and LFQ

C.1 Proof of Eq (7)

Before proving Eq (7), we will first prove the following identity:

Therefore, the probability of u being assigned to ci can be written as:

C.2 Proof of Eq (8)

C.3 Proof of Eq (9)

Now we look at H[Eu [q̂(c|u)]]. We first compute Q(c) = Eu [q̂(c|u)].

D(Q∥q̃) = H(Q, q̃) − H(Q)

Therefore, the entropy term is simplified to:

C.4 Proof of the Quantization Error Bound of BSQ (Eq. (10))

So the distance function to be integrated simplifies to

where RHS can be numerically computed and plotted in Figure 6.

Figure 6: Quantization error with vocabulary size L.

G Baselines for Video Compression

J Licenses for Existing Assets

J.2 Evaluation Metrics

J.3 Baseline Methods

DALL-E dVAE10 is licensed under a Modified MIT License.

You might also like