2201.01741v2 - Understanding Entropy Coding With Asymmetric Numeral Systems (ANS) - Statistician Perspective
2201.01741v2 - Understanding Entropy Coding With Asymmetric Numeral Systems (ANS) - Statistician Perspective
Robert Bamler
University of Tübingen
Department of Computer Science
arXiv:2201.01741v2 [stat.ML] 10 Jan 2022
Maria-von-Linden-Straße 6
72076 Tübingen, Germany
[email protected]
Abstract
1 Introduction
Effective data compression is becoming increasingly important. Digitization in business and science
raises the demand for data storage, and the proliferation of remote working arrangements makes
high-quality videoconferencing indispensable. Recently, novel compression methods that employ
probabilistic machine-learning models have been shown to outperform more traditional compression
methods for images and videos [Ballé et al., 2018, Minnen et al., 2018, Yang et al., 2020a, Agustsson
et al., 2020, Yang et al., 2021]. Machine learning provides new methods for the declarative task of
expressing complex probabilistic models, which are essential for data compression (see Section 2.1).
However, many researchers who come to data compression from a machine learning background
struggle to understand the more procedural (i.e., algorithmic) parts of a compression pipeline.
This paper discusses an integral algorithmic part of lossless and lossy data compression called entropy
coding. We focus on the Asymmetric Numeral Systems (ANS) entropy coder [Duda et al., 2015],
which is the method of choice in many recent proposals of machine-learning based compression
methods [Townsend et al., 2019, Kingma et al., 2019, Yang et al., 2020a, Theis and Ho, 2021].
We intend to make the internals of ANS more approachable to statisticians and machine learning
researchers by presenting them from a new perspective based on latent variable models and the
so-called bits-back trick [Wallace, 1990, Hinton and Van Camp, 1993], which we explain below. We
show that this new perspective allows us to come up with novel variants of ANS that enable research
on new compression methods which combine declarative with procedural tasks.
1
Software library at https://bamler-lab.github.io/constriction and analyzed in Section 5.
2
Course modules 5, 6, and 7 at https://robamler.github.io/teaching/compress21
This paper is written in an educational style that favors real code examples over pseudocode, culmi-
nating in a working demo implementation of ANS in the Python programming language (Listing 7
on page 12). The paper is structured as follows: Section 2 reviews the relevant information theory
and introduces the concept of stream codes. Section 3 presents our statistical perspective on the ANS
algorithm and guides the reader to a full implementation. In Section 4, we show that understanding
the internals of ANS allows us to create new variations on it for special use cases. Section 5 presents
and empirically evaluates an open-source compression library that is more run-time efficient and
easier to use than our demo implementation in Python. We provide concluding remarks in Section 6.
2 Background
In Subsection 2.1, we briefly review the relevant theory of lossless compression. In Subsection 2.2,
we summarize how the Asymmetric Numeral Systems (ANS) method differs from other stream codes
(like Arithmetic Coding or Range Coding) and from symbol codes (like Huffman Coding).
We formalize the problem of lossless compression and state the theoretical bounds from the source
coding theorem. Readers who are already familiar with source coding may opt to skip this section.
Problem Statement. A lossless compression algorithm (or “code”) C maps any message x from
some discrete messages space X to a bit string C(x). We want to make these bit strings C(x) as
short as possible while still keeping the mapping C injective, so that a decoder can reconstruct the
message x by inverting C. We refer to the length of the bit string C(x) as the bitrate RC (x). Since
RC (x) may depend on x, a common (albeit not the only reasonable) goal in the lossless compression
literature is to find a code C that minimizes the expected bitrate EP [RC (X)]. Here, P is a probability
distribution over X , i.e., a probabilistic model of the data source (called the “entropy model” for
short). Further, EP [ · ] is the expectation value, and capital letters like X denote random variables.
To minimize EP [RC (X)], a code C exploits knowledge of P to map probable messages to short bit
strings while mapping less probable messages to longer bit strings (to avoid clashes that would make
C non-injective). Thus, the bitrate RC (x) varies across messages x ∈ X . In order to prevent C from
encoding information about x into the length rather than the contents of C(x), we will assume in the
following that C is uniquely decodable. This means that it must be possible to reconstruct a sequence
of messages from the (delimiter-free) concatenation of their compressed representations.
Theoretical Bounds. The source coding theorem [Shannon, 1948] makes two fundamental
statements—a negative and a positive one—about the expected bitrate, relating it to the entropy
HP [X] := EP [− log2 P (X)] of the message X under the entropy model P . The negative part of the
theorem states that the expected bitrate of any uniquely decodable lossless code cannot be smaller
than the entropy of the message. The positive part states that, if we allow for up to one bit of overhead
over the entropy, then there indeed always exists a uniquely decodable lossless code. More formally,
∀ entropy models P , ∀ uniquely decodable lossless codes C: EP [RC (X)] ≥ HP [X]; and (1)
∀ entropy models P , ∃ uniquely decodable lossless code C: EP [RC (X)] < HP [X] + 1. (2)
Thus, the expected bitrate of an optimal code CP∗ lies in the half-open interval [HP [X], HP [X] + 1).
A so-called entropy coder takes an entropy model P as input and constructs a near-optimal code for
it. The model P and the choice of entropy coder together comprise a compression method on which
two parties have to agree before they can meaningfully exchange any compressed data.
Information Content. The upper bound in Eq. 2 is actually a corollary of a more powerful state-
ment. Recall that the entropy HP [X] = EP [− log2 P (X)] is an expectation over the quantity
“− log2 P (X)”. This quantity is called the information content of X. Eq. 2 thus relates two expecta-
tions under P , and it turns out that there exists a code (the so-called Shannon Code [Shannon, 1948])
where the inequality holds not only in expectation but indeed for all messages individually, i.e.,
∀ entropy models P , ∃ uniqly. dec. lossless code C: ∀x ∈ X : RC (x) < − log2 P (X = x) + 1. (3)
2
Figure 1: Comparison between (a) symbol codes (here: Shannon Coding) and (b) stream codes (here:
Arithmetic Coding). Symbol codes assign an integer number of bits to each symbol xi , which leads
to a significant overhead in bitrate if the information content per symbol (− log2 P (Xi = xi )) is low.
For large message spaces (e.g., the space of all conceivable HD images), P (X = x) is usually very
small for any particular x, and the “+1” on the right-hand side of Eq. 3 is negligible compared to the
information content − log2 P (X = x). In this regime, the information content of a message x is thus
a very accurate estimate of its bitrate RCP∗ (x) under a (hypothetical) optimal code CP∗ .
Trivial Example: Uniform Entropy Model. To obtain some intuition for Eqs. 1-3, we briefly
consider a trivial case, from which our discussion of ANS in Section 3 below will start. Consider a
finite message space X with a uniform entropy model P , i.e., P (X = x) = 1/|X | ∀x ∈ X . Thus,
each message x ∈ X has information content − log2 P (X = x) = log2 |X |, and we have HP [X] =
log2 |X |. We can trivially find a uniquely decodable code Cunif. that satisfies Eqs. 2-3 for this model P :
we simply enumerate all x ∈ X with integer labels from {0, . . . , |X | − 1}, which we then express in
binary. The binary representation of the largest label (|X | − 1) works out to be dlog2 |X |e bits long
where d·e denotes rounding up to the nearest integer. To ensure unique decodability, we pad shorter bit
strings to the same length with leading zeros. Thus, RCunif. (x) = dlog2 |X |e = d− log2 P (X = x)e
for all x ∈ X , which satisfies Eq. 3 and therefore also Eq. 2. According to Eq. 1, this trivial code
Cunif. has less than one bit of overhead over the optimal code for such a uniform entropy model.
In practice, however, the entropy model P is rarely a uniform distribution. For example, in a
compression method for natural language, P should assign higher probability to this paragraph
than to some random gibberish. Any such non-uniform P has entropy HP [X] < log2 |X | and thus,
according to Eq. 2, there exists a better code for P than the trivial Cunif. . In principle, one could find an
optimal code for any entropy model P (over a finite X ) via the famous Huffman algorithm [Huffman,
1952] (knowledge of this algorithm is not required to understand the present paper). However, directly
applying the Huffman algorithm to entire messages would usually be prohibitively computationally
expensive. The next section provides an overview of computationally feasible alternatives.
Practical lossless entropy coding methods trade off some compression performance for better runtime
and memory efficiency. There are two common approaches: symbol codes and stream codes [MacKay
and Mac Kay, 2003]. Both assume that a message x ≡ (x1 , . . . , xk ) is a sequence (with length k) of
symbols xi ∈ X ∀i ∈ {1, . . . , k}, with some discrete set (“alphabet”) X. Further, both approaches
Qk
assume that the entropy model P (X) = i=1 Pi (Xi ) is a product of per-symbol entropy models Pi .3
Pk
Thus, the entropy of the message X is the sum of all symbol entropies: HP [X] = i=1 HPi [Xi ].
Constructing an optimal code over the message space X = Xk by Huffman coding would require
O(|X|k ) runtime (and memory) since Huffman coding builds a tree over all x ∈ X . Both symbol
and stream codes reduce the runtime complexity to O(k), but they differ in their precise practical
run-time cost and, more fundamentally, in their overhead over the minimal expected bitrate (Eq. 1).
Symbol Codes. Symbol codes are illustrated in Figure 1 (a). They simply apply a uniquely
decodable code (e.g., a Huffman code) independently to each symbol xi of the message, and then
concatenate the resulting bit strings. This makes the runtime linear in k but it increases the bitrate.
Even if we use an optimal code CP∗i for each symbol xi , the expected bitrate of each symbol can
3
The Pi may form an autoregressive
Q model, i.e., each Pi may be conditioned on all Xj with j < i, so that
the factorization of P (X) = i Pi (Xi ) does not pose any formal restriction on P .
3
exceed the symbol’s entropy HPi [Xi ] by up to one bit (see Eq. 2). Since this overhead now applies
per symbol, the overhead in expected bitrate for the entire message now grows linearly in k.
Symbol codes are used in many classical compression methods such as gzip [Deutsch, 1996]
and png [Boutell and Lane, 1997]. Here, the entropy per symbol is large enough that an overhead of
up to one bit per symbol can be tolerated. By contrast, recently proposed machine-learning based
compression methods [Ballé et al., 2017, 2018, Yang et al., 2020b,a] often spread the total entropy of
the message over a larger number k of symbols, resulting in much lower entropy per symbol. For
example, with about 0.3 bits of entropy per symbol, a symbol code (which needs at least one full bit
per symbol) would have an overhead of more than 200%, thus rendering the method useless.
Stream Codes. Stream codes reduce the overhead of symbol codes without sacrificing computa-
tional efficiency by amortizing over symbols. Similar to symbol codes, stream codes also encode and
decode one symbol at a time, and their runtime cost is linear in the message length k. However, stream
codes do not directly map each individual symbol to a bit string. Instead, for most encoded symbols,
a stream code does not immediately emit any compressed bits but instead merely updates some
internal encoder state that accumulates information. Only every once in a while (when the internal
state overflows), does the encoder emit some compressed bits that “pack” the information content of
several symbols more tightly together than in a symbol code (see illustration in Figure 1 (b)).
The most well-known stream codes are Arithmetic Coding (AC) [Rissanen and Langdon, 1979,
Pasco, 1976] and its more efficient variant Range Coding (RC) [Martin, 1979]. This paper focuses
on a more recently proposed (and currently less known) stream code called Asymmetric Numeral
Systems (ANS) [Duda et al., 2015]. In practice, all three of these stream codes usually have negligible
overhead over the fundamental lower bound in Eq. 1 (see Section 5.2 for empirical results). From a
user’s point of view, the main difference between these stream codes is that AC and RC operate as a
queue (“first in first out”, i.e., symbols are decoded in the same order in which they were encoded)
while ANS operates as a stack (“last in first out”). A queue simplifies compression with autoregressive
models while a stack simplifies compression with latent variable models and is generally the simpler
data structure for more involved use cases (since reads and writes happen only at a single position).
Further, ANS provides some more subtle advantages over AC and RC:
• ANS is better suited than AC and RC for advanced compression techniques such as bits-
back coding [Wallace, 1990, Hinton and Van Camp, 1993] because (i) AC and RC treat
encoding and decoding asymmetrically with different internal coder states that satisfy
different invariants, which complicates switching between encoding and decoding; and
(ii) while encoding with RC is, of course, injective, it is difficult to make it surjective (and
thus to make decoding arbitrary bit strings injective). ANS has neither of these two issues.
• While the main idea of AC and RC is simple, details are somewhat involved due to a number
of edge cases. By contrast, a complete ANS coder can be implemented in just a few lines of
code (see Listing 7 in Section 3.3 below). Further, this paper provides a new perspective on
ANS that conceptually splits the algorithm into three parts. Combining these parts in new
ways allows us to design new variants of ANS for special use cases (see Section 4.3).
• As a minor point, decoding with AC and RC is somewhat slow because it updates an internal
state that is larger than that of the encoder, and because it involves integer division (which is
slow on standard hardware [Fog, 2021]). By contrast, decoding with ANS is a very simple
operation (assuming that evaluating the model Pi is cheap, as in a small lookup table).
We present the basic ANS algorithm in Section 3, discuss possible extensions to it in Section 4, and
empirically compare its compression performance and runtime efficiency to AC and RC in Section 5.
4
1 class UniformCoder: # optimal code for uniformly distributed symbols
2 def __init__(self, number=0): # constructor with an optional argument
3 self.number = number
4
5 def push(self, symbol, base): # Encodes a symbol ∈ {0, . . . , base − 1}.
6 self.number = self.number * base + symbol
7
8 def pop(self, base): # Decodes a symbol ∈ {0, . . . , base − 1}.
9 symbol = self.number % base # “ %” denotes modulo.
10 self.number //= base # “ //” denotes integer division.
11 return symbol
We introduce our first example of a stream code: a simple code that provides optimal compression
performance for sequences of uniformly distributed symbols. This code will serve as a fundamental
building block in our implementation of Asymmetric Numeral Systems in Sections 3.2-3.3 below.
As shown in Section 2.1, we can obtain an optimal lossless compression code Cunif. for a uniform
entropy model over a finite message space X by using a bijective mapping from X to {0, . . . , |X |−1}
and expressing the resulting integer in binary. Such a bijective mapping can be done efficiently if
messages are sequences (x1 , . . . , xk ) of symbols xi ∈ X from a finite alphabet X: we simply interpret
the symbols as the digits of a number in the positional numeral system of base |X|. For example, if
the alphabet size is |X| = 10 then we can assume, without loss of generality, that X = {0, 1, . . . , 9}.
To encode, e.g., the message x = (3, 6, 5) we thus simply parse it as the decimal number 365, which
we then express in binary: 365 = (101101101)2 . We defer the issue of unique decodability to
Section 3.3. Despite its simplicity, this code has three important properties, which we highlight next.
Property 1: “Stack” Semantics. Encoding and decoding traverse symbols in a message in opposite
order. For encoding, the simplest way to parse a sequence of digits xi ∈ X into a number in the
positional numeral system of base |X| goes as follows: initialize a variable with zero; then, for each
digit, multiply the variable with |X| and add the digit. For example, if |X| = 10 and x = (3, 6, 5):
Thus, the simplest implementation of Cunif. has “stack” semantics (last-in-first-out): conceptually,
encoding pushes information onto the top of a stack of compressed data, and decoding pops the latest
encoded information off the top of the stack. Listing 1 implements these push and pop operations
in Python as methods on a class UniformCoder. A simple usage example is given in Listing 2 (all
code examples in this paper are also available online4 as an executable jupyter notebook.)
Property 2: Amortization. Before we explore more advanced usages of our UniformCoder from
Listing 1, we illustrate how this stream code amortizes compressed bits over multiple symbols (see
Section 2.2). Consider again encoding the message x = (3, 6, 5) with the alphabet X = {0, . . . , 9}.
4
https://github.com/bamler-lab/understanding-ans/blob/main/code-examples.ipynb
5
1 coder = UniformCoder() # Defined in Listing 1.
2
3 # Encode the message x = (3, 6, 5):
4 coder.push(3, base=10) # base=10 means that the alphabet is X = {0, . . . , 9}.
5 coder.push(6, base=10)
6 coder.push(5, base=10)
7 print(f"Encoded number: {coder.number}") # Prints: “Encoded number: 365”
8 print(f"In binary: {coder.number:b}") # Prints: “In binary: 101101101”
9
10 # Decode the encoded symbols (in reverse order):
11 print(f"Decoded '{coder.pop(base=10)}'.") # Prints: “Decoded '5'.”
12 print(f"Decoded '{coder.pop(base=10)}'.") # Prints: “Decoded '6'.”
13 print(f"Decoded '{coder.pop(base=10)}'.") # Prints: “Decoded '3'.”
Listing 2: Encoding and decoding the message x = (3, 6, 5) using our UniformCoder from Listing 1;
digits are encoded from left to right but decoded from right to left, i.e., the coder has stack semantics.
Listing 3: Encoding and decoding the message x = (3, 6, 12, 4) where the first two symbols come
from the alphabet {0, . . . , 9} and the last two symbols come from the alphabet {0, . . . , 14}.
Modifying one of the symbols in this message obviously changes the compressed bit string, e.g.,
Cunif. (3, 6, 5) = 1 0 1 1 0 1 1 0 1 ;
Cunif. (3, 6, 6) = 1 0 1 1 0 1 1 1 0 ; (4)
Cunif. (3, 7, 5) = 1 0 1 1 1 0 1 1 1 .
Here, underlined numbers highlight differences to Cunif. (3, 6, 5) (first line). Note that the penul-
timate bit (red) changes from “0” to “1” when we change the third symbol from x3 = 5 to x3 = 6
(second line), and also when we change the second symbol from x2 = 6 to x2 = 7 (third line). Thus,
this bit depends on both x2 and x3 . This is a crucial difference to symbol codes (e.g., Huffman Codes),
which attribute each bit in the compressed bit string to exactly one symbol only (see Figure 1 (a)).
Amortization is what allows stream codes to reach lower expected bitrates than symbol codes.
Property 3: Varying Alphabet Sizes. Our implementation of ANS in Sections 3.2-3.3 below will
exploit an additional property of the UniformCoder from Listing 1: each symbol xi of the message
x = (x1 , . . . , xk ) may come from an individual alphabet Xi , with individual alphabet size |Xi |. We
just have to set the base when encoding or decoding each symbol xi to the corresponding alphabet
size |Xi |. For example, in Listing 3 we encode and decode the example message x = (3, 6, 12, 4) using
the alphabet X1 = X2 = {0, . . . , 9} for the first two symbols and the alphabet X3 = X4 = {0, . . . , 14}
for the last two symbols. The example works because the method pop of UniformCoder (Listing 1)
performs the exact inverse of push: calling push(symbol, base) followed by pop(base) with
identical base (and with symbol ∈ {0, . . . , base − 1}) restores the coder’s internal state, regardless
of its history. This generalized use of a UniformCoder still maps bijectively between the message
6
Figure 2: Partitioning of the range {0, . . . , n−1}
into disjoint subranges Zi (xi ), xi ∈ Xi , see Eq. 6. Figure 3: Encoding a symbol xi with ANS
The size of each subrange, |Zi (xi )| = mi (xi ), is using the bits-back trick. The net bitrate is
proportional to the probability Qi (Xi = xi ) (Eq. 5). log2 n − log2 mi (xi ) = − log2 Qi (Xi = xi ).
space (which is now X = X1 × X2 × . . . × Xk ) and the integers {0, . . . , |X | − 1}. It therefore still
provides optimal compression performance assuming that all symbols are uniformly distributed over
their respective alphabets. In the next section, we lift this restriction on uniform entropy models.
In this section, we use the UniformCoder introduced in the last section (Listing 1) as a building
block for a more general stream code that is no longer limited to uniform entropy models. This leads
us to a precursor of ANS that provides very close to optimal compression performance for arbitrary
entropy models, but that is too slow for practical use. We will speed it up in Section 3.3 below.
Approximated Entropy Model. The UniformCoder from Listing 1 minimizes the expected bi-
trate EP [RCunif. (X)] only if the entropy model P is a uniform distribution. We now consider the more
Qk
general case where P (X) = i=1 Pi (Xi ) still factorizes over all symbols, but each Pi is now an
arbitrary distribution over a finite alphabet Xi . The ANS algorithm approximates each Pi with a
distribution Qi that represents the probability Qi (Xi = xi ) of each symbol xi in fixed point precision,
mi (xi )
Qi (Xi = xi ) = ∀xi ∈ Xi , where n = 2precision . (5)
n
Here, precision is a positive integer (typically . 32) that controls a trade-off between computational
cost and accuracy of the approximation. Further, mi (xi ) for xi ∈ Xi are integers that should ideally
be chosen by minimizing the Kullback-Leibler (KL) divergence DKL (Pi || Qi ), which measures the
overhead (in expected bitrate) due to approximation errors. In practice, however, exact minimization
of this KL-divergence is rarely necessary as simpler heuristics (e.g., by lazily rounding the cumulative
distribution function) usually already lead to a negligible overhead for reasonable precisions. Note
however, that setting mi (xi ) = 0 for any xi ∈ Xi makes it impossible to encode xP i with ANS, and that
correctness of the ANS algorithm relies on Qi being properly normalized, i.e., xi ∈Xi mi (xi ) = n.
ANS uses the approximated entropy models Qi for its encoding and decoding operations. While
the operations turn out to be very compact, it is often somewhat puzzling to users why they work,
why they provide near-optimal compression performance, and how they could be extended to satisfy
potential additional constraints for special use cases. In the following, we aim to make ANS more
approachable by proposing a new perspective on it that expresses Qi as a latent variable model.
Latent Variable Model. Since the integers mi (xi ) in Eq. 5 sum to n for xi ∈ Xi , they define a
partitioning of the integer range {0, . . . , n − 1} into |Xi | disjoint subranges, see Figure 2: for each
symbol xi ∈ Xi , we define a subrange Zi (xi ) ⊆ {0, . . . , n − 1} of size |Zi (xi )| = mi (xi ) as follows,
nP P o
0 0
Zi (xi ) := x0 <xi mi (xi ), . . . , x0 ≤xi mi (xi ) − 1 , (6)
i i
where we assumed some ordering on the alphabet Xi (to define what “x0i < xi ” means). By con-
struction, the subranges Zi (xi ) are pairwise disjoint for different xi and they cover the entire range
{0, . . . , n − 1} (see Figure 2). Imagine now we want to draw a random sample xi from Qi . A simple
way to do this is to draw an auxiliary integer zi from a uniform distribution over {0, . . . , n − 1},
and then identify the unique symbol xi ∈ Xi that satisfies zi ∈ Zi (xi ). This procedure describes the
so-called generative process of a latent variable model Qi (Xi , Zi ) = Q(Zi ) Qi (Xi | Zi ) where
1 1 if zi ∈ Zi (xi );
Q(Zi = zi ) = ∀zi ∈ {0, . . . , n − 1} and Qi (Xi = xi | Zi = zi ) = (7)
n 0 else.
7
1 class SlowAnsCoder: # Has near-optimal bitrates but high runtime cost.
2 def __init__(self, precision, compressed=0):
3 self.n = 2**precision # See Eq. 5 (“ **” denotes exponentiation).
4 self.stack = UniformCoder(compressed) # Defined in Listing 1.
5
6 def push(self, symbol, m): # Encodes one symbol.
7 z = self.stack.pop(base=m[symbol]) + sum(m[0:symbol])
8 self.stack.push(z, base=self.n)
9
10 def pop(self, m): # Decodes one symbol.
11 z = self.stack.pop(base=self.n)
12 # Find the unique symbol that satisfies z ∈ Zi (symbol) (real deploy-
13 # ments should use a more efficient method than linear search):
14 for symbol, m_symbol in enumerate(m):
15 if z >= m_symbol:
16 z -= m_symbol
17 else:
18 break
19 self.stack.push(z, base=m_symbol)
20 return symbol
21
22 def get_compressed(self):
23 return self.stack.number
Listing 4: A coder with near-optimal compression performance for arbitrary entropy models due to
the bits-back trick (Section 3.2), but with poor runtime performance. Usage example in Listing 5.
Using |Zi (xi )| = mi (xi ), one can easily verify that the approximated entropy model from Eq. 5 is
the marginal distribution of the latent variable model from Eq. 7, i.e., our use of the same name Qi for
Pn−1
both models is justified since we indeed have Qi (Xi = xi ) = zi =0 Qi (Xi = xi , Zi = zi ) ∀xi ∈ Xi .
The uniform prior distribution Q(Zi ) in Eq. 7 suggests that we could build a stream code by utilizing
our UniformCoder from Listing 1. We begin with a naive approach upon which we then improve:
to encode a message x = (x1 , . . . , xk ), we simply identify each symbol xi by some zi that we
may choose arbitrarily from Zi (xi ), and we encode the resulting sequence (z1 , . . . , zk ) using a
UniformCoder with alphabet {0, . . . , n − 1}. On the decoder side, we decode (z1 , . . . , zk ) and we
recover the message x by identifying, for each i, the unique symbol xi that satisfies zi ∈ Zi (xi ).
The Bits-Back Trick. Unfortunately, the above naive approach suffers from poor compression
performance. Encoding zi with the uniform prior model Q(Zi ) increases the (amortized) bitrate by
zi ’s information content, − log2 Q(Zi = zi ) = − log2 n1 = precision. By contrast, an optimal
code would spend only − log2 Qi (Xi = xi ) = precision − log2 mi (xi ) bits on symbol xi (see
Eq. 5). Thus, our naive approach has an overhead of log2 mi (xi ) bits for each symbol xi . The
overhead arises because the encoder can choose zi arbitrarily among any element of Zi (xi ), and this
choice injects some information into the compressed bit string that is discarded on the decoder side.
This suggests a simple solution, called the bits-back trick [Wallace, 1990, Hinton and Van Camp,
1993]: we should somehow utilize the log2 mi (xi ) bits of information contained in our choice of
zi ∈ Zi (xi ), e.g., by communicating through it some part of the previously encoded symbols.
Listing 4 implements this bits-back trick in a class SlowAnsCoder (a precursor to an AnsCoder
that we will introduce in the next section). Like our UniformCoder, a SlowAnsCoder operates as
a stack. The method push for encoding accepts a symbol xi from the (implied) alphabet Xi =
{0, . . . , len(m) − 1}, where the additional method argument “m” is a list of all the integers mi (x0i )
for x0i ∈ Xi , i.e., m specifies a model Qi (Eq. 5), and m[symbol] is mi (xi ). The method body is
very brief: Line 7 in Listing 4 picks a zi ∈ Zi (xi ) and Line 8 encodes zi onto a stack (which is a
UniformCoder, see Line 4) using a uniform entropy model over {0, . . . , n − 1}. The interesting part
8
1 # Specify an approximated entropy model via precision and mi (xi ) from Eq. 5:
2 precision = 4 # For demonstration; deployments should use higher precision.
3 m = [7, 3, 6] # Sets Qi (Xi = 0) = 274 , Qi (Xi = 1) = 234 , and Qi (Xi = 2) = 264 .
4
5 # Encode a message in reversed order so we can decode it in forward order:
6 example_message = [2, 0, 2, 1, 0]
7 encoder = SlowAnsCoder(precision) # Also works with AnsCoder (Listing 7).
8 for symbol in reversed(example_message):
9 encoder.push(symbol, m) # We could use a different m for each symbol.
10 compressed = encoder.get_compressed()
11
12 # We could actually reuse the encoder for decoding, but let's pretend that
13 # decoding occurs on a different machine that receives only “compressed”.
14 decoder = SlowAnsCoder(precision, compressed)
15 reconstructed = [decoder.pop(m) for _ in range(5)]
16 assert reconstructed == example_message # Verify correctness.
Listing 5: Simple usage example for our SlowAnsCoder from Listing 4; the same example also
works with our more efficient AnsCoder from Listing 7. For simplicity, we use the same entropy
model Qi (Lines 2-3) for all symbols, but we could easily use an individual model for each symbol.
is how we pick zi in Line 7: we decode zi from stack using the alphabet Zi (xi ).5 Note that, at this
point, we haven’t actually encoded zi yet; we just decode it from whatever data has accumulated on
stack from previously encoded symbols (if any). This gives us no control over the value of zi except
that it is guaranteed to lie within Zi (xi ), which is all we need. Crucially, decoding zi consumes data
from stack (see Figure 3), thus reducing the (amortized) bitrate as we analyze below.
The method pop for decoding inverts each step of push, and it does so in reverse order because
we are using a stack. Line 11 in Listing 4 decodes zi using the alphabet {0, . . . , n − 1} (thus
inverting Line 8) and Line 19 encodes zi using the alphabet Zi (xi ) (thus inverting Line 7). The latter
step recovers the information that we communicate trough our choice of zi ∈ Zi (xi ) in push. The
method pop may appear more complicated than push because it has to find the unique symbol xi that
satisfies zi ∈ Zi (xi ), which is implemented here—for demonstration purpose—with a simple linear
search (Lines 14-18). This search simultaneously subtracts x0 <xi mi (x0i ) from zi before encoding
P
i
it, which inverts the part of Line 7 that adds sum(m[0:symbol]) to the value decoded from stack.
Listing 5 shows a usage example for SlowAnsCoder. For simplicity, we encode each symbol with
the same model Qi here, but it would be straight-forward to use an individual model for each symbol.
Correctness and Compression Performance. Many researchers are initially confused by the
bits-back trick. It may help to separately analyze its correctness and its compression performance.
Correctness means that the method pop is the exact inverse of push, i.e., calling push(symbol, m)
followed by pop(m) returns symbol and restores the SlowAnsCoder into its original state (assuming
that all method arguments are valid, i.e., 0 ≤ symbol < len(m) and sum(m) = n = 2precision ).
This can easily be verified by following the implementation in Listing 4 step by step. Crucially,
it holds for any original state of the SlowAnsCoder, even if the coder starts out empty (i.e., with
compressed = 0). Some readers may find it easier to verify this by reference to Listing 6, which
reimplements the SlowAnsCoder from Listing 4 in a self-contained way, i.e., without explicitly using
a UniformCoder from Listing 1 because all method calls to stack are manually inlined.
The compression performance of a SlowAnsCoder is analyzed more easily in Listing 4. Line 7 of
the method push decodes zi from stack with a uniform entropy model over Zi (xi ), i.e., with the
model Qi (Zi = zi | Xi = xi ) = 1/mi (xi ) ∀zi ∈ Zi (xi ). This reduces the (amortized) bitrate by the
information content − log2 Qi (Zi = zi | Xi = xi ) = log2 mi (xi ) bits, provided that at least this much
data is available on stack (which holds for all but the first encoded symbol in a message). Line 8 then
encodes onto stack with a uniform entropy model over {0, . . . , n − 1}, which increases the bitrate
by log2 n. Thus, in total, encoding (pushing) a symbol xi contributes log2 min(xi ) bits (see Figure 3),
5
Technically, this is done by decoding with the shifted alphabet {0, . . . , mi (xiP
) − 1} and then shifting the
decoded value back by adding sum(m[0:symbol]), which is Python notation for x0 <xi mi (x0i ).
i
9
1 class SlowAnsCoder: # Equivalent to Listing 4, just more self-contained.
2 def __init__(self, precision, compressed=0):
3 self.n = 2**precision # See Eq. 5 (“**” denotes exponentiation).
4 self.compressed = compressed # (== stack.number in SlowAnsCoder)
5
6 def push(self, symbol, m): # Encodes one symbol.
7 z = self.compressed % m[symbol] + sum(m[0:symbol])
8 self.compressed //= m[symbol] # “//” denotes integer division.
9 self.compressed = self.compressed * self.n + z
10
11 def pop(self, m): # Decodes one symbol.
12 z = self.compressed % self.n
13 self.compressed //= self.n # “//” denotes integer division.
14 # Identify symbol and subtract sum(m[0:symbol]) from z:
15 for symbol, m_symbol in enumerate(m):
16 if z >= m_symbol:
17 z -= m_symbol
18 else:
19 break # We found the symbol that satisfies z ∈ Zi (symbol).
20 self.compressed = self.compressed * m_symbol + z
21 return symbol
22
23 def get_compressed(self):
24 return self.compressed
which is precisely the symbol’s information content, − log2 Qi (Xi = xi ) (see Eq. 5). Therefore, the
SlowAnsCoder implements an optimal code for Q, except for the first symbol of the message, which
always contributes log2 n = precision bits (this constant overhead is negligible for long messages).
However, the SlowAnsCoder turns out to be slow (as its name suggests), which we address next.
We speed up the SlowAnsCoder introduced in the last section (Listings 4 or 6), which finally leads
us to the variant of ANS that is commonly used in practice (also called “streaming ANS”). While the
SlowAnsCoder provides near-optimal compression performance, it is not a very practical algorithm
because the runtime cost for encoding a message of length k scales quadratically rather than linearly
in k. This is because SlowAnsCoder represents the entire compressed bit string as a single integer
(compressed in Listing 6), which would therefore become extremely large (typically millions of bits
long) in practical applications such as image compression.6 The runtime cost of arithmetic operations
involving compressed (Lines 7, 8, and 20 in Listing 6) thus scales linearly with the amount of data
that has been compressed so far, leading to an overall cost of O(k 2 ) for a sequence of k symbols.
To speed up the algorithm, we first observe that not all arithmetic operations on large integers are
necessarily slow. Lines 9, 12, and 13 in Listing 6 perform multiplication, modulo, and integer division
between compressed and n. Since n = 2precision is a power of two (Eq. 5), this suggests storing
the bits that represent the giant integer compressed in a dynamically sized array (aka “vector”)
of precision-bit chunks (called “words” below) so that these arithmetic operations simplify to
appending a zero word to the vector, inspecting the last word in the vector, and deleting the last word,
respectively (analogous to how “×10”, “mod 10” and “b·/10c” are trivial operations in the decimal
system). A good vector implementation performs these operations in constant (amortized) runtime.
Unfortunately, the remaining arithmetic operations (Lines 7, 8, and 20 in Listing 6) cannot be reduced
to a constant runtime cost because restricting mi (xi ) to a power of two for all xi ∈ Xi would severely
limit the expressivity of the approximated entropy model Q (Eq. 5). The form of ANS that is used
6
The reason why our SlowAnsCoder works at all is because Python seamlessly switches to a “big int”
representation for large numbers. A similarly naive implementation in C++ or Rust would overflow very quickly.
10
Figure 4: Streaming ANS with a growable bulk and a finite-capacity head. Encoding and decoding
operate on head, but if encoding a symbol would overflow head then we first transfer the least signif-
icant precision bits of head to bulk. More general configurations are possible, see Section 4.1.
in practice (sometimes called “streaming ANS”) therefore employs a hybrid approach that splits
the compressed bit string into a bulk and a head part (see Figure 4). The bulk holds most of the
compressed data as a growable vector of fixed-sized words while the head has a relatively small fixed
capacity (e.g., 64 bits). Most of the time, encoding and decoding operate only on head and thus have
a constant runtime due to the bounded size of head. Only when head overflows or underflows certain
thresholds (see below) do we transfer some data between head and the end of bulk. These transfers
also have a constant (amortized) runtime because we always transfer entire words (see below).
Invariants. We present here the simplest variant of streaming ANS, deferring generalizations to
Section 4.1. In this variant, bulk is a vector of precision-bit words (i.e., unsigned integers smaller
than 2precision ), and head can hold up to 2×precision bits (see Figure 4). The algorithm is very
similar to our SlowAnsCoder from Listing 6 except that all arithmetic operations now act on head
instead of compressed, and the encoder sometimes flushes some data from head to bulk while the
decoder sometimes refills data from bulk to head. Obviously, the encoder and decoder have to agree
on exactly when such data transfers occur. This is done by upholding the following invariants:
(i) head < 22×precision , i.e., head is a 2×precision-bit unsigned integer; and
(ii) head ≥ 2precision if bulk is not empty.
Any violation of these invariants triggers a data transfer between head and bulk that restores them.
The Streaming ANS Algorithm. Listing 7 shows the full implementation of our final AnsCoder,
which is an evolution on our SlowAnsCoder implementation from Listing 6 (and which can be used
as a replacement for SlowAnsCoder in the usage example of Listing 5). Method push checks upfront
(on Line 13) if encoding directly onto head would lead to an overflow that would violate invariant (i).
We show in Appendix A.1 that this is the case exactly if head » precision ≥ mi (xi ), where “»”
denotes a right bit-shift (i.e., integer division by 2precision ). If this is the case, then the encoder
transfers the least significant precision bits from head to bulk (Lines 15-16). Since we assume
that both invariants are satisfied at method entry, head initially contains at most 2×precision bits
(invariant (i)), so transferring precision bits out of it leads to a temporary violation of invariant (ii)
(but both invariants hold again on method exit as we show in Appendix A.2). The temporary violation
of invariant (ii) is on purpose: since transfers from head to bulk are the only operations during
encoding that lead to a temporary violation of invariant (ii), we can detect and invert such transfers
on the decoder side by simply checking for invariant (ii), see Lines 37-39 in Listing 7.
The method get_compressed exports the entire compressed data by concatenating bulk and head
into a list of precision-bit words, which can be written to a file or network socket (e.g., by splitting
each word into four 8-bit bytes if precision = 32). Before we analyze the AnsCoder in more
detail, we emphasize its simplicity: Listing 7 is a complete implementation of a computationally
efficient entropy coder with very near-optimal compression performance (see below). By contrast, a
complete implementation of arithmetic coding or range coding [Rissanen and Langdon, 1979, Pasco,
1976] would be much more involved due to a number of corner cases in those algorithms.
Compression Performance. The separation between bulk and head reduces the runtime cost of
ANS from quadratic to linear in the message length k, but it introduces a small compression overhead.
When the encoder transfers data from bulk to head it simply chops off precision bits from the
binary representation of head. This is well-motivated as we encode onto head with an optimal
code (with respect to Q), and so one might assume that each valid bit in head carries one full bit of
entropy. However, this is not quite correct: as discussed in Section 2.1, even with an optimal code,
11
1 class AnsCoder:
2 def __init__(self, precision, compressed=[]):
3 self.precision = precision
4 self.mask = (1 << precision) - 1 # (a string of precision one-bits)
5 self.bulk = compressed.copy() # (We will mutate bulk below.)
6 self.head = 0
7 # Establish invariant (ii):
8 while len(self.bulk) != 0 and (self.head >> precision) == 0:
9 self.head = (self.head << precision) | self.bulk.pop()
10
11 def push(self, symbol, m): # Encodes one symbol.
12 # Check if encoding directly onto head would violate invariant (i):
13 if (self.head >> self.precision) >= m[symbol]:
14 # Transfer one word of compressed data from head to bulk:
15 self.bulk.append(self.head & self.mask) # (“ &” is bitwise and)
16 self.head >>= self.precision
17 # At this point, invariant (ii) is definitely violated,
18 # but the operations below will restore it.
19
20 z = self.head % m[symbol] + sum(m[0:symbol])
21 self.head //= m[symbol]
22 self.head = (self.head << self.precision) | z # (This is
23 # equivalent to “ self.head * n + z”, just slightly faster.)
24
25 def pop(self, m): # Decodes one symbol.
26 z = self.head & self.mask # (same as “ self.head % n” but faster)
27 self.head >>= self.precision # (same as “ //= n” but faster)
28 for symbol, m_symbol in enumerate(m):
29 if z >= m_symbol:
30 z -= m_symbol
31 else:
32 break # We found the symbol that satisfies z ∈ Zi (symbol).
33 self.head = self.head * m_symbol + z
34
35 # Restore invariant (ii) if it is violated (which happens exactly
36 # if the encoder transferred data from head to bulk at this point):
37 if (self.head >> self.precision) == 0 and len(self.bulk) != 0:
38 # Transfer data back from bulk to head (“ |” is bitwise or):
39 self.head = (self.head << self.precision) | self.bulk.pop()
40
41 return symbol
42
43 def get_compressed(self):
44 compressed = self.bulk.copy() # (We will mutate compressed below.)
45 head = self.head
46 # Chop head into precision-sized words and append to compressed:
47 while head != 0:
48 compressed.append(head & self.mask)
49 head >>= self.precision
50 return compressed
Listing 7: A complete streaming ANS entropy coder in Python. For a usage example, see Listing 5
(replace SlowAnsCoder with AnsCoder). This implementation is written in Python for demonstration
purpose only. Real deployments should be implemented in a more efficient, compiled language (see,
e.g., the constriction library presented in Section 5.1, which also provides Python bindings).
12
the expected length of the compressed bit string can exceed the entropy by up to almost 1 bit. Thus,
each valid bit in the binary representation of head carries slightly less than one bit of entropy.
According to Benford’s Law [Hill, 1995], the most significant (left) bits in head carry lowest entropy
while less significant (right) bits are nearly uniformly distributed and thus indeed carry close to one
bit of entropy each. This is why the transfer from head to bulk (Lines 15-16 in Listing 7) takes the
least significant precision bits of head (see Figure 4). We can make their entropies arbitrarily close
to one (and thus the overhead arbitrarily small) by increasing precision since a transfer occurs only
if these bits are “buried below” at least an additional precision − (− log2 Qi (Xi = xi )) valid bits.
In practice, streaming ANS has very close to optimal compression performance (see empirical results
in Section 5.2 below). But a small bitrate-overhead over a hypothetical optimal coder comes from:
1. the linear (in k) overhead due to Benford’s Law just discussed (shrinks as precision increases);
Pk
2. a linear approximation overhead of i=1 DKL (Pi || Qi ) (shrinks as precision increases); and
3. a constant overhead of at most precision bits due the first symbol in the bits-back trick.
In practice, it is easy to find a precision that makes all three overheads negligibly small (see
Section 5.2). But there can be additional constraints, e.g., memory alignment, the size of lookup tables
for Qi (see Section 5.2), and the size of jump tables for random-access decoding (see Section 4.2).
4 Variations on ANS
We generalize the basic streaming ANS algorithm from Section 3.3. The generalizations in Sub-
sections 4.1 and 4.2 are straight-forward. Subsection 4.3 is more advanced, and it builds on our
reinterpretation of ANS as bits-back coding with positional numeral systems presented in Section 3.2.
Section 3.3 and Listing 7 present the simplest variant of streaming ANS. More general variants are
possible. The bitlength of all words on bulk may be any integer word_size ≥ precision, and
the head may have a more general bitlength head_capacity ≥ precision + word_size. This
includes the special case from Section 3.3 for word_size = precision and head_capacity =
2×precision, but it also admits more general setups with the following generalized invariants:
(i’) head < 2head_capacity , i.e., head can always be represented in head_capacity bits; and
(ii’) head ≥ 2head_capacity − word_size if bulk is not empty (a violation of this invariant means
precisely that we can transfer one word from bulk to head without violating invariant (i’)).
Setting head_capacity larger than precision + word_size reduces overhead 1 from Section 3.3.
We analyze this improvement empirically in Section 5.2. Setting word_size larger than precision
13
1 class SeekableAnsCoder: # Adds random-access decoding to Listing 7.
2 # __init__, push, pop, and get_compressed same as in Listing 7 ...
3
4 def checkpoint(self): # Records a point to which we can seek later.
5 return (len(self.bulk), self.head)
6
7 def seek(self, checkpoint): # Jumps to a previously taken checkpoint.
8 position, head = checkpoint
9 if position > len(self.bulk): # “raise” throws an exception.
10 raise "This simple demo can only seek forward."
11 self.bulk = self.bulk[0:position] # Truncates bulk.
12 self.head = head
13
14 # Usage example:
15 precision = 4 # For demonstration; deployments should use higher precision.
16 m = [7, 3, 6] # Same demo model as in Listing 5.
17 coder = SeekableAnsCoder(precision)
18 message = [2, 0, 2, 1, 0, 1, 2, 2, 2, 1, 0, 2, 1, 2, 0, 0, 1, 1, 1, 2]
19
20 for symbol in reversed(message[10:20]): # Encode second half of message.
21 coder.push(symbol, m)
22 checkpoint = coder.checkpoint() # Record a checkpoint.
23 for symbol in reversed(message[0:10]): # Encode first half of message.
24 coder.push(symbol, m)
25
26 assert coder.pop(m) == message[0] # Decode first symbol.
27 assert coder.pop(m) == message[1] # Decode second symbol.
28 coder.seek(checkpoint) # Jump to 11th symbol.
29 assert [coder.pop(m) for _ in range(10)] == message[10:20] # Decode rest.
Listing 8: Streaming ANS with random-access decoding (see Section 4.2). This simple demo
implementation can only seek forward since both seeking and decoding consume compressed data.
To allow seeking backwards, one could use a cursor into immutable compressed data instead.
may be motivated by more technical reasons, e.g., memory alignment of words on bulk and the
memory footprint and thus cache friendliness of lookup tables for the quantile function of Qi .
When decoding a compressed message, it is sometimes desirable to quickly jump to some specific
position within the message (e.g., when skipping forward in a compressed video stream). Such
random access is trivial in symbol codes as the decoder only needs to know the correct offset into the
compressed bit string, which may be stored for selected potential jump targets as meta data (“jump
table”) within some container format. For a stream code like ANS, jumping (or “seeking”) to a
specific position within the message requires knowledge not only of the offset within the compressed
bit string but also of the internal state that the decoder would have if it arrived at this position without
seeking. ANS makes it particularly easy to obtain this target decoder state during encoding since—as
opposed to, e.g., arithmetic coding—the encoder and decoder in ANS are the same data structure
with the same internal state (i.e., head). Listing 8 shows an example of random access with ANS.
We now discuss a more specialized variation on ANS that may not be immediately relevant to most
readers, but which demonstrates how useful a deep understanding of the ANS algorithm can be for
developing new ideas. This is an advanced chapter. First time readers might prefer to skip to Section 5
for empirical results and practical advice in more common use cases before coming back here.
Streaming ANS as presented in Listing 7 is a highly effective algorithm for entropy coding with a
fixed model. But the situation becomes tricker if modeling and entropy coding cannot be separated as
14
1 precision = 4 # For demonstration; deployments should use higher precision.
2 m_orig = [7, 3, 6] # Same demo entropy model as in Listing 5.
3 m_mod = [6, 4, 6] # (Slightly) modified entropy model compared to m_orig.
4 compressed = [0b1001, 0b1110, 0b0110, 0b1110] # Some example bit string.
5
6 # Case 1: decode 4 symbols using entropy model m_orig for all symbols:
7 decoder = AnsCoder(precision, compressed) # AnsCoder defined in Listing 7.
8 case1 = [decoder.pop(m_orig) for _ in range(4)]
9
10 # Case 2: change the entropy model, but *only* for the first symbol:
11 decoder = AnsCoder(precision, compressed) # “compressed” hasn't changed.
12 case2 = [decoder.pop(m_mod)] + [decoder.pop(m_orig) for _ in range(3)]
13
14 print(f"case1 = {case1}") # Prints: “case1 = [0, 1, 0, 2]”
15 print(f"case2 = {case2}") # Prints: “case2 = [1, 1, 2, 0]”
16 # ↑ 0 0
17 # We changed only the | But that affected both
18 # model for this symbol. | of these symbols too.
Listing 9: Non-local effects of entropy models (see Section 4.3); changing the entropy model for the
first decoded symbol of a sequence leads to a ripple effect that may affect all subsequent symbols.
clearly. For example, novel deep-learning based compression methods often employ probabilistic
models over continuous spaces, and thus the models have to be discretized in some way before they
can be used as entropy models. Advanced discretization methods might benefit from an end-to-end
optimization on the encoder side that optimizes through the entropy coder. Unfortunately, optimizing
a function that involves an entropy coder like ANS is extremely difficult since ANS packs as much
information content into as few bits as possible, which leads to highly noncontinuous behavior.
For example, Listing 9 demonstrates that ANS reacts in a very irregular and non-local way to even
tiny changes of the entropy model. The example decodes a sequences of symbols from an example bit
string compressed. Such a sequence of decoding operations could be part of a higher-level encoder
that employs the bits-back trick on some latent variable model with a high-dimensional latent space.
In this case, the encoder would still be allowed to optimize over certain model parameters (or, e.g.,
discretization settings) provided that it appends the final choice of those parameters to the compressed
bit string before transmitting it. To demonstrate the effect of optimizing over model parameters,
Listing 9 decodes from the same bit string twice.7 The employed entropy models differ slightly
between these two iterations, but only for the first symbol. Yet, the decoded sequences differ not only
on the first symbol but also on subsequent symbols that were decoded with identical entropy models.
This ripple effect is no surprise: changing the entropy model for the first symbol affects not only the
immediately decoded symbol but also the internal coder state after decoding the first symbol, which
in turn affects how the coder proceeds to decoding subsequent symbols. Our deeper understanding of
ANS as a form of bits-back coding (see Section 3.2) allows us to pinpoint the problem more precisely
as it splits the process of decoding a symbol xi into three steps: (1) decoding a number zi from the
compressed data using the (fixed) alphabet {0, . . . , n − 1}; (2) identifying the unique symbol xi
that satisfies zi ∈ Zi (xi ); and (3) encoding zi back onto the compressed data, this time using the
(model-dependent) alphabet Zi (xi ). Note that step (1) is independent of the entropy model. The only
reason why changing the entropy model for one symbol affects subsequent symbols is step (3), which
“leaks” information about the current entropy model to the stack of compressed data.
This detailed understanding allows us to come up with an alternative entropy coder that does not
exhibit the ripple effect shown in Listing 9. We can prevent the coder from leaking information
about entropy models from one symbol to the next by using two separate stacks for the decoding and
encoding operations in steps (1) and (3) above. Listing 10 sketches an implementation of such an
entropy coder. The method pop decodes zi ∈ {0, . . . , n − 1} from compressed (Line 16), identifies
the symbol xi (Lines 17-21), and then encodes zi ∈ Zi (xi ) onto a different stack remainders
7
Although lists are passed by reference in Python, the first decoder in Listing 9 does not mutate compressed
since the constructor of AnsCoder performs a copy of the provided bit string, see Line 5 in Listing 7.
15
1 class ChainCoder: # Prevents the non-local effect shown in Listing 9.
2 def __init__(self, precision, compressed, remainders=[]):
3 """Initializes a ChainCoder for decoding from `compressed`."""
4 self.precision = precision
5 self.mask = (1 << precision) - 1
6 self.compressed = compressed.copy() # pop decodes from here.
7 self.remainders = remainders.copy() # pop encodes onto here.
8 self.remainders_head = 0
9 # Establish invariant (ii):
10 while len(self.remainders) != 0 and \
11 (self.remainders_head >> precision) == 0:
12 self.remainders_head <<= self.precision
13 self.remainders_head |= self.remainders.pop()
14
15 def pop(self, m): # Decodes one symbol.
16 z = self.compressed.pop() # Always read a full word from compressed.
17 for symbol, m_symbol in enumerate(m):
18 if z >= m_symbol:
19 z -= m_symbol
20 else:
21 break # We found the symbol that satisfies z ∈ Zi (symbol).
22
23 self.remainders_head = self.remainders_head * m_symbol + z
24 if (self.remainders_head >> (2 * self.precision)) != 0:
25 # Invariant (i) is violated. Flush one word to remainders.
26 self.remainders.append(self.remainders_head & self.mask)
27 self.remainders_head >>= self.precision
28 # It can easily be shown that invariant (i) is restored here.
29
30 return symbol
31
32 def push(self, symbol, m): # Encodes one symbol.
33 if len(self.remainders) != 0 and \
34 self.remainders_head < (m[symbol] << self.precision):
35 self.remainders_head <<= self.precision
36 self.remainders_head |= self.remainders.pop()
37 # Invariant (i) is now violated but will be restored below.
38
39 z = self.remainders_head % m[symbol] + sum(m[0:symbol])
40 self.remainders_head //= m[symbol]
41 self.compressed.append(z)
Listing 10: Sketch of an entropy coder that is similar to ANS but that prevents the non-local effect
demonstrated in Listing 9 by using separate stacks of data for reading and writing (see Section 4.3).
16
(Lines 23-28). If we use this coder to decode a sequence of symbols (as in the example of Listing 9),
then any changes to an entropy model for a single symbol affect only that symbol and the data that
accumulates on remainders, but it has no effect on any subsequently decoded symbols. Once all
symbols are decoded, one can concatenate compressed and remainders in an appropriate way.
As a technical remark, Listing 10 implements the simplest streaming configuration for such an
entropy coder, with word_size = precision and head_capacity = 2 × precision. More
general configurations analogous to Section 4.1 are possible and left as an exercise to the reader (hint:
setting word_size > precision requires introducing a separate compressed_head).
In summary, our new perspective on ANS as bits-back coding with a UniformCoder allowed us to
resolve a non-locality issue of ANS. While this specific issue is unlikely to be immediately relevant
to most readers, other limitations of ANS might, and being able to semantically split ANS into the
three subtasks of the bits-back trick might help overcome those too. This concludes our discussion
of variations and generalizations of ANS. The next section provides some practical guidance on the
streaming configuration based on empirically observed compression performances and runtime costs.
Data compression combines the procedural (i.e., algorithmic) task of entropy coding with the
declarative task of probabilistic modeling. Recently, this separation of tasks has started to manifest
itself also in the division of labor among researchers: systems and signal processing researchers are
continuing to optimize compression algorithms for real hardware while machine learning researchers
have started to introduce new ideas for the design of powerful probabilistic models (e.g., [Toderici
et al., 2017, Minnen et al., 2018, Agustsson et al., 2020, Yang et al., 2020c, 2021]). Unfortunately,
these two communities traditionally use vastly different software stacks, which seems to be slowing
down idea transfer and thus might be part of the reason why machine-learning based compression
methods are still hardly used in the wild despite their proven superior compression performance.
Along with this paper, the author is releasing constriction,8 an open-source library of entropy
coders that intends to bridge the gap between systems and machine learning researchers by pro-
viding first-class support for both the Python and Rust programming language. To demonstrate
constriction in a real deployment, we point readers to The Linguistic Flux Capacitor,9 a web
application that uses constriction in a WebAssembly module to execute the neural compression
method proposed in [Yang et al., 2020b] completely on the client side, i.e., in the user’s web browser.
Machine learning researchers can use constriction in their typical workflow through its Python
API. It provides a consistent interface to various entropy coders, but it deliberately dictates the precise
configuration of each coder in a way that prioritizes compression performance (at the cost of some
runtime efficiency, see Section 5.2). This allows machine learning researchers to quickly pick a coder
that fits their model architecture (e.g, a range coder for autoregressive models, or ANS for bits-back
coding with latent variable models) without having to learn a new way of representing entropy models
and bit strings for each coder type, and without having to research complex coder configurations.
Once a successful prototype of a new compression method is implemented and evaluated in Python,
constriction’s Rust API simplifies the process of turning it into a self-contained product (i.e.,
an executable, static library or WebAssembly module). By default, constriction’s Rust and
Python APIs are binary compatible, i.e., data encoded with one can be decoded with the other, which
simplifies debugging. In addition, the Rust API optionally admits control (via compile-time type
parameters) over coder details (streaming configurations, see Section 4.1, and data providers). This
allows practitioners to tune a working prototype for optimal runtime and memory efficiency. The next
section provides some guidance for these choices by comparing empirical bitrates and runtime costs.
8
https://bamler-lab.github.io/constriction
9
https://robamler.github.io/linguistic-flux-capacitor
17
5.2 Empirical Results and Practical Advice
We analyze the bitrates and runtime cost of Asymmetric Numeral Systems (ANS) with various
streaming configurations (see Section 4.1), and we compare to Arithmetic Coding (AC) [Rissanen
and Langdon, 1979, Pasco, 1976] and Range Coding (RC) [Martin, 1979]. We identify a streaming
configuration where both ANS and RC are consistently effective over a large range of entropies, and
we observe that that they are both considerably faster than AC. Based on these findings, we provide
practical advice for when and how to use which entropy coding algorithm.
Benchmark Data. We test all entropy coders on the data that are decoded in The Linguistic Flux
Capacitor web app (see footnote 9 on page 17), which are the discretized model parameters of
the natural language model from [Bamler and Mandt, 2017]. We chose these discretized model
parameters as benchmark data because they arose in a real use case and they cover a wide spectrum of
entropies per symbol spanning four orders of magnitude. The model parameters consist of 209 slices
of 3 million parameters each. We treat the symbols within each slice as i.i.d. and use the empirical
symbol frequencies within each slice as the entropy model when encoding and decoding the respective
slice. Before discretization, the model parameters were transformed such that each slice represents
the difference from other slices of varying distance (similar to I- and B-frames in a video codec).
This leads to a wide range of entropies, ranging from about 0.001 to 10 bits per symbol from the
lowest to highest entropy slice (with an overall average of 0.41 bits per symbol).
Experiment Setup. We use the proposed constriction library through its Rust API for
both ANS and RC, and the third-party Rust library arcode [Burgess] for AC. In the follow-
ing, we specify the streaming configuration (see Section 4.1) for ANS and RC by the triple
“precision / word_size / head_capacity”. These parameters can be set in constriction’s
Rust API at compile time so that the compiler can generate optimized code for each configura-
tion. The library defines two presets: “default” (24 / 32 / 64), which is recommended for initial
prototyping (and therefore exposed through the Python API), and “small” (12 / 16 / 32), which is
optimized for runtime and memory efficiency. In addition to these two preset configurations, we
also experiment with 32 / 32 / 64 and 16 / 16 / 32, which match the simpler setup of Section 3.3 (i.e.,
precision = word_size). For AC, the only tuning parameter is the fixed-point precision. We
report results with precision = 63 (the highest precision supported by arcode) since lowering the
precision did not improve runtime performance. Our benchmark code and data is available online.10
The reported runtime for each slice is the median over 10 measurements on an Intel Core i7-7500U
CPU (2.70 GHz) running Ubuntu Linux 21.10. We ran an entire batch of experiments (encoding
and decoding all slices with all tested entropy coders) in the inner loop and repeated this procedure
10 times, randomly shuffling the order of experiments for each of the 10 batches. For each experiment,
we encoded/decoded the entire slice twice in a row and measured only the runtime of the second
run, so as to minimize variations in memory caches and branch predictions. We calculated and saved
a trivial checksum (bitwise xor) of all decoded symbols to ensure that no parts of the decoding
operations were optimized out. The benchmark code was compiled with constriction version 0.1.4,
arcode version 0.2.3, and Rust version 1.57 in --release mode and ran in a single thread.
Results 1: Bitrates. Table 1 shows aggregate empirical results over all slices of the benchmark
data; A bitrate overhead of 1 % in the second column would mean that the number of bits produced
by a coder when encoding the entire benchmark data is 1.01 times the total information content of
the benchmark data. We observe that arithmetic coding has the lowest bitrate overhead as expected,
but the “default” preset for both ANS and RC in constriction has negligible (< 0.1 %) overhead
too on this data set. Interestingly, setting precision slightly smaller than word_size is beneficial
for overall compression performance, suggesting a Benford’s Law contribution to the overhead (see
Section 3.3). Reducing the precision increases the overhead, and the third column of Table 1
reveals that the overhead in the low-precision regime is largely due to the approximation error
DKL (P || Q). Figure 5 breaks down the overhead for each of the 209 slices in the data set, plotting
them as a function of the entropy in each slice. We observe that the “default” preset (red plus markers)
provides the consistently best bitrates within each method over a wide range of entropies, which is
why constriction’s Rust and Python APIs guide users to use this preset for initial prototyping.
10
https://github.com/bamler-lab/understanding-ans
18
Table 1: Empirical compression performances and runtimes of Asymmetric Numeral Systems (ANS),
Range Coding (RC), and Arithmetic Coding (AC); ANS and RC use the proposed constriction
library, which admits arbitrary streaming configurations (see Section 4.1). Python bindings are
provided for both ANS and RC for the configurations labeled as “default” in this table, which both
have negligible (< 0.1 %) bitrate overhead while being considerably faster than arithmetic coding.
Method (precision / Bitrate overhead over Eq. 1 Runtime [ns per symbol]
word_size / head_capacity) I total I DKL (P || Q) I encoding I decoding
ANS (24 / 32 / 64) (“default”) 0.0015 % 2.6×10−6 % 24.2 6.1
ANS (32 / 32 / 64) 0.0593 % < 10−8 % 24.2 6.9
ANS (16 / 16 / 32) 0.2402 % 0.1235 % 19.8 6.4
ANS (12 / 16 / 32) (“small”) 3.9567 % 3.9561 % 19.8 6.9
RC (24 / 32 / 64) (“default”) 0.0237 % 2.6×10−6 % 16.6 14.3
RC (32 / 32 / 64) 1.6089 % < 10−8 % 16.7 14.8
RC (16 / 16 / 32) 3.4950 % 0.1235 % 16.9 9.4
RC (12 / 16 / 32) (“small”) 4.5807 % 3.9561 % 16.8 9.4
Arithmetic Coding (AC) 0.0004 % n/a 43.2 85.6
Asymmetric Numeral Systems (ANS) Range Coding (RC) & Arithmetic Coding
10−1 precision / word_size / head_capacity:
overhead [bits per symbol]
24 / 32 / 64 (“default” preset)
−2
10 32 / 32 / 64
16 / 16 / 32
10−3
12 / 16 / 32 (“small” preset)
10−4
Arithmetic Coding
(AC; using arcode crate)
10−5
1 % relative overhead
10−6 0.1 % relative overhead
10−3 10−2 10−1 100 101 10−3 10−2 10−1 100 101
entropy [bits per symbol] entropy [bits per symbol]
Figure 5: Empirical compression performances (bitrate overhead over Eq. 1) of various entropy
coders as a function of entropy per symbol in each slice of the benchmark data. The “default” presets
(red) for ANS and RC have a consistently low overhead over a wide range of entropies. The “small”
presets (blue) may be advantageous in memory or runtime-constrained situations (see also Figure 6).
Asymmetric Numeral Systems (ANS) Range Coding (RC) & Arithmetic Coding
precision / word_size / head_capacity:
200 24 / 32 / 64 (“default” preset)
encoder runtime
[ns per symbol]
32 / 32 / 64
100
16 / 16 / 32
12 / 16 / 32 (“small” preset)
40
Arithmetic Coding
20 (AC; using arcode crate)
16 / 16 / 32
[ns per symbol]
20 12 / 16 / 32 (“small” preset)
10
10−3 10−2 10−1 100 101 10−3 10−2 10−1 100 101
entropy [bits per symbol] entropy [bits per symbol]
Figure 6: Measured runtimes for encoding (top) and decoding (bottom) as a function of the entropy
per symbol. Models with low precision admit tabulating the mapping zi 7→ xi : zi ∈ Zi (xi ) for all
zi ∈ {0, . . . , 2precision − 1}, which speeds up decoding in the high-entropy regime (gray markers).
19
Results 2: Runtimes. The last two columns in Table 1 list the runtime cost (in nanoseconds per
symbol) for encoding and decoding, averaged over the entire benchmark data. We observe that ANS
provides the fastest decoder while RC provides the fastest encoder. In the “default” preset, decoding
with ANS is more than twice as fast compared to RC. AC is much slower than both ANS and RC for
both encoding and decoding in our experiments. While the precise runtimes reported here should be
taken with a grain of salt since they depend on the specific implementation of each algorithm, one can
generally expect AC to be considerably slower than ANS and RC since it reads and writes compressed
data bit by bit, which is a poor fit for modern hardware. Indeed, Figure 6, which plots runtimes as a
function of entropy in the respective slice of the benchmark data, reveals that AC (orange crosses)
slows down in the high-entropy regime, i.e., when it has to read or write many bits per symbol.
While the encoder runtimes for ANS and RC (upper panels in Figure 5) depend only weakly on the
entropy, decoding with ANS and RC slows down in the high-entropy regime. The main computational
cost in this regime turns out to be the task of identifying the symbol xi that satisfies zi ∈ Zi (xi )
(corresponding to Lines 28-32 in Listing 7). For categorical entropy models like the ones used here,
constriction uses a binary search here by default, which tends to require more iterations in models
with high entropy. The library therefore provides an alternative way to represent categorical entropy
models that tabulates the mapping zi 7→ xi for all zi ∈ {0, . . . , n − 1} upfront, leading to a constant
lookup time. Since the size of such lookup tables is proportional to n = 2precision , they are only
viable for low precisions (gray crosses and Y-shaped markers in the bottom panels in Figure 6). We
observe that tabulated entropy models indeed speed up decoding in the high-entropy regime but they
come at a cost in the low entropy regime (likely because large lookup tables are not cache friendly).
Practical Advice. In summary, the Asymmetric Numeral Systems (ANS) algorithm reviewed in
this paper, as well as Range Coding (RC), both provide very close to optimal bitrates while being
considerably faster than Arithmetic Coding. While the precise bitrates and runtimes differ slightly
between ANS and RC, in practice, it is usually more important to pick an entropy coder that fits the
model architecture: RC operates as a queue (first-in-first-out), which simplifies compression with
autoregressive models, while ANS operates as a stack (last-in-first-out), which simplifies bits-back
coding with latent variable models. Both ANS and RC can be configured by parameters that trade
off compression performance against runtime and memory efficiency. Users of the constriction
library are advised to use the configuration from the “default” preset for initial prototyping (as guided
by the API), and to tune the configuration only once they have implemented a working prototype.
6 Conclusion
We provided an educational discussion of the Asymmetric Numeral Systems (ANS) entropy coder,
explaining the algorithm’s internals using concepts that are familiar to many statisticians and machine-
learning scientists. This allowed us to understand Asymmetric Numeral Systems as a generalization
of Positional Numeral Systems, and as an application of the bits-back trick. Splitting up ANS into the
three distinct steps of the bits-back trick allowed us to generalize the method by combining these three
steps in new ways. We hope that more idea transfer like this between the procedural (algorithmic)
and the declarative (modeling) subcommunities within the field of compression research will spark
novel ideas for compression methods in the future.
From a more practical perspective, we presented constriction, a new open-source software library
that provides a collection of effective and efficient entropy coders and adapters for defining complex
entropy models. The library is intended to simplify compression research (by providing a catalog of
several different entropy coders within a single consistent framework) and to speed up the transition
from research code to self-contained software products (by providing binary compatible entropy
coders for both Python and Rust). We showed empirically that the entropy coders in constriction
have very close to optimal compression performance while being much faster than Arithmetic Coding.
Acknowledgments
The author thanks Tim Zhenzhong Xiao for stimulating discussions and Stephan Mandt for important
feedback. This work was supported by the German Federal Ministry of Education and Research
(BMBF): Tübingen AI Center, FKZ: 01IS18039A. Robert Bamler is a member of the Machine
Learning Cluster of Excellence, funded by the Deutsche Forschungsgemeinschaft (DFG, German
20
Research Foundation) under Germany’s Excellence Strategy – EXC number 2064/1 – Project number
390727645. The author thanks the International Max Planck Research School for Intelligent Systems
(IMPRS-IS) for support.
References
Jarek Duda, Khalid Tahboub, Neeraj J Gadgil, and Edward J Delp. The use of asymmetric numeral
systems as an accurate replacement for huffman coding. In 2015 Picture Coding Symposium (PCS),
pages 65–69. IEEE, 2015.
J Townsend, T Bird, and D Barber. Practical lossless compression with latent variables using bits
back coding. In 7th International Conference on Learning Representations, ICLR 2019, volume 7.
International Conference on Learning Representations (ICLR), 2019.
Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational im-
age compression with a scale hyperprior. In International Conference on Learning Representations,
2018.
David Minnen, Johannes Ballé, and George D Toderici. Joint autoregressive and hierarchical priors for
learned image compression. Advances in Neural Information Processing Systems, 31:10771–10780,
2018.
Yibo Yang, Robert Bamler, and Stephan Mandt. Improving inference for neural image compression.
Advances in Neural Information Processing Systems, 33, 2020a.
Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George
Toderici. Scale-space flow for end-to-end optimized video compression. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8503–8512, 2020.
Ruihan Yang, Yibo Yang, Joseph Marino, and Stephan Mandt. Insights from generative modeling for
neural video compression. arXiv preprint arXiv:2107.13136, 2021.
Friso Kingma, Pieter Abbeel, and Jonathan Ho. Bit-swap: Recursive bits-back coding for lossless
compression with hierarchical latent variables. In International Conference on Machine Learning,
pages 3408–3417. PMLR, 2019.
Lucas Theis and Jonathan Ho. Importance weighted compression. In Neural Compression: From
Information Theory to Applications–Workshop at ICLR 2021, 2021.
Chris S Wallace. Classification by minimum-message-length inference. In International Conference
on Computing and Information, pages 72–81. Springer, 1990.
Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the
description length of the weights. In Proceedings of the sixth annual conference on Computational
learning theory, pages 5–13, 1993.
Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical
journal, 27(3):379–423, 1948.
David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of
the IRE, 40(9):1098–1101, 1952.
David JC MacKay and David JC Mac Kay. Information theory, inference and learning algorithms.
Cambridge university press, 2003.
Peter Deutsch. Rfc 1952: Gzip file format specification version 4.3, 1996.
Thomas Boutell and T Lane. Png (portable network graphics) specification version 1.0. Network
Working Group, pages 1–102, 1997.
Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression.
International Conference on Learning Representations, 2017.
21
Yibo Yang, Robert Bamler, and Stephan Mandt. Variational bayesian quantization. In International
Conference on Machine Learning, pages 10670–10680. PMLR, 2020b.
Jorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development,
23(2):149–162, 1979.
Richard Clark Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford
University CA, 1976.
G Nigel N Martin. Range encoding: an algorithm for removing redundancy from a digitised message.
In Proc. Institution of Electronic and Radio Engineers International Conference on Video and
Data Recording, page 48, 1979.
Agner Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation
breakdowns for intel, amd and via cpus. Copenhagen University College of Engineering, 2021.
URL https://www.agner.org/optimize/instruction_tables.pdf. Accessed: 2021-11-
26.
Theodore P Hill. A statistical derivation of the significant-digit law. Statistical science, pages
354–363, 1995.
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and
Michele Covell. Full resolution image compression with recurrent neural networks. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5306–5314, 2017.
Ruihan Yang, Yibo Yang, Joseph Marino, and Stephan Mandt. Hierarchical autoregressive modeling
for neural video compression. In International Conference on Learning Representations, 2020c.
Robert Bamler and Stephan Mandt. Dynamic word embeddings. In International conference on
Machine learning, pages 380–389. PMLR, 2017.
Chris Burgess. Arcode. URL https://www.agner.org/optimize/instruction_tables.pdf.
Accessed: 2021-12-14.
22
A Appendix: Proof of Correctness of Streaming ANS
We prove that the AnsCoder from Listing 7 of the main text implements a correct encoder/decoder
pair, i.e., that its methods push and pop are inverse to each other. The proof uses that the AnsCoder
upholds the two invariants (i) and (ii) from Section 3.3 of the main text, which we also prove.
Assumptions and Problem Statement. We assume that the AnsCoder is used correctly: the
constructor (“__init__”) must be called with a positive integer precision and its optional argument
compressed, if given, must be a list of nonnegative integers11 that are all smaller than n = 2precision ).
Once the AnsCoder is constructed, its methods push and pop may be called in arbitrary order and
repetition to bring the AnsCoder into some original state S0 . In this initial setup phase that establishes
state S0 , the employed entropy models (argument m in both push and pop) may vary arbitrarily
across method calls and do not have to match between any of the push and pop calls, but they do
have to specify valid entropy models (i.e., each m must be a nonempty list of nonnegative integers
that sum to n). Further, the argument symbol of each push call must be a nonnegative integer that
is smaller than the length of the corresponding m, and it must satisfy m[symbol] 6= 0 (ANS cannot
encode symbols with zero probability under the approximated entropy model Qi ). Once state S0 is
established, we consider two scenarios:
(a) we call push(symbol, m) with an arbitrary symbol ∈ {0, . . . , len(m)}, followed by pop(m)
with the same (valid) model m; or
(b) we call pop(m), assign the return value to a variable symbol, and then call push(symbol, m)
with the returned symbol and the same (valid) model m.
We show that, after executing either of the above two scenarios, the AnsCoder ends up back in
state S0 . For scenario (a), we further show that the final pop returns the symbol that was used in
the preceding push. In Subsection A.1, we prove these claims under the assumption that the two
invariants (i) and (ii) from Section 3.3 of the main text hold at entry of all methods. In Subsection A.2,
we show that AnsCoder indeed upholds these two invariants (for completeness, the invariants are:
(i) head < n2 (always) and (ii) head ≥ n if bulk is not empty, where n = 2precision ).
To simplify the discussion, we conceptually split the method push into two parts,
conditional_flush (Lines 13-16 in Listing 7 of the main text) followed by push_onto_head
(Lines 20-22). Similarly, we conceptually split the method pop into pop_from_head
(Lines 26-33) followed by conditional_refill (Lines 37-39). The parts push_onto_head and
pop_from_head are analogous to the push and pop method, respectively, of the SlowAnsCoder
(Listing 6), and it is easy to see that they are inverse to each other (regardless of whether or not
invariants (i) and (ii) hold). The tricker part is to show that the conditional flushing and refilling
happen consistently, i.e., either none or both of them occur.
A1
for violation of invariant (ii), so conditional_refill also becomes a no-op in this case, and we
trivially end up in state S0 . If, on the other hand, the condition for conditional_flush is satisfied,
then it is easy to see that S 0 violates invariant (ii): line 15 ensures that bulk is not empty and line 16
performs a right bit-shift of head by precision, which sets headS 0 = bheadS0 /2precision c (where
subscripts denote the coder state in which we evaluate a variable). Since S0 satisfies invariant (i)
by assumption, we have headS0 < 22×precision , and so headS 0 < 2precision , which (temporarily)
violates invariant (ii). This means that the condition for conditional_refill is satisfied, and it
is easy to see that Line 39 inverts Lines 15-16. Thus, Scenario (a) restores the AnsCoder into its
original state S0 .
Scenario (b): Calling pop before push leads to the folowing state transition:
pop_from_head conditional_refill conditional_flush push_onto_head
S0 −−−−−−−−−→ S1 −−−−−−−−−−−−−→ S2 −−−−−−−−−−−−→ S3 −−−−−−−−−−→ S4 . (A1)
We show that state S4 = S0 . Since push_onto_head is the inverse of pop_from_head, it suffices
to show that S3 = S1 . Note that we can assume the invariants only about S0 . The intermediate
state S1 may violate invariant (ii), but it is easy to see that it satisfies invariant (i): Lines 26-33 set
headS0
headS1 = × mi (xi ) + zi0 (A2)
n
with zi0 = zi − x0 <xi mi (x0i ) where zi = (headS0 mod n) is the value that gets initial-
P
i
ized onPLine 26. Since Lines 28-32 find the symbol xi that satisfies zi ∈ Zi (xi ), we have
0 0
zi < x0i ≤xi mi (xi ) (see Eq. 6 of the main text) and thus zi < mi (xi ). Further, we have
headS0 < n2 by invariant (i). Thus, we find the upper bound
2
n −1
headS1 ≤ × mi (xi ) + (mi (xi ) − 1)
n
= (n − 1)mi (xi ) + mi (xi ) − 1
= n mi (xi ) − 1. (A3)
Since the mi (x0i ) are nonnegative for all x0i and sum to n, we have mi (xi ) ≤ n, and so Eq. A3
implies headS1 < n2 , and thus S1 satisfies invariant (i). We now distinguish two cases depending
on whether S1 satisfies invariant (ii). If S1 satisfies invariant (ii), then conditional_refill is a
no-op and S2 = S1 . In the next step, conditional_flush checks whether bheadS1 /nc ≥ mi (xi )
on Line 13, which cannot be the case due to Eq. A3, and thus conditional_flush is also a no-op
and we have S3 = S2 = S1 . If, on the other hand, S1 does not satisfy invariant (ii), then bulk was
not empty at the entry of the method pop, and thus the assumption that the original state S0 satisfies
invariant (ii) implies headS0 ≥ n. Line 39 then sets head to the new value headS2 = headS1 × n + b
with some b ≥ 0. In the next step, conditional_flush checks whether bheadS2 /nc ≥ mi (xi ),
which is now indeed the case since bheadS2 /nc ≥ bheadS1 × n/nc = headS1 , which is at least
mi (xi ) according to Eq. A2 since headS0 ≥ n in this case. The flushing operation on Lines 15-16
is thus executed and it inverts the refilling operation from Line 39, and we again have S3 = S1 , as
claimed. Thus, Scenario (b) also restores the AnsCoder into its original state S0 .
The above prove of correctness relied on the assumption that invariants (i) and (ii) from Section 3.3 of
the main text always hold at the entry of methods push and pop. We now show that this assumption
is justified by proving that the constructor initializes an AnsCoder into a state that satisfies both
invariants, and that all methods uphold the invariants (i.e., both invariants hold on method exit
provided that they hold on method entry).
Constructor. It is easy to see that both invariants are satisfied when the constructor finishes: the
loop condition on Line 8 of Listing 7 checks if invariant (ii) is violated, i.e., if bulk is empty and
head < n (the latter is equivalent to head » precision = 0 where » denotes a right bit-shift, i.e.,
integer division by n = 2precision ). The loop thus only terminates once invariant (ii) is satisfied
(it is guaranteed to terminate since the loop body on Line 9 pops a word from bulk and the loop
terminates at latest once bulk is empty). Invariant (i) is satisfied throughout the constructor since
A2
head is initialized as zero (which is smaller than n2 ) and the only statement that mutates head in the
constructor is Line 9. This statement is only executed if head has at most precision valid bits, and
it increases the number of valid bits by at most precision (due to the left bit-shift «), so it can never
lead to more than 2 × precision valid bits, which would be necessary to violate invariant (i).
Encoding. We now analyze the method push in Listing 7 of the main text. Assuming that both
invariants hold at method entry, we denote by h and h0 the value of head at method entry and exit,
respectively, and we distinguish two cases depending on whether or not the if-condition on Line 13
is met.
A3
Decoding. We finally analyze the method pop in Listing 7 of the main text. We show that
calling pop(m) upholds invariants (i) and (ii) (regardless of whether or not it follows a call of
push(symbol, m) with the same m). We denote the value of head at entry of the method push by
capital letter H so that there is no confusion with the discussion of push. Once the program flow
reaches Line 37, head takes the (intermediate) value
H
H̃ = × mi (xi ) + zi0 (A11)
n
where zi0 ∈ {0, . . . , mi (xi )−1} and xi is the decoded symbol, which implies that Zi (xi ) is not empty
and therefore mi (xi ) = |Zi (xi )| ≥ 1. It is easy to see that H̃ < n2 , i.e., H̃ satisfies invariant (i)
since the largest we can make H̃ in Eq. A11 is by setting H = n2 − 1 (the largest value allowed by
invariant (i)), mi (xi ) = n, and zi0 = n − 1. This leads to the upper bound
2
n −1
H̃ ≤ × n + (n − 1) = (n − 1) × n + (n − 1) = n2 − 1 < n2 . (A12)
n
The if-condition on Line 37 checks if invariant (ii) is violated. We again distinguish two cases:
• If invariant (ii) holds on Line 37 then the method exits, and thus both invariants hold on
method exit.
• If invariant (ii) does not hold on Line 37 then the if-branch is taken and Line 39 mutates
head into the final state
H 0 = H̃ × n + b (A13)
where b ∈ {0, . . . , n − 1} is the word of data that was transferred from bulk. We can see
directly that H 0 satisfies invariant (i) since, as H̃ violates invariant (ii) we have H̃ ≤ n − 1
and therefore H 0 ≤ (n − 1) × n + b = n2 + (b − n) < n2 since b < n. Regarding
invariant (ii), we first note that invariant (ii) can only be violated on Line 37 if bulk at
method entry is nonempty. Thus, the assumption that invariant (ii) is satisfied at method
entry implies that H ≥ n. This allows us to see that H̃ 6= 0 since both bH/nc and mi (xi )
are nonzero in Eq. A11. Inserting into Eq. A13 leads to H 0 ≥ n, i.e., invariant (ii) is also
satisfied at method exit.
In summary, we proved that the AnsCoder from Listing 7 of the main text upholds the two invariants
stated in Section 3.3 of the main text. Using this property, we proved in Section A.1 that the AnsCoder
is a correct encoder/decoder, i.e., that pop is both the left- and the right-inverse of push.
A4