Arithmetic 2
Arithmetic 2
Arithmetic Coding
Paul G. Howard and Jerey Scott Vitter
Brown University
Department of Computer Science
Technical Report No. 92{18
Revised version, April 1992
(Formerly Technical Report No. CS{91{45)
Appears in Image and Text Compression,
James A. Storer, ed., Kluwer Academic Publishers, Norwell, MA, 1992, pages 85{112.
A shortened version appears in the proceedings of the
International Conference on Advances in Communication and Control (COMCON 3),
Victoria, British Columbia, Canada, October 16{18, 1991.
Practical Implementations of
Arithmetic Coding1
Abstract
We provide a tutorial on arithmetic coding, showing how it provides nearly
optimal data compression and how it can be matched with almost any prob-
abilistic model. We indicate the main disadvantage of arithmetic coding, its
slowness, and give the basis of a fast, space-ecient, approximate arithmetic
coder with only minimal loss of compression eciency. Our coder is based on
the replacement of arithmetic by table lookups coupled with a new deterministic
probability estimation scheme.
Index terms : Data compression, arithmetic coding, adaptive modeling, analysis
of algorithms, data structures, low precision arithmetic.
1 A similar version of this paper appears in Image and Text Compression, James A. Storer, ed.,
Kluwer Academic Publishers, Norwell, MA, 1992, 85{112. A shortened version of this paper appears
in the proceedings of the International Conference on Advances in Communication and Control
(COMCON 3), Victoria, British Columbia, Canada, October 16{18, 1991.
2 Support was provided in part by NASA Graduate Student Researchers Program grant NGT{
50420 and by a National Science Foundation Presidential Young Investigators Award grant with
matching funds from IBM. Additional support was provided by a Universities Space Research As-
sociation/CESDIS associate membership.
3 Support was provided in part by National Science Foundation Presidential Young Investigator
Award CCR{9047466 with matching funds from IBM, by NSF research grant CCR{9007851, by
Army Research Oce grant DAAL03{91{G{0035, and by the Oce of Naval Research and the De-
fense Advanced Research Projects Agency under contract N00014{91{J{4052 ARPA Order No. 8225.
Additional support was provided by a Universities Space Research Association/CESDIS associate
membership.
1
used for coding. (This reduces the coding eciency of those methods by narrowing
the range of possible models.) Much of the current research in arithmetic coding
concerns nding approximations that increase coding speed without compromising
compression eciency. The most common method is to use an approximation to
the multiplication operation [10,27,29,43]; in this paper we present an alternative
approach using table lookups and approximate probability estimation.
Another disadvantage of arithmetic coding is that it does not in general produce a
prex code. This precludes parallel coding with multiple processors. In addition, the
potentially unbounded output delay makes real-time coding problematical in critical
applications, but in practice the delay seldom exceeds a few symbols, so this is not a
major problem. A minor disadvantage is the need to indicate the end of the le.
One nal minor problem is that arithmetic codes have poor error resistance, espe-
cially when used with adaptive models [5]. A single bit error in the encoded le causes
the decoder's internal state to be in error, making the remainder of the decoded le
wrong. In fact this is a drawback of all adaptive codes, including Ziv-Lempel codes
and adaptive Human codes [12,15,18,26,55,56]. In practice, the poor error resistance
of adaptive coding is unimportant, since we can simply apply appropriate error cor-
rection coding to the encoded le. More complicated solutions appear in [5,20], in
which errors are made easy to detect, and upon detection of an error, bits are changed
until no errors are detected.
Overview of this paper. In Section 2 we give a tutorial on arithmetic coding.
We include an introduction to modeling for text compression. We also restate several
important theorems from [22] relating to the optimality of arithmetic coding in theory
and in practice.
In Section 3 we present some of our current research into practical ways of improv-
ing the speed of arithmetic coding without sacricing much compression eciency.
The center of this research is a reduced-precision arithmetic coder, supported by
ecient data structures for text modeling.
Old interval
0 L H 1
Decomposition probability of a i
0 1
New interval
0 L H 1
Figure 1: Subdivision of the current interval based on the probability of the input
symbol ai that occurs next.
1. We begin with a \current interval" [L; H ) initialized to [0; 1).
2. For each symbol of the le, we perform two steps (see Figure 1):
(a) We subdivide the current interval into subintervals, one for each possible
alphabet symbol. The size of a symbol's subinterval is proportional to the
estimated probability that the symbol will be the next symbol in the le,
according to the model of the input.
(b) We select the subinterval corresponding to the symbol that actually occurs
next in the le, and make it the new current interval.
3. We output enough bits to distinguish the nal current interval from all other
possible nal intervals.
The length of the nal subinterval is clearly equal to the product of the probabilities
of the individual symbols, which is the probability p of the particular sequence of
symbols in the le. The nal step uses almost exactly , lg p bits to distinguish the
le from all other possible les. We need some mechanism to indicate the end of the
le, either a special end-of-le symbol coded just once, or some external indication of
the le's length.
Pi,1 ai
In step 2, we need to compute only the subinterval corresponding to the symbol
that actuallyPi occurs. To do this we need two cumulative probabilities, P C = k=1 pk
and PN = k=1 pk . The new subinterval is [L + PC (H , L); L + PN (H , L)). The
need to maintain and supply cumulative probabilities requires the model to have a
complicated data structure; Moat [35] investigates this problem, and concludes for
a multi-symbol alphabet that binary search trees are about twice as fast as move-to-
front lists.
Example 1 : We illustrate a non-adaptive code, encoding the le containing the
symbols bbb using arbitrary xed probability estimates pa = 0:4, pb = 0:5, and
pEOF = 0:1. Encoding proceeds as follows:
4 2 TUTORIAL ON ARITHMETIC CODING
Current Subintervals
Interval Action a b EOF Input
[0:000; 1:000) Subdivide [0:000; 0:400) [0:400; 0:900) [0:900; 1:000) b
[0:400; 0:900) Subdivide [0:400; 0:600) [0:600; 0:850) [0:850; 0:900) b
[0:600; 0:850) Subdivide [0:600; 0:700) [0:700; 0:825) [0:825; 0:850) b
[0:700; 0:825) Subdivide [0:700; 0:750) [0:750; 0:812) [0:812; 0:825) EOF
[0:812; 0:825)
The nal interval (without rounding) is [0:8125; 0:825), which in binary is approx-
imately [0.11010 00000, 0.11010 01100). We can uniquely identify this interval by
outputting 1101000. According to the xed model, the probability p of this partic-
ular le is (0:5)3(0:1) = 0:0125 (exactly the size of the nal interval) and the code
length (in bits) should be , lg p = 6:322. In practice we have to output 7 bits. 2
The idea of arithmetic coding originated with Shannon in his seminal 1948 paper
on information theory [54]. It was rediscovered by Elias about 15 years later, as
brie
y mentioned in [1].
Implementation details. The basic implementation of arithmetic coding de-
scribed above has two major diculties: the shrinking current interval requires the
use of high precision arithmetic, and no output is produced until the entire le has
been read. The most straightforward solution to both of these problems is to output
each leading bit as soon as it is known, and then to double the length of the cur-
rent interval so that it re
ects only the unknown part of the nal interval. Witten,
Neal, and Cleary [58] add a clever mechanism for preventing the current interval from
shrinking too much when the endpoints are close to 1=2 but straddle 1=2. In that
case we do not yet know the next output bit, but we do know that whatever it is, the
following bit will have the opposite value; we merely keep track of that fact, and ex-
pand the current interval symmetrically about 1=2. This follow-on procedure may be
repeated any number of times, so the current interval size is always longer than 1/4.
Mechanisms for incremental transmission and xed precision arithmetic have been
developed through the years by Pasco [40], Rissanen [48], Rubin [52], Rissanen and
Langdon [49], Guazzo [19], and Witten, Neal, and Cleary [58]. The bit-stung idea
of Langdon and others at IBM that limits the propagation of carries in the additions
is roughly equivalent to the follow-on procedure described above.
We now describe in detail how the coding and interval expansion work. This
process takes place immediately after the selection of the subinterval corresponding
to an input symbol.
We repeat the following steps (illustrated schematically in Figure 2) as many times
as possible:
a. If the new subinterval is not entirely within one of the intervals [0; 1=2), [ 1=4; 3=4),
or [ 1=2; 1), we stop iterating and return.
2.1 Arithmetic coding and its implementation 5
(a)
(b)
(c)
(d)
1/
0 2 1
Figure 2: Interval expansion process. (a) No expansion. (b) Interval in [0; 1=2). (c)
Interval in [ 1=2; 1). (d) Interval in [ 1=4; 3=4) (follow-on case).
b. If the new subinterval lies entirely within [0; 1=2), we output 0 and any 1s left
over from previous symbols; then we double the size of the interval [0; 1=2),
expanding toward the right.
c. If the new subinterval lies entirely within [ 1=2; 1), we output 1 and any 0s left
over from previous symbols; then we double the size of the interval [ 1=2; 1),
expanding toward the left.
d. If the new subinterval lies entirely within [ 1=4; 3=4), we keep track of this fact
for future output; then we double the size of the interval [ 1=4; 3=4), expanding in
both directions away from the midpoint.
Current Subintervals
Interval Action a b EOF Input
[0:00; 1:00) Subdivide [0:00; 0:40) [0:40; 0:90) [0:90; 1:00) b
[0:40; 0:90) Subdivide [0:40; 0:60) [0:60; 0:85) [0:85; 0:90) b
[0:60; 0:85) Output 1
Expand [ 1=2; 1)
[0:20; 0:70) Subdivide [0:20; 0:40) [0:40; 0:65) [0:65; 0:70) b
[0:40; 0:65) follow
Expand [ 1=4; 3=4)
[0:30; 0:80) Subdivide [0:30; 0:50) [0:50; 0:75) [0:75; 0:80) EOF
[0:75; 0:80) Output 10
Expand [ 1=2; 1)
[0:50; 0:60) Output 1
Expand [ 1=2; 1)
[0:00; 0:20) Output 0
Expand [0; 1=2)
[0:00; 0:40) Output 0
Expand [0; 1=2)
[0:00; 0:80) Output 0
The \follow " output in the sixth line indicates the follow-on procedure: we keep
track of our knowledge that the next output bit will be followed by its opposite; this
\opposite" bit is the 0 output in the ninth line. The encoded le is 1101000, as
before. 2
Clearly the current interval contains some information about the preceding inputs;
this information has not yet been output, so we can think of it as the coder's state. If
a is the length of the current interval, the state holds , lg a bits not yet output. In the
basic method (illustrated by Example 1) the state contains all the information about
the output, since nothing is output until the end. In the implementation illustrated
by Example 2, the state always contains fewer than two bits of output information,
since the length of the current interval is always more than 1=4. The nal state in
Example 2 is [0; 0:8), which contains , lg 0:8 0:322 bits of information.
Use of integer arithmetic. In practice, the arithmetic can be done by storing
the current interval in suciently long integers rather than in
oating point or exact
rational numbers. (We can think of Example 2 as using the integer interval [0; 100)
by omitting all the decimal points.) We also use integers for the frequency counts
used to estimate symbol probabilities. The subdivision process involves selecting non-
overlapping intervals (of length at least 1) with lengths approximately proportional
to thePcounts. To encode symbol ai we need two cumulative
P counts, C = Pik,=11 ck and
N = ik=1 ck , and the sum T of all counts, T = nk=1 ck . (Here and elsewhere we
denote the alphabet size by n.) The new subinterval is [L + b C(HT,L) c; L + b N (HT,L) c).
2.1 Arithmetic coding and its implementation 7
(In this discussion we continue to use half-open intervals as in the real arithmetic case.
In implementations [58] it is more convenient to subtract 1 from the right endpoints
and use closed intervals. Moat [36] considers the calculation of cumulative frequency
counts for large alphabets.)
Example 3 : Suppose that at a certain point in the encoding we have symbol counts
ca = 4, cb = 5, and cEOF = 1 and current interval [25; 89) from the full interval [0; 128).
Let the next input symbol be b. The cumulative counts for b are C = 4 and N = 9,
and T = 10, so the new interval is [25 + b 4(8910,25) c; 25 + b 9(8910,25) c) = [50; 82); we then
increment the follow-on count and expand the interval once about the midpoint 64,
giving [36; 100). It is possible to maintain higher precision, truncating (and adjusting
to avoid overlapping subintervals) only when the expansion process is complete; this
makes it possible to prove a tight analytical bound on the lost compression caused by
the use of integer arithmetic, as we do in [22], restated as Theorem 1 below. In practice
this renement makes the coding more dicult without improving compression. 2
Analysis. In [22] we prove a number of theorems about the code lengths of les
coded with arithmetic coding. Most of the results involve the use of arithmetic coding
in conjunction with various models of the input; these will be discussed in Section 2.3.
Here we note two results that apply to implementations of the arithmetic coder. The
rst shows that using integer arithmetic has negligible eect on code length.
Theorem 1 If we use integers from the range [0; N ) and use the high precision al-
gorithm for scaling up the subrange, the code length is provably bounded by 4=(N ln 2)
bits per input symbol more than the ideal code length for the le.
For a typical value N = 65;536, the excess code length is less than 10,4 bit per
input symbol.
The second result shows that if we indicate end-of-le by encoding a special symbol
just once for the entire le, the additional code length is negligible.
Theorem 2 The use of a special end-of-le symbol when coding a le of length t
using integers from the range [0; N ) results in additional code length of less than
8t=(N ln 2) + lg N + 7 bits.
Again the extra code length is negligible, less than 0.01 bit per input symbol for
a typical 100;000 byte le.
Since we seldom know the exact probabilities of the process that generated an
input le, we would like to know how errors in the estimated probabilities aect the
code length. We can estimate the extra code length by a straightforward asymptotic
analysis. The average code length L for symbols produced by a given model in a
given state is given by n
X
L = , pi lg qi;
i=1
8 2 TUTORIAL ON ARITHMETIC CODING
where pi is the actual probability of the ith alphabet symbol and qi is its estimated
probability. The optimal average code length for symbols in the state is the entropy
of the state, given by n
X
H = , pi lg pi :
i=1
The excess code length is E = L , H ; if we let di = qi , pi and expand asymptotically
in d, we obtain
Xn 1 d 2 d 3 !!
E= i i
i=1 2 ln 2 p + O p2 :
i i
(1)
(This corrects a similar derivation in [5], in which the factor of 1= ln 2 is omitted.)
The vanishing of the linear terms means that small errors in the probabilities used
by the coder lead to very small increases in code length. Because of this property,
any coding method that uses approximately correct probabilities will achieve a code
length close to the entropy of the underlying source. We use this fact in Section 3.1
to design a class of fast approximate arithmetic coders with small compression loss.
of the reduced accuracy of the model. On the other hand, it introduces a locality of
reference (or recency ) eect, which often improves compression. We now discuss and
quantify the locality eect.
In most text les we nd that most of the occurrences of at least some words are
clustered in one part of the le. We can take advantage of this locality by assigning
more weight to recent occurrences of a symbol in an adaptive model. In practice there
are several ways to do this:
Periodically restarting the model. This often discards too much information
to be eective, although Cormack and Horspool nd that it gives good results
when growing large dynamic Markov models [11].
Using a sliding window on the text [26]. This requires excessive computational
resources.
Recency rank coding [7,13,53]. This is simple but corresponds to a rather coarse
model of recency.
Exponential aging (giving exponentially increasing weights to successive sym-
bols) [12,38]. This is moderately dicult to implement because of the changing
weight increments, although our probability estimation method in Section 3.4
uses an approximate form of this technique.
Periodic scaling [58]. This is simple to implement, fast and eective in operation,
and amenable to analysis. It also has the computationally desirable property
of keeping the symbol weights small. In eect, scaling is a practical version of
exponential aging.
Analysis of scaling. In [22] we give a precise characterization of the eect of scaling
on code length, in terms of an elegant notion we introduce called weighted entropy.
The weighted entropy of a le at the end of the mth block, denoted by Hm , is the
entropy implied by the probability distribution at that time, computed according to
the scaling model described above.
We prove the following theorem for a le compressed using arithmetic coding and
a zero-order adaptive model with scaling. All counts are halved and rounded up when
the sum of the counts reaches 2B ; in eect, we divide the le into b blocks of length B .
Theorem 4 Let L be the compressed length of a le. Then we have
b ! !
Hm + Hb , H0 , t Bk
X
B
m=1
b ! ! B !!
<L<B
X
Hm k
+ Hb , H0 + t B lg k + O Bk 2
2
;
m=1 min
2.3 Modeling for text compression 11
Table 1: PPM escape probabilities (pesc) and symbol probabilities (pi ). The number
of symbols that have occurred j times is denoted by nj .
PPMA PPMB PPMC PPMP PPMX
pesc t+1 1 k k n1 , n2 + : : : n1
t t+k t t2 t
pi ci c ,
i 1 ci
t+1 t t+k
le is approximately a Poisson process. See Table 1 for formulas for the probabilities
used by the dierent methods, and see [5] or [6] for a detailed description of the PPM
method. In Section 3.5 we indicate two methods that provide improved estimation
of the escape probability.
contexts with only a small loss of compression eciency. All these components can
be combined into a fast, space-ecient text coder.
The value of the cuto probability in state [0; 4) is clearly between 1=2 and 3=4. If
this were an exact coder, the subintervals of length 3 would correspond to , lg 3=4
0:415 bits of output information stored in the state, and we would choose = 1= lg 3
0:631 to minimize the extra code length. But because of the approximate arithmetic,
the optimal value of depends on the distribution of Probf0g; if Probf0g is uniformly
distributed on (0p; 1), we nd analytically that the excess code length is minimized
when = (15 , 97)=8 0:644. Fortunately, the amount of excess code length is
not very sensitive to the value of ; in the uniform distribution case any value from
about 0.55 to 0.73 gives less than one percent extra code length. 2
Arithmetic coding does not mandate any particular assignment of subintervals to
input symbols; all that is required is that subinterval lengths be proportional to sym-
bol probabilities and that the decoder make the same assignment as the encoder. In
Example 4 we uniformly assigned the left subinterval to symbol 0. By preventing the
longer subinterval from straddling the midpoint whenever possible, we can sometimes
obtain a simpler coder that never requires the follow-on procedure; it may also use
fewer states.
Example 5 : This coder assigns the right subinterval to 0 in lines 4 and 7 of Example 4,
eliminating the need for using the follow-on procedure; otherwise it is the same as
Example 4.
0 input 1 input
State Probf0g Output Next state Output Next state
[0; 4) 0 < p < 1 , 00 [0; 4) - [1; 4)
1,p 0 [0; 4) 1 [0; 4)
<p<1 - [0; 3) 11 [0; 4)
[0; 3) 0 < p < 1=2 10 [0; 4) 0 [0; 4)
1=2 p < 1 0 [0; 4) 10 [0; 4)
[1; 4) 0 < p < 1=2 01 [0; 4) 1 [0; 4)
1=2 p < 1 1 [0; 4) 01 [0; 4)
2
Langdon and Rissanen [29] suggest identifying the symbols as the more probable
symbol (MPS) and less probable symbol (LPS) rather than as 1 and 0. By doing this
we can often combine transitions and eliminate states.
Example 6 : We modify Example 5 to use the MPS/LPS idea. We are able to reduce
the coder to just two states.
3.1 Reduced-precision arithmetic coding 15
This coder is easily programmed and extremely fast. Its only shortcoming is that
on average high-probability symbols require 1=4 bit (corresponding to ProbfMPSg =
2,1=4 0:841) no matter how high the actual probability is. 2
Design of a class of reduced-precision coders. We now present a very
ex-
ible yet simple coder design incorporating most of the features just discussed. We
choose N to be any power of 2. All states in the coder are of the form [k; N ), so the
18 3 FAST ARITHMETIC CODING
number of states is only N=2. (Intervals with k N=2 will produce output, and the
interval will be expanded.) In every state [k; N ) we include the maximally unbalanced
subdivision (at k +1), which corresponds to values of ProbfMPSg between (N , 2)=N
and (N , 1)=N . We include a nearly balanced subdivision so that we will not lose
eciency when ProbfMPSg 1=2. In addition, we locate other subdivision points
such that the subinterval expansion that follows each input symbol leaves the coder
in a state of the form [k; N ), and we choose one or more of them to correspond to
intermediate values of ProbfMPSg. For simplicity we denote state [k; N ) by k.
We always allow the interval [k; N ) to be divided at k + 1; if the LPS occurs we
output the lg N bits of k and move to state 0, while if the MPS occurs we simply
move to state k + 1, then if the new state is N=2 we output a 1 and move to state
0. The other permitted subdivisions are given in the following table. In some cases
additional output and expansion may be possible. It may not be necessary to include
all subdivisions in the coder.
Range of Subdivision LPS input MPS input
states k LPS MPS Output Next State Output Next State
[0; N2 ) [k; N2 ) [ N2 ; N ) 0 2k 1 0
[0; N4 ) [k; N4 ) [ N4 ; N ) 00 4k - N
4
[ N8 ; N4 ) [k; 38N ) [ 38N ; N ) 0f 4k , N2 - 3N
8
[ N4 ; 38N ) [k; 8 ) [ 8 ; N ) 010
3 N 3 N 8k , 2N - 3
8
N
a theoretical basis for selecting the probabilities. Often there are practical consid-
erations limiting our choices, but we can show that it is reasonable to expect that
choosing only a few probabilities will give close to optimal compression.
For a binary alphabet, we can use Equation (1) to compute E (p; q), the extra code
length resulting from using estimates q and 1 , q for actual probabilities p and 1 , p,
respectively. For any desired maximum excess code length , we can partition the
space of possible probabilities to guarantee that the use of approximate probabilities
will never add more than to the code length of any event. We select partitioning
probabilities P0; P1; : : : and estimated probabilities Q0; Q1; : : :. Each probability Qi
is used to encode all events whose probability p is in the range Pi < p Pi+1. We
compute the partition, which we call an -partition, as follows:
1. Set i := 0 and Q0 := 1=2.
2. Find the value of Pi+1 (greater than Qi) such that E (Pi+1; Qi) = . We will
use Qi as the estimated probability for all probabilities p such that Qi < p
Pi+1 .
3. Find the value of Qi+1 (greater than Pi+1) such that E (Pi+1; Qi+1) = . After
we compute Pi+2 in step 2 of the next iteration, we will use Qi+1 as the estimate
for all probabilities p such that Pi+1 < p Pi+2 .
We increment i and repeat steps 2 and 3 until Pi+1 or Qi+1 reaches 1. The values for
p < 1=2 are symmetrical with those for p > 1=2.
Example 11 : We show the -partition for = 0:05 bit per binary input symbol.
Range of actual probabilities Probability to use
[0:0000; 0:0130) 0.0003
[0:0130; 0:1427) 0.0676
[0:1427; 0:3691) 0.2501
[0:3691; 0:6309) 0.5000
[0:6309; 0:8579) 0.7499
[0:8579; 0:9870) 0.9324
[0:9870; 1:0000) 0.9997
Thus by using only 7 probabilities we can guarantee that the excess code length does
not exceed 0.05 bit for each binary decision coded. 2
We might wish to limit the relative error so that the code length can never exceed
the optimal by more than a factor of 1 + . We can begin to compute these -
partitions using a procedure similar to that for -partitions, but unfortunately the
process does not terminate, since -partitions are not nite. As P approaches 1, the
optimal average code length grows very small, so to obtain a small relative loss Q
must be very close to P . Nevertheless, we can obtain a partial -partition.
20 3 FAST ARITHMETIC CODING
Example 12 : We show part of the -partition for = 0:05; the maximum relative
error is 5 percent.
Range of actual probabilities Probability to use
... ...
[0:0033; 0:0154) 0.0069
[0:0154; 0:0573) 0.0291
[0:0573; 0:1670) 0.0982
[0:1670; 0:3722) 0.2555
[0:3722; 0:6278) 0.5000
[0:6278; 0:8330) 0.7445
[0:8330; 0:9427) 0.9018
[0:9427; 0:9846) 0.9709
[0:9846; 0:9967) 0.9931
... ...
2
In practice we will use an approximation to an -partition or a -partition for
values of ProbfMPSg up to the maximum probability representable by our coder.
38 62
0 100 20 80
- - 33 67 100 0 25 75
a b c d e f g h
(a)
38 0 20 - 33 100 25
(b)
38 0 20 33 100 25
(c)
Figure 3: Steps in the development of a compressed tree. (a) Complete binary tree.
(b) Linear representation. (c) Compressed tree.
least double their size. In the compressed tree we collapse the breath-rst linear
representation of the complete binary tree by omitting nodes with zero probability.
If k dierent symbols have non-zero probability, the compressed tree representation
requires at most k lg(2n=k) , 1 nodes.
Example 13 : Suppose we have the following probability distribution for an 8-symbol
alphabet:
Symbol a b c d e f g h
Probability 0 0 1=8 1=4 1=8 0 1=8 3=8
We can represent this distribution by the tree in Figure 3(a), rounding probabili-
ties and expressing them as multiples of 0.01. We show the linear representation in
Figure 3(b) and the compressed tree representation in Figure 3(c). 2
Traversing the compressed tree is mainly a matter of keeping track of omitted
nodes. We do not have to process each node of the tree: for the rst lg n , 2 levels
we have to process each node; but when we reach the desired node in the next-to-
lowest level we have enough information to directly index the desired node of the
lowest level. The operations are very simple, involving only one test and one or
22 3 FAST ARITHMETIC CODING
two increment operations at each node, plus a few extra operations at each level.
Including the capability of adding new symbols to the tree makes the algorithm only
slightly more complicated.
estimate assuming a uniform a priori distribution for the true underlying probabil-
ity P .
We recall that for values of P near 1=2 we do not require a very accurate estimate,
since any value will give about the same code length; hence we do not need many
states in this probability region. When P is closer to 1, we would like our estimate
to be more accurate, to allow the arithmetic coder to give near-optimal compression,
so we assign states more densely for larger P . Unfortunately, in this case estimation
by any means is dicult, because occurrences of the LPS are so infrequent. We also
note that the underlying probability of any branch in the coding tree may change at
any time, and we would like our estimate to adapt accordingly.
To handle the small-sample cases, we reserve a number of states simply to count
occurrences when t is small, using Equation (2) to estimate the probabilities. We do
the same for larger values of t when c is 0, 1, t , 1, or t, to provide fast convergence
to extreme values of P .
We can show that if the underlying probability P does not change, the expected
value of the estimate pk after k events is given by
E (pk ) = P + (p0 , P )f k ;
which converges to P for all f , 0 f < 1. The rapid convergence of E (pk ) when
f = 0 is misleading, since in that case the estimate is always 0 or 1, depending only
on the preceding event. The expected value is clearly P , but the estimator is useless.
A value of f near 1 provides resistance to random
uctuations in the input, but
the estimate converges slowly, both initially and when the underlying P changes. A
careful choice of f would depend on a detailed analysis like that performed by Flajolet
for the related problem of approximate counting [16,17]. We make a more pragmatic
decision. We know that periodic scaling is an approximation to exponential aging
and we can show that a scaling factor of f corresponds to a scaling block size B of
approximately f ln 2=(1 , f ). Since B = 16 works well for scaling [58], we choose
f = 0:96.
Table 2: Comparsion of PPMC and PPMD. Compression gures are in bits per input
symbol.
Improvement
using
File Text? PPMC PPMD PPMD
bib Yes 2.11 2.09 0.02
book1 Yes 2.65 2.63 0.02
book2 Yes 2.37 2.35 0.02
news Yes 2.91 2.90 0.01
paper1 Yes 2.48 2.46 0.02
paper2 Yes 2.45 2.42 0.03
paper3 Yes 2.70 2.68 0.02
paper4 Yes 2.93 2.91 0.02
paper5 Yes 3.01 3.00 0.01
paper6 Yes 2.52 2.50 0.02
progc Yes 2.48 2.47 0.01
progl Yes 1.87 1.85 0.02
progp Yes 1.82 1.80 0.02
geo No 5.11 5.10 0.01
obj1 No 3.68 3.70 -0.02
obj2 No 2.61 2.61 0.00
pic No 0.95 0.94 0.01
trans No 1.74 1.72 0.02
3.6 Hashed high-order Markov models. 25
that is, the occurrence of a symbol for the rst time in the context, is also treated as
a \symbol," with its own count. When a letter occurs for the rst time, its weight
becomes 1; the escape count is incremented by 1, so the total weight increases by 2.
At all other times the total weight increases by 1.
We have developed a new method, which we call PPMD, which is similar to PPMC
except that it makes the treatment of new symbols more consistent by adding 1=2
instead of 1 to both the escape count and the new symbol's count when a new symbol
occurs; hence the total weight always increases by 1. We have compared PPMC and
PPMD on the Bell-Cleary-Witten corpus [5] (including the four papers not described
in the book). Table 2 shows that for text les PPMD compresses consistently about
0.02 bit per character better than PPMC. The compression results for PPMC dier
from those reported in [5] because of implementation dierences; we used versions of
PPMC and PPMD that were identical except for the escape probability calculations.
PPMD has the added advantage of making analysis more tractable by making the
code length independent of the appearance order of symbols in the context.
Even in the worst case, when the symbols from the k colliding contexts in bucket
b are mutually disjoint, the additional code length is only Hb = H (p1; p2; p3; : : : ; pk ),
the entropy of the ensemble of probabilities of occurrence of the contexts. We show
this by conceptually dividing the bucket into disjoint subtrees corresponding to the
various contexts, and noting that the cost of identifying an individual symbol is just
LC = , lg pi , the cost of identifying the context that occurred, plus LS , the cost of
identifying the symbolPin its own context. Hence the extra cost is just LC , and the
average extra cost is ki=1 ,pi lg pi = Hb . The maximum value of Hb is lg k, so in
buckets that contain data from only two contexts, the extra code length is at most 1
bit per input symbol.
In fact, when the number of colliding contexts in a bucket is large enough that
Hb is signicant, the symbols in the bucket, representing a combination of a number
of contexts, will be a microcosm of the entire le; the bucket's average code length
will approximately equal the 0-order entropy of the le. Lelewer and Hirschberg [31]
apply hashing with collision resolution in a similar high-order scheme.
4 Conclusion
We have shown the details of an implementation of arithmetic coding and have pointed
out its advantages (
exibility and near-optimality) and its main disadvantage (slow-
ness). We have developed a fast coder, based on reduced-precision arithmetic coding,
which gives only minimal loss of compression eciency; we can use the concept of
-partitions to nd the probabilities to include in the coder to keep the compression
loss small. In a companion paper [24], in which we refer to this fast coding method
as quasi-arithmetic coding, we give implementation details and performance analysis
for both binary and multi-symbol alphabets. We prove analytically that the loss in
compression eciency compared with exact arithmetic coding is negligible.
We introduce the compressed tree, a new data structure for eciently representing
a multi-symbol alphabet by a series of binary choices. Our new deterministic proba-
bility estimation scheme allows fast updating of the model stored in the compressed
tree using only one byte for each node; the model can provide the reduced-precision
coder with the probabilities it needs. Choosing one of our two new methods for com-
puting the escape probability enables us to use the highly eective PPM algorithm,
and use of a hashed Markov model keeps space and time requirements manageable
even for a high-order model.
References
[1] N. Abramson, Information Theory and Coding, McGraw-Hill, New York, NY,
1963.
27
[20] M. E. Hellman, \Joint Source and Channel Encoding," Proc. Seventh Hawaii In-
ternational Conf. System Sci., 1974.
[21] R. N. Horspool, \Improving LZW," in Proc. Data Compression Conference, J. A.
Storer & J. H. Reif, eds., Snowbird, Utah, Apr. 8{11, 1991, 332{341.
[22] P. G. Howard & J. S. Vitter, \Analysis of Arithmetic Coding for Data Compres-
sion," Information Processing and Management 28 (1992), 749{763.
[23] P. G. Howard & J. S. Vitter, \New Methods for Lossless Image Compression Using
Arithmetic Coding," Information Processing and Management 28 (1992), 765{779.
[24] P. G. Howard & J. S. Vitter, \Design and Analysis of Fast Text Compression
Based on Quasi-Arithmetic Coding," in Proc. Data Compression Conference, J.
A. Storer & M. Cohn, eds., Snowbird, Utah, Mar. 30-Apr. 1, 1993, 98{107.
[25] D. A. Human, \A Method for the Construction of Minimum Redundancy Codes,"
Proceedings of the Institute of Radio Engineers 40 (1952), 1098{1101.
[26] D. E. Knuth, \Dynamic Human Coding," J. Algorithms 6 (June 1985), 163{180.
[27] G. G. Langdon, \Probabilistic and Q-Coder Algorithms for Binary Source Adap-
tation," in Proc. Data Compression Conference, J. A. Storer & J. H. Reif, eds.,
Snowbird, Utah, Apr. 8{11, 1991, 13{22.
[28] G. G. Langdon, \A Note on the Ziv-Lempel Model for Compressing Individual
Sequences," IEEE Trans. Inform. Theory IT{29 (Mar. 1983), 284{287.
[29] G. G. Langdon & J. Rissanen, \Compression of Black-White Images with Arith-
metic Coding," IEEE Trans. Comm. COM{29 (1981), 858{867.
[30] F. T. Leighton & R. L. Rivest, \Estimating a Probability Using Finite Memory,"
IEEE Trans. Inform. Theory IT{32 (Nov. 1986), 733{742.
[31] D. A. Lelewer & D. S. Hirschberg, \Streamlining Context Models for Data Com-
pression," in Proc. Data Compression Conference, J. A. Storer & J. H. Reif, eds.,
Snowbird, Utah, Apr. 8{11, 1991, 313{322.
[32] V. S. Miller & M. N. Wegman, \Variations on a Theme by Ziv and Lempel," in
Combinatorial Algorithms on Words, A. Apostolico & Z. Galil, eds., NATO ASI
Series #F12, Springer-Verlag, Berlin, 1984, 131{140.
[33] J. L. Mitchell & W. B. Pennebaker, \Optimal Hardware and Software Arithmetic
Coding Procedures for the Q-Coder," IBM J. Res. Develop. 32 (Nov. 1988), 727{
736.
[34] A. M. Moat, \Predictive Text Compression Based upon the Future Rather than
the Past," Australian Computer Science Communications 9 (1987), 254{261.
[35] A. M. Moat, \Word-Based Text Compression," Software{Practice and Experience
19 (Feb. 1989), 185{198.
[36] A. M. Moat, \Linear Time Adaptive Arithmetic Coding," IEEE Trans. Inform.
Theory IT{36 (Mar. 1990), 401{406.
29
[37] A. M. Moat, \Implementing the PPM Data Compression Scheme," IEEE Trans.
Comm. COM{38 (Nov. 1990), 1917{1921.
[38] K. Mohiuddin, J. J. Rissanen & M. Wax, \Adaptive Model for Nonstationary
Sources," IBM Technical Disclosure Bulletin 28 (Apr. 1986), 4798{4800.
[39] D. S. Parker, \Conditions for the Optimality of the Human Algorithm," SIAM
J. Comput. 9 (Aug. 1980), 470{489.
[40] R. Pasco, \Source Coding Algorithms for Fast Data Compression," Stanford Univ.,
Ph.D. Thesis, 1976.
[41] W. B. Pennebaker & J. L. Mitchell, \Probability Estimation for the Q-Coder,"
IBM J. Res. Develop. 32 (Nov. 1988), 737{752.
[42] W. B. Pennebaker & J. L. Mitchell, \Software Implementations of the Q-Coder,"
IBM J. Res. Develop. 32 (Nov. 1988), 753{774.
[43] W. B. Pennebaker, J. L. Mitchell, G. G. Langdon & R. B. Arps, \An Overview of
the Basic Principles of the Q-Coder Adaptive Binary Arithmetic Coder," IBM J.
Res. Develop. 32 (Nov. 1988), 717{726.
[44] J. Rissanen, \Modeling by Shortest Data Description," Automatica 14 (1978), 465{
571.
[45] J. Rissanen, \A Universal Prior for Integers and Estimation by Minimum Descrip-
tion Length," Ann. Statist. 11 (1983), 416{432.
[46] J. Rissanen, \Universal Coding, Information, Prediction, and Estimation," IEEE
Trans. Inform. Theory IT{30 (July 1984), 629{636.
[47] J. Rissanen & G. G. Langdon, \Universal Modeling and Coding," IEEE Trans.
Inform. Theory IT{27 (Jan. 1981), 12{23.
[48] J. J. Rissanen, \Generalized Kraft Inequality and Arithmetic Coding," IBM J.
Res. Develop. 20 (May 1976), 198{203.
[49] J. J. Rissanen & G. G. Langdon, \Arithmetic Coding," IBM J. Res. Develop. 23
(Mar. 1979), 146{162.
[50] J. J. Rissanen & K. M. Mohiuddin, \A Multiplication-Free Multialphabet Arith-
metic Code," IEEE Trans. Comm. 37 (Feb. 1989), 93{98.
[51] C. Rogers & C. D. Thomborson, \Enhancements to Ziv-Lempel Data Compres-
sion," Dept. of Computer Science, Univ. of Minnesota, Technical Report TR 89{2,
Duluth, Minnesota, Jan. 1989.
[52] F. Rubin, \Arithmetic Stream Coding Using Fixed Precision Registers," IEEE
Trans. Inform. Theory IT{25 (Nov. 1979), 672{675.
[53] B. Y. Ryabko, \Data Compression by Means of a Book Stack," Problemy Peredachi
Informatsii 16 (1980).
[54] C. E. Shannon, \A Mathematical Theory of Communication," Bell Syst. Tech. J.
27 (July 1948), 398{403.
30 4 CONCLUSION
[55] J. S. Vitter, \Dynamic Human Coding," ACM Trans. Math. Software 15 (June
1989), 158{167, also appears as Algorithm 673, Collected Algorithms of ACM,
1989.
[56] J. S. Vitter, \Design and Analysis of Dynamic Human Codes," Journal of the
ACM 34 (Oct. 1987), 825{845.
[57] I. H. Witten & T. C. Bell, \The Zero Frequency Problem: Estimating the Prob-
abilities of Novel Events in Adaptive Text Compression," IEEE Trans. Inform.
Theory IT{37 (July 1991), 1085{1094.
[58] I. H. Witten, R. M. Neal & J. G. Cleary, \Arithmetic Coding for Data Compres-
sion," Comm. ACM 30 (June 1987), 520{540.
[59] J. Ziv & A. Lempel, \A Universal Algorithm for Sequential Data Compression,"
IEEE Trans. Inform. Theory IT{23 (May 1977), 337{343.
[60] J. Ziv & A. Lempel, \Compression of Individual Sequences via Variable Rate
Coding," IEEE Trans. Inform. Theory IT{24 (Sept. 1978), 530{536.