Compression and Coding Algorithms PDF
Compression and Coding Algorithms PDF
ALGORITHMS
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
COMPRESSION AND CODING
ALGORITHMS
by
Alistair Moffat
The University of Melbourne, Australia
and
Andrew Thrpin
Curtin University of Technology, Australia
Preface vii
2 Fundamental Limits 15
2.1 Information content 15
2.2 Kraft inequality .. 17
2.3 Human compression 19
2.4 Mechanical compression systems . 20
3 Static Codes 29
3.1 Unary and binary codes. 29
3.2 Elias codes ....... 32
3.3 Golomb and Rice codes . 36
3.4 Interpolative coding . 42
3.5 Making a choice. . . . . 48
4 Minimum-Redundancy Coding 51
4.1 Shannon-Fano codes 51
4.2 Huffman coding . . . . . 53
4.3 Canonical codes . . . . . 57
4.4 Other decoding methods 63
4.5 Implementing Huffman's algorithm 66
4.6 Natural probability distributions 70
4.7 Artificial probability distributions 78
4.8 Doing the housekeeping chores . 81
4.9 Related material . . . . . . . . . . 88
PAGE VI COMPRESSION AND CODING ALGORITHMS
5 Arithmetic Coding 91
5.1 Origins of arithmetic coding 92
5.2 Overview of arithmetic coding . . . 93
5.3 Implementation of arithmetic coding 98
5.4 Variations . . . . . . . . . . . . 113
5.5 Binary arithmetic coding . . . . 118
5.6 Approximate arithmetic coding. 122
5.7 Table-driven arithmetic coding 127
5.8 Related material . . . . . . . . 130
9 WhatNext? 253
References 257
Index 271
Preface
None of us is comfortable with paying more for a service than the minimum
we believe it should cost. It seems wantonly wasteful, for example, to pay $5
for a loaf of bread that we know should only cost $2, or $10,000 more than the
sticker price of a car. And the same is true for communications costs - which
of us has not received our monthly phone bill and gone "ouch"? Common to
these cases is that we are not especially interested in reducing the amount of
product or service that we receive. We do want to purchase the loaf of bread
or the car, not half a loaf or a motorbike; and we want to make the phone calls
recorded on our bill. But we also want to pay as little as possible for the desired
level of service, to somehow get the maximal "bang for our buck".
That is what this book is about - figuring out how to minimize the "buck"
cost of obtaining a certain amount of "bang". The "bang" we are talking about
is the transmission of messages, just as in the case of a phone bill; and the
"buck" we seek to minimize is the dollar cost of sending that information. This
is the process of data compression; of seeking the most economical represen-
tation possible for a source message. The only simplification we make when
discussing compression methods is to suppose that bytes of storage or commu-
nications capacity and bucks of money are related, and that if we can reduce
the number of bytes of data transmitted, then the number of bucks spent will
be similarly minimal.
Data compression has emerged as an important enabling technology in a
wide variety of communications and storage applications, ranging from "disk
doubling" operating systems that provide extra storage space; to the facsim-
ile standards that facilitate the flow of business information; and to the high-
definition video and audio standards that allow maximal use to be made of
scarce satellite transmission bandwidth. Much has been written about data
compression - indeed, we can immediately recommend two excellent books,
only one of which involves either of us as an author [Bell et aI., 1990, Witten
et aI., 1999] - and as a research area data compression is relatively mature.
As a consequence of that maturity, it is now widely agreed that compres-
sion arises from the conjunction of two quite distinct activities, modeling and
PAGE VIII COMPRESSION AND CODING ALGORITHMS
Acknowledgements
One of the nice things about writing a book is getting to name names without
fear of being somehow unacademic or too personal. Here are some names,
people who in some way or another contributed to the existence of this work.
Research collaborators come first. There are many, as it has been our good
fortune to enjoy the friendship and assistance of a number of talented and gen-
erous people. Ian Witten has provided enthusiasm and encouragement over
more years than are worth counting, and lent a strategic nudge to this project at
a delicate moment. Lang Stuiver devoted considerable energy to his investiga-
tion of arithmetic coding, and much of Chapter 5 is a result of his efforts. Lang
also contributed to the interpolative coding mechanism described in Chapter 3.
Justin Zobel has been an accomplice for many years, and has contributed to
this book by virtue of his own interests [Zobel, 1997]. Others that we have
enjoyed interacting with include Abe Bookstein, Bill Teahan, Craig Nevill-
Manning, Darryl Lovato, Glen Langdon, Hugh Williams, Jeff Vitter, Jesper
Larsson, Jim Storer, John Cleary, Julien Seward, Jyrki Katajainen, Mahesh
Naik, Marty Cohn, Michael Schindler, Neil Sharman, Paul Howard, Peter Fen-
wick, Radford Neal, Suzanne Bunton, Tim C. Bell, and Tomi Klein. We have
also benefited from the research work undertaken by a very wide range of other
people. To those we have not mentioned explicitly by name - thank you.
Mike Liddell, Raymond Wan, Tim A.H. Bell, and Yugo Kartono Isal un-
dertook proofreading duties with enthusiasm and care. Many other past and
present students at the University of Melbourne have also contributed: Alwin
Ngai, Andrew Bishop, Gary Eddy, Glen Gibb, Mike Ciavarella, Linh Huynh,
Owen de Kretser, Peter Gill, Tetra Lindarto, Trefor Morgan, Tony Wirth, Vo
Ngoc Anh, and Wayne Salamonsen. We also thank the Australian Research
Council, for their funding of the various projects we have been involved in;
our two Departments, who have provided environments in which projects such
as this are feasible; Kluwer, who took it out of our hands and into yours; and
Gordon Kraft, who provided useful information about his father.
Family come last in this list, but first where it counts. Aidan, Allison, Anne,
Finlay, Kate, and Thau Mee care relatively little for compression, coding, and
algorithms, but they know something far more precious - how to take us away
from our keyboards and help us enjoy the other fun things in the world. It is
because of their influence that we plant our tongues in our cheeks and suggest
that you, the reader, take a minute now to look out your window. Surely there
is a nice leafy spot outside somewhere for you to do your reading?
Alistair Moffat, Andrew Turpin,
Melbourne, Australia Perth, Australia
Chapter 1
One of the paradoxes of modem computer systems is that despite the spiraling
decrease in storage costs there is an ever increasing emphasis on data compres-
sion. We use compression daily, often without even being aware of it, when we
use facsimile machines, communication networks, digital cellular telephones,
world-wide web browsers, and DVD players. Indeed, on some computer sys-
tems, the moment we access a file from disk we make use of compression
technology; and not too far in the future are computer architectures that store
executable code in compressed form in main memory in lines of a few hundred
bytes, decompressing it only when brought into cache.
less important has been that the bandwidth limitation imposed by twisted-pair
connections was greatly reduced by the contemporary development of elegant
bi-level (binary) image compression mechanisms. The electronic technology is
what has made facsimile transmission possible, but it is compression technol-
ogy that has kept costs low and made the facsimile machine an indispensable
tool for business and private use.
Similarly, within the last decade the use of compression has served to con-
tain the cost of cellular telephone and satellite television transmission, and has
made both of these technologies accessible to consumers at modest prices. Fi-
nally, the last few years have seen the explosion of the world-wide web net-
work. Which of us has not waited long minutes for pages to load, images to be
visible, and animations to commence. We blame the delays on a multitude of
reasons, but there is usually one single contributing factor - too much data to
be moved, and insufficient channel capacity to carry it. The obvious solution is
to spend more money to increase the bandwidth, but we could also reduce the
amount of data to be transmitted. With compression, it is possible to reduce
the amount of data transmitted, but not make any sacrifice in the amount of
information conveyed.
The third motivating force for compression is the endless search for im-
proved program speed, and this is perhaps the most subtle of the three factors.
Consider the typical personal computer of a decade ago, perhaps around 1990.
In addition to about 100 MB of hard disk, with its 15 millisecond seek time
and a 1 MB per second peak transfer rate, such a computer had a processor
of perhaps 33 MHz clock rate and 1 or 4 MB of memory. Now on the equiv-
alent personal computer the processor will operate more than ten times faster
(950 MHz is a current entry-level specification, and that is sure to have changed
again by the time you are reading this), and the memory capacity will also have
grown by a factor of thirty or more to around 128-256 MB. Disk capacities
have also exploded over the period in question, and the modem entry-level
computer may well have 20 GB of disk, two hundred times more than was
common just a few years ago. But disk speeds have not grown at the same
rate, and it is unlikely that the disk on a modem entry-level computer operates
any more than twice as quickly as the 1990 machine - 10 millisecond seek
times and 2 MB per second transfer rates are still typical, and for CD-ROM
drives seek and transfer times are even greater. That is, the limitations on me-
chanical technology have severely damped growth in disk speeds even though
capacity has increased greatly. Hence, it is now more economical than ever
before to trade-off processor time against reduced disk transfer times and file
sizes, the latter of which reduces average seek times too. Indeed, given the
current balance between disk and processor speeds, compression actually im-
proves overall response time in some applications. This effect will become
1.2. FUNDAMENTAL OPERATIONS PAGE 3
more marked as processors continue to improve; and only when fast solid-state
storage devices of capacity to rival disks are available will it be necessary to
again evaluate the trade-offs involved for and against compression. Once this
occurs, however, an identical trade-off will be possible with respect to cache
and main memory, rather than main memory and disk.
These three factors combine to make compression a fundamental enabling
technology in this digital age. Like any technology, we can, if we prefer, ig-
nore the details. Which of us truly understands the workings of the internal
combustion engine in our automobile? Indeed, which of us really even fully
grasps the exact details of the sequence of operations that allows the electric
light to come on when we flick the switch? And, just as there are mechanical
and electrical engineers who undertake to provide these two technologies to us
in a black box form, so too there are compression engineers that undertake to
provide black box compression systems that others may make use of to attain
the benefits outlined above. If we wish to make use of the technology in this
way without becoming intimate with the details, then no one will be scornful.
But, in the same way that some people regard tinkering with the family car
as a hobby rather than a chore, so too can an understanding of compression be
interesting. And for the student studying computer science, compression is one
of just a small handful of areas in which the development in an abstract way of
algorithms and data structures can address an immediate pragmatic need.
This book is intended for both groups of people - those who want to un-
derstand compression because it is a core technology in a field that they seek to
make their profession, and those who want to understand compression because
it interests them. And, of course, it is the hope of the authors that some of the
interest and excitement that prompted the writing of this book will rub off onto
its readers - in both of these categories.
very simple model of text is that there is no correlation between adjacent sym-
bols: that it is a stream of independent characters. Such a model is referred
to as a zero-order character-based model. A more sophisticated model might
assume that the data is a stream of English words that repeat in certain pre-
dictable ways; or that each of the preceding individual characters can be used
to bias (or condition) the probabilities assigned to the next character.
The second important operation is probability estimation, or statistics gath-
ering: the process of assigning a probability to each of the possible "next" sym-
bols in the input stream that is being compressed, given a particular model of
the data. For example, a very simple approach is to assert that all possible next
symbols are equi-probable. While attractive for its lack of complexity, such
an approach does not necessarily result in very good compression. A more
principled approach is to retain a historical count of the number of times each
possible symbol has appeared in each particular state of the model, and use the
ratio of a symbol's count to the total number of times a state has previously
occurred as an estimate of the symbol probability in that state.
The third of the three principal operations is that of coding. Given a prob-
ability distribution for the symbols in a defined source alphabet, and a symbol
drawn from that alphabet, the coder communicates to the waiting decoder the
identifier corresponding to that symbol. The coder is required to make use of
a specified channel alphabet (normally, but not always, the binary values zero
and one), and to make as efficient use as possible of the channel capacity sub-
ject to whatever other constraints are enforced by the particular application.
For example, one very simple coding method is unary, in which the number
one is coded as "0", the number two as "10", the number three as "110", and so
on. However such a coding method makes no use of the probabilities that have
been estimated by the statistics component of the compression system, and,
presuming that the probabilities are being reliably estimated, a more compact
message will usually result if probabilities are taken into account.
Figure 1.1 shows this relationship between the modeling, statistics, and
coding modules. A sequence of source symbols is processed by the encoder,
and each in turn is represented as a sequence of bits and transmitted to the de-
coder. A probability distribution against which each symbol should be coded is
supplied by the statistics module, evaluated only after the model has indicated
a context that is likely to provide an accurate prediction. After each symbol
has been coded, the statistics module may update its probability estimates for
that context, and the model may update the structural information it retains,
possibly even introducing one or more new contexts. At the decoding end, the
stream of bits must be rendered back into a stream of symbol identifiers, and
exactly identical statistics and structural modifications carried out in order for
the decoder to faithfully reproduce, in parallel, the actions of the encoder.
1.2. FUNDAMENTAL OPERATIONS PAGES
--, I
I
I structural
--, I
I
I structural
context : modification context : modification
identifier identifier
I I
I I
__ I probability • __ I probabilny
: modification modification
symbol symbol
probabilities probabilities
-
source ' - -_ _ _ --'I-----------l. . . ___----'
10011010001... r coder J
sou ce
·
symbols - encoded bitstream - . symbols
ENCODER DECODER
as part of the compressed message, the statistics module might use fixed prob-
abilities gleaned from an off-line inspection of a large volume of representative
text. Or if, for some reason, variable-length codes cannot be used, then symbol
numbers can be transmitted in, for example, a flat binary code.
A further point to be noted in connection with Figure 1.1 is that in some cir-
cumstances the probability estimation component will sit more naturally with
the modeler, and in others will be naturally combined with the coder. Differ-
ent combinations of model and coder will result in different placements of the
statistics module, with the exact placement usually driven by implementation
concerns. Nevertheless, in a logical sense, the three components exist in some
form or another in all compression systems.
1.3 Terminology
The problem of coding is as follows. A source alphabet of n symbols
are given, where it is assumed that ~~I Pi = 1. The coding module must
decide on a code, which is a representation for each symbol using strings over
a defined channel alphabet, usually {O, I}. Also supplied to the coder is a
single index x, indicating the symbol Sx that is to be coded. Normally, Sx
will be a symbol drawn from a longer message, that is, Sx = M[j] for some
1 ::; j ::; m = IMI, but it is simpler at first to suppose that Sx is an isolated
symbol. Where there is no possible ambiguity we will also refer to "symbol
x", meaning symbol sx, the xth symbol of the alphabet. The code for each
possible symbol Si must be decided in advance of x being known, as otherwise
it is impossible for the decoder - which must eventually be able to recover the
corresponding symbol Sx - to make the same allocation of codewords.
Often the underlying probabilities, Pi, are not exactly known, and prob-
ability estimates are derived from the given message M. For example, in a
message of m symbols, if the ith symbol appears Vi times, then the relation-
ship Pi = vdm might be assumed. We call these the self-probabilities of M.
For most applications the alphabet of source symbols is the set of contigu-
ous integers 1,2, ... ,n, so that Si = i. Any situations in which this assumption
is not valid will be noted as they are discussed. Similarly, in most situations it
may be assumed that the symbol ordering is such that PI ~ P2 ~ ... ~ Pn-l ~
1.3. TERMINOLOGY PAGE 7
Table 1.1: Three simple prefix-free codes, and their expected cost in bits per symbol.
(1.1)
where ICi I is the cost of the ith codeword. The usual measure of cost is length -
how many symbols of the channel alphabet are required. But other definitions
are possible, and are considered in Section 7.3 on page 209. For some purposes
the exact codewords being used are immaterial, and of sole interest is the cost.
To this end we define IC I = [I cll, Ic21, ... , ICn IJ as a notational convenience.
Consider, for example, the coding problem summarized in Table 1.1. In
this example n = 6, the source alphabet is denoted by S = [1,2,3,4,5,6], the
corresponding probabilities Pi are listed in the second column, and the channel
alphabet is assumed to be {O, I}. The third, fourth, and fifth columns of the
table list three possible assignments of codewords. Note how, in each of the
codes, no codeword is a prefix of any of the other codewords. Such codes are
known as prefix-free, and, as will be described in Chapter 2, this is a critically
important property. One can imagine, for example, the difficulties that would
occur in decoding the bitstream "001 ... " if one symbol had the codeword "00"
and another symbol the codeword "001".
The first code, in the column headed "Code 1", is a standard binary rep-
resentation using flog2 n 1 = 3 bits for each of the codewords. In terms of
the notation described above, we would thus have lei = [3,3,3,3,3,3]. This
PAGE 8 COMPRESSION AND CODING ALGORITHMS
code is not complete, as there are prefixes (over the channel alphabet) that are
unused. In the example, none of the codewords start with "11", an omission
that implies that some conciseness is sacrificed by this code.
Code 2 is a complete code, formed from Code 1 by shortening some of
the codewords to llog2 nJ = 2 bits, while still retaining the prefix-free prop-
erty. By assigning the shorter codewords to the most frequent source symbols,
a substantial reduction in the expected codeword length E( C, P) from 3.00 to
2.22 bits per symbol is achieved. Furthermore, because the code is both prefix-
free and complete, every semi-infinite (that is, infinite to the right) string over
the channel alphabet can be unambiguously decoded. For example, the string
"011110001 ... " can only have been generated by the source symbol sequence
2,6,1,2, .... On the other hand, with Code 1, the string "011110001 ... " can-
not be decoded, even though Code 1 is prefix-free.
The third code further adjusts the lengths of the codewords, and reduces E,
the expected codeword length, to 1.75 bits per symbol. Code 3 is a minimum-
redundancy code (which are often known as Huffman codes, although, as will
be demonstrated in Chapter 4, they are not strictly the same), and for this prob-
ability distribution there is no allocation of discrete codewords over {O, 1} that
reduces the expected codeword length below 1.75 bits per symbol.
So an obvious question is this: given the column labeled Pi, how can the
column labeled "Code 3" be computed? And when might the use of Code 2 or
Code 1 be preferred? For example, Code 2 has no codeword longer than three
bits. Is it the cheapest "no codeword longer than three bits" code that can be
devised? If these questions are in your head, then read on: they illustrate the
flavor of this book, and will be answered before you get to its end.
Finally in this section, note that there is another whole family of coding
methods that in effect use bit-fractional codes, and with such an arithmetic
coder it is possible to represent the alphabet and probability distribution of
Table 1.1 in an average of 1.65 bits per symbol, better than can be obtained if
each codeword must be of integral length. We consider arithmetic coding in
detail in Chapter 5.
There are many places in our daily lives where we also use codes of var-
ious types. Table 1.2 shows some examples. You may wish to add others
from your own experience. Note that there is no suggestion that these coding
regimes are "good", or unambiguously decodeable, or even result in compres-
sion - although it is worth noting that Morse code was certainly designed with
compression in mind [Bell et aI., 1990]. Nevertheless, they illustrate the idea
of assigning a string over a defined channel alphabet to a concept expressed in
some source alphabet; the very essence of coding.
1.4. RELATED MATERIAL PAGE 9
Help!
The books by Held [1983], Wei [1987], Anderson and Mohan [1991], Hoff-
man [1997], Sayood [2000], and Salomon [2000] are further useful counter-
points to the material on coding presented below, as is the survey article by
Lelewer and Hirschberg [1987].
The information-theoretic aspects of data compression have been studied
for even longer than its algorithmic facets, and the standard references for this
work are Shannon and Weaver [1949], Hamming [1986], and Gray [1990];
with another recent contribution coming from Golomb et aI. [1994].
Finally, for an algorithmic treatment, the four texts already cited above
all provide some coverage of compression [Cormen et aI., 2001, Gonnet and
Baeza-Yates, 1991, Sedgewick, 1990, Storer, 2002]; and Graham et aI. [1989]
provide an excellent encyclopedia of mathematical techniques for discrete do-
mains, many of which are relevant to the design and analysis of compression
systems.
upper bound a "minimal" function that satisfies the definition, and 0 (n log n)
is regarded as a much more accurate description of h than is 0(n 2 ). Note that
the use of the constant function g(n) = 1 is perfectly reasonable, and if f(n)
is described as being 0(1) then in the limit f is bounded above by a constant.
It is also necessary sometimes to reason about lower bounds, and to assert
that some function grows at least as quickly as some other function. Function
f(n) is n(g(n)) if g(n) is O(f(n)). Equality of functional growth rate is
expressed similarly -function f(n) is 8(g(n)) if f(n) is O(g(n)) andg(n) is
o (f (n) ). Note, however, that it is conventional where there is no possibility
of confusion for 0 to be used instead of 8 - if an algorithm is described as
being O(nlogn) without further qualification it may usually be assumed that
the time taken by the algorithm is 8 (n log n).
The final functional comparator that it is convenient to make use of is a
"strictly less than" relationship: f(n) is o(g(n)) if f(n) is O(g(n)) but g(n) is
not 0 (f (n) ). For example the function h (n) = n + n / log n can be described
as being "n + o( n )", meaning in the case of this example that the constant
factor on the dominant term is known, and the next most important term is
strictly sub linear. Similarly, a function that is 0(1) has zero as a limiting value
as n gets large. Note that for this final definition to make sense we presume
both f and 9 to be monotonic and thus well-behaved.
Knowledge of the asymptotic growth rate of the running time of some al-
gorithm is a requirement if the algorithm is to be claimed to be "useful", and
algorithmic descriptions that omit an analysis should usually be considered to
be incomplete. To see the dramatic effect that asymptotic running time can
have upon the usefulness of an algorithm consider, for example, two mecha-
nisms for sorting - Selectionsort and Mergesort [Knuth, 1973]. Selectionsort
is an intuitively attractive algorithm, and is easy to code. Probably all of us
have made use of a selection-like sorting process as "human computers" from
a relatively early age: it seems very natural to isolate the smallest item in the
list, and then the second smallest in the remainder, and so on. But Selection-
sort is not an asymptotically efficient method. It sorts a list of n objects in
0(n 2 ) time, assuming that objects can be compared and exchanged in 0(1)
time. Mergesort is somewhat harder to implement, and unless a rather complex
mechanism is employed, has the disadvantage of requiring O(n) extra work
space. Nor is it especially intuitive. Nevertheless, it operates in time that is
o (n log n). Now suppose that both Selectionsort and Mergesort require 1 sec-
ond to sort a list of 1,000 objects. From such a basis the two asymptotic growth
rates can be used to estimate the time taken to sort a list of (say) 1,000,000 ob-
jects. Since the number of objects increases by a factor of 1,000, the time taken
by the Selectionsort increases by a factor of 1,000 squared, which is 1,000,000.
That is, the estimated time for the Selection sort will be 1 x 106 seconds, about
PAGE 12 COMPRESSION AND CODING ALGORITHMS
11 days. On the other hand, the time of the Mergesort will increase by a factor
of about 2,000, and the sort will complete in 35 minutes or so. The asymptotic
time requirement of an algorithm has a very large impact upon its usability - an
impact for which no amount of new and expensive hardware can possibly com-
pensate. Brute force does have its place in the world, but only when ingenuity
has been tried and been unsuccessful.
Another important consideration is the memory space required by some
methods. If two alternative mechanisms for solving some problem both take
O(n) time, but one requires 5n words of storage to perform its calculations
and the other takes n words, then it is likely that the second method is more
desirable. As shall be seen in the body of this book, such a scenario can occur,
and efficient use of memory resources can be just as important a consideration
as execution-time analysis. A program can often be allowed to run for 10%
more time than we would ideally desire, and a result still obtained. But if
it requires 10% more memory than the machine being used has available, it
might be simply impossible to get the desired answers.
To actually perform an analysis of some algorithm, an underlying machine
model must be assumed. That is, the set of allowable operations - and the time
cost of each - must be defined. The cost of storing data must also be specified.
For example, in some applications it may be appropriate to measure storage
by the bit, as it makes no sense to just count words. Indeed, in some ways
compression is such an application, for it is pointless to ask how many words
are required to represent a message if each word can store an arbitrary inte-
ger. On the other hand, when discussing the memory cost of the algorithm that
generates the code, it is appropriate for the most part to assume that each word
of memory can store any integer as large as is necessary to execute the algo-
rithm. In most cases this requirement means that the largest value manipulated
is the sum of the source frequencies. That is, if a code is being designed for a
set of n integer symbol frequencies Vi it is assumed that quantities as large as
U = 2:i=l Vi can be stored in a single machine word.
It will also be supposed throughout the analyses in this book that compar-
ison and addition operations on values in the range 1 ... U can be effected in
0(1) time per operation; and similarly that the ith element in an array of as
many as n values can be accessed and updated in 0(1) time. Such a machine
model is known in algorithms literature as a random access machine. We also
restrict our attention to sequential computations. There have been a large num-
ber of parallel machine models described in the research literature, but none
are as ubiquitous as the single processor RAM machine model.
An analysis must also specify whether it is the worst case that is being con-
sidered, or the average case, where the average is taken over some plausible
probability distribution, or according to some reasonable randomness as sump-
1.5. ALGORITHM ANALYSIS PAGE 13
tion for the input data. Worst case analyses are the stronger of the two, but
in some cases the average behavior of an algorithm is considerably better than
its worst case behavior, and the assumptions upon which that good behavior is
predicated might be perfectly reasonable (for example, Quicksort).
Finally in this introductory chapter we introduce a small number of stan-
dard mathematical results that are used in the remainder of the book.
For various reasons it is necessary to work with factorials, and an expansion
due to James Stirling is useful [Graham et aI., 1989, page 112]:
n! ~ v27rn (~) n ,
This latter expression means that another useful approximation can be derived:
When nl « n2 (nl is much smaller than n2) Equation 1.4 can be further
simplified to
(1.5)
The Fibonacci series is also of use in the analysis of some coding algo-
rithms. It is defined by the basis F(1) = 1, F(2) = 1, and thereafter by the
recurrence F{n + 2) = F{n + 1) + F{n). The first few terms from n = 1 are
1,1,2,3,5,8,13,21,34. The Fibonacci numbers have a fascinating relation-
ship with the "golden ratio" ¢ defined by the quadratic equation
PAGE 14 COMPRESSION AND CODING ALGORITHMS
cp = 1 + J5 ~ 1.618.
2
The ratio between successive terms in the Fibonacci sequence approaches cp in
the limit, and a closed form for F (n) is
F(n) l
= -cpn +-
J52
1J
A closely related function is defined by F'(l) = 2, F'(2) = 3, and thereafter
by F'(n + 2) = F(n + 1) + F'(n) + 1. The first few terms from n = 1 of this
faster-growing sequence are 2,3,6,10,17,28,46,75. The revised function is,
however, still closely related to the golden ratio, and it can be shown that
with the final equality the result of one of the many identities involving cp, in
this case that (cp2 + 1)/J5 = cp.
Sorting was used as an example earlier in this section, and many of the code
construction methods discussed in this book assume that the input probability
list is sorted. There are several sorting algorithms that operate in 0 (n log n)
time in either the average case or the worst case. Mergesort was mentioned
as being one method that operates in 0 (n log n) time. Heapsort also oper-
ates in the same time bound, and has the added advantage of not requiring
O(n) extra space. Heapsort is also a useful illustration of the use of the pri-
ority queue data structure after which it is named. Finally amongst sorting
algorithms, Hoare's Quicksort [Hoare, 1961, 1962] can be implemented to op-
erate extremely quickly on average [Bentley and McIlroy, 1993] and, while the
o (n log n) analysis is only for the average case, it is relatively robust. Much
of the advantage of Quicksort compared to Heapsort is a result of the largely
sequential operation. On modem cache-based architectures, sequential rather
than random access of items in the array being sorted will automatically bring a
performance gain. Descriptions of all of these sorting algorithms can be found
in, for example, the text of Cormen et al. [2001].
Chapter 2
Fundamental Limits
The previous chapter introduced the coding problem: that of assigning some
codewords or bit-patterns C to a set of n symbols that have a probability dis-
tribution given by P = [PI, ... , Pn]. This chapter explores some lines in the
sand which cannot be crossed when designing codes. The first is a lower bound
on the expected length of a code: Shannon's entropy limit. The second restric-
tion applies to the lengths of codewords, and is generally referred to as the
Kraft inequality. Both of these limits serve to keep us honest when devising
new coding schemes. Both limits also provide clues on how to construct codes
that come close to reaching them. We can also obtain experimental bounds on
compressibility by using human models and experience, and this area is briefly
considered in Section 2.3. The final section of this chapter then shows the
application of these limits to some simple compression systems.
(2.1)
That is, the amount of information conveyed by symbol Si is the negative loga-
rithm of its probability. The multiplication by minus one means that the smaller
the probability of a symbol and the greater the surprise when it occurs, the
greater the amount of information conveyed. Shannon's original definition did
not specify that the base of the logarithm should be two, but if the base is
two, as he observed, then I(si) is a quantity in bits, which is very useful when
discussing coding problems over the binary channel alphabet. For example,
referring back to Table Lion page 7, symbol SI has probability 0.67 and in-
formation content of approximately 0.58 bits, and symbol S6, with P6 = 0.04,
has I(s6) = 4.64 bits.
This definition of information has a number of nice properties. If a symbol
is certain to occur then it conveys no information: Pi = 1, and I(si) = o. As
the probability of a symbol decreases, its information content increases; the
logarithm is a continuous, monotonic function. In the limit, when the probabil-
ity of a symbol or event is zero, if that event does occur, we are rightly entitled
to express an infinite amount of surprise. ("Snow in Perth", the newspaper
headlines would blare across the world.) Another consequence of Shannon's
definition is that when a sequence of independent symbols occurs, the informa-
tion content of the sequence is the sum of the individual information contents.
For example, if the sequence SiSj occurs with probability PiPj, it has informa-
tion content I(sisj) = I(si) + I(sj). Shannon [1948] details several more
such properties.
Given that I(si) is a measure of the information content of a single symbol
2.2. KRAFT INEQUALITY PAGE 17
in bits, and the decoder need only know the information in a symbol in order to
reproduce that symbol, a code should be able to be devised such that the code-
word for Si contains I (Si) bits. Of course, we could make this claim for any
definition of I (Si)' even if it did not share the nice properties above. However,
Shannon's "Fundamental Theorem of a Noiseless Channel" [Shannon, 1948],
elevates I(si) from a convenient function to a fundamental limit.
Consider the expected codeword length of a code C derived from proba-
bility distribution P, where each symbol has (somehow!) a codeword of length
I(si). Let H(P) be the expected cost of such a code:
n
H(P) = - 2:Pilog2Pi. (2.2)
i=l
that all codewords can be one bit long (or indeed, zero bits); but this is not very
useful in a practical sense, as symbols must be disambiguated during decoding.
A more pertinent question is: how short can codewords be so that the code is
uniquely decipherable?
If each symbol Si has a probability that is a negative power of two, say
Pi = 2- ki , then I(si) = k i is a whole number. So setting each codeword to
a string of ICil = k i bits results in a code whose expected codeword length
equals Shannon's bound and thus cannot be improved. This observation was
considered by L. G. "Jake" Kraft [1949], who noted that in such a situation
n
L2- ki ~ 1,
i=l
and that it is indeed possible to assign a code C = [CI' C2, .•. ,cn ] in which
ICil = ki' and in which no codeword is a prefix of any other codeword - that
is, the code can be prefix-free. Once such a code has been calculated, a mes-
sage composed of codewords from C can be decoded from left to right one bit
at a time without ambiguity. This relationship can be inverted, and is then a
requirement for all prefix-free codes: if the quantity
n
K(G) = L 2- icil (2.3)
i=l
Nevertheless, we can guess what comes next, and either form a "nope, try
again" sequence of answers that indicates a ranking of the alternatives even if
not their probabilities, or hypothesize nominal wagers on whether or not we
are right. Implicit probabilities can then be estimated, and an approximation of
the underlying information content of text calculated. Using human subjects,
researchers have done exactly this experiment. Shannon [1951] and Cover and
King [1978] undertook seminal work in this area; these, and a number of related
investigations, are summarized by Bell et al. [1990, page 93].
Assuming that English prose is composed of 26 case-less alphabetic letters,
plus "space" as a single delimiting character, the outcome of these investiga-
tions is that the information content of text is about 1.3 bits per letter. How
close to this limit actual compression systems can get is, of course, the great
unknown. As we shall see in Chapter 8, there is still a gap between the perfor-
mance of the best mechanical systems and the performance attributed to human
modelers, and although the gap continues to close as modeling techniques be-
come more sophisticated, there remains a considerable way to go.
Recall that a compression system (Figure 1.1 on page 5) consists of three com-
ponents - modeling, probability estimation, and coding. Armed with our def-
inition of entropy, and Shannon's fundamental theorem stating that given a
model of data we cannot devise a code that has an expected length better than
entropy, we can explore the effect of different models of the verse on the best
compression obtainable with that model.
A very simple model of the verse is to assume that a symbol is a single
character, and a correspondingly simple way of estimating symbol probabilities
is to assert that all possible characters of the complete character set are equally
likely.
On most computing platforms, the set of possible characters is defined by
the American Standard Code for Information Interchange (ASCII), which was
2.4. COMPRESSION SYSTEMS PAGE 21
introduced in 1968 by the United States of America Standards Institute. The in-
ternational counterpart of ASCII is known as ISO 646. ASCII contains 128 al-
phabetic, numeric, punctuation, and control characters, although on most com-
puting platforms an extension to ASCII, formally known as ISO 8859-1, the
"Latin Alphabet No. I", is employed. The extension allows for 128 charac-
ters that are not typically part of the English language, and, according to the
Linux manual pages, "provides support for Afrikaans, Basque, Catalan, Dan-
ish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic,
Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, and Swedish". The
ISO 8859-1 characters are also the first 256 characters of ISO 10646, better
known as Unicode, the coding scheme supported by the Java programming
language. It is convenient and commonplace to refer to the ISO 8859-1 exten-
sion as ASCII, and that is the approach we adopt throughout the remainder of
this book. Hence there are 256 possible characters that can occur in a text.
Returning to the simple compression system, in which each of the 256 char-
acters is a symbol and all are equally likely, we have Pi = 1/256 = 0.003906
for all 1 :s; i :s; 256. Equation 2.2 indicates that the entropy of this probability
distribution is
256 1 1
H(P) - ?=
t=l
256 log2 256
8.00
bits per symbol, a completely unsurprising result, even though the high entropy
value indicates each symbol is somewhat surprising. This is an example of a
static system, in which the probability estimates are independent of the actual
message to be compressed. The advantage of a static probability estimator is
that both the encoder and decoder "know" the attributes being employed, and
it is not necessary to include any probability information in the transmitted
message.
A slightly more sophisticated compression system is one which uses the
same character-based model, but estimates the probabilities by restricting the
alphabet to those characters that actually occur in the data. Blake's verse has
25 unique characters, including the newline character that marks the end of
each line. Column one of Table 2.1 shows the unique characters in the verse.
Assuming that each of the 25 symbols is equally likely, we now have Pi
1/25 = 0.04 for alII :s; i :s; 25, and a model entropy of
1 1
- I: -log2-
25
H(P)
25
i=l 25
4.64
PAGE 22 COMPRESSION AND CODING ALGORITHMS
bits per symbol, almost half the entropy of the static model.
While it is tempting to claim that by altering the probability estimation
technique a 3.36 bits per character decrease in the space required to store the
message has been attained (assuming, of course, that we can devise a code that
actually requires 4.64 bits per symbol), there is a problem to be dealt with. Un-
like the first compression scheme, this one alters its set of symbols depending
on the input data, meaning that the set of symbols in use in each message must
be declared to the decoder before decompression can be commenced. This
compression system makes use of a semi-static estimator. It examines the mes-
sage in a preliminary pass to derive the symbol set, and includes a description
of those symbols in a prelude to the compressed data. The decoder reads the
prelude, re-creates the code, and only then commences decoding. That is, we
must include the cost of describing the code to be used for this particular mes-
sage. For example, we might use the first compression system to represent the
unique symbols. The prelude also needs to include as its first data item a count
of the number of distinct symbols, which can be at most 256. All up, for the ex-
ample message the prelude consumes 8 bits for the count of the alphabet size,
plus 25 x 8 bits per symbol for the symbol descriptions, a total of 208 bits. If we
spread this cost over all 128 symbols in the message, the total expected code-
word length using this model is 4.64 + 208/128 = 6.27 bits per symbol, rather
more than we first thought. Note that we now must concern ourselves with the
representation of the prelude as well as the representation of the message - our
suggested approach of using eight-bit codes to describe the subalphabet being
used may not be terribly effective.
A third compression system that intuitively should lead to a decrease in the
number of bits per symbol required to store the message is the same character-
based model, but now coupled with a semi-static estimator based upon the
self-probabilities of the characters in the message. That is, if symbol Si occurs
Vi times, and there are a total of m symbols in the message, we take Pi =
vi/m. Column two of Table 2.1 shows the frequency of occurrence of each
character in the verse, and column five the corresponding self-probabilities of
the symbols. Calculating the entropy of the resultant probability distribution
gives 4.22 bits per symbol as a lower bound. Similarly, the quantity
n
-Li=l
Vi
V·
1og2 ~
m
(2.4)
Table 2.1: A character-based model and three different probability estimation tech-
niques for the verse from Blake's Milton. In the columns marked "Static" and "ASCII"
the entropy, average codeword length, and Kraft sum are calculated over the full
n = 256 characters in the ASCII character set. In the other columns they are cal-
culated over the n = 25 characters that appear in the message.
PAGE 24 COMPRESSION AND CODING ALGORITHMS
only a list of symbols, which we calculated we can do in 208 bits using the
first compression system, but also some indication of the probability of those
symbols. If we allow 4 bits per unique symbol to convey its frequency (quite
possibly an underestimate, but it will suffice for now), then a further 25 x
4 = 100 bits are required in the prelude. The total expected codeword length,
assuming that codes can be devised to meet the entropy bound involved, is now
(208 + 100)/128 + 4.22 = 6.63 bits per symbol; worse than the previous
simpler compression system.
This example highlights one of the most difficult problems in designing a
compression system: when to stop modeling and start coding. As is the case in
this example, while the message itself has a lower cost with the more accurate
semi-static probability estimates, the cost of transmitting the information that
allowed that economy exceeded the gain that accrued from using the more
accurate estimates. As a very crude rule, the shorter the message, the more
likely it is that a simple code will result in the most compact overall package.
Conversely, the use of complex models and codes, with many parameters, can
usually be justified for long messages.
Another way of looking at these two components - model description, and
codes relative to the model - is that the first component describes an "average"
message, and the second component describes how the particular message in
question is different from that average. And what we are interested in is min-
imizing the cost of the total package, even if, in our heart, we know that the
model being used is somehow too simple. In the same way, in our real life we
sometimes allow small untruths to be propagated if it is just too tedious to ex-
plain the complete facts. (When we tum on the light switch, do electrons really
start "running through the wire"?)
The trade-off between model and message-relative-to-model is studied as
the minimum message length principle. The minimum message length idea is
a formalism of what has been introduced in this discussion, namely that the
best description for some behavior is the one that minimizes the combined
cost of a general summary of the situation, or average arrangement; plus a list
of the exceptions to the general summary, specifying this particular situation.
The not-unrelated area of machine learning also deals with models of data,
and minimizing the cost of dealing with the exceptions to the model [Witten
and Frank, 1999]. In the compression environment we are able to evaluate
competing models and the cost of using them to represent messages in a very
pragmatic way - by coupling them with an effective coder, and then counting
the number of output bits produced.
The three compression systems considered thus far have parsed the mes-
sage into character symbols, and treated them as if they were emitted by a
memoryless source. All three use a zero-order character-based model, and
2.4. COMPRESSION SYSTEMS PAGE 25
the only difference between them is their mechanism for estimating probabil-
ities. But we can also change the model to create new compression systems.
Afirst-order model estimates the probability of each symbol in the context of
one previous symbol, to exploit any inter-symbol dependencies that may be
present. For example, consider the context established by the character "i". In
Blake's verse, "i" is followed by just three different characters:
the entropy bound of 4.64 bits per symbol for this probability distribution.
The final compression system for which we devise a code is the third one
described earlier, the semi-static model using self-probabilities. The probabil-
ity distribution P derived by this estimator is reflected in the fifth column of
Table 2.1, headed "MR". Using a technique described in Chapter 4, it is pos-
sible to devise a minimum-redundancy code based on P that has an expected
length of 4.26 bits per symbol, which is again close, but still not equal to, the
entropy bound. This is the code "MR" depicted in the final column of Table 2.1.
What then is the notion that we have been trying to convey in this section?
In essence, it is this: there are myriad choices when it comes to designing a
compression system, and care is required with each of them. One must choose
a model, a probability estimation technique, and finally a coding method. If
the modeling or coding method must transmit parameters in the form of a pre-
lude, then a representation for the prelude must also be chosen. (There is also
a corresponding choice to be made for an adaptive estimator, but that problem
is deferred until Chapter 6.) The probability estimator must be chosen con-
sidering the attributes of the model, and the coder must be chosen taking into
account the attributes of the probability estimator.
U sing the entropy and Kraft measures allows fine tuning of coding methods
without the need to actually perform the encoding and decoding. Of course the
ultimate test of any compression scheme is in the final application of a working
program and a count of the number of bits produced on a corpus of standard
test files. One piece of advice we can pass on -learned the hard way! - is that
you should never, ever, laud the benefits of your compression scheme without
first implementing a decompressor and verifying that the output of the decom-
pressor is identical to the input to the compressor, across a wide suite of test
files. Excellent compression is easily achieved if the decoder is not sent all of
the components necessary for reassembly! Indeed, this is exactly the principle
behind the lossy compression techniques used for originally-analog messages
such as image and audio data. Lossy methods deliberately suppress some of the
information contained in the original, and aim only to transmit sufficient con-
tent that when an approximate message is reconstructed, the viewer or listener
will not feel cheated. Lossy modeling techniques are beyond the scope of this
book, and with the exception of a brief discussion in Section 8.5 on page 251,
are not considered.
Chapter 3
Static Codes
The simplest coding methods are those that ignore or make only minimal use
of the supplied probabilities. In doing so, their compression effectiveness may
be relatively poor, but the simple and regular codewords that they assign can
usually be encoded and decoded extremely quickly. Moreover, some compres-
sion applications are such that the source probabilities Pi have a distribution to
which the regular nature of these non-parameterized codes is well suited.
This chapter is devoted to such coding methods. As will be demonstrated,
they are surprisingly versatile, and are essential components of a coding toolkit.
We suppose throughout that a message M is to be coded, consisting of m
integers Xi, each drawn from a source alphabet S = [1 ... n], where n is the
size of the alphabet. We also assume that the probability distribution is non-
increasing, so that PI 2: P2 2: . .. 2: Pn· Some of the codes discussed allow an
infinite source alphabet, and in these cases the probabilities are assumed to be
PI 2: P2 2: ... 2: Pi 2: ... > 0 over the source alphabet S = [1 ... ].
Algorithm 3.1
Use a unary code to represent symbol x, where 1 ~ x.
unary_encode{x)
1: while x > 1 do
2: pULone_bit{l)
3: set x ~ x-I
4: pULone_bit{O)
Algorithm 3.2
Use a minimal binary code to represent symbol x, where 1 ~ x ~ n.
minimaLbinary _encode (x, n)
1: set b +- flog2 n 1
2: set d +- 2b - n
3: if x > d then
4: pULone_integer(x - 1 + d, b)
5: else
6: pULone_integer(x - 1, b - 1)
Return a value x assuming a minimal binary code for 1 ~ x ~ n.
minimaLbinary _decode (n)
1: set b +- pog2 n 1
2: set d +- 2b - n
3: set x+- geLone_integer(b - 1)
4: if (x + 1) > d then
5: set x+-2 x x + geLone_bitO
6: set x +- x - d
7: return x + 1
Use "div" and "mod" operations to isolate and represent the nbits low-order
bits of binary number x.
pULone_integer(x, nbits)
1: for i +- nbits - 1 down to 0 do
2: set b +- (x div 2i) mod 2
3: pULoneJJit(b)
Return an nbits-bit binary integer 0 ~ x < 2nbits constructed from the next
nbits input bits.
geLone_integer(nbits)
1: set x+-O
2: for i +- nbits - 1 down to 0 do
3: set x+-2 x x + geLone_bitO
4: return x
PAGE 32 COMPRESSION AND CODING ALGORITHMS
Table 3.1: Elias, Golomb, and Rice codes. The blanks in the codewords are to assist
the reader, and do not appear in the coded bitstream.
non-zero bit of every binary code is a "I" and need not be stored, hence the
subtraction when coding the binary part.
In the algorithms literature this coding method is known as exponential
and binary search, and was described by Bentley and Yao [1976]. To see how
exponential and binary search operates, suppose a key must be located in a
sorted array of unknown size. Probes to the 1st, 3rd, 7th, 15th, 31st (and so
on) entries of the array are then made, searching for a location - any location
- at which the stored value is greater than the search key. Once such an upper
bound is determined, an ordinary constrained binary search is performed. If the
key is eventually determined to be in location x, then llog2 x J + 1 probes will
have been made during the exponential part of the search, and at most llog2 x J
probes during the binary search - corresponding closely to the number of bits
required by the Elias C y code. In the same way, a normal binary search over a
sorted set corresponds to the use of a binary code to describe the index of the
item eventually found by the search.
Another way to look at these two searching processes is to visualize them as
part of the old "I'm thinking of a number, it's between 1 and 128" game. Most
people would more naturally use n = 100 as the upper bound, but n = 128
is a nice round number for our purposes here. We all know that to minimize
the number of yes/no questions in such a game, we must halve the range of
options with each question, and the most obvious way of doing so is to ask, as
a first question, "Is it bigger than 64?" Use of a halving strategy guarantees
that the number can be identified in flog2 n 1questions - which in the example
is seven. When the puzzle is posed in this form, the binary search undertaken
during the questioning corresponds exactly to a binary code - a "yes" answer
PAGE 34 COMPRESSION AND CODING ALGORITHMS
yields another "I" bit, and a "no" answer another "0" bit. When all bits have
been specified, we have a binary description of the number 0 ~ x-I < n.
In the same framework, a unary code corresponds to the approach to this
problem adopted by young children - "Is it bigger than I?", "Is it bigger than
21", "Is it bigger than 3?", and so on: a linear search.
The Elias C-y is also a searching strategy, this time to the somewhat more
challenging puzzle "I'm thinking of a positive number, but am not going to tell
you any more than that". We still seek to halve the possible range with each
question, but because the range is infinite, can no longer assume that all values
in the range are equi-probable. And nor do we wish to use a linear search, for
fear that it will take all day (or all year!) to find the mystery number. In the
Elias code the first question is "Is it bigger than I?", as a "no" answer gives
a one-bit representation for the answer x = 1: the codeword "0" shown in
the first row in Table 3.1. And if the answer is "yes", we ask "is it bigger
than 3"; and if "yes" again, "is it bigger than 7", "bigger than 15", and so on.
Eventually a "no" will be forthcoming, and a binary convergence phase can be
entered. Hence the name "exponential and binary search" - the questions fall
into two sets, and the first set is used to establish the magnitude of the number.
In the second Elias code shown in Table 3.1, the prefix part is coded using
C-y rather than unary and the codeword for x requires 1 + 2llog 2 log2 2x J +
llog2 x J bits. This gives rise to the Co code, which also corresponds to an
algorithm for unbounded searching in a sorted array.
The amazing thing about the Elias codes is that they are shorter than the
equivalent unary codes at all but a small and finite number of codewords. The
C-y code is longer than unary only when x = 2 or x = 4, and in each case
by only one bit. Similarly, the Co code is longer than C-y only when x E
[2 ... 3,8 ... 15]. On the other hand, for large values of x both Elias codes are
not just better than unary, but exponentially better.
Algorithm 3.3 shows how the two Elias codes are implemented. Given
this description, it is easy to see how further codes in the same family are
recursively constructed: the next member in the sequence uses Co to represent
the prefix part, and requires approximately
Algorithm 3.3
Use Elias's C'Y code to represent symbol x, where 1 ~ x.
elias _gamma_encode (x)
1: set b +- 1 + Llog2 X J
2: unary_encode (b)
3: puLone_integer(x - 2b-1, b - 1)
The Elias codes are sometimes called universal codes. To see why, consider
the assumed probability distribution P in which PI ;::: P2 ;::: ... Pn. Because
of the probability ordering, Px is less than or equal to l/x for all 1 :::; x :::; n,
since ifnot, for some value x we must have l:j=I Pj > l:j=I (l/x) = 1, which
contradicts the assumption the probabilities sum to one. But if Px :::; 1/ x, then
in a zero-redundancy code the codeword for symbol x is at least log2 x bits
long (Equation 2.1 on page 16). As a counterpoint to this lower limit, the
Elias codes offer codewords that are log2 x + J(x) bits long, where J(x) is
e (log x) for C y, and is o(log x) for Co. That is, the cost of using the Elias
codes is within a mUltiplicative constant factor and a secondary additive term
of the entropy for any probability-sorted distribution. They are universal in the
sense of being fixed codes that are provably "not too bad" on any decreasing
probability distribution.
Because they result in reasonable codewords for small values of x and log-
arithmically short codewords for large values of x, the Elias Co and C y codes
have been used with considerable success in the compression of indexes for
text database systems [Bell et al., 1993, Witten et al., 1999].
that is, buckets which grow exponentially in size. The difference between them
is that unary is used as the bucket selector code in C-y, while C-y is used as the
selector in the Co code.
Another important class of codes - the Golomb codes [1966] - use a fixed-
size bucket, of size specified by a parameter b, combined with a unary selector:
(b,b,b,b, ... ).
Algorithm 3.4 illustrates the actions of encoding and decoding using a Golomb
code. Note the use of the minimal binary code to represent the value within
each bucket, with the short codewords assigned to the least values. Note also
that for simplicity of description a "div" operation, which generates the integer
quotient of the division (so that 17 div 5 = 3) has been used in the encoder, and
3.3. GOLOMB AND RICE CODES PAGE 37
Algorithm 3.4
Use a Golomb code to represent symbol x, where 1 ~ x, and b is the
parameter of the Golomb code.
golomh_encode(x, b)
1: set q f- (x - 1) div band r f- x - qx b
2: unary_encode(q + 1)
3: minimaLhinary_encode(r, b)
a multiply in both encoder and decoder. All three of these operations can be
replaced by loops that do repeated subtraction (in the encoder) and addition (in
the decoder); and because each loop iteration is responsible for the generation
or consumption of one compressed bit, the inefficiency introduced is small.
Rice codes [1979] are a special case of Golomb codes, in which the pa-
rameter b is chosen to be 2k for some integer k. This admits a particularly
simple implementation, in which the value x to be coded is first shifted right
k bits to get a value that is unary coded, and then the low-order k bits of the
original value x are transmitted as a k-bit binary value. The final two columns
of Table 3.1 show examples of Golomb and Rice codewords. The last column,
showing a Rice code with k = 2, is also a Golomb code with b = 4. Also
worth noting is that a Rice code with parameter k = 0, which corresponds to a
Golomb code with b = 1, is identical to the unary code described in Section 3.1
on page 29.
Both Golomb and Rice codes have received extensive use in compression
applications. Golomb codes in particular have one property that makes them
very useful. Consider a sequence of independent tosses of a biased coin - a
sequence of Bernoulli trials with probability of success given by p. Let Px be
the probability of the next success taking place after exactly x trials, with PI =
p, P2 = (1 - p)p, P3 = (1 - p)2p, and, in general, P = [(1 - p)x-Ip 11 ~ xl.
If P has this property for some fixed value p, it is a geometric distribution, and
a Golomb code with parameter b chosen as
surprising result was first noted by Gallager and Van Voorhis [1975].
To understand the relationship between Golomb codes and geometric dis-
tributions, consider the codewords for two symbols x and x + b, where b is
the parameter controlling the Golomb code. Because x and x + b differ by b,
the codewords for these two symbols must differ in length by 1 - after all, that
is how the code is constructed. Hence, if Icxl is the length of the codeword
for x, then ICx+bl = Icxl + 1, and, by virtue of the codeword assigned, the
inferred probability of x + b must be half the inferred probability of x. But we
also know that Px = (1 - p)x-lp, that Px+b = (1 - p)x+b-lp, and thus that
Px+b/Px = (1 - p)b. Putting these two relationships together suggests that b
should be chosen to satisfy
To derive this bound, suppose at first that (by luck) b = (loge 2)(B/m) turns
out to be a power of two. The bits in the Golomb codes can be partitioned
into three components: the binary components of the m codewords, which,
when b is a power of two, always amount to exactly m log2 b bits; the m "0"
bits with which the m unary components terminate; and the at most (B -
T)/b bits in unary codes that are "I", where T is the sum of the m binary
components. To understand the final contribution, recall that each "I" bit in
any unary component indicates an additional gap of b, and that the sum of all
3.3. GOLOMB AND RICE CODES PAGE 39
of the gaps cannot exceed B - T once T units have been accounted for in the
binary components.
When b is a power of two, the smallest possible value for T is m, as every
binary component - the remainder r in function golomb_encodeO - is at least
one. Adding in the constraint that b = (loge 2)(B/m) and simplifying shows
that the total number of bits consumed cannot exceed
When b is not a power of two, the binary part of the code is either llog2 bJ
or fiOg2 bl bits long. When it is the former, Equation 3.2 continues to hold.
But to obtain a worst-case bound, we must presume the latter.
Suppose that b is not a power of two, and that 9 = 2 fiog 2 bl is the next
power of two greater than b. Then the worst that can happen is that each binary
component is s + 1, where s = 9 - b is the number of short codewords as-
signed by the minimal binary code. That is, the worst case is when each binary
component causes the first of the long codewords to be emitted. In this case
quantity T must, on a per gap basis, decrease by s, as the first long codeword
corresponds to a binary component of s + 1. Compared to Equation 3.2, the
net bit increase per gap is given by
bits per symbol, where the second line follows from the first because the sum
L~l Pi = 1, and the expected value of the geometric distribution is given by
L~l iPi = lip; the third line follows from the second because log2{1 - p) ;::::::
-p log2 e when p is small compared to 1; and the fourth line follows from the
third because (I - p) ;:::::: 1 when p is small compared to 1.
Equation 3.3 gives a value that is rather less than the bound of Equation 3.1,
and if a random m-subset of the integers 1 ... B is to be coded, a Golomb code
will require, in an expected sense, approximately m{1.5 + log2{Blm)) bits.
But there is no inconsistency between this result and that of Equation 3.1 - the
sequences required to drive the Golomb code to its worst-case behavior are far
from geometric, and it is un surprising that there is a non-trivial difference be-
tween the expected behavior on random sequences and the worst-case behavior
on pathological sequences. If anything, the surprise is on the upside - even
with malicious intent and a pathological sequence, only half a bit per gap of
damage can be done to a Golomb code, compared to the random situation that
it handles best.
The Golomb code again corresponds to an array searching algorithm, a
mechanism noted and described by Hwang and Lin [1972]. In terms of the
"I'm thinking of a number" game, the Golomb code is the correct attack on
the puzzle posed as "I'm thinking of m distinct numbers, all between 1 and B;
what is the smallest?"
Rice codes have similar properties to Golomb codes. When coding a set of
m integers that sum to B the parameter k should be set to k = llog2 (B 1m) J'
which corresponds to b = 2llog2(B/m)J. The worst case cost of a Rice code is
the same as that of a Golomb code:
bits can be required, but never more. The worst case arises when the binary
component of each Rice codeword corresponds to a remainder of r = 1, so a
worst case sequence of length 100 could consist of 99 repetitions of 1, followed
by 6,300, which pushes a Rice code to a total of 796 bits. On the "Golomb-
bad" sequence [21,21,21, ... ,4320] discussed earlier, the Rice code requires
734 bits; and on the "Rice-bad" [1,1,1, ... ,6300] sequence the Golomb code
3.3. GOLOMB AND RICE CODES PAGE 41
requires 743 bits. In general, if the worst-case number of bits in the coded out-
put must be bounded, a Rice code should be preferred; if the average (assuming
m random values) length of the coded sequence is to be minimized, a Golomb
code should be used.
Rice codes also have one other significant property compared to Golomb
codes: the space of possible parameter values is considerably smaller. If a
tabulation technique is being used to determine the parameter for each symbol
in the message on the fly, Rice codes are the method of choice.
Generalizations of Elias and Golomb codes have also been described, and
used successfully in situations in which geometrically-growing buckets are re-
quired, but with a first bucket containing more than one item. For example,
Teuhola [1978] describes a method for compressing full-text indexes that is
controlled by the vector
~x - 1
if x = 0
f(x) = { if x > 0
-2x if x < o.
The modified alphabet 8' can then be handled by any of the static codes de-
scribed in this chapter, with Rice and Golomb codes being particularly appro-
priate in many situations. But the symmetry inherent in the original probability
distribution is no longer handled properly. For example, a Rice code with k = 1
assigns the codewords "00", "01", "100", "101", "1100", and "1101", to sym-
bols 0, +1, -I, +2, and -2, respectively, and is biased in favor of the positive
values. To avoid this difficulty, a code based upon an explicit selector vector
can be used, with an initial bucket containing an odd number of codewords, and
then subsequent buckets each containing an even number of codewords. The
Elias codes already have this structure, but might place too much emphasis on
x = 0 for some applications.
PAGE 42 COMPRESSION AND CODING ALGORITHMS
L = [1,2,3,5,7,9,11,15,18,19,20,21].
The list L is then encoded using a recursive mechanism that follows the struc-
ture of a preorder traversal of a balanced binary tree. First the root of the tree,
corresponding to the central item of the list L, is encoded; and then the left
subtree is recursively encoded (that is, the list of items in L that are to the left
of the central item); and then the right subtree is encoded. This sequence of
operations is illustrated in the pseudo-code of Algorithm 3.5.
Consider the example list L. It contains m = 12 items. Suppose that m is
known to the decoder, and also that the final cumulative sum L[12] is less than
or equal to the bound B = 21. The reasonableness of these assumptions will be
discussed below. The middle item of L (at h = 6) is L[6] = 9, and is the first
value coded. The smallest possible value for L[6] is 6, and the largest possible
value is 15. These bounds follow because if there are m = 12 symbols in
total in the list, there must be ml = 5 values prior to the 6th, and m2 =
6 values following the 6th. Thus the middle value of the list of cumulative
sums can be encoded as a binary integer 6 ~ L[h] ~ 15. Since there are
3.4. INTERPOLATIVE CODING PAGE 43
Algorithm 3.5
Use an interpolative binary code to represent the m symbol message M,
where 1 ~ M[i] for 1 ~ i ~ m.
interpolative_encode_block(M, m)
1: set L[1] t- M[1]
2: for i t- 2 to m do
3: set L[i] t- L[i - 1] + M[i]
4: if an upper bound B ~ L[m] is not agreed with the decoder then
5: set B t- L[m]. and encode(B)
6: recursive_interpolative_encode(L, m, 1, B)
L[6]=9
+
II II
10 15 20
L[3]=3 L[9]=18
+ +
II I II II I II
5 10 15 20
+ + + +
[g II I II I II [TI'
10 15 20
I ~I II
+
Id
10
III
+
1-
15
d~l~ 20
Figure 3.1: Example of interpolative coding applied to the sequence M. Gray re-
gions correspond to possible values for each cumulative sum in L. Vertical solid lines
show the demarcation points between different recursive calls at the same level in the
preorder traversal of the underlying tree.
15 - 6 + 1 = 10 values in this range, either a three bit or a four bit code is used.
The first number-line in Figure 3.1, and the first row of Table 3.2 show this step,
including the range used for the binary code, and the actual output bitstream
"011" (column "Code 1") generated by a minimal binary coder (Algorithm 3.2
on page 31) when coding the 4th of 10 possible values. Column "Code 2" will
be discussed shortly.
Once the middle value of the list L has been coded, the ml = 5 values to
the left are treated recursively. Now the maximum possible value (that is, the
upper bound on L[5]) is given by L[6]- 1 = 8. The sublist in question contains
5 values, the middle one of which is L[3] = 3. In this subproblem there must
be two values to the left of center, and two to the right, and so 3 ~ L[3] ~ 6
is established. That is, L[3] is one of four possible values, and is coded in two
bits - "00". The left-hand half of the second number-line of Figure 3.1 shows
this situation, and the second row of Table 3.2 (again, in the column "Code 1")
3.4. INTERPOLATIVE CODING PAGE 45
Table 3.2: Binary interpolative coding the cumulative list L. Each row shows the result
of coding one of the L[h] values in the range 10 + ml ... hi - m2. Using a minimal
binary code (column "Code I") the coded sequence is "011 00 0110111010101", a
total of 18 bits. Using the centered minimal binary code described in Algorithm 3.5,
the coded sequence is (column "Code 2") "001 10 11 0 100 11001", a total of 16 bits.
were used by the encoder, and so decoding can always take place successfully.
Consider again the example shown in Figure 3.1 and Table 3.2. Quite amaz-
ing is that using a minimal binary code the total message of 12 symbols is coded
in just 18 bits, an average of 1.5 bits per symbol. This value should be com-
pared with the 21 bits required by a Golomb code for the list M (using b = 1,
which is the most appropriate choice of parameter) and the 26 bits required by
the Elias C"( code. Indeed, this coding method gives every appearance of being
capable of real magic, as the self-information (Equation 2.4 on page 22) of the
message M is 1.63 bits per symbol, or a minimum of 20 bits overall. Unfortu-
nately, there are two reasons why this seeming paradox is more sleight-of-hand
than true magic.
The first is noted in steps 4 and 5 of function interpolative_encode_blockO
in Algorithm 3.5: it is necessary for the decoder to know not just the number
of symbols m that are to be decoded, but also an upper bound B for L[m], the
sum of the symbol values. In Algorithm 3.5 transmission of this latter value, if
it cannot be assumed known, is performed using a generic encodeO function,
and in an implementation would be accomplished using the Co code or some
similar mechanism for arbitrary integers. For the example list, a Co code for
L[m] = 21 requires that 9 additional bits be transmitted. On the other hand,
an exact coder (of the kind that will be discussed in Chapters 4 and 5) that can
exploit the actual probability distribution must know either the probabilities,
[6/12,4/12,1/12,1/12], or codewords calculated from those probabilities, if
it is to code at the entropy-based lower bound. So it should also be charged
for parameters, making the entropy-based bound unrealizable. One could also
argue that all methods must know m if they are to stop decoding after the
correct number of symbols. These issues cloud the question as to which code
is "best", particularly for short messages where the prelude overheads might be
a substantial fraction of the message bits. The issue of charging for parameters
needed by the decoder will be considered in greater detail in later chapters.
The other reason for the discrepancy between the actual performance of
the interpolative code in the example and what might be expected from Equa-
tion 2.4 is that the numbers in the cumulative message L are clustered at
the beginning and end, and the interpolative code is especially good at ex-
ploiting localized patterns of this kind. Indeed, the interpolative code was
originally devised as a mechanism for coding when the frequent symbols are
likely to occur in a clustered manner [Moffat and Stuiver, 2000]. For the list
M' = [2,1,2,1,2,1,2,1,3,1,4,1], which has the same self-entropy as M but
no clustering, the interpolative method (using the minimal binary code assumed
by "Code I") generates the sequence "01110 1000 110001 0 11 00", a total
of 20 bits, and the magic is gone.
The performance of interpolative_encode_blockO can be slightly improved
3.4. INTERPOLATIVE CODING PAGE 47
by observing that the shorter binary codes should be given to a block of symbols
at the center of each coding range rather than at the beginning. For example,
when the range is six, the minimal binary coder allocates codewords of length
[2,2,3,3,3,3]. But in this application there is no reason to favor small values
over large. Indeed, the middle value in a list of cumulative sums is rather more
likely to be around half of the final value than it is to be near either of the
extremities. That is, the codeword lengths [3,3,2,2,3,3] for the six possible
values are more appropriate. This alteration is straightforward to implement,
and is the reason for the introduction of functions centered_binary_inJangeO
and centered_minimal..binary_encodeO in Algorithm 3.5. The latter function
rotates the domain by an amount calculated to make the first of the desired
short codewords map to integer 1, and then uses minimal..binary_encodeO to
represent the resultant mapped values.
The column headed "Code 2" of Table 3.2 shows the effect of using a cen-
tered minimal binary code. The codeword for L[5] = 7 becomes one bit shorter
when coded in the range 6 to 8, and the codeword for L[8] = 15 also falls into
the middle section of its allowed range and receives a codeword one bit shorter.
U sing the full implementation of the interpolative code the example message
M can thus be transmitted in 16 bits. Message M' is similarly reduced to
19 bits. Again, both encodings are subject to the assumption that the decoder
knows that B = 21 is an upper bound for L[m].
Moffat and Stuiver [2000] give an analysis of the interpolative code, and
show that for m integers summing to not more than B the cost of the code -
not counting the cost of pre-transmitting B - is never more than
bits. This is a worst-case limit, and holds for all combinations of m and B, and,
once m and B are fixed, for any set of m distinct integers summing to B or less.
Using the same m = 100 and B = 6,399 values employed above, one obvi-
ous bad sequence for the interpolative code is [1,129,1,129,1,129, ... ,1,28].
This sequence requires 840 bits when represented with the interpolative code,
which is 2.40 + 10g2(B 1m) and is close to the bound of Equation 3.4. It is not
clear whether other sequences exist for which the constant is greater than 2.4.
Finally, as an additional heuristic that improves the measured performance
of the interpolative code when the probability distribution is biased in favor
of small values, a "reverse centered minimal binary code" should be used at
the lowest level of recursion when m = 1 in recursive_interpolative_encodeO
(Algorithm 3.5 on page 43). Allocating the short codewords to the low and high
values in the range is the correct assignment when a single value is being coded
if PI is significantly higher than the other probabilities. Unfortunately, the
PAGE 48 COMPRESSION AND CODING ALGORITHMS
example list M fails to show this effect, and use of a reverse centered minimal
binary code when m = 1 on the example list M adds back the two bits saved
through the use of the centered binary code.
Px = 2-(1+21ogx) = _1_ .
2x2
Another well-known distribution is the Zipf distribution [Zipf, 1949]. The
rationale for this distribution is the observation that in nature the most frequent
happening is often approximately twice as likely as the second most frequent,
three times as likely as the third most frequent, and so on. Hence, a Zipf distri-
bution over an alphabet of n symbols is given by
1 1
= -z ,where Z = L -:- = loge n -
n
Px 0(1).
x j=l)
List n P Entropy
Uniform50 50 [0.02,0.02,0.02,0.02,0.02, ... ] 5.64
Geometric50 50 [0.10,0.09,0.08,0.07,0.07, ... ] 4.64
Zip!50 50 [0.22,0.11,0.07,0.06,0.05, ... ] 4.61
Zip!5 5 [0.44,0.22,0.15,0.11,0.09] 2.06
Skew5 5 [0.80,0.10,0.05,0.03,0.02] 1.07
Veryskew3 3 [0.97,0.02,0.01] 0.22
(a)
Table 3.3: Compression of random sequences: (a) six representative probability distri-
butions and the entropy (bits per symbol) of those distributions; and (b) performance
of five coding methods (bits per symbol) for random lists of 1,000 symbols drawn
from those distributions, with the best result for each sequence highlighted in gray. In
the case of the binary code, the parameter n is included in the cost and is transmitted
using C6; in the case of the Golomb code, the parameter b is included in the cost and
is transmitted using C,; and in the case of the interpolative code the value LXi - m
is included in the cost, and transmitted using C6. The value of m is assumed free of
charge in all cases. The interpolative code implementation uses a centered minimal
binary code when m > 1, and a reverse centered minimal binary code when m = 1.
PAGE 50 COMPRESSION AND CODING ALGORITHMS
compression performance (in bits per symbol) of the five main coding mecha-
nisms described in this chapter for a random sequences of 1,000 symbols drawn
from the six distributions. The cost of any required coding parameters are in-
cluded; note that, because of randomness, the self-information of the generated
sequences can differ from the entropy of the distribution used to generate that
sequence. This is how the Golomb code "beats" the entropy limit on file Geo-
metric50.
Unsurprisingly, the minimal binary code is well-suited to the uniform dis-
tribution. It also performs well on the Zipj5 distribution, mainly because it allo-
cates two-bit codewords to the three most frequent symbols. On the other hand,
the fifty-symbol Zipj50 probability arrangement is best handled by a Golomb
code (as it turns out, with b = 8, which makes it a Rice code). In this case the
Zipfian probabilities can be closely approximated by a geometric distribution.
The Golomb code is a clear winner on the Geometric50 sequence, as expected.
The two skew probability distributions are best handled by the interpolative
coder. For the sequence Veryskew3 the average cost per symbol is less than
a third of a bit - more than two thirds of the symbols are deterministically
predicted, and get coded as empty strings. This is a strength of the interpolative
method: it achieves excellent compression when the entropy of the source is
very low. The interpolative code also performs reasonably well on all of the
other distributions, scoring three second places over the remaining four files.
Finally, note the behavior of the two true universal codes, C'Y and Co. Both
perform tolerably well on all of the probability distributions except for Uni-
jorm50, and are reliable defaults. Moreover, their performance (and also that
of the Golomb code) would be improved if use was made of the known bound
on n, the alphabet size (50, 5, or 3 for the lists tested). As implemented for
the experiments, these three methods handle arbitrarily large integers, and so
waste a certain fraction of their possible codewords on symbols that cannot oc-
cur. For example, when n = 5 a truncated C'Y code yields codeword lengths
of ICI = [1,3,3,3,3] (instead of 101 = [1,3,3,5,5]), and on lists Zipj5 and
Skew5 gives compression of 2.12 bits per symbol and 1.40 bits per symbol
respectively.
A similar modification might also be made to the Golomb code, if the max-
imum symbol value were isolated and transmitted prior to the commencement
of coding. But while such tweaking is certainly possible, and in many cases
serves to improve performance, it is equally clear from these results that there
is no universal solution - a static code may ignore the probability distribution
and still get acceptable compression, but if good compression is required re-
gardless of the distribution, a more general mechanism for devising codes must
be used.
Chapter 4
Minimum-Redundancy Coding
We now tum to the more general case illustrated by the "Code 3" column in
Table LIon page 7. It is the best of the three listed codes because, somehow,
its set of codeword lengths better matches the probability distribution than do
the other two sets. Which forces the question: given a sorted list of symbol
probabilities, how can a set of prefix-free codewords be assigned that is best
for that data? And what is really meant by "best"?
The second question is the easier to answer. Let P be a probability dis-
tribution, and C a prefix-free code over the channel alphabet {a, I}. Further,
let E(C, P) be the expected codeword length for C, calculated using Equa-
tion 1.1 on page 7. Then C is a minimum-redundancy code for distribution P
if E( C, P) ~ E( C', P) for every n symbol prefix-free code Cf. That is, a
code is minimum-redundancy for a probability distribution if no other prefix-
free code exists that requires strictly fewer bits per symbol on average. Note
that designing a minimum-redundancy code is not as simple as just choosing
short codewords for all symbols, as the Kraft inequality serves as a balancing
requirement, tending to make at least some of the codewords longer. It is the
tension between the Kraft requirement and the need for the code to have a low
expected length that determines the exact shape of the resultant code.
Now consider the first question. Given an arbitrary set of symbol proba-
bilities, how can we generate a minimum-redundancy code? This chapter is
devoted to the problem of finding such prefix codes, and using them for encod-
ing and decoding.
o 1
o 1
o 1 o 1
o 1
Figure 4.1: Example of the use of the Shannon-Fano algorithm for the proba-
bility distribution P = [0.67,0.11,0.07,0.06,0.05,0.04] to obtain the code C
["0" , "100" , "101" , "110", "1110", "1111"] .
Figure 4.2: Example of the use of Huffman's greedy algorithm for the input prob-
ability distribution P = [0.67,0.11,0.07,0.06,0.05,0.04] to obtain the code C =
["0", "100", "lID", "111", "1010", "lOll"]. At each step the newly created package is
indicated in gray.
4.2. HUFFMAN CODING PAGE 55
still minimum-redundancy. In Figure 4.2 the prefix selector bits are assigned
according to the rule "one for the symbols in the less probable package and
zero for the symbols in the more probable package", but this is arbitrary, and
a fresh choice can be made at every stage. Over the n - 1 merging opera-
tions there are thus 2n - 1 distinct Huffman codes, all of which are minimum-
redundancy. Indeed, a very important point is that any assignment of prefix-
free codewords that has the same codeword lengths as a Huffman code is a
minimum-redundancy code, but that not all minimum-redundancy codes are
one of the 2n - 1 Huffman codes. That is, there may be additional minimum-
redundancy codes that cannot be achieved via Huffman's algorithm, and for
efficiency reasons we might - and indeed will - deliberately choose to use a
minimum-redundancy code that is not a Huffman code. For example, the third
code in Table 1.1 cannot be the result of a strict application of Huffman's algo-
rithm. This notion is explored below in Section 4.3.
One further point is worth noting, and that is the handling of ties. Con-
sider the probabilities P = [004,0.2,0.2,0.1, O.lJ. Both of lei = [2,2,2,3, 3J
and ICI = [1,2,3,4, 4J result in an expected codeword length of 2.20 bits per
symbol. In this case the difference is not just a matter of labelling; instead,
it arises from the manner in which the least weight package is chosen when
there is more than one package of minimal weight. Schwartz [1964] showed
that if ties are resolved by always preferring a package that contains just one
node - that is, by favoring packages containing a single symbol x for which the
tentative code is still marked as ,\ - then the resultant code will have the short-
est possible maximum codeword length. This strategy works because it defers
the merging of any current multi-symbol packages, thereby delaying as long as
possible the further extension of the codewords in those packages, which must,
by construction, already be non-empty.
The sequence of mergings performed by Huffman's algorithm leads di-
rectly to a tree-based visualization. For example, Figure 4.3a shows the code
tree associated with the Huffman code constructed in Figure 4.2. Indeed, any
prefix-free code can be regarded as a code tree, and Figure 4.3b shows part
of the infinite tree corresponding to the Golomb code (with b = 5) shown in
Table 3.1 on page 33.
Visualization of a code as a tree is helpful in the sense of allowing the
prefix-free nature of the code to be seen: in the tree there is a unique path from
the root to each leaf, and the internal nodes do not represent source symbols.
The Huffman tree also suggests an obvious encoding and decoding strategy:
explicitly build the code tree, and then traverse it edge by edge, emitting bits in
the encoder, and in the decoder using input bits to select edges. Although cor-
rect, this tree-based approach is not particularly efficient. The space consumed
by the tree might be large, and the cost of an explicit pointer access-and-follow
PAGE 56 COMPRESSION AND CODING ALGORITHMS
(a)
(b)
Figure 4.3: Examples of code trees: (a) a Huffman code; and (b) a Golomb code with
b = 5. Leaves are shown in white, and are labelled with their symbol number. Internal
package nodes are gray. The second tree is infinite.
4.3. CANONICAL CODES PAGE 57
operation per bit makes encoding and decoding relatively slow. By way of con-
trast, the procedures described in Algorithm 3.4 on page 37 have already shown
that explicit construction of a code tree is unnecessary for encoding and decod-
ing Golomb codes. Below we shall see that for minimum-redundancy coding
we can also eliminate the explicit code tree, and that minimum-redundancy en-
coding and decoding can be achieved with compact and fast loops using only
small amounts of storage. We also describe a mechanism that can be used to
construct Huffman codes simply and economically.
Huffman's algorithm has other applications outside the compression do-
main. Suppose that a set of n sorted files is to be pairwise merged to make
a single long sorted file. Suppose further that the ith file initially contains Vi
records, and that in total there are m = L:i==l Vi records. Finally, suppose (as
is the case for the standard merging algorithm) that the cost of merging lists
containing Vs and Vt records is O( VS + vd time. The question at issue is de-
termination of a sequence of two-file mergings so as to minimize the total cost
of the n-way merge; the answer is to take Pi = vdm, and apply Huffman's
method to the n resulting weights. The length of the ith codeword lei I then
indicates the number of merges in which the ith of the original files should be
involved, and any sequence of pairwise merges that results in records from file
i participating in leil merge steps is a minimum-cost merge. The more general
problem of sorting lists that contain some amount of pre-existing order - where
order might be expressed by mechanisms other than by counting the number
of sorted runs - has also received attention [Moffat and Peters son, 1992, Pe-
tersson and Moffat, 1995], and it is known that the best that can be done in the
n-way merging problem is
comparisons. The similarity between this and the formulation given earlier for
self-information (Equation 2.4 on page 22) is no coincidence.
(a) (b)
array, which records the symbol number that corresponds to the first of the f-
bit codewords. These two arrays are shown in the third and fourth columns of
Table 4.1b. The final column of Table 4.1b will be discussed below. If Wi is
the number of f bit codewords, then the array base is described by
0 iU= 1,
base f
[]
={ 2 x (base[f - 1] + w£-d otherwise.
Using this notation, the kth of the f-bit codewords is the f low-order bits of the
value base[f] + (k - 1) when it is expressed as a binary integer. For example,
in Table 4.1a the first four bit codeword is for symbol number five, which is
the value of offset[4] in Table 4.1b; and the code for that symbol is "1110",
which is the binary representation of the decimal value 14 stored in base[4].
By using these two arrays, the codeword corresponding to any symbol can be
calculated by first determining the length of the required codeword using the
offset array, and then its value by performing arithmetic on the corresponding
base value. The resultant canonical encoding process is shown as function
canonicaLencodeO in Algorithm 4.1. Note that a sentinel value offset[L+ 1] =
n + 1 is required to ensure that the while loop in canonicaLencodeO always
terminates. The offset array is scanned sequentially to determine the codeword
length; this is discussed further below.
The procedure followed by function canonicaLencodeO is simple, and fast
to execute. It also requires only a small amount of memory: 2L + 0(1) words
for the arrays, plus a few scalars - and the highly localized memory access pat-
tern reduces cache misses, contributing further to the high speed of the method.
In particular, there is no explicit codebook or Huffman tree as would be re-
quired for a non-canonical code. The canonical mechanism does, however,
require that the source alphabet be probability-sorted, and so for applications
in which this is not true, an n word mapping table is required to convert a raw
symbol number into an equivalent probability-sorted symbol number. Finally,
note also that the use of linear search to establish the value of f is not a dom-
inant cost, since each execution of the while loop corresponds to exactly one
bit in a codeword. On the other hand, the use of an array indexed by symbol
number x that stores the corresponding codeword length may be an attractive
trade between decreased encoding time and increased memory.
Consider now the actions of the decoder. Let V be an integer variable
storing L as yet unprocessed bits from the input stream, where L is again the
length of a longest codeword. Since none of the codewords is longer than
L, integer V uniquely identifies both the length f of the next codeword to be
decoded, and also the symbol x to which that codeword corresponds. That is,
a lookup table indexed by V, storing symbol numbers and lengths, suffices for
decoding. For the example code and L = 4, the lookup table has 16 entries,
PAGE 60 COMPRESSION AND CODING ALGORITHMS
Algorithm 4.1
Use a canonical code to represent symbol x, where 1 ~ x ~ n, assuming
arrays base and offset have been previously calculated.
canonicaLencode (x)
I: seU +- 1
2: while x ;::: offset[f + 1] do
3: set f +- f + 1
4: set c +- (x - offset[f]) + base[f]
5: pULone_integer(c, f)
Return a value x assuming a canonical code, and assuming that arrays base,
offset, and lj_limit have been previously calculated. Variable V is the current
L-bit buffer of input bits, where L is the length of a longest codeword.
canonicaLdecode 0
I: seU +- 1
2: while V ;::: lj_limit[f] do
3: set f +- f + 1
4: set c +- righLshift(V, L - f) and V +- V - left~hift( c, L - f)
5: set x +- (c - base[f]) + offset[f]
6: set V +- left_shift(V, f) + geLone_integer(f)
7: return x
4.3. CANONICAL CODES PAGE 61
of which the first eight (indexed from zero to seven) indicate symbol 1 and a
one-bit code.
The problem with this exhaustive approach is the size of the lookup table.
Even for a small alphabet, such as the set of ASCII characters, the longest
codeword could well be 15-20 bits (see Section 4.9 on page 88), and so large
amounts of memory might be required. For large source alphabets, such as
English words, codeword lengths of 30 bits or more may be encountered. For-
tunately, it is possible to substantially collapse the lookup table while still re-
taining most of the speed.
Consider the column headed lj_limit in Table 4.1b. Each of the entries
in this column corresponds to the smallest value of V (again, with L = 4)
that is inconsistent with a codeword of P bits or less. For example, the value
lj_iimit[l] = 8 indicates that if the first unresolved codeword in V is one-
bit long, then V must be less than eight. The values in the lj_limit array are
calculated from the array base:
Table 4.2: The array start for the example canonical code, for z = 1, z = 2, and
z = 3. The choice of z can be made by the decoder when the message is decoded, and
does not affect the encoder in any way.
To see how the start array is used, suppose that V contains the four bits
"1100". Then a two-bit start table (the third column of Table 4.2) indexed by
V2 = "11" (three in decimal) indicates that the smallest value that P can take
is 3, and the linear search in function canonicaLdecodeO can be commenced
from that value - there is no point in considering smaller Pvalues for that prefix.
Indeed, any time that the P value so indicated is less than or equal to the value
of z that detennines the size of the start table, the result of the linear search is
completely detennined, and no inspections of I}_limit are required at all. The
tests on ljJimit are also avoided when the start table indicates that the smallest
possible codeword length is L, the length of a longest codeword.
That is, step 1 of function canonicaLdecodeO in Algorithm 4.1 can be
replaced by initializing Pto the value start[righLshijt(V, L - z)l, and the search
of steps 2 and 3 of function canonicaLdecodeO should be guarded to ensure
that if P ~ z the while loop does not execute.
The speed improvement of this tactic arises for two complementary rea-
sons. First, the linear search is accelerated and in many cases completely
circumvented, with the precise gain depending upon the value z and its re-
lationship to L. Second, it is exactly the frequent symbols that get the greatest
benefit, since they are the ones with the short codes. Using this mechanism the
number of inspections of I}_limit might be less than one per symbol: a reduction
achieved without the 2L -word memory overhead of a full lookup table.
As a final remark, it should be noted that the differences between Algo-
rithm 4.1 and the original presentation [Moffat and Turpin, 1997] of the table-
based method are the result of using non-increasing probabilities here, rather
than the non-decreasing probabilities assumed by Moffat and Turpin.
4.4. DECODING METHODS PAGE 63
Input 83 84 85
bits "I" "10" "II" "Ill"
00 81, II:,!:' 81, i~ 8 1 ,:~;:::IJ':;' ~!i:~!!i!llill~
8 1, n~~i~ii,ii,i:;;i:~ii
Table 4.3: Example of finite-state machine decoding with k = 2 and five states, 81
to 85' Each table entry indicates the next state, and, in gray, the symbols to be out-
put as part of that transition. The second heading row shows the partial codeword
corresponding to each state.
"00" completes the codeword for symbol five, and also completes a codeword
for symbol one. After the codeword for symbol one, there are no remaining
unresolved bits. Hence, the entry in the table for the combination of state 85
and input of "00" shows a move to 81 (symbol>' denotes the empty string)
and the output of symbol 5 followed by symboll. Note that this method does
not require that the code be canonical. Any minimum-redundancy code can be
processed in this way.
The drawback of the method is memory space. At a minimum, a list of 2k
"next state" pointers must be maintained at each of the nodes in the finite-state
machine, where k is the number of bits processed in each operation. That is, the
total storage requirement is O(n2k). In a typical character-based application
(n = 100 and k = 8, say) this memory requirement is manageable. But when
the alphabet is larger - say n = 100,000 - the memory consumed is unreason-
able, and very much greater than is required by function canonicaLdecodeO.
Nor is the speed advantage as great as might be supposed: the large amount of
memory involved, and the pointer-chasing that is performed in that memory,
means that on modem cache-based architectures the tight loops and compact
structure of function canonicaLdecodeO are faster. Choueka et al. also de-
scribe variants of their method that reduce the memory space at the expense of
increased running time, but it seems unlikely that these methods can compare
with canonical decoding.
A related mechanism is advocated by Hashemian [1995], who recommends
the use of a canonical code, together with a sequence of k-bit tables that speed
the decoding process. Each table is indexed directly by the next k bits of the
input stream, and each entry in each table indicates either a symbol number and
the number of bits (of the k that are currently being considered) that must be
used to complete the codeword for this symbol; or a new table number to use
to continue the decoding. Table 4.4 shows the tables that arise for the example
4.4. DECODING METHODS PAGE 65
1000000 -
->-
II)
CD
100000 -
CKP, k=S.
:e. 10000 -
CKP, k=4.
~
0
E Huffman tree.
CD 1000 - Hashemian, k=4 • Canonical+start •
~
Canonical.
100
0 ,'0 do ::0 ;0 ~ ~ io ~ r:o 1~0
Decode speed (Mb/min)
Figure 4.4: Decode speed and decode memory space for minimum-redundancy de-
coding methods. zero-order character-based model with n = 96.
canonical code of Table 4.1a (page 58) when k = 2. A similar method has
been described by Bassiouni and Mukherjee [1995]. Because all of the short
codewords in a canonical code are lexicographically adjacent, this mechanism
saves a large fraction of the memory of the brute-force approach, but is not as
compact - or as fast - as the method captured in function canonicaLdecodeO.
Figure 4.4, based on data reported by Moffat and Turpin [1997], shows the
comparative speed and memory space required by several of these decoding
mechanisms when coupled with a zero-order character-based model and exe-
cuted on a Sun SPARC computer to process a 510 MB text file. The method
of Choueka et al. [1985] is fast for both k = 4 and k = 8, but is beaten by a
small margin by the canonical method when augmented by an eight-bit start
array to accelerate the linear search, as was illustrated in Table 4.2 on page 62.
Furthermore, when k = 8 the CKP method requires a relatively large amount
of memory. The slowest of the methods is the explicit tree-based decoder, de-
noted "Huffman tree" in Figure 4.4.
Several of the mechanisms shown in Figure 4.4 need an extra mapping in
the encoder and decoder that converts the original alphabet of symbols into a
probability-sorted alphabet of ordinal symbol numbers. The amount of mem-
ory required is model dependent, and varies from application to application.
In the character-based model employed in the experiments summarized in Fig-
ure 4.4, two 256-entry arrays are sufficient. More complex models with larger
source alphabets require more space.
PAGE 66 COMPRESSION AND CODING ALGORITHMS
Algorithm 4.2
Calculate codeword lengths for a minimum-redundancy code for the symbol
frequencies in array P, where P[I] ~ P[2] ~ ... ~ P[n]. Three passes are
made: the first, operating from n down to 1, assigns parent pointers for
multi-symbol packages; the second, operating from 1 to n, assigns codeword
lengths to these packages; the third, operating from 1 to n, converts these
internal node depths to a corresponding set of leaf depths.
calculate_huffman ..code(P, n)
1: set r +- nand s +- n
2: for x +- n down to 2 do
3: if s < lor (r > x and P[r] < P[s]) then
4: set P[x] +- P[r], P[r] +- x, and r +- r - 1
5: else
6: set P[x] +- P[s] and s +- s - 1
7: if s < lor (r > x and P[r] < P[s]) then
8: set P[x] +- P[x] + P[r], P[r] +- x, and r +- r - 1
9: else
10: set P[x] +- P[x] + P[s] and s +- s - 1
I location 1 2 I 3 4 5 6
Figure 4.5: Example of the use of function calculate.huffman...codeO for the proba-
bility distribution P = [67,11,7,6,5,4] to obtain lei = [1,3,3,3,4,4].
shows the weight of the two components contributing to each package, and the
row marked (c) shows the final weight of each package after step 10. Row
(d) shows the parent pointers stored at the end of steps 1 to 10 of function
calculate_huffman_codeO once the loop has completed. The values stored in
the first two positions of the array at this time have no relevance to the sub-
sequent computation, and are not shown. Note that in Chapter 2, we used Vi
to denote the unnormalized probability of symbol Si, and Pi (or equivalently,
P[i]) to denote the normalized probability, Pi = vdm, where m is the length of
the message. We now blur the distinction between these two concepts, and we
use Pi (and P[i]) interchangeably to indicate either normalized or unnormal-
ized probabilities. Where the difference is important, we will indicate which
we mean. For consistency of types, in Algorithm 4.2 the array P passed as an
argument is assumed to contain unnormalized integer probabilities. Figure 4.5
thus shows the previous normalized probabilities scaled by a factor of 100.
The second pass at steps 11 to 13 - operating from left to right - converts
these parent pointers into internal node depths. The root node of the tree is
represented in location 2 of the array, and it has no parent; every other node
points to its parent, which is to the left of that node, with a smaller index.
4.5. IMPLEMENTATION PAGE 69
Setting P[2] to zero, and thereafter setting P[x] to be one greater than the
depth of its parent, that is, to P[P[xll + 1, is thus a correct labelling. Row (e)
of Figure 4.5 shows the depths that result. There is an internal node at depth
0, the root; one at depth 1 (the other child of the root is a leaf); two at depth 2
(and hence no leaves at this level); and one internal node at depth 3.
The final pass at steps 14 to 20 of function calculate_huffman_codeO con-
verts the n - 1 internal node depths into n leaf node depths. This is again
performed in a left to right scan, counting how many nodes are available (vari-
able a) at each depth d, how many have been used as internal nodes at this depth
(variable u), and assigning the rest as leaves of depth d at pointer x. Row (f) of
Figure 4.5 shows the final set of codeword lengths, ready for the construction
of a canonical code.
Note that the presentation of function calculate_huffman_codeO in Algo-
rithm 4.2 assumes in several places that the Boolean guards on "if' and "while"
statements are evaluated only as far as is necessary to determine the outcome:
in the expression "A and B" the clause B will be evaluated only if A is de-
termined to be true; and that in the expression "A or B" the clause B will be
evaluated only if A is determined to be false.
In the case when the input probabilities are not already sorted there are
two alternative procedures that can be used to develop a minimum-redundancy
code. The first is obvious - simply sort the probabilities, using an additional
n-word index array to record the eventual permutation, and then use the in-
place process of Algorithm 4.2. Sorting an n-element array takes 0 (n log n)
time, which dominates the cost of actually computing the codeword lengths. In
terms of memory space, n words suffice for the index array, and so the total
cost is n + 0(1) additional words over and above the n words used to store the
symbol frequencies.
Alternatively, the codeword lengths can be computed by a direct appli-
cation of Huffman's algorithm. In this case the appropriate data structure to
use is a heap - a partially-ordered implicit tree stored in an array. Sedgewick
[1990], for example, gives an implementation of Huffman's algorithm using a
heap that requires 5n + 0(1) words of memory in total; and if the mechanism
may be destructive and overwrite the original symbol frequencies (which is the
modus operandi of the inplace method in function calculate_huffman_code{))
then a heap-based process can be implemented in a total of n+ 0(1) additional
words [Witten et aI., 1999], matching the memory required by the in-place al-
ternative described in Algorithm 4.2. Asymptotically, the running time of the
two alternatives is the same. Using a heap priority queue structure a total of
o (n log n) time is required to process an n symbol alphabet, since on a total of
2n - 4 different occasions the minimum of a set of as many as n values must
be determined and modified (either removed or replaced), and with a heap each
PAGE 70 COMPRESSION AND CODING ALGORITHMS
Table 4.5: Statistics for the word and non-word messages generated when 510 MB
of English-language newspaper text with embedded SGML markup is parsed using
a word-based model. Note that Pi is the unnormalized probability of symbol Si in a
probability-sorted alphabet.
by applying a word-based model to 510 MB of text drawn from the Wall Street
Journal, which is part of the large TREe corpus [Harman, 1995]. The values in
the table illustrate the validity ofZipf's observation - the n = 289,101 distinct
words correspond to just r = 5,411 different word frequencies.
Consider how such a probability distribution might be represented. In Sec-
tion 4.5 it was assumed that the symbol frequencies were stored in an n element
array. Suppose instead that they are stored as an r element array of pairs (p; f),
where P is a symbol frequency and I is the corresponding number of times
that symbol frequency P appears in the probability distribution. For data ac-
cumulated by counting symbol occurrences, this representation will then be
significantly more compact - about 11,000 words of memory versus 290,000
words for the WSJ . Words data of Table 4.5. More importantly, the condensed
representation can be processed faster than an array representation.
What happens when Huffman's algorithm is applied to such distributions?
For the WSJ . Words probability distribution (Table 4.5), in which there are
more than 96,000 symbols that have Pi = 1, the first 48,000 steps of Huffman's
method (Algorithm 4.2 on page 67) each combine two symbols of weight 1
into a package of weight 2. But with a condensed, or runlength representation,
all that is required is that the pair (1; 96,000) be bulk-packaged to make the
pair (2; 48,000). That is, in one step all of the unit-frequency symbols can be
packaged. More generally, if P is the current least package weight, and there are
I packages of that weight - that is, the pair (p; f) has the smallest P component
of all outstanding pairs - then the next I/2 steps of Huffman's algorithm can
be captured in the single replacement of (p; f) by (2p; I/2). We will discuss
the problem caused by odd I components below. The first part of the process
is shown in Algorithm 4.3, in which a queue of (p; f) pairs is maintained, with
each pair recording a package weight p and a repetition counter I, and with the
PAGE 72 COMPRESSION AND CODING ALGORITHMS
Algorithm 4.3
Calculate codeword lengths for a minimum-redundancy code for the symbol
frequencies in array P, where P = [(Pi; Ii)] is a list of r pairs such that
PI > P2 > ... > Pr and Ei=I Ii = n, the number of symbols in the
alphabet. This algorithm shows the packaging phase of the algorithm. The
initial list of packages is the list of symbol weights and the frequency of
occurrence of each of those weights.
calcu[ateJunlength_code(P, r, n)
1: while the packaging phase is not completed do
2: set childl +- removeJ1linimum(P), and let childl be the pair (p; f)
3: if f = 1 and P is now empty then
4: the packaging phase is completed, so exit the loop and commence the
extraction phase (Algorithm 4.4) to calculate the codeword lengths
5: else if f > 1 is even then
6: create a pair new with the value (2 x P; f /2) and insert new into P in
the correct position, with new marked as "internal"
7: set new.firsLchild +- childl and new. other_child +- child]
8: else if f > 1 is odd then
9: create a pair new with the value (2 x P; (f - 1) /2) and insert new into
P in the correct position, with new marked "internal"
10: set new.jirsLchild +- childl and new. other_child +- childl
11: insert the pair (p; 1) at the head of P
12: else if f = 1 and P is not empty then
13: set child2 +- RemoveMinimum(P), and let child2 be the pair (q; g)
14: create a pair new with the value (p + q; 1) and insert new into P in the
correct position, with new marked "internal"
15: set new.jirsLchild +- childl, and new.other_child +- child2
16: if 9 > 1 then
17: insert the pair (q; 9 - 1) at the head of P
4.6. NATURAL DISTRIBUTIONS PAGE 73
Algorithm 4.4
Continuation of function calculateJunlength_code(P, T, n) from
Algorithm 4.3. In this second phase the directed acyclic graph generated in
the first phase is traversed, and a pair of depths and occurrence counts
assigned to each of the nodes.
queue ordered by increasing p values. At each cycle of the algorithm the pair
with the least p value is removed from the front of the queue and processed.
Processing of a pair consists of doing one of three things.
First, if f = 1 and there are no other pairs in P then the packaging phase
of the process is finished, and the first stage of the algorithm terminates. This
possibility is handled in steps 3 and 4. The subsequent process of extracting
the codeword lengths is described in Algorithm 4.4, and is discussed below.
Second, if f > 1 then the algorithm can form one or more new packages all
of the same weight, as outlined above. If f is even, this is straightforward, and
is described in steps 6 and 7. When f is odd, not all of the packages represented
by the current pair are consumed, and in this case (steps 9 to 11) the current
pair (p; f) is replaced by two pairs, the second of which is a reduced pair (p; 1)
that will be handled during a subsequent iteration of the main loop.
The final possibility is that f = 1 and P is not empty. In this case the
single package represented by the pair (p; f) must be combined with a single
package taken from the second pair in the queue P. Doing so mayor may not
exhaust this second pair, since it too might represent several packages. These
various situations are handled by steps 13 to 17.
When the queue has been exhausted, and the last remaining pair has a rep-
etition count of f = 1, a directed acyclic graph structure of child pointers has
been constructed. There is a single root pair with no parents, which corresponds
to the root of the Huffman code tree for the input probabilities; and every other
node in the graph is the child of at least one other node. Because each node has
two children (marked by the pointers firsLchild and other_child) there may be
multiple paths from the root node to each other node, and each of these possi-
ble paths corresponds to one codeword, of bit-length equal to the length of that
path. Hence, a simple way of determining the codeword lengths is to exhaus-
tively explore every possible path in the graph with a recursive procedure. Such
a procedure would, unfortunately, completely negate all of the saving achieved
by using pairs, since there would be exactly one path explored for every symbol
in the alphabet, and execution would of necessity require n (n) time.
Instead, a more careful process is used, and each node in the graph is vis-
ited just twice. Algorithm 4.4 gives details of this technique. The key to the
improved mechanism is the observation that each node in the graph (represent-
ing one pair) can only have two different depths associated with it, a conse-
quence of the sibling property noted by Gallager [1978] and described in Sec-
tion 6.4. Hence, if the nodes are visited in exactly the reverse order that they
were created, each internal node can propagate its current pair of depths and
their multiplicities to both of its children. The first time each node is accessed
it is assigned the lesser of the two depths it might have, because that depth cor-
responds to the shortest of the various paths to that node. At any subsequent
4.6. NATURAL DISTRIBUTIONS PAGE 75
root original
symbols
packages formed
accesses via other parents of the same depth as this parent (steps 8 to 10) the
two counters are incremented by the mUltiplicity of the corresponding counters
for that parent. On the other hand, step 12 caters for the case when the child is
already labelled with the same depth as the parent that is now labelling it. In
this case the parent must have an other_count of zero, and only the firSLcount
needs to be propagated, becoming an other_count (that is, a count of nodes at
depth one greater than indicated by the depth of that node) at the child node.
The result of this procedure is that three values are associated with each
of the original (p; 1) pairs of the input probability distribution, which are the
only nodes in the structure not marked as being "internal". The first of these
is the depth of that pair, and all of the f symbols in the original source alpha-
bet that are of probability p are to have codewords of length either depth or
depth + 1. The exact number of each length is stored in the other two fields
that are calculated for each pair: firsLcount is the number of symbols that
should be assigned codewords of length depth, and other_count is the number
of symbols that should be assigned codewords of length depth + 1. That is,
f = firsLcount + other_count.
Figure 4.6 shows the action of calculateJunlength_codeO on the probabil-
ity distribution P = [(6; 1), (3; 2), (2; 4), (1; 5)], which has a total of n = 12
PAGE 76 COMPRESSION AND CODING ALGORITHMS
symbols in r = 4 runlengths. The edges from each node to its children are also
shown. The four white nodes in the structure are the leaf nodes corresponding
to the original runs; and the gray nodes represent internal packages. The pro-
cessing moves from right to left, with the gray node labelled "P = 2; f = 2" the
first new node created. The root of the entire structure is the leftmost internal
node.
How much time is saved compared with the simpler 0 (n) time method
of function calculate_huffman_codeO in Algorithm 4.2? The traversal phase
shown in Algorithm 4.4 clearly takes 0(1) time for each node produced during
the first packaging phase, since the queue operations all take place in a sorted
list with sequential insertions. To bound the running time of the whole method
it is thus sufficient to determine a limit on the number of nodes produced during
the first phase shown in Algorithm 4.3. Each iteration requires the formation of
exactly one new node. Each iteration does not, however, necessarily result in
the queue P getting any shorter, since in some circumstances an existing node
is retained as well as a new node being added. Instead of using the length of P
as a monotonically decreasing quantity, consider instead the value
where
f
b.(P) = { ~ when (p, J) at the head of P has
otherwise.
= 1,
94,000, which is about 1/3 of the value of n; and the function <I>(P) is about
50,000. Moreover, the analysis is pessimistic, since some steps decrease <I> by
more than one. Experimentation with an implementation of the method shows
that the calculation of a minimum-redundancy code for the WSJ . Words distri-
bution can be carried out with the formation of just 30,000 pairs.
At this point the reader may well be thinking that the runlength mech-
anism is interesting, but not especially useful, since it is only valid if the
source probability distribution is supplied as a list of runlengths, and that is
unlikely to happen. In fact, it is possible to convert a probability-sorted ar-
ray representation of the kind assumed for function calculate_huffman_codeO
(Algorithm 4.2 on page 67) into the run length representation used by function
calculateJunlength_codeO in time exactly proportional to <I>(P). Moreover,
the conversion process is an interesting application of the Elias C"( code de-
scribed in Section 3.2 on page 32.
Suppose that P is an array of n symbol frequency counts in some message
M, sorted so that P[i] ~ P[i + 1]. Suppose also that a value j has been
determined for which PU - 1] > PU]. To find the number of entries in P that
have the same frequency as prj] we examine the entries PU + 1], prj + 3],
prj + 7], prj + 15] and so on in an exponential manner, until one is found
that differs from prj]. A binary search can then be used to locate the last entry
PU/] that has the same value as PU]. That is, an exponential and binary search
should be used. If PU] contains the kth distinct value in P, then the kth pair
(Pk; fk) of the run length representation for P must thus be (P[j]; j' - j + 1).
The exponential and binary searching process then resumes from prj' + 1].
The cost of determining a pair (Pk; fk) is approximately 210g 2 fk - the
cost in bits of the C"( code for integer fk - meaning that the total cost of the r
searches required to find all of the runlengths is
2.:: 1 + 2log2 fk =
T
O(r + rlog(n/r)),
i=l
Algorithm 4.5
Calculate codeword lengths for a minimum-redundancy code for the symbol
frequencies in array P, where P = [(Pi; Ii)] is a list of r pairs, with each Pi
an integral power of two, and PI > P2 > . . . > Pr. In each tuple Ii is the
corresponding repetition count, and Ei=I Ii = n.
calculate_twopower_code(P, r, n)
1: for d+-O to llog2 mJ do
2: set symbols[d] +- 0, and packages[d] +- 0
3: set total[d] +- 0, and irregular[d] +- "not used"
4: for each (Pi; Ii) in P do
5: set symbols [10g2 Pi] +- Ii
6: set total [log2 Pi] +- Ii
7: for d+-O to 1l0g2 mJ do
8: set packages [d + 1] +- total[ d] div 2
9: set total[d + 1] +- total[d + 1] + packages[d + 1]
10: set total[d] +- total[d] - 2 x packages[d + 1]
11: if total[d] > 0 and irregular[d] = "not used" then
12: determine the smallest 9 > d such that total[g] > 0
13: set total[g] +- total[g] - 1
14: set irregular[g] +- 29 + 2d
15: else if total[d] > 0 then
16: set irregular[d + 1] +- irregular[d] + 2d
17: else if irregular[d] "# "not used" then
18: determine the smallest 9 > d such that total[g] > 0
19: set total[g] +- total[g] - 1
20: set irregular[g] +- 29 + irregular[d]
21: for d +- llog2 mJ down to 1 do
22: propagate node depths from level d to level d-1, assigning symbol[d-1]
codeword lengths at level d - 1
PAGE 80 COMPRESSION AND CODING ALGORITHMS
symbols[d]
packages[d]
irregular[ d]
• a list of n integers, each between 1 and n max , indicating the sub alphabet
of [1 ... nmaxl that appears in this block of the message; and
Only after all of these values are in the hands of the decoder may the encoder
- based upon a code derived solely from the transmitted information, and no
other knowledge of the source message or block - start emitting codewords.
Algorithm 4.6 details the actions of the encoder, and shows how the prelude
components are calculated and then communicated to the waiting decoder. The
first step is to calculate the symbol frequencies in the block at hand. Since
4.8. HOUSEKEEPING PAGE 83
Algorithm 4.6
Use a minimum-redundancy code to represent the m symbol message M,
where 1 ~ M[i] ~ nmax for 1 ~ i ~ m. Assume initially that table[i] = 0
for 1 ~ i ~ nmax.
mr_encode_hlock( M, m)
I: set n+-O
2: for i +- 1 to m do
3: set x +- M[i]
4: if table [x] = 0 then
5: set n +- n + 1 and symLused[n] +- x
6: set table[x] +- table[x] + 1
7: sort symLused[1 ... n] using table[symLused[i]] as the sort keys, so that
table[symLused[I]] ~ table[syms_used[2lJ ~ ... ~ table[symLused[nlJ
8: use function calculate_huffman_codeO to replace table[x] by the
corresponding codeword length, for x E {symLused[ i] I 1 ~ i ~ n}
9: set L +- table[symLused[nlJ
10: sort syms_used[1 ... n] so that
syms_used[l] < symLused[2] < ... < syms_used[n]
11: set nmax +- syms_used[n]
12: set w[i] +- the number of codewords of length i in table
13: set base[l] +- 0, offset[l] +- 1, and offset [L + 1] +- n + 1
14: for i +- 2 to L do
15: set base[i] +- 2 x (base[i - 1] + w[i - 1])
16: set offset[i] +- offset[i - 1] + w[i - 1]
17: use function eliaLdelta_encodeO to encode m, n max , n, and L
18: use function interpolative_encodeO to encode symLused[l . .. n]
19: for i +- 1 to n do
20: unary_encode((L + 1) - table[symLused[i]])
21: for i +- 2 to L do
22: set w[i] +- offset[i]
23: for i +- 1 to n do
24: set sym +- syms_used[i] and code_len +- table[sym]
25: set table[sym] +- w[code_len]
26: set w[code_len] +- w[code_len] + 1
27: for i +- 1 to m do
28: canonicaLencode(table[M[i]]), using base and offset
29: for i +- 1 to n do
30: set table[symLused[ilJ +- 0
PAGE 84 COMPRESSION AND CODING ALGORITHMS
~ll that is known is that n max is an upper bound on each of the m integers
in the input message, an array of n max entries is used to accumulate symbol
frequencies. At the same time (steps 1 to 6 of function mr_encode_block())
the value of n - the number of symbols actually used in this block - is noted.
Array table serves multiple purposes in function mr_encode_block 0 . In this
first phase, it accumulates symbol frequencies.
Once the block has been processed, the array of symbols - symLused -
is sorted into non-increasing frequency order (step 7) in the first of two sort-
ing steps that are employed. Any sorting method such as Quicksort can be
used. The array table of symbol frequencies is next converted into an array
of codeword lengths by function calculate_huffman_codeO (Algorithm 4.2 on
page 67). After the calculation of codeword lengths, array syms_used is sorted
into a third ordering, this time based upon symbol number. Quicksort is again
an appropriate mechanism.
From the array of codeword lengths the L-element arrays base and offset
used during the encoding are constructed (steps 12 to 16), and the prelude sent
to the decoder (steps 17 to 20). Elias's Co, the interpolative code of Section 3.4,
and the unary code all have a part to play in the prelude. Sending the codeword
lengths as differences from L + 1 using unary is particularly effective, since
there can be very few short codewords in a code, and will almost inevitably be
many long ones. We might also use a minimum-redundancy code recursively
to transmit the set of n codeword lengths, but there is little to be gained - a
minimum-redundancy code would look remarkably like a unary code for the
expected distribution of codeword lengths, and there must be a base to the
recursion at some point or another.
Array w is then used to note the offset value for each different codeword
length, so that a pass through the set of symbols in symbol order (steps 21
to 26) can be used to set the mapping between source symbol numbers in the
sparse alphabet of M and the dense probability-sorted symbols in [1 ... nJ used
for the actual canonical encoding. This is the third use of array table - it now
holds, for each source symbol x that appears in the message (or in this block
of it), the integer that will be coded in its stead.
After all this preparation, we are finally ready (steps 27 to 28) to use func-
tion canonicaLencode 0 (Algorithm 4.1 on page 60) to send the m symbols that
comprise M, using the mapping stored in table. Then, as a last clean-up stage,
the array table is returned to the pristine all-zeroes state assumed at the com-
mencement of mr_encode_blockO. This step requires O(n) time if completed
at the end of the function, versus the O(n max ) time that would be required if it
was initialized at the beginning of the function.
In total, there are two O(m)-time passes over the message M; a num-
ber of O(n)-time passes over the compact source alphabet stored in array
4.8. HOUSEKEEPING PAGE 85
symLused; and two O(n log n)-time sorting steps. Plus, during the calls to
function canonicaLencodeO, a total of c output bits are generated, where c ;:::
m. Hence, a total of O(m + nlogn + c) = O(nlogn + c) time is required
for each m-element block of a multi-block message, where n is the number
of symbols used in that block, and c is the number of output bits. A one-off
initialization charge of O(nmax) time to set array table to zero prior to the first
block of the message must also be accounted for, but can be amortized over
all of the blocks of the message, provided that nmax ~ E m, the length of the
complete source message.
In terms of space, the nmax-word array table is used and then, to save space,
re-used two further times. The only other large array is syms_used, in which
n words are used, but for which n max words must probably be allocated. All
of the other arrays are only L or L + 1 words long, and consume a minimal
amount of space. That is, the total space requirement, excluding the m-word
buffer M passed as an argument, is 2n max + O(L) words. No trees are used,
nor any tree pointers.
If n « n max , both table and syms_used are used only sparsely, and other
structures might be warranted if memory space is important. For example,
symLused might be allocated dynamically and resized when required, and ar-
ray table might be replaced by a dictionary structure such as a hash table or
search tree. These substitutions increase execution time, but might save mem-
ory space when n max is considerably larger than n and the subalphabet used in
each block of the message is not dense.
Algorithm 4.7 details the inverse transformation that takes place in the de-
coder. The two n-element arrays symLused and table are again used, and the
operations largely mirror the corresponding steps in the encoder. As was the
case in the encoder, array table serves three distinct purposes - first to record
the lengths of the codewords; then to note symbol numbers in the probability-
sorted alphabet; and finally to record which symbols have been processed dur-
ing the construction of the inverse mapping. This latter step is one not required
in the encoder. The prelude is transmitted in symbol-number order, but the
decoder mapping table - which converts a transmitted symbol identifier in the
probability-sorted alphabet back into an original symbol number - must be the
inverse of the encoder's mapping table. Hence steps 12 to 21. This complex
code visits each entry in syms_used in an order dictated by the cycles in the
permutation defined by array table, and assigns to it the corresponding symbol
number in the sparse alphabet. Once the inverse mapping has been constructed,
function canonicaLdecodeO (Algorithm 4.1 on page 60) is used to decode each
of the m symbols in the compressed message block.
Despite the nested loops, steps 12 to 21 require O(n) time in total, since
each symbol is moved only once, and stepped over once. Moreover, none of
PAGE 86 COMPRESSION AND CODING ALGORITHMS
Algorithm 4.7
Decode and return an m-symbol message M using a minimum-redundancy
code.
mr....decode_block 0
1: use function elias-delta_decode 0 to decode m, n max , n, and L
2: interpolative decode the list of n symbol numbers into syms_used[l ... n]
3: for i +- 1 to n do
4: set table[i] +- (L + 1) - unary_decodeO
5: set w[i] +- the number of codewords of length i in table
6: construct the canonical coding tables base, offset, and lj_limit from w
7: for i +- 2 to L do
8: set w[i] +- offset[i]
9: for i +- 1 to n do
10: set sym +- syms-used[i] and code_len +- table[i]
11: set table[i] t- w[code_len] and w[code_len] t- w[code_len] + 1
12: set start t- 1
13: while start ~ n do
14: setfrom t- start and sym t- syms-used[start]
15: while table [from] =1= "done" do
16: set i t- table[from]
17: set table[from] t- "done"
18: swap sym and syms_used[i]
19: set from t- i
20: while start ~ nand table[start] = "done" do
21: set start t- start + 1
22: set V+- geLone_integer(L)
23: for i +- 1 to m do
24: set c t- canonicaLdecodeO, using V, base, offset, and lj_limit
25: set M[i] t- syms-used[c]
26: return m and M
4.8. HOUSEKEEPING PAGE 87
14.0 - 3.0 -
E
e-
>- 12.0 - 2.5 -
~ 10.0 -
codewords
2.0 - • code lengths
c: B.O- • subalphabet
o 1.5 -
'(j) 6.0 -
II)
1.0 -
~ 4.0 -
c.
E 2.0 - 0.5 -
o
() 0.0 ....up----I1IL----l1...._ .......
1000 10000 100000 1000000 1000 10000 100000 1000000
Figure 4.8: Cost of the prelude components for subalphabet selection and code-
word lengths, and cost of codewords for the files WSJ . Words (on the left), and
WSJ . NonWords (on the right), for different block sizes. Each file contains approxi-
mately m = 86 x 106 symbols, and is described in more detail in Table 4.5.
the other steps require more than 0 (n) time except for the canonical decoding.
Hence the decoder operates faster than the encoder, in a total of 0 (m +n + c) =
O(n + c) time, where c ~ m is again the number of bits in the compressed
message. The space requirement is 2n + 0 (L) words, regardless of n max .
Figure 4.8 summarizes the overall compression effectiveness achieved by
function mr_encode_hlockO on the files WSJ. Words and WSJ. NonWords (de-
scribed in Table 4.5 on page 71) for block sizes varying from m = 103 to
m = 106 . When each block is very small, a relatively large fraction of the
compressed message is consumed by the prelude. But the codes within each
block are more succinct, since they are over a smaller subalphabet, and overall
compression effectiveness suffers by less than might be thought. At the other
extreme, when each block is a million or more symbols, the cost of transmitting
the prelude is an insignificant overhead.
However, encoding efficiency - with its n log n factor per block - suffers
considerably on small blocks. In the same experiments, it took more than ten
times longer to encode WSJ . Words with m = 103 than it did with m = 105,
because there are 100 times as many sorting operations performed, and each
involves considerably more than 1/100 of the number of symbols. Decoding
speed was much less affected by block size, and even with relatively large
block sizes, the decoder operates more than four times faster than the encoder,
PAGE 88 COMPRESSION AND CODING ALGORITHMS
E(C, P) - H(P) ~ :~ {
if PI ~ 0.5,
+ 0.086 if PI < 0.5.
The bound when PI ~ 0.5 cannot be tightened, but several authors have re-
duced the bounds when PI < 0.5. Dietrich Manstetten [1992] summarizes
previous work, and gives a general method for calculating the redundancy of a
minimum-redundancy prefix code as a function of Pl. Manstetten also gives a
graph of the tightest possible bounds on the number of bits per symbol required
by a minimum-redundancy code, again as a function of Pl.
Another area of analysis that has received attention is the maximum code-
word length L assigned in a minimum-redundancy code. This is of particular
relevance to successful implementation of function canonicaLdecodeO in Al-
gorithm 4.1 on page 60, where V is a buffer containing the next L bits of the
compressed input stream. If allowance must be made in an implementation for
L to be larger than the number of bits that can be stored in a single machine
word, the speed of canonical decoding is greatly compromised.
Given that most (currently) popular computers have a word size of 32 bits,
what range of message lengths can we guarantee to be able to handle within
a L = 32 bit limit on maximum codeword length? The obvious answer -
that messages of length m = 232 ~ 4 X 109 symbols can be handled with-
out problem - is easily demonstrated to be incorrect. For example, setting the
unnormalized probability Pi of symbol Si to F(n - i + 1), an element in the
Fibonacci sequence that was defined in Section 1.5 on page 10, gives a code
in which symbols Sn-l and Sn have codewords that are n - 1 bits long. The
intuition behind this observation is simple: Huffman's algorithm packages the
smallest two probabilities at each stage of processing, beginning with two sin-
gleton packages. If the sum of these two packages is equal to the weight of the
next unprocessed symbol, at every iteration a new internal node will be created,
PAGE 90 COMPRESSION AND CODING ALGORITHMS
with the leaf as one of its children, and the previous package as the other child.
The final code tree will be a stick, with lei = [1,2, ... ,n - 1, n - 1]. Hence,
if P = [F(n), F(n - 1), ... ,F(l)], then L = n - 1 [Buro, 1993].
This bodes badly for canonicaLdecodeO, since it implies that L > 32 is
possible on an alphabet of as few as n = 34 symbols. But there is good news
too: a Fibonacci-derived self-probability distribution on n = 34 symbols does
still require a message length of m = 2:f!l F(i) = F(36) - 1 > 14.9 million
symbols. It is extremely unlikely that a stream of more than 14 million symbols
would contain only 34 distinct symbols, and that those symbols would occur
with probabilities according to a Fibonacci sequence.
While the Fibonacci based probability distribution leads to codewords of
length L = n - 1, it is not the sequence that minimizes m = 2:i=l Pi, the
message length required to cause those long codeword to be generated. That
privilege falls to a probability distribution derived from the modified Fibonacci
sequence F' described in Section 1.5:
Arithmetic Coding
Given that the bit is the unit of stored data, it appears impossible for codewords
to occupy fractional bits. And given that a minimum-redundancy code as de-
scribed in Chapter 4 is the best that can be done using integral-length code-
words, it would thus appear that a minimum-redundancy code obtains com-
pression as close to the entropy as can be achieved.
Surprisingly, while true for the coding of a single symbol, this reasoning
does not hold when streams of symbols are to be coded, and it is the latter
situation which is the normal case in a compression system. Provided that the
coded form of the entire message is an integral number of bits long, there is no
requirement that every bit of the encoded form be assigned exclusively to one
symbol or another. For example, if five equi-probable symbols are represented
somehow in a total of three bits, it is not unreasonable to simplify the situation
and assert that each symbol occupies 0.6 bits. The output must obviously be
"lumpy" - bits might only be emitted after the second, fourth, and fifth sym-
bols of the input message, or possibly not until all of the symbols in the input
message have been considered. However, if the coder has some kind of internal
state, and if after each symbol is coded the state is updated, then the total code
for each symbol can be thought of as being the output bits produced as a re-
sult of that symbol being processed, plus the change in potential of the internal
state, positive or negative. Since the change in potential might be bit-fractional
in some way, it is quite conceivable for a coder to represent a symbol of prob-
ability Pi in the ideal amount (Equation 2.1 on page 16) of - log2 Pi bits. At
the end of the stream the internal state must be represented in some way, and
converted to an integral number of bits. But if the extra cost of the rounding
can be amortized over many symbols, the per-symbol cost is inconsequential.
Arithmetic coding is an effective mechanism for achieving exactly such a
"bit sharing" approach to compression, and is the topic of this chapter. The ori-
gins of the ideas embodied in an arithmetic coder are described in Section 5.1.
A. Moffat et al., Compression and Coding Algorithms
© Springer Science+Business Media New York 2002
PAGE 92 COMPRESSION AND CODING ALGORITHMS
Sections 5.2 and 5.3 give an overview of the method, and then a detailed im-
plementation. A number of variations on the basic theme are explored in Sec-
tion 5.4, ideas which are exploited when binary arithmetic coding is considered
in Section 5.5. Finally, Sections 5.6 and 5.7 examine a number of approximate
arithmetic coding schemes, in which some inexactness in the coded represen-
tation is allowed, in order to increase the speed of encoding and decoding.
fonned as Witten et al. would have liked - others around the world typed it in
too. The CACM implementation was revised approximately ten years later in
a followup paper that appeared in ACM Transactions on Information Systems
[Moffat et aI., 1998], and that TOIS implementation is the basis for much of the
presentation in this chapter.
Paul Howard and Jeff Vitter have also considered arithmetic coding in some
depth (see their 1994 paper in a special "Data Compression" issue of Proceed-
ings of the IEEE for an overview), and one of their several contributions is
examined in Section 5.7.
Algorithm 5.1
Use an idealized arithmetic coder to represent the m-symbol message M,
where 1 ~ M[i] ~ nmax for 1 ~ i ~ m. Normalized symbol probabilities
are assumed to be given by the static vector P, with E~~ P[i] = 1.
ideaLarithmeticencode(M, m)
1: set L f- 0 and R f- 1
2: for if-I to m do
3: set S f- M[i]
4: set L f- L + R x Ej:~ prj]
5: set R f- R x P[s]
6: transmit V, where V is the shortest (fewest bits) binary fractional number
that satisfies L ~ V < L + R
Decode and return an m-symbol message assuming an idealized arithmetic
coder.
ideaLarithmeticdecode (m)
1: set L f- 0 and R f- 1
2: let V be the fractional value transmitted by the encoder
3: for if-I to m do
4: determine s such that R x Ej:~ P[j] ~ V - L < R X Ej=l prj]
5: set L f- L + R x Ej:~ P[j]
6: setR f- R x P[s]
7: set M[i] f- S
8: return M
5.2. OVERVIEW PAGE 95
1.0
.....
1.0 1.0
I..
'5
~
., ------ L+R
--------------- .... L+R
:s ~
.,-
E
~0. ~ ---------------~ L
L
Figure 5.1: Encoding a symbol s and narrowing the range: (a) allocation of probability
space to symbol s within the range [0,1); (b) mapping probability space [0, 1) onto the
current [L, L + R) interval; and (c) restriction to the new [L, L + R) interval.
Figure 5.1c the values of Land L + R are updated, and reflect a new reduced
interval that corresponds to having encoded the symbol s.
The same process is followed for each symbol of the message M. At any
given point in time the internal potential of the coder is given by - log2 R. The
potential is a measure of the eventual cost of coding the message, and counts
bits. If R' is used to denote the new value of R after an execution of step 5, then
R' = R x P[s], and -log2 R' = (-log2 R) + (-log2 P[s]). That is, each
iteration of the "for" loop increases the potential by exactly the information
content of the symbol being coded.
At the end of the message the transmitted code is any number V such that
L ~ V < L + R. By this time R = Il~l P[M[i]], where M[i] is the ith of the
m input symbols. The potential has thus increased to - L:~1Iog2 P[M[i]],
and to guarantee that the number V is within the specified range between L
and L + R, it must be at least this many bits long. For example, consider the
sequence of Land R values that arises when the message
M = [1,2,1,1,1,5,1,1,2,1]
P = [0.67,0.11,0.07,0.06,0.05,0.04]
that was used as an example in Section 1.3 on page 6 and again in Chapter 4.
Table 5.1 shows - in both decimal and binary - the values that the two state
variables take during the encoding of this message, starting from their initial
values of zero and one respectively.
~
Cl
tIl
\0
0\
Decimal Binary
z MIZr L R L+R L L+R
0.00000000 1.00000000 1.00000000 0.0000000000000000000000 1.0000000000000000000000
1 1 0.00000000 0.67000000 0.67000000 0.0000000000000000000000 0.1010101110000101001000
2 2 0.44890000 0.07370000 0.52260000 0.0111001011101011000111 0.1000010111001001000111
3 1 0.44890000 0.04937900 0.49827900 0.0111001011101011000111 0.0111111110001111001110
4 1 0.44890000 0.03308393 0.48198393 0.0111001011101011000111 0.0111101101100011010011
5 1 0.44890000 0.02216623 0.47106623 0.0111001011101011000111 0.0111100010010111110011 (j
o
6 5 0.46907127 0.00110831 0.47017958 0.0111100000010101000100 0.0111100001011101101100 ::::
'"0
7 1 0.46907127 0.00074257 0.46981384 0.0111100000010101000100 0.0111100001000101101110 ::tI
tIl
en
8 1 0.46907127 0.00049752 0.46956879 0.0111100000010101000100 0.0111100000110101101010 en
9 2 0.46940461 0.00005473 0.46945934 0.0111100000101010111010 0.0111100000101110011111 i3
z
10 1 0.46940461 0.00003667 0.46944128 0.0111100000101010111010 0.0111100000101101010011 :>
z
o
(j
Table 5.1: Example of arithmetic coding: representing the message M = [1,2,1,1,1,5,1,1,2,1) assuming the static probability distribution o
P = [0.67,0.11,0.07,0.06,0.05,0.04). o
Z
Cl
::x>
r
Cl
o
::tI
=i
:t
::::
en
5.2. OVERVIEW PAGE 97
L+R 0.0111100000101101010011
V 0.0111100000101100
L 0.0111100000101010111010.
At the conclusion of the processing R has the value 3.67 x 10- 5 , the product of
the probabilities of the symbols in the message. The minimum number of bits
r
required to separate Rand L + R is thus given by -log2 R1 = f14.741 =
15, one less than the number of bits calculated above for V. A minimum-
redundancy code for the same set of probabilities would have codeword lengths
of [1, 3, 3, 3, 4, 4J (Figure 4.2 on page 54) for a message length of 17 bits. The
one bit difference between the arithmetic code and the minimum-redundancy
code might seem a relatively small amount to get excited about, but when the
message is long, or when one symbol has a very high probability, an arithmetic
code can be much more compact than a minimum-redundancy code. As an
extreme situation, consider the case when n = 2, P = [0.999,0.001J, and a
message containing 999 "l"s and one "2" is to be coded. At the end of the
r
message R = 3.7 X 10- 4 , and V will contain just -log2 3.7 x 10- 4 1 = 12 or
r -log2 3.7 x 10- 4 1 + 1 = 13 bits, far fewer than the 1,000 bits necessary with
a minimum-redundancy code. On average, each symbol in this hypothetical
message is coded in just 0.013 bits!
There are workarounds to prefix codes that give improved compression ef-
fectiveness, such as grouping symbols together into blocks over a larger alpha-
bet, in which individual probabilities are smaller and the redundancy reduced;
or extracting runs of "I" symbols and then using a Golomb code; or using the
interpolative code. But they cannot compare with the sheer simplicity and el-
egance of arithmetic coding. As a further point in its favor, arithmetic coding
is relatively unaffected by the extra demands that arise when the probability
estimates are adjusted adaptively - a subject to be discussed in Chapter 6.
There are, however, considerable drawbacks to arithmetic coding as pre-
sented in Algorithm 5.1. First, and most critical, is the need for arbitrary pre-
cision real arithmetic. If the compressed message ends up being (say) 125 kB
long, then L and R must be maintained to more than one million bits of pre-
PAGE 98 COMPRESSION AND CODING ALGORITHMS
Algorithm 5.2
Arithmetically encode the range [lit, hit) using fixed-precision integer
arithmetic. The state variables Land R are modified to reflect the new
range, and then renormalized to restore the initial and final invariants
2b- 2 < R ~ 2b- 1 , 0 ~ L < 2b - 2b- 2 , and L + R ~ 2b.
arithmeticencode(l, h, t)
1: set r +- R div t
2: set L +- L + r x I
3: if h < t then
4: set R +- r x (h -l)
5: else
6: set R +- R - r x I
7: while R ~ 2b- 2 do
8: if L + R ~ 2b- 1 then
9: biLplus.jollow(O)
10: else if 2 b- 1 ~ L then
11: biLplus_follow(l)
12: set L +- L - 2b- 1
13: else
14: set bits-outstanding +- bits_outstanding +1
15: set L +- L - 2b- 2
16: set L +- 2 x Land R +- 2 x R
Write the bit x (value 0 or 1) to the output bitstream, plus any outstanding
following bits, which are known to be of opposite polarity.
biLplus.jollow(x)
1: pULone_bit( x)
2: while bits-outstanding > 0 do
3: put-one_bit(l - x)
4: set bits_outstanding +- bits-outstanding - 1
PAGE 100 COMPRESSION AND CODING ALGORITHMS
Table 5.2: Corresponding values for arithmetic coding, real-number interpretation and
scaled integer interpretation.
L
" L
Figure 5.2: Renormalization in arithmetic coding: (a) when L +R ~ 0.5; (b) when
0.5 ~ L; and (c) when R < 0.25 and L < 0.5 < L + R.
In the first case (Figure 5.2a) the next output bit is clearly a zero, as both L
and L + R are less than 0.5. Hence, in this situation the correct procedure is to
generate an unambiguous "0" bit, and scale Land R by doubling them.
The second case (Figure 5.2b) handles the situation when the next bit is
definitely a one. This is indicated by L (and hence L + R also) being greater
than or equal to 0.5. Once the bit is output L should be translated downward
by 0.5, and then Land R doubled, as for the first case.
The third case, at steps 14 and 15, and shown in Figure 5.2c, is somewhat
more complex. When R ~ 0.25 and Land L + R are on opposite sides of 0.5,
the polarity of the immediately next output bit cannot be known, as it depends
upon future symbols that have not yet been coded. What is known is that the
bit after that immediately next bit will be of opposite polarity to the next bit,
because all binary numbers in the range 0.25 < L to L + R < 0.75 start
either with "01" or with "10". Hence, in this third case, the renormalization
can still take place, provided a note is made using the variable bits-outstanding
to output an additional opposite bit the next time a bit of unambiguous polarity
is produced. In this third case L is translated by 0.25 before Land R are
doubled. As the final part of this puzzle, each time a bit is output at step 1
of function biLplus.followO it is followed up by the bits-outstanding opposite
bits still extant.
The purpose of Algorithm 5.2 is to show how a single symbol is processed
in the arithmetic coder. To code a whole message, some initialization is re-
quired, plus a loop that iterates over the symbols in the message. Function
arithmeticencode_blockO in Algorithm 5.3 shows a typical calling sequence
that makes use of arithmeticencodeO to code an entire message M. It serves
the same purpose, and offers the same interface, as function mr_encode_block 0
in Algorithm 4.6 on page 83. For the moment, consider only the encoding func-
PAGE 102 COMPRESSION AND CODING ALGORITHMS
Algorithm 5.3
Use an arithmetic code to represent the m-symbol message M, where
1 ::s; M[i] ::s; nmax for 1 ::s; i ::s; m.
arithmeticencode_block(M, m)
1: for s+-O to nmax do
2: set cum-prob[s] +- 0
3: for i +- 1 to m do
4: set s +- M[i]
5: set cum-prob[s] +- cum-prob[s] + 1
6: use function elias-delta_encodeO to encode m and nmax
7: for s +- 1 to nmax do
8: elias-delta_encode(1 + cum-prob[s])
9: set cum-prob[s] +- cum-prob[s - 1] + cum-prob[s]
10: starLencodeO
11: for i +- 1 to m do
12: set s +- M[i]
13: arithmetic_encode(cum_prob[s - 1], cum_prob[s] , m)
14: finish_encodeO
Algorithm 5.4
Return an integer target in the range 0 ~ target < t that falls within the
interval [l, h) that was used at the corresponding call to arithmetic_encodeO.
decode_target (t)
1: set r r R div t
2: return (min{ t - 1, D div r})
Adjusts the decoder's state variables Rand D to reflect the changes made in
the encoder during the corresponding call to arithmetic_encode() , assuming
that r has been set by a prior call to decode_targetO.
arithmeticdecode(l, h, t)
1: set D r D - r x l
2: if h < t then
3: set R r r x (h - l)
4: else
5: set R r R - r x l
6: while R ~ 2b- 2 do
7: set R r 2 x Rand D r 2 x D + geLone_bitO
Algorithm 5.5
Initialize the encoder's state variables.
starLencode 0
1: set L +- 0, R +- 2b- 1 , and bits-outstanding +- 0
Push any unconsumed bits back to the input bitstream. For the version of
finish_encodeO described here, no action is necessary on the part of
finish_decodeO·
finish_decode 0
1: do nothing
pIe, but heavy handed. Functionfinish_encodeO simply outputs all of the bits
of L, that is, another b bits, compared to the small number of bits that was
required in the example shown in Table 5.1. There are two main reasons for
advocating this brute-force approach. The first is the use of the transformation
D = V - L in the decoder, which must similarly be able to calculate how many
bits should be flushed from its state variables if it is to remain synchronized. If
Land R are maintained explicitly in the decoder, then it can perform the same
calculation (whatever that might end up being) as does the encoder, and so a
variable number of termination bits can be used. But maintaining L as well as
either V or D slows down the decoder, and rather than accept this penalty the
number of termination bits is made independent of the exact values of Land
R. Any other fixed number of bits known to be always sufficier,t could also
be used. For example, the encoder might send the first three bits of L + R/2,
which can be shown to always be enough.
The second reason for preferring the simplistic termination mechanism is
that the compressed file might contain a number of compressed messages, each
handled by independent calls to arithmeticencode_blockO. Indeed, the arith-
metic codes in the file might be interleaved with codes using another quite dif-
ferent mechanism. For example, in a multi-block situation the Elias Co codes
for the P[s] + 1 values end up being interleaved with the arithmetic codes.
Unless care is taken, the buffer D might, at the termination of coding, contain
bits that belong to the next component of the compressed file. If so, those bits
should be processed by quite different routines - such as e/iaLdelta_decodeO.
When finish_encode 0 writes all b bits of L, and the decoder reads no more
beyond the current value of D, it guarantees that when the decoder terminates
the next bit returned by function get-one_bitO will be the first bit of the next
component of the file.
In cases where the compressed file only contains one component it is pos-
sible to terminate in just three bits. In some cases as few as one bit might be
sufficient - consider the two cases R = 2b-l (that is, 0.5) for L = 0 and
L = 2b-l. In the first case a single "0" bit is adequate, and in the second case
a single "I" bit suffices. Similarly, two bits of termination is often enough: as
an example, consider L = "011 ... " and L + R = "110 ... ", in which case
termination with "10" gives a value always in range, regardless of what noise
bits follow on behind. Note the degree of suspicion with which this is done. It
would be quite imprudent to assume that all subsequent bits inspected by the
decoder beyond those explicitly written by the encoder will be zeros. In the
language C, for example, erroneously reading when there are no bytes remain-
ing in the file returns "I" bits, as the EOF marker is represented as the value
-1, which in two's complement form is stored as a word which contains all "I"
bits. This uncertainty is why we insist that the termination bits must be such
5.3. IMPLEMENTATION PAGE 107
Algorithm 5.6
Initialize the encoder's state variables. Note that with this assignment the
encoding/decoding invariant 0.25 < R:::; 0.5 is no longer guaranteed.
JrugaLstart_encode()
1: set L +- 0, R +- 2b - 1, and bits-outstanding +- 0
Flush the encoder so that all information is in the output bitstream, using as
few extra bits as possible.
Jrugal_jinish_encode()
1: for nbits +- 1 to 3 do
2: set roundup +- 2b-nbits - 1
3: set bits +- (L + roundup) div 2b-nbits
4: set value +- bits X 2b-nbits
5: if L :::; value and value + roundup :::; L + (R - 1) then
6: pULone_integer(bits, nbits) , using biLplus-JollowO
7: return
that no matter what bit values are inadvertently used by the decoder after all of
the emitted bits are consumed, decoding works correctly.
FunctionJrugaLjinish_encodeO in Algorithm 5.6 gives an exact calculation
that determines a minimal set of termination bits. Note the care with which the
calculation at step 5 is engineered: the computation must be carried out in an
order that eliminates any possibility of overflow, even if the architecture uses
b-bit words for its integer arithmetic.
Over any realistic compression run the extra 30 or so bits involved in func-
tion jinish_encodeO compared to function JrugaLjinish_encodeO are a com-
pletely negligible overhead. On the other hand, if the file does consist of multi-
ple short components, and functionJrugaLjinish_encodeO is to be used, a very
much more complex regime is required in which the final contents of D must
be pushed back into the input stream by a function JrugaLjinish_decode() and
made available to subsequent calls to geLone_bitO. How easily this can be
done - and what effect it has upon decoding throughput - will depend upon the
language used in an actual implementation.
Let us now return to the example message M that was compressed in the
example of Table 5.1, and see how it is handled in the integer-based implemen-
tation of arithmetic coding. For reasons that are discussed below, it is desirable
to permute the alphabet so that the most probable symbol is the last. Doing so
gives a message M' = [6,1,6,6,6,4,6,6,1,6] to be coded against an integer
cumulative frequency distribution cum-prob = [0,2,2,2,3,3,10]. Suppose
further that b = 7 is being used in Algorithm 5.2, and hence that 0 :::; L < 127
PAGE 108 COMPRESSION AND CODING ALGORITHMS
and renormalization must achieve 32 < R. Table 5.3 shows the sequence of
values taken on by L, R, and r; and the sequence of bits emitted during the
execution of the renormalization loop when message M' is coded. Note that it
is assumed that the bit-frugal version of starLencodeO has been used.
A "?" entry for a bit indicates that the renormalization loop has iterated
and that bitLoutstanding has been incremented rather than a bit actually being
produced; and "x" shows the location where that bit is inserted. Hence, the
emitted bitstream in the example is "011000001100", including the termina-
tion bits. In this case, with L = 40 and R = 56 after the last message symbol,
function frugaLfinish_encode 0 calculates that nbits = 2 is the least number of
disambiguating bits possible, and that they should be "10". That is, transmis-
sion ofthe message M', which is equivalent to the earlier example message M,
has required a total of 12 bits.
To this must be added the cost of the prelude. Using the mechanism sug-
gested in Algorithm 5.3, the prelude takes 4 + 1 + 1 + 4 + 1 + 8 = 19 bits for
the six C8 codes, not counting the cost of the values m and n max . In contrast,
when coding the same message the minimum-redundancy prelude represen-
tation suggested in Algorithm 4.6 on page 83 requires 9 bits for subalphabet
selection, including an allowance of 4 bits for a C 8 code for n = 3; and then
4 bits for codeword lengths - a total of 13 bits. Subalphabet selection is done
implicitly in Algorithm 5.3 through the use of "plus one" symbol frequencies.
The interpolative code might be used in the arithmetic environment for explicit
subalphabet selection, and a Golomb or interpolative code used for the non-
zero symbol frequencies rather than the presumed C 8 code. But the second
component of the prelude - codeword lengths in a minimum-redundancy code,
or symbol frequencies in an arithmetic code - is always going to be cheaper in
the minimum-redundancy code. More information is contained in the set of ex-
act symbol frequencies that led to a set of codeword lengths than is contained in
the lengths that result, as the lengths can be computed from the frequencies, but
not vice-versa. Hence the comments made earlier about remembering to fac-
tor in the cost of transmitting the prelude if absolute best compression is to be
achieved for short messages. For the short example message M, the unary code
described in Section 3.1 on page 29 is probably "absolute best", as it requires
no prelude and has a total cost of just 16 bits. Unfortunately, short messages
are never a compelling argument in favor of complex coding mechanisms!
In the fixed-precision decoder, the variable D is initialized to the first b =
7 bits of the message, that is, to "0110000", which is 48 in decimal. The
decoder then calculates r = R/t = 127/10 = 12, and a target of D /r =
48/12 = 4, which must correspond to symbol 6, as cum-prob[5] = 3 and
cum-prob[6] = 10. Once the symbol number is identified, the decoder adjusts
its state variables D and R to their new values of D = 12 ("0001100" in seven-
PAGE 110 COMPRESSION AND CODING ALGORITHMS
bit binary) and R = 91, and undertakes a renonnalization step, which in this
case - exactly as happened in the encoder at the same time - does nothing. The
second value of r is then calculated to be r = 91/10 = 9; the second target
is then D/r = 12/9 = 1; the second symbol is found to be s = 1; and D
and R are again modified, to D = 12 and R = 18. This time R gets doubled
in the renonnalization loop. At the same time D, which is still 12, or binary
"0001100", is also doubled, and another bit (the next "0") from the compressed
stream shifted in, to make D = "0011000" = 24. The process continues in the
same vein until the required m symbols have been decoded. Notice how the
fact that some bits were delayed in the encoder is completely immaterial in the
decoder - it can always see the full set of needed bits - and so there is no need
in the decoder to worry about outstanding bits.
Now consider the efficiency of the processes we have described. In the
encoder the [l, h) interval is found by direct lookup in the array cum_prob.
Hence the cost of encoding a message M of m symbols over an alphabet of
n symbols onto an output code sequence of c bits is O(n + c + m), that is,
essentially linear in the inputs and outputs. (Note that with arithmetic coding
we cannot assume that c 2: m.) To this must be added the time required in
the model for the recognition of symbols and the conversion into a stream of
integers, but those costs are model dependent and are not considered here.
In the decoder the situation is somewhat more complex. The cum-prob ar-
ray is again used, but is now searched rather than directly accessed. Fortunately
the array is sorted, allowing the use of binary search for target values. This
means that the total decoding time for the same message is 0 (n + c + m log n),
where the first two tenns are again for the cost of computing cum-prob and pro-
cessing bits respectively. Compared to the minimum-redundancy coders dis-
cussed in Chapter 4, encoding is asymptotically faster, and decoding is asymp-
totically slower. Section 6.6 returns to this issue of searching in the cum-prob
array, and describes improved structures that allow the overall decoding time
to be reduced to 0 (n + c + m), at the expense of an additional n words of extra
memory space.
In tenns of memory space, arithmetic coding is more economical than
minimum-redundancy coding in both the encoder and decoder. Just one ar-
ray of n max words is required in each, where nmax is the externally-stipulated
maximum symbol index. If n, the number of symbols that actually appear,
is very much smaller than nmax and the subalphabet is sparse, then other data
structures might be required. As is the case with function mr_encode_blockO,
an array implementation is only appropriate when the subalphabet is dense.
Consider now the compression effectiveness of arithmetic coding. In the
discussion earlier it was suggested that the number of emitted bits c to represent
5.3. IMPLEMENTATION PAGE 111
Pn(r + 1) r+1
Pn log2 + (1 - Pn) log2 - - , (5.1)
1 + Pnr r
where Pn is the true probability of the symbol that is allocated the truncation
excess (step 3 of Algorithm 5.2 on page 99). This means that the compression
loss is never greater than approximately log2 e/2 b - 1- 2 , and is monotonically
decreasing as Pn increases. Hence, if the error is to be minimized, the alphabet
should be ordered so that the symbol Sn is the most likely, in contrast to the
arrangement assumed throughout Chapter 3 and Chapter 4. This is why in
PAGE 112 COMPRESSION AND CODING ALGORITHMS
b- f Worst-case Average-case
error (bits/symbol) error (bits/symbol)
2 1.000 0.500
4 0.322 0.130
6 0.087 0.033
8 0.022 0.008
10 0.006 0.002
Table 5.4: Limiting worst-case and average-case errors, bits per symbol, as Pn -+ O.
the example of Table 5.3 on page 108 the message compressed was M' =
[6,1,6,6,6,4,6,6,1,6] ratherthanM = [1,2,1,1,1,5,1,1,2,1].
Moffat et al. also showed that R, which is constrained in the range 2b- 2 <
R :s; 2b- 1 , can be assumed to have a density function that is proportional to
1/ R, and hence that the bound of Equation 5.1 is pessimistic, as R is larger than
its minimal value a non-trivial fraction of the time. Table 5.4 gives numeric
values for the worst-case and average-case errors, assuming that the source is
true to the observed frequency distribution, and that Pn is close to zero, the
worst that can happen.
If the coder is organized so that symbol Sn is the most probable symbol
then the bits-per-symbol error bound of Equation 5.1 can be used to derive an
upper bound on the relative error, as an alphabet of n symbols and maximum
probability Pn must have an entropy (Equation 2.2 on page 17) of at least
a lower bound achieved when as many as possible of the other symbols have
the same probability as symbol Sn. (Note that for simplicity it is assumed in
this calculation that x log x = 0 when x = 0.) Figure 5.3, taken from Moffat
et al. [1998], shows the relative redundancy as a function of log2 Pn for various
values of b - f. The vertical axis is expressed as a percentage redundancy
relative to the entropy of the distribution. As can be seen, when b - f ~ 6
the relative redundancy is just a few percent, and effective coding results, even
on the extremely skew distributions that are not handled well by minimum-
redundancy coding. Note also that when Pn is close to 1, the compression loss
diminishes rapidly to zero, regardless of the value of b - f.
To put these values into a concrete setting, suppose that b = 32, possible
with almost all current hardware. Working with b - f = 8 allows the sum
of the frequency symbol counts t to be as large as 232 - 8 = 224 ~ 16 X 106 ,
with a compression loss of less than 0.01 bits per symbol on average. That
is, function arithmeticencode_blockO can process messages of up to m =
5.4. VARIATIONS PAGE 113
~ 25
~
!L.
....
E 20
Qj -- b-f=2
~ 15 -- b-f=3
~ -- b-f=4
~ 10 -- b-f=5
E --- b-f=6
::J
E 5
.~
~ 0
·12 ·10 ·8 ·6 ·4 ·2 o
log P
Figure 5.3: Upper bound on relative redundancy: the excess coding cost as a percent-
age of entropy, plotted as a function of log2 Pn and of b - t, assuming Sn is the most
probable symbol. Taken from Moffat et al. [1998].
5.4 Variations
The first area where there is scope for modification is in the renormalization
regime. The mechanism illustrated in Algorithm 5.2 is due to Witten et al.
[1987], and the decoder arrangement of Algorithm 5.4 (using D = V - L) was
described by Moffat et al. [1998]. The intention of the renormalization process
is to allow incremental output of bits, and the use of fixed-precision arithmetic;
and other solutions have been developed.
One problem with the renormalization method described above is that it is
potentially bursty. If by chance the value of bits_outstanding becomes large,
starvation might take place in the decoding process, which may be problem-
atic in a communications channel or other tightly-clocked hardware device. A
solution to this problem is the bit stuffing technique used in a number of IBM
hardware devices [Langdon and Rissanen, 1984]. Suppose that an output reg-
ister logically to the left of L is maintained, and a bit from L is moved into this
PAGE 114 COMPRESSION AND CODING ALGORITHMS
register each time R is doubled. When the register becomes full it is written,
and then inspected. If upon inspection it is discovered to be all "1" bits, then
instead of the register's bit-counter being set back to zero, which would mean
that all bit positions in the register are vacant, it is set to one, which creates a
dummy "0" bit in the most significant position. Processing then continues, but
now any carry out of the most significant bit of L will enter the register, and
either stop at a more recent "0" bit, or propagate into the dummy bit. Either
way, there is no need for the encoder to renege upon or delay delivery of any
of the earlier values of the register.
In the decoder, if an all-ones word is processed, then the first bit of the
following word is inspected. If that bit is also one, then an unaccounted-for
carry must have taken place, and the decoder can adjust its state variables ac-
cordingly. If the lead bit of the following word is a zero, it is simply discarded.
This mechanism avoids the possible problems of starvation, but does have the
drawback of making the decoder more complex than was described above. This
is essentially the only drawback, as the redundancy introduced by the method
is very small. For example, if the register is 16 bits wide then an extra bit will
be introduced each time the register contains 16 "1" bits. If the output from a
coder is good, it should be an apparently random stream of ones and zeros, and
so an extra bit will be inserted approximately every 2 x 2 16 bytes, giving an
expansion of just 0.0001 %.
A different variation is to change the output unit from bits to bytes, a sug-
gestion due to Michael Schindler [1998]. As described above, arithmetic cod-
ing operates in a bit-by-bit manner. But there is no reason why R cannot be
allowed to become even smaller before renormalization takes place, so that one
byte at a time of L can be isolated. Algorithm 5.7 shows how the encoder is
modified to implement this.
The key difference between Algorithm 5.7 and the previous version of
arithmetic-encode 0 is that at step 5 the renormalization loop now executes
only when R :S 2b- 8 , that is, when there are eight leading zero bits in Rand
hence eight bits of L that are, subject to possible later carry, available for out-
put. The carry situation itself is detected prior to this at step 4. If any previous
zero bits have to be recanted then the normalized value of L will exceed 1.0,
which corresponds still to 2b. In this case the carry is propagated via the use of
function byte_carryO, and L is decreased by 1.0 to bring it back into the nor-
mal range. Note that the fact that L ~ 2b is now possible means that if w is the
word size of the hardware being used, then b :S w - 1 must be used, whereas
previously b :S w was safe. On the other hand, now that b < w, it is possible
to allow R to be as large as 1.0 rather than the 0.5 maximum maintained in
Algorithm 5.2, so there is no net effect on the number of bits available for R,
which still has as many as w - 1 bits of precision.
5.4. VARIATIONS PAGE 115
Algorithm 5.7
Arithmetically encode the range [ljt, hjt) using fixed-precision integer
arithmetic and byte-by-byte output. The bounds at each call are now
2b- 8 < R :S 2 b , 0 :S L < 2b , and L + R :S 2b+ 1 . With the carry test written
as it is here, b must be at least one less than the maximum number of bits
used to represent integers, since transient values of L larger than 2b may be
calculated. This means that range R should be initialized to 2b, which can
now be represented. With a modified carry test, b = w can be achieved to
allow the decoder to also be fully byte-aligned.
arithmeticencode_bytewise(l, h, t)
1: execute steps 1 to 6 of Algorithm 5.2 on page 99
2: if L ~ 2b then
3: set L ~ L - 2b
4: byte_carry 0
5: while R ::; 2b - 8 do
6: set byte ~ righLshift(L, b - 8)
7: byte_plus_prev( byte)
8: set L ~ L -lefLshift(byte, 8)
9: set L ~ lefLshift(L, 8) and R ~ lefLshift(R, 8)
Algorithm 5.8
Execute a carry into the bitstream represented by last.fton-ff_byte and
number_ff_bytes.
byte _carry 0
1: set lasLnon_ff_byte +- last.fton-ff_byte + 1
2: while number_ff-.bytes > 0 do
3: pULone_byte(last.fton_ff_byte)
4: set lasLnon-ff_byte +- "00"
5: set number_ff-.bytes +- number_ff_bytes - 1
Byte-oriented output from an arithmetic coder, with provision for carry.
byte _plus _prev( byte)
1: if this is the first time this function is called then
2: set last.fton-ff_byte +- byte and number_ff_bytes +- 0
3: else if byte = "FF" then
4: set number_ff-.bytes +- number_ff_bytes + 1
5: else
6: pULone_byte(last.fton_ff_byte)
7: while number-ff-.bytes > 0 do
8: pULone_byte("FF")
9: set number_ff-.bytes +- number_ff-.bytes - 1
10: set last.fton_ff_byte +- byte
5.4. VARIATIONS PAGE 117
Use of b = 31 meets the constraints that were discussed above, but in-
troduces a problem in the decoder - the call to starLdecodeO reads b = 31
bits into D, and then all subsequent input operations require 8 bits. That is,
while we have achieved a byte-aligned encoder, the decoder always reads in
split bytes of 1 bit plus 7 bits. To remedy this, and allow b = 32 even on a
32-bit machine, the test for "L 2: 2b" in function arithmetic_encode_bytewiseO
must be further refined. In some languages - C being one of them - overflow
in integer arithmetic does not raise any kind of exception, and all that happens
is that carry bits are lost out of the high end of the word. The net effect is
that the computed answer is correct, modulus 2w , where w is the word size. If
integer overflow truncation may be assumed, then when a carry has occurred,
the new value L' calculated by step 2 of function arithmeticencodeO (Algo-
rithm 5.2 on page 99) will in fact be less than the old value of L. To achieve
a full b = w = 32 byte-aligned coder, the old L is retained, and not updated
to the new L' value until after the carry condition has been tested: "if L' < L
then", and so on.
With or without the additional modification described in the previous para-
graph, byte-aligned arithmetic coding suffers from the drawback that the num-
ber of bits f that can be used for frequency counts must become smaller. The
requirement that max{ t} ::; min{ R} now means that about seven fewer bits
are available for frequency counts than previously. In some applications this re-
striction may prove problematic; in others it may not, and the additional speed
of byte-by-byte output determination is a considerable attraction.
A compromise approach between byte-alignment and bit-versatility is of-
fered in a proposal by Stuiver and Moffat [1998]. Drawing on the ideas of
table-driven processing that were discussed in Section 4.3, they suggest that a
k-bit prefix of R be used to index a table of 2k entries indicating how many
bits of L need to be shifted out. For example, if the most significant 8 bits of
R are used to index the shift table, then as much as one byte at a time can be
moved, and the number of actual bit shifting operations is reduced by a factor
of two or more. This method allows f to be as large as b - 2 again, if large
values of t are desired, but it is a little slower than the byte-aligned mechanism
of Schindler.
As a further option, it is possible to use floating point arithmetic to ob-
tain higher precision. For example, a "double" under the IEEE floating point
standard contains a mantissa part that is 51 bits long [Goldberg, 1991], so an
exact representation for integers up to 251 - 1 can be obtained, compared to
the more usual 232 - 1 that is available in integer arithmetic on most popular
architectures.
The structure used for calculating cumulative frequencies is also a compo-
nent of arithmetic coding which can be replaced by another mechanism. For
PAGE 118 COMPRESSION AND CODING ALGORITHMS
static coding, which is the paradigm assumed in this chapter, a cum-prob array
is adequate, unless the subalphabet is a sparse subset of [1 ... nmax]. For adap-
tive coding a more elegant structure is required, an issue discussed in detail in
Section 6.6 on page 157.
Algorithm 5.9
Arithmetically encode binary value bit, where "0" and "I" bits have
previously been observed Co and Cl times respectively.
binary _arithmetic_encode (co, Cl , bit)
1: if Co < Cl then
2: set LPS +- 0 and cLPS +- Co
3: else
4: set LPS +- 1 and cLPS +- Cl
5: set r +- R div (co + cd
6: set rLPS +- r x cLPS
7: if bit = LPS then
8: set L +- L + R - rLPS and R +- rLPS
9: else
10: set R +- R - rLPS
11: renormalize Land R, as for the non-binary case
Return a binary value bit, where "0" and "I" bits have previously been
observed Co and Cl times. There is no need to explicitly calculate a target.
binary _arithmetic _decode (co, Cl)
1: if Co < Cl then
2: set LPS +- 0 and cLPS +- Co
3: else
4: set LPS +- 1 and cLPS +- cl
5: set r +- R div (co + cd
6: set rLPS +- r x cLPS
7: if D ~ (R - rLPS) then
8: set bit +- LPS, D +- D - (R - rLPS), and R +- rLPS
9: else
10: set bit +- 1 - LPS and R +- R - rLPS
11: renormalize D and R, as for the non-binary case
12: return bit
PAGE 120 COMPRESSION AND CODING ALGORITHMS
and that they are further symbolized as being either the more probable sym-
bol (MPS) or the less probable symbol (LPS). This identification allows two
savings. It means, as was suggested in Section 5.3, that the truncation excess
can always be allocated to the MPS to minimize the compression inefficiency;
and it also means that the coding of the MPS is achieved with slightly fewer
operations than is the LPS. Finally, note that the MPS receives the truncation
excess, but is coded at the bottom of the [L, L + R) range.
Binary arithmetic coders have one other perhaps surprising application, and
that is to code multi-symbol alphabets [Howard, 1997, Moffat et aI., 1994], To
see how this can be, suppose that the source alphabet S has n symbols. Sup-
pose also that the symbol identifiers are assigned as the leaves of a complete
binary tree of n - 1 internal nodes and hence n leaves. The simplest arrange-
ment is a balanced tree of n leaves and depth fiog2 n 1, but in fact there is no
need for any particular structure to the tree. Indeed, it can be a stick - a degen-
erate tree - if that arrangement should prove to be appropriate for some reason.
Finally, suppose that each of the internal nodes of this tree is assigned a pair
of conditional probabilities, calculated as follows. Let PI. be the sum of the
probabilities of all of the symbols represented in the left subtree of the node,
and Pr the sum of the probabilities of the symbols represented in the right sub-
tree. Then the probability assigned to the left subtree is pd (PI. + Pr) and the
probability assigned to the right subtree is Pr / (PI. + Pr)·
To represent a particular symbol the tree is traversed from the root, at each
node coding a binary choice "go left" or "go right" based upon the associated
probabilities PI. and Pr' The overall code for the symbol is then the sum of
the incremental codes that drive the tree traversal. Because the sum of the
logarithms of the probabilities is the same as the logarithm of their product,
and the product of the various conditional probabilities telescopes to Ps when
symbol s is being coded, the net cost for symbol s is -log2 Ps bits.
Given that this works with any n-Ieaf tree, the obvious question to ask is
how should the tree be structured, and how should the symbols be assigned
to the leaves of the tree, so the process is efficient. This question has three
answers, depending upon the criterion by which "efficient" is to be decided.
If efficiency is determined by simplicity, then there are two obvious trees
to use. The first is a stick, that is, a tree with one leaf at depth one, one at
depth two, one at depth three, and so on. This is the tree that corresponds in a
prefix-code sense to the unary code described in Section 3.1 on page 29. Each
binary arithmetic code emitted during the transmission of a symbol number s
can then be thought of as a biased bit of a unary code for s, where the bias is by
exactly the right amount so that a zero-redundancy code for s results. The other
obvious choice of tree is a balanced binary tree. In this case the mechanism
can be thought of as coding, bit by bit, the binary representation of the symbol
5.5. BINARY CODING PAGE 121
9110
Figure 5.4: Example of binary arithmetic coding used to deal with a multi-symbol
alphabet. In this example the source alphabet is S = [1 ... 6], with symbol frequencies
P = [7,2,0,0,1,0]' and the tree is based upon the structure of a minimal binary code.
number s, again with each bit biased by exactly the right amount. This tree has
the advantage of requiring almost the same number of binary arithmetic coding
steps to transmit each symbol, and minimizes the worst case number of steps
needed to code one symbol.
Figure 5.4 shows the tree that results if the alphabet S = [1,2,3,4,5,6]
with frequencies P = [7,2,0,0,1,0] is handled via a minimal binary tree. To
code symbol s = 2, for example, the left branch out of the root node is taken,
and a code of -10g2(9/10) bits generated, then the right branch is taken to the
leaf node 2, and a code of -10g2(2/9) bits generated, for a total codelength
(assuming no compression loss) of -10g2(2/1O), as required. Note that prob-
abilities of 0/1 and even % are generated but are not problematic, as they
correspond to symbols that do not appear in this particular message. Probabil-
ities of 1/1 correspond to the emission of no bits.
The second possible measure of efficiency is to minimize the average num-
ber of calls to function binary_arithmeticencodeO. It should come as no sur-
prise to the reader (hopefully!) that the correct tree structure is a Huffman tree,
as this minimizes the weighted path length over all binary trees for the given
set of probabilities. The natural consequence of this is that, as far as is possi-
ble, the conditional binary probabilities used at each step will be approximately
0.5, as in a Huffman tree each node represents a single bit, and that single bit
carries approximately one bit of information.
The third possible measure of efficiency is the hardest to minimize, and
that is compression effectiveness. In any practical arithmetic coder each binary
coding step introduces some small amount of compression loss, and these must
be aggregated to get an overall compression loss for the source symbol. For
PAGE 122 COMPRESSION AND CODING ALGORITHMS
example, some binary arithmetic coders are closest to optimal when the proba-
bility distribution is extremely skew - an arrangement that is likely to occur if
a unary-structured tree is used on a decreasing-probability alphabet.
The idea of using binary arithmetic coding to stipulate a path through a tree
can also be applied to infinite trees. For example, each node of the infinite tree
that corresponds to the Elias C y code can also be assigned a biased bit and then
used for arithmetic coding.
In practical terms, there are two drawbacks to using a tree-structured binary
coder - time and effectiveness. Unless the probability distribution is strongly
biased in favor of one symbol, mUltiple binary coding steps will be required on
average, and there will be little or no time saving compared to a single multi-
alphabet computation. And because compression redundancy is introduced at
each coding step, it is also likely that the single multi-alphabet code will be
more effective. What the tree-structured coder does offer is an obvious route to
adaptation, as the two counts maintained at each node are readily altered. But
adaptive probability estimation is also possible in a multi-alphabet setting, and
the issue of adapting symbol probability distributions will be taken up in detail
in Section 6.6 on page 157.
This modified mapping had the advantage of working with b bit integer arith-
metic, and, provided b - f was not too small and the truncation excess was
allocated to the most probable symbol, not causing too much compression loss.
5.6. ApPROXIMATE CODING PAGE 123
if x < d,
f(x, t, R) ={ ~x _ d otherwise,
(5.4)
PAGE 124 COMPRESSION AND CODING ALGORITHMS
Algorithm 5.10
Use a simple mapping from [0 ... t] to [0 ... R] as part of an arithmetic
coder. The while loop is required to ensure R/2 < t :::; R prior to the
mapping process.
approximate_arithmeticencode(l, h, t)
1: while t :::; R/2 do
2: set l +- 2 x l, h +- 2 x h, and t +- 2 x t
3: set d +- 2 x t - R
4: set L +- L + max {l, 2 x l - d}
5: set R +- max{h, 2 x h - d} - max{l, 2 x l - d}
6: renormalize Land R, as described previously
where d = 2t - R is the number of values in the range [0, t) that are allocated
single units in [0, R).
The easiest way to ensure that R/2 < t :::; R in arithmeticencode_blockO
(Algorithm 5.3) is to scale the frequency counts P[s] so that their total t equals
2b- 2, the lower limit for R. Use of the initialization R = 2b- 1 in function
start-encodeO then ensures that the constraint is always met. This scaling ap-
proach is tantamount to performing pre-division and pre-multiplication, and so
the multiplicative operations are not avoided entirely; nevertheless, they are
performed per alphabet symbol per block, rather than per symbol transmitted.
If control over the block size is possible, another way of achieving the neces-
sary relationship between t and R is to choose m = 2b-2. This choice forces
t = m to be the required fixed value without any scaling being necessary, at
the cost of restricting the set of messages that can be handled.
A more general way of meeting the constraint on the value of t is illus-
trated in Algorithm 5.10. Now all of l, h, and t are scaled by a power of two
sufficiently large to ensure that the constraint is met. The coding then proceeds
using the mapping function of Equation 5.4. Also illustrated is the function
approximate_decode_targetO, which is identical in purpose to the earlier func-
tion decodejargetO, but scales t before applying the inverse of the approx-
imate mapping. The remaining function, approx_arithmeticdecodeO, makes
5.6. ApPROXIMATE CODING PAGE 125
Table 5.5: Cost of using arithmetic coding to compress file WSJ . words (Table 4.5
on page 71) using exact symbol frequencies and approximate symbol frequencies, ex-
pressed as bits per symbol of the source file. In the case of approximate frequencies,
each symbol was assigned to the bucket indicated by llog2 pd, and each symbol in
that bucket assigned the frequency l1.44 x 2Llog2 P;J J. Two different block sizes are
reported: 1,000 symbols per block, and 1,000,000 symbols per block.
[ 0, 5) [ 0, 6) [ 0, 7) [ 0, 8)
[ 1, 6) [ 1, 7) [ 1, 8) [ 1, 9)
[ 2, 7) [ 2, 8) [ 2, 9) [ 2,10)
The threshold value p that separates the use of q and q + 1 will thus be such
that the expected cost of using q / R as the probability estimate is equal to the
expected cost of using (q + 1) / R:
q R-q q+1 R-q-1
plog2 R + (1- p) log2 ~ = plog2 ~ + (1 - p) 10g2 R
(5.5)
p = ( -.!L R-q-l ) .
log2 q+l' R-q
5.7. TABLE-DRIVEN CODING PAGE 129
Table 5.6: Transition table for state [2,10) in a table-driven binary arithmetic coder
with b = 4. Each row corresponds to one probability range. When a symbol is en-
coded, the indicated bits are emitted and the state changed to the corresponding next
state. Bits shown as "1" indicate that bits-outstanding should be incremented.
Our sketch of table-driven arithmetic coding has been brief; Howard and
Vitter [1994b] give a detailed example that shows the action of their quasi-
arithmetic binary coding process. Howard and Vitter also describe how the
mechanism can lead to a practical implementation that requires a manageable
amount of space. Variants that operate on multi-symbol source alphabets are
also possible, and are correspondingly more complex.
Adaptive Coding
In the three previous chapters it has been assumed that the probability distri-
bution is fixed, and that both encoder and decoder share knowledge of either
the actual symbol frequencies within the message, or of some underlying dis-
tribution that may be assumed to be representative of the message. While there
was some discussion of alternative ways of representing the prelude in a semi-
static system such as those of Algorithm 4.6 on page 83 and Algorithm 5.3
on page 102, we acted as if the only problem worth considering was that of
assigning a set of codewords.
There are two other aspects to be considered when designing a compression
system. Chapter 1 described compression as three cooperating processes, with
coding being but one of them. A model must also be chosen, and a mechanism
put in place for statistics (or probability) estimation. Modeling is considered in
Chapter 8, which discusses approaches that have been proposed for identifying
structure in messages. This chapter examines the third of the three components
in a compression system - how probability estimates are derived.
probabilities that is loaded into both encoder and decoder prior to compression
of each message. Each message can then be handled using the same set of
fixed probabilities, and, provided that the messages compressed in this way
are typical of the training text and a good coding method is used, compression
close to the message self-information should be achieved.
One famous static code was devised by Samuel Morse in the 1830s for use
with the then newly-invented telegraph machine. Built around two symbols -
the "dot" and the "dash" - and intended for English text (rather than, say, nu-
meric data), the Morse code assigns short code sequences to the vowels, and
longer codewords to the rarely used consonants. For example, in Morse code
the letter "E" (Morse code uses an alphabet of 48 symbols including some
punctuation and message control, and does not distinguish upper-case from
lower-case) is assigned the code ".", while the letter "Q" has the code "_. - _".
Morse code has another unusual property that we shall consider further in Sec-
tion 7.3 on page 209, which is that one of the symbols costs more to transmit
than does the other, as a dash is notionally the time duration of three dots. That
is, in an ideal code based upon dots and dashes we should design the codewords
so that there are rather more dots than dashes in the encoded message. Only
then will the total duration of the encoded message be minimized.
Because there is no prelude transmitted, static codes can outperform semi-
static codes, even when the probability estimates derived from the training text
differ from those of the actual message. For example, suppose that the dis-
tribution P = [0.67,0.11,0.07,0.06,0.05,0.04] has been derived from some
training text, and the message M = [1,1,1,5,5,3,1,4,1,6] is to be transmit-
ted. Ignoring termination overheads, an arithmetic code using the distribution
P will encode M in 20.13 bits. An arithmetic code using the message-derived
semi-static probability distribution pI = [0.5,0.0,0.1,0.1,0.2,0.1] requires
fewer bits: 19.61, to be precise. But unless the probability distribution pI, or,
more to the point, the difference between P and pI, can be expressed in less
than 20.13 - 19.61 = 0.52 bits, the static code yields better compression.
The drawback of static coding is that sometimes the training text is not
representative of the message, not even in a vague manner, and when this hap-
pens the use of incorrect probabilities means that data expansion takes place
rather than data compression. For example, using Morse code, which is static,
to represent a table of numeric data can result in an expensive representation
compared to alternative codes using the same channel alphabet.
That is, in order to always obtain a good representation, the symbol proba-
bilities estimated by the statistics module should be close - whatever that means
- to the true probabilities, where "true" usually means the self-probabilities
derived from the current message rather than within the universe of all possi-
ble messages. This is why semi-static coding is attractive. Knowledge of the
6.1. STATIC AND SEMI-STATIC ESTIMATION PAGE 133
m· nmax)
n ( 4 + log2 n2 (6.1)
bits, and a reasonable estimate of the average cost is n bits less than this. For
example, when a zero-order character-based model is being used for typical
English text stored using the ASCII encoding, we have nmax = 256, with n ~
100 distinct symbols used. On a message of m = 50,000 symbols the prelude
cost is thus approximately 1,400 bits, or about 0.03 bits per symbol, a relatively
small overhead compared to the approximately 5 bits per symbol required to
actually code the message with respect to this model. With this simple model
it is clear that for all but very short sequences the cost of explicitly encoding
the statistics is regained through improved compression compared to the use of
static probabilities derived from training text.
Now consider a somewhat more complex model. For example, suppose that
pairs of characters are to be coded as single tokens - a zero-order bigram model.
Such a model may be desirable because it will probably operate faster than
a character-based model; and it should also yield better compression, as this
model is compression-equivalent to one which encodes half of the characters
with zero-order predictions, and the other half (in an alternating manner) using
first-order character-based predictions.
In such a model the universe of possible symbols is nmax = 65,536, of
which perhaps n = 5,000 appear in a 50 kB file. Now the prelude costs approx-
imately 45,000 bits, or 0.9 bits per symbol of the original file. If the bigrams
used are a non-random subset of the 65,536 that are possible, or if the symbol
frequencies follow any kind of natural distribution, then an interpolative code
may generate a smaller prelude than the Golomb code assumed in these calcu-
lations. Nevertheless, the cost of the prelude is likely to be considerable, and
might significantly erode the compression gain that arises through the use of
the more powerful model.
Another way of thinking about this effect is to observe that for every model
there is a "break even point" message length, at which the cost of transmitting
the statistics of the model begins to be recouped by improved compression
compared to (say) a zero-order character-based model. And the more complex
the model, the more likely it is that very long messages must be processed
before the break even point is attained. Bookstein and Klein [1993] quantified
this effect for a variety of natural languages when processed with a zero-order
character-based model.
6.2. ADAPTIVE ESTIMATION PAGE 135
• All characters
....c: • B!ank characters
<D 8.0 'r' characters
.-
c:
8 7.0
c:
.2 8.0
(5
E
.... 8.0
.E
.S
",!llQ.
4.0
3.0
.§
11 21 31 41 51 1)1 71 8t 91 101 111 121
Character number
Figure 6.1: Adaptive probability estimation in Blake's Milton, assuming that each of
the 128 standard ASCII characters is assigned an initial frequency count of 1. Each bar
represents one character; and the height of the bar indicates the implied information
content assigned to that occurrence of that character. Black bars represent blank char-
acters, the most common character in this fragment of text; light gray bars represent
occurrences of the second most frequent character, the letter "roo.
attainable value, whereas the 19.4 is achievable only after addition of a prelude.
Figure 6.1 shows a similar computation for a longer message, the 128 char-
acters of Blake's verse from Milton, already used as an example in Chapter 2.
In the figure, the cost in bits of arithmetically coding each letter is shown, again
assuming that an initial false frequency count of one is assigned to each symbol
(with nmax = 128). Occurrences of two symbols - blank, and "roo - are picked
out in different shades, to show the way in which the probability estimates con-
verge to appropriate values. This adaptive estimator has a total cost of 738.0
bits (assuming a perfect entropy coder), while the self-information is 540.7
bits. Again, the difference corresponds to the cost of implicitly smearing the
prelude component across the transmission. With n = 25, an explicit prelude
would cost approximately 192.8 bits (using Equation 6.1). Remarkably, this
is almost exactly the cost difference between the adaptive and the semi-static
probability estimators.
We now have two alternative paradigms for setting the statistics that con-
trol the compression - semi-static estimation, and adaptive estimation. One
requires two passes over the source message, but is fast when actually coding;
the other operates in an on-line manner, but must cope with evolving probabil-
ity estimates, and thus an evolving code. If the cost of paying for the statistics
must be added to the cost of using them, which yields the better compression?
6.2. ADAPTIVE ESTIMATION PAGE 137
~M(j] + 1
-10g2 . 1
J + n max -
bits. Summed over all of the symbols in the message, the cost is
~ ~M(j] + 1
~-~g2 . (6.2)
j=1 J + n max - 1
~M(j] + 1
lI -.--~----
m
= -10g2
j=1 J + n max - 1
TI nmax TIPsj=1 J.
Iog2 m s=1
- .
TIj=1 (J + n max - 1)
1 TI nmax
s=1 Ps·
,
bits. Now compare Equations 6.3 and 6.4. The common tenn is the cost of
sending the message assuming that the statistics are known. The second tenn
in Equation 6.3 is the cost of learning the statistics by identifying nmax - 1
values that sum to m + nmax - I, by identifying nmax - 1 "boundary" values
out of a set of m + nmax - 1 values in total (Equation 1.5 on page 13). In the
enumerative code we must add on the cost of the prelude; one way of coding it
would cost the same as the second tenn in Equation 6.3.
That is, assigning a false count of one to each of the nmax symbols in an
adaptive estimator corresponds almost exactly to the "add one to each of the
frequencies" method for representing the prelude described in Algorithm 5.3
on page 102, the sole difference being that n max - 1 appears because the
nmaxth value is fully detennined once the first nmax - 1 are known. A simi-
lar result holds for the "subalphabet plus frequencies" prelude mechanism of
Algorithm 4.6 on page 83, the cost of which is captured in Equation 6.1 on
page 134. No matter how the prelude is transmitted, an adaptive model from
a standing start achieves almost exactly the same compressed size as does the
decrementing-frequency semi-static model from a flying start, once the prelude
cost is factored in to the latter.
The analysis that led to Equation 6.4 supposed that the frequency counts
were used in a decrementing manner, which in fact would require adaptation
of the codes. On the other hand, the speed of a semi-static coder comes about
through the use of a fixed set of codes. If an initial probability distribution is
used throughout the coding of the message, the length of the coded message is
given by
L
m
-log2 PM(j] = - Iog2
rr nmax
s=1 (p s )Ps
j=1 m
mm
~ -log2
rrnmax P ,
s=\ s· - (log2e)
(nmax)
LPs - m
m. s=1
+'21 (nmax
~ log2(27rPs) -log2(27rm)
)
< -log2
rr~max P'
t~! s· + '2
1 (nmax
~ log2(27rPs)
) (6.5)
where the approximation at the second line is the result of applying Equa-
tion 1.3 on page 13, and where (Ps)Ps is taken to be 1 when Ps = O. That is,
once the cost of the statistics are also allowed for, a true semi-static estimator
requires slightly more bits to code a message than does an adaptive estimator
or an enumerative estimator. Cleary and Witten [1984a] consider the relativity
between adaptive, semi-static, and enumerative estimators in rather more detail
6.3. NOVEL SYMBOLS PAGE 139
than is given in these few pages, and reach the same conclusion - that intrin-
sically there is no compression loss associated with using an adaptive code. If
anything, the converse is true: the need for transmission of the details of the
probability distribution means that the self-information of a message is a bound
that we might hope to approach when the message is long, but can never equal;
and an adaptive estimator is likely to get closer to the self-information than is
a semi-static estimator using a non-adapting probability distribution.
ity to be sensitive to the size of the alphabet discovered so far. The drawback
of method B - and the reason we have not included it in the experiments that
are described shortly - is that use of the secondary model twice for each sym-
bol can add considerably to the overall cost. For example, with a word-based
model, spelling each word out in full at both its first and second appearances is
a burden to be avoided.
To eliminate the double cost inherent in method B, method C [Moffat,
1990] treats the sequence of flag bits as a message in its own right, so if n
novel flags have been transmitted in a sequence of m such tokens then the cor-
rect estimator for the probability of a novel symbol is n/m. And if a symbol
is not novel, then a probability estimate is already available. The drawback of
this approach is the need to encode each non-novel token in a two-step manner,
first the flag, and then a code for the symbol. That is, two arithmetic coding
steps are required for each known symbol, corresponding to the two factors in
the probability calculation. Nor is it possible to pre-calculate the probabilities
and do a single coding step, as the resultant value of m 2 in the denominator is
likely to require more bits of precision than are available to express probabili-
ties unless m is very small. (This issue was discussed in detail in Section 5.3
on page 98.) As it is described in the table, method C also suffers from the
problem of needing a special case when m = n as well as when m = O.
For these two reasons the one-step mechanism labelled C' in the second
part of Table 6.2 is an attractive alternative [Moffat, 1990]. In this formulation
the first factor is included additively, and so the escape probability is slightly
less than it should be, but not by much.
Method D resulted from work by Paul Howard and Jeff Vitter [1992b].
Rather than adding two "units" when a novel symbol appears (both nand m
increase in the denominator of method C') and only one when a symbol repeats,
method D adds two units for every symbol. When a novel symbol is coded the
two units are shared between Ps and n, as for method C. But when a repeat
symbol is coded, method D awards both units to Ps, where method C would
have added only one. In results based upon a PPM-style model (see Section 8.2
on page 221) Howard and Vitter found a small but consistent improvement
when method D was used in place of method C.
The last of the listed escape estimators - method X - was a product of a
study that explicitly investigated the zero-frequency problem [Witten and Bell,
1991]. What all escape probability estimators are really trying to predict is the
number of symbols of frequency zero, as this is the pool of symbols that a novel
symbol - should one appear - must be drawn from. And a plausible approxi-
mation of the number of symbols of frequency zero is likely to be the number
it of symbols of frequency one, a quantity that is known to both encoder and
decoder. Like method C, non-novel symbols must be coded in two steps if this
6.3. NOVEL SYMBOLS PAGE 143
1.0000
~
:5 0.1000
IU
.c
ea. 0.0100
(I)
a.
~ 0.0010
W
0.0001 -+-------r-----.. . . . -------r....:.;:,""""-;:IiI....DiO..........,
100 10000 1000000 100000000
Symbol number
1.0000
~
:5 0.1000
IU
.c
o
c.. 0.0100
(I)
ig 0.0010
W
0.0001 -+--------r-----.........- - - - - - - r - - - - - - - .
100 10000 1000000 100000000
Symbol number
(b) Words
1.0000
---*- method A
~ - G ' - method C'
:cIU 0.1000 -&-- method 0
.c - methodX'
o
c.. 0.0100 - & - - Observed
(I)
a.
g
IU
0.0010
W
0.0001 +-------r----..:::::a..--------r-=~~i;;I::~
100 10000 1000000 100000000
Symbol number
(c) Non-words
Figure 6.2: Escape probabilities: (a) for character bigrams in the Wall Street Journal
text; (b) forWSJ. Words; and (c) forWSJ. NonWords.
PAGE 144 COMPRESSION AND CODING ALGORITHMS
the best estimators, and perform well even against method O. Method D also
has the implementation advantage of requiring that slightly less information be
maintained.
The recorded per-symbol costs listed in Table 6.3 are relatively low com-
pared to the cost of the message itself. For example, the self-entropy of the
word stream is more than 11 bits per character, and so the cost of the escape
flag is less than 0.3%. However, the low per-symbol cost in this example should
not be interpreted as meaning that the choice of escape estimator is academic
for practical purposes, as in the three examples the small values are primarily
a consequence of the extremely long messages. In other situations in which an
escape probability might be used - such as in the compression system described
in Section 8.2 - each particular sub-message, which is the set of symbols coded
in one conditioning context, might only be a few tens of symbols long, and the
overall message the result of interleaving tens or hundreds of thousands of such
sub-messages. In such applications the choice of escape estimator is of critical
importance - not only is the length of each sub-message small, but the expected
cost within the context is quite likely to be under one bit per symbol.
Finally in this section, note that Aberg et al. [1997] have developed a gen-
eral parameterized version of the escape probability estimator that allows the
estimation mechanism to be modified as the message is accumulated. Their
results show a further very slight gain compared to method D on certain test
files, but this "method E" mechanism is complex, and not included in our ex-
periments here.
That Huffman trees have the sibling property follows immediately from
Huffman's algorithm - it greedily chooses the subtrees with smallest weight as
siblings at each packaging stage, and generates a code tree. Hence, if nodes
are numbered in order of their packaging by Huffman's algorithm, then they
are numbered in reverse order of a sibling list. Consider the construction of the
Huffman code in Figure 4.2 on page 54, the resulting tree of which is shown in
the top panel of Figure 4.3 on page 56. The first step packages leaves 5 and 6,
so these two nodes will fonn the last two entries in the sibling list. The second
step packages leaves 3 and 4, so they fonn the next to last entries in the list.
Continuing in this manner yields the tree in Figure 6.3a, where each node is
annotated below with the reverse order in which it is packaged by Huffman's
algorithm. The weight of each package is noted inside each node, assuming
that the probabilities resulted from the processing of a stream of 100 symbols
so far. The numbers above the white leaf nodes indicate the symbol number
currently associated with that leaf. Listing the weights of the nodes in the
reverse order that they are processed with Huffman's algorithm yields
[100,67,33,20,13,11,9,7,6,5,4] ,
which has sibling nodes in adjacent positions. Using the same logic but in
reverse, any code tree which has the sibling property can be generated by Huff-
man's algorithm, so is a Huffman tree. The existence of a sibling list for a code
tree is both a necessary and sufficient condition for the tree to be a Huffman
tree.
The basic idea of dynamic Huffman coding algorithms is to preserve the
sibling property when a leaf has its weight increased by one, by finding its new
position in the sibling list, and updating its position in the tree accordingly. For
example, consider the changes to the tree in Figure 6.3a if symbol 2 has its
frequency incremented from 11 to 12. Firstly the node itself must increase its
weight, altering the sibling list to
[100,67,33,20,13, 12,9,7,6,5,4],
with the altered weight highlighted. The list remains a sibling list, so no re-
structuring of the tree is necessary. Now the frequency increment must be
propagated up the tree to the parent of symbol 2. The sibling list becomes
[100,67,33,21,13,12,9,7,6,5,4] ,
again with no violation of the sibling property. This process continues until the
root is reached, with a final sibling list of
(a)
(b)
(c)
(d)
Figure 6.3: Huffman trees with each node labelled with its position in the sibling list
(below), and with its symbol number (above, white leaf nodes only). The four panels
illustrate the situation: (a) prior to any increments; (b) after the frequency of symbol 2
is incremented from 11 to 12, and then to 13; (c) after nodes 5 and 6 in the sibling list
are exchanged; (d) after the frequency of symbol 2 is incremented in its new position.
6.4. ADAPTIVE HUFFMAN CODING PAGE 149
A second appearance of, and thus increment to, symbol 2 results in further
updates to the sibling list, which becomes
again with the changes highlighted. This sibling list corresponds to the tree
shown in Figure 6.3b. For both of these first two increments, the sibling prop-
erty holds at each stage. The node weights evolve, but the tree and code remain
unchanged.
Now consider what actions are required if symbol 2 is coded and then in-
cremented a third time. If the weight of symbol 2 were to be increased from 13
to 14, the list of node weights becomes
[102,67,35,22,13,14,9,7,6,5,4].
But this list is not a sibling list - the weights are not in non-increasing order-
and the underlying tree cannot be a Huffman tree. To ensure that the increment
we wish to perform will be "safe", it is necessary to swap the fifth and sixth
elements before carrying out the update. In general, the node about to be incre-
mented should be swapped with the leftmost node of the same weight. Only
then can the list of weights be guaranteed to still be non-increasing after the
increment. In the example, after the subtrees rooted at positions 5 and 6 in the
sibling list are swapped, we get the tree shown in Figure 6.3c. The increment
for symbol 2 can now take place; and the sibling list becomes
[102,67,35,22, 14,13,9,7,6,5,4].
As before, the ancestors of the node for symbol 2 must also have their counters
incremented. Neither of these involve further violations of the sibling property.
But if further violations had taken place, they would be dealt with in exactly
the same manner: by finding the leftmost (smallest index) node in the sibling
list with the same weight; swapping it with the node in question; and then
incrementing the weight of the node in its new position. In the example, the
final sibling list after the third increment to symbol 2 is
[103,67, 36,22,14,13,9,7,6,5,4],
and this time the code has adjusted in response to the changing set of self-
probabilities. Figure 6.3d shows the tree that will be used to code the next
symbol in the message.
An overview of the process used to increment a node's weight by one is
given by function sibling_incrementO in Algorithm 6.1. The loop processes
each ancestor of i, incrementing each corresponding weight by one. The root
PAGE 150 COMPRESSION AND CODING ALGORITHMS
Algorithm 6.1
Increase the weight of tree node L[i] by one, where L is an array of tree
nodes from a Huffman tree in sibling list order: L[2j] and L[2j + 1] are
siblings for 1 ~ j < n, and the weight of L[i] is not less than the weight of
L[i + 1] for alII ~ i < 2n. This algorithm also alters the structure of L so
that it continues to represent a Huffman tree.
sibling _increment (i)
1: while ii-I do
2: find the smallest index j ~ i such that the weight of L[j] is equal to the
weight of L[i]
3: swap the subtrees rooted at L[i] and L[j]
4: add one to the weight of L[j]
5: set i +- parent of L[j]
6: add one to the weight of L [1], the root of the tree
of the tree is assumed to be in node L [1]. Before the weight of L [i] can be
incremented the leftmost node L[j] of the same weight in the sibling list is
located, and the two nodes swapped (steps 2 and 3). This ensures that when
one is added to the weight in step 4, the sibling list remains in non-increasing
order. There is a danger when swapping tree nodes that a child will swap with
one of its ancestors, therefore destroying the structure of the tree. Fortunately,
this algorithm only swaps nodes of identical weight, and so neither node can be
an ancestor of the other, as ancestors must have weights that are strictly greater
than their children in our Huffman tree. This is a complication in the FGK
algorithm, as it uses an escape symbol of weight zero.
At this level of detail, Algorithm 6.1 is simple and elegant. The loop iterates
exactly once for each node on the path from the leaf node representing the
symbol to be incremented, to the root, so the number of iterations equals the
number of compressed bits processed. If each step runs in 0(1) time, then the
entire algorithm is on-line. To execute step 2 in 0(1) time requires a supporting
data structure. Consider, for example, a call to increment(15) to increment the
frequency of the last element of the sibling list
[8,4,4,2,2,2,2,1,1,1,1,1,1,1,1].
is maintained that allows 0(1) time access to these leader elements. At first it
seems that each node can contain a pointer to the leftmost node in the sibling
list that shares its weight: the leader node. However, if the leader itself has its
weight incremented, it is necessary to update all of the pointers in the nodes
to its right that were pointing to that leader: an O(n) time operation. To avoid
this difficulty, an extra level of indirection is used, with each node recording
which bucket it belongs to, where nodes in a bucket all have identical weight;
and with an auxiliary data structure recording the leader of each bucket. Then,
if a leader is incremented, an 0(1) time update of this auxiliary structure is
all that is required to update a bucket's new leader. This structure must be dy-
namic, as buckets are created and deleted throughout the coding process, so a
doubly-linked list is used. In the example just given, the list of bucket pointers
would be
11 = [1,2,2,3,3,3,3,4,4,4,4,4,4,4,4],
and the doubly-linked list of leaders would be
D = [1, 2, 4, 8].
That is, element L[i] should be swapped with element D[l1[i]] before it has its
weight incremented.
The Huffman tree is stored in the same array of nodes used to hold the
sibling list, by adding further pointers that allow the necessary threading. Al-
gorithm 6.2 supplies the considerable detail of adaptive Huffman encoding and
decoding suggested by the sketch of Algorithm 6.1. In these algorithms, and
their supporting functions in Algorithm 6.3, a list of 2n - 1 tree nodes L is
maintained, with each node containing six fields:
• left, right, and parent to represent pointers in the Huffman tree that is
threaded through the list;
It is assumed that the escape symbol is and that the Huffman tree is initial-
8Q,
ized to contain n = 2 symbols: symbol 81, and the escape symbol, both with
an initial weight of one.
During encoding it is necessary to locate a symbol's leaf in L, so that a
codeword can be generated by following parent pointers from the leaf to the
root of the tree. To facilitate leaf discovery, an array index is maintained such
PAGE 152 COMPRESSION AND CODING ALGORITHMS
Algorithm 6.2
Use an adaptive Huffman code to represent symbol x, where 1 :S x,
updating the Huffman tree to reflect an increase of one in the frequency of x.
adaptive_huffman_encode (x)
1: if index[x] = "not yet used" then
2: adaptive_huffman_output(O), the codeword for the escape symbol
3: encode x using some agreed auxiliary mechanism
4: adaptive _huffman_increment ( index [0])
5: adaptive_huffman_add(x)
6: else
7: adaptive_huffman_output(x)
8: adaptive _huffman_increment (index [x])
Algorithm 6.3
Add one to the weight of node L[i] and its ancestors, updating L so that it
remains a sibling list, maintaining the tree structure and bucket pointers.
adaptive_hujJman_increment{ i)
1: while i =1= 1 do
2: set b +- L[i].bucket and j +- b.leader
3: swap the left, right, and symbol fields of L[i] and L[j]
4: set index[L[i].symbol] +- j and index[L[j].symbol] +- i
5: set L[L[i].left].parent +- L[L[i].right].parent +- i
6: set L[L[j].left].parent +- L[L[j].right].parent +- j
7: set L[j].weight +- L[j].weight + 1
8: if L[j].weight = L[j - I].weight then
9: set L[j].bucket +- L[j - I].bucket
10: else
11: add a new bucket d with d.leader +- j to the bucket list B
12: set L[j].bucket +- d
13: if L[j + I].weight = L[j].weight - 1 then
14: set b.leader +- j + 1
15: else
16: remove bucket b from the bucket list B
17: set i +- L[j].parent
Add a new symbol s to the underlying Huffman tree by making the final leaf
node L[2n - 1] an internal node with two leaves: the old L[2n - 1] and the
new symbol s with weight one.
adaptive_hujJman_add{ s)
1: set all components of L[2n] +- the matching components of L[2n - 1]
2: set L[2n + I].left +- L[2n + I].right +- 0, L[2n + I].weight +- 1,
L[2n + I].symbol +- s, and L[2n - I].symbol +- "internal"
3: set b +- L[2n - I].bucket and j +- b.leader
4: adaptive_hujJman_increment{2n - 1)
5: set index[L[2n].symbol] +- 2n and index[s] +- 2n + 1
6: set L[2n].parent +- L[2n + I].parent +- j, L[j].left +- 2n, and
L[j].right +- 2n + 1
7: if L[2n].weight = 1 then
8: set L[2n + I].bucket +- L[2n].bucket
9: else
10: add a new bucket d with d.leader +- 2n + 1 to bucket list B
11: set L[2n + I].bucket +- d
12: set n +- n + 1
PAGE 154 COMPRESSION AND CODING ALGORITHMS
that node L[index[x]] is the leaf representing symbol x. It is assumed that all
elements of index are initialized to "not yet used", except for index[O] = 2 and
index[l] = 3. Note in function adaptive_hujJman_output{) that the bits in each
codeword are generated in reverse order, so are buffered in variable w before
being output. A final quirk in these algorithms is that the weight of the root of
the tree, element L[l], is never changed by adaptive_hujJman_increment{) , and
remains throughout at its initial value of zero. This allows it to act as a sentinel
for the comparison at step 8 in function adaptive_hujJman_increment{), and
prevents the leader of L[2]'s bucket from being set to the root.
The resource cost of adaptive Huffman coding is non-trivial. The sibling
list structure requires 6 words in each of 2n - 1 nodes, and the index array
and list of leaders further add to the cost. The total memory requirement of
in excess of 13 words per alphabet symbol is a daunting requirement on all
but small alphabets. Nor is the method especially quick. Provided that the
increments are by units, linear time behavior is assured. But the large number
of checking operations required for every output bit means that the constant of
proportionality is high, and in practice execution is relatively slow.
More generally, we might wish to adjust the weight of a symbol by any
arbitrary amount, positive or negative. Decrementing a weight by unity can be
achieved in a similar manner to the incrementing, but along with maintaining
a leader for each bucket, a trailer must also be kept, which is the index of the
rightmost element in each bucket. To decrement a weight, the symbol's node
is swapped with the trailer, its weight decremented, and then parent pointers
followed as for incrementing.
Provided that all weights are integral, incrementing by an arbitrary amount
can be achieved by calling function adaptive_hujJman_increment{) as many
times as is necessary, being careful to make sure each call affects the leaf for
the desired symbol in its current position, which may change after each call.
This latter is, in effect, the algorithm of Cormack and Horspool [1984], except
that they make one further observation that allows the process to be faster in
practice: if the difference between the weight of a node and the node directly
to the left of the node's leader is greater than one, then the weight can be incre-
mented by the smaller of that difference and the amount required, without the
need for further movement of the node.
This latter realization suggests that a useful heuristic is to reorder the al-
phabet so that the most probable symbols are allocated the smallest symbol
indices, and then run the cumulative probabilities backwards towards zero in
an array rev_cum-prob. When the most probable symbol is coded only one
value needs to be incremented; when the second most likely symbol is coded
the loop iterates twice, and so on. Rearranging the symbol ordering must itself
be done on the fly, as no a priori information is available about the symbol
probabilities. Two more arrays are needed to maintain the necessary informa-
tion about the permuted alphabet - an array symboLto_index that stores the
current alphabet location of each symbol, and an array index_to--symbol that
stores the inverse permutation. Coding a symbol s then consists of fetching its
h value as rev_cum-prob[symboLto_index[s]], and taking its l value to be the
next value stored in rev_cum_prob, assuming rev_cum-prob[n + 1] = O. Then,
to increment the count of symbol s, it is exchanged with the leftmost symbol of
the same frequency, and the same increment loop as was assumed above is used
to add one to all of the cum-prob values at and to the left of its new location. In
this case the search for the leftmost object of the same frequency can be carried
out by binary search, as the search is only performed once per update. This is
a useful contrast to Algorithm 6.2, in which the analogous search for a leader
is performed once per output bit rather than once per symbol.
This mechanism was introduced as part of the seminal arithmetic coding
implementation given by Witten et al. [1987]. It works relatively well for
character-based models for which n = 256, and for files with a "typical" distri-
bution of character frequencies, such as English text stored in ASCII. On aver-
age about ten cum-prob values are incremented for each character coded when
such a model is used, a considerable saving compared to the simpler mecha-
nism. The saving is not without a cost though, and now 3n words of memory
are required for the statistics data structure compared with the n words used by
the approach assumed in Algorithm 5.2 on page 99. Nor does it cope well with
more uniform symbol distributions. On object files Witten et al. measured an
average of 35 loop iterations per symbol, still using a character-based model.
Worse, the structure is completely impotent to deal with the demands of
more complex models. In a word-based model of the kind considered earlier
in this chapter, there are thousands of symbols in the alphabet, and even using
the frequency-sorted approach the per-symbol cost of incrementing cum-prob
values is very high. Taking an asymptotic approach, to code a message of m
symbols over an alphabet of size n might take as much as O(mn) time - and
if the n distinct symbols are equi-probable then this will certainly happen. For
example, the cyclic message
is represented in approximately O(m log n) bits, but takes O(mn) time to pro-
cess. For this message, c, the number of bits produced, is strictly sub linear
in the time taken, making on-line real-time compression impossible to guaran-
tee. Fortunately, other data structures have been developed for maintaining the
cumulative frequencies.
ilocation 1 2 3 4 5 6 7 8 9
Figure 6.4: Maintaining cumulative frequencies with a Fenwick tree. In the example,
the unnormalized probabilities P = [15,11,7,6,11,12,8,1,4] are assumed to be the
result of previous symbols having been transmitted. Row (e) then shows the changes
that take place when P[3] is increased from 7 to 8. There is no requirement that the
source alphabet be probability-sorted.
where, as before, array P[k] represents the frequency of the kth symbol of the
alphabet at this particular point in time. Row (d) in Figure 6.4 records the
values actually stored in the jen-prob array.
It is also useful to define two further functions on array positions: jorw( s)
returns the next value in the same power of two sequence as s, and back(s)
returns the previous one:
jorw(s) s + size(s)
back(s) s - size(s).
The values stored in the array jen-prob are thus also given by
where cum-prob[s] is the nominal value required for the value h to be passed
to the function arithmeticencodeO. For example, symbol s = 6 has a cor-
responding h value of 62 - row (b), stored as lbound(7) - but the value of
jen_prob[6] is 23, being the sum of the frequencies of symbols 5 and 6. To con-
vert the values stored in jen-prob into a cumulative probability, the sequence
dictated by the backO function is summed, until an index of zero is reached.
6.6. CUMULATIVE STATISTICS PAGE 159
Algorithm 6.4
Return the cumulative frequency of the symbols prior to s in the alphabet,
assuming thatfen-prob[l ... n] is a Fenwick tree data structure.
fenwick_geLlbound (s )
I: set l f- 0 and i f- S - 1
2: while i =1= 0 do
3: set l f- l + fen-prob[i] and i f- back(i)
4: return l
fen-prob[6] + fen-prob[4]
(cum-prob[6] - cum-prob[4]) + (cum-prob[4] - cum-prob[O])
= cum_prob[6] ,
where just two terms are involved as back(6) = 4 and then back (4) = O. More
generally, by virtue of the way array fen-prob is constructed, the sequence of
fen-prob values given by
when summed contains exactly the terms necessary to telescope the sum and
yield cum-prob[s].
Algorithm 6.4 gives a formal description of the adaptive coding process.
Taking as its argument a symbol identifier s, function fenwick_geLlboundO
sums a sequence offen-prob values to obtain cum-prob[s - 1], which is the l
value required to arithmetically encode symbol s. As each fen-prob value is
added into the sum the cum_prob values that are not required cancel each other
out, and the right result is calculated.
PAGE 160 COMPRESSION AND CODING ALGORITHMS
size(s) = sAND (2 W - s)
6.6. CUMULATIVE STATISTICS PAGE 161
works on any binary architecture, provided only that 8 < 2w. For example,
assuming that 1 ~ 8 < 24 = 16, size(4) and size(10) are calculated to be 4 and
2 respectively:
8=4 8 = 10
8 in binary: 0100 1010
16 - 8 in binary: 1100 0110
result after AND: 0100 0010
result in decimal: 4 2
The cost of calculating size 0 is thus 0 (1), and the overall time taken to process
symbol 8 is O(logn). Compared to the sorted array method of Witten et at,
this mechanism represents a rare instance of algorithmic development in which
both space and time are saved. Using it, the overall cost of adaptively main-
taining statistics for message M containing m symbols is 0 (m log n) time.
One further function is required in the decoder, illustrated in Algorithm 6.5.
The target value returned by arithmeticdecode_targetO (described in Algo-
rithm 5.4 on page 104) must be located in the array Jen_prob. This is accom-
plished by a binary search variant based around powers of two. The first loca-
tion inspected is the largest power of two less than or equal to n, the current
alphabet size. That is, the search starts at position 2l1og 2 nJ. If the value stored
at this position is greater than target, then the desired symbol cannot have a
greater index, and the search focuses on the first section of the array. On the
other hand, if the target is larger than this middle value the search can move
right, looking for a diminished target. Once the desired symbol number 8 has
been determined, functionJenwick_geLlboundO is used to determine the bound
l for the arithmetic decoding function, andJenwicLgeLand_incremenLcountO
is used to determine c and to increment the frequency of symbol 8. Both of
these latter two functions are shared with the encoder.
The attentive reader might still, however, be disappointed, as 0 (m log n)
time can still be superlinear in the number of bits emitted. Consider, for exam-
ple, the probabilities
Algorithm 6.5
Return the greatest symbol number s that, if passed as argument to function
fenwick_getJboundO, would return a value less than or equal to target.
fenwick_get~ymbol (target)
1: set s+-O and mid +- 2LlognJ
2: while mid ;::: 1 do
3: if s + mid ~ n andfen.prob[s + mid] ~ target then
4: set target +- target - fen_prob[ s + mid]
5: set s +- s + mid
6: set mid +- mid/2
7: return s + 1
where it is presumed that P[k] = 0 for k > n. This change means that a
different calculation strategy must be employed. Now to find the equivalent
cum.prob value for some symbol a two stage process is used, shown as func-
tion fasLgetJboundO in Algorithm 6.6. In the first stage sums are accumu-
lated, starting at fast.prob[l], and doubling the index p at each stage until all
frequencies prior to the desired symbol number s have been included in the
total, plus possibly some additional values to the right of s. The first stage is
accomplished in steps 1 to 3 of function fasLgetJbound O.
The first loop sets variable p to the first power of two greater than s. That is,
the first loop in functionfasLgetJboundO calculates p = 2rlog2(s+l)1. Taking
l to be the sum offast.prob values at the powers of two up to and including the
value stored atfast.prob[P/2] means that l also includes all of the values of P
to the right of s through to but not including the next power of two at p. The
excess, from s + 1 to p - 1, must be subtracted off the preliminary value of
l; doing so is the task of the second phase of the calculation, at steps 4 to 6 of
functionfasLgetJboundO. Note that the processing steps forwards from s, but
only as far as the next power of two.
6.6. CUMULATIVE STATISTICS PAGE 163
Algorithm 6.6
Return the cumulative frequency of the symbols prior to s in the alphabet,
assuming thatjast-prob[l ... n] is a modified Fenwick tree data structure.
jasLgeLlbound (s )
1: set l t- 0 and p t- 1
2: while p ::; s do
3: set l t- l + jasLprob[p] and p t- 2 x p
4: set q t- s
5: while q =1= p and q ::; n do
6: set l t- l - jasLprob[q] and q t- jorw(q)
7: return l
Return the frequency of symbol s using the modified Fenwick tree
jast..prob[l . .. n]; after detennining the value, add one to the stored
frequency of s.
jasLgeLand_incremenLcount( s)
1: set c t- jast..prob[s] and q t- s + 1
2: set z t- min(forw(s) , n + 1)
3: while q < z do
4: set c t- c - jast..prob[q] and q t- jorw(q)
5: setp t- s
6: while p > 0 do
7: setjasLprob[p] t- jasLprob[p] + 1 and p t- back(p)
8: return c
Return the greatest symbol number s that, if passed as argument to function
jasLgeLlboundO, would return a value less than or equal to target.
jasLgeLsymbol (target)
1: setp t- 1
2: while 2 x p ::; n andjasLprob[p] ::; target do
3: set target t- target - jast..prob[p] and p t- 2 x p
4: set s t- p and mid t- p/2 and e t- 0
5: while mid ~ 1 do
6: if s + mid ::; n then
7: set e t- e + jast..prob[s + mid]
8: ifjasLprob[s] - e ::; target then
9: set target t- target - (fasLprob[s] - e)
10: set s t- s + mid and e t- 0
11: set mid t- mid/2
12: return s
PAGE 164 COMPRESSION AND CODING ALGORITHMS
I location 1 2 3 4 5 6 7 8 9
Figure 6.5: Maintaining cumulative frequencies with a modified Fenwick tree. In the
example, the unnormalized probabilities P = [15,11,7,6,11,12,8,1,4] are assumed
to be the result of previous symbols having been transmitted. Row (e) then shows the
changes that take place when P[3] is increased from 7 to 8. The source alphabet need
not be probability-sorted, but a superior bound on execution time is possible if it is.
Algorithm 6.6 also includes the two other functions required in the encoder
and decoder. FunctionjasLgeLand_incremenLcountO serves the same purpose
as its namesake in Algorithm 6.4. The while loop that calculates the current
frequency count of symbol s again requires just 0(1) time on average per call,
where the average is taken over the symbols in 1 ... n. The second section of
jasLgeLand_incremenLcountO then increments the count of symbol s (steps 5
to 7). All of the values that must be incremented as a result of symbol s being
coded again lie within the region [P,2p), and updating them takes O(log s)
time. Including calls to both of these encoding functions, the cost of adaptively
encoding symbol s is decreased from 0 (log n) to 0 (log s).
Figure 6.5 shows the same coding situation used as an example when the
unmodified Fenwick tree was being discussed. To code symbol s = 3 the sum
15 + 18 is calculated in the loop of steps 1 to 3 in functionjasLgeUboundO,
and then the second loop at steps 4 to 6 subtracts 7 to yield the required cu-
mulative sum of l = 26, the starting point of the probability range allocated to
the third symbol. The frequency of symbol s = 3 is then determined to be 7
by the first part of function geLand_incremenLcountO, which goes on to incre-
ment locations two and three of jasLprob to record that symbol 3 has occurred
another time.
6.6. CUMULATIVE STATISTICS PAGE 165
depends on the particular application in which the coder is being used. For
many purposes the non-permuted structure - either in original form or in mod-
ified form - will be adequate, as it would be unusual for a message over an
alphabet of n symbols to have a self-information that is o{log n) bits per sym-
bol. Note also that if the source alphabet is approximately probability-ordered,
but not exactly ordered, the modified structure may have an advantage over
the original Fenwick tree. For example, in the word-based model used as an
example several times in this chapter, words encountered early in the text and
assigned low symbol numbers will typically repeat at shorter intervals than
words encountered for the first time late in the source text and assigned high
symbol numbers. Moffat [1999] reports experiments that quantify this effect,
and concludes that the use of the mapping tables to guarantee linear-time en-
coding is probably unnecessary, but that the modified structure does offer better
compression throughput than the original Fenwick tree.
There is one further operation that must be supported by all data structures
for maintaining cumulative frequencies, and that is periodic scaling. Most
arithmetic coders operate with a specified level of precision for the symbol
frequency counts that cannot be exceeded. For example, the implementation
described in Section 5.3 stipulates that the total of the frequency counts t may
not exceed 21 for some integer f. This restriction means that an adaptive coder
must monitor the sum of the frequency counts, and when the limit is reached,
take some remedial action.
One possible action would be to reset the statistics data structure to the ini-
tial bland state, in which every symbol is equally likely. This has the advantage
of being simple to implement, and it might also be that a dramatic "amnesia"
of the previous part of the message is warranted. For example, it is conceiv-
able that the nature of the message changes markedly at fixed intervals, and
that these changes can be exploited by the compression system. More usual,
however, is a partial amnesia, in which the weight given to previous statistics
is decayed, and recent information is allowed to count for more than historical
records. This effect is achieved by periodically halving the symbol frequency
counts, making sure that no symbol is assigned zero as a frequency. That is, if
Ps is the frequency of symbol s, then after the count scaling the new value p~ of
symbol s is given by (Ps + 1) div 2. When symbol s occurs again, the addition
of 1 to p~ is then worth two of the previous occurrences, and the probability
distribution more quickly migrates to a new arrangement should the nature of
the message have changed.
Algorithmically this raises the question as to how such a scaling operation
should be accomplished, and how long it takes. In the cum-prob array of Witten
et aI., scaling is a simple linear-time operation requiring a single scan through
the array. With careful attention to detail, both thefen_prob andfasLprob struc-
6.6. CUMULATIVE STATISTICS PAGE 167
tures can also be scaled in O(n) time. Algorithm 6.7 gives the details for the
jasLprob data structure.
In Algorithm 6.7 the functionjasLscaling{) makes use of two basic func-
tions. The first, jasuo.probs{) takes a jast.prob array and converts it into a
simple array of symbol frequencies. It does this in-situ and in 0 (n) time. To
see that the second of these two claims is correct, note that despite the nested
loop structure of the function, the number of subtraction operations performed
is exactly n - 1. Then the counts are halved, and finally a similar function
probs_to.jast{) used to rebuild the statistics structure. The total cost is, as re-
quired, 0 (n) time.
Statistics scaling raises one interesting issue. Suppose that count halving
takes place every k symbols. Then the total number of halvings to encode an
m-symbol message is m/k. At a cost of O(n) time per scaling operation, the
total contribution is O(mn/k) = O(mn), which asymptotically dominates the
o (n + m + c) running time of the adaptive coder, seemingly negating all of the
effort spent in this section to avoid the O(mn) cost of using a simple cum.prob
array. Here we have an example of a situation in which it is erroneous to rely
too heavily upon asymptotic analysis. Count halving does dominate, but when
k is large - as it usually is - the actual contribution to the running time is
small. Another way of shedding light upon this result is to observe that almost
certainly k should be larger than n, as otherwise it may not be possible for
every alphabet symbol to have a non-zero probability. Under the additional
requirement that k ~ n the O(mn/k) time for scaling becomes O(m).
A few paragraphs ago we suggested that count scaling meets two needs: the
requirement that the total frequency count going into the arithmetic coder be
bounded at f bits, and the desire to give more emphasis to recent symbol occur-
rences than ancient ones, thereby allowing the probability estimates to evolve.
Scaling the frequency counts in the way shown in Algorithm 6.7 achieves both
these aims, but in a rather lumpy manner. For example, in a straightforward
implementation, no aging at all will take place until 21 symbols have been
processed, which might be a rather daunting requirement when (say) f = 25.
It thus makes sense to separate these two needs, and address them indepen-
dently rather than jointly. To this end, one further refinement has been devel-
oped. Suppose that the current sum of the frequency counts is t, and we wish
to maintain a continuous erosion of the impact of old symbols. To be precise,
suppose that we wish the influence of a symbol that just occurred to be exactly
twice that of one that occurred d symbols ago in the message. Quantity d is the
decay rate, or half-life of the probability estimates.
One way of arranging the required decay would be to multiply each fre-
quency by (1 - x) for some small positive value x after each coding step, and
then add one to the count of the symbol that just appeared. With x chosen suit-
PAGE 168 COMPRESSION AND CODING ALGORITHMS
Algorithm 6.7
Approximately halve each of the frequencies stored in the modified Fenwick
treefastprob[1 .. . n].
fast-.ScalingO
1: fasuo-probs(JasLprob, n)
2: fors +-- 1 to n do
3: setfast-prob[s] +-- (JasLprob[s] + 1) div 2
4: probLto_fast(JasLprob, n)
ably, by the time d steps had taken place, the old total t can be forced to have
effectively halved in weight. The problem of this approach is that O(n) time is
required at each coding step, as all n probability estimates are adjusted.
More economical is to add a slowly growing increment of (1 + x) t at time t
to the count of the symbol that occurred, and leave the other counts untouched.
The desired relative ratios between the "before" and "after" probabilities still
hold, so the effect is the same. The value of x is easily determined: if, after d
steps the increment is to be twice as big as it is now, we require
(1 +x)d = 2.
The approximation loge (1 + x) ~ x when x is close to zero implies that x ~
(loge 2)ld. For example, when we expect a distribution to be stable, a long
half-life d is appropriate, perhaps d = 10,000 or more. In this case, x ~
0.000069 - that is, each frequency increment is 1.000069 greater than the last
one, and after d = 10,000 such steps the frequency increment is 2. On the other
hand, if the distribution is expected to fluctuate rapidly, with considerable local
variation, we should choose d to be perhaps 100. In this case each frequency
increment will be 1.0069 times larger than its predecessor.
Since one of the assumptions throughout the discussion of arithmetic cod-
ing has been that the frequency estimates are maintained as integers, this raises
the obvious problem of roundoff errors. To reduce these, we scale all quanti-
ties by some suitable factor, and in essence retain fractional precision after the
halving process described in Algorithm 6.7, which is still required to ensure
that t :S 21.
To see how this works, let us take a concrete example. Suppose that we
have an alphabet of n = 100 symbols, have decided that we should work
with a half-life of d = 1,000 symbols, thus x = 1.00069, and must operate
with an arithmetic coder with f = 24. If we wish to assign an initial false
count to each symbol in the alphabet, any value less than 21 In ~ 167,000
will suffice, provided that the same value is also used for the first increment.
So we can certainly initialize each of the n = 100 frequency counts to (say)
Pi = 10,000. Then, after the first symbol in the message is coded, an increment
of 10,000 is used. The second increment is bigger, flO,OOO(l + x)l = 10,007;
and the third bigger again, flO,OOO(l + x)21 = 10,014; and so on. When
the total of all the frequency counts reaches 21 , all of them are numerically
halved according to Algorithm 6.7, and so too is the current increment, thereby
retaining all relativities. In this way the constraint of the arithmetic coder is
met; the unnormalized probability estimates are maintained as integers; and
the normalized probability estimates are smoothly decayed in importance as
the symbols they represent fade into the past.
PAGE 170 COMPRESSION AND CODING ALGORITHMS
There is only one small hiccup in this process, which is that the use of
permutation vectors to guarantee linear time performance in the modified Fen-
wick tree requires that increments are by unit amounts. That is, with the non-
unit increments we are now proposing to use, the best we can be assured of
is O(logn) time per operation. But even without this additional complication,
we had accepted that the permutation vectors were only warranted in special
situations; and given the discussion just concluded, we can now quite defini-
tively assert that the modified Fenwick tree data structure, without permutation
vectors, but with decaying probability estimates, is the best general-purpose
structure for maintaining the statistics of an adaptive arithmetic coder.
In this section we have seen how to accommodate messages in which the
symbol probabilities slowly drift. The next two sections consider messages
that contain even more violent shifts in symbol usage patterns, and describe
techniques for "smoothing" such discontinuous messages so that they can be
coded economically. Then, after those two sections, we return to the notion
of evolving probability distributions, and show how combining a coder with
a small half-life with a set of coders with a long half-life can yield improved
compression effectiveness for non-stationary messages.
pappoppp#kkk##ddcptrrr#ccp#leefeeiiiepee#s#.e
(a) As a string of characters
113 99 2 1 113 2 1 1 39
110 1 1 2 1 104 1 104 5
117 116 1 1 6 5 1 5 3
113 108 1 109 2 1 112 1 1
2 6 2 1 6 117 2 60 4
(c) As MTF values
Figure 6.6: A possible message to be compressed, shown as: (a) a string of charac-
ters; (b) the corresponding integer ASCII values; and (c) after application of the MTF
transformation. Section 8.3 explains the origins of the message.
1986, Ryabko, 1987]. Figure 6.6c shows the effect the MTF transformation has
upon the example string of Figure 6.6a. The last character is transformed into
the integer 4, as character "e" (ASCII code 101, in the final position in Fig-
ure 6.6b) last appeared 5 characters previously, with just 3 distinct intervening
characters.
There is a marked difference between the "before" and "after" strings.
In the original sequence the most common letter is "p" (ASCII code 112),
which appears 8 times; now the most common symbol is 1, which appears 16
times. The probability distribution also appears to be more consistent and sta-
ble, and as a consequence is rather more amenable to arithmetic or minimum-
redundancy coding. For a wide range of input sequences the MTF transforma-
tion is likely to result in a probability-sorted transformed message, and the very
large number of "1" symbols that appear in the output sequence when there is
localized repetition in the input sequence means that good compression should
be obtained, even with static coding methods such as the interpolative code
(Section 3.4 on page 42). That is, application of the MTF has the effect of
smoothing wholesale changes in symbol frequencies, and when the message
is composed of sections of differing statistics, the MTF allows symbols to be
PAGE 172 COMPRESSION AND CODING ALGORITHMS
Algorithm 6.8
Perform an MTF transformation on the message M[1 ... m], assuming that
each symbol M[i] is in the range 1 ... n.
mtf_transJorm(M, m)
1: for s +- 1 to n do
2: set T[s] +- s
3: for i +- 1 to m do
4: set s +- M[i], pending +- s, and t +- 1
5: while T[t] t= s do
6: swap pending and T[t], and set t +- t +1
7: set M'[i] +- t
8: set T[t] +- pending
9: return M'
reminiscent of the linear search in a cum..prob array that was discussed in con-
nection with arithmetic coding - it is suitable for small alphabets, or for very
skew probability distributions, but inefficient otherwise. As was the case with
arithmetic coding, we naturally ask if there is a better way; and again, the an-
swer is yes.
Bentley et al. noted that the MTF operations can be carried out efficiently
using a splay tree, a particularly elegant data structure devised by Sleator and
Tarjan [1985]. A splay tree is a self-adjusting binary search tree with good
amortized efficiency for sufficiently long sequences of operations. In particular,
the amortized cost for each access, insertion or deletion operation on a specified
node in an n-node splay tree is O(log n) operations and time. Splay trees also
exhibit some of the behavior of finger search trees, and are ideally suited to the
task of MTF calculation. We are unable to do full justice here to splay trees,
and the interested reader is referred to, for example, Kingston [1990]. But, as
a very crude description, a splay tree is a binary search tree that is adjusted via
edge rotations after any access to any item within the tree, with the net effect
of the adjustments being that the node accessed is moved to the root of the tree.
That node now has a greatly shortened search path; other nodes that shared
several ancestors with the accessed node also benefit from shorter subsequent
search paths. In addition, the tree is always a search tree, so that nodes to the
left of the root always store items that have key values less than that stored at
the root, and so on for each node in the tree.
To use a splay tree to accomplish the MTF transformation, we start with an
array of tree nodes that can be directly indexed by symbol number. Each node
in the array contains the pointers necessary to manipulate the splay tree, which
is built using these nodes in a timestamp ordering. That is, the key used to lo-
cate items in the splay tree is the index in the message at which that symbol last
appeared. Each splay tree node also stores a count of the number of items in its
right subtree within the splay tree. To calculate an MTF value for some symbol
s, the tree node for that symbol is identified by accessing the array of nodes,
using s as a subscript. The tree is then splayed about that node, an operation
which carries out a sequence of edge rotations (plus the corresponding pointer
adjustments, and the corresponding alterations to the "right subtree size" field)
and results in the node representing symbol s becoming the root of the tree.
The MTF value can now be read directly - it is one greater than the number
of elements in the right subtree, as those nodes represent symbols with times-
tamps greater than node s. Finally, the node for s is detached from the tree,
given a new most-recently-accessed timestamp, and then reinserted. The inser-
tion process is carried out by concatenating the left and right subtrees and then
making that combined tree the left subtree of the root. This final step leaves the
node representing symbol s at the root of the tree, and all other nodes in its left
PAGE 174 COMPRESSION AND CODING ALGORITHMS
a, a, a, a, ... , a, b, b, b, b, ... , b
in which the number of as is equal to the number of bs. The zero-order self-
information of this sequence is one bit per symbol, but after an MTF transfor-
mation, and even assuming that the initial state of the MTF list has b in the first
position and a in the second position, the output sequence
file according to the same model. The splay coder also had the advantage
of being considerably faster in execution than the control in Jones's experi-
ments, which was an implementation of Vitter's [1987] adaptive minimum-
redundancy coder. Jones also noted that the splay coder only required about
one quarter of the memory space of Vitter's coder - 3n words for an alphabet
of n symbols rather than 13n words.
Moffat et al. [1994] also experimented with splay coding, and found that,
while it executes quickly compared to other adaptive techniques, for homoge-
neous input files it typically generates a compressed bitstream approximately
15% longer than the self-information. In summary, splay coding provides an
interesting point in the spectrum of possible coding methods: it is probably too
ineffective to be used as a default coding mechanism, but is considerably faster
than adaptive minimum-redundancy coding and also a little faster than adaptive
arithmetic coding [Moffat et aI., 1994]. Static coding methods (see Chapter 3)
also provide fast operation at the expense of compression effectiveness, but the
advantage of the splay coder is that it exploits localized frequency variations
on non-homogeneous messages.
The second use to which splay trees can be put is in maintaining the cumu-
lative frequencies required by an adaptive arithmetic coder. The Fenwick tree,
described in Section 6.6, requires n words of storage and gives 0 (log n) -time
performance; while, assuming that the alphabet is not probability-sorted and
thus that permutation vectors are required, the modified Fenwick tree requires
3n words and gives 0 (log s) -time calculation, where s is the rank of the sym-
bol being coded. The same O(log s )-time bound is offered by a splay tree in
an amortized sense, at a slightly increased memory cost.
To achieve this performance, the symbols in the source alphabet are stored
in a splay tree in normal key order, so that an in-order traversal of the tree
yields the alphabet in sorted order. In addition to tree pointers, each node in
the tree also records the total weight of its left subtree, and from these values
the required cumulative frequencies can be calculated while the tree is being
searched for a symbol. The splaying operation that brings that accessed symbol
to the root must also modify the left-subtree-count fields of all of the nodes it af-
fects. However, the modifications to symbol frequencies need not be by single
units, and a half-life decaying strategy can be incorporated into this structure.
Sleator and Tarjan [1985] prove a result that they call the "Static Optimality
Theorem" for splay trees; namely, that a splay tree is at most a constant factor
more expensive for a sequence of searches than a static optimal binary search
tree built somehow "knowing" the access frequencies. For a sequence of m
accesses to a tree of n items, where the ith item is accessed Vi times, this result
6.9. STRUCTURED CODING PAGE 177
o (n + m + t
i=l
Vi log2 ~)
Vz
The sum inside the parentheses is exactly the self-information of the sequence
(Equation 2.4 on page 22). Hence the claimed linearity - with unit increments
to symbol frequencies, the cost of adaptive arithmetic coding using a splay tree
to manage cumulative frequencies is 0 (n + m + c), where n is the number of
symbols in the alphabet, m is the length of the message, and c is the number of
bits generated.
Because it yields the same cumulative frequencies as a Fenwick tree, and
is coupled with the same arithmetic coder as would be used with a Fenwick
tree, compression effectiveness is identical. But experimentally the splay tree
used in this way is markedly slower than a Fenwick tree [Moffat et aI., 1994].
Compared to a Fenwick tree and a non-permuted modified Fenwick tree it also
uses more memory space - around 4n words for an alphabet of n symbols.
That is, despite the asymptotic superiority of the splay tree, its use in this way
is not recommended.
selector 11121314151
t I ~ bucket[2]
14 I 51 617 I bucket[3]
'-- f-ISI I I I I I 115 1 bucket[4]
Figure 6.7: Structured arithmetic coding. The selector component involves a small
alphabet and a small half-life, and adapts rapidly. Within the secondary buckets the
half-life is greater, and adaptation is slower.
allows the selector to rapidly adjust to gross changes in the number of active
symbols, and half lives of as little as 10 can be tolerated. On the other hand
a larger half-life within each bucket means that these distributions will adjust
more slowly, but will be more stable. The combination offast-adapting selector
and slow-adapting refinement works well when processing the outcome of an
MTF transformation on inputs of the type shown in Figure 6.6 on page 171.
in which the code remained unchanged after the first two increments to symbol
82. Longo and Galasso [1982] formalized this observation by proving that if
P = [(10; 1), (7; 1), (5; 3), (3; 1), (2; 1), (1; 2)]
6.10. PSEUDO-ADAPTIVE CODING PAGE 181
maps to
pI = [(8; 1), (4; 4), (2; 2), (1; 2)].
This is exactly the distribution used as an example in Figure 4.7 on page 80.
If the approximate code is rebuilt after every symbol is processed, and k
is a fixed value, the running time will be 0 (m log m), perhaps fast enough to
be useful. But while the code generated from pI is minimum-redundancy with
respect to pI, it is not guaranteed to be minimum-redundancy with respect to
the true probabilities, P. How much compression is lost?
Assume, for the moment, that a perfect code is being used, in which the
codeword length for symbol Si is equal to the information content of that sym-
bol, I (Si) = - 10g2 Pi. Assuming that Pi is an unnormalized self-probability,
the compression loss when symbol Si is coded is
bi (-10g2(pUm/)) - (-log2(pi/m))
< log2(pi/p~), (6.7)
where m ' = 2:7=1 p~, and, by construction, m ' ~ m. By the definition of p~,
the ratio pi/p~ cannot exceed T, so bi < log2 T = 11k. If k = 1 is used,
the compression loss is at most one bit per symbol, plus the cost of using a
minimum-redundancy code instead of a perfect code (Section 4.9 on page 88).
If k = 2 is used, the compression loss is at most half a bit.
This first analysis examined a symbol in isolation, and assumed the worst
possible ratio between Pi and p~. As a balance, there must also be times when Pi
and p~ are close together; and in an amortized sense, the compression loss must
be less than 1I k bits per symbol. For example, consider a message that includes
occurrences of some symbol x, and that k = 1. The first occurrence of x is
coded using a subsidiary model, after transmission of an escape symbol. The
next will be coded with true probability Px = 1 and approximate probability
p~ = 1. The third occurrence will be coded with Px = 2 and p~ = 2, the fourth
with Px = 3 and p~ = 2, the fifth with Px = 4 and p~ = 4, and so on. Suppose
in total that x appears 20 times in the message. If we sum Equation 6.7 for each
of these twenty occurrences, we get a tighter upper bound on the compression
loss of
1 2 3 4 5 20
log2 1 + 10g2 2 + 10g2 2 + 10g2 4 + 10g2 4 + ... + 10g2 16 = 7.08
bits, which averages 0.35 bits per symbol for the 20 symbols. That is, the amor-
tized per-symbol loss is considerably less than the 1 bit per symbol predicted
by the 10g2 T worst-case bound.
Turpin and Moffat [2001] showed that, over a message of m symbols, the
PAGE 182 COMPRESSION AND CODING ALGORITHMS
Table 6.4: Upper bounds on compression loss in bits per symbol when encoding a
message using ideal codewords based on a geometric frequency distribution with base
T, rather than self-probabilities. Taken from Turpin and Moffat [2001].
bits per symbol, where T = 21/k is again the base ofthe sequence of permitted
unnormalized probabilities.
The bound on compression loss can be further refined if it assumed that
the self-probabilities gathered over the first m symbols accurately represent
the chance of occurrence of each symbol in the upcoming (m + 1) st position
of the message. If they do, Si will occur with probability pdm, the true self-
probability, and will incur a loss of 8i . Forming a weighted sum over the whole
source alphabet gives an expected loss for the m + 1st symbol of [Turpin and
Moffat, 2001]:
n
~m+1 = L:8 i · pdm
1) +--.
i=l
O(mlogm). But because the code is based on pI, rather than the true self-
probabilities P, it only needs rebuilding when pI changes - which only hap-
pens when the approximate frequency pi for a symbol takes a quantum jump.
That is, unless the message is made up entirely of unique symbols, the number
of calls to function calculate_twopower_codeO is considerably less than m.
Consider the sequence of code rebuilds triggered by symbol x. When it
first occurs, the code is reconstructed to incorporate the new symbol. Upon
its next occurrence, the true probability Px of symbol x rises from 1 to 2, and
causes an increase in pi from TO to at least Tl and a second code construc-
tion attributable to x. However, the next occurrence of x may not lead to an
increase in p~. In general, if Si occurs Px times in the whole message it can
only have triggered 1 + llogr PxJ code rebuilds. Each of these rebuilds re-
quires o (logr m) time. Turpin and Moffat showed that, when summed over
all n symbols in the alphabet, the cost of calculating the minimum-redundancy
codes is
n
O(logrm)· L)l + llogrpd) = O(k 2 (m + c)),
i=l
where, as before, c is the number of bits generated, m is the number of symbols
in the input message, and k is the parameter controlling the approximation.
That is, forcing self-probabilities into a geometric distribution with base 2 1/ k
for some positive integer constant k means that the time taken to completely
rebuild the code each time pI changes is no more than the time taken to process
the inputs and outputs of the compression system. The whole process is on-
line. This is better than our initial goal of 0 (m log m) time, and equals the
time bound of adaptive Huffman coding and adaptive arithmetic coding.
But wait, there's more! Now that code generation is divorced from de-
pendence on a Huffman tree, the canonical coding technique from Section 4.3
on page 57 can be used for the actual codeword manipulation. Algorithm 6.9
shows the initialization of the required data structures, and the two functions to
encode and decode a symbol using canonical coding based on a geometric ap-
proximation of the true self-probability. Both of the coding functions make use
of two auxiliary functions twopower_addO and twopower_incrementO which
alter the data structures to reflect a unit increase in the true probability of the
symbol just coded. The data structures used are:
Algorithm 6.9
Use a canonical code based on a geometric approximation to the
self-probabilities of the m symbols processed so far, to code symbol x,
where 1 S; x, updating the code if necessary. The geometric approximation
has base T = 2 1/ k for a fixed integer k ~ 1.
twopower_encode (x)
1: if index [x] = "not yet used" then
2: canonicaLencode(index[O]), the codeword for the escape symbol
3: encode x using some agreed auxiliary mechanism
4: twopower_increment(O)
5: twopower_add(x)
6: else
7: canonicaLencode{index[x])
8: twopower_increment(x)
Return a value assuming a canonical code based on a geometric
approximation of the self-probabilities of the m symbols so far decoded.
The geometric approximation has base T = 2 1/ k for a fixed integer k ~ 1.
Once the symbol is decoded, update the appropriate data structures.
twopower....decode()
1: set x+-- 8[canonicaLdecodeO]
2: twopower_increment(x)
3: if x = 0, the escape symbol, then
4: decode the new symbol x using the agreed auxiliary mechanism
5: twopower_add(x)
6: return x
Algorithm 6.10
Add symbol x, where 1 < x, as the nth symbol into S, with initial weight
one, then recalculate the code.
twopower_add(x)
1: set S[n + 1] +- x and index[x] +- n + 1
2: set n +- n + 1 and m +- m + 1
3: set weight[x] +- 1, and J[O] +- J[O] + 1
4: if J[O] = 1, meaning this is the only symbol in the first bucket, then
5: set leader[O] +- x
6: calculate_twopower_codeO, using P = [(Ti; f[il)] and r = llogT mJ
binary alphabets are also important, especially for applications such as bi-Ievel
image compression.
The binary arithmetic coding routines presented in Section 5.5 on page 118
are readily modified to deal with adaptive models. All that is required is that
the counts Co and Cl of the number of zero bits and one bits seen previously in
this context be maintained as a pair of scalars; once this is done, the adaptation
follows directly. The table-driven binary arithmetic coder of Section 5.7 can
also be used in an adaptive setting. Indeed, the table-driven coder provides a
hint as to how an even more specialized implementation might function, based
upon a small number of discrete states, and migration between those states. It
is such a coder - and its intrinsically coupled probability estimation regime -
that is the subject of this section.
The Q-coder had its origins in two IBM research laboratories [Mitchell and
Pennebaker, 1988, Pennebaker et aI., 1988], and has continued to be developed
there and elsewhere [Slattery and Mitchell, 1998]. The basic idea is exactly
as we have already described for multi-alphabet arithmetic coding: a lower
extreme L for the coding range (called the C register in much of the relevant
literature) and a width R (the A register) are adjusted each time a bit is coded,
with the bit being either the MPS (more probable symbol) for this context,
or the LPS (less probable symbol). But rather than perform a calculation to
determine the splitting point as a fraction of A, a fixed quantity Qe is subtracted
from A if an MPS is coded, and if an LPS is coded, A is set to Qe. That is, A is
assumed to be 1, making any scaling multiplication irrelevant; and Qe can be
thought of as being an estimate of the probability of the LPS, which is always
less than 0.5. To minimize the rounding error inherent in the assumption that A
is one, the normalization regime is designed so that logically O. 75 ~ A < 1.5.
The value of Qe depends upon the estimated probability of the LPS, and is one
of a finite number of predefined values. In the original Q-coder, A and C are
manipulated as 13-bit quantities, and the Q values are all 12-bit values; the
later QM-coder added three more bits of precision.
When A drops below the equivalent of the logical value 0.75, renormaliza-
tion is required. This always happens when an LPS is coded, as Qe < 0.5; and
will sometimes happen when an MPS is coded. The renormalization process is
the same as before: the most significant bit of C is passed to the output buffer-
ing process, and dropped from C; and then both C and A are doubled. The
fact that A is normalized within a different range to that considered in Chap-
ter 5 is immaterial, as it is the doubling of A that corresponds to a bit, not any
particular value of A. Carry bits must still be handled, as C, after a range-
narrowing step, might become greater than one. In the Q-coder a bit-stuffing
regime is used; the later QM-coder manages carries via a mechanism similar to
the byte-counting routines shown in Algorithm 5.8 on page 116.
PAGE 188 COMPRESSION AND CODING ALGORITHMS
e Qe Renorm. Exch.
Hex. Dec. LPS MPS LPS
0 AC1 0.5041 0 +1 1
1 A81 0.4924 -1 +1 0
2 A01 0.4690 -1 +1 0
3 901 0.4221 -1 +1 0
10 381 0.1643 -2 +1 0
20 059 0.0163 -2 +1 0
28 003 0.0006 -3 +1 0
29 001 0.0002 -2 0 0
The only other slight twist is that when A < 1.0 we might be in the position
of having Qe > A - Qe, that is, of estimating the LPS probability to be greater
than the MPS probability, despite the fact that Qe < 0.5 is the current estimate
of the LPS probability. If this situation arises, the MPS and LPS are temporarily
switched. The decoder can make the same adjustment, and no extra information
need be transmitted.
Because the MPS is the more probable symbol. it saves time if the proba-
bility estimates are adjusted only when renormalization takes place, rather than
after every input bit is processed. That is, reassessment of the probabilities is
carried out after every LPS, and after any MPS that triggers the output of a
bit. It is this re-estimation process that makes the Q-coder particularly inno-
vative. Rather than accumulate counters in each context, a single index e is
maintained. The value of Qe is stored in a fixed table, as are a number of other
pre-computed values. Table 6.5 shows some of the rows in the original 12-bit
Q-coder table.
Before any bits are processed, e, the index into the table that represents
a quantized probability estimate, is initialized to O. This assignment means
that both MPS and LPS are considered to be approximately equally likely, a
sensible starting point. Then, for each incoming bit, the corresponding Qe
value is taken from the table, and used to modify A and C, as described in
the previous paragraphs. The second and third columns of the table show the
12-bit Qe values in hexadecimal and in decimal. Because the maximum value
of A is 1.5, the scaling regime used for A maps 1.5 to the maximum that can be
stored in a 13-bit integer, namely 8,191, or "1FFF" in hexadecimal. The value
1.0 then corresponds to the integer 5,461, or "1555" in hexadecimal. This is
6.11. Q CODER PAGE 189
11.8
B4
'0 11.6 Encoding words
.0
E
>. 11.4
~
ill 11.2
He Ae Se
•
G4 B7
eBm
64\
11.0
8 16 32 64 128 256 512
]l 2.5 Ae B~
ill B7
Se
2.4
8 16 32 64 128 256 512
Additional Constraints
The coding algorithms presented so far have focussed on minimizing the ex-
pected length of a code, and on fast coding speed once a code has been devised.
Furthermore, all codes have used a binary channel alphabet. This chapter ex-
amines other coding problems.
First, we examine code generation when a limit is imposed on the length
of the codewords. Applying a limit is of practical use in data compression sys-
tems where fast decoding is essential. When all codewords fit within a single
word of memory (usually 32 bits, sometimes 64 bits), canonical decoding (Al-
gorithm 4.1 on page 60) can be used. If the limit cannot be guaranteed, slower
decoding methods become necessary. Section 7.1 examines the problem of
length-limited coding.
The second problem, discussed in Section 7.2, is that of generating alpha-
betic codes, where a lexicographic ordering of the symbols by their codewords
must match the original order in which the symbols were presented to the cod-
ing system. When an alphabetic code is used to compress records in a database,
the compressed database can be sorted into the same order as would be gener-
ated if the records were decompressed and then sorted. Alphabetic code trees
also correspond to optimal binary search trees, which have application in a va-
riety of searching problems. The assumption that the symbols are sorted by
probability is no longer appropriate in this scenario.
The third area we examine in detail (Section 7.3) is the problem of find-
ing codes for non-binary channel alphabets. Unequal letter-cost coding is the
task of determining a code when the symbols in the channel alphabet can no
longer be presumed to be of unit cost. For example, in an effort to minimize
power consumption in a new communications device based upon some novel
technology, we may seek to calculate a code taking into account (say) that a
zero bit takes 10% more energy to transmit than does a one bit. In such a case,
the code should be biased in favor of one bits - but must still also contain zero
A. Moffat et al., Compression and Coding Algorithms
© Springer Science+Business Media New York 2002
PAGE 194 COMPRESSION AND CODING ALGORITHMS
Table 7.1: An incomplete code with K(C) < 1, and four possible complete codes that
have K(C) = 1, when a length limit of L = 4 is imposed and the underlying source
probabilities are P = [10,8,6,3,1,1,1,1]. The total cost is given by 2:~=1 Pi ·Ieil.
merge technique are possible, and are canvassed after we present the underly-
ing algorithm.
Like many code generation algorithms, reverse package merge builds on
the greedy design paradigm. In Huffman's algorithm (Section 4.2 on page 53)
codeword lengths increase from zero, while the Kraft sum K (C) decreases
from its initial value of n down to one, with, at each step, the least cost ad-
justment chosen from a set of possible changes. In reverse package merge all
codeword lengths are initially set to L bits, and the initial value of the Kraft
sum, K(C) = n2- L :::; 1, is increased with each greedy choice.
This initial position may well be a length-limited code. If L = log2 n, and
n is an exact power of two, then
other options, are shown in Table 7.1, along with the total bit cost of each. The
observation that the largest decreases in length should be assigned to the most
probable symbols means only these four codes need be considered.
Reverse package merge constructs L lists, with the jth list containing sets
of codeword decrements that increase K(C) by 2- j . Within each list, items
are ordered by their impact on the total cost. In the example, initially ICi I = 4
for all i, and K(C) = 0.5. A 2- 1 increase in K(C) is required, so lists are
formed iteratively until the j = 1 list is available.
The process starts with list L. The only way to obtain a 2- 4 increase in
K(C) is to decrease a codeword length from 4 to 3. There are eight possible
codewords to choose from, corresponding to symbols one through eight, and a
unit decrease in the length of the codeword Ci reduces the total code cost by Pi.
The first list generated by reverse package merge is thus
2- 4 : 101 82 63 34 15 16 17 18,
where the sUbscript denotes the corresponding symbol number, and the value
is the reduction in the total code cost that results if that codeword is shortened.
This is a list of all possible ways we can increase K (C) by 2- 4 , ordered by
decreasing impact upon total cost.
Now consider how a 2- 3 increase in K(C) could be obtained. Either two
codewords can be shortened from length 4 to length 3, a 2 x (2- 3 - 2- 4 ) = 2- 3
change in K (C); or individual codewords that have already been reduced to
length 3 could be reduced to 2 bits, a change of 2- 2 - 2- 3 = 2- 3 . The impact
on the total cost of choosing a pair of 2- 4 items can be found by adding two
costs from list j = 4. In this example, the biggest reduction is gained by short-
ening 10 1 and 82 by one bit each, giving a cost saving of 18, where the absence
of a subscript indicates that the element is a package formed from two elements
in the previous list. The next largest reduction can be gained by packaging 63
and 34 to get 9. Note that 82 and 63 are not considered for packaging, as 82 was
already combined with the larger value 10 1 , Continuing in this manner creates
four packages, each of which corresponds to a 2- 3 increase in K (C):
In this and the next set of lists, the square boxes denote packages created by
combining pairs of objects from the previous list.
The same "package, then merge" routine is done twice more to get a full
set of L = 4 lists:
2- 4 : 34 15 16 17 18
2- 3 : 82 63 34 ~ ~ 15 16 17 18
2- 2 : W 82 63 [i] 34 ~ ~ 15 16 17 18
2- 1 : 101 8 2 [1J 63 [i] 34 ~ ~ 15 16 17 18 .
Each entry in the last 2- 1 list represents a basket of codeword length adjust-
ments that have a combined impact of 0.5 on K(C). For example, the first
package, of weight 45, represents two packages at the 2- 2 level; they in turn
represent two leaves at the 2- 3 level and two packages at that level; and, finally,
those two packages represent four leaves at the 2- 4 level.
Once the lists are constructed, achieving some desired amount of increase
to K (C) is simply a matter of selecting the necessary packages off the front of
some subset of these lists, and shortening the corresponding codewords. In the
example, a 0.5 increase in K (C) is desired. To obtain that increase, the package
in list 2- 1 is expanded. As was noted in the previous paragraph, this package
was constructed by hypothesizing four codewords being shortened from 4 bits
to 3 bits, and two codewords being shortened from 3 bits to 2 bits. The set
of lengths for the length-limited code is thus ICI = [2,2,3,3,4,4,4,4]. The
exhaustive listing of sensible combinations in Table 7.1 confirms that this code
is indeed the best.
It may appear contradictory that the first two symbols have their codeword
lengths decreased from 3 bits to 2 bits before they have their lengths decreased
from 4 bits to 3. But there is no danger of the latter not occurring, as a package
containing original symbols from list 2- j always has a greater weight than the
symbols themselves in list 2- j +1 , so the package will be selected before the
original symbols. For example, the package 18 in list 2- 3 contains the original
symbols 10 1 and 82 from list 2- 4 , and 18 appears before both in list 2- 3 .
Not so easily dismissed is another problem: what if an element at the front
of some list is selected as part of the plan to increase K (C), but appears as
a component of a package in a subsequent list that is also required as part of
making K(C) = I? In the example, only a single package was needed to bring
K(C) to 1.0; but in general, multiple packages are required. For example,
consider generating a L = 4 limited code for the slightly smaller alphabet
P = [10,8,6,3,1,1,1]. When n = 7 and 1- K(C) = 1 -7 X 2- 4 =
0.5625, packages are required from the 2- 1 list (0.5), and the 2- 4 list (the
other 0.0625). But the first element in the 2- 1 list contains the first element in
PAGE 198 COMPRESSION AND CODING ALGORITHMS
the 2- 4 list, and the codeword for symbol 81 can hardly be shortened from 4
bits to 3 bits twice.
To avoid this conflict, any elements to be removed from a list as part of the
K (C) increment must be taken before that list is packaged and merged into the
next list. In the n = 7 example, the first element of 2- 4 must be consumed
before the list 2- 3 is constructed, and excluded from further packaging. The
table of lists for P = [10,8,6,3,1,1,1] is thus
2- 4 : 15 16 17
2- 3 : 4 34 2 15 16 17
2- 2 : 82 63 3 34 2 15 16 17
2- 1 : 82 6 63 3 34 2 15 16 17,
where the two bordered regions now show the elements encompassed by the
two critical packages, rather than the packages themselves. The increment of
2- 4 (item Wd is taken first, and the remainder of that list left available for
packaging; then the list 2- 3 is constructed, and no packages are removed from
the front of it as part of the K (C) growth; then the 2- 2 list is constructed, and
again no packages are required out of it; and finally the 2- 1 list is formed, and
one package is removed from it, to bring K(C) to 1.0. Working backwards,
that one package corresponds to two packages in the 2- 2 list; which expand to
one package and three leaves in the 2- 3 list; and that one package expands to
two leaves in the 2- 4 list, namely, items 82 and 63. In this case the final code
is ICI = [2,2,2,4,4,4,4], as symbols 10 1 ,82, and 63 all appear twice to the
left of the final boundary.
Astute readers will by now have realized that at most one element can be
required from each list to contribute to the increase in K(C), and that the ex-
haustive enumeration of packages shown in the two examples is perhaps exces-
sive. Even if a package is required from every list, at most one object will be
removed from list 2- 1 , at most three from list 2- 2 , at most seven from list 2- 3 ,
and so on; and that is the worst that can happen. If not all lists are required
to contribute to lifting K (C) to 1.0, then even fewer packages are inspected.
In the most recent example only two such head-of-list packages are consumed,
and it is only necessary to calculate 14 list entries:
2- 4 :
2- 3 :
2- 2 :
2- 1 :
Larmore and Hirschberg [1990] constructed the lists in the opposite order
to that shown in these examples, and had no choice but to fully evaluate all
L lists, giving rise to an 0 (nL) time and space requirement. Reversing the
7.1. LENGTH-LIMITED CODING PAGE 199
list calculation process, and then only evaluating list items that have at least
some chance of contributing to the solution, saves 0 (n log n) time and space,
to give a resource cost for the reverse package merge algorithm that is O(n(L-
log2 n + 1)). A curious consequence of this bound is that if L is regarded as
being constant - for example, a rationale for L = 32 was discussed above -
then the cost of constructing a length-limited code grows less than linearly in
n. As n becomes larger, the constraining force L becomes tighter, but (per
element of P) the length-limited code becomes easier to find.
Algorithms 7.1 and 7.2 provide a detailed description of the reverse pack-
age merge process. Queue value[j] is used to store the weight of each item in
list j, and queue type[j] to store either a "package" flag, to indicate that the cor-
responding item in value[j] is a package; or, if that item is a leaf, the matching
symbol number - the subscript from the examples. Variable excess is first set
to the amount by which K (C) must be increased; from this, the set of packages
that must be consumed is calculated: bj is one if a package is required from list
e
2- j , and zero if not. The maximum number j of objects that must be formed
in list 2- j is then calculated at steps 8 to 10, by adding bj to twice the number
of objects required in list 2-j+1, but ensuring that the length is no longer than
the maximum number of packages possible in that list.
The first list, for 2- L, is easy to form: it is just the first eL symbols from
P, as there are no packages. If one of these objects is required as part of the
K (C) adjustment, that object is extracted at step 16. The first ej elements
of list j are then iteratively constructed from the symbol probabilities Pi and
list value[j + 1], which, by construction, must have enough elements for all
required packages. Once each list is constructed, its first package is extracted
if it is required as part of the K (C) adjustment.
Function take_package 0 in Algorithm 7.2 is the recursive heart of the code
construction process. Each time it is called, one object is consumed from the
indicated list. If that object is a leaf, then the corresponding symbol has its
codeword shortened by one bit. If the object is a package, then two objects
must be consumed from the previous list, the ones that were used to construct
the package. Those recursive calls will - eventually - result in a correct com-
bination of codeword lengths being shortened.
In an implementation it is only necessary to store value[j + 1] during the
generation of value[j] , as only the information in the type[j] queue is required
by function take-packageO. With a little fiddling, value[j + 1] can be overwrit-
ten by value[j]. In a similar vein, it is possible to store type[j] as a bit vector,
with a one-bit indicating a package and a zero-bit indicating an original sym-
bol. This reduces the space requirements to O(n) words for queue value, and
o (n (L -log2 n + 1)) bits for the L queues that comprise type. Furthermore, if,
as hypothesized, L is less than or equal to the word size of the machine being
PAGE 200 COMPRESSION AND CODING ALGORITHMS
Algorithm 7.1
Calculate codeword lengths for a length-limited code for the n symbol
frequencies in P, subject to the constraint that Ci ~ L, where in this
algorithm Ci is the length of the code assigned to the ith symbol.
reverse_package_merge( P, n, L)
1: set excess f- 1 - n x 2- L and PL f- n
2: for j f- 1 to L do
3: if excess ~ 0.5 then
4: set bj f- 1 and excess f- excess - 0.5
5: else
6: set bj f- 0
7: set excess f- 2 x excess and PL-j f- LPL-Hr!2J + n
8: set PI f- b1
9: for j f- 2 to L do
10: seUj f- min{Pj, 2 x Pj-l + bj }
11: for if-I to n do
12: set Ci f- L
13: for t f- 1 to h do
14: append Pt to queue value[L] and append t to queue type[L]
15: if bL = 1 then
16: take_package(L)
17: for j f- L - 1 down to 1 do
18: set if-I
19: for t f- 1 to Pj do
20: set pack_wght f- the sum of the next two unused items in queue
value[j + 1]
21: if pack_wght > Pi then
22: append pack_wght to queue value[j]
23: append "package" to queue type[j]
24: else
25: append Pi to queue value[j] and append i to queue type[j]
26: set i f- i + 1 and retain pack_wght for the next loop iteration
27: if bj = 1 then
28: take_package(j)
29: return [q, ... ,cn ]
7.1. LENGTH-LIMITED CODING PAGE 201
Algorithm 7.2
Decrease codeword lengths indicated by the first element in type[i] ,
recursively accessing other lists if that first element is a package.
take-package(j)
1: set x ~ the element at the head of queue type U]
2: if x = "package" then
3: take_package(j + 1)
4: take_package(j + 1)
5: else
6: set Cx ~ Cx - 1
7: remove and discard the first elements of queues valueU] and typeU]
used, then the L bit vectors storing type in total occupy n - log2 n + 1 words of
memory. That is, under quite reasonable assumptions, the space requirement
of reverse package merge is O(n).
If space is at a premium, it is possible for the reverse package merge algo-
rithm to be implemented in 0(L2) space over and above the n words required
to store the input probabilities [Katajainen et aI., 1995]. There are two key
observations that allow the further improvement. The first is that while each
package is a binary tree, it is only necessary to store the number of leaves on
each level of the tree, rather than the entire tree. The second is that it is not
necessary to store all of the trees at anyone time: only a single tree in each list
is required, and trees can be constructed lazily as and when they are needed,
rather than all at once. In total this lazy reverse package merge algorithm stores
L vertical cross-sections of the lists, each with O(L) items, so requires 0(L2)
words of memory.
If speed is crucial when generating optimal length-limited codes, the run-
length techniques of Section 4.6 on page 70 can also be employed, to make a
lazy reverse runlength package merge [Turpin and Moffat, 1996]. The result-
ing implementation is not pretty, and no pleasure at all to debug, but runs in
O((r + r 10g(njr))L) time.
Liddell and Moffat [2002] have devised a further implementation, which
rather than forming packages from the symbol probabilities, uses Huffman's
algorithm to create the packages that would be part of a minimum-redundancy
code, and then rearranges these to form the length-limited code. This mecha-
nism takes O(n(LH - L + 1)) time, where LH is the length of a longest unre-
stricted minimum-redundancy codeword for the probability distribution being
processed. This algorithm is most efficient when the length-limit is relatively
relaxed.
PAGE 202 COMPRESSION AND CODING ALGORITHMS
Several approximate algorithms have also been invented. The first of these,
as mentioned above, is due to Fraenkel and Klein [1993]. They construct a
minimum-redundancy code; shorten all of the too-long codewords; and then
lengthen sufficient other codewords that K (C) ~ 1 again, but without being
able to guarantee that the code so formed is minimal. Milidi6 and Laber [2000]
take another approach with their WARM-UP algorithm, and show that a length-
limited code results if all of the small probabilities are boosted to a single larger
value, and then a minimum-redundancy code calculated. They search for the
smallest such threshold value, and in doing so, are able to quickly find codes
that experimentally are minimal or very close to being minimal, but cannot be
guaranteed to be minimal.
Liddell and Moffat [2001] have also described an approximate algorithm.
Their method also adjusts the symbol probabilities; and then uses an approx-
imate minimum-redundancy code calculation process to generate a code in
which the longest codeword length is bounded as a function of the smallest
source probability. This mechanism operates in 0 (n) time and space, and
again generates codes that experimentally are very close to those produced by
the package merge process.
The compression loss caused by length-limiting a prefix code is generally
very small. The expected codeword length E( C, P) for a length-limited code
C can never be less than that of a minimum-redundancy code, and will only
be greater when L is less than the length of a longest codeword in the corre-
sponding minimum-redundancy code. Milidi6 and Laber [2001] have shown
that the compression loss introduced by using a length-limited code, rather than
a minimum-redundancy prefix code, is bounded by
<t>1-L+rlog2(n+f1og2nl-L)1,
where <t> is the golden ratio (1 + V5)/2. Some values for this bound when
L = 32 are shown in Table 7.2. In practice, there is little loss when length-
limits are applied, even quite strict ones. For example, applying a length-limit
of 20 to the WSJ . Words message (Table 4.5 on page 71) still allows a code with
an expected cost of under 11.4 bits per symbol, compared to the 11.2 bits per
symbol attained by a minimum-redundancy code (see Figure 6.8 on page 191).
Upper bound
n
on loss
256 0.00002
103 0.00004
104 0.00028
105 0.00119
106 0.00503
Table 7.2: Upper bound on compression loss (bits per symbol) compared to a
minimum-redundancy code, when a limit of L = 32 bits is placed on the length of
codewords.
which the original order of the input symbols should be preserved, meaning
that the source alphabet may not be permuted.
One such situation is that of alphabetic coding. Suppose that some ordering
-< of the source symbols is defined, such that i < j implies that 8i -< 8j. Sup-
pose also that we wish to extend -< to codewords in the natural lexicographic
manner, so that if i < j and 8i -< 8 j, we require Ci -< Cj. In plain English:
if the codewords are sorted, the order that results is the same as if the source
symbols had been sorted. Needless to say, for a given probability distribution
P that may not be assumed to be non-increasing, we seek the alphabetic code
which minimizes the expected cost E( C, P) over all alphabetic codes C.
All three codes listed as examples in Table LIon page 7 are, by luck,
alphabetic codes. The ordering of the symbols in the source alphabet is
81 -< 82 -< 83 -< 84 -< 85 -< 86 ,
Algorithm 7.3
Calculate a code tree for the n symbol frequencies in P, from which
codeword lengths for an alphabetic code can be extracted. Distribution P
may not be assumed to be probability-sorted. The notation key(x) refers to
the weight of element x in the queues and in the global heap.
calculate_alphabetic_code ( P, n)
1: for i +- 1 to n do
2: set L[iJ +- to a leaf package of weight Pi
3: for i +- 1 to n - 1 do
4: create a new priority queue q containing i and i + 1
5: set key(i) +- Pi and key(i + 1) +- PHI
6: set key(q) +- Pi + PHI
7: add q to the global heap
8: while more than one package remains do
9: set q +- to the queue at the root of the global heap
10: set (i I, i 2 ) +- to the candidate pair of q, with i I < i2
11: set L[iIJ +- a package containing L[iIJ and L[i2J
12: set key(it} +- key(it} + key(i2), repositioning il in q if necessary
13: remove i2 from q
14: if LhJ was a leaf package then
15: let r be the other priority queue containing i l
16: remove i l from r and merge queues rand q
17: remove queue r from the global heap
18: if L[i2J was a leaf package then
19: let r be the other priority queue containing i2
20: remove i2 from r and merge queues rand q
21: remove queue r from the global heap
22: establish a new candidate pair for q
23: restore the heap ordering in the global heap
7.2. ALPHABETIC CODING PAGE 205
p= [4,22,5,4,1,3,2,2,4,8,5,6,1,8,3,8,7,9,1,11,4,1,3,2,4]'
there are three queues with a candidate pair that has key a value of 4: that
associated with symbol 5, that associated with symbol 7, and that associated
with symbol 22. In this case, the candidate pair associated with symbol 5 takes
precedence, and is processed first. If, when breaking ties in such a manner,
even the indices of the first item in a candidate pair are equal, the candidate
pair with the leftmost second component should be preferred.
Figure 7.1 shows the initial queues, and the first two packaging steps of
function calculate_alphabeticcodeO on the character self-probabilities from
Blake's verse. In each panel, the list of packages stored in L appears first.
Under each package is the corresponding priority queue, drawn as a list, that
includes the index of that package, and any to its right with which it may now be
packaged. Only the key values are shown in each priority queue, as the indices
can be inferred from their position in the figure; and by convention, each queue
is shown associated with the leftmost leaf represented in that queue. The root
of the global heap is shown pointing to the next candidate pair to be packaged:
PAGE 206 COMPRESSION AND CODING ALGORITHMS
0 @ CD 0 ~ 0) 0 0 0 0 CD 0) 0 0
4 5 4 1 1\ 2 2 2 2 1 2 2
l l l l l\ l l l l
22 22 5 4 3 '\3 2 4 8 4
l l l
3 3 4
,,
,,
,
(a)
0 @ CD 0 ~0 0 0 CD 0) 0 0
4
l l l l
22
l
5
22
4
5
2
4
~ 2'
l \, l l
2
,,
" 4
,
2 2
8
1
l l l l
4
1
3
2
3
2
,
4
(b)
0 @ CD 0 0 0qJ 0) 0 0
4
l l l l
22
5
22
4
5
4
4
~~ 2
l
8
1
4,, "
:1
l ,/ l l l
3
2
3
2
l 4 ,,
, ,,
,
l 4
(c)
Figure 7.1: First three steps of the Hu-Tucker-Knuth algorithm for generating an al-
phabetic code from the character self-probabilities in Blake's verse: (a) before any
packages are formed; (b) after the first package is formed; and (c) after the second
package is formed. Queues are shown associated with their leftmost candidate leaf.
7.3. ALTERNATIVE ALPHABETS PAGE 207
the priority queue with the smallest sum of the first two elements. Figure 7.1 b
shows the results of packaging L[5] and L[6] in the first iteration of the while
loop in calculate_alphabetic-codeO. Both of these elements are leaves, so the
previous queue and the next queue are merged with the new queue containing
the package formed at this step. That is, queues
[1 -+ 4] [1 -+ 3] [2 -+ 3]
n
o
;s::
."
:;Q
tIl
til
til
oZ
>
Z
o
n
o
0 0 0 0 0 0 0 1 o
0 1 0 0 0
0 0 0 0 0 0 0 1
Z
Cl
0 1 0 0 0 1 0
0 1 0 0 0 ~
0 0 0 r
Cl
o
:;Q
Figure 7.2: Alphabetic code tree for the character self-probabilities in Blake's Milton.
::;
::r:
;s::
til
7.3. ALTERNATIVE ALPHABETS PAGE 209
Table 7.3: Constructing codes with unequal letter costs. Different assignments of
dots and dashes to the probability distribution P = [0.45,0.35,0.15,0.05]. Dots are
assumed to have a cost of 1; dashes a cost of 3.
So far in this book we have assumed a natural mapping from bits in Shannon's
informational sense to bits in the binary digit sense. Implicit in this assumption
is recognition that the bits emitted by a coder are both a stream of symbols
drawn from a binary channel alphabet, and also a description of the information
present in the message.
Now consider the situation when this mapping is made explicit. A measure
of the cost of transmitting a bit of information using the channel alphabet must
be introduced. For lack of a better word, we will use units denoted as dollars
- which represent elapsed time, or power consumed, or some other criteria. As
a very simple example, if each symbol in a binary channel alphabet costs two
dollars - that is, r = 2 and D = [2,2] - then it will clearly cost two dollars per
information bit to transmit a message.
7.3. ALTERNATIVE ALPHABETS PAGE 211
In the general case there are r channel symbols, and their dollar costs are
all different. Just as an input symbol carries information as a function of its
probability, we can also regard the probability of each channel symbol as being
an indication of the information that it can carry through the channel. For ex-
ample, if the ith channel symbol appears in the output stream with probability
qi, then each appearance carries 1(qi) bits of information. In this case the rate
at which the ith channel symbol carries information is given by 1 (qd / d i bits
per dollar, as each appearance of this channel symbol costs di dollars.
In a perfect code every channel symbol must carry information at the same
rate [Shannon and Weaver, 1949]. Hence, if a codeword assignment is to be
efficient, it must generate a set of channel probabilities that results in
(7.1)
An equation of this form always has a single positive real root between zero
and one. For example, when D = [1,1], which is the usual binary channel
alphabet, t = 0.5 is the root of the equation t + t = 1. Similarly, when
D = [2,2]. t = .Jf12 : : : 0.71 is the root of the equation t 2 + t 2 = 1. And as
a third example, Morse code uses the channel alphabet defined by r = 2 and
D = [1,3], and t is the root of t l + t 3 = 1, which is t ::::: 0.68. Hence, in
this third case, ql = t l ::::: 0.68, and q2 = t 3 ::::: 0.32; that is, the assignment
of codewords should generate an output stream of around 68% dots and 32%
dashes.
Given that qi = t di , the expected transmission cost T(D) for the channel
alphabet described by D, measured in dollars per bit of information, is
r di 1
T(D) = ~qi· 1( .) = 1(t) ,
t=l qt
bit of information through the channel. And for the simplified Morse example,
in which D = [1,3], T(D) ~ 1.81, and each bit of information passed through
the channel costs $1.81.
If the message symbols arrive at the coder each carrying on average H(P)
bits of information, then the minimum cost of representing that message is
Returning to the example of Table 7.3, we can now at least calculate the re-
dundancy of the costs listed: the entropy H(P) of the source probability dis-
tribution is approximately 1.68, and so the best we can do with a Morse code
is H(P) . T(D) ~ $3.04 per symbol. If nothing else, this computation can be
used to reassure us that Code 3 in Table 7.3 is pretty good.
But it is still not at all obvious how the codewords should be assigned to
source symbols so as to achieve (or closely approximate) qi = t di for each of
the r channel symbols. Many authors have considered this problem [Abrahams
and Lipman, 1992, Altenkamp and Mehlhorn, 1980, Golin and Young, 1994,
Karp, 1961, Krause, 1962, Mehlhorn, 1980], but it was only relatively recently
that a generalized equivalent of Huffman's famous algorithm was proposed
[Bradford et aI., 1998].
Less complex solutions to some restricted problems are also known. When
all channel costs are equal, and D = [1,1, ... ,1] for an r symbol channel
alphabet, Huffman's algorithm is easily extended. Rather than take the two
least-weight symbols at each packaging stage, the r least-weight symbols are
packaged. There is a single complication to be dealt with, and that is that it
may not be possible for the resulting radix-r tree to be complete. That is, there
are likely to be unused codewords. To ensure that these unused codes are as
deep as possible in the tree, dummy symbols of probability zero must be added
to the alphabet to make the total number of symbols one greater than a multiple
of r - 1. For example, when r = 7 and n = 22, the code tree will have
25 = 6 x 4 + 1 leaves, and three dummy symbols should be added before
packaging commences. An application that uses radix-256 byte-aligned codes
is described in Section 8.4. Perl et al. [1975] considered the reverse constrained
situation - when all symbols in the source alphabet have equal probability, and
the channel symbols are of variable cost. As noted, Bradford et al. [1998] have
provided a solution to the general unequal-unequal problem.
There is also an elegant - and rather surprising - solution to the general
problem using arithmetic coding, and it is this approach that we prefer to de-
scribe here. The key observation is that if an arithmetic decoder is supplied
with a stream of random bits, it generates an output "message" in which sym-
bols appear with a frequency governed by the probability distribution used.
7.3. ALTERNATIVE ALPHABETS PAGE 213
------- -------
I I I I I I
I encode I I decode I I encode I decode
i: i:
I I I I I I I I
: using using : : using using :
channel
message I P V I Q Q V I P message
I
1 ____ __ ,
I
1- _____ I
symbols 1- _____ ,
I
1- _ _ _ _ _ I
Figure 7.3: Arithmetic coding when the channel alphabet probability distribution is
specified by Q = [qil.
impact upon execution costs. The only other cost is solution of Equation 7.1,
so that Q can be computed. Methods such as bisection or Newton-Raphson
converge rapidly, and in any case, need to be executed only when the channel
costs change.
Compression Systems
This chapter resumes the discussion of compression systems that was started in
Chapters 1 and 2, but then deferred while we focussed on coding. Three state-
of-the-art compression systems are described in detail, and the modeling and
coding mechanisms they incorporate examined. Unfortunately, one chapter is
not enough space to do justice to the wide range of compression models and ap-
plications that have been developed over the last twenty-five years, and our cov-
erage is, of necessity, rather limited. For example, we have chosen as our main
examples three mechanisms that are rather more appropriate for text than for,
say, image or sound data. Nevertheless, the three mechanisms chosen - sliding
window compression, the PPM method, and the Burrows-Wheeler transform
- represent a broad cross section of current methods, and each provides inter-
esting trade-offs between implementation complexity, execution-time resource
cost, and compression effectiveness. And because they are general methods,
they can still be used for non-text data, even if they do not perform as well as
methods that are expressly designed for particular types of other data. Lossy
modeling techniques for non-text data, such as gray-scale images, are touched
upon briefly in Section 8.4; Pennebaker and Mitchell [1993], Salomon [2000],
and Sayood [2000] give further details of such compression methods.
Suppose that the first w symbols of some m-symbol message M have been
encoded, and may be assumed by the encoder to be known to the decoder.
Symbols beyond this point, M[w + 1 ... m], are yet to be transmitted to the
decoder. To get them there, the sequence of shared symbols in M[1 ... w] is
searched to find a location that matches some prefix M[w + 1 ... w + £] of
the pending symbols. For example, suppose a match of length £ is detected,
commencing at location w - c + 1 for some offset 1 ::; c ::; w:
Algorithm 8.1
Transmit the sequence M[I ... m] using an LZ77 mechanism.
Iz77_encode_hlock(M, m)
1: encode m using some agreed method
2: set w +- 0
3: while w < m do
4: locate a match for M[w + 1 ... ] in M[w - window...size + 1 ... w], such
that M[w - c + 1. .. w - c + £] = M[w + 1. .. w + £]
5: if £ 2: copy_threshold then
6: encode length £ - copy_threshold + 2 using some agreed method
7: encode offset c using some agreed method
8: set w +- w + £
9: else
10: encode length 1 using the agreed method
11: encode M[w + 1] using some agreed method
12: set w +- w + 1
Decode and return the LZ77-encoded sequence M[I ... m].
Iz77-tJecode_hlock(M, m)
1: decode m using the agreed method
2: set w +- 0
3: while w < m do
4: decode £ using the agreed method
5: iff> 1 then
6: set £ +- £ + copy_threshold - 2
7: decode offset c using the agreed method
8: set M[w + 1 ... w + £] +- M[w - c + 1 ... w - c + £]
9: set w +- w + £
10: else
11: decode M [w + 1] using the agreed method
12: set w +- w + 1
13: return M and m
PAGE 218 COMPRESSION AND CODING ALGORITHMS
plus fifteen valid copy lengths (from 3 ... 17); and a twelve-bit offset code also
stored in binary, which allows a window of 4,096 bytes. Ross Williams [1991b]
has also examined this area, and, in the same way as do Fiala and Greene, de-
scribes a range of possible codes, and examines the trade-offs they supply.
The advantage of simple codes is that the stored tuples are byte aligned, and
bit manipulation operations during encoding and decoding are avoided. Fast
throughput is the result, especially in the decoder.
In the widely-used GZIP implementation, Jean-Ioup Gailly [1993] em-
ploys two separate semi-static minimum-redundancy codes, one for the copy
offsets, and a second one for a combination of raw characters and copy lengths.
The latter of these two is used first in each tuple, and while it is somewhat con-
fusing to code both characters and lengths - each length is, for coding purposes,
incremented by the size of the character set - from the same probability distri-
bution, the conflation of these two allows economical transmission of the criti-
cal binary flag that indicates whether the next code is a raw symbol or a copy.
The two minimum-redundancy codes are calculated over blocks of 64 kB of the
source message: in the context of Algorithm 8.1, this means that a complete
set of £ (and M[w + 1]) values and c offsets are accumulated, codes are con-
structed based upon their self-probabilities, and then the tuples (some of which
contain no second component) are coded, making it a textbook application of
the techniques described in Section 4.8 on page 81.
Gailly's implementation also supports a command-line flag to indicate how
much effort should be spent looking for long matches. Use of gzip -9 gives
better compression than does gzip -1, but takes longer to encode messages.
Decoding time is unaffected by this choice, and is fast in all situations. Fast de-
compression is one of the key attributes of the LZ77 paradigm - the extremely
simple operations involved, and the fact that most output operations involve a
phrase of several characters, mean that decoding is rapid indeed, even when, as
is the case with GZIP, minimum-redundancy codes are used.
The speed of any LZ77 encoder is dominated by the cost of finding prefix
matches. Bell and Kulp [1993] have considered this problem, as have the im-
plementors of the many software systems based upon the LZ77 technique; and
their consensus is that hashing based upon a short prefix of each string is the
best compromise. For example, the first two characters of the lookahead buffer,
M[w + 1] and M[w + 2], can be used to identify a linked list of locations at
which those two characters appear. That list is then searched, and the longest
match found within the permitted duration of the search used. As was noted
above, the search need not be exhaustive. One way of eliminating the risk of
lengthy searches is to only allow a fixed number of strings in each of these
linked lists, or to only search a fixed number of the entries of any list. Control
of the searching cost, trading quality of match against expense of search, is one
PAGE 220 COMPRESSION AND CODING ALGORITHMS
two simpler sequences, each of which is then coded by a zero-order coder pre-
suming that all conditioning has been exploited by the model. The distinction
between the modeling and coding components in a compression system such
as GZIP is then quite clear; and the Ziv-Lempel component of GZIP supplies
a modeling strategy, not a coding mechanism.
a particular order? In seminal work published in 1984, John Cleary and Ian
Witten tackled this question, and in doing so proposed a significant step for-
ward in terms of modeling. Exploiting the ability of the then newly-developed
arithmetic coder to properly deal with small alphabets and symbol probabilities
close to one, they invented a mechanism that tries to use high-order predictions
if they are available, and drops gracefully back to lower order predictions if
they are not. Algorithm 8.2 summarizes their prediction by partial matching
(PPM) mechanism.
The crux of the PPM process lies in two key steps of the subsidiary func-
tion ppm_encode_symboIO, which attempts to code one symbol M[s] of the
original message in a context of a specified order. Those two steps embody
the fundamental dichotomy that is faced at each call: either the symbol M[s]
has a non-zero probability in this context, and can thus be coded successfully
(step 9); or it has a zero probability, and must be handled in a context that is
one symbol shorter (step 12). In the former of these two cases no further ac-
tion is required except to increment the frequency count P[M[s]]; in the latter
case, the recursive call must be preceded by transmission of an escape symbol
to tell the decoder to shift down a context, and then followed by an increment
to both the probability of escape and the probability P[M[s]]. Prior to mak-
ing this fundamental choice a little housekeeping is required: if the order is
less than zero, the symbol M[s] should just be sent as an unweighted ASCII
code (steps 1 and 2); if the indicated context does not yet have a probability
distribution associated with it, one must be created (steps 4 to 7); and when the
probability distribution P is newly created and knows nothing, the first symbol
encountered must automatically be handled in a shorter context, without even
transmission of an escape (step 11).
Because there are so many conditioning classes employed, and because
so many of them are used just a few times during the duration of any given
message, an appropriate choice of escape method (Section 6.3 on page 139) is
crucial to the success of any PPM implementation. Algorithm 8.2 shows the
use of method D - an increment of 2 is made to the frequency P[M[s]] when
symbol M[s] is available in the distribution P; and when it is not, a combined
increment of 2 is shared between P[M[s]] and P[escape]. In their original
presentation of PPM, Cleary and Witten report experiments with methods A
and B. Methods C and D were then developed as part of subsequent inves-
tigations into the PPM paradigm [Howard and Vitter, 1992b, Moffat, 1990];
method D - to make PPMD - is now accepted as being the most appropriate
choice in this application [Teahan, 1998]. Note that the use of the constants 1
and 2 is symbolic; they can, of course, be more general increments that are aged
according to some chosen half-life, as was discussed at the end of Section 6.6.
Table 8.1 traces the action of a first-order PPM implementation (where
8.2. PPM PAGE 223
Algorithm 8.2
Transmit the sequence M[l ... m] using a PPM model of order max_order.
ppm_encode_block(M, m, max_order)
1: encode m using some appropriate method
2: set U[x] +-- 1 for all symbols x in the alphabet, and U[escape] +-- 0
3: for s +-- 1 to max_order do
4: ppm_encode_symbol(s, s - 1)
5: for s +- max_order + 1 to m do
6: ppm_encode_symbol(s, max_order)
Try to code the single symbol M[s] in the conditioning class established by
the string M[s - order ... s - 1]. If the probability of M[s] is zero in this
context, recursively escape to a lower order model. Escape probabilities are
calculated using method D (Table 6.2 on page 141).
ppm_encode_symbol(s, order)
1: if order < 0 then
2: encode M[s] using distribution U, and set U[M[s]] +-- 0
3: else
4: set P +-- the probability distribution associated with the conditioning
class for string M[s - order ... s - 1]
5: if P does not yet exist then
6: create a new probability distribution P for M[s - order ... s - 1]
7: set P[x] +-- 0 for all symbols x, including escape
8: if P[M[s]] > 0 then
9: encode M[s] using distribution P, and set P[M[s]] +-- P[M[s]] + 2
10: else
11: if P[escape] > 0 then
12: encode escape using distribution P
13: ppm_encode_symbol(s, order - 1)
14: set P[M[s]] +- 1, and P[escape] +-- P[escape] +1
PAGE 224 COMPRESSION AND CODING ALGORITHMS
Table 8.1: Calculating probabilities for the string "how#now#brown#cow." using the
PPM algorithm with escape method D (PPMD) and max..order = 1. Context A is the
zero-order context; and in context U all ASCII symbols have an initial probability of
1/256. The total cost is 106.1 bits.
8.2. PPM PAGE 225
remove the excluded symbols from contention. For example, when processing
the "#" after the word "brown", the # in the A context is assigned an adjusted
probability of 1/(20-5), and is coded in -10g2(l/15) = 3.91 bits. The use of
a decrementing frequency count in the U context is also a form of exclusions.
In this case the exclusions are permanent rather than temporary, since there is
just one context that escapes into U.
As presented in Algorithm 8.2, function ppm_encode_symbolO does not
allow for exclusions except in context U - the implementation becomes rather
longer than one page if the additional complexity is incorporated. Without
exclusions, a certain amount of information capacity is wasted, making the
output message longer than is strictly necessary. In Table 8.1, taking out all of
the exclusions adds 1.5 bits to the total cost of the message. On the other hand,
calculating exclusions has an impact upon compression throughput, and they
are only beneficial when contexts shorter than max_order are being used, which
tends to only happen while the model is still learning which symbols appear in
which contexts. Once the model has stabilized into a situation in which most
symbols are successfully predicted in the longest context, no exclusions will be
applied, even if they are allowed.
The second subtle point to be noted in connection with Algorithm 8.2 is
called update exclusions [Moffat, 1990]. When the "w" at the end of the word
"now" is successfully predicted in the context "0", its frequency in that context
is incremented, and the probability distribution P"o" for that context changes
from p.·o .. [escape, "w"] = [1,1] to p.·o.. [escape, "w"] = [1,3]. At the same
time, it is tempting to also change the zero-order probability distribution for
context A, since another "w" has appeared in that context too. Prior to that "w"
the probability distribution P).. for context A is
"h" "0"
P ).. [escape " '" "w" "#" "n"] = [5 , 1, 3, 3" 1 1] .
In fact, we do not make this change, the rationale being that P).. should not
be influenced by any subsequent "w" after "0" combinations, since they will
never require a "w" to be predicted in context A. That is, we modify probability
distributions to reflect what is actually transmitted, rather than the frequency
distributions that would be arrived at via a static analysis. In Table 8.1, the
final probability distribution for context A is
P [escape "h" "0" "w" "#" "n" "b" "r" "e" ""]
).. """'" .
= [9,1,7,1,3,3,1,1,1,1],
8.2. PPM PAGE 227
reflecting the frequencies of the symbols that context oX was called upon to deal
with, not the frequencies of the symbols in the message.
Algorithm 8.2 as described already includes update exclusions, and since
full updating makes the implementation slower, there is no reason to try and
maintain "proper" conditional statistics in a PPM implementation.
An implementation issue deliberately left vague in Algorithm 8.2 is the
structure used to store the many contexts that must be maintained, plus their
associated probability distributions. The contexts are just strings of characters,
so any dictionary data structure, perhaps a binary search tree, could be used.
But the sequence of context searches performed is dictated by the string, and
use of a tailored data structure allows a more efficient implementation. Fig-
ure 8.2a shows such a context tree for the string "how#now#brown#", against
which a probability for the next letter of the example, character "c", must be
estimated. A first-order PPM model is again assumed.
Each node in the context tree represents the context corresponding to the
concatenation of the node labels on the path from the root of the tree through
to that node. For example, the lowest leftmost node in Figure 8.2a corresponds
to the string "ho", or more precisely, the appearance of "0" in the first-order
context "h". All internal nodes maintain a set of children, one child node for
each distinct character that has appeared to date in that context. In addition,
every node, including the leaves, has an escape pointer that leads to the node
representing the context formed by dropping the leftmost, or least significant
character, from the context. In Figure 8.2a escape pointers are represented with
dotted lines, but not all are shown. The escape pointer from the root node oX
leads to the special node for context U.
At any given moment in time, such as the one captured in the snapshot
of Figure 8.2a, one internal node of depth max_order in the context tree is
identified by a current pointer. Other levels in the tree also need to be indexed
from time to time, and so current can be regarded as an array of pointers, with
current [max_order] always the deepest active context node. In the example,
context "#" is the current[l] node.
To code a symbol, the current[max_order] node is used as a context, and
one of two actions carried out: if it has any children, a search is carried out for
the next symbol; and ifthere are no children, the pointer current[max_order-l]
is set to the escape pointer from node current[max_order], and then that node
is similarly examined. The child search - whenever it takes place - has two
outcomes: either the symbol being sought is found as the label of a child, and
that fact is communicated to the decoder; or it is not, and the transfer to the next
lower current node via the escape pointer must be communicated. Either way,
an arithmetic code is emitted using the unnormalized probabilities indicated by
the current pointer for that level of the tree.
PAGE 228 COMPRESSION AND CODING ALGORITHMS
- - - - current[1]
(a)
current[1] - - - - -
(b)
Figure 8.2: Examples of first-order context trees used in a PPM implementation: (a)
after the string "how#now#brown#" has been processed; and (b) after the subsequent
"c" has been processed. Not all escape pointers and frequency counts are shown.
8.2. PPM PAGE 229
In the example, the upcoming "c" is first searched for at node current[l] =
"#", and again at current[O] = A. In both cases an escape is emitted, first with
a probability of 3/6 (see Table 8.1), and then, allowing for the exclusions on
"n" and "b", with probability 7/(22 - 4). The two escapes take the site of
operations to node U, at which time the "c" is successfully coded. Two new
nodes, representing contexts "c", and "#c" are then added as children of the
nodes at which the escape codes were emitted. The deepest current pointer
is then set to the new node for "c", ready for the next letter in the message.
Figure 8.2b shows the outcome of this sequence of operations.
When the next symbol is available as a child of the current[max_order]
node, the only processing that is required is for the appropriate leaf to be se-
lected by an arithmetic code, and current[max_order] to be updated to the des-
tination of the escape pointer out of that child node. No other values of current
are required, but if they are, they can be identified in an on-demand manner
by tracing escape pointers from current[max_order]. That is, when the model
is operating well, the per-symbol cost is limited to one child search step, one
arithmetic coding step, and one pointer dereference.
Figure 8.2 shows direct links from each node to its various children. But
each node has a differing number of children, and in most programming lan-
guages the most economical way to deal with the set of children is to install
them in a dynamic data structure such as a hash table or a list. For character-
based PPM implementations, a linked list is appropriate, since for the major-
ity of contexts the number of different following symbols is small. A linked
list structure for the set of children can be accelerated by the use of a physi-
cal move-to-front process to ensure that frequently accessed items are located
early in the searching process. For larger alphabets, the set of children might
be maintained as a tree or a hash table. These latter two structures make it
considerably more challenging to implement exclusions, since the cumulative
frequency counts for the probability distribution are not easily calculated if
some symbols must be "stepped over" because they are excluded.
Standing back from the structures shown in Figure 8.2, the notion of state-
machine-based compression system becomes apparent. We can imagine an
interwoven set of states, corresponding to identifying contexts of differing
lengths; and of symbols driving the model from state to state via the edges
of the machine. This is exactly what the DMC compression algorithm does
[Cormack and Horspool, 1987]. Space limits preclude a detailed examination
of DMC here, and the interested reader is referred to the descriptions of Bell
et al. [1990] and Witten et al. [1999]. Suzanne Bunton [1997a,b] considers
state-based compression systems in meticulous detail in her doctoral disser-
tation, and shows that PPM and DMC are essentially variants of the same
fundamental process.
PAGE 230 COMPRESSION AND CODING ALGORITHMS
The idea embodied in the PPM process is a general one, and a relatively
large number of authors have proposed a correspondingly wide range of PPM
variants; rather more than we can hope to record here. Nevertheless, there are
a number of versions that are worthy of comment.
Lelewer and Hirschberg [1991] observed that it is not necessary for the
contexts to cascade incrementally. For example, escaping from an order-three
prediction might jump directly to an order-one state, bypassing the order-two
contexts.
Another possibility is for the contexts to be concatenated, so as to form one
long chain of symbol "guesses", in decreasing order of estimated probability
[Fenwick, 1997, 1998, Howard and Vitter, 1994b]. The selection of the appro-
priate member of the chain can then be undertaken using a binary arithmetic
coder. Yokoo [1997] describes a scheme along these lines that orders all pos-
sible next characters based upon the similarity of their preceding context to the
current context of characters.
Another development related to PPM is the PPM* method of Cleary and
Teahan [1997]. In PPM* there is no upper limit set on the length of the con-
ditioning context, and full information is maintained at all times. To estimate
the probability of the next character, the shortest deterministic context is em-
ployed first, where a context is deterministic if it predicts just one character.
If there is no deterministic context, then the longest available context is used
instead. In the experiments of Cleary and Teahan, PPM* attains slightly bet-
ter compression effectiveness than does a comparable PPM implementation.
But Cleary and Teahan also note that their implementation ofPPM* consumes
considerably more space and time than PPM, and it may be that some of the
compression gain is a consequence of the use of more memory. Suzanne Bun-
ton [1997a,b] has studied the PPM and PPM* mechanisms, as well as other
related modeling techniques, and describes an implementation that captures
the salient points of a wide range of PPM-like alternatives. Aberg et ai. [1997]
have also experimented with probability estimation in the context of PPM.
Another important model which we do not have space to describe is context
tree weighting [Willems et aI., 1995, 1996]. In broad terms, the idea of context
tree weighting is that the evidence accumulated by multiple prior conditioning
contexts is smoothly combined into a single estimation, whereas in the PPM
paradigm the estimators are cascaded, with low-order estimators being used
only when high-order estimators have already failed.
By employing pre-conditioning of contexts in a PPM-based model, and by
selecting dynamically amongst a set of PPM models for different types of text,
Teahan and Harper [2001] have also obtained excellent compression for files
containing English text.
Exactly how good is PPM? Figure 8.3 shows the compression rate attained
8.2. PPM PAGE 231
16
- - - order 1
-order2
___ order 3
--+- order 4
_ _ order 6
Figure 8.3: Compressing the file bible. txt with various PPMD mechanisms.
261], and recent experiments - and recent implementations, such as the PPMZ
of Charles Bloom [1996] (see also www.cbloom.com/src/ppmz.html)-
continue to confirm that superiority.
The drawback of PPM-based methods is the memory space consumed.
Each node in the context tree stores four components - a pointer to the next
sibling; an escape pointer; an unnormalized probability; and the sum of the fre-
quency counts of its children - and requires four words of memory. The number
of nodes in use cannot exceed max_order x m, and will tend to be rather less
than this in practice; nevertheless, the number of distinct five character strings
in a text processed with max_order = 4 might still be daunting.
One way of controlling the memory requirement is to make available a pre-
determined number of nodes, and when that limiting value is reached, decline
to allocate more. The remaining part of the message is processed using a model
which is structurally static, but still adaptive in terms of probability estimates.
Another option is to release the memory being used by the data structure, and
start again with a clean slate. The harsh nature of this second strategy can be
moderated by retaining a circular buffer of recent text, and using it to boot-strap
states in the new structure - for example, the zero-order or first-order predic-
tions in the new context tree might be initialized based upon a static analysis
of a few kilobytes of retained text. As is amply demonstrated by Figure 8.3, as
little as 10 kB of priming text is enough to allow a flying start.
Table 8.2 shows the results of experiments with the most abrupt of these
strategies, the trash-and-start-again approach. Each column in the table shows
the compression rates attained in bits per character when the PPMD imple-
mentation was limited to that much memory; the rows correspond to differ-
ent values of max_order. To obtain the best possible compression on the file,
32 MB of memory is required - eight times more than is occupied by the file it-
self. Smaller amounts of memory still allow compression to proceed, but use of
ambitious choices of max_order in small amounts of memory adversely affects
compression effectiveness. On the other hand, provided a realistic max_order
is used, excellent compression is attainable in a context tree occupying as little
as 1 MB of memory.
ital Equipment Corporation nearly ten years after PPM [Burrows and Wheeler,
1994], the BWT has since been the subject of intensive investigation into how
it should best be exploited for compression purposes [Balkenhol et ai., 1999,
Chapin, 2000, Effros, 2000, Fenwick, 1996a, Sadakane, 1998, Schindler, 1997,
Volf and Willems, 1998, Wirth and Moffat, 2001], and is now the basis for a
number of commercial compression and archiving tools.
Before examining the modeling and coding aspects of the method, the
fundamental operation employed by the BWT needs to be understood. Fig-
ure 8.4 illustrates - in a somewhat exaggerated manner - the operations that
take place in the encoder. The message in the example is the simple string
"how#now#brown#cow."; for practical use the message is, of course, thousands
or millions of characters long.
The first step is to create all rotations of the source message, as illustrated in
Figure 8.4a. For a message of m symbols there are m rotated forms, including
the original message. In an actual implementation these rotated versions are
not formed explicitly, and it suffices to simply create an array of m pointers,
one to each of the characters of the message.
The second step, shown in Figure 8.4b, is to sort the set of permutations us-
ing a reverse-lexicographic ordering on the characters of the strings, starting at
the second-to-Iast character and moving leftward through the strings. Hence, in
PAGE 234 COMPRESSION AND CODING ALGORITHMS
how#now#brown#cow. ow.how#now#brown#:c c
ow#now#brown#cow.h ow#brown#cow.how#:n n
w#now#brown#cow.ho rown#cow.how#now#:b b
#now#brown#cow.how ow#now#brown#cow. :h h*
now#brown#cow.how# own#cow.how#now#b:r r
ow#brown#cow.how#n w.how#now#brown#c:o 0
w#brown#cow.how#no w#now#brown#cow.h:o 0
#brown#cow.how#now w#brown#cow.how#n:o 0
brown#cow.how#now# cow.how#now#brown:# #
rown#cow.how#now#b .how#now#brown#co:w w
own#cow.how#now#br #now#brown#cow.ho:w w
wn#cow.how#now#bro #brown#cow.how#no:w w
n#cow.how#now#brow n#cow.how#now#bro:w w
#cow.how#now#brown wn#cow.how#now#br:o 0
cow.how#now#brown# how#now#brown#cow: .
ow.how#now#brown#c now#brown#cow.how:# #
w.how#now#brown#co brown#cow.how#now:# #
.how#now#brown#cow #cow.how#now#brow:n n
(a) (b) (c)
Figure 8Ab, the three rotated forms that have "#" as their second-to-Iast sym-
bol appear first, and those three are ordered by the third-to-Iast characters, then
fourth-to-Iast characters, and so on, to get the ordering shown. For clarity, the
last character of each rotated form is separated by a colon from the earlier ones
that provide the sort ordering. As for the first step, pointers are manipulated
rather than multiple rotated strings, and during the sorting process the original
message string remains unchanged. Only the array of pointers is altered.
The third step of the BWT is to isolate the column of "last" characters (the
ones to the right of the colon in Figure 8Ab) in the list of sorted strings, and
transmit them to the decoder in the order in which they now appear. Since
there are m rotated forms, there will be m characters in this column, and hence
m characters to be transmitted. Indeed, exactly the same set of m charac-
ters must be transmitted as appears in the original message, since no charac-
ters have been introduced, and every column of the matrix of strings, includ-
ing the last, contains every character of the input message. In the example,
the string "cnbhrooo#wwwwo.##n" listed in Figure 8Ac must be transmit-
ted to the decoder. Also transmitted is the position in that string of the first
8.3. BWT PAGE 235
c +- 1 I+-# +- 9 6+-c +- 1
n +- 2 2+-# +- 16 8+-n +- 2
b +- 3 3+-# +- 17 5+-b +- 3
h*+- 4 4+-. +- 15 7+-h*+- 4
r +- 5 5+-b +- 3 14 +- r +- 5
0 +- 6 6+-c +- 1 10+-0 +- 6
0 +- 7 7+-h*+- 4 11 +- 0 +- 7
0 +- 8 8+-n +- 2 12 +- 0 +- 8
# +- 9 9+-n +- 18 I+-# +- 9
w +-10 10 +- 0 +- 6 15 +- w +-10
w +-11 11 +- 0 +- 7 16 +- w +-11
w +- 12 12 +- 0 +- 8 17 +-w +- 12
w +-13 13 +- 0 +- 14 18 +- w +-13
0 +- 14 14 +- r +- 5 13 +- 0 +- 14
+- 15 15 +- w +- 10 4+-. +- 15
# +- 16 16 +- w +- 11 2+-# +- 16
# +- 17 17 +- w +- 12 3+-# +- 17
n +- 18 18 +- w +-13 9+-n +- 18
(a) (b) (c)
Figure 8.5: Decoding the message "cnbhrooo#wwwwo.##n": (a) permuted string re-
ceived by decoder, with position numbers appended; (b) after stable sorting by charac-
ter, with a further number prepended; and (c) after reordering back to received order.
To decode, start at the indicated starting position to output "h"; then follow the links
to character 7 and output "0"; character 11 and output "w"; and so on until position 4
is returned to.
character of the message, which is why the character "h" has been tagged
with an asterisk in Figure 8.4c. That is, the actual message transmitted is
("cnbhrooo#wwwwo.##n", 4). Since this permuted message contains the same
symbols as the original and in the same ratios, it may not be clear yet how
any benefit has been gained. On the other hand if the reader has already ab-
sorbed Section 6.7, which discusses recency transformations, they will have an
inkling as to what will eventually happen to the permuted text. Either way, for
the moment let us just presume that compression is going to somehow result.
To "decode" the permuted message it must be unscrambled by inverting
the permutation, and perhaps the most surprising thing about the BWT is not
that it results in compression, but that it can be reversed simply and quickly.
Figure 8.5 continues with the example of Figure 8.4, and shows the decoding
process. Again, the description given is intended to be illustrative, and as we
shall see below, in an actual implementation the process is rather more terse
and correspondingly more efficient than that shown initially.
PAGE 236 COMPRESSION AND CODING ALGORITHMS
Algorithm 8.3
Calculate and return the inverse of the Burrows-Wheeler transformation,
where M'[l .. . m] is the permuted text over an alphabet of n symbols, and
first is the starting position.
bwUnverse(M', m, first)
1: for i f- 1 to n do
2: setfreq[i] f- 0
3: for c f- 1 to m do
4: setfreq[M'[c]] f- freq[M'[c]J + 1
5: for i f- 2 to n do
6: setfreq[iJ f- freq[iJ + freq[i - 1J
7: for c f- m down to 1 do
8: set next[c] f- freq[M'[cJJ andfreq[M'[c]] f- freq[M'[c]] - 1
9: set sf-first
10: for i f- 1 to m do
11: set M[i] f- M'[sJ
12: set s f- next [sJ
13: return M
Egyptians .#J~4ft,h$#~O~#b~~,g~~~4#~!ili'11~~'i' : r 1
h,#and#my#right#hand#hath#span : v 4
#had#made#them#joyful,#and#tur : r 2
t#of#the#world. # IWhen#he#prepa~-:~'d.i~li~#1;l~~!r : v 2
t#Paran. #Selah. #His#glory#cove' : v 1
»»>In#the#beginning#God#creatill : v 1
#thus#saith#the#LORD#that'~¢r~~'i!~aii!li,~l~~~ : v 1
ith#God#the#LORD, #he#th~t#¢l;'e~t~<:i#~I~'#J:,.~a : v 1
als#were#kindled#by#it. # I He#bow,ed#'fh~#J:,.~~, : v 1
alsf.were#~~l1d~$<:i#by#lt'~#l~ij~j1?~~~,~,.t:li~#b~~ : v 1
n#the#room#of#Joab. # IAnd#he#'S;9'W~4#,th,$:f:I~a. r 2
#our#houses . #And#the#peopl~,#l:#?:W-~,g*g#'h;!'li:l, d 3
#servant#David#hast#said, #Why#did:t,e#lie~'j t 4
ron, #and#said#to#the#king, #Behold#the#le~] d 2
hy#name' s#sake. # IWherefore#shou'f:d~~h~',li~~". t 2
y#truth's#sake.tIWhe:r:efore#sho1l1<:ifftheffhea. t 1
(a) (b)
d 147 1 294
1 4 2 134
P 8 3 113
r 137 4 78
t 157 5 23
v 191 6 2
Total symbols 644 Total symbols 644
Entropy 2.10 Entropy 1.99
Minimum-redundancy 2.25 Minimum-redundancy 2.08
(a) (b)
Figure 8.7: Frequencies of symbols and costs of coding according to those fre-
quencies in the context "#the#hea" in the Bible: (a) distribution of character val-
ues, the zero-order entropy of that distribution, and the expected cost of a minimum-
redundancy code for that distribution; and (b) distribution of the corresponding MTF
values, the zero-order entropy of that distribution, and the expected cost of a minimum-
redundancy code for that distribution.
by the MTF transformation described in Section 6.7 on page 170, and it is MTF
values for the permuted sequence that are actually transmitted.
On small examples such as the one in Figure 8.42 the clustering of symbols
is definitely noticeable. For more substantial source messages it becomes even
more compelling. Figure 8.6 shows part of the result of applying the BWT
to file bible. txt. A small fraction of the permuted message text is shown,
together with the implicit contexts that form the sort key and the MTF values
generated. In Figure 8.6 all of the preceding strings end in "#the#hea" (indeed,
in the section shown they all actually end in "d#the#hea"), and in total the
4,047,392 characters in this particular version of the Bible contain exactly 644
occurrences of this "#the#hea" context. In those 644 occurrences there are
just six different characters that follow, namely "d", "1", "p", "r", "t", and
"v". Figure 8.7a shows the frequencies of these six characters, and the self-
information of their frequency distribution. If the character frequencies shown
in Figure 8.7a are used to directly drive an arithmetic coder to transmit the
644 symbols, compression of 2.10 bits per character results. With a minimum-
redundancy code, 2.25 bits per character is attainable.
The alternative - using the MTF transformation to generate an equivalent
2 And in Figure 6.6 on page 171: the reader is now invited to apply function bwLinverseO
and determine the corresponding source message M that generated Figure 6.6a, starting with
the eighth character.
8.3. BWT PAGE 239
1000000
100000
>-
() 10000
c:
Q)
'.
::l
tT 1000 - - Observed for Bible
. ,, .
~
U. Gamma code .,
100 •...•. Unary code '.
10
2 4 8 16 32 64
MTF value
Figure 8.8: Frequency of MTF ranks for BWT·permuted text for file bible. txt,
together with MTF frequencies presumed by the codeword lengths used in the unary
and C"( codes.
ments in the BWT mechanism. Since the permuted text is essentially the con-
catenation of the outputs from whatever number of contexts are present in the
underlying state machine, it is appropriate to try to parse the permuted text into
same-context sections, and then code each section with a zero-order entropy
coder rather than via the MTF transformation. One obvious way of sectioning
the permuted text is to simply assert that each run of like characters is the result
of the underlying machine being in a single context. With only one symbol in
the output alphabet for each context, the coding of the string emitted by that
context is achieved by simply describing its length. The subalphabet selec-
tion component of the housekeeping details for that state (see Section 4.8 on
page 81) can still be accomplished through the use of an MTF selector value.
This combination leads to the permuted text being represented as a sequence
of "MTF value, number of repetitions of that symbol" pairs, and if these two
components are independently coded using a structured arithmetic coder, com-
pression on the Bible of 1. 70 bits per character is possible. By way of contrast,
on the same file an implementation of PPM (with max_order = 6, and using
escape method D and 64 MB of memory for the data structure) obtains a com-
pression effectiveness of 1.56 bits per character, so the sectioning heuristic is
still somewhat simplistic.
By using more elegant sectioning techniques based upon the self-entropy
of a small sliding window in the permuted text, better compression can be
obtained [Wirth and Moffat, 2001]. For example, the permuted text shown in
8.3. BWT PAGE 241
Figure 8.6 might be regarded as being emitted from two different states - one
giving rise to the string "rvrvvvvvvvr", and then a second generating "dtdtt".
(It might equally be possible that no such distinction would be made, since
the string of characters prior to the section in the figure is "vdtrvvdrrvvtdr",
and the string following is "vtvdtvrrdvvrrddvdrr".) Each of the sections then
becomes a miniature coding problem - exactly as it is in each context of a PPM
compression system.
The best of the BWT-based compression mechanisms are very good indeed.
Michael Schindler's [1997] SZIP program compresses the file bible. txt to
1.53 bits per character (see the results listed at corpus. canterbury. ac .
nz/results!large. htrnl), and the public-domain BZIP2 program of Ju-
lian Seward and Jean-Ioup Gailly [Seward and Gailly, 1999] is not far behind,
with a compression rate 1.67 bits per character. (At time of writing, the best
compression rate we know of for bible. txt is 1.47 bits per character, ob-
tained by Charles Bloom's PPMZ program, and reported to the authors by
Jorma Tarhio and Hannu Peltola.) Note that the compression effectiveness of
BWT systems is influenced by the choice of block size used going into the
BWT; in the case of SZIP, the block size is 4.3 MB, whereas BZIP2 uses a
smaller 900 kB block size. One common theme amongst these "improved"
BWT implementations is the use of ranking heuristics that differ slightly from
simple MTF. Chapin [2000] analyses a number of different ranking mech-
anisms for BWT compression, concluding that the most effective compres-
sion is achieved by employing a mechanism that switches between two differ-
ent rankers, depending on the nature of the text being compressed [Volf and
Willems, 1998]. Deorowicz [2000] has also investigated BWT variants.
The implementation ofBZIP2 draws together many of the themes that have
been discussed in this section and in this book. After the BWT and MTF trans-
formations the ranks are segmented into "blocklets" of 50 values. Each block-
let is then entropy coded using one of six semi-static minimum-redundancy
codes derived for that block of (typically) 900 kB. The six semi-static codes
are transmitted at the beginning of the block, and are chosen by an iterative
clustering process that seeds six different probability distributions; evaluates
codes for those six distributions; assigns each blocklet to the code that mini-
mizes the cost of transmitting that blocklet; and then reevaluates the six dis-
tributions based upon the ranks in the blocklets assigned to that distribution.
This process typically converges to a fixed point within a small number of iter-
ations; in the coded message each blocklet is then preceded by a selector code
that indicates which of the six minimum-redundancy codes is being used. The
use of multiple probability distributions and thus codes allows sensitivity to the
nature of the localized probability distribution within each segment; while the
use of semi-static minimum-redundancy coding allows for quick encoding and
PAGE 242 COMPRESSION AND CODING ALGORITHMS
this is less than the 68% recorded for a character-based BWT (Figure 8.8), it
is still a strong correlation. Isal et al. [2002] report a compression rate of 1.49
bits per character on file bible. txt using a word-based parsing scheme, a
BWT, an MTF-like ranking heuristic, and a structured arithmetic coder. Word-
based PPM schemes have also been considered [Moffat, 1989], but the fact
that all contexts are simultaneously incomplete mitigates against an efficient
implementation, and the BWT approach has a definite advantage in terms of
resource usage.
The drawback of the BWT - word-based or character-based - is that it
must operate off-line. A whole block of message data must be available before
any code for any symbol in that block can be emitted. On the other hand, a
PPM mechanism can emit the codewords describing a symbol as soon as that
symbol is made available to the encoder, and so the latency is limited to that of
the arithmetic coder.
In summary, BWT-based compression mechanisms are based upon a sim-
ple transformation that can be both computed and inverted using relatively
modest amounts of memory and CPU time, and yet which has the power to
rival the compression effectiveness achieved by the best of the context-based
schemes. It is little wonder that in the eight years since it was first developed the
BWT has been employed in both a range of commercial compression systems
such as the STUFFIT product of Aladdin Systems Inc., a widely used compres-
sion and archiving tool that operates across a broad range of hardware platforms
(see www.aladdinsys.com); and freely available software tools such as the
BZIP2 program described above, which was developed by Julian Seward in
collaboration with Jean-Ioup Gailly (sourceware. cygnus. com/bzip2 I).
And just in case the reader wishes to confirm their calculations: the BWT
string in Figure 6.6a is derived from the source message
"peter#piper#picked#a#peck#of#pickled#peppers."
The first ofthe non-text data types is bi-Ievel (black and white) images. The
raw form of a bi-Ievel image can be thought of as a sequence of bytes, with each
byte storing eight pixel values from one raster row of the image. Because of
this linearization, pixels that are adjacent in the image, but in different raster
rows, are widely separated in the file representing that image.
Clearly, a two-dimensional context structure is appropriate. In work that is
intimately connected to the genesis of arithmetic coding, Rissanen and Lang-
don [1981] show that use of a two-dimensional template of pixels above, and
to the left of, the pixel in question provides a suitable conditioning context.
That is, each pixel can be coded in a conditioning context established by a
small number of neighboring pixels (typically seven or ten) drawn from quite
different parts of the linear file representing the image.
Such techniques have been central to all subsequent bi-Ievel image com-
pression techniques, including several standards. The coder required in such
a system must deal with perhaps thousands of conditioning contexts, and a
source alphabet of cardinality two, usually expressed as the two choices "the
next symbol is indeed the one estimated by the model as being the more proba-
ble symbol (MPS)", and "no, the next symbol is the one estimated as being the
less probable symbol (LPS)". Unsurprisingly, this combination of requirements
leads naturally to binary arithmetic coding, either with explicit probability esti-
mation based upon symbol occurrence counts in the various contexts employed,
or with probability estimation based upon the Q-coder mechanism described in
Section 6.11 on page 186.
A specific subset of bi-Ievel images are the textual images obtained when
pages that are dominated by printed text are scanned and stored as bi-Ievel
images. One important application of textual image compression is in facsim-
ile storage and transmission, a compression application that has dramatically
changed the way the world operates.
Early fax standards (see Witten et al. [1999, Chapter 6] for details) were
strictly scan-line oriented, and either ignored the fact that the conditioning con-
texts were two dimensional, or allowed just one prior scan line to be used as
a guide to the transmission of the next one. A simple runlength model was
applied to generate two interleaved streams of symbol numbers, describing the
lengths of the runs of white pixels and black pixels across each scan line; and
the symbols so generated were transmitted using a static prefix code that was
hardwired into the machine at the time it was designed. The set of codewords
were generated by analysis of a set of standard test images, plus manual tun-
ing to make sure that extreme cases were also handled in a reasonable man-
ner. Given the technology available at the time, and the overreaching desire to
make the devices affordable to a wide cross-section of consumers, this Group 3
scheme represents a stunning engineering achievement.
8.4. OTHER SYSTEMS PAGE 245
Two decades later, memory capacity and CPU cycles are vastly cheaper,
and a different engineering trade-off is possible, since only transmission band-
width is still a scarce resource. Now textual images can be segmented into the
individual marks comprising them, and a library of such marks transmitted,
plus a list of the locations on the page at which they appear. That is, a com-
pression model is implemented that knows that many of the connected sections
of black pixels in the bi-Ievel image form repetitive patterns that differ only
around the fringes. As a consequence, better compression is obtained. Paul
Howard [1997] describes one such scheme; it only operates sensibly on inputs
that have the particular structure the model is assuming, but when it gets such
data, it performs much better than do the three general-purpose mechanisms
described in the first three sections of this chapter.
Another important category is signal data, such as that generated by the
analog to digital conversion of speech or music. If, for example, each symbol
in the message is an integer-quantized value of some underlying waveform,
then consecutive values in the message will be highly correlated. The standard
way of handling such signals is known as DPCM, or differential pulse code
modulation. In a DPCM model, a number of previous values of the signal are
combined in some way - perhaps by calculating a weighted mean, or by fitting
some other curve - and used to predict the next quantized value. The difference
between the predicted value and the actual value is then coded as an error
value, or residual. If the predictions are accurate, the errors will be small. If
the predictions are not as accurate, for example, when the signal is volatile, the
errors will be larger. In both cases, the errors can be expected to have a mean
of zero, and are usually assumed to conform to a symmetric distribution that is
strongly biased in favor of small values. Golomb and Rice codes (Section 3.3
on page 36) are ideal for such applications, since the parameter b (or k for a
Rice code) that controls the code can be adjusted as a function of the measured
local volatility of the signal. A good illustration of this approach is provided by
Howard and Vitter [1993], who describe a mechanism for compressing gray-
scale images that uses a family of Rice codes in conjunction with adaptive
estimation of an appropriate value of k.
Another special type of message is natural language text. The first three
sections of this chapter showed three-general purpose compression schemes,
and the fact that English text was used as an example message was convenient,
but not a prerequisite to their success.
But if we know that the message is (say) an ASCII representation of English
text, then models which exploit the resultant structure can be employed. For
example, word-based models [Bentley et aI., 1986, Moffat, 1989], have been
applied to text-retrieval systems with considerable success [Bell et aI., 1993,
Zobel and Moffat, 1995]. Parsing a large volume of text into a sequence of
PAGE 246 COMPRESSION AND CODING ALGORITHMS
interleaved words and non-words, and building a static dictionary for each of
the two different types of tokens, permits a semi-static minimum-redundancy
code to be used, allowing fast decompression, and non-sequential decoding of
small fragments of the original text. The latter facility is important when a
small number of documents drawn from a large collection are to be returned
in response to content-based queries. The inverted indexes used to facilitate
such queries are also amenable to compression, and Golomb codes and other
static coding methods (Chapter 3) have been used in this domain. Witten et ai.
[1999] give extensive coverage of this area; other relevant work includes that
of Williams and Zobel [1999], who consider more general data, but realize
many of the same benefits. English text can also be parsed in more sophis-
ticated ways, taking into account general rules for sentence structure [Teahan
and Cleary, 1996, 1997, 1998].
A special case of structured text is program source code. Because source
code conforms to strict grammar rules, probability estimations based upon a
push-down automaton rather than a finite-state machine is possible. For exam-
ple, because the braces "{" and "}" must match up in a syntactically correct
C program, the probability of a right brace "}" can be set to (very close to)
zero at certain parts of the message. For every non-terminal symbol in the
grammar the probabilities of each of the grammar rules that rewrite that non-
terminal can be estimated adaptively; and the parse tree can then be transmitted
as an arithmetic-coded tree traversal [Cameron, 1988, Katajainen et aI., 1986].
Comments must be handled separately, of course - there is nothing to stop
comments having unmatched parentheses.
Returning to compression of English, one variant on the word-based pars-
ing scheme is worth particular mention, because it illustrates another useful
technique - adding a small amount of redundancy back into the compressed
message in order to obtain a particular benefit. de Moura et ai. [2000] parse
English text into spaceZess words by assuming that each word is by default fol-
lowed by a single blank character. An explicit non-word token is only required
when the token between consecutive words is not a single blank character. The
non-words are coded from the same dictionary of strings as the words, and just
one probability distribution is maintained. For example, the string
"mary#had#a#little#lamb,#little#lamb,#little#lamb."
is transformed into the message
"1,2,3,4,5,6,4,5,6,4,5,7"
against the dictionary of strings
1: "mary" 4: "little" 6: ",#"
2: "had" 5: "lamb" 7:
3: "a"
8.4. OTHER SYSTEMS PAGE 247
Symbols number 6 and 7 are the only non-words tokens explicitly required by
this message. The decoder can correctly reverse the transformation: consecu-
tive decoded tokens composed entirely of alphabetics (or whatever other subset
of the character set is being used to define "words") get a single space inserted
between them. Use of spaceless words avoids the interleaving of symbols from
two different probability distributions, and neatly sidesteps the issues caused
by the fact that an explicit non-word distribution tends to be very skew as a
result of the dominance of the single-space non-word token "#".
To code the stream of tokens, de Moura et al. [2000] suggest use of a semi-
static radix-256 minimum-redundancy code, so that all codewords are byte
aligned. Such a sizeable channel alphabet would normally result in consid-
erable degradation in compression effectiveness, but the large source alphabet
inherent in word-based models, and the fact that the spaceless words proba-
bility distribution is relatively uniform (the most frequent word in English is
"the", with a typical probability of around 5%) means that the loss is small.
Searching in the compressed text for a particular word from the compres-
sion dictionary is then remarkably easy: the set of byte codewords for the
word (or phrase of concatenated compression tokens) is formed, and a stan-
dard pattern-matching algorithm employed (see Cormen et al. [2001]).
Use of variable length codewords does, however, mean that false matches
might result, where the pattern-matching software reports that a particular byte
sequence has been located, but in fact it appears by coincidence as a result of
the juxtaposition of fragments of two unrelated codewords. For example, if the
codeword for "bird" is the three byte sequence "57, 0, 164" (with each byte in
the codeword expressed as an integer between 0 and 255), the code for "in" is
"188, 57", the code for "the" is "0", and the word "hand" has the three-byte
code "164, 45, 142", then the compressed form of the phrase "in#the#hand"
includes within it the code for "bird", and the pattern-matcher has no option
but to declare a possible "bird" that is "in#the#hand", only to then have the
potential match declined when alignment checking is undertaken.
To eliminate the cost of false match checking, de Moura et al. [2000] pro-
pose an alternative that is as simple as it is effective - instead of using a radix-
256 code, they use a radix-128 code. These seven-bit codes are then fitted
into bytes, and the remaining bit of each byte used as an alignment indicator.
For example, the first byte in each codeword might have the tag bit set, and
the remaining bytes in each codeword would then have the tag bit off. Pat-
tern searching is accomplished by forming the compressed form of the pattern
word or phrase, including setting the tag bits appropriately, and using a stan-
dard pattern matcher. The tag bits ensure that matches are only reported when
the alignment is correct, and the false matches are eliminated at the expense of
making the compressed message approximately 8/7 = 14% longer than if a
PAGE 248 COMPRESSION AND CODING ALGORITHMS
"how#now#brown#cow#howl#now#brown#owl."
The pair "ow" appears eight times, and is the first replaced. If we use upper case
letters to indicate new symbol numbers (the primitives, or base letters, occupy
the symbol values 0 to 255, so in an implementation the first new symbol is
actually numbered 256), it is reduced to:
"hA#nA#brAn#cA#hAl#nA#brAn#Al."
The combination "A#" now appears four times, and is replaced next:
"hBnBbrAn#cBhAl#nBbrAn#Al."
The full set of reductions applied, together with the underlying string each
pairwise replacement represents, is:
8.4. OTHER SYSTEMS PAGE 249
"hBIcBhE#IE."
Minimum- Simple
redundancy binary
Phrase hierarchy 0.22 0.22
Reduced message prelude 0.05 0.19
Reduced message code 1.56 1.60
Total 1.83 2.01
Table 8.3: Compression effectiveness (bits per character relative to the source mes-
sage) of RE-PAIR with two different coders: a minimum-redundancy coder, as outlined
in Algorithm 4.6 on page 83; and a coder based upon a simple binary code, taken from
Moffat and Wan [2001]. The source data is 510 MB of newspaper-like text from the
Wall Street Journal, processed in RE-PAIR blocks of 10 MB, with an average phrase
length of 9.97 characters. The minimum-redundancy coder processed the same 51
blocks; the simple binary coder broke the reduced message into 375 blocks, averaging
142,650 symbols per block.
block of the reduced message are coded as fixed double-byte integers relative
to a simple subalphabet mapping table of 65,536 entries.
Table 8.3 shows how surprisingly well this approach performs. The change
from a minimum-redundancy code to a simple binary code of itself adds only
0.04 bits per character to the message cost. The benefit that accrues through
the use of highly localized codes - simple ones, but never mind - is all but
enough to match the power of the minimum-redundancy coder. Only in the
representation of the prelude is the simple scheme expensive, partly because
of the use of simple codes there too, and partly because the n = 65,536 per-
block limit means that there are many more blocks for which preludes must be
provided. Overall, the simple coder is less than 10% worse than the minimum-
redundancy coder.
To search, the alternating "prelude, then codes" nature of the blocks is ex-
ploited. To look for locations at which phrase x appears, the prelude for each
coding block is decoded (the nibble-aligned codes also facilitate rapid decod-
ing), and checked for the presence of x in the sub alphabet. If x appears in
this block, the mapped integer it corresponds to is now known, and the block
searched integer-by-integer for that value. If the block prelude indicates that
x does not appear, the remainder of the block is skipped completely, and pro-
cessing continues with the prelude component of the next block. That is, the
blocked nature of the reduced message, and the fact that the prelude is a suc-
cinct summary of the subalphabet of each block, allows the reduced message
to be searched without each phrase number being handled.
It is the sheer simplicity of this scheme that defines our closing argument.
8.5. Lossy MODELING PAGE 251
What Next?
It is much harder to write the last chapter of a book than to write the first one.
At the beginning, the objective is clear - you must set the scene, be full of
effusive optimism, and paint a rosy picture of what the book is intended to
achieve. But by the time your reader reaches the final chapter there is no scope
for lily-gilding, and for better or for worse, they have formed an opinion as to
the validity and usefulness of your work.
This book is no exception. We claimed at the beginning that coding was an
area of algorithm design rich in theory and practical techniques. In the ensu-
ing chapters we have described many coding techniques, and shown how they
couple with different predictive models to make useful compression systems.
But our coverage, while comprehensive, has not been exhaustive - to capture
everything would require a decade or more of patient writing, by which time
there would be a decade of new research to be incorporated.
So how do we finish our book? What parting information do we convey to
our readers in the event their thirst has not yet been sated?
One important thing we can do is point at further sources of information.
We have deliberately adopted a writing style with interlaced citations to the
research literature - we believe the inventors of the many clever ideas we have
described are deserving of that recognition. The first port of call for readers
wishing more precision, therefore, is the original literature. They will find de-
tails and rationales that we have had no choice but to gloss over in the interests
of brevity and breadth.
Another obvious point we should make in closing is to reiterate our in-
tended coverage - this book is much more about coding than it is about mod-
eling, and both are important in the design of a compression system. There are
good books about modeling already (listed in Section 1.4 on page 9). Some-
one might even rise to the challenge, and write an updated book devoted to
modeling with the express purpose of complementing this one.
www.cs.mu.oz.au/caca/
The home page for this book. Includes an errata listing and all of the following
URLs.
www.faqs.org/faqs/compression-faq/
A detailed set of answers to frequently asked compression questions.
Maintainer: Jean-loup Gailly.
corpus.canterbury.ac.nz
Home page of the Canterbury Corpus, including the Large Canterbury Corpus
[Arnold and Bell, 1997]. Lists compression effectiveness and throughput
results for a wide range of compression systems, both public systems and
research prototypes. Owner: Tim C. Bell.
compression.ca
Home page of the Archive Compression Test. Lists compression effectiveness
and throughput results for a wide range of public and commercial compression
systems, including those that have archiving functions. Owner: Jeff Gilchrist.
www.dogma.net/DataCompression/
A great deal of compression-related information, including (at the
Benchmarks. shtml page), a set of performance figures for compression
systems. Owner: Mark Nelson.
www.internz.com/compression-pointers.html
A listing of compression resources, including people, software, and projects.
Owner: Stuart Inglis.
www.rasip.fer.hr/research/compress/
The Data Compression Reference Center page. Includes (at the algori thms
subpage) descriptions of a number of compression systems. Project manager:
Mario Kovac.
www.cs.brandeis.edu/-dcc/
Home page of the annual IEEE Data Compression Conference, held in
Snowbird, Utah, in late March or early April each year. Conference chairs: Jim
Storer and Marty Cohn.
directory.google.com/Top/Computers/Algorithms/Compression/
Typical listing maintained by a search engine. Owner: Google Inc.
www.cs.mu.oz.au/-alistair/abstracts/
Research papers contributed to by the first author.
www.computing.edu.au/-andrew/pubs/
Research papers contributed to by the second author.
Another place to which we can point the reader is the web. The web has
the advantage of being fluid, and so provides the opportunity for relatively
inexpensive revision and extension. There are many web sites that might be of
interest to the reader who has made it this far, including the web page for this
book at www.cs.mu.oz.au/caca.Table9.1lists a few of the more useful
ones.
Several of the web pages listed in the table include detailed assessments of
various compression systems. The availability of these evaluations has allowed
us one luxury in this book that previous compression texts have not enjoyed:
we are free from the need to provide tables of compression performance. Some
readers will feel that we have evaded the issue, and should give concrete advice.
But the truth is, there are many competing constraints that dictate the choice of
compression mechanism, and to come out with a table that summarizes the at-
tributes of compression systems into a set of single numbers is to oversimplify
that choice. For example, in Chapter 8 we mentioned performance figures for
a range of systems on the file bible. txt, taken from the Large Canterbury
Corpus. But simply comparing these numbers is rather unfair to some of the
systems involved. For example, the PPMD implementation we tested obtained
a compression on bible. txt of 1.56 bits per character, but did so using (Ta-
ble 8.2 on page 233) 32 MB of memory. In comparison, the Burrows-Wheeler
implementation BZIP2 "only" achieves 1.67 bits per character, but does so in
around 5 MB of execution-time memory. At the very least, for any compres-
sion comparison to be fair, similar extents of memory should be used, and with
memory limited to around 4 MB, the PPM implementation tested drops to a
compression rate of 1.66 bits per character.
The PPM implementation also executes slower than does BZIP2, and
when computational resources are taken into account, BZIP2 is the best com-
promise for a range of applications. But we are still not willing to declare
a winner, for BZIP2 operates in an off-line manner, and produces no output
bits until the last input byte of an entire block has been digested. The alterna-
tive, on-line compression systems, start emitting bits as soon as the first input
symbol is available (but possibly subject to the latency inherent in arithmetic
coding). PPM is on-line. This on-going balancing act between constraints and
performance is why we have avoided compression league tables. Choosing a
compression system is like buying a car - selecting the most fuel-economical
one (that is, the most effective system) may not actually be the best way of
meeting our transport needs.
That there are many competing factors means, of course, that the last and
most important resource we can point out is you, the reader. If you have read
your way through this book, you have, we fervently hope, learned a great deal.
At the very least, we expect you to be able to make an intelligent choice of
PAGE 256 COMPRESSION AND CODING ALGORITHMS
coder and compression system for a particular application that you have in
mind. And if you have seen things in these chapters, or in the papers we have
cited, and thought to yourself "That can't be right? Surely it will work better
if... ?" then you are well on the way to making your own discoveries. Don't un-
derestimate your own ability to invent new compression and coding algorithms,
the " ... " in the previous sentence. Most people start their compression careers
by tinkering - messing about in small programs, so to speak - and maybe, just
maybe, this book will have given you the confidence to have a go. Best of all, if
you put down the book and roll up your sleeves, then we have succeeded both
with our attempt to convey our infectious enthusiasm for this topic, and also in
providing an answer to the "what next?" question with which we opened this
discussion. Have fun ...
Bibliography
1. Aberg. A Universal Source Coding Perspective on PPM. PhD thesis, Lund University, Swe-
den, October 1999. [p.9]
1. Aberg, Y. M. Shtarkov, and B. 1. M. Smeets. Towards understanding and improving escape
probabilities in PPM. In Storer and Cohn [1997], pages 22-31. [p. 145.230]
J. Abrahams. Codes with monotonic codeword lengths. Infonnation Processing & Management,
30(6):759-764, 1994. [p.214]
J. Abrahams and M. J. Lipman. Zero-redundancy coding for unequal code symbol costs. IEEE
Trans. on Infonnation Theory, 38:1583-1586, September 1992. [p.212]
N. Abramson. Infonnation Theory and Coding. McGraw Hill, New York, 1963. [p.92]
D. Altenkamp and K. Mehlhorn. Codes: Unequal probabilities, unequal letter costs. J. of the
ACM, 27(3):412-427, July 1980. [p.212]
J. B. Anderson and S. Mohan. Source and Channel Coding: An Algorithmic Approach. Kluwer
Academic, 1991. Int. Series in Engineering and Computer Science. [p.1O]
A. Apostolico and S. Lonardi. Off-line compression by greedy textual substitution. Proc. IEEE,
88(11):1733-1744, November 2000. [p.248]
R. Arnold and T. Bell. A corpus for the evaluation of lossless compression algorithms. In Storer
and Cohn [1997], pages 201-210. [p.254]
B. Balkenhol, S. Kurtz, and Y. M. Shtarkov. Modifications of the Burrows and Wheeler data
compression algorithm. In Storer and Cohn [1999], pages 188-197. [p.233]
M. A. Bassiouni and A. Mukherjee. Efficient decoding of compressed data. J. of the American
Society for Infonnation Science, 46(1):1-8, January 1995. [p.65]
T. C. Bell. Better OPMlL text compression. IEEE Trans. on Communications, COM-34: 1176-
1182, December 1986a. [p.218]
T. C. Bell. A Unifying Theory and Improvements for Existing Approaches to Text Compression.
PhD thesis, University of Canterbury, Christchurch, New Zealand, 1986b. [p. 9]
T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice-Hall, Englewood Cliffs,
New Jersey, 1990. [p. vii, 8, 9, 17,20,220,229,232]
T. C. Bell and D. Kulp. Longest-match string searching for Ziv-Lempel compression. Software
- Practice and Experience, 23(7):757-772, July 1993. [p.219]
T. C. Bell, A. Moffat, C. G. Nevill-Manning, I. H. Witten, and 1. Zobel. Data compression in
full-text retrieval systems. J. of the American Society for Infonnation Science, 44(9):508-
531, October 1993. [p. 36, 245]
PAGE 258 COMPRESSION AND CODING ALGORITHMS
T. C. Bell and I. H. Witten. The relationship between greedy parsing and symbolwise text
compression. J. of the ACM, 41(4):708-724, July 1994. !p.220)
T. C. Bell, I. H. Witten, and 1. G. Cleary. Modeling for text compression. Computing Surveys,
21(4):557-592, December 1989. !p.9]
1. Bentley and D. McIlroy. Data compression using long common strings. In Storer and Cohn
[1999], pages 287-295. !p.248]
J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A locally adaptive data compression scheme.
Communications of the ACM, 29(4):320-330, April 1986. !po 171, 173, 245]
J. Bentley and A. C-C. Yao. An almost optimal algorithm for unbounded searching. Information
Processing Letters, 5(3):82-87, August 1976. !p.33]
J. L. Bentley and M. D. McIlroy. Engineering a sorting function. Software - Practice and
Experience, 23(11): 1249-1265, November 1993. !po 14, 70]
J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc.
Eighth Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 360-369, New Orleans, LA,
January 1997. www.cs.princeton . edurrs/ strings/. !p.242]
W. Blake. Milton. Shambhala Publications Inc., Boulder, Colorado, 1978. With commentary by
K. P. Easson and R. R. Easson. !po 20]
C. Bloom. New techniques in context modeling and arithmetic coding. In Storer and Cohn
[1996], page 426. !p.232)
A. Bookstein and S. T. Klein. Is Huffman coding dead? Computing, 50(4):279-296, 1993.
!po 103, 134)
L. Bottou, P. G. Howard, and Y. Bengio. The Z-Coder adaptive binary coder. In Storer and Cohn
[1998], pages 13-22. !p.130)
P. Bradford, M. 1. Golin, L. L. Larmore, and W. Rytter. Optimal prefix-free codes for unequal
letter costs: Dynamic programming with the Monge property. In G. Bilardi, G. F. Italiano,
A. Pietracaprina, and G. Pucci, editors, Proc. 6th Ann. European Symp. on Algorithms, vol-
ume 1461, pages 43-54, Venice, Italy, August 1998. LNCS Volume 1461. !p.212)
S. Bunton. On-Line Stochastic Processes in Data Compression. PhD thesis, University of
Washington, March 1997a. !po 9, 229, 230]
S. Bunton. Semantically motivated improvements for PPM variants. The Computer J., 40(2/3):
76-93, 1997b. !po 229, 230]
M. Buro. On the maximum length of Huffman codes. Information Processing Letters, 45(5):
219-223, April 1993. !po 90)
M. Burrows and D. J. Wheeler. A block-sorting loss less data compression algorithm. Technical
Report 124, Digital Equipment Corporation, Palo Alto, California, May 1994. !p.233]
R. D. Cameron. Source encoding using syntactic information source models. IEEE Trans. on
Information Theory, IT-34(4):843-850, July 1988. !p.246]
A. Cannane and H. E. Williams. General-purpose compression for efficient retrieval. J. of
the American Society for Information Science and Technology, 52(5):430-437, March 2001.
!p.248]
B. Chapin. Switching between two on-line list update algorithms for higher compression of
Burrows-Wheeler transformed data. In Storer and Cohn [2000], pages 183-192. !po 233, 241]
BIBLIOGRAPHY PAGE 259
P. Fenwick. Block sorting text compression. In K. Ramamohanarao, editor, Proc. 19'th Aus-
tralasian Computer Science Con!, pages 193-202, Melbourne, January 1996a. [p.233]
P. Fenwick. The Burrows-Wheeler transform for block sorting text compression: Principles and
improvements. The Computer J., 39(9):731-740, September 1996b. [po 177]
P. Fenwick. Symbol ranking text compression with Shannon recodings. J. of Universal Com-
puterScience, 3(2):7G-85, 1997. [p.230]
P. Fenwick. Symbol ranking text compressors: Review and implementation. Software - Practice
and Experience, 28(5):547-559, 1998. [p.230]
G. Feygin, P. G. Gulak, and P. Chow. Minimizing excess code length and VLSI complexity
in the multiplication free approximation of arithmetic coding. Information Processing &
Management, 30(6):805-816, November 1994. [p.123]
E. R. Fiala and D. H. Greene. Data compression with finite windows. Communications of the
ACM, 32(4):49G-505, April 1989. [p.218]
A. S. Fraenkel and S. T. Klein. Bidirectional Huffman coding. The Computer J., 33(4):296--307,
1990. [po 214]
A. S. Fraenkel and S. T. Klein. Bounding the depth of search trees. The Computer J., 36(7):
668--678, 1993. [po 194,202]
1. L. Gailly. Gzip program and documentation, 1993. Source code available from ftp: / /
prep. ai .mi t. edu/pub/gnu/gzip- * . tar. [p.219]
R. G. Gallager. Variations on a theme by Huffman. IEEE Trans. on Information Theory, IT-24
(6):668--674, November 1978. [po 74,89,146]
R. G. Gallager and D. C. Van Voorhis. Optimal source codes for geometrically distributed integer
alphabets. IEEE Trans. on Information Theory, IT-21(2):228-230, March 1975. [p.38]
A. M. Garsia and M. L. Wachs. A new algorithm for minimum cost binary trees. SIAM J. on
Computing, 6(4):622--642, December 1977. [p.207]
B. Girod. Bidirectionally decodable streams of prefix code-words. IEEE Communications Let-
ters, 3(8):245-247, August 1999. [p.214]
D. Goldberg. What every computer scientist should know about floating-point arithmetic. Com-
puting Surveys, 23(1):5-48, March 1991. [p.117]
M. J. Golin and N. Young. Prefix codes: equiprobable words, unequal letter costs. In S. Abite-
boul and E. Shamir, editors, Proc. 21st Int. Coli. on Automata, Languages and Programming,
pages 605--617, Jerusalem, July 1994. Springer-Verlag. LNCS 820. [p.212]
S. W. Golomb. Run-length encodings. IEEE Trans. on Information Theory, IT-12(3):399-401,
July 1966. [p.36]
S. W. Golomb, R. E. Peile, and R. A. Scholtz. Basic Concepts in Information Theory and
Coding: The Adventures of Secret Agent 00111. Plenum, April 1994. [po 10]
G. Gonnet and R. Baeza-Yates. Handbook of Data Structures and Algorithms. Addison-Wesley,
Reading, Massachusetts, second edition, 1991. [p.5, 10]
U. Graf. Dense coding - A fast alternative to arithmetic coding. In Proc. Compression and Com-
plexity of Sequences. IEEE Computer Society Press, Los Alamitos, California, July 1997.
[p.123]
BIBLIOGRAPHY PAGE 261
D. A. Huffman. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio
Engineers, 40(9): 1098-1101, September 1952. [p. x, 5, 53]
F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly ordered sets.
SIAM J. on Computing, 1(1):31-39, 1972. [p.40]
R. Y. K. Isal and A. Moffat. Parsing strategies for BWT compression. In Storer and Cohn
[2001], pages 429-438. [p. 174,242]
R. Y. K. Isal and A. Moffat. Word-based block-sorting text compression. In M. Oudshoorn,
editor, Proc 24th Australian Computer Science Conf., pages 92-99, Gold Coast, Australia,
February 2001b. IEEE Computer Society, Los Alamitos, CA. [p.174]
R. Y. K. Isal, A. Moffat, and A. C. H. Ngai. Enhanced word-based block-sorting text compres-
sion. In M. Oudshoorn, editor, Proc. 25th Australasian Computer Science Con!, Melbourne,
Australia, February 2002. [p. 174, 242, 243]
M. Jakobsson. Huffman coding in bit-vector compression. Information Processing Letters, 7
(6):304-307, October 1978. [p.41]
D. W. Jones. Application of splay trees to data compression. Communications of the ACM, 31
(8):996-1007, August 1988. [p.175, 176]
R. M. Karp. Minimum-redundancy coding for the discrete noiseless channel. IEEE Trans. on
Information Theory, IT-7(1):27-38, January 1961. [p.212]
J. Karush. A simple proof of an inequality of McMillan. Institute of Radio Engineers Trans. on
Information Theory, IT-7(2):118, April 1961. [p.18]
J. Katajainen, A. Moffat, and A. Turpin. A fast and space-economical algorithm for length-
limited coding. In J. Staples, P. Eades, N. Katoh, and A. Moffat, editors, Proc. Int. Symp.
on Algorithms and Computation, pages 12-21, Cairns, Australia, December 1995. Springer-
Verlag. LNCS 1004. [p.201]
1. Katajainen, M. Penttonen, and 1. Teuhola. Syntax-directed compression of program files.
Software - Practice and Experience, 16(3):269-276, March 1986. [p.246]
J. H. Kingston. Algorithms and Data Structures: Design, Correctness, Analysis. Addison-
Wesley, Reading, MA, 1990. [p.173]
M. Klawe and B. Mumey. Upper and lower bounds on constructing alphabetic binary trees. In
Proc. 4th ACM-SIAM Symp. on Discrete Algorithms, pages 185-193, 1993. [p.207]
S. T. Klein. Efficient optimal recompression. The Computer J., 40(2/3): 117-126, 1997. [p.133]
D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-
Wesley, Reading, Massachusetts, 1973. [p. 10, 11, 203, 205, 207]
D. E. Knuth. Dynamic Huffman coding. J. of Algorithms, 6(2): 163-180, June 1985. [p. 146]
L. G. Kraft. A device for quantizing, grouping, and coding amplitude modulated pulses. Mas-
ter's thesis, MIT, Cambridge, Massachusetts, 1949. [p.18]
R. M. Krause. Channels which transmit letters of unequal duration. Information and Control,
5:13-24, 1962. [p.212]
E. S. Laber. C6digos de Prejixo: Algoritmos e Cotas. PhD thesis, Pontiflcia Universidade
Cat61ica do Rio de Janeiro (PUC-Rio), Departamento de Informatica, July 1999. In Por-
tuguese. [p. 9]
BIBLIOGRAPHY PAGE 263
K. Mehlhorn. An efficient algorithm for constructing nearly optimal prefix codes. IEEE Trans.
on Information Theory, IT-26(5):513-517, 1980. [p.212]
R. L. Milidiu and E. S. Laber. The WARM-UP algorithm: A Lagrangean construction of length
restricted Huffman codes. SIAM J. on Computing, 30(5): 1405-1426, 2000. [p.202]
R. L. Milidiu and E. S. Laber. Bounding the inefficiency of length-restricted prefix codes.
Algorithmica, 31(4):513-529, 2001. [p.202]
R. L. Milidiu, E. S. Laber, and A. A. Pessoa. Bounding the compression loss of the FGK
algorithm. J. of Algorithms, 32(2): 195-211, 1999. [p.146]
R. L. Milidiu, A. A. Pessoa, and E. S. Laber. Three space-economical algorithms for calculating
minimum-redundancy prefix codes. IEEE Trans. on Information Theory, 47(6):2185-2198,
September 2001. [po 88]
V. Miller and M. Wegman. Variations on a theme by Ziv and Lempel. In A. Apostolico and
Z. Galil, editors, Combinatorial Algorithms on Words, Volume 12, NATO ASI Series F, pages
131-140, Berlin, 1985. Springer-Verlag. [p.220]
S. Mitarai, M. Hirao, T. Matsumoto, A. Shinohara, M. Takeda, and S. Arikawa. Compressed
pattern matching for SEQUITUR. In Storer and Cohn [2001], pages 469-478. [p.248]
1. L. Mitchell and W. B. Pennebaker. Software implementations of the Q-coder. IBM J. of
Research and Development, 32:753-774, 1988. [p.187]
A. Moffat. Word based text compression. Software - Practice and Experience, 19(2): 185-198,
February 1989. [po 243, 245]
A. Moffat. Implementing the PPM data compression scheme. IEEE Trans. on Communications,
38(11):1917-1921, November 1990. [po 140, 142,222,226]
A. Moffat. An improved data structure for cumulative probability tables. Software - Practice
and Experience, 29(7):647-659, 1999. [po 161, 166]
A. Moffat and 1. Katajainen. In-place calculation of minimum-redundancy codes. In S. G. Akl,
F. Dehne, and 1.-R. Sack, editors, Proc. Workshop on Algorithms and Data Structures, pages
393-402. Springer-Verlag, LNCS 955, August 1995. Source code available from www . cs .
mu.oz.au/-alistair/inplace.c.[p.6~
A. Moffat, R. M. Neal, and I. H. Witten. Arithmetic coding revisited. ACM Trans. on Information
Systems, 16(3):256--294, July 1998. Source code available from www.cs.mu.oz.au/
-alistair/arith_coder/. [p.93, 98,111-113]
A. Moffat and O. Petersson. An overview of adaptive sorting. Australian Computer J.• 24(2):
70-77, May 1992. [p.57]
A. Moffat. N. Sharman, I. H. Witten, and T. C. Bell. An empirical evaluation of coding meth-
ods for multi-symbol alphabets. Information Processing & Management, 30(6):791-804,
November 1994. [po 120, 140, 176, 177, 190]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Informa-
tion Retrieval, 3(1):25-47, July 2000. [po 42, 46, 47]
A. Moffat and A. Turpin. On the implementation of minimum-redundancy prefix codes. IEEE
Trans. on Communications, 45(10):1200-1207, October 1997. [po 57, 62, 65]
A. Moffat and R. Wan. Re-Store: A system for compressing, browsing, and searching large
documents. In Proc. Symp. String Processing and Information Retrieval. pages 162-174,
Laguna de San Rafael, Chile, November 2001. [po 249, 250]
BIBLIOGRAPHY PAGE 265
A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In N. J. Belkin, P. In-
gwersen, and A. M. Pejtersen, editors, Proc. 15th Ann. Int. ACM SIGIR Con! on Research
and Development in Information Retrieval, pages 274-285, Copenhagen, June 1992. ACM
Press, New York. [po 41)
H. Nakamura and S. Murashima. Data compression by concatenations of symbol pairs. In Proc.
IEEE Int. Symp. on Information Theory and its Applications, pages 496-499, Victoria, BC,
Canada, September 1996. [po 248)
M. Nelson and J. L. Gailly. The Data Compression Book: Featuring Fast, Efficient Data Com-
pression Techniques in C. IDO Books Worldwide, Redwood City, California, second edition,
1995. [p.9)
C. G. Nevill-Manning. Inferring Sequential Structure. PhD thesis, University of Waikato, New
Zealand, 1996. [po 9)
C. G. Nevill-Manning and I. H. Witten. Compression and explanation using hierarchical gram-
mars. The ComputerJ., 40(2/3):103-116, 1997. [p.248)
C. G. Nevill-Manning and I. H. Witten. On-line and off-line heuristics for inferring hierarchies
of repetitions in sequences. Proc. IEEE, 88(11): 1745-1755, November 2000. [po 248)
R. Pascoe. Source coding algorithms for fast data compression. PhD thesis, Stanford University,
CA, 1976. [po 92)
W. B. Pennebaker and J. L. Mitchell. JPEG: Still Image Data Compression Standard. Van
Nostrand Reinhold, New York, 1993. [po 190,215,251)
W. B. Pennebaker, J. L. Mitchell, G. G. Langdon, and R. B. Arps. An overview of the basic prin-
ciples of the Q-Coder adaptive binary arithmetic coder. IBM J. of Research and Development,
32(6):717-726, November 1988. [po 92, 187]
Y. Perl, M. R. Garey, and S. Even. Efficient generation of optimal prefix code: equiprobable
words using unequal cost letters. J. of the ACM, 22:202-214, April 1975. [p.212)
A. A. Pessoa. Constru~ao eficiente de c6digos livres de prefixo. Master's thesis, Departamento
de Informatica, PUC-Rio, September 1999. In Portuguese. [p.9)
O. Petersson and A. Moffat. A framework for adaptive sorting. Discrete Applied Mathematics,
59(2):153-179, 1995. [p.57]
R. F. Rice. Some practical universal noiseless coding techniques. Technical Report 79-22, Jet
Propulsion Laboratory, Pasadena, California, March 1979. [po 37)
J. Rissanen. Generalised Kraft inequality and arithmetic coding. IBM J. of Research and Devel-
opment, 20:198-203, May 1976. [p.92)
J. Rissanen and G. G. Langdon. Arithmetic coding. IBM J. of Research and Development, 23
(2): 149-162, March 1979. [p.92)
J. Rissanen and G. G. Langdon. Universal modeling and coding. IEEE Trans. on Information
Theory, IT-27(1): 12-23, January 1981. [po 3, 244)
J. Rissanen and K. M. Mohiuddin. A multiplication-free multi alphabet arithmetic code. IEEE
Trans. on Communications, 37(2):93-98, February 1989. [p.123)
J. J. Rissanen. Arithmetic codings as number representations. Acta. Polytech. Scandinavica,
Math 31:44-51, 1979. [p.92)
PAGE 266 COMPRESSION AND CODING ALGORITHMS
M. Schindler. A fast block-sorting algorithm for lossless data compression. In Storer and Cohn
[1997], page 469. [po 174,233,241]
M. Schindler. A fast renormalisation for arithmetic coding. In Storer and Cohn [1998], page
572. [po 114, 117]
E. S. Schwartz. An optimum encoding with minimum longest code and total number of digits.
Information and Control, 7:37-44, 1964. [p.55]
E. S. Schwartz and B. Kallick. Generating a canonical prefix encoding. Communications of the
ACM, 7(3):166-169, March 1964. [p.57]
R. Sedgewick. Algorithms in C. Addison-Wesley, Reading, Massachusetts, 1990. [p.5, 10,69]
1. Seward. On the performance of BWT sorting algorithms. In Storer and Cohn [2000], pages
173-182. [po 242]
1. Seward. Space-time tradeoffs in the inverse B-W transform. In Storer and Cohn [2001], pages
439-448. [po 242]
J. Seward and 1. L. Gailly. Bzip2 program and documentation, 1999. sourceware. cygnus.
com/bz ip2 /. [po 241]
D. Sleator and R. Tarjan. Self-adjusting binary search trees. 1. of the ACM, 32(3):652~86, July
1985. [p. 173, 174, 176]
R. J. Solomonoff. A formal theory of inductive inference. Parts I and II. Information and
Control, 7: 1-22 and 224-254, 1964. [p.248]
J. A. Storer. Data Compression: Methods and Theory. Computer Science Press, Rockville,
Maryland, 1988. [p.9]
1. A. Storer and M. Cohn, editors. Proc. 1993 IEEE Data Compression Conf, March 1993.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 1996 IEEE Data Compression Conf, April 1996. IEEE
Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 1997 IEEE Data Compression Conf, March 1997.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 1998 IEEE Data Compression Conj., March 1998.
IEEE c'omputer Society Press, Los Alamitos, California.
1. A. Storer and M. Cohn, editors. Proc. 1999 IEEE Data Compression Conf, March 1999.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 2000 IEEE Data Compression Conf, March 2000.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 2001 IEEE Data Compression Conj., March 2001.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and J. H. Reif, editors. Proc. 1991 IEEE Data Compression Conj., April 1991. IEEE
Computer Society Press, Los Alamitos, California.
1. A. Storer and T. G. Szymanski. Data compression via textual substitution. 1. of the ACM, 29:
928-951, 1982. [p.218]
L. Stuiver and A. Moffat. Piecewise integer mapping for arithmetic coding. In Storer and Cohn
[1998], pages 3-12. [p. 117, 123, 125, 129, 190, 191]
H. Tanaka. Data structure of the Huffman codes and its application to efficient encoding and
decoding. IEEE Trans. on Information Theory, IT-33(1):154-156, January 1987. [p.63]
W. J. Teahan. Modelling English Text. PhD thesis, University ofWaikato, New Zealand, 1998.
[p. 9, 222]
W. 1. Teahan and 1. G. Cleary. The entropy of English using PPM-based models. In Storer and
Cohn [1996], pages 53~2. [p.246]
W. 1. Teahan and J. G. Cleary. Models of English text. In Storer and Cohn [1997], pages 12-21.
[p.246]
W. 1. Teahan and 1. G. Cleary. Tag based models of English text. In Storer and Cohn [1998],
pages 43-52. [p. 246]
W. 1. Teahan and D. J. Harper. Combining PPM models using a text mining approach. In Storer
and Cohn [2001], pages 153-162. [p.230]
PAGE 268 COMPRESSION AND CODING ALGORITHMS
A. Turpin and A. Moffat. Comment on "Efficient Huffman decoding" and "An efficient finite-
state machine implementation of Huffman decoders". Information Processing Letters, 68(1):
1-2, 1998. [p.63]
A. Turpin and A. Moffat. Adaptive coding for data compression. In J. Edwards, editor, Proc.
22nd Australasian Computer Science Con!, pages 63-74, Auckland, January 1999. Springer-
Verlag, Singapore. [p. 190]
A. Turpin and A. Moffat. Housekeeping for prefix coding. IEEE Trans. on Communications, 48
(4):622-628, April 2000. Source code available from www.cs.mu . oz . aur alistair /
mr_coder /. [p.82]
A. Turpin and A. Moffat. On-line adaptive canonical prefix coding with bounded compression
loss. IEEE Trans. on Infonnation Theory, 47(1):88-98, January 2001. [p. 78,81, 181-183]
J. van Leeuwen. On the construction of Huffman trees. In Proc. 3rd Int. Coli. on Automata.
Languages, and Programming, pages 382-410, Edinburgh University, Scotland, July 1976.
Edinburgh University. [p. 66]
D. C. Van Voorhis. Constructing codes with bounded codeword lengths. IEEE Trans. on Infor-
mation Theory, IT-20(2):288-290, March 1974. [p.194]
D. C. Van Voorhis. Constructing codes with ordered codeword lengths. IEEE Trans. on Infor-
mation Theory, IT-21(1):105-106, January 1975. [p.214]
J. S. Vitter. Design and analysis of dynamic Huffman codes. J. of the ACM, 34(4):825-845,
October 1987. [p. 146, 176]
J. S. Vitter. Algorithm 673: Dynamic Huffman coding. ACM Trans. on Mathematical Software,
15(2):158-167, June 1989. [p.146]
P. A. J. Volf and F. M. 1. Willems. Switching between two universal source coding algorithms.
In Storer and Cohn [1998], pages 491-500. [p. 233, 241]
V. Wei. Data Compression: Theory and Algorithms. Academic PresslHarcourt Brace Jo-
vanovich, Orlando, Florida, 1987. [p. 10]
T. A. Welch. A technique for high performance data compression. IEEE Computer, 17(6):8-20,
June 1984. [p.220]
F. M. J. Willems, Tu. M. Shtarkov, and Tj. 1. Tjalkens. Context tree weighting method: Basic
properties. IEEE Trans. on Infonnation Theory, 32(4):526-532, July 1995. [p.230]
F. M. 1. Willems, Tu. M. Shtarkov, and Tj. 1. Tjalkens. Context weighting for general finite
context sources. IEEE Trans. on Information Theory, 33(5):1514-1520, September 1996.
[p.230]
BIBLIOGRAPHY PAGE 269
H. E. Williams and J. Zobel. Compressing integers for fast file access. The Computer J., 42(3):
193-201, 1999. [p.246]
R. N. Williams. Adaptive Data Compression. Kluwer Academic, Norwell, Massachusetts,
1991a. [p.9]
R. N. Williams. An extremely fast Ziv-Lempel data compression algorithm. In Storer and Reif
[1991], pages 362-371. [p.219]
A. I. Wirth. Symbol-driven compression of Burrows Wheeler transformed text. Master's thesis,
The University of Melbourne, Australia, Australia, September 2000. [po 9]
A. I. Wirth and A. Moffat. Can we do without ranks in Burrows Wheeler transform compression?
In Storer and Cohn [2001], pages 419-428. [po 233, 240]
W. D. Withers. The ELS-coder: A rapid entropy coder. In Storer and Cohn [1997], page 475.
[po 130]
I. H. Witten and T. C. Bell. The zero frequency problem: Estimating the probabilities of novel
events in adaptive text compression. IEEE Trans. on Information Theory, 37(4):1085-1094,
July 1991. [po 140, 142]
I. H. Witten, T. C. Bell, A. Moffat, C. G. Nevill-Manning, T. C. Smith, and H. Thimbleby.
Semantic and generative models for lossy text compression. The Computer J., 37(2):83-87,
April 1994. [p.251]
I. H. Witten and E. Frank. Machine Learning: Learning Tools and Techniques with Java Imple-
mentations. Morgan Kaufmann, San Francisco, 1999. [p.24]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing
Documents and Images. Morgan Kaufmann, San Francisco, second edition, 1999. [po vii, 9,
36, 69, 229, 246]
I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Communi-
cations of the ACM, 30(6):520-541, June 1987. [po 92, 93. 98, 111, 113, 156, 161, 166]
J. G. Wolff. An algorithm for the segmentation of an artificial language analogue. British J. of
Psychology, 66(1):79-90, 1975. [p.248]
J. G. Wolff. Recoding of natural language for economy of transmission or storage. The Computer
J., 21:42-44, 1978. [p.248]
H. Yokoo. Data compression using a sort-based context similarity measure. The Computer J.,
40(2/3):94-102, 1997. [p.230]
G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading,
Mass., 1949. [p.48, 70]
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on
Information Theory, IT-23(3):337-343, May 1977. [p.215]
J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE
Trans. on Information Theory, IT-24(5):530-536, September 1978. [po 218, 220]
J. Zobel. Writing for Computer Science: The Art of Effective Communication. Springer-Verlag,
Singapore, 1997. [po xii]
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software - Practice
and Experience, 25(8):891-903, August 1995. [po 57, 245]
Index
°,
twopower...addO,185
twopower...decode 184
twopower_encodeO, 184
twopowef-incrementO, 185
twopower_initializeO, 184
Z coder, 130
zero-frequency problem, 139-146, 222
zero-order model, 4, 5, 24, 65, 103, 134,
140
zero-redundancy code, 17,30,32,36, Ill,
120
Zipf distribution, 48, 70