0% found this document useful (0 votes)
273 views

Compression and Coding Algorithms PDF

Uploaded by

ananthmt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
273 views

Compression and Coding Algorithms PDF

Uploaded by

ananthmt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 284

COMPRESSION AND CODING

ALGORITHMS
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
COMPRESSION AND CODING

ALGORITHMS

by

Alistair Moffat
The University of Melbourne, Australia

and

Andrew Thrpin
Curtin University of Technology, Australia

Springer Science+Business Media, LLC


ISBN 978-1-4613-5312-6 ISBN 978-1-4615-0935-6 (eBook)
DOI 10.1007/978-1-4615-0935-6

Library of Congress Cataloging-in-Publication Data

A c.I.P. Catalogue record for this book is available


from the Library of Congress.

Copyright © 2002 by Springer Science+Business Media New York


Originally published by Kluwer Academic Publishers in 2002
Softcover reprint ofthe hardcover Ist edition 2002
AII rights reserved. No part of this publication may be reproduced, stored in a
retrieval system or transmitted in any form or by any means, mechanical, photo-
copying, record ing, or otherwise, without the prior written permission of the
publisher. Springer Science+Business Media, LLC.

Printed an acid:free paper.


Contents

Preface vii

1 Data Compression Systems 1


1.1 Why compression? .. 1
1.2 Fundamental operations . 3
1.3 Terminology. . . . . . 6
1.4 Related material . . . . 9
1.5 Analysis of algorithms 10

2 Fundamental Limits 15
2.1 Information content 15
2.2 Kraft inequality .. 17
2.3 Human compression 19
2.4 Mechanical compression systems . 20

3 Static Codes 29
3.1 Unary and binary codes. 29
3.2 Elias codes ....... 32
3.3 Golomb and Rice codes . 36
3.4 Interpolative coding . 42
3.5 Making a choice. . . . . 48

4 Minimum-Redundancy Coding 51
4.1 Shannon-Fano codes 51
4.2 Huffman coding . . . . . 53
4.3 Canonical codes . . . . . 57
4.4 Other decoding methods 63
4.5 Implementing Huffman's algorithm 66
4.6 Natural probability distributions 70
4.7 Artificial probability distributions 78
4.8 Doing the housekeeping chores . 81
4.9 Related material . . . . . . . . . . 88
PAGE VI COMPRESSION AND CODING ALGORITHMS

5 Arithmetic Coding 91
5.1 Origins of arithmetic coding 92
5.2 Overview of arithmetic coding . . . 93
5.3 Implementation of arithmetic coding 98
5.4 Variations . . . . . . . . . . . . 113
5.5 Binary arithmetic coding . . . . 118
5.6 Approximate arithmetic coding. 122
5.7 Table-driven arithmetic coding 127
5.8 Related material . . . . . . . . 130

6 Adaptive Coding 131


6.1 Static and semi-static probability estimation 131
6.2 Adaptive probability estimation. 135
6.3 Coping with novel symbols . 139
6.4 Adaptive Huffman coding . . . . 145
6.5 Adaptive arithmetic coding . . . 154
6.6 Maintaining cumulative statistics 157
6.7 Recency transformations .. 170
6.8 Splay tree coding . . . . . . 175
6.9 Structured arithmetic coding 177
6.10 Pseudo-adaptive coding . 179
6.11 The Q-coder . . . 186
6.12 Making a choice. . 190

7 Additional Constraints 193


7.1 Length-limited coding 194
7.2 Alphabetic coding . . . 202
7.3 Alternative channel alphabets. 209
7.4 Related material . . . . . . . . 214

8 Compression Systems 215


8.1 Sliding window compression 215
8.2 Prediction by partial matching 221
8.3 Burrows-Wheeler transform 232
8.4 Other compression systems . 243
8.5 Lossy modeling . . . . . . . 251

9 WhatNext? 253

References 257

Index 271
Preface

None of us is comfortable with paying more for a service than the minimum
we believe it should cost. It seems wantonly wasteful, for example, to pay $5
for a loaf of bread that we know should only cost $2, or $10,000 more than the
sticker price of a car. And the same is true for communications costs - which
of us has not received our monthly phone bill and gone "ouch"? Common to
these cases is that we are not especially interested in reducing the amount of
product or service that we receive. We do want to purchase the loaf of bread
or the car, not half a loaf or a motorbike; and we want to make the phone calls
recorded on our bill. But we also want to pay as little as possible for the desired
level of service, to somehow get the maximal "bang for our buck".
That is what this book is about - figuring out how to minimize the "buck"
cost of obtaining a certain amount of "bang". The "bang" we are talking about
is the transmission of messages, just as in the case of a phone bill; and the
"buck" we seek to minimize is the dollar cost of sending that information. This
is the process of data compression; of seeking the most economical represen-
tation possible for a source message. The only simplification we make when
discussing compression methods is to suppose that bytes of storage or commu-
nications capacity and bucks of money are related, and that if we can reduce
the number of bytes of data transmitted, then the number of bucks spent will
be similarly minimal.
Data compression has emerged as an important enabling technology in a
wide variety of communications and storage applications, ranging from "disk
doubling" operating systems that provide extra storage space; to the facsim-
ile standards that facilitate the flow of business information; and to the high-
definition video and audio standards that allow maximal use to be made of
scarce satellite transmission bandwidth. Much has been written about data
compression - indeed, we can immediately recommend two excellent books,
only one of which involves either of us as an author [Bell et aI., 1990, Witten
et aI., 1999] - and as a research area data compression is relatively mature.
As a consequence of that maturity, it is now widely agreed that compres-
sion arises from the conjunction of two quite distinct activities, modeling and
PAGE VIII COMPRESSION AND CODING ALGORITHMS

coding (and, as we shall see, the interfacing activity of probability estimation


or statistics maintenance). The modeling task splits the data into symbols, and
attempts to infer a set of probability distributions that predict how the message
to be compressed is made up of those symbols. The objective is to predict the
message with 100% accuracy, as all that remains to be transmitted is the dif-
ference between the model - which is a kind of theory - and the message in
question. Hence, if an appropriate model can be determined for the data being
represented, good compression will result, as the residual difference between
model and message will be small. For example, in English text a letter "q" is
usually followed by the letter "u", so a good model will somehow learn that
relationship and use it to refine its probability distributions. Modeling is the
public face of data compression, as it is where the creativity and intuition are
generally agreed to lie. If a data compression system were an ocean liner, the
model would correspond to the bridge - good all-round views, and control over
course and speed.
On the other hand, coding represents the engine room of a compression
system, and like its namesake on an ocean liner, requires a certain amount of
sweat, steam, and grease in order to operate. The coder generates the sequence
of bits that represent the symbols and probabilities asserted by the model, and
then, at the decoding end of the pipeline, reproduces a stream of instructions
that tells the model what symbols should be emitted in order to recreate the
original source message. Coding requires - at face value at least - that a rel-
atively straightforward task be carried out, with little more to be done than a
mechanical translation process from probabilities to bits. Indeed, coding seem-
ingly has little scope for creative and innovative mechanisms, and tends to be
buried deep in the bowels of a compression system, just as is the engine room
of an ocean liner.
Because of this relatively unglamorous role, little attention has been fo-
cussed on the task of coding. The two books mentioned above both devote
more space to modeling than they do to coding, and the same is true of the work
of other authors. Indeed, there is a widely held belief that the coding problem
is completely solved, and that off-the-shelf packages are available that obviate
any need for the compression system developer (or researcher, or student, or
interested computing professional) to know much about coding.
Nothing could be further from the truth. As an area of study, source cod-
ing has a surprising depth and richness. Indeed, in some ways the intellectual
endeavor that has been invested in this area perhaps rivals the energy that has
been put into another fundamental problem in computing, that of sorting. Just
as there are dozens of sorting algorithms, so too there are dozens of source cod-
ing algorithms, each with its own distinctive features and applications. There
is also, again as is the case with sorting, a gaggle of specialized coding sub-
PREFACE PAGE IX

problems that can be handled elegantly and economically by correspondingly


specialized techniques. And there are some truly beautiful structures and anal-
yses. To say that "Huffman coding is how we do coding" (an assertion implicit
in the treatment of compression given by a number of textbooks) is as false as
saying "Bubblesort is how we do sorting". And, moving forward by thirty odd
years, to say that "arithmetic coding is how we do coding" is only slightly less
naive than saying "Mergesort is how we do sorting". Just as a computing pro-
fessional is expected to have a detailed understanding of Bubblesort, Heapsort,
Quicksort, and Mergesort, together with an appreciation of the applications in
which each should be preferred, so too should a computing professional - and
certainly one professing knowledge of data compression techniques - have an
appreciation of a range of coding mechanisms. The one or two coding mecha-
nisms described in most texts should be regarded as a start, nothing more.
Hence this book. It examines in detail a wide range of mechanisms that
have been proposed for coding, covering nearly fifty years of methods and al-
gorithms, with an emphasis on the practicalities of implementation and execu-
tion. The book includes descriptions of recent improvements to widely-known
methods such as minimum-redundancy (Huffman) coding and arithmetic cod-
ing, as well as coding problems with additional constraints, such as length-
limited coding, alphabetic coding, and unequal letter-cost coding. It concludes
with a chapter that examines three state-of-the-art compression systems, de-
scribing for each the type of coder employed and the reasons for that choice.
Our intention is to be practical, realistic, detailed, and informative. Most of
the techniques described have been tested, with compression and speed results
reported where it is appropriate to do so. Implementations of a number of the
principal mechanisms are available on the Internet, and can be used by those
who seek compression, but are willing to forgo the details.
We also believe that this book has a role as a text for graduate and advanced
undergraduate teaching. A suggested lecture schedule covering a 24 lecture-
hour subject is available on the book's website at WWW.CS.fiu.oz.au/caca;
and we have included sufficient advanced material that even the keenest of
graduate students will be challenged.
The remainder of this preface gives more details of the contents of the
chapters that follow. So if your curiosity has already been pricked, feel free to
go now to Chapter 1, it is where you will be shortly in any case. If you are
not yet sold - if you are sceptical of the claim that coding is as fascinating as
sorting, and almost as important - read on.
PAGE X COMPRESSION AND CODING ALGORITHMS

Outline of the book


Chapter 1 further motivates the need for compression, and explains in more
detail the distinction between coding and modeling. It then defines the coding
problem, and gives a number of simple examples of compression systems and
how they might employ different coders. Chapter 1 concludes with a section
that contains a modest amount of mathematical background that is required for
some of the later analyses.
Chapter 2 then investigates fundamental limits on what can be achieved by
a coder. These are the unbreakable rules that allow us to gauge how good a par-
ticular coder is, as we need only compare the actual length of its output with
the entropy-based lower bound. And, perhaps surprisingly, these fundamental
limits can also guide us in the implementation of effective coders. Ramamoha-
narao (Rao) Kotagiri, a colleague at the University of Melbourne, has observed
that one of the enduring (and perhaps also endearing) desires of the human race
is to condense multi-dimensional data sets into a single scalar value, so that al-
ternative methods can be compared. In the field of compression, this desire can
be sated, as if we ignore any resource implications, all behavior is measured in
bits and bytes, and the same is true of the lower bounds.
The class of static coders - those that make no attempt to manage a prob-
ability distribution on alphabet symbols - is described in Chapter 3. While it
seems counter-intuitive that these mechanisms can possibly be useful, they are
surprisingly versatile, and by virtue of their simplicity and lack of controlling
parameters, can sometimes yield better compression effectiveness than their
more principled stable-mates. They also tend to be robust and reliable, just as a
family sedan is in many ways a more practical choice for day-to-day motoring
needs than is an exotic two-seater sports car.
Nevertheless, more principled methods result in better compression effec-
tiveness whenever the cost of sending the symbol probabilities can be spread
over sufficiently many message symbols. Chapter 4 examines the family of
minimum-redundancy codes: those that assign a discrete bit-pattern to each
of the symbols in the alphabet, and do so in a systematic manner so as to
minimize overall message length. The best-known of all compression meth-
ods - Huffman coding, so named because of the famous paper authored by
David Huffman in 1952 - is one example of a coding algorithm in this cat-
egory. Chapter 4 gives details of the implementation of Huffman coding, and
shows that minimum-redundancy coding is a far more efficient process than the
follow-pointers-through-a-code-tree approach suggested by most textbooks.
If the restriction that all codewords must be discrete bits is lifted, we get the
family of arithmetic codes, the subject of Chapter 5. The singular advantage
of arithmetic codes is that they can very closely approximate the lower bound
PREFACE PAGE XI

on the length of the compressed representation that was mentioned earlier in


connection with Chapter 2. Amazing as it may seem at first, it is possible for
symbols in a message to contribute less than one bit to the compressed output
bitstream. Indeed, if the probability of a symbol is sufficiently close to 1 as
to warrant such a short codeword, it might contribute only 0.1, or 0.01, or
0.001 bits - whatever is appropriate. Chapter 5 also considers in detail the
implementation of arithmetic coding, and describes variants that trade a small
amount of compression effectiveness for increased execution speed.
Chapter 6 examines the problem of adaptive coding. The preceding two
chapters presume that the probabilities of symbols in the source alphabet are
known, and that all that is necessary is to calculate a code. In fact, in many
situations the symbol probabilities must be inferred from the message, and
moreover, inferred in an on-demand manner, in which the code for each mes-
sage symbol is finalized before that symbol contributes to the observed symbol
probabilities. Coding in this way allows one-pass compression, an important
consideration in many applications. Considerable complexity is introduced,
however, as the codewords used must be maintained on the fly. Chapter 6 ex-
amines algorithms for manipulating such codeword sets, and considers the del-
icate issue of whether static codes or adaptive codes yield better compression
effectiveness.
Chapter 7 broadens the quest for codes, and asks a number of what-if ques-
tions: what if no codeword may be longer than L bits for some limit L; what
if the codewords must be assigned to symbols in lexicographic order; what if
the channel alphabet (the set of output symbols that may be used to represent
the message) is non-binary; and what if the symbols in the channel alphabet
each have different costs of transmission. Un surprisingly, when constraints are
added, the codes are harder to find.
The last chapter closes the circle. Chapters 1 and 2 discuss compression
systems at a high level; Chapter 8 returns to that theme, and dissects a num-
ber of recent high-performance compression techniques, describing the models
that they embody, and then the coders with which those models are coupled.
The intention is to explain why the particular combination of model and coder
employed in that product is appropriate, and to provide sufficient explanation
of the model that the interested reader will be able to benefit. The mechanisms
covered include the LZ77 sliding window approach embodied in GZIP; the
Prediction by Partial Matching mechanism used in the PPM family of compres-
sion systems; and the Burrows-Wheeler Transform (BWT) approach exploited
by BZIP2. In dealing in detail with these complete compression systems, it is
hoped that the reader will be provided with the framework in which to design
and implement their own compression system for whatever application they
have at hand. And that they will enjoy doing so.
PAGE XII COMPRESSION AND CODING ALGORITHMS

Acknowledgements
One of the nice things about writing a book is getting to name names without
fear of being somehow unacademic or too personal. Here are some names,
people who in some way or another contributed to the existence of this work.
Research collaborators come first. There are many, as it has been our good
fortune to enjoy the friendship and assistance of a number of talented and gen-
erous people. Ian Witten has provided enthusiasm and encouragement over
more years than are worth counting, and lent a strategic nudge to this project at
a delicate moment. Lang Stuiver devoted considerable energy to his investiga-
tion of arithmetic coding, and much of Chapter 5 is a result of his efforts. Lang
also contributed to the interpolative coding mechanism described in Chapter 3.
Justin Zobel has been an accomplice for many years, and has contributed to
this book by virtue of his own interests [Zobel, 1997]. Others that we have
enjoyed interacting with include Abe Bookstein, Bill Teahan, Craig Nevill-
Manning, Darryl Lovato, Glen Langdon, Hugh Williams, Jeff Vitter, Jesper
Larsson, Jim Storer, John Cleary, Julien Seward, Jyrki Katajainen, Mahesh
Naik, Marty Cohn, Michael Schindler, Neil Sharman, Paul Howard, Peter Fen-
wick, Radford Neal, Suzanne Bunton, Tim C. Bell, and Tomi Klein. We have
also benefited from the research work undertaken by a very wide range of other
people. To those we have not mentioned explicitly by name - thank you.
Mike Liddell, Raymond Wan, Tim A.H. Bell, and Yugo Kartono Isal un-
dertook proofreading duties with enthusiasm and care. Many other past and
present students at the University of Melbourne have also contributed: Alwin
Ngai, Andrew Bishop, Gary Eddy, Glen Gibb, Mike Ciavarella, Linh Huynh,
Owen de Kretser, Peter Gill, Tetra Lindarto, Trefor Morgan, Tony Wirth, Vo
Ngoc Anh, and Wayne Salamonsen. We also thank the Australian Research
Council, for their funding of the various projects we have been involved in;
our two Departments, who have provided environments in which projects such
as this are feasible; Kluwer, who took it out of our hands and into yours; and
Gordon Kraft, who provided useful information about his father.
Family come last in this list, but first where it counts. Aidan, Allison, Anne,
Finlay, Kate, and Thau Mee care relatively little for compression, coding, and
algorithms, but they know something far more precious - how to take us away
from our keyboards and help us enjoy the other fun things in the world. It is
because of their influence that we plant our tongues in our cheeks and suggest
that you, the reader, take a minute now to look out your window. Surely there
is a nice leafy spot outside somewhere for you to do your reading?
Alistair Moffat, Andrew Turpin,
Melbourne, Australia Perth, Australia
Chapter 1

Data Compression Systems

One of the paradoxes of modem computer systems is that despite the spiraling
decrease in storage costs there is an ever increasing emphasis on data compres-
sion. We use compression daily, often without even being aware of it, when we
use facsimile machines, communication networks, digital cellular telephones,
world-wide web browsers, and DVD players. Indeed, on some computer sys-
tems, the moment we access a file from disk we make use of compression
technology; and not too far in the future are computer architectures that store
executable code in compressed form in main memory in lines of a few hundred
bytes, decompressing it only when brought into cache.

1.1 Why compression?


There are several reasons for the increasing reliance on compression. Most
obvious is that our demand for on-line storage is insatiable, and growing at the
same fast rate as is storage capacity. As little as five years ago a 100 MB disk
was an abundance of storage; now we routinely commit 100 MB of storage to
a single application, and expect that the system as a whole will have storage
capacity measured in gigabytes. A few years ago home computers were used
primarily for text-based applications; now, with the advent of digital still and
movie cameras, and on-line sources for high-quality sound data, we routinely
expect to be able to hold large collections of personal multi-media documents.
A second driving force for the increase in the use of compression is the
strict boundedness of some communications channels. A good example of this
phenomenon has been the flooding of the fax machine into every far reach of
the telephone network over the last twenty years. The most obvious cause of
this pervasiveness and acceptance is that no special communications link is
required, as an ordinary twisted pair telephone connection suffices. But no

A. Moffat et al., Compression and Coding Algorithms


© Springer Science+Business Media New York 2002
PAGE 2 COMPRESSION AND CODING ALGORITHMS

less important has been that the bandwidth limitation imposed by twisted-pair
connections was greatly reduced by the contemporary development of elegant
bi-level (binary) image compression mechanisms. The electronic technology is
what has made facsimile transmission possible, but it is compression technol-
ogy that has kept costs low and made the facsimile machine an indispensable
tool for business and private use.
Similarly, within the last decade the use of compression has served to con-
tain the cost of cellular telephone and satellite television transmission, and has
made both of these technologies accessible to consumers at modest prices. Fi-
nally, the last few years have seen the explosion of the world-wide web net-
work. Which of us has not waited long minutes for pages to load, images to be
visible, and animations to commence. We blame the delays on a multitude of
reasons, but there is usually one single contributing factor - too much data to
be moved, and insufficient channel capacity to carry it. The obvious solution is
to spend more money to increase the bandwidth, but we could also reduce the
amount of data to be transmitted. With compression, it is possible to reduce
the amount of data transmitted, but not make any sacrifice in the amount of
information conveyed.
The third motivating force for compression is the endless search for im-
proved program speed, and this is perhaps the most subtle of the three factors.
Consider the typical personal computer of a decade ago, perhaps around 1990.
In addition to about 100 MB of hard disk, with its 15 millisecond seek time
and a 1 MB per second peak transfer rate, such a computer had a processor
of perhaps 33 MHz clock rate and 1 or 4 MB of memory. Now on the equiv-
alent personal computer the processor will operate more than ten times faster
(950 MHz is a current entry-level specification, and that is sure to have changed
again by the time you are reading this), and the memory capacity will also have
grown by a factor of thirty or more to around 128-256 MB. Disk capacities
have also exploded over the period in question, and the modem entry-level
computer may well have 20 GB of disk, two hundred times more than was
common just a few years ago. But disk speeds have not grown at the same
rate, and it is unlikely that the disk on a modem entry-level computer operates
any more than twice as quickly as the 1990 machine - 10 millisecond seek
times and 2 MB per second transfer rates are still typical, and for CD-ROM
drives seek and transfer times are even greater. That is, the limitations on me-
chanical technology have severely damped growth in disk speeds even though
capacity has increased greatly. Hence, it is now more economical than ever
before to trade-off processor time against reduced disk transfer times and file
sizes, the latter of which reduces average seek times too. Indeed, given the
current balance between disk and processor speeds, compression actually im-
proves overall response time in some applications. This effect will become
1.2. FUNDAMENTAL OPERATIONS PAGE 3

more marked as processors continue to improve; and only when fast solid-state
storage devices of capacity to rival disks are available will it be necessary to
again evaluate the trade-offs involved for and against compression. Once this
occurs, however, an identical trade-off will be possible with respect to cache
and main memory, rather than main memory and disk.
These three factors combine to make compression a fundamental enabling
technology in this digital age. Like any technology, we can, if we prefer, ig-
nore the details. Which of us truly understands the workings of the internal
combustion engine in our automobile? Indeed, which of us really even fully
grasps the exact details of the sequence of operations that allows the electric
light to come on when we flick the switch? And, just as there are mechanical
and electrical engineers who undertake to provide these two technologies to us
in a black box form, so too there are compression engineers that undertake to
provide black box compression systems that others may make use of to attain
the benefits outlined above. If we wish to make use of the technology in this
way without becoming intimate with the details, then no one will be scornful.
But, in the same way that some people regard tinkering with the family car
as a hobby rather than a chore, so too can an understanding of compression be
interesting. And for the student studying computer science, compression is one
of just a small handful of areas in which the development in an abstract way of
algorithms and data structures can address an immediate pragmatic need.
This book is intended for both groups of people - those who want to un-
derstand compression because it is a core technology in a field that they seek to
make their profession, and those who want to understand compression because
it interests them. And, of course, it is the hope of the authors that some of the
interest and excitement that prompted the writing of this book will rub off onto
its readers - in both of these categories.

1.2 Fundamental operations


Given that compression plays an increasingly important role in our interactions
with computers and communications systems, it is interesting to examine the
tools used in typical compression methods.
In general, any compression system - by which is meant a program that
is used to minimize the cost of storing messages containing a specified type
of data in some specified storage format - carries out three fundamental op-
erations [Rissanen and Langdon, 1981]. (A bibliography detailing all cited
references appears on page 257.)
The first of these operations is modeling: the process of learning, or making
assumptions about, the structure of the data being compressed. For example, a
PAGE 4 COMPRESSION AND CODING ALGORITHMS

very simple model of text is that there is no correlation between adjacent sym-
bols: that it is a stream of independent characters. Such a model is referred
to as a zero-order character-based model. A more sophisticated model might
assume that the data is a stream of English words that repeat in certain pre-
dictable ways; or that each of the preceding individual characters can be used
to bias (or condition) the probabilities assigned to the next character.
The second important operation is probability estimation, or statistics gath-
ering: the process of assigning a probability to each of the possible "next" sym-
bols in the input stream that is being compressed, given a particular model of
the data. For example, a very simple approach is to assert that all possible next
symbols are equi-probable. While attractive for its lack of complexity, such
an approach does not necessarily result in very good compression. A more
principled approach is to retain a historical count of the number of times each
possible symbol has appeared in each particular state of the model, and use the
ratio of a symbol's count to the total number of times a state has previously
occurred as an estimate of the symbol probability in that state.
The third of the three principal operations is that of coding. Given a prob-
ability distribution for the symbols in a defined source alphabet, and a symbol
drawn from that alphabet, the coder communicates to the waiting decoder the
identifier corresponding to that symbol. The coder is required to make use of
a specified channel alphabet (normally, but not always, the binary values zero
and one), and to make as efficient use as possible of the channel capacity sub-
ject to whatever other constraints are enforced by the particular application.
For example, one very simple coding method is unary, in which the number
one is coded as "0", the number two as "10", the number three as "110", and so
on. However such a coding method makes no use of the probabilities that have
been estimated by the statistics component of the compression system, and,
presuming that the probabilities are being reliably estimated, a more compact
message will usually result if probabilities are taken into account.
Figure 1.1 shows this relationship between the modeling, statistics, and
coding modules. A sequence of source symbols is processed by the encoder,
and each in turn is represented as a sequence of bits and transmitted to the de-
coder. A probability distribution against which each symbol should be coded is
supplied by the statistics module, evaluated only after the model has indicated
a context that is likely to provide an accurate prediction. After each symbol
has been coded, the statistics module may update its probability estimates for
that context, and the model may update the structural information it retains,
possibly even introducing one or more new contexts. At the decoding end, the
stream of bits must be rendered back into a stream of symbol identifiers, and
exactly identical statistics and structural modifications carried out in order for
the decoder to faithfully reproduce, in parallel, the actions of the encoder.
1.2. FUNDAMENTAL OPERATIONS PAGES

--, I
I
I structural
--, I
I
I structural
context : modification context : modification
identifier identifier

I I
I I
__ I probability • __ I probabilny
: modification modification

symbol symbol
probabilities probabilities

-
source ' - -_ _ _ --'I-----------l. . . ___----'
10011010001... r coder J
sou ce
·
symbols - encoded bitstream - . symbols

ENCODER DECODER

Figure 1.1: Modeling, statistics, and coding modules.

To give a concrete example of the three-way division of responsibility il-


lustrated in Figure 1.1, consider classical "Huffman coding" as it is described
in most general-purpose algorithmics texts [Cormen et aI., 2001, Gonnet and
Baeza-Yates, 1991, Sedgewick, 1990, Storer, 2002]. As we shall see below,
"Huffman coding" is more correctly the name for a particular coding technique
rather than the name of a compression system as it has been defined here, but
for the moment we shall persist with the misnomer.
Textbook descriptions usually suppose that the relative frequencies (hence
probabilities) of a set of symbols are known, and that the universe of messages
to be compressed is the set of strings emitted by a one-state Markov source
with these probabilities. In other words, a zero-order model is assumed with
(in the terms of Figure 1.1) neither structural nor probability modification. The
statistics are assumed to be either known a priori in some clairvoyant manner,
or to have been accumulated by a pre-scan of the sequence to be compressed,
often without a clear indication of how the decoder can possibly know the same
probabilities. Finally, these textbook presentations go on to apply Huffman's
algorithm [Huffman, 1952] to the set of probabilities to devise a minimum-
redundancy code, which is the coding part of the system.
Any or all of these three components can be replaced, resulting in a dif-
ferent system. If we seek better compression, the model can be extended to
a first-order one, in which the most recent symbol establishes a conditioning
context for the next. If we wish to avoid making two passes over the source
data, or need to eliminate the cost of pre-transmitting the symbol probabilities
PAGE 6 COMPRESSION AND CODING ALGORITHMS

as part of the compressed message, the statistics module might use fixed prob-
abilities gleaned from an off-line inspection of a large volume of representative
text. Or if, for some reason, variable-length codes cannot be used, then symbol
numbers can be transmitted in, for example, a flat binary code.
A further point to be noted in connection with Figure 1.1 is that in some cir-
cumstances the probability estimation component will sit more naturally with
the modeler, and in others will be naturally combined with the coder. Differ-
ent combinations of model and coder will result in different placements of the
statistics module, with the exact placement usually driven by implementation
concerns. Nevertheless, in a logical sense, the three components exist in some
form or another in all compression systems.

1.3 Terminology
The problem of coding is as follows. A source alphabet of n symbols

and a corresponding set of probability estimates

are given, where it is assumed that ~~I Pi = 1. The coding module must
decide on a code, which is a representation for each symbol using strings over
a defined channel alphabet, usually {O, I}. Also supplied to the coder is a
single index x, indicating the symbol Sx that is to be coded. Normally, Sx
will be a symbol drawn from a longer message, that is, Sx = M[j] for some
1 ::; j ::; m = IMI, but it is simpler at first to suppose that Sx is an isolated
symbol. Where there is no possible ambiguity we will also refer to "symbol
x", meaning symbol sx, the xth symbol of the alphabet. The code for each
possible symbol Si must be decided in advance of x being known, as otherwise
it is impossible for the decoder - which must eventually be able to recover the
corresponding symbol Sx - to make the same allocation of codewords.
Often the underlying probabilities, Pi, are not exactly known, and prob-
ability estimates are derived from the given message M. For example, in a
message of m symbols, if the ith symbol appears Vi times, then the relation-
ship Pi = vdm might be assumed. We call these the self-probabilities of M.
For most applications the alphabet of source symbols is the set of contigu-
ous integers 1,2, ... ,n, so that Si = i. Any situations in which this assumption
is not valid will be noted as they are discussed. Similarly, in most situations it
may be assumed that the symbol ordering is such that PI ~ P2 ~ ... ~ Pn-l ~
1.3. TERMINOLOGY PAGE 7

si Pi Code 1 Code 2 Code 3


1 0.67 000 00 0
2 0.11 001 01 100
3 0.07 010 100 101
4 0.06 011 101 110
5 0.05 100 110 1110
6 0.04 101 111 1111
Expected length 3.00 2.22 1.75

Table 1.1: Three simple prefix-free codes, and their expected cost in bits per symbol.

Pn (or vice-versa, in non-decreasing order). Applications in which the alphabet


may not be assumed to be probability-sorted will be noted as they arise.
Later in the book we shall see coding methods that do not explicitly assign a
discrete representation to each symbol, but in this introductory section we adopt
a slightly simplified approach. Suppose that some coder assigns the codewords
C = [Cl' C2, ••. , cn] to the symbols of the alphabet, where each Ci is a string
over the channel (or output) alphabet. Then the expected codeword length for
the code, denoted E, is given by

(1.1)

where ICi I is the cost of the ith codeword. The usual measure of cost is length -
how many symbols of the channel alphabet are required. But other definitions
are possible, and are considered in Section 7.3 on page 209. For some purposes
the exact codewords being used are immaterial, and of sole interest is the cost.
To this end we define IC I = [I cll, Ic21, ... , ICn IJ as a notational convenience.
Consider, for example, the coding problem summarized in Table 1.1. In
this example n = 6, the source alphabet is denoted by S = [1,2,3,4,5,6], the
corresponding probabilities Pi are listed in the second column, and the channel
alphabet is assumed to be {O, I}. The third, fourth, and fifth columns of the
table list three possible assignments of codewords. Note how, in each of the
codes, no codeword is a prefix of any of the other codewords. Such codes are
known as prefix-free, and, as will be described in Chapter 2, this is a critically
important property. One can imagine, for example, the difficulties that would
occur in decoding the bitstream "001 ... " if one symbol had the codeword "00"
and another symbol the codeword "001".
The first code, in the column headed "Code 1", is a standard binary rep-
resentation using flog2 n 1 = 3 bits for each of the codewords. In terms of
the notation described above, we would thus have lei = [3,3,3,3,3,3]. This
PAGE 8 COMPRESSION AND CODING ALGORITHMS

code is not complete, as there are prefixes (over the channel alphabet) that are
unused. In the example, none of the codewords start with "11", an omission
that implies that some conciseness is sacrificed by this code.
Code 2 is a complete code, formed from Code 1 by shortening some of
the codewords to llog2 nJ = 2 bits, while still retaining the prefix-free prop-
erty. By assigning the shorter codewords to the most frequent source symbols,
a substantial reduction in the expected codeword length E( C, P) from 3.00 to
2.22 bits per symbol is achieved. Furthermore, because the code is both prefix-
free and complete, every semi-infinite (that is, infinite to the right) string over
the channel alphabet can be unambiguously decoded. For example, the string
"011110001 ... " can only have been generated by the source symbol sequence
2,6,1,2, .... On the other hand, with Code 1, the string "011110001 ... " can-
not be decoded, even though Code 1 is prefix-free.
The third code further adjusts the lengths of the codewords, and reduces E,
the expected codeword length, to 1.75 bits per symbol. Code 3 is a minimum-
redundancy code (which are often known as Huffman codes, although, as will
be demonstrated in Chapter 4, they are not strictly the same), and for this prob-
ability distribution there is no allocation of discrete codewords over {O, 1} that
reduces the expected codeword length below 1.75 bits per symbol.
So an obvious question is this: given the column labeled Pi, how can the
column labeled "Code 3" be computed? And when might the use of Code 2 or
Code 1 be preferred? For example, Code 2 has no codeword longer than three
bits. Is it the cheapest "no codeword longer than three bits" code that can be
devised? If these questions are in your head, then read on: they illustrate the
flavor of this book, and will be answered before you get to its end.
Finally in this section, note that there is another whole family of coding
methods that in effect use bit-fractional codes, and with such an arithmetic
coder it is possible to represent the alphabet and probability distribution of
Table 1.1 in an average of 1.65 bits per symbol, better than can be obtained if
each codeword must be of integral length. We consider arithmetic coding in
detail in Chapter 5.
There are many places in our daily lives where we also use codes of var-
ious types. Table 1.2 shows some examples. You may wish to add others
from your own experience. Note that there is no suggestion that these coding
regimes are "good", or unambiguously decodeable, or even result in compres-
sion - although it is worth noting that Morse code was certainly designed with
compression in mind [Bell et aI., 1990]. Nevertheless, they illustrate the idea
of assigning a string over a defined channel alphabet to a concept expressed in
some source alphabet; the very essence of coding.
1.4. RELATED MATERIAL PAGE 9

Domain Example Meaning


Phone number +61-8-92663014 Andrew's office phone
Car registration OOK828 Alistair's car
Product Code 9310310000067 Carton of milk
Video Gcode 797 Channel 2, 7:00-7:30pm
Flight number QF101 Qantas flight from Melbourne
to Los Angeles
Credit card 4567555566667777 Neither Alistair's nor Andrew's
Visa card
Morse code •• 0 ___ 0 ••

Help!

Table 1.2: Examples of code usage from everyday life.

1.4 Related material


Chapter 8 examines a number of compression systems, including their mod-
eling components, but this book is more about coding than modeling. For
more detail of modeling methods, the reader is encouraged to consult alterna-
tive sources. Bell et al. [1990] (see also Bell et al. [1989]) describe in detail
the roles of modeling and coding, as well as giving examples of many mod-
ern compression systems; their work has been the standard academic reference
for more than a decade. The presentation here is designed to update, extend,
and complement their book. Nelson and Gailly [1995] provide implementa-
tion details of a number of compression methods; and Witten et al. [1999] de-
scribe the application of compression to the various components of a full-text
information system, including gray-scale, bi-Ievel, and textual images. Storer
[1988] examines in detail systems that make use of dictionary-based models;
and Williams [1991a] examines predictive character-based models in depth.
There have also been a number of special journal issues covering text and image
compression, including: Information Processing and Management, November
1992 and again in November 1994; Proceedings of the IEEE, June 1994 and
November 2000; and The Computer Journal, 1997.
The dissertations of Tim Bell [1986b], Paul Howard [1993], Craig Nevill-
Manning [1996], Suzanne Bunton [1997a], Bill Teahan [1998], Jesper Larsson
[1999], Jan Aberg [1999], Kunihiko Sadakane [1999], and Tony Wirth [2000]
are further invaluable resources for those interested in modeling techniques.
Investigation of various aspects of coding can be found in the dissertations of
Artur Alves Pessoa [1999] and Eduardo Sany Laber [1999], and of the second
author [Turpin, 1998].
PAGE 10 COMPRESSION AND CODING ALGORITHMS

The books by Held [1983], Wei [1987], Anderson and Mohan [1991], Hoff-
man [1997], Sayood [2000], and Salomon [2000] are further useful counter-
points to the material on coding presented below, as is the survey article by
Lelewer and Hirschberg [1987].
The information-theoretic aspects of data compression have been studied
for even longer than its algorithmic facets, and the standard references for this
work are Shannon and Weaver [1949], Hamming [1986], and Gray [1990];
with another recent contribution coming from Golomb et aI. [1994].
Finally, for an algorithmic treatment, the four texts already cited above
all provide some coverage of compression [Cormen et aI., 2001, Gonnet and
Baeza-Yates, 1991, Sedgewick, 1990, Storer, 2002]; and Graham et aI. [1989]
provide an excellent encyclopedia of mathematical techniques for discrete do-
mains, many of which are relevant to the design and analysis of compression
systems.

1.5 Analysis of algorithms


An important feature of this book is that it is not only a compendium of coding
techniques, but also describes the algorithms used to achieve those codes. The
field of analysis of algorithms is well-established in many other domains -
which of us, for example, is unaware of Donald Knuth's 1973 work on sorting
and searching algorithms - and in this book we also pay particular attention to
the computational efficiency of the algorithms described. This section provides
an introduction to the tools and techniques used in the design and analysis of
algorithms, and introduces a number of mathematical identities of particular
use in the analysis of source coding methods. Readers already familiar with
algorithm analysis techniques, and willing to return to this chapter when the
subsequent use of mathematical identities requires them to do so, may skip the
next few pages and move directly to Chapter 2. Readers settling in for the long
haul may wish to read this section now.
In comparing two different methods for solving some problem we are in-
terested both in the asymptotic complexity of the two methods in question and
their empirical behavior. These are usually correlated (but not always), and
a method that is efficient in theory is often efficient in practice. To describe
asymptotic efficiency it is usual to use the so-called "big Oh" notation. A func-
tion f (n) over the positive integers is said to be 0 (g (n)) if constants k and no
exist such that f(n) ~ k . g(n) whenever n 2: no. That is, except for a finite
number of initial values, the function f grows, to within a constant factor, no
faster than the function g. For example, the function h (n) = n 10g2 n - n + 1
is 0 (n log n). Strictly speaking, 0 (g (n)) is a set offunctions and f is a mem-
ber of the set, so that h (n) E 0 (n 2) too. It is, however, usual to give as an
1.5. ALGORITHM ANALYSIS PAGE 11

upper bound a "minimal" function that satisfies the definition, and 0 (n log n)
is regarded as a much more accurate description of h than is 0(n 2 ). Note that
the use of the constant function g(n) = 1 is perfectly reasonable, and if f(n)
is described as being 0(1) then in the limit f is bounded above by a constant.
It is also necessary sometimes to reason about lower bounds, and to assert
that some function grows at least as quickly as some other function. Function
f(n) is n(g(n)) if g(n) is O(f(n)). Equality of functional growth rate is
expressed similarly -function f(n) is 8(g(n)) if f(n) is O(g(n)) andg(n) is
o (f (n) ). Note, however, that it is conventional where there is no possibility
of confusion for 0 to be used instead of 8 - if an algorithm is described as
being O(nlogn) without further qualification it may usually be assumed that
the time taken by the algorithm is 8 (n log n).
The final functional comparator that it is convenient to make use of is a
"strictly less than" relationship: f(n) is o(g(n)) if f(n) is O(g(n)) but g(n) is
not 0 (f (n) ). For example the function h (n) = n + n / log n can be described
as being "n + o( n )", meaning in the case of this example that the constant
factor on the dominant term is known, and the next most important term is
strictly sub linear. Similarly, a function that is 0(1) has zero as a limiting value
as n gets large. Note that for this final definition to make sense we presume
both f and 9 to be monotonic and thus well-behaved.
Knowledge of the asymptotic growth rate of the running time of some al-
gorithm is a requirement if the algorithm is to be claimed to be "useful", and
algorithmic descriptions that omit an analysis should usually be considered to
be incomplete. To see the dramatic effect that asymptotic running time can
have upon the usefulness of an algorithm consider, for example, two mecha-
nisms for sorting - Selectionsort and Mergesort [Knuth, 1973]. Selectionsort
is an intuitively attractive algorithm, and is easy to code. Probably all of us
have made use of a selection-like sorting process as "human computers" from
a relatively early age: it seems very natural to isolate the smallest item in the
list, and then the second smallest in the remainder, and so on. But Selection-
sort is not an asymptotically efficient method. It sorts a list of n objects in
0(n 2 ) time, assuming that objects can be compared and exchanged in 0(1)
time. Mergesort is somewhat harder to implement, and unless a rather complex
mechanism is employed, has the disadvantage of requiring O(n) extra work
space. Nor is it especially intuitive. Nevertheless, it operates in time that is
o (n log n). Now suppose that both Selectionsort and Mergesort require 1 sec-
ond to sort a list of 1,000 objects. From such a basis the two asymptotic growth
rates can be used to estimate the time taken to sort a list of (say) 1,000,000 ob-
jects. Since the number of objects increases by a factor of 1,000, the time taken
by the Selectionsort increases by a factor of 1,000 squared, which is 1,000,000.
That is, the estimated time for the Selection sort will be 1 x 106 seconds, about
PAGE 12 COMPRESSION AND CODING ALGORITHMS

11 days. On the other hand, the time of the Mergesort will increase by a factor
of about 2,000, and the sort will complete in 35 minutes or so. The asymptotic
time requirement of an algorithm has a very large impact upon its usability - an
impact for which no amount of new and expensive hardware can possibly com-
pensate. Brute force does have its place in the world, but only when ingenuity
has been tried and been unsuccessful.
Another important consideration is the memory space required by some
methods. If two alternative mechanisms for solving some problem both take
O(n) time, but one requires 5n words of storage to perform its calculations
and the other takes n words, then it is likely that the second method is more
desirable. As shall be seen in the body of this book, such a scenario can occur,
and efficient use of memory resources can be just as important a consideration
as execution-time analysis. A program can often be allowed to run for 10%
more time than we would ideally desire, and a result still obtained. But if
it requires 10% more memory than the machine being used has available, it
might be simply impossible to get the desired answers.
To actually perform an analysis of some algorithm, an underlying machine
model must be assumed. That is, the set of allowable operations - and the time
cost of each - must be defined. The cost of storing data must also be specified.
For example, in some applications it may be appropriate to measure storage
by the bit, as it makes no sense to just count words. Indeed, in some ways
compression is such an application, for it is pointless to ask how many words
are required to represent a message if each word can store an arbitrary inte-
ger. On the other hand, when discussing the memory cost of the algorithm that
generates the code, it is appropriate for the most part to assume that each word
of memory can store any integer as large as is necessary to execute the algo-
rithm. In most cases this requirement means that the largest value manipulated
is the sum of the source frequencies. That is, if a code is being designed for a
set of n integer symbol frequencies Vi it is assumed that quantities as large as
U = 2:i=l Vi can be stored in a single machine word.
It will also be supposed throughout the analyses in this book that compar-
ison and addition operations on values in the range 1 ... U can be effected in
0(1) time per operation; and similarly that the ith element in an array of as
many as n values can be accessed and updated in 0(1) time. Such a machine
model is known in algorithms literature as a random access machine. We also
restrict our attention to sequential computations. There have been a large num-
ber of parallel machine models described in the research literature, but none
are as ubiquitous as the single processor RAM machine model.
An analysis must also specify whether it is the worst case that is being con-
sidered, or the average case, where the average is taken over some plausible
probability distribution, or according to some reasonable randomness as sump-
1.5. ALGORITHM ANALYSIS PAGE 13

tion for the input data. Worst case analyses are the stronger of the two, but
in some cases the average behavior of an algorithm is considerably better than
its worst case behavior, and the assumptions upon which that good behavior is
predicated might be perfectly reasonable (for example, Quicksort).
Finally in this introductory chapter we introduce a small number of stan-
dard mathematical results that are used in the remainder of the book.
For various reasons it is necessary to work with factorials, and an expansion
due to James Stirling is useful [Graham et aI., 1989, page 112]:

n! ~ v27rn (~) n ,

where n! = n x (n - 1) x {n - 2)··· x 2 x 1. Taking logs, Stirling's approx-


imation means that

10g2 n! ~ n log2 n - n log2 e + (log2 27rn) /2 . (1.2)

This latter expression means that another useful approximation can be derived:

log2 n n ~ log2 n! + n log2 e - (log2 27rn) /2 . (1.3)

Also in this vein, the number of different combinations of nl objects of one


type and n2 objects of a second type is given by

Using Equation 1.2, it then follows that

When nl « n2 (nl is much smaller than n2) Equation 1.4 can be further
simplified to

(1.5)

The Fibonacci series is also of use in the analysis of some coding algo-
rithms. It is defined by the basis F(1) = 1, F(2) = 1, and thereafter by the
recurrence F{n + 2) = F{n + 1) + F{n). The first few terms from n = 1 are
1,1,2,3,5,8,13,21,34. The Fibonacci numbers have a fascinating relation-
ship with the "golden ratio" ¢ defined by the quadratic equation
PAGE 14 COMPRESSION AND CODING ALGORITHMS

That is, cp is the root of x 2 - x - 1 = 0, so is given by

cp = 1 + J5 ~ 1.618.
2
The ratio between successive terms in the Fibonacci sequence approaches cp in
the limit, and a closed form for F (n) is

F(n) l
= -cpn +-
J52
1J
A closely related function is defined by F'(l) = 2, F'(2) = 3, and thereafter
by F'(n + 2) = F(n + 1) + F'(n) + 1. The first few terms from n = 1 of this
faster-growing sequence are 2,3,6,10,17,28,46,75. The revised function is,
however, still closely related to the golden ratio, and it can be shown that

F' (n) = F (n + 2) + F (n) - 1,

and, since F(n + 2) ~ cp2 F(n) when n is large,

F'(n) ~ (cp2 + l)F(n) ~ (cp2 + 1) ~ = cpn+1 ,

with the final equality the result of one of the many identities involving cp, in
this case that (cp2 + 1)/J5 = cp.
Sorting was used as an example earlier in this section, and many of the code
construction methods discussed in this book assume that the input probability
list is sorted. There are several sorting algorithms that operate in 0 (n log n)
time in either the average case or the worst case. Mergesort was mentioned
as being one method that operates in 0 (n log n) time. Heapsort also oper-
ates in the same time bound, and has the added advantage of not requiring
O(n) extra space. Heapsort is also a useful illustration of the use of the pri-
ority queue data structure after which it is named. Finally amongst sorting
algorithms, Hoare's Quicksort [Hoare, 1961, 1962] can be implemented to op-
erate extremely quickly on average [Bentley and McIlroy, 1993] and, while the
o (n log n) analysis is only for the average case, it is relatively robust. Much
of the advantage of Quicksort compared to Heapsort is a result of the largely
sequential operation. On modem cache-based architectures, sequential rather
than random access of items in the array being sorted will automatically bring a
performance gain. Descriptions of all of these sorting algorithms can be found
in, for example, the text of Cormen et al. [2001].
Chapter 2

Fundamental Limits

The previous chapter introduced the coding problem: that of assigning some
codewords or bit-patterns C to a set of n symbols that have a probability dis-
tribution given by P = [PI, ... , Pn]. This chapter explores some lines in the
sand which cannot be crossed when designing codes. The first is a lower bound
on the expected length of a code: Shannon's entropy limit. The second restric-
tion applies to the lengths of codewords, and is generally referred to as the
Kraft inequality. Both of these limits serve to keep us honest when devising
new coding schemes. Both limits also provide clues on how to construct codes
that come close to reaching them. We can also obtain experimental bounds on
compressibility by using human models and experience, and this area is briefly
considered in Section 2.3. The final section of this chapter then shows the
application of these limits to some simple compression systems.

2.1 Information content


The aim of coding is to achieve the best compression possible; to minimize
the cost of storing or transmitting some message. If we step back from prob-
abilities and bit-patterns, this amounts to removing all spurious data from a
message and leaving only the core information. We seek to transmit exactly
the components required by the decoder to faithfully reconstruct the message,
and nothing more. So how much crucial information is there in a message?
Of course the answer depends on the message and the recipient. Having your
name and address appear in a list is perfectly plausible if the list represents
the electoral roll for your municipality, as everyone's name and address will
appear; but if the list represents the houses in your municipality that are sched-
uled for demolition, or are to pay additional taxes, you might be rather more
taken aback.

A. Moffat et al., Compression and Coding Algorithms


© Springer Science+Business Media New York 2002
PAGE 16 COMPRESSION AND CODING ALGORITHMS

From this example, it seems that the quantity of information is somehow


linked to the amount of surprise a message elicits. An informative message
causes amazement, while a message with low information content is relatively
unsurprising, in the same way that a weather report of a 38 0 day (100 0 Fahren-
heit) is rather more surprising in Antarctica than it is in Australia. And, in the
limit, if you are certain of the content of a message then it contains no infor-
mation at all: the weather report "In Perth it did not snow today" is essentially
devoid of any information, as it never snows in Perth. In a coding context.
this latter example amounts to coding a symbol SI with probability PI = 1, in
which case the decoder already knows the resulting message, and nothing need
be stored or transmitted.
Given the probability of an event or symbol, therefore, it should be possi-
ble to produce a measure of the information represented by that event. Claude
Shannon [1948] drew on existing observations that a measure of information
should be logarithmic in nature, and defined the amount of information con-
tained in a symbol Si of probability Pi to be

(2.1)

That is, the amount of information conveyed by symbol Si is the negative loga-
rithm of its probability. The multiplication by minus one means that the smaller
the probability of a symbol and the greater the surprise when it occurs, the
greater the amount of information conveyed. Shannon's original definition did
not specify that the base of the logarithm should be two, but if the base is
two, as he observed, then I(si) is a quantity in bits, which is very useful when
discussing coding problems over the binary channel alphabet. For example,
referring back to Table Lion page 7, symbol SI has probability 0.67 and in-
formation content of approximately 0.58 bits, and symbol S6, with P6 = 0.04,
has I(s6) = 4.64 bits.
This definition of information has a number of nice properties. If a symbol
is certain to occur then it conveys no information: Pi = 1, and I(si) = o. As
the probability of a symbol decreases, its information content increases; the
logarithm is a continuous, monotonic function. In the limit, when the probabil-
ity of a symbol or event is zero, if that event does occur, we are rightly entitled
to express an infinite amount of surprise. ("Snow in Perth", the newspaper
headlines would blare across the world.) Another consequence of Shannon's
definition is that when a sequence of independent symbols occurs, the informa-
tion content of the sequence is the sum of the individual information contents.
For example, if the sequence SiSj occurs with probability PiPj, it has informa-
tion content I(sisj) = I(si) + I(sj). Shannon [1948] details several more
such properties.
Given that I(si) is a measure of the information content of a single symbol
2.2. KRAFT INEQUALITY PAGE 17

in bits, and the decoder need only know the information in a symbol in order to
reproduce that symbol, a code should be able to be devised such that the code-
word for Si contains I (Si) bits. Of course, we could make this claim for any
definition of I (Si)' even if it did not share the nice properties above. However,
Shannon's "Fundamental Theorem of a Noiseless Channel" [Shannon, 1948],
elevates I(si) from a convenient function to a fundamental limit.
Consider the expected codeword length of a code C derived from proba-
bility distribution P, where each symbol has (somehow!) a codeword of length
I(si). Let H(P) be the expected cost of such a code:
n
H(P) = - 2:Pilog2Pi. (2.2)
i=l

Shannon dubbed quantity H(P) the entropy of the probability distribution, a


term used for H(P) in statistical mechanics. He went on to prove that it is
not possible to devise a code that has lower expected cost than H(P). That is,
given a probability distribution P, all unambiguous codes C must obey

H(P) <.5: E(C, P).


A simple inductive proof is given by Bell et al. [1990, page 47].
In Table 1.1 on page 7 an example probability distribution P was intro-
duced, and three possible codes given. The calculation 0.67 x log2 0.67 +
... + 0.04 X log2 0.04 reveals that the entropy H(P) of the probability distri-
bution P is 1.65 bits per symbol. Relative to this lower limit, the three codes in
Table 1.1 are respectively 82%, 35%, and 6% inefficient. Our goal is to design
codes that are zero-redundancy and have H(P) = E( C, P) - provided that
they can be computed within reasonable resource limits.
It would seem that with one stroke of his pen, Shannon solved the coding
problem. An optimal, or zero-redundancy, code is one that has leil = I(si).
Alert readers will have noticed, however, that I(si) is not always a whole num-
ber. It is not immediately obvious how a symbol Si of probability Pi = 0.3
can be represented by a codeword of I(sd = 1.74 bits; and even if, based
upon some unspecified advice, we decide that Si is to be represented in 2 bits,
a mechanism for calculating codewords is still required. Shannon provided
a bound below which we cannot go; this book, on the other hand, describes
mechanisms that let us approach that bound from above.

2.2 Kraft inequality


If we seek to minimize the expected cost of a code (Equation 1.1 on page 7), the
obvious question is this: how short can codewords be? The flippant answer is
PAGE 18 COMPRESSION AND CODING ALGORITHMS

that all codewords can be one bit long (or indeed, zero bits); but this is not very
useful in a practical sense, as symbols must be disambiguated during decoding.
A more pertinent question is: how short can codewords be so that the code is
uniquely decipherable?
If each symbol Si has a probability that is a negative power of two, say
Pi = 2- ki , then I(si) = k i is a whole number. So setting each codeword to
a string of ICil = k i bits results in a code whose expected codeword length
equals Shannon's bound and thus cannot be improved. This observation was
considered by L. G. "Jake" Kraft [1949], who noted that in such a situation
n
L2- ki ~ 1,
i=l

and that it is indeed possible to assign a code C = [CI' C2, .•. ,cn ] in which
ICil = ki' and in which no codeword is a prefix of any other codeword - that
is, the code can be prefix-free. Once such a code has been calculated, a mes-
sage composed of codewords from C can be decoded from left to right one bit
at a time without ambiguity. This relationship can be inverted, and is then a
requirement for all prefix-free codes: if the quantity
n
K(G) = L 2- icil (2.3)
i=l

is greater than 1, then the code cannot be prefix-free. As an obvious example,


suppose that n codewords Ci are all of length ICil = 1. Then K(C) = nj2, and
a prefix-free code is only possible when n ~ 2.
McMillan [1956] extended this result and showed that if K(C) ~ 1 for
some code C, then there always exists another code C' for which E( C', P) =
E(G, P), with codewords such that IG'I = IGI (recall that we define ICI =
[IClI, ... , IcnlD, and such that the codewords in G' are prefix-free. That is,
if we find a code C that complies with the Kraft inequality, we can morph it
into a same-cost code C' that is prefix-free and thus uniquely decodeable in a
left-to-right manner. A variant of McMillan's proof is given by Karush [1961].
Looking again at Table 1.1 on page 7, the Kraft sums K (C) for the three
example codes are 0.75, 1.00, and 1.00 respectively. It was remarked at the
time that Code 1 in that table was not complete, in that not every semi-infinite
string could be decoded. The Kraft inequality has now provided a definition of
this property: a set of codewords C is complete if and only if it is prefix-free
and its Kraft sum K (G) equals one.
We can now reformulate our desire for good codes as being a quest for
a set of codewords C = [CI' C2, • .• ,cn ] derived from a set of probabilities
P = [PI,P2, ... ,Pn], such that K(C) ~ 1, and the expected cost given by
2.3. HUMAN COMPRESSION PAGE 19

Equation 1.1 on page 7 is minimized. When working with a binary channel


alphabet the latter constraint means that the code will be complete, that is, that
K (C) = 1. Moreover, the work of McMillan relieves us of some of the pres-
sure to find actual codewords, and instead makes it sufficient to compute code-
word lengths. The process whereby a set of codeword lengths can be converted
into a prefix-free code is considered in detail in Section 4.3 on page 57.

2.3 Human compression


By now we hope that the sceptical reader has already asked themself the im-
portant question: "Well, yes, ok, I can believe all of that, but where do those
probabilities come from?"
For the most part - and certainly in all of this book except for this section
- they are derived from the message that is to be compressed, or from some
universe of messages of which the particular message in question is a member.
But there is another way in which probabilities can be derived, and that is based
upon human experience and knowledge. For example, consider the text:
"If you don't hurry up, you're going to be ... "
What comes next? Most people suggest words like "late" and "left" [behind],
in essence, giving them much higher probabilities than, for example, words
such as "cold", "hungry", and "sick". But each of these three low-probability
words fits neatly into one of the next three sentences:
"If you don't put on a jacket, you're going to be ... "
"If you don't start eating, you're going to be ... "
"If you don't stop eating, you're going to be ... "
These examples demonstrate the incredible complexity of the mental model of
English that we have built since we were young. We can immediately match
the three words with the three sentences, and be remarkably confident in our
predictions. Somehow the word 'jacket" means that five words later we are
more receptive to "cold"; while "eating", "hungry", and "sick" somehow rein-
force each other over exactly the same four intervening words, but only if we
take into account "start" and "stop" appropriately.
It seems a forlorn hope that we might ever build a compression system
based upon human estimates of likelihood. We would find it extremely dif-
ficult to assign actual probabilities that summed to one. Worse, it would be
impossible to then assign the same probabilities when we wanted to decode
the compressed message, and equally ridiculous to expect that anyone else in
the entire world has the same set of probabilities - derived from background
information and accumulated experience - as we did at the time of encoding.
PAGE 20 COMPRESSION AND CODING ALGORITHMS

Nevertheless, we can guess what comes next, and either form a "nope, try
again" sequence of answers that indicates a ranking of the alternatives even if
not their probabilities, or hypothesize nominal wagers on whether or not we
are right. Implicit probabilities can then be estimated, and an approximation of
the underlying information content of text calculated. Using human subjects,
researchers have done exactly this experiment. Shannon [1951] and Cover and
King [1978] undertook seminal work in this area; these, and a number of related
investigations, are summarized by Bell et al. [1990, page 93].
Assuming that English prose is composed of 26 case-less alphabetic letters,
plus "space" as a single delimiting character, the outcome of these investiga-
tions is that the information content of text is about 1.3 bits per letter. How
close to this limit actual compression systems can get is, of course, the great
unknown. As we shall see in Chapter 8, there is still a gap between the perfor-
mance of the best mechanical systems and the performance attributed to human
modelers, and although the gap continues to close as modeling techniques be-
come more sophisticated, there remains a considerable way to go.

2.4 Mechanical compression systems


In the remainder of this chapter we examine how the fundamental limits de-
scribed above interact in several example compression systems. In order to
allow comparisons, the following verse from William Blake's Milton [Blake,
1978] is used as a message to be represented:

Bring me my bow of burning gold!


Bring me my arrows of desire!
Bring me my spear! 0 clouds unfold!
Bring me my chariot of fire!

Recall that a compression system (Figure 1.1 on page 5) consists of three com-
ponents - modeling, probability estimation, and coding. Armed with our def-
inition of entropy, and Shannon's fundamental theorem stating that given a
model of data we cannot devise a code that has an expected length better than
entropy, we can explore the effect of different models of the verse on the best
compression obtainable with that model.
A very simple model of the verse is to assume that a symbol is a single
character, and a correspondingly simple way of estimating symbol probabilities
is to assert that all possible characters of the complete character set are equally
likely.
On most computing platforms, the set of possible characters is defined by
the American Standard Code for Information Interchange (ASCII), which was
2.4. COMPRESSION SYSTEMS PAGE 21

introduced in 1968 by the United States of America Standards Institute. The in-
ternational counterpart of ASCII is known as ISO 646. ASCII contains 128 al-
phabetic, numeric, punctuation, and control characters, although on most com-
puting platforms an extension to ASCII, formally known as ISO 8859-1, the
"Latin Alphabet No. I", is employed. The extension allows for 128 charac-
ters that are not typically part of the English language, and, according to the
Linux manual pages, "provides support for Afrikaans, Basque, Catalan, Dan-
ish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic,
Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, and Swedish". The
ISO 8859-1 characters are also the first 256 characters of ISO 10646, better
known as Unicode, the coding scheme supported by the Java programming
language. It is convenient and commonplace to refer to the ISO 8859-1 exten-
sion as ASCII, and that is the approach we adopt throughout the remainder of
this book. Hence there are 256 possible characters that can occur in a text.
Returning to the simple compression system, in which each of the 256 char-
acters is a symbol and all are equally likely, we have Pi = 1/256 = 0.003906
for all 1 :s; i :s; 256. Equation 2.2 indicates that the entropy of this probability
distribution is
256 1 1
H(P) - ?=
t=l
256 log2 256
8.00

bits per symbol, a completely unsurprising result, even though the high entropy
value indicates each symbol is somewhat surprising. This is an example of a
static system, in which the probability estimates are independent of the actual
message to be compressed. The advantage of a static probability estimator is
that both the encoder and decoder "know" the attributes being employed, and
it is not necessary to include any probability information in the transmitted
message.
A slightly more sophisticated compression system is one which uses the
same character-based model, but estimates the probabilities by restricting the
alphabet to those characters that actually occur in the data. Blake's verse has
25 unique characters, including the newline character that marks the end of
each line. Column one of Table 2.1 shows the unique characters in the verse.
Assuming that each of the 25 symbols is equally likely, we now have Pi
1/25 = 0.04 for alII :s; i :s; 25, and a model entropy of

1 1
- I: -log2-
25
H(P)
25
i=l 25
4.64
PAGE 22 COMPRESSION AND CODING ALGORITHMS

bits per symbol, almost half the entropy of the static model.
While it is tempting to claim that by altering the probability estimation
technique a 3.36 bits per character decrease in the space required to store the
message has been attained (assuming, of course, that we can devise a code that
actually requires 4.64 bits per symbol), there is a problem to be dealt with. Un-
like the first compression scheme, this one alters its set of symbols depending
on the input data, meaning that the set of symbols in use in each message must
be declared to the decoder before decompression can be commenced. This
compression system makes use of a semi-static estimator. It examines the mes-
sage in a preliminary pass to derive the symbol set, and includes a description
of those symbols in a prelude to the compressed data. The decoder reads the
prelude, re-creates the code, and only then commences decoding. That is, we
must include the cost of describing the code to be used for this particular mes-
sage. For example, we might use the first compression system to represent the
unique symbols. The prelude also needs to include as its first data item a count
of the number of distinct symbols, which can be at most 256. All up, for the ex-
ample message the prelude consumes 8 bits for the count of the alphabet size,
plus 25 x 8 bits per symbol for the symbol descriptions, a total of 208 bits. If we
spread this cost over all 128 symbols in the message, the total expected code-
word length using this model is 4.64 + 208/128 = 6.27 bits per symbol, rather
more than we first thought. Note that we now must concern ourselves with the
representation of the prelude as well as the representation of the message - our
suggested approach of using eight-bit codes to describe the subalphabet being
used may not be terribly effective.
A third compression system that intuitively should lead to a decrease in the
number of bits per symbol required to store the message is the same character-
based model, but now coupled with a semi-static estimator based upon the
self-probabilities of the characters in the message. That is, if symbol Si occurs
Vi times, and there are a total of m symbols in the message, we take Pi =
vi/m. Column two of Table 2.1 shows the frequency of occurrence of each
character in the verse, and column five the corresponding self-probabilities of
the symbols. Calculating the entropy of the resultant probability distribution
gives 4.22 bits per symbol as a lower bound. Similarly, the quantity
n
-Li=l
Vi

1og2 ~
m
(2.4)

is the zero-order self-information of the message, and represents a lower limit


on the number of bits required to represent the message if the symbols are
independent.
Unfortunately, in this third compression system we again have to attach
a prelude to the compressed data. Moreover, this prelude must contain not
2.4. COMPRESSION SYSTEMS PAGE 23

Probability estimate Corresponding code


si Vi Static Semi- Semi- ASCII Binary MR
static I static II
newline 4 1/256 1/25 4/128 00001010 00000 10100
space 22 1/256 1/25 22/128 00100000 00001 000
! 5 1/256 1/25 5/128 00100001 00010 10101
B 4 1/256 1/25 4/128 01000010 00011 10110
0 1 1/256 1/25 1/128 01001111 00100 1111100
a 3 1/256 1/25 3/128 01100001 00101 111010
b 2 1/256 1/25 2/128 01100010 00110 111011
c 2 1/256 1/25 2/128 01100011 00111 111100
d 4 1/256 1/25 4/128 01100100 01000 10111
e 8 1/256 1/25 8/128 01100101 01001 0100
f 5 1/256 1/25 5/128 01100110 01010 11000
g 6 1/256 1/25 6/128 01100111 01011 0101
h 1 1/256 1/25 1/128 01101000 01100 1111101
8 1/256 1/25 8/128 01101001 01101 0110
3 1/256 1/25 3/128 01101100 01110 11001
m 8 1/256 1/25 8/128 01101101 01111 0111
n 7 1/256 1/25 7/128 01101110 10000 1000
0 9 1/256 1/25 9/128 01101111 10001 1001
P 1/256 1/25 1/128 01110000 10010 1111110
r 11 1/256 1/25 11/128 01110010 10011 001
s 4 1/256 1/25 4/128 01110011 10100 11010
1 1/256 1/25 1/128 01110100 10101 1111111
u 3 1/256 1/25 3/128 01110101 10110 11011
w 2 1/256 1/25 2/128 01110111 10111 111101
Y 4 1/256 1/25 4/128 01111001 11000 11100
H(P) 8.00 4.64 4.22
E(C,P) 8.00 5.00 4.26
K(C) 1.00 0.78 1.00

Table 2.1: A character-based model and three different probability estimation tech-
niques for the verse from Blake's Milton. In the columns marked "Static" and "ASCII"
the entropy, average codeword length, and Kraft sum are calculated over the full
n = 256 characters in the ASCII character set. In the other columns they are cal-
culated over the n = 25 characters that appear in the message.
PAGE 24 COMPRESSION AND CODING ALGORITHMS

only a list of symbols, which we calculated we can do in 208 bits using the
first compression system, but also some indication of the probability of those
symbols. If we allow 4 bits per unique symbol to convey its frequency (quite
possibly an underestimate, but it will suffice for now), then a further 25 x
4 = 100 bits are required in the prelude. The total expected codeword length,
assuming that codes can be devised to meet the entropy bound involved, is now
(208 + 100)/128 + 4.22 = 6.63 bits per symbol; worse than the previous
simpler compression system.
This example highlights one of the most difficult problems in designing a
compression system: when to stop modeling and start coding. As is the case in
this example, while the message itself has a lower cost with the more accurate
semi-static probability estimates, the cost of transmitting the information that
allowed that economy exceeded the gain that accrued from using the more
accurate estimates. As a very crude rule, the shorter the message, the more
likely it is that a simple code will result in the most compact overall package.
Conversely, the use of complex models and codes, with many parameters, can
usually be justified for long messages.
Another way of looking at these two components - model description, and
codes relative to the model - is that the first component describes an "average"
message, and the second component describes how the particular message in
question is different from that average. And what we are interested in is min-
imizing the cost of the total package, even if, in our heart, we know that the
model being used is somehow too simple. In the same way, in our real life we
sometimes allow small untruths to be propagated if it is just too tedious to ex-
plain the complete facts. (When we tum on the light switch, do electrons really
start "running through the wire"?)
The trade-off between model and message-relative-to-model is studied as
the minimum message length principle. The minimum message length idea is
a formalism of what has been introduced in this discussion, namely that the
best description for some behavior is the one that minimizes the combined
cost of a general summary of the situation, or average arrangement; plus a list
of the exceptions to the general summary, specifying this particular situation.
The not-unrelated area of machine learning also deals with models of data,
and minimizing the cost of dealing with the exceptions to the model [Witten
and Frank, 1999]. In the compression environment we are able to evaluate
competing models and the cost of using them to represent messages in a very
pragmatic way - by coupling them with an effective coder, and then counting
the number of output bits produced.
The three compression systems considered thus far have parsed the mes-
sage into character symbols, and treated them as if they were emitted by a
memoryless source. All three use a zero-order character-based model, and
2.4. COMPRESSION SYSTEMS PAGE 25

the only difference between them is their mechanism for estimating probabil-
ities. But we can also change the model to create new compression systems.
Afirst-order model estimates the probability of each symbol in the context of
one previous symbol, to exploit any inter-symbol dependencies that may be
present. For example, consider the context established by the character "i". In
Blake's verse, "i" is followed by just three different characters:

"n" with probability 5/8 = 0.625


"0" with probability 1/8 = 0.125
"r" with probability 2/8 = 0.250.
The self-information of this particular context is thus 10.4 bits. Summing the
self-information for all 25 contexts, and dividing by the length of the message,
gives an overall first-order self-information of 1.48 bits per symbol; consider-
ably less than the self-information of 4.22 bits per symbol calculated using the
simpler zero-order model. Again, however, the prelude has increased in com-
plexity and thus size. Now we must transmit a matrix of probabilities, with
entry (r, c) in the matrix indicating the probability with which symbol r occurs
in context c. Even if only a single bit is used on average for each entry in the
matrix M, the prelude requires 25 x 25/128 = 4.88 bits per symbol, coun-
teracting any gain obtained by using the first-order model. On the other hand,
when the number of model parameters is small in comparison to the message
size, the prelude forms only a tiny percentage of the compressed message. It is
only on long messages that higher order models yield savings.
There is an important point that arises out of this discussion of first-order
models, which is that, once the context in which each symbol will be coded has
been determined, we can mentally partition the message, and regard the sym-
bols processed within each context as being a zero-order stream. That is, any
complex model can be dealt with by imagining a set of zero-order probability
estimators and coders operating in parallel, one for each of the contexts in use
in the model, with the multiple coders either writing in an interleaved manner
to a single output file, or each writing to its own output file. For the bulk of
this book we thus concentrate on coding messages presumed to be from a zero-
order stream; only in Chapter 8 will we again return to the issue of modeling.
Throughout the discussion of entropy and models we have assumed that a
code could be devised to actually reach the entropy bound calculated for each
model. Let us now tum our attention to the coding aspects of a compression
system, and see if that is indeed the case. As might be expected, there are static
codes and semi-static codes that correspond to static and semi-static probability
estimators. There is also a third category that we have not yet introduced:
adaptive probability estimation, and adaptive coding techniques.
PAGE 26 COMPRESSION AND CODING ALGORITHMS

Suppose that pk is the self-probability distribution of the first k - 1 sym-


bols in some m-symbol message M, and that pm+! is thus the self-probability
distribution of the entire message. Then a static code is one that is calculated
based only upon pI - that is, the probabilities used in the code are determined
before any symbols of the message are inspected, and are not changed there-
after. The next chapter discusses several of these codes in detail. At the other
extreme, a semi-static code is one calculated based upon pm+I, the probabil-
ities for the whole message. Finally, an adaptive code is one in which the kth
symbol of the message is coded using a code Ck derived from the probabil-
ity distribution pk. That is, the set of codewords might change after each and
every symbol is coded.
The first of the compression systems presented above used a static prob-
ability estimator, and assumed all possible characters to be equally likely. A
static code is thus appropriate, and this is precisely what ASCII involves: char-
acters are directly mapped to corresponding eight-bit codewords. Column six
of Table 2.1 on page 23 shows the ASCII codewords for characters in the verse
from Milton. The expected length of the code is eight bits per symbol, so this
code achieves the entropy bound of the static model. The Kraft sum K (C)
(Equation 2.3) is 256 x 2- 8 = I, ensuring that the ASCII code is uniquely
decipherable. Note that because of its Ubiquity, ASCII is assumed as a baseline
representation for storing data on a computer.
The second compression system discussed above was a semi-static mecha-
nism that assumed that the characters that do occur in the text are equally likely.
Column four of Table 2.1 shows the probabilities P generated by this estima-
tion technique. As the set of symbols varies from input to input, but is fixed
for anyone data file, it makes sense to couple this estimator with a semi-static
code. Once the model is fixed, and P determined, the coder assigns the code C
and then gets on with the business of coding. One possible code for the example
message is shown in column seven of Table 2.1. This code is a simple binary
code, where a codeword of length flog2 n 1 bits is assigned to each symbol.
The expected length of this code is 5 bits per symbol, and it does not achieve
the entropy bound for this model and estimator of 4.64 bits per symbol. The
binary code does obey the Kraft Inequality, with K (C) ~ 1. Indeed, the fact
that K (C) = 0.78 is strictly less than one indicates that there is room for im-
provement, and that the expected length can be decreased while still preserving
unique decipherability. One obvious improvement is for the codeword for "y"
to be shortened to "11", which is allowable because no other codeword shares
that prefix. This change increases the Kraft sum to I, the maximum allowed,
and decreases the expected length to 4.88 bits per symbol. Other improvements
of this type are possible, and a minimal binary code (described in Section 3.1
on page 29) can reduce the expected length to 4.72 bits per symbol, still above
2.4. COMPRESSION SYSTEMS PAGE 27

the entropy bound of 4.64 bits per symbol for this probability distribution.
The final compression system for which we devise a code is the third one
described earlier, the semi-static model using self-probabilities. The probabil-
ity distribution P derived by this estimator is reflected in the fifth column of
Table 2.1, headed "MR". Using a technique described in Chapter 4, it is pos-
sible to devise a minimum-redundancy code based on P that has an expected
length of 4.26 bits per symbol, which is again close, but still not equal to, the
entropy bound. This is the code "MR" depicted in the final column of Table 2.1.
What then is the notion that we have been trying to convey in this section?
In essence, it is this: there are myriad choices when it comes to designing a
compression system, and care is required with each of them. One must choose
a model, a probability estimation technique, and finally a coding method. If
the modeling or coding method must transmit parameters in the form of a pre-
lude, then a representation for the prelude must also be chosen. (There is also
a corresponding choice to be made for an adaptive estimator, but that problem
is deferred until Chapter 6.) The probability estimator must be chosen con-
sidering the attributes of the model, and the coder must be chosen taking into
account the attributes of the probability estimator.
U sing the entropy and Kraft measures allows fine tuning of coding methods
without the need to actually perform the encoding and decoding. Of course the
ultimate test of any compression scheme is in the final application of a working
program and a count of the number of bits produced on a corpus of standard
test files. One piece of advice we can pass on -learned the hard way! - is that
you should never, ever, laud the benefits of your compression scheme without
first implementing a decompressor and verifying that the output of the decom-
pressor is identical to the input to the compressor, across a wide suite of test
files. Excellent compression is easily achieved if the decoder is not sent all of
the components necessary for reassembly! Indeed, this is exactly the principle
behind the lossy compression techniques used for originally-analog messages
such as image and audio data. Lossy methods deliberately suppress some of the
information contained in the original, and aim only to transmit sufficient con-
tent that when an approximate message is reconstructed, the viewer or listener
will not feel cheated. Lossy modeling techniques are beyond the scope of this
book, and with the exception of a brief discussion in Section 8.5 on page 251,
are not considered.
Chapter 3

Static Codes

The simplest coding methods are those that ignore or make only minimal use
of the supplied probabilities. In doing so, their compression effectiveness may
be relatively poor, but the simple and regular codewords that they assign can
usually be encoded and decoded extremely quickly. Moreover, some compres-
sion applications are such that the source probabilities Pi have a distribution to
which the regular nature of these non-parameterized codes is well suited.
This chapter is devoted to such coding methods. As will be demonstrated,
they are surprisingly versatile, and are essential components of a coding toolkit.
We suppose throughout that a message M is to be coded, consisting of m
integers Xi, each drawn from a source alphabet S = [1 ... n], where n is the
size of the alphabet. We also assume that the probability distribution is non-
increasing, so that PI 2: P2 2: . .. 2: Pn· Some of the codes discussed allow an
infinite source alphabet, and in these cases the probabilities are assumed to be
PI 2: P2 2: ... 2: Pi 2: ... > 0 over the source alphabet S = [1 ... ].

3.1 Unary and binary codes


Unary and binary coding are the simplest coding methods of all. In a unary
coder, the symbol X is represented as x-I "I" bits, followed by a single "0"
bit, which can be thought of as a sentinel to mark the end of each codeword.
The first few unary codewords are thus "0", "10", "110", "1110", and so on.
As described, unary is an infinite code, and arbitrarily large values can be
represented. There is no requirement that an alphabet size n be determined
(and known to the decoder) prior to the commencement of coding of the mes-
sage. On the other hand, when the alphabet is finite, and n is known to the
decoder (perhaps by way of some pre-transmission as part of a prelude), the
nth codeword can be truncated at n - 1 "I" bits.

A. Moffat et al., Compression and Coding Algorithms


© Springer Science+Business Media New York 2002
PAGE 30 COMPRESSION AND CODING ALGORITHMS

Algorithm 3.1
Use a unary code to represent symbol x, where 1 ~ x.
unary_encode{x)
1: while x > 1 do
2: pULone_bit{l)
3: set x ~ x-I
4: pULone_bit{O)

Return a value x assuming a unary code for 1 ~ x.


unary_decode (b)
1: set x ~ 1
2: while geLone_bitO = 1 do
3: set x ~ x + 1
4: return x

Algorithm 3.1 shows the process of unbounded unary coding. As in all of


the pseudo-code in this book, two elementary routines are assumed: an encod-
ing function pULone_bit{b) that writes the bit b to an output bitstream; and, in
the decoder, a function geLone_bitO that returns either zero or one, being the
next unprocessed bit from the compressed bitstream.
Given that the codeword for x is exactly x bits long, unary is a zero-
redundancy code for the infinite distribution given by P = [1/2,1/4,1/8, ... j,
and the truncated unary code is zero-redundancy for the similarly skewed finite
distribution P = [1/2,1/4,1/8, ... , 2-(n-l), 2-(n-l)j. While it might at first
seem that no distribution could possibly be this biased in favor of small values,
we shall encounter exactly such a requirement shortly, and will use unary as a
component of a more elegant and more versatile code.
At the other extreme from the unary-ideal skewed distribution shown in
the previous paragraph is the uniform or "flat" distribution, and for this kind
of distribution the binary code, already illustrated in both Table 1.1 on page 7
and Table 2.1 on page 23, is the appropriate choice. In Table 1.1, Code 1 is a
simple binary code, in which every symbol is assigned a codeword of exactly
rlog2 n 1bits. The second code in that table, Code 2, is a minimal binary code,
and is more efficient than Code 1, as all prefixes are used. In general, for an
alphabet of n symbols, a minimal binary code contains 2 rlog2 n 1 - n codewords
that are llog2 n J bits long, and the remaining 2n - 2 pog2 n 1 are flog2 n 1bits
long. If the average codeword length is to be minimized, the shorter codewords
should be allocated to the more probable symbols. When the probabilities are
non-increasing, as we have assumed, this means that the shorter codes should
be allocated to the symbols at the beginning of the alphabet. For example,
3.1. UNARY AND BINARY PAGE 31

Algorithm 3.2
Use a minimal binary code to represent symbol x, where 1 ~ x ~ n.
minimaLbinary _encode (x, n)
1: set b +- flog2 n 1
2: set d +- 2b - n
3: if x > d then
4: pULone_integer(x - 1 + d, b)
5: else
6: pULone_integer(x - 1, b - 1)
Return a value x assuming a minimal binary code for 1 ~ x ~ n.
minimaLbinary _decode (n)
1: set b +- pog2 n 1
2: set d +- 2b - n
3: set x+- geLone_integer(b - 1)
4: if (x + 1) > d then
5: set x+-2 x x + geLone_bitO
6: set x +- x - d
7: return x + 1

Use "div" and "mod" operations to isolate and represent the nbits low-order
bits of binary number x.
pULone_integer(x, nbits)
1: for i +- nbits - 1 down to 0 do
2: set b +- (x div 2i) mod 2
3: pULoneJJit(b)

Return an nbits-bit binary integer 0 ~ x < 2nbits constructed from the next
nbits input bits.
geLone_integer(nbits)
1: set x+-O
2: for i +- nbits - 1 down to 0 do
3: set x+-2 x x + geLone_bitO
4: return x
PAGE 32 COMPRESSION AND CODING ALGORITHMS

when n = 5 the alphabet S = [1,2,3,4,5] is assigned the codewords G =


["00", "01", "10", "110", "Ill"].
Algorithm 3.2 details the actions required ofthe encoder and decoder when
a minimal binary code is used to represent a symbol x in an alphabet of n sym-
bols S = [1 ... n]. When n is a power of two the minimal binary code is zero-
redundancy for the distribution P = [lin, lin, . .. ,lin], and it is minimum-
redundancy (see Chapter 4 for a definition) for the same probability distribution
when n is not a power of two.
The pseudo-code of Algorithm 3.2 makes use of calls to evaluate binary
logarithms. Needless to say, these should be evaluated only when necessary -
perhaps just once, as soon as n is known, rather than during the coding of each
symbol as intimated in the pseudo-code. A software loop over the possible
values of the integer-valued logarithm will usually execute faster than a call
to a floating point logO function in a mathematics library. Note that the two
low-level functions pULone_integerO and geLone_integerO - which are also
used in several other codes - operate on integers x ~ 0 rather than x ~ 1. On
the other hand, function minimaLhinary_encodeO is couched in terms of an
argument x ~ 1: it is coding a symbol in the alphabet, not an integer value.
The other point to note in connection with Algorithm 3.2 is that in some
situations an alternative form is required, in which the shorter (or longer) code-
words are allocated equally to symbols increasing from 1 and symbols decreas-
ing from n. For example, when b = 5 the codeword lengths might be delib-
erately chosen to be IGI = [3,2,2,2,3] rather than the ICI = [2,2,2,3,3]
arrangement generated by function minimaLhinary_encodeO. One application
in which this alternative arrangement is required is the FELICS gray-scale im-
age compression method of Howard and Vitter [1993]; another is considered in
Section 3.4 below. That is, in some situations the more probable symbols are
in the middle of the alphabet.

3.2 Elias codes


In 1975 Peter Elias proposed a family of codes with behavior that is an elegant
compromise between unary and binary. All of the members in the family are
infinite codes, and all have the property that the codeword for x is 0 (log x) bits
long. The second and third columns of Table 3.1 illustrate two of these codes.
In the first code, C-Y' each codeword consists of two parts. The first part
is a unary code for the binary magnitude of x, that is, the number of bits in
x, which is 1 + llog2 x J and takes 1 + llog2 X J bits. The second part of each
codeword is a binary code for x within the range established by the unary part,
taking a further llog2 x J bits. That is, 1 + llog2 X J is coded in unary, and then
x - 2 l1og 2 X J is coded in binary, consuming 1 + 2llog 2 x J bits in total. The first
3.2. ELIAS CODES PAGE 33

x Elias Elias Golomb code, Rice code,


C y code C8 code b=5 k=2
1 a a 000 000
2 100 1000 001 001
3 101 1001 010 010
4 110 00 10100 0110 all
5 11001 10101 0111 1000
6 110 10 10110 1000 10 01
7 11011 10111 10 01 1010
8 1110 000 11000 000 1010 1011
9 1110 001 11000 001 10110 110 00

Table 3.1: Elias, Golomb, and Rice codes. The blanks in the codewords are to assist
the reader, and do not appear in the coded bitstream.

non-zero bit of every binary code is a "I" and need not be stored, hence the
subtraction when coding the binary part.
In the algorithms literature this coding method is known as exponential
and binary search, and was described by Bentley and Yao [1976]. To see how
exponential and binary search operates, suppose a key must be located in a
sorted array of unknown size. Probes to the 1st, 3rd, 7th, 15th, 31st (and so
on) entries of the array are then made, searching for a location - any location
- at which the stored value is greater than the search key. Once such an upper
bound is determined, an ordinary constrained binary search is performed. If the
key is eventually determined to be in location x, then llog2 x J + 1 probes will
have been made during the exponential part of the search, and at most llog2 x J
probes during the binary search - corresponding closely to the number of bits
required by the Elias C y code. In the same way, a normal binary search over a
sorted set corresponds to the use of a binary code to describe the index of the
item eventually found by the search.
Another way to look at these two searching processes is to visualize them as
part of the old "I'm thinking of a number, it's between 1 and 128" game. Most
people would more naturally use n = 100 as the upper bound, but n = 128
is a nice round number for our purposes here. We all know that to minimize
the number of yes/no questions in such a game, we must halve the range of
options with each question, and the most obvious way of doing so is to ask, as
a first question, "Is it bigger than 64?" Use of a halving strategy guarantees
that the number can be identified in flog2 n 1questions - which in the example
is seven. When the puzzle is posed in this form, the binary search undertaken
during the questioning corresponds exactly to a binary code - a "yes" answer
PAGE 34 COMPRESSION AND CODING ALGORITHMS

yields another "I" bit, and a "no" answer another "0" bit. When all bits have
been specified, we have a binary description of the number 0 ~ x-I < n.
In the same framework, a unary code corresponds to the approach to this
problem adopted by young children - "Is it bigger than I?", "Is it bigger than
21", "Is it bigger than 3?", and so on: a linear search.
The Elias C-y is also a searching strategy, this time to the somewhat more
challenging puzzle "I'm thinking of a positive number, but am not going to tell
you any more than that". We still seek to halve the possible range with each
question, but because the range is infinite, can no longer assume that all values
in the range are equi-probable. And nor do we wish to use a linear search, for
fear that it will take all day (or all year!) to find the mystery number. In the
Elias code the first question is "Is it bigger than I?", as a "no" answer gives
a one-bit representation for the answer x = 1: the codeword "0" shown in
the first row in Table 3.1. And if the answer is "yes", we ask "is it bigger
than 3"; and if "yes" again, "is it bigger than 7", "bigger than 15", and so on.
Eventually a "no" will be forthcoming, and a binary convergence phase can be
entered. Hence the name "exponential and binary search" - the questions fall
into two sets, and the first set is used to establish the magnitude of the number.
In the second Elias code shown in Table 3.1, the prefix part is coded using
C-y rather than unary and the codeword for x requires 1 + 2llog 2 log2 2x J +
llog2 x J bits. This gives rise to the Co code, which also corresponds to an
algorithm for unbounded searching in a sorted array.
The amazing thing about the Elias codes is that they are shorter than the
equivalent unary codes at all but a small and finite number of codewords. The
C-y code is longer than unary only when x = 2 or x = 4, and in each case
by only one bit. Similarly, the Co code is longer than C-y only when x E
[2 ... 3,8 ... 15]. On the other hand, for large values of x both Elias codes are
not just better than unary, but exponentially better.
Algorithm 3.3 shows how the two Elias codes are implemented. Given
this description, it is easy to see how further codes in the same family are
recursively constructed: the next member in the sequence uses Co to represent
the prefix part, and requires approximately

log2 x + log2 log2 x + 0 (log log log x)


bits to represent integer x. The difference between this and Co is, however,
only evident for extremely large values of x, and for practical use Co is almost
always sufficient. For example, when coding the number one billion (that is,
109 ), the C-y code requires 59 bits, the Co code 39 bits, and the next code in
the family also requires 39 bits, as both C-y and Co require 9 bits for the prefix
number 1 + llog2 109 J = 30.
3.2. ELIAS CODES PAGE 35

Algorithm 3.3
Use Elias's C'Y code to represent symbol x, where 1 ~ x.
elias _gamma_encode (x)
1: set b +- 1 + Llog2 X J
2: unary_encode (b)
3: puLone_integer(x - 2b-1, b - 1)

Return a value x assuming Elias's C'Y code for 1 ~ x.


elias_gamma-tiecode()
1: set b +- unary _decode 0
2: set x+- geLone_integer(b - 1)
3: return 2b- 1 +x
Use Elias's C6 code to represent symbol x, where 1 ~ x.
elias-tielta_encode (x)
1: set b +- 1 + Llog2 xJ
2: elias_gamma_encode (b)
3: pULone_integer(x - 2b- 1, b - 1)

Return a value x assuming Elias's C6 code for 1 ~ x.


elias -tielta_decode 0
1: set b +- elias_gamma_decodeO
2: set x+- geLone_integer(b - 1)
3: return 2 b- 1 + x
PAGE 36 COMPRESSION AND CODING ALGORITHMS

The Elias codes are sometimes called universal codes. To see why, consider
the assumed probability distribution P in which PI ;::: P2 ;::: ... Pn. Because
of the probability ordering, Px is less than or equal to l/x for all 1 :::; x :::; n,
since ifnot, for some value x we must have l:j=I Pj > l:j=I (l/x) = 1, which
contradicts the assumption the probabilities sum to one. But if Px :::; 1/ x, then
in a zero-redundancy code the codeword for symbol x is at least log2 x bits
long (Equation 2.1 on page 16). As a counterpoint to this lower limit, the
Elias codes offer codewords that are log2 x + J(x) bits long, where J(x) is
e (log x) for C y, and is o(log x) for Co. That is, the cost of using the Elias
codes is within a mUltiplicative constant factor and a secondary additive term
of the entropy for any probability-sorted distribution. They are universal in the
sense of being fixed codes that are provably "not too bad" on any decreasing
probability distribution.
Because they result in reasonable codewords for small values of x and log-
arithmically short codewords for large values of x, the Elias Co and C y codes
have been used with considerable success in the compression of indexes for
text database systems [Bell et al., 1993, Witten et al., 1999].

3.3 Golomb and Rice codes


Both the C-y and Co codes are examples of a wider class of codes that consist of
a selector part that indicates a range of values that collectively form a bucket,
and a binary part that indicates a precise value within the specified bucket. One
way of categorizing such codes is to give a vector describing the sizes of the
buckets used during the selection process. For example, both C-y and Co base
their selection process upon the bucket sizes

(1,2,4,8, ... ,2k, ... ) ,

that is, buckets which grow exponentially in size. The difference between them
is that unary is used as the bucket selector code in C-y, while C-y is used as the
selector in the Co code.
Another important class of codes - the Golomb codes [1966] - use a fixed-
size bucket, of size specified by a parameter b, combined with a unary selector:

(b,b,b,b, ... ).

Algorithm 3.4 illustrates the actions of encoding and decoding using a Golomb
code. Note the use of the minimal binary code to represent the value within
each bucket, with the short codewords assigned to the least values. Note also
that for simplicity of description a "div" operation, which generates the integer
quotient of the division (so that 17 div 5 = 3) has been used in the encoder, and
3.3. GOLOMB AND RICE CODES PAGE 37

Algorithm 3.4
Use a Golomb code to represent symbol x, where 1 ~ x, and b is the
parameter of the Golomb code.
golomh_encode(x, b)
1: set q f- (x - 1) div band r f- x - qx b
2: unary_encode(q + 1)
3: minimaLhinary_encode(r, b)

Return a value x assuming a Golomb code for 1 ~ x with parameter b.


golomh_decode(b)
1: set q f- unary_decodeO - 1
2: set r f- minimaLhinary_decode(b)
3: return r + q x b

a multiply in both encoder and decoder. All three of these operations can be
replaced by loops that do repeated subtraction (in the encoder) and addition (in
the decoder); and because each loop iteration is responsible for the generation
or consumption of one compressed bit, the inefficiency introduced is small.
Rice codes [1979] are a special case of Golomb codes, in which the pa-
rameter b is chosen to be 2k for some integer k. This admits a particularly
simple implementation, in which the value x to be coded is first shifted right
k bits to get a value that is unary coded, and then the low-order k bits of the
original value x are transmitted as a k-bit binary value. The final two columns
of Table 3.1 show examples of Golomb and Rice codewords. The last column,
showing a Rice code with k = 2, is also a Golomb code with b = 4. Also
worth noting is that a Rice code with parameter k = 0, which corresponds to a
Golomb code with b = 1, is identical to the unary code described in Section 3.1
on page 29.
Both Golomb and Rice codes have received extensive use in compression
applications. Golomb codes in particular have one property that makes them
very useful. Consider a sequence of independent tosses of a biased coin - a
sequence of Bernoulli trials with probability of success given by p. Let Px be
the probability of the next success taking place after exactly x trials, with PI =
p, P2 = (1 - p)p, P3 = (1 - p)2p, and, in general, P = [(1 - p)x-Ip 11 ~ xl.
If P has this property for some fixed value p, it is a geometric distribution, and
a Golomb code with parameter b chosen as

b= r- loge(2 - p) l ~ (log 2)~ ~ 0.69 x ~


loge (1 - p) e p p
is a minimal-redundancy code (see Chapter 4 for a definition). This somewhat
PAGE 38 COMPRESSION AND CODING ALGORITHMS

surprising result was first noted by Gallager and Van Voorhis [1975].
To understand the relationship between Golomb codes and geometric dis-
tributions, consider the codewords for two symbols x and x + b, where b is
the parameter controlling the Golomb code. Because x and x + b differ by b,
the codewords for these two symbols must differ in length by 1 - after all, that
is how the code is constructed. Hence, if Icxl is the length of the codeword
for x, then ICx+bl = Icxl + 1, and, by virtue of the codeword assigned, the
inferred probability of x + b must be half the inferred probability of x. But we
also know that Px = (1 - p)x-lp, that Px+b = (1 - p)x+b-lp, and thus that
Px+b/Px = (1 - p)b. Putting these two relationships together suggests that b
should be chosen to satisfy

Px+b = (1- p)b = 0.5.


Px
Taking natural logarithms and then solving for b yields
loge 0.5
b = loge(1 - p)
loge 0.5
~
-p
1
= (loge 2) - ,
p
as required, where the approximation loge(I - p) ~ -p is valid when p « 1.
Another way of looking at this result is to suppose that a sorted set of m
randomly chosen integer values in the range 1 to B are given. Then the m gaps
between consecutive integers can be considered to be drawn from a geometric
distribution, and applying the Golomb code to the m gaps is effective, provided
that the parameter b is chosen to be (loge 2)(B/m). Conversely, if we start
with m integers Xi and compute B = L~l Xi as their sum, then provided b
is chosen as (loge 2)(B /m), the total cost of Golomb coding the m original
values Xi is limited by
m(2+log 2 !). (3.1)

To derive this bound, suppose at first that (by luck) b = (loge 2)(B/m) turns
out to be a power of two. The bits in the Golomb codes can be partitioned
into three components: the binary components of the m codewords, which,
when b is a power of two, always amount to exactly m log2 b bits; the m "0"
bits with which the m unary components terminate; and the at most (B -
T)/b bits in unary codes that are "I", where T is the sum of the m binary
components. To understand the final contribution, recall that each "I" bit in
any unary component indicates an additional gap of b, and that the sum of all
3.3. GOLOMB AND RICE CODES PAGE 39

of the gaps cannot exceed B - T once T units have been accounted for in the
binary components.
When b is a power of two, the smallest possible value for T is m, as every
binary component - the remainder r in function golomb_encodeO - is at least
one. Adding in the constraint that b = (loge 2)(B/m) and simplifying shows
that the total number of bits consumed cannot exceed

m (log2 (2e loge 2) + log2 ! - IO~2 e) ~ m (1.91 + log2 !). (3.2)

When b is not a power of two, the binary part of the code is either llog2 bJ
or fiOg2 bl bits long. When it is the former, Equation 3.2 continues to hold.
But to obtain a worst-case bound, we must presume the latter.
Suppose that b is not a power of two, and that 9 = 2 fiog 2 bl is the next
power of two greater than b. Then the worst that can happen is that each binary
component is s + 1, where s = 9 - b is the number of short codewords as-
signed by the minimal binary code. That is, the worst case is when each binary
component causes the first of the long codewords to be emitted. In this case
quantity T must, on a per gap basis, decrease by s, as the first long codeword
corresponds to a binary component of s + 1. Compared to Equation 3.2, the
net bit increase per gap is given by

log2 9 - log2 b - bs = log2 b9 - b9 + 1 .


When x = g/b is constrained to 1 ~ x ~ 2, the function log2 x - X + 1 is
maximized at x = log2 e, with maximum value given by

Iog2 2log2 e ,....,..


,...., 0 086
e
Combining this result with that of Equation 3.2 yields Equation 3.1.
The analysis suggests how to construct pathological sequences that push
the Golomb code to behavior that matches the upper bound of Equation 3.1.
We choose a power of two (say, 9 = 64), calculate b = g/ loge 2 = 44,
calculate s = 9 - b = 20, and create a sequence of m - 1 repetitions of s + 1
followed by a final value to generate a total B that forces the desired value of b.
For example, a sequence of 99 repetitions of 21, followed by the number 4,320
is Golomb coded into 797 bits, which is 1.97 bits per symbol in excess of the
value log2(B/m) = log2(6,399/100) = 6.00.
Earlier we commented that the Golomb code is a minimum-redundancy
code for a geometric distribution. The entropy of the geometric distribution
with parameter p is given by
00 00

L -Pi log2Pi = - L(1-p)i-lp log2 ((1_p)i-l p)


i=l i=l
PAGE 40 COMPRESSION AND CODING ALGORITHMS

= log2 ~ - (~-1) log2{1- p)


1
;:::::: log2 - + (I - p) log2 e
p
1
;:::::: log2 - + 1.44 (3.3)
p

bits per symbol, where the second line follows from the first because the sum
L~l Pi = 1, and the expected value of the geometric distribution is given by
L~l iPi = lip; the third line follows from the second because log2{1 - p) ;::::::
-p log2 e when p is small compared to 1; and the fourth line follows from the
third because (I - p) ;:::::: 1 when p is small compared to 1.
Equation 3.3 gives a value that is rather less than the bound of Equation 3.1,
and if a random m-subset of the integers 1 ... B is to be coded, a Golomb code
will require, in an expected sense, approximately m{1.5 + log2{Blm)) bits.
But there is no inconsistency between this result and that of Equation 3.1 - the
sequences required to drive the Golomb code to its worst-case behavior are far
from geometric, and it is un surprising that there is a non-trivial difference be-
tween the expected behavior on random sequences and the worst-case behavior
on pathological sequences. If anything, the surprise is on the upside - even
with malicious intent and a pathological sequence, only half a bit per gap of
damage can be done to a Golomb code, compared to the random situation that
it handles best.
The Golomb code again corresponds to an array searching algorithm, a
mechanism noted and described by Hwang and Lin [1972]. In terms of the
"I'm thinking of a number" game, the Golomb code is the correct attack on
the puzzle posed as "I'm thinking of m distinct numbers, all between 1 and B;
what is the smallest?"
Rice codes have similar properties to Golomb codes. When coding a set of
m integers that sum to B the parameter k should be set to k = llog2 (B 1m) J'
which corresponds to b = 2llog2(B/m)J. The worst case cost of a Rice code is
the same as that of a Golomb code:

bits can be required, but never more. The worst case arises when the binary
component of each Rice codeword corresponds to a remainder of r = 1, so a
worst case sequence of length 100 could consist of 99 repetitions of 1, followed
by 6,300, which pushes a Rice code to a total of 796 bits. On the "Golomb-
bad" sequence [21,21,21, ... ,4320] discussed earlier, the Rice code requires
734 bits; and on the "Rice-bad" [1,1,1, ... ,6300] sequence the Golomb code
3.3. GOLOMB AND RICE CODES PAGE 41

requires 743 bits. In general, if the worst-case number of bits in the coded out-
put must be bounded, a Rice code should be preferred; if the average (assuming
m random values) length of the coded sequence is to be minimized, a Golomb
code should be used.
Rice codes also have one other significant property compared to Golomb
codes: the space of possible parameter values is considerably smaller. If a
tabulation technique is being used to determine the parameter for each symbol
in the message on the fly, Rice codes are the method of choice.
Generalizations of Elias and Golomb codes have also been described, and
used successfully in situations in which geometrically-growing buckets are re-
quired, but with a first bucket containing more than one item. For example,
Teuhola [1978] describes a method for compressing full-text indexes that is
controlled by the vector

(b,2b,4b, ... ,2 k b, ... ).


Iakobsson [1978] and Moffat and Zobel [1992] have also suggested different
schemes for breaking values into selectors and buckets.
Another interesting generalization is the application of Golomb and Rice
codes to doubly-infinite source alphabets. For example, suppose the source
alphabet is given by 8 = [... , -2, -1,0, +1, +2, ... J and that the probability
distribution is symmetric, with P-x = Px, and Px ~ Py when 0 ::; x < y. The
standard technique to deal with this situation is to map the alphabet onto a new
alphabet 8' = [1,2, ... J via the conversion function fo:

~x - 1
if x = 0
f(x) = { if x > 0
-2x if x < o.

The modified alphabet 8' can then be handled by any of the static codes de-
scribed in this chapter, with Rice and Golomb codes being particularly appro-
priate in many situations. But the symmetry inherent in the original probability
distribution is no longer handled properly. For example, a Rice code with k = 1
assigns the codewords "00", "01", "100", "101", "1100", and "1101", to sym-
bols 0, +1, -I, +2, and -2, respectively, and is biased in favor of the positive
values. To avoid this difficulty, a code based upon an explicit selector vector
can be used, with an initial bucket containing an odd number of codewords, and
then subsequent buckets each containing an even number of codewords. The
Elias codes already have this structure, but might place too much emphasis on
x = 0 for some applications.
PAGE 42 COMPRESSION AND CODING ALGORITHMS

3.4 Interpolative coding


As a final example of a non-parameterized code, this section describes the bi-
nary interpolative coding mechanism of Moffat and Stuiver [2000]. The novel
feature of this method is that it assigns codewords to symbols in a dynamic
manner, rather than using a static assignment of codewords. That is, the entire
message is treated in a holistic way, and the codewords used at any point of
the message depend upon codes already assigned to symbols both prior and
subsequent to the symbol currently being processed. In this sense it is not re-
ally a static code (which is what this chapter is about); but nor is it any more
parameterized than the Golomb code, despite its sensitivity to the contents of
the message being coded. Indeed, so great is its flexibility that appending a
symbol to a message and then compressing the augmented message can result
in a compressed form of the same length; and the individual codewords used
to represent the symbols in a message can, in some circumstances, be zero bits
long.
The best way to explain the interpolative code is with an example. Consider
the message
A1 = [1,1,1,2,2,2,2,4,3,1,1,1]
over the alphabet S = [1,2,3,4], in which - as is assumed to be the case
throughout this chapter - the symbols are ordered so that the probabilities are
non-increasing. The first stage of the process is to transform the message into
a list L of m cumulative sums of the symbol identifiers,

L = [1,2,3,5,7,9,11,15,18,19,20,21].
The list L is then encoded using a recursive mechanism that follows the struc-
ture of a preorder traversal of a balanced binary tree. First the root of the tree,
corresponding to the central item of the list L, is encoded; and then the left
subtree is recursively encoded (that is, the list of items in L that are to the left
of the central item); and then the right subtree is encoded. This sequence of
operations is illustrated in the pseudo-code of Algorithm 3.5.
Consider the example list L. It contains m = 12 items. Suppose that m is
known to the decoder, and also that the final cumulative sum L[12] is less than
or equal to the bound B = 21. The reasonableness of these assumptions will be
discussed below. The middle item of L (at h = 6) is L[6] = 9, and is the first
value coded. The smallest possible value for L[6] is 6, and the largest possible
value is 15. These bounds follow because if there are m = 12 symbols in
total in the list, there must be ml = 5 values prior to the 6th, and m2 =
6 values following the 6th. Thus the middle value of the list of cumulative
sums can be encoded as a binary integer 6 ~ L[h] ~ 15. Since there are
3.4. INTERPOLATIVE CODING PAGE 43

Algorithm 3.5
Use an interpolative binary code to represent the m symbol message M,
where 1 ~ M[i] for 1 ~ i ~ m.
interpolative_encode_block(M, m)
1: set L[1] t- M[1]
2: for i t- 2 to m do
3: set L[i] t- L[i - 1] + M[i]
4: if an upper bound B ~ L[m] is not agreed with the decoder then
5: set B t- L[m]. and encode(B)
6: recursive_interpolative_encode(L, m, 1, B)

Recursively process sections of a list. Argument L[1 ... m] is a sorted list of


m strictly increasing integers, all in the range 10 ~ L[i] < L[i + 1] ~ hi.
recursive_interpolative_encode(L, m, 10, hi)
1: ifm = 0 then
2: return
3: set h t- (m + 1) div 2
4: set mIt- h - 1 and m2 t- m - h
5: set LI t- L[1. .. (h - 1)] and L2 t- L[(h + 1) ... m]
6: centered_binary_in1ange(L[h] , 10 + mI, hi - m2)
7: recursive_interpolative_encode(L I , mI, 10, L[h] - 1)
8: recursive_interpolative_encode(L2, m2, L[h] + 1, hi)

Encode 10 ~x ~ hi using a binary code.


centered_binary_in_range(x, 10, hi)
1: centeredJninimaLbinary_encode(x - 10 + 1, hi - 10 + 1)

Encode 1 ~ x ~ n using a centered minimal binary code.


centered_minimaLbinary _encode (x, n)
1: set long t- 2 x x - 2fiog2 nl
2: set x t- x -longj2
3: if x < 1 then
4: set x t- x + n
5: minimaLhinary_encode(x, n)
PAGE 44 COMPRESSION AND CODING ALGORITHMS

L[6]=9

+
II II
10 15 20

L[3]=3 L[9]=18

+ +
II I II II I II
5 10 15 20

L[1]=1 L[4]=S L[7j=11 L[11]=20

+ + + +
[g II I II I II [TI'
10 15 20

L[2]=2 L[S]=7 L[8]=1S L[10]=19 L[12]=21

I ~I II
+
Id
10
III
+
1-
15
d~l~ 20

Figure 3.1: Example of interpolative coding applied to the sequence M. Gray re-
gions correspond to possible values for each cumulative sum in L. Vertical solid lines
show the demarcation points between different recursive calls at the same level in the
preorder traversal of the underlying tree.

15 - 6 + 1 = 10 values in this range, either a three bit or a four bit code is used.
The first number-line in Figure 3.1, and the first row of Table 3.2 show this step,
including the range used for the binary code, and the actual output bitstream
"011" (column "Code 1") generated by a minimal binary coder (Algorithm 3.2
on page 31) when coding the 4th of 10 possible values. Column "Code 2" will
be discussed shortly.
Once the middle value of the list L has been coded, the ml = 5 values to
the left are treated recursively. Now the maximum possible value (that is, the
upper bound on L[5]) is given by L[6]- 1 = 8. The sublist in question contains
5 values, the middle one of which is L[3] = 3. In this subproblem there must
be two values to the left of center, and two to the right, and so 3 ~ L[3] ~ 6
is established. That is, L[3] is one of four possible values, and is coded in two
bits - "00". The left-hand half of the second number-line of Figure 3.1 shows
this situation, and the second row of Table 3.2 (again, in the column "Code 1")
3.4. INTERPOLATIVE CODING PAGE 45

h L[h] 10 +ml hi - m2 Code 1 Code 2


6 9 6 15 011 001
3 3 3 6 00 10
1 1 1 1 A A
2 2 2 2 A A
4 5 4 7 01 11
5 7 6 8 10 0
9 18 12 18 111 100
7 11 10 16 010 110
8 15 12 17 101 01
11 20 20 20 A A
10 19 19 19 A A
12 21 21 21 A A
Total number of bits 18 16

Table 3.2: Binary interpolative coding the cumulative list L. Each row shows the result
of coding one of the L[h] values in the range 10 + ml ... hi - m2. Using a minimal
binary code (column "Code I") the coded sequence is "011 00 0110111010101", a
total of 18 bits. Using the centered minimal binary code described in Algorithm 3.5,
the coded sequence is (column "Code 2") "001 10 11 0 100 11001", a total of 16 bits.

shows the code assigned to L[3].


Next the sublist [1,2] must be coded. In this sublist the first value is known
to be at least 1 and the second to be at least L[3] - 1 = 2. The left-most parts
of the third and fourth number-lines of Figure 3.1, and the third and fourth
lines of Table 3.2, show the calls to function recursive_interpolative_encodeO
that do this. In both cases the possible range of values is just one symbol
wide - the bounds are 1 ::; L[I] ::; 1 and 2 ::; L[2] ::; 2 respectively - and
a natural consequence of using a minimal binary code in such a situation is
that no selection is necessary, and codewords zero bits long are generated. The
"Code I" column of Table 3.2 notes these as empty codewords, denoted A. The
balance of the figure and table shows the remainder of the processing of the left
sublist (h = 4 and h = 5), and then the recursive processing of the right sublist
created from the original problem, starting with its middle item at h = 9. Of
the five further symbols in the right sub list, three are deterministically known
without any codeword being required.
The code for the message M is the concatenation of these individual codes,
in the order in which they are generated by the preorder traversal of the cumu-
lative list L. Hence, the first value decoded is L[6], the second L[3], the third
L[I], and so on. At each stage the decoder knows the bounds on each value that
PAGE 46 COMPRESSION AND CODING ALGORITHMS

were used by the encoder, and so decoding can always take place successfully.
Consider again the example shown in Figure 3.1 and Table 3.2. Quite amaz-
ing is that using a minimal binary code the total message of 12 symbols is coded
in just 18 bits, an average of 1.5 bits per symbol. This value should be com-
pared with the 21 bits required by a Golomb code for the list M (using b = 1,
which is the most appropriate choice of parameter) and the 26 bits required by
the Elias C"( code. Indeed, this coding method gives every appearance of being
capable of real magic, as the self-information (Equation 2.4 on page 22) of the
message M is 1.63 bits per symbol, or a minimum of 20 bits overall. Unfortu-
nately, there are two reasons why this seeming paradox is more sleight-of-hand
than true magic.
The first is noted in steps 4 and 5 of function interpolative_encode_blockO
in Algorithm 3.5: it is necessary for the decoder to know not just the number
of symbols m that are to be decoded, but also an upper bound B for L[m], the
sum of the symbol values. In Algorithm 3.5 transmission of this latter value, if
it cannot be assumed known, is performed using a generic encodeO function,
and in an implementation would be accomplished using the Co code or some
similar mechanism for arbitrary integers. For the example list, a Co code for
L[m] = 21 requires that 9 additional bits be transmitted. On the other hand,
an exact coder (of the kind that will be discussed in Chapters 4 and 5) that can
exploit the actual probability distribution must know either the probabilities,
[6/12,4/12,1/12,1/12], or codewords calculated from those probabilities, if
it is to code at the entropy-based lower bound. So it should also be charged
for parameters, making the entropy-based bound unrealizable. One could also
argue that all methods must know m if they are to stop decoding after the
correct number of symbols. These issues cloud the question as to which code
is "best", particularly for short messages where the prelude overheads might be
a substantial fraction of the message bits. The issue of charging for parameters
needed by the decoder will be considered in greater detail in later chapters.
The other reason for the discrepancy between the actual performance of
the interpolative code in the example and what might be expected from Equa-
tion 2.4 is that the numbers in the cumulative message L are clustered at
the beginning and end, and the interpolative code is especially good at ex-
ploiting localized patterns of this kind. Indeed, the interpolative code was
originally devised as a mechanism for coding when the frequent symbols are
likely to occur in a clustered manner [Moffat and Stuiver, 2000]. For the list
M' = [2,1,2,1,2,1,2,1,3,1,4,1], which has the same self-entropy as M but
no clustering, the interpolative method (using the minimal binary code assumed
by "Code I") generates the sequence "01110 1000 110001 0 11 00", a total
of 20 bits, and the magic is gone.
The performance of interpolative_encode_blockO can be slightly improved
3.4. INTERPOLATIVE CODING PAGE 47

by observing that the shorter binary codes should be given to a block of symbols
at the center of each coding range rather than at the beginning. For example,
when the range is six, the minimal binary coder allocates codewords of length
[2,2,3,3,3,3]. But in this application there is no reason to favor small values
over large. Indeed, the middle value in a list of cumulative sums is rather more
likely to be around half of the final value than it is to be near either of the
extremities. That is, the codeword lengths [3,3,2,2,3,3] for the six possible
values are more appropriate. This alteration is straightforward to implement,
and is the reason for the introduction of functions centered_binary_inJangeO
and centered_minimal..binary_encodeO in Algorithm 3.5. The latter function
rotates the domain by an amount calculated to make the first of the desired
short codewords map to integer 1, and then uses minimal..binary_encodeO to
represent the resultant mapped values.
The column headed "Code 2" of Table 3.2 shows the effect of using a cen-
tered minimal binary code. The codeword for L[5] = 7 becomes one bit shorter
when coded in the range 6 to 8, and the codeword for L[8] = 15 also falls into
the middle section of its allowed range and receives a codeword one bit shorter.
U sing the full implementation of the interpolative code the example message
M can thus be transmitted in 16 bits. Message M' is similarly reduced to
19 bits. Again, both encodings are subject to the assumption that the decoder
knows that B = 21 is an upper bound for L[m].
Moffat and Stuiver [2000] give an analysis of the interpolative code, and
show that for m integers summing to not more than B the cost of the code -
not counting the cost of pre-transmitting B - is never more than

m (2.58 + log2 !) (3.4)

bits. This is a worst-case limit, and holds for all combinations of m and B, and,
once m and B are fixed, for any set of m distinct integers summing to B or less.
Using the same m = 100 and B = 6,399 values employed above, one obvi-
ous bad sequence for the interpolative code is [1,129,1,129,1,129, ... ,1,28].
This sequence requires 840 bits when represented with the interpolative code,
which is 2.40 + 10g2(B 1m) and is close to the bound of Equation 3.4. It is not
clear whether other sequences exist for which the constant is greater than 2.4.
Finally, as an additional heuristic that improves the measured performance
of the interpolative code when the probability distribution is biased in favor
of small values, a "reverse centered minimal binary code" should be used at
the lowest level of recursion when m = 1 in recursive_interpolative_encodeO
(Algorithm 3.5 on page 43). Allocating the short codewords to the low and high
values in the range is the correct assignment when a single value is being coded
if PI is significantly higher than the other probabilities. Unfortunately, the
PAGE 48 COMPRESSION AND CODING ALGORITHMS

example list M fails to show this effect, and use of a reverse centered minimal
binary code when m = 1 on the example list M adds back the two bits saved
through the use of the centered binary code.

3.5 Making a choice


All of the methods described in this chapter require approximately the same
computational resources, and none of them require a detailed parameterization
of the source probability distribution. Nor do any of them require large amounts
of memory, although the interpolative method does have the disadvantage of
operating in a non-sequential manner, which requires that the source message
M be buffered in the encoder, and a stack of O(1ogm) elements be used in the
decoder.
We are thus essentially free, in any given application, to use the code that
yields the best compression. That, in tum, is determined purely by the distance
between the actual probability distribution generated by the compression sys-
tem, and the implicit distribution assumed by the coding method. For example,
Golomb codes are known to be very effective for probability distributions that
are a geometric series, and a minimal binary code is clearly well-matched to
a uniform (or flat) distribution of probabilities. Similarly, the Elias C"{ code-
which allocates a codeword of approximately 1 + 210g 2 x bits to the xth sym-
bol of the alphabet - is an ideal choice when the probability distribution is such
that the probability Px of the xth symbol is

Px = 2-(1+21ogx) = _1_ .
2x2
Another well-known distribution is the Zipf distribution [Zipf, 1949]. The
rationale for this distribution is the observation that in nature the most frequent
happening is often approximately twice as likely as the second most frequent,
three times as likely as the third most frequent, and so on. Hence, a Zipf distri-
bution over an alphabet of n symbols is given by

1 1
= -z ,where Z = L -:- = loge n -
n
Px 0(1).
x j=l)

As another test of coder performance, recall from Chapter 1 that an ef-


fective compression system makes high-probability predictions, since if it did
not, it could not be attaining good compression. Because of this, very skew
probability distributions are also of practical importance.
Table 3.3a shows six representative probability distributions drawn from
four categories - uniform, geometric, Zipfian, and skew. Table 3.3b lists the
3.5. CHOOSING PAGE 49

List n P Entropy
Uniform50 50 [0.02,0.02,0.02,0.02,0.02, ... ] 5.64
Geometric50 50 [0.10,0.09,0.08,0.07,0.07, ... ] 4.64
Zip!50 50 [0.22,0.11,0.07,0.06,0.05, ... ] 4.61
Zip!5 5 [0.44,0.22,0.15,0.11,0.09] 2.06
Skew5 5 [0.80,0.10,0.05,0.03,0.02] 1.07
Veryskew3 3 [0.97,0.02,0.01] 0.22
(a)

List Binary Elias C y Elias Co Golomb Interp.


Uniform50 5.72 8.72 8.54 6.09 6.47
Geometric50 5.21 5.49 5.91 4.58 4.81
Zip!50 5.26 5.13 5.38 4.75 4.79
Zip!5 2.20 2.51 2.87 2.43 2.39
Skew5 2.05 1.50 1.65 1.37 1.19
Veryskew3 1.03 1.06 1.09 1.04 0.26
(b)

Table 3.3: Compression of random sequences: (a) six representative probability distri-
butions and the entropy (bits per symbol) of those distributions; and (b) performance
of five coding methods (bits per symbol) for random lists of 1,000 symbols drawn
from those distributions, with the best result for each sequence highlighted in gray. In
the case of the binary code, the parameter n is included in the cost and is transmitted
using C6; in the case of the Golomb code, the parameter b is included in the cost and
is transmitted using C,; and in the case of the interpolative code the value LXi - m
is included in the cost, and transmitted using C6. The value of m is assumed free of
charge in all cases. The interpolative code implementation uses a centered minimal
binary code when m > 1, and a reverse centered minimal binary code when m = 1.
PAGE 50 COMPRESSION AND CODING ALGORITHMS

compression performance (in bits per symbol) of the five main coding mecha-
nisms described in this chapter for a random sequences of 1,000 symbols drawn
from the six distributions. The cost of any required coding parameters are in-
cluded; note that, because of randomness, the self-information of the generated
sequences can differ from the entropy of the distribution used to generate that
sequence. This is how the Golomb code "beats" the entropy limit on file Geo-
metric50.
Unsurprisingly, the minimal binary code is well-suited to the uniform dis-
tribution. It also performs well on the Zipj5 distribution, mainly because it allo-
cates two-bit codewords to the three most frequent symbols. On the other hand,
the fifty-symbol Zipj50 probability arrangement is best handled by a Golomb
code (as it turns out, with b = 8, which makes it a Rice code). In this case the
Zipfian probabilities can be closely approximated by a geometric distribution.
The Golomb code is a clear winner on the Geometric50 sequence, as expected.
The two skew probability distributions are best handled by the interpolative
coder. For the sequence Veryskew3 the average cost per symbol is less than
a third of a bit - more than two thirds of the symbols are deterministically
predicted, and get coded as empty strings. This is a strength of the interpolative
method: it achieves excellent compression when the entropy of the source is
very low. The interpolative code also performs reasonably well on all of the
other distributions, scoring three second places over the remaining four files.
Finally, note the behavior of the two true universal codes, C'Y and Co. Both
perform tolerably well on all of the probability distributions except for Uni-
jorm50, and are reliable defaults. Moreover, their performance (and also that
of the Golomb code) would be improved if use was made of the known bound
on n, the alphabet size (50, 5, or 3 for the lists tested). As implemented for
the experiments, these three methods handle arbitrarily large integers, and so
waste a certain fraction of their possible codewords on symbols that cannot oc-
cur. For example, when n = 5 a truncated C'Y code yields codeword lengths
of ICI = [1,3,3,3,3] (instead of 101 = [1,3,3,5,5]), and on lists Zipj5 and
Skew5 gives compression of 2.12 bits per symbol and 1.40 bits per symbol
respectively.
A similar modification might also be made to the Golomb code, if the max-
imum symbol value were isolated and transmitted prior to the commencement
of coding. But while such tweaking is certainly possible, and in many cases
serves to improve performance, it is equally clear from these results that there
is no universal solution - a static code may ignore the probability distribution
and still get acceptable compression, but if good compression is required re-
gardless of the distribution, a more general mechanism for devising codes must
be used.
Chapter 4

Minimum-Redundancy Coding

We now tum to the more general case illustrated by the "Code 3" column in
Table LIon page 7. It is the best of the three listed codes because, somehow,
its set of codeword lengths better matches the probability distribution than do
the other two sets. Which forces the question: given a sorted list of symbol
probabilities, how can a set of prefix-free codewords be assigned that is best
for that data? And what is really meant by "best"?
The second question is the easier to answer. Let P be a probability dis-
tribution, and C a prefix-free code over the channel alphabet {a, I}. Further,
let E(C, P) be the expected codeword length for C, calculated using Equa-
tion 1.1 on page 7. Then C is a minimum-redundancy code for distribution P
if E( C, P) ~ E( C', P) for every n symbol prefix-free code Cf. That is, a
code is minimum-redundancy for a probability distribution if no other prefix-
free code exists that requires strictly fewer bits per symbol on average. Note
that designing a minimum-redundancy code is not as simple as just choosing
short codewords for all symbols, as the Kraft inequality serves as a balancing
requirement, tending to make at least some of the codewords longer. It is the
tension between the Kraft requirement and the need for the code to have a low
expected length that determines the exact shape of the resultant code.
Now consider the first question. Given an arbitrary set of symbol proba-
bilities, how can we generate a minimum-redundancy code? This chapter is
devoted to the problem of finding such prefix codes, and using them for encod-
ing and decoding.

4.1 Shannon-Fano codes


The first success in solving this problem was the result of independent dis-
coveries by Claude Shannon [1948] and Robert Fano [1949], and is known

A. Moffat et al., Compression and Coding Algorithms


© Springer Science+Business Media New York 2002
PAGE 52 COMPRESSION AND CODING ALGORITHMS

0.67 0.11 0.07 0.06 0.05 0.04

o 1

o 1

o 1 o 1

o 1

Figure 4.1: Example of the use of the Shannon-Fano algorithm for the proba-
bility distribution P = [0.67,0.11,0.07,0.06,0.05,0.04] to obtain the code C
["0" , "100" , "101" , "110", "1110", "1111"] .

as Shannon-Fano coding. The motivation for their algorithm is clear: if zero


bits and one bits are to be equally useful, then each bit position in a codeword
should correspond to a choice between packages of symbols of roughly the
same probability, or weight. To achieve this packaging, the sorted list of sym-
bol probabilities is broken into two parts, with each part having a probability
as close to 0.5 as possible. All of the symbols in one of the packages are then
assigned a "I" bit as the first bit of their codewords, and similarly the symbols
in the other package are assigned a "0" prefix bit. The two packages are then
subdivided recursively: each is broken into subpackages of weight as close as
possible to half of the weight of the parent package. Figure 4.1 shows this pro-
cess for the example probability distribution of Table LIon page 7, namely,
the probability distribution P = [0.67,0.11,0.07,0.06,0.05,0.04].
The code generated in Figure 4.1 is exactly that listed in the "Code 3"
column of Table 1.1, and, in this particular case, happens to be minimum-
redundancy. To see that the Shannon-Fano algorithm is not always effective,
consider the probabilities P = [0.4,0.1,0.1,0.1,0.1,0.1,0.1]. For this distri-
bution a set of codewords described by IGI = [2,2,3,3,3,4,4] is generated,
for an expected codeword length of E( G, P) = 2.70 bits per symbol. Now
consider the prefix-free code IG'I = [2,3,3,3,3,3,3]. The expected code-
word length E( G', P) is 2.60 bits per symbol, so the code G is not minimum-
redundancy. Because of its top-down construction method, the Shannon-Fano
approach is forced to assign a two bit codeword to symbol 2, when in fact
symbol 2 should have a codeword of the same length as symbols 3 through 7.
4.2. HUFFMAN CODING PAGE 53

4.2 Huffman coding


It was a few years after Shannon and Fano published their approaches that a stu-
dent at MIT by the name of David Huffman devised a general-purpose mecha-
nism for determining minimum-redundancy codes [Huffman, 1952]. The story
of the development is somewhat of a legend in itself: Fano, a faculty member,
offered to waive the exam for any members of his graduate class that could
solve the problem of finding a minimum-redundancy code. Huffman tackled
the task with gusto, but limited success; and on the evening before the final
exam, threw his last attempt in the trash can and set to studying. But shortly
thereafter, he realized that he had in fact developed the germ of an idea. He
refined his bottom-up approach into the algorithm we know today, submitted
it to Fano, and was duly relieved of the burden of sitting his final exam. l His
algorithm is now one of the best-known methods in the discipline of comput-
ing, and is described in books covering both compression-related topics and
algorithmics-related areas.
The basic idea of the method is, with hindsight, extremely simple. Rather
than use the top-down approach of the Shannon-Fano technique, a bottom-up
mechanism is employed. To begin, every symbol is assigned a codeword that
is zero bits long. Unless n = 1, this violates the Kraft inequality (Equation 2.3
on page 18), so the code cannot be prefix-free. All of these individual symbols
are considered to be in packages, initially of size one; and the weight of each
package is taken to be the sum of the weights of the symbols in the package.
At each stage of the algorithm the two lowest weight packages are com-
bined into one, and a selector bit prefixed to the codewords of all of the symbols
involved: a "0" to the codes for the symbols in one of the two packages, and a
"I" to the codes for the symbols in the other package. This simultaneously re-
duces the value K (C) (by lengthening some of the codewords) and reduces the
number of packages (since two have been merged into one). The process is then
repeated, using the modified set of packages and package weights. At exactly
the point at which a prefix-free code becomes possible (with K(C) = 1), the
number of packages becomes one, and the process terminates. Codewords for
each of the symbols in the alphabet have then been constructed. Figure 4.2 il-
lustrates this process for the example probability distribution used in Figure 4.1.
Recall that the symbol A denotes the empty string.
In each stage of Figure 4.2 the gray-highlighted weight indicates the newly
created package. The final codes calculated differ from those listed in Table 1.1
on page 7, but any code that assigns symbols the same codeword lengths as
they have in a Huffman code has the same cost as the Huffman code, so is
1Huffman apparently told this story to a number of people. Glen Langdon, who was a col-
league of Huffman at UCSC for several years, confirmed this version to the authors.
PAGE 54 COMPRESSION AND CODING ALGORITHMS

weight tentative codes weight tentative codes


0.67 Cl =..\ 0.67 Cl =..\
0.11 C2 =..\ 0.11 C2 =..\
0.07 C3 =..\ ~j!Oii
"-/,;,'l,;,",';l;",l;+
C5 = 0,C6 = 1
0.06 C4 =..\ 0.07 C3 =..\
0.05 C5 =..\ 0.06 C4 =..\
0.04 C6 =..\

(a) initial packages, (b) after the first step,


K(C) = 6 K(C) =5
weight tentative codes weight tentative codes
0.67 Cl =..\ 0.67 Cl =..\
J).13 C3 = 0, C4 = 1 .o~20J C2 = 0, C5 = 10,
0.11 C2 =..\ C6 = 11
0.09 C5 = 0,C6 = 1 0.13 C3 = 0, C4 = 1
(c) after the second step, (d) after the third step,
K(C) =4 K(C) =3
weight tentative codes weight tentative codes
0.67 Cl =..\ 'lYoirr Cl = 0, C2 = 100,
0.33 C2 = 00, C3 = 10, C3 = 110,
C4 = 11, C5 = 010, C4 = 111,
C6 = 011 C5 = 1010,
C6 = 1011
(e) after the fourth step, (0 after the final step,
K(C) = 2 K(C) = 1

Figure 4.2: Example of the use of Huffman's greedy algorithm for the input prob-
ability distribution P = [0.67,0.11,0.07,0.06,0.05,0.04] to obtain the code C =
["0", "100", "lID", "111", "1010", "lOll"]. At each step the newly created package is
indicated in gray.
4.2. HUFFMAN CODING PAGE 55

still minimum-redundancy. In Figure 4.2 the prefix selector bits are assigned
according to the rule "one for the symbols in the less probable package and
zero for the symbols in the more probable package", but this is arbitrary, and
a fresh choice can be made at every stage. Over the n - 1 merging opera-
tions there are thus 2n - 1 distinct Huffman codes, all of which are minimum-
redundancy. Indeed, a very important point is that any assignment of prefix-
free codewords that has the same codeword lengths as a Huffman code is a
minimum-redundancy code, but that not all minimum-redundancy codes are
one of the 2n - 1 Huffman codes. That is, there may be additional minimum-
redundancy codes that cannot be achieved via Huffman's algorithm, and for
efficiency reasons we might - and indeed will - deliberately choose to use a
minimum-redundancy code that is not a Huffman code. For example, the third
code in Table 1.1 cannot be the result of a strict application of Huffman's algo-
rithm. This notion is explored below in Section 4.3.
One further point is worth noting, and that is the handling of ties. Con-
sider the probabilities P = [004,0.2,0.2,0.1, O.lJ. Both of lei = [2,2,2,3, 3J
and ICI = [1,2,3,4, 4J result in an expected codeword length of 2.20 bits per
symbol. In this case the difference is not just a matter of labelling; instead,
it arises from the manner in which the least weight package is chosen when
there is more than one package of minimal weight. Schwartz [1964] showed
that if ties are resolved by always preferring a package that contains just one
node - that is, by favoring packages containing a single symbol x for which the
tentative code is still marked as ,\ - then the resultant code will have the short-
est possible maximum codeword length. This strategy works because it defers
the merging of any current multi-symbol packages, thereby delaying as long as
possible the further extension of the codewords in those packages, which must,
by construction, already be non-empty.
The sequence of mergings performed by Huffman's algorithm leads di-
rectly to a tree-based visualization. For example, Figure 4.3a shows the code
tree associated with the Huffman code constructed in Figure 4.2. Indeed, any
prefix-free code can be regarded as a code tree, and Figure 4.3b shows part
of the infinite tree corresponding to the Golomb code (with b = 5) shown in
Table 3.1 on page 33.
Visualization of a code as a tree is helpful in the sense of allowing the
prefix-free nature of the code to be seen: in the tree there is a unique path from
the root to each leaf, and the internal nodes do not represent source symbols.
The Huffman tree also suggests an obvious encoding and decoding strategy:
explicitly build the code tree, and then traverse it edge by edge, emitting bits in
the encoder, and in the decoder using input bits to select edges. Although cor-
rect, this tree-based approach is not particularly efficient. The space consumed
by the tree might be large, and the cost of an explicit pointer access-and-follow
PAGE 56 COMPRESSION AND CODING ALGORITHMS

(a)

(b)

Figure 4.3: Examples of code trees: (a) a Huffman code; and (b) a Golomb code with
b = 5. Leaves are shown in white, and are labelled with their symbol number. Internal
package nodes are gray. The second tree is infinite.
4.3. CANONICAL CODES PAGE 57

operation per bit makes encoding and decoding relatively slow. By way of con-
trast, the procedures described in Algorithm 3.4 on page 37 have already shown
that explicit construction of a code tree is unnecessary for encoding and decod-
ing Golomb codes. Below we shall see that for minimum-redundancy coding
we can also eliminate the explicit code tree, and that minimum-redundancy en-
coding and decoding can be achieved with compact and fast loops using only
small amounts of storage. We also describe a mechanism that can be used to
construct Huffman codes simply and economically.
Huffman's algorithm has other applications outside the compression do-
main. Suppose that a set of n sorted files is to be pairwise merged to make
a single long sorted file. Suppose further that the ith file initially contains Vi
records, and that in total there are m = L:i==l Vi records. Finally, suppose (as
is the case for the standard merging algorithm) that the cost of merging lists
containing Vs and Vt records is O( VS + vd time. The question at issue is de-
termination of a sequence of two-file mergings so as to minimize the total cost
of the n-way merge; the answer is to take Pi = vdm, and apply Huffman's
method to the n resulting weights. The length of the ith codeword lei I then
indicates the number of merges in which the ith of the original files should be
involved, and any sequence of pairwise merges that results in records from file
i participating in leil merge steps is a minimum-cost merge. The more general
problem of sorting lists that contain some amount of pre-existing order - where
order might be expressed by mechanisms other than by counting the number
of sorted runs - has also received attention [Moffat and Peters son, 1992, Pe-
tersson and Moffat, 1995], and it is known that the best that can be done in the
n-way merging problem is

comparisons. The similarity between this and the formulation given earlier for
self-information (Equation 2.4 on page 22) is no coincidence.

4.3 Canonical codes


Techniques for efficiently using minimum-redundancy codes once they have
been calculated have also received attention in the research literature. The
mechanism presented in this section is that of canonical coding [Connell, 1973,
Hirschberg and Lelewer, 1990, Schwartz and Kallick, 1964, Zobel and Mof-
fat, 1995], and our presentation here is based upon that of Moffat and Turpin
[1997]. Other techniques for the implementation of fast encoding and decoding
are considered in the next section.
PAGE 58 COMPRESSION AND CODING ALGORITHMS

~ Ci f Wf. base[f] offset[f] ijJimit[f]


1 0 1 1 0 1 8
2 100 2 0 2 2 8
3 101 3 3 4 2 14
4 110 4 2 14 5 16
5 1110 5 0 7
6 1111

(a) (b)

Table 4.1: Example of canonical assignment of codewords, with L = 4: (a) the


codewords assigned when ICI = [1,3,3,3,4,4]; and (b) the arrays base, offset, and
Ii_limit.

In a canonical code, the codewords Ci are assigned to meet two criteria:


first, they must be of the lengths specified by Huffman's algorithm; and sec-
ond, when they are sorted by length, the codewords themselves must be lexi-
cographically ordered. This latter property is known in some of the literature
as the numerical sequence property. Table 4.1a shows the canonical code cal-
culated for the example probability distribution used in Figure 4.2 on page 54.
Shannon-Fano codes always have the numerical sequence property (see Fig-
ure 4.1 on page 52 to understand why), and for this reason canonical codes
have sometimes been called Huffman-Shannon-Fano codes [Connell, 1973].
The process that assigns canonical codewords is straightforward. The first
of the shortest codewords is assigned a code of all zeros. Then, to calculate the
codeword for the next symbol in the source alphabet, we just add one to the
binary value of the previous codeword, and, if necessary, shift left to obtain the
required codeword length. Hence in Table 4.1a, to get the "100" codeword for
symbol 2, we take the codeword for symbol 1, which is "0"; add one to it, to
get "I", and then shift left by two bits, since a three-bit code is required rather
than the previous one-bit code. The final code in Table 2.1 on page 23 (column
"MR") is also a canonical code. Two symbols - the space character, and the
letter "r" - are assigned three-bit codewords, "000" and "001" respectively.
The set of six four-bit codewords then runs from "0100" to "1001", and the
last, longest, codeword is the seven-bit sequence "1111111".
Because of the ordering of both the codeword lengths and the codewords
themselves, it is possible to index the list of codewords using a small number
of L-entry tables, where L is the length of a longest codeword, L = len I. The
first of these tables is the base table, which records, for each codeword length
f, the integer value of the first f-bit codeword. The second table is the offset
4.3. CANONICAL CODES PAGE 59

array, which records the symbol number that corresponds to the first of the f-
bit codewords. These two arrays are shown in the third and fourth columns of
Table 4.1b. The final column of Table 4.1b will be discussed below. If Wi is
the number of f bit codewords, then the array base is described by

0 iU= 1,
base f
[]
={ 2 x (base[f - 1] + w£-d otherwise.

Using this notation, the kth of the f-bit codewords is the f low-order bits of the
value base[f] + (k - 1) when it is expressed as a binary integer. For example,
in Table 4.1a the first four bit codeword is for symbol number five, which is
the value of offset[4] in Table 4.1b; and the code for that symbol is "1110",
which is the binary representation of the decimal value 14 stored in base[4].
By using these two arrays, the codeword corresponding to any symbol can be
calculated by first determining the length of the required codeword using the
offset array, and then its value by performing arithmetic on the corresponding
base value. The resultant canonical encoding process is shown as function
canonicaLencodeO in Algorithm 4.1. Note that a sentinel value offset[L+ 1] =
n + 1 is required to ensure that the while loop in canonicaLencodeO always
terminates. The offset array is scanned sequentially to determine the codeword
length; this is discussed further below.
The procedure followed by function canonicaLencodeO is simple, and fast
to execute. It also requires only a small amount of memory: 2L + 0(1) words
for the arrays, plus a few scalars - and the highly localized memory access pat-
tern reduces cache misses, contributing further to the high speed of the method.
In particular, there is no explicit codebook or Huffman tree as would be re-
quired for a non-canonical code. The canonical mechanism does, however,
require that the source alphabet be probability-sorted, and so for applications
in which this is not true, an n word mapping table is required to convert a raw
symbol number into an equivalent probability-sorted symbol number. Finally,
note also that the use of linear search to establish the value of f is not a dom-
inant cost, since each execution of the while loop corresponds to exactly one
bit in a codeword. On the other hand, the use of an array indexed by symbol
number x that stores the corresponding codeword length may be an attractive
trade between decreased encoding time and increased memory.
Consider now the actions of the decoder. Let V be an integer variable
storing L as yet unprocessed bits from the input stream, where L is again the
length of a longest codeword. Since none of the codewords is longer than
L, integer V uniquely identifies both the length f of the next codeword to be
decoded, and also the symbol x to which that codeword corresponds. That is,
a lookup table indexed by V, storing symbol numbers and lengths, suffices for
decoding. For the example code and L = 4, the lookup table has 16 entries,
PAGE 60 COMPRESSION AND CODING ALGORITHMS

Algorithm 4.1
Use a canonical code to represent symbol x, where 1 ~ x ~ n, assuming
arrays base and offset have been previously calculated.
canonicaLencode (x)
I: seU +- 1
2: while x ;::: offset[f + 1] do
3: set f +- f + 1
4: set c +- (x - offset[f]) + base[f]
5: pULone_integer(c, f)

Return a value x assuming a canonical code, and assuming that arrays base,
offset, and lj_limit have been previously calculated. Variable V is the current
L-bit buffer of input bits, where L is the length of a longest codeword.
canonicaLdecode 0
I: seU +- 1
2: while V ;::: lj_limit[f] do
3: set f +- f + 1
4: set c +- righLshift(V, L - f) and V +- V - left~hift( c, L - f)
5: set x +- (c - base[f]) + offset[f]
6: set V +- left_shift(V, f) + geLone_integer(f)
7: return x
4.3. CANONICAL CODES PAGE 61

of which the first eight (indexed from zero to seven) indicate symbol 1 and a
one-bit code.
The problem with this exhaustive approach is the size of the lookup table.
Even for a small alphabet, such as the set of ASCII characters, the longest
codeword could well be 15-20 bits (see Section 4.9 on page 88), and so large
amounts of memory might be required. For large source alphabets, such as
English words, codeword lengths of 30 bits or more may be encountered. For-
tunately, it is possible to substantially collapse the lookup table while still re-
taining most of the speed.
Consider the column headed lj_limit in Table 4.1b. Each of the entries
in this column corresponds to the smallest value of V (again, with L = 4)
that is inconsistent with a codeword of P bits or less. For example, the value
lj_iimit[l] = 8 indicates that if the first unresolved codeword in V is one-
bit long, then V must be less than eight. The values in the lj_limit array are
calculated from the array base:

lj_limit[P] = { base[P + ~]L X 2 L -£-1


ifP=L,
otherwise.

Given this array and the window V, decoding is a matter of determining P


by a linear (or other) search for the value V in the array lj_limit, and then
reversing the codeword arithmetic performed by the encoder. The complete
process is described in function canonicaLdecodeO in Algorithm 4.1. Note
the total absence of explicit bit-by-bit decoding using a Huffman tree. While
a tree-based mechanism is useful in textbook descriptions of Huffman coding,
there is no need for it in practice.
As described in step 2 of Algorithm 4.1, the decoder performs a linear
search in order to determine P. This does not dominate the cost of the com-
putation, as one input bit is processed for each iteration of the searching loop.
Nevertheless, faster searching methods are possible, since the array ii_limit is
sorted. Possibilities include binary search and optimal binary search. A fur-
ther alternative - a hybrid of the brute-force table-based mechanism described
above and the linear search of Algorithm 4.1 - is to use a "fast start" linear
search in which a small table is employed to eliminate some of the unneces-
sary searching effort. Suppose that 2z words of memory are available for this
purpose, where z < L. Then an array start can be initialized to store, for each
possible z-bit prefix of V (denoted Vz ) the minimum length P of any codeword
that commences with that prefix. Table 4.2 lists three possible start arrays for
the example code shown in Table 4.1. In this example increasing z from z = 1
to z = 2 results in no extra discrimination, but the z = 3 table for the same
data completely determines codeword lengths.
PAGE 62 COMPRESSION AND CODING ALGORITHMS

VZ start[vzl start[vzl start[Vzl


z=1 z=2 z=3
0 1 1 1
1 3 1 1
2 3 1
3 3 1
4 3
5 3
6 3
7 4

Table 4.2: The array start for the example canonical code, for z = 1, z = 2, and
z = 3. The choice of z can be made by the decoder when the message is decoded, and
does not affect the encoder in any way.

To see how the start array is used, suppose that V contains the four bits
"1100". Then a two-bit start table (the third column of Table 4.2) indexed by
V2 = "11" (three in decimal) indicates that the smallest value that P can take
is 3, and the linear search in function canonicaLdecodeO can be commenced
from that value - there is no point in considering smaller Pvalues for that prefix.
Indeed, any time that the P value so indicated is less than or equal to the value
of z that detennines the size of the start table, the result of the linear search is
completely detennined, and no inspections of I}_limit are required at all. The
tests on ljJimit are also avoided when the start table indicates that the smallest
possible codeword length is L, the length of a longest codeword.
That is, step 1 of function canonicaLdecodeO in Algorithm 4.1 can be
replaced by initializing Pto the value start[righLshijt(V, L - z)l, and the search
of steps 2 and 3 of function canonicaLdecodeO should be guarded to ensure
that if P ~ z the while loop does not execute.
The speed improvement of this tactic arises for two complementary rea-
sons. First, the linear search is accelerated and in many cases completely
circumvented, with the precise gain depending upon the value z and its re-
lationship to L. Second, it is exactly the frequent symbols that get the greatest
benefit, since they are the ones with the short codes. Using this mechanism the
number of inspections of I}_limit might be less than one per symbol: a reduction
achieved without the 2L -word memory overhead of a full lookup table.
As a final remark, it should be noted that the differences between Algo-
rithm 4.1 and the original presentation [Moffat and Turpin, 1997] of the table-
based method are the result of using non-increasing probabilities here, rather
than the non-decreasing probabilities assumed by Moffat and Turpin.
4.4. DECODING METHODS PAGE 63

Input 83 84 85
bits "I" "10" "II" "Ill"
00 81, II:,!:' 81, i~ 8 1 ,:~;:::IJ':;' ~!i:~!!i!llill~
8 1, n~~i~ii,ii,i:;;i:~ii

01 82, HIE 81'!~ 82, li!~ 82,i4'


10 83 81, t~;; 8 1, lii~j~~;i,: 81, iSil
11 84 85 82, I~I:l 81'~;;

Table 4.3: Example of finite-state machine decoding with k = 2 and five states, 81
to 85' Each table entry indicates the next state, and, in gray, the symbols to be out-
put as part of that transition. The second heading row shows the partial codeword
corresponding to each state.

4.4 Other decoding methods


Other authors have considered the problem of efficiently decoding minimum-
redundancy codes. Choueka et al. [1985] showed that the Huffman decoding
process can be considered as moves in a finite state machine of n - 1 states, one
for each of the internal nodes in an explicit Huffman tree; and that each bit in
the compressed input triggers a transition from one state to another and possibly
the emission of an output symbol. Given this framework, it is then natural to
consider extended edge transitions, in which k bits at a time trigger the change
of state, and more than one symbol might be emitted at each transition. The
advantage of this CKP mechanism is that only k-bits-at-a-time operations are
required in the decoder. For example, with k = 8, the compressed text is pro-
cessed a byte at a time, and the two shift operations per symbol that are used by
function canonicaLdecodeO (together with the multiple calls to geLoneJJitO
that are associated with those shift operations) are avoided. Sieminski [1988]
and Tanaka [1987] independently described a similar state-based decoder, and
other authors have followed suit (see Turpin and Moffat [1998]). Table 4.3
shows the state machine that is generated when this approach is used for the
minimum-redundancy code of Table 4.1a (page 58) and k = 2.
Each row of Table 4.3 shows the transitions applied for one two-bit input
combination. The columns correspond to the n - 1 states of the machine,
and are labelled 81 to 85. Each state can also be thought of as representing
one internal node in the corresponding Huffman tree (Figure 4.3a on page 56),
which in tum is equivalent to a partially completed prefix of bits that have
been consumed but not yet fully resolved into a codeword. The prefixes that
correspond to each state are listed in the second heading row of the table. State
85, for example, corresponds to the situation when "Ill" has been observed
in the input string but not yet turned into a codeword. In state 85 the input
PAGE 64 COMPRESSION AND CODING ALGORITHMS

Input Table 1 Table 2 Table 3


00 1WI, use 1 lll~!il, use 1 use 1
01 ~!11~
'f~'''~-I1::':'
use 1 ~c, use 1 use 1
10 Table 2 ~", use 1 use 2
11 Table 3 ~,use 1 use 2

Table 4.4: Example of explicit table-driven canonical decoding with k = 2. Decoding


commences in Table 1, and at each step either outputs a symbol (indicated in gray)
and consumes the specified number of bits of the two bits in the current window, or
consumes both bits and shifts to the indicated table. After a symbol is output, execution
resumes from Table 1.

"00" completes the codeword for symbol five, and also completes a codeword
for symbol one. After the codeword for symbol one, there are no remaining
unresolved bits. Hence, the entry in the table for the combination of state 85
and input of "00" shows a move to 81 (symbol>' denotes the empty string)
and the output of symbol 5 followed by symboll. Note that this method does
not require that the code be canonical. Any minimum-redundancy code can be
processed in this way.
The drawback of the method is memory space. At a minimum, a list of 2k
"next state" pointers must be maintained at each of the nodes in the finite-state
machine, where k is the number of bits processed in each operation. That is, the
total storage requirement is O(n2k). In a typical character-based application
(n = 100 and k = 8, say) this memory requirement is manageable. But when
the alphabet is larger - say n = 100,000 - the memory consumed is unreason-
able, and very much greater than is required by function canonicaLdecodeO.
Nor is the speed advantage as great as might be supposed: the large amount of
memory involved, and the pointer-chasing that is performed in that memory,
means that on modem cache-based architectures the tight loops and compact
structure of function canonicaLdecodeO are faster. Choueka et al. also de-
scribe variants of their method that reduce the memory space at the expense of
increased running time, but it seems unlikely that these methods can compare
with canonical decoding.
A related mechanism is advocated by Hashemian [1995], who recommends
the use of a canonical code, together with a sequence of k-bit tables that speed
the decoding process. Each table is indexed directly by the next k bits of the
input stream, and each entry in each table indicates either a symbol number and
the number of bits (of the k that are currently being considered) that must be
used to complete the codeword for this symbol; or a new table number to use
to continue the decoding. Table 4.4 shows the tables that arise for the example
4.4. DECODING METHODS PAGE 65

1000000 -

->-
II)
CD
100000 -
CKP, k=S.

:e. 10000 -
CKP, k=4.
~
0
E Huffman tree.
CD 1000 - Hashemian, k=4 • Canonical+start •
~
Canonical.
100
0 ,'0 do ::0 ;0 ~ ~ io ~ r:o 1~0
Decode speed (Mb/min)

Figure 4.4: Decode speed and decode memory space for minimum-redundancy de-
coding methods. zero-order character-based model with n = 96.

canonical code of Table 4.1a (page 58) when k = 2. A similar method has
been described by Bassiouni and Mukherjee [1995]. Because all of the short
codewords in a canonical code are lexicographically adjacent, this mechanism
saves a large fraction of the memory of the brute-force approach, but is not as
compact - or as fast - as the method captured in function canonicaLdecodeO.
Figure 4.4, based on data reported by Moffat and Turpin [1997], shows the
comparative speed and memory space required by several of these decoding
mechanisms when coupled with a zero-order character-based model and exe-
cuted on a Sun SPARC computer to process a 510 MB text file. The method
of Choueka et al. [1985] is fast for both k = 4 and k = 8, but is beaten by a
small margin by the canonical method when augmented by an eight-bit start
array to accelerate the linear search, as was illustrated in Table 4.2 on page 62.
Furthermore, when k = 8 the CKP method requires a relatively large amount
of memory. The slowest of the methods is the explicit tree-based decoder, de-
noted "Huffman tree" in Figure 4.4.
Several of the mechanisms shown in Figure 4.4 need an extra mapping in
the encoder and decoder that converts the original alphabet of symbols into a
probability-sorted alphabet of ordinal symbol numbers. The amount of mem-
ory required is model dependent, and varies from application to application.
In the character-based model employed in the experiments summarized in Fig-
ure 4.4, two 256-entry arrays are sufficient. More complex models with larger
source alphabets require more space.
PAGE 66 COMPRESSION AND CODING ALGORITHMS

4.5 Implementing Huffman's algorithm


Now consider exactly how the calculation illustrated in Figure 4.2 on page 54
should be implemented. As was discussed in Section 4.3, it is desirable to
arrange the process so that codeword lengths are calculated rather than code-
words themselves. What we seek is an algorithm, and the data structures
that go with it, that takes as input an n element array of sorted symbol fre-
quencies (that is, unnormalized probabilities) and computes the corresponding
minimum-redundancy codeword lengths. This is exactly what the mechanism
described in Algorithm 4.2 does. What may be surprising is that the calcula-
tion is done using only a fixed amount of temporary memory - there are no
auxiliary arrays, because the computation is performed in-situ.
This implementation is due to Moffat and Katajainen [1995], based upon
an earlier observation made by van Leeuwen [1976]. If the input probabili-
ties are in sorted order, then at any stage the least weight item is the smaller
of the next unprocessed leaf and the next unprocessed multi-symbol package,
where "next" for leaves is determined by the original sorted order, and "next"
for packages is dictated by the order in which they were formed. Both of these
candidates are easily identified, and two linear queues suffice during the algo-
rithm. One records the original symbol weights, in sorted order. The second
contains packages, also in sorted order. Newly formed packages are appended
at the tail of this second queue, and least-weight unprocessed packages are
always available at the head.
To this end, function calculate_huffman_codeO manipulates severallogi-
cally distinct arrays of values, but does so in a manner that allows them all to
coexist in the same physical storage without corrupting each other. During its
operation it makes three linear scans over the array.
In the first scan (steps 1 to 10), which operates from right to left, two
activities take place in tandem. The first is the packaging operation, which
takes symbol weights at pointer s and compares them with package weights at
pointer r to form new packages that are stored at pointer x. Simultaneously,
the second activity that takes place during this phase is that any packages at r
(but not symbols at s) that do get combined into larger packages are no longer
required, so before pointer r is shifted, a pointer to the parent package - which
is at x - is stored at P[rJ.
At the end of this first phase the array contains a set of parent pointers
for the internal (non leaf) nodes of the calculated Huffman tree. Figure 4.5
illustrates the various values stored in array P during this computation. The
first row, marked (a), shows the original symbol weights. For consistency with
the earlier examples these are listed as fractional values, but in practice would
be stored as unnormalized integer probabilities. Row (b) of Figure 4.5 then
4.5. IMPLEMENTATION PAGE 67

Algorithm 4.2
Calculate codeword lengths for a minimum-redundancy code for the symbol
frequencies in array P, where P[I] ~ P[2] ~ ... ~ P[n]. Three passes are
made: the first, operating from n down to 1, assigns parent pointers for
multi-symbol packages; the second, operating from 1 to n, assigns codeword
lengths to these packages; the third, operating from 1 to n, converts these
internal node depths to a corresponding set of leaf depths.
calculate_huffman ..code(P, n)
1: set r +- nand s +- n
2: for x +- n down to 2 do
3: if s < lor (r > x and P[r] < P[s]) then
4: set P[x] +- P[r], P[r] +- x, and r +- r - 1
5: else
6: set P[x] +- P[s] and s +- s - 1
7: if s < lor (r > x and P[r] < P[s]) then
8: set P[x] +- P[x] + P[r], P[r] +- x, and r +- r - 1
9: else
10: set P[x] +- P[x] + P[s] and s +- s - 1

11: set P[2] +- 0


12: for x+-3 to n do
13: set P[x] +- P[P[x]] +1
14: set a +- 1, U +- 0, d +- 0, r +- 2, and x+-l
15: while a > 0 do
16: while r ~ nand P[r] = d do
17: set U +- U + 1 and r +- r + 1
18: while a > u do
19: set P[x] +- d, x +- x + 1, and a +- a-I
20: set a+-2 x u, d +- d + 1, and U +- 0
21: return P
PAGE 68 COMPRESSION AND CODING ALGORITHMS

I location 1 2 I 3 4 5 6

(a) original symbol weights


67 11 7 6 5 4
(b) first weight assigned by steps 3 to 6
33 13 9 6 4
second weight added by steps 7 to 10
67 20 11 7 5
(c) combined weight
100 33 20 13 9
(d) final parent pointers after loop at steps 1 to 10
2 3 3 4
(e) internal node depths after steps 11 to 13
o 1 223
(f) codeword lengths after steps 14 to 20
1 3 3 344

Figure 4.5: Example of the use of function calculate.huffman...codeO for the proba-
bility distribution P = [67,11,7,6,5,4] to obtain lei = [1,3,3,3,4,4].

shows the weight of the two components contributing to each package, and the
row marked (c) shows the final weight of each package after step 10. Row
(d) shows the parent pointers stored at the end of steps 1 to 10 of function
calculate_huffman_codeO once the loop has completed. The values stored in
the first two positions of the array at this time have no relevance to the sub-
sequent computation, and are not shown. Note that in Chapter 2, we used Vi
to denote the unnormalized probability of symbol Si, and Pi (or equivalently,
P[i]) to denote the normalized probability, Pi = vdm, where m is the length of
the message. We now blur the distinction between these two concepts, and we
use Pi (and P[i]) interchangeably to indicate either normalized or unnormal-
ized probabilities. Where the difference is important, we will indicate which
we mean. For consistency of types, in Algorithm 4.2 the array P passed as an
argument is assumed to contain unnormalized integer probabilities. Figure 4.5
thus shows the previous normalized probabilities scaled by a factor of 100.
The second pass at steps 11 to 13 - operating from left to right - converts
these parent pointers into internal node depths. The root node of the tree is
represented in location 2 of the array, and it has no parent; every other node
points to its parent, which is to the left of that node, with a smaller index.
4.5. IMPLEMENTATION PAGE 69

Setting P[2] to zero, and thereafter setting P[x] to be one greater than the
depth of its parent, that is, to P[P[xll + 1, is thus a correct labelling. Row (e)
of Figure 4.5 shows the depths that result. There is an internal node at depth
0, the root; one at depth 1 (the other child of the root is a leaf); two at depth 2
(and hence no leaves at this level); and one internal node at depth 3.
The final pass at steps 14 to 20 of function calculate_huffman_codeO con-
verts the n - 1 internal node depths into n leaf node depths. This is again
performed in a left to right scan, counting how many nodes are available (vari-
able a) at each depth d, how many have been used as internal nodes at this depth
(variable u), and assigning the rest as leaves of depth d at pointer x. Row (f) of
Figure 4.5 shows the final set of codeword lengths, ready for the construction
of a canonical code.
Note that the presentation of function calculate_huffman_codeO in Algo-
rithm 4.2 assumes in several places that the Boolean guards on "if' and "while"
statements are evaluated only as far as is necessary to determine the outcome:
in the expression "A and B" the clause B will be evaluated only if A is de-
termined to be true; and that in the expression "A or B" the clause B will be
evaluated only if A is determined to be false.
In the case when the input probabilities are not already sorted there are
two alternative procedures that can be used to develop a minimum-redundancy
code. The first is obvious - simply sort the probabilities, using an additional
n-word index array to record the eventual permutation, and then use the in-
place process of Algorithm 4.2. Sorting an n-element array takes 0 (n log n)
time, which dominates the cost of actually computing the codeword lengths. In
terms of memory space, n words suffice for the index array, and so the total
cost is n + 0(1) additional words over and above the n words used to store the
symbol frequencies.
Alternatively, the codeword lengths can be computed by a direct appli-
cation of Huffman's algorithm. In this case the appropriate data structure to
use is a heap - a partially-ordered implicit tree stored in an array. Sedgewick
[1990], for example, gives an implementation of Huffman's algorithm using a
heap that requires 5n + 0(1) words of memory in total; and if the mechanism
may be destructive and overwrite the original symbol frequencies (which is the
modus operandi of the inplace method in function calculate_huffman_code{))
then a heap-based process can be implemented in a total of n+ 0(1) additional
words [Witten et aI., 1999], matching the memory required by the in-place al-
ternative described in Algorithm 4.2. Asymptotically, the running time of the
two alternatives is the same. Using a heap priority queue structure a total of
o (n log n) time is required to process an n symbol alphabet, since on a total of
2n - 4 different occasions the minimum of a set of as many as n values must
be determined and modified (either removed or replaced), and with a heap each
PAGE 70 COMPRESSION AND CODING ALGORITHMS

such operation requires 0 (log n) time.


Of these two alternatives, the first is quicker - use of an explicit sorting
step, followed by a call to function calculate_huffman_codeO. This is be-
cause calculate_huffman_codeO operates in a sequential manner, as does a
well-designed implementation of Quicksort [Bentley and McIlroy, 1993], gen-
erally accepted as being the fastest comparison-based sorting method. On mod-
em cache-based architectures, sequential access can be considerably faster than
the random memory reference pattern necessary for maintenance of a heap data
structure.

4.6 Natural probability distributions


Function calculate_huffman_codeO assumes that the symbol probabilities are
integer frequency counts accumulated by observation of some particular mes-
sage M that is to be represented. That is, it is assumed that the probabilities Pi
being manipulated are unnormalized integers; that the ith symbol of the alpha-
bet Si appears Pi times in the message M; and that m = IMI, the length of the
sequence, is equal to L:~1 Pi.
An interesting observation is that the number of distinct Pi values that are
possible is relatively small. Even if there is one symbol for which Pi = 1, a
second for which Pi = 2, another for which Pi = 3, and so on, the number of
distinct Pi values must be small relative to m, the length of the sequence. In
this pathological case, since L: Pi = m, it must be that n :::::i ..,J2m, and thus
that there are at most J2m distinct Pi values.
Moreover, the situation described in the previous paragraph is extreme.
Given an alphabet of size n and a message to be coded of length m ~ n, it
is clear that r, the number of distinct Pi values, is likely to be smaller than
..,J2m. Moreover, the value of r becomes more and more tightly constrained as
n gets large relative to m. For example, consider a zero-order character-based
model applied to a large volume of (say) English text. Then it is likely that n,
the number of distinct characters used, and r, the number of distinct character
frequencies, are of similar magnitude. Both will be about 100 or so, while m
might be anywhere between several thousand and several million. On the other
hand, if the model is based on words rather than characters, then n is likely
to be several hundred thouiand, and m possibly only one order of magnitude
bigger. In this case, there will (indeed, must) be a very large number of symbols
for which Pi = 1, a smaller (but still large) number of symbols for which
Pi = 2, and so on. That is, r will be small relative to n. This kind of "inverse
likelihood" or Zipfian distribution was used to generate test data in Section 3.5
on page 48, and is common across a range of natural domains [Zipf, 1949].
Table 4.5 shows some statistics for the word and non-word sequences derived
4.6. NATURAL DISTRIBUTIONS PAGE 71

Parameter Name WSJ.Words WSJ.NonWords


Total symbols m 86,785,488 86,958,743
Distinct symbols n 289,101 8,912
Distinct Pi values r 5,411 690
Symbols for which Pi = 1 Ir 96,111 3,523
Maximum frequency PI 3,687,748 61,121,088
Maximum probability pI/m 4.24% 70.28%

Table 4.5: Statistics for the word and non-word messages generated when 510 MB
of English-language newspaper text with embedded SGML markup is parsed using
a word-based model. Note that Pi is the unnormalized probability of symbol Si in a
probability-sorted alphabet.

by applying a word-based model to 510 MB of text drawn from the Wall Street
Journal, which is part of the large TREe corpus [Harman, 1995]. The values in
the table illustrate the validity ofZipf's observation - the n = 289,101 distinct
words correspond to just r = 5,411 different word frequencies.
Consider how such a probability distribution might be represented. In Sec-
tion 4.5 it was assumed that the symbol frequencies were stored in an n element
array. Suppose instead that they are stored as an r element array of pairs (p; f),
where P is a symbol frequency and I is the corresponding number of times
that symbol frequency P appears in the probability distribution. For data ac-
cumulated by counting symbol occurrences, this representation will then be
significantly more compact - about 11,000 words of memory versus 290,000
words for the WSJ . Words data of Table 4.5. More importantly, the condensed
representation can be processed faster than an array representation.
What happens when Huffman's algorithm is applied to such distributions?
For the WSJ . Words probability distribution (Table 4.5), in which there are
more than 96,000 symbols that have Pi = 1, the first 48,000 steps of Huffman's
method (Algorithm 4.2 on page 67) each combine two symbols of weight 1
into a package of weight 2. But with a condensed, or runlength representation,
all that is required is that the pair (1; 96,000) be bulk-packaged to make the
pair (2; 48,000). That is, in one step all of the unit-frequency symbols can be
packaged. More generally, if P is the current least package weight, and there are
I packages of that weight - that is, the pair (p; f) has the smallest P component
of all outstanding pairs - then the next I/2 steps of Huffman's algorithm can
be captured in the single replacement of (p; f) by (2p; I/2). We will discuss
the problem caused by odd I components below. The first part of the process
is shown in Algorithm 4.3, in which a queue of (p; f) pairs is maintained, with
each pair recording a package weight p and a repetition counter I, and with the
PAGE 72 COMPRESSION AND CODING ALGORITHMS

Algorithm 4.3
Calculate codeword lengths for a minimum-redundancy code for the symbol
frequencies in array P, where P = [(Pi; Ii)] is a list of r pairs such that
PI > P2 > ... > Pr and Ei=I Ii = n, the number of symbols in the
alphabet. This algorithm shows the packaging phase of the algorithm. The
initial list of packages is the list of symbol weights and the frequency of
occurrence of each of those weights.
calcu[ateJunlength_code(P, r, n)
1: while the packaging phase is not completed do
2: set childl +- removeJ1linimum(P), and let childl be the pair (p; f)
3: if f = 1 and P is now empty then
4: the packaging phase is completed, so exit the loop and commence the
extraction phase (Algorithm 4.4) to calculate the codeword lengths
5: else if f > 1 is even then
6: create a pair new with the value (2 x P; f /2) and insert new into P in
the correct position, with new marked as "internal"
7: set new.firsLchild +- childl and new. other_child +- child]
8: else if f > 1 is odd then
9: create a pair new with the value (2 x P; (f - 1) /2) and insert new into
P in the correct position, with new marked "internal"
10: set new.jirsLchild +- childl and new. other_child +- childl
11: insert the pair (p; 1) at the head of P
12: else if f = 1 and P is not empty then
13: set child2 +- RemoveMinimum(P), and let child2 be the pair (q; g)
14: create a pair new with the value (p + q; 1) and insert new into P in the
correct position, with new marked "internal"
15: set new.jirsLchild +- childl, and new.other_child +- child2
16: if 9 > 1 then
17: insert the pair (q; 9 - 1) at the head of P
4.6. NATURAL DISTRIBUTIONS PAGE 73

Algorithm 4.4
Continuation of function calculateJunlength_code(P, T, n) from
Algorithm 4.3. In this second phase the directed acyclic graph generated in
the first phase is traversed, and a pair of depths and occurrence counts
assigned to each of the nodes.

1: let root be last node taken from P


2: set root.depth f- 0, rootjirsLcount f- I, and root.other_count f- 0
3: for all other nodes pair in the directed acyclic graph rooted at root do
4: set pair.depth f- 0, pair jirsLcount f- 0, and pair. other_count f- 0
5: for each descendant node pair in the acyclic graph rooted at root do
6: set child f- pair jirsLchild and d f- pair .depth
7: if child.depth = 0 or child.depth = d + 1 then
8: set child.depth f- d + 1
9: add pair jirsLcount to childjirsLcount
10: add pair. other_count to child. other_count
11: else if child.depth = d then
12: add pair jirsLcount to child. other_count
13: repeat steps 6 to 12 once, with child f- pair. other_child
14: for each non-internal node in the acyclic graph rooted at root do
15: generate pair jirsLcount codewords of length pair. depth
16: generate pair. other_count codewords of length pair.depth + 1
17: return the resultant set of codeword lengths
PAGE 74 COMPRESSION AND CODING ALGORITHMS

queue ordered by increasing p values. At each cycle of the algorithm the pair
with the least p value is removed from the front of the queue and processed.
Processing of a pair consists of doing one of three things.
First, if f = 1 and there are no other pairs in P then the packaging phase
of the process is finished, and the first stage of the algorithm terminates. This
possibility is handled in steps 3 and 4. The subsequent process of extracting
the codeword lengths is described in Algorithm 4.4, and is discussed below.
Second, if f > 1 then the algorithm can form one or more new packages all
of the same weight, as outlined above. If f is even, this is straightforward, and
is described in steps 6 and 7. When f is odd, not all of the packages represented
by the current pair are consumed, and in this case (steps 9 to 11) the current
pair (p; f) is replaced by two pairs, the second of which is a reduced pair (p; 1)
that will be handled during a subsequent iteration of the main loop.
The final possibility is that f = 1 and P is not empty. In this case the
single package represented by the pair (p; f) must be combined with a single
package taken from the second pair in the queue P. Doing so mayor may not
exhaust this second pair, since it too might represent several packages. These
various situations are handled by steps 13 to 17.
When the queue has been exhausted, and the last remaining pair has a rep-
etition count of f = 1, a directed acyclic graph structure of child pointers has
been constructed. There is a single root pair with no parents, which corresponds
to the root of the Huffman code tree for the input probabilities; and every other
node in the graph is the child of at least one other node. Because each node has
two children (marked by the pointers firsLchild and other_child) there may be
multiple paths from the root node to each other node, and each of these possi-
ble paths corresponds to one codeword, of bit-length equal to the length of that
path. Hence, a simple way of determining the codeword lengths is to exhaus-
tively explore every possible path in the graph with a recursive procedure. Such
a procedure would, unfortunately, completely negate all of the saving achieved
by using pairs, since there would be exactly one path explored for every symbol
in the alphabet, and execution would of necessity require n (n) time.
Instead, a more careful process is used, and each node in the graph is vis-
ited just twice. Algorithm 4.4 gives details of this technique. The key to the
improved mechanism is the observation that each node in the graph (represent-
ing one pair) can only have two different depths associated with it, a conse-
quence of the sibling property noted by Gallager [1978] and described in Sec-
tion 6.4. Hence, if the nodes are visited in exactly the reverse order that they
were created, each internal node can propagate its current pair of depths and
their multiplicities to both of its children. The first time each node is accessed
it is assigned the lesser of the two depths it might have, because that depth cor-
responds to the shortest of the various paths to that node. At any subsequent
4.6. NATURAL DISTRIBUTIONS PAGE 75

root original
symbols

packages formed

Figure 4.6: Example of the use of function calculateJunlength...codeO on the run-


length probability distribution P = [(6; 1), (3; 2), (2; 4), (1; 5)] to generate a code
(also in runlength form) of lei = [(2; 1), (3; 3), (4; 4), (5; 4)]. White nodes are orig-
inal symbols, and are examined from right to left; gray nodes are internal packages,
and are both generated and considered from right to left.

accesses via other parents of the same depth as this parent (steps 8 to 10) the
two counters are incremented by the mUltiplicity of the corresponding counters
for that parent. On the other hand, step 12 caters for the case when the child is
already labelled with the same depth as the parent that is now labelling it. In
this case the parent must have an other_count of zero, and only the firSLcount
needs to be propagated, becoming an other_count (that is, a count of nodes at
depth one greater than indicated by the depth of that node) at the child node.
The result of this procedure is that three values are associated with each
of the original (p; 1) pairs of the input probability distribution, which are the
only nodes in the structure not marked as being "internal". The first of these
is the depth of that pair, and all of the f symbols in the original source alpha-
bet that are of probability p are to have codewords of length either depth or
depth + 1. The exact number of each length is stored in the other two fields
that are calculated for each pair: firsLcount is the number of symbols that
should be assigned codewords of length depth, and other_count is the number
of symbols that should be assigned codewords of length depth + 1. That is,
f = firsLcount + other_count.
Figure 4.6 shows the action of calculateJunlength_codeO on the probabil-
ity distribution P = [(6; 1), (3; 2), (2; 4), (1; 5)], which has a total of n = 12
PAGE 76 COMPRESSION AND CODING ALGORITHMS

symbols in r = 4 runlengths. The edges from each node to its children are also
shown. The four white nodes in the structure are the leaf nodes corresponding
to the original runs; and the gray nodes represent internal packages. The pro-
cessing moves from right to left, with the gray node labelled "P = 2; f = 2" the
first new node created. The root of the entire structure is the leftmost internal
node.
How much time is saved compared with the simpler 0 (n) time method
of function calculate_huffman_codeO in Algorithm 4.2? The traversal phase
shown in Algorithm 4.4 clearly takes 0(1) time for each node produced during
the first packaging phase, since the queue operations all take place in a sorted
list with sequential insertions. To bound the running time of the whole method
it is thus sufficient to determine a limit on the number of nodes produced during
the first phase shown in Algorithm 4.3. Each iteration requires the formation of
exactly one new node. Each iteration does not, however, necessarily result in
the queue P getting any shorter, since in some circumstances an existing node
is retained as well as a new node being added. Instead of using the length of P
as a monotonically decreasing quantity, consider instead the value

<I>(P) = b.(P) + 2: (1 + 310g2 J)


(P;f)EP

where

f
b.(P) = { ~ when (p, J) at the head of P has
otherwise.
= 1,

Quantity <I>(P) is positive at all times during calculateJunlength_codeO. Fur-


thermore, it can be shown that each execution of steps 6 and 7, steps 9 to 11, or
steps 13 to 17 in Algorithm 4.3, decreases <I>(P) by at least 1. Hence, <I>(P),
where P is now the initial list of (p; J) pairs given as input to the whole pro-
cess, is an upper bound on the number of new nodes created, and on the running
time of the entire process. This result means that if there are r pairs (Pk; fk) in
the runlength representation of the probability distribution P, then the running
time of function calculateJunlength_codeO is bounded by
r
1+ 2:(1 + 310g2 ik) = O(r + rlog(n/r))
k=l

where, as before, r is the number of distinct Pi values, n = Lk=l ik is the


number of symbols in the source alphabet and m = Lk=l Pk . fk is the length
of the message being processed.
To give some concrete figures, for the WSJ . Words probability distribu-
tion summarized in Table 4.5, the expression 3(r + r log2(n/r)) has the value
4.6. NATURAL DISTRIBUTIONS PAGE 77

94,000, which is about 1/3 of the value of n; and the function <I>(P) is about
50,000. Moreover, the analysis is pessimistic, since some steps decrease <I> by
more than one. Experimentation with an implementation of the method shows
that the calculation of a minimum-redundancy code for the WSJ . Words distri-
bution can be carried out with the formation of just 30,000 pairs.
At this point the reader may well be thinking that the runlength mech-
anism is interesting, but not especially useful, since it is only valid if the
source probability distribution is supplied as a list of runlengths, and that is
unlikely to happen. In fact, it is possible to convert a probability-sorted ar-
ray representation of the kind assumed for function calculate_huffman_codeO
(Algorithm 4.2 on page 67) into the run length representation used by function
calculateJunlength_codeO in time exactly proportional to <I>(P). Moreover,
the conversion process is an interesting application of the Elias C"( code de-
scribed in Section 3.2 on page 32.
Suppose that P is an array of n symbol frequency counts in some message
M, sorted so that P[i] ~ P[i + 1]. Suppose also that a value j has been
determined for which PU - 1] > PU]. To find the number of entries in P that
have the same frequency as prj] we examine the entries PU + 1], prj + 3],
prj + 7], prj + 15] and so on in an exponential manner, until one is found
that differs from prj]. A binary search can then be used to locate the last entry
PU/] that has the same value as PU]. That is, an exponential and binary search
should be used. If PU] contains the kth distinct value in P, then the kth pair
(Pk; fk) of the run length representation for P must thus be (P[j]; j' - j + 1).
The exponential and binary searching process then resumes from prj' + 1].
The cost of determining a pair (Pk; fk) is approximately 210g 2 fk - the
cost in bits of the C"( code for integer fk - meaning that the total cost of the r
searches required to find all of the runlengths is

2.:: 1 + 2log2 fk =
T

O(r + rlog(n/r)),
i=l

the same asymptotic cost as function calculateJunlength_codeO.


In the natural distribution arising from a message M that is m symbols
long, it must take at least O(m) time to simply count the frequencies of the
symbols, and then the symbol frequencies must be sorted (taking 0 (n log n)
time if a comparison-based sorting method is used) into non-increasing order
so that the run length collection process can be applied. Once the code has been
constructed it will also take at least O(m) = O(n) time to encode the message
M, since every symbol of m must be handled. Hence, if we are starting from
scratch with a natural distribution, the best we can possibly hope for is O(m)
time, and so the O(n) time array-based codeword length calculation may as
well be used - exactly the scepticism assumed of the reader a few paragraphs
PAGE 78 COMPRESSION AND CODING ALGORITHMS

earlier. There is, however, one very useful application of a runlength-based


code computation, and that is when the symbol weights are artificially con-
strained to certain values. The next section develops this theme.

4.7 Artificial probability distributions


In Section 3.3 on page 36 it was observed that, with the correct choice of pa-
rameter b, a Golomb code is minimum-redundancy for the infinite alphabet
given by the geometric distribution Px = (1 - p)X-lp . It is also easy to calcu-
late a minimum-redundancy code for the uniform distribution Px = lin: just
use a minimal binary code. These are two artificial distributions for which a
minimum-redundancy codes can be calculated quickly. This section considers
another such special case.
Suppose that a list of symbol weights is given in which all values are integer
powers of two. For example, suppose that P = [(8; 1), (4; 4), (2; 2), (1; 2)] is
the distribution of symbol frequencies (in run length form) for which a code is
required. Each step of Huffman's procedure generates fresh packages, and, in
general, the packages formed can be of arbitrary weight.
But in a probability distribution constrained to integral powers of two, most
of the packages have weights that are also powers of two. To be precise, if
d is a non-negative integer, then when the initial weights are all powers of
two, there can be at most one package of weight p such that 2d < p < 2d+ 1 .
Moreover, there is no need to distinguish between the packages of weight 2d
for some integer d and the original symbols of weight 2d , and hence no need
to create new nodes that differentiate between them. Algorithm 4.5 details
the process necessary to take a probability distribution P in runlength form in
which all weights are powers of two, and calculate corresponding codeword
lengths [Turpin and Moffat, 2001].
In the algorithm, the array entry symhols[d) records the multiplicity of
weight 2d in the source probabilities, and is initialized from the pairs in the
distribution P. Similarly, packages[d) notes the number of packages of weight
2d, each of which is created from symbols of weight less than 2d. The total
count of packages of weight 2d is denoted by total[d). One further array is used
- irregular[d) - to record the weights of the packages that are not powers of
two. Initially there are no irregular packages, and irregular[d) is set to "not
used" for all d.
During each iteration, the symbols and packages currently at level d are
converted into packages at level d + 1 (steps 8 to 10). Halving total[d) gives the
number of new packages created at level d + 1. But either or both of a regular
and an irregular package might remain unprocessed at level d, and must be
handled. If level d has an odd total but not an irregular package (steps 11
4.7. ARTIFICIAL DISTRIBUTIONS PAGE 79

Algorithm 4.5
Calculate codeword lengths for a minimum-redundancy code for the symbol
frequencies in array P, where P = [(Pi; Ii)] is a list of r pairs, with each Pi
an integral power of two, and PI > P2 > . . . > Pr. In each tuple Ii is the
corresponding repetition count, and Ei=I Ii = n.
calculate_twopower_code(P, r, n)
1: for d+-O to llog2 mJ do
2: set symbols[d] +- 0, and packages[d] +- 0
3: set total[d] +- 0, and irregular[d] +- "not used"
4: for each (Pi; Ii) in P do
5: set symbols [10g2 Pi] +- Ii
6: set total [log2 Pi] +- Ii
7: for d+-O to 1l0g2 mJ do
8: set packages [d + 1] +- total[ d] div 2
9: set total[d + 1] +- total[d + 1] + packages[d + 1]
10: set total[d] +- total[d] - 2 x packages[d + 1]
11: if total[d] > 0 and irregular[d] = "not used" then
12: determine the smallest 9 > d such that total[g] > 0
13: set total[g] +- total[g] - 1
14: set irregular[g] +- 29 + 2d
15: else if total[d] > 0 then
16: set irregular[d + 1] +- irregular[d] + 2d
17: else if irregular[d] "# "not used" then
18: determine the smallest 9 > d such that total[g] > 0
19: set total[g] +- total[g] - 1
20: set irregular[g] +- 29 + irregular[d]
21: for d +- llog2 mJ down to 1 do
22: propagate node depths from level d to level d-1, assigning symbol[d-1]
codeword lengths at level d - 1
PAGE 80 COMPRESSION AND CODING ALGORITHMS

d=4 d=3 d=2 d=O


, ,
weighl=1 : weighl=2: : weighl=1

symbols[d]

packages[d]

irregular[ d]

Figure 4.7: Example of the use of function calculate1wopower_codeO on the run-


length probability distribution P = [(8; 1), (4; 4), (2; 2), (1; 2)] to obtain (also in run-
length form) lei = [(2; 1), (3; 4), (4; 4)]. Labels inside oval symbol and package nodes
represent multiplicities of weight 2d; labels inside circular irregular nodes represent
weights, always between 2d and 2d+l.

to 14), an irregular package must be created at level g, where 9 is the next


level with any packages or symbols available for combination. The weight of
the new irregular package is thus 2d + 29 , and one of the objects at level 9
must be noted as having been consumed. Similarly, if both a regular package
and an irregular package remain at level d, they can be combined to make
an irregular package at level d + 1 (steps 15 and 16). Finally, if an irregular
package is available at level d, but no regular package or symbol (steps 17
to 20), a combining package must again be located and used.
Figure 4.7 shows the computation of packages that takes place for the ex-
ample probability distribution P = [(8; 1), (4; 4), (2; 2), (1; 2)]. In the figure,
nodes corresponding to pairs in the list of symbols are represented as white
ovals, while packages are gray ovals. In the first step, working from right to
left, the two symbols of weight one at d = 0 are combined to make a pack-
age of weight two. Two of the three packages (including original symbols) of
weight two at d = 1 are then combined to make a package of weight four;
the third begins the chain of irregular packages by being joined with a package
from level 9 = 2. Each subsequent level also has an irregular package, but
there is at most one per level, and so it is possible to record (as is shown in the
figure) the weight of that package within the array element irregular[d].
4.8. HOUSEKEEPING PAGE 81

Once the right-to-Ieft packaging process is completed, a left-to-right la-


belling stage similar to that already detailed in Algorithm 4.4 is required to cal-
culate the lengths of the paths to the original symbols. In the case of the exam-
ple, this generates the output code (which can for brevity also be described in
a run length form) lei = [(2; 1), (3; 4), (4; 4)], that is, one codeword of length
two; four of length three; and four of length four.
Both the packaging and labelling processes are extremely fast. The last
package generated, regular or irregular, will have weight m, where m is the
length of the source message that gave rise to the probability distribution, and
so exactly llog2 mJ levels are required. Furthermore, each step in the right-
to-left packaging process, and each step in the left-to-right labelling process,
takes 0 (1) time. Total time is thus 0 (log m). To put this into perspective,
the WSJ . Words data has m = 86 x 106 , and fIog2 m 1 < 27. Including both
phases, fewer than 60 loop iterations are required.
A generalization of this mechanism is possible. Suppose that k is an inte-
ger, and that T is the kth root of two, T = -¢12 = 21/ k. Suppose further that the
input probability distribution is such that all weights are integer powers of T.
For example, with k = 1 we have the situation already described; when k = 2
we allow weights in
[1, V2, 2, 2V2, 4, 4V2, 8, ... ] ~ [1, 1.41,2,2.83,4,5.66,8, ... ] ,
and so on. Then by adopting a similar algorithm - in which symbols or pack-
ages at level d combine to form new packages at level d + k - a minimum-
redundancy code can be constructed in 10gT m steps, that is, in 0 (k log m)
time and space [Turpin and Moffat, 2001].
By now, however, the sceptical reader will be actively scoffing - why on
earth would a probability distribution consist solely of weights that are powers
of some integer root of two? The answer is simple: because we might force
them to be! And while this may seem implausible, we ask for patience - all is
revealed in Section 6.10 on page 179.

4.8 Doing the housekeeping chores


It is tempting to stop at this point, and say "ok, that's everything you need
to know about implementing minimum-redundancy coding". But there is one
more important aspect that we have not yet examined - how to pull all the
various parts together into a program that actually gets the job done. That is
what this section is about - doing the coding housekeeping so that everything
is neat, tidy, and operational. In particular, we now step back from the previous
assumptions that the source alphabet is probability-sorted and consists of inte-
gers in S = [1 ... n], all of which have non-zero probability. We also admit that
PAGE 82 COMPRESSION AND CODING ALGORITHMS

it will not be possible to know the symbol occurrence frequencies in advance,


and that they cannot be assumed to be available free of charge to the decoder.
Finally, we must also allow for the fact that buffering concerns will mandate
the sectioning of the source message into manageable blocks. Our system must
fully process each block before reading the next.
To be precise, we now suppose that the m-symbol source message might be
a block of a larger message; that within each block the n-element subalphabet
is some subset of S = [1 ... nmaxl, where nmax is the maximum symbol number
that appears in that block; and that the symbol probabilities within each block
are completely unordered. In short, we assume only that the message is a
stream of integers, and that the implementation must cope with all possible
remaining variations. The description we give is based upon the approach of
Turpin and Moffat [2000].
The first thing that must be transmitted from encoder to decoder as part of
each block is a prelude that describes the structure of the code to be used for
that block, and some other attributes of the message being represented. Be-
fore decoding can commence the decoder must know, either implicitly through
the design of the compression system, or explicitly via transmission from the
encoder, all of:

• an integer m, the length of this block of the source message;

• an integer n max , an upper bound on the maximum symbol identifier that


occurs in this block of the message;

• an integer n ~ n max , the number of distinct integers in [1 ... nmaxl that


appear in this block of the message;

• an integer L, the maximum length of any codeword in this block of the


message;

• a list of n integers, each between 1 and n max , indicating the sub alphabet
of [1 ... nmaxl that appears in this block of the message; and

• a list of n integers, each between 1 and L, indicating the corresponding


codeword lengths in bits for the symbols in the subalphabet of this block
of the message.

Only after all of these values are in the hands of the decoder may the encoder
- based upon a code derived solely from the transmitted information, and no
other knowledge of the source message or block - start emitting codewords.
Algorithm 4.6 details the actions of the encoder, and shows how the prelude
components are calculated and then communicated to the waiting decoder. The
first step is to calculate the symbol frequencies in the block at hand. Since
4.8. HOUSEKEEPING PAGE 83

Algorithm 4.6
Use a minimum-redundancy code to represent the m symbol message M,
where 1 ~ M[i] ~ nmax for 1 ~ i ~ m. Assume initially that table[i] = 0
for 1 ~ i ~ nmax.
mr_encode_hlock( M, m)
I: set n+-O
2: for i +- 1 to m do
3: set x +- M[i]
4: if table [x] = 0 then
5: set n +- n + 1 and symLused[n] +- x
6: set table[x] +- table[x] + 1
7: sort symLused[1 ... n] using table[symLused[i]] as the sort keys, so that
table[symLused[I]] ~ table[syms_used[2lJ ~ ... ~ table[symLused[nlJ
8: use function calculate_huffman_codeO to replace table[x] by the
corresponding codeword length, for x E {symLused[ i] I 1 ~ i ~ n}
9: set L +- table[symLused[nlJ
10: sort syms_used[1 ... n] so that
syms_used[l] < symLused[2] < ... < syms_used[n]
11: set nmax +- syms_used[n]
12: set w[i] +- the number of codewords of length i in table
13: set base[l] +- 0, offset[l] +- 1, and offset [L + 1] +- n + 1
14: for i +- 2 to L do
15: set base[i] +- 2 x (base[i - 1] + w[i - 1])
16: set offset[i] +- offset[i - 1] + w[i - 1]
17: use function eliaLdelta_encodeO to encode m, n max , n, and L
18: use function interpolative_encodeO to encode symLused[l . .. n]
19: for i +- 1 to n do
20: unary_encode((L + 1) - table[symLused[i]])
21: for i +- 2 to L do
22: set w[i] +- offset[i]
23: for i +- 1 to n do
24: set sym +- syms_used[i] and code_len +- table[sym]
25: set table[sym] +- w[code_len]
26: set w[code_len] +- w[code_len] + 1
27: for i +- 1 to m do
28: canonicaLencode(table[M[i]]), using base and offset
29: for i +- 1 to n do
30: set table[symLused[ilJ +- 0
PAGE 84 COMPRESSION AND CODING ALGORITHMS

~ll that is known is that n max is an upper bound on each of the m integers
in the input message, an array of n max entries is used to accumulate symbol
frequencies. At the same time (steps 1 to 6 of function mr_encode_block())
the value of n - the number of symbols actually used in this block - is noted.
Array table serves multiple purposes in function mr_encode_block 0 . In this
first phase, it accumulates symbol frequencies.
Once the block has been processed, the array of symbols - symLused -
is sorted into non-increasing frequency order (step 7) in the first of two sort-
ing steps that are employed. Any sorting method such as Quicksort can be
used. The array table of symbol frequencies is next converted into an array
of codeword lengths by function calculate_huffman_codeO (Algorithm 4.2 on
page 67). After the calculation of codeword lengths, array syms_used is sorted
into a third ordering, this time based upon symbol number. Quicksort is again
an appropriate mechanism.
From the array of codeword lengths the L-element arrays base and offset
used during the encoding are constructed (steps 12 to 16), and the prelude sent
to the decoder (steps 17 to 20). Elias's Co, the interpolative code of Section 3.4,
and the unary code all have a part to play in the prelude. Sending the codeword
lengths as differences from L + 1 using unary is particularly effective, since
there can be very few short codewords in a code, and will almost inevitably be
many long ones. We might also use a minimum-redundancy code recursively
to transmit the set of n codeword lengths, but there is little to be gained - a
minimum-redundancy code would look remarkably like a unary code for the
expected distribution of codeword lengths, and there must be a base to the
recursion at some point or another.
Array w is then used to note the offset value for each different codeword
length, so that a pass through the set of symbols in symbol order (steps 21
to 26) can be used to set the mapping between source symbol numbers in the
sparse alphabet of M and the dense probability-sorted symbols in [1 ... nJ used
for the actual canonical encoding. This is the third use of array table - it now
holds, for each source symbol x that appears in the message (or in this block
of it), the integer that will be coded in its stead.
After all this preparation, we are finally ready (steps 27 to 28) to use func-
tion canonicaLencode 0 (Algorithm 4.1 on page 60) to send the m symbols that
comprise M, using the mapping stored in table. Then, as a last clean-up stage,
the array table is returned to the pristine all-zeroes state assumed at the com-
mencement of mr_encode_blockO. This step requires O(n) time if completed
at the end of the function, versus the O(n max ) time that would be required if it
was initialized at the beginning of the function.
In total, there are two O(m)-time passes over the message M; a num-
ber of O(n)-time passes over the compact source alphabet stored in array
4.8. HOUSEKEEPING PAGE 85

symLused; and two O(n log n)-time sorting steps. Plus, during the calls to
function canonicaLencodeO, a total of c output bits are generated, where c ;:::
m. Hence, a total of O(m + nlogn + c) = O(nlogn + c) time is required
for each m-element block of a multi-block message, where n is the number
of symbols used in that block, and c is the number of output bits. A one-off
initialization charge of O(nmax) time to set array table to zero prior to the first
block of the message must also be accounted for, but can be amortized over
all of the blocks of the message, provided that nmax ~ E m, the length of the
complete source message.
In terms of space, the nmax-word array table is used and then, to save space,
re-used two further times. The only other large array is syms_used, in which
n words are used, but for which n max words must probably be allocated. All
of the other arrays are only L or L + 1 words long, and consume a minimal
amount of space. That is, the total space requirement, excluding the m-word
buffer M passed as an argument, is 2n max + O(L) words. No trees are used,
nor any tree pointers.
If n « n max , both table and syms_used are used only sparsely, and other
structures might be warranted if memory space is important. For example,
symLused might be allocated dynamically and resized when required, and ar-
ray table might be replaced by a dictionary structure such as a hash table or
search tree. These substitutions increase execution time, but might save mem-
ory space when n max is considerably larger than n and the subalphabet used in
each block of the message is not dense.
Algorithm 4.7 details the inverse transformation that takes place in the de-
coder. The two n-element arrays symLused and table are again used, and the
operations largely mirror the corresponding steps in the encoder. As was the
case in the encoder, array table serves three distinct purposes - first to record
the lengths of the codewords; then to note symbol numbers in the probability-
sorted alphabet; and finally to record which symbols have been processed dur-
ing the construction of the inverse mapping. This latter step is one not required
in the encoder. The prelude is transmitted in symbol-number order, but the
decoder mapping table - which converts a transmitted symbol identifier in the
probability-sorted alphabet back into an original symbol number - must be the
inverse of the encoder's mapping table. Hence steps 12 to 21. This complex
code visits each entry in syms_used in an order dictated by the cycles in the
permutation defined by array table, and assigns to it the corresponding symbol
number in the sparse alphabet. Once the inverse mapping has been constructed,
function canonicaLdecodeO (Algorithm 4.1 on page 60) is used to decode each
of the m symbols in the compressed message block.
Despite the nested loops, steps 12 to 21 require O(n) time in total, since
each symbol is moved only once, and stepped over once. Moreover, none of
PAGE 86 COMPRESSION AND CODING ALGORITHMS

Algorithm 4.7
Decode and return an m-symbol message M using a minimum-redundancy
code.
mr....decode_block 0
1: use function elias-delta_decode 0 to decode m, n max , n, and L
2: interpolative decode the list of n symbol numbers into syms_used[l ... n]
3: for i +- 1 to n do
4: set table[i] +- (L + 1) - unary_decodeO
5: set w[i] +- the number of codewords of length i in table
6: construct the canonical coding tables base, offset, and lj_limit from w
7: for i +- 2 to L do
8: set w[i] +- offset[i]
9: for i +- 1 to n do
10: set sym +- syms-used[i] and code_len +- table[i]
11: set table[i] t- w[code_len] and w[code_len] t- w[code_len] + 1
12: set start t- 1
13: while start ~ n do
14: setfrom t- start and sym t- syms-used[start]
15: while table [from] =1= "done" do
16: set i t- table[from]
17: set table[from] t- "done"
18: swap sym and syms_used[i]
19: set from t- i
20: while start ~ nand table[start] = "done" do
21: set start t- start + 1
22: set V+- geLone_integer(L)
23: for i +- 1 to m do
24: set c t- canonicaLdecodeO, using V, base, offset, and lj_limit
25: set M[i] t- syms-used[c]
26: return m and M
4.8. HOUSEKEEPING PAGE 87

14.0 - 3.0 -
E

e-
>- 12.0 - 2.5 -
~ 10.0 -
codewords
2.0 - • code lengths
c: B.O- • subalphabet
o 1.5 -
'(j) 6.0 -
II)
1.0 -
~ 4.0 -
c.
E 2.0 - 0.5 -
o
() 0.0 ....up----I1IL----l1...._ .......
1000 10000 100000 1000000 1000 10000 100000 1000000

Block size Block size

Figure 4.8: Cost of the prelude components for subalphabet selection and code-
word lengths, and cost of codewords for the files WSJ . Words (on the left), and
WSJ . NonWords (on the right), for different block sizes. Each file contains approxi-
mately m = 86 x 106 symbols, and is described in more detail in Table 4.5.

the other steps require more than 0 (n) time except for the canonical decoding.
Hence the decoder operates faster than the encoder, in a total of 0 (m +n + c) =
O(n + c) time, where c ~ m is again the number of bits in the compressed
message. The space requirement is 2n + 0 (L) words, regardless of n max .
Figure 4.8 summarizes the overall compression effectiveness achieved by
function mr_encode_hlockO on the files WSJ. Words and WSJ. NonWords (de-
scribed in Table 4.5 on page 71) for block sizes varying from m = 103 to
m = 106 . When each block is very small, a relatively large fraction of the
compressed message is consumed by the prelude. But the codes within each
block are more succinct, since they are over a smaller subalphabet, and overall
compression effectiveness suffers by less than might be thought. At the other
extreme, when each block is a million or more symbols, the cost of transmitting
the prelude is an insignificant overhead.
However, encoding efficiency - with its n log n factor per block - suffers
considerably on small blocks. In the same experiments, it took more than ten
times longer to encode WSJ . Words with m = 103 than it did with m = 105,
because there are 100 times as many sorting operations performed, and each
involves considerably more than 1/100 of the number of symbols. Decoding
speed was much less affected by block size, and even with relatively large
block sizes, the decoder operates more than four times faster than the encoder,
PAGE 88 COMPRESSION AND CODING ALGORITHMS

a testament to the speed of the start and Ij _base-assisted canonical decoding


process.
The implementation used for these experiments is available from the web
page for this book at www.cs.rou.02.au/caca.

4.9 Related material


Function calculate_huffman_codeO (Algorithm 4.2 on page 67) is a space and
time efficient method for calculating a minimum-redundancy prefix code. But
it is destructive: it replaces the input probabilities P with the codeword lengths
lei. If we wish to keep a copy of the array P - perhaps so that we can as-
sess the quality of the code that is generated - a space overhead of n words of
memory is required. Milidit1 et al. [2001] considered this problem, and showed
that it is possible to generate codeword lengths in a form suitable for canonical
coding in a non-destructive manner using just O(L) space above and beyond
the n words in P, where L is the length of a longest codeword. Like function
calculate_huffman_codeO, their mechanism operates in O(n) time when P is
probability-sorted. It is based upon a combination of runlength-based code
construction (Algorithm 4.3 on page 72) and a novel technique they call ho-
mogenization. Milidit1 et al. show that, in certain circumstances, a sequence of
probabilities in P can be replaced by the arithmetic mean of those values, with
the prefix code generated from the revised probabilities still being minimum-
redundancy with respect to the original P. In terms of a Huffman tree, homog-
enization permits minimal-binary subtrees to be constructed in advance, and
then manipulated in a holistic manner by assigning all of the weight to the root
of that subtree. Milidit1 et al. note in their paper that their algorithm is complex,
and we give no details here. It is currently the most efficient non-destructive
algorithm for constructing minimum-redundancy prefix codes.
Liddell and Moffat [2001] have also considered the problem of calculating
a prefix code, and give an O(n)-time algorithm that quickly determines an ap-
proximate code by assigning a codeword length of lei I = r- log2 Pi 1 to each
symbol - which guarantees that K (e) ~ 1 - and then calculates a subset of the
symbols to have their codewords shortened by one bit so as to increase K ( e)
to one. The codes generated are not minimum-redundancy, but for practical
purposes the redundancy is small, and the mechanism for partitioning the sym-
bols into the two classes, long codewords and short codewords, can be updated
in an incremental manner in the face of evolving probabilities. Section 6.10 on
page 179 describes how this flexibility can be exploited.
This chapter has focussed entirely on the minimum-redundancy codes -
their calculation, and their use. They offer the best compression over all prefix
4.9. RELATED MATERIAL PAGE 89

codes. But what is the relationship between being "minimum-redundancy" and


Shannon's entropy limit, H(P), for the probability distribution P?
The redundancy of a code is the difference between the expected cost per
symbol, E(C, P), and the average information per symbol, H(P). The redun-
dancy of a Shannon-Fano code is bounded above by one, measured in bits per
symbol. This worst case is realized, for example, by a two symbol alphabet
in which the probability of one symbol approaches one, and H(P) approaches
zero. The best any prefix code can do on this two symbol alphabet is assign a
one bit codeword to each symbol, hence E(C, P) = 1, and the redundancy of
C is close to one bit per symbol.
Under certain conditions the bound on redundancy can be tightened for
minimum-redundancy prefix codes. Gallager [1978] showed that if PI is the
largest probability, then the redundancy of a minimum-redundancy code is less
than

E(C, P) - H(P) ~ :~ {
if PI ~ 0.5,
+ 0.086 if PI < 0.5.
The bound when PI ~ 0.5 cannot be tightened, but several authors have re-
duced the bounds when PI < 0.5. Dietrich Manstetten [1992] summarizes
previous work, and gives a general method for calculating the redundancy of a
minimum-redundancy prefix code as a function of Pl. Manstetten also gives a
graph of the tightest possible bounds on the number of bits per symbol required
by a minimum-redundancy code, again as a function of Pl.
Another area of analysis that has received attention is the maximum code-
word length L assigned in a minimum-redundancy code. This is of particular
relevance to successful implementation of function canonicaLdecodeO in Al-
gorithm 4.1 on page 60, where V is a buffer containing the next L bits of the
compressed input stream. If allowance must be made in an implementation for
L to be larger than the number of bits that can be stored in a single machine
word, the speed of canonical decoding is greatly compromised.
Given that most (currently) popular computers have a word size of 32 bits,
what range of message lengths can we guarantee to be able to handle within
a L = 32 bit limit on maximum codeword length? The obvious answer -
that messages of length m = 232 ~ 4 X 109 symbols can be handled with-
out problem - is easily demonstrated to be incorrect. For example, setting the
unnormalized probability Pi of symbol Si to F(n - i + 1), an element in the
Fibonacci sequence that was defined in Section 1.5 on page 10, gives a code
in which symbols Sn-l and Sn have codewords that are n - 1 bits long. The
intuition behind this observation is simple: Huffman's algorithm packages the
smallest two probabilities at each stage of processing, beginning with two sin-
gleton packages. If the sum of these two packages is equal to the weight of the
next unprocessed symbol, at every iteration a new internal node will be created,
PAGE 90 COMPRESSION AND CODING ALGORITHMS

with the leaf as one of its children, and the previous package as the other child.
The final code tree will be a stick, with lei = [1,2, ... ,n - 1, n - 1]. Hence,
if P = [F(n), F(n - 1), ... ,F(l)], then L = n - 1 [Buro, 1993].
This bodes badly for canonicaLdecodeO, since it implies that L > 32 is
possible on an alphabet of as few as n = 34 symbols. But there is good news
too: a Fibonacci-derived self-probability distribution on n = 34 symbols does
still require a message length of m = 2:f!l F(i) = F(36) - 1 > 14.9 million
symbols. It is extremely unlikely that a stream of more than 14 million symbols
would contain only 34 distinct symbols, and that those symbols would occur
with probabilities according to a Fibonacci sequence.
While the Fibonacci based probability distribution leads to codewords of
length L = n - 1, it is not the sequence that minimizes m = 2:i=l Pi, the
message length required to cause those long codeword to be generated. That
privilege falls to a probability distribution derived from the modified Fibonacci
sequence F' described in Section 1.5:

P = [F' (n - 1) - F' (n - 2), ... , F' (n - i) - F' (n - i-I), ... ,


F' (3) - F'(2), F' (2) - F'(l), F' (1) - F' (0),1],

which clearly sums to m = F' (n - 1). For example, when n = 6, a L = 5


bit code can be forced by a message of just m = 17 symbols, with symbol
frequencies given by P = [7,4,3,1,1,1]. In this case there are no ties of
weights that could be resolved in favor of a shorter overall code, and the only
possible minimum-redundancy code is lei = [1,2,3,4,5,5].
That is, it is possible for a message of m = F' (n - 1) to force minimum-
redundancy codewords to n - 1 bits, and as was demonstrated in Section 1.5,

F'(n - 1) = F(n + 1) + F(n - 1) - 1 ~ ¢n.

So it is conceivable that a codeword of length L = 33 bits might be required


on a message of as few as F' (33) = 12.8 million symbols, slightly less than
the F(36) - 1 = 14.9 million symbols indicated by the ordinary Fibonacci
sequence. It seems extraordinarily unlikely that such a message would occur
in practice. Nevertheless, in critical applications a length-limiting mechanism
of the type discussed in Section 7.1 on page 194 should be employed, in order
to bound the length of the code.
Chapter 5

Arithmetic Coding

Given that the bit is the unit of stored data, it appears impossible for codewords
to occupy fractional bits. And given that a minimum-redundancy code as de-
scribed in Chapter 4 is the best that can be done using integral-length code-
words, it would thus appear that a minimum-redundancy code obtains com-
pression as close to the entropy as can be achieved.
Surprisingly, while true for the coding of a single symbol, this reasoning
does not hold when streams of symbols are to be coded, and it is the latter
situation which is the normal case in a compression system. Provided that the
coded form of the entire message is an integral number of bits long, there is no
requirement that every bit of the encoded form be assigned exclusively to one
symbol or another. For example, if five equi-probable symbols are represented
somehow in a total of three bits, it is not unreasonable to simplify the situation
and assert that each symbol occupies 0.6 bits. The output must obviously be
"lumpy" - bits might only be emitted after the second, fourth, and fifth sym-
bols of the input message, or possibly not until all of the symbols in the input
message have been considered. However, if the coder has some kind of internal
state, and if after each symbol is coded the state is updated, then the total code
for each symbol can be thought of as being the output bits produced as a re-
sult of that symbol being processed, plus the change in potential of the internal
state, positive or negative. Since the change in potential might be bit-fractional
in some way, it is quite conceivable for a coder to represent a symbol of prob-
ability Pi in the ideal amount (Equation 2.1 on page 16) of - log2 Pi bits. At
the end of the stream the internal state must be represented in some way, and
converted to an integral number of bits. But if the extra cost of the rounding
can be amortized over many symbols, the per-symbol cost is inconsequential.
Arithmetic coding is an effective mechanism for achieving exactly such a
"bit sharing" approach to compression, and is the topic of this chapter. The ori-
gins of the ideas embodied in an arithmetic coder are described in Section 5.1.
A. Moffat et al., Compression and Coding Algorithms
© Springer Science+Business Media New York 2002
PAGE 92 COMPRESSION AND CODING ALGORITHMS

Sections 5.2 and 5.3 give an overview of the method, and then a detailed im-
plementation. A number of variations on the basic theme are explored in Sec-
tion 5.4, ideas which are exploited when binary arithmetic coding is considered
in Section 5.5. Finally, Sections 5.6 and 5.7 examine a number of approximate
arithmetic coding schemes, in which some inexactness in the coded represen-
tation is allowed, in order to increase the speed of encoding and decoding.

5.1 Origins of arithmetic coding


An important source describing the history of arithmetic coding is the tutorial
of Langdon [1984], which details the discoveries that led to the concept of
arithmetic coding as we know it today. Curiously, one of the first mentions of
the possibility of such a coding method was by Shannon in 1948. Shannon did
not capitalize on his observation that if probabilities in his code were regarded
as high precision binary numbers, then unambiguous decoding of messages
would be possible. Shortly thereafter David Huffman developed his algorithm,
and the focus of attention was diverted.
Several other authors explored the ideas required for arithmetic coding,
including (according to Abramson [1963]) Peter Elias in the early 1960s. But
Elias went on instead to develop the family of codes described in Chapter 3, and
it was only after another long lull that it became clear, through the independent
work of Rissanen [1976] and Pascoe [1976], that arithmetic coding could be
carried out using finite precision arithmetic. Once that observation had been
made, developments flowed quickly [Guauzzo, 1980, Rissanen and Langdon,
1979, Rissanen, 1979, Rubin, 1979].
Two important threads of investigation evolved. The first, with a hardware
slant, was based around work carried out at IBM by a number of people in-
cluding Ron Arps, Glen Langdon, Joan Mitchell, Jorma Rissanen, and Bill
Pennebaker. Their approach led to fast binary arithmetic coders for applica-
tions such as bi-level image compression, and, more generally, representation
of non-binary data as a sequence of binary choices [Pennebaker et aI., 1988].
Their work continues to be used, and finds application in a number of compres-
sion standards. Section 6.11 discusses that approach to arithmetic coding.
The other thread of development was software-focussed, and led to a stir
of attention with the publication in 1987 of a complete C implementation in
Communications of the ACM [Witten et aI., 1987] - a journal with (at the time)
a significant readership amongst academics and the wider computing commu-
nity. The first author of this book remembers typing in the code from a preprint
of the paper (recall that 1987 was pre-web, and in Australia and New Zealand,
pre-internet too), to explore this wonderful new concept. Judging by the fol-
lowup correspondence in CACM - some of it perhaps not quite as well in-
5.2. OVERVIEW PAGE 93

fonned as Witten et al. would have liked - others around the world typed it in
too. The CACM implementation was revised approximately ten years later in
a followup paper that appeared in ACM Transactions on Information Systems
[Moffat et aI., 1998], and that TOIS implementation is the basis for much of the
presentation in this chapter.
Paul Howard and Jeff Vitter have also considered arithmetic coding in some
depth (see their 1994 paper in a special "Data Compression" issue of Proceed-
ings of the IEEE for an overview), and one of their several contributions is
examined in Section 5.7.

5.2 Overview of arithmetic coding


The key to arithmetic coding is the notion of state, internal infonnation that is
carried forward from the coding of one symbol to influence the coding of the
next. There are several different mechanisms whereby this state is represented;
in this presentation the approach of Moffat et al. [1998] is used, in which the
internal state of the coder is recorded using two variables Land R.
These two variables record the Lower end of a bounding interval, and the
width or Range of that interval. In this section it is assumed that both Land
R are real-valued numbers between zero and one. In an implementation they
are typically scaled by some appropriate power of two, and approximated by
integers. Section 5.3 describes an implementation of arithmetic coding that
adopts such a convention. The use of integers rather than floating point values
allows faster computation, and makes the underlying software less dependent
on the vagaries of particular hardware architectures. But to get started, it is
easier to think of Land R as taking on arbitrary real values.
The fundamental operations that take place in a simplified or "ideal" arith-
°
metic coder are described in Algorithm 5.1. Initially L = and R = 1. To
code each of the symbols in the message, Land R are adjusted by increasing
L and decreasing R. Moreover, they are adjusted in exactly the proportion that
P[8], the probability of the symbol being coded, bears to the total set of prob-
abilities. Space is also proportionately allocated for every other symbol of the
alphabet prior to 8 and after 8. Choosing the 8th subrange for the new Land
R when 8 is coded is the change of internal state that was discussed above.
This range narrowing process, which takes place in steps 4 and 5 of function
ideaLarithmeticencodeO of Algorithm 5.1, is illustrated in Figure 5.1.
In Figure 5.1a the set of symbol probabilities - which sum to one - are
laid out in some order on the real interval [0,1). Symbol 8 appears somewhere
in this ordering, and is allocated a zone (the gray region) of width equal to its
probability. Figure 5.1b then shows the interval [0,1) that contains 8 being
mapped onto the current coding interval, defined by [L, L + R). Finally, in
PAGE 94 COMPRESSION AND CODING ALGORITHMS

Algorithm 5.1
Use an idealized arithmetic coder to represent the m-symbol message M,
where 1 ~ M[i] ~ nmax for 1 ~ i ~ m. Normalized symbol probabilities
are assumed to be given by the static vector P, with E~~ P[i] = 1.
ideaLarithmeticencode(M, m)
1: set L f- 0 and R f- 1
2: for if-I to m do
3: set S f- M[i]
4: set L f- L + R x Ej:~ prj]
5: set R f- R x P[s]
6: transmit V, where V is the shortest (fewest bits) binary fractional number
that satisfies L ~ V < L + R
Decode and return an m-symbol message assuming an idealized arithmetic
coder.
ideaLarithmeticdecode (m)
1: set L f- 0 and R f- 1
2: let V be the fractional value transmitted by the encoder
3: for if-I to m do
4: determine s such that R x Ej:~ P[j] ~ V - L < R X Ej=l prj]
5: set L f- L + R x Ej:~ P[j]
6: setR f- R x P[s]
7: set M[i] f- S
8: return M
5.2. OVERVIEW PAGE 95

1.0
.....
1.0 1.0

I..
'5
~
., ------ L+R
--------------- .... L+R
:s ~

.,-
E
~0. ~ ---------------~ L
L

0.0 0.0 0.0

(a) (b) (e)

Figure 5.1: Encoding a symbol s and narrowing the range: (a) allocation of probability
space to symbol s within the range [0,1); (b) mapping probability space [0, 1) onto the
current [L, L + R) interval; and (c) restriction to the new [L, L + R) interval.

Figure 5.1c the values of Land L + R are updated, and reflect a new reduced
interval that corresponds to having encoded the symbol s.
The same process is followed for each symbol of the message M. At any
given point in time the internal potential of the coder is given by - log2 R. The
potential is a measure of the eventual cost of coding the message, and counts
bits. If R' is used to denote the new value of R after an execution of step 5, then
R' = R x P[s], and -log2 R' = (-log2 R) + (-log2 P[s]). That is, each
iteration of the "for" loop increases the potential by exactly the information
content of the symbol being coded.
At the end of the message the transmitted code is any number V such that
L ~ V < L + R. By this time R = Il~l P[M[i]], where M[i] is the ith of the
m input symbols. The potential has thus increased to - L:~1Iog2 P[M[i]],
and to guarantee that the number V is within the specified range between L
and L + R, it must be at least this many bits long. For example, consider the
sequence of Land R values that arises when the message

M = [1,2,1,1,1,5,1,1,2,1]

is coded according to the static probability distribution

P = [0.67,0.11,0.07,0.06,0.05,0.04]

that was used as an example in Section 1.3 on page 6 and again in Chapter 4.
Table 5.1 shows - in both decimal and binary - the values that the two state
variables take during the encoding of this message, starting from their initial
values of zero and one respectively.
~
Cl
tIl
\0
0\

Decimal Binary
z MIZr L R L+R L L+R
0.00000000 1.00000000 1.00000000 0.0000000000000000000000 1.0000000000000000000000
1 1 0.00000000 0.67000000 0.67000000 0.0000000000000000000000 0.1010101110000101001000
2 2 0.44890000 0.07370000 0.52260000 0.0111001011101011000111 0.1000010111001001000111
3 1 0.44890000 0.04937900 0.49827900 0.0111001011101011000111 0.0111111110001111001110
4 1 0.44890000 0.03308393 0.48198393 0.0111001011101011000111 0.0111101101100011010011
5 1 0.44890000 0.02216623 0.47106623 0.0111001011101011000111 0.0111100010010111110011 (j
o
6 5 0.46907127 0.00110831 0.47017958 0.0111100000010101000100 0.0111100001011101101100 ::::
'"0
7 1 0.46907127 0.00074257 0.46981384 0.0111100000010101000100 0.0111100001000101101110 ::tI
tIl
en
8 1 0.46907127 0.00049752 0.46956879 0.0111100000010101000100 0.0111100000110101101010 en
9 2 0.46940461 0.00005473 0.46945934 0.0111100000101010111010 0.0111100000101110011111 i3
z
10 1 0.46940461 0.00003667 0.46944128 0.0111100000101010111010 0.0111100000101101010011 :>
z
o
(j
Table 5.1: Example of arithmetic coding: representing the message M = [1,2,1,1,1,5,1,1,2,1) assuming the static probability distribution o
P = [0.67,0.11,0.07,0.06,0.05,0.04). o
Z
Cl
::x>
r
Cl
o
::tI
=i
:t
::::
en
5.2. OVERVIEW PAGE 97

As each symbol is coded, R gets smaller, and Land L + R move closer


together. By the time the 10 symbols of the example message M have been
fully coded, the quantities Land L + R agree to four decimal digits, and to
thirteen binary digits. This arrangement is shown in the last line of Table 5.1.
Any quantity V that lies between Land L + R must have exactly the same
prefix, so thirteen bits of the compressed representation of the message are
immediately known. Moreover, three more bits must be added to V before a
number is achieved that, irrespective of any further bits that follow in the coded
bitstream, is always between Land L + R:

L+R 0.0111100000101101010011
V 0.0111100000101100
L 0.0111100000101010111010.

At the conclusion of the processing R has the value 3.67 x 10- 5 , the product of
the probabilities of the symbols in the message. The minimum number of bits
r
required to separate Rand L + R is thus given by -log2 R1 = f14.741 =
15, one less than the number of bits calculated above for V. A minimum-
redundancy code for the same set of probabilities would have codeword lengths
of [1, 3, 3, 3, 4, 4J (Figure 4.2 on page 54) for a message length of 17 bits. The
one bit difference between the arithmetic code and the minimum-redundancy
code might seem a relatively small amount to get excited about, but when the
message is long, or when one symbol has a very high probability, an arithmetic
code can be much more compact than a minimum-redundancy code. As an
extreme situation, consider the case when n = 2, P = [0.999,0.001J, and a
message containing 999 "l"s and one "2" is to be coded. At the end of the
r
message R = 3.7 X 10- 4 , and V will contain just -log2 3.7 x 10- 4 1 = 12 or
r -log2 3.7 x 10- 4 1 + 1 = 13 bits, far fewer than the 1,000 bits necessary with
a minimum-redundancy code. On average, each symbol in this hypothetical
message is coded in just 0.013 bits!
There are workarounds to prefix codes that give improved compression ef-
fectiveness, such as grouping symbols together into blocks over a larger alpha-
bet, in which individual probabilities are smaller and the redundancy reduced;
or extracting runs of "I" symbols and then using a Golomb code; or using the
interpolative code. But they cannot compare with the sheer simplicity and el-
egance of arithmetic coding. As a further point in its favor, arithmetic coding
is relatively unaffected by the extra demands that arise when the probability
estimates are adjusted adaptively - a subject to be discussed in Chapter 6.
There are, however, considerable drawbacks to arithmetic coding as pre-
sented in Algorithm 5.1. First, and most critical, is the need for arbitrary pre-
cision real arithmetic. If the compressed message ends up being (say) 125 kB
long, then L and R must be maintained to more than one million bits of pre-
PAGE 98 COMPRESSION AND CODING ALGORITHMS

cision, a substantial imposition and one that is likely to result in impossibly


expensive processing. The fact that in Algorithm 5.1 there is no on-the-fiy
generation of bits is a second problem, as it means that decoding on a commu-
nications line cannot commence until the entire message has been digested by
the encoder. Fortunately, both of these difficulties can be solved, and eminently
practical arithmetic coders are possible. The next section gives details of one
such implementation.

5.3 Implementation of arithmetic coding


Suppose that we do wish to implement a viable arithmetic coder. For accuracy
and repeatability of computation across a variety of hardware platforms it is
desirable that Land R be represented as integers and that all calculations be
integer-valued. It is also necessary for Land R to be registers of some moder-
ate length, so that they can be manipulated efficiently on typical architectures
without extended precision arithmetic being required. Finally, it is highly de-
sirable for bits to be emitted as soon as they are determined, to avoid buffering
and synchronization problems between encoder and decoder; yet doing so must
not introduce any need for the encoder to recant bits because some value had
to be revised in the light of subsequent information.
There are a number of different ways that these problems have been tackled.
Here we present the TOIS implementation [Moffat et aI., 1998], which relies
heavily upon the earlier 1987 work of Ian Witten, Radford Neal, and John
Cleary - the CACM implementation. Both Land R are taken to be integers
of some fixed number of bits, b, that can be conveniently manipulated by the
underlying hardware. For typical modern hardware, b will thus be either less
than or equal to 32, or less than or equal to 64. Both Land R must lie in the
range 0 :S L, R < 2b. The actual values stored in Land R are then assumed
to be fractional values, normalized by 2b , so that their interpreted values are in
the range 0 to 1. Table 5.2 shows some pairs of equivalent values that arise in
the integer-valued implementation. The algorithms will constrain the value of
R from below as well as above, and one loop invariant is that R > 2b- 2 , which
(Table 5.2) corresponds to 0.25 in scaled terms.
Algorithm 5.2 gives details of a function arith_encode(l, h, t) that encodes
one symbol. The three parameters l, h, and t describe the location of the coded
symbol s in the probability range. For accuracy and repeatability of computa-
tion, they are also stored as integers. One pervasive way of estimating these
values for each symbol is to undertake a pre-scan of the message M, and accu-
mulate the frequency of each of the symbols it contains. Hence, if P(j] is the
unnormalized self-probability in M of the jth symbol in the n-symbol alpha-
bet, then when symbol s is to be coded, the parameters passed to the encoding
5.3. IMPLEMENTATION PAGE 99

Algorithm 5.2
Arithmetically encode the range [lit, hit) using fixed-precision integer
arithmetic. The state variables Land R are modified to reflect the new
range, and then renormalized to restore the initial and final invariants
2b- 2 < R ~ 2b- 1 , 0 ~ L < 2b - 2b- 2 , and L + R ~ 2b.
arithmeticencode(l, h, t)
1: set r +- R div t
2: set L +- L + r x I
3: if h < t then
4: set R +- r x (h -l)
5: else
6: set R +- R - r x I
7: while R ~ 2b- 2 do
8: if L + R ~ 2b- 1 then
9: biLplus.jollow(O)
10: else if 2 b- 1 ~ L then
11: biLplus_follow(l)
12: set L +- L - 2b- 1
13: else
14: set bits-outstanding +- bits_outstanding +1
15: set L +- L - 2b- 2
16: set L +- 2 x Land R +- 2 x R
Write the bit x (value 0 or 1) to the output bitstream, plus any outstanding
following bits, which are known to be of opposite polarity.
biLplus.jollow(x)
1: pULone_bit( x)
2: while bits-outstanding > 0 do
3: put-one_bit(l - x)
4: set bits_outstanding +- bits-outstanding - 1
PAGE 100 COMPRESSION AND CODING ALGORITHMS

Fractional value in function Integer scaled equivalent in


ideaLarithmetic_encode() function arithmetic-encode 0
(Algorithm 5.1) (Algorithm 5.2)
1.00
0.50
0.25
0.00

Table 5.2: Corresponding values for arithmetic coding, real-number interpretation and
scaled integer interpretation.

routine are l = 2:j:~ prj], h = l + P[s], and t = 2:]=1 prj] = m. The


coder normalizes these into true probabilities, and allocates the numeric range
[lit, hit) to s. The range narrowing process is effected by steps 1 to 6. Note
carefully the order of the operations. Although it is computationally more pre-
cise to perform the multiplications before the divisions (as this minimizes the
relative truncation error), doing so involves at least one extra multiplicative
operation. More importantly, doing the mUltiplication first can lead to severe
restrictions upon the number of bits that can be used to represent t, and thus the
frequency counts from which the source probabilities are derived. The order in
which the mUltiplicative operations are carried out is one of the key differences
between the CACM implementation and the later TOIS one. The issue of the
truncation error is examined in detail below.
Note also the care that is taken to make sure that every unit of the initial
interval [L, L + R) is allocated to one source symbol or another; this is the
purpose of the "if' statement at step 3. If there are gaps in the allocation of the
subintervals - which the truncation in step 1 would usually cause were it not
for the "if' statement - then compression "leakage" results, and the compressed
output might be needlessly large.
Once the range has been narrowed in correspondence to the encoded sym-
bol s, the constraint that R > 2b - 2 (that is, R > 0.25 in scaled terms) is
checked and, if necessary, restored. This is the purpose of the loop at step 7
of function arithmetic-encodeO. Each iteration of the loop doubles R and as a
consequence is responsible for writing one bit of output to the compressed mes-
sage. When R is doubled, the internal potential of the coder given by - log2 R
decreases by one bit - the bit that is moved to the output stream.
There are three possible values for that bit: definitely a "0", definitely a "1",
and "hmmm, too early to say yet, let's wait and see". These three cases are han-
dled at step 9, step 11, and step 14 respectively offunction arithmetic-encodeO.
The three cases are also illustrated in Figure 5.2.
5.3. IMPLEMENTATION PAGE 101

1.0 1.0 1.0

...... L+R -.,


/
/' L+R
,,
,,
L+R
0.5 / 0.5 0.5
- --.,.
, , L
/
/

L
" L

0.0 0.0 0.0

(a) (b) (e)

Figure 5.2: Renormalization in arithmetic coding: (a) when L +R ~ 0.5; (b) when
0.5 ~ L; and (c) when R < 0.25 and L < 0.5 < L + R.

In the first case (Figure 5.2a) the next output bit is clearly a zero, as both L
and L + R are less than 0.5. Hence, in this situation the correct procedure is to
generate an unambiguous "0" bit, and scale Land R by doubling them.
The second case (Figure 5.2b) handles the situation when the next bit is
definitely a one. This is indicated by L (and hence L + R also) being greater
than or equal to 0.5. Once the bit is output L should be translated downward
by 0.5, and then Land R doubled, as for the first case.
The third case, at steps 14 and 15, and shown in Figure 5.2c, is somewhat
more complex. When R ~ 0.25 and Land L + R are on opposite sides of 0.5,
the polarity of the immediately next output bit cannot be known, as it depends
upon future symbols that have not yet been coded. What is known is that the
bit after that immediately next bit will be of opposite polarity to the next bit,
because all binary numbers in the range 0.25 < L to L + R < 0.75 start
either with "01" or with "10". Hence, in this third case, the renormalization
can still take place, provided a note is made using the variable bits-outstanding
to output an additional opposite bit the next time a bit of unambiguous polarity
is produced. In this third case L is translated by 0.25 before Land R are
doubled. As the final part of this puzzle, each time a bit is output at step 1
of function biLplus.followO it is followed up by the bits-outstanding opposite
bits still extant.
The purpose of Algorithm 5.2 is to show how a single symbol is processed
in the arithmetic coder. To code a whole message, some initialization is re-
quired, plus a loop that iterates over the symbols in the message. Function
arithmeticencode_blockO in Algorithm 5.3 shows a typical calling sequence
that makes use of arithmeticencodeO to code an entire message M. It serves
the same purpose, and offers the same interface, as function mr_encode_block 0
in Algorithm 4.6 on page 83. For the moment, consider only the encoding func-
PAGE 102 COMPRESSION AND CODING ALGORITHMS

Algorithm 5.3
Use an arithmetic code to represent the m-symbol message M, where
1 ::s; M[i] ::s; nmax for 1 ::s; i ::s; m.
arithmeticencode_block(M, m)
1: for s+-O to nmax do
2: set cum-prob[s] +- 0
3: for i +- 1 to m do
4: set s +- M[i]
5: set cum-prob[s] +- cum-prob[s] + 1
6: use function elias-delta_encodeO to encode m and nmax
7: for s +- 1 to nmax do
8: elias-delta_encode(1 + cum-prob[s])
9: set cum-prob[s] +- cum-prob[s - 1] + cum-prob[s]
10: starLencodeO
11: for i +- 1 to m do
12: set s +- M[i]
13: arithmetic_encode(cum_prob[s - 1], cum_prob[s] , m)
14: finish_encodeO

Decode and return an m-symbol message M using an arithmetic code.


arithmetic _decode _block ()
1: use function elias-delta-LiecodeO to decode m and n max
2: set cum-prob[O] +- 0
3: for s = 1 to n max do
4: set cum-prob[s] +- elias-delta_decodeO - 1
5: set cum-prob[s] +- cum-prob[s - 1] + cum-prob[s]
6: starLdecodeO
7: for i +- 1 to m do
8: set target +- decode_target (m)
9: determine s such that cum-prob[s - 1] ::s; target < cum-prob[s]
10: arithmetic_decode(cum-prob[s - 1], cum-prob[s], m)
11: set M[i] +- s
12: finish-LiecodeO
13: return m and M
5.3. IMPLEMENTATION PAGE 103

tion. The decoder arithmeticdecode_blockO will be discussed shortly.


As was the case with minimum-redundancy coding, a prelude describing
the code being used must be sent to the decoder before the actual message can
be transmitted. For our purposes (Chapter 2 explained why this is reasonable),
it is assumed that the symbols in the message are uncorrelated, and may be
coded with respect to their zero-order self-probabilities. In Algorithm 5.3, the
prelude consists of a list of symbol frequencies, with non-appearing symbols
indicated by a false frequency count. In a sense, the use of the C8 code for
these "frequency plus one" values means that a one-bit overhead is being paid
for each symbol not in the subalphabet; a three-bit overhead is being paid for
symbols that appear in the message once; a zero-bit overhead for symbols that
appear twice in the message; a one-bit overhead for a symbol that appears
thrice; a zero-bit overhead for symbols of frequency four, five, and six; and
so on. Symbols that appear many times in the message will incur a negligible
overhead, as the likelihood of the C8 codes for x and x + 1 being of different
lengths becomes small once x is larger than 10 or so.
The prelude arrangement in Algorithm 5.3 is different to the prelude rep-
resentation shown in function mr_encode_blockO, which uses the interpolative
code to indicate the subalphabet. There is no special reason for this difference
except to show an alternative, and the interpolative code might be preferable
in the arithmetic coder if the subalphabet is sparse and if a second array is
available to hold the indices of the symbols that have non-zero frequencies.
In the minimum-redundancy environment, mr_encode_blockO also codes
a set of codeword lengths, integers in [1 ... L]. Here we code exact symbol
frequencies instead, which take more space because of the more detailed in-
formation they represent. Hence, this component of the prelude is more costly
to represent in the arithmetic coder than in the minimum-redundancy coder,
and on short messages the extra precision of the arithmetic coder might mean
that the minimum-redundancy coder can obtain a more compact representation.
That is, we might have a situation where the arithmetic coding system gener-
ates a shorter code than a minimum-redundancy coder does when presented
with the same message, but not so much shorter that the extra cost of the more
detailed prelude is recouped. This apparent paradox should not be overlooked
if "absolute best" compression effectiveness is required on short messages, and
is quantified by Bookstein and Klein [1993].
Once the prelude has been transmitted (in this case, using an Elias code),
the symbol frequency counts are processed to make an array of cumulative fre-
quencies in array cum-prob, with cum-prob[O] = 0 used as a sentinel value.
The cum-prob array allows easy access to the I and h values needed to code
each of the m symbols in the message M. The more challenging case, in which
the probability distribution is adjusted after each symbol is transmitted, is dis-
PAGE 104 COMPRESSION AND CODING ALGORITHMS

Algorithm 5.4
Return an integer target in the range 0 ~ target < t that falls within the
interval [l, h) that was used at the corresponding call to arithmetic_encodeO.
decode_target (t)
1: set r r R div t
2: return (min{ t - 1, D div r})
Adjusts the decoder's state variables Rand D to reflect the changes made in
the encoder during the corresponding call to arithmetic_encode() , assuming
that r has been set by a prior call to decode_targetO.
arithmeticdecode(l, h, t)
1: set D r D - r x l
2: if h < t then
3: set R r r x (h - l)
4: else
5: set R r R - r x l
6: while R ~ 2b- 2 do
7: set R r 2 x Rand D r 2 x D + geLone_bitO

cussed in detail in Chapter 6. As will be seen at that time, a cum-prob array


should almost certainly not be used for adaptive coding - there are alternative
structures that can be updated rather more efficiently. The initialization and
termination functions starLencodeO andfinish_encodeO are discussed shortly.
Now consider the task of decoding a block of coded symbol numbers. Al-
gorithm 5.3 describes a function arithmeticdecode_blockO that receives the
prelude, builds a mirror-image cum_prob array, and then uses it to recover the
encoded integers using the functions decode_targetO and arithmeticdecodeO,
both of which are defined in Algorithm 5.4. Compared to the idealized de-
coder in Algorithm 5.1 on page 94 there are two changes to note. First and
most obvious is that decoding involves the use of two functions. Function
decode_targetO calculates an integer between 0 and t - 1 from the state vari-
ables, corresponding to the next symbol s that was encoded. The value returned
from decode_targetO lies somewhere in the range cum-prob[s] to cum_prob[ s +
1] - 1, with the exact value depending on what symbols were coded after this
one. To resolve this uncertainty, arithmeticdecode_blockO must search the
cum-prob array to determine the symbol s that corresponds to the target value.
Only then can the second function, arithmeticdecodeO, be called, the purpose
of which is to mimic the bounds adjustment that took place in the encoder at
the time this symbol was coded.
5.3. IMPLEMENTATION PAGE 105

Algorithm 5.5
Initialize the encoder's state variables.
starLencode 0
1: set L +- 0, R +- 2b- 1 , and bits-outstanding +- 0

Flush the encoder so that all information is in the output bitstream.


finish_encode 0
1: pULone_integer(L, b), using bit-plus_followO rather than pULone_bitO

Initialize the decoder's state variables.


starLdecode()
1: set R +- 2b- 1
2: set D +- geLone_integer(b)

Push any unconsumed bits back to the input bitstream. For the version of
finish_encodeO described here, no action is necessary on the part of
finish_decodeO·
finish_decode 0
1: do nothing

The other significant alteration between the decoding process of Algo-


rithm 5.1 and the corresponding routines in Algorithm 5.4 is the use of the
state variables D and R rather than V, L, and R. The transformation between
the two alternatives is simply that D = V - L; but the two-variable version of-
fers a simpler renormalization loop (steps 6 and 7 in Algorithm 5.4). Note also
that D and R in the decoder must be maintained to the same b bits of precision
as Land R in the encoder.
The only remaining components in this top-down progression of functions
are the four support routines described in Algorithm 5.5. The first two initialize
the state variables and terminate the encoding process by emitting the "extra"
bits needed to completely disambiguate the output; the second two perform the
matching tasks in the decoder. Note the initialization of R to 2b- 1 , which cor-
responds (Table 5.2 on page 100) to the value 0.5 rather than the 1.0 used in
Algorithm 5.1. This means that unless the first bit of output is trapped and dis-
carded (and then re-inserted by the decoder), it will always be a "0". This small
penalty seems a reasonable price to pay to guarantee the asserted precondition
in function arithmeticencodeO that 0.25 < R ::; 0.5, which would be violated
by the more correct initial assignment R +- 2b - 1. In Section 5.6 below we
will make use of this restriction on R.
The termination mechanism described in function finish_encode 0 is sim-
PAGE 106 COMPRESSION AND CODING ALGORITHMS

pIe, but heavy handed. Functionfinish_encodeO simply outputs all of the bits
of L, that is, another b bits, compared to the small number of bits that was
required in the example shown in Table 5.1. There are two main reasons for
advocating this brute-force approach. The first is the use of the transformation
D = V - L in the decoder, which must similarly be able to calculate how many
bits should be flushed from its state variables if it is to remain synchronized. If
Land R are maintained explicitly in the decoder, then it can perform the same
calculation (whatever that might end up being) as does the encoder, and so a
variable number of termination bits can be used. But maintaining L as well as
either V or D slows down the decoder, and rather than accept this penalty the
number of termination bits is made independent of the exact values of Land
R. Any other fixed number of bits known to be always sufficier,t could also
be used. For example, the encoder might send the first three bits of L + R/2,
which can be shown to always be enough.
The second reason for preferring the simplistic termination mechanism is
that the compressed file might contain a number of compressed messages, each
handled by independent calls to arithmeticencode_blockO. Indeed, the arith-
metic codes in the file might be interleaved with codes using another quite dif-
ferent mechanism. For example, in a multi-block situation the Elias Co codes
for the P[s] + 1 values end up being interleaved with the arithmetic codes.
Unless care is taken, the buffer D might, at the termination of coding, contain
bits that belong to the next component of the compressed file. If so, those bits
should be processed by quite different routines - such as e/iaLdelta_decodeO.
When finish_encode 0 writes all b bits of L, and the decoder reads no more
beyond the current value of D, it guarantees that when the decoder terminates
the next bit returned by function get-one_bitO will be the first bit of the next
component of the file.
In cases where the compressed file only contains one component it is pos-
sible to terminate in just three bits. In some cases as few as one bit might be
sufficient - consider the two cases R = 2b-l (that is, 0.5) for L = 0 and
L = 2b-l. In the first case a single "0" bit is adequate, and in the second case
a single "I" bit suffices. Similarly, two bits of termination is often enough: as
an example, consider L = "011 ... " and L + R = "110 ... ", in which case
termination with "10" gives a value always in range, regardless of what noise
bits follow on behind. Note the degree of suspicion with which this is done. It
would be quite imprudent to assume that all subsequent bits inspected by the
decoder beyond those explicitly written by the encoder will be zeros. In the
language C, for example, erroneously reading when there are no bytes remain-
ing in the file returns "I" bits, as the EOF marker is represented as the value
-1, which in two's complement form is stored as a word which contains all "I"
bits. This uncertainty is why we insist that the termination bits must be such
5.3. IMPLEMENTATION PAGE 107

Algorithm 5.6
Initialize the encoder's state variables. Note that with this assignment the
encoding/decoding invariant 0.25 < R:::; 0.5 is no longer guaranteed.
JrugaLstart_encode()
1: set L +- 0, R +- 2b - 1, and bits-outstanding +- 0
Flush the encoder so that all information is in the output bitstream, using as
few extra bits as possible.
Jrugal_jinish_encode()
1: for nbits +- 1 to 3 do
2: set roundup +- 2b-nbits - 1
3: set bits +- (L + roundup) div 2b-nbits
4: set value +- bits X 2b-nbits
5: if L :::; value and value + roundup :::; L + (R - 1) then
6: pULone_integer(bits, nbits) , using biLplus-JollowO
7: return

that no matter what bit values are inadvertently used by the decoder after all of
the emitted bits are consumed, decoding works correctly.
FunctionJrugaLjinish_encodeO in Algorithm 5.6 gives an exact calculation
that determines a minimal set of termination bits. Note the care with which the
calculation at step 5 is engineered: the computation must be carried out in an
order that eliminates any possibility of overflow, even if the architecture uses
b-bit words for its integer arithmetic.
Over any realistic compression run the extra 30 or so bits involved in func-
tion jinish_encodeO compared to function JrugaLjinish_encodeO are a com-
pletely negligible overhead. On the other hand, if the file does consist of multi-
ple short components, and functionJrugaLjinish_encodeO is to be used, a very
much more complex regime is required in which the final contents of D must
be pushed back into the input stream by a function JrugaLjinish_decode() and
made available to subsequent calls to geLone_bitO. How easily this can be
done - and what effect it has upon decoding throughput - will depend upon the
language used in an actual implementation.
Let us now return to the example message M that was compressed in the
example of Table 5.1, and see how it is handled in the integer-based implemen-
tation of arithmetic coding. For reasons that are discussed below, it is desirable
to permute the alphabet so that the most probable symbol is the last. Doing so
gives a message M' = [6,1,6,6,6,4,6,6,1,6] to be coded against an integer
cumulative frequency distribution cum-prob = [0,2,2,2,3,3,10]. Suppose
further that b = 7 is being used in Algorithm 5.2, and hence that 0 :::; L < 127
PAGE 108 COMPRESSION AND CODING ALGORITHMS

Symbol Before Output After


s r L R bits L R
6 12 o 127 36 91
1 9 36 91 36 18,
o 72 36
6 3 72 36 8127
I 34 54
6 5 34 54 49 39
6 3 49 39 58 ~O,;
? 52 60
4 4 52 60 64 :nt6
Ix o 12
o o 24
o o 48
6 4 o 48 12 36
6 3 12 36 21 \'1}.7
o 42 54
1 5 42 54 42 10
o 84 201
I 40 40
6 4 40 40 52 28
? 40 56
frugaLfinish_encode 0 lxO
Table 5.3: Example of arithmetic coding (function arithmeticencodeJJlockO in Al-
gorithm 5.3 on page 102) using integer arithmetic with b = 7. The message M' =
[6,1,6,6,6,4,6,6,1,6] with symbol probabilities given by the vector cum-PTob =
[0,2,2,2,3,3,10] is coded with t = 10 throughout. Values of "After, R" that are
below 2b- 2 = 32 are highlighted in gray; each such value causes an iteration in the
renormalization loop and the consequent output of one bit, shown in the "output bits"
column of the next row. Symbol "1" denotes output bits of unknown value that cannot
yet be generated. Symbol "x" shows where these bits appear in the output sequence, at
which time they take the opposite value to the immediately prior bit of known value.
In this example, all "x" bits are value "0". A total of 12 bits are generated, including
two termination bits.
5.3. IMPLEMENTATION PAGE 109

and renormalization must achieve 32 < R. Table 5.3 shows the sequence of
values taken on by L, R, and r; and the sequence of bits emitted during the
execution of the renormalization loop when message M' is coded. Note that it
is assumed that the bit-frugal version of starLencodeO has been used.
A "?" entry for a bit indicates that the renormalization loop has iterated
and that bitLoutstanding has been incremented rather than a bit actually being
produced; and "x" shows the location where that bit is inserted. Hence, the
emitted bitstream in the example is "011000001100", including the termina-
tion bits. In this case, with L = 40 and R = 56 after the last message symbol,
function frugaLfinish_encode 0 calculates that nbits = 2 is the least number of
disambiguating bits possible, and that they should be "10". That is, transmis-
sion ofthe message M', which is equivalent to the earlier example message M,
has required a total of 12 bits.
To this must be added the cost of the prelude. Using the mechanism sug-
gested in Algorithm 5.3, the prelude takes 4 + 1 + 1 + 4 + 1 + 8 = 19 bits for
the six C8 codes, not counting the cost of the values m and n max . In contrast,
when coding the same message the minimum-redundancy prelude represen-
tation suggested in Algorithm 4.6 on page 83 requires 9 bits for subalphabet
selection, including an allowance of 4 bits for a C 8 code for n = 3; and then
4 bits for codeword lengths - a total of 13 bits. Subalphabet selection is done
implicitly in Algorithm 5.3 through the use of "plus one" symbol frequencies.
The interpolative code might be used in the arithmetic environment for explicit
subalphabet selection, and a Golomb or interpolative code used for the non-
zero symbol frequencies rather than the presumed C 8 code. But the second
component of the prelude - codeword lengths in a minimum-redundancy code,
or symbol frequencies in an arithmetic code - is always going to be cheaper in
the minimum-redundancy code. More information is contained in the set of ex-
act symbol frequencies that led to a set of codeword lengths than is contained in
the lengths that result, as the lengths can be computed from the frequencies, but
not vice-versa. Hence the comments made earlier about remembering to fac-
tor in the cost of transmitting the prelude if absolute best compression is to be
achieved for short messages. For the short example message M, the unary code
described in Section 3.1 on page 29 is probably "absolute best", as it requires
no prelude and has a total cost of just 16 bits. Unfortunately, short messages
are never a compelling argument in favor of complex coding mechanisms!
In the fixed-precision decoder, the variable D is initialized to the first b =
7 bits of the message, that is, to "0110000", which is 48 in decimal. The
decoder then calculates r = R/t = 127/10 = 12, and a target of D /r =
48/12 = 4, which must correspond to symbol 6, as cum-prob[5] = 3 and
cum-prob[6] = 10. Once the symbol number is identified, the decoder adjusts
its state variables D and R to their new values of D = 12 ("0001100" in seven-
PAGE 110 COMPRESSION AND CODING ALGORITHMS

bit binary) and R = 91, and undertakes a renonnalization step, which in this
case - exactly as happened in the encoder at the same time - does nothing. The
second value of r is then calculated to be r = 91/10 = 9; the second target
is then D/r = 12/9 = 1; the second symbol is found to be s = 1; and D
and R are again modified, to D = 12 and R = 18. This time R gets doubled
in the renonnalization loop. At the same time D, which is still 12, or binary
"0001100", is also doubled, and another bit (the next "0") from the compressed
stream shifted in, to make D = "0011000" = 24. The process continues in the
same vein until the required m symbols have been decoded. Notice how the
fact that some bits were delayed in the encoder is completely immaterial in the
decoder - it can always see the full set of needed bits - and so there is no need
in the decoder to worry about outstanding bits.
Now consider the efficiency of the processes we have described. In the
encoder the [l, h) interval is found by direct lookup in the array cum_prob.
Hence the cost of encoding a message M of m symbols over an alphabet of
n symbols onto an output code sequence of c bits is O(n + c + m), that is,
essentially linear in the inputs and outputs. (Note that with arithmetic coding
we cannot assume that c 2: m.) To this must be added the time required in
the model for the recognition of symbols and the conversion into a stream of
integers, but those costs are model dependent and are not considered here.
In the decoder the situation is somewhat more complex. The cum-prob ar-
ray is again used, but is now searched rather than directly accessed. Fortunately
the array is sorted, allowing the use of binary search for target values. This
means that the total decoding time for the same message is 0 (n + c + m log n),
where the first two tenns are again for the cost of computing cum-prob and pro-
cessing bits respectively. Compared to the minimum-redundancy coders dis-
cussed in Chapter 4, encoding is asymptotically faster, and decoding is asymp-
totically slower. Section 6.6 returns to this issue of searching in the cum-prob
array, and describes improved structures that allow the overall decoding time
to be reduced to 0 (n + c + m), at the expense of an additional n words of extra
memory space.
In tenns of memory space, arithmetic coding is more economical than
minimum-redundancy coding in both the encoder and decoder. Just one ar-
ray of n max words is required in each, where nmax is the externally-stipulated
maximum symbol index. If n, the number of symbols that actually appear,
is very much smaller than nmax and the subalphabet is sparse, then other data
structures might be required. As is the case with function mr_encode_blockO,
an array implementation is only appropriate when the subalphabet is dense.
Consider now the compression effectiveness of arithmetic coding. In the
discussion earlier it was suggested that the number of emitted bits c to represent
5.3. IMPLEMENTATION PAGE 111

message M was bounded by

r- log2 Rl ~ C ~ r- log2 Rl + 1 ~ - log2 R +2


where
m
R= II P[M[i]]
i=l
is the product of the probabilities of the m symbols comprising the message, a
relationship that is considered in detail by Witten et al. [1987] and Howard and
Vitter [1992a]. Hence, in an amortized sense, the cost Ck of one appearance of
the kth symbol in the alphabet is given by

Ck = -log2 P[k] + O(l/m) ,


and for long messages arithmetic codes come arbitrarily close to Shannon's
lower bound (Equation 2.1 on page 16) for coding. That is, the idealized arith-
metic coding mechanism of Algorithm 5.1 is zero-redundancy to within an
asymptotically decreasing additive term.
Unfortunately, in Algorithm 5.2 the situation is not quite so favorable. A
fixed precision approximation to R is used, and non-trivial truncation errors
are allowed in the computations performed, which means that the subinterval
selected at each coding step is no longer guaranteed to be in the correct ratio to
the interval being processed.
These problems were considered by Moffat et al. [1998]. They parameter-
ized the coder in terms of b, the number of bits used to represent Land R; and
f, the number of bits used to represent t, the sum of the frequency counts. The
truncation error arises when r = R/t is computed. Hence, the larger the value
of b (and thus R) or the smaller the value of f (and thus t), the smaller the
relative truncation error. At the coarsest extreme, when b - f = 2 and t = 21 ,
the quotient r is always just one, and the truncation error substantial. Moffat
et al. [1998] derive bounds on the compression inefficiency, assuming that the
source is true to the observed statistics, and show that the compression loss,
measured in bits per symbol, is given by

Pn(r + 1) r+1
Pn log2 + (1 - Pn) log2 - - , (5.1)
1 + Pnr r
where Pn is the true probability of the symbol that is allocated the truncation
excess (step 3 of Algorithm 5.2 on page 99). This means that the compression
loss is never greater than approximately log2 e/2 b - 1- 2 , and is monotonically
decreasing as Pn increases. Hence, if the error is to be minimized, the alphabet
should be ordered so that the symbol Sn is the most likely, in contrast to the
arrangement assumed throughout Chapter 3 and Chapter 4. This is why in
PAGE 112 COMPRESSION AND CODING ALGORITHMS

b- f Worst-case Average-case
error (bits/symbol) error (bits/symbol)
2 1.000 0.500
4 0.322 0.130
6 0.087 0.033
8 0.022 0.008
10 0.006 0.002

Table 5.4: Limiting worst-case and average-case errors, bits per symbol, as Pn -+ O.

the example of Table 5.3 on page 108 the message compressed was M' =
[6,1,6,6,6,4,6,6,1,6] ratherthanM = [1,2,1,1,1,5,1,1,2,1].
Moffat et al. also showed that R, which is constrained in the range 2b- 2 <
R :s; 2b- 1 , can be assumed to have a density function that is proportional to
1/ R, and hence that the bound of Equation 5.1 is pessimistic, as R is larger than
its minimal value a non-trivial fraction of the time. Table 5.4 gives numeric
values for the worst-case and average-case errors, assuming that the source is
true to the observed frequency distribution, and that Pn is close to zero, the
worst that can happen.
If the coder is organized so that symbol Sn is the most probable symbol
then the bits-per-symbol error bound of Equation 5.1 can be used to derive an
upper bound on the relative error, as an alphabet of n symbols and maximum
probability Pn must have an entropy (Equation 2.2 on page 17) of at least

a lower bound achieved when as many as possible of the other symbols have
the same probability as symbol Sn. (Note that for simplicity it is assumed in
this calculation that x log x = 0 when x = 0.) Figure 5.3, taken from Moffat
et al. [1998], shows the relative redundancy as a function of log2 Pn for various
values of b - f. The vertical axis is expressed as a percentage redundancy
relative to the entropy of the distribution. As can be seen, when b - f ~ 6
the relative redundancy is just a few percent, and effective coding results, even
on the extremely skew distributions that are not handled well by minimum-
redundancy coding. Note also that when Pn is close to 1, the compression loss
diminishes rapidly to zero, regardless of the value of b - f.
To put these values into a concrete setting, suppose that b = 32, possible
with almost all current hardware. Working with b - f = 8 allows the sum
of the frequency symbol counts t to be as large as 232 - 8 = 224 ~ 16 X 106 ,
with a compression loss of less than 0.01 bits per symbol on average. That
is, function arithmeticencode_blockO can process messages of up to m =
5.4. VARIATIONS PAGE 113

~ 25
~
!L.
....
E 20
Qj -- b-f=2
~ 15 -- b-f=3
~ -- b-f=4
~ 10 -- b-f=5
E --- b-f=6
::J
E 5
.~

~ 0
·12 ·10 ·8 ·6 ·4 ·2 o
log P

Figure 5.3: Upper bound on relative redundancy: the excess coding cost as a percent-
age of entropy, plotted as a function of log2 Pn and of b - t, assuming Sn is the most
probable symbol. Taken from Moffat et al. [1998].

16 X 106 symbols before the limit on t is breached, which is hardly onerous.


On the other hand, the CACM implementation requires that all of t, n, and m
be less than 214 = 16,384, a considerable restriction.
This discussion completes our description of one particular implementation
of arithmetic coding. Using it as a starting point, the next section examines a
number of design alternatives.

5.4 Variations
The first area where there is scope for modification is in the renormalization
regime. The mechanism illustrated in Algorithm 5.2 is due to Witten et al.
[1987], and the decoder arrangement of Algorithm 5.4 (using D = V - L) was
described by Moffat et al. [1998]. The intention of the renormalization process
is to allow incremental output of bits, and the use of fixed-precision arithmetic;
and other solutions have been developed.
One problem with the renormalization method described above is that it is
potentially bursty. If by chance the value of bits_outstanding becomes large,
starvation might take place in the decoding process, which may be problem-
atic in a communications channel or other tightly-clocked hardware device. A
solution to this problem is the bit stuffing technique used in a number of IBM
hardware devices [Langdon and Rissanen, 1984]. Suppose that an output reg-
ister logically to the left of L is maintained, and a bit from L is moved into this
PAGE 114 COMPRESSION AND CODING ALGORITHMS

register each time R is doubled. When the register becomes full it is written,
and then inspected. If upon inspection it is discovered to be all "1" bits, then
instead of the register's bit-counter being set back to zero, which would mean
that all bit positions in the register are vacant, it is set to one, which creates a
dummy "0" bit in the most significant position. Processing then continues, but
now any carry out of the most significant bit of L will enter the register, and
either stop at a more recent "0" bit, or propagate into the dummy bit. Either
way, there is no need for the encoder to renege upon or delay delivery of any
of the earlier values of the register.
In the decoder, if an all-ones word is processed, then the first bit of the
following word is inspected. If that bit is also one, then an unaccounted-for
carry must have taken place, and the decoder can adjust its state variables ac-
cordingly. If the lead bit of the following word is a zero, it is simply discarded.
This mechanism avoids the possible problems of starvation, but does have the
drawback of making the decoder more complex than was described above. This
is essentially the only drawback, as the redundancy introduced by the method
is very small. For example, if the register is 16 bits wide then an extra bit will
be introduced each time the register contains 16 "1" bits. If the output from a
coder is good, it should be an apparently random stream of ones and zeros, and
so an extra bit will be inserted approximately every 2 x 2 16 bytes, giving an
expansion of just 0.0001 %.
A different variation is to change the output unit from bits to bytes, a sug-
gestion due to Michael Schindler [1998]. As described above, arithmetic cod-
ing operates in a bit-by-bit manner. But there is no reason why R cannot be
allowed to become even smaller before renormalization takes place, so that one
byte at a time of L can be isolated. Algorithm 5.7 shows how the encoder is
modified to implement this.
The key difference between Algorithm 5.7 and the previous version of
arithmetic-encode 0 is that at step 5 the renormalization loop now executes
only when R :S 2b- 8 , that is, when there are eight leading zero bits in Rand
hence eight bits of L that are, subject to possible later carry, available for out-
put. The carry situation itself is detected prior to this at step 4. If any previous
zero bits have to be recanted then the normalized value of L will exceed 1.0,
which corresponds still to 2b. In this case the carry is propagated via the use of
function byte_carryO, and L is decreased by 1.0 to bring it back into the nor-
mal range. Note that the fact that L ~ 2b is now possible means that if w is the
word size of the hardware being used, then b :S w - 1 must be used, whereas
previously b :S w was safe. On the other hand, now that b < w, it is possible
to allow R to be as large as 1.0 rather than the 0.5 maximum maintained in
Algorithm 5.2, so there is no net effect on the number of bits available for R,
which still has as many as w - 1 bits of precision.
5.4. VARIATIONS PAGE 115

Algorithm 5.7
Arithmetically encode the range [ljt, hjt) using fixed-precision integer
arithmetic and byte-by-byte output. The bounds at each call are now
2b- 8 < R :S 2 b , 0 :S L < 2b , and L + R :S 2b+ 1 . With the carry test written
as it is here, b must be at least one less than the maximum number of bits
used to represent integers, since transient values of L larger than 2b may be
calculated. This means that range R should be initialized to 2b, which can
now be represented. With a modified carry test, b = w can be achieved to
allow the decoder to also be fully byte-aligned.
arithmeticencode_bytewise(l, h, t)
1: execute steps 1 to 6 of Algorithm 5.2 on page 99
2: if L ~ 2b then
3: set L ~ L - 2b
4: byte_carry 0
5: while R ::; 2b - 8 do
6: set byte ~ righLshift(L, b - 8)
7: byte_plus_prev( byte)
8: set L ~ L -lefLshift(byte, 8)
9: set L ~ lefLshift(L, 8) and R ~ lefLshift(R, 8)

The output routines required in the byte-aligned renormalization process


are shown in Algorithm 5.8. A variable counting the number of "all one" bytes
is maintained, together with the most recent byte that contained a zero bit:
number _ff_bytes and lasLnon_ff..hyte respectively. If an output byte, taken from
the most significant eight bits of L, is all ones ("FF" in hexadecimal) then one
is added to the variable number-if_bytes, and no actual output takes place.
On the other hand, if a non-FF byte is generated by the renormalization
loop, then the byte buffered in last..non_ff..hyte is written, followed by an-
other number_ff..hytes of "FF", after which the new byte is installed as the re-
placement value for last..non-if_byte. Once this sequence has been performed
number_ff..hytes is reset to zero. Finally, if L becomes greater than 1.0 and
a carry is necessary, one is added to last..non-if..hyte, which is then written;
number-if_bytes - 1 of "00" (hexadecimal) are written ("00" rather than "FF",
as the carry must have moved through all of these bytes); number_ff..hytes is
set to zero; and last..non_ff..hyte is set to "00". That is, all of the "FF" bytes get
turned to "00" bytes by virtue of the carry, but one of them must be retained as
the last..non-if_byte. Note that a carry can only ever take place into a previously
unaffected "0" bit, and so the initialization at step 2 is safe, even if the first byte
generated by the coder is an "FF" byte.
PAGE 116 COMPRESSION AND CODING ALGORITHMS

Algorithm 5.8
Execute a carry into the bitstream represented by last.fton-ff_byte and
number_ff_bytes.
byte _carry 0
1: set lasLnon_ff_byte +- last.fton-ff_byte + 1
2: while number_ff-.bytes > 0 do
3: pULone_byte(last.fton_ff_byte)
4: set lasLnon-ff_byte +- "00"
5: set number_ff-.bytes +- number_ff_bytes - 1
Byte-oriented output from an arithmetic coder, with provision for carry.
byte _plus _prev( byte)
1: if this is the first time this function is called then
2: set last.fton-ff_byte +- byte and number_ff_bytes +- 0
3: else if byte = "FF" then
4: set number_ff-.bytes +- number_ff_bytes + 1
5: else
6: pULone_byte(last.fton_ff_byte)
7: while number-ff-.bytes > 0 do
8: pULone_byte("FF")
9: set number_ff-.bytes +- number_ff-.bytes - 1
10: set last.fton_ff_byte +- byte
5.4. VARIATIONS PAGE 117

Use of b = 31 meets the constraints that were discussed above, but in-
troduces a problem in the decoder - the call to starLdecodeO reads b = 31
bits into D, and then all subsequent input operations require 8 bits. That is,
while we have achieved a byte-aligned encoder, the decoder always reads in
split bytes of 1 bit plus 7 bits. To remedy this, and allow b = 32 even on a
32-bit machine, the test for "L 2: 2b" in function arithmetic_encode_bytewiseO
must be further refined. In some languages - C being one of them - overflow
in integer arithmetic does not raise any kind of exception, and all that happens
is that carry bits are lost out of the high end of the word. The net effect is
that the computed answer is correct, modulus 2w , where w is the word size. If
integer overflow truncation may be assumed, then when a carry has occurred,
the new value L' calculated by step 2 of function arithmeticencodeO (Algo-
rithm 5.2 on page 99) will in fact be less than the old value of L. To achieve
a full b = w = 32 byte-aligned coder, the old L is retained, and not updated
to the new L' value until after the carry condition has been tested: "if L' < L
then", and so on.
With or without the additional modification described in the previous para-
graph, byte-aligned arithmetic coding suffers from the drawback that the num-
ber of bits f that can be used for frequency counts must become smaller. The
requirement that max{ t} ::; min{ R} now means that about seven fewer bits
are available for frequency counts than previously. In some applications this re-
striction may prove problematic; in others it may not, and the additional speed
of byte-by-byte output determination is a considerable attraction.
A compromise approach between byte-alignment and bit-versatility is of-
fered in a proposal by Stuiver and Moffat [1998]. Drawing on the ideas of
table-driven processing that were discussed in Section 4.3, they suggest that a
k-bit prefix of R be used to index a table of 2k entries indicating how many
bits of L need to be shifted out. For example, if the most significant 8 bits of
R are used to index the shift table, then as much as one byte at a time can be
moved, and the number of actual bit shifting operations is reduced by a factor
of two or more. This method allows f to be as large as b - 2 again, if large
values of t are desired, but it is a little slower than the byte-aligned mechanism
of Schindler.
As a further option, it is possible to use floating point arithmetic to ob-
tain higher precision. For example, a "double" under the IEEE floating point
standard contains a mantissa part that is 51 bits long [Goldberg, 1991], so an
exact representation for integers up to 251 - 1 can be obtained, compared to
the more usual 232 - 1 that is available in integer arithmetic on most popular
architectures.
The structure used for calculating cumulative frequencies is also a compo-
nent of arithmetic coding which can be replaced by another mechanism. For
PAGE 118 COMPRESSION AND CODING ALGORITHMS

static coding, which is the paradigm assumed in this chapter, a cum-prob array
is adequate, unless the subalphabet is a sparse subset of [1 ... nmax]. For adap-
tive coding a more elegant structure is required, an issue discussed in detail in
Section 6.6 on page 157.

5.5 Binary arithmetic coding


Because arithmetic coding is particularly effective on very skewed probabil-
ity distributions, it is hardly surprising that many of the applications in which
it has made an impact are based upon binary alphabets. For example, in the
compression of bi-Ievel images a context-based model [Langdon and Rissa-
nen, 1981] might have some states that are so confident in their predictions
that a probability distribution P = [0.999,0.001] is appropriate. For such a
skewed distribution the entropy is just 0.011 bits per symbol - a compression
rate that clearly cannot be attained with a minimum-redundancy coder. Indeed,
if a binary alphabet is employed then every coding step involves a skew dis-
tribution, as one or the other of the two symbols must have a probability that
is at least 50%. Furthermore, in many binary-alphabet applications it is not
appropriate for the usual blocking techniques to be employed, such as Golomb
coding runs of zero symbols. These techniques rely on the sequence of bits be-
ing drawn from the same probability distribution, which is not possible when,
for example, context-based bi-Ievel image compression is being undertaken.
That is, some applications have an intrinsic need for bit-by-bit coding in which
the probabilities of "0" and "I" may vary enormously from one bit to the next.
Because of the need for bit-by-bit coding, and the importance of image
compression in the commercial arena, the field of binary arithmetic coding has
received as much - if not more - attention in the research literature as has multi-
symbol arithmetic coding. In this section we examine some of the mechanisms
that have been described for binary arithmetic coding. These mechanisms are
not restricted to the domain of image compression, and can be used in any
application in which a binary alphabet arises. For example, the DMC text
compression mechanism [Cormack and Horspool, 1987] also makes use of a
binary source alphabet and a context-driven probability estimation technique.
Algorithm 5.9 shows the encoding and decoding routines that arise when
the processes described in Section 5.3 are restricted to a binary alphabet. The
calculation of cumulative frequencies is now trivial, and because there is only
one splitting point to be determined in the [L, L + R) range, a number of the
mUltiplicative operations are avoided. Moreover, the use of a binary alpha-
bet means that in the decoder there is no need to calculate an explicit target.
Note that the symbol identifiers are assumed to be zero and one, rather than
being numbered from one, as is assumed in the earlier sections of this chapter,
5.5. BINARY CODING PAGE 119

Algorithm 5.9
Arithmetically encode binary value bit, where "0" and "I" bits have
previously been observed Co and Cl times respectively.
binary _arithmetic_encode (co, Cl , bit)

1: if Co < Cl then
2: set LPS +- 0 and cLPS +- Co
3: else
4: set LPS +- 1 and cLPS +- Cl
5: set r +- R div (co + cd
6: set rLPS +- r x cLPS
7: if bit = LPS then
8: set L +- L + R - rLPS and R +- rLPS
9: else
10: set R +- R - rLPS
11: renormalize Land R, as for the non-binary case
Return a binary value bit, where "0" and "I" bits have previously been
observed Co and Cl times. There is no need to explicitly calculate a target.
binary _arithmetic _decode (co, Cl)

1: if Co < Cl then
2: set LPS +- 0 and cLPS +- Co
3: else
4: set LPS +- 1 and cLPS +- cl
5: set r +- R div (co + cd
6: set rLPS +- r x cLPS
7: if D ~ (R - rLPS) then
8: set bit +- LPS, D +- D - (R - rLPS), and R +- rLPS
9: else
10: set bit +- 1 - LPS and R +- R - rLPS
11: renormalize D and R, as for the non-binary case
12: return bit
PAGE 120 COMPRESSION AND CODING ALGORITHMS

and that they are further symbolized as being either the more probable sym-
bol (MPS) or the less probable symbol (LPS). This identification allows two
savings. It means, as was suggested in Section 5.3, that the truncation excess
can always be allocated to the MPS to minimize the compression inefficiency;
and it also means that the coding of the MPS is achieved with slightly fewer
operations than is the LPS. Finally, note that the MPS receives the truncation
excess, but is coded at the bottom of the [L, L + R) range.
Binary arithmetic coders have one other perhaps surprising application, and
that is to code multi-symbol alphabets [Howard, 1997, Moffat et aI., 1994], To
see how this can be, suppose that the source alphabet S has n symbols. Sup-
pose also that the symbol identifiers are assigned as the leaves of a complete
binary tree of n - 1 internal nodes and hence n leaves. The simplest arrange-
ment is a balanced tree of n leaves and depth fiog2 n 1, but in fact there is no
need for any particular structure to the tree. Indeed, it can be a stick - a degen-
erate tree - if that arrangement should prove to be appropriate for some reason.
Finally, suppose that each of the internal nodes of this tree is assigned a pair
of conditional probabilities, calculated as follows. Let PI. be the sum of the
probabilities of all of the symbols represented in the left subtree of the node,
and Pr the sum of the probabilities of the symbols represented in the right sub-
tree. Then the probability assigned to the left subtree is pd (PI. + Pr) and the
probability assigned to the right subtree is Pr / (PI. + Pr)·
To represent a particular symbol the tree is traversed from the root, at each
node coding a binary choice "go left" or "go right" based upon the associated
probabilities PI. and Pr' The overall code for the symbol is then the sum of
the incremental codes that drive the tree traversal. Because the sum of the
logarithms of the probabilities is the same as the logarithm of their product,
and the product of the various conditional probabilities telescopes to Ps when
symbol s is being coded, the net cost for symbol s is -log2 Ps bits.
Given that this works with any n-Ieaf tree, the obvious question to ask is
how should the tree be structured, and how should the symbols be assigned
to the leaves of the tree, so the process is efficient. This question has three
answers, depending upon the criterion by which "efficient" is to be decided.
If efficiency is determined by simplicity, then there are two obvious trees
to use. The first is a stick, that is, a tree with one leaf at depth one, one at
depth two, one at depth three, and so on. This is the tree that corresponds in a
prefix-code sense to the unary code described in Section 3.1 on page 29. Each
binary arithmetic code emitted during the transmission of a symbol number s
can then be thought of as a biased bit of a unary code for s, where the bias is by
exactly the right amount so that a zero-redundancy code for s results. The other
obvious choice of tree is a balanced binary tree. In this case the mechanism
can be thought of as coding, bit by bit, the binary representation of the symbol
5.5. BINARY CODING PAGE 121

9110

Figure 5.4: Example of binary arithmetic coding used to deal with a multi-symbol
alphabet. In this example the source alphabet is S = [1 ... 6], with symbol frequencies
P = [7,2,0,0,1,0]' and the tree is based upon the structure of a minimal binary code.

number s, again with each bit biased by exactly the right amount. This tree has
the advantage of requiring almost the same number of binary arithmetic coding
steps to transmit each symbol, and minimizes the worst case number of steps
needed to code one symbol.
Figure 5.4 shows the tree that results if the alphabet S = [1,2,3,4,5,6]
with frequencies P = [7,2,0,0,1,0] is handled via a minimal binary tree. To
code symbol s = 2, for example, the left branch out of the root node is taken,
and a code of -10g2(9/10) bits generated, then the right branch is taken to the
leaf node 2, and a code of -10g2(2/9) bits generated, for a total codelength
(assuming no compression loss) of -10g2(2/1O), as required. Note that prob-
abilities of 0/1 and even % are generated but are not problematic, as they
correspond to symbols that do not appear in this particular message. Probabil-
ities of 1/1 correspond to the emission of no bits.
The second possible measure of efficiency is to minimize the average num-
ber of calls to function binary_arithmeticencodeO. It should come as no sur-
prise to the reader (hopefully!) that the correct tree structure is a Huffman tree,
as this minimizes the weighted path length over all binary trees for the given
set of probabilities. The natural consequence of this is that, as far as is possi-
ble, the conditional binary probabilities used at each step will be approximately
0.5, as in a Huffman tree each node represents a single bit, and that single bit
carries approximately one bit of information.
The third possible measure of efficiency is the hardest to minimize, and
that is compression effectiveness. In any practical arithmetic coder each binary
coding step introduces some small amount of compression loss, and these must
be aggregated to get an overall compression loss for the source symbol. For
PAGE 122 COMPRESSION AND CODING ALGORITHMS

example, some binary arithmetic coders are closest to optimal when the proba-
bility distribution is extremely skew - an arrangement that is likely to occur if
a unary-structured tree is used on a decreasing-probability alphabet.
The idea of using binary arithmetic coding to stipulate a path through a tree
can also be applied to infinite trees. For example, each node of the infinite tree
that corresponds to the Elias C y code can also be assigned a biased bit and then
used for arithmetic coding.
In practical terms, there are two drawbacks to using a tree-structured binary
coder - time and effectiveness. Unless the probability distribution is strongly
biased in favor of one symbol, mUltiple binary coding steps will be required on
average, and there will be little or no time saving compared to a single multi-
alphabet computation. And because compression redundancy is introduced at
each coding step, it is also likely that the single multi-alphabet code will be
more effective. What the tree-structured coder does offer is an obvious route to
adaptation, as the two counts maintained at each node are readily altered. But
adaptive probability estimation is also possible in a multi-alphabet setting, and
the issue of adapting symbol probability distributions will be taken up in detail
in Section 6.6 on page 157.

5.6 Approximate arithmetic coding


The process of arithmetic coding is controlled by a mapping from the range
[0, t) of cumulative symbol frequencies to the range [0, R) set by the current
values of the state variables. In the development above, we first presumed that
this mapping could be achieved seamlessly by defining the mapping function
f(x, t, R) to be
x
f(x,t,R) = t x R. (5.2)
To code a symbol represented by the triple (l, h, t), we then computed
L' = L+f(l,t,R)
R' = f(h, t, R) - f(l, t, R) .
But we also noted that the mapping function of Equation 5.2 was not attainable
in practice because L and R were to be manipulated as integer-valued variables.
Hence, in Section 5.3, we modified the mapping function and used instead

f(x,t,R) = { ~x (Rdivt) if x -1= t,


(5.3)
otherwise.

This modified mapping had the advantage of working with b bit integer arith-
metic, and, provided b - f was not too small and the truncation excess was
allocated to the most probable symbol, not causing too much compression loss.
5.6. ApPROXIMATE CODING PAGE 123

Other mapping functions are possible, and may be desirable if different


constraints are brought to bear. For example, the CACM implementation uses
the alternative mapping f(x, t, R) = (x x R) div t, which suffers from less
rounding error because the multiplication is performed first, but restricts the
values of t that can be managed without overflow.
One problem with arithmetic coding, especially during the earlier years of
its development, was slow speed of execution. This was caused by a number of
factors, including the amount of computation performed for each bit of output,
and the fact that multiplication and division operations are required for each
input symbol. The first of these two expenses can be reduced by the use of byte-
wise renormalization; the second is more problematic. For example, the CACM
implementation requires four mUltiplicative operations per symbol encoded,
and the TOIS implementation described in Section 5.3 requires three, or two if
the most probable symbol is coded.
On machines without full hardware support, these multiplicative operations
can be expensive. For example, a machine which the first author used in the
early 1990s implemented all integer multiplicative operations in software, and
an integer division took approximately 50 times longer than an integer addition.
To eliminate the mUltiplicative operations from the mapping and replace
them by less expensive operations, several other mapping functions have been
proposed. For example, because the value r = R div t calculated in Equa-
tion 5.3 has only b - f bits of precision, it can be calculated in O(b - f) time
using a shift/test/add loop. That is, when b - f is small, on certain architectures
r might be more speedily computed by not dividing R by t.
Several other schemes have been proposed [Chevion et aI., 1991, Feygin
et aI., 1994, Graf, 1997, Rissanen and Mohiuddin, 1989], and all share the same
desire to eliminate the multiplicative operations. In this section we describe an
approximate mapping suggested by Stuiver and Moffat [1998]. Suppose that t,
the total of the symbol frequency counts, is constrained such that R/2 < t ~
R. Then mapping [0, t) to [0, R) is not unlike the problem of minimal binary
coding considered in Section 3.1 on page 29. There, some codewords were
set to be one bit shorter than the others; here, some values in the domain [0, t)
should be allocated two units of the range [0, R). For example, when t = 5 and
R = 7, three of the five integer values in the domain must be allocated single
integers in the range, and two of the values in the domain can be allocated
double units in the range. If all of the single units are allocated first, this gives
the mapping f(O) -+ 0, f(l) -+ 1, f(2) -+ 2, f(3) -+ 3, f(4) -+ 5, and
f(5) -+ 7. More generally, the mapping is given by

if x < d,
f(x, t, R) ={ ~x _ d otherwise,
(5.4)
PAGE 124 COMPRESSION AND CODING ALGORITHMS

Algorithm 5.10
Use a simple mapping from [0 ... t] to [0 ... R] as part of an arithmetic
coder. The while loop is required to ensure R/2 < t :::; R prior to the
mapping process.
approximate_arithmeticencode(l, h, t)
1: while t :::; R/2 do
2: set l +- 2 x l, h +- 2 x h, and t +- 2 x t
3: set d +- 2 x t - R
4: set L +- L + max {l, 2 x l - d}
5: set R +- max{h, 2 x h - d} - max{l, 2 x l - d}
6: renormalize Land R, as described previously

Return a decoding target by inverting the mapping.


approximate_decode_target( t)
1: while t :::; R/2 do
2: set bits +- bits + 1 and t +- 2 x t
3: set d +- 2 x t - R
4: set target +- min{D, (D + d)/2}
5: return righLshift(target, bits)

where d = 2t - R is the number of values in the range [0, t) that are allocated
single units in [0, R).
The easiest way to ensure that R/2 < t :::; R in arithmeticencode_blockO
(Algorithm 5.3) is to scale the frequency counts P[s] so that their total t equals
2b- 2, the lower limit for R. Use of the initialization R = 2b- 1 in function
start-encodeO then ensures that the constraint is always met. This scaling ap-
proach is tantamount to performing pre-division and pre-multiplication, and so
the multiplicative operations are not avoided entirely; nevertheless, they are
performed per alphabet symbol per block, rather than per symbol transmitted.
If control over the block size is possible, another way of achieving the neces-
sary relationship between t and R is to choose m = 2b-2. This choice forces
t = m to be the required fixed value without any scaling being necessary, at
the cost of restricting the set of messages that can be handled.
A more general way of meeting the constraint on the value of t is illus-
trated in Algorithm 5.10. Now all of l, h, and t are scaled by a power of two
sufficiently large to ensure that the constraint is met. The coding then proceeds
using the mapping function of Equation 5.4. Also illustrated is the function
approximate_decode_targetO, which is identical in purpose to the earlier func-
tion decodejargetO, but scales t before applying the inverse of the approx-
imate mapping. The remaining function, approx_arithmeticdecodeO, makes
5.6. ApPROXIMATE CODING PAGE 125

use of the value bits calculated in approximate_decode_targetO, and is left for


the reader to construct.
As in Equation 5.3, the truncation excess is allocated at the top of the
range. But it is no longer possible to stipulate that all of the truncation ex-
cess is granted to a single symbol, and it makes no difference which symbols
are actually placed at the top of the range. All that can really be said is that no
symbol is mapped to less than half of the range it would have been given under
the mapping of Equation 5.2. Hence, in the worst case, as much as one bit per
symbol of compression loss might be introduced.
On average the compression loss is less than this. Stuiver and Moffat show
that if the symbols in the message are independent, and if the probability esti-
mates being used are accurate for the underlying message (which is always the
case if they are based upon the symbol frequencies within the message) then
the average loss of compression effectiveness is

log2 210g 2 e ::::: 0.086


e
bits per symbol coded. In the context of Table 5.4 on page 112 this suggests that
the approximation gives comparable compression to the mechanism of function
arithmeticencodeO when b- f = 5; but Table 5.4 was drawn up assuming that
Pn ---+ 0, which is not usually the case when Pn is the largest probability. Actual
experiments suggest behavior closer to that obtained with b - f = 3 or b - f =
4. Nevertheless, the compression loss is relatively small when the entropy
of the distribution is greater than one (that is, non-binary alphabets), and on
machines for which multiplication and division operations are expensive, the
altered mapping provides a useful performance boost.
Readers should note, however, that the cost ratio between additive type op-
erations and multiplicative type operations has been greatly reduced in recent
years. In experiments with Pentium hardware late in the 1990s the differential
between a shift/add implementation of Algorithm 5.2 and one based upon mul-
tiplications was very slim. Direct use of mUltiplications also has the advantage
of being considerably simpler to implement and thus debug. Unless there are
specialized needs not considered here, the reader is cautioned that there may
now be no speed advantage to a shift/add arithmetic coder.
Another way in which approximation can be used in arithmetic coding
is via the use of inexact symbol frequencies. Paradoxically, this approxima-
tion can actually help compression effectiveness. It was noted already that the
cost of storing exact symbol frequencies in the prelude was high compared
to the codeword lengths stored in a minimum-redundancy prelude. This ob-
servation suggests that the cost of storing the prelude can be reduced if more
coarse-grained frequencies are stored, and the probabilities approximated. In
PAGE 126 COMPRESSION AND CODING ALGORITHMS

Component and Blocksize 1,000 Blocksize 1,000,000


Code Exact Approx. Exact Approx.
Auxiliary information:
C§ 0.04 0.04 0.00 0.00
Subalphabet selection:
interpolative 3.36 3.36 0.10 0.10
Symbol frequencies:
interpolative 0.83 0.64 0.15 0.10
Message codewords:
arithmetic 8.22 8.24 10.95 10.98
Total cost 12.46 12.29 11.20 11.17

Table 5.5: Cost of using arithmetic coding to compress file WSJ . words (Table 4.5
on page 71) using exact symbol frequencies and approximate symbol frequencies, ex-
pressed as bits per symbol of the source file. In the case of approximate frequencies,
each symbol was assigned to the bucket indicated by llog2 pd, and each symbol in
that bucket assigned the frequency l1.44 x 2Llog2 P;J J. Two different block sizes are
reported: 1,000 symbols per block, and 1,000,000 symbols per block.

a minimum-redundancy code the probabilities can be thought of as being ap-


proximated by negative powers of two. In an arithmetic code an analogous
approximation is to represent each symbol frequency Pi by 2 Llog2 p;j (or, as we
shall see in a few minutes, some other closely related value), and in the pre-
lude transmit the rounded-off value 1 + llog2 PiJ rather than Pi. The message
component can be expected to grow, as imprecise codes are being used. But
the growth might be more than compensated by the prelude saving.
Table 5.5 illustrates this effect by applying arithmeticencode_blockO to
the file WSJ . Words that was introduced in Table 4.5 on page 71. To create the
table, the file was compressed in blocks of 1,000 and then 1,000,000 symbols,
and the cost of storing the various components of the prelude and message bits
summed over the blocks, first using exact symbol frequencies, and then using
approximate symbol frequencies. In this case approximate symbol frequencies
p~ were calculated as pi = l1.44 x 2 Llog2 P;J J, with the value 1 + llog2 pd
transmitted in the prelude rather than Pi. The multiplication by 1.44 places
p~ at the mean value within each range, assuming a 1/ x probability density
function on symbol frequencies.
As can be seen from the table, for both small block sizes and large block
sizes, the use of approximate frequencies results in a measurable saving in the
cost of the prelude, and a smaller consequent increase in the cost of the mes-
sage bits. Overall, better compression effectiveness is achieved by using the
5.7. TABLE-DRIVEN CODING PAGE 127

approximate frequencies than by using the exact frequencies. In terms of the


minimum message length principle, it appears that recording symbol frequen-
cies exactly is a less appropriate model than merely recording their magnitude
to some base.
Nor is there any particular reason to use frequency buckets that are powers
of two. A finer-grained approximation is possible, for example, if the Fibonacci
series 1,2,3,5,8,13, ... is used to control the bucket boundaries, which is
equivalent to taking logarithms base ¢; logarithms base four or ten could sim-
ilarly be used to obtain a coarser approximation and an even smaller prelude.
The fidelity of the approximation, and the cost of transmitting bucket iden-
tifiers, can be traded against each other, and it is possible that further slight
savings might be garnered by use of a base either less than or greater than two.
Finally in this section, it should be noted that Chapter 6 considers adap-
tive coding mechanisms, and in an adaptive environment there is no explicit
transmission of the prelude at all.
The software used in these experiments is available from the web site for
this book at WWW.CS.fiu.oz.au/caca.

5.7 Table-driven arithmetic coding


All of the preceding discussion in this chapter treated arithmetic coding as a
process that operates on numbers. But once we accept that integer arithmetic
is to be used, we can also think of arithmetic coding as being a state-based
process. Howard and Vitter [1992b, 1994b] investigated this notion, and our
presentation here is derived from their work; another state-based mechanism is
considered in Section 6.11.
Suppose that an arithmetic coder is operating with b = 4, and the normal-
°
ization requirements in force are that ~ L < 16, that 4 < R ~ 8, and that
L + R ~ 16. At any given moment the internal state of the coder is specified
by a combination of Land R. Applying the various constraints maintained by
the normalization process, the [L, L + R) internal state is always one of:

[ 0, 5) [ 0, 6) [ 0, 7) [ 0, 8)
[ 1, 6) [ 1, 7) [ 1, 8) [ 1, 9)
[ 2, 7) [ 2, 8) [ 2, 9) [ 2,10)

[ 8, 13) [ 8, 14) [ 8, 15) [ 8,16)


[ 9,14) [ 9, 15) [ 9, 16)
[10,15) [10,16)
[11,16) ,
PAGE 128 COMPRESSION AND CODING ALGORITHMS

a total of 42 possibilities. Suppose also, for simplicity, that a message over a


binary alphabet S = [0, 1] is to be coded. Each bit of the message causes the
values of Land R to be modified, and possibly some bits to be output, or in the
case of changes to bitLoutstanding, queued for future output. That is, depend-
ing upon the current state, the estimated probability that the next bit is a "0",
and the actual value of the next bit, L will be adjusted upward by some amount,
range-narrowing will reduce R by some amount, and then renormalization will
alter both, outputting a bit for each loop iteration. After the renormalization,
the system must again be in one of the 42 states that satisfies the constraints on
Land R.
For example, suppose that the machine is currently in state [2,10), and a
bit "0" is to be coded, with estimated probability 0.1 - or any other probabil-
ity less than, for this state, a threshold value somewhere between 0.125 and
0.250. Then the range narrowing step creates the intermediate combination
[2,3), which is always subsequently expanded to the unstable state [4,6) with
the emission of a "0" bit; then to the unstable state [8,12) with the emission
of a "0" bit; then to [0,8), one of the stable states, with the emission of a "1"
bit. So if we are in state [2,10) and the next bit is a "0" bit with probability
0.1, the sequence is ordained - emission of "001", and transfer to state [0,8).
Similarly, in the same state, and with the same probability estimate, a "1" input
with probability 0.9 drives the state machine to state [3,10), and no bits are
output.
To determine the probability thresholds that apply to each state, suppose
that the true probability of some bit being "0" is p, and that the true probability
of it being "1" is thus (1 - p). Suppose also that we wish to decide whether to
approximate the probability by q/ R or by (q + 1) / R, where R is, as always,
the "width" of the current state. If this bit gets coded with probability q / R, the
expected cost will be
q R-q
-plog2 R - (1 - p) 10g2 ~.

The threshold value p that separates the use of q and q + 1 will thus be such
that the expected cost of using q / R as the probability estimate is equal to the
expected cost of using (q + 1) / R:
q R-q q+1 R-q-1
plog2 R + (1- p) log2 ~ = plog2 ~ + (1 - p) 10g2 R

This equality is attained when

(5.5)
p = ( -.!L R-q-l ) .
log2 q+l' R-q
5.7. TABLE-DRIVEN CODING PAGE 129

Probability Input "0" Probability Input "I"


of "0" Output Next state of "1" Output Next state
0.000-0.182 001 [0, 8) 0.818-1.000 [3,10)
0.182-0.310 00 [8,16) 0.690-0.818 [4,10)
0.310-0.437 0 [4,10) 0.563-0.690 [5,10)
0.437-0.563 0 [4, 12) 0.437-0.563 ? [4,12)
0.563-0.690 [2, 7) 0.310-0.437 ? [6, 12)
0.690-0.818 [2, 8) 0.182-0.310 10 [0, 8)
0.818-1.000 [2, 9) 0.000-0.182 100 [8, 16)

Table 5.6: Transition table for state [2,10) in a table-driven binary arithmetic coder
with b = 4. Each row corresponds to one probability range. When a symbol is en-
coded, the indicated bits are emitted and the state changed to the corresponding next
state. Bits shown as "1" indicate that bits-outstanding should be incremented.

In the example situation, with R = 8, the threshold probability between the


use of q = 1 and q = 2 is 10g2(6/7)/ 10g2(6/14) ~ 0.182, clearly greater than
the 0.1 probability of "0" that was assumed during the discussion above.
Equation 5.5 allows calculation of the set of threshold probabilities cor-
responding to each of the R different values that are possible, and hence, for
each of the 42 states listed above, allows the complete sequence of events to be
pre-determined. The actual compression process is then a matter of, for each
bit to be coded, using whatever probability estimate is appropriate for that bit
to determine a transition number out of the current state, and going to the in-
dicated new state and outputting the indicated bits, possibly including any that
are outstanding from previous transitions.
That is, all of the computation is done in advance, and execution-time arith-
metic avoided. In addition, low probability symbols that generate multi-bit
output strings are handled in a manner not unlike the R-prefix mechanism of
Stuiver and Moffat that was outlined above. Both of these factors result in
speed improvements. Table 5.6 shows further detail of the transitions out of
state [2,10) in this system. Each of the other 41 states has a similar table con-
taining between four and seven rows. Coding starts in state [0,8).
A drawback of this coarse-grained approach is that the effectiveness of the
coding is still determined by the value of b, and thus the approximate probabil-
ities inherent in Equation 5.5. For typical text-based applications values such
as b = 5 or b = 6 are sufficient to ensure minimal compression loss, but when
binary probability distributions are very skew, other forms of table-driven cod-
ing are superior. One such mechanism will be described in Section 6.11 on
page 186.
PAGE 130 COMPRESSION AND CODING ALGORITHMS

Our sketch of table-driven arithmetic coding has been brief; Howard and
Vitter [1994b] give a detailed example that shows the action of their quasi-
arithmetic binary coding process. Howard and Vitter also describe how the
mechanism can lead to a practical implementation that requires a manageable
amount of space. Variants that operate on multi-symbol source alphabets are
also possible, and are correspondingly more complex.

5.8 Related material


The Z-coder of Bottou et al. [1998] provides another alternative mechanism for
binary alphabet compression. A generalization of Golomb coding, Z-coding
retains the speed and simplicity of Golomb coding, but allows the sub-bit com-
pression rates for binary source alphabets that are normally associated only
with arithmetic coding. The key idea is to collect together runs of MPS sym-
bols, and emit a code for the length of the run only when an LPS breaks that
run. In Golomb coding, all bits must have the same estimated MPS probability,
which is why it is such a useful static code; in the Z-coder the estimated MPS
probability of each bit in the run can be different.
Another binary alphabet coder is the ELS mechanism of Withers [1997]
(see also www.pegasllsimaging.com/ELSCODER . PDF). It works with frac-
tional bytes in the same way as a minimum-redundancy coder, except that the
fractions need not be eighths. It retains internal state to track fractional bytes
not yet emitted, and makes use of tables to control the transitions between
states. Like the Z-coder, and like the binary arithmetic coding routines de-
scribed above, it is capable of very good bit rates when the MPS has a high
probability.
A third binary alphabet coding scheme, developed prior to either of these
two mechanisms, is the IBM Q-coder. Because it also includes an interesting
probability estimation technique, we delay discussion of it until Section 6.11
on page 186.
Chapter 6

Adaptive Coding

In the three previous chapters it has been assumed that the probability distri-
bution is fixed, and that both encoder and decoder share knowledge of either
the actual symbol frequencies within the message, or of some underlying dis-
tribution that may be assumed to be representative of the message. While there
was some discussion of alternative ways of representing the prelude in a semi-
static system such as those of Algorithm 4.6 on page 83 and Algorithm 5.3
on page 102, we acted as if the only problem worth considering was that of
assigning a set of codewords.
There are two other aspects to be considered when designing a compression
system. Chapter 1 described compression as three cooperating processes, with
coding being but one of them. A model must also be chosen, and a mechanism
put in place for statistics (or probability) estimation. Modeling is considered in
Chapter 8, which discusses approaches that have been proposed for identifying
structure in messages. This chapter examines the third of the three components
in a compression system - how probability estimates are derived.

6.1 Static and semi-static probability estimation


One mechanism for estimating probabilities is to do exactly as we have already
assumed in Algorithms 4.6 and 5.3 - count the frequencies of the symbols in
the message, and send them (or an approximation thereof) to the decoder, so
that it can calculate an identical code. Such methods are semi-static, and will
be examined shortly.
But there is an even simpler way of estimating probabilities, and that is
for the encoder and decoder to assume a distribution typical of, but perhaps
not exactly identical to, the message being transmitted. For example, a large
quantity of English text might be processed to accumulate a table of symbol

A. Moffat et al., Compression and Coding Algorithms


© Springer Science+Business Media New York 2002
PAGE 132 COMPRESSION AND CODING ALGORITHMS

probabilities that is loaded into both encoder and decoder prior to compression
of each message. Each message can then be handled using the same set of
fixed probabilities, and, provided that the messages compressed in this way
are typical of the training text and a good coding method is used, compression
close to the message self-information should be achieved.
One famous static code was devised by Samuel Morse in the 1830s for use
with the then newly-invented telegraph machine. Built around two symbols -
the "dot" and the "dash" - and intended for English text (rather than, say, nu-
meric data), the Morse code assigns short code sequences to the vowels, and
longer codewords to the rarely used consonants. For example, in Morse code
the letter "E" (Morse code uses an alphabet of 48 symbols including some
punctuation and message control, and does not distinguish upper-case from
lower-case) is assigned the code ".", while the letter "Q" has the code "_. - _".
Morse code has another unusual property that we shall consider further in Sec-
tion 7.3 on page 209, which is that one of the symbols costs more to transmit
than does the other, as a dash is notionally the time duration of three dots. That
is, in an ideal code based upon dots and dashes we should design the codewords
so that there are rather more dots than dashes in the encoded message. Only
then will the total duration of the encoded message be minimized.
Because there is no prelude transmitted, static codes can outperform semi-
static codes, even when the probability estimates derived from the training text
differ from those of the actual message. For example, suppose that the dis-
tribution P = [0.67,0.11,0.07,0.06,0.05,0.04] has been derived from some
training text, and the message M = [1,1,1,5,5,3,1,4,1,6] is to be transmit-
ted. Ignoring termination overheads, an arithmetic code using the distribution
P will encode M in 20.13 bits. An arithmetic code using the message-derived
semi-static probability distribution pI = [0.5,0.0,0.1,0.1,0.2,0.1] requires
fewer bits: 19.61, to be precise. But unless the probability distribution pI, or,
more to the point, the difference between P and pI, can be expressed in less
than 20.13 - 19.61 = 0.52 bits, the static code yields better compression.
The drawback of static coding is that sometimes the training text is not
representative of the message, not even in a vague manner, and when this hap-
pens the use of incorrect probabilities means that data expansion takes place
rather than data compression. For example, using Morse code, which is static,
to represent a table of numeric data can result in an expensive representation
compared to alternative codes using the same channel alphabet.
That is, in order to always obtain a good representation, the symbol proba-
bilities estimated by the statistics module should be close - whatever that means
- to the true probabilities, where "true" usually means the self-probabilities
derived from the current message rather than within the universe of all possi-
ble messages. This is why semi-static coding is attractive. Knowledge of the
6.1. STATIC AND SEMI-STATIC ESTIMATION PAGE 133

probability distribution is achieved by allowing the encoder to make a prelim-


inary inspection of the message to accumulate symbol probabilities, usually as
frequency counts. These are then communicated to the decoder, using a mech-
anism of the type proposed in Algorithms 4.6 and 5.3. Compression systems
that adopt this strategy have the advantage that they always use the correct sym-
bol probabilities. In addition, that fact that the set of codewords is fixed allows
high encoding and decoding rates.
There are, however, two substantial drawbacks to semi-static probability
estimation. The first is that it is not possible to have one-pass compression.
In applications in which a file, perhaps stored on disk, is to be compressed, a
two-pass compression process may be tolerable. Indeed, for file-compression
applications such as transparent "disk doublers", two-pass compression may
be highly desirable, as semi-static codes can usually be decompressed very
quickly, and speed of access to compressed files in such an application is of
critical importance. Another useful paradigm in which two passes might be
made is when a first crude-but-fast compression technique is used to capture a
file written to disk, and then later, when the machine is idle and CPU cycles
are freely available, a sophisticated-but-slow mechanism used to recompress
the file. Such approaches become especially attractive when both the crude
and the sophisticated compression mechanisms are such that they can share the
same decoding process. Klein [1997] considers a mechanism that allows this
trade-off.
On the other hand, some applications are intrinsically one-pass. For exam-
ple, compression applied to a communications line cannot assume the lUxury
of two passes, as there is no beginning or end to the message, and encoding
must be done in real time.
The second disadvantage of semi-static coding is the potential loss of com-
pression effectiveness because of the cost of transmitting the model parameters.
The example above, with probability distribution P and message M, allowed
just half a bit to transmit the prelude, and appears rather trivial because of the
brevity of the message M and the fact that the presumed static distribution was
a close match to the self-probabilities of the message. But the situation does
not necessarily improve when longer messages are being handled.
The most general situation is when a message of m symbols over the alpha-
bet S = [1 ... nmaxl is to be represented, with n of the n max possible symbols
actually appearing. The first component of the prelude - subalphabet selec-
tion - involves identification of n of n max integers, and using a Golomb code
on the differences between successive integers in the set costs not more than
n(2 + 10g2(nmax/n)) bits (Equation 3.1 on page 38). Similarly, the cost of n
symbol frequencies summing to m is bounded above by n(2 + log2(m/n))
PAGE 134 COMPRESSION AND CODING ALGORITHMS

bits. That is, an upper bound of the cost of the prelude is

m· nmax)
n ( 4 + log2 n2 (6.1)

bits, and a reasonable estimate of the average cost is n bits less than this. For
example, when a zero-order character-based model is being used for typical
English text stored using the ASCII encoding, we have nmax = 256, with n ~
100 distinct symbols used. On a message of m = 50,000 symbols the prelude
cost is thus approximately 1,400 bits, or about 0.03 bits per symbol, a relatively
small overhead compared to the approximately 5 bits per symbol required to
actually code the message with respect to this model. With this simple model
it is clear that for all but very short sequences the cost of explicitly encoding
the statistics is regained through improved compression compared to the use of
static probabilities derived from training text.
Now consider a somewhat more complex model. For example, suppose that
pairs of characters are to be coded as single tokens - a zero-order bigram model.
Such a model may be desirable because it will probably operate faster than
a character-based model; and it should also yield better compression, as this
model is compression-equivalent to one which encodes half of the characters
with zero-order predictions, and the other half (in an alternating manner) using
first-order character-based predictions.
In such a model the universe of possible symbols is nmax = 65,536, of
which perhaps n = 5,000 appear in a 50 kB file. Now the prelude costs approx-
imately 45,000 bits, or 0.9 bits per symbol of the original file. If the bigrams
used are a non-random subset of the 65,536 that are possible, or if the symbol
frequencies follow any kind of natural distribution, then an interpolative code
may generate a smaller prelude than the Golomb code assumed in these calcu-
lations. Nevertheless, the cost of the prelude is likely to be considerable, and
might significantly erode the compression gain that arises through the use of
the more powerful model.
Another way of thinking about this effect is to observe that for every model
there is a "break even point" message length, at which the cost of transmitting
the statistics of the model begins to be recouped by improved compression
compared to (say) a zero-order character-based model. And the more complex
the model, the more likely it is that very long messages must be processed
before the break even point is attained. Bookstein and Klein [1993] quantified
this effect for a variety of natural languages when processed with a zero-order
character-based model.
6.2. ADAPTIVE ESTIMATION PAGE 135

Symbol Symbol probabilities at that time


coded PI P2 P3 P4 P5 P6
1 ··V.6 11 6 11 6 11 6 11 6 116
1 .2J;i7 11 7 11 7 11 7 11 7 117
1 '73JJSj 11 8 11 8 11 8 11 8 11 8
5 41 9 11 9 11 9 11 9 :~.!Q::' 119
5 4/10 1110 1110 1110 ·ill'O 1110
3 4/11 1111 1/11 3/11 1111
1 :4/12 1112 2/12 1112 3112 1112
4 5/13 1113 2/13 ~13 3/13 1113
1 5/14· 1114 2/14 2/14 3114 1114
6 6/15 1115 2/15 2/15 3/15 1115
Table 6.1: Probabilities used during the adaptive zero-order character-based coding of
the message M = [1,1,1,5,5,3,1,4,1,6]' assuming that all symbols have an initial
count of one. The overall probability of the message is given by (1/6) . (2/7) . (3/8) .
(1/9) . (2/10) . (1/11) . (4/12) . (1/13) . (5/14) . (1/15) = 2.202 x 10- 8 , with an
information content of 25.4 bits.

6.2 Adaptive probability estimation


The third major mechanism for statistics transmission - and the main focus
of this chapter - is designed to avoid the drawbacks of semi-static probability
estimation. In an adaptive coder there is no pre-transmission of statistics, and
no need for two passes. Instead, both encoder and decoder assume some initial
bland distribution of symbol probabilities, and then, in the course of transmit-
ting the message, modify their knowledge until, paradoxically, just after the last
symbol of the message is dealt with, and when it is too late for the informa-
tion to be exploited, they both know exactly the complete set of probabilities
(according to the selected model) governing the composition of the encoded
message. Table 6.1 gives an example of such a transmission, using the same
ten-symbol message used as an example earlier in this chapter.
In the example, each of the nmax (in this case, six) symbols in the source
alphabet is given a false frequency of one prior to any symbols being coded,
so that initially all symbols have the same non-zero probability of l/n max . Af-
ter each symbol is coded, the probability estimates are modified, by counting
another occurrence of the symbol just transmitted. Hence, the second "I" is
coded with a probability of 2/7, and the third with a probability of 3/8. Sum-
ming negative logarithms yields a total message cost of 25.4 bits, rather more
than the 19.6 bit self-information for the message. But the 25.4 bits is now an
PAGE 136 COMPRESSION AND CODING ALGORITHMS

• All characters
....c: • B!ank characters
<D 8.0 'r' characters
.-
c:
8 7.0
c:
.2 8.0
(5
E
.... 8.0
.E
.S
",!llQ.
4.0

3.0

11 21 31 41 51 1)1 71 8t 91 101 111 121
Character number

Figure 6.1: Adaptive probability estimation in Blake's Milton, assuming that each of
the 128 standard ASCII characters is assigned an initial frequency count of 1. Each bar
represents one character; and the height of the bar indicates the implied information
content assigned to that occurrence of that character. Black bars represent blank char-
acters, the most common character in this fragment of text; light gray bars represent
occurrences of the second most frequent character, the letter "roo.

attainable value, whereas the 19.4 is achievable only after addition of a prelude.
Figure 6.1 shows a similar computation for a longer message, the 128 char-
acters of Blake's verse from Milton, already used as an example in Chapter 2.
In the figure, the cost in bits of arithmetically coding each letter is shown, again
assuming that an initial false frequency count of one is assigned to each symbol
(with nmax = 128). Occurrences of two symbols - blank, and "roo - are picked
out in different shades, to show the way in which the probability estimates con-
verge to appropriate values. This adaptive estimator has a total cost of 738.0
bits (assuming a perfect entropy coder), while the self-information is 540.7
bits. Again, the difference corresponds to the cost of implicitly smearing the
prelude component across the transmission. With n = 25, an explicit prelude
would cost approximately 192.8 bits (using Equation 6.1). Remarkably, this
is almost exactly the cost difference between the adaptive and the semi-static
probability estimators.
We now have two alternative paradigms for setting the statistics that con-
trol the compression - semi-static estimation, and adaptive estimation. One
requires two passes over the source message, but is fast when actually coding;
the other operates in an on-line manner, but must cope with evolving probabil-
ity estimates, and thus an evolving code. If the cost of paying for the statistics
must be added to the cost of using them, which yields the better compression?
6.2. ADAPTIVE ESTIMATION PAGE 137

Consider first an adaptive estimator. As was the case in the examples of


Table 6.1 and Figure 6.1, assume that each of the n max symbols in the alphabet
is assigned an initial count of one, and thus that the initial sum of the frequency
counts is given by n max . As each symbol is coded the current probability for
that symbol is used, and then, as a result of the frequency count for that symbol
being incremented, all of the probabilities are adjusted. That is, the first symbol
of the m symbol message is always assigned a probability of l/n max . Suppose
that the jth symbol in the message is given by M[j], and that pt is the number
of times symbol s appears in the (j - I)-symbol prefix of M that has been
coded until now. Then an arithmetic code for this jth symbol will require

~M(j] + 1
-10g2 . 1
J + n max -
bits. Summed over all of the symbols in the message, the cost is

~ ~M(j] + 1
~-~g2 . (6.2)
j=1 J + n max - 1
~M(j] + 1
lI -.--~----
m
= -10g2
j=1 J + n max - 1
TI nmax TIPsj=1 J.
Iog2 m s=1
- .
TIj=1 (J + n max - 1)
1 TI nmax
s=1 Ps·
,

- og2 (m + n max - 1)!/(nmax - I)!


1 TI~~i Ps! I (m + nmax - I)! (6.3)
- og2 m., + og2 m., (nmax _ 1)'.

bits, where Ps = p';+l is the frequency of symbol s in the whole message.


The alternative to adaptive coding is a pre-transmission of the symbol fre-
quencies. But these symbol frequencies need not be utilized in a static manner,
and decrementing adaptation results in more precise probability estimates be-
ing used at all times - an approach known as enumerative coding. Suppose that
the exact symbol frequencies Ps in the message are known to both encoder and
decoder in advance. To code the first symbol M[I] a probability of PM[I]/m
should be used. The frequency of symbol M[I] can then be decremented, as
one of the appearances of this symbol has now been accounted for. Hence, the
second symbol can be coded with probability (p M[2] - P~[2]) / (m - 1). Taking
logarithms to account for arithmetic coding and summing over the full message
gives
~
~
-1
og2
PM(j] - ~M(j]
. 1
_ -1
- og2
TI~~i,Ps! (6.4)
j=1 m- J + m.
PAGE 138 COMPRESSION AND CODING ALGORITHMS

bits. Now compare Equations 6.3 and 6.4. The common tenn is the cost of
sending the message assuming that the statistics are known. The second tenn
in Equation 6.3 is the cost of learning the statistics by identifying nmax - 1
values that sum to m + nmax - I, by identifying nmax - 1 "boundary" values
out of a set of m + nmax - 1 values in total (Equation 1.5 on page 13). In the
enumerative code we must add on the cost of the prelude; one way of coding it
would cost the same as the second tenn in Equation 6.3.
That is, assigning a false count of one to each of the nmax symbols in an
adaptive estimator corresponds almost exactly to the "add one to each of the
frequencies" method for representing the prelude described in Algorithm 5.3
on page 102, the sole difference being that n max - 1 appears because the
nmaxth value is fully detennined once the first nmax - 1 are known. A simi-
lar result holds for the "subalphabet plus frequencies" prelude mechanism of
Algorithm 4.6 on page 83, the cost of which is captured in Equation 6.1 on
page 134. No matter how the prelude is transmitted, an adaptive model from
a standing start achieves almost exactly the same compressed size as does the
decrementing-frequency semi-static model from a flying start, once the prelude
cost is factored in to the latter.
The analysis that led to Equation 6.4 supposed that the frequency counts
were used in a decrementing manner, which in fact would require adaptation
of the codes. On the other hand, the speed of a semi-static coder comes about
through the use of a fixed set of codes. If an initial probability distribution is
used throughout the coding of the message, the length of the coded message is
given by

L
m
-log2 PM(j] = - Iog2
rr nmax
s=1 (p s )Ps
j=1 m
mm

~ -log2
rrnmax P ,
s=\ s· - (log2e)
(nmax)
LPs - m
m. s=1

+'21 (nmax
~ log2(27rPs) -log2(27rm)
)

< -log2
rr~max P'
t~! s· + '2
1 (nmax
~ log2(27rPs)
) (6.5)

where the approximation at the second line is the result of applying Equa-
tion 1.3 on page 13, and where (Ps)Ps is taken to be 1 when Ps = O. That is,
once the cost of the statistics are also allowed for, a true semi-static estimator
requires slightly more bits to code a message than does an adaptive estimator
or an enumerative estimator. Cleary and Witten [1984a] consider the relativity
between adaptive, semi-static, and enumerative estimators in rather more detail
6.3. NOVEL SYMBOLS PAGE 139

than is given in these few pages, and reach the same conclusion - that intrin-
sically there is no compression loss associated with using an adaptive code. If
anything, the converse is true: the need for transmission of the details of the
probability distribution means that the self-information of a message is a bound
that we might hope to approach when the message is long, but can never equal;
and an adaptive estimator is likely to get closer to the self-information than is
a semi-static estimator using a non-adapting probability distribution.

6.3 Coping with novel symbols


In the analysis of the previous section it was supposed for the adaptive coder
that each symbol in the alphabet would be given an initial frequency count
of 1 to "get it started". Analyzing the adaptive mechanisms for accumulat-
ing symbol probabilities under this assumption shows that there is a relatively
small cost to the use of an adaptive scheme. Unfortunately this assumption,
while persuasive, is not the full story. Consider again the example used in
the introduction to this chapter, in which a compression model based upon
character bigrams was considered. In such a model the alphabet is potentially
n max = 65,536, but is unlikely to become that large on any actual data, for ex-
actly the reasons that prompt the use of the model - characters are correlated.
But when n ~ n max , assigning an initial frequency of one to each possible
symbol has the detrimental effect of considerably slowing down the learning
of the source statistics. For example, even after 32,768 bigrams have been pro-
cessed - or 64 kB of text - there are still at least 32,768 bigrams that have not
appeared and still have an unnormalized Ps of one. The aggregate probability of
all of these "different from those already seen" symbols is, therefore, estimated
to be at least 32,768/(65,536 + 32,768) = 1/3. Worse, assuming (as we did
above) that about 5,000 bigrams do actually appear, the probability of "an un-
seen bigram" is estimated as approximately 60,000/(65,536 + 32,768) ~ 0.6.
But intuitively we would expect, by the time 64 kB of text has been processed
and we have encountered only 5,000 distinct bigrams, that the probability that
the next bigram is novel and not previously encountered is very much smaller
than 0.6. Assigning false counts of one is clearly an expensive way of boot-
strapping the probability estimates, and an adaptive model which somehow
does not make use of false counts might yield considerably better compres-
sion effectiveness in this bigram scenario. And in a semi-static environment,
the same considerations suggest that the analogous "frequency counts plus 1"
method used in Algorithm 5.3 on page 102 for representing the prelude is inef-
fective when n ~ n max .
Even more problematic are models in which the size of the alphabet is sim-
ply not known at the time coding is commenced. One example is a word-based
PAGE 140 COMPRESSION AND CODING ALGORITHMS

model. English is one of the natural languages in which the segmentation of


machine-stored text into words is fairly straightforward. (Not all natural lan-
guages are so cooperative, unfortunately). One simply defines, for example, a
word to be a contiguous sequence of alphabetic letters, hyphens, and apostro-
phes. A word-based model then considers the input stream to be an alternating
sequence of words and non-words. But what is the probability of a word? What
even is the number of words that might be used? Attempting to resolve these
two questions is an extraordinarily complex issue, for a document might easily
contain "words" not known to any dictionary, and perhaps not known to any
person either. This document, for example, now contains the word "plokijuh"
(look at your keyboard to see where it came from), and it may be that this is the
only document to contain it (twice - once in the text and once in the index) in
the history of (non-child) literature.! Given this possibility, it is clear that no a
priori enumeration of the alphabet or subalphabet used in a word-based com-
pression system can be made, and the use of false counts is not just wasteful, it
is impossible.
Instead, in an adaptive word-based model the words are spelled out explic-
itly to the decoder at their first appearance. Each word code (and non-word
code, but for the sake of simplicity we will describe the process as if a stream
of words alone is being coded) in the compressed representation is preceded by
a binary flag that indicates whether or not the word has previously appeared.
When the flag says "yes, the word has appeared before and is already known
to the decoder", all that is then transmitted is a code for the word. But when
the flag says "no, the word is novel", the decoder knows to expect more infor-
mation, typically a code to specify the length of the word, and then zero-order
codes for the characters comprising the word, in both cases using appropriate
probabilities. The new word can then be added to the vocabulary maintained
by both encoder and decoder, and, at subsequent appearances, be represented
by its symbol number. That is, in an adaptive modeling situation the number
of symbols in the current sub alphabet might change as well as the probabilities
assigned to those symbols.
This uncertainty raises a vexatious question: what should the probability
of a novel word be so that the per-word prefix flag can be economically coded?
Known as the zero-frequency problem, this issue requires the determination
of a probability that the next symbol in a stream has not occurred previously,
when the message is over an alphabet of unknown cardinality. It is a problem
which has received considerable attention in the literature [Cleary and Witten,
1984b, Howard and Vitter, 1992b, Moffat, 1990, Moffat et al., 1994, Witten and
Bell, 1991], and several plausible schemes have been described for assigning
ISadly, the word does exist on the web: www.google. corn reports several occurrences,
none of them in English prose.
6.3. NOVEL SYMBOLS PAGE 141

Method Probability of Probability of


novel symbol symbol 8
A 1/(m + 1) ps/(m + 1)
B n/m (Ps - 1)/m
C n/m ((m - n)/m) . (ps/m)
D n/2m (2ps - 1)/2m
X tdm ((m - td/m) . (ps/m)
C' n/(m + n) ps/(m + n)
X' (tl + I)/(m + tl + 1) ps/(m + tl + 1)
Table 6.2: Estimating the probability of a novel symbol, where m is the number of
symbols coded so far, n is the number of distinct symbols encountered, Ps is the num-
ber of occurrences of symbol s that have occurred, tt is the number of symbols that
have occurred once, and the (m + 1)-th symbol is about to be coded. In all methods
except B, each symbol in the underlying alphabet must be defined once in a secondary
model. In method B, each underlying symbol must be defined twice.

a probability to the escape symbol, so called because its appearance in the


compressed stream indicates that the subsidiary character-level model should
be temporarily used. Table 6.2 lists some of these various mechanisms. Note
that in all cases m ~ 1 is assumed. When m = a the escape probability can
always be taken to be 1.0, as the first symbol in the message must always be
novel.
In Table 6.2 it is assumed that m symbols of the message have been pro-
cessed to date; that in those m symbols there have been n distinct symbols; and
that the 8th of these n symbols has been observed Ps times. It is also assumed
that tl = {8 I Ps = 1} is the number of symbols that have occurred exactly
once in the part of the message observed to date. Finally, note that Ps is used
as a convenient shorthand for pr:+ 1 , the number of times the 8th symbol of the
alphabet appears in the first m symbols of the message.
The formulae listed in the table combine these parameters in different ways.
Method A is the simplest. It simply allocates a count of one to the escape
symbol, superficially akin to the false "starter" count of one that was assumed
for each symbol in the example in Table 6.1. Method A allows novel symbols to
appear, but, as we shall see below, consistently underestimates their probability
in most practical applications.
Cleary and Witten [1984b] recognized the problems with method A, and
suggested a second approach, method B. They proposed that a symbol not be
taken seriously until it had occurred twice, and that it should be encoded in
the secondary model at its first two appearances, allowing the escape probabil-
PAGE 142 COMPRESSION AND CODING ALGORITHMS

ity to be sensitive to the size of the alphabet discovered so far. The drawback
of method B - and the reason we have not included it in the experiments that
are described shortly - is that use of the secondary model twice for each sym-
bol can add considerably to the overall cost. For example, with a word-based
model, spelling each word out in full at both its first and second appearances is
a burden to be avoided.
To eliminate the double cost inherent in method B, method C [Moffat,
1990] treats the sequence of flag bits as a message in its own right, so if n
novel flags have been transmitted in a sequence of m such tokens then the cor-
rect estimator for the probability of a novel symbol is n/m. And if a symbol
is not novel, then a probability estimate is already available. The drawback of
this approach is the need to encode each non-novel token in a two-step manner,
first the flag, and then a code for the symbol. That is, two arithmetic coding
steps are required for each known symbol, corresponding to the two factors in
the probability calculation. Nor is it possible to pre-calculate the probabilities
and do a single coding step, as the resultant value of m 2 in the denominator is
likely to require more bits of precision than are available to express probabili-
ties unless m is very small. (This issue was discussed in detail in Section 5.3
on page 98.) As it is described in the table, method C also suffers from the
problem of needing a special case when m = n as well as when m = O.
For these two reasons the one-step mechanism labelled C' in the second
part of Table 6.2 is an attractive alternative [Moffat, 1990]. In this formulation
the first factor is included additively, and so the escape probability is slightly
less than it should be, but not by much.
Method D resulted from work by Paul Howard and Jeff Vitter [1992b].
Rather than adding two "units" when a novel symbol appears (both nand m
increase in the denominator of method C') and only one when a symbol repeats,
method D adds two units for every symbol. When a novel symbol is coded the
two units are shared between Ps and n, as for method C. But when a repeat
symbol is coded, method D awards both units to Ps, where method C would
have added only one. In results based upon a PPM-style model (see Section 8.2
on page 221) Howard and Vitter found a small but consistent improvement
when method D was used in place of method C.
The last of the listed escape estimators - method X - was a product of a
study that explicitly investigated the zero-frequency problem [Witten and Bell,
1991]. What all escape probability estimators are really trying to predict is the
number of symbols of frequency zero, as this is the pool of symbols that a novel
symbol - should one appear - must be drawn from. And a plausible approxi-
mation of the number of symbols of frequency zero is likely to be the number
it of symbols of frequency one, a quantity that is known to both encoder and
decoder. Like method C, non-novel symbols must be coded in two steps if this
6.3. NOVEL SYMBOLS PAGE 143

1.0000

~
:5 0.1000
IU
.c
ea. 0.0100
(I)
a.
~ 0.0010
W
0.0001 -+-------r-----.. . . . -------r....:.;:,""""-;:IiI....DiO..........,
100 10000 1000000 100000000
Symbol number

(a) Character bigrams

1.0000

~
:5 0.1000
IU
.c
o
c.. 0.0100
(I)

ig 0.0010
W
0.0001 -+--------r-----.........- - - - - - - r - - - - - - - .
100 10000 1000000 100000000

Symbol number

(b) Words

1.0000
---*- method A
~ - G ' - method C'
:cIU 0.1000 -&-- method 0
.c - methodX'
o
c.. 0.0100 - & - - Observed
(I)
a.
g
IU
0.0010
W
0.0001 +-------r----..:::::a..--------r-=~~i;;I::~
100 10000 1000000 100000000

Symbol number

(c) Non-words

Figure 6.2: Escape probabilities: (a) for character bigrams in the Wall Street Journal
text; (b) forWSJ. Words; and (c) forWSJ. NonWords.
PAGE 144 COMPRESSION AND CODING ALGORITHMS

Method Symbol stream


Bigrams Words Non-words
A 0.42 79.12 3.60
C' 0.35 32.71 2.12
D 0.29 30.89 2.09
X' 0.26 31.04 2.10

° 0.27 31.23 2.10

Table 6.3: Compression performance of novel symbol probability estimators, mea-


sured in bits per thousand symbols over 510 MB of English text from the Wall Street
Journal data described in Table 4.5 on page 71.

mechanism is adhered to slavishly, and so a one-step approximation, method


X', is also of interest, and is listed in the second section of the table. Note the
addition of one, to avoid the problem of tl = O.
Figure 6.2 plots numeric values for four escape probability estimators for
each of three different alphabets when compressing the 510 MB Wall Street
Journal text described in Table 4.5 on page 71. Also plotted is the actual ob-
served occurrence rate of novel symbols, averaged over intervals each a power
of two wide from 2k to 2k+l - 1 for integral k. This curve gives an indication
of the suitability of each of the estimators. Figure 6.2a shows the novel-event
probabilities associated with a model based upon the use of character bigrams,
an approach already used as an example earlier in this chapter. Figure 6.2b
and Figure 6.2c show the corresponding behavior of the word and non-word
distributions generated by a zero-order word-based model on the same data.
In each of the three graphs the line marked with "0" symbols represents the
observed frequency of novel symbols. Worth noting is that the observed escape
frequency generally declines as more and more symbols are processed, but can
also increase. Also worth noting is that all of methods C' , D, and X' provide
reasonably good approximations for all three of the sample distributions, but
that method X' appears to better track the variability of the observed frequency.
Method A is not suitable for any of these three example distributions.
Table 6.3 lists the overall cost, in bits per thousand symbols, of coding
the escape flag for the three test files for the main escape methods listed in
Table 6.2. Also listed as a fifth method is method 0, which uses as an escape
probability the observed frequency of novel symbols over the previous interval,
where an interval is a power of two symbols wide. For example, the 256th to
511 th symbols in the stream would be coded using an escape probability based
upon the observed novel symbol count during the coding of the 128th to 255th
symbols of the message, and so on. Methods D and X' share the honors as
6.4. ADAPTIVE HUFFMAN CODING PAGE 145

the best estimators, and perform well even against method O. Method D also
has the implementation advantage of requiring that slightly less information be
maintained.
The recorded per-symbol costs listed in Table 6.3 are relatively low com-
pared to the cost of the message itself. For example, the self-entropy of the
word stream is more than 11 bits per character, and so the cost of the escape
flag is less than 0.3%. However, the low per-symbol cost in this example should
not be interpreted as meaning that the choice of escape estimator is academic
for practical purposes, as in the three examples the small values are primarily
a consequence of the extremely long messages. In other situations in which an
escape probability might be used - such as in the compression system described
in Section 8.2 - each particular sub-message, which is the set of symbols coded
in one conditioning context, might only be a few tens of symbols long, and the
overall message the result of interleaving tens or hundreds of thousands of such
sub-messages. In such applications the choice of escape estimator is of critical
importance - not only is the length of each sub-message small, but the expected
cost within the context is quite likely to be under one bit per symbol.
Finally in this section, note that Aberg et al. [1997] have developed a gen-
eral parameterized version of the escape probability estimator that allows the
estimation mechanism to be modified as the message is accumulated. Their
results show a further very slight gain compared to method D on certain test
files, but this "method E" mechanism is complex, and not included in our ex-
periments here.

6.4 Adaptive Huffman coding


Now we tum to the first of the on-line coding methods discussed in this chapter
- adaptive Huffman coding, the process of maintaining a Huffman code in
the face of changing symbol probabilities. Unlike the minimum-redundancy
coding schemes described in Chapter 4, all adaptive Huffman coding schemes
proposed to date rely on an explicit tree data structure, and traverse the tree
bit-by-bit to encode and decode symbols. For this reason we refer to these
techniques as being "adaptive Huffman codes" rather than "adaptive minimum-
redundancy codes"; but to whet the appetite of the reader, Section 6.10 below
describes a pseudo-adaptive technique that does not require an explicit tree.
The fundamental idea in adaptive Huffman coding is that if a leaf node
corresponding to a symbol has its probability changed, all nodes on the path in
the Huffman tree from that leaf back to the root must also have their weights
changed. The reassignment of node weights might then force a reorganization
of the tree, so that it resembles a tree that could be generated by Huffman's
greedy algorithm using the new weights.
PAGE 146 COMPRESSION AND CODING ALGORITHMS

JeffVitter [1987] details the earlier contributions of Faller [1973], Gallager


[1978], and Knuth [1985]. All three techniques assume that self-probabilities
are being used as estimators. Vitter examined this previous "FGK" mechanism
in detail, and improved the worst case compression cost for adaptive Huffman
codes. His work culminated in a detailed implementation being made available
[Vitter, 1989]. Like the FGK mechanism, Vitter's program assumes that the
weights of symbols that appear as leaves in the Huffman tree change by single
units, which is the situation if unnormalized self-probabilities are employed
by the compression system. This assumption allows the algorithms to perform
an 0(1) amount of work for each bit output by the encoder, or input by the
decoder, so the total time required to process an m symbol message that yields
a c bit output is O(n + m + c) = O(n + c).
The more general case, in which symbol probabilities can change by any
amount, is discussed by Cormack and Horspool [1984], and uses the same ba-
sic ideas. However, the linear execution time can no longer be guaranteed for
the general case, and as much as O(n 2 ) time per symbol might be required to
update the tree, for a total worst case running time of O(mn 2 ). As Cormack
and Horspool observed, however, this worst case will only appear in patho-
logical situations, and in general the running time will be linear, despite the
absence of a guarantee.
The principal difference between the FGK algorithm and Vitter's method is
in the treatment of novel symbols. The earlier methods augmented the source
alphabet with a special symbol of probability zero, which, in a Huffman tree,
still results in the assignment of a codeword. This leaf then is replaced by an
internal node with two children each time a new symbol is encountered in the
message. One of the new children remains as the escape symbol, with its count
of zero; the other is assigned to the new symbol, and given a count of one.
This approach most closely corresponds to escape method A in Table 6.2 on
page 141.
The alternative is to include an explicit escape symbol and increment its
count each time it is used - method C' in Table 6.2. The description of adaptive
Huffman coding that follows uses this second approach, with every symbol in
the tree having a positive, non-zero weight, and one of those symbols being the
escape symbol. Along with Vitter, Miliditi et al. [1999] have also considered
the exact bit cost of adaptive Huffman coding.
The basic concept underlying adaptive Huffman coding is generally at-
tributed to Gallager [1978] - although Vitter [1987] notes that it was also dis-
cussed by Faller [1973] - and is known as the sibling property: every node
except the root has a sibling, and a sibling list can be created in which the
nodes in the code tree are written in order of non-increasing weight such that
sibling nodes are adjacent.
6.4. ADAPTIVE HUFFMAN CODING PAGE 147

That Huffman trees have the sibling property follows immediately from
Huffman's algorithm - it greedily chooses the subtrees with smallest weight as
siblings at each packaging stage, and generates a code tree. Hence, if nodes
are numbered in order of their packaging by Huffman's algorithm, then they
are numbered in reverse order of a sibling list. Consider the construction of the
Huffman code in Figure 4.2 on page 54, the resulting tree of which is shown in
the top panel of Figure 4.3 on page 56. The first step packages leaves 5 and 6,
so these two nodes will fonn the last two entries in the sibling list. The second
step packages leaves 3 and 4, so they fonn the next to last entries in the list.
Continuing in this manner yields the tree in Figure 6.3a, where each node is
annotated below with the reverse order in which it is packaged by Huffman's
algorithm. The weight of each package is noted inside each node, assuming
that the probabilities resulted from the processing of a stream of 100 symbols
so far. The numbers above the white leaf nodes indicate the symbol number
currently associated with that leaf. Listing the weights of the nodes in the
reverse order that they are processed with Huffman's algorithm yields

[100,67,33,20,13,11,9,7,6,5,4] ,

which has sibling nodes in adjacent positions. Using the same logic but in
reverse, any code tree which has the sibling property can be generated by Huff-
man's algorithm, so is a Huffman tree. The existence of a sibling list for a code
tree is both a necessary and sufficient condition for the tree to be a Huffman
tree.
The basic idea of dynamic Huffman coding algorithms is to preserve the
sibling property when a leaf has its weight increased by one, by finding its new
position in the sibling list, and updating its position in the tree accordingly. For
example, consider the changes to the tree in Figure 6.3a if symbol 2 has its
frequency incremented from 11 to 12. Firstly the node itself must increase its
weight, altering the sibling list to

[100,67,33,20,13, 12,9,7,6,5,4],

with the altered weight highlighted. The list remains a sibling list, so no re-
structuring of the tree is necessary. Now the frequency increment must be
propagated up the tree to the parent of symbol 2. The sibling list becomes

[100,67,33,21,13,12,9,7,6,5,4] ,

again with no violation of the sibling property. This process continues until the
root is reached, with a final sibling list of

[IQl, 67,34, IZl', 13,fI2, 9, 7, 6, 5, 4].


PAGE 148 COMPRESSION AND CODING ALGORITHMS

(a)

(b)

(c)

(d)

Figure 6.3: Huffman trees with each node labelled with its position in the sibling list
(below), and with its symbol number (above, white leaf nodes only). The four panels
illustrate the situation: (a) prior to any increments; (b) after the frequency of symbol 2
is incremented from 11 to 12, and then to 13; (c) after nodes 5 and 6 in the sibling list
are exchanged; (d) after the frequency of symbol 2 is incremented in its new position.
6.4. ADAPTIVE HUFFMAN CODING PAGE 149

A second appearance of, and thus increment to, symbol 2 results in further
updates to the sibling list, which becomes

DJQc2, 67, '35;,22,13,13,9,7,6,5,4],

again with the changes highlighted. This sibling list corresponds to the tree
shown in Figure 6.3b. For both of these first two increments, the sibling prop-
erty holds at each stage. The node weights evolve, but the tree and code remain
unchanged.
Now consider what actions are required if symbol 2 is coded and then in-
cremented a third time. If the weight of symbol 2 were to be increased from 13
to 14, the list of node weights becomes

[102,67,35,22,13,14,9,7,6,5,4].

But this list is not a sibling list - the weights are not in non-increasing order-
and the underlying tree cannot be a Huffman tree. To ensure that the increment
we wish to perform will be "safe", it is necessary to swap the fifth and sixth
elements before carrying out the update. In general, the node about to be incre-
mented should be swapped with the leftmost node of the same weight. Only
then can the list of weights be guaranteed to still be non-increasing after the
increment. In the example, after the subtrees rooted at positions 5 and 6 in the
sibling list are swapped, we get the tree shown in Figure 6.3c. The increment
for symbol 2 can now take place; and the sibling list becomes

[102,67,35,22, 14,13,9,7,6,5,4].

As before, the ancestors of the node for symbol 2 must also have their counters
incremented. Neither of these involve further violations of the sibling property.
But if further violations had taken place, they would be dealt with in exactly
the same manner: by finding the leftmost (smallest index) node in the sibling
list with the same weight; swapping it with the node in question; and then
incrementing the weight of the node in its new position. In the example, the
final sibling list after the third increment to symbol 2 is

[103,67, 36,22,14,13,9,7,6,5,4],

and this time the code has adjusted in response to the changing set of self-
probabilities. Figure 6.3d shows the tree that will be used to code the next
symbol in the message.
An overview of the process used to increment a node's weight by one is
given by function sibling_incrementO in Algorithm 6.1. The loop processes
each ancestor of i, incrementing each corresponding weight by one. The root
PAGE 150 COMPRESSION AND CODING ALGORITHMS

Algorithm 6.1
Increase the weight of tree node L[i] by one, where L is an array of tree
nodes from a Huffman tree in sibling list order: L[2j] and L[2j + 1] are
siblings for 1 ~ j < n, and the weight of L[i] is not less than the weight of
L[i + 1] for alII ~ i < 2n. This algorithm also alters the structure of L so
that it continues to represent a Huffman tree.
sibling _increment (i)
1: while ii-I do
2: find the smallest index j ~ i such that the weight of L[j] is equal to the
weight of L[i]
3: swap the subtrees rooted at L[i] and L[j]
4: add one to the weight of L[j]
5: set i +- parent of L[j]
6: add one to the weight of L [1], the root of the tree

of the tree is assumed to be in node L [1]. Before the weight of L [i] can be
incremented the leftmost node L[j] of the same weight in the sibling list is
located, and the two nodes swapped (steps 2 and 3). This ensures that when
one is added to the weight in step 4, the sibling list remains in non-increasing
order. There is a danger when swapping tree nodes that a child will swap with
one of its ancestors, therefore destroying the structure of the tree. Fortunately,
this algorithm only swaps nodes of identical weight, and so neither node can be
an ancestor of the other, as ancestors must have weights that are strictly greater
than their children in our Huffman tree. This is a complication in the FGK
algorithm, as it uses an escape symbol of weight zero.
At this level of detail, Algorithm 6.1 is simple and elegant. The loop iterates
exactly once for each node on the path from the leaf node representing the
symbol to be incremented, to the root, so the number of iterations equals the
number of compressed bits processed. If each step runs in 0(1) time, then the
entire algorithm is on-line. To execute step 2 in 0(1) time requires a supporting
data structure. Consider, for example, a call to increment(15) to increment the
frequency of the last element of the sibling list

[8,4,4,2,2,2,2,1,1,1,1,1,1,1,1].

In this case element i = 15 must be swapped with element j = 8, the high-


lighted value, before it is incremented to 2. That is, the two nodes being
swapped might be O(n) entries apart in the sibling list, and the cost of a linear
search too great. Binary search of the list for j would require O(log n) time,
but is also too costly to undertake for every output bit. Instead, a data structure
6.4. ADAPTIVE HUFFMAN CODING PAGE 151

is maintained that allows 0(1) time access to these leader elements. At first it
seems that each node can contain a pointer to the leftmost node in the sibling
list that shares its weight: the leader node. However, if the leader itself has its
weight incremented, it is necessary to update all of the pointers in the nodes
to its right that were pointing to that leader: an O(n) time operation. To avoid
this difficulty, an extra level of indirection is used, with each node recording
which bucket it belongs to, where nodes in a bucket all have identical weight;
and with an auxiliary data structure recording the leader of each bucket. Then,
if a leader is incremented, an 0(1) time update of this auxiliary structure is
all that is required to update a bucket's new leader. This structure must be dy-
namic, as buckets are created and deleted throughout the coding process, so a
doubly-linked list is used. In the example just given, the list of bucket pointers
would be
11 = [1,2,2,3,3,3,3,4,4,4,4,4,4,4,4],
and the doubly-linked list of leaders would be

D = [1, 2, 4, 8].

That is, element L[i] should be swapped with element D[l1[i]] before it has its
weight incremented.
The Huffman tree is stored in the same array of nodes used to hold the
sibling list, by adding further pointers that allow the necessary threading. Al-
gorithm 6.2 supplies the considerable detail of adaptive Huffman encoding and
decoding suggested by the sketch of Algorithm 6.1. In these algorithms, and
their supporting functions in Algorithm 6.3, a list of 2n - 1 tree nodes L is
maintained, with each node containing six fields:

• left, right, and parent to represent pointers in the Huffman tree that is
threaded through the list;

• weight to hold the node's self-probability;

• symbol, which holds the symbol associated with a leaf, or "internal" if


the node is an internal node; and

• bucket, which holds a pointer to the element's bucket.

It is assumed that the escape symbol is and that the Huffman tree is initial-
8Q,
ized to contain n = 2 symbols: symbol 81, and the escape symbol, both with
an initial weight of one.
During encoding it is necessary to locate a symbol's leaf in L, so that a
codeword can be generated by following parent pointers from the leaf to the
root of the tree. To facilitate leaf discovery, an array index is maintained such
PAGE 152 COMPRESSION AND CODING ALGORITHMS

Algorithm 6.2
Use an adaptive Huffman code to represent symbol x, where 1 :S x,
updating the Huffman tree to reflect an increase of one in the frequency of x.
adaptive_huffman_encode (x)
1: if index[x] = "not yet used" then
2: adaptive_huffman_output(O), the codeword for the escape symbol
3: encode x using some agreed auxiliary mechanism
4: adaptive _huffman_increment ( index [0])
5: adaptive_huffman_add(x)
6: else
7: adaptive_huffman_output(x)
8: adaptive _huffman_increment (index [x])

Return a value s assuming an adaptive Huffman code, and update the


Huffman tree to reflect an increase of one in the frequency of s.
adaptive_huffman_decode ()
1: set it-I, the root of the tree
2: while L[i].symbol = "internal" do
3: if geLone_bitO = 0 then
4: set i t- L [i].left
5: else
6: set i t- L[i].right
7: adaptive_huffman_increment(i)
8: if L[i].symbol = 0, the escape symbol, then
9: decode the new symbol s using the agreed auxiliary mechanism
10: adaptive_huffman_add( s)
11: else
12: set s t- L[i].symbol
13: return s
Output the codeword for symbol s, where 0 :S s, by accumulating the
codeword in reverse while traversing from the leaf corresponding to s back
to the root.
adaptive_huffman_output( s)
1: set i t- index[s], w t- 0, and nbits t- 0
2: while L[i].parent =f. 0 do
3: if i is odd then
4: set w t- w + 2nbits
5: set nbits t- nbits + 1 and i t- L[i].parent
6: pULone_integer(w, nbits)
6.4. ADAPTIVE HUFFMAN CODING PAGE 153

Algorithm 6.3
Add one to the weight of node L[i] and its ancestors, updating L so that it
remains a sibling list, maintaining the tree structure and bucket pointers.
adaptive_hujJman_increment{ i)
1: while i =1= 1 do
2: set b +- L[i].bucket and j +- b.leader
3: swap the left, right, and symbol fields of L[i] and L[j]
4: set index[L[i].symbol] +- j and index[L[j].symbol] +- i
5: set L[L[i].left].parent +- L[L[i].right].parent +- i
6: set L[L[j].left].parent +- L[L[j].right].parent +- j
7: set L[j].weight +- L[j].weight + 1
8: if L[j].weight = L[j - I].weight then
9: set L[j].bucket +- L[j - I].bucket
10: else
11: add a new bucket d with d.leader +- j to the bucket list B
12: set L[j].bucket +- d
13: if L[j + I].weight = L[j].weight - 1 then
14: set b.leader +- j + 1
15: else
16: remove bucket b from the bucket list B
17: set i +- L[j].parent
Add a new symbol s to the underlying Huffman tree by making the final leaf
node L[2n - 1] an internal node with two leaves: the old L[2n - 1] and the
new symbol s with weight one.
adaptive_hujJman_add{ s)
1: set all components of L[2n] +- the matching components of L[2n - 1]
2: set L[2n + I].left +- L[2n + I].right +- 0, L[2n + I].weight +- 1,
L[2n + I].symbol +- s, and L[2n - I].symbol +- "internal"
3: set b +- L[2n - I].bucket and j +- b.leader
4: adaptive_hujJman_increment{2n - 1)
5: set index[L[2n].symbol] +- 2n and index[s] +- 2n + 1
6: set L[2n].parent +- L[2n + I].parent +- j, L[j].left +- 2n, and
L[j].right +- 2n + 1
7: if L[2n].weight = 1 then
8: set L[2n + I].bucket +- L[2n].bucket
9: else
10: add a new bucket d with d.leader +- 2n + 1 to bucket list B
11: set L[2n + I].bucket +- d
12: set n +- n + 1
PAGE 154 COMPRESSION AND CODING ALGORITHMS

that node L[index[x]] is the leaf representing symbol x. It is assumed that all
elements of index are initialized to "not yet used", except for index[O] = 2 and
index[l] = 3. Note in function adaptive_hujJman_output{) that the bits in each
codeword are generated in reverse order, so are buffered in variable w before
being output. A final quirk in these algorithms is that the weight of the root of
the tree, element L[l], is never changed by adaptive_hujJman_increment{) , and
remains throughout at its initial value of zero. This allows it to act as a sentinel
for the comparison at step 8 in function adaptive_hujJman_increment{), and
prevents the leader of L[2]'s bucket from being set to the root.
The resource cost of adaptive Huffman coding is non-trivial. The sibling
list structure requires 6 words in each of 2n - 1 nodes, and the index array
and list of leaders further add to the cost. The total memory requirement of
in excess of 13 words per alphabet symbol is a daunting requirement on all
but small alphabets. Nor is the method especially quick. Provided that the
increments are by units, linear time behavior is assured. But the large number
of checking operations required for every output bit means that the constant of
proportionality is high, and in practice execution is relatively slow.
More generally, we might wish to adjust the weight of a symbol by any
arbitrary amount, positive or negative. Decrementing a weight by unity can be
achieved in a similar manner to the incrementing, but along with maintaining
a leader for each bucket, a trailer must also be kept, which is the index of the
rightmost element in each bucket. To decrement a weight, the symbol's node
is swapped with the trailer, its weight decremented, and then parent pointers
followed as for incrementing.
Provided that all weights are integral, incrementing by an arbitrary amount
can be achieved by calling function adaptive_hujJman_increment{) as many
times as is necessary, being careful to make sure each call affects the leaf for
the desired symbol in its current position, which may change after each call.
This latter is, in effect, the algorithm of Cormack and Horspool [1984], except
that they make one further observation that allows the process to be faster in
practice: if the difference between the weight of a node and the node directly
to the left of the node's leader is greater than one, then the weight can be incre-
mented by the smaller of that difference and the amount required, without the
need for further movement of the node.

6.5 Adaptive arithmetic coding


Given the complexity of the mechanisms needed to make adaptive Huffman
coding possible, it is a relief indeed to tum to adaptive arithmetic coding. As
we shall see in this section, adaptive arithmetic coding is easier to implement,
faster to execute, requires less memory, and gives superior compression ef-
6.5. ADAPTIVE ARITHMETIC CODING PAGE 155

fectiveness. There is no real contest between the two alternatives - adaptive


arithmetic coding wins hands down.
The implementation of arithmetic coding presented in Section 5.3 is easily
used as the basis of an adaptive arithmetic coder. Indeed, the functions de-
scribed in Algorithms 5.2 and 5.4 can be used in an adaptive coder without
further modification. In a call to function arithmetic_encode(l, h, t) the three
arguments 1, h, and t describe the probability range [lit, hit) allocated to the
symbol that is to be coded. Hence, to achieve adaptive coding all that must be
added is the facility for the (1, h, t) triples to be calculated in an incremental
manner.
Keeping the value of t up to date is easy - it is just m, the number of
symbols so far processed, plus any adjustment required by the particular escape
probability estimator being used. Similarly, his 1+ V'M(j] , where V'M(j] is again
the number of times that M(j], the jth symbol of the message, has appeared in
the first j - 1 symbols of the message; and 1 is a cumulative sum of such values
over the part of the alphabet prior to M [j]. If we adopt the "false count of
one" mechanism of assigning non-zero probabilities illustrated in Table 6.1 on
page 135, then these values will be inflated by 1, but that can be achieved by an
appropriate initialization of the structure used to store cumulative frequencies.
In Section 5.3 an array cum-prob was assumed that explicitly stored the
corresponding h value for each symbol, and the 1 value for symbol s is available
as the h value for symbol s - 1. To adjust the cum-prob array to account for
the occurrence of symbol M[j], each of cum-prob[M[j]] , cum-prob[M[j] + 1],
cum-prob[M[j] + 2], and so on, must be incremented, through to cum-prob[n].
That is, to make an adaptive arithmetic coder, all that is required is a loop
that increments all cum-prob values at and subsequent to that of the symbol
being coded. Note that in the remainder of this chapter it is not necessary to
distinguish between n, the number of symbols with non-zero probabilities, and
n max , the number of symbols available in the source alphabet, and for notational
brevity we now use n to indicate the number of symbols in the source alphabet,
including those that currently have zero probability.
The simple cum-prob mechanism for maintaining cumulative frequency
counts results in effective compression. It is not, however, efficient. On an
alphabet of n symbols each frequency update will take O(n) time, completely
dominating the cost of doing the arithmetic coding itself. For example, in a
character-based coder with an alphabet of n = 256 symbols an average of 128
loop iterations will be necessary for each character coded, close to 30 times the
number of iterations required in the renormalization loop while generating bits.
Indeed, rather more than 128 loop iterations will be needed for a wide range
of input file categories, as ASCII byte codes less than 128 tend to occur more
frequently.
PAGE 156 COMPRESSION AND CODING ALGORITHMS

This latter realization suggests that a useful heuristic is to reorder the al-
phabet so that the most probable symbols are allocated the smallest symbol
indices, and then run the cumulative probabilities backwards towards zero in
an array rev_cum-prob. When the most probable symbol is coded only one
value needs to be incremented; when the second most likely symbol is coded
the loop iterates twice, and so on. Rearranging the symbol ordering must itself
be done on the fly, as no a priori information is available about the symbol
probabilities. Two more arrays are needed to maintain the necessary informa-
tion about the permuted alphabet - an array symboLto_index that stores the
current alphabet location of each symbol, and an array index_to--symbol that
stores the inverse permutation. Coding a symbol s then consists of fetching its
h value as rev_cum-prob[symboLto_index[s]], and taking its l value to be the
next value stored in rev_cum_prob, assuming rev_cum-prob[n + 1] = O. Then,
to increment the count of symbol s, it is exchanged with the leftmost symbol of
the same frequency, and the same increment loop as was assumed above is used
to add one to all of the cum-prob values at and to the left of its new location. In
this case the search for the leftmost object of the same frequency can be carried
out by binary search, as the search is only performed once per update. This is
a useful contrast to Algorithm 6.2, in which the analogous search for a leader
is performed once per output bit rather than once per symbol.
This mechanism was introduced as part of the seminal arithmetic coding
implementation given by Witten et al. [1987]. It works relatively well for
character-based models for which n = 256, and for files with a "typical" distri-
bution of character frequencies, such as English text stored in ASCII. On aver-
age about ten cum-prob values are incremented for each character coded when
such a model is used, a considerable saving compared to the simpler mecha-
nism. The saving is not without a cost though, and now 3n words of memory
are required for the statistics data structure compared with the n words used by
the approach assumed in Algorithm 5.2 on page 99. Nor does it cope well with
more uniform symbol distributions. On object files Witten et al. measured an
average of 35 loop iterations per symbol, still using a character-based model.
Worse, the structure is completely impotent to deal with the demands of
more complex models. In a word-based model of the kind considered earlier
in this chapter, there are thousands of symbols in the alphabet, and even using
the frequency-sorted approach the per-symbol cost of incrementing cum-prob
values is very high. Taking an asymptotic approach, to code a message of m
symbols over an alphabet of size n might take as much as O(mn) time - and
if the n distinct symbols are equi-probable then this will certainly happen. For
example, the cyclic message

M = [1,2,3, ... ,n - 1, n, 1,2,3, ... ,n - 1, n, 1,2,3, ... ,n - 1, n, ... ]


6.6. CUMULATIVE STATISTICS PAGE 157

is represented in approximately O(m log n) bits, but takes O(mn) time to pro-
cess. For this message, c, the number of bits produced, is strictly sub linear
in the time taken, making on-line real-time compression impossible to guaran-
tee. Fortunately, other data structures have been developed for maintaining the
cumulative frequencies.

6.6 Maintaining cumulative statistics


Peter Fenwick [1994] provided the first of the two structures described in this
section. His tree-based mechanism requires 0 (m log n) time to code a message
of m symbols against an alphabet of size n. The number of bits produced might
still be sub linear in the time taken; nevertheless it represents an enormous im-
provement compared to the mechanism described in the previous section. What
is perhaps most surprising is that this efficiency is accomplished in less mem-
ory space than is required by the permuted alphabet method - just n words are
required, the same as for the static coder as it was described in Algorithm 5.3
on page 102.
The best way to grasp the essence of the Fenwick structure is through an
illustration. Figure 6.4 supposes that a set of n = 9 symbols is being manip-
ulated, with (at this point in time) P = [15,11, 7, 6,11,12,8, 1,4] as a set of
unnormalized probabilities.
The first row of Figure 6.4 - marked (a) - shows the frequencies of the
nine symbols. These frequencies are not stored explicitly, and are shown only
to aid the explanation. Similarly, row (b) shows the nominal cumulative fre-
quencies up to, but not including, the specified symbol, summing from left to
right. These are the values we would like to be returned from a geUbound 0
function; but they are not stored either. What is actually stored at each position
of the array is the frequency of the corresponding symbol plus some number of
previous symbols, shown in row (c) as a range of values. The rule for deciding
what should be summed is as follows. Suppose that the sth valuejen_prob[s]
in the data structure is to be calculated. Define size (s) to be the largest power
of two that evenly divides s,

size(s) = max{2V I s mod 2v = O}.


V~O

For example, size(3) = 1, size(6) = 2, and size(8) = 8. Thenjen.prob[s] is


the sum of size( s) consecutive frequency values,
s
jen.prob[s] = L
k=s-size(s)+l
P[k] ,
PAGE 158 COMPRESSION AND CODING ALGORITHMS

ilocation 1 2 3 4 5 6 7 8 9

(a) symbol frequencies, P


15 11 7 6 11 12 8 1 4
(b) desired lbound values
0 15 26 33 39 50 62 70 71
(c) range of values for jen-prob
1-1 1-2 3-3 1-4 5-5 5-(j 7-7 1-8 9-9
(d) values stored injen_prob
15 26 7 39 11 23 8 71 4
(e) revised values after symbol 3 is coded
15 26 401 11 23 8 7f 4

Figure 6.4: Maintaining cumulative frequencies with a Fenwick tree. In the example,
the unnormalized probabilities P = [15,11,7,6,11,12,8,1,4] are assumed to be the
result of previous symbols having been transmitted. Row (e) then shows the changes
that take place when P[3] is increased from 7 to 8. There is no requirement that the
source alphabet be probability-sorted.

where, as before, array P[k] represents the frequency of the kth symbol of the
alphabet at this particular point in time. Row (d) in Figure 6.4 records the
values actually stored in the jen-prob array.
It is also useful to define two further functions on array positions: jorw( s)
returns the next value in the same power of two sequence as s, and back(s)
returns the previous one:

jorw(s) s + size(s)
back(s) s - size(s).

The values stored in the array jen-prob are thus also given by

jen_prob[s] = cum-prob[s] - cum-prob[back(s)],

where cum-prob[s] is the nominal value required for the value h to be passed
to the function arithmeticencodeO. For example, symbol s = 6 has a cor-
responding h value of 62 - row (b), stored as lbound(7) - but the value of
jen_prob[6] is 23, being the sum of the frequencies of symbols 5 and 6. To con-
vert the values stored in jen-prob into a cumulative probability, the sequence
dictated by the backO function is summed, until an index of zero is reached.
6.6. CUMULATIVE STATISTICS PAGE 159

Algorithm 6.4
Return the cumulative frequency of the symbols prior to s in the alphabet,
assuming thatfen-prob[l ... n] is a Fenwick tree data structure.
fenwick_geLlbound (s )
I: set l f- 0 and i f- S - 1
2: while i =1= 0 do
3: set l f- l + fen-prob[i] and i f- back(i)
4: return l

Return the frequency of symbol s in the Fenwick treefen-prob[l . .. n]; after


determining the value, add one to the stored frequency of s.
fenwick_geLand_incremenLcount( s)
I: set Z f- back (s) and C f- fen-prob[ s] and p f- s - 1
2: while p =1= Z do
3: set C f- C - fen-prob[p] andp f- back(p)
4: setp f- S
5: whilep ~ n do
6: setfen_prob[p] f- fen-prob[p] + 1 andp f- forw(p)
7: return c

Taking the same value of s = 6, the calculation performed is

fen-prob[6] + fen-prob[4]
(cum-prob[6] - cum-prob[4]) + (cum-prob[4] - cum-prob[O])
= cum_prob[6] ,

where just two terms are involved as back(6) = 4 and then back (4) = O. More
generally, by virtue of the way array fen-prob is constructed, the sequence of
fen-prob values given by

fen_prob[ s],jen_prob[back( s)] ,jen_prob[back(back( s))], ...

when summed contains exactly the terms necessary to telescope the sum and
yield cum-prob[s].
Algorithm 6.4 gives a formal description of the adaptive coding process.
Taking as its argument a symbol identifier s, function fenwick_geLlboundO
sums a sequence offen-prob values to obtain cum-prob[s - 1], which is the l
value required to arithmetically encode symbol s. As each fen-prob value is
added into the sum the cum_prob values that are not required cancel each other
out, and the right result is calculated.
PAGE 160 COMPRESSION AND CODING ALGORITHMS

A call tojenwick_geLlbound(s + 1) could then be used to calculate the cor-


responding h value; but it is more efficient to calculate it using the mechanism
shown in the first while loop of functionjenwick_geLand_incremenLcountO,
also shown in Algorithm 6.4. If the argument s is presumed to be equally
likely to be any value in 1 ... n then half the time this while loop never iterates
at all; a quarter of the time it needs to iterate once; and so on. On average it
iterates just once, and requires 0 (1) expected time.
Then, to update the frequency of symbol s, a sequence of changes must
be made to jen..prob values to the right of s. This is accomplished by the
second loop in function jenwick_geLand_incremenLcountO. The values that
must be incremented are exactly those described by the sequence s,jorw(s),
jorw(jorw( s)), and so on, until n, the length of the array, is exceeded.
That is, the operations that arithmetically encode symbol s start in the vicin-
ity of the sth position of the array and step back to the origin in ever increasing
leaps calculating 1, and then return to the sth location and step forward in ever
increasing leaps, incrementing the stored values. Row (e) in Figure 6.4 on
page 158 shows the revised values of the example jen_prob array after symbol
s = 3 is coded. Only the values stored at locations 3, 4, and 8 of the array need
to be altered, yet subsequent reference to any of the other symbols to the right
of s = 3 results in the calculation of revised [1, h) intervals.
The structure clearly requires just n words of memory, where n is an up-
per bound on the largest symbol identifier appearing in the message M. How
long does it take to process each symbol? The important observation is that
size(back(s)) ~ 2 x size(s), for all symbols s. That is, the amount skipped
backwards at least doubles at every loop iteration. Hence, at most fiog2 s 1
loop iterations are required by the while loop in functionjenwick_geLlboundO,
where s is the symbol number being processed. Similarly, size(jorw(s)) ~
2 x size (s ), and so the increment phase of the operation (the second while loop
in function jenwicLgeLand_incremenLcountO) takes at most flog2 (n - s) 1
loop iterations. The total number of loop iterations is thus less than 2 flog2 n 1.
Every iteration involves a call to either backO orjorwO, each one of which,
the way they were described above, requires in tum an evaluation of the sizeO
function. Fortunately, one of the observations made by Fenwick is that size can
be calculated in 0(1) time using logical operations and addition. For example,
if two's complement arithmetic is being used, then

size(s) = sAND (-s)

correctly calculates the sizeO function, where AND is assumed to be a bitwise


logical conjunction operation. More generally,

size(s) = sAND (2 W - s)
6.6. CUMULATIVE STATISTICS PAGE 161

works on any binary architecture, provided only that 8 < 2w. For example,
assuming that 1 ~ 8 < 24 = 16, size(4) and size(10) are calculated to be 4 and
2 respectively:

8=4 8 = 10
8 in binary: 0100 1010
16 - 8 in binary: 1100 0110
result after AND: 0100 0010
result in decimal: 4 2

The cost of calculating size 0 is thus 0 (1), and the overall time taken to process
symbol 8 is O(logn). Compared to the sorted array method of Witten et at,
this mechanism represents a rare instance of algorithmic development in which
both space and time are saved. Using it, the overall cost of adaptively main-
taining statistics for message M containing m symbols is 0 (m log n) time.
One further function is required in the decoder, illustrated in Algorithm 6.5.
The target value returned by arithmeticdecode_targetO (described in Algo-
rithm 5.4 on page 104) must be located in the array Jen_prob. This is accom-
plished by a binary search variant based around powers of two. The first loca-
tion inspected is the largest power of two less than or equal to n, the current
alphabet size. That is, the search starts at position 2l1og 2 nJ. If the value stored
at this position is greater than target, then the desired symbol cannot have a
greater index, and the search focuses on the first section of the array. On the
other hand, if the target is larger than this middle value the search can move
right, looking for a diminished target. Once the desired symbol number 8 has
been determined, functionJenwick_geLlboundO is used to determine the bound
l for the arithmetic decoding function, andJenwicLgeLand_incremenLcountO
is used to determine c and to increment the frequency of symbol 8. Both of
these latter two functions are shared with the encoder.
The attentive reader might still, however, be disappointed, as 0 (m log n)
time can still be superlinear in the number of bits emitted. Consider, for exam-
ple, the probabilities

P = [1/2,1/4, ... ,1/2 i , ... , 1/2n - 1, 1/2n - 1 ].

Then a message of length m has an expected coded length of approximately 2m


bits irrespective of the value of n, yet the time taken to generate that bitstream
using a Fenwick tree is e (m log n). In particular, the reader may recall that
adaptive Huffman coding does take time linear in the number of bits produced.
Can similar linear-time behavior be achieved for adaptive arithmetic coding?
Pleasingly, the answer is "yes" - it is possible to modify the Fenwick tree
data structure and obtain asymptotic linearity [Moffat, 1999]. Consider again
PAGE 162 COMPRESSION AND CODING ALGORITHMS

Algorithm 6.5
Return the greatest symbol number s that, if passed as argument to function
fenwick_getJboundO, would return a value less than or equal to target.
fenwick_get~ymbol (target)
1: set s+-O and mid +- 2LlognJ
2: while mid ;::: 1 do
3: if s + mid ~ n andfen.prob[s + mid] ~ target then
4: set target +- target - fen_prob[ s + mid]
5: set s +- s + mid
6: set mid +- mid/2
7: return s + 1

the actions carried out by functionfenwick_getJboundO. When the cumulative


frequency count is being computed, the structure is traversed in a leftward di-
rection using backO. Then, when the structure is being updated, the traversal
is to the right, using forwO. Both of these traversals extend from the initial
access point through to the end of the array, and in combination give rise to
the O(log n) time requirement. Suppose instead that an array fast.prob stores
sums that extend forward in the notional cum.prob array,
s+size(s)-l
fast.prob[s] = L P[k] ,
k=s

where it is presumed that P[k] = 0 for k > n. This change means that a
different calculation strategy must be employed. Now to find the equivalent
cum.prob value for some symbol a two stage process is used, shown as func-
tion fasLgetJboundO in Algorithm 6.6. In the first stage sums are accumu-
lated, starting at fast.prob[l], and doubling the index p at each stage until all
frequencies prior to the desired symbol number s have been included in the
total, plus possibly some additional values to the right of s. The first stage is
accomplished in steps 1 to 3 of function fasLgetJbound O.
The first loop sets variable p to the first power of two greater than s. That is,
the first loop in functionfasLgetJboundO calculates p = 2rlog2(s+l)1. Taking
l to be the sum offast.prob values at the powers of two up to and including the
value stored atfast.prob[P/2] means that l also includes all of the values of P
to the right of s through to but not including the next power of two at p. The
excess, from s + 1 to p - 1, must be subtracted off the preliminary value of
l; doing so is the task of the second phase of the calculation, at steps 4 to 6 of
functionfasLgetJboundO. Note that the processing steps forwards from s, but
only as far as the next power of two.
6.6. CUMULATIVE STATISTICS PAGE 163

Algorithm 6.6
Return the cumulative frequency of the symbols prior to s in the alphabet,
assuming thatjast-prob[l ... n] is a modified Fenwick tree data structure.
jasLgeLlbound (s )
1: set l t- 0 and p t- 1
2: while p ::; s do
3: set l t- l + jasLprob[p] and p t- 2 x p
4: set q t- s
5: while q =1= p and q ::; n do
6: set l t- l - jasLprob[q] and q t- jorw(q)
7: return l
Return the frequency of symbol s using the modified Fenwick tree
jast..prob[l . .. n]; after detennining the value, add one to the stored
frequency of s.
jasLgeLand_incremenLcount( s)
1: set c t- jast..prob[s] and q t- s + 1
2: set z t- min(forw(s) , n + 1)
3: while q < z do
4: set c t- c - jast..prob[q] and q t- jorw(q)
5: setp t- s
6: while p > 0 do
7: setjasLprob[p] t- jasLprob[p] + 1 and p t- back(p)
8: return c
Return the greatest symbol number s that, if passed as argument to function
jasLgeLlboundO, would return a value less than or equal to target.
jasLgeLsymbol (target)
1: setp t- 1
2: while 2 x p ::; n andjasLprob[p] ::; target do
3: set target t- target - jast..prob[p] and p t- 2 x p
4: set s t- p and mid t- p/2 and e t- 0
5: while mid ~ 1 do
6: if s + mid ::; n then
7: set e t- e + jast..prob[s + mid]
8: ifjasLprob[s] - e ::; target then
9: set target t- target - (fasLprob[s] - e)
10: set s t- s + mid and e t- 0
11: set mid t- mid/2
12: return s
PAGE 164 COMPRESSION AND CODING ALGORITHMS

I location 1 2 3 4 5 6 7 8 9

(a) symbol frequencies, P


15 11 7 6 11 12 8 1 4
(b) desired [bound values
o 15 26 33 39 50 62 70 71
(c) range of values for jasLprob
1-1 2-3 3-3 4-7 5-5 6-7 7-7 8-9 9-9
(d) values stored injasLprob
15 18 7 37 11 20 8 5 4
(e) revised values after symbol 3 is coded
15 19 37 11 20 8 5 4

Figure 6.5: Maintaining cumulative frequencies with a modified Fenwick tree. In the
example, the unnormalized probabilities P = [15,11,7,6,11,12,8,1,4] are assumed
to be the result of previous symbols having been transmitted. Row (e) then shows the
changes that take place when P[3] is increased from 7 to 8. The source alphabet need
not be probability-sorted, but a superior bound on execution time is possible if it is.

Algorithm 6.6 also includes the two other functions required in the encoder
and decoder. FunctionjasLgeLand_incremenLcountO serves the same purpose
as its namesake in Algorithm 6.4. The while loop that calculates the current
frequency count of symbol s again requires just 0(1) time on average per call,
where the average is taken over the symbols in 1 ... n. The second section of
jasLgeLand_incremenLcountO then increments the count of symbol s (steps 5
to 7). All of the values that must be incremented as a result of symbol s being
coded again lie within the region [P,2p), and updating them takes O(log s)
time. Including calls to both of these encoding functions, the cost of adaptively
encoding symbol s is decreased from 0 (log n) to 0 (log s).
Figure 6.5 shows the same coding situation used as an example when the
unmodified Fenwick tree was being discussed. To code symbol s = 3 the sum
15 + 18 is calculated in the loop of steps 1 to 3 in functionjasLgeUboundO,
and then the second loop at steps 4 to 6 subtracts 7 to yield the required cu-
mulative sum of l = 26, the starting point of the probability range allocated to
the third symbol. The frequency of symbol s = 3 is then determined to be 7
by the first part of function geLand_incremenLcountO, which goes on to incre-
ment locations two and three of jasLprob to record that symbol 3 has occurred
another time.
6.6. CUMULATIVE STATISTICS PAGE 165

Decoding uses the same array, and is accomplished by fasLget..symbolO


in Algorithm 6.6. The decoder starts at p = 1 and searches in an exponential
and binary manner (described in Section 3.2 on page 32) for the target value.
During the second loop variable e records the excess count now known to be to
the right of the eventual symbol.
In an actual implementation it is sensible to make n a power of two and
suffer a modest space overhead in order to avoid some of the range tests. There
is no need for the extra symbols so introduced to have non-zero probabilities,
and compression is unaffected. The array fasLprob can also be allowed to grow
as the coding takes place using, for example, the C function reallocO, which
allocates a fresh extent of memory and copies the previous contents of the array
into it. There is no requirement that n be fixed in advance, and this is why we
have not distinguished in these recent sections between n and n rnax .
At face value, the reorganization of the Fenwick tree makes both encoding
and decoding somewhat more complex. Why then is it of interest? Suppose (as
we did in some of the earlier chapters) that the alphabet is probability-sorted,
with symbol one the most frequent, symbol two the second most frequent, and
so on. The normalized probability of symbol s, where s is an index into the
fasLprob array, must be less than 1/ s, as each of the symbols prior to s in the
permuted alphabet has probability no smaller than that of symbol s. Hence,
when s is coded at least log2 s bits are generated. That is, the number of bits
emitted is at least fiog2 s 1 - 1, and so the total computation time - which is
proportional to flog2 s 1- is at most O(c + 1), where c is the number of bits
emitted to code this symbol. Summed over all of the m symbols in the message,
the total time spent maintaining the probability estimates is thus 0 (m + n + c),
where c is now the overall length of the compressed message.
Linear-time encoding and decoding on non-sorted distributions is achieved
by allocating two further arrays in each of the encoder and decoder, mapping
raw symbol numbers to and from probability-sorted symbol numbers. A sim-
ilar mapping was used in the reverse_cum_freq array considered earlier; in the
implementation of adaptive Huffman coding; and in Section 4.8 for minimum-
redundancy coding. To maintain the probabilities in sorted order thus requires
one further step for each symbol coded - prior to incrementing the frequency
of symbol s, we adjust the two mapping arrays so that symbol s is the leftmost
symbol with that frequency count. Incrementing symbol s will then not alter
the decreasing probability arrangement. This component of the running time
is also O(log s), as the new location of symbol s must lie between one and its
old location, and can be determined with a binary search over that interval. The
actual swap operation involves alteration of two values in each of the two index
arrays, and takes 0(1) time.
Whether the extra space consumed by the two mapping arrays is warranted
PAGE 166 COMPRESSION AND CODING ALGORITHMS

depends on the particular application in which the coder is being used. For
many purposes the non-permuted structure - either in original form or in mod-
ified form - will be adequate, as it would be unusual for a message over an
alphabet of n symbols to have a self-information that is o{log n) bits per sym-
bol. Note also that if the source alphabet is approximately probability-ordered,
but not exactly ordered, the modified structure may have an advantage over
the original Fenwick tree. For example, in the word-based model used as an
example several times in this chapter, words encountered early in the text and
assigned low symbol numbers will typically repeat at shorter intervals than
words encountered for the first time late in the source text and assigned high
symbol numbers. Moffat [1999] reports experiments that quantify this effect,
and concludes that the use of the mapping tables to guarantee linear-time en-
coding is probably unnecessary, but that the modified structure does offer better
compression throughput than the original Fenwick tree.
There is one further operation that must be supported by all data structures
for maintaining cumulative frequencies, and that is periodic scaling. Most
arithmetic coders operate with a specified level of precision for the symbol
frequency counts that cannot be exceeded. For example, the implementation
described in Section 5.3 stipulates that the total of the frequency counts t may
not exceed 21 for some integer f. This restriction means that an adaptive coder
must monitor the sum of the frequency counts, and when the limit is reached,
take some remedial action.
One possible action would be to reset the statistics data structure to the ini-
tial bland state, in which every symbol is equally likely. This has the advantage
of being simple to implement, and it might also be that a dramatic "amnesia"
of the previous part of the message is warranted. For example, it is conceiv-
able that the nature of the message changes markedly at fixed intervals, and
that these changes can be exploited by the compression system. More usual,
however, is a partial amnesia, in which the weight given to previous statistics
is decayed, and recent information is allowed to count for more than historical
records. This effect is achieved by periodically halving the symbol frequency
counts, making sure that no symbol is assigned zero as a frequency. That is, if
Ps is the frequency of symbol s, then after the count scaling the new value p~ of
symbol s is given by (Ps + 1) div 2. When symbol s occurs again, the addition
of 1 to p~ is then worth two of the previous occurrences, and the probability
distribution more quickly migrates to a new arrangement should the nature of
the message have changed.
Algorithmically this raises the question as to how such a scaling operation
should be accomplished, and how long it takes. In the cum-prob array of Witten
et aI., scaling is a simple linear-time operation requiring a single scan through
the array. With careful attention to detail, both thefen_prob andfasLprob struc-
6.6. CUMULATIVE STATISTICS PAGE 167

tures can also be scaled in O(n) time. Algorithm 6.7 gives the details for the
jasLprob data structure.
In Algorithm 6.7 the functionjasLscaling{) makes use of two basic func-
tions. The first, jasuo.probs{) takes a jast.prob array and converts it into a
simple array of symbol frequencies. It does this in-situ and in 0 (n) time. To
see that the second of these two claims is correct, note that despite the nested
loop structure of the function, the number of subtraction operations performed
is exactly n - 1. Then the counts are halved, and finally a similar function
probs_to.jast{) used to rebuild the statistics structure. The total cost is, as re-
quired, 0 (n) time.
Statistics scaling raises one interesting issue. Suppose that count halving
takes place every k symbols. Then the total number of halvings to encode an
m-symbol message is m/k. At a cost of O(n) time per scaling operation, the
total contribution is O(mn/k) = O(mn), which asymptotically dominates the
o (n + m + c) running time of the adaptive coder, seemingly negating all of the
effort spent in this section to avoid the O(mn) cost of using a simple cum.prob
array. Here we have an example of a situation in which it is erroneous to rely
too heavily upon asymptotic analysis. Count halving does dominate, but when
k is large - as it usually is - the actual contribution to the running time is
small. Another way of shedding light upon this result is to observe that almost
certainly k should be larger than n, as otherwise it may not be possible for
every alphabet symbol to have a non-zero probability. Under the additional
requirement that k ~ n the O(mn/k) time for scaling becomes O(m).
A few paragraphs ago we suggested that count scaling meets two needs: the
requirement that the total frequency count going into the arithmetic coder be
bounded at f bits, and the desire to give more emphasis to recent symbol occur-
rences than ancient ones, thereby allowing the probability estimates to evolve.
Scaling the frequency counts in the way shown in Algorithm 6.7 achieves both
these aims, but in a rather lumpy manner. For example, in a straightforward
implementation, no aging at all will take place until 21 symbols have been
processed, which might be a rather daunting requirement when (say) f = 25.
It thus makes sense to separate these two needs, and address them indepen-
dently rather than jointly. To this end, one further refinement has been devel-
oped. Suppose that the current sum of the frequency counts is t, and we wish
to maintain a continuous erosion of the impact of old symbols. To be precise,
suppose that we wish the influence of a symbol that just occurred to be exactly
twice that of one that occurred d symbols ago in the message. Quantity d is the
decay rate, or half-life of the probability estimates.
One way of arranging the required decay would be to multiply each fre-
quency by (1 - x) for some small positive value x after each coding step, and
then add one to the count of the symbol that just appeared. With x chosen suit-
PAGE 168 COMPRESSION AND CODING ALGORITHMS

Algorithm 6.7
Approximately halve each of the frequencies stored in the modified Fenwick
treefastprob[1 .. . n].
fast-.ScalingO
1: fasuo-probs(JasLprob, n)
2: fors +-- 1 to n do
3: setfast-prob[s] +-- (JasLprob[s] + 1) div 2
4: probLto_fast(JasLprob, n)

Convert the modified Fenwick treefast-prob[1 ... n] into a simple array of


symbol frequencies.
fasuo-probsO
1: setp +-- 2Llog2 n J
2: while p > 1 do
3: set s +-- p
4: while s + p/2 ::; n do
5: setfast-prob[s] +-- fasLprob[s] - fast-prob[s + p/2]
6: set s +-- s + p
7: set p +-- p/2

Convert the array fast-prob[1 . .. n] from a simple array of symbol


frequencies into a modified Fenwick tree.
probLto-!astO
1: set p +-- 2
2: while p ::; n do
3: set s +-- p
4: while s + p/2 ::; n do
5: setfasLprob[s] +-- fast-prob[s] + fasLprob[s + p/2]
6: set s +-- s + p
7: set p +-- 2 x p
6.6. CUMULATIVE STATISTICS PAGE 169

ably, by the time d steps had taken place, the old total t can be forced to have
effectively halved in weight. The problem of this approach is that O(n) time is
required at each coding step, as all n probability estimates are adjusted.
More economical is to add a slowly growing increment of (1 + x) t at time t
to the count of the symbol that occurred, and leave the other counts untouched.
The desired relative ratios between the "before" and "after" probabilities still
hold, so the effect is the same. The value of x is easily determined: if, after d
steps the increment is to be twice as big as it is now, we require

(1 +x)d = 2.
The approximation loge (1 + x) ~ x when x is close to zero implies that x ~
(loge 2)ld. For example, when we expect a distribution to be stable, a long
half-life d is appropriate, perhaps d = 10,000 or more. In this case, x ~
0.000069 - that is, each frequency increment is 1.000069 greater than the last
one, and after d = 10,000 such steps the frequency increment is 2. On the other
hand, if the distribution is expected to fluctuate rapidly, with considerable local
variation, we should choose d to be perhaps 100. In this case each frequency
increment will be 1.0069 times larger than its predecessor.
Since one of the assumptions throughout the discussion of arithmetic cod-
ing has been that the frequency estimates are maintained as integers, this raises
the obvious problem of roundoff errors. To reduce these, we scale all quanti-
ties by some suitable factor, and in essence retain fractional precision after the
halving process described in Algorithm 6.7, which is still required to ensure
that t :S 21.
To see how this works, let us take a concrete example. Suppose that we
have an alphabet of n = 100 symbols, have decided that we should work
with a half-life of d = 1,000 symbols, thus x = 1.00069, and must operate
with an arithmetic coder with f = 24. If we wish to assign an initial false
count to each symbol in the alphabet, any value less than 21 In ~ 167,000
will suffice, provided that the same value is also used for the first increment.
So we can certainly initialize each of the n = 100 frequency counts to (say)
Pi = 10,000. Then, after the first symbol in the message is coded, an increment
of 10,000 is used. The second increment is bigger, flO,OOO(l + x)l = 10,007;
and the third bigger again, flO,OOO(l + x)21 = 10,014; and so on. When
the total of all the frequency counts reaches 21 , all of them are numerically
halved according to Algorithm 6.7, and so too is the current increment, thereby
retaining all relativities. In this way the constraint of the arithmetic coder is
met; the unnormalized probability estimates are maintained as integers; and
the normalized probability estimates are smoothly decayed in importance as
the symbols they represent fade into the past.
PAGE 170 COMPRESSION AND CODING ALGORITHMS

There is only one small hiccup in this process, which is that the use of
permutation vectors to guarantee linear time performance in the modified Fen-
wick tree requires that increments are by unit amounts. That is, with the non-
unit increments we are now proposing to use, the best we can be assured of
is O(logn) time per operation. But even without this additional complication,
we had accepted that the permutation vectors were only warranted in special
situations; and given the discussion just concluded, we can now quite defini-
tively assert that the modified Fenwick tree data structure, without permutation
vectors, but with decaying probability estimates, is the best general-purpose
structure for maintaining the statistics of an adaptive arithmetic coder.
In this section we have seen how to accommodate messages in which the
symbol probabilities slowly drift. The next two sections consider messages
that contain even more violent shifts in symbol usage patterns, and describe
techniques for "smoothing" such discontinuous messages so that they can be
coded economically. Then, after those two sections, we return to the notion
of evolving probability distributions, and show how combining a coder with
a small half-life with a set of coders with a long half-life can yield improved
compression effectiveness for non-stationary messages.

6.7 Recency transformations


The half-life approach that was described in the previous section is one way of
handling evolving distributions. But sometimes the shifts in symbol probabil-
ities are more dramatic than is encompassed by the words "drift" and "evolu-
tion": sometimes the probabilities exhibit quite sudden discontinuities, possi-
bly over relatively small spans. Consider the character sequence in Figure 6.6a.
This message is just 45 characters long, but even in its brevity has some strange
characteristics. For example, all of the "r" characters appear in a single clus-
ter, as do all of the "k"s. The exact origins of this message will be discussed
in Section 8.3, and for now the reader is asked to accept an assurance that
the message is genuine, and should be coded somehow. The problem is that
over quite short distances the message exhibits remarkably different character-
istics, and it would be quite inappropriate to try and represent this string using
a single probability distribution, even one allowed to evolve. Revolution in the
estimates is called for rather than evolution.
To deal with such rapidly varying statistics, a ranking transformation is
used. The most widely known such transformation is the move-to-front (or
MTF) transformation, but there are other similar mechanisms. The idea of the
move-to-front transformation is very simple - each symbol occurrence in the
sequence is replaced by the integer one greater than the number of distinct sym-
bols that have appeared since the last occurrence of this symbol [Bentley et aI.,
6.7. RECENCY TRANSFORMATIONS PAGE 171

pappoppp#kkk##ddcptrrr#ccp#leefeeiiiepee#s#.e
(a) As a string of characters

112 97 112 112 111 112 112 112 35


107 107 107 35 35 100 100 99 112
116 114 114 114 35 99 99 112 35
108 101 101 102 101 101 105 105 105
101 112 101 101 35 115 35 46 101
(b) As integer ASCII values

113 99 2 1 113 2 1 1 39
110 1 1 2 1 104 1 104 5
117 116 1 1 6 5 1 5 3
113 108 1 109 2 1 112 1 1
2 6 2 1 6 117 2 60 4
(c) As MTF values

Figure 6.6: A possible message to be compressed, shown as: (a) a string of charac-
ters; (b) the corresponding integer ASCII values; and (c) after application of the MTF
transformation. Section 8.3 explains the origins of the message.

1986, Ryabko, 1987]. Figure 6.6c shows the effect the MTF transformation has
upon the example string of Figure 6.6a. The last character is transformed into
the integer 4, as character "e" (ASCII code 101, in the final position in Fig-
ure 6.6b) last appeared 5 characters previously, with just 3 distinct intervening
characters.
There is a marked difference between the "before" and "after" strings.
In the original sequence the most common letter is "p" (ASCII code 112),
which appears 8 times; now the most common symbol is 1, which appears 16
times. The probability distribution also appears to be more consistent and sta-
ble, and as a consequence is rather more amenable to arithmetic or minimum-
redundancy coding. For a wide range of input sequences the MTF transforma-
tion is likely to result in a probability-sorted transformed message, and the very
large number of "1" symbols that appear in the output sequence when there is
localized repetition in the input sequence means that good compression should
be obtained, even with static coding methods such as the interpolative code
(Section 3.4 on page 42). That is, application of the MTF has the effect of
smoothing wholesale changes in symbol frequencies, and when the message
is composed of sections of differing statistics, the MTF allows symbols to be
PAGE 172 COMPRESSION AND CODING ALGORITHMS

Algorithm 6.8
Perform an MTF transformation on the message M[1 ... m], assuming that
each symbol M[i] is in the range 1 ... n.
mtf_transJorm(M, m)
1: for s +- 1 to n do
2: set T[s] +- s
3: for i +- 1 to m do
4: set s +- M[i], pending +- s, and t +- 1
5: while T[t] t= s do
6: swap pending and T[t], and set t +- t +1
7: set M'[i] +- t
8: set T[t] +- pending
9: return M'

Invert the MTF transformation.


mtf_inverse(M', m)
1: for s +- 1 to n do
2: set T[s] +- s
3: for i +- 1 to m do
4: set t +- M'[i] and M[i] +- T[t]
5: while t t= 1 do
6: set T[t] +- T[t - 1] and t +- t - 1
7: set T[l] +- M[i]
8: return M

dealt with one level removed from their actual values.


Algorithm 6.8 shows one way in which the MTF transformation can be
implemented, using an array T initialized with an identity transformation, and
a linear search to locate the current array position of each symbol s in the source
message. The inverse MTF transformation, shown as function mtf_inverseO, is
also easy to implement using an array.
As implemented in Algorithm 6.8 the cost of the forward and reverse MTF
transformations is OCE~l td, where ti is the transformed equivalent of sym-
bol M[i] and is, by definition, one greater than the number of distinct symbols
since the previous appearance of symbol M[i] in the message.
The MTF computation has the potential to be expensive and to dominate
the cost of entropy coding the resultant stream of integers, particularly so if
the transformation is to be applied to messages over a non-character alpha-
bet for which n » 256, or other situations in which the average value of
(2: td 1m » 1. The computation proposed in Algorithm 6.8 is, in some ways,
6.7. RECENCY TRANSFORMATIONS PAGE 173

reminiscent of the linear search in a cum..prob array that was discussed in con-
nection with arithmetic coding - it is suitable for small alphabets, or for very
skew probability distributions, but inefficient otherwise. As was the case with
arithmetic coding, we naturally ask if there is a better way; and again, the an-
swer is yes.
Bentley et al. noted that the MTF operations can be carried out efficiently
using a splay tree, a particularly elegant data structure devised by Sleator and
Tarjan [1985]. A splay tree is a self-adjusting binary search tree with good
amortized efficiency for sufficiently long sequences of operations. In particular,
the amortized cost for each access, insertion or deletion operation on a specified
node in an n-node splay tree is O(log n) operations and time. Splay trees also
exhibit some of the behavior of finger search trees, and are ideally suited to the
task of MTF calculation. We are unable to do full justice here to splay trees,
and the interested reader is referred to, for example, Kingston [1990]. But, as
a very crude description, a splay tree is a binary search tree that is adjusted via
edge rotations after any access to any item within the tree, with the net effect
of the adjustments being that the node accessed is moved to the root of the tree.
That node now has a greatly shortened search path; other nodes that shared
several ancestors with the accessed node also benefit from shorter subsequent
search paths. In addition, the tree is always a search tree, so that nodes to the
left of the root always store items that have key values less than that stored at
the root, and so on for each node in the tree.
To use a splay tree to accomplish the MTF transformation, we start with an
array of tree nodes that can be directly indexed by symbol number. Each node
in the array contains the pointers necessary to manipulate the splay tree, which
is built using these nodes in a timestamp ordering. That is, the key used to lo-
cate items in the splay tree is the index in the message at which that symbol last
appeared. Each splay tree node also stores a count of the number of items in its
right subtree within the splay tree. To calculate an MTF value for some symbol
s, the tree node for that symbol is identified by accessing the array of nodes,
using s as a subscript. The tree is then splayed about that node, an operation
which carries out a sequence of edge rotations (plus the corresponding pointer
adjustments, and the corresponding alterations to the "right subtree size" field)
and results in the node representing symbol s becoming the root of the tree.
The MTF value can now be read directly - it is one greater than the number
of elements in the right subtree, as those nodes represent symbols with times-
tamps greater than node s. Finally, the node for s is detached from the tree,
given a new most-recently-accessed timestamp, and then reinserted. The inser-
tion process is carried out by concatenating the left and right subtrees and then
making that combined tree the left subtree of the root. This final step leaves the
node representing symbol s at the root of the tree, and all other nodes in its left
PAGE 174 COMPRESSION AND CODING ALGORITHMS

subtree, which is correct as they all now have smaller timestamps.


The complete sequence of steps requires O(log ti) amortized time, where
ti is the MTF rank that is emitted [Sleator and Tarjan, 1985]. This compares
very favorably with the O(ti)-time linear search suggested in Algorithm 6.8.
A similar computation allows the reverse transformation - from MTF value to
symbol number - to also be carried out in O(log ti) amortized operations and
time. When the source alphabet is large, use of the splay tree implementa-
tion allows very fast computation of the MTF transformation [Isal and Moffat,
200Ib].
How much difference can the MTF transformation make? In sequences
where symbols are used intensively over a short period, and then have long
periods of disuse, the MTF transformation has the potential to create an output
sequence with self-information considerably less than the self-information of
the input sequence. As an extreme example, consider the sequence

a, a, a, a, ... , a, b, b, b, b, ... , b

in which the number of as is equal to the number of bs. The zero-order self-
information of this sequence is one bit per symbol, but after an MTF transfor-
mation, and even assuming that the initial state of the MTF list has b in the first
position and a in the second position, the output sequence

2,1,1,1, ... ,1,2,1,1,1, ... ,1

has a zero-order self-information close to zero.


There is no need to slavishly apply the MTF rule when seeking to cap-
ture localized repetition in sequences, and provided the decoder uses the same
scheme as the encoder, any promotion mechanism may be used. If anything,
MTF is a rather aggressive policy - the symbol at the front of the list is al-
ways displaced, even if the new symbol is one that has not been used for a
very long time. To moderate the rapid promotion, the move-one-from-front
(MIFF) strategy promotes into the second location unless the item was already
in the second location. An alternative protocol has been described by Michael
Schindler [1997] who sends a newly accessed symbol to the second position if
the symbol at the head of the list was the last one accessed prior to that; another
variant is the MIFF2 heuristic, which sends the new symbol to the second po-
sition if the head was accessed last time or the time before, unless the symbol
accessed was already the second symbol in which case it goes to the top. Other
approximate recency heuristics have been devised by Isal and Moffat [200Ia]
and Isal et al. [2002].
6.8. SPLAY TREE CODING PAGE 175

6.8 Splay tree coding


The previous section discussed the use of splay trees as a way of implementing
the MTF transformation. In a concurrent development, Doug Jones [1988]
proposed two other uses for splay trees in coding.
In the first, the splay tree is used as a hybrid between the MTF transforma-
tion described in the previous section and the adaptive Huffman code described
in Section 6.4 on page 145. Recall that a code tree represents the symbols of
the source alphabet at its leaves, and edges out of the internal nodes correspond
to bits in the codeword for a symbol. In a static or semi-static code the struc-
ture of the tree is fixed and does not change as the message is coded. In an
adaptive Huffman code the structure of the tree is updated after each symbol in
the message is coded, and each symbol is represented with respect to a set of
codes that is minimal for the prior part of the message. The adjustments made
to the code tree are a consequence of the frequency of the symbol just coded
now being one greater than previously.
Jones's proposal is that the code tree be adjusted instead via a splaying-like
operation called semi-splaying that brings the leaf corresponding to the trans-
mitted symbol approximately halfway to the root. Moving a leaf closer to the
root automatically moves other symbols further from the root; the net effect is
that the codewords assigned to a particular symbol are short when that symbol
is being heavily used, and long when it is infrequently used. The transition
between these two extremes depends directly upon symbol frequencies over
relatively short spans of the message. For example, if the source alphabet has
n symbols, then log2 n consecutive repetitions of any symbol is sufficient to
guarantee a one-bit codeword for that symbol at the next step.
The result is a single data structure that both provides an adaptive pre-
fix code and is also sensitive to local frequency variations within the source
message. In addition, it is both relatively straightforward to implement, and
reasonably fast to execute. Jones gives compression results that show that
a character-based splay coder yields inferior compression results to a con-
ventional character-based compression system using an adaptive minimum-
redundancy coder when applied to typical text files, and notes that on such
files the symbol usage is relatively homogeneous, and the splay tree settles
into a relatively stable arrangement that is not as good as a Huffman tree. On
the other hand, for files storing images, Jones found that the splay coder gave
superior compression effectiveness, as the localized pattern of light and dark
areas in typical images meant that the MTF-like recency effects over short seg-
ments of symbols were present for the splay coder to exploit. That is, the
non-homogeneous nature of gray-level images meant that the splay coder was
able to obtain compression effectiveness better than the self-information of the
PAGE 176 COMPRESSION AND CODING ALGORITHMS

file according to the same model. The splay coder also had the advantage
of being considerably faster in execution than the control in Jones's experi-
ments, which was an implementation of Vitter's [1987] adaptive minimum-
redundancy coder. Jones also noted that the splay coder only required about
one quarter of the memory space of Vitter's coder - 3n words for an alphabet
of n symbols rather than 13n words.
Moffat et al. [1994] also experimented with splay coding, and found that,
while it executes quickly compared to other adaptive techniques, for homoge-
neous input files it typically generates a compressed bitstream approximately
15% longer than the self-information. In summary, splay coding provides an
interesting point in the spectrum of possible coding methods: it is probably too
ineffective to be used as a default coding mechanism, but is considerably faster
than adaptive minimum-redundancy coding and also a little faster than adaptive
arithmetic coding [Moffat et aI., 1994]. Static coding methods (see Chapter 3)
also provide fast operation at the expense of compression effectiveness, but the
advantage of the splay coder is that it exploits localized frequency variations
on non-homogeneous messages.
The second use to which splay trees can be put is in maintaining the cumu-
lative frequencies required by an adaptive arithmetic coder. The Fenwick tree,
described in Section 6.6, requires n words of storage and gives 0 (log n) -time
performance; while, assuming that the alphabet is not probability-sorted and
thus that permutation vectors are required, the modified Fenwick tree requires
3n words and gives 0 (log s) -time calculation, where s is the rank of the sym-
bol being coded. The same O(log s )-time bound is offered by a splay tree in
an amortized sense, at a slightly increased memory cost.
To achieve this performance, the symbols in the source alphabet are stored
in a splay tree in normal key order, so that an in-order traversal of the tree
yields the alphabet in sorted order. In addition to tree pointers, each node in
the tree also records the total weight of its left subtree, and from these values
the required cumulative frequencies can be calculated while the tree is being
searched for a symbol. The splaying operation that brings that accessed symbol
to the root must also modify the left-subtree-count fields of all of the nodes it af-
fects. However, the modifications to symbol frequencies need not be by single
units, and a half-life decaying strategy can be incorporated into this structure.
Sleator and Tarjan [1985] prove a result that they call the "Static Optimality
Theorem" for splay trees; namely, that a splay tree is at most a constant factor
more expensive for a sequence of searches than a static optimal binary search
tree built somehow "knowing" the access frequencies. For a sequence of m
accesses to a tree of n items, where the ith item is accessed Vi times, this result
6.9. STRUCTURED CODING PAGE 177

implies that the total cost of the m searches is

o (n + m + t
i=l
Vi log2 ~)
Vz

The sum inside the parentheses is exactly the self-information of the sequence
(Equation 2.4 on page 22). Hence the claimed linearity - with unit increments
to symbol frequencies, the cost of adaptive arithmetic coding using a splay tree
to manage cumulative frequencies is 0 (n + m + c), where n is the number of
symbols in the alphabet, m is the length of the message, and c is the number of
bits generated.
Because it yields the same cumulative frequencies as a Fenwick tree, and
is coupled with the same arithmetic coder as would be used with a Fenwick
tree, compression effectiveness is identical. But experimentally the splay tree
used in this way is markedly slower than a Fenwick tree [Moffat et aI., 1994].
Compared to a Fenwick tree and a non-permuted modified Fenwick tree it also
uses more memory space - around 4n words for an alphabet of n symbols.
That is, despite the asymptotic superiority of the splay tree, its use in this way
is not recommended.

6.9 Structured arithmetic coding


In Chapter 2 we introduced the notion of conditioning, and suggested that a
first or higher order model should be used when the symbols in the message
are not independent. We shall return to this notion in Section 8.2. But there is
also another way in which symbols in a message might be related that is not
fully captured by direct conditioning.
Consider the output of an MTF transformation. If at some location in the
source message there are, for example, ten symbols active, then the MTF output
will be dominated by the symbols 1 to 10. If the message then segues into a
section in which there are 20 active symbols, the MTF output will also change
in nature.
The first warning of such a shift might be the coding of an MTF value of
(say) 13. We should thus regard such a symbol as an alarm, indicating that
perhaps symbols 11 and 12, and 14 and 15, and perhaps others too, should
be given a probability boost. That is, because the MTF symbols are drawn
from an alphabet - the positive integers - in which there is a definite notion of
"continuity" and "nearness", it makes sense for the occurrence of one symbol to
also be used as positive evidence as to the likelihood of other nearby symbols.
Figure 6.7 shows how this sharing effect is attained, using a mechanism known
as structured arithmetic coding [Fenwick, 1996b].
PAGE 178 COMPRESSION AND CODING ALGORITHMS

selector 11121314151
t I ~ bucket[2]
14 I 51 617 I bucket[3]
'-- f-ISI I I I I I 115 1 bucket[4]

'-1 16 1 I I I I I I I I I I I I I 131 I bucket[5]

Figure 6.7: Structured arithmetic coding. The selector component involves a small
alphabet and a small half-life, and adapts rapidly. Within the secondary buckets the
half-life is greater, and adaptation is slower.

In a structured arithmetic coder each symbol is transmitted in two parts: a


magnitude, and an offset that identifies the symbol within a bucket of symbols
that all share the same magnitude. In Figure 6.7, a binary decomposition has
been assumed, but other decompositions can also be used, and there is nothing
magical about the arrangement illustrated.
To code symbol x, the selector probability distribution is used to transmit
the value 1 + Llog2 X J, in an arrangement that has marked similarities with
the Elias C'"( code described in Section 3.2 on page 32. Then, once the binary
magnitude of the value x has been coded, the offset x - 2 Llog2 xJ is transmit-
ted using one of several bucket probability distributions. For example, suppose
x = 13 is to be coded. The value 1 + Llog213J = 4 is coded using the selec-
tor distribution, and symbol four within that distribution is given a probability
increment. Then the offset of 13 - 8 = 5 is coded using the bucket[4] proba-
bility distribution, and the fifth symbol within that distribution given a smaller
probability increment. No secondary estimator is required for bucket [1 ]: in this
case, symbol 81 has been unambiguously indicated, and further refinement is
unnecessary. This sits particularly well with the way in which the MTF trans-
formation generates a large number of "I" outputs. For speed of coding, it
might even be appropriate for the selector component to be transmitted using
a binary arithmetic code and a unary equivalent of the binary-tree structure
shown in Figure 5.4 on page 121.
The increment awarded to a symbol in the selector distribution boosts the
probabilities of all of the symbols in the corresponding bucket - in the case of
the example, symbols [8 ... 15]. The effect is further amplified by the small
number of distinct items in the selector distribution. Use of a modest half-life
6.10. PSEUDO-ADAPTIVE CODING PAGE 179

allows the selector to rapidly adjust to gross changes in the number of active
symbols, and half lives of as little as 10 can be tolerated. On the other hand
a larger half-life within each bucket means that these distributions will adjust
more slowly, but will be more stable. The combination offast-adapting selector
and slow-adapting refinement works well when processing the outcome of an
MTF transformation on inputs of the type shown in Figure 6.6 on page 171.

6.10 Pseudo-adaptive coding


All of the adaptive techniques discussed thus far have focussed on altering
the statistics and code after processing every symbol. While the algorithms
discussed are efficient in an asymptotic sense, requiring O(m + n + c) time
to generate c bits for a message of m symbols over an alphabet of size n, in
practice they are inevitably slower than their static counterparts. Per symbol
there is just more work to be done in an adaptive scheme.
To make adaptive schemes faster, the amount of computation per symbol
and per bit must be reduced. An obvious way to reduce the amount of compu-
tation is to maintain accurate adaptive statistics, but only update the code when
it becomes bad relative to the statistics: a pseudo-adaptive code. Using such
a scheme we can employ the fast encoding and decoding techniques described
in Chapter 4, choosing parameters or rebuilding the code whenever it strays
too far from the statistics. By choosing a suitable definition of "strays too far",
compression levels can be traded against coding speed in a controlled fashion.
First let us consider the total execution cost of a prefix coding scheme that
uses canonical coding (Algorithm 4.1 on page 60), and entirely reconstructs the
code whenever P, the list of adaptive symbol statistics, changes. In order to
use canonical coding, symbols must be in non-increasing order of probability.
This can be achieved in 0(1) time per symbol coded using a bucket and leader
mechanism similar to that employed by adaptive Huffman coding. Symbols
are stored in non-increasing sorted order, and the leftmost symbol of each run
of symbols with the same probability (a bucket) is marked as the leader of that
bucket. When a symbol's frequency is increased it is swapped with the leader
of its current bucket, and leaders are updated appropriately. For a message
of m symbols, 0 (m) time is sufficient to keep the unnormalized probability
distribution ordered.
When the probability distribution P is estimated via self-probabilities, it
changes after every symbol is processed. Blind reuse of Algorithm 4.2 on
page 67 after every message symbol would thus require a total of 0 (mn) time.
The trick is not to be so blind. A change to P might not necessarily cause a
change in the derived minimum-redundancy code - the old code might still be
appropria(e for the new P. We already saw this effect in Figure 6.3 on page 148,
PAGE 180 COMPRESSION AND CODING ALGORITHMS

in which the code remained unchanged after the first two increments to symbol
82. Longo and Galasso [1982] formalized this observation by proving that if

E(C, P) - H(P) ~ 0.247 x 2- lcnl (6.6)

holds, where len I is the length of a longest codeword in C, then C is a Huffman


code for P. So while P changes during the pseudo-adaptive coding process,
but Equation 6.6 holds, there is no need to change the underlying code C.
Still, in the worst case - for example, when a message consists entirely of
symbols that occur only once - the coding process requires m code rebuilds,
and consumes a total of O(mn) time. Fabris [1989] strengthens Equation 6.6,
but at the expense of computational simplicity. The repeated computation of
H(P) implied by Equation 6.6 is also a non-trivial cost.
Another possible strategy for speeding up this pseudo-adaptive coding pro-
cess is to use a faster code generation algorithm. A practical improvement in
running time might be realized if function calculateJunlength_codeO (Algo-
rithm 4.3 on page 72) is employed to generate the code, an O(r + r log(n/r))
process. This code generation algorithm requires the probabilities in the form
of r runlengths: P = [(Pi; Ii)], where probability Pi occurs Ii times, and
m = 'L.i=l pili. Maintaining this list during coding is easy if un normalized
self-probabilities are used, in which case Pi is a count of a symbol's frequency
of occurrence. After a symbol of frequency Pi is coded, Ii is decremented and
Ij is incremented, where Pj = Pi + 1: an 0(1) time operation. This is even eas-
ier to deal with than the buckets-with-Ieaders scheme, as the run length group-
ings automatically record the buckets that were explicitly maintained with the
leaders.
Even using this code generation scheme, however, there is still a chance
that the number of distinct probabilities, r, will be close to the number of
source symbols, n, and that the running time might be O(mn). An alterna-
tive code generation algorithm is function calculate_twopower_codeO (Algo-
rithm 4.5 on page 79). When presented with a runlength formatted probability
list P = [(Pi; Ii)], in which each Pi = 2 1 / k for some integer k > 0, this func-
tion constructs a minimum-redundancy code in O(logT m) = O(k log2 m)
time.
To make use of this algorithm, the probability distribution must be quan-
tized. The unnormalized symbol frequencies P = [(Pi; Ii)] (in runlength
form) generated by the estimator are modified to pI = [(pi; IDl, where pi =
Tl10gT P;J, and II is the total number of symbols that map to the approximate
probability pi. For example, assuming that k = 1 and T = 2, the probability
distribution

P = [(10; 1), (7; 1), (5; 3), (3; 1), (2; 1), (1; 2)]
6.10. PSEUDO-ADAPTIVE CODING PAGE 181

maps to
pI = [(8; 1), (4; 4), (2; 2), (1; 2)].
This is exactly the distribution used as an example in Figure 4.7 on page 80.
If the approximate code is rebuilt after every symbol is processed, and k
is a fixed value, the running time will be 0 (m log m), perhaps fast enough to
be useful. But while the code generated from pI is minimum-redundancy with
respect to pI, it is not guaranteed to be minimum-redundancy with respect to
the true probabilities, P. How much compression is lost?
Assume, for the moment, that a perfect code is being used, in which the
codeword length for symbol Si is equal to the information content of that sym-
bol, I (Si) = - 10g2 Pi. Assuming that Pi is an unnormalized self-probability,
the compression loss when symbol Si is coded is

bi (-10g2(pUm/)) - (-log2(pi/m))
< log2(pi/p~), (6.7)

where m ' = 2:7=1 p~, and, by construction, m ' ~ m. By the definition of p~,
the ratio pi/p~ cannot exceed T, so bi < log2 T = 11k. If k = 1 is used,
the compression loss is at most one bit per symbol, plus the cost of using a
minimum-redundancy code instead of a perfect code (Section 4.9 on page 88).
If k = 2 is used, the compression loss is at most half a bit.
This first analysis examined a symbol in isolation, and assumed the worst
possible ratio between Pi and p~. As a balance, there must also be times when Pi
and p~ are close together; and in an amortized sense, the compression loss must
be less than 1I k bits per symbol. For example, consider a message that includes
occurrences of some symbol x, and that k = 1. The first occurrence of x is
coded using a subsidiary model, after transmission of an escape symbol. The
next will be coded with true probability Px = 1 and approximate probability
p~ = 1. The third occurrence will be coded with Px = 2 and p~ = 2, the fourth
with Px = 3 and p~ = 2, the fifth with Px = 4 and p~ = 4, and so on. Suppose
in total that x appears 20 times in the message. If we sum Equation 6.7 for each
of these twenty occurrences, we get a tighter upper bound on the compression
loss of
1 2 3 4 5 20
log2 1 + 10g2 2 + 10g2 2 + 10g2 4 + 10g2 4 + ... + 10g2 16 = 7.08

bits, which averages 0.35 bits per symbol for the 20 symbols. That is, the amor-
tized per-symbol loss is considerably less than the 1 bit per symbol predicted
by the 10g2 T worst-case bound.
Turpin and Moffat [2001] showed that, over a message of m symbols, the
PAGE 182 COMPRESSION AND CODING ALGORITHMS

T Worst case Amortized case Expected case


log2 T ~m+1
2 1 0.557 0.086
21 / 2 112 0.264 0.022
21 / 3 113 0.173 0.010
21 / 4 114 0.129 0.005
21 / 5 115 0.102 0.003

Table 6.4: Upper bounds on compression loss in bits per symbol when encoding a
message using ideal codewords based on a geometric frequency distribution with base
T, rather than self-probabilities. Taken from Turpin and Moffat [2001].

compression loss due to the frequency approximation is bounded above by

T loge T - T + 1 + 0 (n + n 10gT (m / n) ) (6.8)


(T - 1) loge 2 m

bits per symbol, where T = 21/k is again the base ofthe sequence of permitted
unnormalized probabilities.
The bound on compression loss can be further refined if it assumed that
the self-probabilities gathered over the first m symbols accurately represent
the chance of occurrence of each symbol in the upcoming (m + 1) st position
of the message. If they do, Si will occur with probability pdm, the true self-
probability, and will incur a loss of 8i . Forming a weighted sum over the whole
source alphabet gives an expected loss for the m + 1st symbol of [Turpin and
Moffat, 2001]:
n
~m+1 = L:8 i · pdm

1) +--.
i=l

< 1og2 ( T - log2 T


(6.9)
eloge T T - 1
Values of the three upper bounds on compression loss (Equations 6.7, 6.8,
and 6.9) for a range of values of T are given in Table 6.4. The loss incurred by
forcing P into this special geometric form is small, and can be controlled by
choosing a suitable value for k.
There is a further benefit to be gained from the use of pI rather than
P. Use of the approximation was motivated above by noting that function
calculate_twopower_codeO executes in O(log m) time, so even if the proba-
bility distribution changes after every symbol, the total compression time is
6.10. PSEUDO-ADAPTIVE CODING PAGE 183

O(mlogm). But because the code is based on pI, rather than the true self-
probabilities P, it only needs rebuilding when pI changes - which only hap-
pens when the approximate frequency pi for a symbol takes a quantum jump.
That is, unless the message is made up entirely of unique symbols, the number
of calls to function calculate_twopower_codeO is considerably less than m.
Consider the sequence of code rebuilds triggered by symbol x. When it
first occurs, the code is reconstructed to incorporate the new symbol. Upon
its next occurrence, the true probability Px of symbol x rises from 1 to 2, and
causes an increase in pi from TO to at least Tl and a second code construc-
tion attributable to x. However, the next occurrence of x may not lead to an
increase in p~. In general, if Si occurs Px times in the whole message it can
only have triggered 1 + llogr PxJ code rebuilds. Each of these rebuilds re-
quires o (logr m) time. Turpin and Moffat showed that, when summed over
all n symbols in the alphabet, the cost of calculating the minimum-redundancy
codes is
n
O(logrm)· L)l + llogrpd) = O(k 2 (m + c)),
i=l
where, as before, c is the number of bits generated, m is the number of symbols
in the input message, and k is the parameter controlling the approximation.
That is, forcing self-probabilities into a geometric distribution with base 2 1/ k
for some positive integer constant k means that the time taken to completely
rebuild the code each time pI changes is no more than the time taken to process
the inputs and outputs of the compression system. The whole process is on-
line. This is better than our initial goal of 0 (m log m) time, and equals the
time bound of adaptive Huffman coding and adaptive arithmetic coding.
But wait, there's more! Now that code generation is divorced from de-
pendence on a Huffman tree, the canonical coding technique from Section 4.3
on page 57 can be used for the actual codeword manipulation. Algorithm 6.9
shows the initialization of the required data structures, and the two functions to
encode and decode a symbol using canonical coding based on a geometric ap-
proximation of the true self-probability. Both of the coding functions make use
of two auxiliary functions twopower_addO and twopower_incrementO which
alter the data structures to reflect a unit increase in the true probability of the
symbol just coded. The data structures used are:

• weight[x], the self-probability of symbol x;


• S, a list ofthe n symbols in the current alphabet, sorted in non-increasing
order of llogr(weight[x])J, which is the approximate probability for
symbol x;

• i1l(iex[x], the position of symbol x in S;


PAGE 184 COMPRESSION AND CODING ALGORITHMS

Algorithm 6.9
Use a canonical code based on a geometric approximation to the
self-probabilities of the m symbols processed so far, to code symbol x,
where 1 S; x, updating the code if necessary. The geometric approximation
has base T = 2 1/ k for a fixed integer k ~ 1.
twopower_encode (x)
1: if index [x] = "not yet used" then
2: canonicaLencode(index[O]), the codeword for the escape symbol
3: encode x using some agreed auxiliary mechanism
4: twopower_increment(O)
5: twopower_add(x)
6: else
7: canonicaLencode{index[x])
8: twopower_increment(x)
Return a value assuming a canonical code based on a geometric
approximation of the self-probabilities of the m symbols so far decoded.
The geometric approximation has base T = 2 1/ k for a fixed integer k ~ 1.
Once the symbol is decoded, update the appropriate data structures.
twopower....decode()
1: set x+-- 8[canonicaLdecodeO]
2: twopower_increment(x)
3: if x = 0, the escape symbol, then
4: decode the new symbol x using the agreed auxiliary mechanism
5: twopower_add(x)
6: return x

Initialize data structures for a two-symbol alphabet containing the escape


symbol and the first symbol, "I", in the message. Each has a weight of one.
twopower_initialize 0
1: set n +-- 2 and m +-- 2
2: set 8[1] +-- 0 and 8[2] +-- 1
3: set index[O] +-- 1, index[l] +-- 2, and index[i] +-- "not yet used" for i >1
4: set Zeader[O] +-- 1
5: set weight[l] +-- 1 and weight[2] +-- 1
6: set j[0] +-- 2
6.10. PSEUDO-ADAPTIVE CODING PAGE 185

Algorithm 6.10
Add symbol x, where 1 < x, as the nth symbol into S, with initial weight
one, then recalculate the code.
twopower_add(x)
1: set S[n + 1] +- x and index[x] +- n + 1
2: set n +- n + 1 and m +- m + 1
3: set weight[x] +- 1, and J[O] +- J[O] + 1
4: if J[O] = 1, meaning this is the only symbol in the first bucket, then
5: set leader[O] +- x
6: calculate_twopower_codeO, using P = [(Ti; f[il)] and r = llogT mJ

Increment the weight of symbol x, where 1 ~ x ~ n, updating f, the sorted


list of symbols S, and the associated data structures index and leader.
Recalculate the code if the approximate probability distribution changes.
twopower_increment (x)
1: set oldb +- llogT(weight[x])J and newb +- llogT(weight[x] + l)J
2: set weight[x] +- weight[x] + 1 and m +- m + 1
3: if newb > oldb then
4: set y +- leader[oldb]
5: swap S[x] with S[y]
6: set index[x] +- y and index[y] +- x
7: set leader[oldb] +- leader[oldb] + 1
8: set f[oldb] +- J[oldb] - 1 and J[newb] +- J[newb] + 1
9: if J[newb] = 1 then
10: set leader[newb] +- y
11: if x =1= 0, meaning the increment is not for the escape symbol, then
12: calculate_twopower_codeO using P = [(Ti; J[il)] and r = llogT mJ
PAGE 186 COMPRESSION AND CODING ALGORITHMS

• leader[i], the smallest index in S of the symbols with approximate prob-


ability lTi J; and

• f [i], the number of symbols with approximate probability lTi J.


The data structures are initialized to contain the escape symbol (symbol
"0") and the first symbol of the message (presumed to be "I"), both with un-
normalized probabilities of one. New symbols get added to position n + 1 of the
array S, with a frequency count of one, as shown in twopower_addO. When a
symbol's probability is incremented in twopowedncrementO, a check is made
to see if the approximate probability changes: whether it has moved to a new
bucket. If it has, the appropriate swapping in S and index takes place, and the
leaders are updated. Once all incrementing housekeeping has been performed,
the code is recalculated. Note that the code is not recalculated when incre-
menting the probability of the escape symbol, as a recomputation will shortly
be required in function twopower_addO anyway.
When implementing twopower_incrementO the number of calls to logO in
the first step can be reduced by storing a trigger frequency for each symbol: the
next weight at which the symbol's approximate probability changes. This value
only needs to be calculated when the symbol moves up a bucket, and triggers a
code recalculation.
Table 6.4 demonstrated that the upper bound on compression loss from
approximating symbol probabilities is small. The asymptotic worst case run-
ning time of twopower_encode 0 and twopower1iecode 0 is equal to that of
other adaptive Huffman coding and adaptive arithmetic coding schemes, at
O(n + m + c) = O(n + c). In practice, however, the twopower coding method
is faster than both, as it is able to build on the fast canonical coding algorithms
of Chapter 4. Section 6.12 on page 190 gives experimental results comparing
the compression and throughput performance of the adaptive coders examined
in this chapter, and the pseudo-adaptive coder is included.
Liddell and Moffat [2001] have extended the pseudo-adaptive technique
to length-limited codes (described in Section 7.1 on page 194). The codes
produced admit errors in two different ways: because of the rounding of the
source statistics to powers of T; and because of the use of an approximate
length-limited code. But in practice the compression loss is small, and high
throughput rates are achieved.

6.11 The Q-coder


The adaptive mechanisms we have examined so far in this chapter have primar-
ily been for situ::.tions in which the source alphabet has multiple symbols. But
6.11. Q CODER PAGE 187

binary alphabets are also important, especially for applications such as bi-Ievel
image compression.
The binary arithmetic coding routines presented in Section 5.5 on page 118
are readily modified to deal with adaptive models. All that is required is that
the counts Co and Cl of the number of zero bits and one bits seen previously in
this context be maintained as a pair of scalars; once this is done, the adaptation
follows directly. The table-driven binary arithmetic coder of Section 5.7 can
also be used in an adaptive setting. Indeed, the table-driven coder provides a
hint as to how an even more specialized implementation might function, based
upon a small number of discrete states, and migration between those states. It
is such a coder - and its intrinsically coupled probability estimation regime -
that is the subject of this section.
The Q-coder had its origins in two IBM research laboratories [Mitchell and
Pennebaker, 1988, Pennebaker et aI., 1988], and has continued to be developed
there and elsewhere [Slattery and Mitchell, 1998]. The basic idea is exactly
as we have already described for multi-alphabet arithmetic coding: a lower
extreme L for the coding range (called the C register in much of the relevant
literature) and a width R (the A register) are adjusted each time a bit is coded,
with the bit being either the MPS (more probable symbol) for this context,
or the LPS (less probable symbol). But rather than perform a calculation to
determine the splitting point as a fraction of A, a fixed quantity Qe is subtracted
from A if an MPS is coded, and if an LPS is coded, A is set to Qe. That is, A is
assumed to be 1, making any scaling multiplication irrelevant; and Qe can be
thought of as being an estimate of the probability of the LPS, which is always
less than 0.5. To minimize the rounding error inherent in the assumption that A
is one, the normalization regime is designed so that logically O. 75 ~ A < 1.5.
The value of Qe depends upon the estimated probability of the LPS, and is one
of a finite number of predefined values. In the original Q-coder, A and C are
manipulated as 13-bit quantities, and the Q values are all 12-bit values; the
later QM-coder added three more bits of precision.
When A drops below the equivalent of the logical value 0.75, renormaliza-
tion is required. This always happens when an LPS is coded, as Qe < 0.5; and
will sometimes happen when an MPS is coded. The renormalization process is
the same as before: the most significant bit of C is passed to the output buffer-
ing process, and dropped from C; and then both C and A are doubled. The
fact that A is normalized within a different range to that considered in Chap-
ter 5 is immaterial, as it is the doubling of A that corresponds to a bit, not any
particular value of A. Carry bits must still be handled, as C, after a range-
narrowing step, might become greater than one. In the Q-coder a bit-stuffing
regime is used; the later QM-coder manages carries via a mechanism similar to
the byte-counting routines shown in Algorithm 5.8 on page 116.
PAGE 188 COMPRESSION AND CODING ALGORITHMS

e Qe Renorm. Exch.
Hex. Dec. LPS MPS LPS
0 AC1 0.5041 0 +1 1
1 A81 0.4924 -1 +1 0
2 A01 0.4690 -1 +1 0
3 901 0.4221 -1 +1 0

10 381 0.1643 -2 +1 0

20 059 0.0163 -2 +1 0

28 003 0.0006 -3 +1 0
29 001 0.0002 -2 0 0

Table 6.5: Partial table of Q-coder transitions.

The only other slight twist is that when A < 1.0 we might be in the position
of having Qe > A - Qe, that is, of estimating the LPS probability to be greater
than the MPS probability, despite the fact that Qe < 0.5 is the current estimate
of the LPS probability. If this situation arises, the MPS and LPS are temporarily
switched. The decoder can make the same adjustment, and no extra information
need be transmitted.
Because the MPS is the more probable symbol. it saves time if the proba-
bility estimates are adjusted only when renormalization takes place, rather than
after every input bit is processed. That is, reassessment of the probabilities is
carried out after every LPS, and after any MPS that triggers the output of a
bit. It is this re-estimation process that makes the Q-coder particularly inno-
vative. Rather than accumulate counters in each context, a single index e is
maintained. The value of Qe is stored in a fixed table, as are a number of other
pre-computed values. Table 6.5 shows some of the rows in the original 12-bit
Q-coder table.
Before any bits are processed, e, the index into the table that represents
a quantized probability estimate, is initialized to O. This assignment means
that both MPS and LPS are considered to be approximately equally likely, a
sensible starting point. Then, for each incoming bit, the corresponding Qe
value is taken from the table, and used to modify A and C, as described in
the previous paragraphs. The second and third columns of the table show the
12-bit Qe values in hexadecimal and in decimal. Because the maximum value
of A is 1.5, the scaling regime used for A maps 1.5 to the maximum that can be
stored in a 13-bit integer, namely 8,191, or "1FFF" in hexadecimal. The value
1.0 then corresponds to the integer 5,461, or "1555" in hexadecimal. This is
6.11. Q CODER PAGE 189

the initial value for A.


The Qe values are similarly scaled. Hence, in the first row of Table 6.5,
the hexadecimal value ''ACl'', which is 2,753, corresponds to 2,753/5,461 =
0.5041 in decimal, as shown in the third column of the table. The fact that
A is scaled differently than in the previous coders we have described is also
irrelevant when it comes to the bitstream produced - so long as a bit is output
each time A is doubled, the code will be accurate.
After the range-narrowing process, driven by the Qe entries in the table.
renormalization may be required, indicated by A < 0.75 = "1000" in scaled
hexadecimal. If the bit processed is an MPS and no renormalization is required,
the index e is left unchanged, and the next input bit is processed. But if a renor-
malization of A and C takes place - because the input symbol was an LPS, or
because the input symbol was an MPS and a renormalization was triggered by
an accumulation of MPS events - the value of e is modified, by adding or sub-
tracting the offset stored in the two columns headed "Renorm." For example, if
e is 2, and an MPS renormalization takes place, e is adjusted by + 1 (from the
column "Renorm. MPS") to 3, and as a consequence, the probability estimate
Qe of the LPS is decreased from 0.4690 to 0.4221. Similarly, if e is 2 and an
LPS renormalization takes place, e is decreased to 1 and the LPS probability is
increased to 0.4924. When the index e is 29 the LPS probability is at the lowest
possible 12-bit value, and the "Renorm. MPS" increment is zero. Similarly, at
the other end of the table, when e is zero and an LPS is coded, no change in e
takes place. But the "I" entry in the column "Exch. LPS" in this row denotes
another special situation - in this case the MPS and LPS are swapped as part
of an LPS renormalization, which is the logical equivalent of making e nega-
tive. Starting in state 29, with the lowest Qe, a sequence of 18 consecutive LPS
symbols causes the probability estimator to conclude that the MPS and LPS
need to be swapped.
The corresponding QM-coder table is larger, and includes states that are
intended for transitional use only - they are on a path from the initial state, but
thereafter will never get used.
The table-driven probability estimation mechanism means that throughput
can be extremely high. When an MPS is coded, just a table lookup, a sub-
traction, and a test are required; plus, periodically, a renormalization, and a
table-based adjustment to e. When an LPS is coded the renormalization al-
ways takes place. But there are no divisions to estimate probabilities, and no
multiplications to establish coding ranges.
What is lost is coding precision. Use of 12-bit probability approximations,
and use of just 30 different estimated LPS probabilities, means that in a math-
ematical sense there is quantifiable compression redundancy compared to an
ideal coder. But the fact that the probability estimation process is quick to ad-
PAGE 190 COMPRESSION AND CODING ALGORITHMS

just to localized variations in the bit probabilities more than compensates on


most inputs, and in practice the Q-coder is both fast and effective - so much so
that in a range of image compression standards it is employed to compress non-
binary sources as well as binary sources, using the tree-structured approach de-
scribed in Section 5.5 (see the example in Figure 5.4 on page 121). Pennebaker
and Mitchell [1993] provide more detail of the Q-coder and QM-coder, and of
the image compression applications in which they have been used with such
considerable success.

6.12 Making a choice


This chapter has covered the main coding adaptive coding techniques avail-
able for today's compression system designers. But which is the best? Un-
surprisingly, the answer depends somewhat on your priorities: speed, memory
consumption, or compression effectiveness. It also depends upon the type of
message being processed.
Figure 6.8, derived from experiments carried out by Turpin and Moffat
[1999], shows the performance of the main adaptive coders discussed in this
chapter. Other experiments are reported by Moffat et al. [1994]. The graphs
show the performance of arithmetic coding, adaptive Huffman coding, and
pseudo-adaptive coding based on function twopower_encodeO, when applied
to the two Wall Street Journal probability distributions described in Table 4.5
on page 71. When some buffering of the message is permissible during en-
coding, the semi-static minimum-redundancy coding technique described in
Section 4.8 on page 81 can be used, and is also included in Figure 6.8. We also
experimented with the frequency coder of Ryabko [1992], but compression
levels were at least one bit per symbol worse than the other methods described
in this chapter, and throughput was comparable to that of adaptive Huffman
coding.
One of the most striking facts from these experiments is that adaptive
Huffman coding ("H") is slower than arithmetic coding using a Fenwick tree
("A") on WSJ . Words. This further confirms the earlier results (Figure 4.4 on
page 65) that a bit-by-bit traversal of a Huffman tree is slow, even for static
coding; and that the algorithm for maintaining the Huffman tree has non-trivial
additional overhead. For the lower entropy WSJ . NonWords file, adaptive Huff-
man coding executes more quickly than arithmetic coding, but gives worse
compression. In general arithmetic coding should be preferred over adaptive
Huffman coding.
The points marked "s" represent an adaptive implementation of the approx-
imate arithmetic coding mechanism of Stuiver and Moffat [1998] (Section 5.6
on page 122). It is always faster than the standard arithmetic coder; and yields
6.12. MAKING A CHOICE PAGE 191

11.8
B4
'0 11.6 Encoding words
.0
E
>. 11.4
~
ill 11.2
He Ae Se

G4 B7
eBm

64\
11.0
8 16 32 64 128 256 512

Speed (millions of symbols/minute)


11.8

'0 11.6 Decoding words


.0
E
>. 11.4 B5
~ .--.G1 B6 eBm
ill 11.2
HeAe Se G4.
G2 B7
11.0
8 16 32 64 128 256 512

Speed (millions of symbols/minute)


2.7

'0 Encoding non-words G1


.0
E
>.
2.6 He G4~ Bme
(/)

]l 2.5 Ae B~
ill B7
Se
2.4
8 16 32 64 128 256 512

Speed (millions of symbols/minute)


2.7

'0 Decoding non-words


.0 2.6 He
G~ Bme
E G4
>.
B4
~ 2.5 Ae
B7~
ill Se
2.4
8 16 32 64 128 256 512

Speed (millions of symbols/minute)

Figure 6.8: Compression throughput and compression effectiveness for a range of


coding mechanisms. "H" indicates adaptive Huffman coding, "Pl' indicates adaptive
arithmetic coding, "S" indicates the approximate arithmetic coding method of Stuiver
and Moffat [1998] (Section 5.6 on page 122), "Gx" indicates pseudo-adaptive coding
with a geometric approximation base T = 21 / x , and "Bx" indicates semi-static canon-
ical coding with a block size of lOX symbols, with "Bm" indicating that the entire input
is processed as a single block.
PAGE 192 COMPRESSION AND CODING ALGORITHMS

improved compression on file WSJ . NonWords, an indication that, by luck, its


"approximate" probability estimates are actually more accurate than the "pre-
cise" estimates maintained by the standard coder.
The pseudo-adaptive twopower coder appears in Figure 6.8 as the points
"Gk" ("G" for geometric), where k is the integer root of two used to approx-
imate the actual symbol probabilities during encoding (T = 2 1/ k ). As pre-
dicted by the compression loss analysis in Section 6.10, increasing k decreases
compression loss; but only by a small amount. On WSJ . Words, k also has a
large effect on running time, but not for the lower entropy WSJ . NonWords.
Using k = 1, the twopower coder is both reasonably fast and acceptably ef-
fective. With this choice twopower offers at least double the throughput of
both adaptive Huffman and arithmetic coding, while providing compression
effectiveness similar to that of adaptive Huffman coding. While the complex
development of twopower may have seemed to the reader to be absurdly arti-
ficial, Figure 6.8 shows quite clearly why we persisted: it is the fastest of the
on-line adaptive coding techniques.
The final collection of points in the graphs compare the adaptive coders
with the semi-static minimum-redundancy coder of Section 4.8 on page 81.
Use of a semi-static probability estimation technique and a static minimum-
redundancy code provides good compression - especially when the blocks are
moderately large - and high throughput rates. The drawback of this technique
is that it requires the message to be buffered in blocks during encoding, as two
passes are made over each block. It must operate off-line, although decoding
requires no symbol buffering. Points labelled "Bx" represent the performance
of this semi-static minimum-redundancy coder when the message is broken up
into blocks of lOX symbols; and the point "Bm" indicates use of a single block
containing all m message symbols. Treating the entire message as a single
block is fast, but does not necessarily give the best compression, as localized
variations are ignored.
The upshot of these experiments is that, if buffering of symbols can be tol-
erated during encoding, semi-static minimum-redundancy coding is the method
of choice. Moderate block sizes should be used if there is localized variation
in the message; or if the probability distribution within the message is stable,
then blocks should be as large as possible. If one-pass coding is required then
twopower is the fastest method, but also moderately complex to implement,
and adaptive arithmetic coding might be preferable because of its simplicity.
Finally, if compression levels are the sole priority, then adaptive arithmetic
coding should be used, especially when the probability distribution is skew.
Chapter 7

Additional Constraints

The coding algorithms presented so far have focussed on minimizing the ex-
pected length of a code, and on fast coding speed once a code has been devised.
Furthermore, all codes have used a binary channel alphabet. This chapter ex-
amines other coding problems.
First, we examine code generation when a limit is imposed on the length
of the codewords. Applying a limit is of practical use in data compression sys-
tems where fast decoding is essential. When all codewords fit within a single
word of memory (usually 32 bits, sometimes 64 bits), canonical decoding (Al-
gorithm 4.1 on page 60) can be used. If the limit cannot be guaranteed, slower
decoding methods become necessary. Section 7.1 examines the problem of
length-limited coding.
The second problem, discussed in Section 7.2, is that of generating alpha-
betic codes, where a lexicographic ordering of the symbols by their codewords
must match the original order in which the symbols were presented to the cod-
ing system. When an alphabetic code is used to compress records in a database,
the compressed database can be sorted into the same order as would be gener-
ated if the records were decompressed and then sorted. Alphabetic code trees
also correspond to optimal binary search trees, which have application in a va-
riety of searching problems. The assumption that the symbols are sorted by
probability is no longer appropriate in this scenario.
The third area we examine in detail (Section 7.3) is the problem of find-
ing codes for non-binary channel alphabets. Unequal letter-cost coding is the
task of determining a code when the symbols in the channel alphabet can no
longer be presumed to be of unit cost. For example, in an effort to minimize
power consumption in a new communications device based upon some novel
technology, we may seek to calculate a code taking into account (say) that a
zero bit takes 10% more energy to transmit than does a one bit. In such a case,
the code should be biased in favor of one bits - but must still also contain zero
A. Moffat et al., Compression and Coding Algorithms
© Springer Science+Business Media New York 2002
PAGE 194 COMPRESSION AND CODING ALGORITHMS

bits. Similarly, the engineers might also ask us to provide a minimum-cost


code over a multi-symbol channel alphabet - three or more different channel
symbols perhaps all of differing cost, where cost might be measured by energy
consumption, or by the length of time taken to actually convey the symbols.

7.1 Length-limited coding


Because of the ubiquitous use of minimum-redundancy codes in practical data
compression software, an important variation on the standard prefix coding
problem is that of devising a code when an upper limit is placed on the length
of codewords. Such length-limited codes guarantee the integrity of applica-
tions employing canonical coding, as they allow function canonicaLdecodeO
in Algorithm 4.1 on page 60 to assume that each codeword fits into a single
word, even in the face of pathological probability distributions such as those
derived from the Fibonacci sequence (Section 4.9 on page 88). Another use
for length-limited codes follows from the observation that the underlying code
tree has bounded depth, because the path length to the deepest leaf is limited.
Applications that need to search a weighted tree, but only for a limited number
of steps, can employ the algorithms presented here to build their search trees.
Given a list P of sorted probabilities PI ~ P2 ~ ... ~ Pn, the length-
limited coding problem is to generate a prefix code C such that lei I ~ L, for
all 1 ~ i ~ n and some fixed integer L ~ fiog2 n 1, and such that E( C, P) ~
E( G', P) over all n-symbol prefix codes C' in which no codeword is longer
than L bits.
A first approach to the problem was given by Hu and Tan [1972]. Their
algorithm requires O(n22L) time and space, and was improved upon relatively
soon after by Van Voorhis [1974], who employed dynamic programming to
construct a depth-limited tree in O(Ln2) time and space.
Two groups independently found improved solutions in the late 1980s. An
approximate algorithm, where the code is not guaranteed to have the small-
est possible expected codeword length but is constrained to the length limit,
was devised by Fraenkel and Klein in 1989. Their mechanism is fast, run-
ning in O(n) time, and is conceptually simple. But by the time that work was
published [Fraenkel and Klein, 1993], a faster algorithm for determining an
optimal solution had also appeared - the package merge mechanism of Lar-
more and Hirschberg [1990], which requires O(nL) time and space. In this
section we describe the reverse package merge mechanism [Turpin and Mof-
fat, 1995], which is derived from the package merge approach, but operates in
O(n(L - log2 n + 1)) time and space - faster when the code is tightly con-
strained and L ~ log2 n. Several other variations to the fundamental package
7.1. LENGTH-LIMITED CODING PAGE 195

ICI Total cost


[4,4,4,4,4,4,4,4] 124
[3,3,3,3,3,3,3,3] 93
[2,3,3,3,3,3,4,4] 85
[2,2,3,3,4,4,4,4] 79
[1,3,4,4,4,4,4,4] 86

Table 7.1: An incomplete code with K(C) < 1, and four possible complete codes that
have K(C) = 1, when a length limit of L = 4 is imposed and the underlying source
probabilities are P = [10,8,6,3,1,1,1,1]. The total cost is given by 2:~=1 Pi ·Ieil.

merge technique are possible, and are canvassed after we present the underly-
ing algorithm.
Like many code generation algorithms, reverse package merge builds on
the greedy design paradigm. In Huffman's algorithm (Section 4.2 on page 53)
codeword lengths increase from zero, while the Kraft sum K (C) decreases
from its initial value of n down to one, with, at each step, the least cost ad-
justment chosen from a set of possible changes. In reverse package merge all
codeword lengths are initially set to L bits, and the initial value of the Kraft
sum, K(C) = n2- L :::; 1, is increased with each greedy choice.
This initial position may well be a length-limited code. If L = log2 n, and
n is an exact power of two, then

K(C) = n2-1og2n = n/n = 1,

cannot be improved. For example, if P = [10,8,6,3,1,1,1,1], and the code-


words are constrained to be at most L = 3 bits long, then setting lei I = L = 3 is
the best that can be done. The total cost of the resultant code is 93 bits (3.0 bits
per symbol), compared with a total cost of 77 bits for a minimum-redundancy
code (ICI = [2,2,2,3,5,5,5,5]). But this is an unusual case - the code is
already complete. More generally, given n probabilities, with n not a power of
two, and a Kraft sum of K(C) = n2- L < 1, the code ICI = [L, ... , L] is not
complete, and some symbols should have their codewords shortened.
Suppose that the same distribution P = [10,8,6,3,1,1,1,1] is to be coded
with L = 4 instead of L = 3. Initially all codeword lengths are set to 4,
making K(C) = 0.5, and creating a spare 0.5 in the Kraft sum that must be
spent somehow. For example, all codewords could be shortened to length 3,
achieving an 8 x (2- 3 - 2- 4 ) = 0.5 increase. Alternatively, five codewords
could be shortened by one bit, and one codeword shortened by two, also giving
rise to a 5 x (2- 3 - 2- 4 ) + (2- 2 - 2- 4 ) = 0.5 increase. These, plus two
PAGE 196 COMPRESSION AND CODING ALGORITHMS

other options, are shown in Table 7.1, along with the total bit cost of each. The
observation that the largest decreases in length should be assigned to the most
probable symbols means only these four codes need be considered.
Reverse package merge constructs L lists, with the jth list containing sets
of codeword decrements that increase K(C) by 2- j . Within each list, items
are ordered by their impact on the total cost. In the example, initially ICi I = 4
for all i, and K(C) = 0.5. A 2- 1 increase in K(C) is required, so lists are
formed iteratively until the j = 1 list is available.
The process starts with list L. The only way to obtain a 2- 4 increase in
K(C) is to decrease a codeword length from 4 to 3. There are eight possible
codewords to choose from, corresponding to symbols one through eight, and a
unit decrease in the length of the codeword Ci reduces the total code cost by Pi.
The first list generated by reverse package merge is thus

2- 4 : 101 82 63 34 15 16 17 18,

where the sUbscript denotes the corresponding symbol number, and the value
is the reduction in the total code cost that results if that codeword is shortened.
This is a list of all possible ways we can increase K (C) by 2- 4 , ordered by
decreasing impact upon total cost.
Now consider how a 2- 3 increase in K(C) could be obtained. Either two
codewords can be shortened from length 4 to length 3, a 2 x (2- 3 - 2- 4 ) = 2- 3
change in K (C); or individual codewords that have already been reduced to
length 3 could be reduced to 2 bits, a change of 2- 2 - 2- 3 = 2- 3 . The impact
on the total cost of choosing a pair of 2- 4 items can be found by adding two
costs from list j = 4. In this example, the biggest reduction is gained by short-
ening 10 1 and 82 by one bit each, giving a cost saving of 18, where the absence
of a subscript indicates that the element is a package formed from two elements
in the previous list. The next largest reduction can be gained by packaging 63
and 34 to get 9. Note that 82 and 63 are not considered for packaging, as 82 was
already combined with the larger value 10 1 , Continuing in this manner creates
four packages, each of which corresponds to a 2- 3 increase in K (C):

A choice of any of these packages is equivalent to choosing two elements from


list 2- 4 and shortening their codewords from 4 bits to 3 bits.
The list of costs of the second type of change, in which codewords are
shortened from 3 bits to 2, is exactly the original probability list again. In order
to get a complete list of all ways in which we can reduce K (C) by 2- 3 , these
two lists should be merged:
7.1. LENGTH-LIMITED CODING PAGE 197

In this and the next set of lists, the square boxes denote packages created by
combining pairs of objects from the previous list.
The same "package, then merge" routine is done twice more to get a full
set of L = 4 lists:

2- 4 : 34 15 16 17 18
2- 3 : 82 63 34 ~ ~ 15 16 17 18
2- 2 : W 82 63 [i] 34 ~ ~ 15 16 17 18
2- 1 : 101 8 2 [1J 63 [i] 34 ~ ~ 15 16 17 18 .

Each entry in the last 2- 1 list represents a basket of codeword length adjust-
ments that have a combined impact of 0.5 on K(C). For example, the first
package, of weight 45, represents two packages at the 2- 2 level; they in turn
represent two leaves at the 2- 3 level and two packages at that level; and, finally,
those two packages represent four leaves at the 2- 4 level.
Once the lists are constructed, achieving some desired amount of increase
to K (C) is simply a matter of selecting the necessary packages off the front of
some subset of these lists, and shortening the corresponding codewords. In the
example, a 0.5 increase in K (C) is desired. To obtain that increase, the package
in list 2- 1 is expanded. As was noted in the previous paragraph, this package
was constructed by hypothesizing four codewords being shortened from 4 bits
to 3 bits, and two codewords being shortened from 3 bits to 2 bits. The set
of lengths for the length-limited code is thus ICI = [2,2,3,3,4,4,4,4]. The
exhaustive listing of sensible combinations in Table 7.1 confirms that this code
is indeed the best.
It may appear contradictory that the first two symbols have their codeword
lengths decreased from 3 bits to 2 bits before they have their lengths decreased
from 4 bits to 3. But there is no danger of the latter not occurring, as a package
containing original symbols from list 2- j always has a greater weight than the
symbols themselves in list 2- j +1 , so the package will be selected before the
original symbols. For example, the package 18 in list 2- 3 contains the original
symbols 10 1 and 82 from list 2- 4 , and 18 appears before both in list 2- 3 .
Not so easily dismissed is another problem: what if an element at the front
of some list is selected as part of the plan to increase K (C), but appears as
a component of a package in a subsequent list that is also required as part of
making K(C) = I? In the example, only a single package was needed to bring
K(C) to 1.0; but in general, multiple packages are required. For example,
consider generating a L = 4 limited code for the slightly smaller alphabet
P = [10,8,6,3,1,1,1]. When n = 7 and 1- K(C) = 1 -7 X 2- 4 =
0.5625, packages are required from the 2- 1 list (0.5), and the 2- 4 list (the
other 0.0625). But the first element in the 2- 1 list contains the first element in
PAGE 198 COMPRESSION AND CODING ALGORITHMS

the 2- 4 list, and the codeword for symbol 81 can hardly be shortened from 4
bits to 3 bits twice.
To avoid this conflict, any elements to be removed from a list as part of the
K (C) increment must be taken before that list is packaged and merged into the
next list. In the n = 7 example, the first element of 2- 4 must be consumed
before the list 2- 3 is constructed, and excluded from further packaging. The
table of lists for P = [10,8,6,3,1,1,1] is thus
2- 4 : 15 16 17
2- 3 : 4 34 2 15 16 17
2- 2 : 82 63 3 34 2 15 16 17
2- 1 : 82 6 63 3 34 2 15 16 17,
where the two bordered regions now show the elements encompassed by the
two critical packages, rather than the packages themselves. The increment of
2- 4 (item Wd is taken first, and the remainder of that list left available for
packaging; then the list 2- 3 is constructed, and no packages are removed from
the front of it as part of the K (C) growth; then the 2- 2 list is constructed, and
again no packages are required out of it; and finally the 2- 1 list is formed, and
one package is removed from it, to bring K(C) to 1.0. Working backwards,
that one package corresponds to two packages in the 2- 2 list; which expand to
one package and three leaves in the 2- 3 list; and that one package expands to
two leaves in the 2- 4 list, namely, items 82 and 63. In this case the final code
is ICI = [2,2,2,4,4,4,4], as symbols 10 1 ,82, and 63 all appear twice to the
left of the final boundary.
Astute readers will by now have realized that at most one element can be
required from each list to contribute to the increase in K(C), and that the ex-
haustive enumeration of packages shown in the two examples is perhaps exces-
sive. Even if a package is required from every list, at most one object will be
removed from list 2- 1 , at most three from list 2- 2 , at most seven from list 2- 3 ,
and so on; and that is the worst that can happen. If not all lists are required
to contribute to lifting K (C) to 1.0, then even fewer packages are inspected.
In the most recent example only two such head-of-list packages are consumed,
and it is only necessary to calculate 14 list entries:
2- 4 :
2- 3 :
2- 2 :
2- 1 :

Larmore and Hirschberg [1990] constructed the lists in the opposite order
to that shown in these examples, and had no choice but to fully evaluate all
L lists, giving rise to an 0 (nL) time and space requirement. Reversing the
7.1. LENGTH-LIMITED CODING PAGE 199

list calculation process, and then only evaluating list items that have at least
some chance of contributing to the solution, saves 0 (n log n) time and space,
to give a resource cost for the reverse package merge algorithm that is O(n(L-
log2 n + 1)). A curious consequence of this bound is that if L is regarded as
being constant - for example, a rationale for L = 32 was discussed above -
then the cost of constructing a length-limited code grows less than linearly in
n. As n becomes larger, the constraining force L becomes tighter, but (per
element of P) the length-limited code becomes easier to find.
Algorithms 7.1 and 7.2 provide a detailed description of the reverse pack-
age merge process. Queue value[j] is used to store the weight of each item in
list j, and queue type[j] to store either a "package" flag, to indicate that the cor-
responding item in value[j] is a package; or, if that item is a leaf, the matching
symbol number - the subscript from the examples. Variable excess is first set
to the amount by which K (C) must be increased; from this, the set of packages
that must be consumed is calculated: bj is one if a package is required from list
e
2- j , and zero if not. The maximum number j of objects that must be formed
in list 2- j is then calculated at steps 8 to 10, by adding bj to twice the number
of objects required in list 2-j+1, but ensuring that the length is no longer than
the maximum number of packages possible in that list.
The first list, for 2- L, is easy to form: it is just the first eL symbols from
P, as there are no packages. If one of these objects is required as part of the
K (C) adjustment, that object is extracted at step 16. The first ej elements
of list j are then iteratively constructed from the symbol probabilities Pi and
list value[j + 1], which, by construction, must have enough elements for all
required packages. Once each list is constructed, its first package is extracted
if it is required as part of the K (C) adjustment.
Function take_package 0 in Algorithm 7.2 is the recursive heart of the code
construction process. Each time it is called, one object is consumed from the
indicated list. If that object is a leaf, then the corresponding symbol has its
codeword shortened by one bit. If the object is a package, then two objects
must be consumed from the previous list, the ones that were used to construct
the package. Those recursive calls will - eventually - result in a correct com-
bination of codeword lengths being shortened.
In an implementation it is only necessary to store value[j + 1] during the
generation of value[j] , as only the information in the type[j] queue is required
by function take-packageO. With a little fiddling, value[j + 1] can be overwrit-
ten by value[j]. In a similar vein, it is possible to store type[j] as a bit vector,
with a one-bit indicating a package and a zero-bit indicating an original sym-
bol. This reduces the space requirements to O(n) words for queue value, and
o (n (L -log2 n + 1)) bits for the L queues that comprise type. Furthermore, if,
as hypothesized, L is less than or equal to the word size of the machine being
PAGE 200 COMPRESSION AND CODING ALGORITHMS

Algorithm 7.1
Calculate codeword lengths for a length-limited code for the n symbol
frequencies in P, subject to the constraint that Ci ~ L, where in this
algorithm Ci is the length of the code assigned to the ith symbol.
reverse_package_merge( P, n, L)
1: set excess f- 1 - n x 2- L and PL f- n
2: for j f- 1 to L do
3: if excess ~ 0.5 then
4: set bj f- 1 and excess f- excess - 0.5
5: else
6: set bj f- 0
7: set excess f- 2 x excess and PL-j f- LPL-Hr!2J + n
8: set PI f- b1
9: for j f- 2 to L do
10: seUj f- min{Pj, 2 x Pj-l + bj }
11: for if-I to n do
12: set Ci f- L
13: for t f- 1 to h do
14: append Pt to queue value[L] and append t to queue type[L]
15: if bL = 1 then
16: take_package(L)
17: for j f- L - 1 down to 1 do
18: set if-I
19: for t f- 1 to Pj do
20: set pack_wght f- the sum of the next two unused items in queue
value[j + 1]
21: if pack_wght > Pi then
22: append pack_wght to queue value[j]
23: append "package" to queue type[j]
24: else
25: append Pi to queue value[j] and append i to queue type[j]
26: set i f- i + 1 and retain pack_wght for the next loop iteration
27: if bj = 1 then
28: take_package(j)
29: return [q, ... ,cn ]
7.1. LENGTH-LIMITED CODING PAGE 201

Algorithm 7.2
Decrease codeword lengths indicated by the first element in type[i] ,
recursively accessing other lists if that first element is a package.
take-package(j)
1: set x ~ the element at the head of queue type U]
2: if x = "package" then
3: take_package(j + 1)
4: take_package(j + 1)
5: else
6: set Cx ~ Cx - 1
7: remove and discard the first elements of queues valueU] and typeU]

used, then the L bit vectors storing type in total occupy n - log2 n + 1 words of
memory. That is, under quite reasonable assumptions, the space requirement
of reverse package merge is O(n).
If space is at a premium, it is possible for the reverse package merge algo-
rithm to be implemented in 0(L2) space over and above the n words required
to store the input probabilities [Katajainen et aI., 1995]. There are two key
observations that allow the further improvement. The first is that while each
package is a binary tree, it is only necessary to store the number of leaves on
each level of the tree, rather than the entire tree. The second is that it is not
necessary to store all of the trees at anyone time: only a single tree in each list
is required, and trees can be constructed lazily as and when they are needed,
rather than all at once. In total this lazy reverse package merge algorithm stores
L vertical cross-sections of the lists, each with O(L) items, so requires 0(L2)
words of memory.
If speed is crucial when generating optimal length-limited codes, the run-
length techniques of Section 4.6 on page 70 can also be employed, to make a
lazy reverse runlength package merge [Turpin and Moffat, 1996]. The result-
ing implementation is not pretty, and no pleasure at all to debug, but runs in
O((r + r 10g(njr))L) time.
Liddell and Moffat [2002] have devised a further implementation, which
rather than forming packages from the symbol probabilities, uses Huffman's
algorithm to create the packages that would be part of a minimum-redundancy
code, and then rearranges these to form the length-limited code. This mecha-
nism takes O(n(LH - L + 1)) time, where LH is the length of a longest unre-
stricted minimum-redundancy codeword for the probability distribution being
processed. This algorithm is most efficient when the length-limit is relatively
relaxed.
PAGE 202 COMPRESSION AND CODING ALGORITHMS

Several approximate algorithms have also been invented. The first of these,
as mentioned above, is due to Fraenkel and Klein [1993]. They construct a
minimum-redundancy code; shorten all of the too-long codewords; and then
lengthen sufficient other codewords that K (C) ~ 1 again, but without being
able to guarantee that the code so formed is minimal. Milidi6 and Laber [2000]
take another approach with their WARM-UP algorithm, and show that a length-
limited code results if all of the small probabilities are boosted to a single larger
value, and then a minimum-redundancy code calculated. They search for the
smallest such threshold value, and in doing so, are able to quickly find codes
that experimentally are minimal or very close to being minimal, but cannot be
guaranteed to be minimal.
Liddell and Moffat [2001] have also described an approximate algorithm.
Their method also adjusts the symbol probabilities; and then uses an approx-
imate minimum-redundancy code calculation process to generate a code in
which the longest codeword length is bounded as a function of the smallest
source probability. This mechanism operates in 0 (n) time and space, and
again generates codes that experimentally are very close to those produced by
the package merge process.
The compression loss caused by length-limiting a prefix code is generally
very small. The expected codeword length E( C, P) for a length-limited code
C can never be less than that of a minimum-redundancy code, and will only
be greater when L is less than the length of a longest codeword in the corre-
sponding minimum-redundancy code. Milidi6 and Laber [2001] have shown
that the compression loss introduced by using a length-limited code, rather than
a minimum-redundancy prefix code, is bounded by
<t>1-L+rlog2(n+f1og2nl-L)1,

where <t> is the golden ratio (1 + V5)/2. Some values for this bound when
L = 32 are shown in Table 7.2. In practice, there is little loss when length-
limits are applied, even quite strict ones. For example, applying a length-limit
of 20 to the WSJ . Words message (Table 4.5 on page 71) still allows a code with
an expected cost of under 11.4 bits per symbol, compared to the 11.2 bits per
symbol attained by a minimum-redundancy code (see Figure 6.8 on page 191).

7.2 Alphabetic coding


All of the coding mechanisms described up to this point have been independent
of the exact values of the symbols being coded, and we have made use of this
flexibility in a number of ways. For example, we have several times assumed
that the input distribution is probability-sorted. But there are also situations in
7.2. ALPHABETIC CODING PAGE 203

Upper bound
n
on loss
256 0.00002
103 0.00004
104 0.00028
105 0.00119
106 0.00503

Table 7.2: Upper bound on compression loss (bits per symbol) compared to a
minimum-redundancy code, when a limit of L = 32 bits is placed on the length of
codewords.

which the original order of the input symbols should be preserved, meaning
that the source alphabet may not be permuted.
One such situation is that of alphabetic coding. Suppose that some ordering
-< of the source symbols is defined, such that i < j implies that 8i -< 8j. Sup-
pose also that we wish to extend -< to codewords in the natural lexicographic
manner, so that if i < j and 8i -< 8 j, we require Ci -< Cj. In plain English:
if the codewords are sorted, the order that results is the same as if the source
symbols had been sorted. Needless to say, for a given probability distribution
P that may not be assumed to be non-increasing, we seek the alphabetic code
which minimizes the expected cost E( C, P) over all alphabetic codes C.
All three codes listed as examples in Table LIon page 7 are, by luck,
alphabetic codes. The ordering of the symbols in the source alphabet is
81 -< 82 -< 83 -< 84 -< 85 -< 86 ,

and in Code 3, for example, we have


"0" -< "100" -< "101" -< "110" -< "1110" -< "1111".
A canonical minimum-redundancy code for a probability-sorted alphabet is al-
ways an alphabetic code. On the other hand, the tree codes derivable by a strict
application of Huffman's algorithm are not guaranteed to be alphabetic, even if
the source distribution is probability-sorted. So given an unsorted probability
distribution, how do we generate an alphabetic code?
Hu and Tucker [1971] provided an algorithm to solve this problem, which
they showed could run in O(n 2 ) time. Knuth [1973] subsequently improved
the implementation, to O( n log n) time, using a leftist-tree data structure. This
HTK algorithm generates a non-alphabetic binary tree from which codeword
lengths for each symbol can be read. Once the codeword lengths are known, al-
phabetic codewords are readily assigned using a variant of the canonical code-
word process described in Section 4.3 on page 57.
PAGE 204 COMPRESSION AND CODING ALGORITHMS

Algorithm 7.3
Calculate a code tree for the n symbol frequencies in P, from which
codeword lengths for an alphabetic code can be extracted. Distribution P
may not be assumed to be probability-sorted. The notation key(x) refers to
the weight of element x in the queues and in the global heap.
calculate_alphabetic_code ( P, n)
1: for i +- 1 to n do
2: set L[iJ +- to a leaf package of weight Pi
3: for i +- 1 to n - 1 do
4: create a new priority queue q containing i and i + 1
5: set key(i) +- Pi and key(i + 1) +- PHI
6: set key(q) +- Pi + PHI
7: add q to the global heap
8: while more than one package remains do
9: set q +- to the queue at the root of the global heap
10: set (i I, i 2 ) +- to the candidate pair of q, with i I < i2
11: set L[iIJ +- a package containing L[iIJ and L[i2J
12: set key(it} +- key(it} + key(i2), repositioning il in q if necessary
13: remove i2 from q
14: if LhJ was a leaf package then
15: let r be the other priority queue containing i l
16: remove i l from r and merge queues rand q
17: remove queue r from the global heap
18: if L[i2J was a leaf package then
19: let r be the other priority queue containing i2
20: remove i2 from r and merge queues rand q
21: remove queue r from the global heap
22: establish a new candidate pair for q
23: restore the heap ordering in the global heap
7.2. ALPHABETIC CODING PAGE 205

The tree construction algorithm is depicted in Algorithm 7.3. At each itera-


tion of the while loop, the two smallest available subtrees are packaged, just as
in Huffman's earlier algorithm. But there are two crucial differences. First, the
newly formed package is not immediately inserted into a sorted list of packages
according to its weight, but rather replaces its leftmost constituent. This can be
seen in step 11, where the package derived from L[i 1 ] and L[i 2 ] replaces L[il].
The second crucial difference is that only packages not separated by leaves
are allowed to combine. To select the two packages whose sum is minimal,
and that are not separated by a leaf, Knuth uses a set of priority queues to store
the indices of the packages in L that may be joined. Each priority queue in
the set represents one collection of packages, within which any object may join
with any other. That queue is in tum represented by its candidate pair, the two
packages of smallest weight, with a key equal to the sum of the weights of those
two packages. In addition, a global heap is used to allow rapid identification of
the queue with the candidate pair that has the smallest key. That least candidate
pair is then extracted (step 10 of Algorithm 7.3) and joined together (step 11)
into a new package, and all structures updated.
Initially all packages are leaves, so each priority queue holds just two items:
itself, and its right neighbor. These two items thus form the candidate pair for
each queue, and are used to initialize the global heap. Care must be taken when
modifying the heap that not only is the heap ordering maintained, but also that
objects with equal keys are arranged such that the one whose queue contains
the leftmost packages in L (lowest index) is preferred. For example, when
initializing the queues on the list of probabilities from Blake's verse

p= [4,22,5,4,1,3,2,2,4,8,5,6,1,8,3,8,7,9,1,11,4,1,3,2,4]'

there are three queues with a candidate pair that has key a value of 4: that
associated with symbol 5, that associated with symbol 7, and that associated
with symbol 22. In this case, the candidate pair associated with symbol 5 takes
precedence, and is processed first. If, when breaking ties in such a manner,
even the indices of the first item in a candidate pair are equal, the candidate
pair with the leftmost second component should be preferred.
Figure 7.1 shows the initial queues, and the first two packaging steps of
function calculate_alphabeticcodeO on the character self-probabilities from
Blake's verse. In each panel, the list of packages stored in L appears first.
Under each package is the corresponding priority queue, drawn as a list, that
includes the index of that package, and any to its right with which it may now be
packaged. Only the key values are shown in each priority queue, as the indices
can be inferred from their position in the figure; and by convention, each queue
is shown associated with the leftmost leaf represented in that queue. The root
of the global heap is shown pointing to the next candidate pair to be packaged:
PAGE 206 COMPRESSION AND CODING ALGORITHMS

0 @ CD 0 ~ 0) 0 0 0 0 CD 0) 0 0
4 5 4 1 1\ 2 2 2 2 1 2 2

l l l l l\ l l l l
22 22 5 4 3 '\3 2 4 8 4
l l l
3 3 4
,,
,,
,

(a)

0 @ CD 0 ~0 0 0 CD 0) 0 0
4

l l l l
22

l
5

22
4

5
2

4
~ 2'

l \, l l
2
,,
" 4
,
2 2

8
1

l l l l
4
1

3
2

3
2

,
4

(b)

0 @ CD 0 0 0qJ 0) 0 0
4

l l l l
22
5

22
4

5
4

4
~~ 2

l
8
1

4,, "
:1

l ,/ l l l
3
2

3
2

l 4 ,,
, ,,
,

l 4

(c)

Figure 7.1: First three steps of the Hu-Tucker-Knuth algorithm for generating an al-
phabetic code from the character self-probabilities in Blake's verse: (a) before any
packages are formed; (b) after the first package is formed; and (c) after the second
package is formed. Queues are shown associated with their leftmost candidate leaf.
7.3. ALTERNATIVE ALPHABETS PAGE 207

the priority queue with the smallest sum of the first two elements. Figure 7.1 b
shows the results of packaging L[5] and L[6] in the first iteration of the while
loop in calculate_alphabetic-codeO. Both of these elements are leaves, so the
previous queue and the next queue are merged with the new queue containing
the package formed at this step. That is, queues

[1 -+ 4] [1 -+ 3] [2 -+ 3]

become the single queue


[2 -+ 4 -+ 4].
Figure 7.1c shows the result of the second iteration, during which items L[7]
and L[8] get packaged. Both parts are again leaves, so three queues merge.
The root of the global heap now has a key of four, corresponding to the queue
containing elements L[22] and L[23], the next smallest candidate pair.
The sequence of packagings generated by this process does not of itself
generate an alphabetic tree. There are edges that cross over other edges. But
the codeword lengths can be used, and the leaves numbered off, assigning a
string of zero-bits to symbol 81, and then incrementing for each subsequent
codeword, shifting right or left as necessary.
Figure 7.2 shows the proto-alphabetic tree generated by Algorithm 7.3 for
the character self-probabilities in Milton. Notice how the juxtaposition of PI =
4 and P2 = 22 forces a compromise on codeword lengths: the codeword for
symbol one is two bits shorter than the codewords for other symbols of the
same frequency, a situation that not possible in a minimum-redundancy code.
The final alphabetic codeword for each symbol is listed below the tree.
If the queues are implemented as leftist-trees then O(log n) time is required
for each merge in steps 16 and 20 [Knuth, 1973]. As at most O(n) merges are
performed, the total time required to maintain the data structures is 0 (n log n).
The global heap adds a similar cost, and does not dominate the time spent on
merging operations.
Several other asymptotically faster algorithms have been proposed to gen-
erate alphabetic codes, but impose certain restrictions on P. Larmore and
Przytycka [1994] discuss an algorithm for constructing an alphabetic code in
O(ny'logn) time using a Cartesian tree data structure when P is a list of in-
tegers. Garsia and Wachs [1977] give an O(n) time algorithm to generate
an alphabetic code when max1:Si,j:Sn {Pi/pj} ::; 2 for all 1 ::; i, j ::; n.
Klawe and Mumey [1993] extend this result, providing an algorithm for the
case maxl:Si,j:Sn{pi/Pj} ::; k for some constant k. Larmore and Przytycka
[1998] have also undertaken work in this area.
;p
Cl
tIl
tv
o
00

n
o
;s::
."
:;Q
tIl
til
til
oZ
>
Z
o
n
o
0 0 0 0 0 0 0 1 o
0 1 0 0 0
0 0 0 0 0 0 0 1
Z
Cl
0 1 0 0 0 1 0
0 1 0 0 0 ~
0 0 0 r
Cl
o
:;Q

Figure 7.2: Alphabetic code tree for the character self-probabilities in Blake's Milton.
::;
::r:
;s::
til
7.3. ALTERNATIVE ALPHABETS PAGE 209

7.3 Alternative channel alphabets


Thus far our focus has been on producing codes for a binary channel alphabet,
where the "0" and "I" bits are considered equal in tenns of cost, and cost is
measured in time, money or some other external quantity. This may not always
be the case. For example, consider the code developed by Samuel Morse in
1835 for telegraphic transmissions. Morse uses a channel alphabet of "dots"
and "dashes", where a dash is, by definition, three times longer than a dot. The
full definition of Morse code also stipulates rules for the transmission of a one-
dot space between characters, and a longer space between words, but we will
ignore this additional complication here, as the additional cost is constant for
any given message, and not influenced by the codes assigned to the characters
comprising the message, once blanks have been stripped out. That is, if mes-
sages composed of characters are to be coded in a minimum-duration manner,
a code must be constructed in which the cost of a dash should be taken to be
three times the cost of a dot.
A more immediate application than Morse coding arises in data commu-
nications channels. In such a channel it is usual for a number of 8-bit combi-
nations to be reserved for channel control, and these must be "escaped" when
they appear in the data being transmitted (as must the escape byte itself, should
it appear naturally). Hence, the cost of emitting these special characters from a
coder should be taken to be twice the cost of the other non-special 8-bit combi-
nations if overall channel capacity is to be maximized. In this case the channel
alphabet consists of 256 symbols, and some nominated subset of these cost two
time units each while all other channel symbols cost one time unit each. That
is, we might be asked to deal with coding situations that not only have unequal
letter costs, but have multi-symbol channel alphabets.
Most generally, suppose we are given D = [d1 , d2, ... ,dr ] as a set of costs
for the r symbols in the channel alphabet, as well as the source probability
distribution P. For example, suppose we must compose codewords of Morse's
dots and dashes for the probability distribution P = [0.45,0.35,0.15,0.05].
Table 7.3 lists some plausible codes, and the expected cost (in "dot units") of
each. The five codes are all sensible, in that codewords have been assigned to
the alphabet in increasing cost order. That the codewords should be ordered
by increasing cost when the alphabet is ordered by decreasing cost is clear, as
if not, two codewords can be exchanged to reduce the expected cost. Once
that is taken into account, these five codes are the only candidates; and by
enumeration, it is clear that Code 3 has the least cost.
The codes shown in Table 7.3 are all complete. But not even that can be
taken for granted. Consider the channel alphabet described by r = 4 and
D = [1,1,5,9]. For the source probability distribution shown in Table 7.3
PAGE 210 COMPRESSION AND CODING ALGORITHMS

Pi Code 1 Code 2 Code 3 Code 4 Code 5


0.45
0.35
0.15
0.05
Expected cost 3.35 3.45 3.05 3.20 3.25

Table 7.3: Constructing codes with unequal letter costs. Different assignments of
dots and dashes to the probability distribution P = [0.45,0.35,0.15,0.05]. Dots are
assumed to have a cost of 1; dashes a cost of 3.

- which, after all, has just n = 4 symbols in it - the obvious code is C 1


["0", "1", "2", "3"]. This assignment leads to an expected cost of 2.0 cost-units
per source symbol. But the incomplete code C2 = ["0", "10", "11", "2"], that
leaves symbol "3" unused, attains a better cost of 1. 7 cost-units per symbol.
What if we wanted a Morse code for the probability distribution

J> = [4,22,5,4,1,3,2,2,4,8,5,6,1,8,3,8, 7,9,1,11,4,1,3,2,4],


with n = 22, derived from Blake's verse? How then should we proceed? In-
deed, what is the equivalent in this situation of the entropy-based lower bound?
Recall from Chapter 2 that Shannon defined the information content of
symbol Si to be I(si) = -log2Pi, where information is measured in bits. The
average information content per symbol over an alphabet of symbols is the
entropy of that probability distribution:
n
H(J» =- LPi 10g2Pi'
i=1

So far in this book we have assumed a natural mapping from bits in Shannon's
informational sense to bits in the binary digit sense. Implicit in this assumption
is recognition that the bits emitted by a coder are both a stream of symbols
drawn from a binary channel alphabet, and also a description of the information
present in the message.
Now consider the situation when this mapping is made explicit. A measure
of the cost of transmitting a bit of information using the channel alphabet must
be introduced. For lack of a better word, we will use units denoted as dollars
- which represent elapsed time, or power consumed, or some other criteria. As
a very simple example, if each symbol in a binary channel alphabet costs two
dollars - that is, r = 2 and D = [2,2] - then it will clearly cost two dollars per
information bit to transmit a message.
7.3. ALTERNATIVE ALPHABETS PAGE 211

In the general case there are r channel symbols, and their dollar costs are
all different. Just as an input symbol carries information as a function of its
probability, we can also regard the probability of each channel symbol as being
an indication of the information that it can carry through the channel. For ex-
ample, if the ith channel symbol appears in the output stream with probability
qi, then each appearance carries 1(qi) bits of information. In this case the rate
at which the ith channel symbol carries information is given by 1 (qd / d i bits
per dollar, as each appearance of this channel symbol costs di dollars.
In a perfect code every channel symbol must carry information at the same
rate [Shannon and Weaver, 1949]. Hence, if a codeword assignment is to be
efficient, it must generate a set of channel probabilities that results in

for all i, j such that 1 ~ i, j ~ r. This requirement can only be satisfied if



qi = t •

for some positive constant t. Moreover, as ~~l qi = 1 by definition of qi, the


value of t is defined by the equation

(7.1)

An equation of this form always has a single positive real root between zero
and one. For example, when D = [1,1], which is the usual binary channel
alphabet, t = 0.5 is the root of the equation t + t = 1. Similarly, when
D = [2,2]. t = .Jf12 : : : 0.71 is the root of the equation t 2 + t 2 = 1. And as
a third example, Morse code uses the channel alphabet defined by r = 2 and
D = [1,3], and t is the root of t l + t 3 = 1, which is t ::::: 0.68. Hence, in
this third case, ql = t l ::::: 0.68, and q2 = t 3 ::::: 0.32; that is, the assignment
of codewords should generate an output stream of around 68% dots and 32%
dashes.
Given that qi = t di , the expected transmission cost T(D) for the channel
alphabet described by D, measured in dollars per bit of information, is

r di 1
T(D) = ~qi· 1( .) = 1(t) ,
t=l qt

which follows as ~ qi = 1. For the usual equal-cost binary case, D = [1,1],


so T(D) = 1/1(0.5) = 1. When D = [2,2], the average per-symbol channel
cost is T(D) = 1/1( yI2) = 2, and it costs a minimum of $2 to transmit each
PAGE 212 COMPRESSION AND CODING ALGORITHMS

bit of information through the channel. And for the simplified Morse example,
in which D = [1,3], T(D) ~ 1.81, and each bit of information passed through
the channel costs $1.81.
If the message symbols arrive at the coder each carrying on average H(P)
bits of information, then the minimum cost of representing that message is

H(P) . T(D) -- - L~lPi


1 log Pi -- ~·1
2 .
~ pz ogt pz .
- og2 t i=l

Returning to the example of Table 7.3, we can now at least calculate the re-
dundancy of the costs listed: the entropy H(P) of the source probability dis-
tribution is approximately 1.68, and so the best we can do with a Morse code
is H(P) . T(D) ~ $3.04 per symbol. If nothing else, this computation can be
used to reassure us that Code 3 in Table 7.3 is pretty good.
But it is still not at all obvious how the codewords should be assigned to
source symbols so as to achieve (or closely approximate) qi = t di for each of
the r channel symbols. Many authors have considered this problem [Abrahams
and Lipman, 1992, Altenkamp and Mehlhorn, 1980, Golin and Young, 1994,
Karp, 1961, Krause, 1962, Mehlhorn, 1980], but it was only relatively recently
that a generalized equivalent of Huffman's famous algorithm was proposed
[Bradford et aI., 1998].
Less complex solutions to some restricted problems are also known. When
all channel costs are equal, and D = [1,1, ... ,1] for an r symbol channel
alphabet, Huffman's algorithm is easily extended. Rather than take the two
least-weight symbols at each packaging stage, the r least-weight symbols are
packaged. There is a single complication to be dealt with, and that is that it
may not be possible for the resulting radix-r tree to be complete. That is, there
are likely to be unused codewords. To ensure that these unused codes are as
deep as possible in the tree, dummy symbols of probability zero must be added
to the alphabet to make the total number of symbols one greater than a multiple
of r - 1. For example, when r = 7 and n = 22, the code tree will have
25 = 6 x 4 + 1 leaves, and three dummy symbols should be added before
packaging commences. An application that uses radix-256 byte-aligned codes
is described in Section 8.4. Perl et al. [1975] considered the reverse constrained
situation - when all symbols in the source alphabet have equal probability, and
the channel symbols are of variable cost. As noted, Bradford et al. [1998] have
provided a solution to the general unequal-unequal problem.
There is also an elegant - and rather surprising - solution to the general
problem using arithmetic coding, and it is this approach that we prefer to de-
scribe here. The key observation is that if an arithmetic decoder is supplied
with a stream of random bits, it generates an output "message" in which sym-
bols appear with a frequency governed by the probability distribution used.
7.3. ALTERNATIVE ALPHABETS PAGE 213

------- -------
I I I I I I
I encode I I decode I I encode I decode
i: i:
I I I I I I I I
: using using : : using using :
channel
message I P V I Q Q V I P message
I
1 ____ __ ,
I
1- _____ I
symbols 1- _____ ,
I
1- _ _ _ _ _ I

combined encoder combined decoder

Figure 7.3: Arithmetic coding when the channel alphabet probability distribution is
specified by Q = [qil.

Earlier we calculated a distribution Q = [qi] describing the desired frequencies


of the channel symbols if the channel was to be used to capacity. To generate a
stream of symbols that matches that distribution, we simply take the bitstream
generated by an arithmetic encoder controlled by the source probability dis-
tribution P, and feed it directly into an arithmetic decoder controlled by the
channel distribution Q. Figure 7.3 shows this arrangement. The code value
V that is transferred between the two halves of the encoder is a concise rep-
resentation of the information distilled from the source message. The channel
decoding stage then converts that information into a message over the channel
alphabet.
To reconstruct the source sequence the process is reversed - the transmitted
message is first encoded using distribution Q, thereby recovering the informa-
tion content of the message (the binary value V); and then that information is
decoded using the source probabilities P.
In the middle, the channel transfers a sequence of symbols over the channel
alphabet. If P is a good approximation of the true source probabilities, then the
value V handed on to the channel decoder will resemble a stream of random
bits, as each will be carrying one full bit's worth of information. The message
passed over the channel will thus exactly match the probability distribution
required for the maximum channel throughput rate. On the other hand, if P
is not an accurate summary of the source statistics, then the channel symbol
distribution cannot be guaranteed, and the expected cost per symbol will rise -
exactly as for the binary channel case.
This mechanism requires approximately twice the computation of the arith-
metic coding process described in Chapter 5. Nevertheless, it is sufficiently
fast that it is eminently practical. Adaptive computation of the source statis-
tics (Section 6.5 on page 154) is also straightforward, and has little additional
PAGE 214 COMPRESSION AND CODING ALGORITHMS

impact upon execution costs. The only other cost is solution of Equation 7.1,
so that Q can be computed. Methods such as bisection or Newton-Raphson
converge rapidly, and in any case, need to be executed only when the channel
costs change.

7.4 Related material


There are many more constrained coding problems -length-limited alphabetic
codes, for example [Laber et aI., 1999, Larmore and Przytycka, 1994]. This
section briefly summarizes two other coding variations.
The first is monotonic coding, where, given an unsorted probability distri-
bution P, a code C must be generated such that ICII ~ IC21 ~ ... ~ Icnl, and
the expected codeword length is minimized over all codes with this property.
That is, the lengths of the generated codewords must be in non-decreasing or-
der. If the probabilities are in non-increasing order, Huffman's algorithm (Sec-
tion 4.2 on page 53) solves this coding problem in O(n) time, provided ties
are broken appropriately. When P is not sorted, however, the fastest algorithm
proposed to date requires O(n 2 ) time [Van Voorhis, 1975]. Note that a mono-
tonic code can always be assigned to have the alphabetic property, but it may
not be a minimal alphabetic code for that probability distribution. Abrahams
[1994] explores some properties of monotonic codes.
A second variation is bi-directional coding [Fraenkel and Klein, 1990, and
references therein]. Here the objective is to design a code that allows codeword
recognition in either direction, which is of practical use when the compressed
text is indexed, and decompression might commence from any point within the
compressed stream. To show a window of text around the position of inter-
est, backwards as well as forwards decoding might be required. Unfortunately
minimum-redundancy affix codes are not guaranteed to exist for all P. Girod
[1999] gives a pragmatic solution that requires the addition of a small number
of additional bits to the coded message.
Chapter 8

Compression Systems

This chapter resumes the discussion of compression systems that was started in
Chapters 1 and 2, but then deferred while we focussed on coding. Three state-
of-the-art compression systems are described in detail, and the modeling and
coding mechanisms they incorporate examined. Unfortunately, one chapter is
not enough space to do justice to the wide range of compression models and ap-
plications that have been developed over the last twenty-five years, and our cov-
erage is, of necessity, rather limited. For example, we have chosen as our main
examples three mechanisms that are rather more appropriate for text than for,
say, image or sound data. Nevertheless, the three mechanisms chosen - sliding
window compression, the PPM method, and the Burrows-Wheeler transform
- represent a broad cross section of current methods, and each provides inter-
esting trade-offs between implementation complexity, execution-time resource
cost, and compression effectiveness. And because they are general methods,
they can still be used for non-text data, even if they do not perform as well as
methods that are expressly designed for particular types of other data. Lossy
modeling techniques for non-text data, such as gray-scale images, are touched
upon briefly in Section 8.4; Pennebaker and Mitchell [1993], Salomon [2000],
and Sayood [2000] give further details of such compression methods.

8.1 Sliding window compression


Sliding window modeling is one of those strikingly simple ideas that everyone
grasps as soon as it is explained. Yet, as a testament to the ingenuity of Jacob
Ziv and Abraham Lempel, the two researchers who pioneered this area, it is
worth noting that their ground breaking paper [Ziv and Lempel, 1977] was
not published until Huffman's work was approaching its twenty-fifth birthday.
Perhaps the idea isn't that simple at all!

A. Moffat et al., Compression and Coding Algorithms


© Springer Science+Business Media New York 2002
PAGE 216 COMPRESSION AND CODING ALGORITHMS

Suppose that the first w symbols of some m-symbol message M have been
encoded, and may be assumed by the encoder to be known to the decoder.
Symbols beyond this point, M[w + 1 ... m], are yet to be transmitted to the
decoder. To get them there, the sequence of shared symbols in M[1 ... w] is
searched to find a location that matches some prefix M[w + 1 ... w + £] of
the pending symbols. For example, suppose a match of length £ is detected,
commencing at location w - c + 1 for some offset 1 ::; c ::; w:

M[w - c + 1 ... w - c + f] = M[w + 1 ... w + fl.


Then the next f characters in the message are communicated to the decoder by
the tuple (c, £).
To decode, the tuple (c, £) is accepted from the encoder, and the f symbols
at locations M[w -c+ 1 ... w-c+f] are copied to M[w+l ... w+£]. Pointer
w is then advanced by f positions, and the process repeated. If the message is
repetitive in any way, common sequences will exist, and mUltiple symbols will
be represented by each two-integer tuple.
There are two practical problems that need to be addressed before this
mechanism is viable. The first is that we have not yet explained how the first
symbol, M[I], gets transmitted. More to the point, we have not yet explained
how any symbol gets transmitted the first time it appears in M. Ziv and Lempel
resolved this difficulty by transmitting triples rather than tuples, with the third
element in each triple the character M[w + f + 1] that failed to match. This
approach transmits novel symbols as triples (?, 0, M[w + 1]) in which no use
is made of the field marked ? The drawback of including a symbol in every
phrase is that some compression effectiveness may be lost - that same symbol
might naturally sit at the beginning of the next phrase and not require separate
transmission at all.
An alternative is to code the match length first, and follow it by either a
copy offset c, or a code for the next character M[w + 1], but not both. This
is the process described in Algorithm 8.1. Algorithm 8.1 also deals with the
second of the two practical problems: the need to ensure that each tuple is
indeed more economical to represent than the sequence it replaces. Hence, if no
match can be found, or if the available matches are all shorter than some lower
bound copy_threshold, a value of £ = 1 is coded, and a raw code for the single
symbol M[w + 1] emitted. Finally, in order for copy_threshold to be accurately
estimated, and to allow suitable coding methods to be employed, it is usual for
the values of offset c to be bounded to some maximum window..size, typically a
power of two between 212 = 4,096 and 2 16 = 65,536. The current contents of
the window can then be maintained in a circular array of window..size entries,
and the number of bits required by c values strictly controlled.
8.1. SLIDING WINDOW COMPRESSION PAGE 217

Algorithm 8.1
Transmit the sequence M[I ... m] using an LZ77 mechanism.
Iz77_encode_hlock(M, m)
1: encode m using some agreed method
2: set w +- 0
3: while w < m do
4: locate a match for M[w + 1 ... ] in M[w - window...size + 1 ... w], such
that M[w - c + 1. .. w - c + £] = M[w + 1. .. w + £]
5: if £ 2: copy_threshold then
6: encode length £ - copy_threshold + 2 using some agreed method
7: encode offset c using some agreed method
8: set w +- w + £
9: else
10: encode length 1 using the agreed method
11: encode M[w + 1] using some agreed method
12: set w +- w + 1
Decode and return the LZ77-encoded sequence M[I ... m].
Iz77-tJecode_hlock(M, m)
1: decode m using the agreed method
2: set w +- 0
3: while w < m do
4: decode £ using the agreed method
5: iff> 1 then
6: set £ +- £ + copy_threshold - 2
7: decode offset c using the agreed method
8: set M[w + 1 ... w + £] +- M[w - c + 1 ... w - c + £]
9: set w +- w + £
10: else
11: decode M [w + 1] using the agreed method
12: set w +- w + 1
13: return M and m
PAGE 218 COMPRESSION AND CODING ALGORITHMS

Figure 8.1: LZ77 parsing of Blake's Milton, assuming copyJhreshold = 3. Newline


characters are represented by".". Each gray section is represented as a pointer to
an earlier occurrence in the message; white sections are transmitted as raw characters
after e = 1 flags. The seven gray sections cover, respectively, 4,12,4, 14, 17,4, and 5
characters, or 47% of the message.

Further alternative mechanisms are described by Storer and Szymanski


[1982] and by Bell [1986a]. All are known as LZ77 techniques, to distin-
guish them from an alternative dictionary-based modeling technique (LZ78)
described by the same authors a year later [Ziv and Lempel, 1978].
One important point is that there is no requirement that the absolute best
match be found. In fact any match can be used at each coding stage without the
decoder losing synchronization with the encoder, and we are free in the encoder
to choose whether to expend considerable computational resources looking for
good matches, or to take a quick and dirty approach. As an example of the
latter strategy we might simply accept the first match encountered of length
greater than four. Nor is there any requirement that the decoder be advised of
the particular strategy used on any file. Of course, finding the longest possible
match is likely to lead to better compression, but the process is still correct even
if sub-optimal matches are used.
Figure 8.1 shows the application of Algorithm 8.1 to the verse from Blake's
Milton, assuming that copy_threshold = 3. The strings colored gray, including
three of the four newline characters, are transmitted to the decoder as phrases,
each requiring one tuple. The remaining characters are sent as special phrases
of length one. With a longer message, and a greater amount of history in the
window, the great majority of characters are sent as phrases, and good com-
pression effectiveness can be achieved. Phrases can be surprisingly long even
in text that at face value is non-repetitive, and on typical English prose the
average length of each phrase is in the vicinity of 6-10 characters.
Any appropriate coding mechanisms can be used in conjunction with func-
tion Iz77_encode_hlockO. For example, implementations have made use of
Elias codes, binary codes, minimum-redundancy codes, and even unary codes
for either or both of the stream of match lengths g and the stream of copy off-
sets c. In their simplest implementation, Fiala and Greene [1989] use a four-bit
nibble to hold a binary code for the length g, which allows the "no match" code
8.1. SLIDING WINDOW COMPRESSION PAGE 219

plus fifteen valid copy lengths (from 3 ... 17); and a twelve-bit offset code also
stored in binary, which allows a window of 4,096 bytes. Ross Williams [1991b]
has also examined this area, and, in the same way as do Fiala and Greene, de-
scribes a range of possible codes, and examines the trade-offs they supply.
The advantage of simple codes is that the stored tuples are byte aligned, and
bit manipulation operations during encoding and decoding are avoided. Fast
throughput is the result, especially in the decoder.
In the widely-used GZIP implementation, Jean-Ioup Gailly [1993] em-
ploys two separate semi-static minimum-redundancy codes, one for the copy
offsets, and a second one for a combination of raw characters and copy lengths.
The latter of these two is used first in each tuple, and while it is somewhat con-
fusing to code both characters and lengths - each length is, for coding purposes,
incremented by the size of the character set - from the same probability distri-
bution, the conflation of these two allows economical transmission of the criti-
cal binary flag that indicates whether the next code is a raw symbol or a copy.
The two minimum-redundancy codes are calculated over blocks of 64 kB of the
source message: in the context of Algorithm 8.1, this means that a complete
set of £ (and M[w + 1]) values and c offsets are accumulated, codes are con-
structed based upon their self-probabilities, and then the tuples (some of which
contain no second component) are coded, making it a textbook application of
the techniques described in Section 4.8 on page 81.
Gailly's implementation also supports a command-line flag to indicate how
much effort should be spent looking for long matches. Use of gzip -9 gives
better compression than does gzip -1, but takes longer to encode messages.
Decoding time is unaffected by this choice, and is fast in all situations. Fast de-
compression is one of the key attributes of the LZ77 paradigm - the extremely
simple operations involved, and the fact that most output operations involve a
phrase of several characters, mean that decoding is rapid indeed, even when, as
is the case with GZIP, minimum-redundancy codes are used.
The speed of any LZ77 encoder is dominated by the cost of finding prefix
matches. Bell and Kulp [1993] have considered this problem, as have the im-
plementors of the many software systems based upon the LZ77 technique; and
their consensus is that hashing based upon a short prefix of each string is the
best compromise. For example, the first two characters of the lookahead buffer,
M[w + 1] and M[w + 2], can be used to identify a linked list of locations at
which those two characters appear. That list is then searched, and the longest
match found within the permitted duration of the search used. As was noted
above, the search need not be exhaustive. One way of eliminating the risk of
lengthy searches is to only allow a fixed number of strings in each of these
linked lists, or to only search a fixed number of the entries of any list. Control
of the searching cost, trading quality of match against expense of search, is one
PAGE 220 COMPRESSION AND CODING ALGORITHMS

of the variables adjusted by the GZIP command-line flag mentioned earlier.


A large number of variations on the sliding window approach have been
described: far too many to be properly enumerated here. However there are
some common threads that warrant mention. In a second paper, Ziv and Lem-
pel [1978] describe another modeling mechanism in which the dictionary is
stored explicitly, and phrases in the dictionary are indexed. The dictionary
grows adaptively as each phrase is used, by adding a one-character extension
to an existing phrase to make a new phrase. The COMPRESS program, written
by Spencer Thomas, Jim McKie, Steve Davies, Ken Turkowski, James Woods,
and Joe Orost, is an implementation of Terry Welch's [1984] variant of the
LZ78 approach. COMPRESS was, for many years, the standard Unix compres-
sion tool, and played an important role in early Internet mail and file transfer
protocols. The work of Thomas and his collaborators endures in the compres-
sion fraternity not just because of the benchmarks they set for compression
performance in the mid 1980s, but also because the source code of COMPRESS
appears as the file progc in the Calgary Corpus that was constructed at the
time Tim Bell, John Cleary, and Ian Witten were preparing their 1990 book.
Miller and Wegman [1985] also considered an LZ78-like mechanism, but
constructed new phrases by concatenating previous phrases rather than extend-
ing them by one character. Other explicitly dictionary-based modeling mecha-
nisms are discussed in Section 8.4 on page 243.
Another theme that has been explored by a number researchers and im-
plementors is that of "smart" parsing. When looking for a match, it is not
necessarily true that taking the longest match available at this step maximizes
overall compression. Hence, if some amount of lookahead is built into the
match-searching process, slightly better compression can be attained than is
available with the obvious greedy parsing method.
The last extension to LZ77 is that of conditioning. Each phrase in an LZ77
system starts with a character that is essentially modeled using zero-order prob-
abilities [Bell and Witten, 1994]. Gutmann and Bell [1994] showed that the last
character of the previous phrase could be used to bias the choice of the next
phrase, allowing better compression effectiveness at the expense of greater
complexity. This notion of biasing estimates based upon previous characters
will be explored in detail in the next section.
However, before we end this section, there is one common misconception
that warrants discussion. LZ77 is a modeling technique, not a coding tech-
nique. We have seen texts that describe the whole field of compression in
terms of "three coding methods - Huffman coding, arithmetic coding, and Ziv-
Lempel coding". We disagree with this categorization: it is both simplistic
and inaccurate. The discussion above was explicit in describing the manner
in which the LZ77 technique partitions a complex sequence of symbols into
8.2. PPM PAGE 221

two simpler sequences, each of which is then coded by a zero-order coder pre-
suming that all conditioning has been exploited by the model. The distinction
between the modeling and coding components in a compression system such
as GZIP is then quite clear; and the Ziv-Lempel component of GZIP supplies
a modeling strategy, not a coding mechanism.

8.2 Prediction by partial matching


In Section 2.4 on page 20 a compression system was introduced in which the
symbol probabilities - where a symbol is an ASCII character - were condi-
tioned upon the immediately prior symbol. For example, in English text the
letter "q" is almost always followed by a ''u'', so "u" should be assigned a high
probability in the "q" context. In total this compression system needs 256 con-
texts, one for each possible preceding ASCII letter; and each context has an
alphabet of as many as 256 symbols.
For the short example message in Section 2.4 the cost of the model parame-
ters made this first-order model more expensive to use than a zero-order model,
which has just a single context, or conditioning class. But for longer messages
- where long means a few hundred bytes or more - the more complex model is
likely to pay for itself whenever adjacent symbols are correlated.
There is no reason to stop at just one conditioning character. For example,
after the pair "th", an "e" seems like a good bet, probably a better bet than
would be argued for if only "h" was known. Using such a second order model,
with its as many as 256 2 = 65,536 conditioning classes, we would expect even
more accurate probabilities to be accumulated, and thus better compression to
result. Of course, there are more model parameters to be estimated, so longer
messages are required before the break-even point is reached, at which the
model pays for its parameters through reduced message costs.
The same is true for third-order models, fourth-order models, and so on.
As the order of the model increases the number of conditioning classes grows
exponentially, and compression effectiveness can be expected to continue to
improve - but only on messages that are long enough that accurate statistics
are accumulated, and only on messages that have this level of inter-symbol
correlation. Indeed, it is possible to go too far, and with a high-order model it
might be that the symbol frequency counts get so fragmented across a multitude
of conditioning classes that compression effectiveness degrades because the
probability distributions used are less accurate than the consolidated statistics
available in a lower-order model. Having too many contexts for the evidence
to support is a problem known as statistics dilution.
How then can contextual information be employed, if we don't know how
long the message will be, and don't wish to commit in advance to a model of
PAGE 222 COMPRESSION AND CODING ALGORITHMS

a particular order? In seminal work published in 1984, John Cleary and Ian
Witten tackled this question, and in doing so proposed a significant step for-
ward in terms of modeling. Exploiting the ability of the then newly-developed
arithmetic coder to properly deal with small alphabets and symbol probabilities
close to one, they invented a mechanism that tries to use high-order predictions
if they are available, and drops gracefully back to lower order predictions if
they are not. Algorithm 8.2 summarizes their prediction by partial matching
(PPM) mechanism.
The crux of the PPM process lies in two key steps of the subsidiary func-
tion ppm_encode_symboIO, which attempts to code one symbol M[s] of the
original message in a context of a specified order. Those two steps embody
the fundamental dichotomy that is faced at each call: either the symbol M[s]
has a non-zero probability in this context, and can thus be coded successfully
(step 9); or it has a zero probability, and must be handled in a context that is
one symbol shorter (step 12). In the former of these two cases no further ac-
tion is required except to increment the frequency count P[M[s]]; in the latter
case, the recursive call must be preceded by transmission of an escape symbol
to tell the decoder to shift down a context, and then followed by an increment
to both the probability of escape and the probability P[M[s]]. Prior to mak-
ing this fundamental choice a little housekeeping is required: if the order is
less than zero, the symbol M[s] should just be sent as an unweighted ASCII
code (steps 1 and 2); if the indicated context does not yet have a probability
distribution associated with it, one must be created (steps 4 to 7); and when the
probability distribution P is newly created and knows nothing, the first symbol
encountered must automatically be handled in a shorter context, without even
transmission of an escape (step 11).
Because there are so many conditioning classes employed, and because
so many of them are used just a few times during the duration of any given
message, an appropriate choice of escape method (Section 6.3 on page 139) is
crucial to the success of any PPM implementation. Algorithm 8.2 shows the
use of method D - an increment of 2 is made to the frequency P[M[s]] when
symbol M[s] is available in the distribution P; and when it is not, a combined
increment of 2 is shared between P[M[s]] and P[escape]. In their original
presentation of PPM, Cleary and Witten report experiments with methods A
and B. Methods C and D were then developed as part of subsequent inves-
tigations into the PPM paradigm [Howard and Vitter, 1992b, Moffat, 1990];
method D - to make PPMD - is now accepted as being the most appropriate
choice in this application [Teahan, 1998]. Note that the use of the constants 1
and 2 is symbolic; they can, of course, be more general increments that are aged
according to some chosen half-life, as was discussed at the end of Section 6.6.
Table 8.1 traces the action of a first-order PPM implementation (where
8.2. PPM PAGE 223

Algorithm 8.2
Transmit the sequence M[l ... m] using a PPM model of order max_order.
ppm_encode_block(M, m, max_order)
1: encode m using some appropriate method
2: set U[x] +-- 1 for all symbols x in the alphabet, and U[escape] +-- 0
3: for s +-- 1 to max_order do
4: ppm_encode_symbol(s, s - 1)
5: for s +- max_order + 1 to m do
6: ppm_encode_symbol(s, max_order)
Try to code the single symbol M[s] in the conditioning class established by
the string M[s - order ... s - 1]. If the probability of M[s] is zero in this
context, recursively escape to a lower order model. Escape probabilities are
calculated using method D (Table 6.2 on page 141).
ppm_encode_symbol(s, order)
1: if order < 0 then
2: encode M[s] using distribution U, and set U[M[s]] +-- 0
3: else
4: set P +-- the probability distribution associated with the conditioning
class for string M[s - order ... s - 1]
5: if P does not yet exist then
6: create a new probability distribution P for M[s - order ... s - 1]
7: set P[x] +-- 0 for all symbols x, including escape
8: if P[M[s]] > 0 then
9: encode M[s] using distribution P, and set P[M[s]] +-- P[M[s]] + 2
10: else
11: if P[escape] > 0 then
12: encode escape using distribution P
13: ppm_encode_symbol(s, order - 1)
14: set P[M[s]] +- 1, and P[escape] +-- P[escape] +1
PAGE 224 COMPRESSION AND CODING ALGORITHMS

M[s] Context P[ese.] P[M[slJ Lx P[x] Exc. Bits Total


"h" U 0 1 256 8.00 8.00
"0" A 1 0 2 1.00
U 0 1 255 7.99 8.99
"w" A 2 0 4 1.00
U 0 1 254 7.99 8.99
"#" A 3 0 6 1.00
U 0 1 253 7.98 8.98
"n" A 4 0 8 1.00
U 0 1 252 7.98 8.98
"0" A 5 10 3.32 3.32
"w" "0" 1 1 2 1.00 1.00
"#" "w" 1 2 1.00 1.00
"b" "#" 1 0 2 1.00
A 5 0 12 1.14
U 0 251 7.97 10.11
"r" A 6 0 14 1.22
U 0 1 250 7.97 9.19
"0" A 7 3 16 2.42 2.42
"w" "0" 1 3 4 0.42 0.42
"n" "w" 1 0 4 2.00
A 7 1 18 1 4.09 6.09
"#" "n" 1 0 2 1.00
A 7 20 5 3.91 4.91
"e" "#" 2 0 4 1.00
A 7 0 22 4 1.36
U 0 249 7.96 10.32
"0" A 8 5 24 2.26 2.26
"w" "0" 1 5 6 0.26 0.26
"" "w" 2 0 6 1.58
A 8 0 26 6 1.32
U 0 1 248 7.95 10.86

Table 8.1: Calculating probabilities for the string "how#now#brown#cow." using the
PPM algorithm with escape method D (PPMD) and max..order = 1. Context A is the
zero-order context; and in context U all ASCII symbols have an initial probability of
1/256. The total cost is 106.1 bits.
8.2. PPM PAGE 225

ma.xJJrder = 1) on the simple message "how#now#brown#cow." Each row of


the table corresponds to one call to the arithmetic coder, caused by an execution
of either step 9 or step 12 in function ppm_encode_symboIO. The first symbol
of the message, "h", cannot be coded in the zero-order context (string A) be-
cause that context has not been used at all, so after an implicit escape, the "h"
is coded against the uniform distribution U, in which all novel ASCII symbols
are initially equally likely. A total of eight output bits are generated.
The second letter, "0", cannot be attempted in the context "h", since that
context has never been used before; but can be attempted, without success, in
the context A (with one output bit generated to code an escape); and is finally
successfully coded in the fallback context U, for a total of just under nine bits.
This same pattern continues for several more characters, during which the zero-
order context is slowly building up a repertoire of symbols. When the second
"0" (in "now") is encountered, the zero-order context A can be used directly,
and a code of 3.32 bits is generated. The "w" following that second "0" is the
first to actually be attempted in a first-order context, and is coded successfully
in just one bit without recourse to the zero-order context. The "#" that follows
is also coded in a first-order context - the model has learned that "#" after "w"
is a combination worth noting.
The cheapest symbol in the entire message is the final "w", which again
follows an "0". That combination has been seen three prior times, and the
probability estimator is very confident in its predictions. On the other hand, the
most expensive symbol in the message is the final period "."; that character has
not been seen before at all, and the model is somewhat surprised when the "w"
is not followed by a "#", or if not a "#", then at least an "n".
There are two rather subtle points about the example that should be noted.
The first issue does not arise until the "b" is being coded. The first context
examined when coding "b" is the "#" context; but "b" is not available, and
an escape is transmitted, costing one bit. The next context to be tried is A.
But given that processing arrived in this context by escaping from the context
"#", and given that "n" is available in the "#" context, we can be sure that the
next symbol is not an "n". That is, letter "n" should be temporarily excluded
from consideration while context A is used [Cleary and Witten, 1984b]. To
achieve this, the frequency P).. ["n"] is temporarily reduced to zero, and the total
frequency count (used in the call to the arithmetic coder) reduced by the same
amount. Once the symbol has been coded the counts for excluded symbols
are reinstated, since they might not be excluded next time this state is used.
The column headed "Exc." in Table 8.1 shows the amount by which the total
frequency count for a state is reduced to allow for the exclusions being applied;
and the bit counts in the final two columns show the cost in bits of that encoding
step assuming that the frequency count Lx P[x] for the context is modified to
PAGE 226 COMPRESSION AND CODING ALGORITHMS

remove the excluded symbols from contention. For example, when processing
the "#" after the word "brown", the # in the A context is assigned an adjusted
probability of 1/(20-5), and is coded in -10g2(l/15) = 3.91 bits. The use of
a decrementing frequency count in the U context is also a form of exclusions.
In this case the exclusions are permanent rather than temporary, since there is
just one context that escapes into U.
As presented in Algorithm 8.2, function ppm_encode_symbolO does not
allow for exclusions except in context U - the implementation becomes rather
longer than one page if the additional complexity is incorporated. Without
exclusions, a certain amount of information capacity is wasted, making the
output message longer than is strictly necessary. In Table 8.1, taking out all of
the exclusions adds 1.5 bits to the total cost of the message. On the other hand,
calculating exclusions has an impact upon compression throughput, and they
are only beneficial when contexts shorter than max_order are being used, which
tends to only happen while the model is still learning which symbols appear in
which contexts. Once the model has stabilized into a situation in which most
symbols are successfully predicted in the longest context, no exclusions will be
applied, even if they are allowed.
The second subtle point to be noted in connection with Algorithm 8.2 is
called update exclusions [Moffat, 1990]. When the "w" at the end of the word
"now" is successfully predicted in the context "0", its frequency in that context
is incremented, and the probability distribution P"o" for that context changes
from p.·o .. [escape, "w"] = [1,1] to p.·o.. [escape, "w"] = [1,3]. At the same
time, it is tempting to also change the zero-order probability distribution for
context A, since another "w" has appeared in that context too. Prior to that "w"
the probability distribution P).. for context A is

P ).. [escape, "h" , "0" , "w" " "#" "n"] = [5 , 1, 3 , 1" 1 1] .

If A is faithfully recording the zero-order self-probabilities in the message, then


the occurrence of "w" should cause a change to

"h" "0"
P ).. [escape " '" "w" "#" "n"] = [5 , 1, 3, 3" 1 1] .

In fact, we do not make this change, the rationale being that P).. should not
be influenced by any subsequent "w" after "0" combinations, since they will
never require a "w" to be predicted in context A. That is, we modify probability
distributions to reflect what is actually transmitted, rather than the frequency
distributions that would be arrived at via a static analysis. In Table 8.1, the
final probability distribution for context A is

P [escape "h" "0" "w" "#" "n" "b" "r" "e" ""]
).. """'" .
= [9,1,7,1,3,3,1,1,1,1],
8.2. PPM PAGE 227

reflecting the frequencies of the symbols that context oX was called upon to deal
with, not the frequencies of the symbols in the message.
Algorithm 8.2 as described already includes update exclusions, and since
full updating makes the implementation slower, there is no reason to try and
maintain "proper" conditional statistics in a PPM implementation.
An implementation issue deliberately left vague in Algorithm 8.2 is the
structure used to store the many contexts that must be maintained, plus their
associated probability distributions. The contexts are just strings of characters,
so any dictionary data structure, perhaps a binary search tree, could be used.
But the sequence of context searches performed is dictated by the string, and
use of a tailored data structure allows a more efficient implementation. Fig-
ure 8.2a shows such a context tree for the string "how#now#brown#", against
which a probability for the next letter of the example, character "c", must be
estimated. A first-order PPM model is again assumed.
Each node in the context tree represents the context corresponding to the
concatenation of the node labels on the path from the root of the tree through
to that node. For example, the lowest leftmost node in Figure 8.2a corresponds
to the string "ho", or more precisely, the appearance of "0" in the first-order
context "h". All internal nodes maintain a set of children, one child node for
each distinct character that has appeared to date in that context. In addition,
every node, including the leaves, has an escape pointer that leads to the node
representing the context formed by dropping the leftmost, or least significant
character, from the context. In Figure 8.2a escape pointers are represented with
dotted lines, but not all are shown. The escape pointer from the root node oX
leads to the special node for context U.
At any given moment in time, such as the one captured in the snapshot
of Figure 8.2a, one internal node of depth max_order in the context tree is
identified by a current pointer. Other levels in the tree also need to be indexed
from time to time, and so current can be regarded as an array of pointers, with
current [max_order] always the deepest active context node. In the example,
context "#" is the current[l] node.
To code a symbol, the current[max_order] node is used as a context, and
one of two actions carried out: if it has any children, a search is carried out for
the next symbol; and ifthere are no children, the pointer current[max_order-l]
is set to the escape pointer from node current[max_order], and then that node
is similarly examined. The child search - whenever it takes place - has two
outcomes: either the symbol being sought is found as the label of a child, and
that fact is communicated to the decoder; or it is not, and the transfer to the next
lower current node via the escape pointer must be communicated. Either way,
an arithmetic code is emitted using the unnormalized probabilities indicated by
the current pointer for that level of the tree.
PAGE 228 COMPRESSION AND CODING ALGORITHMS

O.1 •..• 9.a.b.c.d •..• z.A •..•Z.=.+.(.) •..

- - - - current[1]

(a)

O.1 •..• 9.a.b.c.d ...• z.A •..• Z.=.+.(.) •..

current[1] - - - - -

(b)

Figure 8.2: Examples of first-order context trees used in a PPM implementation: (a)
after the string "how#now#brown#" has been processed; and (b) after the subsequent
"c" has been processed. Not all escape pointers and frequency counts are shown.
8.2. PPM PAGE 229

In the example, the upcoming "c" is first searched for at node current[l] =
"#", and again at current[O] = A. In both cases an escape is emitted, first with
a probability of 3/6 (see Table 8.1), and then, allowing for the exclusions on
"n" and "b", with probability 7/(22 - 4). The two escapes take the site of
operations to node U, at which time the "c" is successfully coded. Two new
nodes, representing contexts "c", and "#c" are then added as children of the
nodes at which the escape codes were emitted. The deepest current pointer
is then set to the new node for "c", ready for the next letter in the message.
Figure 8.2b shows the outcome of this sequence of operations.
When the next symbol is available as a child of the current[max_order]
node, the only processing that is required is for the appropriate leaf to be se-
lected by an arithmetic code, and current[max_order] to be updated to the des-
tination of the escape pointer out of that child node. No other values of current
are required, but if they are, they can be identified in an on-demand manner
by tracing escape pointers from current[max_order]. That is, when the model
is operating well, the per-symbol cost is limited to one child search step, one
arithmetic coding step, and one pointer dereference.
Figure 8.2 shows direct links from each node to its various children. But
each node has a differing number of children, and in most programming lan-
guages the most economical way to deal with the set of children is to install
them in a dynamic data structure such as a hash table or a list. For character-
based PPM implementations, a linked list is appropriate, since for the major-
ity of contexts the number of different following symbols is small. A linked
list structure for the set of children can be accelerated by the use of a physi-
cal move-to-front process to ensure that frequently accessed items are located
early in the searching process. For larger alphabets, the set of children might
be maintained as a tree or a hash table. These latter two structures make it
considerably more challenging to implement exclusions, since the cumulative
frequency counts for the probability distribution are not easily calculated if
some symbols must be "stepped over" because they are excluded.
Standing back from the structures shown in Figure 8.2, the notion of state-
machine-based compression system becomes apparent. We can imagine an
interwoven set of states, corresponding to identifying contexts of differing
lengths; and of symbols driving the model from state to state via the edges
of the machine. This is exactly what the DMC compression algorithm does
[Cormack and Horspool, 1987]. Space limits preclude a detailed examination
of DMC here, and the interested reader is referred to the descriptions of Bell
et al. [1990] and Witten et al. [1999]. Suzanne Bunton [1997a,b] considers
state-based compression systems in meticulous detail in her doctoral disser-
tation, and shows that PPM and DMC are essentially variants of the same
fundamental process.
PAGE 230 COMPRESSION AND CODING ALGORITHMS

The idea embodied in the PPM process is a general one, and a relatively
large number of authors have proposed a correspondingly wide range of PPM
variants; rather more than we can hope to record here. Nevertheless, there are
a number of versions that are worthy of comment.
Lelewer and Hirschberg [1991] observed that it is not necessary for the
contexts to cascade incrementally. For example, escaping from an order-three
prediction might jump directly to an order-one state, bypassing the order-two
contexts.
Another possibility is for the contexts to be concatenated, so as to form one
long chain of symbol "guesses", in decreasing order of estimated probability
[Fenwick, 1997, 1998, Howard and Vitter, 1994b]. The selection of the appro-
priate member of the chain can then be undertaken using a binary arithmetic
coder. Yokoo [1997] describes a scheme along these lines that orders all pos-
sible next characters based upon the similarity of their preceding context to the
current context of characters.
Another development related to PPM is the PPM* method of Cleary and
Teahan [1997]. In PPM* there is no upper limit set on the length of the con-
ditioning context, and full information is maintained at all times. To estimate
the probability of the next character, the shortest deterministic context is em-
ployed first, where a context is deterministic if it predicts just one character.
If there is no deterministic context, then the longest available context is used
instead. In the experiments of Cleary and Teahan, PPM* attains slightly bet-
ter compression effectiveness than does a comparable PPM implementation.
But Cleary and Teahan also note that their implementation ofPPM* consumes
considerably more space and time than PPM, and it may be that some of the
compression gain is a consequence of the use of more memory. Suzanne Bun-
ton [1997a,b] has studied the PPM and PPM* mechanisms, as well as other
related modeling techniques, and describes an implementation that captures
the salient points of a wide range of PPM-like alternatives. Aberg et ai. [1997]
have also experimented with probability estimation in the context of PPM.
Another important model which we do not have space to describe is context
tree weighting [Willems et aI., 1995, 1996]. In broad terms, the idea of context
tree weighting is that the evidence accumulated by multiple prior conditioning
contexts is smoothly combined into a single estimation, whereas in the PPM
paradigm the estimators are cascaded, with low-order estimators being used
only when high-order estimators have already failed.
By employing pre-conditioning of contexts in a PPM-based model, and by
selecting dynamically amongst a set of PPM models for different types of text,
Teahan and Harper [2001] have also obtained excellent compression for files
containing English text.
Exactly how good is PPM? Figure 8.3 shows the compression rate attained
8.2. PPM PAGE 231

16

- - - order 1
-order2
___ order 3
--+- order 4
_ _ order 6

10 100 1000 10000 100000 1000000

Length of prefix of bible.txt (bytes)

Figure 8.3: Compressing the file bible. txt with various PPMD mechanisms.

by six different PPMD mechanisms on a text file containing a representation


of the King James Bible l . This file, bible. txt, is a little over 4 MB long. To
generate the graph, prefixes from the file of different lengths were taken and
passed into a suite of PPMD programs (with exclusions implemented) with
differing values of max_order and no cap placed upon memory consumption.
For each different prefix length, the average compression rate attained over
the whole of that prefix is plotted as a function of the length of the prefix,
for each of the different PPM mechanisms. The smooth manner in which the
low-order curves peel away from the on-going improvement represented by
the high-order curves is a vindication of the claim that in PPM we have the
best of both worlds: there is limited compression loss on short files, while
on long files high-order models are free to realize their full potential. The
region around 10 kB where compression worsens show the drawback of having
a model become over-confident in its predictions. That section of bible. txt
corresponds to the end of the book of Genesis, and the start of Exodus; and
a new set of names and other proper nouns must be absorbed into the model
probability estimates.
Despite this blip, good compression is achieved. Over the 4 MB of file
bible. txt, the sixth-order PPMD tested in these experiments obtains an av-
erage compression of 1.56 bits per character, remarkably close to the human-
entropy bounds for English text that were commented upon in Section 2.3. For
more than ten years PPM -based compression mechanisms have been the tar-
get against which other techniques have been measured [Bell et aI., 1990, page
1Part of the Large Canterbury Corpus, available from corpus. can terbury . ac . nz /.
PAGE 232 COMPRESSION AND CODING ALGORITHMS

261], and recent experiments - and recent implementations, such as the PPMZ
of Charles Bloom [1996] (see also www.cbloom.com/src/ppmz.html)-
continue to confirm that superiority.
The drawback of PPM-based methods is the memory space consumed.
Each node in the context tree stores four components - a pointer to the next
sibling; an escape pointer; an unnormalized probability; and the sum of the fre-
quency counts of its children - and requires four words of memory. The number
of nodes in use cannot exceed max_order x m, and will tend to be rather less
than this in practice; nevertheless, the number of distinct five character strings
in a text processed with max_order = 4 might still be daunting.
One way of controlling the memory requirement is to make available a pre-
determined number of nodes, and when that limiting value is reached, decline
to allocate more. The remaining part of the message is processed using a model
which is structurally static, but still adaptive in terms of probability estimates.
Another option is to release the memory being used by the data structure, and
start again with a clean slate. The harsh nature of this second strategy can be
moderated by retaining a circular buffer of recent text, and using it to boot-strap
states in the new structure - for example, the zero-order or first-order predic-
tions in the new context tree might be initialized based upon a static analysis
of a few kilobytes of retained text. As is amply demonstrated by Figure 8.3, as
little as 10 kB of priming text is enough to allow a flying start.
Table 8.2 shows the results of experiments with the most abrupt of these
strategies, the trash-and-start-again approach. Each column in the table shows
the compression rates attained in bits per character when the PPMD imple-
mentation was limited to that much memory; the rows correspond to differ-
ent values of max_order. To obtain the best possible compression on the file,
32 MB of memory is required - eight times more than is occupied by the file it-
self. Smaller amounts of memory still allow compression to proceed, but use of
ambitious choices of max_order in small amounts of memory adversely affects
compression effectiveness. On the other hand, provided a realistic max_order
is used, excellent compression is attainable in a context tree occupying as little
as 1 MB of memory.

8.3 Burrows-Wheeler transform


The Burrows-Wheeler transform (BWT) is currently the nearest challenger to
the dominance of PPM in terms of compression effectiveness; and it has a
number of implementation advantages that make it a pragmatic choice. The
BWT is an innovative - and rather surprising - mechanism for permuting the
characters in an input message so as to allow more effective prediction of char-
acters and hence better compression. Developed by two researchers at the Dig-
8.3. BWT PAGE 233

max_order Memory limit (MB)


1 2 4 8 16 32 64
1 3.38
2 2.44
3 1.91 1.90
4 1.85 1.71 1~66
5 2.02 1.81 1.67 1.60 1.58
6 2.18 1.96 1.78 1.66 1.59 1.56
7 2.31 2.09 1.90 1.76 1.66 1.58 1.56

Table S.2: Compression effectiveness in bits per character achieved by a memory-


constrained PPMD implementation on the file bible. txt. When only limited main
memory is available, smaller values of max.1Jrder should be used. The rightmost entry
in each row corresponds to an amount of memory that was sufficient that no model
rebuildings were required; adding further memory results in identical compression
effectiveness. The gray entry shows the best performance in each column. No gray
box is drawn in the 64 MB column because the order-six mechanism with 32 MB
slightly outperforms the order-seven implementation with 64 MB, even though both
suffer no model rebuildings within those memory limits.

ital Equipment Corporation nearly ten years after PPM [Burrows and Wheeler,
1994], the BWT has since been the subject of intensive investigation into how
it should best be exploited for compression purposes [Balkenhol et ai., 1999,
Chapin, 2000, Effros, 2000, Fenwick, 1996a, Sadakane, 1998, Schindler, 1997,
Volf and Willems, 1998, Wirth and Moffat, 2001], and is now the basis for a
number of commercial compression and archiving tools.
Before examining the modeling and coding aspects of the method, the
fundamental operation employed by the BWT needs to be understood. Fig-
ure 8.4 illustrates - in a somewhat exaggerated manner - the operations that
take place in the encoder. The message in the example is the simple string
"how#now#brown#cow."; for practical use the message is, of course, thousands
or millions of characters long.
The first step is to create all rotations of the source message, as illustrated in
Figure 8.4a. For a message of m symbols there are m rotated forms, including
the original message. In an actual implementation these rotated versions are
not formed explicitly, and it suffices to simply create an array of m pointers,
one to each of the characters of the message.
The second step, shown in Figure 8.4b, is to sort the set of permutations us-
ing a reverse-lexicographic ordering on the characters of the strings, starting at
the second-to-Iast character and moving leftward through the strings. Hence, in
PAGE 234 COMPRESSION AND CODING ALGORITHMS

how#now#brown#cow. ow.how#now#brown#:c c
ow#now#brown#cow.h ow#brown#cow.how#:n n
w#now#brown#cow.ho rown#cow.how#now#:b b
#now#brown#cow.how ow#now#brown#cow. :h h*
now#brown#cow.how# own#cow.how#now#b:r r
ow#brown#cow.how#n w.how#now#brown#c:o 0
w#brown#cow.how#no w#now#brown#cow.h:o 0
#brown#cow.how#now w#brown#cow.how#n:o 0
brown#cow.how#now# cow.how#now#brown:# #
rown#cow.how#now#b .how#now#brown#co:w w
own#cow.how#now#br #now#brown#cow.ho:w w
wn#cow.how#now#bro #brown#cow.how#no:w w
n#cow.how#now#brow n#cow.how#now#bro:w w
#cow.how#now#brown wn#cow.how#now#br:o 0
cow.how#now#brown# how#now#brown#cow: .
ow.how#now#brown#c now#brown#cow.how:# #
w.how#now#brown#co brown#cow.how#now:# #
.how#now#brown#cow #cow.how#now#brow:n n
(a) (b) (c)

Figure 8.4: Example of the Burrows-Wheeler transformation on the source message


"how#now#brown#cow.": (a) rotated strings before being sorted; (b) rotated strings
after being sorted, with the last character in each string separated by a colon from the
corresponding sort key; and (c) transformed message, with first character asterisked.

Figure 8Ab, the three rotated forms that have "#" as their second-to-Iast sym-
bol appear first, and those three are ordered by the third-to-Iast characters, then
fourth-to-Iast characters, and so on, to get the ordering shown. For clarity, the
last character of each rotated form is separated by a colon from the earlier ones
that provide the sort ordering. As for the first step, pointers are manipulated
rather than multiple rotated strings, and during the sorting process the original
message string remains unchanged. Only the array of pointers is altered.
The third step of the BWT is to isolate the column of "last" characters (the
ones to the right of the colon in Figure 8Ab) in the list of sorted strings, and
transmit them to the decoder in the order in which they now appear. Since
there are m rotated forms, there will be m characters in this column, and hence
m characters to be transmitted. Indeed, exactly the same set of m charac-
ters must be transmitted as appears in the original message, since no charac-
ters have been introduced, and every column of the matrix of strings, includ-
ing the last, contains every character of the input message. In the example,
the string "cnbhrooo#wwwwo.##n" listed in Figure 8Ac must be transmit-
ted to the decoder. Also transmitted is the position in that string of the first
8.3. BWT PAGE 235

c +- 1 I+-# +- 9 6+-c +- 1
n +- 2 2+-# +- 16 8+-n +- 2
b +- 3 3+-# +- 17 5+-b +- 3
h*+- 4 4+-. +- 15 7+-h*+- 4
r +- 5 5+-b +- 3 14 +- r +- 5
0 +- 6 6+-c +- 1 10+-0 +- 6
0 +- 7 7+-h*+- 4 11 +- 0 +- 7
0 +- 8 8+-n +- 2 12 +- 0 +- 8
# +- 9 9+-n +- 18 I+-# +- 9
w +-10 10 +- 0 +- 6 15 +- w +-10
w +-11 11 +- 0 +- 7 16 +- w +-11
w +- 12 12 +- 0 +- 8 17 +-w +- 12
w +-13 13 +- 0 +- 14 18 +- w +-13
0 +- 14 14 +- r +- 5 13 +- 0 +- 14
+- 15 15 +- w +- 10 4+-. +- 15
# +- 16 16 +- w +- 11 2+-# +- 16
# +- 17 17 +- w +- 12 3+-# +- 17
n +- 18 18 +- w +-13 9+-n +- 18
(a) (b) (c)

Figure 8.5: Decoding the message "cnbhrooo#wwwwo.##n": (a) permuted string re-
ceived by decoder, with position numbers appended; (b) after stable sorting by charac-
ter, with a further number prepended; and (c) after reordering back to received order.
To decode, start at the indicated starting position to output "h"; then follow the links
to character 7 and output "0"; character 11 and output "w"; and so on until position 4
is returned to.

character of the message, which is why the character "h" has been tagged
with an asterisk in Figure 8.4c. That is, the actual message transmitted is
("cnbhrooo#wwwwo.##n", 4). Since this permuted message contains the same
symbols as the original and in the same ratios, it may not be clear yet how
any benefit has been gained. On the other hand if the reader has already ab-
sorbed Section 6.7, which discusses recency transformations, they will have an
inkling as to what will eventually happen to the permuted text. Either way, for
the moment let us just presume that compression is going to somehow result.
To "decode" the permuted message it must be unscrambled by inverting
the permutation, and perhaps the most surprising thing about the BWT is not
that it results in compression, but that it can be reversed simply and quickly.
Figure 8.5 continues with the example of Figure 8.4, and shows the decoding
process. Again, the description given is intended to be illustrative, and as we
shall see below, in an actual implementation the process is rather more terse
and correspondingly more efficient than that shown initially.
PAGE 236 COMPRESSION AND CODING ALGORITHMS

Algorithm 8.3
Calculate and return the inverse of the Burrows-Wheeler transformation,
where M'[l .. . m] is the permuted text over an alphabet of n symbols, and
first is the starting position.
bwUnverse(M', m, first)
1: for i f- 1 to n do
2: setfreq[i] f- 0
3: for c f- 1 to m do
4: setfreq[M'[c]] f- freq[M'[c]J + 1
5: for i f- 2 to n do
6: setfreq[iJ f- freq[iJ + freq[i - 1J
7: for c f- m down to 1 do
8: set next[c] f- freq[M'[cJJ andfreq[M'[c]] f- freq[M'[c]] - 1
9: set sf-first
10: for i f- 1 to m do
11: set M[i] f- M'[sJ
12: set s f- next [sJ
13: return M

First, the characters of the received message are numbered, as shown in


Figure 8.Sa. The character-number pairs are then sorted by character value,
using a stable sorting mechanism that retains the number ordering in the case of
ties on character. The resulting list is again numbered sequentially, as indicated
by the leftmost column in Figure 8.Sb. Finally, the list of number-character-
number triples is returned to the original ordering, which is still indicated by
the third component of each triple. This leads to the arrangement shown in
Figure 8.Sc.
We are now ready to decode. The transmission included explicit indication
that the first symbol of the source message was the fourth of the permuted
text, so the "h" can be written. Then the arrows in Figure 8.Sc are followed.
According to the arrows the symbol after the fourth is the seventh, and so the
seventh symbol in the table, which is an "0", is output, and the arrows followed
again to the eleventh symbol, which is "w"; then the sixteenth, "#"; and so on.
This simple process of following the arrows proceeds until the string reaches
its starting point again, with the fifteenth symbol (the period, ".") leading to the
fourth. This sequence recreates the original message "how#now#brown#cow."
In an actual implementation the sorting and indexing is done rather more
elegantly than is indicated in Figure 8.S. All that is required is that the fre-
quency of each of the different alphabet symbols be established, and then the
symbols of the message numbered off from the entries in a table of cumula-
8.3. BWT PAGE 237

Egyptians .#J~4ft,h$#~O~#b~~,g~~~4#~!ili'11~~'i' : r 1
h,#and#my#right#hand#hath#span : v 4
#had#made#them#joyful,#and#tur : r 2
t#of#the#world. # IWhen#he#prepa~-:~'d.i~li~#1;l~~!r : v 2
t#Paran. #Selah. #His#glory#cove' : v 1
»»>In#the#beginning#God#creatill : v 1
#thus#saith#the#LORD#that'~¢r~~'i!~aii!li,~l~~~ : v 1
ith#God#the#LORD, #he#th~t#¢l;'e~t~<:i#~I~'#J:,.~a : v 1
als#were#kindled#by#it. # I He#bow,ed#'fh~#J:,.~~, : v 1
alsf.were#~~l1d~$<:i#by#lt'~#l~ij~j1?~~~,~,.t:li~#b~~ : v 1
n#the#room#of#Joab. # IAnd#he#'S;9'W~4#,th,$:f:I~a. r 2
#our#houses . #And#the#peopl~,#l:#?:W-~,g*g#'h;!'li:l, d 3
#servant#David#hast#said, #Why#did:t,e#lie~'j t 4
ron, #and#said#to#the#king, #Behold#the#le~] d 2
hy#name' s#sake. # IWherefore#shou'f:d~~h~',li~~". t 2
y#truth's#sake.tIWhe:r:efore#sho1l1<:ifftheffhea. t 1
(a) (b)

Figure 8.6: Part of the result of applying the Burrows-Wheeler transformation to a


version of the King James Bible: (a) sorted strings and following characters; and (b)
the corresponding MTF values for the characters in the permuted text. Blank characters
are indicated by #, newline characters by I, and void characters prior to the start of the
string by>. Characters with a gray background are common to the previous string.

tive frequencies. This process is sketched as function bwLinverse 0 in Algo-


rithm 8.3. Three linear-time scans of the permuted message are required: one to
count symbol frequencies and establish a table freq of cumulative frequencies
(steps 3 to 6); a second to assign the next value for each symbol in the permuted
text (the value indicated in the leftmost column of Figure 8.Sc); and then a third
(steps 10 to 12) to unscramble the text and reproduce the original message. The
terse simplicity of this decoding process is an important contributing factor to
the speed and popularity of the method.
Now it is clear that the BWT can be reversed, let us return to the problem
of transmitting the permuted message M'. Since it contains all of the origi-
nal symbols, with exactly the same frequencies as they appear in the original
message, it might at first appear that nothing has been gained by the transfor-
mation. But consider again the text shown in Figure 8.4b and Figure 8.4c. All
of the "w" characters in the permuted message appear as a single run, and do
so as a direct consequence of the fact that they all follow "0" characters in the
source message. This clustering of like symbols is exactly the effect exploited
PAGE 238 COMPRESSION AND CODING ALGORITHMS

d 147 1 294
1 4 2 134
P 8 3 113
r 137 4 78
t 157 5 23
v 191 6 2
Total symbols 644 Total symbols 644
Entropy 2.10 Entropy 1.99
Minimum-redundancy 2.25 Minimum-redundancy 2.08
(a) (b)

Figure 8.7: Frequencies of symbols and costs of coding according to those fre-
quencies in the context "#the#hea" in the Bible: (a) distribution of character val-
ues, the zero-order entropy of that distribution, and the expected cost of a minimum-
redundancy code for that distribution; and (b) distribution of the corresponding MTF
values, the zero-order entropy of that distribution, and the expected cost of a minimum-
redundancy code for that distribution.

by the MTF transformation described in Section 6.7 on page 170, and it is MTF
values for the permuted sequence that are actually transmitted.
On small examples such as the one in Figure 8.42 the clustering of symbols
is definitely noticeable. For more substantial source messages it becomes even
more compelling. Figure 8.6 shows part of the result of applying the BWT
to file bible. txt. A small fraction of the permuted message text is shown,
together with the implicit contexts that form the sort key and the MTF values
generated. In Figure 8.6 all of the preceding strings end in "#the#hea" (indeed,
in the section shown they all actually end in "d#the#hea"), and in total the
4,047,392 characters in this particular version of the Bible contain exactly 644
occurrences of this "#the#hea" context. In those 644 occurrences there are
just six different characters that follow, namely "d", "1", "p", "r", "t", and
"v". Figure 8.7a shows the frequencies of these six characters, and the self-
information of their frequency distribution. If the character frequencies shown
in Figure 8.7a are used to directly drive an arithmetic coder to transmit the
644 symbols, compression of 2.10 bits per character results. With a minimum-
redundancy code, 2.25 bits per character is attainable.
The alternative - using the MTF transformation to generate an equivalent
2 And in Figure 6.6 on page 171: the reader is now invited to apply function bwLinverseO
and determine the corresponding source message M that generated Figure 6.6a, starting with
the eighth character.
8.3. BWT PAGE 239

message of symbol ranks - is summarized in Figure 8.7b. Quite remarkably, the


six characters that appear in this eight-character context are all already within
the first six positions of the MTF array, moved there by previous steps; and
the largest MTF value required over the 644 "#the#hea" contexts is just 6.
As can be seen, for this particular context the MTF codes provide the more
economical representation. Indeed, the MTF distribution is sufficiently skewed
that the minimum-redundancy code presumed in the last row of Figure 8.7b is
a unary code.
Over the whole Bible the MTF transformation reduces the permuted stream
of characters to a stream of integers that has a self-information of 1.99 bits per
character, and which can be coded in 2.10 bits per character by a minimum-
redundancy code. Figure 8.8 shows the probability distribution for the MTF
values generated for the BWT-permuted Bible, and their strongly decreasing
nature. Figure 8.8 also shows the implied frequency distributions for which
the unary and C'"( codes would be optimal, calculated using the relationship
Pi = m/21; between frequency Pi and codelength li for the ith distinct symbol
in a message totalling m symbols.
It is clear that the MTF transformation successfully captures the bulk of
the temporal locality present in the permuted sequence, and that representing
the MTF values with an arithmetic coder, or even a C'"( coder, yields excellent
compression. Any residual local variations correspond to different segments
in the BWT string, with their different subalphabets; and use of a structured
arithmetic coder (Section 6.9 on page 177) allows compression rates superior
to the self-information.
There is a strong correspondence between the BWT and the PPM-style
models described earlier in this chapter [Cleary and Teahan, 1997, Effros,
1999]. Suppose, for example, that an order-8 PPM model is being used, and
that a context for "#the#hea" has been created. Over the complete message that
context is visited exactly 644 times, and the stream of characters coded in that
context, while permuted compared to the BWT fragment shown in Figure 8.6,
would nevertheless contribute to the output bitstream a number of bits deter-
mined by the frequencies listed in Figure 8.8a. For example, there would be
exactly six uses of the escape symbol in this context, one for each of the novel
successor characters. The only difference between a PPM -based compression
system and a BWT-based one is that an MTF transformation is used with the
BWT to implicitly recognize and exploit contexts, and only one context is ac-
tive at any given time; whereas in a PPM implementation, the contexts are
formed explicitly, context shifts are signalled formally to the decoder by the
use of escape codes, and a large number of contexts are concurrently active,
since none are ever dispensed with.
The observation that the two methods are related leads to further improve-
PAGE 240 COMPRESSION AND CODING ALGORITHMS

1000000

100000

>-
() 10000
c:
Q)
'.
::l
tT 1000 - - Observed for Bible
. ,, .
~
U. Gamma code .,
100 •...•. Unary code '.
10

2 4 8 16 32 64

MTF value

Figure 8.8: Frequency of MTF ranks for BWT·permuted text for file bible. txt,
together with MTF frequencies presumed by the codeword lengths used in the unary
and C"( codes.

ments in the BWT mechanism. Since the permuted text is essentially the con-
catenation of the outputs from whatever number of contexts are present in the
underlying state machine, it is appropriate to try to parse the permuted text into
same-context sections, and then code each section with a zero-order entropy
coder rather than via the MTF transformation. One obvious way of sectioning
the permuted text is to simply assert that each run of like characters is the result
of the underlying machine being in a single context. With only one symbol in
the output alphabet for each context, the coding of the string emitted by that
context is achieved by simply describing its length. The subalphabet selec-
tion component of the housekeeping details for that state (see Section 4.8 on
page 81) can still be accomplished through the use of an MTF selector value.
This combination leads to the permuted text being represented as a sequence
of "MTF value, number of repetitions of that symbol" pairs, and if these two
components are independently coded using a structured arithmetic coder, com-
pression on the Bible of 1. 70 bits per character is possible. By way of contrast,
on the same file an implementation of PPM (with max_order = 6, and using
escape method D and 64 MB of memory for the data structure) obtains a com-
pression effectiveness of 1.56 bits per character, so the sectioning heuristic is
still somewhat simplistic.
By using more elegant sectioning techniques based upon the self-entropy
of a small sliding window in the permuted text, better compression can be
obtained [Wirth and Moffat, 2001]. For example, the permuted text shown in
8.3. BWT PAGE 241

Figure 8.6 might be regarded as being emitted from two different states - one
giving rise to the string "rvrvvvvvvvr", and then a second generating "dtdtt".
(It might equally be possible that no such distinction would be made, since
the string of characters prior to the section in the figure is "vdtrvvdrrvvtdr",
and the string following is "vtvdtvrrdvvrrddvdrr".) Each of the sections then
becomes a miniature coding problem - exactly as it is in each context of a PPM
compression system.
The best of the BWT-based compression mechanisms are very good indeed.
Michael Schindler's [1997] SZIP program compresses the file bible. txt to
1.53 bits per character (see the results listed at corpus. canterbury. ac .
nz/results!large. htrnl), and the public-domain BZIP2 program of Ju-
lian Seward and Jean-Ioup Gailly [Seward and Gailly, 1999] is not far behind,
with a compression rate 1.67 bits per character. (At time of writing, the best
compression rate we know of for bible. txt is 1.47 bits per character, ob-
tained by Charles Bloom's PPMZ program, and reported to the authors by
Jorma Tarhio and Hannu Peltola.) Note that the compression effectiveness of
BWT systems is influenced by the choice of block size used going into the
BWT; in the case of SZIP, the block size is 4.3 MB, whereas BZIP2 uses a
smaller 900 kB block size. One common theme amongst these "improved"
BWT implementations is the use of ranking heuristics that differ slightly from
simple MTF. Chapin [2000] analyses a number of different ranking mech-
anisms for BWT compression, concluding that the most effective compres-
sion is achieved by employing a mechanism that switches between two differ-
ent rankers, depending on the nature of the text being compressed [Volf and
Willems, 1998]. Deorowicz [2000] has also investigated BWT variants.
The implementation ofBZIP2 draws together many of the themes that have
been discussed in this section and in this book. After the BWT and MTF trans-
formations the ranks are segmented into "blocklets" of 50 values. Each block-
let is then entropy coded using one of six semi-static minimum-redundancy
codes derived for that block of (typically) 900 kB. The six semi-static codes
are transmitted at the beginning of the block, and are chosen by an iterative
clustering process that seeds six different probability distributions; evaluates
codes for those six distributions; assigns each blocklet to the code that mini-
mizes the cost of transmitting that blocklet; and then reevaluates the six dis-
tributions based upon the ranks in the blocklets assigned to that distribution.
This process typically converges to a fixed point within a small number of iter-
ations; in the coded message each blocklet is then preceded by a selector code
that indicates which of the six minimum-redundancy codes is being used. The
use of multiple probability distributions and thus codes allows sensitivity to the
nature of the localized probability distribution within each segment; while the
use of semi-static minimum-redundancy coding allows for quick encoding and
PAGE 242 COMPRESSION AND CODING ALGORITHMS

decoding compared to alternatives such as structured arithmetic coding.


Considerable thought has also been put into the efficiency of BZIP2 [Se-
ward, 2000, 2001]. As we have seen, the BWT decoder is simpler to imple-
ment than it is to understand. But what about the BWT encoder? The way it
was described above, it should also be simple to implement. It forms an ar-
ray of character pointers, calls a sorting function to order the array of pointers,
and reads off the characters that follow each of the pointers in the sorted array,
sending them to either a MTF coder, or a mechanism for isolating states and
coding with respect to those states.
The sorting stage is quite difficult to implement efficiently. Certainly the
BWT can be effected by calling a library Quicksort function such as the stan-
dard C function qsortO. But the m objects being ordered are strings of m
characters, and the normally-quoted O(m log m} expected or worst-case run-
ning time of an efficient sorting algorithm is actually a count on the number of
string comparisons, each of which might, in a pathological situation, require
O( m} character comparisons. For example, if the input message contains long
repeated patterns - and after all, this is exactly what we hope any compres-
sion system will find - then each string comparison required during the sorting
process may involve tens or hundreds of character comparisons, greatly slow-
ing the BWT. Ultimately, when called upon to sort m identical strings, each
of which contains m identical characters (which happens if the input message
to the BWT consists of m identical characters) even a well-engineered imple-
mentation of Quicksort may require excessive time.
Clearly, for the special case of sorting strings to form such a suffix ar-
ray, a better mechanism is required. One such technique, due to Bentley and
Sedgewick [1997], is ternary Quicksort, which examines just one next charac-
ter per string at each partitioning step, and performs a three-way split to addi-
tionally isolate a subsection of the original array in which all strings agree in
this character position. That is, if a three way partitioning is undertaken based
upon the kth character, in the next set of recursive calls one of the partitions
processes character k + 1, and two continue examining position k.
More specific algorithms for generating suffix arrays have been developed
by Manber and Myers [1990], Larsson [1996, 1999], and Sadakane [1998].
BZIP2 uses a digital sorting process documented in the work of Seward [2000].
The BWT has also been applied to symbol streams representing words
rather than characters [Isal and Moffat, 2001a, Isal et at, 2002] and small ad-
ditional compression gains result. Words in English are also correlated, and
while the conditioning is less pronounced than at the character level, there is,
nevertheless, an effect to be exploited - consider "text compression", "Prime
Minister", and "United States", for example - and around 25% of the post-
BWT symbols are found at rank position number one in the MTF list. While
8.4. OTHER SYSTEMS PAGE 243

this is less than the 68% recorded for a character-based BWT (Figure 8.8), it
is still a strong correlation. Isal et al. [2002] report a compression rate of 1.49
bits per character on file bible. txt using a word-based parsing scheme, a
BWT, an MTF-like ranking heuristic, and a structured arithmetic coder. Word-
based PPM schemes have also been considered [Moffat, 1989], but the fact
that all contexts are simultaneously incomplete mitigates against an efficient
implementation, and the BWT approach has a definite advantage in terms of
resource usage.
The drawback of the BWT - word-based or character-based - is that it
must operate off-line. A whole block of message data must be available before
any code for any symbol in that block can be emitted. On the other hand, a
PPM mechanism can emit the codewords describing a symbol as soon as that
symbol is made available to the encoder, and so the latency is limited to that of
the arithmetic coder.
In summary, BWT-based compression mechanisms are based upon a sim-
ple transformation that can be both computed and inverted using relatively
modest amounts of memory and CPU time, and yet which has the power to
rival the compression effectiveness achieved by the best of the context-based
schemes. It is little wonder that in the eight years since it was first developed the
BWT has been employed in both a range of commercial compression systems
such as the STUFFIT product of Aladdin Systems Inc., a widely used compres-
sion and archiving tool that operates across a broad range of hardware platforms
(see www.aladdinsys.com); and freely available software tools such as the
BZIP2 program described above, which was developed by Julian Seward in
collaboration with Jean-Ioup Gailly (sourceware. cygnus. com/bzip2 I).
And just in case the reader wishes to confirm their calculations: the BWT
string in Figure 6.6a is derived from the source message

"peter#piper#picked#a#peck#of#pickled#peppers."

8.4 Other compression systems


The three compression systems described in this chapter are general-purpose,
in that all they assume of the input message is that it is a stream of bytes. And
since everything stored on a computer is represented as a stream of bytes one
way or another, all three can be used on any data.
But tailored mechanisms for specific data types can do better. This section
briefly considers a range of specific data types, and compression systems that
target these types. Our intention is to show the variety of modeling techniques
that have been developed over the years, and the way in which each places
different requirements on the coder used to support it.
PAGE 244 COMPRESSION AND CODING ALGORITHMS

The first ofthe non-text data types is bi-Ievel (black and white) images. The
raw form of a bi-Ievel image can be thought of as a sequence of bytes, with each
byte storing eight pixel values from one raster row of the image. Because of
this linearization, pixels that are adjacent in the image, but in different raster
rows, are widely separated in the file representing that image.
Clearly, a two-dimensional context structure is appropriate. In work that is
intimately connected to the genesis of arithmetic coding, Rissanen and Lang-
don [1981] show that use of a two-dimensional template of pixels above, and
to the left of, the pixel in question provides a suitable conditioning context.
That is, each pixel can be coded in a conditioning context established by a
small number of neighboring pixels (typically seven or ten) drawn from quite
different parts of the linear file representing the image.
Such techniques have been central to all subsequent bi-Ievel image com-
pression techniques, including several standards. The coder required in such
a system must deal with perhaps thousands of conditioning contexts, and a
source alphabet of cardinality two, usually expressed as the two choices "the
next symbol is indeed the one estimated by the model as being the more proba-
ble symbol (MPS)", and "no, the next symbol is the one estimated as being the
less probable symbol (LPS)". Unsurprisingly, this combination of requirements
leads naturally to binary arithmetic coding, either with explicit probability esti-
mation based upon symbol occurrence counts in the various contexts employed,
or with probability estimation based upon the Q-coder mechanism described in
Section 6.11 on page 186.
A specific subset of bi-Ievel images are the textual images obtained when
pages that are dominated by printed text are scanned and stored as bi-Ievel
images. One important application of textual image compression is in facsim-
ile storage and transmission, a compression application that has dramatically
changed the way the world operates.
Early fax standards (see Witten et al. [1999, Chapter 6] for details) were
strictly scan-line oriented, and either ignored the fact that the conditioning con-
texts were two dimensional, or allowed just one prior scan line to be used as
a guide to the transmission of the next one. A simple runlength model was
applied to generate two interleaved streams of symbol numbers, describing the
lengths of the runs of white pixels and black pixels across each scan line; and
the symbols so generated were transmitted using a static prefix code that was
hardwired into the machine at the time it was designed. The set of codewords
were generated by analysis of a set of standard test images, plus manual tun-
ing to make sure that extreme cases were also handled in a reasonable man-
ner. Given the technology available at the time, and the overreaching desire to
make the devices affordable to a wide cross-section of consumers, this Group 3
scheme represents a stunning engineering achievement.
8.4. OTHER SYSTEMS PAGE 245

Two decades later, memory capacity and CPU cycles are vastly cheaper,
and a different engineering trade-off is possible, since only transmission band-
width is still a scarce resource. Now textual images can be segmented into the
individual marks comprising them, and a library of such marks transmitted,
plus a list of the locations on the page at which they appear. That is, a com-
pression model is implemented that knows that many of the connected sections
of black pixels in the bi-Ievel image form repetitive patterns that differ only
around the fringes. As a consequence, better compression is obtained. Paul
Howard [1997] describes one such scheme; it only operates sensibly on inputs
that have the particular structure the model is assuming, but when it gets such
data, it performs much better than do the three general-purpose mechanisms
described in the first three sections of this chapter.
Another important category is signal data, such as that generated by the
analog to digital conversion of speech or music. If, for example, each symbol
in the message is an integer-quantized value of some underlying waveform,
then consecutive values in the message will be highly correlated. The standard
way of handling such signals is known as DPCM, or differential pulse code
modulation. In a DPCM model, a number of previous values of the signal are
combined in some way - perhaps by calculating a weighted mean, or by fitting
some other curve - and used to predict the next quantized value. The difference
between the predicted value and the actual value is then coded as an error
value, or residual. If the predictions are accurate, the errors will be small. If
the predictions are not as accurate, for example, when the signal is volatile, the
errors will be larger. In both cases, the errors can be expected to have a mean
of zero, and are usually assumed to conform to a symmetric distribution that is
strongly biased in favor of small values. Golomb and Rice codes (Section 3.3
on page 36) are ideal for such applications, since the parameter b (or k for a
Rice code) that controls the code can be adjusted as a function of the measured
local volatility of the signal. A good illustration of this approach is provided by
Howard and Vitter [1993], who describe a mechanism for compressing gray-
scale images that uses a family of Rice codes in conjunction with adaptive
estimation of an appropriate value of k.
Another special type of message is natural language text. The first three
sections of this chapter showed three-general purpose compression schemes,
and the fact that English text was used as an example message was convenient,
but not a prerequisite to their success.
But if we know that the message is (say) an ASCII representation of English
text, then models which exploit the resultant structure can be employed. For
example, word-based models [Bentley et aI., 1986, Moffat, 1989], have been
applied to text-retrieval systems with considerable success [Bell et aI., 1993,
Zobel and Moffat, 1995]. Parsing a large volume of text into a sequence of
PAGE 246 COMPRESSION AND CODING ALGORITHMS

interleaved words and non-words, and building a static dictionary for each of
the two different types of tokens, permits a semi-static minimum-redundancy
code to be used, allowing fast decompression, and non-sequential decoding of
small fragments of the original text. The latter facility is important when a
small number of documents drawn from a large collection are to be returned
in response to content-based queries. The inverted indexes used to facilitate
such queries are also amenable to compression, and Golomb codes and other
static coding methods (Chapter 3) have been used in this domain. Witten et ai.
[1999] give extensive coverage of this area; other relevant work includes that
of Williams and Zobel [1999], who consider more general data, but realize
many of the same benefits. English text can also be parsed in more sophis-
ticated ways, taking into account general rules for sentence structure [Teahan
and Cleary, 1996, 1997, 1998].
A special case of structured text is program source code. Because source
code conforms to strict grammar rules, probability estimations based upon a
push-down automaton rather than a finite-state machine is possible. For exam-
ple, because the braces "{" and "}" must match up in a syntactically correct
C program, the probability of a right brace "}" can be set to (very close to)
zero at certain parts of the message. For every non-terminal symbol in the
grammar the probabilities of each of the grammar rules that rewrite that non-
terminal can be estimated adaptively; and the parse tree can then be transmitted
as an arithmetic-coded tree traversal [Cameron, 1988, Katajainen et aI., 1986].
Comments must be handled separately, of course - there is nothing to stop
comments having unmatched parentheses.
Returning to compression of English, one variant on the word-based pars-
ing scheme is worth particular mention, because it illustrates another useful
technique - adding a small amount of redundancy back into the compressed
message in order to obtain a particular benefit. de Moura et ai. [2000] parse
English text into spaceZess words by assuming that each word is by default fol-
lowed by a single blank character. An explicit non-word token is only required
when the token between consecutive words is not a single blank character. The
non-words are coded from the same dictionary of strings as the words, and just
one probability distribution is maintained. For example, the string
"mary#had#a#little#lamb,#little#lamb,#little#lamb."
is transformed into the message
"1,2,3,4,5,6,4,5,6,4,5,7"
against the dictionary of strings
1: "mary" 4: "little" 6: ",#"
2: "had" 5: "lamb" 7:
3: "a"
8.4. OTHER SYSTEMS PAGE 247

Symbols number 6 and 7 are the only non-words tokens explicitly required by
this message. The decoder can correctly reverse the transformation: consecu-
tive decoded tokens composed entirely of alphabetics (or whatever other subset
of the character set is being used to define "words") get a single space inserted
between them. Use of spaceless words avoids the interleaving of symbols from
two different probability distributions, and neatly sidesteps the issues caused
by the fact that an explicit non-word distribution tends to be very skew as a
result of the dominance of the single-space non-word token "#".
To code the stream of tokens, de Moura et al. [2000] suggest use of a semi-
static radix-256 minimum-redundancy code, so that all codewords are byte
aligned. Such a sizeable channel alphabet would normally result in consid-
erable degradation in compression effectiveness, but the large source alphabet
inherent in word-based models, and the fact that the spaceless words proba-
bility distribution is relatively uniform (the most frequent word in English is
"the", with a typical probability of around 5%) means that the loss is small.
Searching in the compressed text for a particular word from the compres-
sion dictionary is then remarkably easy: the set of byte codewords for the
word (or phrase of concatenated compression tokens) is formed, and a stan-
dard pattern-matching algorithm employed (see Cormen et al. [2001]).
Use of variable length codewords does, however, mean that false matches
might result, where the pattern-matching software reports that a particular byte
sequence has been located, but in fact it appears by coincidence as a result of
the juxtaposition of fragments of two unrelated codewords. For example, if the
codeword for "bird" is the three byte sequence "57, 0, 164" (with each byte in
the codeword expressed as an integer between 0 and 255), the code for "in" is
"188, 57", the code for "the" is "0", and the word "hand" has the three-byte
code "164, 45, 142", then the compressed form of the phrase "in#the#hand"
includes within it the code for "bird", and the pattern-matcher has no option
but to declare a possible "bird" that is "in#the#hand", only to then have the
potential match declined when alignment checking is undertaken.
To eliminate the cost of false match checking, de Moura et al. [2000] pro-
pose an alternative that is as simple as it is effective - instead of using a radix-
256 code, they use a radix-128 code. These seven-bit codes are then fitted
into bytes, and the remaining bit of each byte used as an alignment indicator.
For example, the first byte in each codeword might have the tag bit set, and
the remaining bytes in each codeword would then have the tag bit off. Pat-
tern searching is accomplished by forming the compressed form of the pattern
word or phrase, including setting the tag bits appropriately, and using a stan-
dard pattern matcher. The tag bits ensure that matches are only reported when
the alignment is correct, and the false matches are eliminated at the expense of
making the compressed message approximately 8/7 = 14% longer than if a
PAGE 248 COMPRESSION AND CODING ALGORITHMS

radix-256 code was used.


Many other authors have considered the problem of searching compressed
text. Some of the proposed mechanisms are tantamount to decoding the text
looking for the string in question, building up the same data structures as would
be required by an adaptive decoder. Whether or not such schemes operate faster
than a "fully decode the text and pipe the output into a pattern matcher" method
is a question that is sometimes lost in the complexity of the proposed solution.
The radix-128 mechanism of de Moura et al. [2000] passes this test, as does
Manber's [1997] simpler bigram mechanism. Mitarai et al. [2001] also report
positive experiments, and their work can be taken as a starting point if the
reader wishes to pursue the topic further.
The final compression system we examine in this chapter also provides
searching capabilities, and has been saved until last in part because the coding
mechanism employed is amongst the simplest ones considered anywhere in this
book. Our objective in describing this mechanism at such a late stage of our
presentation is to remind the reader that humble methods are often attractive,
and that it is always worth trying to "keep it simple, stupid".
This last compression system is the off-line mechanism known as RE-
PAIR, for recursive pairing [Larsson and Moffat, 2000]; related mechanisms
have been described by Solomonoff [1964], Wolff [1975, 1978], Rubin [1976],
Nakamura and Murashima [1996], Bentley and McIlroy [1999], Apostolico
and Lonardi [2000], Nevill-Manning and Witten [1997, 2000], and Cannane
and Williams [2001]. In RE-PAIR, a block of the message is reduced to a
sequence in which there are no duplicate pairs of consecutive symbols, by the
repeated application of a replacement rule. The replacement rule is simple - the
most frequently occurring pair of adjacent symbols is replaced everywhere by
a new symbol that represents that coupling. For example, consider the string:

"how#now#brown#cow#howl#now#brown#owl."

The pair "ow" appears eight times, and is the first replaced. If we use upper case
letters to indicate new symbol numbers (the primitives, or base letters, occupy
the symbol values 0 to 255, so in an implementation the first new symbol is
actually numbered 256), it is reduced to:

"hA#nA#brAn#cA#hAl#nA#brAn#Al."

The combination "A#" now appears four times, and is replaced next:

"hBnBbrAn#cBhAl#nBbrAn#Al."

The full set of reductions applied, together with the underlying string each
pairwise replacement represents, is:
8.4. OTHER SYSTEMS PAGE 249

256: A -+ ow = "ow" 261: F -+ nB = "now#"


257: B -+ A# = "ow#" 262: G -+ CA = "brow"
258: C -+ br = "br" 263: H -+ GD = "brown#"
259: D -+ n# = "n#" 264: I -+ FH = "now#brown#" ,
260: E -+ Al = "owl"

and the final reduced message is (the integer-symbol equivalent of):

"hBIcBhE#IE."

A RE-PAIR-compressed message consists of two components: a descrip-


tion of the phrase hierarchy that describes the set of strings into which the
source message has been partitioned; and a list of phrase numbers relative to
that hierarchy, the reduced message. In their description ofthe technique, Lars-
son and Moffat [2000] give results for a range of coding techniques for these
two parts, and conclude that the interpolative code is an appropriate choice for
the phrase hierarchy, and that a semi-static minimum-redundancy coder han-
dles the reduced message well.
In a followup investigation, Moffat and Wan [2001] consider the task of
searching a RE-PAIR-compressed message, looking for the locations at which
a phrase (or phrases) appears. The issues are thus the same as with the space-
less words model, and a byte-aligned minimum-redundancy code could be
used. But the byte-aligned code still has the drawback of being variable-length,
whereas for fastest possible processing, a fixed-length code is required. The re-
duced message produced by RE-PAIR also has the advantage of having a more
uniform probability distribution than is created by the spaceless words model.
For these reasons, Moffat and Wan turned to a simple binary code, in which
all codewords are exactly sixteen bits long. This choice dictates a subalphabet
of not more than n = 65,536 symbols in each coding block, which, in exper-
iments, typically meant that each coding block was approximately 130,000 to
150,000 symbols long.
Much of the compression in this system arises by virtue of the blocks be-
ing relatively short, and the consequent use of localized sub alphabets (see Fig-
ure 4.8 on page 87 and the related discussion); some of the compression is
a consequence of the reduced amount of information required in the prelude.
Section 2.4 on page 20 noted that it is the cost of the "prelude plus codes" pack-
age that governs the eventual compression ratio, not just the bits consumed by
the codes. Hence, if frequently-occurring symbols are assigned short represen-
tations in the prelude, compression can still be obtained. This is exactly what
Moffat and Wan do: the representation of the prelude is via a simple nibble-
aligned Golomb-like code that favours the low values in the subalphabet which
occur in a high proportion of the coding blocks; while the symbols in each
PAGE 250 COMPRESSION AND CODING ALGORITHMS

Minimum- Simple
redundancy binary
Phrase hierarchy 0.22 0.22
Reduced message prelude 0.05 0.19
Reduced message code 1.56 1.60
Total 1.83 2.01

Table 8.3: Compression effectiveness (bits per character relative to the source mes-
sage) of RE-PAIR with two different coders: a minimum-redundancy coder, as outlined
in Algorithm 4.6 on page 83; and a coder based upon a simple binary code, taken from
Moffat and Wan [2001]. The source data is 510 MB of newspaper-like text from the
Wall Street Journal, processed in RE-PAIR blocks of 10 MB, with an average phrase
length of 9.97 characters. The minimum-redundancy coder processed the same 51
blocks; the simple binary coder broke the reduced message into 375 blocks, averaging
142,650 symbols per block.

block of the reduced message are coded as fixed double-byte integers relative
to a simple subalphabet mapping table of 65,536 entries.
Table 8.3 shows how surprisingly well this approach performs. The change
from a minimum-redundancy code to a simple binary code of itself adds only
0.04 bits per character to the message cost. The benefit that accrues through
the use of highly localized codes - simple ones, but never mind - is all but
enough to match the power of the minimum-redundancy coder. Only in the
representation of the prelude is the simple scheme expensive, partly because
of the use of simple codes there too, and partly because the n = 65,536 per-
block limit means that there are many more blocks for which preludes must be
provided. Overall, the simple coder is less than 10% worse than the minimum-
redundancy coder.
To search, the alternating "prelude, then codes" nature of the blocks is ex-
ploited. To look for locations at which phrase x appears, the prelude for each
coding block is decoded (the nibble-aligned codes also facilitate rapid decod-
ing), and checked for the presence of x in the sub alphabet. If x appears in
this block, the mapped integer it corresponds to is now known, and the block
searched integer-by-integer for that value. If the block prelude indicates that
x does not appear, the remainder of the block is skipped completely, and pro-
cessing continues with the prelude component of the next block. That is, the
blocked nature of the reduced message, and the fact that the prelude is a suc-
cinct summary of the subalphabet of each block, allows the reduced message
to be searched without each phrase number being handled.
It is the sheer simplicity of this scheme that defines our closing argument.
8.5. Lossy MODELING PAGE 251

In this book we have described an extensive range of coding techniques, many


of them rather complex, and tailored for quite specific combinations of con-
straints. But for many applications straightforward schemes can also be inter-
esting, and if careful attention is paid to the details, can provide compression
performance only marginally worse than complex schemes, and at a fraction of
the resource - and human - cost. This lesson is worth remembering.

8.5 Lossy modeling


For many kinds of data lossy compression can be tolerated. In a lossy sys-
tem, the decompressed output generated by the decoder is not guaranteed to
be exactly faithful to the original input handed to the encoder. For many kinds
of data such loss of fidelity cannot be contemplated. But when the original
source data is a digital representation of some naturally occurring continuous
phenomenon - recorded sound, or digitized images - then loss has already oc-
curred at the moment the data is sampled, and allowing additional controlled
loss may have no impact at all on the usefulness of the data, yet considerably
reduce the cost of storing that data.
This book is not about lossy compression, and for a description of lossy
techniques, the reader is referred to Pennebaker and Mitchell [1993], Salomon
[2000], or Sayood [2000]. The point that we do wish to make in this section is
that lossy compression schemes require lossless coders - the loss is managed
in the model, not the coder. In a crude sense, the "lossy" component of a lossy
compression scheme consists of representing a set of coefficients to some de-
gree of precision, where, as low-order bits are removed from the coefficients,
the fidelity of the representation is slowly eroded. Once a fidelity criterion
has been established, and some amount of precision in the coefficients decided
upon, the coder must represent those values exactly. In particular, the coder
is not free to introduce more rounding errors, since the encoder will be antic-
ipating the actions of the decoder, and perhaps adjusting its future behavior
knowing exactly the quality of the information it has just sent.
The reduction in precision allows better bit rates in the coder, since low
precision numbers have fewer possible values than do high-precision numbers.
A good choice of coder will automatically obtain this gain; once the model has
determined the precision that it wishes to use, the coder must be exact. That
is, the techniques we have described in Chapters 3 to 7 are just as relevant to
lossy compression as they are to the lossless compression schemes we have
described in this chapter.
On a final whimsical note, we observe that lossy compression for text has
been considered and found to have some modest area of application [Witten
et aI., 1994].
Chapter 9

What Next?

It is much harder to write the last chapter of a book than to write the first one.
At the beginning, the objective is clear - you must set the scene, be full of
effusive optimism, and paint a rosy picture of what the book is intended to
achieve. But by the time your reader reaches the final chapter there is no scope
for lily-gilding, and for better or for worse, they have formed an opinion as to
the validity and usefulness of your work.
This book is no exception. We claimed at the beginning that coding was an
area of algorithm design rich in theory and practical techniques. In the ensu-
ing chapters we have described many coding techniques, and shown how they
couple with different predictive models to make useful compression systems.
But our coverage, while comprehensive, has not been exhaustive - to capture
everything would require a decade or more of patient writing, by which time
there would be a decade of new research to be incorporated.
So how do we finish our book? What parting information do we convey to
our readers in the event their thirst has not yet been sated?
One important thing we can do is point at further sources of information.
We have deliberately adopted a writing style with interlaced citations to the
research literature - we believe the inventors of the many clever ideas we have
described are deserving of that recognition. The first port of call for readers
wishing more precision, therefore, is the original literature. They will find de-
tails and rationales that we have had no choice but to gloss over in the interests
of brevity and breadth.
Another obvious point we should make in closing is to reiterate our in-
tended coverage - this book is much more about coding than it is about mod-
eling, and both are important in the design of a compression system. There are
good books about modeling already (listed in Section 1.4 on page 9). Some-
one might even rise to the challenge, and write an updated book devoted to
modeling with the express purpose of complementing this one.

A. Moffat et al., Compression and Coding Algorithms


© Springer Science+Business Media New York 2002
PAGE 254 COMPRESSION AND CODING ALGORITHMS

www.cs.mu.oz.au/caca/
The home page for this book. Includes an errata listing and all of the following
URLs.
www.faqs.org/faqs/compression-faq/
A detailed set of answers to frequently asked compression questions.
Maintainer: Jean-loup Gailly.
corpus.canterbury.ac.nz
Home page of the Canterbury Corpus, including the Large Canterbury Corpus
[Arnold and Bell, 1997]. Lists compression effectiveness and throughput
results for a wide range of compression systems, both public systems and
research prototypes. Owner: Tim C. Bell.
compression.ca
Home page of the Archive Compression Test. Lists compression effectiveness
and throughput results for a wide range of public and commercial compression
systems, including those that have archiving functions. Owner: Jeff Gilchrist.
www.dogma.net/DataCompression/
A great deal of compression-related information, including (at the
Benchmarks. shtml page), a set of performance figures for compression
systems. Owner: Mark Nelson.
www.internz.com/compression-pointers.html
A listing of compression resources, including people, software, and projects.
Owner: Stuart Inglis.
www.rasip.fer.hr/research/compress/
The Data Compression Reference Center page. Includes (at the algori thms
subpage) descriptions of a number of compression systems. Project manager:
Mario Kovac.
www.cs.brandeis.edu/-dcc/
Home page of the annual IEEE Data Compression Conference, held in
Snowbird, Utah, in late March or early April each year. Conference chairs: Jim
Storer and Marty Cohn.
directory.google.com/Top/Computers/Algorithms/Compression/
Typical listing maintained by a search engine. Owner: Google Inc.
www.cs.mu.oz.au/-alistair/abstracts/
Research papers contributed to by the first author.
www.computing.edu.au/-andrew/pubs/
Research papers contributed to by the second author.

Table 9.1: Web-based resources for compression and coding.


9. WHAT NEXT? PAGE 255

Another place to which we can point the reader is the web. The web has
the advantage of being fluid, and so provides the opportunity for relatively
inexpensive revision and extension. There are many web sites that might be of
interest to the reader who has made it this far, including the web page for this
book at www.cs.mu.oz.au/caca.Table9.1lists a few of the more useful
ones.
Several of the web pages listed in the table include detailed assessments of
various compression systems. The availability of these evaluations has allowed
us one luxury in this book that previous compression texts have not enjoyed:
we are free from the need to provide tables of compression performance. Some
readers will feel that we have evaded the issue, and should give concrete advice.
But the truth is, there are many competing constraints that dictate the choice of
compression mechanism, and to come out with a table that summarizes the at-
tributes of compression systems into a set of single numbers is to oversimplify
that choice. For example, in Chapter 8 we mentioned performance figures for
a range of systems on the file bible. txt, taken from the Large Canterbury
Corpus. But simply comparing these numbers is rather unfair to some of the
systems involved. For example, the PPMD implementation we tested obtained
a compression on bible. txt of 1.56 bits per character, but did so using (Ta-
ble 8.2 on page 233) 32 MB of memory. In comparison, the Burrows-Wheeler
implementation BZIP2 "only" achieves 1.67 bits per character, but does so in
around 5 MB of execution-time memory. At the very least, for any compres-
sion comparison to be fair, similar extents of memory should be used, and with
memory limited to around 4 MB, the PPM implementation tested drops to a
compression rate of 1.66 bits per character.
The PPM implementation also executes slower than does BZIP2, and
when computational resources are taken into account, BZIP2 is the best com-
promise for a range of applications. But we are still not willing to declare
a winner, for BZIP2 operates in an off-line manner, and produces no output
bits until the last input byte of an entire block has been digested. The alterna-
tive, on-line compression systems, start emitting bits as soon as the first input
symbol is available (but possibly subject to the latency inherent in arithmetic
coding). PPM is on-line. This on-going balancing act between constraints and
performance is why we have avoided compression league tables. Choosing a
compression system is like buying a car - selecting the most fuel-economical
one (that is, the most effective system) may not actually be the best way of
meeting our transport needs.
That there are many competing factors means, of course, that the last and
most important resource we can point out is you, the reader. If you have read
your way through this book, you have, we fervently hope, learned a great deal.
At the very least, we expect you to be able to make an intelligent choice of
PAGE 256 COMPRESSION AND CODING ALGORITHMS

coder and compression system for a particular application that you have in
mind. And if you have seen things in these chapters, or in the papers we have
cited, and thought to yourself "That can't be right? Surely it will work better
if... ?" then you are well on the way to making your own discoveries. Don't un-
derestimate your own ability to invent new compression and coding algorithms,
the " ... " in the previous sentence. Most people start their compression careers
by tinkering - messing about in small programs, so to speak - and maybe, just
maybe, this book will have given you the confidence to have a go. Best of all, if
you put down the book and roll up your sleeves, then we have succeeded both
with our attempt to convey our infectious enthusiasm for this topic, and also in
providing an answer to the "what next?" question with which we opened this
discussion. Have fun ...
Bibliography

1. Aberg. A Universal Source Coding Perspective on PPM. PhD thesis, Lund University, Swe-
den, October 1999. [p.9]
1. Aberg, Y. M. Shtarkov, and B. 1. M. Smeets. Towards understanding and improving escape
probabilities in PPM. In Storer and Cohn [1997], pages 22-31. [p. 145.230]
J. Abrahams. Codes with monotonic codeword lengths. Infonnation Processing & Management,
30(6):759-764, 1994. [p.214]
J. Abrahams and M. J. Lipman. Zero-redundancy coding for unequal code symbol costs. IEEE
Trans. on Infonnation Theory, 38:1583-1586, September 1992. [p.212]
N. Abramson. Infonnation Theory and Coding. McGraw Hill, New York, 1963. [p.92]
D. Altenkamp and K. Mehlhorn. Codes: Unequal probabilities, unequal letter costs. J. of the
ACM, 27(3):412-427, July 1980. [p.212]
J. B. Anderson and S. Mohan. Source and Channel Coding: An Algorithmic Approach. Kluwer
Academic, 1991. Int. Series in Engineering and Computer Science. [p.1O]
A. Apostolico and S. Lonardi. Off-line compression by greedy textual substitution. Proc. IEEE,
88(11):1733-1744, November 2000. [p.248]
R. Arnold and T. Bell. A corpus for the evaluation of lossless compression algorithms. In Storer
and Cohn [1997], pages 201-210. [p.254]
B. Balkenhol, S. Kurtz, and Y. M. Shtarkov. Modifications of the Burrows and Wheeler data
compression algorithm. In Storer and Cohn [1999], pages 188-197. [p.233]
M. A. Bassiouni and A. Mukherjee. Efficient decoding of compressed data. J. of the American
Society for Infonnation Science, 46(1):1-8, January 1995. [p.65]
T. C. Bell. Better OPMlL text compression. IEEE Trans. on Communications, COM-34: 1176-
1182, December 1986a. [p.218]
T. C. Bell. A Unifying Theory and Improvements for Existing Approaches to Text Compression.
PhD thesis, University of Canterbury, Christchurch, New Zealand, 1986b. [p. 9]
T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice-Hall, Englewood Cliffs,
New Jersey, 1990. [p. vii, 8, 9, 17,20,220,229,232]
T. C. Bell and D. Kulp. Longest-match string searching for Ziv-Lempel compression. Software
- Practice and Experience, 23(7):757-772, July 1993. [p.219]
T. C. Bell, A. Moffat, C. G. Nevill-Manning, I. H. Witten, and 1. Zobel. Data compression in
full-text retrieval systems. J. of the American Society for Infonnation Science, 44(9):508-
531, October 1993. [p. 36, 245]
PAGE 258 COMPRESSION AND CODING ALGORITHMS

T. C. Bell and I. H. Witten. The relationship between greedy parsing and symbolwise text
compression. J. of the ACM, 41(4):708-724, July 1994. !p.220)
T. C. Bell, I. H. Witten, and 1. G. Cleary. Modeling for text compression. Computing Surveys,
21(4):557-592, December 1989. !p.9]
1. Bentley and D. McIlroy. Data compression using long common strings. In Storer and Cohn
[1999], pages 287-295. !p.248]
J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A locally adaptive data compression scheme.
Communications of the ACM, 29(4):320-330, April 1986. !po 171, 173, 245]
J. Bentley and A. C-C. Yao. An almost optimal algorithm for unbounded searching. Information
Processing Letters, 5(3):82-87, August 1976. !p.33]
J. L. Bentley and M. D. McIlroy. Engineering a sorting function. Software - Practice and
Experience, 23(11): 1249-1265, November 1993. !po 14, 70]
J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc.
Eighth Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 360-369, New Orleans, LA,
January 1997. www.cs.princeton . edurrs/ strings/. !p.242]
W. Blake. Milton. Shambhala Publications Inc., Boulder, Colorado, 1978. With commentary by
K. P. Easson and R. R. Easson. !po 20]
C. Bloom. New techniques in context modeling and arithmetic coding. In Storer and Cohn
[1996], page 426. !p.232)
A. Bookstein and S. T. Klein. Is Huffman coding dead? Computing, 50(4):279-296, 1993.
!po 103, 134)

L. Bottou, P. G. Howard, and Y. Bengio. The Z-Coder adaptive binary coder. In Storer and Cohn
[1998], pages 13-22. !p.130)
P. Bradford, M. 1. Golin, L. L. Larmore, and W. Rytter. Optimal prefix-free codes for unequal
letter costs: Dynamic programming with the Monge property. In G. Bilardi, G. F. Italiano,
A. Pietracaprina, and G. Pucci, editors, Proc. 6th Ann. European Symp. on Algorithms, vol-
ume 1461, pages 43-54, Venice, Italy, August 1998. LNCS Volume 1461. !p.212)
S. Bunton. On-Line Stochastic Processes in Data Compression. PhD thesis, University of
Washington, March 1997a. !po 9, 229, 230]
S. Bunton. Semantically motivated improvements for PPM variants. The Computer J., 40(2/3):
76-93, 1997b. !po 229, 230]
M. Buro. On the maximum length of Huffman codes. Information Processing Letters, 45(5):
219-223, April 1993. !po 90)
M. Burrows and D. J. Wheeler. A block-sorting loss less data compression algorithm. Technical
Report 124, Digital Equipment Corporation, Palo Alto, California, May 1994. !p.233]
R. D. Cameron. Source encoding using syntactic information source models. IEEE Trans. on
Information Theory, IT-34(4):843-850, July 1988. !p.246]
A. Cannane and H. E. Williams. General-purpose compression for efficient retrieval. J. of
the American Society for Information Science and Technology, 52(5):430-437, March 2001.
!p.248]

B. Chapin. Switching between two on-line list update algorithms for higher compression of
Burrows-Wheeler transformed data. In Storer and Cohn [2000], pages 183-192. !po 233, 241]
BIBLIOGRAPHY PAGE 259

D. Chevion, E. D. Kamin, and E. Walach. High efficiency, multiplication free approximation of


arithmetic coding. In Storer and Reif [1991], pages 43-52. [p.123]
Y. Choueka, S. T. Klein, and Y. Perl. Efficient variants of Huffman codes in high level languages.
In Proc. 8th Ann. Int. ACM SIGIR Con! on Research and Development in Information Re-
trieval, pages 122-130, Montreal, Canada, June 1985. ACM, New York. [p.63-65]
J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Computer J., 40(2/3):
67-75, 1997. [po 230, 239]
J. G. Cleary and I. H. Witten. A comparison of enumerative and adaptive codes. IEEE Trans.
on Information Theory, 1T-30(2):306-315, March 1984a. [p.138]
1. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string match-
ing. IEEE Trans. on Communications, COM-32(4):396-402, April 1984b. [po 140, 141, 222,
225]

J. B. Connell. A Huffman-Shannon-Fano code. Proc. IEEE, 61(7):1046-1047, July 1973. [p.57,


58]

G. V. Cormack and R. N. Horspool. Algorithms for adaptive Huffman codes. Information


Processing Letters, 18(3):159-165, March 1984. [po 146, 154]
G. V. Cormack and R. N. Horspool. Data compression using dynamic Markov modeling. The
ComputerJ., 30(6):541-550, December 1987. [po 118,229]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT
Press, Cambridge, Massachusetts, second edition, 2001. [po 5, 10, 14, 247]
T. M. Cover and R. C. King. A convergent gambling estimate of the entropy of English. IEEE
Trans. on Information Theory, IT-24(4):413-421, July 1978. [p.20]
E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching
on compressed text. ACM Trans. on Information Systems, 18(2): 113-l39, 2000. [p.246-248]
S. Deorowicz. Improvements to Burrows-Wheeler compression algorithm. Software - Practice
and Experience, 30(l3):1465-1483, November 2000. [p.241]
M. Effros. Universal lossless source coding with the Burrows Wheeler transform. In Storer and
Cohn [1999], pages 178-187. [p.239]
M. Effros. PPM performance with BWT complexity: A new method for lossless data compres-
sion. In Storer and Cohn [2000], pages 203-212. [p.233]
P. Elias. Universal codeword sets and representations of the integers. IEEE Trans. on Informa-
tion Theory, IT-21(2):194-203, March 1975. [p.32]
F. Fabris. The Longo-Galasso criterion for Huffman codes: An extended version. Alta Fre-
quenza, LVIII(I):35-38, 1989. [p.180]
N. Faller. An adaptive system for data compression. In Record of the 7th Asilomar Con! on
Circuits, Systems, and Computers, pages 593-597, 1973. [po 146]
R. M. Fano. The transmission of information. Technical Report 65, Research Laboratory of
Electronics, M.l.T., Cambridge, MA, 1949. [p.51]
P. Fenwick. A new data structure for cumulative probability tables. Software - Practice and
Experience, 24(3):327-336, March 1994. Errata published in 24(7):677, July 1994. [p.157]
PAGE 260 COMPRESSION AND CODING ALGORITHMS

P. Fenwick. Block sorting text compression. In K. Ramamohanarao, editor, Proc. 19'th Aus-
tralasian Computer Science Con!, pages 193-202, Melbourne, January 1996a. [p.233]
P. Fenwick. The Burrows-Wheeler transform for block sorting text compression: Principles and
improvements. The Computer J., 39(9):731-740, September 1996b. [po 177]
P. Fenwick. Symbol ranking text compression with Shannon recodings. J. of Universal Com-
puterScience, 3(2):7G-85, 1997. [p.230]
P. Fenwick. Symbol ranking text compressors: Review and implementation. Software - Practice
and Experience, 28(5):547-559, 1998. [p.230]
G. Feygin, P. G. Gulak, and P. Chow. Minimizing excess code length and VLSI complexity
in the multiplication free approximation of arithmetic coding. Information Processing &
Management, 30(6):805-816, November 1994. [p.123]
E. R. Fiala and D. H. Greene. Data compression with finite windows. Communications of the
ACM, 32(4):49G-505, April 1989. [p.218]
A. S. Fraenkel and S. T. Klein. Bidirectional Huffman coding. The Computer J., 33(4):296--307,
1990. [po 214]
A. S. Fraenkel and S. T. Klein. Bounding the depth of search trees. The Computer J., 36(7):
668--678, 1993. [po 194,202]
1. L. Gailly. Gzip program and documentation, 1993. Source code available from ftp: / /
prep. ai .mi t. edu/pub/gnu/gzip- * . tar. [p.219]
R. G. Gallager. Variations on a theme by Huffman. IEEE Trans. on Information Theory, IT-24
(6):668--674, November 1978. [po 74,89,146]
R. G. Gallager and D. C. Van Voorhis. Optimal source codes for geometrically distributed integer
alphabets. IEEE Trans. on Information Theory, IT-21(2):228-230, March 1975. [p.38]
A. M. Garsia and M. L. Wachs. A new algorithm for minimum cost binary trees. SIAM J. on
Computing, 6(4):622--642, December 1977. [p.207]
B. Girod. Bidirectionally decodable streams of prefix code-words. IEEE Communications Let-
ters, 3(8):245-247, August 1999. [p.214]
D. Goldberg. What every computer scientist should know about floating-point arithmetic. Com-
puting Surveys, 23(1):5-48, March 1991. [p.117]
M. J. Golin and N. Young. Prefix codes: equiprobable words, unequal letter costs. In S. Abite-
boul and E. Shamir, editors, Proc. 21st Int. Coli. on Automata, Languages and Programming,
pages 605--617, Jerusalem, July 1994. Springer-Verlag. LNCS 820. [p.212]
S. W. Golomb. Run-length encodings. IEEE Trans. on Information Theory, IT-12(3):399-401,
July 1966. [p.36]
S. W. Golomb, R. E. Peile, and R. A. Scholtz. Basic Concepts in Information Theory and
Coding: The Adventures of Secret Agent 00111. Plenum, April 1994. [po 10]
G. Gonnet and R. Baeza-Yates. Handbook of Data Structures and Algorithms. Addison-Wesley,
Reading, Massachusetts, second edition, 1991. [p.5, 10]
U. Graf. Dense coding - A fast alternative to arithmetic coding. In Proc. Compression and Com-
plexity of Sequences. IEEE Computer Society Press, Los Alamitos, California, July 1997.
[p.123]
BIBLIOGRAPHY PAGE 261

R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics: A Foundation for Com-


puter Science. Addison-Wesley, Reading, Massachusetts, 1989. [po 10, 13]
R. M. Gray. Entropy and Information Theory. Springer-Verlag, November 1990. [p.1O]
M. Guauzzo. A general minimum-redundancy source-coding algorithm. IEEE Trans. on Infor-
mation Theory, IT-26:15-25, January 1980. [p.92]
P. C. Gutmann and T. C. Bell. A hybrid approach to text compression. In 1. A. Storer and
M. Cohn, editors, Proc. 1994 IEEE Data Compression Conj., pages 225-233. IEEE Com-
puter Society Press, Los Alamitos, California, March 1994. [p.220]
R. W. Hamming. Coding and Information Theory. Prentice-Hall, second edition, 1986. [p.1O]
D. K. Harman. Overview of the second text retrieval conference (TREC-2). Information Pro-
cessing & Management, 31(3):271-289, May 1995. [p.71]
R. Hashemian. High speed search and memory efficient Huffman coding. IEEE Trans. on
Communications, 43(10):2576--2581, October 1995. [p.64]
G. Held. Data Compression Techniques and Applications - Hardware and Software Consider-
ations. John Wiley and Sons, 1983. [p.1O]
D. S. Hirschberg and D. A. Lelewer. Efficient decoding of prefix codes. Communications of the
ACM, 33(4):449-459, April 1990. [p.57]
C. A. R. Hoare. Algorithms 63 and 64: Partition and quicksort. Communications of the ACM,
4:321, 1961. [p.14]
C. A. R. Hoare. Quicksort. The ComputerJ., 4:10-15, 1962. [p.14]
R. Hoffman. Data Compression in Digital Systems. Chapman and Hall, New York, 1997. [p.1O]
P. G. Howard. The Design and Analysis of Efficient Lossless Data Compression Systems. PhD
thesis, Brown University, Rhode Island, 1993. Available as Technical Report CS-93-28. [p.9]
P. G. Howard. Text image compression using soft pattern matching. The Computer 1.,40(2/3):
146--156, 1997. [po 120, 245]
P. G. Howard and 1. S. Vitter. Analysis of arithmetic coding for data compression. Information
Processing & Management, 28(6):749-763, 1992a. [po 111]
P. G. Howard and J. S. Vitter. Practical implementations of arithmetic coding. In 1. A. Storer, edi-
tor, Image and Text Compression, pages 85-112. Kluwer Academic, Norwell, Massachusetts,
1992b. [po 127, 140, 142,222]
P. G. Howard and J. S. Vitter. Fast and efficient lossless image compression. In Storer and Cohn
[1993], pages 351-360. [po 32, 245]
P. G. Howard and J. S. Vitter. Arithmetic coding for data compression. Proc. IEEE, 82(6):
857-865, June 1994a. [p.93]
P. G. Howard and J. S. Vitter. Design and analysis of fast text compression based on quasi-
arithmetic coding. Information Processing & Management, 30(6):777-790, 1994b. [po 127,
130,230]
T. C. Hu and K. C. Tan. Path length of binary search trees. SIAM 1. of Applied Mathematics, 22
(2):225-234, March 1972. [po 194]
T. C. Hu and A. C. Tucker. Optimal computer search trees and variable length alphabetic codes.
SIAM 1. of Applied Mathematics, 21:514--532, 1971. [p.203]
PAGE 262 COMPRESSION AND CODING ALGORITHMS

D. A. Huffman. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio
Engineers, 40(9): 1098-1101, September 1952. [p. x, 5, 53]
F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly ordered sets.
SIAM J. on Computing, 1(1):31-39, 1972. [p.40]
R. Y. K. Isal and A. Moffat. Parsing strategies for BWT compression. In Storer and Cohn
[2001], pages 429-438. [p. 174,242]
R. Y. K. Isal and A. Moffat. Word-based block-sorting text compression. In M. Oudshoorn,
editor, Proc 24th Australian Computer Science Conf., pages 92-99, Gold Coast, Australia,
February 2001b. IEEE Computer Society, Los Alamitos, CA. [p.174]
R. Y. K. Isal, A. Moffat, and A. C. H. Ngai. Enhanced word-based block-sorting text compres-
sion. In M. Oudshoorn, editor, Proc. 25th Australasian Computer Science Con!, Melbourne,
Australia, February 2002. [p. 174, 242, 243]
M. Jakobsson. Huffman coding in bit-vector compression. Information Processing Letters, 7
(6):304-307, October 1978. [p.41]
D. W. Jones. Application of splay trees to data compression. Communications of the ACM, 31
(8):996-1007, August 1988. [p.175, 176]
R. M. Karp. Minimum-redundancy coding for the discrete noiseless channel. IEEE Trans. on
Information Theory, IT-7(1):27-38, January 1961. [p.212]
J. Karush. A simple proof of an inequality of McMillan. Institute of Radio Engineers Trans. on
Information Theory, IT-7(2):118, April 1961. [p.18]
J. Katajainen, A. Moffat, and A. Turpin. A fast and space-economical algorithm for length-
limited coding. In J. Staples, P. Eades, N. Katoh, and A. Moffat, editors, Proc. Int. Symp.
on Algorithms and Computation, pages 12-21, Cairns, Australia, December 1995. Springer-
Verlag. LNCS 1004. [p.201]
1. Katajainen, M. Penttonen, and 1. Teuhola. Syntax-directed compression of program files.
Software - Practice and Experience, 16(3):269-276, March 1986. [p.246]
J. H. Kingston. Algorithms and Data Structures: Design, Correctness, Analysis. Addison-
Wesley, Reading, MA, 1990. [p.173]
M. Klawe and B. Mumey. Upper and lower bounds on constructing alphabetic binary trees. In
Proc. 4th ACM-SIAM Symp. on Discrete Algorithms, pages 185-193, 1993. [p.207]
S. T. Klein. Efficient optimal recompression. The Computer J., 40(2/3): 117-126, 1997. [p.133]
D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-
Wesley, Reading, Massachusetts, 1973. [p. 10, 11, 203, 205, 207]
D. E. Knuth. Dynamic Huffman coding. J. of Algorithms, 6(2): 163-180, June 1985. [p. 146]
L. G. Kraft. A device for quantizing, grouping, and coding amplitude modulated pulses. Mas-
ter's thesis, MIT, Cambridge, Massachusetts, 1949. [p.18]
R. M. Krause. Channels which transmit letters of unequal duration. Information and Control,
5:13-24, 1962. [p.212]
E. S. Laber. C6digos de Prejixo: Algoritmos e Cotas. PhD thesis, Pontiflcia Universidade
Cat61ica do Rio de Janeiro (PUC-Rio), Departamento de Informatica, July 1999. In Por-
tuguese. [p. 9]
BIBLIOGRAPHY PAGE 263

E. S. Laber, R. L. Milidil1, and A. A. Pessoa. Practical constructions of L-restricted alphabetic


prefix codes. In Proc. Symp. String Processing and Information Retrieval, pages 115-119,
1999. [p.214]
G. G. Langdon. An introduction to arithmetic coding. IBM J. of Research and Development, 28
(2): 135-149, March 1984. [p.92]
G. G. Langdon and 1. Rissanen. Compression of black-white images with arithmetic coding.
IEEE Trans. on Communications, COM-29(6):858-867, June 1981. [po 118]
G. G. Langdon and J. Rissanen. Method and means for carry-over control in the high order to
low order pairwise combining of digits in a decode able set of relatively shifted finite number
strings, July 1984. United States Patent Number 4,463,342. [p.113]
L. L. Larmore and D. S. Hirschberg. A fast algorithm for optimal length-limited Huffman codes.
1. of the ACM, 37(3):464-473, July 1990. [po 194, 198]
L. L. Larmore and T. M. Przytycka. A fast algorithm for optimum height-limited alphabetic
binary trees. SIAM J. on Computing, 23(6):1283-1312, December 1994. [po 207, 214]
L. L. Larmore and T. M. Przytycka. The optimal alphabetic tree problem revisited. J. of Algo-
rithms, 28(1):1-20, 1998. [p.207]
N. J. Larsson. Extended application of suffix trees to data compression. In Storer and Cohn
[1996], pages 19{}-199. [p.242]
N. 1. Larsson. Structures of String Matching and Data Compression. PhD thesis, Lund Univer-
sity, Sweden, September 1999. [po 9, 242]
N. 1. Larsson and A. Moffat. Offline dictionary-based compression. Proc. IEEE, 88(11): 1722-
1732, November 2000. [p. 248, 249]
D. A. Lelewer and D. S. Hirschberg. Data compression. Computing Surveys, 19(3):261-296,
September 1987. [p.1O]
D. A. Lelewer and D. S. Hirschberg. Streamlining context models for data compression. In
Storer and Reif [1991], pages 313-322. [p.230]
M. Liddell and A. Moffat. Length-restricted coding in static and dynamic frameworks. In Storer
and Cohn [2001], pages 133-142. [p.88, 186, 202]
M. Liddell and A. Moffat. Incremental calculation of optimal length-restricted codes. Submitted,
January 2002. [p.201]
G. Longo and G. Galasso. An application of informational divergence to Huffman codes. IEEE
Trans. on Information Theory, 28(1):36-43, January 1982. [po 180]
U. Manber. A text compression scheme that allows fast searching directly in the compressed
file. ACM Trans. on Information Systems, 15(2): 124--136, April 1997. [p.248]
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In Proc.
Symp. Discrete Algorithms, pages 319-327, San Franscisco, 1990. [p.242]
D. Manstetten. Tight upper bounds on the redundancy of Huffman codes. IEEE Trans. on
Information Theory, 38(1):144--151, January 1992. [p.89]
B. McMillan. Two inequalities implied by unique decipherability. Inst. Radio Engineers Trans.
on Information Theory, IT-2: 115-116, December 1956. [p. 18, 19]
PAGE 264 COMPRESSION AND CODING ALGORITHMS

K. Mehlhorn. An efficient algorithm for constructing nearly optimal prefix codes. IEEE Trans.
on Information Theory, IT-26(5):513-517, 1980. [p.212]
R. L. Milidiu and E. S. Laber. The WARM-UP algorithm: A Lagrangean construction of length
restricted Huffman codes. SIAM J. on Computing, 30(5): 1405-1426, 2000. [p.202]
R. L. Milidiu and E. S. Laber. Bounding the inefficiency of length-restricted prefix codes.
Algorithmica, 31(4):513-529, 2001. [p.202]
R. L. Milidiu, E. S. Laber, and A. A. Pessoa. Bounding the compression loss of the FGK
algorithm. J. of Algorithms, 32(2): 195-211, 1999. [p.146]
R. L. Milidiu, A. A. Pessoa, and E. S. Laber. Three space-economical algorithms for calculating
minimum-redundancy prefix codes. IEEE Trans. on Information Theory, 47(6):2185-2198,
September 2001. [po 88]
V. Miller and M. Wegman. Variations on a theme by Ziv and Lempel. In A. Apostolico and
Z. Galil, editors, Combinatorial Algorithms on Words, Volume 12, NATO ASI Series F, pages
131-140, Berlin, 1985. Springer-Verlag. [p.220]
S. Mitarai, M. Hirao, T. Matsumoto, A. Shinohara, M. Takeda, and S. Arikawa. Compressed
pattern matching for SEQUITUR. In Storer and Cohn [2001], pages 469-478. [p.248]
1. L. Mitchell and W. B. Pennebaker. Software implementations of the Q-coder. IBM J. of
Research and Development, 32:753-774, 1988. [p.187]
A. Moffat. Word based text compression. Software - Practice and Experience, 19(2): 185-198,
February 1989. [po 243, 245]
A. Moffat. Implementing the PPM data compression scheme. IEEE Trans. on Communications,
38(11):1917-1921, November 1990. [po 140, 142,222,226]
A. Moffat. An improved data structure for cumulative probability tables. Software - Practice
and Experience, 29(7):647-659, 1999. [po 161, 166]
A. Moffat and 1. Katajainen. In-place calculation of minimum-redundancy codes. In S. G. Akl,
F. Dehne, and 1.-R. Sack, editors, Proc. Workshop on Algorithms and Data Structures, pages
393-402. Springer-Verlag, LNCS 955, August 1995. Source code available from www . cs .
mu.oz.au/-alistair/inplace.c.[p.6~

A. Moffat, R. M. Neal, and I. H. Witten. Arithmetic coding revisited. ACM Trans. on Information
Systems, 16(3):256--294, July 1998. Source code available from www.cs.mu.oz.au/
-alistair/arith_coder/. [p.93, 98,111-113]
A. Moffat and O. Petersson. An overview of adaptive sorting. Australian Computer J.• 24(2):
70-77, May 1992. [p.57]
A. Moffat. N. Sharman, I. H. Witten, and T. C. Bell. An empirical evaluation of coding meth-
ods for multi-symbol alphabets. Information Processing & Management, 30(6):791-804,
November 1994. [po 120, 140, 176, 177, 190]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Informa-
tion Retrieval, 3(1):25-47, July 2000. [po 42, 46, 47]
A. Moffat and A. Turpin. On the implementation of minimum-redundancy prefix codes. IEEE
Trans. on Communications, 45(10):1200-1207, October 1997. [po 57, 62, 65]
A. Moffat and R. Wan. Re-Store: A system for compressing, browsing, and searching large
documents. In Proc. Symp. String Processing and Information Retrieval. pages 162-174,
Laguna de San Rafael, Chile, November 2001. [po 249, 250]
BIBLIOGRAPHY PAGE 265

A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In N. J. Belkin, P. In-
gwersen, and A. M. Pejtersen, editors, Proc. 15th Ann. Int. ACM SIGIR Con! on Research
and Development in Information Retrieval, pages 274-285, Copenhagen, June 1992. ACM
Press, New York. [po 41)
H. Nakamura and S. Murashima. Data compression by concatenations of symbol pairs. In Proc.
IEEE Int. Symp. on Information Theory and its Applications, pages 496-499, Victoria, BC,
Canada, September 1996. [po 248)
M. Nelson and J. L. Gailly. The Data Compression Book: Featuring Fast, Efficient Data Com-
pression Techniques in C. IDO Books Worldwide, Redwood City, California, second edition,
1995. [p.9)
C. G. Nevill-Manning. Inferring Sequential Structure. PhD thesis, University of Waikato, New
Zealand, 1996. [po 9)
C. G. Nevill-Manning and I. H. Witten. Compression and explanation using hierarchical gram-
mars. The ComputerJ., 40(2/3):103-116, 1997. [p.248)
C. G. Nevill-Manning and I. H. Witten. On-line and off-line heuristics for inferring hierarchies
of repetitions in sequences. Proc. IEEE, 88(11): 1745-1755, November 2000. [po 248)
R. Pascoe. Source coding algorithms for fast data compression. PhD thesis, Stanford University,
CA, 1976. [po 92)
W. B. Pennebaker and J. L. Mitchell. JPEG: Still Image Data Compression Standard. Van
Nostrand Reinhold, New York, 1993. [po 190,215,251)
W. B. Pennebaker, J. L. Mitchell, G. G. Langdon, and R. B. Arps. An overview of the basic prin-
ciples of the Q-Coder adaptive binary arithmetic coder. IBM J. of Research and Development,
32(6):717-726, November 1988. [po 92, 187]
Y. Perl, M. R. Garey, and S. Even. Efficient generation of optimal prefix code: equiprobable
words using unequal cost letters. J. of the ACM, 22:202-214, April 1975. [p.212)
A. A. Pessoa. Constru~ao eficiente de c6digos livres de prefixo. Master's thesis, Departamento
de Informatica, PUC-Rio, September 1999. In Portuguese. [p.9)
O. Petersson and A. Moffat. A framework for adaptive sorting. Discrete Applied Mathematics,
59(2):153-179, 1995. [p.57]
R. F. Rice. Some practical universal noiseless coding techniques. Technical Report 79-22, Jet
Propulsion Laboratory, Pasadena, California, March 1979. [po 37)
J. Rissanen. Generalised Kraft inequality and arithmetic coding. IBM J. of Research and Devel-
opment, 20:198-203, May 1976. [p.92)
J. Rissanen and G. G. Langdon. Arithmetic coding. IBM J. of Research and Development, 23
(2): 149-162, March 1979. [p.92)
J. Rissanen and G. G. Langdon. Universal modeling and coding. IEEE Trans. on Information
Theory, IT-27(1): 12-23, January 1981. [po 3, 244)
J. Rissanen and K. M. Mohiuddin. A multiplication-free multi alphabet arithmetic code. IEEE
Trans. on Communications, 37(2):93-98, February 1989. [p.123)
J. J. Rissanen. Arithmetic codings as number representations. Acta. Polytech. Scandinavica,
Math 31:44-51, 1979. [p.92)
PAGE 266 COMPRESSION AND CODING ALGORITHMS

F. Rubin. Experiments in text file compression. Communications of the ACM, 19(11):617-623,


November 1976. [p.248]
F. Rubin. Arithmetic stream coding using fixed precision registers. IEEE Trans. on Information
Theory, IT-25(6):672-675, November 1979. [po 92]
B. Y. Ryabko. Technical correspondence on 'A locally adaptive data compression scheme'.
Communications of the ACM, 30:792, September 1987. [p.171]
B. Y. Ryabko. A fast on-line adaptive code. IEEE Trans. on Information Theory, 4(38): 1400-
1404, July 1992. [p.19O]
K. Sadakane. A fast algorithm for making suffix arrays and for Burrows-Wheeler transforma-
tion. In Storer and Cohn [1998], pages 129-138. [po 233, 242]
K. Sadakane. Unifying Text Search and Compression: Suffix Sorting, Block Sorting and Suffix
Arrays. PhD thesis, The University of Tokyo, December 1999. [p.9]
D. Salomon. Data Compression: The Complete Reference. Springer-Verlag, second edition,
2000. [po 10,215,251]
K. Sayood. Introduction to Data Compression. Morgan Kaufmann, second edition, March 2000.
[po 10, 215, 251]

M. Schindler. A fast block-sorting algorithm for lossless data compression. In Storer and Cohn
[1997], page 469. [po 174,233,241]
M. Schindler. A fast renormalisation for arithmetic coding. In Storer and Cohn [1998], page
572. [po 114, 117]

E. S. Schwartz. An optimum encoding with minimum longest code and total number of digits.
Information and Control, 7:37-44, 1964. [p.55]
E. S. Schwartz and B. Kallick. Generating a canonical prefix encoding. Communications of the
ACM, 7(3):166-169, March 1964. [p.57]
R. Sedgewick. Algorithms in C. Addison-Wesley, Reading, Massachusetts, 1990. [p.5, 10,69]
1. Seward. On the performance of BWT sorting algorithms. In Storer and Cohn [2000], pages
173-182. [po 242]
1. Seward. Space-time tradeoffs in the inverse B-W transform. In Storer and Cohn [2001], pages
439-448. [po 242]
J. Seward and 1. L. Gailly. Bzip2 program and documentation, 1999. sourceware. cygnus.
com/bz ip2 /. [po 241]

C. E. Shannon. A mathematical theory of communication. Bell Systems Technical J., 27:379-


423,623-656, 1948. [po 16, 17, 51, 92,210]
C. E. Shannon. Prediction and entropy of printed English. Bell Systems Technical J., 30:55,
1951. [po 20]
C. E. Shannon and W. Weaver. The Mathematical Theory of Communication. The University of
Illinois Press, Urbana, Illinios, 1949. [po 10, 211]
A. Sieminski. Fast decoding of the Huffman codes. Information Processing Letters, 26(5):
237-241, May 1988. [p.63]
M. J. Slattery and 1. L. Mitchell. The Qx-coder. IBM J. of Research and Development, 42(6),
1998. www.research.ibm.com/journal/rd/426/mitchell.html.[p.187]
BIBLIOGRAPHY PAGE 267

D. Sleator and R. Tarjan. Self-adjusting binary search trees. 1. of the ACM, 32(3):652~86, July
1985. [p. 173, 174, 176]

R. J. Solomonoff. A formal theory of inductive inference. Parts I and II. Information and
Control, 7: 1-22 and 224-254, 1964. [p.248]
J. A. Storer. Data Compression: Methods and Theory. Computer Science Press, Rockville,
Maryland, 1988. [p.9]

J. A. Storer. An Introduction to Data Structures and Algorithms. Birkhaser Springer, Boston,


MA, 2002. [p. 5, 10]

1. A. Storer and M. Cohn, editors. Proc. 1993 IEEE Data Compression Conf, March 1993.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 1996 IEEE Data Compression Conf, April 1996. IEEE
Computer Society Press, Los Alamitos, California.

J. A. Storer and M. Cohn, editors. Proc. 1997 IEEE Data Compression Conf, March 1997.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 1998 IEEE Data Compression Conj., March 1998.
IEEE c'omputer Society Press, Los Alamitos, California.
1. A. Storer and M. Cohn, editors. Proc. 1999 IEEE Data Compression Conf, March 1999.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and M. Cohn, editors. Proc. 2000 IEEE Data Compression Conf, March 2000.
IEEE Computer Society Press, Los Alamitos, California.

J. A. Storer and M. Cohn, editors. Proc. 2001 IEEE Data Compression Conj., March 2001.
IEEE Computer Society Press, Los Alamitos, California.
J. A. Storer and J. H. Reif, editors. Proc. 1991 IEEE Data Compression Conj., April 1991. IEEE
Computer Society Press, Los Alamitos, California.

1. A. Storer and T. G. Szymanski. Data compression via textual substitution. 1. of the ACM, 29:
928-951, 1982. [p.218]

L. Stuiver and A. Moffat. Piecewise integer mapping for arithmetic coding. In Storer and Cohn
[1998], pages 3-12. [p. 117, 123, 125, 129, 190, 191]

H. Tanaka. Data structure of the Huffman codes and its application to efficient encoding and
decoding. IEEE Trans. on Information Theory, IT-33(1):154-156, January 1987. [p.63]

W. J. Teahan. Modelling English Text. PhD thesis, University ofWaikato, New Zealand, 1998.
[p. 9, 222]

W. 1. Teahan and 1. G. Cleary. The entropy of English using PPM-based models. In Storer and
Cohn [1996], pages 53~2. [p.246]
W. 1. Teahan and J. G. Cleary. Models of English text. In Storer and Cohn [1997], pages 12-21.
[p.246]
W. 1. Teahan and 1. G. Cleary. Tag based models of English text. In Storer and Cohn [1998],
pages 43-52. [p. 246]

W. 1. Teahan and D. J. Harper. Combining PPM models using a text mining approach. In Storer
and Cohn [2001], pages 153-162. [p.230]
PAGE 268 COMPRESSION AND CODING ALGORITHMS

J. Teuhola. A compression method for clustered bit-vectors. Infonnation Processing Letters, 7


(6):308-311, October 1978. [p.41]
A. Turpin. Efficient Prefix Coding. PhD thesis, University of Melbourne, Australia, 1998. [p. 9]
A. Turpin and A. Moffat. Practical length-limited coding for large alphabets. The Computer J.,
38(5):339-347, 1995. [p. 194]
A. Turpin and A. Moffat. Efficient implementation of the package-merge paradigm for generat-
ing length-limited codes. In P. Eades and M. E. Houle, editors, Proc. CATS'96 (Computing:
The Australasian Theory Symp.), pages 187-195, University of Melbourne, January 1996.
[p.201]

A. Turpin and A. Moffat. Comment on "Efficient Huffman decoding" and "An efficient finite-
state machine implementation of Huffman decoders". Information Processing Letters, 68(1):
1-2, 1998. [p.63]
A. Turpin and A. Moffat. Adaptive coding for data compression. In J. Edwards, editor, Proc.
22nd Australasian Computer Science Con!, pages 63-74, Auckland, January 1999. Springer-
Verlag, Singapore. [p. 190]
A. Turpin and A. Moffat. Housekeeping for prefix coding. IEEE Trans. on Communications, 48
(4):622-628, April 2000. Source code available from www.cs.mu . oz . aur alistair /
mr_coder /. [p.82]
A. Turpin and A. Moffat. On-line adaptive canonical prefix coding with bounded compression
loss. IEEE Trans. on Infonnation Theory, 47(1):88-98, January 2001. [p. 78,81, 181-183]
J. van Leeuwen. On the construction of Huffman trees. In Proc. 3rd Int. Coli. on Automata.
Languages, and Programming, pages 382-410, Edinburgh University, Scotland, July 1976.
Edinburgh University. [p. 66]
D. C. Van Voorhis. Constructing codes with bounded codeword lengths. IEEE Trans. on Infor-
mation Theory, IT-20(2):288-290, March 1974. [p.194]
D. C. Van Voorhis. Constructing codes with ordered codeword lengths. IEEE Trans. on Infor-
mation Theory, IT-21(1):105-106, January 1975. [p.214]
J. S. Vitter. Design and analysis of dynamic Huffman codes. J. of the ACM, 34(4):825-845,
October 1987. [p. 146, 176]
J. S. Vitter. Algorithm 673: Dynamic Huffman coding. ACM Trans. on Mathematical Software,
15(2):158-167, June 1989. [p.146]
P. A. J. Volf and F. M. 1. Willems. Switching between two universal source coding algorithms.
In Storer and Cohn [1998], pages 491-500. [p. 233, 241]
V. Wei. Data Compression: Theory and Algorithms. Academic PresslHarcourt Brace Jo-
vanovich, Orlando, Florida, 1987. [p. 10]
T. A. Welch. A technique for high performance data compression. IEEE Computer, 17(6):8-20,
June 1984. [p.220]
F. M. J. Willems, Tu. M. Shtarkov, and Tj. 1. Tjalkens. Context tree weighting method: Basic
properties. IEEE Trans. on Infonnation Theory, 32(4):526-532, July 1995. [p.230]
F. M. 1. Willems, Tu. M. Shtarkov, and Tj. 1. Tjalkens. Context weighting for general finite
context sources. IEEE Trans. on Information Theory, 33(5):1514-1520, September 1996.
[p.230]
BIBLIOGRAPHY PAGE 269

H. E. Williams and J. Zobel. Compressing integers for fast file access. The Computer J., 42(3):
193-201, 1999. [p.246]
R. N. Williams. Adaptive Data Compression. Kluwer Academic, Norwell, Massachusetts,
1991a. [p.9]
R. N. Williams. An extremely fast Ziv-Lempel data compression algorithm. In Storer and Reif
[1991], pages 362-371. [p.219]
A. I. Wirth. Symbol-driven compression of Burrows Wheeler transformed text. Master's thesis,
The University of Melbourne, Australia, Australia, September 2000. [po 9]
A. I. Wirth and A. Moffat. Can we do without ranks in Burrows Wheeler transform compression?
In Storer and Cohn [2001], pages 419-428. [po 233, 240]
W. D. Withers. The ELS-coder: A rapid entropy coder. In Storer and Cohn [1997], page 475.
[po 130]

I. H. Witten and T. C. Bell. The zero frequency problem: Estimating the probabilities of novel
events in adaptive text compression. IEEE Trans. on Information Theory, 37(4):1085-1094,
July 1991. [po 140, 142]
I. H. Witten, T. C. Bell, A. Moffat, C. G. Nevill-Manning, T. C. Smith, and H. Thimbleby.
Semantic and generative models for lossy text compression. The Computer J., 37(2):83-87,
April 1994. [p.251]
I. H. Witten and E. Frank. Machine Learning: Learning Tools and Techniques with Java Imple-
mentations. Morgan Kaufmann, San Francisco, 1999. [p.24]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing
Documents and Images. Morgan Kaufmann, San Francisco, second edition, 1999. [po vii, 9,
36, 69, 229, 246]

I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Communi-
cations of the ACM, 30(6):520-541, June 1987. [po 92, 93. 98, 111, 113, 156, 161, 166]
J. G. Wolff. An algorithm for the segmentation of an artificial language analogue. British J. of
Psychology, 66(1):79-90, 1975. [p.248]
J. G. Wolff. Recoding of natural language for economy of transmission or storage. The Computer
J., 21:42-44, 1978. [p.248]

H. Yokoo. Data compression using a sort-based context similarity measure. The Computer J.,
40(2/3):94-102, 1997. [p.230]

G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading,
Mass., 1949. [p.48, 70]
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on
Information Theory, IT-23(3):337-343, May 1977. [p.215]
J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE
Trans. on Information Theory, IT-24(5):530-536, September 1978. [po 218, 220]
J. Zobel. Writing for Computer Science: The Art of Effective Communication. Springer-Verlag,
Singapore, 1997. [po xii]
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software - Practice
and Experience, 25(8):891-903, August 1995. [po 57, 245]
Index

C, see code binary, 92, 118-122, 130, 178,


C6,34,46,48-50, 84, 109 186-190,230,244
C~,32,48-50, 77,122,178,239 bit stuffing for carry control, 113,
D, see channel alphabet 187
EO, see expected codeword length bytewise renormalization, 114-117
FO, see Fibonacci series CACM implementation, 93, 98, 100,
H 0, see entropy 113, 123
KO, see Kraft inequality effectiveness, 110-113
L, see maximum codeword length efficiency, no
>.,45,53,225 for unequal channel letters, 212-214
nmax,82 idealized, 93-98
00,10 implementation using integer
00,11 arithmetic, 98-110
no, 11 origins, 92-93
P, see probability distribution prelude representation, 103, 109,
<p,76 125-127
<p, see golden ratio quasi,130
S, see source alphabet renormalization, 100-101, 109,
eO,11 113-117
shift/add implementation, 123-125
adaptive code, xi, 26, 135-136 structured, 177-179,239,242,243
adaptive_huffman...addO, 153 table-driven, 127-130, 186-190
adaptive_huffman....decodeO,152 termination bits, 105-107
adaptive_huffman_encodeO,152 TOIS implementation, 93, 98, 123
adaptive_huffman_incrementO, 153 arithmetic....decodeO, 104
adaptive _huffman_output (), 152 arithmetic....decode_blockO, 102
alphabet arithmetic..encodeO,99
binary, see arithmetic code, binary arithmetic..encode_blockO, 102
channel, see channel alphabet arithmetic_encode_bytewiseO, 115
source, see source alphabet artificial distribution, 78-81
subalphabet selection, 82, 103, 249 ASCII, 20-21, 26, 134, 155
alphabetic code, 193, 202-207
analysis of algorithms, 10-13 backO,158
approximate...arithmeticencodeO, 124 Bernoulli trial, 37
approximate....decode_targetO,124 Bible, King James, 231, 238, 240, 255
Archive Compression Test (ACT), 254 big-O notation, 10-11
arithmetic code, x-xii, 8, 91.f{., 132,255 bigram model, 134, 139, 144,248
adaptive, 154-157, 190-192,214 binary code, 6, 26, 30-33, 48-50, 218
approximate mapping functions, centered,46-47
122-125, 190-192 minimal, see minimal binary code
reverse centered, 47
PAGE 272 COMPRESSION AND CODING ALGORITHMS

simple. 30. 249-250 rationale for. 1-3


binary search. 33. 61. 77. 110. 150. 161 systems. xi. 3-6. 20-27. 215ff.
binary search tree. 173. 176. 227 conditioning class. 221
degenerate. see stick constrained code. xi. 193Jf.
binary.Jlrithmetic...decodeO. 119 context tree. 227. 232
binary.Jlrithmetic_encodeO. 119 context tree weighting. 230
bit stuffing. see arithmetic code. bit
stuffing for carry control decode_targetO.104
biLplus.jollowO.99 directed acyclic graph. 74
blocking. 82. 192 DMC. 118.229
Bubblesort. ix DPCM.245
Burrows-Wheeler transformation. xi. dynamic code. see adaptive code
232-243. 255 dynamic Huffman coding. see Huffman
word-based. 242-243 code. adaptive
bwLinverseO.236
byte_carryO. 116 Elias code. 32-36. 41. 48-50. 77. 92. 122.
byte-plus-prevO. 116 218
BZIP2. xi. 241-242. 255 elias...delta...decodeO. 35
elias...delta_encodeO. 35
cache. 14.59.64.70 elias_gamma...decodeO.35
calculate.Jllphabetic_codeO.204 elias_gamma_encodeO. 35
calculate_huffman_codeO.67 entropy. 15. 17.36.39-40.89.210
calculateJunlength_codeO. 72. 73 of natural language. 19-20. 231
calculateJwopower_codeO.79 enumerative code. 137-139
Calgary corpus. 220 escape probability. 139-145.222.227.
canonical code. 57-62. 64-65. 69. 84. 88. 239
89. 179. 183. 193. 194.203 method A. 141. 146
canonicaLdecodeO.6O method B. 141-142
canonicaLencodeO.6O method C. 142. 146
Canterbury corpus. 231. 238. 254. 255 method D. 142.240
centeredJJinary_inJangeO.43 method E. 145
centered.minimalJJinary_encodeO.43 method O. 144
channel alphabet. 4. 6. 7. 193.247 method X. 142-144
character-based model. 24. 64. 65. 70. expected codeword length. 7
134.140.156.175 exponential search. 33. 34. 77. 165
code. 6
as a tree. 55. 145. 175 facsimile compression. 1-2.244
bi -directional. 214 factorial. 13
use in daily life. 8 jasLget.Jlnd_incremenLcountO.163
codeword. 7 jasLgeUboundO.163
coding. 4 jasLget..symboIO. 163
example. 7-8 jast...scalingO. 168
in blocks. 82 jasuo-probsO. 168
relationship with modeling. 4-6 Fenwick tree. 157-161. 176. 190
terminology. 6-7 modified. 161-166. 170. 176
theses about. 9 jenwicLget.Jlnd_incremenLcountO. 159
combinations. 13 jenwick-8eUboundO. 159
complete code. 8. 18. 195.209.212 jenwick_get..symboIO. 162
COMPRESS. 220 Fibonacci series. 13-14.89.127.194
compression jinish...decodeO.105
books about. 9 jinish_encodeO. 105
INDEX PAGE 273

finite-state machine, 63, 229, 246 effectiveness, 202


first-order model, 5,25,222-225 using runlengths. 201
jorwO,158 linear search. 34, 59, 61
jrugaLjinish..encodeO, 107 lossy compression. 27. 251
jrugal....starLencodeO, 107 LZ77, xi. 215-221
algorithm. 215-218
geometric distribution, 37, 39, 48, 78 coding methods, 218-219
entropy, 39 match detection. 219-220
geLone_bitO, 30 variations. 220
geLone_integerO, 31 lz77....decode_blockO.217
golden ratio, 13, 127,202 lz77_encode_blockO.217
Golomb code, 36-41, 48-50,55,57,78, LZ78, 218. 220
97,130,133,245,246,249
choice of b, 37-38 Markov source. 5
worst-case cost, 38-40 maximum codeword length. 58, 82.
golomb....decodeO, 37 89-90. 194-202
golomb_encodeO,37 Mergesort. ix. 11, 14.57
GZIP, xi, 219 merging. 57
Milton. 20. 136.205,218
heap, 14, 69 minimal binary code. 26. 30, 39. 46-48.
Heapsort, ix, 14 78. 123
Huffman code, ix, 5, 8, 53-57, 203, 215 minimal...hinary-tJecodeO. 31
adaptive, 145-154, 179 minimal...hinary_encodeO.31
algorithm, see minimum-redundancy minimum message length, 24. 127
code minimum-redundancy code, x. 8. 27. 32,
canonical, see canonical code 37.51.55. 103.201,218.241.
for merging, 57 246.249
handling ties, 55, 214 adaptive. 175
infinite, 37, 55 algorithm,66-70
number of codes, 53-55 approximate, 88. 179-186. 192
radix T, 212, 247-248 byte aligned. 247-248
tree, 55, 59, 61, 63, 65, 88, 121, 145 decoding. 63-65
human compression, 19-20, 231 effectiveness. 88-89
efficiency. 69-70. 76-77. 81, 84-88,
ideaLarithmetic....decodeO,94 154
ideal..arithmeticencodeO,94 homogenization. 88
image compression, 2, 175, 187,243-245 housekeeping. 81-87
infinite code, 29, 32, 37, 55 length of, 89-90
doubly, 41, 245 prelude representation. 82-84, 87.
information, 15-17,210-212 103.109
interpolative code, xii, 42-50, 84,97, 103, using runlengths. 71-78. 88.
134, 171,249 180-181
worst-case cost, 47 model
interpolative_encode_blockO,43 BWT. see Burrows-Wheeler
inverted index, 36,41,246 transformation
DMC. see DMC
Kraft inequality, 15, 17-19, 51. 53. 88. first -order. see first -order model
195 lossy. see lossy compression
off-line. 192.243,248.255
length-limited code. 90, 186. 194-202 on-line. 136. 150. 183, 255
approximate. 201-202 PPM. see PPM
PAGE 274 COMPRESSION AND CODING ALGORITHMS

word-based, see word-based model program compression, 246


zero-order, see zero-order model pseudo-adaptive code, 179-186, 192
model of computation, 12-13 push-down automaton, 246
modeling, 3 puLone..bitO, 30
theses about, 9 pULone_integerO,31
monotonic code, 214
Morse code, 132, 209 Q coder, 130, 186-190
move-to-front transformation, 170-175, Quicksort, ix, 13, 14, 70, 84, 242
177,229,237-239 ternary, 242
mr...decode..blockO,86
mr-'!ncode_blockO, 83 random access machine, 12
mtj..inverseO,I72 ranking transformation, see move-to-front
mtjJransjormO,I72 transformation
redundancy, 89
noiseless channel, 17 RE-PAIR,248-249
numerical sequence property, 58 reverse-packageJ1lergeO,200
Rice code, 37-41, 245
optimal binary search, 61,176,193
searching, 33-34, 40
package-merge algorithm, see searching compressed text, 247-250
length-limited code Selectionsort, 11
plokijuh, 140, 274 self-information, 22
PPM, xi, 221-232,239,255 semi-static code, 22, 26, 131-134, 192,
data structure, 227-229 246
efficiency, 230-232 Shannon-Fano code, 51-52, 89
exclusions, 225-226, 229 sibling property, 74, 146
memory space, 232 sibling_incrementO, 150
update exclusions, 226-227 signal compression, 245
variants, 229-230 sizeO,158
word-based, 243 skew distribution, 30, 48, 118, 192,239,
ppm_encode_blockO, 223 247
PPMZ, 232,241 sliding window, see LZ77
prediction by partial matching, see PPM sorting, viii-ix, 11-12, 14, 57, 69, 77, 242
prefix-free code, 7, 18,51 source alphabet, 4, 6, 29, 61
prelude, 22, 29, 46, 82, 87, 103, 109, 125, spaceless words, 246
132-134, 249 splay tree
probability distribution, 6-7, 51 as a code, 175-176
aging of, 166-170, 222 for MTF, 173-174
artificial, see artificial distribution for probability estimation, 176-177
conversion to runlength form, 77-78 start...decodeO, 105
entropy of, see entropy staruncodeO,105
geometric, see geometric distribution static code, x, 21, 26, 29jf, 131-132
implicit, 30, 48, 239 evaluation, 48-50
infinite, see infinite code statistics dilution, 221
runlength representation, 71, 78, 201 stick,90, 120
self-probabilities, 6, 22, 132, 146 Stirling's approximation, 13
skew, see skew distribution STUFFIT,243
uniform, see uniform distribution subalphabet, 22, 82, 249
Zipf, see Zipf distribution suffix array, 242
probability estimation, 4, 131jf
probsJo-!astO, 168
INDEX PAGE 275

text compression, 245-251, see also


1277, PPM, Burrows-Wheeler
transformation
text-retrieval system, 36, 41, 245, 246
two's complement arithmetic, 106, 160

°,
twopower...addO,185
twopower...decode 184
twopower_encodeO, 184
twopowef-incrementO, 185
twopower_initializeO, 184

unary code, 4, 29-30, 32, 34, 84,178,218,


239
unaryJiecodeO,30
unary_encodeO,30
unequal letter-cost code, 193,209-214
uniform distribution, 30, 48, 78, 249
unique decipherability, 18
universal code, 36, 50

word-based model, 61, 70, 71, 140,


242-243, 245-250

Z coder, 130
zero-frequency problem, 139-146, 222
zero-order model, 4, 5, 24, 65, 103, 134,
140
zero-redundancy code, 17,30,32,36, Ill,
120
Zipf distribution, 48, 70

You might also like