0% found this document useful (0 votes)
23 views17 pages

2007 Bookmatter Variable-lengthCodesForDataCompression

2007_Bookmatter_Variable-lengthCodesForDataCompression

Uploaded by

Her Kirk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

2007 Bookmatter Variable-lengthCodesForDataCompression

2007_Bookmatter_Variable-lengthCodesForDataCompression

Uploaded by

Her Kirk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Variable-length Codes for Data Compression

David Salomon

Variable-length Codes
for Data Compression

ABC
Professor David Salomon (emeritus)
Computer Science Department
California State University
Northridge, CA 91330-8281
USA
email: david.salomon@ csun.edu

British Library Cataloguing in Publication Data


A catalogue record for this book is available from the British Library
Library of Congress Control Number:

ISBN 978-1-84628-958-3 e-ISBN 978-1-84628-959-0


Printed on acid-free paper.
c Springer-Verlag London Limited 2007
°

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permit-
ted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored
or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in
the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright
Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers.

The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and regulations and therefore free
for general use.

The publisher makes no representation, express or implied, with regard to the accuracy of the information
contained in this book and cannot accept any legal responsibility or liability for any errors or omissions
that may be made.
9 8 7 6 5 4 3 2 1
Springer Science+Business Media
springer.com
To the originators and developers of the codes.
Apostolico, Capocelli, Elias, Fenwick, Fraenkel,
Golomb, Huffman, Klein, Pigeon, Rice, Stout,
Tsai, Tunstall, Villasenor, Wang, Wen, Wu,
Yamamoto, and others.

To produce a mighty book, you must choose a mighty theme.


—Herman Melville
Preface
The dates of most of the important historical events are known, but not always very
precisely. We know that Kublai Khan, grandson of Ghengis Khan, founded the Yuan
dynasty in 1280 (it lasted until 1368), but we don’t know precisely (i.e., the month,
day and hour) when this act took place. A notable exception to this state of affairs is
the modern age of telecommunications, a historical era whose birth is known precisely,
up to the minute. On Friday, 24 May 1844, at precisely 9:45 in the morning, Samuel
Morse inaugurated the age of modern telecommunications by sending the first telegraphic
message in his new code. The message was sent over an experimental line funded by
the American Congress from the Supreme Court chamber in Washington, DC to the B
& O railroad depot in Baltimore, Maryland. Taken from the Bible (Numbers 23:23),
the message was “What hath God wrought?” It had been suggested to Morse by Annie
Ellsworth, the young daughter of a friend. It was prerecorded on a paper tape, was sent
to a colleague in Baltimore, and was then decoded and sent back by him to Washington.
An image of the paper tape can be viewed at [morse-tape 06].

Morse was born near Boston and was educated at Yale.


We would expect the inventor of the telegraph (and of such a
sophisticated code) to have been a child prodigy who tinkered
with electricity and gadgets from an early age (the electric
battery was invented when Morse was nine years old). In-
stead, Morse became a successful portrait painter with more
than 300 paintings to his credit. It wasn’t until 1837 that
the 46-year-old Morse suddenly quit his painting career and
started thinking about communications and tinkering with
electric equipment. It is not clear why he made such a drastic
career change at such an age, but it is known that two large,
wall-size paintings that he made for the Capitol building in
Washington, DC were ignored by museum visitors and rejected by congressmen. It may
have been this disappointment that gave us the telegraph and the Morse code.
Given this background, it is easy to imagine how the 53-year-old Samuel Morse
felt on that fateful day, Friday, 24 May 1844, as he sat hunched over his mysterious
viii Preface

apparatus, surrounded by a curious crowd of onlookers, some of whom had only a vague
idea of what he was trying to demonstrate. He must have been very anxious, because
his telegraph project, his career, and his entire future depended on the success of this
one test. The year before, the American Congress awarded him $30,000 to prepare this
historical test and prove the value of the electric telegraph (and thus also confirm the
ingenuity of yankees), and here he is now, dependent on the vagaries of his batteries, on
the new, untested 41-mile-long telegraph line, and on a colleague in Baltimore.
Fortunately, all went well. The friend in Baltimore received the message, decoded
it, and resent it within a few minutes, to the great relief of Morse and to the amazement
of the many congressmen assembled around him.

The Morse code, with its quick dots and dashes (Table 1), was extensively used for
many years, first for telegraphy, and beginning in the 1890s, for early radio communi-
cations. The development of more advanced communications technologies in the 20th
century displaced the Morse code, which is now largely obsolete. Today, it is used for
emergencies, for navigational radio beacons, land mobile transmitter identification, and
by continuous wave amateur radio operators.

A .- N -. 1 .---- Period .-.-.-


B -... O --- 2 ..--- Comma --..--
C -.-. P .--. 3 ...-- Colon ---...
Ch ---- Q --.- 4 ....- Question mark ..--..
D -.. R .-. 5 ..... Apostrophe .----.
E . S ... 6 -.... Hyphen -....-
F ..-. T - 7 --... Dash -..-.
G --. U ..- 8 ---.. Parentheses -.--.-
H .... V ...- 9 ----. Quotation marks .-..-.
I .. W .-- 0 -----
J .--- X -..-
K -.- Y -.--
L .-.. Z --..
M --
Table 1: The Morse Code for English.

Our interest in the Morse code is primarily with a little-known aspect of this code.
In addition to its advantages for telecommunications, the Morse code is also an early
example of text compression. The various dot-dash codes developed by Morse (and
possibly also by his associate, Alfred Vail) have different lengths, and Morse intuitively
assigned the short codes (a single dot and a single dash) to the letters E and T, the
longer, four dots-dashes, he assigned to Q, X, Y, and Z. The even longer, five dots-
dashes codes, were assigned to the 10 digits, and the longest codes (six dots and dashes)
became those of the punctuation marks. Morse also specified that the signal for error
is eight consecutive dots, in response to which the receiving operator should delete the
last word received.
Preface ix

It is interesting to note that Morse was not the first to think of compression (in
terms of time saving) by means of a code. The well-known Braille code for the blind
was developed by Louis Braille in the 1820s and is still in common use today. It consists
of groups (or cells) of 3 × 2 dots each, embossed on thick paper. Each of the six dots
in a group may be flat or raised, implying that the information content of a group is
equivalent to six bits, resulting in 64 possible groups. The letters, digits, and common
punctuation marks do not require all 64 codes, which is why the remaining groups may
be used to code common words—such as and, for, and of—and common strings of
letters—such as ound, ation, and th.
The Morse code has another feature that makes it relevant to us. Because the
individual codes have different lengths, there must be a way to identify the end of a
code. Morse solved this problem by requiring accurate relative timing. If the duration
of a dot is taken to be one unit, then that of a dash is three units, the space between
the dots and dashes of one character is one unit, the space between characters is three
units, and the interword space is six units (five for automatic transmission). This book
is concerned with the use of variable-length codes to compress digital data. With these
codes, it is important not to have any extra spaces. In fact, there is no such thing as a
space, because computers use only zeros and 1’s. Thus, when a string of data symbols is
compressed by assigning short codes (that are termed “codewords”) to the symbols, the
codewords (whose lengths vary) are concatenated into a long binary string without any
spaces or separators. Such variable-length codes must therefore be designed to allow for
unambiguous reading. Somehow, the decoder should be able to read bits and identify
the end of each codeword. Such codes are referred to as uniquely decodable or uniquely
decipherable (UD).
Variable-length codes have become important in many areas of computer science.
This book is a survey of this important topic. It presents the principles underlying
this type of codes and describes the important classes of variable-length codes. Many
examples illustrate the applications of these codes to data compression. The book is
devoted to the codes, which is why it describes very few actual compression algorithms.
Notice that many important (and some not so important) methods, algorithms, and
techniques for compressing data are described in detail in [Salomon 06].
The term representation is central to our discussion. A number can be represented
in decimal, binary, or any other number base (or number system, see Section 2.18).
Mathematically, a representation is a bijection (or a bijective function) of an infinite,
countable set S1 of strings onto another set S2 of strings (in practice, S2 consists of
binary strings, but it may also be ternary or based on other number systems), such
that any concatenation of any elements of S2 is UD. The elements of S1 are called data
symbols and those of S2 are codewords. Set S1 is an alphabet and set S2 is a code.
An interesting example is the standard binary notation. We normally refer to it as the
binary representation of the integers, but according to the definition above it is not a
representation because it is not UD. It is easy to see, for example, that a string of binary
codewords that starts with 11 can be either two consecutive 1’s or the code of 3.

A function f : X ⇒ Y is said to be bijective, if for every y ∈ Y , there is exactly one


x ∈ X such that f (x) = y.
x Preface

Figure 3.19 and Table 3.22 list several variable-length UD codes assigned to the 26
letters of the English alphabet.
This book is aimed at readers who have a basic knowledge of data compression
and who want to know more about the specific codes used by the various compression
algorithms. The necessary mathematical background includes logarithms, polynomials,
a bit of calculus and linear algebra, and the concept of probability. This book is not
intended as a guide to software implementors and has no programs. Errors, mistypes,
comments, and questions should be sent to the author’s email address below.
It is my pleasant duty to acknowledge the substantial help and encouragement I
have received from Giovanni Motta and Cosmin Truţa and for their painstaking efforts.
They read drafts of the text, found many errors and misprints, and provided valuable
comments and suggestions that improved this book and made it what it is. Giovanni
also wrote part of Section 2.12.
If, by any chance, I have omitted anything more or less proper or necessary, I beg
forgiveness, since there is no one who is without fault and circumspect in all matters.
—Leonardo Fibonacci, Libe Abaci (1202)

[email protected] David Salomon

The Preface is the most important part of


the book. Even reviewers read a preface.
—Philip Guedalla
Contents
Preface vii
Introduction 1
1 Basic Codes 9
1.1 Codes, Fixed- and Variable-Length 9
1.2 Prefix Codes 12
1.3 VLCs, Entropy, and Redundancy 13
1.4 Universal Codes 18
1.5 The Kraft–McMillan Inequality 19
1.6 Tunstall Code 21
1.7 Schalkwijk’s Coding 23
1.8 Tjalkens–Willems V-to-B Coding 28
1.9 Phased-In Codes 31
1.10 Redundancy Feedback (RF) Coding 33
1.11 Recursive Phased-In Codes 37
1.12 Self-Delimiting Codes 40
1.13 Huffman Coding 42
2 Advanced Codes 69
2.1 VLCs for Integers 69
2.2 Start-Step-Stop Codes 71
2.3 Start/Stop Codes 73
2.4 Elias Codes 74
2.5 Levenstein Code 80
2.6 Even–Rodeh Code 81
2.7 Punctured Elias Codes 82
2.8 Other Prefix Codes 83
2.9 Ternary Comma Code 86
2.10 Location Based Encoding (LBE) 87
2.11 Stout Codes 89
2.12 Boldi–Vigna (ζ) Codes 91
xii Contents

2.13 Yamamoto’s Recursive Code 94


2.14 VLCs and Search Trees 97
2.15 Taboo Codes 100
2.16 Wang’s Flag Code 105
2.17 Yamamoto Flag Code 106
2.18 Number Bases 110
2.19 Fibonacci Code 112
2.20 Generalized Fibonacci Codes 116
2.21 Goldbach Codes 120
2.22 Additive Codes 126
2.23 Golomb Code 129
2.24 Rice Codes 136
2.25 Subexponential Code 138
2.26 Codes Ending with “1” 139
3 Robust Codes 143
3.1 Codes For Error Control 143
3.2 The Free Distance 149
3.3 Synchronous Prefix Codes 150
3.4 Resynchronizing Huffman Codes 156
3.5 Bidirectional Codes 159
3.6 Symmetric Codes 168
3.7 VLEC Codes 170
Summary and Unification 177
Bibliography 179
Index 187

An adequate table of contents serves as a synopsis or headline display


of the design or structural pattern of the body of the report.
—D. E. Scates and C. V. Good, Methods of Research
Introduction
The discipline of data compression has its origins in the 1950s and 1960s and has ex-
perienced rapid growth in the 1980s and 1990s. Currently, data compression is a vast
field encompassing many approaches and techniques. A student of this field realizes
quickly that the various compression algorithms in use today are based on and require
knowledge of diverse physical and mathematical concepts and topics, some of which are
included in the following, incomplete list: Fourier transforms, finite automata, Markov
processes, the human visual and auditory systems—statistical terms, distributions, and
concepts—Unicode, XML, convolution, space-filling curves, Voronoi diagrams, interpo-
lating polynomials, Fibonacci numbers, polygonal surfaces, data structures, the Van-
dermonde determinant, error-correcting codes, fractals, the Pascal triangle, fingerprint
identification, and analog and digital video.
Faced with this complexity, I decided to try and classify in this short introduction
most (but not all) of the approaches to data compression in four classes as follows:
(1) block-to-block codes, (2) block-to-variable codes, (3) variable-to-block codes, and
(4) variable-to-variable codes (the term “fixed” is sometimes used instead of “block”).
Other approaches to compression, such as mathematical transforms (orthogonal or
wavelet) and the technique of arithmetic coding, are not covered here. Following is
a short description of each class.

Block-to-block codes constitute a class of techniques that input n bits of raw data
at a time, perform a computation, and output the same number of bits. Such a process
results in no compression; it only transforms the data from its original format to a
format where it becomes easy to compress. Thus, this class consists of transforms. The
discrete wavelet, discrete cosine, and linear prediction are examples of transforms that
are commonly used as the first step in the compression of various types of data. Here is
a short description of linear prediction.
Audio data is common in today’s computers. We all have mp3, FLAC, and other
types of compressed audio files in our computers. A typical lossless audio compression
technique consists of three steps. (1) The original sound is sampled (digitized). (2) The
audio samples are converted, in a process that employs linear prediction, to small
numbers called residues. (3) The residues are replaced by variable-length codes. The
2 Introduction

last step is the only one that produces compression.


Linear prediction of audio samples is based on the fact that most audio samples
are similar to their near neighbors. One second of audio is normally converted to many
thousands of audio samples (44,100 samples per second is typical), and adjacent samples
tend to be similar because sound rarely varies much in pitch or frequency during one
second. If we denote the current audio sample by s(t), then linear prediction computes a
predicted value ŝ(t) from the p immediately-preceding samples by a linear combination
of the form

p
ŝ(t) = ai s(t − i).
i=1

Parameter p depends on the specific algorithm and many also be user controlled.
Parameters ai are linear coefficients that are also determined by the algorothm.
If the prediction is done properly, the difference (which is termed residue or residual)
e(t) = s(t)− ŝ(t) will almost always be a small (positive or negative) number, although in
principle it could be about as large as s(t) or −s(t). The difference between the various
linear prediction methods is in the number p of previous samples that they employ and
in the way they determine the linear coefficients ai .

Block-to-variable codes are the most important of the four types discussed here.
Each symbol of the input alphabet is assigned a variable-length code according to its
frequency of occurrence (or, equivalently, its probability) in the data. Compression
is achieved if short codes are assigned to commonly-occurring (high probability) symbols
and long codes are assigned to rare symbols. Many statistical compression methods
employ this type of coding, most notably the Huffman method (Section 1.13). The
difference between the various methods is mostly in how they compute or estimate the
probabilities of individual data symbols. There are three approaches to this problem,
namely static codes, a two-pass algorithm, and adaptive methods.
Static codes. It is possible to construct a set of variable-length codes and perma-
nently assign each code to a data symbol. The result is a static code table that is built
into both encoder and decoder. To construct such a table, the developer has to analyze
large quantities of data and determine the probability of each symbol. For example,
someone who develops a compression method that employs this approach to compress
text, has to start by selecting a number of representative “training” documents, count
the number of times each text character appears in those documents, compute frequen-
cies of occurrence, and use this fixed, static statistical model to assign variable-length
codewords to the individual characters. A compression method based on a static code
table is simple, but the results (the compression ratio for a given text file) depend on
how much the data resembles the statistics of the training documents.
A two-pass algorithm. The idea is to read the input data twice. The first pass simply
counts symbol frequencies and the second pass performs the actual compression by
replacing each data symbol with a variable-length codeword. In between the two passes,
the code table is constructed by utilizing the symbols’ frequencies in the particular
data being compressed (the statistical model is taken from the data itself). Such a
method features very good compression, but is slow because reading a file from an input
device, even a fast disk, is slower than memory-based operations. Also, the code table
is constructed individually for each data file being compressed, so it has to be included
Introduction 3

in the compressed file, for the decoder’s use. This reduces the compression ratio but not
significantly, because a code table typically contains one variable-length code for each
of the 128 ASCII characters or for each of the 256 8-bit bytes, so its total length is only
a few hundred bytes.
An adaptive method starts with an empty code table, or with a tentative table, and
modifies the table as more data is read and processed. Initially, the codes assigned to
the data symbols are inappropriate and are not based on the (unknown) probabilities of
the data symbols. But as more data is read, the encoder acquires better statistics of the
data and exploits it to improve the codes (the statistical model adapts itself gradually
to the data that is being read and compressed). Such a method has to be designed to
permit the decoder to mimic the operations of the encoder and modify the code table
in lockstep with it.
A simple statistical model assigns variable-length codes to symbols based on sym-
bols’ probabilities. It is possible to improve the compression ratio significantly by basing
the statistical model on probabilities of pairs or triplets of symbols (digrams and tri-
grams), instead of probabilities of individual symbols. The result is an n-order statistical
compression method where the previous n symbols are used to predict (i.e., to assign
a probability to) the current symbol. The PPM (prediction by partial matching) and
DMC (dynamic Markov coding) methods are examples of this type of algorithm.
It should be noted that arithmetic coding, an important statistical compression
method, is included in this class, but operates differently. Instead of assigning codes to
individual symbols (bits, ASCII codes, Unicodes, bytes, etc.), it assigns one long code
to the entire input file.

Variable-to-block codes is a term that refers to a large group of compression tech-


niques where the input data is divided into chunks of various lengths and each chunk of
data symbols is encoded by a fixed-size code. The most important members of this group
are run-length encoding and the various LZ (dictionary-based) compression algorithms.
A dictionary-based algorithm saves bits and pieces of the input data in a special
buffer called a dictionary. When the next item is read from the input file, the algorithm
tries to locate it in the dictionary. If the item is found in the dictionary, the algorithm
outputs a token with a pointer to the item plus other information such as the length of
the item. If the item is not in the dictionary, the algorithm adds it to the dictionary
(based on the assumption that once an item has appeared in the input, it is likely that
it will appear again) and outputs the item either in raw format or as a special, literal
token. Compression is achieved if a large item is replaced by a short token. Quite a
few dictionary-based algorithms are currently known. They have been developed by
many scientists and researchers, but are all based on the basic ideas and pioneering
work of Jacob Ziv and Abraham Lempel, described in [Ziv and Lempel 77] and [Ziv and
Lempel 78].
A well-designed dictionary-based algorithm can achieve high compression because a
given item tends to appear many times in a data file. In a text file, for example, the same
words and phrases may appear many times. Words that are common in the language
and phrases that have to do with the topic of the text, tend to appear again and again. If
they are kept in the dictionary, then more and more phrases can be replaced by tokens,
thereby resulting in good compression.
4 Introduction

The differences between the various LZ dictionary methods are in how the dictionary
is organized and searched, in the format of the tokens, in the way the algorithm handles
items not found in the dictionary, and in the various improvements it makes to the basic
method. The many variants of the basic LZ approach employ improving techniques such
as a circular buffer, a binary search tree, variable-length codes or dynamic Huffman
coding to encode the individual fields of the token, and other tricks of the programming
trade. Sophisticated dictionary organization eliminates duplicates (each data symbol is
stored only once in the dictionary, even if it is part of several items), implements fast
search (binary search or a hash table instead of slow linear search), and may discard
unused items from time to time in order to regain space.
The other important group of variable-to-block codes is run-length encoding (RLE).
We know that data can be compressed because the common data representations are
redundant, and one type of redundancy is runs of identical symbols. Text normally
does not feature long runs of identical characters (the only examples that immediately
come to mind are runs of spaces and of periods), but images, especially monochromatic
(black and white) images, may have long runs of identical pixels. Also, an audio file
may have silences, and even one-tenth of second worth of silence typically translates to
4,410 identical audio samples.
A typical run-length encoder identifies runs of the same symbol and replaces each
run with a token that includes the symbol and the length of the run. If the run is shorter
than a token, the raw symbols are output, but the encoder has to make sure that the
decoder can distinguish between tokens and raw symbols.
Since runs of identical symbols are not common in many types of data, run-length
encoding is often only one part of a larger, more sophisticated compression algorithm.

Variable-to-variable codes is the general name used for compression methods that
select variable-length chunks of input symbols and compress each chunk by replacing it
with a variable-length code.
A simple example of variable-to-variable codes is run-length encoding combined
with Golomb codes, especially when the data to be compressed is binary. Imagine a
long string of 0’s and 1’s where one value (say, 0) occurs more often than the other
value. This value is referred to as the more probable symbol (MPS), while the other
value becomes the LPS. Such a string tends to have runs of the MPS and Section 2.23
shows that the Golomb codes are the best candidate to compress such runs. Each run
has a different length, and the various Golomb codewords also have different lengths,
turning this application into an excellent example of variable-to-variable codes.
Other examples of variable-to-variable codes are hybrid methods that consist of
several parts. A hybrid compression program may start by reading a chunk of input and
looking it up in a dictionary. If a match is found, the chunk may be replaced by a token,
which is then further compressed (in another part of the program) by RLE or variable-
length codes (perhaps Huffman or Golomb). The performance of such a program may
not be spectacular, but it may produce good results for many different types of data.
Thus, hybrids tend to be general-purpose algorithms that can deal successfully with
text, images, video, and audio data.
This book starts with several introductory sections (Sections 1.1 through 1.6) that
discuss information theory concepts such as entropy and redundancy, and concepts that
Introduction 5

are used throughout the text, such as prefix codes, complete codes, and universal codes.
The remainder of the text deals mostly with block-to-variable codes, although its
first part deals with the Tunstall codes and other variable-to-block codes. It concen-
trates on the codes themselves, not on the compression algorithms. Thus, the individual
sections describe various variable-length codes and classify them according to their struc-
ture and organization. The main techniques employed to design variable-length codes
are the following:

The phased-in codes (Section 1.9) are a slight extension of fixed-size codes and
may contribute a little to the compression of a set of consecutive integers by changing
the representation of the integers from fixed n bits to either n or n − 1 bits (recursive
phased-in codes are also described).
Self-delimiting codes. These are intuitive variable-length codes—mostly due to
Gregory Chaitin, the originator of algorithmic information theory—where a code signals
its end by means of extra flag bits. The self-delimiting codes of Section 1.12 are inefficient
and are not used in practice.
Prefix codes. Such codes can be read unambiguously (they are uniquely decodable,
or UD codes) from a long string of codewords because they have a special property (the
prefix property) which is stated as follows: Once a bit pattern is assigned as the code
of a symbol, no other codes can start with that pattern. The most common example of
prefix codes are the Huffman codes (Section 1.13). Other important examples are the
unary, start-step-stop, and start/stop codes (Sections 2.2 and 2.3, respectively).
Codes that include their own length. One way to construct a UD code for the
integers is to start with the standard binary representation of an integer and prepend to
it its length L1 . The length may also have variable length, so it has to be encoded in some
way or have its length L2 prepended. The length of an integer n equals approximately
log n (where the logarithm base is the same as the number base of n), which is why
such methods are often called logarithmic ramp representations of the integers. The
most common examples of this type of codes are the Elias codes (Section 2.4), but
other types are also presented. They include the Levenstein code (Section 2.5), Eve–
Rodeh code (Section 2.6), punctured Elias codes (Section 2.7), the ternary comma code
(Section 2.9), Stout codes (Section 2.11), Boldi–Vigna (zeta) codes (Section 2.12), and
Yamamoto’s recursive code (Section 2.13).
Suffix codes (codes that end with a special flag). Such codes limit the propagation
of an error and are therefore robust. An error in a codeword affects at most that
codeword and the one or two codewords following it. Most other variable-length codes
sacrifice data integrity to achieve short codes, and are fragile because a single error can
propagate indefinitely through a sequence of concatenated codewords. The taboo codes
of Section 2.15 are UD because they reserve a special string (the taboo) to indicate the
end of the code. Wang’s flag code (Section 2.16) is also included in this category.
Note. The term “suffix code” is ambiguous. It may refer to codes that end with a
special bit pattern, but it also refers to codes where no codeword is the suffix of another
codeword (the opposite of prefix codes). The latter meaning is used in Section 3.5, in
connection with bidirectional codes.
6 Introduction

Flag codes. A true flag code differs from the suffix codes in one interesting aspect.
Such a code may include the flag inside the code, as well as at its right end. The only
example of a flag code is Yamamoto’s code, Section 2.17.
Codes based on special number bases or special number sequences. We normally
use decimal numbers, and computers use binary numbers, but any integer greater than
1 can serve as the basis of a number system and so can noninteger (real) numbers. It is
also possible to construct a sequence of numbers (real or integer) that act as weights
of a numbering system. The most important examples of this type of variable-length
codes are the Fibonacci (Section 2.19), Goldbach (Section 2.21), and additive codes
(Section 2.22).
The Golomb codes of Section 2.23 are designed in a special way. An integer param-
eter m is selected and is used to encode an arbitrary integer n in two steps. In the first
step, two integers q and r (for quotient and remainder) are computed from n such that
n can be fully reconstructed from them. In the second step, q is encoded in unary and
is followed by the binary representation of r, whose length is implied by the parameter
m. The Rice code of Section 2.24 is a special case of the Golomb codes where m is an
integer power of 2. The subexponential code (Section 2.25) is related to the Rice codes.
Codes ending with “1” are the topic of Section 2.26. In such a code, all the code-
words end with a 1, a feature that makes them the natural choice in special applications.
Variable-length codes are designed for data compression, which is why implementors
select the shortest possible codes. Sometimes, however, data reliability is a concern, and
longer codes may help detect and isolate errors. Thus, Chapter 3 discusses robust codes.
Section 3.3 presents synchronous prefix codes. These codes are useful in applications
where it is important to limit the propagation of errors. Bidirectional (or reversible)
codes (Sections 3.5 and 3.6) are also designed for increased reliability by allowing the
decoder to read and identify codewords either from left to right or in reverse.
The following is a short discussion of terms that are commonly used in this book.

Source. A source of data items can be a file stored on a disk, a file that is input
from outside the computer, text input from a keyboard, or a program that generates
data symbols to be compressed or processed in some way. In a memoryless source, the
probability of occurrence of a data symbol does not depend on its context. The term
i.i.d. (independent and identically distributed) refers to a set of sources that have the
same probability distribution and are mutually independent.
Alphabet. This is the set of symbols that an application has to deal with. An
alphabet may consist of the 128 ASCII codes, the 256 8-bit bytes, the two bits, or any
other set of symbols.
Random variable. This is a function that maps the results of random experiments
to numbers. For example, selecting many people and measuring their heights is a ran-
dom variable. The number of occurrences of each height can be used to compute the
probability of that height, so we can talk about the probability distribution of the ran-
dom variable (the set of probabilities of the heights). A special important case is a
discrete random variable. The set of all values that such a variable can assume is finite
or countably infinite.
Introduction 7

Compressed stream (or encoded stream). A compressor (or encoder) compresses


data and generates a compressed stream. This is often a file that is written on a disk
or is stored in memory. Sometimes, however, the compressed stream is a string of bits
that are transmitted over a communications line.
The acronyms MSB and LSB refer to most-significant-bit and least-significant-bit,
respectively.
The notation 1i 0j indicates a bit string of i consecutive 1’s followed by j zeros.

Understanding is, after all, what science is all about—and science


is a great deal more than mere mindless computation.
—Roger Penrose, Shadows of the Mind (1996)

You might also like