Application of Splay Tree
Application of Splay Tree
Edgar H. Sibley The splay-prefix algorithm is one of the simplest and fastest adaptive data
Panel Chair
compression algorithms based on the use of a prefix code. The data
structures used in the splay-prefix algorithm can also be applied to
arithmetic data compression. Applications of these algorithms to encryption
and image processing are suggested.
DOUGLAS W. JONES
Data compression algorithms can improve the effi- Static probability models do not provide very good
ciency with which data is stored or transmitted by characterizations of many sources. For example, in
reducing the amount of redundant data. A compression English text, the letter u is less common than e, so a
algorithm takes a source text as input and produces a static probability model would incorrectly predict that
corresponding compressed text, while an expansion 9e would be more common than 9~. Markov probability
algorithm takes the compressed text as input and pro- models allow very good characterization of such
duces the original source text as output.’ Most sources. A Markov source has many states, and under-
compression algorithms view the source text as consist- goes a random state change as each letter is drawn.
ing of a sequence of letters selected from an alphabet. Each state is associated with a probability distribution
The redundancy of a representation of a string S is that determines the next state and the next letter pro-
L(S)-H(S), where L(S) is the length of the representa- duced. When a Markov source producing English-like
tion, in bits, and H(S) is its entropy-a measure of text emits a 9, it would enter a state in which u is the
information content, also expressed in bits. No most likely output. Further discussion of emropy, static
compression algorithm can compress a string to fewer sources, and Markov sources can be found in most
bits than its entropy without information loss. If the books on information theory [Z].
source text is drawn one letter at a time from a random Although there are a number of ad hoc approaches to
source using alphabet A, the entropy is given by: data compression, for example, run-length e.ncoding,
there are also a number of systematic approaches. Huff-
fw) = C(S).z&p(c) log* --& man codes are among the oldest of the systematic
approaches to data compression. Adaptive Huffman
where C(S) is the number of letters in the string, and compression algorithms require the use of tree balanc-
p(c) is the static probability of obtaining any particular ing schemes which can also be applied to the data
letter c. If the frequency of each letter c in the string S structures required by adaptive arithmetic compression
is used as an estimate of p(c), H(S) is called the self- algorithms. There is sufficient similarity between the
entropy of S. In this paper, I&(S) will be used to signify balancing objectives of these schemes and tb.ose
the self-entropy of a string, computed under the achieved by splay trees to try splay trees in both con-
assumption that it was produced by a static source. texts.
Splay trees are usually considered forms of lexico-
‘Such algorithms are noiseless; in this paper. approximate or noisy algorithms graphically ordered binary search trees, but -the trees
will not be considered. used in data compression need not have a static order.
The removal of the ordering constraint allows the basic
0 1968 ACM OOOl-0782/88/0800-0996 $1.50 splaying operation to be considerably simplified. The
resulting algorithms are extremely fast and compact. Vitter has developed an optimal adaptive Huffman algo-
When applied to Huffman codes, splaying leads to a rithm [lo].
locally adaptive compression algorithm that is remarka- Vitter’s optimal adaptive Huffman code always comes
bly simple as well as fast, although it does not achieve within one bit per source letter of the optimal static
optimal compression. When applied to arithmetic Huffman code, and it is usually within a few percent
codes, the result is near optimal in compression and of H,. Furthermore, static Huffman codes are always
asymptotically optimal in time. within one bit per source letter of H, (Huffman codes
achieve this limit only when, for all letters, p(c) = 2-‘).
PREFIX CODES These same bounds can be applied to Markov sources if
The most widely studied data compression algorithms a different (static or adaptive) Huffman tree is used for
are probably those based on Huffman codes. In a Huff- each source state inferred from the source text. There
man code, each source letter is represented in the com- are compression algorithms that can improve on these
pressed text by a variable length code. Common source bounds. The Ziv-Lempel algorithm, for example,
letters are represented by short codes, while uncom- assigns fixed length words in the compressed text to
mon ones are represented by long codes. The codes varying length strings from the source [ll], while arith-
used in the compressed text must obey the prefix prop- metic compression can, in effect, utilize fractional bits
erty, that is, a code used in the compressed text may in the encoding of source letters [12].
not be a prefix of any other code.
Prefix codes may be thought of as trees, with each Applying Splaying to Prefix Codes
leaf of the tree associated with one letter in the source Splay trees were first described in 1983 [8], and more
alphabet. Figure 1 illustrates a prefix code tree for a details were presented in 1985 [9]. Splay trees were
4 letter alphabet. The prefix code for a letter can be originally intended as a form of self-balancing binary
read by following the path from the root of the tree to search trees, but they have also been shown to be
the letter and associating a 0 with each left branch among the fastest known priority queue implementa-
followed and a 1 with each right branch followed. The tions [4]. When a node in a splay tree is accessed, the
code tree for a Huffman code is a weight balanced tree, tree is splayed: that is, the accessed node becomes the
where each leaf is weighted with the letter frequency root, and all nodes to the left of it form a new left
and internal nodes have no intrinsic weight. The exam- subtree, while all nodes to the right form a new right
ple tree would be optimal if the frequencies of the let- subtree. Splaying is accomplished by following the path
ters A, B, C, and D were 0.125, 0.125, 0.25, and 0.5, from the old root to the target node, making only local
respectively. changes along the way, so the cost of splaying is propor-
tional to the length of the path followed.
Tarjan and Sleator [9] showed that splay trees are
A 1
1
D
A = 000
statically optimal. In other words, if the keys of the
nodes to be accessed are drawn from a static probability
distribution, the access speeds of a splay tree and of a
statically balanced tree optimized for that distribution
A
should differ by a constant factor when amortized over
B = 001 a sufficiently long series of accesses.Since a Huffman
C
1 tree is an example of a statically balanced tree, this
c = 01
A suggests that splaying should be applicable to data
A B D=l compression, and that the compressed code resulting
from a splayed prefix code tree should be within a
FIGURE1. A TreeRepresentationof a PrefixCode constant factor of the size achievable by using a Huff-
man code.
As originally described, splaying applies to trees
Conventional Huffman codes require either prior where data is stored in the internal nodes, not the
knowledge of the letter frequencies or two passes leaves. Prefix code trees carry all of their data in the
through the data to be compressed-one to obtain the leaves, with nothing in the internal nodes. There is a
letter frequencies, and one to perform the actual variant of splaying, however, called semi-splaying,
compression. In the latter case, the letter frequencies which is applicable to prefix code trees. In semi-
must be included with the compressed text in order splaying, the target node is not moved to the root, nor
to allow for later expansion. Adaptive compression are its children modified; instead, the path from the
algorithms operate in one pass. In adaptive Huffman root to the target is simply shortened by a factor of two.
codes, the code used for each letter in the source Semi-splaying has been shown to achieve the same the-
text being compressed is based on the frequencies of oretical bounds as splaying, within a constant factor.
all letters up to but not including that letter. The basis Both splaying and semi-splaying are complicated in a
for efficient implementation of adaptive Huffman codes lexicographic tree when a zig-zag path is followed in
was established by Gallager [3]; Knuth published the interior of the tree, but they are easy when the path
a practical version of an adaptive algorithm [5]; and to the target node stays entirely on the left or right edge
of the tree (called the zig-zig case in [8] and [9]). This prefix code for the target leaf is all zeros and the target
simple case is illustrated in Figure 2. The effect of semi- leaf is the leftmost leaf. In Figure 3, the tree has been
splalying along the path from the root (node zu) to leaf twisted to allow easy semi-splaying around leaf C. For-
node A is to rotate each successive pair of internal tunately, this change does not disturb any of the per-
nodes so that the path length from the root to the leaf formance bounds that have been proven for semi-
node is halved. In the process, the nodes in each pair splaying. The proof of this follows trivially since the
that were farthest from the root stay on the new path potential function used in [9] to prove these perform-
(nodes x and z), while those that were closest move off ance bounds does not depend on the order of the sub-
the path (nodes w and y). trees of a node.
While the semi-splaying operation preserves the lexi- A second simplification arises when we consider that
cographic ordering of all nodes in the tree, this is not not only can left and right siblings be exchanged at
important in a prefix code tree. With prefix codes, all will, but all internal nodes in the prefix cod.e tree are
that matters is that the tree used by the compression anonymous and carry no information. This allows the
routine to compress any letter of the source text exactly rotations used in semi-splaying to be replacled by opera-
match the tree used by the expansion routine to tions requiring the exchange of only two links in the
expand that letter. Any transformation of the tree is tree; we will call these operations semi-rotations. Fig-
allowed between successive letters, as long as both rou- ure 4 shows a semi-rotation. A semi-rotation has the
tines perform the same transformations in the same same effect on the distances of each leaf from the root
order. as a full rotation, but it destroys the lexicographic
The lack of a lexicographic ordering constraint allows ordering and involves cutting and grafting only 2
a great simplification to the semi-splaying operation by branches of the tree, while a full rotation involves cuts
eliminating the need to consider the zig-zag case. This and grafts on 4 branches.
can be done by inspecting the nodes on the path from There is actually no need to twist the tree prior to
the root to the target leaf and exchanging those which applying semi-rotations. Instead, the semi-rotations can
are right children with their siblings. This will be be applied along the path from the root to the target
called twisting the tree. After this modification, the new leaf as if that path were the left-most path. For exam-
c/i\ A
B B
0 11 1
/ \ A
C D C D
A E B D
The Splay-Prefix Algorithm used to reference leaves. Note that the root of the tree
The code presented here will be in the style of Pascal, always has an undefined parent, and is always stored at
with constant valued expressions substituted for con- location 1. The letter corresponding to a leaf can be
stants where that improves readability. The data struc- computed from the index of that leaf by subtracting
tures required by this code will be constructed using maxchar + 1.
only arrays, even though the logical structure might be If the end of a source document can be inferred from
more clearly expressed using records and pointers. This context, the source alphabet can be encoded directly in
is in keeping with the form of presentation used in the range codetype, and the largest code allowed in a
earlier work in this area [5, lo]. It allows easy expres-
sion in older but widely used languages such as For-
‘In [9], an extra bit was needed per node in the triangular representation to
tran, and it allows compact pointer representations. distinguish left only children from right only children; since a prefix code tree
Each internal node in the code tree must allow is a complete binary tree. this bit is not needed here.
source document can be maxcha~. If this is not the case, mitted. Since encoding is done by following a path from
the range codetype must be expanded by one to include a leaf to the root of the tree, the code bits are produced
a special end-of-file character; this means that maxchnr in the reverse order from the order in which they must
will be one greater than the largest character represen- be transmitted. To correct this, the compress routine
tation. uses a local stack from which bits are popped one at a
The following routine will initialize the code tree. time and passed to the transmit routine.
This builds a balanced code tree, but in fact, any initial
tree would suffice as long as the same initial tree is procedure compress(plain: codetype);
used for both compression and expansion. var
sp: 1 . . succmax;
procedure initialize; stack: array[upindex] of bit;
var a: downindex;
i: downindex; begin
j: upindex; (encode)
begin a := plain + succmax;
for i := 2 to twicemax sp := 1;
do up[i] := i div 2; repeat (walk up the tree pushing bits)
for j := 1 to maxchar do begin stack[sp] := ord(right[up[a]] = a);
left[j] := 2 X j; sp := sp + 1;
right[j] := 2 X j + 1; a := up[a];
end; until a = root;
end {initialize); repeat (transmit)
sp := sp - 1;
After each letter is compressed or expanded, using transmit(stack[sp]);
the current version of the code tree, the tree must be until sp = 1;
splayed around the code for that letter. The following splay(plain);
procedure does this, using bottom-up splaying. end (compress];
procedure splay(plain: codetype): To expand a letter, successive bits must be read
var from the compressed text using the receive function.
a, b: downindex (children of nodes to semi-rotate); Each bit determines one step on the path from the root
c, d: upindex (pair of nodes to semi-rotate); of the tree to the leaf representing the expanded
begin letter.
a := plain + succtnax;
repeat {walk up the tree semi-rotating pairs) function expand: codetype;
c := up[a]; var
if c # root then begin {a pair remains) a: downindex;
d := up[c]; begin
(exchange children of pair) a := root;
b := left[d]; repeat (once for each bit on the path)
if c = b then begin if receive = 0
b := right[d]; then a := left[u]
right[d] := a; else a := right[a];
endelsebegin unt i 1 a > maxchar;
left[d] := a; spZay(a - succmax);
end; uncompress := a - succmax;
if a = left[c] then begin end (expand};
left[c] := b;
The main programs for compression and expansion
endelsebegin
are trivial, consisting of a call to the initialize routine,
right[c] := b;
followed by successive calls to compress or expand for
end;
each letter processed.
up[a] := d;
up[b] := c;
a := d; Performance of the Splay-Prefix Algorithm
end else begin (handle odd node at end) In practice, splay-tree based prefix codes are not opti-
a := c; mal, but they have some useful properties. Primary
end; among these are speed, simple code, and compact data
until a = Toot; structures. The splay-prefix algorithm requi.res only
end {splay]; 3 arrays, while Vitter’s Algorithm A for computing an
optimal adaptive prefix code requires 11 arrays [lo].
To c:ompressa letter from the source text, the letter Assuming that the source character set uses 8 bits per
must be encoded using the code tree, and then trans- character and that end-of-file must be signalled by a
character outside the a bit range, maxchar = 256 and all and thus, neither compression method was able to
array entries can be directly represented in either 9 or reduce their size as effectively. Nonetheless, both
10 bits (two bytes on most machines).3 The static stor- compression methods managed to usefully compress
age requirements for the splay-prefix algorithm are the data, and the splay algorithm produced results that
about 9.7k bits (or 2k bytes on most machines). A simi- were about 10 percent larger than those produced by
lar approach to storing the arrays used by Algorithm A Algorithm A.
requires about 57k bits (or lOk bytes on most Three digitized images of human faces were com-
machines). pressed (files 8, 9, 10); these have varying numbers of
Other commonly used compression algorithms pixels, but all were digitized using 16 grey levels, and
require even more memory; for example, Welch recom- stored one pixel per byte. For these files, Algorithm A
mends using a 4096 entry hash table with 20 bits per produced results that were about 40 percent of the orig-
entry to implement Ziv-Lempel compression [ll], for a inal size, while the splay-prefix algorithm produced
total of almost 82k bits (or 12k bytes on most results only 25 percent of the original size, or about
machines). The widely used compress command on 60 percent of H,. At first, this may appear to be impos-
Berkeley UNIX systems uses a Ziv-Lempel code based sible, since H, is an information theoretic limit, but the
on a table of up to 64k entries of at least 24 bits each, splay-prefix algorithm passes this limit by exploiting
for a total of 1572k bits (196k bytes on most machines). the Markov characteristics of some sources.
Table I shows how Vitter’s Algorithm A and the The final 3 files were artificially created to explore
splay-prefix algorithm performed when used on a vari- the class of sources where the splay-prefix algorithm
ety of test data. In all cases, an alphabet of 256 distinct excels (files 11, 12, 13); all contain equal numbers of
letters was used, augmented with a reserved end-of-file each of the 256 character codes, so H, is the same for
mark. For all files, compressed output of Algorithm A all three, and is equal to the length of the string in bits.
was within 5 percent of H,, and was usually within In file 11, the entire character set is repeated 64 times;
2 percent. For all files, the compressed output of the the splay-prefix algorithm performed marginally
splay algorithm was never more than 20 percent larger better than H,. In file 12, the character set is repeated
than H,, and was sometimes much smaller. 64 times but the bits of each character are reversed;
The test data includes a C program (file l), two Pascal this prevents splaying from improving on H,. The key
programs (files 2, 3), and an early draft of this text (file difference between these two is that in file 11, succes-
4). All 4 text files use the ASCII character set, with tabs sive characters are likely to come from the same sub-
replacing most groups of a leading blanks, and few if tree of the code tree, while in file 12, this is unlikely. In
any trailing blanks. For all of these files, Algorithm A file 13, the character set is repeated 7 times, but in each
produced results that were about 60 percent of the orig- copy of the character set after the second, each charac-
inal size, and the splay algorithm produced results that ter is repeated twice as many times as in the previous
were about 70 percent of the original size. This was the copy; the file ends with a run of 32 a’s followed by a
worst compression performance observed for the splay- run of 32 b’s, and so forth. Here, the splay-prefix algo-
prefix algorithm relative to Algorithm A. rithm was able to exploit long runs of repeated charac-
Two M68000 object files were compressed (files 5, 6), ters, so the result was only 25 percent of H,; on the
as well as a file of TP,Xoutput in DVI format (file 7). other hand, algorithm A never found any character to
These files have less redundancy than the text files, be more than twice as common as any other, so equal
length codes were used throughout.
When a character is repeated, the splay-prefix algo-
3 Changes to the coding standards allowing array indices to run from o to 255
instead of I to 256 would reduce the storage requirements of both the splay-
rithm assigns successively shorter codes to each repeti-
prefix algorithm and Algorithm A. tion; after at most log2 n repetitions of a letter from an
file type bytes bits --,“:I~ ,@g, .: ‘~ ,: &,i$ts: !* -.g : ,“- . ..splayt)itS
n letter alphabet, splaying will assign a 1 bit code to For a system with n states, and where the previous
that letter. This explains the excellent results of splay- letter was c, it is easy to use the value c mod n to
ing applied to file 13. Furthermore, if letters from one determine the next state. This Markov model blindly
subtree of the code tree are repeatedly referenced, lumps every nth letter in the alphabet into one state.
splaying will shorten the codes for all letters in that Values of n varying from 1 to 64 were tried in com-
subtree. This explains why splaying performed well pressing a text file, an object code file, and a digitized
when applied to file 11. image (file 8). The results of these experiments are pre-
In the image data, it was rare for more than a few sented in Figure 6. For object code, a 64 sta.te model
consecutive pixels of any scan line to have the same was sufficient to outperform the Ziv-Lempel based
intensity, but within each textured region of the image, compress command and a 4 state model was sufficient
a different static probability distribution could be used to pass H,. For the text file, a 64 state model came close
to describe the distribution of intensities. As the splay- to the performance of the compress command, and an
prefix algorithm compresses successive pixels in a scan 8 state model was sufficient to pass H,. For the image
line, it assigns short codes to the pixel intensities which data [file 8), a 16 state model was sufficient to outper-
are common in the current context. When it crosses form the compress command and all models signifi-
from one textured region to another, short codes are cantly outperformed H,. Markov models with fewer
quickly assigned to intensities common in the new re- than 8 states were less effective than a simple static
gion, while the codes for now-unused intensities slowly model applied to the image data, with the worst case
grow longer. As a result of this behavior, the splay- being 3 states. This is because the use of a Markov
prefix algorithm is locally adaptive. The splay-prefix model interferes with the locally adaptive behavior of
algorithm and the similar locally adaptive algorithms the splay-prefix algorithm.
should be able to achieve reasonable compression
results for any Markov source that stays in each state
long enough for the algorithm to adapt to that state. H, UNIX compress
Other locally adaptive data compression algorithms
have been proposed by Knuth [5] and by Bentley, et al.
[l]. Knuth proposed a locally adaptive Huffman algo-
rithm where the code used for any letter was deter-
mined by the n most recent letters; this approach is
computationally slightly more difficult than simple
adaptive Huffman algorithms, but the appropriate value
of n depends on the frequency of state changes in the
source. Bentley, et al. propose using the move-to-front . Object
heuristic to organize a list of recently used words 0 Text
(assuming that the source text has lexical structure] in 0 Image
conjunction with a locally adaptive Huffman code for ,,,. ,
encoding slot numbers in the list. This locally adaptive 248 1’8 3’2 $4
Huffman code involves periodically reducing the
weights on all letters in the Huffman tree by multi- Statesin Markov model
plying by a constant less than one. A similar approach
is used [12] in the context of arithmetic codes. In FIGURE6. Performance of the Splay-Prefix Algorithm with
many respects, the periodic reduction of the weights
a Markov Model
of all letters in an adaptive Huffman or arithmetic
code should result in adaptive behavior very similar
to that of the splay compression algorithm de- Both Algorithm A and the splay-prefix algorithm
scribed here. have run-times proportional to the size of the output,
The small data structures required by the splay- and in both cases, the output is of worst-case length
prefix algorithm allow Markov models to be con- O(H,); thus both should run in worst-case time O(HJ.
structed with a relatively large number of states; for The constant factors differ because the splay-prefix
example, models with more than 96 states can be rep- algorithm performs less work per bit of output, but
resented in the 196k byte space used by the compress produces more bits of output in the worst case. For the
command under Berkeley UNIX. Furthermore, the code 13 files in Table I, Algorithm A produced output at an
presented here can be converted to a Markov model by average rate of 3k bits per second, while the splay-
adding one variable, state, and by adding a state dimen- prefix algorithm produced output at better than 4k bits
sion to each of the 3 arrays representing the code tree. per second; thus, the splay algorithm was always signif-
The c:odetrees for all of the states can be identically icantly faster. These times were measured on an
initialized, and one statement needs to be added at the M68616 based Hewlett Packard Series 266 91836CU
end of the splay routine to change the state based on workstation under the HP-UX operating system, with
the previous letter (or in more complex models, on the both algorithms written in Pascal to similar coding
previous letter and the previous state). standards.
ARITHMETIC CODES incremented each time the letter or any of its succes-
The compressed text resulting from arithmetic data sors in the alphabet are encountered. With this
compression is viewed as a binary fraction, and each approach, the frequency of a letter is the difference
letter in the alphabet is associated with a different sub- between its counter and its predecessor’s counter. This
range of the half open interval [0, 1). The source text simple approach can take O(n) time to process a letter
can be viewed as a textual representation of this frac- from an n letter alphabet. In Witten, Neal and Cleary’s
tion using a number system where each letter in the C implementation of an arithmetic data compression
alphabet is used as a digit, but the range of values algorithm [12], the average performance was improved
associated with each letter has a width depending on by using a move-to-front organization, thus reducing
the frequency of that letter. The first letter of the com- the number of counters that must be updated each time
pressed text (the most significant “digit”) can be a letter is processed.
decoded by finding the letter associated with the sub- Further improvement in the worst-case performance
range bounding the fraction that represents the text. for updating the cumulative frequency distribution
After determining each letter of the source text, the requires a radical departure from the simple data struc-
fraction can be resealed to remove that letter; this is tures used in [12]. The requirements that this data
done by subtracting the base of the letter’s subrange structure must meet are best examined by expressing it
and dividing by the width of the subrange. Once this is as an abstract data type with the following five opera-
done, the next letter can be decoded. tions: initialize, update, findletter, findrange, and maxrange.
As an example of an arithmetic code, consider the The initialize operation sets the frequency of all letters
4 letter alphabet (A, B, C, D) with the probabilities to one; any nonzero value would do, as long as the
(0.125, 0.125, 0.25, 0.5). The interval [0, 1)could be encode and decode algorithms use the same initial fre-
subdivided as follows: quencies. An initial frequency of zero would assign an
empty range to a character, thus preventing it from
A = [0, 0.125), B = [0.125, 0.25), being transmitted or received.
C = [0.25, 0.5), D = [0.5, 1) The update(c) operation increments the frequency of
the letter c. The findletter and findrange functions are
This subdivision is easily derived from the cumulative inverses, and update may perform any reordering of the
probabilities of each letter and its predecessors in the alphabet as long as it maintains this inverse relation-
alphabet. Given the compressed text 0.6 (represented as ship. At any point in time, findletter(f, c, min, max) will
a decimal fraction), the first letter must be D because it return the letter c and the associated cumulative fre-
is in the range [0.5, 1). Resealing gives: quency range [min, max), where this range contains f.
(0.6 - 0.5)/0.5 = 0.2 The inverse function, findrange(c, min, max) will return
the values for min and max when given the letter c.
Thus, the second letter must be B because it is in the The maxrange function returns the sum of the fre-
range [0.125, 0.25). Resealing gives: quencies of all letters in the alphabet, and is needed
(0.2 - 0.125)/0.125 = 0.6 to scale the cumulative frequencies into the
interval [0, 1).
This implies that the third letter is D, and that, lacking
any information about the length of the message,it Applying Splaying to Arithmetic Codes
could be the repeating string DBDBDB . . . . The key to implementation of the cumulative fre-
The primary problem with arithmetic codes is the quency data structures, with worst case behavior better
high precision arithmetic required by interpreting the than O(n) per operation on an n letter alphabet, is to
entire bit pattern that represents the compressed text as organize the letters of the alphabet as leaves in a tree.
a number. This problem was solved in 1979 [6]. The Each leaf in this tree can be weighted with the fre-
compression efficiency of a static arithmetic code will quency of the corresponding letters, and each internal
equal H, only if infinite precision arithmetic is used. node can be weighted with the sum of the weights of
The finite precision of most machines, however, is suf- all children. Figure 7 illustrates such a tree for the
ficient to allow extremely good compression. Integer
variables 16 bits long, with 32 bit products and divi-
dends, are sufficient to allow adaptive arithmetic
compression to within a few percent of the limit, and
the result is almost always slightly better than Vitter’s
optimal adaptive Huffman code.
As with Huffman codes, static arithmetic codes
require either two passesor prior knowledge of the
letter frequencies. Adaptive arithmetic codes require
an efficient algorithm for maintaining and updating the
running frequency and cumulative frequency informa- A/l B/l
tion as letters are processed. The simplest way of doing
this is to associate a counter with each letter that is FIGURE7. A Cumulative Frequency Tree
4 letter alphabet (A, B, C, D) with the probabilities with each node on the path. This leads to the following
(0.125, 0.125, 0.25, 0.5) and the frequencies (1, 1, 2, 4). code:
The maxrange function is trivial to compute on such a
tree; it simply returns the weight on the root. The procedure findsymbol(f: integer; var c: codetype;
update and findrange functions can be computed by tra- var a, b: integer);
versing a path in the tree from a leaf to the root, and var
the findletter function can be computed by traversing a i: downindex;
path from the root to a leaf. t: integer;
The data structures for representing the cumulative begin
frequency tree are essentially the same as those already i := root;
presented for representing a prefix code tree, with the a := 0;
addition of an array to hold the frequency of each node b := freq[root];
in the structure: repeat
t := a + freq[leff[i]];
const if f< tthen begin {left turn)
maxchar := . . . (maximum source character code]; i := Zeft[i];
succmax = maxchar + 1; b := t;
twicemax = 2 X maxchar + 1; end else begin (right turn]
root = 1; i := right[i];
type a := t;
codetype = 0 . . maxchar (source character code range); end;
bit = 0 . . 1; unt i 1 i > maxchar;
upindex = 1 . . maxchar; c := i - succmax;
downindex = 1 . . twicemax; end (findsymbol)
var
left, right: array [upindex] of downindex; To find the cumulative frequency range associated
up: array [downindex] of upindex; with a letter, the process illustrated for fin.dsymbol must
freq: array [downindex] of integer; be reversed. Initially, the only information known
about the letter at node i in the tree is the frequency of
Initialization of this structure involves not only that letter, freq(i). From this, the range [0, freq(i)) can be
building the tree data structure, but initializing the fre- inferred; this would be the range associated with the
quencies of each leaf and internal node as follows: letter if it were the only letter in the alphalbet. Given
that the range [a, b) is associated with some leaf in the
procedure initialize; context of the subtree rooted at i, the range associated
var with that leaf in the context of the up[i] can be com-
d: downindex; puted. If i is a left child, this is simply [a, b); if i is a
u: upindex; right child, this is [a + d, b + d), where d ==freq[up[i]] -
begin freq[i], or equivalently, d = freq[left[up[i]]]. This leads
for d := succmax to twicemax do freq[d] := 1; to the following:
f oru:=maxchar downtoldobegin
leff[u] := 2 X u; procedure findrange(c: codetype; var a, b: integer);
right[u] := (2 X u) + 1; var
freq[u] := freq[lefr[u]] + freq[right[u]]; i: downindex;
up[left[u]] := u; d: integer;
up[right[u]] := u; begin
end; i := c + succmax;
end (initialize]; a := 0;
b := freq[i];
To find a letter and its cumulative frequency range repeat
when given a particular cumulative frequency, the tree if right[up[i]] = i then begin (i is right child]
must be entered at the root and traversed towards that d := freq[left[up[i]]];
letter, keeping a running account of the frequency a := a + d;
range represented by the current branch of the tree. b := b + d;
The range associated with the root is [0, freq[root]), end;
which must contain f. When at a particular node i in i := up[i];
the tree associated with range [a, b), where a - b = until i = root;
freqlli], the ranges associated with the two subtrees will end (findrange);
be [a, a + freq[left[i]]) and [a + freq[left[i]], b); these
subranges are disjoint and the path down the tree will If not for the problem of maintaining appropriate bal-
be such that f is contained in the subranges associated ance in the cumulative frequency tree, the update
x i’\ x-A+C i’\ The code ignores the problem of overflow in the fre-
quency counters. Arithmetic data compression repeat-
edly uses computations of the form a*b/c, and as a
result, the limit on the precision of computation is set
A B C B by the storage allowed for intermediate products and
dividends, not for integer variables. Many 32 bit ma-
FIGURE8. Semi-Rotation in a Cumulative Frequency Tree chines impose a limit of 32 bits on products and divi-
dends, and thus impose an effective 16 bit limit on the
integers a, b, and c in the above expression. When this
length, but with larger constant factors. The empirical constraint is propagated through the code for arithmetic
results presented earlier suggest that these constant fac- data compression, the net effect is a limit of 16383 on
tors are more than compensated for by the simplicity of the maximum value returned by maxrange, or freq[root].
the splay-prefix algorithm. As a result, unless all files being compressed are shorter
In the context of the splay-prefix algorithm, the splay than 16383 bytes, all frequencies in the data structure
operation did not need to manipulate any information must be periodically resealed to force them into this
in the internal nodes of the tree. When splaying is used range. An easy way to do this is to divide all frequen-
as part of the update operation, each semi-rotate opera- cies by a small constant such as two, and rounding up
tion must preserve the invariants governing the weights to prevent any frequencies from dropping to zero.
of nodes in the tree. In Figure 8, the tree is semi-rotated Leaves in the cumulative frequency tree can be eas-
about A; as a result, the weight of x is reduced by the ily resealed by division by two, but internal nodes are
weight of A and increased by the weight of C. At the not as easily resealed because of the difficulty of propa-
same time, since this is part of an iterative traversal of gating rounding decisions up the tree. As a result, the
the path from A to the root, the weight of A is incre- easiest thing to do is rebuild the tree, as shown in the
mented. The resulting code is shown as: following code:
procedure rescale; frequency tree are probably not justified for compress-
vaz ing data using a 256 letter alphabet. For 1a:rgeralpha-
d: downindex; bets, on the other hand, this data suggests that splaying
u : upindex; may well be the best approach.
begin
for d := succmax to twicemax CONCLUSIONS
do freq[d] := (freq[d] + 1) div 2; The splay-prefix algorithm presented here is probably
for u := maxchar downtcl 1 do begin the simplest and fastest adaptive data compression algo-
leff[u] := 2 X u; rithm based on the use of a prefix code. Its; outstanding
right[u] := (2 X u) + 1; characteristics are a very small storage req.uirement
freq[u] := freq[leff[u]] + freq[right [u]]; and locally adaptive behavior. When large amounts of
up[left[u]] := u; memory are available, use of the splay-prefix algorithm
up[right[u]] := u; with a Markov model frequently allows more effective
end; data compression than competing algorithms that use
end irescale]; the same amount of memory.
The advantages of the splay-prefix algorithm were
Performance of Arithmetic Codes most effectively demonstrated when it was applied to
The above routines were incorporated into a Pascal compressing image data. The locally adaptive character
transliteration of the Witten, Neal and Cleary’s algo- of the algorithm allowed it to compress an image to
rithm [12]. As expected, there was no significant differ- fewer bits than the self-entropy of the image measured,
ence between the compressed text resulting from the assuming a static source. Finally, a simple Markov model
original and from the modified arithmetic compression using the splay-prefix algorithm frequently allowed
algorithm. Usually, the compressed texts that resulted compression superior to the widely used Ziv-Lempel
from the two algorithms were exactly the same length. algorithm which uses a comparable amount of memory.
Arithmetic data compression algorithms can be made
: to run in O(H,) time by using a cumulative frequency
Object / ’ tree balanced by the splaying heuristic for the statisti-
cal model required by the algorithm. This bound is
new, but the simple move-to-front heuristic is more
effective for the small alphabets typically .used.4
Both the splay-prefix algorithm and the use of splay-
ing to manage the cumulative frequency t:ree provide
useful illustrations of the utility of splaying to manage
trees which are not lexicographically organized. The
0 Original (move-to-front)
notion of twisting a tree prior to splaying to eliminate
the need for the zig-zag case may be applicable to other
0 Modified (splay tree)
nonlexicographic trees, as may the notion of semi-
rotation for balancing such a tree. As an example, these
techniques should be applicable to merge trees where
a binary tree of 2-way merges is used to construct an
n-way merge; Saraswat appears to have used similar
ideas in developing his Prolog implementa.tion of merge
FIGURE9. Performance of Arithemtic Compression Algorithms trees [7].
It is interesting to note that, as with other adaptive
Figure 9 shows the speed of the two arithmetic compression schemes, the loss of one bit from the
compression algorithms as a function of H,. Time is stream of compressed data is catastrophic! This suggests
shown in milliseconds per source byte, and entropy is that it would be interesting to search for ways of
shown in bits per source byte. The files with 2 bits/ recovering from such a loss; yet it also suggests the use
byte and 8 bits/byte were artificially created; the oth- of such compression schemes in cryptography. It is
ers were an image file digitized using 16 grey levels well known that compressing a messagebefore it is
(3.49 bits/byte), a text file (4.91 bits/byte), and an encrypted increases the difficulty of breaking the code
M68000 object file (6.02 bits/byte). Time was measured simply because code breaking relies on redundancy in
on an HP9836CU workstation under HP-UX. the encrypted text and compression reduces this redun-
As shown in Figure 9, splaying applied to a cumula- dancy. The new possibility, introduced by the compres-
tive frequency tree only outperforms the move-to-front sion algorithms described here, is to use the initial
algorithm employed by Witten, Neal, and Cleary [12] state of the prefix code tree or the initial state of the
when the data to be compressed has an entropy of more
than about 6.5 bits/byte. Below this, the move-to-front
‘Alistair Moffat of the University of Melbourne has independently achieved
method always slightly outperforms splaying. Thus, the same performance using a data structure derived from the implicit heap of
splaying or other approaches to balance the cumulative heapsort.
cumulative frequency tree as a key for direct encryp- 4. Jones, D.W. An empirical comparison of priority queue and event set
implementations. Commun. ACM 29, 4 (Apr. 1986), 300-311.
tion during compression. The arithmetic compression 5. Knuth, D.E. Dynamic Hoffman coding. J Algorithms 6. 2 (Feb. 1985),
algorithm could further complicate the work of a code 163-180.
breaker because letter boundaries to not necessarily 6. Rubin, F. Arithmetic stream coding using fixed precision registers.
IEEE Trans. Inform. Theory IT-25, 6 [Nov. 1979), 672-675.
fall between bits. 7. Saraswat, V. Merge trees using splaying-or how to splay in parallel
The key space for such an encryption algorithm is and bottom-up. PROLOG Digest 5, 22 (Mar. 27, 1987).
8. Sleator, D.D.. and Tarjan, R.E. Self-adjusting binary trees. In Proceed-
huge. For an II letter alphabet, there are n! permuta- ings of the ACM SIGACT Symposiumon Theory of Computing (Boston,
tions allowed on the leaves of the tree, times C,-, trees Mass., Apr. 25-27). ACM, New York, 1983, pp. 235-245.
with II - 1 internal nodes, where C, = (Zi)!/i! (i + l)!, 9. Tarjan, R.E., and Sleator, D.D. Self-adjusting binary search trees.
J. ACM 32, 3 (July 1985), 652-686.
the ith Catalan number. This product simplifies to 10. Vitter, J.S. Two papers on dynamic Huffman codes. Tech. Rep. CS-
(2(n - l))!/[n - l)!. For n = 257 [a 256 letter alphabet 85-33. Brown University Computer Science, Providence, R.I.
augmented with an end-of-file character), this is Revised Dec. 1986.
11. Welch. T.A. A technique for high-performance data compression.
512!/256!, or somewhat less than 22200.A compact inte- ZEEEComput. 17, 6 (June 1984), 8-19.
ger representation of a key from this space would 12. Witten, I.H., Neal, R.M., and Cleary, J.G. Arithmetic coding for data
compression. Commun. ACM 30, 6 (June 1987), 520-540.
occupy 675 eight-bit bytes; clearly, such large keys may
pose problems. One practical solution would simply
CR Categories and Subject Descriptors: E.l [Data Structures]: trees;
involve starting with an initial balanced tree, as in the E.2 [Data Storage Representation]: linked representations;E.4 [Coding
compression algorithms presented here, and then splay- and Information Theory]: data compaction and compression
ing this tree about each of the letters in a key string General Terms: Algorithms, Performance
Additional Key Words and Phrases: Adaptive algorithms, arithmetic
provided by the user; most users are unlikely to pro- codes, data compression, splay trees, prefix codes
vide key strings as long as 675 bytes, and it takes keys
longer than this to allow splaying to move the tree into Received 11/87; accepted Z/88
all possible configurations, but even short key strings
should provide a useful degree of encryption. Author’s Present Address: Douglas W. Jones, Dept. of Computer
Science, University of Iowa, Iowa City, IA 52242. Internet address:
[email protected]
REFERENCES
1. Bentley, J.L., Sleator, D. D., Tarjan, R. E.. and Wei, V. K. A locally
adaptive data compression scheme. Commun. ACM 29.4 (Apr. 1986). Permission to copy without fee all or part of this material is granted
320-330. provided that the copies are not made or distributed for direct commer-
2. Gallager, R.G. Information Theory and Reliable Communication. John cial advantage, the ACM copyright notice and the title of the publication
Wiley & Sons. New York, 1968. and its date appear, and notice is given that copying is by permission of
3. Gallager. R.G. Variations on a theme by Huffman. IEEE Trans. the Association for Computing Machinery. To copy otherwise, or to
Inform. Theory IT-24, 6 (Nov. 197&J),666-674. republish, requires a fee and/or specific permission.