3F7 - FTR - Improving Arithmetic Codes
3F7 - FTR - Improving Arithmetic Codes
DATA COMPRESSION
Full Technical Report
Oliver Jones
Christ’s College
Contents
1 Introduction 2
2 Methodology 2
3 Benchmarking Results 3
8 Conclusions 11
8.1 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1 INTRODUCTION 2
1 Introduction
Data compression, or coding, is the act or reducing the storage size of digital data through
algorithmically defined codewords. Such a process is crucial to the digital-age where a
transmission channel needs to be utilised for as small amount of time as possible, or where
vast amounts of information needs to be archived and stored without requiring significant
hard drive or memory space on a digital device.
This Full Technical Report details how the algorithms laid out in the initial 3F7 lab can
be modified to improve their performance. The main focus will be on making them adaptive,
that is, how to create a compression algorithm that needs no prior knowledge of a probability
distribution and calculates it on the fly. However some consideration will also be given
on how to improve upon the independent and identically distributed (i.i.d) assumptions of
the algorithms. As Shannon-Fano coding is largely of academic interest (as the optimality of
Huffman makes it redundant), this report will not refer to modifications in the Shannon-Fano
algorithm developed initially, and will instead focus on Huffman and Arithmetic coding.
2 Methodology
Code created in addition on top of the code in the lab is provided as an appendix, and
entire source file can be found at github.com/falcoso/CamZip. An ipython notebook
is also provided to demonstrate the code. The focus of the analysis of the algorithms
in this lab will be the compression performance of the algorithms - complexity is taken
into account comparatively with other algorithms, and the specific implementation of the
algorithm provided in the Appendix may not be the most efficient implementation in runtime
due to time of development constraints.
All algorithms developed were benchmarked against a number of files that can be found
at corpus.cantebury.ac.nz/, summarised in Table 1.
For the adaptive algorithms presented in the benchmarking, N was taken to be 1% of the
file length1 and α = 0.5. References to ’Entropy’ mean the entropy of the file assuming an
i.i.d. model and ’Markov Entropy’ refers to the entropy of the file assuming a Markov chain
model.
The Adaptive Huffman algorithm referenced in the Table 2 is the Vitter algorithm. The
FGK algorithm provided does not decay its distributions so will be strictly worse performing
(discussed further in Section 7). Discrepancies between functionality2 of the algorithms was
down to a focus on having an example of each type of functionality within a family of
Compression algorithms, rather than each algorithm being equally capable.
1
In reality such a varying of the N is not ideal as the information will also need to be passed to the decoder
2
Estimators, Escape symbols, Decayed distributions etc.
3 BENCHMARKING RESULTS 3
3 Benchmarking Results
File Stat. Huff Ad. Huff Stat. Arith Ad. Arith Cont. Arith 7zip
a.txt 0 7 2 9 3 952
aaa.txt 0 1.00006 0.00002 0.12741 0.00035 0.01704
alice29.txt 4.55529 4.57744 4.51289 4.54987 3.50206 2.61798
alphabet.txt 4.7692 4.81154 4.70045 4.81522 0.00039 0.01984
asyoulik.txt 4.84465 4.87094 4.80812 4.85613 3.41797 2.8514
bib 5.23171 5.27439 5.20069 5.26043 3.36444 2.20965
bible.txt 4.38495 4.37417 4.34275 4.33401 3.26911 1.7564
book1 4.56181 4.56293 4.52715 4.53015 3.58458 2.71718
book2 4.82339 4.78802 4.79264 4.75693 3.74529 2.2256
E.coli 2 2.24123 1.99982 2.0021 1.98143 2.04769
fields.c 5.0409 5.41865 5.0078 5.28951 2.95354 2.21776
grammar.lsp 4.66434 5.8358 4.63236 5.68584 2.81349 2.93684
lcet10.txt 4.65373 4.61886 4.62271 4.59406 3.55983 2.28365
news 5.22699 5.1962 5.18963 5.15768 4.09201 2.53349
paper1 5.01669 5.00581 4.98301 4.95265 3.64673 2.60928
paper2 4.6341 4.68534 4.60144 4.65941 3.52279 2.65959
paper3 4.68974 4.80705 4.66511 4.78743 3.5556 2.9427
paper4 4.73258 5.02047 4.69983 5.11712 3.4796 3.28948
paper5 4.97281 5.27137 4.93617 5.28836 3.52853 3.31738
paper6 5.04349 4.9556 5.00953 4.91613 3.61207 2.63902
plrabn12.txt 4.5196 4.52427 4.47713 4.49096 3.44257 2.80933
progc 5.23365 5.32577 5.19904 5.19767 3.60428 2.54818
progl 4.79936 4.72936 4.77009 4.71938 3.21208 1.68294
progp 4.89496 4.86004 4.86879 4.84953 3.18828 1.69141
random.txt 6 6.03048 5.9995 6.08046 5.97128 6.15168
trans 5.53526 5.40153 5.50101 5.3593 3.41452 1.47511
world192.txt 4.99637 4.9783 4.95309 4.93916 3.75947 1.6592
xargs.1 4.92382 6.07476 4.89875 5.87958 3.2018 3.55051
However, it is clear that many files, such as English text, are not i.i.d but in fact will
vary based on the text before it, for example P (Xn = u|Xn−1 = q) ≈ 1. Take a simple
Markov chain as the assumed distribution of the data. The probability of a character only
depends on the character before it (i.e. it has limited memory). From the definition of mutual
information between the current symbol Xn and the previous symbol Xn−1 :
the decoding function with the length of the original file so that it knows when to terminate,
but there may be many applications when the real file length may not be known.
1. For a length K alphabet, create a list of K − 1 sibling pairs. That is, create a list of
adjacent nodes on a tree. Each sibling pair should hold 5 numbers.
• 2 counters, 1 for each sibling node in the pair that is incremented when it is
traversed
• A Forward Pointer that points to the parent node of the sibling pair, with an
additional bit to say whether it is a 0 or 1 traversal
• 2 Backward Pointers that point to the nodes in the pair (either another sibling
pair or a leaf node)
2. Create a list of the alphabet and the corresponding pointer that points to its location
within the sibling list. This is effecitvely a list of leaf nodes.
For the algorithm to work as intended it is important that the list of sibling pairs is
consistently maintained such that both the counts of the pair higher (or lower depending on
the ordering) in the list than the current pair are both equal to or greater than the current
counts. If not pointers are swapped around to make this the case.
To make the code’s implementation clear with reference to the above algorithm, a SiblingPair()
class with attributes for each of the appropriate pointers is used. An additional bit was also
added to the back pointers to say when the back pointer references the alphabet pointers list
so the algorithm can switch between the two accordingly. This provided algorithm uses a
Laplacian estimator (Section 6) and does not decay the distribution.
The algorithm is fairly straightforward to implement but does have the drawback that
it will not always generate a tree of minimum weight, i.e. the depth of the tree will grow
faster after a series of new symbols. An alternative, but more complex algorithm is the Vitter
algorithm.
Each node in the tree has a given order, with the order of the root node being the highest
and the order of the NULL node being the lowest (see Section 6). All nodes of the same weight
are considered as being in the same ’weight class’, with the order of all nodes in a weight
class higher than the current weight class being greater than the order of all the nodes in
the current weight class. When a weight is to be incremented, first the node is swapped
with the node of highest order in the same weight class, then when the weight of the node is
incremented it will then be at the bottom of the next higher weight class.
Figure 1: Trees for the encoding of string ’abacabdabaceabacabdfg’ with Zero node 3
5.4 Complexity
Once a tree is generated in the static algorithm, a dictionary codebook is generated so a
simple lookup is required to encode, and extended tree to decode. The Adaptive algorithm
requires a tree for both cases, so that each node counter can be incremented as it is traversed
so it can be re-created on decoding. Traversing the tree is much like a binary search algorithm
similar to what is going on when a dictionary is indexed by it’s key, but python is naturally
optimised for searches within a dictionary compared to indexing multiple items in a list. In a
lower level language (where compression algorithms are much more commonly implemented
for speed), this will become less of an issue.
The real added complexity to the Adaptive algorithm is the shuffling of the trees after
each symbol is encoded or decoded. Particularly at the start when the tree will be changing
rapidly as each symbol is encountered for the first time, multiple nodes on the tree will be
re-arranged as the counts change, slowing down the compression process. It is also noted that
the change in N (see Section 7) is also going to have an affect on the runtime, as it determines
3
Source: http://www.stringology.org, Blue, black and yellow numbers indicat the order, weight and
codewords of the node respectively
6 BIASED ESTIMATORS AND ESCAPE SYMBOLS 9
how many times the sibling list is decayed over the encoding and decoding process - compared
the runtime of the implementation when N = 100 vs. N = 10000.
5.5 Corruptibility
As demonstrated in the initial lab, the static Huffman algorithm is very robust to any
corruption of the data. The change of a bit will change the current symbol and possibly
one or two of the next symbols on encoding, but it quickly re-synchronises with the code
and the rest of the encoding is completely un-affected. Adaptive Huffman however, is much
more susceptible to corruption. As the tree is generated and the weights incremented the
tree will change, so changing a bit in the compression code will change the weights on the
tree in decoding from what it was at the same point in the encoding, and so the rest of the
file becomes unreadable.
This will not always happen however - towards the end of the file when the estimated
distribution approaches the real distribution (particularly for long files), the Huffman tree
will not always change for each symbol encoded so may be able to cope provided the weight
classes are significantly different from one another.
In the case of a Huffman algorithm, this will be at the lowest order on the tree, and when
encountered will spawn two more nodes, with the 0 node being the new location of the escape
symbol, and the 1 node being the new symbol that has been encountered.
As this method requires the transmission of the escape symbol and the block code of the
new symbol, the compression ratio will naturally be worse for files with sizes comparable
to the alphabet size contained with in it compared to initialising a redundant alphabet as
described above. There could, of course, be a combination of the approaches applied, with
an initial estimator and an escape symbol so that the scope of the algorithm isn’t reduced.
N
τ= (4)
1−α
For Adaptive Huffman codes, even though α is not an integer, the counts themselves still
need to be maintained as integers, such that when incrementing the weights of a node, it
does not then skip several weight classes which may include the parent of the pair. This
is naturally maintained by using a floor or ceiling function after all the counts have been
multiplied by α.
The choice between floor or ceil will depend on whether you want to remove symbols that
haven’t occurred for a while and may be clogging up the tree. Ceiling will always mean the
minimum value of a count is 1 once it appears on the tree, so there is no need to remove it.
Floor will always reduce a count by at least 1, so that there is a point where a count can
reduce to 0 and be merged with the NULL symbol. The provided Vitter implementation has
the option for either with a boolean option remove.
Figure 2 shows the variation in compression rates as the time constant varies for different
N and α for an Adaptive Huffman algorithm. While the adaptive compression of Hamlet
cannot beat the assumed i.i.d source entropy, the compression of a terminal transcript can
improve on this compression limit by a significant amount, implying its distribution changes
over the file. Both sets of data appear to follow a trend that by empirical inspection is of the
form:
(a) Hamlet with and without node removal (b) Cantebury ’trans’ with node removal
Figure 2: Compression ratios of Adaptive Huffman for a given time constant as N and α are
varied
8 Conclusions
Adaptive coding methods allow data to be decoded with no prior knowledge of the encoded
data. This reduces the need to send additional uncompressed dictionaries with a coded file
and with proper decay mechanisms, allow it to adapt to time-varying distributions.
While Arithmetic methods do beat Huffman methods as they become arbitrarily close to
the file’s distribution entropy, the nature of the algorithm means that it can only work on
files of a fixed length, whereas Huffman codes can be used to compress data for streaming.
Adaptive methods often need an estimator of the initial distribution or an additional
symbol to say when a new symbol is to be encountered. This does have the disadvantage of
adding additional overheads to the file size, that a first pass algorithm would not, however
the effect of these can be mitigated with decay mechanisms previously mentioned.
The main drawback of adaptive methods however is their susceptibility to corruption,
while Arithmetic coding is already susceptible, Adaptive Huffman methods can also be
corrupted if the encoder does not have the exact compression file.
• Encode the information of N and α into the compressed file so that the information
does not need to be shared between the decoder. A solution to this would be to use
Elias Gamma coding to give N first, then a second Elias Gamma integer M for the
precision of α, and then the next M digits being the bits of negative powers of 2.
• Combine Adaptive and Contextual algorithms into a single programme that calculates
conditional probabilities on the fly. Further work can be used in PPM methods.
• Add escape symbols to the Adaptive Arithmetic algorithm to compare its results against
a biased estimator.
References
[1] G. V. Cormack and R. N. S. Horspool. “Data Compression Using Dynamic Markov
Modelling”. In: Comput. J. 30.6 (Dec. 1987), pp. 541–550. issn: 0010-4620. doi: 10.
1093/comjnl/30.6.541. url: http://dx.doi.org/10.1093/comjnl/30.6.541.
[2] P. Elias. “Universal codeword sets and representations of the integers”. eng. In: Infor-
mation Theory, IEEE Transactions on 21.2 (1975), pp. 194–203. issn: 0018-9448.
[3] R. Gallager. “Variations on a theme by Huffman”. eng. In: Information Theory, IEEE
Transactions on 24.6 (1978), pp. 668–674. issn: 0018-9448.
[4] J. J. Rissanen. “Generalized Kraft Inequality and Arithmetic Coding”. In: IBM Journal
of Research and Development 20.3 (May 1976), pp. 198–203. issn: 0018-8646. doi: 10.
1147/rd.203.0198.
[5] J. Vitter. “Design and analysis of dynamic Huffman codes”. eng. In: Journal of the ACM
(JACM) 34.4 (1987), pp. 825–845. issn: 1557-735X.
[6] I. Witten and T. Bell. “The zero-frequency problem: Estimating the probabilities of novel
events in adaptive text compression”. In: IEEE Transactions on Information Theory 37
(July 1991), pp. 1085–1094. doi: 10.1109/18.87000.
Appendices
A Python Code
Attached with this report is a set of python files used in this report on top of those used in the
initial lab. The start of each python file contains a docstring summarising the functionality
of the algorithm (i.e. if it decays distributions, estimators used etc.). The manifest is as
follows:
• adaptive arithmetic.py - All the functions required for the Adaptive Arithmetic algorithm.
• context arithmetic.py - All the functions required for the Contextual Arithmetic algorithm.