Arithmetic Code Discussion and Implementation
Arithmetic Code Discussion and Implementation
Programming
The rest of this page discusses arithmetic coding and the results of
my efforts so far.
Algorithm Overview
Arithmetic coding is similar to Huffman coding; they both achieve their compression by reducing the
average number of bits required to represent a symbol.
Given:
For example, suppose we have an alphabet 'a', 'b', 'c', 'd', and 'e'
with probabilities of occurrence of 30%,
15%, 25%, 10%, and 20%. We can
choose the following range assignments to each symbol based on its
probability:
Symbol Probability Range
Home Compression Misc. Programming
a 30% [0.00, 0.30)
b 15% [0.30, 0.45)
c 25% [0.45, 0.70)
d 10% [0.70, 0.80)
e 20% [0.80, 1.00)
TABLE 1. Sample Symbol Ranges
Where square brackets '[' and ']' mean the adjacent number is
included and parenthesis '(' and ')' mean the
adjacent number is
excluded.
Ranges assignments like the ones in this table can then be use for
encoding and decoding
strings of
symbols in the alphabet. Algorithms using ranges for coding
are often referred to as range coders.
Encoding Strings
By assigning each symbol its own unique probability range, it's
possible to encode a single symbol by its
range. Using this approach,
we could encode a string as a series of probability ranges, but that
doesn't
compress anything. Instead additional symbols may be encoded by
restricting the current probability
range by the range of a new symbol
being encoded. The pseudo code below illustrates how additional
symbols
may be added to an encoded string by restricting the string's range
bounds.
lower bound = 0
upper bound = 1
upper bound = lower bound +
(current range ×
upper bound of new symbol)
lower bound = lower bound +
(current range ×
lower bound of new symbol)
end while
Any value between the computed lower and upper probability bounds now
encodes the input string.
Example:
Encode 'a'
current range = 1 - 0 = 1
Encode 'c'
Encode 'e'
Home Compression
The string Misc. Programming
"ace" may be encoded by any value within the
probability range [0.195, 0.210).
Decoding Strings
The decoding process must start with a an encoded value representing
a string. By definition, the
encoded value lies within the lower and
upper probability range bounds of the string it represents. Since
the
encoding process keeps restricting ranges (without shifting), the
initial value also falls within the
range of the first encoded symbol.
Successive encoded symbols may be identified by removing the
scaling
applied by the known symbol. To do this, subtract out the lower
probability range bound of the
known symbol, and multiply by the size of
the symbols' range.
encoded value = encoded input
current range =
upper bound of new symbol -
lower bound of new symbol
encoded value =
(encoded value - lower bound of new symbol)
÷ current range
end while
Example:
In case you were sleeping, this is the string that was encoded in the
encoding example.
There are two issues I've glossed over with this example: knowing
when to stop, and the required
Home Compression
computational Misc.
precision. The Programming
section on
implementation addresses both of these
issues.
Implementation
This section discusses some of the issues of implementing an
arithmetic encoder/decoder and the
approach that I took to handle those
issues. As I have already stated, my implementation is intended to be
easy to develop and follow, not necessarily optimal.
What is a Symbol
One of the first questions that needs to be resolved before you start
is "What is a symbol?". For my
implementation a symbol is any
8 bit combination as well as an End Of File (EOF) marker.
This means
that there are 257 possible symbols in any encoded stream.
Infinite Precision
Example:
.00000... = 0
.10000... = 1/2
.01000... = 1/4
.11000... = 3/4
.11111... = 1
Fortunately for us, we don't need to operate on all the bits at once.
As additional symbols are encoded,
the lower and upper range bounds
converge. As the bounds converge, their most significant bits stop
changing. By choosing the precision of each symbol's range bounds, we
can also limit the amount of bits
required for each computation.
As you read through this section, you will learn how it is possible
to achieve good compression results
using 16 bit computations.
The next step is to scale all of the symbol counts so that they can
be used in N bit computations. To do
this symbol counts must be scaled to
use no more than N - 2 bits. In the case of 16 bit computations,
counts
must be 14 bits or less. You'll see why in the next section.
Home Compression Misc. Programming
Use the following steps to scale the counts to be used with N bit
integer math:
Now we have a scaled count for each symbol. We need to convert the
scaled counts to range bounds on a
probability line. For an alphabet
with symbols S0, S1, ... Sn and scaled
counts c0, c1, ... cn, we can define
ranges as follows:
Symbol Range
S0 [0, c0)
S1 [upper range of S0,
upper range of S0 + c1)
S2 [upper range of S1,
upper range of S1 + c2)
. .
. .
. .
Sn [upper range of Sn - 1,
upper range of Sn - 1 + cn)
TABLE 2. Scaled Symbol Ranges
For my adaptive model we start with scaled symbols, but the N - 2 bit
restriction still applies. It's
possible to add the effect of a new
symbol that would make the symbol count too large to be represented
by
N - 2 bits (see Adaptive Model and Symbol Range
Updates). When this happens, I just rescale the
current model for
half the symbol count.
Use the following steps to scale an adaptive model to be used with N bit
use integer math:
The following pseudo code modifies the standard form of the encoding
algorithm for a range scale of [0,
∑ci):
lower bound = 0
upper bound = ∑ci
end while
Home Compression Misc. Programming
There are a few important things to know about this algorithm:
The following pseudo code modifies the standard form of the decoding
algorithm for a range scale of [0,
∑ci):
encoded value = encoded input
lower bound = 0
upper bound = ∑ci
// remove scaling
end while
Home Compression
Lower Range Misc. Programming
Bound = 1011010111001101
Shift out the MSB (1) and write it to the encoded output
Underflow
Now that we have a rule for shifting bits through our N bit variables,
We can do everything with N bit
math. Well almost. It turns out there's
one thing I overlooked. What happens when the lower and upper
range bounds
start to converge around 1/2 (0111.... and 1000...)? This condition is
called an underflow
condition. When an underflow condition occurs the MSBs
of the range bounds will never match.
The good news is that there's a fairly simple way to handle underflow.
We can recognize that an
underflow condition is pending when the two MSBs of
the lower range bound are different from the two
MSBs of the upper range
bound. What we have to do in this situation is to remove the second bit
from
both range bounds, shift the other bits left, and remember that we had
an underflow condition. We may
still have an underflow condition so this
process may need to be repeated. When we finally get an
opportunity to
shift out the MSB, follow it with the underflow bit(s) of the opposite
value of the
converged MSB.
Example:
This is an underflow condition remove the second MSB and shift all the
bits over.
Write out the MSB (1) and underflow bit (0), and shift in the new LSB.
The arithmetic coding algorithm is well suited for both static and
adaptive probability models. Encoders
and decoders using adaptive
probability models start with a fixed model and use a set of rules to adjust
the model as symbols are encoded/decoded. Encoders and decoders that use
static symbol probability
models start with a model that doesn't change
during the encoding process. Static probability models may
be constant
regardless of what is being encoded or they may be generated based on the
encoded input.
The file header is section of data prior to the encoded output that
includes scaled range counts for each
encoded symbol
(except the EOF whose count is always one). Since I've scaled my
counts to all be 14
bit values, I cheat a little and just write 14 bit
values to the encoded file. Handling the 14 bit values is
only slightly
more complicated than handling 16 bit values, and it makes the file 512
bits smaller.
As its name would suggest, the adaptive model updates with the data
stream. I chose to update my model
after a symbol is encoded or decoded.
After a symbol is encoded or decoded, it's upper bound and the
bounds of
every symbol after it must be incremented by one. On average that will be
128 updates per an
Home Compression
encode/decode Misc.alphabet.
in a 256 symbol Programming
Others have
tried to reduce the number of updates required by
placing the more common
symbols near the end of the list of ranges. Laziness and the my inability
to see
a great performance benefit kept me from trying that approach.
ArEncodeFile
Declaration:
int ArEncodeFile(FILE *inFile, FILE *outFile,
const model_t model);
Description:
This routine generates a list of arithmetic code ranges for an
input file and then uses the ranges to
write out an encoded version
of that file.
Parameters:
inFile
- The file stream to be encoded. It must be opened and it must also
be rewindable if a static
model is used. If NULL, stdin will be
used.
outFile
- The file stream receiving the encoded results. It must be opened
as binary. If NULL,
stdout will be used.
model
- model_t type value indicating whether a static model
or a dynamic model is to be used.
Effects:
inFile is arithmetically encoded and written to
outFile. Neither file is closed after exit.
Returned:
0 for success, non-zero for failure. errno will be set
in the event of a failure.
Decoding Data
ArDecodeFile
Declaration:
int ArDecodeFile(FILE *inFile, FILE *outFile,
const model_t model);
Description:
This routine opens an arithmetically encoded file, reads it's
header, and builds a list of probability
ranges which it then uses
to decode the rest of the file.
Parameters:
inFile
- The file stream containing the encoded input. It must be opened
as binary. If NULL, stdin
will be used.
outFile
- The file stream receiving the decoded results. It must be opened
as binary. If NULL,
stdout will be used.
Home model
Compression Misc.indicating
- model_t type value Programming
whether a static model
or a dynamic model is to be used.
Effects:
The arithmetically encoded file inFile is decoded and
the results are written to outFile. Neither file
is
closed after exit.
Returned:
0 for success, non-zero for failure. errno will be set
in the event of a failure.
Portability
All of the C source code that I have provided is written in strict
ANSI C. I would expect it to build
correctly on any machine with an
ANSI C compiler. I have tested the code compiled with gcc on several
Linux
distributions as well as mingw on Windows XP. (Un)fortunately I don't
have a modern Windows
system to test.
There are some compile time options that offer minimal speed-up when
compiling code for a little endian
target, but the little endian code is
disabled by default.
The Python code was tested using Python 2.6 on Linux and Windows XP.
It's possible that minor tweaks
will be required to get the code to run
properly using other versions of Python.
Further Information
I found Arturo Campos' articles on arithmetic coding to be a
huge help. Unfortunately all that is left of
them is the
wayback machine archive.
Mark Nelson's articles on arithmetic codeing was also a huge help.
Both articles provided most of the information that I used to implement
the algorithm.
Actual Software
I am releasing my implementations of the arithmetic coding algorithm
under the LGPL.
The source code
repositories are available on
GitHub
C Version https://github.com/michaeldipperstein/arcode
Python Version https://github.com/michaeldipperstein/arcode-py