Design of A Python-Subset Compiler in Rust Targeting ZPAQL
Design of A Python-Subset Compiler in Rust Targeting ZPAQL
Bachelor Thesis
Design of a Python-subset Compiler in Rust targeting ZPAQL
Kai Lke
[email protected]
Supervisors:
Abstract
The compressed data container format ZPAQ embeds decompression algorithms as ZPAQL bytecode in the
archive. This work contributes a Python-subset compiler written in Rust for the assembly language ZPAQL,
discusses design decisions and improvements. On the way it explains ZPAQ and some theoretical and practical properties of context mixing compression by the example of compressing digits of . As use cases for
the compiler it shows a lossless compression algorithm for PNM image data, a LZ77 variant ported to Python
from ZPAQL to measure compiler overhead and as most complex case an implementation of the Brotli algorithm. It aims to make the development of algorithms for ZPAQ more accessible and leverage the discussion
whether the current specification limits the suitability of ZPAQ as an universal standard for compressed
archives.
Contents
1
Introduction
1.1
Preceding History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
16
21
4.1
4.2
4.3
23
25
6.1
6.2
29
7.1
Exposed API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2
7.3
7.4
7.5
7.6
7.7
Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi
Contents
35
38
41
Bibliography
42
A Tutorial
44
B Visualizations
47
1 Introduction
It takes time until new and incompatible data compression algorithms become distributed in software. Also
different input data should often be handled with different compression techniques, utilizing best knowledge
about the data.
The ZPAQ standard format for compressed data is a container format which also holds the needed decompression algorithms. They can be specified through a context mixing model of several predictors with a bytecode
which computes the context data for them (used for arithmetic coding), a bytecode for postprocessing (used
for transformations or stand-alone algorithms) or any combination of both.
Arithmetic coding spreads symbols over a number range partitioned according to the probability distribution.
That way a likely to be encoded symbol can get a bigger part. The whole message is encoded at once from the
beginning. And every time a symbol was chosen the possibly modified probability distribution is applied
again to segment its part of the number range. Then for the next symbol this is the number range partition
to choose from. When the last symbol has been processed, the number range is very narrow compared to
the beginning and every number of it now represents the whole message. So the shortest in terms of binary
representation can be selected and a decoder can do the same steps again by choosing the symbol which has
this number in its range and applying then the partitioning to this range again according to the probability
distribution. In practice, one has either to use a special end-of-message symbol or specify the message length
before to define an end to this process.
The history which led to the development of ZPAQ shall shortly be explained in this chapter, followed by the
motivation of writing a compiler for ZPAQL and the research question of this work. In the following chapter
the whole picture of ZPAQ compression is visualized and the building blocks, i.e. context mixing and the
bytecode, are explained in detail. General theory of data compression and its limits are shortly noted on
afterwards. The main part is preceded by a short introduction to compiler construction, the implementation
language Rust and the source language Python. One chapter outlines the conditions and difficulties of
ZPAQL as a compiler target. The following chapters cover the chosen Python-subset (6), the developed API
and compiler internals (7), example programs (8) and finally the evaluation (9).
References to compressors and project websites or more detailed documentation and short remarks are
located in footnotes. All websites have been accessed until August 19th 2016. Academic publications are
listed in the bibliography.
1 Introduction
http://prize.hutter1.net/
1.2 Motivation
Parts of PAQ variants made it into the recent context mixing compressor cmix that leads the Large Text
Compression Benchmark 3 and the Silesia Open Source Compression Benchmark 4 but with around 30 GB
RAM usage5.
ZPAQ is intended to replace PAQ and its variants (PAQ8, PAQ9A, LPAQ, LPQ1, etc) with similar or better
compression in a portable, standard format. Current versions of PAQ break archive compatibility with
each compression improvement. ZPAQ is intended to fix that. 6
The development of the ZPAQ archiver started from early 2009 and defined the first version of The ZPAQ
Open Standard Format for Highly Compressed Data [8] after some months. The main idea is to move
the layout of the context mixing tree as well as the algorithm for context computation and the one for
postprocessing into the archive before each block of compressed data. There are nine components7 to
choose from for the tree, mostly context models coming from PAQ. In addition the algorithms are provided
as bytecode for a minimal virtual machine. Hence the algorithm implementation is mostly independent of the
decompressor implementation and compatibility is preserved when improvements are made. Also depending
on the input data an appropriate compression method can be chosen. The main program using libzpaq is
an incremental journaling backup utility which supports deduplication and encryption and provides various
compression levels using LZ77, BWT (Burrows-Wheeler Transform) and context models [9]. But there are
also reference decoders unzpaq and tiny_unzpaq, a simple pipeline application zpipe and a development
tool zpaqd8.
The fastqz compressor [10] for Sanger FASTQ format DNA strings and quality scores also uses the ZPAQ
format and was submitted to the Pistoia Alliance Sequence Squeeze Competition9.
1.2 Motivation
The development of ZPAQ continued specially for the use case of the incremental archiver program. But
the appearance of new algorithms for ZPAQ reached a rather low level, as well as the number of authors10
despite the fact that it offers a good environment for research about context mixing methods and crafting
of special solutions for use cases with a known type of data because one can build upon existing parts. The
leading reason could be that the assembly language ZPAQL with its few registers is not very accessible and
programs are hard to grasp or get easily unmanageable when complex tasks like parsing a header should be
accomplished. Therefore, the first question is whether another language can support ZPAQs popularity.
http://mattmahoney.net/dc/text.html
http://mattmahoney.net/dc/silesia.html
http://www.byronknoll.com/cmix.html
http://mattmahoney.net/dc/zpaq.html
http://mattmahoney.net/dc/zpaqutil.html
http://www.pistoiaalliance.org/projects/sequence-squeeze/
10
1 Introduction
If that is the case, then a compiler for a well-known programming language could help to overcome the
obstacle of learning ZPAQL for implementing new algorithms.
The second question is whether the design decisions of the ZPAQ specification allow for arbitrary compression algorithms to be used with ZPAQ, even if they bring their own encoder or a dictionary or whether ZPAQ
is rather meant to be a platform for algorithms that use the predefined prediction components like (I)CM
and (I)SSE11 together with the built-in arithmetic coder.
11
(Indirect) context model and (indirect) secondary symbol estimation, four of the nine ZPAQ components.
http://mattmahoney.net/dc/unzpaq206.cpp
http://mattmahoney.net/dc/tiny_unzpaq.cpp
http://mattmahoney.net/dc/zpaq715.zip
http://mattmahoney.net/dc/zpaqd715.zip
whether arithmetic coding is used and what predicting components make up the context mixing tree. Each
component has arguments also determining memory use. If needed the bytecode hcomp is embedded to
compute context data for the components of the context mixing tree for each byte. All components give
bit predictions for the partially decoded byte (these are passed upwards the tree) and are trained afterwards
with the correct bit which was decoded based on the root (i.e. the last) node probability for each bit.
The optionally arithmetic coded data which comes from all segment content (not the segment filename or
comment) in the block can start with an embedded pcomp bytecode or declare that no pcomp bytecode
is present. Therefore, the hcomp section can already be used for context computation to compress the
pcomp bytecode (0 for empty or 1 followed by the length and the bytecode). The pcomp code is used for
postprocessing, may it be a simple transform or the decompression of LZ codes. It gets each decoded byte
as input and outputs a number of bytes not necessarily equal to the input.
That means the four combinations for a block are in total no compression, only context mixing with arithmetic coding, only postprocessing the stored data or context mixing with subsequent postprocessing (from
the decompressor perspective). The chosen selection applies for all (file) segments in the block.
The following charts illustrate the named parts and their relation to each other for a sample compression use
case. The transform for x86 machine code enhances compressibility by converting relative to static addresses
after each CALL and JMP instruction (0xE8 and 0xE9). It is applied on the two input files, a x86 executable
binary and shared library. Therefore a ZPAQL pcomp program needs to be supplied in the archive block
to revert that transform. Encoding takes place based on the probability distribution of 1 and 0 for each bit
of the current byte as they are provided as prediction by the root node of the simple context mixing tree.
The hcomp program is loaded into the ZPAQL VM and computes contexts for the two components. The
ISSE maps the context to a bit history which is used as context for a learning mixer that should improve the
probability provided by the first component, a CM (Context Model) which should learn good predictions for
the given context. The whole model and the hcomp bytecode are also embedded into the archive block.
The two files are stored as two segments in the block (like a solid archive). Because the preprocessor might
be any external program or also included in the compressing archiver and is of no use for decompression it
is therefore not mentioned in the archive anymore.
Decompression takes place in reverse manner and hcomp is loaded into the ZPAQL VM to compute the
context data for the components of the model. They supply the predictions to the arithmetic coder and are
corrected afterwards. For the reverse transform of each segment pcomp is read from the decoded stream and
loaded in another VM. Then the two segments follow in the decoding step and go through the postprocessing
transform before they are written out as files.
The tool zpaqd only supports the streaming format and can be used to construct this example setup by writing
a configuration file for it and then adding two files as segments into a block. But beside the algorithms that are
already defined for compression in libzpaq for the levels 1 to 5 (LZ77, BWT and context mixing) it also offers
the ability to specify a customized model (E8E9, LZ77 transformations or also word models are supported)
given as argument, so that the above configuration can also be brought to life with something like zpaq
a [archive] [files] -method s8.4c22.0.255.255i3 (denotation documented in libzpaq.h and the
zpaq man page5). Here the first 8 accounts for 28 = 256 MB blocks, so that both segments should fit into
one block (yet the zpaq application uses the API in a way that creates an additional block), then an order-2
CM and an order-3 ISSE are chained. The resulting configuration including the two ZPAQL programs stored
in the archive can be listed with zpaqd l [archive].
For a more general view c.f. the compression workflow of the zpaq end user archiver as described in the
article mentioned [9]. It selects one of its predefined algorithms based on their performance for the data and
uses deduplication through the journaling format.
1
1+
to
http://mattmahoney.net/dc/zpaqdoc.html
MATCH is a model that takes the following byte of a found string match as prediction until a mismatch
happens. Therefore it keeps its own history buffer and the prediction is also varied depending on the length
of the match. The match is found based on the (higher order) context hash input. This works because the
search does not take place in the history buffer, but a table maps each context hash input to the last element
in the history buffer.
AVG outputs a non-adaptive weighted average of the predictions of two other components. It does not
receive context data.
MIX maps a context and the masked partially decoded byte to weights for producing an averaged prediction
of some other components. Afterwards the weights are updated to reduce the prediction error.
MIX2 is a simpler MIX and can only take two predictions as input.
SSE stands for secondary symbol estimation (Dmitry Shkarin, also known as Adaptive Probability Map in
PAQ7 or LPAQ) because it receives a context as input and takes the prediction of another component which
is then quantized. For the given context this quantized prediction and the next closest quantization value
are mapped to predictions which result in an interpolation of both. Initially this map is an identity, but
the update step corrects the prediction of the closer-to-original quantized value the same way as in the CM
update phase.
A typical place for SSE is to adjust the output of a mixer using a low order (0 or 1) context. SSE components
may be chained in series with contexts typically in increasing order. Or they may be in parallel with
independent contexts, and the results mixed or averaged together. [7]
ISSE is an indirect secondary symbol estimator that, as an SSE, refines the prediction of another component
based on a context which is mapped to a bit history like in an ICM. That bit history is set as context for an
adaptive MIX to select the weights to combine the original prediction with a fixed prediction.
Generally, the best compression is obtained when each ISSE context contains the lower order context of
its input. [7]
10
is present and if yes, then its length and bytecode. Afterwards the data of the segments is coming, without
any separation. Except for the which is set to 0 and the register which is used for the input all state
is preserved between the calls.
The postprocessor pcomp is run for each decoded byte of the block to revert the preprocessing and puts out
the data via an instruction. After each segment it is invoked with 232 1 as input to mark the end.
There is an assembly language for the bytecode which is also used in the table describing the corresponding
opcodes in the specification [8]. In this assembly language whitespace can only occur between opcode bytes
in order to visualize that a=b is a 1-byte opcode while a= 123 is a 2-byte opcode. Comments are written in
brackets.
Operations on the 32-bit registers and elements of are 232 and interpreted as positive numbers in
comparisons. Indexes access into and is or and denoted as *B for [], *C for
[] and *D for []. Because holds bytes operations on *B and *C are 256 and swapping via
B*<>A or C*<>A alters only the lower byte of [8].
Instructions
error
X++
X--
decrement X by 1
X!
X=0
set X to 0
X<>A
swap (X is not A)
X=X
set X to X
X= N
A+=X
add X on A
A-=X
subtract X from A
A*=X
multiply A by X
A/=X
divide A by X (set = 0 if = 0)
A%=X
= (set = 0 if = 0)
A&=X
A&~=X
A|=X
binary OR with X
A^=X
A<<=X
A>>=X
A+=
N, A-=
N,
A*=
N, A/=
N,
A%=
N, A&=
N,
11
A&~= N, A|= N,
A^= N, A<<= N,
A>>= N
A==X
= 1 if = otherwise = 0
A<X
= 1 if < otherwise = 0
A>X
= 1 if > otherwise = 0
N, A<
N,
N, B=R
N,
set A, B, C or D to
A==
A> N
A=R
C=R N, D=R N
R=A N
set to A
HALT
OUT
HASH
= ( + [] + 512) 773
HASHD
JMP I
add 128 <= <= 127 to PC relative to the following instruction, so = 0 has
no effect and = 1 is an endless loop (in the specification a positive N is used, so
+= (( + 128) 256) 128)
JT N
jump if = 1
JF N
jump if = 0
LJ L
the macro: a> 255 if endif. The statements can also be interweaved, e.g. write a do-while-loop or a
continue-jump as do if forever endif.
In a ZPAQ config file the two sections for hcomp and pcomp are written behind the context mixing model
configuration. The pcomp section is optional. Comments can appear everywhere within brackets.
12
Syntax: (where < and 2 , 2 , 2 and 2 define the size of and for each section)
comp HH HM PH PM N
(I COMPONENT (ARG)+ )*
hcomp
(ZPAQL_INSTR)*
halt
(pcomp (PREPROCESSOR_COMMAND)? ;
(ZPAQL_INSTR)*
halt
)?
end
2.4 Examples: A simple Context Model and LZ1 with a Context Model
The following example configuration is based on fast.cfg from the utility site6 and can be used for text
compression and adaptively combines (independently of contexts, just based on the success of the last prediction) the prediction of a direct order-1 context model with the prediction of a order-4 ISSE which refines
the prediction of a order-2 ICM. The arguments for the components are documented in the specification [8].
comp 2 2 0 0 4 (hh hm ph pm n)
(where H gets the size of 2^hh in hcomp or 2^ph in comp ,
M 2^hm or 2^pm and n is the number of
context -mixing components)
0 cm 19 4
1 icm 16
hcomp
r=a 2 (R2 = A, input byte in R2)
d=0
a<<= 9 *d=a (H[D] = A) (set context to actual byte)
(leaving first 9 bits free for the partially decoded byte)
a=r 2 (A = R2)
16
21
http://mattmahoney.net/dc/zpaqutil.html, config files with a post instead of pcomp are in the old format of the
Level 1 specification
2.4 Examples: A simple Context Model and LZ1 with a Context Model
13
To demonstrate the compression phases and parts involved in detail the LZ1 configuration from the utility
site is chosen, but also the BWT.1 examples are worth to look at.
The LZ1 configuration relies on a preprocessor lzpre.cpp which turns the input data into a compressed
LZ77-variant representation of codes for match copies and literal strings. This is further compressed through
arithmetic coding with probabilities provided by an ICM (indirect context model).
The contexts are always hashed as (() + ) from two values and as follows. For the first byte
of an offset number of a match the length of the match and the current state (24, i.e. 13 bytes to follow
as offset number) are used as context. For the remaining bytes of an offset number (or a new code if no
bytes are remaining) the previous context and the current state (previous state - 1, i.e. 02 bytes to follow)
are used as context. For the first literal of a literal string the number of literals and the state 5 are used as
context. For the following literals the current literal and the state 5 are used as context. For a new code
after a literal string instead of a hash of the first value just 0 and the current state (1) are used as context.
The bytecode of pcomp is not specially handled.
To revert the LZ1 compression pcomp parses the literal and match codes and maintains a 16 MB = 224
byte buffer in .
(lz1.cfg
3
hcomp
(c=state: 0=init , 1=expect LZ77 literal or match code ,
2..4= expect n-1 offset bytes ,
5..68= expect n-4 literals)
b=a (save input)
13
18
else
a=b a&= 63 a+= 5 c=a (literal length)
*d=0 a=b hashd
endif
23
else
a== 5 if (end of literal)
c= 1 *d=0
14
else
a== 0 if (init)
c= 124 *d=0 (5+ length of postprocessor)
28
33
endif
endif
(model parse state as context)
38
43
else
(LZ77 decoder: b=i, c=c d=state r1=len r2=off
state = d = 0 = expect literal or match code
1 = decoding a literal with len bytes left
2 = expecting last offset byte of a match
3,4 = expecting 2,3 match offset bytes
48
Input format:
00llllll: literal of length lllllll =1..64 to follow
01lllooo oooooooo: length lll=5..12, offset o=1..2048
10llllll oooooooo oooooooo: l=1..64 offset =1..65536
11llllll oooooooo oooooooo oooooooo: 1..64, 1..2^24)
58
63
endif
else
a== 1 if (writing literal)
a=c *b=a b++ out
a=r 1 a-- a== 0 if d=0 endif r=a 1 (if (--len==0) state =0)
73
else
a> 2 if (reading offset)
a=r 2 a<<= 8 a|=c r=a 2 d-- (off=off <<8|c, --state)
else (state==2, write match)
a=r 2 a<<= 8 a|=c c=a a=b a-=c a-- c=a (c=i-off -1)
78
d=r 1 (d=len)
do (copy and output d=len bytes)
a=*c *b=a out c++ b++
d-- a=d a> 0 while
(d=state =0. off , len dont matter)
2.4 Examples: A simple Context Model and LZ1 with a Context Model
15
endif
83
endif
endif
endif
halt
88
end
For comparison a Python port for usage with the zpaqlpy compiler can be found in test/lz1.py. It differs
in processing the pcomp bytecode with the current opcode byte as context for the next.
0<
0<
and indeed make this swap in order to maintain the bijection, ending up with | (2 )| > | (2 )| because
|1 | > |2 |. So while compresses 1 it expands 2 and this holds for each iteration when we change our
compression scheme.
Luckily, most data which is relevant to us and interesting for compression has patterns and other data where
we do not understand the patterns appears to be random and is out of scope for compression algorithms. If
we do not have any knowledge about the data except that its symbols are equally distributed with probability
=
for each symbol , best we can do is to use an optimal code reaching the Shannon entropy as coding
For || = 256 would be 8 bit as usual and we can simply store the data instead of encoding it again.
In general, given that the distribution is known, we can choose e.g. a non-adaptive arithmetic encoder to
almost reach the limit of bits per symbol. But adaptive arithmetic coding with PPM and others could
even give a better average because the distribution is adjusted by exploiting patterns. Therefore, to craft a
well performing algorithm for the expected input data knowledge about the patterns is needed in order to
give good predictions and go beyond the Shannon entropy as average size.
To define the lower limit that can be reached the concept of algorithmic information or Kolmogorov complexity of a string is developed. Basically it is the length of the shortest program in a fixed language that produces
this string. The language choice only influences a constant variation because an interpreter could be written.
When comparing different compressors in a benchmark it is common to include the decompressor size in
the measurement because it could also hold data or generate it via computation.
We want to allow for every string to be compressed, resulting in the same number because they have to differ of compressed strings which also should be decompressible. Another common approach to prove that there is no universal lossless
compression is a counting argument which uses the pigeonhole principle.
17
Using this principle with ZPAQ and its pcomp section the first million digits of if form of the ~1 MB text
file pi.txt from the Canterbury Miscellaneous Corpus 2 can be compressed to a 114 bytes ZPAQ archive
which consists of no data stored and a postprocessing step which computes to the given precision and
outputs it as text3. This extreme case of the earlier mentioned knowledge about the data can serve as a
bridge between Kolmogorov complexity and adaptive arithmetic coding. For faster execution of the ZPAQ
model we only take the first ten thousand digits of . Normally the limit most compressors would stay
above (because the digits are equally distributed) is
10000 =
10(0.12 (0.1))
8
instead of 10 KB.
With a ZPAQ context model we can, instead of generating the digits in pcomp phase, also use the next
expected digit as context so that the predictor will quickly learn that e.g. character 3 comes in context
3. But prediction can not be 100 % for one symbol as other symbols could occur and there has to be a
probability greater zero assigned to them. Also whether the end of the message is reached is encoded as a
special symbol. So the range of the arithmetic encoder gets still narrowed when a perfectly predicted digit
is encoded, but on such a small level that still only 121 bytes are needed for the ZPAQ archive consisting
of the CM model configuration, hcomp bytecode and the arithmetically coded ten thousand digits of .
That shows that we can go much beyond the entropy limit down to Kolmogorov complexity by using
context modeling and adaptive arithmetic coding. And still the context model is usable for all input data in
opposition to computing in pcomp. The overhead of the fact that 100 % can not be used when predicting
seems to be linear for the message size and can be observed when compressing 50 or 200 MB zeros with a
CM, resulting in around 0.0000022 byte per input byte.
instead of generating the digits in pcomp phase , use the next expected digit as context
To compress: zpaqd cinst pi10k.cfg pi10k.zpaq pi10000.txt)
comp 0 14 0 0 1
0 cm 13 0
7
hcomp
ifnot (only first run)
(Compute pi to 10000 digits in M using the formula:
pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2;
where r1 is the number of base 100 digits.
The precision is 1 bit per iteration so 20/3
12
17
(multiply M *= d, carry in c)
b=r 1 c=0
do
b--
22
http://corpus.canterbury.ac.nz/descriptions/#misc
http://mattmahoney.net/dc/pi.cfg
18
32
37
42
end
Even if this configuration is usable for other input than it does not give good compression. It can be
merged with the general text model from Listing 2.1 and change between using the CM for order-1 contexts
and for the expected digits contexts every time the start of is detected until a mismatch is found. This
way all occurrences of are coded with only a few bits.
(mixed_pi2.cfg
use the next expected digit as context for CM or a general text model of fast.cfg
3
0 cm 18 0
1 icm 16
2 isse 19 1 (order 4)
3 mix2 0 0 2 24 0 (moderate adapting mixer between CM and ISSE based on which predicts
better)
hcomp
r=a 2
13
a=r 0
a== 0 if (only first run)
(Compute pi to 10000 digits using the formula:
pi=4; for (d=r1*20/3;d>0;--d) pi=pi*d/(2*d+1)+2;
where r1 is the number of base 100 digits.
18
23
do
(multiply M *= d, carry in c)
19
b=r 1 c=0
do
b-a=*b a*=d a+=c c=a a%= 10 *b=a
28
33
do
a=c a*= 10 a+=*b c=a a/=d *b=a
a=c a%=d c=a
a=r 1 b++ a>b while
a=d a>>= 1 d=a
38
(add 2)
b=0 a= 2 a+=*b *b=a
d-- a=d a== 0 until
c= 2 (point to 4 of 3.14)
43
a= 1
r=a 0
a<<= 14 a-- (last element of ring buffer)
b=a
a-= 4 (first element of ring bufer , pointer in r3)
48
r=a 3
halt (input 0 came from pcomp , also to restart c=2 is enough)
endif
(CM part)
53
d=0
a=r 2
a-= 48
c-a==*c
c++
58
63
68
hash d= 1 *d=a
b-d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d
hash b-d=a (save hash) a=r 3 a>b if b++ b++ b++ b++ endif a=d
73
hash d= 2 *d=a
halt
end
20
For real-life use cases it is often not possible to give perfect predictions. Good contexts can help to bring
order into the statistics about previous data. Beside the manual approach heuristic context models can be
generated for data by calculating the datas autocorrelation function [12] or as done in PAQ by recognizing
two-dimensional strides for table or graphical data.
Even compressed JPEG photos can be further compressed by 10% - 30% by predicting the Huffman-coded
DCT coefficients when using the decoded values as contexts (done in PAQ7-8, Stuffit, PackJPG, WinZIP [7]).
The ZPAQ configuration jpg_test2.cfg uses a preprocessor to expand Huffman codes to DCT coefficients
and later uses them as contexts4. The PackJPG approach continues to be developed by Dropbox under the
name lepton5 and supports progressive beside baseline JPEGs.
Overall modeling and prediction are an AI problem because e.g. for a given sentence start a likely following
word has to be provided or how a picture with a missing area is going to continue. Remarkable results
have been accomplished by using PAQ8 as machine learning tool for e.g. building a game AI with it serving
as a classifier, for interactive input text prediction, text classification, shape recognition and lossy image
compression [13].
http://mattmahoney.net/dc/zpaqutil.html
https://github.com/dropbox/lepton
22
24
can just be fully exposed as data structure in the source language and also be used as arrays for other
means with low abstraction costs. This keeps the structure of handwritten ZPAQL programs close to those
in the source language. But in order to keep variables in a stack, with its 256 elements is not enough, so
expanding seems to be a good solution. To model the repetitive execution of hcomp and pcomp they
could be defined as functions in the source program (think main function) and it would also be possible to
pass the input byte as argument which also keeps the similarity to a handwritten ZPAQL source.
As the runtime code abstractions for providing the mentioned read-API are not too high, the similarity to
original ZPAQL files is more a cosmetic design decision. And if a context-set-API which halts and continues
execution through runtime code is present then hcomp and pcomp functions could be replaced by main
functions which are entered only once and thus hide the fact that execution starts from the beginning for
every input byte. Still dynamic memory management on top of and seems to be costly and thus
departing to far from ZPAQL and adding more complicated data structures could hurt performance too
much.
It would be helping if the source program is standalone executable without being compiled to ZPAQL to
ease debugging by staying outside the ZPAQL VM as long as possible.
Before ZPAQL instructions are generated by the compiler it would be helpful if most complicated operations
are solved on the IR level already, like the saving, restoring and other memory management.
26
Editable?
Definition of the ZPAQ configuration header data (memory size, context mixing com-
yes
ponents) and optionally functions and variables used by both hcomp and pcomp
API functions for input and output, initialization of memory
no
yes
yes
code for standalone execution of the Python file analog to running a ZPAQL configu-
no
Prog
funcdef
Parameters
( Typedargslist? )
Typedargslist
Tfpdef
NAME (: test)?
stmt
simple_stmt | compound_stmt
simple_stmt
small_stmt
expr_stmt
https://docs.python.org/3/reference/grammar.html
store_assign
augassign
+= | -= | *= | @= | //= | /= | %=
&= | |= | ^= | <<= | >>= | **=
pass_stmt
pass
flow_stmt
break_stmt
break
continue_stmt
continue
return_stmt
return test
global_stmt
nonlocal_stmt
compound_stmt
if_stmt
while_stmt
suite
test
or_test
test_nocond
or_test
or_test
and_test
not_test
comparison
comp_op
expr
xor_expr (| xor_expr)*
xor_expr
and_expr (^ and_expr)*
and_expr
shift_expr
shift_op
<< | >>
arith_expr
t_op
+ | -
term
f_op
* | @ | / | % | //
factor
power
atom_expr
atom
dictorsetmaker
dictorsetmaker_t (, dictorsetmaker_t)* ,?
dictorsetmaker_t
test : test
arglist
test (, test)* ,?
27
28
The semantics of the language elements as described in the reference2 stay mostly the same, even if the
usable feature set is still reduced as stated before. In particular, one has to be aware of integer overflows
which are absent in Python but are present in ZPAQL and thus all computations are in 4294967296 i.e.
232 . Except for the bit shift operations with a shift of more than 32 bits. In this case the Python-subset
will do a shift by 32 bits. To achieve the semantics of (v << X) % 2**32 or (v >> X) % 2**32
with > 31 the resulting value should directly be set to 0 instead. Also / 0 and % 0 does not fail in ZPAQL
but results in a 0 value.
For a wrong Python input the current compiler specification might behave in a different way and even accept
it without a failure. Therefore it is required that the input is a valid Python program which runs without
exceptions.
This requirement is also important because the current compiler does not check array boundaries, so
index%len(hH) or index&((1<<hh)-1) should be used e.g. for a ring buffer because after the original
size of there comes the stack. If run as plain Python file, an exception is thrown anyway then because it
checks array boundaries.
https://docs.python.org/3/reference/index.html
in general instead of using len() for dynamically allocated arrays as well, special functions like len_hH()
are used to visibly expose their types and do runtime checks already in Python. NONE is a shortcut for
0 1 = 4294967295.
Other functions
Description
c = read_b()
Read one input byte, might leave VM execution to get the next
input byte before return
push_b(c)
c = peek_b()
out(c)
error()
aref = alloc_pH(asize),
aref = array_pH(intaddr),
len_pH(aref),
free_pH(aref),
If backend implementations addr_alloc_pH(size), addr_free_pH(addr), are defined then dynamic memory management is available though the API functions alloc_pM and free_pM. The cast
30
array_pH(numbervar) can be used to save a type check in ZPAQL at runtime. Also in plain Python the
cast from an address is needed after an array reference was itself stored into and thus became an address
number and is then retrieved as a number again instead of a reference. In general, there are no boxed types
but by context a variable is used as address.
The last addressable starting point for any list is 2147483647 == (1<<31) 1 because the compiler uses
the 32nd bit to distinguish between pointers to and .
The provided implementations of addr_alloc_pM, addr_free_pM, can be found in the template (run
./zpaqlpy --emit-template or see src/template.rs). The returned pointer is expected to point at the
first element of the array. One entry before the first element is used to store whether this memory section
is free or not. Before that the length of the array is stored, i.e. [ 2] for arrays in H and the
four bytes [ 5][ 2] of the 32-bit length for arrays in .
Beside these constraints the implementations are free how to find a free region. The example uses getter
and setter functions for the 32-bit length value as four bytes in . For allocation it skips over the blocks
from the beginning until a sufficiently sized block is found. If this block is bigger then the rest of it is kept
free and might be merged with the next block if it is also free. That also happens when a block is freed
again, then it is even merged with the previous block if that is free.
external calls. It is assumed that the input is UTF-8 without a BOM (Byte Order Mark).
The parser is built with a parser generator out of the grammar specification. It constructs the AST with each
production. The parser library lalrpop1 was chosen in LALR(1) recursive ascent mode for that purpose.
The elements of the produced AST are based on the abstract grammar of the Python module ast 2 but
simplified (see src/ast.rs) and give a structured representation of the source program. In fact it is not a tree
but a list of statements. They can be one of FunctionDef which holds the function body as list of statements,
Return, Assign and AugAssign which hold the expressions that are concerned, While and If which hold
their body and else-part as list of statements and the test as expression, Global, Pass, Break, Continue and
Expr which encapsulates an expression. These expressions can be one of BoolOpE to evaluate AND and OR
https://github.com/nikomatsakis/lalrpop
https://docs.python.org/3.5/library/ast.html#abstract-grammar
31
over expressions, BinOp for evaluation of an arithmetic expression over two expressions, UnaryOpE for an
unary operation, Compare for comparisons over expressions, Call for a function call including arguments
as expressions, Num for an integer, NameConstant for True and False, Name for a variable or Subscript for
index access to an array variable. The precedence of and and or is resolved during parsing into a binary tree
of BoolOpE elements. Contrary to that one has to be aware that with Compare the semantics of a == b
== c and (a == b) == c differ, the middle operand is split up and passed to the next comparison. Their
results are evaluated and merged with and until the result can not be True anymore, so this was better left
to the IR generator.
, , , , , 0255 and arrays and the IR gives only control about , and and no other
registers. There are no new data structures added. So the other registers can be used for address computation and temporary calculations when the IR is converted to ZPAQL. That means the temporary variables
t0t255 of the IR are a direct mapping to registers . Because the input byte is in at the beginning of
hcomp/pcomp execution the IR relies on the guarantee that 255 = before the first instruction.
stmt
var
t | H[ t ] | H[t0+ x ] | H[t252+ x ] | H[ x ] | M[ t ] | M[ x ] | x
op
+ | - | * | / | // | % | ** | << | >> | | | ^
& | or | and | == | != | < | <= | > | >=
uop
! | ~ | -
t0 | | t255
0 | | 4294967295
label
[a-z_0-9~A-Z]+
comment
#\n
32
The var on the left side of an assignment can not be a number. The operators or and and differ from the
binary versions | and & as they represent the semantics of Python or and and i.e. they evaluate to the original
value and not simply to the boolean choices of 1 and 0 for True and False while the binary operators use
bitwise AND and OR. Operator !v tests against v==0 while ~v inverts the bits.
The choices made might not be the best and a totally different IR is possible which could introduce variables
or stack operations. For efficiency it might be interesting to go into the direction of static single assignment
with linear scan register allocation or an algorithm with graph-coloring register allocation. Currently local
Python variables are on the stack and temporary variables are in because together they could be more
than 256 and anyway local variables need to be stored on stack before a call. But they could be joined in a
common pool and the current limit of maximum 256 temporary variables in could be widened by using
elements of as needed. Maybe even LLVM as a very popular IR could be used with its many optimization
passes available.
and returns to the caller with the newly acquired byte when the bytecode is run again.
The original size = 2 of is extended to hold the stack of size . To calculate a valid resulting size of
There are no IR instructions for calling a function or stack operations on , but there are helping meta
instructions. These will be converted to simple IR instructions and have been defined for handling blocks,
saving and loading variables on the stack in , calls and returns to comply to a calling convention, predefined
IR code sections for the runtime API and the jumping table to continue execution after a return. The initial
IR code is responsible to either call the Python functions hcomp(c)/pcomp(c) with the new input byte or
continue execution if it was interrupted through a read_b() API call. Also it sets the base pointer for the
first run and defines the API function read_b.
The function traverse() in src/gen_ir.rs generates IR instructions for the given (part of an) AST. It is
recursively used and consults the symbol table for free temporary variables, position of local variables and the
mapping of global variables. Also it uses the recursive function evaluate() which takes only an expression
part of an AST and returns IR instructions and the IR variable that holds the value after these instructions
33
have been executed. Returning the instructions makes testing easier. If pcomp(c)/hcomp(c) only contain
a pass then the whole pcomp/hcomp section is omitted.
The temporary variables have to be saved on the stack before a call. Also the current base pointer and then
the return ID for the jump table need to be saved there as part of the calling convention. The new base
pointer in t0 points at the return ID. Arguments passed come afterwards and the called function will address
them via [0 + ]. On return the previous base pointer will be restored to t0 and the return ID is copied
in t2 for the jumper table while the return value is in t1 before the jump to the code for the jump table is
done in order to return after the call instruction.
It should be possible to treat local variables like temporary variables to reduce code size and stack usage by
a compiler flag. Also the calling convention could be changed to avoid the stack and globals could be stored
in tx.
For now the compiler passes IR code around as vectors (lists), but for a more idiomatic Rust style iterators
could be used.
structions MarkTempVarStart and MarkTempVarEnd are inserted around each function scope. Saving and restoring temporary variables is done through the macros StoreTempVars{identifiers} and
LoadTempVars{identifiers}. The optimization pass goes through the IR instructions in reverse order to
reason about the lifetimes and removes the identifiers from the load and store macro instructions which are
not live i.e. referred to after the load.
Compiler books have much more to offer and there are many classical optimizations which could be looked
on for further improvements. Caching could be used and constant expressions evaluated during IR generation. This is just done for a row of assignments on allocated arrays in .
34
taken. The function emit_zpaql() in src/gen_zpaql.rs takes IR code and yields ZPAQL code by calling
assign_var_to_a(var) for the operands (if needed saving one value from in ), and applying the
operator on them. The result is moved from to the target by calling assign_a_to_var(var). In order to
load a number in calc_number(value) is used and for values greater 255 multiple bit shifts are needed.
All of this helper functions emit ZPAQL code which is combined as result of each function in a similar way
to the IR generation.
Then a more advanced solution was chosen with pattern matching to avoid the ubiquitous transfer over
and detect an augmented assignment like +=1 and simply increase the value at its location. Therefore the
helper function gen_loc_for_var() was introduced to get the location handle and the other helpers were
widened to operate with the notion of a location meaning registers or memory positions through pointers
instead of only . Tracking recent values of variables in the registers or memory locations for reuse was also
helpful in order to produce less code output. Yet it is a more careful issue now to make changes because
the cache entries concerning the location and the variable need - at least - to be invalidated for a correct
output.
After the code generation a special optimization takes place to simplify byte assignments on arrays in .
They have been produced in the IR generation if a row of successive assignments like in an initialization was
detected. Thus the pointer variable is always just increased and even not saved on until all assignments
are finished.
Resolving labels to positions for the jump destinations is done as last step before the ZPAQL assembly code
is written out. The opcode size for each instruction is known and thus the labels positions can be hold in
a hash table. The second pass through the code can then replace the virtual GoTo placeholder instruction
with the real long jump.
7.7 Debugging
It is important to run the plain Python version because it contains assertions and test it before trying to
compile it to ZPAQL. For the Python runtime a compare option was introduced to test the correctness of the
pcomp code which should restore the preprocessor input. Each output byte is compared to the expected
output and a debugger shell spawned if a mismatch occurs.
Because compiler optimizations are likely to introduce bugs there is a small test suite in the Makefile. It
compares the output of the Python code for pcomp and hcomp to the output of running the compiled
ZPAQL code. Mostly zpaqd r CFG p is used but also a ZPAQ VM implementation in Rust included in the
compiler (option --run-hcomp) because zpaqd r CFG h does not print [0], , [ 1].
http://netpbm.sourceforge.net/doc/ppm.html
http://www.modejong.com/blog/post15_zpaql_grayscale/index.html
https://developers.google.com/speed/webp/docs/webp_lossless_bitstream_specification#subtract_
green_transform
36
To measure the compression ratio four image files were converted to PNM in test/ and compressed (512
pixel high peppers.pnm, monarch.pnm, kodim23.pnm4 and 1020 pixel high rafale.pnm5). FLIF is a new
lossless format6.
(a) kodim23
(b) peppers
(c) monarch
(d) rafale
Method
ZPAQ archive with bmp_j4c.cfg (in BMP format, uses special color transform)
1785405
1908258
FLIF
1914297
WebP
2256458
PNG
2938153
For now it does not use e.g. delta coding or other predictors beside the average from neighbor values. Also
it could provide another predicted color value which is estimated by the recent development. Beside that
the color transform should depend on the image or a way needs to be found to move its advantage out from
the preprocessor into the predictors. Like the mentioned grayscale model it uses an additional context for
the mixer depending on image noise.
By using Python the model can easily be improved and writing it in ZPAQL would have been a bigger an
bug-prone effort. The bytecode overhead as of writing is around 3 KB.
For the large 53 MB photo canon24.pnm7 it would reach the 6th place with 21,326,964 bytes (89 seconds
for (de)compression) in the benchmark for lossless picture compression published in a paper about BWT
image compression [15]. For kodim23.pnm it would reach the 4th place with an archive size of 369,519
bytes. An overall benchmark for a huge number of files like in 8 or 9 has not been done up to now.
http://www.maximumcompression.com/data/bmp.php
http://flif.info/
https://web.archive.org/web/20140702040431/http://www.squeezechart.com/canon24.pnm
http://www.squeezechart.com/bitmap.html
http://imagecompression.info/gralic/LPCB.html
37
The overhead is less significant for bigger archives and trying to reduce it by applying context mixing to the
whole block might not always improve the result because the Brotli data could be expanded in coding size.
For the four sample PNM files it saves 133 KB which is almost the size of the dictionary and bytecode (from
3615 KB with no arithmetic coding down to 3482 KB in total). For the enwik8 benchmark12 (100 MB
Wikipedia dump) however this would not even make sense because mfast.cfg alone performs better by
reaching 24720 KB instead of 30303 KB when used together with Brotli data and the dictionary/bytecode
overhead. The same Brotli data without arithmetic coding through context model can be stored in around
30416 KB, depending on the Brotli compressor options it might go down to around 26000 KB.
Decompression is much slower than with the C or Rust Brotli implementations. The ZPAQL implementation
needs 118 seconds for enwik8 instead of 0.5 or 2.1 seconds for the reference C implementation and the one
of Mark Adler13. The memory management is not very efficient, specially dealing with byte arrays on is
expensive and the compiler could sometimes produce better code by avoiding the intermediate variables in
. Beside that ZPAQL is JIT-compiled by libzpaq in a rather simple way and so an advanced JIT backend on
10 https://github.com/google/brotli
11 https://github.com/ende76/brotli-rs
12
Used for the Hutter Prize compression challenge, comparison here: http://mattmahoney.net/dc/text.html
13 https://github.com/madler/brotli
9 Evaluation
What follows in this chapter is a review on the outcome so far of working on a compiler as well as an
attempt to provide one possible answer to the question whether or how well ZPAQ suits to be a
general standard for data compression.
9.3 Analysis of the generated Code and Comparison with handwritten Code of LZ1
39
Runtime Benchmarks of the LZ1 Configuration in ZPAQL and Python for the four PNMs
ZPAQ config
zpaqlpy lz1.cfg
1.6
3350016
1.85
handwritten lz1.orig.cfg
1.3
3349062
1.53
# Python code
# IR code
The compiled file starts with the runtime code which decides whether it is the first run. For the first run
it sets the base pointer and starts the global definitions before calling the pcomp/hcomp function with the
input. When initialization did take place already then the code decides whether a read was interrupted to
get a new input byte or if just the pcomp/hcomp function should be called again with the new input. As
stated in the previous chapters all Python variables are in the stack and thus it takes more opcodes to do
calculations on them. The code does not use custom arrays but instead used as history buffer as the
original code. Everytime when the function returns the jump table is used to jump behind the call and then
the execution halts.
It is an advantage that changes can be made more quickly in Python and the code is more comprehensible.
But the compiled code uses a stack while it is not really necessary and instead of using the jump table for
the return it could just use the halt instruction. So for simple programs like that it would be desirable if
the compiler offers a mode without a stack by holding variables in .
40
9 Evaluation
10 Conclusion
It was shown that it is possible to have a compiler for ZPAQL which is also helpful for development of new
algorithms. To write the compiler large parts of the Python grammar have been ported to the format of the
LALR(1) parser library lalrpop. The core functionality of the Python module tokenize has been ported to
Rust. As a goal for a source input which should be supported a Rust implementation for Brotli decompression
has been ported to Python. A compilation scheme from a Python-subset to ZPAQL including an IR has been
planned and implemented. The compiler is platform independent as long as a Rust compiler is available. It
features an implementation of the ZPAQ VM to print the calculated context data for each input byte which
can then be compared with the calculated values of the Python source to expose ZPAQL specific semantics
or even compiler bugs.
As a result it can be said that more different approaches should be tried to reach an acceptable performance
for a large and complex code base. The ZPAQ specification does at its current point not offer various features
which might be considered for future improvements, from additional instructions up to ways of being even
more flexible on how prediction takes place. But most important would be a variable jump instruction.
Compilation can be configured with command line arguments to some extend. The pcomp or hcomp part
can be disabled so no changes to the input file are needed to vary between using only context mixing, only
a preprocessor or both. Documentation can be printed via arguments as well. Four example source files are
provided: A small context mixing compression for run length encoded data (see appendix tutorial), the LZ1
port and the PNM model which already showed good results with less efforts. An extreme case is the Brotli
algorithm which needed many compiler optimizations to fit under the 64 KB bytecode limit and utilizes
dynamic memory allocation.
While the compiler was developed two bugs in ZPAQ tools were found and are already resolved through
two new releases. One was a simple crash in zpaqd and the more serious one a wrong instruction in the
x86 JIT code of libzpaq which caused a miscomputation in the Brotli decompressor.
For most results there have been measurements which can be repeated because all needed tools are published
as free/libre software.
While PAQ and its internals were covered and modified in publications, ZPAQ was often only used as just
another compressor instead of a platform for compression algorithms. This work exposed the crucial part of
ZPAQ which is the embedding of the two bytecodes for hcomp and pcomp in the archive.
Bibliography
[1] J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE
Transactions on Communications, 32(4):396402, Apr 1984. ISSN 0090-6778. doi: 10.1109/
TCOM.1984.1096090. http://dx.doi.org/10.1109/TCOM.1984.1096090.
[2] G. V. Cormack and R. N. S. Horspool. Data compression using dynamic markov modelling. Comput.
J., 30(6):541550, December 1987. ISSN 0010-4620. doi: 10.1093/comjnl/30.6.541. http:
//dx.doi.org/10.1093/comjnl/30.6.541.
[3] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653664, May 1995. ISSN 0018-9448. doi:
10.1109/18.382012. http://dx.doi.org/10.1109/18.382012.
[4] Matthew V. Mahoney. Fast text compression with neural networks. In James N. Etheredge
and Bill Z. Manaris, editors, Proceedings of the Thirteenth International Florida Artificial Intelligence Research Society Conference, May 22-24, 2000, Orlando, Florida, USA, pages 230234.
AAAI Press, 2000. ISBN 1-57735-113-4. http://www.aaai.org/Library/FLAIRS/2000/
flairs00-044.php.
[5] Matthew V. Mahoney. The PAQ1 data compression program. 2002. https://cs.fit.edu/
~mmahoney/compression/paq1.pdf.
[6] Matthew V. Mahoney. Adaptive weighing of context models for lossless data compression, Florida
Tech. Technical Report CS-2005-16. 2005. https://cs.fit.edu/~mmahoney/compression/
cs200516.pdf.
[8] Matthew V. Mahoney. The ZPAQ open standard format for highly compressed data - Level 2. 2016.
http://mattmahoney.net/dc/zpaq206.pdf.
Bibliography
43
[10] James K. Bonfield and Matthew V. Mahoney. Compression of FASTQ and SAM format sequencing
data. PLoS ONE, 8(3):110, 03 2013. doi: 10.1371/journal.pone.0059190. http://dx.doi.org/
10.1371%2Fjournal.pone.0059190.
[11] Jyrki Alakuijala and Zoltan Szabadka. Brotli Compressed Data Format. RFC 7932, July 2016. https:
//rfc-editor.org/rfc/rfc7932.txt.
[12] John Scoville. Fast autocorrelated context models for data compression. CoRR, abs/1305.5486,
2013. http://arxiv.org/abs/1305.5486.
[13] Byron Knoll and Nando de Freitas. A machine learning perspective on predictive coding with PAQ.
CoRR, abs/1108.3298, 2011. http://arxiv.org/abs/1108.3298.
[14] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques,
and Tools (2nd Edition). Addison Wesley, 2006. ISBN 978-0321486813.
[15] Aftab Khan and Ashfaq Khan. Lossless colour image compression using RCT for bi-level BWCA.
Signal, Image and Video Processing, 10(3):601607, 2016. ISSN 1863-1711. doi: 10.1007/
s11760-015-0783-3. http://dx.doi.org/10.1007/s11760-015-0783-3.
A Tutorial
A context mixing model with a preprocessor for run length encoding is written. Three components are
used to form the network. Create a new template which will then be modified at the beginning and at the
pcomp/hcomp sections:
$ ./zpaqlpy --emit-template > rle_model.py
$ chmod +x rle_model.py
First the size of the arrays and for each section, hcomp and pcomp needs to be specified:
hh = 2
# i.e. size is 2**2=4, because hH[0], , hH[2] are the inputs for the
components
One component should give predictions based on the byte value and the second component based on the run
length, both give predictions for the next count and the next value. Then the context-mixing components
are combined to a network:
n = len({
0: cm 19 22,
1: cm 19 22,
2: mix2 1 0 1 30 0,
# will mix 0 and 1 together , context table size 2**1 with and -0 masking of the
# partly decoded byte which is added to the context , learning rate 30
9
})
Each component gets its context input from the entry in [] after each run of the hcomp function, which is
called for each input byte of the preprocessed data, which either is to be stored through arithmetic coding in
compression phase or is retrieved through decoding in decompression phase with following postprocessing
done by calls of the pcomp function.
The context-mixing network is written to the archive in byte representation as well as the bytecode for
hcomp and pcomp (if they are used). The preprocessor command is needed when the compiled file is used
45
with zpaqd if a pcomp section is present. As the preprocessor might be any external programme or also
included in the compressing archiver and is of no use for decompression it is therefore not mentioned in the
archive anymore. This way we specify a preprocessor:
1
pcomp_invocation = ./ simple_rle
$ chmod +x simple_rle # create the preprocessor as executable file and fill it as follows
#!/usr/bin/env python3
import sys
input = sys.argv[1]
4
output = sys.argv[2]
with open(input , mode=rb) as fi:
with open(output , mode=wb) as fo:
last = None
count = 0
data = []
for a in fi.read():
if a != last or count == 255:
if last != None:
data.append(last)
data.append(count)
14
last = a
# start counting
count = 1
else:
count += 1
19
# continue counting
if last != None:
data.append(last)
data.append(count)
fo.write(bytes(data))
def pcomp(c):
global case_loading , last
if c == NONE:
case_loading = False
8
last = NONE
return
if not case_loading:
# c is byte to load
case_loading = True
last = c
13
else:
case_loading = False
while c > 0:
46
A Tutorial
c-= 1
out(last)
We can already try it, even if hcomp does not compute the context data yet (so compression is not good):
$ ./zpaqlpy rle_model.py
$ ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE
last_value = 0
3
last_counter = 0
def hcomp(c):
last_counter = c
else:
last_value = c
# first part of the context for the first CM is the byte replicated and
# the second is whether we are at a counter (then we predict for a byte) or vice versa
13
# again shift to side because of the xor with the partially decoded byte
hH[2] = at_counter + 0
We need to compile again before we run the final ZPAQ configuration file:
$ ./zpaqlpy rle_model.py
$ ./zpaqd c rle_model.cfg archive.zpaq FILE FILE FILE
B Visualizations
B.1 Arithmetic Coding
Figure B.6: The parts from source reading to writing a config file