0% found this document useful (0 votes)
10 views21 pages

The Power of Algorithms - From BWT To Bzip

The power of Algorithms - From BWT to bzip

Uploaded by

perhacker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views21 pages

The Power of Algorithms - From BWT To Bzip

The power of Algorithms - From BWT to bzip

Uploaded by

perhacker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

The power of Algorithms!

Lectures on algorithmic pearls

Paolo Ferragina, Università di Pisa

These notes have been taken by the master students of the course on ”Algorithm Engi-
neering” within the Laurea Magistrale in Informatica and Networking.

P.F.
Contents

1 From BWT to bzip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1


1
From BWT to bzip

1.1 The Burrows-Wheeler transform . . . . . . . . . . . . . . . . . . . . 1-1


Forward Transform • Backward Transform
1.2 Two simple compressors: MTF and RLE . . . . . . . . . . 1-6
Move-To-Front Transform • Run Length Encoding
Alessandro Adamou 1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
adamou AT cs unibo it Construction with suffix arrays
1.4 Theoretical results and compression boosting . . . . . 1-13
Paolo Parisen Toldin Entropy
paolo.parisentoldin AT gmail com 1.5 Some experimental tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16

The following notes describe a lossless text compression technique devised by Michael Bur-
rows and David Wheeler in 1994 at the DEC Systems Research Center. This technique,
which combines simple compression algorithms with an input transformation algorithm
(the so called Burrows-Wheeler Transform, or BWT), offered a revolutionary alternative to
dictionary-based approaches and is currently employed in widely used compressors such as
bzip2.
These notes are structured as follows: we begin by describing the algorithm both for
the forward transform and for the backward transform in Section 1.1. As we will show,
the BWT alone is not a compression algorithm, so two simple compressors, i.e. Move-To-
Front and Run-Length Encoding, are introduced in Section 1.2. How these algorithms can
be combined for boosting textual compression is discussed with examples in Section 1.3.
Section 1.4 introduces the notion of entropy and discusses the performance of the BWT
with respect to this notion. The notes conclude with a description of experiments that
compare the implementation of BWT in the bzip2 compressor with several implementations
of dictionary-based Lempel-Ziv algorithm variants, along with a discussion on the findings.

1.1 The Burrows-Wheeler transform


The Burrows-Wheeler Transform (BWT) [4] is not a compression algorithm per se,
as it does not tamper with the size, or the encoding, of its input. What the BWT does
perform is what we call block-sorting, in that the algorithm processes its input data by
reordering them block-wise (i.e. the input is divided into blocks before being processed)
in a convenient manner. What a “convenient manner” is, in the case of the BWT, will be
shown later, but for the time being, suffice it to say that the resulting output lends itself
to greater and more effective compression by simple algorithms. Two such algorithms are
the Move-To-Front optionally followed by Run Length Encoding, both to be described in
section 1.2.

c Paolo Ferragina 1-1


1-2 Paolo Ferragina

The Burrows-Wheeler Transform consists of a pair of inverse operations; a forward trans-


form rearranges the input text so that the resulting form shows a degree of “locality” in
the symbol ordering; a backward transform reconstructs the original input text, hence being
the reverse operation of the forward transform.

1.1.1 Forward Transform


Let s = s s ...sn be an input string on n characters drawn from an alphabet Σ, with a total
ordering defined on the symbols in Σ.
Given s, the forward transform proceeds as follows:

1. Let s$ be the string resulting from the concatenation of s with a string consisting
of a single occurrence of symbol $, where $ 6∈ Σ and $ is smaller than any other
symbol in the alphabet, according to the total ordering defined on Σ.1
2. Consider matrix M of size (n + )×(n + ), whose rows contain all the cyclic left
shifts of string s. M is also called the rotation matrix of s.
3. Let M0 be the matrix obtained by sorting the rows of M left-to-right according
to the ordering defined on alphabet Σ ∪ {$}. Recall that $ is smaller than any
other symbol in Σ and, by construction, appears only once, therefore the sort
operation will always move the last row from position n + 1 to position 0.
4. Let F and L be the first and last columns of M0 respectively. Let ˆl be the string
obtained by reading column L top-to-bottom, sans character $. Let r be the
index of $ in L.
5. Take bw(s) = (ˆl, r) as the output of the algorithm.

Note that M is a conceptual matrix, in that there is no actual need to build it entirely:
only the first and last columns are required for applying the forward transform. Section
1.3.1 will show an example of BWT application that does not require the rotation matrix
to be actually constructed in full.

Remark
An alternate enunciation of the algorithm, less frequent yet still present in the literature
[5], constructs matrix M0 by sorting the rows of M right-to-left (i.e. starting from character
at index n) in step 3. Then, in step 4, it takes string fˆ instead of ˆl, where fˆ is obtained
by reading column F top-to-bottom skipping character $, and sets r as the index of $ in F
instead of L. The output is then bw(s) = (fˆ, r). This enunciation is dual to the one given
in the following sense: although its output differs from the one generated by left-to-right
sorting, it showcases the same properties of reversibility and compression, to be illustrated
below, as the former.

Example 1.1

We will now show an example of Burrows-Wheeler Transform applied to an input block


which, historically, is known to transform to a highly compressible string.

1 The step that concatenates character $ to the initial string was not part of the original version of the
algorithm as described by Burrows and Wheeler. It is here introduced with the intent to simplify the
enunciation. Due to its role, $ is sometimes called a sentinel character.
From BWT to bzip 1-3

Let us have s = abracadabra. Figure 1.1 shows, on its left side, how matrix M is
constructed for the given input. Starting from row 0, which reads abracadabra$ i.e. the
original string concatenated with $, each subsequent row contains the same string left-shifted
by one.

FIGURE 1.1: Example of forward Burrows-Wheeler transform applied to the input block
s = abracadabra$, resulting in the transformed string ardrcaaaabb. The cyclic shift ending
with $, assigned to variable r, is placed at index 3 in the sorted matrix M0 .

The rows of M are then sorted lexicographically from left to right, thus obtaining matrix
M0 . Because the last row of M (row 11, which reads $abracadabra) is the only one to begin
with $, which is the lowest-ordered symbol in the (extended) alphabet, it will then become
the first row of M0 . The next five rows will be the ones beginning with a (in the given
order: a$abracadabr, abra$abracad, abracadabra$, acadabra$abr and adabra$abrac),
then the two shifts beginning with b (i.e. bra$abracada and bracadabra$a), and so on.
The order in which the rows starting with the same character appear in M0 has an extremely
important property, which will be shown when discussing the reverse transform in the next
section.
The sorted matrix M0 is shown on the right side of Figure 1.1. If we read the first column
F of M0 , we obtain the string $aaaaabbcdrr, which is itself sorted lexicographically, while
the last column L of M0 reads ard$rcaaaabb. We therefore obtain ˆl by excluding the single
occurrence of $, thus having ˆl = ardrcaaaabb. Since $ appears as the fourth element of L,
which is of length n = 12, we take r = 3. We finally obtain bw(s) = (ardrcaaaabb, 3).
Notice how both occurrences of b end up juxtaposed in the transformed string, as are
four of the five occurrences of a. The following chapters in this lecture will show how this
layout favors text compression.

1.1.2 Backward Transform


1-4 Paolo Ferragina

We can observe, both by construction and from the example provided, that each column of
the sorted cyclic shift matrix M0 contains a permutation of s$. In particular, its first column
F represents the best-compressible transformation of the original input block. However, it
would not be possible to determine how the transformed string should be reversed, even by
knowing the distribution of its alphabet Σ. The Burrows-Wheeler transform represents a
trade-off which retains some properties that render a transformed string reversible, yet still
remaining highly compressible.
In order to prove these properties more formally, let us define a useful function that tells
us how to locate in M0 the predecessor of a character at a given index in s.

DEFINITION 1.1 For 1 ≤ i ≤ n, let s[ki , n − 1] denote the suffix of s prefixing row i
of M0 . Let Ψ(i) be the index of the row prefixed by s[ki + 1, n − 1].

The function Ψ(i) is not defined for i = 0, because row 0 of M0 begins with $, which does
not occur in s at all, so that row cannot be prefixed by a suffix of s.
For example in Figure 1.1 it is Ψ(2) = 6. Row 2 of M0 is prefixed by abra, followed by $.
We look for the one row of M0 that is prefixed by the suffix of s that is immediately shorter
(bra), and find that row 6 (which reads bra$abracada) is the one. Similarly, Ψ(11) = 4
since row 11 of M0 is prefixed by racadabra and row 4 is the one prefixed by acadabra.

Property 1
For each i, L[i] is the immediate predecessor of F [i] in s. Formally, for 1 ≤ i ≤ n,
L[Ψ(i)] = F [i].
Proof. Since each row of M0 contains a cyclic shift of s$, the last character of the row
prefixed by s[ki + 1, n − 1] is s[ki ]. Because the index of that row is what we have defined
to be Ψ(i) in Definition 1.1, this implies the claim L[Ψ(i)] = s[ki ] = F [i].

Intuitively, this property descends from the very nature of every row in M and M0 that
is a cyclic shift of the original string concatenated with $, so if we take two extremes of
each row, the symbol on the right extreme is immediately followed by the one on the left
extreme. For index j such that L[j] = $, F [j] is the first character of s.

Property 2
All the occurrences of a same symbol in L maintain the same relative order
as in F. Using function Ψ, if 1 ≤ i < j ≤ n and F [i] = F [j] then Ψ(i) < Ψ(j). This means
that the first occurrence of a symbol in L maps to the first occurrence of that symbol in
F, its second occurrence in L maps to its second occurrence in F, and so on, regardless of
what other symbols separate two occurrences.
Proof. Given two strings s and t, we shall use the notation s ≺ t to indicate that s
lexicographically precedes t.
Let s[ki , n − 1] (resp. s[kj , n − 1]) denote the suffix of s prefixing row i (resp. row j).
The hypothesis i < j implies that s[ki , n − 1] ≺ s[kj , n − 1]. The hypothesis F [i] = F [j]
implies s[ki ] = s[kj ] hence it must be s[ki + 1, n − 1] ≺ s[kj + 1, n − 1]. The thesis follows
since by construction Ψ(i) (resp. Ψ(j)) is the lexicographic position of the row prefixed by
s[ki + 1, n − 1] (resp. s[kj + 1, n − 1]).

In order to re-obtain s, we need to map each symbol from the L column of M0 to its
corresponding occurrence in the sorted column F. This way, we are able to obtain the
permutation of rows that generated matrix M0 from M. This operation is what we call
From BWT to bzip 1-5

LF-mapping, and it is essentially analogous to determining the values of function Ψ for


matrix M0 . The process is rather straightforward for symbols that occur only once, as is
the case of $, c and d in abracadabra$. However, when it comes to symbols a, b and r,
which occur several times in the string, the problem is no longer trivial. It can however be
solved thanks to the properties we proved [7], which hold for transformed matrix M0 .
Let us now illustrate a simple algorithm for performing the LF-mapping, which will be
the starting point for the actual inverse transformation algorithm to be shown later in this
section. The LF-mapping will be stored in an array LF, ||LF || = n.
The algorithm uses an auxiliary vector C, with ||C|| = |Σ ∪ {$}|. For each symbol c, C
stores the total number of occurrences in L of symbols smaller than c. Vector C is indeed
indexed by symbol rather than by integer2 .
Given this initial content of C, the LF-mapping is performed as follows.

1 for (int i=0; i<n; i++) {


2 LF[i] = C[L[i]];
3 C[L[i]]++;
4 }

This algorithm scans the entire transformed column L (lines 1,4). When a symbol is
encountered in the process, the algorithm checks the value of the element in C having that
symbol as a key. That value is assigned to the LF element with the same index as the one of
L (line 2). Having found an occurrence of that symbol, the counter expressed by the value
of LF having that symbol as a key is incremented by one (line 3). This means that vector C
contains, for each symbol c and at a given instant, the number of occurrences smaller than
or equal to symbol c found so far. Therefore, once the algorithm has completed, vector C
will contain, for each c, the total number of occurrences smaller than or equal to c, while
array LF will contain the LF-mapping that can be used straight away for applying the
backward transform algorithm we will now illustrate.
Given the LF-mapping algorithm and the fundamental properties shown earlier, we are
able to reconstruct s backwards starting from the transformed output bw(s), which we
remind to be consisting of a permuted s plus an index r. Having bw(s) is equivalent to
having the L column, since the latter can simply be obtained by placing symbol $ at the
r-th position and right-shifting the remainder by one. Before showing the algorithm itself,
it can be helpful to exemplify it.

Example 1.2
Let us show how the to re-obtain the original input block starting from its BWT-transformed
block (ardrcaaaabb, 3).
Since r = 3, we insert the sentinel character $ at position 3 and right-shift the remainder
by one. We obtain the last column L of M0 , which reads ard$rcaaaabb top-to-bottom.
By sorting this column lexicographically, we obtain column F , which reads $aaaaabbcdrr.
There will be no need to compute any other permutations in M0 .
We start reconstructing s$ from the last character $. Its index in F is 0, and L[0] = a,
so we append this character to the end of the string we are rebuilding, thus obtaining a$.

2 By custom, the term “array” is used when referring to the linear data structures whose elements are
indexed by integer ranges, while the term “vector” is used for structures whose indices belong to arbitrary
sets, as is the case of C in this lecture.
1-6 Paolo Ferragina

L[0] contains the first occurrence of a, so we also locate the first occurrence of a in F (by
Property 2 we know that this relative order is preserved). Its index in F is 1 and L[1] = r.
The rebuilt string now ends with ra$.
L[1] contains the first occurrence of r, and its index in F is 10. Since L[10] = b, we
append it to the output and obtain bra$.
Note that this process of locating in L the predecessor of each character is equivalent to
determining the values of function Ψ in M0 , albeit without having to know the prefixes of
each row of M0 .
By following this rationale, we eventually rebuild the entire string to abracadabra$, and
by stripping the sentinel character, the original input abracadabra is reconstructed.

Having intuitively demonstrated how it works, let us now illustrate and explain the al-
gorithm itself. An auxiliary array T , ||T || = ||L|| is used for storing the output of the
algorithm.

1 Compute LF[0,n-1];
2 k = 0;
3 i = n;
4 while (i>0) {
5 T[i]=L[k];
6 k = LF[k];
7 i--;
8 }

Given the array LF obtained by LF-mapping (line 0), we begin scanning this array (lines
2,4) starting from the first element. Due to Property 1 and the fact that F is sorted starting
from symbol $, the first symbol in L is the final symbol of the original string, and LF [0]
tells us the index in L of its immediate predecessor.
We fill the new array T , starting from its last element (lines 3,5) and proceeding backwards
(line 7). The array LF is not scanned linearly, since every subsequent index to be scanned
both in L (for reading the actual symbol) and from LF (for knowing the next index) is
determined by the value of LF at the previously found index.
Once the algorithm has returned, we can obtain the string s$ by reading the content of
T left-to-right. By stripping the last symbol, the original input string is reconstructed.

1.2 Two simple compressors: MTF and RLE


Let us now focus on two simple data compression algorithms that come in very useful, now
that the BWT has been described in detail, in order to build up the final bzip2 compressor.
These compression algorithms are respectively called Move-To-Front (MTF) and Run-
Length Encoding (RLE).

1.2.1 Move-To-Front Transform


The MTF [3] (Move-To-Front) transformation is a technique used to improve the perfor-
mance of entropy encoding. It is based on the idea that every character is replaced with its
index in list of the given alphabet.
Given a string s and an alphabet Σ, the algorithm produces a string sM T F on the alphabet
Σ MT F
= {0, . . . , |Σ| − 1}.
From BWT to bzip 1-7

At the beginning, a list l is initialized to the alphabet Σ and sM T F to the empty string.
On each step, we consider, in order, a simbol σ in s and add the index i of σ in l to the
string sM T F . Then we modify l by moving the element σ to the front of the list.
In the example presented in figure 1.2 we consider the string “bananacocco” and we show
all the steps needed to encode it with MTF.

s: “bananacocco”
Σ: {a, b, c, n, o}

σ:“b” 1 σ:“a” 2 σ:“n” 3


i:1 i:1 i:3
l: {a, b, c, n, o} l: {b, a, c, n, o} l: {b, a, c, n, o}
sM T F :“1” sM T F :“11” sM T F :“113”

σ:“a” 4 σ:“n” 5 σ:“a” 6


i:1 i:1 i:1
l: {n, a, b, c, o} l: {a, n, b, c, o} l: {n, a, b, c, o}
sM T F :“1131” sM T F :“11311” sM T F :“113111”

σ:“c” 7 σ:“o” 8 σ:“c” 9


i:3 i:4 i:1
l: {a, n, b, c, o} l: {c, a, n, b, o} l: {o, c, a, n, b}
sM T F :“1131113” sM T F :“11311134” sM T F :“114111341”

σ:“c” 10 σ:“o” 11 σ:“” 12


i:0 i:1 i:
l: {c, o, a, n, b} l: {c, o, a, n, b} l: {o, c, a, n, b}
sM T F :“1131113410” sM T F :“11311134101” sM T F :“11311134101”

FIGURE 1.2: MTF example

As we can infer from the example 1.2, MTF assigns codes with lower values to symbols
that are more frequent [5]. We can also notice two local homogeneity in the considered
string: “banana” and “cocco”. Indeed, these two substrings show redundancy in a few
symbols only. For this reason, the output string contains numbers with low value at homo-
geneity. Higher values suggest a change of homogeneity.
The reverse operation is easy to compute. Having the initial list l we can obtain s by
performing the reverse steps in MTF, as shown in figure 1.3

Theoretical analysis
If the message exhibits locality of references, MTF performs better than Huffman coding
because a word will have a short encoding when it is used frequently. Contrariwise if the
message does not exhibits any kind of homogeneity, MTF performs worse that Huffman
coding; indeed, letters used rarely are encoded with long encoding [3].
Let ρM T F (x) be the avarage number of bits per words used to compress a sequence X
with the MFT scheme. Let ρH (X) the number of bits used to compress the sequence X
with Huffman coding.
1-8 Paolo Ferragina

sM T F :“11311134101”
Σ: {a, b, c, n, o}
1 2 3
i:1 i:1 i:3
l: {a, b, c, n, o} l: {b, a, c, n, o} l: {b, a, c, n, o}
s: “b” s: “ba” s: “ban”

4 5 6
i:1 i:1 i:1
l: {n, a, b, c, o} l: {a, n, b, c, o} l: {n, a, b, c, o}
s: “bana” s: “banan” s: “banana”

7 8 9
i:3 i:4 i:1
l: {a, n, b, c, o} l: {c, a, n, b, o} l: {o, c, a, n, b}
s: “bananac” s: “bananaco” s: “bananacoc”

10 11 12
i:0 i:1 i:
l: {c, o, a, n, b} l: {c, o, a, n, b} l: {o, c, a, n, b}
s: “bananacocc” s: “bananacocco” s: “bananacocco”

FIGURE 1.3: MTF inverted example

Let f (i) be the number of bits needed to encode the integer i (every letter can be encoded
in an integer number). We will use γ-coding, so f (i) ≤ 2 × log2 i + 1.

Na
THEOREM 1.1 Let Na the number of occurences of a symbol a in X. N represents the
probability of symbol a.

X Na  N 
ρM T F (x) ≤ f
N Na
a∈S

Proof
Let t1 , . . . , tNa the times when simbol a is sent. Eg. when a occurs at time t1 , its position
in the list is at most t1 . When a occurs at time ti , with i > 1, its position is at most
ti − ti−1 . The cost of encoding the first a is f (t1 ) and the cost of encoding the i-th a is at
most f (ti − ti−1 ).
Let Ra (X) the total number of bits used to transmit the Na occurences of simbol a, we
have:

Na
X
Ra (X) ≤ f (t1 ) + f (ti − ti−1 )
i=2

Applying Jensen’s inequality:


From BWT to bzip 1-9

Na
!!
1
   
X tNa Na
Ra (X) ≤ Na f t1 + (ti − ti−1 ) = Na f ≤ Na f
Na i=2
Na Na

If now we sum for every a ∈ S the constrain for Ra (X) and divide for N we reach the
thesis.
X Ra (X) X Na  N 
≡ ρM T F (x) ≤ f
N N Na
a∈S a∈S

THEOREM 1.2 We will prove the following disequation:

ρM T F (x) ≤ 2ρH (X) + 1

Proof We will start from the result in theorem 1.1. We substitute f with the γ-coding.

X  Na  N 
ρM T F (x) ≤ f
N Na
a∈S
X Na   
N
≤ · 2 · log +1
N Na (1.1)
a∈S

X Na  
N
≤2·− · log + O(1)
N Na
a∈S
≤ 2 · H0 (X) + O(1)

1.2.2 Run Length Encoding


RLE [4] (Run Length Encoding) is a lossless encoding algorithm that compresses sequences
of the same symbol in a string, into the length of the sequence plus an occurrence of the
symbol itself.
It was invented and used for compressing data for fax transmission. Indeed, a sheet of
paper is viewed as a binary (i.e. monochromatic) bitmap with many white pixels and very
few black pixels.
Suppose to have to compress the following string which represents a part of a monochro-
matic bitmap (where W stands for “white” and B for “Black”).

W W W W W W W W W W W BW W W W W W W W W W W W BBBBBW W W W W W

We can take the first block of W and compress in the following way:

W
| WWWWW
{z W W W W W} BW W W W W W W W W W W W BBBBBW W W W W W
11W

We can proceed in the same fashion until the end of the line is encountered. We will
achieve the following compression:
1-10 Paolo Ferragina

W
| WWWWW B W
{z W W W W W} |{z} | W W W W W{zW W W W W W} BBBBB
| {z } W
| W W{zW W W}
11W 1B 12W 5B 6W

We can encode the initial string with the string h11, W i, h1, Bi, h12, W i, h5, Bi, h6, W i. It
is easy to see that the encoding is lossless and simple to reverse. In the previous example
we have show a general scheme, even if, in this case, we could have encoded the string
in a more convenient way: hW, 11i, h−, 1i, h−, 12i, h−, 5i, h−, 6i. Indeed, if |Σ| = 2 we can
simply emit the numbers (which indicate the length), while we have the alternation of the
two simbols in the alphabet.
RLE can perform better or worse than the Huffman scheme: this depends on the message
we want to encode.

Example 1.3
Here is an example on how RLE can perform better than the Huffman scheme. Suppose we
want to encode the following string s:

W W W W W W W W BBBBBBBBCCCCCCC
with Σ = {B, W, C}. RLE encodes s in this way:

hW, 8i, hB, 8ihC, 7i


we associate 0 to B, 10 to W and 11 to C. We use gamma elias coding to encode numbers
and we code 7 as 00111 and 8 as 0001000:

100001000000010001100111
We now encode s with the Huffman coding (we shall omit the calculus). The result of
encoding string s with the Huffman scheme is:

00000000010101010101010111111111111111
where we associated 0 to W , 01 to B and 11 to C. The result prove that |CHuf f man | =
38 > |CRLE | = 24

Example 1.4
Here is now an example on how RLE can perform worse than the Huffman scheme. Suppose
we want to encode the following string s:

ABCAABCABCAB
where Σ = {A, B, C}. According to RLE scheme, we encode s in the following way:

hA, 1i, hB, 1i, hC, 1i, hA, 2i, hB, 1i, hC, 1i, hA, 1i, hB, 1i, hC, 1i, hA, 1i, hB, 1i
We associate to A the encoding 0, 10 to B and 11 to C. Even if we encode the numbers
1 and 2 with bits 0 and 1, respectively, we cannot make a shorter encoding than:

00100110011001100010011000100
Here, the length of the encoding of s if 29.
From BWT to bzip 1-11

The encoding with Huffman scheme assigns 0, 10, 11, according to their probability,
respectively to A, B and C.

010110101101011010
We can therefore conclude that |CHuf f man (s)| = 18 < |CRLE (s)| = 29

THEOREM 1.3 For any binary string s = al11 al22 . . . alnn , with ai ∈ {0, 1} and ai 6= ai+1
we have:
k
X
|CRLE (s)| = 1 + |CP F (li )|
i=0

where |CRLE (s)| is the length of encoding the string s with RLE and |CP F (li )| is the
length of encoding li with PreFixed codes.

Proof Proving the equality is very easy, as it simply derives from the definition. Let us
then analyze all the addenda in the right part of our equation.
• We have put +1 because it represents the first bit (remember that s is a binary
string). We don’t need to take care of the other bits in the succession because
we know that there’s an alternation of them in the string.
• The other addendum sums the lengths of the exponents, coded with PreFixed
scheme.

As we can deduce from the above example, RLE [5] works well when the sequences of the
same symbol are very long.

1.3 Implementation
Having seen how the BWT algorithm [4] works, let us now focus on how is it used. We
assume from now on that the reader has well understood the previous chapters.
Take a string s and a substring w. Suppose that our substring appears n times within the
main string s. As we have seen, in the matrix created by BWT there will be n consecutive
rows prefixed with the substring w. Let us call these rows as rw + 1, rw + 2, . . . , rw + n and
let’s call ŝ = bwt(s).
All of these n rows will contain all the symbols that precede w in s.

Example 1.5
Recall the example in figure 1.1. The substring w ≡ bra appears 2 times within the main
string abracadabra.
As we could expect [11], in the BWT matrix we will find two consecutive rows prefixed
with the substring bra. In the example in figure 1.1, these rows are located in the second
and third positions respectively.

If it is possible to identify some patterns in the string s, that are more frequent than
others, then for all patterns wi , there will be several rows that differ from each other by
only a few characters (as we may see in the example in figure 1.4).
1-12 Paolo Ferragina

a bracadacabrapica $ a bracadacabrapica $
b racadacabrapica $ a a brapica $ abracada c
r acadacabrapica $ a b a cabrapica $ abraca d
a cadacabrapica $ ab r a cadacabrapica $ ab r
c adacabrapica $ abr a a dacabrapica $ abra c
a dacabrapica $ abra c a pica $ abracadacab r
d acabrapica $ abrac a a $ abracadacabrapi c
a cabrapica $ abraca d b racadacabrapica $ a
c abrapica $ abracad a b rapica $ abracadac a
a brapica $ abracada c c a $ abracadacabrap i
b rapica $ abracadac a c abrapica $ abracad a
r apica $ abracadaca b c adacabrapica $ abr a
a pica $ abracadacab r d acabrapica $ abrac a
p ica $ abracadacabr a i ca $ abracadacabra p
i ca $ abracadacabra p p ica $ abracadacabr a
c a $ abracadacabrap i r acadacabrapica $ a b
a $ abracadacabrapi c r apica $ abracadaca b
$ abracadacabrapic a $ abracadacabrapic a

FIGURE 1.4: Example of BWT applied to abracadacabrapica.

For this reason, the ŝ string is locally homogeneous. As we can see from the examples
in figure 1.1, 1.4, there are some substrings of equal characters in the string ŝ.
The base idea is to exploit this property [5]. For this purpose, we could use Move To
Front (MTF) encoding applied to ŝ. If the ŝ string is locally homogeneous, then mtf (ŝ)
is encoded with small numbers.

Example 1.6

Consider the example in figure 1.4. Let’s encode the ŝ string.


In this case, the ŝ ≡ cdrcrcaaiaaapabba. (we omit the computation).
If we apply the mtf to ŝ we find the following string: 2362113510061602.

Since we obtain a good distribution with the mtf applied to ŝ, we could, finally, apply
Huffman [8] or Arithmetic coding [9] to the result. In this case we will obtain a good
compression.

1.3.1 Construction with suffix arrays


Given how the Burrows-Wheeler forward transform employs a rotation matrix M, as well
as its sorted matrix M0 , to transform an entire input block, the reader might be reasonably
doubtful as to how spatially efficient this algorithm can be. Indeed, being n the length of
the input block and assuming that each character is stored as one single byte, the construc-
tion of M would require (n + 1)2 bytes, since the sentinel character $ is also part of the
permutations. That is why most compressors that actually implement the BWT exploit
some “tricks” in order to avoid constructing the actual matrix. One such “trick” involves
the usage of the so-called Suffix Arrays.
From BWT to bzip 1-13

suffix index sorted suffix value


abracadabra$ 0 $ 11
bracadabra$ 1 a$ 10
racadabra$ 2 abra$ 7
acadabra$ 3 abracadabra$ 0
cadabra$ 4 acadabra$ 3
adabra$ 5 adabra$ 5
dabra$ 6 bra$ 8
abra$ 7 bracadabra$ 1
bra$ 8 cadabra$ 4
ra$ 9 dabra$ 6
a$ 10 ra$ 9
$ 11 racadabra$ 2

FIGURE 1.5: Construction of the suffix array for string abracadabra$.

The suffix array of a string s is an array of integers, where each integer gives the starting
position of a suffix of s when all the suffixes are sorted lexicographically. It is, in fact, a
permutation of the indices for all the characters in the string.
For example, the suffix array for string abracadabra$ is {11, 10, 7, 0, 3, 5, 1, 4, 6, 9, 2},
since the first suffix in lexicographical order is $ and it starts at index 11 in the string;
the second suffix is a$ and it starts at index 10; the third is abra$ at index 7, and so on.
Figure 1.5 shows how this array is constructed. For convenience, all the suffixes of the string
are sorted by length, longest-first, on the first column, and the indices are sorted on the
second column. The third column shows the same suffixes as in the first column, but sorted
lexicographically. In order to obtain the suffix array value (fourth column) for a certain
suffix, one must find the row that contains that same suffix in the first column: the index
in the second column indicates what value should be filled.
Sorting suffixes is equivalent to sorting rotations in M to obtain M0 , if we ignore the
symbols following the sentinel character $ [1]. Therefore, if s is the original input string, T
denotes the array of characters in s$, and AT denotes the suffix array of T , then we can
compute column L of M0 as follows:

T [AT [i] − 1] if AT [i] 6= 0



L[i] =
$ otherwise
Informally, this means that every position of L is filled with the character of T immediately
preceding the corresponding suffix. If, however, that suffix is the whole string (thus the suffix
array has value 0 for that index), $ will be used instead.
Figure 1.6 shows how to obtain bw(s) given s and its suffix array AT . The first four
columns are a repetition, for convenience, of the table from Figure 1.5. The fifth column
shows how the corresponding rotation matrix M0 would look like. This is not actually
needed, but is displayed here to show that every string in this column is prefixed by the
corresponding suffix in the third column. Finally, the last column shows the transformed
input, i.e. the last column of M0 , as obtained by applying the above formula.
The construction of suffix arrays is itself a computationally challenging effort, albeit not
as challenging as by employing the actual rotation matrix M0 . Many algorithms with
linear spatial complexity and pseudolinear temporal complexity have been proposed over
the years. One such algorithm, presented in [10], uses O(n) bytes and, although bounded
by O(nlog n) in the worst case, runs in Θ(n) time in the majority of cases. For an extensive
classification of suffix array construction algorithms, see [13].

1.4 Theoretical results and compression boosting


1-14 Paolo Ferragina

suffix index sorted suffix value M0 L


abracadabra$ 0 $ 11 $abracadabra a
bracadabra$ 1 a$ 10 a$abracadabr r
racadabra$ 2 abra$ 7 abra$abracad d
acadabra$ 3 abracadabra$ 0 abracadabra$ $
cadabra$ 4 acadabra$ 3 acadabra$abr r
adabra$ 5 adabra$ 5 adabra$abrac c
dabra$ 6 bra$ 8 bra$abracada a
abra$ 7 bracadabra$ 1 bracadabra$a a
bra$ 8 cadabra$ 4 cadabra$abra a
ra$ 9 dabra$ 6 dabra$abraca a
a$ 10 ra$ 9 ra$abracadab b
$ 11 racadabra$ 2 racadabra$ab b

FIGURE 1.6: Application of the BWT from the suffix array for string abracadabra$.

1.4.1 Entropy
Let us first define the concept of entropy. The information entropy is the measure
of uncertainty associated with a random variable. In other (and simpler) words, entropy
represents a measure of information [6].

DEFINITION 1.2 [Entropy] The entropy H of a random discrete variable X over values
x1 , ·, xn is defined as:

H(X) ≡ E(I(X))
where E is the expected value and I is the self-information of X, also known as −log(p(x)).
If we denote p as the probability mass function of X, then we can rewrite the definition as
follows:
n
X
H(X) ≡ − p(x) × log p(x)
i=1

Burrows-Wheeler performance
An encoding algorithm is considered better than another one if it takes a lower number of
bits.
For this reason, is it importat to study the theoretical performances of the BW function.
The Burrows-Wheeler Transform (seen in 1.1) made no hypothesis on the source text, but,
potentially, it uses all the input symbols previously encountered.
Given a string s, we need to investigate the k-order empirical entropy of s:
1 X
Hk (s) = |ws |H0 (ws )
|s| k
w∈A

where ws represents the set of all symbols that follow w in s and |s| represents the length
of the string s.
Now, consider to compress all the ws blocks, given in output by BW transform, with a 0-
order statistical compressor. At this moment we will not consider to use MTF algorithm. We
shall use ρbw to denote the number of bits per symbol required for this kind of compression.
We know that the number of bits per symbol required for a 0-order statistical compressor
is bounded by
From BWT to bzip 1-15

H0 ≤ ρ ≤ H0 + µ (1.2)

Since we are compressing separately all the ws blocks we are sure that the length of the
whole compression is the sum of the length of the single blocks compressed. We can infer
the following inequality:

1 X
ρbw ≤ |ws |(H0 (ws ) + µ) (1.3)
|s|
ws ∈bw(s)

where |s| is the length of the input (indeed we would like to calculate the number of bits
per symbol) and ws are the blocks, in output, created by BW transform.
Let’s denote with fws (c) the frequency of the symbol c in the block ws and with A(ws )
the alphabet used in the string ws . We can rewrite the formula 1.3 in the following way:

1 X X 1
ρbw ≤ |ws |( fws (c) log2 + µ) (1.4)
|s| fws (c)
ws ∈bw(s) c∈A(ws )

Indeed H0 (ws ) is the entropy of the block ws . So, we have rewritten H0 (ws ), expanding
with its definition. We can do better. Since fws (c) is the frequency of the simbol c within
the block ws , it can be rewritten as the conditioned frequency f (c|w). Indeed when the
symbol c is contained in the block ws , it means that c follows the context w in the input.

THEOREM 1.4 We will prove the following disambiguation:

ρbw ≤ Hk (s) + µ

Proof
Starting from equation 1.4, we replace fws (c) with the conditioned frequency and carry
out the µ element from the sum. We have:

1 X X 1
ρbw ≤ µ + |ws |( f (c|w) log2 )
|s| f (c|w)
ws ∈bw(s) c∈A(ws )

and by definition of H0 we have:

1 X
ρbw ≤ µ + |ws |(H0 (ws ))
|s|
ws ∈bw(s)

and, generalizing:

1 X
ρbw ≤ µ + |ws |(H0 (ws ))
|s| k ws ∈A

|ws |(H0 (ws )) is the number of bits necessary for our compressor of order 0 to represent
ws . We sum for all the possibly ws strings and we have the thesis.
1-16 Paolo Ferragina

1.5 Some experimental tests


We will now show some results of testing an implementation of a BWT-based compressor
against a few other well-known compression algorithms. All implementations were tested
on a GNU/Linux platform.

• Bzip (combination of BWT with multiple encoding schemes including MTF,


RLE and Huffman): for this algorithm we shall use the “bzip2” package from
http://www.bzip.org, version 1.0.5-r1.
• LZMA (LempelZivMarkov chain algorithm): we shall use the “lzip” package from
http://www.nongnu.org, version 1.10.
• LZ77 (original Lempel Ziv algorithm): since there are no direct implementations
of this algorithm, we will use the “lzop” package (LZ-oberhumer zip), version
1.02 rc1-r1. It is the last implementation of LZO1A compression algorithm. It
compresses a block of data into matches (using a LZ77 sliding dictionary) and
runs of non-matching literals. LZO1A takes care about long matches and long
literal runs so that it produces good results on high redundant data and deals
acceptably with non-compressible data. Some ideas are borrowed from LZRW
and LZV compression algorithms [12]. We chose this kind of implementation
beacuse it is a commercial and military (it has been used by NASA [2]) program.
• LZ77: for this algorithm we shall also test the DEFLATE implementation, which
combines LZ77 with Huffman coding. We shall use the “zip” package from
http://www.info-zip.org/, version 3.0. It is a lossless data compression algo-
rithm.

All the tests are run in RAM (using ramfs) and made under the following configuration:

• AMD Athlon(tm)X2 DualCore QL-64


• 2048 PC2-6400 SoDimm ram
• GNU/Linux Gentoo distribution, using zen-sources-2.6.33 p1 with bfs scheduling.

We shall test all the implementations on three different kinds of files:


• all the canticas of Divine Comedy: we’re going to test this file because it repre-
sents a text with some degree of correlation.
• a raw monochromatic non-compressed image: the reason why we choose this
kind of file is because it contains a lot of “similar” information (black and white
pixels). We therefore expected a high compression rate.
• gcc-4.4.3: we will test the algorithms on this package in order to assess how they
behave on compressing a large amount of files, which are distributed across a
sparse directory structure with great ramification.
All the tests are run 100 times and we will show the avarage result.
From figure 1.7,1.8,1.9 we can easly deduce some conclusions. First of all, we can conclude
that lzma is very bad. It has ever the worst compress-time and does not reach the best
compression rate. bzip2 seems to be quite in the middle. Sometimes it takes a lot of time to
encode, but it reaches the best compression rate. By considering the decompression time,
bzip2 seems to slow down too much as the size of the file grows.
Perhaps the best solution seems to be the zip: it takes a short time to encode/decode
and reaches a very good compression rate.
From BWT to bzip 1-17

compression
Divine Comedy monochromatic image Gcc

seconds 1.285 547.90


1 1 550
less is better

0.648


0.5 0.5 225
150.20

0.143 0.022 0.082 0.048 0.000∗ 0.049 3.84 24.49


0
Bzip Lzma Lzo Zip Bzip Lzma Lzo Zip Bzip Lzma Lzo Zip
∗ less than the discretization of the machine

FIGURE 1.7: Encoding test (lower is better)

compression rate
99% 99% 98%
100% 96%
85% 87% Key
more is better

80%
Divine Comedy
68% 69%
64% monochromatic image
59%
↑ Gcc package
50%
35%

0%
Bzip Lzma Lzo Zip

FIGURE 1.8: Ratio of compression (higher is better)


1-18 Paolo Ferragina
decompression
Divine Comedy monochrome image Gcc
seconds 0.009 0.010 0.009
0.1 0.01 40
34.42
less is better

0.073


0.05 0.005 20

0.020 8.89

3.54
0.000∗ 0.001 0.000∗ 1.70
0
Bzip Lzma Lzo Zip Bzip Lzma Lzo Zip Bzip Lzma Lzo Zip

∗ less than the discretization of the machine

FIGURE 1.9: Decoding test (lower is better)

lzo is the fastest algorithm we tested. Unfortunately, its compression rate seems to be
very bad.
It’s not so easy to understand so well what kind of algorithm is the best. Every possible
consideration depends on what the user needs (compression rate, encoding time, decoding
time).
From BWT to bzip 1-19

References
[1] D. Adjeroh, T. Bell, and A. Mukherjee. The Burrows-Wheeler Transform: Data
Compression, Suffix Arrays, and Pattern Matching. Springer series in statistics.
Springer, 2008.
[2] National Aeronautics and Space Administration. http://marsrovers.jpl.nasa.gov/home/.
[3] Jon Louis Bentley, Daniel D. Sleator, Robert E. Tarjan, and Victor K. Wei. A locally
adaptive data compression scheme. Commun. ACM, 29(4):320–330, 1986.
[4] Michael Burrows and David J. Wheeler. A block-sorting lossless data compression
algorithm. Technical Report 124, Digital Systems Research Center (SRC), 1994.
[5] Stefano Cataudella and Antonio Gulli. BW transform and its applications. Tutorial
on BW-transform, 2003.
[6] Paolo Ferragina and Giovanni Manzini. Boosting textual compression. In Ming-Yang
Kao, editor, Encyclopedia of Algorithms, pages 1–99. Springer US, 2008.
[7] Paolo Ferragina and Giovanni Manzini. Burrows-wheeler transform. In Ming-Yang
Kao, editor, Encyclopedia of Algorithms, pages 1–99. Springer US, 2008.
[8] D.A. Huffman. A method for the construction of minimum redundancy codes. Pro-
ceedings IRE 40, 10:1098–1101, 1952.
[9] Glen G. Langdon. An introduction to arithmetic coding. IBM J. Res. Dev., 28(2):135–
149, 1984.
[10] Giovanni Manzini and Paolo Ferragina. Engineering a lightweight suffix array con-
struction algorithm. Algorithmica, 40(1):33–50, 2004.
[11] Shoshana Neuburger. The burrows-wheeler transform: data compression, suffix arrays,
and pattern matching by donald adjeroh, timothy bell and amar mukherjee springer,
2008. SIGACT News, 41(1):21–24, 2010.
[12] Markus Franz Xaver Johannes Oberhumer. ftp://ftp.matsusaka-
u.ac.jp/pub/compression/oberhumer/lzo1a-announce.txt, 1996.
[13] Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix ar-
ray construction algorithms. In Jan Holub and Milan Simánek, editors, Stringology,
pages 1–30. Department of Computer Science and Engineering, Faculty of Electrical
Engineering, Czech Technical University, 2005.

You might also like