23 Landau
23 Landau
23 Landau
Burrows-Wheeler Based
Compression
Haim Kaplan
Shir Landau
Elad Verbin
Our Results
1. Improve the bounds of one of the main
BWT based compression algorithms
2. New technique for worst case analysis of
BWT based compression algorithms
using the Local Entropy
3. Interesting results concerning
compression of integer strings
The Burrows-Wheeler Transform
(1994)
Given a string S the Burrows-Wheeler
Transform creates a permutation of S that
is locally homogeneous.
S BWT S’ is locally
homogeneous
Empirical Entropy - Intuition
0
0
1
1
0
0 1
Order-k entropy
Hk(s): Lower bound for compression with order-k
contexts – the codeword representing each
symbol depends on the k symbols preceding it
MISSISSIPPI
=S b a a c b
a
b
c
b
a
c
= MTF(S) 1 1 0 2 2
b
a
c
d
Main Bounds (Manzini 1999)
2. Convexity
The Local Entropy – Property 1
1. The entropy hierarchy:
We prove: For each k:
LE(BWT(s)) ≤ nHk(s) + O(1)
a b a a a b a b
Convexity – Why do we need it?
Ferragina, Giancarlo, Manzini and Sciortino, JACM
2005:
String S FGMS(s) nH*k (s) g k
nH 0 ( s ) SL( s ) log ( ) n
• Strange conclusion… we get an upper-bound on
the order-0 algorithm with a phrase dependant
on the value of the integers.
• This is true for all strings but is especially
interesting for strings with smaller integers.
A lower bound for SL
Theorem: For any algorithm A and for any
1 , and any C such that C < log(ζ(μ))
there exists a string S of length n for
which:
|A(S)| > μ∙SL(S) + C∙n
Our Results - Summary
• New improved bounds for BWMTF
We question the
effectiveness of nH k (s) .
Is there a better statistic?
?
Anybody want to guess??
Creating a Huffman encoding
•For each encoding unit (letter, in this
example), associate a frequency (number of
times it occurs)
A=0
B = 100
C = 1010
D = 1011
R = 11
The Burrows-Wheeler Transform
(1994)
Given a string S = banana#
banana# # banan a
anana#b Sort the rows a #bana n
nana#ba a na#ba n
ana#ban a nana# b
na#bana b anana #
a#banan n a#ban a
#banana n ana#b a
The Burrows-
Wheeler
Transform
Suffix Arrays and the BWT
The Index
Suffix of
Array BWT
So all we need to get
the BWT is the suffix
# banan a array!
7 a #bana n 6
6 a na#ba n 5
4 a nana# b 3
2 b anana # 1
1 n a#ban a 7
5 n ana#b a 4
3 2