23 Landau

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 35

A Simpler Analysis of

Burrows-Wheeler Based
Compression
Haim Kaplan
Shir Landau
Elad Verbin
Our Results
1. Improve the bounds of one of the main
BWT based compression algorithms
2. New technique for worst case analysis of
BWT based compression algorithms
using the Local Entropy
3. Interesting results concerning
compression of integer strings
The Burrows-Wheeler Transform
(1994)
Given a string S the Burrows-Wheeler
Transform creates a permutation of S that
is locally homogeneous.

S BWT S’ is locally
homogeneous
Empirical Entropy - Intuition

The Problem – Given a string S


encode each symbol in S using a
fixed codeword…
Order-0 Entropy (Shannon 48)
H0(s): Maximum compression we can get using
only frequencies and no context information
Example: Huffman Code
1

0
0

1
1
0
0 1
Order-k entropy
Hk(s): Lower bound for compression with order-k
contexts – the codeword representing each
symbol depends on the k symbols preceding it

MISSISSIPPI

Context 1 for s Context 1 for i


“isis” “mssp”

Traditionally, compression ratio of compression


algorithms measured using Hk(s)
History
The Main Burrows-Wheeler Compression
Algorithm (Burrows, Wheeler 1994):
String S

BWT MTF ? Order-0


Burrows-
Wheeler
Move-to-
front
RLE Encoding
Transform Run-
Length
encoding
Compressed
String S’
MTF
Given a string S = baacb over alphabet = {a,b,c,d}

=S b a a c b
a
b
c

b
a
c
= MTF(S) 1 1 0 2 2
b
a
c

d
Main Bounds (Manzini 1999)

BWMTF (s)  8nH k (s)  0.08n  g k

• gk is a constant dependant on the context k and


the size of the alphabet
• these are worst-case bounds
…Now we are ready to begin
Some Intuition…
• H0 – “measures” frequency

• Hk – “measures” frequency and context

→ We want a statistic that measures local


similarity in a string and specifically in the
BWT of the string
Some Intuition…
• The more the contexts are similar in the
original string, the more its BWT will
exhibit local similarity…
• The more local similarity found in the
BWT of the string the smaller the
numbers we get in MTF…

→ The solution: Local Entropy


The Local Entropy- Definition
We define: given a string s

s MTF MTF(S) = “s1s2…sn”


Original string Integer
sequence

The local entropy of s: (Bentley, Sleator, Tarjan, Wei, 86)


n
LE(s)   log(si  1)
i 1
The Local Entropy - Definition
n
LE(s)   log(si  1)
i 1

Note: LE(s) = number of bits needed to write the MTF


sequence in binary.
Example: In Dream
world… We
MTF(s)= 311
would like to
→ LE(s) = 4 compress S to
→ MTF(s) in binary = 1111 LE(S)…
The Local Entropy – Properties
We use two properties of LE:

1. The entropy hierarchy

2. Convexity
The Local Entropy – Property 1
1. The entropy hierarchy:
We prove: For each k:
LE(BWT(s)) ≤ nHk(s) + O(1)

→ Any upper bound that we get for BWT


with LE holds for Hk(s) as well.
The Local Entropy – Properties 2
2. Convexity:

LE(s1s 2 )  LE(s1 )  LE(s 2 )  O(1)

→ This means that a partition of a string s does


not improve the Local Entropy of s.
Convexity
• Cutting the input string into parts doesn’t
influence much: Only  positions per part

a b a a a b a b
Convexity – Why do we need it?
Ferragina, Giancarlo, Manzini and Sciortino, JACM
2005:
String S FGMS(s)  nH*k (s)  g k

BWT Booster RHC


Burrows- Variation of
Wheeler Huffman encoding
transform
Partition of
BWT(S) BWT(S)
Compressed
String S’
Using LE and its properties we
get our bounds
1 1 1
Theorem: For every μ 1 whereζ(μ)  μ  μ  μ ...
1 2 3

BWMTF (s)  μ  LE(BWT(s))  log ζ(μ)  n


Our LE
bound    nH k ( s )  log  (  )  n  g k
Our Hk
bound
Our bounds
We get an improvement of the known bounds:

BWMTF (s)  4.45nH k (s)  0.08n  g k

BWMTF (s)  8nH k (s)  0.006n  g k

As opposed to the known bounds (Manzini, 1999):

BWMTF (s)  8nH k (s)  0.08n  g k


Our Test Results
Manzini’s
bound
8nHk(s)+ Our Hk Our bound
0.08n + gk bound using LE bzip2 File Name
2328219 766940 396813 345568 alice29.txt
2141646 683171 367874 316552 asyoulik.txt
295714 105033 69858 61056 cp.html
119210 43379 25713 24312 fields.c
45134 16054 10234 10264 grammar.lsp
5867291 1967240 1021440 861184 lcet10.txt
8198976 2464440 1391310 1164360 plrabn12.txt
64673 22317 13858 14096 xargs.1
*The files are non-binary files from the Canterbury corpus. gzip results are also
taken from the corpus. The size is indicated in bytes.
How is LE related to compression
of integer sequences?
• We mentioned “dream world” but what about
reality?
How close can we come to LE ( BWT ( s )) ?
Problem:
Compress an integer sequence S close to its sum
of logs:
SL(s)   log(x  1)
xs

Notice for any s: LE(s)  SL(MTF(s))


Compressing Integer Sequences

• Universal Encodings of Integers:


prefix-free encoding for integers (e.g.
Fibonacci encoding).
• Doing some math, it turns out that
order-0 encoding is good.
Not only good: It is best!
The order-0 math
• Theorem: For any string s of length n over the
integer alphabet {1,2,…h} and for any   1 ,

nH 0 ( s )    SL( s )  log  (  )  n
• Strange conclusion… we get an upper-bound on
the order-0 algorithm with a phrase dependant
on the value of the integers.
• This is true for all strings but is especially
interesting for strings with smaller integers.
A lower bound for SL
Theorem: For any algorithm A and for any
  1 , and any C such that C < log(ζ(μ))
there exists a string S of length n for
which:
|A(S)| > μ∙SL(S) + C∙n
Our Results - Summary
• New improved bounds for BWMTF

• Local Entropy (LE)

• New bounds for compression of integer


strings
Open Issues

We question the
effectiveness of nH k (s) .
Is there a better statistic?

?
Anybody want to guess??
Creating a Huffman encoding
•For each encoding unit (letter, in this
example), associate a frequency (number of
times it occurs)

•Create a binary tree whose children are the


encoding units with the smallest frequencies
–The frequency of the root is the sum of the
frequencies of the leaves

•Repeat this procedure until all the encoding


units are in the binary tree
Example
Assume that relative frequencies are:
A: 40
B: 20
C: 10
D: 10
R: 20
.Example , cont
.Example, cont
• Assign 0 to left branches, 1 to right branches
• Each encoding is a path from the root

A=0
B = 100
C = 1010
D = 1011
R = 11
The Burrows-Wheeler Transform
(1994)
Given a string S = banana#

banana# # banan a
anana#b Sort the rows a #bana n
nana#ba a na#ba n
ana#ban a nana# b
na#bana b anana #
a#banan n a#ban a
#banana n ana#b a

The Burrows-
Wheeler
Transform
Suffix Arrays and the BWT
The Index
Suffix of
Array BWT
So all we need to get
the BWT is the suffix
# banan a array!
7 a #bana n 6
6 a na#ba n 5
4 a nana# b 3
2 b anana # 1
1 n a#ban a 7
5 n ana#b a 4
3 2

You might also like