0% found this document useful (0 votes)
147 views193 pages

Notes 07 Compression PDF

Variable-length codes allow different codeword lengths to provide more flexibility than fixed-length codes, but they require that each codeword is uniquely decodable to avoid ambiguity. While variable-length codes can achieve better compression than fixed-length codes by assigning shorter lengths to more common characters, they also introduce the potential problem of codes being ambiguous if codeword lengths are not properly assigned. Modern encodings like UTF-8 use variable-length codes to efficiently encode all Unicode characters with varying numbers of bytes per character.

Uploaded by

hussein hammoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views193 pages

Notes 07 Compression PDF

Variable-length codes allow different codeword lengths to provide more flexibility than fixed-length codes, but they require that each codeword is uniquely decodable to avoid ambiguity. While variable-length codes can achieve better compression than fixed-length codes by assigning shorter lengths to more common characters, they also introduce the potential problem of codes being ambiguous if codeword lengths are not properly assigned. Modern encodings like UTF-8 use variable-length codes to efficiently encode all Unicode characters with varying numbers of bytes per character.

Uploaded by

hussein hammoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 193

7 Compression

Sebastian Wild
16 March 2020

version 2020-04-20 15:47


Outline

7 Compression
7.1 Context
7.2 Character Encodings
7.3 Huffman Codes
7.4 Run-Length Encoding
7.5 Lempel-Ziv-Welch
7.6 Move-to-Front Transformation
7.7 Burrows-Wheeler Transform
7.1 Context
Overview
� Unit 4–6: How to work with strings
� finding substrings
� finding approximate matches
� finding repeated parts
� ...

� Unit 7–8: How to store strings


� computer memory: must be binary
� how to compress strings (save space)
� how to robustly transmit over noisy channels � Unit 8

1
Terminology
� source text: string 𝑆 ∈ Σ★ to be stored / transmitted
𝑆
Σ𝑆 is some alphabet
� coded text: encoded data 𝐶 ∈ Σ★ that is actually stored / transmitted
𝐶
usually use Σ𝐶 = {0, 1}
� encoding: algorithm mapping source texts to coded texts

� decoding: algorithm mapping coded texts back to original source text

2
What is a good encoding scheme?
� Depending on the application, goals can be
� efficiency of encoding/decoding
� resilience to errors/noise in transmission
� security (encryption)
� integrity (detect modifications made by third parties)
� size

� Focus in this unit: size of coded text


Encoding schemes that (try to) minimize the size of coded texts perform data compression.
|𝐶| · lg |Σ𝐶 | Σ𝐶 ={0,1} |𝐶|
� We will measure the compression ratio: =
|𝑆| · lg |Σ𝑆 | |𝑆| · lg |Σ𝑆 |
< 1 means successful compression
= 1 means no compression
> 1 means “compression” made it bigger!? (yes, that happens . . . )

3
Types of Data Compression
� Logical vs. Physical
� Logical Compression uses meaning of data
� only applies to a certain domain, e. g., sound recordings
� Physical Compression only knows the (physical) bits in the data, not the meaning behind
them

� Lossy vs. Lossless


� lossy compression can only decode approximately;
the exact source text 𝑆 is lost
� lossless compression always decodes 𝑆 exactly

� For media files, lossy, logical compression is useful (e. g. JPEG, MPEG)

� We will concentrate on physical, lossless compression algorithms.


These techniques can be used for any application.

4
What makes data compressible?
� Physical, lossless compression methods mainly exploit
two types of redundancies in source texts:

1. uneven character frequencies


some characters occur more often than others → Part I
2. repetitive texts
different parts in the text are (almost) identical → Part II

5
What makes data compressible?
� Physical, lossless compression methods mainly exploit
two types of redundancies in source texts:

1. uneven character frequencies


some characters occur more often than others → Part I
2. repetitive texts
different parts in the text are (almost) identical → Part II

There is no such thing as a free lunch!


Not everything is compressible (→ tutorials)
� focus on versatile methods that often work

5
Part I
Exploiting character frequencies
7.2 Character Encodings
Character encodings
� Simplest form of encoding: Encode each source character individually

� encoding function 𝐸 : Σ𝑆 → Σ★𝐶


� typically, |Σ𝑆 | � |Σ𝐶 |, so need several bits per character
� for 𝑐 ∈ Σ𝑆 , we call 𝐸(𝑐) the codeword of 𝑐

� fixed-length code: |𝐸(𝑐)| is the same for all 𝑐 ∈ Σ𝐶

� variable-length code: not all codewords of same length

6
Fixed-length codes
� fixed-length codes are the simplest type of character encodings

� Example: ASCII (American Standard Code for Information Interchange, 1963)


0000000 NUL 0010000 DLE 0100000 0110000 0 1000000 @ 1010000 P 1100000 ‘ 1110000 p
0000001 SOH 0010001 DC1 0100001 ! 0110001 1 1000001 A 1010001 Q 1100001 a 1110001 q
0000010 STX 0010010 DC2 0100010 " 0110010 2 1000010 B 1010010 R 1100010 b 1110010 r
0000011 ETX 0010011 DC3 0100011 # 0110011 3 1000011 C 1010011 S 1100011 c 1110011 s
0000100 EOT 0010100 DC4 0100100 $ 0110100 4 1000100 D 1010100 T 1100100 d 1110100 t
0000101 ENQ 0010101 NAK 0100101 % 0110101 5 1000101 E 1010101 U 1100101 e 1110101 u
0000110 ACK 0010110 SYN 0100110 & 0110110 6 1000110 F 1010110 V 1100110 f 1110110 v
0000111 BEL 0010111 ETB 0100111 ’ 0110111 7 1000111 G 1010111 W 1100111 g 1110111 w
0001000 BS 0011000 CAN 0101000 ( 0111000 8 1001000 H 1011000 X 1101000 h 1111000 x
0001001 HT 0011001 EM 0101001 ) 0111001 9 1001001 I 1011001 Y 1101001 i 1111001 y
0001010 LF 0011010 SUB 0101010 * 0111010 : 1001010 J 1011010 Z 1101010 j 1111010 z
0001011 VT 0011011 ESC 0101011 + 0111011 ; 1001011 K 1011011 [ 1101011 k 1111011 {
0001100 FF 0011100 FS 0101100 , 0111100 < 1001100 L 1011100 \ 1101100 l 1111100 |
0001101 CR 0011101 GS 0101101 - 0111101 = 1001101 M 1011101 ] 1101101 m 1111101 }
0001110 SO 0011110 RS 0101110 . 0111110 > 1001110 N 1011110 ^ 1101110 n 1111110 ~
0001111 SI 0011111 US 0101111 / 0111111 ? 1001111 O 1011111 _ 1101111 o 1111111 DEL

� 7 bit per character

� just enough for English letters and a few symbols (plus control characters)

� Example: Hello ↦→ 1001000 1100101 1101100 1101100 1101111 7


Fixed-length codes – Discussion
Encoding & Decoding as fast as it gets

Unless all characters equally likely, it wastes a lot of space

inflexible (how to support adding a new character?)

8
Variable-length codes
� to gain more flexibility, have to allow different lengths for codewords

� actually an old idea: Morse Code

https://commons.wikimedia.org/wiki/File:Morse- code- tree.svg

https://commons.wikimedia.org/wiki/File:
International_Morse_Code.svg

9
Variable-length codes – UTF-8
� Modern example: UTF-8 encoding of Unicode:
default encoding for text-files, XML, HTML since 2009
� Encodes any Unicode character (137 994 as of May 2019, and counting)
� uses 1–4 bytes (codeword lengths: 8, 16, 24, or 32 bits)
� Every ASCII character is encoded in 1 byte with leading bit 0, followed by the 7 bits for ASCII
� Non-ASCII charactters start with 1–4 1s indicating the total number of bytes,
followed by a 0 and 3–5 bits.
The remaining bytes each start with 10 followed by 6 bits.

Char. number range UTF-8 octet sequence


(hexadecimal) (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For English text, most characters use only 8 bit,


but we can include any Unicode character, as well.
10
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100

� Happily encode text 𝑆 = banana with the coded text 𝐶 = 110 0 10 0 10 0


b a n a n a

11
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100

� Happily encode text 𝑆 = banana with the coded text 𝐶 = 110 0 10 0 10 0


b a n a n a

� 𝐶 = 1100100100 decodes both to banana and to bass: 110 0 100100


b a s s

� not a valid code . . . (cannot tolerate ambiguity)

but how should we have known?

11
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100

� Happily encode text 𝑆 = banana with the coded text 𝐶 = 110 0 10 0 10 0


b a n a n a

� 𝐶 = 1100100100 decodes both to banana and to bass: 110 0 100100


b a s s

� not a valid code . . . (cannot tolerate ambiguity)

but how should we have known?

𝐸(n) = 10 is a (proper) prefix of 𝐸(s) = 100


� Leaves decoding wondering whether to stop after reading 10 or continue

� Require a prefix-free code: No codeword is a prefix of another.


prefix-free =⇒ instantaneously decodable
11
Code tries
� From now on only consider prefix-free codes 𝐸:
𝐸(𝑐) is not a prefix of 𝐸(𝑐 �) for any 𝑐, 𝑐 � ∈ Σ𝑆 .

𝑐 A E N O T ␣
� Example:
𝐸(𝑐) 01 101 001 100 11 000

Any prefix-free code corresponds to a 0 1


(code) trie (trie of codewords)
with characters of Σ𝑆 at leaves. 0 1 0 1

A T
no need for end-of-string symbols $ here 0 1 0 1
(already prefix-free!) ␣ N O E

� Encode AN␣ANT
� Decode 111000001010111

12
Code tries
� From now on only consider prefix-free codes 𝐸:
𝐸(𝑐) is not a prefix of 𝐸(𝑐 �) for any 𝑐, 𝑐 � ∈ Σ𝑆 .

𝑐 A E N O T ␣
� Example:
𝐸(𝑐) 01 101 001 100 11 000

Any prefix-free code corresponds to a 0 1


(code) trie (trie of codewords)
with characters of Σ𝑆 at leaves. 0 1 0 1

A T
no need for end-of-string symbols $ here 0 1 0 1
(already prefix-free!) ␣ N O E

� Encode AN␣ANT → 010010000100111


� Decode 111000001010111 → TO␣EAT

12
Who decodes the decoder?
� Depending on the application, we have to store/transmit the used code!
� We distinguish:
� fixed coding: code agreed upon in advance, not transmitted (e. g., Morse, UTF-8)
� static coding: code depends on message, but stays same for entire message;
it must be transmitted (e. g., Huffman codes → next)
� adaptive coding: code depends on message and changes during encoding;
implicitly stored withing the message (e. g., LZW → below)

13
7.3 Huffman Codes
Character frequencies
� Goal: Find character encoding that produces short coded text

� Convention here: fix Σ𝐶 = {0, 1} (binary codes), abbreviate Σ = Σ𝑆 ,

� Observation: Some letters occur more often than others.


Typical English prose:

e 12.70% d 4.25% p 1.93%


t 9.06% l 4.03% b 1.49%
a 8.17% c 2.78% v 0.98%
o 7.51% u 2.76% k 0.77%
i 6.97% m 2.41% j 0.15%
n 6.75% w 2.36% x 0.15%
s 6.33% f 2.23% q 0.10%
h 6.09% g 2.02% z 0.07%
r 5.99% y 1.97%

� Want shorter codes for more frequent characters!


14
Huffman coding
e. g. frequencies / probabilities

� Given: Σ and weights 𝑤 : Σ → ℝ≥0

� Goal: prefix-free code 𝐸 (= code trie) for Σ that minimizes coded text length

i. e., a code trie minimizing 𝑤(𝑐) · |𝐸(𝑐)|
𝑐∈Σ

15
Huffman coding
e. g. frequencies / probabilities

� Given: Σ and weights 𝑤 : Σ → ℝ≥0

� Goal: prefix-free code 𝐸 (= code trie) for Σ that minimizes coded text length

i. e., a code trie minimizing 𝑤(𝑐) · |𝐸(𝑐)|
𝑐∈Σ

� If we use 𝑤(𝑐) = #occurrences of 𝑐 in 𝑆,


this is the character encoding with smallest possible |𝐶|

� best possible character-wise encoding

� Quite ambitious! Is this efficiently possible?

15
Huffman’s algorithm
� Actually, yes! A greedy/myopic approach succeeds here.

Huffman’s algorithm:
1. Find two characters a, b with lowest weights.
� We will encode them with the same prefix, plus one distinguishing bit,
i. e., 𝐸(a) = 𝑢 0 and 𝐸(b) = 𝑢 1 for a bitstring 𝑢 ∈ {0, 1}★ (𝑢 to be determined)

2. (Conceptually) replace a and b by a single character “ ab ”


with 𝑤( ab ) = 𝑤(a) + 𝑤(b).

3. Recursively apply Huffman’s algorithm on the smaller alphabet.


This in particular determines 𝑢 = 𝐸( ab ).

16
Huffman’s algorithm
� Actually, yes! A greedy/myopic approach succeeds here.

Huffman’s algorithm:
1. Find two characters a, b with lowest weights.
� We will encode them with the same prefix, plus one distinguishing bit,
i. e., 𝐸(a) = 𝑢 0 and 𝐸(b) = 𝑢 1 for a bitstring 𝑢 ∈ {0, 1}★ (𝑢 to be determined)

2. (Conceptually) replace a and b by a single character “ ab ”


with 𝑤( ab ) = 𝑤(a) + 𝑤(b).

3. Recursively apply Huffman’s algorithm on the smaller alphabet.


This in particular determines 𝑢 = 𝐸( ab ).

� efficient implementation using a (min-oriented) priority queue


� start by inserting all characters with their weight as key
� step 1 uses two deleteMin calls
� step 2 inserts a new character with the sum of old weights as key

16
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

1 2 1 4

E L O S

17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

2 2 4

L S
0 1

E O

17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

4 4

S
0 1

L
0 1

E O

17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

0 1

S
0 1

L
0 1

E O

17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

0 1

S
0 1

L
0 1

E O

� Huffman tree (code trie for Huffman code)

17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

0 1

S
0 1

L
0 1

E O

� Huffman tree (code trie for Huffman code)

14 14
LOSSLESS → 01001110100011 compression ratio: 8·log 4 = 16 ≈ 88%

17
Huffman tree – tie breaking
� The above procedure is ambiguous:
� which characters to choose when weights are equal?
� which subtree goes left, which goes right?

� For COMP 526: always use the following rule:

1. To break ties when selecting the two characters,


first use the smallest letter according to the alphabetical order,
or the tree containing the smallest alphabetical letter.
2. When combining two trees of different values,
place the lower-valued tree on the left (corresponding to a 0-bit).
3. When combining trees of equal value,
place the one containing the smallest letter to the left.

18
Huffman code – Optimality
Theorem 7.1 (Optimality of Huffman’s Algorithm)
Given Σ and 𝑤 : Σ → ℝ≥0 , Huffman’s Algorithm computes codewords 𝐸 : Σ → {0, 1}★ with

minimal expected codeword length ℓ (𝐸) = 𝑐∈Σ 𝑤(𝑐) · |𝐸(𝑐)|, among all prefix-free codes
for Σ. �

19
Huffman code – Optimality
Theorem 7.1 (Optimality of Huffman’s Algorithm)
Given Σ and 𝑤 : Σ → ℝ≥0 , Huffman’s Algorithm computes codewords 𝐸 : Σ → {0, 1}★ with

minimal expected codeword length ℓ (𝐸) = 𝑐∈Σ 𝑤(𝑐) · |𝐸(𝑐)|, among all prefix-free codes
for Σ. �

Proof sketch: by induction over 𝜎 = |Σ|


� Given any optimal prefix-free code 𝐸∗ (as its code trie).

� code trie � ∃ two sibling leaves 𝑥, 𝑦 at largest depth 𝐷

� swap characters in leaves to have two lowest-weight characters a, b in 𝑥, 𝑦


(that can only make ℓ smaller, so still optimal)
� any optimal code for Σ� = Σ \ {a , b} ∪ { ab } yields optimal code for Σ
by replacing leaf ab by internal node with children a and b.

� recursive call yields optimal code for Σ� by inductive hypothesis,


so Huffman’s algorithm finds optimal code for Σ.

19
Entropy
Definition 7.2 (Entropy)
Given probabilities 𝑝1 , . . . , 𝑝 𝑛 (for outcomes 1, . . . , 𝑛 of a random variable), the entropy of the
distribution is defined as

𝑛 �
𝑛 � �
1
H(𝑝 1 , . . . , 𝑝 𝑛 ) = − 𝑝 𝑖 lg 𝑝 𝑖 = 𝑝 𝑖 lg
𝑝𝑖 �
𝑖=1 𝑖=1

20
Entropy
Definition 7.2 (Entropy)
Given probabilities 𝑝1 , . . . , 𝑝 𝑛 (for outcomes 1, . . . , 𝑛 of a random variable), the entropy of the
distribution is defined as

𝑛 �
𝑛 � �
1
H(𝑝 1 , . . . , 𝑝 𝑛 ) = − 𝑝 𝑖 lg 𝑝 𝑖 = 𝑝 𝑖 lg
𝑝𝑖 �
𝑖=1 𝑖=1

� entropy is a measure of information content of a distribution


� more precisely: the expected number of bits (Yes/No questions) required
to nail down the random value
� would ideally encode value 𝑖 using lg(1/𝑝 𝑖 ) bits
that is not always possible; cannot use 1.5 bits . . . but:

20
Entropy
Definition 7.2 (Entropy)
Given probabilities 𝑝1 , . . . , 𝑝 𝑛 (for outcomes 1, . . . , 𝑛 of a random variable), the entropy of the
distribution is defined as

𝑛 �
𝑛 � �
1
H(𝑝 1 , . . . , 𝑝 𝑛 ) = − 𝑝 𝑖 lg 𝑝 𝑖 = 𝑝 𝑖 lg
𝑝𝑖 �
𝑖=1 𝑖=1

� entropy is a measure of information content of a distribution


� more precisely: the expected number of bits (Yes/No questions) required
to nail down the random value
� would ideally encode value 𝑖 using lg(1/𝑝 𝑖 ) bits
that is not always possible; cannot use 1.5 bits . . . but:
Theorem 7.3 (Entropy bounds for Huffman codes)
For any Σ = {𝑎1 , . . . , 𝑎 𝜎 } and 𝑤 : Σ → ℝ≥0 and its Huffman code 𝐸, we have
� � � �
𝑤(𝑎 1 ) 𝑤(𝑎 𝜎 ) 𝑤(𝑎 1 ) 𝑤(𝑎 𝜎 )
H ,..., ≤ ℓ (𝐸) ≤ H ,..., +1
𝑊 𝑊 𝑊 𝑊
where 𝑊 = 𝑤(𝑎1 ) + · · · + 𝑤(𝑎 𝜎 ). �
20
Clicker Question

When is Huffman coding more efficient than a fixed-length


encoding?
A always

B when H ≈ lg(𝜎)

C when H < lg(𝜎)

D when H < lg(𝜎) − 1

E when H ≈ 1

pingo.upb.de/622222
21
Clicker Question

When is Huffman coding more efficient than a fixed-length


encoding?
A always �
B when H ≈ lg(𝜎)

C when H < lg(𝜎)

D when H < lg(𝜎) − 1 �


E when H ≈ 1

pingo.upb.de/622222
21
Encoding with Huffman code
� The overall encoding procedure is as follows:
� Pass 1: Count character frequencies in 𝑆
� Construct Huffman code 𝐸 (as above)
� Store the Huffman code in 𝐶 (details omitted)
� Pass 2: Encode each character in 𝑆 using 𝐸 and append result to 𝐶

� Decoding works as follows:


� Decode the Huffman code 𝐸 from 𝐶. (details omitted)
� Decode 𝑆 character by character from 𝐶 using the code trie.

� Note: Decoding is much simpler/faster!

22
Huffman coding – Discussion
� running time complexity: 𝑂(𝜎 log 𝜎) to construct code
� build PQ + 𝜎 · (2 deleteMins and 1 insert)
� can do Θ(𝜎) time when characters already sorted by weight
� time for encoding: 𝑂(𝑛 + |𝐶|)

� many variations in use (tie-breaking rules, estimated frequencies, adaptive encoding, . . . )

23
Huffman coding – Discussion
� running time complexity: 𝑂(𝜎 log 𝜎) to construct code
� build PQ + 𝜎 · (2 deleteMins and 1 insert)
� can do Θ(𝜎) time when characters already sorted by weight
� time for encoding: 𝑂(𝑛 + |𝐶|)

� many variations in use (tie-breaking rules, estimated frequencies, adaptive encoding, . . . )

optimal prefix-free character encoding

very fast decoding This is only true some errors.


In the worst case the ALL remaining
needs 2 passes over source text for encoding characters of the text can get corrupted!
� one-pass variants possible, but more complicated

have to store code alongside with coded text

23
Part II
Compressing repetitive texts
Beyond Character Encoding
� Many “natural” texts show repetitive redundancy
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull
boy. All work and no play makes Jack a dull boy. All work and no play makes Jack
a dull boy. All work and no play makes Jack a dull boy. All work and no play makes
Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play
makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no
play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and
no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work
and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All
work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

� character-by-character encoding will not capture such repetitions


� Huffman won’t compression this very much

24
Beyond Character Encoding
� Many “natural” texts show repetitive redundancy
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull
boy. All work and no play makes Jack a dull boy. All work and no play makes Jack
a dull boy. All work and no play makes Jack a dull boy. All work and no play makes
Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play
makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no
play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and
no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work
and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All
work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

� character-by-character encoding will not capture such repetitions


� Huffman won’t compression this very much

� Have to encode whole phrases of 𝑆 by a single codeword

24
7.4 Run-Length Encoding
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� here: only consider Σ𝑆 = {0, 1}


000101100100000111111000000000011111000
001111111110001111111110000001111111000
001111011010001110001111000011100000000
(work on a binary representation)
001100000000000000000111000111000000000
001100000000000000000011001110000000000
001100000000000000000011001110000000000 � can be extended for larger alphabets
001101100000000000000111001100111110000
001111111100000000000111001111111111000
001110111110000000001110001111100111100
000000000111000000011100001110000001110
000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100
000000000111000111000000000110000001110
000000000110000111000000000111000011100
001101111110001111011101000011111111000
011111111100011111111111100001111110000
000101100000001010011001000000100100000
000000000000000000000000000000000000000
000000000000000000000000000000000000000

25
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� here: only consider Σ𝑆 = {0, 1}


000101100100000111111000000000011111000
001111111110001111111110000001111111000
001111011010001110001111000011100000000
(work on a binary representation)
001100000000000000000111000111000000000
001100000000000000000011001110000000000
001100000000000000000011001110000000000 � can be extended for larger alphabets
001101100000000000000111001100111110000
001111111100000000000111001111111111000
001110111110000000001110001111100111100
000000000111000000011100001110000001110

� run-length encoding (RLE):


000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100

use runs as phrases: 𝑆 = 00000 111 0000


000000000111000111000000000110000001110

���� ���� ����


000000000110000111000000000111000011100
001101111110001111011101000011111111000
011111111100011111111111100001111110000
000101100000001010011001000000100100000
000000000000000000000000000000000000000
000000000000000000000000000000000000000

25
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� here: only consider Σ𝑆 = {0, 1}


000101100100000111111000000000011111000
001111111110001111111110000001111111000
001111011010001110001111000011100000000
(work on a binary representation)
001100000000000000000111000111000000000
001100000000000000000011001110000000000
001100000000000000000011001110000000000 � can be extended for larger alphabets
001101100000000000000111001100111110000
001111111100000000000111001111111111000
001110111110000000001110001111100111100
000000000111000000011100001110000001110

� run-length encoding (RLE):


000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100

use runs as phrases: 𝑆 = 00000 111 0000


000000000111000111000000000110000001110

���� ���� ����


000000000110000111000000000111000011100
001101111110001111011101000011111111000
011111111100011111111111100001111110000
000101100000001010011001000000100100000
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� We have to store
� the first bit of 𝑆 (either 0 or 1)
� the length each each run
� Note: don’t have to store bit for later runs since they must alternate.

� Example becomes: 0, 5, 3, 4

25
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� here: only consider Σ𝑆 = {0, 1}


000101100100000111111000000000011111000
001111111110001111111110000001111111000
001111011010001110001111000011100000000
(work on a binary representation)
001100000000000000000111000111000000000
001100000000000000000011001110000000000
001100000000000000000011001110000000000 � can be extended for larger alphabets
001101100000000000000111001100111110000
001111111100000000000111001111111111000
001110111110000000001110001111100111100
000000000111000000011100001110000001110

� run-length encoding (RLE):


000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100

use runs as phrases: 𝑆 = 00000 111 0000


000000000111000111000000000110000001110

���� ���� ����


000000000110000111000000000111000011100
001101111110001111011101000011111111000
011111111100011111111111100001111110000
000101100000001010011001000000100100000
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� We have to store
� the first bit of 𝑆 (either 0 or 1)
� the length each each run
� Note: don’t have to store bit for later runs since they must alternate.

� Example becomes: 0, 5, 3, 4

� Question: How to encode a run length 𝑘 in binary? (𝑘 can be arbitrarily large!)


25
Clicker Question

How would you encode a string that can we arbitrarily long?

pingo.upb.de/622222
26
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!


7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001

27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!


7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001
Much too long
� (wasn’t the whole point of RLE to get rid of long runs??)

27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!


7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001
Much too long
� (wasn’t the whole point of RLE to get rid of long runs??)

� Refinement: Elias gamma code


� Store the length ℓ of the binary representation in unary
� Followed by the binary digits themselves

27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!


7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001
Much too long
� (wasn’t the whole point of RLE to get rid of long runs??)

� Refinement: Elias gamma code


� Store the length ℓ of the binary representation in unary
� Followed by the binary digits themselves
� little tricks:
� always ℓ ≥ 1, so store ℓ − 1 instead
� binary representation always starts with 1 � don’t need terminating 1 in unary

� Elias gamma code = ℓ − 1 zeros, followed by binary representation


Examples: 1 ↦→ 1 , 3 ↦→ 011 , 5 ↦→ 00101 , 30 ↦→ 000011110
27
Clicker Question

Decode the first number in Elias gamma code (at the beginning)
of the following bitstream:

000110111011100110.

pingo.upb.de/622222
28
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 =1

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =7
𝐶 = 100111

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =2
𝐶 = 100111010

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =1
𝐶 = 1001110101

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 = 20
𝐶 = 1001110101000010100

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 = 11
𝐶 = 10011101010000101000001011

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =3+1

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =3+1
𝑘 = 13
𝑆 = 0000000000000

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =2+1
𝑘=
𝑆 = 0000000000000

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =2+1
𝑘 =4
𝑆 = 00000000000001111

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =0+1
𝑘=
𝑆 = 00000000000001111

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =0+1
𝑘 =1
𝑆 = 000000000000011110

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =1+1
𝑘=
𝑆 = 000000000000011110

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =1+1
𝑘 =2
𝑆 = 00000000000001111011

29
Run-length encoding – Discussion
� extensions to larger alphabets possible (must store next character then)

� used in some image formats (e. g. TIFF)

30
Run-length encoding – Discussion
� extensions to larger alphabets possible (must store next character then)

� used in some image formats (e. g. TIFF)

fairly simple and fast

can compress 𝑛 bits to Θ(log 𝑛)!


for extreme case of constant number of runs

negligible compression for many common types of data


� No compression until run lengths 𝑘 ≥ 6
� expansion when run lengths 𝑘 = 2 or 6

30
7.5 Lempel-Ziv-Welch
Warmup

https://www.flickr.com/photos/quintanaroo/2742726346

https://classic.csunplugged.org/text- compression/
31
Clicker Question

Write down the second-to-last line of the above poem!

pingo.upb.de/622222
32
Lempel-Ziv Compression
� Huffman and RLE mostly take advantage of frequent or repeated single characters.

� Observation: Certain substrings are much more frequent than others.


� in English text: the, be, to, of, and, a, in, that, have, I
� in HTML: “<a href”, “<img src”, “<br/>”

33
Lempel-Ziv Compression
� Huffman and RLE mostly take advantage of frequent or repeated single characters.

� Observation: Certain substrings are much more frequent than others.


� in English text: the, be, to, of, and, a, in, that, have, I
� in HTML: “<a href”, “<img src”, “<br/>”

� Lempel-Ziv stands for family of adaptive compression algorithms.


� Idea: store repeated parts by reference!
� each codeword refers to
� either a single character in Σ𝑆 ,
� or a substring of 𝑆 (that both encoder and decoder have already seen).

33
Lempel-Ziv Compression
� Huffman and RLE mostly take advantage of frequent or repeated single characters.

� Observation: Certain substrings are much more frequent than others.


� in English text: the, be, to, of, and, a, in, that, have, I
� in HTML: “<a href”, “<img src”, “<br/>”

� Lempel-Ziv stands for family of adaptive compression algorithms.


� Idea: store repeated parts by reference!
� each codeword refers to
� either a single character in Σ𝑆 ,
� or a substring of 𝑆 (that both encoder and decoder have already seen).
� Variants of Lempel-Ziv compression
� “LZ77” Original version (“sliding window”)
Derivatives: LZSS, LZFG, LZRW, LZP, DEFLATE, . . .
DEFLATE used in (pk)zip, gzip, PNG
� “LZ78” Second (slightly improved) version
Derivatives: LZW, LZMW, LZAP, LZY, . . .
LZW used in compress, GIF

33
Lempel-Ziv-Welch
� here: Lempel-Ziv-Welch (LZW) (arguably the “cleanest” variant of Lempel-Ziv)

� variable-to-fixed encoding
� all codewords have 𝑘 bits (typical: 𝑘 = 12) � fixed-length
� but they represent a variable portion of the source text!

34
Lempel-Ziv-Welch
� here: Lempel-Ziv-Welch (LZW) (arguably the “cleanest” variant of Lempel-Ziv)

� variable-to-fixed encoding
� all codewords have 𝑘 bits (typical: 𝑘 = 12) � fixed-length
� but they represent a variable portion of the source text!

� maintain a dictionary 𝐷 with 2 𝑘 entries � codewords = indices in dictionary


� initially, first |Σ𝑆 | entries encode single characters (rest is empty)
� add a new entry to 𝐷 after each step:
� Encoding: after encoding a substring 𝑥 of 𝑆,
add 𝑥𝑐 to 𝐷 where 𝑐 is the character that follows 𝑥 in 𝑆.
encode 𝑥 = ban

𝑆 h a n n a h b a n s b a n a n a s
already encoded 𝒙 𝒄

add 𝑥𝑐 = bana to dictionary

� new codeword in 𝐷
� 𝐷 actually stores codewords for 𝑥 and 𝑐, not the expanded string
34
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)

𝐶=

Code String Code String


... 128
32 ␣ 129
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y
𝐶 = 89

Code String Code String


... 128
32 ␣ 129
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y
𝐶 = 89

Code String Code String


... 128 YO
32 ␣ 129
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O
𝐶 = 89 79

Code String Code String


... 128 YO
32 ␣ 129
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O
𝐶 = 89 79

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O !
𝐶 = 89 79 33

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O !
𝐶 = 89 79 33

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣
𝐶 = 89 79 33 32

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣
𝐶 = 89 79 33 32

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO
𝐶 = 89 79 33 32 128

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO
𝐶 = 89 79 33 32 128

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U
𝐶 = 89 79 33 32 128 85

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U
𝐶 = 89 79 33 32 128 85

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣
𝐶 = 89 79 33 32 128 85 130

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣
𝐶 = 89 79 33 32 128 85 130

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU
𝐶 = 89 79 33 32 128 85 130 132

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU
𝐶 = 89 79 33 32 128 85 130 132

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R
𝐶 = 89 79 33 32 128 85 130 132 82

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R
𝐶 = 89 79 33 32 128 85 130 132 82

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y
𝐶 = 89 79 33 32 128 85 130 132 82 131

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y
𝐶 = 89 79 33 32 128 85 130 132 82 131

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O
𝐶 = 89 79 33 32 128 85 130 132 82 131 79

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O
𝐶 = 89 79 33 32 128 85 130 132 82 131 79

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138 OY
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138 OY
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138 OY
already encoded 𝒙 𝒄 ... 139 YO!
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO !
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128 33

Code String Code String


... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138 OY
already encoded 𝒙 𝒄 ... 139 YO!
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Code
1 procedure LZWencode(𝑆[0..𝑛))
2 𝑥 := 𝜀 // previous phrase, initially empty
3 𝐶 := 𝜀 // output, initially empty
4 𝐷 := dictionary, initialized with codes for 𝑐 ∈ Σ𝑆 // stored as trie
5 𝑘 := |Σ𝑆 | // next free codeword
6 for 𝑖 := 0, . . . , 𝑛 − 1 do
7 𝑐 := 𝑆[𝑖]
8 if 𝐷.containsKey(𝑥𝑐) then
9 𝑥 := 𝑥𝑐
10 else
11 𝐶 := 𝐶 · 𝐷.get(𝑥) // append codeword for 𝑥
12 𝐷.put(𝑥𝑐, 𝑘) // add 𝑥𝑐 to 𝐷, assigning next free codeword
13 𝑘 := 𝑘 + 1; 𝑥 := 𝑐
14 end for
15 𝐶 := 𝐶 · 𝐷.get(𝑥)
16 return 𝐶

36
LZW decoding
� Decoder has to replay the process of growing the dictionary!

� Decoding:
after decoding a substring 𝑦 of 𝑆, add 𝑥𝑐 to 𝐷,
where 𝑥 is previously encoded/decoded substring of 𝑆,
and 𝑐 = 𝑦[0] (first character of 𝑦)
decode 𝑦 = an
𝒄
𝑆 h a n n a h b a n s b a n a n a s
already decoded 𝒙 𝒚
add 𝑥𝑐 = bana to dictionary

� Note: only start adding to 𝐷 after second substring of 𝑆 is decoded

37
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
...
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
133 ??? 133
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
133 ??? 133
...
83 S
...

38
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!

decoder is “one step behind” in creating dictionary

39
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!

decoder is “one step behind” in creating dictionary

� problem occurs if we want to use a code that we are just about to build.

39
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!

decoder is “one step behind” in creating dictionary

� problem occurs if we want to use a code that we are just about to build.
� But then we actually know what is going on:
� Situation: decode using 𝑘 in the step that will define 𝑘.
� decoder knows last phrase 𝑥, needs phrase 𝑦 = 𝐷[𝑘] = 𝑥𝑐.
last step 𝑐𝑦=𝐷[𝑘]

1. en/decode 𝑥.
C A N ␣ B A N A N A S
done 𝒙 𝒄 𝒚
2. store 𝐷[𝑘] := 𝒙𝒄
𝐷[𝑘] := 𝑥𝑐
3. next phrase 𝑦 equals 𝐷[𝑘]
A N A
� 𝐷[𝑘] = 𝒙𝒄 = 𝒙 · 𝒙[0] (all known)
𝒙 𝒄

39
LZW decoding – Code
1 procedure LZWdecode(𝐶[0..𝑚))
2 𝐷 := dictionary [0..2𝑑 ) → Σ+ 𝑆
, initialized with codes for 𝑐 ∈ Σ𝑆 // stored as array
3 𝑘 := |Σ𝑆 | // next unused codeword
4 𝑞 := 𝐶[0] // first codeword
5 𝑦 := 𝐷[𝑞] // lookup meaning of 𝑞 in 𝐷
6 𝑆 := 𝑦 // output, initially first phrase
7 for 𝑗 := 1, . . . , 𝑚 − 1 do
8 𝑥 := 𝑦 // remember last decoded phrase
9 𝑞 := 𝐶[𝑗] // next codeword
10 if 𝑞 = = 𝑘 then
11 𝑦 := 𝑥 · 𝑥[0] // bootstrap case
12 else
13 𝑦 := 𝐷[𝑞]
14 𝑆 := 𝑆 · 𝑦 // append decoded phrase
15 𝐷[𝑘] := 𝑥 · 𝑦[0] // store new phrase
16 𝑘 := 𝑘 + 1
17 end for
18 return 𝑆

40
LZW decoding – Example continued
� Example: 67 65 78 32 66 129 133 83

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
133 ANA 133 ANA 129, A
...
83 S 134 ANAS 133, S
83 S
... last step 𝑐𝑦=𝐷[𝑘]

1. en/decode 𝑥.
C A N ␣ B A N A N A S
done 𝒙 𝒄 𝒚
2. store 𝐷[𝑘] := 𝒙𝒄
𝐷[𝑘] := 𝑥𝑐
3. next phrase 𝑦 equals 𝐷[𝑘]
A N A
� 𝐷[𝑘] = 𝒙𝒄 = 𝒙 · 𝒙[0] (all known)
𝒙 𝒄

41
Clicker Question

How many phrases will LZW create on 𝑆 = a𝑛 , a run of 𝑛 copies


of as?

A ∼𝑛 F Θ(log 𝑛)

B ∼ 𝑛/2 G Θ(log log 𝑛)

C ∼ 𝑛/4 H 2

D Θ(𝑛/log 𝑛) I 1

E Θ( 𝑛)

pingo.upb.de/622222
42
Clicker Question

How many phrases will LZW create on 𝑆 = a𝑛 , a run of 𝑛 copies


of as?

A ∼𝑛 F Θ(log 𝑛)

B ∼ 𝑛/2 G Θ(log log 𝑛)

C ∼ 𝑛/4 H 2

D Θ(𝑛/log 𝑛) I 1



E Θ( 𝑛)

pingo.upb.de/622222
42
LZW – Discussion
� As presented, LZW uses coded alphabet Σ𝐶 = [0..2𝑑 ).
� use another encoding for code numbers ↦→ binary, e. g., Huffman

� need a rule when dictionary is full; different options:


� increment 𝑑 � longer codewords
� “flush” dictionary and start from scratch � limits extra space usage
� often: reserve a codeword to trigger flush at any time

� encoding and decoding both run in linear time (assuming |Σ𝑆 | constant)

43
LZW – Discussion
� As presented, LZW uses coded alphabet Σ𝐶 = [0..2𝑑 ).
� use another encoding for code numbers ↦→ binary, e. g., Huffman

� need a rule when dictionary is full; different options:


� increment 𝑑 � longer codewords
� “flush” dictionary and start from scratch � limits extra space usage
� often: reserve a codeword to trigger flush at any time

� encoding and decoding both run in linear time (assuming |Σ𝑆 | constant)

fast encoding & decoding

works in streaming model (no random access, no backtrack on input needed)

significant compression for many types of data

captures only local repetitions (with bounded dictionary)

43
Compression summary

Huffman codes Run-length encoding Lempel-Ziv-Welch


fixed-to-variable variable-to-variable variable-to-fixed
2-pass 1-pass 1-pass
must send dictionary can be worse than ASCII can be worse than ASCII
60% compression bad on text 45% compression
on English text on English text

optimal binary good on long runs good on English text


character encopding (e.g., pictures)

rarely used directly rarely used directly frequently used

part of pkzip, JPEG, MP3 fax machines, old picture-formats GIF, part of PDF, Unix compress

44
Part III
Text Transforms
Text transformations
� compression is effective is we have one the following:
� long runs � RLE
� frequently used characters � Huffman
� many (local) repeated substrings � LZW

45
Text transformations
� compression is effective is we have one the following:
� long runs � RLE
� frequently used characters � Huffman
� many (local) repeated substrings � LZW

� but methods can be frustratingly “blind” to other “obvious” redundancies


� LZW: repetition too distant � dictionary already flushed
� Huffman: changing probabilities (local clusters) � averaged out globally
� RLE: run of alternating pairs of characters � not a run

45
Text transformations
� compression is effective is we have one the following:
� long runs � RLE
� frequently used characters � Huffman
� many (local) repeated substrings � LZW

� but methods can be frustratingly “blind” to other “obvious” redundancies


� LZW: repetition too distant � dictionary already flushed
� Huffman: changing probabilities (local clusters) � averaged out globally
� RLE: run of alternating pairs of characters � not a run

� Enter: text transformations


� invertible functions of text
� do not by themselves reduce the space usage
� but help compressors “see” existing redundancy
� use as pre-/postprocessing in compression pipeline

45
7.6 Move-to-Front Transformation
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper

46
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper

� Here: use such a list for storing source alphabet Σ𝑆


� to encode 𝑐, access it in list
� encode 𝑐 using its (old) position in list
� then apply MTF to the list
� codewords are integers, i. e., Σ𝐶 = [0..𝜎)

46
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper

� Here: use such a list for storing source alphabet Σ𝑆


� to encode 𝑐, access it in list
� encode 𝑐 using its (old) position in list
� then apply MTF to the list
� codewords are integers, i. e., Σ𝐶 = [0..𝜎)

� clusters of few characters � many small numbers

46
Clicker Question

Assume a MTF list currently contains the items X Y Z A B C, and we


now access A. What is the list content after the MTF rule has been
applied?

pingo.upb.de/622222
47
MTF – Code

� Transform (encode): � Inverse transform (decode):

1 procedure MTF−encode(𝑆[0..𝑛)) 1 procedure MTF−encode(𝐶[0..𝑚))


2 𝐿 := list containing Σ𝐶 (sorted order) 2 𝐿 := list containing Σ𝐶 (sorted order)
3 𝐶 := 𝜀 3 𝑆 := 𝜀
4 for 𝑖 := 0, . . . , 𝑛 − 1 do 4 for 𝑗 := 0, . . . , 𝑚 − 1 do
5 𝑐 := 𝑆[𝑖] 5 𝑝 := 𝐶[𝑗]
6 𝑝 := position of 𝑐 in 𝐿 6 𝑐 := character at position 𝑝 in 𝐿
7 𝐶 := 𝐶 · 𝑝 7 𝑆 := 𝑆 · 𝑐
8 Move 𝑐 to front of 𝐿 8 Move 𝑐 to front of 𝐿
9 end for 9 end for
10 return 𝐶 10 return 𝑆

� Important: encoding and decoding produce same accesses to list

48
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶=

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶= 8

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I A B C D E F G H J K L M N O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
N I A B C D E F G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
E N I A B C D F G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
F E N I A B C D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
F E N I A B C D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I F E N A B C D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3 6

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
C I F E N A B D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3 6 1

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S E I C N F A B D G H J K L M O P Q R T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3 6 1 3 4 3 3 3 18
� What does a run in 𝑆 encode to in 𝐶?

� What does a run in 𝐶 mean about the source 𝑆?

49
MTF – Discussion
� MTF itself does not compress text (if we store codewords with fixed length)

� prime use as part of longer pipeline

� two simple ideas for encoding codewords:


� Elias gamma code � smaller numbers gets shorter codewords
works well for text with small “local effective” alphabet
� Huffman code (better compression, but need 2 passes)

� but: most effective after BWT (→ next)

50
7.7 Burrows-Wheeler Transform
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a different order
� But: coded text is (typically) more compressible with MTF(!)

51
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a different order
� But: coded text is (typically) more compressible with MTF(!)

� Encoding algorithm needs all of 𝑆 (no streaming possible).


� BWT is a block compression method.

51
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a different order
� But: coded text is (typically) more compressible with MTF(!)

� Encoding algorithm needs all of 𝑆 (no streaming possible).


� BWT is a block compression method.

� BWT followed by MTF, RLE, and Huffman is the algorithm used by the bzip2 program.
achieves best compression on English text of any algorithm we have seen:
4047392 bible.txt
1191071 bible.txt.gz
888604 bible.txt.7z
845635 bible.txt.bz2

51
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
l l
i m i m
e i e i
s t s t
␣ ␣ ␣ ␣
q y q y
u l u l
i c k i c k

52
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
� add end-of-word l l
i m i m
character $ to 𝑆 e i e i
(as in Unit 6) s t s t
␣ ␣ ␣ ␣
� can recover original q y q y
u l u l
string i c k i c k

52
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
� add end-of-word l l
i m i m
character $ to 𝑆 e i e i
(as in Unit 6) s t s t
␣ ␣ ␣ ␣
� can recover original q y q y
u l u l
string i c k i c k

� The Burrows-Wheeler Transform proceeds in three steps:


1. Place all cyclic shifts of 𝑆 in a list 𝐿
2. Sort the strings in 𝐿 lexicographically
3. 𝐵 is the list of trailing characters (last column, top-down) of each string in 𝐿

52
BWT transform – Example

alf␣eats␣alfalfa$
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a
f␣eats␣alfalfa$al
␣eats␣alfalfa$alf
1. Write all cyclic shifts eats␣alfalfa$alf␣
ats␣alfalfa$alf␣e
ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
␣alfalfa$alf␣eats
alfalfa$alf␣eats␣
lfalfa$alf␣eats␣a
falfa$alf␣eats␣al
alfa$alf␣eats␣alf
lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal
a$alf␣eats␣alfalf
$alf␣eats␣alfalfa

53
BWT transform – Example

alf␣eats␣alfalfa$ $alf␣eats␣alfalfa
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a ␣alfalfa$alf␣eats
f␣eats␣alfalfa$al ␣eats␣alfalfa$alf
␣eats␣alfalfa$alf a$alf␣eats␣alfalf
1. Write all cyclic shifts eats␣alfalfa$alf␣ alf␣eats␣alfalfa$
ats␣alfalfa$alf␣e alfa$alf␣eats␣alf
2. Sort cyclic shifts ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
alfalfa$alf␣eats␣
ats␣alfalfa$alf␣e
␣alfalfa$alf␣eats eats␣alfalfa$alf␣
alfalfa$alf␣eats␣ f␣eats␣alfalfa$al
lfalfa$alf␣eats␣a fa$alf␣eats␣alfal
falfa$alf␣eats␣al falfa$alf␣eats␣al
alfa$alf␣eats␣alf lf␣eats␣alfalfa$a
lfa$alf␣eats␣alfa lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal lfalfa$alf␣eats␣a
a$alf␣eats␣alfalf s␣alfalfa$alf␣eat
$alf␣eats␣alfalfa ts␣alfalfa$alf␣ea

53
BWT transform – Example

BWT

alf␣eats␣alfalfa$ $alf␣eats␣alfalfa
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a ␣alfalfa$alf␣eats
f␣eats␣alfalfa$al ␣eats␣alfalfa$alf
␣eats␣alfalfa$alf a$alf␣eats␣alfalf
1. Write all cyclic shifts eats␣alfalfa$alf␣ alf␣eats␣alfalfa$
ats␣alfalfa$alf␣e alfa$alf␣eats␣alf
2. Sort cyclic shifts ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
alfalfa$alf␣eats␣
ats␣alfalfa$alf␣e
␣alfalfa$alf␣eats eats␣alfalfa$alf␣
3. Extract last column alfalfa$alf␣eats␣ f␣eats␣alfalfa$al
lfalfa$alf␣eats␣a fa$alf␣eats␣alfal
falfa$alf␣eats␣al falfa$alf␣eats␣al
𝐵 = asff$f␣e␣lllaaata alfa$alf␣eats␣alf lf␣eats␣alfalfa$a
lfa$alf␣eats␣alfa lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal lfalfa$alf␣eats␣a
a$alf␣eats␣alfalf s␣alfalfa$alf␣eat
$alf␣eats␣alfalfa ts␣alfalfa$alf␣ea

53
Clicker Question

What is the relation between suffix array 𝐿[0..𝑛] and BWT 𝐵[0..𝑛]
of a string 𝑇[0..𝑛)$?
A 𝐿 can be very easily computed from 𝐵 and 𝑇

B 𝐵 can be very easily computed from 𝐿 and 𝑇

C Both A and B

D Neither A nor B

pingo.upb.de/622222
54
Clicker Question

What is the relation between suffix array 𝐿[0..𝑛] and BWT 𝐵[0..𝑛]
of a string 𝑇[0..𝑛)$?
A 𝐿 can be very easily computed from 𝐵 and 𝑇

B 𝐵 can be very easily computed from 𝐿 and 𝑇 �


C Both A and B

D Neither A nor B

pingo.upb.de/622222
54
BWT – Implementation & Properties
Compute BWT efficiently:
� cyclic shifts 𝑆 �
= suffixes of 𝑆 𝑟 ↓ 𝐿[𝑟]
alf␣eats␣alfalfa$ 0 $alf␣eats␣alfalfa 16
� BWT is essentially suffix sorting! lf␣eats␣alfalfa$a 1 ␣alfalfa$alf␣eats 8
f␣eats␣alfalfa$al 2 ␣eats␣alfalfa$alf 3
� 𝐵[𝑖] = 𝑆[𝐿[𝑖] − 1] (𝐿 = suffix array!) ␣eats␣alfalfa$alf 3 a$alf␣eats␣alfalf 15
eats␣alfalfa$alf␣ 4 alf␣eats␣alfalfa$ 0
(if 𝐿[𝑖] = 0, 𝐵[𝑖] = $)
ats␣alfalfa$alf␣e 5 alfa$alf␣eats␣alf 12
� Can compute 𝐵 in 𝑂(𝑛) time ts␣alfalfa$alf␣ea 6 alfalfa$alf␣eats␣ 9
s␣alfalfa$alf␣eat 7 ats␣alfalfa$alf␣e 5
␣alfalfa$alf␣eats 8 eats␣alfalfa$alf␣ 4
alfalfa$alf␣eats␣ 9 f␣eats␣alfalfa$al 2
lfalfa$alf␣eats␣a 10 fa$alf␣eats␣alfal 14
falfa$alf␣eats␣al 11 falfa$alf␣eats␣al 11
alfa$alf␣eats␣alf 12 lf␣eats␣alfalfa$a 1
lfa$alf␣eats␣alfa 13 lfa$alf␣eats␣alfa 13
fa$alf␣eats␣alfal 14 lfalfa$alf␣eats␣a 10
a$alf␣eats␣alfalf 15 s␣alfalfa$alf␣eat 7
$alf␣eats␣alfalfa 16 ts␣alfalfa$alf␣ea 6

55
BWT – Implementation & Properties
Compute BWT efficiently:
� cyclic shifts 𝑆 �
= suffixes of 𝑆 𝑟 ↓ 𝐿[𝑟]
alf␣eats␣alfalfa$ 0 $alf␣eats␣alfalfa 16
� BWT is essentially suffix sorting! lf␣eats␣alfalfa$a 1 ␣alfalfa$alf␣eats 8
f␣eats␣alfalfa$al 2 ␣eats␣alfalfa$alf 3
� 𝐵[𝑖] = 𝑆[𝐿[𝑖] − 1] (𝐿 = suffix array!) ␣eats␣alfalfa$alf 3 a$alf␣eats␣alfalf 15
eats␣alfalfa$alf␣ 4 alf␣eats␣alfalfa$ 0
(if 𝐿[𝑖] = 0, 𝐵[𝑖] = $)
ats␣alfalfa$alf␣e 5 alfa$alf␣eats␣alf 12
� Can compute 𝐵 in 𝑂(𝑛) time ts␣alfalfa$alf␣ea 6 alfalfa$alf␣eats␣ 9
s␣alfalfa$alf␣eat 7 ats␣alfalfa$alf␣e 5
␣alfalfa$alf␣eats 8 eats␣alfalfa$alf␣ 4
alfalfa$alf␣eats␣ 9 f␣eats␣alfalfa$al
Why does BWT help? lfalfa$alf␣eats␣a 10 fa$alf␣eats␣alfal
2
14
� sorting groups characters by what follows falfa$alf␣eats␣al 11 falfa$alf␣eats␣al 11
alfa$alf␣eats␣alf 12 lf␣eats␣alfalfa$a 1
� Example: lf always preceded by a lfa$alf␣eats␣alfa 13 lfa$alf␣eats␣alfa 13
fa$alf␣eats␣alfal 14 lfalfa$alf␣eats␣a 10
� 𝐵 has local clusters of characters a$alf␣eats␣alfalf 15 s␣alfalfa$alf␣eat 7
$alf␣eats␣alfalfa 16 ts␣alfalfa$alf␣ea 6
� that makes MTF effective

� repeated substring in 𝑆 � runs of characters in 𝐵


� picked up by RLE

55
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


it is at all invertible!

56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


it is at all invertible!

� “Magic” solution:
1. Create array 𝐷[0..𝑛] of pairs:
𝐷[𝑟] = (𝐵[𝑟], 𝑟).
2. Sort 𝐷 stably with
respect to first entry.
3. Use 𝐷 as linked list with
(char, next entry)

56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 it is at all invertible!

� “Magic” solution:
0 (a , 0)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2)
2. Sort 𝐷 stably with 3 ($ , 3)
respect to first entry.
4 (r , 4)
3. Use 𝐷 as linked list with 5 (c , 5)
(char, next entry)
6 (a , 6)
7 (a , 7)
Example:
8 (a , 8)
𝐵 = ard$rcaaaabb
𝑆= 9 (a , 9)
10 (b , 10)
11 (b , 11)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆= 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆=a 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = ab 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abr 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abra 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abrac 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abraca 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracad 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracada 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadab 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadabr 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadabra 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?

not even obvious that


𝐷 sorted 𝐷 it is at all invertible!

char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadabra$ 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT – The magic revealed
� Inverse BWT very easy to compute:
� only sort individual characters in 𝐵 (not suffixes)
� 𝑂(𝑛) with counting sort

� but why does this work!?

57
Inverse BWT – The magic revealed
� Inverse BWT very easy to compute:
� only sort individual characters in 𝐵 (not suffixes)
� 𝑂(𝑛) with counting sort

� but why does this work!?


� decode char by char
� can find unique $ � starting row
𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] 𝐵[𝑟]
� to get next char, we need 0 9 $bananaba n
(i) char in first column of current row 1 5 aban$bana n
(ii) find row with that char’s copy in BWT 2 7 an$banana b
3 3 anaban$ba n
� then we can walk through and decode 4 1 ananaban$ b


� for (i): first column = characters of 𝐵 in sorted order 5
6
6
0
ban$banan a
bananaban $
� for (ii): relative order of same character same! 7 8 n$bananab a
� 𝑖th a in first column = 𝑖th a in BWT 8 4 naban$ban a


9 2 nanaban$b a
� stably sorting (𝐵[𝑟], 𝑟) by first entry enough
57
BWT – Discussion
� Running time: Θ(𝑛)
� encoding uses suffix sorting
� decoding only needs counting sort
� decoding much simpler & faster (but same Θ-class)

58
BWT – Discussion
� Running time: Θ(𝑛)
� encoding uses suffix sorting
� decoding only needs counting sort
� decoding much simpler & faster (but same Θ-class)

typically slower than other methods

need access to entire text (or apply to blocks independently)

BWT-MTF-RLE-Huffman pipeline tends to have best compression

58
Clicker Question

Consider 𝑇 = have_had_hadnt_hasnt_havent_has_what$.
The BWT is 𝐵 = tedtttshhhhhhhaavv____w$_edsaaannnaa_.
How can we explain the long run of hs in 𝐵?

A h is the most frequent character

B h always appears at the beginning of a word

C almost all words start with h

D h is always followed by a

E all as are preceded by h

F h is the 4th character in the alphabet

pingo.upb.de/622222
59
Clicker Question

Consider 𝑇 = have_had_hadnt_hasnt_havent_has_what$.
The BWT is 𝐵 = tedtttshhhhhhhaavv____w$_edsaaannnaa_.
How can we explain the long run of hs in 𝐵?

A h is the most frequent character

B h always appears at the beginning of a word

C almost all words start with h

D h is always followed by a

E all as are preceded by h �


F h is the 4th character in the alphabet

pingo.upb.de/622222
59
Bigger Example
have_had_hadnt_hasnt_havent_has_what$ $have_had_hadnt_hasnt_havent_has_what
ave_had_hadnt_hasnt_havent_has_what$h _had_hadnt_hasnt_havent_has_what$have
ve_had_hadnt_hasnt_havent_has_what$ha _hadnt_hasnt_havent_has_what$have_had
e_had_hadnt_hasnt_havent_has_what$hav _has_what$have_had_hadnt_hasnt_havent
_had_hadnt_hasnt_havent_has_what$have _hasnt_havent_has_what$have_had_hadnt
had_hadnt_hasnt_havent_has_what$have_ _havent_has_what$have_had_hadnt_hasnt
ad_hadnt_hasnt_havent_has_what$have_h _what$have_had_hadnt_hasnt_havent_has
d_hadnt_hasnt_havent_has_what$have_ha ad_hadnt_hasnt_havent_has_what$have_h
_hadnt_hasnt_havent_has_what$have_had adnt_hasnt_havent_has_what$have_had_h
hadnt_hasnt_havent_has_what$have_had_ as_what$have_had_hadnt_hasnt_havent_h
adnt_hasnt_havent_has_what$have_had_h asnt_havent_has_what$have_had_hadnt_h
dnt_hasnt_havent_has_what$have_had_ha at$have_had_hadnt_hasnt_havent_has_wh
nt_hasnt_havent_has_what$have_had_had ave_had_hadnt_hasnt_havent_has_what$h
t_hasnt_havent_has_what$have_had_hadn avent_has_what$have_had_hadnt_hasnt_h
_hasnt_havent_has_what$have_had_hadnt d_hadnt_hasnt_havent_has_what$have_ha
hasnt_havent_has_what$have_had_hadnt_ dnt_hasnt_havent_has_what$have_had_ha
asnt_havent_has_what$have_had_hadnt_h e_had_hadnt_hasnt_havent_has_what$hav
snt_havent_has_what$have_had_hadnt_ha ent_has_what$have_had_hadnt_hasnt_hav
nt_havent_has_what$have_had_hadnt_has had_hadnt_hasnt_havent_has_what$have_
t_havent_has_what$have_had_hadnt_hasn hadnt_hasnt_havent_has_what$have_had_
_havent_has_what$have_had_hadnt_hasnt has_what$have_had_hadnt_hasnt_havent_
havent_has_what$have_had_hadnt_hasnt_ hasnt_havent_has_what$have_had_hadnt_
avent_has_what$have_had_hadnt_hasnt_h hat$have_had_hadnt_hasnt_havent_has_w
vent_has_what$have_had_hadnt_hasnt_ha have_had_hadnt_hasnt_havent_has_what$
ent_has_what$have_had_hadnt_hasnt_hav havent_has_what$have_had_hadnt_hasnt_
nt_has_what$have_had_hadnt_hasnt_have nt_has_what$have_had_hadnt_hasnt_have
t_has_what$have_had_hadnt_hasnt_haven nt_hasnt_havent_has_what$have_had_had
_has_what$have_had_hadnt_hasnt_havent nt_havent_has_what$have_had_hadnt_has
has_what$have_had_hadnt_hasnt_havent_ s_what$have_had_hadnt_hasnt_havent_ha
as_what$have_had_hadnt_hasnt_havent_h snt_havent_has_what$have_had_hadnt_ha
s_what$have_had_hadnt_hasnt_havent_ha t$have_had_hadnt_hasnt_havent_has_wha
_what$have_had_hadnt_hasnt_havent_has t_has_what$have_had_hadnt_hasnt_haven
what$have_had_hadnt_hasnt_havent_has_ t_hasnt_havent_has_what$have_had_hadn
hat$have_had_hadnt_hasnt_havent_has_w t_havent_has_what$have_had_hadnt_hasn
at$have_had_hadnt_hasnt_havent_has_wh ve_had_hadnt_hasnt_havent_has_what$ha
t$have_had_hadnt_hasnt_havent_has_wha vent_has_what$have_had_hadnt_hasnt_ha
$have_had_hadnt_hasnt_havent_has_what what$have_had_hadnt_hasnt_havent_has_

� 𝐵 = tedtttshhhhhhhaavv____w$_edsaaannnaa_
MTF yields 8, 5, 5, 2, 0, 0, 8, 7, 0, 0, 0, 0, 0, 0, 7, 0, 9, 0, 8, 0, 0, 0, 10, 9, 2, 9, 9, 8, 7, 0, 0, 10, 0, 0, 1, 0, 5

60
Summary of Compression Methods

Huffman Variable-width, single-character (optimal in this case)


RLE Variable-width, multiple-character encoding
LZW Adaptive, fixed-width, multiple-character encoding
Augments dictionary with repeated substrings
MTF Adaptive, transforms to smaller integers
should be followed by variable-width integer encoding
BWT Block compression method, should be followed by MTF

61

You might also like