0% found this document useful (0 votes)

147 views193 pages

Notes 07 Compression PDF

Variable-length codes allow different codeword lengths to provide more flexibility than fixed-length codes, but they require that each codeword is uniquely decodable to avoid ambiguity. While variable-length codes can achieve better compression than fixed-length codes by assigning shorter lengths to more common characters, they also introduce the potential problem of codes being ambiguous if codeword lengths are not properly assigned. Modern encodings like UTF-8 use variable-length codes to efficiently encode all Unicode characters with varying numbers of bytes per character.

Uploaded by

hussein hammoud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views193 pages

Notes 07 Compression PDF

Uploaded by

hussein hammoud

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 193

7 Compression

Sebastian Wild
16 March 2020

version 2020-04-20 15:47

Outline

7 Compression
7.1 Context
7.2 Character Encodings
7.3 Huffman Codes
7.4 Run-Length Encoding
7.5 Lempel-Ziv-Welch
7.6 Move-to-Front Transformation
7.7 Burrows-Wheeler Transform
7.1 Context
Overview
� Unit 4–6: How to work with strings
� finding substrings
� finding approximate matches
� finding repeated parts
� ...

� Unit 7–8: How to store strings

� computer memory: must be binary
� how to compress strings (save space)
� how to robustly transmit over noisy channels � Unit 8

1
Terminology
� source text: string 𝑆 ∈ Σ★ to be stored / transmitted
𝑆
Σ𝑆 is some alphabet
� coded text: encoded data 𝐶 ∈ Σ★ that is actually stored / transmitted
𝐶
usually use Σ𝐶 = {0, 1}
� encoding: algorithm mapping source texts to coded texts

� decoding: algorithm mapping coded texts back to original source text

2
What is a good encoding scheme?
� Depending on the application, goals can be
� eﬃciency of encoding/decoding
� resilience to errors/noise in transmission
� security (encryption)
� integrity (detect modiﬁcations made by third parties)
� size

� Focus in this unit: size of coded text

Encoding schemes that (try to) minimize the size of coded texts perform data compression.
|𝐶| · lg |Σ𝐶 | Σ𝐶 ={0,1} |𝐶|
� We will measure the compression ratio: =
|𝑆| · lg |Σ𝑆 | |𝑆| · lg |Σ𝑆 |
< 1 means successful compression
= 1 means no compression
> 1 means “compression” made it bigger!? (yes, that happens . . . )

3
Types of Data Compression
� Logical vs. Physical
� Logical Compression uses meaning of data
� only applies to a certain domain, e. g., sound recordings
� Physical Compression only knows the (physical) bits in the data, not the meaning behind
them

� Lossy vs. Lossless

� lossy compression can only decode approximately;
the exact source text 𝑆 is lost
� lossless compression always decodes 𝑆 exactly

� For media ﬁles, lossy, logical compression is useful (e. g. JPEG, MPEG)

� We will concentrate on physical, lossless compression algorithms.

These techniques can be used for any application.

4
What makes data compressible?
� Physical, lossless compression methods mainly exploit
two types of redundancies in source texts:

1. uneven character frequencies

some characters occur more often than others → Part I
2. repetitive texts
diﬀerent parts in the text are (almost) identical → Part II

5
What makes data compressible?
� Physical, lossless compression methods mainly exploit
two types of redundancies in source texts:

1. uneven character frequencies

some characters occur more often than others → Part I
2. repetitive texts
diﬀerent parts in the text are (almost) identical → Part II

There is no such thing as a free lunch!

Not everything is compressible (→ tutorials)
� focus on versatile methods that often work

5
Part I
Exploiting character frequencies
7.2 Character Encodings
Character encodings
� Simplest form of encoding: Encode each source character individually

� encoding function 𝐸 : Σ𝑆 → Σ★𝐶

� typically, |Σ𝑆 | � |Σ𝐶 |, so need several bits per character
� for 𝑐 ∈ Σ𝑆 , we call 𝐸(𝑐) the codeword of 𝑐

� ﬁxed-length code: |𝐸(𝑐)| is the same for all 𝑐 ∈ Σ𝐶

� variable-length code: not all codewords of same length

6
Fixed-length codes
� ﬁxed-length codes are the simplest type of character encodings

� Example: ASCII (American Standard Code for Information Interchange, 1963)

0000000 NUL 0010000 DLE 0100000 0110000 0 1000000 @ 1010000 P 1100000 ‘ 1110000 p
0000001 SOH 0010001 DC1 0100001 ! 0110001 1 1000001 A 1010001 Q 1100001 a 1110001 q
0000010 STX 0010010 DC2 0100010 " 0110010 2 1000010 B 1010010 R 1100010 b 1110010 r
0000011 ETX 0010011 DC3 0100011 # 0110011 3 1000011 C 1010011 S 1100011 c 1110011 s
0000100 EOT 0010100 DC4 0100100 $ 0110100 4 1000100 D 1010100 T 1100100 d 1110100 t
0000101 ENQ 0010101 NAK 0100101 % 0110101 5 1000101 E 1010101 U 1100101 e 1110101 u
0000110 ACK 0010110 SYN 0100110 & 0110110 6 1000110 F 1010110 V 1100110 f 1110110 v
0000111 BEL 0010111 ETB 0100111 ’ 0110111 7 1000111 G 1010111 W 1100111 g 1110111 w
0001000 BS 0011000 CAN 0101000 ( 0111000 8 1001000 H 1011000 X 1101000 h 1111000 x
0001001 HT 0011001 EM 0101001 ) 0111001 9 1001001 I 1011001 Y 1101001 i 1111001 y
0001010 LF 0011010 SUB 0101010 * 0111010 : 1001010 J 1011010 Z 1101010 j 1111010 z
0001011 VT 0011011 ESC 0101011 + 0111011 ; 1001011 K 1011011 [ 1101011 k 1111011 {
0001100 FF 0011100 FS 0101100 , 0111100 < 1001100 L 1011100 \ 1101100 l 1111100 |
0001101 CR 0011101 GS 0101101 - 0111101 = 1001101 M 1011101 ] 1101101 m 1111101 }
0001110 SO 0011110 RS 0101110 . 0111110 > 1001110 N 1011110 ^ 1101110 n 1111110 ~
0001111 SI 0011111 US 0101111 / 0111111 ? 1001111 O 1011111 _ 1101111 o 1111111 DEL

� 7 bit per character

� just enough for English letters and a few symbols (plus control characters)

� Example: Hello ↦→ 1001000 1100101 1101100 1101100 1101111 7

Fixed-length codes – Discussion
Encoding & Decoding as fast as it gets

Unless all characters equally likely, it wastes a lot of space

inﬂexible (how to support adding a new character?)

8
Variable-length codes
� to gain more ﬂexibility, have to allow diﬀerent lengths for codewords

� actually an old idea: Morse Code

https://commons.wikimedia.org/wiki/File:Morse- code- tree.svg

https://commons.wikimedia.org/wiki/File:
International_Morse_Code.svg

9
Variable-length codes – UTF-8
� Modern example: UTF-8 encoding of Unicode:
default encoding for text-ﬁles, XML, HTML since 2009
� Encodes any Unicode character (137 994 as of May 2019, and counting)
� uses 1–4 bytes (codeword lengths: 8, 16, 24, or 32 bits)
� Every ASCII character is encoded in 1 byte with leading bit 0, followed by the 7 bits for ASCII
� Non-ASCII charactters start with 1–4 1s indicating the total number of bytes,
followed by a 0 and 3–5 bits.
The remaining bytes each start with 10 followed by 6 bits.

Char. number range UTF-8 octet sequence

(hexadecimal) (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For English text, most characters use only 8 bit,

but we can include any Unicode character, as well.
10
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100

� Happily encode text 𝑆 = banana with the coded text 𝐶 = 110 0 10 0 10 0

b a n a n a

11
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100

� Happily encode text 𝑆 = banana with the coded text 𝐶 = 110 0 10 0 10 0

b a n a n a

� 𝐶 = 1100100100 decodes both to banana and to bass: 110 0 100100

b a s s

� not a valid code . . . (cannot tolerate ambiguity)

but how should we have known?

11
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100

� Happily encode text 𝑆 = banana with the coded text 𝐶 = 110 0 10 0 10 0

b a n a n a

� 𝐶 = 1100100100 decodes both to banana and to bass: 110 0 100100

b a s s

� not a valid code . . . (cannot tolerate ambiguity)

but how should we have known?

𝐸(n) = 10 is a (proper) preﬁx of 𝐸(s) = 100

� Leaves decoding wondering whether to stop after reading 10 or continue

� Require a preﬁx-free code: No codeword is a preﬁx of another.

prefix-free =⇒ instantaneously decodable
11
Code tries
� From now on only consider prefix-free codes 𝐸:
𝐸(𝑐) is not a prefix of 𝐸(𝑐 �) for any 𝑐, 𝑐 � ∈ Σ𝑆 .

𝑐 A E N O T ␣
� Example:
𝐸(𝑐) 01 101 001 100 11 000

Any preﬁx-free code corresponds to a 0 1

(code) trie (trie of codewords)
with characters of Σ𝑆 at leaves. 0 1 0 1

A T
no need for end-of-string symbols $ here 0 1 0 1
(already preﬁx-free!) ␣ N O E

� Encode AN␣ANT
� Decode 111000001010111

12
Code tries
� From now on only consider preﬁx-free codes 𝐸:
𝐸(𝑐) is not a preﬁx of 𝐸(𝑐 �) for any 𝑐, 𝑐 � ∈ Σ𝑆 .

𝑐 A E N O T ␣
� Example:
𝐸(𝑐) 01 101 001 100 11 000

Any preﬁx-free code corresponds to a 0 1

(code) trie (trie of codewords)
with characters of Σ𝑆 at leaves. 0 1 0 1

A T
no need for end-of-string symbols $ here 0 1 0 1
(already preﬁx-free!) ␣ N O E

� Encode AN␣ANT → 010010000100111

� Decode 111000001010111 → TO␣EAT

12
Who decodes the decoder?
� Depending on the application, we have to store/transmit the used code!
� We distinguish:
� ﬁxed coding: code agreed upon in advance, not transmitted (e. g., Morse, UTF-8)
� static coding: code depends on message, but stays same for entire message;
it must be transmitted (e. g., Huﬀman codes → next)
� adaptive coding: code depends on message and changes during encoding;
implicitly stored withing the message (e. g., LZW → below)

13
7.3 Huﬀman Codes
Character frequencies
� Goal: Find character encoding that produces short coded text

� Convention here: ﬁx Σ𝐶 = {0, 1} (binary codes), abbreviate Σ = Σ𝑆 ,

� Observation: Some letters occur more often than others.

Typical English prose:

e 12.70% d 4.25% p 1.93%

t 9.06% l 4.03% b 1.49%
a 8.17% c 2.78% v 0.98%
o 7.51% u 2.76% k 0.77%
i 6.97% m 2.41% j 0.15%
n 6.75% w 2.36% x 0.15%
s 6.33% f 2.23% q 0.10%
h 6.09% g 2.02% z 0.07%
r 5.99% y 1.97%

� Want shorter codes for more frequent characters!

14
Huﬀman coding
e. g. frequencies / probabilities

� Given: Σ and weights 𝑤 : Σ → ℝ≥0

� Goal: preﬁx-free code 𝐸 (= code trie) for Σ that minimizes coded text length
�
i. e., a code trie minimizing 𝑤(𝑐) · |𝐸(𝑐)|
𝑐∈Σ

15
Huﬀman coding
e. g. frequencies / probabilities

� Given: Σ and weights 𝑤 : Σ → ℝ≥0

� Goal: preﬁx-free code 𝐸 (= code trie) for Σ that minimizes coded text length
�
i. e., a code trie minimizing 𝑤(𝑐) · |𝐸(𝑐)|
𝑐∈Σ

� If we use 𝑤(𝑐) = #occurrences of 𝑐 in 𝑆,

this is the character encoding with smallest possible |𝐶|

� best possible character-wise encoding

� Quite ambitious! Is this eﬃciently possible?

15
Huﬀman’s algorithm
� Actually, yes! A greedy/myopic approach succeeds here.

Huﬀman’s algorithm:
1. Find two characters a, b with lowest weights.
� We will encode them with the same preﬁx, plus one distinguishing bit,
i. e., 𝐸(a) = 𝑢 0 and 𝐸(b) = 𝑢 1 for a bitstring 𝑢 ∈ {0, 1}★ (𝑢 to be determined)

2. (Conceptually) replace a and b by a single character “ ab ”

with 𝑤( ab ) = 𝑤(a) + 𝑤(b).

3. Recursively apply Huﬀman’s algorithm on the smaller alphabet.

This in particular determines 𝑢 = 𝐸( ab ).

16
Huﬀman’s algorithm
� Actually, yes! A greedy/myopic approach succeeds here.

2. (Conceptually) replace a and b by a single character “ ab ”

with 𝑤( ab ) = 𝑤(a) + 𝑤(b).

3. Recursively apply Huﬀman’s algorithm on the smaller alphabet.

This in particular determines 𝑢 = 𝐸( ab ).

� eﬃcient implementation using a (min-oriented) priority queue

� start by inserting all characters with their weight as key
� step 1 uses two deleteMin calls
� step 2 inserts a new character with the sum of old weights as key

16
Huﬀman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

1 2 1 4

E L O S

17
Huﬀman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

2 2 4

L S
0 1

E O

17
Huﬀman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

4 4

S
0 1

L
0 1

E O

17
Huﬀman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

0 1

S
0 1

L
0 1

E O

17
Huﬀman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

0 1

S
0 1

L
0 1

E O

� Huﬀman tree (code trie for Huﬀman code)

17
Huﬀman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }

� Character frequencies: E : 1, L : 2, O : 1, S:4

0 1

S
0 1

L
0 1

E O

� Huﬀman tree (code trie for Huﬀman code)

14 14
LOSSLESS → 01001110100011 compression ratio: 8·log 4 = 16 ≈ 88%

17
Huﬀman tree – tie breaking
� The above procedure is ambiguous:
� which characters to choose when weights are equal?
� which subtree goes left, which goes right?

� For COMP 526: always use the following rule:

1. To break ties when selecting the two characters,

ﬁrst use the smallest letter according to the alphabetical order,
or the tree containing the smallest alphabetical letter.
2. When combining two trees of diﬀerent values,
place the lower-valued tree on the left (corresponding to a 0-bit).
3. When combining trees of equal value,
place the one containing the smallest letter to the left.

18
Huffman code – Optimality
Theorem 7.1 (Optimality of Huffman’s Algorithm)
Given Σ and 𝑤 : Σ → ℝ≥0 , Huffman’s Algorithm computes codewords 𝐸 : Σ → {0, 1}★ with
�
minimal expected codeword length ℓ (𝐸) = 𝑐∈Σ 𝑤(𝑐) · |𝐸(𝑐)|, among all prefix-free codes
for Σ. �

19
Huffman code – Optimality
Theorem 7.1 (Optimality of Huffman’s Algorithm)
Given Σ and 𝑤 : Σ → ℝ≥0 , Huffman’s Algorithm computes codewords 𝐸 : Σ → {0, 1}★ with
�
minimal expected codeword length ℓ (𝐸) = 𝑐∈Σ 𝑤(𝑐) · |𝐸(𝑐)|, among all prefix-free codes
for Σ. �

Proof sketch: by induction over 𝜎 = |Σ|

� Given any optimal preﬁx-free code 𝐸∗ (as its code trie).

� code trie � ∃ two sibling leaves 𝑥, 𝑦 at largest depth 𝐷

� swap characters in leaves to have two lowest-weight characters a, b in 𝑥, 𝑦

(that can only make ℓ smaller, so still optimal)
� any optimal code for Σ� = Σ \ {a , b} ∪ { ab } yields optimal code for Σ
by replacing leaf ab by internal node with children a and b.

� recursive call yields optimal code for Σ� by inductive hypothesis,

so Huffman’s algorithm finds optimal code for Σ.
�
19
Entropy
Definition 7.2 (Entropy)
Given probabilities 𝑝1 , . . . , 𝑝 𝑛 (for outcomes 1, . . . , 𝑛 of a random variable), the entropy of the
distribution is defined as
�
𝑛 �
𝑛 � �
1
H(𝑝 1 , . . . , 𝑝 𝑛 ) = − 𝑝 𝑖 lg 𝑝 𝑖 = 𝑝 𝑖 lg
𝑝𝑖 �
𝑖=1 𝑖=1

20
Entropy
Deﬁnition 7.2 (Entropy)
Given probabilities 𝑝1 , . . . , 𝑝 𝑛 (for outcomes 1, . . . , 𝑛 of a random variable), the entropy of the
distribution is deﬁned as
�
𝑛 �
𝑛 � �
1
H(𝑝 1 , . . . , 𝑝 𝑛 ) = − 𝑝 𝑖 lg 𝑝 𝑖 = 𝑝 𝑖 lg
𝑝𝑖 �
𝑖=1 𝑖=1

� entropy is a measure of information content of a distribution

� more precisely: the expected number of bits (Yes/No questions) required
to nail down the random value
� would ideally encode value 𝑖 using lg(1/𝑝 𝑖 ) bits
that is not always possible; cannot use 1.5 bits . . . but:
Theorem 7.3 (Entropy bounds for Huﬀman codes)
For any Σ = {𝑎1 , . . . , 𝑎 𝜎 } and 𝑤 : Σ → ℝ≥0 and its Huﬀman code 𝐸, we have
� � � �
𝑤(𝑎 1 ) 𝑤(𝑎 𝜎 ) 𝑤(𝑎 1 ) 𝑤(𝑎 𝜎 )
H ,..., ≤ ℓ (𝐸) ≤ H ,..., +1
𝑊 𝑊 𝑊 𝑊
where 𝑊 = 𝑤(𝑎1 ) + · · · + 𝑤(𝑎 𝜎 ). �
20
Clicker Question

When is Huffman coding more efficient than a fixed-length

encoding?
A always

B when H ≈ lg(𝜎)

C when H < lg(𝜎)

D when H < lg(𝜎) − 1

E when H ≈ 1

pingo.upb.de/622222
21
Clicker Question

When is Huffman coding more efficient than a fixed-length

encoding?
A always �
B when H ≈ lg(𝜎)

C when H < lg(𝜎)

D when H < lg(𝜎) − 1 �

E when H ≈ 1

pingo.upb.de/622222
21
Encoding with Huffman code
� The overall encoding procedure is as follows:
� Pass 1: Count character frequencies in 𝑆
� Construct Huffman code 𝐸 (as above)
� Store the Huffman code in 𝐶 (details omitted)
� Pass 2: Encode each character in 𝑆 using 𝐸 and append result to 𝐶

� Decoding works as follows:

� Decode the Huﬀman code 𝐸 from 𝐶. (details omitted)
� Decode 𝑆 character by character from 𝐶 using the code trie.

� Note: Decoding is much simpler/faster!

22
Huﬀman coding – Discussion
� running time complexity: 𝑂(𝜎 log 𝜎) to construct code
� build PQ + 𝜎 · (2 deleteMins and 1 insert)
� can do Θ(𝜎) time when characters already sorted by weight
� time for encoding: 𝑂(𝑛 + |𝐶|)

� many variations in use (tie-breaking rules, estimated frequencies, adaptive encoding, . . . )

23
Huﬀman coding – Discussion
� running time complexity: 𝑂(𝜎 log 𝜎) to construct code
� build PQ + 𝜎 · (2 deleteMins and 1 insert)
� can do Θ(𝜎) time when characters already sorted by weight
� time for encoding: 𝑂(𝑛 + |𝐶|)

� many variations in use (tie-breaking rules, estimated frequencies, adaptive encoding, . . . )

optimal preﬁx-free character encoding

very fast decoding This is only true some errors.

In the worst case the ALL remaining
needs 2 passes over source text for encoding characters of the text can get corrupted!
� one-pass variants possible, but more complicated

have to store code alongside with coded text

23
Part II
Compressing repetitive texts
Beyond Character Encoding
� Many “natural” texts show repetitive redundancy
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull
boy. All work and no play makes Jack a dull boy. All work and no play makes Jack
a dull boy. All work and no play makes Jack a dull boy. All work and no play makes
Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play
makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no
play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and
no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work
and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All
work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

� character-by-character encoding will not capture such repetitions

� Huﬀman won’t compression this very much

24
Beyond Character Encoding
� Many “natural” texts show repetitive redundancy
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull
boy. All work and no play makes Jack a dull boy. All work and no play makes Jack
a dull boy. All work and no play makes Jack a dull boy. All work and no play makes
Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play
makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no
play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and
no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work
and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All
work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

� character-by-character encoding will not capture such repetitions

� Huﬀman won’t compression this very much

� Have to encode whole phrases of 𝑆 by a single codeword

24
7.4 Run-Length Encoding
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� here: only consider Σ𝑆 = {0, 1}

000101100100000111111000000000011111000
001111111110001111111110000001111111000
001111011010001110001111000011100000000
(work on a binary representation)
001100000000000000000111000111000000000
001100000000000000000011001110000000000
001100000000000000000011001110000000000 � can be extended for larger alphabets
001101100000000000000111001100111110000
001111111100000000000111001111111111000
001110111110000000001110001111100111100
000000000111000000011100001110000001110
000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100
000000000111000111000000000110000001110
000000000110000111000000000111000011100
001101111110001111011101000011111111000
011111111100011111111111100001111110000
000101100000001010011001000000100100000
000000000000000000000000000000000000000
000000000000000000000000000000000000000

25
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� here: only consider Σ𝑆 = {0, 1}

� run-length encoding (RLE):

000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100

use runs as phrases: 𝑆 = 00000 111 0000

000000000111000111000000000110000001110

��

000000000110000111000000000111000011100
001101111110001111011101000011111111000
011111111100011111111111100001111110000
000101100000001010011001000000100100000
000000000000000000000000000000000000000
000000000000000000000000000000000000000

� here: only consider Σ𝑆 = {0, 1}

� run-length encoding (RLE):

000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100

use runs as phrases: 𝑆 = 00000 111 0000

000000000111000111000000000110000001110

��

� We have to store
� the ﬁrst bit of 𝑆 (either 0 or 1)
� the length each each run
� Note: don’t have to store bit for later runs since they must alternate.

� Example becomes: 0, 5, 3, 4

� here: only consider Σ𝑆 = {0, 1}

� run-length encoding (RLE):

000000000111000000011000001110000001100
000000000011000000110000000110000001110
000000000011000001110000001110000001100

use runs as phrases: 𝑆 = 00000 111 0000

000000000111000111000000000110000001110

��

� We have to store
� the ﬁrst bit of 𝑆 (either 0 or 1)
� the length each each run
� Note: don’t have to store bit for later runs since they must alternate.

� Example becomes: 0, 5, 3, 4

� Question: How to encode a run length 𝑘 in binary? (𝑘 can be arbitrarily large!)

25
Clicker Question

How would you encode a string that can we arbitrarily long?

pingo.upb.de/622222
26
Elias codes
� Need a preﬁx-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

27
Elias codes
� Need a preﬁx-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!

7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001

27
Elias codes
� Need a preﬁx-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!

7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001
Much too long
� (wasn’t the whole point of RLE to get rid of long runs??)

27
Elias codes
� Need a preﬁx-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!

7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001
Much too long
� (wasn’t the whole point of RLE to get rid of long runs??)

� Reﬁnement: Elias gamma code

� Store the length ℓ of the binary representation in unary
� Followed by the binary digits themselves

27
Elias codes
� Need a preﬁx-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading

� But that’s simple! Just use unary encoding!

7 ↦→ 00000001 3 ↦→ 0001 0 ↦→ 1 30 ↦→ 0000000000000000000000000000001
Much too long
� (wasn’t the whole point of RLE to get rid of long runs??)

� Reﬁnement: Elias gamma code

� Store the length ℓ of the binary representation in unary
� Followed by the binary digits themselves
� little tricks:
� always ℓ ≥ 1, so store ℓ − 1 instead
� binary representation always starts with 1 � don’t need terminating 1 in unary

� Elias gamma code = ℓ − 1 zeros, followed by binary representation

Examples: 1 ↦→ 1 , 3 ↦→ 011 , 5 ↦→ 00101 , 30 ↦→ 000011110
27
Clicker Question

Decode the ﬁrst number in Elias gamma code (at the beginning)
of the following bitstream:

000110111011100110.

pingo.upb.de/622222
28
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 =1

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =7
𝐶 = 100111

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =2
𝐶 = 100111010

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =1
𝐶 = 1001110101

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 = 20
𝐶 = 1001110101000010100

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 = 11
𝐶 = 10011101010000101000001011

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =3+1

𝑆=

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =3+1
𝑘 = 13
𝑆 = 0000000000000

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =2+1
𝑘=
𝑆 = 0000000000000

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =2+1
𝑘 =4
𝑆 = 00000000000001111

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =0+1
𝑘=
𝑆 = 00000000000001111

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =0+1
𝑘 =1
𝑆 = 000000000000011110

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =1+1
𝑘=
𝑆 = 000000000000011110

29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111

𝐶 = 10011101010000101000001011

Compression ratio: 26/41 ≈ 63%

� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =1+1
𝑘 =2
𝑆 = 00000000000001111011

29
Run-length encoding – Discussion
� extensions to larger alphabets possible (must store next character then)

� used in some image formats (e. g. TIFF)

30
Run-length encoding – Discussion
� extensions to larger alphabets possible (must store next character then)

� used in some image formats (e. g. TIFF)

fairly simple and fast

can compress 𝑛 bits to Θ(log 𝑛)!

for extreme case of constant number of runs

negligible compression for many common types of data

� No compression until run lengths 𝑘 ≥ 6
� expansion when run lengths 𝑘 = 2 or 6

30
7.5 Lempel-Ziv-Welch
Warmup

https://www.flickr.com/photos/quintanaroo/2742726346

https://classic.csunplugged.org/text- compression/
31
Clicker Question

Write down the second-to-last line of the above poem!

pingo.upb.de/622222
32
Lempel-Ziv Compression
� Huﬀman and RLE mostly take advantage of frequent or repeated single characters.

� Observation: Certain substrings are much more frequent than others.

� in English text: the, be, to, of, and, a, in, that, have, I
� in HTML: “<a href”, “<img src”, “<br/>”

33
Lempel-Ziv Compression
� Huﬀman and RLE mostly take advantage of frequent or repeated single characters.

� Observation: Certain substrings are much more frequent than others.

� in English text: the, be, to, of, and, a, in, that, have, I
� in HTML: “<a href”, “<img src”, “<br/>”

� Lempel-Ziv stands for family of adaptive compression algorithms.

� Idea: store repeated parts by reference!
� each codeword refers to
� either a single character in Σ𝑆 ,
� or a substring of 𝑆 (that both encoder and decoder have already seen).

33
Lempel-Ziv Compression
� Huﬀman and RLE mostly take advantage of frequent or repeated single characters.

� Observation: Certain substrings are much more frequent than others.

� in English text: the, be, to, of, and, a, in, that, have, I
� in HTML: “<a href”, “<img src”, “<br/>”

� Lempel-Ziv stands for family of adaptive compression algorithms.

� Idea: store repeated parts by reference!
� each codeword refers to
� either a single character in Σ𝑆 ,
� or a substring of 𝑆 (that both encoder and decoder have already seen).
� Variants of Lempel-Ziv compression
� “LZ77” Original version (“sliding window”)
Derivatives: LZSS, LZFG, LZRW, LZP, DEFLATE, . . .
DEFLATE used in (pk)zip, gzip, PNG
� “LZ78” Second (slightly improved) version
Derivatives: LZW, LZMW, LZAP, LZY, . . .
LZW used in compress, GIF

33
Lempel-Ziv-Welch
� here: Lempel-Ziv-Welch (LZW) (arguably the “cleanest” variant of Lempel-Ziv)

� variable-to-ﬁxed encoding
� all codewords have 𝑘 bits (typical: 𝑘 = 12) � ﬁxed-length
� but they represent a variable portion of the source text!

34
Lempel-Ziv-Welch
� here: Lempel-Ziv-Welch (LZW) (arguably the “cleanest” variant of Lempel-Ziv)

� variable-to-ﬁxed encoding
� all codewords have 𝑘 bits (typical: 𝑘 = 12) � ﬁxed-length
� but they represent a variable portion of the source text!

� maintain a dictionary 𝐷 with 2 𝑘 entries � codewords = indices in dictionary

� initially, ﬁrst |Σ𝑆 | entries encode single characters (rest is empty)
� add a new entry to 𝐷 after each step:
� Encoding: after encoding a substring 𝑥 of 𝑆,
add 𝑥𝑐 to 𝐷 where 𝑐 is the character that follows 𝑥 in 𝑆.
encode 𝑥 = ban

𝑆 h a n n a h b a n s b a n a n a s
already encoded 𝒙 𝒄

add 𝑥𝑐 = bana to dictionary

� new codeword in 𝐷
� 𝐷 actually stores codewords for 𝑥 and 𝑐, not the expanded string
34
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)

𝐶=

Code String Code String

... 128
32 ␣ 129
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y
𝐶 = 89

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y
𝐶 = 89

Code String Code String

... 128 YO
32 ␣ 129
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O
𝐶 = 89 79

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O
𝐶 = 89 79

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O !
𝐶 = 89 79 33

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O !
𝐶 = 89 79 33

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣
𝐶 = 89 79 33 32

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣
𝐶 = 89 79 33 32

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO
𝐶 = 89 79 33 32 128

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO
𝐶 = 89 79 33 32 128

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U
𝐶 = 89 79 33 32 128 85

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U
𝐶 = 89 79 33 32 128 85

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣
𝐶 = 89 79 33 32 128 85 130

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣
𝐶 = 89 79 33 32 128 85 130

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU
𝐶 = 89 79 33 32 128 85 130 132

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU
𝐶 = 89 79 33 32 128 85 130 132

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R
𝐶 = 89 79 33 32 128 85 130 132 82

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R
𝐶 = 89 79 33 32 128 85 130 132 82

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y
𝐶 = 89 79 33 32 128 85 130 132 82 131

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y
𝐶 = 89 79 33 32 128 85 130 132 82 131

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O
𝐶 = 89 79 33 32 128 85 130 132 82 131 79

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O
𝐶 = 89 79 33 32 128 85 130 132 82 131 79

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138 OY
already encoded 𝒙 𝒄 ... 139
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128

Code String Code String

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128

Code String Code String

... 128 YO
32 ␣ 129 O!
33 ! 130 !␣
... 131 ␣Y
79 O 132 YOU
𝐷= ... 133 U!
82 R 134 !␣Y
... 135 YOUR
85 U 136 R␣
encode 𝑥 = ban
... 137 ␣YO
𝑆 h a n n a h b a n s b a n a n a s 89 Y 138 OY
already encoded 𝒙 𝒄 ... 139 YO!
add 𝑥𝑐 = bana to dictionary

35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO !
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128 33

Code String Code String

35
LZW encoding – Code
1 procedure LZWencode(𝑆[0..𝑛))
2 𝑥 := 𝜀 // previous phrase, initially empty
3 𝐶 := 𝜀 // output, initially empty
4 𝐷 := dictionary, initialized with codes for 𝑐 ∈ Σ𝑆 // stored as trie
5 𝑘 := |Σ𝑆 | // next free codeword
6 for 𝑖 := 0, . . . , 𝑛 − 1 do
7 𝑐 := 𝑆[𝑖]
8 if 𝐷.containsKey(𝑥𝑐) then
9 𝑥 := 𝑥𝑐
10 else
11 𝐶 := 𝐶 · 𝐷.get(𝑥) // append codeword for 𝑥
12 𝐷.put(𝑥𝑐, 𝑘) // add 𝑥𝑐 to 𝐷, assigning next free codeword
13 𝑘 := 𝑘 + 1; 𝑥 := 𝑐
14 end for
15 𝐶 := 𝐶 · 𝐷.get(𝑥)
16 return 𝐶

36
LZW decoding
� Decoder has to replay the process of growing the dictionary!

� Decoding:
after decoding a substring 𝑦 of 𝑆, add 𝑥𝑐 to 𝐷,
where 𝑥 is previously encoded/decoded substring of 𝑆,
and 𝑐 = 𝑦[0] (ﬁrst character of 𝑦)
decode 𝑦 = an
𝒄
𝑆 h a n n a h b a n s b a n a n a s
already decoded 𝒙 𝒚
add 𝑥𝑐 = bana to dictionary

� Note: only start adding to 𝐷 after second substring of 𝑆 is decoded

37
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
...
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
...
83 S
...

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

38
LZW decoding – Example
� Same idea: build dictionary while reading string.

� Example: 67 65 78 32 66 129 133

38
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!

decoder is “one step behind” in creating dictionary

39
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!

decoder is “one step behind” in creating dictionary

� problem occurs if we want to use a code that we are just about to build.

39
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!

decoder is “one step behind” in creating dictionary

� problem occurs if we want to use a code that we are just about to build.
� But then we actually know what is going on:
� Situation: decode using 𝑘 in the step that will deﬁne 𝑘.
� decoder knows last phrase 𝑥, needs phrase 𝑦 = 𝐷[𝑘] = 𝑥𝑐.
last step 𝑐𝑦=𝐷[𝑘]

1. en/decode 𝑥.
C A N ␣ B A N A N A S
done 𝒙 𝒄 𝒚
2. store 𝐷[𝑘] := 𝒙𝒄
𝐷[𝑘] := 𝑥𝑐
3. next phrase 𝑦 equals 𝐷[𝑘]
A N A
� 𝐷[𝑘] = 𝒙𝒄 = 𝒙 · 𝒙[0] (all known)
𝒙 𝒄

39
LZW decoding – Code
1 procedure LZWdecode(𝐶[0..𝑚))
2 𝐷 := dictionary [0..2𝑑 ) → Σ+ 𝑆
, initialized with codes for 𝑐 ∈ Σ𝑆 // stored as array
3 𝑘 := |Σ𝑆 | // next unused codeword
4 𝑞 := 𝐶[0] // ﬁrst codeword
5 𝑦 := 𝐷[𝑞] // lookup meaning of 𝑞 in 𝐷
6 𝑆 := 𝑦 // output, initially ﬁrst phrase
7 for 𝑗 := 1, . . . , 𝑚 − 1 do
8 𝑥 := 𝑦 // remember last decoded phrase
9 𝑞 := 𝐶[𝑗] // next codeword
10 if 𝑞 = = 𝑘 then
11 𝑦 := 𝑥 · 𝑥[0] // bootstrap case
12 else
13 𝑦 := 𝐷[𝑞]
14 𝑆 := 𝑆 · 𝑦 // append decoded phrase
15 𝐷[𝑘] := 𝑥 · 𝑦[0] // store new phrase
16 𝑘 := 𝑘 + 1
17 end for
18 return 𝑆

40
LZW decoding – Example continued
� Example: 67 65 78 32 66 129 133 83

41
Clicker Question

How many phrases will LZW create on 𝑆 = a𝑛 , a run of 𝑛 copies

of as?

A ∼𝑛 F Θ(log 𝑛)

B ∼ 𝑛/2 G Θ(log log 𝑛)

C ∼ 𝑛/4 H 2

D Θ(𝑛/log 𝑛) I 1
√
E Θ( 𝑛)

pingo.upb.de/622222
42
Clicker Question

How many phrases will LZW create on 𝑆 = a𝑛 , a run of 𝑛 copies

of as?

A ∼𝑛 F Θ(log 𝑛)

B ∼ 𝑛/2 G Θ(log log 𝑛)

C ∼ 𝑛/4 H 2

D Θ(𝑛/log 𝑛) I 1

�
√
E Θ( 𝑛)

pingo.upb.de/622222
42
LZW – Discussion
� As presented, LZW uses coded alphabet Σ𝐶 = [0..2𝑑 ).
� use another encoding for code numbers ↦→ binary, e. g., Huﬀman

� need a rule when dictionary is full; diﬀerent options:

� increment 𝑑 � longer codewords
� “ﬂush” dictionary and start from scratch � limits extra space usage
� often: reserve a codeword to trigger ﬂush at any time

� encoding and decoding both run in linear time (assuming |Σ𝑆 | constant)

43
LZW – Discussion
� As presented, LZW uses coded alphabet Σ𝐶 = [0..2𝑑 ).
� use another encoding for code numbers ↦→ binary, e. g., Huﬀman

� need a rule when dictionary is full; diﬀerent options:

� increment 𝑑 � longer codewords
� “ﬂush” dictionary and start from scratch � limits extra space usage
� often: reserve a codeword to trigger ﬂush at any time

� encoding and decoding both run in linear time (assuming |Σ𝑆 | constant)

fast encoding & decoding

works in streaming model (no random access, no backtrack on input needed)

signiﬁcant compression for many types of data

captures only local repetitions (with bounded dictionary)

43
Compression summary

Huﬀman codes Run-length encoding Lempel-Ziv-Welch

ﬁxed-to-variable variable-to-variable variable-to-ﬁxed
2-pass 1-pass 1-pass
must send dictionary can be worse than ASCII can be worse than ASCII
60% compression bad on text 45% compression
on English text on English text

optimal binary good on long runs good on English text

character encopding (e.g., pictures)

rarely used directly rarely used directly frequently used

part of pkzip, JPEG, MP3 fax machines, old picture-formats GIF, part of PDF, Unix compress

44
Part III
Text Transforms
Text transformations
� compression is eﬀective is we have one the following:
� long runs � RLE
� frequently used characters � Huﬀman
� many (local) repeated substrings � LZW

45
Text transformations
� compression is eﬀective is we have one the following:
� long runs � RLE
� frequently used characters � Huﬀman
� many (local) repeated substrings � LZW

� but methods can be frustratingly “blind” to other “obvious” redundancies

� LZW: repetition too distant � dictionary already ﬂushed
� Huﬀman: changing probabilities (local clusters) � averaged out globally
� RLE: run of alternating pairs of characters � not a run

45
Text transformations
� compression is eﬀective is we have one the following:
� long runs � RLE
� frequently used characters � Huﬀman
� many (local) repeated substrings � LZW

� but methods can be frustratingly “blind” to other “obvious” redundancies

� Enter: text transformations

� invertible functions of text
� do not by themselves reduce the space usage
� but help compressors “see” existing redundancy
� use as pre-/postprocessing in compression pipeline

45
7.6 Move-to-Front Transformation
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper

46
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper

� Here: use such a list for storing source alphabet Σ𝑆

� to encode 𝑐, access it in list
� encode 𝑐 using its (old) position in list
� then apply MTF to the list
� codewords are integers, i. e., Σ𝐶 = [0..𝜎)

� Here: use such a list for storing source alphabet Σ𝑆

� to encode 𝑐, access it in list
� encode 𝑐 using its (old) position in list
� then apply MTF to the list
� codewords are integers, i. e., Σ𝐶 = [0..𝜎)

� clusters of few characters � many small numbers

46
Clicker Question

Assume a MTF list currently contains the items X Y Z A B C, and we

now access A. What is the list content after the MTF rule has been
applied?

pingo.upb.de/622222
47
MTF – Code

� Transform (encode): � Inverse transform (decode):

1 procedure MTF−encode(𝑆[0..𝑛)) 1 procedure MTF−encode(𝐶[0..𝑚))

2 𝐿 := list containing Σ𝐶 (sorted order) 2 𝐿 := list containing Σ𝐶 (sorted order)
3 𝐶 := 𝜀 3 𝑆 := 𝜀
4 for 𝑖 := 0, . . . , 𝑛 − 1 do 4 for 𝑗 := 0, . . . , 𝑚 − 1 do
5 𝑐 := 𝑆[𝑖] 5 𝑝 := 𝐶[𝑗]
6 𝑝 := position of 𝑐 in 𝐿 6 𝑐 := character at position 𝑝 in 𝐿
7 𝐶 := 𝐶 · 𝑝 7 𝑆 := 𝑆 · 𝑐
8 Move 𝑐 to front of 𝐿 8 Move 𝑐 to front of 𝐿
9 end for 9 end for
10 return 𝐶 10 return 𝑆

� Important: encoding and decoding produce same accesses to list

48
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶=

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶= 8

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I A B C D E F G H J K L M N O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
N I A B C D E F G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
E N I A B C D F G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
F E N I A B C D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
F E N I A B C D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I F E N A B C D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3 6

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
C I F E N A B D G H J K L M O P Q R S T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3 6 1

49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S E I C N F A B D G H J K L M O P Q R T U V W X Y Z

𝑆 = INEFFICIENCIES

𝐶 = 8 13 6 7 0 3 6 1 3 4 3 3 3 18
� What does a run in 𝑆 encode to in 𝐶?

� What does a run in 𝐶 mean about the source 𝑆?

49
MTF – Discussion
� MTF itself does not compress text (if we store codewords with ﬁxed length)

� prime use as part of longer pipeline

� two simple ideas for encoding codewords:

� Elias gamma code � smaller numbers gets shorter codewords
works well for text with small “local eﬀective” alphabet
� Huﬀman code (better compression, but need 2 passes)

� but: most eﬀective after BWT (→ next)

50
7.7 Burrows-Wheeler Transform
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a diﬀerent order
� But: coded text is (typically) more compressible with MTF(!)

51
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a diﬀerent order
� But: coded text is (typically) more compressible with MTF(!)

� Encoding algorithm needs all of 𝑆 (no streaming possible).

� BWT is a block compression method.

� Encoding algorithm needs all of 𝑆 (no streaming possible).

� BWT is a block compression method.

� BWT followed by MTF, RLE, and Huﬀman is the algorithm used by the bzip2 program.
achieves best compression on English text of any algorithm we have seen:
4047392 bible.txt
1191071 bible.txt.gz
888604 bible.txt.7z
845635 bible.txt.bz2

51
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
l l
i m i m
e i e i
s t s t
␣ ␣ ␣ ␣
q y q y
u l u l
i c k i c k

52
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
� add end-of-word l l
i m i m
character $ to 𝑆 e i e i
(as in Unit 6) s t s t
␣ ␣ ␣ ␣
� can recover original q y q y
u l u l
string i c k i c k

� The Burrows-Wheeler Transform proceeds in three steps:

1. Place all cyclic shifts of 𝑆 in a list 𝐿
2. Sort the strings in 𝐿 lexicographically
3. 𝐵 is the list of trailing characters (last column, top-down) of each string in 𝐿

52
BWT transform – Example

alf␣eats␣alfalfa$
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a
f␣eats␣alfalfa$al
␣eats␣alfalfa$alf
1. Write all cyclic shifts eats␣alfalfa$alf␣
ats␣alfalfa$alf␣e
ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
␣alfalfa$alf␣eats
alfalfa$alf␣eats␣
lfalfa$alf␣eats␣a
falfa$alf␣eats␣al
alfa$alf␣eats␣alf
lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal
a$alf␣eats␣alfalf
$alf␣eats␣alfalfa

53
BWT transform – Example

alf␣eats␣alfalfa$ $alf␣eats␣alfalfa
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a ␣alfalfa$alf␣eats
f␣eats␣alfalfa$al ␣eats␣alfalfa$alf
␣eats␣alfalfa$alf a$alf␣eats␣alfalf
1. Write all cyclic shifts eats␣alfalfa$alf␣ alf␣eats␣alfalfa$
ats␣alfalfa$alf␣e alfa$alf␣eats␣alf
2. Sort cyclic shifts ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
alfalfa$alf␣eats␣
ats␣alfalfa$alf␣e
␣alfalfa$alf␣eats eats␣alfalfa$alf␣
alfalfa$alf␣eats␣ f␣eats␣alfalfa$al
lfalfa$alf␣eats␣a fa$alf␣eats␣alfal
falfa$alf␣eats␣al falfa$alf␣eats␣al
alfa$alf␣eats␣alf lf␣eats␣alfalfa$a
lfa$alf␣eats␣alfa lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal lfalfa$alf␣eats␣a
a$alf␣eats␣alfalf s␣alfalfa$alf␣eat
$alf␣eats␣alfalfa ts␣alfalfa$alf␣ea

53
BWT transform – Example

BWT
↓
alf␣eats␣alfalfa$ $alf␣eats␣alfalfa
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a ␣alfalfa$alf␣eats
f␣eats␣alfalfa$al ␣eats␣alfalfa$alf
␣eats␣alfalfa$alf a$alf␣eats␣alfalf
1. Write all cyclic shifts eats␣alfalfa$alf␣ alf␣eats␣alfalfa$
ats␣alfalfa$alf␣e alfa$alf␣eats␣alf
2. Sort cyclic shifts ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
alfalfa$alf␣eats␣
ats␣alfalfa$alf␣e
␣alfalfa$alf␣eats eats␣alfalfa$alf␣
3. Extract last column alfalfa$alf␣eats␣ f␣eats␣alfalfa$al
lfalfa$alf␣eats␣a fa$alf␣eats␣alfal
falfa$alf␣eats␣al falfa$alf␣eats␣al
𝐵 = asff$f␣e␣lllaaata alfa$alf␣eats␣alf lf␣eats␣alfalfa$a
lfa$alf␣eats␣alfa lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal lfalfa$alf␣eats␣a
a$alf␣eats␣alfalf s␣alfalfa$alf␣eat
$alf␣eats␣alfalfa ts␣alfalfa$alf␣ea

53
Clicker Question

What is the relation between suﬃx array 𝐿[0..𝑛] and BWT 𝐵[0..𝑛]
of a string 𝑇[0..𝑛)$?
A 𝐿 can be very easily computed from 𝐵 and 𝑇

B 𝐵 can be very easily computed from 𝐿 and 𝑇

C Both A and B

D Neither A nor B

pingo.upb.de/622222
54
Clicker Question

What is the relation between suﬃx array 𝐿[0..𝑛] and BWT 𝐵[0..𝑛]
of a string 𝑇[0..𝑛)$?
A 𝐿 can be very easily computed from 𝐵 and 𝑇

B 𝐵 can be very easily computed from 𝐿 and 𝑇 �

C Both A and B

D Neither A nor B

pingo.upb.de/622222
54
BWT – Implementation & Properties
Compute BWT efficiently:
� cyclic shifts 𝑆 �
= suffixes of 𝑆 𝑟 ↓ 𝐿[𝑟]
alf␣eats␣alfalfa$ 0 $alf␣eats␣alfalfa 16
� BWT is essentially suffix sorting! lf␣eats␣alfalfa$a 1 ␣alfalfa$alf␣eats 8
f␣eats␣alfalfa$al 2 ␣eats␣alfalfa$alf 3
� 𝐵[𝑖] = 𝑆[𝐿[𝑖] − 1] (𝐿 = suffix array!) ␣eats␣alfalfa$alf 3 a$alf␣eats␣alfalf 15
eats␣alfalfa$alf␣ 4 alf␣eats␣alfalfa$ 0
(if 𝐿[𝑖] = 0, 𝐵[𝑖] = $)
ats␣alfalfa$alf␣e 5 alfa$alf␣eats␣alf 12
� Can compute 𝐵 in 𝑂(𝑛) time ts␣alfalfa$alf␣ea 6 alfalfa$alf␣eats␣ 9
s␣alfalfa$alf␣eat 7 ats␣alfalfa$alf␣e 5
␣alfalfa$alf␣eats 8 eats␣alfalfa$alf␣ 4
alfalfa$alf␣eats␣ 9 f␣eats␣alfalfa$al 2
lfalfa$alf␣eats␣a 10 fa$alf␣eats␣alfal 14
falfa$alf␣eats␣al 11 falfa$alf␣eats␣al 11
alfa$alf␣eats␣alf 12 lf␣eats␣alfalfa$a 1
lfa$alf␣eats␣alfa 13 lfa$alf␣eats␣alfa 13
fa$alf␣eats␣alfal 14 lfalfa$alf␣eats␣a 10
a$alf␣eats␣alfalf 15 s␣alfalfa$alf␣eat 7
$alf␣eats␣alfalfa 16 ts␣alfalfa$alf␣ea 6

55
BWT – Implementation & Properties
Compute BWT efficiently:
� cyclic shifts 𝑆 �
= suffixes of 𝑆 𝑟 ↓ 𝐿[𝑟]
alf␣eats␣alfalfa$ 0 $alf␣eats␣alfalfa 16
� BWT is essentially suffix sorting! lf␣eats␣alfalfa$a 1 ␣alfalfa$alf␣eats 8
f␣eats␣alfalfa$al 2 ␣eats␣alfalfa$alf 3
� 𝐵[𝑖] = 𝑆[𝐿[𝑖] − 1] (𝐿 = suffix array!) ␣eats␣alfalfa$alf 3 a$alf␣eats␣alfalf 15
eats␣alfalfa$alf␣ 4 alf␣eats␣alfalfa$ 0
(if 𝐿[𝑖] = 0, 𝐵[𝑖] = $)
ats␣alfalfa$alf␣e 5 alfa$alf␣eats␣alf 12
� Can compute 𝐵 in 𝑂(𝑛) time ts␣alfalfa$alf␣ea 6 alfalfa$alf␣eats␣ 9
s␣alfalfa$alf␣eat 7 ats␣alfalfa$alf␣e 5
␣alfalfa$alf␣eats 8 eats␣alfalfa$alf␣ 4
alfalfa$alf␣eats␣ 9 f␣eats␣alfalfa$al
Why does BWT help? lfalfa$alf␣eats␣a 10 fa$alf␣eats␣alfal
2
14
� sorting groups characters by what follows falfa$alf␣eats␣al 11 falfa$alf␣eats␣al 11
alfa$alf␣eats␣alf 12 lf␣eats␣alfalfa$a 1
� Example: lf always preceded by a lfa$alf␣eats␣alfa 13 lfa$alf␣eats␣alfa 13
fa$alf␣eats␣alfal 14 lfalfa$alf␣eats␣a 10
� 𝐵 has local clusters of characters a$alf␣eats␣alfalf 15 s␣alfalfa$alf␣eat 7
$alf␣eats␣alfalfa 16 ts␣alfalfa$alf␣ea 6
� that makes MTF effective

� repeated substring in 𝑆 � runs of characters in 𝐵

� picked up by RLE

55
Inverse BWT
� Great, can compute BWT eﬃciently and it helps compression. But how can we decode it?

not even obvious that

it is at all invertible!

56
Inverse BWT
� Great, can compute BWT eﬃciently and it helps compression. But how can we decode it?

not even obvious that

it is at all invertible!

� “Magic” solution:
1. Create array 𝐷[0..𝑛] of pairs:
𝐷[𝑟] = (𝐵[𝑟], 𝑟).
2. Sort 𝐷 stably with
respect to ﬁrst entry.
3. Use 𝐷 as linked list with
(char, next entry)

56
Inverse BWT
� Great, can compute BWT eﬃciently and it helps compression. But how can we decode it?

not even obvious that

𝐷 it is at all invertible!

� “Magic” solution:
0 (a , 0)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2)
2. Sort 𝐷 stably with 3 ($ , 3)
respect to ﬁrst entry.
4 (r , 4)
3. Use 𝐷 as linked list with 5 (c , 5)
(char, next entry)
6 (a , 6)
7 (a , 7)
Example:
8 (a , 8)
𝐵 = ard$rcaaaabb
𝑆= 9 (a , 9)
10 (b , 10)
11 (b , 11)
56
Inverse BWT
� Great, can compute BWT eﬃciently and it helps compression. But how can we decode it?