Notes 07 Compression PDF
Notes 07 Compression PDF
Sebastian Wild
16 March 2020
7 Compression
7.1 Context
7.2 Character Encodings
7.3 Huffman Codes
7.4 Run-Length Encoding
7.5 Lempel-Ziv-Welch
7.6 Move-to-Front Transformation
7.7 Burrows-Wheeler Transform
7.1 Context
Overview
� Unit 4–6: How to work with strings
� finding substrings
� finding approximate matches
� finding repeated parts
� ...
1
Terminology
� source text: string 𝑆 ∈ Σ★ to be stored / transmitted
𝑆
Σ𝑆 is some alphabet
� coded text: encoded data 𝐶 ∈ Σ★ that is actually stored / transmitted
𝐶
usually use Σ𝐶 = {0, 1}
� encoding: algorithm mapping source texts to coded texts
2
What is a good encoding scheme?
� Depending on the application, goals can be
� efficiency of encoding/decoding
� resilience to errors/noise in transmission
� security (encryption)
� integrity (detect modifications made by third parties)
� size
3
Types of Data Compression
� Logical vs. Physical
� Logical Compression uses meaning of data
� only applies to a certain domain, e. g., sound recordings
� Physical Compression only knows the (physical) bits in the data, not the meaning behind
them
� For media files, lossy, logical compression is useful (e. g. JPEG, MPEG)
4
What makes data compressible?
� Physical, lossless compression methods mainly exploit
two types of redundancies in source texts:
5
What makes data compressible?
� Physical, lossless compression methods mainly exploit
two types of redundancies in source texts:
5
Part I
Exploiting character frequencies
7.2 Character Encodings
Character encodings
� Simplest form of encoding: Encode each source character individually
6
Fixed-length codes
� fixed-length codes are the simplest type of character encodings
� just enough for English letters and a few symbols (plus control characters)
8
Variable-length codes
� to gain more flexibility, have to allow different lengths for codewords
https://commons.wikimedia.org/wiki/File:
International_Morse_Code.svg
9
Variable-length codes – UTF-8
� Modern example: UTF-8 encoding of Unicode:
default encoding for text-files, XML, HTML since 2009
� Encodes any Unicode character (137 994 as of May 2019, and counting)
� uses 1–4 bytes (codeword lengths: 8, 16, 24, or 32 bits)
� Every ASCII character is encoded in 1 byte with leading bit 0, followed by the 7 bits for ASCII
� Non-ASCII charactters start with 1–4 1s indicating the total number of bytes,
followed by a 0 and 3–5 bits.
The remaining bytes each start with 10 followed by 6 bits.
11
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100
11
Pitfall in variable-length codes
𝑐 a n b s
� Suppose we have the following code:
𝐸(𝑐) 0 10 110 100
𝑐 A E N O T ␣
� Example:
𝐸(𝑐) 01 101 001 100 11 000
A T
no need for end-of-string symbols $ here 0 1 0 1
(already prefix-free!) ␣ N O E
� Encode AN␣ANT
� Decode 111000001010111
12
Code tries
� From now on only consider prefix-free codes 𝐸:
𝐸(𝑐) is not a prefix of 𝐸(𝑐 �) for any 𝑐, 𝑐 � ∈ Σ𝑆 .
𝑐 A E N O T ␣
� Example:
𝐸(𝑐) 01 101 001 100 11 000
A T
no need for end-of-string symbols $ here 0 1 0 1
(already prefix-free!) ␣ N O E
12
Who decodes the decoder?
� Depending on the application, we have to store/transmit the used code!
� We distinguish:
� fixed coding: code agreed upon in advance, not transmitted (e. g., Morse, UTF-8)
� static coding: code depends on message, but stays same for entire message;
it must be transmitted (e. g., Huffman codes → next)
� adaptive coding: code depends on message and changes during encoding;
implicitly stored withing the message (e. g., LZW → below)
13
7.3 Huffman Codes
Character frequencies
� Goal: Find character encoding that produces short coded text
� Goal: prefix-free code 𝐸 (= code trie) for Σ that minimizes coded text length
�
i. e., a code trie minimizing 𝑤(𝑐) · |𝐸(𝑐)|
𝑐∈Σ
15
Huffman coding
e. g. frequencies / probabilities
� Goal: prefix-free code 𝐸 (= code trie) for Σ that minimizes coded text length
�
i. e., a code trie minimizing 𝑤(𝑐) · |𝐸(𝑐)|
𝑐∈Σ
15
Huffman’s algorithm
� Actually, yes! A greedy/myopic approach succeeds here.
Huffman’s algorithm:
1. Find two characters a, b with lowest weights.
� We will encode them with the same prefix, plus one distinguishing bit,
i. e., 𝐸(a) = 𝑢 0 and 𝐸(b) = 𝑢 1 for a bitstring 𝑢 ∈ {0, 1}★ (𝑢 to be determined)
16
Huffman’s algorithm
� Actually, yes! A greedy/myopic approach succeeds here.
Huffman’s algorithm:
1. Find two characters a, b with lowest weights.
� We will encode them with the same prefix, plus one distinguishing bit,
i. e., 𝐸(a) = 𝑢 0 and 𝐸(b) = 𝑢 1 for a bitstring 𝑢 ∈ {0, 1}★ (𝑢 to be determined)
16
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }
1 2 1 4
E L O S
17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }
2 2 4
L S
0 1
E O
17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }
4 4
S
0 1
L
0 1
E O
17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }
0 1
S
0 1
L
0 1
E O
17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }
0 1
S
0 1
L
0 1
E O
17
Huffman’s algorithm – Example
� Example text: 𝑆 = LOSSLESS � Σ𝑆 = { E , L , O , S }
0 1
S
0 1
L
0 1
E O
14 14
LOSSLESS → 01001110100011 compression ratio: 8·log 4 = 16 ≈ 88%
17
Huffman tree – tie breaking
� The above procedure is ambiguous:
� which characters to choose when weights are equal?
� which subtree goes left, which goes right?
18
Huffman code – Optimality
Theorem 7.1 (Optimality of Huffman’s Algorithm)
Given Σ and 𝑤 : Σ → ℝ≥0 , Huffman’s Algorithm computes codewords 𝐸 : Σ → {0, 1}★ with
�
minimal expected codeword length ℓ (𝐸) = 𝑐∈Σ 𝑤(𝑐) · |𝐸(𝑐)|, among all prefix-free codes
for Σ. �
19
Huffman code – Optimality
Theorem 7.1 (Optimality of Huffman’s Algorithm)
Given Σ and 𝑤 : Σ → ℝ≥0 , Huffman’s Algorithm computes codewords 𝐸 : Σ → {0, 1}★ with
�
minimal expected codeword length ℓ (𝐸) = 𝑐∈Σ 𝑤(𝑐) · |𝐸(𝑐)|, among all prefix-free codes
for Σ. �
20
Entropy
Definition 7.2 (Entropy)
Given probabilities 𝑝1 , . . . , 𝑝 𝑛 (for outcomes 1, . . . , 𝑛 of a random variable), the entropy of the
distribution is defined as
�
𝑛 �
𝑛 � �
1
H(𝑝 1 , . . . , 𝑝 𝑛 ) = − 𝑝 𝑖 lg 𝑝 𝑖 = 𝑝 𝑖 lg
𝑝𝑖 �
𝑖=1 𝑖=1
20
Entropy
Definition 7.2 (Entropy)
Given probabilities 𝑝1 , . . . , 𝑝 𝑛 (for outcomes 1, . . . , 𝑛 of a random variable), the entropy of the
distribution is defined as
�
𝑛 �
𝑛 � �
1
H(𝑝 1 , . . . , 𝑝 𝑛 ) = − 𝑝 𝑖 lg 𝑝 𝑖 = 𝑝 𝑖 lg
𝑝𝑖 �
𝑖=1 𝑖=1
B when H ≈ lg(𝜎)
E when H ≈ 1
pingo.upb.de/622222
21
Clicker Question
pingo.upb.de/622222
21
Encoding with Huffman code
� The overall encoding procedure is as follows:
� Pass 1: Count character frequencies in 𝑆
� Construct Huffman code 𝐸 (as above)
� Store the Huffman code in 𝐶 (details omitted)
� Pass 2: Encode each character in 𝑆 using 𝐸 and append result to 𝐶
22
Huffman coding – Discussion
� running time complexity: 𝑂(𝜎 log 𝜎) to construct code
� build PQ + 𝜎 · (2 deleteMins and 1 insert)
� can do Θ(𝜎) time when characters already sorted by weight
� time for encoding: 𝑂(𝑛 + |𝐶|)
23
Huffman coding – Discussion
� running time complexity: 𝑂(𝜎 log 𝜎) to construct code
� build PQ + 𝜎 · (2 deleteMins and 1 insert)
� can do Θ(𝜎) time when characters already sorted by weight
� time for encoding: 𝑂(𝑛 + |𝐶|)
23
Part II
Compressing repetitive texts
Beyond Character Encoding
� Many “natural” texts show repetitive redundancy
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull
boy. All work and no play makes Jack a dull boy. All work and no play makes Jack
a dull boy. All work and no play makes Jack a dull boy. All work and no play makes
Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play
makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no
play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and
no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work
and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All
work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
24
Beyond Character Encoding
� Many “natural” texts show repetitive redundancy
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull
boy. All work and no play makes Jack a dull boy. All work and no play makes Jack
a dull boy. All work and no play makes Jack a dull boy. All work and no play makes
Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play
makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no
play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and
no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work
and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All
work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
24
7.4 Run-Length Encoding
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000
25
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000
25
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000
� We have to store
� the first bit of 𝑆 (either 0 or 1)
� the length each each run
� Note: don’t have to store bit for later runs since they must alternate.
� Example becomes: 0, 5, 3, 4
25
Run-Length encoding
� simplest form of repetition: runs of characters
000000000000000000000000000000000000000 same character repeated
000000000000000000000000000000000000000
000000000000000000000000000000000000000
� We have to store
� the first bit of 𝑆 (either 0 or 1)
� the length each each run
� Note: don’t have to store bit for later runs since they must alternate.
� Example becomes: 0, 5, 3, 4
pingo.upb.de/622222
26
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading
27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading
27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading
27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading
27
Elias codes
� Need a prefix-free encoding for ℕ = {1, 2, 3, . . . , }
� must allow arbitrarily large integers
� must know when to stop reading
Decode the first number in Elias gamma code (at the beginning)
of the following bitstream:
000110111011100110.
pingo.upb.de/622222
28
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 =1
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =7
𝐶 = 100111
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =2
𝐶 = 100111010
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 =1
𝐶 = 1001110101
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 = 20
𝐶 = 1001110101000010100
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝑘 = 11
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =0
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =3+1
𝑆=
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =3+1
𝑘 = 13
𝑆 = 0000000000000
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =2+1
𝑘=
𝑆 = 0000000000000
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =2+1
𝑘 =4
𝑆 = 00000000000001111
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =0+1
𝑘=
𝑆 = 00000000000001111
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =0
ℓ =0+1
𝑘 =1
𝑆 = 000000000000011110
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =1+1
𝑘=
𝑆 = 000000000000011110
29
Run-length encoding – Examples
� Encoding:
𝑆 = 11111110010000000000000000000011111111111
𝐶 = 10011101010000101000001011
� Decoding:
𝐶 = 00001101001001010
𝑏 =1
ℓ =1+1
𝑘 =2
𝑆 = 00000000000001111011
29
Run-length encoding – Discussion
� extensions to larger alphabets possible (must store next character then)
30
Run-length encoding – Discussion
� extensions to larger alphabets possible (must store next character then)
30
7.5 Lempel-Ziv-Welch
Warmup
https://www.flickr.com/photos/quintanaroo/2742726346
https://classic.csunplugged.org/text- compression/
31
Clicker Question
pingo.upb.de/622222
32
Lempel-Ziv Compression
� Huffman and RLE mostly take advantage of frequent or repeated single characters.
33
Lempel-Ziv Compression
� Huffman and RLE mostly take advantage of frequent or repeated single characters.
33
Lempel-Ziv Compression
� Huffman and RLE mostly take advantage of frequent or repeated single characters.
33
Lempel-Ziv-Welch
� here: Lempel-Ziv-Welch (LZW) (arguably the “cleanest” variant of Lempel-Ziv)
� variable-to-fixed encoding
� all codewords have 𝑘 bits (typical: 𝑘 = 12) � fixed-length
� but they represent a variable portion of the source text!
34
Lempel-Ziv-Welch
� here: Lempel-Ziv-Welch (LZW) (arguably the “cleanest” variant of Lempel-Ziv)
� variable-to-fixed encoding
� all codewords have 𝑘 bits (typical: 𝑘 = 12) � fixed-length
� but they represent a variable portion of the source text!
𝑆 h a n n a h b a n s b a n a n a s
already encoded 𝒙 𝒄
� new codeword in 𝐷
� 𝐷 actually stores codewords for 𝑥 and 𝑐, not the expanded string
34
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
𝐶=
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y
𝐶 = 89
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y
𝐶 = 89
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O
𝐶 = 89 79
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O
𝐶 = 89 79
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O !
𝐶 = 89 79 33
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O !
𝐶 = 89 79 33
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣
𝐶 = 89 79 33 32
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣
𝐶 = 89 79 33 32
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO
𝐶 = 89 79 33 32 128
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO
𝐶 = 89 79 33 32 128
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U
𝐶 = 89 79 33 32 128 85
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U
𝐶 = 89 79 33 32 128 85
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣
𝐶 = 89 79 33 32 128 85 130
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣
𝐶 = 89 79 33 32 128 85 130
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU
𝐶 = 89 79 33 32 128 85 130 132
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU
𝐶 = 89 79 33 32 128 85 130 132
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R
𝐶 = 89 79 33 32 128 85 130 132 82
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R
𝐶 = 89 79 33 32 128 85 130 132 82
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y
𝐶 = 89 79 33 32 128 85 130 132 82 131
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y
𝐶 = 89 79 33 32 128 85 130 132 82 131
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O
𝐶 = 89 79 33 32 128 85 130 132 82 131 79
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O
𝐶 = 89 79 33 32 128 85 130 132 82 131 79
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128
35
LZW encoding – Example
Input: YO!␣YOU!␣YOUR␣YOYO! Σ𝑆 = ASCII character set (0–127)
Y O ! ␣ YO U !␣ YOU R ␣Y O YO !
𝐶 = 89 79 33 32 128 85 130 132 82 131 79 128 33
35
LZW encoding – Code
1 procedure LZWencode(𝑆[0..𝑛))
2 𝑥 := 𝜀 // previous phrase, initially empty
3 𝐶 := 𝜀 // output, initially empty
4 𝐷 := dictionary, initialized with codes for 𝑐 ∈ Σ𝑆 // stored as trie
5 𝑘 := |Σ𝑆 | // next free codeword
6 for 𝑖 := 0, . . . , 𝑛 − 1 do
7 𝑐 := 𝑆[𝑖]
8 if 𝐷.containsKey(𝑥𝑐) then
9 𝑥 := 𝑥𝑐
10 else
11 𝐶 := 𝐶 · 𝐷.get(𝑥) // append codeword for 𝑥
12 𝐷.put(𝑥𝑐, 𝑘) // add 𝑥𝑐 to 𝐷, assigning next free codeword
13 𝑘 := 𝑘 + 1; 𝑥 := 𝑐
14 end for
15 𝐶 := 𝐶 · 𝐷.get(𝑥)
16 return 𝐶
36
LZW decoding
� Decoder has to replay the process of growing the dictionary!
� Decoding:
after decoding a substring 𝑦 of 𝑆, add 𝑥𝑐 to 𝐷,
where 𝑥 is previously encoded/decoded substring of 𝑆,
and 𝑐 = 𝑦[0] (first character of 𝑦)
decode 𝑦 = an
𝒄
𝑆 h a n n a h b a n s b a n a n a s
already decoded 𝒙 𝒚
add 𝑥𝑐 = bana to dictionary
37
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
...
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
𝐷= 66 B
67 C
...
78 N
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
67 C
...
78 N
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
...
78 N
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
78 N
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
133 ??? 133
...
83 S
...
38
LZW decoding – Example
� Same idea: build dictionary while reading string.
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
133 ??? 133
...
83 S
...
38
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!
39
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!
� problem occurs if we want to use a code that we are just about to build.
39
LZW decoding – Bootstrapping
� example: Want to decode 133, but not yet in dictionary!
� problem occurs if we want to use a code that we are just about to build.
� But then we actually know what is going on:
� Situation: decode using 𝑘 in the step that will define 𝑘.
� decoder knows last phrase 𝑥, needs phrase 𝑦 = 𝐷[𝑘] = 𝑥𝑐.
last step 𝑐𝑦=𝐷[𝑘]
1. en/decode 𝑥.
C A N ␣ B A N A N A S
done 𝒙 𝒄 𝒚
2. store 𝐷[𝑘] := 𝒙𝒄
𝐷[𝑘] := 𝑥𝑐
3. next phrase 𝑦 equals 𝐷[𝑘]
A N A
� 𝐷[𝑘] = 𝒙𝒄 = 𝒙 · 𝒙[0] (all known)
𝒙 𝒄
39
LZW decoding – Code
1 procedure LZWdecode(𝐶[0..𝑚))
2 𝐷 := dictionary [0..2𝑑 ) → Σ+ 𝑆
, initialized with codes for 𝑐 ∈ Σ𝑆 // stored as array
3 𝑘 := |Σ𝑆 | // next unused codeword
4 𝑞 := 𝐶[0] // first codeword
5 𝑦 := 𝐷[𝑞] // lookup meaning of 𝑞 in 𝐷
6 𝑆 := 𝑦 // output, initially first phrase
7 for 𝑗 := 1, . . . , 𝑚 − 1 do
8 𝑥 := 𝑦 // remember last decoded phrase
9 𝑞 := 𝐶[𝑗] // next codeword
10 if 𝑞 = = 𝑘 then
11 𝑦 := 𝑥 · 𝑥[0] // bootstrap case
12 else
13 𝑦 := 𝐷[𝑞]
14 𝑆 := 𝑆 · 𝑦 // append decoded phrase
15 𝐷[𝑘] := 𝑥 · 𝑦[0] // store new phrase
16 𝑘 := 𝑘 + 1
17 end for
18 return 𝑆
40
LZW decoding – Example continued
� Example: 67 65 78 32 66 129 133 83
Code # String
... decodes String String
32 ␣ input to Code # (human) (computer)
...
67 C
...
65 A 128 CA 67, A
65 A
78 N 129 AN 65, N
𝐷= 66 B
32 ␣ 130 N␣ 78, ␣
67 C
66 B 131 ␣B 32, B
...
129 AN 132 BA 66, A
78 N
133 ANA 133 ANA 129, A
...
83 S 134 ANAS 133, S
83 S
... last step 𝑐𝑦=𝐷[𝑘]
1. en/decode 𝑥.
C A N ␣ B A N A N A S
done 𝒙 𝒄 𝒚
2. store 𝐷[𝑘] := 𝒙𝒄
𝐷[𝑘] := 𝑥𝑐
3. next phrase 𝑦 equals 𝐷[𝑘]
A N A
� 𝐷[𝑘] = 𝒙𝒄 = 𝒙 · 𝒙[0] (all known)
𝒙 𝒄
41
Clicker Question
A ∼𝑛 F Θ(log 𝑛)
C ∼ 𝑛/4 H 2
D Θ(𝑛/log 𝑛) I 1
√
E Θ( 𝑛)
pingo.upb.de/622222
42
Clicker Question
A ∼𝑛 F Θ(log 𝑛)
C ∼ 𝑛/4 H 2
D Θ(𝑛/log 𝑛) I 1
�
√
E Θ( 𝑛)
pingo.upb.de/622222
42
LZW – Discussion
� As presented, LZW uses coded alphabet Σ𝐶 = [0..2𝑑 ).
� use another encoding for code numbers ↦→ binary, e. g., Huffman
� encoding and decoding both run in linear time (assuming |Σ𝑆 | constant)
43
LZW – Discussion
� As presented, LZW uses coded alphabet Σ𝐶 = [0..2𝑑 ).
� use another encoding for code numbers ↦→ binary, e. g., Huffman
� encoding and decoding both run in linear time (assuming |Σ𝑆 | constant)
43
Compression summary
part of pkzip, JPEG, MP3 fax machines, old picture-formats GIF, part of PDF, Unix compress
44
Part III
Text Transforms
Text transformations
� compression is effective is we have one the following:
� long runs � RLE
� frequently used characters � Huffman
� many (local) repeated substrings � LZW
45
Text transformations
� compression is effective is we have one the following:
� long runs � RLE
� frequently used characters � Huffman
� many (local) repeated substrings � LZW
45
Text transformations
� compression is effective is we have one the following:
� long runs � RLE
� frequently used characters � Huffman
� many (local) repeated substrings � LZW
45
7.6 Move-to-Front Transformation
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper
46
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper
46
Move to Front
� Move to Front (MTF) is a heuristic for self-adjusting linked lists
� unsorted linked list of objects
� whenever an element is accessed, it is moved to the front of the list
(leaving the relative order of other elements unchanged)
� list “learns” probabilities of access to objects
makes access to frequently requested ones cheaper
46
Clicker Question
pingo.upb.de/622222
47
MTF – Code
48
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶=
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶= 8
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I A B C D E F G H J K L M N O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
N I A B C D E F G H J K L M O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13 6
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
E N I A B C D F G H J K L M O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13 6 7
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
F E N I A B C D G H J K L M O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13 6 7 0
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
F E N I A B C D G H J K L M O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13 6 7 0 3
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I F E N A B C D G H J K L M O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13 6 7 0 3 6
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
C I F E N A B D G H J K L M O P Q R S T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13 6 7 0 3 6 1
49
MTF – Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S E I C N F A B D G H J K L M O P Q R T U V W X Y Z
𝑆 = INEFFICIENCIES
𝐶 = 8 13 6 7 0 3 6 1 3 4 3 3 3 18
� What does a run in 𝑆 encode to in 𝐶?
49
MTF – Discussion
� MTF itself does not compress text (if we store codewords with fixed length)
50
7.7 Burrows-Wheeler Transform
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a different order
� But: coded text is (typically) more compressible with MTF(!)
51
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a different order
� But: coded text is (typically) more compressible with MTF(!)
51
Burrows-Wheeler Transform
� Burrows-Wheeler Transform (BWT) is a sophisticated text-transformation technique.
� coded text has same letters as source, just in a different order
� But: coded text is (typically) more compressible with MTF(!)
� BWT followed by MTF, RLE, and Huffman is the algorithm used by the bzip2 program.
achieves best compression on English text of any algorithm we have seen:
4047392 bible.txt
1191071 bible.txt.gz
888604 bible.txt.7z
845635 bible.txt.bz2
51
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
l l
i m i m
e i e i
s t s t
␣ ␣ ␣ ␣
q y q y
u l u l
i c k i c k
52
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
� add end-of-word l l
i m i m
character $ to 𝑆 e i e i
(as in Unit 6) s t s t
␣ ␣ ␣ ␣
� can recover original q y q y
u l u l
string i c k i c k
52
BWT transform
𝑇 = time␣flies␣quickly␣ flies␣quickly␣time␣
� cyclic shift of a string:
␣ � cyclic shift ␣
f e f e
� add end-of-word l l
i m i m
character $ to 𝑆 e i e i
(as in Unit 6) s t s t
␣ ␣ ␣ ␣
� can recover original q y q y
u l u l
string i c k i c k
52
BWT transform – Example
alf␣eats␣alfalfa$
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a
f␣eats␣alfalfa$al
␣eats␣alfalfa$alf
1. Write all cyclic shifts eats␣alfalfa$alf␣
ats␣alfalfa$alf␣e
ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
␣alfalfa$alf␣eats
alfalfa$alf␣eats␣
lfalfa$alf␣eats␣a
falfa$alf␣eats␣al
alfa$alf␣eats␣alf
lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal
a$alf␣eats␣alfalf
$alf␣eats␣alfalfa
53
BWT transform – Example
alf␣eats␣alfalfa$ $alf␣eats␣alfalfa
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a ␣alfalfa$alf␣eats
f␣eats␣alfalfa$al ␣eats␣alfalfa$alf
␣eats␣alfalfa$alf a$alf␣eats␣alfalf
1. Write all cyclic shifts eats␣alfalfa$alf␣ alf␣eats␣alfalfa$
ats␣alfalfa$alf␣e alfa$alf␣eats␣alf
2. Sort cyclic shifts ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
alfalfa$alf␣eats␣
ats␣alfalfa$alf␣e
␣alfalfa$alf␣eats eats␣alfalfa$alf␣
alfalfa$alf␣eats␣ f␣eats␣alfalfa$al
lfalfa$alf␣eats␣a fa$alf␣eats␣alfal
falfa$alf␣eats␣al falfa$alf␣eats␣al
alfa$alf␣eats␣alf lf␣eats␣alfalfa$a
lfa$alf␣eats␣alfa lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal lfalfa$alf␣eats␣a
a$alf␣eats␣alfalf s␣alfalfa$alf␣eat
$alf␣eats␣alfalfa ts␣alfalfa$alf␣ea
53
BWT transform – Example
BWT
↓
alf␣eats␣alfalfa$ $alf␣eats␣alfalfa
𝑆 = alf␣eats␣alfalfa$ lf␣eats␣alfalfa$a ␣alfalfa$alf␣eats
f␣eats␣alfalfa$al ␣eats␣alfalfa$alf
␣eats␣alfalfa$alf a$alf␣eats␣alfalf
1. Write all cyclic shifts eats␣alfalfa$alf␣ alf␣eats␣alfalfa$
ats␣alfalfa$alf␣e alfa$alf␣eats␣alf
2. Sort cyclic shifts ts␣alfalfa$alf␣ea
s␣alfalfa$alf␣eat �
sort
alfalfa$alf␣eats␣
ats␣alfalfa$alf␣e
␣alfalfa$alf␣eats eats␣alfalfa$alf␣
3. Extract last column alfalfa$alf␣eats␣ f␣eats␣alfalfa$al
lfalfa$alf␣eats␣a fa$alf␣eats␣alfal
falfa$alf␣eats␣al falfa$alf␣eats␣al
𝐵 = asff$f␣e␣lllaaata alfa$alf␣eats␣alf lf␣eats␣alfalfa$a
lfa$alf␣eats␣alfa lfa$alf␣eats␣alfa
fa$alf␣eats␣alfal lfalfa$alf␣eats␣a
a$alf␣eats␣alfalf s␣alfalfa$alf␣eat
$alf␣eats␣alfalfa ts␣alfalfa$alf␣ea
53
Clicker Question
What is the relation between suffix array 𝐿[0..𝑛] and BWT 𝐵[0..𝑛]
of a string 𝑇[0..𝑛)$?
A 𝐿 can be very easily computed from 𝐵 and 𝑇
C Both A and B
D Neither A nor B
pingo.upb.de/622222
54
Clicker Question
What is the relation between suffix array 𝐿[0..𝑛] and BWT 𝐵[0..𝑛]
of a string 𝑇[0..𝑛)$?
A 𝐿 can be very easily computed from 𝐵 and 𝑇
D Neither A nor B
pingo.upb.de/622222
54
BWT – Implementation & Properties
Compute BWT efficiently:
� cyclic shifts 𝑆 �
= suffixes of 𝑆 𝑟 ↓ 𝐿[𝑟]
alf␣eats␣alfalfa$ 0 $alf␣eats␣alfalfa 16
� BWT is essentially suffix sorting! lf␣eats␣alfalfa$a 1 ␣alfalfa$alf␣eats 8
f␣eats␣alfalfa$al 2 ␣eats␣alfalfa$alf 3
� 𝐵[𝑖] = 𝑆[𝐿[𝑖] − 1] (𝐿 = suffix array!) ␣eats␣alfalfa$alf 3 a$alf␣eats␣alfalf 15
eats␣alfalfa$alf␣ 4 alf␣eats␣alfalfa$ 0
(if 𝐿[𝑖] = 0, 𝐵[𝑖] = $)
ats␣alfalfa$alf␣e 5 alfa$alf␣eats␣alf 12
� Can compute 𝐵 in 𝑂(𝑛) time ts␣alfalfa$alf␣ea 6 alfalfa$alf␣eats␣ 9
s␣alfalfa$alf␣eat 7 ats␣alfalfa$alf␣e 5
␣alfalfa$alf␣eats 8 eats␣alfalfa$alf␣ 4
alfalfa$alf␣eats␣ 9 f␣eats␣alfalfa$al 2
lfalfa$alf␣eats␣a 10 fa$alf␣eats␣alfal 14
falfa$alf␣eats␣al 11 falfa$alf␣eats␣al 11
alfa$alf␣eats␣alf 12 lf␣eats␣alfalfa$a 1
lfa$alf␣eats␣alfa 13 lfa$alf␣eats␣alfa 13
fa$alf␣eats␣alfal 14 lfalfa$alf␣eats␣a 10
a$alf␣eats␣alfalf 15 s␣alfalfa$alf␣eat 7
$alf␣eats␣alfalfa 16 ts␣alfalfa$alf␣ea 6
55
BWT – Implementation & Properties
Compute BWT efficiently:
� cyclic shifts 𝑆 �
= suffixes of 𝑆 𝑟 ↓ 𝐿[𝑟]
alf␣eats␣alfalfa$ 0 $alf␣eats␣alfalfa 16
� BWT is essentially suffix sorting! lf␣eats␣alfalfa$a 1 ␣alfalfa$alf␣eats 8
f␣eats␣alfalfa$al 2 ␣eats␣alfalfa$alf 3
� 𝐵[𝑖] = 𝑆[𝐿[𝑖] − 1] (𝐿 = suffix array!) ␣eats␣alfalfa$alf 3 a$alf␣eats␣alfalf 15
eats␣alfalfa$alf␣ 4 alf␣eats␣alfalfa$ 0
(if 𝐿[𝑖] = 0, 𝐵[𝑖] = $)
ats␣alfalfa$alf␣e 5 alfa$alf␣eats␣alf 12
� Can compute 𝐵 in 𝑂(𝑛) time ts␣alfalfa$alf␣ea 6 alfalfa$alf␣eats␣ 9
s␣alfalfa$alf␣eat 7 ats␣alfalfa$alf␣e 5
␣alfalfa$alf␣eats 8 eats␣alfalfa$alf␣ 4
alfalfa$alf␣eats␣ 9 f␣eats␣alfalfa$al
Why does BWT help? lfalfa$alf␣eats␣a 10 fa$alf␣eats␣alfal
2
14
� sorting groups characters by what follows falfa$alf␣eats␣al 11 falfa$alf␣eats␣al 11
alfa$alf␣eats␣alf 12 lf␣eats␣alfalfa$a 1
� Example: lf always preceded by a lfa$alf␣eats␣alfa 13 lfa$alf␣eats␣alfa 13
fa$alf␣eats␣alfal 14 lfalfa$alf␣eats␣a 10
� 𝐵 has local clusters of characters a$alf␣eats␣alfalf 15 s␣alfalfa$alf␣eat 7
$alf␣eats␣alfalfa 16 ts␣alfalfa$alf␣ea 6
� that makes MTF effective
55
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
� “Magic” solution:
1. Create array 𝐷[0..𝑛] of pairs:
𝐷[𝑟] = (𝐵[𝑟], 𝑟).
2. Sort 𝐷 stably with
respect to first entry.
3. Use 𝐷 as linked list with
(char, next entry)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
� “Magic” solution:
0 (a , 0)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2)
2. Sort 𝐷 stably with 3 ($ , 3)
respect to first entry.
4 (r , 4)
3. Use 𝐷 as linked list with 5 (c , 5)
(char, next entry)
6 (a , 6)
7 (a , 7)
Example:
8 (a , 8)
𝐵 = ard$rcaaaabb
𝑆= 9 (a , 9)
10 (b , 10)
11 (b , 11)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆= 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆=a 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = ab 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abr 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abra 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abrac 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abraca 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracad 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracada 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadab 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadabr 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadabra 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT
� Great, can compute BWT efficiently and it helps compression. But how can we decode it?
char next
� “Magic” solution:
0 (a , 0) 0 ($ , 3)
1. Create array 𝐷[0..𝑛] of pairs: 1 (r , 1) 1 (a , 0)
𝐷[𝑟] = (𝐵[𝑟], 𝑟). 2 (d , 2) 2 (a , 6)
2. Sort 𝐷 stably with 3 ($ , 3) 3 (a , 7)
respect to first entry.
4 (r , 4) 4 (a , 8)
3. Use 𝐷 as linked list with 5 (c , 5) 5 (a , 9)
(char, next entry)
6 (a , 6) 6 (b , 10)
7 (a , 7) 7 (b , 11)
Example:
8 (a , 8) 8 (c , 5)
𝐵 = ard$rcaaaabb
𝑆 = abracadabra$ 9 (a , 9) 9 (d , 2)
10 (b , 10) 10 (r , 1)
11 (b , 11) 11 (r , 4)
56
Inverse BWT – The magic revealed
� Inverse BWT very easy to compute:
� only sort individual characters in 𝐵 (not suffixes)
� 𝑂(𝑛) with counting sort
57
Inverse BWT – The magic revealed
� Inverse BWT very easy to compute:
� only sort individual characters in 𝐵 (not suffixes)
� 𝑂(𝑛) with counting sort
�
� for (i): first column = characters of 𝐵 in sorted order 5
6
6
0
ban$banan a
bananaban $
� for (ii): relative order of same character same! 7 8 n$bananab a
� 𝑖th a in first column = 𝑖th a in BWT 8 4 naban$ban a
�
9 2 nanaban$b a
� stably sorting (𝐵[𝑟], 𝑟) by first entry enough
57
BWT – Discussion
� Running time: Θ(𝑛)
� encoding uses suffix sorting
� decoding only needs counting sort
� decoding much simpler & faster (but same Θ-class)
58
BWT – Discussion
� Running time: Θ(𝑛)
� encoding uses suffix sorting
� decoding only needs counting sort
� decoding much simpler & faster (but same Θ-class)
58
Clicker Question
Consider 𝑇 = have_had_hadnt_hasnt_havent_has_what$.
The BWT is 𝐵 = tedtttshhhhhhhaavv____w$_edsaaannnaa_.
How can we explain the long run of hs in 𝐵?
D h is always followed by a
pingo.upb.de/622222
59
Clicker Question
Consider 𝑇 = have_had_hadnt_hasnt_havent_has_what$.
The BWT is 𝐵 = tedtttshhhhhhhaavv____w$_edsaaannnaa_.
How can we explain the long run of hs in 𝐵?
D h is always followed by a
pingo.upb.de/622222
59
Bigger Example
have_had_hadnt_hasnt_havent_has_what$ $have_had_hadnt_hasnt_havent_has_what
ave_had_hadnt_hasnt_havent_has_what$h _had_hadnt_hasnt_havent_has_what$have
ve_had_hadnt_hasnt_havent_has_what$ha _hadnt_hasnt_havent_has_what$have_had
e_had_hadnt_hasnt_havent_has_what$hav _has_what$have_had_hadnt_hasnt_havent
_had_hadnt_hasnt_havent_has_what$have _hasnt_havent_has_what$have_had_hadnt
had_hadnt_hasnt_havent_has_what$have_ _havent_has_what$have_had_hadnt_hasnt
ad_hadnt_hasnt_havent_has_what$have_h _what$have_had_hadnt_hasnt_havent_has
d_hadnt_hasnt_havent_has_what$have_ha ad_hadnt_hasnt_havent_has_what$have_h
_hadnt_hasnt_havent_has_what$have_had adnt_hasnt_havent_has_what$have_had_h
hadnt_hasnt_havent_has_what$have_had_ as_what$have_had_hadnt_hasnt_havent_h
adnt_hasnt_havent_has_what$have_had_h asnt_havent_has_what$have_had_hadnt_h
dnt_hasnt_havent_has_what$have_had_ha at$have_had_hadnt_hasnt_havent_has_wh
nt_hasnt_havent_has_what$have_had_had ave_had_hadnt_hasnt_havent_has_what$h
t_hasnt_havent_has_what$have_had_hadn avent_has_what$have_had_hadnt_hasnt_h
_hasnt_havent_has_what$have_had_hadnt d_hadnt_hasnt_havent_has_what$have_ha
hasnt_havent_has_what$have_had_hadnt_ dnt_hasnt_havent_has_what$have_had_ha
asnt_havent_has_what$have_had_hadnt_h e_had_hadnt_hasnt_havent_has_what$hav
snt_havent_has_what$have_had_hadnt_ha ent_has_what$have_had_hadnt_hasnt_hav
nt_havent_has_what$have_had_hadnt_has had_hadnt_hasnt_havent_has_what$have_
t_havent_has_what$have_had_hadnt_hasn hadnt_hasnt_havent_has_what$have_had_
_havent_has_what$have_had_hadnt_hasnt has_what$have_had_hadnt_hasnt_havent_
havent_has_what$have_had_hadnt_hasnt_ hasnt_havent_has_what$have_had_hadnt_
avent_has_what$have_had_hadnt_hasnt_h hat$have_had_hadnt_hasnt_havent_has_w
vent_has_what$have_had_hadnt_hasnt_ha have_had_hadnt_hasnt_havent_has_what$
ent_has_what$have_had_hadnt_hasnt_hav havent_has_what$have_had_hadnt_hasnt_
nt_has_what$have_had_hadnt_hasnt_have nt_has_what$have_had_hadnt_hasnt_have
t_has_what$have_had_hadnt_hasnt_haven nt_hasnt_havent_has_what$have_had_had
_has_what$have_had_hadnt_hasnt_havent nt_havent_has_what$have_had_hadnt_has
has_what$have_had_hadnt_hasnt_havent_ s_what$have_had_hadnt_hasnt_havent_ha
as_what$have_had_hadnt_hasnt_havent_h snt_havent_has_what$have_had_hadnt_ha
s_what$have_had_hadnt_hasnt_havent_ha t$have_had_hadnt_hasnt_havent_has_wha
_what$have_had_hadnt_hasnt_havent_has t_has_what$have_had_hadnt_hasnt_haven
what$have_had_hadnt_hasnt_havent_has_ t_hasnt_havent_has_what$have_had_hadn
hat$have_had_hadnt_hasnt_havent_has_w t_havent_has_what$have_had_hadnt_hasn
at$have_had_hadnt_hasnt_havent_has_wh ve_had_hadnt_hasnt_havent_has_what$ha
t$have_had_hadnt_hasnt_havent_has_wha vent_has_what$have_had_hadnt_hasnt_ha
$have_had_hadnt_hasnt_havent_has_what what$have_had_hadnt_hasnt_havent_has_
� 𝐵 = tedtttshhhhhhhaavv____w$_edsaaannnaa_
MTF yields 8, 5, 5, 2, 0, 0, 8, 7, 0, 0, 0, 0, 0, 0, 7, 0, 9, 0, 8, 0, 0, 0, 10, 9, 2, 9, 9, 8, 7, 0, 0, 10, 0, 0, 1, 0, 5
60
Summary of Compression Methods
61