55DataCompression PDF
55DataCompression PDF
http://algs4.cs.princeton.edu
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression
http://algs4.cs.princeton.edu
Data compression
Compress Expand
bitstream B compressed version C(B) original bitstream B
0110110101... 1101011111... 0110110101...
5
Food for thought
Aboolean
key feature of the abstraction is is
isEmpty() the bitstream
that, in marked empty?
constrast to StdIn, the data on stan-
dard void
input is not necessarily aligned
close() onthe
close byte boundaries. If the input stream is a single
bitstream
byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close()
API for static methods that read from a bitstream on standard input
method is not essential, but, for clean termination, clients should call close() to in-
dicate that no more bits are to be read. As with StdIn/StdOut, we use the following
A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan-
complementary API for writing bitstreams to standard output:
dard input is not necessarily aligned on byte boundaries. If the input stream is a single
byte, a client
public could
class read it 1 bit at a time with 8 calls to readBoolean(). The close()
BinaryStdOut
method is not essential, but, for clean termination, clients should call close() to in-
void write(boolean b) write the specified bit
dicate that no more bits are to be read. As with StdIn/StdOut, we use the following
void write(char
complementary c) bitstreams
API for writing writetothe specified output:
standard 8-bit char
void
public write(char
class c, int r) write the r least significant bits of the specified char
BinaryStdOut
voidmethods
[similar for byte (8 bits);
write(boolean (16 bits); int (32 bits); long and double (64 bits)]
b) shortwrite the specified bit
void close() close the bitstream
void write(char c) write the specified 8-bit char
API for static methods that write to a bitstream on standard output
void write(char c, int r) write the r least significant bits of the specified char 8
Writing binary data
ters, the table at right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
11
Universal data compression
Pf 1. [by contradiction]
・Suppose you have a universal data compression algorithm U
U
Pf 2. [by counting] U
12
Undecidability
1000000 bits
A difficult file to compress: one million (pseudo-) random bits
A. Quite a bit.
14
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression
http://algs4.cs.princeton.edu
Run-length encoding
17
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression
http://algs4.cs.princeton.edu
David Huffman
Variable-length codes
Issue. Ambiguity.
SOS ?
V7 ?
IAMIE ?
EEWNI ?
19
Variable-length codes
key value
Ex 1. Fixed-length code. ! 101 0 1
A 0
A
Ex 2. Append special stop char to each codeword. B 1111 0 1
C 110
0 1
Ex 3. General prefix-free code. D 100 0 1
R 1110 D ! C
0 1
R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !
R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 20
Prefix-free codes: trie representation
key value
Trie representation
A
1
B 1111 0 1
C 110
0 1 0 1
D 100
R 1110 D ! C
0 1
R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !
R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 21
Prefix-free codes: compression and expansion
Compression.
・Method 1: start at leaf; follow path up to the root; print bits in reverse.
・Method 2: create ST of key-value pairs.Codeword table
key value
Trie representation
! 101 0 1
A 0
A
Expansion. B 1111 0 1
C 110
・ Start at root. D
R
100
1110 D
0 1
!
0
C
1
R B
・If leaf node, print char and return to root. Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !
R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 22
Huffman coding overview
Compression.
・Read message.
・Built best prefix-free code for message. How?
・Write prefix-free code (as a trie) to file.
・Compress message using prefix-free code.
Expansion.
・Read prefix-free code (as a trie) from file.
・Read compressed message and expand using trie.
23
Huffman trie node data type
24
Prefix-free codes: expansion
27
Shannon-Fano codes
Shannon-Fano algorithm:
・Partition symbols S into two subsets S and S of (roughly) equal freq.
0 1
・Recur in S and S .
0 1
A 5 0... B 2 1...
C 1 0... D 1 1...
! 1 1...
input
A B R A C A D A B R A !
Huffman algorithm demo
char freq encoding
A 5 0
B 2 111
C 1 1011
D 1 100
R 2 110
! 1 1010
0 1
A
0 1
0 1 0 1
D R B
0 1
! C
Huffman codes
Huffman algorithm:
・Count frequency freq[i] for each char i in input.
・Start with one node corresponding to each char i (with weight freq[i]).
・Repeat until single trie formed:
– select two tries with min weight freq[i] and freq[j]
– merge into single trie with weight freq[i] + freq[j]
Applications:
31
Constructing a Huffman encoding trie: Java implementation
32
Huffman encoding summary
Implementation.
・Pass 1: tabulate char frequencies and build trie.
・Pass 2: encode file by traversing trie or lookup table.
input alphabet
size size
http://algs4.cs.princeton.edu
35
LZW compression demo
input A B R A C A D A B R A B R A B RR A
matches A B R A C A D A B R A B R A B R A
value 41 42 52 41 43 41 44 81 83 82 88 41 80
⋮ ⋮ AB 81 DA 87
A 41 BR 82 ABR 88
B 42 RA 83 RAB 89
C 43 AC 84 BRA 8A
D 44 CA 85 ABRA 8B
⋮ ⋮ AD 86
codeword table 36
Lempel-Ziv-Welch compression
LZW compression.
・Create ST associating W-bit codewords with string keys.
・Initialize ST with codewords for single-char keys.
・Find longest string s in ST that is a prefix of unscanned part of input.
・Write the W-bit codeword associated with s. longest prefix match
・Add s + c to ST, where c is next char in the input.
Q. How to represent LZW compression code table?
A. A trie to support longest prefix match.
A 41 B 42 C 43 D 44 R 52
B 81 C 84 D 86 R 82 A 85 A 87 A 83
R 88 A 8A B 89
A 8B
37
LZW expansion demo
value 41 42 52 41 43 41 44 81 83 82 88 41 80
output A B R A C A D A B R A B R A B R A
⋮ ⋮ 81 AB 87 DA
41 A 82 BR 88 ABR
42 B 83 RA 89 RAB
43 C 84 AC 8A BRA
44 D 85 CA 8B ABRA
⋮ ⋮ 86 AD
codeword table 38
LZW expansion
66
A
B
・Read a W-bit key. 67 C
・Find associated string value in ST and write it out. 68 D
・Update ST. ⋮ ⋮
129 AB
130 BR
Q. How to represent LZW expansion code table?
131 RA
A. An array of size 2W.
132 AC
133 CA
134 AD
135 DA
136 ABR
137 RAB
138 BRA
139 ABRA
⋮ ⋮
39
LZW tricky case: compression
input A B A B A B A
matches A B A B A B A
value 41 42 81 83 80
⋮ ⋮ AB 81
A 41 BA 82
B 42 ABA 83
C 43
D 44
⋮ ⋮
codeword table 40
LZW tricky case: expansion
⋮ ⋮ 81 AB
41 A 82 BA
42 B 83 ABA
43 C
44 D
⋮ ⋮
codeword table 41
LZW implementation details
42
LZW in the real world
・LZW.
・Deflate / zlib = LZ77 variant + Huffman.
43
LZW in the real world
44
Lossless data compression benchmarks
1999 RK 1.89
Lossless compression.
・Represent fixed-length symbols with variable-length codes. [Huffman]
・Represent variable-length symbols with fixed-length codes. [LZW]
46