0% found this document useful (0 votes)
105 views46 pages

55DataCompression PDF

Uploaded by

sfleandro_67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views46 pages

55DataCompression PDF

Uploaded by

sfleandro_67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE

5.5 D ATA C OMPRESSION


‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression
F O U R T H E D I T I O N

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu
Data compression

Compression reduces the size of a file:


・To save space when storing it.
・To save time when transmitting it.
・Most files have lots of redundancy.
Who needs compression?
・Moore's law: # transistors on a chip doubles every 18–24 months.
・Parkinson's law: data expands to fill space available.
・Text, images, sound, video, …
“ Everyday, we create 2.5 quintillion bytes of data—so much that
90% of the data in the world today has been created in the last
two years alone. ” — IBM report on big data (2011)

Basic concepts ancient (1950s), best technology recently developed.


3
Applications

Generic file compression.


・Files: GZIP, BZIP, 7z.
・Archivers: PKZIP.
・File systems: NTFS, HFS+, ZFS.
Multimedia.
・Images: GIF, JPEG.
・Sound: MP3.
・Video: MPEG, DivX™, HDTV.
Communication.
・ITU-T T4 Group 3 Fax.
・V.42bis modem.
・Skype.
Databases. Google, Facebook, ....
4
Lossless compression and expansion

Message. Binary data B we want to compress.


Compress. Generates a "compressed" representation C (B).
Expand. Reconstructs original bitstream B.
uses fewer bits (you hope)

Compress Expand
bitstream B compressed version C(B) original bitstream B
0110110101... 1101011111... 0110110101...

Basic model for data compression

Compression ratio. Bits in C (B) / bits in B.

Ex. 50–75% or better compression ratio for natural language.

5
Food for thought

Data compression has been omnipresent since antiquity:


・Number systems. X1
1 ⇡2
=
・Natural languages. n=1
n 2 6
・Mathematical notation.
has played a central role in communications technology,
b r a i l l e
・Grade 2 Braille.
・Morse code.
・Telephone system. but rather a I like like every

and is part of modern life.


・MP3.
・MPEG.

Q. What role will it play in the future?


6
Data representation: genomic code

Genome. String over the alphabet { A, C, T, G }.

Goal. Encode an N-character genome: A T A G A T G C A T A G . . .

Standard ASCII encoding. Two-bit encoding.


・8 bits per char. ・2 bits per char.
・8 N bits. ・2 N bits.
char hex binary char binary
A 41 01000001 A 00
C 43 01000011 C 01
T 54 01010100 T 10
G 47 01000111 G 11

Fixed-length code. k-bit code supports alphabet of size 2k.


Amazing but true. Some genomic databases in 1990s used ASCII.
7
streams are the primary abstraction for data compression, we go a bit further to allow
Binary input and output. Most systems nowadays, including Java, base their I/O on
clients to read and write individual bits, intermixed with data of various types (primi-
Reading and tivewriting
8-bit binary
bytestreams,
). Thedata
so we might
types and String
decide to read and write bytestreams to match I/O for-
goal is to minimize the necessity for type conversion in
mats with the internal representations of primitive types, encoding an 8-bit char with
client programs and also to take care of operating-system conventions for representing
1 byte, a 16-bit short with 2 bytes, a 32-bit int with 4 bytes, and so forth. Since bit-
data.We use the following API for reading a bitstream from standard input:
streams are the primary abstraction for data compression, we go a bit further to allow
Binary standard
clients
public
input
to read
class
and standard output. Libraries to read and write
andBinaryStdIn
write individual bits, intermixed with data of various types (primi-
bits
tive
from standard types and
input String
and to). standard
The goal is to minimize the necessity for type conversion in
boolean readBoolean() read 1output.
bit of data and return as a boolean value
client programs and also to take care of operating-system conventions for representing
data.We char readChar()
use the read 8 bits
following API for reading of data andfrom
a bitstream return as a char
standard value
input:
char
public readChar(int
class r)
BinaryStdIn read r bits of data and return as a char value
[similar methods
boolean for byte (8 bits); short
readBoolean() read 1(16
bitbits);
of dataint (32return
and bits); long and double
as a boolean value(64 bits)]
boolean isEmpty()
char readChar() is the8 bitstream
read empty?
bits of data and return as a char value
char close()
void readChar(int r) closerthe
read bitsbitstream
of data and return as a char value
API for static methods that read from a bitstream on standard input
[similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits)]

Aboolean
key feature of the abstraction is is
isEmpty() the bitstream
that, in marked empty?
constrast to StdIn, the data on stan-
dard void
input is not necessarily aligned
close() onthe
close byte boundaries. If the input stream is a single
bitstream
byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close()
API for static methods that read from a bitstream on standard input
method is not essential, but, for clean termination, clients should call close() to in-
dicate that no more bits are to be read. As with StdIn/StdOut, we use the following
A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan-
complementary API for writing bitstreams to standard output:
dard input is not necessarily aligned on byte boundaries. If the input stream is a single
byte, a client
public could
class read it 1 bit at a time with 8 calls to readBoolean(). The close()
BinaryStdOut
method is not essential, but, for clean termination, clients should call close() to in-
void write(boolean b) write the specified bit
dicate that no more bits are to be read. As with StdIn/StdOut, we use the following
void write(char
complementary c) bitstreams
API for writing writetothe specified output:
standard 8-bit char
void
public write(char
class c, int r) write the r least significant bits of the specified char
BinaryStdOut

voidmethods
[similar for byte (8 bits);
write(boolean (16 bits); int (32 bits); long and double (64 bits)]
b) shortwrite the specified bit
void close() close the bitstream
void write(char c) write the specified 8-bit char
API for static methods that write to a bitstream on standard output
void write(char c, int r) write the r least significant bits of the specified char 8
Writing binary data

Date representation. Three different ways to represent 12/31/1999.

A character stream (StdOut)


A character stream (StdOut)
StdOut.print(month + "/" + day + "/" + year);
ay + "/" +StdOut.print(month
year); + "/" + day + "/" + year);
00110001001100100010111100110111001100010010111100110001001110010011100100111001
00110001001100100010111100110111001100010010111100110001001110010011100100111001
1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 011 0 0 1 1 1 0 021 0 0 1 1 1 0 0/
1 0 0 1 1 1 0 031 1 / 1 9 9 9
1 2 / 3 1 / 1 9 9 9
80 bits
/ 1
Three ints (BinaryStdOut) 9 9 9 80 bits
80 bits
Three ints (BinaryStdOut)
BinaryStdOut.write(month);
BinaryStdOut.write(month);
BinaryStdOut.write(day);
BinaryStdOut.write(day);
BinaryStdOut.write(year);
BinaryStdOut.write(year);
000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111
000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 01020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 31 1999
96 bits
12 31 1999
31 1999 96 bits
Two chars and a short (BinaryStdOut) 96Abits
4-bit field, a 5-bit field, and a 12-bit field (BinaryStdOut)
Two chars and a short (BinaryStdOut) A 4-bit field, a 5-bit field, and a 12-bit field (BinaryStdOut)
A BinaryStdOut.write((char)
4-bit field, a 5-bit field, and a 12-bit field (BinaryStdOut)
month); BinaryStdOut.write(month, 4);
nth); BinaryStdOut.write((char) month);
BinaryStdOut.write((char) 4);
BinaryStdOut.write(month, day); BinaryStdOut.write(month,
BinaryStdOut.write(day, 5);4);
y); BinaryStdOut.write((char)
BinaryStdOut.write((short)
BinaryStdOut.write(day, day);
5); year); BinaryStdOut.write(day,
BinaryStdOut.write(year,5);
12);
ear); BinaryStdOut.write((short)
BinaryStdOut.write(year, year);
12); BinaryStdOut.write(year, 12);
00001100000111110000011111001111 110011111011111001111000
00001100000111110000011111001111 110011111011111001111000
1 1 0 01121 1 1 1 0 1 1
3 1 1 1 0 0 1 1 1 1 0 0109 9 9 12 31 1999
12 31 1999
32 bits 12 31 1999
21 bits ( + 3 bits for byte alignment at close)
12 31 1999 32 bits 21 bits ( + 3 bits for byte alignment at close)
bits 21 bits ( + 3 bits for byte alignment at close)
Four ways to put a date onto standard output
ways to put a date onto standard output Four ways to put a date onto standard output
9
ging when working with small inputs. We use a slightly more complicated version that
Binaryjust prints the count when the width argument is 0 (see Exercise 5.5.X). The similar
dumps
client HexDump groups the data into 8-bit bytes and prints each as two hexadecimal
digits that each represent 4 bits. The client PictureDump displays the bits in a Picture.
Q. HowYou tocan download HexDump
examine and PictureDump
the contents from the booksite. Typically, we use pip-
of a bitstream?
ing and redirection at the command-line level when working with binary files: we can
pipe the output of an encoder to BinaryDump, HexDump, or PictureDump, or redirect
it to a file.
Standard character stream Bitstream represented with hex digits
% more abra.txt % java HexDump 4 < abra.txt
ABRACADABRA! 41 42 52 41
43 41 44 41
Bitstream represented as 0 and 1 characters 42 52 41 21
12 bytes
% java BinaryDump 16 < abra.txt
0100000101000010 Bitstream represented as pixels in a Picture
0101001001000001
% java PictureDump 16 6 < abra.txt
0100001101000001
0100010001000001
16-by-6 pixel
0100001001010010 window, magnified
0100000100100001 6.5 Data Compression 667
96 bits 96 bits

Four ways to look at a bitstream


ASCII encoding. When you HexDump a bit- 0 1 2 3 4 5 6 7 8 9 A B C D E F
stream that contains ASCII-encoded charac- 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

ters, the table at right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

Given a 2-digit hex number, use the first hex 2 SP ! “ # $ % & ‘ ( ) * + , - . /


digit as a row index and the second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
as a column reference to find the character 4 @ A B C D E F G H I J K L M N O
that it encodes. For example, 31 encodes the 5 P Q R S T U V W X Y Z [ \ ] ^ _
digit 1, 4A encodes the letter J, and so forth. 6 ` a b c d e f g h i j k l m n o
This table is for 7-bit ASCII, so the first hex 7 p q r s t u v w x y z { | } ~ DEL
digit must be 7 or less. Hex numbers starting
Hexadecimal to ASCII conversion table
with 0 and 1 (and the numbers 20 and 7F)
10
correspond to non-printing control charac-
Universal data compression

US Patent 5,533,051 on "Methods for Data Compression", which is capable


of compression all files.

Slashdot reports of the Zero Space Tuner™ and BinaryAccelerator™.

“ ZeoSync has announced a breakthrough in data compression


that allows for 100:1 lossless compression of random data. If
this is true, our bandwidth problems just got a lot smaller.… ”

11
Universal data compression

Proposition. No algorithm can compress every bitstring.


U

Pf 1. [by contradiction]
・Suppose you have a universal data compression algorithm U
U

that can compress every bitstream.


・Given bitstring B , compress it to get smaller bitstring B .
U
0 1

・Compress B to get a smaller bitstring B .


.
1 2 .
.

・Continue until reaching bitstring of size 0.


・Implication: all bitstrings can be compressed to 0 bits! U

Pf 2. [by counting] U

・Suppose your algorithm that can compress all 1,000-bit strings.


・2 possible bitstrings with 1,000 bits.
1000
U

・Only 1 + 2 + 4 + … + 2 + 2 can be encoded with ≤ 999 bits.


998 999
!

・Similarly, only 1 in 2 bitstrings can be encoded with ≤ 500 bits!


499
Universal
data compression?

12
Undecidability

% java RandomBits | java PictureDump 2000 500

1000000 bits
A difficult file to compress: one million (pseudo-) random bits

public class RandomBits


{
public static void main(String[] args)
{
int x = 11111;
for (int i = 0; i < 1000000; i++)
{
x = x * 314159 + 218281;
BinaryStdOut.write(x > 0);
}
BinaryStdOut.close();
}
}
13
Rdenudcany in Enlgsih lnagugae

Q. How mcuh rdenudcany is in the Enlgsih lnagugae?

“ ... randomising letters in the middle of words [has] little or no


effect on the ability of skilled readers to understand the text. This
is easy to denmtrasote. In a pubiltacion of New Scnieitst you
could ramdinose all the letetrs, keipeng the first two and last two
the same, and reibadailty would hadrly be aftcfeed. My ansaylis
did not come to much beucase the thoery at the time was for
shape and senqeuce retigcionon. Saberi's work sugsegts we may
have some pofrweul palrlael prsooscers at work. The resaon for
this is suerly that idnetiyfing coentnt by paarllel prseocsing
speeds up regnicoiton. We only need the first and last two letetrs
to spot chganes in meniang. ” — Graham Rawlinson

A. Quite a bit.

14
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu
Run-length encoding

Simple type of redundancy in a bitstream. Long runs of repeated bits.


0000000000000001111111000000011111111111
40 bits

Representation. 4-bit counts to represent alternating runs of 0s and 1s:


15 0s, then 7 1s, then 7 0s, then 11 1s.

1111011101111011 16 bits (instead of 40)


15 7 7 11

Q. How many bits to store the counts?


A. We'll use 8 (but 4 in the example above).

Q. What to do when run length exceeds max count?


A. If longer than 255, intersperse runs of length 0.

Applications. JPEG, ITU-T T4 Group 3 Fax, ...


16
Run-length encoding: Java implementation

public class RunLength


{
private final static int R = 256; maximum run-length count
private final static int lgR = 8; number of bits per count

public static void compress()


{ /* see textbook */ }

public static void expand()


{
boolean bit = false;
while (!BinaryStdIn.isEmpty())
{
int run = BinaryStdIn.readInt(lgR); read 8-bit count from standard input
for (int i = 0; i < run; i++)
BinaryStdOut.write(bit); write 1 bit to standard output
bit = !bit;
}
BinaryStdOut.close(); pad 0s for byte alignment
}

17
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu

David Huffman
Variable-length codes

Use different number of bits to encode different chars.

Ex. Morse code: • • • − − − • • •

Issue. Ambiguity.
SOS ?
V7 ?
IAMIE ?
EEWNI ?

In practice. Use a medium gap to


separate codewords.
codeword for S is a prefix
of codeword for V

19
Variable-length codes

Q. How do we avoid ambiguity?


A. Ensure that no codeword is a prefix of another.
Codeword table Trie representation

key value
Ex 1. Fixed-length code. ! 101 0 1

A 0
A
Ex 2. Append special stop char to each codeword. B 1111 0 1

C 110
0 1
Ex 3. General prefix-free code. D 100 0 1

R 1110 D ! C
0 1

R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !

Codeword table Trie representation


Codeword table Trie representation
key value key value
! 101 0 1
! 101 0 1
A 0 A 11
A
B 1111 0 1
B 00 0 1 0 1
C 110 C 010
0 1 0 1
B A
D 100 D 100 0 1 0 1
R 1110 D ! C R 011 C R D !
0 1

R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 20
Prefix-free codes: trie representation

Q. How to represent the prefix-free code?


A. A binary trie!
・ Chars in leaves. Codeword table

key value
Trie representation

・Codeword is path from root to leaf. !


A
101
0
0

A
1

B 1111 0 1

C 110
0 1 0 1
D 100
R 1110 D ! C
0 1

R B
Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !

Codeword table Trie representation


Codeword table Trie representation
key value key value
! 101 0 1
! 101 0 1
A 0 A 11
A
B 1111 0 1
B 00 0 1 0 1
C 110 C 010
0 1 0 1
B A
D 100 D 100 0 1 0 1
R 1110 D ! C R 011 C R D !
0 1

R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 21
Prefix-free codes: compression and expansion

Compression.
・Method 1: start at leaf; follow path up to the root; print bits in reverse.
・Method 2: create ST of key-value pairs.Codeword table
key value
Trie representation

! 101 0 1

A 0
A
Expansion. B 1111 0 1

C 110

・ Start at root. D
R
100
1110 D
0 1

!
0

C
1

・Go left if bit is 0; go right if 1.


0 1

R B

・If leaf node, print char and return to root. Compressed bitstring
011111110011001000111111100101 30 bits
A B RA CA DA B RA !

Codeword table Trie representation


Codeword table Trie representation
key value key value
! 101 0 1
! 101 0 1
A 0 A 11
A
B 1111 0 1
B 00 0 1 0 1
C 110 C 010
0 1 0 1
B A
D 100 D 100 0 1 0 1
R 1110 D ! C R 011 C R D !
0 1

R B
Compressed bitstring
Compressed bitstring
011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits
A B R A C A D A B R A !
A B RA CA DA B RA !
Two prefix-free codes
Codeword table Trie representation 22
Huffman coding overview

Dynamic model. Use a custom prefix-free code for each message.

Compression.
・Read message.
・Built best prefix-free code for message. How?
・Write prefix-free code (as a trie) to file.
・Compress message using prefix-free code.
Expansion.
・Read prefix-free code (as a trie) from file.
・Read compressed message and expand using trie.

23
Huffman trie node data type

private static class Node implements Comparable<Node>


{
private final char ch; // used only for leaf nodes
private final int freq; // used only for compress
private final Node left, right;

public Node(char ch, int freq, Node left, Node right)


{
this.ch = ch;
this.freq = freq; initializing constructor
this.left = left;
this.right = right;
}

public boolean isLeaf() is Node a leaf?


{ return left == null && right == null; }

public int compareTo(Node that) compare Nodes by frequency


{ return this.freq - that.freq; } (stay tuned)
}

24
Prefix-free codes: expansion

public void expand()


{
Node root = readTrie(); read in encoding trie
int N = BinaryStdIn.readInt(); read in number of chars

for (int i = 0; i < N; i++)


{
Node x = root;
while (!x.isLeaf()) expand codeword for ith char
{
if (!BinaryStdIn.readBoolean())
x = x.left;
else
x = x.right;
}
BinaryStdOut.write(x.ch, 8);
}
BinaryStdOut.close();
}

Running time. Linear in input size N.


25
Prefix-free codes: how to transmit

Q. How to write the trie?


A. Write preorder traversal of trie; mark leaf and internal nodes with a bit.

preorder private static void writeTrie(Node x)


traversal 1
{
A 2 if (x.isLeaf())
{
3 5 BinaryStdOut.write(true);
BinaryStdOut.write(x.ch, 8);
D 4 R B
return;
! C }
BinaryStdOut.write(false);
leaves writeTrie(x.left);
A D ! C R B
writeTrie(x.right);
01010000010010100010001000010101010000110101010010101000010
}
1 23 4 5 internal nodes
Using preorder traversal to encode a trie as a bitstream

Note. If message is long, overhead of transmitting trie is small.


26
Prefix-free codes: how to transmit

Q. How to read in the trie?


A. Reconstruct from preorder traversal of trie.

preorder private static Node readTrie()


traversal 1
{
A 2 if (BinaryStdIn.readBoolean())
{
3 5 char c = BinaryStdIn.readChar(8);
return new Node(c, 0, null, null);
D 4 R B
}
! C Node x = readTrie();
Node y = readTrie();
leaves return new Node('\0', 0, x, y);
A D ! C R B
}
01010000010010100010001000010101010000110101010010101000010
arbitrary value
1 23 4 5 internal nodes (value not used with internal nodes)
Using preorder traversal to encode a trie as a bitstream

27
Shannon-Fano codes

Q. How to find best prefix-free code?

Shannon-Fano algorithm:
・Partition symbols S into two subsets S and S of (roughly) equal freq.
0 1

・Codewords for symbols in S start with 0; for symbols in S start with 1.


0 1

・Recur in S and S .
0 1

char freq encoding char freq encoding

A 5 0... B 2 1...

C 1 0... D 1 1...

S0 = codewords starting with 0


R 2 1...

! 1 1...

S1 = codewords starting with 1

Problem 1. How to divide up symbols?


Problem 2. Not optimal!
28
Huffman algorithm demo
char freq encoding
・Count frequency for each character in input. A 5
B 2
C 1
D 1
R 2
! 1

input

A B R A C A D A B R A !
Huffman algorithm demo
char freq encoding
A 5 0
B 2 111
C 1 1011
D 1 100
R 2 110
! 1 1010
0 1

A
0 1

0 1 0 1

D R B
0 1

! C
Huffman codes

Q. How to find best prefix-free code?

Huffman algorithm:
・Count frequency freq[i] for each char i in input.
・Start with one node corresponding to each char i (with weight freq[i]).
・Repeat until single trie formed:
– select two tries with min weight freq[i] and freq[j]
– merge into single trie with weight freq[i] + freq[j]

Applications:

31
Constructing a Huffman encoding trie: Java implementation

private static Node buildTrie(int[] freq)


{
MinPQ<Node> pq = new MinPQ<Node>();
for (char i = 0; i < R; i++) initialize PQ with
if (freq[i] > 0) singleton tries
pq.insert(new Node(i, freq[i], null, null));

while (pq.size() > 1)


{ merge two
smallest tries
Node x = pq.delMin();
Node y = pq.delMin();
Node parent = new Node('\0', x.freq + y.freq, x, y);
pq.insert(parent);
}

return pq.delMin(); not used for total frequency two subtries


internal nodes
}

32
Huffman encoding summary

Proposition. [Huffman 1950s] Huffman algorithm produces an optimal


prefix-free code.
Pf. See textbook. no prefix-free code
uses fewer bits

Implementation.
・Pass 1: tabulate char frequencies and build trie.
・Pass 2: encode file by traversing trie or lookup table.

Running time. Using a binary heap ⇒ N + R log R .

input alphabet
size size

Q. Can we do better? [stay tuned]


33
5.5 D ATA C OMPRESSION
‣ introduction
‣ run-length coding
‣ Huffman compression
Algorithms
‣ LZW compression

R OBERT S EDGEWICK | K EVIN W AYNE

http://algs4.cs.princeton.edu

Abraham Lempel Jacob Ziv


Statistical methods

Static model. Same model for all texts.


・Fast.
・Not optimal: different texts have different statistical properties.
・Ex: ASCII, Morse code.
Dynamic model. Generate model based on text.
・Preliminary pass needed to generate model.
・Must transmit the model.
・Ex: Huffman code.
Adaptive model. Progressively learn and update model as you read text.
・More accurate modeling produces better compression.
・Decoding must start from beginning.
・Ex: LZW.

35
LZW compression demo

input A B R A C A D A B R A B R A B RR A
matches A B R A C A D A B R A B R A B R A
value 41 42 52 41 43 41 44 81 83 82 88 41 80

LZW compression for A B R A C A D A B R A B R A B R A

key value key value key value

⋮ ⋮ AB 81 DA 87

A 41 BR 82 ABR 88

B 42 RA 83 RAB 89

C 43 AC 84 BRA 8A

D 44 CA 85 ABRA 8B

⋮ ⋮ AD 86

codeword table 36
Lempel-Ziv-Welch compression

LZW compression.
・Create ST associating W-bit codewords with string keys.
・Initialize ST with codewords for single-char keys.
・Find longest string s in ST that is a prefix of unscanned part of input.
・Write the W-bit codeword associated with s. longest prefix match
・Add s + c to ST, where c is next char in the input.
Q. How to represent LZW compression code table?
A. A trie to support longest prefix match.
A 41 B 42 C 43 D 44 R 52

B 81 C 84 D 86 R 82 A 85 A 87 A 83

R 88 A 8A B 89

A 8B

37
LZW expansion demo

value 41 42 52 41 43 41 44 81 83 82 88 41 80
output A B R A C A D A B R A B R A B R A

LZW expansion for 41 42 52 41 43 41 44 81 83 82 88 41 80

key value key value key value

⋮ ⋮ 81 AB 87 DA

41 A 82 BR 88 ABR

42 B 83 RA 89 RAB

43 C 84 AC 8A BRA

44 D 85 CA 8B ABRA

⋮ ⋮ 86 AD

codeword table 38
LZW expansion

LZW expansion. key value

・Create ST associating string values with W-bit keys. ⋮ ⋮

・Initialize ST to contain single-char values. 65

66
A

B
・Read a W-bit key. 67 C
・Find associated string value in ST and write it out. 68 D

・Update ST. ⋮ ⋮

129 AB

130 BR
Q. How to represent LZW expansion code table?
131 RA
A. An array of size 2W.
132 AC

133 CA

134 AD

135 DA

136 ABR

137 RAB

138 BRA

139 ABRA
⋮ ⋮

39
LZW tricky case: compression

input A B A B A B A
matches A B A B A B A
value 41 42 81 83 80

LZW compression for ABABABA

key value key value

⋮ ⋮ AB 81

A 41 BA 82

B 42 ABA 83

C 43

D 44

⋮ ⋮

codeword table 40
LZW tricky case: expansion

value 41 42 81 83 80 need to know which


key has value 83
output A B A B A B A
before it is in ST!

LZW expansion for 41 42 81 83 80

key value key value

⋮ ⋮ 81 AB

41 A 82 BA

42 B 83 ABA

43 C

44 D

⋮ ⋮

codeword table 41
LZW implementation details

How big to make ST?


・How long is message?
・Whole message similar model?
・[many other variations]
What to do when ST fills up?
・Throw away and start over. [GIF]
・Throw away when not effective. [Unix compress]
・[many other variations]
Why not put longer substrings in ST?
・[many variations have been developed]

42
LZW in the real world

Lempel-Ziv and friends.


・LZ77. LZ77 not patented widely used in open source

・LZ78. LZW patent #4,558,302 expired in U.S. on June 20, 2003

・LZW.
・Deflate / zlib = LZ77 variant + Huffman.

43
LZW in the real world

Lempel-Ziv and friends.


・LZ77.
・LZ78.
・LZW.
・Deflate / zlib = LZ77 variant + Huffman.

Unix compress, GIF, TIFF, V.42bis modem: LZW.


zip, 7zip, gzip, jar, png, pdf: deflate / zlib.
iPhone, Sony Playstation 3, Apache HTTP server: deflate / zlib.

44
Lossless data compression benchmarks

year scheme bits / char

1967 ASCII 7.00

1950 Huffman 4.70

1977 LZ77 3.94

1984 LZMW 3.32

1987 LZH 3.30

1987 move-to-front 3.24

1987 LZB 3.18

1987 gzip 2.71

1988 PPMC 2.48

1994 SAKDC 2.47

1994 PPM 2.34

1995 Burrows-Wheeler 2.29 next programming assignment

1997 BOA 1.99

1999 RK 1.89

data compression using Calgary corpus


45
Data compression summary

Lossless compression.
・Represent fixed-length symbols with variable-length codes. [Huffman]
・Represent variable-length symbols with fixed-length codes. [LZW]

Lossy compression. [not covered in this course]


・JPEG, MPEG, MP3, …
・FFT, wavelets, fractals, …
n
X
Theoretical limits on compression. Shannon entropy: H(X) = p(xi ) lg p(xi )
i

Practical compression. Use extra knowledge whenever possible.

46

You might also like