0% found this document useful (0 votes)
91 views58 pages

Algorithms For Data Science: CSOR W4246

This document provides an outline and overview of a lecture on data compression and the Huffman algorithm. The lecture will cover data representation, symbol codes, uniquely decodable codes, prefix codes, and optimal lossless compression. Prefix codes allow for unique decoding and fast decoding. Variable-length prefix codes can achieve better compression than fixed-length codes by assigning shorter codes to more frequent symbols. The Huffman algorithm constructs an optimal prefix code that minimizes the expected code length for a given set of symbol probabilities.

Uploaded by

Eartha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views58 pages

Algorithms For Data Science: CSOR W4246

This document provides an outline and overview of a lecture on data compression and the Huffman algorithm. The lecture will cover data representation, symbol codes, uniquely decodable codes, prefix codes, and optimal lossless compression. Prefix codes allow for unique decoding and fast decoding. Variable-length prefix codes can achieve better compression than fixed-length codes by assigning shorter codes to more frequent symbols. The Huffman algorithm constructs an optimal prefix code that minimizes the expected code length for a given set of symbol probabilities.

Uploaded by

Eartha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Algorithms for Data Science

CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Tuesday, September 29, 2015

Outline

1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm

Today

1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm

Review of the last lecture

DFS applications
I
I
I

Cycle detection
Topological sorting
Strongly connected components in directed graphs

Today

1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm

Motivation

Data compression: find compact representations of data


Data compression standards
I

jpeg for image transmission

mp3 for audio content, mpeg2 for video transmission

utilities: gzip, bzip2

All of the above use the Huffman algorithm as a basic building


block.

Data representation

Chromosome maps: sequences of hundreds of millions of


bases (symbols from {A, C, G, T }).

Goal: store a chromosome map with 200 million bases.

How do we represent a chromosome map?

Data representation

Chromosome maps: sequences of hundreds of millions of


bases (symbols from {A, C, G, T }).

Goal: store a chromosome map with 200 million bases.

How do we represent a chromosome map?


I

Encode every symbol that appears in the sequence


separately by a fixed length binary string.

Codeword c(x) for symbol x: a binary string encoding x

Example code
I
I

Alphabet A = {A, C, G, T } with 4 symbols


Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T

codeword c(x)
00
01
10
11

Example code
I
I

Alphabet A = {A, C, G, T } with 4 symbols


Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T

codeword c(x)
00
01
10
11

Output of encoding: the concatenation of the codewords for


every symbol in the input sequence

Example code
I
I

Alphabet A = {A, C, G, T } with 4 symbols


Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T

codeword c(x)
00
01
10
11

Output of encoding: the concatenation of the codewords for


every symbol in the input sequence
Example
I

Input sequence: ACGT AA

Example code
I
I

Alphabet A = {A, C, G, T } with 4 symbols


Encode each symbol with 2 bits
alphabet symbol x
A
C
G
T

codeword c(x)
00
01
10
11

Output of encoding: the concatenation of the codewords for


every symbol in the input sequence
Example
I

Input sequence: ACGT AA

Output: c(A)c(C)c(G)c(T )c(A)c(A) = 000110110000

Total length of encoding = 6 2 = 12 bits.

Today

1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm

Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.

Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Examples of symbol codes
I

C0 = {00, 01, 10, 11} is a symbol code for {A, C, G, T }.

ASCII encoding system: every character and special


symbol on the computer keyboard is encoded by a different
7-bit binary string.

Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Examples of symbol codes
I

C0 = {00, 01, 10, 11} is a symbol code for {A, C, G, T }.

ASCII encoding system: every character and special


symbol on the computer keyboard is encoded by a different
7-bit binary string.

Remark 1.
C0 and ASCII are fixed-length symbol codes: each codeword has
the same length.

Unique decodability
Decoding C0 ?

Unique decodability
Decoding C0
I

read 2 bits of the output;

print the symbol corresponding to this codeword;

continue with the next 2 bits.

Unique decodability
Decoding C0
I

read 2 bits of the output;

print the symbol corresponding to this codeword;

continue with the next 2 bits.

Remark 2.
I

This decoding algorithm works for ASCII (replace 2 by 7)

C0 , ASCII: distinct input sequences have distinct encodings

Unique decodability
Decoding C0
I

read 2 bits of the output;

print the symbol corresponding to this codeword;

continue with the next 2 bits.

Remark 2.
I

This decoding algorithm works for ASCII (replace 2 by 7)

C0 , ASCII: distinct input sequences have distinct encodings

Definition 1.
A symbol code is uniquely decodable if, for any two distinct
input sequences, their encodings are distinct.

Lossless compression

Lossless compression: compress and decompress without


errors.

Uniquely decodable codes allow for lossless compression.

A symbol code achieves optimal lossless compression when


it produces an encoding of minimum length for its input
(among all uniquely decodable symbol codes).

Huffman algorithm: provides a symbol code that achieves


optimal lossless compression.

Fixed-length vs variable-length codes

Chromosome map consists of 200 million bases as follows:


alphabet symbol x
A
C
G
T

frequency f req(x)
110 million
5 million
25 million
60 million

Fixed-length vs variable-length codes

Chromosome map consists of 200 million bases as follows:


alphabet symbol x
A
C
G
T
I

frequency f req(x)
110 million
5 million
25 million
60 million

A appears much more often than the other symbols.

It might be best to encode A with fewer bits.


I

Unlikely that the fixed-length encoding C0 is optimal.

Today

1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm

Prefix codes
Variable-length encodings
Code C1
alphabet symbol x codeword c(x)
A
0
C
00
G
10
T
1

Prefix codes
Variable-length encodings
Code C1
alphabet symbol x codeword c(x)
A
0
C
00
G
10
T
1
I

C1 is not unique decodable! E.g., 101110: how to decode it?

Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10

Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
I

C2 is uniquely decodable.

C2 is such that no codeword is a prefix of another.

Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
I

C2 is uniquely decodable.

C2 is such that no codeword is a prefix of another.

Definition 2 (prefix codes).


A symbol code is a prefix code if no codeword is a prefix of
another.

Decoding prefix codes


1. Scan the binary string from left to right until youve seen
enough bits to match a codeword.
2. Output the symbol corresponding to this codeword.
I

Since no other codeword is a prefix of this codeword or


contains it as a prefix, this sequence of bits cannot be used
to encode any other symbol.

3. Continue starting from the next bit of the bit string.

Decoding prefix codes


1. Scan the binary string from left to right until youve seen
enough bits to match a codeword.
2. Output the symbol corresponding to this codeword.
I

Since no other codeword is a prefix of this codeword or


contains it as a prefix, this sequence of bits cannot be used
to encode any other symbol.

3. Continue starting from the next bit of the bit string.


Thus prefix codes allow for
I

unique decoding;

fast decoding (the end of a codeword is instantly


recognizable).

Decoding prefix codes


1. Scan the binary string from left to right until youve seen
enough bits to match a codeword.
2. Output the symbol corresponding to this codeword.
I

Since no other codeword is a prefix of this codeword or


contains it as a prefix, this sequence of bits cannot be used
to encode any other symbol.

3. Continue starting from the next bit of the bit string.


Thus prefix codes allow for
I

unique decoding;

fast decoding (the end of a codeword is instantly


recognizable).

Examples of prefix codes: C0 , C2

Prefix codes and optimal lossless compression

Decoding a prefix code is fast.

Would like to focus on prefix codes (rather than


all uniquely decodable symbol codes) for achieving
optimal lossless compression.
I

Information theory guarantees that.

So we can solely focus on prefix codes for optimal


compression.

Compression gains from variable-length prefix codes

Chromosome map: do we gain anything by using C2 instead of


C0 when compressing the map of 200 million bases?
Input
symbol x
f req(x)
A
110 million
C
5 million
G
25 million
T
60 million

Code C0
x c(x)
A
00
C
01
G
10
T
11

Code C2
x c(x)
A
0
C 110
G 111
T
10

Compression gains from variable-length prefix codes

Chromosome map: do we gain anything by using C2 instead of


C0 when compressing the map of 200 million bases?
Input
symbol x
f req(x)
A
110 million
C
5 million
G
25 million
T
60 million

Code C0
x c(x)
A
00
C
01
G
10
T
11

Code C2
x c(x)
A
0
C 110
G 111
T
10

C0 : 2 bits 200 million symbols = 400 million bits

C2 : 1 110 + 3 5 + 3 25 + 2 60 = 320 million bits

Improvement of 20% in this example

The optimal prefix code problem

Input:
I

Alphabet A = {a1 , . . . , an }

Set P = {p1 , . . . , pn } of probabilities over A such that


pi = Pr[ai ]

Output: a binary prefix code C = {c(a1 ), c(a2 ), . . . , c(an )} for


(A, P ), where codeword c(ai ) has length `i and the expected
length of the code
X
L(C ) =
pi `i
ai A

is minimum among all prefix codes.

Example
Chromosome example
Input
symbol x
Pr(x)
A
110/200
C
5/200
G
25/200
T
60/200

Code C0
x c(x)
A
00
C
01
G
10
T
11

Code C2
x c(x)
A
0
C 110
G 111
T
10

L(C0 ) = 2

L(C2 ) = 1.6

Coming up: C2 is the output of the Huffman algorithm,


hence an optimal encoding for (A, P ).

Today

1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm

Prefix codes and trees


I

A binary tree T is a rooted tree such that each node that is


not a leaf has at most two children.

Binary tree for a prefix code: a branch to the left represents


a 0 in the encoding and a branch to the right a 1.
T0

Code C0
x c(x)
A
00
C
01
G 10
T
11

0
Pr[A]

Pr[C]

Pr[G]

Pr[T]

Binary tree for prex code C0

Prefix codes and trees


I

A binary tree T is a rooted tree such that each node that is


not a leaf has at most two children.

Binary tree for a prefix code: a branch to the left represents


a 0 in the encoding and a branch to the right a 1.
T2
0

Code C2
x c(x)
A
0
C 110
G 111
T
10

Pr[A]

Pr[T]

Pr[C]

Pr[G]

Binary tree for prex code C2

Properties of binary trees representing prefix codes

1. Where do alphabet symbols appear in the tree?


2. What do codewords correspond to in the tree?
3. Consider the tree corresponding to the optimal prefix code.
Can it have internal nodes with one child?

Properties of binary trees representing prefix codes


1. Symbols must appear at the leaves of the tree T (why?)
T has n leaves.

2. Codewords c(ai ) are given by root-to-leaf paths.


Recall that `i is the length of the codeword c(ai ) for input
symbol ai . Therefore, `i corresponds to the depth of ai in
T (we assume that the root is at depth 0).
Can rewrite the expected length of the prefix code as:
X
X
pi depthT (ai ) = L(T ).
L(C) =
pi `i =
ai A

1in

3. Optimal tree must be full: all internal nodes must have


exactly two children (why?).

More on optimal tree

Claim 1.
There is an optimal prefix code, with corresponding tree T , in
which the two lowest frequency characters are assigned to leaves
that are siblings in T at maximum depth.

More on optimal tree

Claim 1.
There is an optimal prefix code, with corresponding tree T , in
which the two lowest frequency characters are assigned to leaves
that are siblings in T at maximum depth.

Proof.
By an exchange argument: start with a tree for an optimal
prefix code and transform it into T .

Proof of Claim 1
I

Let T be the tree for the optimal prefix code.

Let , be the two symbols with the smallest probabilities,


that is, Pr[] Pr[] Pr[s] for all s A {, }.

Let x and y be the two siblings at maximum depth in T .


T

T'
0

Pr[ ]

Pr[x]

Pr[]

Pr[]

Pr[x]

Pr[y]

Pr[]

Pr[y]

We want L(T) L(T)

Proof of Claim 1
I

Let T be the tree for the optimal prefix code.

Let , be the two symbols with the smallest probabilities,


that is, Pr[] Pr[] Pr[s] for all s A {, }.

Let x and y be the two siblings at maximum depth in T .


T'

T''
0

Pr[x]

Pr[x]

Pr[]

Pr[y]

y
Pr[]

Pr[y]

Pr[]

Pr[]

We want L(T) L(T)

How do the expected lengths of the two trees compare?


L(T ) L(T 0 )

Pr[ai ] depthT (ai )

ai A

Pr[ai ] depthT 0 (ai )

ai A

Pr[] depthT () + Pr[x] depthT (x)


Pr[] depthT 0 () Pr[x] depthT 0 (x)

Pr[] depthT () + Pr[x] depthT (x)


Pr[] depthT (x) Pr[x] depthT ()

(Pr[] Pr[x]) (depthT () depthT (x)) 0

The third equality follows from the exchange.

Similarly, exchanging and y in T 0 yields L(T 0 ) L(T 00 ) 0.

Hence L(T ) L(T 00 ) 0.

Since T is optimal, it must be L(T ) = L(T 00 ).

So T 00 is also optimal.

The claim follows by setting T to be T 00 .

Building the optimal tree

Claim 1 tells us how to build the optimal tree greedily!


1. Find the two symbols with the lowest probabilities.
2. Remove them from the alphabet and replace them with a
new meta-character with probability equal to the sum of
their probabilities.
I

Idea: this meta-character will be the parent of the two


deleted symbols in the tree.

3. Recursively construct the optimal tree using this process.


Greedy algorithms: make a local (myopic) decision at every
step that optimizes some criterion

Today

1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm

Huffman algorithm
Huffman(A, P )
if |A| = 2 then
Encode one symbol using 0 and the other using 1
end if
Let and be the two symbols with the lowest probabilities
Let be a new meta-character with probability Pr[] + Pr[]
Let A1 = A {, } + {}
Let P1 be the new set of probabilities over A1
T1 = Huffman(A1 , P1 )
return T as follows: replace leaf node in T1 by an internal
node, and add two children labelled and below .

Remark 3.
Output of Huffman procedure is a binary tree T ; the code for
(A, P ) is the prefix code corresponding to T .

Example: recursive Huffman for chromosome map


5
25
60
Recursive call 1: Huffman({A, C, G, T }, { 110
200 , 200 , 200 , 200 })
30
60
Recursive call 2: Huffman({A, CG , T }, { 110
200 , 200 , 200 })
90
Recursive call 3: Huffman({A, CGT }, { 110
200 , 200 })

0
110/200

1
90/200

CGT

110/200

90/200

60/200

End of rec. call 3

CGT

110/200

90/200

30/200

CG

CGT

60/200

30/200

CG

End of rec. call 2


5/200

25/200

End of rec. call 1

Correctness
Proof: by induction on the size of the alphabet n 2.
I

Base case. For n = 2, Huffman is optimal.

Hypothesis. Assume that Huffman returns the optimal


prefix code for alphabets of size n.

Induction Step. Let A be an alphabet of size n + 1, P


the corresponding set of probabilities.
Let T1 be the tree returned by our algorithm for (A1 , P1 ),
where A1 , P1 , T1 as in the pseudocode. By the hypothesis,
T1 is optimal. Let T be the final tree returned for (A, P )
by our algorithm. We claim that T is optimal.
We will prove the claim by contradiction. Suppose T is
the optimal tree for (A, P ) with L(T ) < L(T ).

Replacing , by some 0 in the optimal tree T


W.l.o.g., suppose T is an optimal tree where , appear as siblings
at maximum depth (Claim 1 guarantees there is such an optimal tree.)
Replace , in T by a meta-character 0 with Pr[ 0 ] = Pr[] + Pr[]
to obtain the tree T1 on n nodes with expected length L(T1 ).

T1

T1*

Pr[]+
Pr[]

Pr[]+
Pr[]

Pr[]

Pr[]

Dashed silver edges and nodes belong to T* but not T*

Dashed silver edges and


nodes belong to T but not T

Pr[]

Pr[G]
Pr[]

Expressing L(T ) in terms of L(T1 )


Shorthand notation: dT (ai ) = depthT (ai ).
L(T )

Pr[ai ]dT (ai )

ai A



Pr[ai ]dT (ai ) + Pr[] + Pr[] dT ()

ai A{,}

Pr[ai ]dT (ai ) + Pr[] + Pr[]

ai A{,}


+ Pr[] + Pr[] (dT () 1)

X
=
Pr[ai ]dT1 (ai ) + Pr[] + Pr[]
=

ai A{,}+{ 0 }
L(T1 ) + Pr[] + Pr[]

(1)

Correctness (contd)
I

Similarly, we obtain that


L(T )

L(T1 ) + Pr[] + Pr[]

(2)

By the induction hypothesis, we have L(T1 ) L(T1 ) since T1 is


optimal for alphabets with n characters and probabilities given
by P1 . Hence
L(T ) = L(T1 ) + Pr[] + Pr[] L(T1 ) + Pr[] + Pr[] = L(T ), (3)
where the first equality follows from equation (1), the inequality
from the induction hypothesis and the last equality from
equation (2).

But equation (3) contradicts optimality of T .

Hence T must be optimal.

Implementation and running time

1. Straightforward implementation: O(n2 ) time


2. Store the alphabet symbols in a priority queue implemented
as a binary min-heap with keys their probabilities
I

Operations: Initialize (O(n)), Extract-min (O(log n)),


Insert (O(log n))

Total time: O(n log n) time


For an iterative implementation of Huffman, see your textbook.

Example: iterative Huffman for chromosome map


Input (A, P )
symbol x
Pr(x)
A
110/200
C
5/200
G
25/200
T
60/200
Step 1
0
5/200

Step 2

30/200

Step 3

90/200

1
25/200

Output code
symbol x c(x)
A
0
C
111
G
110
T
10

60/200

30/200

110/200

90/200

5/200

25/200

60/200

1
30/200

5/200

25/200

Beyond Huffman coding

Huffman algorithm provides an optimal symbol code

Codes that encode larger blocks of input symbols might do


better
Storage on noisy media: what if a bit of the output of the
compressor is flipped?

I
I

decompression cannot carry through


need error correction on top of compression

You might also like