Algorithms For Data Science: CSOR W4246
Algorithms For Data Science: CSOR W4246
CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Tuesday, September 29, 2015
Outline
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
DFS applications
I
I
I
Cycle detection
Topological sorting
Strongly connected components in directed graphs
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Motivation
Data representation
Data representation
Example code
I
I
codeword c(x)
00
01
10
11
Example code
I
I
codeword c(x)
00
01
10
11
Example code
I
I
codeword c(x)
00
01
10
11
Example code
I
I
codeword c(x)
00
01
10
11
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Examples of symbol codes
I
Symbol codes
Symbol code: a set of codewords where every input symbol is
encoded separately.
Examples of symbol codes
I
Remark 1.
C0 and ASCII are fixed-length symbol codes: each codeword has
the same length.
Unique decodability
Decoding C0 ?
Unique decodability
Decoding C0
I
Unique decodability
Decoding C0
I
Remark 2.
I
Unique decodability
Decoding C0
I
Remark 2.
I
Definition 1.
A symbol code is uniquely decodable if, for any two distinct
input sequences, their encodings are distinct.
Lossless compression
frequency f req(x)
110 million
5 million
25 million
60 million
frequency f req(x)
110 million
5 million
25 million
60 million
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Prefix codes
Variable-length encodings
Code C1
alphabet symbol x codeword c(x)
A
0
C
00
G
10
T
1
Prefix codes
Variable-length encodings
Code C1
alphabet symbol x codeword c(x)
A
0
C
00
G
10
T
1
I
Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
I
C2 is uniquely decodable.
Prefix codes
Variable-length encodings
Code C2
alphabet symbol x codeword c(x)
A
0
C
110
G
111
T
10
I
C2 is uniquely decodable.
unique decoding;
unique decoding;
Code C0
x c(x)
A
00
C
01
G
10
T
11
Code C2
x c(x)
A
0
C 110
G 111
T
10
Code C0
x c(x)
A
00
C
01
G
10
T
11
Code C2
x c(x)
A
0
C 110
G 111
T
10
Input:
I
Alphabet A = {a1 , . . . , an }
Example
Chromosome example
Input
symbol x
Pr(x)
A
110/200
C
5/200
G
25/200
T
60/200
Code C0
x c(x)
A
00
C
01
G
10
T
11
Code C2
x c(x)
A
0
C 110
G 111
T
10
L(C0 ) = 2
L(C2 ) = 1.6
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Code C0
x c(x)
A
00
C
01
G 10
T
11
0
Pr[A]
Pr[C]
Pr[G]
Pr[T]
Code C2
x c(x)
A
0
C 110
G 111
T
10
Pr[A]
Pr[T]
Pr[C]
Pr[G]
1in
Claim 1.
There is an optimal prefix code, with corresponding tree T , in
which the two lowest frequency characters are assigned to leaves
that are siblings in T at maximum depth.
Claim 1.
There is an optimal prefix code, with corresponding tree T , in
which the two lowest frequency characters are assigned to leaves
that are siblings in T at maximum depth.
Proof.
By an exchange argument: start with a tree for an optimal
prefix code and transform it into T .
Proof of Claim 1
I
T'
0
Pr[ ]
Pr[x]
Pr[]
Pr[]
Pr[x]
Pr[y]
Pr[]
Pr[y]
Proof of Claim 1
I
T''
0
Pr[x]
Pr[x]
Pr[]
Pr[y]
y
Pr[]
Pr[y]
Pr[]
Pr[]
ai A
ai A
So T 00 is also optimal.
Today
1 Recap
2 Data compression
3 Symbol codes and optimal lossless compression
4 Prefix codes
5 Prefix codes and trees
6 The Huffman algorithm
Huffman algorithm
Huffman(A, P )
if |A| = 2 then
Encode one symbol using 0 and the other using 1
end if
Let and be the two symbols with the lowest probabilities
Let be a new meta-character with probability Pr[] + Pr[]
Let A1 = A {, } + {}
Let P1 be the new set of probabilities over A1
T1 = Huffman(A1 , P1 )
return T as follows: replace leaf node in T1 by an internal
node, and add two children labelled and below .
Remark 3.
Output of Huffman procedure is a binary tree T ; the code for
(A, P ) is the prefix code corresponding to T .
0
110/200
1
90/200
CGT
110/200
90/200
60/200
CGT
110/200
90/200
30/200
CG
CGT
60/200
30/200
CG
25/200
Correctness
Proof: by induction on the size of the alphabet n 2.
I
T1
T1*
Pr[]+
Pr[]
Pr[]+
Pr[]
Pr[]
Pr[]
Pr[]
Pr[G]
Pr[]
ai A
Pr[ai ]dT (ai ) + Pr[] + Pr[] dT ()
ai A{,}
ai A{,}
+ Pr[] + Pr[] (dT () 1)
X
=
Pr[ai ]dT1 (ai ) + Pr[] + Pr[]
=
ai A{,}+{ 0 }
L(T1 ) + Pr[] + Pr[]
(1)
Correctness (contd)
I
(2)
Step 2
30/200
Step 3
90/200
1
25/200
Output code
symbol x c(x)
A
0
C
111
G
110
T
10
60/200
30/200
110/200
90/200
5/200
25/200
60/200
1
30/200
5/200
25/200
I
I