Lec 13 Compress
Lec 13 Compress
IITB India
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 1
Topic 13.1
Data compression
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 2
Data compression
You must have used Zip, which reduces the space used by a file.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 3
Fixed-length vs. Variable-length encoding
▶ Fixed-length encoding. Example: An 8-bit ASCII code encodes each character in a text file.
▶ We may save space by assigning fewer bits to the characters that occur more often.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 4
Example: Variable-length encoding
Example 13.1
Consider text: “agra”
▶ There are only three characters. Let us use encoding, a = “0”, g = “10”, and r = “11”.
The text needs six bits.
▶ 010110
Exercise 13.1
Are the six bits sufficient?
Commentary: If the encoding depends on the text content, we also need to record the encoding along with the text.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 5
Example: decoding variable-length encoding
Example 13.2
Consider encoding a = “0”, g = “10”, and r = “11” and the following encoding of a text.
101100001110
The text is “graaaarg ”.
We scan the encoding from the left. As soon as a match is found, we start matching the next
symbol.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 6
Example: decoding bad variable-length encoding
Example 13.3
Consider encoding a = “0”, g = “01”, and r = “11” and the following encoding of a text.
0111000011001
We cannot tell if the text starts with a “g ” or an “a”.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 7
Encoding trie
Definition 13.1
An encoding trie is a binary trie that has the following 0 1
properties.
▶ Each terminating leaf is labeled with an encoded 0 1 0 1
character.
▶ The left child of a node is labeled 0 and the right
C 0 1 D B
child of a node is labeled 1
A R
Exercise 13.2
Show: An encoding trie ensures that the prefix Character encoding/codewords:
condition is not violated. C = 00, A = 010, R = 011,
D = 10, and B = 11.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 8
Example: Decoding from a Trie
0 1
Encoding: 01011011010000101001011011010
0 1 0 1
Text: ABRACADABRA
C 0 1 D B
A R
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 9
Encoding length
Example 13.4
Let us encode ABRACADABRA using the following two tries.
0 1 0 1
0 1 0 1 0 1 0 1
C 0 1 D B A 0 1 R B
A R C D
Since we know the label of an internal node by observing that a node is a left or right child, we will
not write the labels.
A R B
C D
Commentary: We can assign any bit to a node as long as the sibling will use a different bit.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 11
Topic 13.2
Optimal compression
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 12
Optimal compression
Design principle: We encode a character that occurs more often with fewer bits.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 13
frequency
Definition 13.2
The frequency fc of a character c in a text T is the number of times c occurs in T .
Example 13.5
The frequencies of the characters in ABRACADABRA are as follows.
▶ fA = 5
▶ fB = 2
▶ fR = 2
▶ fC = 1
▶ fD = 1
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 14
Characters encoding length
Definition 13.3
The encoding length lc of a character c in a trie is the number of bits needed to encode c.
Example 13.6
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 15
Weighted path length == number of encoded bits
The total number of bits needed to store a text is
X
fc lc .
c∈Leaves
Example 13.7
The number of bits needed for ABRACADABRA using
the left trie is the following sum.
A R B
fA ∗ lA + fC ∗ lC + fD ∗ lD + fR ∗ lR + fB ∗ lB
C D
= 5 ∗ 2 + 1 ∗ 3 + 1 ∗ 3 + 2 ∗ 2 + 2 ∗ 2 = 24
Is this the best trie for compression? How can we find the best trie?
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 16
Huffman encoding
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 17
Example: Huffman encoding
Example 13.8
After initialization.
5 2 2 1 1
A B R C D
5 2 2 2
A B R 1 1
C D
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 18
Example: Huffman encoding(2)
5 6
5 2 4 A 2 4
A B 2 2 B 2 2
R 1 1 R 1 1
C D C D
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 19
Example: Huffman encoding(3)
After the final recursive step: We scrub the frequency labels.
11
5 6
A 2 4 A
B 2 2 B
R 1 1 R
C D C D
Exercise 13.3
How many bits do we need to encode ABRACADABRA?
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 20
Topic 13.3
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 21
Minimum weighted path length
Definition 13.4
Given frequencies fc1 , ..., fck , minimum weighted path length MWPL(fc1 , ..., fck ) is the weighted
path length for the encoding trie for which the sum is minimum.
Commentary: The definition of MWPL does not mention the trie. It is the property of occurrence rate distribution
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 22
A recursive relation
Theorem 13.1
MWPL(fc1 , ..., fck ) ≤ fc1 + fc2 + MWPL(fc1 + fc2 , fc3 , ..., fck )
Proof.
Consider a witness trie T for MWPL(fc1 + fc2 , fc3 , ..., fck ).
There is a node in T labeled with fc1 + fc2 with a terminal child (below left).
We construct a trie for fc1 , ..., fck such that the weighted path length of the trie is
fc1 + fc2 + MWPL(fc1 + fc2 , fc3 , ..., fck ). (below right). Hence proved.
c1 c2
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 23
Reverse recursive relation
Theorem 13.2
If fc1 and fc2 are the minimum two, MWPL(fc1 , ..., fck ) = fc1 + fc2 + MWPL(fc1 + fc2 , fc3 , ..., fck ).
Proof.
There is a witness for MWPL(fc1 , ..., fck ) where the parents of c1 and c2 are siblings. (Why?)(below
left)
We construct a tree for frequencies fc1 + fc2 , fc3 , ..., fck such that the weighted path length of the
tree is MWPL(fc1 , ..., fck ) − fc1 − fc2 . (below right).
c1 c2
Therefore, MWPL(f
cbna , ...,
CS213/293 ) ≥ and
DatafStructure f Algorithms
+f + MWPL(f Instructor:
2023 + f , fAshutosh
, ..., Gupta
f ). IITB India 24
Commentary: In the proof, the induction step is non-
trivial to understand.
Correctness of BuildTree
Theorem 13.3
Huffman(fc1 , ..., fck ) always returns a tree that is a witness of MWPL(fc1 , ..., fck ).
Proof.
We prove this inductively.
In the call Encode(T1 , .., Tk ), we assume Ti is a witness of the respective MWPL. (For which frequencies?)
Base case:
Trivial. There is a single tree and we return the tree.
Induction step:
Since we are updating trees by combining trees with minimum weight, we have the following due
to the previous theorem.
MWPL(T1 .value, ..., Tk .value) = Ti .value + Tj .value + MWPL(Ti .value + Tj .value, ....)
| {z } | {z }
We will have the witness of the frequencies due to the construction. witness returned due to the induction hypothesis
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 25
Practical Huffman
When we compress a file, we do not compute the frequencies for the entire file in one go.
Exercise 13.4
How many bits are needed per character for 8 characters if frequencies are all equal?
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 26
DEFLATE
In addition to encoding trie, the Linux utility gzip uses the LZ77 algorithm for compression.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 27
Topic 13.4
LZ77
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 28
Repeated string
In LZ77, we search if a string is repeated within the sliding window on the input stream.
The repeated occurrence is replaced by reference, which is a pair of the offset and length of the
string.
The references are viewed as yet another symbols on the input stream.
Example 13.9
Before encoding ABRACADABRA into a trie the string will be transformed to
ABRACAD[7, 4]
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 29
Topic 13.5
Tutorial problems
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 30
Single-bit Huffman code
Exercise 13.5
2
a. In an Huffman code instance, show that if there is a character with frequency greater than
5
then there is a codeword of length 1.
1
b. Show that if all frequencies are less than then there is no codeword of length 1.
3
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 31
Predictable text
Exercise 13.6
Suppose that there is a source that has three characters a,b,c. The output of the source cycles in
the order of a,b,c followed by a again, and so on. In other words, if the last output was a b, then
the next output will either be a b or a c. Each letter is equally probable. Is the Huffman code the
best possible encoding? Are there any other possibilities? What would be the pros and cons of
this?
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 32
Compute Huffman code tree
Exercise 13.7
Given the following frequencies, compute the Huffman code tree.
a 20
d 7
g 8
j 4
b 6
e 25
h 8
k 2
c 6
f 1
i 12
l 1
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 33
End of Lecture 13
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 34