0% found this document useful (0 votes)
22 views34 pages

Lec 13 Compress

Uploaded by

Saksham Rathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views34 pages

Lec 13 Compress

Uploaded by

Saksham Rathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

CS213/293 Data Structure and Algorithms 2023

Lecture 13: Compression

Instructor: Ashutosh Gupta

IITB India

Compile date: 2023-09-24

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 1
Topic 13.1

Data compression

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 2
Data compression

You must have used Zip, which reduces the space used by a file.

How does Zip work?

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 3
Fixed-length vs. Variable-length encoding

▶ Fixed-length encoding. Example: An 8-bit ASCII code encodes each character in a text file.

▶ Variable-length encoding: each character is given a different bit length encoding.

▶ We may save space by assigning fewer bits to the characters that occur more often.

▶ We may have to assign some characters more than 8-bit representation.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 4
Example: Variable-length encoding

Example 13.1
Consider text: “agra”

▶ In a text file, the text will take 32 bits of space.


▶ 01100001011001110111001001100001

▶ There are only three characters. Let us use encoding, a = “0”, g = “10”, and r = “11”.
The text needs six bits.
▶ 010110

Exercise 13.1
Are the six bits sufficient?
Commentary: If the encoding depends on the text content, we also need to record the encoding along with the text.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 5
Example: decoding variable-length encoding

Example 13.2
Consider encoding a = “0”, g = “10”, and r = “11” and the following encoding of a text.

101100001110
The text is “graaaarg ”.

We scan the encoding from the left. As soon as a match is found, we start matching the next
symbol.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 6
Example: decoding bad variable-length encoding

Example 13.3
Consider encoding a = “0”, g = “01”, and r = “11” and the following encoding of a text.

0111000011001
We cannot tell if the text starts with a “g ” or an “a”.

Prefix condition: Encoding of a character cannot be a prefix of encoding of another character.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 7
Encoding trie

Definition 13.1
An encoding trie is a binary trie that has the following 0 1
properties.
▶ Each terminating leaf is labeled with an encoded 0 1 0 1
character.
▶ The left child of a node is labeled 0 and the right
C 0 1 D B
child of a node is labeled 1

A R
Exercise 13.2
Show: An encoding trie ensures that the prefix Character encoding/codewords:
condition is not violated. C = 00, A = 010, R = 011,
D = 10, and B = 11.
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 8
Example: Decoding from a Trie

0 1
Encoding: 01011011010000101001011011010
0 1 0 1
Text: ABRACADABRA
C 0 1 D B

A R

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 9
Encoding length

Example 13.4
Let us encode ABRACADABRA using the following two tries.

0 1 0 1

0 1 0 1 0 1 0 1

C 0 1 D B A 0 1 R B

A R C D

Encoding:(29 bits) Encoding:(24 bits)


01011011010 0001010 01011011010 00111000 01000011 00111000
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 10
Drawing with tries without labels

Since we know the label of an internal node by observing that a node is a left or right child, we will
not write the labels.

A R B

C D

Commentary: We can assign any bit to a node as long as the sibling will use a different bit.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 11
Topic 13.2

Optimal compression

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 12
Optimal compression

Different tries will result in different compression levels.

Design principle: We encode a character that occurs more often with fewer bits.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 13
frequency

Definition 13.2
The frequency fc of a character c in a text T is the number of times c occurs in T .

Example 13.5
The frequencies of the characters in ABRACADABRA are as follows.
▶ fA = 5
▶ fB = 2
▶ fR = 2
▶ fC = 1
▶ fD = 1

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 14
Characters encoding length

Definition 13.3
The encoding length lc of a character c in a trie is the number of bits needed to encode c.

Example 13.6

In the left trie, the encoding length of the characters are


as follows.
▶ lA = 2
▶ lB = 2
▶ lR = 2
A R B ▶ lC = 3
▶ lD = 3
C D

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 15
Weighted path length == number of encoded bits
The total number of bits needed to store a text is
X
fc lc .
c∈Leaves

Example 13.7
The number of bits needed for ABRACADABRA using
the left trie is the following sum.
A R B
fA ∗ lA + fC ∗ lC + fD ∗ lD + fR ∗ lR + fB ∗ lB
C D
= 5 ∗ 2 + 1 ∗ 3 + 1 ∗ 3 + 2 ∗ 2 + 2 ∗ 2 = 24

Is this the best trie for compression? How can we find the best trie?
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 16
Huffman encoding

Algorithm 13.1: Huffman(Integers fc1 , ...., fck )


1 for i ∈ [1, k] do
2 N := CreateNode(ck , Null, Null);
3 Ti := CreateNode(fck , N, Null);
4 return BuildTree(T1 , ..., Tk )

Algorithm 13.2: BuildTree(Nodes T1 , ...., Tk )


1 if k == 1 then
2 return T1
3 Find Ti and Tj such that Ti .value and Tj .value are minimum;
4 Ti := CreateNode(Ti .value + Tj .value, Ti , Tj );
5 return BuildTree(T1 , ..., Tj−1 , Tj+1 , ..., Tk )

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 17
Example: Huffman encoding

Example 13.8
After initialization.

5 2 2 1 1

A B R C D

We choose nodes labeled with 1 to join and create a larger tree.

5 2 2 2

A B R 1 1

C D

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 18
Example: Huffman encoding(2)

After the next recursive step After another recursive step:

5 6

5 2 4 A 2 4

A B 2 2 B 2 2

R 1 1 R 1 1

C D C D

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 19
Example: Huffman encoding(3)
After the final recursive step: We scrub the frequency labels.

11

5 6

A 2 4 A

B 2 2 B

R 1 1 R

C D C D

Exercise 13.3
How many bits do we need to encode ABRACADABRA?
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 20
Topic 13.3

Proof of optimality of Huffman encoding

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 21
Minimum weighted path length

Definition 13.4
Given frequencies fc1 , ..., fck , minimum weighted path length MWPL(fc1 , ..., fck ) is the weighted
path length for the encoding trie for which the sum is minimum.

Commentary: The definition of MWPL does not mention the trie. It is the property of occurrence rate distribution

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 22
A recursive relation
Theorem 13.1
MWPL(fc1 , ..., fck ) ≤ fc1 + fc2 + MWPL(fc1 + fc2 , fc3 , ..., fck )
Proof.
Consider a witness trie T for MWPL(fc1 + fc2 , fc3 , ..., fck ).
There is a node in T labeled with fc1 + fc2 with a terminal child (below left).
We construct a trie for fc1 , ..., fck such that the weighted path length of the trie is
fc1 + fc2 + MWPL(fc1 + fc2 , fc3 , ..., fck ). (below right). Hence proved.

fc1 + fc1 fc1 + fc1

... fc1 fc2

c1 c2
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 23
Reverse recursive relation
Theorem 13.2
If fc1 and fc2 are the minimum two, MWPL(fc1 , ..., fck ) = fc1 + fc2 + MWPL(fc1 + fc2 , fc3 , ..., fck ).
Proof.
There is a witness for MWPL(fc1 , ..., fck ) where the parents of c1 and c2 are siblings. (Why?)(below
left)
We construct a tree for frequencies fc1 + fc2 , fc3 , ..., fck such that the weighted path length of the
tree is MWPL(fc1 , ..., fck ) − fc1 − fc2 . (below right).

fc1 + fc1 fc1 + fc1

fc1 fc2 ...

c1 c2
Therefore, MWPL(f
cbna , ...,
CS213/293 ) ≥ and
DatafStructure f Algorithms
+f + MWPL(f Instructor:
2023 + f , fAshutosh
, ..., Gupta
f ). IITB India 24
Commentary: In the proof, the induction step is non-
trivial to understand.
Correctness of BuildTree
Theorem 13.3
Huffman(fc1 , ..., fck ) always returns a tree that is a witness of MWPL(fc1 , ..., fck ).
Proof.
We prove this inductively.
In the call Encode(T1 , .., Tk ), we assume Ti is a witness of the respective MWPL. (For which frequencies?)

Base case:
Trivial. There is a single tree and we return the tree.

Induction step:
Since we are updating trees by combining trees with minimum weight, we have the following due
to the previous theorem.

MWPL(T1 .value, ..., Tk .value) = Ti .value + Tj .value + MWPL(Ti .value + Tj .value, ....)
| {z } | {z }
We will have the witness of the frequencies due to the construction. witness returned due to the induction hypothesis
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 25
Practical Huffman

When we compress a file, we do not compute the frequencies for the entire file in one go.

▶ We compute the encoding trie of a block of bytes.


▶ we check if the data allows compression, if it does not we do not compress the file
▶ If the file is small, we use precomputed encoding trie.

Exercise 13.4
How many bits are needed per character for 8 characters if frequencies are all equal?

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 26
DEFLATE

In addition to encoding trie, the Linux utility gzip uses the LZ77 algorithm for compression.

The combined algorithm is called DEFLATE.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 27
Topic 13.4

LZ77

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 28
Repeated string

In LZ77, we search if a string is repeated within the sliding window on the input stream.

The repeated occurrence is replaced by reference, which is a pair of the offset and length of the
string.

The references are viewed as yet another symbols on the input stream.
Example 13.9
Before encoding ABRACADABRA into a trie the string will be transformed to

ABRACAD[7, 4]

We run Huffman on the above string.

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 29
Topic 13.5

Tutorial problems

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 30
Single-bit Huffman code

Exercise 13.5
2
a. In an Huffman code instance, show that if there is a character with frequency greater than
5
then there is a codeword of length 1.
1
b. Show that if all frequencies are less than then there is no codeword of length 1.
3

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 31
Predictable text

Exercise 13.6
Suppose that there is a source that has three characters a,b,c. The output of the source cycles in
the order of a,b,c followed by a again, and so on. In other words, if the last output was a b, then
the next output will either be a b or a c. Each letter is equally probable. Is the Huffman code the
best possible encoding? Are there any other possibilities? What would be the pros and cons of
this?

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 32
Compute Huffman code tree

Exercise 13.7
Given the following frequencies, compute the Huffman code tree.
a 20
d 7
g 8
j 4
b 6
e 25
h 8
k 2
c 6
f 1
i 12
l 1
cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 33
End of Lecture 13

cbna CS213/293 Data Structure and Algorithms 2023 Instructor: Ashutosh Gupta IITB India 34

You might also like