MLSP_LAB_EXP2
MLSP_LAB_EXP2
Grading Rubric
Tick the best applicable per row Points
Below Lacking Meets all
Expecta- in Some Expecta-
tion tion
Completeness of the report
Organization of the report (5 pts)
Quality of figures (5 pts)
Building the min-heap (20 pts)
Generating the code book and com-
pressed bit stream (25 pts)
Ability to decode the Huffman tree
from its in-order traversal represen-
tation (20 pts)
Ability to recover the dataset (25
pts)
TOTAL (100 pts)
1
1 Introduction
Compression of a signal involves multiple kinds of approaches. Since a sample of signal my be
represented in a dimension much higher than its intrinsic dimension, so one kind of approaches
involves operations to reduce its dimension. Principal component analysis (PCA) or Singular
value decomposition (SVD) are few of such approaches to reduce the intrinsic dimensionality of a
signal. Alternatively, operations like Discrete Fourier transform (DFT), Discrete cosine transform
(DCT) also play a role of compacting the energy of the signal, by reducing the effective number
of samples in the transformed domain which may be needed to recover the original domain
signal. Such kind of approaches are generally referred to as energy compacting transforms and
also offer lossy compression of the signal. The other family of operations involve reducing the
number of bits required to represent a sample. Since the number of samples or its dimension is
not effected, these kind of transforms do not typically lead of any loss of energy of the signal.
Such family of operations are known as lossless compression.
In this experiment we would implement the mechanism of lossless compression of a vector
representing a set of samples using principles of Entropy of values of samples.
2 Huffman coding
The method of Huffman coding is adopted to minimize the number of bits required to represent
a sample. The minimum limit being the Entropy of values which make up the samples.
Let us consider a vector X constituted of the following samples X = [x0 , x1 , · · · , xi , · · · , xN −1 ],
where xi ∈ RD×1 . As an example we can consider the string X ={‘h’,‘e’,‘l’,‘l’,‘o’}. If every
alphabet is represented as 8-bit ASCII, then we would have xi ∈ B8×1 , where B ∈ {0, 1}
represents the set of bits. Thus we have X ∈ B8×5 . So the average number of bits per sample is
8. However, it can be seen that since the number of unique symbols in this case is only 4, so
one may say that only 2 bits per symbol is enough. However, this number of bits per symbol
would vary with the constituents of X. Huffman coding is one such way to evolve a reduced
code representation for each sample, and interestingly, through this approach the number of bits
per sample is not fixed, and this variable number of bits per sample enables us to reduce the
average number of bits per sample.
2
(a) Step 1 (b) Step 2 (c) Step 3
represented similar to as in Fig. 2. Now, using this tree we can find the bit code for the symbols
as in Table 1.
Figure 2: Huffman tree created for the dataset in example. # denotes a non-leaf node, and the
symbol present at a leaf node is appropriately mentioned viz. x =‘l’, etc.
ui Bit code
‘h’ 110
‘e’ 111
‘l’ 0
‘o’ 10
Table 1: Bit code for each symbol ui generated using the Huffman tree.
3
2.4 Decoding the bit stream
In order to recover the original form of the dataset stored as X̂ from the received bit stream Y ,
we would again have to employ the Huffman tree. We start the process accordingly by reading
one bit at a time and traversing through the Huffman tree till we reach a leaf node. As with
this example we would see the following steps being adopted.
2.5.1 Encoder
The encoder would perform the following tasks
6. Encode the Huffman tree with in-order traversal, also known as the Code book.
4
2.5.3 Decoder
The decoder would perform the following set of steps to recover the dataset
3 Experiments
3.1 Dataset 1
3.1.1 Generation of the dataset
Generate a lower case alphabet and numeric string of a given length N consisting of random
characters without blank spaces or line breaks or special characters. You can use the string and
random library in Python to achieve it. This would denote the dataset X. Store the generated
string as a .txt file.
1. Generate t = 10 different .txt files for each of N = {50, 100, 500, 1000, 5000}.
2. Use encoder to encode each of the file and store each as a separate .huf.
3. Compute the compression factor for each file as sizeof(< fileName > .txt)/sizeof(< fileName > .huf)
where sizeof(·) operator returns the size of the file in Bytes. Report the variation of
compression factors for each N using a notch-box plot. Here x-axis should represent N as
specified above and y-axis should represent the compression factor.
5. Measure the MSE between the original .txt and the decoded .huf file.
6. Measure the time to encode and decode each file, and report the times separate as a
notch-box plot for encoder and for decoder. Here x-axis should represent N as specified
above and y-axis should represent the computation time in sec. Encoder and decoder
times to be shown as stacked / grouped items for each N .
5
3.2 Dataset 2
3.2.1 Preparation of the dataset
Here we would strive to solve Huffman on grayscale images of human faces of size 64 × 64, from
the Olivetti faces dataset1 with N = 400 samples. Since each image I ∈ Z64×64×1 and has 8-bit
grayscale representation, so we can alternatively write it as X ∈ B8×4,096 where B = {0, 1}.
1. Over each image in the dataset store it as a .bmp file and then perform the following.
2. Use encoder to encode each of the image in .bmp format and store each as a separate .huf.
3. Compute the compression factor for each file as sizeof(< fileName > .bmp)/sizeof(< fileName > .huf)
where sizeof(·) operator returns the size of the file in Bytes. Report the variation of
compression factors over all images using a notch-box plot. Here y-axis should represent
the compression factor.
4. Use the decoder to decode each of the .huf files to obtain .bmp file.
5. Measure the MSE between the original .bmp and the decoded .huf file.
6. Measure the time to encode and decode each file, and report the times separately as a
notch-box plot for encoder and for decoder. Here y-axis should represent the computation
time in sec. Encoder and decoder times to be shown as stacked / grouped items.
7. Plot the compression factor on x-axis and encoder time on y-axis over all images as a scatter
plot. Repeat the same with decoder time on y-axis, and comment on your observation.
1
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch olivetti faces.html