Lec 3 Huffman Coding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Department of Electrical Engineering

Indian Institute of Technology Jodhpur


EEL3080: Data Communication Networks
2022-23 Second Semester (January - May 2023)

Unit 2 - Part 3

Information Source Coding


Arun Kumar Singh
10 January 2023

Today we will consider some examples of loss-less source coding algorithms to conclude the discussion
on the idea of binary representation of information signals.

1 Human Codes
1.1 Binary Human Codes
Consider the source S with symbols s1 , s2 , · · · , sq and symbol probabilities P1 , P2 , · · · , Pq . Let the
symbols be ordered so that P1 ≥ P2 ≥ · · · ≥ Pq }. By regarding the last two symbols of S as combined
into one symbol, we obtain a new source from S containing only q − 1 symbols. We refer to this new
source as a reduction of S . The symbols of this reduction of S may be reordered, and again we may
combine the two least probable symbols to form a reduction of this reduction of S . By proceeding in
this manner, we construct a sequence of sources, each containing one fewer symbol than the previous
one, until we arrive at a source with only two symbols.
Construction of a sequence of reduced sources, as illustrated in Figure 1, is the rst step in the
construction of a compact instantaneous code for the original source S .

Figure 1: A source and its reduction

The second step is the recognition of a compact binary instantaneous code for the last reduced
source. The nal step is to assign a compact instantaneous code for the source immediately preceding
the reduced source and work back-ward along the sequence of sources until we arrive at a compact
instantaneous code for the original source.
Figure 2 illustrates an example of the construction of a compact code. First we formed a sequence
of reduced sources from the original source S . Then we assigned the words 0 and 1 to the last source

1
in the sequence (S4 in this case). Finally, we worked our way back from S4 to S through the reduced
sources.

Figure 2: Synthesis of a compact code

It may sometimes be unnecessary to form a sequence of reductions of the original source all the way
to a source with only two symbols. This is so since we need only form reductions until we nd the
rst reduction for which we have a compact code. Once we have a compact code for any reduction of a
source, we may start working backward from this compact code. This point is illustrated in Figure 3.

Figure 3: Synthesis of a compact code

1.2 r-ary Human Codes


To form an r-ary Human code, we shall combine the source symbols r at a time in order to form one
symbol in the reduced source. We would like the last source in the sequence to have exactly r symbols
(This will allow us to construct the trivial compact code for this source). The last source will have r
symbols if and only if the original source has r + α(r − 1) symbols, where α is an integer. Therefore, if
the original source does not have r + α(r − 1) symbols, we add 'dummy' symbols to the source until this
number is reached. The dummy symbols are assumed to have probability 0, and so they may be ignored
after code is formed. Figure 4 illustrates an example of the construction of a quaternary compact code.

1.3 Properties of Human Codes


Non-uniqueness:
In a trivial way, because the 0/1 labels on any pair of branches in a code tree can be reversed,
there are in general multiple dierent encodings that all have the same expected length. In fact, there
may be multiple optimal codes for a given set of symbol probabilities, and depending on how ties are
broken, Human coding can produce dierent non-isomorphic code trees, i.e., trees that look dierent
structurally and are not just relabelings of a single underlying tree. The two code trees shown in Figure 2
and Figure 5 are two non-isomorphic Human code trees, both optimal.
Optimality:

2
Figure 4: A quaternary compact code

Figure 5: Synthesis of another compact code

Human codes are optimal in the sense that there are no other codes with shorter expected length,
when restricted to instantaneous (prex-free) codes, and symbols in the messages are drawn in inde-
pendent fashion from a xed, known probability distribution. The optimal (Human) code obtained is
also a Compact Code.

1.4 Human coding with grouped symbols


By applying Human coding to nth extension of the information source some gain in compression can
be achieved but at a cost of increased encoding and decoding complexity.
With Human's algorithm it is possible to nd an optimal code, which is no further away from
the optimal bound than 1 bit. However, this approach has two signicant problems: rst, it assumes
that the symbol selection is independent and from a xed, known distribution at each position in all
messages, and second, it requires knowledge of the individual symbol probabilities. For example, if we
consider normal text or images the rst assumption would require that individual letters or pixels are
independent of their neighbors. In reality most sources are much more complex than this. Also, in texts
or images, the distribution of letters or pixels is dependent of the language or content. So, that none
of the assumptions made in Human's algorithm are very realistic. In the next section we will look
at algorithms that can perform compression for sources when the statistics is unknown. These type of
algorithms are called universal source coding algorithms.

2 Dictionary Coding
The class of universal coding techniques include dierent versions of dictionary coding algorithms. To
understand the source coding problem better, let us consider the problem of digital representation of
the text of a book written in, say, English. There are several possible approaches:

3
2.1 Fixed Dictionary
One approach is to is to use a predened dictionary for the encoding. One can analyze a few books
and estimate the probabilities of dierent letters of the alphabet. Then, treat each letter as a symbol
and apply Human coding to compress the document of interest. This may work quite good for some
English texts. However, the algorithm will not work as good if it is used on e.g. a list of names, a
program code or even an image, which has a completely dierent distribution of the characters.

2.2 Adaptive Dictionary


It is only in very specic cases when it is possible to know statistics of the source in such details allowing
the code to work from a xed dictionary. Instead it is often more ecient to use an adaptive dictionary
that is built from the data that is to be compressed. A direct generalization of the xed dictionary
encoding is to have a two pass algorithm. In the rst pass, count how often each symbol (or pairs
of symbols, or triples - whatever level of grouping you have chosen) appears and use those counts to
develop a Human code customized to the contents of the le. Then, in the second pass, encode the
le using the customized Human code. This strategy is expensive but workable, yet it falls short in
several ways.
After the rst pass of the algorithm the estimated distribution might not be all correct. The
compression loss made by assuming the wrong distribution can be derived by assuming that q(s) is the
true distribution for the random variable S , and p(s) the estimated distribution. The optimal codeword
length for the source is H(S) ≤ L(opt) ≤ H(S) + 1 where H(S) = −Eq [log q(s)]. When using the
estimated distribution p(s) instead we get an average length of H(S)+D(q||p) ≤ L ≤ H(S)+D(q||p)+1,
where the relative entropy D(q||p) comes as the penalty paid for having mismatch in the distribution
of the code and the source.
Moreover, whatever size symbol grouping is chosen, it will not do an optimal job on encoding
recurring groups of some dierent size, either larger or smaller. And if the symbol probabilities change
dramatically at some point in the le, a one-size-ts-all Human code will not be optimal; in this case
one would want to change the encoding midstream.
Among the adaptive methods there is a class of algorithm known as Lempel-Ziv-Welch (LZW)
algorithm. This method was developed originally by Ziv and Lempel, and subsequently improved by
Welch.

3 Lampel-Ziv-Welch (LZW) Codes


LZW code uses a dictionary that is indexed by codes used. The dictionary is assumed to be initial-
ized with 256 entries (indexed with ASCII codes 0 through 255) representing the ASCII table. The
compression algorithm assumes that the output is either a le or a communication channel. The input
being a le or buer. Conversely, the decompression algorithm assumes that the input is a le or a
communication channel and the output is a le or a buer.
As the message to be encoded is processed, the LZW algorithm builds a dictionary that maps
symbol sequences to/from an N -bit index. The dictionary has 2N entries and the transmitted code can
be used at the decoder as an index into the dictionary to retrieve the corresponding original symbol
sequence. The sequences stored in the dictionary can be arbitrarily long. The algorithm is designed so
that the dictionary can be reconstructed by the decoder based on information in the encoded stream -
the dictionary, while central to the encoding and decoding process, is never transmitted. This property
is crucial to the understanding of the LZW method.

3.1 Encoding Procedure


The encoder reads one character at a time. If the code is in the dictionary, then it adds the character
to the current work string, and waits for the next one. This occurs on the rst character as well. If
the work string is not in the dictionary, (such as when the second character comes across), it adds the
work string to the dictionary and sends over the wire (or writes to a le) the code assigned to the work
string without the new character. It then sets the work string to the new character.

4
3.1.1 Algorithm
ˆ Step 1. Initialize dictionary to contain one entry for each byte. Initialize the encoded string with
the rst byte of the input stream.
ˆ Step 2. Read the next byte from the input stream.

ˆ Step 3. If the byte is an EOF goto step 6.

ˆ Step 4. If concatenating the byte to the encoded string produces a string that is in the dictionary:

 concatenate the the byte to the encoded string.


 go to step 2.
ˆ Step 5. If concatenating the byte to the encoded string produces a string that is not in the
dictionary:
 add the new sting to the dictionary.
 write the code for the encoded string to the output stream.
 set the encoded string equal to the new byte.
 go to step 2.
ˆ Step 6. Write out code for encoded string and exit.

3.1.2 Pseudo-code
initialize TABLE[0 to 255] = code for individual bytes
STRING = get input symbol
while there are still input symbols:
SYMBOL = get input symbol
if STRING + SYMBOL is in TABLE:
STRING = STRING + SYMBOL
else:
output the code for STRING
add STRING + SYMBOL to TABLE
STRING = SYMBOL
output the code for STRING

3.2 Decoding Procedure


LZW data is decoded pretty much the opposite of how it's encoded. The dictionary is initialized so that
it contains an entry for each byte. Instead of maintaining an encoded string, the decoding algorithm
maintains the last code word and the rst character in the string it encodes. New code words are read
from the input stream one at a time and string encoded by the new code is output.
During the encoding process, the code prior to the current code is written because concatenating
the rst character of the current code with the string encoded by the prior code generated a code that
was not in the dictionary. When that happened the string formed by the concatenation was added to
the dictionary. The same string needs to be added to the dictionary when decoding.

3.2.1 Algorithm
ˆ Step 1. Initialize dictionary to contain one entry for each byte.

ˆ Step 2. Read the rst code word from the input stream and write out the byte it encodes.

ˆ Step 3. Read the next code word from the input stream.

ˆ Step 4. If the code word is an EOF exit.

ˆ Step 5. Write out the string encoded by the code word.

5
ˆ Step 6. Concatenate the rst character in the new code word to the string produced by the
previous code word and add the resulting string to the dictionary.
ˆ Step 7. Go to step 3.

3.2.2 Pseudo-code
initialize TABLE[0 to 255] = code for individual bytes
CODE = read next code from encoder
STRING = TABLE[CODE]
output STRING
while there are still codes to receive:
CODE = read next code from encoder
if TABLE[CODE] is not dened: // needed because sometimes the
ENTRY = STRING + STRING[0] // decoder may not yet have entry!
else:
ENTRY = TABLE[CODE]
output ENTRY
add STRING+ENTRY[0] to TABLE
STRING = ENTRY

3.2.3 Exception to the Rules


When decoding certain input streams the decoder may see a code word that is one larger than anything
that it has in it's dictionary. Whenever this exception occurs, concatenate the rst character of the
string encoded by the previous code word to the end of the string encoded by the previous code word.
The resulting string is the value of the new code word. Write it to the output stream and add it to the
dictionary.

Some interesting observations about LZW compression:


1. The encoder algorithm is greedy - it is designed to nd the longest possible match in the dictionary
before it makes a transmission.
2. The dictionary is lled with sequences actually found in the message stream. No encodings are
wasted on sequences not actually found in the le.
3. A common choice for the size of the dictionary is 4096 (N = 12). A larger table means the encoder
has a longer memory for sequences it has seen and increases the possibility of discovering repeated
sequences across longer spans of message. However, dedicating dictionary entries to remembering
sequences that will never be seen again decreases the eciency of the encoding.
For universal source codes, which perform compression for sources when the statistics is unknown,
the eciency of the code is measured by
Uncompressed Size
Data Compression Ratio =
Compressed Size
The space savings of a universal source code is dened as:
Compressed Size
Space Savings = 1 −
Uncompressed Size
Example 1:
Consider one string abcabcabcabcabcabc to demonstrate the encoding process.
The encoded output is '97 98 99 256 258 257 259 262 257' with binary representation '001100001
001100010 001100011 100000000 100000010 100000001 100000011 100000110 100000001'. This binary
string will be the input for the LZW decoder.
The decoded output is 'abcabcabcabcabcabc'.

6
Table 1: Encoder Dictionary
S Message Byte Look up Result Code Output Encoded String
- a - - - -
a b ab not found 97 (a) Table[256]=ab
b c bc not found 98 (b) Table[257]=bc
c a ca not found 99 (c) Table[258]=ca
a b ab found - -
ab c abc not found 256 Table[259]=abc
c a ca found - -
ca b cab not found 258 Table[260]=cab
b c bc found - -
bc a bca not found 257 Table[261]=bca
a b ab found - -
ab c abc found - -
abc a abca not found 259 Table[262]=abca
a b ab found - -
ab c abc found - -
abc a abca found - -
abca b abcab not found 262 Table[263]=abcab
b c bc found - -
bc -end- - - 257 -

Table 2: Decoder Dictionary


Input Code String Table Decoding
97 None a
98 Table[256]=ab b
99 Table[257]=bc c
256 Table[258]=ca ab
258 Table[259]=abc ca
257 Table[260]=cab bc
259 Table[261]=bca abc
262 Table[262]=abca abca
257 Table[263]=abcab bc

NOTE: The decoding process demonstrates the exception and how it is handled. (New encoded
string = old encoded string + the rst character of the old encoded string = abc + a = abca).
Uncompressed size = 18 ∗ 8 = 144 bits.
Compressed size = 9 ∗ 9 = 81 bits.
Uncompressed Size
Data Compression Ratio = = 1.78
Compressed Size

7
Compressed Size
Space Savings = 1 − = 0.4375
Uncompressed Size

4 Run-length encoding
Imagine you are given the string of bits
0000000 111 0000
| {z } |{z} 11 00000
|{z} |{z} | {z }
7 3 4 2 5

Rather than create codewords, we could simply store this string of bits as the sequence (7, 3, 4, 2, 5).
This strategy is used in fax-machine transmission, and also in jpeg.

5 Discrete vs Analog Information Source


An information source or source is a mathematical model for a physical entity that produces a succession
of symbols called 'outputs' in a random manner. The symbols produced may be real numbers such
as voltage measurements from a transducer, binary numbers as in computer data, two dimensional
intensity elds as in a sequence of images, continuous or discontinuous waveforms, and so on. The
space containing all of the possible output symbols is called the alphabet of the source and a source is
essentially an assignment of a probability measure to events consisting of sets of sequences of symbols
from the alphabet. We rst distinguish between two important classes of information sources:

5.1 Discrete sources


The output of a discrete source is a sequence of symbols from a known discrete alphabet. This alphabet
could be the alphanumeric characters, the characters on a computer keyboard, English letters, the
symbols in sheet music (arranged in some systematic fashion), binary digits, etc. The discrete alphabets
in this course are assumed to contain a nite set of symbols.

5.2 Analog sources


The output of an analog source, in the simplest case, is an analog real waveform. The word analog is
used to emphasize that the waveform can be arbitrary and is not restricted to taking on amplitudes from
some discrete set of values. For example, the output of an analog source might be an image (represented
as an intensity function of horizontal/vertical location) or video (represented as an intensity function of
horizontal/vertical location and time).
There are many dierences between discrete sources and analog sources. The most important is
that a discrete source can be, and almost always is, encoded in such a way that the source output can
be uniquely retrieved from the encoded string of binary digits. On the other hand, for analog sources,
there is usually no way to map the source values to a bit sequence such that the source values are
uniquely decodable. Thus, some sort of quantization is necessary for analog sources, and this introduces
distortion. Binary representation for analog sources thus involves a trade-o between the bit rate and
the amount of distortion.
Analog sequence sources are almost invariably encoded by rst quantizing each element of the
sequence (or more generally each successive n-tuple of sequence elements) into one nite set of symbols.
This symbol sequence is a discrete sequence which can then be represented by a binary sequence.
Figure 6 summarizes this layered view of analog and discrete source coding.
As illustrated, discrete source coding is both an important subject in its own right for encoding
text-like sources, but is also the inner layer in the encoding of analog sequences and waveforms.

8
Figure 6: Discrete sources require only the inner layer above, whereas the outer two layers are used for
analog sequences.

You might also like