Unit 2 - Part 7 Coding Information Sources: 1 Adaptive Variable-Length Codes
Unit 2 - Part 7 Coding Information Sources: 1 Adaptive Variable-Length Codes
Unit 2 - Part 7
Coding Information Sources
Arun Kumar Singh
18 January 2017
We discuss two source coding algorithms to compress messages (a message is a sequence of symbols).
The first, Huffman coding, is efficient when one knows the probabilities of the different symbols making
up a message, and each symbol is drawn independently from some known distribution. The second,
Lempel-Ziv-Welch (LZW), is an adaptive compression algorithm that does not assume any knowledge
of the symbol probabilities. Both Huffman codes and LZW are widely used in practice, and are a part
of many real-world standards such as GIF, JPEG, MPEG, MP3, and more.
1
being a file or buffer. Conversely, the decompression algorithm assumes that the input is a file or a
communication channel and the output is a file or a buffer.
As the message to be encoded is processed, the LZW algorithm builds a dictionary that maps
symbol sequences to/from an N -bit index. The dictionary has 2N entries and the transmitted code can
be used at the decoder as an index into the dictionary to retrieve the corresponding original symbol
sequence. The sequences stored in the dictionary can be arbitrarily long. The algorithm is designed so
that the dictionary can be reconstructed by the decoder based on information in the encoded stream -
the dictionary, while central to the encoding and decoding process, is never transmitted. This property
is crucial to the understanding of the LZW method.
2.1.1 Algorithm
Step 1. Initialize dictionary to contain one entry for each byte. Initialize the encoded string with
the first byte of the input stream.
Step 4. If concatenating the byte to the encoded string produces a string that is in the dictionary:
Step 5. If concatenating the byte to the encoded string produces a string that is not in the
dictionary:
2.1.2 Pseudo-code
set w = NIL
loop
read a character k
if wk exists in the dictionary
w = wk
else
output the code for w
add wk to the dictionary
w=k
endloop
2
2.2 Decoding Procedure
LZW data is decoded pretty much the opposite of how its encoded. The dictionary is initialized so that
it contains an entry for each byte. Instead of maintaining an encoded string, the decoding algorithm
maintains the last code word and the first character in the string it encodes. New code words are read
from the input stream one at a time and string encoded by the new code is output.
During the encoding process, the code prior to the current code is written because concatenating
the first character of the current code with the string encoded by the prior code generated a code that
was not in the dictionary. When that happened the string formed by the concatenation was added to
the dictionary. The same string needs to be added to the dictionary when decoding.
2.2.1 Algorithm
Step 1. Initialize dictionary to contain one entry for each byte.
Step 2. Read the first code word from the input stream and write out the byte it encodes.
Step 3. Read the next code word from the input stream.
Step 6. Concatenate the first character in the new code word to the string produced by the
previous code word and add the resulting string to the dictionary.
Step 7. Go to step 3.
2.2.2 Pseudo-code
read fixed length token k (code or char)
output k
w=k
loop
read a fixed length token k
entry = dictionary entry for k
output entry
add w+ first char of entry to the dictionary
w = entry
endloop
1. The encoder algorithm is greedy - it is designed to find the longest possible match in the dictionary
before it makes a transmission.
2. The dictionary is filled with sequences actually found in the message stream. No encodings are
wasted on sequences not actually found in the file.
3
3. A common choice for the size of the dictionary is 4096 (N = 12). A larger table means the encoder
has a longer memory for sequences it has seen and increases the possibility of discovering repeated
sequences across longer spans of message. However, dedicating dictionary entries to remembering
sequences that will never be seen again decreases the efficiency of the encoding.
H(S)
= ,
L
where H(S) denotes the entropy of the source and L denotes the average length of the codeword for
the given source code.
The redundancy of a source code is defined as:
= 1 .
Example 1:
Consider the string abcabcabcabcabcabc to demonstrate the encoding process.
NOTE: The decoding process demonstrates the exception and how it is handled. (New encoded
string = old encoded string + the first character of the old encoded string = abc + a = abca)
4
Table 2: Decoder Dictionary