0% found this document useful (0 votes)
49 views5 pages

Unit 2 - Part 7 Coding Information Sources: 1 Adaptive Variable-Length Codes

The document discusses source coding algorithms like Huffman coding and Lempel-Ziv-Welch (LZW) coding that are used to compress messages. It explains how LZW coding works, including the encoding and decoding procedures. LZW is an adaptive compression algorithm that builds a dictionary of symbol sequences encountered in the message and assigns codes to the sequences. During encoding, it searches for the longest match from the dictionary and outputs the code. The dictionary is reconstructed during decoding using the codes from the encoded stream. The document also defines efficiency and redundancy measures for source codes.

Uploaded by

fsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views5 pages

Unit 2 - Part 7 Coding Information Sources: 1 Adaptive Variable-Length Codes

The document discusses source coding algorithms like Huffman coding and Lempel-Ziv-Welch (LZW) coding that are used to compress messages. It explains how LZW coding works, including the encoding and decoding procedures. LZW is an adaptive compression algorithm that builds a dictionary of symbol sequences encountered in the message and assigns codes to the sequences. During encoding, it searches for the longest match from the dictionary and outputs the code. The dictionary is reconstructed during decoding using the codes from the encoded stream. The document also defines efficiency and redundancy measures for source codes.

Uploaded by

fsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Department of Electrical Engineering

Indian Institute of Technology Jodhpur


EE 321: Contemporary Communication Systems
2016-17 Second Semester (January - May 2017)

Unit 2 - Part 7
Coding Information Sources
Arun Kumar Singh
18 January 2017

We discuss two source coding algorithms to compress messages (a message is a sequence of symbols).
The first, Huffman coding, is efficient when one knows the probabilities of the different symbols making
up a message, and each symbol is drawn independently from some known distribution. The second,
Lempel-Ziv-Welch (LZW), is an adaptive compression algorithm that does not assume any knowledge
of the symbol probabilities. Both Huffman codes and LZW are widely used in practice, and are a part
of many real-world standards such as GIF, JPEG, MPEG, MP3, and more.

1 Adaptive Variable-length Codes


To understand the source coding (compression) problem better, let us consider the problem of digital
representation of the text of a book written in, say, English. There are several possible approaches:
One approach is to analyze a few books and estimate the probabilities of different letters of the
alphabet. Then, treat each letter as a symbol and apply Huffman coding to compress the document
of interest. This approach is reasonable but ends up achieving relatively small gains compared to the
best one can do. One big reason is that the probability with which a letter appears in any text is not
always the same. For example, a priori, "x" is one of the least frequently appearing letters, appearing
only about 0.3% of the time in English text. But in the sentence "... nothing can be said to be certain,
except death and ta ", the next letter is almost certainly an "x". In this context, no other letter can
be more certain!
An approach that adapts to the material being compressed might avoid these shortcomings. One
approach to adaptive encoding is to use a two pass process: in the first pass, count how often each
symbol (or pairs of symbols, or triples - whatever level of grouping you have chosen) appears and use
those counts to develop a Huffman code customized to the contents of the file. Then, in the second
pass, encode the file using the customized Huffman code. This strategy is expensive but workable, yet
it falls short in several ways. Whatever size symbol grouping is chosen, it will not do an optimal job on
encoding recurring groups of some different size, either larger or smaller. And if the symbol probabilities
change dramatically at some point in the file, a one-size-fits-all Huffman code will not be optimal; in
this case one would want to change the encoding midstream.
A different approach to adaptation is taken by the popular Lempel-Ziv-Welch (LZW) algorithm.
This method was developed originally by Ziv and Lempel, and subsequently improved by Welch.

2 Lampel-Ziv-Welch (LZW) Codes


LZW code uses a dictionary that is indexed by codes used. The dictionary is assumed to be initial-
ized with 256 entries (indexed with ASCII codes 0 through 255) representing the ASCII table. The
compression algorithm assumes that the output is either a file or a communication channel. The input

1
being a file or buffer. Conversely, the decompression algorithm assumes that the input is a file or a
communication channel and the output is a file or a buffer.
As the message to be encoded is processed, the LZW algorithm builds a dictionary that maps
symbol sequences to/from an N -bit index. The dictionary has 2N entries and the transmitted code can
be used at the decoder as an index into the dictionary to retrieve the corresponding original symbol
sequence. The sequences stored in the dictionary can be arbitrarily long. The algorithm is designed so
that the dictionary can be reconstructed by the decoder based on information in the encoded stream -
the dictionary, while central to the encoding and decoding process, is never transmitted. This property
is crucial to the understanding of the LZW method.

2.1 Encoding Procedure


The encoder reads one character at a time. If the code is in the dictionary, then it adds the character
to the current work string, and waits for the next one. This occurs on the first character as well. If
the work string is not in the dictionary, (such as when the second character comes across), it adds the
work string to the dictionary and sends over the wire (or writes to a file) the code assigned to the work
string without the new character. It then sets the work string to the new character.

2.1.1 Algorithm
Step 1. Initialize dictionary to contain one entry for each byte. Initialize the encoded string with
the first byte of the input stream.

Step 2. Read the next byte from the input stream.

Step 3. If the byte is an EOF goto step 6.

Step 4. If concatenating the byte to the encoded string produces a string that is in the dictionary:

concatenate the the byte to the encoded string.


go to step 2.

Step 5. If concatenating the byte to the encoded string produces a string that is not in the
dictionary:

add the new sting to the dictionary.


write the code for the encoded string to the output stream.
set the encoded string equal to the new byte.
go to step 2.

Step 6. Write out code for encoded string and exit.

2.1.2 Pseudo-code
set w = NIL
loop
read a character k
if wk exists in the dictionary
w = wk
else
output the code for w
add wk to the dictionary
w=k
endloop

2
2.2 Decoding Procedure
LZW data is decoded pretty much the opposite of how its encoded. The dictionary is initialized so that
it contains an entry for each byte. Instead of maintaining an encoded string, the decoding algorithm
maintains the last code word and the first character in the string it encodes. New code words are read
from the input stream one at a time and string encoded by the new code is output.
During the encoding process, the code prior to the current code is written because concatenating
the first character of the current code with the string encoded by the prior code generated a code that
was not in the dictionary. When that happened the string formed by the concatenation was added to
the dictionary. The same string needs to be added to the dictionary when decoding.

2.2.1 Algorithm
Step 1. Initialize dictionary to contain one entry for each byte.

Step 2. Read the first code word from the input stream and write out the byte it encodes.

Step 3. Read the next code word from the input stream.

Step 4. If the code word is an EOF exit.

Step 5. Write out the string encoded by the code word.

Step 6. Concatenate the first character in the new code word to the string produced by the
previous code word and add the resulting string to the dictionary.

Step 7. Go to step 3.

2.2.2 Pseudo-code
read fixed length token k (code or char)
output k
w=k
loop
read a fixed length token k
entry = dictionary entry for k
output entry
add w+ first char of entry to the dictionary
w = entry
endloop

2.2.3 Exception to the Rules


When decoding certain input streams the decoder may see a code word that is one larger than anything
that it has in its dictionary. Whenever this exception occurs, concatenate the first character of the
string encoded by the previous code word to the end of the string encoded by the previous code word.
The resulting string is the value of the new code word. Write it to the output stream and add it to the
dictionary.

Some interesting observations about LZW compression:

1. The encoder algorithm is greedy - it is designed to find the longest possible match in the dictionary
before it makes a transmission.

2. The dictionary is filled with sequences actually found in the message stream. No encodings are
wasted on sequences not actually found in the file.

3
3. A common choice for the size of the dictionary is 4096 (N = 12). A larger table means the encoder
has a longer memory for sequences it has seen and increases the possibility of discovering repeated
sequences across longer spans of message. However, dedicating dictionary entries to remembering
sequences that will never be seen again decreases the efficiency of the encoding.

3 Efficiency and Redundancy of the Source Code


Source coding theorem is used to derive a measure of the efficiency of a source code:

H(S)
= ,
L
where H(S) denotes the entropy of the source and L denotes the average length of the codeword for
the given source code.
The redundancy of a source code is defined as:

= 1 .

Example 1:
Consider the string abcabcabcabcabcabc to demonstrate the encoding process.

Table 1: Encoder Dictionary

New Byte Encoded String New Code Code Output


a a None None
b b 256 (ab) a
c c 257 (bc) b
a a 258 (ca) c
b ab None None
c c 259 (abc) ab (256)
a ca None None
b b 260 (cab) ca (258)
c bc None None
a a 261 (bca) 257 (bc)
b ab None None
c abc None None
a a 262 (abca) 259 (abc)
b ab None None
c abc None None
a abca None None
b b 263 (abcab) 262 (abca)
c bc None None
None None None 257 (bc)

NOTE: The decoding process demonstrates the exception and how it is handled. (New encoded
string = old encoded string + the first character of the old encoded string = abc + a = abca)

4
Table 2: Decoder Dictionary

Input Code Encoded String Added Code String Output


a a None a
b b 256 (ab) b
c c 257 (bc) c
256 ab 258 (ca) ab
258 ca 259 (abc) ca
257 bc 260 (cab) bc
259 abc 261 (bca) abc
262 Not In Dictionary 262 (abca) abca
257 bc 263 (abcab) bc

You might also like