0% found this document useful (0 votes)
122 views9 pages

Lossless Data Compression Algorithm Abraham Lempel Jacob Ziv Terry Welch LZ78

Lempel-Ziv-Welch (LZW) is a lossless data compression algorithm that builds a dictionary of strings as it encodes the input data. It replaces strings in the input with codes for strings in the dictionary, allowing longer and longer strings to be encoded with each code. The decompressor builds the same dictionary from the codes to reconstruct the original input data. The algorithm achieves compression by coding repeated strings more concisely than if they were encoded character by character.

Uploaded by

rlnandha_2006
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views9 pages

Lossless Data Compression Algorithm Abraham Lempel Jacob Ziv Terry Welch LZ78

Lempel-Ziv-Welch (LZW) is a lossless data compression algorithm that builds a dictionary of strings as it encodes the input data. It replaces strings in the input with codes for strings in the dictionary, allowing longer and longer strings to be encoded with each code. The decompressor builds the same dictionary from the codes to reconstruct the original input data. The algorithm achieves compression by coding repeated strings more concisely than if they were encoded character by character.

Uploaded by

rlnandha_2006
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Lempel-Ziv-Welch

Lempel-Ziv-Welch (LZW) is a universal lossless data compression algorithm created by


Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an
improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978.
The algorithm is designed to be fast to implement but is not usually optimal because it
performs only limited analysis of the dataThe algorithm in a nutshell

The algorithm takes a string as input. It processes the string using a Dictionary.
The string is encoded (and the dictionary grows) as the string is being processed. Initially, the
dictionary contains all the possible characters (the alphabet) with their corresponding encoding.
The algorithm takes the longest word w (from the dictionary) that can replace the next characters
in the string. It encodes this part of the string with the encoding of w. Now it takes the character
c that followed w in the string, and adds wc to the dictionary. This repeats until the string is
consumed. The idea is that, as the string is being processed, we populate the dictionary with
longer strings, allowing encoding of bigger chunks of the string at each replacement.

Alternative description of the algorithm

The compressor algorithm builds a string translation table from the text being compressed. The
string translation table maps fixed-length codes (usually 12-bit) to strings. The string table is
initialized with all single-character strings (256 entries in the case of 8-bit characters). As the
compressor character-serially examines the text, it stores every unique two-character string into
the table as a code/character concatenation, with the code mapping to the corresponding first
character. As each two-character string is stored, the first character is sent to the output.
Whenever a previously-encountered string is read from the input, the longest such previously-
encountered string is determined, and then the code for this string concatenated with the
extension character (the next character in the input) is stored in the table. The code for this
longest previously-encountered string is output and the extension character is used as the
beginning of the next word.

The decompressor algorithm only requires the compressed text as an input, since it can build an
identical string table from the compressed text as it is recreating the original text. However, an
abnormal case shows up whenever the sequence character/string/character/string/character
(with the same character for each character and string for each string) is encountered in the input
and character/string is already stored in the string table. When the decompressor reads the code
for character/string/character in the input, it cannot resolve it because it has not yet stored this
code in its table. This special case can be dealt with because the decompressor knows that the
extension character is the previously-encountered character.[1]

Algorithm

Compressor algorithm:

w = NIL;
add all possible charcodes to the dictionary
for (every character c in the uncompressed data) do
if ((w + c) exists in the dictionary) then
w = w + c;
else
add the dictionary code for w to output;
add (w + c) to the dictionary;
w = c;
endif
done
add the dictionary code for w to output;
display output;
Decompressor algorithm:

add all possible charcodes to the dictionary


read a char k;
output k;
w = k;
while (read a char k) do
if (index k exists in dictionary) then
entry = dictionary entry for k;
else if (k == currSizeDict)
entry = w + w[0];
else
signal invalid code;
endif
output entry;
add w+entry[0] to the dictionary;
w = entry;
done

This example shows the LZW algorithm in action, showing the status of the output and the
dictionary at every stage, both in encoding and decoding the message. In order to keep things
clear, let us assume that we're dealing with a simple alphabet - capital letters only, and no
punctuation or spaces. This example has been constructed to give reasonable compression on a
very short message; when used on real data, repetition is generally less pronounced, and so the
initial parts of a message will see little compression. As the message grows, however, the
compression ratio tends asymptotically to the maximum.[2] A message to be sent might then look
like the following:

TOBEORNOTTOBEORTOBEORNOT#
The # is a marker used to show that the end of the message has been reached. Clearly, then, we
have 27 symbols in our alphabet (the 26 capital letters A through Z, plus the # character). A
computer will render these as strings of bits; 5-bit strings are needed to give sufficient
combinations to encompass the entire dictionary. As the dictionary grows, the strings will need
to grow in length to accommodate the additional entries. A 5-bit string gives 2 5 = 32 possible
combinations of bits, and so when the 33rd dictionary word is created, the algorithm will have to
start using 6-bit strings (for all strings, including those which were previously represented by
only five bits). Note that since the all-zero string 00000 is used, and is labeled "0", the 33rd
dictionary entry will be labeled 32. The initial dictionary, then, will consist of the following:

# = 00000
A = 00001
B = 00010
C = 00011
.
.
.
Z = 11010

Encoding

If we weren't using LZW, and just sent the message as it stands (25 symbols at 5 bits each), it
would require 125 bits. We will be able to compare this figure to the LZW output later. We are
now in a position to apply LZW to the message.
Symbol: Bit Code: New Dictionary Entry:
(= output)

T 20 = 10100
O 15 = 01111 28: TO <--- Don't forget, we originally had 27 symbols, so the next one is
28th.
B 2 = 00010 29: OB
E 5 = 00101 30: BE
O 15 = 01111 31: EO <--- start using 6-bit strings
R 18 = 010010 32: OR
N 14 = 001110 33: RN
O 15 = 001111 34: NO
T 20 = 010100 35: OT
TO 28 = 011100 36: TT
BE 30 = 011110 37: TOB
OR 32 = 100000 38: BEO
TOB 37 = 100101 39: ORT
EO 31 = 011111 40: TOBE
RN 33 = 100001 41: EOR
OT 35 = 100011 42: RNO
# 0 = 000000 43: OT#
This is somewhat clearer:

Current Next Output Value Extended


Sequence Char (# of bits) Dictionary
NULL T
T O 20 = 5 bits 27: TO <-- This IS the 28th entry, but the initial entries are numbered
0-26 so this is #27.
O B 15 = 5 bits 28: OB
B E 2 = 5 bits 29: BE
E O 5 = 5 bits 30: EO
O R 15 = 5 bits 31: OR
R N 18 = 6 bits 32: RN <-- Starting at R, 6 bits are used {floor(lg2(init_dict_size +
num_chars_output)) + 1}
N O 14 = 6 bits 33: NO i.e. O: floor(lg2(27 + 4)) + 1 = 5 bits -> 01111
O T 15 = 6 bits 34: OT R: floor(lg2(27 + 5)) + 1 = 6 bits -> 010010
T T 20 = 6 bits 35: TT
TO B 27 = 6 bits 36: TOB
BE O 29 = 6 bits 37: BEO
OR T 31 = 6 bits 38: ORT
TOB E 36 = 6 bits 39: TOBE
EO R 30 = 6 bits 40: EOR
RN O 32 = 6 bits 41: RNO
OT # 34 = 6 bits 42: OT#
# 0 = 6 bits

Total Length = 5*5 + 12*6 = 97 bits.

In using LZW we have made a saving of 28 bits out of 125 -- we have reduced the message by
almost 22%. If the message were longer, then the dictionary words would begin to represent
longer and longer sections of text, allowing repeated words to be sent very compactly.
Decoding

Imagine now that we have received the message produced above, and wish to decode it. We need
to know in advance the initial dictionary used, but we can reconstruct the additional entries as we
go, since they are always simply concatenations of previous entries.

Bits: Output: New Entry:


Full: Partial:

10100 = 20 T 28: T?
01111 = 15 O 28: TO 29: O?
00010 = 2 B 29: OB 30: B?
00101 = 5 E 30: BE 31: E?
01111 = 15 O 31: EO 32: O? <--- start using 6-bit strings
010010 = 18 R 32: OR 33: R?
001110 = 14 N 33: RN 34: N?
001111 = 15 O 34: NO 35: O?
010100 = 20 T 35: OT 36: T?
011100 = 28 TO 36: TT 37: TO? <- for 36, only add 1st element
011110 = 30 BE 37: TOB 38: BE? of next dictionary word
100000 = 32 OR 38: BEO 39: OR?
100101 = 37 TOB 39: ORT 40: TOB?
011111 = 31 EO 40: TOBE 41: EO?
100001 = 33 RN 41: EOR 42: RN?
100011 = 35 OT 42: RNO 43: OT?
000000 = 0 #

The only slight complication comes if the newly-created dictionary word is sent immediately. In
the decoding example above, when the decoder receives the first symbol, T, it knows that
symbol 28 begins with a T, but what does it end with? The problem is illustrated below. We are
decoding part of a message that reads ABABA:
Bits: Output: New Entry:
Full: Partial:

.
.
.
011101 = 29 AB 46: (word) 47: AB?
101111 = 47 AB? <--- what do we do here?

At first glance, this may appear to be asking the impossible of the decoder. We know ahead of
time that entry 47 should be ABA, but how can the decoder work this out? The critical step is to
note that 47 is built out of 29 plus whatever comes next. 47, therefore, ends with "whatever
comes next". But, since it was sent immediately, it must also start with "whatever comes next",
and so must end with the same symbol it starts with, namely A. This trick allows the decoder to
see that 47 must be ABA.

More generally the situation occurs whenever the encoder encounters the input of the form
cScSc, where c is a single character, S is a string and cS is already in the dictionary. The encoder
outputs the symbol for cS putting new symbol for cSc in the dictionary. Next it sees the cSc in
the input and sends the new symbol it just inserted into the dictionary. By the reasoning
presented in the above example this is the only case where the newly-created symbol is sent
immediately.

You might also like