Implementation of Lempel-Ziv algorithm for lossless compression using VHDL 275
Implementation of Lempel-Ziv algorithm for
lossless compression using VHDL
Prof. Minaldevi K. Tank
HOD – Digital Electronics, Babasaheb Gawde Institute of Technology, Mumbai, India.
1. Introduction 2.1 Lossless compression
In computer science and information theory, data compres- Lossless compression algorithms usually exploit statisti- sion or source coding is the process of encoding informa- cal redundancy in such a way as to represent the sender’s tion using fewer bits than an unencoded representation data more concisely without error. Lossless compression would use, through use of specific encoding schemes. As is possible because most real-world data has statistical with any communication, compressed data communica- redundancy. tion only works when both the sender and receiver of the information understand the encoding scheme. For example, 2.2 Lossy compression this text makes sense only if the receiver understands that it It is also known as perceptual coding, it is possible if some is intended to be interpreted as characters representing the loss of fidelity is acceptable. Generally, a lossy data com- English language. Similarly, compressed data can only be pression will be guided by research on how people perceive understood if the decoding method is known by the receiv- the data in question. For example, the human eye is more er. Compression is useful because it helps reduce the con- sensitive to subtle variations in luminance than it is to varia- sumption of expensive resources, such as hard disk space or tions in color. JPEG image compression works in part by transmission bandwidth. On the downside, compressed data “rounding off” some of this less-important information. must be decompressed to be used, and this extra process- Lossy data compression provides a way to obtain the best ing may be detrimental to some applications. For instance, fidelity for a given amount of compression. In some cases, a compression scheme for video may require expensive transparent (unnoticeable) compression is desired; in other hardware for the video to be decompressed fast enough to cases, fidelity is sacrificed to reduce the amount of data as be viewed as its being decompressed (the option of decom- much as possible. pressing the video in full before watching it may be incon- Lossless compression schemes are reversible so that the venient, and requires storage space for the decompressed original data can be reconstructed, while lossy schemes ac- video). The design of data compression schemes therefore cept some loss of data in order to achieve higher compres- involves trade-offs among various factors, including the de- sion. However, lossless data compression algorithms will gree of compression, the amount of distortion introduced (if always fail to compress some files; indeed, any compres- using a lossy compression scheme), and the computational sion algorithm will necessarily fail to compress any data resources required to compress and uncompress the data. containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usu- 2. What Is compression? ally result in an expansion, as will attempts to compress all Data compression enables devices to transmit or store but the most trivially encrypted data An example of loss- the same amount of data in fewer bits. The Compression less and lossy compression is the string: 25.888888888. is briefly classified in two types lossless and lossy This string can be compressed as: 25.[9]8. Interpreted as, compression “twenty five point 9 eights”, the original string is perfectly
S.J. Pise (ed.), ThinkQuest 2010, DOI 10.1007/978-81-8489-989-4_51,
recreated, just written in a smaller form. In a lossy system, 5. Lempel-Ziv-Welch
using 26 instead, the exact original data is lost, at the benefit LZW (Lempel-Ziv-Welch) is the one that is most com- of a smaller file size. monly used in practice. The algorithm is used to encode byte streams (i.e., each message is a byte). The algorithm 3. The Lempel-Ziv Algorithm maintains a dictionary of strings (sequences of bytes). The The Lempel-Ziv (LZ) compression methods are among dictionary is initialized with one entry for each of the 256 the most popular algorithms for lossless storage.The possible byte values—these are strings of length one. As the Lempel-Ziv algorithms compress by building a diction- algorithm progresses it will add new strings to the diction- ary of previously seen strings. Unlike PPM which uses ary such that each string is only added if a prefix one byte the dictionary to predict the probability of each character, shorter is already in the dictionary. For example, John is and codes each character separately based on the context, only added if Joh had previously appeared in the message the Lempel-Ziv algorithms code groups of characters of sequence. Each entry of the dictionary is given an index, varying lengths. The original algorithms also did not use where these indices are typically given out incrementally probabilities strings were either in the dictionary or not starting at 256. and all strings in the dictionary were give equal probabi- lity. Some of the newer variants, such as gzip, do take 6. LZ78 encoding and decoding. some advantage of probabilities. At the highest level The basic idea is to parse the input sequence into non-over- the algorithms can be described as follows. Given a posi- lapping blocks of different lengths while constructing a tion in a file, look through the preceeding part of the file dictionary of blocks seen thus far. to find the longest match to the string starting at the current position, and output some code that refers to that 6.1 Encoding match. Now move the finger past the match. The two main variants of the algorithm were described by Ziv and A dictionary is initialized to contain the single-character Lempel in two separate papers in 1977 and 1978, and strings corresponding to all the possible input characters are often refered to as LZ77 and LZ78. The algorithms (and nothing else except the clear and stop codes if they’re differ in how far back they search and how they find being used). The algorithm works by scanning through matches. The LZ77 algorithm is based on the idea of a the input string for successively longer substrings until it sliding window. The algorithm only looks for matches finds one that is not in the dictionary. When such a string in a window a fixed distance back from the current posi- is found, the index for the string less the last character (i.e., tion. Gzip, ZIP, and V.42b is (a standard modem protocal) the longest substring that is in the dictionary) is retrieved are all based on LZ77. TheLZ78 algorithm is based on a from the dictionary and sent to output, and the new string more conservative approach to adding strings to the dict- (including the last character) is added to the dictionary with ionary. Unixcompress, and the Gif format are both based the next available code. The last input character is then used on LZ78. as the next starting point to scan for substrings. In this way, successively longer strings are registered in the dictionary 4. Why VHDL? and made available for subsequent encoding as single out- put values. The algorithm works best on data with repeated 4.1 Using the same language it is possible to simulate as patterns, so the initial parts of a message will see little com- well as design a complex logic. pression. As the message grows, however, the compression 4.2 Design reuse is possible ratio tends asymptotically to the maximum. 4.3 Design can be described at various levels of abstrac- tions. 6.2 Decoding 4.4 It provides for modular design and testing. 4.5 The use of VHDL has tremendously reduced the The decoding algorithm works by reading a value from “Time to Market “for large and small design. the encoded input and outputting the corresponding string 4.6 VHDL designs are portable across synthesis across from the initialized dictionary. At the same time it obtains synthesis and simulation tools, which adhere to the the next value from the input, and adds to the dictionary the IEEE 1076 standard. concatenation of the string just output and the first character 4.7 Using VHDL makes the design device independent. of the string obtained by decoding the next input value. The 4.8 The design description can be targeted to PLD, ASIC, decoder then proceeds to the next input value (which was FPGA very easily. already read in as the “next value” in the previous pass) and 4.9 Designer has very little control at gate level. repeats the process until there is no more input, at which 4.10 The logic generated for the same description may vary point the final input value is decoded without any more ad- from tool to tool. This may be due to algorithm used ditions to the dictionary. In this way the decoder builds up by the tools, which might be proprietary. a dictionary which is identical to that used by the encoder,