0% found this document useful (0 votes)
105 views

IRS Text Compression

There are two general approaches to text compression: statistical and dictionary based. Statistical methods rely on probability estimates of symbols appearing in text, with better estimates yielding better compression. Two statistical coding strategies are Huffman coding, which assigns fixed-length bit encodings to symbols based on probability, and arithmetic coding, which computes codes incrementally for each symbol. Dictionary methods substitute symbol sequences with pointers to previous occurrences in a dictionary of frequently occurring phrases. Well-known dictionary methods include the Ziv-Lempel family, which can reduce English texts to under four bits per character.

Uploaded by

Pravin Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

IRS Text Compression

There are two general approaches to text compression: statistical and dictionary based. Statistical methods rely on probability estimates of symbols appearing in text, with better estimates yielding better compression. Two statistical coding strategies are Huffman coding, which assigns fixed-length bit encodings to symbols based on probability, and arithmetic coding, which computes codes incrementally for each symbol. Dictionary methods substitute symbol sequences with pointers to previous occurrences in a dictionary of frequently occurring phrases. Well-known dictionary methods include the Ziv-Lempel family, which can reduce English texts to under four bits per character.

Uploaded by

Pravin Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 1

Text compression

There are two general approaches to text compression: statistical and dictionary based. Statistical methods
rely on generating good probability estimates (of appearance in the text) for each symbol. The more
accurate the estimates are, the better the compression obtained. A symbol here is usually a character, a
text word, or a fixed number of characters. The set of all possible symbols in the text is called the alphabet.
The task of estimating the probability on each next symbol is called modeling. A model is essentially a
collection of probability distributions, one for each context in which a symbol can be coded.

Statistical
There are two well-known statistical coding -strategies: Huffman coding and arithmetic coding.

Huffman coding
The idea of Huffman coding is to assign a fixed-length bit encoding to each different symbol of the text.
Compression is achieved by assigning a smaller number of bits to symbols with higher probabilities of
appearance.

Arithmetic coding
Arithmetic coding computes the code incrementally, one symbol at a time, as opposed to the Huffman
coding scheme in which each different symbol is pre-encoded using a fixed-length number of bits. The
incremental nature does not allow decoding a string which starts in the middle of a compressed file. To
decode a symbol in the middle of a file compressed with arithmetic coding, it is necessary to decode the
whole text from the very beginning until the desired word is reached. This characteristic makes arithmetic
coding inadequate for use in an IR environment.

Dictionary
Dictionary methods substitute a sequence of symbols by a pointer to a previous occurrence of that sequence.
The pointer representations are references to entries in a dictionary composed of a list of symbols (often
called phrases)that are expected to occur frequently. The most well known dictionary methods are
represented by a family of methods, known as the Ziv-Lempel family.

Ziv-Lempel
Ziv-Lempel methods are able to reduce English texts to fewer than four bits per character.

You might also like