Text Data Compression
Text Data Compression
Text Data Compression
Compression Algorithms
A comparison of two reversible transforms
James R. Achuff
Penn State Great Valley
School of Graduate Professional Studies
30 East Swedesford Road, Malvern, PA 19355, USA
Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is
estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless
compression standard for text has been proposed.
A number of lossless text compression algorithms exist, however, none of these methods is able to
consistently reach its theoretical best-case compression ratio.
This paper evaluates the performance characteristics of several popular compression algorithms and
explores two strategies for improving ratios without significantly impacting computation time.
1. INTRODUCTION
Compression means making things smaller by applying pressure. Data compression means
reducing the amount of bits needed to represent a particular piece of data. Text compression
means reducing the amount of bits or bytes needed to store textual information. It is necessary
that the compressed form can be decompressed to reconstitute the original text, and it is usually
important that the original is recreated exactly, not approximately. This differentiates text
compression from many other kinds of data reduction, such as voice or picture coding, where
some degradation of the signal may be tolerable if the compression achieved is worth the
reduction in quality. [Bell, Cleary & Witten, 1990]
The immutable yardstick by which data compression is measured is the “compression ratio”,
or ratio of the size of a compressed file to the original uncompressed file. For example, suppose a
data file takes up 100 kilobytes (KB). Using data compression software, that file could be reduced
in size to, say, 50 KB, making it easier to store on disk and faster to transmit over a network
connection. In this specific case, the data compression software reduces the size of the data file
by a factor of two, or results in a “compression ratio” of 2:1.
1
Improving the Efficiency of Lossless Text Data Compression Algorithms 2
There are “lossless” and “lossy” forms of data compression. Lossless data compression is
used when the data has to be uncompressed exactly as it was before compression. Text files are
stored using lossless techniques, since losing a single character can in the worst case make the
text dangerously misleading. Lossless compression ratios are generally in the range of 2:1 to 8:1.
Compression algorithms reduce the redundancy in data to decrease the storage requirements
for that data. Data compression offers an attractive approach to reducing communications and
storage costs by using available bandwidth effectively. With the trend of increasing amounts of
digital data being transmitted over public and private networks expected to continue, it makes
sense to pursue research on developing algorithms that can most effectively use available network
bandwidth by maximally compressing data. This paper is focused on addressing this problem for
lossless compression of text files. It is well known that there are theoretical predictions on how
far a source file can be losslessly compressed [Shannon, 1951], but no existing compression
approaches consistently attain these bounds over wide classes of text files.
One approach to tackling the problem of developing methods to improve compression is to
develop better compression algorithms. However, given the sophistication of existing algorithms
such as arithmetic coding, Lempel-Ziv algorithms, Dynamic Markov Coding, Prediction by
Partial Match and their variants, it seems unlikely that major new progress will be made in this
area.
An alternate approach, which is taken in this paper, is to perform a lossless, reversible
transformation to a source file prior to applying an existing compression algorithm. This
transformation is designed to make it easier to compress the source file. Figure 1 illustrates this
strategy. The original text file is provided as input to the transformation, which outputs the
transformed text. This output is provided to an existing, unmodified data compression algorithm,
which compresses the transformed text. To decompress, on simply reverses the process by first
invoking the appropriate decompression algorithm and then providing the resulting text to the
inverse transform.
There are several important observations about this strategy. The transformation must be
exactly reversible, so that the overall lossless text compression requirement is not compromised.
The data compression and decompression algorithms are unmodified, so they do not exploit
information about the transformation while compressing. The intent is to use the strategy to
Improving the Efficiency of Lossless Text Data Compression Algorithms 3
improve the overall compression ratio of the text in comparison with that achieved by the
compression algorithm alone. A similar strategy has been employed in the compression of
images and video transmissions using the Fourier transform, Discrete Cosine Transform or
wavelet transforms. In these cases, however, the transforms are usually lossy, meaning that some
data can be lost without compromising the interpretation of the image by a human.
One well-known example of the text compression strategy outlined in Figure 1 is the Burrows
Wheeler Transform (BWT). BWT combines ad-hoc compression techniques (Run Length
Encoding, Move to Front) and Huffman coding to provide one of the best compression ratios
available on a wide range of data.
As stated above, text compression ought to be exact – the reconstructed message should be
identical to the original. Exact compression is also called noiseless (because it does not introduce
any noise into the signal), lossless (since no information is lost), or reversible (because
compression can be reversed to recover the original input exactly).
The task of finding a suitable model for text is an extremely important problem in
compression. Data compression is inextricably bound up with prediction. In the extreme case, if
one can predict infallibly what is going to come next, one can achieve perfect compression by
dispensing with transmission altogether. Even if one can only predict approximately what is
coming next, one can get by with transmitting just enough information to disambiguate the
prediction. Once predictions are available, the are processed by an encoder that turns them into
binary digits to be transmitted.
There are three ways that the encoder and decoder can maintain the same model: static,
semiadaptive, and adaptive modelling. In static modelling the encoder and decoder agree on a
fixed model, regardless of the text to be encoded. This is the method employed when sending a
message via Morse Code. In semiadaptive modelling, a “codebook” of the most frequently used
words or phrases is transmitted first and then used to encode and decode the message. Adaptive
modelling builds it’s “codebook” as it progresses according to a predefined method. In this way,
both the encoder and decoder use the same codebook without ever having to transmit the codes
with the data.
In 1952, D. A. Huffman introduced his method for the construction of minimum redundancy
codes – now more commonly known as “Huffman Coding”. In Huffman Coding, the characters
in a data file are converted to a binary code, where the most common characters in the file have
the shortest binary codes, and the least common have the longest. This is accomplished by
building a binary tree based upon the frequency with which characters occur in a file.
1.1.3 LZ Coding
In 1977, Jacob Ziv and Abraham Lempel described an adaptive dictionary encoder in which
they “employ the concept of encoding future segments of the [input] via maximum-length
copying from a buffer containing the recent past output.” The essence being that phrases are
replaced with a pointer to where they have occurred earlier in the text.
Figure 2 illustrates how well this approach works for a variety of texts by indicating some of
many instances where phrases could be replaced in this manner. A phrase might be a word, part
of a word, or several words. It can be replaced with a pointer as long as it has occurred once
before in the text, so coding adapts quickly to a new topic.
Figure 2. The principle of Ziv-Lempel coding – phrases are coded as pointers to earlier occurrences
Decoding a text that has been compressed in this manner is straightforward; the decoder
simply replaces a pointer by the already decoded text to which it points. In practice LZ coding
achieves good compression, and an important feature is that decoding can be very fast.
1.1.3.1 LZ77
LZ77 was the first form of LZ coding to be published. In this scheme pointers denote phrases
in a fixed-size window that precedes the coding position. There is a maximum length for
substrings that may be replaced by a pointer, usually 10 to 20. These restrictions allow LZ77 to
be implemented using a “sliding windows” of N characters.
Ziv and Lempel showed that LZ77 could give at least as good compression as any
semiadaptive dictionary designed specifically for the string being encoded, if N is sufficiently
large. The main disadvantage of LZ77 is that although each encoding step requires a constant
amount of time, that constant can be large, and a straightforward implementation can require a
vast number of character comparisons per character coded. This property of slow encoding and
fast decoding is common to many LZ schemes.
Improving the Efficiency of Lossless Text Data Compression Algorithms 5
Finite-state probabilistic models are based on finite-state machines. They have a set of states
and transition probabilities that signify the likelihood of the model to transition from one state to
another. Also, each state is labelled uniquely. Figure 3 shows a simple model with two states, 0
and 1.
Finite state-based modelling is typically too slow and too computationally cumbersome to
support practical text compression. Dynamic Markov Coding (DMC) however, provides an
efficient way of building complex state models that fit a particular sequence and is generally
regarded as the only state-based technique that can be applied to text compression. [Bell, Witten
& Cleary, 1989]
1
3
p(0)=0.5 p(1)=0.5
p(1)=0.5
2
0 1
p(0)=0.5 4
The basic idea of DMC is to maintain frequency counts for each transition in the current
finite-state model, and to “clone” a state when a related transition becomes sufficiently popular.
Cloning consumes resources by creating an extra state, and should not be performed unless it is
likely to be productive. High-frequency transitions have, by definition, been traversed often in
the past and are therefore likely to be traversed often in the future. Consequently, they are likely
candidates for cloning, since any correlations discovered will be utilised frequently.
been seen after its context. In order to keep track of the number of times that a certain character
followed a given context, the number of its occurrences is noted along each edge. Based on this
information PPM can assign probabilities to potentially subsequent characters. [Cleary and
Witten, 1984]
The length of contexts is also called their order. Note that contexts of different order might
yield different counts leading to varying predictions.
Burrows and Wheeler released a research report in 1994 entitled “A Block Sorting Lossless
Data Compression Algorithm” which presented a data compression algorithm based on Wheeler’s
earlier work.
The BWT is an algorithm that takes a block of data and rearranges it using a sorting scheme.
The resulting output block contains exactly the same data elements that it started with differing
only in their ordering. The transformation is reversible and lossless, meaning that the original
ordering of the data elements can be restored with no loss of fidelity.
The BWT is performed on an entire block of data at once, preferably the largest amount
possible. Since the BWT operates on data in memory, it must often break files up into smaller
pieces and process one piece at a time.
Work done by Awan and Mukherjee and Franceschini et al. details several lossless, reversible
transforms that can be applied to text files in order to improve their compressibility by established
algorithms. Two have been selected for study in this paper: star encoding (or *-encoding) and
length index preserving transform (LIPT).
The first transform proposed is an algorithm developed by Franceschini et al. Star encoding
(or *-encoding) is designed to exploit the natural redundancy of the language. It is possible to
replace certain characters in a word by a special placeholder character and retain a few key
characters so that the word is still retrievable.
For example, if given a set of six letter words: {school, simple, strong, sturdy, supple}, and
replacing “unnecessary” characters with a chosen symbol ‘*’, the set can now be represented
unambiguously as {**h***, **m***, **r***, **u***, **p***}. In *-encoding, an unambiguous
representation of a word by a partial sequence of letters from the original sequence of letters in
the word interposed by special characters ‘*’ as placeholders will be called a signature of the
word.
*-encoding utilises an indexed and sorted dictionary containing the natural form and the
signature of each word. No word in a 60,000 word English dictionary required the use of more
than two unencoded characters in its signature using Franceschini’s scheme. The predominant
character in *-encoded text is ‘*’ which occupies more than fifty percent of the space. If a word
is not in the dictionary, it is passed to the transformed text unaltered.
The main drawback of *-encoding is that the compressor and decompressor need to share a
dictionary. The aforementioned 60,000 word English dictionary requires about one megabyte of
storage overhead that must be shared by all users of this transform. Also, special provisions
Improving the Efficiency of Lossless Text Data Compression Algorithms 7
made to handle capitalisation, punctuation marks and special characters will most likely
contribute to a slight increase of the size of the input text in its transformed form.
2.1.2 LIPT
Another method investigated here is the Length Index Preserving Transform or LIPT. Fawzia
S. Awan and Amar Mukherjee developed LIPT as part of their project work at the University of
Central Florida. LIPT is a dictionary method that replaces words in a text file with a marker
character, a dictionary index and a word index.
LIPT is defined as follows: words of length more than four are encoded starting with ‘*’, this
allows predictive compression algorithms to strongly predict the space character preceding a ‘*’
character. The last three characters form an encoding of the dictionary offset of the corresponding
word. For words of more than four characters, the characters between the initial ‘*’ and the final
three-character-sequence in the word encoding are constructed using a suffix of the string ‘…
nopqrstuvw’. For instance, the first word of length 10 would be encoded as ‘*rstuvwxyzaA’. This
method provides a strong local context within each word encoding and its delimiters.
3. PROCESS
To evaluate these methods, they were applied to the Calgary Corpus, a collection of text files
that was originally used by Bell, Witten and Cleary in 1989 to evaluate the practical performance
of various text compression schemes. The methods were also applied to three html files in order
to supply a more “modern” facet to the test corpus.
In the Calgary Corpus, nine different types of text are represented, and to confirm that the
performance of schemes is consistent for any given type, many of the types have more than one
representative. Normal English, both fiction and non-fiction, is represented by two books and six
papers (labelled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual
styles of English writing are found in a bibliography (bib) and a batch of unedited news articles
(news). Three computer programs represent artificial languages (progc, progl, progp), and a
transcript of a terminal session (trans) is included to indicate the increase in speed that could be
achieved by applying compression to a slow line to a terminal. All of the above files use ASCII
encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2),
some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is
particularly difficult to compress because it contains a wide range of data values, while the file
pic is highly compressible because of large amounts of white space in the picture, represented by
long runs of zeros. [Witten and Bell, 1990]
The additional html files were chosen to representative of “average” web traffic. One is the
front page of an American university (http://www.psu.edu), another is the front page of a popular
Internet auction site (http://www.ebay.com) and the third is the main page of a popular
multimedia web content company (http://www.real.com). Each contained different types of web
content and page structures.
Improving the Efficiency of Lossless Text Data Compression Algorithms 8
4. RESULTS
Table 1 shows the file names, their original sizes and their sizes after being processed by our
transforms and by compression algorithms.
Table 2 shows the file sizes after the application of the star encoding transform in conjunction
with the compression algorithms.
Table 3 shows the file sizes after the application of the LIPT transform in conjunction with the
compression algorithms.
The following charts display the compression ratios for each file, grouped roughly by content
type. It is interesting to note that the transforms generally do, but not always provide better
compression.
4.5
0
bib bib bib book1 book1 book1 book2 book2 book2 new s new s new s
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT
3.5
3 PK-ZIP 2.50
0
geo geo geo obj1 obj1 obj1 obj2 obj2 obj2
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT
16.00
14.00
12.00
10.00
pic none
8.00 pic *-encoding
pic LIP T
6.00
4.00
2.00
-
P K-ZIP bzip Gzip Arithmetic DM C Huffm an P PM (No
2.50 (B W T) (LZ77) Coding Coding Training)
4.5
4
PK-ZIP 2.50
3.5
bzip (BWT)
3
Gzip (LZ77)
2.5
Arithmetic Coding
2 DMC
1.5 Huffman Coding
0.5
0
bib bib bib book1 book1 book1 book2 book2 book2 new s new s new s
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT
Figure 7. Compression Ratios for paper1, paper2, paper3, paper4, paper5, paper6
Improving the Efficiency of Lossless Text Data Compression Algorithms 11
2.5 DMC
2 Huffman Coding
1.5
PPM (No Training)
1
0.5
0
progc progc progc progl progl progl progp progp progp trans trans trans
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT
It is interesting to note that the transforms typically do not result in increased performance for
arithmetic or Huffman coding. In fact, LIPT actually decreases the compression ratio for
arithmetic coding by almost a third for the English language text files (bib, book1, book2, news,
paper1, paper2, paper3, paper4, paper5, paper6, progc, progl, progp, trans).
*-encoding caused a decrease in compression for bib, book2, news, progl, progp and trans
with PPM encoding, and for book1, book2 and progp with BWT encoding. Other than those, the
transforms typically offer some increase. *-encoding offered improvements of 11% to nearly
15% of original file size for the books and papers when coupled with Huffman coding and LIPT
offered improvements of up to 8% in combination with Huffman coding.
Overall, PPM with LIPT produced the best compression ratios of the English language text
files and was nearly as good as any other method on the other files.
Improving the Efficiency of Lossless Text Data Compression Algorithms 12
5. CONCLUSION
This paper has shown that it is possible to make textual data more compressible, even if only
to a small degree, by applying an intermediate reversible transform to the data prior to
compression. Although not specifically measured for this paper, the time impact of applying
these transforms to the data was not observed to be significant.
Transform encoding offered improvements of up to 15% for some standard compression
methods and - depending on the methods used and the type of text contained in the input file - can
offer compression ratios over 13 and can generally have a beneficial effect on the compressibility
of data over standard compression algorithms.
It is recommended that further investigation be made into the applicability of this process to
html files in an effort to decrease download times for web information and to conserve Internet
bandwidth.
6. REFERENCES
Akman, K. Ibrahim. “A New Text Compression Technique Based on Language Structure.” Journal of Information
Science. 21, no. 2 (February 1995): 87-95.
Awan, F. S. and A. Mukherjee, LIPT: A Lossless Text Transform to Improve Compression. [paper on-line] School of
Electrical Engineering and Computer Science, University of Central Florida, available from
http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.
Bell, T. C., J. G. Cleary and I. H. Witten. Text Compression. Englewood Cliffs: Prentice-Hall, 1990.
Bell, Timothy, Ian H. Witten and John G. Cleary. “Modelling for Text Compression.” ACM Computing Surveys. 21,
no. 4 (December 1989): 557-591.
Burrows, M. and D. J. Wheeler. “A Block-sorting Lossless Data Compression Algorithm.” SRC Research Report 124,
Digital Systems Research Center, Palo Alto, (May 1994) available from http://citeseer.nj.nec.com/76182.html;
Internet; accessed 15 July 2001.
Cleary, J. G. and I.H. Witten. “Data Compression Using Adaptive Coding and Partial String Matching.” IEEE
Transactions on Communications. 32, no 4 (April 1984): 396-402.
Crochemore, Maxime and Thierry Lecroq. “Pattern-Matching and Text Compression Algorithms.” ACM Computing
Surveys. 28, no. 1 (March 1996): 39-41.
Fenwick, P. Symbol Ranking Text Compression with Shannon Recodings. [paper on-line] Department of Computer
Science, The University of Auckland, 6 June 1996 available from ftp://ftp.cs.auckland.ac.nz/out/peter-
f/TechRep132; Internet; accessed 6 June 2001.
Franceschini, R., H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee. Lossless, Reversible Transformation that Improve
Text Compression Ratios. [paper on-line] School of Electrical Engineering and Computer Science, University of
Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.
Goebel, G.V., Data Compression. available from http://vectorsite.tripod.com/ttdcmp0.html; Internet; accessed 14 May,
2001.
Huffman, D. A. “A Method for the Construction of Minimum-Redundancy Codes.” Proceedings of the Institute of
Electrical and Radio Engineers. 40, no 9 (September 1952): 1098-1101.
Moffat, Alistair, Radford M. Neal and Ian H. Witten. “Arithmetic Coding Revisited.” ACM Transactions on
Information Systems. 16, no. 3 (July 1998): 256-294.
Improving the Efficiency of Lossless Text Data Compression Algorithms 13
Motgi, N. and A. Mukherjee, Network Conscious Text Compression System (NCTCSys). [paper on-line] School of
Electrical Engineering and Computer Science, University of Central Florida, available from
http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.
Nelson, M. and J. L. Gailly. The Data Compression Book 2nd Edition. New York: M&T Books, 1996.
Nelson, Mark. “Data Compression with the Burrows-Wheeler Transform.” Dr. Dobb’s Journal. (September 1996)
available from http://www.dogma.net/markn/articles/bwt/bwt.htm; Internet; accessed 18 June 2001
Salomon, D. Data Compression: The Complete Reference 2nd Edition. New York: Springer-Verlag 2000.
Sayood, K. Introduction to Data Compression 2nd Edition. San Diego: Academic Press, 2000.
Shannon, C. E. “Prediction and Entropy of Printed English.” Bell System Technical Journal. 30 (January 1951): 50-64.
Stork, Christian H., Vivek Haldar and Michael Franz. Generic Adaptive Syntax-Directed Compression for Mobile
Code. [paper on-line] Department of Information and Computer Science, University of California, Irvine, available
from http://www.ics.uci.edu/~franz/pubs-pdf/ICS-TR-00-42.pdf; Internet; accessed 14 July 2001.
Wayner, P. Compression Algorithms for Real Programmers. San Diego: Academic Press, 2000.
Witten, I. and T. Bell, README. included with the Calgary Corpus (May 1990) available from
ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus; Internet; accessed 25 June 2001.
Ziv, J. and A. Lempel. “A Universal Algorithm for Sequential Data Compression.” IEEE Transactions of Information
Theory. IT-23, no. 3 (May 1977): 337-343.