Text Data Compression

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13

Improving the Efficiency of Lossless Text Data

Compression Algorithms
A comparison of two reversible transforms

James R. Achuff
Penn State Great Valley
School of Graduate Professional Studies
30 East Swedesford Road, Malvern, PA 19355, USA

Abstract: Over the last decade the amount of textual information available in electronic form has exploded. It is
estimated that text data currently comprises nearly half of all Internet traffic, but as of yet, no lossless
compression standard for text has been proposed.

A number of lossless text compression algorithms exist, however, none of these methods is able to
consistently reach its theoretical best-case compression ratio.

This paper evaluates the performance characteristics of several popular compression algorithms and
explores two strategies for improving ratios without significantly impacting computation time.

Key words: Text Compression, Lossless Compression, Reversible Transform

1. INTRODUCTION

Compression means making things smaller by applying pressure. Data compression means
reducing the amount of bits needed to represent a particular piece of data. Text compression
means reducing the amount of bits or bytes needed to store textual information. It is necessary
that the compressed form can be decompressed to reconstitute the original text, and it is usually
important that the original is recreated exactly, not approximately. This differentiates text
compression from many other kinds of data reduction, such as voice or picture coding, where
some degradation of the signal may be tolerable if the compression achieved is worth the
reduction in quality. [Bell, Cleary & Witten, 1990]
The immutable yardstick by which data compression is measured is the “compression ratio”,
or ratio of the size of a compressed file to the original uncompressed file. For example, suppose a
data file takes up 100 kilobytes (KB). Using data compression software, that file could be reduced
in size to, say, 50 KB, making it easier to store on disk and faster to transmit over a network
connection. In this specific case, the data compression software reduces the size of the data file
by a factor of two, or results in a “compression ratio” of 2:1.

1
Improving the Efficiency of Lossless Text Data Compression Algorithms 2

There are “lossless” and “lossy” forms of data compression. Lossless data compression is
used when the data has to be uncompressed exactly as it was before compression. Text files are
stored using lossless techniques, since losing a single character can in the worst case make the
text dangerously misleading. Lossless compression ratios are generally in the range of 2:1 to 8:1.
Compression algorithms reduce the redundancy in data to decrease the storage requirements
for that data. Data compression offers an attractive approach to reducing communications and
storage costs by using available bandwidth effectively. With the trend of increasing amounts of
digital data being transmitted over public and private networks expected to continue, it makes
sense to pursue research on developing algorithms that can most effectively use available network
bandwidth by maximally compressing data. This paper is focused on addressing this problem for
lossless compression of text files. It is well known that there are theoretical predictions on how
far a source file can be losslessly compressed [Shannon, 1951], but no existing compression
approaches consistently attain these bounds over wide classes of text files.
One approach to tackling the problem of developing methods to improve compression is to
develop better compression algorithms. However, given the sophistication of existing algorithms
such as arithmetic coding, Lempel-Ziv algorithms, Dynamic Markov Coding, Prediction by
Partial Match and their variants, it seems unlikely that major new progress will be made in this
area.
An alternate approach, which is taken in this paper, is to perform a lossless, reversible
transformation to a source file prior to applying an existing compression algorithm. This
transformation is designed to make it easier to compress the source file. Figure 1 illustrates this
strategy. The original text file is provided as input to the transformation, which outputs the
transformed text. This output is provided to an existing, unmodified data compression algorithm,
which compresses the transformed text. To decompress, on simply reverses the process by first
invoking the appropriate decompression algorithm and then providing the resulting text to the
inverse transform.

O riginal Text: Transform ed Text: Data Compression


Transform Encoding
M y dog has fleas. *y^ d** **s f****. Algorithm

Com pressed Text:


(binary code)

O riginal Text: Transform ed Text: Data Decom pression


Transform Encoding
M y dog has fleas. *y^ d** **s f****. Algorithm

Figure 1. Text compression process involving a lossless, reversible transform

There are several important observations about this strategy. The transformation must be
exactly reversible, so that the overall lossless text compression requirement is not compromised.
The data compression and decompression algorithms are unmodified, so they do not exploit
information about the transformation while compressing. The intent is to use the strategy to
Improving the Efficiency of Lossless Text Data Compression Algorithms 3

improve the overall compression ratio of the text in comparison with that achieved by the
compression algorithm alone. A similar strategy has been employed in the compression of
images and video transmissions using the Fourier transform, Discrete Cosine Transform or
wavelet transforms. In these cases, however, the transforms are usually lossy, meaning that some
data can be lost without compromising the interpretation of the image by a human.
One well-known example of the text compression strategy outlined in Figure 1 is the Burrows
Wheeler Transform (BWT). BWT combines ad-hoc compression techniques (Run Length
Encoding, Move to Front) and Huffman coding to provide one of the best compression ratios
available on a wide range of data.

1.1 Lossless Text Compression Algorithms

As stated above, text compression ought to be exact – the reconstructed message should be
identical to the original. Exact compression is also called noiseless (because it does not introduce
any noise into the signal), lossless (since no information is lost), or reversible (because
compression can be reversed to recover the original input exactly).
The task of finding a suitable model for text is an extremely important problem in
compression. Data compression is inextricably bound up with prediction. In the extreme case, if
one can predict infallibly what is going to come next, one can achieve perfect compression by
dispensing with transmission altogether. Even if one can only predict approximately what is
coming next, one can get by with transmitting just enough information to disambiguate the
prediction. Once predictions are available, the are processed by an encoder that turns them into
binary digits to be transmitted.
There are three ways that the encoder and decoder can maintain the same model: static,
semiadaptive, and adaptive modelling. In static modelling the encoder and decoder agree on a
fixed model, regardless of the text to be encoded. This is the method employed when sending a
message via Morse Code. In semiadaptive modelling, a “codebook” of the most frequently used
words or phrases is transmitted first and then used to encode and decode the message. Adaptive
modelling builds it’s “codebook” as it progresses according to a predefined method. In this way,
both the encoder and decoder use the same codebook without ever having to transmit the codes
with the data.

1.1.1 Huffman Coding

In 1952, D. A. Huffman introduced his method for the construction of minimum redundancy
codes – now more commonly known as “Huffman Coding”. In Huffman Coding, the characters
in a data file are converted to a binary code, where the most common characters in the file have
the shortest binary codes, and the least common have the longest. This is accomplished by
building a binary tree based upon the frequency with which characters occur in a file.

1.1.2 Arithmetic Coding

In arithmetic coding a message is represented by an interval of real numbers between 0 and 1.


As the message becomes longer, the interval needed to represent it becomes smaller, and the
number of bits needed to specify that interval grows. Successive symbols of the message reduce
the size of the interval in accordance with the symbol probabilities generated by the model. The
more likely symbols reduce the range by less than the unlikely symbols and hence add fewer bits
to the message.
Improving the Efficiency of Lossless Text Data Compression Algorithms 4

1.1.3 LZ Coding

In 1977, Jacob Ziv and Abraham Lempel described an adaptive dictionary encoder in which
they “employ the concept of encoding future segments of the [input] via maximum-length
copying from a buffer containing the recent past output.” The essence being that phrases are
replaced with a pointer to where they have occurred earlier in the text.
Figure 2 illustrates how well this approach works for a variety of texts by indicating some of
many instances where phrases could be replaced in this manner. A phrase might be a word, part
of a word, or several words. It can be replaced with a pointer as long as it has occurred once
before in the text, so coding adapts quickly to a new topic.

Figure 2. The principle of Ziv-Lempel coding – phrases are coded as pointers to earlier occurrences

Decoding a text that has been compressed in this manner is straightforward; the decoder
simply replaces a pointer by the already decoded text to which it points. In practice LZ coding
achieves good compression, and an important feature is that decoding can be very fast.

1.1.3.1 LZ77
LZ77 was the first form of LZ coding to be published. In this scheme pointers denote phrases
in a fixed-size window that precedes the coding position. There is a maximum length for
substrings that may be replaced by a pointer, usually 10 to 20. These restrictions allow LZ77 to
be implemented using a “sliding windows” of N characters.
Ziv and Lempel showed that LZ77 could give at least as good compression as any
semiadaptive dictionary designed specifically for the string being encoded, if N is sufficiently
large. The main disadvantage of LZ77 is that although each encoding step requires a constant
amount of time, that constant can be large, and a straightforward implementation can require a
vast number of character comparisons per character coded. This property of slow encoding and
fast decoding is common to many LZ schemes.
Improving the Efficiency of Lossless Text Data Compression Algorithms 5

1.1.4 Dynamic Markov Coding

Finite-state probabilistic models are based on finite-state machines. They have a set of states
and transition probabilities that signify the likelihood of the model to transition from one state to
another. Also, each state is labelled uniquely. Figure 3 shows a simple model with two states, 0
and 1.
Finite state-based modelling is typically too slow and too computationally cumbersome to
support practical text compression. Dynamic Markov Coding (DMC) however, provides an
efficient way of building complex state models that fit a particular sequence and is generally
regarded as the only state-based technique that can be applied to text compression. [Bell, Witten
& Cleary, 1989]

1
3

p(0)=0.5 p(1)=0.5

p(1)=0.5
2

0 1
p(0)=0.5 4

Figure 3. An order-1 finite state model for 0 and 1

The basic idea of DMC is to maintain frequency counts for each transition in the current
finite-state model, and to “clone” a state when a related transition becomes sufficiently popular.
Cloning consumes resources by creating an extra state, and should not be performed unless it is
likely to be productive. High-frequency transitions have, by definition, been traversed often in
the past and are therefore likely to be traversed often in the future. Consequently, they are likely
candidates for cloning, since any correlations discovered will be utilised frequently.

1.1.5 Prediction by Partial Match

Prediction by Partial Match (PPM) is a statistical, predictive text compression algorithm


originally proposed by Cleary and Witten in 1984 and refined by Moffat in 1988. PPM and its
derivatives have consistently outperformed dictionary-based methods as well as other statistical
methods for text compression. PPM maintains a list of already seen string prefixes,
conventionally called contexts. For example, after processing the string ababc, the contexts are
{∅, a, b, c, ab, ba, bc, aba, bab, abc, abab, babc, and ababc}. For each context PPM maintains a
list of characters that appeared after the context. PPM also keeps track of how often the
subsequent characters appeared. So in the given example the counts of subsequent characters for,
say, ab are a and c both with a count of one. Normally, efficient implementations of PPM
maintain contexts dynamically in a context trie. A context trie is a tree with characters as nodes
and where any path from the root to a node represents the context formed by concatenating the
characters along this path. The root node does not contain any character and represents the empty
context (i.e., no prefix). In a context trie, children of a node constitute all characters that have
Improving the Efficiency of Lossless Text Data Compression Algorithms 6

been seen after its context. In order to keep track of the number of times that a certain character
followed a given context, the number of its occurrences is noted along each edge. Based on this
information PPM can assign probabilities to potentially subsequent characters. [Cleary and
Witten, 1984]
The length of contexts is also called their order. Note that contexts of different order might
yield different counts leading to varying predictions.

1.1.6 Burrows-Wheeler Transform

Burrows and Wheeler released a research report in 1994 entitled “A Block Sorting Lossless
Data Compression Algorithm” which presented a data compression algorithm based on Wheeler’s
earlier work.
The BWT is an algorithm that takes a block of data and rearranges it using a sorting scheme.
The resulting output block contains exactly the same data elements that it started with differing
only in their ordering. The transformation is reversible and lossless, meaning that the original
ordering of the data elements can be restored with no loss of fidelity.
The BWT is performed on an entire block of data at once, preferably the largest amount
possible. Since the BWT operates on data in memory, it must often break files up into smaller
pieces and process one piece at a time.

2. LOSSLESS, REVERSIBLE TRANSFORMS

Work done by Awan and Mukherjee and Franceschini et al. details several lossless, reversible
transforms that can be applied to text files in order to improve their compressibility by established
algorithms. Two have been selected for study in this paper: star encoding (or *-encoding) and
length index preserving transform (LIPT).

2.1.1 Star Encoding

The first transform proposed is an algorithm developed by Franceschini et al. Star encoding
(or *-encoding) is designed to exploit the natural redundancy of the language. It is possible to
replace certain characters in a word by a special placeholder character and retain a few key
characters so that the word is still retrievable.
For example, if given a set of six letter words: {school, simple, strong, sturdy, supple}, and
replacing “unnecessary” characters with a chosen symbol ‘*’, the set can now be represented
unambiguously as {**h***, **m***, **r***, **u***, **p***}. In *-encoding, an unambiguous
representation of a word by a partial sequence of letters from the original sequence of letters in
the word interposed by special characters ‘*’ as placeholders will be called a signature of the
word.
*-encoding utilises an indexed and sorted dictionary containing the natural form and the
signature of each word. No word in a 60,000 word English dictionary required the use of more
than two unencoded characters in its signature using Franceschini’s scheme. The predominant
character in *-encoded text is ‘*’ which occupies more than fifty percent of the space. If a word
is not in the dictionary, it is passed to the transformed text unaltered.
The main drawback of *-encoding is that the compressor and decompressor need to share a
dictionary. The aforementioned 60,000 word English dictionary requires about one megabyte of
storage overhead that must be shared by all users of this transform. Also, special provisions
Improving the Efficiency of Lossless Text Data Compression Algorithms 7

made to handle capitalisation, punctuation marks and special characters will most likely
contribute to a slight increase of the size of the input text in its transformed form.

2.1.2 LIPT

Another method investigated here is the Length Index Preserving Transform or LIPT. Fawzia
S. Awan and Amar Mukherjee developed LIPT as part of their project work at the University of
Central Florida. LIPT is a dictionary method that replaces words in a text file with a marker
character, a dictionary index and a word index.
LIPT is defined as follows: words of length more than four are encoded starting with ‘*’, this
allows predictive compression algorithms to strongly predict the space character preceding a ‘*’
character. The last three characters form an encoding of the dictionary offset of the corresponding
word. For words of more than four characters, the characters between the initial ‘*’ and the final
three-character-sequence in the word encoding are constructed using a suffix of the string ‘…
nopqrstuvw’. For instance, the first word of length 10 would be encoded as ‘*rstuvwxyzaA’. This
method provides a strong local context within each word encoding and its delimiters.

3. PROCESS

To evaluate these methods, they were applied to the Calgary Corpus, a collection of text files
that was originally used by Bell, Witten and Cleary in 1989 to evaluate the practical performance
of various text compression schemes. The methods were also applied to three html files in order
to supply a more “modern” facet to the test corpus.

3.1 Test Corpus

In the Calgary Corpus, nine different types of text are represented, and to confirm that the
performance of schemes is consistent for any given type, many of the types have more than one
representative. Normal English, both fiction and non-fiction, is represented by two books and six
papers (labelled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual
styles of English writing are found in a bibliography (bib) and a batch of unedited news articles
(news). Three computer programs represent artificial languages (progc, progl, progp), and a
transcript of a terminal session (trans) is included to indicate the increase in speed that could be
achieved by applying compression to a slow line to a terminal. All of the above files use ASCII
encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2),
some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is
particularly difficult to compress because it contains a wide range of data values, while the file
pic is highly compressible because of large amounts of white space in the picture, represented by
long runs of zeros. [Witten and Bell, 1990]

3.2 Additional Test Files

The additional html files were chosen to representative of “average” web traffic. One is the
front page of an American university (http://www.psu.edu), another is the front page of a popular
Internet auction site (http://www.ebay.com) and the third is the main page of a popular
multimedia web content company (http://www.real.com). Each contained different types of web
content and page structures.
Improving the Efficiency of Lossless Text Data Compression Algorithms 8

4. RESULTS

Table 1 shows the file names, their original sizes and their sizes after being processed by our
transforms and by compression algorithms.

Table 1. Compression Resultes for Untransformed Corpus


Calgary Corpus
Filename Original Size *-encoded LIPT- PK-ZIP 2.50 bzip (BWT) Gzip (LZ77) Arithmetic DMC Huffman PPM (No
Bib 111261 116385 encoded
101522 34961 27467 34900 Coding
40170 30535 Coding
72762 Training)
25898
Book1 768771 779421 681210 311295 232598 312281 246687 238026 438375 221304
Book2 610856 621779 512862 205538 157443 206158 195060 167229 368301 149917
Geo 102400 76716 78309 68536 56921 102400 72481 61458 72905 60580
News 377109 386662 350114 144102 118600 144400 150866 130717 246395 110998
obj1 21504 16360 16232 10290 10787 10323 16149 11076 16377 10022
obj2 246814 216798 217291 80948 76441 81631 193703 85291 194505 73374
Paper1 53161 54917 45731 18496 16558 18543 20433 18141 33338 15480
Paper2 82199 83752 69393 29516 25041 29667 27567 26581 47616 23787
Paper3 46526 47328 37655 18027 15837 18074 17511 17089 27276 15015
Paper4 13286 13498 10979 5499 5188 5534 5450 5460 7861 4806
Paper5 11954 12242 10418 4959 4837 4995 5335 5088 7432 4458
Paper6 38105 39372 34500 13271 12292 13213 15068 13412 24024 11488
Pic 513216 66937 66948 52531 49759 56442 78010 52394 106757 51016
Progc 39611 41057 38605 13317 12544 13261 15405 13637 25915 11700
Progl 71646 71863 67133 16098 15579 16164 21319 17796 42983 15023
Progp 49379 49995 48434 11171 10710 11186 14972 12318 30215 10466
Trans 93695 92525 84753 18996 17899 18862 35100 22453 65219 17182
Additional html files
Html1 29830 30819 29592 5689 5368 5788 19305 5955 19357 5021
Html2 46893 47721 46205 9846 9607 9961 31753 10887 31882 9084
Html3 36323 36921 35901 6317 6304 6460 24026 7323 24118 6064

Table 2 shows the file sizes after the application of the star encoding transform in conjunction
with the compression algorithms.

Table 2. Compression Results for *-encoded Corpus


Calgary Corpus
Filename Original Size *-encoded PK-ZIP 2.50 bzip (BWT) Gzip (LZ77) Arithmetic DMC Huffman PPM (No
bib.sta 111261 116385 34084 26825 34051 Coding
39821 29160 Coding
66487 Training)
26811
book1.sta 768771 779421 282605 235559 282778 234839 226290 332042 220750
book2.sta 610856 621779 191004 158070 191494 190827 163358 287140 156474
geo.sta 102400 76716 62250 57753 62294 63932 65503 64327 59957
news.sta 377109 386662 137289 118678 138012 149796 127577 224312 112494
obj1.sta 21504 16360 9570 10190 9591 14005 10380 14203 9399
obj2.sta 246814 216798 77229 73867 77961 179034 81493 179461 70552
paper1.sta 53161 54917 17175 15901 17165 18649 16783 27146 15415
paper2.sta 82199 83752 26758 24211 26751 24541 24129 35564 23176
paper3.sta 46526 47328 16073 14732 16068 14583 14913 20852 14085
paper4.sta 13286 13498 4815 4701 4854 4522 4671 5896 4413
paper5.sta 11954 12242 4507 4500 4531 4775 4527 6079 4222
paper6.sta 38105 39372 12361 11782 12353 14182 12246 19864 11332
pic.sta 513216 66937 38741 38213 38409 43190 38954 43423 38052
progc.sta 39611 41057 12893 12437 12850 15730 13296 24615 11670
progl.sta 71646 71863 15546 15253 15599 22141 17228 39509 15309
progp.sta 49379 49995 11096 10829 11112 16132 12331 29620 10632
trans.sta 93695 92525 18336 17384 18281 34526 21739 59835 18067
Additional html files
html1.sta 29830 30819 5429 5176 5539 19213 5697 19244 4859
html2.sta 46893 47721 9763 9562 9886 31702 10793 31755 9091
html3.sta 36323 36921 6084 6116 6191 23613 7055 23691 5893
Improving the Efficiency of Lossless Text Data Compression Algorithms 9

Table 3 shows the file sizes after the application of the LIPT transform in conjunction with the
compression algorithms.

Table 3. Compression Results for LIPT-encoded Corpus


Calgary Corpus
Filename Original Size LIPT-encoded PK-ZIP 2.50 bzip (BWT) Gzip (LZ77) Arithmetic Coding DMC Huffman PPM (No
Coding Training)
bib.lpt 111261 101522 33424 26,901 33948 67,592 29,439 67,746 25,437
book1.lpt 768771 681210 285332 222,398 291973 390,421 225,414 393,507 214,509
book2.lpt 610856 512862 189839 151,861 192939 321,284 162,481 323,539 145,636
geo.lpt 102400 78309 62600 57,788 62566 65,311 64,233 65,711 59,988
news.lpt 377109 350114 137688 115,586 139511 234,409 128,416 235,229 108,743
obj1.lpt 21504 16232 9622 10,183 9644 14,118 10,403 14,328 9,442
obj2.lpt 246814 217291 77410 73,820 78135 180,224 82,347 180,896 70,588
paper1.lpt 53161 45731 17104 15,451 17228 29,670 16,771 29,734 14,658
paper2.lpt 82199 69393 26903 23,180 27402 41,437 24,199 41,701 22,266
paper3.lpt 46526 37655 16058 14,259 16211 23,400 14,944 23,496 13,786
paper4.lpt 13286 10979 4880 4,553 4917 6,832 4,737 6,833 4,275
paper5.lpt 11954 10418 4533 4,401 4566 6,769 4,567 6,774 4,097
paper6.lpt 38105 34500 12513 11,450 12585 22,073 12,373 22,134 10,885
pic.lpt 513216 66948 38752 38,208 38422 43,209 38,956 43,443 38,054
progc.lpt 39611 38605 13002 12,082 12982 25,605 13,409 25,676 11,397
progl.lpt 71646 67133 15610 14,868 15760 41,524 17,361 41,760 14,409
progp.lpt 49379 48434 11180 10,607 11282 30,709 12,270 30,767 10,364
trans.lpt 93695 84753 18326 17,173 18371 60,048 21,837 60,155 16,474
Additional html files
html1.lpt 29830 29592 5413 5146 5496 19,497 5,643 19,549 4,777
html2.lpt 46893 46205 9786 9552 9893 31,890 10,800 32,041 8,913
html3.lpt 36323 35901 6104 6108 6234 11,412 7075 24,163 5,787

The following charts display the compression ratios for each file, grouped roughly by content
type. It is interesting to note that the transforms generally do, but not always provide better
compression.

Compression Ratios for


bib, book1, book2 and news
5

4.5

3.5 PK-ZIP 2.50


bzip (BWT)
3
Gzip (LZ77)
2.5
Arithmetic Coding
2
DMC
1.5 Huffman Coding
1 PPM (No Training)
0.5

0
bib bib bib book1 book1 book1 book2 book2 book2 new s new s new s
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT

Figure 4. Compression Ratios for bib, book1, book2, and news


Improving the Efficiency of Lossless Text Data Compression Algorithms 10

Compression Ratios for


geo, obj1, obj2
4

3.5

3 PK-ZIP 2.50

2.5 bzip (BWT)


Gzip (LZ77)
2
Arithmetic Coding
1.5 DMC
Huffman Coding
1
PPM (No Training)
0.5

0
geo geo geo obj1 obj1 obj1 obj2 obj2 obj2
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT

Figure 5. Compression Ratios for geo, obj1, obj2

pic: Compression Ratios

16.00

14.00

12.00

10.00
pic none
8.00 pic *-encoding
pic LIP T
6.00

4.00

2.00

-
P K-ZIP bzip Gzip Arithmetic DM C Huffm an P PM (No
2.50 (B W T) (LZ77) Coding Coding Training)

Figure 6. Compression Ratios for pic

Compression Ratios for


paper1, paper2, paper3, paper4, paper5, paper6
5

4.5

4
PK-ZIP 2.50
3.5
bzip (BWT)
3
Gzip (LZ77)
2.5
Arithmetic Coding
2 DMC
1.5 Huffman Coding

1 PPM (No Training)

0.5

0
bib bib bib book1 book1 book1 book2 book2 book2 new s new s new s
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT

Figure 7. Compression Ratios for paper1, paper2, paper3, paper4, paper5, paper6
Improving the Efficiency of Lossless Text Data Compression Algorithms 11

Compression Ratios for


progc, progl, progp, trans
6
5.5
5 PK-ZIP 2.50
4.5
bzip (BWT)
4
Gzip (LZ77)
3.5
3 Arithmetic Coding

2.5 DMC
2 Huffman Coding
1.5
PPM (No Training)
1
0.5
0
progc progc progc progl progl progl progp progp progp trans trans trans
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT

Figure 8. Compression Ratios for progc, progl, progp, trans

Compression Ratios for


html1, html2, html3
7
6.5
6
5.5
5 PK-ZIP 2.50
4.5 bzip (BWT)
4 Gzip (LZ77)
3.5
Arithmetic Coding
3
2.5 DMC
2 Huffman Coding
1.5
PPM (No Training)
1
0.5
0
html1 html1 html1 html2 html2 html2 html3 html3 html3
none *-encoding LIPT none *-encoding LIPT none *-encoding LIPT

Figure 9. Compression Ratios for html1, html2, html3

It is interesting to note that the transforms typically do not result in increased performance for
arithmetic or Huffman coding. In fact, LIPT actually decreases the compression ratio for
arithmetic coding by almost a third for the English language text files (bib, book1, book2, news,
paper1, paper2, paper3, paper4, paper5, paper6, progc, progl, progp, trans).
*-encoding caused a decrease in compression for bib, book2, news, progl, progp and trans
with PPM encoding, and for book1, book2 and progp with BWT encoding. Other than those, the
transforms typically offer some increase. *-encoding offered improvements of 11% to nearly
15% of original file size for the books and papers when coupled with Huffman coding and LIPT
offered improvements of up to 8% in combination with Huffman coding.
Overall, PPM with LIPT produced the best compression ratios of the English language text
files and was nearly as good as any other method on the other files.
Improving the Efficiency of Lossless Text Data Compression Algorithms 12

5. CONCLUSION

This paper has shown that it is possible to make textual data more compressible, even if only
to a small degree, by applying an intermediate reversible transform to the data prior to
compression. Although not specifically measured for this paper, the time impact of applying
these transforms to the data was not observed to be significant.
Transform encoding offered improvements of up to 15% for some standard compression
methods and - depending on the methods used and the type of text contained in the input file - can
offer compression ratios over 13 and can generally have a beneficial effect on the compressibility
of data over standard compression algorithms.
It is recommended that further investigation be made into the applicability of this process to
html files in an effort to decrease download times for web information and to conserve Internet
bandwidth.

6. REFERENCES
Akman, K. Ibrahim. “A New Text Compression Technique Based on Language Structure.” Journal of Information
Science. 21, no. 2 (February 1995): 87-95.

Awan, F. S. and A. Mukherjee, LIPT: A Lossless Text Transform to Improve Compression. [paper on-line] School of
Electrical Engineering and Computer Science, University of Central Florida, available from
http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.

Bell, T. C., J. G. Cleary and I. H. Witten. Text Compression. Englewood Cliffs: Prentice-Hall, 1990.

Bell, Timothy, Ian H. Witten and John G. Cleary. “Modelling for Text Compression.” ACM Computing Surveys. 21,
no. 4 (December 1989): 557-591.

Burrows, M. and D. J. Wheeler. “A Block-sorting Lossless Data Compression Algorithm.” SRC Research Report 124,
Digital Systems Research Center, Palo Alto, (May 1994) available from http://citeseer.nj.nec.com/76182.html;
Internet; accessed 15 July 2001.

Cleary, J. G. and I.H. Witten. “Data Compression Using Adaptive Coding and Partial String Matching.” IEEE
Transactions on Communications. 32, no 4 (April 1984): 396-402.

Crochemore, Maxime and Thierry Lecroq. “Pattern-Matching and Text Compression Algorithms.” ACM Computing
Surveys. 28, no. 1 (March 1996): 39-41.

Fenwick, P. Symbol Ranking Text Compression with Shannon Recodings. [paper on-line] Department of Computer
Science, The University of Auckland, 6 June 1996 available from ftp://ftp.cs.auckland.ac.nz/out/peter-
f/TechRep132; Internet; accessed 6 June 2001.

Franceschini, R., H. Kruse, N. Zhang, R. Iqbal and A. Mukherjee. Lossless, Reversible Transformation that Improve
Text Compression Ratios. [paper on-line] School of Electrical Engineering and Computer Science, University of
Central Florida, available from http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.

Goebel, G.V., Data Compression. available from http://vectorsite.tripod.com/ttdcmp0.html; Internet; accessed 14 May,
2001.

Huffman, D. A. “A Method for the Construction of Minimum-Redundancy Codes.” Proceedings of the Institute of
Electrical and Radio Engineers. 40, no 9 (September 1952): 1098-1101.

Moffat, Alistair, Radford M. Neal and Ian H. Witten. “Arithmetic Coding Revisited.” ACM Transactions on
Information Systems. 16, no. 3 (July 1998): 256-294.
Improving the Efficiency of Lossless Text Data Compression Algorithms 13

Motgi, N. and A. Mukherjee, Network Conscious Text Compression System (NCTCSys). [paper on-line] School of
Electrical Engineering and Computer Science, University of Central Florida, available from
http://vlsi.cs.ucf.edu/listpub.html; Internet; accessed 9 July 2001.

Nelson, M. and J. L. Gailly. The Data Compression Book 2nd Edition. New York: M&T Books, 1996.

Nelson, Mark. “Data Compression with the Burrows-Wheeler Transform.” Dr. Dobb’s Journal. (September 1996)
available from http://www.dogma.net/markn/articles/bwt/bwt.htm; Internet; accessed 18 June 2001

Salomon, D. Data Compression: The Complete Reference 2nd Edition. New York: Springer-Verlag 2000.

Sayood, K. Introduction to Data Compression 2nd Edition. San Diego: Academic Press, 2000.

Shannon, C. E. “Prediction and Entropy of Printed English.” Bell System Technical Journal. 30 (January 1951): 50-64.

Stork, Christian H., Vivek Haldar and Michael Franz. Generic Adaptive Syntax-Directed Compression for Mobile
Code. [paper on-line] Department of Information and Computer Science, University of California, Irvine, available
from http://www.ics.uci.edu/~franz/pubs-pdf/ICS-TR-00-42.pdf; Internet; accessed 14 July 2001.

Wayner, P. Compression Algorithms for Real Programmers. San Diego: Academic Press, 2000.

Witten, I. and T. Bell, README. included with the Calgary Corpus (May 1990) available from
ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus; Internet; accessed 25 June 2001.

Ziv, J. and A. Lempel. “A Universal Algorithm for Sequential Data Compression.” IEEE Transactions of Information
Theory. IT-23, no. 3 (May 1977): 337-343.

You might also like