Class Coding and Two Applications
Class Coding and Two Applications
By
Avishay Orpaz
July 2004
ii
TEL-AVIV UNIVERSITY
The Iby and Aladar Fleischman Faculty of Engineering
By
Avishay Orpaz
July 2004
iv
To my wife Idit
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1.1 Motivation . . . . . . . . . . . . . . . . . . . 2
1.2.2.1 Motivation . . . . . . . . . . . . . . . . . . . 6
2 Class Compression 15
2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
viii CONTENTS
2.2.2 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Relation of number of classes to search speed . . . . . . 57
4.4 Extra space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1 Implementation Notes . . . . . . . . . . . . . . . . . . 58
4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Enhancement Possibilities . . . . . . . . . . . . . . . . . . . . 64
5 Summary 67
x CONTENTS
List of Figures
2.3 Compression ratio for some files from the testset with different
number of classes . . . . . . . . . . . . . . . . . . . . . . . . . 36
xi
xii LIST OF FIGURES
xiii
xiv LIST OF TABLES
Abstract
This work presents a coding system called class coding and two of its appli-
cations. The class coding system is a compression scheme that has unique
features. The system is analyzed and a novel algorithm is presented to op-
timally adjust its parameters. The first application of this system is code
compression. The work explores the parameters of the system that are rele-
vant to such usage. The second application is compressed matching – a class
of algorithms that are designed to search a text within another, compressed
text, without having to completely decompress it. The features of class coding
are used to provide fast, flexible and efficient search algorithm.
xv
Chapter 1
This work deals with two subjects: code compression in embedded systems
and compressed pattern matching. The common denominator to these, usu-
ally unrelated, problems is a compression method called class compression.
The objectives of the work are as follows:
• Explore the use of class compression for the application of code com-
pression in embedded systems. Parameters that have influence over the
system performance and tradeoffs will be introduced and studied.
1
2 CHAPTER 1. INTRODUCTION AND RELATED WORK
1.2 Introduction
1.2.1.1 Motivation
This is usually not a concern in PCs and other non-embedded systems, where
memory is in abundance. In embedded systems the picture is different. These
systems are typically subject to space, power and cost constrains, in which
instruction memory, as any other resource, is limited.
Thumb [56] and MIPS-16 [28] offer extended instruction sets that include
short instructions for embedded applications. The use of short instructions
adds minimal run-time overhead. This is not compression method per se, but
code size is reduced. The major drawback of this technique is that code has
to be recompiled (if it is hand written in assembly, it should be rewritten).
Moreover, all the development tools must be modified to utilize the new
instruction. This method is also very processor-specific. It cannot be ported
4 CHAPTER 1. INTRODUCTION AND RELATED WORK
The third approach, illustrated in figure 1.1, takes the binary image of
regular code, generated by the regular development tools and compresses
it. This compressed code cannot, of course, be executed on it own - it has
to be decompressed by some hardware unit in the target system. Several
implementations for this approach have been suggested. Chen [13, 12] de-
scribes a dictionary based system, in which common code segments are re-
placed by a single reference. At runtime, the decompression core retrieves
these code segments back for execution. Another paper, based on the same
method is by Lefurgy[39]. Other researchers tried other compression algo-
rithms. Kozuch[32, 33] studied the statistical properties of RISC code using
0-order and 1-order models. Ernst[16] uses a method called operand factor-
ization along with a Markov probability model. Lekatsas[40] and Xie[64] use
arithmetic coding (reduced to accommodate the fast decoding requirement);
Liao[41] uses a dictionary based method called EPM; Wolfe and Chanin[62]
use Huffman based methods. Finally IBM’s CodePack system[20] is of special
interest to our work. It will be described in details in a later chapter.
"% %
!
# $
" "
%""
"% %
" & !
# $
' " "
%""
1.2.2.1 Motivation
This problem is becoming more and more important due to the vast
amount of information modern servers and personal computers have to deal
with.
1.2. INTRODUCTION 7
The simplest form of string matching is the brute force approach. This
algorithm simply scans the y and compare every character with a character
in x. This algorithm has two drawbacks: first, it requires the ability to go
back in y; and second, it has a worst case time complexity of O(m · n).
Two algorithms that overcome these problems are the Knuth-Morris-
Pratt[31] and Boyer-Moore[7]. Both have as worst case time complexity
of O(m + n) but the last one possesses sublinear time in the average case,
which makes it extremely important and useful.
It is also important to note here a fundamental result by Rivest[47], who
showed that in the worst case, no algorithm can perform string matching
with less than n − m + 1 comparisons, thus making sublinear-in-the-worst-
case algorithm impossible.
This problem has been extensively studied, and the two most common
coding families today are the entropy coding and the dictionary coding. In
entropy coding, the goal is to find the most suitable code for every symbol,
such that the total length of the encoded text is minimized. Some members of
this family are the Huffman[23] and arithmetic coding. Dictionary methods,
in contrary, try to replace frequently occurring substring with references to
their previous occurrence. Some members are Lampel-Ziv and their variants.
In the recent years, some newer methods have emerged such as anti-
dictionaries[14], byte pair encoding[14] and block sorting[8].
The methods described in the previous section do achieve their goals, but
the encoded text, in the general case, cannot be searched for a pattern us-
ing the algorithms described in 1.2.2.2. For them to work, the compression
algorithms change the structure, the symbol encoding and order of the text,
thus making the direct application of the plain text matching algorithms
impossible.
will have to decompress the same text over and over again. In some other
cases, we are not interested in the entire file, but only in the area where the
desired string has been found. Again, we will have to decompress the entire
text.
The second approach has the potential of overcoming all these drawbacks
if a good algorithm is found.
Amir and Benson[1] were the first to define this problem and give perfor-
mance goals (in the following definition u is the length of the uncompressed
text, n is the length of the compressed text and m is the length of the pattern
being searched):
• An algorithm that finds all the occurrences of the pattern in the text
in O(u) time is efficient
• An algorithm that finds all the occurrences of the pattern in the text
in O(n log m + m) time is almost optimal
• An algorithm that finds all the occurrences of the pattern in the text
in O(n + m) time is optimal
In later paper, Amir, Benson and Farach[2] observe that there is some-
times a tradeoff between the execution speed and the extra space consumed.
The amount of extra space used is, therefore, another important performance
measure.
The work on this subject can be coarsely divided by the type of algorithm
used to compress the text. Table 1.1 presents this taxonomy.
10 CHAPTER 1. INTRODUCTION AND RELATED WORK
text with very good results. Some algorithms in this group try to perform
search on files compressed by standard tools, particularly the Unix compress
tool which uses LZW[60]. Two examples for such approach is Amir, Benson
and Farach[2] and Kida et al.[26] who try to make use of the efficient Shift-
And algorithm for searching in (uncompressed) text. Other authors change
the base encoding scheme to facilitate the search. Klein[29], for example,
replaces the back pointers in the LZSS scheme by forward pointers.
The fourth and final group contains other kind of methods. Manber [42]
designed a system in which the 128 unused codes in regular texts (assuming
ASCII encoding) are utilized to encode common character pairs. The author
describes a method for selecting these pairs in such way that no ambiguity will
be present. Shibata et al.[51] uses a hierarchical method in which every stage
consists of removing common symbol pairs from the text and replacing them
by a new symbol. Another scheme by Shibata[53] searches for pattern in text
compressed using antidictionaries (which are lists of forbidden phrases, see
Crochemore[14]). Two recent results are due to Tarhio[46] which generalizes
the method proposed by De Moura[44] and Shibata et al.[52] who uses Boyer-
Moore algorithm to search through BPE compressed text.
The algorithms described here were previously published in [4] and [5].
The second chapter deal with the compression method that we propose
- the class compression. After explaining the scheme, we develop an algo-
rithm for optimizing the compression and discuss the influence of various
parameters on the compression rate.
1.4. STRUCTURE OF THE WORK 13
Class Compression
The class compression system, used by CodePack[20, 25] and previously used
by Said and Pearlman[49] (with some variations) is a 0-order entropy coder.
Each symbol in the original text is replaced by some codeword. The length
(in bits) of the input symbols if fixed, but the length of the output codeword
is variable. By tuning the codeword length according to their probability of
occurrence, compression can be attained. The most known coder of this kind
is the Huffman[23] code, which describes a way to optimally select the length
and the bits of the codeword for every symbol. Huffman code possesses
the prefix property, which means that no codeword is a prefix of another
codeword. That way, the code can be decoded without ambiguity.
15
16 CHAPTER 2. CLASS COMPRESSION
N
[
Σ= Cj (Cover) (2.1)
j=1
The codewords to the prefix and to the index are allocated differently. The
prefix uses Huffman codes. The index uses a fixed length binary code – every
class has an associated index length (which may be different among classes).
The length of that code must be selected according to the cardinality of the
2.2. ANALYSIS 17
2.2 Analysis
Let
Σ = {σ1 , σ2 , ..., σK } (2.3)
Let
C = {C1 , C2 , ..., CN , CN +1 } (2.4)
G:Σ→C (2.5)
A class configuration shall be called valid if it complies with (2.1), (2.2) and
one more rule:
σi ∈ Cj ⇒ σi+1 ∈ Cj 0 j 0 = j or j + 1 (2.6)
1
The compression ratio that will be achieved under such parameter sys-
tem can be obtained by counting the bits in the individual components of
the code:
I III
z }| { II
z }| {
XN z }| { N
X
Pi · (πi + log2 |Ci |) + (B + πN +1 ) · PN +1 + B · |Ci |
i=1 i=1
R= (2.7)
B
where B is the word length of each symbol in the uncompressed text and
Pi is the probability that a symbol belongs to Ci :
X
Pi = pj (2.8)
j:σj ∈Ci
The second case is when N = |Σ|. In this case every symbol in Σ has
its own class with 0 bit index. The prefix is Huffman coded so we get
that every symbol is Huffman coded. So in this case |C(T )| = |H̃(T )|
(where H̃(T ) is the Huffman coding of T ).
cally. To prove that, let us assume that we have a class structure with
N classes over some alphabet. We build a new class structure with
N + 1 classes using the following procedure. First, we pick some class
arbitrarily having two symbols or more in it. Then, we split it into
two classes, each containing exactly half the symbols of the original
class (always possible because class cardinality is an integral power of
2). The number of bits required to represent each symbol in each new
class is one bit less than was required by the original class. The pre-
fixes of the two new classes would be the prefix of the original class
appended by a single bit, selecting one of the new classes. The new
structure obtained is clearly valid, has N + 1 classes as required, and
uses exactly the same number of bits. This structure is not necessarily
optimal, meaning that there might be some other structure that uses
less bits (but not more). Therefore, the compression ratio, as a function
of the number of classes is monotonic.
To sum up, the more classes we allow, the better compression ratio we
expect to achieve.
• Class structure – In its documentation ([24, 25, 20]) IBM doesn’t reveal
how it has determined its class structure. This is the main issue dealt
with here, and an optimal class structure optimization will be presented
in the next section.
• Model – The model used to decide what are the symbols of the text
and what is the probability of every symbol. Two choices are possible
with class compression. The first is a static model, built using a-priori
2.2. ANALYSIS 21
2.2.2 Redundancy
We start our analysis following [49]. Our model is an i.i.d source of symbols
σi each having a probability of pi . The entropy of the source is:
K
X
H=− pi log2 pi (2.9)
i=1
N N X pj
X X pj
H=− Pi log2 Pi − Pi · log2
i=1 i=1 j∈C
Pi Pi
i
N N
(2.10)
X X X pj
=− Pi log2 Pi − pj log2
i=1 i=1 j∈Ci
Pi
In our coding scheme, the indices are not entropy coded, but have a fixed
length. Thus the second additive term in (2.10) changes:
N N X
0
X X 1
H =− Pi log2 Pi − pj log2 (2.11)
i=1 i=1 j∈Ci
|Ci |
N X
0
X pj · |Ci |
∆H = H − H = (pj log2 ) (2.12)
i=1 j∈Ci
Pi
As already has been shown in the last section, the class configuration has a
major impact on the compression ratio. In this section we investigate this
problem, and propose an efficient algorithm to find such mapping.
N
X
|Ci | ≤ |Σ| (2.14)
i=1
N
X
2zi ≤ K (2.15)
i=1
2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM 23
The problem now is how many vectors Z = (z1 , z2 , ..., zN ) satisfy (2.15).
To bound this, let us assume z1 = z2 = ... = zN = z. Now (2.15) becomes
N · 2z ≤ K (2.16)
and therefore
and every Z where zn ≤ z would now satisfy (2.15). The last approxima-
tion is due to the fact, that in real world application K N . It is evident
that this is a lower bound since there might be some vectors Z in which some
component is greater than z still satisfying (2.15).
vectors.
In practice (as we shall see later), N is rather small, so that brute force
search is feasible. Nevertheless, it is desirable to find an algorithm that can
solve the problem in much shorter time (both from complexity point of view
and in practice).
V = {vk |1 ≤ k ≤ K + 1} (2.19)
The nodes are connected by arcs of two types – final and non-final.
A = ANF ∪ AF (2.20)
Each arc in the graph corresponds to a feasible class – a class that may
be selected to be included in C. A non-final arc extending from vi to vj
represents a class in which the symbols σi , σi+1 , ..., σj−1 are included. A final
arc extending from vi to vK+1 (note that a final arc always ends at the final
node) represents a literal class containing the symbols between σi to the end
of the alphabet. Following these definitions, we look at a path from v1 to
vK+1
such that
2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM 25
where
σ1 , ..., σz1 −1 ∈ C1
...
σzN , ..., σN ∈ CN +1
The uniqueness and cover properties are direct corollary of (2.24). For
any σi , exists one and only one pair zk and zk+1 , such that zk ≤ i ≤ zk+1 and
from (2.25) we conclude that σi ∈ Ck .
To prove that 2.25 satisfies (2.6), let assume σi , σj ∈ C`+1 (i < j). This
means that z` ≤ i < j ≤ z`+1 − 1 and from (2.24) we see that for any σk ,
i < k < j also σk ∈ C`+1 .
26 CHAPTER 2. CLASS COMPRESSION
The weight is taken directly from (2.7). The weight of non final arcs is
taken from parts I and III, while the weight of final arcs is taken from part II.
2
The weight of the arc is the average number of bits per symbol such class
would have required in the compressed text. Note that this weight includes
the number of bits required in the codebook. The arcs that extend from any
node vi to the final node correspond to a literal class that begins at symbol
σi .
Comparing (2.26) and (2.7), one can notice that the terms πk are absent
from the former. These terms are related to the prefix bits. A key property
of (2.26) is that the weight of a certain class is dependent only on its size and
content (the probabilities of the symbols in it) and it is absolutely indepen-
dent of any other class. Including the prefixes changes this situation. The
prefixes are determined by applying Huffman’s algorithm[23] to the weights
of all the classes (2.8). Therefore, a knowledge of all the classes is required to
find even a single class prefix length. We claim, however, that the influence
of ignoring the prefix length is minor. This claim shall be left as a heuristic,
since it cannot be proved exactly, only a general reason will be given.
2
When using the symbol probability pi as a measure, (2.26) gives the average bit per
symbol. In practice, it is much more efficient to make these calculation in terms of actual
bits, thereby eliminating costly floating-point calculation. This can be achieved easily by
multiplying all the probabilities by the text length, making them count the actual number
of occurrences of a symbol in the text. This step has no influence when seeking minimum
or maximum
2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM 27
If, instead of Huffman codes, we would have used other code for the prefix,
one that achieves the entropy bound, it would be possible to completely
separate the problem of index and prefix (as in (2.10)). Huffman codes
have some redundancy over the entropy, so this statement does not hold.
It is also very difficult to bound this redundancy based on a single symbol
probability[61].
As already mentioned, the prefix was not included in the optimization prob-
lem. In this section we will explain why it can be approximately ignored.
First, we will explain why it impossible to include the prefix in the opti-
mization method described in the last section. In order to know what is the
28 CHAPTER 2. CLASS COMPRESSION
length of the prefix that is required for a certain class, it is not enough to
know which symbols belong to that class, as is the case with the index. To
correctly compute the prefix, we need to know the weight of all the classes.
This fact makes it impossible to accurately include the length of the prefix
in any algorithm that finds the optimal solution in a step-by-step manner.
Moreover, the relationship between symbol probability and their code length
(using Huffman encoding) is highly nonlinear ([65]), which renders numerical
optimization methods unuseful.
However, we claim that under certain conditions, ignoring the prefix has
negligible effect.
PN PN
0 i=1 Pi log2 |Ci | + PN +1 · B + B · i=1 |Ci | + (−Hpref ix + ρ)
R = (2.27)
B
where
N
X +1
Hpref ix = − Pi · log2 Pi (2.28)
i=1
PN PN
i=1 Pi (log2 |Ci | + log2 Pi ) + PN +1 (B + log2 PN +1 ) + B · i=1 |Ci | ρ
R= +
B B
(2.29)
2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM 29
the first part of the equation has the form of equation (2.7), thus we can
write:
ρ
R = R̃ + (2.30)
B
R̃ is the “ideal” compression rate - when the prefix coding hits the entropy
bound. ρ, as defined, is the redundancy of Huffman code. A basic property
of such code is ρ ≤ 1 ([65]). This gives us a loose, but absolute bound on
the redundancy (due to the prefix only) of class coding using Huffman coded
prefixes:
1
ρclass ≤ (2.31)
B
We want to improve this bound further, but here we will need to add
some assumptions. The extensive research on the redundancy of Huffman
code reveals that if the probability of the most probable symbol is below
0.5, the upper bound reduces dramatically([65]). In natural languages, the
probability of the most frequent symbol is well below that threshold (see, for
example, in the English language [6]). Moreover, calculating the cumulative
probability of some of the most frequent characters, still yields probability
that is much smaller than 0.5. Remembering the tendency of the optimiza-
tion algorithm to gather symbols with similar probabilities leads us to the
assumption that in such application the redundancy would be negligible and
the optimization algorithm can ignore the prefix safely.
30 CHAPTER 2. CLASS COMPRESSION
2.3.4 Complexity
2.3.5 Example
HDFCABACDFCCEEDCEBCBACDGDCCACACCEAEEE
The statistics of this text can be easily extracted (see table 2.1). We are
interested in coding this text using 2 classes (N = 2). Using (2.26), we build
a graph for the optimization problem (for clarity, not all arcs are shown),
which is shown in figure 2.2.
It can be immediately verified, that the emphasized path is the shortest
when using only three arcs at total (one for each class and one final arc for
the literal class). The following tables summarizes the final coding. Table
2.2 summarizes the class configuration and table 2.3 gives the coding for each
symbol in the compressed text.
Counting the bits in both the uncompressed and the compressed text,
reveals that the original representation required 111 bits whereas the latter
2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM 31
Symbol Code
C 00
E 10
A 11
D 01 011
B 01 001
F 01 101
G 01 110
H 01 111
• The algorithm tends to gather symbols that have close probability val-
ues. The difference between p1 to p2 is big, but p2 and p3 are very
close, so they were assigned to their own class. p4 , however was not
assigned it C2 even though it is close to p2 and p3 , since that would
require adding p4 to C2 as well, and the price is too high.
2.3.6 Implementation
The implementation starts with the definition of three arrays: f[k] which
holds pk ; u[l] is the shortest path from the first node to the l’th node; and
trace[l,j] is the number of node through which the shortest path to node
2.4. EXPERIMENTAL RESULTS 33
j in the l’t step passes. We initialize u[l] with the w1,l according to ((2.26))
which is the shortest path from the first node to the l’th node using one step
only. If there is no arc between V1 and Vl , u[l] should be given “infinite”
value (in practice MAXINT can be used).
After completing, u[K+1] will be the value of the shortest path (which
is the length of the compressed message without the prefixes). trace[l,j]
can help us recover the nodes through which that path has passed and that
will give us the class structure.
The size of the trace array governs the space complexity of the algorithm,
which is O(N · K).
when we increase the number of classes, heading for the Huffman limit, the
codebook becomes less efficient, and the compression ratio drops.
2.4.2 Results
The first variable we would like to examine is the number of classes. In section
2.2 we have claimed that the compression ratio rises monotonically as the
number of classes increases. Figure 2.4.2 shows a single text (world95.txt)
compressed with varying number of classes. The dashed line is the length
of the text compressed using Huffman code and the approach towards the
latter is seen.
Figure 2.3 show compression results for several files selected from the
datasets. The figure shows that when the number of classes is small, increas-
ing it has a big impact on the result, whereas from some point the curve
flattens and increasing the number of classes will no more yield a real im-
provement in compression. This behavior is consistent to all the texts show.
36 CHAPTER 2. CLASS COMPRESSION
!
!
Figure 2.3: Compression ratio for some files from the testset with different
number of classes
Figure 2.4: Approaching the Huffman limit (X axis is logarithmic for conve-
nience)
Chapter 3
37
38 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS
bility to each symbol, regardless of the previous (or future) symbols. This is
similar to a source emitting random symbols at some specified probabilities.
N -th order model assigns a probability to each symbol depending also on
its context, which is the previous N symbols. Higher order models have the
potential to provide better compression ratio, since they capture more of the
behavior of the text. Higher order models, however, are more complicated
and require more time to build. In code compression application, the decom-
pression hardware is in the critical path from the memory to the execution
units. Thus, it must perform its function very fast, which means that high
order models are not practical.
The codebook contains all the codes that are not in the literal class. If
we look at the graph representation of the problem, we notice that the size
of the literal class is determined by the node from which the (selected) final
arc extends. If we want to limit the codebook size to K̃ symbols, we must
make sure that the final arc extends from a node not after K̃ nodes. This is
easily accomplished by removing all the arcs from node K̃ + 1.
Obviously, the class structure selected by the new graph may be subop-
timal in comparison with the unlimited graph, but it is reasonable to expect
degradation in compression performance after imposing a new constraint.
• Most Frequent. Again, we calculate the optimal structure for all the
programs, and for every class we select the most frequently occurring
length.
Figures 3.4 and 3.5 show compression ratio results for various codebook
limit values. The mapping between the codebook limit to the compression
performance is immediate – as the codebook limit gets tighter, less com-
pression is attained. It is to be noted, that when the codebook size is not
limited, it does not grow to include all the symbols, since from some point on,
adding symbols to the dictionary only increases the size of the overall com-
pressed text. The optimization algorithm determines the optimal codebook
size thereby calculating the optimal class structure.
44 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS
80
75
70
65
60
0 10 20 30 40 50 60
Number of Classes
85
Compression Ratio [%]
80
75
70
65
60
0 4096 8192 12288 16384
Dictionary Size [words]
!"
The last figure, 3.6, shows compression results for static class configura-
tion. The class configuration was determined using the methods described in
the previous section, then applied to some files from the set (the character-
istic set was all the programs from SPEC2000). The results show that using
static class structure has very little penalty, and thus it is a feasible design
option.
46 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS
Chapter 4
Multi-resolution string
matching
Let
T = t1 t2 t3 ...t` (4.1)
be a string (text) over Σ (as defined in equation (2.3)). The low resolution
image of T is defined as:
47
48 CHAPTER 4. MULTI-RESOLUTION STRING MATCHING
where
t̂i = σ̂j , ti ∈ Cj (4.4)
• Uniqueness. For any text T , there is one and only one low resolution
image T̂ . This property is a direct corollary of the uniqueness property
of the class definition (2.2). The reverse direction is not possible, for
any low resolution image, there exists many texts that would produce
it.
• Existence. A low resolution image exists for any text. This property is
a direct corollary of the cover property (2.1).
In order to recover a text from its low resolution image, more information
needs to be supplied. This information will be called resolving information.
A text represented as a low resolution image and resolving information, will
be called multi-resolution text coding. It is to be noted, that the process can
be repeated several times for the text, which will yield a coding with several
levels of resolution. From this point on, we will restrict ourselves for two-level
coding only.
4.2. A STRING MATCHING ALGORITHM 49
The class coding, presented in the previous chapter, is one possible way
to code a text in multi-resolution fashion. Encoding a text, class coding
produces a stream of prefix-index pairs. Taking the prefixes only, we get the
desired low-resolution image. In order to decode the text completely, we need
the indices, which are used as resolving information. Figure 4.1 illustrates
the process using the text from section 2.3.5.
4.2.1 Definition
Let
S = s1 s2 s3 ...sr ,
(4.5)
sj ∈ Σ
50 CHAPTER 4. MULTI-RESOLUTION STRING MATCHING
1
be a string. We wish to find the first (all ) occurrences of S in T. T,
however, is given in class compressed form, ordered as two separate blocks
- first comes the prefix block, then the index block, as illustrated in figure
4.2. The simplest way to accomplish the task is to extract the compressed
file into some temporary storage, and then to apply known algorithms. As
discussed in the first chapter, this method has several drawbacks.
• Step 3. For any occurrence found, decode T locally and check weather
S actually occurs.
4.2.2 Details
of possible prefixes is much smaller than the number of symbols, and their
Huffman codes are shorter accordingly.
The result of this step would be a list of matches. This list, however, is
valid only for the low resolution string. Since a single low resolution string
can be produced by many full resolution strings, this list may contain false
matches. The uniqueness property, on the other hand, asserts that if S
appears in the text, Ŝ must appear in T̂ . Thus, another step is required to
determine which of the occurrences of Ŝ is a true occurrence of S. To do
that, we need the resolving information stored as indices.
One possible way is to decode the indices, starting at the first symbol
up to the occurrence found. This way, however, means completely decoding
T , which is not desirable. Rather, it is possible to access the index block
“randomly”, start decoding at the occurrence and decode only as much sym-
bols as required. The indices, like the prefixes are stored as variable-length
code. The length of each index is, however, known at an earlier stage – when
its prefix was decoded. To access the i’th index, we only need to decode
the prefixes of s1 ...si and accumulate the length of index paired with every
prefix. That way, the exact location of the index of si within the index block
is known without having to decode all the indices along the way.
After the indices have been checked, and the occurrence of the string
being sought was confirmed, it can be marked as an occurrence.
The process is illustrated in figure 4.3. In part (a) of the figure, the
text is shown (the text, the classes and the symbol codes are taken from the
example at section 2.3.5) and the low resolution image is given just below.
The numbers 1, 2 and 3 denote here that the symbol belongs to C1 , C2 or C3 .
4.2. A STRING MATCHING ALGORITHM 53
The same procedure continues on to the next occurrences of Ŝ, but re-
solving the original text on these cases reveals that these are false matches.
TO DA Yt IS tT HE tD AY
σ 1 σ2 σ3 σ4 σ5 σ6 σ7 σ8
We wish to search the string DAY within it. The naı̈ve approach will
encode this string as σ2 σ3 . There might be, actually, several possibilities for
the last symbol, so it can be simply dropped (remembering to match the last
character in some later stage). This choice, however, will miss the second
occurrence of the string.
The solution to this situation is to perform two searches: the first will
be done using the said conversion; the second one will start matching the
target string starting from the second character. In the last example, the
first search would try to find the sequence σ2 σ3 and the second will try to
find the sequence σ8 . Combining the results from both searches will yield
the desired result.
4.3 Analysis
The common tool for estimating the performance of algorithms is the asymp-
totic complexity. Reviewing the algorithm described, we see that in the first
step, the prefix block, which is ` symbols long, have to be searched for all the
occurrences of a single string. The complexity of such algorithm is O(` + r)
at the worst case (assuming the use of a slightly more intelligent algorithm
56 CHAPTER 4. MULTI-RESOLUTION STRING MATCHING
than brute force). In the second step, in the worst case (all the characters
are decoded), another brute force search is performed, thus the overall com-
plexity is O(` + r). This places the algorithm in the efficient category due to
Amir and Benson[1] taxonomy.
From the complexity point of view, the algorithm is efficient (since it’s
polynomial time) and it is no faster than a brute force search (in the uncom-
pressed text).
This is, however, not the complete picture. The stage of prefix decoding
consists of sequential pass over the compressed file, whereas the stage of
index decoding is base on random access. In most cases (when the file is
stored on devices such as disc, CD-ROM etc.) random access is much slower.
Therefore, the real performance measure is the rate of false matches, i.e.
matches in the low resolution part that are not matches in the text. This
phenomenon cannot be observed by the asymptotic complexity tool, since it
ignores constants.
Increasing the number of classes has two effects. The first is increasing the
number of symbols in the low resolution text. The outcome of this fact is
that the low-resolution part becomes closer to the full text and the false
match rate is expected to drop. On the other hand, the increasing number
of symbols in the prefix block requires more bits to represent, and the size of
the prefix block increases. Increase in the prefix block size will increase the
execution time of the first stage.
While running, the search program needs no extra space, that is dependent
on the length on the uncompressed file. The only memory required is to
accelerate decoding of the prefix’s Huffman code and searching through them.
58 CHAPTER 4. MULTI-RESOLUTION STRING MATCHING
• Prefix Decoding. The method used to decode the prefix block is byte
oriented. Each prefix is aligned to a byte boundary and then fed into
a lookup table (containing 256 entries) which decodes the prefix at the
beginning of the byte and outputs the class and its actual length. The
length is then used to determine the location of the next prefix and the
class continues to the low-resolution string matching algorithm. The
table has constant length, and it is built in constant time. The only
drawback is that the length of the longest prefix is limited to 8 bits (we
can, of course, use a larger table). In practice, however, this limit is
sufficient. A different way to overcome this limitation is to allocate the
prefix codes using length-limited algorithm (for instance [35] or [57])
which will come in expense of compression.
4.5.2 Results
Figure 4.5.2 shows the compression results of some text files used in the
experiments with ZIP. Unlike the code compression application, natural lan-
guage has more redundancy built-in ([14]), and the usage of 0-order model
cannot exploit them, thus ZIP (which does) gives much better results.
Figures 4.5, 4.6 and 4.7 shows the results obtained from the search runs
described in the previous chapters.
The graph in figure 4.5 shows the rate of false matches, as function of the
number of classes (N ) and the length of the searched pattern. Y axis is given
as percent of the all the substrings of T̂ that are false matches. It is evident
that the false match rate drops with the increasing length and number of
classes. The drop due to the number of classes is attributed to the fact that
60 CHAPTER 4. MULTI-RESOLUTION STRING MATCHING
as the number of possible symbols (in the prefix) rises, the probability to find
an arbitrary string of symbols (that is, where they do not result from a true
match) drops. The dependency of search speed by the pattern length exists
in all search algorithms, but in this algorithm it is much more emphasized.
Again, this is due to the fact that the probability of finding some arbitrary
string of symbols drops sharply as the length increases. Eventually, the only
prefix string that will match the pattern being searched is the true match,
and the number of false matches will drop to zero.
The graph in figure 4.6 shows the search time depending on the same
variables. The resemblance between this graph and the previous graph is
salient. This fact supports the claim the the search time depends mainly on
the false matches rate. The last graph shows the the relation between the
search time and the rate of false matches. Again, the relation is clear.
The graph in figure 4.8 show the relation of search time versus chang-
ing the number of classes. This graph was taken without operating system
caching. The expected behavior of search time, as previously discussed, is
evident from the graph. Under these terms, the optimal number of classes is
about 10. The file that was searched is world95 and the string searched was
6 character long.
Finally, the graph 4.9 shows a comparison between our algorithm and
the lzgrep program[45]. The graph shows that in some cases, our algorithm
slightly outperforms lzgrep, even though the latter uses the fast Boyer-
Moore method. It comes however in the expense of compression (the LZ-
compressed file is about 20 percent smaller).
4.5. EXPERIMENTAL RESULTS 61
"#
$
% &'())*
(+,
0.18
0.16
0.14
0.12
r=4
r=6 0.1
r=10 0.08
r=12
0.06
0.04
0.02
0
0 5 10 15 20 25 30 35
N
0.0016
0.0014
0.0012
r=4 0.001
r=6
0.0008
r=10
r=12 0.0006
0.0004
0.0002
0
0 5 10 15 20 25 30 35
N
0.002
0.0018
0.0016
0.0014
0.0012
N=4
N=6 0.001
N=10
0.0008
0.0006
0.0004
0.0002
0
0.25 0.2 0.15 0.1 0.05 0
False Match Ratio
In this section, several possible enhancements to the base algorithm are pre-
sented. They were not implemented, tested or studied, but they are examples
how this approach can be generalized.
• Non-text files. Throughout the work, only text files were mentioned
and considered. However, no feature has relied on this fact. Unlike
other methods in this field of research ([42] for instance) the method
presented here can be used without any changes for arbitrary file types,
including text in various encoding schemes and binary files.
Summary
In this work we have dealt with several problems. The first problem was the
class coding and its optimization. The method was introduced and an algo-
rithm for optimizing the class structure has been developed. This algorithm
is efficient and is proved to run faster than the brute force approach. Though
the algorithm handles only the index part of the code, a heuristic suggesting
that such method gives good results was given.
The third and last problem is the problem of compressed matching. This
problem, gaining popularity in the recent time, was defined and a novel
algorithm, based on class compression has been suggested. The algorithm is
67
68 CHAPTER 5. SUMMARY
based on a decomposition of the text into two parts – a low resolution part
and resolving information. Unlike in the image processing area (from which
the term ”low resolution image“ is taken), the low resolution part has little
meaning by itself (the original text can not be recovered by this part alone),
but it is very helpful in fast filtering of the text, such that the actual search
can be made fast. The chapter has concluded with experimental results
showing search times.
Bibliography
[2] A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern
matching in z-compressed files. Journal of Computer and System Sci-
ences, 52:299–307, 1996.
[3] A. Amir, G.M. Landau, and U. Vishkin. Efficient pattern matching with
scaling. In Proceedings of the first annual ACM-SIAM symposium on
Discrete algorithms, pages 344–357, 1990.
69
70 BIBLIOGRAPHY
[13] I.C. Chen, P. Bird, and T. Mudge. The impact of instruction compres-
sion on I-cache performance. Technical Report CSE-TR-330-97, EECS
Department, University of Michigan, 1996.
[16] J. Ernst, C.W. Fraser, W. Evans, S. Lucco, and T.A. Proebsting. Code
compression. In Proc. Conf. on Programming Languages Design and
Implementation, pages 358–365, June 1997.
[19] C.W. Fraser and T.A. Proebsting. Finite-state code generation. In Proc.
Conf. on Programming Languages Design and Implementation, pages
270–280, May 1999.
[25] T.M. Kemp, R.M. Montoye, J.D. Harper, J.D. Palmer, and D.J. Auer-
bach. A decompression core for PowerPC. IBM Journal of Research and
Development, 42(6):807–812, Nov 1998.
[28] K. Kissell. MIPS16: High-density MIPS for the Embedded Market. Sil-
icon Graphics MIPS Group, 1997.
[31] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in
strings. SIAM J. of Computing, 6:323–350, 1977.
BIBLIOGRAPHY 73
[34] S.Y. Larin and T.M. Conte. Compiler-driven cached code compression
schemes for embedded ILP processors. In Proc. Int’l Symp. on Microar-
chitecture, pages 82–92, November 1999.
[35] L. Larmore and D.S. Hirschberg. A fast algorithm for optimal length-
limited Huffman codes. Journal of the ACM, 37(3):464–473, Jul 1990.
[38] E.A. Lee. What’s ahead for embedded software? IEEE Computer,
33(9):18–26, September 2000.
[39] C. Lefurgy, P. Bird, I.C. Chen, and T. Mudge. Improving code density
using compression techniques. In Proc. Int’l Symp. on Microarchitecture,
pages 194–203, December 1997.
[40] H. Lekatsas, J. Henkel, and W. Wolf. Code compression for low power
embedded system design. In Proceedings of the 37th conference on De-
sign automation, pages 294–299, 2000.
74 BIBLIOGRAPHY
[42] U. Manber. A text compression scheme that allows fast searching di-
rectly in the compressed file. In Proceedings of the 5th Annual Sym-
posium on Combinatorial Pattern Matching, pages 113–124. Springer-
Verlag, Berlin, 1994.
[45] G. Navarro and J. Tarhio. lzgrep – a direct compressed text search tool.
www.dcc.uchile.cl/gnavarro/software.
[49] A. Said and W.A. Pearlman. Low-complexity waveform coding via al-
phabet and sample-set partitioning. In Visual Communications and
Image Processing ’97, Proc. SPIE Vol. 3024, pages 25–37, Feb. 1997.
[55] M. Takeda. Pattern matching machine for text compressed using fi-
nite state model. Technical report, Department of Informatics, Kyushu
University, October 1997.
[56] J.L. Turley. Thumb squeezes ARM code size. Microprocessor Report,
9(4), March 1995.
[57] D.C. Van Voorhis. Constructing codes with bounded codeword lengths.
IEEE Transactions on Information Theory, 20(3):288–290, March 1974.