Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching

This document discusses the development of compressed suffix arrays and trees for efficient text indexing and string matching, addressing the need for space-efficient indexing methods that support fast searches. The proposed index structure occupies O(n) bits and allows for optimal search times, significantly improving upon traditional methods that require more space and time. The paper outlines the construction of these data structures and their applications, highlighting their advantages over existing indexing techniques.

Uploaded by

pinjata69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching

Uploaded by

pinjata69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Compressed Sux Arrays and Sux Trees

with Applications to Text Indexing and String Matching

(extended abstract)
y
Roberto Grossi Je rey Scott Vitter
Dipartimento di Informatica Department of Computer Science
Universita di Pisa Duke University
56125 Pisa, Italy Durham, NC 27708{0129, USA
[email protected] [email protected]

Abstract 1 Introduction
The proliferation of online text, such as on the World Wide A great deal of textual information is available in electronic
Web and in databases, motivates the need for space-ecient form in databases and on the World Wide Web, and conse-
index methods that support fast search. Consider a text T quently, devising indexing methods to support fast search is
of n binary symbols to index. Given any query pattern P of a relevant research topic. Inverted lists and signature les
m binary symbols, the goal is to search for P in T quickly, are ecient indexes for texts that are structured as long se-
with T being fully scanned only once, namely, when the in- quences of words or keys. Inverted lists are theoretically and
dex is created. All indexing schemes published in the last practically superior to signature les [49]. Their versatility
thirty years support searching in (m) worst-case time and allows for several kinds of queries (exact, boolean, ranked,
require (n) memory words (or (n log n) bits), which is and so on) whose answers have a variety of output formats.
signi cantly larger than the text itself. In this paper we pro- Searching unstructured text for string matching queries,
vide a breakthrough both in searching time and index space however, adds a new diculty to text indexing. The set
under the same model of computation as the one adopted of candidate keys is much larger than that of structured
in previous work. Based upon new compressed representa- texts because it consists of all possible substrings of the text.
tions of sux arrays and sux trees, we construct an index String matching queries look for the occurrences of a pat-
structure that occupies only O(n) bits and compares favor- tern string P of length m as any substring of a long text T
ably with inverted lists in space. We can search any binary of length n. We are interested in three types of queries: ex-
pattern P , stored in O(m= log n) words, in only o(m) time. istential, counting, and enumerative. An existential query
Speci cally, searching takes O(1) time for m = o(log n), returns a boolean value that says if P is contained in T .
and O(m= log n + log n) = o(m) time for m = (log n) and A counting query computes the number occ of occurrences
any xed 0 < < 1. That is, we achieve optimal O(m= log n) of P in T . An enumerative query outputs the list of occ
search time for suciently large m = (log1+ n). We can positions where P occurs in T . In the rest of the paper, we
list all the occ pattern occurrences in optimal O(occ ) addi- assume that the strings are de ned over a binary alphabet
tional time when m = (polylog(n)) or when occ = (n ); = fa; bg. Our results extend to an alphabet of 2
otherwise, listing takes O(occ log n) additional time. symbols by the standard trick of encoding each symbol with
Supported in part by the Italian MURST project \Algorithms dlog e bits. (The implied base of the log function is 2.)
for Large Data Sets: Science and Engineering" and by the United The prominent data structures widely used in string
Nations Educational, Scienti c and Cultural Organization under matching, such as sux arrays [35; 24], sux trees [37; 46]
contract UVO-ROSTE 875.631.9. and similar tries or automata [15], are more powerful than
yPart of this work was done while the author was on sabbatical inverted lists and signature les when used in text indexing.
at I.N.R.I.A. in Sophia Antipolis, France. Supported in part by The sux tree for text T = T [1; n] is a compact trie whose
Army Research Oce MURI grant DAAH04{96{1{0013 and by leaves store pointers to the n suxes in the binary text|
National Science Foundation research grants CCR{9522047 and
CCR{9877133. T [1; n], T [2; n], : : : , T [n; n]|and whose internal nodes each
have two children. The sux array stores the pointers to
the n suxes in lexicographic order. It also keeps another
array of longest common pre xes to speed up the search [35].
In this paper we refer to the sux array as the plain array of
pointers. Both data structures occupy (n) memory words
(or (n log n) bits) in the unit cost RAM model. We can
do existential and counting queries of P in T in O(m) time
(using automata or sux trees and their variations) and in
O(m + log n) time (using sux arrays along with longest
common pre xes). Enumerative queries take an additional
additive output sensitive cost O(occ ). Compressed sux trees can be implemented in O(n)
Indexes based upon sux trees and sux arrays and re- bits by using compressed sux arrays and the tech-
lated data structures are especially ecient when several niques for compact representation of Patricia tries pre-
searches are to be performed, since the text T needs to be sented in [42]. As a result, they occupy asymptotically
fully scanned only once, namely, when the indexes are cre- the same space as that of the text string being indexed.
ated. The importance of sux arrays and sux trees is
witnessed by numerous references to a great variety of ap- A text index on T can be built in only O(n) bits by
plications besides string searching [4; 25; 35]. Their range of a suitable combination of our compressed sux trees
applications is growing in molecular biology, data compres- and previous techniques [12; 30; 42; 39]. This is the
sion, and text retrieval. rst result obtaining existential and counting queries
A major criticism that limits the applicability of indexes of any binary pattern string of length m in o(m) time
based upon sux arrays and sux trees is that they occupy and O(n) bits. Speci cally, searching takes O(1) time
signi cantly more space than inverted lists. Space occu- for m = o(log n), and O(m= log n + log n) = o(m)
pancy is especially crucial for large texts. For a text of n time for m = (log n) and any xed 0 < < 1.
binary symbols, sux arrays use n words of log n bits each That is, we achieve optimal O(m= log n) search time
(a total of n log n bits), while sux trees require between 4n for suciently large m = (log1+ n). For enumera-
and 5n words (or between 4n log n and 5n log n bits) [35]. In tive queries, retrieving all occ occurrences has optimal
contrast, inverted lists require less than 0:1 n= log n words cost O(occ ) time when m = ((log3 n) log log n) or
(or 0:1n bits) in many practical cases [38] in order to in- when occ = (n ); otherwise, it takes O(occ log n)
dex a set of words consisting of a total of n bits. However, time.
as previously mentioned, inverted les have less functional-
ity than sux arrays and sux trees since only the words Outline of the paper. In the next section we review re-
are indexed, whereas sux arrays and sux trees index all lated work on string searching and text indexing. In Sec-
substrings of the text. tion 3 we describe the ideas behind our new data structure
No data structures with the functionality of sux trees for compressed sux arrays. In Section 4 we show how to
and sux arrays published in the literature to date use o(n) use compressed sux arrays to construct compressed sux
words (or o(n log n) bits) and support fast queries in the trees and a general space-ecient indexing mechanism for
worst case. In order to remedy the space problem, we in- text search. Details of our compressed sux array construc-
troduce compressed sux arrays, which are abstract data tion are given in Section 5. We adopt the standard unit cost
structures supporting two basic operations: RAM for the analysis of our algorithms, as does the previ-
ous work that we compare with. We use standard arithmetic
1. compress : Given a sux array SA, compress SA so as and boolean operations on words of O(log n) bits, each op-
to represent it succinctly. eration taking constant time and each word read or written
in constant time. We give nal conclusions and comments
2. lookup (i): Given the compressed representation men- in Section 6.
tioned above, return SA[i], the pointer to the ith sux
in T in lexicographic order.
2 Previous Work on String Searching
The primary measures of performance are the query time to
do lookup , the amount of space occupied by the compressed
and Text Indexing
sux array, and the preprocessing time taken by compress . The seminal paper by Knuth, Morris, and Pratt [33] pre-
Our main result is that we can implement operation sented the rst string matching solution taking O(m + n)
compress in only O(n) bits and O(n) preprocessing time, time and O(m) words to scan the text. The space complex-
so that each call to lookup takes sublogarithmic worst-case ity was remarkably lowered to O(1) words in [22; 14]. A
time, that is, O(log n) time for any xed constant > 0. seminal paper by Weiner [46] introduced the sux tree for
We can also achieve O(n log log n) bits and O(n) preprocess- solving the text indexing problem in string matching. Since
ing time, so that calls to lookup can be done in O(log log n) then, a plethora of papers have studied the problem in sev-
time. eral contexts and sometimes using di erent terminology [7;
Our ndings have several important implications: 8; 13; 19; 26; 37; 35; 45]; for more references see [4; 15; 25].
Although very ecient, the resulting index data structures
To the best of our knowledge, ours is the rst result are greedy of space, at least n words or (n log n) bits.
successfully breaking the space barrier of n log n bits Numerous papers faced the problem of saving space in
(or n words) for a full text index while retaining fast these data structures, both in practice and in theory. Many
lookup in the worst case. We refer the reader to the of the papers were aimed at improving the lower-order
literature described in Section 2. terms, as wells the constants in the higher order term, or
at achieving tradeo between space requirements and search
Our compressed sux arrays are provably as good as time complexity. Some authors improved the multiplicative
inverted lists in terms of space usage, at least theo- constants in the O(n log n)-bit practical implementations.
retically. No previous result supported this nding. In For the analysis of constants, we refer the reader to [3; 10;
the worst case, both types of indexes require asymptot- 23; 29; 34; 35]. Other authors devised several variations of
ically the same number of bits; however, compressed sparse sux trees to store a subset of the suxes [2; 24;
sux arrays have more functionality because they sup- 32; 31; 36; 39]. Some of them wanted queries to be ecient
port search for arbitrary substrings. when the occurrences are aligned with the beginnings of the
indexed suxes. Sparsity saves much space but makes the The natural explicit implementation of sux arrays re-
search for arbitrary substrings dicult and, in the worst quires O(n log n) bits and supports the lookup operation in
case, as expensive as scanning the whole text in O(m + n) constant time. The abstract optimization discussed above
time. Another interesting index, the Lempel-Ziv index of suggests that there is a canonical way to represent sux ar-
Karkkainen and Sutinen [30], occupies O(n) bits and takes rays in O(n) bits. This observation follows from the fact that
O(m) time to search patterns shorter than log n; for longer the class Cn of sux arrays has no more than 2n,1 distinct
patterns, it may occupy (n log n) bits. members, as there are 2n,1 binary strings of length n , 1.
A recent line of research has been built upon Jacob- We use the intuitive correspondence between sux ar-
son's succinct representation of trees in 2n bits, with navi- rays of length n and binary strings of length n , 1. Ac-
gational operations [27]. That representation was extended cording to the correspondence, given a sux array SA, we
in [11] to represent a sux tree in n log n bits plus an extra can infer its associated binary string T and vice versa. To
O(n log log n) expected number of bits. A solution requir- see how, let x be the entry in SA corresponding to the last
ing n log n + O(n) bits and O(m + log log n) search time was sux # in lexicographic order. Then T must have the sym-
described in [12]. Munro et al. [42] used it along with an im- bol a in each of the positions pointed to by SA[1], SA[2],
proved succinct representation of balanced parentheses [41] : : : , SA[x , 1], and it must have the symbol b in each of
in order to get O(m) search time with only n log n + o(n) the positions pointed to by SA[x + 1], SA[x + 2], : : : , SA[n].
bits. For example, in the sux array h45321i (the 15th of the 16
examples above), the sux # corresponds to the second en-
try 5. The preceding entry is 4, and thus the string T has a
3 Compression of Sux Arrays in position 4. The subsequent entries are 3, 2, 1, and thus
T must have bs in positions 3, 2, 1. The resulting string T ,
The compression of sux arrays falls into the general frame- therefore, must be bbba#.
work presented by Jacobson [28] for the abstract optimiza- The abstract optimization does not say anything regard-
tion of data structures. We start from the speci cation of ing the eciency of the supported operations. By the corre-
our data structure as an abstract data type with its sup- spondence above, we can de ne a trivial compress operation
ported operations. We take the time complexity of the that transforms SA into a sequence of n , 1 bits, namely,
\natural" (and space inecient) implementation of the data string T . The drawback, however, is the una ordable cost
structure. Then, we de ne the class Cn of all distinct data of lookup . It takes (n) time to decompress a single pointer
structures storing n elements. A simple combinatorial argu- in SA as it must build the whole sux array on T from
ment implies that each such data structure can be canoni- scratch. In other words, the trivial method proposed so far
cally identi ed by log jCn j bits. We try to give a, succinct does not support ecient lookup operations.
implementation of the same data structure in O log jCn j In this paper we give an elegant and ecient method to
bits, while supporting the operations within time complexity represent sux arrays in O(n) bits. Our breakthrough idea
comparable with that of the natural implementation. How- is to distinguish among the permutations of f1; : : : ; ng by
ever, the combinatorial argument does not guarantee that relating them to the suxes of the corresponding strings,
the operations can be supported eciently. instead of studying them alone. We mimic a simple divide-
We de ne the sux array SA for a binary string T and-conquer \de-construction" of the sux arrays to de ne
as an abstract data type that supports the two opera- the permutations recursively in terms of shorter permuta-
tions compress and lookup described in the introduction. tions. For some examples of divide-and-conquer construc-
We will adopt the convention that T is a binary string of tion of sux arrays and sux trees, see [5; 16; 17; 18; 35;
length n , 1 over the alphabet fa; bg, and it is terminated 44]. We reverse the construction process to compress the
in the nth position by a special end-of-string symbol #, such permutations.
that a < # < b.1 The sux array SA is a permutation of f1, Our decomposition scheme is by a simple recursion mech-
2, : : : , ng that corresponds to the lexicographic ordering of anism. Let SA be the sux array for binary string T . In
the suxes in T ; that is, SA[i] is the starting position in T the base case, we denote SA by SA0 , and let n0 = n be
of the ith sux in lexicographic order. In the example below the number of its entries. For simplicity in exposition, we
are the sux arrays corresponding to the 16 binary strings assume that n is a power of 2.
of length 4: In the inductive phase k 0, we start with sux array
SAk , which is available by induction. It has nk = n=2k
aaaa# aaab# aaba# aabb# entries and stores a permutation of f1; : : : ; nk g. We run
12345 12354 14253 12543 four main steps to transform SAk into an equivalent but
abaa# abab# abba# abbb#
more succinct representation:
34152 13524 41532 15432 Step 1. Produce a bit vector Bk of nk bits, such that
baaa# baab# baba# babb# B i
k[ ] = 1 if SAk [i] is even and Bk [i] = 0 if SAk [i] is odd.
23451 23514 42531 25143
bbaa# bbab# bbba# bbbb# Step 2. Map each 0 in Bk onto its companion 1. (We say
34521 35241 45321 54321 that a certain 0 is the companion of a certain 1 if the odd
1 Usually an end-of-symbol character is not explicitly stored in , entry in SA associated with the 0 is 1 less than the even
but rather is implicitly represented by a blank symbol , with
T
entry in SA associated with the 1.) We can denote this cor-
the ordering a b. However, our use of # is convenient for respondence by a partial function k , where k (i) = j if
and only if SAk [i] is odd and SAk [j ] = SAk [i]+1. When de-
< <
showing the explicit correspondence between sux arrays and
binary strings. ned, k (i) = j implies that Bk [i] = 0 and Bk [j ] = 1. It is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
T: a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a #
SA0 : 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25
B0 : 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0
rank 0 : 0 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8 9 10 10 10 11 11 12 12 12 12 13 14 14 15 16 16
0 : 2 2 14 15 18 23 7 8 28 10 30 31 13 14 15 16 17 18 7 8 21 10 23 13 16 17 27 28 21 30 31 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
SA1 : 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11

Figure 1: The e ect of a single application of Steps 1{4.

convenient to make k a total function by setting k (i) = i Level k, for each 0 k < `, stores Bk , k , and rank k .
when SAk [i] is even (i.e., when Bk [i] = 1). In summary, for We do not store SAk , but we refer to it for the sake of
1 i nk , we have discussion. The arrays k and rank k are not stored ex-

j if SAk [i] is odd and SAk [j ] = SAk [i] + 1; plicitly, but are stored in a specially compressed form
k (i) = i otherwise. to be described later.
The last level k = ` stores SA` explicitly because it
Step 3. Compute the number of 1s for each pre x of Bk . is suciently small to t in O(n) bits. The `th level
We use function rank k for this purpose, such that rank k (j ) functionality of structures B` , ` , and rank ` are not
counts how many 1s there are in the rst j bits of Bk . needed as a result.
Step 4. Pack together the even values from SAk and di- Procedure lookup (i). We de ne lookup (i) = rlookup (i; 0),
vide each of them by 2. The resulting values form a permu- where procedure rlookup (i; k) is described recursively in Fig-
tation of f1; 2, : : : , nk+1 g, where nk+1 = nk =2 = n=2k+1 . ure 2. If k is the last level `, then it performs a direct lookup
Store them into a new sux array SAk+1 of nk+1 entries, in SA` [i]. Otherwise, it exploits Lemma 1 and the inductive
and remove the old sux array SAk . hypothesis so that rlookup (i; k) returns the value in SAk [i].
The example in Fig. 1 illustrates the e ect of a single Further details on how to represent rank k and k in com-
application of Steps 1{4. The next lemma shows that these pressed form and how to implement compress and lookup (i)
steps preserve the information originally kept in sux ar- will be given in Section 5. Our main theorem below gives
ray SAk : the resulting time and space complexity that we are able to
Lemma 1. Given sux array SAk , let Bk , k , rank k achieve.
and SAk+1 be the result of the transformation performed by Theorem 1. Consider the sux array SA built on a bi-
Steps 1{4 of phase k. We can reconstruct SAk from SAk+1 nary string of length n , 1.
by the following formula, for 1 i nk ,
,
SAk [i] = 2 SAk+1 rank k k (i) + (Bk [i] , 1): i. We can implement compress in O(n log log n) bits and
O(n) preprocessing time, so that each call lookup (i)
Proof. Suppose Bk [i] = 1. By Step 3, there are rank k (i) takes O(log log n) time.
1s among Bk [1], Bk [2], : : : , Bk [i]. By Step 1, SAk [i] is even,
and by Step 4, SAk [i]=2 is stored in the rankk (i)th entry of
ii. We can implement compress in O(n) bits and O(n)
SAk+1 . In other words, SAk [i] = 2 SAk+1 rank k (i) . As preprocessing time, so that each call lookup (i) takes
k (i) = i by Step 2, and Bk [i] , 1 = 0, we obtain the claimed O(log n) time for any constant > 0.
formula. Remark 1. In each of the cases stated in Theorem 1,
Next, suppose Bk [i] = 0 and let j = k (i). By Step 2, we we can batch together j , i + 1 procedure calls lookup (i),
have SAk [i] = SAk [j ] , 1 and Bk [j ] = 1. Consequently, we lookup (i + 1), : : : , lookup (j ), so that the total cost is
can apply the previous case of our analysis
to index j , and
we get SAk [j ] = 2 SAk+1 rank k (j ) . The claimed formula O(j , i + (log2 n) log log n) time when the suxes
follows by replacing j with k (i) and by noting that Bk [i] , pointed to by SA[i] and SA[j ] have the same rst
1 = ,1. (log2 n) binary symbols in common, or
We now give the main ideas to perform the compression
of sux array SA and support the lookup operations on its
compressed representation. procedure rlookup ( ):
i; k

if k = ` then

Procedure compress . We represent SA succinctly by ex- return SA` [i]

ecuting Steps 1{4 of phases k = 0; 1; : : : ; ` , 1, where the else
return
,
2 rlookup rank k ( k ( )) + 1 + ( k [ ] , 1).
exact value of ` = (log log n) will be determined in Sec- i ; k B i

tion 5. As a result, we have ` + 1 levels of information,

numbered 0; 1; : : : ; `, which form the compressed represen- Figure 2: Recursive lookup of entry SAk [i] in a compressed
tation of sux array SA: sux array.
O(j , i + n ) time, for any constant 0 < < 1, when We prove Theorem 2 by employing our compressed sux
the suxes pointed to by SA[i] and SA[j ] have the arrays and some techniques presented in [30; 39; 42] and
same rst (log n) binary symbols. brie y discussed below.
The Lempel-Ziv (LZ) index [30] is a powerful tool to
search for q-grams (substrings of length q) in T . If we x
4 Text Indexing, String Matching, and q = log n for any xed positive constant < 1, we can
Compressed Sux Trees build an LZ index on T in O(n) time, such that the LZ index
occupies O(n) bits and any pattern of length m log n can
Text indexing is worthwhile when handling multiple queries be searched in O(m + occ ) time. In this special case, we can
on text collections. Inverted lists are versatile data struc- actually obtain O(1 + occ ) time by suitable table lookup.
tures for this purpose [49]. They keep a vocabulary of dis- (Unfortunately, for longer patterns, the LZ index may take
tinct keys, with a list of the occurrences for each distinct key. (n log n) bits.) The LZ index allows us to concentrate on
They support search of the form, \Given a query pattern, patterns of length m > log n.
is the query one of the keys in the vocabulary, and where The Patricia trie [39] is another powerful tool in text in-
are the instances in the text where it appears?" We show dexing. It is a binary tree that stores a set of distinct strings,
that, despite their extra functionality, compressed sux ar- in which each internal node has two children and each leaf
rays require the same asymptotic space as inverted lists in stores a string. Each internal node also keeps an integer
the worst case. (called skip value) to locate the position of the branching
character while descending towards a leaf. The left arcs are
Lemma 2. In the worst case, both inverted lists and com- implicitly labeled with one character of the alphabet, and
pressed sux arrays require (n) bits for a binary string of the right arcs are implicitly labeled with the other char-
length n, whereas compressed sux arrays can search arbi- acter (recall we have a binary alphabet). Sux trees are
trary substrings. often implemented by building a Patricia trie on the suxes
of T [24]. Searching for P takes O(m) time and retrieves
Proof. (Sketch) Let us take a De Bruijn sequence S of only the sux pointer in two leaves (i.e., the leaf reached by
length n, in which each substring of log n characters is dif- branching with the skip values, and the leaf corresponding
ferent from the others. Now let the keys in the inverted to an occurrence). It requires only O(1) calls to the lookup
le be those obtained by partitioning S into s = n=(2 log n) operation in the worst case.
disjoint substrings of length k = 2 log n. Any data struc- Unfortunately, the Patricia trie storing s suxes of T oc-
ture that implements inverted lists must be able to solve cupies O(s log n) bits. This amount of space usage is the re-
the static dictionary
, problem
on the s keys, and so it result of three separate factors: the Patricia trie topology, the
quires at least log 2sk = (n) bits. We build instead our skip values, and the string pointers [10; 11]. Because of our
compressed sux array by setting our text T = S (i.e., our compressed sux arrays, the string pointers are no longer
keys are all the substrings) and the resulting space is still a problem. For the remaining two items, the space-ecient
(n) bits. incarnation of Patricia tries in [42] cleverly avoids the over-
head for the Patricia trie topology and the skip values. It is
In practice, we expect several occurrences for each dis- able to represent a Patricia trie storing s suxes of T with
tinct key, and the experimental study in [49] shows that in- only O(s) bits, provided that a sux array is given separately
verted lists are space-ecient. Moreover, inverted lists are (which in our case is a compressed sux array). Searching
dynamic while we do not know how to maintain our com- for query pattern P takes O(m) time and accesses O(s) suf-
pressed sux arrays in a dynamic setting. x pointers in the worst case. For each traversed node, its
We now focus on text indexing with string matching corresponding skip value is computed in time O(skip value )
queries. Here, we are given a binary pattern string P of by accessing the sux pointers in its leftmost and right-
m symbols and we are interested in its occurrences (per- most descendant leaves. Consequently, searching requires
haps overlapping) in a binary text string T of n symbols O(s) calls to lookup in the worst case.
(where # is the nth symbol).
Theorem 2. Given a binary text string T of length n,
4.1 Speeding Up Patricia Trie Search
we can build an index data structure on T in O(n) time Before we discuss how to construct the index, we rst need to
such that the index occupies O(n) bits and supports the fol- show that search in Patricia tries, which normally proceeds
lowing queries on a pattern string P of m bits packed into one level at a time, can be improved to sublinear time by
O(m= log n) words: processing log n bits of the pattern at a time. Let us rst
consider an ordinary Patricia trie [39] PT . We will show how
i. Existential and counting queries can be done in o(m) to reduce the search time for an m-bit pattern in PT from
time; in particular, they take O(1) time for m = O(m) to O(m= log n + log n). Without p loss of generality,
o(log n), and O(m= log n + log n) = o(m) time for we show how to achieve O(m= log n + log n ) time as this
m = (log n) and any xed 0 < < 1; bound extends to any exponent > 0.2 The point is that, in
the worst case, we may have to traverse (m) nodes, so we
ii. An enumerative query can be done in O(m= log n+occ ) 2 We can actually achieve a better bound with more sophisticated
time, where occ is the number of occurrences of P in T ,
when m = ((log3 n) log log n)) or when occ = (n ) techniques that have applications to other problems not discussed
for xed 0 < < 1; otherwise, it takes O(m= log n + here. Although interesting in itself, we do not discuss this bound
as it does not improve the nal search time obtained in this sec-
occ log n) time. tion.
need a tool to skip most of these nodes. Ideally, we would The speed up in the search of a space-ecient Patricia
like to branch downward matching log n bits in constant trie [42] is easier. In our case, we do not need to skip nodes,
time, independently of the number of traversed nodes. but just compare (log n) bits at a time in constant time by
We use a perfect hash function h for this purpose [21], precomputing a suitable table. The search cost is therefore
on keys of length at most 2 log n bits each. First of all, O(m= log n) plus a linear cost proportional to the number
we enumerate the nodes of PT in preorder starting from of traversed nodes.
the root, numbered 1. Next, we build hash tables in which
we store pair hj; bi at position h(i; x), where node j is a
descendent of node i, string x is of length less than or equal 4.2 Index Construction
to log n, and b is a nonnegative integer. These parameters
must satisfy two conditions: We blend the previously mentioned tools with our com-
pressed sux arrays to design a multilevel index data struc-
1. j is the node identi ed by starting out from node i and ture, called the compressed sux tree, following the mul-
traversing downward PT according to the bits in x. tilevel scheme adopted in [12; 42]. It suces to support
searching of patterns of length m > log n because of the LZ
2. b is the unique integer such that the string correspond- index. We assume that 0 < < 1=2 as the case 1=2 < 1
ing to the path from i to j has pre x x and length requires minor modi cations.
jxj + b; this condition does not hold for any ancestor Given text T , we build its compressed sux tree by start-
of j . ing out from the sux array SA built on T in O(n) time via
The rationale behind conditions 1{2 is that of de ning a sux tree. We have O(1) levels, which we describe top
shortcut links from nodes i to nodes j , so that each successful down.
branching takes constant time, matches at least jxj bits and In the rst level, we build a regular Patricia trie PT 1 ,
skips no more than jxj nodes downward. In order to handle augmented with the shortcut links as mentioned in Sec-
border situations, we build PT on the text string padded tion 4.1, on the s1 = (n= log n) suxes pointed to by
with log n symbols #. We want to use a small number of SA[1], SA[1 + log n], SA[1 + 2 log n], : : : . This implicitly
shortcut links and, to this end, we set up two hash tables splits SA into s1 subarrays of size log n, except the last one.
T1 and T2 . The rst table stores entries The size of PT 1 is O(s1 log n) = O(n) bits.
In the second level, we process the s1 subarrays created in
T1 [h(i; x)] = hj; bi the rst level, and create s1 space-ecient Patricia tries [42],
such that all strings x are of length jxj = log n, and the denoted PT 21 , PT 22 , : : : , PT 2s1 . We build the ith Patricia
shortcut links are de ned by a simple top-down traversal PT 2i on the ith subarray. Assume without loss of generality
of PT . Initially, we create all possible shortcut links from that the subarray consists of SA[h+1], SA[h+2], : : : , SA[h+
the root. This step links the root to a set of descendents. log n]. Then, PT 2i is built on the s2 = log=2 n suxes
Each of these nodes is then recursively linked to its descen- pointed to by SA[h + 1], SA[h + 1 + log1,=2 n], SA[h +
dents in the same fashion. Note that the number of links 1 + 2 log1,=2 n], : : : . This further splits the subarrays into
does not exceed the number of nodes in PT , and PT is smaller subarrays, each of size log1,=2 n. The size of PT 2i
partitioned into subtries of maximum depth log n. is O(s2 ) bits without accounting for the sux array.
We set up the second table T2 analogously. We start from We go on in this way with O(1) further levels, each split-
the root of each individual subtrie and use strings of length ting every subarray into s2 = log=2 n smaller subarrays,
jxj = plog n. Again, the number of these links is upper until we are left with small subarrays of size at most s2 .
bounded by the number of nodes in PT . The mechanism In the last level, we execute compress on the sux ar-
that makes the above speedup possible is that we closely ray SA and store its compressed version, so that accessing a
follow the Patricia topology, so that the strings thatpwe hash pointer takes O(log=2 n) time.
are not all possible substrings of length log n (or log n ), We build each of the levels in O(n) time. As for the
but only a subset of those that start at the nodes in the space complexity, we have a total of O(n) bits. The rst
Patricia trie. level requires O(n) bits for the single Patricia trie; the O(1)
We are now ready to describe the search of a pattern intermediate levels require O(n) bits in total; the last level
in the Patricia trie PT augmented with the shortcut links. requires O(n) bits by Theorem 1.
It suces to match its longest pre x. We take the rst
log n bits in the pattern and branch quickly from the root
by using T1 . If the hash lookup in T1 succeeds and gives
pair hj; bi, we try to match the next b bits in the pattern
4.3 Search Algorithm
in O(b= log n) time, and then recursively search in node j We now have to show that searching for a pattern P in
with the next log n bits in the pattern. Instead, if the hash the text T costs O(m= log n + log n) time. The search lo-
lookup fails because there are fewer than log n bits left in cates the leftmost occurrence and the rightmost occurrence
the query pattern, we switch to T2 and take only the next
p of P in SA, without having SA stored explicitly. A success-
log n bits in the pattern to branch further in pPT . Here ful search determines two positions i j , such that SA[i],
the scheme is the same, except that we compare log n bits SA[i + 1], : : : , SA[j ] contain all the pointers to the suxes
at a time. Finally, when p we fail branching again, we have to that begin with P . The counting query returns j , i + 1,
match no more than log n bits remaining in the pattern. and the existence checks whether there are any matches at
We complete this task by branching in the standard way, all. The enumerative query executes the j , i + 1 queries
one bit a time. This completes the description of the search lookup (i), lookup (i + 1), : : : , lookup (j ) to list all the occur-
in PT . rences.
We restrict our discussion to how to nd the leftmost oc- implicit representations of rank k and k can be accessed in
currence of P ; nding the rightmost is analogous. We search constant time, the procedure described in Lemma 1 shows
in each level from scratch. That is, we perform the search how to achieve the desired lookup in constant time per level,
with shortcut links described in Section 4.1 on PT 1 in the for a total of O(log log n) time.
rst level. We locate a subarray in the second level, say, the All that remains is to investigate how to represent rank k
i1 th subarray. We go on and search in PT 2i1 according to and k , for 0 k `,1, in O(n) bits and support constant-
the method described in Section 4.1 for space-ecient Patri- time access. Given the bit vector Bk of nk = n=2k bits,
cia tries. We repeat a similar search for all the intermediate Jacobson [27] shows how to support constant-time access
levels. While searching in the levels, we execute lookup (i) to rank k using only o(nk ) bits. For the k function, we
whenever we need the ith pointer in the last level. use the following representation: For each 1 i nk =2,
The complexity of the search procedure is O(m= log n + let j be the k
index of the ith 1 in Bk . Consider the 2
log n) total time for all the levels plus the cost of the lookup symbols T 2k SAk [j ] , 2k , : : : , T 2k, SAk [j ] ,1 ; these
operations. In the rst level, we call lookup O(1) times; in 2k symbols immediately precede the 2k SAk [j ] th sux
the intermediate levels we call lookup O(s2 ) times. Multi- in T , as the sux pointer in SAk [j ] was 2k times larger
plying these calls by the O(log=2 n) cost of lookup as given before the compression. For each bit pattern of 2k symbols
in Theorem 1 (using =2 in place of ), we obtain O(log n) that appears, we keep an ordered list of the indices j that
time in addition to O(m= log n + log n). Finally, the cost of correspond to it, and we record the number of items in each
retrieving all the occurrences is the one stated in Remark 1 list. Continuing the example above, we get the following
after Theorem 1, because the suxes pointed to by SA[i] lists for level 0:
and SA[j ] share at least m = (log n) symbols. This argu- a list: h2; 14; 15; 18; 23; 28; 30; 31i; ja listj = 8
ment completes the proof of Theorem 2. b list: h7; 8; 10; 13; 16; 17; 21; 27i; jb listj = 8
5 Algorithms for Compressed Sux Ar- Level 1:
rays aa
ab
list:
list:
;;
h9i;
jaa listj = 0
jab listj = 1
In this section we constructively prove Theorem 1 by show- ba list: h1; 6; 12; 14i; jba listj = 4
ing two ways to implement the recursive decomposition of bb list: h2; 4; 5i; jbb listj = 3
sux arrays discussed in Lemma 1 of Section 3. For brevity Level 2:
we defer the discussion of the preprocessing steps and time
analysis to the full paper. Multiple occurrences can be re- aaaa list: ;; jaaaa listj = 0
ported in an optimal time under certain cases. aaab list: ;; jaaab listj = 0
aaba list: ;; jaaba listj = 0
list: ;; jaabb listj = 0
5.1 Compressed Sux Arrays in aabb
abaa list: ;; jabaa listj = 0
O(n log log n) Bits and O(log log n) abab list: ;; jabab listj = 0
list: h5; 8i; jabba listj = 2
Access Time abba
abbb list: ;; jabbb listj = 0
The rst method achieves O(log log n) lookup time with a baaa list: ;; jbaaa listj = 0
total space usage of O(n log log n) bits. We perform the baab list: ;; jbaab listj = 0
recursive decomposition of Steps 1{4 described in Section 3, baba list: h1i; jbaba listj = 1
for 0 k < ` , 1, where ` = dlog log ne. The decomposition babb list: h4i; jbabb listj = 1
below shows the result on the example of Section 3: bbaa list: ;; jbbaa listj = 0
bbab list: ;; jbbab listj = 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 bbba list: ;; jbbba listj = 0
SA1 : 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11 bbbb list: ;; jbbbb listj = 0
B1 : 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 0
rank 1 : 1 2 2 3 4 5 5 5 6 6 6 7 7 8 8 8 Suppose we want to compute k (i). If Bk [i] = 1, we
1 : 1 2 9 4 5 6 1 6 9 12 14 12 2 14 4 5 trivially have k (i) = i; therefore, let's consider the harder
case in which Bk [i] = 0, which means that SAk [i] is odd. We
1 2 3 4 5 6 7 8 have to determine the index j such that SAk [j ] = SAk [i]+1.
SA2 : 4 7 1 6 8 3 5 2 We can determine the number h of 0s in Bk up to index i
B2 : 1 0 0 1 1 0 0 1 using the approach of [27] for rank k . Consider the 2k lists
rank 2 : 1 1 1 2 3 3 3 4 concatenated together in lexicographic order of the 2k pre-
2: 1 5 8 4 5 1 4 9 xes. What we need to nd now is the hth entry in the
concatenated list. For example, to determine 0 (25) in the
1 2 3 4 example above, we nd that there are h = 13 0s in the
SA3 : 2 3 4 1 rst 25 slots of B0 . There are eight entries in the a list and
eight entries in the b list; hence, the 13th entry in the con-
The resulting sux array SA` on level ` contains at most catenated lists is the fth entry above in the b list, namely,
n= log n entries and can thus be stored explicitly in n bits. index 16. Hence, we have 0 (25) = 16 as desired; note that
We store the bit vectors B0 , B1 , : : : , B`,1 in explicit form, SA0 [25] = 29 and SA0 [16] = 30 are consecutive values.
using less than 2n bits, as well as implicit representations Continuing the example, consider the next level of the
of rank 0 , rank 1 , : : : , rank `,1 and 0 , 1 , : : : , `,1 . If the recursive processing of rlookup , in which we need to deter-
mine 1 (8). (The previously computed value 0 (25) = 16 level ` , 1, except that the suxes in the lists are len po-
has a rank 0 value of 8 (i.e., rank 0 (16) = 8), so the rlookup sitions apart in the text, and so pre x patterns are longer,
procedure needs to determine SA1 [8], which it does by rst namely, of length len rather than 2`,1 . For each entry e in
calculating 1 (8).) There are h = 3 0s in the rst eight en- the list of pre x p, we associate it with a two-dimensional
tries of B1 . The third entry in the concatenated lists for aa, point (e; pr ), where pr is the reversal of string p.
ab, ba, and bb is the second entry in the ba list, namely, 6. Assuming that the common pre x among the suxes
Hence, we have 1 (8) = 6 as desired; note that SA1 [8] = 15 starting at positions SA0 [i], : : : , SA0 [j ] has length at
and SA1 [6] = 16 are consecutive values. least len , we can nd the entries among SA0 [i], : : : , SA0 [j ]
Finding the appropriate list containing the hth entry can whose mod value is t by means of the functions 0 and 00
be determined in constant time by using a clever encoding of and a certain two-dimensional orthogonal range query.
a pre x sum array storing the cumulated list sizes. We can The range search can be done, using O(n) bits, either in
nd the second entry in the b list using a recursive encoding O(log log n + occ t ) time for len = log2 n [43] or O(n + occ t )
of the list coupled with table lookup into directories by using time for len = log n and any xed > 0 [6; 47]. The total
the ranking and selection operations de ned in [10; 28; 40]. running time for all len range searches is thus the cost of
Details will appear in the full paper. the pattern search plus O((log2 n) log log n + occ ), which is
O(m= log n + occ ) when the pattern is ((log3 n) log log n)
bits long; for shorter patterns, the total running time for
5.2 Compressed Sux Arrays in O(n) all len range searches is the cost of the pattern search plus
Bits and O(log n) Access Time

O(n log n + occ ), which is O(occ ) when occ = (n ), be-
cause we can make the choice 0 < < . The details are
Each of the dlog log ne levels of the data structure discussed suppressed for lack of space. The requirement on the com-
in the previous Section 5.1 uses O(n) bits, so one way to mon pre x length of the suxes can be further reduced from
reduce the space complexity is to store only a constant num- len to (len ). This requirement is satis ed for the applica-
ber of levels, at the cost of increased access time. We keep tion of text indexing, as noted at the end of Section 4.3.
a total of three levels: level 0, level `0 , and level `, where
`0 = d 21 log log ne and as before ` = dlog log ne. In the previ-
ous example of n = 32, the three levels chosen are levels 0, 6 Conclusions
2, and 3. The trick is to determine how to reconstruct SA0
from SA` and how to reconstruct SA` from SA` .
0 0
We have presented the rst index structure to break through
We keep a bit vector at level 0 that marks the indices that both the time barrier of O(m) time and the space barrier of
correspond to the entries at level `0 , and similarly we keep a O(n log n) bits for fast text searching. Our method, which
bit vector at level `0 that marks the indices that correspond is based upon notions of compressed sux arrays and sux
to the entries at level `. We also have a data structure for trees, uses O(n) bits to index a text string T of n bits. Given
k = 0 and k = `0 to support the function 0k , which is similar any pattern P of m bits, it can be used to count the number
to k , except that it maps 1s to the next corresponding 0. of occurrences of P in T in o(m) time. Namely, searching
To determine SA0 [i], we use the functions 0 and 00 to takes O(1) time for m = o(log n), and O(m= log n +log n) =
walk along indices i0 , i00 , : : : , such that SA0 [i]+1 = SA0 [i0 ], o(m) time for m = (log n) and any xed 0 < < 1. We
SA0 [i0 ] + 1 = SA0 [i00 ], and so on, until we reach a marked achieve optimal O(m= log n) search time for suciently large
index that corresponds to an entry at level `0 . We can recon- m = (log1+ n). For an enumerative query, retrieving all
struct the entry at level `0 from the explicit representation of occ occurrences has optimal cost O(occ ) time when m =
SA` at level ` by a similar walk along indices at level `0 . We ((log3 n) log log n) or when occ = (n ); otherwise, it takes
defer details pfor reasons of brevity. The maximum length of O(occ log n) time.
p walk is log n, and thus the lookup procedurebyrequires
each
O( log n ) time. The method can be generalized adding An interesting open problem is to improve upon our
more levels to support lookup in O(log n) time for any xed O(n)-bit compressed sux array so that each call to lookup
> 0. takes constant time. Such an improvement would decrease
the output-sensitive time of the enumerative queries to
O(occ ) in all cases. A related question is to characterize
5.3 Output-Sensitive Reporting of Mul- combinatorially the permutations that correspond to sux
tiple Occurrences arrays. A better understanding of the correspondence may
lead to more ecient compression methods. Ideally we'd
If we want to output a contiguous set SA0 [i], : : : , SA0 [j ] like to nd a text index that uses as few as bits as possi-
of entries from the sux array, one way to output the ble and that supports enumerative queries for each query
j , i + 1 entries is via a reduction to two-dimensional or- pattern in sublinear time in the worst case. The interplay
thogonal range search. We output the entries in a sequence between compression and indexing is the subject of current
of len stages (for parameter value len to be discussed be- investigation in [20]. Additional open problems are listed
low), with one range search per stage. In the tth stage, for in [42].
0 t < len , we output the entries containing sux pointers The kinds of queries examined in this paper are very
that, in the text T , are t symbols to the left of the sux basic and involve exact occurrences of the pattern strings.
pointers compressed and kept in the entries of SA` . We They are often used as preliminary lters so that more so-
say that these entries have mod value t. We use a variant of phisticated queries can be performed on a smaller amount
the two-dimensional pre x matching problem [31] and de ne of text. An interesting extension would be to support some
the points in the range search instance as follows: Consider sophisticated queries directly, such as those that tolerate a
the lists de ned in Section 5.1 and build analogous lists for small number of errors in the pattern match [1; 9; 48].
7 References [16] M. Farach. Optimal sux tree construction with large
alphabets. In 38th Annual Symposium on Foundations
[1] A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, of Computer Science, pages 137{143, Miami Beach,
N. Lewenstein, and M. Rodeh. Indexing and dictionary Florida, 20{22 Oct. 1997. IEEE.
matching with one error. Lecture Notes in Computer
Science, 1663:181{190, 1999. [17] M. Farach, P. Ferragina, and S. Muthukrishnan. Over-
coming the memory bottleneck in sux tree construc-
[2] A. Andersson, N. J. Larsson, and K. Swanson. Sux tion. In IEEE Symposium on Foundations on Computer
trees on words. Algorithmica, 23(3):246{260, 1999. Science (to appear in J.ACM), 1998.
[3] A. Andersson and S. Nilsson. Ecient implementa- [18] M. Farach and S. Muthukrishnan. Optimal logarith-
tion of sux trees. Software Practice and Experience, mic time randomized sux tree construction. In F. M.
25(2):129{141, Feb. 1995. auf der Heide and B. Monien, editors, Automata, Lan-
[4] A. Apostolico. The myriad virtues of sux trees. In guages and Programming, 23rd International Collo-
A. Apostolico and Z. Galil, editors, Combinatorial Al- quium, volume 1099 of Lecture Notes in Computer Sci-
gorithms on Words, volume 12 of NATO Advanced Sci- ence, pages 550{561, Paderborn, Germany, 8{12 July
ence Institutes, Series F, pages 85{96, Springer-Verlag, 1996, Springer-Verlag.
berlin, 1985. [19] P. Ferragina and R. Grossi. The String B-tree: a new
[5] A. Apostolico, C. Iliopoulos, G. M. Landau, data structure for string search in external memory and
B. Schieber, and U. Vishkin. Parallel construc- its applications. Journal of the ACM, 46(2):236{280,
tion of a sux tree with applications. Algorithmica, Mar. 1999.
3:347{365, 1988. [20] P. Ferragina and G. Manzini. Personal communication,
[6] J. L. Bentley and H. A. Maurer. Ecient worst-case 2000.
data structures for range searching. Acta Informatica, [21] M. L. Fredman, J. Komlos, and E. Szemeredi. Stor-
13:155{168, 1980. ing a sparse table with O(1) worst case access time.
[7] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, Journal of the Association for Computing Machinery,
M. Chen, and J. Seiferas. The smallest automaton rec- 31(3):538{544, July 1984.
ognizing the subwords of a text. Theoretical Computer [22] Z. Galil and J. Seiferas. Time-space-optimal string
Science, 40(1):31{55, Sept. 1985. matching. Journal of Computer and System Sciences,
[8] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and 26:280{294, 1983.
A. Ehrenfeucht. Complete inverted les for ecient text [23] R. Giegerich, S. Kurtz, and J. Stoye. Ecient imple-
retrieval and analysis. Journal of the ACM, 34(3):578{ mentation of lazy sux trees. In J. S. Vitter and C. D.
595, July 1987. Zaroliagis, editors, Proceedings of the 3rd Workshop on
[9] G. S. Brodal and L. Gasieniec. Approximate dictio- Algorithm Engineering, number 1668 in Lecture Notes
nary queries. In D. S. Hirschberg and E. W. Myers, in Computer Science, pages 30{42, London, UK, 1999,
editors, Proc. 7th Annual Symp. Combinatorial Pattern Springer-Verlag, Berlin.
Matching, CPM, volume 1075 of Lecture Notes in Com- [24] G. H. Gonnet, R. A. Baeza-Yates, and T. Snider. New
puter Science, LNCS, pages 65{74, Springer-Verlag, 10{ indices for text: PAT trees and PAT arrays. In Informa-
12 June 1996. tion Retrieval: Data Structures And Algorithms, chap-
[10] D. Clark. Compact Pat trees. PhD Thesis, Department ter 5, pages 66{82. Prentice-Hall, 1992.
of Computer Science, University of Waterloo, 1996. [25] D. Gus eld. Algorithms on Strings, Trees and Se-
[11] D. R. Clark and J. I. Munro. Ecient sux trees on quences: Computer Science and Computational Biology.
secondary storage (extended abstract). In Proceedings Cambridge University Press, 1997.
of the Seventh Annual ACM-SIAM Symposium on Dis- [26] R. W. Irving. Sux binary search trees. Technical Re-
crete Algorithms, pages 383{391, Atlanta, Georgia, 28{ port TR-1995-7, Computing Science Department, Uni-
30 Jan. 1996. versity of Glasgow, 1995.
[12] L. Colussi and A. De Col. A time and space ecient [27] G. Jacobson. Space-ecient static trees and graphs. In
data structure for string searching on large texts. Infor- IEEE Symposium on Foundations of Computer Science,
mation Processing Letters, 58(5):217{222, Oct. 1996. pages 549{554, 1989.
[13] M. Crochemore. Transducers and repetitions. Theoret-
ical Computer Science, 45(1):63{86, 1986. [28] G. Jacobson. Succinct static data structures. Technical
Report CMU-CS-89-112, Dept. of Computer Science,
[14] M. Crochemore and D. Perrin. Two{way string match- Carnegie-Mellon University, Jan. 1989.
ing. Journal of the Association for Computing Machin-
ery, 38:651{675, 1991. [29] J. Karkkainen. Sux cactus: A cross between sux
tree and sux array. In Combinatorial Pattern Match-
[15] M. Crochemore and W. Rytter. Text Algorithms. Ox- ing, volume 937 of Lecture Notes in Computer Science,
ford University Press, 1994. pages 191{204. Springer, 1995.
[30] J. Karkkainen and E. Sutinen. Lempel-Ziv index for [46] P. Weiner. Linear pattern matching algorithm. Proc.
q-grams. Algorithmica, 21(1):137{154, 1998. 14th IEEE Symposium on Switching and Automata
Theory, pages 1{11, 1973.
[31] J. Karkkainen and E. Ukkonen. Lempel-Ziv parsing and
sublinear-size index structures for string matching. In [47] D. E. Willard. On the application of sheared retrieval to
N. Ziviani, R. Baeza-Yates, and K. Guimar~aes, editors, orthogonal range queries. In Proceedings of the Second
Proceedings of the 3rd South American Workshop on Annual Symposium on Computational Geometry, pages
String Processing, pages 141{155, Recife, Brazil, 1996. 80{89, 1986.
Carleton University Press.
[48] A. C. Yao and F. F. Yao. Dictionary look-up with small
[32] J. Karkkainen and E. Ukkonen. Sparse sux trees. Lec- errors. Lecture Notes in Computer Science, 937:387{
ture Notes in Computer Science, 1090:219{230, 1996. 394, 1995.
[33] D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pat- [49] J. Zobel, A. Mo at, and K. Ramamohanarao. Inverted
tern matching in strings. SIAM Journal on Computing, les versus signature les for text indexing. ACM
6:323{350, 1977. Transactions on Database Systems, 23(4):453{490, Dec.
1998.
[34] S. Kurtz. Reducing the space requirement of sux
trees. Technical Report 98-03, Universitat Bielefeld,
1998.
[35] U. Manber and G. Myers. Sux arrays: a new method
for on-line string searches. SIAM Journal on Comput-
ing, 22(5):935{948, 1993.
[36] U. Manber and S. Wu. GLIMPSE: A tool to search
through entire le systems. In Proceedings of the
USENIX Winter 1994 Technical Conference, pages 23{
32, 1994.
[37] E. M. McCreight. A space-economical sux tree con-
struction algorithm. Journal of the ACM, 23(2):262{
272, 1976.
[38] A. Mo at and J. Zobel. Self-indexing inverted les for
fast text retrieval. ACM Transactions on Information
Systems, 14(4):349{379, Oct. 1996.
[39] D. R. Morrison. PATRICIA - Practical Algorithm To
Retrieve Information Coded In Alphanumeric. Journal
of the ACM, 15(4):514{534, Oct. 1968.
[40] J. I. Munro. Tables. FSTTCS: Foundations of Software
Technology and Theoretical Computer Science, 16:37{
42, 1996.
[41] J. I. Munro and V. Raman. Succinct representation of
balanced parentheses, static trees and planar graphs. In
38th Annual Symposium on Foundations of Computer
Science, pages 118{126, 1997.
[42] J. I. Munro, V. Raman, and S. Srinivasa Rao. Space
ecient sux trees. In Proceedings of Foundations of
Software Technology and Theoretical Computer Science,
volume 1530 of Lecture Notes in Computer Science,
pages 186{195, Berlin, Germany, 1998, Springer-Verlag.
[43] M. H. Overmars. Ecient data structures for range
searching on a grid. Journal of Algorithms, 9(2):254{
275, June 1988.
[44] S. C. Sahinalp and U. Vishkin. Symmetry breaking for
sux tree construction. In Proceedings of the 26th An-
nual Symposium on the Theory of Computing, pages
300{309, New York, May 1994, ACM Press.
[45] E. Ukkonen. On-line construction of sux trees. Algo-
rithmica, 14(3):249{260, Sept. 1995.

11 Data Structures and Algorithms - Narasimha Karumanchi
No ratings yet
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
Jason Brown Coaching- The Minimum Effective Dose Program
100% (5)
Jason Brown Coaching- The Minimum Effective Dose Program
21 pages
An Efficient Index Structure For String Databases: Tamer Kahveci Ambuj K. Singh
No ratings yet
An Efficient Index Structure For String Databases: Tamer Kahveci Ambuj K. Singh
45 pages
Notes 06 Text Indexing PDF
No ratings yet
Notes 06 Text Indexing PDF
162 pages
Suffix Array
No ratings yet
Suffix Array
71 pages
gsaca
No ratings yet
gsaca
63 pages
Stephen Eskildsen-Asceticism in Early Taoist Religion
100% (3)
Stephen Eskildsen-Asceticism in Early Taoist Religion
233 pages
Shang Tries For Approximate String Matching
No ratings yet
Shang Tries For Approximate String Matching
46 pages
Current Challenges in Textual Databases: Gonzalo Navarro
No ratings yet
Current Challenges in Textual Databases: Gonzalo Navarro
44 pages
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
0801 2378 PDF
No ratings yet
0801 2378 PDF
63 pages
2d Pattern Matching
No ratings yet
2d Pattern Matching
35 pages
Suffix Trees and Suffix Arrays
No ratings yet
Suffix Trees and Suffix Arrays
33 pages
Cuda
No ratings yet
Cuda
93 pages
IRS unit-5
No ratings yet
IRS unit-5
62 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Suffix Trees and Their Applications in String Algo
No ratings yet
Suffix Trees and Their Applications in String Algo
21 pages
FM 072
No ratings yet
FM 072
20 pages
Boyer 77
No ratings yet
Boyer 77
11 pages
Seminar 2
No ratings yet
Seminar 2
20 pages
Binary Jumbled Pattern Matching On Trees and Tree-Like Structures
No ratings yet
Binary Jumbled Pattern Matching On Trees and Tree-Like Structures
18 pages
L17
No ratings yet
L17
23 pages
Algorithm PPT Final
No ratings yet
Algorithm PPT Final
13 pages
Fast String Searching: AT&T Bell Laboratories, 600 Mountain Ave., Murray Hill, NJ 07974, U.S.A
No ratings yet
Fast String Searching: AT&T Bell Laboratories, 600 Mountain Ave., Murray Hill, NJ 07974, U.S.A
28 pages
Exact String Matching Algorithms Survey Issues and
No ratings yet
Exact String Matching Algorithms Survey Issues and
25 pages
Fla 03
No ratings yet
Fla 03
27 pages
Suffix Arrays
No ratings yet
Suffix Arrays
20 pages
Fast_Intersection_Algorithms_for_Sorted_Sequences
No ratings yet
Fast_Intersection_Algorithms_for_Sorted_Sequences
18 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
35 pages
LIPIcs.CPM.2016.23
No ratings yet
LIPIcs.CPM.2016.23
12 pages
Suffix Trees, Suffix Arrays, and Their Applications
No ratings yet
Suffix Trees, Suffix Arrays, and Their Applications
29 pages
Phy & Chem Girls Final Result Dt-27!04!2025 a (1)
No ratings yet
Phy & Chem Girls Final Result Dt-27!04!2025 a (1)
19 pages
Unit2_Research paper
No ratings yet
Unit2_Research paper
9 pages
String Similarity Search: A Hash-Based Approach: Hao Wei, Jeffrey Xu Yu, and Can Lu
No ratings yet
String Similarity Search: A Hash-Based Approach: Hao Wei, Jeffrey Xu Yu, and Can Lu
14 pages
40 years of Suffix Trees
No ratings yet
40 years of Suffix Trees
8 pages
PDF
No ratings yet
PDF
11 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
10 pages
Tutorial Suffix Tree
No ratings yet
Tutorial Suffix Tree
16 pages
Efficient Merging and Filtering Algorithms For Approximate String Searches
No ratings yet
Efficient Merging and Filtering Algorithms For Approximate String Searches
10 pages
Succinct Suffix Arrays Based On Run-Length Encoding
No ratings yet
Succinct Suffix Arrays Based On Run-Length Encoding
26 pages
Baeza Yates PDF
No ratings yet
Baeza Yates PDF
9 pages
A Collection of Latin Maxims and Phrases Literally Translated Intended For The Use of Students For All Legal Examinations 1913
No ratings yet
A Collection of Latin Maxims and Phrases Literally Translated Intended For The Use of Students For All Legal Examinations 1913
90 pages
9 Suffix Trees: Tttta
No ratings yet
9 Suffix Trees: Tttta
9 pages
Module 06. String Algorithms Lecture 1 - 2
No ratings yet
Module 06. String Algorithms Lecture 1 - 2
19 pages
1 s2.0 S0304397508008852 Main
No ratings yet
1 s2.0 S0304397508008852 Main
14 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
Applications of Suffix Trees
No ratings yet
Applications of Suffix Trees
40 pages
Exact String Matchin
No ratings yet
Exact String Matchin
7 pages
Computational Bioprospection of Selected Plant Secondary Metabolites Against VP7 (Capsid Protein) of Rotavirus A
No ratings yet
Computational Bioprospection of Selected Plant Secondary Metabolites Against VP7 (Capsid Protein) of Rotavirus A
21 pages
Be Still My Soul
No ratings yet
Be Still My Soul
3 pages
LiteratureLanguageArtsTextbookGrade10.moi_
No ratings yet
LiteratureLanguageArtsTextbookGrade10.moi_
16 pages
pdrugs
No ratings yet
pdrugs
3 pages
Milani Cosmetics: Marketing Plan
No ratings yet
Milani Cosmetics: Marketing Plan
8 pages
Automated Highways
No ratings yet
Automated Highways
6 pages
String Matching Algorithms: International Journal of Engineering and Computer Science March 2018
No ratings yet
String Matching Algorithms: International Journal of Engineering and Computer Science March 2018
5 pages
Matrix Problems
No ratings yet
Matrix Problems
2 pages
BM - Menu Bima Update 28 Feb 2023
No ratings yet
BM - Menu Bima Update 28 Feb 2023
15 pages
A. Yet Another Problem With Strings: ACM ICPC Practice Contest, 8 November, 2015
No ratings yet
A. Yet Another Problem With Strings: ACM ICPC Practice Contest, 8 November, 2015
2 pages
Multiple Choice Competition: - Solution
No ratings yet
Multiple Choice Competition: - Solution
11 pages
Theory of Computation
No ratings yet
Theory of Computation
7 pages
Application of A Modified Convolution Method To Exact String Matching
No ratings yet
Application of A Modified Convolution Method To Exact String Matching
6 pages
Food and Water Security
No ratings yet
Food and Water Security
3 pages
Unit 5 String Matching 2010
No ratings yet
Unit 5 String Matching 2010
5 pages
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
No ratings yet
Text Pattern Search Using Naïve Algorithm: Justine Estoesta, Patricia Mae Omana, Winci John Singh
5 pages
Data Compresion 1
No ratings yet
Data Compresion 1
2 pages
Nasdaq Data Link Data Fabric
No ratings yet
Nasdaq Data Link Data Fabric
12 pages
Narmer Astronomy 22
No ratings yet
Narmer Astronomy 22
3 pages
Dissertation On Mutual Funds in India
100% (2)
Dissertation On Mutual Funds in India
5 pages
SAM PROJECT MAIN PAGES
No ratings yet
SAM PROJECT MAIN PAGES
9 pages
Pineapple: Recommendations For Maintaining Postharvest Quality
No ratings yet
Pineapple: Recommendations For Maintaining Postharvest Quality
4 pages
Iraqi Airways Eticket (Q65YQF) - MUSTAFA
No ratings yet
Iraqi Airways Eticket (Q65YQF) - MUSTAFA
2 pages
CV - Intan Ardina
No ratings yet
CV - Intan Ardina
1 page
Manual Control de Acceso Dahua Asi-1212d-1
No ratings yet
Manual Control de Acceso Dahua Asi-1212d-1
32 pages
WHLP Week 4
No ratings yet
WHLP Week 4
4 pages
Aspen Certification Exam Process
No ratings yet
Aspen Certification Exam Process
8 pages
Anatomical Terms of Movement
No ratings yet
Anatomical Terms of Movement
12 pages
Teachers Philosophical Heritage
No ratings yet
Teachers Philosophical Heritage
20 pages
The Science of Soul
No ratings yet
The Science of Soul
11 pages
Dorayaki
No ratings yet
Dorayaki
7 pages
The Speaking Model
No ratings yet
The Speaking Model
2 pages
Elementary Matrix Theory
From Everand
Elementary Matrix Theory
Howard Eves
2.5/5 (3)
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
From Everand
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
Björn Olsson
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Recursive Analysis
From Everand
Recursive Analysis
R. L. Goodstein
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Breadth First Search: Fundamentals and Applications
From Everand
Breadth First Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Lectures on Ergodic Theory
From Everand
Lectures on Ergodic Theory
Paul R. Halmos
No ratings yet

Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching

Uploaded by

Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching

Uploaded by

Compressed Sux Arrays and Sux Trees

with Applications to Text Indexing and String Matching

Figure 1: The e ect of a single application of Steps 1{4.

Procedure compress . We represent SA succinctly by ex- return SA` [i]

tion 5. As a result, we have ` + 1 levels of information,

You might also like

Compressed Sux Arrays and Sux Trees