Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching
Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching
Abstract 1 Introduction
The proliferation of online text, such as on the World Wide A great deal of textual information is available in electronic
Web and in databases, motivates the need for space-ecient form in databases and on the World Wide Web, and conse-
index methods that support fast search. Consider a text T quently, devising indexing methods to support fast search is
of n binary symbols to index. Given any query pattern P of a relevant research topic. Inverted lists and signature les
m binary symbols, the goal is to search for P in T quickly, are ecient indexes for texts that are structured as long se-
with T being fully scanned only once, namely, when the in- quences of words or keys. Inverted lists are theoretically and
dex is created. All indexing schemes published in the last practically superior to signature les [49]. Their versatility
thirty years support searching in (m) worst-case time and allows for several kinds of queries (exact, boolean, ranked,
require (n) memory words (or (n log n) bits), which is and so on) whose answers have a variety of output formats.
signi cantly larger than the text itself. In this paper we pro- Searching unstructured text for string matching queries,
vide a breakthrough both in searching time and index space however, adds a new diculty to text indexing. The set
under the same model of computation as the one adopted of candidate keys is much larger than that of structured
in previous work. Based upon new compressed representa- texts because it consists of all possible substrings of the text.
tions of sux arrays and sux trees, we construct an index String matching queries look for the occurrences of a pat-
structure that occupies only O(n) bits and compares favor- tern string P of length m as any substring of a long text T
ably with inverted lists in space. We can search any binary of length n. We are interested in three types of queries: ex-
pattern P , stored in O(m= log n) words, in only o(m) time. istential, counting, and enumerative. An existential query
Speci cally, searching takes O(1) time for m = o(log n), returns a boolean value that says if P is contained in T .
and O(m= log n + log n) = o(m) time for m = (log n) and A counting query computes the number occ of occurrences
any xed 0 < < 1. That is, we achieve optimal O(m= log n) of P in T . An enumerative query outputs the list of occ
search time for suciently large m = (log1+ n). We can positions where P occurs in T . In the rest of the paper, we
list all the occ pattern occurrences in optimal O(occ ) addi- assume that the strings are de ned over a binary alphabet
tional time when m = (polylog(n)) or when occ = (n ); = fa; bg. Our results extend to an alphabet of 2
otherwise, listing takes O(occ log n) additional time. symbols by the standard trick of encoding each symbol with
Supported in part by the Italian MURST project \Algorithms dlog e bits. (The implied base of the log function is 2.)
for Large Data Sets: Science and Engineering" and by the United The prominent data structures widely used in string
Nations Educational, Scienti c and Cultural Organization under matching, such as sux arrays [35; 24], sux trees [37; 46]
contract UVO-ROSTE 875.631.9. and similar tries or automata [15], are more powerful than
yPart of this work was done while the author was on sabbatical inverted lists and signature les when used in text indexing.
at I.N.R.I.A. in Sophia Antipolis, France. Supported in part by The sux tree for text T = T [1; n] is a compact trie whose
Army Research Oce MURI grant DAAH04{96{1{0013 and by leaves store pointers to the n suxes in the binary text|
National Science Foundation research grants CCR{9522047 and
CCR{9877133. T [1; n], T [2; n], : : : , T [n; n]|and whose internal nodes each
have two children. The sux array stores the pointers to
the n suxes in lexicographic order. It also keeps another
array of longest common pre xes to speed up the search [35].
In this paper we refer to the sux array as the plain array of
pointers. Both data structures occupy (n) memory words
(or (n log n) bits) in the unit cost RAM model. We can
do existential and counting queries of P in T in O(m) time
(using automata or sux trees and their variations) and in
O(m + log n) time (using sux arrays along with longest
common pre xes). Enumerative queries take an additional
additive output sensitive cost O(occ ). Compressed sux trees can be implemented in O(n)
Indexes based upon sux trees and sux arrays and re- bits by using compressed sux arrays and the tech-
lated data structures are especially ecient when several niques for compact representation of Patricia tries pre-
searches are to be performed, since the text T needs to be sented in [42]. As a result, they occupy asymptotically
fully scanned only once, namely, when the indexes are cre- the same space as that of the text string being indexed.
ated. The importance of sux arrays and sux trees is
witnessed by numerous references to a great variety of ap- A text index on T can be built in only O(n) bits by
plications besides string searching [4; 25; 35]. Their range of a suitable combination of our compressed sux trees
applications is growing in molecular biology, data compres- and previous techniques [12; 30; 42; 39]. This is the
sion, and text retrieval. rst result obtaining existential and counting queries
A major criticism that limits the applicability of indexes of any binary pattern string of length m in o(m) time
based upon sux arrays and sux trees is that they occupy and O(n) bits. Speci cally, searching takes O(1) time
signi cantly more space than inverted lists. Space occu- for m = o(log n), and O(m= log n + log n) = o(m)
pancy is especially crucial for large texts. For a text of n time for m = (log n) and any xed 0 < < 1.
binary symbols, sux arrays use n words of log n bits each That is, we achieve optimal O(m= log n) search time
(a total of n log n bits), while sux trees require between 4n for suciently large m = (log1+ n). For enumera-
and 5n words (or between 4n log n and 5n log n bits) [35]. In tive queries, retrieving all occ occurrences has optimal
contrast, inverted lists require less than 0:1 n= log n words cost O(occ ) time when m = ((log3 n) log log n) or
(or 0:1n bits) in many practical cases [38] in order to in- when occ = (n ); otherwise, it takes O(occ log n)
dex a set of words consisting of a total of n bits. However, time.
as previously mentioned, inverted les have less functional-
ity than sux arrays and sux trees since only the words Outline of the paper. In the next section we review re-
are indexed, whereas sux arrays and sux trees index all lated work on string searching and text indexing. In Sec-
substrings of the text. tion 3 we describe the ideas behind our new data structure
No data structures with the functionality of sux trees for compressed sux arrays. In Section 4 we show how to
and sux arrays published in the literature to date use o(n) use compressed sux arrays to construct compressed sux
words (or o(n log n) bits) and support fast queries in the trees and a general space-ecient indexing mechanism for
worst case. In order to remedy the space problem, we in- text search. Details of our compressed sux array construc-
troduce compressed sux arrays, which are abstract data tion are given in Section 5. We adopt the standard unit cost
structures supporting two basic operations: RAM for the analysis of our algorithms, as does the previ-
ous work that we compare with. We use standard arithmetic
1. compress : Given a sux array SA, compress SA so as and boolean operations on words of O(log n) bits, each op-
to represent it succinctly. eration taking constant time and each word read or written
in constant time. We give nal conclusions and comments
2. lookup (i): Given the compressed representation men- in Section 6.
tioned above, return SA[i], the pointer to the ith sux
in T in lexicographic order.
2 Previous Work on String Searching
The primary measures of performance are the query time to
do lookup , the amount of space occupied by the compressed
and Text Indexing
sux array, and the preprocessing time taken by compress . The seminal paper by Knuth, Morris, and Pratt [33] pre-
Our main result is that we can implement operation sented the rst string matching solution taking O(m + n)
compress in only O(n) bits and O(n) preprocessing time, time and O(m) words to scan the text. The space complex-
so that each call to lookup takes sublogarithmic worst-case ity was remarkably lowered to O(1) words in [22; 14]. A
time, that is, O(log n) time for any xed constant > 0. seminal paper by Weiner [46] introduced the sux tree for
We can also achieve O(n log log n) bits and O(n) preprocess- solving the text indexing problem in string matching. Since
ing time, so that calls to lookup can be done in O(log log n) then, a plethora of papers have studied the problem in sev-
time. eral contexts and sometimes using di erent terminology [7;
Our ndings have several important implications: 8; 13; 19; 26; 37; 35; 45]; for more references see [4; 15; 25].
Although very ecient, the resulting index data structures
To the best of our knowledge, ours is the rst result are greedy of space, at least n words or (n log n) bits.
successfully breaking the space barrier of n log n bits Numerous papers faced the problem of saving space in
(or n words) for a full text index while retaining fast these data structures, both in practice and in theory. Many
lookup in the worst case. We refer the reader to the of the papers were aimed at improving the lower-order
literature described in Section 2. terms, as wells the constants in the higher order term, or
at achieving tradeo between space requirements and search
Our compressed sux arrays are provably as good as time complexity. Some authors improved the multiplicative
inverted lists in terms of space usage, at least theo- constants in the O(n log n)-bit practical implementations.
retically. No previous result supported this nding. In For the analysis of constants, we refer the reader to [3; 10;
the worst case, both types of indexes require asymptot- 23; 29; 34; 35]. Other authors devised several variations of
ically the same number of bits; however, compressed sparse sux trees to store a subset of the suxes [2; 24;
sux arrays have more functionality because they sup- 32; 31; 36; 39]. Some of them wanted queries to be ecient
port search for arbitrary substrings. when the occurrences are aligned with the beginnings of the
indexed suxes. Sparsity saves much space but makes the The natural explicit implementation of sux arrays re-
search for arbitrary substrings dicult and, in the worst quires O(n log n) bits and supports the lookup operation in
case, as expensive as scanning the whole text in O(m + n) constant time. The abstract optimization discussed above
time. Another interesting index, the Lempel-Ziv index of suggests that there is a canonical way to represent sux ar-
Karkkainen and Sutinen [30], occupies O(n) bits and takes rays in O(n) bits. This observation follows from the fact that
O(m) time to search patterns shorter than log n; for longer the class Cn of sux arrays has no more than 2n,1 distinct
patterns, it may occupy (n log n) bits. members, as there are 2n,1 binary strings of length n , 1.
A recent line of research has been built upon Jacob- We use the intuitive correspondence between sux ar-
son's succinct representation of trees in 2n bits, with navi- rays of length n and binary strings of length n , 1. Ac-
gational operations [27]. That representation was extended cording to the correspondence, given a sux array SA, we
in [11] to represent a sux tree in n log n bits plus an extra can infer its associated binary string T and vice versa. To
O(n log log n) expected number of bits. A solution requir- see how, let x be the entry in SA corresponding to the last
ing n log n + O(n) bits and O(m + log log n) search time was sux # in lexicographic order. Then T must have the sym-
described in [12]. Munro et al. [42] used it along with an im- bol a in each of the positions pointed to by SA[1], SA[2],
proved succinct representation of balanced parentheses [41] : : : , SA[x , 1], and it must have the symbol b in each of
in order to get O(m) search time with only n log n + o(n) the positions pointed to by SA[x + 1], SA[x + 2], : : : , SA[n].
bits. For example, in the sux array h45321i (the 15th of the 16
examples above), the sux # corresponds to the second en-
try 5. The preceding entry is 4, and thus the string T has a
3 Compression of Sux Arrays in position 4. The subsequent entries are 3, 2, 1, and thus
T must have bs in positions 3, 2, 1. The resulting string T ,
The compression of sux arrays falls into the general frame- therefore, must be bbba#.
work presented by Jacobson [28] for the abstract optimiza- The abstract optimization does not say anything regard-
tion of data structures. We start from the speci cation of ing the eciency of the supported operations. By the corre-
our data structure as an abstract data type with its sup- spondence above, we can de ne a trivial compress operation
ported operations. We take the time complexity of the that transforms SA into a sequence of n , 1 bits, namely,
\natural" (and space inecient) implementation of the data string T . The drawback, however, is the una ordable cost
structure. Then, we de ne the class Cn of all distinct data of lookup . It takes (n) time to decompress a single pointer
structures storing n elements. A simple combinatorial argu- in SA as it must build the whole sux array on T from
ment implies that each such data structure can be canoni- scratch. In other words, the trivial method proposed so far
cally identi ed by log jCn j bits. We try to give a, succinct does not support ecient lookup operations.
implementation of the same data structure in O log jCn j In this paper we give an elegant and ecient method to
bits, while supporting the operations within time complexity represent sux arrays in O(n) bits. Our breakthrough idea
comparable with that of the natural implementation. How- is to distinguish among the permutations of f1; : : : ; ng by
ever, the combinatorial argument does not guarantee that relating them to the suxes of the corresponding strings,
the operations can be supported eciently. instead of studying them alone. We mimic a simple divide-
We de ne the sux array SA for a binary string T and-conquer \de-construction" of the sux arrays to de ne
as an abstract data type that supports the two opera- the permutations recursively in terms of shorter permuta-
tions compress and lookup described in the introduction. tions. For some examples of divide-and-conquer construc-
We will adopt the convention that T is a binary string of tion of sux arrays and sux trees, see [5; 16; 17; 18; 35;
length n , 1 over the alphabet fa; bg, and it is terminated 44]. We reverse the construction process to compress the
in the nth position by a special end-of-string symbol #, such permutations.
that a < # < b.1 The sux array SA is a permutation of f1, Our decomposition scheme is by a simple recursion mech-
2, : : : , ng that corresponds to the lexicographic ordering of anism. Let SA be the sux array for binary string T . In
the suxes in T ; that is, SA[i] is the starting position in T the base case, we denote SA by SA0 , and let n0 = n be
of the ith sux in lexicographic order. In the example below the number of its entries. For simplicity in exposition, we
are the sux arrays corresponding to the 16 binary strings assume that n is a power of 2.
of length 4: In the inductive phase k 0, we start with sux array
SAk , which is available by induction. It has nk = n=2k
aaaa# aaab# aaba# aabb# entries and stores a permutation of f1; : : : ; nk g. We run
12345 12354 14253 12543 four main steps to transform SAk into an equivalent but
abaa# abab# abba# abbb#
more succinct representation:
34152 13524 41532 15432 Step 1. Produce a bit vector Bk of nk bits, such that
baaa# baab# baba# babb# B i
k[ ] = 1 if SAk [i] is even and Bk [i] = 0 if SAk [i] is odd.
23451 23514 42531 25143
bbaa# bbab# bbba# bbbb# Step 2. Map each 0 in Bk onto its companion 1. (We say
34521 35241 45321 54321 that a certain 0 is the companion of a certain 1 if the odd
1 Usually an end-of-symbol character is not explicitly stored in , entry in SA associated with the 0 is 1 less than the even
but rather is implicitly represented by a blank symbol , with
T
entry in SA associated with the 1.) We can denote this cor-
the ordering a b. However, our use of # is convenient for respondence by a partial function k , where k (i) = j if
and only if SAk [i] is odd and SAk [j ] = SAk [i]+1. When de-
< <
showing the explicit correspondence between sux arrays and
binary strings. ned, k (i) = j implies that Bk [i] = 0 and Bk [j ] = 1. It is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
T: a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a #
SA0 : 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25
B0 : 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0
rank 0 : 0 1 1 1 1 1 2 3 3 4 4 4 5 6 7 8 9 10 10 10 11 11 12 12 12 12 13 14 14 15 16 16
0 : 2 2 14 15 18 23 7 8 28 10 30 31 13 14 15 16 17 18 7 8 21 10 23 13 16 17 27 28 21 30 31 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
SA1 : 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11
convenient to make k a total function by setting k (i) = i Level k, for each 0 k < `, stores Bk , k , and rank k .
when SAk [i] is even (i.e., when Bk [i] = 1). In summary, for We do not store SAk , but we refer to it for the sake of
1 i nk , we have discussion. The arrays k and rank k are not stored ex-
j if SAk [i] is odd and SAk [j ] = SAk [i] + 1; plicitly, but are stored in a specially compressed form
k (i) = i otherwise. to be described later.
The last level k = ` stores SA` explicitly because it
Step 3. Compute the number of 1s for each pre x of Bk . is suciently small to t in O(n) bits. The `th level
We use function rank k for this purpose, such that rank k (j ) functionality of structures B` , ` , and rank ` are not
counts how many 1s there are in the rst j bits of Bk . needed as a result.
Step 4. Pack together the even values from SAk and di- Procedure lookup (i). We de ne lookup (i) = rlookup (i; 0),
vide each of them by 2. The resulting values form a permu- where procedure rlookup (i; k) is described recursively in Fig-
tation of f1; 2, : : : , nk+1 g, where nk+1 = nk =2 = n=2k+1 . ure 2. If k is the last level `, then it performs a direct lookup
Store them into a new sux array SAk+1 of nk+1 entries, in SA` [i]. Otherwise, it exploits Lemma 1 and the inductive
and remove the old sux array SAk . hypothesis so that rlookup (i; k) returns the value in SAk [i].
The example in Fig. 1 illustrates the e ect of a single Further details on how to represent rank k and k in com-
application of Steps 1{4. The next lemma shows that these pressed form and how to implement compress and lookup (i)
steps preserve the information originally kept in sux ar- will be given in Section 5. Our main theorem below gives
ray SAk : the resulting time and space complexity that we are able to
Lemma 1. Given sux array SAk , let Bk , k , rank k achieve.
and SAk+1 be the result of the transformation performed by Theorem 1. Consider the sux array SA built on a bi-
Steps 1{4 of phase k. We can reconstruct SAk from SAk+1 nary string of length n , 1.
by the following formula, for 1 i nk ,
,
SAk [i] = 2 SAk+1 rank k k (i) + (Bk [i] , 1): i. We can implement compress in O(n log log n) bits and
O(n) preprocessing time, so that each call lookup (i)
Proof. Suppose Bk [i] = 1. By Step 3, there are rank k (i) takes O(log log n) time.
1s among Bk [1], Bk [2], : : : , Bk [i]. By Step 1, SAk [i] is even,
and by Step 4, SAk [i]=2 is stored in the rankk (i)th entry of
ii. We can implement compress in O(n) bits and O(n)
SAk+1 . In other words, SAk [i] = 2 SAk+1 rank k (i) . As preprocessing time, so that each call lookup (i) takes
k (i) = i by Step 2, and Bk [i] , 1 = 0, we obtain the claimed O(log n) time for any constant > 0.
formula. Remark 1. In each of the cases stated in Theorem 1,
Next, suppose Bk [i] = 0 and let j = k (i). By Step 2, we we can batch together j , i + 1 procedure calls lookup (i),
have SAk [i] = SAk [j ] , 1 and Bk [j ] = 1. Consequently, we lookup (i + 1), : : : , lookup (j ), so that the total cost is
can apply the previous case of our analysis
to index j , and
we get SAk [j ] = 2 SAk+1 rank k (j ) . The claimed formula O(j , i + (log2 n) log log n) time when the suxes
follows by replacing j with k (i) and by noting that Bk [i] , pointed to by SA[i] and SA[j ] have the same rst
1 = ,1. (log2 n) binary symbols in common, or
We now give the main ideas to perform the compression
of sux array SA and support the lookup operations on its
compressed representation. procedure rlookup ( ):
i; k
if k = ` then