0% found this document useful (0 votes)
7 views9 pages

Unit2 - Research Paper

This paper presents an opportunistic data structure for compressing and indexing data, achieving space occupancy that is a function of the data's entropy while maintaining efficient query performance. The structure allows for effective searching and updating in dynamic settings, and it integrates with existing tools to provide sublinear space and query time complexity. The authors demonstrate improvements over traditional suffix arrays and trees, particularly in handling compressible data, and explore applications in text retrieval systems.

Uploaded by

kirtisatpute
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

Unit2 - Research Paper

This paper presents an opportunistic data structure for compressing and indexing data, achieving space occupancy that is a function of the data's entropy while maintaining efficient query performance. The structure allows for effective searching and updating in dynamic settings, and it integrates with existing tools to provide sublinear space and query time complexity. The authors demonstrate improvements over traditional suffix arrays and trees, particularly in handling compressible data, and explore applications in text retrieval systems.

Uploaded by

kirtisatpute
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Opportunistic Data Structures with Applications

Paolo Ferragina* Giovanni Manzinit


Universitl di Pisa Universitl del Piemonte Orientale

Abstract Space reduction in data structural design is an attractive is-


sue, now more than ever before, because of the exponential
In this paper we address the issue of compressing and increase of electronic data nowadays available, and because
indexing data. We devise a data structure whose space oc- of its intimate relation with algorithmic performance im-
cupancy is afunction of the entropy of the underlying data provements (see e.g. Knuth [16] and Bentley [5]). ‘This has
set. We call the data structure opportunistic since its space recently motivated an upsurging interest in the design of im-
occupancy is decreased when the input is compressible and plicit data structures for basic searching problems (see [23]
this space reduction is achieved at no significant slowdown and references therein). The goal is to reduce as much as
in the query pe$ormance. More precisely, its space occu- possible the auxiliary information kept together with the in-
pancy is optimal in an information-content sense because put data without introducing any significant slowdown in
+
a text T [ 1 ,U]is stored using O ( H k ( T ) ) o(1) bitsper in- the query performance. However, input data are represented
put symbol in the worst case, where H k ( T ) is the kth or- in their entirety thus taking no advantage of possible repet-
der empirical entropy of T (the bound holds for any fixed itiveness into them. The importance of those issues is well
k). Given an arbitrary string P[1,p ] , the opportunistic data known to programmers who typically use various tricks to
structure allows to searchfor the occ occurrences of P in T squeeze data as much as possible and still achieve good
+
in O(p occlog‘ U) time Cfor anyfixed E > 0). If data are query performance. Their approaches, though, boil down
uncompressible we achieve the best space bound currently to heuristics whose effectiveness is witnessed only by ex-
known [12]; on compressible data our solution improves perimentation.
the succinct suffix array of [I21 and the classical suffix tree In this paper we address the issue of compressing and in-
and suffix array data structures either in space or in query dexing data by studying it in a theoretical framework. From
time or both. the best of our knowledge no other result is known in the
We also study our opportunistic data structure in a literature about the study of the interplay between com-
dynamic setting and devise a variant achieving effective pression and indexing of data collections. The exploitation
search and update time bounds. Finally, we show how of data compressibility have been already investigated only
to plug our opportunistic data structure into the Glimpse with respect to its impact on algorithmic performance in the
tool [19]. The result is an indexing tool which achieves context of on-line algorithms (e.g. caching and prefetch-
sublinear space and sublinear query time complexity. ing [15, 17]), string-matching algorithms (see e.g. [ 1,2,9]),
sorting and computational geometry algorithms [8].
The scenario. Most of the research in the design of in-
1 Introduction dexing data structures has been directed to devise solutions
which offer a good trade-off between query and update time
Data structure is a central concept in algorithmics and versus space usage. The two main approaches are word-
computer science in general. In the last decades it has been based indices andfull-text indices. The former achieve suc-
investigatedfrom different points of view and its basic ideas cinct space occupancy at the cost of being mainly limited
enriched by new functionalities with the aim to cope with to index linguistic texts [27], the latter achieve versatility
the features of the peculiar setting of use: dynamic, persis- and guaranteed performance at the cost of requiring large
tent, self-adjusting, implicit, fault-tolerant, just to cite a few. space occupancy (see e.g. [ 10, 18, 211). Some progress on
*Dipattimento di Informatica,Universith di Pisa, 56100 Pisa, Italy. E-
full-text indices has been recently achieved [ 12,231, but an
mail: [email protected]. Supported in part by Italian MURST project asymptotical linear space seems unavoidable and this makes
“Algorithms for Large Data Sets: Science and Engineering” and by UN- word-based indices much more appealing when space oc-
ESCO p t UVO-ROSE 875.631.9. cupancy is a primary concern. In this context compression
t Dipartimento di Scienze e Tecnologie Avanzate, Universith del appears always as an attractive choice, if not mandatory.
Piemonte Orientale, 15100 Alessandria, Italy and IMC-CNR, 56100 Pisa,
Italy. E-mail: [email protected]. Supported in part by MURST 60% Processing speed is currently improving at a faster rate than
funds. disk speed. Since compression decreases the demand of

390
0-7695-0850-2/00 $10.00 0 2000 IEEE
storage at the expenses of processing, it is becoming more a compressible string, our opportunistic data structure is
economical to store data in a compressed form rather than the first to achieve sublinear space occupancy. Given an
uncompressed. arbitrary pattern P [ l , p ] ,such an opportunistic data struc-
Starting from these promising considerations, many re- ture allows to search for the om occurrences of P in T in
searchers have recently concentrated on the compressed +
O ( p occ log' U ) time, for any fixed E > 0.
matching problem, introduced in [l], as the task of per- The novelty of our approach resides in the careful combi-
forming string matching in a compressed text without de- nation of the Burrows-Wheeler compression algorithm [7]
compressing it. A collection of algorithms is currently with the the suffix array data structure [ 181 to obtain a sort
known to solve efficiently (possibly optimally) this prob- of compressed suffix array. We indeed show how to aug-
lem on text compressed by means of various schemes: e.g. ment the information kept by the Burrows-Wheeler algo-
run-length [l], LZ77 [9], LZ78 [2], Huffman [24]. All of rithm, in order to support effective random accesses to the
these results, although asymptotically faster than the classi- compressed data without the need of uncompressing all of
cal scan-based methods, they rely on the scan of the whole them at query time. We design two algorithms for operating
compressed text and thus result still unacceptable for large on our opportunistic data structure. The first algorithm is an
text collections. effective approach to search for an arbitrary pattern P [ l , p ]
Approaches to combine compression and indexing tech- in a compressed suffix array, taking O b ) time in the worst
niques are nowadays receiving more and more attention, es- case (Section 3.1). The second algorithm exploits compres-
pecially in the context of word-based indices, achieving ex- sion to speed up the retrieval of the actual positions of the
perimental trade-offs between space occupancy and query pattern occurrences, thus incurring only in a sublogarithmic
performance (see e.g. [4, 19, 271). An interesting idea to- O(1og' U ) time slowdown for any fixed E > 0 (Section 3.2).
wards the direct compression of the index data structure In some sense, our result can be interpreted as a method
has been proposed in [13, 141 where the properties of the to compress the suffix array, and still support effective
Lempel-Ziv's compression scheme have been exploited to searches for arbitrary patterns. In their seminal paper,
reduce the number of index points, still supporting pattern Manber and Myers [18] introduced the suffix array data
searches. As a result, the overall index requires provably structure showing how to search for a pattern P [ l , p ]in
sublinear space but at the cost of either limiting the search
to q-grams [ 131 or worsening significantly the query perfor-
+ +
O ( p logu occ) time in the worst case. The suffix
array uses O(u1ogu) bits of storage. Recently, Grossi
mance [ 141. and Vitter [12] reduced the space usage of suffix arrays
A natural question arises at this point: Do full-text in- to @ ( U ) bits at the cost of requiring O(1og' U ) time to re-
dices need a space occupancy linear in the (uncompressed) trieve the i-th suffix. Hence, searching in this succinct suf-
text size in order to support effective search operations on fix array via the classical Manber-Myers' procedure takes
arbitrary patterns? It is a common belief [27] that some + +
O ( p log"' u occ log' U ) time. Our solution therefore
space overhead must be paid to use the full-text indices, but improves the succinct suffix array of [ 121 both in space and
is this actually .a provable need? query time complexity. The authors of [ 121 introduce also
Our Results. In this paper we answer the two questions other hybrid indices which achieve better query-time com-
plexity but yet require n(u) bits of storage. As far as the
above by providing a novel data structure for indexing and
searching whose space occupancy is afunction of the en- problem of counting the pattern occurrences is concerned,
our solution improves the classical suffix tree and suffix ar-
tropy of the underlying data set. The data structure is called
ray data structures, because they achieve R(p) time com-
opportunistic in that, although no assumption on a partic-
ular distribution is made, it takes advantage of the com- plexity and occupy n(u logu) bits of storage.
pressibility of the input data by decreasing the space oc- In Section 4, we investigate the modifiability of our
cupancy at no signifcant slowdown in the query perfor- opportunistic data structure by studying how to choreo-
mance.' The data structure is provably space optimal in graph its basic ideas with a dynamic setting. We show
an information-content sense because it stores a text T[1,U ] that a dynamic text collection A of size U can be stored in
+
using O ( H k ( T ) ) o(1) bits per input symbol in the worst O(Hk(A)) + o ( l )bit per input symbol (for any fixed k >_ 0
and not very short texts), support insert operations on indi-
case (for any fixed k >_ 0), where H k ( T ) is the kth order
vidual texts T[1,t] in O(t1o U ) amortized time, delete op-
empirical entropy. Hk expresses the maximum compres-
sion we can achieve using for each character a code which
5
erations on T[1,t] in O(tlog U ) amortized time, and search
depends only on the k characters preceding it. We point out +
for a pattern P[1,p ] in O ( plog3 u occ log U ) time in the
that in the case of an uncompressible string T, the space worst case. We point out that even in the case of an uncom-
occupancy is @ ( U ) bits which is actually optimal [12]; for pressible text T , our space bounds are the best known ones
since the data structures in [12] do not support updates (the
'The concept of opportunistic algorithm has been introduced in [9] to dynamic case is left as open in their Section 4).
characterize an algorithm which takes advantage of the compressibility of
the text to speed up its (scan based) search operations. In our paper, we Finally, we investigate applications of our ideas to the
tum this concept into the one of opportunistic datu structure. development of novel text retrieval systems based on the

391
concept of block addressing (first introduced in the Glimpse Consequently, entry A[i]points to the suffix of T occupying
tool [ 191). The notable feature of block addressing is that (a prefix of) the ith row of M . The cost of performing the
it can achieve both sublinear space overhead and sublinear forward BWT is given by the ‘Costof constructing the suffix
query time, whereas inverted indices achieve only the sec- array A, and this requires O(U)time [21].
ond goal [4]. Unfortunately, up to now all the known block The cyclic shift of the rows of M is crucial to define
addressing indices [4, 191 achieve time and space sublinear- the backward BWT, which is based on two easy to prove
ity only under some restrictive conditions on the block size. observations [7]:
We show how to use our opportunistic data structure to de-
vise a novel block addressing scheme, called CGlimpse a. Given the ith row of M , its last character L[i]precedes
(standing for Compressed Glimpse), which always achieves its first character F[i]in the original text T , namely
time and space sublinearity. T = . * * L[i]F[i] . * *.
b. Let L[i] = c and let ri be the number of occurrences
2 Background of c in the prefix L[1,i ] . Let M [j] be the ri-th row of
M starting with c. The character in the first column F
Let T[1,U ] be a text drawn from a constant-size alphabet corresponding to L[i]is located at F [ j ] .We call this
E. A central concept in our discussion is the su& array LF-mapping (Last-to-First mapping) and set LF[i] =
data structure [18]. The suffix array A built on T[1,U ] is an j.
array containing the lexicographically ordered sequence of
the suffixes of T , represented via pointers to their starting We are now ready to describe the backward BWT:
positions (i.e., integers). For instance, if T = ababc then
1. Compute the array C [ 1 . .. lEl] storing in C[c] the
A = [ l ,3,2,4,5].Clearly A requires U log, U bits, actually number of occurrences in T of the characters
a lot when indexing large text collections. It is a long stand-
{#, 1 , . . . ,c - 1). Notice that C[c]+ 1 is the posi-
ing belief that suffix arrays are uncompressible because of
tion of the first occurrence of c in F (if any).
the “apparently random” permutation of the suffix pointers.
Recent results in the data compression field have opened the 2. Define the LF-mapping L F [ l . .. U +
11 as follows:
door to revolutionary ways to compress suffix arrays and are +
LF[i]= C[L[i]] r i , where r i equals the number of
basic tools of our data structure. occurrences of the character L[i]in the prefix L [ l , i ]
In [7], Burrows and Wheeler propose a transformation (see observation (b) above).
(BWT from now on) consisting of a reversible permutation
3. Reconstruct T backward as follows: set s = 1 and
of the text characters which gives a new string that is “eas-
T [ u ] = L[1](because M [ 1 ] = # T ) ; then, for each
ier to compress”. The BWT tends to group together char-
i = U - 1 , . . . , l do s = LF[s]and T[i]= L[s].
acters which occur adjacent to similar text substrings. This
nice property is exploited by locally-adaptive compression In [26] it is shown how to derive the suffix a r r ~ yA from
algorithms, such as move-to-front coding [6], in combina- L in linear time; however, in the context of pattern search-
tion with statistical (i.e. Huffman or Arithmetic coders) or ing, the algorithm in [26] is no better than the known scan-
structured coding models. The BWT-based compressors are based opportunistic algorithms (such as [9]). Nonetheless,
among the best compressors currently available since they the implicit presence of the suffix array A into L suggests to
achieve a very good compression ratio using relatively small take full advantage of the structure of A for fast searching,
time and space. and of the high compressibility of L for space reduction.
The reversible BWT. We distinguish between a for- This is actually the ultimate hope of any indexer: succinct
ward transformation, which produces the string to be com- and fast! In the next section, we show that this result is
pressed, and a backward transformation which gives back achievable provided that a sublogarithmic slowdown (wrt
the original text from the transformed one. The forward the suffix array) is introduced in the cost of listing the pat-
BWT consists of three basic steps: (1) Append to the end tern occurrences.
of T a special character # smaller than any other text char- Let Tbw= bwt(T) denote the last column L, output of
acter; (2) form a conceptual matrix M whose rows are the the BWT. Our indexing data structure consists of a com-
cyclic shifts of the string T # sorted in lexicographic order; pressed version of Tb” together with some other auxiliary
(3) construct the transformed text L by taking the last col- array-based data structures that support random access to
umn of M . Notice that every column of M is a permuta- T b w We
. compress TbWin three steps (see also [20]):
tion of the last column L, and in particular the first column
of M , call it F , is obtained by lexicographically sorting the 1. Use a move-to-front coder, briefly m t f [6], to encode
characters in L. a character c via the count of distinct characters seen
There is a strong relation between the matrix M and the since its previous occurrence. The structural proper-
suffix array A of the string T . When sorting the rows of ties of T b Wmentioned
, above, imply that the string
the matrix M we are essentially sorting the suffixes of T . Tmtf = mtf (T”) will be dominated by low numbers.

392
2. Encode each run of zeroes in Tmtf using run length
encoding (rle). More precisely, replace the sequence Algorithm BW-Count(P[l, p])
0" with the number (m + 1) written in binary, least 1. c = Pb],i = p;
significant bit first, discarding the most significant bit.
For this encoding we use two new symbols 0 and 1so 2. sp = C[c] + 1, ep = C[c + 11;
that the resulting string Tr' = rle(Tmtf)is over the 3. while ( ( s p 5 ep) and (i 2 2)) do
alphabet {0,1,1,2,.. ., 1x1 - 1). 4. c = P[i - 11;
3. Compress T" by means of a variable-length prefix
code, called Pc, which encodes the symbols 0 and 1
5. + Occ(c, 1,s p - 1)+ 1;
s p = C[C]
using two bits (10 for 0 , 11 for l),and the symbol i us- 6. ep = C[c]+ OCC(C, 1,ep);
ing a variable-length prefix code of 1+ 2 [log(i + 1)J 7. i=i-l;
bits, the first one being a zero.
8. if ( e p < sp) then return "pattern not found"
The resulting algorithm BWRLX = bwt + m t f r l e + + else return "found ( e p - sp + 1)occurrences"
PC is sufficiently simple so that in the rest of the paper we
can concentrate on the searching algorithm without being Figure 1. Algorithm for counting the number
distracted by the details of the compression. Despite of the of occurrences of P[l,p] In T[l,u].
simplicity of BWRLX, using the results in [2012it is possible
to show that (proof in the full paper), for any k 3 0 and for
any T there exists a constant gk such that
points to thefirst row of M prejixed by P [ i , p ]and the pa-
rameter ep points to the last row of M prefied by P[i,p].
The pseudo-code is given in Fig. 1. In the first phase (i.e.
where Hk is the kth order empirical entropy. Hk expresses i = p), s p and ep are determined via the array C defined
the maximum compression we can achieve using for each in Section 2 (Step 2). The values sp and ep are updated
character a code which depends only on the k characters at Steps 5 and 6 using the subroutine Occ(c, 1,k) which
preceding it. reports the number of occurrences of c in TbW[l, k]. Note
that at Steps 5 and 6 we are computing the LF-mapping
for, respectively, the first and the last occurrence (if any) of
3 Searching in BWT-compressed text P[i - 11in Tbw[sp, ep]. If at the generic ith phase we have
ep < s p we can conclude that P [ i , p ]does not occur in T
Let T[1,U ] denote an arbitrary text over the alphabet E, and hence P does not too. After the final phase, sp and ep
and let 2 = BWRLX(T).In this section we describe an algo- will delimit the portion of M (and thus of the suffix array
rithm which, given a pattern P[1,p ] ,reports all occurrences A) containing all the text suffixes prefixed by P. The in-
of P in the uncompressed text T by looking only at 2 and +
teger ( e p - sp 1) will therefore account for the number
without uncompressing all of it. Our algorithm makes use of occurrences of P in T. The following lemma proves the
of the relation between the suffix array A and the matrix correctness of BW-Count assuming Occ works as claimed
M. Recall that the suffix array A posses two nice structural (proof in the full paper).
properties which are usually exploited to support fast pat-
tern searches: (i) all the suffixes of the text T prefixed by Lemma 1 For i = p , p - 1,. . . ,2, if P[i- 1,p ] occurs in T
a pattern P occupy a contiguous portion (subarray) of A; then Step 5 (resp. Step 6) of BW-Count correctly updates
(ii) that subarray has starting position s p and ending posi- the value of s p (resp. ep) thus pointing to the first (resp.
tion ep, where sp is actually the lexicographic position of last) row prejixed by P[i - l,p]. I
the string P among the ordered sequence of text suffixes.
The running time of BW-Count depends on the cost of
3.1 Step I: Counting the occurrences the procedure OCC.We now describe an algorithm for com-
puting OcC(c, 1,k) in O(1) time, on a RAM with word size
We now describe an algorithm, called BW-Count, which O(logu) bits.
identifiesthe positions s p and ep by accessing only the com- We logically partition the transformed string Tbwinto
pressed string 2 and some auxiliary array-based data struc- substrings of e characters each (called buckets), and denote
tures. +
them by BTi = Tbw[(i- l)! l,ie], f o r i = 1 , . . . ,./e.
BW-Count consists of p phases each one preserving the This partition naturally induces a partition of Tmtf into
following invariant: At the i-th phase, the parameter s p u/e buckets B T y t f , .. . ,BTU";:f of size e too. We as-
*The algorithm B W R L X corresponds to the procedure A' described sume that each run of zeroes in T m t f is entirely con-
in [20] tained in a single bucket and we describe our algorithm

393
for computing OCC(C,1 , k ) under this simplifying assump- Tbw[k] is determined via i = [k/.!l, together with the posi-
tion. The general case in which a sequence of zeroes may tion j = k- (i - l)lof this character in BTi and the param-
span several buckets is similar and thus its discussion is de- eter i* = [(k - l)/12]. Then the number of occurrences
ferred to the full paper. Under our assumption, the buck- of c in the prefix BT1 . . BTei. (point (i)) is determined
ets BTYtf's induce a partition of the compressed file Z via NOi.[c], and the number of occurrences of c in the
into u/e compressed buckets BZ1,. . .,BZ,/e, defined as substring BTti., . . . ,BTi-1 (point (ii)) is determined via
BZi = PC(rle(BTytf)). NO:[c]. Finally, the compressed bucket BZi is rletrieved
Let BTi denote the bucket containing the charac- from Z (notice that W[i*]+W'[i]+l is its startingposition),
ter Tbw[k](namely i = [ k / e l ) . The computation of and the number of occurrences of c within BTi [ l ,j ] are ac-
OCC(C, 1 , k ) is based on a hierarchical decomposition of counted accessing S[c,j, BZi, MTF[i]]in 0(1)time. The
T b W [ l , kin] three substrings as follows: (i) the longest sum of these three quantities gives Occ(c, 1 , k).
prefix of T b w [ l , k having
] length a multiple of C2 (i.e. By construction any compressed bucket BZi has size at
BT1 . -.BTep, where i* = IF]), (ii) the longest pre- +
most e' = ( 1 2 l1ogCJ)e bits. We choose L = @(logu)
fix of the remaining suffix having length a multiple of e (i.e. so that e l = clogu with c < 1. Under this assumption,
BTe(*+I BTi-l), and finally (iii) the remaining suffix
e
every step of OCCconsists of arithmetic operations or table
of T b w [ lk], which is indeed a prefix of the bucket BTi. lookup operations involving O(1og u)-bit operands. Conse-
We compute Occ(c, 1 , k) by summing the number of oc- quently every call to OCCtakes O(1 ) time on a RAM. As far
currences of c in each of these substrings. This can be done as the space occupancy is concerned, the arrays NO and W
in 0(1)time and sublinear space using the following auxil- take O((u/C2)logu) = O(u/logu) bits. The arrays NO'
iary data structures. and W' take O(( U / [ ) log e) = O(( U / log U ) log log U) bits.
For the calculations on the substring of point (i): The array MTF takes O(u/l?)= O(u/ logu) bits. Table
S consists of O(t2") loge-bit entries and thus it occupies
0 Fori = 1 , . . . ,u/12,the array NOi[l,lCl] stores in the
0(2e'.tlogC) = O(zf1oguloglogu) bits, where c < 1.
entry NOi[c]the number of occurrences of the charac-
We conclude that the auxiliary data structures used by OCC
ter c in BT1 . .BTu.
occupy O((U/log U )log log U ) bits (in addition to the com-
0 The array W [ 1 ,u/12]stores in the entry W [ i ]the value pressed file Z ) .
IBZj 1 equals to the sum of the sizes of the com-
pressed buckets BZ1,. . . ,BZit. Theorem 1 Let Z denote the output of the allgorithm
For the calculations on the substring of point (ii): BWRLX on input T[1,U ] . The number of occurrences of a
pattern P [ l , p ]in T[1,U]can be computed in O ( p ) time on
Fori = 1 , . . . , U / [ , the array NOI[l,lCl] stores in the
0

entry NOi[c] the number of occurrences of the char- a RAM. The space occupancy is 121 0 + (& loglogu
1
acter c in the string BTi.+l . . .BTi-1 (this concate- bits in the worst case. I
nated string has length less than 12).
0 The array W ' [ l , u / q stores in the entry W'[i]the 3.2 Step 11: Locating the occurrences
value lBZj I equals to the overall size of the
compressed buckets BZi++l,.. . ,BZi-1 (the value is We now consider the problem of determining the exact
bounded above by O(e2)). position in the text T of all the occurrences of the pattern
. means that fors = sp, sp+ 1 , . . .,ep, we want
P [ l , p ]This
For the calculations on the (compressed) buckets: to find the text position pos(s) of the suffix which prefixes
0 The array M T F [ l ,u / 4 stores in the entry MTF[i]a the sth row M [SI. We propose two approaches: the first one
picture of the state of the MTF list at the beginning of is simple and slow, the second one is faster and relies on the
the encoding of BTi. Each entry takes IC( log 1x1bits very special properties of the the string T b W .
(i.e. 0(1)bits). In the first algorithm we logically mark the rows of M
0 The table S stores in the entry S[c,j, b, m]the num- which correspond to text positions having the fomi 1 iq, +
ber of occurrences of c among the first j characters for q = @(log2U ) and i = 0 , 1 , .. . ,u/q. We store with
of the compressed string b, assuming that m is the these marked rows the starting positions of the correspond-
picture of the MTF list used to produce b. Thus, en- ing text suffixes explicitly. This preprocessing is done at
try S[c, j, B Z i , MTF[i]]stores the number of occur- compression time. At query time we find pos(s) as fol-
rences of c in B T i [ l , j ] .Table S has 0(t2e') entries lows. Ifs is a marked row, then there is nothing to be done
each one occupying O(1og t) bits, where e' is the max- and its position is directly available. Otherwise, we use the
LF-mapping to find the row s' corresponding to the suffix
imum length of a compressed bucket.
T[pos(s)- l , ~ ]We . iterate this procedure w times until
The computation of OCC(C, 1 , k ) therefore proceeds as s' points to a marked row; at that point pos(s') is available
follows. First, the bucket BTi containing the character c = +
and we set pos(s) = pos(s') w. The crucial point of

394
the algorithm is the logical marking of the rows of M cor- rows we explicitly keep the starting position of the corre-
+
responding to the text suffixes starting at positions 1 iq, sponding text suffixes. To compute pos(s) we first compute
i = 0,. . .,u/q. Our solution consists in storing the row the LF-mapping in M until we reach a marked row s'. Then
numbers in a two-level bucketing scheme. We partition the we compute pos(s') by finding its corresponding row in
rows of M into buckets of size @(log2U ) each. For each M Oand computing the LF-mapping in M O(via Lemma 2)
bucket, we take all the marked rows lying into it, and store until we reach a marked row s" in M O(for which pos(s'')
them into a Packet B-tree [3] using as a key their distance is explicitly available by construction). The marking of T
from the be inning of the bucket. Since a bucket contains at and the counting of the number of marked rows in M that
9
most O(1og U ) keys, each O(1og log u)bits long, member- precede a given marked row s' (this is required in order to
ship queries take O(1)'time on a RAM.The overall space determine the position in M Oof M [s']) can be done in con-
required for the logical marking is O ( ( u / v )loglogu) bits. stant time and O(F log log U ) bits of storage using again a
In addition, for each marked row we also keep the starting Packed B-tree and a two level bucketing scheme as before.
position of the corresponding text suffix (i.e. pos()),which In addition, for @(ITol/q)rows of M O we keep explic-
requires additional O(1og U ) bits per marked row. Conse- itly their ositions in TOwhich take @((lTol/q) logu) =
quently, the overall space occupancy is O((u/q)log U ) = B
@ ( U / log ' U ) bits of storage. The space occupancy of the
O(u/ logu) bits. For what concerns the time complexity, procedure for computing the LF-mapping in Ttw is given
our algorithm computes pos(s) in at most q = @(log2U ) by Lemma 2. Since Hk(T0) 5 y H k y ( T ) ,a simple alge-
steps, each taking constant time. Hence the occ occurrences braic calculation yields that the overall space occupancy is
of a pattem P in T can be retrieved in O(occ log2 U ) time,
with a space overhead of O(u/ log U ) bits. Combining the
0 (Hk ( T ) + &) bits per input symbol, for any fixed k.
results of this section with (1) we have: The time complexity of the algorithm is O(y) (for finding a
marked row in M) plus O(q log' U ) (for finding a marked
Theorem 2 A text T[1,U ] can be preprocessed in O(u) row in M O ) thus
, O(log(1/2)+2' U ) time overall.
time so that all the occ occurrences of a pattem P [ l , p ]in
T can be listed in O(p+ occlog2 U ) time on a RAM. The The final time bound of O(1og' U ) for the computation
space occupancy is bounded by 5Hk(T) + O( v)
per input symbol in the worst case, for anyjixed k3
bits
0. I
of pos(s) can be achieved by iterating the approach above
as follows. The main idea is to take yo = O(log'u),
and apply the procedure for computing the LF-mapping in
We now refine the above algorithm in order to compute TOfor O(1og'u) steps, thus identifying a row s1 of the
pos(s) in O(1og' U ) time for any fixed E > 0. We still use
matrix M O such that pos(s1) has the form 1 + iyl with
y1 = @(log2'u). Next, we define the string TI obtained
the idea of marking some of the rows in M, however we in-
by grouping the characters of T into blocks of size y1 and
troduce some shortcuts which allow to move in T by more
that one character at a time, thus reducing the number of we consider the corresponding matrix M I . By construction
s1 corresponds to a row in M I and we can iterate the above
steps required to reach a marked position. The key ingredi-
ent of our new approach is a procedure for computing the scheme. At the j t h step we operate on the matrix M j - 1 un-
LF-mapping over a string !#? drawn from an alphabet A of til we find a row sj such that p o s ( s j ) has the form 1 + iyj
non-constant size (proof and details in the full paper): where yj = @(log(j+')'U ) . This continues until j reaches
the value [ l / c l . At that point the matrix M j consists of
Lemma2 Given a string !#?[l,v] over an arbitrary al- @(U/ U ) rows, where 6 = [l/El E - 1 . Since we
phabet A, we can compute the LF-mapping over Fbw in can always choose E so that 6 > 0, we can store explic-
O(1og' w) time using O(v(1+ H k ( ! # ? ) ) + IAlk+' (log 1111 + itly the starting positions pas() of the marked text suffixes
log U)) bits of storage, for any given E > 0. I in M j using sublinear space, i.e. o(u) bits. Summing up,

w e use Lemma 2 to computepos(s) in O(log(1/2)+2' z


the algorithm com utespos(s) in [ l / ~=] @ ( 1 ) iterations,
each taking @(log U ) time. Since E is an arbitrary positive
time; this is an intermediate result that will be then refined constant, it is clear that we can rewrite the previous time
to achieve the final O(1og' U ) time bound. bound as @(+) = @(log' U ) . The space occupancy is
At compression time we logically mark the rows of M dominated by the one required for the marking of M .
+
which correspond to text positions of the form 1 iy for
i = 0,. . . ,u/y and y = @(log(1/2)+'U ) . Then, we con-
sider the string TOobtained by grouping the characters of T
into blocks of size y. Clearly TOhas length u / y and its char-
acters belong to the alphabet Cr. Let M Odenote the cyclic- Theorem 3 A text T [ 1 ,U ] can be indexed so that all the
shift matrix associated to TO;notice that M Oconsists of the occ occurrences of a pattern P [ l , p ]in T can be listed in
marked rows of M. Now we mark the rows of M Ocorre- O(p + occ log' U ) time on a RAM. The space occupancy
sponding to the suffixesof TOstarting at the positions 1+iq, is o ( H k (T) + ':$r) bits per input symbol in the worst
for i = 0 , . . ., IToI/q and 77 = U ) . For these case, for any@ed k 2 0. I

395
4 Dynamizing our approach the subset Sf is selected, where i = LlogtJ, andl T is in-
serted into it using the following approach. If Sf is empty
then the compressed index is built for T and associated to
Let A be a dynamic collection of texts { T I , .. . ,Tm}
this subset, thus taking O ( t ) time. Otherwise the new set
having arbitrary lengths and total size U. Collection A may
Sf U { T }is formed and inserted in Sf+l.If the latiter subset
shrink or grow over the time due to insert and delete oper-
is not empty then the insertion process is propagated until an
ations which allow to add or remove from A an individual
empty subset Sf+j is found. At this point, the compressed
text string. Our aim is to store A in succinct space, perform
the update operations efficiently, and support fast searches index is built over the set Sf U . . . U S&j-l U { T } , by
for the occurrences of an arbitrary pattern P [ l , p ]into A's concatenating all the texts contained in this set to form a
texts. This problem can be solved in optimal time complex- unique string, texts are separated by a special symbol (as
ity and O(u1ogu) bits of storage [lo, 211. In the present usual). By noticing that these texts have overall length
section we aim at dynamizing our compressed index in or- 0(2'+j), we conclude that this propagation process has a
der to keep A in a reduced space and be able to efficiently complexity proportional to the overall length of the moved
support update and search operations. Our result exploits texts. Although each single insertion may be very costly, we
an elegant technique proposed in [22, 251, here adapted to can amortize this cost by charging O(1og U) credits per text
manage items of variable lengths (i.e. texts). character (since i,j = O(logu)), thus obtaining an overall
amortized cost of O(t1ogu) to insert T [ l , t ]in A. Some
In the following we bound the space occupancy of our
care must be taken to evaluate the space occupied during
data structure in terms of the entropy of the concatenation of
the reconstruction of the set Sf. In fact, the coristruction
A's texts. A better overall space reduction might be possi-
of our compressed index over the set S! requires the use
bly achieved by compressing separately the texts Ti's.How-
ever, if the texts Ti's have similar statistics, the entropy of of the suffix tree data structure (to compute the BWT) and
the concatenated string is a reasonable lower bound to the thus O(2i log 2') bits of auxiliary storage. This could be too
much, but we ensured that every collection C h contains texts
compressibility of the collection. Furthermore, in the prob-
abilistic setting where we assume that every text is gener- having overall length O( &). So that at most O(
bits suffices to support any reconstruction process.
e)
ated by the same probabilistic source, the entropy of the
concatenated string coincides with the entropy of the sin- We now show how to support text deletions from A. The
gle texts and therefore provides a tight lower bound to the main problem here is that from one side we would like to
compressibility of the collection. physically cancel the texts in order to avoid the listing of
In the following we focus on the situation in which the ghost occurrences belonging to texts no longer in A; but
length p of the searched pattern is O( &) because for the from the other side a physical deletion would be too much
other range of p's values, the search operation can be im- time-consuming to be performed on-the-fly. Amortization
plemented in a brute-force way by first decompressing the can still be used but much care must be taken when answer-
text collection and by then searching for P into it using a ing a query to properly deal with texts which have been log-
scan-based string matching algorithm in O(plog3 U ow) + ically deleted from the Sf's. For the sake of presentation
time. We partition the texts ~ i ' into
s q = @(log2U) col- let Tbwbe the BWT of the texts stored in some set Sf. We
lections C', . . . ,Cq, each containing texts of overall length store in a balanced search tree the set Zf of interval posi-
O(+). This is always possible, independently of the tions in TbWoccupied by deleted text suffixes. If a pattern
lengths of the text strings in A, since the upper bound on the occurrence is found in Tbwusing our compressed index,
length of the searchable patterns allows us to split very long we can check in O(1og U) time if it is a real or a ghost oc-
texts (i.e. texts of lengths Q( +)) into 2 log2 U pieces currence. Every time a text T[1,t] must be deleted from
Sf, we search for all of its suffixes in Sf and then update
overlapping for O( &) characters. This covering of a
accordingly Zf in O ( t log U) time. The additional space re-
single long text with many shorter ones still allows us to quired to store the balanced search tree is O( lZf I log U) =
find the occurrences of the searched patterns.
Every collection C h is then partitioned into a series of
O( e) bits, where we are assuming to physically delete
subsets Sf defined as follows: Sf contains some texts the texts from Sf as soon as a fraction of O(Fgk) suf-
of C h having overall length in the range [2i,2i+'), where fixes is logically marked. Hence, each set Sf may undergo
i = O(1ogu). Each set Sf is simultaneously indexed and O(log2U) reconstructions before it shrinks enough to move
compressed using our opportunistic data structure. Search- back to the previous set Sf'l. Consequently the amortized
ing for an arbitrary pattern P[1,p ] in A, with p = O( +) +
cost of delete is O(t log U t log2 U) = O(t log2 U),where
can be performed by searching for P in all the O(10g3 U) the first term denotes the cost of I f ' s update and the second
subsets Sf via the compressed index built on each of them. term accounts for the credits to be left in order to pay for
This takes ~ (log3 +
p U occ log' U) time overall. the physical deletions.
Inserting a new text T[1,t] into A consists of insert- Finally, to identify a text to be deleted we append to ev-
ing T into one of the sets C h , the most empty one. Then, ery text in A an identifier of O(1ogu) bits, and we keep

396
track of the subset Sf containing a given text via a table. Our opportunistic index naturally fits in this block-
This introduces an overhead of O(m1ogu) bits which is addressing framework and allows us to extend its applica-
.(U) if we reasonably assume that the texts are not too short, bility to larger text databases. The new approach, named
i.e. w(1ogu) bits each. Compressed Glimpse (shortly CG1 impse), consists in us-
ing our opportunistic data structure to index each text block
Theorem4 Let A be a dynamic collection of texts individually; this way, each candidate block is not fully
{TI,T2,. ,.,T,) having total length U . All the occ occur- scanned at query time but its index is employed to fasten
rences of a pattern P[l,p] in the texts of A can be listed the detection of the pattern occurrences. In some sense
in O(plog3 U + occ log U ) time in the worst case. Opera- CGlimpse is a compromise between a full-text index (like
tion insert adds a new text T[1,t]to A in O(t log U ) amor- a suffix array) and a word-based index (like an inverted list)
tized time. Operation delete removes a text T[l,t]from over a compressed text.
A in O(t log2 U) amortized time. The space occupancy is A theoretical investigation of the performance of
0 (HI,(A) m + F)+ o( 1) bits per input symbol in the CGlimpse is feasible using a model generally accepted
worst case for a n y f i e d k 2 0. I in Information Retrieval [4]. It assumes the Heaps law to
model the vocabulary size (i.e. V = 0 ( d )with 0 < p <
l), the generalized Zipf’s law to model the frequency of
5 A simple application words in the text collection (i.e. the largest ith frequency of
a word is u/(ieHF’),where H f ) is a normalization term
Glimpse [ 191 is an effective tool to index linguistic texts. and t9 is a parameter larger than l),and assumes that 0(uP)
From a high level point of view, it is a hybrid between in- is the number of matches for a given word with IC 2 1errors
verted files and scan-based approaches with no index. It (where p < 1). Under these hypothesis we can show that
relies on the observation that there is no need to index every CG1 impse achieves both sublinear space overhead and
word with an exact location (as it occurs in inverted files); sublinear query time independent of the block size (proof
but only pointers to an area where the word occurs (called in the full paper). Conversely, inverted indices achieve only
a block) should be maintained. Glimpse assumes that the the second goal [U], and classical Glimpse achieves both
text T[1,U ] is logically partitioned into T blocks of size b goals but under some restrictive conditions on the block
each, and thus its index consists of two parts: a vocabulary size [4].
V containing all the distinct words of the text; and for each
word w E V, a list L(w)of blocks where the word w occurs.
This blocking scheme induces two space savings: pointers 6 Conclusions
to word occurrences are shorter, and the occurrences of the
same word in a single block are represented only once. Typ- Some issues remain still to be investigated in various
ically the index is very compact: 2-4% of the original text models of computation. In external memory, it would be
size [19]. interesting to devise a compressed index which takes ad-
Given this index structure, the search scheme proceeds in vantage of the blocked access to the disk and thus achieves
two steps: first the queried word w is searched in the vocab- O(occ/B) I/Os for locating the pattern occurrences, where
ulary V , then all candidate blocks of L ( w ) are sequentially B is the disk-page size. In the RAM, it would be interest-
examined to find all the w’soccurrences. Complex queries ing to avoid the o(1og U) overhead incurred in the listing of
(e.g. approximate or regular expression searches) can be the pattern occurrences. In the full paper we will show how
supported by using Agrep [28] both in the vocabulary and to use known techniques (see e.g. [ 111) for designing hy-
in the block searches. Clearly, the search is efficient if the brid indices which achieve O(occ) retrieval time cost under
vocabulary is small, if the query is enough selective, and if restrictive conditions either on the pattern length or on the
the block size is not too large. The first two requirements number of pattern occurrences. Guaranteeing the ~ ( O C C )
are usually met in practice, so that the main constraint to the retrieval cost in the general case is an open problem also in
effective use of Glimpse remains the strict relation between the uncompressed setting [ 121.
block-pointer sizes and text sizes. Theoretical and exper-
imental analysis of such block-addressing scheme [4, 191 References
have shown that the Glimpse approach is effective only for
medium sized texts (roughly up to 200Mb). Recent papers
[l] A. Amir and G. Benson. Efficient two-dimensional com-
tried to overcome this limitation by compressing each text
pressed matching. Proceedings of IEEE Data Compression
block individually and then searching it via proper oppor- Conference, pages 279-288, 1992.
tunistic string-matching algorithms [ 19, 241. The experi- [2] A. Amir, G. Benson, and M. Farach. Let sleeping files lie:
mental results showed an improvement of about 30-50% Pattern matching in Z-compressed files. Journal of Com-
in the final performance, thus implicitly proving that the puter and System Sciences, 52(2):299-307, 1996.
second searching step dominates Glimpse’s query perfor- [3] A. Anderson. Sorting and searching revisited. In R. G.
mance. Karlsson and A. Lingas, editors, Proceedings of the 5th

397
Scandinavian Workshop on Algorithm Theory, pages 185- [21] E. M. McCreight. A space-economical suffix tree construc-
197. Springer-Verlag LNCS n. 1097, 1996. tion algorithm. Joumal of the ACM, 23(2):262-27:2, 1976.
[4] R. Baeza-Yates and G. Navarro. Block addressing indices [22] K. Mehlhom and M. H. Overmars. Optimal dynamzation of
for approximate text retrieval. Journal of the American So- decomposable searching problems. Information Processing
cietyfor Information Science, 51(1):69-82, 2000. Letters, 12(2):93-98, Apr. 1981.
[5] J. Bentley. Programming Pearls. Addison-Wesley, USA, [23] J. I. Munro. Succinct data structures. In Proceeding of
1989. the 19th Conference on Foundations of Software Technology
[6] J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A locally and Theoretical Computer Science. Springer-Verlag LNCS
adaptive compression scheme. Communicationof the ACM, n. 1738, 1999.
29(4):320-330, 1986. [24] G. Navarro, E. de Moura, M. Neubert, N. Ziviani, and
[7] M. Burrows and D. Wheeler. A block sorting lossless R. Baeza-Yates. Adding compression to block addressing
data compression algorithm. Technical Report 124, Digital inverted indexes. Information Retrieval Journal, 2000, (to
Equipment Corporation, 1994. appear).
[8] S . Chen and J. Reif. Using difficulty of prediction to de- [25] M. H. Overmars and J. van Leeuwen. Worst-case optimal
crease computation: Fast sort, priority queue and convex insertion and deletion methods for decomposable searching
hull on entropy bounded inputs. In Proceedings of the problems. Information Processing Letters, 12(4):168-173,
34th IEEE Symposium on Foundations of ComputerScience, Aug. 1981.
pages 104-112, 1993. [26] K. Sadakane. A modified Burrows-Wheeler transformation
[9] M. Farach and M. Thorup. String matching in Lempel-Ziv for case-insensitive search with application to suffix array
compressed strings. Algorithmica, 20(4):388-404, 1998. compression. In Proceedings of IEEE Data Compression
[lo] P. Ferragina and R. Grossi. The string B-tree: A new data Conference, 1999.
structure for string search in external memory and its appli- [27] I. H. Witten, A. Moffat, and T. C. Bell. Managing Giga-
cations. Journal of the ACM, 46:236-280, 1999. bytes: Compressing and Indexing Documents and Images.
[ll] P. Ferragina, S . Muthukrishnan, and M. deBerg. Multi- Morgan Kaufmann Publishers, Los Altos, CA 94022, USA,
method dispatching: A geometric approach with applica- second edition, 1999.
tions to string matching problems. In Proceedings of the [28] S . Wu and U. Manber. AGREP - A fast approximate pattem-
31st ACM Symposium on the Theory of Computing, pages matching tool. In Proceedings of the Usenix Winter 1992
483491, 1999. Technical Conference, pages 153-1 62. Usenix Association,
[121 R. Grossi and J. Vitter. Compressed suffix arrays and suffix 1992.
trees with applications to text indexing and string matching.
In Proceedings of the 32nd ACM Symposium on Theory of
Computing,2000.
[13] J. Karkkainen and E. Sutinen. Lempel-Zip index for q-
grams. In J. Dfaz and M. Sema, editors, Proceedings of the
4th European Symposium on Algorithms, pages 378-391.
Springer-Verlag LNCS n. 1136, 1996.
[14] J. Karkkcnen and E. Ukkonen. Lempel-Ziv parsing and
sublinear-size index structures for string matching. In N. Zi-
viani, R. Baeza-Yates, and K. GuimarHes, editors, Proceed-
ings of the 3rd South American Workshop on String Process-
ing, pages 141-155. Carleton University Press, 1996.
[15] A. Karlin, S. Phillips, and P. Raghavan. Markov paging (ex-
tended abstract). In Proceedings of the 33rd IEEE Sympo-
sium on Foundations of Computer Science, pages 208-217,
24-27 Oct. 1992.
[161 D. E. Knuth. Sorting and Searching, volume 3 of The Art of
Computer Programming. Addison-Wesley, Reading, MA,
USA, second edition, 1998.
[ 171 P. Krishnan and J. Vitter. Optimal prediction for prefetching
in the worst case. SIAM Joumal on Computing,27(6):1617-
1636, Dec. 1998.
[IS] U. Manber and G. Myers. Suffix arrays: a new method
for on-line string searches. SIAM Joumal on Computing,
22(5):935-948, 1993.
[19] U. Manber and S. Wu. GLIMPSE: A tool to search through
entire file systems. In Proceedings of the USENIX Winter
I994 Technical Conference,pages 23-32, 1994.
[20] G. Manzini. An analysis of the Burrows-Wheeler trans-
form. In Proceedings of the 10th ACM-SIAM Symposium
on Discrete Algorithms, pages 669-677, 1999. Full version
inwww.imc.pi.cnr.it/"manzini/tr-99-13/.

398

You might also like