Better External Memory Suffix Array Construction-05
Better External Memory Suffix Array Construction-05
1
alphabet Σ = [1, n]. The restriction to the alphabet Overview: In Section 2 we present the doubling al-
[1, n] is not a serious one. For a string T over any gorithm [3, 7] for suffix array construction that has
alphabet, we can first sort the characters of T , re- I/O complexity O(sort(n log maxlcp)). This algo-
move duplicates, assign a rank to each character, and rithm sorts strings of size 2k in the k-th iteration.
construct a new string T 0 over the alphabet [1, n] by Our variant already yields some small optimization
renaming the characters of T with their ranks. Since opportunities.
the renaming is order preserving, the order of the Using this simple algorithm as an introductory ex-
suffixes does not change. A similar technique called ample, Section 3 then introduces the technique of
lexicographic naming will play an important role in pipelined processing of sequences in a systematic way
all of our algorithms where a string (e.g., a substring which saves a factor at least two in I/Os for many
of T ) is replaced by its rank in some set of strings. external algorithms and is supported by our external
There is a special character $ which is smaller than memory library Stxxl . The main technical result
any character in the alphabet. We use the convention of this section is a theorem that allows easy anal-
that T [i] = $ if i ≥ n. Ti = T [i, n) denotes the i-th ysis of the I/O complexity of pipelined algorithms.
suffix of T . The suffix array SA of T is a permutation This theorem is also applied to more sophisticated
of [0, n) such that TSA[i] < TSA[j] whenever 0 ≤ i < construction algorithms presented in the subsequent
j < n. Let lcp(i, j) denote the longest common prefix sections.
length of SA[i] and SA[j] (lcp(i, j) = 0 if i < 0 or Section 4 gives a simple and efficient way to dis-
j ≥ n). Then we get the following derived quantities card suffixes from further iterations of the doubling
that can be used to characterize the “difficulty” of an algorithm whose position in the suffix array is al-
input or that will turn out to play such a role in our ready known. This leads to an algorithm with
analysis. ↔
I/O complexity O(sort(n log lcp )) improving on a
previous discarding algorithm with I/O complexity
maxlcp := max lcp(i, i + 1) (1) ↔
O(sort(n log lcp ) + scan(n log maxlcp)) [7]. A further
0≤i<n
1 X constant factor is gained in Section 5 by considering
lcp := lcp(i, i + 1) (2) a generalization of the doubling technique that sorts
n
0≤i<n
½ ¾ strings of size ak in iteration k. The best multiplica-
↔ lcp(i − 1, i), tion factor is four (quadrupling) or five. A pipelined
lcp(i) := max (3)
lcp(i, i + 1) optimal algorithm with I/O complexity O(sort(n))
↔ 1 X
↔ in Section 6 concludes our sequence of suffix array
log lcp := log(1 + lcp (i)) (4)
n construction algorithms.
0≤i<n
A useful tool for testing our implementations was
The I/O model [25] assumes a machine with fast a fast and simple external memory checker for suffix
memory of size M words and a secondary memory arrays described in Section 7.
that can be accessed by I/Os to blocks of B con- In Section 8 we report on extensive experiments
secutive words on each of D disks [25]. Our al- using synthetic difficult inputs, the human genome,
gorithms use words of size dlog ne bits for inputs English books, web-pages, and program source code
of size n. Sometimes it is assumed that an addi- using inputs of up to 4 GByte on a low cost machine.
tional bit can be squeezed in somewhere. We express The theoretically optimal algorithm turns out to be
all our I/O complexities in terms of the shorthands the winner closely followed by quadrupling with dis-
scan(x) = dx/DBe for sequentially l reading
m or writing carding.
2n n Section 9 summarizes the overall results and dis-
x words and sort(x) ≈ DB logM/B M for sorting x
words of data (not counting the 2scan(x) I/Os for cusses how even larger suffix arrays could be build.
reading the input and writing the output). The appendix contains further details that will be
Our algorithms are described using high level Pas- part of the full paper.
cal like pseudocode mixed with mathematical nota-
tion. The scope of control structures is determined More Related Work The first I/O optimal al-
by indentation. We extend set notation to sequences gorithm for suffix array construction [11] is based
in the obvious way. For example [i : i is prime] = on suffix tree construction and introduced the ba-
h2, 3, 5, 7, 11, 13, . . .i in that order. sic divide-and-conquer approach that is also used by
2
DC3. However, the algorithm from [11] is so compli- scalable solution to this problem. Once this step is
cated that an implementation looks not promising. made, space consumption is less of an issue because
There is an extensive implementation study for ex- disk space is two orders of magnitude cheaper than
ternal suffix array construction by Crauser and Fer- RAM.
ragina [7]. They implement several nonpipelined vari- The biggest suffix array computations we are aware
ants of the doubling algorithm [3] including one that of are for the human genome [23, 20]. One [20] com-
discards unique suffixes. However, this variant of putes the compressed suffix array on PC with 3GByte
discarding still needs to scan all unique tuples in of memory in 21 h. Compressed suffix arrays work
each iteration. With our analysis, one would get well in this case (they need only 2 GByte of space)
↔
O(sort(n) log lcp + scan(n) log maxlcp) I/Os. Our because the small alphabet size present in genomic
discarding algorithm eliminates the second term that information enables efficient compression. The other
dominates the I/O volume for many inputs. Inter- implementation [23] uses a supercomputer with 64
estingly, an algorithm that fares very well in the GByte of memory and needs 7 hours. Our algorithms
study
¡ N of [7] ¢is the GBS-algorithm [12] that takes have comparable speed using external memory.
O M scan(n) I/Os in the best case and has dismal Pipelining to reduce I/Os is well known technique
worst case performance1 . In iteration i, the GBS- in executing database queries [24]. However, previ-
algorithm sorts the suffixes Ti for i ∈ [in, (i + 1)n) ous algorithm libraries for external memory [4, 8] do
and then merges them with the previously sorted suf- not support it. We decided quite early in the design
fixes. The GBS-algorithm can have favourable I/O of our library Stxxl [9] that we wanted to remove
volume if N/M is a small constant. We have not this deficit. Since suffix array construction can profit
implemented this algorithm not only because more immensely from pipelining and since the different al-
scalable algorithms are more interesting but also be- gorithms give a rich set of examples, we decided to
cause all our algorithmic improvements (pipelining, use this application as a test bed for a prototype im-
discarding, quadrupling, the DC3-algorithm) add to plementation of pipelining.
a dramatic reduction in I/O volume and are not
applicable to the GBS-algorithm. Hence it is pre-
dictable that the range where the GBS-algorithm is 2 A Doubling Algorithm
interesting would get much smaller. Moreover, the
GBS-algorithm needs a local suffix array search for Figure 1 gives pseudocode for the doubling algorithm.
each suffix scanned so that it is quite expensive with The basic idea is to replace characters T [i] of the in-
respect to internal work. Our system (multiple mod- put by lexicographic names that respect the lexico-
ern disks controlled by a performance oriented library graphic order of the length 2k substring T [i, i + 2k )
[9]) supports disk I/O at a speed up to one third of in the k-th iteration. In contrast to previous exter-
its memory bandwidth [10] so that the high internal nal implementation of this algoirthm, our formulation
cost makes the GBS-algorithm even more question- never actually builds the resulting string of names.
able for the present study. Nevertheless it should be Rather, it manipulates a sequence P of pairs where
kept in mind that the GBS-algorithm might be inter- each character c is tagged with its position i in the
esting for small inputs and fast machines with slow input. To obtain names for the next iteration k + 1,
I/O. the names for T [i, i + 2k ) and T [i + 2k , i + 2k+1 ) to-
There has been considerable interest in space ef- gether with the position i are stored in a sequence
ficient internal memory algorithms for constructing S and sorted. The new names can now be ob-
suffix arrays [22, 5] and even more compact full-text tained by scanning this sequence and comparing ad-
indexes [20, 13, 14]. We view this as an indication jacent tuples. Sequence S can be build using con-
that internal memory is too expensive for the big suf- secutive elements of P if we sort P using they pair
fix arrays one would like to build. Going to external (i mod 2k , i div 2k ).2 Previous formulations of the
memory can be viewed as an alternative and more algorithm use i as a sorting criterion and therefore
have to access elements that are 2k characters apart.
1
There is also an variant of the GBS-algorithm that gives Our approach saves I/Os and simplifies the pipelining
the best case bound in the worst case [7]. But this algorithm
needs a constant factor more passes over the input and hence 2
(i mod 2k , i div 2k ) can also be computed using a single
might be slower in practice. left rotation by k-bits of the binary representation of i.
3
be viewed as a data flow through a directed acyclic
Function doubling(T )
graph G = (V = F ∪ S ∪ R, E). The file nodes F rep-
S:= [((T [i], T [i + 1]), i) : i ∈ [0, n)] (0)
resent data that has to be stored physically on disk.
for k := 1 to dlog ne do
When a file node f ∈ F is accessed we need a buffer
sort S (1)
of size b(f ) = Ω (BD). The streaming nodes s ∈ S
P := name(S) (2)
read zero, one or several sequences and output zero,
invariant ∀(c, i) ∈ P :
one or several new sequences using internal buffers
c is a lexicographic name for T [i, i + 2k )
of size b(s).3 The Sorting nodes r ∈ R read a se-
if the names in P are unique then
quence and output it in sorted order. Sorting nodes
return [i : (c, i) ∈ P ] (3)
have a buffer requirement of b(r) = Θ(M ) and out-
sort P by (i mod 2k , i div 2k )) (4)
0 degree one4 . Edges are labeled with the number of
S:= h((c, c ), i) : j ∈ [0, n), (5)
machine words w(e) flowing between two nodes. In
(c, i) = P [j], (c0 , i + 2k ) = P [j + 1]i
the proof of Theorem 3 you find the flow graph for the
Function name(S : Sequence of Pair ) pipelined doubling algorithm. We will see somewhat
q:= r:= 0; (`, `0 ):= ($, $) more complicated graph in Sections 6 and 4. The
result:= hi following theorem (proven in Appendix B) gives nec-
foreach ((c, c0 ), i) ∈ S do essary and sufficient conditions for an I/O efficient
q++ execution of such a data flow graph. Moreover, it
if (c, c0 ) 6= (`, `0 ) then r:= q; (`, `0 ):= (c, c0 ) shows that streaming computations can be scheduled
append (r, i) to result completely systematically in an I/O efficient way.
return result
Theorem 2. The computations of a data flow graph
G = (V = F ∪ S ∪ R, E) with edge flows w : E → R+
Figure 1: The doubling algorithm. and buffer requirements b : V → R+ can be executed
using
optimization described in Section 3. X X
scan(w(e)) + sort(w(e)) (5)
The algorithm performs a constant number of sort-
e∈E∩(F ×V ∪V ×F ) e∈E∩(V ×R)
ing and scanning operations for sequences of size n
in each iteration. The number of iterations is deter- I/Os iff the following conditions are fulfilled. Con-
mined by the logarithm of the longest common prefix. sider the graph G0 which is a copy of G except that
edges between streaming nodes are replaced by bidi-
Theorem 1. The doubling algorithm computes a suf-
rected edges. The strongly connected components
fix array using O(sort(n) dlog maxlcpe) I/Os.
(SCCs) of G0 are required to be either single file
nodes, single sorting nodes, or sets of streaming
3 Pipelining nodes. The total buffer requirement of each SCC C
of streaming nodes plus the buffer requirements of the
The I/O volume of the doubling algorithm from Fig- nodes directly connected to C has to be bounded by the
ure 1 can be reduced significantly by observing that internal memory size M .
rather than writing the sequence S to external mem-
Theorem 2 can be used to design and analyze
ory, we can directly feed it to the sorter in Line (1).
pipelined external memory algorithms in a system-
Similarly, the sorted tuples need not be written but
atic way. All we have to do is to give a data flow
can be directly fed into the naming procedure in
graph that fulfills the requirements and we can then
Line (2) which can in turn forward it to the sorter
read off the I/O complexity. Using the relations
in Line (4). The result of this sorting operation need
a · scan(x) = scan(a · x) + O(1) and a · sort(x) ≤
not be written but can directly yield tuples of S that
can be fed into the next iteration of the doubling al- 3
Streaming nodes may cause additional I/Os for internal
gorithm. Appendix A gives a simplified analysis of processing, e.g., for large FIFO queues or priority queues.
These I/Os are not counted in our analysis.
this example for pipelining. 4
We could allow additional outgoing edges at an I/O cost
Let us discuss a more systematic model: The com- n/DB. However, this would mean to perform the last phase of
putations in many external memory algorithms can the sorting algorithm several times.
4
sort(a · x) + O(1), we can represent the result in the nodes continue in the same mode, pulling the inputs
form scan(x) + sort(y) + O(1), i.e., we can character- needed to produce an output element. The process
ize the complexity in terms of the sorting volume x terminates when the result of the topologically lat-
and the scanning volume y. One could further eval- est node is computed. To support nodes with more
uate this function by plugging in the I/O complexity than one output, Stxxl exposes an interface where
of a particular sorting algoithm (e.g., ≈ 2x/DB for a node generates output accessible not only via the *
x ¿ M 2 /DB and M À DB) but this may not be operator but a node can also push an output element
desirable because we loose information. In particu- to output nodes.
lar, scanning implies less internal work and can usu- The library already offers basic generic classes
ally be implemented using bulk I/Os in the sense of which implement the functionality of sorting, file, and
[7] (we then need larger buffers b(v) for file nodes) streaming nodes.
whereas sorting requires many random accesses for
information theoretic reasons [2].
Now we apply Theorem 2 to the doubling algo-
4 Discarding Unique Suffixes
rithm: The procedure name in Figure 1 assigns a rank to
Theorem 3. The doubling algorithm from each suffix as one plus the number of suffixes with
Figure 1 can be implemented to run using strickly smaller prefix of length 2k . If a suffix has a
sort(5n) dlog(1 + maxlcp)e + O(scan(n)) I/Os. unique prefix of length 2k , the rank assigned to it will
not change in later iterations. The idea of discard-
Proof. The following flow graph shows that each iter- ing is to remove the pairs representing such finished
ation can be implemented using sort(2n)+sort(3n) ≤ suffixes from P and S, thus reducing the work and
sort(5n) I/Os. The numbers refer to the line numbers I/O in later iterations. This approach is particularly
↔
in Figure 1 effective when log lcp ¿ log maxlcp, since while the
algorithm still performs dlog maxlcpe iterations, an
3n 2n streaming node ↔
average suffix is involved in only log lcp of them.
1 2 4 5 sorting node
There are two places in the algorithm in Figure 1
After dlog(1 + maxlcp)e iterations, the algorithm fin- that are disturbed by discarding. The first one is in
ishes. The O(sort(n)) term accounts for the I/Os function name, where the rank can no more be com-
needed in Line 0 and for computing the final result. puted by simple counting. The solution is to take the
Note that there is a small technicality here: Although rank from the previous iteration (the number suffixes
naming can find out “for free” whether all names are with strickly smaller prefix of length 2k−1 ) and add
unique, the result is known only when naming fin- to it the number of suffixes with the same prefix of
ishes. However, at this time, the first phase of the length 2k−1 but smaller prefix of length 2k [7]. The
sorting step in Line 4 has also finished and has al- modified function name2 is given in Figure 3. Note
ready incurred some I/Os. Moreover, the convenient that name2 cannot be used in the first iteration when
arrangement of the pairs in P is destroyed now. How- no ranks have been computed yet.
ever we can then abort the sorting process, undo the The second problem with discarding is on Line (5)
wrong sorting, and compute the correct output. in Figure 1, where ranks of discarded suffixes may be
needed as the component c0 in S. As a correction,
In Stxxl the data flow nodes are implemented the discarded suffixes are included when computing
as objects with an interface similar to the STL input S (a scan), but excluded during the rest of the al-
iterators [9]. A node reads data from input nodes gorithm (including all the sorting steps). Up to this
using their * operators. With help of their preincre- point, the algorithm corresponds to the one in [7].
ment operators a node proceeds to the next elements As an additional optimization, we will identify suf-
of the input sequences. The interface also defines fixes that are not needed even in the computation of
an empty() function which signals the end of the se- S and store them separately to wait until the end of
quence. After creating all node objects, the compu- the algorithm. The rule to identify these fully dis-
tation starts in a “lazy” fashion, first trying to eval- carded suffixes is simple: if a rank was not used in
uate the result of the topologically latest node. The iteration k as a component of S, it will not be used in
node reads its input nodes element by element. Those later iterations either. Figure 3 gives the final algo-
5
input 3N 2N 2n output
1 2,10 3,11 4 5 6 7 8 9
2n P 2n
Figure 2: Data flow graph for the doubling + discarding. The numbers refer to line numbers in Figure 3.
↔
The edge weights are sums over the whole execution with N = n log lcp
rithm. A slightly different algorithm with the same Theorem 4. Doubling with discarding can be im-
↔
asymptotic complexity is described in [15]. plemented to run using sort(5n log lcp) + O(sort(n))
I/Os.
Function doubling + discarding(T ) Proof. We prove the theorem by showing that the
S:= [((T [i], T [i + 1]), i) : i ∈ [0, n)] (1) total amount of data in the different steps of the al-
sort S (2) gorithm over the whole execution is as in the data
U := name(S) // undiscarded (3) flow graph in Figure 2. The nontrivial points are
↔
P := hi // partially discarded that N = n log lcp tuples are processed in all sorting
F := hi // fully discarded steps together and that at most n tuples are writ-
for k := 1 to dlog ne do ten to P . The former follows from the fact that a
mark unique names in U (4) suffix i is involved in the sorting steps as long as
sort U by (i mod 2k , i div 2k ) (5) it has a non-unique rank, which happens in exactly
↔
merge P into U ; P := hi (6) dlog(1 + lcp (i))e iterations. To show the latter, we
S:= hi; discard := 1 note that a tuple (c, i) is written to P in iteration k
foreach (c, i) ∈ U do (7) only if the previous tuple (c0 , i − 2k ) was not unique.
if c is unique then That previous tuple will become unique in the next
if discard = 1 then iteration, because it is represented by ((c0 , c), i + 2k )
append (c, i) to F in S. Since each tuple turns unique only once, the
else append (c, i) to P total number of tuples written to P is at most n.
discard := 1
else 5 From Doubling to a-Tupling
let (c0 , i0 ) be the next pair in U
append ((c, c0 ), i) to S It is straightforward to generalize the doubling algo-
discard := 0 rithms from Figures 1 and 3 so that it maintains the
if S = ∅ then invariant that in iteration k, lexicographic names rep-
sort F by first component (8) resent strings of length ak : just gather a names from
return [i : (c, i) ∈ F ] (9) the last iteration that are ak−1 characters apart. Sort
sort S (10) and name as before.
U := name2 (S) (11)
Theorem 5. The a-tupling algorithm can be imple-
Function name2 (S : Sequence of Pair ) mented to run using
q:= r:= 0; (`, `0 ):= ($, $)
result:= hi a+3
sort( n) log maxlcp + O(sort(n)) or
foreach ((c, c0 ), i) ∈ S do log a
if c 6= ` then q:= r:= 0; (`, `0 ):= (c, c0 )
a+3
else if c0 6= `0 then r:= q; `0 := c0 sort( ↔
n) log lcp + O(sort(n))
append (c + r, i) to result log a
q++ I/Os without or with discarding respectively.
return result
We get a tradeoff between higher cost for each iter-
ation and a smaller number of iterations that is deter-
Figure 3: The doubling with discarding algorithm. a+3
mined by the ratio log a . Evaluating this expression
we get the optimum for a = 5. But the value for
6
a = 4 is only 1.5 % worse, needs less memory, and Figure 4 gives pseudocode for an external implemen-
calculations are much easier because four is a power tation of this algorithm and Figure 5 gives a data
two. Hence, we choose a = 4 for our implementation flow graph that allows pipelined execution. Step 1
of the a-tupling algorithm. This quadrupling algo- is implemented by Lines (1)–(6) and starts out quite
rithm needs 30 % less I/Os than doubling. similar to the tripling (3-tupling) algorithm described
in Section 5. The main difference is that triples are
only obtained for two thirds of the suffixes and that
6 A Pipelined I/O-Optimal Algo- we use recursion to find lexicographic names that ex-
rithm actly characterize the relative order of these sample
suffixes. As a preparation for the Steps 2 and 3, in
lines (7)–(10) these sample names are used to anno-
Function DC3 (T ) tate each suffix position i with enough information
S:= [((T [i, i + 2]), i) : i ∈ [0, n), i mod 3 6= 0] (1) to determine its global rank. More precisely, at most
sort S by the first component (2) two sample names and the first one or two characters
P := name(S) (3) suffice to completely determine the rank of a suffix.
if the names in P are not unique then This information can be obtained I/O efficiently by
sort the (i, r) ∈ P by (i mod 3, i div 3) (4) simultaneously scanning the input and the names of
SA12 := £ DC3 ([c :12(c, i) ∈ P ]) ¤ (5) the sample suffixes sorted by their position in the in-
P := (j + 1, SA [j]) : j ∈ [0, 2n/3) (6) put. With this information, Step 2 reduces to sorting
sort P by the second component (7) suffixes Ti with i mod 3 = 0 by their first charac-
S0 := h(T [i], T [i + 1], c0 , c00 , i) : (8) ter and the name for Ti+1 in the sample (Line 11).
0 00
i mod 3 = 0, (c , i + 1), (c , i + 2) ∈ P i Line (12) reconstructs the order of the mod-2 suffixes
S1 := h(c, T [i], c0 , i) : (9) and mod-3 suffixes. Line (13) implements Step 3 by
0
i mod 3 = 1, (c, i), (c , i + 1) ∈ P i ordinary comparison based merging. The slight com-
S2 := h(c, T [i], T [i + 1], c00 , i) : (10) plication is the comparison function. There are three
i mod 3 = 2, (c, i), (c00 , i + 2) ∈ P i cases:
sort S0 by components 1,3 (11)
sort S1 and S2 by component 1 (12) • A mod-0 suffix Ti can be compared with a mod-
S:= merge(S0 , S1 , S2 ) comparison function: (13) 1 suffix Tj by looking at at the first characters
(t, t0 , c0 , c00 , i) ∈ S0 ≤ (d, u, d0 , j) ∈ S1 and the names for Ti+1 and Tj+1 in the sample
⇔ (t, c0 ) ≤ (u, d0 ) respectively.
(t, t0 , c0 , c00 , i) ∈ S0 ≤ (d, u, u0 , d00 , j) ∈ S2 • For a comparison between a mod-0 suffix Ti and
⇔ (t, t0 , c00 ) ≤ (u, u0 , d00 ) a mod-2 suffix Tj the above technique does not
(c, t, c0 , i) ∈ S1 ≤ (d, u, u0 , d00 , j) ∈ S2 work since Tj+1 is not in the sample. However,
⇔c≤d both Ti+2 and Tj+2 are in the sample so that it
return [last component of s : s ∈ S] (14) suffices to look at the first two characters and
the names of Ti+2 and Tj+2 respectively.
Figure 4: The DC3-algorithm.
• Mod-1 suffixes and Mod-2 suffixes can be com-
pared by looking at their names in the sample.
The following three step algorithm outlines a linear
time algorithm for suffix array construction [16]: The resulting data flow graph is large but fairly
1. Construct the suffix array of the suffixes start- straightforward except for the file node which stores
ing at positions i mod 3 6= 0. This is done by a copy of input stream T . The problem is that the
reduction to the suffix array construction of a input is needed twice. First, Line 2 uses it for gen-
string of two thirds the length, which is solved erating the sample and later, the node implement-
recursively. ing Lines (8)–(10) scans it simultaneously with the
names of the sample suffixes. It is not possible to
2. Construct the suffix array of the remaining suf- pipeline both scans because we would violate the re-
fixes using the result of the first step. quirement of Theorem 2 that edges between stream-
3. Merge the two suffix arrays into one. ing nodes must not cross sorting nodes. This problem
7
if names are not unique 5n
input 8n 4n 4n 3 11 output
3 3 3 8− 4n
1 2 3 4 5 6 7 10 3 12 13 14
n 5n 12
T n 3
file streaming sorting recursion
node node node
Figure 5: Data flow graphs for the DC3-algorithm. The numbers refer to line numbers in Figure 4
V (n) ≤ sort(( 83 + 4
3 + 4
+ 53 + 33 + 53 )n)
3 Proof. The conditions are clearly necessary. To show
2 sufficiency, assume that the suffix array contains ex-
+ scan(2n) + V ( n)
3 actly permutation of [0, n) but in wrong order. Let
= sort(10n) + scan(2n) + V ( 32 n) Si and Sj be a pair of wrongly ordered suffixes, say
Si > Sj but ri < rj , that maximizes i + j. The sec-
This recurrence has the solution V (n) ≤ ond conditions is violated if T [i] > T [j]. Otherwise,
3(sort(10n)+scan(2n)) ≤ sort(30n)+scan(6n). Note we must have T [i] = T [j] and Si+1 > Sj+1 . But
that the data flow diagram assumes that the input then ri > rj by maximality of i + j and the second
is a data stream into the procedure call. However, condition is violated.
we get the same complexity if the original input is a
file. In that case, we have to read the input once but Theorem 7. The suffix array checker from Figure 6
we save writing it to the local file node T . can be implemented to run using sort(5n) + scan(2n)
I/Os.
7 A Checker
To ensure the correctness of our algorithms we have 8 Experiments
designed and implemented a simple and fast suffix
array checker. It is given in Figure 6 and is based on We have implemented the algorithms in C++ us-
the following result. ing the g++ 3.2.3 compiler (optimization level -O2
--omit-framepointer) and the external memory li-
Lemma 1 ([5]). An array SA[0, n) is the suffix array brary Stxxl Version 0.52 [9]. Our experimen-
of a text T iff the following conditions are satisfied: tal platform has two 2.0 GHz Intel Xeon proces-
1. SA contains a permutation of [0, n). sors, one GByte of RAM, and we use four 80 GByte
IBM 120GXP disks. Refer to [10] for a performance
2. Let ri be the rank of the suffix Si according to the evaluation of this machine whose cost was 2500 Euro
suffix array. For all i, j, ri ≤ rj ⇔ (T [i], ri+1 ) ≤ in July 2002. The following instances have been con-
(T [j], rj+1 ). sidered:
8
Table 1: Statistics of the instances used in the experiments.
↔
T n = |T | |Σ| maxlcp lcp log lcp
Random2 232 128 231 ≈ 229 ≈ 30
Gutenberg 3 277 099 765 128 4 819 356 45 617 todo
Genome 3 070 128 194 5 21 999 999 454 111 todo
HTML 4 214 295 245 128 102 356 1 108 todo
Source 547 505 710 128 173 317 431 5.80
Random2: Two concatenated copies of a Random for all input sizes and all instances. However, there
string of length n/2. This is a difficult instance that is enough data to draw some interesting conclusions.
is hard to beat using simple heuristics. Complicated behavior is observed for “small” in-
Gutenberg: Freely available English texts from puts up to 226 characters. The main reason is that
http://promo.net/pg/list.html. we made no particular effort to optimize special cases
where at least some part of some algorithm could ex-
Genome: The known pieces of the humane genome
ecute internally but Stxxl sometime makes such
from http://genome.ucsc.edu/downloads.html
optimizations automatically.
(status May, 2004). We have normalized this input
to ignore the distinction between upper case and The most important observation is that the DC3-
lower case letters. The result are characters in algorithm is always the fastest algorithm and is al-
an alphabet of size 5 (ACGT and sometime long most completely insensitive to the input. For all
sequences of “unknown” characters). inputs of size more than a GByte, DC3 is at least
twice as fast as its closest competitor. With respect
HTML: Pages from a web crawl containing only
to I/O volume, DC3 is sometimes equaled by qua-
pages from .gov domains. These page are filtered
drupling with discarding. This happens for relatively
so that only text and html code is contained but no
small inputs. Apparently quadrupling has more com-
pictures and no binary files.
plex internal work. For example, it compares quadru-
Source: Source code (mostly C++) containing core- ples during half of its sorting operations whereas DC3
utils, gcc, gimp, kde, xfree, emacs, gdb, Linux kernel never compares more than triples during sorting. For
and Open Office). the difficult synthetic input Random2, quadrupling
We have collected some of these instances at ftp: with discarding is by far outperformed by DC3.
//www.mpi-sb.mpg.de/pub/outgoing/sanders/. For real world inputs, discarding algorithms turn
For a nonsynthetic instance T of length n, our out to be successful compared to their nondiscarding
experiments use T itself and its prefixes of the form counterparts. They outperform them both with re-
T [0, 2i ). Table 1 shows statistics of the properties of spect to I/O volume and running time. For random
these instances. inputs without repetitions the discarding algorithms
The figure on the next page shows execution time might actually beat DC3 since one gets inputs with
and I/O volume side by side for each of our instance very small values of log lcp ↔
.
families and for the algorithms nonpipelined dou- Quadrupling algorithms consistently outperform
bling, pipelined doubling, pipelined doubling with doubling algorithms.
discarding, pipelined quadrupling, pipelined quadru- Comparing pipelined doubling with nonpipelined
pling with discarding5 , and DC3. All ten plots share doubling in the top pair of plots (instance Random2)
the same x-axis and the same curve labels. Comput- one can see that pipelining brings a huge reduction
ing all these instances takes about 14 days moving of I/O volume whereas the execution time is affected
more than 20 TByte of data. Due to these large exe- much less — a clear indication that our algorithms
cution times it was not feasible to run all algorithms are dominated by internal calculations. We do not
5
The discarding algorithms we have implemented need
show the nonpipelined algorithm for the other inputs
slightly more I/Os and perhaps more complex calculations than since the relative performance compared to pipelined
the newer algorithms described in Section 4. doubling should remain about the same.
9
140 3500
nonpipelined
Doubling
120 Discarding 3000
Quadrupling
Quad-Discarding
Random2: Time [µs] / n
60 1500
40 1000
20 500
0 0
80 1000
900
70
800
60
Gutenberg: Time [µs] / n
700
40 500
400
30
300
20
200
10
100
0 0
80 1000
900
70
800
60
Genome: Time [µs] / n
700
I/O Volume [byte] / n
50
600
40 500
400
30
300
20
200
10
100
0 0
40 600
35
500
I/O Volume [byte] / n
30
HTML: Time [µs] / n
400
25
20 300
15
200
10
100
5
0 0
600
40
35 500
I/O Volume [byte] / n
Source: Time [µs] / n
30
400
25
20 300
15 200
10
100
5
0 0
224 226 228 230 232 224 226 228 230 232
n n
10
A comparison of the new algorithms with previous practical usefulness of DC3 since a mere comparison
algorithms is more difficult. The implementation of with the relatively simple, nonpipelined previous im-
[7] works only up to 2GByte of total external memory plementations would have been unfair.
consumption and would thus have to compete with As a side effect, the various generalizations of dou-
space efficient internal algorithms on our machine. bling yield an interesting case study for the system-
At least we can compare I/O volume per byte of in- atic design of pipelined external algorithms.
put for the measurements in [7]. Their best scalable The most important practical question is whether
algorithm for the largest real world input tested (26 constructing suffix arrays in external memory is now
MByte of text from the Reuters news agency) is non- feasible. We believe that the answer is a careful ‘yes’.
pipelined doubling with a simple form of discarding. We can now process 4 · 109 characters over night on
This algorithm needs an I/O volume of 1303 Bytes a low cost machine. Two orders of magnitude more
per character of input. The DC3-algorithm about 5 than in [7] in a time faster or comparable to previ-
times less I/Os. Furthermore, it is to be expected ous internal memory computations [23, 20] on more
that the lead gets bigger for larger inputs. The GBS expensive machines.
algorithm [12] needs 486 bytes of I/O per character There are also many opportunities to scale to even
for this input in [7], i.e., even for this small input DC3 larger inputs. For example, one could exploit that
already outperforms the GBS algorithm. We can also about half of the sorting operations are just per-
attempt a speed comparison in terms of clock cycles mutations which should be implementable with less
per byte of input. Here [7] needs 157 000 cycles per internal work than general sorting. It should also
byte for doubling with simple discarding and 147 000 be possible to better overlap I/O and computation.
cycles per byte for the GBS algorithm whereas DC3 More interestingly, there are many ways to paral-
needs only about 20 000 cycles. Again, the advan- lelize. On a small scale, pipelining allows us to run
tage should grow for larger inputs in particular when several sorters and one streaming thread in parallel.
comparing with the GBS algorithm. On a large scale DC3 is also perfectly parallelizable
The following small table shows the execution time [16]. Since the algorithm is largely compute bound,
of DC3 for 1 to 8 disks on the ‘Source’ instance. even cheap switched Gigabit-Ethernet should allow
D 1 2 4 6 8 high efficiency (DC3 sorts about 13 MByte/s in our
t[mus/byte] 13.96 9.88 8.81 8.65 8.52 measurements). Considering all these improvements
We see that adding more disks gives only very small and the continuing advance in technology, there is no
speedup. (And we would see very similar speedups reason why it should not be possible to handle inputs
for the other algorithms except nonpipelined dou- that are another two orders of magnitude larger in a
bling). Even with 8 disks, DC3 has an I/O rate of few years.
less than 30 MByte/s which is less than the peak
performance of a single disk (45 MByte/s). Hence,
Acknowledgements
by more effective overlapping of I/O and computa-
tion it should be possible to sustain the performance We would like to thank Stefan Burkhardt and Knut
of eight disks using a single cheap disk so that even Reinert for valuable pointers to interesting experi-
very cheap PCs (≈) could be used for external suffix mental input. Lutz Kettner helped with the design
array construction. of Stxxl . The html pages were supplied by Sergey
Sizov from the information retrieval group at MPII.
Christian Klein helped with Unix tricks for assem-
9 Conclusion bling the data.
Our efficient external version of the DC3-algorithm
is theoretically optimal and clearly outperforms all References
previous algorithms in practice. Since all practical
previous algorithms are asymptotically suboptimal [1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch.
and dependent on the inputs, this closes a gap be- The enhanced suffix array and its applications
tween theory and practice. DC3 even outperforms to genome analysis. In Proc. 2nd Workshop on
the pipelined quadrupling-with-discarding algorithm Algorithms in Bioinformatics, volume 2452 of
even for real world instances. This underlines the LNCS, pages 449–463. Springer, 2002.
11
[2] A. Aggarwal and J. S. Vitter. The input/output [13] W.-K. Hon, T.-W. Lam, K. Sadakane, and W.-
complexity of sorting and related problems. K. Sung. Constructing compressed suffix arrays
Communications of the ACM, 31(9):1116–1127, with large alphabets. In Proc. 14th Interna-
1988. tional Symposium on Algorithms and Compu-
tation, volume 2906 of LNCS, pages 240–249.
[3] L. Arge, P. Ferragina, R. Grossi, and J. S. Vit- Springer, 2003.
ter. On sorting strings in external memory. In
29th ACM Symposium on Theory of Computing, [14] W.-K. Hon, K. Sadakane, and W.-K. Sung.
pages 540–548, El Paso, May 1997. ACM Press. Breaking a time-and-space barrier in construct-
ing full-text indices. In Proc. 44th Annual Sym-
[4] L. Arge, O. Procopiuc, and J. S. Vitter. Im-
posium on Foundations of Computer Science,
plementing I/O-efficient data structures using
pages 251–260. IEEE, 2003.
TPIE. In 10th European Symposium on Algo-
rithms (ESA), volume 2461 of LNCS, pages 88– [15] J. Kärkkäinen. Algorithms for Memory Hier-
100. Springer, 2002. archies, volume 2625 of LNCS, chapter Full-
[5] S. Burkhardt and J. Kärkkäinen. Fast Text Indexes in External Memory, pages 171–
lightweight suffix array construction and check- 192. Springer, 2003.
ing. In Proc. 14th Annual Symposium on Com-
[16] J. Kärkkäinen and P. Sanders. Simple linear
binatorial Pattern Matching, volume 2676 of
work suffix array construction. In Proc. 30th In-
LNCS, pages 55–69. Springer, 2003.
ternational Conference on Automata, Languages
[6] M. Burrows and D. J. Wheeler. A block-sorting and Programming, volume 2719 of LNCS, pages
lossless data compression algorithm. Technical 943–955. Springer, 2003.
Report 124, SRC (digital, Palo Alto), May 1994.
[17] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and
[7] A. Crauser and P. Ferragina. Theoretical and ex- K. Park. Linear-time longest-common-prefix
perimental study on the construction of suffix ar- computation in suffix arrays and its applications.
rays in external memory. Algorithmica, 32(1):1– In Proc. 12thAnnual Symposium on Combinato-
35, 2002. rial Pattern Matching, volume 2089 of LNCS,
pages 181–192. Springer, 2001.
[8] A. Crauser and K. Mehlhorn. LEDA-SM a plat-
form for secondary memory computations. Tech- [18] D. K. Kim, J. S. Sim, H. Park, and Kunsoo
nical report, MPII, 1998. draft. Park. Linear-time construction of suffix arrays.
In Proc. 14th Annual Symposium on Combinato-
[9] R. Dementiev. The stxxl library. documentation
rial Pattern Matching. Springer, June 2003. To
and download at http://www.mpi-sb.mpg.de/
appear.
~rdementi/stxxl.html.
[10] R. Dementiev and P. Sanders. Asynchronous [19] P. Ko and S. Aluru. Space efficient linear time
parallel disk sorting. In Proc. 15th Annual Sym- construction of suffix arrays. In Proc. 14th
posium on Parallelism in Algorithms and Archi- Annual Symposium on Combinatorial Pattern
tectures. ACM, 2003. To appear. Matching. Springer, June 2003. To appear.
[11] M. Farach-Colton, P. Ferragina, and [20] T.-W. Lam, K. Sadakane, W.-K. Sung, and Siu-
S. Muthukrishnan. On the sorting-complexity Ming Yiu. A space and time efficient algorithm
of suffix tree construction. Journal of the ACM, for constructing compressed suffix arrays. In
47(6):987–1011, 2000. Proc. 8th Annual International Conference on
Computing and Combinatorics, volume 2387 of
[12] G. Gonnet, R. Baeza-Yates, and T. Snider. New LNCS, pages 401–410. Springer, 2002.
indices for text: PAT trees and PAT arrays. In
W. B. Frakes and R. Baeza-Yates, editors, In- [21] U. Manber and G. Myers. Suffix arrays: A new
formation Retrieval: Data Structures & Algo- method for on-line string searches. SIAM Jour-
rithms. Prentice-Hall, 1992. nal on Computing, 22(5):935–948, October 1993.
12
[22] G. Manzini and P. Ferragina. Engineering a procedure called in Line (2) which generates pairs
lightweight suffix array construction algorithm. that are immediately fed into the run formation pro-
In Proc. 10th Annual European Symposium on cess of the next sorting operation in Line (3) (2n/DB
Algorithms, volume 2461 of LNCS, pages 698– I/Os). The multiway merging phase (2n/DB I/Os)
710. Springer, 2002. for Line (3) does not write the sorted pairs but in
Line (4) it generates triples for S that are fed into
[23] K. Sadakane and T.Shibuya. Indexing huge
the pipeline for the next iteration. We have elimi-
genome sequences for solving various problems.
nated all the I/Os for scanning and half of the I/Os
Genome Informatics, 12:175–183, 2001.
for sorting resulting in only 10n/DB I/Os per iter-
[24] A. Silberschatz, H. F. Korth, and S. Sudarshan. ation — only one third of the I/Os needed for the
Database System Concepts. McGraw-Hill, 4th naive implementation.
edition, 2001. Note that pipelining would have been more com-
plicated in the more traditional formulation where
[25] J. S. Vitter and E. A. M. Shriver. Algorithms Line (3) sorts P directly by the index i. In that case,
for parallel memory, I: Two level memories. Al- a pipelining formulation would require a FIFO of size
gorithmica, 12(2/3):110–147, 1994. 2k to produce a shifted sequences. When 2k > M this
FIFO would have to be maintained externally caus-
A An Introductory Example For ing 2n/DB additional I/Os per iteration, i.e., our
modification simplifies the algorithm and saves up to
Pipelining 20 % I/Os.
To motivate the idea of pipelining let us first analyze
the constant factor in a naive implementation of the B Proof of Theorem 2
doubling algorithm from Figure 1. For simplicity as-
sume for now that inputs are not too large so that Proof. The basic observation is that all streaming
sorting m words can be done in 4m/DB I/Os using nodes within an SCC C of G0 must be executed to-
two passes over the data. For example, one run for- gether exchanging data through their internal buffers
mation phase could build sorted runs of size M and — if any node from C is excluded it will eventually
one multiway merging phase could merge the runs stall the computation because input or output buffer
into a single sorted sequence. fill up.
Line (1) sorts n triples and hence needs 12n/DB Now assume that G fulfills the requirements. We
I/Os. Naming in Line (2) scans the triples and writes schedule the computations for each SCC of G0 in
name-index pairs using 3n/DB + 2n/DB = 5n/DB topologically sorted order. First consider an SCC C
I/Os. The naming procedure can also determine of streaming nodes. We perform in a single pass all
whether all names are unique now, hence the test the computations of the streaming nodes in C, read-
in Line (3) needs no I/Os. Sorting the pairs in P in ing from the file nodes with edges entering C, writing
Line (4) costs 8n/DB I/Os. Scanning the pairs and to the file nodes with edges coming from C, perform-
producing triples in Line (5) costs another 5n/DB ing the first phase of sorting (e.g., run formation)
I/Os. Overall, we get (12+5+8+5)n/DB = 30n/DB of the sorting nodes with edges coming from C, and
I/Os for each iteration. performing the last phase of sorting (e.g. multiway
This can be radically reduced by interpreting the merging) for the sorting nodes with edges entering
sequences S and P not as files but as pipelines simi- C. The requirement on the buffer sizes ensures that
lar to the pipes available in UNIX. In the beginning there is sufficient internal memory. The topological
we explicitly scan the input T and produce triples sorting ensures that all the data from incoming edges
for S. We do not count these I/Os since they are not is available. Since there are only streaming nodes in
needed for the subsequent iterations. The triples are C, data can freely flow through them respecting the
not output directly but immediately fed into the run topological sorting of G.6
formation phase of the sorting operation in Line (1). 6
In our implementations the detailed scheduling within the
The runs are output to disk (3n/DB I/Os). The mul- components is done by the user to keep the overhead small.
tiway merging phase reads the runs (3n/DB I/Os) However, one could also schedule them automatically, possibly
and directly feeds the sorted triples into the naming using multithreading.
13
When a sorting node is encountered as an SCC we
may have to perform I/Os to make sure that the final
phase can incrementally produce the ¡ sorted
¢ elements.
However for a sorting volume of O M 2 /B , multiway
merging only needs the run formation phase that will
already be done and the final merging phase that will
be done later. For SCCs consisting of file nodes we
do nothing.
Now assume the G violates the requirements. If
there is an SCC that exceeds its buffer requirements,
there is no systematic way to execute all its nodes
together.
If an SCC C of G0 contains a sorting node v, there
must be a streaming node w that directly or indi-
rectly needs input from v, i.e., it cannot start execut-
ing before v starts to produce output. Node v cannot
produce any output before it did not see its complete
input. This input directly or indirectly depends on
some other streaming node u in C. Since u and w
are in the same SCC, they have to be executed to-
gether. But the data dependencies above make this
impossible. The argument for a file node within an
SCC is analogous.
14