Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
External Memory Sufx Array Construction
Roman Dementiev Juha Krkkinen Jens Mehnert Peter Sanders
MPI Informatik, U. Karlsruhe, U. Helsinki
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
Sufx Arrays
sort sufxes T [i..n] of string T [0..n] over alphabet {1..n}.
Applications
b a n a n a
a ana anana banana na nana
2 Full text search 2 Burrows-Wheeler text compression 2 Bioinformatics,. . .
Big interest in BIG inputs ; External memory
registers ALU fast memory capacity M freely programmable B large memory
n scan(n) = , B
2n n sort(n) = machine words logM/B B M
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
Related Work
Incremental:
n scan(n) M
I/Os [Gonnet/Baeza-Yates/Snider 92]
[CF 97] not very scalable, a lot of internal work Doubling: Sort by rst 2i characters in iteration i [Manber/Myers 93]
; O(sort(n) log maxlcp) I/Os [AFGV 97]
Doubling+Discarding: Avoid sorting sufxes known to be unique [Crauser/Ferragina 97] Best scalable algorithm in study. >
6h for 26 MByte.
; External construction not practical?
via Sufx-Tree:
O(sort(n)) I/Os [Farach/Ferragina/Muthukrishnan 00]
very complicated DC3: Simple, linear time, O (sort(n)) I/Os [KS 03]. Practical? Better than improved doubling?
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
Pipelined Doubling with Bit Shufing
name(T [i..i + k ]) {1..n} preserves order of k -substrings i
(T[j], T[j+1], j) (name(T[j..j+2i), name(T[j+2 i..j+2i+1 ), j ) 3n words i := i+1
sort
form runs runs merge
name
2n words
i bits
pair
sort
total I/O complexity: sort(5n) log maxlcp + O (sort(n))
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
Improved Discarding
2 Scan all unique sufxes [CF 97];
Scan new unique sufxes [Krkkinen 03]
2 Triples ; pairs
merge
3N 2N
pair
2n 2n fully
Name and mark unique
2n
discarded suffixes
partially
sort(5N )+O(sort(n)) I/Os where N =
i
log distPrexSize(T [i..n]))
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
a-Tupling
Sort by rst ai characters in iteration i
Constant Factor in I/Os
(a + 3)/ log a 5.00 3.78 3.50 3.45 3.48 3.56
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
Difference Cover 3 (DC3) Algorithm
1. sort T [i..n] for i recurse 2. sort T [i..n] for i
mod 3 {1, 2}
sort and name triples
mod 3 {0} sort pairs (T [3i], name(T [3i + 1..n]))
3. merge using difference cover property of {1, 2}
T [3i..n] T [3j + 1..n] iff (T [3i ], name(T [3i + 1..n])) (T [3j + 1], name(T [3j + 2..n])) T [3i..n] T [3j + 2..n] iff (T [3i ], T [3i + 1], name(T [3i + 2..n])) (T [3j + 2], T [3j + 3], name(T [3j + 4..n]))
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
Pipelined DC3
if names are not unique input
triple 8n
3 n
name 4n
3
4n 3
5n 3 4n 3
mod 0
output
recurse
n
permute
tuple
5n 3
mod 1 merge mod 2
recursion
file node
streaming node
sorting node
sort(30n) + scan(6n) I/Os
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
Experimental Setup
g++3.2.3 -O2
S TXXL library [Dementiev 03] with new iterator-like pipelining feature
2x64x66 Mb/s 4x2x100 MB/s 8x45 MB/s 400x64 Mb/s Intel E7500 Chipset 128
2x Xeon 4 Threads 1 GB DDR RAM PCIBusses Controller Channels 8x80 GB
Genome: Human Genome Gutenberg: HTML: Source:
3GByte English text from Gutenberg project
3GByte text from a crawl of .gov 0.5GByte Linux sources T T with T := randCharn/2
Random2:
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
10
Gutenberg I/Os
1000 900 800 700 600 500 400 300 200 100 0 224
I/O Volume [byte] / n
Doubling Quadrupling Discarding Quad-Discarding DC3
226
228
230 n
232
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
11
Gutenberg Time
80 Gutenberg: Time [s] / n 70 60 50 40 30 20 10 0 224 226 228 230 n 232 Doubling Discarding Quadrupling Quad-Discarding DC3
I/O bound even for a single disk
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
12
Comparison with Previous Implementations
2 5 less I/O volume than [CF 97] 2 78 less clock cycles than [CF 97] (including BGS algorithm) 2 2.4 faster than internal compressed Genome [LSSSY 02] 2 1.2 slower than internal Genome on 64 GByte super computer
[Sadakane Shibuya 01]
2 Faster than linear time internal LCP computation on MPIIs SUN
Starre 15000
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
13
Conclusion
2 External DC3 is practical 2 Better than pipelined, shufed 4-tupling with improved discarding 2 S TXXL makes pipelining easy. Saves factor 23 in I/O volume.
Future Work
2 Tune pipelined sorters 2 Go parallel 2 Larger difference covers for rst iteration? 2 Will discarding help for DC algorithms?
Terabytes over night?
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
14
Random2 I/Os
3500 I/O Volume [byte] / n 3000 2500 2000 1500 1000 500 0 224 226 228 n 230 232 Doubling Discarding Quadrupling Quad-Discarding Skew nonpipelined
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
15
Random2 Time
140 Random2: Time [s] / n 120 100 80 60 40 20 0 226 228 n 230 232 nonpipelined Doubling Discarding Quadrupling Quad-Discarding DC3
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
16
Genome I/Os
1000 900 800 700 600 500 400 300 200 100 0 224
I/O Volume [byte] / n
Doubling Quadrupling Discarding Quad-Discarding Skew
226
228 n
230
232
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
17
Genome Time
80 Genome: Time [s] / n 70 60 50 40 30 20 10 0 224 226 228 n 230 232 Doubling Quadrupling Discarding Quad-Discarding Skew
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
18
I/O Volume [byte] / n
600 500 400 300 200 100 0 40 30 20 10 0 24 2 2
26
Quadrupling Quad-Discarding Skew
Source: Time [s] / n
Quadrupling Quad-Discarding Skew
28
30
32
Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays
19
I/O Volume [byte] / n
600 500 400 300 200 100 0 40 30 20 10 0 24 2 2
26
Quadrupling Quad-Discarding Skew
HTML: Time [s] / n
Quadrupling Quad-Discarding Skew
28
30
32