0% found this document useful (0 votes)
2 views29 pages

Suffix

The document discusses the suffix sorting problem, which involves sorting the suffixes of a string into lexicographic order and highlights its applications in full text indexing and compression algorithms. It presents various methods for constructing suffix arrays, including divide-and-conquer approaches and radix sorting techniques, while also addressing the complexities and efficiency of these algorithms. The document provides implementation details and analysis of the algorithms, demonstrating their linear time complexity in certain cases.

Uploaded by

juxeiier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views29 pages

Suffix

The document discusses the suffix sorting problem, which involves sorting the suffixes of a string into lexicographic order and highlights its applications in full text indexing and compression algorithms. It presents various methods for constructing suffix arrays, including divide-and-conquer approaches and radix sorting techniques, while also addressing the complexities and efficiency of these algorithms. The document provides implementation details and analysis of the algorithms, demonstrating their linear time complexity in certain cases.

Uploaded by

juxeiier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Sorting Suffixes

Juha Kärkkäinen, Peter Sanders

MPI für Informatik, U. Karlsruhe, U. Helsinki

Sorting Suffixes – p.1


Some Stringology-Speak

String S: Array S[0 : n) := S[0:n − 1] := [S[0], . . ., S[n − 1]] of characters


Suffix: Si := S[i..n)
End markers: S[n] := S[n + 1] := · · · := 0
0 is smaller than all other characters

Sorting Suffixes – p.2


Suffix sorting problem

Sort the set {S0 , S1 , . . . , Sn−1 } of the suffixes of a string S of length n


(alphabet [1, n] = {1, . . . , n}) into the lexicographic order.
I suffix Si = S[i, n] for i ∈ [0 : n − 1]
S = banana

0 banana 5 a
1 anana 3 ana
2 nana 1 anana
=⇒
3 ana 0 banana
4 na 4 na
5 a 2 nana

Sorting Suffixes – p.3


Suffix sorting problem

Sort the set {S0 , S1 , . . . , Sn−1 } of the suffixes of a string S of length n


(alphabet [1, n] = {1, . . . , n}) into the lexicographic order.
I suffix Si = S[i, n] for i ∈ [0 : n − 1]

Applications
I full text indexing
I Burrows-Wheeler transform (bzip2 compressor)
I replacement for more complex suffix tree

Sorting Suffixes – p.3


Full text search

Search pattern P[0 : m) in text S[0 : n) using suffix array SA of S.

Binary search: O (m logn) good for short patterns


Binary search with lcp: O (m + log n) if we precompute the longest
common prefix between compared strings
Suffix tree: O (m) can be build from SA

Sorting Suffixes – p.4


Suffix tree [Weiner ’73][McCreight ’76]

I compact trie of the suffixes S = banana0


+ O (n) time [Farach 97] for
integer alphabets 0
+ Most potent tool of stringology? 6
a na
− Space consuming 0
− Efficient construction is compli- 5 na0
na
cated 2
0 0
3 4
na0
banana0
1
0

Sorting Suffixes – p.5


Alphabet model

Ordered alphabet
I only comparisons of characters allowed

Constant alphabet
I ordered alphabet of constant size
I multiset of characters can be sorted in linear time

Integer alphabet
I alphabet is {1, . . ., σ} for integer σ ≥ 2
I multiset of k characters can be sorted in O (k + σ) time

Sorting Suffixes – p.6


Ordered → Integer Alphabet

sort the characters of S


replace S[i] by its rank among the characters

012345 125024
banana -> aaabnn
213131 <- 111233

Sorting Suffixes – p.7


Generalization Lexicographic Naming

sort the k-tuples S[i : i + k) for i ∈ 1 : n


replace S[i] by the rank of S[i : i + k) among the tuples

Sorting Suffixes – p.8


A First Divide-and-Conquer Approach

1. SA1 =sort {Si : i is odd} (recursion)


2. SA0 =sort {Si : i is even} (easy using SA1 )
3. merge SA0 and SA1 (very difficult)

Problem: its hard to compare odd and even suffixes.


[Farach 97] developed a linear time suffix tree construction algorithm
based on that idea. Very complicated.
Was only known linear time algorithm for suffix arrays

Sorting Suffixes – p.9


Skewed Divide-and-Conquer

1. SA12 =sort {Si : i mod 3 6= 0} (recursion)


2. SA0 =sort {Si : i mod 3 = 0} (easy using SA12 )
3. merge SA12 and SA0 (easy!)

S = banana

5 a 5 a
3 ana 3 ana
1 anana 1 anana
+ =⇒
0 banana 0 banana
4 na 4 na
2 nana 2 nana

Sorting Suffixes – p.10


Recursion Example
012345678
S anananas.
nananas.0 sort .00 ana ana nan nas s.0
3 2 5
ananas.00 1 2 2 3 4 5
2 4 1 lexicographic triple names

12
S 325241
recursive call
531042 suffix array
436251 lex. names (ranks) among 23 suffixes

a 4n 2a n 3a 5n a 6s 1.

Sorting Suffixes – p.11


Recursion

I sort triples S[i : i + 2] for i mod 3 6= 0


(LSD-first radix sort)
I find lexicographic names S0 [1 : 2n/3] of triples,
(i.e., S0 [i] < S0 [ j] iff S[i : i + 2] < S[ j : j + 2])
I S12 = [S0 [i] : i mod 3 = 1]◦[S0 [i] : i mod 3 = 2],
suffix Si12 of S12 represents S3i+1
12
suffix Sn/3+i of S12 represents S3i+2
I recurseOn(S12 ) (alphabet size ≤ 2n/3)
I Annotate the 23-suffixes with their position in rec. sol.

Sorting Suffixes – p.12


Least Significant Digit First Radix Sort

Here: Sort n 3-tuples of integers ∈ [0 : n] in lexicographic order

Sort by 3rd position


Elements are sorted by pos 3
Sort stably by 2nd position
Elements are sorted by pos 2,3
Sort stably by 1st position
Elements are sorted by pos 1,2,3

Sorting Suffixes – p.13


Stable Integer Sorting

Sort a[0 : n) to b[0 : n) where key(a[i]) ∈ [0 : n]

c[0 : n] := [0, . . ., 0] counters


for i ∈ [0 : n) do c[a[i]]++ count
s := 0
for i ∈ [0 : n) do (s, c[i]) := (s + c[i], s) prefix sums
for i ∈ [0 : n) do b[c[a[i]]++] := a[i] bucket sort
Time O (n) !

Sorting Suffixes – p.14


Recursion Example: Easy Case
012345678
S chihuahua
hihuahua0 sort a00 ahu hih ihu ua0 uah
ihuahua00 1 2 3 4 5 6
lexicographic triple names

12
S 365421 names already unique

c 3h 4i h 6u 2a h 5u 1a

Sorting Suffixes – p.15


Sorting mod 0 Suffixes
0 c 3 (h 4 i h 6 u 2 a h 5u 1 a)
1
2
3 h 6 (u 2 a h 5 u 1 a)
4
5
6 h 5 (u 1 a)
7
8

Use radix sort (LSD-order already known)

Sorting Suffixes – p.16


Merge SA12 and SA0
0 < 1 ⇔ cn < cn 4: ( 6 )u 2 (ahua)
0 < 2 ⇔ cc n < cc n 7: ( 5 )u 1 (a)
2: ( 4 )i h 6 (uahua)
3: h 6 u 2 (ahua) 1: ( 3 )h 4 (ihuahua)
6: h 5 u 1 (a) 5: ( 2 )a h 5 (ua)
0: c 3 h 4 (ihuahua) 8: ( 1 )a 0 0 0 (0)

8: a
5: ahua
0: chihuahua
1: hihuahua
6: hua
3: huahua
2: ihuahua
7: ua
4: uahua

Sorting Suffixes – p.17


Analysis

1. Recursion: T (2n/3) plus


Extract triples: O (n) (forall i, i mod 3 6= 0 do . . . )
Sort triples: O (n)
(e.g., LSD-first radix sort — 3 passes)
Lexicographic naming: O (n) (scan)
Build recursive instance: O (n) (forall names do . . . )
2. SA0 =sort {Si : i mod 3 = 0}: O (n) (1 radix sort pass)
3. merge SA12 and SA0 : O (n) (ordinary merging with strange
comparison function)

All in all: T (n) ≤ cn + T (2n/3)


⇒ T (n) ≤ 3cn = O (n)

Sorting Suffixes – p.18


Implementation: Comparison Operators

inline bool leq(int a1, int a2, int b1, int b2) {
return(a1 < b1 || a1 == b1 && a2 <= b2);
}
inline bool leq(int a1, int a2, int a3, int b1, int b2, int b3) {
return(a1 < b1 || a1 == b1 && leq(a2,a3, b2,b3));
}

Sorting Suffixes – p.19


Implementation: Radix Sorting

// stably sort a[0..n-1] to b[0..n-1] with keys in 0..K from r


static void radixPass(int* a, int* b, int* r, int n, int K)
{ // count occurrences
int* c = new int[K + 1]; // counter array
for (int i = 0; i <= K; i++) c[i] = 0; // reset counters
for (int i = 0; i < n; i++) c[r[a[i]]]++; // count occurences
for (int i = 0, sum = 0; i <= K; i++) { // exclusive prefix sums
int t = c[i]; c[i] = sum; sum += t;
}
for (int i = 0; i < n; i++) b[c[r[a[i]]]++] = a[i]; // sort
delete [] c;
}

Sorting Suffixes – p.20


Implementation: Sorting Triples

void suffixArray(int* s, int* SA, int n, int K) {


int n0=(n+2)/3, n1=(n+1)/3, n2=n/3, n02=n0+n2;
int* s12 = new int[n02 + 3]; s12[n02]= s12[n02+1]= s12[n02+2]=0;
int* SA12 = new int[n02 + 3]; SA12[n02]=SA12[n02+1]=SA12[n02+2]=0;
int* s0 = new int[n0];
int* SA0 = new int[n0];

// generate positions of mod 1 and mod 2 suffixes


// the "+(n0-n1)" adds a dummy mod 1 suffix if n%3 == 1
for (int i=0, j=0; i < n+(n0-n1); i++) if (i%3 != 0) s12[j++] = i;

// lsb radix sort the mod 1 and mod 2 triples


radixPass(s12 , SA12, s+2, n02, K);
radixPass(SA12, s12 , s+1, n02, K);
radixPass(s12 , SA12, s , n02, K);

Sorting Suffixes – p.21


Implementation: Lexicographic Naming

// find lexicographic names of triples


int name = 0, c0 = -1, c1 = -1, c2 = -1;
for (int i = 0; i < n02; i++) {
if (s[SA12[i]] != c0 || s[SA12[i]+1] != c1 || s[SA12[i]+2] != c2) {
name++; c0 = s[SA12[i]]; c1 = s[SA12[i]+1]; c2 = s[SA12[i]+2];
}
if (SA12[i] % 3 == 1) { s12[SA12[i]/3] = name; } // left half
else { s12[SA12[i]/3 + n0] = name; } // right half

Sorting Suffixes – p.22


Implementation: Recursion

// recurse if names are not yet unique


if (name < n02) {
suffixArray(s12, SA12, n02, name);
// store unique names in s12 using the suffix array
for (int i = 0; i < n02; i++) s12[SA12[i]] = i + 1;
} else // generate the suffix array of s12 directly
for (int i = 0; i < n02; i++) SA12[s12[i] - 1] = i;

Sorting Suffixes – p.23


Implementation: Sorting mod 0 Suffixes

for (int i=0, j=0; i < n02; i++) if (SA12[i] < n0) s0[j++] = 3*SA12[i];
radixPass(s0, SA0, s, n0, K);

Sorting Suffixes – p.24


Implementation: Merging
for (int p=0, t=n0-n1, k=0; k < n; k++) {
#define GetI() (SA12[t] < n0 ? SA12[t] * 3 + 1 : (SA12[t] - n0) * 3 + 2)
int i = GetI(); // pos of current offset 12 suffix
int j = SA0[p]; // pos of current offset 0 suffix
if (SA12[t] < n0 ?
leq(s[i], s12[SA12[t] + n0], s[j], s12[j/3]) :
leq(s[i],s[i+1],s12[SA12[t]-n0+1], s[j],s[j+1],s12[j/3+n0]))
{ // suffix from SA12 is smaller
SA[k] = i; t++;
if (t == n02) { // done --- only SA0 suffixes left
for (k++; p < n0; p++, k++) SA[k] = SA0[p];
}
} else {
SA[k] = j; p++;
if (p == n0) { // done --- only SA12 suffixes left
for (k++; t < n02; t++, k++) SA[k] = GetI();
}
}
}
delete [] s12; delete [] SA12; delete [] SA0; delete [] s0;
}
Sorting Suffixes – p.25
Generalization: Difference Covers

A difference cover D modulo v is a subset of [0, v) such that for all


i ∈ [0, v) there exist j, k ∈ D with i ≡ k − j (mod v).
Example:
{1, 2} is a difference cover modulo 3.
{1, 2, 4} is a difference cover modulo 7.

I Leads to space efficient variants


I Faster for small alphabets

Sorting Suffixes – p.26


Improvements / Generalization

I tuning
I use larger difference covers
I external memory implementation
I parallel implementation
I combine with best algorithms for easy inputs
[Manzini Ferragina 02, Schürmann Stoye 05]

Sorting Suffixes – p.27


Suffix Array Construction: Conclusion

I simple, direct, linear time suffix array construction


I easy to adapt to advanced models of computation
I generalization to cycle covers yields space efficient
implementation

Future/Ongoing Work
I Implementation (internal/external/parallel)
I Large scale applications

Sorting Suffixes – p.28

You might also like